The asset index is often used as a measure of socioeconomic status in empirical research as an explanatory variable or to control confounding. Principal component analysis (PCA) is frequently used to create the asset index. We conducted a simulation study to explore how accurately the principal component based asset index reflects the study subjects’ actual poverty level, when the actual poverty level is generated by a simple factor analytic model. In the simulation study using the PC-based asset index, only 1% to 4% of subjects preserved their real position in a quintile scale of assets; between 44% to 82% of subjects were misclassified into the wrong asset quintile. If the PC-based asset index explained less than 30% of the total variance in the component variables, then we consistently observed more than 50% misclassification across quintiles of the index. The frequency of misclassification suggests that the PC-based asset index may not provide a valid measure of poverty level and should be used cautiously as a measure of socioeconomic status.

Socioeconomic status (SES) is commonly measured in social science and public health research by combining diverse factors including wealth, education level and occupation [

To derive the asset index, researchers commonly gather information on asset ownership usually through the administration of a questionnaire and then frequently apply principal component analysis (PCA), as a data compression technique. The PCA method generates as many principal components as there are variables in the dataset. The first principal component (PC) is a weighted sum of the observed asset variables that accounts for the maximum variability of the observed data among other principal components. This first PC is considered as an asset index [

Conceptually, there is a “true” measure of socioeconomic status which can not be determined and is associated with various outcomes, for example, a specific health outcome. Since we can not determine the true measure of socioeconomic status, we measure either related proxy variables, such as income, or manifest variables, such as presence of assets. Economic proxy and manifest variables are assumed to represent a person’s true economic status. When proxy variables are not available, researchers may use an asset index derived using PCA [

Several authors who have applied the PC-based asset index, have attempted to validate its credibility in different ways [

Howe et al. compared four different methods to measure an asset index, including applying PCA on all categories of asset variables and applying PCA on binary coded asset variables. Using the data from the 2004 - 2005 Malawi Integrated Household Survey, they found that PC had modest agreement with consumption expenditure (kappa = 0.11 and 0.10) which is an intensive measure of household wealth used by economists as the optimal measure to assess income and welfare [

Kolenikov and Angeles used simulations to assess the performance of a PCA-based asset index for ranking the subjects compared to simulated welfare. They reported that the PCA-based asset index misclassified subjects into the wrong asset quintiles when compared to welfare quintiles, but did not explore the reasons behind the misclassification [

Howe et al. performed a systematic review of 17 articles with 36 datasets to see how the PC-based asset index performed compared to consumption expenditure and found that most of the asset indices poorly reflected consumption expenditure. The study considered different measures of asset indices in addition to PC-based asset indices but did not focus on reasons for poor performance of the asset indices [

In published literature of asset index measurements using PCA, the proportion of explained variances by the first PC were low, ranging from 12% to 34% [

The use of misclassified covariates to control confounding can bias the exposure-disease association estimates [

In this study, we performed a simulation experiment. In each simulation, we generated 100 random numbers from the uniform distribution of five different non-overlapping ranges as a measure of asset index. This simulated asset index was considered the true asset index of a group of 100 subjects. We then generated the asset variables using pre-specified loadings and the simulated index through a confirmatory factor model as described in Kolenikov and Angeles [

Data generating process in the simulation

● | We generated artificial latent factor |

● | We considered normalized loading vectors _{1}=(0.79,0.54,0.13, 0.01,0.26), _{2}=(0.73,0.52,−0.20,0.00,−0.4), _{3}=(0.67,0.4,−0.5, −0.01,−0.4) and _{4}=(−0.02,0.14,−0.57,−0.51,−0.63). |

● | We generated the data matrix |

● | We generated five dimensional random variables using the loading vectors and standard normal errors |

● | We performed PCA on ^{∗}. |

Flowchart of the simulation.

We then tested the performance of a PC-based asset index without any specific distributional assumption by using a real measurement of expenditure data, collected from an intensive qualitative survey, that was a skewed proxy measure of the economic status. We performed a similar experiment using the expenditure data instead of simulated asset index. The only difference between the simulated index and expenditure data was that in the simulation experiment, we generated a different asset index for each set of loadings, however in the expenditure data the asset index was fixed. We generated the artificial asset variables using the observed expenditure data and the same weights used in the simulated asset index. To make the results comparable with the units of the simulated asset index, we standardized the expenditure data using

We repeated the experiment 10,000 times for each of the four models for a total of 40,000 replications for both the simulated index and the real expenditure data. If the PC score ^{∗} retained the order of ^{∗} should be the same. We recorded the frequency of the same position index which we defined as the frequency of unchanged positions.

We estimated the mean degree of misclassification in the PC-based asset index that involved classification of ^{∗} into their quintiles and counted the number of observations where the quintile membership was different. The probability of misclassification was estimated by dividing the total number of observations classified into different quintiles by the total number of observations in

We stored the proportion of explained variance by the first PC and estimates of the loadings for each of the replications. We explored the dependency pattern among the frequency of unchanged position, probability of misclassification, and explained proportion of variance using scatter plots. To assess the effect that the five different loadings estimate on the relationship between the proportion of explained variance and asset quintile misclassification, we constructed a parallel coordinate plot. In a parallel coordinate plot, the estimate of a loading vector that consisted of five elements was plotted into the five parallel vertical coordinates (E1-E5) and the plotted points were connected horizontally. Each connected line corresponded to a simulation result of loading vector estimates. Finally, we used different colors for the two clusters and identified the characteristics of loading estimates between the two clusters.

Our simulation study showed a range of 0% to 98% misclassification in the PC-based asset quintile. The median probability of misclassification varied from 44% to 82% depending on the different loading vector used for data generation. (Table

Descriptive statistics of the number of unchanged order, and probability of misclassification into the wrong quintile for four different vectors in simulated data

| ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|

_{
1
} | _{
2
} | _{
3
} | _{
4
} | _{
1
} | _{
2
} | _{
3
} | _{
4
} | _{
1
} | _{
2
} | _{
3
} | _{
4
} | |

Minimum | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 5 | 0 | 0 | 0 | .04 |

First quartile | 2 | 1 | 0 | 0 | 20 | 24 | 38 | 42 | .25 | .30 | .46 | .49 |

Median | 4 | 3 | 1 | 1 | 37 | 44 | 93 | 76 | .44 | .50 | .82 | .67 |

Mean | 7 | 6 | 4 | 2 | 38 | 50 | 68 | 69 | .41 | .51 | .65 | .66 |

Third quartile | 7 | 6 | 4 | 3 | 52 | 92 | 97 | 97 | .55 | .82 | .89 | .87 |

Maximum | 98 | 98 | 98 | 26 | 99 | 99 | 99 | 99 | .97 | .98 | .98 | .97 |

The pattern of the probability of misclassification using real expenditure data was similar to the simulated asset index. The observed misclassification ranged between 0% to 96%. The median probability of misclassification varied from 68% to 79% (Table

Descriptive statistics of the number of unchanged order, and probability of misclassification into the wrong quintile for four different vectors for real expenditure data

| ^{
∗
} | |||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|

_{
1
} | _{
2
} | _{
3
} | _{
4
} | _{
1
} | _{
2
} | _{
3
} | _{
4
} | _{
1
} | _{
2
} | _{
3
} | _{
4
} | |

Minimum | 0 | 0 | 0 | 0 | 2 | 2 | 1 | 13 | 0 | 0 | 0 | .10 |

First quartile | 1 | 1 | 0 | 0 | 56 | 63 | 73 | 71 | .55 | .59 | .67 | .65 |

Median | 2 | 2 | 1 | 1 | 76 | 81 | 88 | 86 | .68 | .72 | .79 | .76 |

Mean | 4 | 3 | 2 | 2 | 70 | 74 | 80 | 80 | .64 | .68 | .74 | .73 |

Third quartile | 5 | 4 | 3 | 3 | 89 | 93 | 96 | 95 | .79 | .84 | .87 | .86 |

Maximum | 86 | 95 | 88 | 19 | 99 | 99 | 99 | 99 | .96 | .96 | .97 | .96 |

^{*}Real expenditure data has 112 observation, so the dispersion of position were rescaled to 100.

The simulated data subjects were much more likely to retain their position when the first PC explained a large fraction (>90

In this article we evaluated whether PCA retained the order of subjects based on a true asset index using a simulation experiment. We also used expenditure data collected in a different study to address the distributional limitations of the simulated asset index.

We found that PCA does not reliably maintain the order of the true asset index. PCA changes the position of up to 98% of subjects, and the magnitude of the position change was usually enough to classify the subjects into the wrong asset quintiles. We observed a relatively higher probability of misclassification when we considered observed expenditure data as a true index which was positively skewed. The skewed distribution of the underlying latent factor introduced more risk of the probability of misclassification in a PC based index. Our findings are supported by Kolenikov and Angeles [

In our simulations, the sign of the loading of the asset variables retained by the PCA was an important determinant of the probability of misclassification in a PC based asset quintile. A change in the sign means a change in the direction of contribution of an asset variable to the index. In the real world, an asset might positively contribute to relative wealth, but in the PC-based index, this might appear negatively. For example, the loading of agricultural land appeared with a negative sign in the PC-based asset index in Howe et al. [

The increased proportion of explained variance of the first PC score increases the probability of generating an index that reflects the underlying economic status. To ensure a higher proportion of explained variance of the dataset by the first PC, variables should be well correlated with each other. It is possible that asset variables might be classified into subgroups and/or might be redundant based on the correlation structure. When this occurs the first PC represents the subgroup of variables that contains the major source of variability of the total dataset and may not account for the contribution of all variables [

To use PCA for an asset index, the sign of loadings should be examined in addition to the proportion of variance explained by the first PC in order to increase our confidence in the accuracy of the ranking of real wealth. The sign of the loading variables should be internally consistent with our understanding of what constitutes wealth of the study population. Additionally, checking consistency between wealth groups in respect to their existing asset variables and checking the robustness of the asset index with regards to different asset variables could help measure the level of reliability as was done by Filmer and Pritchett [

Although PC based asset index is a poor proxy against the standard consumption expenditure, it continues to be used because it is so much easier to deploy [

To even engage in an exploration of possible algorithms applied to proxies of economic status, and examine those against a standard, implies an acceptance that the underlying data-generating distribution follows this model. Ideally, there would exist a measurable standard that we could compare algorithms applied to proxies and thus be able to argue for one approach versus another based on estimates of risk (e.g., probability of misclassification to which quintile a subject belongs). However, such a measurable standard does not exist for economic status. We have taken an approach that would identify which algorithms applied to proxies are best with regard to some loss function at predicting the latent variable under the best circumstances, where this sort of latent variable model is true. Thus the results should be interpreted knowing that the possible simulations (data-generating models) and possible methods for summarizing the manifest variables are but a tiny subset of the possible combinations. Our conclusions are meant to provide some intuition for problems that could arise, but can of course not be seen as proof by simulation.

Kolenikov and Angeles [

Through repeated simulation experiments using artificial and real proxy data for latent variables, we showed that PCA does not retain the order of the true asset index and provides a high proportion of misclassification into the asset quintiles. Since the first PC score does not reliably maintain the original order of a latent construct, we should search for an alternative index that maintains the original order.

If investigators use PCA to create an asset index, they should report the proportion of variance explained and the loadings. Careful selection of asset variables, proper measurements and coding, and suitable correlation estimates of categorical asset variables are recommended to increase the variability explaining capacity of the first PC. If the proportion of explained variance is less than 30%, the risk of misclassification could be high (≥50

The authors declare that they have no competing interests.

YS developed the simulation study design and drafted the manuscript, MN provided theoretical support in developing study design, JA provided input in programming, and data presentation, SL supervised the process. All the authors critically reviewed, provided intellectual input to the manuscript and approved the final version of the manuscript.

The authors thank Alan E. Hubbard for his critical review and input to the manuscript, Nadia Ali Rimi for providing the data set, Dorothy Southern and Diana DiazGranadoz for assistance in manuscript writing. icddr,b is thankful to the Government of Australia, Bangladesh, Canada, Sweden and the UK for providing unrestricted support.