Beta Fulltext view is in preview — article structure may vary. Browse all articles
Contents
Annals of Advanced Biomedical Sciences Research Article 7 min read

Principal Component Analysis of Hierarchical and Multivariate Data

Adugna DG, Koge D, Teklezgi MG* and Ezgi T*
* Corresponding author
ISSN: 2641-9459  10.23880/aabsc-16000209  Received: May 11, 2023  Published: May 22, 2023
  views
 3 references
 6 figures
 6 tables
PDF
Keywords
Principal Component Variance-Covariance Structure
Abstract

Principal component analysis concerned with explaining the variance-covariance structure of a set of variables through a few linear combinations of these variables. To determine how many principal components should be considered, the eigenvalues should first be examined. Investigating whether the processing of numbers depend on the way the numbers were presented, i.e. whether the data could be reduced. Four variables (WordDiff, WordSame, ArabicDiff and ArabicSame) were used. Both covariate and correlation matrix were used to obtain the principal components and compare the results between them. In addition, canonical correlation was used to examine the correlation between the set of Word variables and the set of Arabic variables. To see the correlation between the Word variables and Arabic variables, observing the canonical correlation between the first Word canonical variable and the first Arabic variable is enough. Data was reduced into a single principal component (PC1) as more than 80% of the total variability was explained by this principal component.

Introduction

Principal components are independent linear combinations that depend on the correlation or covariance matrix. Principal Component Analysis is concerned with explaining the variance- covariance structure of a set of variables through a few linear combinations of these variables. Its general objectives are data reduction and interpretation [1]. Although p components are required to reproduce the total system variability, often much of this variability can be accounted for by a small number k of the principal components. If so, there is (almost) as much information in the k components as there is in the original p variables. The k principal components can then replace the initial p variables, and the original data set, consisting of n measurements on p variables, is reduced to a data set consisting of n measurements on k principal components [2].

An analysis of principal components often reveals relationships that were not previously suspected and thereby allows interpretations that would not ordinarily result. Analyses of principal components are more of a means to an end rather than an end in themselves, because they frequently serve as intermediate steps in much larger investigations. For example, principal components may be inputs to a multiple regression or cluster analysis, moreover, (scaled) principal components are one “factoring” of the covariance matrix for the factor analysis model considered. To determine how many principal components should be considered, the eigenvalues should first be examined. In this study, both covariate and correlation matrix were used to obtain the principal components and compare the results between them.

Canonical correlation analysis is concerned with the amount of linear relationship between two sets of variables. Sometimes researchers may want to measure two types of variables on each research unit, or may be interested on investigations of the relationship between two sets of variables, like the relationship between the set of academic achievement variables and a set of measure of job success variables, then the canonical correlation analysis is the best method. If we assume that there are Y= (y1, y2. . . yg) and X=(x1, x2, x3, . . . , xg) two sets of variables which were measured on the same sampling unit, the overall correlation between Y and X was obtained using canonical correlation analysis. Canonical correlation is an extension of multiple correlation, which is the correlation between one y and several x’s [3].

Data and Variable Description: Thirty-two subjects were required to make a series of quick numerical judgments about two numbers presented as either two number words (e.g. “two”, “four”) or two single Arabic digits (e.g. “2”, “4”). The subjects were asked to respond “same” if the two numbers had a different parity (one even, one odd). Half of the subjects were assigned a block of Arabic digit trials, followed by a block of number word trials, and half of the subjects received the block of trials in the reverse order. Within each block, the order of “same” and “different” parity trials was randomized for each subject. For each of the four combinations of parity and format, the median reaction times for correct responses were recorded for each subject.

The variables available in the data set are given as:

  • WordDiff: is median reaction time for word format and different parity combination.
  • WordSame: is median reaction time for word format and same parity combination.
  • ArabicDiff: is median reaction time for Arabic format and different parity combination.
  • ArabicSame: is median reaction time for Arabic format and same parity combination.

Objectives of the study

As the data were collected to test psychological models of numerical cognition, the main objective of this study was to investigate whether the processing of numbers depend on the way the numbers are presented (word, Arabic digits). In addition, some of specific objectives are: (1). Investigate whether there is possibility to reduce the data. (2). Assess the correlation between the Word variables (WordDiff and WordSame) and Arabic variables (ArabicDiff and ArabicSame). Some important formulas: The correlation between PC (Yj) and the original variables (Xk) were calculated by the formula given bellow.

(1) or Using the standardized formula , = Yj X k e jk j ρ λ (2)

Exploratory Data Analysis: It can be observed from Table 1 that the standard deviation for the variable WordDiff is the highest (190.206) whereas for the variable ArabicSame is the lowest (114.024). The same is true for the mean values of the variables, i.e. 967.562 is the mean value of the median reaction time for word format and different parity combination (WordDiff), which is the highest of all, whereas 710.938 is the mean value of the median reaction time for Arabic format and same parity combination (ArabicSame), and which is the lowest value as compared to the others.

WordDiffWordSameArabicDiffArabicSame
Mean967.562875.609825.312710.938
Std190.206150.325135.97114.024

Table 1: Summary statistics of the variables.

Canonical Correlation Analysis: As results shown in Table 2 (part 2.1), the correlation between the Word and Arabic variables is the largest, being 0.824 between WordSame and ArabicSame. But there are larger within set correlations, i.e. 0.907 between WordDiff and WordSame, and 0.761 between ArabicSame and ArabicDiff.

  • Part 2.1 Correlation among the Word and Arabic variables
  • Variables
  • WordDiff
  • WordSame
  • ArabicDiff
  • ArabicSame
  • WordDiff
  • 1
  • 0.907
  • 0.713
  • 0.734
  • WordSame
  • 0.907
  • 1
  • 0.698
  • 0.824
  • ArabicDiff
  • 0.713
  • 0.698
  • 1
  • 0.761
  • ArabicSame
  • 0.734
  • 0.823
  • 0.761
  • 1
  • Part 2.2 Canonical correlation analysis
  • Canonical
  • Correlation
  • Adjusted Canonical correlation
  • Approximate
  • Standard Error
  • 1
  • 0.831
  • 0.822
  • 0.16
  • 0.056
  • 0.69
  • 2
  • 0.333
  • 0.111

Table 2: Correlation among the Word and Arabic variables, and canonical correlation analysis.

Picking the largest correlation between Word and Arabic variables is not satisfactory as we lose information in the remaining variables. As a solution to this problem, it is necessary to find a linear combination of Word and Arabic variables, and then discover the correlation between two linear combinations. To avoid possible problems, we find the linear combinations that maximize the correlation.

The first canonical correlation is larger whereas the second one is smaller Table 2 (part 2.2). These canonical correlations are providing us with an overall information for the degree of association between Word and Arabic characteristics. The first correlation 0.831 is slightly larger than the original correlation between Word and Arabic variables, which is 0.824 Table 2 (part 2.2).

In Figure 1, we can observe that there seems to be a direct linear relation between the first Word canonical variable (WordDiff) and the first Arabic canonical variable (ArabicDiff) indicating correlation. Upon this we can suggest that there may be first canonical correlation between Word variables and Arabic variables.

Figure 1: First canonical pair plot.
Click to enlarge
Figure 1: First canonical pair plot.

The second canonical pair plot (Figure 2) indicated that there seems no correlation between the second Word canonical variable (WordSame) and the second Arabic canonical variable (ArabicSame). From this, it can be suggested that second canonical correlation is not very important.

Figure 2: Second canonical pair plot.
Click to enlarge
Figure 2: Second canonical pair plot.

It can be done formal tests for canonical correlations. This test assumes normality assumption, and then were obtained by comparing the between (B) and within (W) correlations. The results were presented bellow (Table 3).

Part 3.1 Eigenvalues of Inv(E)*H=CanRsq/(1-CanRsq
EigenvalueDifferenceProportionCumulative
12.22942.10450.94690.9469
20.12490.05311
Part 3.2 Multivariate Statistics and F Approximations
StatisticsValueF ValueNum dfDen dfPr>F
Wilks’ Lambda0.2812.68456<0.0001*
Pillai’s Trace0.89.7458<0.0001*
Hotelling-Lawley Trace2.3516.3459<0.0001*
Roy’s Greatest Root2.2332.33229<.0001*
Part 3.3 Test for canonical correlations in the current row and all that fellow are zero
LikelihoodApproximate FNumeratorDenominatorPr>F
RatioValueDFDF
10.27512.68456<0.0001*
20.8893.621290.067

Table 3: Formal tests for canonical correlation (CC).

*shows significant at 5% level of significance. Table 3: Formal tests for canonical correlation (CC).

The eigenvalues of Inv(E) were obtained to better investigate the importance of the canonical correlation between Arabic variable and Word variable. As given in Table 3 (Part 3.1), the first canonical pair explains 94.7% of the common structure variability whereas the second canonical pair will probably not be taken into account, because it explains about 5.3% of the variability.

Besides, formal tests were conducted in order to further investigate the importance of canonical correlation. Based on the results of four test statistics shown in Table 3 (part

3.2), as the p-values for all the tests is less than 5% level of significance, we can reject the null hypothesis that canonical correlation equal to zero, and it can be concluded that the canonical correlation between Word variables and Arabic variables is important and then the two sets of variables are linearly related.

Thus, next we identify how many canonical correlation(s) we need, mainly how many canonical correlation(s) are different from zero is done. Based on normality assumption, the null hypothesis is that canonical correlation in the current row and all that follow are zero. Upon the result of likelihood ratio test statistic in Table 3(part 3.3), since p-value <0.0001 for the first canonical correlation is less than 5% significant level, we can reject H0, and conclude that the first canonical correlation is important. For the second canonical correlation coefficient in row 2, since p-value=0,067 is greater than 5% significance level, we do not reject H0 and conclude that the second canonical correlation is not important. Thus, for this study, to explain the correlation between Word variables and Arabic variables, the first canonical correlation is important enough. Since we do not have any information provided on units for the Word and Arabic variables characteristics, it is important to make use of standardized canonical coefficients, which were given in Table 4.

Part 4.1Canonical CoefficientsCorrelation
Word1Word2Word1Word2Arabic1Arabic2
WordDiff0.041-2.3760.914-0.4050.76-0.135
WordSame0.9632.1720.9990.0170.8310.006
Part 4.2Canonical CoefficientsCorrelation
Arabic1Arabic2Arabic1Arabic2Word1Word2
ArabicDiff0.214-1.5270.844-0.5370.701-0.179
ArabicSame0.8281.30.990.1380.8230.046
Part 4.3Word canonical variablesArabic canonical variables
ProportionCumulativeProportionCumulative
10.91770.91770.63360.6336
20.082310.00910.6427
Part 4.4Arabic canonical variablesWord canonical variables
ProportionCumulativeProportionCumulative
0.84620.84620.58420.5842
0.153810.01710.6013

Table 4: Standardized canonical coefficients for all the variables and correlations between the original variables and their cano

It is noted in Table 4 (part 4.1) that WordSame variable is important for the first Word canonical variable, providing evidence for correlation between first Word canonical variable and WordSame, since it is observed to be the highest correlation with WordSame, which was found to be important in the first Word canonical variable. Similarly, on Table 4 (part 4.2), it was noticed that in the first Arabic canonical variable, ArabicSame variable is important as it has the highest canonical coefficient in comparison to ArabicDiff. The correlation between first Arabic canonical variable and ArabicSame is also noted to be 0.99. Again looking on the cross correlations between original Arabic variables and the Word canonical variables, we noted that first Word canonical variable has the highest correlation with the ArabicSame, being equal to 0.823, as it was observed to be important in the first Arabic canonical variable.

Moreover, we can notified from Table 4 (part 4.3) that the first canonical pair of Word variables explain 91.8% of the variability in Word variables. Therefore, the canonical redundancy analysis showed that the first Arabic canonical variable is a good overall predictor of the opposite set of variables, the proportions of variance explained being 0.6336. Again in Table 4 (part 4.4) is observed that the first canonical pair of Arabic variable explains a total of 84.6% of the variance in the Arabic variables. The first Word canonical variable is a good overall predictor of the opposite set of variables, the proportions of variance explained being 0.5842. Principal Component Analysis (PCA): Here the principal component analysis is done using both covariance and correlation matrix.

Principal component analysis (PCA) using covariance matrix:

Part 5.1Covariance matrix.
WordDiffWordSameArabicDiffArabicSame
WordDiff36178.3508125936.7268118447.5685515909.238
WordSame25936.7268122597.7537814261.7308514115.934
ArabicDiff18447.5685514261.7308518487.8185511799.73
ArabicSame15909.237914115.9344811799.7298413001.399
Part 5.2Eigenvalues and the proportion of variation explained by the PCs
EigenvalueDifferenceProportionCumulative
176576.329668696.44790.84830.8483
27879.88183650.7630.08730.9356
34229.11882649.12680.04690.9825
41579.99210.01751
Part 5.3Coefficients of principal components
PC1PC2PC3PC4
WordDiff0.660421-0.46792-0.4382880.390894
WordSame0.518743-0.2460120.414609-0.706034
ArabicDiff0.4095020.784865-0.412483-0.214832
ArabicSame0.3564510.323290.682540.550059
Part 5.4Correlations between final principal components and the original variables
WordDiffWordSameArabicDiffArabicSame
PC10.960820.954920.833410.86507
p-value<.0001<.0001<.0001<.0001
PC2-0.21838-0.145270.51240.25168
p-value0.22990.42760.00270.1647

Table 5: Principal component analysis using covariance matrix.

From Table 5 (part 5.1) of the covariance matrix shown that the total variance is 90265.322, and WordDiff and ArabicSame variables have the highest and the lowest variances, respectively. In addition, the total sum of the eigenvalues is 90265.322 which is the sum of the variances of the original variables given in the covariance matrix in part 5.1. It can also be observed from Table 5 (part 5.2) that the first principal component explains about 84.8% of the total variability that is within the rule of thumb, which is at least between 80%-90%. Therefore, the data can be represented by the first principal component with an acceptable loss of information. This first PC can be besides explained by the scree plots and cumulative proportions given in Figure 3.

Figure 3: Scree plot for the four variables using covariance matrix.
Click to enlarge
Figure 3: Scree plot for the four variables using covariance matrix.

As it can be seen in Figure 3, contribution of the second, third and fourth components are relatively small in comparison to the first principal component. Moreover, the elbow shape occurring at the second PC also proposes only the first PC to be considered. The first PC can be interpreted as the weighted average of all the four original variables. As it can be noticed from Table 5 (part 5.3), WordDiff and WordSame contribute the highest and the lowest respectively as compared to the other variables. The contributions of variables on the first PC can be generalized in the following equation.

  • PC1= 0.660WordDiff + 0.518WordSame + 0.409ArabicDiff + 0.356ArabicSame.
  • PC2= -0.468WordDiff – 0.246WordSame + 0.785ArabicDiff + 0.323ArabicSame In the second PC, it was shown that the word variables (WordDiff and WordSame) seem to contrast with the Arabic variables. The correlation of the original variables with the respective PCs were given in Table 5 (part 5.4), and the p-value for all of the variables in the first PC is very small (highly significant), which indicated that the original variables are highly correlated with this first PC. The first PC increases as the value for the variables increase and vice versa. On the other hand, in the second PC, it is noticed that as the p-value for the variables WordDiff, Wordsame and ArabicSame is large, the correlation between the original variables and the second PC was found to be insignificant.

Principal Component Analysis Using Covariance Matrix: Instead of making use of the covariance matrix, it is also recommended to use the correlation matrix if the following points are present;

  • The original variables are in different scales or units.
  • The original variables have high difference in their variability and should be standardized using the correlation matrix.
Part 6.1Correlation matrix
VariablesWordDiffWordSameArabicDiffArabicSame
WordDiff10.90710.71330.7336
WordSame0.907110.69770.8235
ArabicDiff0.71330.697710.7611
ArabicSame0.73360.82350.76111
Part 6.2Eigenvalues and the proportion of variation explained by the PCs
EigenvalueDifferenceProportionCumulative
13.3212.9530.830.83
20.3680.1260.0920.922
30.2410.1720.060.983
40.070.0171
Part 6.3Coefficients of PCs using correlation matrix
PC1PC2PC3PC4
WordDiff0.506-0.470.4230.586
WordSame0.518-0.435-0.093-0.73
ArabicDiff0.4750.7310.463-0.158
ArabicSame0.4990.233-0.7730.314
Part 6.4Correlation between the final PCS and the original variables
PC10.922730.943940.865860.91045
P-value<.0001<.0001<.0001<.0001
PC2-0.28531-0.26420.44360.1412
P-value0.11350.1440.0110.4408

Table 6: Principal component analysis using correlation matrix.

Sample correlation matrix standardizes the variables and brings them to unit variance as opposed to covariance matrix where the observations with larger units of measurement tend to drive the analysis towards themselves. It is also revealed from Table 1 that the standard deviations are widely spread apart. This suggesting that if a covariance matrix is used, the PCs will be pulled towards such variables with larger standard deviations.

Results in Table 6 (part 6.2) showed that 83% of the total variation is explained by the first PC. Although this result is above the stated rule of thumb value, explaining at least

80% of the total variability, it is lower than the variability explained by the PC using the covariance matrix (Table 5, part 5.2). Furthermore, the first two PCs explain 92.2% of the total variability. Lastly, the first three PCs explain 98.3% of the total variation. As suggested by both the scree plot of Figure 4 and the cumulative proportions represented in Table 6 (part 6.2), the first PC can effectively explain about 83% of the sample variability, leaving only 17% of the sample variability unexplained.

Figure 4: Scree plot for the four variables using correlation matrix.
Click to enlarge
Figure 4: Scree plot for the four variables using correlation matrix.

As observed in Figure 4, contribution of the second, third and fourth components are relatively small in comparison to the first PC. In addition, elbow shape occurring in the second PC also suggested only the first PC to be considered. The first PC can summarize the four dimensions in the original data to one dimension with acceptable information loss indicating a data reduction from four dimension to one dimension is reasonable. The PCs can be fitted and interpreted in terms of weighted average of all four original variables as follow.

  • PC1= 0.506WordDiff + 0.518WordSame + 0.475ArabicDiff + 0.499ArabicSame
  • PC2= -0.470WordDiff – 0.435WordSame + 0.731ArabicDiff + 0.233ArabicSame From PC1, we can observe that the original variables have almost similar contribution for the PC, but in PC2 they have different contribution. Contributions of the original variables to PCs is interpreted by using the correlation between each original variables and the PC. The PC1is strongly correlated with four of the original variables. The PC1 increases with increasing all the four variables. Since the information that can be lost is acceptable, i.e. more than 80% of the total variability is explained by PC1, the data can be reduced into a single PC (PC1).

The scatter plot of the first two PCs under both covariance and correlation coefficient presented in Figures 5 & 6, respectively showed that the two PCs are uncorrelated.

Figure 5: Scatter plot of PC1 against PC2 obtained using covariance matrix.
Click to enlarge
Figure 5: Scatter plot of PC1 against PC2 obtained using covariance matrix.
Figure 6: Scatter plot of PC1 against PC2 obtained using correlation matrix. Conclusion Principal components are independent linear combinations that depend on the correlation or covariance matrix. Principal component analysis is concerned with explaining the variance- covariance structure of a set of variables through a few linear combinations of these variables. Canonical correlation analysis is concerned with the amount of linear relationship between two sets of variables. Sometimes researchers may want to measure the correlation between two types of variables on each research unit, or may be interested on investigations of the relationship between two sets of variables, in this case canonical correlation is important. Data were collected from thirty-two subjects on four different variables, i.e. WordDiff, WordSame, ArabicDiff and ArabicSame to make a series of quick numerical judgments about two numbers presented as either two number words (e.g. “two”, “four”) or two single Arabic digits (e.g. “2”, “4”). Objective of this study was to investigate whether there is possibility to reduce the data.
Click to enlarge
Figure 6: Scatter plot of PC1 against PC2 obtained using correlation matrix. Conclusion Principal components are independent linear combinations that depend on the correlation or covariance matrix. Principal component analysis is concerned with explaining the variance- covariance structure of a set of variables through a few linear combinations of these variables. Canonical correlation analysis is concerned with the amount of linear relationship between two sets of variables. Sometimes researchers may want to measure the correlation between two types of variables on each research unit, or may be interested on investigations of the relationship between two sets of variables, in this case canonical correlation is important. Data were collected from thirty-two subjects on four different variables, i.e. WordDiff, WordSame, ArabicDiff and ArabicSame to make a series of quick numerical judgments about two numbers presented as either two number words (e.g. “two”, “four”) or two single Arabic digits (e.g. “2”, “4”). Objective of this study was to investigate whether there is possibility to reduce the data.

Figure 6: Scatter plot of PC1 against PC2 obtained using correlation matrix. Conclusion Principal components are independent linear combinations that depend on the correlation or covariance matrix. Principal component analysis is concerned with explaining the variance- covariance structure of a set of variables through a few linear combinations of these variables. Canonical correlation analysis is concerned with the amount of linear relationship between two sets of variables. Sometimes researchers may want to measure the correlation between two types of variables on each research unit, or may be interested on investigations of the relationship between two sets of variables, in this case canonical correlation is important. Data were collected from thirty-two subjects on four different variables, i.e. WordDiff, WordSame, ArabicDiff and ArabicSame to make a series of quick numerical judgments about two numbers presented as either two number words (e.g. “two”, “four”) or two single Arabic digits (e.g. “2”, “4”). Objective of this study was to investigate whether there is possibility to reduce the data.

Results indicated that the canonical correlation between Word variables and Arabic variables is high, which is 0.824 between WordSame and ArabicSame. But there are larger within set correlations, which is 0.907 between WordDiff and WordSame and 0.761 between ArabicSame and ArabicDiff. Moreover, we can notified that the first canonical pair of Word variables explain 91.8% of the variability in Word variables. Therefore, the canonical redundancy analysis showed that the first Arabic canonical variable is a good overall predictor of the opposite set of variables, the proportions of variance explained being 0.6336. Again it is observed that the first canonical pair of Arabic variable explains a total of 84.6% of the variance in the Arabic variables. The first Word canonical variable is a good overall predictor of the opposite set of variables, the proportions of variance explained being 0.5842. Moreover, contributions of the original variables to PCs is interpreted by using the correlation between each original variables and the PC. The PC1is strongly correlated with four of the original variables. The PC1 increases with increasing all the four variables. Since the information that can be lost is acceptable, i.e. more than 80% of the total variability is explained by PC1, the data can be reduced into a single PC (PC1).

In conclusion, in order to see the correlation between the Word variables and the Arabic variables, it is adequate to see the canonical correlation between the first Word canonical variable and the first Arabic canonical variable. Data is reduced into a single principal component (PC1) since more than 80% of the total variability is explained by this single PC. Therefore, the information that can be lost will be acceptable.

References

  1. Johnson RA, Wichern DW (2000) Applied Multivariate Statistical Analysis. 6th (Edn.), Englewood Cliffs: Prentice- Hall.
  2. Jackson JE (1980) Principal Components and Factor Analysis. Component Journal of Quality Technology.
  3. Rencher AC (2001) Methods of Multivariate Analysis. 2nd (Edn.), Wiley series in Probability and Statistics.

Cite this article

BibTeX
APA
RIS
@article{adugna2023,
  title   = {Principal Component Analysis of Hierarchical and Multivariate Data},
  author  = {Adugna DG, Koge D, Teklezgi MG* and Ezgi T},
  journal = {Annals of Advanced Biomedical Sciences},
  year    = {2023},
  volume  = {6},
  number  = {1},
  doi     = {10.23880/aabsc-16000209}
}
Adugna DG, Koge D, Teklezgi MG* and Ezgi T (2023). Principal Component Analysis of Hierarchical and Multivariate Data. Annals of Advanced Biomedical Sciences, 6(1). https://doi.org/10.23880/aabsc-16000209
TY  - JOUR
TI  - Principal Component Analysis of Hierarchical and Multivariate Data
AU  - Adugna DG, Koge D, Teklezgi MG* and Ezgi T
JO  - Annals of Advanced Biomedical Sciences
PY  - 2023
VL  - 6
IS  - 1
DO  - 10.23880/aabsc-16000209
ER  -