=================================================================================
Principal Component Analysis (PCA) itself is not a predictive modeling algorithm like regression or classification algorithms. PCA doesn't make predictions in the sense of predicting outcomes or labels for new, unseen data points. PCA is a dimensionality reduction technique used for feature extraction and data visualization, while retaining as much of the variance in the data as possible and preserving the most important information:
-
Purpose of PCA: PCA is used for feature extraction and data visualization by reducing the dimensionality of the dataset while retaining important information. -
Uncorrelated Variables: PCA aims to find a set of uncorrelated variables, called principal components, that capture the maximum variance in the data. These components are linear combinations of the original variables.
The PCA was applied to combine the original variables into a smaller set of principal components. These principal components are uncorrelated, meaning they are orthogonal to each other.
-
Capturing Variance: The principal components are ordered such that the first component captures the most variance in the data, the second captures the second most, and so on. This allows for dimensionality reduction while retaining the most significant information. -
Correlation Matrix and Varimax Rotation: PCA can be performed on the covariance matrix or the correlation matrix. In this case, the PCA was performed on the correlation matrix. Additionally, a rotation method called Varimax rotation was applied to the principal components. Varimax rotation is used to simplify the interpretation of the components by maximizing the variance of the squared loadings for each component.
-
Linear Combinations: PCA finds linear combinations of the original variables to create the principal components. These combinations are determined through mathematical computations and are not specific to any single dataset.
PCA is a statistical technique used to reduce the dimensionality of a dataset while preserving most of the variation in the data. It does this by transforming the original variables into a new set of variables called principal components (as shown in Figure 4501a), which are linear combinations of the original variables.
Figure 4501a. Comparison between original data and PCA transformed data (code).
-
Data Visualization: PCA is often used for visualizing high-dimensional data in a lower-dimensional space, making it easier to interpret and analyze complex datasets. -
Model Predictors: The principal components resulting from the PCA can be used as predictors in a model. Predictors are the variables used to make predictions in a statistical model. -
Candidate Predictors: The first few principal components resulting from the PCA can be selected as candidate predictors for a model if we believe these components capture the most important variation in the data and are therefore useful for predicting the outcome of interest.
It aims to find the directions (or principal components) of maximum variance in a dataset and projects the data onto these directions to create a new coordinate system. When a dataset is plotted as points in a plane, each point typically represents an observation with multiple features (e.g., height and weight). PCA reorients this coordinate system to align with the directions of maximum variance in the data. The new axes (principal components) are linear combinations of the original features. While the original axes might represent physical quantities like height and weight, the new axes are abstract constructs aimed at capturing the most significant sources of variation in the data. Therefore, in essence, PCA doesn't change the data itself but rather re-expresses it in a new coordinate system where the axes are chosen to capture the most variation. These new axes do not directly correspond to physical quantities, namely the axes don't actually mean anything physical, but are mathematical constructs representing the directions of greatest variability in the data.
Some benefits of using PCA over the original data are:
-
Dimensionality Reduction: PCA reduces the number of features (dimensions) in the data while retaining most of the variance. This can be particularly useful when dealing with high-dimensional data, as it can simplify the analysis and computation. -
Visualization: PCA allows for easier visualization of the data by reducing it to a lower-dimensional space (often 2D or 3D), making it easier to plot and interpret. PCA can be especially helpful when dealing with data in higher dimensions than 2D. -
Noise Reduction: By focusing on the directions of maximum variance, PCA tends to minimize the effects of noise in the data. This can lead to better generalization and performance of machine learning models trained on the transformed data. -
Feature Engineering: PCA can be used for feature engineering by creating new features (principal components) that are combinations of the original features. These new features may capture complex relationships in the data more effectively than the original features alone. -
Computational Efficiency: Working with fewer dimensions reduces the computational complexity of various tasks such as clustering, classification, and regression, leading to faster computation times.
It works with a set of observed data points, often represented as vectors, in a high-dimensional space. PCA aims to find a new set of uncorrelated variables called principal components (PCs) that capture the maximum variance in the data. These principal components are linear combinations of the original variables. PCA identifies latent features in data by finding orthogonal axes (principal components) along which the data varies the most. These principal components are linear combinations of the original features and can capture latent patterns in the data.
Here's how it works:
-
You start with a dataset containing multiple variables (features) for each data point. Each feature can be considered a random variable. So, if you have, say, 10 features in your dataset, you have 10 random variables.
-
PCA transforms these original variables into a new set of variables called principal components. These principal components are linear combinations of the original variables and are designed to be uncorrelated with each other.
-
The first principal component (PC1) captures the most variance in the data, the second principal component (PC2) captures the second most, and so on. These principal components are denoted as X and Y for illustrative purposes, but they can be any linear combinations of the original variables.
When discussing PCA, X and Y represent abstract concepts denoting linear combinations of the original variables that capture the most important information in the data. These linear combinations are determined through PCA's mathematical computations and are not specific to any single dataset. The primary goal of PCA is to find these uncorrelated principal components to reduce the dimensionality of your data while retaining as much information as possible.
In many cases, the few first principal components (PCs) obtained in the PCA can explain the most portion of the data, for instance, three first PCs explained 81.1% of the total cumulative variation in a Huanglongbing (HLB) data [1].
The mean, median, and standard deviation of principal components can provide insights into the distribution and variability of the data represented by these components.
Mean (Average):
- The mean of a principal component represents the average value of the component across the samples or observations in the dataset.
- A high mean suggests that, on average, the component has a relatively large positive value across the dataset.
- A low mean suggests that, on average, the component has a relatively small value or tends to be negative across the dataset.
- The mean can help understand the central tendency of the data represented by the principal component.
- Median:
- The median of a principal component represents the middle value of the component when all the values are arranged in ascending order.
- The median can be less affected by extreme values (outliers) compared to the mean, making it a robust measure of central tendency.
- It provides a measure of the typical value or central tendency of the component.
- Standard Deviation:
- The standard deviation of a principal component measures the spread or variability of the component's values around the mean.
- A high standard deviation suggests that the values of the component are spread out widely from the mean.
- A low standard deviation suggests that the values of the component are clustered closely around the mean.
- The standard deviation can provide information about the dispersion or variability of the data represented by the principal component.
After applying PCA, other predictive models, such as regression or classification models, can be trained on the reduced-dimensional data to make predictions.
In PCA, the maximum value that principal components (PCs) can take is not inherently constrained to a specific range like [0, 1] or [-1, 1]. Instead, the values of the principal components for a dataset are influenced by several factors: - Scale of the Data: The range of the principal components depends on the variance of the data along the directions identified by these components. If the original data variables have large variances, the scores of the principal components can also be large.
- Unit Variance Scaling: Often, data is scaled to have unit variance before applying PCA. Even after scaling, the principal components themselves do not necessarily have a maximum value but are scaled such that the variance (or total inertia) they explain is maximized. Each principal component score can vary widely based on the data distribution.
- Variance Explained by Components: Each principal component captures a portion of the total variance of the dataset:
- The first principal component captures the most variance. In PCA, the principal components are derived such that the first principal component has the highest possible variance, and each succeeding component, in turn, has the highest variance possible under the constraint that it is orthogonal to the preceding components. There is no theoretical upper limit to the values of the components themselves; they are more a reflection of the data’s structure and scaling. When interpreting PCA results, rather than focusing on the maximum possible values of the components, attention is often directed to:
- Loadings of Components: Indicate how much each original variable contributes to each principal component.
- Scree Plot: Helps in visualizing the proportion of total variance explained by each principal component.
- Cumulative Explained Variance: This can guide whether additional components are necessary for an adequate understanding of the data.
- Each subsequent component captures progressively less variance, subject to the constraint that it must be orthogonal to the preceding components.
- Magnitude of Component Scores: The scores (or coordinates) of the data points projected onto the principal components can have large magnitudes, especially if the original data points are far from the mean and if the data spreads widely along the direction of the principal components.
- In PCA, there isn't a predefined "trustable" minimum value for the principal components (PCs) because PCA is fundamentally a technique that transforms the original variables into a new coordinate system where the axes (principal components) are the directions of maximum variance in the data. The values of the principal components for each data point (often called scores) are the coordinates of that data point in this new coordinate system.
- Data Variance: PCs are derived to capture the maximum variance possible. If a dataset has large variations or is spread out, the corresponding PCs will have scores with large absolute values.
- Scaling of Data: The scale of the original data can significantly affect the PC scores. Data are often standardized (mean-subtracted and divided by the standard deviation) before PCA, making the data scale uniform. This doesn't necessarily limit the range of the principal components but ensures that no single variable dominates due to scale differences.
- Distribution and Spread of Data: The range of values that PCs take depends on how the data is distributed and its spread. For example, if the data is clustered closely around the mean, the PC scores might be closer to zero. Conversely, if the data points are far from the mean, the PC scores can be very large.
- "Trustable" in the context of PCA: Refers to understanding that the component scores accurately reflect the underlying structure and variability of the data. All values are inherently "trustable" as long as the PCA is appropriately applied.
- Explained Variance: A more pertinent question might be how many PCs are necessary to trustably represent the data's variability. This can be determined by looking at the explained variance ratio, which tells you how much variance each PC captures. A common rule is that the chosen components should cumulatively explain a substantial portion of the variance (common thresholds are 70-90%).
- Outliers: Extremely high or low PC scores might sometimes indicate outliers or anomalies in the data. Investigating these can provide insights into data quality or peculiar patterns that might not be immediately apparent.
- Component Loadings: Besides the scores, examining the component loadings (which variables contribute to each principal component and how strongly) can provide insights into what drives the variation captured by each PC.
- Deciding How Many PCs to Retain: This is typically done by examining a scree plot or the cumulative explained variance. You want enough PCs to capture a significant portion of the total variance without retaining noise and redundant information.
One application example of PCA Model is to study the impact of fabrication conditions on semiconductor wafer fail rates. Assuming we have a dataset containing the fail rates of semiconductor wafers under 40 different test bins (data) as shown below. The dataset includes five columns, each representing one of five wafers, with the fail rates measured for each. The wafers were fabricated using different combinations of 10 possible conditions. Specifically, Wafer1 was fabricated under Conditions 1 and 2; Wafer2 under Conditions 1, 2, 3, 6, and 9; Wafer3 under Conditions 1, 8, 9, and 10; Wafer4 under Conditions 1, 2, 3, 5, and 7; and Wafer5 under Conditions 1, 4, 5, and 8. We want to perform a fail analysis to understand the relationships between these varying fabrication conditions and the observed fail rates across the different bins. This will involve identifying any patterns or correlations that may exist between the conditions and the fail rates, which could help in pinpointing specific conditions that lead to higher fail rates, thereby facilitating improvements in fabrication processes. A Python code is used to analyze the data below:
To analyze the impacts of different fabrication conditions on the fail rates using PCA, we first reshape the data to effectively represent the association between fabrication conditions and fail rates for each wafer. Since each wafer is fabricated under a unique set of conditions, we'll treat these conditions as features: - Data Representation: Encode the fabrication conditions of each wafer as binary features. Each feature will indicate whether a specific condition is used (1) or not used (0).
- Data Preprocessing: Prepare the data by combining the fail rates and the conditions.
- PCA Execution: Apply PCA to analyze the main components that influence the fail rates based on the conditions.
- Results Interpretation: Evaluate the principal components to understand which conditions most strongly correlate with the fail rates.
The output of the execution of the script is below:
Figure 4501b. PCA output of the impact of fabrication conditions on semiconductor wafer fail rates. |
The PCA plot above provides a visual representation of the data in the reduced dimensional space defined by the first two principal components:
- Principal Components: Each point on the plot represents one of the bins from the dataset in the space defined by the first two principal components. These components are linear combinations of the original variables (fabrication conditions and fail rates), chosen to capture the maximum variance in the dataset.
- Variance and Spread:
- X-axis (Principal Component 1): This is the direction of maximum variance. Points spread along this axis vary the most in their underlying properties (fail rates and conditions).
- Y-axis (Principal Component 2): This captures the second most variance and is orthogonal to the first component.
- Clusters and Outliers:
- The bins within each cluster share similar characteristics in terms of fail rates and the impact of fabrication conditions.
- Outliers or points that are far apart from others might indicate unusual or unique fail rate patterns under certain conditions.
- Interpreting Directions:
- If points that represent specific wafers or conditions group together, it suggests that those conditions have similar effects on fail rates.
- The relative distance between points can indicate the relative similarity or dissimilarity in the fail patterns influenced by the fabrication conditions.
- Actionable Insights:
- Identify which conditions contribute most to failure rates: By looking at which wafers or conditions cluster together or stand apart, we can potentially identify which manufacturing conditions are more likely to lead to higher fail rates.
- Optimization: These insights can help in adjusting manufacturing processes to mitigate conditions leading to higher failure rates.
- Loadings of Principal Components:
- Look at the loadings (coefficients) of the original variables on the principal components. This will tell you which conditions weigh more heavily in each principal component, helping us to identify which specific conditions are most influential. This output results in Figure 4501b show how each original variable (both fail rates and conditions) contributes to the principal components. Higher absolute values in loadings indicate a stronger influence on that component.
- Explained Variance:
- Check the amount of variance explained by each principal component to understand how much of the total variability in the dataset is being captured in the plot. This tell us the proportion of the dataset's total variance that is captured by each principal component. For example, if the first component explains 70% of the variance, it means that this component captures most of the information contained in the original dataset.
The results in Figure 4501b provide insightful information about how the different fabrication conditions and fail rates (represented as "Wafer1" through "Wafer5") influence the variability in the dataset: - Loadings (PC Coefficients) Interpretation:
The loadings describe how each variable (fail rates and conditions) contributes to each principal component. - Wafer Fail Rates:
- The contributions of each wafer to the principal components are relatively small compared to the fabrication conditions, suggesting that specific conditions might play a larger role in the variability than differences in the wafers themselves.
- Notably, Wafer5 has a higher loading on PC2, indicating its fail rates vary more along the direction captured by PC2 compared to the other wafers.
- Fabrication Conditions:
- Large absolute values in loadings for certain conditions suggest significant influence on the principal components.
- PC1 is strongly influenced by Condition2, Condition3, Condition8, and Condition10. The negative sign on Condition2 and Condition3, and positive on Condition8 and Condition10, suggest these conditions pull the component in opposite directions, indicating different effects on the fail rates.
- PC2 has significant contributions from Condition4, Condition5, Condition6, Condition7, Condition9, and Condition10. Conditions like Condition5 and Condition7 with negative signs indicate they affect PC2 differently than conditions like Condition6, Condition9, and Condition10 with positive signs.
- Explained Variance:
- The explained variance tells us how much information (variability) is captured by each principal component.
- PC1 explains about 27.6% of the variance in the dataset, while PC2 accounts for about 22.4%. Together, they capture roughly 50% of the total variance in the dataset.
- This implies that while these two components capture a significant portion of the variability, there are other factors (possibly captured in other principal components not shown here) that also contribute to the variation in fail rates and the impact of conditions.
- Overall Interpretation and Next Steps:
- Conditions vs. Wafers: The PCA suggests that specific fabrication conditions have a more profound impact on the variability of the fail rates than differences between the wafers themselves.
- Actionable Insights: Conditions like Condition2, Condition3, and Condition8 appear crucial due to their strong influence on the principal components. Adjusting these conditions might significantly impact fail rates.
- Further Analysis: Since the first two principal components account for only half of the total variance, we might consider examining more components or applying other analytical techniques (like cluster analysis) to understand other factors affecting fail rates.
This analysis above helps pinpoint which manufacturing processes and conditions should be the focus of further investigation and potential adjustment to optimize wafer production and reduce fail rates.
Table 4501. Applications of Principal Component Analysis.
Applications |
Details |
Decorrelating models |
page3737 |
============================================
Apply training data to PCA: code:
Output:
============================================
The "X" samples and "Y" samples are clustered on the left and right sides, suggesting that they are correlated with each other in each group.The separation of the two clusters along the x-axis suggests that "X" samples are very different from "Y" samples.The loading scores for PC1 to determine which N had the largest influence on the separating the two clusters along the x-axis.
Apply training data to PCA: code:
Output:
[1] Kaique S. Alves, Lisa A. Rothmann, Emerson M. Del Ponte, Linking climate variables to large-scale spatial pattern and risk of citrus Huanglongbing: a hierarchical Bayesian modeling approach, OSFPreprints, 2021, DOI: 10.31219/osf.io/32djg.
|