Pearson Correlation Coefficient/Pearson's r/Correlation Coefficient

Pearson Correlation Coefficient/Pearson's r/Correlation Coefficient
- Python for Integrated Circuits -
- An Online Book -

Python for Integrated Circuits http://www.globalsino.com/ICs/

Chapter/Index: Introduction | A | B | C | D | E | F | G | H | I | J | K | L | M | N | O | P | Q | R | S | T | U | V | W | X | Y | Z | Appendix

=================================================================================

The Pearson Correlation Coefficient, often denoted as Pearson's r or simply the correlation coefficient, is a statistical measure that quantifies the linear relationship between two continuous variables. It assesses the degree to which two variables are related in a linear manner, meaning that it measures how well the relationship between the variables can be described by a straight line. One the other hand, a p-value is a statistical measure used in hypothesis testing to determine the strength of evidence against a null hypothesis.

Some key characteristics of the Pearson Correlation Coefficient are:

Range: The Pearson Correlation Coefficient values range from -1 to 1.
- A value of -1 indicates a perfect negative linear relationship (as one variable increases, the other decreases linearly).
- A value of 1 indicates a perfect positive linear relationship (both variables increase together linearly).
- A value of 0 indicates no linear relationship (the variables are not correlated linearly).
Interpretation:
- Values close to -1 or 1 imply a strong linear relationship.
- Values close to 0 imply a weak or no linear relationship.
Assumption: Pearson's correlation assumes that the variables are normally distributed and that there is a linear relationship between them. It may not capture non-linear relationships.

The formula for calculating Pearson's correlation coefficient between two variables X and Y is as follows:

r = (Σ((X - μX) * (Y - μY))) / (nX * nY * σX * σY) ---------------------------- [3920a]

or,

----------------------------- [3920b]

Where:

r is the Pearson Correlation Coefficient.
- -1: Perfect negative linear relationship.
- 0.1 < r < 0.1: No correlation.
- 0.1 < r < 0.3: Low correlation.
- 0.3 < r < 0.5: Medium correlation.
- 0.5 < r < 0.7: High correlation.
- 0.7 < r < 1.0: Very high correlation.
X and Y are the variables being compared.
μX and μY are the means of X and Y, respectively.
σX and σY are the standard deviations of X and Y, respectively.
nX and nY are the number of data points for X and Y, respectively.

The Pearson Correlation Coefficient is widely used in statistics, data analysis, and machine learning to understand relationships between variables, perform feature selection, and assess the strength and direction of associations between two continuous variables.

Python cheatsheet for calculating the Pearson Correlation Coefficient:

Importing Necessary Libraries
- import numpy as np
- import pandas as pd
- from scipy.stats import pearsonr

Creating Data
- Using NumPy Arrays
  
  x = np.array([1, 2, 3, 4, 5])
  y = np.array([2, 4, 6, 8, 10])

Using Pandas DataFrame

data = {'x': [1, 2, 3, 4, 5], 'y': [2, 4, 6, 8, 10]}
df = pd.DataFrame(data)

Calculating Pearson Correlation
- Using NumPy
- Using Pandas

Using SciPy

correlation, p_value = pearsonr(x, y)
print(correlation)

Handling NaN Values

Using Pandas

df['x'].corr(df['y'], method='pearson')

Dropping NaN Values

df = df.dropna()
pearson_correlation = df['x'].corr(df['y'])
print(pearson_correlation)

Visualizing Correlation
- Using Matplotlib
  
  import matplotlib.pyplot as plt
  plt.scatter(x, y)
  plt.title('Scatter plot of x and y')
  plt.xlabel('x')
  plt.ylabel('y')
  plt.show()

Using Seaborn for Pairplots

import seaborn as sns
sns.pairplot(df)
plt.show()

============================================

The script below loads data from multiple folders, calculates Pearson Correlation Coefficients between data in FolderOne and data in other folders for each file pair, finds the best linear correlation for each folder, and computes the overall correlation for each folder. Code:
          Upload Files to Webpages

       Input:







       Output:

============================================

Calculate the Pearson correlation coefficient and p-value after Filtering out elements with missing data. Code:

Case 1:
       Input:

      Output:

Case 2:

       Input:

      Outputed new lists for the analysis:

      Output:

Case 3:

       Input:

      Outputed new lists for the analysis:

      Output:

Case 4:

       Input:

      Outputed new lists for the analysis:

      Output:

Note, this script creates a new list filtered_data by zipping list1 and list2 together and filtering out pairs where either x or y is an empty string (""). Then, it unpacks the filtered pairs back into separate lists filtered_list1 and filtered_list2. The Pearson correlation coefficient is then calculated using these filtered lists.

============================================

Few outliers can affect the Pearson analysis significantly. Code:

Case 1: with few outliers (There are 100 "normal" datapoints):
          Upload Files to Webpages
      Output:

Case 1: without outliers:
      Output:
          Upload Files to Webpages

============================================

Calculate Pearson correlation between the last column and all other columns (except the first column). Code:

Output:

In addtion, a subset of dataframe can be extracted under conditions. For instance, this script first calculate Pearson correlation between the last column and all other columns (except the first column), then a subset, which include columns where the Pearson correlation with the last column is greater than 0.5, along with the last column itself, will be created.

============================================

=================================================================================