-
Notifications
You must be signed in to change notification settings - Fork 343
Variable Correlation Analysis:Pearson and Spearman
Correlation analysis is a commonly used method in data analysis and mining. By analyzing the relationships between the feature and target variables, it helps identify influential factors in business operations, enabling the prediction of future business trends.
The relationships between two variables can be described as follows:
(1) Completely positive linear correlation: A value increases as another value increases, and their relationship perfectly falls on a straight line with a slope greater than 0.
(2) Completely negative linear correlation: A value decreases as another value increases, and their relationship perfectly falls on a straight line with a slope less than 0.
(3) Nonlinear correlation: There is no obvious linear relationship between two variables, yet some form of nonlinear relationship exists, such as a curve, S-shape, or Z-shape.
(4) Positive linear correlation: A value increases as another value increases, and their relationship approximately falls on a straight line with a slope greater than 0.
(5) Negative linear correlation: A value decreases as another value increases, and their relationship approximately falls on a straight line with a slope less than 0.
(6) Uncorrelated: There is no correlation between two variables.
In practice, we can use correlation coefficients to analyze the degree of correlation between two variables. The commonly used correlation coefficients are Pearson correlation coefficient and Spearman correlation coefficient.
Pearson correlation, also known as the product-difference correlation (or product-moment correlation), is a method developed by British statistician Karl Pearson in the 20th century. It is used to calculate the linear correlation between two variables.
Calculation formula:
The Pearson correlation coefficient has a value range of [-1, 1]. The closer the absolute value is to 1, the stronger the correlation; the closer the absolute value is to 0, the weaker the correlation. It is an indicator for evaluating linear correlation and is applicable to continuous variables, paired data, and data that is normally distributed overall.
The Spearman correlation coefficient, also known as rank correlation coefficient, is named after Charles Spearman and is a nonparametric association measure based on the order of data values. The calculation formula is:
The Spearman correlation coefficient has a value range of [-1, 1]. The closer the absolute value is to 1, the stronger the correlation; the closer the absolute value is to 0, the weaker the correlation. This coefficient is used to evaluate **monotonic relationship **(whether linear or not). It is applicable to both continuous and categorical variables, and does not require the overall distribution of variables and sample size.
The Pearson correlation coefficient evaluates the linear relationship between two variables, while Spearman correlation coefficient evaluates the monotonic relationship.
The relationship between the two can be explained with the following examples.
Let’s take an example to see how to analyze the relationship between two variables.
We use the house price prediction data from Kaggle to analyze whether there is a correlation between ‘GrLivArea’ (living area) and ‘SalePrice’ (sale price).
Data analysis is usually a process of analyzing while calculating, which can be implemented using SPL that has higher interactivity.
First, import the data into SPL, then generate a scatter plot to observe the relationship between the two variables.
SPL code:
A | |
---|---|
1 | =file("house_prices_train.csv").import@tc() |
2 | =A1.(GrLivArea) |
3 | =A1.(SalePrice) |
4 | =canvas() |
5 | =A4.plot("NumericAxis","name":"x") |
6 | =A4.plot("NumericAxis","name":"y","location":2) |
7 | =A4.plot("Dot","lineWeight":0,"lineColor":-16776961,"markerWeight":1,"axis1":"x","data1":A2,"axis2":"y","data2": A3) |
8 | =A4.draw(800,400) |
A1: Import the data.
A2: Living area variable.
A3: Sale price variable.
A4-A8: Generate a scatter plot where the x-axis represents the living area and the y-axis represents the sale price.
From the scatter plot, a positive linear relationship can be observed between the living area and the sale price. Therefore, the Pearson and Spearman correlation coefficients can be used to calculate the correlation.
Continue writing code:
A | |
---|---|
… | …… |
9 | =pearson(A2,A3) |
10 | =spearman(A2,A3) |
A9: Calculate the Pearson correlation coefficient.
A10: Calculate the Spearman correlation coefficient.
The results indicate a strong linear relationship between the living area and the sale price.
When conducting regression analysis or linear fitting, data is sometimes preprocessed to strengthen the linear relationship. For instance, we can handle the skewness of the data in this example and then evaluate the linear relationship between the variables.
SPL offers some automatic preprocess functions, which are very convenient to use.
Continue writing code:
A | |
---|---|
… | …… |
11 | =A2.skew() |
12 | =A3.skew() |
13 | =A2.corskew() |
14 | =A3.corskew() |
15 | =pearson(A13(1),A14(1)) |
A11-A12: Calculate the skewness of the two variables respectively.
The two variables have a fairly high skewness, indicating they don’t follow a standard normal distribution.
A13-A14: Automatic skewness handling.
A15: Calculate the Pearson correlation coefficient after skewness handling.
It can be seen that after skewness handling, the linear relationship between the variables is strengthened, simplifying subsequent fitting calculations.
SPL Resource: SPL Official Website | SPL Blog | Download esProc SPL | SPL Source Code