Welcome back for the final post in our series introducing basic statistics concepts for people in tech. In earlier posts we covered the basics of descriptive statistics, probability, and statistical inference. In this post I’m going to introduce correlation and regression, along with a brief discussion on inference and interpretation in regression modeling.
A programmer comes up with the theory that his caffeine consumption is directly related to his productivity; the more cups of coffee he drinks in a day, the more code he can churn out. (NB: This is a bad plan. Don’t try this at home.) Over the course of a month, he records his coffee consumption and the number of lines of code he wrote, then calculates a bivariate correlation to examine the relationship between the two continuous variables. As the amount of coffee consumed increases, does the amount of code written go up, down, or show no change? Correlation is most commonly measured via a Pearson’s r statistic and has both magnitude (r = 0.0 means a zero correlation and r = |1.0| is a perfect correlation) and direction (positive or negative).
You can visualize a correlation using a scatterplot, where every point on the graph represents a pair of values from your two variables. The more tightly clustered the points are along a single straight line, the higher the correlation. Scatterplots are useful in making general observations about two variables’ relationship at a glance, such as the strength and direction of the relationship. Scatterplots can also help determine whether a relationship is linear, i.e. if the points fall along a straight line or a curve; a very low correlation, near r = 0.0, might be due to a curvilinear relationship instead of a true zero correlation. Finally, scatterplots are helpful in identifying outliers, or points that do not appear to fit with the rest of the data and might incorrectly affect the estimated correlation.
Regression is directly related to correlation but moves from describing the relationship between two variables to predicting the value of one variable (Y, a.k.a. the response or dependent variable) from another (X, a.k.a. the predictor or independent variable). We use the term simple regression is when there is only a single predictor. A linear regression is represented by an equation in the form of Y = β0 + β1X, where β0 is the intercept (the point where the line crosses the Y axis, e.g. the value of Y when your X variable is 0) and β1 is the slope (the amount of change in Y when X changes by 1). The slope is usually the part of the regression equation that has the information we’re most interested in. It is also the statistic used in hypothesis testing for regression; if X is a significant predictor of Y, then its slope, β1, is significant. The slope is also directly related to the bivariate correlation; if you standardize a simple regression equation, the intercept drops (because β0 = 0) and β1 will have the same value as the correlation between the two variables.
On a scatterplot, the regression line goes through the plotted data points at the spot where there is the overall smallest distance between the points and the regression line – the best-fit line. When there is a perfect correlation, X predicts Y without error, so all of the data points in a scatterplot would fall exactly along the regression line. The set of distances between each actual Y value and the regression equation’s predicted Y value (Y ̂, pronounced ‘Y-hat’) are the model’s residuals, which are valuable in determining how well the model fits the data.
Multiple linear regression uses a set of X variables to predict a single outcome Y; maybe amount of coffee consumed, number of hours of sleep, and proximity to a deadline all combine to predict lines of code written better than any one of those variables by themselves. This is a much more likely scenario than the simple regression case, since an outcome usually has multiple predictors. The multiple linear regression equation extends the simple regression equation by adding a regression coefficient β for each additional variable: Y = β0 + β1X1 + β2X2 + β3X3, etc. Each regression coefficient is interpreted as the amount of change in Y when that particular X goes up by one unit and all other X values are held constant. Similarly, each β value is tested for significance separately via a t test.
There is a multiple correlation coefficient, R, but we’re usually more interested in R2, which represents the total amount of variability in Y that can be predicted by that set of X predictors. The outcome variable, when standardized, always has a variability of 100% (which is the same as a proportion of 1). When R2 = 1, the set of predictors accounts for 100% of the variability in Y and so we can perfectly predict Y just by knowing the value for each X in the model. R2 is an indicator of the model’s overall fit, i.e. how well the set of X values do in predicting the outcome. However, this does not mean you can just add extra variables to the model for the sole purpose of increasing the R2 value; this does not really increase the model’s fit, just the appearance of it. Building a good multiple regression model can be an involved process that compares competing models to maximize the model’s fit while keeping the model as simple as possible.
Interpretation and Inference
You’ve probably heard the phrase ‘correlation does not mean causation.’ Since regression modeling is built on correlation, this holds true for regression as well. A strong regression model – one that has balanced predictive power and parsimony, has met the assumptions of a regression analysis, and is hypothesis driven – can have immense predictive value. However, even the best model should not be used to infer causality if the data did not come from a well-designed experiment. A regression model inherently does not provide a way to account for potential predictors not included in the model. In a randomized experiment, any confounding variables should have equivalent distributions in all groups of interest and so should not be able have a systematic influence the outcome. The only variables left that could be causes are those that have been intentionally manipulated by the experiment. There are non-randomized methods, but all have their own statistical controls and limits to what they can determine in terms of causality.
So How Do I Use This Stuff, Anyway?
– You are a game developer and want to estimate the amount of lag in a control based on a number of different factors.
– You are an IT sales rep who wants to show potential clients the relationship between investing in new custom software and increased revenue for their business.
– You are a DBA and want to improve the performance of a database by predicting how much slower a query runs per row being indexed.