Welcome back to our series on what every developer should understand about statistics. In the last post, I went through important foundational statistics concepts, including measures of central tendency, variability, and distributions. In this part, we’re going to discuss an additional area of statistics that will be important in understanding inferential statistics: probability.
Proportions, Percentages, and Probability (Oh My)
True or false: if you roll a D20 twenty times, that means that each number on the die will come up exactly once. This is false, of course, but why? If any individual roll means that each die value has an equal chance of coming up, then it would make sense for every value to be rolled an equal number of times. However, this is only true when the die is rolled a sufficiently large number of times. The expectation that every number on a D20 will be rolled 1 time in every 20 rolls is its long-run probability; each number will come up an average of 1 time per 20 rolls, not exactly 1 in 20.
You (hopefully) remember from a long-ago math class that a proportion is the number of things in a subgroup relative to the total number of things in the whole group, and that a percentage is when you multiply a proportion by 100. Suppose there are 12 bags of coffee beans in our office break room. Three of the bags are a weird taco-flavored coffee that someone bought on a dare and which have been sitting in the back of the cupboard ever since. The proportion of bags of coffee that no one will touch is 3/12, giving a proportion of 0.25, or 25%. NB: Most of the time we do calculations with proportions, but speaking about them in terms of percentages flows better in English.
If you understand proportions and percentages, then you understand probability too. Probability is when proportions are applied to inference and randomness. Early one Monday morning, you stumble in to the office and blindly reach in to the cabinet to get some coffee so you can finish waking up. The probability that you will grab a bag of taco coffee is 0.25; if you reach in to the cabinet an infinite number of times*, you’ll grab a bag of flavored beans 25% of the time. All of the proportions in a set will add up to 1 (or 100%, if using percentages). If the only other coffee bags in the cabinet aren’t flavored, we could count them (9 out of 12) to determine that 75% of the beans are normal. Alternatively, we could just subtract the percentage of flavored bags of beans (25%) from 100% to see that 75% of the coffee is actually drinkable.
*Assuming the bags are randomly distributed in the cabinet, all equally likely to be pulled, and that the bags are replaced after every pull, along with other requirements that are much simpler in silly examples than in the real world.
Odds and Odds Ratios
‘Odds’ is commonly misused term when talking about probability. When someone asks ‘What are the odds of that?’, they usually mean ‘What is the probability of that happening?’ Odds are directly related to a probability, but reframe probability in terms of how likely it is for something to happen relative to it not happening. Mathematically, this is done by dividing the probability of an occurrence (e.g. probability of rolling 20) by its converse (probability of not rolling a 20). The probability of getting non-taco coffee is 0.75, so the odds of this happening are 0.75/0.25, or 3 – you are 3 times more likely to pull out normal coffee than not.
A common application of odds are in calculating an odds ratio. As the name suggests, an odds ratio is the ratio of two odds and is a way to easily compare two related probabilities. For example, I got out a second D20, but I think it might be coming up 20 more often than my first D20. I roll each die and record the number of times each comes up 20 versus not-20 (i.e. any other number). The odds of rolling 20 for each die is the proportion of times 20 is rolled divided by the proportion of times any number other than 20 is rolled. The odds ratio, then, is the odds for one die divided by the odds for the other. If the dice have equal frequencies for rolling a 20, the odds ratio will be close to 1; the further the odds ratio moves from 1, the more different each die’s odds (and their underlying probabilities) are from each other.
The last post in this series ended with a discussion of different statistical distributions that have expected properties, such as the proportion of values that will fall within a given number of standard deviations from the mean. Now combine that with the idea that a proportion is directly related to a probability, and you can see how distributions such as the standard normal distribution help show the probability of a particular data point relative to its distribution.
Picture a histogram. Each section of the histogram represents the number of times (y axis) that a given value (x axis) occurred in the distribution. If you add the frequency (e.g. height of the bar) of each of these data points, you get the total number of data points in your sample. Just like with the coffee bag example, the probability of getting any given value in the distribution is the frequency of that value divided by the total number of data points. Every value in the histogram has its own probability, and those probabilities will always add up to 1.
If you smooth out the curve of your histogram to generate a probability density function, you still can use that distribution to figure out the probability of getting a particular value. The difference is that a distribution like the standard normal is modeling an infinite number of possible x values instead of a discrete number of values shown in a histogram, so we usually now talk about the probability of getting at least or no more than a given value. The probabilities still add up to 1, though, giving the standard normal curve (and all probability density functions) an area of 1 under the curve. This is why probability matters so much for statistics. Understanding how distributions and probability work are fundamental in hypothesis testing and inferential statistics, which are coming up in the next post.
So How Do I Use This Stuff, Anyway?
– You are a project manager who wants to see what proportion of your time each week is spent answering client emails.
– You are a mobile developer and are concerned that your app is more likely to crash when loaded on a data connection than when on wi-fi.
– You are a hiring manager who wants to visualize where a given candidate’s GPA falls relative to a distribution of all undergraduate GPAs to see what proportion of GPAs are below your candidate’s.
Coming up next in the series: hypothesis testing and inference!