Hello, and welcome to all! I've been promising this for quite a while (like, as soon as the new study design came out "a while"), and so here it is.
Throughout the rest of this post, I'm going to be exploring the part of probability and statistics that's slated to come into the new study design - this journey is ALSO going to explore a couple of things that /aren't/ in the new study design, but are a little necessary to understand what's going on. Because of this, the entire thing will be colour coded like so:
Black text means what I'm writing is relevant, and something you will be studying next year.Blue text means it's a little less important - it's not assessable, but for a good and proper contextual understanding, you should know about this.Red text means what I'm writing about is not necessary at all - I'm either teaching you about some cool maths, extending on the content a little bit, or providing anecdotes.Finally, green text will be use to discuss examples. If it's in green text, there is NO new content, it's just showing an example of when this might be useful.
You'll notice that this colour code does not account for methods/specialist separation - this is because to do specialist statistics, you need to know about methods statistics. So, from this point on, everything I talk about will be only concerning methods statistics, and I'll let you know when we reach the end of that. From that point on, it's all specialist stats. If you did methods in 2015, and are doing specialist in 2016, I highly recommend reading the entire post, still.
Importantly - this is NOT designed to teach you how to do statistics for VCE. Yes, I go over the concepts, and you might actually be able to learn it all from this one guide. However, the MAIN purpose of this post is to teach you about what you're going to be getting yourself into, because this is an area of mathematics you've probably not actually seen before.NOTE: I mentioned red text will include anecdotes. I'm not only just a stats major, I'm also a chemistry major, and so statistical analysis is something I do a lot. Believe it or not, a lot of the things being introduced into VCE are actually highly useful! (erm, sorta... We'll get back to this) In fact, I've used these exact techniques in my own research which I was meant to present at a scientific conference. So, get excited! Finally, something you can USE!!Inferential vs. Descriptive StatisticsBefore now, you're probably familiar with the idea of "statistics is about drawing pie-charts and finding the mean, median, mode and range of things!!". If you did further, you might have explored on this - not only talking about means and charts, but also standard deviations. You might have heard these words again in probability, at which point you thought, "so probability is just like statistics?"The truth is, they are related - but not as you know it. So far, all you've dealt with is what is known as descriptive statistics. Descriptive statistics is all about trying to describe a set of data, and this type of statistics is very useful for getting information out to the public and analysing data. This includes things such as sports stats (for example, what's the score on the latest AFL match?), discovering where to allocate resources (for example, most teenagers spend 60% of their time on the internet. If you want to reach out to them, the internet is the best place to find them) and even analysing scientific data (for example, chemical A has an absorbance spectra that looks like this. If we put chemical B's absorbance spectra over this known spectra for A, and they match, we know that chemical A and B are similar)The type of statistics that has just been introduced into the study design is known as inferential statistics. This type of statistics is a bit more useful to society, because it is designed to give us a quantification of our task and if we should proceed or not. Descriptive statistics is very limited, in this sense - it can give you a very good idea of what you're doing, but if you have to make a call, it's never black or white. With inferential statistics, it largely is - this is something we'll explore very soon, when I teach you about confidence intervals.Inferential statistics is a very diverse field, unfortunately, and you won't see all of it at once. In methods and specialist, you explore the use of confidence intervals of means and proportions, and hypothesis testing of means. You also covered a little bit of this in further (if you took it), which is one of the more useful techniques known as regression analysis. You didn't do it in the truest sense, unfortunately, but you did cover it when you went over lines of best fit (also known as least squares data fitting)Probability vs. Statistics
We've already discussed the difference between probability and statistics a little bit, so let's go a little bit deeper - what is the difference between probability and statistics?
In reality, if you were to take a coin, and that coin was called "data", then one side of the coin would be probability, and other side would be statistics. A more telling example, however, is to think of that coin in a different way - let's call "data" what we write down about the toss of a coin. In this scenario, probability is the hand flipping the coin, and statistics is what the coin lands on. That is to say, probability is what generates the numbers, and statistics is what we can see at the end.
A handy way of thinking about that is by this little chart:
|Generates data||Figures out the generator|
|Describes the distribution ||Guesses the distribution|
|Concerns a population||Concerns a sample of a population|
The first two points mirror the idea that probability is a generator that makes data, and statistics is data that tries to guess at the generator, but the last point is one you might not have heard of before. This follows an idea that all events can be described by a probability distribution of some sort (philosophicals of "can something be random" aside), and so for each population
of things, they will follow this probability distribution. Populations are just as they sound - they concern everything
. Populations are a little hard to describe, so seeing examples will help you get the idea of what is what. Samples are easy, though - they're some portion of a population. In fact, pretty much everything we deal with are samples, because it's not exactly possible to GET a whole population to give you data.An example of a population could include the human race, the colour of socks made by a particular supplier or the proportion of rock songs released on the radio in the 1990s. These are all populations because all data that is available, or ever will be available, is present.
An example of a sample, however, could be how many schools teach only to humans, the colour of socks made by all sock makers or the proportion of rock songs ever released. In these cases, not everything is known - in the first case, we can only talk about human schools, in the second case, we don't know all the sock makers in the world (only the commercial ones), and in the third case, we haven't reached the end of time. Not all data is yet available, and more testing can be done.
The idea of statistics is a simple one - given a sample, let's try and figure out what traits we can discern, and from that, figure out how closely they describe the population. However, how can we do that? The general gist of this is to define a population parameter
, and then use sample statistics
. For example, consider the binomial random variable.
If you haven't done 3/4 probability yet, you won't know about the binomial random variable, but essentially it is a counting random variable. A binomial random variable defines one event as a "success" and another as a "failure", with some probability p
of success. This event is then repeated some number of times, n
, and the number of successes is the number that the random variable takes on.
In this situation, the population parameters
are n and p - however, the sample statistics
are the numbers that this random variable takes on at each instance. Let's say that we toss a fair coin 6 times, and we want to get "heads". In this case, the population parameter is n=6 and p=0.5. However, if this coin is not fair, then we wouldn't know what the probability is, but we do know what n is (since it's only how many times we toss the coin). From all of these flips, let's say we get 2 heads out of our 6 tosses, we might guess that the population parameter is p=2/6=1/3. This is known as estimation.
Estimation is not covered in detail in the methods course, so instead we'll just consider the three of the more common estimators, shown below:
The estimator is the sample, or arithmetic, mean, and it's what you're used to evaluating since year 7. Add all the numbers up, divide by the sample size, and you're done.
The next is the sample variance, which is just the squared difference between what you've seen from the sample mean. You will notice that in this case, we divide by n-1 instead of n. This is known as a correction factor, and we use it so that S^2 is unbiased. A particular estimator, V, which estimates some parameter, U, is said to be unbiased if - that is, in the long term, we expect V and U to be the same quantity.
Finally, we have the proportion statistic. We'll come to this one a bit later, but it essentially describes the proportion of "desireable outcomes". Think of it this way - if X is a binomial random variable, with n
trials, then dividing by n
will give you an estimate of the probability of success, like we did just before.If you're particularly switched on, you may have noticed the use of capital letters for these estimators. That's because the value they take on is, inherently, random as well, and so you can actually think of them as random variables! This might be confusing at first, but you can test it out just by flipping a coin. Take one coin, flip it 10 times, and record the "mean number of heads". Do the experiment again with the exact same coin. You probably got a different mean each time, hence the sample mean is "random" in its own way.
The law of large numbers says that as the number of flips approaches infinity, the sample mean will actually stop being random. This is where statistics and probability starts to overlap - if we can stop the sample mean from being random, our sample will actually very closely describe the population parameters!The Population Proportion
Now, this is something we just went over, in the proportion statistic. We used it before as if discussing a coin, but this can be generalised a bit. For example, how many people vote liberal? How many women have used a male toilet?
The easiest way of trying to figure this out is to define a "success" as the information you care about, and then use the sample proportion to figure out the probability of people that "fit the bill", so to speak. This seems almost trivial at first, however - this only
teaches us about the sample, not the population. So, how do we find out the sample?
Here's the good part - the proportion statistic is based off of a binomial random variable, and as the trials of the binomial random variable approaches infinity, it becomes a NORMAL random variable. That might be a bit weird to hear, so don't worry if it doesn't make sense - just know that for large n
, this simple means that the binomial random variable is approximately normal. As a consequence, this means that the proportion statistic is ALSO approximately normal for large n
. Also note that the variance and mean don't change - so, let's figure out what those are, assuming that
So, for large n
Now, this is where z-scores come to the rescue - now that we know what distribution the sample proportion has, we can use this to our advantage. This brings us into the realm of confidence intervals
. These are a bit tricky to think about, so we start from the bare bones:
A confidence interval
is an interval that the population parameter belongs to with some probability. So, let's say that the proportion of women who have used a male bathroom has a 95% confidence interval of [0.2,0.3]. This means that we can say with 95% certainty, that the population proportion of women who have used a male bathroom is somewhere between 0.2 and 0.3.
So, how does this z-score help us? Well, we know that to find some confidence interval, we can use:
Where this confidence interval is a q*100% confidence interval. Picking a and b might seem difficult at first, however - we know that this particular statistic is normal. So, if we make it standard normal, we can use the symmetry of the normal distribution and the z-score to make picking a and b super easy - because a just becomes -z, and b becomes +z! This gives us:
Now, here's where things get tricky. We want to use this to find a confidence interval for p, the population proportion (which is also the mean of the statistic). The problem is that this is present in the numerator AND denominator. However, the standard deviation is usually very small - so small, that we can replace the population proportion in it with the sample proportion. This does turn the confidence interval into an approximation, but this is a beauty of real life mathematics - we usually don't care for more than 5 significant figures (or even for more than 2 decimal places), and so if this approximation isn't very big, it won't affect the end result.
In particular, this approximation works very well for large n
. A lot of statisticians like to use the npq rule - that is, if npq=np(1-p) is less than or equal to 5, then the approximation is too imprecise. This rule comes from the approximation of binomial distributions as normal distributions, so don't worry too much about the proof of it. I know some good books if you are interested, though.
So, after all that, we have decided that our equation has now turned into this:
Rearranging for the one p
that is left, we get:
From this, we can say that a confidence interval for the population proportion is given by:
is chosen to suit whatever "confidence level" you want. Usually, we want a 95% confidence interval - for this, we pick z=1.96. You might remember the "68-95-99.7% rule", which is where this number can be related to. This rule told us that for z=2, you'll obtain 95%. However, this rule is rounded up to 2 so that it will work for ALL normal distributions. For the standard normal distribution, we can use some more decimal places and get a little bit more accurate, obtaining the value of z=1.96So, let's look at the last federal election. All statistics from here on out come from this website. From this, we can see that 33% of all people voted for Labor - so, how did liberal win with labor on top at a third of the votes? Because of second preferences and such. This system means we can get a better idea of what Australians want. However, what if labor should have scored higher? Let's put this to test, and figure out how many electorates are truly "Labor" electorates.
So, we define our "success" as voting for Labor, and a "failure" as voting for any other party. This does not mean you should vote Labor - this is just so you can understand how this relates to Binomial random variables. In this case, the sample statistic is 0.33, and with 150 electorates, that's a sample size of 150. Quick check - 150*0.33*0.67=33, and so all approximations are okay. This means a 95% confidence interval is given by:
So, the proportion of "Labor" electorates is approximately equal to what was observed. Given how close it is, and knowing that Liberal (the next highest) gave a very similar response, it really does make sense that we have the current system in place!
And with that, we reach the end of the methods curriculum. There's not really much in there, now is there? So, let's move onto specialist! But first, an aside...Combinations of Random Variables and the Central Limit TheoremNote: the majority of this section is blue because most of it is not actually relevant. However, some of the equations you must know! So, look out for black text - I'll make sure that all black text in this section has their own paragraph.
With that out of the way, let's get started. Proportions are easy because they only concern one random variable, and that random variable is already approximately normal. Here's the problem, though - there are only a handful of distributions we can actually estimate, so if we want to make more confidence intervals and move into hypothesis testing, we need a way of relating the sample we have to one of these distributions. In particular, for VCE, you only ever cover one of these distributions - and that's the normal distribution. There are a couple of ways of doing this that we'll explore - but first, we need to know how to find the mean and variance of a combination of random variables.
Given two random variables X and Y, we can relate the mean of sums of these variables like so:
We can also do this with their variance - however, for the next formula to work, we require that X and Y are independent.
Note that if X and Y are NOT independent, then they have some amount of correlation which affects the variance of their sum.
Defining the distribution of these two random variables is a bit more difficult, however if they're both normal, this is easier. If X and Y are normal and independent, then aX + bY is also normal with the mean and variance as defined from above.However, in most cases, X and Y aren't normal. More importantly, X and Y are usually a specific sampling. Because each sample - or incident, experiment, however you like to think of it - will happen some amount of times, we use the notation for the ith experiment. From here, we can define the central limit theorem - remembering the sample mean. The central limit theorem states that for a particular sequence of iid random variables with mean and variance , then:
Once again, we can make an approximation for this. If n is particularly large (not necessarily infinity), then we still apply the central limit theorem. This number could be anywhere from 5 to 5,000, and really depends on the data in question. For the purposes of VCE, 50 is usually a good number.Confidence Intervals for Means
So, we know what a confidence interval is, so this won't take too long. Now, for this case, we'll need to assume that the variance of what we're measuring is known - once again, this usually isn't the case, but bear with me for a bit. Using the central limit theorem, we can define a confidence interval just as we did for proportions, so I'll skip through the process a bit. First, we find the mean and variance:This actually gives some insight as to why the sample mean stops being random. Think about it - the variance measures the spread of the data, and as n goes to infinity, the variance of the sample mean goes to 0. This means that whatever number it pulls out won't be spread out at all - it'll only sit in one place, and that one place is the actual mean of the data.
Now, the confidence interval part:
Once again, this gives us an interval:
and the choice of z once again depends on how much confidence we want. Usually, we use z=1.96 for the 95% confidence interval.
However, as I previously said, we usually DON'T know the variance of each particular random variable. In fact, if you ever do an experiment, you usually know absolutely nothing! So, once again, we make an approximation. Now, look at the variance for the sample mean - for very large n, this variance approaches 0. Because of this, the variance will have little affect on the interval, and so we can actually use the sample variance (which we defined up above as an example of an estimator), giving us the following formula:So, let's say we're concerned with the amount of times a particular person goes to the toilet each day. A survey of 50 random people told us that on average, each person went to the toilet 2.3 times a day, with a standard deviation of 1.2. From this, we get the following 95% confidence interval:
So, on average, each human uses the toilet somewhere between 2 and 2.6 times a day.I'm not going to lie, this approximation is a bit of a weird one. However, it's useful for teaching students about how inferences can be made - in particular, this interval is very similar to one I commonly use myself. If the standard deviation is unknown, we may use a distribution that's not the normal distribution - this distribution is the student's t-distribution, however the selection of the z value in this case isn't very easy. Feel free to read up more about it if interested.
This test is miles more useful than the one based on the normal distribution - in fact, for the conference I mentioned, I was looking at the use of different materials (known as MOFs) and their ability to adsorb CO2. In particular, I examined the use of carboxyl based MOFs and amine based MOFs, and I found that with 90% confidence amine based MOFs performed better than carboxyl based MOFs. I don't remember the numbers exactly, but it was something along the lines of caryboxyl based MOFs had a mean adsorption factor in the interval [1,2] and amine based MOFs had a mean adsorption factor in the interval [2.5, 3]. Note, if I tried to raise this interval to 95%, then the intervals I got were [0.8,2.3] and [2.2,3.4]. You might be tempted to think that the amine based MOFs still perform higher, but we don't know WHERE in those intervals the true mean actually lies - so, to say that one was higher than the other, I required that NO overlap of the intervals was present.Hypothesis Testing of Means
We're actually nearing the end, this is the last topic left!! How sad.
Hypothesis testing makes one assumption - if you have some parameter, say a, and you want to test what value that parameter is, you simply assume that a takes on that value. That makes sense, right?
Probably not, so let's go through the process a bit more. We know that
has a standard normal distribution. So, if we were to get a number from that fraction that doesn't belong to a standard normal distribution, then maybe some part of the fraction is wrong.
From this, we need to develop a way of guessing outcomes - that is, making a "hypothesis". From this, we consider two different hypotheses - the first, known as
, is the "null hypothesis" or "no-change hypothesis". The second is known
, and it's the "alternate hypothesis". You could also have
etc., and each is simply the next "alternate hypothesis". Usually for education purposes, we don't really go further than
This is where things get tricky - hypothesis testing cannot prove something is true
. It can only prove if something is false.
This is just by the nature of statistics - it's very difficult to say if something random is in a distribution. However, it's very easy to say if something random is NOT in a distribution.
Because of this, we define the test statistic as
. This test statistic has the following possible hypotheses:
There are, of course, more that you could cover - a google search can give you a whole list of different hypotheses! You may have noticed the use of the sample standard deviation (again) instead of the population standard deviation. This is once again a bit of a weirdy, but we can use the same correction with the student's t-distribution if we wish, using the appropriate t-value instead of the z-value we wish to use.
Now, let's say we want to say with 95% confidence that the mean is equal to
- this means that z should be less than 1.96. So, if you compute that test statistic and get a value over 1.96, then we can say with 95% confidence that the test-statistic probably doesn't belong to a normal distribution. However, central limit theorem says that it MUST be normal - this means that either the variance or the mean isn't what we thought it was! Since only the mean wasn't measured (remember, we "guessed" that
was the correct mean), it stands to reason that that isn't the mean. So, from that, we accept the alternate hypothesis - that the mean is something else.These tests, of course, usually follow from changing something, and wanting to know if that change was a good one. So, let's look at the study designs! Over the past 5 years (not including 2015), the amount of people enrolled in specialist maths has been 4489, 4224, 3877, 4056 and 4489. This gives us a sample mean of 4227 and standard deviation of 269.
Let's also say, that this study design has a mean of number of people being 5000 people. This gives us the test statistic being:
. This is greater than 1.96, so we can say with 95% confidence that the study design HAS improved! In fact, all we can really say is that the number of people taking specialist maths has changed. We can't even say it's gotten higher - although, that's a fair assumption given that the new sample mean is higher! Usually, we would guess that it's due to the new study design, because that's the most substantial change that we've witness, but that might not be true.
Finally, we might like to define what is known as a "p-value". The p-value is super easy - it's a measure of how significant the result we got is! So before, we compared the test-statistic to a z-score to find out what the significance is. However, instead of doing that, let's just see at what probability the test-statistic comes out at. To test this, we define the p-value as shown below:
That is, we define p as the probability that Z (a normal standard distribution) is more extreme than the value we found. If it's very likely that Z is more extreme than the test-statistic, then there's no reason to reject the null hypothesis. If it's very unlikely that Z is more extreme than the test statistic, then there's good reason to reject the null hypothesis.Using our situation from before, we use our CAS (or in my case, Microsoft Excel) to calculate the p-value as . Since this value is SUPER tiny, it's pretty likely that the number of students participating in specialist has changed, and so we would reject the null hypothesis. Great work, team!This might seem like a weird situation for you, but it's not exactly uncommon in forensics. For a chem unit I took this semester, we were examining the iron content of a sample of concrete to find out who made the concrete. We had reason to believe that this concrete only belonged to one of two different manufacturers, and so we analysed the iron content using AAS (a chemistry machine). From there, we used an extension of this hypothesis testing technique, and found that the iron content of the sample was different to one of the manufacturers with a p-value of 0.002 - that is, it's 0.2% likely that the samples were the same. This meant that the concrete must have come from the other manufacturer.
Which brings us to the end of this little journey. I hope you enjoyed it, and even more, I hope that this rummaging around with statistics has gotten you excited - even if the stats seemed boring, hopefully my examples were funny/interesting enough to get you excited and realise that what you're about to learn actually has some real life applications that you can readily use!