In this problem, we're going to conduct a hypothesis test, and before we start that we should probably organize our data. So in the problem, we're talking about birth months of American born Major league baseball players. And in January there were 387 baseball players that have their birthdays in January. There's 329 that have their birthdays in February, 366 with birthdays in March. 344 with birthdays in April. 336 with birthdays in May, 313 in June and in July, 503 in August. 421 in September, 434 in October. In November, 398 and December was 371 and the's are all observed values. And if we were to total all of the's up, we were dealing with 4000 515 American born Major league baseball players and their birth months. So we want to conduct a hypothesis test, and it's based on the argument made by an author of the book Outliers. And this author believes that their arm or baseball players that have birth dates in the months immediately following July 31st, meaning there's more in August, September and in October than any other month. And he feels it's due to the age cut off dates for the non school baseball leagues. So what we're going to do is we're going to run the hypothesis test to see if the distribution of birth months is the same, an equal for month to month to month, or if there is an actual distance difference based on this data. So the first thing we're going to do is we're going to talk about what would be expected. And if there were 4515 players and there are 12 months, if we divided 12 into that, we would get 376.25 as what would be expected if they were all going to be equally distributed. So for the expected, we would expect 3 76.25 for each of the months, all the way down the chart, so we'll come back to that shortly. So we want to run a hypothesis test and in order to run the hypothesis test, the first thing we are going to need to dio is to write, are null hypothesis and are no hypothesis is going to be that major league baseball players mhm are born in different months with the same frequency. And when we run this test, if we would reject this hypothesis, then there would be some sort of evidence saying that it is not equally distributed. And it might give Malcolm Gladwell some clout to his statement. So the alternative, then would be that the birthdates or birthdays of major league baseball players are not evenly distributed throughout the year. So basically, what we're going to want to dio is to run a goodness of fit test. We want to see how well the data that we've collected on the American born players fits to the expected data. And in order to run a goodness of fit test, we're going to need to calculate a Chi Square test statistic. And to calculate that Chi Square test statistic, we have to sum up observed, minus expected squared, divided by expected. So we're going to come back to our chart and we're going to add on to it, and this next column is going to be what we get when we take each observed value minus its corresponding expected value, we're going to square that difference and divided by the corresponding expected value. Now it's easier to put all this data into the calculator to generate these values rather than doing each one individually. So I'm going to bring in my graphing calculator and I'm going to go to stat and edit. And as you can see, I've already pre loaded the pieces of data Enlist one. I've placed all the observed values and enlist to I've placed all the expected values and again, where, if we would like to expect that each month is equally likely. So we're going to go to the top of list three, and we're going to tell it to take each observed value from List one and subtract each expected value from list, too. We're going to square that value before we divide by each expected value enlist to, and we're going to get these values and I'm going to just round them to three decimal places just for the sake of recording them on your paper. So we would have 0.307 5.934 0.279 2.764 4.306 10.633 and 10.633 again 42.699 5.3 to 2, 8.864 1.257 and 0.73 Now, in order to find that test statistic, we need to add up all of the's values and the fastest way to add up the values would be to sum up column three. So we're going to quit and then we're going to hit second stat. Scoot over to the math option. We're going to sum up everything. Enlist three. And when we do that, we're going to get a test statistic as a decimal of 93 0 718 So now we want to find a P value and to find a P value. What you're expected to find is the probability that Chi squared would be greater than that test statistic. And in this case again, it was 93.718 ish, and I like to always draw a picture to kind of summarize what's going on there. So we are running a chi square goodness of fit test. So we're going to want to look at the Chi Square distribution and the chi square. Distribution is usually skewed to the right, and I say usually because there's only one instance that it's not, um it's when the degrees of freedom art is to. So a Chi square graph is dependent on its degrees of freedom, and the degrees of freedom is found by taking K minus one and K represents how many different categories did you divide your data into? So if we go back to our chart, we have divided our data into 12 different months. So therefore we have 12 different categories, so our K would be 12, and our degrees of freedom then would be 11. And that degrees of freedom is also indicative of what the mean of the chi square distribution is, and you will always find the mean slightly to the right of the peak of the bell so we could put 11 on the chi square access. So when we're trying to find RPI value, we are trying to find what's the likelihood or the probability that the chi square value is greater than 93.7 So if I were to extend this all the way out as Faras, my board will let me go. 93 is going to be somewhere out here and I'm trying to find out what's the probability or the area between the green curve and the pink horizontal axis. And I conduce that by utilizing the cumulative density function for chi squared. And when I use that function, I've got to talk about the lower boundary of the shaded area, the upper boundary of the shaded area and the degrees of freedom of the chi square curve that I'm working with. So for this particular problem are lower boundary is that test statistic 93.718 are upper boundary. Keep in mind that that pink horizontal axis continues infinitely to the right. So we're gonna pick a very large number 10 to the 99th Power and are degrees of freedom was 11. So let me show you where you can access that on your calculator. So I'm gonna bring the calculator in again, and we're going to hit the second button and the bears button, and it's going to be number eight in the menu that you see right now. So I'm going to put my low boundary 93.718 my upper boundary, 10 to the 99th power and my degrees, the freedom of 11. And I am going to get a P value off 4.663 times 10 to the negative 15th power, which is a super super small number. So I, Maya's well say it was very, very close 20 or very, very highly unlikely. So now that I have a test statistic and I have a P value, I can actually run my test one of two ways. The one way I can run my test is to find a critical chi square value, and the critical chi square value is going to be the chi square value that separates the curve into what we would refer to as thief failed to reject H O region and the reject H o region, and that critical value is found by using the chart in the back of your book. And when we're running this hypothesis test, we were told in the problem to run the test at a level of significance of 0.5 So for this particular problem, we're talking about 0.5 being in the right tail of the Chi Square distribution. So here's our Chi Square distribution. We're putting 0.5 in the right tail, which would mean 0.95 is over here in the bulk of the bell and this value where the boundary line is going to be the critical value that separates the two regions. So if we were to look in the chart in the back of the book, you're going to see degrees of Freedom column and then across the top you're going to see your level of significance. So you're going to go down to your degrees of freedom and you're going to go across to your level of significance. And when you do that, you're going to find the critical value of 19.6 seven five. So now we can either use the P value to decide if we're going to reject the no hypothesis, or we can use the comparison between our test statistic and our critical value to determine whether we're going thio. So I'm going to run the um Test two different ways. So the first way I'm going to run it is I'm going to utilize that P value and the P value we found out to be was zero. So what we were saying is there is really no area between the curve and the horizontal axis that was greater than that test statistic. And at Alfa being 0.5 the decision would be to reject the null hypothesis if Alfa was greater than the P value. And in this case, Alfa was 0.5 and our P value was really, really, really close to zero. So therefore, our decision is to reject the null hypothesis. Now we could have run it the other way as well. We could have taken that critical value and we said if we put a 0.5 in that right tail, we would get a critical value of 9.675 And when we found our test statistic, our test statistic waas 93.718 and 93 would be up here and this would be the reject h o region. So because our test statistic landed in the reject H A region, our decision is still the same to reject the null hypothesis. Let's go back to that hypothesis. So our hypothesis is all the way back here, and we have just decided, either using the P value in comparison to the level of significance or using the critical value in comparison to the test statistic. Either way, we said we were rejecting this null hypothesis. So if we reject the null hypothesis, then we're supporting the fact that birthdays of major league baseball players are not evenly distributed throughout the year. So then the final part of this question was saying, Do the sample values appear to support Gladwell's claim? Now keep in mind that our sample values came from American born Blake baseball players. So, yes, we could say that the sample values do appear to support Gladwell's claim that more Major League baseball players birthdays in months occur in months immediately following July 31st. And we saw a heavier concentration in August, September and October