Computer exercise Number eight is off somewhat different nature than what we've been solving so far. Namely, because we're not gonna need any data set. But instead, we're going to generate our own variables. So, uh, I will be using state. As always, it's very easy to, uh, generate specific kinds of random variables. But of course, you feel free to use our python or mass level or anything else that you want. So in part one, we need to start by generating 500 observations of explanatory variable X from the inform distribution with support 0 to 10. Now, as the author mentions, most of disco packages have a command for the unit from 01 standard uniform, and that wench is multiply the observation by 10. So even though I could do it and stay that way, I will be doing that way because it's very probable that you have to do that way too. Right. So, first of all, we're gonna set the number of observation people 500. Okay. And now we're gonna generate random variable called X one equals uniform. This is the command for the infirm 01 And when the multiplied by all right. You go day a browser. You can see that Indeed we have. You know, five hundreds of patients of of this random variable. And now we're being asked to calculate what is the sample mean in the sample standard deviation. But before we do it, let's see what we expected to be. In other words, one of the theoretical mean and side invasion of the uniforms. Every 10 distribution Do you remember you are in the ground stats. The expected value of a uniform variable on Ah, um defined illness on ah, interval from A to B A D notice lower range of the support be the lives range the lower part of the support of the highest, um bound during. So the expected value would be a plus b divided by to hear a ghost. Zero b was 10. So expected values five. The variants, on the other hand, is defined as B minus a squared, divided by 12. Here it would plug in our numbers. We have 8.333 Understand? Division, of course, is the square root of that. So it's 2.87 around 2.89 So we expect to find them sample mean close to five, and the sample standard deviations close to 8.29. Let's see what happened to we summarize you are. Thanks. One run available. Yeah, well, you see that the Semple mean is not too far away from five. Of course, it's not exactly equal to five. I'm going to discuss why slightly less assemble congregation was Also it's like lesson 2.89. Uh, here, that's what's very interesting. Is that that meaning the max is not zero in 10. Who's of course, each number Israel number has a probability zero occurring. And the whole idea is that you might be tempted to think. Okay, we generated this random variable to the population. Is that the one we have here? But no, the population is every uniforms, everything random variable ever generated. It's, ah, population infinite size. Right, So this is just sub sample off 500 observations. It's a random sample. Yeah, admittedly, but as it always happen with random samples, we need a very large size to be able to say that you know are mean and salivation are very, very close to the deadline population. Indeed, we need to invoke week love large numbers. But here the random sample is not that large. So that's why we have this deviations around the, uh the theoretical bodies. All right. In part two, we degenerate now 500 errors. You y from the normal 0 36 as we did before. We're gonna generate Ah, it's called you. One will generate a normal I stand a normal variable. Sorry. I told you no are normal. There's a standard Normal. Very well. And we're gonna multiply it by six to get ah, in normal 0 36 time variable, right? 36 is a very in sexist, A deviation, no off course. We expect to see the available to have a mean zero and very insourced innovation of six. Let's see what happens. It was somewhere I said Yeah. Again. Mmm. Well, the meaning is no favorite. Close to zero this time. Deviation. On the other hand, it's quite a you know, it's above six 6.54. Again. This is because of a random of a small sample problem. If this, uh, if the number of observation where 10,000, 100,000 or 1,000,000,000 that would be I would converge and probability to the theoretical values. And indeed the sample mean would observe would be even closer to zero extremely zero incentive Asian with extreme 26. And I promise that by the end of the video, I will do this whole thing with with a huge number of observation to see what happens. But just out of curiosity, let's produce the hist a gram of this, uh, you variable and we'll superimpose a normal density. You see how close it is? Well, you see, not too far, but again, we have gaps here. There's, um, frequents here. The higher than they should be. You know, it just it's just approximately normal, but we'll see later around. But if I generate enough friend of variables, it will be almost identical with critical density right now in part. See, we need to generate the why variable as foes. Why one equals one plus two time 61 one. What do you want, right? Yeah. Yes. Here it is. 500 values. Why one r generated. Now we're gonna run a very simple regression to see what's gonna happen. The grace Why one on X one. Remember? Now I'm gonna do it. I'm gonna make a mistake deliberately. If I also grows the era term. Of course, we're gonna get this, you know, R squared off one perfect coefficient knows there because this is the actual data generating process. That's what I want to do. We wanna, um, regards justice deterministic part. The X one explains every variable in the U one will be our era terms. So what we expect to get, it's ah is a constant off one about an estimate for the constant equal. The one estimate of the slope coefficient equals two. And in our square, there is like we're on 50% 0.5 because with deterministic parties have the very was included. Let's see. Yeah, well, not too bad. But all right, Percival, right. Number of observations. The joint F test is extremely statistically significant. They are square. It is. It's your 0.46 three again, not exactly 50% as we were. Theoretically, I expect to see now our estimate for the constant is no. One. By any chance, it's 0.321 and it's not statistically significant, meaning the underlying T test says it is not sustainable. statistically significantly, it's basically significant different from zero. So it's like estimating a zero here, in any case, even anywhere so disclosing if everyone's disclose significant, we're talking about an estimate that is far from one and our slow coefficient. It is very statistically significant. Look at the high T stat zero p value, and it's quite close to, but it's 2.13. It's not exactly two, so the value of cities within its not even within extend their air. All right, so from this analysis wouldn't conclude that, uh, the 100 line populations to enter would definitely not conclude that the other lion population parameter is one for the constant. Now, why is this happening? The first thing you need to remember is that we know what we're not dealing with the underlying population within with sub sample. And the sub sample here is not zero mean six Tyra Deviation. And the other thing is not a five mean 2.9 division. So these discrepancies of the sample from the underlying population for the variables for the X one you want can definitely account for these two discrepancies here, right? Because the the variables, the random variables that we have, as we saw the means inside division of less, at least in this case or less. And here the means less with integration is more. It's more dispersed. So this differences will be incorporated into our estimates for the constant. And this is why it will not be precise. It will be actually biased Dan Awards in this case, not biased in tradition. Nonsense. But from a computational for interview. This way we get this, uh, flood estimates And this, you know, not exactly flawed here, but more inaccurate vestments again, if we do this whole analysis with exactly the same thing. But with a sample size equals two people thio 2,000,000,000 we would get the extreme. It would get an estimate of 211 almost guarantee you okay now in part for we need to obtain the or less residuals you had and verify the equation to 60 Holds subject of rounding error, uh, remind. Let's remember that the equation to six is this one I copied from the book. It says the dishonorable OLS residuals. Remember, we're not talking about the the equation errors or disturbance. We're talking about the estimated or less residuals. There's some will be equal to zero and equivalently the sum of the product of Excite and your listeners. Will we be equal to zero again? No, this is not a restriction we imposed. This is something that comes from the optimization from the OLS from solving the or less problem for minimizing the squad residuals. This comes from the first order conditions. So this will always hold no matter what. Even if we run a nonsense regrets regression, this will fall because this is how the or less estimates are designed. Be obtained right now. If you think about it, More mathematical terms in terms of linear algebra. What is this thing here? Say it says that the inner product off X and you had physical to zero if it probably is equal to zero, this means the two vectors are perfectly killer to each other. Okay, they form at night AA degree angle. And if we're talking about random variables like here, it means they are that the end line Victor's linearly independent and hence the round that random variables will be independent. Right? And I'm correlated in this case. Okay, In this case, let's say uncork related. Let's not, uh, let's not devolve into the difference, but let's say in court later. So the correlation. First of all, let's obtain the or less residuals would do it by predict the name. But those names, because you've had my color, is, is okay. Come on, residuals. Now, the the factor resists is the s main residuals You won't have. Let's see. Gonna generate a new variable called some someone the same equal to the sum of the residuals. All right, so this thing here, someone will be variable. It will be, actually. Ah, a simple number, right? It will be a one times one matrix is equal. Told the residuals hit Enter and we could go there. Browser and see. It's gonna be all right. Great. There's just one number here. Just ah, entered in every entry. But it's one number, and it is practically zero. Hey, you see this? A notation here. This means that it's something. 0.3 17 Water. This is practically here. Okay, so we've verified this claim and the next one let's do again, was called this, um, somewhere less. Now we're gonna do define available, which is a sum of X one times ises, right? That's what we want. And let's see if this is also zero. Yes, that's also zero, you know, up to the fourth. That small point. Of course, this is practically zero. We're talking about rounding things here. In a way, I don't have to really explain what's going on, but this is zero up to the four. This one point. All right, now compute the same quantities as an equation to six. But using the heirs now in place of the residuals. Well, let's think about a four minute. Do we really? We said that those equations here, those two conditions here hold by definition, it's an algebraic fact that comes from the first of the conditions of solving the oil s problem. The minimization squared residuals Do those conditions need to hold in for the population? I mean, what would that mean? Let's see. Let's see. Let's say Hee jin Uh, no. No. First I was doing some with Leonard. It will be the some off you want and the residuals. Does he have to be zero? No, I mean, is there any reason has to be zero No, it's not. It's minus 48 17. In fact, there's no reason. Absolutely no reason that the sum of errors should be. Zero doesn't come from anywhere we haven't imposed. It is completely random. I think they could be. But this is ah probability zero bank. There's no not necessarily. All right now, if we do the somewhere, there's whatever. It shouldn't be someone by some to let's say some of X one times you want. Does this have to be called a zero? No, absolutely no reason. There's no reason this has to be called zero, but it could be Could potentially beat Will do zero if those variables where independent and handsome correlated. Because this thing right here, uh, divided by one over end mine is the expected values which are, you know, supposed to be zero whatever this is some somehow indicative of the co variance between the two variables. I mean, between X one and you wanna do you have now, if we see that the Corvair Ian's here, it's non zero. Then this, um, quantity will be nuns here. Let's shake it up. Will compute that co Berries matrix. And indeed that coherence is not zero. So there's no reason that this quantity should be zero. All right, remember, this only happens with you, Hades with the estimated or less residuals, by definition, by construction of the of the regression method, I came. Um, now, in part six, we need to repeat parts 112 and three with a new symbol of data starting by generating, you know, whatever exactly the same thing. So let's start from beginning. And this is why I wrote this Sub strips one before because now I'm gonna do the same with X two. So gonna repeat the I would it before, just by changing 1 to 2. Okay, we'll summarize here, x two. And you too. But before Kate, let me just summarize also X one you want. So we have the picture next to each other. Cane somewhere X one you want. And now somewhere Isaacs to you too. So we can look Att Have the whole picture. Okay. Look at that. Not the same. Not at all right. The main is quite different. One is below five. The others above five. Standing ovation here is higher. The min and max different. Everything is different. Isn't that crazy. Well, not really. Because the as we said, the population is every random variable either uniform out normal with this parameters. Evident rated. And now we're extracting different SAB samples of 500 observations. Each doesn't have to be cool. Just random, simply Okay, Now we're gonna define why, too, Generate y two. I was the one plus two weeks to bus you, too. I'm gonna run aggression. Why? It's your next to, but okay, No, Let me first also rerun the previous regression. So we have the whole picture here. Okay, So I'm just rerunning the previous aggression I'm gonna run their aggression with. Thanks to wise, you are next to no. All right. Look at that. Quite different again. The estimates are Well, this is just entirely by chance of the estimates are close to each other, but they're different. Look, that different are square 46% versus 53% different center errors, different values. Everything is different, everything is different. And yet again. Morning. The thing that's, um, similar is that's the disco significance of this low coefficient and the constant term again, it is not likely statistically significant. Uh, this exactly because you know what we said before the mean does not correspond to them theoretical mean and the same goes for San deviation. What are those things different? Well, because we have different brand of samples. This is the question. If we do it 1000 times, gonna get 1000 different estimates. At least you know, they don't have to be dramatically different, but they have to be somehow different unless the two random samples are the same, which is a probability zero event, and now extra bonus part. I'm gonna do the same with a lot of observations, and I want you to see the magic of statistics and of convergence results. If when I always said the observations two 10,000 king instead of 500. So I'm gonna do exactly what I did. But with just a larger random sample. Now let's generate Ah, the eggs to be now it's gonna be with Capital X and you okay and also generates ah, capsule. Why so again, doing exactly what I did before? Exactly Same. But with a larger sub sample. Now look at that. First of all, let's summarize uh, X and you look at that. Those are much better than before. Again. We don't have a theoretical meaningful to five because we need a bigger sample. I'm not going to do with the 1,000,000,000 observation is gonna take some time for you to run. But the estimates that we have here look at the standing ovation is almost, you know, the correct one that send aviation for the Brenda. Very always practically six. And the means of very close to the theoretical quantities. Now, I want you also to look at the history, ma'am, If let's say you won before we came. Now we're talking about the first, the first you one and two. The history. I'm going to superimpose a normal density. Look at that. This is the first variable 500 observations. Uh, you know, not too bad. We do have. As we said before, we did have the gaps here and there. And now let me do the same with the other variable from a larger It's a sample. Look at that. Beautiful, you know, overlaps it so much better. And if I do it with 1,000,000,000 with 10,000,000 it's gonna be almost identical. You're not gonna be able to tell the difference, especially if I reduce the width of the bane of the bins here. And finally, they just run the regression with, Ah, large, uh, with a larger random sample. You see if we're gonna get estimates closer to the theoretical ones. Whoa, Look at that R squared closer to 50 that ever before, Um, and estimates for the slope coefficient very closer to indeed them send to be, uh is very close to doing statistically significant. And now the constant term is Well, no one is one point to get the through values within two San Deviations. But it's the discipline significant. You see, you see the difference. Just, uh, if we take those problems that arise. Here are small sample problems, and the bad news is that in applied research usually is not. We're not able to extract a huge random sample. That and you know, this is why, even if we know the true model, even write about it, we can get the weird estimates just from having a small sample. So remember, sample size matters is extremely, extremely important