So we motivated the discussion of statistical inference and estimation by bringing up this so called decline effect, where the, effect size of scientific results seems to be going down over time and reproducibility is suffering. And so we gave some reasons for this; publication bias, you know mistakes and fraud and this multiple hypothesis problem. And so we used these, motivations, these scenario, to bring up various topics and techniques. So we talked a little bit about basic statistical inference Where I just give you an overview and, and that's it. we talked about effect size. We brought up the specific term, heteroskedasticity. For fraud detection, we brought up Benford's Law. And then we multiple hypothesis testing which is. Perhaps the most important part of the discussion we talked about the familywise error rate and the false discovery rate and gave correction procedures for both of these. Okay, and so this hopefully was a tour of not just some basic concepts but also some, if not advanced at least things that don't necessarily come up in a a you know, Stats 101 course. But I think it's pretty important for us data scientists to understand. In fact, as a data scientist, there's a view amongst statisticians that these topics are not very well understood. And in fact, they'll point to typical machine learning classes where understand the population, understanding the various biases, understanding how to correct for, for the problems that can arise is not taught at all and it's more of a. Of a, you know, blind application of algorithms. So I think it's pretty important to go over this choice of topics now. So, What about big data? What changes? Well, so, Brad Efron. Who's a world renowned statistician you know, describes it this way. Says classical statistics was fashioned for small problems, a few hundred data points at most, in just a few parameters. And the bottom line is that we've entered
in an era of massive scientific data
collection, with a demand for answers to large scale inference problems that lie beyond the scope of classical statistics. And so, Suggest that something is changing in the area of big data. Now, what can go wrong here? Well, as we've talked about, you can find spurious relationships in big data and so this is a picture that I got from a colleague recently that was emailed to him. Which is a plot that someone took the time to make, may or may not have been as a joke, but as you can see here, it says "Internet Explorer versus the murder rate." Rate, OK. And so this is the murders in the US in blue, along with the market share of internet explorer in the green. And the, you know, corresponding discussion that went along with this plot, you know, was, was somewhat amusing. Talking about various theories for why The murder rate might be going up as in the next four market share. Our murder rate goes down as, as in the next four market share also goes down. But, the point here is that without some common sense or without the [UNKNOWN] the application of understanding the scenario of the problem you can make, you know discoveries Of, of this form. Okay. Alright, and so other examples that have been talked about in the literature, again brought up, as as, you know, bad examples, the number of police officers and the number of crimes. So, why might these 2 things be correlated? You know, maybe police officers cause crimes. Well, no, probably because there's in pop, in densely populated areas, there are both more police officers and there are more, and there are more crimes. By the way, just to point out again, you know, these, these authors here are not authors that claim, made these claims. These are authors that brought up the mistake. Okay, amount of ice cream sold and deaths by drowning. Why would these things be correlated? Well, there's a seasonality. Right? In the summertime you sell more ice cream and more people go swimming.
And then one is you know Stork sightings
and population increase used as you know, evidence that storks do indeed bring newborns to families. Well again, in more densely populated areas there's more people actually actually see the Storks and so you get an increase in sightings. So, these kind of procedures to remove bias and these procedures to understand the population you are sampling from and understand the possibilities as far as correlations. These things are taught in statistics programs, but are not typically taught in machine learning classes. Okay. So what does that have to do with big data? Well. Might be a view that there's, you know, the curse of big data, as Vincent Granville put it, is the fact that when you search for patterns in a very, very large data sets with billions or trillions of data points and thousands of metrics, you are bound to identify coincidences that have Predictive power and so the example he gives is to consider stock prices for some large number of companies over a one month period. And then you check for correlations between all pairs. And actually doesn't stop there, because that would be over the same exact one month period. But you might want to account for lags. Maybe the stock price of Google. a few days later effects the stock price of smaller companies that depend on you. So now you're not just comparing every 500 squared checking the paralyzed correlation of these time series but you are also checking the paralyzed slightly offset one okay and so these are the cross correlation procedures. So very basic time series analysis this is just to measure the correlation and I just wanted to throw the formulas up here where the covariants of two data sets is measured this way. Alright so you take the data point xi and subtract the mean of x. And multiply that by y i minus the mean of y. And all that up and that's the covariance. And then you divide the covariance by the standard deviation of each data set
multiplied by each other.
And so this gives you the correlation. Okay. So, what does this experiment look like? Well, I generated this plot by running random walks for stock prices that start at $10. They all start at the same, the same Point, and at each step, which is an hour of simulated time. A, draw a sample for a normal distribution where the mean is the current stock price. And the, a standard deviation is one percent of that current stock price. Okay. And this is, not especially defensible, but you can see just sort of visually that it does generate stock price looking things. And you do get some variance here. Alright. So, clearly this is, this is random. This plot shows the number of corelations at a level of 0.9. All right, that's a pretty strong correlation as a function of the number of stock prices tracked. So as I went up from 10 to 100, I didn't go all the way up to 500 which is what Vincent Granville described in the thought experiment. This is the number of spurious correlations I- You, you find, okay and this is also not doing the lagged cross correlation, alright this is just directly [INAUDIBLE] the correlation of these two [INAUDIBLE] of time series across this month. And that's a pretty long period to, across a month. So what's the point? Well [INAUDIBLE] gives more opportunities For spurious findings. Okay. Now, it's not all bad news. So, how is big data different? Well, there's a notion of big p versus big n. Where big p is sort of the number of columns. And big n is the number of rows. And in this experiment we just did with the time series. This was sort of a big piece in here. We looked at more you know an increasing number of companies and then we looked at all possible correlations between them so this was growing sort of quadratically. Okay.
So the thing about big data though is
marginal cost of increasing the number of records is essentially zero. It's gotten cheaper and cheaper and cheaper to collect data. Okay. Great. Now that's very very powerful, right. We want to, the increase in the number of records, adds statistical power and helps us sort of, you know, get lower and lower p values but it also amplifies bias. If you 're collecting the wrong data, if you're looking at the wrong population. you're going to make, you know, so-called discoveries that are simply false. And so, for example, log all the clicks to your website, you have a very, very large data set and you can very precisely model user behavior. But that would only model your current users. When your hope, you know, perhaps the whole point of modeling. user behaviors to try to attract new customers. Well, for example, if you have early adopters, and your current user base is early adopters, you're only going to be modeling their behavior. You haven't actually sampled the population at large. You know, another example is mobile data. And this comes up in polling, for say, the presidential election. you know, you, you're only sampling people that have cell phones. And this may or may not be the same population, you want, you want to be sampling. Okay, this may ignore lower income groups or different age groups, okay. You need to be careful on multiple hypothesis tests as well, as we pointed out. So there's a fantastic comment from XKCD that makes this point very, very clear. where they sort of demonstrate that green jelly beans cause acne. Right, and the story here is that there's 20 different [SOUND] colours of jelly beans, and for a P value of 0.05 [SOUND] we do 20 experiments. And sure enough we find one of the colors indeed causes acne. But that would be expected purely by chance. And so I encourage you to look up that comment. And the other comment I'll make that we
will probably come back to is Nassim
Taleb's Black Swan events. So this is- Things that are sort of inherently unpredictable or the distribution of them does not follow a normal distribution, sort of a bell curve distribution, where the tails of the bell curve mean that extreme values become exponentially more rare. That's the sort of definition of the normal distribution. But in some cases, extreme values are not exponentially less common. They, they, they happen, okay? And so the example that he uses in this case is that, you know, that if the, if a turkey was to model your behavior, it would get increasingly more confident that that you mean it, it no hard. And you mean it, you know, good will. Every day you come and feed the turkey, and everyday you take care of it and you look out for its well being. But then on the, you know day before Thanksgiving it gets slaughtered. Perhaps and so that was Taleb's argument for a Black Swam event. A black swan itself refers to the fact that people didn't believe black swans existed and then. Finds out that they did, so it was an unexpected event, okay. All right we'll talk more about that in some detail. [SOUND]