Stats Lab 2

Question 1: Does the number of at-bats predict the number of runs a team will score?
Create a graph that shows the relationship between runs and at bats from the Batting11
Collection. What does the Graph say about the ability to predict the number of runs based
on knowing a teams at-bats?
Graph 1: Runs vs Bats for Batting11 Dataset.

I would say that the trend is very much increasing, positive association, as you increase at_bats
range, however it is not definitive. For example, in range between 5500 and 5525, you find the
New York Yankees, Cleveland Indians, Los Angeles Angels, Chicago White Sox, and Florida
Marlins in that range. However, the runs value range is tremendous, with the Yankees at the top
value and the Marlins as the lowest. So, for a given value of bats, the relationship is not
definitive, but clearly hinting in the right (linear) direction.
However, take a look at the curve, non-linear behavior, that can be made at the ranges of runs. As
you increase runs values, you can compass a clear non-linear curve that can be express in a clear
relationship. But overall, this is an upward, positive, weaker association.
Question 2: If you had to summarize your graph with a single line, where would you put
your line? Add a Movable Line to your graph and move it around to best fit the data. To
see how well your line summarizes the data, ask Fathom to show the squared errors on the
graph. Notice the Sum of Squares at the bottom of the graph. What happens to this
number as you move your line to better fit the data? Why?
Graph 2: Batting11 Dataset

I chose the data line to cover the lower elements from graph (the lower values if there were an
identity line dividing the data points). The correlation value is 0.919. However, as I move this
value, such value changes.
When prompted to show Sum of Squares the following is obtained:
Graph 3: Sum of Squares at Chosen Fit.

As we move the data line, fixed slope, the value for sum of squares changes as the squares for
each data point have to change to represent the fit given by the line chosen.
Question 3: Make a Duplicate of your Graph and replace the Movable Line with the Least
Squares Line. How do the Sum of Squares compare for this least squares line versus the line
you chose in Question 3?
Graphs 4 and 5: Least Squares in Scatter Plot of Dataset.

If we compare the line chosen with the line given by the least squares fit, we see that the sum of
squares is minimized accordinglyas well as the squares created by each data point with respect
to fit. It is worth noting that the correlation factor is much lower, but more accurately represents
the data point as a whole rather than a biased fit. Thus, the r-squared ration found in this fit is
much lower than the one represented in the previous question (0.822).
Question 4: Linear regression requires that the relationship be linear. Make a residual plot
and use this to explain why you think the relationship between runs and at bats is, or is not,
linear. (Note: you will have to Remove the moveable line before Fathom will allow you to
add a residual plot to the graph.)
If you look at the residual plot, it is clear that there is a random behavior to it, thus, we can take
from that indeed this dataset has a linear fit. Why is this the case? Because in residual analysis
you, if given linear relation, should not see any pattern at all.
Question 6: Given your (perhaps limited) knowledge of baseball,
which variable in the Batting11 Collection do you think will have the
lowest sums of squares using the least squares line? Why? Open
another graph to show the relationship between runs and the
variable you chose. At first glance, does there seem to be a linear
relationship? How does this relationship compare to the relationship
between runs and at bats ?
It is worth noting that I do not know anything about baseball. I just picked the walks parameter
because I watched Moneyball. However, in a way it makes sense as it is an isolated parameter
that can determine the amount of wins a team has, which is given the runs parameter.
In fact, and oddly enough for me (rather amusing actually), this graph shows a different sum of
squares value that is lower than the value obtained prior in this exercise. However, the
correlation factor, r-squared, is a bit lower.
I suppose that much like our previous graph, this too will show a linear relationship. And it does:
The relationship is comparable, albeit the slope for my parameters is stepper, and a bit less
random, more clear from first glance due to the fact the points are closer to each other, hinting
less spread in this dataset.
Question 7: In your graph predicting runs with at bats, look at the
equation at the bottom. What is the slope? What does the slope tell
us in the context of predicting success for a team if we know its atbats? Interpret the slope for your second graph.
For the first graph, the slope is 0.6305, while for my choice, the slope is of
0.931. Given an increase by one unit, the increase for the first graph is by a
factor for 3 digits, whereas for my choice of parameter the increase is by a
factor of two digits. It is worth noting that the sign difference matters as the
value for the first graph will decrease whereas my choice parameter will
increase the value.
That is, if we rely from this purely algebraic analysis.
I do not rely on the slope to measure success. I rely on the correlation factor
between the graphs to see which variables hold a closer relationship. From
these plots, the former, at 0.37, provides a strong correlation. Thus, I hold
that information to be more valuable in assessing success than my choice
never mind I also do not know anything about baseball so my choice is
random at best.
Question 8: Compare the r-squared values found at the bottom of
each graph. Based on your comparison of the two relationships from
Question 6, what do you think the r-squared value represents? Does
your variable seem to better predict runs than at bats ? Why?
R-squared value measures how well the fit interacts with the data given.
From this we can say that runs vs bats is a better fit as it is 0.37 whereas
runs vs walks is at 0.36. So, they are not quite equivalent but close to each
other.
What it means is that we can get better predictions using bats instead of
walks under the run parameter. Thus, my choice renders less precise
predictions.
Question 9: Now that you can summarize the linear relationship between two variables,
find the variable (or variables) in the Batting11 Collection which you think best predicts
runs. Support your conclusion using the graphical and numerical methods were discussed
and describe the relationships you explore in context of the problem. How well does this
variable predict runs? Does this variable match with what you initially thought in Question
6? Are you surprised?
Given the choices, it is better to predict runs using bats than walks as reflected by the r-squared
parameter as it provides more precise predictions.
If you take a look at the residual plot for my choice of parameter, and compare it to the residual
plot of the first graph, it is clear that the second plot has a more non-random (less chaotic, more
organized) behavior. Hence why this graph may be less linear than the first graph.
Am I surprised? No, because I know nothing about baseball!
Question 10: Now examine the newer variables in the NewBatting11 Collection. Were the
baseball researchers described in the background successful in finding better variables to
predict the total number of runs scored? Explain using appropriate graphical and
numerical evidence. Of all the variables were analyzed, which seems to be the best
predictor of runs? Using the information you know or have learned about these baseball
statistics, does your result make sense?
Clearly, instead of using bats but ratios, we can get a glance of more information from the plot
above. The correlation is of .90, so a tremendous improvement from both variables previously
used (walks and bats), and thus, a more reliable graph from which to obtain tremendously precise
predictions in comparison to the alternatives. And below we use another parameter, on-baseplus-slugging percentage, to see a better fitwhich it is, at .93 (see below). This information
provides more insight into key fundamental variables that can predict better outcome as they
involve more aspects of the game (an intersection of what takes place with bats, walks, bases
captures, etc).

Stats Lab 2

Загружено:

Сведения о документе

Оригинальное название

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Stats Lab 2

Загружено:

Авторское право:

Доступные форматы

Question 1: Does the number of at-bats predict the number of runs a team will score?

Graph 1: Runs vs Bats for Batting11 Dataset.

Graph 2: Batting11 Dataset

Graph 3: Sum of Squares at Chosen Fit.

Graphs 4 and 5: Least Squares in Scatter Plot of Dataset.

Вам также может понравиться