Вы находитесь на странице: 1из 6

19/6/2019 Data Scientists Versus Statisticians – ODSC - Open Data Science – Medium

Data Scientists Versus Statisticians


ODSC - Open Data Science Follow
May 9 · 5 min read

Ever since the term “data scientist” came onto the tech scene, there’s
been a cross-generational debate raging, attempting to define and
distinguish newly branded data scientists and traditional statisticians. I
personally adopted the data scientist title around 2012, and I recall a
rather pithy definition float across the Twittersphere around this time:

“A data scientist is someone who is better at statistics than any software


engineer and better at software engineering than any statistician.”

In a more serious light, data science is often defined as the confluence


of three areas: computer science, mathematics/statistics, and specific
domain knowledge. Implicit in this definition is the focus on solving
specific problems, in contrast with the type of deep understanding that
is typical in academic statistics.

In this article, we’ll take yet another look at the data


scientist/statistician kerfuffle to see if we can find some common
ground and maybe even a common endpoint.

Data Science or Statistics?

It seems that the designation “data scientist” has taken the world by
storm. It’s a title that conjures up almost mystical abilities of a person

https://medium.com/@ODSC/data-scientists-versus-statisticians-8ea146b7a47f 1/6
19/6/2019 Data Scientists Versus Statisticians – ODSC - Open Data Science – Medium

garnering information from deep data lakes with ease. It comes from a
belief that a data scientist can wave his or her hand like a 21st century
Houdini and effortlessly extract insights from the data.

What’s intriguing about the field of data science is its perceived threat
to other disciplines, specifically statistics. I don’t see this threat as real
however as the two fields are quite distinct and complementary. In the
past decade, it’s clear that though the two fields can exist separately on
their own, each is weak without the other. Statisticians need to
understand the modeling and structure of data, while data scientists
need to understand applied statistics.

It’s no wonder that statisticians feel threatened by data scientists to a


certain degree. Statisticians deal with nebulous concepts like point
estimates, margins of error, confidence intervals, standard errors, p-
values, hypothesis testing, and the proverbial argument between the
“frequentists” and “Bayesians.” Statisticians can be viewed as confusing
to the general public and many times the statisticians can’t even agree
on what is correct.

Data scientists on the other hand, closely follow the “data science
process” that is more approachable; data ingest, data transformation,
exploratory data analysis, model selection, model evaluation, and data
storytelling. Sure, many of these steps follow statistical methods
behind the scene, but they’re sealed in a more engaging and
understandable wrapper. Many more people can embrace data science.

To be sure, there will always be a need for a solid foundation in


statistics. There are many cases where a data scientist would not have a
clue what to do with certain data sets without help from someone with
a background in statistics. At the same time, if a statistician was handed
a high-dimensionality data set with 5 billion rows and 10,000 variables,
they’d be hard pressed to set-up the data for analysis without consulting
a data scientist.

Ultimately, the two disciplines need to find some common ground. It


should be part of the curriculum of a statistics department program to
teach students how to work with real-world data. And those working in
data science need to have the appropriate training in statistics.

https://medium.com/@ODSC/data-scientists-versus-statisticians-8ea146b7a47f 2/6
19/6/2019 Data Scientists Versus Statisticians – ODSC - Open Data Science – Medium

[Related article: What Will the Next Generation of Data Scientists Look
Like?]

Further Comparing and Contrasting

Although data scientists and statisticians tend to gather information for


similar purposes, their means of data collection are quite different. On
one hand, the amount of data for data scientists is often massive,
consequently, they spend a lot of time with tasks like large-scale data
ingest, data cleansing and transformation. Conversely, statisticians still
rely on more traditional and smaller scale methods of data collection,
such as surveys, polls, and experiments.

Typically data science problems are formulated using a modeling


process which focuses on the predictive accuracy of the model. Data
scientists do this by comparing the predictive accuracy of different
machine learning algorithms and selecting the model with the best
accuracy. Statisticians take a different approach to building and testing
their models. The starting point in statistics is usually a simple model,
such as linear regression, where the data is verified to determine
whether it is consistent with the assumptions of the model. The model
is improved by addressing assumptions in the model that are violated.
The modeling process is considered complete when all model
assumptions are verified and no assumptions are violated.

While data scientists focus on comparing a number of different


methods to create the best machine learning model, statisticians rather
work to improve a single, simple model to best fit the data.

Statisticians tend to focus more on quantifying uncertainty than data


scientists. As part of the statistical model-building process, it’s common
to quantify the connection between the outcome being predicted and
each predictor. Any uncertainty about this connection is also quantified.
This process is not as common with the tools used by data scientists,
namely machine learning.

The two fields also use somewhat different nomenclature to describe


the same principles. Data scientists speak of things like: “example”
whereas statisticians use “observation,” “feature” versus “predictor” or
“independent variable,” “label” versus “response” or “dependent
variable.”

https://medium.com/@ODSC/data-scientists-versus-statisticians-8ea146b7a47f 3/6
19/6/2019 Data Scientists Versus Statisticians – ODSC - Open Data Science – Medium

[Related article: The Di erence Between Data Scientists and Data


Engineers]

Conclusion

In current terms, the fields of data science and statistics differ in a


number of ways. The fields differ in modeling processes, the size of
data consumed, the types of problems studied, the academic
background of the people in the field, and the terminology used. At the
same time, the fields are closely related in the sense that both data
science and statistics aim to extract knowledge from data.

Given time, the fields of data science and statistics likely will converge
to a common end-point. Statisticians have gone about gathering data
and performing analysis techniques like linear regressions for several
centuries. Eventually, as more statisticians pick up on skills like
implementing algorithms that learn from data, and provide predictions
and actions and more data scientists pick up on statistical science
(sampling, experimental design, confidence intervals, p-values, etc.)
the boundary between data scientists and statisticians will eventually
blur.

. . .

Original post here.

Read more data science articles on OpenDataScience.com, including


tutorials and guides from beginner to advanced levels! Subscribe to our
weekly newsletter here and receive the latest news every Thursday.

https://medium.com/@ODSC/data-scientists-versus-statisticians-8ea146b7a47f 4/6
19/6/2019 Data Scientists Versus Statisticians – ODSC - Open Data Science – Medium

https://medium.com/@ODSC/data-scientists-versus-statisticians-8ea146b7a47f 5/6
19/6/2019 Data Scientists Versus Statisticians – ODSC - Open Data Science – Medium

https://medium.com/@ODSC/data-scientists-versus-statisticians-8ea146b7a47f 6/6

Вам также может понравиться