Академический Документы
Профессиональный Документы
Культура Документы
The textbook
Proposed title
Patterns, Causality and Prediction: Data Analysis for Business, Economics and Policy
Motivation
The ongoing data revolution has major consequences for businesses and policy-makers
alike: more and better data is available to support decision making. As a result there is
a growing need for professionals who can learn from available data and can collect
relevant data.
There is need for analysts who can assess the effects of business and policy practices,
carry out predictions and work with real-life data, small and big. The ability to visualize
and interpret results is also becoming extremely important. Not only analysts but users of
analyses need many of these skills to translate results to decisions and commission data
analysis and data collection.
There is need for analysts with a skills set that integrates traditional statistical analysis with
machine learning methods. There is need for analysts with a deep and applicable
knowledge of the most reliable methods. There is need for analysts who can write their
own code and work with real-life data that is often messy and complicated. There is
need for analysts who can understand the business and policy context and tailor their
analysis to answer substantive questions.
The isolated and often formalistic textbooks of econometrics and machine learning
offer fragmented skills and knowledge, cover many more methods than needed, rarely
provide instructions or code for software implementation, ignore the messy and
complicated nature of real life data, and often focus on academic applications. The
more practical textbooks of business statistics, survey statistics and other applied fields
do not cover many important data analysis methods, and when they do, they not offer
a deep understanding of those methods.
Our textbook addresses all four needs: the need for analysts with an integrated
knowledge who understand and can apply the most robust methods, who can work
with real-life data, and who build their work to address real-life problems.
The textbook supports Microsoft Excel, R and Stata, emphasizing the latter two software.
The needs of data professionals of all stripes extend beyond Excel. R and Stata are the
most widely used software for the methods covered in the textbook. Both include
powerful tools for data management and visualization as well.
Key topics
Patterns: regression analysis
Uncovering patterns in the data can be an important goal in itself, and it is the
prerequisite to establishing cause and effect and carrying out predictions. The textbook
starts with simple regression analysis, the method that compares expected y for
different values of x to learn the patterns of association between the two variables. It
discusses nonparametric regressions and focuses on the linear regression. It builds on
simple linear regression and goes on to enriching it with nonlinear functional forms,
generalizing from a particular dataset to other data it represents, adding more
explanatory variables, etc. The textbook also covers regression analysis for time series
data, panel data, binary dependent variables, as well as nonlinear models such as logit
and probit. Understanding the intuition behind the methods, their applicability in various
situations, and the correct interpretation of their results are the constant focus of the
textbook.
Causality: learning the effects of interventions
Decisions in business and policy are often centered on specific interventions, such as
changing monetary policy, modifying health care financing, changing the price or
other attributes of products, or changing the media mix in marketing. Learning the
effects of such interventions is an important purpose of data analysis. The textbook
incorporates the basic concepts and methods used by program evaluation (the
framework of potential outcomes, the benefits of randomized assignment, etc.). It also
covers related methods used in business, such as A/B testing.
Prediction: carrying out predictions
Data analysis in business and policy applications is often aimed at prediction. The
textbook introduces tools to evaluate predictions, such as loss functions or the Brier
4
Online material
The textbook will be supported by a set of additional material available online.
Data and lab
Each chapter is accompanied by the data used in the illustration studies with a full but
concise description, and the description of how to implement data management,
cleaning, analysis, and visualization in Excel, R and Stata. We also provide the R and
Stata codes themselves that produce all results shown in the textbook, starting with raw
data. Students can learn coding by first understanding and then tinkering with code
that works. We plan to store these Data and lab sections online, some elements
possibly turned into videos or interactive exercises similar to those on datacamp.com.
Outline
I. PATTERNS: REGRESSION ANALYSIS
1. How to approach and describe data
Key characteristics of a dataset. Types of observations (cross-section, time series,
other structures), types of variables. Describing data (source, types of
observations and variables, descriptive statistics, distributions).
Visualization of basic features of data, histograms, kernel densities, box plots
Frequent data problems and cleaning data. Common issues with data (zero
values, missing values, errors, duplicates, dates, spelling). Suspicious values and
benchmarking. How to form realistic expectations.
2. Simple regression analysis
Regression as comparison of means. Definitions. Visualizing regression.
Nonparametric regression. Linear model. Learning simple linear regression
parameters. Predicted dependent variable and the residual. Goodness of fit.
Correlation and causality.
Graphical representation, scatterplots, visualizing nonparametric and linear
regression.
3. Uncovering non-linear patterns in regression analysis
Transforming variables (taking logs, normalizing by size, standardized variables),
piecewise linear spline, quadratic and other polynomials. When to worry and
when not to worry about nonlinear pattern
4. Inference: Generalizing from our data
Repeated samples. Confidence Interval as the prime tool to make inference. SE.
Robust SE. Layers of external validity.
Presenting regression results in standard table format. Visualizing and interpreting
confidence bounds around regression lines.
5. Multiple linear regression analysis
Uses of multiple regression: multiple associations, controlling for confounders,
improving prediction. Categorical explanatory variables, interactions. Omitted
variable bias, bad controls. Modelling and interpretation.
Presenting and summarizing results: advice on tables and graphs.
6. Probability models
Linear regression with binary outcome: the linear probability model.
Nonlinear probability models: logit and probit. Coefficients and marginal
differences.
7. Analysis of time series data
Frequency, trends, seasonality. Taking differences. SE robust to serial correlation.
Time series graphs
8. Messy data
Dealing with missing data, influential observations, weights, standardization of
variables.