Вы находитесь на странице: 1из 46

17-1

Importing Data into Excel


CHAP T E R
FINDING INFORMATION WITH DATA MINING
T
he types of data analysis we discuss in this and other chapters of this
book are crucial to the success of most companies in todays data-
driven business world. However, the sheer volume of available data often
defies traditional methods of data analysis. Therefore, a whole new set
of methodsand accompanying softwarehas recently been developed
under the name of data mining. Data mining attempts to discover
patterns, trends, and relationships among data, especially non-obvious and
unexpected patterns. For example, the analysis might discover that people
who purchase skim milk also tend to purchase whole wheat bread, or that
cars built on Mondays before 10 A.M. on production line #5 using parts
from suppliers ABC and XYZ have significantly more defects than average.
This new knowledge can then be used for more effective management of
a business.
The place to start is with a data warehouse.Typically, a data
warehouse is a huge database that is designed specifically to study patterns
I
m
a
g
e

c
o
p
y
r
i
g
h
t

G
i
n
a

S
a
n
d
e
r
s

2
0
1
0
.

U
s
e
d

u
n
d
e
r

l
i
c
e
n
s
e

f
r
o
m

S
h
u
t
t
e
r
s
t
o
c
k
.
c
o
m
17
Not For Sale


C
e
n
g
a
g
e

L
e
a
r
n
i
n
g
.

A
l
l

r
i
g
h
t
s

r
e
s
e
r
v
e
d
.

N
o

d
i
s
t
r
i
b
u
t
i
o
n

a
l
l
o
w
e
d

w
i
t
h
o
u
t

e
x
p
r
e
s
s

a
u
t
h
o
r
i
z
a
t
i
o
n
.
in data. A data warehouse is not the same as the databases companies use for their day-
to-day operations. A data warehouse should (1) combine data from multiple sources to
discover as many interrelationships as possible, (2) contain accurate and consistent data,
(3) be structured to enable quick and accurate responses to a variety of queries, and
(4) allow follow-up responses to specific, newly relevant questions. In short, a data
warehouse represents a relatively new type of database, one that is specifically
structured to enable data mining. Another term you might hear is data mart.This is
essentially a scaled-down data warehouse (or part of an overall data warehouse) that is
structured specifically for one part of an organization, such as sales.Virtually all large
organizations, and many smaller ones, have developed data warehouses or data marts in
the past decade to enable them to better understand their businesstheir customers,
their suppliers, and their processes.
Once a data warehouse is in place, analysts can begin to mine the data with a
collection of methodologies, techniques, and accompanying software. Some of the
primary methodologies are classification analysis, prediction, cluster analysis, market
basket analysis, and forecasting. Each of these is a large topic in itself, but some brief
explanations follow.

Classification analysis attempts to find variables that are related to a categorical


(often binary) variable. For example, credit card customers can be categorized as
those who pay their balances in a reasonable amount of time and those who dont.
Classification analysis would attempt to find predictive variables that help explain
which of these two categories a customer is in. Some variables, such as salary, are
natural candidates for predictors, but the analysis might uncover less obvious
predictors.

Prediction is similar to classification analysis, except that it tries to find variables


that help explain a continuous variable, such as credit card balance, rather than a
categorical variable. Regression, the topic of Chapters 10 and 11, is one of the most
popular tools used for prediction, but there are others not covered in this book.

Cluster analysis tries to group observations into clusters so that observations


within a cluster are alike, and observations in different clusters are not alike. For
example, one cluster for an automobile dealers customers might be middle-aged
men who are not married, make over $150,000, and favor high-priced sports cars.
Once natural clusters are found, a company can then tailor its marketing to the
individual clusters.

Market basket analysis tries to find products that customers purchase together
in the same market basket, or set of consumer goods. In a supermarket setting, this
knowledge can help a manager position or price various products in the store. In
banking and other settings, it can help managers to cross-sell (sell a product to a
customer already purchasing a related product) or up-sell (sell a more expensive
product than a customer originally intended to purchase).

Forecasting is used to predict values of a time series variable by extrapolating


patterns seen in historical data into the future. (This topic is covered in some detail
in Chapter 12.) This is clearly an important problem in all areas of business,
including the forecasting of future demand for products, forecasting future stock
prices and commodity prices, and many others.
Only a few years ago, data mining was considered a topic only for the experts. In
fact, most people had never heard of data mining. Also, the required software was
expensive and difficult to learn. Fortunately, this is changing. Many people in
organizations, not just the experts, have access to large amounts of data, and they have
to make sense of it right away, not a year from now. Therefore, they must have some
17-2 Chapter 17 Importing Data into Excel
Not For Sale


C
e
n
g
a
g
e

L
e
a
r
n
i
n
g
.

A
l
l

r
i
g
h
t
s

r
e
s
e
r
v
e
d
.

N
o

d
i
s
t
r
i
b
u
t
i
o
n

a
l
l
o
w
e
d

w
i
t
h
o
u
t

e
x
p
r
e
s
s

a
u
t
h
o
r
i
z
a
t
i
o
n
.
understanding of the data analysis techniques used in data mining, and they must have
software to implement these techniques. Microsoft, among others, has recognized
these needs and has recently provided a free data mining add-in for Excel, available on
the Web.
1
(Perform a Web search for Excel 2007 data mining add-in.) We will not
cover data mining or this add-in in this bookthe book is already long enoughbut
we encourage you to take a follow-up course in data mining if you have the chance.
It is a very valuable skill for the workplace that will only become more valuable in
the future.
17.1 Introduction 17-3
17.1 INTRODUCTION
We introduced several numerical and graphical methods for analyzing data statistically in
Chapters 2 and 3, and we examined many more statistical methods in later chapters.
However, any statistical analysis, whether in Excel or any other software package, pre-
sumes that you have the appropriate data. This is a big presumption. Indeed, the majority
of the time spent in many real-world data analysis projects is devoted to finding the data
and getting it into a format suitable for analysis. Unfortunately, this aspect of data analysis
is given very little, if any, attention in most statistics textbooks. We believe it is extremely
important, so we devote this chapter to methods for finding data and importing it into
Excel, the software package we are using for data analysis.
Our basic assumption throughout most of this chapter is that the appropriate data
exists somewhere. In particular, we do not cover methods for collecting data from scratch,
such as using opinion polls, for example. This is a large topic in itself and is better left to a
specialized textbook in sampling and survey methods. We assume that the data already
exists, either in an Excel file, in a text file, in a database file (such as a Microsoft Access
file), or on the Web. In the first case, where the data set already resides in an Excel file, you
might need to rearrange the data in some way to get it in the form of a rectangular data set,
as discussed in Chapters 2 and 3. You already have basic tools for doing this, such as cut-
ting and pasting, but we will illustrate some interesting possibilities for rearranging data in
the next section.
If the data is not already in Excel, one common possibility is that it is stored in a
text file. This is essentially any file that can be opened and read by humans in a text editor
such as Notepad. Text files, also called ASCII files,
2
are common because they dont
require any proprietary software, such as SPSS or SAS, to make them readable. In fact,
they are often called plain vanilla files because they represent a lowest common denom-
inatoranyone with a text editor can read them. We will show how they can be imported
fairly easily into Excel by using Excels handy text import wizard.
Another possibility is that the data is stored in a relational database. Indeed, most
companies store at least some of their data in this format. Common database packages
include Microsoft Access, SQL Server, and Oracle. These packages were developed to
perform certain tasks very well, including data storage, querying, and report writing.
However, they are not nearly as good as Excel at statistical data analysisthat is, number
crunching. Therefore, we show how to import data from a typical database package into
Excel. The key here is to form a query, using the Microsoft Query package that ships with
1
Although the add-in is free, it requires a connection to a SQL Server Analysis Services server. This contains the
software that Microsoft uses to perform its data mining number crunching. Essentially, the add-in is an Excel-
based front end.
2
ASCII (American Standard Code for Information Interchange) is a standard character-encoding scheme based
on alphabetical, numerical, and other characters.
Not For Sale


C
e
n
g
a
g
e

L
e
a
r
n
i
n
g
.

A
l
l

r
i
g
h
t
s

r
e
s
e
r
v
e
d
.

N
o

d
i
s
t
r
i
b
u
t
i
o
n

a
l
l
o
w
e
d

w
i
t
h
o
u
t

e
x
p
r
e
s
s

a
u
t
h
o
r
i
z
a
t
i
o
n
.
Microsoft Office. The query specifies exactly which data you want to import. This package
not only presents a friendly user interface for creating the query, but it also finds the appro-
priate data from the database file and automatically imports it into Excel. Again, the entire
process is surprisingly easy, even if you know practically nothing about database packages
and database design.
Next, we briefly examine the possibility of importing data directly from the Web into
Excel. Given that the amount of data on the Web is already enormous and is constantly
growing, the ability to import it into Excel is extremely valuable. As with importing data
from a database file, you can import data from the Web by creating a query and then run-
ning it in Excel. Unfortunately, different Web sites store data in many different ways, so a
method that imports data into Excel from one site might not work for another site.
Nevertheless, you will see that the current possibilities are powerful and relatively straight-
forward. If you think that querying from a Web site is something only expert programmers
can do, we hope to change your mind.
Finally, you cannot always assume that the data you obtain, from the Web or else-
where, is clean. There can be (and often are) many instances of wrong valueswhich can
occur for all sorts of reasonsand unless you fix these at the beginning, the resulting sta-
tistical analysis can be seriously flawed. Therefore, we conclude this chapter by discussing
a few techniques for cleansing data.
17.2 REARRANGING EXCEL DATA
The tools we discussed in Chapters 2 and 3 were always applied to a data set in Excel, a
rectangular array of data with observations in rows, variables in columns, and variable
names in the top row. Sometimes you have to use your Excel skills to get the data in this
format, even if the data already resides in Excel. In this section we illustrate two possibili-
ties, based on data we imported from the Web. As you will see from these two examples,
there is no single way to do it. Sometimes simple cutting and pasting works, and
sometimes advanced Excel functions are required. In all cases, it is best to map out a plan
and then decide how to implement it.
17-4 Chapter 17 Importing Data into Excel
E XAMP L E 17.1 BASEBALL DATA FOR DIFFERENT TEAMS
W
e have already analyzed baseball salaries in Chapters 2 and 3. The files used in those
chapters were already in nice data set form inside Excel. However, they didnt start
this way. We found the data on the USA Today Web site, and the site allowed us to import
the data into Excel via a Web query (as we discuss later in the chapter). However, a sepa-
rate Web query was required for each of the 30 teams. The results for a typical team appear
as in Figure 17.1, with only a few of the Arizona players listed. (See the file Baseball
Salaries Original.xlsx, which also lists the Web links.) As this figure indicates, each
teams data starts with the team name, then a row of headings, and then the data. The data
for Atlanta is right below the data for Arizona, the data for Baltimore is right below the
data for Atlanta, and so on. If you want four long columns with all of the data, and these
columns are to have headings Player, Team, Salary, and Position, how can you rearrange
the data to achieve this?
Objective To rearrange the data from the baseball Web queries into a single data set.
Not For Sale


C
e
n
g
a
g
e

L
e
a
r
n
i
n
g
.

A
l
l

r
i
g
h
t
s

r
e
s
e
r
v
e
d
.

N
o

d
i
s
t
r
i
b
u
t
i
o
n

a
l
l
o
w
e
d

w
i
t
h
o
u
t

e
x
p
r
e
s
s

a
u
t
h
o
r
i
z
a
t
i
o
n
.
Solution
First, we admit that this isnt really data analysis; it is Excel manipulation. But it will cer-
tainly be valuable for you to know some Excel methods for rearranging data. This is a very
common task, especially in the business world. We also mention that this is a repetitive
taskthe same thing must be done for all 30 teamsand repetitive tasks are handled best
with macros. However, because we do not presume that you can write macros at this stage,
we will manipulate the data with Excel tools only.
The key is to devise a strategy. Ours is quite simple, using the following steps:
1 Insert a blank column before column B, and enter the label Team in cell B2.
2 Cut (Ctrl-x) the Arizona Diamondbacks team name from cell A1 and paste it (Ctrl-v)
next to the first Arizona player in cell B3. Then copy it down for the other Arizona players.
3 Repeat step 2 for each of the other teams.
4 Delete unnecessary rows of labels for the other teams.
Try it out and see how quickly you can manipulate the data. Better yet, see if you can find
a strategy that is even quicker.
17.2 Rearranging Excel Data 17-5
1
2
3
4
5
6
7
8
9
10
11
A B C
Arizona Diamondbacks
n o i t i s o P y r a l a S r e y a l P
Buckner, Billy $403,000 Pitcher
Byrnes, Eric $11,666,666 Outelder
Clark, Tony $800,000 First Baseman
Davis, Doug $8,750,000 Pitcher
Drew, Stephen $1,500,000 Shortstop
Garland, Jon $6,250,000 Pitcher
Gordon, Tom $500,000 Pitcher
Guerrez, Juan C. $401,000 Pitcher
Haren, Danny $7,500,000 Pitcher
Figure 17.1
Imported Data for
Arizona
The following example is typical for time series data found on the Web.
E XAMP L E 17.2 CPI MONTHLY DATA
T
he file CPI.xlsx contains monthly data on the Consumer Price Index (CPI) going back
to 1913. We imported this data from the Web site www.bls.gov/cpi/#tables, again using
a Web query. A few rows appear in Figure 17.2. This format is common on Web sites, where
there is a row for each year and a column for each month. For some data analysis purposes,
this format might be fine, but what if you want a long data set with just two variables,
Month-Year (like Jan-1913) and CPI? How can the data be rearranged to this format?
Objective To rearrange the monthly data into two long columns, one with month-year
and one with the CPI.
Solution
This example comes under the category of If you plan a bit and know some good func-
tions, you can save yourself a lot of work. Our solution appears in Figure 17.3. The
Repetitive tasks like
this one are best
handled by macros,
but if you dont know
how to write macros,
you need to develop
an efficient plan for
performing the task
manually.
Not For Sale


C
e
n
g
a
g
e

L
e
a
r
n
i
n
g
.

A
l
l

r
i
g
h
t
s

r
e
s
e
r
v
e
d
.

N
o

d
i
s
t
r
i
b
u
t
i
o
n

a
l
l
o
w
e
d

w
i
t
h
o
u
t

e
x
p
r
e
s
s

a
u
t
h
o
r
i
z
a
t
i
o
n
.
desired results are in columns D and E, but the values in columns AC help to get these
results. Here are the steps:
1 Referring to Figure 17.2, create the range name Data (for all the CPI values, not the
headings in row 1 or column A). This makes the formula in step 7 easier to read.
2 Add a new worksheet for the rearranged data in Figure 17.2, create the column head-
ings in row 1, enter 1 in cells A2 and B2, and enter 1913 in cell C2.
3 To generate the recurring pattern of 1 to 12 in column B, enter the formula
IF(B212,B21,1) in cell B3. Copy this down as far as necessary.
4 To generate the pattern in column A (12 1s, 12 2s, 12 3s, and so on), enter the formula
IF(B31,A21,A2) in cell A3, and copy it down.
5 To generate the years in column C, enter the formula IF(B31,C21,C2) in cell
C3 and copy it down.
6 To generate the month-year values in column D, enter the formula DATE(C2,B2,1)
in cell D2 and copy it down. This creates dates in column D, which can then be formatted
in a custom format such as mmm-yyyy.
7 To generate the CPI values in column E, enter the formula INDEX(Data,A2,B2) in
cell E2 and copy it down.
8 If you want only the data in columns D and E, copy these two columns and paste them
over themselves as values. Then they are no longer dependent on columns AC, so
columns AC can be deleted.
17-6 Chapter 17 Importing Data into Excel
1
2
3
4
5
6
7
8
9
A B C D E F G H I J K L M
Year Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
1913 9.8 9.8 9.8 9.8 9.7 9.8 9.9 9.9 10.0 10.0 10.1 10.0
1914 10.0 9.9 9.9 9.8 9.9 9.9 10.0 10.2 10.2 10.1 10.2 10.1
1915 10.1 10.0 9.9 10.0 10.1 10.1 10.1 10.1 10.1 10.2 10.3 10.3
1916 10.4 10.4 10.5 10.6 10.7 10.8 10.8 10.9 11.1 11.3 11.5 11.6
1917 11.7 12.0 12.0 12.6 12.8 13.0 12.8 13.0 13.3 13.5 13.5 13.7
1918 14.0 14.1 14.0 14.2 14.5 14.7 15.1 15.4 15.7 16.0 16.3 16.5
1919 16.5 16.2 16.4 16.7 16.9 16.9 17.4 17.7 17.8 18.1 18.5 18.9
1920 19.3 19.5 19.7 20.3 20.6 20.9 20.8 20.3 20.0 19.9 19.8 19.4
Figure 17.2 CPI Data
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
A B C D E
Row index Column index Year Month-Year CPI
1 1 1913 Jan-1913 9.8
1 2 1913 Feb-1913 9.8
1 3 1913 Mar-1913 9.8
1 4 1913 Apr-1913 9.8
1 5 1913 May-1913 9.7
1 6 1913 Jun-1913 9.8
1 7 1913 Jul-1913 9.9
1 8 1913 Aug-1913 9.9
1 9 1913 Sep-1913 10
1 10 1913 Oct-1913 10
1 11 1913 Nov-1913 10.1
1 12 1913 Dec-1913 10
2 1 1914 Jan-1914 10
2 2 1914 Feb-1914 9.9
Figure 17.3
Rearranged CPI
Data
This is the type of
example that can
make you a hero at
your job. With some
planning and
knowledge of useful
Excel functions, you
can save hours or
even days of work.
And the formula-
based approach is
the best way to avoid
errors.
Not For Sale


C
e
n
g
a
g
e

L
e
a
r
n
i
n
g
.

A
l
l

r
i
g
h
t
s

r
e
s
e
r
v
e
d
.

N
o

d
i
s
t
r
i
b
u
t
i
o
n

a
l
l
o
w
e
d

w
i
t
h
o
u
t

e
x
p
r
e
s
s

a
u
t
h
o
r
i
z
a
t
i
o
n
.
Study these steps carefully and try to understand our strategy. You should also look up the
INDEX and DATE functions, either in online help or in our Excel tutorial. Basically, the
INDEX function allows you to find a particular value from a rectangular range. The DATE
function allows you to specify a date from a year, month, and day. They can be real life-
savers for working with tables and dates.
The Excel data that you obtain can be arranged in all sorts of ways, and you might
need strategies and tools we have not discussed here. However, the point of these two
examples is that if you spend some time devising a plan before you dive in, you can save
yourself a lot of work. The bottom line is that clever strategies can sometimes save your
company days of mind-numbing workand make you a hero.
P ROB L E MS
Note: Student solutions for problems whose numbers appear within
a colored box are available for purchase at www.cengagebrain.com.
Level A
1. The file P17_01.xlsx contains the counts of unique
visitors to popular Web sites from April 2008 to April
2009. There are actually two sheets, one for news sites
and one for sports sites. For each sheet, rearrange the
data or do whatever it takes to create the two charts
shown in the file.
2. The file P17_02.xlsx contains yearly data on nine of
the Big Ten universities. There are two pieces of data
for each university for each year: the count of full
professors, and the total amount paid to these
professors. Rearrange the data or do whatever it takes
to create the chart shown in the file of average salary
per full professor.
3. The file P17_03.xlsx is the result of importing a text
(.txt) file into Excel. The first column should be Year,
and this should be followed by four columns for each
state: the number and the percentage of females
without health insurance and similar data for males.
Starting in row 202, you can see that the numbers
were imported correctly. However, the variable
names that should all be in row 1 were imported
badly. Do whatever it takes to get the correct labels
in row 1, with the corresponding numbers below
them in rows 26. You can shorten the labels to
Virginia Male Pct, for example. Then create a time
series chart that contains a few time series variables
of your choice.
4. The file P17_04.xlsx contains monthly data since 1931
on the number of cooling degree-days (CDD, an index
of the amount of energy to cool a home or business)
in each of the United States and several combined
regions. (The codes for the locations appear on the
Locations sheet.) Some of the data is missing, as
indicated by -9999 in various cells. Rearrange the data
so that there are three long columns: Location (spelled
out, such as Virginia or South Atlantic), Month (such
as Jan-1941), and CDD. Replace all -9999 values by
blanks. Then show how a pivot table can be used to
create a time series graph for any selected location.
Level B
5. In analogy with the previous problem, the file
P17_05.xlsx contains monthly data since 1931 on
the number of heating degree days (HDD, an index
of the amount of energy to heat a home or business)
in each of the United States and several combined
regions. (The codes for the locations appear on the
Locations sheet.) However, each row really contains
the end of the previous year and the beginning of the
current year. For example, row 2 contains values
(all missing) for JulDec of 1930 and the values
(non-missing) for JanJun of 1931. As in the
previous problem, rearrange the data so that there
are three long columns: Location (spelled out, such
as Virginia or South Atlantic), Month (such as
Jan-1941), and HDD. Replace all -9999 values by
blanks, and make sure you get the correct data with
the correct months. Then show how a pivot table can
be used to create a time series graph for any selected
location.
6. The file P17_06.xlsx contains monthly data from 1920
to 2004 on average temperatures in the 48 contiguous
states and several regions of the U.S. (The codes for
locations are listed on the Locations sheet.) Rearrange
the data so that there is a Month column (with values
such as Jun-1945) and a column for each state or region
(spelled out, such as Virginia or Southeast). Each state or
region column should contain its monthly temperatures.
Then create a time series graph for a few states or
regions.
The INDEX function
is a really useful
function for finding
a particular value
in a rectangular
range.
17.2 Rearranging Excel Data 17-7
Not For Sale


C
e
n
g
a
g
e

L
e
a
r
n
i
n
g
.

A
l
l

r
i
g
h
t
s

r
e
s
e
r
v
e
d
.

N
o

d
i
s
t
r
i
b
u
t
i
o
n

a
l
l
o
w
e
d

w
i
t
h
o
u
t

e
x
p
r
e
s
s

a
u
t
h
o
r
i
z
a
t
i
o
n
.
17.3 IMPORTING TEXT DATA
Most software packages, statistical and otherwise, store documents in a proprietary binary
format that is readable only by that package. For example, prior to the 2007 version, Excel
stored its files in .xls format.
3
If you try to open an .xls file in a text editor such as Notepad,
all you will see is gibberish. Because such files are not readable from one package to the
next, it is often useful to store them in a more universal format, namely, as text files. Most
software packages allow you to save a file as a text file, usually with a .txt extension. Other
common extensions for text files are .dat, .csv (comma delimited) and .prn (created when
you print to a file).
Because of their universal nature, you will often receive data in text files, and then
you will often want to import the data into Excel for analysis. This section discusses
some of the issues involved. First, a text file can be fixed width or delimited. (Delimited
means separated.) For either, a line of data in a text file contains a single observationa
list of variable values. With fixed width, each variables value starts and stops at fixed
positions (columns) in the line. For example, columns 15 might contain the first vari-
ables value, columns 68 might contain the second variables value, and so on.
Therefore, each line of data has the same length, and the columns line up when you view
the data in a text editor. In contrast, with delimited data there is a delimiter character,
usually a tab, space, or comma, that separates the values in a line. In this case, the lines
are typically of different lengths and do not line up nicely. For example, if there is a
Name variable, the names John Lee and Tom Schlussenberger will create two lines of
different lengths.
Fixed-width files were originally associated with the old IBM punched cards, which
could be read much more easily if the reading program knew which column each vari-
able started in. Even though punched cards are thankfully long gone, fixed-width files
are still quite common and found on many Web sites. Fortunately, Excel can import
either fixed-width or delimited formats with its text import wizard, as the following
examples illustrate.
17-8 Chapter 17 Importing Data into Excel
E XAMP L E 17.3. TEMPERATURE DATA
W
e found state, regional, and national (srn) annual data (18952005) on temperature, pre-
cipitation, and drought at the Web site http://dss.ucar.edu/datasets/ds885.1/data/. Three
text files were available for download: srn_temp.txt (temperature), srn_pcp.txt (precipita-
tion), and srn_pdsi.txt (drought). A small portion of the temperature data, as shown in
Notepad, appears in Figure 17.4. There are no headings, just numbers. Fortunately, the Web
site also has a data dictionary file, srn_data.txt, that provides information about these fixed-
width data files. Figure 17.5 shows part of this data dictionary. It indicates what the variables
are and how they are stored in columns. How can the temperature data be imported into Excel?
Objective To import the fixed-width text file data into Excel by using Excels text import
wizard.
3
Excel 2007 files are stored in XML format (hence the final x in .xlsx). These are special forms of text files that
can be read in an XML editor.
A text file is basically
any file that can be
read by humans when
it is opened in a text
editor such as
Notepad.
The columns of fixed-
width format files line
up when you open
them in Notepad. The
columns of delimited
files tend to have
ragged edges.
Always look for a data
dictionary on Web
sites. Without it, you
might get a text file
full of numbers with-
out knowing what the
numbers represent.
Not For Sale


C
e
n
g
a
g
e

L
e
a
r
n
i
n
g
.

A
l
l

r
i
g
h
t
s

r
e
s
e
r
v
e
d
.

N
o

d
i
s
t
r
i
b
u
t
i
o
n

a
l
l
o
w
e
d

w
i
t
h
o
u
t

e
x
p
r
e
s
s

a
u
t
h
o
r
i
z
a
t
i
o
n
.
Solution
The key is to open a file in the usual way within Excel, but to look for files of the type
Text Files (*.prn; *.txt; *.csv). When you select the srn_tmp.txt file, you see the first
step of the text import wizard in Figure 17.6. It allows you to select either Delimited or
Fixed width. In this case, select Fixed width. It also allows you to start the import at a row
other than row 1 (because the first few rows often contain descriptive text), but row 1 for
this example starts directly with data.
The second step of the wizard, shown in Figure 17.7, allows you to separate (or parse)
the columns as listed in the data dictionary. Excel guesses where the breaks should be, but
you can override these guesses. Specifically, we clicked on the third, fourth, and fifth posi-
tions on the ruler to create the three arrows to the left.
The last step of the wizard, not shown here, allows you to fine-tune the import, column
by column, but we usually bypass this step and simply click on Finish. The data is
imported beautifully into Excel, as shown in Figure 17.8. All you have to do is create
column headings in row 1, using the data dictionary as a guide. However, the file is still a
.txt file, so the last step is to Save As to save it as an .xlsx (or .xls) file.
17.3 Importing Text Data 17-9
Figure 17.4 Temperature Data
Figure 17.5
Data Dictionary
To parse a line of data
means to separate it
into its individual
pieces.
Not For Sale


C
e
n
g
a
g
e

L
e
a
r
n
i
n
g
.

A
l
l

r
i
g
h
t
s

r
e
s
e
r
v
e
d
.

N
o

d
i
s
t
r
i
b
u
t
i
o
n

a
l
l
o
w
e
d

w
i
t
h
o
u
t

e
x
p
r
e
s
s

a
u
t
h
o
r
i
z
a
t
i
o
n
.
17-10 Chapter 17 Importing Data into Excel
1
2
3
4
5
6
7
8
9
A B C D E F G H I J K L M N O P
State Division Element Year Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
2 1895 43.7 37.6 54.5 63.1 69.8 78.1 79.9 79.9 77.9 60.6 53.4 45.6
2 1896 44.1 47.9 52.5 67.9 75.7 77 80.8 82 75.5 63.5 57.5 46.2
2 1897 42.6 51.2 60.2 62 68.6 80.9 81 78.7 75.3 66.9 54.3 47.8
2 1898 49.4 45.9 59 58.1 73.4 80.3 79.8 78.6 75.1 61.1 49.9 44.1
2 1899 44.4 39.9 55.2 61.6 75.8 79.7 80.2 81.1 72.5 66 55.5 44.9
2 1900 44 44.1 52.6 63.5 71.2 76.3 79.7 81.5 77.6 68.7 54.9 46.8
2 1901 46.1 43.1 53 57.4 70 78.4 82.1 78.5 72 62.5 48.7 41.7
1 0
1 0
1 0
1 0
1 0
1 0
1 0
1 0 2 1902 43.2 40.6 54.8 61.6 75.3 80.8 82.6 82 73 63.3 57.6 45.3
Figure 17.6
Step 1 of Text
Import Wizard
Figure 17.7
Step 2 of Text
Import Wizard for
Fixed-Width Data
Figure 17.8 Imported Data with Column Headings

Not For Sale


C
e
n
g
a
g
e

L
e
a
r
n
i
n
g
.

A
l
l

r
i
g
h
t
s

r
e
s
e
r
v
e
d
.

N
o

d
i
s
t
r
i
b
u
t
i
o
n

a
l
l
o
w
e
d

w
i
t
h
o
u
t

e
x
p
r
e
s
s

a
u
t
h
o
r
i
z
a
t
i
o
n
.
If the data is in delimited format, you first have to be told (or figure out) what the
delimiter character is. If the values are clearly separated by a space or a comma, the choice
is obvious. If you see varying gaps between the values in a given row, the tab character is
probably the delimiter. Note that the delimiter character is sometimes part of a value.
For example, suppose the delimiter is a space and a Name variable has a value such as John
Brown. In this case, there is typically a text qualifier character, usually a double-quote,
that surrounds any value that contains the delimiter. So instead of seeing John Brown in the
text file, you will see John Brown in quotes. This simply indicates that the name John
Brown is a single value, not two values. Of course, the person who creates the text file is
the one who must include these text qualifiersif they are indeed necessary.
E XAMP L E 17.4 MOBILE SUBSCRIBERS
T
he Web site http://stats.oecd.org/index.aspx contains a wide assortment of economic
data. (OECD stands for Organisation for Economic Co-operation and Development.)
Under the Information and Communication Technology group, we found annual data by
country for the number of mobile subscribers during 20022009.
4
We downloaded the data
into a file named Mobile Subscriptions.txt. A portion of the file in Notepad appears in
Figure 17.9. This time there are column headings in row 1, but the ragged lines indicate
that this file must be delimited, not fixed width. How can the data be imported into Excel?
Figure 17.9
Text File Data on
Mobile Subscribers
Objective To see how delimited text data can be imported into Excel with the import text
wizard.
Solution
Again, the key is to open a file of the type Text Files (*.prn; *.txt; *.csv). This time you
should check the Delimited option in step 1 of the wizard (see Figure 17.6 above). Then
the step 2 dialog box appears as in Figure 17.10. Excel guesses, correctly in this case, that
4
Actually, the data on this site can be imported directly into Excel. However, we took the circuitous route of
downloading into a text file for illustration in this example.
A text qualifier, usually
a double-quote, is
necessary if data
values include the
delimiter character.
17.3 Importing Text Data 17-11
Not For Sale


C
e
n
g
a
g
e

L
e
a
r
n
i
n
g
.

A
l
l

r
i
g
h
t
s

r
e
s
e
r
v
e
d
.

N
o

d
i
s
t
r
i
b
u
t
i
o
n

a
l
l
o
w
e
d

w
i
t
h
o
u
t

e
x
p
r
e
s
s

a
u
t
h
o
r
i
z
a
t
i
o
n
.
the file is tab-delimited, so all you need to do is click on Finish. Note the Text qualifier
option in this dialog box. It isnt relevant here (no tabs are in the middle of any data val-
ues), but the option always appears, and you can change the character if necessary.
17-12 Chapter 17 Importing Data into Excel
1
2
3
4
5
6
7
8
9
10
11
A B C D E
Series Country Year Value Flags
Mobile subscribers Australia 2002 1.27E+07
Mobile subscribers Australia 2003 1.43E+07
Mobile subscribers Australia 2004 1.65E+07
Mobile subscribers Australia 2005 1.84E+07
Mobile subscribers Austria 2002 6.74E+06
Mobile subscribers Austria 2003 7.09E+06
Mobile subscribers Austria 2004 7.99E+06
Mobile subscribers Austria 2005 8.37E+06
Mobile subscribers Belgium 2002 8.10E+06
Mobile subscribers Belgium 2003 8.61E+06
Some of the imported data appear in Figure 17.11. Note that the Flags column contains no
data, so it can be deleted. Column A can also be deleted because it includes a constant
value, Mobile subscribers. Next, you might want to reformat the numbers in column D,
say, to Numeric with 0 decimals. Finally, as in the previous example, you should Save
As to save the file in .xlsx (or .xls) format.
Figure 17.10
Step 2 of Text
Import Wizard for
Delimited Data
Figure 17.11
Imported Excel Data
Before leaving this section, here are a few comments about importing text data.

If a text file is comma-delimited and is saved as a .csv file, you can open it directly
into Excel, without the import text wizard. Excel automatically parses the values
between the commas into separate columns.

Not For Sale


C
e
n
g
a
g
e

L
e
a
r
n
i
n
g
.

A
l
l

r
i
g
h
t
s

r
e
s
e
r
v
e
d
.

N
o

d
i
s
t
r
i
b
u
t
i
o
n

a
l
l
o
w
e
d

w
i
t
h
o
u
t

e
x
p
r
e
s
s

a
u
t
h
o
r
i
z
a
t
i
o
n
.

Sometimes you will find a table of data on the Web where there is no option to save
it in some type of format, text or otherwise. One method that sometimes works is the
following. Highlight the data in your browser, press Ctrl-c to copy, put your cursor in
Excel, and press Ctrl-v to paste. It might work as you hoped, but it might not. It is
very possible that everything will be pasted into a single column. If this happens to
you, highlight the data in this column and click on the Text to Columns button on
Excels Data ribbon. The purpose of this button is to parse delimited data in a single
column into several columns. It often works perfectly, but be aware of the caution
that follows.

Whenever you parse text data into Excel columns, there is always the chance that the
data wont line up properlythat is, data will get into the wrong columns. So make
sure you look closely at the parsed data before blindly proceeding. For example, if
the data are supposed to fill columns AL but you see some dangling values in
column M, something didnt get parsed correctly and you need to fix it.
P ROB L E MS
Level A
7. The file P17_07.txt lists monthly values of the
consumer price index (CPI). Import this data into
Excel, excluding the rows before the actual data.
(Many text files have information in the first few lines
that doesnt import nicely.) Then rearrange the data as
was done in the previous section so that there are only
two columns: Month (such as Jun-1954) and CPI. You
can delete the columns to the right of the December
column. Save the results as an Excel (.xlsx) file.
8. The file P17_08.txt lists daily values of an air quality
index for Los Angeles. The first value in each row is a
date such as 20070322. This means March 22, 2007.
Import this data into Excel and save it as an .xlsx file.
Use the date functions DATE, YEAR, MONTH, and
DAY to transform the text values such as 20070322
into dates, and then format these as mm/dd/yyyy.
9. The files P17_09 AL.txt and P17_09 NL.txt contain
lists of all American League and National League
baseball batters who had at least 250 at-bats during the
1990 season. The heading at the top of each text file
describes the variables. Import these text files into
Excel, and then arrange the data into a single sheet in a
single file with five variables: League (AL or NL),
Last Name, First Name, At-Bats, and Handedness.
Finally, create a pivot table to find the percentages of
batters for either league or for both leagues combined
who are right-handed hitters, left-handed hitters, or
switch-hitters. Save the results as an Excel (.xlsx) file.
(This data came from a research project on batting
streaks performed by one of the authors in the 1990s.)
10. Continuing the previous problem, the files P17_10
1987.txt through P17_10 1990.txt contain data on
Barry Bondss at-bats during each of the four seasons
19871990. The file P17_10 Variables.txt lists the
variables in each file. Import each years data into
Excel. Then bring all of the data into a single Excel
(.xlsx) file with four sheets, one for each year. Enter
variable names in the top row of each sheet (you can
use the ones suggested in the P17_10 Variables.txt
file), and replace all missing values with blanks. (They
are currently denoted by periods.) Finally, for any year
of your choice, create a pivot table that allows you to
see how Bondss batting success varied by possible
explanatory variables. For example, did he have more
success in home games than away games?
11. The files P17_11 Temp.txt, P17_11 Precip.txt,
and P17_11 Drought.txt contain monthly data on
temperature, precipitation, and drought for various
states and regions in the U.S. The file P17_11
Description.txt describes the data. In particular, it
indicates that each of the data files is fixed width, and
it indicates the precise file format (which data are in
which columns). Import each of the data files into
Excel. Then combine the three worksheets into a single
Excel (.xlsx) file, with one sheet for temperature, one
for precipitation, and one for drought.
12. The file P17_12.txt contains yearly data for the num-
ber of licensed drivers (those under 18, those over 85,
and total) by gender and state. Import this data into
Excel and save it as an Excel (.xlsx) file. Use
appropriate text functions (unless you want to do it
Excels Text to
Columns tool is handy
for parsing a single
column into multiple
columns. But check
the results carefully
to make sure that
they are correctly
lined up.
17.3 Importing Text Data 17-13
Not For Sale


C
e
n
g
a
g
e

L
e
a
r
n
i
n
g
.

A
l
l

r
i
g
h
t
s

r
e
s
e
r
v
e
d
.

N
o

d
i
s
t
r
i
b
u
t
i
o
n

a
l
l
o
w
e
d

w
i
t
h
o
u
t

e
x
p
r
e
s
s

a
u
t
h
o
r
i
z
a
t
i
o
n
.
17.4 IMPORTING RELATIONAL DATABASE DATA
In this section, we will discuss how data stored in a relational database format can be
imported into Excel. Specifically, we consider data in Microsoft Access format. (This is
the database package that is part of Microsoft Office.) Database packages such as Access,
SQL Server, Oracle, and many others are extremely complex and powerful packages. For
database creation, querying, manipulation, and reporting, they have many advantages over
spreadsheets. However, they are not nearly as powerful as spreadsheets for statistical
analysis. Therefore, it is often necessary to import data from a database packageeither
all of it or just a subset of it, based on a queryinto Excel, where the statistical analysis
can then be performed. Fortunately, Microsoft includes software called Microsoft Query in
its Office suite that makes the importing relatively easy.
17.4.1 A Brief Introduction to Relational Databases
First, we present some general concepts about database structure. The Excel databases
we have discussed so far in this book are often called flat files or, more simply, tables.
They are also called single-table databases, where table is the database term for a rectan-
gular range of data, with rows corresponding to records and columns corresponding to
fields.
5
Actually, all of the data sets we analyze in Excel in this textbook are really single-
table databases. Flat files are fine for relatively simple database applications, but they are
not powerful enough for more complex applications. For the latter you need a relational
database, a set of related tables, where each table is a rectangular arrangement of fields
and records, and the tables are linked explicitly.
As a simple example, suppose you would like to keep track of information on all of
the books you own. Specifically, you would like to keep track of data on each book (title,
author, copyright date, whether you have read it, when you bought it, and so on), as well as
data on each author (name, birthdate, awards won, number of books written, and so on).
Now suppose you store all of these data in a flat file. Then if you own 10 books by Danielle
Steele, you must fill in the identical personal information on Ms. Steele for each of the 10
17-14 Chapter 17 Importing Data into Excel
manually) to shorten the variable names to something
like Arizona Females Young, Arizona Females Old,
and Arizona Females All. Explain briefly what the
presence of consecutive commas in the text files
indicates. How are these imported into Excel?
Level B
13. The file P17_13.txt contains yearly salary data for full
professors at several Big Ten universities. For each
university, there is data on the average salary for full
professors, the number of full professors, and the total
paid to all full professors. Import this data into Excel.
The variable names in row 1 will be quite long, so do
whatever it takes to shorten them to something like
Indiana Average Salary, Indiana Full Professors, and
Indiana Total Paid. Save the results as an Excel (.xlsx)
file.
14. Text data can sometimes be quite intimidating. As
an example, the file P17_14.txt has data on all
organizations convicted in federal courts in 2007. If
you open it in Notepad, it wont make a bit of sense.
Fortunately, there is a codebook in the file
P17_14.pdf. (It is common for government and other
organizations to publish such codebooks.) Explain in
some detail how you could import this data into Excel.
Optionally, try importing it.
5
Fortunately, Excel now uses the term table in exactly the same way as it has been used in database packages for
years. However, Excel has no practical way for dealing with the multitable databases discussed here. Also, when
talking about databases, it is more common to refer to observations (rows) as records and variables (columns) as
fields.
Not For Sale


C
e
n
g
a
g
e

L
e
a
r
n
i
n
g
.

A
l
l

r
i
g
h
t
s

r
e
s
e
r
v
e
d
.

N
o

d
i
s
t
r
i
b
u
t
i
o
n

a
l
l
o
w
e
d

w
i
t
h
o
u
t

e
x
p
r
e
s
s

a
u
t
h
o
r
i
z
a
t
i
o
n
.
records associated with her books. This is not only a waste of time, but it increases the
chance of introducing errors as you enter the same information multiple times.
A better solution is to create a Books table and an Authors table. The Authors table
would include a record for each author, with fields for name, gender, date of birth, and so
on. It would also contain a unique identifier field, probably named AuthorID. For example,
Danielle Steele might have AuthorID 1, John Grisham might have AuthorID 2, and so on.
The values of AuthorID are not important, but they must be unique. The Books table would
have a record for each book, with fields like title, copyright date, genre, and so on. It would
also contain an AuthorID field that lists the same AuthorID as in the Authors table for the
author of this book. It could also contain a unique identifier field, such as ISBN. (For the
purpose of this discussion, we are assuming that each book has a single author. If multiple
authors are allowed, the situation is slightly more complex.)
The key to relating these two tables is the AuthorID field. In a database package such
as Access, there is a link between the AuthorID fields in the two tables.
6
This link allows
you to find data from the two tables easily. For example, suppose you see in the Authors
table that John Updikes ID is 35. Then you can search through the Books table for all
records with AuthorID 35. These correspond to the books you own by John Updike. Going
the other way, if you see in the Books table that you own The World According to Garp by
John Irving, who happens to have AuthorID 21, you can look up the (unique) record in the
Authors table with AuthorID 21 to find personal information about John Irving.
The linked fields are called keys. Specifically, the AuthorID field in the Authors table
is called a primary key, and the AuthorID field in the Books table is called a foreign key.
A primary key must contain unique values, whereas a foreign key can contain duplicate
values. For example, there is only one Danielle Steele, but she has written many books.
The theory and implementation of relational databases is both lengthy and complex.
Indeed, many books have been written about the topic. However, this brief introduction
suffices for our purposes. As you will see in examples, an Access database file (recogniz-
able by the .mdb extension, or the .accdb extension
introduced in Access 2007) typically contains sev-
eral related tables. They are related in the same
basic way as the Books and Authors were related in
the previous paragraphsthrough links between
primary and foreign keys. These links will be
apparent when you use Microsoft Query to import
data from Access into Excel.
Keep in mind that we will not discuss how to
create Access databases. This would take us too far
afield, given the goals of this book. In fact, you do
not even need to own Access. We simply assume that
(1) an Access database exists, (2) you know the
names of the tables and their fields, and how they are
linked through primary and foreign keys, and (3) you
want to query the database for information that you
can import into Excel for eventual data analysis.
17.4.2 Using Microsoft Query
There are two ways to import Access data into Excel in Excel 2007. They are both found
in the Get External Data group on the Data ribbon. (See Figure 17.12.) The first method
uses the From Access button. This seems very natural, but it is limited to importing whole
17.4 Importing Relational Database Data 17-15
Spreadsheets Versus Databases
As is illustrated throughout this book, Excel is an
excellent tool for analyzing data. However, Excel is
not the best place to store complex data. In contrast,
database packages such as Access, SQL Server,
Oracle, and others have been developed explicitly to
store data, and much of corporate data is stored in
such database packages.These packages typically have
some tools for analyzing data, but these tools are
neither as well known nor as easy to use as Excel.
Therefore, it is useful to know how to import data
from a database into Excel for analysis.
FUNDAMENTAL INSI GHT
6
They do not actually have to have the same field name, such as AuthorID, but the indexes must match. For example,
if 1 is Danielle Steeles index in one table, it must be her index in the other table as well.
Primary keys, which
are unique, and foreign
keys, which are usually
not unique, relate the
tables in a relational
database.
Not For Sale


C
e
n
g
a
g
e

L
e
a
r
n
i
n
g
.

A
l
l

r
i
g
h
t
s

r
e
s
e
r
v
e
d
.

N
o

d
i
s
t
r
i
b
u
t
i
o
n

a
l
l
o
w
e
d

w
i
t
h
o
u
t

e
x
p
r
e
s
s

a
u
t
h
o
r
i
z
a
t
i
o
n
.
tables or saved queries. If you want to import only a single table, or if you have already
saved a query in Access, this is the method you should use because it is very easy.
However, if you want to create a query on the fly involving several Access tables, you need
to use the second method, which employs Microsoft Query.
17-16 Chapter 17 Importing Data into Excel
Figure 17.12 Get External Data Group on the Data Ribbon
The Microsoft Query software allows you to import all or part of the data from many
database packages into Excelwith very little work. You probably do not know you own
this software. For example, if you click on the Windows Start button and then choose
Programs, you will not find Microsoft Query on the list. However, it comes with Office,
and you can use it. (The only question is whether you installed it when you installed
Office. To check, open a blank spreadsheet in Excel and select From Microsoft Query from
the From Other Sources dropdown menu on the Data ribbon. If this doesnt work, then
Microsoft Query is not installed. You will have to go through the Add/Remove part of the
Office Setup program, with your Office CD-ROM, to install it.)
Once Microsoft Query is installed, importing data from Access (or any other sup-
ported database package) is essentially a three-step process:
1. Define the source, so that Excel knows what type of database the data is in and where
the data is located.
2. Use Microsoft Query to define a query.
3. Return the data to Excel.
We illustrate these three steps in the following example.
E XAMP L E 17.5 FINE SHIRT COMPANYS RELATIONAL DATA
T
he Fine Shirt Company creates and sells shirts to its customers. These customers are
retailers who sell the shirts to consumers. The company has created an Access data-
base file Shirt Orders.mdb that has information on sales to its customers during the
period of 2005 through 2009.
7
There are three related tables in this database: Customers,
Orders, and Products.

The Customers table has the following information on the companys seven cus-
tomers: CustomerID (an index from 1 to 7), Name, Street, City, State, Zip, and
Phone.

The Products table has the following information on the companys 10 products
(types of shirts): ProductID (an index from 1 to 10), Description, Gender (whether
the product is made for females, males, or both), and UnitPrice (the price to the
retailer).
7
In Office 2003 and earlier, Access files had an .mdb extension. In Office 2007, the extension changed to .accdb.
Old .mdb files can be converted to the new .accdb format, but Access 2007 (or 2010) can read .mdb files. Because
we see no advantage to converting .mdb files to .accdb files, we have not done so.
Not For Sale


C
e
n
g
a
g
e

L
e
a
r
n
i
n
g
.

A
l
l

r
i
g
h
t
s

r
e
s
e
r
v
e
d
.

N
o

d
i
s
t
r
i
b
u
t
i
o
n

a
l
l
o
w
e
d

w
i
t
h
o
u
t

e
x
p
r
e
s
s

a
u
t
h
o
r
i
z
a
t
i
o
n
.

Finally, the bulk of the data is in the Orders table. This table has a record for each
product ordered by each customer on each date during the five-year period. There
are 2245 records in this table. If a customer ordered more than one product on a
particular date, there is a separate record for each product ordered. The fields in the
Orders table are OrderID (an index from 1 to 2245), CustomerID (foreign key to
Customers table), ProductID (foreign key to Products table), OrderDate,
UnitsOrdered (number of shirts of this type ordered), and Discount (percentage
discount, if any, for this order).
The Access file has a link between the CustomerID fields in the Customers and Orders
tables, and a link between the ProductID fields in the Products and Orders tables. This
way, the detailed information on customers and products is entered only once. If you
need any of this information for a particular order, you can find it through the links. For
example, if a particular order shows that CustomerID and ProductID are 2 and 7, you can
look up information about customer 2 and product 7 in the Customers and Products
tables.
Access allows you to diagram the relationships
between tables, as shown in Figure 17.13. This dia-
gram shows the primary keys (the key symbols) and
the links involving the CustomerID and ProductID
fields. The 1 and

signs on the links imply many-


to-one relationships. Specifically, a given cus-
tomer is included only once in the Customers table,
but this same customer can be responsible for many
orders in the Orders table. Similarly, a given prod-
uct is included only once in the Products table, but
it can be included in many orders in the Orders
table.
Relationships Diagram
The place to start with any relational database is a
relationships diagram. It shows at a glance how the
data are stored logicallyhow they are separated
out into tables and how the tables are linked through
keysand it also shows the names of the various
tables and fields.
FUNDAMENTAL INSI GHT
Figure 17.13
Relationships
Diagram
The company wants to perform data analysis on the order data within Excel. How can it
use Microsoft Query to import the data from Access into Excel?
Objective To illustrate how Microsoft Query can be used to import the results of queries
on the Shirt Orders database into Excel.
17.4 Importing Relational Database Data 17-17
Not For Sale


C
e
n
g
a
g
e

L
e
a
r
n
i
n
g
.

A
l
l

r
i
g
h
t
s

r
e
s
e
r
v
e
d
.

N
o

d
i
s
t
r
i
b
u
t
i
o
n

a
l
l
o
w
e
d

w
i
t
h
o
u
t

e
x
p
r
e
s
s

a
u
t
h
o
r
i
z
a
t
i
o
n
.
Solution
Before going into the details, it is important to realize that the entire procedure is done
within Excel and Microsoft Query, not Access. All you need is an Access database file, in
this case Shirt Orders.mdb.
8
The first step of the procedure is to tell Excel what type of data you have and where it
is located. In its terminology, you must define a data source. To do so, open a blank spread-
sheet in Excel and select From Microsoft Query from the From Other Sources dropdown
menu on the Data ribbon. This takes you to the Choose Data Source dialog box shown in
Figure 17.14. Note that the list you see might not be the same as the one shown here. Each
time you tell Excel about a new data source, it is added to the list shown. In any case, to add
a new data source, select the top New Data Sourceitem. Then click on OK.
17-18 Chapter 17 Importing Data into Excel
Figure 17.14
Choose Data Source
Dialog Box
This takes you to the Create New Data Source dialog box. It should eventually be filled in
as shown in Figure 17.15. To do so, proceed as follows.
Figure 17.15
Create New Data
Source Dialog Box
8
This procedure can be done in the same way, with virtually no changes, with databases from other packages,
such as SQL Server and Oracle. This is fortunate; you need to learn only one basic method for all relational
databases.
Not For Sale


C
e
n
g
a
g
e

L
e
a
r
n
i
n
g
.

A
l
l

r
i
g
h
t
s

r
e
s
e
r
v
e
d
.

N
o

d
i
s
t
r
i
b
u
t
i
o
n

a
l
l
o
w
e
d

w
i
t
h
o
u
t

e
x
p
r
e
s
s

a
u
t
h
o
r
i
z
a
t
i
o
n
.
1 Enter a descriptive title in line 1. (This does not need to be the same name as the
Access file name.)
2 Use the dropdown list in line 2 to select the appropriate driver, in this case the
Microsoft Access Driver. (This is where you could specify another database package, such
as SQL Server or Oracle. If you did so, you would probably need to specify a server where
the database resides, along with authentication for permissions to this server.)
3 Click on the Connect button in line 3 to bring up the ODBC Microsoft Access Setup
dialog box shown in Figure 17.16, where you indicate which database file you want to use.
To choose it, click on its Select button and browse for the Shirt Orders.mdb file. (Your
file will almost certainly be in a different location than ours.)
4 Once you have located this file, click on OK a couple of times to see the completed
Create New Data Source dialog box, and click on OK once more to get back to the
Choose Data Source dialog box, with your data source, Shirt Orders, now on the list. (See
Figure 17.17.)
Figure 17.16
Dialog Box for
Selecting Database
File
Figure 17.17
Choose Data Source
Dialog Box with the
New Entry
This completes step 1 of the overall procedure. You have defined a data source that you can
now query. It is important to realize that once you have created this Shirt Orders source, you
will not have to create it again. Specifically, if you want to run another query on this database
17.4 Importing Relational Database Data 17-19
Not For Sale


C
e
n
g
a
g
e

L
e
a
r
n
i
n
g
.

A
l
l

r
i
g
h
t
s

r
e
s
e
r
v
e
d
.

N
o

d
i
s
t
r
i
b
u
t
i
o
n

a
l
l
o
w
e
d

w
i
t
h
o
u
t

e
x
p
r
e
s
s

a
u
t
h
o
r
i
z
a
t
i
o
n
.
at a later time, you can select the From Microsoft Query option, select the Shirt Orders source
from the list, and proceed directly to the query itself, bypassing the steps described above.
At this point, you should be looking at the Choose Data Source dialog box shown in
Figure 17.17with Shirt Orders on the list. Make sure the Shirt Orders item is selected
and the bottom checkbox is unchecked, and click on OK.
9
This brings up the Add Tables
dialog box shown in Figure 17.18, in front of the Microsoft Query screen shown in Figure
17.19. This begins the second step of the overall procedure, where you define the query.
Essentially, you need to specify which tables are relevant for the query, which fields you
want to return to Excel, and which records meet the criteria you spell out.
17-20 Chapter 17 Importing Data into Excel
Figure 17.18
Add Tables Dialog
Box
Figure 17.19
Microsoft Query
Screen
9
If the bottom checkbox is checked, the Query Wizard will be launched when you click on OK. You can try this
if you like, but we prefer the method described here.
To get started, lets try a relatively easy single-table query: Find all of the records from the
Orders table where the order date is during the years 2007 or 2008, the product number is
Not For Sale


C
e
n
g
a
g
e

L
e
a
r
n
i
n
g
.

A
l
l

r
i
g
h
t
s

r
e
s
e
r
v
e
d
.

N
o

d
i
s
t
r
i
b
u
t
i
o
n

a
l
l
o
w
e
d

w
i
t
h
o
u
t

e
x
p
r
e
s
s

a
u
t
h
o
r
i
z
a
t
i
o
n
.
3 or 5, and the number of units ordered is at least 100, and return to Excel all fields in the
Orders table for these records. Proceed as follows.
1 If the Add Tables dialog box is still visible, select the Orders table, click on Add, and
then click on Close. (If the Add Tables dialog box is not visible, select the Add Tables
menu item from the Table menu to make it visible.) The labels from the table appear in the
top pane of the screen. (See Figure 17.20).
Figure 17.20
Query Screen before
Entering Criteria
2 Double-click on any of the fields in this table that you want to be returned by the
query. If you double-click on the top item (the asterisk), all fields will be returned. For this
query, double-click on the asterisk, and you should see a sampling of the data that will be
returned in the bottom pane of the screen.
3 Click on the Show/Hide Criteria button on the Microsoft Query toolbar (the button
with the glasses and the funnel). This opens a middle pane on the screen, where you can
enter criteria.
4 Enter the criteria for the query. Any conditions in a given row are and conditions,
and those in different rows are treated as or conditions. You can either type the condi-
tions directly into the middle paneif you know the correct syntaxor you can select the
Add Criteria menu item from the Criteria menu. This latter option brings up the dialog box
shown in Figure 17.21. After a bit of experimenting, you will see how to enter conditions
in this dialog box. Then, when you click on the Add button, the condition appears in the
middle pane of the screen. By examining the syntax of the conditions that are entered, you
can quickly learn how to type in your own conditions directly. The final conditions for our
query appear in Figure 17.22. (Note how dates are enclosed in # signs, and how the key-
words Between and In are used.)
FUNDAMENTAL INSI GHT
Queries
When you go through the steps indicated here in
Microsoft Query, you are asking for selected data to
be imported into Excel. You do this with a query. In
general, a query (for any database package) specifies
(1) which tables you are getting your data from,
(2) which fields from these tables you want to import,
and (3) which criteria (or filters) you want to impose.
The criteria determine which rows from the tables
will be imported. In short, a query specifies the subset
of the entire database that is of interest to you.
17.4 Importing Relational Database Data 17-21
Not For Sale


C
e
n
g
a
g
e

L
e
a
r
n
i
n
g
.

A
l
l

r
i
g
h
t
s

r
e
s
e
r
v
e
d
.

N
o

d
i
s
t
r
i
b
u
t
i
o
n

a
l
l
o
w
e
d

w
i
t
h
o
u
t

e
x
p
r
e
s
s

a
u
t
h
o
r
i
z
a
t
i
o
n
.
If you scroll down the records in the bottom pane of the screen, you will see that this
query returns 69 of the 2245 records in the Orders table. The final step in the overall
three-step process is to get these data back into Excel. This is easy. Simply select the
Return Data to Microsoft Excel menu item from the File menu. This takes you back to
Excel and brings up the dialog box in Figure 17.23, where you can specify the type of
report you want and where you want the results. For now, choose the Table option, so that
the Access data are imported into an Excel tablethe same type of table discussed in
Chapter 2. When you click on OK, the results appear in a few seconds, and you can now
analyze them statistically using any tools we have discussed. However, it is important to
realize that these data are still linked to the query. This means that you can refresh the
data in Excel if the Access data change. To do so, make sure your cursor is inside the
Excel table, and click on the Refresh button on the Table Tools Design ribbon (or
the Refresh All button on the Data ribbon).
It is also possible to get back to Microsoft Query so that you can edit your query. Again,
make sure your cursor is inside the Excel table, and click on the Connections button on the
Data ribbon. This brings up the Workbook Connections dialog box in Figure 17.24. Click
on Properties, then select the Definition tab, and finally click on the Edit Query button.
(Note that if you have performed multiple imports into the same Excel workbook, you will
see several connections in the top pane of Figure 17.24. You can find the relevant connec-
tion by highlighting any of the connections and then clicking on the bottom pane.)
17-22 Chapter 17 Importing Data into Excel
Figure 17.21
Add Criteria Dialog
Box
Figure 17.22
Criteria for Query
Not For Sale


C
e
n
g
a
g
e

L
e
a
r
n
i
n
g
.

A
l
l

r
i
g
h
t
s

r
e
s
e
r
v
e
d
.

N
o

d
i
s
t
r
i
b
u
t
i
o
n

a
l
l
o
w
e
d

w
i
t
h
o
u
t

e
x
p
r
e
s
s

a
u
t
h
o
r
i
z
a
t
i
o
n
.
One more possibility is to save the query itself. To do so, select Save from the File menu
in the Microsoft Query screen and type some suggestive file name such as Shirt Orders
Query 1. The extension .dqy (for database query) is added by default. This allows you to run
this query at any time from within Excel by clicking on the Existing Connections button on
the Data ribbon.
Lets now try a more ambitious query: Find all of the records in the Orders table that
correspond to orders for at least 80 units made by the customer Shirts R Us for the product
Long-sleeve Tunic, and return the dates and units ordered for these orders. The main dif-
ference now is that the query must be based on all three tables in the database. The reason
is that the Orders table does not have Shirts R Usit contains only customer IDs.
Similarly, the Orders table doesnt know about Long-sleeve Tunic. The trick is to use the
links between the tables.
Figure 17.23
Import Data Dialog
Box
Figure 17.24
Workbook
Connections Dialog
Box
17.4 Importing Relational Database Data 17-23
Not For Sale


C
e
n
g
a
g
e

L
e
a
r
n
i
n
g
.

A
l
l

r
i
g
h
t
s

r
e
s
e
r
v
e
d
.

N
o

d
i
s
t
r
i
b
u
t
i
o
n

a
l
l
o
w
e
d

w
i
t
h
o
u
t

e
x
p
r
e
s
s

a
u
t
h
o
r
i
z
a
t
i
o
n
.
Starting in Excel (with the cursor not inside the data previously returned), proceed as
follows.
1 Select the From Microsoft Query option. (Remember that it is under From Other
Sources on the Data ribbon.) This time, however, simply click on the Shirt Orders data
source that is already thereyou do not need to create it again. (As before, clear the Query
Wizard checkbox if you want to follow along with our instructions.)
2 This takes you directly into the Microsoft Query screen. Inside this screen, first add all
three tables to the top pane of the Query.
3 Double-click on the OrderDate and UnitsOrdered fields in the Orders table (because
you want data in these two fields to be returned to Excel).
4 Fill out the criteria as shown in Figure 17.25. Note that the field names for the three
criteria are from different tables. The Name field is from the Customers table, the
Description field is from the Products table, and the UnitsOrdered field is from the
Orders table. (A good exercise is to think through the logic that Microsoft Query uses.
From the Customers table, Microsoft Query finds that Shirts R Us corresponds to cus-
tomer number 3. From the Products table, it finds that Longsleeve Tunic corresponds
to product number 6. Therefore, it searches the Orders table for all records where
CustomerID is 3, ProductID is 6, and UnitsOrdered is at least 80.) This returns 17
records, as shown in Figure 17.26.
17-24 Chapter 17 Importing Data into Excel
Figure 17.25
Query Based on All
Three Tables
5 Select the Return Data to Microsoft Excel menu item, and select the Table option.
One last possibility we will illustrate is returning calculated fields. Suppose you want
to return the revenues for all orders during 2008 and 2009 from Rags to Riches for shirts
sold to females, where revenue is calculated as units ordered times unit price times (one
minus the discount). You form the query in the usual way, but in the bottom pane, you type
the expression UnitsOrdered*UnitPrice*(1-Discount) as one of the field names. (Note
that unlike in Excel, there is no equals sign to the left of the expression.) The resulting
Not For Sale


C
e
n
g
a
g
e

L
e
a
r
n
i
n
g
.

A
l
l

r
i
g
h
t
s

r
e
s
e
r
v
e
d
.

N
o

d
i
s
t
r
i
b
u
t
i
o
n

a
l
l
o
w
e
d

w
i
t
h
o
u
t

e
x
p
r
e
s
s

a
u
t
h
o
r
i
z
a
t
i
o
n
.
Query screen, assuming you want to return the fields Description, Gender, and OrderDate in
addition to revenue, should appear as in Figure 17.27. To give the calculated field a descrip-
tive name such as Revenue, double-click on the expression to obtain the dialog box in
Figure 17.28 and enter a column heading. Otherwise, the calculated field in Excel will be
something like Expr1001. (You can also create this calculated field and name it by selecting
Add Column from the Records menu. This leads directly to the dialog box in Figure 17.28.)
Figure 17.26
Data Returned to
Excel as a Table
Figure 17.27
Query with a
Calculated Field
We reiterate that once the results of the query data are returned to Excel, you can then
begin the statistical analysis of the datacreating summary measures, scatterplots, pivot
tables, and so on. However, if your ultimate goal is to create a pivot table based on the data-
base data, you can do this directly, as we discuss next.
17.4 Importing Relational Database Data 17-25
Not For Sale


C
e
n
g
a
g
e

L
e
a
r
n
i
n
g
.

A
l
l

r
i
g
h
t
s

r
e
s
e
r
v
e
d
.

N
o

d
i
s
t
r
i
b
u
t
i
o
n

a
l
l
o
w
e
d

w
i
t
h
o
u
t

e
x
p
r
e
s
s

a
u
t
h
o
r
i
z
a
t
i
o
n
.
So far, you have imported database data into Excel tables. It is also possible to import
the data directly into pivot tables, as illustrated in the following continuation of the Shirt
Company example.
17-26 Chapter 17 Importing Data into Excel
Figure 17.28
Naming a Calculated
Field
E XAMP L E 17.5 FINE SHIRT COMPANYS RELATIONAL DATA (CONTINUED)
T
he Fine Shirt Company would like to break down revenue from its various customers
and products by using pivot tables. How should it proceed?
Objective To illustrate how Microsoft Query can be used to import data directly into a
pivot table.
Solution
If you want to base a pivot table on external data, you should go through Microsoft Query,
not through the usual PivotTable button on the Insert menu.
10
To do so, get into Microsoft
Query and define a query. We defined a sample query as shown in Figure 17.29 with no
criteriajust a set of fields to return, one of which is calculated revenue (and is renamed
Revenue as in Figure 17.28)but you can impose criteria if you like. When you select the
Return Data to Microsoft Excel menu item from Microsoft Querys File menu, you see
the dialog box in Figure 17.23, where you can specify the type of report you want and
where you want it. At this point, select PivotTable Report (or PivotChart and PivotTable
Report).
From here, you can create any pivot tables in the usual way. For example, one possi-
bility is to choose the settings in Figure 17.30, so that you can analyze the sum of revenue
for any customer/product combination for any date(s). The only trick here involves the
OrderDate field. The original pivot table contains a row for each dateover 1000 rows.
If you want to group the data by quarter of year, right-click on any date in the original
pivot table, select Group, and select both Quarter and Year. The resulting pivot table in
Figure 17.31 (for long-sleeve products only) shows total revenue broken down by product,
10
If you do the latter, you will see an option to base the pivot table on external data and a Choose Connection but-
ton. However, this button leads only to saved Query .dqy files. So unless you have already created such a file by
saving a query, this will lead to a dead end.

Not For Sale


C
e
n
g
a
g
e

L
e
a
r
n
i
n
g
.

A
l
l

r
i
g
h
t
s

r
e
s
e
r
v
e
d
.

N
o

d
i
s
t
r
i
b
u
t
i
o
n

a
l
l
o
w
e
d

w
i
t
h
o
u
t

e
x
p
r
e
s
s

a
u
t
h
o
r
i
z
a
t
i
o
n
.
customer (using the Report Filter area at the top), and quarter of year. This is a lot of use-
ful data with very little work. In addition, you have the option of obtaining corresponding
pivot charts automatically.
Figure 17.29 Query Specification
Figure 17.30
Pivot Table Fields
Like the query results discussed earlier, the pivot table is linked to the query. This means
that you can go back to Microsoft Query, edit the query, and return the data to Excel to
update the pivot table. It is an amazingly intuitive and powerful tool.
17.4 Importing Relational Database Data 17-27
Not For Sale


C
e
n
g
a
g
e

L
e
a
r
n
i
n
g
.

A
l
l

r
i
g
h
t
s

r
e
s
e
r
v
e
d
.

N
o

d
i
s
t
r
i
b
u
t
i
o
n

a
l
l
o
w
e
d

w
i
t
h
o
u
t

e
x
p
r
e
s
s

a
u
t
h
o
r
i
z
a
t
i
o
n
.
17.4.3 SQL Statements
Queries represent a large part of the power behind relational databases. Regardless of the
particular database packageAccess, SQL Server, Oracle, or any of the othersthe types
of queries you create are all basically the same. You typically select one or more tables,
select several fields from these tables, and impose certain conditions. To standardize
queries across packages, SQL (structured query language; pronounced S-Q-L or
sequel) was developed several decades ago. (It is often called the language of data-
bases.) Sitting behind each query you develop in a user-friendly interface such as
Microsoft Query is an SQL statement. Although these statements are beyond the scope of
this book, you might like to take a look at them. This is easy to do. Once you have created
a query, click on the SQL button in the Query toolbar.
As an example, if you form the query shown in Figure 17.19 and click on the SQL but-
ton, you see the SQL statement in Figure 17.32. At first, this is probably intimidating.
However, if you break it down into its parts, it is fairly straightforward. SQL has a number
of keywords that are capitalized. This statement includes the keywords SELECT, FROM,
WHERE, and AND. The SELECT clause of the statement specifies the fields to return
(where, in the case of multiple tables, the table name and a period precede the field name).
The FROM clause specifies the tables to base the query on. Finally, the WHERE clause
spells out the criteria, separated by ANDs. If you want to learn more about SQL, the best
way is to create a query through the interface and then look at the corresponding SQL
statement. Once you get used to SQL statements, you can edit a query by editing its SQL
equivalent. If you get really proficient, you can even create a query from scratch by typing
the appropriate SQL statement directly. (If you are interested, there are numerous books
available for learning SQL.)
17-28 Chapter 17 Importing Data into Excel
Figure 17.31 Pivot Table Results after Grouping by OrderDate

Not For Sale


C
e
n
g
a
g
e

L
e
a
r
n
i
n
g
.

A
l
l

r
i
g
h
t
s

r
e
s
e
r
v
e
d
.

N
o

d
i
s
t
r
i
b
u
t
i
o
n

a
l
l
o
w
e
d

w
i
t
h
o
u
t

e
x
p
r
e
s
s

a
u
t
h
o
r
i
z
a
t
i
o
n
.
Figure 17.32
SQL Statement
P ROB L E MS
Level A
15. Starting with the Shirt Orders.mdb file from
Example 17.5, find all of the records from the Orders
table where the order was placed in 2008 or 2009, the
product number is 1 or 10, the customer is not 7, and
the number of units ordered is at least 75. Return all
fields in the Orders table for each of these records to
Excel.
16. Starting with the Shirt Orders.mdb file from
Example 17.5, find all of the records from the Orders
table that correspond to orders for between 50 and 100
items (inclusive) made by the customer Rags to Riches
for the product Short-sleeve Polo. Return the dates,
units ordered, and discounts for each of these orders to
Excel.
17. Starting with the Shirt Orders.mdb file from
Example 17.5, find all of the records from the Orders
table that correspond to orders for more than 75 items
made by the customer Threads for products designed
to be worn by women. Return the dates, units ordered,
and product description for each of these orders to
Excel.
18. The Fine Shirt Company would like to know how
many units of its products designed for each gender
subset (i.e., men, women, and both genders) were sold
to each customer during each quarter of the past five
years (i.e., from the first quarter of 2005 through the
fourth quarter of 2009). Starting with the Shirt
Orders.mdb file from Example 17.5, perform an
appropriate query and bring the results back to Excel
as a pivot table to answer the companys question.
19. The Fine Shirt Company would like to know how
many units of each of its products were sold to each
customer during each year of the period 20052009.
Starting with the Shirt Orders.mdb file from
Example 17.5, perform an appropriate query and bring
the results back to Excel as a pivot table to answer the
companys question.
Level B
20. Starting with the Shirt Orders.mdb file from
Example 17.5, do the following.
a. Find all of the records from the Orders table that
correspond to orders placed in 2008 by the
customer The Shirt on Your Back for shirts
designed to be worn by both men and women.
Return to Excel the fields OrderDate, Description,
Gender, UnitsOrdered, UnitPrice, Discount, and a
calculated field Revenue. Revenue should be
calculated as UnitsOrdered*UnitPrice*
(1-Discount).
b. Analyze the distribution of revenues associated
with order records identified in part a. Be sure to
consider measures of central location, variability,
and skewness in characterizing this distribution.
c. Repeat parts a and b with the same criteria except
that the analysis should now focus on the orders
placed in 2009. Summarize the differences between
the revenue distributions for 2008 and 2009.
21. Write the SQL statement to perform the query given in
Problem 16.
22. Write the SQL statement to perform the query given in
Problem 17.
23. The Fine Shirt Company would like to know what
proportion of each customers total dollar purchases in
2009 came from buying Short-sleeve Seersucker
shirts. Furthermore, the company would like to
compare this proportion to that of the most popular
product, as measured by 2009 total dollar purchases,
for each customer. Starting with the Shirt
Orders.mdb file from Example 17.5, perform an
appropriate query and bring the results back to Excel
as a pivot table to answer the companys questions.
17.4 Importing Relational Database Data 17-29
Not For Sale


C
e
n
g
a
g
e

L
e
a
r
n
i
n
g
.

A
l
l

r
i
g
h
t
s

r
e
s
e
r
v
e
d
.

N
o

d
i
s
t
r
i
b
u
t
i
o
n

a
l
l
o
w
e
d

w
i
t
h
o
u
t

e
x
p
r
e
s
s

a
u
t
h
o
r
i
z
a
t
i
o
n
.
17.5 WEB QUERIES
Most of us rely on the Web for so many things: shopping for books and electronics, mak-
ing plane and hotel reservations, reading the news, finding the latest sports scoresand
not least, finding virtually unlimited sources of data. It is hard to imagine that the Web did
not exist, at least in anything resembling its current form, when Bill Clinton became presi-
dent in 1992. But yes, the Web is relatively new, and it continues to evolve. There are
indeed Web standards, but in some ways, the Web is still like the wild, wild West, where
anything goes.
This is particularly true for Web sites that contain data sets. They are structured in all
sorts of ways, and the steps required to import the data into Excel for analysis vary greatly.
Many Web sites, such as http://stats.oecd.org/wbos/index.aspx, provide buttons that allow
you to download the data directly into Excel. Probably more sites will have this feature in
the future, but we doubt that they will ever all have it. At the other extreme, the only way
to get some Web data into Excel is to cut and paste and hope for the best.
In this section we discuss an Excel tool called a Web query that lies somewhere
between these two extremes. To understand how it is possible to query a Web site from
Excel, you first need to understand at least a little of how Web pages are constructed. They
are created with HTML (hypertext markup language), a text language that includes tags
for displaying the various items you see on a typical Web page. One tag that is particularly
useful for our purposes is the Tabletag. When this tag is used as part of an HTML doc-
ument, followed by data, it puts the data in a readable tabular form. Of course, the table
might be surrounded by a lot of text and graphics, but the chances are that when you query
a Web page from Excel, you are most interested in the table data and would like to ignore
the surrounding stuff. Web queries allow you to do exactly this. They search for Table
tags, find the corresponding data, and bring them into Excel in the usual row and column
format.
We begin with a simple Web query. We (the authors) have a Web site called www.kel-
ley.iu.edu/albrightbooks that we control. (This means that unlike other ever-changing Web
sites, this one will continue to behave as we describe hereprobably.) There is an HTML
page Scores.htm on this site, created just for this example, that contains a heading and a
table of course scores for students in a fictitious course. To get the data in this table into
Excel, use the following steps:
1. Make sure you have an active connection to the Web, and open a new workbook in
Excel.
2. Click on the From Web button on the Data ribbon.
3. Fill in this dialog box as shown in Figure 17.33. The most important part is the URL
(the address of the page) at the top, which is http://www.kelley.iu.edu/albrightbooks/
scores.htm
You have to know this URL or browse the Web for it. We find it easiest to browse to the
intended Web site on Internet Explorer (or any other browser), copy its URL, and paste it
into the dialog box in Figure 17.33. Once you enter the URL and click on Go, you will see
the Web page with yellow arrows next to all of the tables. (Some of these will probably not
look like tables, but they all have the HTML Table tag.) You can click on any of these
yellow arrows to change them to green checkmarks. The selected tables will then be
imported into Excel.
4. After you click on Import, you will be asked where to place the results. We specified
cell A1 of the blank worksheet.
The results then appear as shown in Figure 17.34. The contents of the single checked table
have been imported into Excel and are formatted nicely. In addition, a link to the Web page
17-30 Chapter 17 Importing Data into Excel
Web queries import
data that are
surrounded by HTML
Table tags. This is
usually, but not always,
where the data of
interest is stored on a
Web page.
To run a Web query,
browse for a promis-
ing URL and paste
it into the Address
area in the New Web
Query dialog box.
Not For Sale


C
e
n
g
a
g
e

L
e
a
r
n
i
n
g
.

A
l
l

r
i
g
h
t
s

r
e
s
e
r
v
e
d
.

N
o

d
i
s
t
r
i
b
u
t
i
o
n

a
l
l
o
w
e
d

w
i
t
h
o
u
t

e
x
p
r
e
s
s

a
u
t
h
o
r
i
z
a
t
i
o
n
.
remains. This means that if the contents of the Web page change, as they often do, you can
refresh to obtain the latest data. To do so, put the cursor anywhere inside the Excel data and
click on the Refresh All button on the Data ribbon.
17.5 Web Queries 17-31
Figure 17.33
Web Query Dialog
Box
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
A B C
StudentID Total Points Course Grade
1 880 B+
2 935 A
3 830 B
4 890 B+
5 915 A-
6 840 B
7 785 C+
8 730 C
9 810 B-
10 905 A-
11 865 B
12 720 C-
13 895 B+
14 835 B
15 965 A
Figure 17.34
Results of Web
Query
You can also save the definition of the query in an .iqy file. (You might want to save it so
that you can give it to a friend or use it on a different PC.) To save it, click on the Connections
button on the Data ribbon, then the Properties button, then the Definition tab, and finally the
Edit Query button to get back to the New Web Query dialog box in Figure 17.33. Now click Not For Sale


C
e
n
g
a
g
e

L
e
a
r
n
i
n
g
.

A
l
l

r
i
g
h
t
s

r
e
s
e
r
v
e
d
.

N
o

d
i
s
t
r
i
b
u
t
i
o
n

a
l
l
o
w
e
d

w
i
t
h
o
u
t

e
x
p
r
e
s
s

a
u
t
h
o
r
i
z
a
t
i
o
n
.
on the Save Query button at the top, which allows you to save the query under some descrip-
tive name, such as Scores Web Query.iqy. By default, Microsoft stores such queries in a
folder such as C:\Documents and Settings\username\Application Data\Microsoft\Queries,
depending on your operating system. In any case, we suggest that you accept the default.
Then you can run this query later on by clicking on the Existing Connections button on the
Data ribbon and selecting your saved query file.
In fact, Microsoft provides several ready-made Web queries that you can try. If you
click on the Existing Connections button on the Data ribbon, you will probably find three
items (and maybe more): MSN MoneyCentral Investor Currency Rates, MSN
MoneyCentral Investor Major Indices, and MSN MoneyCentral Investor Stock Quotes, as
shown in Figure 17.35. (These queries are defined in .iqy files stored in the folder in the
previous paragraph. If you are interested, you can open these small files in Notepad.
Basically, they just contain URLs.) To run any of these built-in Web queries, double-click
on the entry in the Existing Connections list; you dont have to go through the Web Query
button. It is important to remember that they return live results from the Web, and if you
refresh them tomorrow, they will update automatically with fresh data.
17-32 Chapter 17 Importing Data into Excel
Figure 17.35
Existing Collections
Dialog Box
The key to creating your own Web queries is finding good data sets on the Web and then
crossing your fingers. As we searched for data sets for this edition of the book, we found
many on the Web that imported beautifully with Web queries, but we also found some that
returned virtually nothing with Web queries. (We also found a number of promising
tables of data on the Web, but when we tried to import them with Web queries, there
were no yellow arrows next to the tableshence, no way to do the import.)
Sometimes you will need to run several Web queries on the same basic site to get all
of the data you want. As an example, when we imported the baseball salary data discussed
in Chapters 2 and 3, we successfully ran Web queries on the USA Today site. Specifically,
we queried with URLs such as http://content.usatoday.com/sports/baseball/salaries/
teamdetail.aspx?team1&year2009. You have probably seen many Web addresses such
as this, where there is a base address, followed by a question mark, and then followed by a
series of field-value pairs separated by ampersand symbols. The part to the right of the
question mark is called a query string, and it specifies exactly which data you want, in this
case the first team (Atlanta Braves) for the year 2009. To get data for all 30 teams for a sin-
gle year, you would have to run 30 Web queries, each with a different value for team in the
query string. To get data for all 30 teams for 8 separate years, as we did, you would have to
run 240 separate Web queries. Fortunately, it is possible to automate this with a macro
The data from a Web
query is live. If the
data on the Web
changes and you
refresh the Excel sheet,
you will automatically
get the newer data.
Be on the watch for
query strings in the
URLs you visit. They are
the part to right of the
question mark (if any),
and they specify
exactly which page you
want.
Not For Sale


C
e
n
g
a
g
e

L
e
a
r
n
i
n
g
.

A
l
l

r
i
g
h
t
s

r
e
s
e
r
v
e
d
.

N
o

d
i
s
t
r
i
b
u
t
i
o
n

a
l
l
o
w
e
d

w
i
t
h
o
u
t

e
x
p
r
e
s
s

a
u
t
h
o
r
i
z
a
t
i
o
n
.
(which we did), but if you dont know how to write macros, you are stuck with the tedious
job of running many Web queries. Still, it beats entering the data manually.
There is a virtually unlimited amount of interesting data freely available on the Web,
so we encourage you to try Web queries. They dont always work, and even when they do,
the imported data often requires cleaning (getting rid of unwanted labels, reformatting
numbers, deleting blank rows, and so on). However, Web queries are quick and easy, and
they can frequently save you hours or even days of work. Of course, if you are lucky
enough to find data sets on the Web with nice Click here to download to Excel buttons,
then you dont need a Web query.
FUNDAMENTAL INSI GHT
Importing Data from the Web
With the Web still less than 20 years old, it is
astounding how much data is available on itand
much more will surely follow. At some point, you will
almost undoubtedly want to import Web data into
Excel for analysis. The Web queries discussed here
are quick and easy, but only when they work. Some
Web sites do not permit you to use Web queries to
import the data you want, in which case you have to
use any tricks you can develop, including simple copy-
ing and pasting. We suspect that Web sites will
become Excel friendlier in the future.
P ROB L E MS
Level A
For the following problems, there is no guarantee that the
Web sites listed will continue to exist, at least at the current
addresses. Also, there is no guarantee that you will need a Web
query to retrieve the data. If you are lucky, there might be a
button that allows you to import the data directly into Excel.
24. Import data of interest into Excel from http://www
.usatoday.com/sports/sagarin.htm. (Jeff Sagarin, the
publisher of these nationally syndicated ratings, is a
long-time best friend of Winston.)
25. Import data of interest into Excel from http://www
.census.gov/, the official site of the U.S. Census Bureau.
26. Import data of interest into Excel from http://www
.ed.gov/, the official site of the U.S. Department of
Education. For example, you ought to find some
interesting data from the Education Statistics link
toward the bottom of the page.
27. Import data of interest into Excel from http://www.dot
.gov/, the official site of the U.S. Department of
Transportation.
28. Import data of interest into Excel from http://wonder
.cdc.gov/, a Department of Health and Human
Services site.
29. Import data of interest into Excel from http://www
.standardandpoors.com/indices/sp-case-shiller-
home-price-indices/en/us/?indexIdspusa-
cashpidff--p-us----, an S&P site that contains
housing data.
30. Import data of interest into Excel from the Web site at
http://146.142.4.24/cgi-bin/surveymost?eb, a Bureau
of Labor Statistics site.
31. The site http://content.usatoday.com/sports
/baseball/salaries/teamdetail.aspx?team1&year
2008 contains 2008 baseball salary data for the
Atlanta Braves. Use a Web query to import this data
into Excel. By modifying the URL slightly, import
salary data into Excel for other teams and/or other
years.
32. Yahoo sites such as http://finance.yahoo.com/q/hp?s
AA&a00&b29&c2001&d00&e29&f
2010&gm contain historical stock prices on public
companies and market indexes. Do some exploration
to determine what all of the information to the right of
hp? means. Can you import such data into Excel
with a Web query? (Check closely. Does the Web
query return all of the data you request?) Do you even
need a Web query?
17.5 Web Queries 17-33
Not For Sale


C
e
n
g
a
g
e

L
e
a
r
n
i
n
g
.

A
l
l

r
i
g
h
t
s

r
e
s
e
r
v
e
d
.

N
o

d
i
s
t
r
i
b
u
t
i
o
n

a
l
l
o
w
e
d

w
i
t
h
o
u
t

e
x
p
r
e
s
s

a
u
t
h
o
r
i
z
a
t
i
o
n
.
17.6 CLEANSING DATA
When you study statistics in a course, the data sets you analyze have usually been carefully pre-
pared by the textbook author or your instructor. For that reason, they are usually in good
shapethey usually contain exactly the data you need, there are no missing data, and there are
no bad entries (caused by typographical errors, for example). Unfortunately, you cannot
count on real-world data sets to be so perfect. This is especially the case when you obtain data
from external sources such as the Web. There can be all sorts of problems with the data, and it
is your responsibility to correct these problems before you do any serious analysis. This initial
step, called cleansing data, can be tedious, but it can prevent totally misleading results later on.
In this section we examine one data set that has a number of errors, all of which could
very possibly occur in real data sets. We discuss methods for finding the problems and for
correcting them. However, you should be aware of two things. First, the errors in this
example are only a few of those that could occur. Cleansing data requires careful detective
work to uncover all possible errors that might be present. Second, once an error is found, it
is not always clear how to correct it. A case in point
is missing data. For example, some respondents to a
questionnaire, when asked for their annual income,
might leave this box blank. How should you treat
these questionnaires when you perform the eventual
data analysis? Should you delete them entirely,
should you replace their blank incomes with the
average income of all who responded to this ques-
tion, or should you use a more complex rule to esti-
mate the missing incomes? All three of these
options have been suggested by statisticians, and all
of them have their pros and cons. Perhaps the safest
method is to delete any questionnaires with missing
data, so that you dont have to guess at the missing
values, but this could mean throwing away a lot of
17-34 Chapter 17 Importing Data into Excel
33. Public companies must make their financial data
available to the public. As an example, visit http://
finance.yahoo.com/q/is?sKFTIncomeStatement&
annual. This site shows several annual income state-
ments for Kraft Foods. Use a Web query to import this
data into Excel. Then modify the URL to retrieve the
income statements for a few other public companies.
Level B
34. Continuing the previous problem, financial data for
public companies are available at a number of Web sites.
Try importing the data from http://finapps.forbes
.com/finapps/jsp/finance/compinfo/IncomeStatement.jsp
?tkrkft&periodqtr into Excel. This should bring in
quarterly financial reports for Kraft, but does it? (We had
no luck, but maybe you will.) Is there any other way to
get the data into Excel (without entering it manually)?
35. The site http://tonto.eia.doe.gov/oog/info/gdu/
gaspump.html contains a table of monthly prices of
gasoline. Try importing this into Excel with a Web
query. (We had no luck, but maybe you will.) Is there
any other way to get the data into Excel (without
entering it manually)?
36. The site http://graphicsweb.wsj.com/php/
CEOPAY09.html contains a table of CEO
compensation from a Wall Street Journal survey.
If you try to import this data into Excel with a
Web query, you will fail because there is no
yellow arrow next to the table. (At least, there
wasnt one when we tried it.) Is there any other
way to get the data into Excel (without entering it
manually)?
37. The site http://cdo.ncdc.noaa.gov/climatenormals/
hcs/HCS_51_seq.txt contains monthly data on heating
degree-days for states and regions in the U.S. Try
importing this data into Excel with a Web query. What
form do you get? Explain how you can then parse this
into individual columns.
Cleansing Data
Textbook data tends to be clean (no missing or bad
data), but this is not the case with data in the real
world. Unfortunately, cleansing data is difficult and
time-consuming, but it must be done to avoid the
garbage in, garbage out effect. Fortunately, there are
a number of tools, both in Excel and in database soft-
ware, for cleansing data. Although data cleansing is
still time-consuming detective work, the tools make
the job easier.
FUNDAMENTAL INSI GHT
Not For Sale


C
e
n
g
a
g
e

L
e
a
r
n
i
n
g
.

A
l
l

r
i
g
h
t
s

r
e
s
e
r
v
e
d
.

N
o

d
i
s
t
r
i
b
u
t
i
o
n

a
l
l
o
w
e
d

w
i
t
h
o
u
t

e
x
p
r
e
s
s

a
u
t
h
o
r
i
z
a
t
i
o
n
.
Objective To find and fix errors in this companys data set.
Solution
We purposely constructed this data set to have a number of problems, all of which you
might encounter in real data sets. We begin with the Social Security Number (SSN).
Presumably, all 1500 customers are distinct people, so all 1500 SSNs should be different.
How can you tell if they are? One simple way is as follows.
1 Sort on the SSN column.
2 Once the SSNs are sorted, enter the formula If(B3B2,1,0) in cell J3 and copy this
formula down column J. This formula checks whether two adjacent SSNs are equal.
3 Enter the formula SUM(J3:J501) in cell J2 to see if there are any duplicate SSNs.
(See Figure 17.37.) As you can see, there are two pairs of duplicate SSNs.
4 To find the duplicates, highlight the range from cell J3 down and select Find from the
Find & Select dropdown menu on the Home ribbon, with the resulting dialog box filled in
as shown in Figure 17.38. In particular, make sure the bottom box has Values selected.
5 Click on the Find Next button twice to find the offenders. Customers 369 and 618
each have SSN 283-42-4994, and customers 159 and 464 each have SSN 680-00-1375.
17.6 Cleansing Data 17-35
E XAMP L E 17.6 CUSTOMER DATA WITH ERRORS
T
he file Data Cleansing.xlsx has data on 1500 customers of a particular company. A
portion of these data appears in Figure 17.36, where many of the rows have been hid-
den. How much of this data set is usable? How much needs to be cleansed?
1
2
3
4
5
6
7
8
9
10
11
12
1496
1497
1498
1499
1500
1501
A B C D E F G H I
Customer SSN Birthdate Age Region CredCardUser Income Purchases AmountSpent
1 539-84-9599 10/26/44 62 East 0 62900 4 2080
2 444-05-4079 01/01/32 67 West 1 23300 0 0
3 418-18-5649 08/17/73 25 East 1 48700 8 3990
4 065-63-3311 08/02/47 51 West 1 137600 2 920
5 059-58-9566 10/03/48 50 East 0 101400 2 1000
6 443-13-8685 03/24/60 39 East 0 139700 1 550
7 638-89-7231 12/02/43 55 South 1 50900 3 1400
8 202-94-6453 11/08/74 24 South 1 50500 0 0
9 266-29-0308 09/28/67 31 North 0 151400 2 910
10 943-85-8301 07/05/65 33 West 0 88300 2 1080
11 047-07-5332 11/13/64 34 North 0 120300 3 1390
1495 632-29-6841 02/06/45 54 West 1 89700 2 1000
1496 347-70-0762 09/28/65 33 West 0 71800 2 970
1497 638-19-2849 07/31/30 68 South 0 121100 5 2540
1498 670-57-4549 07/21/54 44 North 1 64000 4 2160
1499 166-84-2698 10/30/66 32 South 0 91000 6 2910
1500 366-03-5021 09/23/34 64 South 0 121400 1 530
Figure 17.36 Data Set with Bad Data
potentially useful data. The point is that some subjectivity and common sense must be used
when cleansing data sets.
Not For Sale


C
e
n
g
a
g
e

L
e
a
r
n
i
n
g
.

A
l
l

r
i
g
h
t
s

r
e
s
e
r
v
e
d
.

N
o

d
i
s
t
r
i
b
u
t
i
o
n

a
l
l
o
w
e
d

w
i
t
h
o
u
t

e
x
p
r
e
s
s

a
u
t
h
o
r
i
z
a
t
i
o
n
.
At this point, the company should check the SSNs of these four customers, which are
hopefully available from another source, and enter them correctly here. (You can then
delete column J and sort on column A to bring the data set back to its original form.)
The Birthdate and Age columns present two interesting problems. When the birth-
dates were entered, they were entered in exactly the form shown (10/26/44, for example).
Then the age was calculated by a somewhat complex formula, just as you would calculate
your own age.
11
Are there any problems? First, sort on Birthdate. You will see that the first
18 customers all have birthdate 05/17/27quite a coincidence! (See Figure 17.39.) As you
may know, Excels dates are stored internally as integers (the number of days since January
1, 1900), which you can see by formatting dates as numbers. So highlight these 18 birth-
dates and format them with the Number option (and zero decimals) to see which number
they correspond to. It turns out to be 9999, the code often used for a missing value.
Therefore, it is likely that these 18 customers were not born on 05/17/27 after all. Their
birthdates were probably missing and simply entered as 9999, which were then formatted
as dates. If birthdate is important for further analysis, these 18 customers should probably
be deleted from the data set or their birthdates should be changed to blanks (if the true val-
ues cannot be found).
17-36 Chapter 17 Importing Data into Excel
1
2
3
4
5
6
7
8
9
A B C D E F G H I J
Customer SSN Birthdate Age Region CredCardUser Income Purchases AmountSpent
681 001-05-3748 03/24/36 63 North 0 159700 1 530 2
685 001-43-2336 08/21/63 35 North 0 149300 4 1750 0
62 001-80-6937 12/27/54 44 West 1 44000 4 2020 0
787 002-23-4874 01/31/76 23 North 0 153000 3 1330 0
328 004-10-8303 10/19/76 22 West 1 49800 4 1940 0
870 004-39-9621 10/13/57 41 South 0 138900 2 1010 0
156 004-59-9799 06/12/38 60 North 0 79700 2 980 0
1481 005-06-4020 06/16/52 46 South 1 42700 6 2890 0
Figure 17.37 Checking for Duplicate SSNs
Figure 17.38
Dialog Box for
Locating Duplicates
11
In case you are interested in some of Excels date functions, we left the formula for age in cell D2. (We replaced
this formula by its values in the rest of column D; otherwise, Excel takes quite a while to recalculate it 1500
times.) This formula uses Excels TODAY, YEAR, MONTH, and DAY functions. Check online help to learn
more about these functions.
Not For Sale


C
e
n
g
a
g
e

L
e
a
r
n
i
n
g
.

A
l
l

r
i
g
h
t
s

r
e
s
e
r
v
e
d
.

N
o

d
i
s
t
r
i
b
u
t
i
o
n

a
l
l
o
w
e
d

w
i
t
h
o
u
t

e
x
p
r
e
s
s

a
u
t
h
o
r
i
z
a
t
i
o
n
.
It gets even more interesting if you sort on the Age variable. You will see that the first
12 customers after sorting have negative ages. (See Figure 17.40.) You have just run into a
Y2K (year 2000) problem. These 12 customers were all born before 1930. Excel guesses
that any two-digit year from 00 to 29 corresponds to the 21st century, whereas those from
30 to 99 correspond to the 20th century.
12
This guess was obviously a bad one for these
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
A B C D E F G H I
Customer SSN Birthdate Age Region CredCardUser Income Purchases AmountSpent
64 205-84-3572 05/17/27 71 East 0 50500 1 490
429 279-23-7773 05/17/27 71 South 0 120300 4 2100
463 619-94-0553 05/17/27 71 East 0 62300 2 930
466 365-18-7407 05/17/27 71 East 0 155400 4 1900
486 364-94-9180 05/17/27 71 West 0 116500 2 1040
494 085-32-5438 05/17/27 71 East 0 103700 1 480
607 626-04-1182 05/17/27 71 South 1 75900 3 1540
645 086-39-4715 05/17/27 71 North 0 155300 5 2480
661 212-01-7062 05/17/27 71 West 0 147900 5 2450
730 142-06-2339 05/17/27 71 West 1 38200 1 510
754 891-12-9133 05/17/27 71 North 0 77300 4 1980
782 183-25-0406 05/17/27 71 West 0 51600 0 0
813 338-58-7652 05/17/27 71 East 1 47500 2 1020
1045 715-28-2884 05/17/27 71 South 0 82400 4 1850
1068 110-67-7322 05/17/27 71 North 0 138500 3 1400
1131 602-63-2343 05/17/27 71 North 1 67800 3 1520
1179 183-40-5102 05/17/27 71 East 0 44800 4 1940
1329 678-19-0332 05/17/27 71 West 0 83900 5 2710
174 240-78-9827 01/09/30 69 East 0 29900 2 960
Figure 17.39 Suspicious Duplicate Birthdates
12
To make matters even worse, a different rule was used in earlier versions of MS Office. There is no guarantee
that Microsoft will continue to use this same rule in future editions of Office. However, if you enter four-digit
years from now on, as you should, it wont make any difference.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
A B C D E F G H I
Customer SSN Birthdate Age Region CredCardUser Income Purchases AmountSpent
148 237-88-3817 08/11/29 -31 South 0 63800 8 3960
324 133-99-5496 05/13/28 -30 North 0 142500 2 1000
426 968-16-0774 09/29/28 -30 North 0 68400 2 1100
440 618-84-1169 10/19/28 -30 West 1 113600 1 470
1195 806-70-0226 10/14/28 -30 West 0 40600 4 1960
1310 380-84-2860 10/17/28 -30 West 0 91800 2 980
589 776-44-8345 04/16/27 -29 West 1 59300 2 1030
824 376-25-7809 11/02/27 -29 North 1 9999 2 1070
922 329-51-3208 03/21/28 -29 East 1 35400 6 3000
229 964-27-4755 01/29/27 -28 East 0 26700 1 450
1089 808-29-7482 02/28/27 -28 South 0 90000 5 2580
1037 594-47-1955 08/10/25 -27 East 1 128300 3 1510
23 943-09-9693 12/08/76 22 North 1 150500 0 0
Figure 17.40 Negative Ages: A Y2K Problem
17.6 Cleansing Data 17-37
Not For Sale


C
e
n
g
a
g
e

L
e
a
r
n
i
n
g
.

A
l
l

r
i
g
h
t
s

r
e
s
e
r
v
e
d
.

N
o

d
i
s
t
r
i
b
u
t
i
o
n

a
l
l
o
w
e
d

w
i
t
h
o
u
t

e
x
p
r
e
s
s

a
u
t
h
o
r
i
z
a
t
i
o
n
.
12 customers, and you should change their birthdates to the 20th century. An easy way to
do this is to highlight these 12 birthdates, select Replace from the Find & Select dropdown
list, fill out the resulting dialog box as shown in Figure 17.41, and click on the Replace All
button. This replaces any year that begins with 202, as in 2028, with a year that begins with
192. (Always be careful with the Replace All option. For example, if you enter /20 and /19
in the Find what: and Replace with: boxes, you will not only replace the years, but the
20th day of any month will also be replaced by the 19th day.) If you copy the formula for
Age that was originally in cell D2 to all of column D, the ages should recalculate automat-
ically as positive numbers.
17-38 Chapter 17 Importing Data into Excel
Figure 17.41
Dialog Box for
Correcting the Y2K
Problem
The Region variable presents a problem that can be very hard to detectbecause you usu-
ally are not looking for it. There are four regions: North, South, East, and West. If you sort
on Region and scroll down, you will find a few East values, a few North values, a few
South values, and a few West values, and then the East values start again. Why arent the
East values all together? If you look closely, you will see that a few of the labels in these
cellsthose at the top after sortingbegin with a space. The person who entered them
inadvertently entered a space before the name. Does this matter? It certainly can. Suppose
you create a pivot table, for example, with Region in the row area. You will get eight row
categories, not four. (An example appears in Figure 17.42.) Therefore, you should delete
the extra spaces. The most straightforward way is to use Replace from the Find & Select
dropdown menu in the obvious way. (Excel also has a handy TRIM function that removes
any leading or trailing spaces from a label.)
Figure 17.42
Pivot Table with Too
Many Categories
Not For Sale


C
e
n
g
a
g
e

L
e
a
r
n
i
n
g
.

A
l
l

r
i
g
h
t
s

r
e
s
e
r
v
e
d
.

N
o

d
i
s
t
r
i
b
u
t
i
o
n

a
l
l
o
w
e
d

w
i
t
h
o
u
t

e
x
p
r
e
s
s

a
u
t
h
o
r
i
z
a
t
i
o
n
.
A slightly different problem occurs in the CredCardUser column, where 1 corresponds to
credit card users and 0 corresponds to nonusers. A typical use of these numbers might be
to find the proportion of credit card users, which you can find by entering the formula
AVERAGE(F2:F1501) in some blank cell. This should give the proportion of 1s, but
instead it gives an error (#DIV/0!). What is wrong? A clue is that the numbers in column F
are left-justified, whereas numbers in Excel are usually right-justified. Here is what might
have happened. Data on credit card users and nonusers might initially have been entered as
the labels Yes and No. Then to convert them to 1 and 0, someone might have entered the
formula IF(F4"Yes","1","0"). The double quotes around 1 and 0 cause them to be
interpreted as text, not numbers, and no arithmetic can be done on them. (In addition, text
is typically left-justified, the telltale sign seen here.) Fortunately, Excel has a VALUE func-
tion that converts text entries that look like numbers to numbers. So you should form a new
column that uses this VALUE function on the entries in column F to convert them to num-
bers. (Specifically, you can create these VALUE formulas in a new column, then do a Copy
and Paste Special as Values to replace the formulas by their values, and finally cut and
paste these values over the original text in column F.)
Next consider the Income column. If you sort on it, you will see that most incomes are
from $20,000 to $160,000. However, there are a few at the top that are much smaller, and
there are a few 9999s. (See Figure 17.43.) By this time, you can guess that the 9999s cor-
respond to missing values, so unless these true values can be found, these customers
should probably be deleted if Income is crucial to the analysis (or their incomes should be
changed to blanks). The small numbers at the top take some educated guesswork. Because
they range from 22 to 151, a reasonable guess (and hopefully one that can be confirmed) is
that the person who entered these incomes thought of them as thousands and simply
omitted the trailing three zeroes. If this is indeed correct, you can fix them by multiplying
each by 1000. (There is an easy way to do this. Enter the multiple 1000 in some blank cell,
and press Ctrl-c to copy it. Next, highlight the range G2:G12, click on the Paste dropdown
menu, select Paste Special, and check the Multiply option. This trick has become one of
our favorites.)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
A B C D E F G H I
Customer SSN Birthdate Age Region CredCardUser Income Purchases AmountSpent
439 390-77-9781 06/03/70 37 West 0 22 8 4160
593 744-30-0499 05/04/60 47 East 0 25 5 2460
1343 435-02-2521 08/24/42 65 West 1 43 5 2600
925 820-65-4438 11/12/32 74 North 0 55 6 2980
1144 211-02-9333 08/13/34 73 North 0 71 9999 9999
460 756-41-9393 05/14/71 36 East 0 81 3 1500
407 241-86-3823 07/03/59 48 East 1 88 4 2000
833 908-76-1846 09/17/60 47 West 0 104 4 1970
233 924-59-1581 05/12/31 76 South 0 138 6 2950
51 669-39-4544 10/05/33 74 West 0 149 2 1010
816 884-27-5089 03/05/62 45 North 1 151 2 900
47 601-10-4503 12/19/48 58 East 1 9999 2 1020
270 985-78-7861 08/17/40 67 South 0 9999 2 940
447 856-77-6560 01/06/40 67 South 1 9999 0 0
518 378-83-7998 11/02/74 32 West 1 9999 2 940
527 906-06-0341 03/26/52 55 South 0 9999 3 1590
Figure 17.43 Suspicious Incomes
17.6 Cleansing Data 17-39
Not For Sale


C
e
n
g
a
g
e

L
e
a
r
n
i
n
g
.

A
l
l

r
i
g
h
t
s

r
e
s
e
r
v
e
d
.

N
o

d
i
s
t
r
i
b
u
t
i
o
n

a
l
l
o
w
e
d

w
i
t
h
o
u
t

e
x
p
r
e
s
s

a
u
t
h
o
r
i
z
a
t
i
o
n
.
Finally, consider the Purchases (number of separate purchases by a customer) and
AmountSpent (total spent on all purchases) columns. First, sort on Purchases. You will see
the familiar 9999s at the bottom. In fact, each 9999 for Purchases has a corresponding
9999 for AmountSpent. This makes sense. If the number of purchases is unknown, the
total amount spent is probably also unknown. You can effectively delete these 9999 rows
by inserting a blank row right above them. Excel then automatically senses the boundary of
the data. Essentially, a blank row or column imposes a separation from the active data.
(See Figure 17.44.)
17-40 Chapter 17 Importing Data into Excel
1483
1484
1485
1486
1487
1488
1489
1490
1491
1492
1493
1494
1495
1496
1497
1498
1499
1500
1501
1502
A B C D E F G H I
1427 182-48-9138 05/18/40 67 East 0 105000 9 4450
1144 211-02-9333 08/13/34 73 North 0 71 9999 9999
287 133-53-5943 09/22/35 72 North 1 20000 9999 9999
1298 552-06-0509 10/12/37 70 North 0 23700 9999 9999
375 867-63-6238 09/17/71 36 West 0 29900 9999 9999
250 586-87-0627 06/24/52 55 East 1 53300 9999 9999
14 614-59-6703 08/01/72 35 South 1 54400 9999 9999
1106 102-74-2447 03/14/30 77 West 0 59300 9999 9999
1121 637-23-3846 06/14/54 53 South 0 64000 9999 9999
153 048-55-8930 09/05/34 73 West 1 64400 9999 9999
980 967-97-4228 07/04/63 44 South 1 76800 9999 9999
1061 377-29-0406 10/08/51 56 West 1 93000 9999 9999
858 819-34-4450 05/26/59 48 South 1 101300 9999 9999
432 572-79-9529 01/21/67 40 West 1 104500 9999 9999
1438 452-69-6883 01/16/74 33 South 0 116400 9999 9999
1125 394-20-9464 10/20/75 31 North 1 129400 9999 9999
469 797-55-3419 09/16/61 46 North 1 132800 9999 9999
443 087-21-2053 07/02/52 55 West 0 141200 9999 9999
317 865-85-3875 12/19/31 75 South 0 149900 9999 9999
Figure 17.44 Separating Rows with Missing Data from the Rest
Now we examine the remaining data for these two variables. Presumably, there is a rela-
tionship between these variables, where the amount spent increases with the number of
purchases. You can check this with a scatterplot of the (nonmissing) data, which is shown
in Figure 17.45. There is a clear upward trend for most of the points, but there are some
suspicious outliers at the bottom of the plot. Again, you might take an educated guess.
Perhaps the average spent per purchase, rather than the total amount spent, was entered for
a few of the customers. This would explain the abnormally small values. (It would also
explain why these outliers are all at about the same height in the plot.) If you can locate
these outliers in the data set, you can multiply each by the corresponding number of pur-
chases (if your educated guess is correct). How do you locate them in the data set? First,
sort on AmountSpent, then sort on Purchases. This will arrange the amounts spent in
increasing order for each value of Purchases. Then, using the scatterplot as a guide, scroll
through each value of Purchases (starting with 2) and locate the abnormally low values of
AmountSpent (which are all together). For example, Figure 17.46 indicates the suspicious
values for three purchases. This procedure is a bit tedious, but it is better than working with
invalid data.
Not For Sale


C
e
n
g
a
g
e

L
e
a
r
n
i
n
g
.

A
l
l

r
i
g
h
t
s

r
e
s
e
r
v
e
d
.

N
o

d
i
s
t
r
i
b
u
t
i
o
n

a
l
l
o
w
e
d

w
i
t
h
o
u
t

e
x
p
r
e
s
s

a
u
t
h
o
r
i
z
a
t
i
o
n
.
Again, cleansing data typically involves careful detective work and some common sense.
The bad news is that it is tedious yet often necessary. The good news is that you can use the
powerful Excel tools we have discussed to search for suspicious data values and then fix
them.
4500
5000
3000
3500
4000
S
p
e
n
t
1500
2000
2500
A
m
o
u
n
t
500
1000
0
0 1 2 3 4 5 6 7 8 9 10
Purchases
641
642
A B C D E F G H I
1455 169-31-5478 06/19/45 62 North 1 144600 2 1170
777 820-27-6346 07/04/36 71 West 0 155000 2 1180
643
644
645
646
647
648
259 731-52-6832 02/05/51 56 East 1 41700 3 450
121 345-16-5545 07/08/59 48 West 1 112700 3 450
109 280-07-3023 08/04/43 64 West 0 24300 3 460
1469 719-98-9028 03/15/69 38 North 1 91300 3 470
1331 745-63-6259 07/22/58 49 South 0 63700 3 480
1313 041-74-0192 12/04/59 47 East 0 25900 3 510
- - 649
650
651
652
653
501 156 39 5201 08/15/38 69 East 0 111000 3 540
936 261-74-3204 10/01/37 70 West 0 65000 3 590
921 601-98-9218 05/06/38 69 South 1 131000 3 1260
294 728-06-3395 07/12/66 41 West 0 159800 3 1300
568 375-92-1009 01/13/59 48 North 1 73600 3 1310
Figure 17.46 Suspicious Values of AmountSpent
Figure 17.45
Scatterplot with
Suspicious Outliers

P ROB L E MS
Level A
38. The file P17_38.xlsx contains a data set that represents
30 responses from a questionnaire concerning the
presidents environmental policies. Each observation
lists the persons age, gender, state of residence,
number of children, annual salary, and opinion of the
presidents environmental policies. Check for bad or
suspicious values and change them appropriately.
39. The file P17_39.xlsx contains the following data on
movie stars: the name of each star, the stars gender,
domestic gross (average domestic gross of the stars
last few movies), foreign gross (average foreign gross
of the stars last few movies), and income (current
17.6 Cleansing Data 17-41
Not For Sale


C
e
n
g
a
g
e

L
e
a
r
n
i
n
g
.

A
l
l

r
i
g
h
t
s

r
e
s
e
r
v
e
d
.

N
o

d
i
s
t
r
i
b
u
t
i
o
n

a
l
l
o
w
e
d

w
i
t
h
o
u
t

e
x
p
r
e
s
s

a
u
t
h
o
r
i
z
a
t
i
o
n
.
17.7 CONCLUSION
This chapter has covered some very powerful tools for importing data into Excel. As with
many other features of Excel, the tools we have discussed are fairly easy to useonce you
know that they exist. We believe that once you know that something can be done and have
a general idea of how to do it, you can figure out the details. Indeed, as the software
changes, you will be forced to learn the details on your own through experimenting and
consulting online help. Therefore, as you look back on this chapter, focus more on what
can be done, not the details. It is possible to import data into Excel from text files. It is pos-
sible to create queries in Microsoft Query so that data from database packages can be
imported into Excel. It is even possible to import Web data into Excel by various methods.
Once you realize these possibilities, you will be able to accomplish tasks that many Excel
users have never even tried.
17-42 Chapter 17 Importing Data into Excel
amount the star asks for a movie). Check for bad or
suspicious values (including names) and change them
appropriately.
40. The file P17_40.xlsx contains data on a banks
employees. Check for bad or suspicious values and
change them appropriately.
41. The file P17_41.xlsx contains data on 500 randomly
selected households. Check for bad or suspicious
values and change them appropriately.
Level B
42. The file P17_42.xlsx contains data imported into
Excel from Microsofts famous Northwind database.
There are worksheets for the companys customers,
products, product categories, and transactions. Each
transaction is for a product purchased by a customer,
but if a customer purchases multiple products at the
same time, there are several corresponding rows in the
Transactions table, one for each product purchased.
The ID columns allow you to look up names of
customers, products, and product categories. However,
some of the IDs in the Transactions sheet have
purposely been corrupted. There can be three reasons.
First, an ID in the Transactions sheet might not
correspond to any customer, product, or product
category. Second, because each order is by a single
customer, a given OrderID should correspond to only
one CustomerID. Third, a given product ID should
always correspond to the same product category ID.
Besides the corrupted IDs, there is one other potential
type of error, concerning dates. Shipping dates can be
blank (for orders that havent yet shipped), but they
shouldnt be before the corresponding order dates.
Find all corrupted IDs and shipping dates in the
Transactions sheet. Highlight all bad data in yellow.
You dont need to change them (because in most cases
there is no way of knowing the correct values).
Summary of Key Terms
Term Explanation Excel Pages
Data mining Exploring large data sets to find interesting 17-1
trends and patterns
Data warehouse Type of database used to store large quantities 17-1
of historical data for data mining
Data mart Part of a data warehouse for use in a particular 17-2
group in an organization
Text file (or Data file that can be read by any text editor, such as 17-3
ASCII file) Notepad
Query Instruction to a database to return a subset of 17-3
the data that satisfies specified conditions
(continued)
Not For Sale


C
e
n
g
a
g
e

L
e
a
r
n
i
n
g
.

A
l
l

r
i
g
h
t
s

r
e
s
e
r
v
e
d
.

N
o

d
i
s
t
r
i
b
u
t
i
o
n

a
l
l
o
w
e
d

w
i
t
h
o
u
t

e
x
p
r
e
s
s

a
u
t
h
o
r
i
z
a
t
i
o
n
.
17.7 Conclusion 17-43
Term Explanation Excel Pages
Fixed-width file Text file where each variable starts in a 17-8
specified column
Delimited file Text file where values are separated by a 17-8
delimiter character (usually tab, comma, or space)
Flat file A single-table database 17-14
Relational A database where the data are stored in related 17-14
database tables, linked by primary and foreign keys
Microsoft Software packaged with Office to import database From Microsoft 17-16
Query data into Excel Query,under From Other
Sources on the Data ribbon
SQL Structured query language, a concise language used 17-28
to specify database queries
Web query A method for importing tables from selected From Web on 17-30
Web pages into Excel the Data ribbon
Cleansing data Process of removing errors from a data set 17-34
P ROB L E MS
Conceptual Questions
C.1. Consider a data set with monthly data on some
variable. The data are arranged so that there is a
column for each month, Jan to Dec, and a row for each
year. Is it possible to create a single time series chart
on monthyear values when the data are arranged this
way? If not, what format is required?
C.2. Why is data stored so often in plain vanilla text
files? After all, a text editor like Notepad has virtually
no tools for analysis.
C.3. Suppose you have data such as in the Shirt Orders
database. How could this data be stored in an Excel
file? Would anything be lost? In general, what is the
advantage of storing such data in a database package
such as Access rather than in Excel?
C.4. Continuing the previous question, if there is an
advantage of storing this type of data in Access, why
should you bother importing it into Excel?
C.5. What is the advantage of using SQL for queries?
(Although you might not know SQL at this point,
virtually all database analysts know it and use it often.)
C.6. If you find data of interest on a Web site, discuss the
options for importing it into Excel for analysis. Why
isnt there a single method that always works? Do you
think there ever will be such a method?
C.7. Suppose that you are given a data set in Excel, but
because of the source, you suspect that it has some bad
data. What tools would you typically use, and how
would you use them, to find the bad data? Even if you
find the bad data, why might you not be able to change
the values appropriately?
Level A
43. The file P17_43.txt contains yearly data on the projected
population growth rate for several countries. Import this
data into Excel, change the labels in the first row so that
only the country name remains, and create a time series
chart for your choice of countries. Due to missing data,
delete all years before 1991. Make sure the data are in
chronological order, starting with the earliest date
(1991). Save the results in an Excel (.xlsx) file.
44. Sometimes the data you find on the Web is embedded
in a Word document. For example, the file
P17_44.docx contains an air traffic report from June
2009, including a number of tables. Table 3 of the
report lists data about on-time arrivals by airport and
time of day. Can you import this data into Excel?
45. The Organisation for Economic Co-Operation and
Development (OECD) Web site
http://stats.oecd.org/index.aspx has a wealth of
economic data. For example, under Monthly Economic
Indicators, find Index of Industrial Production. Choose
monthly data for Germany and import it into Excel.
(Note that the left column, Edition, has a red
information button next to it. Click on this button to
learn what each row means.) Save the results as an
Excel (.xlsx) file with a meaningful name. If you like,
experiment with some of the many options at this site
for filtering the data. The tools are pretty amazing.
Not For Sale


C
e
n
g
a
g
e

L
e
a
r
n
i
n
g
.

A
l
l

r
i
g
h
t
s

r
e
s
e
r
v
e
d
.

N
o

d
i
s
t
r
i
b
u
t
i
o
n

a
l
l
o
w
e
d

w
i
t
h
o
u
t

e
x
p
r
e
s
s

a
u
t
h
o
r
i
z
a
t
i
o
n
.
17-44 Chapter 17 Importing Data into Excel
Problems 4649 (and 56 in the Level B section) are
based on the School Schedule.mdb file. Open this file
in Access, click on Database Tools and select
Relationships to see the relationships diagram.
Essentially, students at a university take sections of
courses. The sections are in various buildings, and the
courses are in various departments. The
Student_Schedules table is for the many-to-many
relationship between students and sections: one
student can take multiple sections, and a given section
contains multiple students.
46. Create an appropriate query that returns the course
code, course name, course description, category, and
department of all courses in the database, and then
return the data to Excel as a table.
47. Create an appropriate query that returns a list of all
students in all sections. Specifically, it should return
the students first and last name, the course name, the
section number, and the students grade in the section.
Return the data to Excel as a table. Then, inside the
table, create a new column that is a concatenation of
first name and last name, and create a pivot table that
allows you to find the average grade for each student.
48. Create an appropriate query and return the results to
Excel as a pivot table. The pivot table should allow
you to choose a building name (from the report filter
area) and see the number of sections that meet in each
room of that building.
49. Create an appropriate query and return the results to
Excel as a pivot table. The pivot table should allow
you to choose a department name (from the report
filter area) and see the number of courses given by that
department in each category. (Note that some
categories arent given by some departments.)
Problems 5052 (and 5758 in the Level B section)
are based on the Northwind.mdb file. (This version
has only the tables. The queries, forms, and reports
you may have seen in other versions of Northwind
have been deleted.) Open this file in Access, click on
Database Tools and select Relationships to see the
relationships diagram. Orders at Northwind are taken
by employees from customers for products supplied
by suppliers, and the orders are shipped by shippers.
The Order Details table is required because one order
can be for several products.
50. Create an appropriate query that lists the company
names of all of Northwinds customers and their
countries. Return the results to Excel as a table. Then
build a pivot table from this table that allows you to
count customers by country.
51. Create an appropriate query that lists the first and last
names of each employee and the person, if any, that
employee reports to. (Hints: This is a case where the
Employees table includes a primary key, EmployeeID,
and a foreign key, Reports to, to itself. Bring back both
in your query and then use a VLOOKUP in the results
to find the first and last name of the employees
supervisor, if any.)
52. Create an appropriate query that returns the following
information for all products that have not been
discontinued: product name, product category name,
unit price, units in stock, and units on order. The query
should filter out all discontinued products before they
are returned to Excel. Return the results as an Excel
table. Then create another table right next to it with a
similar query, but this time ask only for the product
name and category of all discontinued products. (Note
that the Discontinued field is 1 for discontinued
products, 0 for others.)
Problems 5354 (and 5960 in the Level B section) are
based on the Classical CDs.mdb file. Open this file in
Access, click on Database Tools and select Relationships
to see the relationships diagram. This is the attempt by
one of the authors to capture his rather large classical
music collection in a database. Each record in the CDs
table corresponds to one CD album. (Usually its a
single CD, but some come in sets.) The large
CDs_Works table contains a record for each work on
each CD. There are links to the musicians playing the
work, which could include an orchestra, a chamber
group, a choral group, a conductor, and/or one or more
artists (such as pianists). The Works table has a record
for every work (piece of music) in the collection, with a
link to the composer. If you open one of the tables with
foreign keys in Access, such as the Artists table with a
foreign key to the Instruments table, you will see the
names of the instruments. This is a lookup trick that is
possible in Access. Even though you might see Piano,
only the foreign key (6 for Piano) is stored in the Artists
table. The implication is that if you want a list of artists
and their instruments, as in the next problem, you will
need both the Artists and Instruments tables in the query.
53. Create an appropriate query to find the names of all
artists and the instruments they play. Return the data
into an Excel table. Then create a pivot table from this
table and use it to find a list of all pianists.
54. Create an appropriate query to find the titles of all
works in the collection and the corresponding
composer names. Return the data as a pivot table.
Then use the pivot table to find all of the works by
Beethoven, Ludwig van. (The database stores names
like this one in a single field. It doesnt separate names
into first name and last name fields.)
Level B
55. The file P17_55.txt contains yearly data on the
number and percentage of people without health
insurance by state. Try importing this data into Excel.
Not For Sale


C
e
n
g
a
g
e

L
e
a
r
n
i
n
g
.

A
l
l

r
i
g
h
t
s

r
e
s
e
r
v
e
d
.

N
o

d
i
s
t
r
i
b
u
t
i
o
n

a
l
l
o
w
e
d

w
i
t
h
o
u
t

e
x
p
r
e
s
s

a
u
t
h
o
r
i
z
a
t
i
o
n
.
What goes wrong with the first row, which should be a
single row of variable names? Can you fix this by
modifying the text file and then importing into Excel?
Or is it easier to fix it in Excel? Do it the way you
think is easier. Then shorten the variable names in
Excel to something like Ohio Uninsured and Ohio
Uninsured Pct. Save the results as an Excel (.xlsx) file.
56. (Based on the School Schedule database) Create an
appropriate query that enables you to determine, for
each student, how many classes the student has each
day of the week. Note that the Class_Sections table
has flags for each day of the week to indicate the days
each section meets. You can ignore the Saturday flag
because no classes meet on Saturday.
57. (Based on the Northwind database) Create an
appropriate query and return it as a pivot table. This
pivot table should allow you to see the number of
orders shipped by each shipper during each month.
For example, you should find that the shipped date
was July 1997 for 13 orders shipped by Speedy
Express. (Note: Microsoft Query recognizes primary
keyforeign key relationships if the primary key and
foreign key have the same name. However, it doesnt
recognize the relationship in this case because the
primary key is ShipperID and the foreign key is
ShipVia. Therefore, you must drag a link between
these two in the top pane of the Microsoft Query
screen.)
58. (Based on the Northwind database) Create an appropri-
ate query that allows you to see the total revenue bro-
ken down by any of product name, product category,
month-year of order date, or customer. Here revenue is
defined as UnitPrice*Quantity*(1-Discount), where
these fields all come from the Order Details table. You
can do this in either of two ways. First, you can return
all of the data as a table, then use an Excel formula to
calculate revenue for each row, and finally use a pivot
table to get the desired results. Second, you can create
a calculated field Revenue in Microsoft Query and
return the results as a pivot table. (The latter is actually
preferable because a pivot table takes up less room in
Excel. Imagine if there were millions of rows in the
Order Details table. They wouldnt fit in Excel if you
tried to bring them in as a table, but this isnt a problem
with a pivot table.)
59. (Based on the Classical CDs database) Create an
appropriate query to find all of the works conducted
by Bernstein, Leonard. (Use a filter for the conductor.)
Return the data as a table to Excel. The table should
include the name of the work, the orchestra, the
title/description of the CD, and the CDs label. (There
is no need to return the name of the conductor. If you
filter correctly, it will always be Bernstein, Leonard.)
60. (Based on the Classical CDs database) Create a query
to find the following fields: Artist from Artists,
Composer from Composers, Conductor from
Conductors, Orchestra from Orchestras, and Title from
Works. Return this to Excel as a pivot table. Then use
the pivot table to find a list of all works by Chopin,
Frederic. The collection has hundreds of works by
Chopin. Why are you seeing only a few of these in the
pivot table? What could you do differently to see all of
them?
61. The Foodmart.mdb file contains many transactions at
a supermarket chain. Open this file in Access, click on
Database Tools and select Relationships to see the
relationships diagram. This type of relationship
diagram is called a star schema. The Facts table
includes a record for each line item purchased by any
customer at any store on any date. The other tables are
called dimensions that facts can be broken down by.
This type of database is set up perfectly for pivot
tables, where you can break down facts (Revenue or
UnitsSold) by dimensions. In Microsoft Query,
develop a query that uses all of the tables. It should
have the following fields: Revenue and ItemsSold
from Facts; all fields except DateID from Dates; all
fields except ProductID from Products; City, Region,
StateProvince, Country, MaritalStatus, and Gender
from Customers; and City, Region, StateProvince, and
Country from Stores. Note that City, Region,
StateProvince, and Country have the same names in
the Customers and Stores tables, which can be
confusing. Therefore, for each of these, use the Add
Column menu item from the Records menu to select
these fields and given them names like Customer City
and Store City. Return the data to a pivot table in
Excel. Then create an interesting pivot table, where
you break down Revenue or ItemsSold by any of the
various dimensions.
17.7 Conclusion 17-45
Not For Sale


C
e
n
g
a
g
e

L
e
a
r
n
i
n
g
.

A
l
l

r
i
g
h
t
s

r
e
s
e
r
v
e
d
.

N
o

d
i
s
t
r
i
b
u
t
i
o
n

a
l
l
o
w
e
d

w
i
t
h
o
u
t

e
x
p
r
e
s
s

a
u
t
h
o
r
i
z
a
t
i
o
n
.
C AS E
E
duToys, Inc., sells a wide variety of educational toy
products to its customers through its Web site.
Jeannie Dobson, director of information services at
EduToys, recently developed a relational database to
store critical information that the management team
needs to more effectively serve EduToys customers.
The database, which is provided in the file
EduToys.mdb, consists of five related tables:
Company, Customer, Inventory, Orders, and Toys.
The Company table consists of the following
information on each of the 159 companies that
manufacture and supply products to EduToys:
identification number, name, and telephone number.
The Customer table maintains the following data on
each of the 307 customers who purchased at least
one item from EduToyss online store during the first
10 months of operation (JanuaryOctober 1998):
identification number, last name, first name, age,
gender, street address, city, state, zip code, and
telephone number. The Inventory table consists of
the following information on each of the 201
products that EduToys purchases from its various
suppliers: identification number, name, quantity in
current inventory, quantity on order, and expected
delivery date of order.The Orders table records
the following information for each of the customer
transactions that took place during the first
10 months of 1998: transaction identification number,
date, customer identification number, customer
credit card number, product identification number,
and quantity purchased. Finally, the Toys table
maintains the following data on each of the products
sold by EduToys: product identification number,
company (i.e., supplier) identification number,
product name, type of product, appropriate age
group for product, unit price, and detailed product
description.
As part of your internship with EduToys, you
have been asked by your supervisor to prepare a
report that responds to the following questions.Your
supervisor encourages you to make extensive use of
the database in completing this assignment. Also, she
wants you to retain copies of all Excel spreadsheets
that you prepare to generate the needed
information.
Questions
1. How do EduToyss past customers break down
by age and gender?
2. Which of EduToyss past customers have
spent amounts that fall in the top 20% of all
transactions (as measured in dollars)? Report
the first name, last name, street address, city,
state, and zip code for each of these
customers.
3. Which products have generated sales revenues
(in dollars) that fall in the top 25% of all such
revenue contributions? Report the current
inventory level, quantity on order, and supplier of
each of these best-selling products.
4. How do the given 1998 sales (in dollars) break
down by product type and product age group?
5. What proportion of all given transactions were
conducted through the use of each type of
credit card (including American Express,
Discover, MasterCard, and Visa)?
6. What changes or additions would you
recommend making to the present database?
Provide the reasoning behind each of your
recommendations.
17.1 EDUTOYS, INC.
17-46 Chapter 17 Importing Data into Excel
Not For Sale


C
e
n
g
a
g
e

L
e
a
r
n
i
n
g
.

A
l
l

r
i
g
h
t
s

r
e
s
e
r
v
e
d
.

N
o

d
i
s
t
r
i
b
u
t
i
o
n

a
l
l
o
w
e
d

w
i
t
h
o
u
t

e
x
p
r
e
s
s

a
u
t
h
o
r
i
z
a
t
i
o
n
.

Вам также может понравиться