Академический Документы
Профессиональный Документы
Культура Документы
Table of Contents:
Introduction
Chapter 1: Overview of Data Analytics
Foundations Data Analytics
Getting Started
Mathematics and Analytics
Analysis and Analytics
Communicating Data Insights
Automated Data Services
Chapter 2: The Basics of Data Analytics
Planning a Study
Surveys
Experiments
Gathering Data
Selecting a Useful Sample
Avoiding Bias in a Data Set
Explaining Data
Descriptive analytics
Charts and Graphs
Chapter 3: Measures of Central Tendency
Mean
Median
Mode
Variance
Standard Deviation
Coefficient of Variation
Drawing Conclusions
Chapter 4: Charts and Graphs
Pie Charts
Create a Pie Chart in MS Excel
Bar Graphs
Create a Bar Graph with MS Excel
Customizing the Bar Graph
Time Charts and Line Graphs
Create a Line Graph in MS Excel
Customizing Your Chart
Annual Employee Losses
Introduction
We live in thrilling and innovative times. As business moves to the digital environment, virtually every
action we take produces data. Information is collected from every online interaction. All sorts of devices
gather and store data about who we are, where we are, and what we are doing. Increasingly-massive
warehouses of data are now freely available to the public. Skilled analyses of all this data can help
businesses, governments, and organizations to make better-informed decisions, respond quickly to
changing needs, and to gain deeper insights into our rapidly-changing environment. It is a challenge to
even attempt to make good use of all of the available data. In order to answer specific questions, a
person must decide what data to collect, which methods to use, and how to interpret the results.
Data analytics is a way to make valuable use all types of information. Analytics is used to help
categorize data, identify patterns, and predict results. Data use has become so ubiquitous that it has
become necessary for individuals in every profession to learn how to work with data. Those who
become the most proficient at working with data in useful and creative ways will be the most successful
in the new world of business.
Until recently, data analytics was limited to an exclusive culture of data analysts, who characteristically
presented this topic in complicated and often unintelligible terminology. Fortunately, data analytics is
not as complicated as many believe. It simply consists of using analytical methods and processes to
develop and explain specific and useful information from data. The point of data analytics is to enhance
practices and to support better-informed decisions. This can result in: safer practices within an industry,
greater revenues for a business, higher customer satisfaction, or any other object of focus. This eBook
introduces a wide range of ideas and concepts used for deriving useful information from a set of data,
including data analytics techniques and what can be achieved by using them.
Getting Started
This chapter explains the major components comprising data analytics, gathering, exploring, and
interpreting data. As a data analyst, you will be collecting and sorting large volumes of raw,
unstructured, and partially-structured data. The amounts of data that you are likely to be working with
can be too large for a normal database system to effective process. A data set that is too large, changes
too quickly, or it does not conform to the structure of standard database designs requires a special
skillset to manage. Data analytics consists of analyzing, predicting, and visualizing data. When data
analysts gather, query, and interpret data, they conduct a process that is quite similar to data engineering.
Although useful insights can be produced from an individual source of data, the blending of several
sources gives context to the data that is necessary to make more informed decisions. As a data analyst,
you can combine multiple datasets that are maintained in a single database. You can also work with
several different databases maintained within a large data warehouse. Data can also be maintained and
managed within a cloud-based platform specially designed for that purpose. However the data is pooled
and wherever it is stored, the analyst must still issue queries on the data and make commands to retrieve
specific information. This is typically done using a specialized database language called Structured
Query Language (SQL).
When using a database software application or conducting an analysis using other programming
languages, like R or Python, you can utilize a variety of digital file formats, such as:
Comma-separated values (CSV) files: Virtually all data-based software applications (including
cloud-based programs) and scripting languages are compatible with the CSV file type.
Programming Scripts: Professional data analysts generally know how to write programming
scripts in order to work with data and visualizations in languages like Python and R.
Common File Extensions: MS Excel files have the .xls or .xlsx extension. Geospatial
applications are saved with their own file formats (e.g., .mxdextension for ArcGIS and the .qgs
extension for QGIS).
Web Programming Files: Web-based data visualizations often use the Data Driven Documents
JavaScript library (D3.js.). D3.js, files are saved as .html files.
Train current personnel. This can be an inexpensive way to provide an organization with
ongoing data analytics. This training can be used to transform certain employees into highlyskilled subject-matter experts who are proficient in data analysis.
Train current personnel and also hire professional analysts. This strategy follows the same
process as the first method, but also includes hiring a few data professionals to oversee the
process and personally handle the most challenging problems and tasks.
Hire data professionals. An organization get their needs met by hiring or contracting with
professional data analysts. This is the most expensive option, because professional data analysts
are in low supply and generally have high salary requirements.
Securing highly-skilled data analysts to meet the needs of an organization can be extremely difficult.
Many businesses and organizations outsource their data analytics jobs to external experts. This happens
in two different ways: They contract with someone to develop a wide-ranging data analytics plan to
serve the entire organization. Another way is to contract with experts to provide individual data analytics
solutions for specific situations and problems that that their organization may encounter.
more in proficient and sought after professional. Below is a brief list of benefits that data analytics
provide for various areas:
Benefits for corporations: Cost minimization, higher return on investment (ROI), increased staffproductivity, reduction of customer loss, higher customer satisfaction, sales forecasting, pricingmodel enhancement, loss detection, and more efficient processes.
Benefits for academia: More efficient resource allocations, improved instructional focus and
student performance, increased student retention, refinement of processes, reliable budget
forecasting, and increased ROI for student recruitment practices.
This chapter provided an introduction to the concept of data analytics. Analytics is a growing field of
science that brings together traditional statistical procedures and computer science in order to ascertain
meaningful insights from huge sets of raw data for the benefit of businesses, organizations,
governments, and society. Data analytics is sometimes confused with Business Intelligence (BI) because
of the common tools they both share, particularly data visualizations, such as traditional charts and
graphs. BI, however, is a discipline designed for business leaders without the advanced training
necessary to engage in data analytics. The following chapter discusses the basic principle of data
analytics.
Planning a Study
Once the research question is established, it is time to design a study to answer that specific question.
This requires figuring out the methods that you will use to extract the necessary data. This section covers
the two main types of studies: descriptive studies and experimental studies.
Surveys
With a descriptive study, data are gathered from people in a way that does not have an impact on them.
The most widely used type of descriptive study is a survey. Surveys are questionnaires that are given to
people who are randomly selected from a target population. Surveys are useful data tools for gathering
information. As with all methods of gathering data, improperly conducted surveys are likely to result in
inaccurate information. Common issues with surveys include inadequately worded questions, which can
be confusing, lack of participant response, or lack of randomization in the selection process. Any of
these problems can invalidate the results of the survey, therefore surveys must be carefully planned
before they are implemented.
A limitation of the survey method is that they can only provide information on relationships that exist
between variables and not information on causes and effects. If the survey researchers observe that the
people who smoke cigarettes, for example, tend to work longer hours per day than those who do not
smoke, they are not in a position to suggest that smoking is the cause for the longer work hours.
Variables that were not part of the research design might cause the relationship, such as number of hours
they sleep every night.
Experiments
Experiments involve the application of one or more treatments to subjects in a controlled environment.
The treatments are things that may or may not affect the subject under study. Some studies involve
medical experiments, wherein the subjects are patients who undergo medical treatments. Other
experiments might include students who receive tutoring, or exposure to a particular instructional tool as
the treatment. Businesses engage in experiments that involve sample participants from the consumer
market. These participants may be exposed to a certain type of advertisement and asked how they were
emotionally affected.
Once the treatments are applied, the responses are systematically recorded. For instance, to study the
effect of a drug dosage amount on blood pressure, a group of subjects may be administered 15mg of a
medicine. A different sample group may be administered 30 mg of the same drug. Typically, a control
group is also involved, where subjects each receive a placebo treatment (i.e., a substance with no
medicinal properties).
Experiments are often designed to take place in a controlled setting, in order to reduce the number of
potential unrelated variables and possible biases that might affect the results. Some possible problems
might include: researchers knowing which participants received particular treatments; a particular
circumstance or condition, not factored into the study, that may impact the results (e.g., other
medications that a participant may be taking), or not including an experimental control group. However,
when experiments are designed correctly, difference in responses, found when the groups are compared,
allow the researchers to conclude that there is a cause and effect relationship. No matter what the study,
it must be designed so that the original questions can be answered in a credible way.
Gathering Data
Once a research plan (whether descriptive or experimental) has been designed, the subjects must be
selected, and data must be gathered. This stage of the research process is essential to generating
meaningful data. The ways in which data are collected vary with the type of study. In experimental
designs, the data should be collected in the most controlled manner possible, in order to reduce the
possibility of generating contaminated results. Some experiments require more strenuous procedures
than others. When gathering data on peoples perceptions of a new business marketing strategy or data
concerning the effectiveness of a new teaching strategy, the consequences of inaccurate results are not as
critical as they would be in a medical study. Therefore, in low-stakes experiments, it is sometimes
preferable to use less robust data gathering procedures in order to save time and money.
Experiments can be even more problematic in terms of gathering data. If you want to test how well
people retain information when exposed to loud music, a variety of factors could affect the outcomes.
The experiment designer should consider if everyone will listen to the same song, if they will be asked
about the amount of sleep they got the night before, if they have prior knowledge about the type of
subject matter, how they feel about being there participating in the experiment, whether they use drugs
or alcohol regularly, and a host of other considerations that must be considered in order to control for
outside variables.
Explaining Data
Once data has been collected, it is time to compile it in order to get a view of the entire data set.
Analysts describe data in two basic ways: with images, like graphs and charts, and with figures, called
descriptive analytics. Descriptive analytics are the most commonly-used methods for describing data to
the general population. When used effectively, a chart or graph can easily explain volumes of data in a
single snapshot.
Descriptive analytics
Data can be summarized by using descriptive analytics. Descriptive analytics are numerical
representations of data that highlight the most important features of a dataset. With categorical data,
wherein everything is sorted into groups (e.g., age, gender, ethnicity, currency, price, etc.) things are
usually summarized by the number of units in each category. This is referred to in terms of frequency or
percentage.
Numerical data consists of literal quantities or totals (e.g., height, weight, amount of money, etc.),
wherein the actual numbers are meaningful. When working with numerical data, more aspects can be
summarized than just the number or percentage within each category. Such elements include measures
of middle (i.e., the center point of the data); measures of variance (i.e., how widely spread or how
tightly-clustered the data are around the center). Another consideration is a measure of the relationship
between different variables.
Depending on the particular situation, certain descriptive analytics are more appropriate than others. For
example, if you were to assign the codes 1 for men and 2 women, when analyzing the data, it would not
make sense to attempt to average those numbers. Likewise, attempting to use a percentage to explain a
singular amount of time would not be useful.
Another type of data, ordinal data, is somewhat of a combination of the first two types. Ordinal data
appear are in categories, however the categories have a hierarchical order, such as rankings from 1 to 10,
or student ranks of freshman through seniors. This data can be analyzed the same way as categorical
data. Numerical data procedures can also be used when the categories represent meaningful numbers.
Once the data is collected and described with pictures and numbers, it is time to begin the process of
data analysis. Assuming that the study was planned well, the research question can be properly answered
by applying an appropriate data analysis. As with all previous steps in the process, selecting an
appropriate analytical procedure determines the usefulness of the results.
This chapter discussed the foundations of data analytics. Using mathematical techniques and scientific
procedures to collect, measure, analyze, and draw conclusions from data is what data analytics is all
about. The following chapter discusses the major kinds of data analyses necessary to conduct effective
data analytics. In the following chapter you will learn the basics of calculating and measuring common
descriptive analytics for measuring central tendency and variation within a set of data, as well as the
analytics necessary to evaluate the relative position of a specific value within that data set.
Mean
The mean or average of a set of data, is the sum of all the numbers within a group divided by the number
of units in the group. The mean of a group is a representative property of the collective group. Useful
assumptions can be made about an entire set of data by figuring out its mean. The formula for
calculating the mean is below.
Mean = Sum of all the set elements / Number of elements.
For example: (1+2+3+4+5) / 5 = 3
The mean of a data set summarizes all of the data with a single value. An analyst might want to compare
the average price of houses between to different neighborhoods. In order to compare theses housing
prices, it would be illogical to compare the price of each individual house to the price of every other
house in the study. The best way to approach this research question would be to find the mean prices of
houses in each of the two neighborhoods, and then compare the two means with each other. By doing
this, the analyst will be able make a valid assumption about which neighborhood has the more expensive
houses.
Median
Median is the middle number of a data set. For a set of data that is composed of an odd number of
values, the value in the middle the median. For a set of data composed of an even number of values, is
the average of the two middle numbers is the median. The median is commonly utilized to divide a
collection of data into two separate halves.
In order to find the median of a set of data, write the numbers of the set in order from smallest to largest,
and count the number of units and identify the one or two numbers in the center. This is different from
calculating the mean, because the range of number values is not taken into consideration. Consider this
set of numbers: (1, 2, 3, 4, 20):
Mean: (1+2+3+4+20) / 5 = 6
The median of a data set is important, because it is not affected by abnormal deviations in the data set.
As we can see in our example, the value "20" disproportionately affects our median, making it appear as
though half of the values would be below 6 and the other half above 6. The mean, in this case, does not
provide a realistic representation of the data set. If the values represented dollars per week in allowance,
it would appear that the individual receives amounts that are half over and half under $6, when in fact,
the person would have only once received more than $4. The median, in this case, provides us with a
more accurate description of the contents of the data set. Bear in mind that this small collection of data
only consists of 5 values, so it is easy to understand with a quick glance. When the data set contains
hundreds of thousands of values, accurate estimations cannot be made with a quick glance.
The most significant feature of this data set is the single outlier that raises the mean. An outlier is an
outstanding deviation from the majority of the data set. For instance, if a set of data contains the values:
10, 20, 30, 40, 1000, the value 1000 is considered an outlier. Outliers can move the value of the mean far
from its logical central location. The mean of the above set is 1100/5=220 and the median is 30. The
median of this more accurately represents the data set than does the mean.
Mode
In a data set, the mode is the value that occurs most frequently. Mode is a measure of central tendency
like mean and median. The mode also represents a set of data with a single value. For instance, the mode
of the dataset (1,2,3,3,3,4,4,4,4,4,5,5,6,7) is 4, because it appears more than any other value.
If a data set has a normal distribution of values, the mode is equal to the values of the median and the
mean. With data distributions that are skewed (not standard), the mean, median, and mode values may
all be different. Data is symmetrical to the central value in a normal distribution. The distribution curve
in a normally-distributed data set is also symmetrical to an axis. Also, in a perfectly normal distribution,
half of the data values are lower than the mean, and the other half are higher.
Variance
It is sometimes necessary, and always a helpful, to measure the variation from the mean value within a
set of data. As we saw earlier, one or two outliers can result in an inaccurate representation of the data
set. For example, a large variance within family income data for a city may suggest that a mostly poor
population, with a few wealthy members, is earning more than a solidly middle-class population.
Measuring variance adds context to a standard data analysis. Below is the procedure for finding
variance:
-----------------------------------------------------------------------------------------------Step 1
Calculate the mean of the data set.
Example: (1, 2, 3, 4, 5)
Mean: (1+2+3+4+5) / 5 = 3
Step 2
Subtract the difference between the mean of the entire data set and the all of the individual values in the
data set (using absolute values...no negative numbers).
|3| - |1| = 2 |3| - |2| = 1 |3| - |3| = 0 |3| - |4| = 1|3| - |5| = 2
Step 3
Square each of those differences.
2 x 2 = 4 1 x 1 = 10 x 0 = 01 x 1 = 12 x 2 = 4
Step 4
Add all of the differences together.
4 + 1 + 0 + 1 + 4 = 10
Step 5
Divide that total by the number of values in the data set minus one.
10 / 5-1 = 2.5
Var = 2.5
-----------------------------------------------------------------------------------------------Because the variance is calculated using absolute values (see step 2), the variance of a dataset is always
positive. Calculating the variance of an actual data set would take too much time to calculate by hand.
The variance of a dataset with thousands of values can be calculated within seconds (actually microseconds) using data software. Perhaps the most important function of the variance value is the fact that it
is used to calculate the Standard Deviation, which is a critical concept of data analytics.
Standard Deviation
Standard deviation is a single value that represents how widely spread the values in a data set are from
the central value (mean). The more spread out a data distribution is, the greater its standard deviation.
This value provides a precise measure of how widely dispersed the values are in a dataset, allowing for
more advance statistical analyses. The standard deviation is determined by squaring the variance of the
data set. Standard deviation is derived by calculating the square root of the variance. Therefore, standard
deviation is a highly reliable analytical value that can be used to conduct sophisticated analytical
procedures. Standard deviation is also necessary to perform probability calculations, making it that
much more important to data analytics.
Step 1
Calculate the variance of the data set. This is necessary to find the standard deviation.
In our earlier example the variance was 2.5.
Step 2
Calculate the square root of the variance.
2.5 = 1.58.
Check to verify that 4 out of 5 (80%) of the data set (1, 2, 3, 4, 5) is within one standard deviation (1.58)
from the mean (2.5). We know what the standard deviation is...but what does it really mean? In order to
determine whether our standard deviation is low (which means that the distribution is uniform and
therefore representative of the average member of the population) or high (which means that the
distribution is not very uniform and, therefore, it is less representative of the average member), we must
normalize it by calculating the Coefficient of Variation.
Coefficient of Variation
The coefficient of variation (CV) is the standard deviation / mean. This formula is applied to normalize
the standard deviation so that it can evaluated. Generally, a CV >= 1 indicates high variation, and a CV
< 1 indicates low variation. The greater the distance from 1 in either direction is significant. Let us
consider our example:
CV = Std. Dev. / mean
CV: 1.58 / 2.5 = 0.63
Because the CV < 1, we can assume that our data set is strongly representative of the average member of
the total population.
Example:
Imagine that the population of a particular city has an average monthly income of $5,000. We might
assume based upon the mean that the average citizen in this city are doing financially well. As data
analysts, however, we know that before we can make a reasonable assumption, it is necessary to order to
determine how uniformly the income is distributed among the population by calculating the variation of
data set. If the standard deviation is high, we may assume that the salaries are unevenly distributed
throughout the population. In that event we should not assume that the average member makes a
monthly income in the neighborhood of $5,000. If the standard deviation is low, then we may tend to
consider the population generally affluent.
With standard deviation, 68% of the values in a data set will always be within one standard deviation of
the group mean. Ninety-five percent of the values will be within two standard deviations of the mean.
Also, 99.7% of all values in the data set will be within three standard deviations of the mean. Consider
the statement, Ninety-five percent of a towns residents are between the ages of 4 and 84 years old. To
find the mean age, you would use the formula, mean = the sum of all data values / the total number of
values (4+84/2=22). Therefore, the mean age of the population is 22. Because we already know that the
range include 95% of the total population, we can assume that at least 68% of the citizens are within on
standard deviation of 22; therefore the majority of citizens are young.
Drawing Conclusions
Analysts utilize computers and formulas. However neither computers nor formulas can detect if they are
being used to perform useful operations. Nor can these things determine the meaning or significance of
the results. A common error made in in analytics is to overemphasize the significance of the results, or
to apply the results to the general population, when there is no logical basis for doing so. For example, a
research team is researching which types of restaurants airline travelers prefer to frequent. They
interview 100 travelers from the local airport and ask them to rate each restaurant from a provided list.
They produce a top 5 list, and conclude that travelers like those 5 restaurants the most. However, they
actually only know which ones those particular traveler like the most; they cannot draw conclusions
about travelers everywhere.
Analytics is much more than just numbers. It is important for analysts to know how to draw sensible
conclusions from their results.
This chapter discussed measures of central tendency and the role they play in data analytics. Analytical
concepts were explained, including: standard deviation, variance, relative standing, and other measures
of variance. All data analysis is affected by variation and analyses of how the values within the set of
data are distributed. Normally distributed data values strengthen both the inferences that can be drawn
and the predictions that can be made from statistical procedures conducted on a set of data.
Pie Charts
Pie chart take are used for categorical data. They illustrate the percentage of individuals that in each
category. The total of all of the pieces of the pie equal 100%. Categories can clearly be compared and
contrasted with each other, due to the visually straightforward pie chart. The Budgets are typically
presented with pie charts to show how money is distributed.
In this the example, the pie chart will be created to identify the relative percentage of money spent at the
grocery store. The data table includes a column with the list of grocery items and another column with
the amount of money spent on each item. The process is the same whether it is for a small list of
groceries or a large list of corporate transactions.
Items
Amount Spent
Cereal
$5.50
Milk
$4.10
Bananas
$1.25
Yogurt
$0.75
Total
$11.60
Step 2. Highlight the information that you would like to include in your pie chart. You do not have to
include all of the data in your table, however you must have at least 1 data record. Do this by clicking
and dragging your mouse over the area. Be sure to include the column headings when you do this. In
this example, those would be Items and Amount Spent. This way, you can include the headings in
your chart.
Step 3. Click on the "Insert" menu on the tool bar along the top of the screen. Select "Chart" from the
list of options. Then select the Pie Chart.
Step 4. Choose the type of pie chart you would like to make from the range of options. The pie chart
options consist of a flat chart, a 3D chart, an exploded chart, a pie-of-pie chart, or a bar-of-pie chart,
each option includes a section of the chart with more detail.
If you would like to preview each pie chart, click the "Press and Hold to View Sample" button,
Step 5. Click Enter, and review your pie chart. To edit or modify your chart, right click on it and
select from the extensive range of options.
Bar Graphs
Bar graphs are another way to summarize categorical data. Like pie charts, bar graphs display data by
category, indicating how many objects are in each group, or the percentages of each category. Analysts
typically us bar graphs to compare and contrast categorical groups by separating the categories for each
one and displaying the resultant bars next to each other.
Make sure that the units on the Y-axis are evenly spaced.
Consider units of measurement on the scale of the bar graph. Smaller scales can make minor
differences appear to be huge.
If the bars represent percentages, as opposed to total numbers, look for the total number of units
being summarized.
Include labels for the data and variable at the head of each column. If you want to graph the number of
military personnel recruited in a month, you would write "Branch" at the head of the first column and
"Recruited" at the head of the second column.
Branch
Recruited
Army
210
Navy
165
Airforce
130
Marines
75
As an option, you could insert a third column containing a sub-data category. The Bar Graph menu
allows you to choose from a standard, clustered, or stacked bar graph. The stacked bar graph that
displays an additional number that is related to the variable.
Branch
Recruited
Code
Army
210
77B
Navy
165
50A
Airforce
130
45C
Marines
75
22D
Highlight the data that you would like to include in your graph. You can include everything in the data
table or just a selection with the data set. Microsoft Excel will separate the X and Y axes by columns.
Step 2. Click on the "Insert" menu on the tool bar along the top of the screen. Select "Chart" from the
list of options. Then select the Bar Graph. Click on the kind of bar graph you want from the choices
available in the bar menu. Bar graph options include: Cylinder, 2-D, 3-D, or Pyramid Cone or shaped
bar graphs.
Step 3. From the range of options, select the type of bar graph you would like to make. To select a
standard bar graph, you will choose "Bar." However, if you would like a vertical graph, then click the
arrow next to column.
The image of your graph will quickly appear inside of your Excel sheet.
If you would like to customize your graph, then double click inside of it.
There are a variety of ways to customize the appearance of you graph, including: line fill, line, 3-D
format, shadow, soft edges, and glow & to format your bar graph. When you are done formatting your
bar graph, click O.K.)
gallons, loss of inventory, etc.). For each measure of time, a mark indicates a specific amount. All of the
individual marks are joined along a line to visually highlight changes.
Our example indicates that salaries for city employees increased from 1950 until the early 1980s, began
to fall during the 1980s, and essentially remained the same until the early 2000s, when they slightly
increased.
Inspect the scale along the vertical axis in addition to the horizontal axis. The data could be
presented to appear more or less significant than they really are by adjusting the scale.
Take into account the units used in the chart and be sure theyre appropriate for comparison over
time (for example, are dollar amounts adjusted for inflation?).
Measurements over a short period are more precise than over a long time period.
In our example, we will measure the number of employees leaving a job over the course of a year.
Month
January
February
March
April
May
June
July
August
September
October
November
December
Losses
3
2
4
1
7
10
12
9
3
8
3
4
Select Insert from the tab at the top of the Excel window.
From Charts select Line Graph. A blank graph field will be displayed in your spreadsheet.
You will see several options for line graphs. Select the standard line graph option if you have a lot of
data values. For small data sets, select the Line with Markers option. This will emphasize each data
point along the line.
Click on the chart, and an editing menu will open. Click on Select Data, and a widow will open that
allows you to select the data that you would like. In the Chart Data Range field, highlight the data you
want to include in your line graph. Make sure to include the column headings.
To add a second data line, enter your data into the spreadsheet, the same as in the previous section. Add
a third column of data next to your other columns. You should now have three columns that contain the
same number of values. Click on your chart, and select Data under the Data heading. When the
Select Data Source window opens, click the Add button under Legend, and you will see a field
box labeled Series.
In the Series field, click the cell with the heading name for your second set of data. In the Series values
field, select the cells that contain your new data. Press OK. You will be taken back to the Select Data
Source window. Your new line will appear on the chart.
Histograms
Histograms present a picture of data separated into ordered groups. A histograms offers a convenient
way to visualize patterns in a large set of data. A histogram is essentially a bar chart that visually
presents numerical data. Unlike categorical data, such as color of hair, which has no innate order,
numerical categories are displayed in order from smallest to largest. Every number in a histogram fits
into only one group. Although the bars on a histogram are contiguous, they do not overlap. Every bar on
the horizontal axis is marked by the values representing its boundaries. The height of each bar signifies
either frequency (the number of units) in each category or the relative frequency (the percentage of
units) in each category.
The following chart displays the grades earned by a group of students, ranging from A to F. The
numerical variable grade is sorted into 5 categories. It is clearly evident that the most frequently
occurring grade is B, and the least frequently occurring is F.
A key feature of histograms is that they display the distribution on the data as a shape, which can be
used to make simple inferences. Of course shapes vary with each different set of data; however, there are
three main shapes that are commonly looked for in a set of data:
1. Symmetric (the left side of the histogram is the same as the right side.)
2. Skewed Right (The left side is high and gets continually lower going right.)
3. Skewed Left (The left side is low and gets continually higher going right.)
Variability in the data from a histogram
Histograms also help to illustrate levels of variability within a data set. A histogram that is generally flat
along the top may appear to have low variability; however, that would indicate a wide range within the
data set. Having the same numbers in each category means that the measures were spread out widely. A
hill in the center shows that the majority of measures are near the central point, with a few straying away
in both directions away from the center (which is to be expected). The higher the center point, the lower
the variability in the data. In more advance analyses, the distance of the outliers to the left and right of
the center take on greater significance.
Variability in a histogram is distinct from variability in a time chart. When values on a time chart change
over a period of time, they move either higher or lower on the chart. The more highs and lows along a
time chart indicate greater variability. Conversely, a flat line on a time chart indicates low variability.
Below are considerations for evaluating a histogram:
Inspect the scale being utilized for the frequency (vertical axis). Understand that results can be
made to appear less or more significant by adjusting the size of the scales. For example, if a
group of people have weight difference ranging from end to end by 20 pounds, this can appear to
be massive by using a gram scale or insignificant by using a ton scale.
Examine the units along the vertical axis to see if the graph is using frequencies (numbers) or
relative frequencies (percentages).
Check the size range of the categories for the numerical variables (on the horizontal axis). If
they represent very small measures, the data may appear to have excessive variation. If they are
very large, the chart may conceal significant amounts of variation.
In the Add-Ins dialog box, click on the Analysis ToolPak check box. It is located under Add-Ins
Available, Next, click OK. The Analysis ToolPak Add-in will not be in the dialog box if it has not
been previously installed. If the Analysis ToolPak is not in the dialog box, run MS Excel Setup. And add
the ToolPak to your list of installed items. Now that the Analysis Toolpak is installed and enabled, you
are ready to create a Histogram.
Creating a Histogram
Step 1: Enter the Data. Enter your data into two adjacent columns, and populate the left column with the
"input data" (the set of values that you will analyze with the Histogram tool). In the right column you
will place your bin Numbers (the segments that you use for separating and analyzing your data
values). For example, in order to organize ratings into categories of Good, Better, an Best you could
make bins for 1, 2, and 3.
Navigate to the Data tab, at the top of the screen, and click Data Analysis in the Analysis group. This
will start up the Analysis ToolPak. Then, click to open the Data Analysis box.
In the Data Analysis dialog box, scroll down to Histogram, and click OK. This opens the Histogram
dialog box.
Under Histogram, click the input and the bin ranges from your worksheet. This is done by clicking on
the input box. The input range contains the data that you want to analyze. If the input data is a set of 30
values, and you have copied it into the B column (from B1 to B30), then enter your data range as
B1:B30. The bin range consists of the bin numbers. For example, if there are 5 bins at the very top of
column C, then then your bin will be C1:C5.
Place a check the Chart Output checkbox. Under Output Options, click New Workbook. Then, place a
check in the Chart Output check box.
Once you click OK, you are finished. Excel will produce a new workbook containing a histogram
table long with your chart.
Scatter Plots
Scatter plots are charts that visually represent the relationship between two variables. A scatterplot
consists of an X axis (the horizontal axis), a Y axis (the vertical axis), and a series of dots. Each dot on
the scatterplot represents one observation from a data set. The position of the dot on the scatterplot
represents its X and Y values. The example chart below displays the relationship between iPhone sales
and Galaxy Note sales. When we examine the number of Galaxy Note sale along the X (horizontal) axis,
we see that the more Galaxy Note sales there are, the more iPhone sales there are. The red trend line
illustrates this relationship. If the trend line were horizontal and flat, that would tell us that as Galaxy
Note sales go up, iPhone sales level off. A downward sloping trend line along the X axis would indicate
that as Galaxy Note sale rise, iPhone sales drop. This would be a possible situation within a small
population (e.g. 10 customers) who have to simply choose one phone or the other.
The dots along the trend line represent actual data points. These data points give us specific information
about the units being measured. They also help us to see the variance in the set of data. The more close
the data points are to the trend line, the stronger the relationship between the two variables. The more
spread out they are, the weaker the relationship. A weak relationship, for example, might be observed if
the data were collected from a customer population that had several other types of phones to choose
from, besides these two.
Click Scatter with only Markers. Your chart will appear on your Excel worksheet. If you want the
trend line, right click on the chart click on Chart Elements and check the box marked Trend Line.
Choropleth Maps: Choropleth maps are spatial data plotted out along area boundary shapes,
rather than by point, line, or raster coverage. For example, in a map of the U.S., state boundaries
represents area boundary shapes. Colors may be used within areas to signify some sort of value
for an attribute being looked at in each state. Perhaps red areas indicate higher values and blue
areas signify smaller values.
Point Maps: These are composed of spatial data plotted out along specific point locations. Point
maps visually display data in a graphical point form, rather than in shapes, line, or raster surface
formats.
Raster Surface Maps: These maps can be anything from a satellite image map to a surface
coverage with values that have been included from basic spatial data points.
This chapter discussed the purpose and concepts behind common visual methods for displaying data.
Descriptive analytics uses numbers to summarize aspects of a collection of data. They give you
understandable information to help you answer research questions. They can also help you to understand
what is happening in you experiment, so that you can later conduct more in-depth analyses. Visual
representations of data help analysts to present data to the outside world plainly and succinctly.
Large public and private collections of data: Private collections are information sets supplied by
the organizations data collection methods.
Technological tools and skillsets: This includes online analytical data procedures, database
development and management, warehousing of data, and information technology (IT) for
business programs and applications.
The types of data used in business intelligence insights that are generated in business intelligence (BI)
result from standardized sets of organized business data. BI solutions are primarily comprised of
transactional data that is produced throughout the course of countless events, such as data created during
sales, or records resulting from financial transfers among bank accounts. Transactional data is natural
produced by business actions occurring throughout the organization. This data is critical for variety of
insights that can be gathered from it. BI can be used to extract the following types of business insights:
Customer Information: This data can help managers identify, for example, the areas of their
business that are creating the most customer turnover.
Marketing Data: This data can let businesses know the specific marketing strategies that are
most effective and what exactly makes them so effective.
Operational Data: This data can let business know how efficiently different departments are
functioning and the best actions to take in order to fix identified problems.
Employee Data: This data can let business know which employees are producing the most, and
which are producing the least.
Because the results of data analytics are often extracted from large datasets, cloud-based data platform
solutions are common in the field. Data thats used in data analytics is often derived from dataengineered big data solutions, like Hadoop, MapReduce, and Massively Parallel Processing. Data
analysts must be innovative, forward-thinkers who must often come up with creative solutions in order
to overcome limitations in data collection and interpretation. Many data analysts prefer open-source
solutions. Considering the free cost of open-source software and its robust development architecture, it
is quite popular among analysts. This benefits the organizations that employ these analysts.
Transactional Data: This is the type of organized data used in most BI models. It includes
administration data, customer data, marketing data, organizational data, and employee
productivity data.
Social Data: This includes the unfiltered data generated from emails and social networks, like
Facebook, Twitter, LinkedIn, Pinterest, and Instagram.
Machine data from business operations: This data is used to monitor the organizations
equipment and machines.
Audio, video, image, and PDF file data: These are all well-established formats are all sources of
unstructured data
To streamline BI processes, you must make sure that your data is structured for maximum ease of access
and control. You can use multidimensional databases to accomplish your goals. Unlike the popular
relational, and flat databases, multidimensional databases sort data into cubes that are organized into
multi-dimensional data arrays. To be able to manipulate your data as rapidly and effortlessly as possible,
you can place your data in multidimensional databases as a cube, rather than organizing your data
among multiple relational databases that may encounter difficulties working with each other. The cubic
data architecture allows for online analytical processing (OLAP). OLAP is a technology with which you
can conveniently access and use all of your data for all several different procedures and explorations.
To understand the OLAP model, imagine that you have a cube of market data with three scopes, time,
location, and department. You could, for example, arrange the data to examine only one rectangle, in
order to view one particular department. You could arrange the data to explore a proportionately smaller
cube, consisting of a specific period of time, locations, and departments. You could also drill up or down
your data set to view very detailed data or decidedly summarized data. You could also or total a range of
numbers along a single dimension in order to sum up the totals for small units of business or examine
sales across an extended period of time within a specific location.
OLAP is just merely one system for warehousing data. Another data warehouse system that is popular
among (BI) solutions is called a data mart. This is a data management system used to store specific
elements of data, fitting only one element of business in the organization. The process used for
extracting, changing, and resorting the data into a database or data mart is known as extract, transform,
and load (ETL).
Typically, business analysts are highly trained in (BI) technology. As a general rule, BI training is
accompanied by on traditional IT training and development. Within the business world, data analytics
fulfills the same function as that BI, and that is to turn mountains of raw data into useful information that
can help business leaders make informed, strategic business decisions. If you have large sets of
unconnected data sources, that may possibly be incomplete, and you want to convert all of that into
valuable and useful business insights throughout the entire organization. Business data analysts produce
critical data insights. This is accomplished by identifying patterns and abnormalities in business data.
Data analytics in the business world consists of:
Programming skills: You do not need to have software programming skills to gather, organize,
explore the data, and share this data with stakeholders.
Business knowledge: Having knowledge of the particular business from a functional perspective
will definitely help you to better understand the relevancy and meanings of your findings.
Sources of Data: BI relies on structured data housed in relational databases. Data analysts utilize
both structured and unstructured data, for example, the information spawned by machines or in
social media interactions.
Products: Traditional BI products include reports, data tables, and decision dashboards. Data
analysts, on the other hand, produce outputs that may be related to dashboards analytics and
advanced data visualization, but typically not data reports. Data analysts typically relay their
findings through words and data visualizations, but not tables and reports. This is due to the fact
that the sources of data, with which they work, tend to be more complex than a typical
organizational leader would be able to truly grasp.
Technology: BI relies on relational databases, data warehouses, OLAP, and ETL technologies.
Data analytics utilizes data-engineered systems that use Hadoop, MapReduce, or Massively
Parallel Processing.
Expertise: BI relies heavily on IT and business technology expertise, whereas data analysts rely
on expertise in analytics, statistical methods, computer programming, and business.
Because most business leaders are not trained to perform advanced data analytics themselves, it is
beneficial for them to distinguish the types of decisions that are best-suited for a business leaders, and
those best left to their data analysts. In our rapidly-evolving knowledge-based economy, organizations
seeking to remain competitive must constantly become more efficient in their operations and more
strategic with resources. The key to this is capitalizing on the opportunities provided by skilled analyses
of industrial-level Big Data.
Prior to the recent rise in analytics, businesses and organizations did not have the capacity to analyze a
great deal of data, so a relatively small amount was maintained. In todays data-driven world, anything
and everything may have significance, so there has been an attempt to record and keep virtually any data
that we have the capacity to collect; and we have a great deal of capacity. Beyond the quantity of data
that we are gathering and storing is the quality of the data. That is to say, data has grown beyond basic
facts and figures to encompass media files. Video, audio, and presentations have all become units of data
for possible analysis. A major concern with regards to data analytics is how to store and maintain all of
these rapidly-increasing piles of data. The data science community has begun to rely more heavily upon
the software engineering community, in order to find solutions to our over-abundance of data.
Not all data is necessarily valuable. Society now has advanced data analytics that allows us to glean
useful and important information from even the smallest bits of data. Such information, when reconciled
with other groups of information, can (and has often) resulted in breakthrough of modern science,
business, and economy. As we consider our need to increase the role of data analytics in the ways that
we function as organizations, we should keep in mind that data does not contain all of the answers to our
growth and advancement. Data provides us with the building material with which we can create new
understanding and innovation. The other part of the process is distinctively human. This part includes
creativity, risk taking, and cooperation. It appears as though the less we have of one, the more we need
of the other. The more intellectual rigor and collaboration between various fields of science the more
that we seem to benefit for even limited amounts of data. Conversely, the less of those things that we
have, the more data we need in order to learn, grow, and innovate. Perhaps, the solution to our looming
problem with big data is to reduce our need for so much of it.
Conclusion
As we have seen, data analytics is inclusive and encompassing field of study. What distinguishes data
analytics from traditional areas of data analysis is its orientation toward the business world and its focus
on Big Data. Data analytics exists at the intersection of data science and computer technology. Each of
these sciences are constantly evolving, and each heavily influences the other. Although a career in data
analytics does not require specialized training in computer programing, familiarizing oneself with the
fundamentals of computer science will definitely benefit a data analyst. This introductory book has
provided you with the necessary understanding and skills to move on to advanced principles, techniques,
and procedures in data analytics.
Advanced data analytics build upon the fundamentals that are covered in this book. Even the most
sophisticated studies begin with basic research design principals that we discussed, measures of central
tendency, descriptive analytics, basic charts and graphs, and analysis of the variance. The differences lie
in additional procedures that are conducted in order to further evaluate the quality of data and reliability
of the results. The majority of data analytics is accomplished utilizing the fundamental principles that
you have just learned.