Вы находитесь на странице: 1из 161

Material for Introduction to

Chromebook Data Science

Jeffrey Leek
Material for Introduction to Chromebook
Data Science
Jeffrey Leek
This book is for sale at http://leanpub.com/universities/courses/jhu/cbds-intro

This version was published on 2018-12-11

Copyright © Johns Hopkins University 2018. Creative Commons Attribution 4.0 International
License.
Contents

Welcome to Chromebook Data Science . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

Program Philosophy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

Why Automated Videos . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

The Data Science Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

How To Learn . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

Finding Help . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

Account Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

Google Account Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

Other Accounts Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

Your first data science project . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

Google Sheets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79

RStudio Cloud . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97

Google Docs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118

Google Slides . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131

DataCamp . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
Welcome to Chromebook Data
Science
Hello and welcome! This is the first course in the Chromebook Data Science program. The goal of
Chromebook Data Science is to help anyone with an internet connection and a computer learn to
do data science. The program will start with the very basics of using a computer on the internet
and work all the way up to doing data science and data analysis. We hope that by building this
program we can help people get into the exciting tech world in one of the fastest growing¹ and most
satisfying² jobs in the United States. There are only going to be more and more jobs asking for data
science skills³ in the future. We believe that by making this career accessible to anyone we can have
a positive impact on the world.

Course Details
Before we jump into the content, we just wanted to orient you to how this course and all the courses
in this program will be laid out:

• Courses - There are multiple courses in the Chromebook Data Science program. The first one
is “Introduction to Chromebook Data Science”, which is the course you’re in right now.
• Lessons - Each course will consist of lessons. You’re looking at the first lesson here. It’s called
“Welcome to Chromebook Data Science”. You can see a list of all the lessons in this course in
the left panel. The lessons will contain text and images to walk you through every lesson of
each course.
• Videos - At the end of each lesson there will be a link to a YouTube video. This video contains
the same information as what is included in the text of the lesson; however, we know that some
people learn better by listening. Sometimes you may find the videos more helpful. Sometimes
you may find the text more helpful. These are included in case they are more helpful than the
text to you personally.
• Slides - Each lesson also has link at the bottom to an accompanying slide show. Feel free to
look through these slides if you find them helpful. They are the same images that were used to
generate the video.
• Quizzes - Most lessons will have a quiz to evaluate your understanding of the material in that
lesson. Successful completion of these quizzes is required for receipt of the certificate at the
end of each course.
¹https://www.pwc.com/us/en/publications/data-science-and-analytics.html
²https://www.techrepublic.com/article/is-data-scientist-the-most-rewarding-tech-job-new-report-says-yes/
³https://www.forbes.com/sites/louiscolumbus/2017/05/13/ibm-predicts-demand-for-data-scientists-will-soar-28-by-2020/
Welcome to Chromebook Data Science 2

• Exercise - A few of the courses will have associated exercises. Think of these as larger projects.
They won’t be required to receive the certificate at the end of the course; however, the skills
the exercises require will be essential if you’re interested in getting a job in data science, so
we highly suggest you complete them. Also, occassionally, there will be DataCamp exercises.
DataCamp is a company that generates content to help people learn to code. These cost money
and are not required for completion of the program; however, they will help you get additional
practice if you choose to do them.

What is data science?


You might not think about data very much. Most people don’t. So why is data science such a popular
and growing career? And what does data science even mean?
One definition of data is any information that you can store on a computer. Examples of data that
you produce all the time are text messages, Facebook posts, websites you visit, things you buy with
a credit card, pictures of your car on speed cameras, and information you fill out in profiles for your
work, school, or community organizations. If you can take a picture of it, write about it, make a
video of it, or record it on audio - then it is probably data. All of these different kinds of data are
collected and saved on a computer.
It used to be that measuring and storing data was expensive and hard. Now it is easy and cheap.
Governments, companies, organizations, and even individual people can now collect and store more
data in a few days than the entire world collected over the last few centuries. Most of the time we
don’t even think about the data we are collecting. We take pictures and post them to Facebook to
show our grandparents, not because we want to analyze the data in the pictures. This is true for
most of the data that we create and collect, both for ourselves and for companies. We don’t do it
for the data - we do it because we want to record and share information about ourselves and the
world. For example, companies do it so they can keep track of their customers, and governments do
it because they want to keep records of who got parking tickets.
But people started to figure out that you could use that data for other purposes. When you search for
“symptoms of the flu” on Google, you are just looking for information because you are sick. But the
data that you are searching for flu symptoms is valuable information for companies, doctors, and
even scientists. We could use that data from you to do things like show you an ad for blankets or for
flu medicine. We could also use data from you and everyone else who searched for flu symptoms
to find out where there are lots of flu cases.
Another example is social media. You might write a post on Facebook with pictures describing
your child’s birthday party. You might do this so your child’s grandparents can see pictures of her
birthday. But the information in that post can again be valuable for other people. We could figure out
birthdays, hobbies, interests, and who knows who from Facebook posts and likes. That information
can be valuable for showing ads, for suggesting other people you might know, or for studying how
humans interact with each other on birthdays and holidays.
But to make data valuable we need to be able to study it and separate the interesting facts (called the
“signal”) from the uninteresting information (the “noise”). One definition for data science is that.
Welcome to Chromebook Data Science 3

“Data science is asking a question that can be answered with data, collecting and cleaning
the data, studying the data, creating models to help understand and answer the question,
and sharing the answer to the question with other people.”

The reason this field is growing so fast is that nearly every government, company, and organization
is now collecting data. As the data have become cheaper and cheaper, the ability to analyze that data
and find useful information has become a more and more valuable skill. But most people don’t have
training or experience sifting through big piles of data to make interesting and valuable discoveries.
The people who can do this well are called data scientists. They have a job that is exciting, interesting,
and promises to be in high demand for years to come.

What is Chromebook Data Science?


Data science is a fast-growing and exciting profession. But it can be a challenge to get into this
career. Some of the biggest challenges are:

• Finding out that data science is even a real career


• Getting an education in data science can be costly and inconvenient
• You usually need a background in math, statistics, or computer science
• The equipment for data science can be costly and difficult to set up
• Many of the jobs are only available in major cities

Most of the people who are currently data scientists have degrees in math, statistics, physics. They
can afford computers that cost thousands of dollars and specialized computing software to help them
do their jobs. They also mostly live in a few major cities like New York, San Francisco, Seattle, and
Washington D.C. Many of these data scientists are former software engineers or other white-collar
workers who moved into data science when they saw the demand for this kind of job.
It is our goal with Chromebook Data Science to try to help people who would otherwise not have
access to this exciting career to get into the career. To do that we need to remove some of the
challenges above. So we designed this program to tackle some of the challenges that are preventing
more widespread adoption of this career.

• Chromebook Data Science is being released as a set of online courses with a pay-what-you-can
model. That means you can take the whole series of courses for free or for whatever cost you
can afford.
• Chromebook Data Science is designed to be done entirely online using only tools you can
access from a web browser. This means that you can do the entire program on a Chromebook⁴
- which you can get for as little as $150.
• Chromebook Data Science starts at the very basics of how to set up all of your accounts, which
websites and apps to use, and simple little projects that anyone can do. The only pre-requisites
are high school math/reading and the ability to use a computer.
⁴https://www.google.com/chromebook/
Welcome to Chromebook Data Science 4

• Chromebook Data Science includes resources for finding, getting, and working at data science
jobs. It also includes resources for finding and working at remote data science jobs that can be
done from anywhere in the world.

Who is this program for?


Chromebook Data Science is designed for people who have a high school education and know how
to use a computer. Some people who we hope the program will be useful for are:

• High school students


• People who are working on or have completed a high school education
• Students at community colleges
• Older adults who want to learn something new

But the program can be completed by anyone! We hope that it will be useful for anyone who wants
to learn something new about data science. This program is also focused on people who want to
learn to do data science.
In some cases this program may not be the most efficient way to learn about data science.
If you already have a background in statistics, math, or computer science and want to jump directly
to more advanced topics we have already created a Data Science Specialization⁵ on Coursera just
for you. There are many jobs that require people to understand or manage a data science project. If
you are a leader or executive who just wants a high level overview of what data science is all about,
we have also created an Executive Data Science Specialization⁶.
Our goal here is also to create a supportive and inclusive learning experience. Data science is
frustrating and slow to learn. Often the best way is to learn from other people who have discovered
similar solutions or made similar mistakes. Fortunately, there are communities in data science that
are cheerful, friendly and willing to help new people get involved. Throughout the program we will
introduce you to these communities and hope that you will also make an effort to help your fellow
students as they discover this exciting field.

How the program is organized


This program is a series of online classes. They are designed to be used in many different ways so
they can be useful for the most people possible. The courses and projects can be completed entirely
online using nothing more than a web browser. The program is organized into

• Courses: Courses are designed to be able to be done in about a month working in your spare
time or day or two working full time. You can receive a certificate for each course and all
courses are based on a pay-what-you-can model. Each course consists of:

⁵https://www.coursera.org/specializations/jhu-data-science
⁶https://www.coursera.org/specializations/executive-data-science
Welcome to Chromebook Data Science 5

– Text based tutorials and lessons


– Slides with the images from the tutorials
– Video tutorials that cover the same information as the lessons
– Ungraded exercises to practice what you have learned
– Graded quizzes to measure what you have learned
– Projects to help you build a portfolio for showing what you’ve learned
• Course Set: A Course Set is a group of courses that form credentials.

To keep up on the latest information about the program, courses and more go to http://jhudatascience.org/chromeboo

How this course is graded


This first class is designed to get you set up with the accounts you will use as you learn to become
a data scientist. You will also complete your first data science project. Each lesson will have a short
quiz at the end. To pass the course you need to get 70% of the questions in the course correct. If you
receive more than 90% of the points across all quizzes you will pass with honors.

Slides and Video

View this Video at https://www.youtube.com/watch?v=DJbQ2rg6KOk⁷.


Welcome to Chromebook Data Science

Slides⁸
Take this quiz online⁹
⁸https://docs.google.com/presentation/d/18q2gRHXGZxBL7pSWcQg_HThmgoo5qDeO9O372QkAnYU/edit?usp=sharing
⁹http://leanpub.com/courses/jhu/cbds-intro/quizzes/quiz_00_welcome
Program Philosophy
Our philosophy with building this course and this program is to try to make data science accessible
to the widest audience possible.
This course is part of the “Chromebook Data Science”¹⁰ series of courses.
These courses are designed to tackle some of the challenges that prevent people from getting into
data science in the first place. Some of those challenges are geographic - we’ll talk more about that
later. Some are due to the price of education - that is why we are offering these courses as MOOCs.
But one of the key barriers is that the type of computer you usually need to do data science is
expensive.
Chromebooks¹¹, on the other hand, are a very cheap type of computer. Chromebooks aren’t exactly
like normal computers and they have a few unique characteristics:

• They are usually very cheap¹²


• They are designed mostly to use the web
• You don’t “install” any software on the computer itself
• Instead of “apps” and “software” you simply go to websites for your work

A simple way to think about it is that a Chromebook is a computer that only lets you use an internet
browser like Chrome¹³. You can’t really do much on the computer itself. Some people call this way
of working - working only through the internet - “cloud computing”¹⁴.
It’s called cloud computing because the computer you are using most of the time is not the one
sitting in front of you. You are using the internet to access tools and computers to do your work.
But the physical computers doing the work are stored somewhere else - it could be nearby or on the
other side of the globe. That is why people call the computers “in the cloud”.
The goal of Chromebook Data Science is not that you have to use a Chromebook to finish the
program, it is just that you could use a Chromebook to finish the whole program. You can finish the
entire sequence of courses using any computer with an internet connection and a web browser.
We took this approach because we want data science to be accessible to everyone. We have found
that in earlier classes we taught online, the cost of computers, difficulties installing software, and
lack of computing resources prevented from students from completing our courses. We wanted to
strip all those barriers away so that more students would have access to our program.
¹⁰http://jhudatascience.org/chromebookdatascience/
¹¹https://www.google.com/chromebook/
¹²https://www.google.com/chromebook/find-yours/
¹³https://www.google.com/chrome/
¹⁴https://en.wikipedia.org/wiki/Cloud_computing
Program Philosophy 7

We also believe that the future of data science is increasingly cloud based. So this educational choice
matches a trend we see in the field that we can help you take advantage of. It is less and less likely
that you will work only on your laptop as a data scientist. Through the internet you will access data
and computing power so that you can magnify the impact of what you are working on. We hope to
show you how to use those resources to maximize the value you can bring as a new data scientist.
We do recognize that internet access is also a limiting factor for many people. We have tried to
make it so that you don’t have to download data so hopefully the broadband requirements will be
minimal. We hope that if internet access is a challenge for you that you can leverage the resources
you have - whether they are local libraries, coffee shops, or internet cafes to complete this program.
If that isn’t an option for you we’d love to hear from you and see if we can find ways to make data
science accessible to everyone, everywhere.

Slides and Video

View this Video at https://www.youtube.com/watch?v=0NwqO1BSb7E¹⁵.


Program Philosophy

• Slides¹⁶

¹⁶https://docs.google.com/presentation/d/1s7sLqa0GAUqVaD9TE63ntBKJfXm2Ac2dG90G_41egLI/edit?usp=sharing
Why Automated Videos
What’s the deal with these videos? And why do the videos say almost exactly the same thing as
the printed lecture material? You have probably already noticed that the lectures and videos for this
class are structured a little differently than in many MOOCs you have taken. We created this video
to explain to you why we made this change and why we think it highlights the awesome power of
R and data science.
We create a lot of massive online open courses at the Johns Hopkins Data Science Lab. We have
created more than 30 courses on multiple platforms over the last 5 years. Our goal with these classes
is to provide the best and most up to date information to the broadest audience possible.
But there are significant challenges to maintaining this much material online. R packages go out of
date, new workflows are invented, and typos - oh the typos!
We used to make these courses like many other universities. We’d create course material in the form
of lecture slides, then we’d record videos of ourselves delivering those lectures. In some ways this was
great - you actually got to hear our voices delivering your lectures, including all the “yawns”, “ums”,
“buts”, and “so’s”. But the downside is that it is difficult and time consuming to update the content
when we have to book a recording studio, set up special equipment, record ourselves delivering a
lecture, edit those lectures, and then upload them to a system.
The result is that a lot of our lectures have been out of date, include errors, or don’t include the latest,
best versions of workflows and pipelines. This has been a problem for a while, but as the number of
courses we offer grows, it has become more and more of a challenge for us to keep them up to date.
So we started to think about how to solve this challenging problem. We realized that while recording
and editing videos is extremely time consuming there is another type of content we can edit, update,
and maintain much more frequently - regular old plain text documents¹⁷. We aren’t the only ones
who have thought this - massive online open course innovators like Lorena Barba have been saying
that videos aren’t even necessary for these types of courses¹⁸.
So when we sat down to develop our new process for creating and maintaining our courses we
wanted to see if we could figure out how to make a class made entirely out of plain text documents.
We broke down a massive online open course into its basic elements:

• Tutorials - these we can easily write in plain text formats like markdown or R markdown.
• Slides - these are easy enough to maintain and share if we make them with something like
Google Slides¹⁹.
¹⁷https://simplystatistics.org/2017/06/13/the-future-of-education-is-plain-text/
¹⁸https://www.class-central.com/report/why-my-mooc-is-not-built-on-video/
¹⁹https://www.google.com/slides/about/
Why Automated Videos 9

• Assessments - here we can use a markup language²⁰ to create quizzes and other assessments.
• Videos - this was the sticking point, how were we going to make videos from plain text
documents?

By a happy coincidence, the data science and artificial intelligence communities were solving a huge
part of this problem for us, improving text to voice synthesis! So we could now write a script for a
video and use Amazon Polly²¹ to synthesize our voices!
To take advantage of this new technology we created two new R packages: ari²² and didactr²³.
Ari will take a script and a set of Google Slides and narrate the script over the slides using Amazon
Polly. It will also generate the closed caption file needed to include captions and ensure that the
videos are accessible to those with hearing impairment. didactr automates several of the steps from
creating the videos with ari, to uploading them to YouTube, so that we can quickly make edits to
the scripts or slides, remake the videos, re-upload them and reduce our maintenance overhead for
keeping our content fresh.
Whenever we change the text file or edit the slides we can recreate the video in a couple of minutes.
Everything is done in R. One of the coolest features of going to this new process is showing you how
powerful the R programming language is. This is the main language you will learn in this program
and we hope you will be able to build cool things like this system by the time you are done with our
courses.
Why did we choose this approach instead of creating each piece of a lesson separately? Well, first,
this process makes it a lot easier for us to maintain and update the courses. If you report an issue
or find a mistake with a lesson, all we need to do is to change these two files and recreate the
courses again. Therefore, we will have a more efficient way of maintaining the course content and
updating it. Second, by using this process we have made our instruction more accessible. Since videos
have transcripts and transcripts have voice over, the content is accessible by those of us who have
disabilities. For everyone else, you can have a choice of reading versus listening versus watching the
content as you wish.
Finally, a cool feature of using text to speech synthesis is that our videos will keep getting better
as the voice synthesis software improves. It means that we can change the voice to different
voices. Ultimately, it will allow us to translate our courses into different languages quickly and
automatically using machine learning. We think this highlights the incredible power of data science
and artificial intelligence to improve the world.
If you find the robot voice annoying, we get it. We know that the technology isn’t perfect yet. That’s
why we’ve made the written lecture material reflect as closely as possible the video lectures. This
means that you don’t have to watch these vidoes. Using this setup, you can pick how you want
to consume our classes. We hope that this change will allow us to better serve you with the best
content at the fastest speed. Thanks for participating in this new phase of course development with
us!
²⁰https://leanpub.com/markua/read#leanpub-auto-quizzes-and-exercises
²¹https://aws.amazon.com/polly/
²²https://cran.r-project.org/web/packages/ari/index.html
²³https://github.com/muschellij2/didactr
Why Automated Videos 10

Slides and Video

View this Video at https://www.youtube.com/watch?v=gGcfV9ngJMc²⁴.


Why Automated Videos

• Slides²⁵

²⁵https://docs.google.com/presentation/d/1FtdynwBR8IAE8x9cMTZrnWKfTPXEhH-KwkAZKQ0zwk8/edit?usp=sharing
The Data Science Process
In the first few lessons of this course we discussed what data is and talked about the fact that data
are everywhere. We also introduced you to the philosophy of this program: that everyone should
have access to the knowledge needed to become a data scientist and that these materials should be
able to be updated with ease as technologies and methodologies change over time. What we haven’t
yet covered is what an actual data science project looks like. To do so, we’ll first step through an
actual data science project, breaking down the parts of a typical project and then, provide a number
of links to other interesting data science projects. Our goal in this lesson is to expose you to the
process one goes through as they carry out data science projects.

The Parts of a Data Science Project


Every Data Science Project starts with a question that is to be answered with data. That means
that forming the question is an important first step in the process. The second step is finding or
generating the data you’re going to use to answer that question. With the question solidified and
data in hand, the data are then analyzed, first by exploring the data and then often by modeling
the data, which means using some statistical or machine learning techniques to analyze the data
and answer your question. After drawing conclusions from this analysis, the project has to be
communicated to others. Sometimes this is a report you send to your boss or team at work. Other
times it’s a blog post. Often it’s a presentation to a group of colleagues. Regardless, a data science
project almost always involves some form of communication of the projects’ findings. We’ll walk
through these steps using a data science project example below.

A Data Science Project Example


For this example, we’re going to use an example analysis from a data scientist named Hilary Parker²⁶.
Her work can be found on her blog²⁷, and the specific project we’ll be working through here is from
2013 and titled “Hilary: the most poisoned baby name in US history”²⁸. To get the most out of this
lesson, click on that link and read through Hilary’s post. Once you’re done, come on back to this
lesson and read through the breakdown of this post.
²⁶https://hilaryparker.com/about-hilary-parker/
²⁷https://hilaryparker.com
²⁸https://hilaryparker.com/2013/01/30/hilary-the-most-poisoned-baby-name-in-us-history/
The Data Science Process 12

Hilary’s blog post

The Question

When setting out on a data science project, it’s always great to have your question well-defined.
Additional questions may pop up as you do the analysis, but knowing what you want to answer
with your analysis is a really important first step. Hilary Parker’s question is included in bold in her
post. Highlighting this makes it clear that she’s interested in answer the following question:

Is Hilary/Hillary really the most rapidly poisoned name in recorded American history?

The Data

To answer this question, Hilary collected data from the Social Security website²⁹. This dataset
included the 1000 most popular baby names from 1880 until 2011.

Data Analysis

As explained in the blog post, Hilary was interested in calculating the relative risk for each of the
4,110 different names in her dataset from one year to the next from 1880 to 2011. By hand, this would
²⁹https://www.ssa.gov/OACT/babynames/
The Data Science Process 13

be a nightmare. Thankfully, by writing code in R, all of which is available on GitHub³⁰, Hilary was
able to generate these values for all these names across all these years. It’s not important at this
point in time to fully understand what a relative risk calculation is (although Hilary does a great job
breaking it down in her post!), but it is important to know that after getting the data together, the
next step is figuring out what you need to do with that data in order to answer your question. For
Hilary’s question, calculating the relative risk for each name from one year to the next from 1880 to
2011 and looking at the percentage of babies named each name in a particular year would be what
she needed to do to answer her question.

Hilary’s GitHub repo for this project

Exploratory Data Analysis

What you don’t see in the blog post is all of the code Hilary wrote to get the data from the Social
Security website³¹, to get it in the format she needed to do the analysis, and to generate the figures.
As mentioned above, she made all this code available on GitHub³² so that others could see what she
did and repeat her steps if they wanted. In addition to this code, data science projects often involve
writing a lot of code and generating a lot of figures that aren’t included in your final results. This
is part of the data science process too. Figuring out how to do what you want to do to answer your
³⁰https://github.com/hilaryparker/names
³¹https://www.ssa.gov/OACT/babynames/
³²https://github.com/hilaryparker/names
The Data Science Process 14

question of interest is part of the process, doesn’t always show up in your final project, and can be
very time-consuming.

Data Analysis Results

That said, given that Hilary now had the necessary values calculated, she began to analyze the
data. The first thing she did was look at the names with the biggest drop in percentage from one
year to the next. By this preliminary analysis, Hilary was sixth on the list, meaning there were five
other names that had had a single year drop in popularity larger than the one the name “Hilary”
experienced from 1992 to 1993.

Biggest Drop Table

In looking at the results of this analysis, the first five years appeared peculiar to Hilary Parker.
(It’s always good to consider whether or not the results were what you were expecting, from any
analysis!) None of them seemed to be names that were popular for long periods of time. To see if this
hunch was true, Hilary plotted the percent of babies born each year with each of the names from
this table. What she found was that, among these “poisoned” names (names that experienced a big
drop from one year to the next in popularity), all of the names other than Hilary became popular
all of a sudden and then dropped off in popularity. Hilary Parker was able to figure out why most
of these other names became popular, so definitely read that section of her post! The name, Hilary,
however, was different. It was popular for a while and then completely dropped off in popularity.
The Data Science Process 15

14 most poisoned names over time

To figure out what was specifically going on with the name Hilary, she removed names that became
popular for short periods of time before dropping off, and only looked at names that were in the
top 1000 for more than 20 years. The results from this analysis definitively show that Hilary had the
quickest fall from popularity in 1992 of any female baby name between 1880 and 2011. (“Marian”’s
decline was gradual over many years.)
The Data Science Process 16

39 most poisoned names over time, controlling for fads

Communication

The final step in this data analysis process was, once Hilary Parker had answered her question on
her computer, it was time to share it with the world. An important part of any data science project is
effectively communicating the results of the project. Hilary did so by writing a wonderful blog post
that communicated the results of her analysis, answered the question she set out to answer, and did
so in an entertaining way.
Additionally, it’s important to note that most projects build off someone else’s work. It’s really
important to give those people credit. Hilary accomplishes this by:

• linking to a blog post³³ where someone had asked a similar question previously
• linking to the Social Security website³⁴ where she got the data
• linking to a link about where she learned about web scraping³⁵

What you can build using R


Hilary’s work was carried out using the R programming language. Throughout the courses in this
series, you’ll learn the basics of programming in R, exploring and analyzing data, and how to build
³³http://stuartbuck.blogspot.com/2003/09/hillary-is-most-poisoned-baby-name-in.html
³⁴https://www.ssa.gov/OACT/babynames/
³⁵http://syntaxi.net/2013/01/20/storyboard/
The Data Science Process 17

reports and web applications that allow you to effectively communicate your results. To give you
an example of the types of things that can be built using the R programming and suite of available
tools that use R, below are a few examples of the types of things that have been built using the data
science process and the R programming language - the types of things that you’ll be able to generate
by the end of this series of courses.

Prediction Risk of Opioid Overdoses in Providence, RI

Masters students at the University of Pennsylvania set out to predict the risk of opioid overdoses in
Providence, Rhode Island. They include details on the data they used, the steps they took to clean
their data, their visualization process, and their final results³⁶. While the details aren’t important
now, seeing the process and what types of reports can be generated is important. Additionally,
they’ve created a Shiny App³⁷, which is an interactive web application. This means that you can
choose what neighborhood in Providence you want to focus on. All of this was built using R
programming.

Prediction of Opioid Overdoses in Providence, RI

³⁶https://pennmusa.github.io/MUSA_801.io/project_5/index.html
³⁷https://jordanbutz.shinyapps.io/directory/
The Data Science Process 18

Other Cool Data Science Projects


The following are smaller projects than the example above, but data science projects nonetheless! In
each project, the author had a question they wanted to answer and used data to answer that question.
They explored, visualized, and analyzed the data. Then, they wrote blog posts to communicate their
findings. Take a look to learn more about the topics listed and to see how others work through the
data science project process and communicate their results!

• Text analysis of Trump’s tweets confirms he writes only the (angrier) Android half³⁸, by David
Robinson³⁹
• Where to Live in the US⁴⁰, by Maelle Salmon⁴¹
• Sexual Health Clinics in Toronto⁴², by Sharla Gelfand⁴³

Conclusions
In this lesson, we hope we’ve conveyed that sometimes data science projects are tackling difficult
questions (‘Can we predict the risk of opioid overdose?’) while other times the goal of the project
is to answer a question you’re interested in personally (‘Is Hilary the most rapidly poisoned baby
name in recorded American History?’). In either case, the process is similar. You have to form your
question, get data, explore and analyze your data, and communicate your results. With the tools
you’ll learn in this series of courses, you will be able to set out and carry out your own data science
projects, like the examples included in this lesson!
³⁸http://varianceexplained.org/r/trump-tweets/
³⁹http://varianceexplained.org/about/
⁴⁰http://www.masalmon.eu/2017/11/16/wheretoliveus/
⁴¹http://www.masalmon.eu/about/
⁴²https://sharlagelfand.netlify.com/posts/tidying-toronto-open-data/
⁴³https://sharlagelfand.netlify.com/about/
The Data Science Process 19

Slides and Video

View this Video at https://www.youtube.com/watch?v=Yq-USoMNTMU⁴⁴.


The Data Science Process

• Slides⁴⁵

Take this quiz online⁴⁶


⁴⁵https://docs.google.com/presentation/d/1SNT3SYuWJhjRYx7VmyFKWkuxESEx5THt-mWJ7Mx5Cr8/edit?usp=sharing
⁴⁶http://leanpub.com/courses/jhu/cbds-intro/quizzes/quiz_003_data_science_process
How To Learn
In the last lesson we walked through a few interesting data science projects. Eventually, using the
foundational skills learned in the courses throughout this Course Set, with practice on your own,
and with other skills you pick up along the way, you’ll be completing your own, equally-awesome
data science projects!
However, what many people don’t tell you early on is that that path will be paved with a lot of
failure. This isn’t a bad thing! Data scientists fail all the time. They write code that produces an
error they have to figure out. And they regularly have to abandon projects that aren’t going to work
out. Failure is part of the process.

Failure

Even when a project is successful, know that there was failure on the way to success! The problem
is that what you see in a final blog post or a product put out by data scientists at a company is the
final product. This product may be something that is functional, really important, or even beautiful.
What you don’t see is all the failure that happened on the way to getting the end product. Data
science projects can be a lot like social media accounts. On social media, it’s easy to only show the
good stuff about one’s life. For data science projects, the end product of a data science project may
How To Learn 21

be awesome, so the user will only see the good stuff. But, there’s a lot of struggle and failure that
went into creating the awesome end product!

success requires failure

In fact, that pathway to success in data science is always full of failure. And, often, failure followed
by figuring out why you just failed is a great way to learn.
That doesn’t make failure easier. It will be frustrating from time to time, and figuring out why
something isn’t working can be hard. That’s ok! Know that you’re not alone. Even experienced data
scientists who have built really cool stuff experience lots of failure along the way.
How To Learn 22

process can be difficult

Learning How To Learn


In addition to learning the basics of data science in this course set, we also want you to learn how
to learn.
First and foremost, the best way to learn data science is by doing it. Throughout these lessons, copy
the code you see in the lessons and try it out on your own. If you get an error, that’s ok! Google
that error and try to learn from this error! In fact, we’ve got a whole lesson in this course on how
to Google and a lesson in a later course in this course set on how to get help for questions when
you’re programming. But, there’s more to learning how to learn than getting good at searching on
the Internet (although, that is important!)

The Mindset

To learn how to learn, it’s important to know just how important your mindset is. Your goal should be
to answer an interesting question. Your objective is not to memorize a bunch of functions. It’s to use
those functions to do something interesting. The path to accomplishing that goal may be circuitous.
You may take a few steps backward and experience a setback or two before moving forward. That’s
ok!
How To Learn 23

mindset

The Path

When carrying out a data science projects, there is always more than one way to solve a problem.
Your path may be different than someone else’s path.
In fact, while you may not know R code yet, the following four lines of code all produce the exact
same output:

mtcars %>% tidyr::gather(key = variable, value = value)


tidyr::gather(mtcars, key = variable, value = value)
mtcars %>% tidyr::gather(key = variable)
mtcars_long <- tidyr::gather(mtcars, key = variable)

Any one of these would be a reasonable approach. We use this example to explain that there is more
than one way to approach and to answer a question! Your path may be different than someone else’s.
Your approaches may not be identical. And, that is more than ok!
How To Learn 24

path

Asking For Help

While we’ll point out where to find help when you’re stuck throughout this course set; however,
it may not be obvious when to ask for help. While this is not a hard and fast rule, if you’ve been
trying to find the answer to something you’re stuck on for half an hour and cannot figure it out,
it may be time to post your question online for someone else to answer or to reach out directly to
someone to get your question answered. During the half hour when you’re trying on your own, you
should Google for the answer. If it’s a coding question, you should try running code to test to see
if the fixes from Google fix your problem. If you’re getting error messages, paste those messages
into Google. If after trying all of these things you’re still stuck, then you should ask for help every
time. Rather than give up because you’re stuck, ask questions!
How To Learn 25

Ask Questions

Summary
Learning how to learn and asking questions may seem simple when you read this lesson, but in
practice it can be tough. It’s hard to admit you don’t know something and it can be difficult
sometimes to explain what it is you don’t know. Try anyway! Everyone was a beginner at some point.
Those who moved from beginner to advanced did so because they learned the material, practiced
and because they asked questions along the way. We’ll remind you of the information included in
this lesson throughout the course set because while it’s easy to read the information here, it’s not
always easy to remember it when you’re struggling!
How To Learn 26

Slides and Video

View this Video at https://www.youtube.com/watch?v=jXrkuWTPJoc⁴⁷.


How To Learn

• Slides⁴⁸

⁴⁸https://docs.google.com/presentation/d/1sgE2Um0t2AhkUlPHLJDSVLTJlyTabg1gtz1ybOgO-kY/edit?usp=sharing
Finding Help
In data science and computer-related work in general, it is common to ask for help multiple times
per day. While sometimes we ask our colleagues for help in-person, most of the time we search the
web.
Throughout this coursework it may surprise you just how frequently other people have run into the
same problem or had the exact same question you have. Often, there is an answer that was publicly
shared previously on the Internet that can help answer the very question you’re asking. There are
a number of websites and discussion boards where people frequently ask and answer questions. By
knowing how to effectively search the web, you can easily find these answers.

Searching the Internet


You’ve been working on the Internet through this coursework so far on the Chrome Browser? Within
every Internet browser, you have access to web search engines. These are designed to find the most
relevant answers to our question. The most common web search engine is Google⁴⁹. In fact, Google
started as a web search engine before it developed any of the other many products it offers today.We
can access Google by typing www.google.com⁵⁰ in the search bar at the top of the Chrome browser.
This will bring you to the Google homepage, where you will see a simple text box and a button
called Google search.
⁴⁹https://www.google.com
⁵⁰https://www.google.com
Finding Help 28

Google search

On the search box, as you start typing your question you will see suggestions based on what you
have written so far. This is called Google auto-complete. Here is an example where Google suggests
a few common searches that start with “how to find help in”.
Finding Help 29

Google auto-complete

The auto-complete feature can be useful because it helps us refine our search query which will lead to
more relevant results and answers. Throughout this course work, we’ll be using hte R programming
language to complete data analyses. Thus, you will often be searching for help related to the R
programming language. So, in this example, let’s select “how to find help in r” and then click on the
Google search button, we will get a list of websites that are most related to our question as shown
below.
Finding Help 30

Google search results

Google highlights some of our key terms from our search in the search results list. For example, the
word help is bolded twice on the first link title “R: Getting Help with R”.
Each search result includes a short title, the web link, a short extract from the website, and some of
our search terms (words) highlighted. Using this information we have to decide if our search was
specific enough. For example, we could have searched “how to get help”. Google search would have
had no way of knowing that we had an R question specifically. Alternatively, searching “how to
get help for all the questions I’ve ever had or may have in the R programming language today or
tomorrow” is also not ideal. Devising a search with the fewest words that help accurately answer
your questions is the goal!
We will cover different ways of finding help. Throughout this coursework, you’ll likely learn that
part of being a data scientist means being good at Googling. Effectively searching the web is an
important skill to have.

Search Guidelines
The best way to get a response to your question is to be able to boil it down to relatively few words.
Less is usually better…it’s also faster to type too! So, when you’re Googling things, keep a few things
in mind:
Finding Help 31

• Use the fewest words possible - full sentences and correct grammar are not necessary when
searching google
• Be Specific - include words that are important to your specific search
• Know specific websites where you can get help - while Google is generally a great place to start,
sometimes it can be helpful to know specific websites where you can get help. StackOverflow⁵¹
and the RStudio Community⁵² will likely be helpful places as you learn to program in R. These
resources will be covered in detail in a future course; however, it’s good at this point to know
they exist

View this Video at https://www.youtube.com/watch?v=tEyaXu6OVBo⁵³.


Finding Help

• Slides⁵⁴

Take this quiz online⁵⁵


⁵¹www.stackoverflow.com
⁵²https://community.rstudio.com/c/tidyverse
⁵⁴https://docs.google.com/presentation/d/180OSJkB2c7BxvtJZ3F-KrGzQ33vPLWbErC2xjO6g4O4/edit?usp=sharing
⁵⁵http://leanpub.com/courses/jhu/cbds-intro/quizzes/quiz_01_findinghelp
Account Setup
Before we can get started doing fun things with data we need to make sure you are set up to use
all of the different accounts that you will need throughout the course. We will tell you briefly what
each of these accounts is used for and how to set it up now. If you don’t know what each of these
accounts is for exactly, don’t worry! We will walk you through everything you need to know.

Choosing a Username
Choosing an appropriate username is important. Some combination of your first and last name is a
good idea. For example, if your name were Jane Doe, a username such as “JaneDoe” or “Jane.Doe”
would work. If the first username you attempt is taken, you can try another, similar username. In
this case, maybe try “JDoe”.

Appropriate Usernames

But, be sure that whatever name you choose, you would be comfortable sharing it with your boss
or family member. Usernames with nicknames or profanity are not a good idea.
Account Setup 33

What to Avoid in Usernames

Using a Consistent Username


Remembering different usernames for different accounts is difficult. It is best to make your life easy
and use the same username whenever possible. We will make your life easy by using the Google
Account you set up in the next section whenever possible. When it’s not possible to log in with
Google, then we suggest you try to use the same username for each account.

Accounts
To give you an idea of where we’re going, the first account (and arguably the most important
account) you set up in the next lesson will be a Google account⁵⁶. After that we will walk you
through the steps to get you set up with accounts on:

• LinkedIn⁵⁷ - this is a site to share information about yourself with employers.


• Twitter⁵⁸ - this is a social media site that we will use to share our data science products and get
support from the data science community.
⁵⁶https://mail.google.com/mail
⁵⁷https://www.linkedin.com
⁵⁸https://twitter.com/
Account Setup 34

• slack⁵⁹ - this is a website where you will be able to chat online with your fellow students and
instructors.
• RStudio Cloud⁶⁰ - this is a website where you can use Rstudio, the main tool to learn data
science.
• DataCamp⁶¹ - this is a website where you can practice using R and Rstudio.
• GitHub⁶² - this is a website where we will share the results of our data science projects with
each other and the world.

Accounts
⁵⁹https://slack.com/
⁶⁰rstudio.cloud
⁶¹https://www.datacamp.com/
⁶²https://github.com
Account Setup 35

Slides and Video

View this Video at https://www.youtube.com/watch?v=6z4LQQ17lyQ⁶³.


Account Setup

• Slides⁶⁴

Take this quiz online⁶⁵


⁶⁴https://docs.google.com/presentation/d/1mQMEdR4opFzuReP9i7te5v8T-kyDNNklHPvQ2OnzZpQ/edit?usp=sharing
⁶⁵http://leanpub.com/courses/jhu/cbds-intro/quizzes/quiz_02_account_setup
Google Account Setup
The first and most important account we need to set up will be a Google account. You will need
a Google account to be able to use free Google products. These Google products, such as Gmail,
Google Docs, Google Sheets, and Google Slides will be useful in many of the Data Science projects
you complete. The Google account will also be useful for letting you get access to other
websites we will use in the program.

Google Products

If you already have a Google account with an appropriate username that you would like to use
throughout this course, you can skip the next section and move to the “Log off Guest Chromebook”
section. However, it’s probably best to create a new account dedicated to all your Data Science
accounts, many of which you will set up in the next lesson.

Getting a Google Login


To get a Google account, you will first want to open a new tab in your current Chrome session. To
do this, you’ll first click on the small gray box to the right of the tab you currently have open.
Google Account Setup 37

Google website

Once this new tab is open, type in ‘www.gmail.com’ in the web address bar at the top of your
Chrome session. After clicking enter, you will be brought to a Login screen. Here, you will click on
“More options”
Google Account Setup 38

Google Sign in

You will then click on “Create account” to start the process of getting a Google Login.
Google Account Setup 39

Google Create Account

Begin filling in the blank spaces in the box to the right with your information. Google will alert you
if the username you’ve chosen has already been taken. Once you’ve filled out all the blanks, click
on “Next Step” at the bottom right.
Google Account Setup 40

Google Blank Form

You’ll be asked to read the “Privacy and Terms.” To scroll through the entire document, click on the
blue arrow in the middle at the bottom of the document. After reading over the Privacy and Terms,
click “I AGREE” to continue.
You will then be asked to verify your account. To do so, ensure that a valid phone number for you
is in the ‘Phone number’ box. Select whether you prefer to be contacted by “Text message (SMS)”
or ‘Voice Call’. Click ‘Continue’ once the information has been entered. You will then be sent a
verification code by text message or by phone call, depending on your choice to this question. Enter
the verification code into the box on the screen and click “Continue”.
Congratulations! You now have a Google username and account! Be sure to remember your
username and password! This will be used for your email address (Gmail) and all other Google
products.
Google Account Setup 41

Google Welcome!

Log Off Guest Chromebook


At this point, you’ll want to log off the Chromebook you’re using as a Guest and sign in using your
Google Username. To do so, click on ‘Guest’ at the very bottom right-hand of the screen. Click ‘Exit
guest’ on the screen that pops up. This will log you off of the Chromebook so you can re-login in
using your Google account.
Google Account Setup 42

Chromebook Log Off

Re-Login using Google Account


You will now be on the Chromebook login screen. To sign in using your Google account, click ‘Add
person’ at the bottom of the login screen.
Google Account Setup 43

Chromebook Add Person

A ‘Sign in to your Chromebook’ screen will open.


Google Account Setup 44

Chromebook Sign in

Enter your new Google account name here. Click ‘Next’. Enter your password. Click ‘Next.’ You will
now be logged on. Anytime you work on this Chromebook now, you will simply log in using your
new Google account.
Google Account Setup 45

Slides and Video

View this Video at https://www.youtube.com/watch?v=IQUB-NvoNqI⁶⁶.


Google Account Setup

• Slides⁶⁷

Take this quiz online⁶⁸


⁶⁷https://docs.google.com/presentation/d/1sOBtwszQqq366q84VCDY_BwSjWQz_4yFJLC4ib1dEGQ/edit?usp=sharing
⁶⁸http://leanpub.com/courses/jhu/cbds-intro/quizzes/quiz_03_google
Other Accounts Setup
In addition to having a Google username, there are a number of other accounts to which you’ll need
access. By the end of this lesson, you will have set up a number of different accounts. While this
may seem like a lot of work now, it will get you set up with all the accounts you’ll need throughout
this series of courses. The time you spend now will pay off later! Right now, we’ll walk through each
one briefly, get you set up with an account, and discuss what the account is used for.

LinkedIn Account
LinkedIn⁶⁹ is a social networking site for employment. Think of it as Facebook for getting a job. It
allows you to put your qualifications online (like an online resume), has a space where you can look
for jobs, and can put you in contact with employers. Don’t worry about the details now. Through
this program, you will have the chance to set up your LinkedIn gradually. For right now, we’re just
worried about getting this account set up.
To begin set up, you’ll go to the web address bar in your Chrome browser. You will type
www.linkedin.com and hit ‘Enter.’
⁶⁹http://www.linkedin.com
Other Accounts Setup 47

LinkedIn website

This will bring you to LinkedIn’s login screen. On this screen you’ll begin filling out the boxes in the
middle of the screen. Be sure to use the Gmail username you just created in the last chapter when
asked for your Email address. Choose a password that cannot be easily guessed by someone else.
Once the four boxes are filled in, click “Join now”.
Other Accounts Setup 48

LinkedIn Blank Form

A screen may pop up asking you to verify that you’re not a robot. Whenever this happens, just click
on the empty square box to let the computer know you’re a real person.
Other Accounts Setup 49

Not a robot

After clicking “Join now” (and maybe after you verify that you are a human) you will be brought to
a new screen. Here, at the top, it will show you that you almost have a LinkedIn login but that first
you have to confirm your email address.
To do so, you will open a new tab at the top of the Chrome browser window. You will then type in
‘www.gmail.com’ and click ‘Enter.’
Other Accounts Setup 50

gmail website

This will bring you to your email account. An email from ‘LinkedIn Messages’ should be there. Click
on that email to open it.
Other Accounts Setup 51

Verify email from LinkedIn

In that email there will be a button where you can click to ‘Confirm your email’.
Other Accounts Setup 52

LinkedIn Confirm Your Email

This will open a new screen where your google username will be in the box already. Click ‘Continue’
on this screen.
Other Accounts Setup 53

LinkedIn Continue

While other boxes may pop up for you to go further on LinkedIn and get your profile set up, this
is all you have to do for now. You now have a LinkedIn username and account! We will now go
through similar processes for the other accounts needed in this program.

Twitter Account
The second account you will need will be a Twitter⁷⁰ account. You may already be familiar with
Twitter; however, data scientists tend to use Twitter for work rather than socializing. Twitter is a
social media platform where users can “post and interact with messages.” These messages are known
as ‘tweets.’ Twitter is a great place to learn new things, connect with other data scientists, and to
ask/answer questions quickly.
You will need a professional Twitter account for our program. If you already have a Twitter
account you use for personal tweets and communicating with friends you should still create a
new, professional account where you will only post professional links and interact with other data
scientists.
To get a Twitter account, first type ‘www.twitter.com’ in the search bar at the top of your Chrome
session.
⁷⁰http://www.twitter.com
Other Accounts Setup 54

Twitter website

This will bring you to a screen where at the top right you’ll want to click ‘Sign up’.
Other Accounts Setup 55

Twitter sign in

This will bring you to a screen that prompts you for some information and asks you to create a new
password. After filling out the information, you will click ‘Sign up.’
Other Accounts Setup 56

Twitter sign in

You will be brought to a screen asking for your phone number. After entering your phone number,
you will click ‘Next.’
Other Accounts Setup 57

Twitter phone number

This will bring you to a screen where you will choose a username. This will be what your Twitter
‘handle’ will be. For simplicity, it would be best for your Gmail username and Twitter handle to be
the same (ie if your Gmail address is Jane.Doe@gmail.com, ‘Jane.Doe’ would be a great Twitter
username). If that name is unavailable, choose a different, but appropriate and simple, Twitter
username. Once you have chosen a Twitter username, click ‘Next’.
Other Accounts Setup 58

Twitter username

At this point you will be brought to a screen that will have a button saying ‘Let’s go!’ This will lead
you to set up your profile further, which is not needed at this time. Instead, go to your gmail, look
for an email from ‘Twitter’ and click on the email.
Other Accounts Setup 59

Twitter Let’s go!

In this email, there will be a ‘Confirm now’ button to click. To verify your Twitter account, click on
this button.
Other Accounts Setup 60

Twitter Confirm now

Congrats! You now have a Twitter account!

Slack Account
Slack⁷¹ is a place where teams of people can easily communicate and work together on a project.
As a data scientist, you are often working in a group on a project. Slack is a place where everyone
working on that project can communicate. Slack is where communication throughout this course
will happen. You will be able to ask questions, answer questions, and communicate with others on
Slack about the things you are learning and the projects you are working on.
To get a slack account, you will first open a new tab in your browser by typing ctrl and t. Once you
have a new tab open, you will type ‘www.slack.com’ at the top of your browser in the web address
bar.
⁷¹http://www.slack.com
Other Accounts Setup 61

Slack Web Address

On this webpage, type your Gmail address in the ‘email address’ box, and click ‘GET STARTED.’
Other Accounts Setup 62

Slack Get Started!

We won’t be signing into any workspaces yet; however, later in the program, when we do, you will
have an account! That’s all you need to do with Slack for now!

RStudio Cloud Account


As a data scientist A LOT of your work will be done in something called RStudio. In this program,
you will be learning the basics of the programming language, R. R is a free programming language
for statistical computing and graphics. In other words, the code you write to work with data will
all be done in R. RStudio Cloud is the place (or ‘platform’) where you will type this code and make
basic plots.
Luckily, RStudio Cloud makes it easy to sign up. You will first go to ‘rstudio.cloud’. Note: This web
address does not start with ‘www.’
Other Accounts Setup 63

rstudio.cloud web address

This will bring you to a screen where you can click on ‘Get Started.’ This will bring you to a login
screen where, instead of typing in your information, since you already have a Google account, just
click on ‘Sign up with Google.’
Other Accounts Setup 64

rstudio.cloud sign up with Google

You will be prompted to choose which Google account you want to use. Choose your professional
Google account. Then, you will be brought to a screen where you will have to enter a username.
Again, for simplicity, try to use the same username across all accounts.
Other Accounts Setup 65

rstudio.cloud choose account

Then, click ‘Create Account’ and you’re all set! You now have an RStudio Cloud account!

DataCamp Account
DataCamp⁷² is an online platform where people learn Data Science. Throughout this program, you
will take courses and do exercises on DataCamp to ensure that you are acquiring the skills necessary
to be a successful data scientist.
Getting a DataCamp account will be very similar to getting an rstudio.cloud account because you
can again sign in using your Google account. To do so, go to www.datacamp.com.
⁷²http://www.datacamp.com
Other Accounts Setup 66

DataCamp Web Address

You will be brought to a screen where you can ‘Create Your Free Account.’ Click on the ‘Google+’
button. You will again be brought to a screen where you’re asked to choose your Google account.
Other Accounts Setup 67

DataCamp Google+

Select your Google account.


Other Accounts Setup 68

DataCamp choose account

And, just like that, you have a DataCamp account!

GitHub Account
GitHub⁷³ is a website that hosts computer code and allows for version control. We’ll get back to
what version control is later, but as for now, know that GitHub is where you’ll be ‘saving’ all of the
code you write. It’s also a place where you can look at other people’s code. And, throughout this
program, you’ll realize that you can learn a lot from other people’s code!
To get a GitHub account, first type www.github.com into the web address bar at the top of your
Chrome window and hit ‘Enter’.
⁷³http://www.github.com
Other Accounts Setup 69

GitHub web address

You will be brought to a page where you should fill in your information. As with the other accounts,
try to use the same Username if possible. Enter your Gmail Email address. And, create a password
that cannot be easily guessed by others. Then, click ‘Sign up for GitHub.’
Other Accounts Setup 70

GitHub sign up

One fianl note about GitHub usernames in particular. This will be used for your website (which
you’ll build later) and all the code you write. You’ll use GitHub a lot, so this is a case where it is
particularly helpful to choose a good username, particularly one that has something to do with your
name and not much else. For example, the person writing this lesson is named Shannon Ellis. Her
GitHub username is “ShanEllis.” While it is possible to change your GitHub username down the line,
it’s a bit of a pain, so choose wisely now!
You now have a GitHub account and all of the online accounts needed for this program!
Other Accounts Setup 71

Slides and Video

View this Video at https://www.youtube.com/watch?v=U4g_b-1adso⁷⁴.


Other Accounts Setup

• Slides⁷⁵

Take this quiz online⁷⁶


⁷⁵https://docs.google.com/presentation/d/1NbqkwyE9f4__0uTHWiVDoZaVfb_sMqJw4LTNcvsVw3A/edit?usp=sharing
⁷⁶http://leanpub.com/courses/jhu/cbds-intro/quizzes/quiz_04_other_accounts
Your first data science project
We are using this definition of data science.

“Data science is asking a question that can be answered with data, collecting and cleaning
the data, studying the data, creating models to help understand and answer the question,
and sharing the answer to the question with other people.”

Rather than try to explain data science with examples made by other people, we are going to show
you the process of data science through a project that you will complete.
The first step in any data science project is to come up with a question. You are taking this course
on Leanpub⁷⁷. Leanpub is a website where you can sell books and courses. For this first project the
question we are trying to answer is:

“How does the readership of a bestselling book relate to how much the author is charging
for that book?”

This question isn’t about data. It is just something we might be curious about. In this case, if you
were going to write and sell a book on Leanpub you might want to know what price to pick in order
to try to sell the most books. Many good data science questions don’t start out with data. They are
just questions you wish that you knew the answer to. Later, you try to find out if there is data to
answer your question.
In this case, to answer our question, we need some information on books on the Leanpub website.
If you go to https://leanpub.com/bookstore you will see a website that looks like this.
⁷⁷https://leanpub.com/
Your first data science project 73

Leanpub bookstore website

This shows the bestseller books for the last week. If you click on one of the pictures of a book you
can get some information on that book. If I click on the page for the first book “PowerShell 101” I
see something like this.
Your first data science project 74

Powershell 101 landing page on Leanpub

It will probably be a different book for you since it will be a different weekly bestseller. But you can
look in the top left corner and see how many people read the book. This information is there for
most books, but is sometimes missing if the author decides not to publish that number. In this case
there are 1,036 total readers of this book.
Your first data science project 75

Number of readers for Powershell 101

Next we can find out the suggested price. This is on the right hand side and is the price the author
thinks is the appropriate price for their book. In this case the suggested price is $15.99.
Your first data science project 76

Suggested price for Powershell 101

But one nice thing about Leanpub is that you can set up a “pay what you want” model where people
can choose how much they pay for a book. When authors do this, there is also a minimum price
they set for the book. If there is a minimum price it is also on the right hand side. In this case the
minimum price is $7.99.
Your first data science project 77

Minimum price for Powershell 101

We could do this for each book and then we’d have a nice data set that would tell us something
about the number of readers for a book and the price of that book. Then we could start to look at
the numbers we collected and see if we see any patterns to the data that we have collected to try to
answer our question.
We’ll go through the steps necessary to do all of this and answer the project question “How does
readership of a bestselling book relate to how much the author is charging for that book?” in the
following lessons.
Your first data science project 78

Slides and Video

View this Video at https://www.youtube.com/watch?v=8HABnIPgpgA⁷⁸.


Your First Data Science Project

Slides⁷⁹
⁷⁹https://docs.google.com/presentation/d/1auByZV5pghzELH-SMKLwxrZtigtXd-PC4Q5SrcT4qlE/edit?usp=sharing
Google Sheets
Google Sheets is a free, online spreadsheet program. If you’re familiar with Excel, it is similar to
Excel. If you are unfamiliar with Excel, that’s ok! We’ll go through everything you need to know to
get started on the project here. And, later in the program, we will go into more details to get you
fully comfortable working with Google Sheets. As for right now, just know that when you have data
that you want to input into a spreadsheet, Google Sheets is an ok place to start. Google Sheets is also
great because you never have to worry about saving your work. If you are online, Google Sheets
automatically saves your work.

What is a spreadsheet?
A spreadsheet is a type of document where data are stored in rows and columns of a grid. Each
square is referred to as a ‘cell’ in the spreadsheet. In Google Sheets (and many other spreadsheet
programs like Excel), the rows are numbered (like 1,2,3,…) and the columns are labeled with capital
letters (like A, B, C,…).

spreadsheet
Google Sheets 80

If you want to talk about a specific spot on the grid you can use the number and letter corresponding
to that point. For example, A2 specifies the data in cell in the first column (A) and second row (2) of
the spreadsheet.

spreadsheet position

When you are working with data in a spreadsheet you can type directly into the spreadsheet. It is
important to make sure you double check all the numbers you type since there isn’t a good way to
“spellcheck” your work when you are editing a spreadsheet.
We will talk a lot more in future courses about how to organize data that you have collected. Mostly
we will want to collect “tidy data”⁸⁰ which is data that has

1. Each type of data in one column.


2. Each data point in one row.
3. One spreadsheet for each “kind” of data.
4. If you have more than one spreadsheet, they should include a column in the table that allows
them to be linked.

Here we are only collecting one “kind” of data - just data on books. The columns will be different
types of information about the books. We will collect information on the name of the book, the
number of sales of that book, the minimum price of the book, and the suggested price of the book.
⁸⁰https://en.wikipedia.org/wiki/Tidy_data
Google Sheets 81

Each of those will be in a separate column. Then, for each book, we will make a new row with the
data for that book.
Remember we are collecting information on the bestselling books from the last week on Leanpub.
You can find the list of bestsellers here: https://leanpub.com/bookstore⁸¹. Remember that if you click
on the image of one book you will get something that looks like this.

Powershell 101 landing page

Setting up your spreadsheet


When we collect the information we will use the Google Sheets software to store it for us. You will
need to open up another web browser. You can do this by holding down the key ctrl and pressing
t. This will open up a new tab. Leave this page open and type go to Google Sheets by navigating to
the website https://docs.google.com/spreadsheets/ in the new tab. You will see something like this.
⁸¹https://leanpub.com/bookstore
Google Sheets 82

Google sheets home

Now click on the big plus sign and you will get a new spreadsheet that will look like this.
Google Sheets 83

Untitled sheet

If you click on the words “Untitled Spreadsheet” you can rename the spreadsheet. Type in the words
“leanpub_data” to change the name of your spreadsheet. You should now have a spreadsheet that
looks like this.
Google Sheets 84

leanpub_data sheet

We are almost done setting up the spreadsheet, now we just need to label the different kinds of data
we are going to collect. Start by clicking on the upper left hand cell (A1) and type “title”. This will
be the column where we are going to store information on the title of the book.
Google Sheets 85

leanpub_data sheet with title

Then move one cell to the right, click and type “readers”. This will be where we will store how many
readers a book has. Move one more cell to the right type “suggested” and then one more cell and
type “minimum”. Make sure your column names are not capitalized.
Google Sheets 86

leanpub_data sheet with headers

Collecting data
Now you are all set to start collecting data! To do this open another new tab by holding ctrl and
pressing t, then go to the webpage: https://leanpub.com/bookstore. Click on the book and write the
title, number of readers, suggested, and minimum prices on a row in the spreadsheet tab. When you
are doing this make sure that:

• There are no commas in numbers. Just leave them out. So don’t write “1,036” write “1036”
instead.
• You don’t put dollar signs for the price, just include the number like “7.99.”
• If a book’s minimum price is free, enter “0” in the cell.
• If the book has no readers, put “0” in the cell.
• If the book’s author opted not to inlcude how many readers their book has, put “NA” in the
“readers” column for that book.

So for me, since the first book is “PowerShell 101” after getting the data for the first book my
spreadsheet will look like this.
Google Sheets 87

First row of data for project

Continue this process, entering each book into a new row. Collect information on ten or twenty
books. One book for every row. At the end you should have a data set that looks something like this.
But yours will have different numbers and names in it.
Google Sheets 88

First complete data set

Checking your data


Now that you’ve entered your data into the Google Sheet, we want to check for a few possible issues
before moving on to make sure the data are formatted correctly. Double to make sure the following
are true for the data in your spreadsheet:

1. You have at least 11 rows with reader and minimum price information (one header row and at
least 10 books included - if you have NAs anywhere, you’ll want more than 11 books)
2. Your dollar amounts do NOT have dollar signs next to them.
3. Your number of readers does not include any commas.
4. If a book’s minimum price is FREE, you have put the number 0 in the cell, rather than “FREE”
Google Sheets 89

Checking your data

This is great! You now have a question you want to answer and you have collected some data to
answer that question. You are on your way to becoming a data scientist!

Publishing to the web


Our plan is to use the data in this spreadsheet to answer our question about how readership of a
bestselling book relates to how much the author is charging for that book. To do so in the next lesson,
you will first have to publish the data to the web. This gives the software we’ll use in the next lesson
permission to access your data. to make your sheet public, you’ll want to click on File at the top of
the Google Sheet. From the drop-down menu that appears, you’ll want to click on “Publish to the
web.”
Google Sheets 90

Publish to the web…

In the window that pops up, you’ll want to click on “Publish”


Google Sheets 91

Publish

A box will appear to confirm that you would like to publish this Google Sheet. Click “OK.”
Google Sheets 92

OK

Making the sheet public


After publishing your data to the web, the last step is to make these data accessible to others who
have the link.This can be done easily on a Google Sheet by clicking on “Share” in the top right-hand
corner of the Google Sheet.
Google Sheets 93

Share

A “Share with others” box will pop up. Click on “Get shareable link.”
Google Sheets 94

Share with others

Your screen will update so that this document can now be viewed by anyone, as long as they have
the link to the spreadsheet.
Google Sheets 95

Shareable

Congrats! You have successfully made this spreadsheet shareable and the link has been copied. You’ll
be asked to paste this link in the quiz for this lesson, and we’ll use this spreadsheet link in the next
lesson when you get started using RStudio Cloud, so don’t close your Google Sheets tab quite yet.
Google Sheets 96

Slides and Video

View this Video at https://www.youtube.com/watch?v=nt4xolGiAVc⁸².


Google Sheets

• Slides⁸³

Take this quiz online⁸⁴


⁸³https://docs.google.com/presentation/d/1EPt7DuMZOqJMElDNMi3PWO66OytMlWPoc-RsopdVxNM/edit?usp=sharing
⁸⁴http://leanpub.com/courses/jhu/cbds-intro/quizzes/quiz_06_googlesheets
RStudio Cloud
The main software that we are going to use to analyze data in this class is called R⁸⁵. R is a piece
of software that lets you write computer code to analyze data. RStudio⁸⁶ is a company that makes a
piece of software that works with R. RStudio makes it easier to create, save, share, and work with R
code and data sets. R is one of the two most popular languages for data science. We will learn a lot
more about it throughout the courses, but here we are just going to use it to take a peak at the data
you have created.
If you have a more traditional laptop you can download and install R and RStudio on your laptop.
But this class is part of our Chromebook Data Science program where we will be teaching you how
to do everything through a web browser. Fortunately RStudio also offers a web-based version of
their software called RStudio Cloud.
In a future class we will go into much more detail about RStudio and RStudio Cloud. For now, we
will just go over the basics and then use RStudio Cloud to do a very basic analysis of the data you
collected in your Google Sheet. Using RStudio Cloud we will give you all the commands you need
to run to complete this project. Don’t worry if this seems a little foreign, we are going to learn a lot
more about it later! Just follow the steps and you’ll end up with your very first plot!

RStudio Cloud Basics


Before we get started with the data you collected, we’ll explain the basic components of RStudio
Cloud.
There are four main components in an RStudio Cloud window: the scripting area, the Console, the
Environment, and the Files directory. We’ll briefly discuss each part now and go into a lot more
detail later.
⁸⁵https://www.r-project.org/
⁸⁶https://www.rstudio.com/
RStudio Cloud 98

Four RStudio Cloud components

First, in the top left-hand portion of the window, the scripting area is where you will see code to run
in your first project in a few slides. In the future, this will be where you will type all your code. The
code typed in this space can be saved and re-run later whenever you need it.
RStudio Cloud 99

Scripting RStudio Cloud

In the bottom left-hand portion of the window is the Console. This is where the code you type in
the scripting window above will actually run. You script what you want to happen in the scripting
window. In the Console, what you wanted to happen actually happens.
RStudio Cloud 100

Console RStudio Cloud

The coding language R is an object-oriented programming language. This means that when you
code, objects are created. We’ll talk in detail about what that means later. However, any objects that
you create while coding will be listed here in the Environment section in the top right-hand portion
of the RStudio Cloud window.
RStudio Cloud 101

Environment RStudio Cloud

The fourth component is at the bottom on the right-hand side of the window. Here, any files or
folders you create, such as the scripts you save, will be listed.
RStudio Cloud 102

Files RStudio Cloud

You’ll also note that there are multiple tabs in each of these sections. We’ll talk about the other tabs
shortly; however, we’ll note now that in the bottom right-hand section, there is a “Plots” tab. If you
were to click on that you would simply see an empty blank space because you haven’t made any
plots yet. However, when you do the project you’ll be generating a plot. The plot you create will
show up in this tab.
RStudio Cloud 103

Plots RStudio Cloud

Data Science Project in RStudio Cloud


Now that you’re a little familiar with RStudio Cloud, we can get started on using the data you
collected from Leanpub and entered into your Google Sheet spreadsheet. We’ll then be one step
closer to answering the project question

“How does readership of a bestselling book relate to how much the author is charging for
that book?”

To start working in RStudio Cloud, open up a new tab by pressing ctrl and pressing t, then copy this
URL and paste it into your web browser http://bit.ly/cbds_projects⁸⁷. If you get a log in page, press
the button to “Log in with Google” just like you did when you were setting up your account.
You should now see a page that looks like this. You should see a Project listed that is called “leanpub_-
project”.
⁸⁷http://bit.ly/cbds_projects
RStudio Cloud 104

RStudio Cloudprojects home page

On the right-hand side, you should see an icon to “Copy” the project. Click on this icon.
RStudio Cloud 105

RStudio Cloud new project

You should now see a page that looks like this across the top.
RStudio Cloud 106

RStudio Cloud project page

You’ll first want to title your project. Click on ‘leanpub_project’ at the top and begin typing. Title it
with ‘leanpub_project_lastname’. So, for example if your last name were Doe, the project would be
titled ‘leanpub_project_doe’. You’re ready to get going!
RStudio Cloud 107

RStudio Cloud project named

You are now using the RStudio software! The first thing that you should do is go to the bottom right
hand side of the screen and click on the file called “leanpub_googlesheets_analysis.R”.
RStudio Cloud 108

RStudio Cloud project R file

This should open up a file full of code in the top left-hand portion of the screen. Your screen should
now look like this.
RStudio Cloud 109

RStudio Cloud project page with script open

This file already has computer code in it. That computer code will read the data from the Google
Sheet you have created and make a plot. If you scroll through this code you will see likes that start
with “#”. Any time you see a line that starts with a pound sign (#) in code is a comment. This is text
that is added to explain to anyone looking at the code what the code does. The rest of the text in
this file tells the computer what to do. Using this code, we’ll do a few things:

1. Get things set up. The details aren’t important now, but we’ll definitely get into them later in
the series.
2. Read in the Google Sheet you generated.
3. Check to make sure that the data are in the correct format.
4. Make a plot that will look at the relationship between the number of readers and minimum
price for Leanpub books.

In the future, you’ll learn how to write this code. For now, all the code is available to you. All you
should have to do to make this work is copy the public URL for the Google Sheet that you made
in the last chapter of the course. To do this, scroll through the code in the top left-hand panel of
RStudio Cloud. Find the place in the computer code that says “PASTE_YOUR_GOOGLE_SHEET_-
LINK_HERE!”.
RStudio Cloud 110

rstudio.cloud with leanpub_googlesheets_analysis.R with PASTE_YOUR_GOOGLE_SHEET_LINK_HERE!

Delete ‘PASTE_YOUR_GOOGLE_SHEET_LINK_HERE!’ and paste your URL.


One thing to keep in mind is that when you copy the URL from the top of your Google Sheet OR
from the blue ‘Share’ button at the top right-hand side of the screen, the link will have a little extra
information at the end. After pasting the copied URL into the code, you’ll want to delete the tail-end
of the URL starting at ‘/edit’. Below you will see what should be included in the pink box at top or
the pink text of the link below. Everything after ‘/edit’ should be deleted.
RStudio Cloud 111

RStudio Cloud with URL edited

Your code should look something like this now:


RStudio Cloud 112

RStudio Cloud with leanpub_googlesheets_analysis.R with personal URL

Now you should be ready to run your code! You can do so all at once by highlighting all the code
in the “leanpub_googlesheets_analysis.R” script. Then, you would find the button that says “Run”
at the top of the code file and click on that button.
RStudio Cloud 113

RStudio Cloud run code

You should see code running in the bottom left-hand panel. As code runs, there will be some output
in red text, letting you know that the code is running. This red text does not mean anything is
wrong. Note that red text in RStudio sometimes is an error, while other times it is just providing you
with information. If it says error, than it’s an error. But, don’t be alarmed that red text is appearing
on your screen. If the code runs, a plot should appear on the lower right hand side.
RStudio Cloud 114

RStudio Cloudwith plot

If a plot does not show up, there are errors.


The first place to check for errors is in your ‘leanpub_googlesheets_analysis.R’ code file. Errors in
code formatting in RStudio are marked by a red ‘X’ to the left of any code lines that have errors. For
example, if you copy and pasted your Google Slides link but accidentally deleted the second set of
quotes before the final parenthesis, a red X would show up, showing you which line has the coding
error that needs to be fixed.
RStudio Cloud 115

RStudio Cloud code error in leanpub_googlesheets_analysis.R

If you don’t see any red Xs in your code, there is likely an error with how you formatted your
spreadsheet. The errors will appear in the bottom left-hand Console panel. Scroll through the text
there to see if any of the error messages help point you to what mistake may have been made. Then,
edit your spreadsheet in Google Sheets and re-run all the code again.
RStudio Cloud 116

RStudio Cloud after running the code in leanpub_googlesheets_analysis.R

Once you have your plot, you have what you need to make the Google Doc and finish your project
in the next lesson. Keep this tab open so that you can copy your plot in the next lesson!

Slides and Video

View this Video at https://www.youtube.com/watch?v=X5HWXAPGhIk⁸⁸.


RStudio Cloud
RStudio Cloud 117

• Slides⁸⁹

Take this quiz online⁹⁰


⁸⁹https://docs.google.com/presentation/d/1FFaIAQO7qtUANdHApu4fFCcB0KT9FNo5oQCWLULqsdY/edit?usp=sharing
⁹⁰http://leanpub.com/courses/jhu/cbds-intro/quizzes/quiz_07_rstudio
Google Docs
Like Google Sheets, Google Docs is an online and free program, but instead of being used to create
and edit spreadsheets, Google Docs is a place where you can create and edit word documents.
Google Docs is similar to Microsoft Word, for those who are familiar. For those who aren’t, Google
Docs is somewhere you can type notes or any type of text. Like Google Sheets, Google Docs will
automatically save your work anytime you are connected to the Internet.

Getting started in Google Docs


To begin, you’ll open a new tab by typing cntrl and t at the same time. In the web address bar you’ll
go to ‘www.docs.google.com’. This will bring you to the homepage for Google Docs. To get started
with a blank document you’ll click on the ‘Blank’ document at the left top of the screen.

Google Docs Home

This will open a Blank Google Doc. You’re now ready to get started working in Google Docs.
Google Docs 119

Blank Google Doc

Using Google Docs


You could just start typing in the document; however, for your Leanpub data science project, you’ll
want to first change the name of the document. To do so, click at the top of the document where it
says ‘Untitled document’ and type leanpub_project_lastname.
So if your last name were Doe, the title of this document would be ‘leanpub_project_doe.’
Google Docs 120

Google Doc named project

In this document, you’ll want to include a short summary about what question you were asking,
what data you collected, and where these data were collected from in a section titled “Summary”.
You’ll then want to paste your results and explain what you see in the plot you generated in a
“Results” section. Finally you’ll conclude how the price of a bestselling book relates to how much
the author is charging for that book in a “Conclusion” section.
Google Docs 121

Google Doc Report sample

In order to get the plot to paste into your report, you’ll start a new tab by typing ctrl and t at the
same time and going to http://bit.ly/cbds_projects. You should see your project here. You will click
on that project. The analysis you already carried out will be here. To copy the plot you generated,
click on ‘Export’ in the ‘Plots’ tab in the bottom right-hand of the RStudio window.
Google Docs 122

Export in rstudio.cloud

Then, click on ‘Copy to Clipboard.’


Google Docs 123

Copy to Clipboard

Your plot will pop up in a new window.


Google Docs 124

Plot in rstudio.cloud

With your cursor over the plot that pops up, you will then tap the mouse keypad with two fingers
at the same time to bring up a new menu. On this menu, select, ‘Copy Image.’
Google Docs 125

Copy image in rstudio.cloud

You can now return to Google Docs, place your cursor where you’d like the plot to go, tap the mouse
keypad with two fingers at the same time to bring up a new menu, and click ‘Paste’ to paste your
plot from RStudio Cloud in your Google Doc.
Google Docs 126

Paste in Google Docs

Your plot will now be in your Google Doc!


Google Docs 127

Plot in Google Doc

Sharing Your Google Doc


Google Docs are helpful to data scientists because they will not only allow you to keep notes about
what you have done and what you have found from your analyses, but they will allow you to share
this information with people you work with, which is critically important to data scientists.
So, the last thing we’ll do is get the link so that you can share this document with others. You will
paste this link in the quiz at the end of this section. To get this link, you’ll click on the blue ‘Share’
button on the top right-hand section of the Google Doc.
Google Docs 128

Share Google Doc

A ‘Share with others’ box will pop up in the middle of the screen. Click on ‘Get shareable link.’
Google Docs 129

Shareable Link

A new screen will pop up informing you that your link has been copied. This is the link you will
paste by pressing ctrl and v in the quiz below when asked for your Google Doc link. Congrats!
You’ve completed your first report from a data science project!
Google Docs 130

Slides and Video

View this Video at https://www.youtube.com/watch?v=Mnx3csT4P8Q⁹¹.


Google Docs

• Slides⁹²

Take this quiz online⁹³


⁹²https://docs.google.com/presentation/d/13arBfuP1WFhTca0XCZNMBB7G1gxn8ZCJhpxsqdpDz_A/edit?usp=sharing
⁹³http://leanpub.com/courses/jhu/cbds-intro/quizzes/quiz_08_googledocs
Google Slides
Google Slides is the last Google product we will discuss today. It is a place to make free, online
presentations. Similar to Microsoft PowerPoint, Google Slides allows users like you to create
presentations to communicate findings from a project to a general audience. However, unlike
PowerPoint, fortunately your work is saved automatically whenever you are online, and it’s free!
As a data scientist, you are frequently required to present your findings. Sometimes that is in the
form of a report, such as the Google Doc you made in the last section. However, very often, you are
required to make a slide presentation. Here, we will discuss what a slide presentation is and some
features of a good slide presentation before you make a brief presentation about your Leanpub data
science project. Slide presentations are often used to present to a group of people. You as the analyst
would be explaining what is on the slides out loud and using what is on the slides to support what
you are saying. This means that every detail of the analysis does not have to be on the slide. You
can use what you say out loud to fill in the details.

What is a slide presentation?


A slide presentation is a presentation that consists of multiple slides. Each slide is there to get a
smaller message across so that when all the slides are viewed in order, they tell a story. As data
scientists, the story we are most often telling is the story of an analysis. This typically starts with a
question, then explains the data collected to answer the question, followed by an explanation of the
analysis, and concludes with the results and conclusions drawn from the analysis. Each part of this
story will typically have at least one slide to explain that part of the analysis. Here we will discuss
how to get started in Google Slides and then make a short presentation.

Presentation Guidelines
We’ll go into more details later, but there are three things to keep in mind anytime you are making
a slide presentation:

1. Pictures are often better than words.


2. Minimize the number of words on each slide.

3. Make the font and pictures big enough to be seen when presentation is projected.
Google Slides 132

Getting started in Google Slides


To get started, open a new tab in your browser by holding down ctrl and t at the same time and
going to ‘slides.google.com’. This will bring you to the Google Slides home page. To start a new
presentation, click on the “Blank” presentation at left.

Google Slides Home

This will open up a blank and simple slide where you can begin to work on your presentation.
Google Slides 133

Google Slides Blank presentation

Similar to the Google Doc you created, you’ll want to rename this file. To do so, click on ‘Untitled
presentation’ in the top left-hand corner of the presentation. Again, title this slideshow using your
last name. For example, if your last name were Doe, you would title this ‘leanpub_presentation_doe.’
You’re now ready to get ready working on your first slide!
Google Slides 134

Google Slides rename file

Making a simple presentation


Keeping the presentation guidelines we discussed earlier in mind, you’ll now start to make a short
(approximately 4 slides) Google Slides presentation. To begin, you’ll want a title slide. To change the
title, click on the slide where it says ‘Click to add title’ and then begin to type the title of your slide
presentation.
Google Slides 135

Google Slides Blank slide

A reasonable title would be ‘Leanpub Data Science Project.’ You’d then want to include who did the
analysis as a subtitle. By clicking ‘Click to add subtitle’ you can then include your name on your
presentation.
Google Slides 136

Google slide with title and name

If you wanted to change the font size of any of the text to make it bigger or smaller, you would
highlight that text and then click on the font size at the top of the menu to display a drop down
menu. Font size can be selected from this list or typed in that box directly.
Google Slides 137

Change font size on Google Slides

You can use a similar process of highlighting text and then selecting from the toolbar to change
formatting in a number of other ways. You can change the font of the text, make the text bold,
italicize the text, underline the text, or change the color of the text as well.
Google Slides 138

other formatting options on slides

Once you’re happy with how your title slide looks, you’ll want to start working on the next slide
in your presentation. To start the next slide, you’ll click the plus sign at the top left-hand portion of
your Google Slides presentation.
Google Slides 139

Google Slides new slide

A second slide in your presentation will appear. You can add text to this slide the same way you did
on the title slide. Pictures can also be copy and pasted into your Google Slides.
Google Slides 140

Google Slides second slide

You will want to create a Google Slides presentation with approximately four slides summarizing
the Leanpub data science project you have been working on. These slides should include

• Title slide
• The question you were asking in your data science project
• Information about how the data were collected, where the data came from, and what data were
collected

* The results (including your plot!) and conclusions from your analysis

Sharing your Google Slides Presentation


Once you’ve finished creating your Google Slides presentation, you’ll want to make sure it can be
shared with others. This will be the same process as in Google Sheets and Google Docs. To make the
presentation shareable, you will first click the Blue ‘Share’ button at the top right-hand side of the
slide.
Google Slides 141

Google Slides share

A ‘Share with others’ window will pop up. Here you will click ‘Get shareable link’.
Google Slides 142

Google Slides Get shareable link

This will bring up a new box indicating that your link has been copied. This is the link you will
paste when asked by the quiz at the end of this lesson for your Google slides link.
Google Slides 143

Google Slides shared

Slides and Video

View this Video at https://www.youtube.com/watch?v=mLPBZUS-AtY⁹⁴.


Google Slides

• Slides⁹⁵
⁹⁵https://docs.google.com/presentation/d/1sjOuMmP1oXuqvTMeKlAoOSCqD-TOncWraD67b_pzrUE/edit?usp=sharing
Google Slides 144

Take this quiz online⁹⁶


⁹⁶http://leanpub.com/courses/jhu/cbds-intro/quizzes/quiz_09_slides
DataCamp
DataCamp is a website for learning about the R programming language⁹⁷. We will be using
DataCamp because:

• DataCamp allows you to practice R without having to set up all the software
• DataCamp has courses covering a broad range of topics we are going to cover in the courses

Logging on to DataCamp
You previously signed on to DataCamp in an earlier lesson and you will simply repeat that process
now by first going to ‘www.datacamp.com’ and clicking the ‘Google+’ logo to log on.

DataCamp Home Page

From here you will be prompted to choose a Google account.


⁹⁷https://www.r-project.org/
DataCamp 146

Google Account Sign in

You will then be on the DataCamp home page. On this page, you will click on ‘Learn’ from the menu
across the top.
DataCamp 147

DataCamp Learn

This will open a drop-down menu. From this menu, you will select ‘Introduction to R’ from the
Course listings on the left-most column.
DataCamp 148

DataCamp Introduction to R

Introduction to R
This will bring you to the course page for ‘Introduction to R.’ Here you will click ‘Start Course For
Free.’
DataCamp 149

DataCamp Start Course for Free

This will open up the DataCamp course. This layout will be used throughout the course and should
look somewhat familiar. It is similar to RStudio Cloud in that you have a place where you will write
your code (SCRIPT.R) and a place where that code will run (R CONSOLE). However, DataCamp
is different in that it has lessons and exercises to help teach you how to code in the programming
language R.
The information you need to learn will always be on the left side of the DataCamp window. At the
top there will be an ‘EXERCISE.’ The text in here will explain what you need to know to complete
the lesson.
DataCamp 150

DataCamp Exercise

Below the ‘EXERCISE’ is the ‘INSTRUCTIONS’ section. This window will include the specific
instructions for what you will need to do before continuing on to the next part of the course.
DataCamp 151

DataCamp Instructions

If you scroll through this part of the window, you will notice a ‘Take Hint’ button that you can click
on. You’ll always want to try the exercise without taking a hint; however, if you get stuck, clicking
on ‘Take Hint’ may help you.
DataCamp 152

DataCamp Take Hint

Now that you know where to find instructions, you’re ready to start learning how to code. All code
will be written in the SCRIPT.R portion of the DataCamp window in the top right-hand portion of
the screen.
DataCamp 153

DataCamp Script.R

The code you write will then execute, or be carried out in the R CONSOLE in the bottom right-hand
corner of the screen.
DataCamp 154

DataCamp R Console

In order to run a line of code, you can first highlight the line you want to run. You then click on
‘Run Code.’ This will send the code to the console to execute. In this example, you will see that R
acts as a calculator. When you run the code ‘3 + 4’ in the R Console, you get back that the answer is
‘7.’
DataCamp 155

DataCamp Run Code

Once you’ve completed the task asked of you in the instructions section and clicked ‘Run Code’ to
test your answer, you can then click ‘Submit Answer.’
DataCamp 156

DataCamp Submit Answer

If your response is correct the screen at left will pop up to let you know that you’re ready to continue
on to the next section of the course. Press ‘Enter’ to continue.
DataCamp 157

DataCamp Continue

Completing your First Course


The information here gets you started on DataCamp’s Introduction to R course; however, it is your
job now to finish it. DataCamp estimates that this course should take approximately four hours.
It’s ok if it takes longer than that or shorter than that, but we realize you will likely not do it all at
once. The great thing is DataCamp will remember when you left off so when you log back on to
DataCamp you can pick up where you left off!
DataCamp 158

Slides and Video

View this Video at https://www.youtube.com/watch?v=Y1fv_xVPjkA⁹⁸.


DataCamp

• Slides⁹⁹

Take this quiz online¹⁰⁰

⁹⁹https://docs.google.com/presentation/d/1Kgpmw00v_OjhhXkf_ULGV4pWIJjNuu3Sukmd2aqbHUk/edit?usp=sharing
¹⁰⁰http://leanpub.com/courses/jhu/cbds-intro/quizzes/quiz_10_datacamp

Вам также может понравиться