Академический Документы
Профессиональный Документы
Культура Документы
Menu
School of Data is re-publishing Noah Veltman‘s Learning Lunches, a series of tutorials that
demystify technical subjects relevant to the data journalism newsroom.
This first Learning Lunch is about the database query language SQL and how it compares to
Excel.
https://schoolofdata.org/2013/11/07/sql-databases-vs-excel/ 1/11
3/2/2018 SQL: The Prequel (Excel vs. Databases) | School of Data - Evidence is Power
It lacks data integrity. Because every cell is a unique snowflake, things can get very
inconsistent. What you see doesn’t necessarily represent the underlying data. A
number is not necessarily a number. Data is not necessarily data. Excel tries to make
educated guesses about you want, and sometimes it’s wrong.
It’s not very good for working with multiple datasets in combination.
It’s not very good for answering detailed questions with your data.
It doesn’t scale. As the amount of data increases, performance suffers, and the visual
interface becomes a liability instead of a benefit. It also has fixed limits on how big a
spreadsheet and its cells can be.
Collaborating is hard. It’s hard to control versions and have a “master” set of data,
especially when many people are working on the same project (Google Spreadsheets
fix some of this).
The querying is where SQL comes in, emphasis on the Q. SQL stands for Structured Query
Language, and it is a syntax for requesting things from the database. It’s the language the
reference librarian speaks. More on this later.
The “relational” part is a hint that these databases care about relationships between data.
And yes, there are non-relational databases, but let’s keep this simple. We’re all friends
here.
https://schoolofdata.org/2013/11/07/sql-databases-vs-excel/ 2/11
3/2/2018 SQL: The Prequel (Excel vs. Databases) | School of Data - Evidence is Power
Every database consists of tables. Think of a table like a single worksheet in an Excel file,
except with more ground rules. A database table consists of columns and rows.
Columns
Every column is given a name (like ‘Address’) and a defined column type (like ‘Integer,’
‘Date’, ‘Date+Time’, or ‘Text’). You have to pick a column type, and it stays the same for
every row. The database will coerce all the data you put in to that type. This sounds
annoying but is very helpful. If you try to put the wrong kind of data in a column, it will get
upset. This tells you that there’s a problem with your data, your understanding of the data,
or both. Excel would just let you continue being wrong until it comes back to bite you.
Rows
Rows are the actual data in the table. Once you establish the column structure, you can
add in as many rows as you like.
Every row has a value for every column. Excel is a visual canvas and will allow you to create
any intricate quilt of irregular and merged cells you’d like. You can make Tetris shapes and
put a legend in the corner and footnotes at the bottom, all sharing the same cells.
This won’t fly with a database. A database table expects an actual grid. It’s OK for cells to
be empty, but to a computer intentionally empty is not the same as nonexistent.
is usually really good at what it does do, but it also often needs to be paired with other
things in order to create a final product, like a chart or a web page. A database is
designed to plug in to other things. This extra step is one of the things that turns a lot
of people off to databases.
By being really good at data storage and processing and not at other things, databases
are extremely scaleable. 1 million rows of data? 10 million? No problem. For newsroom
purposes, there’s virtually no upper limit to how much data you can store or how
complicated you can make your queries.
When it comes to news features, a database usually gets involved when the amount of
data is large or it is expected to change over time. Rather than having a static JSON file
with your data, you keep a database and you write an app that queries the database for
the current data. Then, when the data changes, all you have to do is update the database
and the changes will be reflected in the app. For one-off apps where the data is not going
to change and the amount is small, a database is usually overkill, although you may still
use one in the early stages to generate a data file of some sort.
If you’re using an API to feed current data into an app instead, you’re still using a database,
you’re just letting someone else host it for you. This is much easier, but also poses risks
because you access the data at their pleasure.
Potential semantic quibble: sometimes an app isn’t directly accessing a database to fetch
information. Sometimes it’s accessing cached files instead, but those cached files are
generated automatically based on what’s in the database. Tomato, Tomahto.
Option 1: SQLite
SQLite is a good way to get started. You can install the “SQLite Manager” add-on for
Firefox and do everything within the browser.
MySQL and PostgreSQL are two popular ones with lots of documentation, don’t overthink
it for now).
The basic building blocks of SQL are four verbs: SELECT (look up something), UPDATE
(change some existing rows), INSERT (add some new rows), DELETE (delete some rows).
There are many other verbs but you will use these the most, especially SELECT .
Let’s imagine a table called athletes of Olympic athletes with six columns:
name
country
birthdate
height
weight
gender
When creating our table, we might also specify things like “country can be empty” or
“gender must be either M or F.”
Query 1: Get a list of all the athletes in alphabetical order. This will show you the
entire table, sorted by name from A-Z.
SELECT
*
FROM athletes
ORDER BY name ASC
Query 2: Get a list of all the athletes on Team GB. This will show you only the rows
for British athletes. You didn’t specify how to sort it, so you can’t count on it coming
back in the order you want.
SELECT
*
FROM athletes
WHERE country = 'Great Britain'
https://schoolofdata.org/2013/11/07/sql-databases-vs-excel/ 6/11
3/2/2018 SQL: The Prequel (Excel vs. Databases) | School of Data - Evidence is Power
Query 3: What country is the heaviest on average? This will take all the rows and
put them into groups by country. It will show you a list of country names and the
average weight for each group.
SELECT
country,AVG(WEIGHT)
FROM athletes
GROUP BY country
Query 4: What birth month produces the most Olympic athletes? Maybe you want
to test an astrological theory about Leos being great athletes. This will show you
how many Olympic athletes were born in each month.
SELECT
MONTHNAME(birthdate),COUNT(*)
FROM athletes
GROUP BY MONTHNAME(birthdate)
Query 5: Add a new athlete to the table. To insert a row, you specify the columns
you’re adding (because you don’t necessarily need to add all of them every time),
and the value for each.
INSERT
INTO athletes
(name,country,height,weight,gender)
VALUES ('Andrew Leimdorfer','Great Britain',180,74.8,'M')
Query 6: Get all male athletes in order of height:weight ratio. Maybe you would
notice something strange about Canadian sprinter Ian Warner.
SELECT
*
FROM athletes
WHERE gender = 'M'
ORDER BY height/weight ASC
Query 7: If you got your data from london2012.com, you would think 5′ 7″ Ian
Warner was 160 kg, because his weight was probably entered in lbs instead. Let’s fix
that by UPDATEing his row.
UPDATE
athletes
SET weight = weight/2.2
WHERE NAME = 'Ian Warner'
https://schoolofdata.org/2013/11/07/sql-databases-vs-excel/ 7/11
3/2/2018 SQL: The Prequel (Excel vs. Databases) | School of Data - Evidence is Power
Query 8: Delete all the American and Canadian athletes, those jerks.
DELETE
FROM athletes
WHERE country = 'United States of America' OR country = 'Canada';
Once you look at enough queries you’ll see that a query is like a sentence with a grammar.
It has a “verb” (what kind of action do I want?), an “object” (what tables do I want to do the
action to?), and optional “adverbs” (how specifically do I want to do the action?). The
“adverbs” include details like “sort by this column” and “only do this for certain rows.”
The typical Excel approach to this is one of two forms of madness: either you have a
bunch of spreadsheets and painstakingly cross-reference them, or you have one mega-
spreadsheet with every column (government data sources love mega-spreadsheets).
You might have a spreadsheet where each row is an athlete, and then you have a long list
of columns including lots of redundant and awkwardly-stored information, like:
You lose the benefit of visually browsing the information once a table gets this big.
It’s a mess.
This structure is very inflexible. The law of mega-spreadsheets states that once you
arbitrarily define n columns as the maximum number of instances a row could need,
something will need n+1.
It has no sense of relationships. Athletes are one atomic unit here, but there are
others. You have countries, you have events (which belong to sports), you have results
(which belong to events), you have athletes (which compete in events, have results in
those events, and almost always belong to countries). These relationships will probably
be the basis for lots of interesting stories in your data, and the mega-spreadsheet
does a poor job of accounting for them.
https://schoolofdata.org/2013/11/07/sql-databases-vs-excel/ 8/11
3/2/2018 SQL: The Prequel (Excel vs. Databases) | School of Data - Evidence is Power
Analysis is difficult. How do you find all the athletes in the men’s 100m dash? Some
of them might have their time in event1_result, some of them might have it in
event2_result. Have fun with those nested IF() statements! And if any of this data is
manually entered, there’s a good chance you’ll get textual inconsistencies between
things like “Men’s 100m,” “Men’s 100 meter,” and “100m.”
SQL lets you keep these things in a bunch of separate tables but use logical connections
between them to smoothly treat them as one big set of data. To combine tables like this,
you use what are called JOINs. In a single sentence: a JOIN takes two tables and overlaps
them to connect rows from each table into a single row. JOINs are beyond the scope of
this primer, but they are one of the things that make databases great, so a brief example
is in order.
You might create a table for athletes with basic info like height & weight, a table for events
with details about where and when it takes place and the current world record, a table for
countries with information about each country, and a table for results where each row
contains an athlete, an event, their result, and what medal they earned (if any). Then you
use joins to temporarily combine multiple tables in a query.
SELECT
athletes.name, athletes.country, event.name
FROM athletes, results, events
WHERE athletes.id = results.athlete_id AND event.id = results.event_id
AND event.date = DATE(NOW()) AND results.medal = 'Gold'
SELECT
countries.name, COUNT(*)
FROM athletes, countries, results, events
WHERE athletes.id = results.athlete_id AND event.id = results.event_id AND athletes.countr
y_id = countries.id
AND results.medal IN ('Gold','Silver','Bronze')
GROUP BY countries.id
Share this:
https://schoolofdata.org/2013/11/07/sql-databases-vs-excel/ 9/11
3/2/2018 SQL: The Prequel (Excel vs. Databases) | School of Data - Evidence is Power
Related
Data Journalism for Data Harvest: Planting seeds Web APIs for non-
Beginners in Guatemala of journalism collaboration programmers
September 6, 2016 June 10, 2016 November 18, 2013
In "Event report" In "Event report" In "HowTo"
WRITTEN BY
Noah Veltman
Noah Veltman is a 2013 Knight-Mozilla OpenNews Fellow and a developer/data journalist on the BBC
Visual Journalism team. Some of his projects can be found here: http://noahveltman.com/sandbox/
Told a data story recently? Ran a successful data driven campaign? Want to share with others how
you did it? Contact us! We're always looking for guest posts.
ON the blog
Data Expeditions
Data Stories
Data Roundup
Events
HowTo
https://schoolofdata.org/2013/11/07/sql-databases-vs-excel/ 10/11
3/2/2018 SQL: The Prequel (Excel vs. Databases) | School of Data - Evidence is Power
44 LEARNING modules
13 MEMBER organisations
34 COUNTRIES represented
https://schoolofdata.org/2013/11/07/sql-databases-vs-excel/ 11/11