Академический Документы
Профессиональный Документы
Культура Документы
Fall 2016
Why should you What is this class Who are we? How will this
take this class? about? class work?
Why?
Making connections?
Friends Love ?
What are they doing? Finding love through
What have they done? Data + Algorithms
8
http://www.data4sdgs.org
12
• Foundations of Transaction
Processing
• Fault Tolerance
Jim Gray
Turing Award Winner
First Berkeley CS PhD
13
Jim Gray
Turing Award Winner
First Berkeley CS PhD
14
Experimental
Theoretical
Simulation
Data
Intensive Jim Gray
Turing Award Winner
First Berkeley CS PhD
15
+
Sloan Digital
Sky Survey (SDSS) Database Sky Server
Systems
16
https://www.domo.com/blog/2016/06/data-never-sleeps-4-0/
19
https://www.domo.com/blog/2016/06/data-never-sleeps-4-0/
20
https://www.domo.com/blog/2016/06/data-never-sleeps-4-0/
21
https://www.domo.com/blog/2016/06/data-never-sleeps-4-0/
22
• Philosophy: Save all data figure out how to derive value from it
What: Summing up
No
A Syllogism of Quotes
• “Any user can change any entry, and if enough users agree with
them, it becomes true” – Stephen Colbert
http://www.pnas.org/content/110/15/5802.full.pdf
31
So… Why?
Why should you What is this class Who are we? How will this
take this class? about? class work?
33
Databases?
What is a
Database?
35
Looks Like?
37
Looks Like
http://www.computerhistory.org/storageengine/first-commercial-hard-disk-drive-shipped
38
Is this a database?
Rolodex contains Contacts
Organized alphabetically
≈
39
Is this a database?
Facebook contains
• Contacts
• Events
• Posts
Organized …
Facebook
Collection of DBs and
“business logic”
40
Is this a database?
Flight Booking:
• Early application of
of database systems
Expedia
Collection of DBs and
“business logic”
41
What is a Database?
• Mature technologies …
43
DB Engines Ranking
Based on #mentions (e.g., stack overflow), google trends, job postings, profile data on LinkedIn, tweets …
44
Towards(
E*discovery(
Non+rela%onal(
Data
zone( Platforms
Towards(
SIEM(
Map
June 2015
Rela%onal(zone( Key:((
Grid/cache(zone( https://
451research.com/
dashboard/ dpa
47
Towards(
enterprise(search(
Towards(
E*discovery(
Non+rela%onal(
Data
zone( Platforms
Towards(
SIEM(
Map
June 2015
Rela%onal(zone( Key:((
Grid/cache(zone( https://
451research.com/
dashboard/ dpa
48
• …
49
– Compositional Approach
You will be able to use existing & build new DBMS technologies!
50
• Transactions
– Concurrency, Consistency, and Recovery
51
Principles
• Declarative programming
Systems
Current topics
• Parallel databases
• Data Warehousing
• NoSQL
• Streaming computation
54
Summary
• Principles
• Systems Design
• Current topics
55
Why should you What is this class Who are we? How will this
take this class? about? class work?
56
• Work:
– Data management at Scale: Microsoft SQL, Twitter
– Distributed systems
• Other interests:
– Hiking, Running, Snowboarding, Travel
57
• Work:
– Stream processing before it was hot
– Startup guy
• Other interests:
– Photography, Hiking, Road trips
09
TAs
Vikram Sreekanti
(lead TA)
59
• 2000’s:
– Shift from “programs” to data-centric services
• More recently:
– End of the full-stack programmer
• Data Engineer
– Evolution of IT
Why should you What is this class Who are we? How will this
take this class? about? class work?
61
How? Administrivia
• http://www.cs186berkeley.net
• Textbook
– Database Management Systems, 3rd Edition
• Ramakrishnan and Gehrke
– Suggested
• I wouldn’t buy any more textbooks
– read it regularly
How? Homework
• Graded on completion
• HW0: Assigned today, due this Monday 8/29
7. Spark Notebooks
66
• 1 final exam
67
Why should you What is this class Who are we? How will this
take this class? about? class work?
1. Read something
2. Do something
3. Write something
4. GOTO 1
CPU
1. Read something
2. Do something
3. Write something
4. GOTO 1
CPU
“Out-of-Core” algorithms.
71
Scaling up
72
• Streaming
• Divide-and-Conquer
74
Simplifying Assumption
INPUT OUTPUT
RAM
77
Input Output
INPUT Buffer Buffer
f(x) OUTPUT
RAM
78
Parallelize Me
Input Output
INPUT Buffer Buffer
f(x) OUTPUT
RAM
79
Parallelize Me
Input Output
INPUT Buffer Buffer
f(x) OUTPUT
RAM
Input Output
INPUT Buffer Buffer
f(x) OUTPUT
RAM
Input Output
INPUT Buffer Buffer
f(x) OUTPUT
RAM
80
Unix Pipes
• e.g. “find students who got 100 on one assignment, and got 0 on no
assignments”
Rendezvous
• Time-space Rendezvous
– in the same place (RAM) at the same time
– most of computing (and life?) is about this
DISK 2
DISK 1
B
INPUT OUTPUT
IN OUT
83
• Phase 1
– “streamwise” divide into N/(B-2) megachunks
– conquer each and write to disk
B-2 chunks
in each
DISK 2
DISK 1
B
N
INPUT OUTPUT
OUTPUT
IN OUT
84
• Phase 2
– a streaming algorithm over conquered megachunks.
– the streaming must ensure rendezvous
• but across rendezvous groups, order still immaterial!
• We will see concrete examples shortly
B-2 chunks
in each
DISK 2
DISK 1
B
N
INPUT OUTPUT
OUTPUT
IN OUT
85
Parallelize Me?
• Phase 1
B
IN OUT
B
IN OUT
B
IN OUT
86
Parallelize Me?
B B
IN OUT IN OUT
B B
IN OUT
IN OUT
B
B
IN OUT
IN OUT
87
Summing Up 1
Summing Up 2
Life of a Query
Buffer Management
Up Next