Вы находитесь на странице: 1из 47

Privacy Risks from Mining

Online Social Networks

Alessandro Acquisti
(with Ralph Gross)
Heinz School and SCS
Carnegie Mellon University

NGDM 2007, October 10-12 2007


That is...
 Can we estimate social security numbers based on
data publicly available in online social networks?
 If so, what are the actual risks of identity theft for
social networks’ participants?

 One caveat
Agenda
 Online social networks (OSN)
 Social security numbers (SSN) and identity theft
 OSN as Breeding documents: Estimating SSN from OSN
data
 Approach
 Data
 Pattern discovery
 Estimation
 Results

 Conclusions
Online social networks
What are online social networks?

 Online social networks are sites that facilitate interaction

between members through their self-published personal


profiles

 Exciting new tools for interaction, socialization, and self-

representation

 However, participants reveal vast amounts of personal and

sometimes sensitive information to friends and strangers


alike
Possible privacy risks
 Previous research
 Gross and Acquisti, WPES 2005; Acquisti and Gross, PET 2006
 Studied usage of a campus-oriented OSN (The Facebook)
 Focused on patterns of information revelation and
members’ awareness of information practices
 Found a number of potential privacy risks
 Stalking
 Digital dossier
 Discrimination
 Re-identification
 From pseudonymous to identified
 From personal information (PI) to personal identifying information (PII)
Social security numbers (SSN) and identity theft
Social security numbers
and identity theft
 Social security numbers are unfortunately used for both
identification and authentication
 Identity theft is a crime in which an imposter obtains key
pieces of personal information and impersonates someone
else for illicit purposes
 Social security numbers are instrumental to identity theft
Anatomy of a social
security number
 Each SSN has 9 digits:
 XXX-YY-ZZZZ

 and is composed of three parts:


 Area number: XXX
 Group number: YY
 Serial number: ZZZZ

 The pattern of assignment/issuance of SSN is complex,


but fixed
Area numbers
 First 3 digits
 A specific range of AN is assigned to each State
 520 (WY)
 159~211(PA)
 135~158 (NJ)
 …
 Within each State, AN are assigned by ascending order
within the assigned range
Group numbers
 Second 2 digits
 GN are assigned within each State following a non
monotonic yet ordered pattern:
 01-09: 01,03,05,07,09
 10-98: 10,12,14,…,96,98
 02-08: 02,04,06,08
 11-99: 11,13,15, …, 97,99
Serial numbers
 Last 4 digits
 SN are assigned in monotonically increasing order within
each State and within each group number
 From 0001 to 9999
Example of SSN
assignment sequence
Online social networks as Breeding documents
The approach
 OSN users reveal personal information in profiles
 Simple scripts allow adversaries to continuously retrieve and
save all profile information
 OSN users also reveal information that does not appear
to be sensitive, but that may be used to identify more
sensitive information
 I.e., OSN profiles as “Breeding documents”
 Specifically, social security numbers
 How?
SSN assignment pattern
is predictable
 SSN assignment pattern is known, and therefore
specific digits of an SSN are at least in principle
predictable
 Area number: based on where SSN was issued
 Related to birthplace
 Group number: based on when and where SSN was
issued
 Related to birthday and birthplace
 Group number: based on when and where SSN was
issued
 Related to birthday and birthplace
From OSN to SSN
 Knowledge of an individual’s birthday and birthplace
may provide useful information to estimate that
individual’s AN and GN
 When matched to known ANs and GNs of other individuals
 Birthday and birthplace data can be obtained from several
sources, but most easily and in mass amounts from online
social networks
 Knowledge of SN must come from comparison of SN
of other individuals with similar birthplaces and
birthdays
 Therefore, approach is to combine OSN and publicly
available SSN data
Data sets
Combining various public data sets
to estimate SSN
 We used two data sets of public or semi-public
available information
 The Facebook
 Social security death index (SSDI)
 Studies were IRB approved
Why the Facebook
 From Facebook’s own blog:

Facebook now has more than 42 million active users (double


the number one year ago when it opened up registration
and growing at more than 200,000 per day since January)
More than half of Facebook users are outside of college (85
percent of US college students still use Facebook)
Why the Facebook
 Facebook profiles are (most often) uniquely and
personally identified
 In the sample we studied (freshman at north-
American College), most profiles provided birthday
and birthplace information
 89% revealed first and last name
 88% revealed birthday information
 72% revealed hometown information
 No guarantee that information, if provided, is correct
Facebook data
 We collected birthday and birthplace information
from Facebook profiles of US national freshman
students at a north-American College institution
 Sample: 397 profiles
Why SSDI
 The Social Security Administration’s Death Master
File provides SSN for individuals who are deceased
 Can be studied to find patterns in the SSN assignment
schemes for individuals born in the same years and
locations as the targets (i.e., the students whose profiles
we accessed)
An example of SSDI record

Name Birth Death Last Residence SSN Issued

JOHN 21 Jun Oct 33540 (Zephyrhills,


022-10-459 Massachetts
SMITH 1904 1979 Pasco, FL)
Pattern discovery
Enumeration at birth
 We found that SSN assignment pattern has become
more regular over the years
 Computerization
 Enumeration at Birth Process (EAB)
 Implemented in 1989
 Prior to 1989, only small percentage of people received SSNs
when they were born
 Currently at least 75 percent of all newborns receive SSNs via EAB
Group numbers patterns
in NV over time
Group numbers patterns
in ND over time
Group number frequencies are
decreasing over time
Group number frequencies are
decreasing over time
Group numbers frequencies
by States
Pattern regularities
 For group numbers, increasing over time
 What about serial numbers?
Serial numbers patterns
in ND over time
Estimation
Estimation approach
 We used SSDI data to identify patterns in SSN
assignment based on birthplace and birthday, which
led us to estimate:
 Area numbers
 Group numbers
 Serial numbers
 … for any individual for whom we knew (or we had
some noisy knowledge of) birthplace and birthday

 Details offline
Predictability of area numbers
Predictability of serial numbers

 Plots of R2 of regressions of SN
 R2 increases significantly over time, especially for less
populous states
Results
How we verified our estimates
 Predicted SSN based on
 Facebook data
 SSDI
 Contrasted predicted SSN to sample of actual
SSN (protected, IRB approved)
Benchmark
 To date, more than 400 million SSNs have been
issued by the SSA
 Currently, there are around 540 possible area
numbers, 99 possible group numbers, and 9,999
possible serial numbers
 Combining them together, one has odds of 1
over 643,435,650 of guessing an individual’s
SSN without using any information about that
individual
 0.000000155%
Results from our estimates
 Exact predictions: we correctly estimated…
 5.8% exact area number
 2.8% exact group number and exact area number
 I.e., for 2.8% of our sample we could correctly identify
the first 5 digits of the SSN
Versus ~0.001% random guess)
Results from our estimates
 Range predictions: we correctly estimated…
 60% right window of possible area numbers
 18.4% exact group number and right window of area
numbers
 22% in window of +/-10 group number and right
window of area numbers
 3.3% in window of +/-10 group number and exact
area number
Results from our estimates
 Serial numbers and complete SSN: we correctly
estimated
 2.3% of serial numbers within +/-500 digits of exact serial
number and right window of area numbers and window of +/-10
group number
 1% of serial number within +/-500 digits of exact SSN and exact
area numbers and exact group number
 Closest match absolute difference=278, average=2,212, sd=2,255
 In other words:
 For 1% of our sample, we could estimate exactly the first 5
digits of the SSN and the last 4 digits within a +/- 500 digit
range
 Compare to 0.000000155% if random (improved by 645,161
times)
Discussion
Some concluding remarks
 The attack presented here is not unique to OSN, but
OSNs make it easy to attempt it
 Attack purely based on public data

 What can an attacker do with and about the last 4 digits?


 Why not everybody is affected the same way
 Things are much worst for smaller States
 Why it is going to get worst
 More EAB individuals, and more of them on Facebook
 What can we do?
Acknowledgements

Вам также может понравиться