Академический Документы
Профессиональный Документы
Культура Документы
Data
The amount of data created by each
x 10 x 100 x 10000
A. Weigend
A. Weigend
Time Scales
Technology: ~1 year Biology:
~100k
years
A. Weigend
A. Weigend
per month
Process of
Shopping?
only
occasionally
punctuated by purchases
secret desires?
Instrument
for
feedback
Attention
" Transactions " Clicks
n
Data Sources
Intention
" Search
n
Situation
" Location " Device
A. Weigend
13
paNerns,
and
trends
by
siOing
through
large
amounts
of
data
stored
in
repositories
and
by
using
paNern
recogni-on
technologies
as
well
as
sta-s-cal
and
mathema-cal
techniques
(The
Gartner
Group).
The
explora-on
and
analysis
of
large
quan--es
of
data
in
order
to
discover
meaningful
paNerns
and
rules
(Berry
and
Lino).
The
nontrivial
extrac-on
of
implicit,
previously
unknown,
and
poten-ally
useful
informa-on
from
data
(Frawley,
Paitestsky- Shapiro
and
Mathews).
14
Deni-on (Fayyad et. al): T he non-trivial discovery of novel, valid , comprehensible and poten-ally useful paNerns from data.
nOn Thursday nights people who buy diapers also tend to buy
beer
nPeople
with
good
credit
ra-ngs
are
less
likely
to
have
accidents
nMale
consumers,
37+,
income
bracket
50K-75K
spend
15
Data
mining:
few if any a priori hypotheses data is usually already collected a priori analysis is typically data-driven not hypothesis-driven OOen algorithm-oriented rather than model-oriented
Dierent?
Yes, in terms of culture, mo-va-on: however.. sta-s-cal ideas are very useful in data mining, e.g., in valida-ng whether discovered knowledge is useful Increasing overlap at the boundary of sta-s-cs and DM e.g., exploratory data analysis (based on pioneering work of John Tukey in the 1960s)
1E+7
ExaByte 1E+6
1E+5
Commercial
products
SAS,
SPSS,
Insighlul,
IBM,
Oracle
1E+4
1E+3 1988
1991
1994
1997
2000
Data-Driven
Discovery
Observa-onal
data
cheap
rela-ve
to
experimental
data
Examples:
Transac-on
data
archives
for
retail
stores,
airlines,
etc
Web
logs
for
Amazon,
Google,
etc
The
human/mouse/rat
genome
Etc.,
etc
makes sense to leverage available data useful (?) informa-on may be hidden in vast archives of data
Machine Learning
Data Mining
Visualization
Information Science
Other Disciplines
Different fields have different views of what data mining is (also different terminology!)
22
The Course
24
Course
Objec-ves
Approach
business
problems
data-analy;cally.
Think
carefully
&
systema-cally
about
whether
&
how
data
can
improve
business
performance.
Be
able
to
interact
competently
on
the
topic
of
data
mining
for
business
intelligence.
Know
the
basics
of
data
mining
processes,
algorithms,
&
systems
well
enough
to
interact
with
CTOs,
expert
data
miners,
and
business
analysts.
Be
able
to
envision
data-mining
opportuni-es.
Hands-on
experience
mining
data.
Be
prepared
to
follow
up
on
ideas
or
opportuni-es
that
present
themselves,
e.g.,
by
performing
pilot
studies
25
Our
Goals
Understand the basics of the major Data Mining/Machine Learning techniques: What they do: problems they can solve Who uses them Where they are used When and how to use them How they work (at a high level only) Limitations Apply techniques and evaluate the models built
26
Course
Outline
Introduc-on
to
Modeling
&
Data
Mining
nFundamental
concepts
and
terminology
Data
Mining
methods
nClassica-on
decision
trees,
associa-on
rules,
clustering
and
segmenta-on,
collabora-ve
ltering,
gene-c
algorithms
etc.
nInner
workings
nStrengths
and
weaknesses
Evalua-on
nHow
to
evaluate
the
results
of
a
data
mining
solu-ons
Applica-ons
nReal-world
business
problems
DM
can
be
applied
to
27
Course Informa-on
Teaching
style:
Lecture
/
Lab/
Guest
Speakers
(AT&T,
IBM,
Yahoo!)
Student
par-cipa-on/aNendance
is
important
Lab
sessions:
Weka,
Gephi,
python
Textbook:
Various
Publicly
Available
Readings
28
Weka
Gephi
Course
Informa-on
Canvas
Wordpress
class
site:
hNp://opim672.wordpress.com
Facebook/TwiNer
Oce
hours:
M
6-7pm,
F
2-5pm,
or
by
appointment
Email: shawndra@wharton.upenn.edu
Course
Informa-on
Read
material
before
and
aOer
class
n 8
homework
assignment
(35
points)
groups
of
2
n Data
mining
project
(50
points)
-
groups
of
4
6,
10
groups
per
class
n Final
Report
n Mid-semester
update
n End
of
semester
presenta-on
n Project
Reviews
n Class
par-cipa-on
(15
points)
n Data
set
compe--on
(op-onal
for
extra
credit)
Warning:
1.
This
is
a
hands
on
class
2.
A
signicant
por-on
of
deliverables
are
at
the
end
of
the
semester.
n
31
What
is
a
DSS?
Decision
Support
Systems
aim
at
allowing
business
users
to
make
beHer
decisions
faster
and
take
ac%on
more
easily
and
more
protably
based
on
this
informa%on.
This
is
achieved
through:
Predic-on
Descrip-on
Data
Dissemina-on
Prescrip-on
32
Predic-on
Induc%on:
Rules
From specic examples (instances) to general rules Instances: ID Swims Color Type Animal1 yes gray dolphin Animal2 yes black dolphin Animal3 no gray elephant Rules: IF swims=yes THEN class=dolphin Antecedent / Assump%on (Rule Body) Consequent / Conclusion (Rule Head)
Predic%on =
Determining
the
class
or
aHribute-value
for
a
new
item
with
some
known
aHributes.
33
Text Mining
34
Examples
Mining Medical Discussion Board Data Mining Motley Fool Caps Social Network Based Marketing Social Network Based Fraud Detection Social TV Examples Profit Maximizing Recommendation Engine
Data Mining, Spring 20013 Shawndra Hill 36
Linens n Things
Monster.com
Monster.com Predict if stock value of company will go up based on Employee attrition? The World Bank Predict if country/organization will default?
38
Pepsico
39
40
41
Google is a company built on data mining PageRank mined the web to build beNer search Google as spell checker Google as ad placer Google as news aggregator Google as face recognizer
43
44
p
Rows
=
objects
Columns
=
measurements
on
objects
Represent
each
row
as
a
p-dimensional
vector,
where
p
is
the
dimensionality
In
efect,
embed
our
objects
in
a
p-dimensional
vector
space
OOen
useful,
but
not
always
appropriate
Both n and p can be very large in data mining Matrix can be quite sparse
Text
47
Word IDs
Sometimes another representation is more useful User 1 User 2 User 3 User 4 User 5 2 3 7 1 5 3 3 7 5 1 2 3 7 1 1 2 1 7 1 5 3 1 7 1 3 3 1 1 1 3 1 3 3 3 3 1 7 7 7 5 1 5 1 1 1 1 1 1
128.195.36.195, Doe, John, 12 Main St, 973-462-3421, Madison, NJ, 07932 114.12.12.25,Trank, Jill, 11 Elm St, 998-555-5675, Chester, NJ, 07911
07911, Chester, NJ, 07954, 34000, , 40.65, -74.12 07932, Madison, NJ, 56000, 40.642, -74.132
Most large data sets are stored in relational data sets Oracle, MSFT, IBM Good open source versions: MySQL, PostGres
60
40
10
15 TIME
20
25
30
NetworkData
Algorithms for estimating relative importance in networks S. White and P. Smyth, ACM SIGKDD, 2003.
59
Accuracy
is
king
Only 15% of mergers and acquisitions succeed Stephen Denning The Leaders Guide to StoryTelling, pg xiv
60
Failure rate of new ventures invested in: 8 out of 10 Profit on Google investment: $4 billion (on $25 million)
Source: http://www.financialnews-us.com/?contentid=534017
61
Customer Lifetime Value: $2,700 Cost per flyer: 7 cents Required hit rate = 7 / 270,000 = 1 in 38,571
62
Possible
solu-ons
Oer
incen-ves
to
every
customers
before
contracts
expire
expensive
no
learning
64
65
www.crisp-dm.org
66
Marketing learned the modeling process as well as capabilities and weaknesses of modeling
IT learned the business processes and direct marketing strategies n Marketing recommended additions to attributes to use in building model
n
68
Modeling
Data Selection/Preparation Included hundreds of basic attributes Derived and Ratio fields added to enrich the model Use predictive modeling technique to refine relationship between predictors and output of interest Test Model: how will it perform in real life Select the best models (accuracy, profitability, etc.)
69
70
Deployment
Direct Mail and Telemarketing
n
71
Benets
nCost
Reduction
n Customers
saved up to 80% more takes n Direct Mail budget for same churner mailing reduced by 60%
Revenue Increase Average monthly revenue increase per bill Monthly usage increased
Switched
72
73
74
A model that given a customers characteris-cs predicts how much the customer will spend on the next catalog order.
Most
predic-ve
models
are
also
descrip-ve.
Amount
spent
on
catalog
purchase=
0.001*(Annunal_Income)
+0.3*(Num_Cards)+
(1/Num_Orders)
A model that classies credit applicants to determine whether or not an applicant will default on a loan.
75
Like any other powerful tool can be very dangerous if not used properly. n Team work: Cannot (always) replace skilled business analysts - needs guidance and validation of output
76
Problem
formula%on
n Need
to
understand
the
business
well,
good
formula-on
of
problem
Inappropriate
use
of
methods
n (And/Or)
Lack
of
sucient/high
quality
data
n Computa-onal
issues
Evalua%on
n Need
domain
experts
throughout
the
process
to
provide
indispensable
input
and
validate
results
77
78
2006: (chris v) Published papers on Communities of Interest using social networks and Guilt by association to catch fraud 9 September 2007: NYT lead story F.B.I. Data Mining Reached Beyond Initial Targets discusses FBI techniques COI and GBA 23 October 2007: Blogosphere erupts: How AT&T Provides the FBI with Terror Suspect Leads
83
85
Wikileaks Visualizations
86
87
Fallout
CTO + at least two others fired Data still out in the public
Is it ethical to study?
purple lilac," "happy bunny pictures, "square dancing steps "cut into your trachea," "pee fetish, "Simpsons incest."
Findings
humans follow simple, reproducible patterns Sample finding: Nearly three-quarters of those studied mainly stayed within a 20-mile-wide circle for half a year. Results could impact all phenomena driven by human mobility, from epidemic prevention to emergency response and urban planning.
Case Study 3: Barabasi Mobile Studyof cell phone users Uproar ensued over secret tracking
Blowback of negative feedback to Nature and scientists Study would be illegal in the US Approval from ONR review board and Northeastern review board. Barabasi did not check with an ethics panel
Response
Hidalgo: the data could be misused, but we were not trying to do evil things. We are trying to make the world a little better. Northeastern and Nature backed the research Continues to be referenced as an example of dangerous research Risk and reward both very high
But,
one
step
further,
maybe
all
k
have
a
given
sensi-ve
aNribute!
The
distribu-on
of
target
values
within
a
group
is
referred
to
as
l-diversity.
What
is
R?
Open
source
sta-s-cal
soOware
grown
out
of
S/Splus
www.r-project.org
Packages
at
CRAN
R
Tutorials
available
online
(see
website
and
CRAN)
Great
graphics
(with
a
bit
of
a
learning
curve)
authorita-ve texts (yet). This class draws from many sources, best are
Resources
does
not
have
Data
mining
is
a
new
eld
and
as
such,
Data
Mining
Techniques:
For
Marke-ng,
Sales,
and
Customer
Support,
by
Michael
J.A.
Berry,
Gordon
Lino,
published
by
John
Wiley
&
Sons,
Inc.
Elements
of
Sta%s%cal
Learning
Has%e,
Tibshirani,
and
Friedman
Handbook
of
Data
Mining
Hand,
Mannila
and
Smyth
Interac-ve
and
Dynamic
Graphics
for
Data
Analysis
Cook
and
Swayne
Also
good
class
notes
available
from
other
classes:
David
Madigan,
Columbia
Di
Cook,
Iowa
State
Padhraic
Smyth,
UC
Irvine
Jiawei
Han,
Simon
Fraser
see class web site for pointers to these notes, or just Google them!)
Assignment
1
n
Profiles will be posted on canvas to facilitate group selection ASAP Generate 3 potential classification (prediction) problems/ideas as part of Assignment 1 (Start exploring publicly available data sets projects from last year are available)
n
99
100
101