Академический Документы
Профессиональный Документы
Культура Документы
effectively with the traditional applications that exist. The processing of Big Data
begins with the raw data that isnt aggregated and is most often impossible to store
in the memory of a single computer.
The definition of Big Data, given by Gartner is, Big data is high-volume, and highvelocity and/or high-variety information assets that demand cost-effective,
innovative forms of information processing that enable enhanced insight, decision
making, and process automation
Data analytics the science of examining raw data with the purpose of drawing
conclusions
about
that
information.
Data Analytics involves applying an algorithmic or mechanical process to derive
insights. For example, running through a number of data sets to look for
meaningful correlations between each other.
Data Science is the combination of statistics, mathematics, programming, problem
solving, capturing data in ingenious ways, the ability to look at things differently,
and the activity of cleansing, preparing, and aligning the data. : Dealing with
unstructured and structured data, Data Science is a field that comprises of
everything that related to data cleansing, preparation, and analysis
Why is analytics becoming more important now?
Much more operational data is being created and captured because of the use of
technology (structured)
Enterprise software :ERP,CRM,SCM
Much more unstructured data is being captured and stored (social media data)
Facebook,Twitter
Much more unstructured data being captured: Web transactions,Smart objects
Data Analytics vs. Statistical Analysis
Data Analytics
Utilizes data mining techniques
Identifies inexplicable or novel relationships/trends
Seeks to visualize the data to allow the observation of relationships/trends
Statistical Analysis
Utilizes statistical and/or mathematical techniques
Used based on theoretical foundation
Seeks to identify a significant level to address hypotheses or RQs
Analtics landscape
4.Database and Data Warehouse
Strategies
Serve as the foundation of business
Social, Email, Blogs, Video, Mobile
analytics
Marketing, Sales - Product Listing, Promotions
Principles of database design and
Applications
implementation:
ERP, CRM, Databases, Internal Applications,
Conceptual, logical and physical
Customer/Consumer facing applications
modeling
Context
Relational Databases
Web, Customers, Products, Business Systems,
ETL
process
(Extraction,
Processes and Services
Transformation and Loading)
Support Systems
SQL (Structured Query Language)
CRM, Recommendation Systems
Data warehouses, Business Intelligence
Analytic Tools
Dangers in Analytics
Data mining
Privacy
Statistical analysis
Security
Predictive analysis
Drawing decisions on incomplete data
Correlation
Drawing decisions on inaccurate data
Using only data that supports our gut
Regression
decisions
Forecasting
Drawing the wrong conclusion from the
Process Modeling
data
Optimization
Stock prices example
DATA
- collected facts
and figures
DATABASE
- collection of
computer files
containing data
INFORMATION
- comes from
analyzing data
3. Interval Data
Ordinal data but with constant
differences
between
observations
No true zero point
Ratios are not meaningful
Examples:
- temperature readings
- SAT scores
4. Ratio Data
Continuous values and
have a natural zero
point
Ratios are meaningful
Examples:
- monthly sales
- delivery times
3.presciptive
analytics
Function:
make decisions based
on data
Common models:
linear programming
sensitivity analysis
integer programming
goal programming
nonlinear
programming
simulation modeling
Volume: How much data is really relevant to the problem solution? Cost of processing? So, can
you really afford to store and process all that data?
Velocity:
Much data coming in at high speed
Need for streaming versus block approach to data analysis
So, how to analyze data in-flight and combine with data at-rest
Variety:
A small fraction is structured formats, Relational, XML, etc.
A fair amount is semi-structured, as web logs, etc.
The rest of the data is unstructured text, photographs, etc.
So, no single data model can currently handle the diversity
Veracity: cover term for
Accuracy, Precision, Reliability, Integrity
So, what is it that you dont know you dont know about the data?
Value:
How much value is created for each unit of data (whatever it is)?
So, what is the contribution of subsets of the data to the problem solution?
Types of Analytics
Descriptive: A set of techniques for reviewing and examining the data set(s) to
understand the data and analyze business performance.
Diagnostic: A set of techniques for determine what has happened and why
Predictive: A set of techniques that analyze current and historical data to
determine what is most likely to (not) happen
Prescriptive: A set of techniques for computationally developing and analyzing
alternatives that can become courses of action either tactical or strategic that
may discover the unexpected
Decisive: A set of techniques for visualizing information and recommending
courses of action to facilitate human decision-making when presented with a set
of alternatives.
Advanced Analytics:Application of Analytics To Critical Problems
Goal of analytics : to Monitoring and Alerting , to Sensemaking (Data and
Analytics Science), to Cents-Making (Getting to ROI!!)
Klein (2006) theorizes that sensemaking processes are initiated when individuals
or organizations recognize a lack of understanding of evenT
Sensemaking Challenges
Lillian Wu (IBM) has noted that everything is
becoming:
Instrumented: We now have the ability to
measure, sense and see the exact condition of
practically everything.
Interconnected: People, systems and objects
can communicate and interact with each
other in entirely new ways
Intelligent: People, systems and objects can
respond to changes quickly and accurately,
and get better results by predicting and
optimizing for future events.
How to deal with ambiguity?
How to deal with too much data?
Network Forensics
Networks have become exponentially faster. They carry more traffic and more types of data than
ever before. Yet as they get faster, they become more difficult to monitor and analyze.
Network forensics must be:
1.Precise: capture high-speed packets without droppage
2.Scalable: extend to new network technologies and speeds
3.Flexible: adapt to heterogeneous network segments
4.VOIP-Smart: reconstruct & replay VoIP calls; present Call Detail Records (CDR) for each call
5.Continuously available: run 24/7 with adequate storage; support real-time analysis
ADVANCED ANALYTICS
Crowdsourcing: A New Analytic?
: DISTRIBUSI DARI PENYELESAIAN MASALAH
Moving Analytics To the Edge
Traditional analysis: data is stored and then analyzed
-Usually at some central location or a few distributed locations
-The cost and time to move large amounts of data may render it obsolete or of little worth
Moving analytics to the frontier of the domain:
-(Near) real-time analysis and decisions are required
-Streaming massive amounts of data is expensive, fraught with error: microsecond latency,
millions of events
-We just cant store it all
-Perishability is a key factor
-May only really need the synthesized, aggregate information, not the raw data
-True data-driven analysis
Example: Pushing analytics into cameras for images, full motion video analysis, motion
correction, 3D perception,
WEB ANALYTICS
What is it?
-Now: The study of the behavior of web
users
-Future: The study of one mechanism for
how society makes decisions
-Example: Behavior of Web Users
How many people clicked on Ebola (or
related terms in the past 2 months)
LOCATION ANALYTICS
What is it?
- Augmenting mission-critical, enterprise business
systems with complementary content, mapping,
and geographic capabilities
-Mapping & Visualization: use maps as the media
to visualize data
-Spatial analytics: merging GIS w/ other types of
analytics
-Find spatio-temporal patterns indicative of
physical activities or social behavior
-Data/information enrichment: add maps,
imagery, demographics, consumer and lifestyle
data, environment and weather, social media, etc.
Ubiquity of GPS on cellphones, cars, wristwatches,
laptops, tablets, etc.
ANALYTIC CHALLENGES
1.
3. DATA ANALYTICS
What is Context-Aware Mobile Computing?
- Applications that can detect their users situations and adapt to their behaviors
accordingly.
-A software that adapts according to its context!
Active context awareness
an application automatically adapts to discovered
context, by changing the applications behavior.
Calm technology
What is context?
Context is any information that can be used to characterize the situation of an entity. An entity
is a person, place, or object that is considered relevant to the interaction between a user and an
application, including the user and the application themselves
Example : Temperature, User preferences,Lighting,Location,Nearby resources (such as
printers),History
CONTEXT SENSING
1.Low level context sensing
location : GPS, IR, etc, orientation,time, temperature, pressure
2.High level context sensing
Activity: Machine vision based on camera tech & image processing, Combine several low-level
context (AI)
DEFINING CONTEXT
Computing context: connectivity, communication cost, bandwidth, nearby resources (printers,
displays, PCs)
User context: user profile, location, nearby people, social situation, activity, mood
Physical context: temperature, lighting, noise, traffic conditions
Context-Aware applications use context to:
Present services and information to a user
Examples: The time of day and restaurants near the user
Automatically execute a service for a user
Example: A phone automatically setting a weekly alarm for
the user
Tag information to retrieve at a later time
Example: Phone keeps track of recent calls
System infrastructure
Examples:
1. Phone display adjusts the brightness of the display based
on the surrounding area : Uses a light sensor
2. Schneider trucking trackers
Sends emergency
notifications when
conditions are met
Message passing
Centralized architecture
Scalability problem
Distributed architecture
Wireless communication
Wireless LAN
4.DATA WAREHOUSE
WHAT IS DATA WAREHOUSE
- A data warehouse is a relational database that is designed for query and analysis rather than
for transaction processing
- A common way of introducing data warehousing is to refer to the characteristics of a data
warehouse as set forth by William Inmon : Subject Oriented, Integrated, Nonvolatile, Time
Variant
DATA WAREHOUSE PROPERTIES
Subject oriented : Data is categorized and stored by
business subject rather than by application.
Integrated : Data warehouses must put data from
disparate sources into a consistent format.
Time variant : Data is stored as a series of
snapshots, each representing a period of time.
Non volatile : Typically data in the data warehouse is
not updated or deleted. .This is logical because the
purpose of a warehouse is to enable you to analyze
what has occurred.
Data Warehouse Architecture (Basic)
End users directly access data derived from several source systems through the data warehouse.
Operational Systems are the internal and external core systems that run the day-to-day business
operations. They are accessed through application program interfaces (APIs) and are the source of data
for the data warehouse and operational data store.
External Data is any data outside the normal data collected through an enterprises internal applications.
Generally, external data, such as demographic, credit, competitor, and financial information, is purchased
by the enterprise from a vendor of such information.
Data Acquisition is the set of processes that capture, integrate, transform, cleanse, and load source data
into the data warehouse and operational data store
DATA MART: A Data Mart is a small warehouse designed for strategic business unit or a
department
Data Mart Advantages: The cost is low. Implementation time is shorter, They are controlled
locally rather than centrally, They contain less information than the data warehouse and hence
have more rapid response, They allow a business unit to build its own DSS without relying on a
centralized IS department.
Data Mart Types:
The Data Warehouse is a subject-oriented, integrated, time-variant, non-volatile collection of data used
to support the strategic decision-making process for the enterprise.
The Operational Data Store is an subject-oriented, integrated, current, volatile collection of data used to
support the tactical decision-making process for the enterprise.
Operation and Administration is the set of activities required to ensure smooth daily operations, to ensure
that resources are optimized, and to ensure that growth is managed.
Systems Management is the set of processes for maintaining, versioning, and upgrading the core
technology on which the data, software, and tools operate.
Data Acquisition Management is the set of processes that manage and maintain processes used to
capture source data and its preparation for loading into the data warehouse or operational data store.
Service Management is the set of processes for promoting user satisfaction and productivity within the
Corporate Information Factory. It includes processes that manage and maintain service level agreements,
requests for change, user communications, and the data delivery mechanisms.
Change Management is the set of processes coordinating modifications to the Corporate Information
Factory.
DATA MINING
CIF Data Management is the set of processes that protect the integrity and continuity of the data within
and across the data warehouse and operational data store. It may employ a staging area for cleansing
and synchronizing data.
The Transactional Interface is an easy-to-use and intuitive interface for the end user to access and
manipulate data in the operational data store.
Data Delivery is the set of processes that enables end users and their supporting IT groups to filter, format,
and deliver data to data marts and oper-marts.
The Exploration Warehouse is a data mart whose purpose is to provide a safe haven for exploratory and
ad hoc processing. An exploration warehouse may utilize specialized technologies to provide fast
response times with the ability to access the entire database.
The Data Mining Warehouse includes tasks known as knowledge extraction, data archaeology, data
exploration, data pattern processing and data harvesting.
The OLAP (online analytical processing) Data Mart is aggregated and/or summarized data that is derived
from the data warehouse and tailored to support the multidimensional requirements of a given business
unit or business function.
The Oper-Mart is a subset of data derived from of the operational data store used in tactical analysis and
usually stored in a multidimensional manner (star schema or hypercube). They may be created in a
temporary manner and dismantled when no longer needed.
The Decision Support Interface is an easy-to-use, intuitive tool to enable end user capabilities such as
exploration, data mining, OLAP, query, and reporting to distill information from data.
Meta Data Management is the set of processes for managing the information needed to promote data
legibility, use, and administration.
Information Feedback is the set of processes that transmit the intelligence gained through usage of the
Corporate Information Factory to appropriate data stores.
Information Workshop is the set of the facilities that optimize use of the Corporate Information Factory
by organizing its capabilities and knowledge, and then assimilating them into the business process.
The Library and Toolbox is the collection of meta data and capabilities that provides information to
effectively use and administer the Corporate Information Factory.
proses mengekstraksi pola-pola yang menarik (tidak remeh-temeh, implisit, belum diketahui
sebelumnya, dan berpotensi untuk bermanfaat) dari data yang berukuran besar.
WHY DATA MINING?
Query:
Well defined
poorly defined, no specifiec
SQL
query language
Data
Operational data not operational data
Output
Precise
fuzzy
subset of database
1.Understand application
domain Prior knowledge,
user goals
2. Create target dataset
Select data, focus on
subsets
3. Data cleaning and
transformation Remove
noise, outliers, missing values
Select features, reduce
dimensions
4.Apply data mining
algorithm
Associations, sequences,
classification, clustering, etc.
incomplete: lacking attribute values, lacking certain attributes of interest, or containing only
aggregate data
noisy: containing errors or outliers
inconsistent: containing discrepancies in codes or names
-No quality data, no quality mining results!
Quality decisions must be based on quality data
Data warehouse needs consistent integration of quality data
Major Tasks in Data Preprocessing
Data cleaning
Data integration
Data transformation
Data reduction
Data discretization
A
c
c
e
s
s
i
b
i
l
i
t
y
Binning method:
Clustering
Regression
min-max normalization
z-score normalization
Supervised learning
Pattern recognition
Prediction
Unsupervised learning
Segmentation
Partitioning
Characterization
Generalization
Affinity Analysis
Association Rules
When people buy diapers they also buy beer 50 percent of the time.
A sequencing technique is very similar to an association technique, but it adds time
to the analysis and produces rules such as:
People who have purchased a VCR are three times more likely to
purchase a camcorder in the time period two to four months after the
VCR was purchased.
Clustering technique for creating partitions so that all members of each set are
similar according to some metric or set of metrics