Вы находитесь на странице: 1из 49

August 2009

Bachelor of Science in Information Technology (BScIT) – Semester 4


BT0050 – Data Warehousing & Mining – 4 Credits
(Book ID: B0038)
Assignment Set – 1 (60 Marks)

Answer all questions 6 x 10 = 60

Ques 1 Explain the concept of knowledge discovery in database.


Knowledge discovery is a concept of the field of computer science that
describes the process of automatically searching large volumes of data for
patterns that can be considered knowledge about the data. It is often described
as deriving knowledge from the input data. This complex topic can be
categorized according to
• What kind of data is searched
• In what form is the result of the search represented.
The most well-known branch of data mining is knowledge discovery, also known
as Knowledge Discovery in Databases (KDD). Just as many other forms of
knowledge discovery it creates abstractions of the input data. The knowledge
obtained through the process may become additional data that can be used for
further usage and discovery.

Another promising application of knowledge discovery is in the area of software


modernization which involves understanding existing software artifacts. This
process is related to a concept of reverse engineering. Usually the knowledge
obtained from existing software is presented in the form of models to which
specific queries can be made when necessary. An entity relationship is a
frequent format of representing knowledge obtained from existing software.
Object Management Group (OMG) developed specification Knowledge Discovery
Metamodel (KDM) which defines an ontology for the software assets and their
relationships for the purpose of performing knowledge discovery of existing
code. Knowledge discovery from existing software systems, also known as
software mining is closely related to data mining, since existing software
artifacts contain enormous business value, key for the evolution of software
systems. Instead of mining individual data sets, software mining focuses on
metadata, such as database schemas.

Frawley et al. define knowledge discovery to be "the non trivial extraction of


implicit, previously unknown and potentially useful information in data". In
knowledge discovery from databases (KDD), machine learning techniques have
been adapted to large scale databases for discovering task. The discovery
method, which is at the core of the generic architecture for a discovery system,
computes and evaluates groupings, patterns, and relationships in the context of
a problem solving task. The groupings, patterns, and relationships are derived
from raw data extracted from a database, or a preprocessed form of the raw
data. Preprocessing may be done by statistical or by knowledge-based
techniques.
Depending on the discovery method used, the knowledge produced may be in
different forms:

• Data objects organized into groups or categories, and each group


represents a relevant concept in the problem solving domain. Inductive
discovery methods in this category are called clustering methods,

• Classification rules that identify a group of objects that have the same
label or differentiate among groups of objects that have different labels.
These methods are termed classification/regression methods, and

• Descriptive regularities, qualitative or quantitative among sets of


parameters drawn from object descriptions Inductive methods in this
category are called empirical discovery methods.

Our work in this field include

(1) ITERATE, the conceptual clustering system that generates stable and
cohesive clusters through ADO-star data ordering technique and iterative
redistribution strategy

(2) Knowledge-based equation discovery system that defines homogeneous


context using clustering techniques and derives analytical equations for
the response variable under proper context.

More recently, we have been looking at unsupervised learning(clustering)


techniques with temporal sequences of data. The goal is to clarify objects with
temporal features, and this will find applications in domains, such as analysis of
Pediatric Intensive Care Unit (PICU) patients, and classification of faults in
complex, dynamic systems. Recent papers discuss our Hidden Marker Model
(HMM) - based algorithm for clustering of data objects with continuous time
sequence features.

Ques 2. Discuss the following types of Multidimensional Data Models:


Stars, Snowflakes and Constellations
2a Stars, Snowflakes and Constellations

The star schema is a simple schema used in dimensional modeling.


The star schema (sometimes referenced as star join schema) is the simplest
style of data warehouse schema. The star schema consists of a few fact tables
(possibly only one, justifying the name) referencing any number of dimension
tables. The star schema is considered an important special case of the
snowflake schema.

Model
The facts that the data warehouses helps analyze are classified along different
dimensions: the fact tables hold the main data, while the usually smaller
dimension tables describe each value of a dimension and can be joined to fact
tables as needed.

Dimension tables have a simple primary key, while fact tables have a
compound primary key consisting of the aggregate of relevant dimension keys.
It is common for dimension tables to consolidate redundant data and be in
second normal form, while fact tables are usually in third normal form because
all data depend on either one dimension or all of them, not on combinations of
a few dimensions.

The star schema is a way to implement multi-dimensional database (MDDB)


functionality using a mainstream relational database: given the typical
commitment to relational databases of most organizations, a specialized
multidimensional DBMS is likely to be both expensive and inconvenient.
Another reason for using a star schema is its simplicity from the users' point of
view: queries are never complex because the only joins and conditions involve
a fact table and a single level of dimension tables, without the indirect
dependencies to other tables that are possible in a better normalized snowflake
schema.

Star schema used by example query.


Consider a database of sales, perhaps from a store chain, classified by date,
store and product. The image of the schema to the right is a star schema
version of the sample schema provided in the snowflake schema article.

Example
Fact_Sales is the fact table and there
are three dimension tables Dim_Date,
Dim_Store and Dim_Product. Each
dimension table has a primary key on
its Id column, relating to one of the
columns of the Fact_Sales table's
three-column primary key (Date_Id,
Store_Id, Product_Id). The non-
primary key Units_Sold column of the
fact table in this example represents
a measure or metric that can be used in calculations and analysis. The non-
primary key columns of the dimension tables represent additional attributes of
the dimensions (such as the Year of the Dim_Date dimension).

The following query extracts how many TV sets have been sold, for each brand
and country, in 1997.

SELECT
P.Brand,
S.Country,
SUM (F.Units_Sold)
FROM
Fact_Sales F
INNER JOIN Dim_Date D
ON F.Date_Id = D.Id
INNER JOIN Dim_Store S
ON F.Store_Id = S.Id
INNER JOIN Dim_Product P
ON F.Product_Id = P.Id
WHERE
D.Year = 1997
AND
P.Product_Category = 'tv'
GROUP BY
P.Brand,
S.Country

Snowflake schema
The snowflake schema is a variation of the star schema, featuring
normalization of dimension tables.

A snowflake schema is a logical arrangement of tables in a multidimensional


database such that the entity relationship diagram resembles a snowflake in
shape. Closely related to the star schema, the snowflake schema is represented
by centralized fact tables which are connected to multiple dimensions. In the
snowflake schema, however, dimensions are normalized into multiple related
tables whereas the star schema's dimensions are denormalized with each
dimension being represented by a single table. When the dimensions of a
snowflake schema are elaborate, having multiple levels of relationships, and
where child tables have multiple parent tables ("forks in the road"), a complex
snowflake shape starts to emerge. The "snowflaking" effect only affects the
dimension tables and not the fact tables.

Common uses
The star and snowflake schema are most commonly found in dimensional data
warehouses and data marts where speed of data retrieval is more important
than the efficiency of data manipulations. As such, the tables in these schema
are not normalized much, and are frequently designed at a level of
normalization short of third normal form.
The decision whether to employ a star schema or a snowflake schema should
consider the relative strengths of the database platform in question and the
query tool to be employed. Star schema should be favored with query tools that
largely expose users to the underlying table structures, and in environments
where most queries are simpler in nature. Snowflake schema are often better
with more sophisticated query tools that isolate users from the raw table
structures and for environments having numerous queries with complex
criteria.

Data normalization and storage


Normalization splits up data to avoid redundancy (duplication) by moving
commonly repeating groups of data into a new table. Normalization therefore
tends to increase the number of tables that need to be joined in order to
perform a given query, but reduces the space required to hold the data and the
number of places where it needs to be updated if the data changes.

From a space storage point of view, the dimensional tables are typically small
compared to the fact tables. This often removes the storage space benefit of
snowflaking the dimension tables.

Some database developers compromise by creating an underlying snowflake


schema with views built on top of it that perform many of the necessary joins to
simulate a star schema. This provides the storage benefits achieved through
the normalization of dimensions with the ease of querying that the star schema
provides. The tradeoff is that requiring the server to perform the underlying
joins automatically can result in a performance hit when querying as well as
extra joins to tables that may not be necessary to fulfill certain queries.

Benefits of "snowflaking"
• Some OLAP multidimensional database modeling tools that use
dimensional data marts as a data source are optimized for snowflake
schemas.
• If a dimension is very sparse (i.e. most of the possible values for the
dimension have no data) and/or a dimension has a very long list of
attributes which may be used in a query, the dimension table may
occupy a significant proportion of the database and snowflaking may be
appropriate.
• A multidimensional view is sometimes added to an existing transactional
database to aid reporting. In this case, the tables which describe the
dimensions will already exist and will typically be normalized. A
snowflake schema will therefore be easier to implement.
• A snowflake schema can sometimes reflect the way in which users think
about data. Users may prefer to generate queries using a star schema in
some cases, although this may or may not be reflected in the underlying
organization of the database.
• Some users may wish to submit queries to the database which, using
conventional multidimensional reporting tools, cannot be expressed
within a simple star schema. This is particularly common in data mining
of customer databases, where a common requirement is to locate
common factors between customers who bought products meeting
complex criteria. Some snowflaking would typically be required to permit
simple query tools to form such a query, especially if provision for these
forms of query weren't anticipated when the data warehouse was first
designed.

Examples

Snowflake schema used by example query.


The example schema shown to the right is a snowflaked version of the star
schema example provided in the star schema article.

The following example query is the snowflake schema equivalent of the star
schema example code which returns the total number of TV units sold by brand
and by country for 1997. Notice that the snowflake schema query requires
many more joins than the star schema version in order to fulfill even a simple
query. The benefit of using the snowflake schema in this example is that the
storage requirements are lower since the snowflake schema eliminates many
duplicate values from the dimensions themselves.

SELECT
B.Brand,
G.Country,
SUM (F.Units_Sold)
FROM
Fact_Sales F
INNER JOIN Dim_Date D
ON F.Date_Id = D.Id
INNER JOIN Dim_Store S
ON F.Store_Id = S.Id
INNER JOIN Dim_Geography G
ON S.Geography_Id = G.Id
INNER JOIN Dim_Product P
ON F.Product_Id = P.Id
INNER JOIN Dim_Product_Category C
ON P.Product_Category_Id = C.Id
INNER JOIN Dim_Brand B
ON P.Brand_Id = B.Id
WHERE
D.Year = 1997
AND
C.Product_Category = 'tv'
GROUP BY
B.Brand,
G.Country

2b Concept Hierarchies

The concept hierarchy file


The concept hierarchy is defined by a concept hierarchy file. The file contains
lines of the form human isa thing human words person, man where human
and thing are concepts, and person and man are words; isa and words are the
only two keywords. This file indicates that human is a subtype of thing, and
human is associated with the words person and man. Concept names should
begin with a lower-case letter, and may contain lower and upper case letters; a
word is normally all lower case. No two concepts in the hierarchy can have the
same name, and no two words in the hierarchy can be the same. A concept
name and a word can be the same though.
Concept hierarchy files should typically have the file extension .hrc.
Loading the concepts
A concept file can be loaded into the system by including a line of the form
Concepts.fileName = filename
in the Jet properties file.

Using the concepts


In an annotation pattern element, in place of a feature test of the form
feature=value , one can have a test of the form
feature ?isa(concept)
This feature test succeeds if the value of feature is a word associated with
concept, or associated with some concept' which is a descendant of concept in
the hierarchy.

Using the Concept Hierarchy UI


The Concept Hierarchy UI is an editor for concept hierarchy (.hrc) files. You start
the Concept Hierarchy UI by selecting "Concept Window" in the Jet File menu. If
a hierarchy has been specified in the Jet properties file, it will appear
The Concept Hierarchy UI has the basic features of an editor such as those for a
text editor.
File Open, Save, SaveAs and Exit do the standard file IO (note that Exit quits Jet,
not just the UI). If a concept hierarchy file is changed after it was last saved, it
is marked as dirty with an asterisk after the file name in the window caption. A
dialog message will pop up asking if the current file is to be saved. There is a
bug here: if the window is being closed when the file is dirty, a dialog pops up,
and then the "cancel" button is pressed, the window will become invisible
without anything done to the file. If the "Exit" menu item is chosen instead,
everything works just as fine.
From the "Edit" menu, you can add, delete, rename and search concepts and
words. There is a bug here: when you rename a concept or word either by
selecting the "Rename" menu item or clicking on the concept or word itself,
then type in the new name and carriage return, an error message will pop out if
the new name is an empty string or causes duplicate in the hierarchy. After you
press OK to get rid of the message, the concept or word is still being edited, if
you press ESC, the bad name will be accepted. It's the user's responsibility not
to deliberately introduce this kind of error into the file at the time being
because the editor is not so stubborn as to give the same advice twice.

2c OLAP Operations
Online analytical processing or OLAP is an approach to quickly answer
multi-dimensional analytical queries. OLAP is part of the broader category of
business intelligence, which also encompasses relational reporting and data
mining. The typical applications of OLAP are in business reporting for sales,
marketing, management reporting, business process management (BPM),
budgeting and forecasting, financial reporting and similar areas. The term OLAP
was created as a slight modification of the traditional database term OLTP
(Online Transaction Processing).
Databases configured for OLAP use a multidimensional data model, allowing for
complex analytical and ad-hoc queries with a rapid execution time. They borrow
aspects of navigational databases and hierarchical databases that are faster
than relational databases.

The output of an OLAP query is typically displayed in a matrix (or pivot) format.
The dimensions form the rows and columns of the matrix; the measures form
the values.

Concept
At the core of any OLAP system is the concept of an OLAP cube (also called a
multidimensional cube or a hypercube). It consists of numeric facts called
measures which are categorized by dimensions. The cube metadata is typically
created from a star schema or snowflake schema of tables in a relational
database. Measures are derived from the records in the fact table and
dimensions are derived from the tables. Each measure can be thought of as
having a set of labels, or meta-data associated with it. A dimension is what
describes these labels; it provides information about the measure. A simple
example would be a cube that contains a store's sales as a measure, and
Date/Time as a dimension. Each Sale has a Date/Time label that describes more
about that sale. Any number of dimensions can be added to the structure such
as Store, Cashier, or Customer by adding a column to the fact table. This allows
an analyst to view the measures along any combination of the dimensions.

For Example:
Sales Fact Table
+-----------------------+
| sale_amount | time_id |
+-----------------------+ Time Dimension
| 2008.08| 1234|---+ +----------------------------+
+-----------------------+ | | time_id | timestamp |
| +----------------------------+
+---->| 1234 | 20080902 12:35:43|
+----------------------------+

Multidimensional databases
Multidimensional structure is defined as “a variation of the relational model that
uses multidimensional structures to organize data and express the relationships
between data” (O'Brien & Marakas, 2009, pg 177). The structure is broken into
cubes and the cubes are able to store and access data within the confines of
each cube. “Each cell within a multidimensional structure contains aggregated
data related to elements along each of its dimensions” (pg. 178). Even when
data is manipulated it is still easy to access as well as be a compact type of
database. The data still remains interrelated. Multidimensional structure is
quite popular for analytical databases that use online analytical processing
(OLAP) applications (O’Brien & Marakas, 2009). Analytical databases use these
databases because of their ability to deliver answers quickly to complex
business queries. Data can be seen from different ways, which gives a broader
picture of a problem unlike other models (Williams, Garza, Tucker & Marcus,
1994).

Aggregations
It has been claimed that for complex queries OLAP cubes can produce an
answer in around 0.1% of the time for the same query on OLTP relational data.
The most important mechanism in OLAP which allows it to achieve such
performance is the use of aggregations. Aggregations are built from the fact
table by changing the granularity on specific dimensions and aggregating up
data along these dimensions. The number of possible aggregations is
determined by every possible combination of dimension granularities.
The combination of all possible aggregations and the base data contains the
answers to every query which can be answered from the data.
Because usually there are many aggregations that can be calculated, often only
a predetermined number are fully calculated; the remainder are solved on
demand. The problem of deciding which aggregations (views) to calculate is
known as the view selection problem. View selection can be constrained by the
total size of the selected set of aggregations, the time to update them from
changes in the base data, or both. The objective of view selection is typically to
minimize the average time to answer OLAP queries, although some studies also
minimize the update time. View selection is NP-Complete. Many approaches to
the problem have been explored, including greedy algorithms, randomized
search, genetic algorithms and A* search algorithm.
A very effective way to support aggregation and other common OLAP
operations is the use of bitmap indexes.

Types
OLAP systems have been traditionally categorized using the following
taxonomy.

Multidimensional
MOLAP is the 'classic' form of OLAP and is sometimes referred to as just OLAP.
MOLAP stores this data in optimized multi-dimensional array storage, rather
than in a relational database. Therefore it requires the pre-computation and
storage of information in the cube - the operation known as processing.
Relational
ROLAP works directly with relational databases. The base data and the
dimension tables are stored as relational tables and new tables are created to
hold the aggregated information. Depends on a specialized schema design.

Hybrid
There is no clear agreement across the industry as to what constitutes "Hybrid
OLAP", except that a database will divide data between relational and
specialized storage. For example, for some vendors, a HOLAP database will use
relational tables to hold the larger quantities of detailed data, and use
specialized storage for at least some aspects of the smaller quantities of more-
aggregate or less-detailed data.

Comparison
Each type has certain benefits, although there is disagreement about the
specifics of the benefits between providers.
• Some MOLAP implementations are prone to database explosion.
Database explosion is a phenomenon causing vast amounts of storage
space to be used by MOLAP databases when certain common conditions
are met: high number of dimensions, pre-calculated results and sparse
multidimensional data. The typical mitigation technique for database
explosion is not to materialize all the possible aggregation, but only the
optimal subset of aggregations based on the desired performance vs.
storage trade off.
• MOLAP generally delivers better performance due to specialized indexing
and storage optimizations. MOLAP also needs less storage space
compared to ROLAP because the specialized storage typically includes
compression techniques.
• ROLAP is generally more scalable. However, large volume pre-processing
is difficult to implement efficiently so it is frequently skipped. ROLAP
query performance can therefore suffer tremendously.
• Since ROLAP relies more on the database to perform calculations, it has
more limitations in the specialized functions it can use.
• HOLAP encompasses a range of solutions that attempt to mix the best of
ROLAP and MOLAP. It can generally pre-process quickly, scale well, and
offer good function support.

Other types
The following acronyms are also sometimes used, although they are not as
widespread as the ones above:
• WOLAP - Web-based OLAP
• DOLAP - Desktop OLAP
• RTOLAP - Real-Time OLAP

APIs and query languages


Unlike relational databases, which had SQL as the standard query language,
and wide-spread APIs such as ODBC, JDBC and OLEDB, there was no such
unification in the OLAP world for a long time. The first real standard API was
OLE DB for OLAP specification from Microsoft which appeared in 1997 and
introduced the MDX query language. Several OLAP vendors - both server and
client - adopted it. In 2001 Microsoft and Hyperion announced the XML for
Analysis specification, which was endorsed by most of the OLAP vendors. Since
this also used MDX as a query language, MDX became the de-facto standard.

Products
The first product that performed OLAP queries was Express, which was released
in 1970 (and acquired by Oracle in 1995 from Information Resources).However,
the term did not appear until 1993 when it was coined by Edgar F. Codd, who
has been described as "the father of the relational database". Codd's paper
resulted from a short consulting assignment which Codd undertook for former
Arbor Software (later Hyperion Solutions, and in 2007 acquired by Oracle), as a
sort of marketing coup. The company had released its own OLAP product,
Essbase, a year earlier. As a result Codd's "twelve laws of online analytical
processing" were explicit in their reference to Essbase. There was some
ensuing controversy and when Computerworld learned that Codd was paid by
Arbor, it retracted the article. OLAP market experienced strong growth in late
90s with dozens of commercial products going into market. In 1998, Microsoft
released its first OLAP Server - Microsoft Analysis Services, which drove wide
adoption of OLAP technology and moved it into mainstream.

Product comparison
Market structure

Below is a list of top OLAP vendors in 2006, with figures in millions of United
States Dollars.

Global
Vendor
Revenue
Microsoft Corporation 1,806
Hyperion Solutions
1,077
Corporation
Cognos 735
Business Objects 416
MicroStrategy 416
SAP AG 330
Cartesis SA 210
Applix 205
Infor 199
Oracle Corporation 159
Others 152
Total 5,700

Microsoft was the only vendor that continuously exceeded the industrial
average growth during 2000-2006. Since the above data was collected,
Hyperion has been acquired by Oracle, Cartesis by Business Objects, Business
Objects by SAP, Applix by Cognos, and Cognos by IBM.

Ques 3. Discuss the following Data Preprocessing techniques:


• Data Cleaning
• Data Integration and Transformation

Data cleansing or data scrubbing is the act of detecting and correcting (or
removing) corrupt or inaccurate records from a record set, table, or database.
Used mainly in databases, the term refers to identifying incomplete, incorrect,
inaccurate, irrelevant etc. parts of the data and then replacing, modifying or
deleting this dirty data.
After cleansing, a data set will be consistent with other similar data sets in the
system. The inconsistencies detected or removed may have been originally
caused by different data dictionary definitions of similar entities in different
stores, may have been caused by user entry errors, or may have been
corrupted in transmission or storage.
Data cleansing differs from data validation in that validation almost invariably
means data is rejected from the system at entry and is performed at entry
time, rather than on batches of data.
The actual process of data cleansing may involve removing typographical errors
or validating and correcting values against a known list of entities. The
validation may be strict (such as rejecting any address that does not have a
valid postal code) or fuzzy (such as correcting records that partially match
existing, known records).

Motivation
Administratively, incorrect or inconsistent data can lead to false conclusions
and misdirected investments on both public and private scales. For instance,
the government may want to analyze population census figures to decide which
regions require further spending and investment on infrastructure and services.
In this case, it will be important to have access to reliable data to avoid
erroneous fiscal decisions.
In the business world, incorrect data can be costly. Many companies use
customer information databases that record data like contact information,
addresses, and preferences. If for instance the addresses are inconsistent, the
company will suffer the cost of resending mail or even losing customers.

Data quality
High quality data needs to pass a set of quality criteria. Those include:
• Accuracy: An aggregated value over the criteria of integrity, consistency
and density
• Integrity: An aggregated value over the criteria of completeness and
validity
• Completeness: Achieved by correcting data containing anomalies
• Validity: Approximated by the amount of data satisfying integrity
constraints
• Consistency: Concerns contradictions and syntactical anomalies
• Uniformity: Directly related to irregularities
• Density: The quotient of missing values in the data and the number of
total values ought to be known
• Uniqueness: Related to the number of duplicates in the data

The process of data cleansing


• Data Auditing: The data is audited with the use of statistical methods to
detect anomalies and contradictions. This eventually gives an indication
of the characteristics of the anomalies and their locations.
• Workflow specification: The detection and removal of anomalies is
performed by a sequence of operations on the data known as the
workflow. It is specified after the process of auditing the data and is
crucial in achieving the end product of high quality data. In order to
achieve a proper workflow, the causes of the anomalies and errors in the
data have to be closely considered. If for instance we find that an
anomaly is a result of typing errors in data input stages, the layout of the
keyboard can help in manifesting possible solutions.
• Workflow execution: In this stage, the workflow is executed after its
specification is complete and its correctness is verified. The
implementation of the workflow should be efficient even on large sets of
data which inevitably poses a trade-off because the execution of a data
cleansing operation can be computationally expensive.
• Post-Processing and Controlling: After executing the cleansing
workflow, the results are inspected to verify correctness. Data that could
not be corrected during execution of the workflow are manually
corrected if possible. The result is a new cycle in the data cleansing
process where the data is audited again to allow the specification of an
additional workflow to further cleanse the data by automatic processing.

Popular methods used


• Parsing: Parsing in data cleansing is performed for the detection of
syntax errors. A parser decides whether a string of data is acceptable
within the allowed data specification. This is similar to the way a parser
works with grammars and languages.
• Data Transformation: Data Transformation allows the mapping of the
data from their given format into the format expected by the appropriate
application. This includes value conversions or translation functions as
well as normalizing numeric values to conform to minimum and
maximum values.
• Duplicate Elimination: Duplicate detection requires an algorithm for
determining whether data contains duplicate representations of the
same entity. Usually, data is sorted by a key that would bring duplicate
entries closer together for faster identification.
• Statistical Methods: By analyzing the data using the values of mean,
standard deviation, range, or clustering algorithms, it is possible for an
expert to find values that are unexpected and thus erroneous. Although
the correction of such data is difficult since the true value is not known, it
can be resolved by setting the values to an average or other statistical
value. Statistical methods can also be used to handle missing values
which can be replaced by one or more plausible values that are usually
obtained by extensive data augmentation algorithms.

Existing tools
Before computer automation data about individuals or organizations were
maintened and secured as paper records, dispersed in separate business or
organizational units. Information systems concentrate data in computer files
that can potentially be accessed by large numbers of people and by groups
outside of organization.

Criticism of Existing Tools and Processes


The value and current approaches to Data Cleansing have come under criticism
due to some parties claiming large costs and low return on investment from
major data cleansing initiatives.

Challenges and problems


• Error Correction and loss of information: The most challenging
problem within data cleansing remains the correction of values to
remove duplicates and invalid entries. In many cases, the available
information on such anomalies is limited and insufficient to determine
the necessary transformations or corrections leaving the deletion of such
entries as the only plausible solution. The deletion of data though, leads
to loss of information which can be particularly costly if there is a large
amount of deleted data.
• Maintenance of Cleansed Data: Data cleansing is an expensive and
time consuming process. So after having performed data cleansing and
achieving a data collection free of errors, one would want to avoid the re-
cleansing of data in its entirety after some values in data collection
change. The process should only be repeated on values that have
changed which means that a cleansing lineage would need to be kept
which would require efficient data collection and management
techniques.
• Data Cleansing in Virtually Integrated Environments: In virtually
integrated Sources like IBM’s DiscoveryLink, the cleansing of data has to
be performed every time the data is accessed which considerably
decreases the response time and efficiency.
• Data Cleansing Framework: In many cases it will not be possible to
derive a complete data cleansing graph to guide the process in advance.
This makes data cleansing an iterative process involving significant
exploration and interaction which may require a framework in the form of
a collection of methods for error detection and elimination in addition to
data auditing. This can be integrated with other data processing stages
like integration and maintenance.

Data integration

Data integration involves combining data residing in different sources and


providing users with a unified view of these data.[1] This process becomes
significant in a variety of situations both commercial (when two similar
companies need to merge their databases) and scientific (combining research
results from different bioinformatics repositories, for example). Data integration
appears with increasing frequency as the volume and the need to share
existing data explodes. It has become the focus of extensive theoretical work,
and numerous open problems remain unsolved. In management circles, people
frequently refer to data integration as "Enterprise Information Integration" (EII).

Example
Consider a web application where a user can query a variety of information
about cities (such as crime statistics, weather, hotels, demographics, etc).
Traditionally, the information must exist in a single database with a single
schema. But any single enterprise would find information of this breadth
somewhat difficult and expensive to collect. Even if the resources exist to
gather the data, it would likely duplicate data in existing crime databases,
weather websites, and census data.
A data-integration solution may address this problem by considering these
external resources as materialized views over a virtual mediated schema,
resulting in "virtual data integration". This means application-developers
construct a vitual schema — the mediated schema — to best model the kinds of
answers their users want. Next, they design "wrappers" or adapters for each
data source, such as the crime database and weather website. These adapters
simply transform the local query results (those returned by the respective
websites or databases) into an easily processed form for the data integration
solution (see figure 2). When an application-user queries the mediated schema,
the data-integration solution transforms this query into appropriate queries
over the respective data sources. Finally, the virtual database combines the
results of these queries into the answer to the user's query.
This solution offers the convenience of adding new sources by simply
constructing an adapter for them. It contrasts with ETL systems or with a single
database solution, which require manual integration of the entire new dataset
into the system.

Theory of data integration


The theory of data integration forms a subset of database theory and
formalizes the underlying concepts of the problem in first-order logic. Applying
the theories gives indications as to the feasibility and difficulty of data
integration. While its definitions may appear abstract, they have sufficient
generality to accommodate all manner of integration systems.

Definitions
Data integration systems are formally defined as a triple where G is
the global (or mediated) schema, S is the heterogeneous set of source
schemas, and M is the mapping that maps queries between the source and the
global schemas. Both G and S are expressed in languages over alphabets
composed of symbols for each of their respective relations. The mapping M
consists of assertions between queries over G and queries over S. When users
pose queries over the data integration system, they pose queries over G and
the mapping then asserts connections between the elements in the global
schema and the source schemas.
A database over a schema is defined as a set of sets, one for each relation (in a
relational database). The database corresponding to the source schema S
would comprise the set of sets of tuples for each of the heterogeneous data
sources and is called the source database. Note that this single source
database may actually represent a collection of disconnected databases. The
database corresponding to the virtual mediated schema G is called the global
database. The global database must satisfy the mapping M with respect to the
source database. The legality of this mapping depends on the nature of the
correspondence between G and S. Two popular ways to model this
correspondence exist: Global as View or GAV and Local as View or LAV.

Figure 3: Illustration of tuple space of the GAV and LAV mappings. In GAV, the
system is constrained to the set of tuples mapped by the mediators while the
set of tuples expressible over the sources may be much larger and richer. In
LAV, the system is constrained to the set of tuples in the sources while the set
of tuples expressible over the global schema can be much larger. Therefore LAV
systems must often deal with incomplete answers.
GAV systems model the global database as a set of views over S. In this case M
associates to each element of G a query over S. Query processing becomes a
straightforward operation due to the well-defined associations between G and
S. The burden of complexity falls on implementing mediator code instructing
the data integration system exactly how to retrieve elements from the source
databases. If any new sources join the system, considerable effort may be
necessary to update the mediator, thus the GAV approach appears preferable
when the sources seem unlikely to change.
In a GAV approach to the example data integration system above, the system
designer would first develop mediators for each of the city information sources
and then design the global schema around these mediators. For example,
consider if one of the sources served a weather website. The designer would
likely then add a corresponding element for weather to the global schema.
Then the bulk of effort concentrates on writing the proper mediator code that
will transform predicates on weather into a query over the weather website.
This effort can become complex if some other source also relates to weather,
because the designer may need to write code to properly combine the results
from the two sources.
On the other hand, in LAV, the source database is modeled as a set of views
over G. In this case M associates to each element of S a query over G. Here the
exact associations between G and S are no longer well-defined. As is illustrated
in the next section, the burden of determining how to retrieve elements from
the sources is placed on the query processor. The benefit of an LAV modeling is
that new sources can be added with far less work than in a GAV system, thus
the LAV approach should be favored in cases where the mediated schema is
more likely to change.
In an LAV approach to the example data integration system above, the system
designer designs the global schema first and then simply inputs the schemas of
the respective city information sources. Consider again if one of the sources
serves a weather website. The designer would add corresponding elements for
weather to the global schema only if none existed already. Then programmers
write an adapter or wrapper for the website and add a schema description of
the website's results to the source schemas. The complexity of adding the new
source moves from the designer to the query processor.

Query processing
The theory of query processing in data integration systems is commonly
expressed using conjunctive queries [5]. One can loosely think of a conjunctive
query as a logical function applied to the relations of a database such as "f(A,B)
where A < B". If a tuple or set of tuples is substituted into the rule and satisfies
it (makes it true), then we consider that tuple as part of the set of answers in
the query. While formal languages like Datalog express these queries concisely
and without ambiguity, common SQL queries count as conjunctive queries as
well.
In terms of data integration, "query containment" represents an important
property of conjunctive queries. A query A contains another query B (denoted
) if the results of applying B are a subset of the results of applying A for
any database. The two queries are said to be equivalent if the resulting sets are
equal for any database. This is important because in both GAV and LAV
systems, a user poses conjunctive queries over a virtual schema represented
by a set of views, or "materialized" conjunctive queries. Integration seeks to
rewrite the queries represented by the views to make their results equivalent or
maximally contained by our user's query. This corresponds to the problem of
answering queries using views (AQUV).
In GAV systems, a system designer writes mediator code to define the query-
rewriting. Each element in the user's query corresponds to a substitution rule
just as each element in the global schema corresponds to a query over the
source. Query processing simply expands the subgoals of the user's query
according to the rule specified in the mediator and thus the resulting query is
likely to be equivalent. While the designer does the majority of the work
beforehand, some GAV systems such as Tsimmis involve simplifying the
mediator description process.
In LAV systems, queries undergo a more radical process of rewriting because no
mediator exists to align the user's query with a simple expansion strategy. The
integration system must execute a search over the space of possible queries in
order to find the best rewrite. The resulting rewrite may not be an equivalent
query but maximally contained, and the resulting tuples may be incomplete. As
of 2009 the MiniCon algorithm[6] is the leading query rewriting algorithm for LAV
data integration systems.
In general, the complexity of query rewriting is NP-complete. If the space of
rewrites is relatively small this does not pose a problem — even for integration
systems with hundreds of sources.

Enterprise information integration


Enterprise information integration (EII) applies data integration commercially.
Despite the theoretical problems described above, the private sector shows
more concern with the problems of data integration as a viable product. [7] EII
emphasizes neither on correctness nor tractability, but speed and simplicity. An
EII industry has emerged, but many professionals believe it does not perform to
its full potential. Practitioners cite the following major issues which EII must
address for the industry to become mature:

Simplicity of understanding
Answering queries with views arouses interest from a theoretical
standpoint, but difficulties in understanding how to incorporate it as an
"enterprise solution". Some developers believe it should be merged with
EAI. Others believe it should be incorporated with ETL systems, citing
customers' confusion over the differences between the two services.
Simplicity of deployment
Even if recognized as a solution to a problem, EII as of 2009currently
takes time to apply and offers complexities in deployment. People have
proposed a variety of schema-less solutions such as "Lean Middleware",
but ease-of-use and speed of employment appear inversely proportional
to the generality of such systems. Others cite the need for standard data
interfaces to speed and simplify the integration process in practice.
Handling higher-order information
Analysts experience difficulty — even with a functioning information
integration system — in determining whether the sources in the
database will satisfy a given application. Answering these kinds of
questions about a set of repositories requires semantic information like
metadata and/or ontologies. The few commercial tools that leverage this
information remain in their infancy.

Data transformation
In metadata, a data transformation converts data from a source data format
into destination data.
Data transformation can be divided into two steps:
1. data mapping maps data elements from the source to the destination
and captures any transformation that must occur
2. code generation that creates the actual transformation program
Data element to data element mapping is frequently complicated by complex
transformations that require one-to-many and many-to-one transformation
rules.
The code generation step takes the data element mapping specification and
creates an executable program that can be run on a computer system. Code
generation can also create transformation in easy-to-maintain computer
languages such as Java or XSLT.
When the mapping is indirect via a mediating data model, the process is also
called data mediation.

Transformational languages
There are numerous languages available for performing data transformation.
Many transformational languages require a grammar to be provided. In many
cases the grammar is structured using something closely resembling Backus–
Naur Form (BNF). There are numerous languages available for such purposes
varying in their accessibility (cost) and general usefulness. Examples of such
languages include:
• XSLT - the XML transformation language
• TXL - prototyping language-based descriptions using source
transformation
It should be noted that though transformational languages are typically best
suited for transformation, something as simple as regular expressions can be
used to achieve useful transformation. Textpad supports the use of regular
expressions with arguments. This would allow all instances of a particular
pattern to be replaced with another pattern using parts of the original pattern.
For example:
foo ("some string", 42, gCommon);
bar (someObj, anotherObj);

foo ("another string", 24, gCommon);


bar (myObj, myOtherObj);
could both be transformed into a more compact form like:
foobar("some string", 42, someObj, anotherObj);
foobar("another string", 24, myObj, myOtherObj);
In other words, all instances of a function invocation of foo with three
arguments, followed by a function invocation with two invocations would be
replaced with a single function invocation using some or all of the original set of
arguments.
Another advantage to using regular expressions is that they will not fail the null
transform test. That is, using your transformational language of choice, run a
sample program through a transformation that doesn't perform any
transformations. Many transformational languages will fail this test.

Difficult problems
There are many challenges in data transformation. Probably the most difficult
problem to address in C++ is "unstructured preprocessor directives". These are
preprocessor directives which do not contain blocks of code with simple
grammatical descriptions - example:
void MyFunc ()
{
if (x>17)
{ printf("test");
#ifdef FOO
} else {
#endif
if (gWatch)
mTest = 42;
}
}
A really general solution to handling this is very hard because such
preprocessor directives can essentially edit the underlying language in arbitrary
ways. However, because such directives are not, in practice, used in completely
arbitrary ways, one can build practical tools for handling preprocessed
languages. The DMS Software Reengineering Toolkit is capable of handling
structured macros and preprocessor conditionals.

Ques 4. Define a frequent set. Define an association rule


In data mining, association rule learning is a popular and well researched
method for discovering interesting relations between variables in large
databases. Piatetsky-Shapiro describes analyzing and presenting strong rules
discovered in databases using different measures of interestingness. Based on
the concept of strong rules, Agrawal et al. [2] introduced association rules for
discovering regularities between products in large scale transaction data
recorded by point-of-sale (POS) systems in supermarkets. For example, the rule
found in the sales data of a supermarket
would indicate that if a customer buys onions and potatoes together, he or she
is likely to also buy beef. Such information can be used as the basis for
decisions about marketing activities such as, e.g., promotional pricing or
product placements. In addition to the above example from market basket
analysis association rules are employed today in many application areas
including Web usage mining, intrusion detection and bioinformatics.

Definition
Following the original definition by Agrawal et al. the problem of association
rule mining is defined as: Let be a set of n binary attributes
called items. Let be a set of transactions called the
database. Each transaction in D has a unique transaction ID and contains a
subset of the items in I. A rule is defined as an implication of the form
where and . The sets of items (for short itemsets) X and
Y are called antecedent (left-hand-side or LHS) and consequent (right-hand-side
or RHS) of the rule.
Example data base with 4 items and 5
transactions
transaction
milk bread butter beer
ID
1 1 1 0 0
2 0 1 1 0
3 0 0 0 1
4 1 1 1 0
5 0 1 0 0
To illustrate the concepts, we use a small example from the supermarket
domain. The set of items is I = {milk,bread,butter,beer} and a small database
containing the items (1 codes presence and 0 absence of an item in a
transaction) is shown in the table to the right. An example rule for the
supermarket could be meaning that if milk and
bread is bought, customers also buy butter.
Note: this example is extremely small. In practical applications, a rule needs a
support of several hundred transactions before it can be considered statistically
significant, and datasets often contain thousands or millions of transactions.

To select interesting rules from the set of all possible rules, constraints on
various measures of significance and interest can be used. The best-known
constraints are minimum thresholds on support and confidence. The support
supp(X) of an item set X is defined as the proportion of transactions in the data
set which contain the item set. In the example database, the item set
{milk,bread} has a support of 2 / 5 = 0.4 since it occurs in 40% of all
transactions (2 out of 5 transactions).
The confidence of a rule is defined
. For example, the rule has a confidence of 0.2 /
0.4 = 0.5 in the database, which means that for 50% of the transactions
containing milk and bread the rule is correct. Confidence can be interpreted as
an estimate of the probability P(Y | X), the probability of finding the RHS of the
rule in transactions under the condition that these transactions also contain the
LHS.

The lift of a rule is defined as or the


ratio of the observed confidence to that expected by chance. The rule

has a lift of .

The conviction of a rule is defined as .

The rule has a conviction of , and


be interpreted as the ratio of the expected frequency that X occurs without Y
(that is to say, the frequency that the rule makes an incorrect prediction) if X
and Y were independent divided by the observed frequency of incorrect
predictions. In this example, the conviction value of 1.2 shows that the rule
would be incorrect 20% more often (1.2 times as
often) if the association between X and Y was purely random chance.
Association rules are required to satisfy a user-specified minimum support and
a user-specified minimum confidence at the same time. To achieve this,
association rule generation is a two-step process. First, minimum support is
applied to find all frequent itemsets in a database. In a second step, these
frequent itemsets and the minimum confidence constraint are used to form
rules. While the second step is straight forward, the first step needs more
attention.
Finding all frequent itemsets in a database is difficult since it involves searching
all possible itemsets (item combinations). The set of possible itemsets is the
power set over I and has size 2n − 1 (excluding the empty set which is not a
valid itemset). Although the size of the powerset grows exponentially in the
number of items n in I, efficient search is possible using the downward-closure
property of support [2](also called anti-monotonicity) which guarantees that for a
frequent itemset also all its subsets are frequent and thus for an infrequent
itemset, all its supersets must be infrequent. Exploiting this property, efficient
algorithms (e.g., Apriori and Eclat) can find all frequent itemsets.

Statistically sound associations


One limitation of the standard approach to discovering associations is that by
searching massive numbers of possible associations to look for collections of
items that appear to be associated, there is a large risk of finding many
spurious associations. These are collections of items that co-occur with
unexpected frequency in the data, but only do so by chance. For example,
suppose we are considering a collection of 10,000 items and looking for rules
containing two items in the left-hand-side and 1 item in the right-hand-side.
There are approximately 1,000,000,000,000 such rules. If we apply a statistical
test for independence with a significance level of 0.05 it means there is only a
5% chance of accepting a rule if there is no association. If we assume there are
no associations, we should nonetheless expect to find 50,000,000,000 rules.
Statistically sound association discovery controls this risk, in most cases
reducing the risk of finding any spurious associations to a user-specified
significance level.
Algorithms
Many algorithms for generating association rules were presented over time.
Some well known algorithms are Apriori, Eclat and FP-Growth, but they only do
half the job, since they are algorithms for mining frequent itemsets. Another
step needs to be done after to generate rules from frequent itemsets found in a
database.

Apriori algorithm
Apriori is the best-known algorithm to mine association rules. It uses a breadth-
first search strategy to counting the support of itemsets and uses a candidate
generation function which exploits the downward
closure property of support.
Eclat algorithm
Eclat is a depth-first search algorithm using set intersection.

FP-growth algorithm
FP-growth (frequent pattern growth) uses an extended prefix-tree (FP-tree)
structure to store the database in a compressed form. FP-growth adopts a
divide-and-conquer approach to decompose both the mining tasks and the
databases. It uses a pattern fragment growth method to avoid the costly
process of candidate generation and testing used by Apriori.

One-attribute-rule
The one-attribute-rule, or OneR, is an algorithm for finding association rules.
According to Ross, very simple association rules, involving just one attribute in
the condition part, often work well in practice with real-world data.. The idea of
the OneR (one-attribute-rule) algorithm is to find the one attribute to use to
classify a novel datapoint that makes fewest prediction errors.
For example, to classify a car you haven't seen before, you might apply the
following rule: If Fast Then Sportscar, as opposed to a rule with multiple
attributes in the condition: If Fast And Softtop And Red Then Sportscar.
The algorithm is as follows:
For each attribute A:
For each value V of that attribute, create a rule:
• count how often each class appears
• find the most frequent class, c
• make a rule "if A=V then C=c"
Calculate the error rate of this rule
Pick the attribute whose rules produce the lowest error rate

Zero-attribute-rule
The zero-attribute-rule, or ZeroR, does not involved any attribute in the
condition part, and always returns the most frequent class in the training set.
This algorithm is frequently used to measure the classification success of other
algorithms.

Lore
A famous story about association rule mining is the "beer and diaper" story. A
purported survey of behavior of supermarket shoppers discovered that
customers (presumably young men) who buy diapers tend also to buy beer.
This anecdote became popular as an example of how unexpected association
rules might be found from everyday data. [See
http://www.dssresources.com/newsletters/66.php]
GUHA procedure ASSOC
GUHA is a general method for exploratory data analysis that has theoretical
foundations in observational calculi. The ASSOC procedure [19] is a GUHA
method which mines for generalized association rules using fast bitstrings
operations. The association rules mined by this method are more general than
those output by apriori, for example "items" can be connected both with
conjunction and disjunctions and the relation between antecedent and
consequent of the rule is not restricted to setting minimum support and
confidence as in apriori: an arbitrary combination of supported interest
measures can be used.

Other types of association mining


Contrast set learning is a form of associative learning. Contrast set
learners use rules that differ meaningfully in their distribution across subsets.
Weighted class learning is another form of associative learning in which
weight may be assigned to classes to give focus to a particular issue of concern
for the consumer of the data mining results.
K-optimal pattern discovery provides an alternative to the standard
approach to association rule learning that requires that each pattern appear
frequently in the data.
Mining frequent sequences uses support to find sequences in temporal data.

Ques 5 Define the terms Classification and Prediction. Discuss the


issues pertaining to Classification and Prediction

Gregory Piatetsky-Shapiro answers:

The decision tree is a classification model, applied to existing data. If you apply
it to new data, for which the class is unknown, you also get a prediction of the
class. The assumption is that the new data comes from the similar distribution
as the data you used to build your decision tree. In many cases this is a correct
assumption and that is why you can use the decision tree for building a
predictive model.

Gregory Piatetsky-Shapiro answers:

It is a matter of definition. If you are trying to classify existing data, e.g. group
patients based on their known medical data and treatment outcome, I would
call it a classification. If you use a classification model to predict the treatment
outcome for a new patient, it would be a prediction.
gabrielac adds In the book "Data Mining Concepts and Techniques", Han and
Kamber's view is that predicting class labels is classification, and predicting
values (e.g. using regression techniques) is prediction. Other people prefer to
use "estimation" for predicting continuous values.

Ques 6. Explain major clustering methods with examples.

Clustering methods
The goal of clustering is to reduce the amount of data by categorizing or
grouping similar data items together. Such grouping is pervasive in the way
humans process information, and one of the motivations for using clustering
algorithms is to provide automated tools to help in constructing categories or
taxonomies [Jardine and Sibson, 1971, Sneath and Sokal, 1973]. The methods
may also be used to minimize the effects of human factors in the process.
Clustering methods [Anderberg, 1973, Hartigan, 1975, Jain and Dubes, 1988,
Jardine and Sibson, 1971, Sneath and Sokal, 1973, Tryon and Bailey, 1973] can
be divided into two basic types: hierarchical and partitional clustering. Within
each of the types there exists a wealth of subtypes and different algorithms for
finding the clusters.
Hierarchical clustering proceeds successively by either merging smaller clusters
into larger ones, or by splitting larger clusters. The clustering methods differ in
the rule by which it is decided which two small clusters are merged or which
large cluster is split. The end result of the algorithm is a tree of clusters called a
dendrogram, which shows how the clusters are related. By cutting the
dendrogram at a desired level a clustering of the data items into disjoint groups
is obtained.
Partitional clustering, on the other hand, attempts to directly decompose the
data set into a set of disjoint clusters. The criterion function that the clustering
algorithm tries to minimize may emphasize the local structure of the data, as
by assigning clusters to peaks in the probability density function, or the global
structure. Typically the global criteria involve minimizing some measure of
dissimilarity in the samples within each cluster, while maximizing the
dissimilarity of different clusters.

A commonly used partitional clustering method, K-means clustering


[MacQueen, 1967], will be discussed in some detail since it is closely related to
the SOM algorithm. In K-means clustering the criterion function is the average
squared distance of the data items

from their nearest cluster centroids,


where is the index of the centroid that is closest to . One possible
algorithm for minimizing the cost function begins by initializing a set of K

cluster centroids denoted by , . The positions of the are


then adjusted iteratively by first assigning the data samples to the nearest
clusters and then recomputing the centroids. The iteration is stopped when E
does not change markedly any more. In an alternative algorithm each randomly
chosen sample is considered in succession, and the nearest centroid is
updated.
Equation 1 is also used to describe the objective of a related method, vector
quantization [Gersho, 1979, Gray, 1984, Makhoul et al., 1985]. In vector
quantization the goal is to minimize the average (squared) quantization error,
the distance between a sample and its representation . The algorithm
for minimizing Equation 1 that was described above is actually a
straightforward generalization of the algorithm proposed by Lloyd (1957) for
minimizing the average quantization error in a one-dimensional setting.

A problem with the clustering methods is that the interpretation of the clusters
may be difficult. Most clustering algorithms prefer certain cluster shapes, and
the algorithms will always assign the data to clusters of such shapes even if
there were no clusters in the data. Therefore, if the goal is not just to compress
the data set but also to make inferences about its cluster structure, it is
essential to analyze whether the data set exhibits a clustering tendency. The
results of the cluster analysis need to be validated, as well. Jain and Dubes
(1988) present methods for both purposes.
Another potential problem is that the choice of the number of clusters may be
critical: quite different kinds of clusters may emerge when K is changed. Good
initialization of the cluster centroids may also be crucial; some clusters may
even be left empty if their centroids lie initially far from the distribution of data.
Clustering can be used to reduce the amount of data and to induce a
categorization. In exploratory data analysis, however, the categories have only
limited value as such. The clusters should be illustrated somehow to aid in
understanding what they are like. For example in the case of the K-means
algorithm the centroids that represent the clusters are still high-dimensional,
and some additional illustration methods are needed for visualizing them.
August 2009
Bachelor of Science in Information Technology (BScIT) – Semester 4
BT0050 – Data Warehousing & Mining – 4 Credits
(Book ID: B0038)
Assignment Set – 2 (60 Marks)

Answer all questions 6 x 10 = 60


Ques 1 Discuss the following Data Mining Functionalities:

Association Analysis
In data mining, association rule learning is a popular and well researched
method for discovering interesting relations between variables in large
databases. Piatetsky-Shapiro describes analyzing and presenting strong rules
discovered in databases using different measures of interestingness. Based on
the concept of strong rules, Agrawal etal. introduced association rules for
discovering regularities between products in large scale transaction data
recorded by point-of-sale (POS) systems in supermarkets. For example, the rule
found in the sales data of a supermarket would indicate that if a customer buys
onions and potatoes together, he or she is likely to also buy beef. Such
information can be used as the basis for decisions about marketing activities
such as, e.g., promotional pricing or product placements. In addition to the
above example from market basket analysis association rules are employed
today in many application areas including Web usage mining, intrusion
detection and bioinformatics.

Following the original definition by Agrawal et al. the problem of association


rule mining is defined as: Let be a set of n binary attributes called items. Let be
a set of transactions called the database. Each transaction in D has a unique
transaction ID and contains a subset of the items in I. A rule is defined as an
implication of the form where . The sets of items (for short itemsets) X and Y
are called antecedent (left-hand-side or LHS) and consequent (right-hand-side
or RHS) of the rule.

Example data base with 4 items and 5


transactions
transaction
milk bread butter beer
ID
1 1 1 0 0
2 0 1 1 0
3 0 0 0 1
4 1 1 1 0
5 0 1 0 0

To illustrate the concepts, we use a small example from the supermarket


domain. The set of items is I = {milk,bread,butter,beer} and a small database
containing the items (1 codes presence and 0 absence of an item in a
transaction) is shown in the table to the right. An example rule for the
supermarket could be meaning that if milk and bread is bought, customers also
buy butter.
Note: this example is extremely small. In practical applications, a rule needs a
support of several hundred transactions before it can be considered statistically
significant, and datasets often contain thousands or millions of transactions.
To select interesting rules from the set of all possible rules, constraints on
various measures of significance and interest can be used. The best-known
constraints are minimum thresholds on support and confidence. The support
supp(X) of an itemset X is defined as the proportion of transactions in the data
set which contain the itemset. In the example database, the itemset
{milk,bread} has a support of 2 / 5 = 0.4 since it occurs in 40% of all
transactions (2 out of 5 transactions).The confidence of a rule is defined . For
example, the rule has a confidence of 0.2 / 0.4 = 0.5 in the database, which
means that for 50% of the transactions containing milk and bread the rule is
correct. Confidence can be interpreted as an estimate of the probability P(Y |
X), the probability of finding the RHS of the rule in transactions under the
condition that these transactions also contain the LHS.

The lift of a rule is defined as or the ratio of the observed confidence to that
expected by chance. The rule has a lift of .The conviction of a rule is defined as
. The rule has a conviction of , and be interpreted as the ratio of the expected
frequency that X occurs without Y (that is to say, the frequency that the rule
makes an incorrect prediction) if X and Y were independent divided by the
observed frequency of incorrect predictions. In this example, the conviction
value of 1.2 shows that the rule would be incorrect 20% more often (1.2 times
as often) if the association between X and Y was purely random chance.

Association rules are required to satisfy a user-specified minimum support and


a user-specified minimum confidence at the same time. To achieve this,
association rule generation is a two-step process. First, minimum support is
applied to find all frequent itemsets in a database. In a second step, these
frequent itemsets and the minimum confidence constraint are used to form
rules. While the second step is straight forward, the first step needs more
attention.
Finding all frequent itemsets in a database is difficult since it involves searching
all possible itemsets (item combinations). The set of possible itemsets is the
power set over I and has size 2n − 1 (excluding the empty set which is not a
valid itemset). Although the size of the powerset grows exponentially in the
number of items n in I, efficient search is possible using the downward-closure
property of support (also called anti-monotonicity) which guarantees that for a
frequent itemset also all its subsets are frequent and thus for an infrequent
itemset, all its supersets must be infrequent. Exploiting this property, efficient
algorithms (e.g., Apriori and Eclat) can find all frequent itemsets.

Classification and Prediction


This chapter focuses on the common data mining task of classification and
prediction of binary (or two class) and mulit-class targets. The model builders
supported by Rattle are introduced. We begin with a review of risk charts as a
mechanism for evaluating two class models.

Two class classification is the task of distinguishing between two classes of


entities - whether they be high risk and low risk insurance clients, productive
and unproductive audits, responsive and non-responsive customers, successful
and unsuccessful security breaches, and many other similar examples.

Rattle provides a straight-forward interface to the collection of model builders


commonly used in data mining. For each, a basic collection of the commonly
used tuning parameters is exposed through the interface for fine tuning the
model performance. Where possible, Rattle attempts to present good default
values to allow the user to simply build a model with no or little tuning. This
may not always be the right approach, but is certainly a good place to start.

The two class model builders provided by Rattle are: Decision Trees, Boosted
Decision Trees, Random Forests, Support Vector Machines, and Logistic
Regression. Whilst a model is being built you will see the cursor image change
to indicate the system is busy, and the status bar will report that a model is
being built.

We will consider each of the model builders deployed in Rattle and characterise
them through the sentences they generate and how they search for the best
sentences that capture or summarises what the data is indicating.

Cluster Analysis
Cluster analysis or clustering is the assignment of a set of observations into
subsets (called clusters) so that observations in the same cluster are similar in
some sense. Clustering is a method of unsupervised learning, and a common
technique for statistical data analysis used in many fields, including machine
learning, data mining, pattern recognition, image analysis and bioinformatics.

Types of clustering
Data clustering algorithms can be hierarchical. Hierarchical algorithms find
successive clusters using previously established clusters. These algorithms can
be either agglomerative ("bottom-up") or divisive ("top-down"). Agglomerative
algorithms begin with each element as a separate cluster and merge them into
successively larger clusters. Divisive algorithms begin with the whole set and
proceed to divide it into successively smaller clusters. Partitional algorithms
typically determine all clusters at once, but can also be used as divisive
algorithms in the hierarchical clustering. Density-based clustering algorithms
are devised to discover arbitrary-shaped clusters. In this approach, a cluster is
regarded as a region in which the density of data objects exceeds a threshold.
DBSCAN and OPTICS are two typical algorithms of this kind.

Two-way clustering, co-clustering or biclustering are clustering methods where


not only the objects are clustered but also the features of the objects, i.e., if the
data is represented in a data matrix, the rows and columns are clustered
simultaneously. Many clustering algorithms require specification of the number
of clusters to produce in the input data set, prior to execution of the algorithm.
Barring knowledge of the proper value beforehand, the appropriate value must
be determined, a problem for which a number of techniques have been
developed.

Outlier Analysis
Rare, unusual, or just plain infrequent events are of interest in data mining in
many contexts including fraud in income tax, insurance, and online banking, as
well as for marketing. We classify analyses that focus on the discovery of such
data items as outlier analysis. () captures the concept of an outlier as:
An observation that deviates so much from other observations as to arouse
suspicion that it was generated by a different mechanism. Outlier detection
algorithms often fall into one of the categories of distance-based methods,
density-based methods, projection-based methods, and distribution-based
methods. A general approach to identifying outliers is to assume a known
distribution for the data and to examine the deviation of individuals from the
distribution. Such approaches are common in statistics (, ) but such approaches
do not scale well. Distance based methods are common in data mining where
the measure of an entities outliedness is based on its distance to nearby
entities. The number of nearby entities and the minimum distance are two
parameters.Density based approaches from breuning kriegel ng and sander
2000 sigmod LOF: local outlier factor.

Quese 2: Discuss various types of Data Warehouse Architectures with


suitable examples
Data warehouse is a repository of an organization's electronically stored data.
Data warehouses are designed to facilitate reporting and analysis.A data
warehouse houses a standardized, consistent, clean and integrated form of
data sourced from various operational systems in use in the organization,
structured in a way to specifically address the reporting and analytic
requirements.

This definition of the data warehouse focuses on data storage. However, the
means to retrieve and analyze data, to extract, transform and load data, and to
manage the data dictionary are also considered essential components of a data
warehousing system. Many references to data warehousing use this broader
context. Thus, an expanded definition for data warehousing includes business
intelligence tools, tools to extract, transform, and load data into the repository,
and tools to manage and retrieve metadata.

Types of Data Warehouse


There are mainly three type of Data Warehouse

1). Enterprise Data Warehouse.


2). Operational data store.
3). Data Mart.

Enterprise Data Warehouse provide a control Data Base for decision support
through out the enterprise.Operational data store has a broad enterprise under
scope but unlike a real enterprise DW. Data is refreshed in rare real time and
used for routine business activity.

Data Mart is a sub part of Data Warehouse. It support a particular reason or it is


design for particular lines of business such as sells, marketing or finance, or in
any organization documents of a particular department will be data mart
Ques 3 Discuss the issues to be considered during data integration.

Data integration involves combining data residing in different sources and


providing users with a unified view of these data. This process becomes
significant in a variety of situations both commercial (when two similar
companies need to merge their databases) and scientific (combining research
results from different bioinformatics repositories, for example). Data integration
appears with increasing frequency as the volume and the need to share
existing data explodes. It has become the focus of extensive theoretical work,
and numerous open problems remain unsolved. In management circles, people
frequently refer to data integration as "Enterprise Information Integration" (EII).

History

Figure 1: Simple schematic for a data warehouse. The ETL process extracts
information from the source databases, transforms it and then loads it into the
data warehouse.
Figure 2: Simple schematic for a data-integration solution. A system designer
constructs a mediated schema against which users can run queries. The virtual
database interfaces with the source databases via wrapper code if required.

Issues with combining heterogeneous data sources under a single query


interface have existed for some time. The rapid adoption of databases after the
1960s naturally led to the need to share or to merge existing repositories. This
merging can take place at several levels in the database architecture. One
popular solution involves data warehousing (see figure 1). The warehouse
system extracts, transforms, and loads data from several sources into a single
queriable schema. Architecturally, this offers a tightly coupled approach
because the data reside together in a single repository at query-time. Problems
with tight coupling can arise with the "freshness" of data, for example when an
original data source gets updated, but the warehouse still contains the older
data and the ETL process needs re-execution. Difficulties also arise in
constructing data warehouses when one has only a query interface to a data
sources and no access to the full data. This problem frequently emerges when
integrating several commercial query services like travel or classified-
advertisement web-applications.
As of 2009 the trend in data integration has favored loosening the coupling
between data. This may involve providing a uniform query interface over a
mediated schema (see figure 2), thus transforming a query into specialized
queries over the original databases. One can also term this process "view-
based query-answering" because each of the data sources functions as a view
over the (nonexistent) mediated schema. Formally, computer scientists label
such an approach "Local As View" (LAV) — where "Local" refers to the local
sources/databases. An alternate model of integration has the mediated schema
functioning as a view over the sources. This approach, called "Global As View"
(GAV) — where "Global" refers to the global (mediated) schema — has
attractions due to the simplicity involved in answering queries issued over the
mediated schema. However, one must rewrite the view for the mediated
schema whenever a new source gets integrated and/or an existing source
changes its schema.

Some of the current work in data-integration research concerns the semantic


integration problem. This problem addresses not the structuring of the
architecture of the integration, but how to resolve semantic conflicts between
heterogeneous data sources. For example if two companies merge their
databases, certain concepts and definitions in their respective schemas like
"earnings" inevitably have different meanings. In one database it may mean
profits in dollars (a floating-point number), while in the other it might represent
the number of sales (an integer). A common strategy for the resolution of such
problems involves the use of ontologies which explicitly define schema terms
and thus help to resolve semantic conflicts. This approach represents ontology-
based data integration.

Example
Consider a web application where a user can query a variety of information
about cities (such as crime statistics, weather, hotels, demographics, etc).
Traditionally, the information must exist in a single database with a single
schema. But any single enterprise would find information of this breadth
somewhat difficult and expensive to collect. Even if the resources exist to
gather the data, it would likely duplicate data in existing crime databases,
weather websites, and census data.
A data-integration solution may address this problem by considering these
external resources as materialized views over a virtual mediated schema,
resulting in "virtual data integration". This means application-developers
construct a vitual schema — the mediated schema — to best model the kinds of
answers their users want. Next, they design "wrappers" or adapters for each
data source, such as the crime database and weather website. These adapters
simply transform the local query results (those returned by the respective
websites or databases) into an easily processed form for the data integration
solution (see figure 2). When an application-user queries the mediated schema,
the data-integration solution transforms this query into appropriate queries
over the respective data sources. Finally, the virtual database combines the
results of these queries into the answer to the user's query.
This solution offers the convenience of adding new sources by simply
constructing an adapter for them. It contrasts with ETL systems or with a single
database solution, which require manual integration of the entire new dataset
into the system.

Theory of data integration


The theory of data integration forms a subset of database theory and
formalizes the underlying concepts of the problem in first-order logic. Applying
the theories gives indications as to the feasibility and difficulty of data
integration. While its definitions may appear abstract, they have sufficient
generality to accommodate all manner of integration systems.

Definitions
Data integration systems are formally defined as a triple where G is the global
(or mediated) schema, S is the heterogeneous set of source schemas, and M is
the mapping that maps queries between the source and the global schemas.
Both G and S are expressed in languages over alphabets composed of symbols
for each of their respective relations. The mapping M consists of assertions
between queries over G and queries over S. When users pose queries over the
data integration system, they pose queries over G and the mapping then
asserts connections between the elements in the global schema and the source
schemas.
A database over a schema is defined as a set of sets, one for each relation (in a
relational database). The database corresponding to the source schema S
would comprise the set of sets of tuples for each of the heterogeneous data
sources and is called the source database. Note that this single source
database may actually represent a collection of disconnected databases. The
database corresponding to the virtual mediated schema G is called the global
database. The global database must satisfy the mapping M with respect to the
source database. The legality of this mapping depends on the nature of the
correspondence between G and S. Two popular ways to model this
correspondence exist: Global as View or GAV and Local as View or LAV.

Figure 3: Illustration of tuple space of the GAV and LAV mappings. In GAV, the
system is constrained to the set of tuples mapped by the mediators while the
set of tuples expressible over the sources may be much larger and richer. In
LAV, the system is constrained to the set of tuples in the sources while the set
of tuples expressible over the global schema can be much larger. Therefore LAV
systems must often deal with incomplete answers.
GAV systems model the global database as a set of views over S. In this case M
associates to each element of G a query over S. Query processing becomes a
straightforward operation due to the well-defined associations between G and
S. The burden of complexity falls on implementing mediator code instructing
the data integration system exactly how to retrieve elements from the source
databases. If any new sources join the system, considerable effort may be
necessary to update the mediator, thus the GAV approach appears preferable
when the sources seem unlikely to change.

In a GAV approach to the example data integration system above, the system
designer would first develop mediators for each of the city information sources
and then design the global schema around these mediators. For example,
consider if one of the sources served a weather website. The designer would
likely then add a corresponding element for weather to the global schema.
Then the bulk of effort concentrates on writing the proper mediator code that
will transform predicates on weather into a query over the weather website.
This effort can become complex if some other source also relates to weather,
because the designer may need to write code to properly combine the results
from the two sources.On the other hand, in LAV, the source database is
modeled as a set of views over G. In this case M associates to each element of
S a query over G. Here the exact associations between G and S are no longer
well-defined. As is illustrated in the next section, the burden of determining how
to retrieve elements from the sources is placed on the query processor. The
benefit of an LAV modeling is that new sources can be added with far less work
than in a GAV system, thus the LAV approach should be favored in cases where
the mediated schema is more likely to change.

In an LAV approach to the example data integration system above, the system
designer designs the global schema first and then simply inputs the schemas of
the respective city information sources. Consider again if one of the sources
serves a weather website. The designer would add corresponding elements for
weather to the global schema only if none existed already. Then programmers
write an adapter or wrapper for the website and add a schema description of
the website's results to the source schemas. The complexity of adding the new
source moves from the designer to the query processor.

Query processing
The theory of query processing in data integration systems is commonly
expressed using conjunctive queries. One can loosely think of a conjunctive
query as a logical function applied to the relations of a database such as "f(A,B)
where A < B". If a tuple or set of tuples is substituted into the rule and satisfies
it (makes it true), then we consider that tuple as part of the set of answers in
the query. While formal languages like Datalog express these queries concisely
and without ambiguity, common SQL queries count as conjunctive queries as
well.
In terms of data integration, "query containment" represents an important
property of conjunctive queries. A query A contains another query B (denoted )
if the results of applying B are a subset of the results of applying A for any
database. The two queries are said to be equivalent if the resulting sets are
equal for any database. This is important because in both GAV and LAV
systems, a user poses conjunctive queries over a virtual schema represented
by a set of views, or "materialized" conjunctive queries. Integration seeks to
rewrite the queries represented by the views to make their results equivalent or
maximally contained by our user's query. This corresponds to the problem of
answering queries using views (AQUV).

In GAV systems, a system designer writes mediator code to define the query-
rewriting. Each element in the user's query corresponds to a substitution rule
just as each element in the global schema corresponds to a query over the
source. Query processing simply expands the subgoals of the user's query
according to the rule specified in the mediator and thus the resulting query is
likely to be equivalent. While the designer does the majority of the work
beforehand, some GAV systems such as Tsimmis involve simplifying the
mediator description process.
In LAV systems, queries undergo a more radical process of rewriting because no
mediator exists to align the user's query with a simple expansion strategy. The
integration system must execute a search over the space of possible queries in
order to find the best rewrite. The resulting rewrite may not be an equivalent
query but maximally contained, and the resulting tuples may be incomplete. As
of 2009 the MiniCon algorithm is the leading query rewriting algorithm for LAV
data integration systems.In general, the complexity of query rewriting is NP-
complete. If the space of rewrites is relatively small this does not pose a
problem — even for integration systems with hundreds of sources.

Enterprise information integration


Enterprise information integration (EII) applies data integration commercially.
Despite the theoretical problems described above, the private sector shows
more concern with the problems of data integration as a viable product. EII
emphasizes neither on correctness nor tractability, but speed and simplicity. An
EII industry has emerged, but many professionals believe it does not perform to
its full potential. Practitioners cite the following major issues which EII must
address for the industry to become mature:
simplicity of understanding
Answering queries with views arouses interest from a theoretical standpoint,
but difficulties in understanding how to incorporate it as an "enterprise
solution".Some developers believe it should be merged with EAI. Others believe
it should be incorporated with ETL systems, citing customers' confusion over
the differences between the two services.
simplicity of deployment
Even if recognized as a solution to a problem, EII as of 2009currently takes time
to apply and offers complexities in deployment. People have proposed a variety
of schema-less solutions such as "Lean Middleware", but ease-of-use and speed
of employment appear inversely proportional to the generality of such systems.
Others cite the need for standard data interfaces to speed and simplify the
integration process in practice.

Handling higher-order information


Analysts experience difficulty — even with a functioning information integration
system — in determining whether the sources in the database will satisfy a
given application. Answering these kinds of questions about a set of
repositories requires semantic information like metadata and/or ontologies. The
few commercial tools that leverage this information remain in their infancy.

Ques 4 Discuss:
• Mining Multi-level Association rules from Transactional
Databases
Due to the development of information systems and technology, businesses
increasingly have the capability to accumulate huge amounts of retail data in
large databases. In the recent marketing research, products' discounts have
rarely been considered as an important decision variable. Although few
researches have analyzed the effect of discount on sales, they ignore its
temporal characteristics. That is, in real world, each product may appear with
different discounts rates in different time periods. Moreover, they have
considered discount at single concept level. Therefore, the discovered
knowledge is less concrete and implementation of the results of analyses
become difficult. The problem addressed in this paper is the consideration of
products' discounts in discovering multiple-level association rules in different
time intervals that a specific discount appears on a specific product. The
proposed algorithm makes it possible to acquire more concrete and specific
knowledge corresponding to association between products and their discounts
as well as implementation of its results.

• Mining Multi-Dimensional Association rules from Relational


Databases
Ans: Abstract The problem of association rule mining has gained considerable
prominence in the data mining community for its use as an important tool of
knowledge discovery from large-scale databases. And there has been a spurt of
research activities around this problem. Traditional association rule mining is
limited to intra-transaction. Only recently the concept ofN-dimensional inter-
transaction association rule (NDITAR) was proposed by H.J. Lu. This paper
modifies and extends Lu’s definition of NDITAR based on the analysis of its
limitations, and the generalized multidimensional association rule (GMDAR) is
subsequently introduced, which is more general, flexible and reasonable than
NDITAR.

5. Briefly outline the major steps of decision tree classification.


The following preprocessing steps may be applied to the data to help improve
the accuracy,
efficiency, and scalability of the classification or prediction process.

Data cleaning: This refers to the preprocessing of data in order to remove or


reduce noise (by applying smoothing techniques, for example) and the
treatment of missingvalues (e.g., by replacing a missing value with the most
commonly occurring valuefor that attribute, or with the most probable value
based on statistics). Although mostclassification algorithms have some
mechanisms for handling noisy or missing data, this step can help reduce
confusion during learning.

Relevance analysis: Many of the attributes in the data may be redundant.


Correlation analysis can be used to identify whether any two given attributes
are statistically related. For example, a strong correlation between attributes
A1 and A2 would suggest that one of the two could be removed from further
analysis. A database may also contain irrelevant attributes. Attribute subset
selection4 can be used in these cases to find a reduced set of attributes such
that the resulting probability distribution of the data classes is as close as
possible to the original distribution obtained using all attributes. Hence,
relevance analysis, in the form of correlation analysis and attribute subset
selection, can be used to detect attributes that do not contribute to the
classification or prediction task. Including such attributes may otherwise slow
down, and
possibly mislead, the learning step. Ideally, the time spent on relevance
analysis, when added to the time spent on learning from the resulting
“reduced” attribute (or feature) subset, should be less than the time that would
have been spent on learning from the original set of attributes. Hence, such
analysis can help improve classification efficiency and scalability.

Data transformation and reduction: The data may be transformed by


normalization, particularly when neural networks or methods involving distance
measurements are used in the learning step. Normalization involves scaling all
values for a given attribute so that they fall within a small specified range, such
as �1:0 to 1:0, or 0:0 to 1:0. In methods that use distance measurements, for
example, this would prevent attributes with initially large ranges (like, say,
income) from outweighing attributes with initially smaller ranges (such as
binary attributes).

The data can also be transformed by generalizing it to higher-level concepts.


Concept hierarchies may be used for this purpose. This is particularly useful for
continuousvalued attributes. For example, numeric values for the attribute
income can be generalized to discrete ranges, such as low, medium, and high.
Similarly, categorical attributes, like street, can be generalized to higher-level
concepts, like city. Because generalization compresses the original training
data, fewer input/output operations may be involved during learning.

Data can also be reduced by applying many other methods, ranging from
wavelet transformation and principle components analysis to discretization
techniques, such as binning, histogram analysis, and clustering.

Ques 6. Discuss the following w.r.t. Cluster Analysis:

• Partitioning Methods

Given D, a data set of n objects, and k, the number of clusters to form, a


partitioning algorithm organizes the objects into k partitions (k _ n), where each
partition represents a cluster. The clusters are formed to optimize an objective
partitioning criterion, such as a dissimilarity function based on distance, so that
the objects within a cluster are “similar,” whereas the objects of different
clusters are “dissimilar” in terms of the data set attributes.

Classical Partitioning Methods: k-Means and k-Medoids


The most well-known and commonly used partitioning methods are k-means, k-
medoids,
and their variations.

Centroid-Based Technique: The k-Means Method

The k-means algorithm takes the input parameter, k, and partitions a set of n
objects into k clusters so that the resulting intracluster similarity is high but the
intercluster similarity is low. Cluster similarity is measured in regard to the
mean value of the objects in a cluster, which can be viewed as the cluster’s
centroid or center of gravity. “How does the k-means algorithm work?” The k-
means algorithm proceeds as follows. First, it randomly selects k of the objects,
each of which initially represents a cluster mean or center. For each of the
remaining objects, an object is assigned to the cluster to which it is the most
similar, based on the distance between the object and the cluster mean. It then
computes the new mean for each cluster. This process iterates until the
criterion function converges. Typically, the square-error criterion is used,
defined as

where E is the sum of the square error for all objects in the data set; p is the
point inspace representing a given object; and mi is the mean of cluster Ci
(both p and mi are multidimensional). In other words, for each object in each
cluster, the distance from the object to its cluster center is squared, and the
distances are summed. This criterion tries to make the resulting k clusters as
compact and as separate as possible. The k-means procedure is summarized in
Figure 7.2.

Clustering by k-means partitioning. Suppose that there is a set of objects


located in spaceas depicted in the rectangle shown in Figure 7.3(a). Let k = 3;
that is, the user would like
the objects to be partitioned into three clusters. According to the algorithm in
Figure 7.2, we arbitrarily choose three objects as the three initial cluster
centers, where cluster centers are marked by a “+”. Each object is distributed
to a cluster based on the cluster center to which it is the nearest. Such a
distribution forms silhouettes encircled by dotted curves, as shown in Figure
7.3(a).
Next, the cluster centers are updated. That is, the mean value of each cluster is
recalculated based on the current objects in the cluster.Using the new cluster
centers, the objects are redistributed to the clusters based on which cluster
center is the nearest. Such a redistribution forms new silhouettes encircled by
dashed curves, as shown in Figure 7.3(b). This process iterates, leading to
Figure 7.3(c). The process of iteratively reassigning objects to clusters to
improve the partitioning is referred to as iterative relocation. Eventually, no
redistribution of the objects in any cluster occurs, and so the process
terminates. The resulting clusters are returned by the clustering process.
The algorithm attempts to determine k partitions that minimize the square-
error function. It works well when the clusters are compact clouds that are
rather well

Algorithm: k-means. The k-means algorithm for partitioning, where each


cluster’s center is represented by the mean value of the objects in the cluster.
Input:

k: the number of clusters,


D: a data set containing n objects.
Output: A set of k clusters.
Method:
(1) arbitrarily choose k objects from D as the initial cluster centers;
(2) repeat
(3) (re)assign each object to the cluster to which the object is the most
similar,based on the mean value of the objects in the cluster;
(4) update the cluster means, i.e., calculate the mean value of the
objects for
each cluster;
(5) until no change;

separated from one another. The method is relatively scalable and efficient in
processing large data sets because the computational complexity of the
algorithm is O(nkt), where n is the total number of objects, k is the number of
clusters, and t is the number of iterations. Normally, k_n and t _n. The method
often terminates at a local optimum. The k-means method, however, can be
applied only when the mean of a cluster is defined. This may not be the case in
some applications, such as when data with categorical attributes are involved.
The necessity for users to specify k, the number of clusters, in advance can be
seen as a disadvantage. The k-means method is not suitable for discovering
clusters with nonconvex shapes or clusters of very different size. Moreover, it is
sensitive to noise and outlier data points because a small number of such data
can substantially influence the mean value.There are quite a few variants of the
k-means method. These can differ in the selection of the initial k means, the
calculation of dissimilarity, and the strategies for calculating cluster means. An
interesting strategy that often yields good results is to first apply a hierarchical
agglomeration algorithm, which determines the number of clusters and finds an
initial clustering, and then use iterative relocation to improve the
clustering.Another variant to k-means is the k-modes method, which extends
the k-means paradigmto cluster categorical data by replacing the means of
clusters with modes, using newdissimilarity measures to dealwith categorical
objects and a frequency-basedmethod to update modes of clusters. The k-
means and the k-modes methods can be integrated to cluster data with mixed
numeric and categorical values.
The EM (Expectation-Maximization) algorithm (which will be further discussed in
Section 7.8.1) extends the k-means paradigm in a different way. Whereas the k-
means algorithm assigns each object to a cluster, in EM each object is assigned
to each cluster according to a weight representing its probability of
membership. In other words, there are no strict boundaries between clusters.
Therefore, new means are computed based on weighted measures. “How can
we make the k-means algorithm more scalable?” A recent approach to scaling
the k-means algorithm is based on the idea of identifying three kinds of regions
in data: regions that are compressible, regions that must be maintained in main
memory, and regions that are discardable. An object is discardable if its
membership in a cluster is ascertained. An object is compressible if it is not
discardable but belongs to a tight subcluster. A data structure known as a
clustering feature is used to summarize objects that have been discarded or
compressed. If an object is neither discardable nor compressible, then it should
be retained in main memory. To achieve scalability, the iterative clustering
algorithm only includes the clustering features of the compressible objects and
the
objects that must be retained in main memory, thereby turning a secondary-
memorybased algorithm into a main-memory-based algorithm. An alternative
approach to scaling the k-means algorithm explores the microclustering idea,
which first groups nearby objects into “microclusters” and then performs k-
means clustering on the microclusters. Microclustering is further discussed in
Section 7.

• Density Based Methods

Density-based methods: Most partitioning methods cluster objects based on the


distance between objects. Such methods can find only spherical-shaped
clusters and encounter difficulty at discovering clusters of arbitrary shapes.
Other clustering methods have been developed based on the notion of density.
Their general idea is to continue growing the given cluster as long as the
density (number of objects or data points) in the “neighborhood” exceeds some
threshold; that is, for each data point within a given cluster, the neighborhood
of a given radius has to contain at least a minimum number of points. Such a
method can be used to filter out noise (outliers)and discover clusters of
arbitrary shape. DBSCAN and its extension, OPTICS, are typical density-based
methods that grow clusters according to a density-based connectivity analysis.
DENCLUE is a method that clusters objects based on the analysis of the value
distributions of density functions.

To discover clusters with arbitrary shape, density-based clustering methods


have been developed. These typically regard clusters as dense regions of
objects in the data space that are separated by regions of low density
(representing noise). DBSCAN grows clusters according to a density-based
connectivity analysis. OPTICS extends DBSCAN to produce a cluster ordering
obtained from a wide range of parameter settings. DENCLUE clusters objects
based on a set of density distribution functions.

7.6.1 DBSCAN: A Density-Based Clustering Method Based on Connected


Regions with Sufficiently High Density

DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is a


density based clustering algorithm. The algorithm grows regions with
sufficiently high density into clusters and discovers clusters of arbitrary shape
in spatial databases with noise. It defines a cluster as a maximal set of density-
connected points. The basic ideas of density-based clustering involve a number
of new definitions. We intuitively present these definitions, and then follow up
with an example.

Density reachability is the transitive closure of direct density reachability, and


thisrelationship is asymmetric. Only core objects are mutually density
reachable. Density connectivity, however, is a symmetric relation.
A density-based cluster is a set of density-connected objects that is maximal
with respect to density-reachability. Every object not contained in any cluster is
considered to be noise.
“How does DBSCAN find clusters?” DBSCAN searches for clusters by checking
the e-neighborhood of each point in the database. If the e-neighborhood of a
point p contains more than MinPts, a new cluster with p as a core object is
created. DBSCAN then iteratively collects directly density-reachable objects
from these core objects, which may involve the merge of a few density-
reachable clusters. The process terminates when no new point can be added to
any cluster.

If a spatial index is used, the computational complexity of DBSCAN is O(nlogn),


where n is the number of database objects. Otherwise, it is O(n2).With
appropriate settings of the user-defined parameters e and MinPts, the algorithm
is effective at finding arbitrary-shaped clusters.

Вам также может понравиться