Академический Документы
Профессиональный Документы
Культура Документы
PROGRAM Master of Science in Information Technology (MSc IT) Revised Fall 2011
SEMESTER 4
SUBJECT CODE & NAME MIT401 Data Warehousing and Data Mining
CREDIT 4
BK ID B1633 MAX. MARKS 60
CONTACT ME TO GET FULLY SOLVED SMU ASSIGNMENTS/PROJECT/SYNOPSIS/EXAM
GUIDE PAPER
Email Id: mrinal833@gmail.com
9706665251/9706665232/8724043374
www.smuassignmentandproject.com
COST= 100 RS PER SUBJECT
Q.No 1 Explain the Top-Down and Bottom-up Data Warehouse development Methodologies. 10
Answer:
Top- Down and Bottom - Up Development Methodology
Despite the fact that Data Warehouses can be designed in a number of different ways, they all share a number
of important characteristics. Most Data Warehouses are Subject Oriented. This means that the information
that is in the Data Warehouse is stored in a way that allows it to be connected to objects or event, which occur
in reality.
Another characteristic that is frequently seen in Data Warehouses is called Time Variant. A time variant Data
Warehouse will allow changes in the information to be monitored and recorded over time. All the programs
that are used by a particular institution will be stored in the Data Warehouse, and it will be integrated together.
The first Data Warehouses were developed in the 1980s. As societies entered the information age, there was a
large demand for efficient methods of storing information.
Many of the systems that existed in the 1980s were not powerful enough to store and manage large amounts of
data. There were a number of reason for this. The systems that existed at the time took too long to report and
process information. Many of these systems were not designed to analyze or report information. In addition to
this, the computer programs that were necessary for reporting information were both costly and slow. To solve
these problems, companies began designing computer databases that placed an emphasis on managing and
analyzing information. These were the first Data Warehouses, and they could obtain data from a variety of
different sources, and some of these include PCs and mainframes.
Spreadsheet programs have also played an important role in the development of Data Warehouses. By the end
of the 1990s, the technology had greatly advanced, and was much lower in cost. The technology has continued
to evolve to meet the demands of those who are looking for more functions and speed. There are four advances
in Data Warehouse technology that has allowed it to evolve. These advances are offline operational databases,
real time Data Warehouses, offline Data Warehouses, and the integrated Data Warehouses.
The offline operational database is a system in which the information within the database of an operational
system is copied to a server that is offline. When this is done, the operational system will perform at a much
higher level. As the name implies, a real time Data Warehouse system will be updated every time an event
occurs. For example, if a customer orders a product, a real time Data Warehouse will automatically update the
information in real time.
With the integrated Data Warehouse, transactions will be transferred back to the operational systems each day,
and this will allow the data to easily be analyzed by companies and organizations. There are a number of
devices that will be present in the typical Data Warehouse. Some of these devices are the source data layer,
reporting layer, Data Warehouse layer, and transformation layer. There are a number different data sources for
Data Warehouses. Some popular forms of data sources are Teradata, Oracle database, or Microsoft SQL Server.
Another important concept that is related to Data Warehouses is called data transformation. As the name
suggests, data transformation is a process in which information transferred from specific sources is cleaned
and loaded into a repository.
Fig.: Multicube
5. Describe K-means method for clustering. List its advantages and drawbacks. 5+5=10
Answer:
K-means
K-means (MacQueen, 1967) is one of the simplest unsupervised learning algorithms that solve the well known
clustering problem. The procedure follows a simple and easy way to classify a given data set through a certain
number of clusters (assume k clusters) fixed a priori. The main idea is to define k centroids, one for each
cluster. The basic step of k-means clustering is simple. In the beginning we determine number of cluster K and
we assume the centroid or center of these clusters. We can take any random objects as the initial centroids or
the first K objects in sequence can also serve as the initial centroids. Then the K means algorithm will do the
three steps given below until convergence iterate until stable (= no object move group)
1. Determine the centroid coordinate
2. Determine the distance of each object to the centroids
3. Group the object based on minimum distance
These steps are given in the form of flow chart. (See fig. below)