Академический Документы
Профессиональный Документы
Культура Документы
CHAPTER 1 INTRODUCTION
In a relational database, especially with normalized tables, a significant effort is required to prepare a summary data set that can be used as input for a data mining or statistical algorithm. Most algorithms require as input a data set with a horizontal layout, with several records and one variable or dimension per column. That is the case with models like clustering, classification, regression and PCA; consult. Each research discipline uses different terminology to describe the data set. In data mining the common terms are point-dimension. Statistics literature generally uses observation-variable. Machine learning research uses instance-feature. This article introduces a new class of aggregate functions that can be used to build data sets in a horizontal layout (demoralized with aggregations), automating SQL query writing and extending SQL capabilities. We show evaluating horizontal aggregations is a challenging and interesting problem and we introduced alternative methods and optimizations for their efficient evaluation.
1.1PROBLEM STATEMENT:
Preparing a data set for analysis is generally the most time consuming task in a data mining project, requiring many complex SQL queries, joining tables and aggregating columns. Existing SQL aggregations have limitations to prepare data sets because they return one column per aggregated group. In general, a significant manual effort is required to build data sets, where a horizontal layout is required. We propose simple, yet powerful, methods to generate SQL code to return aggregated columns in a horizontal tabular layout, returning a set of numbers instead of one number per row. This new class of functions is called horizontal aggregations Horizontal aggregations build data sets with a horizontal de-normalized layout (e.g. point-dimension, observation-variable, instance-feature), which is the standard layout required by most data mining algorithms. Our proposed horizontal aggregations provide several unique features and advantages. First, they represent a template to generate SQL code from a data mining tool. Such SQL code automates writing SQL queries, optimizing them and testing them for correctness.
A Novel Aggregations Approach for Preparing Datasets 1.2 MOTIVATION: Building a suitable data set for data mining purposes is a time- consuming task. This task generally requires writing long SQL statements or customizing SQL Code if it is automatically generated by some tool. There are two main ingredients in such SQL code: joins and aggregations; we focus on the second one. The most widely-known aggregation is the sum of a column over groups of rows. Some other aggregations return the average, maximum, minimum or row count over groups of rows. There exist many aggregations functions and operators in SQL. Unfortunately, all these aggregations have limitations to build data sets for data mining purposes. The main reason is that, in general, data sets that are stored in a relational database (or a data warehouse) come from On-Line Transaction Processing (OLTP) systems where database schemas are highly normalized. But data mining, statistical or machine learning algorithms generally require aggregated data in summarized form. Based on current available functions and clauses in SQL, a significant effort is required to compute aggregations when they are desired in a cross tabular (Horizontal) form, suitable to be used by a data mining algorithm. Such effort is due to the amount and complexity of SQL code that needs to be written, optimized and tested. There are further practical reasons to return aggregation results in a horizontal (cross-tabular) layout. Standard aggregations are hard to interpret when there are many result rows Especially when grouping attributes have high cardinalities. To perform analysis of exported tables into spreadsheets it may be more convenient to have aggregations on the same group in one row (e.g. to produce graphs or to compare data sets with repetitive information). OLAP tools generate SQL code to transpose results (sometimes called PIVOT). Transposition can be more efficient if there are mechanisms combining aggregation and transposition together. With such limitations in mind, we propose a new class of aggregate functions that aggregate numeric expressions and transpose results to produce a data set with a horizontal layout. Functions belonging to this class are called horizontal aggregations. Horizontal aggregations represent an extended form of traditional SQL aggregations, which return a set of values in a horizontal layout (somewhat similar to a multidimensional vector), instead of a single value per row.
1.3 SCOPE:
Data mining algorithm requires suitable input in the form of cross tabular (horizontal) form significant effort is required to compute aggregations. Such effort is due to the amount and complexity of SQL code which needs to be written, optimized and tested.
1.4 OUTLINE:
Data aggregation is a process in which information is gathered and expressed in a summary form, and which is used for purposes such as statistical analysis. A common aggregation purpose is to get more information about particular groups based on specific variables such as age, name, phone number, address, profession, or income. Most algorithms require input as a data set with a horizontal layout, with several records and one variable or dimension per column. That technique is used with models like clustering, classification, regression and PCA. Dimension used in data mining technique are point dimension.
CHAPTER 2 BACKGROUND
Preparing a data set for analysis is generally the most time consuming task in a data mining project, requiring many complex SQL queries, joining tables, and aggregating columns. SQL aggregations have limitations to prepare data sets because they return one column per aggregated group. In general, a significant manual effort is required to build data sets, where a horizontal layout is required. The methods to generate SQL code to return aggregated columns in a horizontal tabular layout, returning a set of numbers instead of one number per row. This new class of functions is called horizontal aggregations. Horizontal aggregations build data sets with a horizontal de-normalized layout (e.g., point-dimension, observation variable, instance-feature), which is the standard layout required by most data mining algorithms.
A Novel Aggregations Approach for Preparing Datasets The key properties of data mining are:
Automatic discovery of patterns Prediction of likely outcomes Creation of actionable information Focus on large data sets and databases
A Novel Aggregations Approach for Preparing Datasets bankruptcy and other forms of default, and identifying segments of a population likely to respond similarly to given events.
to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean, serving as a prototype of the cluster.
A Novel Aggregations Approach for Preparing Datasets models. Some statistical tools (e.g., SAS) can directly score data sets by generating SQL queries, but they are mathematically limited. In this work, we show that UDFs represent a promising alternative to extend the DBMS with multidimensional statistical models. The statistical models studied in this work include linear regression, principal component analysis (PCA), factor analysis, clustering, and Naive Bayes. These models are widely used and cover the whole spectrum of unsupervised and supervised techniques. UDFs are a standard Application Programming Interface (API) available in modern DBMSs. In general, UDFs are developed in the C language (or similar language), compiled to object code and efficiently executed inside the DBMS like any other SQL function. Thus, UDFs represent an outstanding alternative to extend a DBMS with statistical models, exploiting the C language flexibility and speed. Therefore, there is no need to change SQL syntax with new data mining primitives or clauses, making UDF implementation and usage easier. The UDFs proposed in this paper can be programmed on any DBMS supporting scalar and aggregate UDFs
A Novel Aggregations Approach for Preparing Datasets pivoting functionality. This pivot implementation is a post-processing operation through cursors. While existing implementations are certainly useful, they fail to consider Pivot or Unpivot as first-class RDBMS operations,
Inclusion of Pivot and Unpivot inside the RDBMS enables interesting and useful possibilities for data modeling. Existing modeling techniques must decide both the relationships between tables and the attributes within those tables to persist. The requirement that columns be strongly defined contrasts with the nature of rows, which can be added and removed easily. Pivot and Unpivot, which exchange the role of rows and columns, allow the a priori requirement for pre-defined columns to be relaxed. These operators provide a technique to allow rows to become columns dynamically at the time of query compilation and execution. When the set of column cannot be determined in advance, one common table design scenario employs property tables, where a table containing (id, property name, property value) is used to store a series of values in rows that would be desirable to represent columns. Users typically use this design to avoid RDBMS implementation restrictions (such as an upper limit for the number of columns in a table or storage overhead associated with many empty columns in a row) or to avoid changing the schema when a new property needs to be added. This design choice has implications on
9
A Novel Aggregations Approach for Preparing Datasets how tables in this form can be used and how well they perform in queries. Property table queries are more difficult to write and maintain, and the complexity of the operation may result in less optimal query execution plans. In general, applications written to handle data stored in property tables can not easily process data in the wide (pivoted) format. Pivot and Unpivot enable property tables to look like regular tables (and vice versa) to a data modeling tool. These operations provide the framework to enable useful extensions to data modeling
Including Pivot and Unpivot explicitly in the query language provides excellent opportunities for query optimization. Properly defined, these operations can be used in arbitrary combinations with existing operations such as filters, joins, and grouping. For example, since Unpivot transposes columns into rows, it is possible to convert a filter (an operation that restricts rows) over Unpivot into a projection (an operation that restricts columns) beneath it. Algebraic equivalences between Pivot/Unpivot and existing operators enable consideration of many execution strategies through reordering, with the standard opportunity to improve query performance. Furthermore, new optimization techniques can also be introduced that take advantage of unique properties of these new operators. Consideration of these issues provides powerful techniques for improving existing user scenarios currently performed outside the confines of a query optimizer
10
4.1 EXECUTION STRATEGIES IN HORIZONTAL AGGREGATION: Horizontal aggregations propose a new class of functions that aggregate numeric expressions and the result are transposed to produce data sets with a horizontal layout. The operation is needed in a number of data mining tasks, such as unsupervised classification and data summation, as well as segmentation of large heterogeneous data sets into smaller homogeneous subsets that can be easily managed, separately modeled and analyzed. To create datasets for data mining related works, efficient and summary of data are needed. For that this proposed system collect particular needed attributes from the different fact tables and displayed columns in order to create date in horizontal layout. Main goal is to define a template to generate
11
A Novel Aggregations Approach for Preparing Datasets SQL code combining aggregation and transposition (pivoting). A second goal is to extend the SELECT statement with a clause that combines transposition with aggregation.
4.2.1 SPJ METHOD: The SPJ method is based on only relational operators. The basic concept in SPJ method is to build a table with vertical aggregation for each resultant column. To produce Horizontal aggregation FH system must join all those tables. There are two sub-strategies to compute Horizontal aggregation .First strategy includes direct calculation of aggregation from fact table. Second one compute the corresponding vertical aggregation and store it in temporary table.
4.2.2 CASE METHOD: In SQL build-in case programming construct are available, it returns a selected value rather from a set of values based on Boolean expression. It can be used in any statement or clause that allows a valid expression. The case statement returns a value selected from a set of values based on Boolean expression. The Boolean expression for each case statement has a conjunction of K equality comparisons. Query evaluation needs to combine the desired aggregation with case statement for each distinct combination of values.
4.2.3 PIVOT METHOD: Pivot transforms a series of rows into a series of fewer numbers of rows with additional columns Data in one source column is used to determine the new column for a row, and another source column is used as the data for that new column. The wide form can be considered as a matrix of column values, while the narrow form is a natural encoding of a sparse matrix. In current implementation PIVOT operator is used to calculate the aggregations. One method to express pivoting uses scalar sub queries. Each pivoted is created through a separate sub query. PIVOT operator provides a technique to allow rows to columns dynamically at the time of query compilation and execution.
12
13
5.3 MODULES:
Admin Module User Module View Module Download Module
14
15
A Novel Aggregations Approach for Preparing Datasets User will be able to download the various details regarding bills. If he/she is a new user, he/she can download the new connection form, subscription details etc. then he/she can download his /her previous bill details in hands so as to ensure it.
DESCRIPTION:
User: User gets registered and login, user can view the bill details in aggregation wise view, vertical view and horizontal view depending on user need. Admin: Admin will upload new connection form and enters the bill details using SPJ wise view, CASE wise view and PIVOT wise view to display bill details to user.
17
DESCRIPTION:
User gets register and login in his account. Admin will create bill entry like meter no, Issue date, reading, total cost etc. Admin will upload the form on request
18
A Novel Aggregations Approach for Preparing Datasets Both user and admin can view the bill details like meter no, paid date, paid amount etc.
19
DESCRIPTION:
Admin will upload new connection form on server. User will register. Admin will upload bill details on server. Both user and admin and view bill details from server Aggregate wise display to user from server
20
DESCRIPTION:
When there is need for Search form, and then download new search form from the database will display user bill details in SPJ wise view, CASE wise view, PIVOT wise view when admin login and upload the form. When user login, database will display options like change password, view details and aggregate view
21
CHAPTER 8 IMPLEMENTATION
8.1 CODE TEMPLATE
According to this project the implementation was carried out with the continuous revision of the requirements, matching them on the relevance of tasks expected to be performed by the system.
8.2 DESIGN IMPLEMENTATION The design for the system was formulated manually by me under the internal guides supervision. The design for follow of the system and data was revised many times for changes and finalized. In the same way the design for the forms and Reports was made subject to user convenience and organizations needs respectively. The design implementation made me to gain knowledge about the user convenience methods of designing the forms. The reason for making the product entry screen was based on making the user feel more convenient with a Graphical User Interface. I have used check boxes to ensure that user checks the eligible deductions. The design implementation was in need of knowing the required inputs and controls that could accomplish the design from the paper to the forms in windows. This implementation phase helped me to be aware of the limitations of the controls that could be used in developing the software. 8.3 CODE IMPLEMENTATION The code for this project was implemented step by step starting from deciding the events in which the code must get executed and deciding the logic which the code has to execute. Since my project was done in Menu driven interface the coding was done form by form. The code was mainly focused to execute in a sequence of the logic that is expected from it. After implementing the code the testing was carried out.
22
8.4 IMPLEMENTATION DETAILS In horizontal aggregations we used three types of aggregations function to develop horizontal layout tabular form. PIVOT METHOD: A common scenario where PIVOT can be useful is when you want to generate crosstabulation reports to summarize data. For example, suppose you want to query the User bill details table in the Horizontal database to determine the number of bills placed by certain users. The following query provides this report.
USE Adv SELECT <non-pivoted column>, [first pivoted column] AS <column name>, [second pivoted column] AS <column name>, ... [last pivoted column] AS <column name> FROM (<SELECT query that produces the data>) AS <alias for the source query> PIVOT ( <aggregation function>(<column being aggregated>) FOR [<column that contains the values that will become column headers>] IN ( [first pivoted column], [second pivoted column], ... [last pivoted column]) ) AS <alias for the pivot table> <optional ORDER BY clause>;
CASE METHOD: Using CASE query we can customize the present table and column based on the conditions. This will help us to reduce enormous amount of space used by various user bill details. It can be viewed in two difference ways namely Horizontal and Vertical.
23
CHAPTER-9 TESTING
9.1 TEST CASES:
Login Name Action Input Parameters Test Case Login User ID, Password
If user id and password is correct redirect to Admin or User login page else show warning message that enter valid user id/password.
We tried in different ways but it giving correct out put that while entering correct Id Password its redirecting to users home page otherwise showing login failed message
Result
While submitting button complete user details should be stored in user registration database table.
While submitting button complete user details are storing in user database table.
Result
Expected Output
When user click on download button when it displays save the file toolbar and when you click on save button it should download.
Actual Output
Result
Form downloaded
25
A Novel Aggregations Approach for Preparing Datasets Table no: 9.3 Test Case 3 User Bill Details Name Action Input Parameters Test Case User bill details User name , Meter number
Expected Output
When user enters his name and meter number it should immediately display user bill details
Actual Output
We entered username and meter number after clicking on submit button it automatically moving to next page user bill details.
Result
Database Connectivity Name Action Input Parameters Test Case Check database connectivity User id and password
Expected Output
When we enter username and password it should retrieve the user bill details.
Actual Output
26
Result
Database is connected
27
28
29
30
31
32
33
Screen no: 10.14 Screen shots for User date wise details
34
35
CHAPTER-11
CONCLUSION AND FURTHER DEVELOPMENT
We proposed three query evaluation methods. The first one (SPJ) relies on standard relational operators. The second one (CASE) relies on the SQL CASE construct. The third (PIVOT) uses a built-in operator in a commercial DBMS that is not widely available. The SPJ method is important from a theoretical point of view because it is based on select, project, and join (SPJ) queries. The CASE method is our most important contribution. It is in general the most efficient evaluation method and it has wide applicability since it can be programmed combining GROUPBY and CASE statements. We proved the three methods produce the same result. We have explained it is not possible to evaluate horizontal aggregations using standard SQL without either joins or case constructs using standard SQL operators. Our proposed horizontal aggregations can be used as a database method to automatically generate efficient SQL queries with three sets of parameters: grouping columns, sub grouping columns, and aggregated column. The fact that the output horizontal columns are not available when the query is parsed (when the query plan is explored and chosen) makes its evaluation through standard SQL mechanisms infeasible. Our experiments with large tables show our proposed horizontal aggregations evaluated with the CASE method have similar performance to the built-in PIVOT operator. We believe this is remarkable since our proposal is based on generating SQL code and not on internally modifying the query optimizer.
Horizontal aggregations produce tables with fewer rows, but with more columns. Thus query optimization techniques used for standard aggregations are inappropriate for horizontal aggregations. In future, this work can be extended to develop a more formal model of evaluation methods to achieve better results. Also then we can be developing more complete I/O cost models.
REFERENCES
36
A Novel Aggregations Approach for Preparing Datasets [1] G. Bhargava, P. Goel, and B.R. Iyer, Hypergraph Based Reordering of Outer Join Queries with Complex Predicates,Proc. ACM SIGMOD Intl Conf. Management of Data (SIGMOD 95), pp. 304-315, 1995. [2] J.A. Blakeley, V. Rao, I. Kunen, A. Prout, M. Henaire, and C. Kleinerman, .NET Database Programmability and Extensibilityin Microsoft SQL Server, Proc. ACM SIGMOD Intl Conf. Management of Data (SIGMOD 08), pp. 1087-1098, 2008. [3] J. Clear, D. Dunn, B. Harvey, M.L. Heytens, and P. Lohman, Non- Stop SQL/MX Primitives for Knowledge Discovery, Proc. ACMSIGKDD Fifth Intl Conf. Knowledge Discovery and Data Mining (KDD 99), pp. 425-429, 1999. [4] E.F. Codd, Extending the Database Relational Model to Capture More Meaning, ACM Trans. Database Systems, vol. 4, no. 4,pp. 397-434, 1979. [5] C. Cunningham, G. Graefe, and C.A. Galindo-Legaria, PIVOT and UNPIVOT: Optimization and Execution Strategies in an RDBMS, Proc. 13th Intl Conf. Very Large Data Bases (VLDB 04),pp. 998-1009, 2004. [6] C. Galindo-Legaria and A. Rosenthal, Outer Join Simplification and Reordering for Query Optimization, ACM Trans. Database Systems, vol. 22, no. 1, pp. 43-73, 1997.
[7] H. Garcia-Molina, J.D. Ullman, and J. Widom, Database Systems: The Complete Book, first ed. Prentice Hall, 2001. [8] G. Graefe, U. Fayyad, and S. Chaudhuri, On the EfficientGathering of Sufficient Statistics for Classification from LargeSQL Databases, Proc. ACM Conf. Knowledge Discovery and DataMining (KDD 98), pp. 204-208, 1998. [9] J. Gray, A. Bosworth, A. Layman, and H. Pirahesh, Data Cube: A Relational Aggregation Operator Generalizing Group-by, Cross-Tab and Sub-Total, Proc. Intl Conf. Data Eng., pp. 152-159, 1996.
37
[10] J. Han and M. Kamber, Data Mining: Concepts and Techniques, firsted. Morgan Kaufmann, 2001.
38