Вы находитесь на странице: 1из 71

1.

INTRODUCTION

1.1 ABOUT PROJECT:

ROAD traffic monitoring is of great importance for urban transportation system. Traffic control agencies
and drivers could benefit from timely and accurate road traffic prediction and make prompt, or even
advance decisions possible for detecting and avoiding road congestions. Existing methods mainly focus
on raw speed sensing data collected from cameras or road sensors, and suffer severe data scarcity issue
because the installation and maintenance of sensors are very expensive [56]. At the same time, most
existing techniques based only on past and current traffic conditions (e.g [9], [54], [25], [38]) do not fit
well when real-world factors such as traffic accidents play a part. To address the above issues, in this
paper we introduce new-type traffic related data arising from public services: 1) Social media data,
which is posted on social networking websites, e.g. Twitter and Facebook. With the popularization of
mobile devices, people are more likely to exchange news and trifles in their life through social media
services, where messages about traffic conditions, such as “Stuck in traffic on E 32nd St. Stay away!”, are
posted by drivers, passengers and pedestrians who can be viewed as sensors observing the ongoing
traffic conditions near their physical locations. Meanwhile, traffic authorities register public accounts
and post tweets to inform the public of the traffic status Our goal is to predict the traffic speed of
specific road links, as shown with the red question marks, given: 1) some speed observations collected
by speed sensors, as shown in blue; 2) trajectory and travel time of OD pairs. Note that speeds of passed
road links are either observed or to be predicted; 3) tweets describing traffic conditions. Note that the
location mentioned by a tweet may be a street covering multiple road links. as “Slow traffic on I95 SB
from Girard Ave to Vine St.” posted by local transportation bureau account. Such text messages
describing traffic conditions and some of them tagged with location information are accessible by public
and could be a complementary information source of raw speed sensing data. (OD) pair on a map, such
services can recommend optimal route from the origin to the destination with least time, and
trajectories can be collected once drivers use the service to navigate. Here a trajectory is a sequence of
links for a given OD pair, and a link is a road segment between neighboring intersections.
Correspondently, a trajectory travel time is an integration of link travel times, which are related to the
realtime road traffic speeds. Longer trajectory travel time indicates that some involving road links may
be congested with lower traffic speed. Trajectory data is useful for a wide range of transportation
analyses and applications [49] [9]. Based on the above observations, where traditional traffic sensing
data are limited while new-type data from social media and map service begin to spring up, our goal is

1
to predict the road-level traffic speed by incorporating newtype data with traditional speed sensing
data. To motivate this scenario, consider a road traffic prediction example depicted in Fig.1. Those links
in red question marks are not covered by traditional speed sensors, but may be passed by trajectories
attached with travel time information, or mentioned in tweets describing traffic conditions, so their
speeds can be inferred fusing multiple cross-domain data

1.2 DATA MINING:

In today’s world large amount of data is generated and collected daily. Analyzing the data and
finding out important part out of it is really difficult and is the most important need. There is a
huge amount of data available in the Information Industry. This data is of no use until it is
converted into useful information. It is necessary to analyze this huge amount of data and extract
useful information from it.

Data mining is an interdisciplinary subfield of computer science.It is the computational process


of discovering patterns in large data sets involving methods at the intersection of artificial
intelligence, machine learning, statistics, and database systems.The overall goal of the data
mining process is to extract information from a data set and transform it into an understandable
structure for further use.

Data mining is the natural evolution of information technology. It is the computational process of
discovering patterns in large data sets involving methods at the intersection of artificial
intelligence, machine learning, statistics, and database systems. The overall goal of the data
mining process is to extract information from a data set and transform it into an understandable
structure for further use. Aside from the raw analysis step, it involves database and data
management aspects, data pre-processing, model and inference considerations, interestingness
metrics, complexity considerations, post-processing of discovered structures, visualization, and
online updating. Data mining is the analysis step of the "knowledge discovery in databases"
process.

Extraction of information is not the only process we need to perform; data mining also involves
other processes such as Data Cleaning, Data Integration, Data Transformation, Data Mining,
Pattern Evaluation and Data Presentation. Once all these processes are over, we would be able to

2
use this information in many applications such as Fraud Detection, Market Analysis, Production
Control, Science Exploration, etc.

1.2.1 Data Preprocessing:

In real-world, data tend to be incomplete, noisy and inconsistent. Such situation requires data
preprocessing. Various forms of data preprocessing includes data cleaning, data integration, data
transformation and data reduction. Typically, the process of duplicate detection is preceded by a
data preparation stage, during which data entries are stored in an uniform manner in the database.
Data preprocessing is a data mining technique that involves transforming raw data into an
understandable format. Real-world data is often incomplete, inconsistent, and/or lacking in
certain behaviors or trends, and is likely to contain many errors. Data preprocessing is a proven
method of resolving such issues. Data preprocessing prepares raw data for further processing.

Data goes through a series of steps during preprocessing:

 Data Cleaning:

Data cleansing, data cleaning or data scrubbing is the process of detecting and correcting
(or removing) corrupt or inaccurate records from a record set, table, or database. Used
mainly in databases, the term refers to identifying incomplete, incorrect, inaccurate,
irrelevant, etc. parts of the data and then replacing, modifying, or deleting this dirty data
or coarse data. After cleansing, a data set will be consistent with other similar data sets in
the system. The inconsistencies detected or removed may have been originally caused by
user entry errors, by corruption in transmission or storage, or by different data dictionary
definitions of similar entities in different stores.

 Data Integration:

Data Integration is a data preprocessing technique that merges the data from multiple
heterogeneous data sources into a coherent data store.Data integration may involve
inconsistent data and therefore needs data cleaning.Data integration primarily supports
the analytical processingof large data sets by aligning, combining and presenting each
data set from organizational departments and external remote sources to fulfill integrator

3
objectives.Data integration is generally implemented in data warehouses (DW) through
specialized software that hosts large data repositories from internal and external
resources. Data is extracted, amalgamated and presented as a unified form. For example,
a user’s complete data set may include extracted and combined data from marketing,
sales and operations, which is combined to form a complete report.

 Data Transformation:

Data transformation is the process of converting data or information from one format to
another, usually from the format of a source system into the required format of a new
destination system. The usual process involves converting documents, but data
conversions sometimes involve the conversion of a program from one computer language
to another to enable the program to run on a different platform.

 Data Reduction:

Data reduction is the transformation of numerical or alphabetical digital information


derived empirically or experimentally into a corrected,ordered, and simplified form. The
basic concept is the reduction of multitudinous amounts of data down to the meaningful
parts.Data reduction techniques can be applied to obtain a reduced representation of the
data set that is much smaller in volume, yet closely maintains the integrity of the original
data.Data reduction is the transformation of numerical or alphabetical digital information
derived empirically or experimentally into a corrected, ordered, and simplified form. The
basic concept is the reduction of multitudinous amounts of data down to the meaningful
parts. That is, mining on the reduced data set should be more efficient yet produce the
same (or almost the same) analytical results.

These are common techniques used in data reduction.

 Order by some aspect of size.


 Table diagonalization, whereby rows and columns of tables are re-arranged to make
patterns easier to see.

4
 Round drastically to one, or at most two, effective digits (effective digits are ones that
vary in that part of the data).
 Use averages to provide a visual focus as well as a summary.
 Use layout and labeling to guide the eye.
 Give a brief verbal summary.

1.2.2 Data Mining Applications:

Data mining is widely used in diverse areas. There are a number of commercial data mining
system available today and yet there are many challenges in this field.

Here is the list of areas where data mining is widely used -

• Financial Data Analysis

• Fraud detection

• Retail Industry

• Telecommunication Industry

• Biological Data Analysis

• Other Scientific Applications

• Intrusion Detection

 Financial Data Analysis:

The financial data in banking and financial industry is generally reliable and of high
quality which facilitates systematic data analysis and data mining. Some of the typical
cases are as follows -

• Design and construction of data warehouses for multidimensional data analysis


and data mining.

• Loan payment prediction and customer credit policy analysis.

• Classification and clustering of customers for targeted marketing.

• Detection of money laundering and other financial crimes.

5
 Fraud Detection:

Data mining is also used in the fields of credit card services and telecommunication to
detect frauds. In the call, duration of the call, time of the day or week, etc. It also
analyzes the patterns that deviate from expected norms.

 Retail Industry:

Data Mining has its great application in Retail Industry because it collects large amount
of data from on sales, customer purchasing history, goods transportation, consumption
and services. It is natural that the quantity of data collected will continue to expand
rapidly because of the increasing ease, availability and popularity of the web.

Data mining in retail industry helps in identifying customer buying patterns and trends
that lead to improved quality of customer service and good customer retention and
satisfaction. Here is the list of examples of data mining in the retail industry -

• Design and Construction of data warehouses based on the benefits of data mining.

• Multidimensional analysis of sales, customers, products, time and region.

• Analysis of effectiveness of sales campaigns.

• Customer Retention.

• Product recommendation and cross-referencing of items.

 Telecommunication Industry:

Today the telecommunication industry is one of the most emerging industries providing
various services such as fax, pager, cellular phone, internet messenger, images,e-mail,
web data transmission, etc. Due to the development of new computer and communication
technologies, the telecommunication industry is rapidly expanding. This is the reason
why data mining is become very important to help and understand the business. Data
mining in telecommunication industry helps in identifying the telecommunication
patterns, catch fraudulent activities, make better use of resource, and improve quality of
service. Here is the list of examples for which data mining improves telecommunication
services -

6
• Multidimensional Analysis of Telecommunication data.

• Fraudulent pattern analysis.

• Identification of unusual patterns.

• Multidimensional association and sequential patterns analysis.

• Mobile Telecommunication services.

• Use of visualization tools in telecommunication data analysis.

 Biological Data Analysis:

In recent times, we have seen a tremendous growth in the field of biology such as
genomics, proteomics, functional Genomics and biomedical research. Biological data
mining is a very important part of Bioinformatics. Following are the aspects in which
data mining contributes for biological data analysis -

• Semantic integration of heterogeneous, distributed genomic and proteomic


databases.

• Alignment, indexing, similarity search and comparative analysis multiple


nucleotide sequences.

• Discovery of structural patterns and analysis of genetic networks and protein


pathways.

• Association and path analysis.

• Visualization tools in genetic data analysis.

 Other Scientific Applications:

The applications discussed above tend to handle relatively small and homogeneous data
sets for which the statistical techniques are appropriate. Huge amount of data have been
collected from scientific domains such as geosciences, astronomy, etc. A large amount of
data sets is being generated because of the fast numerical simulations in various fields

7
such as climate and ecosystem modeling, chemical engineering, fluid dynamics, etc.
Following are the applications of data mining in the field of Scientific Applications -

• Data Warehouses and data preprocessing.

• Graph-based mining.

• Visualization and domain specific knowledge.

 Intrusion Detection:

Intrusion refers to any kind of action that threatens integrity, confidentiality, or the
availability of network resources. In this world of connectivity, security has become the
major issue. With increased usage of internet and availability of the tools and tricks for
intruding and attacking network prompted intrusion detection to become a critical
component of network administration. Here is the list of areas in which data mining
technology may be applied for intrusion detection-

• Development of data mining algorithm for intrusion detection.

• Association and correlation analysis, aggregation to help select and build


discriminating attributes.

• Analysis of Stream data.

• Distributed data mining.

• Visualization and query tools

2. REQUIREMENT ELICITATION

A Requirementis a feature that the system must have or a constraint that it must satisfy to be
accepted by the clients. Requirements Engineering aims at defining the requirements of the
system under construction. It includes two main activities namely Requirements Elicitationand
Analysis.

8
Requirements Elicitation focuses on describing the purpose of the system. The Client, the
Developer, and the Users identify a problem area and define a system that addresses the problem.
Such a definition is called Requirements Specification. This specification is structured and
formalized during analysis to produce an Analysis Model. Requirements Elicitation and Analysis
focuses only on the user’s view of the system. Requirements Elicitation includes the following
activities.

Identifying Actors:

During this activity, developers identify the different types of users the future system will
support.

Identifying Scenarios:

During this activity, developers observe users and develop a set of detailed scenarios for typical
functionality provided by the future system. Developers use these scenarios to communicate
with the users and deepen their understanding.

Identifying Use Cases:

Once developers and users agree on a set of scenarios, developers derive from the scenarios a set
of use cases that completely represent the future system.

Refining Use Cases:

During this activity, developers ensure that the requirements specification is complete by
detailing each use case and describing the behavior of the system in the presence of errors and
exceptional conditions.

Identifying relationships among use cases:

During this activity, developers identify dependencies among use cases and also consolidate the
use case model by factoring out common functionality.

Identifying Non-functional requirements:

9
During this activity, developers, users and clients agree on aspects like performance of system,
documentation, resources security and its quality.

2.1 Existing System :


ROAD traffic monitoring is of great importance for urban transportation system. Traffic control agencies
and drivers could benefit from timely and accurate road traffic prediction and make prompt, or even
advance decisions possible for detecting and avoiding road congestions. Existing methods mainly focus
on raw speed sensing data collected from cameras or road sensors, and suffer severe data sparsity issue
because the installation and maintenance of sensors are very expensive [1]. At the same time, most
existing techniques based only on past and current traffic conditions, do not fit well when real-world
factors such as traffic accidents play a par

2.2 PROPOSED SYSTEM:


Proposed System :

To address the above issues, in this paper we introduce new-type traffic related data arising
from public services:

1) Social media data, which is posted on social networking websites, e.g., Twitter and Facebook. With
the popularization of mobile devices, people are more likely to exchange news and trifles in their life
through social media services, where messages about traffic conditions, such as “Stuck in traffic on E
32nd St. Stay away!”, are posted by drivers, passengers and pedestrians who can be viewed as sensors
observing the ongoing traffic conditions near their physical locations.

2.2.1 Project Scope:


Road traffic monitoring is of great importance for urban transportation system. Traffic control agencies
and drivers could benefit from timely and accurate road traffic prediction and make prompt, or even
advance decisions possible for detecting and avoiding road congestions. Existing methods mainly focus
on raw speed sensing data collected from cameras or road sensors, and suffer severe data scarcity issue
because the installation and maintenance of sensors are very expensive . At the same time, most
existing techniques based only on past and current traffic conditions do not fit well when real-world
factors such as traffic accidents play a part. To address the above issues, in this paper we introduce new-
type traffic related data arising from public services:

10
2.2.2 Project objectives:
The system has a clear set of objectives to achieve. They are as follows:

 To make it easy to use.


 To make it easy to extend.
 To support a large variety of data sources, including nested data.
 To provide several basic similarity measures.
 To allow almost all algorithms to be implemented using the toolkit.

2.2.3 Project overview:


The functional overview of the system is as follows:

 This project takes the input dataset from the user.

 After taking the input, it runs the both algorithms to find the duplicates in the given
dataset.

 Finally, it displays the duplicateswhich are identified in the given dataset.

11
2.3 FUNCTIONAL REQUIREMENTS:
The functional requirements describe the inputs and outputs of the application. The functional
requirements of this project are as follows:

 Input:Data set from the user.


 Output: Speed Prediction

2.3.1 Actors:
Actors represent external entities that interact with the system. An actor can be human or an
external system. During this activity, developers identify the Actors involved in this system are:
In this project, user and his responsibilities are as follows

Admin:

 Admin collects the files from the users.

 Runs the algorithms on the files collected from the user.

12
 Finds the duplicates and returns the duplicates in the file to the user.

User:

 Uploads the file in which the duplicates are to be detected.

 Views the result.

2.3.2Use Case:
Use cases are used during requirement elicitation and analysis to represent the functionality of
the system. Use cases focus on the behavior of the system from an external point of view. A use
case describes a function provided by the system that yields a visible result for an actor. An actor
describes any entity that interacts with the system.

The identification of actors and use cases results in the definition of the boundary of the system,
which is, in differentiating the tasks accomplished by the system and the tasks accomplished by
its environment. The actors are outside the boundary of the system, whereas the use cases are
inside the boundary of the system.

Actors are external entities that interact with the system. Use cases describe the behavior of the
system as seen from an actor’s point of view. Actors initiate a use case to access the system
functionality. The use case then initiates other use cases and gathers more information from the
actors. When actors and use cases exchange information, they are said to be communicate. To
describe a use case we use a template composed of six fields:

Use Case Name : The name of the use case

Participating Actors:The actors participating in the particular use case

Entry Condition:Condition for initiating the use case.

Flow of events: Sequence of steps describing the function of use case

Exit Condition:Condition for terminating the use case.

Quality Requirements:Requirements that do not belong to the use case

13
Use case diagrams include four types of relations. They are as follows:

Communication Relationships

Inclusion Relationships

Exclusion Relationships

Inheritance relationships

USE CASE DIAGRAM:

Figure 2.1: Use case Diagram

USE CASE 1: Login


Use Case Name: Login
Participating Actors: Admin, user
Flow of events: Admin, user will get into the system
Entry Condition: Admin, user should login with his userid
& password provided to him
Exit Condition: Successfully logged in
Table 2.1: Use case table for login

14
USE CASE 2: Upload Dataset
Use Case Name: Upload Dataset
Participating Actors: Admin, user
Flow of events: User will upload the file which contains
duplicates and admin will upload the
identified duplicates.
Entry Condition: Data files are taken.
Exit Condition: Successfully uploaded.
Table 2.2: Use case table for Upload Dataset
USE CASE 3: Duplicate detection process
Use Case Name: Duplicate detection
Participating Actors: Admin
Flow of events: 1. Client and admin get into the system
with his respective credentials
2. Client upload his file
3. Admin Verifies the details and approve
the respective transaction
Entry Condition: Uploaded datasetsare taken for applying
algorithms to it.
Exit Condition: Duplicates successfully identified.
Table 2.3: Use case table for Duplicate detection
USE CASE 4: Download Dataset
Use Case Name: Download Dataset
Participating Actors: Admin, user
Flow of events: 1. Client and admin get into the system
with his respective credentials
2. Client upload his file
3. Admin Verifies the details and
downloads the file uploaded by the user.
4. Admin finds the duplicates in fileand
uploads it to the user.
5. The user downloads the file sent by the
admin.
Entry Condition: Clients undergoes various authentication
process

15
Exit Condition: Successful downloaded.
Table 2.4: Use case table for Download Dataset

2.3.3. Scenarios:

A use case is an abstraction that describes all possible scenarios involving the described
functionality. A scenario is an instance of a use case describing a concrete set of action.
Scenarios are used as examples for illustrating common cases. We describe a scenario using a
template with three fields: name of the scenario, participating actors and flow of events, which
describe the sequence of events step by step.

SCENARIO 1: Login
Scenario Name: Login
Participating Actors: Admin, User
Flow of events: Admin, user will get into the system
Table 2.5: Scenario table for Login

SCENARIO 2: Upload Dataset


Scenario Name: Upload Dataset
Participating Actors: Admin, User
Flow of events: User will upload the file which contains
duplicates and admin will upload the
identified duplicates.
Table 2.6: Scenario table for Upload Dataset

SCENARIO 3: Duplicate detection process


Scenario Name: Duplicate detection
Participating Actors: Admin
Flow of events: 1. User and admin get into the system
with his respective credentials
2. User upload his file
3. Admin Verifies the details and approve
the respective transaction
Table 2.7: Scenario table for Duplicate detection process

16
SCENARIO 4:Download Dataset
Scenario Name: Download Dataset
Participating Actors: Admin, User
Flow of events: 1.User and admin get into the system
with his respective credentials
2. User uploads his file
3. Admin Verifies the details and
downloads the file uploaded by the user.
4. Admin finds the duplicates in file and
uploads it to the user.
5. The user downloads the file sent by the
admin.
Table 2.8: Scenario table forDownload Dataset

2.4NON-FUNCTIONALREQUIREMENTS:

 Usability:User must have minimal knowledge on duplicate detection.

 Reliability:The system is more reliable because the qualities of it are inherited from the
platform java. The code built by using java is more reliable.

 Supportability:The system is designed to be the cross platform supportable. The system is


supported on a wide range of hardware and any software platform, which is having JVM,
built into the

 Performance:This system is developing in the high level languages and using the advanced
front-end and back-end technologies it will give response to the end user on client system
with in very less time.

 Implementation:This project is implemented using JAVA.

2.4.1. User Interface and Characteristics:


1. User interface:

17
Here we have chosen JAVA as our programming language for the implementation of the system.

The reason for choosing this language is:

 Java is platform-independent
One of the most significant advantages of Java is its ability to move easily from one
computer system to another. The ability to run the same program on many different
systems is crucial to World Wide Web software, and Java succeeds at this by being
platform-independent at both the source and binary levels.
 Java is secure
Java considers security as part of its design. The Java language, compiler, interpreter, and
runtime environment were each developed with security in mind.

2. Error handling:

Before performing any operations on the dataset, contents on the dataset must be checked. If any
values that format or type is not matched, then Error Message is displayed so that user can take
an appropriate decision. For example, without entering data the system doesn’t access.

3. Performance consideration:

The performance of the system is very high when compared to current numerical model
techniques. The training time is less when compared to the current system.

4. Platform:

Windows XP & above operating systems.


5. Technology to be used:

Programming LanguageJAVA is chosen as the programming language for the implementation


of the system.

2.4.2. Hardware Requirements:


Processor : Intel

Hard Disk : 40GB

RAM Capacity : 512MB

18
Monitor : Standard

Keyboard : Standard

Mouse : Any Standard Mouse

2.4.3. Software Requirements:


Operating System : Windows XP and above

Front end : JAVA

Back end : MySQL

2.4.3.1 About Java:


Java is a general-purpose computer programming language that is concurrent, class-
based, object-oriented, and specifically designed to have as few implementation dependencies as
possible. It is intended to let application developers "write once, run anywhere" (WORA),
meaning that compiled Java code can run on all platforms that support Java without the need for
recompilation. Java applications are typically compiled to byte code that can run on any Java
virtual machine (JVM) regardless of computer architecture.

Java is the foundation for virtually every type of networked application and is the global standard
for developing and delivering embedded and mobile applications, games, Web-based content,
and enterprise software. With more than 9 million developers worldwide, Java enables you to
efficiently develop, deploy and use exciting applications and services.

History of Java:

Java was originally developed by James Gosling at Sun Microsystems (which has since
been acquired by Oracle Corporation) and released in 1995 as a core component of Sun
Microsystems' Java platform. The language derives much of its syntax from C and C++, but it
has fewer low-level facilities than either of them.

Java is a general purpose, high-level programming language developed by Sun Microsystems. A


small team of engineers, known as the Green Team, initiated the language in 1991. Java was
originally called OAK, and was designed for handheld devices and set-top boxes. Oak was

19
unsuccessful, so in 1995 Sun changed the name to Java and modified the language to take
advantage of the burgeoning World Wide Web.

Java Features:
Simple:

 Java is Easy to write and more readable and eye catching.


 Java has a concise, cohesive set of features that makes it easy to learn and use.
 Most of the concepts are drew from C++ thus making Java learning simpler.

Secure:

 Java program cannot harm other system thus making it secure.


 Java provides a secure means of creating Internet applications.
 Java provides secure way to access web applications.

Portable:

 Java programs can execute in any environment for which there is a Java run-time
system.(JVM)
 Java programs can be run on any platform (Linux,Window,Mac)
 Java programs can be transferred over world wide web (e.g. applets)

Object oriented:

 Java programming is object-oriented programming language.


 Like C++ java provides most of the object oriented features.
 Java is pure OOP. Language. (while C++ is semi object oriented)

Robust:

 Java encourages error-free programming by being strictly typed and performing run-time
checks.

Multi-threaded:

 Java provides integrated support for multithreaded programming.

20
Architecture neutral:

 Java is not tied to a specific machine or operating system architecture.


 Machine Independent i.e. Java is independent of hardware.

Interpreted:

 Java supports cross-platform code through the use of Java byte code.
 Byte code can be interpreted on any platform by JVM.

High Performance:

 Byte codes are highly optimized.


 JVM can executed them much faster.

Distributed:

 Java was designed with the distributed environment.


 Java can be transmit,run over internet.

Dynamic:

 Java programs carry with them substantial amounts of run-time type information that is
used to verify and resolve accesses to objects at run time.

Java Principles:
There were five primary goals in the creation of the Java language:
 It must be "simple, object-oriented, and familiar".
 It must be "robust and secure".
 It must be "architecture-neutral and portable".
 It must execute with "high performance".
 It must be "interpreted, threaded, and dynamic".

21
Overview of OOP Terminology:
 Class: A user-defined prototype for an object that defines a set of attributes that
characterize any object of the class. The attributes are data members (class variables
and instance variables) and methods, accessed via dot notation.
 Class variable: A variable that is shared by all instances of a class. Class variables are
defined within a class but outside any of the class's methods. Class variables aren't used
as frequently as instance variables are
 Data member: A class variable or instance variable that holds data associated with a
class and its objects.
 Function overloading: The assignment of more than one behavior to a particular
function. The operation performed varies by the types of objects (arguments) involved.
 Instance variable: A variable that is defined inside a method and belongs only to the
current instance of a class.
 Inheritance: The transfer of the characteristics of a class to other classes that are
derived from it.
 Instantiation : The creation of an instance of a class
 Method: A special kind of function that is defined in a class definition.
 Object: A unique instance of a data structure that's defined by its class. An object
comprises both data members (class variables and instance variables) and methods.
 Operator overloading: The assignment of more than one function to a particular
operator.
 Instance: An individual object of a certain class. An object obj that belongs to a class
Circle, for example, is an instance of the class Circle.

2.4.3.2. About MySQL:

MySQL is a free, open-source database engine available for all major platforms.
(Technically, MySQL is a relational database management system (RDBMS)). MySQL
represents an excellent introduction to modern database technology, as well as being a
reliable mainstream database resource for high-volume applications.

22
A modern database is an efficient way to organize, and gain access to, large amounts of
data. A relational database is able to create relationships between individual database
elements, to organize data at a higher level than a simple table of records, avoid data
redundancy and enforce relationships that define how the database functions.

A database is a separate application that stores a collection of data. Each database has
one or more distinct APIs for creating, accessing, managing, searching and replicating
the data it holds.

Other kinds of data stores can be used, such as files on the file system or large hash
tables in memory but data fetching and writing would not be so fast and easy with those
types of systems.

So nowadays, we use relational database management systems (RDBMS) to store and


manage huge volume of data. This is called relational database because all the data is
stored into different tables and relations are established using primary keys or other keys
known as foreign keys.

A Relational Database Management System (RDBMS) is a software that:

 Enables you to implement a database with tables, columns and indexes.

 Guarantees the Referential Integrity between rows of various tables.

 Updates the indexes automatically.

 Interprets an SQL query and combines information from various tables.

RDBMS Terminology:
Before we proceed to explain MySQL database system, let's revise few definitions related to
database.

 Database: A database is a collection of tables, with related data.

 Table: A table is a matrix with data. A table in a database looks like a simple
spreadsheet.

23
 Column: One column (data element) contains data of one and the same kind, for
example the column postcode.

 Row: A row (= tuple, entry or record) is a group of related data, for example the data of
one subscription.

 Redundancy: Storing data twice, redundantly to make the system faster.

 Primary Key: A primary key is unique. A key value cannot occur twice in one table.
With a key, you can find at most one row.

 Foreign Key: A foreign key is the linking pin between two tables.

 Compound Key: A compound key (composite key) is a key that consists of multiple
columns, because one column is not sufficiently unique.

 Index: An index in a database resembles an index at the back of a book.

 Referential Integrity: Referential Integrity makes sure that a foreign key value always
points to an existing row.

MySQL is a fast, easy-to-use RDBMS being used for many small and big businesses. MySQL
is developed, marketed, and supported by MySQL AB, which is a Swedish company. MySQL
is becoming so popular because of many good reasons:

 MySQL is released under an open-source license. So you have nothing to pay to use it.

 MySQL is a very powerful program in its own right. It handles a large subset of the
functionality of the most expensive and powerful database packages.

 MySQL uses a standard form of the well-known SQL data language.

 MySQL works on many operating systems and with many languages including PHP,
PERL, C, C++, JAVA, etc.

 MySQL works very quickly and works well even with large data sets.

 MySQL is very friendly to PHP, the most appreciated language for web development.

24
 MySQL supports large databases, up to 50 million rows or more in a table. The default
file size limit for a table is 4GB, but you can increase this (if your operating system can
handle it) to a theoretical limit of 8 million terabytes (TB).

 MySQL is customizable. The open-source GPL license allows programmers to modify


the MySQL software to fit their own specific environments.

25
3. ANALYSIS

In object-oriented analysis, developers build a model describing the application


domain. The analysis model is then extended to describe how the actors and the system together
with non-functional requirements, to prepare the architecture of the system developed during
high-level design, which is correct, complete, consistent and verifiable.

Analysis object model is represented by class and object diagrams. Analysis focuses on
producing a model of the system, called the Analysis model, which is correct, complete,
consistent, and verifiable. Analysis is different from requirements elicitation, where, developer
focus on structuring and formalizing the requirements elicited from users. This formalization
leads to new insights and discovery of errors in the requirements. As the analysis model may not
be understandable to the users and the client, developers need to update the requirements
specification to reflect insights gained during analysis, and then review the changes with the
client and users. In the end, the requirements, however large, should be understandable by the
client and the users.

The analysis model is composed of three individual models: the Functional Model
represented by use cases and scenarios, the Analysis Object Model, represented by class and
object diagrams, and the Dynamic Model, represented by state chart and sequence diagrams. In
Requirements phase, we gather requirements from the users and represent them as use cases and
scenarios. We refine the functional model and derive the object and the dynamic model. This
leads to a more precise and complete specification as details is added to the analysis model. We
conclude by describing management activities related to analysis.

The analysis model represents the system under development from the user’s point of
view. The analysis object model is a part of the analysis and focuses on the individual concepts
that are manipulated by the system, their properties and their relationships. The analysis object
model, depicted with UML class diagrams, includes classes, attributes, and operations. The
analysis object model is a visual dictionary of the main concepts visible to the user.

26
3.1. ENTITY OBJECTS:

The Analysis object model consists of entity, boundary and control objects. Entity
objects represent the persistent information tracked by the system. Participating objects form the
basis of the analysis model.

Entity objects for my system are


1. Training data set
2. Testing data set

3.2. BOUNDARY OBJECTS:

Boundary object is the object used for interaction between the user and the system.
Moreover it is an interface used to communicate with the system. Boundary objects represent the
system interface with the actors. In each use case, each actor interacts with at least one boundary
object. The boundary object collects the information from the actor and translates into an
interface model from that can be used by the objects and also by the control objects.

The set of Boundary Objects that are involved in the system are as follows:

 Boundary objects for Login Page.

 Boundary objects for uploading the file.

27
I) BOUNDARY OBJECTS FOR HOME PAGE OR LOGIN PAGE

1I) BOUNDARY OBJECTS FOR UPLOADING THE FILE

3.3 CONTROL OBJECTS:

Control objects are responsible for coordinating entity objects and boundary objects. A
control object is creating at the beginning of the use cases and ceases to exist at its end. Control
objects usually do not have a concrete counterpart in the real world. Control object is a
responsible for collecting information from the boundary objects and dispatching it to entity
object.

Here the files are taken and are processed to analyze the performance of the student.

3.4. OBJECT INTERACTION:


About Sequence Diagram:

Interaction diagrams model the behavior of use cases by describing the way groups of
objects interact to complete the task. The two kinds of interaction diagrams are sequence and
collaboration diagrams. Sequence diagrams generally show the sequence of events that
occur. Sequence diagrams demonstrate the behavior of objects in a use case by describing the
objects and the messages they pass. The diagrams are read left to right and
descending. Following are the Sequence Diagrams for the system under consideration

28
SEQUENCE DIAGRAM:

Figure 3.1: Sequence Diagram

3.5 OBJECT BEHAVIOUR:

State chart diagrams are used to describe the behavior of a system. State diagrams
describe all of the possible states of an object as events occur. Each diagram usually represents
objects of a single class and tracks the different states of its objects through the system. Not all
classes will require a state diagram and state diagrams are not useful for describing the
collaboration of all objects in a use case. State diagrams have very few elements. The basic
elements are rounded boxes representing the state of the object and arrows indicting the
transition to the next state. The activity section of the state symbol depicts what activities the
object will be doing while it is in that state. All state diagrams being with an initial state of the
object. This is the state of the object when it is created. After the initial state the object begins
changing states.

29
Use Case

Data Reading

PreProcessing

Stemming

System

Training The Model

Prediction

Sequence Diagram

30
System Dataset

Preprocessing

Stemming

Training

Prediction

2: Preprocessing
3: Stemming
4: Training
5: Prediction

System Dataset

1:

31
Figure 3.2: State chart Diagram 1

STATECHART DIAGRAM (Admin):

32
Figure 3.3: State chart Diagram 2

4. SYSTEM DESIGN
System Design is the transformation of an analysis model into a system design model. In
System design, developers:

- Define design goals of the project

- Decompose the system into smaller sub systems

- Design hardware/software strategies

- Design persistent data management strategies

- Design global control flow strategies

- Design access control policies and

- Design strategies for handling boundary conditions.

System design is not algorithmic. It is decomposed of several activities. They are:

- Identify Design Goals

- Design the initial subsystem decomposition

- Refine the subsystem decomposition to address the design goals.

Design is the first step in the development phase for any techniques and principles for the
purpose of defining a device, a process or system in sufficient detail to permit its physical
realization.

Once the software requirements have been realized, analyzed and specified the software
design involves three technical activities design, coding, generation and testing that are required to
build and verify the software.

33
The design activities are of main importance in this phase, because in this activity,
decisions ultimately affecting the success of software implementation and its ease of maintenance
are made. These decisions have the final bearing upon reliability and maintainability of the system.

Design is the place where quality is fostered in development. Software design is a process
through which requirements are translated into a representation of software.

System Design is the transform of analysis model into a system design model. Developers
define the design goals of the project and decompose the system into smaller subsystems that can
be realized by individual teams. Developers also select strategies for building the system, such as
the hardware/software platform on which the system will run, the persistent data management
strategy, the goal control flow the access control policy and the handling of boundary conditions.
The result of the system design is model that includes a clear description of each of these
strategies, subsystem decomposition, and a UML deployment diagram representing the
hardware/software mapping of the system.

The Analysis model describes the system completely form the actors point of view and
serves as the basis of communication between the client and the developers. The Analysis model,
however, does not contain information about the internal structure of the system, its hardware
configuration or more generally, how the system should be realized. System design is the first
step in this direction.

During the system design activities, DevelopersBridge the gap between the requirements
specification, produced during requirements elicitation and analysis, and the system that is
delivered to the users.

4.1 DESIGN GOALS:

Design goals are the qualities that the system should focus on. Many design goals can be
inferred from the nonfunctional requirements or from the application domain.

Cost:
JAVA is freeware. Hence no high development and maintenance costs.
Response time:
The system response is based on the length of the training data set.

34
Portability:
Java has an ability to move easily from one computer system to another. The ability to run
the same program on many different systems is crucial to World Wide Web software, and
Java succeeds at this by being platform-independent at both the source and binary levels.

Usability:
Users capable of handling simple GUI are able to use the system.
Reliability:
The system is trained with ID3 & C4.5 algorithm so that it can give us accurate
results.

4.2. SYSTEM ARCHITECTURE:

As the complexity of systems increases, the specification of the system decomposition is


critical. Moreover, subsystem decomposition is constantly revised whenever new issues are
addressed. Subsystems are merged into alone subsystem, a complex subsystem is split into parts,
and some subsystems are added to take care of new functionality. The first iterations over the
subsystem decomposition can introduce drastic changes in the system design model

Popular System Architectures are as follows:

- Repository

- Mode/View/Controller (MVC)

- Document/View /Controller (DVC)

- Peer-to-Peer

- Client/Server

- Three-tier

- Four-tier & Pipe and Filter

4.3. SUBSYSTEM DECOMPOSITION:

A subsystem is characterized by the services it provides to other subsystem. A


service is a set of related operations that share a common purpose. The set of operations of a sub

35
system that are available to other subsystems form the “subsystem interface”. Subsystem
Interface includes the name of the operations, their parameters, their types and their return
values.

The subsystems that are factored out of the main system are as follows:

 Apply algorithm: The subsystem that is responsible for applied algorithm


 Output: The subsystem that is responsible for give the output

: Host
: Host

Give
Apply
Output
Algorithm

Figure 4.1:Subsystem decomposition

4.4. GLOBAL CONTROL FLOW:

Control flow is the sequencing of actions in a system. It defines the order of execution of
operations. These decisions are based on external events generated by an actor or on the passage
of time. These are two possible control flow mechanisms.

(a) Procedure Driven Control:

Operations wait for input whenever they need data from an actor. In the selection of
algorithm, processing operation waits for decision maker to choose data source. Whenever
decision maker gives the input preprocessing operation can be executed.

(b) Event Driven Control:

A main loop waits for an external event. Whenever an event available, it is dispatched to
appropriate object based on information associated with the event.

4.5. BOUNDARY CONDITIONS:

36
During this activity we review the design decisions we made so far and identify additional
conditions i.e., how the system is started, initialized, shutdown and how to deal with major
failures such as data corruption.

Configure: For each persistent object, we examine in which use cases it is created or destroyed.
Download use case creates the persistent object Download files.

Start up and Shutdown:For each component we add three use cases to start, shutdown and
configure the component.

Exception Handling: For each type of component failure we decide how the system should
react. In general exception is an event or error that occurs during the execution of the system.
Exceptions are caused by different sources.

Hardware Failure: Hardware ages and fails. For example, failure of the network link, system
can identify it by using connect use case and inform the user.

37
5. OBJECT DESIGN
Object design closed the gap between the application objects and off-time-shelf
components by identifying additional solution. Object design is not sequential. Although
each group of activities described above addresses a specific object design issue, they usually
occur concurrently.

Object design includes four groups of activities:

 Reuse: Off-the shelf components identified during system design are used to help in the
realization of each subsystem. Class libraries and additional components are selected for
basic data structures and services. Design patterns are selected for solving common
problems and for protecting specific classes from future change.
 Interface specification: During this activity, the subsystem services identified during
system design are specified in terms of class interfaces, including operations, arguments,
type signatures, and exceptions.
 Object model Restructuring: Restructuring activities manipulate the system model to
increase code
 Object model optimization: During this activity, object design model is transformed to
address performance criteria such as response time or memory utilization.
Object design closed the gap between the application objects and off-time-shelf
components by identifying additional solution objects and refining existing objects.

Operation parameters and return values are typed in the same way as attributes are. The
type constraints the range of values the parameter or the return value can take. The type of the
return value is called the signature of the operation. The visibility of an attribute or an operation
specifies whether other classes can use it or not. UML defines three levels of visibility.

5.1. OBJECT SPECIFICATION:


38
The system design model focuses on the subsystem decomposition and global system
decision such as hardware/software mapping, persistent storage or access control. We identify to
level subsystems and define them in terms of the services they provide.

Specification activities during object design includes

a. Identifying missing attributes and operations


b. Specifying type, signature and visibility.
c. Specifying Constraints.
d. Specifying Exceptions

Attributes and Operations:

a. Attributes: In this, we can identify attributes of the operation.


b. Operation: Operation is a property of the objects.
Attributes:

Attributes represent the properties of individual objects, only the attributes relevant to the
system should be considered.

CLASS DIAGRAM:

Figure 5.1: Class Diagram

About Class Diagram:

Class diagrams model class structure and contents using design elements such as classes,
packages and objects. Class diagrams describe three different perspectives when designing a

39
system, conceptual, specification, and implementation. Classes are composed of three things: a
name, attributes, and operations. Class diagrams also display relationships such as containment,
inheritance, associations and others. The association relationship is the most common
relationship in a class diagram. The association shows the relationship between instances of
classes. The multiplicity of the association denotes the number of objects that can participate in
then relationship.

5.1.1. Type, Signature and Visibility:

Type:The Type of an attribute specifies the range of values the attribute can take and the
operations that can be applied to the attributes.

Signature:Given an operation, the tuple made out of the types of its parameters and the type of
the return value is the Signature. Signatures are generally defined for operations.

Visibility:The Visibility of an attribute or an operation is a mechanism for specifying whether


other classes can use the attribute or the operation or not. In general there are three levels of
visibility.

Private:A private attribute can only be accessible with in which it is defined.

Protected: A protected attribute or operation can be accessed by the class in


which it is defied and on any descendant of the class.

Public:A public attribute or oration can be accessed by any class.

Private Attribute or Operation (indicated by ‘-‘)

Protected Attribute or Operation (indicated by ‘#‘)

Public Attribute or Operation (indicated by ‘+‘)

Type, Signature and Visibility of the classes in this system are as follows:

5.1.2 Constraints:
We attach constraints to classes and operations to more precisely specify their behavior and
boundary cases. The following are the constraints in this project:

The input file must be .txt, .doc, .docx, .xls, .xlsx, .csv file formats.

40
5.1.3 Exceptions:
Exceptional conditions are usually associated with the violation of preconditions. Exceptions can
be found systematically by examining each parameter of the operation.

Constraints that the caller needs to satisfy before invoking an operation. When the user inputs a
file which doesn’t consists of any data then certain messages should be displayed.

5.1.4 Associations:
Associations are relationships between classes and represent groups of links. Each end of an
association can be labeled by a set of integers indicating the number of links that can legitimately
originate form an instance of the class connected to the association end. Associations are used to
represent a wide range of connections among a set of objects.

1 1
User Input data

Figure 5.2: Association

5.2. ALGORITHMS:

Algorithm 2:

6. CODING

The goal of the coding or programming phase is to translate the design of the system produced
during the design phase into code in a given programming language, which can be executed by a
computer and that performs the computation specified by the design.

The coding phase affects both testing and maintenance. The goal of coding is not to reduce the
implementation cost but the goal should be to reduce the cost of later phases. In other words the

41
goal is not to simplify the job of programmer. Rather the goal should be to simplify the job of
the tester and maintainer.

6.1. CODING APPROACH:


There are two major approaches for coding any software system. They are Top-Down approach
and bottom up approach.

Bottom-up Approach can best suit for developing the object-oriented systems. During system
design phase of reduce the complexity, we decompose the system into an appropriate number of
subsystems, for which objects can be modeled independently. These objects exhibit the way the
subsystems perform their operations.

Once objects have been modeled they are implemented by means of coding. Even though related
to the same system as the objects are implemented of each other the Bottom-Up approach is
more suitable for coding these objects. In this approach, we first do the coding of objects
independently and then we integrate these modules into one system to which they belong.

This code can detect duplicates from data. First, it takes data as input, then it scans the entire file
and performs the preprocessing step. Through this step the missing values, errors, etc. get
eliminated. Then it generates a unique key for each tuple and sorts the data using the key. And
then it identifies the duplicates and displays the result.

6.2. INFORMATION HANDLING:

Any software system require some amount of information during its operation
selection of appropriate data structures can help us to produce the code so that objects of the
system can better operate with the available information decreased complexity.

In this project, if any of the field is vacant, then it could not proceed further steps and
prompts a message saying that “input data must be in specified range “.System will not have any
default values.

42
6.3. PROGRAMMING STYLE:

Programming style deals with act of rules that a programmer has to follow so that the
characteristics of coding such as Traceability, Understandability, Modifiability, and Extensibility
can be satisfied. In the current system, we followed the coding rules for naming the variables and
methods. As part of coding internal documentation is also provided that help the readers to
better understand the code.

6.4. VERIFICATION AND VALIDATION:


Verification is the process of checking the product built is right. Validation is the process of
checking whether the right product is built. During the Development of the system, Coding for
the object has been thoroughly verified from different aspects regarding their design, in the way
they are integrated and etc. The various techniques that have been followed for validation
discussed in testing the current system.
Validations applied to the entire system at two levels:

 Form level Validation:

Validations of all the inputs given to the system at various points in the forms are
validated while navigating to the next form. System raises appropriate custom and pre-
defined exceptions to alert the user about the errors occurred or likely to occur.

 Field level Validation:

Validations at the level of individual controls are also applied wherever necessary.
System pops up appropriate and sensuous dialogs wherever necessary.

In this project, validations are performed on each individual control. In normalizing


phase, if any one of text field is not filled or any wrong click occurs then system will
generate appropriate exceptions.

43
7. TESTING

Testing is the process of finding differences between the expected behavior specified by
system models and the observed behavior of the system. Testing is a critical role in quality
assurance and ensuring the reliability of development and these errors will be reflected in the
code so the application should be thoroughly tested and validated.

Unit testing finds the differences between the object design model and its
corresponding components. Structural testing finds differences between the system design model
and a subset of integrated subsystems. Functional testing finds differences between the use case
model and the system.

Finally performance testing, finds differences between non-functional requirements and


actual system performance. From modeling point of view, testing is the attempt of falsification of
the system with respect to the system models. The goal of testing is to design tests that exercise
defects in the system and to reveal problems.

7.1. TESTING ACTIVITIES:

Testing a large system is a complex activity and like any complex activity. It has to be
broke into smaller activities. Thus incremental testing was performed on the project i.e.,
components and subsystems of the system were tested separately before integrating them to form
the subsystem for system testing.

7.2. TESTING TYPES:

Unit Testing:

Unit testing focuses on the building blocks of the software system that is the objects and
subsystems. There are three motivations behind focusing on components. First unit testing
reduces the complexity of overall test activities allowing focus on smaller units of the system,
second unit testing makes it easier to pinpoint and correct faults given that few components are
involved in the rest. Third unit testing allows parallelism in the testing activities, that is each
component are involved in the test. Third unit testing allows parallelism in the testing activities

44
that is each component can be tested independently of one another. The following are some unit
testing techniques.

1. Equivalence testing: It is a black box testing technique that minimizes the number of test
cases. The possible inputs are partitioned into equivalence classes and a test case is selected for
each class.
2. Boundary testing: It is a special case of equivalence testing and focuses on the conditions at
the boundary of the equivalence classes. Boundary testing requires that the elements be selected
from the edges of the equivalence classes.
3. Path testing: It is a white box testing technique that identifies faults in the implementation of
the component the assumption here is that exercising all possible paths through the code at least
once. Most faults will trigger failure. This acquires knowledge of source code.

Integration Testing:

Integration testing defects faults that have not been detected. During unit testing by focusing on
small groups on components two or more components are integrated and tested and once tests do
not reveal any new faults, additional components are added to the group. This procedure allows
testing of increasing more complex parts on the system while keeping the location of potential
faults relatively small. I have used the following approach to implements and integrated testing.

Top-down testing strategy unit tests the components of the top layer and then integrated
the components of the next layer down. When all components of the new layer have been tested
together, the next layer is selected. This was repeated until all layers are combined and involved
in the test.

Validation Testing:

The systems completely assembled as package, the interfacing have been uncovered and
corrected, and a final series of software tests are validation testing. The validation testing is
nothing but validation success when system functions in a manner that can be reasonably
expected by the customer. The system validation had done by series of Black-box test methods.

45
System Testing:

1. System testing ensures that the complete system compiles with the functional
requirementsand non-functional requirements of the system, the following are some
system testing activities.
2. Functional testing finds differences between the functional between the functional
requirements and the system. This is a black box testing technique. Test cases are
divided from the use case model.
3. Performance testing finds differences between the design and the system the design
goals are derived from the functional requirements.
4. Pilot testing the system is installed and used by a selected set of users – users exercise
the system as if it had been permanently installed.
5. Acceptance testing, I have followed benchmarks testing in a benchmarks testing the
client prepares a set of test cases represent typical conditions under which the system
operates. In our project, there are no existing benchmarks.
6. Installation testing, the system is installed in the target environment.

7.3. TESTING PLAN:


Testing accounts for 45 - 75% of the typical project effort. It is also one of the most
commonly underestimated activities on a project. A test plan is a document that answers the
basic questions about your testing effort. It needs to be initiated during the requirements
gathering phase of your project and should evolve into a roadmap for the testing phase.

 Test Planning enables a more reliable estimate of the testing effort up front.
 It allows the project team time to consider ways to reduce the testing effort without being
under time pressure.
 Test Plan helps to identify problem areas and focuses the testing team’s attention on the
critical paths.
 Test plan reduces the probability of implementing non-tested components.

46
7.4. TEST CASE REPORT:

 Test Case1-Login
 Test Case id: 01
 Test Case Name:Login
 Test Case Type: Black Box Testing

Description Expected Value Observed Value Result

Adminselects ‘login’ Admin get the access If the login id and Successful
button password are correct
then the admin will
get access

If Admin enters Admin does not get Admin does not get Successful
wrong password the access the access and has to
enter the correct
password

Table 7.1: Test case 1

47
 Test Case2-File Browse
 Test Case id: 02
 Test Case Name: Browse the file
 Test Case Type: Black Box Testing

Description Expected Value Observed Value Result

Userselects User uploads the file If the file is of valid file Successful
‘Upload’ button from the folder format, file will be taken.

Ta
ble
7.2:
Tes
t
cas
e2
If user selects User selects ppt or Only .csv, .txt, .doc, .xlsx, Successful
file of not valid pdf files, system will .xls files can be taken from
file format. not take. user.

8.SCREENS

9. SOURCE CODE
Sample Code
/*
* To change this license header, choose License Headers in Project Properties.

48
* To change this template file, choose Tools | Templates
* and open the template in the editor.
*/
package com.util;

import java.io.BufferedReader;
import java.io.FileNotFoundException;
import java.io.FileReader;
import weka.classifiers.Classifier;
import weka.classifiers.lazy.IBk;
import weka.core.Instance;
import weka.core.Instances;
public class KNN {
public static BufferedReader readDataFile(String filename) {
BufferedReader inputReader = null;
try {
inputReader = new BufferedReader(new FileReader(filename));
} catch (FileNotFoundException ex) {
System.err.println("File not found: " + filename);
}
return inputReader;
}
public static void main(String[] args) throws Exception {
BufferedReader datafile = readDataFile("ads.txt");
Instances data = new Instances(datafile);
data.setClassIndex(data.numAttributes() - 1);
//do not use first and second
Instance first = data.instance(0);
Instance second = data.instance(1);
data.delete(0);

49
data.delete(1);
Classifier ibk = new IBk();
ibk.buildClassifier(data);
double class1 = ibk.classifyInstance(first);
double class2 = ibk.classifyInstance(second);
System.out.println("first: " + class1 + "\nsecond: " + class2);
}
}

/*
* To change this license header, choose License Headers in Project Properties.
* To change this template file, choose Tools | Templates
* and open the template in the editor.
*/
package com.util;

import java.util.Arrays;
import java.util.Random;

public class np {

private static Random random;


private static long seed;

static {

50
seed = System.currentTimeMillis();
random = new Random(seed);
}

/**
* Sets the seed of the pseudo-random number generator. This method enables
* you to produce the same sequence of "random" number for each execution of
* the program. Ordinarily, you should call this method at most once per
* program.
*
* @param s the seed
*/
public static void setSeed(long s) {
seed = s;
random = new Random(seed);
}

/**
* Returns the seed of the pseudo-random number generator.
*
* @return the seed
*/
public static long getSeed() {
return seed;
}

/**
* Returns a random real number uniformly in [0, 1).
*
* @return a random real number uniformly in [0, 1)

51
*/
public static double uniform() {
return random.nextDouble();
}

/**
* Returns a random integer uniformly in [0, n).
*
* @param n number of possible integers
* @return a random integer uniformly between 0 (inclusive) and {@code n}
* (exclusive)
* @throws IllegalArgumentException if {@code n <= 0}
*/
public static int uniform(int n) {
if (n <= 0) {
throw new IllegalArgumentException("argument must be positive: " + n);
}
return random.nextInt(n);
}

/**
* Returns a random long integer uniformly in [0, n).
*
* @param n number of possible {@code long} integers
* @return a random long integer uniformly between 0 (inclusive) and
* {@code n} (exclusive)
* @throws IllegalArgumentException if {@code n <= 0}
*/
public static long uniform(long n) {
if (n <= 0L) {

52
throw new IllegalArgumentException("argument must be positive: " + n);
}

long r = random.nextLong();
long m = n - 1;

// power of two
if ((n & m) == 0L) {
return r & m;
}

// reject over-represented candidates


long u = r >>> 1;
while (u + m - (r = u % n) < 0L) {
u = random.nextLong() >>> 1;
}
return r;
}

/**
* Returns a random integer uniformly in [a, b).
*
* @param a the left endpoint
* @param b the right endpoint
* @return a random integer uniformly in [a, b)
* @throws IllegalArgumentException if {@code b <= a}
* @throws IllegalArgumentException if {@code b - a >= Integer.MAX_VALUE}
*/
public static int uniform(int a, int b) {
if ((b <= a) || ((long) b - a >= Integer.MAX_VALUE)) {

53
throw new IllegalArgumentException("invalid range: [" + a + ", " + b + ")");
}
return a + uniform(b - a);
}

/**
* Returns a random real number uniformly in [a, b).
*
* @param a the left endpoint
* @param b the right endpoint
* @return a random real number uniformly in [a, b)
* @throws IllegalArgumentException unless {@code a < b}
*/
public static double uniform(double a, double b) {
if (!(a < b)) {
throw new IllegalArgumentException("invalid range: [" + a + ", " + b + ")");
}
return a + uniform() * (b - a);
}

/**
* @param m
* @param n
* @return random m-by-n matrix with values between 0 and 1
*/
public static double[][] random(int m, int n) {
double[][] a = new double[m][n];
for (int i = 0; i < m; i++) {
for (int j = 0; j < n; j++) {
a[i][j] = uniform(0.0, 1.0);

54
}
}
return a;
}

/**
* Transpose of a matrix
*
* @param a matrix
* @return b = A^T
*/
public static double[][] T(double[][] a) {
int m = a.length;
int n = a[0].length;
double[][] b = new double[n][m];
for (int i = 0; i < m; i++) {
for (int j = 0; j < n; j++) {
b[j][i] = a[i][j];
}
}
return b;
}

/**
* @param a matrix
* @param b matrix
* @return c = a + b
*/
public static double[][] add(double[][] a, double[][] b) {
int m = a.length;

55
int n = a[0].length;
double[][] c = new double[m][n];
for (int i = 0; i < m; i++) {
for (int j = 0; j < n; j++) {
c[i][j] = a[i][j] + b[i][j];
}
}
return c;
}

/**
* @param a matrix
* @param b matrix
* @return c = a - b
*/
public static double[][] subtract(double[][] a, double[][] b) {
int m = a.length;
int n = a[0].length;
double[][] c = new double[m][n];
for (int i = 0; i < m; i++) {
for (int j = 0; j < n; j++) {
c[i][j] = a[i][j] - b[i][j];
}
}
return c;
}

/**
* Element wise subtraction
*

56
* @param a scaler
* @param b matrix
* @return c = a - b
*/
public static double[][] subtract(double a, double[][] b) {
int m = b.length;
int n = b[0].length;
double[][] c = new double[m][n];
for (int i = 0; i < m; i++) {
for (int j = 0; j < n; j++) {
c[i][j] = a - b[i][j];
}
}
return c;
}

/**
* @param a matrix
* @param b matrix
* @return c = a * b
*/
public static double[][] dot(double[][] a, double[][] b) {
int m1 = a.length;
int n1 = a[0].length;
int m2 = b.length;
int n2 = b[0].length;
if (n1 != m2) {
throw new RuntimeException("Illegal matrix dimensions.");
}
double[][] c = new double[m1][n2];

57
for (int i = 0; i < m1; i++) {
for (int j = 0; j < n2; j++) {
for (int k = 0; k < n1; k++) {
c[i][j] += a[i][k] * b[k][j];
}
}
}
return c;
}

/**
* Element wise multiplication
*
* @param a matrix
* @param x matrix
* @return y = a * x
*/
public static double[][] multiply(double[][] x, double[][] a) {
int m = a.length;
int n = a[0].length;

if (x.length != m || x[0].length != n) {
throw new RuntimeException("Illegal matrix dimensions.");
}
double[][] y = new double[m][n];
for (int j = 0; j < m; j++) {
for (int i = 0; i < n; i++) {
y[j][i] = a[j][i] * x[j][i];
}
}

58
return y;
}

/**
* Element wise multiplication
*
* @param a matrix
* @param x scaler
* @return y = a * x
*/
public static double[][] multiply(double x, double[][] a) {
int m = a.length;
int n = a[0].length;

double[][] y = new double[m][n];


for (int j = 0; j < m; j++) {
for (int i = 0; i < n; i++) {
y[j][i] = a[j][i] * x;
}
}
return y;
}

/**
* Element wise power
*
* @param x matrix
* @param a scaler
* @return y
*/

59
public static double[][] power(double[][] x, int a) {
int m = x.length;
int n = x[0].length;

double[][] y = new double[m][n];


for (int i = 0; i < m; i++) {
for (int j = 0; j < n; j++) {
y[i][j] = Math.pow(x[i][j], a);
}
}
return y;
}

/**
* @param a matrix
* @return shape of matrix a
*/
public static String shape(double[][] a) {
int m = a.length;
int n = a[0].length;
String Vshape = "(" + m + "," + n + ")";
return Vshape;
}

/**
* @param a matrix
* @return sigmoid of matrix a
*/
public static double[][] sigmoid(double[][] a) {
int m = a.length;

60
int n = a[0].length;
double[][] z = new double[m][n];

for (int i = 0; i < m; i++) {


for (int j = 0; j < n; j++) {
z[i][j] = (1.0 / (1 + Math.exp(-a[i][j])));
}
}
return z;
}

/**
* Element wise division
*
* @param a scaler
* @param x matrix
* @return x / a
*/
public static double[][] divide(double[][] x, int a) {
int m = x.length;
int n = x[0].length;

double[][] z = new double[m][n];

for (int i = 0; i < m; i++) {


for (int j = 0; j < n; j++) {
z[i][j] = (x[i][j] / a);
}
}
return z;

61
}
/**
* Element wise division
*
* @param A matrix
* @param Y matrix
* @param batch_size scaler
* @return loss
*/
public static double cross_entropy(int batch_size, double[][] Y, double[][] A) {
int m = A.length;
int n = A[0].length;
double[][] z = new double[m][n];

for (int i = 0; i < m; i++) {


for (int j = 0; j < n; j++) {
z[i][j] = (Y[i][j] * Math.log(A[i][j])) + ((1 - Y[i][j]) * Math.log(1 - A[i][j]));
}
}

double sum = 0;
for (int i = 0; i < m; i++) {
for (int j = 0; j < n; j++) {
sum += z[i][j];
}
}
return -sum / batch_size;
}
public static double[][] softmax(double[][] z) {
double[][] zout = new double[z.length][z[0].length];

62
double sum = 0.;
for (int i = 0; i < z.length; i++) {
for (int j = 0; j < z[0].length; j++) {
sum += Math.exp(z[i][j]);
}
}
for (int i = 0; i < z.length; i++) {
for (int j = 0; j < z[0].length; j++) {
zout[i][j] = Math.exp(z[i][j]) / sum;
}
}
return zout;
}

public static void print(String val) {


System.out.println(val);
}
}

<!DOCTYPE html>
<!--[if lt IE 7]> <html class="no-js lt-ie9 lt-ie8 lt-ie7"> <![endif]-->
<!--[if IE 7]> <html class="no-js lt-ie9 lt-ie8"> <![endif]-->
<!--[if IE 8]> <html class="no-js lt-ie9"> <![endif]-->
<!--[if gt IE 8]><!--> <html class="no-js"> <!--<![endif]-->
<head>
<meta charset="utf-8">
<meta http-equiv="X-UA-Compatible" content="IE=edge">

63
<meta name="description" content="company is a free job board template">
<meta name="author" content="">

<meta name="viewport" content="width=device-width, initial-scale=1">

<link href='http://fonts.googleapis.com/css?family=Open+Sans:400,300,700,800' rel='stylesheet'


type='text/css'>

<!-- Place favicon.ico and apple-touch-icon.png in the root directory -->


<link rel="shortcut icon" href="favicon.ico" type="image/x-icon">
<link rel="icon" href="favicon.ico" type="image/x-icon">

<link rel="stylesheet" href="css/normalize.css">


<link rel="stylesheet" href="css/font-awesome.min.css">
<link rel="stylesheet" href="css/fontello.css">
<link rel="stylesheet" href="css/animate.css">
<link rel="stylesheet" href="css/bootstrap.min.css">
<link rel="stylesheet" href="css/owl.carousel.css">
<link rel="stylesheet" href="css/owl.theme.css">
<link rel="stylesheet" href="css/owl.transitions.css">
<link rel="stylesheet" href="style.css">
<link rel="stylesheet" href="responsive.css">
<script src="js/vendor/modernizr-2.6.2.min.js"></script>
</head>
<body>

<div id="preloader">
<div id="status"><h1>Road Traffic Speed Prediction: A Probabilistic Model Fusing Multi-Source
Data </h1></div>
</div>

64
<!-- Body content -->

<div class="header-connect">
<div class="container">
<div class="row">
<div class="col-md-5 col-sm-8 col-xs-8">
<div class="header-half header-call">

</div>
</div>

<nav class="navbar navbar-default">


<div class="container">
<!-- Brand and toggle get grouped for better mobile display -->
<div class="navbar-header">
<button type="button" class="navbar-toggle collapsed" data-toggle="collapse" data-target="#bs-
example-navbar-collapse-1">
<span class="sr-only">Toggle navigation</span>
<span class="icon-bar"></span>
<span class="icon-bar"></span>
<span class="icon-bar"></span>
</button>
<a class="navbar-brand" href="#"><img src="img/final.jpeg" width="150" height="90"
alt=""></a>
</div>

<!-- Collect the nav links, forms, and other content for toggling -->

65
<h2 align="center">Road Traffic Speed Prediction: A Probabilistic Model Fusing Multi-Source
Data </h2>
<div class="collapse navbar-collapse" id="bs-example-navbar-collapse-1">

<center>
<div class="button navbar-right">
<a href="registerfacebook.jsp"><button class="navbar-btn nav-button wow bounceInRight
login" data-wow-delay="0.8s">FaceBook Register</button></a>
<a href="registertwitter.jsp"><button class="navbar-btn nav-button wow fadeInRight" data-
wow-delay="0.6s">Twitter Register</button></a>

<a href="registerqzone.jsp"><button class="navbar-btn nav-button wow bounceInRight


login" data-wow-delay="0.8s">Qzone Register</button></a>

<a href="facebook.jsp"><button class="navbar-btn nav-button wow fadeInRight" data-wow-


delay="0.6s">Facebook</button></a>
<a href="twitter.jsp"><button class="navbar-btn nav-button wow fadeInRight" data-wow-
delay="0.6s">Twitter</button></a>
<a href="qzone.jsp"><button class="navbar-btn nav-button wow fadeInRight" data-wow-
delay="0.6s">Qzone</button></a>
<a href="datauser.jsp"><button class="navbar-btn nav-button wow bounceInRight login"
data-wow-delay="0.8s">Admin</button></a>

</div>
</center>

<!--

<ul class="main-nav nav navbar-nav navbar-right">


<li class="wow fadeInDown" data-wow-delay="0s"><a class="active"
href="index.html">Home</a></li>

66
<li class="wow fadeInDown" data-wow-delay="0.2s"><a href="alumni.jsp">Admin</a></li>
<li class="wow fadeInDown" data-wow-delay="0.5s"><a href="contact.html">User
Login</a></li>
</ul>
-->

</div><!-- /.navbar-collapse -->


</div><!-- /.container-fluid -->
</nav>

</div>

<div class="content-area">
<hr>

<div class="row how-it-work text-center">

<hr>

<script src="http://ajax.googleapis.com/ajax/libs/jquery/1.10.2/jquery.min.js"></script>
<script>window.jQuery || document.write('<script src="js/vendor/jquery-
1.10.2.min.js"><\/script>')</script>
<script src="js/bootstrap.min.js"></script>
<script src="js/owl.carousel.min.js"></script>
<script src="js/wow.js"></script>
<script src="js/main.js"></script>
</div>
</div>

67
</div>

<h2 align="center">Data User Login......</h2>


<br/>
<center>

<form action="facebook1.jsp" method="post">


<table border="2">

<tr>
<td>User Name:</td>
<td>
<input type="text" name="username"/>
</td>

</tr>

<tr>
<td>Password:</td>
<td>
<input type="password" name="password"/>
</td>

</tr>

68
<tr>
<td></td>

<td>
<input type="submit" value="Submit" style="color: #080808"/>
<input type="reset" value="Clear" style="color: #080808"/>
</td>
</tr>

</table>

</form>

</center>

</div>

69
</div>
</body>

</html>

Conclusion :
This project proposes a novel probabilistic framework to predict road traffic speed with multiple cross-
domain data. Existing works are mainly based on speed sensing data, which suffers data spar sity and
low coverage. In our work, wehandle the challenges arising from fusing multi-source data,including
location uncertainty, language ambiguity and data heterogeneity, using Location Disaggregation Model,
TrafficTopic model and Traffic Speed Gaussian Process Model. Experiments on real data demonstrate
the effectivenessand efficiency of our model. For Future work, we plan toimplement kernel-based and
distributive GP, so the trafficprediction framework can be applied into a real-time largetraffic network

References :

[1]B. Abdulhai, H. Porwal, and W. Recker. Short-term traffic flowprediction using neuro-genetic
algorithms.ITS JournalIntelligentTransportation Systems Journal, 7(1):3–41, 2002.

[2]R. Alfelor, H. S. Mahmassani, and J. Dong. Incorporating weatherimpacts in traffic estimation and
prediction systems.Technicalreport, US Department of Transportation, 2009.

[3]M. T. Asif, N. Mitrovic, L. Garg, and J. Dauwels.Low-dimensionalmodels for missing data imputation in
road networks. 32(3):3527– 3531, 2013.

70
[4]C. M. Bishop.Pattern Recognition and Machine Learning (InformationScience and Statistics). Springer-
Verlag New York, Inc., 2006.

[5]D. M. Blei and J. D. Lafferty.Correlated topic models.InNIPS,pages 147–154, 2005.

[6]D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent dirichletallocation.Journal of Machine Learning


Research3:993–1022, 2003.

[7]J. Chen, K. H. Low, Y. Yao, and P. Jaillet.Gaussianprocessdecentralized data fusion and active sensing
for spatiotemporaltraffic modeling and prediction in mobility-on-demand systems.IEEE Transactions on
Automation Science and Engineering, 12(3):1–21, 2015.

[8]P.-T. Chen, F. Chen, and Z. Qian.Road traffic congestion monitoring in social media with hinge-loss
markov random fields.InICDM, pages 80–89. IEEE, 2014.

[9]S. Clark.Traffic prediction using multivariate nonparametricregression.Journal of transportation


engineering, pages 161–168,2003.

71

Вам также может понравиться