Академический Документы
Профессиональный Документы
Культура Документы
Submitted By:
Karadkhelkar Kalyani (18)
Kawale Tina (19)
Mhatre Shivani (23)
.
Chapter 2
Introduction:
Students pursuing higher education degrees are faced with two challenges: a myriad of courses from
which to choose, and a lack of knowledge about which courses to follow and in what sequence. It is
according to their friends and colleagues’ recommendations that the majority of them choose their
courses and register. It would be useful to help students in finding courses of interest by the
intermediary of a recommender system. The proposed system is based on the same principle that
consists of taking advantage of the collaborative experience of the students who have finished their
studies. Since the volume of data concerning registered students keeps increasing, applying data mining
to interpret this data can reveal hidden relations between courses followed by students. Once
interesting results are discovered, a course recommender system can use them to predict the most
appropriate courses for current students. As the users rate the recommendations thus provided, system
performances can be improved.
Course recommender system aims at predicting the best combination of courses selected by
students
A course recommender system, and focuses on the effectiveness of the incorporation of data
mining in course recommendation. Since the volume of data concerning registered students
keeps increasing, applying data mining to interpret this data can reveal hidden relations
between courses followed by students
Chapter 3
Data Mining:
Definition:
“Data mining is the process of discovering meaningful new correlations, patterns and trends by
sifting through large amounts of data stored in repositories, using pattern recognition
technologies as well as statistical and mathematical techniques.” [Gartner Group, Larose, pp.xi,
2005]
“Data mining is the analysis of (often large) observational data sets to find unsuspected
relationships and to summarize the data in novel ways that are both understandable and useful
to the data owner” (Hand et al, 2001)
“A class of database applications that look for hidden patterns in a group of data that can be
used to predict future behavior.” (webopdia, n.d)
“Data mining is an interdisciplinary field bringing together techniques from machine learning,
pattern recognition, statistics, databases, and visualization to address the issue of information
extraction from large data bases” (Cabena et al, 1998)
Data mining is a process that uses a variety of data analysis tools to discover patterns and
relationships in the data, which can in turn be used to make predictions. Data mining is the
process of discovering hidden, valuable knowledge by analyzing a large amount of data. Also,
we have to store that data in different databases.
As data mining is a very important process, it is advantageous for various industries, such as
manufacturing, marketing, etc. Therefore, there's a need for a standard data mining process.
This data mining process must be reliable. Also, this process should be repeatable by business
people with little to no knowledge of data science.
Data in digital form are available everywhere, like on the Internet. It can be used to predict the
future. Usually the statistical approach is used. Data mining is an extension of traditional data
analysis and statistical approaches in that it incorporates analytical techniques drawn from a
range of disciplines. Data mining covers the entire process of data analysis, including data
cleaning and preparation and visualization of the results, and how to produce predictions in
real-time so that specific goals are met.
Applications of Data Mining:
Data mining is used for market basket analysis to provide information on what product
combinations were purchased together when they were bought and in what sequence.
This information helps businesses promote their most profitable products and
maximize the profit. In addition, it encourages customers to purchase related products
that they may have been missed or overlooked.
Retail companies use data mining to identify customer’s behavior buying patterns.
Data mining is used to identify customer’s loyalty by analyzing the data of customer’s
purchasing activities such as the data of frequency of purchase in a period of time, a
total monetary value of all purchases and when was the last purchase. After analyzing
those dimensions, the relative measure is generated for each customer. The higher of
the score, the more relative loyal the customer is.
To help the bank to retain credit card customers, data mining is applied. By analyzing
the past data, data mining can help banks predict customers that likely to change their
credit card affiliation so they can plan and launch different special offers to retain those
customers.
Credit card spending by customer groups can be identified by using data mining.
The hidden correlations between different financial indicators can be discovered by
using data mining.
From historical market data, data mining enables to identify stock trading rules.
Data mining is applied in claims analysis such as identifying which medical procedures
are claimed together.
Data mining enables to forecasts which customers will potentially purchase new
policies.
Data mining allows insurance companies to detect risky customers’ behavior patterns.
Data mining helps detect fraudulent behavior.
Data mining helps determine the distribution schedules among warehouses and outlets
and analyze loading patterns.
Chapter 4:
Recommender System:
In a world where the number of choices can be overwhelming, recommender systems help
users find and evaluate items of interest. They connect users with items to “consume”
(purchase, view, listen to, etc.) by associating the content of recommended items or the
opinions of other individuals with the consuming user’s actions or opinions. Such systems have
become powerful tools in domains from electronic commerce to digital libraries and knowledge
management. For example, a consumer of just about any major online retailer who expresses
an interest in an item – either through viewing a product description or by placing the item in
his “shopping cart” – will likely receive recommendations for additional products. These
products can be recommended based on the top overall sellers on a site, on the demographics
of the consumer, or on an analysis of the past buying behavior of the consumer as a prediction
for future buying behavior.
The term data mining refers to a broad spectrum of mathematical modeling techniques and
software tools that are used to find patterns in data and user these to build models. In this
context of recommender applications, the term data mining is used to describe the collection of
analysis techniques used to infer recommendation rules or build recommendation models from
large data sets. Recommender systems that incorporate data mining techniques make their
recommendations using knowledge learned from the actions and attributes of users. These
systems are often based on the development of user profiles that can be persistent (based on
demographic or item “consumption” history data), ephemeral (based on the actions during the
current session), or both.
Clustering
Classification techniques
The generation of association rules
The production of similarity graphs through techniques such as Horting.
Chapter 5:
Data Mining Algorithms:
There are various Data mining algorithm and those are:
1. Clustering Algorithm:
Clustering is finding groups of objects such that the objects in one group will be similar to one another
and different from the objects in another group. Clustering can be considered the most important
unsupervised learning technique. In educational data mining, clustering has been used to group the
students according to their behavior e.g. clustering can be used to distinguish active student from non-
active student according to their performance in activities
2. Classification:
Classification is a data mining task that maps the data into predefined groups & classes. It is also called
as supervised learning.
2. Model usage: This model is used for classifying future or unknown objects. The known label of test
sample is compared with the classified result from the model. Accuracy rate is the percentage of test set
samples that are correctly classified by the model. Test set is independent of training set, otherwise
over-fitting will occur.
Association rules are used to show the relationship between data items. Mining association
rules allows finding rules of the form: If antecedent then (likely) consequent where antecedent
and consequent are item sets which are sets of one or more items. Association rule generation
consists of two separate steps: First, minimum support is applied to find all frequent item sets
in a database. Second, these frequent item sets and the minimum confidence constraint are
used to form rules.
Apriori algorithm:
The Apriori algorithm is an algorithm that attempts to operate on database records, particularly
transactional records, or records including certain numbers of fields or items.
Name of algorithm is Apriori is because it uses prior knowledge of frequent item set properties.
We apply a iterative approach or level-wise search where k-frequent item sets are used to find
k+1 item sets.
The credit for introducing this algorithm goes to Rakesh Agrawal and Ramakrishnan Srikant in
1994.
At times, you need a large number of candidate rules. It can become computationally
expensive.
It is also an expensive method to calculate support because the calculation has to go
through the entire database.
Use the following methods to improve the efficiency of the apriori algorithm.
Transaction Reduction – A transaction not containing any frequent k-item set becomes
useless in subsequent scans.
Hash-based Item set Counting – Exclude the k-item set whose corresponding hashing
bucket count is less than the threshold is an infrequent item set.
Chapter 6
Apriori Association Rule and Algorithm:
Apriori Association rule is used to mine the frequent patterns in database. Support &
confidence are the normal method used to measure the quality of association rule. Support for
the association rule X->Y is the percentage of transaction in the database that contains XUY.
Confidence for the association rule is X->Y is the ratio of the number of transaction that
contains XUY to the number of transaction that contain X.
The Apriori association rule algorithm is given below:
Purpose:
To find subset which are common to at least a minimum number C(confidence threshold) of the item
set.
Input:
Database of transaction D-(t1, t1,…,tn)
Support,
Confidence
Output:
Association Rule satisfying support and confidence
Method:
C1=Item size 1 in I
K=1;
Repeat
K=k+1;
Ck+1=candidate generated from Lk-1
Begin,
Reduce the infrequent k item set from this set i.e. any k-item set that is not
frequent cannot be subset of(k+1) item set.
End
Return Uk Lk.
Chapter 7
Design and Implementation:
The Code is designed using the programming language “JAVA”. Also it has a
database connectivity of MYSQL database.
Packages Used:
Package in java can be categorized in two form, built-in package and user-defined package.
There are many built-in packages such as java, lang, awt, javax, swing, net, io, util, sql etc.
Here, we will have the detailed learning of creating and using user-defined packages.
1) Java package is used to categorize the classes and interfaces so that they can be easily
maintained.
2. java.sql.*
Java util package contains collection framework, collection classes, classes related to date and time,
event model, internationalization, and miscellaneous utility classes. On importing this package, you
can access all these classes and methods.
Java.sql.*
Classes used:
1. Class Apriori
2. Class Tuple
What is a class
o Fields
o Methods
o Constructors
o Blocks
o Nested class and interface
class <class_name>{
field;
method;}
A variable which is created inside the class but outside the method is known
as an instance variable. Instance variable doesn't get memory at compile
time. It gets memory at runtime when an object or instance is created. That
is why it is known as an instance variable.
What is JDBC?
JDBC stands for Java Database Connectivity, which is a standard Java API
for database-independent connectivity between the Java programming
language and a wide range of databases.
The JDBC library includes APIs for each of the tasks mentioned below that
are commonly associated with database usage.
Java Applications
Java Applets
Java Servlets
All of these different executables are able to use a JDBC driver to access a
database, and take advantage of the stored data.
Pre-Requisite
Before moving further, you need to have a good understanding of the
following two subjects −
JDBC Architecture
The JDBC API supports both two-tier and three-tier processing models for
database access but in general, JDBC Architecture consists of two layers −
Driver: This interface handles the communications with the database server.
You will interact directly with Driver objects very rarely. Instead, you use
DriverManager objects, which manages objects of this type. It also abstracts the
details associated with working with Driver objects.
Connection: This interface with all methods for contacting a database. The
connection object represents communication context, i.e., all communication
with database is through connection object only.
Statement: You use objects created from this interface to submit the SQL
statements to the database. Some derived interfaces accept parameters in
addition to executing stored procedures.
ResultSet: These objects hold data retrieved from a database after you execute
an SQL query using Statement objects. It acts as an iterator to allow you to
move through its data.
import java.sql.*;
class Tuple {
Set<Integer> itemset;
int support;
Tuple() {
support = -1;
Tuple(Set<Integer> s) {
itemset = s;
support = -1;
Tuple(Set<Integer> s, int i) {
itemset = s;
support = i;
class Apriori {
static Set<Tuple> c;
static Set<Tuple> l;
getDatabase();
c = new HashSet<>();
l = new HashSet<>();
int i, j, m, n;
min_support = scan.nextFloat();
System.out.println(d[i][j]);
candidate_set.add(d[i][j]);
sssss while(iterator.hasNext()) {
Set<Integer> s = new HashSet<>();
s.add(iterator.next());
c.add(t);
prune();
generateFrequentItemsets();
int i, j, k;
int support = 0;
int count;
boolean containsElement;
count = 0;
while(iterator.hasNext()) {
containsElement = false;
if(element == d[i][k]) {
containsElement = true;
count++;
break;
if(!containsElement) {
break;
if(count == s.size()) {
support++;
return support;
l.clear();
while(iterator.hasNext()) {
Tuple t = iterator.next();
l.add(t);
System.out.println("-+- L -+-");
for(Tuple t : l) {
System.out.println(t.itemset + " : " + t.support);
int element = 0;
int size = 1;
while(toBeContinued) {
candidate_set.clear();
c.clear();
while(iterator.hasNext()) {
Tuple t1 = iterator.next();
while(it2.hasNext()) {
Tuple t2 = it2.next();
while(it3.hasNext()) {
try {
element = it3.next();
} catch(ConcurrentModificationException e) {
temp.add(element);
if(temp.size() != size) {
for(Integer x : int_arr) {
temp2.add(x);
candidate_set.add(temp2);
temp.remove(element);
while(candidate_set_iterator.hasNext()) {
Set s = candidate_set_iterator.next();
prune();
if(l.size() <= 1) {
toBeContinued = false;
}
size++;
for(Tuple t : l) {
Class.forName("com.mysql.jdbc.Driver");
Connection con =
DriverManager.getConnection("jdbc:mysql://localhost:3306/DWM","root","root");
Statement s = con.createStatement();
List<Integer> temp;
while(rs.next()) {
temp = m.get(list_no);
if(temp == null) {
temp.add(object);
m.put(list_no, temp);
d = new int[keyset.size()][];
int count = 0;
while(iterator.hasNext()) {
temp = m.get(iterator.next());
d[count][i] = int_arr[i].intValue();
count++;
}
Chapter 9
Conclusion and future scope:
Chapter 10
References: