Text Mining Techniques and Applications

Text Based Information Retrieval - Text Mining
PKB - Antonie
Background
Human dificults to process huge information Computer can do better with matemathics
why dont also use computer to process huge information?
A Large text to find:

Terrorist attack on 1995? Terrorist movement and bomb relation?
Relates to Information Retreival, Data Mining and Text Mining
Terminology
Data Mining A step in the knowledge discovery process consisting of particular algorithms (methods), produces a particular enumeration of patterns (models) over the data. Data Mining is a process of discovering advantageous patterns in data. Knowledge Discovery Process The process of using data mining methods (algorithms) to extract (identify) what is knowledge according to the specifications of measures and thresholds, using a database along with any necessary preprocessing or transformations.
What kind of data in Data Mining?

Relational Databases Data Warehouses Transactional Databases Advanced Database Systems
Object-Relational Multimedia Text Heterogeneous and Distributed WWW
Data Mining Application: Market analysis Risk analysis and management Fraud detection and detection of unusual patterns (outliers) Text mining (news group, email, documents) and Web mining Stream data mining
Knowledge Discovery
Required effort for each KDD Step

Arrows indicate the direction we hope the effort should go.
What Is Text Mining?

The objective of Text Mining is to exploit information contained in textual documents in various ways, including discovery of patterns and trends in data, associations among entities, predictive rules, etc. (Grobelnik et al., 2001) Another way to view text data mining is as a process of exploratory data analysis that leads to heretofore unknown information, or to answers for questions for which the answer is not currently known. (Hearst, 1999) The non trivial extraction of implicit, previously unknown, and potentially useful information from (large amount of) textual data. An exploration and analysis of textual (natural-language) data by (naturalautomatic and semi automatic means to discover new knowledge.
Text Mining (2)

What is previously unknown information ?
Strict definition
Information that not even the writer knows.
Lenient (lunak) definition

Rediscover the information that the author encoded in the text e.g., Automatically extracting a products name from a web-page.
Text Mining Methods

Information Retrieval
Indexing and retrieval of textual documents
Information Extraction
Extraction of partial knowledge in the text
Web Mining
Indexing and retrieval of textual documents and extraction of partial knowledge using the web
Clustering
Generating collections of similar text documents
Text Mining Application

Email: Spam filtering News Feeds: Discover what is interesting Medical: Identify relationships and link information from different medical fields Marketing: Discover distinct groups of potential buyers and make suggestions for other products Industry: Identifying groups of competitors web pages Job Seeking: Identify parameters in searching for jobs
Information Retrieval (1)

Given:
A source of textual documents A well defined limited query (text based)
Find:
Sentences with relevant information Extract the relevant information and ignore non-relevant information (important!) Link related information and output in a predetermined format
Example: news stories, e-mails, web pages, photograph, music, statistical data, biomedical data, etc. Information items can be in the form of text, image, video, audio, numbers, etc.
Information Retrieval (2) 2 basic information retrieval (IR) process:

Browsing or navigation system
User skims document collection by jumping from one document to the other via hypertext or hypermedia links until relevant document found
Classical IR system: question answering system

Query: question in natural language Answer: directly extracted from text of document collection
Text Based Information Retrieval:

Information item (document) :
Text format (written/spoken) or has textual description
Information need (query):

Usually in text format
Classical IR System Process
Intelligent Information Retrieval

meaning of words
Synonyms buy / purchase Ambiguity bat (baseball vs. mammal)
order of words in the query

hot dog stand in the amusement park hot amusement stand in the dog park
Why Mine the Web?

Enormous wealth of textual information on the Web.
Book/CD/Video stores (e.g., Amazon) Restaurant information (e.g., Zagats) Car prices (e.g., Carpoint)
Lots of data on user access patterns

Web logs contain sequence of URLs accessed by users
Possible to retrieve previously unknown information

People who ski also frequently break their leg. Restaurants that serve sea food in California are likely to be outside San-Francisco
Mining the Web

Web
Spider Documents source
Query
IR / IE System
Ranked Documents
1. Doc1 2. Doc2 3. Doc3 . .
What is Web Clustering ?

Given:
A source of textual documents Similarity measure
e.g., how many words are common in these documents Find:
Documents source
Similarity measure
Clustering System
Doc Doc Doc Doc Doc Doc Doc
Several clusters of documents that are relevant to each other
Do Doc
Doc c
Text characteristics
Large textual data base
Efficiency consideration
over 2,000,000,000 web pages almost all publications are also in electronic form
High dimensionality (Sparse input)

Consider each word/phrase as a dimension
Dependency
relevant information is a complex conjunction of words/phrases
e.g., Document categorization.Pronoun disambiguation
Text characteristics
Ambiguity
Word ambiguity
Pronouns (he, she ) buy, purchase
Semantic ambiguity
The king saw the rabbit with his glasses. (? meanings)
Noisy data
Example: Spelling mistakes
Not well structured text

Chat rooms
r u available ? Hey whazzzzzz up
Speech
Text mining process

Text preprocessing
Syntactic/Semantic text analysis
Features Generation
Bag of words
Features Selection
Simple counting Statistics
Text/Data Mining
ClassificationSupervised learning ClusteringUnsupervised learning
Analyzing results
Syntactic / Semantic text analysis

Part Of Speech (pos) tagging
Find the corresponding pos for each word e.g., John (noun) gave (verb) the (det) ball (noun)
Word sense disambiguation

Context based or proximity based Very accurate
Parsing
Generates a parse tree (graph) for each sentence Each sentence is a stand alone graph
Feature Generation: Bag of words

Text document is represented by the words it contains (and their occurrences)
e.g., Lord of the rings p {the, Lord, rings, of} Highly efficient Makes learning far simpler and easier Order of words is not that important for certain applications
Stemming: identifies a word by its root

Reduce dimensionality e.g., flying, flew p fly Use Porter Algorithm
Stop words: The most common words are unlikely to help text mining
e.g., the, a, an, you
Feature selection
Reduce dimensionality
Learners have difficulty addressing tasks with high dimensionality
Irrelevant features
Not all features help!
e.g., the existence of a noun in a news article is unlikely to help classify it as politics or sport
Use Weightening
Given: a collection of labeled records (training set training set)

Each record contains a set of features (attributes and attributes), label) the true class (label
Text Mining: Classification definition
Find: a model for the class as a function of the values of the features Goal: previously unseen records should be assigned a class as accurately as possible
A test set is used to determine the accuracy of the model. Usually, the given data set is divided into training and test sets, with training set used to build the model and test set used to validate it
Text Mining: Clustering definition

Given: a set of documents and a similarity measure among documents Find: clusters such that:
Documents in one cluster are more similar to one another Documents in separate clusters are less similar to one another
Goal:
Finding a correct set of documents
Similarity Measures:
Euclidean Distance if attributes are continuous Other Problem-specific Measures e.g., how many words are common in these documents
Supervised vs. Unsupervised Learning

Supervised learning (classification)
Supervision: The training data (observations, measurements, etc.) are accompanied by labels indicating the class of the observations New data is classified based on the training set
Unsupervised learning (clustering)

The class labels of training data is unknown Given a set of measurements, observations, etc. with the aim of establishing the existence of classes or clusters in the data
Correct classification: The known label of test sample is identical with the class result from the classification model Accuracy ratio: the percentage of test set samples that are correctly classified by the model A distance measure between classes can be used
e.g., classifying football document as a basketball document is not as bad as classifying it as crime.
Evaluation:What Is Good Classification?
Evaluation: What Is Good Clustering?

Good clustering method: produce high quality clusters with . . .
high intra-class similarity intrainter low inter-class similarity
The quality of a clustering method is also measured by its ability to discover some or all of the hidden patterns
Text Classification: An Example

Ex# An English football fan During a game in Italy England has been beating France Italian football fans were cheering An average USA salesman earns 75K The game in London was horrific Manchester city is likely to win the championship Rome is taking the lead in the football league Hooligan Yes Yes A Danish football fan Yes No No Yes Yes Yes
1 0
1 2 3 4 5 6 7 8
1 0
Hooligan ? ?
Turkey is playing vs. France. The Turkish fans
Test Set
Training Set
Learn Classifier
Model
Decision Tree: A Text Example

Ex# An English football fan During a game in Italy England has been beating France Italian football fans were cheering An average USA salesman earns 75K The game in London was horrific Manchester city is likely to win the championship Rome is taking the lead in the football league Hooligan
Splitting Attributes
English Yes Yes No MarSt Single, Divorced Income > 80K NO < 80K YES Married NO
1 2 3 4 5 6 7 8
10
Yes Yes Yes No No Yes Yes Yes
The splitting attribute at a node is determined based on a specific Attribute selection algorithm
Classification by DT Induction
Decision tree
A flow-chart-like tree structure Internal node denotes a test on an attribute Branch represents an outcome of the test Leaf nodes represent class labels or class distribution
Decision tree generation consists of two phases:

Tree construction Tree pruning
Identify and remove branches that reflect noise or outliers
Use of decision tree: Classifying an unknown sample

Test the attribute of the sample against the decision tree
Summary
Text is tricky to process, but ok results are easily achieved There exist several text mining systems
e.g., D2K - Data to Knowledge http://www.ncsa.uiuc.edu/Divisions/DMV/ALG/
Additional Intelligence can be integrated with text mining

One may play with any phase of the text mining process
Summary
There are many other scientific and statistical text mining methods developed but not covered in this talk.
http://www.cs.utexas.edu/users/pebronia/text-mining/ http://filebox.vt.edu/users/wfan/text_mining.html
Also, it is important to study theoretical foundations of data mining.

Data Mining Concepts and Techniques / J.Han & M.Kamber Machine Learning, / T.Mitchell

Text Mining Techniques and Applications

Загружено:

Сведения о документе

Исходное описание:

Оригинальное название

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Text Mining Techniques and Applications

Загружено:

Авторское право:

Доступные форматы

Text Based Information Retrieval - Text Mining

A Large text to find:

Relates to Information Retreival, Data Mining and Text Mining

What kind of data in Data Mining?

Required effort for each KDD Step

What Is Text Mining?

Text Mining (2)

Lenient (lunak) definition

Text Mining Methods

Text Mining Application

Information Retrieval (1)

Information Retrieval (2) 2 basic information retrieval (IR) process:

Classical IR system: question answering system

Text Based Information Retrieval:

Information need (query):

Classical IR System Process

Intelligent Information Retrieval

order of words in the query

Why Mine the Web?

Lots of data on user access patterns

Possible to retrieve previously unknown information

Mining the Web

1. Doc1 2. Doc2 3. Doc3 . .

What is Web Clustering ?

Doc Doc Doc Doc Doc Doc Doc

Several clusters of documents that are relevant to each other

High dimensionality (Sparse input)

Not well structured text

Text mining process

Syntactic / Semantic text analysis

Word sense disambiguation

Feature Generation: Bag of words

Stemming: identifies a word by its root

Given: a collection of labeled records (training set training set)

Text Mining: Classification definition

Text Mining: Clustering definition

Supervised vs. Unsupervised Learning

Unsupervised learning (clustering)

Evaluation:What Is Good Classification?

Evaluation: What Is Good Clustering?

Text Classification: An Example

Turkey is playing vs. France. The Turkish fans

Decision Tree: A Text Example

Yes Yes Yes No No Yes Yes Yes

Decision tree generation consists of two phases:

Use of decision tree: Classifying an unknown sample

Additional Intelligence can be integrated with text mining

Also, it is important to study theoretical foundations of data mining.

Вам также может понравиться