Академический Документы
Профессиональный Документы
Культура Документы
PKB - Antonie
Background
Human dificults to process huge information Computer can do better with matemathics
why dont also use computer to process huge information?
Terminology
Data Mining A step in the knowledge discovery process consisting of particular algorithms (methods), produces a particular enumeration of patterns (models) over the data. Data Mining is a process of discovering advantageous patterns in data. Knowledge Discovery Process The process of using data mining methods (algorithms) to extract (identify) what is knowledge according to the specifications of measures and thresholds, using a database along with any necessary preprocessing or transformations.
Data Mining Application: Market analysis Risk analysis and management Fraud detection and detection of unusual patterns (outliers) Text mining (news group, email, documents) and Web mining Stream data mining
Knowledge Discovery
Information Extraction
Extraction of partial knowledge in the text
Web Mining
Indexing and retrieval of textual documents and extraction of partial knowledge using the web
Clustering
Generating collections of similar text documents
Find:
Sentences with relevant information Extract the relevant information and ignore non-relevant information (important!) Link related information and output in a predetermined format
Example: news stories, e-mails, web pages, photograph, music, statistical data, biomedical data, etc. Information items can be in the form of text, image, video, audio, numbers, etc.
Query
IR / IE System
Ranked Documents
Documents source
Similarity measure
Clustering System
Do Doc
Doc c
Text characteristics
Large textual data base
Efficiency consideration
over 2,000,000,000 web pages almost all publications are also in electronic form
Dependency
relevant information is a complex conjunction of words/phrases
e.g., Document categorization.Pronoun disambiguation
Text characteristics
Ambiguity
Word ambiguity
Pronouns (he, she ) buy, purchase
Semantic ambiguity
The king saw the rabbit with his glasses. (? meanings)
Noisy data
Example: Spelling mistakes
Speech
Features Generation
Bag of words
Features Selection
Simple counting Statistics
Text/Data Mining
ClassificationSupervised learning ClusteringUnsupervised learning
Analyzing results
Parsing
Generates a parse tree (graph) for each sentence Each sentence is a stand alone graph
Stop words: The most common words are unlikely to help text mining
e.g., the, a, an, you
Feature selection
Reduce dimensionality
Learners have difficulty addressing tasks with high dimensionality
Irrelevant features
Not all features help!
e.g., the existence of a noun in a news article is unlikely to help classify it as politics or sport
Use Weightening
Find: a model for the class as a function of the values of the features Goal: previously unseen records should be assigned a class as accurately as possible
A test set is used to determine the accuracy of the model. Usually, the given data set is divided into training and test sets, with training set used to build the model and test set used to validate it
Goal:
Finding a correct set of documents
Similarity Measures:
Euclidean Distance if attributes are continuous Other Problem-specific Measures e.g., how many words are common in these documents
Correct classification: The known label of test sample is identical with the class result from the classification model Accuracy ratio: the percentage of test set samples that are correctly classified by the model A distance measure between classes can be used
e.g., classifying football document as a basketball document is not as bad as classifying it as crime.
The quality of a clustering method is also measured by its ability to discover some or all of the hidden patterns
1 2 3 4 5 6 7 8
1 0
Hooligan ? ?
Test Set
Training Set
Learn Classifier
Model
Splitting Attributes
English Yes Yes No MarSt Single, Divorced Income > 80K NO < 80K YES Married NO
1 2 3 4 5 6 7 8
10
The splitting attribute at a node is determined based on a specific Attribute selection algorithm
Classification by DT Induction
Decision tree
A flow-chart-like tree structure Internal node denotes a test on an attribute Branch represents an outcome of the test Leaf nodes represent class labels or class distribution
Summary
Text is tricky to process, but ok results are easily achieved There exist several text mining systems
e.g., D2K - Data to Knowledge http://www.ncsa.uiuc.edu/Divisions/DMV/ALG/
Summary
There are many other scientific and statistical text mining methods developed but not covered in this talk.
http://www.cs.utexas.edu/users/pebronia/text-mining/ http://filebox.vt.edu/users/wfan/text_mining.html