Вы находитесь на странице: 1из 23

Natural Language Processin

Freeware Tools

ALE (Attribute Logic Engine) o Description: ALE is a environment that integrates phrase structure parsing and constraint logic programming with typed feature structures. It can handle several formalisms including HPSG, PATR-II, DCG grammars, and Prolog, Prolog-II, and LOGIN programs. Sample grammars are provided with the distribution. o Platforms: Platforms with SICStus Prolog, or Quintus Prolog. o Source: The latest version is available from the CMU Artificial Intelligence Repository. o Reference: Additional information is available from the CMU Artificial Intelligence Repository. o Contact: carp@lcl.cmu.edu. CGPARSER o Description: CGParser is a linear parser of Conceptual Graphs. It was written using the YACC compiler generator utility. The distribution includes examples of various levels of complexity for testing purposes. o Platforms: UNIX. o Source: The latest version is available from the Consortium for Lexical Research. o Reference: Additional information is available from the Consortium for Lexical Research. o Contact: hdp@nmsu.edu. CHARON o Description: CHARON is an environment for the development and testing of LFG grammars. It integrates parsers, semantic components and the generator, and provides a user-interface for the compilation and the testing of LFG grammars. o Platforms: UNIX. o Source: The latest version is available from ftp.ims.uni-stuttgart.de.

Reference: Additional information is available from ftp.ims.uni-stuttgart.de.

Conc Description: Conc is used for producing concordances of texts. It also produces a frequency index for each word in the text. It displays the original text, the concordance, and the index each in synchronized windows. o Platforms: Mac. o Source: The latest version is available from the Consortium for Lexical Research. o Reference: Additional information is available from the Consortium for Lexical Research. o Contact: antworth@am.dallas.sil.org. ELIZA o Description: This is the classic NLP program by Weizenbaum. It allows for a simple first assignment in NLP. Students are asked to develop a new knowledge base for some domain other that the classic psychoanalyst-patient one. o Platforms: PC, Mac, VAX, UNIX and others. o Source: The latest version is available from the CMU Artificial Intelligence Repository. o Reference: N/A. o Contact: N/A. ENGLEX o Description: Englex is a lexicon for morphological analysis of English text. It is intended for use with PC-KIMMO (or programs that use the PC-KIMMO parser, such as KTEXT). Combined with software, it facilitates production of sets of records of the morphological constituents in English texts. o Platforms: PC, Mac, and UNIX. o Source: The latest version is available from the Consortium for Lexical Research. o Reference: Additional information is available from the Consortium for Lexical Research. o Contact: evan@txsil.lonestar.org. FLEX (Fast Lexical Analyzer Generator) o Description: FLEX is a generator of lexical pattern recognizers. It is an extension to the UNIX LEX lexical analyzer utility. o Platforms: UNIX. o Source: The latest version is available from the ftp.ee.lbl.gov.
o

Reference: Additional information is available from the Consortium for Lexical Research. o Contact: vern@ee.lbl.gov. FONOL o Description: Fonol is a programming language for experimenting with Transformation-Grammar-style phonological rules. It also incorporates input and output filters/conditions. It is intended for both phonology students and researchers in that it facilitates understanding of phonological rule fundamentals and helps manage large complex bodies of phonological rules. o Platforms: PC (and platforms with Turbo Pascal). o Source: The latest version is available from the Consortium for Lexical Research. o Reference: Additional information is available from the Consortium for Lexical Research. o Contact: brandon@gamma.is.tcu.edu. Grammar Workbench o Description: The Grammar Workbench is an environment for the development and analysis of grammars. It is geared towards the AGFL (Affix Grammars over a Finite Lattice) formalism. o Platforms: PC, and Sun. o Source: The latest version is available from hades.cs.kun.nl. o Reference: Additional information is available from hades.cs.kun.nl. o Contact: agfl@cs.kun.nl. KGEN o Description: KGEN is a program for building morphological parsers for NLP systems. It is an auxiliary program for PC-KIMMO. o Platforms: PC. o Source: The latest version is available from the Consortium for Lexical Research. o Reference: Additional information is available from the Consortium for Lexical Research. o Contact: evan@txsil.lonestar.org. LINK o Description: LINK is a parser for Link Grammar, a context-free formalism for the description of natural language. It also includes a Link Grammar for English. o Platforms: UNIX.

o o o

Source: The latest version is available from the Consortium for Lexical Research. Reference: Additional information is available from the Consortium for Lexical Research. Contact: sleator@cs.cmu.edu.

Lotec Description: The Lotec Speech Recognition Package is a simple set of libraries and tools for building single-speaker, small-vocabulary, low-quality continuous speech recognition applications. o Platforms: Sun. o Source: The latest version is available from the ftp.sanpo.t.u-tokyo.ac.jp. o Reference: Additional information is available from The Natural Language Software Registry. o Contact: nigel@sanpo.t.u-tokyo.ac.jp. LT Thistle o Description: LT Thistle is a parameterizable display engine and editor for diagrams, allowing the inclusion of interactive diagrams within Web pages. Originally designed for use with linguistic diagrams, we envisage widespread application within other areas involving the presentation and interpretation of highly structured information. It is available free of charge for non-commercial purposes. o Platforms: Java o Source: http://www.ltg.ed.ac.uk/software/thistle/demos/index. html o Reference: http://www.ltg.ed.ac.uk/software/thistle/index.html o Contact: Jo Calder J.Calder@ed.ac.uk OGI Speech Tools o Description: The OGI Speech Tools are a set of speech data manipulation tools including an X Windows display tool (Lyre) for displaying data in a time synchronous fashion, a Neural Network training package, a set of C library routines (LIBNSPEECH) for speech data manipulation, a set of sound-file format conversion utilities, and a set of Pearl scripts for automating the use of the above tools. o Platforms: UNIX. o Source: N/A. o Reference: N/A. o Contact: tools@cse.ogi.edu.
o

PC-KIMMO o Description: PC-KIMMO is a popular program among computational linguists, descriptive linguists, and NLP system developers. It generates and/or recognizes words using a two-level model of word structure, i.e., a lexical-level form, and a surface-level form. o Platforms: PC, Mac, UNIX. o Source: The latest version is available from the Consortium for Lexical Research. o Reference: Additional information is available from the Consortium for Lexical Research, and PCKIMMO: A Two-Level Processor for Morphological Analysis by Evan L. Antworth, published by the Summer Institute of Linguistics (1990). o Contact: evan@txsil.lonestar.org. SAX (Sequential Analyzer for syntaX and semantics) o Description: SAX is a syntactic analyzer for Definite Clause Grammar. It employs a bottom-up and breadth-first parsing algorithm. Distribution includes a Japanese grammar and some sample Japanese data. o Platforms: Platforms with SICStus Prolog. o Source: The latest version is available from the Consortium for Lexical Research. o Reference: Additional information is available from the Consortium for Lexical Research. o Contact: N/A. SYNTACTICA o Description: SYNTACTICA is a system for grammar development with a simple graphical user interface. It is intended for use in introductory syntax classes, or introductory linguistics classes with a syntax component. o Platforms: NextStep. o Source: The latest version is available from the Consortium for Lexical Research. o Reference: Additional information is available from the Consortium for Lexical Research. o Contact: rlarson@semlab1.sbs.sunysb.edu.

Commercial Tools

Alvey Natural Language Tools (ANLT) o Description: The Alvey Natural Language Tools is a set of tools for use in natural language processing

o o o o

research. These include a morphological analyzer, parsers, a grammar and a lexicon. They can be used independently or with a grammar development environment to form a complete system for the morphological, syntactic and semantic analysis of a considerable subset of English. Platforms: UNIX. Source: N/A. Reference: Additional information is available from ftp.cl.cam.ac.uk. Contact: N/A.

ALEP Description: The Advanced Language Engineering Platform (ALEP) is a versatile and flexible general purpose NLP platform. It is independent of formalism, incorporates a number of standards such as SGML, ISO character sets, and MOTIF and comes with a graphical user interface, an extensive on-line documentation, and various tools for text handling, linguistic processing, and debugging. o Platforms: Platforms supporting Prolog by BIM 4.0.5, ClauseDB 2.0, GNU Emacs 19.19, OSF/MOTIF 1.2. o Source: Available on tape through contact below. o Reference: Additional information is available from The Natural Language Software Registry. o Contact: Mr. N. K. Simpkins, Cray Systems, ALEP Support, 11b Bvd Joseph II, LUXEMBOURG, L1840 LUXEMBOURG. CSRE -- Canadian Speech Research Environment o Description: CSRE is designed to support speech research by providing a powerful, low-cost facility using mass-produced and widely-available hardware. Functions include speech capture, editing, and replay, spectral analysis procedures, 3D displays, parameter extraction/tracking and tools to automate measurement and support data logging. CSRE components include a speech editor, a time-domain analyzer, a spectral analyzer, a formant tracker, a pitch tracker, a speech synthesizer, an acoustic signal synthesizer, and an experiment generator/controller. o Platforms: PC. o Source: N/A. o Reference: Additional information is available from The Natural Language Software Registry. o Contact: Donald G. Jamieson, , University of Western
o

Ontario, Hearing Health Care Research Unit, Communicative Disorders, London, Ontario N6G1H1, Canada. ESPS - Entropic Signal Processing System o Description: ESPS is a set of signal and speech processing utilities. Their functionality includes spectrum analysis, time series manipulation, pattern classification, file manipulation, plotting, speech processing, data I/O and conversion, filter design, and filtering. o Platforms: SUN, SGI, HP 9000/700, or DEC 3100/5000 and ALPHA computer running UNIX. o Source: N/A. o Reference: Additional information is available from The Natural Language Software Registry. o Contact: Ken Nelson, Director of Sales and Marketing, Entropic Research Laboratory, 600 Penn. Ave. S.E., Suite 202, Washington, D.C., USA 20003. Natural Language (TM) o Description: Natural Language (TM) is an extensible natural language interface to relational SQL databases. It employs a parser, semantic interface, natural language generator, and a deductive system that interprets English questions in the context of the specific applications. Its extension mechanism, Intelligent Connector (ICon), may be used to customize Natural Language to specific applications. o Platforms: MS-Windows, VMS, and UNIX. o Source: N/A. o Reference: Additional information is available from The Natural Language Software Registry. o Contact: Cilla DeVries, Natural Language Inc., Marketing Department, 1125 Atlantic Avenue, Alameda, CA 94501, U.S.A. NL Builder (TM) o Description: NL Builder (TM) may be used to develop NLP applications or experiment with various linguistic components. It consists of a tokenizer, a dictionary, a morphological analyzer, a parser, a semantic interpreter, a semantic network KRL, lexical acquisition tools, "C" hooks, and a debugger. o Platforms: PC, Mac, Apollo, Sun, VAX, NeXT, and others. o Source: N/A. o Reference: Additional information is available from

The Natural Language Software Registry. o Contact: Edwin R. Addison, Synchronetics, Inc., Synchronetics, Inc., 301 N. Front St., Baltimore, MD 21202, U.S.A. VisualText (R) o Description: VisualText is an integrated development environment (IDE) for NLP. It constructs multipass text analyzers that combine pattern-based, grammarbased, and additional paradigms. Features a rapid prototyping GUI, the NLP++ (R) programming language (interpreted/compiled), and the Conceptual Grammar (TM) hierarchical knowledge base management system (KBMS). Open architecture integrated with C++; ODBC database connectivity. Applications include information extraction, natural language generation (NLG), categorization, summarization. Comes with the TAIParse general analyzer and others. o Platforms: VisualText runs on Windows PC; analyzers run on Windows PC and Linux. o Source: N/A. o Reference: http://www.textanalysis.com. o Contact: Maureen McHenry, 877-235-6259 USA toll free, maureen.mchenry@textanalysis.com, Text Analysis International, Inc.

Software Tools for NLP


Software Archive

CMU Artificial Intelligence Repository Resources Available Through CRL SIL Computing Resources Linguistics Tools at the University of Vaasa in Finland Leeds University, Natural Language Processing Research Group: RESOURCES ICOT Free Software Netlib Repository (mirror in Japan)

General Information

Sourcebank - a search engine for programming resources. Resources related to content analysis and text analysis - Software Some publically available NLP packages SAL (Scientific Applications on Linux) Artificial Intelligence

Public Domain Generic Tools: An Overview - a paper written by Tomaz Erjavec A collection of online interactive CL tools (Computational Linguistics Group, University of Zurich) The LINGUIST List: Software The Natural Language Software Registry Language Software Helpdesk o Frequently Asked Questions PennTools - Computational Linguistics Resources At Penn. Parsing Resources Taggers online, email message containing addresses Parsers and Taggers Information (by Steven Paul Abney) Relator Language Processing Resources Corpus Search Tools Neural Networks & Statistics: Software

Tagger, Morphological Analyzer


A Perl/Tk text tagger Conexor Cogilex R&D inc - Makers of expert tools for natural language processing CLAWS part-of-speech tagger TnT - Statistical Part-of-Speech Tagging

POS tagger for Spanish Tagging and Parsing tools AUTASYS - A Fully Automatic English Wordclass Analysis System TOSCA/LOB tagger Relaxation Labelling Based Multi-Tagger The QTAG Part of Speech Tagger QTAG: A portable Parts of Speech Tagger The Alvey Natural Language Tools The XTAG Project TreeTagger - a language independent part-of-speech tagger Xerox Part-of-Speech Tagger The Edinburgh/Cambridge Morphological Analyser System Winbrill - An adaptation of Brill's tagger to Windows 95/98. Eric Brill's Part of Speech Tagger Software Plaza: Brill's Tagger Morphy - An integrated tool for German morphology and statistical part-of-speech tagging. Korean Morphological Analyzer Natural Language Tools - Japanese morphological analyzer (JUMAN) and parser (KNP) developed by Nagao Lab. at Kyoto University, Japan. WordSmith Tools - Wordsmith Tools is the Swiss Army knife of lexical analysis - an integrated suite of programs for looking at how words behave in texts. It is intended for linguists, language teachers, and anyone who needs to examine language. o Mike Scott's Home Page o Oxford University Press A Lexical Analyzer for HTML and Basic SGML ARIES Natural Language Tools - Lexical platform for the Spanish language.

Stemmer

Porter stemmer Porter stemmer Dutch Porter stemmer IRIS stemmer Iterated Lovins stemmer

Collocation

Xtract - Frank Smadja's Collocation Extractor.

Parser

Malaga - a system for automatic language analysis Attribute-Logic Engine (ALE) System and Grammars - A freeware logic programming and grammar parsing system. CG Parser - Natural deduction categorial grammar and lambda-calculus parser. Head-Corner Parser (by Gertjan van Noord)

A basic parser written to illustrate the bottom up parsing algorithms in Natural Language Understanding, Second Edition Cass Partial Parser CHILL: An empirical parser acquisition system using inductive logic programming ISSCO Tools - Left-head-corner Island Parser Compiler, etc. Georgetown University Natural Language Processing Parser Modularity Demo page PC-PATR: A syntactic parser IMS Stuttgart: The CUF Web Page - Comprehensive Unification Formalism Apple Pie Parser - The Apple Pie Parser is a bottom-up probabilistic chart parser which finds the parse tree with the best score by best-first search algorithm. Link Grammar Parser

Corpus Tools

WebCorp Concordances: Producing and Using them XCES: Corpus Encoding Standard for XML RST Tool - An RST (Rhetorical Structure Theory) Markup Tool. RST Annotation Tool Qwick - corpus browser Linguistic Annotation - This page describes tools and formats for creating and managing linguistic annotations. Alembic Workbench - a suite of tools for the analysis of a corpus, along with the Alembic system to enable the automatic acquisition of domain-specific tagging heuristics. The System Quirk - Workbench for Terminology, Lexicography and Text Analysis. Multext: Multilingual Text Tools and Corpora XCorpus - An Environment for Managing Corpus and Multilingual Web Server The IMS Corpus Toolbox Webpage X Kobe Phoenix Laboratory - Corpus Wizard program. Concordance - A program for Windows NT 4.0 and Windows 95/98 which makes wordlists, concordances, and Web Concordances from your electronic texts. MonoConc (concordance program) MonoConc for Windows (concordance program) Text Analysis Computing Tools (TACT) The Lingua Project: The World of MultiLingual Parallel Concordancing (http://prune.loria.fr/~bonhomme/lingua/) - Sentences alignment tool in multilingual corpora. The Lingua Project: The World of MultiLingual Parallel Concordancing (http://www.loria.fr/exterieur/equipe/dialogue/lingua/) Textual Corpora and Tools for their Exploration

Language Modeling

Maximum Entropy Modeling Maximum Entropy Modeling Toolkit

CMU-Cambridge Statistical Language Modeling Toolkit CMU Statistical Language Modeling Toolkit by Roni Rosenfeld o Program o Document Trigger Toolkit Simple Good-Turing Smoothing Smoothing tools software by Joshua Goodman and Stanley Chen Language modeling tools Statistical Decision Trees

HMM

A HMM mini-toolkit (by Anand Venkataraman) HMM Software see also: Exercise: Using a Hidden Markov Model Discrete HMM Toolkit Hidden Markov Model (HMM) Toolbox Meta-MEME: Motif-based Hidden Markov Models of Biological Sequences

Language Identification

Ted E. Dunning's program Gertjan van Noord's program Doug Beeferman's program

FSA Tools

Finite State Utilities Automata Learning from Theory to Practice o Downloadable Software Index to finite-state machine software, products, and projects FSA utilities o FSA Utilities: A Toolbox to Manipulate Finite-state Automata Grail - a symbolic computation environment for finite-state machines, regular expressions, and other formal language theory objects. AMoRE - A program for the computation of Automata, Monoids, and Regular Expressions.

Speech

HTK: Hidden Markov Model Toolkit CSLU Toolkit The Epos Speech Synthesis System ISIP public domain speech to text system o The ISIP Automatic Speech Recognition Toolkit CSLU Toolkit (Center for Spoken Language Understanding, Oregon Graduate Institute of Science and Technology)

Computer generation of accent marks Spoken Natural Language Processing Group Software CMU Error Analysis Toolkit Audio Tools VOICEBOX: Speech Processing Toolbox for MATLAB

Mathematical Software

NIST Guide to Available Mathematical Software

Statistics

Bayesian inference Using Gibbs Sampling CoCo - A statistics package for analysis of associations between discrete variables.

Machine Learning

Machine Learning Toolbox (MLT) The Machine Learning Programs Repository The RIPPER rule learner mFOIL - An ILP systems designed to handle noisy examples.

Support Vector Machine


SVMLight SVM package by William Noble Grundy Kernel Machines Web Site

Information Retrieval & Filtering


Rubryx: Text Classification Program seft - a Search Engine For Text MG - Managing Gigabytes Isearch - software for indexing and searching text documents. SMART Software and test collections (Cornell University) o see also SMART links Doug Oard's Research Software Page - SMART Modifications Bow: A Toolkit for Statistical Language Modeling, Text Retrieval, Classification and Clustering ifile - A general mail filtering system. IR-STAT-PAK - A program to compute descriptive and analytic statistics for the TREC IR trials. Yavi - A visual interface to textual information. Labeled data sets for information extraction

String/Pattern Matching

Online Approximate String Matching Strmat package (exact string matching and suffix trees)

Sentence Boundary Detector


SATZ: An Adaptive Sentence Boundary Detector Adwait Ratnaparkhi's MXTERMINATOR

Clustering/Classification

FCLUSTER - A tool for fuzzy cluster analysis LNKnet Pattern Classification Software Principal Direction Divisive Partitioning k-means clustering

WWW

w3mir - HTTP copying and mirroring tool. HTTrack - The Web mirror utility. HTML Conversion, Shareware and Freeware

Other Tools

TextSTAT - Simples Text Analyse Tool WebCONC German Morphology Browser (online service) 'mat2D' Matrix/Vector Library in C Content Analysis Resources - for quantitative analyses of texts, transcripts, and images. SNoW learning program The -TBL Homepage - Logic Programming Tools for Transformation-Based Learning ROOT: An Object-Oriented Data Analysis Framework CAQDAS Networking Project - Computer Assisted Qualitative Data Analysis Software Suffix sort Nb - a graphical user interface for annotating the discourse structure of spoken dialogue, monologue, and text. GATE - General Architecture for Text Engeneering. TiMBL: Tilburg Memory Based Learner MtRecode - The Multext character translation program Evalb - A bracket scoring program. It reports precision, recall, non crossing and tagging accuracy for given data. The OC1 decision tree software system IND Version 2.0 - creation and manipulation of decision trees from data Paai's text utilities

Shoebox 3.0 for Windows and Macintosh - A database program oriented to the needs of a field linguist's dictionary. Teaching materials for statistical NLP by Chris Brew, Language Technology Group, Human Communication Research Centre, University of Edinburgh Introducing environmentalism and post-fordism into NLP (NeuroTran) Tools for Estonian Language Dan Melamed's Page - Simulated Annealing Program, XTAG morpholyzer post-processors for English Stemming, Good-Turing Smoothing Software, 150 miscellaneous text processing tools, 75 text statistics and bitext geometry tools. TOOLDIAG: Pattern recognition toolbox The DN2 Home Page - DN2 is an intelligent self-relating free format database system which accepts data in human text format, and retrieves it in response to human requests, like Where is London? Software Announcements Tools for drawing and graphically editing trees Paul Nation's vocabulary programs syllable prediction code (a simple lisp function) Pratt - a pattern discovery tool XGobi - A system for multivariate data visualization. NODElib - Neural Optimization Development Engine library

Some information retrieval tools


Michel Beigbeder -- 2006/18/09
Please, let me know if you find any error in the following information.

Evaluation

trec_eval

trec_eval trec_eval.8.1.tar.gz trec_eval.8.0.tar.gz trec_eval.7.3.tar.gz trec_eval.7.0beta trec_eval.v3beta trec2_eval trec_eval_hp trec1_eval The software for doing IR system evaluation.

Links:

Notes on Trec Eval. Small guide on how to use trec_eval with mg.

3 IR tools tested or in use within the RIM team

Zettair team site tool site Previously known as lucy. Zettair is a (small) set of software written in language C for text indexing and retrieval.

Comment: The index format is very easy to understand. It is easy to add its own weighting scheme too. The straightforward programming style makes easy to add other features (in indexing for instance).
mg tool site book site MG is an open-source compressing, indexing and retrieval system for text,

images, and textual images. It is written in language C.

Development discontinued since August 1999 Comment: The book does not help too much to understand the software, but anyway it is a very good book on both compression and information retrieval. The software is more difficult to extend than Zettair because there is heavy use of (complex) macros to tackle with the compression features. But we succeeded in some extensions by inserting our own code in some key points, both in the indexing and the querying phases. However, it is very difficult to create new code to directly access to the index (again this is due to the complex compression mechanisms in use). Links:

Version 1.3g in use at the New Zealand Digital Library. Versions mg-1.3c mg-1.3x mg-1.3.1x mg-1.3.2x mg-1.3.64x from UTAH, they seem derived from the above version.

smart

smart tool site Smart implements the basic vector model of information retrieval. It is possible to experiment with different weighting schemes. It is written in language C.

Development discontinued since 1992. Comment: Not easy to install. The configuration mechanism is difficult to understand. The configuration process is error prone. Some (badly) documented features actually don't work. Extensions that fit well in the vector model are not too difficult but it is quite impossible to add other ones. Links: Because this software is difficult to use and its internal documentation is not good, here are some links on how to use it.

Tutorial for beginners by Hans Paijmans A smart tutorial by Tassos Tombros (Glasgow) Another documentation set for smart by Christian Meunier and Ghita Bouayad. Internal principles by Jian-Yun Nie (in french) Internal implementation (data structures, ...) by Jian-Yun Nie (in french)

7 softwares not tested

Cheshire tool site (Mainly C) (Most recently modified file: 2005-01-13 in V2.41) A Next-Generation Online Catalog and Full-Text Information Retrieval System. DataparkSearch Engine tool site (C) (Most recently modified file: 2005-12-01 in V4.35) DataparkSearch Engine is a full-featured open sources web-based search engine released under the GNU General Public License and designed to organize search within a website, group of websites, intranet or local system. Lemur tool site (C++) The Lemur Toolkit for Language Modeling and Information Retrieval. Lucene tool site (Java) (Most recently modified file: 2004-11-29 in V1.4.3) Lucene is a high-performance, full-featured text search engine library written entirely in Java. Senga tool site (Mainly C++) Senga is a development group focused on information retrieval software.

Catalog is a perl program that allows to create, maintain and display Yahoo! like directories. (Last version 1.03 2001-07-11) The purpose of GNU mifluz is to provide a C++ library to build and query a full text inverted index. (Last release 0.23.0 2001-07-23) unac is a C library and command that removes accents from a string. (Last version 1.7.0 2002-09-02) uri is a library that analyses URIs and transform them. (Last version 2.13 2001-07-16) webbase is a crawler for the Internet. It has two main functions : crawl the WEB to get documents and build a full text database with this documents. (Last version 5.17.0 2001-09-10)

Development discontinued since 2002.

Terrier tool site (Java) (Last version 1.0.2) (Most recently modified file: 2005-03-17 in V1.0.2) Terrier is a software for the rapid development of Web, intranet and desktop search engines. More generally, it is a modular platform for the rapid development of large-scale Information Retrieval applications, providing indexing and retrieval functionalities. Wumpus tool site (C++) (Most recently modified file: 2005-11-30 in V2005-11-30) Wumpus is an information retrieval system. Its main purpose is to study issues that arise in the context of indexing dynamic text collections in multi-user environments. Xapian tool site (C++) (Most recently modified file: 2005-07-15 in V 0.9.2)

Xapian is an Open Source Probabilistic Information Retrieval library, released under the GPL. It's written in C++. Features: Ranked probablistic search, Relevance feedback, Phrase and proximity searching, Structured boolean search operators, Stemming (Danish, Dutch, English, Finnish, French, German, Italian, Norwegian, Portuguese, Russian, Spanish, and Swedish).
1 library not tested

bow

Bow library site Bow (or libbow) is a library of C code useful for writing statistical text analysis, language modeling and information retrieval programs. The current distribution includes the library, as well as front-ends for document classification (rainbow), document retrieval (arrow) and document clustering (crossbow).

BOW
Bow: A Toolkit for Statistical Language Modeling, Text Retrieval, Classification and Clustering
Bow (or libbow) is a library of C code useful for writing statistical text analysis, language modeling and information retrieval programs. The current distribution includes the library, as well as front-ends for document classification (rainbow), document retrieval (arrow) and document clustering (crossbow). The library and its front-ends were designed and written by Andrew McCallum, with some contributions from several graduate and undergraduate students. The name of the library rhymes with `low', not `cow'.

About the library


The library provides facilities for:

Recursively descending directories, finding text files. Finding `document' boundaries when there are multiple documents per file. Tokenizing a text file, according to several different methods. Including N-grams among the tokens. Mapping strings to integers and back again, very efficiently. Building a sparse matrix of document/token counts. Pruning vocabulary by word counts or by information gain. Building and manipulating word vectors. Setting word vector weights according to Naive Bayes, TFIDF, and several other methods. Smoothing word probabilities according to Laplace (Dirichlet uniform), M-estimates, WittenBell, and Good-Turning. Scoring queries for retrieval or classification. Writing all data structures to disk in a compact format. Reading the document/token matrix from disk in an efficient, sparse fashion. Performing test/train splits, and automatic classification tests. Operating in server mode, receiving and answering queries over a socket.

The library does not:


Have English parsing or part-of-speech tagging facilities. Do smoothing across N-gram models. Claim to be finished.

Have good documentation. Claim to be bug-free.

It is known to compile on most UNIX systems, including Linux, Solaris, SUNOS, Irix and HPUX. Over a year ago, it compiled on WindowsNT (with a GNU build environment); it doesn't do this any more, but probably could with small fixes. Patches to the code are most welcome. It is developed on a Linux system. The code conforms to the GNU coding standards. It is released under the Library GNU Public License (LGPL).

Citation
You are welcome to use the code under the terms of the licence for research or commercial purposes, however please acknowledge its use with a citation:
McCallum, Andrew Kachites. "Bow: A toolkit for statistical language modeling, text retrieval, classification and clustering." http://www.cs.cmu.edu/~mccallum/bow. 1996.

Here is a BiBTeX entry:


@unpublished{McCallumLibbow, author = "Andrew Kachites McCallum", title = "Bow: A toolkit for statistical language modeling, text retrieval, classification and clustering", note = "http://www.cs.cmu.edu/~mccallum/bow", year = 1996}

Obtaining the Source


Source code for the library can be downloaded from this directory. Different versions are indicated by eight digit sequences that indicate year, month and day. Thus, the most recent version is the one with the largest version number. Unfortunately I do not have time to help rainbow's many users with all their compilation and usage problems. Feel free to send me mail asking for help, but please do not necessarily expect me to have time to help. Most appreciated are bug reports accompanied by fixes.

Bow Library Front-Ends


Provided in the library source distribution, there are currently three executable programs based on the library.

Rainbow is an executable program that does document classification. While mostly designed for classification by naive Bayes, it also provides TFIDF/Rocchio, Probabilistic Indexing and K-nearest neighbor.

Arrow is an executable program that does document retrieval. It currently only performs simple TFIDF-based retrieval. Crossbow is a an executable program that does document clustering (and also classification).

Вам также может понравиться