Вы находитесь на странице: 1из 6

Indexed vs.

Unindexed Searching:
Distributed Searching • Email Filtering
Security Classifications • Forensics

Both indexed and


unindexed searching have The classification filter relies on what dtSearch® calls a
their place in the enterprise. “SearchFilter object” for use while indexing. A search
Indexed text retrieval is
typically more efficient for filter object is a mechanism to specify a subset of the
uses such as general documents in a collection for purposes of limiting a
information retrieval, query to that subset. Conceptually, the SearchFilter
distributed searching and object is similar to a list of document identifiers.
security classification
systems. But unindexed
searching too has its place— typically store each unique
in outgoing email filtering, word in a document collection
Overview: Indexed Text
searching of live data sources Indexing the inevitable and its location within each
Retrieval
like RSS news feeds, and millions of documents that any document. Indexing also works
sometimes in forensics. This sizeable organization generates with non-document data, e.g.
article will attempt to explain on shared file servers is the for forensics search purposes.
which search technique to use fastest way to facilitate data After indexing, full-text
when, and why. retrieval. An index will search speed, even across

A dog digging in its own backyard to find A dog digging in a neighboring backyard in
bones that the dog remembers burying is the hopes of finding some of the
performing an indexed search. neighboring dog’s buried bones is
performing an unindexed search.

Reprinted with permission of PC AI Online Magazine V. 18 #3


For more information about PC AI Online Magazine, visit www.pcai.com
get a low relevancy ranking. If part of that work, to implement
This discussion uses the latest marketing a classified designation, so that
terminology appears in only only employees with a certain
dtSearch-based examples four documents, it would get a security clearance can review
for illustrating distributed much higher relevancy rank. In those documents.
searching using XML as an that way, PR could, for One way to implement these
interchange medium. This example, enter a whole types of document
is a different use of XML paragraph of proposed text for classifications is to build
than as a database storage a press release as a natural separate indexes for separate
format for holding Web- language search, and zoom sets of documents that bear such
right in on the most relevant classifications, and make the
based and other fielded documents. search indexes as well as the
data ... The distributed But full-text searching, original documents available
search can then combine whether boolean, natural only from secure, limited-access
the streams of XML data, language, or otherwise, is only servers. But while this type of
presenting a single, part of the text retrieval answer. method may work for very
unified and flexible view. Suppose HR wants to limit its limited designations, it is
search to documents with an unwieldy for implementing a
HR executive designation. This complex document
type of fielded data classification scheme.
millions of documents, is
typically less than a second. classification can result from A more elegant security
While indexing a very large fields or meta data inside a classification solution involves
collection of documents for the document, or from an a document filter. Filtering is
first time may be time overlaying document ultimately a sifting process, the
consuming, subsequent updates management-type application. computer equivalent of panning
of the index are usually much With the latter, fielded data for gold, or separating the chaff
faster. dtSearch, for example, classification can rely on from the wheat. The greater the
simply checks the file associated database entries, volume of data that requires
modification dates of all indexed such as SQL or XML, or the filtering, the more important it
files, and only reindexes those addition of fields “on the fly” is to automate the filtering
files that have been added, during the indexing process. process, usually by looking for
deleted or changed since the last certain flags or keywords.
index update. (While the text Adding in Security Hence, the relationship between
retrieval terminology here relies Classifications filtering and text searching.
on the dtSearch product line, the Now suppose the goal is to In this context, the
concepts in this article are enable searching organization- classification filter relies on
generally applicable.) wide, but to keep the wrong what dtSearch calls a
In addition to enabling documents out of the wrong “SearchFilter object” for use
precision boolean searching, an hands. For example, suppose while indexing. A search filter
index can also store such documents that bear certain HR object is a mechanism to specify
information as word positions, designations might contain a subset of the documents in a
enabling word or phrase salary and other confidential collection for purposes of
proximity searching. An index information, and so should be limiting a query to that subset.
can also hold information about kept out of the general Conceptually, the SearchFilter
word frequency and distribution, information pool—yet still object is similar to a list of
enabling computation of natural accessible for authorized HR document identifiers.
language relevancy rankings employees who may need that The SearchFilter object has
across a document collection. If information. Or suppose a the flexibility to integrate with
the company name appears in company is doing government more complex security settings.
two million documents, it would defense work, and needs, as The security settings themselves

Reprinted with permission of PC AI Online Magazine V. 18 #3


For more information about PC AI Online Magazine, visit www.pcai.com
phase 2322 is a top secret item. filtering outlined above is not to
A text retrieval program Email filtering could flag any catch willful abuse of company
email body or document secrets, so much as to pick up
like dtSearch typically attachment that contains casual misuse. Examples of
receives live data streams, projectx or 2322. such casual misuse might be an
such as RSS news feeds, Since email filtering is a email sent in haste that could
through a programmable one-step operation—either the wind up in the wrong hands, or
data source interface. email proceeds or it does not— even the accidental attachment
Searching this type of unindexed searching is usually of the wrong file to a message.
data usually involves the more efficient way to In other words, email filtering
process the search. While protects against ordinary human
scanning for a series of unindexed searching is much error. It does not, for example,
pre-established queries, slower than indexed searching, protect against the employee who
much like email filtering. it is faster to do a single copies the projectx phase 2322
And like unindexed email unindexed search than to build files to portable media, and walks
filtering, finding the an index and then do a single out with this under a jacket.
relevant key words search. And the more advanced In addition to scanning
usually triggers certain relevancy ranking features that outgoing emails and
indexed-only searching provide attachments, an organization
processes, such as tend to be less important in may also want to search emails
sending an email notice searching through individual and attachments as archival
after a “hit.” emails and attachments than in records. In that case, indexing
searching through millions of the emails, enabling the
documents in a document repeated instantaneous search of
can exist on various levels. To repository. the knowledge stores that the
use SearchFilters to implement The most important special emails and attachments
security, an application would feature for email filtering is represent, is the more efficient
first create a series of fuzzy searching to pick up retrieval method.
SearchFilters, one for each typographical errors. In
security category, based on dtSearch, fuzzy searching Searching Data Outside
information in the database (or works fully with unindexed the Enterprise Borders
other repository) that specifies searching as well as indexed Whereas the other levels
security rights. When a user searching. With fuzzy above relate to data internal to
submits a query, the searching on, a search for an organization, this discussion
application would select the phase 2322 would also, for relates to data outside the
SearchFilter that corresponds example, retrieve a possible organization. An example is
to the user's security category misspelling in the form of information that resides on the
and attach it to the search. phase 2332. Internet, ranging from
When the query executes, it Typically an organization competitor Web sites, to
will only return documents will embed unindexed regulatory Web sites. For such
that the SearchFilter permits. searching capabilities into a information, the answer is
custom application. Once the spidered indexed searching. The
Filtering of Outgoing system flags an email, the dtSearch Web spider, for
Emails system could trigger a warning example, can access text in any
Email filtering typically to the sender. For example, the named Web site to X levels of
scans for certain combinations warning might say: did you depth, and even follow links off
of terms that represent really mean to send out an the site to related sites.
knowledge that should not email mentioning projectx Indexed searching—and
leave the organization. For phase 2322? XML as the interchange
example, suppose projectx The intent of the email language for synthesizing

Reprinted with permission of PC AI Online Magazine V. 18 #3


For more information about PC AI Online Magazine, visit www.pcai.com
search results as in dtSearch— the spidering component, differ
A good guideline for enables display of data with significantly from indexed
highlighted hits and links and searching of retrieved files
both indexed and images intact for popular Web- inside an organization, another
unindexed searching of based formats, such as dynamic type does differ considerably in
forensically retrieved content ASP.NET, as well as structure. Monitoring of live
data is to search twice: PDF, HTML and Web-based data streams such as RSS news
once with a file format XML. In fact, such searches feeds for relevant search
filter on, and once with a can look to the end-user just queries would be an example of
file format filter off. In like searching a local file the latter type. In that case,
server, and search results can there is no “file” to perform
dtSearch, for example, even display internal and unindexed searching on, such
file format filtering external retrieved content in a as with email messages and
would represent the fully integrated way. See attachments, or the above
default for searching. Distributed (Indexed) spidered Web site content
And searching without Searching: Evolution to XML example.
file format recognition (next page) for additional A text retrieval program like
would correspond to a details on this usage of XML. dtSearch typically receives live
While this type of indexed data streams, such as RSS news
binary data search. searching does not, except for feeds, through a programmable
Distributed Searching
Distributed Searching

The distributed
er prise Bord
Other nt er Other search returns to
Web E dtSearch multiple
Web
Server Server streams of XML data,
Local which the application
Hard Drive then combines and
presents to the user
in a single, unified
view. Retrieved
HTML, PDF, XML,
ZIP, MS Office, etc.
files appear with
EnterpriseX Enterprise X EnterpriseX highlighted hits, as
LAN Server Web Server well as (for HTML,
Client XML and PDF)
images, formatting
and links intact. A
distributed search
can also support hit-
highlighted treatment
Portable of dynamically-
Media generated content,
like ASP.NET and
SharePoint.

Reprinted with permission of PC AI Online Magazine V. 18 #3


For more information about PC AI Online Magazine, visit www.pcai.com
data source interface. same indexed searching
The most important Searching this type of data considerations that apply to a
usually involves scanning for a general information
special feature for email series of pre-established management repository would
filtering is fuzzy queries, much like email likewise apply.
searching to pick up filtering. And like unindexed
typographical errors. In email filtering, finding the Searching Forensically
dtSearch, fuzzy relevant key words usually Retrieved Data
triggers certain processes, such When you delete a file, the
searching works fully data remains; the computer just
as sending an email notice after
with unindexed a “hit.” marks it as deleted. Both
searching as well as Since scanning of live data “undelete” programs and
indexed searching. With sources is typically a one-pass forensic investigation tools
fuzzy searching on, a filtering operation, unindexed work to recover such deleted
search for phase 2322 searching is usually the more files. But that, for retrieval of
efficient search method. For forensically-recovered data, is
would also, for example, only the beginning. If an
subsequent, often repeat, “on
retrieve a possible demand” searching of a data investigator is examining a
misspelling in the form repository of news feeds, the stack of harddrives, and not
of phase 2332. balance, however, shifts to sure whether one or more are
indexed searching. And the even relevant to an

Distributed (Indexed) Searching: Evolution to XML


Conventional browser- remote. With HTML, a A retrieved PDF file
based searching returns distributed search returns six would look similar to other
search results as an HTML different search results, with retrieved Web-based files,
stream. Returning results in no method for combining including all existing
XML, however, makes for them. With XML, a embedded images with
much smarter search results. distributed search returns six highlighted hits. In the case
HTML tells how to streams of XML data. of the PDF file, however, the
display the data, not what The distributed search can server transmission includes
the data is. XML tell what then combine the streams of an XML component along
the data is, not how to XML data, presenting a with the underlying file. The
display it. While HTML single, unified and flexible server first sends links to the
simply paints a picture of view. At the user's request, the original PDF file, followed
what search results look XML data can instantly resort by links to an XML file
like, XML can include search results from, for describing where the hits are
numeric values, such as example, ascending date order as page and character
number of hits and word to descending hit number. offsets. Adobe Reader,
offsets of hits, indicating Retrieved Web-based files operating through the
where hits are in each appear just as they would in a browser, downloads the
document. Web browser, i.e. including all XML file with the hit offsets
To illustrate, assume a embedded links and images, for highlighting hits.
distributed search of six and the addition of
servers, some local, some highlighted hits.

Reprinted with permission of PC AI Online Magazine V. 18 #3


For more information about PC AI Online Magazine, visit www.pcai.com
terms. For query strings applying a file format filter can
consisting of hundreds of miss data that appears only in
Indexed searching—and terms, building an index and raw form. For example, “slack”
XML as the interchange then searching is generally space, such as the space
language for more efficient than scanning between an end of a file and the
synthesizing search the files for all of these terms end of the allocated sector, can
results as in dtSearch— in an unindexed search. In any hide data. Hence, a good
enables display of data case, following a determination guideline for both indexed and
with highlighted hits and or even a suspicion of unindexed searching of
threshold relevance—so the forensically retrieved data is to
links and images intact PC was not just for soup search twice: once with a file
for popular Web-based recipes!—the balance clearly format filter on, and once with
formats, such as shifts towards indexed a file format filter off.
dynamic content searching. In dtSearch, for example,
ASP.NET, as well as Whether indexed or file format filtering would
PDF, HTML and Web- unindexed, searching generally represent the default for
needs to apply a file format searching. And searching
based XML. filter to most current computer without file format recognition
data. Popular file formats such would correspond to a binary
as PDF, MS Word, and MS data search. An intermediate
investigation, unindexed Excel store text in such a way step in dtSearch, called
searching is often a good first- as to make the text often “filtered binary,” would, while
pass tool. Maybe the PC from unrecognizable in raw form. viewing all retrieved data as
the dumpster really did contain For example, abcdefghijklmno binary, attempt to
only soup recipes, and has no pqrstuvwxyz could look as programmatically sift out the
bearing on the matter at hand. follows in a PDF file: 8 0 text, and leave behind what
An initial unindexed search looks like formatting data.
can scan the PC for certain key
obj<</Length 57/Filter/Flate

words to determine threshold


Decode/L 69/S 38>>stream Please visit dtSearch online at
xÚb```c``α`....
relevance. Often, however, Without a file format filter,
www.dtsearch.com
forensic investigators are therefore, a search would
searching not for just a couple almost certainly miss a lot of
of terms, but for hundreds of data. But just searching data

Unindexed Indexed
Searching Searching
Works best with Use with
single-pass everything
operations, like: else
• email filtering
• searching live
data feeds like RSS
• making an initial
determination of
relevance in
forensics

Reprinted with permission of PC AI Online Magazine V. 18 #3


For more information about PC AI Online Magazine, visit www.pcai.com

Вам также может понравиться