MarkLogic Server Technology - White Paper

MarkLogic Server Overview:
Infrastructure Software for Building

Information Applications
Table of Contents
1 | Introduction
2 | XML is the Future
2 | The Challenges of Information
5 | MarkLogic Server: Addressing the Challenges of Information
6 | An Enterprise-Ready Database Management System for XML
8 | Accessing Information: Search and Query
13 | A Complete Platform for Building Information Applications
15| Conclusion
Abstract
Rapidly changing conditions are forcing organizations to re-think how they use information
to meet their objectives. Whether battling in the market place or on the battlefield, the
need for flexibility and agility with information has never been greater. Organizations are
looking to integrate and enrich information to create additional value for users. User ex-
pectations are changing too, as they demand Web 2.0 and Enterprise 2.0 style applications
that provide modern search capabilities, as well as an ability to interact with information
through tagging and user generated comments. And various distribution channels present
new challenges for information providers in exposing their information through rich user in-
terfaces or through syndicated services like RSS and Atom feeds, allowing users to explore
and access information in their own context.
Choosing the right technology at the core of their application architecture is critical for
any organization to provide them with the agility they need to meet these goals and rapidly
respond to unforeseen changes. XML servers such as MarkLogic Server provide that agility
by providing a single unified platform for storing, manipulating and delivering XML and
building innovative information applications.
This paper provides a technical overview of MarkLogic Server, the industry’s leading XML
server, and also discusses some of the challenges facing organizations today for storing,
repurposing, and dynamically delivering information.
MarkLogic Server Overview:
Infrastructure Software for Building Information
Applications
Introduction
The amount of information produced by private and public organizations contin-
ues to increase. For government in particular, more publicly available, private,
and general information is being created and processed than anyone predicted.
Along with the information explosion, XML is gaining momentum as the paradigm
for storing, managing, and delivering information. The increasing adoption of
XML is driven by a combination of factors including:
• Organizational desire to maximize the value of existing information
• Changing user expectations
• Technical factors such as the need to reuse information at any level of detail
Organizational objectives and rapidly changing environmental factors increas-

ingly require improvements in how information is reused and delivered. Many
organizations struggle to reuse information across different deliverables and
channels. For example, subject matter experts easily create documents with
common authoring tools, but lack the ability to use them again in books, journals,
articles, on-line in customer facing applications, training materials, doctrine, and
technical manuals. Organizations want to create information once and reuse it in
as many channels as possible to add additional value.
On the user side, there are new expectations when it comes to accessing, inter-
acting with, and collaborating around information. These expectations impact
how organizations exploit information. The proliferation of Web 2.0-style inter-
faces lets users interact directly with information, and this heavy reliance on
internet search has set a high bar in terms of easy and immediate access to the
most up-to-date information.
2 | MarkLogic whitepaper
XML has proven to be the right technol- Platforms designed for XML help people
ogy for these issues. Because of its ability unlock the value of information. The prob-
to represent the many forms information lem is existing tools and technologies have
comes in—ordered, hierarchical, textual, not met all the challenges of information
and irregular—there are new and emerg- management and have forced organizations
ing XML-based standards for representing to make trade-offs. MarkLogic Server, the
information in different industries. In the industry’s leading XML server, was designed
financial sector, industry standards include and optimized specifically to address the
Financial products Markup Language complex nature of XML. It allows enter-
(FpML); for general business and financial prises and government organizations to
reporting, eXtensible Business Reporting transform operations by maximizing the
Language (XBRL); for Department of De- value of their information assets. In the fed-
fense resource discovery, Defense Discov- eral government, a wide range of agencies
ery Metadata Specification (DDMS); and in have used MarkLogic Server to change the
publishing, DocBook. There are also XML way they manage and distribute informa-
standards for representing and integrating tion through innovative, new knowledge and
technical concepts, such as SensorML for information management systems which
describing sensor systems, and geography dynamically deliver information to users.
markup language (GML) for modeling and
The Challenges of Information
exchanging geographic information. And
XML is already widely used as a messag- What is information? For most organiza-
ing/enterprise information integration (EII) tions, it is the combination of discrete
format across industries and for document data types—such as numbers, dates, and
markup in publishing and government. names—with semi-structured and unstruc-
tured text. Information is often recognized
Industry standards are based on XML be- as books, journals, articles, reports, and
cause it improves an organization’s ability to technical documentation. It also includes
achieve objectives via fast, intuitive access, popular text sources such as email, web
dynamic reuse, and flexible, personalized pages, general message traffic, and user-
information delivery. generated comments. In this context,
information is semi-structured: unlike data,
XML is the Future
it is irregular in length and structure, and
With the growing number of XML-based
largely textual in nature. XML is the ideal
For many organizations, XML will be even industry standards, more tools are available
model for storing, managing, and delivering
more ubiquitous due to the simple fact that for authoring, manipulating, and enriching
all information.
productivity tools such as Microsoft Office XML. This has increased the adoption of
and OpenOffice have adopted XML as their XML and spurred the creation of more tools Information often creates challenges for IT
base format. for XML creation and processing. For many organizations because they typically treat
organizations, XML information will be even it like relational data. Among the biggest
more ubiquitous due to the simple fact that challenges are:
productivity tools, such as Microsoft Office
• Structures— information comes in many
and OpenOffice, have adopted XML as their
forms, formats and structures
base format. This means office workers are
now XML authors and create huge amounts • Size— organizations generate an enor-
of XML documents with no additional mous amount of information
training. Vendors, such as Adobe and Quark
with broadly deployed document creation- • Semantics— relevant results depend on
focused products, have started to increase understanding the meaning of information
the XML orientation in their offerings. The Challenge of Structure

Whether you realize it or not, XML already One of the fundamental challenges of infor-
exists in your organization. mation is the nearly infinite variations of
structure it can have. Structure can be high-
ly variable and dynamic. We see this just by not always possible. Books and journals may
looking at a few different information types: have been authored with different tools, and
articles can have a different structure from information may come through corporate
journals, which can have a different struc- acquisition or organizational merging. Yet
ture from books; email has some structure an organization may want to exploit all
(e.g., standard fields like to, from, date); and of this information together to build new
HTML web pages mix structure with display. products or repurpose information for new
Consider the text in a book which can have uses. Additionally, changing business condi-
headings, chapters, and paragraphs that tions inevitably force the creation of new
form the first levels of structure, then add structures over time.
tables and captions, extracted entities (peo-
The following simple examples in Figure 1
ple, places, things) and even user-tagging
illustrate different ways to model informa-
and comments. What appears at first glance
tion with the text-oriented DocBook schema
to be regularly structured quickly becomes
(www.docbook.org). Inside a paragraph you
irregularly structured information.
can have any number of tables, emphasis,
When storing, manipulating, or updating sub-titles, etc., in any order. This schema al-
information, organizations consider using lows information to be arranged in different
1
schemas for a range of reasons including ways.
catching errors and making queries faster
Both examples above are valid structures in
and more accurate. In doing so, they often
DocBook, but they have significant differ-
have to make a choice between maintaining
ences in terms of which elements exist and
multiple structures and attempting to en-
the hierarchy they represent.
force a single structure on their information.
Enforcing such a schema on a large scale is
<book xmlns=”http://docbook.org/ns/docbook”> <book xmlns=”http://docbook.org/ns/docbook”>

<title>...</title> <title>...</title>
<chapter> <subtitle>...</subtitle>
<title/> <bookinfo>
<para> <author>
<phrase>...</phrase> <firstname>...</firstname>
<emphasis> <surname>...</surname>
<quote>...</quote> </author>
</emphasis> <author>
</para> <firstname>...</firstname>
</chapter> <surname>...</surname>
<chapter> </author>
<title/> </bookinfo>
<para/> <chapter>
</chapter> <title>...</title>
</book> <para> ...
<phrase>...</phrase> ...
<table>
<caption>...</caption>
<tbody>
<tr>
<th/>
<td/>
<td/>
</tr>
</tbody>
</table>
<graphic/>
</para>
</chapter>
</book>
Figure 1. Different valid structures with the same schema.
1 www.wikipedia.org [Schema] A schema is a way to define the structure, content and, to some extent, the
semantics of XML documents.
Understanding the structure and represen- As information grows, the ability to ef-
tation of information allows you to build ficiently respond to complex queries
more powerful solutions because you have becomes more important. Complex queries
more control over access and retrieval. might have tens or even hundreds of terms.
You can ask questions of your information For example, a researcher might need to
that combine the knowledge of structure query a large amount of scientific and pat-
and the text. It is possible to search for the ent data with extensive queries for patent
existence of a particular word in a part of related research. By providing a fast re-
the structure like a title, footnote, or cap- sponse, an XML server lets researchers give
tion. For example, when searching medical better value to customers because they can
journals a researcher might ask “find the be more responsive and do more detailed
word lymphoma when it is used in a caption” research. Architecturally, this means a solu-
and then retrieve that caption along with its tion must be able to distribute processing
associated table or image. and storage efficiently in order to provide
the necessary responsiveness.
When accessing information using search
.
techniques, it is important that the full-text The Challenge of Semantics
search capabilities are more than just “XML Another significant challenge related to
aware” meaning it knows the type of docu- information is that of semantics, or the
Whether your organization is big or small, you
ment is XML or that some fields might need understanding of meaning within informa-
want to know that the solution you choose for
processing. As a comparison, search engines tion. Understanding the semantics lets
loading, storing and accessing information can
do not typically handle XML well—at best an organization more accurately search,
scale to provide the same level of capabili-
they may recognize (in order to skip during access, interpret, and manipulate text. They
ties in 5 years as it does for relatively small
indexing) or completely ignore XML tags, or can use the semantic knowledge to access
amounts of information today.
index a fixed number of pre-defined fields. and deliver the right information to users.
This limitation makes search engines a poor
choice for exploiting XML, especially since At the most basic level, understanding
updates to the indexing structure — known meaning requires words to be broken down
as schema migration — require complete to their base forms, a process called stem-
re-indexing of the documents, increasing ming, in order to ensure the right informa-
the administrative burden and reducing user tion — and only the right information — is
flexibility. found. Consider the words “mouse” and
“mice” — a search for “mouse” should find
The Challenge of Size
all instances of both words. Additionally,
It has been estimated that 80% of an
a thesaurus can be used to find synonyms,
organization’s total information is unstruc-
or a broader or narrower form of a word or
tured or semi-structured. Only 20% is struc-
phrase. In the example above, a thesaurus
tured and suitable for storage in a relational
would enable the search for “mouse” to be
database. The volume of unstructured or
expanded to find the word “rodent” or nar-
semi-structured information presents many
rowed to “field mouse.” Similarly, taxono-
challenges beyond the obvious storage and
mies — the classification of information
computing requirements. Whether your or-
based on a set of criteria — can be applied
ganization is big or small, you want to know
to organize information.
the solution you choose for loading, storing,
and accessing information can scale to What about semantic ambiguity? Should
provide the same level of capabilities in five a search for “George Washington” return
years as it does for relatively small amounts results for the street, the university, the
today. For some, scalability is about the person, and the bridge? Increasingly, entity
number of users, simultaneous queries or extraction engines are being used to disam-
transactions, and query speed. For others, it biguate George Washington as the person
is about the sheer amount of raw informa- and president from the bridge and from the
tion the system must store and manage. And university through XML tags. Solutions that
for many, it is all of the above. manage and deliver XML must leverage this
additional semantic information to deliver cient system for processing XML. Within
the correct information to users. MarkLogic Server, information is main-
tained in a highly compressed, parsed XML
Organizations will face the challenges of
form to improve throughput and processing
structure, size, and semantics to varying de-
speeds. It decompresses and serializes
grees when trying to maximize the value of
information before it is delivered to an ap-
their information. A system that can over-
plication or the web.
come these challenges must be architected
in a way that allows for superior flexibility XQuery for Information Access and
and scalability. Manipulation
XQuery was designed as a query language
MarkLogic Server: Addressing the for XML, just as SQL is a query language for
Challenges of Information relational databases. Today, MarkLogic has
MarkLogic Server meets today’s informa- the industry’s most extensive implementa-
tion challenges of structure, size, and tion of XQuery 1.0, the W3C standard for
semantics because it was specifically XML query, and achieves 99.9% conform-
2
designed from the ground up for XML. In ance to the specification. XQuery is a
MarkLogic Server, XML is the native format powerful language for accessing informa-
for storing and delivering information. tion and building applications that contain
MarkLogic Server is a platform for build- all of the constructs needed to provide
ing information applications that address a large-scale, highly sophisticated solutions.
wide range of solutions, including content MarkLogic Server also includes extensions
integration, metadata catalogs, information for full-text search, information update,
analysis, information reuse, and dynamic and programming. All of this means you can
delivery. write high-performance applications in a
high-level, declarative language.
Optimized for XML and XQuery
When MarkLogic Server was first conceived, The MarkLogic full-text search extensions
two fundamental assumptions were made to XQuery allow organizations to compose
regarding the direction of technology. full-text search and structured XML search
The first was the emergence of XML as a to provide users with one-step answers to
format for representing semi-structured questions. MarkLogic Server, with these
information. The second was the flexibility XML full-text extensions, provides signifi-
of XQuery to easily access and manipulate cant advantage over other full-text search
XML. Both of these assumptions have solutions, such as enterprise search prod-
proven to be correct, and today they repre- ucts that simply provide users with pages
sent two of the fundamental strengths of of links to documents, web pages, and other
MarkLogic Server. By focusing on these two information.
industry-standard technologies, MarkLogic
engineers can make design decisions that
optimize performance for processing XML.
This is at the core of MarkLogic Server’s sig-
nificant long-term, architectural advantages
over traditional technologies.
XML Top-to-Bottom
XML is the format used for storing and
delivering information to other applications,
to the web, and everywhere in between.
MarkLogic Server is the industry’s leading
XML server and is an extremely effi-
2 W3C XQuery Test Suite Result Summary - http://www.w3.org/XML/Query/test-suite/XQTSReportSimple.html
Another extension of the XQuery standard An Enterprise-Ready Database
enables fine-grained updates to XML in- Management System for XML
formation. XQuery was originally designed Designed specifically to handle XML
3
for accessing XML through an XPath-like information, MarkLogic Server has the
selection of information. The architects capabilities of an enterprise-class DBMS. It
of MarkLogic Server recognized that a provides storage, query, read consistency,
database management system (DBMS) for update, failover, backup, and restore on a
XML needed to have the ability to update highly scalable and flexible architecture. It
information, resulting in the development of also provides administrators with easy-to-
information-update extensions. MarkLogic use tools and interfaces to load and manage
Server’s XQuery implementation includes information, and interfaces for managing
extensions for inserting, updating, and and monitoring a deployment. It easily in-
deleting parts of documents. tegrates with systems within a data center
and provides key functionalities previously
A third set of extensions focus on error and
missing from enterprise architectures that
exception handling. Programs often perform
must cope with the challenges of informa-
many operations that have the potential
tion.
for generating exception conditions that
under normal circumstances would cause an Fast, Efficient Information Loading and
application to fail. An example is a divide by Processing
zero error. One easy way to gracefully han- Loading information efficiently lets an
dle this error is through try/catch exception organization keep up with the rapid informa-
handling. MarkLogic Server extends XQuery tion creation, including high-speed activi-
with try/catch support to capture errors as ties such as capturing and storing message
part of the normal application functionality. traffic. With MarkLogic Server, you can
Most programming languages have this ca- create pipelines for processing information
pability but the XQuery specification does that automates loading using the Content
not include such support, so this MarkLogic Processing Framework (CPF). This program-
Server extension enables the development mable feature allows you to perform various
of robust applications. operations on-the-fly, such as conversion,
information transformation, XML classifi-
By selecting XML and XQuery as the stand-
cation, entity enrichment, and more. CPF
ard for storing and interacting with informa-
uses triggers that cause user-specified
tion, MarkLogic has created a platform ideal
operations to be performed. CPF is an ef-
for building information applications that
ficient pipeline that can be parallelized to
address the challenges of structure, size,
maximize information loading operations
and semantics.
3 XML Path Language - http://www.w3.org/TR/xpath
for deployments with extreme throughput or create new information from existing
requirements. information, you can store these changes,
updates, and new information directly in
Transactional Storage of Information with
MarkLogic Server.
No Compromises
MarkLogic Server implements a transac- High Availability
tional system that adheres to the ACID Another key component of any enterprise
4
model to guarantee returning the correct DBMS is high availability, which assures
answer. MarkLogic Server also supports users continuous access to information.
non-blocking read operations which pro- MarkLogic Server ensures high availability
mote efficiency by ensuring users will never through several mechanisms, including jour-
have to wait on the read/write operations naling, clustering, and automatic failover.
of others. The journaling system used in MarkLogic
Server ensures data integrity during up-
The transactional model and ACID compli-
date operations. A clustered architecture
ance give MarkLogic Server an advantage
spreads the storage and query processing
over search engines and XML-extended
load across multiple servers and improves
relational database management systems
system responsiveness in times of heavy
(RDBMS) in returning correct search results.
use. Automatic failover protects in the
Search engines, which are read optimized,
event of a node failure, which will automati-
typically use a web-crawler design where
cally switch traffic between nodes based on
the state of the information reflects the
their status. MarkLogic Server gracefully
most recent crawl. Similarly, RDBMSs often
handles node failures, so users have access
use an asynchronous or lazy index update
to the information they need, when they
model in XML implementations and/or in
need it.
full-text implementations. This results in
an index that is often out-of-synch with Shared-Nothing Architecture
the database. In both cases, search engines It can be difficult to predict future infor-
and RDBMSs have to sacrifice result ac- mation requirements, how much informa-
curacy and currency to achieve acceptable tion you will need to store and access, the
performance. number of users that may need to access in-
formation, or the complexity of the queries
MarkLogic Server does not force organi-
users may want to run. One way to protect
zations to make these trade-offs. When
your deployment is to use a system where
information is loaded, it is immediately
you can easily add commodity hardware to
made available to query users with assured
scale across more information and/or more
transactional integrity. This means organi-
users. This type of architecture is called a
zations can be confident users will get the
shared-nothing architecture and MarkLogic
correct results.
Server was designed for deployment using
System of Record this method. The goal of this architecture is
An important difference between to provide scalability by easily adding nodes
MarkLogic Server and enterprise search to handle increased inbound query requests
engines is that its XML repository makes it and increased information volumes. An important difference between
a system of record for information because MarkLogic Server and enterprise search
In a clustered deployment, MarkLogic
it stores the actual information. Search engines is the fact that its XML repository
Server can be set up on separate hosts
engines can only provide pointers to the makes it a system of record for information,
(nodes) and tuned to process inbound query
original information because their indexes because it contains actual information.
requests along with information storage and
contain only keywords. This is an important
retrieval. Each of these nodes runs Mark-
distinction. When you need to create ap-
Logic Server independently of other hosts.
plications that modify or alter information,
Nodes tuned for storage and retrieval man-
4 Atomicity, Consistency, Isolation, Durability. These are four primary requirements for transactional data-
bases, ensuring that only complete, valid transactions are written to the database, and preventing any reads
from returning intermediate data from a transaction. See http://en.wikipedia.org/wiki/ACID
age components of the database (Figure 2). meet the specific needs for each deploy-
Nodes tuned to handle inbound query traffic ment.
evaluate queries and transform, assemble,
Secure Web-Based Administration
and deliver results back to the requesting
Administering a DBMS should be easy
application. Each node can communicate
and straight forward. MarkLogic Server
with every other node in a cluster. Increasing
features an easy-to-use, web-based ad-
storage capacity is a simple matter of add-
ministration interface. Administrators can
ing additional nodes configured to store and
complete a wide range of tasks, including
retrieve information.
changing configuration, managing data-
bases, and managing users. Administrators
can also perform important system-level
tasks such as security management (using
MarkLogic Server’s built-in security compo-
nents), backup and restore operations, and
performance tuning. Finally, the interface
lets administrators perform a complete
status check on their resources.
Efficient Use of Storage

As a result of the rapid increase in storage
densities, conventional wisdom holds that
“storage is cheap.” Unfortunately, while
comparatively inexpensive, it isn’t free. For
databases sized in the hundreds of tera-
bytes, cost can become an issue. MarkLogic
Server addresses this by efficiently using
Figure 2. MarkLogic Server shared-nothing architecture.

disk space through extremely effective
and efficient compressed storage. An ad-
ditional benefit of storing information in a
The process required to add user query
compressed format is that it reduces disk
capacity or support increased query com-
and network traffic, resulting in enhanced
plexity is a simple matter of adding nodes to
application performance.
evaluate inbound requests. Handling larger
information volumes in a deployment is a Accessing Information: Search and
matter of adding servers configured for in- Query
formation storage and retrieval (as opposed MarkLogic Server provides both search
to query focused). MarkLogic currently has and query capabilities. MarkLogic Server
customers with multi-hundred terabyte provides all the standard functionality you’d
databases in production. expect in a full-text search engine, plus
database-style queries against informa-
Backup and Restore
tion. At the heart of the search and query
Information is often an organization’s most
capability is a patented indexing technique
valuable asset, so the ability to protect
that results in fast queries. So applications
and backup that information is crucial.
can dynamically transform text, sort results,
MarkLogic Server provides a facility for
reformat for a specific device, and trans-
consistent database backup that allows
form messages from one format to another.
administrators to consider their own unique
requirements when determining how much Unified Indexing of Text and Structure
or how little they want to backup or restore. Accessing semi-structured information is
They have the flexibility to backup selected easily optimized when you can combine the
components of the database or the entire existence of the text with the knowledge
database, giving them the control needed to
of the structure. However, access to text relational databases require three or more
has typically been done through an inverted different indexes to access information.
5
index , supplemented by a second index of These models have serious disadvantages,
the information’s structure in an attempt including additional administration costs,
to provide search in limited portions of a synchronization issues, performance, as
document structure. A better method is to well as forcing administrators to anticipate
create an index that includes the complete user queries. Instead, MarkLogic Server
knowledge of the words and the structure indexes all text and XML structure at load
and inline markup. The result is a system time, and can be further updated without
where all information is accessible, includ- downtime.
ing its structure. MarkLogic refers to this
The universal index includes text, structure,
type of index as a universal index.
and values and can be used by an application
The universal index includes full text, values, to deliver rich user experiences and effec-
and XML structure. By combining what are tively target information delivery.
typically separate indexes into a single
Search Capabilities
universal index, MarkLogic Server delivers
Search is one of the primary ways users
extremely powerful ways to access informa-
expect to locate the information they need.
tion.
Full-featured, highly relevant search is a
One of the advantages of the universal core capability of MarkLogic Server. This
index is you can load any XML as-is. This includes:
allows for immediate access regardless of
• Standard search functionality including
schema or structure. The universal index in
word, phrase, Boolean, wildcard, proximity,
MarkLogic Server uses a patented approach
stemmed, spell checking, and thesaurus
to indexing structure that allows informa-
assisted search
tion in any format to be quickly loaded and
made immediately available to users. Infor- • Relevance measure and relevance ranking
mation from different sources, with differ- based on term frequency and document
ent schemas, can be used without needing frequency, tunable according organiza-
to be normalized, overcoming traditional, tional needs
structure-related challenges.
• Advanced language processing including
Another advantage of the universal index tokenization, collation, and diacritic sup-
is it does not require up-front decisions port
and pre-processing. Search engines
• Faceted navigation and other rich analysis
typically index text with a fixed number of
and exploration user interfaces
pre-determined fields. That enables them
to answer certain types of questions, but if
other types of questions need to be asked
in the future, the schema migration effort
requires a complete, costly re-indexing of
documents. RDBMSs are more restrictive
since they were not designed for ad hoc user
queries, and only provide answers for ques-
tions you know in advance. XML-extended
5 An inverted index is an index data structure which stores a mapping from content elements, such as words or
numbers, to their locations in a document or a set of documents, in this case allowing full text search. – http://
en.wikipedia.org/wiki/Inverted_index
Combining the powerful full-text search to accurately search across information and
capabilities in MarkLogic Server with an leverage the semantics and the meaning of
inherent understanding of the elements and the text.
attributes of XML exposes the semantics
Dynamic Navigation
of information. The following example of
Solutions built using MarkLogic Server
enriched XML demonstrates some of the
provide a wide range of navigation options
complexity involved with the semantics of
for exploring information. Navigation op-
information.
tions like faceted navigation are possible
When enriched using a text mining tool the because of the universal index and the
following snippet of text: fast query processing that enable dynamic
updates of navigation options as the user
“General Motors (NYSE: GM-news) expects clicks through the information. Common
its sales in central and eastern Europe navigation techniques built using MarkLogic
to jump to 505,000 units this year from Server include facets, tag clouds, heat
350,000 in 2006, GM Europe chief Jonathan maps, node edge graphs, temporal explora-
Browning…” tion, geospatial navigation, and traditional
Transforms into the XML in Figure 3: pie and bar charts. Some organizations use
MarkLogic Server to provide users with
<entity><company Domain=“Automotive” value=“general motors”>General Motors</company></entity>
multiple navigation techniques so they can
(NYSE:GM - news) expects its sales in central and eastern select one that best meets their needs and
<entity><location> <continent value=“europe”>Europe</continent></location></entity> preferences.
to jump to 505,000 units
<entity><time><relative_time value=“this year”>this year</relative_time></time></entity> Combined XML and Full-Text Search
from 350,000 MarkLogic Server goes beyond full-text
<entity><time><exact_time Year=“2006” value=“in 2006”>in 2006</exact_time></time></entity>
search to provide structured XML search.
, GM
It can search information within XML ele-
<relationship><board_and_management_changes value=“board_function”>
<entity><location><continent role=“Context” value=“europe”>Europe</continent> </location> ments, attributes, and values, as well as
</entity> combinations of each. This means a user’s
<entity><function FunctionCategory=“Board” role=“Has Function” value=“chief”>chief</function> search will examine document structure
</entity>
and metadata and words. You can restrict
<entity><person Family_Name=“Browning” First_Name=“Jonathan” Gender=“Male” role=“Person”
search to specific parts and return indi-
value=“jonathan browning”>Jonathan
Browning</person></entity></board_and_management_changes></relationship> vidual elements that are on a specific path,
such as “return /book/chapter//title”. Users
can also search for specific element values,
Figure 3. Enriched XML.
so the path-specific example above could
The markup in this example identifies the
be narrowed to find all elements whose
organization (General Motors), the locations
values equal “The Red Book” on the path /
(Europe), time elements, role of the person
book/chapter/title. Or users can search for
mentioned, and more. The entire markup
all elements called <title> with occurrences
can be captured and stored with the original
of the word “dog” in the path /book/chap-
text and is called “inline” markup. With appli-
ter/. The combination of XML search and
cations powered by MarkLogic Server, users
full-text search lets organizations answer
can search this information and leverage the
questions about information not possible in
markup to get more refined, exact answers
other systems, and return not just a pointer
to their questions.
to a document, but the part(s) of the docu-
For example, a user could find the names ment requested. This combined capability is
of executives in GM’s European operations known as “fine-grained search and retrieval”
and have the name “Jonathan Browning” of information. It empowers organizations
returned with his title. They could also re-
quire or exclude results containing the year
2006. These capabilities make it possible
to dynamically repurpose information at a document as a relevant hit for the phrase
granular level, creating on-the-fly, custom query: “retract the landing gear.” In Example
personalized documents and web pages to B, the XML tags again indicate different
meet specific user’s needs. contexts for the word “retract,” and the rest
of the search phrase, “the landing gear,” is
A Deep Understanding of Structure
in a footnote. Again, a search engine that
Even if an information application can index
ignores XML tags would incorrectly return
all of the structure like MarkLogic Server
this as a relevant hit. Examples C and D
does, it takes more than that to correctly
should be returned as relevant hits, but a
process information; you also have to know
system that does not recognize and cor-
the difference between different kinds of
rectly process XML tags will miss them, as
tags. Some tags are hierarchical, indicat-
the intervening markup breaks the string.
ing parent/child relationship, and some
Example C also illustrates what is known
are inline, e.g., indicating display style.
as “phrase through.” In this case, the XML
MarkLogic Server easily distinguishes
tags <b> and </b> should be ignored for the
between hierarchical and inline markup and
purpose of relevance since they refer only
provides correct results when processing
to bolding the associated text. Example
information that includes these tags.
D demonstrates what is known as “phrase
Let’s look at a simple example of how XML around.” The XML tags <footnote> and
structure interacts with search. In Figure 4 </footnote> must be recognized and ignored
below, examples A, B, C, and D all show inline for the purpose of relevance and proximity.
markup of text. They represent the case of
MarkLogic Server’s deep understanding
pilots in an airplane looking for instructions
of XML allows it to easily handle informa-
on how to retract landing gear in a document
tion in the examples above as well as more
using search. This type of inline markup of
complex XML such as enriched information
information is sometimes referred to as
that is becoming more common in real-world
mixed content and presents serious chal-
implementations. Mixed content creates
lenges to search engines and XML-extended
special challenges because markup, such as
relational databases. Even in situations
style information like bolding or references,
where the XML is relatively simple, search
is embedded within the text. MarkLogic
engines and XML-extended relational data-
Server has been built for arbitrarily complex
bases can return the wrong answer.
XML and is the right technology to unlock
In Example A, the document describes the the value of your information.
operational steps and the XML tags indicate
that “retract” is in a different step from “the
landing gear.” A search engine that ignores
XML tags would incorrectly return this
Example A Example B Example C Example D

<step number=“3”> <warning> <p> <step number = “5”>
Wait for the automatic pilot While waiting for the flaps to It is vital that you retract Partially retract
control to retract. fully retract <b>the landing gear</b>, but <footnote>It is not recom-
</step> <step number=“4”> <footnote> The landing gear leave the flaps fully deployed. mended to fully retract at
The landing gear should now need to be stowed or else the </p> this point.
be ready to deploy. flaps will remain deployed </footnote> the landing gear.
</step> < /footnote>, ensure that the </step>
automatic pilot control is at
hand.
</warning>
Figure 4. Search engines and XML extended relational databases mishandle XML (search phrase “retract the landing gear” shown in bold font for emphasis).
Query Optimization efficiently, allowing end users to search for
In database query performance optimiza- information as soon as it is loaded. Users
tion, an important technique is to evaluate are assured that when they run searches,
the most restrictive predicates first in they are accessing all information available
order to minimize the data sets on which at the time of the search. Any end user who
the database operates. MarkLogic Server needs immediate access to time-sensitive
does this with the universal index, where information, such as financial market news
predicates are automatically evaluated so or intelligence reports, can search over the
restrictive terms constrain less restric- latest information without waiting for the
tive terms. By using an extremely efficient technology to catch up.
algorithm for merging term lists, MarkLogic
Real-time Alerting
Server is able to deliver maximum applica-
Real-time alerting, also known as profiling,
tion performance.
filtering, and proactive search, is the search
Instant Search technology for users who need to receive
MarkLogic Server supports instant search, information as soon as it is available. The
which ensures that information loaded into value of information is often related to
MarkLogic Server can be searched im- its timeliness, so whether the information
mediately. It eliminates the latency that is consists of intelligence reports, financial
typically found in applications that rely on market data, or even late-breaking busi-
separate systems to handle storage and ness news, users are demanding real-time
search. In those applications, the search delivery so they can respond immediately.
index is often not up-to-date, as the lag re- Users can set up an alert such as, “send me
duces the overhead cost of continual index an email whenever there’s news about oil
updates. Shorter lags require more comput- and gas within 100 miles of San Francisco”,
ing resources, so applications developers and will not need to run the search over and
must make a trade-off by only periodically over.
updating the search indexes to reduce
Real-time alerting takes MarkLogic Server’s
resource requirements.
extensive query capabilities and rapidly ap-
MarkLogic Server’s unified architecture en- plies them to an incoming feed of informa-
sures that storage and search synchronization. This type of monitoring allows users
tion is done automatically, immediately, and to set up alerts to take specific actions on
information that matches their topics of
interest. Millions of alerts can be created to
enable a variety of immediate actions on the
matching information, including delivery,
MarkLogic Server’s enterprise-class database capabilities, extensive full-text
categorization, enrichment, and transforma-
and XML search, and a robust XQuery development environment provide a tion. Since the alerts are optimized for this
complete platform for building information applications. type of search, they are far more efficient
than mechanisms in systems that periodi-
cally run standard queries.
Geospatial Search
The use of geospatial information continues
to grow as more users demand the geo-
graphical relevance of information. Many
organizations already have large collections
of geospatial information, and those that
do not can leverage enrichment tools to add platform for building information applica-
geospatial data to entities such as coun- tions. Because MarkLogic Server is XML
tries, buildings, and points of interest. And top-to-bottom, application developers can
with the growth of location-aware mobile more quickly develop better applications.
devices, geospatial information appears to They can build end-to-end applications in
become even more relevant to our everyday simple infrastructures, or they can integrate
lives. One consumer-oriented example of with existing applications in Java or .NET, or
geospatial search is, “show me all restau- integrate with a Service-Oriented Architec-
Figure 5. XML servers are comprehensive
rants within 5 miles of where I am that have ture (including SOAP or ReST interfaces). platforms for application development.
server filet mignon”. They create more efficient and reliable ap-
plications that incorporate built-in business
MarkLogic Server enables organizations
logic, expose powerful search capabilities,
to identify geographical relevance in
and ensure users get the most up-to-date
their information with geospatial search
and correct results.
capabilities combined with full-text search.
MarkLogic Server supports a comprehen- A Robust Application Development
sive set of query criteria including point Platform
search, radius search, latitude-longitude The MarkLogic Server architecture
box search, and polygon search in which provides a complete platform for develop-
thousands of boundary points can be speci- ing information applications. In addition
fied for a polygon. Users can also specify to its extensive search capabilities and
any number of locations per document or enterprise-class DBMS, MarkLogic Server
record, allowing flexibility in the formatting includes a web server for serving XHTML
of information. to web browsers. Developers can create
complete applications and define func-
MarkLogic Server has built-in support for tional libraries for reuse. Organizations can
various geospatial types such as Geography create centralized repositories of informa-
Markup Language (GML), Keyhole Markup tion within MarkLogic Server and rapidly
Language (KML), and GeoRSS. MarkLogic’s develop multiple applications efficiently
high-performance engine allows fast query- reusing and repurposing information in
ing over billions of records. multiple applications.
Visualization is an important part of any The completeness of MarkLogic Server’s

geospatial interface, and MarkLogic Server XML and XQuery implementations minimize
can integrate with any geographic-aware the need to use other languages such as
product such as ESRI, Google Earth, Google Java to complete the data processing. This
Maps, Yahoo! Maps, and Microsoft Live is important because there is a performance
Search Maps. Heat maps are also supported penalty that comes with converting XML
so users can see a distribution of data over to objects for Java processing. This XML
a geographic area and instantly refine those conversion, often referred to as the X/O
results. impedance mismatch problem, is a com-
mon problem if you need to program in Java
A Complete Platform for Building
against XML.
Information Applications
MarkLogic Server’s enterprise-class data- Applications that deliver information to
base capabilities, extensive full-text and a browser in particular are more efficient
XML search, and robust XQuery develop- when information remains in XML and does
ment environment provide a complete
not have to be transformed into other for- tify potential bottlenecks. MarkLogic also
mats or storage models. MarkLogic Server has integrations with XQuery debugging
outputs XML directly to eXtensible Hyper- environments such as Eclipse and oXygen.
text Markup Language (XHTML), which can These help developers more efficiently find
be rendered by web browsers without the code errors.
need for expensive transformations and
Ease of Integration with Standard Inter-
allows for easy modification of applica-
faces
tion user interfaces. In addition to XHTML,
Since MarkLogic Server conforms to XML
MarkLogic Server can render final output
standards and operates natively in XML for
in any format required (such as Microsoft
input, storage, and delivery, it can eas-
Word or PDF) using native capabilities or
ily integrate with any third-party product
third-party conversion tools. Furthermore,
that supports XML. It is commonly used
MarkLogic Server includes a full, embedded
as a “hub” in a system architecture linking
web server, allowing it to serve HTML or
different applications which speak XML,
XHTML directly to the browser and avoid
enabling them to work together. It does this
potential bottlenecks and latency issues.
easily because XML message traffic can be
MarkLogic Server also avoids the need consumed as-is and can speak all the differ-
for a thick “middle layer” of information ent flavors of XML spoken by the various
manipulation code. Instead of getting URLs systems. This speeds and eases integration
from a search engine, fetching documents, of disparate applications in large system
parsing them, and navigating the Docu- architectures. Additionally, MarkLogic has
ment Object Model (DOM) to find required prebuilt interfaces to JAVA and .NET to al-
elements, MarkLogic Server lets users low for integration into those development
simply request the elements they need in frameworks.
XQuery and returns them directly. Addition-
Rapid Application Development
ally, developers are able to keep business
MarkLogic Server supports rapid applica-
logic close to the information making them
tion development with Application Services,
more efficient to write and run. You can, of
a set of high-level APIs and interactive
course, still build a middle layer with Java or
tools. Application Services consists of three
.NET through standard APIs.
components. The Search API provides a
Developer-Friendly Environment high-level library that developers can use to
As developers create more and more quickly incorporate core search functional-
complex applications using XQuery, it is ity, such as faceted navigation, dynamic
important their platforms provide tools and result snippets, and automatic term sugges-
interfaces for optimizing and debugging tions in their own applications. MarkLogic
applications. MarkLogic Server provides a Library Services API is a way for developers
developer-friendly environment including to add document management functionality
interfaces that profile XQuery programs such as versioning and check-in/check-out
to reveal call trees and execution times for to their information applications. Finally,
each XQuery function executed. This ena- MarkLogic Application Builder is a browser-
bles developers to optimize code and iden- based tool designed to rapidly prototype
applications through agile development
cycles, and ultimately to build baseline ap-
plications that can be further customized.
With applications built with MarkLogic Server, businesses and All components of Application Services
government organizations are taking the use and reuse of their leverage best practices for performance
XML information to new levels with innovative information ap- and scalability , freeing application devel-
plications. opers to focus on their applications’ unique
features, not the underlying search and
management infrastructure.
Toolkits and Connectors
MarkLogic Server’s set of toolkits and
connectors enable rapid integration with
existing IT infrastructures. MarkLogic
Connector for SharePoint enables informa-
tion in SharePoint to be integrated into
MarkLogic Server for additional processing
and delivery. The Office toolkits MarkLogic
Toolkit for Word, MarkLogic Toolkit for
PowerPoint, and MarkLogic Toolkit for Excel
— enable Office files to be easily reused in
custom documents with a search-and-insert
interface built into the Office interfaces.
This set of integration tools promote com-

ponent content reuse, document tracking,
auditing and reporting, and extensive de-
livery options. Custom dynamic publishing
infrastructures can be built with MarkLogic
Server that leverage collections of existing
Office files and require minimal training to
increase end-user productivity through easy
information reuse.
Conclusion
MarkLogic Server overcomes the structure,
size, and semantic challenges related to in-
formation through its superior architecture
and XML capabilities. The universal index,
integrated full-text and XML search, and
high scalability helps organizations maxi-
mize the value of information. MarkLogic
Server’s complete application framework
provides the industry’s leading XQuery
implementation, which means organiza-
tions can confidently rely on it to meet their
needs today and grow with them as their
information and business needs evolve.
MarkLogic Corporation
www.marklogic.com
sales@marklogic.com
Headquarters
999 Skyway Road, Suite 200
San Carlos, CA 94070
+1 650 655 2300
New York, NY
5 Penn Plaza
23rd Floor
New York, NY 10001
+1 646 378 2104
Washington, D.C.
1600 Tysons Boulevard
8th Floor
McLean, VA 22102
+1 703 245 8590
United Kingdom
3000 Hillswood Drive
Hillswood Business Park
Chertsey, Surrey, KT16 0RS
United Kingdom
+44 (0) 1932 796 400
Version 2 © Copyright 2010 MarkLogic Corporation. MarkLogic is a registered trademark and MarkLogic Server is a trademark of MarkLogic Corporation, all
January 2010 rights reserved. All other product names mentioned herein are the property of their respective owners.

MarkLogic Server Technology - White Paper

Загружено:

Сведения о документе

Исходное описание:

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

MarkLogic Server Technology - White Paper

Загружено:

Авторское право:

Доступные форматы

MarkLogic Server Overview:

Infrastructure Software for Building

• Organizational desire to maximize the value of existing information

• Changing user expectations

Organizational objectives and rapidly changing environmental factors increas-

focused products, have started to increase understanding the meaning of information

the XML orientation in their offerings. The Challenge of Structure

<book xmlns=”http://docbook.org/ns/docbook”> <book xmlns=”http://docbook.org/ns/docbook”>

Figure 1. Different valid structures with the same schema.

semantics of XML documents.

2 W3C XQuery Test Suite Result Summary - http://www.w3.org/XML/Query/test-suite/XQTSReportSimple.html

3 XML Path Language - http://www.w3.org/TR/xpath

Efficient Use of Storage

Figure 2. MarkLogic Server shared-nothing architecture.

Example A Example B Example C Example D

Visualization is an important part of any The completeness of MarkLogic Server’s

This set of integration tools promote com-

Вам также может понравиться