Вы находитесь на странице: 1из 5

21/01/2011 Tackling Large Scale Data in Governm…

Downloads Events
What is Why Developer
Learn Hadoop Blog
Hadoop? Hadoop? Center
Get Support Careers

Products & Services Customers Partners Community Resources Downloads Company Contact Search

Developer Center

Overview
Tackling Large Scale Data in Government
Downloads by Jon Zuanich November 02, 2010 5 comments 33 Like

Learn Hadoop
This is a guest post provided by Booz Allen Hamilton data analysis consultant, Aaron Cordova.
Get Support Aaron specializes in large-scale distributed data processing systems.

Blog Working within the U.S. federal government arena provides plenty of opportunities to encounter
large-scale data analysis. Projects ranging from massive health studies to high-velocity network
Avro (3) security events to new sources of imagery and video represent a huge increase in the amount of
Careers (4) data that must be not only stored but processed quickly and efficiently. These challenges are at
once a daunting and exciting chance to turn data into a positive impact for the country. The large-
Community (29)
scale data processing technologies created and supported by Cloudera play a big part in answering
Data Collection (9)
that challenge.
Distribution (21)
Flume (2)
Often our clients have an immediate need to analyze the data at hand, to discover patterns, reveal
threats, monitor critical systems, and make decisions about the direction the organization should
General (142)
take. Several constraints are always present: the need to implement new analytics quickly enough
Guest (16)
to capitalize on new data sources, limits on the scope of development efforts, and the pressure to
Hadoop (83) expand mission capability without an increase in budgets. For many of these organizations, the
HBase (13) large data processing stack (which includes the simplified programming model MapReduce,
HDFS (18) distributed file systems, semi-structured stores, and integration components, all running on
Hive (16)
commodity class hardware) has opened up a new avenue for scaling out efforts and enabling
analytics that were impossible in previous architectures.
MapReduce (23)

Pig (13) We have found this new ecosystem to be remarkably versatile at handling various types of data and
Testing (4) classes of analytics. When working to help solve clients’ large-scale data analysis problems we first
take a comprehensive look at their existing environment, resources available, the nature of data
Training (14)
sources, and immediate questions that must be answered of the data. Usually a large data
ZooKeeper (4)
processing solution will be composed of several pieces that are composed into a system that
Archives by Month provide the desired capability. This can range from real-time tipping to vast index and query
capabilities to periodic and deep analysis of all the data available. Constructing a solution almost
always requires one or more new and highly scalable components from the large-scale data
analysis software stack, and integration with conventional data storage and processing software.
Newsletter
Having the ability to pick and choose which elements of the stack to include and having well-
Subscribe to the defined interfaces and in some cases interoperability standards is essential to making the system
Cloudera Newsletter work. This is a major reason that we value the open source community and concept of the large-
scale data analysis ecosystem.
Your email...

Perhaps the most exciting benefit, however, from moving to these highly scalable architectures is
that after we’ve solved the immediate issues, often with a system that can handle today’s
requirements and scale up to 10x or more, is that new analytics and capabilities are now incredibly
easy to develop, evaluate, and integrate thanks to the speed and ease of MapReduce, Pig, Hive, and
other technologies. More than ever the large-scale data analysis software stack is proving to be a
platform for innovation.

The response to the challenge of large-scale data analysis continues to emerge and there is room
for ongoing innovation. One example of this is evident as large-scale data systems or clouds
become more numerous; the task of integrating analysis across those clouds remains an area of
open research. Even integrating data sources, existing systems, and delivery mechanisms within
departments of the same enterprise can be a challenge and may require new solutions.

Recently when researching the problem of large-scale Biometric search, we at Booz Allen realized
the need for a highly scalable and low-latency method of performing fuzzy matching, i.e. returning
the most similar items when there is no exact match, in response to growing databases of

cloudera.com/…/tackling-large-scale-… 1/5
21/01/2011 Tackling Large Scale Data in Governm…
fingerprints and requirements to identify individuals quickly. It was clear that Hadoop would
provide a good platform on which to build a new distributed fuzzy matching capability for several
reasons. The ability to run MapReduce over all stored biometrics would help in clustering the data
to reduce the search space. The data distribution and replication features of HDFS would provide a
reliable storage system with enough parallelism to support fast queries running on multiple
machines. The result of our research is a system we developed called FuzzyTable.

FuzzyTable is a large-scale, low-latency, parallel fuzzy-matching database built over Hadoop. It


can use any matching algorithm that can compare two often high-dimensional items and return a
similarity score. This makes it suitable not only for comparing fingerprints but other biometric
modalities, images, audio, and anything that can be represented as a vector of features.

Fuzzy matching involves extracting features from the data of interest, and running a given fuzzy
matching, distance, or similarity algorithm over the resulting feature vectors to produce the
numerical score.

Our work involved developing two major components – a clustering process that we use to reduce
the total search space for each query, and a client-server system for performing fuzzy matching on
demand and in parallel across a Hadoop instance.

In the first step, we use two clustering algorithms – canopy clustering and k-means – from the
Apache Mahout project to assign each biometric into a bin. Each bin contains biometrics that are
statistically similar. Each bin also has a ‘mean biometric’ that represents an average of the
biometrics contained in that bin.

When performing queries looking for the best match we first find the bin that a given biometric
scores closest to by comparing it to the list of ‘mean biometrics’ from our k-means processing.
This allows us to avoid searching a large portion of the data set and only search the bin (or small
number of bins) that contains the most similar items to the biometric in question.

cloudera.com/…/tackling-large-scale-… 2/5
21/01/2011 Tackling Large Scale Data in Governm…

The FuzzyTable client then looks up the location of all blocks or chunks that contain biometrics for
that bin by querying the FuzzyTable master server. Blocks are not replicas, rather they contain
different sets of data (although using replication to improve the number of concurrent queries is
possible). These blocks live on several HDFS DataNodes so that a single bin can be searched by
several machines at once.

Finally the FuzzyTable client submits the biometric query and the ID of the closest bin to
FuzzyTable query servers running alongside HDFS DataNodes which sift through the blocks of the
closest bin comparing the query biometric to each stored biometric and return the most similar
matches. The client collects all the matches from FuzzyTable query servers and displays a ranked
list of matches to the user.

All this takes place in a few seconds. Because Hadoop can distribute our data automatically across
multiple machines we only had to write the code to run comparisons on local data. HDFS exposes
the location of data blocks making it possible to run the comparisons only on local data, which is
also the key to the speed of MapReduce.

The following graph shows how query times (in milliseconds) improved dramatically as we added
machines. After about seven machines, our query times became dominated by fixed costs that did
not increase with the number of machines. Our test cluster could easily store several times more
than our test data before we would see query times increase again.

cloudera.com/…/tackling-large-scale-… 3/5
21/01/2011 Tackling Large Scale Data in Governm…

Large-scale data stack components such as Hadoop and Mahout proved to be a perfect platform on
which to create this new capability. Undoubtedly the trend of innovation will continue as we further
explore the possibilities of these new reliable and easy-to-use software platforms that achieve
scalability through parallel computation over distributed shared-nothing systems.

Booz Allen Hamilton has been at the forefront of strategy and technology consulting for nearly a
century. Providing a broad range of services in strategy and organization, technology, operations,
and analytics, Booz Allen is committed to delivering results that endure. To learn more, visit
www.boozallen.com.

Aaron Cordova is a data analysis consultant at Booz Allen Hamilton, who specializes in large-scale
distributed data processing systems.

33 Like 15 people like this. Be the first of your friends.

Filed under Share this post


general Twitter
hadoop Facebook
Digg
Email
more...

5 Responses leave a comment...

BEN STANDEFER / NOVEMBER 23, 2010 / 12:19 PM

I saw this talk at Hadoop World. It was really, really exciting,


especially because I thought that at the end Aaron would announce
that he’s open-sourced all of the stuff he got us so amped about.
Aaron, are there any plans of open-sourcing the generic parts of
this? I think it would be very interesting and there would be a lot of
interest.

-Ben Standefer

KIM VOGT / NOVEMBER 23, 2010 / 1:16 PM

+1 for open sourcing all or parts of this

AARON CORDOVA / NOVEMBER 23, 2010 / 2:58 PM

Ben,

Thanks for the interest. We are indeed working on making this


project open source. Right now we’re trying to determine how to
best do so. I’ll post updates as we near a decision.

If you have a preference in terms of licensing terms, etc., please


share and we will take that into consideration.

Thanks,

cloudera.com/…/tackling-large-scale-… 4/5
21/01/2011 Tackling Large Scale Data in Governm…
Aaron Cordova

BEN STANDEFER / NOVEMBER 23, 2010 / 3:51 PM

Wow, that would be awesome Aaron. Thanks for the response. It all
looks really awesome and I would love to take a closer look at it.
Good luck on getting through the hoops to open source it. Let me
know if I can help in any way.

-Ben Standefer

Leave a comment

Name REQUIRED

Email REQUIRED
(WILL NOT BE PUBLISHED)

Website

Comment

Leave Comment

What is Hadoop? Products & Services Customers Resources Company Developer Center
Hadoop Overview Cloudera's Distribution for Cloudera Customers Videos Events Downloads
Why Hadoop? Hadoop Cloudera Partners About Cloudera Awards Learn Hadoop
Hadoop Tutorial Cloudera Enterprise Analyst Reports Press Center Get Support
Management Tools Blog Posts Open Source Blog
Hadoop Support Presentations Careers
Professional Services Recorded Webinars Management
Hadoop Training Training Team
White Papers Board
more » Investors
Contact Us

Cloudera, Inc. info@cloudera.com ©2010 Cloudera, Inc. All rights reserved | Terms & Conditions | Site Map
210 Portage Avenue US: 1-888-789-1488 Hadoop and the Hadoop elephant logo are trademarks of the Apache Software
Palo Alto, CA 94306 Intl: 1-650-362-0488 Foundation.

cloudera.com/…/tackling-large-scale-… 5/5

Вам также может понравиться