Вы находитесь на странице: 1из 26

How does Hive compare to HBase?

I'm interested in finding out how the recently-released (http://mirror.facebook.com/facebook/hive/hadoopvote dow 0.17/) Hive compares to HBase in terms of performance. The SQL-like interface used by Hive is very much preferable to the HBase API we have implemented. n vote

up

35
11

favorite

hadoop hbase hive

share|improve this question

asked Aug 23 '08 at 12:22

mrhahn 202146

5 Answers
activeoldestvotes

It's hard to find much about Hive, but I found this snippet on the Hive site that leans heavily in favor of vote dow HBase (bold added): Hive is based on Hadoop which is a batch processing system. Accordingly, this system does not and cannot n vote promise low latencies on queries. The paradigm here is strictly of submitting jobs and being notified when the jobs are completed as opposed to real time queries. As a result it should not be compared with systems like Oracle where analysis is done on a significantly smaller amount of data but the analysis proceeds much more iteratively with the response times between iterations being less than a few minutes. For Hive queries response times for even the smallest jobs can be of the order of 5-10 minutes and for larger jobs this may even run into hours. Since HBase and HyperTable are all about performance (being modeled on Google's BigTable), they sound like they would certainly be much faster than Hive, at the cost of functionality and a higher learning curve (e.g., they don't have joins or the SQL-like syntax).

up

28

accepted

share|improve this answer

answered Aug 30 '08 at 22:16

Chris Bunch 16.7k176092

up

vote dow a query execution engine, a metadata repository, and a columnar storage layout. Its primary focus is data

From one perspective, Hive consists of five main components: a SQL-like grammar and parser, a query planner,

n vote

warehouse-style analytical workloads, so low latency retrieval of values by key is not necessary. HBase has its own metadata repository and columnar storage layout. It is possible to author HiveQL queries over HBase tables, allowing HBase to take advantage of Hive's grammar and parser, query planner, and query execution engine. See http://wiki.apache.org/hadoop/Hive/HBaseIntegration for more details.
share|improve this answer answered Jun 4 '10 at 4:38

Jeff Hammerbacher 1,403719

Hive is an analytics tool. Just like pig, it was designed for ad hoc batch processing of potentially enourmous vote dow amounts of data by leveraging map reduce. Think terrabytes. Imagine trying to do that in a relational database...

up

n vote

HBase is a column based key value store based on BigTable. You can't do queries per se, though you can run map reduce jobs over HBase. It's primary use case is fetching rows by key, or scanning ranges of rows. A major feature is being able to have data locality when scanning across ranges of row keys for a 'family' of columns.
share|improve this answer edited Oct 7 '11 at 9:30 answered Jun 25 '09 at 21:38

Bolo 4,08611031 up

Tim 35426

To my humble knowledge, Hive is more comparable to Pig. Hive is SQL-like and Pig is script based. Hive vote down seems to be more complicated with query optimization and execution engines as well as requires end user needs to specify schema parameters(partition etc). Both are intend to process text files, or sequenceFiles. vote

HBase is for key value data store and retrieve...you can scan or filter on those key value pairs(rows). You can not do queries on (key,value) rows.
share|improve this answer answered Jun 6 '10 at 5:09

haijin 236139

As of the most recent Hive releases, a lot has changed that requires a small update as Hive and HBase are vote down now integrated. What this means is that Hive can be used as a query layer to an HBase datastore. Now if people are looking for alternative HBase interfaces, Pig also offers a really nice way of loading and storing vote HBase data. Additionally, it looks like Cloudera Impala may offer substantial performance Hive based queries on top of HBase. They are claim up to 45x faster queries over traditional Hive setups.

up

44

share|improve this answer I'm learning traditional Relational Databases (with PostgreSQL) and doing some research I've come across some new types of databases. CouchDB, Drizzle, and Scalaris to name a few, what is going to be the next database down technologies to deal with?
favorite

vote 58

sql database nosql non-relational-database

share|improve this question

edited Apr 30 '12 at 22:43

asked Nov 12 '08 at 2:02

Jakub Konecki 21.4k31756

Randin 6151914

Could someone please update this question to refer to "databases" instead of "SQL"? Rick Nov 12 '08 at 2:05 Even though randin is using the term SQL incorrectly, I think that change would be against the spirit of peer editing. Bill Karwin Nov 12 '08 at 2:29 too late.. sorry Bill. Feel free to roll back my edit if you feel strongly. I made my change before you posted your comment. I think rephrasing it the way I did is both educational to the OP and more useful to the community. SquareCog Nov 12 '08 at 2:31 Well, it's good to be correct. A tech writer friend of mine said, "you can't get the right answers unless you ask the right

1 questions." Bill Karwin Nov 12 '08 at 2:36


Ah, sorry about the misleading question, my knowledge of SQL and databases was non-existent when I asked the

1 question. Randin Mar 16 '09 at 2:53


show 1 more comment

8 Answers
activeoldestvotes

I would say next-gen database, not next-gen SQL. vote dowSQL is a language for querying and manipulating relational databases. SQL is dictated by an international standard. While the standard is revised, it seems to always work within the relational database paradigm. n vote

up

88

accepted

Here are a few new data storage technologies that are getting attention currently:

CouchDB is a non-relational database. They call it a document-oriented database. Amazon SimpleDB is also a non-relational database accessed in a distributed manner through a web service. Amazon also has a distributed key-value store called Dynamo, which powers some of its S3 services. Dynomite and Kai are open source solutions inspired by Amazon Dynamo. BigTable is a proprietary data storage solution used by Google, and implemented using their Google File System technology. Google's MapReduce framework uses BigTable. Hadoop is an open-source technology inspired by Google's MapReduce, and serving a similar need, to distribute the work of very large scale data stores. Scalaris is a distributed transactional key/value store. Also not relational, and does not use SQL. It's a research project from the Zuse Institute in Berlin, Germany. RDF is a standard for storing semantic data, in which data and metadata are interchangeable. It has its own query language SPARQL, which resembles SQL superficially, but is actually totally different. Vertica is a highly scalable column-oriented analytic database designed for distributed (grid) architecture. It does claim to be relational and SQL-compliant. It can be used through Amazon's Elastic Compute Cloud. Greenplum is a high-scale data warehousing DBMS, which implements both MapReduce and SQL. XML isn't a DBMS at all, it's an interchange format. But some DBMS products work with data in XML format. ODBMS, or Object Databases, are for managing complex data. There don't seem to be any dominant ODBMS products in the mainstream, perhaps because of lack of standardization. Standard SQL is gradually gaining some OO features (e.g. extensible data types and tables). Drizzle is a relational database, drawing a lot of its code from MySQL. It includes various architectural changes designed to manage data in a scalable "cloud computing" system architecture. Presumably it

will continue to use standard SQL with some MySQL enhancements. Cassandra is a highly scalable, eventually consistent, distributed, structured key-value store, developed at Facebook by one of the authors of Amazon Dynamo, and contributed to the Apache project. Project Voldemort is a non-relational, distributed, key-value storage system. It is used at LinkedIn.com Berkeley DB deserves some mention too. It's not "next-gen" because it dates back to the early 1990's. It's a popular key-value store that is easy to embed in a variety of applications. The technology is currently owned by Oracle Corp. Also see this nice article by Richard Jones: "Anti-RDBMS: A list of distributed key-value stores." He goes into more detail describing some of these technologies. Relational databases have weaknesses, to be sure. People have been arguing that they don't handle all data modeling requirements since the day it was first introduced.

Year after year, researchers come up with new ways of managing data to satisfy special requirements: either requirements to handle data relationships that don't fit into the relational model, or else requirements of highscale volume or speed that demand data processing be done on distributed collections of servers, instead of central database servers. Even though these advanced technologies do great things to solve the specialized problem they were designed for, relational databases are still a good general-purpose solution for most business needs. SQL isn't going away.

I've written an article in php|Architect magazine about the innovation of non-relational databases, and data modeling in relational vs. non-relational databases. http://www.phparch.com/magazine/2010-2/september/
share|improve this answer edited Mar 5 at 14:40 answered Nov 12 '08 at 2:24

Emil 3,06443289

Bill Karwin 138k20171326

Hey Bill, we do tend to answer the same questions a lot.. your answer here is thorough enough I don't feel writing my own would be of much use -- want to add some info about Vertica et al, and Greenplum and friends, to make it more 1 complete? SquareCog Nov 12 '08 at 2:56

1 oh and XML and Object databases.. I always forget about those.. SquareCog Nov 12 '08 at 3:14
Thank you Bill for the through answer, I'll just stick with PostgreSQL for the time being. Randin Nov 12 '08 at 10:50

1 PostgreSQL is a fine choice for RDBMS. Have fun! Bill Karwin Nov 12 '08 at 16:46
Hey, thanks for the list! :) hasen j Feb 22 '09 at 16:06

I'm missing graph databases in the answers so far. A graph or network of objects is common in vote dowprogramming and can be useful in databases as well. It can handle semi-structured and interconnected n vote information in an efficient way. Among the areas where graph databases have gained a lot of interest are semantic web and bioinformatics. RDF was mentioned, and it is in fact a language that represents a graph. Here's some pointers to what's happening in the graph database area:

up

20

Graphs - a better database abstraction Graphd, the backend of Freebase Neo4j open source graph database engine AllegroGraph RDFstore Graphdb abstraction layer for bioinformatics Graphdb behind Directed Edge recommendation engine I'm part of the Neo4j project, which is written in Java but has bindings to Python, Ruby and Scala as well. Some people use it with Clojure or Groovy/Grails. There is also a GUI tool evolving.
share|improve this answer edited Oct 8 '09 at 19:11 answered Mar 26 '09 at 22:26

nawroth 3,18011015 How about db4o.com, an object-database, but its designed around managing object graphs. Norman H Mar 16 '11 at 1:53 Object databases (OODB) are different from graph databases. Simply put a graphdb won't tie your data directly to your object model. In a graphdb relationships are first class citizens, while you'd have to implement that on your own in an OODB. In a graphdb you can have different object types represent different views on the same data. Graphdbs typically support things like finding shortest paths and the like. nawroth Mar 16 '11 at 12:05 Cool, thanks for the clarification! Norman H Mar 17 '11 at 13:47

Might be not the best place to answer with this, but I'd like to share this taxonomy of noSQL world created by vote down Steve Yen (please find it at http://dl.dropbox.com/u/2075876/nosql-steve-yen.pdf) (1) keyvaluecache vote memcached repcached coherence innispan eXtremescale jbosscache velocity terracoqa (2) keyvaluestore keyspace are schemafree RAMCloud (3) eventuallyconsistent keyvaluestore dynamo voldemort Dynomite SubRecord MongoDb Dovetaildb (4) orderedkeyvaluestore tokyotyrant lightcloud NMDB luxio memcachedb

up

actord (5) datastructures server redis (6) tuplestore gigaspaces coord apacheriver (7) object database ZopeDB db4o Shoal (8) document store CouchDB Mongo Jackrabbit XMLDatabases ThruDB CloudKit Perservere RiakBasho Scalaris (9) widecolumnarstore BigTable Hbase Cassandra Hypertable KAI OpenNep
share|improve this answer answered Mar 19 '11 at 10:26

Paolo Bozzola 18938

For a look into what academic research is being done in the area of next gen databases take a look at vote dow this: http://www.thethirdmanifesto.com/ n vote In regard to the SQL language as a proper implementation of the relational model, I quote from wikipedia, "SQL, initially pushed as the standard language for relational databases, deviates from the relational model in several places. The current ISO SQL standard doesn't mention the relational model or use relational terms or concepts. However, it is possible to create a database conforming to the relational model using SQL if one does not use certain SQL features."

up

http://en.wikipedia.org/wiki/Relational_model (Referenced in the section "SQL and the relational model" on March 28, 2010
share|improve this answer answered Mar 28 '10 at 11:15

Norman H 605412 up vote dow the next-gen SQL would make SQL a lot less... fugly and non-intuitive. n vote

Not to be pedantic, but I would like to point out that at least CouchDB isn't SQL-based. And I would hope that

share|improve this answer

answered Nov 12 '08 at 2:05

Jason Baker 43.6k34185348 A friend of mine said, "It's supposed to be hard to read! It's called code for a reason!" :-) Bill Karwin Nov 12 '08 at 2:30

1 My brain is broken, I like SQL, too much looking at it grow on to you :) Robert Gould Nov 12 '08 at 2:35 There are special databases for XML like MarkLogic and Berkeley XMLDB. They can index xml-docs and one vote dow can query them with XQuery. I expect JSON databases, maybe they already exist. Did some googling but n vote couldn't find one.
up

share|improve this answer

answered Mar 22 '09 at 17:30

tuinstoel 5,7791422 There are a few that provide a JSON interface to the data. Terrastore is one example. quikchange Jul 7 '10 at 18:35 up

SQL has been around since the early 1970s so I don't think that it's going to go away any time soon.

vote dow n vote Maybe the 'new(-ish) sql' will oql (see http://en.wikipedia.org/wiki/ODBMS) share|improve this answer

answered Nov 12 '08 at 2:15

Christopher Edwards 2,28121739 up

I heard also about NimbusDB by Jim Starkey

vote dow n vote Jim Starkey is the man who "create" Interbase

who work on Vulcan (a Firebird fork) and who was at the begining of Falcon for MySQL

Hadoop examples?

share|improve this answer

up

I'm examining Hadoop as a possible tool with which to do some log analysis. I want to analyze several kinds of statistics in one run. Each line of my log files has all sorts of potentially useful that I'd like to aggregate. I'd

vote dow online all seem to total exactly one thing. This may be because every single example Hadoop program I can n vote find just does word counts. Can I use Hadoop to solve two or more problems at once? 60
favorite

82

like to get all sorts of data out of the logs in a single Hadoop run, but the example Hadoop programs I see

Are there other Hadoop examples, or a Hadoop tutorial out there, that don't solve the word count problem?
hadoop

share|improve this question

edited Mar 9 at 15:15

asked Apr 9 '09 at 20:18

j0k 10.1k102335

Brandon Yarbrough 5,34753165

As a general comment, I do seem to notice that Hadoop doesn't have a lot of examples floating around. Not sure why

11 that is. John Feminella Apr 9 '09 at 20:24

15 Answers
activeoldestvotes

One of the best resources that I have found to get started is Cloudera. They are a startup company comprised vote dowof mainly ex-Google and ex-Yahoo employees. On their page there is a training section with lessons on the different technologies here. I found that very useful in playing with straight Hadoop, Pig and Hive. They have n vote a virtual machine that you can download that has everything configured and some examples that help you get coding. All of that is free in the training section. The only thing that I couldn't find is a tutorial on HBase. I have been looking for one for a while. Best of luck.

up

62

accepted

share|improve this answer

answered May 7 '09 at 15:21

Ryan H 1,0172818

2 Mark White's Second edition has info on HBase, Pig and Hive C-x C-t Feb 18 '11 at 10:01

I'm finishing up a tutorial on processing Wikipedia pageview log files, several parts of which compute vote dow multiple metrics in one pass (sum of pageviews, trend over the last 24 hours, running regressions, etc.). The n vote code is here: http://github.com/datawrangling/trendingtopics/tree/master The Hadoop code mostly uses a mix of Python streaming & Hive w/ the Cloudera distro on EC2...

up

16

share|improve this answer

answered Jun 30 '09 at 5:58

Pete Skomoroch

1,18164

i loved your tute pete, especially the overview you gave at hadoop world, awesome stuff! mat kelcey Oct 20 '09 at

1 5:06
up

Here are two examples using Cascading (and API over Hadoop) A simple log parser Calculates arrival rate of requests You can start with the second and just keep adding metrics. Cascading project site
share|improve this answer edited May 1 '11 at 20:07 answered Apr 10 '09 at 15:11

vote down vote

cwensel 88154

You can refer to Tom White's Hadoop book for more examples and usecases:http://www.amazon.com/Hadoopvote dow Definitive-Guide-Tom-White/dp/1449389732/ answered Oct 19 '10 at 10:20 n vote share|improve this answer

up

Pavan Yara 7111

With the normal Map/Reduce paradigm, you typically solve one problem at a time. In the map step you vote dow typically perform some transformation or denormalization, in the Reduce step you often aggregate the map n vote outputs.

up

If you want to answer multiple questions about your data, the best way to do it in Hadoop is to write multiple jobs, or a sequence of jobs that read the previous step's outputs. There are several higher-level abstraction languages or APIs (Pig, Hive, Cascading) that simplify some of this work for you, allowing you to write more traditional procedural or SQL-style code that, under the covers, just creates a sequence of Hadoop jobs.
share|improve this answer answered Apr 23 '09 at 0:23

Ilya Haykinson 45044

There was a course taught by Jimmy Lin at the University of Maryland. He developed the Cloud9 package as a vote dow training tool. It contains several examples.

up

n vote

Cloud9 Documentation and Source

share|improve this answer

answered Jun 8 '09 at 18:29

user119381 5111

Amazon has a new service based on Hadoop, its a great way to get started and they have some nice vote dow examples. http://aws.amazon.com/elasticmapreduce/ answered Apr 10 '09 at 15:14 n vote share|improve this answer

up

Maurice Flanagan 2,4561623

You can also follow Cloudera blog, they posted recently a really good article about Apache log analysis with vote dow Pig. answered Jul 8 '09 at 22:17 n vote share|improve this answer

up

Romain Rigaux 1,531810 As the author of said article, I want to point out that it was written more from a "getting familiar with Pig" perspective than a "doing log parsing in Hadoop" perspective. There are more efficient and less verbose ways to do those things. 2 But yeah, Pig is nice for this sort of stuff at large scale. SquareCog Nov 25 '09 at 1:04

Have you looked at the wiki? You could try looking through the software in the contrib section though the code vote dow for those will probably be hard to learn from. Looking over the page they seem to have a link to an external n vote tutorial.

up

share|improve this answer

answered Apr 9 '09 at 21:45

fuzzy-waffle 54629

I'm sure you've solved your problem by now, but for those who still get redirected here from google searching vote dow for examples here is a excellent blog with hundreds lines of working code:http://sujitpal.blogspot.com/ edited Nov 30 '09 at 15:12 answered Nov 27 '09 at 19:06 n vote share|improve this answer

up

alex 2321213

There are several examples using ruby under Hadoop streaming in the wukong library. (Disclaimer: I am an vote dow author of same). Besides the now-standard wordcount example, there's pagerank and a couple simple graph n vote manipulation scripts.

up

share|improve this answer

answered Apr 21 '09 at 19:18

mrflip 56535 up

For your given example I would recommend the following implementation:

vote dow n vote In the MAP-Step you walk through the log line by line. In each line, you separate your relevant data from each

other (somethink like split() I guess) and emit a key-value-pair for each bit of information for every line. So if your log has a format like this: (Timestamp) (A) (B) (C) 123 789 4 1 5 2 6 3

You could emit (A,4),(B,5),(C,6) for the first line and so forth for the other lines. Now you can even have parallel reducers! Each reducers collects the bits for a given category. You can tweak your Hadoop app, so one reducer gets all "A"s and another one gets all "B"s. The Reduce itself is like the typical word-count ;-)
share|improve this answer answered Feb 2 '11 at 18:53

Peter Wippermann 643315 up

Apache have released a set of examples. You can find them at:

vote dow n vote http://svn.apache.org/repos/asf/hadoop/common/trunk/mapreduce/src/examples/org/apache/hadoop/examples/ answered Jul 13 '11 at 14:38 share|improve this answer

Adrian Mouat 4,26331829

Two tools that can give a good starting point to solve the problem in the Hadoop way are PIG (that was already vote dow mentioned with a link above) and MAHOUT (machine learning libraries).

up

n vote

Regarding Mahout, you can read IBM's articles that give a very good introduction on what you can do "easily" with it: http://www.ibm.com/developerworks/java/library/j-mahout/http://www.ibm.com/developerworks/java/library/jmahout-scaling/ It gives you the next thing (Clustering, Categorization...) that you would like to do with the Accounting data

that you can get from the likes of PIG or hand written MapReduce code that you will write. Y
share|improve this answer edited Jan 31 '12 at 13:06 answered Jan 31 '12 at 12:52

Guy 1,6981614

Ilya said it well: folks usually write one job per task because the output from the mapper and reducers are very vote dow specific to the result you're after.

up

n vote

Also, at higher scale, jobs take longer and usually you'll run different jobs at different frequencies (and subsets of your data). Finally, it's a lot more maintainable. We've been spoiled by Hive for syslog and app log analysis. That might get you closer to the lightweight, adhoc queries that would let you do multiple results really quickly:http://help.papertrailapp.com/kb/analytics/loganalytics-with-hadoop-and-hive Passing multiple functions to a SELECT clause would probably accomplish what you're after, but you still may need a temporary table.

Parallelizing the Reduce in MapReduce

I understand how Map is easily parallelizable - each computer/CPU can just operate on a small portion of the vote dow array.

up

n vote 2

favorite

Is Reduce/foldl parallelizable? It seems like each computation depends on the previous one. Is it just parallelizable for certain types of functions?
multithreading optimization map multicore reduce

share|improve this question

asked Nov 30 '08 at 21:44

Claudiu 35.2k32143295 Give us some clues: what platform or programming language are you talking about? This doesn't sound like MPI. And what's a "foldl"? Die in Sente Dec 1 '08 at 0:17 foldl is a left fold, or a fold with a left-associative operator: folding [1,2,3,4] with + would yield (((1 + 2) + 3) + 4) Frank Shearar Apr 7 '10 at 7:47

6 Answers
activeoldestvotes

If your reduction underlying operation is associative*, you can play with the order of operations and locality. vote dowTherefore you often have a tree-like structure in the 'gather' phase, so you can do it in several passes in logarithmic time: n vote

up

10

accepted

a + b + c + d \ / \ / (a+b) (c+d) \ / ((a+b)+(c+d)) instead of (((a+b)+c)+d) If your operation is commutative, further optimization are possible as you can gather in different order (it may be important for data alignment when those operations are vector operations for example) [*] your real desired mathematical operations, not those on effective types like floats of course.
share|improve this answer edited Dec 1 '08 at 14:07 answered Nov 30 '08 at 22:01

Piotr Lesnicki 3,24411119 Do you mean "associative" rather than "commutative"? Patrick McElhaney Dec 1 '08 at 2:38 You're right, thanks, I meant associative, corrected! But in fact it also helps if the operation is commutative, so that you can gather your chunks in any order (you do that for data alignment issues for example) Piotr Lesnicki Dec 1 '08 at 14:02

up

Yes, if the operator is associative. For example, you can parallelise summing a list of numbers:

vote dow n vote step 1: 1 + 2 + 3 + 4 + 5 + 6 + 7 + 8

step 2: 3 + 7 + 11 + 15 step 3: 10 + 26 step 4: 36 This works because (a+b)+c = a+(b+c), i.e. the order in which the additions are performed doesn't matter.
share|improve this answer answered Nov 30 '08 at 22:25

Jules 3,108821 up

Check out the combine phase in Hadoop

vote dow n vote http://wiki.apache.org/hadoop/HadoopMapReduce

share|improve this answer

answered Nov 30 '08 at 23:00

jganetsk 1071 up

Technically a reduce is not the same as a foldl (fold-left) which can also be described as an accumulate.

vote dow n vote The example given by Jules illustrates a reduce operation very well:

step 1: 1 + 2 + 3 + 4 step 2: 3 + 7 step 3: 10 Note that at each step the result is an array, including the final result which is an array of one item. A fold-left is like the following: step 0: a = 0 step 1: a = a + 1 step 2: a = a + 2 step 3: a = a + 3 step 4: a = a + 4 step 5: a Now obviously these both produce the same results, but a foldl has a well-defined result when given a nonassociative operator (like subtraction) whereas a reduce operator doesn't.
share|improve this answer answered Feb 9 '10 at 2:10

cdiggins 4,1961732 Subtraction's non-associative but it is left associative (because 5 - 3 - 2 yields the same result as (5 - 3) - 2). But what happens if you give foldl a right-associative operator, or foldr a left-associative one, I wonder? Frank Shearar Apr 7 1 '10 at 7:50 up

Not sure what platform/language you're thinking of, but you can parallelize reduce operators like this:

vote dow n vote // Original

result = null; foreach(item in map) { result += item; } // Parallel resultArray = array(); mapParts = map.split(numThreads); foreach(thread) { result = null; foreach(item in mapParts[thread]) { result += item;

// Lock this! } waitForThreads(); reduce(resultArray); As you can see, a parallel implementation is easily recursive. You split the map up, operate on each part in its own thread, then perform another reduce once those threads are done to bring the pieces together. (This is the programmatic reasoning behind Piotr Lesnick's answer.)
share|improve this answer answered Nov 30 '08 at 22:00

} resultArray += result;

strager 30.7k354114

It depends on your Reduce step. In a Hadoop-style implementation of MapReduce, your Reducer is getting vote downcalled once per key, with all the rows relevant to that key. So, for example, your Mapper might be taking in a lot of unordered web server logs, adding some metadata vote (e.g., geocoding), and emitting [key, record] pairs with a cookie ID as the key. Your Reducer would then be called once per cookie ID and would be fed all the data for that cookie, and could compute aggregate info such as visit frequency or average pages viewed per visit. Or you could key on geocode data, and gather aggregate stats based on geography.

up

Even if you're not doing per-key aggregate analysis - indeed, even if you're computing something over the whole set - it might be possible to break your computation into chunks, each of which could be fed to a Reducer.

How is MapReduce a good method to analyse http server logs?

I've been looking at MapReduce for a while, and it seems to be a very good way to implement fault-tolerant vote dow distributed computing. I read a lot of papers and articles on that topic, installed Hadoop on an array of virtual n vote machines, and did some very interesting tests. I really think I understand the Map and Reduce steps.

up

5
7

favorite

But here is my problem : I can't figure out how it can help with http server logs analysis. My understanding is that big companies (Facebook for instance) use MapReduce for the purpose of computing their http logs in order to speed up the process of extracting audience statistics out of these. The company I work for, while smaller than Facebook, has a big volume of web logs to compute everyday (100Go growing between 5 and 10 percent every month). Right now we process these logs on a single server, and it works just fine. But distributing the computing jobs instantly come to mind as a soon-to-be useful optimization. Here are the questions I can't answer right now, any help would be greatly appreciated :

Can the MapReduce concept really be applied to weblogs analysis ? Is MapReduce the most clever way of doing it ? How would you split the web log files between the various computing instances ?

Thank you. Nicolas


distributed mapreduce logfile-analysis

share|improve this question

asked Jun 2 '09 at 11:50

Nicolas 1,407820

2 Answers
activeoldestvotes

up

vote n vote

12

Can the MapReduce concept really be applied to weblogs analysis ? dowYes.


accepted

You can split your hudge logfile into chunks of say 10,000 or 1,000,000 lines (whatever is a good chunk for your type of logfile - for apache logfiles I'd go for a larger number), feed them to some mappers that would extract something specific (like Browser,IP Address, ..., Username, ... ) from each log line, then reduce by counting the number of times each one appeared (simplified): 192.168.1.1,FireFox x.x,username1 192.168.1.1,FireFox x.x,username1 192.168.1.2,FireFox y.y,username1 192.168.1.7,IE 7.0,username1 You can extract browsers, ignoring version, using a map operation to get this list: FireFox FireFox FireFox IE Then reduce to get this : FireFox,3 IE,1 Is MapReduce the most clever way of doing it ? It's clever, but you would need to be very big in order to gain any benefit... Splitting PETABYTES of logs. To do this kind of thing, I would prefer to use Message Queues, and a consistent storage engine (like a database), with processing clients that pull work from the queues, perform the job, and push results to another queue, with jobs not being executed in some timeframe made available for others to process. These clients would be small programs that do something specific. You could start with 1 client, and expand to 1000... You could even have a client that runs as a screensaver on all the PCs on a LAN, and run 8 clients on your 8-core servers, 2 on your dual core PCs... With Pull: You could have 100 or 10 clients working, multicore machines could have multiple clients running, and whatever a client finishes would be available for the next step. And you don't need to do any hashing or assignment for the work to be done. It's 100% dynamic.

How would you split the web log files between the various computing instances ? By number of elements or lines if it's a text-based logfile. In order to test MapReduce, I'd like to suggest that you play with Hadoop.
share|improve this answer edited Jun 2 '09 at 13:21 answered Jun 2 '09 at 12:26

Osama ALASSIRY 3,22131234 First of all, sorry for the delay. Thanks a lot for your very high-quality answer. It helps a lot ! Nicolas Jun 11 '09 at 10:01 As an alternative to splitting the log files, you could parallelize your "log analysis" script across n cores. And if you were to run this script on a virtualized cluster (of say, 96 cores), your code will run flawlessly without any changes. You need to identify and isolate the "smallest" unit of work that is side-effect free and deals with immutable data. This may require you to re-design code, possibly. Besides Hadoop is comparatively harder to setup (and where I live, expertise is harder to find). Imran.Fanaswala Apr 10 '10 at 20:27

up vote dow

Can the MapReduce concept really be applied to weblogs analysis ?

n vote

Sure. What sort of data are you storing?

Is MapReduce the most clever way of doing it ? It would allow you to query across many commodity machines at once, so yes it can be useful. Alternatively, you could try Sharding. How would you split the web log files between the various computing instances ? Generally you would distribute your data using a consistent hashing algorithm, so you can easily add more instances later. You should hash by whatever would be your primary key in an ordinary database. It could be a user id, an ip address, referer, page, advert; whatever is the topic of your logging.

Dynamic Nodes in Hadoop

share|improve this answer

up

Is it possible to add new nodes to Hadoop after it is started? I know that you can remove nodes (as that the

vote dow master tends to keep tabs on the node state). n vote
favorite

java hadoop grid-computing

share|improve this question

asked Jan 21 '10 at 21:10

monksy 8,10812264

2 Answers
activeoldestvotes

You can add new nodes by just booting up a new one with a proper hadoop-site.xml (one that points back to the vote dow master namenode and jobtracker).

up

n vote

accepted

That said, removing nodes is a really bad idea without rebalancing your hdfs file blocks to prevent removing all the dupes of a given block. If you drop three datanodes, you could lose all the dupes for a given block (that has a replication of 3, the default), thus corrupting the file the block belongs too. Removing two nodes could leave you with one replica, and it could be corrupt (known to happen with dying disks).
share|improve this answer answered Jan 22 '10 at 1:52

cwensel 88154 Well I want to build a cluster that may have unreliable clients [network connection etc] monksy Jan 22 '10 at 1:59

You're right Hadroop isn't made for dynamic scaling, its ment for fixed clusters monksy Jan 26 '10 at 9:47

up

vote downaspect of hadoop is gone at that point. vote

i think as long as you don't use them as datanodes your wouldn't have an issue ... of course the data locality

What is the use of the 'key K1' in the org.apache.hadoop.mapred.Mapper?

I'm learning Apache Hadoop and I was looking at the WordCount vote dow exampleorg.apache.hadoop.examples.WordCount. I've understand this example, however I can see that the n vote variable LongWritable key was not used in (...) 1 public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { String line = value.toString(); StringTokenizer itr = new StringTokenizer(line); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); output.collect(word, one); } } (...) what is the use of this variable ? Could someone give me a simple example where it would be used ? Thanks

up

favorite

hadoop mapreduce

share|improve this question

asked Apr 22 '09 at 18:12

Pierre 11.7k23077

2 Answers
activeoldestvotes

up vote dow When the InputFormat is TextInputFormat, the Key is the bytes offset from the beginning of the current input

n vote

accepted

file. Value is simply the line of text at that offset. If SequenceFileInputFormat was used, the Key would be whatever was stuffed into the Key position of the 'record. The same for Value. Bottom line is that they Key/Value types are dependent on the input type (text, sequence file, etc). ckw
share|improve this answer answered Apr 22 '09 at 22:06

cwensel 88154

I could be wrong (I have read map/reduce tutorials but haven't used it for real projects yet), but I think in vote dow general it is the identifier of input entry; for example, tuple (file name, line number). In this particular case it's n vote supposedly line number, and it's of no interest for word counts. It could be used if the idea was to, say, aggregate word counts on per-line, not per-file basis (or for multiple files if key did contain that info).

up

What type of problems can mapreduce solve?

share|improve this answer

up

Is there a theoretical analysis available which describes what kind of problems mapreduce can solve?

vote dow parallel-processing mapreduce n vote 12 share|improve this question


favorite

26

edited Apr 1 '09 at 12:44

asked Apr 1 '09 at 12:40

Welbog 28k45776

amit-agrawal 845718

2 Total Ninja stuff. James Andino Aug 23 '11 at 11:27

7 Answers

activeoldestvotes

For problems requiring processing and generating large data sets. Say running an interest generation query over vote dow all accounts a bank hold. Say processing audit data for all transactions that happened in the past year in a bank. n vote The best use case is from Google - generating search index for google search engine.

up

share|improve this answer

answered Apr 1 '09 at 12:46

sangupta 1,2691816

Many problems that are "Embarrassingly Parallel" (great phrase!) can use vote dow MapReduce.http://en.wikipedia.org/wiki/Embarrassingly_parallel n vote From this article.... http://www.businessweek.com/magazine/content/07_52/b4064048925836.htm ... Doug Cutting, founder of Hadoop (an open source implementation of MapReduce) says... Facebook uses Hadoop to analyze user behavior and the effectiveness of ads on the site"

up

and... the tech team at The New York Times rented computing power on Amazons cloud and used Hadoop t o convert 11 million archived articles, dating back to 1851, to digital and searchable documents. They turned around in a single day a job that otherwise would have taken months.
share|improve this answer answered Apr 1 '09 at 13:30

ChrisV 1,205624

In Map-Reduce for Machine Learning on Multicore Chu et al describe "algorithms that fit the Statistical Query vote dow model can be written in a certain summation form, which allows them to be easily parallelized on multicore n vote computers." They specifically implement 10 algorithms including e.g. weighted linear regression, k-Means, Naive Bayes, and SVM, using a map-reduce framework. The Apache Mahout project has released a recent Hadoop (Java) implementation of some methods based on the ideas from this paper.

up

share|improve this answer

answered May 18 '09 at 4:04

bubaker 1,4001710 up vote dow Anything that involves doing operations on a large set of data, where the problem can be broken down into n vote smaller independent sub-problems who's results can then be aggregated to produce the answer to the larger

problem. A trivial example would be calculating the sum of a huge set of numbers. You split the set into smaller sets, calculate the sums of those smaller sets in parallel (which can involve splitting those into yet even smaller sets), then sum those results to reach the final answer.
share|improve this answer answered Apr 1 '09 at 13:02

Eric Petroelje 35.4k23797 up

You can also watch the videos @ Google, I'm watching them myself and I find them very educational.
answered Aug 14 '09 at 19:21

vote dow share|improve this answer n vote

Alix Axel 45.4k18143253

The answer lies is really in the name of the algorithm. MapReduce is not a general purpose parallel vote dow programming work or batch execution framework as some of the answers suggest. Map Reduce is really useful n vote when large data sets that need to be processed (Mapping phase) and derive certain attributes from there, and then need to be summarized on on those derived attributes (Reduction Phase).

up

share|improve this answer

answered Apr 12 '11 at 12:37

user606308 211 up

Sort of a hello world introduction to MapReduce

Hadoop: map/reduce from HDFS

vote dow n vote http://blog.diskodev.com/parallel-processing-using-the-map-reduce-prog share|improve this answer

I may be wrong, but all(?) examples I've seen with Apache Hadoop takes as input a file stored on the local file vote dow system (e.g. org.apache.hadoop.examples.Grep)

up

n vote

favorite

Is there a way to load and save the data on the Hadoop file system (HDFS)? For example I put a tab delimited file named 'stored.xls' on HDFS using hadoop-0.19.1/bin/hadoop dfs -put ~/local.xls stored.xls. How should I configure the JobConf to read it ?

Thanks .
configuration input hadoop mapreduce

share|improve this question

asked Apr 24 '09 at 19:45

Pierre 11.7k23077

3 Answers
activeoldestvotes

up

JobConf conf = new JobConf(getConf(), ...);

vote dow ... n vote


accepted

FileInputFormat.setInputPaths(conf, new Path("stored.xls")) ... JobClient.runJob(conf); ... setInputPaths will do it.


share|improve this answer answered Apr 24 '09 at 20:21

yogman 1,7522921 Thanks, but it throws an exception saying that "file:/home/me/workspace/HADOOP/stored.xls" (this is a local path) doesn't exist. The file on HDFS is in '/user/me/stored.xls'. I also tried new Path("/user/me/stored.xls") and it doesn't work too. Pierre Apr 24 '09 at 20:41 First off, it's strange that Hadoop complained about "file:" rather than "hdfs:". If might be that your hadoop-site.xml is misconfigured. And, second, if that still doesn't work, mkdir input and put stored.xls in the "input" dir (all with bin/hadoop fs command). And, new Path("input") instead of new Path("stored.xls") yogman Apr 24 '09 at 20:53 Revealing your command line to run the job wouldn't hurt. yogman Apr 24 '09 at 20:54

Pierre, the default configuration for Hadoop is to run in local mode, rather than in distributed mode. You likely vote dow need to just modify some configuration in your hadoop-site.xml. It looks like your default filesystem is still n vote localhost, when it should be hdfs://youraddress:yourport. Look at your setting for fs.default.name, and also see the setup help at Michael Noll's blog for more details.

up

share|improve this answer

answered May 9 '09 at 19:19

Kevin Weil 5042715 up

FileInputFormat.setInputPaths(conf, new Path("hdfs://hostname:port/user/me/stored.xls"));

vote dow n vote This will do

Anybody knows a good extendable open source web-crawler? [closed]

up

vote n vote 9

11

The crawler needs to have an extendable architecture to allow changing the internal process, like dowimplementing new steps (pre-parser, parser, etc...)
favorite

I found the Heritrix Project (http://crawler.archive.org/). But there are other nice projects like that?
open-source web-crawler

share|improve this question

asked Jun 24 '09 at 17:29

Zanoni 4,21831647 stackoverflow.com/questions/176820/ Gavin Miller Jun 24 '09 at 17:30 @LFSR Consulting. They are for different purposes... Zanoni Jun 24 '09 at 17:39

closed as off topic by Andrew Barber Mar 14 at 23:19


Questions on Stack Overflow are expected to relate to programming or software development within the scope defined in the FAQ. Consider editing the question or leaving comments for improvement if you believe the question can be reworded to fit within the scope. Read more about closed questions here.

5 Answers
activeoldestvotes

Nutch is the best you can do when it comes to a free crawler. It is built off of the concept of Lucene (in an vote dowenterprise scaled manner) and is supported by the Hadoop back end using MapReduce (similar to Google) for large scale data querying. Great products! I am currently reading all about Hadoop in the new (not yet n vote released) Hadoop in Action from manning. If you go this route I suggest getting onto their technical review team to get an early copy of this title! These are all Java based. If you are a .net guy (like me!!) then you might be more interested

up

11

accepted

inLucene.NET, Nutch.NET, and Hadoop.NET which are all class by class and api by api ports to C#.
share|improve this answer answered Jun 24 '09 at 18:00

Andrew Siemer 6,8031638 +1 for Nutch and Hadoop, you can also look at solr if you are looking for distributed and scalable solution. Sumit Ghosh Aug 12 '10 at 12:06 From the looks of it, Nutch.NET is completely non-existant and I couldn't even find a way to download it. Mike

3 Christensen Apr 12 '11 at 8:38


The same goes for Hadoop.NET, there isn't a single file for download Xavier Poinas May 10 '11 at 3:26

You May also want to try Scrapy http://scrapy.org/ vote dow It is really easy to specify and run your crawlers.

up

n vote share|improve this answer answered Feb 11 '11 at 9:59

fccoelho 1,13721222 up

I've discovered recently one called - Nutch.


answered Jun 24 '09 at 17:32

vote dow share|improve this answer n vote

Artem Barger 16.5k42138

If you're not tied down to platform, I've had very good experiences with Nutch in the past. vote dow It's written in Java and goes hand in hand with the Lucene indexer.

up

n vote share|improve this answer answered Jun 24 '09 at 17:32

Justin Niessner 117k5146284 up vote dow n vote Abot is a good extensible web-crawler. Every part of the architecture is pluggable giving you complete control

Not sure why my last comment was deleted. Seems totally relevant, even to an old thread. Here it is again...

over its behavior. Its open source, free for commercial and personal use, written in C#. http://code.google.com/p/abot/

Вам также может понравиться