Академический Документы
Профессиональный Документы
Культура Документы
I'm interested in finding out how the recently-released (http://mirror.facebook.com/facebook/hive/hadoopvote dow 0.17/) Hive compares to HBase in terms of performance. The SQL-like interface used by Hive is very much preferable to the HBase API we have implemented. n vote
up
35
11
favorite
mrhahn 202146
5 Answers
activeoldestvotes
It's hard to find much about Hive, but I found this snippet on the Hive site that leans heavily in favor of vote dow HBase (bold added): Hive is based on Hadoop which is a batch processing system. Accordingly, this system does not and cannot n vote promise low latencies on queries. The paradigm here is strictly of submitting jobs and being notified when the jobs are completed as opposed to real time queries. As a result it should not be compared with systems like Oracle where analysis is done on a significantly smaller amount of data but the analysis proceeds much more iteratively with the response times between iterations being less than a few minutes. For Hive queries response times for even the smallest jobs can be of the order of 5-10 minutes and for larger jobs this may even run into hours. Since HBase and HyperTable are all about performance (being modeled on Google's BigTable), they sound like they would certainly be much faster than Hive, at the cost of functionality and a higher learning curve (e.g., they don't have joins or the SQL-like syntax).
up
28
accepted
up
vote dow a query execution engine, a metadata repository, and a columnar storage layout. Its primary focus is data
From one perspective, Hive consists of five main components: a SQL-like grammar and parser, a query planner,
n vote
warehouse-style analytical workloads, so low latency retrieval of values by key is not necessary. HBase has its own metadata repository and columnar storage layout. It is possible to author HiveQL queries over HBase tables, allowing HBase to take advantage of Hive's grammar and parser, query planner, and query execution engine. See http://wiki.apache.org/hadoop/Hive/HBaseIntegration for more details.
share|improve this answer answered Jun 4 '10 at 4:38
Hive is an analytics tool. Just like pig, it was designed for ad hoc batch processing of potentially enourmous vote dow amounts of data by leveraging map reduce. Think terrabytes. Imagine trying to do that in a relational database...
up
n vote
HBase is a column based key value store based on BigTable. You can't do queries per se, though you can run map reduce jobs over HBase. It's primary use case is fetching rows by key, or scanning ranges of rows. A major feature is being able to have data locality when scanning across ranges of row keys for a 'family' of columns.
share|improve this answer edited Oct 7 '11 at 9:30 answered Jun 25 '09 at 21:38
Bolo 4,08611031 up
Tim 35426
To my humble knowledge, Hive is more comparable to Pig. Hive is SQL-like and Pig is script based. Hive vote down seems to be more complicated with query optimization and execution engines as well as requires end user needs to specify schema parameters(partition etc). Both are intend to process text files, or sequenceFiles. vote
HBase is for key value data store and retrieve...you can scan or filter on those key value pairs(rows). You can not do queries on (key,value) rows.
share|improve this answer answered Jun 6 '10 at 5:09
haijin 236139
As of the most recent Hive releases, a lot has changed that requires a small update as Hive and HBase are vote down now integrated. What this means is that Hive can be used as a query layer to an HBase datastore. Now if people are looking for alternative HBase interfaces, Pig also offers a really nice way of loading and storing vote HBase data. Additionally, it looks like Cloudera Impala may offer substantial performance Hive based queries on top of HBase. They are claim up to 45x faster queries over traditional Hive setups.
up
44
share|improve this answer I'm learning traditional Relational Databases (with PostgreSQL) and doing some research I've come across some new types of databases. CouchDB, Drizzle, and Scalaris to name a few, what is going to be the next database down technologies to deal with?
favorite
vote 58
Randin 6151914
Could someone please update this question to refer to "databases" instead of "SQL"? Rick Nov 12 '08 at 2:05 Even though randin is using the term SQL incorrectly, I think that change would be against the spirit of peer editing. Bill Karwin Nov 12 '08 at 2:29 too late.. sorry Bill. Feel free to roll back my edit if you feel strongly. I made my change before you posted your comment. I think rephrasing it the way I did is both educational to the OP and more useful to the community. SquareCog Nov 12 '08 at 2:31 Well, it's good to be correct. A tech writer friend of mine said, "you can't get the right answers unless you ask the right
8 Answers
activeoldestvotes
I would say next-gen database, not next-gen SQL. vote dowSQL is a language for querying and manipulating relational databases. SQL is dictated by an international standard. While the standard is revised, it seems to always work within the relational database paradigm. n vote
up
88
accepted
Here are a few new data storage technologies that are getting attention currently:
CouchDB is a non-relational database. They call it a document-oriented database. Amazon SimpleDB is also a non-relational database accessed in a distributed manner through a web service. Amazon also has a distributed key-value store called Dynamo, which powers some of its S3 services. Dynomite and Kai are open source solutions inspired by Amazon Dynamo. BigTable is a proprietary data storage solution used by Google, and implemented using their Google File System technology. Google's MapReduce framework uses BigTable. Hadoop is an open-source technology inspired by Google's MapReduce, and serving a similar need, to distribute the work of very large scale data stores. Scalaris is a distributed transactional key/value store. Also not relational, and does not use SQL. It's a research project from the Zuse Institute in Berlin, Germany. RDF is a standard for storing semantic data, in which data and metadata are interchangeable. It has its own query language SPARQL, which resembles SQL superficially, but is actually totally different. Vertica is a highly scalable column-oriented analytic database designed for distributed (grid) architecture. It does claim to be relational and SQL-compliant. It can be used through Amazon's Elastic Compute Cloud. Greenplum is a high-scale data warehousing DBMS, which implements both MapReduce and SQL. XML isn't a DBMS at all, it's an interchange format. But some DBMS products work with data in XML format. ODBMS, or Object Databases, are for managing complex data. There don't seem to be any dominant ODBMS products in the mainstream, perhaps because of lack of standardization. Standard SQL is gradually gaining some OO features (e.g. extensible data types and tables). Drizzle is a relational database, drawing a lot of its code from MySQL. It includes various architectural changes designed to manage data in a scalable "cloud computing" system architecture. Presumably it
will continue to use standard SQL with some MySQL enhancements. Cassandra is a highly scalable, eventually consistent, distributed, structured key-value store, developed at Facebook by one of the authors of Amazon Dynamo, and contributed to the Apache project. Project Voldemort is a non-relational, distributed, key-value storage system. It is used at LinkedIn.com Berkeley DB deserves some mention too. It's not "next-gen" because it dates back to the early 1990's. It's a popular key-value store that is easy to embed in a variety of applications. The technology is currently owned by Oracle Corp. Also see this nice article by Richard Jones: "Anti-RDBMS: A list of distributed key-value stores." He goes into more detail describing some of these technologies. Relational databases have weaknesses, to be sure. People have been arguing that they don't handle all data modeling requirements since the day it was first introduced.
Year after year, researchers come up with new ways of managing data to satisfy special requirements: either requirements to handle data relationships that don't fit into the relational model, or else requirements of highscale volume or speed that demand data processing be done on distributed collections of servers, instead of central database servers. Even though these advanced technologies do great things to solve the specialized problem they were designed for, relational databases are still a good general-purpose solution for most business needs. SQL isn't going away.
I've written an article in php|Architect magazine about the innovation of non-relational databases, and data modeling in relational vs. non-relational databases. http://www.phparch.com/magazine/2010-2/september/
share|improve this answer edited Mar 5 at 14:40 answered Nov 12 '08 at 2:24
Emil 3,06443289
Hey Bill, we do tend to answer the same questions a lot.. your answer here is thorough enough I don't feel writing my own would be of much use -- want to add some info about Vertica et al, and Greenplum and friends, to make it more 1 complete? SquareCog Nov 12 '08 at 2:56
1 oh and XML and Object databases.. I always forget about those.. SquareCog Nov 12 '08 at 3:14
Thank you Bill for the through answer, I'll just stick with PostgreSQL for the time being. Randin Nov 12 '08 at 10:50
1 PostgreSQL is a fine choice for RDBMS. Have fun! Bill Karwin Nov 12 '08 at 16:46
Hey, thanks for the list! :) hasen j Feb 22 '09 at 16:06
I'm missing graph databases in the answers so far. A graph or network of objects is common in vote dowprogramming and can be useful in databases as well. It can handle semi-structured and interconnected n vote information in an efficient way. Among the areas where graph databases have gained a lot of interest are semantic web and bioinformatics. RDF was mentioned, and it is in fact a language that represents a graph. Here's some pointers to what's happening in the graph database area:
up
20
Graphs - a better database abstraction Graphd, the backend of Freebase Neo4j open source graph database engine AllegroGraph RDFstore Graphdb abstraction layer for bioinformatics Graphdb behind Directed Edge recommendation engine I'm part of the Neo4j project, which is written in Java but has bindings to Python, Ruby and Scala as well. Some people use it with Clojure or Groovy/Grails. There is also a GUI tool evolving.
share|improve this answer edited Oct 8 '09 at 19:11 answered Mar 26 '09 at 22:26
nawroth 3,18011015 How about db4o.com, an object-database, but its designed around managing object graphs. Norman H Mar 16 '11 at 1:53 Object databases (OODB) are different from graph databases. Simply put a graphdb won't tie your data directly to your object model. In a graphdb relationships are first class citizens, while you'd have to implement that on your own in an OODB. In a graphdb you can have different object types represent different views on the same data. Graphdbs typically support things like finding shortest paths and the like. nawroth Mar 16 '11 at 12:05 Cool, thanks for the clarification! Norman H Mar 17 '11 at 13:47
Might be not the best place to answer with this, but I'd like to share this taxonomy of noSQL world created by vote down Steve Yen (please find it at http://dl.dropbox.com/u/2075876/nosql-steve-yen.pdf) (1) keyvaluecache vote memcached repcached coherence innispan eXtremescale jbosscache velocity terracoqa (2) keyvaluestore keyspace are schemafree RAMCloud (3) eventuallyconsistent keyvaluestore dynamo voldemort Dynomite SubRecord MongoDb Dovetaildb (4) orderedkeyvaluestore tokyotyrant lightcloud NMDB luxio memcachedb
up
actord (5) datastructures server redis (6) tuplestore gigaspaces coord apacheriver (7) object database ZopeDB db4o Shoal (8) document store CouchDB Mongo Jackrabbit XMLDatabases ThruDB CloudKit Perservere RiakBasho Scalaris (9) widecolumnarstore BigTable Hbase Cassandra Hypertable KAI OpenNep
share|improve this answer answered Mar 19 '11 at 10:26
For a look into what academic research is being done in the area of next gen databases take a look at vote dow this: http://www.thethirdmanifesto.com/ n vote In regard to the SQL language as a proper implementation of the relational model, I quote from wikipedia, "SQL, initially pushed as the standard language for relational databases, deviates from the relational model in several places. The current ISO SQL standard doesn't mention the relational model or use relational terms or concepts. However, it is possible to create a database conforming to the relational model using SQL if one does not use certain SQL features."
up
http://en.wikipedia.org/wiki/Relational_model (Referenced in the section "SQL and the relational model" on March 28, 2010
share|improve this answer answered Mar 28 '10 at 11:15
Norman H 605412 up vote dow the next-gen SQL would make SQL a lot less... fugly and non-intuitive. n vote
Not to be pedantic, but I would like to point out that at least CouchDB isn't SQL-based. And I would hope that
Jason Baker 43.6k34185348 A friend of mine said, "It's supposed to be hard to read! It's called code for a reason!" :-) Bill Karwin Nov 12 '08 at 2:30
1 My brain is broken, I like SQL, too much looking at it grow on to you :) Robert Gould Nov 12 '08 at 2:35 There are special databases for XML like MarkLogic and Berkeley XMLDB. They can index xml-docs and one vote dow can query them with XQuery. I expect JSON databases, maybe they already exist. Did some googling but n vote couldn't find one.
up
tuinstoel 5,7791422 There are a few that provide a JSON interface to the data. Terrastore is one example. quikchange Jul 7 '10 at 18:35 up
SQL has been around since the early 1970s so I don't think that it's going to go away any time soon.
vote dow n vote Maybe the 'new(-ish) sql' will oql (see http://en.wikipedia.org/wiki/ODBMS) share|improve this answer
vote dow n vote Jim Starkey is the man who "create" Interbase
who work on Vulcan (a Firebird fork) and who was at the begining of Falcon for MySQL
Hadoop examples?
up
I'm examining Hadoop as a possible tool with which to do some log analysis. I want to analyze several kinds of statistics in one run. Each line of my log files has all sorts of potentially useful that I'd like to aggregate. I'd
vote dow online all seem to total exactly one thing. This may be because every single example Hadoop program I can n vote find just does word counts. Can I use Hadoop to solve two or more problems at once? 60
favorite
82
like to get all sorts of data out of the logs in a single Hadoop run, but the example Hadoop programs I see
Are there other Hadoop examples, or a Hadoop tutorial out there, that don't solve the word count problem?
hadoop
j0k 10.1k102335
As a general comment, I do seem to notice that Hadoop doesn't have a lot of examples floating around. Not sure why
15 Answers
activeoldestvotes
One of the best resources that I have found to get started is Cloudera. They are a startup company comprised vote dowof mainly ex-Google and ex-Yahoo employees. On their page there is a training section with lessons on the different technologies here. I found that very useful in playing with straight Hadoop, Pig and Hive. They have n vote a virtual machine that you can download that has everything configured and some examples that help you get coding. All of that is free in the training section. The only thing that I couldn't find is a tutorial on HBase. I have been looking for one for a while. Best of luck.
up
62
accepted
Ryan H 1,0172818
2 Mark White's Second edition has info on HBase, Pig and Hive C-x C-t Feb 18 '11 at 10:01
I'm finishing up a tutorial on processing Wikipedia pageview log files, several parts of which compute vote dow multiple metrics in one pass (sum of pageviews, trend over the last 24 hours, running regressions, etc.). The n vote code is here: http://github.com/datawrangling/trendingtopics/tree/master The Hadoop code mostly uses a mix of Python streaming & Hive w/ the Cloudera distro on EC2...
up
16
Pete Skomoroch
1,18164
i loved your tute pete, especially the overview you gave at hadoop world, awesome stuff! mat kelcey Oct 20 '09 at
1 5:06
up
Here are two examples using Cascading (and API over Hadoop) A simple log parser Calculates arrival rate of requests You can start with the second and just keep adding metrics. Cascading project site
share|improve this answer edited May 1 '11 at 20:07 answered Apr 10 '09 at 15:11
cwensel 88154
You can refer to Tom White's Hadoop book for more examples and usecases:http://www.amazon.com/Hadoopvote dow Definitive-Guide-Tom-White/dp/1449389732/ answered Oct 19 '10 at 10:20 n vote share|improve this answer
up
With the normal Map/Reduce paradigm, you typically solve one problem at a time. In the map step you vote dow typically perform some transformation or denormalization, in the Reduce step you often aggregate the map n vote outputs.
up
If you want to answer multiple questions about your data, the best way to do it in Hadoop is to write multiple jobs, or a sequence of jobs that read the previous step's outputs. There are several higher-level abstraction languages or APIs (Pig, Hive, Cascading) that simplify some of this work for you, allowing you to write more traditional procedural or SQL-style code that, under the covers, just creates a sequence of Hadoop jobs.
share|improve this answer answered Apr 23 '09 at 0:23
There was a course taught by Jimmy Lin at the University of Maryland. He developed the Cloud9 package as a vote dow training tool. It contains several examples.
up
n vote
user119381 5111
Amazon has a new service based on Hadoop, its a great way to get started and they have some nice vote dow examples. http://aws.amazon.com/elasticmapreduce/ answered Apr 10 '09 at 15:14 n vote share|improve this answer
up
You can also follow Cloudera blog, they posted recently a really good article about Apache log analysis with vote dow Pig. answered Jul 8 '09 at 22:17 n vote share|improve this answer
up
Romain Rigaux 1,531810 As the author of said article, I want to point out that it was written more from a "getting familiar with Pig" perspective than a "doing log parsing in Hadoop" perspective. There are more efficient and less verbose ways to do those things. 2 But yeah, Pig is nice for this sort of stuff at large scale. SquareCog Nov 25 '09 at 1:04
Have you looked at the wiki? You could try looking through the software in the contrib section though the code vote dow for those will probably be hard to learn from. Looking over the page they seem to have a link to an external n vote tutorial.
up
fuzzy-waffle 54629
I'm sure you've solved your problem by now, but for those who still get redirected here from google searching vote dow for examples here is a excellent blog with hundreds lines of working code:http://sujitpal.blogspot.com/ edited Nov 30 '09 at 15:12 answered Nov 27 '09 at 19:06 n vote share|improve this answer
up
alex 2321213
There are several examples using ruby under Hadoop streaming in the wukong library. (Disclaimer: I am an vote dow author of same). Besides the now-standard wordcount example, there's pagerank and a couple simple graph n vote manipulation scripts.
up
mrflip 56535 up
vote dow n vote In the MAP-Step you walk through the log line by line. In each line, you separate your relevant data from each
other (somethink like split() I guess) and emit a key-value-pair for each bit of information for every line. So if your log has a format like this: (Timestamp) (A) (B) (C) 123 789 4 1 5 2 6 3
You could emit (A,4),(B,5),(C,6) for the first line and so forth for the other lines. Now you can even have parallel reducers! Each reducers collects the bits for a given category. You can tweak your Hadoop app, so one reducer gets all "A"s and another one gets all "B"s. The Reduce itself is like the typical word-count ;-)
share|improve this answer answered Feb 2 '11 at 18:53
Apache have released a set of examples. You can find them at:
vote dow n vote http://svn.apache.org/repos/asf/hadoop/common/trunk/mapreduce/src/examples/org/apache/hadoop/examples/ answered Jul 13 '11 at 14:38 share|improve this answer
Two tools that can give a good starting point to solve the problem in the Hadoop way are PIG (that was already vote dow mentioned with a link above) and MAHOUT (machine learning libraries).
up
n vote
Regarding Mahout, you can read IBM's articles that give a very good introduction on what you can do "easily" with it: http://www.ibm.com/developerworks/java/library/j-mahout/http://www.ibm.com/developerworks/java/library/jmahout-scaling/ It gives you the next thing (Clustering, Categorization...) that you would like to do with the Accounting data
that you can get from the likes of PIG or hand written MapReduce code that you will write. Y
share|improve this answer edited Jan 31 '12 at 13:06 answered Jan 31 '12 at 12:52
Guy 1,6981614
Ilya said it well: folks usually write one job per task because the output from the mapper and reducers are very vote dow specific to the result you're after.
up
n vote
Also, at higher scale, jobs take longer and usually you'll run different jobs at different frequencies (and subsets of your data). Finally, it's a lot more maintainable. We've been spoiled by Hive for syslog and app log analysis. That might get you closer to the lightweight, adhoc queries that would let you do multiple results really quickly:http://help.papertrailapp.com/kb/analytics/loganalytics-with-hadoop-and-hive Passing multiple functions to a SELECT clause would probably accomplish what you're after, but you still may need a temporary table.
I understand how Map is easily parallelizable - each computer/CPU can just operate on a small portion of the vote dow array.
up
n vote 2
favorite
Is Reduce/foldl parallelizable? It seems like each computation depends on the previous one. Is it just parallelizable for certain types of functions?
multithreading optimization map multicore reduce
Claudiu 35.2k32143295 Give us some clues: what platform or programming language are you talking about? This doesn't sound like MPI. And what's a "foldl"? Die in Sente Dec 1 '08 at 0:17 foldl is a left fold, or a fold with a left-associative operator: folding [1,2,3,4] with + would yield (((1 + 2) + 3) + 4) Frank Shearar Apr 7 '10 at 7:47
6 Answers
activeoldestvotes
If your reduction underlying operation is associative*, you can play with the order of operations and locality. vote dowTherefore you often have a tree-like structure in the 'gather' phase, so you can do it in several passes in logarithmic time: n vote
up
10
accepted
a + b + c + d \ / \ / (a+b) (c+d) \ / ((a+b)+(c+d)) instead of (((a+b)+c)+d) If your operation is commutative, further optimization are possible as you can gather in different order (it may be important for data alignment when those operations are vector operations for example) [*] your real desired mathematical operations, not those on effective types like floats of course.
share|improve this answer edited Dec 1 '08 at 14:07 answered Nov 30 '08 at 22:01
Piotr Lesnicki 3,24411119 Do you mean "associative" rather than "commutative"? Patrick McElhaney Dec 1 '08 at 2:38 You're right, thanks, I meant associative, corrected! But in fact it also helps if the operation is commutative, so that you can gather your chunks in any order (you do that for data alignment issues for example) Piotr Lesnicki Dec 1 '08 at 14:02
up
Yes, if the operator is associative. For example, you can parallelise summing a list of numbers:
step 2: 3 + 7 + 11 + 15 step 3: 10 + 26 step 4: 36 This works because (a+b)+c = a+(b+c), i.e. the order in which the additions are performed doesn't matter.
share|improve this answer answered Nov 30 '08 at 22:25
Jules 3,108821 up
jganetsk 1071 up
Technically a reduce is not the same as a foldl (fold-left) which can also be described as an accumulate.
vote dow n vote The example given by Jules illustrates a reduce operation very well:
step 1: 1 + 2 + 3 + 4 step 2: 3 + 7 step 3: 10 Note that at each step the result is an array, including the final result which is an array of one item. A fold-left is like the following: step 0: a = 0 step 1: a = a + 1 step 2: a = a + 2 step 3: a = a + 3 step 4: a = a + 4 step 5: a Now obviously these both produce the same results, but a foldl has a well-defined result when given a nonassociative operator (like subtraction) whereas a reduce operator doesn't.
share|improve this answer answered Feb 9 '10 at 2:10
cdiggins 4,1961732 Subtraction's non-associative but it is left associative (because 5 - 3 - 2 yields the same result as (5 - 3) - 2). But what happens if you give foldl a right-associative operator, or foldr a left-associative one, I wonder? Frank Shearar Apr 7 1 '10 at 7:50 up
Not sure what platform/language you're thinking of, but you can parallelize reduce operators like this:
result = null; foreach(item in map) { result += item; } // Parallel resultArray = array(); mapParts = map.split(numThreads); foreach(thread) { result = null; foreach(item in mapParts[thread]) { result += item;
// Lock this! } waitForThreads(); reduce(resultArray); As you can see, a parallel implementation is easily recursive. You split the map up, operate on each part in its own thread, then perform another reduce once those threads are done to bring the pieces together. (This is the programmatic reasoning behind Piotr Lesnick's answer.)
share|improve this answer answered Nov 30 '08 at 22:00
} resultArray += result;
strager 30.7k354114
It depends on your Reduce step. In a Hadoop-style implementation of MapReduce, your Reducer is getting vote downcalled once per key, with all the rows relevant to that key. So, for example, your Mapper might be taking in a lot of unordered web server logs, adding some metadata vote (e.g., geocoding), and emitting [key, record] pairs with a cookie ID as the key. Your Reducer would then be called once per cookie ID and would be fed all the data for that cookie, and could compute aggregate info such as visit frequency or average pages viewed per visit. Or you could key on geocode data, and gather aggregate stats based on geography.
up
Even if you're not doing per-key aggregate analysis - indeed, even if you're computing something over the whole set - it might be possible to break your computation into chunks, each of which could be fed to a Reducer.
I've been looking at MapReduce for a while, and it seems to be a very good way to implement fault-tolerant vote dow distributed computing. I read a lot of papers and articles on that topic, installed Hadoop on an array of virtual n vote machines, and did some very interesting tests. I really think I understand the Map and Reduce steps.
up
5
7
favorite
But here is my problem : I can't figure out how it can help with http server logs analysis. My understanding is that big companies (Facebook for instance) use MapReduce for the purpose of computing their http logs in order to speed up the process of extracting audience statistics out of these. The company I work for, while smaller than Facebook, has a big volume of web logs to compute everyday (100Go growing between 5 and 10 percent every month). Right now we process these logs on a single server, and it works just fine. But distributing the computing jobs instantly come to mind as a soon-to-be useful optimization. Here are the questions I can't answer right now, any help would be greatly appreciated :
Can the MapReduce concept really be applied to weblogs analysis ? Is MapReduce the most clever way of doing it ? How would you split the web log files between the various computing instances ?
Nicolas 1,407820
2 Answers
activeoldestvotes
up
vote n vote
12
You can split your hudge logfile into chunks of say 10,000 or 1,000,000 lines (whatever is a good chunk for your type of logfile - for apache logfiles I'd go for a larger number), feed them to some mappers that would extract something specific (like Browser,IP Address, ..., Username, ... ) from each log line, then reduce by counting the number of times each one appeared (simplified): 192.168.1.1,FireFox x.x,username1 192.168.1.1,FireFox x.x,username1 192.168.1.2,FireFox y.y,username1 192.168.1.7,IE 7.0,username1 You can extract browsers, ignoring version, using a map operation to get this list: FireFox FireFox FireFox IE Then reduce to get this : FireFox,3 IE,1 Is MapReduce the most clever way of doing it ? It's clever, but you would need to be very big in order to gain any benefit... Splitting PETABYTES of logs. To do this kind of thing, I would prefer to use Message Queues, and a consistent storage engine (like a database), with processing clients that pull work from the queues, perform the job, and push results to another queue, with jobs not being executed in some timeframe made available for others to process. These clients would be small programs that do something specific. You could start with 1 client, and expand to 1000... You could even have a client that runs as a screensaver on all the PCs on a LAN, and run 8 clients on your 8-core servers, 2 on your dual core PCs... With Pull: You could have 100 or 10 clients working, multicore machines could have multiple clients running, and whatever a client finishes would be available for the next step. And you don't need to do any hashing or assignment for the work to be done. It's 100% dynamic.
How would you split the web log files between the various computing instances ? By number of elements or lines if it's a text-based logfile. In order to test MapReduce, I'd like to suggest that you play with Hadoop.
share|improve this answer edited Jun 2 '09 at 13:21 answered Jun 2 '09 at 12:26
Osama ALASSIRY 3,22131234 First of all, sorry for the delay. Thanks a lot for your very high-quality answer. It helps a lot ! Nicolas Jun 11 '09 at 10:01 As an alternative to splitting the log files, you could parallelize your "log analysis" script across n cores. And if you were to run this script on a virtualized cluster (of say, 96 cores), your code will run flawlessly without any changes. You need to identify and isolate the "smallest" unit of work that is side-effect free and deals with immutable data. This may require you to re-design code, possibly. Besides Hadoop is comparatively harder to setup (and where I live, expertise is harder to find). Imran.Fanaswala Apr 10 '10 at 20:27
up vote dow
n vote
Is MapReduce the most clever way of doing it ? It would allow you to query across many commodity machines at once, so yes it can be useful. Alternatively, you could try Sharding. How would you split the web log files between the various computing instances ? Generally you would distribute your data using a consistent hashing algorithm, so you can easily add more instances later. You should hash by whatever would be your primary key in an ordinary database. It could be a user id, an ip address, referer, page, advert; whatever is the topic of your logging.
up
Is it possible to add new nodes to Hadoop after it is started? I know that you can remove nodes (as that the
vote dow master tends to keep tabs on the node state). n vote
favorite
monksy 8,10812264
2 Answers
activeoldestvotes
You can add new nodes by just booting up a new one with a proper hadoop-site.xml (one that points back to the vote dow master namenode and jobtracker).
up
n vote
accepted
That said, removing nodes is a really bad idea without rebalancing your hdfs file blocks to prevent removing all the dupes of a given block. If you drop three datanodes, you could lose all the dupes for a given block (that has a replication of 3, the default), thus corrupting the file the block belongs too. Removing two nodes could leave you with one replica, and it could be corrupt (known to happen with dying disks).
share|improve this answer answered Jan 22 '10 at 1:52
cwensel 88154 Well I want to build a cluster that may have unreliable clients [network connection etc] monksy Jan 22 '10 at 1:59
You're right Hadroop isn't made for dynamic scaling, its ment for fixed clusters monksy Jan 26 '10 at 9:47
up
i think as long as you don't use them as datanodes your wouldn't have an issue ... of course the data locality
I'm learning Apache Hadoop and I was looking at the WordCount vote dow exampleorg.apache.hadoop.examples.WordCount. I've understand this example, however I can see that the n vote variable LongWritable key was not used in (...) 1 public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { String line = value.toString(); StringTokenizer itr = new StringTokenizer(line); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); output.collect(word, one); } } (...) what is the use of this variable ? Could someone give me a simple example where it would be used ? Thanks
up
favorite
hadoop mapreduce
Pierre 11.7k23077
2 Answers
activeoldestvotes
up vote dow When the InputFormat is TextInputFormat, the Key is the bytes offset from the beginning of the current input
n vote
accepted
file. Value is simply the line of text at that offset. If SequenceFileInputFormat was used, the Key would be whatever was stuffed into the Key position of the 'record. The same for Value. Bottom line is that they Key/Value types are dependent on the input type (text, sequence file, etc). ckw
share|improve this answer answered Apr 22 '09 at 22:06
cwensel 88154
I could be wrong (I have read map/reduce tutorials but haven't used it for real projects yet), but I think in vote dow general it is the identifier of input entry; for example, tuple (file name, line number). In this particular case it's n vote supposedly line number, and it's of no interest for word counts. It could be used if the idea was to, say, aggregate word counts on per-line, not per-file basis (or for multiple files if key did contain that info).
up
up
Is there a theoretical analysis available which describes what kind of problems mapreduce can solve?
26
Welbog 28k45776
amit-agrawal 845718
7 Answers
activeoldestvotes
For problems requiring processing and generating large data sets. Say running an interest generation query over vote dow all accounts a bank hold. Say processing audit data for all transactions that happened in the past year in a bank. n vote The best use case is from Google - generating search index for google search engine.
up
sangupta 1,2691816
Many problems that are "Embarrassingly Parallel" (great phrase!) can use vote dow MapReduce.http://en.wikipedia.org/wiki/Embarrassingly_parallel n vote From this article.... http://www.businessweek.com/magazine/content/07_52/b4064048925836.htm ... Doug Cutting, founder of Hadoop (an open source implementation of MapReduce) says... Facebook uses Hadoop to analyze user behavior and the effectiveness of ads on the site"
up
and... the tech team at The New York Times rented computing power on Amazons cloud and used Hadoop t o convert 11 million archived articles, dating back to 1851, to digital and searchable documents. They turned around in a single day a job that otherwise would have taken months.
share|improve this answer answered Apr 1 '09 at 13:30
ChrisV 1,205624
In Map-Reduce for Machine Learning on Multicore Chu et al describe "algorithms that fit the Statistical Query vote dow model can be written in a certain summation form, which allows them to be easily parallelized on multicore n vote computers." They specifically implement 10 algorithms including e.g. weighted linear regression, k-Means, Naive Bayes, and SVM, using a map-reduce framework. The Apache Mahout project has released a recent Hadoop (Java) implementation of some methods based on the ideas from this paper.
up
bubaker 1,4001710 up vote dow Anything that involves doing operations on a large set of data, where the problem can be broken down into n vote smaller independent sub-problems who's results can then be aggregated to produce the answer to the larger
problem. A trivial example would be calculating the sum of a huge set of numbers. You split the set into smaller sets, calculate the sums of those smaller sets in parallel (which can involve splitting those into yet even smaller sets), then sum those results to reach the final answer.
share|improve this answer answered Apr 1 '09 at 13:02
You can also watch the videos @ Google, I'm watching them myself and I find them very educational.
answered Aug 14 '09 at 19:21
The answer lies is really in the name of the algorithm. MapReduce is not a general purpose parallel vote dow programming work or batch execution framework as some of the answers suggest. Map Reduce is really useful n vote when large data sets that need to be processed (Mapping phase) and derive certain attributes from there, and then need to be summarized on on those derived attributes (Reduction Phase).
up
user606308 211 up
I may be wrong, but all(?) examples I've seen with Apache Hadoop takes as input a file stored on the local file vote dow system (e.g. org.apache.hadoop.examples.Grep)
up
n vote
favorite
Is there a way to load and save the data on the Hadoop file system (HDFS)? For example I put a tab delimited file named 'stored.xls' on HDFS using hadoop-0.19.1/bin/hadoop dfs -put ~/local.xls stored.xls. How should I configure the JobConf to read it ?
Thanks .
configuration input hadoop mapreduce
Pierre 11.7k23077
3 Answers
activeoldestvotes
up
yogman 1,7522921 Thanks, but it throws an exception saying that "file:/home/me/workspace/HADOOP/stored.xls" (this is a local path) doesn't exist. The file on HDFS is in '/user/me/stored.xls'. I also tried new Path("/user/me/stored.xls") and it doesn't work too. Pierre Apr 24 '09 at 20:41 First off, it's strange that Hadoop complained about "file:" rather than "hdfs:". If might be that your hadoop-site.xml is misconfigured. And, second, if that still doesn't work, mkdir input and put stored.xls in the "input" dir (all with bin/hadoop fs command). And, new Path("input") instead of new Path("stored.xls") yogman Apr 24 '09 at 20:53 Revealing your command line to run the job wouldn't hurt. yogman Apr 24 '09 at 20:54
Pierre, the default configuration for Hadoop is to run in local mode, rather than in distributed mode. You likely vote dow need to just modify some configuration in your hadoop-site.xml. It looks like your default filesystem is still n vote localhost, when it should be hdfs://youraddress:yourport. Look at your setting for fs.default.name, and also see the setup help at Michael Noll's blog for more details.
up
up
vote n vote 9
11
The crawler needs to have an extendable architecture to allow changing the internal process, like dowimplementing new steps (pre-parser, parser, etc...)
favorite
I found the Heritrix Project (http://crawler.archive.org/). But there are other nice projects like that?
open-source web-crawler
Zanoni 4,21831647 stackoverflow.com/questions/176820/ Gavin Miller Jun 24 '09 at 17:30 @LFSR Consulting. They are for different purposes... Zanoni Jun 24 '09 at 17:39
5 Answers
activeoldestvotes
Nutch is the best you can do when it comes to a free crawler. It is built off of the concept of Lucene (in an vote dowenterprise scaled manner) and is supported by the Hadoop back end using MapReduce (similar to Google) for large scale data querying. Great products! I am currently reading all about Hadoop in the new (not yet n vote released) Hadoop in Action from manning. If you go this route I suggest getting onto their technical review team to get an early copy of this title! These are all Java based. If you are a .net guy (like me!!) then you might be more interested
up
11
accepted
inLucene.NET, Nutch.NET, and Hadoop.NET which are all class by class and api by api ports to C#.
share|improve this answer answered Jun 24 '09 at 18:00
Andrew Siemer 6,8031638 +1 for Nutch and Hadoop, you can also look at solr if you are looking for distributed and scalable solution. Sumit Ghosh Aug 12 '10 at 12:06 From the looks of it, Nutch.NET is completely non-existant and I couldn't even find a way to download it. Mike
You May also want to try Scrapy http://scrapy.org/ vote dow It is really easy to specify and run your crawlers.
up
fccoelho 1,13721222 up
If you're not tied down to platform, I've had very good experiences with Nutch in the past. vote dow It's written in Java and goes hand in hand with the Lucene indexer.
up
Justin Niessner 117k5146284 up vote dow n vote Abot is a good extensible web-crawler. Every part of the architecture is pluggable giving you complete control
Not sure why my last comment was deleted. Seems totally relevant, even to an old thread. Here it is again...
over its behavior. Its open source, free for commercial and personal use, written in C#. http://code.google.com/p/abot/