Вы находитесь на странице: 1из 12

GigaOM Home Apple

Cleantech Cloud Data Europe Mobile Video

Jul 11, 2012 - 2:50PM PT

Because Hadoop isnt perfect: 8 ways to replace HDFS


By Derrick Harris 14 Comments

Hadoop is on its way to becomig the de facto platform for the next-generation of data-based applications, but its not without some flaws. Ironically, one of Hadoops biggest shortcomings right now is also one of its biggest strengths going forward the Hadoop Distributed File System.

Hadoop is on its way to becoming the de facto platform for the next-generation of data-based applications, but its not without flaws. Ironically, one of Hadoops biggest shortcomings now is also one of its biggest strengths going forward the Hadoop Distributed File System. Within the Apache Software Foundation, HDFS is always improving in terms of performance and availability. Honestly, its probably fine for the majority of Hadoop workloads that are

running in pilot projects, skunkworks projects or generally non-demanding environments. And technologies such as HBase that are built atop HDFS speak to its versatility as storage system even for non-MapReduce applications. But if the growing number of options for replacing HDFS signifies anything, its that HDFS isnt quite where it needs to be. Some Hadoop users have strict demands around performance, availability and enterprise-grade features, while others arent keen of its direct-attached storage (DAS) architecture. Concerns around availability might be especially valid for anyone (read almost everyone) whos using an older version of Hadoop without the High Availability NameNode. Here are eight products and projects whose proprietors argue can deliver what HDFS cant: Cassandra (DataStax)

Not a file system at all but an open source, NoSQL key-value store, Cassandra has become a viable alternative to HDFS for web applications that rely on fast data access. DataStax, a startup commercializing the Cassandra database, has fused Hadoop atop Cassandra to provide web applications fast access to data processed by Hadoop, and Hadoop fast access to data streaming into Cassandra from web users. Ceph

Ceph is an open source, multi-pronged storage system that was recently commercialized by a startup called Inktank. Among its features is a high-performance parallel file system that some think makes it a candidate for replacing HDFS (and then some) in Hadoop environments. Indeed, some researchers started looking at this possibility as far back as 2010. Dispersed Storage Network (Cleversafe)

Cleversafe got into the HDFS-replacement business on Monday, announcing a product that will fuse Hadoop MapReduce with the companys Dispersed Storage Network system. By fully distributing metadata across the cluster (instead of relying on a single NameNode) and not relying on replication, Cleversafe says its much faster, more reliable and scalable than HDFS. GPFS (IBM)

IBM has been selling its General Parallel File System to high-performance computing customers for years (including within some of the worlds fastest supercomputers), and in 2010 it tuned GPFS for Hadoop. IBM claims the GPFSSNC (Shared Nothing Cluster) edition is so much faster than Hadoop in part because it runs at the kernel level as opposed to atop the OS like HDFS. Isilon (EMC)

EMC has offered its own Hadoop distributions for more than a year, but in January 2012 it unveiled a new method for making HDFS enterpriseclass replace it with EMC Isilons OneFS file system. Technically, as EMCs Chuck Hollis explained at the time, because Isilon can read NFS, CIFS and HDFS protocols, a single Isilon NAS system can serve to intake, process and analyze data. Lustre

Lustre is a an open source high-performance file system that some claim can make for an HDFS alternative where performance is a major concern. Truth be told, I havent heard of this combination running anywhere in the wild, but

HPC storage provider Xyratex wrote a paper on the combination in 2011, claiming a Lustrebased cluster (even with InfiniBand) will be faster and cheaper than an HDFS-based cluster. MapR File System

The MapR File System is probably the bestknown HDFS alternative, as its the basis of MapRs increasingly popular and well-funded Hadoop distribution. Not only does MapR claim its file system is two to five times faster than HDFS on average (although, really, up to 20 times faster), but it has features such as mirroring, snapshots and high availability that enterprise customers love. NetApp Open Solution for Hadoop

OK, the NetApp Open Solution for Hadoop isnt so much an HDFS replacement as it is an HDFS improvement, according to NetApp and early partner Cloudera. The offering still relies on HDFS, but it reenvisions the physical Hadoop architecture by putting HDFS on a RAID array. This, NetApp claims, means faster, more reliable and more secure Hadoop jobs.

This might be a good place to say rest in peace to two other HDFS alternatives that are effectively no longer with us KosmosFS (aka CloudStore) and Appistry CloudIQ Storage. The former was created by Kosmix (since bought by @WalmartLabs) and released to the open source world in 2007, but no longer has an active community. The latter was an attempt by Appistry in 2010 to get a piece of the Hadoop pie with its computational storage technology, but the company has since switched its focus from selling the technology to providing highperformance computing services based on it. Feature image courtesy of Shutterstock user Panos Karapanagiotis.

Subscriber Content
Related research and analysis from GigaOM Pro Infrastructure Q1: Cloud and big data woo enterprises - 04/18/2012 A near-term outlook for big data - 03/20/2012

Amazons DynamoDB: rattling the cloud market - 01/20/2012

What is this?

14 Comments

1. Per Thursday, July 12 2012 add XtremeData engine and you have a game changing BD analytics at consumer level pricing. (www.xtremedata.com) Share

2. Adam Bane Thursday, July 12 2012 OpenStack Object Storage deserves a mention in this list. The Hadoop implementation for OpenStack removes the NameNode requirement and streams data directly from the object-store to compute memory (no staging on local disk required). The real significance here is OpenStack Storage has already been implemented at HP, Rackspace, and SoftLayer (among other IaaS providers) and can be deployed privately in

the Enterprise. Hadoop projects that are started at small scale in these public clouds can easily be migrated in house. Similarly, an in house implementation can easily be extended to these providers for easy access to additional storage and compute resources. Share

Derrick Harris Thursday, July 12 2012 Good call. I didnt realize OpenStack has an HDFS implementation. Share

3. John Mark Thursday, July 12 2012 Hi Derrick, GlusterFS, and by extension Red Hat Storage, have a drop-in compatibility library today that allows you to augment or replace HDFS. The win for customers is that they have a unified data backend they can access either via hadoop operations or via traditional NAS methods, ie. NFS or the GlusterFS client. Share

4. Dave Mackey Thursday, July 12 2012 Great list of alternatives. I wonder how Microsofts Storage Spaces will compete? Share

5.

Bill McColl Thursday, July 12 2012 Good post. At Cloudscale we tried using HDFS a while back. Given that we wanted to run realtime analytics, graph analytics, MPI, and bulk synchronous parallelism (BSP), as well as Hadoop, in a unified big data architecture, HDFS was a non-starter. For our purposes, Lustre has proved to be a far better option as a super-fast and super-scalable file system foundation for a unified big data platform. Running Lustre on AWS clusters is new. We are probably the only company currently doing it. To help others get started, we posted how to do it here http://www.cloudscale.com/index.php/technology/lustre-on-aws-cloud if you want to try out Lustre on cloud clusters for yourself. Share

Toast to HDFS Thursday, July 26 2012 We wanted to make a sandwich/toast with HDFS, but could not. We had to design our own toaster. So is this a problem with HDFS? Share

6. scarpe messi 2012 Monday, July 16 2012 Dispersed Storage Network Share

7. eric baldeschwieler Wednesday, July 25 2012 Hi Derrick, Ive blogged my thoughts on HDFS vs other systems on the Hortonworks website.

http://hortonworks.com/blog/thinking-about-the-hdfs-vs-other-storage-technologies/ E14 CTO Hortonworks & hadoop architectural contributor Share

Derrick Harris Thursday, July 26 2012 I saw your post, and Charles (from Cloudera) post. I think youre both probably right that HDFS wins out in the end. By percentage, its probably very dominant now. But there will always be people looking for alternatives. Share

8. Jay Wednesday, July 25 2012 Looks like most of the alternatives being promoted are highly expensive proprietary hardware based storage vendors who are feeling threatened by the RISE OF HDFS and crying wolf. Everyone knows that HDFS has its flaws but the benefits outweigh the drawbacks. It has got huge traction and in a few years will be the most dominant technology for data crunching. It is important for the storage vendors to re-invent themselves and not fight for a piece of the hadoop pie. That will mean slow decay for them. Share

9. gengstrand Sunday, July 29 2012 I think that it is disingenuous to propose expensive proprietary solutions as alternatives to an open source solution. Perhaps a more valuable article would be to propose open source alternatives instead. At Zoosk, we used http://www.dynamicalsoftware.com/nosql/solr instead of Hadoop for the implementation of our activity feed. The Apache Solr project is our open source alternative to Hadoop.

Share

Derrick Harris Sunday, July 29 2012 I think it depends on your use case. If youre a Fortune 500 company, you might want to pay for performance, reliability, etc. And some alternative, such as Gluster (mentioned in comments), are open source but vendor-backed. Share

10. Cameron Thursday, August 16 2012 The ParaScale distributed storage and computation platform perhaps deserves a mention here too as one of the early vendors that integrated Hadoop into the platform and submitted the filesystem plugin to the open source community back in 2009/2010 time frame. The ParaScale filesystem provided fast parallel ingest via standard NFS protocol, full read/write POSIX semantics, a distributed and replicated object store, global namespace, and many other enterprise features. Hitachi Data Systems acquired ParaScale in 2010. References: http://en.wikipedia.org/wiki/Apache_Hadoop https://issues.apache.org/jira/browse/HADOOP-6704 Share Displaying 14 of 14 comments.

Most popular in Cloud


Amazon blames human error for Xmas Eve outage; Netflix vows better resiliency 12/31/2012 This year in cloud: Amazon-almost-all-the-time and the other 5 top stories of 20... 12/30/2012

Related

Why big data might be more about automation than insights Big data technologies are like manufacturing robots: they let people do what they're already trying to do,...

This year in cloud: Amazon-almost-all-the-time and the other 5 top stories of 2012 Last year, AWS saw big success and big snafus; Superstorm Sandy prompted worry about data center location;...

Maybe big data can quell gun violence but not in the way you think Big data might not be able to predict when a mass murderer is about to strike, but...

See More Related Stories For: big data / Hadoop / IBM / open source

1241 Readers Right Now

Just commented on: YouTube sucks on French ISP Free, and French regulators want to know why

From a user's point of view, it does really suck. Even with a...

Just commented on: ZipCar, Google, cars and the inevitability of the Internet

This is a very smart strategic move. I just hope that the small-tech-culture...

Just commented on: Here's the secret success sauce in Ubuntu's phone platform

I hope it gets pre-installed, but you can bet there'll be ROMs available....

Stay on top of cloud news in your inbox


Get a daily roundup of news and analysis about everything cloud (see a sample):

Events Pro Research GigaOM TV Privacy Policy Terms of Service About Editorial Team Media Kit Contact

GigaOM Powered by WordPress.com VIP News Events


paidContent Research Jobs

Вам также может понравиться