Cloudy Journey: Why Hadoop caught on

Why Hadoop caught on:

Doug Cutting (@cutting) is a founder of the Apache Hadoop project and an architect at Hadoop provider Cloudera. When Cutting expresses surprise at Hadoop's growth — as he does below — that carries a lot of weight.

In the following interview, Cutting explains why he's surprised at Hadoop's ascendance, and he looks at the factors that helped Hadoop catch on. He'll expand on some of these points during his Hadoop session at the upcoming Strata Conference.

Why do you think Hadoop has caught on?

Doug Cutting: Hadoop is a technology whose time had come. As computer use has spread, institutions are generating vastly more data. While commodity hardware offers affordable raw storage and compute horsepower, before Hadoop, there was no commodity software to harness it. Without tools, useful data was simply discarded.

Open source is a methodology for commoditizing software. Google published its technological solutions, and the Hadoop community at Apache brought these to the rest of the world. Commodity hardware combined with the latent demand for data analysis formed the fuel that Hadoop ignited.

Are you surprised at its growth?

Doug Cutting: Yes. I didn't expect Hadoop to become such a central component of data processing. I recognized that Google's techniques would be useful to other search engines and that open source was the best way to spread these techniques. But I did not realize how many other folks had big data problems nor how many of these Hadoop applied to.

What role do you see Hadoop playing in the near-term future of data science and big data?

Doug Cutting: Hadoop is a central technology of big data and data science. HDFS is where folks store most of their data, and MapReduce is how they execute most of their analysis. There are some storage alternatives — for example, Cassandra and CouchDB, and useful computing alternatives, like S4, Giraph, etc. — but I don't see any of these replacing HDFS or MapReduce soon as the primary tools for big data.

Long term, we'll see. The ecosystem at Apache is a loosely-coupled set of separate projects. New components are regularly added to augment or replace incumbents. Such an ecosystem can survive the obsolescence of even its most central components.

In your Strata session description, you note that "Apache Hadoop forms the kernel of an operating system for big data." What else is in that operating system? How is that OS being put to use?

Doug Cutting: Operating systems permit folks to share resources, managing permissions and allocations. The two primary resources are storage and computation. Hadoop provides scalable storage through HDFS and scalable computation through MapReduce. It supports authorization, authentication, permissions, quotas and other operating system features. So, narrowly speaking, Hadoop alone is an operating system.

But no one uses Hadoop alone. Rather, folks also use HBase, Hive, Pig, Flume, Sqoop and many other ecosystem components. So, just as folks refer to more than the Linux kernel when they say "Linux," folks often refer to the entire Hadoop ecosystem when they say "Hadoop." Apache BigTop combines many of these ecosystem projects together into a distribution, much like RHL and Ubuntu do for Linux.

Strata 2012 — The 2012 Strata Conference, being held Feb. 28-March 1 in Santa Clara, Calif., will offer three full days of hands-on data training and information-rich sessions. Strata brings together the people, tools, and technologies you need to make data work.

Save 20% on registration with the code RADAR20