The Hadoop Ecosystem’s Continued Influence | weblog@CACM

By Doug Meil

June 12, 2023
Feedback

As Arun Murthy’s 2019 put up “Hadoop is useless, lengthy stay Hadoop” put it, barring any specialised or legacy necessities, racking and stacking servers both in an on-premise or business information heart for a Hadoop cluster was not a most well-liked sample given developments in public cloud computing. Because the put up stated, completed and dusted as sometimes deployed within the prior decade.

That stated, I’ve seen some situations the place individuals have written the Hadoop-era off totally as if it “did not work all.” That’s a number of steps too far. Hadoop environments might be advanced —particularly at scale—and did take non-trivial effort to handle. To paraphrase Crossing The Chasm, Hadoop in that type may need had some bother discovering a house on the prime of the expertise adoption lifecycle curve—or at the very least staying there—due partly to this complexity, but it surely was nonetheless an extremely highly effective platform within the arms of Innovators and Early Adopters. Hadoop was an development in distributed computing that owed a lot to the Google GFS, MapReduce, and BigTable papers, but it surely was additionally a response to a broader expertise neighborhood want. Particularly, the frustrations with obtainable information administration and analytic patterns of the time, and people patterns have been overwhelmingly proprietary.

A picture containing text, diagram, plot, lineDescription automatically generated

Supply: https://en.wikipedia.org/wiki/Crossing_the_Chasm

Whereas computing patterns have moved considerably from on-premise to cloud, it’s price reviewing the truth that lots of the frameworks from the Hadoop ecosystem are nonetheless in use.

Lucene

Apache Lucene is a search engine created in 1999, nonetheless very a lot in use inside different higher-level search frameworks corresponding to Elastic and Solr.

Search was a use-case that pressured distributed computing—each in storage and in processing—and arguably was the primary use-case of the Hadoop ecosystem. Nutch is a framework for distributed internet crawling that started as a sub-project of Lucene. Hadoop began as a sub-project of Nutch. Though Lucene was created years earlier than Hadoop existed, in a method it was the genesis of the ecosystem.

Hive

Apache Hive is a distributed SQL-based question engine with metadata repository that goes again to the earliest days of Hadoop. Programmers may be variously pissed off with business variations of relational database administration techniques, however SQL itself is a lingua franca for information analysts, and Hive offered the primary SQL help within the ecosystem. The earliest Apache Hive Jira ticket goes again to 2008, however was constructed upon previous efforts at Fb.

Hive Metastore

Whereas the Hive question engine attracted many opponents, the Hive Metastore was utilized in quite a lot of distributed SQL engines, corresponding to Impala, Drill, Presto, BigSQL, Shark, and SparkSQL, amongst others. The Hive Metastore can also be utilized straight or as an integration possibility for cloud information platforms corresponding to Databricks and Snowflake.

Hive Clones

Talking of Hive clones, the AWS question service Athena is definitely the Presto framework underneath the hood. Because the saying goes, imitation is the sincerest type of flattery.

Serialization Frameworks

The Hadoop ecosystem noticed the event of a number of serialization frameworks for each row and columnar codecs.

Row oriented

Avro is a row-oriented serialization framework that dates to 2009 inside Hadoop, however its utilization unfold.

Columnar

Parquet is a columnar serialization framework that dates to 2012/2013 from Cloudera’s efforts on the Impala question engine. ORC is one other columnar framework created in 2013 by Hortonworks utilized by Hive. Columnar codecs have been an enormous step ahead for information administration and analytics on massive datasets within the Hadoop ecosystem as beforehand this characteristic was solely obtainable from proprietary column-oriented databases like Vertica. Each Parquet and ORC proceed to be extensively utilized.

Compression

The Snappy compression library was developed by Google and launched open supply in 2011. The Hadoop ecosystem offered an ideal showcase of Snappy’s capabilities throughout quite a lot of tasks, aided by a distribution-friendly license. Snappy continues to be extensively utilized.

Streaming

Apache Kafka emerged in 2011 as a distributed streaming platform from efforts at LinkedIn. Whereas not strictly talking a “Hadoop” challenge, Kafka was continuously utilized in options that utilized Hadoop for streaming information ingress or egress from clusters, amongst different use-cases.

A historic implementation element price noting is that till just lately, Kafka utilized the Apache ZooKeeper consensus service. ZooKeeper is yet one more challenge which initially began as a Hadoop sub-project after which finally cut up out as its personal top-level challenge.

Spark

Talking of responses, Spark was a response to MapReduce’s early successes because the 2010 paper “Spark: Cluster Computing with Working Units” made clear (https://individuals.csail.mit.edu/matei/papers/2010/hotcloud_spark.pdf), notably the place iterative algorithms have been wanted. Spark finally was utilized in Hadoop options by way of the Hadoop useful resource supervisor YARN, in each batch and streaming circumstances. A decade after its creation, Spark moved from being a response to being the dominant distributed processing framework, which is kind of an achievement. However that achievement did not occur in a vacuum.

BigTable – What Goes Round Comes Round

Google BigTable, a distributed key-value storage framework, impressed a bunch of “NoSQL” clones, one in all them being Apache HBase (~2007). In a round twist, BigTable now helps HBase-compatible APIs.

Lastly

To echo Arun’s sentiments on the finish of his put up, so long as there’s information, there will probably be “Hadoop.” Could that spirit of innovation proceed, in no matter type it takes.

References

“Hadoop is useless, lengthy stay Hadoop”
Crossing The Chasm, Geoffrey Moore

Apache Hadoop
Apache Spark (2010)
Google

MapReduce paper (2004)
BigTable paper (2006)
BigTable

Associated BLOG@CACM Posts
- “The continuous re-creation of the KeyValue Datastore”
“Why are there so many programming languages?”

Doug Meil is a software program architect in healthcare information administration and analytics. He additionally based the Cleveland Massive Knowledge Meetup in 2010. Extra of his BLOG@CACM posts may be discovered at https://www.linkedin.com/pulse/publications-doug-meil

No entries discovered

Supply hyperlink

biskit June 28, 2023Last Updated: June 28, 2023

0 165 4 minutes read