23 June 2019

The Time For An Elephant In The Room To Go?

There is no consensus of what “Data Science” means exactly, and often people mean different things when they speak about data science. Of course, many areas of expertise, from statistics to machine learning, from engineering to pipeline devops, are required to solve a business problem with data. While I have been fortunate to work with some of the best algorithmic and machine learning experts, my personal preference, and the focus of my continuing search for ever-evading technical perfection, has been the engineering aspect of data science. In this article, I would like to share some of the lessons I have learned about where the elephant in the room, Hadoop, is going to go from here, and what would be left in its place.

Before we start criticizing Hadoop, let’s give credit where credit is due. Hadoop effectively democratized access to big data by taking the place of mainframes that ruled the world before. MapReduce revolutionized the world of analytics, and the surrounding infrastructure became a standard for storing corporate data, enabling even SQL queries with Hive or Impala, or using Python or other tools to access the data. With all of the convenience, what is wrong with Hadoop?

One problem with Hadoop is its high “overhead.” It usually takes too much time to just go and run a small simple task on Hadoop, however simple it might be. You will inevitably end up spending time writing MapReduce instead of perfecting your business logic and trying to fit your algorithm into the Hadoop paradigm instead of being able to quickly do the job. Once you finally work through all of the hiccups and run your code, you will discover a new source of frustration as low Hadoop performance is a direct result of its architecture: every intermediate result is written to disk and sent across the network between the nodes, drastically decreasing the performance.

What if you prefer Python instead of standard Java to write MapReduce jobs? You will quickly learn that your performance is even worse, and not because of Python itself, but because of several layers of abstraction further slowing Hadoop down. On top of the performance issues, it will be close to impossible to keep up with newer versions of Python libraries, such as Pandas, Scikit-Learn or some implementations of statistical models. Operation teams correctly reason that any upgrade can potentially destabilize Hadoop deployment and impact other critical production jobs.   

The overhead of Hadoop is so ridiculously heavy, that Adam Drake in his article “Command Line Tools Can Be 235x Faster Than Your Hadoop Cluster” measures a x235 speed advantage of running Linux command line tools versus using Hadoop. This blew my mind when I first saw it, but it makes a lot of sense, given the zero overhead of command line tools. I now incorporated some elements of CLI into my data processing production workflows, and if you are concerned about maintainability of my CLI-based code, rest assured that I am not the only person who can support it.

After accepting the fact that Hadoop is slow, but at least scalable for batch processing, the right question is how much latency this processing adds. This latency is a gap between the moment the data is collected and the final processing result (at the end of your machine learning pipeline). This gap grows even more when you increase the delay between batch jobs to an hour or even a day, a big difference if you need to process your logs.

With your data streaming in continuously, and the recent push for real-time processing, a concept of a data bus emerged, with a technology stack to support it. Apache Kafka or RabbitMQ works well for queuing the data, enabling real-time analytics pipelines, and making available data streams as topics. Can we produce useful, close to real-time analytics with Hadoop if we divide the data into chunks for batch processing? In the following blog post, I am planning to explore how to create a hybrid system with real-time and batch components, taking advantage of the existing batch processing infrastructure while delivering true real-time results.

SQL has been the standard way to query data, and Hadoop ecosystem produced several solution to enable SQL or SQL-like querying. We have Hive and Impala, but also Spark SQL and Presto, to name a few. Presto in particular can be integrated with a variety of data sources through its plugins. AWS already has a service based on Presto on top of S3, called Amazon Athena. Apache Zeppelin, a popular data visualization tool, can use both Presto and Spark SQL as the data source, allowing your data team to visualize the data through flexible SQL queries, either in the private or public cloud.

SQL is not the only solution to access big data. If you need full text search, or any other  unstructured data, you should at least consider NoSQL. While many agree that PostgreSQL is an amazing open source database, there are other good choices. Elasticsearch can be far better for diverse data, and Neo4j is optimized for storing and querying graphs. Hazelcast, Apache Ignite, Aerospike, Cassandra - the list can go on, and depending on your requirements, existing technology stacks, integration points, and your vision for the perfect data processing pipeline, there are many big data storage and access alternatives to Hadoop.

The discussion about big data engineering is not complete without a discussion about continuous integration and deployment. Let's take a look at a typical CI/CD workflow: first, developers submit their code into the repository, then automatic tests are triggered and performed, and finally the code is packaged into OS packages for deployment in testing or production environments. What if you need different setups for testing? Containers can do just that. Do you want to deploy both to Ubuntu and CentOS? Containers will help. Do you need high availability, close to zero-downtime upgrades and horizontal scalability? Containers will save your day.

Containers, however, are not silver bullets, they are just wrappers. They can "isolate" the required environment and create a somewhat independent sandbox for an application, but you might want more. What about flexible networking for these containers, an automatic configuration of replicas, or an integration with different file systems like GlusterFS or Ceph? Until recently, companies had to address these additional needs themselves or use virtual machines instead of containers, for example, OpenStack. Thanks to Google and other contributors and enthusiasts, Kubernetes was born, which covers most of these needs, making it a valuable tool to enable various flexible workflows, including data analytic pipelines.

From our experience at Aligned Research, there is one specific data-related job that Kubernetes cannot handle properly, and it's DAGs, directed acyclic graphs of tasks with topology sorting and execution control. While there are open source tools to address this deficiency, such as Airflow, Luigi, and Mistral, they are not perfect. Airflow is overloaded with features while lacking the declarative API for tasks; Luigi is the opposite, too simplified; and Mistral is best fit for environments based on OpenStack. While some of them could work for your requirements, we found a somewhat unexpected solution that proved to be both simpler and more effective for our customers: Jenkins.

You must have Jenkins already deployed as part of your CI pipeline, correct? Jenkins, among a few other CI tools, not so long time ago received a declarative API, with a domain specific language (DSL), such as Groovy. This DSL finally lets pipeline developers specify whether a certain processing stage should be ran in parallel to a different stage, or strictly in a sequence. Jenkins is currently executing most of Kubernetes jobs for us, and we are actively experimenting with Jenkins X, a newer tool that is even more tightly integrated with Kubernetes.

In the end, we cannot tell you whether you should get rid of the elephant in the room, Hadoop, or incorporate it into your next generation pipeline. After all, Hadoop manages to do everything and nothing well, simultaneously. Likewise, we cannot tell you which other tools to use, as the right choice depends on your needs, requirements, and available talent in your organization. In our blog, we will continue to describe best practices and “recipes” for data processing workflows, based on our failures and successes. If you’d like to get notified when we publish another article, please follow our blog.