
When you say Data engineering the first thing most professionals think is Hadoop. Hadoop was the big game-changer more than a decade ago allowing us to escape the confines on relational databases. Hadoop showed you could use distributed computing and storage over a large cluster of servers with instance fail over. Hard drive on a node failed? No worries we can get to it in the morning (assuming we set our replication factor to 3).
However the Hadoop ecosystem is getting very old. It’s been well over a decade since Hadoop first entered the enterprise computing scene but it’s beginning to get outdated. Hadoop is the new mainframe, an over complicated system that has become inflexible to change or upgrade. It’s not intuitive or easy to introduce to new users.
The problems the Hadoop echo system were designed to solve have been improved upon in many ways by Kubernetes.
- Common redundant storage – in HDFS data is usually stored in triplicate, over 3 different pieces of hardware. If one of those hardware elements fails the name nodes ensure those blocks get copied to a new segment in the cluster ensuring 3 copies exist. In Kubernetes there is the concept of persistent volumes. Storage is replicated across each node so applications have access to it.
- Distributed Job process – The Hadoop YARN manager receives job requests for deployment and manages them in queues. It also limits resources used by a submitted job based on the queue it runs in. Kubernetes name spaces segregate applications and enforce limits on those resources. In both cases the managers distribute workload across the cluster to maximize fault tolerance and load balance.
- Automatic fail over – In both cases the platforms can have a failure in one segment without end users or applications having any idea something went wrong. Did a disk in the HDFS pool fail? No worries as there are two replicas that can be accessed without bothering the end user. Did a pod crash on a given worker? No problem as other pod replicas pick up the slack and the control plane creates a replacement pod immediately.
Where Kubernetes does it better

When a YARN job gets submitted you kind of submit and pray. YARN does have a facility to automatically restart failed jobs. In my professional experience this never gets set up. Getting YARN to tell me that a workload failed is not intuitive or easy to figure out. You have to scan the YARN queues for failed jobs and roll your own notification.
Kubernetes has the concept of a replica set. I prefer this to the way YARN handles this. When you submit your application to the cluster you are specifying I would like it to always be like this. The control plane will do its best to maintain that state. This is a key difference in architecture philosophy. Hadoop expects you to be around to baby it and always watch. Kubernetes manages this for you automatically. Cloudera’s documentation does talk about this a bit but there are a ton of preconditions, the most irritating;
Only one autoscale policy type (either load-based or schedule-based) can be configured for a single host group, in a single cluster, at a time.
What this tells me is Cloudera Hadoop just does not think auto-scaling is an important feature. I imagine the makers of cars said something similar when automatic transmission was introduced. They just don’t want to do the work to make things easier for their users, and in the case of Cloudera, their customers. I remember seeing a funny de motivator poster that said “if you’re not a part of the solution, there’s good money to be made in prolonging the problem.“

This does get to the main problem with Hadoop. It’s a collection of all ready defined services you have to use in specific way the way they tell you. Kubernetes is the frame work on which you can deploy your customized services and applications. While Hadoop is focused on map-reduce and spark jobs Kubernetes is bring what you want. If it can run in a container you can run it as a Kubernetes work load. Run python with some weird library you want to keep isolated? Sure thing, not a problem. Some Ruby application, go for it. A customized API for analysts to get data? Sure thing. A bring your own web interface for interacting with data? Absolutely.
Hadoop does have some ability to deploy docker containers, but why? Kubernetes was built for that. Kubernetes describes itself as a container orchestration system. Hadoop YARN is just doing it cause everyone else is and you get the sense in their documentation they wish you wouldn’t.
But Kubernetes is complicated…
Try Hadoop and you will be surprised by its overly complex nature. The difference I think is that Kubernetes is complex but designed to solve modern problems. Hadoop was designed to overcome the problems of the old world of relational databases. We no longer live in that world and we need more options than just HDFS and the YARN command line interface.
Kubernetes allows a lot more freedom in what you can deploy. For data engineering it opens up a lot of options to reliably move, process, and present data to end users.