Kubernetes and Data Engineering

When you say Data engineering the first thing most professionals think is Hadoop.  Hadoop was the big game-changer more than a decade ago allowing us to escape the confines on relational databases.  Hadoop showed you could use distributed computing and storage over a large cluster of servers with instance fail over.  Hard drive on a node failed?  No worries we can get to it in the morning (assuming we set our replication factor to 3).

However the Hadoop ecosystem is getting very old.  It’s been well over a decade since Hadoop first entered the enterprise computing scene but it’s beginning to get outdated.  Hadoop is the new mainframe, an over complicated system that has become inflexible to change or upgrade.  It’s not intuitive or easy to introduce to new users.

The problems the Hadoop echo system were designed to solve have been improved upon in many ways by Kubernetes.

  • Common redundant storage – in HDFS data is usually stored in triplicate, over 3 different pieces of hardware.  If one of those hardware elements fails the name nodes ensure those blocks get copied to a new segment in the cluster ensuring 3 copies exist.  In Kubernetes there is the concept of persistent volumes.  Storage is replicated across each node so applications have access to it.
  • Distributed Job process – The Hadoop YARN manager receives job requests for deployment and manages them in queues.  It also limits resources used by a submitted job based on the queue it runs in.  Kubernetes name spaces segregate applications and enforce limits on those resources.  In both cases the managers distribute workload across the cluster to maximize fault tolerance and load balance.
  • Automatic fail over – In both cases the platforms can have a failure in one segment without end users or applications having any idea something went wrong.  Did a disk in the HDFS pool fail?  No worries as there are two replicas that can be accessed without bothering the end user.  Did a pod crash on a given worker?  No problem as other pod replicas pick up the slack and the control plane creates a replacement pod immediately.

Where Kubernetes does it better

When a YARN job gets submitted you kind of submit and pray.  YARN does have a facility to automatically restart failed jobs.  In my professional experience this never gets set up.  Getting YARN to tell me that a workload failed is not intuitive or easy to figure out.  You have to scan the YARN queues for failed jobs and roll your own notification.

Kubernetes has the concept of a replica set.  I prefer this to the way YARN handles this.  When you submit your application to the cluster you are specifying I would like it to always be like this.  The control plane will do its best to maintain that state.  This is a key difference in architecture philosophy.  Hadoop expects you to be around to baby it and always watch.  Kubernetes manages this for you automatically.  Cloudera’s documentation does talk about this a bit but there are a ton of preconditions, the most irritating;

Only one autoscale policy type (either load-based or schedule-based) can be configured for a single host group, in a single cluster, at a time.

What this tells me is Cloudera Hadoop just does not think auto-scaling is an important feature.  I imagine the makers of cars said something similar when automatic transmission was introduced.  They just don’t want to do the work to make things easier for their users, and in the case of Cloudera, their customers.  I remember seeing a funny de motivator poster that said “if you’re not a part of the solution, there’s good money to be made in prolonging the problem.

This does get to the main problem with Hadoop. It’s a collection of all ready defined services you have to use in specific way the way they tell you. Kubernetes is the frame work on which you can deploy your customized services and applications. While Hadoop is focused on map-reduce and spark jobs Kubernetes is bring what you want. If it can run in a container you can run it as a Kubernetes work load. Run python with some weird library you want to keep isolated? Sure thing, not a problem. Some Ruby application, go for it. A customized API for analysts to get data? Sure thing. A bring your own web interface for interacting with data? Absolutely.

Hadoop does have some ability to deploy docker containers, but why?  Kubernetes was built for that.  Kubernetes describes itself as a container orchestration system.  Hadoop YARN is just doing it cause everyone else is and you get the sense in their documentation they wish you wouldn’t.

But Kubernetes is complicated…

Try Hadoop and you will be surprised by its overly complex nature.  The difference I think is that Kubernetes is complex but designed to solve modern problems.  Hadoop was designed to overcome the problems of the old world of relational databases.  We no longer live in that world and we need more options than just HDFS and the YARN command line interface.

Kubernetes allows a lot more freedom in what you can deploy. For data engineering it opens up a lot of options to reliably move, process, and present data to end users.

Danforth Retail Stats

I live in the east end of Toronto near “The Danforth”. Danforth Avenue is a 9.2 KM stretch of road in the borough of East York inside Toronto. It’s full of retail shops and has grown organically since its inception in 1799. It’s a true hodge podge of buildings and shops, no city planner could design this. This is part of its appeal. There are well over 600 businesses from Broadview to Woodbine alone. It’s why I like to call East York home.

When the pandemic came I became concerned over the health of Danforth businesses. Every resident in the east end has their favorite pub, shop, or cafe on the Danforth. Shops come and go on the Danforth. That got me thinking could I measure the health of the east end by counting the active shops? Count the shops on a regular basis. How many shops open for business were there compared to empty shops.

Primitive Data Collection

Hand Counters
This most basic of data collection methods, just simply count. I got myself a couple of hand counters, no batteries needed. I came up with some basic rules for what to count and put on a good pair of sneakers. I go out about once a month and count retail spaces that are active or not in use. The results get saved in this Google sheet. I don’t do the whole Danforth, just the section from Broadview to Woodbine.

Variances

I notice that from month to month the total number of shops fluctuate from one month to the next. I don’t think buildings are magically appearing and disappearing every 4 to 6 weeks. Data collection is never perfect. Did I count that retail shop? Did I double count? When I start my walk I think I will pay attention, but then I see a pub or a cafe I have to stop and visit. Did my attention wander? All of that, which is why we don’t take one days count as a 100% accurate count. And neither does an organization like Statistics Canada. It’s the average over time that matters.

The Results


Mixed, I am surprised how the counts differ from month to month. It’s about 8 km there and back from Woodbine where I start counting to Broadview and back. I calculate the utilization rate as the number of active retail spaces divided by the total number of retail spaces. By my estimates about 90% of retail spaces are active. Note there is a slight upward trend from when I started in June of 2023. Some of that might be explained by the new subway project and Pape and Danforth. That took out a number of retail spaces as buildings were removed.

Based on the line graph below it doesn’t look like much of a change but there is a slight upward slope. Business is getting better?

Let’s exaggerate the differences by setting the lower bound presented on the graph to 80%.

Making this sort of change helps us magnify the changes over time, which can be a good thing to see or a deceptive thing depending on the presenter and the audience. Am I selling you something or are we examining the data together?

Keeping the data fresh

I hope to keep updating these numbers and monitoring the health of the Danforth. It’s my neighborhood and I care about the place. As the economy expands and shrinks over the years it would be nice to see how it correlates to shops coming and going.