Data Engineering

Realtime Data Availability By Kafka

Separating OLTP and OLAP layer with Kafka engine

Kshitij Gaikar

--

Apache Kafka is an open-source platform for building real-time streaming data pipelines and applications. It is used by numerous companies in their data streaming and analytical products. Kafka was developed at LinkedIn for their activity stream and NewsFeed but they later open-sourced for public use. This blog provides a summary of real-time data processing with Kafka.

To get started, Kafka is an event streaming platform.

What is event streaming ?

It’s an ability to capture/process data in real-time through numerous IoT devices and applications.

Kafka combines three key capabilities for event streaming :

  1. To publish (write) and subscribe to (read) streams of events, including continuous import/export of your data from other systems.
  2. To store streams of events durably and reliably for as long as you want.
  3. To process streams of events as they occur or retrospectively.

To simplify what Kafka does, let’s imagine you are watching a cricket match on a sports streaming platform like Hotstar. As a spectator, you are seeing the live video, live stats and other infographics like social media chat, games, advertisements and important highlights all during the match.

Hotstar Streaming

You are getting all the information from Hotstar. It does all the middle work. From A to Z, you are satisfied with how information is received. All this is possible due to Kafka. Hotstar checkmarks all the bullet points above by providing live video to you, storing data and processing live data simultaneously to provide stats and other highlights which remain accessible even after the match has ended.

How does this all happen?

To understand this, let’s zoom into Hotstar. But before that, let’s go through this video to understand OLTP and OLAP systems and their differences

Now let’s look at the simplified flow below

Inside Hotstar

The OLTP layer is providing you with a live feed and the OLAP layer is providing you with statistics and other data. One thing to note is all this happens in real-time. It’s one thing to feed live video and the other to feed live data. Processing data which is huge is time-consuming and all this cannot happen in the blink of seconds. So what happens? We are looking at a problem that involves applying algorithms to a huge volume of data. This data can be a flat file, a chat message, or a video file.

So how does this happens in real-time?

Kafka for separating OLTP and OLAP

When data comes into the warehouse, Kafka would act like a bus delivering a message to different systems. A process called Kafka Connect is deployed to copy data to the Kafka cluster and send it to other sources. Kafka Connect Producer job will send data (in small packets of KB size ) whenever new data has arrived at Kafka Cluster. Similar Kafka Connect Sink Job will send data to OLTP and OLAP systems when data has arrived in Kafka Cluster.

The sink processes only send data respective necessary data to OLTP and OLAP layer. Now the question that remains is on statistics that one sees on this platform ?

This is done in two ways: Real-time and Historical

Real-time

Real-time data processing can be done using a service called Kafka Stream. Not all data is necessary for real-time processing. Kafka stream can be set up to pull important data and run recommended algorithms and queries. Streams will pull data directly from the Cluster.

Historical

All the data can also be sent to a separate Database for running a complex set of AI, ML jobs and analytical queries. This can then be processed and visualised in form of a dashboard and metrics. Please note, only data copy is real-time here; processing takes time for determining results, factors and predictions. Real-time data is only processing a subset of this data.

The Kafka engine above can be replicated by deploying Kafka on any system. Whether the system is on-premise or cloud, Kafka has been made available on all systems. Its open source and a form of it exists on major clouds like AWS, Azure and Google Cloud Platform.

P.S: The above flows are extremely simplified versions of the Kafka engine and real-time data streaming platforms. Not all may be real-time and using Kafka. Read more on how this happens at Hotstar

Ingesting data at “Bharat” Scale. We wrote an ingest API that can ingest… | by Bhavesh Raheja | Disney+ Hotstar (medium.com)

--

--

Data, Tech and Sustainability. I believe in preserving the world so that our future generations have ample of resources. If you have a Green idea, Reachout!