Member-only story

Python, Beam and Google Dataflow: From Batch to Streaming in a few lines

Is it possible to transform my script from a pipeline in Batch to Streaming, without a headache? YES, on Apache Beam.

--

data pipelines

Just before we start:

Are you not a Medium member? I think you should consider signing up via my referral link to take use all Medium has for you costing just $5 a month!

I created a very simple script in Apache Beam, for a task in Batch, using Apache Beam’s Direct Runner, that is, running locally (not in a Spark or Dataflow engine). The data consumed is flight data, containing various information, such as flight number, origin, destination, delay in departure, delay in arrival…

Sample data

So, I created a routine that filters only the records with a positive arrival delay, in column 8 (starting from index 0), and the respective Airport, in column 4. The script looks like this:

Let me explain it a little bit:
lines 5 and 6 = Setting my pipeline to P1
lines 8 and 9 = Defining my key.json file as the account to authenticate with google
Line 13 = Import Bucket file into GCP
Line 14 = Separate records by ,
Line 15 = Filter column 8 with delay values greater than 0
Line 16 = Select column 4 with flight identification, and column 8 with delay value
Line 17 = Print on result

Simple right?

What if I want to see this information in Real-Time?

Logically, the data source has to be real-time, and for that, I created a little routine in python that sends a line from the file we saw at the beginning to PubSub, every x seconds. (I won’t go…

--

--

Cássio Bolba
Cássio Bolba

Written by Cássio Bolba

Senior Data Engineer | Udemy Teacher | Expat in Germany | Mentor -> https://linktr.ee/cassiobolba

Responses (10)

Write a response