Member-only story
Python, Beam and Google Dataflow: From Batch to Streaming in a few lines
Is it possible to transform my script from a pipeline in Batch to Streaming, without a headache? YES, on Apache Beam.

Just before we start:
Are you not a Medium member? I think you should consider signing up via my referral link to take use all Medium has for you costing just $5 a month!
I created a very simple script in Apache Beam, for a task in Batch, using Apache Beam’s Direct Runner, that is, running locally (not in a Spark or Dataflow engine). The data consumed is flight data, containing various information, such as flight number, origin, destination, delay in departure, delay in arrival…

So, I created a routine that filters only the records with a positive arrival delay, in column 8 (starting from index 0), and the respective Airport, in column 4. The script looks like this:
Let me explain it a little bit:
lines 5 and 6 = Setting my pipeline to P1
lines 8 and 9 = Defining my key.json file as the account to authenticate with google
Line 13 = Import Bucket file into GCP
Line 14 = Separate records by ,
Line 15 = Filter column 8 with delay values greater than 0
Line 16 = Select column 4 with flight identification, and column 8 with delay value
Line 17 = Print on result
Simple right?
What if I want to see this information in Real-Time?
Logically, the data source has to be real-time, and for that, I created a little routine in python that sends a line from the file we saw at the beginning to PubSub, every x seconds. (I won’t go…