Python, Beam and Google Dataflow: From Batch to Streaming in a few lines
Is it possible to transform my script from a pipeline in Batch to Streaming, without a headache? YES, on Apache Beam.
Just before we start:
Are you not a Medium member? I think you should consider signing up via my referral link to take use all Medium has for you costing just $5 a month!
I created a very simple script in Apache Beam, for a task in Batch, using Apache Beam’s Direct Runner, that is, running locally (not in a Spark or Dataflow engine). The data consumed is flight data, containing various information, such as flight number, origin, destination, delay in departure, delay in arrival…
So, I created a routine that filters only the records with a positive arrival delay, in column 8 (starting from index 0), and the respective Airport, in column 4. The script looks like this: