Member-only story

Apache Beam, Python and GCP: Deploying a Batch Pipeline on Google DataFlow

In this article, we will describe how to deploy a batch pipeline, created locally, to Google Dataflow, in a very simplified way.

In the previous article (here), we explored how to change a pipeline from batch to streaming with just a few extra lines. This shows us the versatility of using Apache Beam.

In this article, we will describe how to deploy a batch pipeline, created locally, to Google Dataflow, in a very simplified way. There are other methods to deploy, more or less complex. Complexity that depends on your level of knowledge in python, especially.

Let’s put your hand in the dough?

Create a Service Account

Go to IAM & Admin > Service Accounts > + Create > name your SA > Create:

Then give Dataflow Worker permission > Click Done

Once created, go to the 3 dots to the right of the created SA, and click on Create Key > Select JSON > Create

Ready, SA (Service Account Created) and exported, it should be in your Downloads folder! Here are some more details on how to use the Python SDK and Dataflow

In your Local Environment

If you are running Apache in Direct Runner, i.e. locally, you already have Apache Beam packages installed. Now also install Apache Beam SDK packages for GCP with the following command via CMD or :

pip install apache-beam[gcp]

This SDK allows your local Apache Beam code, which runs with the Direct Runner (it’s worth researching possible runners, such as spark, flink…), to be converted and…

Create an account to read the full story.

The author made this story available to Medium members only.
If you’re new to Medium, create a new account to read this story on us.

Or, continue in mobile web

Already have an account? Sign in

Cássio Bolba
Cássio Bolba

Written by Cássio Bolba

Senior Data Engineer | Udemy Teacher | Expat in Germany | Mentor -> https://linktr.ee/cassiobolba

Responses (3)

Write a response

Thanks for the article . I have gone thru' your Udemy course. It really helps . Just one question - instead of CSV file, if I want to connect to Oracle Database- what changes I will require to do in Python ? I tried to search but could not get much info.

--

This sensational deploy works for me, thank you Cassio

--

Very informative, thank you.,.

--