How to Run Google Data Catalog Connectors in Production — Cloud Functions VS Kubernetes

Best practices on two approaches, with code samples!

Marcelo Costa
Google Cloud - Community

--

Disclaimer: All opinions expressed are my own, and represent no one but myself…. They come from the experience of participating in the development of fully operational sample connectors, available at: GitHub.

If you want to hear more about some of the Data Catalog connectors use cases, please check the official documentation:

Data Catalog connectors use cases, from official docs.

To put it in a nutshell, Data Catalog connectors are Python CLI’s, that users are able to run to connect their source systems metadata with Google Data Catalog.

Usually, a CLI relies on manual execution, and life has taught us that anything that is not automated at some point in time will fail.

So to better understand this, let’s take a quick look at a normal day on the SmartTechnologies corporation:

[Mike] Aargh… sorry Joe, I forgot to run the connector.

If not finding our table definition was not enough, the next issue we might run into is in the security territory. Just like any CLI that requires credentials to connect to the source system, we are dealing with sensitive data.

If we are running it manually, it’s really easy to make a mistake and expose those credentials. Working with a production environment could lead to unrepairable damage and could even result in some legal fees.

To illustrate this, check the outcome of the next day at SmartTechnologies corporation:

To sum up, in this short story we learned that first automating our workloads is key, and then that we need to keep our credentials secure. But how can we achieve that using some of the GCP products?

In this post series, we are going to look at two approaches to answer that question.

The SmartTechnologies company team is really excited to learn more about this, and have a peaceful day for once.

The two approaches we will tackle are:

  • Serverless with Cloud Scheduler and Cloud function.
  • Managed workloads with Kubernetes (Anthos Ready).

Usually, I’d go for the serverless one, since it’s much simpler to set up and in most cases, it will be cheaper. But if you are already using Kubernetes and starting your hybrid cloud journey the second will certainly help.

Without further ado, let’s see some action!

Serverless with Cloud Scheduler and Cloud function

Data Catalog Serverless Cloud Function Architecture.

In this GitHub repo, we have the sample code and a deploy.sh script, that users are able to run and set up the architecture above. Next, we will go over each of the components:

  • Cloud Scheduler

Cloud Scheduler will help us creating a serverless cron job, so we can automate our workload.

To create it, it’s really straightforward:

From Github.

We basically need to set up the PubSub topic name and the CRON_SCHEDULE, if we wanted to run every 30 minutes it would be: ”30 * * * *”.

  • PubSub
Image: Kingdom by Natasha Remarchuk

Why do we even need PubSub? Why not just call the Cloud Functions directly?

Cloud Scheduler is able to invoke the following types:

Since Cloud Scheduler supports HTTP/s endpoints, you can think that adding PubSub to the solution is not worth it, and will increase the costs and the overall management of it.

If we want to call the Cloud Functions directly, using an HTTP call, in short we need to work with a security token that only the triggering server knows about. We also need to apply some security best practices, and think about how to rotate that token, trust me, it’s not worth the headache.

With PubSub, in this use case, we won’t even get close to the billing quota. So we will let GCP handle the security of the Cloud Function invocation for us, using PubSub.

  • Cloud Functions

This component will work as a facade on top of the Data Catalog connectors.

First, a quick look at this part of the deploy.sh script:

Part of GitHub repo.

The code above sets up the environment variables required by the connector and then creates the Cloud Function.

On the GitHub repo, there are instructions with sample values, on how to use it:

Deploy instructions from GitHub repo.
Image: Kingdom by Natasha Remarchuk

You again? You are probably wondering what the DB_CREDENTIALS_USER_SECRET and DB_CREDENTIALS_PASS_SECRET are, right?

Do you remember that we need to keep our credentials secure somehow?

We could store the credentials values as environment variables, but then it would present the same challenges we discussed in the PubSub component.

So to handle that we will use the GCP serverless Secrets Manager, to store the credentials for us.

If you have stronger security requirements, like managing your own keys, consider using Google KMS.

On the GitHub repo there’s a helper script to work with it:

Secrets helper from GitHub repo.

You only need to execute it once, and whenever you want to rotate your secrets. Make sure when you run it, you are in a secure and audited environment.

Next, let’s see the Cloud Functions code:

Part of GitHub repo.

Basically, it just initiates the google-datacatalog-postgresql-connector, after retrieving the secrets values in the following function:

Part of GitHub repo.

We are using the latest property in the sample to always pick the most up-to-date version, in case we have rotated the secrets.

Finally, let’s see a demonstration video of the Serverless PostgreSQL Data Catalog Connector execution:

That’s it!

Closing thoughts

In this article, we looked at some challenges in dealing with command-line interfaces and database credentials, and how the SmartTechnologies corporation was struggling with that.

We tackled the two main challenges:

  • Manual execution

First, by automating the execution using some of the GCP serverless products, we no longer need to manually run it.

  • Credentials exposure

Second, we mitigated the risks of exposing the credentials by storing it in a secrets manager product. SmartTechnologies can finally have a peaceful day 😄.

Stay tuned for my next post, where I will show how to do the same using managed workloads with Kubernetes. Cheers!

References

Data Catalog Serverless Connector Github Repo: https://github.com/mesmacosta/google-datacatalog-postgresql-connector-serverless

--

--

Marcelo Costa
Google Cloud - Community

software engineer & google cloud certified architect and data engineer | love to code, working with open source and writing @ alvin.ai