How to Run Google Data Catalog Connectors in Production — Cloud Functions VS Kubernetes
Best practices on two approaches, with code samples!
Disclaimer: All opinions expressed are my own, and represent no one but myself…. They come from the experience of participating in the development of fully operational sample connectors, available at: GitHub.
If you want to hear more about some of the Data Catalog connectors use cases, please check the official documentation:
To put it in a nutshell, Data Catalog connectors are Python CLI’s, that users are able to run to connect their source systems metadata with Google Data Catalog.
Usually, a CLI relies on manual execution, and life has taught us that anything that is not automated at some point in time will fail.
So to better understand this, let’s take a quick look at a normal day on the SmartTechnologies corporation:
[Mike] Aargh… sorry Joe, I forgot to run the connector.
If not finding our table definition was not enough, the next issue we might run into is in the security territory. Just like any CLI that requires credentials to connect to the source system, we are dealing with sensitive data.
If we are running it manually, it’s really easy to make a mistake and expose those credentials. Working with a production environment could lead to unrepairable damage and could even result in some legal fees.
To illustrate this, check the outcome of the next day at SmartTechnologies corporation:
To sum up, in this short story we learned that first automating our workloads is key, and then that we need to keep our credentials secure. But how can we achieve that using some of the GCP products?
In this post series, we are going to look at two approaches to answer that question.
The SmartTechnologies company team is really excited to learn more about this, and have a peaceful day for once.
The two approaches we will tackle are:
- Serverless with Cloud Scheduler and Cloud function.
- Managed workloads with Kubernetes (Anthos Ready).
Usually, I’d go for the serverless one, since it’s much simpler to set up and in most cases, it will be cheaper. But if you are already using Kubernetes and starting your hybrid cloud journey the second will certainly help.
Without further ado, let’s see some action!
Serverless with Cloud Scheduler and Cloud function
In this GitHub repo, we have the sample code and a deploy.sh script, that users are able to run and set up the architecture above. Next, we will go over each of the components:
- Cloud Scheduler
Cloud Scheduler will help us creating a serverless cron job, so we can automate our workload.
To create it, it’s really straightforward:
We basically need to set up the PubSub topic name and the CRON_SCHEDULE
, if we wanted to run every 30 minutes it would be: ”30 * * * *”
.
- PubSub
Why do we even need PubSub? Why not just call the Cloud Functions directly?
Cloud Scheduler is able to invoke the following types:
- HTTP/S endpoints
- Pub/Sub topics
- App Engine applications
Since Cloud Scheduler supports HTTP/s endpoints, you can think that adding PubSub to the solution is not worth it, and will increase the costs and the overall management of it.
If we want to call the Cloud Functions directly, using an HTTP call, in short we need to work with a security token that only the triggering server knows about. We also need to apply some security best practices, and think about how to rotate that token, trust me, it’s not worth the headache.
With PubSub, in this use case, we won’t even get close to the billing quota. So we will let GCP handle the security of the Cloud Function invocation for us, using PubSub.
- Cloud Functions
This component will work as a facade on top of the Data Catalog connectors.
First, a quick look at this part of the deploy.sh script:
The code above sets up the environment variables required by the connector and then creates the Cloud Function.
On the GitHub repo, there are instructions with sample values, on how to use it:
You again? You are probably wondering what the DB_CREDENTIALS_USER_SECRET
and DB_CREDENTIALS_PASS_SECRET
are, right?
Do you remember that we need to keep our credentials secure somehow?
We could store the credentials values as environment variables, but then it would present the same challenges we discussed in the PubSub component.
So to handle that we will use the GCP serverless Secrets Manager, to store the credentials for us.
If you have stronger security requirements, like managing your own keys, consider using Google KMS.
On the GitHub repo there’s a helper script to work with it:
You only need to execute it once, and whenever you want to rotate your secrets. Make sure when you run it, you are in a secure and audited environment.
Next, let’s see the Cloud Functions code:
Basically, it just initiates the google-datacatalog-postgresql-connector, after retrieving the secrets values in the following function:
We are using the
latest
property in the sample to always pick the most up-to-date version, in case we have rotated the secrets.
Finally, let’s see a demonstration video of the Serverless PostgreSQL Data Catalog Connector execution:
That’s it!
Closing thoughts
In this article, we looked at some challenges in dealing with command-line interfaces and database credentials, and how the SmartTechnologies corporation was struggling with that.
We tackled the two main challenges:
- Manual execution
First, by automating the execution using some of the GCP serverless products, we no longer need to manually run it.
- Credentials exposure
Second, we mitigated the risks of exposing the credentials by storing it in a secrets manager product. SmartTechnologies can finally have a peaceful day 😄.
Stay tuned for my next post, where I will show how to do the same using managed workloads with Kubernetes. Cheers!
References
Data Catalog Serverless Connector Github Repo: https://github.com/mesmacosta/google-datacatalog-postgresql-connector-serverless