How to Automate Governance Best Practices With Google Data Catalog and Terraform

Scripts and Terraform automation to help you ensure best practices in Google Data Catalog

Marcelo Costa
Google Cloud - Community

--

Disclaimer: All opinions expressed are my own, and represent no one but myself…

There’s extensive documentation on what IAM Roles are available for Google Data Catalog. But when you are getting started with your data governance journey, you probably have wondered what kind of access controls are needed and who should be granted them in your organization…

  • What end user should be able to discover my data assets?
  • Who should be able to classify and add tags to them?
  • And finally, be able to create templates and set standards for the data classification process?

This can get really complex, so in this blog post, we will start by looking at the access controls on top of metadata, which is Google Data Catalog playing field.

And what if I told you that we can automate all that by using Terraform? Sounds good, right? So let’s get started!

The solution

Data Governance Architecture

This is simply a suggestion on how to work with Data Catalog. To start off, let’s say you have some common templates that will be used to create tags in different projects.

For that we need two different pieces:

  • The Tag Central Project

This is where we store all the common resources, like Tag Templates, Policy Tags, and Custom entries. So we don’t duplicate those, are charged only once, and have a much easier time when managing and making changes to them.

To showcase this, in the Terraform sample we will create 4 Tag Templates in the Tag Central project:

★ Data Engineering Template

★ Derived Data Template

★ Data Governance Template

★ Data Quality Template

  • A Group of Analytics Projects

By Analytics Projects, I mean those that will store data assets, such as Big Query tables, Pub Sub topics, or any other resource managed by Data Catalog.

The personas

Now let’s look at the personas who will interact with the Tag Central and Analytics Projects and that we will automatically set up with Terraform.

Keep in mind that the names are just suggestions, and you could replace them with names that play similar roles, you could call Data Governors as Data Architects or Data Curators as Data Stewards and many other names in this data alphabet soup.

  • Data Governors
Data Governor Persona

Data Governor is the the role for people who perform administrative workloads on top of your metadata. And this means Creating/Updating/Deleting the Data Catalog resources like Tag Templates and setting the standards of your data governance process.

  • Data Curators
Data Curator Persona

Data Curators will take care of your data assets :) … They will select the relevant ones and add meaning to them (by creating tags), so other users can easily discover and make use of them.

  • Data Analysts
Data Analyst Persona

This is the person who will use the curated assets and define and develop domain-specific analytics to support your decision making.

Take into consideration, that those personas can change or overlap, depending on the size of your organization or the way it is structure. So you can have the same person doing more than one role.

If you use different personas, please feel free to contribute to the sample repository or add comments to this blog post with your use case, this will be really helpful.

The automation

Without further ado, let’s look at the Terraform automation because doing things manually does us no good!

This Github repo:

Contains all the sample and a detailed step-by-step guide on how to run it.

To run Terraform, we are going to use a service account, since at the time of this writing Data Catalog does not support using end-user credentials from the Google Cloud SDK.

And to follow the best practices we won’t download the service account key, but use service account impersonation.

Create the Service Account

So the first step is creating a service account and setting the appropriate IAM roles:

Set Terraform variables placeholders

Next, we need to set Terraform variable placeholders. So after your clone, the GitHub repo, change the .tfvars placeholders.

Let’s look at an example of a valid configuration file:

In the sample code above, whenever you see member, it can be any of: user:{email}, serviceAccount:{email}, group:{email} or domain:{domain}.

Run Terraform

And at last, let's execute Terraform:

In case you want a quick overview, I’ve put together a demo video showing the execution:

Generated Resources

After Terraform completes, we can look at the generated resources:

IAM Roles

We can see that all the projects we set up in Terraform contain the discussed personas, with the appropriate permissions.

And let’s not forget the common resources created by Terraform:

Data Catalog Tag Templates

That’s pretty much it, thanks for reading :).

Wrapping up

Data Governance is a really complex area, and any automation that helps us to set and enforce those standards is welcome. In this blog post, we looked at Terraform samples that supports us when working at the project level.

Keep in mind that if you want to use the suggested access controls at the folder or organization level, which is a common use case for large organizations.

The iam module at the GitHub repo, is easily adaptable to that use case, all you need to do is switch the google_project_iam_member resource to google_folder_iam_member or google_organization_iam_member respectively.

Please feel free to contribute to the sample repository or add a feature request in case you need guidance on that.

Cheers,

References

--

--

Marcelo Costa
Google Cloud - Community

software engineer & google cloud certified architect and data engineer | love to code, working with open source and writing @ alvin.ai