A metadata comparison between Apache Atlas and Google Data Catalog

Learn how your metadata is structured on both systems.

Marcelo Costa
Google Cloud - Community

--

Image created on Canva.

Disclaimer: All opinions expressed are my own, and represent no one but myself…. They come from the experience of participating in the development of fully operational sample connectors, available on: github.

If you missed any of the latest posts on how to ingest metadata into Data Catalog, please check the following: Looker, RDBMS, Tableau, Hive.

The one million dollar question

A Data Catalog is usually defined by a collection of metadata, combined with data management and search tools. This enables organizations to quickly discover, understand, and manage all their data.

Now here’s the one million dollar question.

How do you structure your metadata?

Google Data Catalog

Defines their core metadata as:

Metadata mental model based on gcp-datacatalog-diagrams

Google Data Catalog comes with pre-defined structures to represent metadata. If by any chance the built-in attributes are not enough, users are able to work with Templates to add extra attributes to their assets.

Let’s understand each main component of that diagram.

  • Entry Group

An entry group keeps related entries together, by using Cloud Identity and Access Management we can even specify the users who can create, edit, and view entries within that entry group.

It’s worth mentioning that Data Catalog automatically creates an entry group for Big Query entries and Pub/Sub topics.

One entry group as an example, showing some entries ingested that belongs to the Tableau entry group:

Data Catalog Entry Group UI — showing Tableau Entry Group

For further details on those, please check the Tableau connector.

  • Entry

The native Data Catalog entity represents an asset’s technical metadata. Comes with pre-defined fields, changing according to its type.

This means some fields for a BigQuery table will not be the same as the ones representing a PubSub Topic, although, most of them are common.

It even allows users to create their own Entry types, using custom entries. Like the ones from Tableau, we saw above.

Now let’s look at one entry from Big Query:

Big Query Table

We will detail later on what the Tags and Schema tabs are used for.

  • Tag Template

Data Catalog provides a templating mechanism, where you can create representations of metadata. One quick example for clarification:

Template Example

This template contains useful attributes for the discover, understand, and manage flow we talked at the beginning of this post.

We can use them to classify our assets and for example, search and troubleshoot all tables which have the failed status, or add some automation to our ETL pipeline blocking jobs with tables having a data quality score lower than 5.

Stay with me to understand how we create tags with them later on.

Coffee break

After a quick coffee break, let’s move on to Apache Atlas.

Apache Atlas

Defines their core metadata as:

Apache Atlas Metadata mental model

Atlas allows users to define a model for the metadata objects they want to manage. The model is composed of definitions called types.

A type represents one or a collection of attributes that define the properties for the metadata object.

— Hey Marcelo, can we compare a type with any Google Data Catalog object?— Sorry! This is not a fair comparison, if you look at the mental models, the hierarchies are different! But let’s dig deeper and we will find some similarities.

There are two Composite Metatypes: Struct and Relationships that are out of the scope of this article. Google Data Catalog does not support lineage at the time of this writing, so we are not using Relationships.

And if you are using struct types, I’d love to know your use cases and perhaps improve this article.

Now let’s understand each main component on that diagram.

  • Primitive and Enum Meta types

Think about any programming language, those are the most basic types, that you can use when creating your Entities and Classifications attributes.

  • Collection Meta types

This is where things get interesting, you can use arrays and maps structures composed of the primitive and enum types.

Let’s say you have a Table in Atlas, that Table will surely contain some columns. So here you would represent the columns as an array meta-type.

  • Composite Meta types

Here are the two most important units, Entities, and Classifications.

  • Entities

I told you we would find similarities, Entities in Atlas are close to what we call Entries in Google Data Catalog.

They represent an asset’s technical metadata, the difference, is that there are no pre-defined fields.

Gif from giphy

Sounds scary right? This give users a lot of flexibility, but comes with complexity, so use with care.

Lucky for us, Atlas comes with some pre-defined entity types for various Hadoop and non-Hadoop metadata, and you can even ingest sample models and data by running their quick_start.

Also, entity types can extend from other types, called superTypes, so you receive attributes from ancestors. Let’s look at one example:

Atlas Table Example

This image shows the attributes for a Table entity called customer_dim.

If we look at the Table ancestors, we would get this hierarchy: Table -> DataSet -> Asset -> Referenceable.

And the attributes we are looking at, are the combination of that hierarchy.

Bear in mind that DataSet is one of the most important types — according to Atlas documentation: “DataSet can be expected to have a Schema” — allowing us to add classifications on them later on.

  • Classifications

Do you remember Google Data Catalog Templates? We can say Classifications are really similar to them.

Just like Templates entities can be associated with Classifications, enabling easier discovery and management.

As Atlas Entities we have the same attributes and superTypes capabilities when creating Classifications.

To show how similar Classifications are from Google Data Catalog Templates we are going to create one named ETL Governance.

ETL Classification creation

The difference here, is we are adding the PII superType.

Classification Created

Classifications and Tags

We talked about Classifications and Templates, but how do we apply them?

  • Google Data Catalog

Google Data Catalog uses Tags to apply Templates to Entries.

Metadata Tag mental model based on gcp-datacatalog-diagrams

This is what a Tag looks like:

Google Data Catalog Entry with some Tags

If you remember the ETL example at the beginning, lets search using it:

Search showing returned entries

So using Tags, we get rich search capabilities, enhancing our metadata management process as a whole.

Next, we will see how the same features work within Apache Atlas.

  • Apache Atlas

Apache Atlas does not create a different object like Tags, it uses the same Classification object to apply them to entities.

Metadata Classification mental model

This is what a Classification attached to an Entity looks like:

Apache Atlas Entity with some Classifications

Now let's do the same search:

Search showing returned entries

That’s Great! So at the end of the day, both Google Data Catalog and Apache Atlas core capabilities are similar.

A final comparison

At last, we will put the metadata objects we saw in the article side by side.

Google Data Catalog and Apache Atlas comparison
  • Entry Groups don’t have a correlated object in Apache Atlas. You could use Glossaries, to group your assets in Atlas, but they are out of the scope of this article, and they serve a broader purpose.
  • Entries are mapped to a combination of Entities and Attributes.
  • TagTemplates are mapped to a combination of Classifications and Attributes.
  • Tags are mapped to the Classifications when they are applied to Entities.

— Hey Marcelo, now tell me which one is the best?

Sorry! This is not the blog post for that, but what I can say is that Google Data Catalog is a fully managed and serverless product, where Apache Atlas you have to manage yourself.

Closing thoughts

In this article, we compared how Apache Atlas and Google Data Catalog structure their metadata. We could see that many concepts are similar since those are a must to have a good metadata management process in place.

The assets you saw in Google Data Catalog were ingested using the Apache Atlas connector, stay tuned for my next post, where I will show how to execute the connector doing both full and incremental ingestions! Cheers!

References

--

--

Marcelo Costa
Google Cloud - Community

software engineer & google cloud certified architect and data engineer | love to code, working with open source and writing @ alvin.ai