Leveraging Gemini for PII Detection in BigQuery: An Experiment

Step-by-step sample code on an experiment using Google Gemini Pro 1.0

Marcelo Costa
Google Cloud - Community

--

This image was generated using GenAI tools

Inspired by:

In this blog post, we’re going to go over how to use Google Gemini Pro 1.0 in a simple streamlit APP to find PII (personally identifiable information) in BigQuery tables. You’ll see how simple it is to plug your BigQuery metadata into Gemini or any other model by using LangChain modular architecture pre-built connectors, and Vertex AI foundation models.

Disclaimer:

  • The sample code is based on Generative AI technologies and may be subject to occasional hallucinations.

Let's start with some small summaries to give you a little bit of context, you may as well skip this part:

This walkthrough is based on Gemini 1.0 Pro on Vertex AI.

Enable Required API's

Let's start by enabling the required APIs in Cloud Console

Check out Vertex AI Foundation Models

I recommend going to Vertex AI Model Garden page to check the latest models:

You can find details on different foundation models fine-tuned for specific use cases. If we go over Gemini Pro, we can find what is the latest version available, documentation, and sample code guidance on how to use it:

Set up the Service Account permissions

Create a service account such as:

gemini-demo-00001@<your-project-id>.iam.gserviceaccount.com

Give it the following permissions so you can query the BigQuery information schema tables in the next steps.

Also, make sure to give the BigQuery Job User role, so you can run a job to query the metadata tables.

Lastly, you will need the aiplatform.endpoints.predict permission to be able to run the Vertex AI model, to make it easier for experiments you may use the Vertex AI User role, for production workloads I recommend creating a custom role only with the necessary permissions.

Code Sample

load table metadata

We are going to start by defining a query that will extract all Tables DDL from the BigQuery information schema tables view.

project = "<your-project-id>"
dataset = "<your-dataset-id>"
# Define our query
query = f"""
SELECT table_name, ddl
FROM `{project}.{dataset}.INFORMATION_SCHEMA.TABLES`
WHERE table_type = 'BASE TABLE'
ORDER BY table_name;"""

Then we are going to use the BigQuery LangChain data loader, to load the metadata.

# Load the data
loader = BigQueryLoader(query, project="<your-project-id>", metadata_columns="table_name", page_content_columns="ddl")
data = loader.load()

Example of loaded data from a test project:

[Document(page_content='ddl: CREATE TABLE `gcp-tutorials-bigdata-dev-4.bq_dataset_0014.org_e_enable_value_added_roi_f77e`
(
hostname STRING,
ipv4_private STRING,
coordinate FLOAT64,
iban STRING,
ipv6 STRING,
last_name STRING,
license_plate STRING,
office STRING,
title STRING,
org STRING,
salary INT64,
bonus INT64,
uri STRING,
user_name STRING
);', metadata={'table_name': 'org_e_enable_value_added_roi_f77e'}), Document(page_content='ddl: CREATE TABLE `gcp-tutorials-bigdata-dev-4.bq_dataset_0014.org_transform_customized_e_markets_50ba`
(
license_plate STRING,
mac_address STRING,
iban STRING,
ssn STRING,
credit_card_number INT64,
coordinate FLOAT64,
image_url STRING,
uri STRING,
hostname STRING,
last_name STRING
);', metadata={'table_name': 'org_transform_customized_e_markets_50ba'})]

By giving a quick grasp we can spot some PII columns in the schema, let's see what we get from Gemini, by adding a few more lines of code.

Set up model

First, select the model, we are using Gemini Pro 1.0:

llm = VertexAI(model_name="gemini-1.0-pro-001", max_output_tokens=2048)

Next, define the chain:

from langchain.prompts import PromptTemplate
from langchain.schema import format_document
chain = (
{
"content": lambda docs: "\n\n".join(
format_document(doc, PromptTemplate.from_template("{page_content}")) for doc in docs
)
}
| PromptTemplate.from_template("""Identify PII and sensitive columns in the provided tables schema, "
return a JSON response following this format:
<
table_name: <table_name_value>,
classification: [>
<classification_type_value> : [
<column_name>
]
>]
>:\n\n{content}""")
| llm
)

We created a basic chain that “stuffs” together all of the table metadata into one prompt. For larger datasets with many more tables, a more sophisticated chaining approach will be needed. That’s because there’s a limited length to each prompt, i.e. a context window.

We also took the liberty to ask it to format it in a JSON structure, with a templating syntax that I just made up on the go… and this is what we get by invoking it

Invoke chain

# Invoke the chain with the documents, and remove code backticks
result = chain.invoke(data).strip('```')

{
"table_name": "org_e_enable_value_added_roi_f77e",
"classification": [
{
"PII": [
"hostname",
"ipv4_private",
"last_name",
"license_plate",
"org",
"salary",
"bonus",
"uri",
"user_name"
]
},
{
"Sensitive": [
"iban",
"ipv6",
"office",
"title",
"coordinate"
]
}
]
},
{
"table_name": "org_transform_customized_e_markets_50ba",
"classification": [
{
"PII": [
"last_name"
]
},
{
"Sensitive": [
"license_plate",
"mac_address",
"iban",
"ssn",
"credit_card_number",
"image_url",
"hostname",
"coordinate",
"uri"
]
}
]
}

Not bad, I didn't need to explain to it what PII and Sensitive means, since they are pretty standard data governance classifications.

Wrap it under streamlit

By just combining the previous code with this small piece:

def setup_streamlit_interface():
"""Configures Streamlit page and initializes chat interface."""
st.set_page_config(page_title="Classification Bot")
st.title("Classification Bot")

# Sidebar for context definition
with st.sidebar:
with st.form("context"):
st.write("Define your context")
st.text_input("Project", key="project", value="gcp-tutorials-bigdata-dev-4")
st.text_input("Dataset", key="dataset", value="bq_dataset_0014")
st.form_submit_button("Classify", on_click=fetch_and_classify)

# Initialize language model for processing queries
st.session_state.llm = VertexAI(model_name="gemini-1.0-pro-001", max_output_tokens=2048)

# Initialize or continue chat interface
if "messages" not in st.session_state:
st.session_state["messages"] = [{"role": "ai", "content": "Hit classify to make me work"}]

display_chat_interface()


def display_chat_interface():
"""Displays the chat interface"""
for msg in st.session_state.messages:
with st.chat_message(msg["role"]):
if "content" in msg:
st.write(msg["content"])


if __name__ == '__main__':
setup_streamlit_interface()

And running it with:
streamlit run main.py

We got a cool running demo:

Final Thoughts

Creating a quick demo and experimenting with Gemini along with various foundational models can be achieved in just a few minutes. Streamlit is also an interesting framework worth adding to your toolkit for quick experiments. However, it’s crucial to acknowledge that deploying your LLM as a production workload involves several important considerations, may be subject for another blog post :)

--

--

Marcelo Costa
Google Cloud - Community

software engineer & google cloud certified architect and data engineer | love to code, working with open source and writing @ alvin.ai