How Databricks enables records of processing activities

By the Blueprint Team

Databricks’ ability to process, store, clean, share, analyze, model, and monetize datasets provides myriad benefits for businesses. A lesser discussed benefit is how its structure and capabilities innately support companies’ data protection and privacy obligations.

Within the U.S. there are soon to be eight comprehensive state privacy laws, and throughout the world there are dozens more. The obligations in these laws require companies to have a solid understanding of what data they collect, where they retain it and for how long, with whom they share it, and how they use it.

Databricks’ three-phase structure enables companies to flow their data through bronze, silver, and gold phases with each phase providing capabilities that support companies in meeting their privacy and data protection obligations.

Required by many data protection laws — including the EU General Data Protection Regulation — and the foundation for any successful privacy program, records of processing activities (ROPAs) are essentially logs of the lifecycle of the personal information your company holds: where you collected it, where it’s stored, how it’s used, with whom it’s shared, and when it is deleted. While not required in U.S. state privacy laws, having basic ROPAs in place is essential for responding to the data subject rights that most businesses need to honor.

Databricks and ROPAs

When migrating data into Databricks, it is extracted from the source as-is into the bronze stage of the data lifecycle. In this process, data engineers can extend the data frame, adding metadata columns to the source data table that enables the organization to track the provenance of the data.

For example, where native columns extracted from a CRM look like this:

  • Account ID: A unique identifier for each account record.
  • Account Name: The name of the account.
  • Account Type: The type of account (e.g., customer, partner, competitor).
  • Industry: The industry that the account belongs to (e.g., healthcare, technology, finance).
  • Annual Revenue: The annual revenue of the account.
  • Billing Address: The billing address of the account.
  • Shipping Address: The shipping address of the account.
  • Phone: The phone number associated with the account
  • Number of Employees: The number of employees at the account.
  • Parent Account ID: If the account is a subsidiary, this column stores the ID of the parent account.
  • Created Date: The date and time that the account record was created.
  • Last Modified Date: The date and time that the account record was last modified.
  • Description: A description of the account.
  • TAX ID Number: A TIN or SSN for a small business
  • D&B Number: Dunn and Bradstreet ID
  • Stock Symbol: Stock symbol, if applicable

An engineer can append them to include metadata columns in the export, like so:

  • _sourceSystem = “SalesForce”
  • _sourceTable = “Accounts”
  • _extractSystem = “Databricks Notebook_X”
  • _extractUser = account_x@company.com
  • _extractDateTime = “2023-02-02 13:44:22z”

Now, with these additional columns, you can perform many types of auditing queries against your bronze data; for example:

  • Produce a list of all warehouse source systems where a specific column exists;
  • Identify the [Account Name] for Acct. ID XXX when extracted from SalesForce on a specific date; or
  • Determine the last time the Account table’s data was extracted.

Understanding your data lifecycle is at the core of every privacy program. Databricks provides this understanding through its ability to maintain data provenance and establish ways to clearly identify where it is stored and when and for what purpose it is used.

Stay tunedfor the next post in our series

Subscribe to our newsletter to stay up to date!

Share with your network

You may also enjoy

Article

In 2022, federal privacy legislation in the U.S. did not pass, resulting in a complex privacy landscape in 2023. Blueprint’s privacy program assessments offer a roadmap for success in navigating evolving regulations.