Databricks’ ability to process, store, clean, share, analyze, model, and monetize datasets provides myriad benefits for businesses. A lesser discussed benefit is how its structure and capabilities innately support companies’ data protection and privacy obligations.
Within the U.S. there are soon to be eight comprehensive state privacy laws, and throughout the world there are dozens more. The obligations in these laws require companies to have a solid understanding of what data they collect, where they retain it and for how long, with whom they share it, and how they use it.
Databricks’ three-phase structure enables companies to flow their data through bronze, silver, and gold phases with each phase providing capabilities that support companies in meeting their privacy and data protection obligations.
Required by many data protection laws — including the EU General Data Protection Regulation — and the foundation for any successful privacy program, records of processing activities (ROPAs) are essentially logs of the lifecycle of the personal information your company holds: where you collected it, where it’s stored, how it’s used, with whom it’s shared, and when it is deleted. While not required in U.S. state privacy laws, having basic ROPAs in place is essential for responding to the data subject rights that most businesses need to honor.
Databricks and ROPAs
When migrating data into Databricks, it is extracted from the source as-is into the bronze stage of the data lifecycle. In this process, data engineers can extend the data frame, adding metadata columns to the source data table that enables the organization to track the provenance of the data.
For example, where native columns extracted from a CRM look like this:
An engineer can append them to include metadata columns in the export, like so:
Now, with these additional columns, you can perform many types of auditing queries against your bronze data; for example:
- Produce a list of all warehouse source systems where a specific column exists;
- Identify the [Account Name] for Acct. ID XXX when extracted from SalesForce on a specific date; or
- Determine the last time the Account table’s data was extracted.
Understanding your data lifecycle is at the core of every privacy program. Databricks provides this understanding through its ability to maintain data provenance and establish ways to clearly identify where it is stored and when and for what purpose it is used.