There is no single way to organize your Unity Catalog that will fit every organization. But there is one guiding principle that should guide that organization.
Each metastore is a menu of data you can select.
Consider your local Chinese restaurant.
There are a ton of food options to choose from. How can you quickly locate the dishes that you enjoy? In this example, they have grouped the options like appetizers, soup, poultry, combinations, and specialties. This helps you find what you are looking for more quickly.
Similarly, your selection of catalog names helps guide the top-level groups for your organization’s data.
One of the most common cataloging systems is project-based. You define a catalog for each project. This works well when there is little overlap in source data.
You will notice that the example “samples” (provided by Databricks) contains data for the New York City Taxi Commission and a retail company. Slowder is an example of a project where I was illustrating how to secure confidential data in Databricks. Sqluc_202308 is a project illustrating the migration of a single project from hive metastore to Unity Catalog.
In these cases, there is no sharing of data. Each project deals with distinct data. While you could allow users of one project to see data from another project’s catalog, it may not be the ideal way to organize your data if there is a great deal of shared data.
Shared Zone Catalogs
Consider a case where you want to share your bronze data across all projects. You can minimize the amount of duplicated effort in ingesting data for each subsequent project. In this case, it may make sense to create one catalog named bronze and create schemas underneath that to refer to the source system. Then, you could grant all users in your ETL team “read access” to this data, and they would be able to transform the raw data into the form that best fits the needs of data consumers.
This approach could even be extended to a common silver layer. For example, suppose you created one common Operational Data Store (ODS) that contained all the transaction-level data for your entire org. In that case, you might want to make that available to your ETL developers to build separate gold zones.
Shared zones start to break down when you have multiple groups of consumers who need restrictions placed on what they can and cannot see. Suppose you deal with PII (Personally Identifiable Information), PCI (Payment Card Industry), or other data with privacy concerns. In that case, you might not want anyone to directly access a shared bronze zone. With that, they could reverse the masking you put in place to prevent decoding that confidential data.
You may then decide to create multiple catalogs based on the consumer’s role based on the role the consumer plays within your organization. You might then consider catalogs like bronze, bronze_pii, silver_pci, etc. Then, you could secure these catalogs for those groups who should have access.
If you visit the Databricks documentation site, they advocate naming your catalogs for your environments. This can work well with a single project for your whole data lake. You set up your catalogs; then, your zones become your schemas.
You could combine environment-based naming with any of the other naming conventions to get a better fit. For example, if you deal with several projects, each with their own environments, you could name your catalogs <environment_abbreviation>-<project_name>. This could give you dev_finance, stg_operations, etc.
Keep in mind the Chinese menu example when organizing your data. You want people to be able to quickly identify where to go to get information. The more logical your catalog structure, the more likely someone will quickly find the data needed. Adoption will falter and even fail if you make it too difficult to find that data.
As a Velocity Partner with Databricks, we have the expertise and tools to ensure a smooth and seamless transition
Are you planning a migration but feeling overwhelmed by the process?
Let us help you accelerate your journey with our solutions.