Slash costs and time with a Databricks-powered data pipeline


A California-based company that assembles and enhances law enforcement data for agencies across the country was experiencing significant bottlenecks when onboarding new customers to its data and analytics platform. These ongoing issues threatened the company’s ability to scale to meet growth goals and disseminate valuable data that could help law enforcement agencies solve and prevent more crimes. Blueprint optimized the company’s data pipeline to reduce time-to-insight for new clients, reduced overhead by swapping computing environments and built a machine learning model capable of vastly improving the searchability of the company’s crime data.

Our work

0 Weeks
Onboarding time before Blueprint
0 Hours
Onboarding time after Blueprint
0 %
Data processing cost savings

The problem

Headquartered in California, this company’s goal is to provide law enforcement agencies across the country with access to crime data to keep their communities safer. Law enforcement customers upload their data to the company’s vast database of crime information from other agencies around the country, broadening their search and analyses capabilities to solve crimes in their jurisdiction more quickly.

Due to the company’s suboptimal data pipeline, however, it took between six and eight weeks to ingest a potential new customer’s data and demonstrate the capabilities of the platform. This threatened the company’s growth goals and its mission to unlock and distribute valuable crime data to solve and prevent crimes across the country.

The Blueprint Way

The company initially reached out to Databricks to develop a proof of concept to rearchitect its data pipeline. Databricks completed an initial POC and partnered with Blueprint to demonstrate that a new data pipeline powered by Databricks would both reduce the time needed to onboard new clients and reduce compute costs for the company.

As partner-led organizations, Blueprint and Databricks work closely to move POCs into production. Blueprint’s extensive implementation experience has informed the development of frameworks that support rapid environmental assessment, eliminating future issues and gaining the specific knowledge needed to create a comprehensive roadmap to production.

Blueprint identified a portion of the company’s architecture that depended on complex Java ETL and data enrichment tasks that were processed on expensive virtual machines. This piece of the pipeline was slowing down the onboarding process for new and potential clients significantly. By transitioning the company to a modern data ingestion pipeline with Spark-based compute with Databricks at its core, Blueprint reduced the onboarding time from six to eight weeks to 8 hours and cut the costs associated with that data processing by 50%.


While working on this project, Blueprint recognized a further challenge the company was facing with its data – one that impacted the searchability of crime reports. If a customer wanted to search for all thefts, crime reports would have to contain the word theft. This caused users to have to run multiple searches using different phrases – attempted burglary, breaking and entering, etc. Adding another layer of complexity was the fact that often, reports wouldn’t name the type of crime, but instead, they’d have the corresponding local code ID for the crime.

Data science machine learning model

In three weeks, Blueprint’s data science team created a machine learning model that delivered 15% accuracy when running searches. To demonstrate the power of machine learning, Blueprint found an enormous external data set with more than 500 million rows that would tie keywords to codes. After applying natural language processing and marrying it to the data from the crime reports, in just two weeks Blueprint delivered a model that increased search accuracy to 72%.

“The difference between a good data science team and someone who just copies code from Stack Overflow is where you go from there,” Blueprint’s Data Science Practice Lead said. “If you can Google ‘training a classification model,’ you can technically do this. But can you go from 15% to 72%? That’s where the data science comes in.”


Let's build your future.

Share with your network

Share on twitter
Share on facebook
Share on linkedin
Share on email

You may also enjoy