Bobby Huang
Author Bobby Huang

Elastic: JSON Aggregation in Power BI

August 22 2019 | Conduit

Data management teams spend hours and hours of time developing in-house connectors, managing security permissions, and writing a ton of python code, only to have to start everything over when something changes. Don’t get caught in this costly trap.

If you’ve worked with NoSQL databases, you’ll know the pain of querying and aggregating nested JSON data. Like many NoSQL data sources, JSON data looks more akin to a family tree than an excel table. This makes calculating data points more difficult.

For example, if you want to compare the age of your grandfather’s second cousin’s youngest child against the age of your parents, the query would need to include every family member in the tree’s descending order until you reach that child in the hierarchy. Whereas in a structed relational table, you’d simply filter on the second cousin relationships and view the ages in the age field.

Relational Table

Name Age Relationship to you
Bob 82 Grandfather
Barb 78 Grandmother
Tom 56 Father
Tammy 55 Mother
Ann 84 Grandfather's 2nd Cousin
John 86 Grandfather's 2nd Cousin husband
Ed 59 Grandfather's 2nd Cousin's Son
Alice 60 Grandfather's 2nd Cousin's daughter

JSON Example

{
    "name": "bob",
    "age": "82", 
    "relationship_to_you": "Grandfather",
    "wife": {
    "wife_name": "barb",
    "age": "78",
    "relationship_to_you": "Grandmother",
  },
  "Decendants": {
    "first_child_name": "tom",
    "age": "56",
    "relationship_to_you”: “father”,
    “wife”: {
    “wife_name”: “tammy”
    “age”: “55”
    “relationship_to_you”: “mother"
    }
}

… eventually -> “Grandfather's 2nd Cousin” … and on and on },

Recently, we’ve received several requests from BI and data science teams looking to connect Power BI to NoSQL data sources. This blog post highlights one of the more unique use cases and shows how Conduit was able to resolve a client’s data connectivity challenge.

The Scenario

A customer who builds applications within a large automotive company recently approached us in search of a solution that could connect Power BI to Elastic. Our customer needed to leverage sales and inventory data to drive outbound marketing campaigns. They tried several data virtualization products but found them inadequate or too costly to meet their needs.

The automotive company had marketing, sales, and inventory source data stored in several SQL and NoSQL databases. Our customer chose Elastic as part of the solution due to Elastic’s scalability and performance as a highly-available distributed system. The customer aggregated the data and built a replication layer from which to run reporting tools. Data was stored as JSON objects because of JSON’s flexible schema definition, but JSON is  impractical to use as a data source for analytics because of the nested and loosely-defined nature of the schema. This issue surfaced when the data science team wanted to create marketing optimization models by connecting Elastic to Power BI; Elastic simply can’t connect to Power BI.

Before Conduit

The team eventually found a cumbersome work around by writing custom PySpark scripts to perform their aggregations, but this was time consuming and quickly grew complex due to the necessary rounds of testing and continuous configuration.

A common practice is to avoid connecting visualization tools directly to source data. Reasons include:

  • It opens the original data source to manipulation
  • It can be difficult to track resultant issues with the data source
  • Performance impact due to opening up the data source to queries from business users, a concern if the data source represents the source of truth                      
  • Difficult to monitor access to the data source when its connection credentials can be passed around by business users that are building reports.

After Conduit

Unlike some of the heavier and more costly solutions, Conduit was able to be installed in their environment, an Elastic Search connector was created, and DirectQuery in PowerBI to the JSON files was completed in less than an hour.

Conduit simplifies access to arbitrarily structured JSON data wherever it may be stored – ElasticSearch, Azure Blob Storage, AWS S3. This is accomplished through several important steps. First, Conduit reads the original data source into the cache. Second, Conduit determines which fields are relevant based on the fields indicated in the query and limits the subsequent query operation steps to only these fields. This operation – dynamic column pruning – is used to optimize the performance of the query and facilitate efficient queries on much larger JSON files. Third, the necessary fields are flattened to provide access and enable aggregations on deeply nested values. Finally, the SQL-like query operation is performed in Conduits SQL engine and returned to the user in the form of output to their visualization tool that made the query in the first place.

This effectively “flattens out” the nested JSON files and makes them readable in Power BI! It’s pretty cool.

Takeaways

NoSQL data aggregation is hard. We repeatedly observe data management teams spend hours and hours of time developing in-house connectors, managing security permissions, and writing a ton of python code to do a simple sum value, only to have to start everything over when something in the data changes. Don’t get caught in this costly trap. Conduit can bridge these gaps with best-in-class security and minimal implementation time. We offer dedicated customer success teams and white-glove installs to every client. If you have data connectivity challenges, simply fill the short form below, and we would be delighted to get the conversation started.

Get Conduit today

Get a Conduit demo or install today in your Azure environment. Unleash the power of your data now.