Bobby Huang
Author Bobby Huang

Introducing Spark Clusters, Caching, and JSON Explosion in Conduit v1.5

October 17 2019 | Product, Conduit

Conduit 1.5 adds powerful new features to drive better performance at greater scale than ever before.


The popularity of data virtualization continues to rise as a growing number of companies need to query, aggregate, and integrate large data sets. Traditional ETL methods prove to be cumbersome and time consuming when analyzing large data sets that span multiple data sources. We built our own data virtualization tool in response to this problem with a lightweight footprint that unlocks tremendous value for businesses. Take a look at some cool features developed in the latest release:

1.     Additional Spark Cluster attachments – This feature is really important when querying and analyzing really big data consisting of thousands of files or millions of records. It allows additional VMs/clusters to be attached to your processing so those big data sets can be analyzed faster. No more sitting around and waiting for hours waiting for a slow query to run.  

2.     Caching for all data sources - Many users want to grab something directly out of source data. Folks that work and play with data know that source data should be left alone in most cases. Typically, companies will do a big batch scrape of their data in the middle of the night when business is closed, as not to impact operations. Caching minimizes impact to source data, allowing users to access it in more real-time intervals by sticking data in memory for as long as you want that cache to last. If new folks hit that table, they access data stored in memory rather than the source data directly. 

3.     S3 connector - Enhancements to the S3 connector allow for greater efficiency in hybrid join scenarios. You can now join a CSV file stored in AWS S3 with a Parquet file in Azure blob.  

4.     JSON explosion - Those who have played with large nested JSON files understand that trying to write PySpark code to turn JSON data into relational quarriable data is extremely difficultTo solve this problem, we’ve embedded a feature inside of Conduit that allows you to seamlessly flattening JSON data. You can now query JSON data like a relational table with ease. 

If you have any questions, want to talk data virtualization, or want to get on a free trial, just reach out. We’re more than happy to help.  

 

Get Conduit today

Get a Conduit demo or install today in your Azure environment. Unleash the power of your data now.