Performance is a defining factor in an organization's data estate strategy.
Open-source lakehouse platforms have emerged as a powerful catalyst for data-driven innovation, offering organizations a cost-effective, flexible, and scalable solution to manage and analyze diverse data. By leveraging open-source technologies, businesses benefit from a global developer community’s collective expertise and contributions, ensuring continuous improvement, feature enhancements, and up-to-date security measures. Open-source lakehouse platforms enable seamless integration with a wide array of advanced analytics and machine-learning tools, allowing organizations to extract valuable insights and make data-driven decisions at an accelerated pace. Embracing open-source lakehouse platforms facilitate an agile, adaptable approach to data management and analytics, empowering companies to stay ahead in a highly competitive and dynamic business environment.
In my experience, while organizations of all sizes conceptually understand the positive impact of embracing open-source platforms, adoption is directly influenced by clarity of the performance of proposed platforms in production environments. Rational business leaders want to avoid frantic phone calls at two in the morning informing them that data workloads in production have slowed to a crawl and deadlines are at risk.
In the world of open-source Lakehouse platforms, three have risen in popularity: Apache Hudi, Apache Iceberg, and the Linux Foundation’s Delta Lake. Each has well-documented adoption and continued contributions by large organizations. Uber initially developed Hudi, Iceberg started at Netflix, and Delta Lake originated at Databricks. Each has its benefits and deserves a deeper evaluation of its capabilities as they pertain to an individual organization’s needs. However, for this article, I’m going to focus on one: performance.
Performance measurement requires one or more benchmarks. For example, Dodge recently unveiled its new Challenger SRT Demon 170, a beast of a modern muscle car with a 0-60 MPH time (benchmark) of 1.66 seconds and a quarter mile time (benchmark) of 8.91 seconds. If you’re Vin Diesel and living life a quarter mile at a time, that’s fast living. Compare that performance against the Toyota Corolla, one of the best-selling cars in 2022. The Toyota Corolla moves from 0-60 MPH in 8.6 seconds and achieves the quarter mile in 16.85 seconds. The numbers don’t lie; the winner is clear if performance is the critical benchmark.
In the world of technology industry benchmarks, there are a few key criteria:
- The benchmark specification is published and accessible
- The code utilized to perform the benchmark tests are open and accessible
- Any datasets used in the benchmark tests are publicly available
For industry acceptance, a non-profit organization must consistently develop, evaluate and refine the benchmarks. The organization should have a diverse group of members and affiliates representing the interest of customers and the industry.
Lakehouse platforms: Benchmark drift
Lakehouse platforms, which combine the best features of data lakes and data warehouses, are typically benchmarked using a combination of industry-standard benchmarks and custom benchmarks to assess their performance, scalability, and efficiency.
The TPC-DS benchmark, introduced in 2011 by the Transaction Processing Performance Council (TPC), was designed to evaluate the performance of modern, large-scale data warehousing systems. It replaced the earlier TPC-H benchmark and introduced a more diverse and complex set of queries to represent real-world decision support and business intelligence workloads. The history of the TPC and the details of the TPC-DS benchmark are beyond the scope of this article, but here is an excellent primer on both.
A team from UC Berkeley recently open-sourced a Lakehouse benchmark that adapts the TPC-DS data warehouse benchmark specification to a lakehouse setting. Named LHBench, the benchmark consists of four tests (including aforementioned TPC-DS) with the source code available in a public Github repo and the raw TPC-DS test dataset provided as Apache Parquet files in an AWS S3 bucket.
Fate of the Furious Three
With a benchmark specification, open source code to perform the benchmark, and a dataset with which to perform the benchmark, a team set out to compare the performance of Hudi, Iceberg, and Delta Lake. They ran all tests using Apache Spark on AWS EMR 6.9.0 storing data in AWS S3 using Delta Lake 2.2.0, Hudi 0.12.0, and Iceberg 1.1.0. They shared their findings in a white paper titled Analyzing and Comparing Lakehouse Storage Systems released at the 2023 Conference on Innovative Data Systems Research (CIDR). There is an incredible amount of detail in the white paper and I encourage anyone reading this article to read the white paper in its entirety.
In every test performed in the benchmark, Delta Lake was faster than Hudi and Iceberg.
For example, in data load performance from the TPC-DS benchmark, Delta Lake is slightly faster than Iceberg and nearly 10X faster than Hudi.
In query performance from the TPC-DS benchmark, Delta Lake is 1.4x faster than Hudi and 1.7X faster than Iceberg.
The load and query performance differences can be attributed to several factors including:
- Delta Lake’s more efficient columnar compression and lower overhead for large table scans due to larger file sizes, which results in fewer files to read compared to Hudi
- Delta Lake’s utilization of the default Spark reader, which performs better than the custom-built Parquet reader used by Iceberg
There are a variety of considerations that organizations must consider when choosing a lakehouse platform, such as data ingestion, metadata storage, and transaction coordination that will impact performance. If your organization is getting started on a Lakehouse evaluation and performance is your primary deciding factor, I recommend running the LHBench benchmark against all three Lakehouse platforms. If your organization is currently on Hudi or Iceberg and considering a migration to an alternative Lakehouse platform, I’d recommend Delta Lake.