Whether or not to combine cloud storage and compute is an argument approaching the intensity of longstanding debates like Mac vs. PC or leasing vs. buying a car. These are two radically different approaches, but an argument can be made either way.
At Blueprint we’re not going to weigh in on the Mac vs. PC or leasing or buying a car questions – that’s an argument for another day. But we tend to err on the side of separating storage and compute – and with good reason. When it comes to separating storage and compute functions, not only is that a fundamental tenet of cloud computing, it’s also more affordable and ensures flexibility and future adaptability as technologies mature and change.
While the idea behind combining cloud storage and compute is to simplify things for data managers while maintaining flexibility, by doing this you actually lose flexibility in working with different data sets and adopting new emerging compute engines, and you end up feeding more data into the compute engine, which is the most expensive part of operating in a cloud environment.
You can facilitate affordability and flexibility without compromising simplicity.
Cloud data storage is just storage and it should be thought of that way. It is inexpensive, fast, supports all data types and can be supported by all cloud services, data-ingestion tools and apps. Cloud storage also keeps data in its native state as your data, meaning you can take it wherever you want in the future.
We suggest keeping storage simple, cheap and distinct from compute by parking it in an Azure Data Lake. Following this suggestion allows you to use any compute engine — we often recommend Databricks — and only pay for compute resources on the data sets you want and only when you are running analytics. When you park all your data in a warehouse that also runs your compute, it results in paying a steeper price because your compute is run on all your data, rather than spinning up data from an inexpensive storage location to run compute when you need it and for only as long as you need it.
It is simple – the more you reduce compute – the more you reduce cost.
No company or organization should pay for resources they don’t need – we are no longer in the age of monolithic platforms and the massive hardware spend required to run data analysis. By leveraging the power of the Data Lake and coupling it with a compute engine like Databricks, you only pay for the services you need when you need them.
Companies ingest, own and buy an immense amount of data. It may or may not have a purpose or use yet and that is OK. If a company doesn’t have an immediate use for its data, cloud storage in a data lake tiers your data to the cheapest possible level, only re-tiering it when you decide it is useful and needed for business intelligence.
Separating storage and compute and using the data lake for your storage allows you to better manage your team’s experience, your data and your usage. For example, multiple compute resources can leverage the same data in the data lake. By storing your data in this way, users can interact with it differently. One person can be working on machine learning with Spark while another runs reports on the same data set using a high-speed Power BI connector, for example.
By creating a modern data estate that utilizes the data lake, what may historically have been disparate data sources for an organization that get copied and moved around for different queries can now all be viewed and queried holistically and simultaneously using the numerous tools and connectors available through tools like Databricks. Not only is this a more affordable business model, but with Databricks you can now eliminate the wasted time and energy associated with moving data between different platforms that perform different tasks. Speed is your friend when it comes to extracting insights from data – don’t waste time over-processing data if it’s not needed.
With Databricks Delta Lake, for example, you have one complete compute platform overtop your data from which you can perform BI-type queries, data engineering workloads with SQL or Python and data science with any of the common frameworks right where your data are.
Taking it one step further, streaming data analytics represents the next frontier in unlocking insights from data. Embrace it! By having cloud storage and compute separate, cloud storage can collect data from streaming services and Databricks can process it – very easily and without running compute on your whole database at the same time. Leaders can start integrating this into their data estates now to be more agile when more streaming data becomes available to you, such as IoT devices and web, mobile and customer experience platforms.
Because Databricks is so feature rich with respect to data-engineering, data science and support for all business intelligence tools, you should be asking yourself “Why am I paying more, losing time making things more complicated and moving data to yet another data store? Shouldn’t I learn what Databricks can do with the data I already have in my data lake?”