How Does The Data Lakehouse Enhance The Customer Data Stack?
Reading the title of this post will probably make you wonder how many buzzwords can fit in one sentence. A fair point, but it’s worth exploring how a customer data stack benefits from the data lakehouse. First, let’s clarify exactly what we mean by customer data stack and data lakehouse.
What is a customer data stack?
We talk a lot about the modern data stack, but it’s important to make a distinction here because customer data is special. It provides a unique value to the organization, and it comes with a unique set of technical challenges.
Customer data provides unique value because it’s the main source of behavioral information a business has about its customers. Anyone who has ever built a business will tell you that if you don’t know your customer, you’ll be out of business fast. This is especially true for modern online businesses where direct customer interactions are rare.
But, like all valuable things, customer data comes with a cost. Remember those unique technical challenges I mentioned? Let’s take at the characteristics of customer data that make it difficult to work with.
- Customer data comes in big quantities. Modern commerce, driven by online interactions, generates a massive amount of data every day. Just ask anyone who’s worked at a large B2C company like Uber or even a medium-sized e-commerce company. The sheer volume of data generated by even simple interactions is huge.
- Customer data is extremely noisy. The customer journey always involves many actions, but not every action holds value. The issue here is that there’s no way to know what’s valuable and what isn’t before analysis. Your best bet is to record everything and let your brilliant data analysts and data scientists shine.
- Customer data changes a lot. No one has the same behavior forever, right? I mean, only dead people have a behavior that remains the same across time. This means you need to keep accumulating big amounts of data, but only some of them will be relevant at a specific point in time.
- Customer data is a multidimensional time series. This sounds very scientific, but all it means is that time ordering is important, and there’s no single value for each data point. This adds to the complexity of the data and how you interact with it. You can read about how we implemented our queueing system using PostgreSQL if you’d like to go down a rabbit hole with this one.
- Finally, customer data can be pretty much anything, from very structured to completely unstructured. For example, an invoice issued at a specific time for a specific customer, that’s customer data. So is an interaction that the same customer has on your website. Even the picture of that customer, taken as part of a verification process, is customer data.
I bring all of this up to convince you that working with customer data is important, and it’s challenging enough to require some unique choices to be made when you build your data stack. Thus, we have the term Customer Data Stack.
TL;DR, A Customer Data Stack is a complete Data Stack (modern or not, it doesn’t matter) that allows you to capture, store and process customer data at scale.
What is a Data Lakehouse?
If you’ve been paying attention lately, you’ve heard a lot of buzz around the terms data lake and data lakehouse. The data lakehouse is even younger than the data lake which, to be honest, is pretty old! Humans started building data lakes with the inception of HDFS and Hadoop. Yes, that old. A distributed file system and Map Reduce is all it takes to build a data lake.
Obviously, many things have changed since the early 00’s when it comes to data technology. The term data lake has now been formalized, and the lakehouse is the new kid on the block.
But, before we get into any more details, let’s make something clear. When we talk about data lakes and lakehouses, we mainly refer to an architecture and not to a specific technology. It’s important to keep this in mind because a lot of the confusion around these two terms comes from this misconception.
Data Lake
Let’s start with the definition of a data lake. Getting clear on this first will help us understand the data lakehouse.
The main concept behind a data lake is the separation of storage and compute. Sounds familiar right? Snowflake talks about this a lot, and so does Databricks. But, if you noticed, at the beginning of this section I mentioned HDFS and Hadoop. The first data lakes, using HDFS as a distributed file system and Map-Reduce as a processing framework, separated storage (HDFS) from processing (Hadoop - MR). This is the fundamental concept of a data lake.
Today, when we talk about a data lake, the first thing we think about is S3 which replaces HDFS as the storage layer and brings storage on the cloud. Of course, instead of S3 we can have the equivalent products from GCP and Azure, but the idea remains the same: an extremely scalable object storage system that is sitting behind an API and lives on the Cloud.
Processing has also evolved since Hadoop. First, we had the introduction of Spark that offered an API for Map-Reduce that was more user-friendly, and then we got distributed query engines like Trino. These two processing frameworks co-exist most of the time, addressing different needs. Trino is mainly used for analytical online queries where latency is important while Spark is heavily used for bigger workloads (think ETL) where the volume of data is much bigger and latency is not so important.
So, having S3, using something like Parquet for storing the data, and using Trino or Spark for processing the data gives us a lean but capable Data Lake.
This architecture is good on paper and scales amazingly well, but there are a number of functionalities commonly found in data warehouses and transactional databases that are not present. For example, we haven’t mentioned anything about things like transactions. This is the reason why Lakehouse exists.
Data lakehouse
A Lakehouse is an architecture that builds on top of the data lake concept and enhances it with functionality commonly found in database systems. The limitations of the data lake led to the emergence of a number of technologies including Apache Iceberg and Apache Hudi. These technologies define a Table Format on top of storage formats like ORC and Parquet on which additional functionality like transactions can be built.
So, what is a data lakehouse? It’s an architecture that combines the critical data management features of a data warehouse with the openness and scalability of the data lake.
Data Lakehouse as the foundation of a Customer Data Stack
Now that we’ve covered the definitions and utility of the customer data stack and data lakehouse, I’ll make a case for leveraging the data lakehouse in the customer data stack.
When building a data stack, one of the most important and impactful decisions you’ll make is with the storage and processing layer. In most cases, this is a data warehouse like Redshift or Snowflake. Here are the main benefits of using a lakehouse when dealing with customer data instead:
- Cheap, scalable storage. Because of the characteristics of customer data we covered above, to efficiently work with it, you’ll need a technology that can store it at scale and keep costs as low as possible. The only architecture that offers this efficiently is a Lakehouse (or data lake), which allows you to build on top of the cheap infrastructure offered by a cloud object storage service. None of the cloud warehouse solutions can offer that, especially if you don’t want to maintain any pruning policy for older data.
- Support for every format. The nature of customer data means you might have to both store and process completely heterogeneous data, ranging from structured tabular data to binary data. Cloud data warehouses, though they’ve made progress towards supporting structured and semi-structured data, still have a hard time supporting every possible data format. Data Lakes and Lakehouses don’t suffer from this limitation. They offer a futureproof option capable of supporting every format we currently use.
- Hybrid workloads. Another big consideration is that customer data usually results in hybrid workloads. What I mean by this is that you will want to do operational analytics on your customer data, and soon you’ll need to incorporate more sophisticated ML techniques to do things like churn prediction and attribution (if you aren’t already). Data lakes and lakehouses are the best fit for covering both workloads, and they do it pretty natively. Cloud Data warehouses do now offer functionalities for ML, but they’re limited compared to what you can do on a data lake or lakehouse. So, a lakehouse is pretty much the de facto storage-processing solution when DS and ML projects are involved. This single solution that can serve all the use cases relieves companies from having to implement both a CDW and a data lake, saving cost and operational complexity.
- Lakehouses are open. Compared to the cloud data warehouses on the market right now that are proprietary and closed systems, lakehouses are open. That means there’s an ever-evolving ecosystem of technologies and products that can supplement the data lake/lakehouse, allowing the operating team to be extremely agile when it comes to maintaining and extending the customer data stack.
Final thoughts
If you are still with us, you hopefully have a bit of a better understanding of why we need and use all these buzzwords and, most importantly, how they fit together.
Data lakes and lakehouses are quickly becoming fully-featured data warehouses with infinite scale and low cost. Customer data is a natural fit for such a storage and processing layer. These architectures will allow you to support any possible use case you might have in the future, even if all you care about today is operational analytics.
Here at RudderStack, we are strong supporters of the data lake and lakehouse architectures. We have invested our resources to build best-in-class integrations with both data lakes and lakehouses like Delta Lake. By using RudderStack together with a Lakehouse or data lake, you can have a complete customer data stack that scales amazingly well and allows you to do anything you want with your data.