Zero copy CDP vs. warehouse native CDP architecture
The principle of zero-copy systems is over half a century old. It was originally intended to increase CPU performance by decreasing the number of times a file needed to be replicated in the process of being transferred (IBM’s 1964 OS/360 system used zero-copy principles). Today, zero copy architecture is the latest term causing confusion in the CDP world.
At the 2023 Signal Conference, Segment used zero copy to describe an upcoming product feature that provides a way to use data without copying it into other systems. However, Segment hasn’t released the product yet, and we don’t have any details beyond this quote from their product team:
“Your accounts table can stay in the data warehouse for maximum privacy and security. We don’t have your account IDs floating around Segment and in our systems. It stays where the source of the data is.”
Since Signal, other CDP vendors have started promoting zero copy and even zero ETL. Luke Ambrosetti of Snowflake proposes two definitions for zero copy:
- The CDP vendor is not persisting the customer’s data in their infrastructure
- The customer doesn’t have to copy the data out of the vendor’s platform to query it—also known as data sharing
Zero copy may be a step forward as it allows you to limit some of your data from being ingested by a third-party, black-box SaaS tool. Still, the fundamental limitations remain because part of your customer data remains siloed in a system you don’t control. At the end of the day, cloud SaaS platforms create value by leveraging proprietary data models and data processing that happens in their own cloud environments.
If you want the data collection and unification benefits provided by a CDP with the control, flexibility, and scalability you expect from modern data tooling, you need the truly open architecture of a warehouse native approach.
How would zero copy work in a CDP that stores data?
Segment is promoting zero copy as a way for their customers to use data that lives in their warehouse without having to copy it into Segment’s cloud. It’s a move doubtless aimed at addressing security, compliance, and cost concerns.
The most likely technical approach will leverage a similar flow to Segment’s 5-year-old, and largely disappointing, SQL Traits feature. SQL Traits allows you to cherry-pick data points you want from the warehouse without loading the entire user data set, which could include sensitive IDs and PII.
It’s also likely that the feature will give users more flexibility to represent key data points, as opposed to literally ‘importing’ user and account traits with a SQL query. A simple example would be mapping account size classification (small business, mid-market, and enterprise) based on rules without pulling in the actual employee count number from the warehouse.
What’s the key difference between zero copy and warehouse native?
The very reason a legacy SaaS CDP would implement a zero copy approach exposes its fundamental limitations. Legacy CDPs built their businesses on storing data in a closed cloud environment, but as data teams increasingly leverage the warehouse for their single source of truth, it’s now necessary for these legacy CDPs to find a way to incorporate data from the warehouse inside their closed systems. This need, combined with increasing security concerns, sparked the use of zero copy terminology – and the anticipated rollout of zero copy features – in the CDP space.
Warehouse native architecture solves this problem at a foundational level, not a feature level. It makes the warehouse itself the foundation of your CDP. With this approach, you don’t just give the CDP access to the data in your warehouse–you build your CDP on top of your warehouse. Ironically, the warehouse native approach is also “zero copy” because no copying is required—the data is ingested into, stored and modeled in, and activated out of your warehouse.
If you’re considering a CDP with zero copy architecture, here are a few questions you’ll want to answer, keeping in mind your current and aspirational goals for utilizing your customer data.
Questions to ask CDPs about their “zero copy" architecture
Is the first-party data collected directly by the CDP also zero copy, or does the CDP keep a copy of it?
If your CDP still keeps a copy of the first-party behavioral data you use it collect, it significantly increases your exposure to regulatory risk. It doesn’t matter if zero copy allows your CDP to access the accounts table in your warehouse in a privacy-compliant manner. Entrusting your valuable first-party data to a third party is an unnecessary exposure, and it makes meeting elevated consumer expectations about personal data privacy harder.
Moreover, first-party data often includes PII, necessitating additional work from developers and data teams to meet internal and external requirements for which systems can access which data. This work is extremely challenging when working with a closed cloud environment.
Does the CDP offer flexibility around how that data is modeled, or does it enforce a strict data model of users and accounts?
If your CDP forces you to use their data model, it will be nearly impossible for you to replicate your custom business logic in the cloud. Zero copy architecture doesn’t solve this problem.
Modern business needs are complex and transcend the typical CDP user/account data model. Today, leading teams overcome these limitations by customizing business objects and logic to their specific needs in their own warehouse.
Even if a CDP with zero copy functionality can access data in the warehouse, if it still forces you to abide by its data model in the cloud, you’re still subject the inherent limitations of the CDPs data model.
This lack of support for custom data modeling is often a deal-breaker for businesses that need to model entities like households, products, employees, tickets, and IoT devices.
Does the CDP support custom identities and custom rules for ID stitching?
If your CDP relies on its own data model, zero copy functionality won’t overcome identity resolution limitations.
You probably have multiple identifiers for individual users that must be stitched together to create a single view of the user. For example, you could have a mobile device ID, cookie ID, app user ID, email, phone number, name, home address, CRM ID, and service ID all for one user.
The challenge here is that every ID is not created equally. Cookie IDs and device IDs can be matched directly, but phone numbers must be cleaned up (delimiters removed), and addresses can only be fuzzy matched along with other IDs.
CDPs that rely on their own data model have inherent limitations to their identity resolution capabilities that zero copy does not solve. Even if their zero copy architecture can incorporate IDs from the warehouse, lack of support for custom identities and stitching means you’re still stuck with an incomplete view of the customer in the CDP.
Does the CDP let you define complex features leveraging the full power of a language like SQL or Python, or are you stuck with their user interface?
Warehouse native architecture lets you leverage the full power of SQL, Python and machine learning tools to model your data and define traits on your entire portfolio of customer data.
Some legacy CDPs may allow you to create custom traits, or even query your warehouse on a limited basis (i.e., “zero copy” traits), but they typically offer this capability through a limited UI only. If you want code-based power and flexibility to derive the full value from your customer data, you’ll need a warehouse native data tool that provides UI and code-based control. This enables your data team to utilize the full power of modern warehouse compute capabilities, unlocking use cases at every level of complexity.
Does the CDP let you see under the hood and tweak the machine learning models driving predictive features like churn, LTV and product recommendations?
Many legacy CDPs now offer out-of-the-box machine learning or AI capabilities. However, those models typically live in a black box and, like identity resolution, don’t allow you to see what’s happening under the hood or customize the code to meet your business needs. Unlike a warehouse native approach, they also have limited ability to leverage the full breadth of customer data in your warehouse.
Templated models can help jumpstart ML efforts, but eventually, your data science team will want to define additional features or tweak models to improve accuracy and performance—ideally on your own infrastructure.
Advantages of a warehouse native CDP
Warehouse native architecture exposes zero copy as a marginally helpful half-measure. With the warehouse native approach, there is simply no need to copy any data into a CDP. You get all the features you need to collect comprehensive data and build truly unified customer profiles, but you don’t have to store any data with a third-party vendor or subject your data team to painful technical inflexibility.
Here are the key advantages of a warehouse native approach:
- Flexibility & accessibility – when all of your raw customer data lives in your own data warehouse, it’s easily accessible to everyone, and usage isn’t subject to vendor specific limitations.
- More complete customer data sets – because you own your warehouse, you can combine your behavioral data with internal customer data (like transactional data) that you wouldn’t send to a 3rd party system.
- More advanced use cases – because your warehouse is connected the rest of your customer data infrastructure that runs functions like data science, you don’t have to rely on vendor provided black box models for ML. You can enable and control more advanced machine learning use cases like churn prediction and personalized recommendations.
- Enhanced data privacy and governance – utilizing your existing data warehouse as your customer data store means one less tool where you have to deal with data security and privacy concerns. It gives you more control in the era of GDPR and CCPA regulations.
- Cost savings – it’s cheaper to store your data in the warehouse you’re already paying for than it is to pay your CDP to store it for you – again.
Now is the time to embrace the warehouse native future
Zero copy architecture doesn’t provide the control or the freedom you need to build powerful data products and use cases for your business with your customer data.
If you want to solve old problems around data collection and unification, and future-proof your efforts to build advanced capabilities with your customer data, it’s time to embrace warehouse native architecture. Request a demo with our team today to learn more about the Warehouse Native CDP.