How to Backfill Data in Mixpanel with RudderStack
We recently announced our partnership with Mixpanel, and for the last several months we’ve been busy building dashboards driving insights as a result of connecting RudderStack to Mixpanel. Unfortunately, our analysis was limited to the timeframe from when we started submitting data. To report from an earlier start date, we needed a way to load previously processed RudderStack data into Mixpanel.
To accomplish this, we explored two options:
- Replay all of the events from our S3 bucket and send them to Mixpanel
- Create a RudderStack Reverse ETL to replay the data already stored in our snowflake warehouse
The first option, replaying events from our S3 bucket and sending them to Mixpanel, was the most obvious. With a new user transformation to update the timestamp for this backfill source, we thought we’d be up and running in no time. In general this approach makes a lot of sense, but we’ve made considerable changes to our stack in the past six months, and we changed our naming conventions when we implemented our Tracking Plans. So this option meant we’d be creating a bunch of extra user profile and event properties in Mixpanel which have been previously deprecated within our RudderStack app. We could clean up this noise within the user transformation, but we decided it might be easier to do this within the data warehouse.
So, we went with option number two: create a RudderStack Reverse ETL to replay the data already stored in our snowflake warehouse. By leveraging Snowflake, we could clean, prep and test the data loading it into a staging project within Mixpanel prior to loading it into production. This route also allowed us to normalize the data schema and pair down the number of properties we wanted to load to only those important to our specific Mixpanel use cases. Yes, similar to the S3 option, this could have been done with user transformations, but we decided cleaning up the data and normalizing schemas in a database would be easier. Here’s how we did it:
Step 1: Determine the right Track, Identify and Page calls
If you are new to RudderStack, you may not be familiar with the schema RudderStack creates to store the different event types and track calls in the data warehouse. For our use case we were principally focused on user conversion activity pre and post conversion, so we wanted to process prior identify calls along with the page and track calls from our website. All of these prior events would be loaded into a single “replay” table along with an event type column - which would be used in the user transformation from the Reverse ETL. If we only needed to replay our page calls, we could have used the Reverse ETL on the page_view table from within Snowflake, but because we have over a dozen different event types, we made the decision to consolidate all of the events into a single replay and only run the Reverse ETL once as opposed to running multiple Reverse ETL, one for each event type table in Snowflake.
Step 2: Create the staging table in Snowflake
As we mentioned above, we wanted to replay events from a variety of different page, track and identify tables, each of which has more columns than we need. We decided to consolidate the various tables into a single replay table that would be the source for a single Reverse ETL. This also allowed us to rename events and properties from older events that have since changed. If we did not rename the columns in the new table, we could still create formulas within Mixpanel to combine the various properties, but it’s much easier to clean this up in snowflake and/or a user transformation.
Review Mixpanel lexicon
When creating the new table, it was helpful to review the Mixpanel Lexicon in our production instance to review the event and user profile properties. It’s important to note that both RudderStack and Mixpanel are case sensitive and since the RudderStack schema in Snowflake is created in all capitals, these columns would need to be modified within the user transformation.
Step 3: Define User Transformation mapping
The RudderStack Reverse ETL will load each row from our replay table as an identify call with each column as a trait. Therefore, we created a User Transformation to change the event type to page or track if necessary and move the traits to the appropriate properties or context objects. We also mapped the email address to the userId property and explicitly set the timestamp to our original timestamp value (or otherwise the Rudderstack Destination Transformer would use the current timestamp for the event).
We also added a new property called “backfiled_date” to make it easy to filter out these events in Mixpanel if we ever need to.
JAVASCRIPT
export function transformEvent(event) {event.timestamp = event.traits.TIMESTAMPevent.userId = event.traits.CONTEXT_TRAITS_EMAILevent.new_event = event.traits.EVENTif (event.type == 'identify') {event.context["traits"];event.context["page"];let page = {initial_referrer : event.CONTEXT_PAGE_INITIAL_REFERRER,initial_referring_domain : event.CONTEXT_PAGE_INITIAL_REFERRING_DOMAIN,path : event.CONTEXT_PAGE_PATH,referrer : event.CONTEXT_PAGE_REFERRER,search : event.CONTEXT_PAGE_SEARCH,title : event.CONTEXT_PAGE_TITLE,url : event.CONTEXT_PAGE_URL}event.context = {page : page}let traits = {account_type : event.traits.CONTEXT_TRAITS_ACCOUNT_TYPE,utm_campaign : event.traits.CONTEXT_TRAITS_UTM_CAMPAIGN,utm_content : event.traits.CONTEXT_TRAITS_UTM_CONTENT,utm_medium : event.traits.CONTEXT_TRAITS_UTM_MEDIUM,utm_source : event.traits.CONTEXT_TRAITS_UTM_SOURCE,utm_term : event.traits.CONTEXT_TRAITS_UTM_TERM,}event.context = {traits : traits}// Mixpanel Mapped Fieldsevent.context.traits.email = event.traits.CONTEXT_TRAITS_EMAILevent.context.traits.firstName = event.traits.CONTEXT_TRAITS_FIRST_NAMEevent.context.traits.lastName = event.traits.CONTEXT_TRAITS_LAST_NAMEevent.context.traits.backfill_date = "2021-11-07T12:00:00.00Z"}if ((event.new_event == 'form_submit' || event.new_event == 'webhook_source_event' || event.new_event == 'user_signed_up')) {event.type = 'track';event.event = event.traits.EVENT_TEXTevent.name = event.EVENT_TEXTevent.properties.form_id = event.context.traits.form_id}if (event.new_event == 'page') {event.type = 'page';event.event = "page_view"event.name = "page_view"}delete event.traits;return event}
Step 4: Set Up test environment in Mixpanel
If possible, it is always recommended to create a new project within Mixpanel to test your data replay prior to loading the data into production.
Quick Tip: Mixpanel dropdown menus only display events and their properties for events within the last 30 days. This was confusing to us initially because we could see the events in the eventstream window (when we set the timeframe back several months) but did not see them when we were QA’ing the data in the report builder. In order to populate these menus we had to submit a sample page and track call with current timestamp.
Step 5: Create Reverse ETL & send test events
With the data staged in the replay table and the User Transformation created, the last step was to actually create the RudderStack Reverse ETL to stream the replay events to Mixpanel. We used the Full Sync option and then disabled the source after the initial sync since we only needed to send the data once to the test environment (Check out our Snowflake Reverse ETL Sources documentation for specifics on how we connected our replay table). Once the QA was completed, we simply updated the Mixpanel Destination API Key and Secret to the production process and reran a full sync.
With the historical data loaded into production we were able to gain insights from a full year’s worth of data in Mixpanel, which made our sales and marketing teams very happy.
Try it for yourself
If you would like to learn more about backfilling data in Mixpanel with RudderStack, sign up for free, check out more specifics in our Mixpanel Documentation, and send us a note if you have any questions.