Step-by-step tutorial on creating an entity traits 360 / feature table model.
12 minute read
Once you have done identity stitching to unify the identity of your users across all the cross-platforms, you can evaluate and maintain the required features/traits for each identified user in a entity traits 360.
This guide provides a detailed walkthrough on how to use a PB project and create output tables in a warehouse for an entity traits 360 model.
Structure of a Profile Builder project and the parameters used in different files.
Identity Stitching model as Entity Traits 360 reuses its output to extract the required features/traits.
Entity Traits 360 Models
Once traits are defined on an entity, we need the means to use them.
A primary application to send them to the downstream destinations.
The destination could either be the Activation API or via any of the rETL destinations that RudderStack supports. Each such destination requires data in the form of a table with an ID column and 1 or more feature columns. This is possible using Entity Traits 360.
Entity Traits 360 provides a way to access entity features based on any of the given ID types, including the entity main id as the identifier column. It creates a view which will have all (or a specified set of) features on that entity from across the project.
To configure the creation of a specific set of traits 360 models, add serve_traits section in the entity. You need to provide a list of ID types as id_served. Optionally, you can also give a name which specifies the name of the generated model. If you don’t specify a name, it will create the model with a default name. By default, it will add all available features on the entity into the view.
If you want finer control, you can also include or exclude any features from any models by defining a custom entity traits 360 model and add the reference to the serve_traits section like model_served: models/name_of_custom_traits_360_model. See below on more examples of this.
Default entity traits 360 model
pb_project.yaml:
...entities:- name:userid_types:- user_idserve_traits:- id_served:user_idname:user_id_stitched_features# This will add an entity-traits-360 model with user_id as the identifier with model name user_id_stitched_features.# It will contain all the available features.
Custom entity traits 360 model:
This is an example of custom entity traits 360 model. Here we are including / excluding features from models of choice.
models:- name:cart_entity_traits_360model_type:entity_traits_360model_spec:validity_time:24h# 1 dayentity_key:userid_served:user_idfeature_list:- from:packages/pkg/models/cart_table# a table created by packageinclude:["*"]# will include all the traits- from:models/user_var_tableinclude:["*"]exclude:[cart_quantity, purchase_status]# except two, all the other traits will be included- from:models/sql_modelinclude:[lifetime_value]# will include only one trait
Sample project
This sample project uses the output of an identity stitching model as an input to create a entity traits 360. The following sections describe how to define your PB project files:
Project detail
The pb_project.yaml file defines the project details such as name, schema version, connection name and the entities which represent different identifiers.
You can define all the identifiers from different input sources you want to stitch together as a user_main_id:
You need to add main_id to the list only if you have defined main_id_type: main_id in the ID stitcher buildspec.
# Project namename:sample_id_stitching# Project's yaml schema versionschema_version:49# Warehouse connectionconnection:test# Allow inputs without timestampsinclude_untimed:true# Folder containing modelsmodel_folders:- models# Entities in this project and their ids.entities:- name:userid_types:- main_id # You need to add ``main_id`` to the list only if you have defined ``main_id_type:main_id`` in the id stitcher buildspec.- user_id# one of the identifier from your data source.- email# lib packages can be imported in project signifying that this project inherits its properties from therepackages:- name:coreliburl:"https://github.com/rudderlabs/profiles-corelib/tag/schema_{{best_schema_version}}"# if required then you can extend the package definition such as for ID types.
Input
The input file file includes the input table references and corresponding SQL for the above-mentioned entities:
inputs:- name:rsIdentifiescontract:# constraints that a model adheres tois_optional:falseis_event_stream:truewith_entity_ids:- userwith_columns:- name:timestamp- name:user_id- name:anonymous_id- name:emailapp_defaults:table:rudder_events_production.web.identifies# one of the WH table RudderStack generates when processing identify or track events.occurred_at_col:timestampids:- select:"user_id"# kind of identity sql to pick this column from above table.type:user_identity:user# as defined in project fileto_default_stitcher:true- select:"anonymous_id"type:anonymous_identity:userto_default_stitcher:true- select:"lower(email)"# can use sql.type:emailentity:userto_default_stitcher:true- name:rsTrackscontract:is_optional:falseis_event_stream:truewith_entity_ids:- userwith_columns:- name:timestamp- name:user_id- name:anonymous_idapp_defaults:table:rudder_events_production.web.tracks# another table in WH maintained by RudderStack processing track events.occurred_at_col:timestampids:- select:"user_id"type:user_identity:userto_default_stitcher:true- select:"anonymous_id"type:anonymous_identity:userto_default_stitcher:true
Model
Profiles entity traits 360 model lets you define the specific features/traits you want to evaluate from the huge spread of scattered data in your warehouse tables.
A sample profiles.yaml file specifying a entity traits 360 model (user_profile):
var_groups:- name:first_groupentity_key:uservars:- entity_var:name:first_seenselect:min(timestamp::date)from:inputs/rsTrackswhere:properties_country is not null and properties_country != ''- entity_var:name:last_seenselect:max(timestamp::date)from:inputs/rsTracks- entity_var:name:user_lifespanselect:'{{user.Var("last_seen")}} - {{user.Var("first_seen")}}'description:Life Time Value of a customer- entity_var:name:days_activeselect:count(distinct timestamp::date)from:inputs/rsTracksdescription:No. of days a customer was active- entity_var:name:campaign_sourcedefault:"'organic'"- entity_var:name:user_rankdefault:-1- entity_var:name:campaign_source_first_touchselect:first_value(context_campaign_source)window:order_by:- timestamp ascpartition_by:- main_idfrom:inputs/rsIdentifieswhere:context_campaign_source is not null and context_campaign_source != ''- input_var:name:num_c_rank_num_b_partitionselect:rank()from:inputs/tbl_cdefault:-1window:partition_by:- '{{tbl_c}}.num_b'order_by:- '{{tbl_c}}.num_c asc'where:'{{tbl_c}}.num_b >= 10'- entity_var:name:min_num_c_rank_num_b_partitionselect:min(num_c_rank_num_b_partition)from:inputs/tbl_c
entity_var
The entity_var field provides inputs for the entity traits 360 model. This variable stores the data temporarily, however, you can choose to store its data permanently by specifying the name in it as a feature in the features key.
Field
Data type
Description
name
String
Name of the entity_var to identify it uniquely.
select
String
Column name/value you want to select from the table. This defines the actual value that will be stored in the variable. You can use simple SQL expressions or select an entity_var as {{entityName.Var(\"entity_var\")}}. It has to be an aggregate operation that ensures the output is a unique value for a given main_id. For example: min(timestamp), count(*), sum(amount) etc. This holds true even when a window function (optional) is used. For example:: first_value(), last_value() etc are valid while rank(), row_number(), etc. are not valid and give unpredictable results.
from
List
Reference to the source table from where data is to be fetched. You can either refer to another model from the same YAML or some other table specified in input YAML.
where
String
Any filters you want to apply on the input table before selecting a value. This must be SQL compatible and should consider the data type of the table.
default
String
Default value in case no data matches the filter. When defining default values, make sure you enclose the string values in single quotes followed by double quotes to avoid SQL failure. However, you can use the non-string values without any quotes.
description
String
Textual description of the entity_var.
window
Object
Specifies the window function. Window functions in SQL usually have both partition_by and order_by properties. But for entity_var, partition_by is added with main_id as default; so, adding partition_by manually is not supported. If you need partitioning on other columns too, check out input_var where partition_by on arbitrary and multiple columns is supported.
input_var
The syntax of input_var is similar to entity_var, with the only difference that instead of each value being associated to a row of the entity traits 360, it’s associated with a row of the specified input. While you can think of an entity_var as adding a helper column to the entity traits 360, you can consider an input_var as adding a helper column to the input.
Field
Data type
Description
name
String
Name to store the retrieved data.
select
String
Data to be stored in the name.
from
List
Reference to the source table from where data is to be fetched.
where
String
(Optional) Applies conditions for fetching data.
default
String
(Optional) Default value for any entity for which the calculated value would otherwise be NULL.
description
String
(Optional) Textual description.
window
Object
(Optional) Specifies a window over which the value should be calculated.
window
Field
Data type
Description
partition_by
String
(Optional) List of SQL expressions to use in partitioning the data.
order_by
String
(Optional) List of SQL expressions to use in ordering the data.
In window option, main_id is not added by default, it can be any arbitrary list of columns from the input table. So if a feature should be partitioned by main_id, you must add it in the partition_by key.
In the left sidebar, click Database and the corresponding Schema to view the list of all tables. You can hover over a table to see the full table name along with its creation date and time.
Write a SQL query like select * from <table_name> and execute it to see the results:
Open Postico2. If required, create a new connection by entering the relevant details. Click Test Connection followed by Connect.
Click the + icon next to Queries in the left sidebar.
You can click Database and the corresponding schema to view the list of all tables/views.
Double click on the appropriate view name to paste the name on an empty worksheet.
You can prefix SELECT * from the view name pasted previously and suffix LIMIT 10; at the end.
Press Cmd+Enter keys, or click the Run button to execute the query.
Enter your Databricks workspace URL in the web browser and log in with your username and password.
Click the Catalog icon in left sidebar.
Choose the appropriate catalog from the list and click on it to view contents.
You will see list of tables/views. Click the appropriate table/view name to paste the name on worksheet.
You can prefix SELECT * FROM before the pasted view name and suffix LIMIT 10; at the end.
Select the query text. Press Cmd+Enter, or click the Run button to execute the query.
A sample output containing the results in Snowflake:
Window functions
A window function operates on a window (group) of related rows. It performs calculation on a subset of table rows that are connected to the current row in some way. The window function has the ability to access more than just the current row in the query result.
The window function returns one output row for each input row. The values returned are calculated by using values from the sets of rows in that window. A window is defined using a window specification, and is based on three main concepts:
Window partitioning, which forms the groups of rows (PARTITION BY clause)
Window ordering, which defines an order or sequence of rows within each partition (ORDER BY clause)
Window frames, which are defined relative to each row to further restrict the set of rows (ROWS specification). It is also known as the frame clause.
Snowflake does not enforces users to define the cumulative or sliding frames, and considers ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING as the default cumulative window frame. However, you can override this by defining the frame manually.
On the Redshift aggregate window function list given below, specify the frame_clause while using any function from the list:
AVG
COUNT
CUME_DIST
DENSE_RANK
FIRST_VALUE
LAG
LAST_VALUE
LEAD
LISTAGG
MAX
MEDIAN
MIN
NTH_VALUE
PERCENTILE_CONT
PERCENTILE_DISC
RATIO_TO_REPORT
STDDEV_POP
STDDEV_SAMP (synonym for STDDEV)
SUM
VAR_POP
VAR_SAMP (synonym for VARIANCE)
On the Redshift ranking window functions given below, do not specify the frame_clause while using any function from the list:
DENSE_RANK
NTILE
PERCENT_RANK
RANK
ROW_NUMBER
When using a window function, use frame_clause carefully. While It is not very critical for Snowflake, using it incorrectly in Redshift can lead to errors.
Example of using frame_clause:
- entity_var:name:first_num_b_order_num_bselect:first_value(tbl_c.num_b)# Specify frame clause as aggregate window function is usedfrom:inputs/tbl_cdefault:-1where:tbl_c.num_b >= 10window:order_by:- tbl_c.num_b descframe_clause:ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING- entity_var:name:first_num_b_order_num_b_rankselect:rank()# DO NOT specify frame clause as ranking window function is usedwindow:partition_by:- first_num_b_order_num_b > 0order_by:- first_num_b_order_num_b asc
Note how frame_clause is specified in first entity_var and not in the second one.
Feature Table Model (legacy)
Profiles Feature table model lets you define the specific features/traits you want to evaluate from the huge spread of scattered data in your warehouse tables.
A sample profiles.yaml file specifying a feature table model (user_profile):
Specifies the validity of the model with respect to its timestamp. For example, a model run as part of a scheduled nightly job for 2009-10-23 00:00:00 UTC with validity_time: 24h would still be considered potentially valid and usable for any run requests, which do not require precise timestamps between 2009-10-23 00:00:00 UTC and 2009-10-24 00:00:00 UTC. This specifies the validity of generated entity traits 360. Once the validity is expired, scheduling takes care of generating new tables. For example: 24h for 24 hours, 30m for 30 minutes, 3d for 3 days, and so on.
entity_key
String
Specifies the relevant entity from your input.yaml file.
features
String
Specifies the list of name in entity_var, that must act as a feature.
This site uses cookies to improve your experience while you navigate through the website. Out of
these
cookies, the cookies that are categorized as necessary are stored on your browser as they are as
essential
for the working of basic functionalities of the website. We also use third-party cookies that
help
us
analyze and understand how you use this website. These cookies will be stored in your browser
only
with
your
consent. You also have the option to opt-out of these cookies. But opting out of some of these
cookies
may
have an effect on your browsing experience.
Necessary
Always Enabled
Necessary cookies are absolutely essential for the website to function properly. This
category only includes cookies that ensures basic functionalities and security
features of the website. These cookies do not store any personal information.
This site uses cookies to improve your experience. If you want to
learn more about cookies and why we use them, visit our cookie
policy. We'll assume you're ok with this, but you can opt-out if you wish Cookie Settings.