Know the specifications of a site configuration file, PB project structure, configuration files, and their parameters.
14 minute read
Once you complete the Profile Builder CLI steps, you will be able to see the Profiles project on your machine.
Site configuration file
RudderStack creates a site configuration file (~/.pb/siteconfig.yaml) while creating a warehouse connection. It contains the following details including secrets (if any):
Warehouse connection details and its credentials.
Git repository connection credentials (if any). Follow the below steps to add the same:
Add private key as credentials in the siteconfig.yaml file under key field in the gitcreds field.
If you have multiple Profiles projects and they use different warehouse connections, you can store the details for multiple connections in the same site configuration file.
A sample site configuration file containing multiple warehouse connection details is shown below:
connections:# connection nameprod-db-profile:target:devoutputs:dev:account:inb828.us-west-3dbname:MAT_STORE_DEVpassword:password_devrole:PROFILES_ROLE_DEVschema:AB_SCHEMA_DEVtype:snowflakeuser:profiles_demowarehouse:DEV_WAREHOUSEprod:account:inc654.us-west-3dbname:MAT_STOREpassword:passwordrole:PROFILES_ROLEschema:AB_SCHEMAtype:snowflakeuser:profiles_demowarehouse:PROD_WAREHOUSEtest-db-profile:target:testoutputs:db:access_token:dabasihasdhocatalog:rs_devhost:adb-98.18.azuredatabricks.nethttp_endpoint:/sql/1.0/warehouses/919uasdn92hport:443schema:rs_profilestype:databricksuser:johndoe@abc.onmicrosoft.comdev:account:uk12.us-west-1dbname:RUDDERSTACK_DBpassword:passwordrole:RS_ROLEschema:RS_PROFILEStype:snowflakeuser:johndoewarehouse:RS_WAREHOUSEsnowflake-keypair:# example of an unencrypted Snowflake key-pairtype:snowflakeaccount:vb8.us-east-1dbname:PROD_DBrole:PROFILES_ROLEwarehouse:RUDDERschema:RS_PROFILESuser:PROFILES_USER_UNCuseKeyPairAuth:trueprivateKey:-----BEGIN PRIVATE KEY----- ..keyvalue.. -----END PRIVATE KEY-----snowflake-encrypted-keypair:# example of an encrypted Snowflake key-pairtype:snowflakeaccount:vb8.us-east-1dbname:PROD_DBrole:PROFILES_ROLEwarehouse:RUDDERschema:RS_PROFILESuser:PROFILES_USER_ECuseKeyPairAuth:trueprivateKey:-----BEGIN ENCRYPTED PRIVATE KEY----- ..keyvalue.. -----END ENCRYPTED PRIVATE KEY-----privateKeyPassphrase:valuegoeshere!redshift_v1:dbname:warehouse_rshost:warehouse.abc.us-east-3.redshift.amazonaws.compassword:passwordport:5419schema:rs_profilestype:redshiftuser:redshift_userredshift_v2:workgroup_name:warehouse_workgroupregion:us-east-1driver:v2sslmode:requiredbname:warehouse_rsschema:rs_profilestype:redshiftaccess_key_id:******************secret_access_key:******************************big:credentials:auth_provider_x509_cert_url:https://www.googleapis.com/oauth2/v1/certsauth_uri:https://accounts.google.com/o/oauth2/authclient_email:johndoe@big-query-integration-poc.iam.gserviceaccount.comclient_id:"123345678909872"client_x509_cert_url:https://www.googleapis.com/robot/v1/metadata/x509/johndoe%40big-query-integration-poc.iam.gserviceaccount.comprivate_key:| -----BEGIN PRIVATE KEY-----## private key-----END PRIVATE KEY-----private_key_id:5271368bhjbd72y278222e233w23e231eproject_id:big-query-integration-poctoken_uri:https://oauth2.googleapis.com/tokentype:service_accountproject_id:rs_profilesschema:rs_profilestype:bigqueryuser:johndoe@big-query-integration-poc.iam.gserviceaccount.comgitcreds:- reporegex:"git@github.com:REPO_OWNER/*"# in case of ssh urlkey:| -----BEGIN OPENSSH PRIVATE KEY-----
**********************************************************************
**********************************************************************
**********************************************************************
**********************************************************************
****************************************************************
-----END OPENSSH PRIVATE KEY------ reporegex:"https://github.com/rudderlabs/*"# https urlbasic_auth:username:oauth2password:...# your personal access token with read permissionpy_models:enabled:true# in case you are using Python models in your project, else set it to falsepython_path:/opt/anaconda3/bin/python# the path where Python is installed (run `which python` to get the full path). If `py_models` is not enabled, set it to `""`. For Windows, you may pass the path value as: python.execredentials_presets:nullallowed_git_urls_regex:""cache_dir:/Users/YOURNAME/.pb/WhtGitCache/# For Windows, the directory path will have forward slash (\)filepath:/Users/YOURNAME/.pb/siteconfig.yaml# For Windows, the file path will have forward slash (\)
RudderStack recommends defining two target within a single connection: one for dev/staging and the other for production. For example, refer to the connection named prod-db-profile in the sample file above. The default target is set to dev, as specified by the target key. To change the output to production, you can run pb run -t prod.
Profiles project structure
The following image shows the folder structure of the project:
README.md
The README.md file provides a quick overview on how to use PB along with SQL queries for data analysis.
A sample pb_project.yaml file with entity type as user:
# Project namename:sample_attribution# Project's yaml schema versionschema_version:72# Name of warehouse connection in siteconfig.yamlconnection:prod-db-profile# Directory to store models model_folders:- models# Entities in the project and their idsentities:- name:userid_column_name:<column_name># If not specified, this field is set to <entity_name>_main_id by default.# Use the following (id_stitcher) field to define a custom ID stitcher model (optional).# id_stitcher: models/user_id_stitcherid_types:- main_id- user_id- anonymous_id- email- shopify_customer_id- device_id# Feature views - to get all features/traits of an entity into a single view (optional)feature_views:using_ids:- id:emailname:customer_profile_by_email- id:device_idname:customer_profile_by_device_idid_types:- name:shopify_customer_id- name:device_idfilters:- type:excludevalue:"carol.foster@example.com"- name:anonymous_idfilters:- type:excludevalue:""- type:excludevalue:"unknown"- type:excludevalue:"NaN"- name:emailfilters:- type:includeregex:".+@.+"- type:excludesql:select:emailfrom:inputs/csv_email_blacklist- name:user_idfilters:- type:excludesql:select:user_idfrom:inputs/models/sql_exclusion_model# lib packages can be imported to signify that this project's properties are inheritedpackages:- name:coreliburl:"https://github.com/rudderlabs/rudderstack-profiles-corelib/tag/schema_{{best_schema_version}}"# Profiles can also use certain model types defined in Python.# Examples include ML models. Those dependencies are specified here.python_requirements:- profiles-pycorelib==0.1.0- profiles-rudderstack==0.14
The following table explains the fields used in the above file:
Field
Data type
Description
name
String
Name of the project.
schema_version
Integer
Project’s YAML version. Each new schema version comes with improvements and added functionalities.
connection
String
Connection name from siteconfig.yaml used for connecting to the warehouse.
Lists all the entities used in the project for which you can define models. Each entry for an entity here is a JSON object specifying entity’s name and attributes.
List of the identifier types you want the id_stitcher model to consume, process, and stitch together. You can further define filters on these id types.
feature_views
List
(Optional) Lists all the view names along with their ID’s being served for feature view model.
The identifiers listed in id_types may have a many-to-one relationship with an entity but each ID must belong to a single entity.
For example, a user entity might have id_types as the salesforce_id, anonymous_id, email, and session_id (a user may have many session IDs over time). However, it should not include something like ip_address, as a single IP can be used by different users at different times and it is not considered as a user identifier.
filters
You can filter out the ID types you do not want to include in the stitching process by using the filters field.
For example:
If email is an ID type but you have internal testing emails you want to filter out, you can define those here.
If you want to exclude all the blacklisted email addresses, you can create an input model (for example, csv_email_blacklist) with CSV file as a source, that contains all such email addresses.
If you want to exclude all the user_ids, you can create an SQL model (for example, sql_exclusion_model) that contains a specific logic to enlist all such IDs.
Field
Data type
Description
type
String
Type of filter. Allowed values are include or exclude.
value
String
Value to match, for example, you can reject certain invalid ID values like NaN, unknown, test@domain.com, etc.
regex
String
Regular expression with which to match the values.
sql
List
SQL statement with select and from keys.
packages
You can import library packages in a project signifying where the project inherits its properties from.
Field
Data type
Description
name
String
Specify a name.
url
String
HTTPS URL of the lib package, with a tag for the best schema version.
Models folder
Models folder contains all the input sources your Profiles project will consume and process.
Whenever you create a Profiles project using pb init pb project, the default inputs.yaml and profiles.yaml are created in the Models folder.
inputs.yaml
The inputs.yaml file contains the configuration for all the input sources which Profiles uses to run models and create outputs like the ID graph, feature view, etc. You can also define specific constraints on the input sources using the contract key.
RudderStack supports the following input sources:
Table: Specify the table’s name in the table key.
View: Specify the view’s name in the view key.
S3 bucket: Specify the path of the CSV file in your bucket in the s3 key. See Use Amazon S3 bucket as input for more information.
Local CSV file: Specify the file path in the csv key. See Use CSV file as input for more information.
You can also specify the table/view along with the column name and SQL expression for retrieving values. The input specification may also include metadata and the constraints on those columns.
A sample inputs.yaml file:
inputs:# name of the input source- name:salesforceTaskscontract:is_optional:falseis_event_stream:truewith_entity_ids:- userwith_columns:- name:activitydate- name:whoidapp_defaults:# name of the table for input datatable:salesforce.task# For BigQuery, it is recommended to use view (view: _views_<view_name>) instead of table for event streaming data sets.occurred_at_col:activitydaterow_identifier:- activitydate- whoid# aliases to select from the input tableids:# column name or sql expression- select:"whoid"type:salesforce_id# entity to which the id belongsentity:userto_default_stitcher:true- name:salesforceContactcontract:is_optional:falseis_event_stream:truewith_entity_ids:- userwith_columns:- name:createddate- name:id- name:emailapp_defaults:table:salesforce.contact# For BigQuery, it is recommended to use view (view: _views_<view_name>) instead of table for event streaming data sets.occurred_at_col:createddateids:- select:"id"type:salesforce_identity:userto_default_stitcher:true- select:"case when lower(email) like any ('%gmail%', '%yahoo%') then lower(email) else split_part(lower(email),'@',2) end"type:emailentity:userto_default_stitcher:true- name:websitePageVisitscontract:is_optional:falseis_event_stream:truewith_entity_ids:- userwith_columns:- name:timestamp- name:anonymous_id- name:context_traits_email- name:user_idapp_defaults:table:autotrack.pages# For BigQuery, it is recommended to use view (view: _views_<view_name>) instead of table for event streaming data sets.occurred_at_col:timestampids:- select:"anonymous_id"type:rudder_anon_identity:userto_default_stitcher:true# below sql expression check the email type, if it is gmail and yahoo return email otherwise spilt email return domain of email. - select:"case when lower(coalesce(context_traits_email, user_id)) like any ('%gmail%', '%yahoo%') then lower(coalesce(context_traits_email, user_id)) \
else split_part(lower(coalesce(context_traits_email, user_id)),'@',2) end"type:emailentity:userto_default_stitcher:true
The following table explains the fields used in the above file:
Field
Data type
Description
name
String
Name of the input model.
contract
Dictionary
A model contract provides essential information about the model like the necessary columns and entity IDs that it should contain. This is crucial for other models that depend on it, as it helps find errors early and closer to the point of their origin.
app_defaults
Dictionary
Values that input defaults to when you run the project directly. For library projects, you can remap the inputs and override the app defaults while importing the library projects.
contract
Field
Data type
Description
is_optional
Boolean
[Optional] Whether the model’s existence in the warehouse is mandatory.
Default value: False
is_event_stream
Boolean
[Optional] Whether the table/view is a series/stream of events. A model that has a timestamp column is an event stream model.
Default value: False
with_entity_ids
List
List of all entities with which the model is related. A model M1 is considered related to model M2 if there is an ID of model M2 in M1’s output columns.
with_columns
List
List of all ID columns that this contract is applicable for.
app_defaults
Field
Data type
Description
table/view
String
Name of the warehouse table/view containing the data. You can prefix the table/view with an external schema or database in the same warehouse, if applicable. Note that you can specify either a table or view but not both.
s3
String
Name of the CSV file in your Amazon S3 bucket containing the data.
csv
String
Name of the CSV file in your local storage containing the data. The file path should be relative to the project folder.
occurred_at_col
String
Name of the column in table/view containing the timestamp.
row_identifier
String
(Optional) List of all the identifiers whose combination acts as a primary key. If the unique row exists already during the run process while creating a copy of the input table, it is not copied again.
Specifies the list of all IDs present in the source table along with their column names (or column SQL expressions).
Note: Some input columns may contain IDs of associated entities. By their presence, such ID columns associate the row with the entity of the ID. The ID Stitcher may use these declarations to automatically discover ID-to-ID edges.
ids
Field
Data type
Description
select
String
Specifies the column name to be used as the identifier. You can also specify a SQL expression if some transformation is required.
Note: You can also refer table from another Database/Schema in the same data warehouse. For example, table: <database_name>.<schema_name>.<table_name>.
type
String
Type of identifier. All the ID types of a project are declared in pb_project.yaml. You can specify additional filters on the column expression.
Note: Each ID type is linked only with a single entity.
entity
String
Entity name defined in the pb_project.yaml file to which the ID belongs.
to_default_stitcher
Boolean
Set this optional field to false for the ID to be excluded from the default ID stitcher.
Default value: True
profiles.yaml
The profiles.yaml file defines the Profiles semantic models. Each model (defined in model_type field) is a prepackaged SQL model that takes your inputs and runs in the warehouse to output the defined views and tables.
You can define the entity_vars/input_vars under var_groups which are used to create the output tables.
The following profiles.yaml file defines a group of vars named vars_list and a user_profile model:
var_groups:name:vars_listentity_key:user# This is the name defined in project file. If we change that, we need to change the name here too.vars:- entity_var:name:is_mqlselect:max(case when salesForceLeadsTable.mql__c == 'True' then 1 else 0 end)from:inputs/salesForceLeadsTabledescription:Whether a domain is mql or not- entity_var:name:blacklistFlagselect:max(case when exclude_reason is not null then 1 else 0 end)from:inputs/blacklistDomainswhere:(context_sources_job_run_id = (select top 1 context_sources_job_run_id from blacklistDomains order by timestamp desc))is_feature:false- entity_var:name:ignore_domainselect:case when {{user.Var("blacklistFlag")}} = 1 or {{user.Var("domainSummary_account_type")}} like '%free%' then 1 else 0 enddescription:Whether a domain should be ignored for the analysis- entity_var:name:salesEventsselect:json_agg(activitydate, case when (type='Email' or tasksubtype = 'Email') then case when lower(subject) like '%[in]%' then 'sf_inbound_email' \else 'sf_outbound_email' end when macro(call_conversion) then 'sf_call' else null end as event)from:inputs/salesforceTasksdescription:Salesforce touches are converted to one of following events - sf_inbound_email, sf_outbound_email, sf_call, nullis_feature:false- entity_var:name:webhookFormSubmitselect:min(timestamp)from:inputs/webhookSourcewhere:variable_1 is null and timestamp < sales_conversion_timestamp and timestamp > var('start_date')models:- name:user_profilemodel_type:feature_table_modelmodel_spec:validity_time:24h# 1 dayentity_key:user
var_groups
Field
Data type
Description
name
String
A unique name for the var_group.
entity_key
String
The entity to which the var_group belongs to.
vars
Object
This section is used to specify variables, with the help of entity_var and input_var. Aggregation on stitched ID type is done by default and is implicit.
Optionally, you can create models using the above vars. The following fields are common for all the model types:
Field
Data type
Description
name
String
Name of the model. Note that a table with the same name is created in the data warehouse. For example, if you define the name as user_table, the output table will be named something like Material_user_table_<rest-of-generated-hash>_<timestamp-number>.
model_type
String
Defines the type of model. Possible values are: id_stitcher, feature_table_model, sql_template, entity_cohort, id_collator, python_model, feature_view, etc. See model types for more infromation.
model_spec
Object
Defines the detailed configuration specification for the target model.
macros.yaml
You can optionally use macros - reusable functions that encapsulate complex processing logic directly within the SQL expression. See Macros for more information on their usage.
sql_models.yaml
You can use SQL models that are suitable for advanced use cases where you want a model that does some intermediary transformations, joins, or unions on some data before it is consumed by the identity stitcher or feature creation models. See SQL Models for more information.
This site uses cookies to improve your experience while you navigate through the website. Out of
these
cookies, the cookies that are categorized as necessary are stored on your browser as they are as
essential
for the working of basic functionalities of the website. We also use third-party cookies that
help
us
analyze and understand how you use this website. These cookies will be stored in your browser
only
with
your
consent. You also have the option to opt-out of these cookies. But opting out of some of these
cookies
may
have an effect on your browsing experience.
Necessary
Always Enabled
Necessary cookies are absolutely essential for the website to function properly. This
category only includes cookies that ensures basic functionalities and security
features of the website. These cookies do not store any personal information.
This site uses cookies to improve your experience. If you want to
learn more about cookies and why we use them, visit our cookie
policy. We'll assume you're ok with this, but you can opt-out if you wish Cookie Settings.