You are viewing documentation for an older version.
Click here to view the latest documentation.

Input sources

Detailed information on the input source types you can use in Profiles.

4 minute read

Profiles lets you use input data via a table, view, Amazon S3 bucket, or a CSV file which is further used to run models and create outputs in your warehouse.

Tables

You can specify the table’s name in the table key:

- name: rsTracks
  app_defaults:
    table: profiles_new.tracks
    occurred_at_col: timestamp
    ids:
      - select: "user_id"
        type: user_id
        entity: user
      - select: "anonymous_id"
        type: anonymous_id
        entity: user

Views

You can specify the view’s name in the view key:

- name: tbl_b
  app_defaults:
    view: Temp_view_b
    occurred_at_col: timestamp
    ids:
      - select: "id1"
        type: test_id
        entity: user
        to_default_stitcher: true
      - select: "id2"
        type: test_id
        entity: user
        to_default_stitcher: true

Amazon S3 bucket

This is an experimental feature.

If you store data in your Amazon S3 bucket in a CSV file format, you can use it as an input for the Profiles models. The S3 URI path must be specified in the app_defaults.s3:

name: s3_table
contract:
  is_optional: false
  is_event_stream: true
  with_entity_ids:
    - user
  with_columns:
    - name: insert_ts
      datatype: timestamp
    - name: num_a
      datatype: integer
app_defaults:
  s3: "s3://bucket-name/prefix/example.csv"
  occurred_at_col: insert_ts
  ids:
    - select: "id1"
      type: test_id
      entity: user
    - select: "id2"
      type: test_id
      entity: user

Ensure that the CSV file follows the standard format with the first row as the header containing column names, for example:

ID1,ID2,ID3,INSERT_TS,NUM_A
a,b,ex,2000-01-01T00:00:01Z,1
D,e,ex,2000-01-01T00:00:01Z,3
b,c,ex,2000-01-01T00:00:01Z,2
NULL,d,ex,2000-01-01T00:00:01Z,4

Note that:
To escape comma (,) from any cell of the CSV file, enclose that cell with double quotes " " .
Double quotes (" ") enclosing a cell are ignored.

Follow the below steps to grant PB the required permissions to access the file in S3 Bucket:

Private S3 bucket

Add region, access key id, secret access key, and session token in your siteconfig file so that PB can access the private bucket. By default, the region is set to us-east-1 unless specified otherwise.

aws_credential:
    region: us-east-1
    access_key: **********
    secret_access_key: **********
    session_token: **********

Generate `access key id` and `secret access key`

Open the AWS IAM console in your AWS account.
Click Policies.
Click Create policy.
In the Policy editor section, click the JSON option.
Replace the existing JSON policy with the following policy and replace the <bucket_name> with your actual bucket name:

{
  "Version": "2012-10-17",
  "Statement": [{
      "Effect": "Allow",
      "Action": [
        "s3:GetObject",
        "s3:GetObjectVersion"
      ],
      "Resource": "arn:aws:s3:::<bucket_name>/*"
    },
    {
      "Effect": "Allow",
      "Action": [
        "s3:ListBucket",
        "s3:GetBucketLocation"
      ],
      "Resource": "arn:aws:s3:::<bucket_name>",
      "Condition": {
        "StringLike": {
          "s3:prefix": [
            "*"
          ]
        }
      }
    }
  ]
}

Click Review policy.
Enter the policy name. Then, click Create policy.

Further, create an IAM user by following the below steps:

An IAM user requires the following permissions on an S3 bucket and folder to access files in the folder (and sub-folders):
s3:GetBucketLocation
s3:GetObject
s3:GetObjectVersion
s3:ListBucket

In AWS IAM console, click Users.
Click Create user.
Enter a name for the user.
Select Programmatic access as the access type, then click Next: Permissions.
Click Attach existing policies directly, and select the policy you created earlier. Then, click Next.
Review the user details, then click Create user.
Copy the access key ID and secret access key values.

Generate `session token`

Use the AWS CLI to create a named profile with the AWS credentials that you copied in the previous step.
To get the session token, run the following command:

 $ aws sts get-session-token --profile <named-profile>

See Snowflake, Redshift, and Databricks for more information.

Public S3 Bucket

You must have the following permissions on the S3 bucket and folder to access files in the folder (and sub-folders):

s3:GetBucketLocation
s3:GetObject
s3:GetObjectVersion
s3:ListBucket

You can use the following policy in your bucket to grant the above permissions:

Go to the Permissions tab of your S3 bucket.
Edit bucket policy in Permissions tab and add the following policy. Replace the <bucket_name> with your actual bucket name:

{
  "Version": "2012-10-17",
  "Statement": [{
      "Effect": "Allow",
      "Principal": "*",
      "Action": [
        "s3:GetObject",
        "s3:GetObjectVersion"
      ],
      "Resource": "arn:aws:s3:::<bucket_name>/*"
    },
    {
      "Effect": "Allow",
      "Principal": "*",
      "Action": [
        "s3:ListBucket",
        "s3:GetBucketLocation"
      ],
      "Resource": "arn:aws:s3:::<bucket_name>"
    }
  ]
}

In Redshift, you additionally need to set an IAM role as default for your cluster, unless access keys are provided. It is necessary because more than one IAM role can be associated with the cluster and Redshift needs explicit permissions granted through an IAM role to access the S3 bucket (public or private).
Follow Redshift Documentation for setting an IAM role as default.

CSV file

RudderStack recommends using CSV file as an input only if you have limited amount of data.

You can read data from a CSV file by using csv: <path_to_filename> under app_defaults field in the input.yaml file. CSV data is loaded internally as a single SQL select query, making it useful for seeding tests.

A sample code is as shown:

- name: rsTracks
  app_defaults:
    csv: "../common.xtra/Temp_tbl_a.csv"
    occurred_at_col: timestamp
    ids:
      - select: "user_id"
        type: user_id
        entity: user
      - select: "anonymous_id"
        type: anonymous_id
        entity: user

Was this page helpful?

Glad to hear it! Please tell us how we can improve.

Sorry to hear that. Please tell us how we can improve.

Questions? Contact us by email or on Slack