How we crafted a domain-specific language for JSON transformation at RudderStack
RudderStack created a JSON Template Engine to simplify transformation of JSON data from one format to another, making it easier to manage and maintain complex integrations. This blog post will cover why we needed to craft our own Domain-Specific Language for JSON transformation and how we did it.
First, let’s understand the background about the problem that we were trying to solve and why we needed to create our own JSON Template Engine.
The challenge
RudderStack is the Warehouse Native CDP. We provide an integrated solution for data collection, unification in the warehouse, and activation. Our platform supports over 200 integrations and features a powerful Transformations tool. Traditionally, we used native JavaScript code for data transformation, which required significant effort and maintenance. Writing intricate JavaScript code for complex JSON transformations can be error-prone and time-consuming. Moreover, JavaScript’s general-purpose nature did not provide the level of abstraction and expressiveness needed to succinctly represent JSON transformation logic. Although JSONata offered a more efficient way to manipulate JSON data, we still encountered performance bottlenecks due to its parsing and interpretation overhead.
Our solution
The solution was to use a domain-specific language tailored specifically for JSON transformation. By designing a custom JSON template language, we can provide developers with a specialized syntax and semantics optimized for JSON manipulation tasks. Such a language would abstract away low-level JavaScript details, simplify complex transformation logic, and enhance readability and maintainability.
With that goal in mind, we developed our own JSON Transformation Engine. This engine generates optimized JavaScript code from transformation templates, reducing runtime overhead and significantly improving performance.
Steps to build a domain-specific language
Here’s how we crafted our customer JSON template language. You can follow a similar process to create a language for your problem domain.
1. Define the domain and requirements
Start by clearly defining the domain for which you’re building the DSL — in our case, JSON transformation. Identify the specific requirements and challenges within that domain, such as the need for concise syntax, support for complex data structures, and efficient execution.
2. Design language syntax and semantics
Based on the identified requirements, design the syntax and semantics of your DSL — in our case, the JSON template language. Define language constructs such as statements, expressions, and control flow mechanisms that enable users to express JSON transformation logic in a clear and concise manner.
3. Implement lexing (tokenization)
Lexical analysis involves breaking down the source code into tokens, the smallest units of meaningful characters in the language. Implement a lexer to tokenize the input JSON template code, identifying keywords, identifiers, operators, and other lexical elements.
In order to understand how we approach this tokenization, let’s look at the implementation example of descendant operator `..`. This operator is used to search for a specific key in all descendants of a property.
To begin, we must first locate the descendant operator within the code. This can be achieved by creating a generic function as part of the Lexer, which is responsible for identifying various punctuators that include dots.
JAVASCRIPT
/*** Scans the provided code characters for punctuator tokens, specifically focusing on variations of dots ('.', '..', '(...)').** @param {string[]} codeChars - An array of characters representing the code being scanned.* @param {number} idx - The current index within the code to start scanning from.* @returns {Token | undefined} - Returns a Token object representing the identified punctuator or undefined if no match is found.*/function scanPunctuatorForDots(codeChars: string[], idx: number): Token | undefined {const start = idx; // Store the starting index for the token// Extract the characters at specific positions relative to the current indexconst ch1 = codeChars[idx];const ch2 = codeChars[idx + 1];const ch3 = codeChars[idx + 2];// Check if the first character is a dot ('.')if (ch1 !== '.') {return undefined; // No match, return undefined}// Handle different punctuator variations involving dots:if (ch2 === '(' && ch3 === ')') {return {type: TokenType.PUNCT,value: '.()',range: [start, idx + 3], // Update range to include all three characters};}if (ch2 === '.' && ch3 === '.') {return {type: TokenType.PUNCT,value: '...',range: [start, idx + 3], // Update range to include all three characters};}if (ch2 === '.') {return {type: TokenType.PUNCT,value: '..',range: [start, idx + 2], // Update range to include both dots};}// Default case: single dotreturn {type: TokenType.PUNCT,value: '.',range: [start, idx + 1], // Update range to include single dot};}
4. Implement parsing (syntax analysis)
Parsing involves constructing a parse tree (or Abstract Syntax Tree — AST) from the tokenized source code. Implement a parser to generate the AST according to the grammar rules defined for the language.
After successfully identifying the descendant selector token in the previous step, we are now proceeding to combine it with other tokens. By doing so, we are creating an expression or Abstract Syntax Tree (AST) that represents the selector in a structured manner.
JAVASCRIPT
/*** Checks if the current token from the lexer is a dot (.) or double-dot (..) punctuator.** @returns {boolean} True if the current token is a dot or double-dot punctuator, false otherwise.*/function matchPathPartSelector(): boolean {const token = this.lexer.lookahead(); // Peek at the next token without consuming itif (token.type === TokenType.PUNCT) {return token.value === '.' || token.value === '..'; // Check if the value is '.' or '..'}return false;}/*** Parses a selector expression, which can be a simple identifier, wildcard (*), or a string literal.** @returns {SelectorExpression | IndexFilterExpression | Expression} The parsed selector expression object.*/function parseSelector(): SelectorExpression | IndexFilterExpression | Expression {const selector = this.lexer.value(); // Consume the current token as the selector// Other codelet prop: Token | undefined;if (this.lexer.match('*') || this.lexer.matchID() || this.lexer.matchTokenType(TokenType.STR)) { // Check for specific token typesprop = this.lexer.lex(); // Consume the next token as the property}return {type: SyntaxType.SELECTOR,selector,prop,};}/*** Parses a path part, which can be an expression, an array of expressions, or a selector expression based on the current token.** @returns {Expression | Expression[] | undefined} The parsed path part or undefined if not applicable.*/function parsePathPart(): Expression | Expression[] | undefined {// Other code} else if (matchPathPartSelector()) { // Check if the current token is a dot or double-dotreturn parseSelector(); // Parse a selector if it is}// Other code}
The above functions work together to identify and parse different parts of a path, with a focus on recognizing selectors within the path structure. They rely on a separate lexer module that provides functionality for reading and identifying different token types in the input stream.
This is the Abstract Syntax Tree (AST) representation for the code expression .employees..name
JAVASCRIPT
{"type": "statements_expr","statements": [{"type": "path","parts": [{"type": "selector","selector": ".","prop": {"type": "id","value": "employees","range": [1,10]},{"type": "selector","selector": "..","prop": {"type": "id","value": "name","range": [12,16]}}],"pathType": "rich"}]}
5. Implement code translation
Translate the parsed AST into executable code in a target language (e.g., JavaScript). This involves traversing the AST and generating code that performs the specified JSON transformations as defined by the DSL.
The final step involves converting the Descendant selector Expression (AST) into JavaScript code. This step will transform the structured representation of the selector into executable JavaScript code that can be used in the desired context.
JAVASCRIPT
/*** Translates a selector expression with descendant operator (..) into executable code.** @param {SelectorExpression} expr - The selector expression containing the descendant operator.* @param {string} dest - The variable name to store the final result.* @param {string} baseCtx - The starting context for traversing descendant properties.* @returns {string} The generated JavaScript code representing the translation.*/function translateDescendantSelector(expr: SelectorExpression,dest: string,baseCtx: string,): string {const code: string[] = []; // Array to store generated code lines// Acquire temporary variables for the translation processconst ctxs = this.acquireVar();const currCtx = this.acquireVar();const result = this.acquireVar();// Initialize the result variable to an empty arraycode.push(JsonTemplateTranslator.generateAssignmentCode(result, '[]')); // Call a helper function to generate assignment code// Extract the property from the selector expression (if any)const { prop } = expr;const propStr = CommonUtils.escapeStr(prop?.value); // Escape the property value for safe string inclusion// Push initial code to set up the context listcode.push(`${ctxs}=[${baseCtx}];`); // Assign the base context to the contexts list// Loop through contexts while there are more to processcode.push(`while(${ctxs}.length > 0) {`);// Shift the current context from the listcode.push(`${currCtx} = ${ctxs}.shift();`);// Handle empty contexts (skip if empty)code.push(`if(${JsonTemplateTranslator.returnIsEmpty(currCtx)}){continue;}`); // Call a helper function to check for emptiness// Handle context being an array (recursively process elements)code.push(`if(Array.isArray(${currCtx})){`);code.push(`${ctxs} = ${ctxs}.concat(${currCtx});`); // Concatenate the array elements to the contexts listcode.push('continue;'); // Skip to the next iterationcode.push('}');// Handle context being an object (process its properties)code.push(`if(typeof ${currCtx} === "object") {`);const valuesCode = JsonTemplateTranslator.returnObjectValues(currCtx); // Call a helper function to get object valuescode.push(`${ctxs} = ${ctxs}.concat(${valuesCode});`); // Concatenate object values to the contexts listif (prop) { // If there's a property in the selectorif (prop?.value === '*') { // If the property is a wildcard (*)code.push(`${result} = ${result}.concat(${valuesCode});`); // Concatenate all object values to the result} else { // If the property is a specific keycode.push(`if(Object.prototype.hasOwnProperty.call(${currCtx}, ${propStr})){`); // Check if the property exists on the objectcode.push(`${result} = ${result}.concat(${currCtx}[${propStr}]);`); // Append the property value to the resultcode.push('}');}}code.push('}');// If no property was specified, add the entire current context to the resultif (!prop) {code.push(`${result}.push(${currCtx});`);}// Close the loopcode.push('}');// Flatten the final result array (remove nested arrays)code.push(`${dest} = ${result}.flat();`);// Join all code lines and return the generated codereturn code.join('');}
This code translates a selector expression containing the descendant operator (..) into executable JavaScript code. It iterates through a list of contexts, starting with a provided base context. For each context, it checks if it’s an array and recursively processes its elements. If it’s an object, it retrieves its property values and adds them to the context list for further processing. The code also considers a property specified in the selector: if it’s a wildcard (*), all object values are included in the result; otherwise, only the value for the specific property key is included. Finally, the code flattens the result array to remove any nested arrays and stores it in a designated variable.
Below is the code generated for the expression .employees..name, the code has been modified from the original generated code for better readability.
JAVASCRIPT
// This function takes an input object and processes its 'employees' property to extract names into an array.function extractEmployeeNames(inputObject) {let result; // Initialize variable to store final resultlet currentObject; // Temporary variable for iterating over input objectlet employeesArray; // Temporary variable for storing 'employees' arraylet i; // Counter variable for looping over input objectlet j; // Counter variable for looping over 'employees' arraylet currentEmployee; // Temporary variable for each 'employee' objectlet extractedNames; // Temporary variable for storing names extracted from 'employee' objectlet collectedNames; // Array to collect extracted nameslet queue; // Queue for BFS traversal of objectslet currentQueueItem; // Temporary variable for BFS traversallet tempNames; // Temporary array to collect names during traversal// Initialize result with the input objectresult = inputObject;// Initialize collectedNames as an empty array to collect extracted namescollectedNames = [];// Assign currentObject to input objectcurrentObject = result;// Check if currentObject is not null or undefinedif (currentObject !== null && currentObject !== undefined) {// Convert currentObject to an array if it's not already onecurrentObject = Array.isArray(currentObject) ? currentObject : [currentObject];}// Loop through each item in currentObjectfor (i = 0; i < currentObject.length; i++) {// Assign employeesArray to 'employees' property of current itememployeesArray = currentObject[i]?.employees;// Continue if employeesArray is null or undefinedif (employeesArray === null || employeesArray === undefined) {continue;}// Loop through each item in employeesArrayfor (j = 0; j < employeesArray.length; j++) {// Assign currentEmployee to current item in employeesArraycurrentEmployee = employeesArray[j];// Initialize tempNames as an empty array to collect namestempNames = [];// Initialize queue with currentEmployee in an arrayqueue = [currentEmployee];// Perform BFS traversal on queue until it's emptywhile (queue.length > 0) {// Pop the first item from queuecurrentQueueItem = queue.shift();// Continue if currentQueueItem is null or undefinedif (currentQueueItem === null || currentQueueItem === undefined) {continue;}// If currentQueueItem is an array, concatenate it with queueif (Array.isArray(currentQueueItem)) {queue = queue.concat(currentQueueItem);continue;}// If currentQueueItem is an object, extract values and filter out null or undefined onesif (typeof currentQueueItem === "object") {queue = queue.concat(Object.values(currentQueueItem).filter(v => v !== null && v !== undefined));// If 'name' property exists in currentQueueItem, add it to tempNamesif (currentQueueItem.hasOwnProperty('name')) {tempNames = tempNames.concat(currentQueueItem.name);}}}// Flatten tempNames and assign it to extractedNamesextractedNames = tempNames.flat();// Continue if extractedNames is null or undefinedif (extractedNames === null || extractedNames === undefined) {continue;}// Push extractedNames into collectedNamescollectedNames.push(extractedNames);}}// If collectedNames has only one element, assign that element to collectedNamescollectedNames = collectedNames.length < 2 ? collectedNames[0] : collectedNames;// Assign result to collectedNamesresult = collectedNames;// Return the final resultreturn result;}
Conclusion
Building a DSL at RudderStack empowered our engineering team to simplify complex workflows and scale our efforts in building and managing 100s of integrations for our Customer Data Platform. This guide covered the process we used to craft a domain-specific language (DSL) for JSON transformation and build a tailored solution to streamline data integration challenges.
We covered everything from understanding the need for a DSL to implementing lexing, parsing, and code translation. Following this guide, you can create your own custom DSLs to address specific domain requirements.