Best Practices for Accessing Your Data Warehouse
A lot of the data businesses generate every day, such as emails, product analytics events, or customer data, needs to be analyzed for business intelligence purposes. In order for this data to be useful, it needs to be in a format that’s suitable for analysis. A data warehouse is frequently used to address this need—it stores data in a specified model that makes it readily available for analysis. A data warehouse is usually accessed by the company’s employees through business intelligence tools, internal applications, dashboards, and sometimes directly via SQL queries.
In this article, you will learn:
1. How data is stored in a data warehouse
2. How it’s accessed3. Best practices for writing SQL queries for the data warehouse
4. Who the major cloud data warehouse vendors are
5. The various types of data storage models available
Background on Data Warehouses
A data warehouse is a repository where data can be organized into a specific data model, cleaned, transformed, and stored. Consider the following two use cases:
- Suppose you work for a company that owns a SaaS product. The application is an aggregator where customers can buy tickets for multiple airlines, trains, and buses. Since the tickets are ultimately purchased from the individual transit vendors, you must connect to the vendors, extract the data from them, clean and transform it into a data model that is accepted by your company, and then store it in a data warehouse. Once the data is in the data warehouse, it can be used to inform your product and to fuel downstream applications for machine learning or data visualization.
- Alternatively, consider you work at a banking company. The company collects data about customers, including personal information, transaction history, credit card activity, and loans. The company might use a data warehouse to consolidate this data in one place. Once in the warehouse, it could run machine-learning algorithms on top of the warehouse data to extract insights such as credit risk or to raise alerts about potentially fraudulent transactions.
Data Access Methods
There are multiple ways for users to query or access the data stored in a database or data warehouse.
Direct Queries
If a user has direct access to the database or a table within the database, then they may write a SQL query that can read and display data from the table. However, it’s important that they optimize these queries correctly. Unoptimized queries cause the execution to take longer, affecting the performance of the database and increasing cost. The cost consideration is especially important in a warehouse solution like Google’s BigQuery which charges by query execution time.
Intermediate Tables
A secure way of accessing data in large operational databases is by making intermediate tables. These tables run on separate instances from the primary database and contain a subset of the primary database’s data. Accessing data this way eliminates the risk of running unoptimized queries on the primary database or of accidentally updating or deleting any records. Generally speaking, warehouse tooling should not have edit or delete permissions on a production database.
Views are wrappers on SQL queries that are ephemeral in nature. They can be used as tables while writing queries on top of them, but they are not stored in the warehouse and are materialized only when a query runs on them. Some warehouses do optimizations on how views are treated by caching previous results and using them if data in the underlying tables don't change. Views are a great way to break down complex queries into manageable chunks.
Materialized views are similar to views in that they're the result of SQL queries on top of other tables. Unlike views, though, they're materialized. This means the data is stored, so they come with additional storage costs. The advantage of materialized views is that if they are used in multiple downstream queries, they don't need to be recreated every time.
Access Management
Access management involves assigning users to specific groups, then assigning group-specific roles to those groups. Individual users aren’t granted a specific role, they inherit the roles of the group to which they belong. This is especially helpful if you have a large user base. As the number of users grows, granting and maintaining data access for each user becomes a complicated, time-consuming task. This can lead to problems like lapses in needed access or users who have more access than they need. A group-based access policy can efficiently solve these problems by ensuring each individual has access to what they need and nothing else on a group level instead of individually.
In addition, 3rd party tools that connect to your data warehouse, such as BI and visualization tools, can each be seen as their own group since the various end users of these tools all indirectly use the same access credentials (the one used by the tool to connect to the warehouse).
Best Practices for SQL Queries
In this section, you will learn about some of the important aspects and best practices to consider while writing SQL queries.
Common Table Expressions (CTEs)
When authoring SQL queries, developers often come across a situation where they need to create a temporary dataset that's then used to calculate the final results. While this approach could be achieved by introducing a view and using the view as an intermediate layer, most SQL databases provide a feature known as a common table expression, or CTE, that allows you to achieve the same result more easily.
CTEs can be used multiple times within the same query, but the results of the CTE are only available until the query has finished executing. A CTE can be used in a view, a stored procedure, or while writing complex ETL transformations based on SQL.
Data Sanitization
Data Sanitization is the practice of deleting any special characters from within your user input. For example, your SQL query might contain characters like \, #, $, @, %, and &. These characters can be misinterpreted by the query engine and result in undesirable code execution. Unsanitized inputs also cause SQL injection vulnerabilities. Refer to this article for more information about about data sanitization and the consequences of unsanitized data.
Indexing
Indexing a database means properly organizing data to maximize access to the columns that are most frequently used. Indexing is used to boost the performance of SQL queries, but it comes at the cost of storage space. The more indexing there is, the more storage space required, and the more it will cost. When indexing a database, it’s important to be thoughtful about the tradeoffs involved and select only the necessary fields to index. For more on database indexing, refer to this answer on StackOverflow.
Batching
Batching in SQL queries involves grouping similar queries in a batch, then executing them together. When you execute multiple SQL queries one by one, each of those queries begins a transaction in the database, manipulates the data as required, and then commits the transaction. This creates multiple commits to the database, since each query commits. In a batch, however, all the queries in the batch will be executed, and only when they’re all done will the transaction be committed. In some cases, this is used to improve database performance.
Explain Plan
When you write and execute a SQL query, a plan is generated by the database engine which tells us how the query is will get executed in the database. You can take a look at this by using EXPLAIN PLAN commands. This allows you to rewrite your queries to optimize their execution. Once you execute the query, the plan is cached and will be used when you run your queries in the future.
How to Adopt Modern Data Warehousing Tools and Access Data
Modern data warehousing tools provide cloud-based services that allow customers to get started easily without worrying about infrastructure setup and maintenance. This means businesses can focus on more important things, such as the development of the warehouse, the data pipelines, and implementing the business logic free from worry about the backend infrastructure, which is maintained by the cloud vendor. The following are four of the most popular cloud data warehouse vendors:
- Snowflake
- Amazon Redshift
- Google BigQuery
- Microsoft Azure Synapse Analytics
Let's look at each of these offerings in more detail.
Snowflake
Snowflake is a scalable, fully managed cloud-based data warehouse solution. You can implement your Snowflake instance on any public cloud vendor, including AWS, Azure, and GCP. This allows you to easily integrate Snowflake with the cloud provider you already use, keeping the overall experience the same. One of the things that makes Snowflake stand out is its separation of compute nodes from storage nodes. This enables them to decouple the two services and charge for higher cost compute services only when queries are being executed.
Snowflake also supports the storage of both structured and semi-structured data in its database schemas. This approach made Snowflake one of the first data warehouse systems capable of dealing with JSON-like data. Snowflake also offers several security measures to protect your data. Since it’s cloud-based, you can control access to data by whitelisting IP addresses that are allowed access. It also offers both single sign-on and user-based multi-factor authentication. These factors make Snowflake one of the most popular data warehouse solutions on the cloud.
Amazon Redshift
Amazon Redshift is an easy choice if customers are already building their infrastructure on the AWS cloud. Because it’s part of the AWS ecosystem, Redshift offers excellent integration with other AWS resources, including S3, Lambda, Kinesis, and Athena. This means that when customers want to build a customized data warehouse to fit their needs, integrating Amazon Redshift is very easy. One of the most commonly used features of Redshift is the ability to directly move data between S3 buckets and Redshift using the copy and unload commands.
Amazon Redshift is a column-oriented database, which contrasts with the traditional row-oriented databases. This means Redshift will perform aggregate calculations faster and in a more optimized way. It also leverages Massively Parallel Processing, or MPP, which allows it to execute large data jobs in parallel within the distributed cluster. AWS Redshift is also compliant with major privacy laws, such as the GDPR, HIPAA, and the CCPA. Though it’s worth noting, just because they are compliant does not mean your organization will be automatically.
BigQuery
BigQuery, a Google offering, has been popular since its inception. It allows users to use a petabyte-scale data warehouse on the cloud. Some of the main draws of BigQuery are that it is multi-cloud, able to run machine-learning models within your data warehouse, and conducts analysis for spatial or geographic data. A big benefit of BigQuery is its native integration with other Google services like Google Analytics 4, Google Cloud Storage, and Google Data Studio. It’s also completely serverless, meaning you don’t have to manage storage or compute nodes or have “unused capacity”.
Azure Synapse Analytics
Azure Synapse Analytics is the cloud-based data warehouse offering from Microsoft Azure. It’s especially appealing to customers who are already using the Microsoft ecosystem. Azure Synapse Analytics provides easy integration with machine-learning tools within the Azure environment, which helps build analytics solutions faster. Data access in Azure Synapse Analytics can be managed by integrating it with the Azure Active Directory service, which allows SSO-based user access to the platform’s data.
Data Models for Optimizing Data Access
When storing data, the very first thing that customers seek is a database where they can store data in the form of tables. A table is comprised of rows and columns, similar to a spreadsheet. However in a database table columns have enforced data types and tables have relations to other tables. This way of storing and accessing data is called the relational database management system (RDMS). Relational databases have been popular since the earliest databases, and many essential business applications still run on them. However, since the rise of Big Data, the velocity and variety of data that are generated has skyrocketed. Relational databases are capable of storing structured data, but they are not suited for unstructured or semi-structured data. In the following section, we’ll look at some other popular data models that can be used to store unstructured or semi-structured data.
Document Store
A document store, or a document database, is a system of storing information in documents, rather than in rows, as in relational databases. Documents are organized in collections, and a collection can store multiple documents. A document in a document database is usually stored in JSON format. It’s easy to add new documents, and documents can be updated or deleted using a query language. Some popular examples of document databases are MongoDB, DocumentDB by Amazon, and Cosmos DB by Azure. Most of these vendors provide an API that can be used to query and access the data using simple ANSI SQL statements.
In order to optimize data access in a document database, the database should be partitioned and indexed properly. Another important practice is to define the schema for the database, as the schema exists within your database and should be as compact as possible.
Advantages:
- Schemaless: Document databases are schemaless, meaning you are free to choose the schema on write for each of the records that you store.
- No foreign-key constraints:* In a document database records are not related to other records with a foreign-key constraint, so document databases can be independent of other tables and records.
Disadvantages:
- Schemaless: Since document storage systems don’t enforce a schema, organizations run the risk of creating ever more complex, nested data structures that reduce efficiency and can make data analysis unreliable.
Key-Value
A key-value database stores data in key and value pairs. This is a non-relational database that’s used to store data that can be identified using a row identifier. The usual format of storing data is JSON, and as such, nested JSON documents can easily be stored in a key-value database. While accessing the data, you can use the key to directly query the object, and it will return the data that it holds. One of the use cases for key-value stores is that it stores the session information of users who log into your website. They’re also commonly used for storing product details, or managing e-commerce shopping carts. One of the most popular key-value stores is Redis which offers both a hosted cloud solution as well as an open source version.
Advantages:
- High throughput: With key-value databases, users can obtain a very high throughput while reading data. While in some analytical data stores, it may be acceptable for query execution to take several minutes, production-ready systems require quicker response times.
Disadvantages:
- Data Persistence: Key-Value stores are often in-memory, meaning they do not persist data on disk by default. While they can be configured to dump snapshots or logs to a disk, they are not set up to do this by default.
Graph Databases
Graph databases are a special type of database that store data in a graph-based semantic model using nodes, edges, and properties. Nodes are connected to each other by edges, which store information on how nodes are related to each other. Graph databases are similar to how many people might express connections between entities on a whiteboard, or in a family tree. You might also consider a social network like LinkedIn, where there are many users, each of them connected to some—but not all—of the other users. This relationship can be illustrated by a graph database: the nodes represent the different users, and the relationships between the users are clarified by the edges. Popular graph databases in today's market are Neo4J, Apache Cassandra, Amazon Neptune, and Azure Cosmos DB. Learn more about graph databases in this episode of The Data Stack Show.
Advantages:
- Schemaless: Graph databases are schemaless, which provides flexibility while storing data.
- Performance: These databases offer high performance, as there are no relationships to handle.
Disadvantages:
- A different language: Graph databases often require different query languages and API connectors. This means that they often require more training as it’s difficult to find staff that have extensive experience with these languages.
Metadata Management
In simple terms, metadata means information about your existing data. When you know what kind of data you have, it becomes much more intuitive and fun to interact with your data. Metadata management deals with the management of your existing data sources and targets. It involves setting proper definitions, correct access policies, and guidelines for data governance. For example, your policies might relate to the following information:
- How is this data created?
- How is it transformed and stored?
- What is the business definition of this column?
- How is a KPI calculated, and what are the underlying fields used to define the KPI metric?
- Who owns the data, and how should it be accessed?
These questions can be used as a starting point for managing your metadata. This information can lead to building operational data catalogs that can be used as a single source of reference for all of your metadata.
Advantages:
- Operational excellence: Maintaining your metadata will allow users to easily find and access the data they need for various purposes.
- Less confusion: Users will have less confusion while using metrics because definitions are readily available in the data catalog.
Programming Languages for Accessing Data Warehouses
Data warehouses, in simple terms, are database solutions designed with a different approach in terms of schema and tables. Most of the data warehouse solutions can be built using the native ANSI-SQL commands. While dealing with data warehouses, you can use SQL commands to write Data Definition Language (DDL) statements for things like creating the databases, tables, and views. You can also use Data Query Language and Data Manipulation Language statements that allow you to select, insert, delete, or update data from various tables, and Data Control Language statements that allow you to control how data can be accessed by different users. For example, you can use the grant or revoke SQL commands to adjust permissions for users, groups, and data warehouse assets.
In addition to direct access via SQL commands, many programmers and developers use Python and R due to their fantastic support for libraries in many fields, including data science, analysis, statistics, and machine learning. For example, Snowflake offers a Python connector that allows you to directly access your data warehouse tables from your Python code. Amazon Redshift allows JDBC connections that allow users to connect to the Redshift instances using their R console. Mode’s BI Solution which connects directly to your warehouse, allows you to write complex data processing notebooks that run off your warehouse data and query results. This makes accessing the modern data warehouse solutions easier.
Conclusion
In this article, you’ve learned all about different kinds of cloud data warehouses, their optimal use cases, and the best ways to access the data in them. Data warehouses are a great way to support your complex analytical workloads.
When making foundational decisions about your data warehouse such as what size to use and how to structure access to it, it can be helpful to “think backwards” and start with your desired end users requirements. For example, if you expect to use a lot of Python or R to perform complex analysis on a large set of records, a key-value store is not going to work well for you. However if you frequently have to access user data based on a key such as their userId, a key-value store is exactly what you need.
Build a data pipeline in less than 5 minutes
Create an accountSee RudderStack in action
Get a personalized demoCollaborate with our community of data engineers
Join Slack Community