Star Schema: Do I Need One for My Data Warehouse?


Share this post:

Businesses are now responsible for managing more data than ever: at last count, an average of 163 terabytes (163,000 gigabytes). Organizations store much of this information in data warehouses: specialized data management systems that pull data from various sources and that are used for tasks in business intelligence, analysis, and reporting.  To get that data into data warehouses, businesses use a variety of data integration platforms and ETL (Extract, Transform and Load) processes.

If you’ve been learning about data warehouses for your business, you might have run across the term “star schema.” What is a star schema exactly, and do you need a star schema for your enterprise data warehouse?

What is a Star Schema?

starfish
This is a starfish, not a star schema

To define a star schema in data warehousing, we first need to go over some key database terms and concepts.

Star Schema: Terms and Concepts

There are two main types of data management systems: operational and reporting.

  • Online transaction processing (OLTP) systems are used for lightning-fast queries in response to live business processes.
  • Meanwhile, online analytical processing (OLAP) systems are used for reporting and analytics, which allows for some sacrifices in performance in exchange for better insights and business intelligence.

Data warehouses are OLAP systems, allowing users to access and analyze information from multiple sources simultaneously. Designing a data warehouse requires users to be familiar with the concept of dimensional modeling, which is an approach that optimizes the data warehouse for online queries. Dimensional modeling involves the concepts of fact tables and dimension tables.

Fact tables are used to store facts, which are data that will be included in reports and that are used as the basis of business decisions. Facts are usually numbers, such as the number of customer purchases or the number of sales made during a given quarter.

A fact table will contain a grouping of related facts. For example, a fact table about customer purchases may include facts such as the number of purchases and the amount of each purchase. Because they contain numeric quantities, fact tables are usually intended for easy and rapid mathematical manipulation: e.g. the total revenue for a particular store or product.

Dimension tables, on the other hand, store dimensions, which are attributes that qualify and provide more information about facts. In our retail example, some examples of dimension tables could be geographical locations, items, and customers. Each dimension table will then contain more specific information: e.g. IDs, cities, countries, and geographical coordinates for a dimension table that contains geography data.

Fact tables and dimension tables frequently reference each other. A fact table usually contains a number of primary keys that then form their own entries in the dimension tables. For example, a fact table about a sale may contain the primary keys of the items that were sold. These primary keys are then referenced in the dimension tables, which contain more specific information such as the item’s name, size, and brand.

Now that we’ve established this crucial database terminology, let’s discuss how they apply to the concept of a star schema.

Star Schema: Definition

In database management systems, a schema is the structure of the database that describes how objects and information in the database are logically grouped and connected. A “star schema” gets its name from the fact that its diagram resembles a star, with various points radiating outward from the center.

As described above, there are two different types of components in a star schema: fact tables and dimension tables. Each “star” in a database star schema consists of a fact table at the center, with multiple dimension tables attached to the central fact table.

For example, suppose that we have a star schema for our data warehouse that stores customer order history. The fact table at the center of the star schema will contain data such as the product ID, customer ID, and price.

The price is a simple numerical quantity that will not be referenced in a dimension table; however, the two IDs will serve as primary keys that are referenced in their corresponding dimension tables. Radiating outward from the fact table, we will have two dimension tables for products and customers. For example, the customer dimension table will contain attributes such as the customer’s name, gender, age, date of birth, and location.

star schema
An example of a Star Schema

Why Do Businesses Use Star Schemas?

We’ve defined the concept of a star schema in the section above—so why would you want to use one for your data warehouse project?

Star schemas are the simplest and most widely used form of data warehouse schema, which makes them a good choice for data warehouses that aren’t overly complicated. The star schema is less complex to understand and tends to involve fewer joins than other data warehouse schemas, which makes it optimized for querying large data sets.

One thing to note about star schemas is that they are denormalized. Essentially, this means that the schema contains redundant data records in order to improve the data warehouse’s speed and performance. More specifically, the schema is normalized in terms of the fact tables, but denormalized in terms of the dimension tables.

businesses using reporting
A data warehouse with a star schema can help produce insightful reports

Is a Star Schema Right for Your Data Warehouse?

Denormalized schemas such as star schemas offer performance advantages, but they have drawbacks as well. Because the schema is denormalized, data integrity is weaker than with a normalized schema.

For example, the geography dimension table in a star schema of moving companies may contain attributes representing city, state, and country. However, there is no way to enforce the relationship between these attributes; the data may contain the correct city but the incorrect state and country.

Star schemas are a highly popular choice for building data warehouses, but they’re by no means the only option. Two other options for your data warehouse schema are described below.

Snowflake schemas are an extension of the star schema model that include additional dimension tables; they are named after the intricate structure of a snowflake when viewed up close. The dimensions in a snowflake schema are normalized into multiple related tables.

This makes the snowflake schema a better choice than the star schema if you want your data warehouse schema to be normalized. However, complex joins mean that the performance of the snowflake schema is generally worse than the star schema.

Galaxy schemas, also known as “fact constellation schemas,” are complex data warehouse schemas that combine several different fact tables that may share one or more dimension tables. Of course, the name “galaxy schema” comes from the fact that galaxies are a collection of stars.

Because they share dimension tables, using a galaxy schema can reduce the size of the data warehouse. Galaxy schemas are often used to model multiple related subjects, versus the more singular focus of the star and snowflake schemas.

Final Thoughts

Star schemas are the simplest and most popular way of organizing information within a data warehouse. However, alternatives to the star schema, such as snowflake schemas and galaxy schemas, exist for users who will get more benefits from modeling their data warehouse in a different way.

Being able to get the right data-driven insights at the right time from your data warehouse is crucial. Whether you use a star schema for your data warehouse or a different model entirely, SQLBot allows you to get automated SQL reports in your email or Slack account, so that the information you need is always at your fingertips.

Try SQLBot - It's Free!

Sign up for a free account on SQLBot to see how easy it is to get SQL reports into your Slack.  What are you waiting for, it's free!

About the Author

undefined

David Tidmarsh is a freelance software and technology writer and graduate student in computer science who specializes in artificial intelligence and big data. He's passionate about translating complex tech issues into clear, comprehensible language.