Working with Nested and Repeated Fields in BigQuery

The What, Why, and How of BigQuery Nested and Repeated Data. By Fassil S. Yehuala and Sanyukta Suman

Bildschirmfoto 2022-10-25 um 12.36.02.png

If you have worked with nested and repeated columns and found them a bit challenging, then this article is for you. In this blog, we explain the nitty-gritty of such kinds of fields in BigQuery. We will discuss in detail why they are preferred in data warehousing and how to create and query them. Let's start by defining what nested and repeating fields are.

What is Nested data?

In relational databases, data are organized in a way that relationships between key data points can easily be identified. For example, if we want to analyze the purchasing behavior of a customer, we may derive information from three tables containing information about the customer, products, and order transactions. A single product in the products table can have multiple rows in the orders table, which makes it difficult to put the data in a single row. The process of organizing relational databases in a way that minimizes data redundancy and enhances data integrity is called normalization.

For data warehousing, data analysts and engineers often are interested in taking the denormalization route and combining several separate tables into one large reporting table without any data redundancy. Google’s BigQuery offers a way to handle these processes using nested and repeated fields. 

 

Let’s have a look at an example of nested and repeated data. The data consists the following fields:

  • id

  • name

  • location  

    • location.city

    • location.country

Here is what the JSON data file of the above example would look like.

{"id":"1","name":"Xavier Williams", “location":["city":"Berlin","country":"Germany"]}

{"id":"1","name":"Crystal Wolf", “location":["city":"Munich","country":"Germany"]}

Basically, the nested data are multiple fields within each location and the repeated data are the multiple locations.

Why Nested and Repeated Fields?

Now let's take a closer look at nested and repeating fields in BigQuery, and then see why this is such a popular design for data analysts and engineers. We'll illustrate this with an e-commerce example running on Google Cloud. This company stores transactional orders in BigQuery in the “orders” table. 

  • Each order contains information on the product and the customer who placed the order

  • An order can have one or more products

  • Data that provides detailed information about the customers who placed the particular order

As a data analyst or engineer, how can you efficiently store this disparate information in your data warehouse? Note that the company needs to support a large user base that requests large amounts of data on a daily basis.

As shown in figure 1 below, one method is that we can store each fact in one place, as is common in relational database management systems. So what are the disadvantages of this method? When working with communications programs, our data from different locations need to be brought together for analysis. This requires lots of joining. Joining, especially in the case of very large tables, means a high analysis effort and is therefore expensive. This is because DBMS are record-based and they must fully open each record and extract the join key from the table it is joined to.

Bildschirmfoto 2022-10-25 um 12.36.20.png

Normalized Schema

Another approach is that we can go the fully denormalized route and store all levels of granularity in one big table. Although it speeds up data reading, this option has several disadvantages. In the schema shown in Figure 2, information common to each order is repeated. That means we have to be very careful when analyzing data at different levels of granularity. We must be careful not to double or triple count when trying to calculate.

Bildschirmfoto 2022-10-25 um 12.36.31.png

Fully Denormalized Schema

This is where nested and repeated data fields come in handy. We can have a row for each order and repeated values in a row for more detailed data. For example, the repeated and nested table below clearly shows that there are only three rows for three unique order IDs. 

Bildschirmfoto 2022-10-25 um 12.36.44.png

Denormalizing Data with Nested and Repeated Fields

If you are looking at a nested and repeated table, like the one in figure 3, for the first time, you might be wondering about column names like “orders.price” and ask yourself "What's a period doing in those column names?"  Dotted columns are called STRUCT or structured data types in SQL. We can think of them as pre-joined tables within a table. Instead of having a separate table for products, orders, and customers, we can just put all of them on a big table.

Bildschirmfoto 2022-10-25 um 12.36.54.png

STRUCTs in BigQuery Schema

By using STRUCT we simplify the query since the data is conceptually pre-joined. The column-based nature of BigQuery makes having a huge table with repeating and nested fields the most efficient and cost-effective option. For example, if we want to know the total quantities ordered, BigQuery returns the desired result by accessing the “order.quantity” column without having to look at other columns. The ability to have lots of data in one place and be able to easily analyze it without being hampered by the problem of multi-granularity is a very desirable feature for analysts.

STRUCTs are containers in which multiple fields and data types can be nested. ARRAYs can be one of the field types within a STRUCT. Notice all the empty space between the rows, that's because some columns are at a deeper level of granularity – meaning there are multiple duplicate values for those columns. An ARRAY is the perfect data type to handle that repeated value and store that data in a single row. In a schema, ARRAY data types are identified as repeating by observing their mode in the table schema.

Bildschirmfoto 2022-10-25 um 12.37.03.png

ARRAYs in BigQuery Schema

One thing to note here is that STRUCT and ARRAY data types in SQL are independent of each other. A table may have an ARRAY column that has no relation to any structure. Also, there can be a STRUCT with no ARRAY fields. We want to use them both at the same time because we want to take advantage of their unique features. While an ARRAY allows us to go deep into different levels of granularity, STRUCT helps us to go wide by pre-joining related fields together.

When to Use Nested and Repeated Fields?

The traditional relational database system is advantageous when we have tables smaller than 10 gigabytes. Therefore it is better to keep the tables normalized. The smaller the size, the less impact JOIN has on performance. However, as data grows in size, so does the performance impact of a JOIN. In this case, denormalizing our data would be the desirable option. As a rule of thumb, 10 gigabytes is taken as a threshold.

How to Create Nested and Repeated Fields?

Let's try the STRUCT syntax and create a table.

 

SELECT

  STRUCT('2254' AS id,

    'Sunglasses' AS name) AS products

Bildschirmfoto 2022-10-25 um 12.37.16.png

Since ‘id’ and ‘name’ are a subset of products, we have a dot notation.

 

Running the following query will create the repeated and nested table shown in figure 3. 

 

CREATE OR REPLACE TABLE

database.table AS (

SELECT 1 AS order_id,

DATE('2022-02-08') AS order_date,

    [STRUCT(1 AS quantity, 27.90 AS price),

      STRUCT(2 AS quantity, 15.90 AS price)] AS orders,

      [STRUCT('101' AS id, 'Xavier Williams' AS name, 'xwilliams@example.com' AS       email, 'Berlin' AS city)] AS customers,

    [STRUCT('2254' AS id, 'Sunglasses' AS name),

        STRUCT('4568' AS id, 'Scarf' AS name)] AS products

UNION ALL

SELECT 2 AS order_id,

  DATE('2022-02-07') AS order_date,

  [STRUCT(2 AS quantity, 27.9 AS price),

    STRUCT(1 AS quantity, 120.75 AS price),

    STRUCT(2 AS quantity, 29.90 AS price)] AS orders,

  [STRUCT('102' AS id, 'Crystal Wolf' AS name, 'wolfcrystal@example.com' AS email, 'Munich' AS city)] AS customers,

  [STRUCT('2254' AS id, 'Sunglasses' AS name),

    STRUCT('5487' AS id, 'Backpack' AS name),

    STRUCT('1558' AS id, 'T-Shirt' AS name)]

UNION ALL

SELECT 3 AS order_id,

  DATE('2022-02-07') AS order_date,

  [STRUCT(1 AS quantity, 120.75 AS price)

  ] AS orders,

  [STRUCT('103' AS id, 'Laura Brooks' AS name, 'laura11@example.com' AS email, 'Coburg' AS city)

  ] AS customers,

  [STRUCT('5487' AS id, 'Backpack' AS name)])

Bildschirmfoto 2022-10-25 um 13.03.20.png

How to Query Nested and Repeated Fields?

Querying nested and repeated fields is a little bit tricky. Let’s try to query the name of a product purchased by a customer with an ID 101. 

 

SELECT

  products.name AS product_name

FROM `data.table`

WHERE customers.id = "101"

 

The above query will throw the following error: 

 

“Cannot access field id on a value with type ARRAY<STRUCT<id STRING, name STRING, email STRING, ...>> at [4:17]”.

 

Since repeated fields (arrays) are stored in a single row, we need to first break them back into rows so that we can query them normally. To extract information from a repeated field in BigQuery, we must use a more exotic pattern. This is normally done using the UNNEST function, which converts an array of values from a table into rows. This unnest function helps to flatten the data, making accessing nested data easier. 

Now, let’s try to solve the previous error by using UNNEST.

 

SELECT

  p.name AS product_name

FROM

  `data.table` AS t1,

  UNNEST(t1.products) AS p,

  UNNEST(t1.customers) AS c

WHERE

  c.id = "101"

Bildschirmfoto 2022-10-25 um 12.37.36.png

Let’s see more examples. The following query will extract product_id and product_name from the nested and repeated data. 

 

SELECT

     order_id,

     p.id AS product_id,

     p.name AS product_name   

FROM `data.table` t1,

 

UNNEST(t1.products) as p

Bildschirmfoto 2022-10-25 um 12.37.57.png

Similarly, let’s extract order.price and order.quantity from the orders field

SELECT

     order_id,

     p.id AS product_id,

     p.name AS product_name,

 

     o.price,

     o.quantity    

FROM `data.table` t1,

 

UNNEST(t1.products) AS p,

UNNEST (t1.orders) AS o

Bildschirmfoto 2022-10-25 um 12.38.12.png

WITH t1 AS (

 SELECT

 

     order_id,

     p.id AS product_id,

     p.name AS product_name,

 

     o.price AS order_price,

     o.quantity AS order_quantity

    

FROM `data.table` t1,

 

UNNEST (t1.products) AS p,

UNNEST (t1.orders) AS o

)

 

SELECT

t1. order_id,

t1. product_id,

t1.product_name,

t1.order_price *t1.order_quantity AS revenue

 

FROM t1

Bildschirmfoto 2022-10-25 um 12.38.24.png

Now that you are familiar with the nested data and unnest function. Let’s calculate the most product sold from our database 

 

WITH t1 AS (

 

SELECT

  order_id,

     COUNT(o.quantity) AS Qty_ordered,

     SUM((o.price)*(o.quantity)) AS total_revenue

FROM

 `data.table` t1,

 UNNEST(orders) AS o

GROUP BY

 1

ORDER BY

 2 DESC

)

SELECT

t1.order_id,

t1.Qty_ordered,

t1.total_revenue

 

FROM t1

Bildschirmfoto 2022-10-25 um 12.38.34.png

That’s it! Now it’s your turn to use whatever you have learned to query complex data.

 

Here are a few open-source data for you to practice in BigQuery. 

Conclusion

This article introduced you to the steps required for BigQuery Nested Fields & Repeated Fields. Furthermore, you learned about what and how to create and query BigQuery Nested & Repeated Fields.

 

With your Data Warehouse, Google BigQuery live and running, you’ll need to extract data from multiple platforms to carry out your analysis. However, integrating and analyzing your data from a diverse set of data sources can be challenging and this is where datadice comes into the picture.


Visit our website to learn more about datadice and its data services.

Further Links

This post is part of the Data Analytics series from datadice enlightening GCP AI in building ML models and unfolding data handling, manipulation, and visualization techniques using Google Cloud Services.

 

Check out our LinkedIn account, to get insights into our daily working life and get important updates about BigQuery, Data Studio, and marketing analytics

 

We also started with our own YouTube channel. We talk about important DWH, BigQuery, Data Studio, and many more topics. Check out the channel here.

 

If you want to learn more about how to use Google Data Studio and take it to the next level in combination with BigQuery, check out our Udemy course here.

 

If you are looking for help setting up a modern and cost-efficient data warehouse or analytical dashboard, send us an email at hello@datadice.io and we will schedule a call.