From the course: Data Analytics with Google Cloud BigQuery and Looker Studio

Joining two datasets in BigQuery

- In the previous chapter, we learned the basics of how to connect BigQuery with Looker Studio and how to plot a line chart. In this chapter, we're going to learn how to make column charts, another common chart type. And for, this we're going to introduce dimensions. But now I'm going to first introduce the problem we're trying to solve. In the last chapter, we analyzed the global temperature trend and in this chapter we are going to dive a little bit deeper at the country level. So we have selected three countries and we will calculate their average temperature for each decade and see how they compare in a column chart. Here comes the first question. If we go back to the GSOD dataset we have been using, you may notice that there isn't a country field or any location fields in the schema. There is, however, a station ID. That's the ID of the location where the observation data was taken. How do we map the station ID to locations? Luckily, the GSOD data set comes with the reference table already. If we scroll all the way down to the bottom of the list, there is a table called stations. In the stations table, the USAF is essentially the station ID in the observation data and there is a country column too, which tells us where the data was taken from. The next we can use a join operation to map the station ID to locations. Essentially, we will look up the country name in the stations table based on the station ID in the GSOD table. Be careful that the join operations can be quite resource intensive, meaning that it takes a larger amount of compute power and time to complete. So here we will only try to do a join using the data collected from 1940s alone. Remember the wild card asterisk we used in chapter one. If we want to cover the tables contain 1940s data only, we just need to add an asterisk after 194, like this. We can also use the as syntax to give the table an alias. And alias is basically a shorter name, so it's easier when we need to refer to this table later in this query. We rename it as gsod. For the join operations, we need to type in join followed by the table name. Because here we want to join the stations table, we will go to the detailed tab of the stations table and copy the table name from there. Remember to put the table name between back ticks and we will give it an alias too. In BigQuery, the default type of join is an inner join. And then very importantly, we need to define on which fields this two tables should be joined together. Essentially, what is the linkage between these two tables? As we said before, we are going to look up the station ID in the gsod table which matches with the USAF field in the stations table and we will select the columns we care about. Remember, we mentioned BigQuery is a columnar database. This will allow us to scan this data. Because we joined the location data from the stations table, we can also add the country field in our select. And that's it. We can run this query. This query will only process about 40 megabytes of data but you will see it takes significant longer to run than the queries we did in chapter one which process more than one or two gig of data. And this is because as we mentioned, the join operation is much more resource intensive. In fact, if you go to the execution graph, you can see the majority of the time was spent on the join operation. Are the result what you expected? I think so. By joining two tables together, we now have the country field for each observation record. In the next video, we are going to know how to aggregate the data and use the countries as a dimension in order to create the charts we saw at the beginning.

Contents