You are currently viewing Data Warehouse in 5 minutes

Data Warehouse in 5 minutes


In a response to this tuit last week, one of you asked me how this Cloud Data Warehouse really works. Google Cloud’s solution is named BigQuery, and it is a DB focused in big amounts of data.

One relevant point in this DB, and the model that Google Cloud offers in many other products, is that you separate the data storage and its computation, so you don’t have to pay for a lot of virtual machines to store all the data if you are not actually querying it. The other point is that is a fully managed PaaS service, that you don’t need to install, administer or tune: you upload the data and begin using it.

The cost is splitted in 2 parts, as you can see in the pricing calculator. This example is a 10TB DB, that we query (full queries) 10 times per month, having a cost of 700$ per month. If you don’t queries one month it will go down to 200$:

To load data you have to go to BigQuery web interface , create a dataset, on the left, in this case it is already there, demos, and then create a table, which will show the following screen:

The data is loaded from a file, in this case from my local PC, but if the size were going to be big, we would do it from data already stored in the cloud. If we had data exported from another application, we would upload the files to Google Cloud Storage, and from there create the BigQuery table.

The next step is defining the table schema, to identify the different columns y their data type … a few options, and you can click on the blue button to “create table”:

This example is data from my gps and heart rate monitor, which is measuring my position and heartbeat rate during a bicycle ride, as we can see in the CSV file that we are using:

Once loaded we have our Data Warehouse up and running, and we can start querying the data with standard SQL sentences. For example, so see at what moments I was riding with my heart above 160 bpms:

If you wonder about the suitability of a web interface to perform this processing, yes, you can issue BigQuery commands using this toolkit, or call the API from your application:

The only issue with this example is that is was not really BigData, there were only 8000 lines in the CSV file. To test the power of BigQuery we can use some of the public tables that have information that we can query. For example, this table has more than 300GB, and 4.000 million rows, of Wikipedia visits information. This simple query allow us to see BigQuery in action:

As you can see it has read those 339GB in 7.6 seconds, which gives us a bandwith of 44GB/s. This performance depends on a number of factors, like table size, type of query, etc., this is just and example, and it is not bad at all for a Data Warehouse that would cost you 100$ a month (making 2 full queries a day). As a reference, this is the level of performance that you get from a node of some of the moderm in-memory data bases, with servers filled with a lot of memory and the complexity of the environment they use.

This is what I meant with Data Warehouse in 5 minutes.

If you want to learn more on BigQuery you can check this tutorial or the official documentation.




I work for Google Cloud, but this post are personal ideas and opinions

Leave a Reply