Querying datasets with 50+ million rows
If you are presented with a relational database table or a delimited file, what is the first thing you do? Query it! Fairly rhetoric, right? Well, one does not have be a data guru to know that data mining 101 starts with the ability to query a data set.
To understand the story behind the data one has to query it.
To illustrate this, we will be using a publicly available heathcare data set from healthdata.gov. We will look at the 2014 Medicaid/Medicare data set that contains the list of medical providers in the United States, along with attributes such as NPI (National Provider ID), Name, Address, Number of Beneficiaries, Medicare Payments Submitted and so on. A plethora of information!
Typically, with huge data sets, we usually start with a sample.
We are all familiar with opening that comma delimited data file (.csv file) with Microsoft Excel but we know that we are limited to a million or so rows, and even with files of those sizes, the Excel application slows down once you start filtering on the data. This can be very frustrating. In descriptive analytics, we understand that something has happened. In this case, we know what data we want to retrieve. For instance, I would be interested in data for providers in Florida, and would probably just download that file. This would cut down the data set to 1/50th of its size, (assuming the data is distributed equally for all the 50 states in the US) and make it just a bit more manageable.
But what if the analytics I want to do is more exploratory?
I want to be able to query the entire data set of providers for all the states in the US and restrict the provider type to be diagnostic radiology centers with at least 50 beneficiaries who are male and over the age of 70 and whose average beneficiary risk score is 2.0 or greater? All I have is a 14 GB file with 56 million rows? There is no convenient data set of this already available that I can just open using Excel. If I go a step further and load this into a MySQL database, the loading of the data set into the MySQL table is the easy part. Querying it is an entirely different ball -game. We are all too familiar with firing off the queries into oblivion and waiting for eons, literally for the results to return. Granted indexes may help to a certain extent, but as soon as the number of records turns to several millions, these tend to be ineffective in improving query performance. Often times, after 10 minutes, the user simply gets frustrated and just terminates the query.
So how does one explore such huge data sets?
The good news is there are multiple ways to do this, and we will take a look at a couple of options:
Use Apache Spark and write a program that will execute the above health care query. With the right-sized infrastructure, Spark will very quickly chew through the data and churn out the results. Great! But what happens if you want to update the risk score to 3.0? You have to rerun the Spark job with the new values assuming the program accepts the risk value as a parameter. So yes, I could be exploring the data if I know how to write and run Spark jobs. As a client user though, what if I still want to stick to a more familiar, and more flexible sounding SQL like interface so I can tweak the filter values or add more filters ? Enter our next option...
Use Amazon Web Services' S3 and Redshift. To explain very briefly, loading the data into Amazon's Redshift database tables allows the end user to use postgreSQL syntax to query the tables via any DB client tool such as DBVisualizer. If the tables are defined optimally, the query responses will be under a minute in most cases. The performance gains with this approach makes exploratory data mining very feasible. The steps to do this are as follows at a very high level:
Zip the data files using gzip locally. Upload these into a S3 folder. Data transfer into S3 is cheaper for smaller data sizes so make sure to gzip those mammoth files! Make sure to separate datasets by its future destination table into separate folders.
Provision a redshift cluster using SSD instead of HDD. SSD is better for compute, but is more expensive. So it is suggested that the cluster be alive only when performing exploratory data mining, to save on running cost.
Create redshift tables. The key to the performance magic are the SORTKEYs during table creation. It is important to make sure that the fields I want to query upon are in the set of SORTKEY(s). I prefer to use INTERLEAVED instead of COMPOUND so I can choose to use all or some of the columns in the SORTKEYs. Another parameter to pay attention to is the DISTSTYLE that controls how the rows are distributed to the slices in each node. Smaller tables that are lookup tables participating in a JOIN operation with the larger tables can use ALL. But for larger tables, ALL would require storage equivalent to the table size in each node leading to excessive cloud usage costs! In this case, we want to use the EVEN or ODD or choose to go with the distribution key or DISTKEY option that will allow data to be distributed across the different nodes based on the distribution key.
Once the tables are created, use the COPY command to load the data from the respective S3 folder to redshift tables already created. To give you an idea, when using a dc1.large SSD cluster with 1 leader and 1 worker, a load of approximately 4GB of .gz files into a redshift table took around 18 minutes. In our case, we did chose a DISTSTYLE of ALL for a table with approximately 56 million rows and whose unzipped size is 14GB. As mentioned earlier, we did use INTERLEAVED SORTKEYs. We are allowed a maximum of 8 columns in an INTERLEAVED SORTKEY.
Write your query and execute it! Once the load is done, as long as the query's "where" clause parameters are columns in the SORTKEYs defined for the table, the results are returned within seconds rather than minutes or hours. Queries can be against a single table or a JOIN between multiple tables.
Repeat the above as needed. With the above approach, if I wanted to revisit the SORTKEYs, then all it will take is to create a new table and reload it with the data in S3. All of these steps can be executed by a devOps or a user trained to do just this, so they can successfully query their dataset.
So the above illustrates just one way of addressing the fast query performance that is very much needed for exploratory data mining. Furthermore, if the data is in Redshift and the queries tuned this way, it does not take much to expose the data via a REST API for custom applications or visualization tools to consume, OR even run spark jobs on the data already loaded for other data processing needs.