What is big data?
Put that search term in to your favourite search engine and see how many different definitions you get back; it’ll be quite a few.
However, one of the most quoted is from a McKinsey study in 2011:
“Big data’ refers to datasets whose size is beyond the ability of typical database software tools to capture, store, manage, and analyse.”
This continues with a number of caveats, and while others may define big data in a more coherent manner, McKinsey clearly states the problem that many of our enterprise customers have.
Consequently, we often hear –
“We have scaled up our database servers and tuned our schema and code as much as we can but we have outgrown the database and now need to look at new technology”.
Therefore, it will sometimes be appropriate to look at Hadoop, Spark etc. Equally, often a database will do just fine – in particular a couple of the big cloud vendors have databases capable of scale that on-premises systems cannot conceive.
Bonus reading: here are some powerful big data stats.
A big data database example
While big data tends to be thought of quite theoretically, here is an actual example of a big data database.
First of all, we recently tested a 35 million row, 296 column table for a client executing some queries against a well specified Microsoft 2012 SQL server (32GB RAM, 8-core, RAID10 SSDs) as a baseline.
Now, I know some will already be doing a Paul Hogan impression at this point “That’s not Big Data!”. But as we will show, running analysis on the 35 million row table (over 10 billion cells of data) on a traditional RDBMS is not exactly snappy let alone the 500 million rows (148 billion cells).
Big data analysis tools
Both Google BigQuery and Azure Data Warehouse are petabyte scale, distributed storage, parallel compute databases and also capable of querying semi structured data.
They also have a pay as you go type pricing model.
With BigQuery you pay for storage and then per query and Data Warehouse you pay for storage and then per reserved resources called Data Warehouse Units. While both use columnar storage by default, Data Warehouse has the added trick that is can store row wise too like a normal SQL server.
A simple test
A simple aggregate taking the maximum of an unindexed float column against the 35 million row table on the SQL server takes on average 70 seconds.
The same aggregate against the 500 million row table on either of the Cloud services takes less than 3 seconds. Now, that is a table that is over 14 times the size and the aggregate is over 23 times faster on either Cloud service.
We have done no engineering on the services, no tweaking, just loaded the data and executed the query.
You would of course expect a column based aggregate like this on compressed columnar storage to be much faster than row based storage, but as we said Data Warehouse has the added trick of being able to store row wise too, meaning you can get the distribution and parallelism benefits from the service whilst using it as an enterprise relational database.
Summary: databases can’t do big data
I agree with the McKinsey researcher’s caveats that what we mean by “Big Data” is subjective and the definition will change as time goes on.
For structured data in particular the goalposts for what database software can “capture, store, manage, and analyse” have moved.
These databases can certainly address a large number of problems that are currently being labelled “Big Data” problems.
Let us know what your thoughts on whether databases can’t do big data below. Should you wish to learn more, download this guide on choosing your IT Partner ⇓