I’ve been making my way through a handy list of database papers from MIT’s graduate database class [thanks, T!], and a follow-up blog post about “big data” made me think about the associated issues a bit more, captured below.
As context: I actually have the fortune to work with massive data sets because I’m on the team that runs the storage services for the Windows Live Mesh, Profile, SkyDrive and Spaces services. In particular, we store information for a few hundred million users – profiles, photos, blogs etc, and some of these data sets are pretty big; for example, we store billions of photos. This data is spread across a couple of thousand machines and the total amount we store is in the terabyte range for structured data [ie data stored in a database] and in the petabyte range for unstructured storage [ie data stored directly in a file system].
“Scale changes everything” – this is an oft-repeated phrase, but it begs the questions: why, and how ? I think a large part of the answer to these questions is that in order to build a large-scale system, you have to scale out and not up – you can’t solve the problem by simply buying a faster, more powerful computer, at least not in the long term. Instead, you have to buy multiple less-powerful machines and use their aggregate processing power to run your application. However, as soon as you do this, you’re building a distributed system, and that brings with it a huge increase in the number of things you need to think about. For example:
Consensus and consistency: how do you make sure all machines agree to do the same thing, in the presence of crashes and network hiccups ?
Partition tolerance: what do you do if some of the machines can’t talk to each other?
Fault tolerance and availability: how many machines can fail before your application stops working ?
Workload partitioning: how do you actually distribute the workload across multiple machines ?
Data partitioning: how do you distribute the data across multiple machines, and how do you figure out where a piece of data is ?
It turns out that these are really difficult problems to solve and lead to having to build lots of machinery that you don’t need if your application can be run on just a single machine.
Another aspect of getting to large scale is the fact that you often need to scale multiple components at once, especially the components that provide facilities that are widely used, and not just by one part of your application. An interesting example of this is this report from the folks at Facebook that talks about the issues they ran into with very fundamental facilities like kernel-level network traffic handling as they scaled out their caching servers. It’s also worth pointing out that some of the facilities you have to scale may not be part of the “core” of your application. A canonical example is the usage reporting and monitoring component – to run a large system, you need to know what and how it’s doing ie whether some machines are failing, what the current request rate is etc. As the end-user visible application that you’re providing is used by more and more people, you also need to increase the ability of your monitoring component to handle and digest more and more incoming data. In other words, you need to scale out your monitoring system, and this in turn means you have to deal with all the issues mentioned above.
OLTP vrs OLAP workloads – for the services we're running, the world actually can’t be divided neatly into OLTP-type “lots of transactions that read/write a small amount of live data” and OLAP-type “infrequent, read-mostly transactions that look at aggregated, warehoused [ie summarized, slightly-stale] data” workloads. There are certain things you may want to do with fairly frequently with your live data that requires looking at all the data; for example, if you want to build a full-text search index of all the data in your store, you need to look at all the data [so you can’t summarize it], you want to look at the live data [so you don’t return stale search results], you want to do some computation on it and stick it into an index [via code that runs outside the DB], you want to do this fairly frequently [so that your index is up-to-date], and you may want to push some data back into the DB as the result of your processing. You can think of this as a database crawl, and subsequent processing of the crawled data, in analogy with the web crawling done by search engines.
This sort of access doesn’t quite fit either of the patterns described above, and introduces some extra considerations. For example, you don’t want to simply run a query that selects everything from a particular table, because that would peg the DB; rather, you want to be able to just select a portion of the records at a time, process them, and then select the next set i.e. you need a way to iterate through a table, and you want to do it efficiently, so that you don’t put undue stress on the DB that’s also processing user transactions. Trying to avoid stressing the DB machine also means that you want to run the code that processes the records on a different machine than the one that’s running your database. Iterating through tables in turn drags behind it needing to checkpoint some state for robustness – if you’ve processed the first million records and then you crash, you don’t want to start again from the first record. Our old friend Scale also raises his scaly [haha] head again: these large data sets are spread across lots of machines, so you need to figure out a way to parallelize your database crawl, so that you have multiple machines iterating through different portions of the overall data set. This of course implies that you need to build a work-partitioning system that knows how to split up the work, makes sure there’s no overlap and that every bit of data gets looked at, handles failures of individual crawl processes, schedules work appropriately etc. Now, you can [and currently have to] build all this machinery yourself, but it’s interesting to think about what sort of support could be built into the database engine itself to facilitate this sort of access pattern and processing.
“Is MapReduce a big step backwards in data processing ?” [as per Stonebraker and DeWitt] – I’ve had the chance to do some work with Microsoft’s version of MapReduce, which consists of a SQL-like language called Scope layered on top of a large-scale distributed file system that serves a similar function as the Google File System and provides support for a file system-like view of huge streams of data [in the gigabyte-to-terabyte range]. Based on my experience in this area, I agree with the MapReduce advocates that it’s useful to be able to interpret the same data through a variety of lenses, simply by writing different map/reduce functions, without having to pick a DB schema upfront -- that’s something we’ve been doing a lot with our log data. However, I also empathize with Stonebraker et al that having to express every computation as a combination of map and reduce steps is rather painful and unnatural, and that high-level languages that hide the nuts-and-bolts of the underlying data access mechanics are highly desirable. From that perspective, Microsoft is actually ahead of Google – the Scope language looks a lot like SQL, and so you get the full declarative and expressive power of SQL, across a gigantic dataset, without having to worry about the underlying details of how the data is accessed, sent from machine to machine etc.
Overall, I think the answer to “Should I use MapReduce or [traditional] databases ?” is “Use both”. The sweet spot is using each tool for what it’s good at, and having ways to transform/move data from one system into the other and back – for example, using a MapReduce-style system to store and process raw logs, and produce summary reports which are then loaded into a DB to allow interactive querying and display.
... and that's currently all I have to say about that.
I agree that OLTP-vs-OLAP, as a classification, doesn't really describe a lot of modern systems very well; it was really just a useful pair of terms on which to hang my hat for the purposes of that one post...
Incidentally, one of the papers I enjoyed reading most, from the 6.830 class, was the Brewer's "Combining Systems and Databases." It's a little out of date, but still fun to read...
Posted by: son1 | December 29, 2008 at 05:27 AM