June 07, 2009

Xanderism, 6/3/09

At the dinner table, Xander suddenly stops eating, gets off his chair, peeks over the edge of the table and says in a very quiet voice: "Excuse me, Mama, is there a bear coming ?"

Uhhh ... no. There's no bear coming. As far as we can tell.

This little episode brings back memories of a few months ago when Xander, for reasons that are still unclear, was suddenly very concerned and scared about bears popping up in various places. Like in his ear. And, bad parents that we are, we used this to our advantage, by threatening him with bears when he refused to do something. The most egregious example that comes to mind is being in a jewelry store to get something on Christina's ring fixed -- the jeweler was working upstairs, and Xander kept trying to go upstairs, so we had to resort to telling him that there was a bear upstairs. And then, when I heard the noise of the jewelry grinding wheel, I couldn't help myself and had to add "... and that sound is the noise of the bear sharpening his knife".

As if a bear needs a knife.

[Yes, I'm a horrible parent. That was a bad thing to do. I keep telling myself that it didn't really register with him.]

May 13, 2009

Principal Component Analysis of Toddler Behavior

Over the last few months, the Mallet Household War on Terror, namely raising Xander, has escalated -- we are fighting increasingly pitched battles against the insurgent. His ever-increasing physical agility and strength, desire to have it his own way, totally random requests, and discovery of the word “no” have inflicted heavy damage on our general temperament and morale. We occasionally call up reservists [aka grandparents], but they can only handle him for short stints.

In other words, we’re dealing with the Terrible Twos.

One thing I’ve been wondering about, though, is whether there isn’t some pattern to the madness. In other words, if I gathered data about enough variables that might affect his behavior and tried to correlate them with his observed behavior, could I build a predictive model that tells me the probability of a meltdown in the next couple of hours, and what the most important contributing factors are ? [Actually, I wouldn’t really need a sophisticated model for that – a model that just predicted a 100% probability of an imminent meltdown would be correct 90% of the time …]

Some variables that might factor into this:

- Time since last nap

- Length of last nap

- Whether he’s gotten a new toy in the last couple of days

- Last time he ran around outside

- What he’s had to eat today

- Time since he last watched a bit of “Cars”

- Frequency of Bobcat rides in last week [our neighbor has one that he gets to ride on]

- “Toddler factor” drawn from a mixture of various distributions

 

Alas, I have no time to actually do this – having to say “No, stop it, get down, give me that, don’t do that, quit pulling on that, be careful, stop squirming” a few hundred times every hour is not conducive to careful data gathering.

May 04, 2009

Xanderism, 5/4/09

While eating a slice of bread with Nutella on it: "Wolves love Nutella". And since it's hard to assess the veracity of this particular statement, my response was "Yes, they probably do. Everybody loves Nutella !".

December 28, 2008

Large scale-data access patterns-MapReduce buzzword bingo

I’ve been making my way through a handy list of database papers from MIT’s graduate database class [thanks, T!], and a follow-up blog post about “big data” made me think about the associated issues a bit more, captured below.

 

As context: I actually have the fortune to work with massive data sets because I’m on the team that runs the storage services for the Windows Live Mesh, Profile, SkyDrive and Spaces services. In particular, we store information for a few hundred million users – profiles, photos, blogs etc, and some of these data sets are pretty big; for example, we store billions of photos. This data is spread across a couple of thousand machines and the total amount we store is in the terabyte range for structured data [ie data stored in a database] and in the petabyte range for unstructured storage [ie data stored directly in a file system].

 

“Scale changes everything” – this is an oft-repeated phrase, but it begs the questions: why, and how ? I think a large part of the answer to these questions is that in order to build a large-scale system, you have to scale out and not up – you can’t solve the problem by simply buying a faster, more powerful computer, at least not in the long term. Instead, you have to buy multiple less-powerful machines and use their aggregate processing power to run your application. However, as soon as you do this, you’re building a distributed system, and that brings with it a huge increase in the number of things you need to think about. For example:

 

Consensus and consistency: how do you make sure all machines agree to do the same thing, in the presence of crashes and network hiccups ?

Partition tolerance: what do you do if some of the machines can’t talk to each other?

Fault tolerance and availability: how many machines can fail before your application stops working ?

Workload partitioning: how do you actually distribute the workload across multiple machines ?

Data partitioning: how do you distribute the data across multiple machines, and how do you figure out where a piece of data is ?  

 

It turns out that these are really difficult problems to solve and lead to having to build lots of machinery that you don’t need if your application can be run on just a single machine.

 

Another aspect of getting to large scale is the fact that you often need to scale multiple components at once, especially the components that provide facilities that are widely used, and not just by one part of your application. An interesting example of this is this report from the folks at Facebook that talks about the issues they ran into with very fundamental facilities like kernel-level network traffic handling as they scaled out their caching servers. It’s also worth pointing out that some of the facilities you have to scale may not be part of the “core” of your application. A canonical example is the usage reporting and monitoring component – to run a large system, you need to know what and how it’s doing ie whether some machines are failing, what the current request rate is etc. As the end-user visible application that you’re providing is used by more and more people, you also need to increase the ability of your monitoring component to handle and digest more and more incoming data. In other words, you need to scale out your monitoring system, and this in turn means you have to deal with all the issues mentioned above.     

 

OLTP vrs OLAP workloads – for the services we're running, the world actually can’t be divided neatly into OLTP-type “lots of transactions that read/write a small amount of live data” and OLAP-type “infrequent, read-mostly transactions that look at aggregated, warehoused [ie summarized, slightly-stale] data” workloads. There are certain things you may want to do with fairly frequently with your live data that requires looking at all the data; for example, if you want to build a full-text search index of all the data in your store, you need to look at all the data [so you can’t summarize it], you want to look at the live data [so you don’t return stale search results], you want to do some computation on it and stick it into an index [via code that runs outside the DB],  you want to do this fairly frequently [so that your index is up-to-date], and you may want to push some data back into the DB as the result of your processing. You can think of this as a database crawl, and subsequent processing of the crawled data, in analogy with the web crawling done by search engines.

 

This sort of access doesn’t quite fit either of the patterns described above, and introduces some extra considerations. For example, you don’t want to simply run a query that selects everything from a particular table, because that would peg the DB; rather, you want to be able to just select a portion of the records at a time, process them, and then select the next set i.e. you need a way to iterate through a table, and you want to do it efficiently, so that you don’t put undue stress on the DB that’s also processing user transactions. Trying to avoid stressing the DB machine also means that you want to run the code that processes the records on a different machine than the one that’s running your database. Iterating through tables in turn drags behind it needing to checkpoint some state for robustness – if you’ve processed the first million records and then you crash, you don’t want to start again from the first record. Our old friend Scale also raises his scaly [haha] head again: these large data sets are spread across lots of machines, so you need to figure out a way to parallelize your database crawl, so that you have multiple machines iterating through different portions of the overall data set. This of course implies that you need to build a work-partitioning system that knows how to split up the work, makes sure there’s no overlap and that every bit of data gets looked at, handles failures of individual crawl processes, schedules work appropriately etc. Now, you can [and currently have to] build all this machinery yourself, but it’s interesting to think about what sort of support could be built into the database engine itself to facilitate this sort of access pattern and processing.

 

“Is MapReduce a big step backwards in data processing ?”  [as per Stonebraker and DeWitt] – I’ve had the chance to do some work with Microsoft’s version of MapReduce, which consists of a SQL-like language called Scope layered on top of a large-scale distributed file system that serves a similar function as the Google File System and provides support for a file system-like view of huge streams of data [in the gigabyte-to-terabyte range]. Based on my experience in this area, I agree with the MapReduce advocates that it’s useful to be able to interpret the same data through a variety of lenses, simply by writing different map/reduce functions, without having to pick a DB schema upfront -- that’s something we’ve been doing a lot with our log data. However, I also empathize with Stonebraker et al that having to express every computation as a combination of map and reduce steps is rather painful and unnatural, and that high-level languages that hide the nuts-and-bolts of the underlying data access mechanics are highly desirable. From that perspective, Microsoft is actually ahead of Google – the Scope language looks a lot like SQL, and so you get the full declarative and expressive power of SQL, across a gigantic dataset, without having to worry about the underlying details of how the data is accessed, sent from machine to machine etc.

 

Overall, I think the answer to “Should I use MapReduce or [traditional] databases ?” is “Use both”. The sweet spot is using each tool for what it’s good at, and having ways to transform/move data from one system into the other and back – for example, using a MapReduce-style system to store and process raw logs, and produce summary reports which are then loaded into a DB to allow interactive querying and display.

 

... and that's currently all I have to say about that.

November 29, 2008

Gun safety training in the NFL: who needs it ?

A friend of mine once gave me some very simple, but profound, advice: "Don't $#@% up the simple things." In other words, if your job involves doing some things that can be accomplished merely by following the rules, do them. And, by extension, don't do things that even a cursory inspection through the lens of common sense would indicate as falling into the "probably a bad idea" category. 

Pro athletes would benefit from having this drummed into them, together with all the other drills they do. Consider the general case of a highly-prized NFL player. By all rights, his job description should basically amount to

- Show up for practice and do the drills
- During a game, do whatever is appropriate for your position: run, catch, pass, tackle etc
- At end of year, pick up multi-million dollar paycheck
- Repeat

Note that nowhere in there does it say "carry a gun". Or "carry a gun into a nightclub". And it definitely doesn't say anything about shooting yourself in [almost] the foot. Especially if you've already pulled the hamstring on that leg. Because now you have two injuries to rehab, and pulled hamstrings are tricky.

It's the little things.

November 02, 2008

Live Mesh goes beta, crazy logging and other forms of hilarity ensue

Live Mesh, the product I've been working on for the last couple of years, went into open beta last week, with lots of new features and support for Mac and Windows Mobile clients. Since we rolled out the new bits, my team and I have been watching the "dashboard" for our services closely, and have learned a couple of things that I thought I'd share.

Observation #1: Performance in test environments matters [duh !]

The graph below shows the number of log messages per second generated by one of our machine types. As you can see, the service was generating tens of messages per second and then suddenly developed a severe case of logorrhea [literally and figuratively] and started pumping out close to 5000 messages a second. This, in turn, was chewing up a fair bit of CPU and keeping the disks spinning at a good clip. Loading a machine is, in and of itself, not a bad thing, given the low average datacenter machine utilization, but what was a problem was that we simply weren't keeping up with the volume of messages and had started throwing some away.

After a bit of frantic email, it turned out that the spike was due to a new service running on these machines, which was logging rather liberally. We turned down the verbosity a bit, leading to the first drop in the graph, and I thought the problem was under control. What I neglected to take into account was that all the users that had been disconnected during the upgrade hadn't re-connected yet, and so we still weren't seeing the full traffic. As the day wore on, it quickly became obvious that we were heading for trouble again [see the second spike], so we had to turn the logging verbosity down a bit more, leading to the second drop. Hopefully that solves the problem for now, and we haven't lost all the useful information from the logs. [Yes, I know this violates the "log everything all the time" mantra that we try to adhere to, but sometimes you gotta do what you gotta do, at least temporarily.]


Logmessages

The obvious question is: how come I didn't catch this earlier, given that we run our code in a test environment before we roll it out to our production environment ? The answer is very simple: I wasn't monitoring our test environment, at least not in a consistent fashion. That's partially because it's hard to get a useful baseline from the test environment -- we update the bits on it frequently, the load on it is inconsistent etc. That said, I'm pretty sure I would have detected something was amiss if I'd done a more-than-cursory check.

So, lesson learned: keep a closer eye on the test environment.

Observation #2: Sometimes, having too little load hurts.

With our new release, our datacenter services are set up to handle a lot more load than we're currently getting [because we have grand plans and/or are just foolishly optimistic :-)]. This led to some interesting behavior between our frontend machines and the machines on our backend [see my previous description of our architecture], which talk to each other through a loadbalancer. In order to conserve resources, loadbalancers are generally configured to close connections that remain idle for a particular period of time. Given our over-provisioning, we had lots such idle connections between front- and back-end machines, leading to many more connections being reset than we were used to. This, in turn, exposed a bug in some new code that didn't deal properly with connection resets, and led to some flakiness in the user experience.

In the meantime, we've fixed the bug, but I thought it was noteworthy that there is such a thing as "too little load".

[Technorati tag: Live Mesh]

October 15, 2008

Be still, my beating heart

Apparently Cheney experienced an "irregular heart rhythm" and had to be treated for it yesterday.

In other words, his heart actually started to beat, and he wasn't used to that, so he made them turn it back off.

October 03, 2008

Area man no longer into Web 2.0, says enterprise software is the new hotness

Going from working on Facebook to working on "enterprise productivity software" [aka Office] is about as radical a shift as you can make. Maybe he's taking the admonition to stop throwing sheep seriously.

That said, I wonder why two young men in the prime of their lives would willingly work on something as yawn-inducing as enterprise productivity software.

September 28, 2008

Putting the cart before the elephant

There's a lot of handwringing about the fact that Sarah Palin is "one 72 year-old heartbeat away from being the president", and whether she's ready for that. This seems to ignore a more fundamental question: forget about being president, is she ready to be the vice-president ?

Even if it's not clear to her what a VP does, I think it's pretty clear that her grasp on something as important as the current bailout package is ... tenuous, to put it kindly. [Translation of the look on Katie Couric's face in that last clip: "Must restrain myself ... in face of ... incoherent blathering ... on ... job creation ... oh god, kill me now ..."]

In short: no, she's not ready.

But you knew that already, didn't you ?

September 22, 2008

As I was saying before I was so rudely interrupted by life ...

Well, it's been almost exactly three months since my last post. I feel like I should be reporting quarterly earnings. Or losses, given the current economic climate ...

However, in the interest of making my transition back into hopefully-more-regular blogging as easy as possible, I'll just issue a quick report on a number of ongoing threads, anchored by a perfect rendition of a lot of the interactions I've been having with Xander lately.

Xander: only a month after his 2nd birthday, he's deep into the terrible twos. That boy has a fierce temper. He also has a one-truck mind -- just about all of his activities and utterances center around trucks: trucks in the morning, trucks in the afternoon, trucks at night, trucks in the bathtub, trucks when he's eating, trucks in trucks, trucks on trucks ... trucks. On the physical side, he's a little beefwagon with oodles of energy, and just about indestructible. He likes to slow himself down by running into things, so I think football might be the right sport for him. He's also become a little chatterbox, and parrots just about anything you say, including the bits where Mommy and Daddy have something go wrong and say things like #$@$@!@#$.

Christina: her business is doing very well, and she's keeping it interesting by doing various types of work -- commercial, maternity, weddings etc. She's gotten some good press and has even been on TV. And, after gamely doing all her work on a 7-year old PC for the last couple of years [yes, I know, it's embarrassing], we finally left the Stone Age and outfitted her with a stonkin' desktop PC, laptop, lots of storage space etc. This, in turn, has made my life a lot easier because I no longer get lots of complaints about her machine running out of memory or being ridiculously slow -- 8GB of RAM are pretty handy that way.

Me: still jes' workin' for The Man. My leg is close to being back to normal, and so I've been looking around for a new sport to pick up, since taekwondo is out of the question. I've been thinking about mountain biking, and have been out on a couple of mini-rides with a friend of mine, but haven't really engaged.

And that's pretty much that. More interesting [to me, at least] stuff to follow.