« April 2008 | Main | June 2008 »

May 18, 2008

Lessons from running the Live Mesh services, 4 weeks in

It's been almost 4 weeks since we went live, and so far things have gone pretty well -- we haven't had any system meltdowns, we haven't lost anybody's data, and the general reaction to our feature set has been quite positive.

It's also been really interesting watching the system "breathe", so to speak -- looking at various parameters of system load and their evolution over time, identifying recurring patterns, figuring out the events that led to irregular patterns in the performance data etc. In particular, I now have personal experience with a number of things that I only knew from reading about them. None of this is profound in any way, but here are some examples:

1. There's a clear day-vrs-night and weekday-vrs-weekend usage pattern: the graph below shows a week's worth of data of a statistic that basically measures how many clients are connected to our cloud services. The troughs in the data occur at night, the peak is around noon PDT, and the two low peaks in a row at the beginning represent weekend days.

Socpattern







2. Beware of synchronized clients: one of the problems that cloud services have to deal with is the "flash crowd" effect. For example, if an IM service crashes and disconnects all the millions of clients currently connected to it, you don't want all the clients to try reconnecting at the same time, or the incoming load can bring your servers to their knees [if you haven't built in mechanisms for dealing with overload gracefully]. Instead, you want clients to spread out their reconnects over a period of time, to smoothen out the load.

We have a similar problem to an IM service in that if a certain subset of our in-memory state management services crash, or are brought down, we'll cause client disconnects and reconnects. Well, of course we had to bounce exactly that set of services in the first couple of weeks, and so we ended up disconnecting our clients, they all tried to reconnect at the same time, and then they stayed in sync -- the row of spikes shows a system load metric that illustrates the regular cadence that comes from the synchronization.

Tracemsgs_2 



In the meantime, we've added the necessary code to avoid that problem in the future :-)

3. In a large-enough system, something is always broken: we have a monitoring tool that periodically looks at all our machines and tries to figure out whether they're healthy, by looking for service crashes, strange load characteristics, the wrong version of software running etc. Looking at this data over the last few weeks, I've realized that even in a system with "only" a couple of hundred servers, there's pretty much always something that's not quite right -- there are machines with failing hard disks, some machines appear to be handling a disproportionate share of the load, some are running the wrong bits, on others our service health checks are failing etc. The second realization was that the system state is very dynamic -- service health checks will start passing again, the load gets rebalanced etc, and so what you really want to do is watch for persistent errors, and not just jump on everything that seems wrong at a particular point in time. And, perhaps more importantly, our system is robust enough to deal with things not being quite right.

4. Invest early in "boring management infrastructure": we generate tons of logs; currently, we're producing on the order of 100GB a day, and that'll only increase as we increase the number of users of our system. These logs are really our only source of debugging data when users complain about cloud interactions going wrong, so it's clearly very important to have the necessary infrastructure to collect and process all this data. Thankfully, we actually built some of the necessary tools before releasing our bits, and so now have a tolerable way of filtering through this flood of data. That said, some of the tools we have are already straining to keep up, so this is something we'll have to keep working on.

Technorati tag: LiveMesh.

May 11, 2008

This just in: "omg, work is totally hard and stuff"

If you need "time off" at 23, you have a long road ahead of you.