Live Mesh, the product I've been working on for the last couple of years, went into open beta last week, with lots of new features and support for Mac and Windows Mobile clients. Since we rolled out the new bits, my team and I have been watching the "dashboard" for our services closely, and have learned a couple of things that I thought I'd share.
Observation #1: Performance in test environments matters [duh !]
The graph below shows the number of log messages per second generated by one of our machine types. As you can see, the service was generating tens of messages per second and then suddenly developed a severe case of logorrhea [literally and figuratively] and started pumping out close to 5000 messages a second. This, in turn, was chewing up a fair bit of CPU and keeping the disks spinning at a good clip. Loading a machine is, in and of itself, not a bad thing, given the low average datacenter machine utilization, but what was a problem was that we simply weren't keeping up with the volume of messages and had started throwing some away.
After a bit of frantic email, it turned out that the spike was due to a new service running on these machines, which was logging rather liberally. We turned down the verbosity a bit, leading to the first drop in the graph, and I thought the problem was under control. What I neglected to take into account was that all the users that had been disconnected during the upgrade hadn't re-connected yet, and so we still weren't seeing the full traffic. As the day wore on, it quickly became obvious that we were heading for trouble again [see the second spike], so we had to turn the logging verbosity down a bit more, leading to the second drop. Hopefully that solves the problem for now, and we haven't lost all the useful information from the logs. [Yes, I know this violates the "log everything all the time" mantra that we try to adhere to, but sometimes you gotta do what you gotta do, at least temporarily.]
The obvious question is: how come I didn't catch this earlier, given that we run our code in a test environment before we roll it out to our production environment ? The answer is very simple: I wasn't monitoring our test environment, at least not in a consistent fashion. That's partially because it's hard to get a useful baseline from the test environment -- we update the bits on it frequently, the load on it is inconsistent etc. That said, I'm pretty sure I would have detected something was amiss if I'd done a more-than-cursory check.
So, lesson learned: keep a closer eye on the test environment.
Observation #2: Sometimes, having too little load hurts.
With our new release, our datacenter services are set up to handle a lot more load than we're currently getting [because we have grand plans and/or are just foolishly optimistic :-)]. This led to some interesting behavior between our frontend machines and the machines on our backend [see my previous description of our architecture], which talk to each other through a loadbalancer. In order to conserve resources, loadbalancers are generally configured to close connections that remain idle for a particular period of time. Given our over-provisioning, we had lots such idle connections between front- and back-end machines, leading to many more connections being reset than we were used to. This, in turn, exposed a bug in some new code that didn't deal properly with connection resets, and led to some flakiness in the user experience.
In the meantime, we've fixed the bug, but I thought it was noteworthy that there is such a thing as "too little load".
[Technorati tag: Live Mesh]
Comments