Scribblelive provides dependable, real-time communications providing millions of people with billions of updates every month. On March 7th during the Apple event, we encountered some problems.
We have been working hard to discover what went wrong and how it can be fixed and tested to win back the level of confidence our customers demand. As we make progress we will continue to remain transparent and provide as much detail as possible.
Here is what we know after an extensive review:
By analyzing our logs it became clear that the database began periodically locking-up at 12:38 pm EST. The database cluster peaked at 14% CPU usage, so it didn’t appear to be out of resources. At first, this indicated some condition on the web-servers that prevented them from fully utilizing the database e.g. an artificial constraint on the number of connections. However, further analysis showed that that wasn’t the case.
We shifted our attention to the database itself and found a series of errors in the logs. These errors were traced to a bug in our database platform. Our load-testing strategy hadn’t produced the conditions necessary to produce the bug, and the bug hadn’t been triggered during normal traffic conditions in the months prior to the Apple event.
Once we had identified the root cause of the problems, we set about finding a solution. We were able to create specialized load-tests in our lab that allowed us to reproduce the failure under test conditions. With that baseline available, our development team was able to make code changes that minimize the chance of the bug being exposed, to the point where we could no longer produce a failure under load-tests. Those changes will be part of our next release in the second week of April. A patch for the bug will also be available mid-April.
Our development team will now turn its attention to further reducing the load on the database cluster during peak traffic, and revise our load-testing strategy to simulate the traffic patterns of large-scale events. By introducing another caching layer to our code, we aim to remove 100% of direct database calls on whitelabel sites. We will then begin extensive load-testing at the end of April, culminating in a public load-test May 17th.
We appreciate the trust our customers place in us every day as they cover live news, and we can’t wait to show you May 17th how far the platform has come. For more details, please contact your Account Manager.
Sincerely,
Jonathan Keebler
CTO
@KeeblerBlog