March 7th: What We Learned and What We’re Doing

Updated: February 12, 2021

Need content for your business? Find top writers on WriterAccess!

Scribblelive provides dependable, real-time communications providing millions of people with billions of updates every month. On March 7th during the Apple event, we encountered some problems.

We have been working hard to discover what went wrong and how it can be fixed and tested to win back the level of confidence our customers demand. As we make progress we will continue to remain transparent and provide as much detail as possible.

Here is what we know after an extensive review:

By analyzing our logs it became clear that the database began periodically locking-up at 12:38 pm EST. The database cluster peaked at 14% CPU usage, so it didn’t appear to be out of resources. At first, this indicated some condition on the web-servers that prevented them from fully utilizing the database e.g. an artificial constraint on the number of connections. However, further analysis showed that that wasn’t the case.

We shifted our attention to the database itself and found a series of errors in the logs. These errors were traced to a bug in our database platform. Our load-testing strategy hadn’t produced the conditions necessary to produce the bug, and the bug hadn’t been triggered during normal traffic conditions in the months prior to the Apple event.

Once we had identified the root cause of the problems, we set about finding a solution. We were able to create specialized load-tests in our lab that allowed us to reproduce the failure under test conditions. With that baseline available, our development team was able to make code changes that minimize the chance of the bug being exposed, to the point where we could no longer produce a failure under load-tests. Those changes will be part of our next release in the second week of April. A patch for the bug will also be available mid-April.

Our development team will now turn its attention to further reducing the load on the database cluster during peak traffic, and revise our load-testing strategy to simulate the traffic patterns of large-scale events. By introducing another caching layer to our code, we aim to remove 100% of direct database calls on whitelabel sites. We will then begin extensive load-testing at the end of April, culminating in a public load-test May 17th.

We appreciate the trust our customers place in us every day as they cover live news, and we can’t wait to show you May 17th how far the platform has come. For more details, please contact your Account Manager.

Sincerely,
Jonathan Keebler
CTO
@KeeblerBlog

Share
facebook
linkedin
twitter
mail

Human Crafted Content

Find top content freelancers on WriterAccess.

Human Crafted Content

Find top content freelancers on WriterAccess.

Subscribe to our blog

Sign up to receive Rock Content blog posts

Rock Content WriterAccess - Start a Free Trial

Order badass content with WriterAccess. Just as we do.

Find +15,000 skilled freelance writers, editors, content strategists, translators, designers and more for hire.

Want to receive more brilliant content like this for free?

Sign up to receive our content by email and be a member of the Rock Content Community!

Talk to an expert and enhance your company’s marketing results.

Rock Content offers solutions for producing high-quality content, increasing organic traffic, building interactive experiences, and improving conversions that will transform the outcomes of your company or agency. Let’s talk.