Severe performance degradation
Incident Report for Bizzon
Postmortem

Problem

On 17 October 2022 at 16:03 UTC the platform gradually became unresponsive severely degrading the service for all users. Most application operations started returning a timeout error. Database operations have suddenly become very slow, which reflected on the whole platform. Diving into the platform monitoring tools showed that the DB engine was slowed down, but there was no obvious cause for such behavior.

On 18 October 2022 at 20:40 UTC the problem reoccurred. This time the team managed to isolate the root cause, which was the malfunctioned EFR node used for German fiscalization.

Action

On 17 October 2022 in an attempt to quickly restore the database capacity the team rebooted its node at 16:52 UTC. This has temporarily restored its performance, but by 17:04 UTC it was very slow again.

Since the reboot had no effect, the team proceeded by doubling the database node (giving it more memory and CPU power), which was initiated at 17:28 UTC and completed by 17:45 UTC. This action also had no effect, proving that the incident was not caused by capacity restrains.

Connecting to the database node and querying its system tables showed multiple write transactions waiting for write locks to be released. This indicated concurrency issues which arise when there are multiple long-running transactions attempting to modify the same data set. By 18:43 UTC the write locks were released and the DB engine resumed normal operations.

On 18 October 2022, when the incident reoccurred, the team did not intervene into the database configuration and infrastructure setup, but rather focused on analyzing the surrounding environment. The analysis has shown correlation between the database slowness and the German fiscalization errors. Once the fiscalization service was restored, the database operations were also normalized.

Causes

As it turned out, creating or updating an order is accompanied with a fiscalization attempt to an external service, all of which happens within a database transaction. Due to an unrelated issue, calls to the fiscalization service lasted unusually long, causing the DB transactions to prolong significantly. Since fiscalization happens on order create/update, write locks are held much longer than needed, forcing subsequent transactions to wait until they are released.

In short, the problem with the fiscalization service cascaded into the database causing it to stall and block all write operations for extended periods of time.

Solutions

The root cause has shown tight coupling of database operations and external services, which was not needed at all. The team implemented a patch to the logic that moved the call to the external service outside of the database transaction. The fix was deployed into the live environment on 19 October.

The extra database parameter monitoring will also be put in place for speedier response times in cases of increased database loads and unresponsiveness.

Posted Oct 24, 2022 - 12:44 UTC

Resolved
This incident has been resolved.
Posted Oct 17, 2022 - 19:05 UTC
Monitoring
The server operation has been fully restored. We will continue to investigate and monitor its performance.
Posted Oct 17, 2022 - 18:45 UTC
Update
We are continuing to investigate this issue.
Posted Oct 17, 2022 - 17:45 UTC
Investigating
As of 16:03 UTC, we are noticing a severe performance degradation of our servers.
We are investigating the issue and will provide an update as soon as more information is available.
Posted Oct 17, 2022 - 16:26 UTC
This incident affected: Payments, Point of Sale, and Dashboard.