Delay in processing applications
Incident Report for PassFort
Postmortem

Summary

PassFort apologises for the disruption to service yesterday evening.

Between 18:20 and 21:40, applications that were created stayed in the Automating state for much longer than usual, preventing them from being Approved. Notifications (such as when applications require manual approval) were delayed in delivery.

All applications were processed normally by 02:05.

Incident

At 18:20, PassFort's processing queues began to increase in size. This was due to an unexpected 10x load on our servers caused by a customer’s batch operation. The issue was exacerbated when many of the items in the queue errored due to a data provider issue and, as a result, the items were retried multiple times. The final result was an overall >100x increase in load.

PassFort automatically scaled but, due to the magnitude of the increase, hit its automated scaling limits. Applications were still seeing significant delays, so at 21:40 engineers intervened to manually scale the application further. New applications were prioritised in the queue to allow normal processing while engineers resolved the backlog.

Between 21:40 and 02:05, PassFort continued to adjust scaling and monitored the queues in order to ensure that the backlog was resolved as quickly as possible.

During the issue period, the API error rate was 10%, meaning the majority of requests continued to operate normally.

Resolution

PassFort will be taking a number of actions to ensure such an incident cannot occur again.

  • Rate limiting - We will introduce rate-limiting API keys to prevent unexpected significant increases in load without prior agreement. Further to this, we will ensure that any significant queue build-up is reported through API responses.
  • Customer queues - Currently, PassFort maintains a single processing queue for all customers. We will introduce specific customer queues so an increase in load from one customer will not affect the quality of service for other customers.
  • Scaling review - The majority of PassFort's infrastructure is able to scale without downtime. However, scaling PassFort's databases is a manual operation and requires a server restart with several minutes of downtime. We're reviewing how to mitigate this in the future.
Posted Jun 17, 2020 - 17:43 BST

Resolved
Queues have remained normal overnight, so we are marking this issue as resolved.

We will be following up with a post mortem in the next few days.
Posted Jun 17, 2020 - 10:26 BST
Update
We are continuing to monitor for any further issues.
Posted Jun 17, 2020 - 03:15 BST
Monitoring
We have cleared the backlog of applications stuck in automating.
Some user notifications & emails may be delayed.

We will monitor the situation over the next few hours.

We will send a post-mortem on this over the next two days.
Posted Jun 17, 2020 - 02:06 BST
Update
We are continuing to work on a fix for this issue.
Posted Jun 16, 2020 - 23:02 BST
Update
New applications are now processing normally.

We are still working on processing the backlog of "Automating" profiles.
Posted Jun 16, 2020 - 22:06 BST
Update
We are continuing to work on processing the backlog.
Posted Jun 16, 2020 - 21:59 BST
Update
A fix has been implemented and we are processing the backlog of automating applications.

We will provide a further update in 30 minutes.
Posted Jun 16, 2020 - 21:24 BST
Identified
The issue has been identified and a fix is being implemented.
Posted Jun 16, 2020 - 21:15 BST
Investigating
We have identified delays in applications being processed, which we are actively investigating.

Applications will show as “Automating” and will complete once we resolve the issue.

We will post an update in 30 minutes.
Posted Jun 16, 2020 - 21:07 BST
This incident affected: Portal and API.