On October 27, 2025 between 12:30 and 20:55 UTC, Loom was significantly degraded for all customers. Users also experienced short periods of degradation on October 28, 2025 from 20:26 to 20:33 UTC and October 30, 2025 19:22 to 19:27 UTC.
During the incident, users experienced issues with intermittent or failing playback, recordings, transcript generation, issues logging in, and general errors or slowness across the service. Our monitoring shows that between 20% and 80% of user interactions were failing during these periods, with failures growing in the later half of the incident.
We know that outages impact your productivity and we apologize to customers who were impacted during this incident. We are taking immediate steps to improve Loom’s reliability based on our learnings from this incident.
Beginning on October 25, 2025, Loom’s primary database cluster showed early indicators of a problem. Resource usage was growing slowly and replication slots were not keeping up. Engineers were alerted on October 26, 2025 and began investigating. We saw no signs of customer impact at this time, however, we did identify an issue with a retrying background job running unsuccessfully in a loop. We shipped a temporary fix for this issue and planned to investigate further the next day.
On Monday, October 27, 2025, the issue resurfaced significantly as usage increased, and we began observing customer impact. Engineers were monitoring the issue closely and engaged immediately. It was clear that our database was not performing as expected, and over the next few hours, we made multiple attempts at mitigation. Notably, we scaled up the database writer and two readers to relieve resource pressure. This did not help, and user impact remained unchanged.
Over the next several hours, we made multiple attempts to restore normal database performance. We rolled back recent changes, optimized individual query performance, added new indexes, updated database configurations, and vacuumed large system and application tables. These changes provided some short-term relief, but performance repeatedly worsened. Dropping our replication slots did resolve resource pressure, but did not improve overall query performance.
We then observed that identical queries were planning and executing much faster on the writer than on our reader instances, so we decided to shift all query traffic to the single writer instance. This change immediately mitigated the customer impact.
While this action resolved the immediate issue, it was not a stable state for our service long term. Over the next few days, we made multiple attempts to revert to our typical single-writer, dual-reader architecture. Each time, we saw the same issues resurface, resulting in short periods of degradation. In between each of these attempts, we made further optimizations to our database usage, but we were still not seeing the performance we expected when adding back our reader instances. This was especially challenging because our services worked as expected with a single reader, and we only observed this degradation when adding a second reader to the cluster.
On October 31, 2025, we tried the failover with a smaller instance size, similar to our original architecture before the vertical scaling attempt. This succeeded, restoring the Loom architecture to normal and fully mitigating the incident.
Broadly speaking, this incident can be divided into two parts. First, the initial degradation was caused by multiple Loom changes made in the months before the incident. Second, the extended mitigation resulted unexpectedly from our attempt to mitigate scaling up the instance sizes.
The first part of this incident was caused by a complex interaction of multiple old and recent changes to Loom services:
The second part of this incident was caused by our earlier mitigation attempt of scaling up our instance sizes to handle the increased load from the first part:
As part of our investigation and mitigation efforts during the incident, we have already completed several key actions to address similar issues, including:
Additionally, we are prioritizing the following improvement actions:
Thank you,
Atlassian