Route and Deliver Web Outage

Incident Report for Innovo

Resolved

We had a great day yesterday! During our peak time (10am-2pm EDT), we averaged about 350 transactions a second and 2,800+ reads per second. Average CPU usage remained steady at 20-40%. In addition to the increase of RAM on Friday, we rebuilt indexes over the weekend which has really seemed to help. We will continue to monitor numbers this week and if we find anything else that needs to be changed to keep up with the load, we will do so. Thank you for your continued patience - we really appreciate you!

Posted Apr 01, 2025 - 06:42 MDT

Update

We have finally identified root cause of the outage.

In January we made a change to turn on Deliver alerts for everyone by default. Because of this change, this is now our third outage so far this year. This is more than we've experienced in past years. The two previous times we put some new indexes in place and thought the issue was resolved. Unfortunately that was not the case. Today the DB server pegged at 100% CPU and when that happens the application server starts dropping connections and then struggles to reconnect to get a connection to the database. Luckily the connections still went through, however slow so theoretically the portal should still show the stops as delivered, skipped, etc. And like I said previously all signatures, photos, and comments were successfully updated to Eclipse.

To fix this today, we increased the RAM on our DB server based on the recommendations from Mongo. On Sunday we will also be creating additional indexes to mitigate any further slowness. In addition, we will be working with Mongo next week to implement further performance recommendations.

Again, I am really sorry this happened today and we do really appreciate your patience. We will continue to keep you updated as we learn more.

Posted Mar 28, 2025 - 18:47 MDT

Update

Unfortunately we are still experiencing issues and connections are intermittent. We will keep you updated.

Posted Mar 28, 2025 - 11:57 MDT

Monitoring

We have identified the issue to be with MongoDB (our database provider) and are working with their engineers on implementing auto scaling of our cluster to mitigate the issue. This seems to be working for the moment. We will continue to monitor throughout the day.

Thank you all for your patience.

All of your signatures, photos, comments, etc. were being transmitted to Eclipse during the outage just as normal. The outage was only on the portal side.

Posted Mar 28, 2025 - 10:33 MDT

Update

The DNS change worked for a bit and then we went down again. We have been on the phone with our database provider and AWS all morning to try to get to the bottom of things. We will keep you posted as we learn more.

Posted Mar 28, 2025 - 09:56 MDT

Update

We've pointed our DNS to our backup server so Route and Deliver Web should be coming back online. We are still investigating what caused the outage and will keep you posted.

Posted Mar 28, 2025 - 08:34 MDT

Update

We are continuing to investigate this issue.

Posted Mar 28, 2025 - 08:09 MDT

Investigating

It appears we are currently experiencing an outage. We will investigate and let you know as soon as we know more.

Posted Mar 28, 2025 - 08:08 MDT

This incident affected: Deliver and Route.