Las Vegas - Investigating Reports of latency
Incident Report for True North Cloud
Postmortem

Event Recap

On 08/03/21, at 6:30 am PST, the True North helpdesk began receiving reports of connection performance issues to the Las Vegas data center. After review, it was discovered that there were, in fact, random disconnects occurring throughout multiple client environments. True North engineers immediately began a technical deep dive. Our engineers discovered a significant increase in file activity on both database and application servers across several client environments, resulting in random session disconnects due to an increase in latency in the infrastructure.

Impact

During the event, some customer users were able to connect, actively working in the EMR while others had more difficulties getting and/or maintaining an active login session.

What We Found

While we were awaiting final vendor log analyses, as soon as we had enough data to identify the source of the problem, we were able to take action that reduced latency to its pre-event levels. Each database and application server in the True North data center has an EDR (endpoint detection & response) security scanning component installed. This tool, named SentinelOne, protects any operating system it’s installed on from malicious activity and ransomware attacks. After consulting with several vendors during the investigation, it was discovered that the active scanning function of EDR on each machine saw an increase of activity in the hosted environments the detection software categorized as abnormal. This appears to have been a false-positive event caused by the detection of “new” and/or unexpected traffic patterns following the recent increase in customers who enabled Medication Management for athenaPractice/athenaFlow. Although the v20 upgrade did not cause the false positive, the related changes to file systems, databases, and file movement patterns triggered SentinelOne to take immediate defensive measures. The extreme load of this additional scanning and log generation led to the spike in latency which caused the event.

Actions Taken

After the discovery, True North worked directly with SentinelOne to fine-tune filters in the EDR logging and scanning to alleviate the additional load this traffic caused. Exclusions were put in place based on findings with the vendor. After applying new filters to the scanner and resetting all agents, the traffic load returned to normal levels. More connections to the data center started to re-establish from users at just after 12 p.m. PST and continued to increase through 3 p.m. PST. Performance and activity loads have remained at expected levels since that time.

We want to be clear that the recent events are not related to any malicious activity and are not a result of a cybersecurity threat. All of your organization’s data is secure and 100% accounted for. Please be assured we are continuing to closely monitor the infrastructure for any signs of anomalies. In addition, we will continue periodic reviews and updates to internal processes to ensure more frequent and proactive communications, including status page and/or ticket updates, during a customer-impacting event such as this.

If you have any questions pertaining to the event, please let us know and we’ll do our best to respond within 72 hours. If you have any new urgent or work-stoppage issues, please call in an emergency ticket to the help desk so that we can assist you in a timely manner.

Posted Aug 18, 2021 - 15:28 PDT

Resolved
True North Cloud Services is marking this case resolved. RCA to follow.
Posted Aug 18, 2021 - 15:10 PDT
Monitoring
The storage latency performance has normalized and we are monitoring all systems for stability. We will continue to watch this closely and we will release an RCA within 48 hours of marking this incident as Resolved.
Posted Aug 03, 2021 - 18:49 PDT
Update
We are making progress and seeing performance improvements that should be resulting in improved connectivity on your end. We are, however, still investigating. We do not yet consider the current issue stable or resolved. If you have an open emergency ticket logged with the support team, we apologize for not responding to each ticket in a timely fashion as the volume of tickets exceeded expected levels. We will continue to post updates to True North Cloud Status as they become available.
Posted Aug 03, 2021 - 14:03 PDT
Update
We are continuing to investigate this issue.
Posted Aug 03, 2021 - 14:02 PDT
Update
We continue to investigate latency issues in the Las Vegas facility. At this time, latency metrics have been improved over the last few hours. We will update this page with more information as we have more updates.
Posted Aug 03, 2021 - 13:00 PDT
Update
We are still investigating the root cause of the latency issues occurring within the Las Vegas facility. There is no ETA for resolution at this time.
Posted Aug 03, 2021 - 09:32 PDT
Investigating
We are investigating reports of increased latency occurring in the Las Vegas facility. We are actively engaged and will update this page with more information as it becomes available.
Posted Aug 03, 2021 - 06:35 PDT
This incident affected: True North Cloud Las Vegas Overall Availability (Las Vegas Virtual Machine Availability).