Thalamus Incident Report – 11/17/2021

Dear Thalamus Community, 

It is part of our culture and mission at Thalamus to never hide from a problem.  Instead, we tackle them it head on to continuously improve our software, promote transparency amongst our users and better serve the GME community.

On Wednesday (11/17), starting at ~12pm ET, Thalamus experienced an issue that affected a segment of our users.  The issue left this group unable to access Thalamus.  Our team immediately responded to the situation and all Thalamus services were fully restored within 40 minutes. 

We sincerely apologize for the disruption that this caused the affected users.  We want to provide a summary of what occurred, why this occurred, and what actions have been taken as a result, both in the short and long term.

What happened yesterday(11/17):

  • ~11:50am ET: The number of users on Thalamus jumped 1000% over a few-minute period, corresponding with Orthopaedic Surgery applicants logging in to Thalamus for the Universal Interview Offer Day (UIOD).  This was anticipated and expected. 
  • 11:56am ET: Our internal system monitoring recognized increased utilization and slowing of our API across a subset of our network instances.
  • 11:58am ET: A subset of these instances were overloaded in terms of utilization, resulting in affected users being unable to login and/or load pages/data.
  • As an intervention, our team immediately took the following steps:
    • 12:00pm – 12:10pm ET: Performed a system-wide analysis and assessment.  Notification of the issue was announced on our Twitter feed/status page.
    • 12:10pm ET: Scaled Thalamus infrastructure to accommodate increased demand.
    • 12:15pm ET: Selectively began routing traffic to additional instances with lower utilization to optimize network traffic load balancing, and to ensure users not experiencing issues could continue utilizing Thalamus unaffected.
    • 12:40pm ET: All Thalamus services were fully restored for all users.
    • 12:53pm ET: After further assessment and validation, announcement of fully restored services was posted to our Twitter feed/status page.

Why did this occur?  It was not any single event but rather the combination of the following items:

  • While our team anticipated all Orthopaedic Surgery applicants being on Thalamus at this time, what was unexpected was many users were logged in on multiple devices (many on greater than two devices) concurrently.
  • Sessions were being refreshed by most users, every few seconds, across multiple devices.
  • Many applicants logged in prior to the common interview release and at the same time across multiple devices, and thus were routed to the same instance. This caused a disparity load balancing across instances (i.e. some instances were very busy, and some were not, which is why only a subset of users were affected).
  • This also coincided with Orthopaedic Surgery programs opening their interview dates and wait lists as part of the UIOD, in addition to the normal operations and actions of all other users on Thalamus.

As a result of our investigation, we have determined that these events in totality are what caused the issue.

To provide further context, Thalamus has managed several specialty-wide interview release dates over the near decade history of the company ( with no downtime).  Many of these have been with larger specialties that involve many more programs, applicants and interview dates.  We also consistently load and performance test our system, have auto scale rules in place to increase infrastructure as needed, and dynamically load balance our software to ensure optimal performance.  We also add dedicated machine and human resources to accommodate these common interview release days.  We would like to reassure the few remaining specialties with common release interview days, that this will not be an issue going forward.

What steps has Thalamus taken to address this?

Immediate Actions:

  • Every user who reported an issue has been contacted individually to ensure resolution of the issue experienced.
  • As a temporary precautionary measure, we have expanded Thalamus infrastructure to accommodate many more users than are currently using the site.
  • Our team continues to monitor all Thalamus systems as part of our normal escalation policies and procedures.
  • All Thalamus systems have been functioning optimally since the resolution of the issue at 12:40pm on Wednesday (11/17).

Longer Term Actions:

  • We are exploring expansion of dedicated infrastructure for interview scheduling on the website vs. mobile app, as well as Thalamus video across applicant and program users.
  • We are further optimizing functionality for programs to open interview dates/wait lists across situations of high and repetitive user load.
  • We are implementing this new use case, by simulation of these novel behaviors into our regular testing procedures to accommodate continuous rapid refresh across multiple devices and enhanced load balancing.

We consider it an honor and privilege to partner with the GME community, and that’s why we don’t take lightly any issues with Thalamus’ functionality. We again apologize for the disruption and take full responsibility for the inconvenience that this caused for our affected users, especially to the Orthopaedic Surgery community as well as subset of applicants and programs that had interviews disrupted during this time.  We hope that in addressing the issue expediently, we were able to prove our commitment to you, and our dedication to rapid problem solving.

My team and I truly appreciate your continued use of Thalamus. GME expects and deserves the best, and we will continue to work hard to provide an optimal Thalamus experience.  As always, if there is anything else we can do to assist you, please contact us at customercare@thalamusgme.com and our team and leadership (myself included) will be available to address any concerns.

Sincerely, 

Jason Reminick, M.D., M.B.A., M.S.

CEO & Founder, Thalamus