How We Responded and What Steps We're Taking
Thanks again for your patience and understanding of Wednesday's outage. The performance and reliability of your virtual world is our top priority as we grow to meet the incredible demand we're experiencing. In an effort to provide complete transparency, I've asked my team to write this detailed incident summary that speaks to what happened, what we learned, and the steps we are taking to greatly improve the resilience of our platform.
Incident Response: What Happened, How We Responded and the Steps We're Taking
Early Wednesday morning, VirBELA Open and Private Campuses were unreachable for many of you for approximately 4 hours. First, we’d like to apologize to all of our customers and users that were impacted. We also want to provide an explanation of what happened and the steps we’re taking to mitigate similar issues in the future.
On July 16th at approximately 5 AM PDT, our database provider, IBM Compose, experienced an unplanned total outage in their US East server cluster. This outage caused our user authentication system to go down, which is the root cause of the login issue our users experienced.
One of the main reasons we chose Compose was their robust backup and recovery features. Unfortunately the outage also caused the Compose web portal that gives us access to our backup and recovery tools to go down as well. This prevented us from recovering in minutes by switching to a different server cluster.
Although our servers were still down, as soon as the Compose web portal was operational we were able to retrieve our backup and run our recovery protocol to shift the authentication system to another server cluster. However, because our backups are scheduled on a daily basis we lost about 20 hours worth of user registrations and admin setting changes.
We decided that getting the servers up immediately outweighed the potential data losses, and our engineering team was able to stand up a new environment in a matter of minutes. We were able to fully recover all the data that was initially lost, and are currently working with individual customers to provide missing data.
We learned two main lessons dealing with this issue: our backups need to be housed using a separate vendor to prevent a single point of failure, and our customer communication needs to happen much more rapidly.
Steps we’re taking to prevent and mitigate this issue in the future
- We’re evaluating our relationship with this partner
- Our engineering team has started building a backup process using another vendor
- We’re investing in additional application monitoring solutions to enhance our visibility into 3rd party services
- We are implementing a new process to notify customers more quickly in the event of another outage
- VirBELA teams are additionally evaluating all systems and processes for redundancy and recovery
Again, we apologize to all of our users for this outage. We take the performance and reliability of VirBELA and the worlds we build very seriously.
Please arrange for a call with our leadership team if you have any questions or concerns that we can address with you or your teams.