Actions to Take
Let’s start with the actions you are recommended to take as a TSM user:
- All users: Update the TSM Desktop Application to the latest version (r410)
- Your app should update automatically, but you can download it here manually
- All users: Re-select any Burning Crusade Realms previously configured on the Realm Selection Page for pricing data
- All users: Enable any Group Notifications previously selected on the Deal Notifications Page
Background
Early in the morning US time on Friday, March 25th, we were notified by our hosting provider that the hard drive containing our primary database died, and they were unable to recover any data from it. After a ton of work, we have since been able to restore all of our services. However, there was some data which we, unfortunately, were not able to recover, which is listed as follows:
- Premium user addon backups
- Desktop App realm selections for Burning Crusade realms
- User configuration data for Group Notifications
- Some other user account-level and user configuration changes between 4am PDT on March 24th and when our website was brought back online including new accounts and any other changes made on our website during that time
What we Did
We take regular backups of our database and quickly worked to restore from those backups on Friday. Unfortunately, we realized that our backup process had not been updated to include database tables related to Burning Crusade Classic. Additionally, a configuration issue with a script that manages Premium user addon backups also resulted in all backups being lost as a result of this outage.
On Saturday we brought the website back online in a read-only mode as we continued to try to restore some additional database data from a few raw database files which had not been corrupted. After much work, we were successful at restoring some additional data which we otherwise didn’t have proper backups of, but unfortunately, this still left us at the list above of completely lost data.
Naturally, the Premium user addon backups are the most critical. Working with a manifest of the missing files, we modified TSM Desktop Application (version r410) to re-upload any backup files that match the name from the recovered manifest. While we realize this does not guarantee a full recovery of all previously synced backups we have been able to get coverage of a majority of recent backups stored locally to Premium users.
With this in mind, we have set up a dedicated contact address that is available to Premium users who may be missing an important backup that was not captured in the re-upload efforts. We encourage those to reach out to outage@tradeskillmaster.com and we will explore any available options to make this right.
Additionally, with new resources available along with the protections and processes described in the next section of this post, we have removed the limit on the number of Premium user addon backups that can be marked as ‘saved’ and stored by TSM in the cloud going forward. This means Premium users no longer have to decide which backups to keep available long-term and all backups will be saved indefinitely.
Finally, if you created an account or reset your password on March 24th or March 25th and are having trouble logging in, it is recommended to re-create your account or reset your password again.
Learnings / Improvements
We’ve only had one outage like this in the past, but this one is significantly more serious given the loss of user data. We do take seriously the impact of this event and are working hard to correct things in the short-term, as we discussed above, as well as make immediate changes to prevent this from happening again in the future.
First and foremost, we’ve enabled multiple additional levels of redundancy and protection around our Premium user addon backups. This will prevent them from being irreversibly lost in the future, along with providing a mechanism to quickly recover them if any issues arise. This will have a non-negligible impact on our infrastructure costs, as it is certainly a non-trivial amount of data, especially with the change we’re making to not limit the number of backups we store for people but is the right thing to do given the importance of this data.
Next, we have addressed the gap in our database backup process which resulted in some user configuration data being lost, so any similar event in the future will just be a matter of quickly restoring the backups we have. We’ve also built a number of additional tools to help us recover any lost data in the future, although, of course, we hope we’ll never need to use those.
Lastly, the affected database has been running since 2016. A lot has changed since then in the server infrastructure space, and we’ve been steadily working over the past year to reimplement and migrate our backend services and infrastructure to a more modern, scalable, and maintainable architecture. Most of this has been behind the scenes and is already supporting more user-facing things like Ledger and all of the AuctionDB data which is downloaded by the desktop app, but this outage has definitely raised the importance of moving more things over to this much better system as quickly as possible.
Closing Thoughts
To wrap things up, we’d like to sincerely apologize for the inconvenience caused by this downtime and for the loss of data that occurred. We are looking forward to doing better and further improving the software and services we provide moving forward, and appreciate your continued support and usage of TSM. If you would like to share any thoughts, questions, or feedback – please feel free to share them in the #discussion channel on our Discord server.