At about 6am UTC time on 2018-06-09, Sortd started experiencing intermittent issues, which impacted some users during the weekend. On Sunday 2018-06-10 the issue appeared to escalate and seemed be related to a system that is used to store (temporary) login sessions, so a forced step-down/migration was performed to remove the offending infrastructure. This had the unfortunate consequence of logging some users out.
Earlier the previous week we had started seeing issues with one of our suppliers systems which is critical to our infrastructure; We had seen this before and we had been planning to replace that infrastructure with another supplier for some time. We had been preparing for and testing the migration to the new supplier prior to the weekend, but the events of the week before and that of the weekend forced us to escalate our migration to the new infrastructure. Additionally the issues of Sunday seemed to escalate as system load increased with people starting work on the Monday and it became apparent that the changes of the weekend had not fixed the issue.
The decision was made to migrate to the new infrastructure; the load was still relatively low, there should have been no down time at all, and we were ready (or so we thought). Needless to say this did not go according to plan. During the migration we (finally correctly) identified the issues that had been occurring during the weekend which were easy enough to fix once we knew what we were looking at, but it was too late to stop the migration. At first everything looked great, but we soon started to notice an increase in errors and that Sortd was slowing down.
We were not sure if what we were experiencing was a consequence of the events of the weekend, related to the migration, or some code change that had occurred. A mad scramble ensued. We knew data was getting duplicated thousands of times which caused the entire system to slow down and eventually fail for certain users, but it took us some time to identify that the issue was still occurring and where. It was decided to take Sortd offline completely as the volume of data was growing fast (tasks were being duplicated 10's of thousands of times) and we needed to get it fixed before the data got so large that we would spend the next week trying to find and remove the superfluous data (which while in-place would have stopped boards form loading). [Sometimes fast servers can be a problem]
Once we were able to duplicate the problem in our test environment, we were able to find and fix the problem which was related to a change in a software library required for the migration. It took us a little while to do this and Sortd was completely offline for about 1-2hrs from about 10:30am UTC on 2018-06-11. While Sortd was technically back up and running, the data explosion meant that for many people there data had exploded to such a point that their boards would no longer load and Sortd would timeout and or throw an "Oops" error. We have spent the last 24-36 hours removing the duplicates (Sortd didn't actually duplicate the real data, but just references on lists and boards), so for some users Sortd was down for a long time. Unfortunately, even when we cleaned up the data, the users browser would then try and post all the duplicate data back, so we were fighting on two fronts. Oh, and did I mention one of our own servers suddenly died while were were busy with the data cleanup. Hardware (even virtualised hardware) is fun.
Everything should be resolved now, however if you have not refreshed Sortd you may still get an Oops error if your Sortd tries to post back bad data. After the refresh you should be good to go. If you do see still see any duplicates, please do not delete them as they are kind of 'ghosts' of your data and deleting one will delete them all. To the best of our knowledge there should be no cases of his anymore, but if we have missed something, please contact us via support.
We apologise for the inconvenience caused and thank you for your patience and understanding. We too use Sortd all day, and know how important it is for it to be up and running at all times.
PS. Unless a discussion with you via Twitter or our support channels occurred, nobody saw any of your data, as all data cleanup was performed by authorised people via automated scripts/programs.