Earlier today, the OpenTrainTimes website was down. The root cause was that one of the disks was full, which we’ve fixed and put our recovery process in to action. Given the time it takes to replay a backlog of messages – a problem we’ve had before – it took until about 3pm for us to catch up on the missed messages.
Although we have plenty of monitoring in place to alert us of problems such as this, we’ve identified – in this case – that the monitoring was set up for the wrong server. Consequently, nothing alerted us to the imminently-filling disk.
We’ll be fixing this over the coming days.
We’re sorry for the extended degradation in service today. We don’t take outages lightly, and we’ll be looking at ways to stop this happening in the future.