We had some problems with OpenTrainTimes earlier today. Although the public site is not operated for profit, we take uptime seriously and we’ve produced this review of what happened.
If you use OpenTrainTimes as part of your job and you’re interested in a commercially supported version of the site, including freight data and integrations with your stock and crew systems, please drop us a mail at email@example.com.
Earlier this evening, we had multiple users reporting that maps on OpenTrainTimes were lagging.
Upon investigation, we found an unusually large number of users on the site for the time of day combined with a 45 minute backlog on our train describer feed.
We temporarily disabled the train movement feed and turned off logging for the real-time maps in order to process this backlog. Once the backlog had cleared, we turned the train movement feed back on and monitored the service whilst the backlog of TRUST messages cleared.
The site returned to normal operation by about 2045.
OpenTrainTimes is a very popular site, and several hundred users are usually viewing multiple maps at the same time. This figure grows steadily and gradually over time, and we review our capacity every few months to make sure we’re not caught out. Each time we release a new map, the base load on our servers increases as we have anything up to 500 new pieces of signalling data to process – and then there are the extra users that the maps attract.
But that wasn’t the issue – but not by the number of users, but by the type of users!
Briefly, when a user’s web browser connects to our map server, it either uses a long-lived connection over which map data is sent, or it requests map data every few seconds. Several things influence which is chosen, but it’s usually down to whether the device is behind a proxy server – not all proxy servers allow, or support, long-lived connections over websockets.
This evening, we noticed a larger than normal number of polling users. Since a large percentage of OpenTrainTimes users are coming from a mobile device, we think this may be because a change was made at one of the mobile network providers which meant our websocket implementation couldn’t be used by clients.
Normally, this is OK – but the gradual and continual increase in users each week, coupled with a gradual surge in the number of connections that our server was logging, meant there was insufficient CPU time available to process all of the data coming in to us from Network Rail.
The first thing we did was to turn off logging – we don’t really need it day-to-day, and it bought us some time. We then switched off processing TRUST messages, allowing them to queue whilst we allocated the rest of the server’s capacity to processing the backlog of train describer (TD) messages. It took about 20 minutes to process the TD messages, after which we turned TRUST messages back on. Processing the backlog of those messages, plus the remaining TD messages took about another hour.
What we’re going to do about it
First of all, we’re sorry that we missed a trick and took too long to respond to the initial reports of a problem.
We’re going to add some new health checks to our monitoring system, one of which will enable us to monitor the size of any message backlog.
We’re also going to look at scaling out our servers to cope with the extra demand and leave more breathing room – but this means our costs will double, so we’ll need to make sure this is sustainable.
And finally, we’re going to press forward with the new version of OpenTrainTimes which builds on the six years experience we’ve had working with railway data, and will be quicker and better than the current version.
So, sorry for the problems this evening.
Director, OpenTrainTimes Ltd.