What’s new – 25th June 2017

Monday’s problems on the website has caused us to scale our release plans back a bit to make sure the site’s stable. Some of the work we’ve done since Monday has involved adding in extra monitoring to check that our fixes have been successful and, so far, they have been.

So, what have we added?

The biggest thing is the new Gloucester map, which links up with the Bromsgrove and Swindon maps, covering Gloucester to near Chepstow, Stroud and Stonehouse to Standish Junction, Gloucester Yard and all the way up to just outside Bristol Parkway. We’re planning to extend our coverage westwards too.

We’ve also fixed numerous little things, from platform numbers and junction links on the Warrington map, to missing signals between Crewe and Winsford, some TVM430 marker boards which should have been limit-of-shunt markers at Staines, and missing signals at Worting Junction. We’ve also taken off the schedule count on the Schedule Search page, which was causing inefficient database lookups and contributing to Monday’s issue.

Thanks to everyone who’s been in touch over the past fortnight – it’s time to sit in the remains of the sunshine with a glass of wine and chill out for a bit.

Post-Incident Review

We had some problems with OpenTrainTimes earlier today. Although the public site is not operated for profit, we take uptime seriously and we’ve produced this review of what happened.

If you use OpenTrainTimes as part of your job and you’re interested in a commercially supported version of the site, including freight data and integrations with your stock and crew systems, please drop us a mail at hello@opentraintimes.com.

What happened?

Earlier this evening, we had multiple users reporting that maps on OpenTrainTimes were lagging.

Upon investigation, we found an unusually large number of users on the site for the time of day combined with a 45 minute backlog on our train describer feed.

We temporarily disabled the train movement feed and turned off logging for the real-time maps in order to process this backlog. Once the backlog had cleared, we turned the train movement feed back on and monitored the service whilst the backlog of TRUST messages cleared.

The site returned to normal operation by about 2045.

More detail

OpenTrainTimes is a very popular site, and several hundred users are usually viewing multiple maps at the same time. This figure grows steadily and gradually over time, and we review our capacity every few months to make sure we’re not caught out. Each time we release a new map, the base load on our servers increases as we have anything up to 500 new pieces of signalling data to process – and then there are the extra users that the maps attract.

But that wasn’t the issue – but not by the number of users, but by the type of users!

Briefly, when a user’s web browser connects to our map server, it either uses a long-lived connection over which map data is sent, or it requests map data every few seconds. Several things influence which is chosen, but it’s usually down to whether the device is behind a proxy server – not all proxy servers allow, or support, long-lived connections over websockets.

This evening, we noticed a larger than normal number of polling users. Since a large percentage of OpenTrainTimes users are coming from a mobile device, we think this may be because a change was made at one of the mobile network providers which meant our websocket implementation couldn’t be used by clients.

Normally, this is OK – but the gradual and continual increase in users each week, coupled with a gradual surge in the number of connections that our server was logging, meant there was insufficient CPU time available to process all of the data coming in to us from Network Rail.

The first thing we did was to turn off logging – we don’t really need it day-to-day, and it bought us some time. We then switched off processing TRUST messages, allowing them to queue whilst we allocated the rest of the server’s capacity to processing the backlog of train describer (TD) messages. It took about 20 minutes to process the TD messages, after which we turned TRUST messages back on. Processing the backlog of those messages, plus the remaining TD messages took about another hour.

What we’re going to do about it

First of all, we’re sorry that we missed a trick and took too long to respond to the initial reports of a problem.

We’re going to add some new health checks to our monitoring system, one of which will enable us to monitor the size of any message backlog.

We’re also going to look at scaling out our servers to cope with the extra demand and leave more breathing room – but this means our costs will double, so we’ll need to make sure this is sustainable.

And finally, we’re going to press forward with the new version of OpenTrainTimes which builds on the six years experience we’ve had working with railway data, and will be quicker and better than the current version.

So, sorry for the problems this evening.

Peter Hicks
Director, OpenTrainTimes Ltd.

What’s new – 11th June 2017

The last fortnight has been hard work, but we’re very pleased to bring you more maps!

  • The Crewe (exc.) to Leyland map fills in the gap between the Crewe and Preston maps, covering Warrington Bank Quay and Wigan North Western stations
  • The Southampton area map now has route indications in a number of places, and more to come
  • The Letchworth to Waterbeach map has now been extended up to Kings Lynn, linking up at Peterborough (although with a massive gap in train describer coverage) and to the Ely to Norwich map
  • The Thameslink Core map now has route indications throughout, especially useful south of Blackfriars

As always, a number of smaller issues have been fixed, and we’re grateful to everyone who’s emailed or contacted us to report small inaccuracies in our maps.

And finally, we’re currently working through the backlog of support requests – there are about 30 waiting to be answered, and a mixture of quick and not-so-quick tasks. If we haven’t gotten around to replying, we will do soon.