Thursday, June 29, 2017

The long weekend: a retrospective

The Le Mans 24 Hours is the world's greatest motor race.

One of the toughest tests in motorsport, the race pits 180 drivers in 60 cars against the 8.5 mile Circuit de la Sarthe, and against the clock, with 24 hours straight of racing through the French countryside.

It's not just the drivers and cars that are put to the test; teams, too, face a battle to stay alert and react to the changing phases of the race. It's an endurance challenge for fans as well - at the track or around the world, staying awake for as long as possible to keep following a race which rarely disappoints in terms of action and (mis)adventure. The 2017 running of the 24 Hours was also something of a technical endurance challenge for this fan in particular...

Several years ago, unhappy with the live timing web pages made available by race organisers the WEC and ACO, I decided to start playing around with the data feed and see what I could come up with. Over the course of the 2015 race, I developed a prototype written in JavaScript that would later start to evolve and grow into something much bigger...

Fast-forward to 2017, and the Live Timing Aggregator was soft-launched before the 6 Hours of Silverstone via /r/wec and the Midweek Motorsport Listeners' Collective on Facebook. Despite having to debug the integration with a new format of WEC data from the grandstands of the International Pit Straight, and the system conking out a few hours into the race being held inconveniently on my wedding anniversary, feedback was overwhelmingly positive, and a few generous individuals even asked if they could donate money as a thank-you. The money let me move the system away from my existing server (which was becoming increasingly busy with other projects!) and onto a VPS of its own.

Sadly, though, the performance of the new VPS left a lot to be desired. On regular occasions, even loopback network connections were dropped, and when simply issuing ls would sometimes take more than ten seconds to execute, I decided that for the Big Race an alternative solution would be the safe bet; I took advantage of the AWS free tier to try and minimise my expenditure, and since the system isn't particularly CPU-intensive I didn't feel the restrictions on nano instances would be too arduous.

The AWS setup was ready in time for the Le Mans test day - the first opportunity the racing teams have to run on the full 24 Hours circuit, and the first opportunity for me to test the new setup. In all, over 1,500 unique users visited my timing pages that day, almost three times the previous high-water mark - helped by the inclusion of the per-sector timings that, while included in the official site's data feed, are inexplicably not displayed on the WEC timing pages.

In the following weeks, visitors enjoyed the "replay" functionality, giving them "as-live" timing replays of both test day sessions and the entire 2016 race, plus extensive live timing of a single-seater championship considered a "feeder" series into the WEC. Then into Le Mans week itself - and things started to get a bit nuts.

More and more people had caught word of the timing pages, and I was seeing steady traffic - as well as frequent messages via Twitter and email, some carrying thanks, some feature requests. One of the commentators at a well-known broadcaster even got in touch to say that they had no official timing from the circuit and that my site had made their TV production a whole lot easier! Many of the feature requests were already on my backlog, and there were a few I could sneak in as low-effort (although, deployments in race week seem a pretty bad idea in general).

Signs of not all being well were starting to become apparent - though not at my end. Rather, the WEC timing server seemed to be creaking under the load a little bit, and rather than updating every ten seconds (itself an age in motorsport terms!) there were five, ten, sometimes fifteen minutes between update cycles. I started research into an alternative data source which, at that stage at least, seemed to be more reliable. The modular nature of the timing system meant that it only took about an hour to get this alternative source (which I badged "WEC beta") up and running. (I ended up running both systems in parallel during the race, and once the WEC systems had calmed down there wasn't much difference between them.)

Peak visitors over practice and qualifying was about the same as for the test day. At this point, I had no idea of what was going to happen on Saturday afternoon...

Then real life interfered. For reasons I couldn't avoid, I had to be out of my house for the start of the race. Not only that, but the place I had to be had no mobile signal; and I ended up missing the first three hours of the race entirely.

I got home to find the world was on fire.

The WEC website had buckled under its own load (later claimed to be a DOS attack), which drove more and more visitors to my timing site. At some point, various processes reached their limits for open file handles. CPU usage had hit 100%, and stayed there. To make it a proper comedy of errors, I'd managed to leave my glasses, and my work laptop, at the friend's house at which I'd been at the start of the race - so I could only work by squinting an inch from the screen, or with sunglasses that rendered the monitors very dim. Nevertheless, I persevered...

First task was to get the load on the node under control. I took nginx offline briefly, and upped its allowed file handles. I also restarted the timing data collection process (which can be done while preserving its state). This helped very briefly - but after a few minutes, the number of connections was such that the data collection process itself was losing connection to the routing process, so no timing data could get in or out.

It was then that I had a brainwave - I could shunt the routing process (a instance) onto its own node, reducing the network and CPU load on the webserver and data collection node. I still had the code on the slow VPS, so I just needed to reactivate it, and patch the JavaScript client to connect to the timing network via the VPS rather than the AWS node. I also removed nginx as a proxy to the crossbar process, reducing the overhead - crossbar is capable of managing a large number of connections itself.

It turns out network IO on the VPS is adequate for the task, and over the next hour or so, things started to stabilise. I'd also decided to reduce network load by disabling the analysis portion of the site - which is a shame, as the stint-length and drive-time analyses were written with Le Mans in mind. I'll need to re-architect that portion somewhat, as the pub/sub model has proved to be an expensive one compared to request-response, especially with a large number of cars.

I'm grateful to those on Twitter and Reddit who, at this point, started to encourage me to not forget actually watching and enjoying the race! Thankfully, after another round of file handle limit increases (turns out that systemd completely ignores /etc/security/limits.conf and friends) - and my loving and patient spouse having retrieved my spectacles - I could do just that, only occasionally needing to hit it with a hammer to get it running again.

I also have some ideas to work with to improve function under load in future. Separating the data-collection process from the WAMP router one was a good idea, but still the former can be squeezed out of connectivity with the latter. Some method of ensuring that "internal" connections are given priority will keep up performance of the service for those users still connected. Upping the file handle limit and opening Crossbar directly helped increase the concurrent user count - around 10,000 over the course of the race - but a way of spreading that load over multiple nodes is going to be needed to go much beyond that.

The official WEC timekeepers, Al Kamel Systems, publish on their website a "chronological analysis" - a 3MB CSV file containing every lap and sector time set during the race. I wonder what effort will be involved to reconstruct the missing timing data from the first part of the race, into a replay-able format for my site...