Thursday, June 29, 2017

The long weekend: a retrospective

The Le Mans 24 Hours is the world's greatest motor race.

One of the toughest tests in motorsport, the race pits 180 drivers in 60 cars against the 8.5 mile Circuit de la Sarthe, and against the clock, with 24 hours straight of racing through the French countryside.

It's not just the drivers and cars that are put to the test; teams, too, face a battle to stay alert and react to the changing phases of the race. It's an endurance challenge for fans as well - at the track or around the world, staying awake for as long as possible to keep following a race which rarely disappoints in terms of action and (mis)adventure. The 2017 running of the 24 Hours was also something of a technical endurance challenge for this fan in particular...

Several years ago, unhappy with the live timing web pages made available by race organisers the WEC and ACO, I decided to start playing around with the data feed and see what I could come up with. Over the course of the 2015 race, I developed a prototype written in JavaScript that would later start to evolve and grow into something much bigger...

Fast-forward to 2017, and the Live Timing Aggregator was soft-launched before the 6 Hours of Silverstone via /r/wec and the Midweek Motorsport Listeners' Collective on Facebook. Despite having to debug the integration with a new format of WEC data from the grandstands of the International Pit Straight, and the system conking out a few hours into the race being held inconveniently on my wedding anniversary, feedback was overwhelmingly positive, and a few generous individuals even asked if they could donate money as a thank-you. The money let me move the system away from my existing server (which was becoming increasingly busy with other projects!) and onto a VPS of its own.

Sadly, though, the performance of the new VPS left a lot to be desired. On regular occasions, even loopback network connections were dropped, and when simply issuing ls would sometimes take more than ten seconds to execute, I decided that for the Big Race an alternative solution would be the safe bet; I took advantage of the AWS free tier to try and minimise my expenditure, and since the system isn't particularly CPU-intensive I didn't feel the restrictions on nano instances would be too arduous.

The AWS setup was ready in time for the Le Mans test day - the first opportunity the racing teams have to run on the full 24 Hours circuit, and the first opportunity for me to test the new setup. In all, over 1,500 unique users visited my timing pages that day, almost three times the previous high-water mark - helped by the inclusion of the per-sector timings that, while included in the official site's data feed, are inexplicably not displayed on the WEC timing pages.

In the following weeks, visitors enjoyed the "replay" functionality, giving them "as-live" timing replays of both test day sessions and the entire 2016 race, plus extensive live timing of a single-seater championship considered a "feeder" series into the WEC. Then into Le Mans week itself - and things started to get a bit nuts.

More and more people had caught word of the timing pages, and I was seeing steady traffic - as well as frequent messages via Twitter and email, some carrying thanks, some feature requests. One of the commentators at a well-known broadcaster even got in touch to say that they had no official timing from the circuit and that my site had made their TV production a whole lot easier! Many of the feature requests were already on my backlog, and there were a few I could sneak in as low-effort (although, deployments in race week seem a pretty bad idea in general).

Signs of not all being well were starting to become apparent - though not at my end. Rather, the WEC timing server seemed to be creaking under the load a little bit, and rather than updating every ten seconds (itself an age in motorsport terms!) there were five, ten, sometimes fifteen minutes between update cycles. I started research into an alternative data source which, at that stage at least, seemed to be more reliable. The modular nature of the timing system meant that it only took about an hour to get this alternative source (which I badged "WEC beta") up and running. (I ended up running both systems in parallel during the race, and once the WEC systems had calmed down there wasn't much difference between them.)

Peak visitors over practice and qualifying was about the same as for the test day. At this point, I had no idea of what was going to happen on Saturday afternoon...

Then real life interfered. For reasons I couldn't avoid, I had to be out of my house for the start of the race. Not only that, but the place I had to be had no mobile signal; and I ended up missing the first three hours of the race entirely.

I got home to find the world was on fire.

The WEC website had buckled under its own load (later claimed to be a DOS attack), which drove more and more visitors to my timing site. At some point, various processes reached their limits for open file handles. CPU usage had hit 100%, and stayed there. To make it a proper comedy of errors, I'd managed to leave my glasses, and my work laptop, at the friend's house at which I'd been at the start of the race - so I could only work by squinting an inch from the screen, or with sunglasses that rendered the monitors very dim. Nevertheless, I persevered...

First task was to get the load on the node under control. I took nginx offline briefly, and upped its allowed file handles. I also restarted the timing data collection process (which can be done while preserving its state). This helped very briefly - but after a few minutes, the number of connections was such that the data collection process itself was losing connection to the routing process, so no timing data could get in or out.

It was then that I had a brainwave - I could shunt the routing process (a instance) onto its own node, reducing the network and CPU load on the webserver and data collection node. I still had the code on the slow VPS, so I just needed to reactivate it, and patch the JavaScript client to connect to the timing network via the VPS rather than the AWS node. I also removed nginx as a proxy to the crossbar process, reducing the overhead - crossbar is capable of managing a large number of connections itself.

It turns out network IO on the VPS is adequate for the task, and over the next hour or so, things started to stabilise. I'd also decided to reduce network load by disabling the analysis portion of the site - which is a shame, as the stint-length and drive-time analyses were written with Le Mans in mind. I'll need to re-architect that portion somewhat, as the pub/sub model has proved to be an expensive one compared to request-response, especially with a large number of cars.

I'm grateful to those on Twitter and Reddit who, at this point, started to encourage me to not forget actually watching and enjoying the race! Thankfully, after another round of file handle limit increases (turns out that systemd completely ignores /etc/security/limits.conf and friends) - and my loving and patient spouse having retrieved my spectacles - I could do just that, only occasionally needing to hit it with a hammer to get it running again.

I also have some ideas to work with to improve function under load in future. Separating the data-collection process from the WAMP router one was a good idea, but still the former can be squeezed out of connectivity with the latter. Some method of ensuring that "internal" connections are given priority will keep up performance of the service for those users still connected. Upping the file handle limit and opening Crossbar directly helped increase the concurrent user count - around 10,000 over the course of the race - but a way of spreading that load over multiple nodes is going to be needed to go much beyond that.

The official WEC timekeepers, Al Kamel Systems, publish on their website a "chronological analysis" - a 3MB CSV file containing every lap and sector time set during the race. I wonder what effort will be involved to reconstruct the missing timing data from the first part of the race, into a replay-able format for my site...

Friday, January 30, 2015

I'm racing at Silverstone!

Silverstone Wing and International Pits Straight. © James Muscat

It started, as these things occasionally do, with a dream.

The home of the British Grand Prix. The 3.66-mile ribbon of tarmac graced by the likes of Schumacher, Alonso, Button, and Hamilton, not to mention their many peers and forerunners. The magnificent, high-speed Maggotts/Becketts/Chapel sequence of corners. I'm a huge motorsport fan (some would say 'obsessive'); how could I pass up the opportunity to enter a race at Silverstone?

The only trouble is: I'm not going to be in a car. I'm not even going to be on a bike. I'm running the Silverstone Half Marathon on March 15th, and I'm doing so to raise money for Cancer Research UK.

Back in October, I had a dream in which I was running the Silverstone Half Marathon, and no, I don't know why. The most puzzling part of the dream was that I was running the wrong way around the track when I got to the finish line. I happened to mention this to a serial marathon-running colleague, who looked up the route map to discover that, in fact, the last lap of the race is run against the usual flow of the circuit!

Curious, but not enough to make me sign up for the darn thing. No, that would be the fault of another serial marathon-running colleague, Cristin. You can read her side of the story on her blog.

Both of those colleagues are also running the race with me, and Cristin and I are fundraising together. Please would you take a moment to visit our fundraising page, and sponsor whatever you feel you can?

Another curious part of that dream came after I'd crossed the finish line. I got a celebratory kiss from the girl who, in real life a few days later, would become my girlfriend... but that's another story! Her mum has, after a year-long fight, beaten cancer; that's one of the reasons I chose to raise money for Cancer Research UK. A colleague of ours is also fighting her own fight right now.

If you can, please sponsor us to the chequered flag (there had better be a chequered flag!), and help CRUK continue their work against cancer.

Our fundraising page is at Virgin Money Giving.
Wet GP3 qualifying, Silverstone 2014. © James Muscat

Monday, May 05, 2014

Build your own AMX replacement: part two (or, touchscreens can't swim)

Last time (admittedly, a very long time ago!) I painted the backdrop of my church and its A/V control setup, and gave some of the motivation for wanting a cheaper, more flexible replacement. I had a proof of concept up and running, the design and architecture of which I'll go into later in this post. But I also left you with a cliffhanger: just what exactly did happen on Sunday 10th February, and what bearing did it have on the project?

Somebody had thoughtfully left a glass of water hidden right next to the AMX touchscreen. That morning, the inevitable happened; the touchscreen lost its programming after a short bath, and we were left with no way of controlling our video switchers short of manually dialling in each take, from a different room. The team stumbled through the morning service as best as they could, and I got a phone call at lunchtime: was my replacement system ready, and could I install it right now please?

The answers were no and yes, respectively. That glass of water, strangely enough, was one of the most useful things to be added to the A/V system in some time. ;)

One of the things I had agreed with the church leadership after my week of working in the Parish Centre was that I wasn't going to spend any more than I already had on components for the replacement system. I'd already invested a couple of hundred pounds all told, and knowing what happened to the last few upgrade plans we'd heard of, I wasn't going to commit more until the plan had been given both approval and budget. So when the call came, I didn't have all the necessary components to install a complete solution - I had barely any more than the bare proof of concept from November (though I had been developing the software in that time). Nevertheless, some control was better than no control, so I grabbed what I had and headed to church.

What I had at this point was a Raspberry Pi with the control software installed, a USB hub, and enough serial adapters to connect exactly two devices. A quick committee meeting decided that the most important things were the main switcher and one of the cameras, so those were wired up first; I made a Kanban-style board using post-its to track which devices had been moved to the new system and which were still wired in to the AMX. (Two of those post-it notes, "Please do not remove these post-its" and "thank you", are still there. That's just how our sense of humour works.)

We didn't have a control surface (the touchscreen being the single most expensive part of the system), but I did have an old netbook that was capable enough for the time being. I was also fortunate enough to have received back the router that church had "borrowed" which meant we could connect the server and client machines together reliably. With the help of several of the team, we managed to bring the first phase of the system online in just a couple of hours - finishing just as the 6pm team arrived to set up.

Enough narrative. Show me teh codez.

Because I work as a technical architect, I drew a pretty, if also pretty basic, diagram of the new system:

Essentially, the "controller" running on the server is a big interface onto all the devices in its bucket. It's worth noting that you don't interact with the devices themselves from the client, you ask the controller to prod them for you. This is partly by design and partly a consequence of choosing Pyro: because the devices themselves need access to the serial ports on the physical server, they would need to use a proxy object to be able to be RPC'd. (Actually, I'm not sure quite how much work that would be.)

Having the controller as this union of all device interfaces means that it is the only object that needs to be made available through Pyro. The disadvantage (if it so proves) is that all the devices have to be physically on one server. It doesn't take much imagination to come up with a scenario in which a more distributed system is useful - for example, controlling something which doesn't have serial cabling into the server room. But that can be a future feature when it's needed. I imagine that keeping track of all possible devices on a network is one headache too far right now!

Another interesting feature of the system is the controller's "sequencer". It essentially lets you queue up a sequence of commands to be executed in turn, at intervals of (at the moment and somewhat arbitrarily) a second. The first use case, and the one for which it was included, is the perhaps surprising candidate of the highly desirable "turn the system on" feature: it needs to turn on one power distribution unit, then pause for a short time before turning on the next one, and so on. I have no doubt that more clever things will make their way in here as use cases in future. The ability to record and play back macros is an interesting idea that I'll certainly be looking into in the future.

The code itself is probably the least interesting part of the system. It's up at GitHub if you're interested (the project with all the UI is up separately). More interesting is what we can do now it's running, where I'm thinking of taking it, and how I intend to make sure getting from here to there doesn't accidentally destroy the universe on the way. (The phone calls would be insufferable.) More of that in part 3, which I'll hopefully not take quite so long to write as I did this one!