Runaway Train? Pass me the Remote!
It seems obvious that operating an unmanned freight train with a remote control creates risks. So why in the Tas Rail derailment did no one check the integrity of the device's software?
It’s a trip down under this time, to look at the circumstances of the Tas Rail freight derailment in 2018…
Unstoppable
Loaded freight trains in motion represent one of the most risky areas of the railway. Their huge momentum, when combined with hazardous or explosive goods, creates the possibility for catastrophe.
This risk was the key concern in 2010, when a locomotive pulling nearly fifty cars, including some loaded with hazardous chemicals, ran away for nearly two hours at speed. Catastrophe was averted when railroad crew in a second locomotive eventually caught up with the runaway train and coupled to its rear car, allowing it to be braked. This incident came to be known as the “Crazy Eights” runaway and its dramatic real time play out on US television created the basis of a Hollywood blockbuster: “Unstoppable” starring Denzel Washington.
Hollywood endings have unfortunately not always been the case. Perhaps most notably, the Lac-Mégantic rail disaster in Quebec, Canada in 2013, resulted in an explosion which killed forty-seven people and destroyed half of the town’s centre when an unattended freight train carrying crude oil rolled away and derailed. The accident at Viareggio, in Italy led to thirty-two deaths and many more injuries when a freight train carrying liquefied petroleum gas derailed, caught fire and exploded in 2009.
But in the dirty, mechanical world of freight, one might not immediately think that such an accident could be caused by failure of software. Well, one particular accident points at a need to think again.
The Tas Rail freight derailment
On the morning of 21 September 2018, a train driver was using a remote control device to control and position TasRail train no 604 while no one was on board. The train moved past its intended stopping point at a siding in Railton, Tasmania so he sent a command to reverse the train. The train failed to respond and instead it started to move, picked up speed and rolled out of the siding onto the main line. After an alert was raised, the train was soon diverted to a branch line, but nevertheless hurtled through a number of level crossings, and five sets of points at speeds of up to 90 km/h, before eventually derailing. The accident caused major damage, injuring two passers by, but could - clearly - have been much worse.
The report by the Australian Transport Safety Bureau, which was eventually issued at the end of 2022, found a huge number of issues but the fundamental faults were with the software in the remote control device. A bug caused the device to enter into a spurious fault state. When the train then rolled away, and moved outside of radio communication range with the transmitter, the software was supposed to apply the train’s brakes. But another software error meant that it didn’t do so.
The level of safety integrity
When functions of the railway: like emergency braking, are implemented in software, or programmable electronic systems, they’re assigned a Safety Integrity Level - or ‘SIL’. The SIL points to a range of actions in technical standards that need to be put in place to minimise the existence of bugs that might cause the function to fail. The highest level of integrity for rail functions - like emergency braking - is SIL 4.
SIL 4 functions tend to also require redundancy in the engineered design - with functions duplicated in separate independently realised hardware and software. To achieve the required reliability in the execution of safety functions two out of two (2oo2) or two out of three (2003) computer processor architectures are often used along with measures to ensure that common failures do not occur in the software run in each of the processing channels. This makes any complex electronic system implementing SIL 4 functions very expensive.
For this reason it’s good to avoid the use of software for high integrity safety functions when you can. It’s therefore accepted practice for the emergency brakes on rolling stock to be ‘hard-wired.’
However, in this case the remote control was partly delivering the SIL 4 braking function. This meant that it needed to be developed by applying the full suite of stringent ‘SIL 4’ software assurance requirements, to ensure its failure rate was acceptable. However, this didn’t happen.
Shifting the function
The remote control equipment used had been developed by a very small Australian-based manufacturing company. The business owner undertook most of the hardware design and manufacture personally, supported by other experts on an ‘as-needs’ basis. This included a programmer who developed software code for this, the third generation of the device.
The Australian Rail National Safety Law places requirements on manufacturers to ensure that equipment they provide is safe for its intended use. The law states that safety risk must be reduced to a level that is ‘as low as is reasonably practicable.’ The only way to demonstrate that SIL-rated software reaches this level of safety - and therefore meets the legal duty - is through applying recognised good practice. In this case that means the railway software safety standard EN50128, or its parent standard IEC61508 (badged as AS61508 in Australia).
For the SIL 4 emergency braking function, these standards mandate a large number of stringent requirements for the design, coding, testing, validation, independent assessment and certification of software. There must be evidence that they have been applied. However the investigation report found that:
Although it is almost certain that the generation 3 [remote control] software was tested to some extent, there were no records of what was tested and to what extent. Records that were provided by [the supplier] related to the generation 1 [remote control].
Worse than that, other than a single sparse document which appeared to mainly argue why the stated methods in the standards had not been applied, the ATSB report stated that:
There was no documentation associated with the [remote control]’s software, apart from limited comments in the software code (which sometimes contradicted the code).
The report goes on to conclude that there were:
no other records that would necessarily result from a design process complying with AS 61508, such as requirements specifications, hazard and risk analyses, and a safety case.
These are alarming findings. Essentially the train had software controlling its emergency brakes, that had been developed without any meaningful safety assurance.
The chain of assurance
The absence of supplier software assurance here is shocking. But is it reasonable to expect a company to fully understand the importance of its code, if it isn’t directly exposed to the risks inherent in using it? Well, the answer to this is both: ‘Yes, it is’ and ‘No, it isn’t.’ The investigation report clarifies that:
Rolling stock operators and equipment developers both [my italics] have duties to ensure the design and modification of rolling stock and other equipment is safe for its intended use…this is especially true of complex systems such as those that implement software, due to the potential for these systems to fail in ways that are hard to predict.
The law is clear that the supplier should have known what the requirements were. But TasRail, as the Rail Transport Operator, should also have specified that the SIL requirements were met. Ultimately the investigation found that TasRail’s change management procedures were not sufficient to address changes of this nature.
The report went further still, highlighting weaknesses in the scope, availability and clarity of guidance, including guidance for the ONRSR, the Australian rail safety regulator. But even then, guidance doesn’t achieve anything unless its understood and applied. So what we really have here is a major competence gap in all parties from the supplier, through to the operator and right up to the top of the chain of assurance. No one knew or fully appreciated the potential for software failures in the remote control to cause a major safety accident - and no one in the chain of responsible actors flagged a concern or raised a challenge. And the incident was not actually an isolated one. The ATSB found multiple reports containing:
driver descriptions of [remote control] behaviour that could not be readily attributed to a defined failure mode or response.
Each of these incidents presented a missed opportunity to address the fundamental issues.
The investigation report was published in late 2022 and at that time the Chief Commissioner was keen to point out that:
Since the remote control system was designed, there have been substantial improvements in the quality and availability of systems safety guidance for complex rail systems in Australia.
but added pointedly that:
…it is important this accident serves as a reminder for a heightened focus on systems safety from all transport operators and manufacturers, into the future.
Improving client competence
This blog post was inspired by work I’ve been jointly undertaking with CWG Projects in Australia. We’ve just collaborated on the release of an Australian version of the course “Railway Software Safety as a Client.” It’s intended to rapidly educate the broad community that needs to understand software safety, but to do so in a way that is free of jargon and accessible to all. As is clear from the post above, this broader understanding is both valuable and necessary.
The Australian course was itself developed from the UK version launched earlier this year, which is designed specifically to help railway companies comply with the requirements of RIS0745-CCS Client Safety Assurance of High Integrity Software-Based Systems for Railway Applications: This Rail Industry Standard was published following the investigation into a safety incident on the UK pilot application of the ETCS digital signalling system in the UK. That accident has many parallels to the Tas Rail derailment including: the absence of sufficient legacy safety design information; uncontrolled changes being made to equipment; the unavailability of digital information to support investigation and a software reset having unexpected effects.
Thanks for reading
All views here are my own. Please feel free to give me feedback on the blog and any related topics. I can be opinionated on occasion, but I’m always happy to consider my positions and arguments and where I find I’m wrong to say so.
I’m always interested in collaborating on research or projects on the topic of risk analysis and safety assurance. If you’d like to get in touch please feel free to drop me an email on george.bearfield@libusa.co.uk.