Tech Safe Transport
Tech Safe Transport Podcast
Rail Software Safety: An Expert View
0:00
-34:05

Rail Software Safety: An Expert View

Dr Roger Short is a pioneer in the field of software safety assurance on the railway. In this podcast we discuss his career and the key lessons he's learned in over 50 years of cutting edge work.

It’s been a while since I last posted: I put this down to a glut of work, but also a little bit of a loss of focus on these topics in a world where so many more high profile political issues are in dangerous flux. Thankfully I have re-established my equilibrium now and I have a range of interesting areas I intend to deep dive into over the coming months: I hope that’s welcome news.

So anyway, it’s a podcast this time. Some months ago, I spoke to Dr Roger Short, a former colleague of mine, and someone whose orbit I have come into on a range of projects and initiatives dealing with the topic of software safety assurance over many years since.

Roger began his career with British Rail in the 1970s and led pioneering work on the safety assurance of the Advanced Passenger Train and the Solid State Interlocking (SSI) - the first computer based signalling system on the UK rail network. And his knowledge extends to the current day, where he is actively working on topics like machine learning and cyber security.

As well as being extremely knowledgeable on the subject of railway software safety Roger has an ability to communicate complex ideas clearly and succinctly. Many more people on the railway need to know how software is kept safe as the scope and breath of the ‘digital railway’ continues to expand so if you’d like to know more about the topic from a wise master, please plug in, settle back and listen to what he has to say.

I could have spoken to Roger for hours on this topic - and almost did: A full version of our discussion has been added as a bonus module on the course “Railway Software Safety as a Client” from Libusa.

And finally, for those who would prefer to read: A transcript of the discussion is included below.

[Please note that the software safety standard referred to in our discussions (EN50128) has now been superseded by EN50716 but the standards are essentially the same].

Thanks for listening

Please feel free to feed back any thoughts or comments and please do drop me an e-mail. My particular areas of professional and research interest are risk management, safety decision making and assurance of new transport technology. I’m always keen to consider interesting projects in these areas.

If you know anyone else who you think might appreciate a thought-provoking read or listen every few weeks, please let them know:

Share Tech Safe Transport


Transcription:

George Bearfield   Hi Roger. What is your background and how did you first come to work in the area of software safety?

Roger Short  
Well, my background is as a railway signal engineer. I joined British Rail from school at age 18 and went on their student Apprentice scheme. This was in the railway signals department of the Western Region. But I've never been what you might call a mainstream signalling engineer, because when I finished my training, I went straight into working with electronic systems: hot axle box detectors, digital transmission equipment and so on. I don't have the background of having been involved in designing signalling schemes, testing them and implementing them and so on. When it came to getting involved in software, this happened a few years on, when I moved on to become a member of the Signal Development section at the British Railways Board. We were involved in safety approval of new signalling equipment and systems and the first software I was involved with was a system used on the advanced passenger train. This was in the late 1970s. The train was designed to travel at higher speeds over any given section of track than a normal conventional train, basically because of its tilting capability. The system would display in the driver's cab the permanent speed restriction which applied to the train, which was not the same as speed indicator restrictions indicated by lineside notices or by what it said in all the printed instructions, and so on. It worked by having transponders at intervals along the track which transmitted to the train the permitted speed for the coming section. There was a microprocessor-based system on board the train which decoded these messages and displayed them to the driver. If they provided a speed that was too high, there would be a danger of derailment. So, it was a safety-related system and it was developed by the British Rail Research Department. I was responsible for the safety approval of it and having to assess and validate the software system. So that was my first involvement in software.

George Bearfield  
What is surprising to me is that often people think about technology as something which is relatively novel. But the techniques in the standards for software assurance have been around for decades, haven't they? And they're sort of tried and tested. For something which often people are only getting to grips with now, they might think of it as something which is fairly cutting edge. But it's actually pretty old and established.

Roger Short  
Yes, this was just the period when there was a great deal of interest in applying software to things which had high demand for functional safety and at the same time software was sort of almost legendary for its ability to do things wrong with jokes around different themes such as: “To err is human, but to totally foul things up, it takes a computer.” And so there was a great deal of interest and lots of people working on how to get software we can rely on; software we can trust to deliver safety functions. So, I was there at that time and as you say, I'm aware that the things that are recommended now in in the software safety standards, like EN 50128 for railway signalling, are almost entirely things which were developed around the 1970s and 1980s.

George Bearfield  
Just back to your career: we first worked together I think over 20 years ago. So I'm aware you worked also at British Rail Research on the first solid-state interlockings. People might not really know what that project was about, because that was a sort of paradigm shift in signalling at the time: moving safety critical functions from relay-based technology into code.

Roger Short  
Oh yes: well an interlocking is actually a system which it is designed to prevent a railway signaller from setting train movements into unsafe situations, like setting up conflicting train movements or allowing trains to run across points which are not set correctly and so on. Since about the 1930s at least, this had been done largely using relays and relay logic. By the 1960s and 70s relays were providing the safety control of quite large areas of signalling, involving, in any particular installation, perhaps many thousands of relays. Railway signalling relays were designed to be inherently fail safe, which was achieved by making them big robust devices. A so-called ‘miniature relay’ was about the size of a house brick and there would be several thousands of those perhaps to control the signalling, say in an area like Bristol. So the relays themselves were costly. The work of installing them was costly and wiring them in too. As the signalling logic is relatively simple, you can define it in fairly simple Boolean logic. These new computers/microprocessors were devices that could do that easily. So British Rail Research came up with a project to develop a processor based interlocking system. And again, my perspective on it as being part of the headquarters signalling development team was: “Oh, so now we've got to approve this new system and it's right at the heart of our safety. I will be totally dependent on it for protecting our trains. So yeah, this is a big problem.” So this led to me leading the team that was dealing with the safety assessment and indeed safety validation of the software for the solid-state interlocking system. We also did the hardware as well which was interesting - and a lot of important engineering went into it - but it wasn't the same sort of really hard conceptual problem as there was with the software.

George Bearfield  
You mentioned relays being inherently safe: In one of your papers you said that a failure rate of 10 to the minus 10 (failures per hour) is proven in use for relays. So, they're inherently very safe. I remember that you said that you sought to use a relay-based interlocking as a benchmark for a quantitative assessment of the performance of the software. I may have got that wrong, but I remember that from previous conversations which of course is very interesting, because any quantitative measures around software are inherently quite difficult to do.

Roger Short  
Yes, it wasn't so much for assessing the software: It was for setting the target which we would be aiming at and hoping to achieve for the software and then extrapolating from that basing the sort of high level argument for the software validation process and by analogy with the series of test inspections, checks and so on, which were made on a relay interlocking system and mapping across from those to the the techniques, the testing, the analysis and so on, which was being done for the software. We were saying that in principle, they're doing the equivalent thing and that they're being done by people who are sufficiently competent in their field, either relay interlocking design or software people so it's reasonable to believe that the result of all of these activities - the overall result achieved - will be equivalent and we should achieve an equivalent level of safety with our software based interlockings as that achieved by the relay based interlockings.

George Bearfield 
One advantage of relays is that they're quite transparent. You know in terms of their logic, and that’s a great benefit for assurance. I've always been fascinated by the trajectory of technology, so when you're talking about microprocessors [for the SSI project] they’re a similar sort of generation to the old ZX Spectrum or Commodore 64 type computers: that sort of level of complexity. But now obviously, modern interlockings are hugely more complicated with advances in processing and code and also network connectivity. So, I guess the scale of the assurance problem and the loss of transparency continues to move apace?

Roger Short  
Oh yes, it does to some extent, although so far as the actual functional logic of the interlocking is concerned that hasn't become any more complex.

George Bearfield  
That's a good point.

Roger Short  
But the operating systems of the computers that it's running in have become a lot more complex and a lot more sophisticated. It was an 8-bit microprocessor which was being used for the SSI system.

George Bearfield  
Yes. And so Roger, you already referred to the software safety standards and you mentioned the signalling standard EN 50128. Is it possible to give a lay persons understanding of what those software standards are and where they came from? And also, when you first came to grapple with them in your professional life?

Roger Short  

Well, what the software standards do is they essentially set out a process, which is a whole series of processes, which need to be applied through the lifecycle of the software from the initial concept through to installation, commissioning and eventually decommissioning. The lifecycle is very central to the standards. They define the activities which have to be performed at each stage of the lifecycle. They define organisational things and in particular they're very strong, generally on independence between the people who do the design and the people who check and validate the design and so on. Also, although sometimes people tend to miss the importance of it - but it's very central to them - is they define what are suitable techniques to be applied at each stage of the lifecycle. Naive readers of the standard tend to regard the techniques as secondary because they appear as appendices rather than in the main body of the text. But they’re there and they're actually normative. The appendices tell you what you have to do, and they are very central to actually achieving the levels of integrity which you want.

George Bearfield  
These standards came out of the process sector, didn't they; oil and gas,  petrochemicals and factory machinery through a standard called IEC 61508. This was adapted to EN50128 for rail signalling purposes. You talked about target setting before when we talked about the relays and the interlockings and the importance of setting safety targets to define the rigour of those processes that need to be applied and there's something called a safety integrity level or SIL level which is central to that concept. You wrote a paper called ‘the use and misuse of SIL’ which is a little bit of a bible for me really. I go back to it and read it from time to time, just as a reminder to bring some common sense into some of these debates around safety integrity levels. Could you explain what a safety integrity level is and how it should be applied?

Roger Short  
Yes, the problem which has existed right since the early days of software and still exists is that there is no recognised way of predicting what the probability is that there will be errors or defects in the software and the probability that any such errors or defects will do something which is harmful from a safety point of view. You can make that kind of calculation for hardware - particularly for electronic hardware where we have published failure rates for all the components used and you can apply reliability theory to calculate the probabilities of particular failure. So, you can come up with a number which says “Yes, actually the probability of this system failing in an unsafe state is 10 to the minus 10 or to the minus eight or something per hour. It's not possible to do the same thing for software because software science still isn't able to provide you with any basic concept of failure rates.

George Bearfield  
And the reason for that is basically that software has branching logic and this logic very quickly gets so vast that you can never exhaustively test it. So every time you exercise it in a different way, you might uncover something that you didn't know was there, that was sitting there in the software design just waiting to happen.

Roger Short  
Yes, that's right. Theoretically, from a sort of philosophical point of view, you could prove your software was completely correct by absolutely exhaustively testing it. The problem is that to get anywhere near an exhaustive test would require such a literally astronomical number of tests that the testing time would be longer than the anticipated useful life of your system. It will take longer than that to actually test it exhaustively.

George Bearfield  
Yes.

Roger Short  
This was realised quite early on, in the early days and the approach which was adopted was to say that we can't give a numerical prediction, but we can give at least a qualitative guide to what has gone into the process of ensuring that the software is going to be sufficiently free of defects. The safety integrity levels were a series of progressive levels of applying more and more rigorous techniques to get a greater confidence that your software would be free of defects. I heard somebody give quite an interesting analogy to that in the early days. They said, “well, look, take a sort of physical process like welding: You can't go around testing all of your welds to see whether you can make them break because you'll just be left with a broken heap of scrap iron. So you have to trust the processes used for the welding will produce sound welds.” Extrapolating that to software you have to trust that the process and the techniques you've used to produce the software will  give you software which has sufficient integrity.

George Bearfield  
This links back to what you said before then, Roger, because this means that you do need to read these standards in detail and understand exactly what the mandatory requirements are because they are your means of assurance. And I guess the danger is sometimes people see this all as ‘paperwork’ when actually it's about ensuring rigour of your fundamental test and assurance processes to make sure errors don't creep in. In the case of the interlocking you were talking about earlier, those errors could bring two trains together at speed.

Roger Short  
Oh yes, absolutely. Yeah. Yeah, so the consensus came around that we need a set of successively more intense levels of protection. We’ll call them ‘safety integrity levels’ and the lowest level will be SIL1 and the highest level will be SIL4. It was anticipated that maybe in the future somebody would invent something even better and we’d have a SIL5, but we're still waiting for that.

George Bearfield  
You mentioned that it's impossible to put failure rates on software. Even with traditional reliability engineering though as somebody with a design engineering background I’m always wary of absolute failure rates. I always see those as more of a design tool to find out where the relative weaknesses are in your design so that you can improve it. But the absolute numbers are always, often fairly sketchy even then. But for software you can't do it at all. But yet despite that, the various software standards have quoted indicative failure rates for functions developed with different SIL levels, which certainly can cause a lot of confusion, I think.

Roger Short  
It is not a good thing to associate a number with the safety integrity level. That's sort of the biggest misuse of SIL. But it's very difficult to separate it out. I think actually in the EN50128 standard itself it says very clearly “SIL doesn't guarantee a specific number is achieved”. Nevertheless, the guidance that people are given is: “Well, if your random failure rate for the system you're developing needs to be sort of 10 to the minus nine, then you should be using the highest level available for software SIL 4. You know, if it's an order of magnitude lower, you can use SIL 3 and so on. So its kind of inevitable that, people think that if you've built your software to use all the things it says in the standard for SIL then you can be quite sure that the unsafe failure rate will be equivalent to the SIL 10 to the minus nine figure which is quoted for random hardware failures of the corresponding part of the system.

George Bearfield  
Getting back to the timeline, I think we first worked together probably in the early 2000s on the next generation of signalling interlockings where the processers were getting a bit more complicated and there was a variety of suppliers, it was almost a generation after solid-state interlockings with a variety of international suppliers.

Roger Short  
That's right. Yes. Yeah.

George Bearfield  
The bit that's relevant to me here is ‘data preparation’. So, software, (whether it's a signalling interlocking or increasingly say rolling stock systems like selective door opening systems) needs to be configured with local data. What is integral to that is how the data is processed, configured and turned into functionality. The process is integral to the failure rate. And so, you know that's a whole area that's often neglected because you know - generally the supplier's job is to define the process for turning raw data into functions and the toolset to do that. That's actually part of the asset, part of the signalling system or part of the train, if you like. And it's included within the scope of the standards. But often it's missed at the time that those systems are approved, and the end user gets more involved with spreadsheets and uncontrolled processes to try and feed the system with data which they can get wrong.

 
Roger Short  

That's right, yes.

George Bearfield  
They can get the wrong function and so the end-to-end process needs to be assured. I think that's often missed in the realities of how projects progress.

Roger Short 
Yeah, yes, that's quite true. In fact, the EN50128 standard does have a chapter in it on software data preparation. That's really important. Going right back into the history of software right back to the solid-state interlocking days, there was a bit of discussion about, “Well, you know actually what you're doing with this data preparation is you're actually programming the interlocking. So it's really a programming activity.” It was considered to be a bit frightening to suggest that, you know, railway signal engineers are going to do programming. “This is only data preparation”, but in fact the data includes not only things about geographically which signals and which points apply to which tracks, but it applies the control logic between them too, so it includes logical statements. You are, in fact, programming the system. But it was felt that this would cause great alarm if it was suggested that people who were used to designing relay circuits should have to learn to do programming. “No, no, no…We won't frighten them. We'll say it's data preparation”. And that's how it stayed. But I think an unfortunate side effect of that is it's tended to be regarded that errors [in data are] less of a threat than software errors. But in fact it is a form of application software. I'm always stressing to people it needs to be treated as such. You're also in an area which isn't fully testable, so you need to include checking, analysis and so on as well as a functional test of your system which you've ‘configured with data’ as we put it. You've actually loaded an application programme into and what you're doing is testing and validating that application programme.

George Bearfield  
And so, do you think Roger, given that what we said before, around the fact that some of these SIL concepts go back to the 1970s, 1980s, do you think that the SIL standards are restrictive and still fully fit for purpose with modern digital technologies and the way that systems are built.

Roger Short  
Well, yeah, there are two aspects to that. One is I feel that now, having these standards - having these established - has been a bit of a disincentive to look for any new or improved techniques. Why should you spend money on that when simply using the existing techniques according to the standards, is all you need to be able to get your product certified assessed and get it accepted? So the standards tend to create a disincentive to further research and improvement. The other area where I think they will be stressed is with the development of more complex systems and applications which the present standards have some difficulty in dealing with, and I'm thinking particularly there of machine learning techniques. If somebody is developing a system which has got some machine learning component within it and some functionality of that. I don't believe that the standards as they are at present - the EN50128 the software standard or indeed necessarily in 50129 the electronic system standard - would be sufficient to ensure that they were safe. There's some need for new techniques and for at least an extension and revision of some of the life cycle and process. Features of the standards need to be developed in order to extend them to be applied to technology such as machine learning. I think within the existing safety community, people would recognise “Oh no, this won't do. We can't entirely cover that with our present standards.” But I worry slightly that somebody quite outside the community will say “We're developing this we're going to use machine learning. Ohh we'll use EN 50128. Oh yeah, I will. We've done a lot of things. So we've now achieved the SIL.” From what I know of machine learning to date, I don't believe that that will be true. There are potential defects which the present standards would not prevent or maybe would not detect in their verification and added validation activities. I don't think it's impossible. I think we're back in the early 1980s situation where we need some more work development on techniques to extend that to cover technology such as machine learning. But what we have at present, I would have some doubt. I would feel that there would be dangerous gaps in it applied to a system which included machine learning in the development of its functions.

George Bearfield  
There's a couple of points their, listening to what you say, I think. One thing that's misunderstood is that the standards do have flexibility, don't they? So you know if you've got a highly recommended technique for, you know, doing some sort of code analysis for example according to the standard, you're allowed to do something else as long as you justify it. But what you were referring to is that there's something structurally wrong with the standards and the mandatory requirements within them that needs rethinking: that is what I heard?

Roger Short  
Yeah, I yes, I think. Possibly not entirely rethinking, but at least extending it. I regard machine learning not as some totally new phenomenon which is quite outside of anything we do at present. No, I don't believe that. It's just mathematics and software. But it's mathematics and software which is used in ways which our present techniques don't cover and we need to develop new techniques, a new lifecycle and new organisational arrangements to extend the standards to the new technology.

George Bearfield  
And then. Roger, flipping it around, because you can look at AI and machine learning as both an opportunity and a challenge. Do you think machine learning offers some opportunities to enhance the verification, validation and assurance processes? I'm thinking particularly as you talked before about the practical difficulties of exhaustive testing, but you know, perhaps quantum computing and other things could start to make a dent in that, and particularly apply with some targeted testing [through simulation], for example additional brute force in an intelligent way?: you know, checking some of the automation and data preparation issues and removing some of the human error from that? Is that something you're given thought to and do you think there's opportunities to progress there?

Roger Short  
I'm sure there's opportunity to progress there. I think there are lots of ways that AI and machine learning tools could support the development of safety critical systems. The bit about that which worries me is that, even at the moment, we're inclined to be a bit easy going about the tools that we use: “Oh well, this is only a tool. It's not actually being used functionally in the system. It's not providing the function itself.” But we're still depending on it. We're still depending on the tools in some ways for ensuring that the system we develop will perform the function correctly and I think people fall in to the trap “Oh well, since there's a tool we can always check it’s output,” but if it's a really sophisticated tool which does all kinds of clever things, it's really going to be difficult, if not impossible, to actually verify and prove to yourself that what that tool has done to the end system you're producing hasn't introduced some kind of defect. You're left with a problem with the tool which may introduce defects, which is just like the original problem of how do you ensure that there aren't any defects in in your functional system.

George Bearfield  
Yes. Yeah.

Roger Short  
I think that's perhaps manageable at present, but a step towards an AI type tool offers great benefits but being able to trust the tool and rely on that not to result in defects in in the actual operational system is still a question which has got to be solved.

George Bearfield  
The final question from technical question from me Roger, really was around cyber security. I know we're both on the drafting group for 62543, the new cyber security rail standard. What additional challenges do you think cyber security poses for safety? I’d be interested to know conceptually how you think we should go about making safe, secure rail assets in the future?

Roger Short  
Well, it seems to me that the biggest tension there - between safety and security communities - is that the security community wants to react very, very quickly as soon as they become aware of a new threat. They want to rush out and patch the software. The safety community wants to go to very great lengths to verify and validate any software modifications which are made in case they introduce some unforeseen defects and there are two sort of almost diametrically opposite goals there: “We want to do it quickly,” “Oh no we want to be absolutely sure that it's safe.” I think probably the most effective way of dealing with it is to be able to segregate the software which is essential to safety from the any software which might have to be patched quickly as a result of security threats and that will be an important part of system architecture: keeping the safety critical software in the areas where the security critical software, which has to be patched quickly is external to and segregated from the safety critical software.

George Bearfield  
I'm glad you said that Roger because I agree with you entirely on that one and it brings me full circle then to the reason why I was keen to have a chat today.
For those who've spent time thinking about these things there are clear principles. But we've got pressures for more flexibility, driven by cost and we've got a general lack of awareness of some of these things. So that was kind of the driver: More people need to understand these tensions and how they need to work: these principles. The understanding of some of the topics needs to broaden out and we need to help people get beyond thinking: “Well, this is just technical software stuff.” and really starting to understand how central this is becoming to the whole viability of the railway.

Thanks ever so much for your time. I'll draw stumps there, Roger. But I could keep talking for hours on these topics, and I know you've got plenty more you could say. Thanks for your time: I've always found you bring a common sense touch to this topic and I've always appreciated that. I’m really glad to have worked with you on these topics at a formative stage in my career because I often reflect on some of the conversations and things we worked through on projects. And I'm keen that more people understand some of this stuff as we move forward. So thanks so much for your time.

Roger Short  
I've really enjoyed our chat, George that thank you very much.

Discussion about this episode