Why PESSRAL Is Not PESS
Important safety ramifications in European standards applications
This paper was presented at Madrid 2016, the International Congress on Vertical Transportation Technologies, and first published in IAEE book Elevator Technology 21, edited by A. Lustig. It is a reprint with permission from the International Association of Elevator Engineers (website: www.elevcon.com).
The lift industry is quite old fashioned in electric/electronic/programmable electronic (E/E/PE) safety: it used the electric safety chain for more than 30 years. However, since the EN 81-1/2 A1: 2005 amendment, the standard allows use of programmable electronics for safety systems (PESS). Also, when the code committee decided to implement a subset of the leading norm (IEC 61508) into EN 81 in order to decrease the difficulty and increase the implementation speed, Programmable Electronic Systems in Safety Related Applications for Lifts (PESSRAL) was born. However, due to cherry picking and skipping the basics, the old and even the newest code (EN 81-20/50) makes it possible to create unsafe systems. Where are the potential risks?
The IEC 61508 itself consists of seven different pieces with a total of more than 500 pages. It describes the complete path to follow when creating an E/E/PE safety device. It contains calculations, assumptions, design strategies, risk analyses and descriptions of quality systems. It results in a safety integrity level (SIL), which is a mathematical number expressing the safety of the system. All of this documentation is needed to end up in a safe system. In contrast, EN 81-20/50 uses 11 pages and claims to be a full package.
The entire process flow for making a PESS is described in a separate part of the standard, 61508-1. By a clear way of working and project management, we try to minimize systematic failures in a system. There are clear demands, and this results in a systematic capability (SC) value. Techniques that can be used are, e.g., project management, documentation, structured design and modularization, as well as the SC, as these techniques are not demanded or described in EN 81-20. Projects without proper management can contain major mistakes, and these are hard to spot.
For safety software, SIL is used to measure safety. It is a mathematical number expressing the safety of the system. For example: SIL 3 has an average chance of failure between 10-9 and 10-8 or 10-5 to 10-4 an hour, depending on the demanded rate. Normally, you have to perform a risk analyses in order to determine the needed SIL rate. EN 81-1/2+A3 and EN 81-20/50 have already performed this risk analysis in it and ask for SIL ratings. This way, there is no need for a risk analysis, which creates uniformity in the systems of competitors. However, a risk analysis gives insight into the project and influences the design. This is mandatory in IEC 61508 procedure, but not in EN 81-1/2 and EN 81-20.
The calculation is the theoretical basis; it gives insight into the weakest points of the system and proves that the system is safe enough.
So, a SIL level is available, but it is not clear by the standard if we’re working in high or low demand. The difference in demand rate between these, however, is exactly a factor of 10.000 failures/hour. Low demand is explained in IEC 61508-4 as “where the safety function is only performed on demand, in order to transfer the Equipment under Control (EUC) into a specified safe state, and where the frequency of demands is no greater than one per year.” For a lift, we do not use the over speed governor more than once a year, so is it, then, low demand? This is necessary to know, because it gives a difference in the calculated safety by a factor of 10.000. It is not plain set in the standard. However, the IEC-62061 states that machines shall fulfill high demand. Most of the certifying organizations are following this guideline. Unfortunately, it is not set plainly in EN 81-20.
Safe Failure Fraction
When building a SIL 3 system, the relevant tables in EN 81-1/2+A3 and EN 81-50 mandate a double-channel system. The main idea of this is “when one channel fails, the other channel will put the system to a safe state.” IEC 61508 has the same principles, but there are some major discrepancies. IEC 61508 describes the model of Safe Failure Fraction (SFF): the fraction of failures which is safe and which is dangerous. For components where the failure mode cannot be predicted (like CPUs and other complex systems), the demands are set higher. Here, diagnostic software also increases the SFF. Due to the fact that EN 81-20/50 demands a two-channel system for SIL 3, it excludes the use of a totally failsafe (SFF = 100%) one-channel system and makes it possible to create a fail-unsafe (SFF < 90%) system. If every possible fault in a channel is directly dangerous (SFF = 0%), and if the fault remains undetected, a second fault causes an unsafe system. This way, PESSRAL solutions can be less safe than the fault tree analyses present in EN 81-20.
Due to not performing a risk analysis and the demand for two channels for SIL 3, a new difficulty occurs. By demanding two channels without further specification, it becomes possible to build two identical channels. These identical channels introduce the risk to fail at the same time due to the same error (common cause). Typical errors are a slightly to very low supply voltage, design faults inside a CPU or temperature. When working with multiple channels, the common-cause errors are the largest part of the total.
You can compare it with throwing a die. If by throwing a one, you will lose, your chance of losing is exactly one in six. To decrease this chance of losing, you can add another die. Now, you need two ones to lose the game. When calculating the chance of losing, we use: 1/6 * 1/6 = 1/36.
Now, we introduce a common-cause fault in this “system,” a fault that influences both channels (the dice). Due to the fact that the number “6” is represented on the other side of the die, and for painting six dots, we need slightly more paint. More paint also means more weight, and two opposite sides on a die always give a total of seven. Due to this faulty design, the chance of throwing a one is bigger than that of throwing other numbers. The chance of a double one is also bigger than the chance of another double combination. If I have a 5% more chance of throwing two ones, the system is 5% less safe than 1/36: we need to add 1/120 to the 1/36.
For this system, the impact is relatively small. However, the fault chance of a PESS channel is a lot smaller: for example, 10-9. Doing the same calculations, the two-channel system has a 10-9 * 10^-9 = 10^-18 chance of failing. Now, we add the 5% common cause: 5 * 10-11. We can see clearly that the common-cause part is much bigger than the single-channel faults. If we have smaller-failing chances in channels, the common cause will become more important and be the dominant part of the safety calculations, as well as the real safety. EN 81 does not tackle this problem; no techniques for common-cause avoidance are described or calculated.
EN 81-20 cherry picks a number of techniques and states them as mandatory. There is no calculation needed anymore. (EN 81-50 states that IEC 61508-6, which explains the calculations, is not needed for understanding.) IEC 61508 gives a large number of options; the most suitable technique can be chosen for the system. It can happen that completely irrelevant techniques are demanded, where other techniques are quite more useful. For example: there are no demands for sensors in the lift standard, but when we use a complex logic programmable device (CLPD), there are still demands for RAM checks and watchdogs; this is not right, according to IEC 61508. Here, we cannot check if our diagnostics are good enough. Normally, diagnostic coverage (DC) has a direct influence on the safe failure fraction (SFF), and so on, the entire safety calculation of the system.
The backbone of IEC 61508 is the underlying calculations. By looking at all components’ failure in time (FIT) rates and design, a calculation of the chance of failure can be made. The calculated numbers should be in line with the SIL rate. Failure modes and effects analysis on components and DC in order to improve the SFF ends up in a safer system. IEC 61508 has demands on the SFF which need to be met.
The calculation is the theoretical basis; it gives insight into the weakest points of the system and proves that the system is safe enough. This calculation is not needed for EN 81; by fulfilling all demands, the requirements are met. These demands describe techniques only but do not give any numbers. There is no check if the system is “safe enough,” so it is possible to end up with a mathematically unsafe system.
For example: two really bad relays can be used in parallel. When they fail every 10 times, they will both fail at the same time every 100 times (excluding common cause). It still fulfills EN 81-20 (double channel with diagnostics): it can be detected that both relays are failing. However, at this point, it can no longer be acted upon. When we calculate the failure rates for the system with IEC 61508, we will directly find that the relays are not good enough for this system: the FIT values will be devastating for the product-failure-per-hour figure. Due to the calculation, bad components are filtered out.
Every system needs testing after development: there are always unforeseen problems that are filtered out during the test phase. Of course, a PESSRAL system will be tested, but which test strategy is the proper one? Most of the industry has no practical experience with safety software, and there are no test strategies mandatory or even mentioned in the standard. The most commonly known test method is black/white box testing. This basic way of screening a system is usable for both electric and mechanical systems. When creating PESS, the system is a full black box. However, IEC 61508 can also ask for traceability of the requirements, full modeling, software simulation and performance testing. Also, there is no test procedure or awareness for common-cause faults in the lift norm.
Proof Test Interval
Again, the lifetime of a system is not considered. Due to the fact that periodic inspection on PESS systems is almost impossible, a lifetime must be specified. Diagnostics in the system also cannot detect every possible fault; the DC is always smaller than 100%. Normally, PESS systems have a “proof test interval” to detect the normally undetected errors. EN 81 does not require this. This allows a system to build up an endless amount of errors and gives the possibility of ending up with a dangerous fault.
At this moment, only a small amount of lifts work with PESS. For the ones that work, there are no major failures yet. PESS application has been possible since the first amendment of EN 81-1/2 in 2005. We do not know how many installations are in the field today, so we cannot determine why there were no failures. There are some possible explanations that can explain the fact that we did not have any accidents:
- When making something revolutionary, a company must be absolutely sure it is safe. Otherwise, the product will not be accepted in the market by the customer. For PESSRAL, most lift companies want to be absolutely sure that it still works after several years, so endurance tests will probably be done. This is a powerful testing method.
- There are not many PESSRAL systems in the world: most lifts have a long lifetime, and controls are not regularly changed. Also, the development of PESSRAL has just started: there are not many PESSRAL systems on the market. Most of them are still in development.
- The major certification bodies also perform tests on PESS systems. They have their own demands for testing or will ask for a calculation. Certification bodies also want safe systems, and most of them know how to perform the tests properly.
- There is no guideline for reporting crashes, and we cannot be sure that we will hear about all crashes in the world, especially reports that include the cause.
The biggest problems of these possible explanations are the fact that they are not mandatory: there are no demands on test time, and there is no requirement for experience in PESS for Notified Bodies. Also, worldwide information about lift catastrophes do not exist related to this topic.
The only way to check the system now is by testing, but testing strategies are not described.
PESSRAL is not PESS, and this is not only due to the absence of a lot of background information. The entire mathematical backbone is gone, so we cannot calculate if the chance of failure of the system is right. This has a huge impact on the common-cause faults. These are the most dangerous faults for a double-channel system. Also, the channels themselves can be made out of unsafe components. The only way to check the system now is by testing, but testing strategies are not described. As of the time of this writing, there have been no fatal accidents yet. However, we cannot explain why they didn’t happen or predict that they won’t happen. In the end, it is possible to build unsafe systems with the rules of PESSRAL. For now, we can only hope that lifts will stay safe; for the future, we need EN 81-20 to change as quickly as possible.