Part 1: Generic Problems with DCS Alarm Systems
Donald Campbell Brown and Manus O’Donnell
This paper brings together BP's current views on what constitutes an effective DCS alarm system, and the barriers that must be overcome before the alarm system can make a fall contribution to supporting the task of the process operator. A summary is presented of the role of the alarm system within the operator interface, together with definitions of typical event-driven information. The generic problems that prevent alarm systems from being effective are reviewed, including practical and fundamental limitations, poor implementation and misuse of the available alarm features. The typical symptoms of poor alarm systems (alarm flood, standing alarms etc.) are described, and a number of tools and techniques are listed which have been used by BP to bring about improvements. This paper draws heavily from work carried out on a range of BP sites, and from the work of the ASM Consortium.
One of the significant areas of impact with the introduction of Distributed Control Systems (DCS) during the late 1970's and early 80's was in the generation and handling of alarms. Whereas before, for the panel-based operator interface, each alarm had to be separately derived and linked to a dedicated hard-wired alarm annunciator window, with the DCS the 11lanns were now 'free' - part of the standard functionality. On many sites large numbers of new alarms were configured in an attempt to enhance the operator interface.
The implications of this became clear when large alarm floods began to occur during upset conditions, and long lists of standing alarms started to build up during normal operations. There were recorded incidents of plant upsets arising out of an operator 'missing' an important alarm. Whilst DCS alarms were clearly powerful tools to help the operator, it became recognized that a badly implemented or maintained alarm system could actually make his task more difficult. This effect was compounded by the background of increasing plant complexity, greater sophistication of controls and reduced numbers or operators.
In tandem with a number of other Hydrocarbon Processing companies, particularly through the Abnormal Situations Management (ASM) Consortium, BP has been trying to establish and apply 'best practices' for alarm management. Alarm handling for BP Oil is primarily driven by the features provided as standard by DCS manufacturers, predominantly Honeywell and Foxboro, and the penetration of advanced software solutions, including Expert Systems, has so far been low. A wide range of initiatives to improve the usage of alarms have been tak1:11 locally on sites around the world, and the current effort is to study these, to add techniques and tools from outside BP and to try to move all of our sites up to the new paradigm.
What are the Objectives of an Alarm System?
The main role of the Operator on a typical refinery process unit can be characterized according to the state at that time of the plant under his control. Under normal operating conditions the main tasks are related to optimization and pursuit of efficiency, whilst at the other end of the scale, when the shutdown systems have triggered, the role is to make the plant safe and then restart as soon as possible. The full range of states and roles are listed in Figure 1:
Operator’s Primary Role
Efficiency and optimisation
Make safe and restart
Figure 1 - Primary role at different plant states
The typical focus of control systems and the DCS operator interface is to support the operator during normal operation, and to bring the plant automatically back to 'normal' when unplanned disturbances (for example, a rain shower) push it into the 'perturbed' state. If the disturbance is too large, (for example, mechanical failure of equipment), it may push the plant into the 'upset' state - from which the control system is not able to effect a recovery without operator intervention. The mason for this may be either that the current plant state has diverged too far from the fixed controller 'model', (for example, the controller tuning), or that the assumptions underlying the control, (for example, that a pump is available), are no longer valid. If the upset state is not corrected satisfactorily, the Emergency Shutdown System (ESD) intervenes. This can be represented graphically as in Figure 2:
Figure 2 - Plant state and transition mechanisms
Whilst the plant is 'upset' the primary role of the operator is situation management, and this can be broken down into a sequence of tasks:
- Problem Identification - what is wrong?
- Situation Assessment -what else is happening to affect what I do?
- Immediate Action - minimize consequences
- Follow-up Actions - restore normal operations
It is in support of 'problem identification' that the DCS alarm system is most useful. As an 'event driven' system it can be used to notify the operator that action is required, and the text information delivered with the alarm can identify the symptom, (for example, the temperature is high), and the location, (for example, reactor V101 in Unit 100). The alarm priority, if configured properly, can inform on the urgency and criticality of the problem. Out of this comes a definition of the objective of the DCS alarm system:
The purpose of the alarm system is to assist the operator in detecting process problems and prioritizing his response.
What makes an Alarm System effective?
The key to an effective alarm system is that it should mark the boundary between the perturbed and the upset plant state - which is the point at which the operator has to take action. Figure 3 depicts an effective alarm system, whilst Figure 4 depicts an ineffective one:
Figure 3 -An effective Alarm System
Figure 4 - An ineffective Alarm System
This leads to a definition of what an alarm should be:
- Alarm - a warning to the operator that immediate action is required to correct a condition of the plant.
This definition is useful because it excludes information such as status (for example, 'the advanced control scheme has been put into an unusual mode') and sequence messages / batch reports (for example, 'the purge cycle has started'). This type of information, characterized as "the operator needs to be aware of it, but doesn't need to do anything", is often found in alarm systems and represents unnecessary clutter that obscures the operator's view of the important information. It can itself be usefully split into two parts:
- Alert - a signal to the operator that he should be aware of something about the condition of the plant, hut that no immediate intervention is required.
- Status - an indication of a long term change in the stare of plant.
Ideally all 'status' information should be located on DCS schematics and 'alerts' should be handled through a different. DCS 'window' from the alarm system - either the 'Message' system or via a dedicated annunciator schematic that mimics the alarm annunciator (but that is dedicated to alerts).
What are the generic problems that prevent Alarm Systems from being effective?
In order to understand how to implement an effective alarm system it is useful to define the range of generic problems that are present. These are pitfalls which must be avoided in any new alarm system implementation, or focus areas for solutions where improvements are attempted for an existing alarm system, and they fall broadly into four categories:
Problem Area # 1 – Misuse
Alarm Systems are often 'misused' as a workaround where the DCS lacks appropriate functionality. This includes their misuse:
- to counter drop-off of operator vigilance. Alarms are often configured that have no real action required - but that serve only to keep the operator awake during quiet periods, especially late at night.
- to assist with identification of tasks. Alarms are often configured to prompt the operator through an operating procedure (for example, to switch off a pump when a particular step in a normal operating sequence has been reached).
- to monitor process control strategies. Alarms are often configured to inform the operator that a process control strategy is in an unusual state, even though it is usually the operator who has intervened manually to put the strategy into this state.
In each case there are better methods of achieving the desired functionality that do not degrade the utility of the alarm system. For example, operator vigilance can often be improved by varying the tasks that he perfo1ms or providing more breaks; prompting of tasks can be delivered by on- line procedure monitors; the state of process control strategies can be represented using appropriate depiction on schematics.
Problem Area # 2 – Fundamental
There are a number of fundamental limitations inherent in the basic DCS alarm systems:
- The structure of the basic alarm system is static, not dynamic, whilst for the real plant alarms can mean different things at different times. (For example, a single pump stopping may be relatively unimportant if the backup pump is already running, but may be of major significance if the backup has also -stopped). Some form of logic may represent the solution to this.
- Alarms are generally based on single process measurements, whilst the process itself is usually multivariate. Events of concern to the operator often cannot be measured directly. (For example, it may be that the state of a reaction that is important, and that the temperature, pressure, PH etc. that are available as measurements are each only poorly correlated with this). Some form of logic, or even Principal Components treatment may be called for.
- Prioritization is discrete, not continuous, and alarms can generally only lie assigned to one of a limited range of priorities, (for example, for Honeywell TPS, the effective options are limited to emergency, high and low). However, operators want to work in serial fashion, dealing with the most important alarm first and working down to the least important. How is he to discriminate the priority two alarms of the same level?
- The 'Acknowledgement' function, (i.e. new alarms have to be acknowledged manually by the operator before they are allowed to 'clear'), is only useful if the rate of new alarms is less than around 15 new alarms per minute. In alarm floods the operator is often not able to read the text of each new alarm fast enough to keep up, let alone take in what it means and acknowledge it in the alarm system. The solution to this is closely linked to mechanisms that avoid the alarm flood in the first place.
- Redundancy is ignored in the basic alarm system; only the first warning of a given problem is useful. (For example, if a reactor temperature is high, the high temperature alarm on the effluent stream may be extraneous - unless the reactor temperature measurement device has failed)
Problem Area #3 – Practical
There are a range of practical problems associated with DCS alarm systems:
- Alarm 'creep' often occurs in response to process problems. The easiest solution to a process issue may appear to be to install an alarm, thereby avoiding a possibly extensive study to show that the problem is not real, or a process modification to rectify it properly. The sum effect of this over time can lead to degradation in the effectiveness of the alarm system.
- Failures in maintenance often lead to significant numbers of spurious alarms, either instrument maintenance, (instrument drift etc.), or maintenance of the alarm system itself.
- The relevance of the alarm system can easily degrade in the face of process or operational changes. Alarms must be removed from the system if they have been rendered irrelevant.
- The original design context of an alarm can be lost, and the operator no longer understands why it was installed or why it is significant. The removal of these alarms can be particularly problematical.
- Lack of adequate ownership of the alarm system by the operator can result over time in a system that is not liked and possibly not used.
- Alarm chatter during abnormal situations can be handled effectively by some DCS alarm systems, but in others can represent a major load which provides no useful information to the operator.
- The value of any alarm system is compromised by instrument measurements which are out of range, since this represents a loss of operator view of the process. However the 'Bad PV' alarming that is often set up to warn of this condition ca11 itself become counter-productive, turning the operator's alarm pages into little more than a 'worklist' for the instrument maintenance team.
Problem Area # 4 – Implementation
Alarm Systems often suffer from uneven or suboptimal implementation. This is directly under the control of the user, and can include:
- improper configuration of priorities (too many at the same or at too high a level)
- poor BadPV handling ('Bad PV' limits defined too close to the instrument range)
- improper setting of trip thresholds (the operator does not have adequate time to respond)
- no PV clamping (the operator cannot see in which direction an instrument has gone out of range)
- miss-selection of points for 'background' processing (inappropriate points give rise to alarms)
- ineffective PV filtering (alarms arise from process noise, or the response to a real excursion is too slow)
- ineffective delay time (PY filtering for digital signals)
- ineffective alarm dead bands (leading to 'alarm chatter' when the process variable is close to the alarm set point)
What are the typical symptoms of a poor alarm system?
Analysis of alarm event journals, (preferably by using statistical search tools), and of the basic alarm system configuration can give important pointers to areas where the alarm system needs to be improved. A review of incident analysis reports can also provide clues, but the best indication can generally be provided by the operators themselves.
The gross symptoms of a poor alarm system, (apart from a generally low credibility), include large numbers of standing alarms during normal operation, and floods of large numbers of alarms during abnormal events. An operator missing an important process condition is clearly a worrying outcome (either he is not alerted, does not assess the process condition or miss-assesses the process condition), but more frequently a poor alarm system will serve to distract the operator from other important tasks.
In BP, we are trying to quantify the status on our sites by collecting data for several key metrics:
- the number of alarms configured, as a ratio of the number of control valves within an operator's area of responsibility. A 'good' baseline is reported to be no more than 3, although a higher number could be justified by particular types of plant or operating culture.
- the proportion of alarms per operator configured at each priority level. A common target for Honeywell systems is Emergency 10%, High 20% and Low 70%.
- the proportion of alam1s per operator, as a snapshot, that are 'disabled', (i.e. not annunciated to the operator), when the associated process unit is not shut down for maintenance. The target is zero, unless Dynamic Configuration Software has been implemented, since good alarm management practices should ensure that alarms arc 'inhibited' (i.e. not journaled or annunciated) or removed altogether when they are not relevant.
- the number of alarms 'standing' on the alarm summary pages of one operator, taken as a snapshot during normal operations. A figure of 3-5 is reported as a reasonable target, but so far the BP sites have not achieved anywhere near this.
- the worst case number of alarms per minute presented to the operator during a process upset. It is suggested that 15 alarms per minute per operator should be a target maximum, but even this may be too high.
Another good indicator is whether there are any alarms that do not- have an associated operator action, although the availability of manuals comprehensively defining the action on receipt of each alarm is not widespread in BP.
What solutions are available for improving alarm systems?
Within BP, alarm system reviews have provided a basis for improving the effectiveness of DCS alarm systems. Poor basic alarm configuration, poor usage of standard features and poor maintenance often cover a large part of the problem. Thereafter a number of advanced solutions have been applied:
- Combinational logic - has been successful in handling simple contextual issues (for example, the simultaneous stopping of two pumps).
- Simple static alarm management techniques - such as a schematic where nuisance alarms can be 'parked' to remove them from the alarm system annunciator, but with an automatic time-out facility to ensure that they are not left there indefinitely.
- Dynamic Configuration software - where the alam1 configuration is switched dynamically according to the current operating mode, has been implemented on one site to avoid alarm floods.
- Expert systems solutions - have been attempted on a number of sites and are still active on some. These tend to be expensive to build and difficult to maintain.
The key message from BP's activities in this area is that these 'advanced' alarm management solutions are not a 'fix' for bad basic practices.
Some ten or more years after the introduction of Distributed Control Systems on to BP sites there is currently a growing focus on the importance of effective alarm systems. The systems designed for normal operation must also support the operator during abnormal situations - and experience shows that this has not always been the case.
The alarm system does have an important role in assisting the operator to detect process problems that require his intervention, and to prioritize his response. Alarms must be differentiated from 'alerts' and stah1s information, and the generic problems associated with the alarm system, including practical and fundamental limitations, poor implementation and misuse of the available features, must be avoided in order for it to be effective.
There are a range of symptoms which characterize poor alarm systems, but the best source of opinion is usually the operator. A number of tools and techniques have been applied within BP for improving the effectiveness of DCS alarm systems, but it is recognized that advanced solutions arc never a 'fix' for bad basic practices.
Much of the subject matter in this paper draws heavily on the work of others, including John McCulloch at BP Chemicals, Grangemouth, and David Coates at BP Exploration, Sunbury, and John Davis at BP Oil, Sunbury. There is also a strong influence from the work of the ASM Consortium, particularly Ken Emigholtz of Exxon and Tim Montgomery/Nancy Lyche of Chevron.
The following is a short list of related papers, all of which are in the 'public domain':
- Emigholtz, KF, Improving the Operator's Capabilities;
Observations from the: Control House, Proceedings of the AlChE Loss Symposium, July 1995
- Lyche NP, 1995, Alarm Management and System Design, Proceedings of the 7th Annual Honeywell Australasian IAC Users Group Conference
- McCulloch, JG, 1996, Alarm Handling and Future DCS Developments, Proceedings of the Honeywell 1996 European Users Group.
- Nimmo, I, Abnormal Situations Management: Giving your Control System the Ability to 'Cope', Honeywell Journal for Industrial Automation & Control, Aug ‘95