Software Engineering

6 Key Takeaways from a Chemical Plant Disaster


Near the end of a warm summer day, an engineer monitors the flow of process materials at a chemical manufacturing plant. On his screen, the engineer watches a valve switch from open to closed. He’s confused. It’s not supposed to close—not on its own. The plant is under cyber attack, and, as the engineer soon learns, the closing valve is just the first failure.

Organizations frequently (and appropriately) spend a lot of time and effort on the technical aspects of operations. But the catastrophe about to unfold was caused just as much by weaknesses in plans and procedures. In this blog post, I’ll walk through the technical vulnerabilities—and the perhaps more surprising process maturity vulnerabilities—that led to the disaster, talk about why they’re so important for any organization, and suggest some tried-and-true mitigations.

A Bad Day at the Chemical Plant

In the control room of the chemical plant, the engineer quickly investigates the unexpected closure of the valve. As he watches the screen, other valves close and a pump stops. The engineer knows he didn’t make these changes, and his heart starts pounding a little faster. Suddenly, chemical-spill alarms blare in the distance, and others on the operations team race to determine the cause of the production disruption.

The engineer knows he needs to inform management of the incident so they can quickly deploy a hazmat team, and at the same time he fears something more serious might be happening. As additional chemical production steps begin to fail, the operations team members struggle to respond. They’ve received no reports of problems from elsewhere in the plant. Human nature makes them hesitant to declare an incident, and even if they do, they’re not sure whom they should tell. The operators get a sinking feeling their one training session wasn’t enough.

The operations team would later learn that the plant had been under cyber attack all day. The attackers compromised a third of the assets that controlled chemical production, triggering a spill that shut down all plant operations, required an expensive hazmat team, and led to an unpleasant press release.

Thankfully, this situation was only an exercise, and the chemical spilled was only water. It was all part of U.S. Cybersecurity and Infrastructure Security Agency (CISA) training on real, physical equipment. Members of our SEI team, which specializes in operational resilience of critical infrastructure, played the roles of plant staff. I was an engineer on the operations team and was part of a Blue team of defenders protecting the plant from the Red team of attackers.

Though the scenario was an exercise, I understood the fear that engineers in Ukraine likely felt in 2015 when they saw mouse cursors moving by themselves at an electric utility facility. When I saw those valves close on their own, it was a powerful moment for me, and it was heightened when I learned of other chaos the Red team had caused on the information technology (IT) side of the organization.

So, what happened? The Red team found some vulnerable entry points on the network and established persistence. The Blue team valiantly held back the Red team’s assault until late in the day, but ultimately the Red team achieved their objective. After searching the network and battling with the Blue team, the Red team located a specialized operational technology (OT) asset called a programmable logic controller (PLC) that had direct control of the chemical supply valves and pumps. The Red team directly modified settings on the PLC, causing it to close valves and turn off a pump, ultimately disrupting the flow of chemicals and leading to the spill. With more time, they might have compromised other PLCs to expand the scope of the plant disruption.

Through this exercise, I learned some excellent lessons that could apply to other organizations. The Blue IT team faced common technical vulnerabilities, such as weaknesses in network segmentation and undocumented assets on the network. However, the Blue operations team suffered from crippling vulnerabilities in our plans and procedures. While mitigating technical vulnerabilities should be a priority for any organization, it’s just as important to implement and maintain foundational process maturity concepts.

Process maturity includes key activities, such as documenting your processes, developing policies, and ensuring people are provided necessary training. Implementing these foundational practices can help your organization perform consistently and be more resilient in the face of an incident, such as the one described above.

The mitigations and recommendations in the following sections include references to applicable goals and practices from the CERT Resilience Management Model (CERT-RMM), “the foundation for a process improvement approach to operational resilience management.” The CERT-RMM details dozens of goals and practices across 26 process areas such as Communications, Incident Management and Control, and Technology Management. It has been the basis for several cybersecurity and resilience maturity assessments and models, and it explains how the foundations of operational resilience are based on a combination of cybersecurity, business continuity, and IT operations activities. The references to specific CERT-RMM goals and practices below appear in the following format: CERT-RMM process area:goal:practice.

Technical Mitigations

Operational Technology (OT) Network Segmentation

In our exercise, the Red team accessed a PLC in the industrial (OT) segment of the network. This segment was not directly connected to the Internet, so the Red team accessed the PLC via the IT segment. Unfortunately, this IT-OT interconnection wasn’t adequately secured.

Operators of industrial and other business processes that are sensitive to disruption should carefully consider their network architecture and controls that restrict communications between these segments. Many OT organizations, like our chemical plant, need an interconnection between these segments for business functions, such as billing, process reporting, or enterprise resource management. Such organizations should consider the following practices to secure the connection between interconnected IT-OT networks:

  • Identify and document the requirements necessary to build a resilient architecture (CERT-RMM RTSE:SG1)
  • Implement controls to satisfy resilience requirements, such as network segmentation and limiting communications across network interconnections to highly controlled and monitored assets (CERT-RMM TM:SG2.SP1).
  • Regularly test these controls to ensure they satisfy resilience requirements (CERT-RMM CTRL:SG4).

Industrial organizations might consider resources, such as the Securing Energy Infrastructure Executive Task Force’s recently released guidance on reference architectures that are based on foundational Purdue Model concepts.

Know Your Assets

Our exercise intentionally gave the Blue team an uphill battle. One of the Blue team’s first activities was determining the assets that were in the environment. Regardless of whether your organization operates OT assets, having a thorough understanding of your assets is a foundational activity for managing cyber risk:

  • Document assets in an asset inventory; be sure to consider people, information, and facilities in addition to your technology assets (CERT-RMM ADM:SG1.SP1).
  • Regularly perform asset discovery to identify any rogue assets connected to your network. While these assets may not be malicious, they do represent blind spots for security teams that are working to mitigate known vulnerabilities.

A recent binding operational directive from CISA directs federal agencies to consistently maintain their asset inventories and identify software vulnerabilities.

Process Maturity Mitigations

Communications

Our operations team was largely unaware of the IT network incidents. The IT Blue team was working hard to understand and address its issues, but it didn’t immediately inform the operations team what was happening. Of course, we suspected the Red team was behind the unusual activity on our screen. We were doing a cybersecurity exercise, after all. In the real world, personnel may dismiss unusual activity if they’re not properly briefed and trained on how to interpret and respond to it. Consider taking the time to plan for effective communications with stakeholders across the organization:

  • Identify and document the requirements for resilient communications (CERT-RMM COMM:SG1).
  • Establish and maintain a resilient communication infrastructure. It may consist of varied methods of communication based on urgency of messages or scope of recipients (CERT-RMM COMM:SG2.SP2).
  • Security teams may consider communicating the cybersecurity state of assets to other units within the organization. This communication may be accomplished through dashboards or other means that notify staff if they should be on high alert.

Roles and Responsibilities

Some individuals in the exercise filled management roles and were responsible for oversight tasks, such as approving change requests and determining appropriate incident response actions. However, the operations team had only individuals that were responsible for chemical production steps, and we lacked a role that provided that oversight. When we became the target of the Red team, we scrambled to respond because we had not planned who would work with management if we determined an incident had occurred. Assigning individuals to roles, making them aware of their responsibilities, and ensuring those responsibilities are appropriately captured in job descriptions is essential for resilient operations of any business:

  • Assign someone to the roles defined in the incident management plan (CERT-RMM IMC:SG1.SP2), such as personnel responsible for analyzing detected events to determine if they meet defined incident declaration criteria.

Policies and Procedures

While the Blue team developed effective processes to mitigate the impact of the Red team, it did so in an ad hoc manner. The CERT-RMM has a generic goal (one that spans process areas) called “Institutionalize a Managed Process.” One of its practices states, “Objectively evaluating [process] adherence is especially important during times of stress (such as during incident response) to ensure that the organization is relying on processes and not reverting to ad hoc practices that require people and technology as their basis.” Stated another way, the process needs to outlive the people and technology.

When the organization in this scenario was under great pressure, the operations team knew they had to act but stumbled when determining the correct course of action. Was the activity we observed on the screen an incident? Who should report the incident? A more prepared organization would have done the following:

  • Define event detection methods, assign responsibility for detection, and document a process to report events (CERT-RMM IMC:SG2.SP1).
  • Perform analysis of detected events to determine if they meet documented incident criteria (CERT-RMM IMC:SG2.SP4) and declare an incident if event activity meets the criteria threshold (CERT-RMM IMC:SG3.SP1).

Exercise and Training

In our exercise, the operations team only completed brief training on how to operate the industrial process and perform simple procedures like filling out forms to request a change. Organizations should periodically perform exercises for key activities to ensure they’re performed consistently, both during normal operations as well as times of stress. Likewise, organizations should identify and provide training that aligns with employee responsibilities, such as incident handling or other technical training. An effective training and awareness program will do the following:

  • Identify and plan necessary training for all individuals who have a role in sustaining operational resilience (CERT-RMM OTA:SG2).
  • Periodically deliver necessary training, track the completion of training, and continually evaluate the effectiveness of training (CERT-RMM OTA:SG4).

Formalizing Cybersecurity

Dedicating the necessary resources to appropriately plan and document cybersecurity activities can help organizations achieve the desired level of operational resilience objectives. Moreover, organizations should consider establishing and maintaining a cybersecurity program that, ideally, oversees the security of both IT and OT assets. At a minimum, organizations should build bridges to increase collaboration, clarity, and accountability across staff responsible for IT and OT security. Organizations may be able to reduce blind spots in both security controls and organizational processes by encouraging or mandating communication between these teams.

To effectively perform the necessary cybersecurity activities to keep the organization safe and productive, organizational leadership and those who manage individual business units must work together in concert. Building a strong process maturity foundation that supports these cybersecurity activities should be a priority for critical infrastructure operators to mitigate the increasing threat of cyber attacks.