9 min read

Making Your Operation More Cyber Resilient

Making Your Operation More Cyber Resilient

In last quarter's article, I discussed the process for established IT Security teams to expand their influence into OT. This quarter I want to give some practical guidance for building a more cyber-resilient operation that applies to all operations, particularly smaller operators who might not have a full in-house cyber security team.

Ultimately, accountability for maintaining a safe, sustainable, and resilient operation must rest with the operation itself. The good news is you don't need a huge team, a massive budget or special skills to make a measurable impact on your operation's cyber resilience. The better news is that these steps toward cyber resiliency will help make your operation more resilient overall and less susceptible to a wide range of IT/Technology failures.

I am a firm believer that, while significant parts of OT security can and should be centralized, effective OT security is a team effort, with actions that can be taken at the site and corporate levels. In this article, I hope to give some simple steps that can be executed at the site, alone or in cooperation with a corporate security team, regardless of the size, maturity level of the operation or skill level of IT or OT resources or budget allocation.

Identification and Inventory

Before you can start to make your operation more resilient, you need to know what you are working with. It sounds simple, but it is very difficult to protect assets you don't know about!

Asset Inventory

The first step is constructing a reliable asset inventory. All technology assets that run your operation should be cataloged in some system of record. Maintenance systems are often good choices for this in smaller operations, but SharePoint lists and even Excel spreadsheets have worked well for some smaller operations. Try to capture as much information as possible for each device on your Operational Technology network (including PCs/Servers, Network Equipment, PLCs, MCUs - everything with a network connection). Critical information to capture includes:

  • Manufacturer
  • Model
  • Serial
  • IP address/MAC address
  • Location at the site
  • Firmware Version (if appliance/device)
  • OS version
  • Critical Software (PCs/Servers/Mobile Devices)
  • Asset Owner
  • Support Contract Information

 

This inventory will be critical to the next steps in this article. How can you get this inventory? To get started, you may have pieces of it spread across existing systems and documentation already. You can consult your networking team for a list of IPs/MAC addresses currently in use. It is also a valuable exercise to walk/drive the site to confirm those lists, as well as identify equipment that may be on other networks such as cell systems.

Mapping Assets to Process

Once you have your inventory, map that inventory to your critical processes. For example, crushing/grinding may involve a set of PLCs and MCUs, an assortment of sensors and an HMI or two. Those are the obvious components, but don't forget the network switch(s) that all those components are connected to as well as any other critical services, such as Active Directory, if you use AD accounts to log into those HMIs

In addition to those onsite systems, identify corporate and/or cloud systems that you rely on for critical production activities. For example, do you rely on an ERP such as SAP to order parts and other consumables to support your maintenance activities? How is fuel for your mobile equipment or reagents, concentrate or any other process input ordered and received?

Identify and Map all Connections into the OT environment and their Use Case (s)

Many mining operations, out of necessity, generally have remote connections to their OT environments. Typically, the more remote the operation, the more remote connections we see. These connections exist for a variety of reasons: vendor support, the ability for employees to provide support from home/offsite locations and a myriad of other very good reasons. We saw the number and variety of these connections explode during COVID when many employees had to work remotely. For each connection, identify:

  • Type of connection
  • The direction of connection (is the connection initiated from inside the OT environment or from outside?)
  • Hardware/Software connection endpoint (what is the hardware/software connected to and where is it in the network)
  • Authentication method
  • List of users permitted to connect
  • Frequency/schedule of use

 

Finally, identify and document the connection between the OT environment and the corporate network. What are the points of connection? What are the firewall(s) between those environments? Generate a report of the firewall rules for what is permitted to cross that boundary.

Keeping it up-to-date

Once you've spent the time building your asset inventory, you'll want to take a little extra time to ensure you have processes in place to keep it up to date. This is where connecting your inventory (or keeping it) to your maintenance systems can sometimes be very valuable. Identify the teams that will make changes to your environment (IT/OT, electrical, instrumentation, etc) and ensure their processes will keep the inventory up to date. For larger, more complex operations, there is technology that monitors traffic on your OT network for changes and alerts to them—a valuable investment for larger or frequently changing environments, but likely overkill for most smaller mining operations. A simple, well-defined set of processes and procedures should solve 80% of the problem here.

Steps to Resiliency

Now that you have your asset inventory, the next set of steps is all about building some cyber resiliency to your operation. The goals here are twofold.

  1. Reduce the likelihood that you will be affected by commodity malware and attacks.
  2. If/when an incident occurs, make it easier to maintain or quickly recover operations.

 

This is not intended to be a be-all and end-all list of everything necessary to make a secure, resilient operation. It is, however, a good start to knocking off the low-hanging fruit. All progress is progress.

1. Add redundancy for "can't fail" systems

Once you have your asset inventory, identify the systems that simply can't go down, can't be rebooted, and can't fail. These systems are problematic for a couple of reasons:

Murphy just loves them - if a computer power failure will happen, it will happen to these systems, if a drive can crash, that will be the one. Having a redundant approach will allow for one system to fail - spectacularly even and operations to continue on the second system.

If it can't go down, it can't be patched - Effective, regular patching is a critical step to ensuring the ongoing safety of systems. The biggest obstacle to patching is systems that cannot go down until an annual (or longer) maintenance window. These systems are the enemy of not just cyber security, but operational redundancy. Engineer these single points out through the deployment of hot standby systems, parallel redundancy or some other strategy that works for you.

2. Effective Backup and Restore

For each item in your asset list, identify and document the backup from both a hardware and software perspective.

A good, backup and restore capability, covering hardware and software will be the key to protecting you from all manner of outages. If a system is unavailable, at that point it is not terribly relevant if the cause is a power surge, causing the magic smoke to leave a device or a ransomware attack. The system is unavailable and operations are affected. With effective backup and restore capabilities at your operation level, you can get yourself back to production quickly.

PLCs, IoT and other OT devices: Have you got a backup of the current configuration or programming? Where is it, and is there a process to ensure that the backup is current? For the hardware itself, do you have a stock of spares? If not how long will it take your supplier to get you an RMA unit? Have you validated that assumption? If those hardware replacement times are not acceptable to you, consider a small stock of spare devices to allow you to swap out failed devices.

Network Equipment: The same goes for your core network equipment - Do you have a backup of the current configuration? Where is it? Is it kept up to date with all changes? For the hardware, critical network equipment (firewalls, switches etc) should be configured for redundancy. All devices used in OT should at minimum have a redundant power supply (the most common component to fail) and ideally should be set up in redundant pairs. If not - consider having a spare device on the shelf and ready to replace any failed devices.

Servers/Computers: Ensure you have proper back-ups to be able to restore not only the data but the applications and operating system as well. This is a place where virtualization can help, but regardless of your solution, you need to be able to restore the complete system, not just the data that resides on it.

Testing: Once you have identified the backup from both a hardware and software perspective, implement a testing process. At least quarterly, test your ability to restore something from each class of device. Pull a spare PLC off the shelf and restore a configuration to it. Test the redundancy of your network devices, and restore a system and configuration for an HMI.

3. Assess External Services

Your operation likely depends on a number of services supplied and managed externally to your operation. Some, such as ERP, email and others might be provided by a corporate group, other services may be 3rd party, hosted inside your network or in the cloud.

From the perspective of the operation, you will want to ensure you have an SLA - a Service Level Agreement with the group that maintains that system. Ensure your expectations are aligned for things like uptime, scheduled maintenance windows and the like. However, regardless of the SLA, it will be a very cold comfort when something unexpected goes wrong. From the operating entity view, you must ensure you have a plan to continue operations should those services be unexpectedly unavailable. These Business Continuity Plans (BCPs) are critical to ensuring that you can continue some level of operation should unexpected events happen outside your control. Some of the areas an operation needs to focus on when it comes to external services:

Authentication: Many operations rely on authentication services such as Microsoft Active Directory to log in and authenticate to OT systems such as HMIs, Servers and other IT-like systems. Should those authentication services not be available, it could be difficult to log in. Local accounts or "break glass" accounts can be configured to allow access to some critical services to preserve operations should authentication not be available.

ERP: Regardless of if it is cloud or on-prem, operations rely on ERP systems to order spares and fuel, as well as get their product to market. ERP systems, unfortunately, are also some of the most challenging to restore and rebuild should things go sideways. At the site level, make sure you can execute your critical functions in some way, shape or form without the ERP.

Communications: If your site relies on IP-based telephony for communications, it is probably in your best interest to ensure you still have a few old-fashioned copper phone lines in critical areas to enable communications during an outage. The same goes for Radio over IP, or any other IP-based communication. Have a fallback to enable safe production to continue.

4. Implement preventative maintenance.

Like the rest of your site, your OT systems need regular PM. Ensure that preventative maintenance is a part of your program. Like all preventative maintenance activities, these ones need time, processes and resources assigned to them to ensure they get done.

Patch systems - Vulnerabilities are discovered in all systems from time to time. Ensure you have a schedule that allows for regular (at least monthly) patching of all your operating systems and applications at the site. There is a risk that a patch could cause issues, so refer to your vendors for guidance, and make sure you patch from low-risk systems to high-risk systems. In addition to your regular patching process, have an expectation process for any high-risk patches that need to be done outside the preventive maintenance window. (Treat this as a break/fix issue)

Know your support timelines - Hardware and software systems all come with a defined support period. This will likely be a different length of time from the warranty period on the device. When a device or system is purchased and deployed, there must be a plan to replace it before the end-of-support/end-of-life timeframe. Software systems that are beyond end-of-support will no longer receive patches and updates from the manufacturer, while hardware beyond end-of-life will additionally see failure rates begin to increase rapidly due to aging components. Waiting to upgrade until well past end-of-life also complicates the upgrade activity, as the manufacturer may not have an easy or supported upgrade path from very old systems to current versions.

5. Simplify External Connections

Have a look at the external connections. There is a chance you may have a collection of external connections in and out of your environment. By simplifying those connections, you have the opportunity to both make your operation easier to understand and maintain, as well as reduce the attack surface for an attack.

  1. Can any connections be combined? - We have seen some operations where virtually every contractor had their own method for connecting to the OT environment. This creates an expensive, complicated and difficult-to-secure mess of connections. There is most likely a possibility of combining a number of those connections into a single effective remote access solution. You will need to work with your networking team to do this, but the results can be very effective.
  2. What are the usage patterns? If a connection is only used occasionally, such as during a failure or support incident, if it can't be combined with other methods, can it be disabled when not in use?
  3. What direction is the traffic? If the connection is for outgoing telemetry only, you can reduce the possibility of that connection being misused by putting it through a data diode or unidirectional gateway device.

 

What are the attributes of a good remote access solution? Glad you asked. Effective remote access solutions for OT have the following attributes:

  • Multi-Factor Authentication - Remote access to OT must take advantage of phishing-resistant multi-factor authentication. Roger Grimes has published a great article on good MFA here - give it a read.
  • Be monitored - OT asset owners must have the ability to monitor (in real-time and at a later date) what the remote user, particularly vendors, did while they were connected.
  • Restrict access to required systems - Your OT remote access solution must restrict each user to only the OT resources needed to do their job. Nobody should get free reign remotely. If you need access to everything, see you at site.

 

Conclusion

Regardless of the size of the operation, budget, staff skill sets or other maturity measures, the recommendations in this article will go a long way to ensuring your operation maintains resistance as the dependency on systems increases.

An effective asset inventory will give you the basis to work with. It will make sure the entire organization is informed about what systems and assets are critical to safe production. The initial steps to cyber resilience will serve as a starting point for building an operation that can effectively withstand and recover from an incident.

This is a starting point, and much more can be done to secure OT infrastructure. Effective security requires a good working partnership between IT and OT. However, unless the basics are covered at both the IT and OT levels, those more advanced steps will be more difficult to implement.

Phishing Education - Maybe 'Best Practice' is not Best After all

Phishing Education - Maybe 'Best Practice' is not Best After all

As security practitioners and leaders, we must contribute to the professionalization of our field by searching out data and evidence-based solutions...

Read More
Guide to Traffic Light Protocol (TLP)

Guide to Traffic Light Protocol (TLP)

Key Points: TLP is aSharingProtocol. Companies that do not share threat information are at a disadvantage. Overclassification stifles sharing...

Read More
The CrowdStrike Incident - Resilience Matters

The CrowdStrike Incident - Resilience Matters

Now that most MM-ISAC member organizations impacted by CrowdStrike's issue on Friday are through the worst of the recovery, I want to share some...

Read More