Developing an effective Risk Management Plan can help keep small issues from developing into emergencies. Different types of Risk Management Plans can deal with calculating the probability of an event, and how that event might impact you, what the risks are with certain ventures and how to mitigate the problems associated with those risks. Having a plan may help you deal with adverse situations when they arise and, hopefully, head them off before they arise.
1- Understand how Risk Management works. Risk is the effect (positive or negative) of an event or series of events that take place in one or several locations. It is computed from the probability of the event becoming an issue and the impact it would have (See Risk = Probability X Impact). Various factors should be identified in order to analyze risk, including:
Event: What could happen?
Probability: How likely is it to happen?
Impact: How bad will it be if it happens?
Mitigation: How can you reduce the Probability (and by how much)?
Contingency: How can you reduce the Impact (and by how much)?
Reduction = Mitigation X Contingency
Exposure = Risk – Reduction
After you identify the above, the result will be what’s called Exposure. This is the amount of risk you simply can’t avoid. Exposure may also be referred to as Threat, Liability or Severity, but they pretty much mean the same thing. It will be used to help determine if the planned activity should take place.
This is often a simple cost vs. benefits formula. You might use these elements to determine if the risk of implementing the change is higher or lower than the risk of not implementing the change.
Assumed Risk. If you decide to proceed (sometimes there is no choice, e.g. federally mandated changes) then your Exposure becomes what is known as Assumed Risk. In some environments, Assumed Risk is reduced to a dollar value which is then used to calculate the profitability of the end product.
2- Define your project. In this article, let’s pretend you are responsible for a computer system that provides important (but not life-critical) information to some large population. The main computer on which this system resides is old and needs to be replaced. Your task is to develop a Risk Management Plan for the migration. This will be a simplified model where Risk and Impact are listed as High, Medium or Low (that is very common especially in Project Management).
3- Get input from others. Brainstorm on risks. Get several people together that are familiar with the project and ask for input on what could happen, how to help prevent it, and what to do if it does happen. Take a lot of notes! You will use the output of this very important session several times during the following steps. Try to keep an open mind about ideas. “Out of the box” thinking is good, but do keep control of the session. It needs to stay focused and on target.
4- Identify the consequences of each risk. From your brainstorming session, you gathered information about what would happen if risks materialized. Associate each risk with the consequences arrived at during that session. Be as specific as possible with each one. “Project Delay” is not as desirable as “Project will be delayed by 13 days.” If there is a dollar value, list it; just saying “Over Budget” is too general.
5- Eliminate irrelevant issues. If you’re moving, for example, a car dealership’s computer system, then threats such as nuclear war, plague pandemic or killer asteroids are pretty much things that will disrupt the project. There’s nothing you can do to plan for them or to lessen the impact. You might keep them in mind, but don’t put that kind of thing on your risk plan.
6- List all identified risk elements. You don’t need to put them in any order just yet. Just list them one-by-one.
7- Assign probability. For each risk element on your list, determine if the likelihood of it actually materializing is High, Medium or Low. If you absolutely have to use numbers, then figure Probability on a scale from 0.00 to 1.00. 0.01 to 0.33 = Low, 0.34 to 0.66 = Medium, 0.67 to 1.00 = High.
Note: If the probability of an event occurring is zero, then it will be removed from consideration. There’s no reason to consider things that simply cannot happen (enraged T-Rex eats the computer).
8- Assign impact. In general, assign Impact as High, Medium or Low based on some pre-established guidelines. If you absolutely have to use numbers, then figure Impact on a scale from 0.00 to 1.00 as follows: 0.01 to 0.33 = Low, 0.34 – 066 = Medium, 0.67 – 1.00 = High.
Note: If the impact of an event is zero, it should not be listed. There’s no reason to consider things that are irrelevant, regardless of the probability (my dog ate dinner).
9- Determine risk for the element. Often, a table is used for this. If you have used the Low, Medium and High values for Probability and Impact, the top table is most useful. If you have used numeric values, you will need to consider a bit more complex rating system similar to the second table here. It is important to note that there is no universal formula for combining Probability and Impact; that will vary between people and projects. This is only an example (albeit a real-life one):
Be flexible in analysis. Sometimes it may be appropriate to switch back and forth between the L-M-H designations and numeric designations. You might use a table similar to the one below.
10- Rank the risks. List all the elements you have identified from the highest risk to the lowest risk.
11- Compute the total risk: Here is where numbers will help you. In Table 6, you have 7 risks assigned as H, H, M, M, M, L, and L. This can translate to 0.8, 0.8, 0.5, 0.5, 0.5, 0.2 and 0.2, from Table 5. The average of the total risk is then 0.5 and this translates to Medium.
12- Develop mitigation strategies. Mitigation is designed to reduce the probability that a risk will materialize. Normally you will only do this for High and Medium elements. You might want to mitigate low risk items, but certainly address the other ones first. For example, if one of your risk elements is that there could be a delay in delivery of critical parts, you might mitigate the risk by ordering early in the project.
13- Develop contingency plans. Contingency is designed to reduce the impact if a risk does materialize. Again, you will usually only develop contingencies for High and Medium elements. For example, if the critical parts you need do not arrive on time, you might have to use old, existing parts while you’re waiting for the new ones.
14- Analyze the effectiveness of strategies. How much have you reduced the Probability and Impact? Evaluate your Contingency and Mitigation strategies and reassign Effective Ratings to your risks.
15- Compute your effective risk. Now your 7 risks are M, M, M, L, L, L and L, which translate to 0.5, 0.5, 0.5, 0.2, 0.2, 0.2 and 0.2. This gives an average risk of 0.329. Looking at Table 5, we see that the overall risk is now categorized as Low. Originally the Risk was Medium (0.5). After management strategies have been added, your Exposure is Low (0.329). That means you have achieved a 34.2% reduction in Risk through Mitigation and Contingency. Not bad!
16– Monitor your risks. Now that you know what your risks are, you need to determine how you’ll know if they materialize so you’ll know when and if you should put your contingencies in place. This is done by identifying Risk Cues. Do this for each one of your High and Medium risk elements. Then, as your project progresses, you will be able to determine if a risk element has become an issue. If you don’t know these cues, it is very possible a risk could silently materialize and affect the project, even if you have good contingencies in place.
Written safety policies do not ensure a culture of safety at a company. Although putting a foundation of safety policies and best practices in writing is essential to a successful safety management system, a collection of policies alone cannot create an environment where employees feel safe and instinctively make safe choices.
Creating a culture of safety takes time and begins with real commitment from all levels of management—not just a Safety First sign as you enter the building or verbal commitment to safety by the CEO or facility manager, but an active commitment that leadership demonstrates every day in the decisions they make and the actions they take. Frontline supervisors set the tone because they have to make quick decisions throughout the day, including corrective action when a hazard is identified. Their first priority is safety.
The decision to make safety a priority must be fully supported, encouraged, and rewarded by managers and executives to consistently reinforce that making safe decisions is most important among all levels of leadership. Taking action to correct unsafe conditions or using a positive approach to coaching team members on safe behavior deepens the internal commitment to following safe-work practices.
Fostering a culture of safety also requires a facility that is clean, organized, and well maintained. It is difficult to expect employees to commit to safe-work practices if the facility in which they work is dirty, unorganized, and in disrepair. On the contrary, employees are much more likely to feel their company values safety and remain engaged to continue to work safely if the environment around them is well-lit, orderly, and properly maintained and if signage clearly indicates safety requirements and expectations and employees are provided with high-quality safety equipment and personal protective equipment, when necessary.
Third-Party Evaluation Provides a Benchmark
To begin to understand where a facility is in its journey to provide a safe environment, it makes sense to engage a third party to visit a workplace, provide an unbiased facility safety evaluation, and gauge the safety culture of the organization through safety behavior observation and employee interviews. Too often a company functions under the “we’ve always done it this way” mantra or has become blind to unsafe working conditions or hazards hidden in plain sight.
Safety consultants, insurance representatives, corporate safety teams, Occupational Safety and Health Administration (OSHA) compliance specialists, industry peers, and other companies that participate in OSHA’s Voluntary Protection Program (VPP) are well equipped to conduct a safety audit to provide a company with a benchmark, indicating where the company can improve or where it excels.
A comprehensive safety analysis should not be limited to the facility. It should also include a review of all tasks and processes, giving priority to such high-risk areas as fall protection, lockout/tagout, confined spaces, electrical safety, lifting and rigging, and heavy equipment use, because failures in those areas can have serious consequences. Evaluate the hazards of each task and develop safe solutions to correct them. This means phasing out “we have always done it this way” processes and replacing them with best practices that use hazard analysis as a guide for development of new work processes. It also means implementing solutions to ensure the right equipment is being used for the right job. For example, if a company is striving to prevent falls, which are consistently one of the most frequently cited OSHA violations and a serious hazard in any line of work, an extension ladder or makeshift scaffold may not be the best choice for workers who need to use hand tools to access a motor or filter high above the shop floor.
Employee Engagement is Key
If a company wants to have success in safety, it is critical that it actively engage its employees to ensure that a strong safety culture can survive and grow. This means sharing the vision for safety and inviting and encouraging each employee to participate in shaping and achieving that vision.
Employee engagement can be accomplished in a number of ways, beginning with the establishment of employee-led safety committees and inviting all team members to join the committees. Participating in regularly scheduled meetings puts the pulse of shop floor employees in front of any safety initiative. Companies can also provide employees the opportunity to become voluntary first responders who are trained in emergency first aid, CPR, the use of automated external defibrillators, and other emergency response protocols.
Other employee engagement activities also contribute to a culture of safety, among them safety poster or calendar contests that employees can share with their families, small work teams for identifying and making safety improvements through facility safety audits or safety behavior observations, safety mentor programs that assign mentors to new team members, or, in the case of JLG Industries, Inc., a Safety Action Tracker. This program tracks safety opportunities as employees identify them, providing a description of the item and a photograph if one is available, describing the corrective action to be taken, identifying a target date for completion, and assigning an employee responsibility for taking the corrective action. This information is posted and available to all employees, supporting transparency, encouraging communications, and directing the appropriate resources to make a safety improvement in a timely manner.
JLG Industries also engages employees in activities designed to identify workers performing tasks in awkward postures or positions where ergonomics could be improved. The goal is to recognize these situations and propose solutions to remedy them. Additionally, these small work teams identify opportunities, such as maintaining work material within a self-regulated “power zone” of 15 to 60 inches. Known as the 15/60 rule, this rule ensures no product or materials necessary to perform a manufacturing job require an employee to reach down below 15 inches or above 60 inches. This simple guideline can improve ergonomics, while providing a well-organized and safer workstation.
Training Supports a Culture of Safety
Ongoing education and training also support a culture of safety. Although employers can develop their own safety training programs, OSHA offers a number of training resources for both employers and employees. The OSHA Outreach Training Program provides training for workers and employers on the recognition, avoidance, abatement, and prevention of safety and health hazards in workplaces. Through this program, workers can attend 10- or 30-hour classes delivered by OSHA-authorized trainers. The 10-hour class is intended to provide workers with awareness of common job-related safety and health hazards, while the 30-hour class is more appropriate for supervisors or workers with some safety responsibility.
Of even greater value are hands-on training programs that simulate the work environment. Programs like this provide training essentials in a setting that mimics the workplace but eliminates workplace pressures. They are especially useful as part of new employee training programs that teach job skills and ergonomic practices prior to an employee’s introduction to the production environment.
VPP Assesses Safety and Health Systems
Companies that want to be recognized for their commitment to workplace safety can participate in OSHA’s VPP. It sets performance-based criteria for a managed safety and health system, invites sites to apply, and then assesses applicants against these criteria. OSHA’s verification includes an application review and a rigorous on-site evaluation by a team of OSHA safety and health experts. OSHA approves qualified sites to one of three programs:
Star: The Star Program is designed for exemplary work sites with comprehensive, successful safety and health management systems.
Merit: Merit is an effective stepping stone to Star. Merit sites have good safety and health management systems, but these systems need some improvement to be judged excellent. Merit sites demonstrate the potential and the commitment to meet goals tailored to each site and to achieve Star quality within three years.
Demonstration: The Star Demonstration program is designed for work sites with Star-quality safety and health protection to test alternatives to current Star eligibility and performance requirements. Star Demonstration program participants are evaluated every 12 to 18 months.
Always Evaluate and Re-Evaluate
Whatever path a company chooses to follow as it strives to create a culture of safety, it is important to constantly monitor and evaluate the programs in place to ensure they are meeting the established benchmarks. It’s all about creating an environment that attracts new employees with the same safety mindset, while ensuring the safety of current employees—an evolving environment where safety is second nature to all employees and leadership, ingrained in their DNA, influencing every decision they make and action they take. In the end, a rigorous, comprehensive safety program, endorsed by management and employees alike and assessed by third-party experts, can help organizations achieve constant, continuous improvement.
Incident management is one of the most critical IT support processes that an IT organization needs to get right. Service outages can be costly to the business and IT teams need an efficient way to respond to and resolve these issues quickly. According to a 2015 HDI study, incident management remains a top priority for 65% of IT teams around the world.
Incident management 101
Here’s how the IT Infrastructure Library (ITIL) defines an incident: “An unplanned interruption that causes, may cause, or reduces the quality of an IT Service.” Incident Management is an IT service management (ITSM) process that aims to restore normal service operation as fast as possible, minimizing the impact on the business and end user. A business application going down is an incident. The printer not working is also an incident.
“A crawling-but-not-yet-dead web server can be an incident, too. It’s running slowly, and interfering with productivity. Worse yet, it poses the even greater risk of complete failure.” – Nick Wright, Service Operations Manager at Atlassian
Incident Management vs. Problem Management: A problem is just the not-yet-known root cause behind one or more incidents. In the incidents above where the printer is down and the network is creeping, a misconfigured router could be the underlying problem behind both. Incident management focuses on short-term solutions (not completing a root cause analysis to identify why an incident occurred) and on doing whatever is necessary to restore the service. We’ll talk about managing re-occurring incidents (underlying problems) in the problem management blog.
Mean time to resolution (MTTR)
Mean Time to Resolution (MTTR) is a service-level metric that measures the average elapsed time from when an incident is reported until the incident is resolved and is typically measured in business hours. MTTR is one of the key drivers of customer satisfaction, as users may be either completely down or forced to use workarounds until their incidents have been resolved.
Consequently, improving major incident response is one of the number one goals for IT teams, specifically around finding ways to lower MTTR and streamline the process of finding the root cause to prevent future outages. The below diagram outlines what’s included in the MTTR. A Forrester study found that most of the time is spent within the Investigation and Diagnosis phase. In fact it takes 70% of the time because IT teams find it difficult to collaborate and share valuable insights to quickly find an incident resolution.
Incident management priorities
So what are the key areas and priorities for incident management for IT teams?
Respond effectively so they can recover fast to define who is accountable for it.
Communicate clearly to their stakeholders, both service owners, those within the organization, but ultimately their customers.
Collaborate effectively to solve the issue faster as a team and remove barriers that prevent them from sharing and collaborating.
Continuously improve to learn from these outages and apply these lessons to improve a service or even refine the process in the future.
StatusPage: While every team uses different solutions for communication, we recommend a dedicated tool like StatusPage for incident communication. This provides a central source of truth for the current status of an incident as well as a record of past incident communication. Stakeholders can customize how they want to receive StatusPage updates; whether it’s over email, text message, or a ChatOps tool like HipChat.
Incident management process
An incident management process helps service desks investigate, record, and resolve service interruptions or outages. An Information Technology Infrastructure Library (ITIL) incident management workflow aims to reduce downtime and negative impacts. The IT Service Desk template comes with an incident management workflow, which ensures that you log, diagnose, and resolve incidents. We recommend you start with this workflow and adapt it to your business needs. When managed well, incident records can identify missing service requirements, potential improvements and future team member training.
The ITIL incident management process, in brief:
Service end users, monitoring systems, or internal IT members report interruptions.
The service desk describes and logs the incident. They link together all reports related to the service interruption.
The service desk records the date and time, reporter name, and a unique ID for the incident. JIRA Service Desk does this automatically.
A service desk agent labels the incidents with appropriate categorization. The team uses these categories during post-incident reviews and for reporting.
A service desk agent prioritizes the incident based on impact and urgency.
The team diagnoses the incident, the services effected, and possible solutions. Agents communicate with incident reporters to help complete this diagnosis.
If needed, the service desk team escalates the incident to second-line support representatives. These are the people who works regularly on the effected systems.
The service desk resolves the service interruption and verifies that the fix is successful. The resolution is fully documented for future reference.