Scenario: Sever is down.
1. Customer cuts an incident ticket reporting server is down, affecting normal operations. This is a critical service and a very important App is down causing all agents not able to calculate insurance policies. Helpdesk created a P1 ticket with SLA of 1hr to resolve.
Scenario: Sever is down.
1. Customer cuts an incident ticket reporting server is down, affecting normal operations. This is a critical service and a very important App is down causing all agents not able to calculate insurance policies. Helpdesk created a P1 ticket with SLA of 1hr to resolve.
2. Resolver found out that the power plug was unplugged causing the server to power down.
3. Resolver resolved the issue by plugging in the power back to the server. When power is resumed and server started, all users can access the application again.
4. Helpdesk closed the ticket after verifying with the customer all the services are back to normal. (Within 1hr, SLA Met)
5. During the monthly meeting, this P1 ticket is discussed. The problem manager decides that this incident should not re-occur in the future and he created a problem ticket, linking to this incident.
6. Problem manager setup a problem investigation team consisting of the server room operators, windows server admin, facilities mgmt team.
7. Problem manager asked the 5 whys:
a. Why is the server down? (Engineer admit he tripped on the cable)
b. Why did you trip on the cable? (The cable was on the floor when he was in the server room)
c. Why is the cable lying on the floor? (The server vendor left it there)
d. Why did the vendor access server room? (They are doing a UPS upgrade but didn’t finish cleaning up)
e. Why did they not tidy up? (They are short of time)
8. Problem manager came up with the following solution to prevent these P1 incidents from happening again:
a. Appointed server room operators to maintain room safety and came up with a server room hazards handbook.
b. Vendor access to server rooms are now restricted unless with valid change request number.
c. Started a log book to record all server room access personnel
d. Time given to vendors are reviewed to ensure sufficient time for vendor to implement their work.
e. Windows team to setup monitoring on critical servers so that they are alerted before the customers complain.
f. Facilities team to implement a power cable fastener to secure all cables.
9. After all the new processes are in place, the Problem manager proceeds to close off the problem ticket.
Incident Management: The first goal of the incident management process is to restore a normal service operation as quickly as possible and to minimize the impact on business operations, thus ensuring that the best possible levels of service quality and availability are maintained. ‘Normal service operation’ is defined here as service operation within service-level agreement (SLA).
Problem Management: The primary objectives of problem management are to prevent problems and resulting incidents from happening, to eliminate recurring incidents, and to minimize the impact of incidents that cannot be prevented.
5 Whys (SixSigma RCA methodology) : The primary goal of the technique is to determine the root cause of a defect or problem by repeating the question “Why?” Each question forms the basis of the next question.