Maintenance & Change Management
Incident Management for Automated Processes
Automated processes fail. Systems go down. Business rules change and nobody updates the logic. A well-designed incident management framework does not prevent failures — it determines how quickly they are detected, how cleanly they are resolved, and whether the organisation learns something useful from each one. The framework must exist before go-live. Built after the first incident, it is always too late.Definition and Scope
An automation incident is any unplanned event that causes a process to fail, degrade, or produce incorrect outputs. This includes: the automation stopping entirely, performance falling below defined thresholds, exception rates rising above baseline, outputs that pass technically but encode incorrect business logic, and integration failures that cause data loss or duplication. If it affects the process outcome — it is an incident, regardless of whether the platform itself is running.
- Automation stops processing — no output from a running process
- Exception rate rises more than 20% above the established baseline
- Cycle time exceeds the defined SLA threshold for more than 2 hours
- Integration with an upstream or downstream system fails or produces corrupt data
- Automation processes transactions but applies incorrect business logic
- Monitoring alert triggers — even if no visible operational impact yet
- Planned maintenance windows — these are scheduled and communicated in advance
- A single transaction routed to the exception queue — this is the automation working as designed
- A change request to update business logic — this follows the change management process, not incident management
- A performance dip within the defined acceptable variance band
Classifying and Responding to Automation Failures
The severity classification determines the response — not the preference of the team that notices it first. Every automation must have a defined incident severity model before go-live, not after the first failure. Classification must be documented, agreed with all stakeholders, and applied consistently regardless of time of day, volume pressure, or who is on call.
| Severity | Definition | Response Time | Action |
|---|---|---|---|
| P1 — Critical | Automation completely stopped. Process cannot execute. Regulated function or client-facing service affected. | Immediate (<1h) | Manual fallback activated; full team mobilised; executive notification; incident declared |
| P2 — High | Significant degradation. Exception rate >50% above baseline. SLA breach imminent or already occurring. | <4 hours | Partial manual fallback for affected transaction types; root cause identified within 4 hours |
| P3 — Medium | Performance below threshold but automation still running. Queue building. Error rate elevated. | Same business day | Monitor closely; schedule fix within next maintenance window; communicate to process owner |
| P4 — Low | Minor anomaly. No operational impact. Monitoring flag triggered but no SLA or compliance risk. | Next planned maintenance | Log; include in next maintenance cycle; no immediate action required |
The Five Stages of Incident Response
What Must Be Defined Before the Automation Goes Live
An incident management framework that is built after the first incident is not incident management — it is incident recovery. Every item below must be documented, agreed, and tested before go-live. If any of them is missing, the automation is not operationally ready, regardless of whether UAT has passed.
- Named on-call contact for each severity level — a person, not a team
- Escalation path if the first responder cannot resolve within the SLA window
- Named executive to notify for P1 incidents — agreed before go-live, not decided in the moment
- Compliance owner to notify if the incident affects regulated processes or the audit trail
- Severity classification criteria specific to this process — not just generic thresholds
- Manual fallback procedure — documented step by step and tested before go-live
- Communication template for each severity level — who gets notified, in what format, within what timeframe
- Incident log template — where incidents are recorded and who is responsible for maintaining it
The manual fallback procedure must be tested before go-live — not read and acknowledged, but executed by the operations team against real transaction scenarios. A fallback that exists only on paper is not a fallback. It is a plan that will fail at the worst possible moment.
What Must Happen After Every P1 or P2
- Root cause documented formally — not just resolved and forgotten
- Post-incident review completed with all parties involved in the response
- Timeline of events reconstructed — detection, notification, containment, resolution
- Incident log updated with final root cause and resolution details
- Preventive measure identified — not “monitor more closely” but a specific action
- Change request raised if the root cause requires a logic or design fix
- Monitoring thresholds reviewed — would earlier detection have been possible?
- Compliance record updated if the incident affected regulated processes or audit trail completeness
Resolving the incident without documenting the root cause. The next incident of the same type arrives — and the team starts from zero again. Every undocumented P1 or P2 is a lesson the organisation paid for and did not keep.
The Five Most Frequent Root Causes in Banking Automation
| Pattern | What It Looks Like | Prevention |
|---|---|---|
| Business rule drift | Automation produces outputs that were correct six months ago but no longer reflect current policy. Usually discovered in an audit or by an alert operations analyst — not by monitoring. | Quarterly compliance review of automation logic. Change management process that triggers an automation review whenever a business rule changes. |
| Upstream system change | An upstream system updates its API, data format, or UI — breaking the integration silently. The automation continues running but receives malformed or missing data. | Dependency register for all upstream systems. Change notification agreement with IT for any system the automation depends on. Integration monitoring that alerts on data format anomalies. |
| Data quality degradation | Input data quality deteriorates gradually — more missing fields, more format inconsistencies, more duplicate records. Exception rate rises slowly until it crosses the alert threshold. | Input validation monitoring with trend alerting. Monthly data quality review for processes with high exception sensitivity. Exception rate trended over 30 days — not just point-in-time. |
| Volume spike beyond design parameters | Transaction volume exceeds the load the automation was designed and tested for. Performance degrades. Queue builds. SLA breaches begin. | Load testing as part of UAT at 150% of expected volume. Queue depth monitoring with early-warning threshold. Capacity review before known volume events. |
| Unhandled exception type | A transaction type not anticipated in the original design arrives in the process. The automation has no defined path for it and either fails, loops, or routes incorrectly. | Exception log review in the first 30 and 90 days post-go-live. New exception types trigger a design review — not just a manual workaround. Exception catalogue maintained and updated continuously. |

