Fault Management Pipeline
The fault management pipeline from alarm detection through correlation, root cause analysis, and service impact analysis — and the inventory data each stage requires.
Fault Management: From Network Alarm to Customer Impact
Fault Management is the assurance capability that translates raw network events into business-meaningful information. A fibre cut, a router crash, a capacity threshold breach — these are infrastructure facts. Fault Management answers the question the business actually cares about: "Which customers are affected, how badly, and what are we contractually obligated to do about it?"
This translation — from network issue to customer impact — is called impact analysis, and it is the single most valuable function in service assurance. Without it, the NOC (Network Operations Centre) and the service desk operate in separate realities: the NOC sees alarms on equipment, the service desk sees customer complaints, and neither can connect the two in real time.
The Fault Management Pipeline
Fault Management is not a single step — it is a pipeline that progressively enriches raw network data until it becomes actionable business intelligence. Each stage depends on specific inventory and catalog data produced by L2C.
Fault Management Pipeline
Event Collection
NMS / EMS / Telemetry CollectorsNetwork elements, probes, and management systems generate raw events — SNMP traps, syslog messages, streaming telemetry, API notifications. A single fault can produce hundreds or thousands of events across multiple devices.
Event Filtering & Deduplication
Event Management / Alarm PlatformRedundant, duplicate, and transient events are filtered out. Flapping alarms (rapid up/down cycles) are suppressed. The goal is to reduce noise to a manageable set of meaningful alarms.
Alarm Correlation
Alarm Correlation EngineRelated alarms are grouped to identify the probable root cause. For example, 50 interface-down alarms on devices downstream of a single aggregation node correlate to the upstream node failure — not 50 independent faults. This requires knowledge of network topology from Resource Inventory.
Root Cause Identification
Fault Management / Resource InventoryThe correlated alarm set is analysed to pinpoint the most likely root cause — the single failure that explains all observed symptoms. Resource Inventory provides the CI/resource relationships needed to trace upstream dependencies.
Service Impact Analysis
SLM / Fault ManagementThe root cause resource is mapped upward through the RFS → CFS chain in SLM to determine which services are degraded or down. This is where a network event becomes a service event. Without SLM topology, this step is impossible to automate.
Customer Impact Analysis
SLM / SLA ManagementAffected CFS instances are mapped to SLM to identify impacted customers, their subscription tiers, and SLA commitments. A single root cause resource failure is now expressed as: "47 customers affected, 12 on Gold SLA with 4-hour restore commitment, estimated SLA breach in 2.5 hours."
Business Prioritisation & Action
ITSM / Service Desk / CRMThe enriched fault — now carrying customer count, SLA exposure, revenue impact, and estimated breach time — drives prioritisation, resource dispatch, proactive customer notification, and escalation. The business can make informed decisions about where to allocate repair resources.
Impact Analysis: The Core of Fault Management
Impact analysis is what makes fault management a business capability rather than a purely technical one. It answers questions that network monitoring alone cannot:
Network View vs Business View of a Fault
| Question | Network View (without impact analysis) | Business View (with impact analysis) |
|---|---|---|
| What happened? | Line card failure on router PE-LON-04, ports 3/0/1 through 3/0/4 down | Same — plus: this router serves 3 aggregation rings across East London |
| How bad is it? | 4 interfaces down, 12 BGP sessions lost | 47 enterprise customers affected. 23 MPLS VPN services degraded. 8 SIP trunk services down. |
| Who cares? | NOC team assigned to restore | Gold SLA customers X, Y, Z have 4-hour restore commitment — breach in 2.5 hours. Customer Z is a top-10 revenue account. |
| What should we do? | Replace line card, restore BGP sessions | Same technical fix — but escalate immediately due to Gold SLA exposure. Proactively notify 47 customers. Divert field engineer from lower-priority job. Pre-calculate SLA credits. |
| After resolution? | Interfaces up, BGP sessions re-established. Close ticket. | Verify all 47 customer services restored. Update SLM subscription statuses. Calculate SLA compliance: 2 Gold customers breached by 14 minutes — auto-generate credits. Feed data into capacity planning for that aggregation ring. |
What Fault Management Requires from L2C
Fault Management and impact analysis are only as good as the data they consume. Each stage of the pipeline has specific data dependencies on L2C outputs:
Fault Management Data Dependencies
| Pipeline Stage | Data Required | L2C Source | If Missing |
|---|---|---|---|
| Alarm Correlation | Network topology, device relationships, physical connectivity | Resource Inventory (populated during ROM/activation) | Alarms cannot be grouped; every alarm treated as independent. NOC drowns in noise. |
| Root Cause Identification | Upstream/downstream resource dependencies | Resource Inventory CI relationships | Root cause guessed from experience, not derived from data. MTTR increases. |
| Service Impact Analysis | RFS-to-CFS mappings, service topology | SLM (populated during SOM fulfilment) | Cannot determine which services are affected. Impact is unknown until customers report. |
| Customer Impact Analysis | CFS-to-product-to-customer linkage, SLA terms | SLM (populated at order completion) | Cannot identify affected customers or SLA exposure. All faults get equal priority. |
| Business Prioritisation | Customer tier, revenue, contract value, SLA penalties | CRM / SLM / Billing | Prioritisation based on technical severity only, not business impact. High-value customers get no preferential treatment. |
Key Takeaways
- Fault Management is a 7-stage pipeline: event collection → filtering → alarm correlation → root cause → service impact → customer impact → business prioritisation
- Impact analysis transforms infrastructure events into business-meaningful information: affected customers, SLA exposure, revenue risk
- Every pipeline stage depends on inventory data produced by L2C — if the inventories are wrong, the pipeline fails
- AIOps and correlation engines cannot compensate for missing inventory data — the data problem must be solved at fulfilment time