BSS/OSS Academy
🛡️
Section 6.2

Fault Management Pipeline

The fault management pipeline from alarm detection through correlation, root cause analysis, and service impact analysis — and the inventory data each stage requires.

Fault Management: From Network Alarm to Customer Impact

Fault Management is the assurance capability that translates raw network events into business-meaningful information. A fibre cut, a router crash, a capacity threshold breach — these are infrastructure facts. Fault Management answers the question the business actually cares about: "Which customers are affected, how badly, and what are we contractually obligated to do about it?"

This translation — from network issue to customer impact — is called impact analysis, and it is the single most valuable function in service assurance. Without it, the NOC (Network Operations Centre) and the service desk operate in separate realities: the NOC sees alarms on equipment, the service desk sees customer complaints, and neither can connect the two in real time.

Fault Management
The assurance discipline responsible for detecting, correlating, and managing faults across the network and service infrastructure. Fault Management encompasses alarm collection, event filtering, alarm correlation, root cause analysis, and — critically — service impact analysis: determining which services and customers are affected by an infrastructure fault. In eTOM terms, Fault Management spans Resource Trouble Management (1.2.3.2), Service Problem Management (1.2.2.2), and feeds into Customer Problem Handling (1.2.1.4).

The Fault Management Pipeline

Fault Management is not a single step — it is a pipeline that progressively enriches raw network data until it becomes actionable business intelligence. Each stage depends on specific inventory and catalog data produced by L2C.

Fault Management Pipeline

1
Event Collection
NMS / EMS / Telemetry Collectors

Network elements, probes, and management systems generate raw events — SNMP traps, syslog messages, streaming telemetry, API notifications. A single fault can produce hundreds or thousands of events across multiple devices.

2
Event Filtering & Deduplication
Event Management / Alarm Platform

Redundant, duplicate, and transient events are filtered out. Flapping alarms (rapid up/down cycles) are suppressed. The goal is to reduce noise to a manageable set of meaningful alarms.

3
Alarm Correlation
Alarm Correlation Engine

Related alarms are grouped to identify the probable root cause. For example, 50 interface-down alarms on devices downstream of a single aggregation node correlate to the upstream node failure — not 50 independent faults. This requires knowledge of network topology from Resource Inventory.

4
Root Cause Identification
Fault Management / Resource Inventory

The correlated alarm set is analysed to pinpoint the most likely root cause — the single failure that explains all observed symptoms. Resource Inventory provides the CI/resource relationships needed to trace upstream dependencies.

5
Service Impact Analysis
SLM / Fault Management

The root cause resource is mapped upward through the RFS → CFS chain in SLM to determine which services are degraded or down. This is where a network event becomes a service event. Without SLM topology, this step is impossible to automate.

6
Customer Impact Analysis
SLM / SLA Management

Affected CFS instances are mapped to SLM to identify impacted customers, their subscription tiers, and SLA commitments. A single root cause resource failure is now expressed as: "47 customers affected, 12 on Gold SLA with 4-hour restore commitment, estimated SLA breach in 2.5 hours."

7
Business Prioritisation & Action
ITSM / Service Desk / CRM

The enriched fault — now carrying customer count, SLA exposure, revenue impact, and estimated breach time — drives prioritisation, resource dispatch, proactive customer notification, and escalation. The business can make informed decisions about where to allocate repair resources.

Impact Analysis: The Core of Fault Management

Impact analysis is what makes fault management a business capability rather than a purely technical one. It answers questions that network monitoring alone cannot:

Network View vs Business View of a Fault

QuestionNetwork View (without impact analysis)Business View (with impact analysis)
What happened?Line card failure on router PE-LON-04, ports 3/0/1 through 3/0/4 downSame — plus: this router serves 3 aggregation rings across East London
How bad is it?4 interfaces down, 12 BGP sessions lost47 enterprise customers affected. 23 MPLS VPN services degraded. 8 SIP trunk services down.
Who cares?NOC team assigned to restoreGold SLA customers X, Y, Z have 4-hour restore commitment — breach in 2.5 hours. Customer Z is a top-10 revenue account.
What should we do?Replace line card, restore BGP sessionsSame technical fix — but escalate immediately due to Gold SLA exposure. Proactively notify 47 customers. Divert field engineer from lower-priority job. Pre-calculate SLA credits.
After resolution?Interfaces up, BGP sessions re-established. Close ticket.Verify all 47 customer services restored. Update SLM subscription statuses. Calculate SLA compliance: 2 Gold customers breached by 14 minutes — auto-generate credits. Feed data into capacity planning for that aggregation ring.
Why Impact Analysis Matters to the Business
Without impact analysis, every fault is treated as an infrastructure problem for the NOC to solve. With impact analysis, faults are expressed in terms the business understands: affected customers, revenue at risk, SLA exposure, and churn probability. This is what allows a telco to shift from "we fix network problems" to "we manage customer service quality" — and it is entirely dependent on the inventory data that L2C produces.

What Fault Management Requires from L2C

Fault Management and impact analysis are only as good as the data they consume. Each stage of the pipeline has specific data dependencies on L2C outputs:

Fault Management Data Dependencies

Pipeline StageData RequiredL2C SourceIf Missing
Alarm CorrelationNetwork topology, device relationships, physical connectivityResource Inventory (populated during ROM/activation)Alarms cannot be grouped; every alarm treated as independent. NOC drowns in noise.
Root Cause IdentificationUpstream/downstream resource dependenciesResource Inventory CI relationshipsRoot cause guessed from experience, not derived from data. MTTR increases.
Service Impact AnalysisRFS-to-CFS mappings, service topologySLM (populated during SOM fulfilment)Cannot determine which services are affected. Impact is unknown until customers report.
Customer Impact AnalysisCFS-to-product-to-customer linkage, SLA termsSLM (populated at order completion)Cannot identify affected customers or SLA exposure. All faults get equal priority.
Business PrioritisationCustomer tier, revenue, contract value, SLA penaltiesCRM / SLM / BillingPrioritisation based on technical severity only, not business impact. High-value customers get no preferential treatment.
The Correlation Gap
Many telcos invest heavily in alarm correlation engines and AIOps platforms, expecting them to deliver impact analysis. But correlation engines can only work with the data they are given. If SLM has no CFS-to-RFS topology, or if Product Inventory has no linkage to SLM, even the most sophisticated correlation engine cannot answer "which customers are affected?" The data problem must be solved at L2C fulfilment time — no amount of AI on the assurance side compensates for missing inventory data.

Key Takeaways

  • Fault Management is a 7-stage pipeline: event collection → filtering → alarm correlation → root cause → service impact → customer impact → business prioritisation
  • Impact analysis transforms infrastructure events into business-meaningful information: affected customers, SLA exposure, revenue risk
  • Every pipeline stage depends on inventory data produced by L2C — if the inventories are wrong, the pipeline fails
  • AIOps and correlation engines cannot compensate for missing inventory data — the data problem must be solved at fulfilment time