5.512 min read

Trouble Ticket & SLA APIs

Trouble Ticket & Assurance APIs

While Modules 5.2 through 5.4 covered the Lead-to-Cash domain (catalogs, orders, inventory), this section focuses on the Trouble-to-Resolve domain. Assurance APIs handle what happens when things go wrong: customer complaints, service problems, and quality degradation. These APIs are the backbone of telco fault management and SLA monitoring.

The key assurance APIs are TMF621 (Trouble Ticket Management), TMF656 (Service Problem Management), and TMF657 (Service Quality Management). Together, they provide a structured approach to detecting, diagnosing, and resolving service issues.

Trouble-to-Resolve in eTOM

In the TM Forum eTOM process framework, Trouble-to-Resolve is one of the core end-to-end operational processes. It spans from issue detection/reporting (via trouble tickets or automated alarms) through diagnosis and resolution to customer notification. The assurance APIs provide the system interfaces that support this process.

TMF621: Trouble Ticket Management

TMF621 -- Trouble Ticket Management

TMF621 manages trouble tickets -- formal records of issues reported by customers or detected by monitoring systems. A trouble ticket captures the problem description, affected service/product, priority, and tracks the resolution lifecycle from creation through investigation to closure.

Trouble tickets are the customer-facing entry point for the assurance process. Whether a customer calls the contact centre to report "my internet is down" or submits an issue through a self-service portal, the result is a trouble ticket created via TMF621.

TMF621 Key Resources

TMF621 Core Resources

Resource	Purpose	Key Attributes
TroubleTicket	A record of a reported issue	name, description, severity, priority, status, statusChangeDate, statusChangeReason, ticketType, channel, relatedEntity, relatedParty, note, attachment, troubleTicketCharacteristic

Trouble Ticket State Machine

TMF621 Trouble Ticket States

State	Meaning	Next Possible States
Submitted	Ticket created and awaiting triage	Acknowledged, Rejected
Acknowledged	Ticket accepted by the support team	InProgress, Pending
InProgress	Actively being investigated or resolved	Resolved, Pending, Cancelled
Pending	Waiting for customer input or external action	InProgress, Cancelled
Resolved	Fix applied, awaiting confirmation	Closed, InProgress (re-opened)
Closed	Issue confirmed resolved, ticket completed	InProgress (re-opened, if within policy)
Rejected	Ticket rejected during triage (duplicate, invalid)	Terminal state
Cancelled	Ticket cancelled by customer or system	Terminal state

TMF621 Operations

POST /troubleTicket -- Create a new trouble ticket
GET /troubleTicket/{id} -- Retrieve ticket details
GET /troubleTicket -- List/query tickets with filtering (by status, priority, customer)
PATCH /troubleTicket/{id} -- Update ticket (add notes, change status, reassign)
DELETE /troubleTicket/{id} -- Delete a ticket (rarely used in production)
POST /hub -- Register for event notifications (state changes, updates)

Trouble Ticket Creation Example

Customer calls: "My broadband is not working." Agent creates a ticket via POST /troubleTicket with: severity: "Medium", priority: "2 - High", ticketType: "Service Fault", description: "Customer reports no internet connectivity since 09:00 today", relatedEntity: [{ id: "PROD-12345", role: "AffectedProduct", @referredType: "Product" }], relatedParty: [{ id: "CUST-12345", role: "Customer" }]. The ticket enters "Submitted" state and triggers a TroubleTicketCreateEvent.

Severity vs Priority

TMF621 distinguishes between severity (the technical impact of the issue) and priority (the business urgency of resolution). These are not the same thing.

Severity vs Priority

Concept	Definition	Who Sets It	Example
Severity	Technical impact of the issue	Determined by impact analysis	Critical: total service outage; Minor: degraded but functional
Priority	Business urgency of resolution	Determined by business rules (SLA, customer tier)	Priority 1: VIP customer with SLA breach risk; Priority 3: residential, no SLA

Priority Calculation

Priority is typically calculated from a combination of severity and business context. A minor issue (severity: Low) for a VIP enterprise customer with strict SLAs might be Priority 1, while a critical issue (severity: Critical) for a residential customer with no SLA might be Priority 2. The priority determines the response and resolution time targets.

TMF656: Service Problem Management

TMF656 -- Service Problem Management

TMF656 manages service problems -- the technical root causes behind one or more trouble tickets. While a trouble ticket captures a symptom reported by a customer, a service problem captures the underlying issue in the service/resource layer. Multiple trouble tickets may be correlated to a single service problem.

The relationship between trouble tickets and service problems is many-to-one. When an OLT fails, 200 customers might report issues (200 trouble tickets), but there is only one service problem (OLT failure). Service problem management is about correlation and root cause identification.

TMF656 Key Resources

TMF656 Core Resources

Resource	Purpose	Key Attributes
ServiceProblem	A diagnosed technical issue affecting services	description, category, priority, status, reason, impactImportanceFactor, originatingSystem, affectedService, associatedTroubleTicket, underlyingProblem, rootCauseResource, correlatedAlarm

Ticket-to-Problem Correlation

From Trouble Ticket to Service Problem

Tickets Arrive

TMF621 -- Trouble Ticket

Multiple trouble tickets are created via TMF621 from different customers reporting similar symptoms (e.g., "internet down", "slow connection").

Pattern Detection

Correlation Engine

The assurance platform detects a pattern: multiple tickets in the same geographic area, affecting the same service type, within a short time window.

Service Problem Created

TMF656 -- Service Problem

A ServiceProblem is created via TMF656, linking to all related trouble tickets via the associatedTroubleTicket attribute.

Root Cause Identified

TMF656 + TMF639

Investigation identifies the root cause (e.g., OLT card failure). The rootCauseResource attribute is set to the failed resource (from TMF639).

Resolution Applied

Field Operations

The root cause is fixed (e.g., OLT card replaced). The service problem is resolved, which triggers resolution of all associated trouble tickets.

Tickets Resolved

TMF621 -- Trouble Ticket

All correlated trouble tickets are moved to "Resolved" state. Customers are notified that their issue has been fixed.

Advanced

Alarm Correlation

Service problems can be created proactively from network alarms before customers report issues. TMF656 supports correlation of alarms to service problems via the correlatedAlarm attribute.

In a proactive assurance model, the sequence is reversed: (1) Network monitoring detects an alarm (e.g., OLT port down). (2) The correlation engine creates a ServiceProblem linked to the alarm. (3) Service inventory (TMF638) is queried to identify affected services and customers. (4) Proactive trouble tickets are created for affected customers. (5) Customers receive notification before they even notice the issue. This proactive approach significantly improves customer experience and reduces inbound call volume.

TMF657: Service Quality Management

TMF657 -- Service Quality Management

TMF657 manages service quality objectives and their measurement. It provides APIs for defining service level objectives (SLOs), tracking service quality metrics, and detecting SLA/SLO violations. It is the API that connects performance monitoring to business commitments.

While TMF621 is reactive (responding to reported issues) and TMF656 focuses on root cause analysis, TMF657 is about continuous monitoring and compliance. It answers: "Are we meeting our service quality promises?"

TMF657 Key Concepts

TMF657 Core Concepts

Concept	Definition	Example
Service Level Specification	A template defining quality parameters for a service type	Broadband SLS: availability >= 99.5%, max latency <= 30ms
Service Level Objective (SLO)	A specific quality target for a metric	SLO: monthly availability >= 99.5% measured per customer
Service Level Agreement (SLA)	A contract committing to specific service levels	Enterprise SLA: 99.99% availability, 4-hour MTTR
Service Quality Rule	A rule that evaluates quality metrics against SLOs	IF availability < 99.5% THEN raise SLO violation event

SLA Monitoring Flow

SLA Monitoring with TMF657

Define SLO

TMF657 -- Service Quality

Create service level objectives via TMF657 API: target availability, latency thresholds, MTTR targets, mapped to specific service specifications.

Collect Metrics

Performance Management

Performance monitoring systems continuously collect quality metrics (availability, latency, throughput, error rates) per service instance.

Evaluate Rules

TMF657 -- Service Quality

TMF657 evaluates collected metrics against defined SLOs. When a metric breaches a threshold, a service quality violation event is raised.

Trigger Actions

Assurance Orchestration

SLO violations can trigger automated actions: create a service problem (TMF656), escalate a trouble ticket (TMF621), or notify account management.

How Assurance APIs Work Together

The three assurance APIs form a coherent domain that covers the full spectrum of fault management and service quality:

The reactive flow starts with a customer complaint and works inward to find the root cause:

Customer reports issue -> TMF621 TroubleTicket created
Triage identifies potential service impact -> TMF656 ServiceProblem investigated
Root cause identified in resource layer -> TMF639 Resource queried
Fix applied -> ServiceProblem resolved -> TroubleTicket resolved -> Customer notified

Additional Assurance APIs

Beyond the three core assurance APIs, TM Forum defines several related APIs that support the broader assurance domain:

Related Assurance APIs

API	Name	Purpose
TMF642	Alarm Management	Manage network alarms and events from monitoring systems
TMF650	Performance Management	Collect and manage performance data from network elements
TMF628	Performance Threshold	Define thresholds that trigger alarms when performance degrades
TMF649	Performance Monitoring	Continuous monitoring of network and service performance

The Assurance Stack

Think of the assurance APIs as a stack: TMF642/TMF650 at the bottom handle raw network data (alarms, performance). TMF656 in the middle handles correlation and root cause analysis. TMF621 at the top handles customer-facing issue management. TMF657 runs alongside all layers, continuously evaluating quality against commitments. Each layer adds business context to raw technical data.

Assurance API Source of Record

Assurance Entities -- Source of Record

Entity	System of Record	System of Engagement	System of Reference	Notes
Trouble Tickets	Trouble Ticket System (TMF621)	Contact Centre / Self-Service Portal	—	Customer-reported and proactive issues
Service Problems	Service Problem Manager (TMF656)	NOC / Assurance Platform	—	Correlated root cause records
Service Quality/SLO	Service Quality Manager (TMF657)	SLA Management Dashboard	—	SLO definitions and violation tracking
Network Alarms	Alarm Management (TMF642)	NOC Dashboard	—	Raw network fault events
Performance Metrics	Performance Management (TMF650)	Analytics / Reporting	—	Collected performance data

SLA Management in Practice

A Service Level Agreement (SLA) is a contract between the telco and a customer that guarantees specific service quality levels. For example: "We guarantee 99.9% availability and will respond to faults within 4 hours." If the telco fails to meet these commitments, the customer may be entitled to credits or penalties.

An SLA typically contains several Service Level Objectives (SLOs), each defining a measurable quality target:

Availability SLO: Service uptime percentage per month (e.g., 99.95%)
Latency SLO: Maximum packet delay (e.g., < 20ms within network)
Jitter SLO: Maximum delay variation (e.g., < 5ms)
MTTR SLO: Maximum time to restore service after a fault (e.g., 4 hours)
MTTF SLO: Minimum time between failures (e.g., > 30 days)
Packet Loss SLO: Maximum acceptable packet loss (e.g., < 0.01%)

Mature operators automate the full SLA lifecycle using TMF657:

SLO definitions are created via TMF657 API, linked to service specifications in TMF633
Performance data is collected from TMF650 and correlated to specific service instances in TMF638
TMF657 rules engine continuously evaluates metrics against SLOs
SLO violations trigger events that can auto-create service problems (TMF656) and trouble tickets (TMF621)
SLA compliance reports are generated automatically for customer-facing reporting
SLA breach credits can be calculated automatically and fed to billing systems

Common Assurance Pitfalls

Pitfall: No Ticket-to-Problem Correlation

Without service problem management (TMF656), every trouble ticket is treated as an isolated incident. When 200 customers are affected by the same OLT failure, the contact centre creates 200 independent tickets, each investigated separately. Service problem management correlates these into a single root cause, dramatically improving resolution efficiency.

Pitfall: Reactive-Only Assurance

Operators that only react to customer complaints miss the opportunity for proactive resolution. Modern assurance should detect issues via alarms and performance monitoring, create service problems automatically, and notify affected customers before they call. This requires integration between TMF642 (Alarms), TMF656 (Service Problems), and TMF621 (Trouble Tickets).

Pitfall: SLAs Without Measurement

Committing to SLAs without implementing TMF657-style quality monitoring means you have no way to know if you are meeting your commitments until a customer complains. Worse, you cannot proactively manage quality to prevent SLA breaches. SLOs must be measured continuously, not just reported after the fact.

Section 5.5 Key Takeaways

TMF621 (Trouble Tickets) captures customer-reported and proactive issues with a standard state machine
TMF656 (Service Problems) correlates multiple tickets to a single root cause, enabling efficient resolution
TMF657 (Service Quality) defines SLOs and continuously monitors compliance
Severity measures technical impact; priority measures business urgency -- they are not the same
The reactive flow starts from customer complaints; the proactive flow starts from alarms and monitoring
Alarm correlation (TMF656 + TMF642) enables proactive customer notification before complaints arise
SLA management requires both definition (SLOs) and continuous measurement (TMF657 + TMF650)
Assurance APIs integrate with inventory APIs (TMF638, TMF639) for impact analysis