Trouble Ticket & SLA APIs
Trouble Ticket & Assurance APIs
While Modules 5.2 through 5.4 covered the Lead-to-Cash domain (catalogs, orders, inventory), this section focuses on the Trouble-to-Resolve domain. Assurance APIs handle what happens when things go wrong: customer complaints, service problems, and quality degradation. These APIs are the backbone of telco fault management and SLA monitoring.
The key assurance APIs are TMF621 (Trouble Ticket Management), TMF656 (Service Problem Management), and TMF657 (Service Quality Management). Together, they provide a structured approach to detecting, diagnosing, and resolving service issues.
TMF621: Trouble Ticket Management
Trouble tickets are the customer-facing entry point for the assurance process. Whether a customer calls the contact centre to report "my internet is down" or submits an issue through a self-service portal, the result is a trouble ticket created via TMF621.
TMF621 Key Resources
TMF621 Core Resources
| Resource | Purpose | Key Attributes |
|---|---|---|
| TroubleTicket | A record of a reported issue | name, description, severity, priority, status, statusChangeDate, statusChangeReason, ticketType, channel, relatedEntity, relatedParty, note, attachment, troubleTicketCharacteristic |
Trouble Ticket State Machine
TMF621 Trouble Ticket States
| State | Meaning | Next Possible States |
|---|---|---|
| Submitted | Ticket created and awaiting triage | Acknowledged, Rejected |
| Acknowledged | Ticket accepted by the support team | InProgress, Pending |
| InProgress | Actively being investigated or resolved | Resolved, Pending, Cancelled |
| Pending | Waiting for customer input or external action | InProgress, Cancelled |
| Resolved | Fix applied, awaiting confirmation | Closed, InProgress (re-opened) |
| Closed | Issue confirmed resolved, ticket completed | InProgress (re-opened, if within policy) |
| Rejected | Ticket rejected during triage (duplicate, invalid) | Terminal state |
| Cancelled | Ticket cancelled by customer or system | Terminal state |
TMF621 Operations
- POST /troubleTicket -- Create a new trouble ticket
- GET /troubleTicket/{id} -- Retrieve ticket details
- GET /troubleTicket -- List/query tickets with filtering (by status, priority, customer)
- PATCH /troubleTicket/{id} -- Update ticket (add notes, change status, reassign)
- DELETE /troubleTicket/{id} -- Delete a ticket (rarely used in production)
- POST /hub -- Register for event notifications (state changes, updates)
Severity vs Priority
TMF621 distinguishes between severity (the technical impact of the issue) and priority (the business urgency of resolution). These are not the same thing.
Severity vs Priority
| Concept | Definition | Who Sets It | Example |
|---|---|---|---|
| Severity | Technical impact of the issue | Determined by impact analysis | Critical: total service outage; Minor: degraded but functional |
| Priority | Business urgency of resolution | Determined by business rules (SLA, customer tier) | Priority 1: VIP customer with SLA breach risk; Priority 3: residential, no SLA |
TMF656: Service Problem Management
The relationship between trouble tickets and service problems is many-to-one. When an OLT fails, 200 customers might report issues (200 trouble tickets), but there is only one service problem (OLT failure). Service problem management is about correlation and root cause identification.
TMF656 Key Resources
TMF656 Core Resources
| Resource | Purpose | Key Attributes |
|---|---|---|
| ServiceProblem | A diagnosed technical issue affecting services | description, category, priority, status, reason, impactImportanceFactor, originatingSystem, affectedService, associatedTroubleTicket, underlyingProblem, rootCauseResource, correlatedAlarm |
Ticket-to-Problem Correlation
From Trouble Ticket to Service Problem
Tickets Arrive
TMF621 -- Trouble TicketMultiple trouble tickets are created via TMF621 from different customers reporting similar symptoms (e.g., "internet down", "slow connection").
Pattern Detection
Correlation EngineThe assurance platform detects a pattern: multiple tickets in the same geographic area, affecting the same service type, within a short time window.
Service Problem Created
TMF656 -- Service ProblemA ServiceProblem is created via TMF656, linking to all related trouble tickets via the associatedTroubleTicket attribute.
Root Cause Identified
TMF656 + TMF639Investigation identifies the root cause (e.g., OLT card failure). The rootCauseResource attribute is set to the failed resource (from TMF639).
Resolution Applied
Field OperationsThe root cause is fixed (e.g., OLT card replaced). The service problem is resolved, which triggers resolution of all associated trouble tickets.
Tickets Resolved
TMF621 -- Trouble TicketAll correlated trouble tickets are moved to "Resolved" state. Customers are notified that their issue has been fixed.
Alarm Correlation
Service problems can be created proactively from network alarms before customers report issues. TMF656 supports correlation of alarms to service problems via the correlatedAlarm attribute.
TMF657: Service Quality Management
While TMF621 is reactive (responding to reported issues) and TMF656 focuses on root cause analysis, TMF657 is about continuous monitoring and compliance. It answers: "Are we meeting our service quality promises?"
TMF657 Key Concepts
TMF657 Core Concepts
| Concept | Definition | Example |
|---|---|---|
| Service Level Specification | A template defining quality parameters for a service type | Broadband SLS: availability >= 99.5%, max latency <= 30ms |
| Service Level Objective (SLO) | A specific quality target for a metric | SLO: monthly availability >= 99.5% measured per customer |
| Service Level Agreement (SLA) | A contract committing to specific service levels | Enterprise SLA: 99.99% availability, 4-hour MTTR |
| Service Quality Rule | A rule that evaluates quality metrics against SLOs | IF availability < 99.5% THEN raise SLO violation event |
SLA Monitoring Flow
SLA Monitoring with TMF657
Define SLO
TMF657 -- Service QualityCreate service level objectives via TMF657 API: target availability, latency thresholds, MTTR targets, mapped to specific service specifications.
Collect Metrics
Performance ManagementPerformance monitoring systems continuously collect quality metrics (availability, latency, throughput, error rates) per service instance.
Evaluate Rules
TMF657 -- Service QualityTMF657 evaluates collected metrics against defined SLOs. When a metric breaches a threshold, a service quality violation event is raised.
Trigger Actions
Assurance OrchestrationSLO violations can trigger automated actions: create a service problem (TMF656), escalate a trouble ticket (TMF621), or notify account management.
How Assurance APIs Work Together
The three assurance APIs form a coherent domain that covers the full spectrum of fault management and service quality:
The reactive flow starts with a customer complaint and works inward to find the root cause:
- Customer reports issue -> TMF621 TroubleTicket created
- Triage identifies potential service impact -> TMF656 ServiceProblem investigated
- Root cause identified in resource layer -> TMF639 Resource queried
- Fix applied -> ServiceProblem resolved -> TroubleTicket resolved -> Customer notified
Additional Assurance APIs
Beyond the three core assurance APIs, TM Forum defines several related APIs that support the broader assurance domain:
Related Assurance APIs
| API | Name | Purpose |
|---|---|---|
| TMF642 | Alarm Management | Manage network alarms and events from monitoring systems |
| TMF650 | Performance Management | Collect and manage performance data from network elements |
| TMF628 | Performance Threshold | Define thresholds that trigger alarms when performance degrades |
| TMF649 | Performance Monitoring | Continuous monitoring of network and service performance |
Assurance API Source of Record
Assurance Entities -- Source of Record
| Entity | System of Record | System of Engagement | System of Reference | Notes |
|---|---|---|---|---|
| Trouble Tickets | Trouble Ticket System (TMF621) | Contact Centre / Self-Service Portal | — | Customer-reported and proactive issues |
| Service Problems | Service Problem Manager (TMF656) | NOC / Assurance Platform | — | Correlated root cause records |
| Service Quality/SLO | Service Quality Manager (TMF657) | SLA Management Dashboard | — | SLO definitions and violation tracking |
| Network Alarms | Alarm Management (TMF642) | NOC Dashboard | — | Raw network fault events |
| Performance Metrics | Performance Management (TMF650) | Analytics / Reporting | — | Collected performance data |
SLA Management in Practice
A Service Level Agreement (SLA) is a contract between the telco and a customer that guarantees specific service quality levels. For example: "We guarantee 99.9% availability and will respond to faults within 4 hours." If the telco fails to meet these commitments, the customer may be entitled to credits or penalties.
An SLA typically contains several Service Level Objectives (SLOs), each defining a measurable quality target:
- Availability SLO: Service uptime percentage per month (e.g., 99.95%)
- Latency SLO: Maximum packet delay (e.g., < 20ms within network)
- Jitter SLO: Maximum delay variation (e.g., < 5ms)
- MTTR SLO: Maximum time to restore service after a fault (e.g., 4 hours)
- MTTF SLO: Minimum time between failures (e.g., > 30 days)
- Packet Loss SLO: Maximum acceptable packet loss (e.g., < 0.01%)
Mature operators automate the full SLA lifecycle using TMF657:
- SLO definitions are created via TMF657 API, linked to service specifications in TMF633
- Performance data is collected from TMF650 and correlated to specific service instances in TMF638
- TMF657 rules engine continuously evaluates metrics against SLOs
- SLO violations trigger events that can auto-create service problems (TMF656) and trouble tickets (TMF621)
- SLA compliance reports are generated automatically for customer-facing reporting
- SLA breach credits can be calculated automatically and fed to billing systems
Common Assurance Pitfalls
Section 5.5 Key Takeaways
- TMF621 (Trouble Tickets) captures customer-reported and proactive issues with a standard state machine
- TMF656 (Service Problems) correlates multiple tickets to a single root cause, enabling efficient resolution
- TMF657 (Service Quality) defines SLOs and continuously monitors compliance
- Severity measures technical impact; priority measures business urgency -- they are not the same
- The reactive flow starts from customer complaints; the proactive flow starts from alarms and monitoring
- Alarm correlation (TMF656 + TMF642) enables proactive customer notification before complaints arise
- SLA management requires both definition (SLOs) and continuous measurement (TMF657 + TMF650)
- Assurance APIs integrate with inventory APIs (TMF638, TMF639) for impact analysis