BSS/OSS Academy
5.512 min read

Trouble Ticket & SLA APIs

Trouble Ticket & Assurance APIs

While Modules 5.2 through 5.4 covered the Lead-to-Cash domain (catalogs, orders, inventory), this section focuses on the Trouble-to-Resolve domain. Assurance APIs handle what happens when things go wrong: customer complaints, service problems, and quality degradation. These APIs are the backbone of telco fault management and SLA monitoring.

The key assurance APIs are TMF621 (Trouble Ticket Management), TMF656 (Service Problem Management), and TMF657 (Service Quality Management). Together, they provide a structured approach to detecting, diagnosing, and resolving service issues.

Trouble-to-Resolve in eTOM
In the TM Forum eTOM process framework, Trouble-to-Resolve is one of the core end-to-end operational processes. It spans from issue detection/reporting (via trouble tickets or automated alarms) through diagnosis and resolution to customer notification. The assurance APIs provide the system interfaces that support this process.

TMF621: Trouble Ticket Management

TMF621 -- Trouble Ticket Management
TMF621 manages trouble tickets -- formal records of issues reported by customers or detected by monitoring systems. A trouble ticket captures the problem description, affected service/product, priority, and tracks the resolution lifecycle from creation through investigation to closure.

Trouble tickets are the customer-facing entry point for the assurance process. Whether a customer calls the contact centre to report "my internet is down" or submits an issue through a self-service portal, the result is a trouble ticket created via TMF621.

TMF621 Key Resources

TMF621 Core Resources

ResourcePurposeKey Attributes
TroubleTicketA record of a reported issuename, description, severity, priority, status, statusChangeDate, statusChangeReason, ticketType, channel, relatedEntity, relatedParty, note, attachment, troubleTicketCharacteristic

Trouble Ticket State Machine

TMF621 Trouble Ticket States

StateMeaningNext Possible States
SubmittedTicket created and awaiting triageAcknowledged, Rejected
AcknowledgedTicket accepted by the support teamInProgress, Pending
InProgressActively being investigated or resolvedResolved, Pending, Cancelled
PendingWaiting for customer input or external actionInProgress, Cancelled
ResolvedFix applied, awaiting confirmationClosed, InProgress (re-opened)
ClosedIssue confirmed resolved, ticket completedInProgress (re-opened, if within policy)
RejectedTicket rejected during triage (duplicate, invalid)Terminal state
CancelledTicket cancelled by customer or systemTerminal state

TMF621 Operations

  • POST /troubleTicket -- Create a new trouble ticket
  • GET /troubleTicket/{id} -- Retrieve ticket details
  • GET /troubleTicket -- List/query tickets with filtering (by status, priority, customer)
  • PATCH /troubleTicket/{id} -- Update ticket (add notes, change status, reassign)
  • DELETE /troubleTicket/{id} -- Delete a ticket (rarely used in production)
  • POST /hub -- Register for event notifications (state changes, updates)
Trouble Ticket Creation Example
Customer calls: "My broadband is not working." Agent creates a ticket via POST /troubleTicket with: severity: "Medium", priority: "2 - High", ticketType: "Service Fault", description: "Customer reports no internet connectivity since 09:00 today", relatedEntity: [{ id: "PROD-12345", role: "AffectedProduct", @referredType: "Product" }], relatedParty: [{ id: "CUST-12345", role: "Customer" }]. The ticket enters "Submitted" state and triggers a TroubleTicketCreateEvent.

Severity vs Priority

TMF621 distinguishes between severity (the technical impact of the issue) and priority (the business urgency of resolution). These are not the same thing.

Severity vs Priority

ConceptDefinitionWho Sets ItExample
SeverityTechnical impact of the issueDetermined by impact analysisCritical: total service outage; Minor: degraded but functional
PriorityBusiness urgency of resolutionDetermined by business rules (SLA, customer tier)Priority 1: VIP customer with SLA breach risk; Priority 3: residential, no SLA
Priority Calculation
Priority is typically calculated from a combination of severity and business context. A minor issue (severity: Low) for a VIP enterprise customer with strict SLAs might be Priority 1, while a critical issue (severity: Critical) for a residential customer with no SLA might be Priority 2. The priority determines the response and resolution time targets.

TMF656: Service Problem Management

TMF656 -- Service Problem Management
TMF656 manages service problems -- the technical root causes behind one or more trouble tickets. While a trouble ticket captures a symptom reported by a customer, a service problem captures the underlying issue in the service/resource layer. Multiple trouble tickets may be correlated to a single service problem.

The relationship between trouble tickets and service problems is many-to-one. When an OLT fails, 200 customers might report issues (200 trouble tickets), but there is only one service problem (OLT failure). Service problem management is about correlation and root cause identification.

TMF656 Key Resources

TMF656 Core Resources

ResourcePurposeKey Attributes
ServiceProblemA diagnosed technical issue affecting servicesdescription, category, priority, status, reason, impactImportanceFactor, originatingSystem, affectedService, associatedTroubleTicket, underlyingProblem, rootCauseResource, correlatedAlarm

Ticket-to-Problem Correlation

From Trouble Ticket to Service Problem

1
Tickets Arrive
TMF621 -- Trouble Ticket

Multiple trouble tickets are created via TMF621 from different customers reporting similar symptoms (e.g., "internet down", "slow connection").

2
Pattern Detection
Correlation Engine

The assurance platform detects a pattern: multiple tickets in the same geographic area, affecting the same service type, within a short time window.

3
Service Problem Created
TMF656 -- Service Problem

A ServiceProblem is created via TMF656, linking to all related trouble tickets via the associatedTroubleTicket attribute.

4
Root Cause Identified
TMF656 + TMF639

Investigation identifies the root cause (e.g., OLT card failure). The rootCauseResource attribute is set to the failed resource (from TMF639).

5
Resolution Applied
Field Operations

The root cause is fixed (e.g., OLT card replaced). The service problem is resolved, which triggers resolution of all associated trouble tickets.

6
Tickets Resolved
TMF621 -- Trouble Ticket

All correlated trouble tickets are moved to "Resolved" state. Customers are notified that their issue has been fixed.

Advanced

Alarm Correlation

Service problems can be created proactively from network alarms before customers report issues. TMF656 supports correlation of alarms to service problems via the correlatedAlarm attribute.

In a proactive assurance model, the sequence is reversed: (1) Network monitoring detects an alarm (e.g., OLT port down). (2) The correlation engine creates a ServiceProblem linked to the alarm. (3) Service inventory (TMF638) is queried to identify affected services and customers. (4) Proactive trouble tickets are created for affected customers. (5) Customers receive notification before they even notice the issue. This proactive approach significantly improves customer experience and reduces inbound call volume.

TMF657: Service Quality Management

TMF657 -- Service Quality Management
TMF657 manages service quality objectives and their measurement. It provides APIs for defining service level objectives (SLOs), tracking service quality metrics, and detecting SLA/SLO violations. It is the API that connects performance monitoring to business commitments.

While TMF621 is reactive (responding to reported issues) and TMF656 focuses on root cause analysis, TMF657 is about continuous monitoring and compliance. It answers: "Are we meeting our service quality promises?"

TMF657 Key Concepts

TMF657 Core Concepts

ConceptDefinitionExample
Service Level SpecificationA template defining quality parameters for a service typeBroadband SLS: availability >= 99.5%, max latency <= 30ms
Service Level Objective (SLO)A specific quality target for a metricSLO: monthly availability >= 99.5% measured per customer
Service Level Agreement (SLA)A contract committing to specific service levelsEnterprise SLA: 99.99% availability, 4-hour MTTR
Service Quality RuleA rule that evaluates quality metrics against SLOsIF availability < 99.5% THEN raise SLO violation event

SLA Monitoring Flow

SLA Monitoring with TMF657

1
Define SLO
TMF657 -- Service Quality

Create service level objectives via TMF657 API: target availability, latency thresholds, MTTR targets, mapped to specific service specifications.

2
Collect Metrics
Performance Management

Performance monitoring systems continuously collect quality metrics (availability, latency, throughput, error rates) per service instance.

3
Evaluate Rules
TMF657 -- Service Quality

TMF657 evaluates collected metrics against defined SLOs. When a metric breaches a threshold, a service quality violation event is raised.

4
Trigger Actions
Assurance Orchestration

SLO violations can trigger automated actions: create a service problem (TMF656), escalate a trouble ticket (TMF621), or notify account management.

How Assurance APIs Work Together

The three assurance APIs form a coherent domain that covers the full spectrum of fault management and service quality:

The reactive flow starts with a customer complaint and works inward to find the root cause:

  1. Customer reports issue -> TMF621 TroubleTicket created
  2. Triage identifies potential service impact -> TMF656 ServiceProblem investigated
  3. Root cause identified in resource layer -> TMF639 Resource queried
  4. Fix applied -> ServiceProblem resolved -> TroubleTicket resolved -> Customer notified

Additional Assurance APIs

Beyond the three core assurance APIs, TM Forum defines several related APIs that support the broader assurance domain:

Related Assurance APIs

APINamePurpose
TMF642Alarm ManagementManage network alarms and events from monitoring systems
TMF650Performance ManagementCollect and manage performance data from network elements
TMF628Performance ThresholdDefine thresholds that trigger alarms when performance degrades
TMF649Performance MonitoringContinuous monitoring of network and service performance
The Assurance Stack
Think of the assurance APIs as a stack: TMF642/TMF650 at the bottom handle raw network data (alarms, performance). TMF656 in the middle handles correlation and root cause analysis. TMF621 at the top handles customer-facing issue management. TMF657 runs alongside all layers, continuously evaluating quality against commitments. Each layer adds business context to raw technical data.

Assurance API Source of Record

Assurance Entities -- Source of Record

EntitySystem of RecordSystem of EngagementSystem of ReferenceNotes
Trouble TicketsTrouble Ticket System (TMF621)Contact Centre / Self-Service PortalCustomer-reported and proactive issues
Service ProblemsService Problem Manager (TMF656)NOC / Assurance PlatformCorrelated root cause records
Service Quality/SLOService Quality Manager (TMF657)SLA Management DashboardSLO definitions and violation tracking
Network AlarmsAlarm Management (TMF642)NOC DashboardRaw network fault events
Performance MetricsPerformance Management (TMF650)Analytics / ReportingCollected performance data

SLA Management in Practice

A Service Level Agreement (SLA) is a contract between the telco and a customer that guarantees specific service quality levels. For example: "We guarantee 99.9% availability and will respond to faults within 4 hours." If the telco fails to meet these commitments, the customer may be entitled to credits or penalties.

An SLA typically contains several Service Level Objectives (SLOs), each defining a measurable quality target:

  • Availability SLO: Service uptime percentage per month (e.g., 99.95%)
  • Latency SLO: Maximum packet delay (e.g., < 20ms within network)
  • Jitter SLO: Maximum delay variation (e.g., < 5ms)
  • MTTR SLO: Maximum time to restore service after a fault (e.g., 4 hours)
  • MTTF SLO: Minimum time between failures (e.g., > 30 days)
  • Packet Loss SLO: Maximum acceptable packet loss (e.g., < 0.01%)

Mature operators automate the full SLA lifecycle using TMF657:

  • SLO definitions are created via TMF657 API, linked to service specifications in TMF633
  • Performance data is collected from TMF650 and correlated to specific service instances in TMF638
  • TMF657 rules engine continuously evaluates metrics against SLOs
  • SLO violations trigger events that can auto-create service problems (TMF656) and trouble tickets (TMF621)
  • SLA compliance reports are generated automatically for customer-facing reporting
  • SLA breach credits can be calculated automatically and fed to billing systems

Common Assurance Pitfalls

Pitfall: No Ticket-to-Problem Correlation
Without service problem management (TMF656), every trouble ticket is treated as an isolated incident. When 200 customers are affected by the same OLT failure, the contact centre creates 200 independent tickets, each investigated separately. Service problem management correlates these into a single root cause, dramatically improving resolution efficiency.
Pitfall: Reactive-Only Assurance
Operators that only react to customer complaints miss the opportunity for proactive resolution. Modern assurance should detect issues via alarms and performance monitoring, create service problems automatically, and notify affected customers before they call. This requires integration between TMF642 (Alarms), TMF656 (Service Problems), and TMF621 (Trouble Tickets).
Pitfall: SLAs Without Measurement
Committing to SLAs without implementing TMF657-style quality monitoring means you have no way to know if you are meeting your commitments until a customer complains. Worse, you cannot proactively manage quality to prevent SLA breaches. SLOs must be measured continuously, not just reported after the fact.

Section 5.5 Key Takeaways

  • TMF621 (Trouble Tickets) captures customer-reported and proactive issues with a standard state machine
  • TMF656 (Service Problems) correlates multiple tickets to a single root cause, enabling efficient resolution
  • TMF657 (Service Quality) defines SLOs and continuously monitors compliance
  • Severity measures technical impact; priority measures business urgency -- they are not the same
  • The reactive flow starts from customer complaints; the proactive flow starts from alarms and monitoring
  • Alarm correlation (TMF656 + TMF642) enables proactive customer notification before complaints arise
  • SLA management requires both definition (SLOs) and continuous measurement (TMF657 + TMF650)
  • Assurance APIs integrate with inventory APIs (TMF638, TMF639) for impact analysis