BSS/OSS Academy
8.415 min read

Event-Driven Architecture in Telco

Event-Driven Architecture in Telco BSS/OSS

In a modular BSS/OSS architecture, components must communicate. The two fundamental patterns are synchronous (request-response, typically REST APIs) and asynchronous (event-driven, typically message brokers). While synchronous APIs are essential for queries and commands, event-driven architecture is what makes a distributed BSS/OSS truly resilient, scalable, and loosely coupled.

This section explains why event-driven patterns matter specifically in the telco context, how TM Forum standardises event management, and the practical patterns for implementing event-driven BSS/OSS — including sagas, CQRS, and event sourcing.

Event ProducersProduct CatalogTMF620COMTMF622SOMTMF641ROMTMF652OfferingChangedOrderCompletedServiceActivatedResourceAllocatedEvent Bus (Kafka / TMF688 Event Management)Topics: bss.catalog.events · bss.ordering.events · oss.service.events · oss.resource.eventsEvent Consumers (Subscribe Independently)BillingCharges on orderProduct InventoryUpdates subsService InventoryTracks CFS/RFSNotificationsSMS · Email · PushAnalyticsBI · ReportingSaga Pattern — Distributed Transaction CoordinationChoreography: Event-driven chain (simple flows)Orchestration: SOM coordinates (complex fulfilment)OrderCreated → ServiceActivated → BillingStartedSOM → Allocate → Activate → Inventory → Notify COM

Event-Driven BSS/OSS — Components publish domain events to a central event bus; consumers subscribe independently

Figure 8.4 — Event-driven BSS/OSS architecture showing event flow between components

Why Event-Driven Matters for BSS/OSS

Telco BSS/OSS operations are inherently asynchronous. An order placed by a customer does not complete instantly — it triggers a cascade of fulfilment steps that may take minutes, hours, or even days. Event-driven architecture models this reality naturally.

Synchronous vs Event-Driven: Telco Context

AspectSynchronous (REST)Event-Driven
Order fulfilmentFrontend polls for status updates. Backend must hold connections open or implement complex polling.Backend emits OrderStateChanged events. Interested consumers react when ready.
Billing notificationsBilling must call CRM, self-service, and notification systems sequentially after bill generation.Billing emits BillGeneratedEvent. CRM, self-service, and notification systems subscribe independently.
Service activationSOM calls ROM synchronously and waits. If ROM is slow or down, SOM is blocked.SOM emits ServiceOrderCreated. ROM picks it up when ready. SOM is not blocked.
Fault propagationIf billing is down, anything that calls billing fails. Cascading failures.Events are queued. When billing recovers, it processes the backlog. No cascading failure.
ScalingSynchronous chains create bottlenecks at the slowest component.Components scale independently. Consumers process events at their own pace.
Adding new consumersEach new system that needs order data requires a new API call in the order service.New consumers subscribe to existing events. No changes to the publisher.
The Key Insight
Event-driven architecture changes "who knows about whom." In synchronous architecture, the order service must know about billing, CRM, and notification. In event-driven architecture, the order service just publishes events — it does not know or care who listens. This inversion of dependencies is what makes the system truly modular.

TMF Event Management Pattern

TM Forum defines a standardised event management pattern across all Open APIs. This pattern, known as the Hub/Listener pattern, ensures that event communication between ODA components is interoperable and consistent.

TMF Hub/Listener Pattern
Every TMF Open API that supports events defines two constructs: (1) a Hub — an endpoint where consumers register their interest in specific event types (subscription). (2) a Listener — a callback endpoint on the consumer side that receives event notifications when the subscribed event occurs. This is essentially a webhook pattern standardised across all TMF APIs.

How Hub/Listener Works

TMF Hub/Listener Event Flow

1
Register Subscription
Consumer → Publisher

Consumer (e.g., Billing) calls POST /hub on the publisher (e.g., Product Ordering API). Specifies the event type (ProductOrderStateChangeEvent) and the callback URL.

2
Event Occurs
Publisher (Product Ordering)

A product order changes state (e.g., from "acknowledged" to "completed") within the Product Ordering component.

3
Publish Event
Publisher → Consumer

Product Ordering sends a POST to the registered callback URL with the event payload containing the order ID, new state, and relevant fields.

4
Consumer Processes Event
Consumer (Billing)

Billing receives the callback, processes the event (e.g., starts billing for the new product), and returns HTTP 200 to acknowledge receipt.

TMF Event Types

Standard TMF Event Types

Event PatternWhen FiredExample
CreateEventA new entity is createdProductOrderCreateEvent — fired when a new order is submitted
StateChangeEventAn entity transitions to a new lifecycle stateServiceOrderStateChangeEvent — fired when service order moves from InProgress to Completed
AttributeValueChangeEventA significant attribute changes valueProductOfferingAttributeValueChangeEvent — fired when price changes
DeleteEventAn entity is deleted or retiredProductOfferingDeleteEvent — fired when an offering is withdrawn
InformationRequiredEventA process needs additional input to proceedProductOrderInformationRequiredEvent — fired when order needs missing data
Hub/Listener Limitations
The TMF Hub/Listener pattern is a webhook model — it works well for moderate event volumes between a small number of components. For high-volume event streams (thousands of events per second), real-time processing, or complex event routing, you need a proper message broker (Kafka, RabbitMQ) underneath. Most production deployments use message brokers for the transport layer and expose the TMF Hub/Listener API as a compatibility layer on top.

Event Categories in Telco BSS/OSS

Not all events are created equal. Understanding the different categories of events in a telco context helps design appropriate handling strategies.

Event Categories

CategoryCharacteristicsTelco ExamplesHandling Strategy
Domain Events (State Changes)Fact that something happened. Immutable. Past tense.OrderCompleted, ServiceActivated, BillGeneratedPublish to topic. Multiple consumers process independently. Retry on failure.
Integration EventsSignal between components to trigger action. Present tense imperative.ActivateService, AllocateResource, GenerateBillPublish to queue. Single consumer processes. Dead-letter on failure.
Notification EventsInform external parties. May trigger SMS, email, push.PaymentReceived, OutageDetected, UsageThresholdReachedPublish to notification bus. Notification engine formats and delivers.
System/Operational EventsHealth, performance, and lifecycle events.ComponentHealthCheck, DeploymentCompleted, ScaleUpTriggeredPublish to observability platform. Alerting and dashboarding.
Domain Event vs Command
A domain event is a statement of fact: "OrderCompleted" — it happened, you cannot reject it. A command is a request to do something: "ActivateService" — it can succeed or fail. This distinction matters for error handling. If a domain event consumer fails, the event is retried or dead-lettered. If a command fails, the sender needs to know about the failure and potentially compensate.

Event Sourcing and CQRS in Telco

Event Sourcing stores every change as an event, rather than just the current state. Instead of storing "order status = completed", you store the entire history: OrderCreated → OrderValidated → OrderFulfilmentStarted → OrderCompleted. The current state is derived by replaying events.

CQRS (Command Query Responsibility Segregation) separates the write model (commands that change state) from the read model (queries that return data). This allows each to be optimised independently — the write model for consistency, the read model for performance.

Event Sourcing is naturally suited to telco BSS/OSS for several reasons:

  • Audit trail — regulators and billing disputes require full history of what happened and when. Event sourcing provides this by default.
  • Order lifecycle — orders transition through many states. Storing these transitions as events captures the full lifecycle, including timestamps and causality.
  • Temporal queries — "What was the customer's product inventory on March 15?" is trivial with event sourcing (replay events up to that date) but extremely difficult with state-based storage.
  • Debugging — when an order goes wrong, replaying events shows exactly what happened and where the process diverged.

CQRS is valuable because telco read and write patterns are dramatically different:

  • Write path (order submission): must be consistent, validated, and durable. Moderate throughput.
  • Read path (dashboard, self-service): must be fast, potentially denormalised, and handle massive concurrent reads. High throughput.
  • Separating these allows the read model to be a pre-computed, cached, denormalised view — perfect for BFF consumption.

Event sourcing and CQRS add significant complexity. Apply them selectively — not every BSS/OSS component needs them.

Where to Apply Event Sourcing/CQRS in BSS/OSS

ComponentEvent Sourcing?CQRS?Rationale
Order Management (COM)Strong candidateYesOrders have complex lifecycle. Audit trail is essential. Write/read patterns differ greatly.
Service Order Mgmt (SOM)Strong candidateYesComplex orchestration with many state transitions. Replay capability aids debugging.
Product CatalogOptionalYes (for high-traffic reads)Catalog reads vastly outnumber writes. CQRS with cached read model improves performance.
BillingDependsOften not neededRating engines are already event-driven (CDR processing). Full event sourcing may be overkill.
CRM / Party MgmtRarelyRarelyCRUD patterns dominate. MDM/golden record pattern is more appropriate than event sourcing.

Message Broker Patterns

The message broker is the backbone of event-driven BSS/OSS. Choosing the right broker and using the right messaging patterns is critical. The two dominant patterns in telco deployments are Apache Kafka and RabbitMQ, though other options exist.

Kafka vs RabbitMQ for Telco BSS/OSS

CriterionApache KafkaRabbitMQ
Message modelDistributed log. Messages persist for configured retention period. Consumers track their own offset.Queue. Messages are consumed and acknowledged. Deleted after consumption (by default).
ThroughputExtremely high (millions of events/second). Designed for high-volume streaming.High (tens of thousands/second). Sufficient for most BSS/OSS event volumes.
Message replayYes — consumers can rewind to any offset and re-process messages.No (by default). Once consumed, messages are gone. Dead-letter queues for failed messages.
Ordering guaranteePer-partition ordering. Messages within a partition are strictly ordered.Per-queue ordering. Single consumer per queue guarantees order.
Consumer modelPull-based. Consumer groups enable parallel processing with partition assignment.Push-based. Broker pushes messages to consumers. Prefetch count controls flow.
Telco use case fitBest for: CDR processing, network event streaming, analytics pipelines, event sourcing.Best for: Order orchestration, command dispatch, task queues, notification delivery.
Operational complexityHigher. Requires ZooKeeper/KRaft, careful partition management, monitoring.Lower. Simpler to deploy and operate. Clustering is straightforward.
Many Telcos Use Both
It is common to use Kafka and RabbitMQ together: Kafka as the backbone for high-volume event streaming (CDRs, network events, analytics) and RabbitMQ for transactional messaging (order orchestration, command dispatch). This is not an either/or decision.

Topic Design for Telco Events

How you design your event topics (Kafka) or exchanges/queues (RabbitMQ) directly impacts system maintainability. A common pattern in telco BSS/OSS:

  • Domain-based topics: bss.ordering.events, bss.catalog.events, oss.service.events, oss.resource.events
  • Event type partitioning: Within a topic, use event type headers to allow consumers to filter. E.g., bss.ordering.events contains OrderCreated, OrderStateChanged, OrderCancelled.
  • Partition by entity ID: For Kafka, partition by order ID or customer ID to ensure all events for the same entity are processed in order by the same consumer.
  • Separate topics for commands vs events: oss.service.commands (ActivateService, ModifyService) vs oss.service.events (ServiceActivated, ServiceModified). Commands go to a specific consumer; events go to all interested consumers.

The Saga Pattern for Distributed Transactions

In a monolithic BSS, a single database transaction can span ordering, billing, and inventory updates. In a modular architecture, this is impossible — each component has its own database. The Saga pattern manages distributed transactions by breaking them into a sequence of local transactions, each publishing events that trigger the next step.

Saga Pattern
A saga is a sequence of local transactions across multiple services. Each local transaction updates one service and publishes an event/message to trigger the next step. If any step fails, the saga executes compensating transactions to undo the changes made by preceding steps. There are two saga variants: Choreography (event-driven, decentralised) and Orchestration (centralised coordinator).

Choreography-Based Saga

In a choreography saga, each service reacts to events and publishes new events. There is no central coordinator. This is simple for short sagas but becomes hard to follow for complex flows.

Choreography Saga: New Order Activation

1
COM: Order Submitted
COM (BSS)

Commercial Order Management validates and accepts the order. Publishes ProductOrderCreatedEvent.

2
SOM: Receives Order Event
SOM (OSS)

SOM subscribes to ProductOrderCreatedEvent. Decomposes the order into service orders. Publishes ServiceOrderCreatedEvent.

3
ROM: Receives Service Order Event
ROM (OSS)

ROM subscribes to ServiceOrderCreatedEvent. Allocates resources and activates on the network. Publishes ResourceOrderCompletedEvent.

4
SOM: Receives Resource Completion
SOM (OSS)

SOM subscribes to ResourceOrderCompletedEvent. Updates service inventory. Publishes ServiceOrderCompletedEvent.

5
COM: Receives Service Completion
COM (BSS)

COM subscribes to ServiceOrderCompletedEvent. Updates product inventory. Publishes ProductOrderCompletedEvent. Billing subscribes and starts charging.

Orchestration-Based Saga

In an orchestration saga, a central saga orchestrator (typically the SOM or a dedicated orchestration engine) coordinates the steps. It sends commands to each participant and handles success/failure responses.

Orchestration Saga: New Order Activation

1
Orchestrator: Receive Order
Orchestrator (SOM)

The saga orchestrator receives the commercial order from COM via TMF622/641 integration.

2
Orchestrator → ROM: Allocate Resources
Orchestrator → ROM

Orchestrator sends AllocateResource command to ROM. Waits for success/failure response.

3
Orchestrator → Activation: Configure Network
Orchestrator → Network

On resource allocation success, orchestrator sends ActivateService command to network activation.

4
Orchestrator → Inventory: Update Records
Orchestrator → Inventory

On activation success, orchestrator updates service and resource inventory.

5
Orchestrator → COM: Report Completion
Orchestrator → COM

Orchestrator notifies COM of fulfilment completion. COM updates product inventory and triggers billing.

Choreography vs Orchestration Sagas

AspectChoreographyOrchestration
CoordinationDecentralised — each service reacts to eventsCentralised — orchestrator directs each step
CouplingLow — services only know about events, not each otherMedium — orchestrator knows about all participants
VisibilityHard to see the full flow — events are scattered across servicesEasy — the orchestrator represents the complete workflow
Error handlingComplex — compensating events must propagate back through the chainSimpler — orchestrator manages compensation centrally
Telco fitGood for simple, linear flows (notification chains, billing triggers)Better for complex fulfilment flows (order activation, service modification)
Recommendation for Telco Fulfilment
Most telco BSS/OSS implementations use orchestration-based sagas for order fulfilment (the SOM acts as the orchestrator) and choreography-based sagas for cross-domain notifications (billing, CRM updates, analytics). This hybrid approach gives clear control over the critical fulfilment path while keeping peripheral integrations loosely coupled.

Compensating Transactions

When a saga step fails, all preceding steps must be "undone" through compensating transactions. In telco BSS/OSS, this is not simply a database rollback — it involves real-world reversals.

Compensating Transactions in Telco

Original StepCompensating ActionComplexity
Resource allocated (IP address, VLAN)Release allocated resources back to poolLow — purely logical operation
Network configuration appliedRollback configuration (remove VLAN, deactivate port)Medium — requires network access and may fail
Service activated in inventorySet service state to "cancelled" or "failed"Low — inventory state change
Product added to customer subscriptionRemove product from subscription, notify customerMedium — customer-facing impact, notification needed
Billing activatedCancel billing agreement, potentially issue credit for any chargesHigh — financial implications, regulatory concerns
Compensating Transactions Are Not Perfect Undo
Compensating transactions do not magically restore the previous state. They create a new action that logically reverses the effect. A cancelled order is not the same as an order that never existed — it leaves a trail. This is especially important for billing: if a customer was charged before the failure, the compensation is a credit, not a deletion of the charge. Design compensating transactions with this in mind.

Event Delivery Guarantees

In distributed systems, message delivery is non-trivial. Understanding delivery guarantees is essential for designing reliable BSS/OSS event flows.

Delivery Guarantees

GuaranteeMeaningTelco ImpactWhen to Use
At-most-onceMessage may be lost but never duplicatedAcceptable for analytics events, usage metrics. Not for orders or billing.Non-critical telemetry, performance metrics
At-least-onceMessage is never lost but may be delivered more than onceSafe for orders and billing IF consumers are idempotent. Most common in telco.Order events, billing events, fulfilment events
Exactly-onceMessage is delivered once and only onceIdeal but extremely expensive to guarantee in distributed systems.Financial transactions, regulatory events (if achievable)
Idempotency Is Non-Negotiable
In practice, most telco event systems use at-least-once delivery with idempotent consumers. Every event handler must be safe to execute multiple times with the same event. This means: use event IDs to detect duplicates, make state transitions idempotent (setting status to "completed" when it is already "completed" is a no-op), and use database constraints to prevent duplicate processing.

Practical Implementation Patterns

Transactional Outbox Pattern

Advanced

Transactional Outbox

Ensures that database updates and event publishing happen atomically — preventing scenarios where the database is updated but the event is lost (or vice versa).

Instead of publishing events directly to the message broker, the component writes events to an "outbox" table in the same database transaction as the business data update. A separate process (outbox relay) reads the outbox table and publishes events to the broker. This guarantees that if the business data was committed, the event will eventually be published. Popular implementations: Debezium (CDC-based outbox), custom polling outbox.

Dead Letter Queue (DLQ) Strategy

Intermediate

Dead Letter Queues in Telco

When an event cannot be processed after multiple retries, it is moved to a Dead Letter Queue for manual investigation rather than being lost or retried forever.

DLQ strategy for telco BSS/OSS: (1) Retry 3 times with exponential backoff (1s, 5s, 30s). (2) If still failing, move to DLQ with full context (original event, error, retry count). (3) Alert the operations team. (4) Provide a tool to replay DLQ events after the root cause is fixed. This prevents a single bad event from blocking the entire processing pipeline while ensuring no events are lost.

Event Schema Evolution

Event schemas must evolve over time without breaking existing consumers. This is the event equivalent of API versioning.

  • Forward compatibility: New producers can add new fields; old consumers ignore unknown fields
  • Backward compatibility: New consumers can read events from old producers; missing fields use defaults
  • Schema registry: Use Apache Avro + Schema Registry or JSON Schema to enforce compatibility rules at the broker level
  • Versioned event types: Major breaking changes create new event types (OrderCreatedV2) rather than modifying existing ones

Operating Event-Driven Systems: Monitoring & Recovery

Building an event-driven BSS/OSS is the easy part. Operating it is where most organisations struggle. Without explicit monitoring, alerting, and recovery patterns, event-driven systems become opaque — events flow, but nobody knows if they are flowing correctly, completely, or on time.

Saga Monitoring & Stuck Order Detection

In an orchestration-based saga (which is the recommended pattern for telco fulfilment), the saga coordinator maintains state for every in-flight order. Monitoring this state is critical for operational control.

Saga Health Monitoring Checklist

MetricWhat to MonitorAlert ConditionRecovery Action
Saga durationElapsed time from saga start to completion. Track p50, p95, p99.Any saga exceeding 2x the p95 duration for its order type.Investigate the blocking step. Check if the downstream service is responding. If the service is healthy, check for a missing callback or lost event.
Stuck sagasSagas that have not progressed (no state transition) for longer than the expected step timeout.No state change for > configured timeout (e.g., 30 minutes for a provisioning step).Query the saga state to identify the blocked step. Check the downstream service's DLQ for failed events. If the event was lost, replay from the saga coordinator.
Compensation ratePercentage of sagas that trigger compensating transactions.Compensation rate > 5% (threshold varies by order type).High compensation rates indicate a systemic issue: resource exhaustion, downstream service degradation, or data quality problems. Investigate the most common compensation reason.
Event processing lagConsumer lag per topic/partition — the gap between the latest published event and the latest consumed event.Consumer lag > N events (threshold depends on volume) or lag growing over time.Scale consumers, investigate slow processing, check for poison messages blocking the partition.
DLQ depthNumber of events in each Dead Letter Queue, grouped by error type.Any DLQ depth > 0 requires investigation. DLQ depth growing indicates a persistent failure.Triage DLQ events by error type. Fix the root cause. Replay corrected events. Do not bulk-replay without understanding why they failed.

Compensating Transaction Failure Recovery

A critical operational question that is often overlooked: what happens when a compensating transaction itself fails? If an order fails at step 5 and the system attempts to reverse steps 4, 3, 2, and 1 — but step 3's compensation fails — the system is in an inconsistent state.

  • Retry with backoff — compensating transactions should be retried with exponential backoff, just like forward actions. Most compensation failures are transient (network blip, service restart).
  • Idempotent compensation — compensating actions must be idempotent. "Release VLAN 1042" must be safe to execute multiple times. If the VLAN is already released, the action is a no-op, not an error.
  • Human escalation — after N retries, the saga must escalate to a human operator with full context: what was the original order, which steps succeeded, which compensations succeeded, and which compensation is stuck.
  • Quarantine state — the saga enters a "requires manual intervention" state. It does not continue retrying forever, and it does not silently fail. Operations dashboards must surface quarantined sagas prominently.
  • Reconciliation — periodic reconciliation between inventory systems and the saga coordinator detects compensation gaps that were never resolved. This is the safety net behind the safety net.

Event Flow Observability

In a synchronous request-response system, a failed request returns an error immediately. In an event-driven system, a failed event may be silently dropped, stuck in a queue, or processed but ignored. Observability requires explicit investment.

Event Observability Patterns

PatternPurposeImplementation
Correlation ID propagationTrace a single business operation (e.g., one order) across all events, services, and saga steps.Generate a unique correlationId at the entry point (e.g., COM). Propagate it in every event header. All logs, metrics, and traces include the correlationId. Enables "show me everything that happened for order X."
Event flow dashboardsVisualise the flow of events across topics and services in real-time. Identify bottlenecks and dead spots.Use Kafka consumer lag metrics + custom topic-level throughput dashboards. Show events/second per topic, consumer group lag, and partition distribution.
End-to-end latency trackingMeasure how long it takes for an event to flow from producer to final consumer, including all intermediate processing.Embed a timestamp in the event header at production time. Each consumer logs the delta between production time and processing time. Aggregate across the saga to measure end-to-end fulfilment latency.
Schema validation alertsDetect events that fail schema validation before they cause processing errors.Use a schema registry (Confluent Schema Registry, Apicurio) with strict validation. Schema violations are rejected at the broker level and routed to a validation DLQ.

Section 8.4 Key Takeaways

  • Event-driven architecture makes BSS/OSS truly modular — publishers do not know about consumers, enabling loose coupling
  • TMF Hub/Listener is the standardised event pattern, but production systems typically use Kafka or RabbitMQ underneath
  • Events fall into categories: domain events (facts), integration events (commands), notification events, and system events
  • Event sourcing is valuable for order management and SOM where full audit trail and replay capability matter
  • CQRS separates read and write models — essential when read patterns (dashboards) differ dramatically from write patterns (orders)
  • Saga pattern manages distributed transactions: choreography for simple chains, orchestration for complex fulfilment
  • Compensating transactions are not perfect undo — they create new actions that logically reverse the effect
  • Use at-least-once delivery with idempotent consumers — this is the practical standard for telco BSS/OSS
  • Transactional outbox pattern prevents data/event inconsistency; dead letter queues prevent event loss
  • Saga monitoring requires tracking duration, stuck state, compensation rates, consumer lag, and DLQ depth
  • Compensating transaction failures must escalate to human operators — infinite retry is not a strategy
  • Correlation ID propagation across all events is mandatory for operational traceability in event-driven systems