Carrier-grade video streaming: requirements and architecture

"Carrier-grade" is one of the most overused phrases in the streaming industry. Vendors apply it to anything from a SaaS dashboard to a Kubernetes cluster, but in operator procurement it has a specific meaning: a system that meets the reliability, scale, and operational standards of a telecom network. This guide unpacks what carrier-grade actually requires of a video streaming platform, and how a modern OTT/IPTV architecture is built to meet those requirements.
What "carrier-grade" actually means
In telecom engineering, carrier-grade software has historically been defined by the so-called five-nines target: 99.999% availability, or no more than 5 minutes 15 seconds of unplanned downtime per year. For a streaming platform, that target translates into five concrete properties:
- High availability — no single point of failure across compute, storage, network, or DNS layers; planned upgrades performed without service disruption.
- Predictable scale — the system scales horizontally from thousands to millions of concurrent users without re-architecting; capacity is added incrementally, not by forklift.
- Operational manageability — single source of truth for configuration, role-based access control, audit logging, change control, and disaster recovery.
- Lawful and regulatory compliance — content protection meets studio requirements; subscriber data handling complies with regional regulations (GDPR, NIS2).
- Sustained performance under load — quality remains consistent during peak events (national football, World Cup, opening night) when concurrent viewers can multiply 10–20x in minutes.
A carrier-grade streaming platform is one that holds these properties simultaneously, not one that hits each in isolation.
Reference architecture: four layers
A streaming platform that meets carrier-grade standards is built on four integrated layers, each with its own scale and failure model.
1. Service delivery (middleware)
Subscriber accounts, entitlements, EPG, content metadata, profiles, parental controls, and concurrent-device limits. Carrier-grade middleware uses a clustered relational database (typically PostgreSQL) with synchronous replication for the primary site and asynchronous replication to a failover site. Application servers run in odd-numbered clusters (3, 5, 7) for quorum-based clustering. Cache servers reduce database load by serving repeat metadata reads.
2. Content processing and storage
Live transcoding, recording for catch-up and nPVR, encryption, packaging into HLS and MPEG-DASH, and storage. Hardware acceleration is the difference between a workable plant and an expensive one: a single NVIDIA GPU can transcode roughly 8 Full HD channels at 3 bitrate variants. Storage uses a tiered architecture — fast NVMe for the live DVR window, POSIX or S3-compatible object storage for catch-up and VoD.
3. Content delivery (CDN)
The component that determines whether the system is carrier-grade or not. A modern delivery layer separates origin (a small number of nodes that hold the canonical packaged content) from edge (many nodes close to subscribers). A redirector routes each subscriber to the optimal edge based on geography, server load, and content availability. Token-based authentication prevents direct origin access. Each edge node typically supports about 3,500 concurrent streams at 6 Mbps.
4. Quality monitoring
If the platform cannot see what its subscribers see, it cannot be carrier-grade. Quality of Experience (QoE) collection from clients (buffering events, bitrate switches, frame drops, startup time, MOS) is correlated with infrastructure metrics (server CPU, network utilization, packet loss) to detect issues before they affect a meaningful share of the audience. Predictive analytics — statistical clustering on shared attributes (region, ISP, device, content) — surface problems whose blast radius is small enough that an alert threshold would never trigger.
Scaling profile
Carrier-grade systems are dimensioned by Peak Concurrent Devices (PCD), not subscribers. A 1 million-subscriber service typically peaks between 200,000 and 400,000 concurrent devices on a high-demand evening, and a major live event can take that closer to 600,000–800,000. The architecture has to handle the peak gracefully, not the average comfortably.
| Tier | PCD range | Database | App/cache servers | CDN nodes |
|---|---|---|---|---|
| Small | Up to 10K | 2 servers, 8-core / 16GB | 2 servers | ~3 nodes |
| Medium | 10K–100K | 2 servers, scaling to 16-core / 64GB | 3 servers (clustered) | ~10–30 nodes |
| Large | 100K–500K+ | Add 1 DB server per 100K PCD | Add 2 servers per 100K PCD | Geographic CDN footprint |
The pattern is incremental: add servers, not replace them. A platform that requires re-architecting between 50K and 500K subscribers is not carrier-grade — it is a startup-grade product with a marketing label.
Deployment model
Carrier-grade does not mean on-premise only, but it does mean deployment-flexible. Tier-one operators run streaming platforms across four models: on-premise in operator data centres, virtualised on standard hypervisors (VMware, KVM, Hyper-V), in public cloud (AWS, Azure, GCP), and hybrid (on-premise core with cloud edge or DR). The platform itself must be vendor-agnostic — no dependency on a specific hypervisor, container runtime, or cloud provider — so the operator retains negotiating leverage and is not locked into one infrastructure decision for a decade.
Disaster recovery and failover
Two figures matter: Recovery Time Objective (RTO) and Recovery Point Objective (RPO). For premium pay-TV services, RTO of few minutes and RPO of zero (no transactional data loss) are typical contractual targets. The standard pattern is a primary site running active and a secondary site running active-standby with synchronous replication of the database tier and asynchronous replication of content storage. Geographic separation between sites protects against regional outages, fibre cuts, and natural disasters.
Where Smartlabs fits
Smartlabs has been building carrier-grade IPTV/OTT platforms since 2007. SmartTUBE (middleware), SmartMEDIA (CDN and content processing), Universal DRM, and SmartCARE (QoE monitoring) are designed to be deployed together as a single integrated stack — but each component is also operable independently and integrable with operator-side BSS, OSS, billing, and identity systems. Production deployments have run on-premise, in private cloud, and hybrid, with operator footprints from tens of thousands to several million subscribers, and live-event peaks well above the steady-state baseline.
Procurement checklist
- Does the vendor publish dimensioning data tied to PCD, not vague "users"?
- Are all components horizontally scalable, including the database tier?
- Is the platform deployable on-premise, in cloud, and hybrid without code changes?
- Is there a multi-site failover model with documented RTO and RPO?
- Is QoE monitoring integrated, not bolted on as a separate product purchase?
- Is the database replication model synchronous or asynchronous, and what is the primary-site RPO?
- How does the platform behave when a CDN edge fails — automatic re-routing, or stalled sessions?
FAQ
Is carrier-grade still relevant in the cloud era?
Yes. Cloud abstracts hardware reliability but does not abstract software design. A poorly-architected service running on AWS will fail just as visibly as one running on bare metal — sometimes more so, because regional cloud outages can take entire availability zones down. The carrier-grade properties (HA, scale, operability, compliance, sustained peak performance) are deployment-model agnostic.
How is carrier-grade different from "webscale"?
Webscale optimises for cost per request at very high volumes; degraded service is acceptable as long as the average user is happy. Carrier-grade optimises for predictable behaviour for every user, including the ones at the long tail. They are different design philosophies, and trying to claim both at the same time is usually a sign that neither is being met.
Can a managed SaaS be carrier-grade?
In principle, yes. In practice, operators evaluating SaaS need clarity on three points: contractual SLA with concrete uptime numbers and service credits, data residency and compliance posture, and the ability to operate the platform after the contract ends. SaaS is a procurement model, not a quality grade.
If you're evaluating an IPTV/OTT platform against carrier-grade requirements, we'd be happy to walk through how SmartTUBE applies to your environment.