Capacity and next generation mobile services (3G & 4G/LTE) seem to be constantly under scrutiny. Ever since the iPhone came on the scene and sucked the lifeblood out of at&t’s backhaul network we constantly hear about the impending doom, the bandwidth desert we’re all facing ahead. This has been labeled “The Capacity Crisis” – here’s an example of one of a gazillion articles harping on the uncertainty of our mobile broadband future. Sound a bit like the swine flu? What ever happened to that?
One thing you learn working with real operators doing real deployments is that:
- backhaul capacity is something they dealing with (don’t lose too much sleep);
- there are bigger issues: real deployment challenges to figure out first.
And field trials for 3G & 4G are full of such examples. No one’s finding an issue getting bandwidth to the cell site – no magic formula is required for that – simply put, if a fiber is laid or a good microwave connection is setup the capacity is there, pretty much on tap. The issues that operators are stumbling over have more to do with the operational nuts and bolts. A lot of new technologies are getting put through their paces at the same time, and some that work great in the lab seem to be falling short in the field.
Ethernet OAM: Lies, Lies & More Lies
One of the key technologies almost every operator is counting on is Y.1731 – the popular Ethernet operations, administration and maintenance (OAM) standard for connectivity fault monitoring (CFM) and performance monitoring (PM). Y.1731 is a must, and for good reason: it’s the only standards-based QoS monitoring method available to assure Ethernet latency, jitter, frame loss and availability meet the demanding targets required for packet backhaul. It works in multi-vendor networks; it works in multi-operator networks (great for using and keeping tabs on wholesale backhaul carriers). Every network element maker selling into backhaul has it in their products and they’re all tuned up and ready to go. Are they?
A recent field trial in a 3G deployment in North America went into crisis mode when one leading mobile operator turned on OAM PM to verify latency over their backhaul provider’s network. The one-way latency target (and SLA) from mobile switching center (MSC) to tower was set at 5ms. Y.1731 measured 20ms. The mobile operator freaked. The backhaul carrier claimed 3ms. What was up?
Using an alternative test method transparent to OAM processing, the mobile operator confirmed the 3ms, giving both carriers another problem to solve: why were the OAM measurements in error by more than 300%? The first step was to turn off OAM at all intermediate nodes in the network – suddenly Y.1731 PM measurements said 3ms. They turned it back on: 20ms. It’s important to point out here that the delay only affected OAM traffic – real traffic was unaffected and was meeting spec the whole time! With the problem isolated to OAM processing itself, they were starting to experience something most network element vendors knew full well might turn up, but were hoping would go unnoticed.
The problem? Most switches and routers claim to offer the full Y.1731 feature set, but none of this was thought out when the products were originally architected. When Y.1731 became a must-have for backhaul, the features were typically shoe-horned into a software patch. Running delay-sensitive monitoring features in software is a big faux-pas, because shared CPU time in the network element is a poor place to do anything critical. These CPUs are busy doing more important things (like routing / switching functions) most of the time, putting OAM into background processing queues. When traffic is at its peak, the network elements are heavily taxed – and just when you need performance measurements the most, they turn out the least accurate of all.
Scary stuff. In this case, every latency alarm the operators saw wasn’t an indication of network performance issues, but of CPU processing restrictions. Not a very useful alert.
There of course ways to fix this situation, and these two operators came to their own conclusions and had things humming a little while later. OAM can certainly work in large-scale, multi-provider deployments, and can assure critical services. It just takes a few tricks and some solid, hardware-based OAM devices to help things out.
This gets especially critical when you consider the OAM flows hitting the MSC: expect 1,000’s at a time as CFM and PM for 3 service classes from say, 250 towers, converge at a single router.
We’ve been getting a lot of calls in the middle of the night recently, and things can always be worked out. Let’s just say none of these calls are about ‘The Capacity Crisis’. That’s for the media to worry about.