| Monitoring & Reporting for Ultra-Low Latency Services |
|
With ever more emphasis placed on guaranteed latencies in the "handful of milliseconds" range, and trading latencies increasingly monitored to microsecond resolution, financial institutions are selecting interconnection carriers based on proven performance, backed by measureable Service Level Agreements. This session calls on a panel of experts to explore the need for ultra-low latency (ULL), one-way delay measurement and real-time reporting, while addressing key challenges and measurement methods. From exploring the technology required to the market drivers, trading trends, and the networks that support them panel drills-down to discuss their own network issues, performance goals and customer requirements. Guided discussion topics include SLA Portals, per-second latency measurement and reporting, micro-bursting, short-term network performance fluctuations, and circuit-based analysis for both service providers and financial firms seeking the fastest possible networks. Learn more about the latest monitoring solutions for ULL applications in this short video TranscriptScott Sumner: Performance monitoring now is possible with microsecond precision: for example you can measure latency between New York and Chicago one way, ten times a second, to the microsecond - so you can see incredibly fine resolution on how your latency or packet loss or throughput is performing. Real-time reporting allows you to put results on your screen and see what’s happening as it happens to troubleshoot different circuits and their performance. What this has really allowed from a financial vertical perspective is to see things like micro-bursting and micro packet loss that you’ll normally not see from something like a five minute averaged measurement. Now you can actually see minute packet loss that may interrupt your trading for a significant period of time, as a protocol messages go back and forth, renegotiate and packets are resent. Gary Kim: Let’s look at sources of latency: local, national, international. What causes latency and what your firm is doing to minimize latency to absolute maximum extent? John Knuff: Equinix as a data centre provider opted to take the simple solution - when we connect customers to other networks we essentially hardwire your network equipment to their network equipment. We have opted not to put any network gear in between you and your target or destination, no switch or router. And so we have actually taken the most simplistic path, which is fiber optic cable or copper wire between financial participants. And that’s different from some of the other providers have a switched platform which adds latency, and can add new dynamics and variability with delay and security issues and whatnot. Mary Stanhope: From a physical perspective, the biggest pieces [affecting latency] are distance and location, simply being in the key [exchange point] locations, and then continuously looking, grooming the most direct routes between endpoints has a big effect on transport time. So it really comes down to real estate. Scott Sumner: [One key strategy to reduce latency is to] eliminate as many store-and-forward network elements as possible. If you have a switch in your fabric but you are not actually switching anything - you are just using [the switch] to terminate a circuit - there is probably a much better way to do that. There are a lot of alternatives out there that are fully hardware based, no network processors, no buffering, no queuing, based on dedicated ASICs. Jock Percy: Removing in elements is great; other than that look at your fiber assets - whether they are underground or overhead - it’s really all about optimizing existing routes. Ernie Hoffman: Optimum Lightpath, like RCN, is focused on the shortest distance between locations, but in addition there are different technologies that we are also looking at, like dispersion compensation. Currently dispersion compensation usually uses a coil of fiber which adds an extreme amount of latency. Some new technologies are out there, fiber Bragg gratings for example, which offer a vast improvement. So we look at every piece of the connection and work very hard to reduce the latency to meet the financial verticals’ requirements. Another good practice is stay as low as possible in the protocol stack, at layer one if possible. In some cases you need a layer two service, which is fine, but you don’t want to keep going up and down the protocol stack or you’ll introduce significant delay. Jock Percy: On the new trans-pacific cable Unity: the metric today is 95.654 milliseconds round trip. That metric between LA and Tokyo actually beats some of the existing trans-pacific cables. But to get from LA to Chicago - to put your total circuit to Chicago end-to-end – that’s going to be the interesting piece [to focus on for latency reduction]. Gary Kim: Several network [operators] sometimes pool together their money to build one fiber-optic route, which is really nice in terms of cost, but not very nice in terms of earthquakes. So in terms of your routing it’s not just about latency but how you protect yourself from earthquakes, boat anchors and other disasters. John Knuff: Indeed, diversity and redundancy is important, and a lot of companies have that found out the hard way. There may be a single point failure at a central office just two blocks from the data centre. There are common landing points for under-sea cables: wet landings, dry landing spots. You really have to do an audit with all of your network providers and have them trace the path that you are on, their primary and also their failover path over diverse routes to make sure there is no common single point of failure. Scott Sumner: When you land at the end of the country and have to bring your circuit all the way to Chicago, something can potentially happen along the way because you are potentially getting ping-ponged around between different providers. Delay based on path length might not give you an accurate picture of actual delay [in an Ethernet service], there is the potential for a lot of congestion and contention and oversubscription in Layer 2 traffic – it really depends on who you are working with as a provider and how they provisioned your service. You have to be continuously measuring the circuit end-to-end to actually know what you are getting. Mary Stanhope: For RCN, actually putting networks out and working with our clients means looking at everything they are doing: which applications, which ones are going on which routes and then also continuously looking at improvements. Let’s take New York - Chicago as an example: the speed on that route is dropping, dropping, dropping, from carrier to carrier to carrier, almost in a round-robin, so [to get the best path means] constantly looking and improving and driving that network for performance. The Demand for Ethernet Ernie Hoffman: Optimum Lightpath made the decision in 2004 to stop selling TDM services, and focus solely on Ethernet, and this year we’ve seen tremendous demand that continues to grow. Mary Stanhope: Similarly we have looked at the New York metro area market and we see Ethernet growing out through the 2014 at about 15%, transported waves at about 7% and then you can get into the TDM and SONET which are really laying flat. Scott Sumner: The real thing to fear if you are in the IT professional industry and financial services, when you move to Ethernet service when you had a T1 line or an OC3 installed, a turn-up test for latency throughput would match what you’d measure 5 years later. When you perform a turn-up for Ethernet for latency and throughput, the results are likely to degrade over time if your providers aren’t actually keeping an eye on that for you, and proving it to you somehow. The days are over where you can rely on turn up test results. Ernie Hoffman: Part of our low-latency product is to take a separate channel and put NIDs on both ends to constantly monitor the latency of those routes, so if anything fluctuates we’re alerted and can take action right away. We are measuring down to one second intervals, the capability is there to go even more granular. Scott Sumner: In terms of active testing, where you have test “probe” packets running through your network, that accounts for less than 0.3% of the total bandwidth on GigE link. So it is fairly negligible, even at a per second interval, it’s the management telemetry of taking that data and pushing it back to some centralized server, that’s where it starts to get very interesting. And legacy technology like SNMP are extremely bandwidth consuming. So instead of 0.3% you’d probably lose several percent of your bandwidth if you’re look at the data at a high frequency. Newer monitoring systems have a way to wrap up the data compress it, then deliver it in a secure, binary format back to a central server that unpacks the data for reporting. With these techniques you can reduce telemetry down to far less than 1% of a GigE link. What’s Currently the Best Latency Available over Common Routes? Jock Percy: Everybody wants to be first, there’s no ribbon for second. Currently between New York and Chicago magic number is 14 milliseconds. If we can get there we’ve got everbody’s business in the room. Trans-Atlantic it’s 64ms that’s magical. These aren’t currently mass markets; they are not currently probable by the end of this year. What About Bandwidth? John Knuff: If you have a one hundred Mbps network connection covering the Metro area in New Your, that’s great. But you’re going to have spikes of 400, 800 [Mbps], 1.2 Gbps - and when you have those spikes that exceed your bandwidth, well over 50% to 60% more than your bandwidth, it does you no good to have a low-latency link. So we try to have our customers we deal look at both bandwidth and latency dynamics, and drill down to the market data feeds and market data sources they connect with. We do this to make sure they have a good understanding of whether they will be able to aggregate Opera and two or three other feeds across that link without coming close to their bandwidth limit. Jock Percy: I’ve seen actually 1 Gbps ports that are faster than some 10 Gbps links, which comes down to the hardware discussion that we are were having earlier. It’s really important to accelerate hardware next to managing propagation delay with the most direct route – those details can actually be low hanging fruit. Scott Sumner: I the wireless backhaul industry, that we see a lot of in monitoring, is really what’s actually brought Ethernet to the U.S. as a carrier-grade service. There have been many providers that I’ve talked to that said before 3G cellular, there was really no reason to roll out massive amounts of Ethernet anywhere. But now every cell tower needs that high capacity, ultra-low latency connection, where delay from a switching centre to a tower must be less than 5 milliseconds over a metro region, so the operators are extremely concerned about latency, maybe even more so than financial institutions in some regards. That’s brought down the price of equipment, resulted in hardened network equipment and offerings that are available to the financial vertical – technology originally proven out in wireless is spilling into metro networking. Audience: How do you store your data so you can look at an instant that occurred 3 hours ago? Scott Sumner: You can over measure data and definitely Ernie and his team are already measuring per second on the request of some clients, but we wouldn’t do that normally continuously because your Oracle server is filling up very fast. So normally the way these systems work is you would granularly look at something if there’s a problem going on, you can increase sample to ten-times per second. Or you can sample continuously at even per second start intervals but starting rolling it up after a certain per first-stage roll up period, say a week – after a week you don’t keep per second data, you keep hourly averages and per-second historics only for periods where there was an exception. This is a self-maintaining system that keeps long-term storage requirements in check while providing enough granular data for detailed trending and troubleshooting. Audience: There is now a new movement where the industry is open to more requests from the clients; as a default can we expect data and statistics be made available for our traffic, so we can realistic view of the performance we are actually getting? Mary Stanhope: We do work closely with customers in terms of providing them data or even real-time data or streaming data, as well as providing customer portals that provide access and views of circuits. But the timeframes could be unique to a customer. Whether we measure at per-second intervals depends on the application in question, the guarantee in terms of throughput, what are the parameters of that SLA are, etc. John Knuff: From prior experience working in network operators, their business is transport and to make sure they don’t violate the SLA, but qualitative measuring down to the microsecond and looking at variability or packet loss, those are things you need to become very good at. I don’t think the network provides will give you that level of detail. |

This session from the recent

