As part of a 2014 project supported by ISIF Asia and Internet NZ, we’ve been going to a number of satellite-connected islands in the Pacific on behalf of the Pacific Islands Chapter of the Internet Society (PICISOC) to see whether we could make better use of their satellite links using network-coded TCP. One of the phenomena we came across even before we got to the network coding part seemed a bit of an oddity at first. At second glance offered an opportunity to look and learn.
Let me paint you a scenario: You have a remote Pacific island with a few thousand inhabitants. There’s plenty of demand for Internet, but the place isn’t overly wealthy, so the only affordable way to connect it to the rest of the world is via a geostationary satellite system. Bandwidth on such satellites is very expensive, so our island needs to make do with inward bandwidth in the tens of Mbps – anything more breaks the bank. Both locally and offshore, the satellite link connects to something that can carry hundreds or thousands of Mbps.
Now you talk to plenty of islanders and you get to hear the horror stories of web pages that never load, computers that never see an update, connections that time out, and so on. So if you could eavesdrop on the satellite link, what would you expect to find?
I guess that, like us, you’d expect to find the link hopelessly overloaded, with packets rushing across it nose-to-tail without gaps. You’d expect to see nearly 100% of the link’s capacity in use nearly 100% of the time. So imagine our surprise when we looked at the satellite link utilisation in a couple of locations and found it to be well below 100%. One large island never saw more than 75% even during time periods of just a few seconds, with the average utilisation being around 60%. Another island didn’t tap into more than one sixth of the theoretically available capacity. Looking at the same links, we found that small parts of our data streams were getting wiped ever so often – which is what we would have expected with overloaded links.
Seems weird? Not quite so. The effect is actually quite well described in literature under the heading “queue oscillation”. It’s generally associated with router queues at Internet bottlenecks. So what is it, and why is it happening on geostationary satellite links?
What is queue oscillation?
Let’s use an analogy: Trying to get data from a sender to a receiver through an Internet bottleneck is a bit like trying to pour expensive wine from a barrel into a bottle using a funnel. Think about you & the barrel as the data sender; the bottle is the receiver, and the funnel (at the input of which the wine will bank up) is the satellite ground station where data arrives to be transmitted via the link. The link itself is literally the bottleneck.
The goal of the exercise is to fill the bottle as quickly as possible, while spilling an absolute minimum of the valuable wine. To do so, you’ll want to ensure that the funnel (your queue) is never empty, but also never overflows. Imagine that you do this by yourself and that you get to hold the barrel right above the funnel. Sounds manageable? It probably is (unless you’ve had too much of the wine yourself).
OK, now let’s turn this into a party game – in real life many computers download via a satellite link simultaneously. Moreover, a lot of the data senders aren’t anywhere near the satellite ground station. So imagine that you put the bottle with the funnel under the end of a (clean) downpipe, and you get a few friends with barrels (your broadband senders) to tip the wine into the (clean) gutter on the roof. You watch the funnel’s fill level at ground floor and let your friends know whether to pour more or less in. You’re only allowed two types of feedback: “Wine flowing into bottle!” and “Funnel overflowing!”
Bet that filling the bottle is going to take longer with a lot more spillage this way, even if you’re all completely sober? Why? Because your friends have no control over the wine that’s already in the gutter and the downpipe – it’s this wine that causes the overflows. Similarly, if you run out of wine in the funnel, new liquid takes a while to arrive from above. Your funnel will both be empty and overflow at times.
A geostationary satellite link carrying TCP/IP traffic behaves almost exactly the same: The long feedback loop between TCP sender and receiver makes it extremely difficult to control the data flow rate. The fact that multiple parties are involved just makes it a lot worse. On average, users on the island get the impression that the link is a lot slower – and indeed they can access only a part of the capacity they’re paying for and that is being provisioned to them. With satellite bandwidth retailing for hundreds of dollars per megabit per second per month, that’s a lot of money for nothing.
Who is to blame?
The culprit is quite simply the TCP protocol, which controls data flow across the Internet. More precisely, it’s TCP’s flow control algorithm. This algorithm exists in various interoperable flavours, none of which was designed specifically with shared narrowband geostationary satellite links in mind. So, if you happen to live in the Islands, it’s not your evil local monopoly ISP, nor the price-gouging satellite provider, the government, or the fact that you may consider yourself a developing country.
In TCP’s defence: The problem it would have to solve here is pretty tricky – as you’ll no doubt find out if you try the wine analogy. Even if your friends on the roof are pretty switched on, they’ll still spill plenty of the stuff. Unfortunately, as you’d find out, using a bigger funnel doesn’t help much (it’d still overflow). Explicit congestion notification (ECN) isn’t really workable in this scenario either, and we don’t want to limit the number of simultaneous TCP connections on the link either. So we need a Plan B.
Plan B: Could network coding help?
A solution that we have been experimenting with is the use of network-coded tunnels, a project under the auspices of the Pacific Island Chapter of the Internet Society (PICISOC), supported by ISIF Asia and Internet NZ. Network coding is a technology fresh out of the labs, and in this case we’ve been using a solution pioneered by colleagues at the Massachusetts Institute of Technology (MIT) in the U.S. and Aalborg University in Denmark. The idea behind network coding is based on systems of linear equations, which you might remember from school, like these:
4x + 2y + 3z = 26
2x + 5y + 2z = 19
3x + 3y + 3z = 24
You might also remember that you can solve such a system (find the values of x, y and z) as long as you have – broadly speaking – at least as many equations as you have variables. In network coding, our original data packets are the variables, but what we actually send through our tunnel are the numbers that make up the equations. At the other end, we get to solve the system and recover the value of the variables. As there’s a risk that some of the equations might get lost enroute, we just send a few extra ones for good measure.
We build our tunnels such that one end is on the “mainland” and the other on the island, which puts the tunnel right across the point at which we lose stuff (spill the wine or lose equations, as you wish). So how does this help with queue oscillation? Simple: Since we generate extra equations, we now have more equations than variables. This means we can afford to lose a few equations in overflowing queues or elsewhere – and still get all of our original data back. TCP simply doesn’t get to see the packet loss, and so doesn’t get frightened into backing off to a crawl.
Does this actually work?
Yes it does. How do we know? We have two indicators: Link utilisation and goodput. In our deployment locations with severe queue oscillation and low link utilisation, we have seen link utilisation increase to previously unobserved levels during tunnelled downloads. The tunnelled connections (when configured with a suitable amount of overhead) provide roughly the same goodput as conventional TCP under non-oscillating low packet loss conditions. Tunnelled goodput exhibits a high degree of stability over time, whereas that of conventional TCP tends to drop mercilessly under queue oscillation.
“So, tell us, how much better are the network-coded tunnels compared to standard TCP?” Let me note here that we can’t create bandwidth, so this question can be largely reformulated as “How bad can it get for standard TCP?” We’ve seen standard TCP utilise between roughly 10% and 90% of the available bandwidth. On the island with 60% average utilisation, we were able to achieve goodput rates across our network-coded TCP tunnel that were up to 10 times higher than those of conventional TCP – during the times when conventional TCP struggled badly. At other times, conventional TCP did just fine and a network-coded tunnel with 20% overhead provided no extra goodput. However, that’s an indication that, strictly speaking, we wouldn’t have needed all the overhead, and a tunnel with less overhead would have performed better at these times.
So the trick to getting this to work well in practice is to get the amount of overhead just right. If we don’t supply enough extra equations, we risk that losses aren’t covered and the encoded TCP connections lose data and slow down. If we supply too many equations, they take up valuable satellite bandwidth. That’s also undesirable. What we really want is just enough of them, so we’re currently discussing with the supplier of the software we’ve been using, Steinwurf ApS of Denmark, to see whether they can add feedback from decoder to encoder for us.
Written by Dr. Ulrich Speidel with support from Etuate Cocker, Péter Vingelmann, Janus Heide, and Muriel Médard. Thanks go to Telecom Cook Islands, Internet Niue, Tuvalu Telecommunications Corporation, MIT and, close to home, to the IT operations people at the University of Auckland for putting up with a whole string of extremely unusual requests!