Coding internet satellite links for better goodput under bandwidth and latency constraints

TECHNICAL REPORT

Grantee	University of Auckland
Project Title	Coding internet satellite links for better goodput under bandwidth and latency constraints
Amount Awarded	US$34,345.05
Dates covered by this report:	2018-01-01 to 2018-12-31
Report submission date	2019-01-18
Economies where project was implemented	New Zealand
Project leader name	Ulrich Matthias Speidel
Project Team	Lei Qian [email protected]
Partner organization	Steinwurf ApS, Denmark

Project Summary

Many Pacific islands rely on expensive narrowband satellite links for international backhaul connectivity. To make matters worse, many such links exhibit extreme TCP queue oscillation, which causes large end user downloads to slow down to a crawl well before the links reach capacity. This problem causes TCP senders to operate with out-of-phase congestion windows that are either much large or much smaller than they ought to be. The latter case leads to link underutilisation after burst packet losses at the link input queue. In a previous project [1], we demonstrated that transparent forward error correction coding with a random linear network code across packets could recoup some of the lost capacity in a real deployment - albeit only for the individual flows we were able to access. In the follow-up project [2], we built a large hardware-based simulator to see whether this solution could be scaled. This project taught us a lot about the need for efficient and well-timed coding in such a bandwidth-constrained environment. The present project aims mainly at a major software upgrade of the coding software to be able to configure the timing of coding redundancy more flexibly, and to be able to avoid the ineffective coding of small packets and flows.

Project factsheet information
Background and justification
Project Implementation
Project Evaluation
Gender Equality and Inclusion
Project Communication Strategy
Recommendations and Use of Findings
Bibliography

Background and Justification

The Pacific Ocean covers nearly half of the planet. The vast majority of its population sits on the Pacific Rim and is either reasonably well served by fibre-based backhaul Internet connectivity, or where it is not, the problems tend to be of a political or regulatory rather than technical nature. At the centre of our project, however, are Pacific Island states (or parts thereof). These countries make up only a tiny fraction of the Pacific's population, with sometimes only a few hundred or a few thousand inhabitants per island. The islands are often in very remote locations, separated by the world's deepest ocean, and generally have low GDP and least developed nation status. This makes fibre connectivity especially difficult to afford. While the number of islands with fibre connectivity has been increasing in recent years, the business case remains impossible for many islands. This leaves satellite connectivity as the only option, save experimental approaches such as Google's Project Loon [3].

Satellite connectivity currently comes in two versions: conventional geostationary (GEO) links and via the more recent medium-earth-orbit (MEO) satellites.

GEO links require satellites to be stationed approximately 35,786 km above the equator. While this permits continuous service via a single satellite and does not require a ground station antenna to track the satellite, the high orbital altitude constrains both the payload that can be injected into such orbits and the choice of launch vehicle. This translates into comparatively small and lightweight satellites with small solar panels (i.e., low transmit power). In turn, this constrains the data rates that such satellites can support, and pushes up the size of ground antennas. The high orbital altitude of GEO satellites also means that signals travelling to and from the satellite have a round-trip time (RTT) of around 500 ms. In comparison, the round-trip time on a fibre optical submarine cable across the Pacific is typically less than 150 ms.

Continuous MEO connectivity requires multiple satellites in a much lower orbit. The current constellation of SES's O3b satellites - the only MEO fleet in operation - orbits around 8000 km above the equator. As these are not stationary above a particular point on the equator, they cover all longitudes. Ground antennas need to track the satellites, i.e., require moving parts or phased array antennas. However, the lower orbital altitude has two advantages: It is cheaper to reach, so it is easier to deploy satellites with larger solar panels and antennas, allowing for a larger signal to be transmitted towards the ground station. The shorter distance between satellite and ground also means that a much larger fraction of the satellite's signal arrives at the ground station, allowing use of a much smaller antenna there while still maintaining a comparatively higher data rate. MEO RTTs are in the order of around 120 ms but vary depending on the location of the ground stations with respect to the currect position of the satellite.

For a Pacific ISP, the entry costs to GEO and MEO connectivity are generally manageable, however the ongoing cost of connectivity is very high, typically in the hundreds of US dollars per megabit per second per month, with MEO connectivity said to command a premium over GEO connectivity. This inevitably restricts many islands to inbound capacities much lower than the terrestrial Gigabit capacities available at either end of the link. Add the significant RTTs, and the satellite link represents a high latency bottleneck. At the input to this bottleneck sits a more or less complex buffer, which accommodates a (possibly managed) queue of packets heading towards the link.

One might naively assume that given the demand for connectivity in the islands, such bottleneck links should permanently operate at capacity. This is not so. The vast majority of Internet traffic is carried by various flavours of the Transmission Control Protocol (TCP), which implements congestion control algorithms in order to maximise the utilisation of bottlenecks along the transport path. A common feature of these algorithms is that they use acknowledgment (ACK) packets from the receiver to determine which data packets arrived, and to estimate how many data packets can be entrusted to the channel without overloading the tightest bottleneck along the way. This works reasonably well as long as the various TCP connections sharing a bottleneck get timely feedback, i.e., when the RTT across the bottleneck is short.

Add the long RTT of a satellite link, and TCP literally loses the plot of what is happening at the bottleneck: It sends increasing amounts of packets when the queue is overflowing, being tricked into doing so by the receipt of ACKs for packets that passed through the bottleneck long before the queue started to build up. Conversely, TCP backs off excessively from sending when the ACKs fail to arrive that the packets lost at the overflowing queue should have elicited. As all TCP senders sharing the satellite link do so more or less simultaneously, very little traffic arrives at the link, the queue clears, and the link sits there underused for a while. TCP senders flip-flop between these states, an effect known as "TCP queue oscillation".

The number of data packets a sender is willing to entrust to the channel is called its "congestion window". If the amount of data that a connection has to transfer fits entirely within its initial congestion window, chances are that the sender will never get to adjust its window. TCP queue oscillation thus affects predominantly large voluminous TCP flows. These are a minority of flows, but carry a significant share of the data: Large e-mail attachments, software downloads, large images, audio or video files all fall within this category. Alleviating the effects of TCP queue oscillation on large flows thus means paying attention to the needs of these flows.

The conventional approach to this problem is the use of a performance-enhancing proxy (PEP). PEPs are generally either sold as an add-on to a link, but open source solutions are also available. PEPs generally operate at least on the world side of a satellite link, and typically split the connection between TCP sender and receiver by spoofing each end towards the respective other. This splits the RTTs into a satellite segment and a terrestrial segment at the world end, alleviating the problem of the long RTTs somewhat. If PEPs are installed at either side of the satellite link, it is also possible to use special flavours of TCP or even entirely different protocols across the link itself. However, splitting connections has undesirable consequences - one side may believe that data has been delivered when it fact it has not.

The coding approach we are trying can in principle be used on its own or in combination with PEPs. It considers that the link underutilisation is a result of excessive TCP back-off following a queue overflow event. TCP recovers from such losses by retransmitting packets after having waited for the corresponding ACK for a reasonably generous timeout period. When doing so, TCP also reduces its congestion window. If we can deliver replacements for any lost packets and elicit and return an ACK before this timeout period, we can prevent the reduction of the congestion window.

The code used for this is a random linear network code, a technology originally developed at MIT and currently being implemented in a variety of contexts by Steinwurf ApS, a Danish start-up company. Its principle of operation is quite simple: At the encoder on each side of the link, we intercept each incoming IP (Internet Protocol) packet and form sets (generations) of N packets at a time. As IP packets are just byte sequences, and each byte is just a number, we can treat them as if they were numbers, i.e., add or multiply them with other numbers. For each generation of N packets, we initially add a coding header to each packet. It identifies the set that the packet belongs to, as well as which out of the N packets in the generation it is, and whether it is coded or not. The first N packets are sent out immediately to the decoder with a header that identifies them as "uncoded". In a second step, the encoder generates M linear combinations of these first N packets, i.e., the encoder multiplies each of the original packets with a random coefficient, sums the multiplied packets up, and sends the coefficients and the sum in a "coded" packet to the decoder. For N incoming packets, the encoder thus sends N+M packets towards the link. N is known as the "generation size" and the last M packets are known as the "overhead" or "redundancy" packets.

When the decoder receives an uncoded packet, it simply strips the coding header and forwards the packet to its original destination. It also retains a copy in case the link queue drops one or more of the other uncoded originals. When this happens, these originals may be recovered with the help of the coded packets: We simply treat each coded and uncoded packet as an equation in a system of linear equations, where the original packets represent the variables of the system. High school mathematics tells us that systems of linear equations with N variables may be solved as long as we have at least N linearly independent equations, i.e., we may lose up to M of the N+M packets to the queue.

This coding approach allows the original TCP sender and receiver to remain totally oblivious to the coding in the middle and preserves the Internet's end-to-end principle. In our initial project [1], we found that this worked very well when applied to individual large TCP connections: Goodput increased significantly.

Scaling this to all the TCP traffic to an island presents a number of challenges, however. Firstly: Testing in a production environment is prohibitive. This led us to build a hardware-based simulator capable of replicating a GEO or MEO satellite link with up to 500 Mbps capacity and the associated demand scenario of an island with up to around 3,000-4,000 simultaneous users [2]. The simulator now consists of 96 Raspberry Pis and 10 Intel NUCs and "island client" machines, 5 Super Micro servers along the satellite chain (2 x PEP, 2 x encoder / decoder, 1 x satellite emulator), 22 "world server" machines to generate background traffic, two special purpose servers on the world side, one Raspberry Pi each on either side for signalling purposes, two capture servers, six copper taps, two storage servers, and a command and control server in four 7ft 19" racks. See http://sde.blogs.auckland.ac.nz/ for a more detailed description and methodology.

The second challenge in "all-of-island" coding is to find a set of code parameters that works in a given scenario. There are a number of tradeoffs involved: The larger we make N, the more uncoded packets a set of M coded packets can protect, but if N is too large, the coding/decoding causes potentially unacceptable delays. The larger we make M, the more protection against packet loss we obtain - but the additional M packets must also fit within the spare capacity on the link. As part of our previous project, we looked at this and found that since most packet losses on the queue came in bursts, M could be made adaptive - but then there is another parameter for the adaptive algorithm to adjust. The project also pointed out that such burst errors could be distributed across multiple generations if we interleaved them, allowing us to keep M smaller.

However, this project also found that timing was a more important aspect than originally envisaged: The average rate at which data packets are generated and arrive at the link is roughly in the order of the link data rate itself. E.g., on a 16 Mbps GEO link, we would not expect this rate to exceed, say, a few dozen Mbps. This also applies to the first N uncoded packets of each generation. However, the M coded packets are generated by an encoder on a Gigabit Ethernet interface, and are fired off in the direction of the link immediately after the last uncoded packet, at a rate of 1 Gbps. From the perspective of the link queue, that corresponds to a very large amount of packets arriving in a very short amount of time. This has two consequences: Firstly, if our original N packets are affected by packet loss thanks to an overflowing queue, the additional M packets arrive while the queue is still overflowing, and will also be dropped. Secondly, if the queue is not currently overflowing, the flash flood of packets from the encoder can cause it to overflow.

The present project had five main objectives:

1. Extension of the network coding software. The purpose of this extension was to allow us to set a configurable delay between the transmission of the last coded packet and the first redundancy packet. The existing software also had a limit on N in that the all coefficients were included in the coding header of coded packets, which for large N grew to sizes that made no sense. We also wanted a whitelisting functionality to be able to code only packets from certain TCP sources (with large flows).

2. Deployment of the new software on the simulator.

3. Acquiring the ability to measure goodput at all stages of the simulated satellite link.

4. An investigation into the predictability of TCP flow size based on IP address and port combinations, as an input to the whitelisting functionality under Objective 1.

5. A packet aggregation algorithm for the coding software, allowing us to code byte blocks rather than packets, removing the inefficiency associated with coding small packets.

Project Implementation

The first step in the project implementation was to commission the enhanced coding software from Steinwurf ApS (Objectives 1 and, with lower time priority, 5). Our original project plan envisaged that this would take several months to complete. However, Steinwurf surprised us with a staggered delivery of all features in January/February, including the aggregation algorithm option, with a number of subsequent updates as a result of feedback from our end.

This resulted in a rearrangement of the original project plan as we could now test the new software much earlier than anticipated. The testing revealed a number of issues that required follow-up:

- Our new simulator configuration with the ability to capture packets at multiple stages of the simulated link was causing our storage disk to fill rapidly. It became clear that we would require significantly more storage for trace files in order to go ahead. Luckily, we were able to obtain two storage servers with 8 x 14 TB each (96 TB each in RAID configuration) as part of our annual CAPEX round. These are now configured and integrated, and have capacity for further expansion if needed.
- The extended version of the software, when not using the packet aggregation algorithm, caused connections on the simulator to stall during the experiment. The symptoms of this stall are rather strange - the stalls occur progressively and seemingly randomly on individual connections and are not traceable to individual machines. The trace files show flawless communication, with the last data packet from the server duly acknowledged by the client, and then no further communication between the two, hogging the client socket on what becomes de-facto an inactive connection. The machines affected had generally already completed a large number of connections, and the remaining clients are similarly able to complete many more connections via the coding software. These symptoms do not occur when the coding software is not in use, and do not occur when the new packet aggregation algorithm is in use. We are currently unable to tell whether the problem is with the coding software or a simulator issue.
- During the investigation of the aforementioned problem, we repeated a number of baselines experiments without the coding software, during which we noticed performance impairment compared to previous baselines. We traced these to one of the older machines in the satellite chain - again, our annual CAPEX round provided relief here in the form of a new satellite emulator and two new machines for performance-enhancing proxies. These have just recently been commissioned and we are currently running a new set of baselines. Once this is done, we will return to work with the coding software - hopefully, this upgrade will also have resolved the connection stall issue.

The hardware upgrade was drawn out due to a warranty issue with one of the storage servers, and immediately followed by the annual electrical inspection of our lab - a two day job for the electrician, but over a week for us to plug it all back together again in the right order and test that everything works. We used this downtime to upgrade out client software and experiment harness scripts and we are now able to collect goodput data immediately from the clients, rather than being restricted to looking at TCP bytes flowing through our copper taps. These two measures are not the same: TCP bytes as recorded over a given time period by a copper tap are not necessarily goodput as they may contain retransmissions. Conversely, goodput as reported by the clients during the same time period may omit data from TCP packets where a predecessor packet has not been received, or may report data belonging to packets that precede the measurement period if a packet completing the byte stream was received during the measurement period.

The project period to date has also seen two scholarly conference papers submitted and accepted. One was presented at APAN in Auckland in August 2018 [4] and resulted in a Best Student Paper Award for Lei Qian, the other was presented at IEEE GlobeCom in Abu Dhabi in December 2018 [7] with travel support courtesy of Internet NZ. The project leader also shared insights from the project during a lightning talk at APNIC46 in Noumea [5], chaired a panel session on satellite connectivity in the Pacific, and got interviewed for Vanuatu's Tek Tok (Tech Talk) TV show. APNIC's assistance in making this attendance possible is gratefully acknowledged. As APNIC46 was also attended by a number of ISPs and OEMs from the region, this presented an opportunity to talk to many of them. One of our lab's former masters students, Fuli Fuli, was also there as an APNIC fellow - we are looking forward to seeing him return to the lab as a PhD student in 2019.

Implementation progress: At this point, Objectives 2, 3, and 5 are complete, with 1 needing further investigation of the stall issue. However, from our work with the packet aggregation algorithm to date, we know that the delay and extended generation size parts of the software upgrade work. We have yet to test the whitelisting feature (Objective 4), as this work was pushed back as a result of the earlier availability of the enhanced coding software. Personnel: Lei Qian was unable to partake in the project for a number of months due to unforeseen personal circumstances, but is now back on deck.

Project Evaluation

At this point in time, objectives 2, 3, and 5 have been achieved. Objective 1 has been partially achieved, the one remaining issue is clarification of the stall issue observed and, if required, an amendment to the software. In its current state, the software upgrade has already allowed us to observe the impact that certain parameter changes (overhead delay in particular) have on the loss of overhead packets. Similarly, we have been able to demonstrate that a packet aggregation approach is workable in principle. We now need to optimise it, which is mostly a matter of (time-consuming!) computing and measuring. Objective 4 is ongoing.

Gender equality, diversity and inclusion: The main motivation for the project was to address the growing digital divide between satellite-connected Pacific islands and the rest of the world, and as such aims at improved inclusion. In terms of development of technical capacity, the project has helped us entice our Samoan Masters student back into attempting a PhD. He is currently preparing the research proposal for his application for admission. We have also been joined by a female Honours student, whose dissertation will look at ways of managing the burgeoning amounts of experiment data that has accrued.

To what extent has the project lived to its potential for growth/further development? The project has helped us to develop an alternative coding approach that is not based on packets but on aggregated byte blocks, which may contain multiple small packets. This allows us to accommodate more packets in the same generation size and avoids the inefficient encoding of small packets. The project has also helped us consolidate our satellite simulator laboratory further: The previous project led to the simulator being accommodated in a newly established dedicated network laboratory rather than a student pool office, as well as growth in hardware. This project has helped us acquire additional storage in the form of two large storage servers with 2 x 96 TB worth of storage capacity. Each individual experiment generates hundreds of MB of data and some can generate up to 3.6 GB worth - and we have run many thousands of experiments already whose data we need to store.

The project has also helped us to obtain CAPEX for improved compute capacity for three of our satellite link chain machines, which has allowed us to verify that our simulated high-speed MEO baseline data was not constrained by CPU capacity.

We are also still hopeful to move the simulator into a dedicated climate-controlled environment. Initially this was meant to happen within the existing laboratory space - a move that had been in the planning phase for a while but had been pushed back repeatedly by our facilities managers for a variety of reasons ranging from personnel shortage via asbestos and fire damage in other buildings to external chiller siting problems. The latter have now prompted discussions of moving the laboratory again. Two suitable candidate rooms have been identified and we are currently looking at design options.

What were the most important findings and outputs of the project? What will be done with them? The main project finding to date is that "all-of-island" coding remains challenging as the spare capacity on the uncoded link needs to accommodate both the coding overhead and any gains from coding. We now have a significant number of parameters for the coding. These include:

Generation size: The number of packets (or aggregation blocks) that will be coded together. Choosing a larger generation size allows the same number of coded redundancy packets to protect a larger number of original packets. As the encoder can only provide the redundancy packets once all original packets have been received, a large generation size delays delivery of the redundancy packets. This can cause decoding delay if there is packet loss at the link input queue and the redundancy packets are required to compensate for the loss. Before the start of the present project, generation size was limited to around 60. The software upgrade that was part of Objective 1 now lets us use larger generation sizes. Our current experiments use values of around 120, with values up to 180 being computationally feasible (decoding complexity is proportional to generation size cubed, so large generation sizes quickly become prohibitive).
Maximum number of redundancy packets to send for each generation.
Number of generations for which the number of redundancy packets above will be retained if there is no packet loss (retention window): This parameter allows us to adapt our coding to the occurrence of packet loss and allows us to code more lightly under low demand conditions, thus freeing up link capacity.
Number of generations by which to delay the transmission of the redundancy packets. This parameter can prevent redundancy packets from being transmitted into an already overflowing input queue. Before the project, we could delay by at most one generation, and this proved to be insufficient. The software upgrade in this now lets us delay by an arbitrary number of generations. In our current GEO experiments, we delay by two generations, which seems be to sufficient to get most redundancy packets across the link.
A size threshold for coding below which a packet will not be coded. If the input queue to the satellite link is a byte queue, it will continue to accept small packets even when it will already reject larger packets. Coding small packets is therefore not necessarily economical, and this parameter allows us to save the associated redundancy.
A timeout for the new packet aggregation mode specifying how long the encoder should wait for additional packets to arrive when a byte block is incomplete. Our results to date show that under medium load (60 channels) on a coded 16 Mbps GEO link with a 120 kB queue, a 1 ms timeout causes just over 80% of blocks to be filled to capacity. A 5 ms timeout increases this to about 96%, but on average delays roughly 200 coded packets per second by 4 ms each. On a 64 Mbps GEO link with a load of 180 channels, a 1 ms timeout is sufficient to fill 93% of blocks.

Together with the link and load parameters, these give rise to an almost infinite number of possible combinations, and each combination requires around 10 experiments in order to have a reasonably reliable base of experimental data for assessment. Each experiment takes around 15-20 minutes, without allowing time for analysis.

At this point, we are still learning how the various parameters interact. For example, deploying more redundancy packets allows us to correct more packet losses, but it also affects timing. Changing the size threshold for coding can have a significant effect on the time period that is encapsulated in a single generation. Any change in the timing affects the meaning of the parameter value that controls the delayed transmission of the redundancy parameters. Some parameters such as the generation size, number of redundancy packets and retention window depend on the link rate and type, whereas parameters such as the size threshold depend only on the traffic distribution.

Moreover, coding heavily assists mostly long flows, so in cases where we see improvements in the rate for long downloads, we often find that these have a price in terms of lower overall goodput on the link. An optimised code would hopefully yield both higher rates for large transfers and higher overall goodput on the link.

Our strategy at the moment is to try and optimise one parameter at a time. At this point, we already have approximate values for the packet aggregation timeout and the redundancy delay, and a few tentative values for generation sizes at 16 Mbps and 64 Mbps GEO. We have observed significant performance differences for different retention window sizes, however these indicate that we may be operating either well below or well above the optimal values, and further experimentation will be needed to clarify this.

What lessons can be derived that would be useful in improving future performance? Our experiments to date show that excess redundancy on the link is a serious impediment in the all-of-island coding scenario because it contributes to queue sojourn delays compared to uncoded links. To this end, we are currently considering a scheme in which coded redundancy packets will only be sent when there is available capacity on the link, so that they cannot delay original data. This will require a separate managed queue, and we are currently evaluating whether this would best be implemented as part of the coding software kernel module or as an external entity with Linux qdiscs.

To what extend did the project help build up the capacity of your institution or of the individuals involved? We now have the capacity to run experiments that nobody else in the satellite research community can perform. This lends substantial weight to our results. For example, the principal investigator of this project recently presented a paper [7] at IEEE GlobeCom in Abu Dhabi on a spin-off result from our simulator facility, which queries the appropriateness of current satellite link input buffer dimensioning. After the presentation, he was approached by Dr. Lin-Nan Lee, VP of Hughes Network Systems, who run a direct-to-end-user satellite Internet network across the Americas. He reported that they had seen similar effects to those described in the presentation on some of their end customer uplinks and had previously not been able to make sense of them.

The questions thrown up by the project to date will also provide plenty of material for future student projects.

Were certain aspects of project design, management and implementation particularly important to the degree of success of the project? Two words: Quality Assurance. In a simulator with over 140 CPUs, 10 networks, and a very complex experiment procedure, there is plenty that can go wrong, and not everything that goes wrong gets detected automatically. While QA red flags have led us to repeat numerous experiments and have certainly slowed experimentation down, careful inspection of experiment output ensures that we produce quality data. Where feasible, we have tried to automate QA, but nothing replaces a watchful eye over a result file.

Indicators	Baseline	Project activities related to indicator	Outputs and outcomes	Status
How do you measure project progress, linked to the your objectives and the information reported on the Implementation and Dissemination sections of this report.	Refers to the initial situation when the projects haven’t started yet, and the results and effects are not visible over the beneficiary population.	Refer to how the project has been advancing in achieving the indicator at the moment the report is presented. Please include dates.	We understand change is part of implementing a project. It is very important to document the decision making process behind changes that affect project implementation in relation with the proposal that was originally approved.	Indicate the dates when the activity was started. Is the activity ongoing or has been completed? If it has been completed add the completion dates.
Extension of the network coding software	Delay between last coded original packet and first redundancy packet could either be 0 or 1 generations only. Generation size limited to ~60 packets. MTU of coded packets fixed. All packets were coded.	Commissioned upgrade from Steinwurf ApS.	Delay between last coded original packet and first redundancy packet is now flexibly configurable. Generation size can now be arbitrarily set without affecting the size of the coding header in each packet. MTU of coded packets can now be adapted via param	Complete except for resolution of a bug (which affects the non-aggregating version of the software only, and may not be in the software).
Deployment of the new software on the simulator.	Simulator uses old version of the software.	Compilation of software, integration into experiment harness (addition of new parameters to control scripts, analysis software etc.), testing.	New coding software operational.	Complete (except for bug investigation in non-aggregating mode).
Acquiring the ability to measure goodput at all stages of the simulated satellite link.	Simulator did not have copper taps at each stage of the satellite chain. Problem diagnosis was difficult.	Acquisition & connection of additional copper taps, adaptation of experiment scripts to collect from new taps.	Experiments collect data from all stages.	Complete

Gender Equality and Inclusion

Although the project was not specifically designed to promote gender equality, the simulator facility it has helped to extend has now attracted its first female graduate student.

A much bigger aspect of the project is inclusion, as its ultimate goal is to lessen the impact of the digital divide created by satellite links in Pacific islands unable to achieve cable connection in the short term. Better connectivity in the islands is a goal that meets the aspirations of many islanders, who are a well-recognised equity group in Aotearoa/New Zealand.

In a university context, this translates into a shortage of Pasifika students especially at graduate level, lower pass rates and grades than other ethnicities, and a higher drop-out rate. The idea for the simulator was first hatched as a consequence of the first PhD in Computer Science by a Tongan student ('Etuate Cocker), and having the facility has helped us recruit our first Samoan Masters student, Fuli Fuli, who completed during the course of the project and has since attended APNIC46 in Noumea as an APNIC fellow. He intends to return as a PhD student.

The simulator facility that this project has helped consolidate acts as an attractor for Pasifika students as they can see a way in which they can support their communities through their research - a big part of the "why" that is so important as a motivator in a Polynesian context. We would especially like to acknowledge APNIC's role here in supporting us beyond this ISIF project in the form of Fuli's APNIC fellowship and in giving the principal investigator the opportunity to attend APNIC46 in Noumea and APT in Samoa. This helps us motivate young people to pursue IT qualifications, but also lets us touch base with their island context to promote what we do and how we support their young generation. For example, the principal investigator's presentation at APT in Samoa in November 2018 precipitated a discussion about educational goals over lunchtime. This highlighted island focus on discipline, and allowed the principal investigator to put the case for fostering more self-worth, initiative and creativity among island youth.

Project Communication Strategy

As an Internet Operations Research project, the most direct users of our findings are in the research sphere and in the island ISP community, as well as among those who supply such ISPs with satellite connectivity, either as consultants or as satellite network operators. The following activities were undertaken to inform these users of our work:

At a presentation at the Asia Pacific Advanced Network Research Workshop in Auckland in August 2018, Lei Qian gave an introduction to our simulator and to the way in which we run our experiments [4]. This resulted in a best student paper award. The target audience here were mostly from the Internet research community.
An abridged version of this presentation became a lightning talk at the APNIC46 meeting in Noumea in September 2018, where the principal investigator also chaired a panel session on satellite connectivity in the Pacific. The target audience here included island users, ISPs and those implementing satellite links, as well as regulators [5].
In the context of a presentation on social, cultural and environmental issues at the APT conference of telecoms regulators in Apia, Samoa, at the end of November 2018 [6], we were able to bring our work to the attention of regulators and satellite connectivity providers.
At IEEE GlobeCom in Abu Dhabi in December 2018 [7], we were able to discuss our findings on satellite link input queue dimensioning with experts on satellite networks from the US, Europe and Asia. The conference also provided a fascinating insight in the important role that satellite networks will play alongside terrestrial networks as part of 5G+ and 6G networks.
The principal investigator contributed four guest posts [8][9][10][11] to the APNIC technical blog during the course of the project: https://blog.apnic.net/author/ulrich-speidel/

Recommendations and Use of Findings

Our main recommendation from this project is that certain types of research take a lot of time. In all-of-island coding, the possible gain one can expect is limited by the spare capacity on the link minus any coding redundancy transported by the link, and this limit is difficult to reach. E.g., a 16 Mbps link may already see 11 Mbps average link utilisation at a given load level. This leaves 5 Mbps of spare capacity, out of which we may need to allocate 1-2 Mbps for coding redundancy. Realistically, we may only be able to achieve another 1 Mbps or so in gain in this case. Due to the Monte-Carlo technique that underpins the generation of the background traffic in the simulator, even uncoded baseline results for the same parameter combination show significant variability, in particular for the timing of the large data transfer that is part of our measurement. Differences in average transfer rate of a factor of two for the same parameters are not uncommon. This necessitates multiple measurements for each parameter combination, with a minimum of 8-10 measurements required to be able to draw any conclusions at all. With each measurement taking between 15-20 minutes, scanning all possible parameter combinations becomes prohibitive.

This project increased the number of parameters available to us and, as a result, the number of possible combinations that might conceivably work. We already know that most combinations will not work, either because the redundancy that makes it onto the link takes up too much capacity, or because the code offers too little protection against the queue overflows that occur, or because the code structure results in timing that causes the redundancy to be lost because it ends up being transmitted into an already overflowing queue. Traditional research in error-correction coding focuses on error rates rather than on the time periods during which errors occur, such that our timing considerations represent a somewhat novel approach - but that does not make finding good codes any easier. The other aspect here is that in traditional coding research, the cause of errors is noise or interference from unrelated signals, and as such does not depend on the chosen code. In our case, the loss mechanism is queue tail drop. The amount and timing of the tail drops is of course meant to change as a result of our code choice. Moreover, the periodic nature of TCP queue oscillation means that parameters such as the redundancy delay have the potential to "overshoot" the optimal delivery time and deliver the redundancy during the subsequent overflow phase instead.

Our current approach is to run a batch of experiments for a small range of parameter combinations, often only varying one or two parameters, followed by in-depth analysis, much of which is manual at this point in time. This then guides us to either change these parameters, or maintain them and vary others.

Use of findings: Towards the end of the project, we learned at IEEE GlobeCom that our problem scenario (TCP queue oscillation on the link from the world Internet to an island) also occurs in direct-to-end-user Gbps-class satellite networks, albeit in the reverse direction from end user to Internet, where the capacity is similarly constrained to only a few Mbps. End users sharing their connection among multiple sub-users (e.g., other household members or employees on the same site) can trigger the same effect here, which should make our coding approach useful here in principle as well.

We are very grateful for the ISIF secretariat's flexibility in regard to our non-submission of an interim report and the trust extended to us in this regard. The preparation of an interim or final report takes us around a week of fulltime work each, and due to unforeseen circumstances, the demand on the principal investigator's time in mid-2018 was unusually high.

Bibliography

[1] U. Speidel and 'E. Cocker, 'Improving Internet user experience in Pacific Island countries with network coded TCP', ISIF Asia grant 2014 final report, https://application.isif.asiahttps://application.isif.asia/theme/default/files/isifasia_grants2014_finaltechnicalreport_universityofauckland_picisoc1.pdf
[2] U. Speidel and L. Qian, 'Realistic simulation of uncoded, coded and proxied Internet satellite links with
a flexible hardware-based simulator', ISIF Asia Grant 2016 final report https://application.isif.asiahttps://application.isif.asia/theme/default/files/ISIFAsia_2016_Grants_TechReport_UoA_SimulationSatPAC.pdf
[3] Project Loon home page, https://loon.co
[4] U. Speidel and L. Qian, 'Simulating satellite Internet performance on a small island', Asia Pacific Advanced Network Research Workshop, Auckland, August 2018. This paper won a best student paper award.
[5] U. Speidel and L. Qian, 'Simulating satellite Internet performance on a small island', Lightning talk, APNIC46 Noumea, September 2018.
[6] U. Speidel, 'Social, Cultural and Environmental Issues in the (South) Pacific and their Relationship to Internet Connectivity', Asia Pacific Telecommunity Conference, Apia, Samoa, November 2018
[7] U. Speidel and L. Qian, 'Striking a balance between bufferbloat and TCP queue oscillation in satellite input buffers', IEEE GlobeCom, Abu Dhabi, December 2018
[8] U. Speidel, 'Should I allow UDP over my satellite link?', APNIC Blog, March 2018, https://blog.apnic.net/2018/03/15/should-i-allow-udp-over-my-satellite-link/
[9] U. Speidel, 'Striking a balance between bufferbloat and TCP queue oscillation', APNIC Blog, March 2018, https://blog.apnic.net/2018/03/19/striking-a-balance-between-bufferbloat-and-tcp-queue-oscillation/
[10] U. Speidel, 'GEO vs MEO: which satellite solution works best?', APNIC Blog, March 2018, https://blog.apnic.net/2018/03/20/geo-vs-meo-which-satellite-solution-works-best/
[11] U. Speidel, 'Satellite still a necessity for many Pacific Islands', APNIC Blog, September 2018, https://blog.apnic.net/2018/09/18/satellite-still-a-necessity-for-many-pacific-islands/

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License