Entrepreneurs always require legal advice when they are trying to secure support from investors, but most-time (specially when they are starting a business) they do not have the expertise to assess legal documents such as fundraising paperwork.
Entrepreneurs in South and Southeast Asia who have found potential investors also struggle, as the online templates are primarily in English and most web-based solutions available today are mostly only available for US jurisdiction.
Legalese, supported by ISIF Asia 2016 Technical Innovation grant, created an application, which enables Asian entrepreneurs and investors to complete financing transactions without hiring lawyers for paperwork. This app is able to access a library of contract templates, generate a suitable set of agreements, and help to understand the implications of each proposed action. In a long term, the software development will begin to influence the delivery of law. Instead of relying on an experienced lawyer for advice, an end-user may consult the legal equivalent of StackExchange to hear opinions from a community of peers.
By building a fully-automated interface with smart defaults, embedded user education, and automated legal logic, Legalese aims to eventually be able to help these entrepreneurs at scale.
Internet Protocol version 6 (IPv6) is the future resolution to deal with the long-anticipated problem of the current Internet Protocol version 4 (IPv4) address exhaustion and intendedly to replace IPv4 in the future. IP is the central protocol in the Internet. Each device connected to the Internet must have a unique IP address to communicate, and routers in the network use the IP destination address in packet headers to forward the packet to the receiver and vice versa.
With a rapid growth of the devices connecting to the network, the current status of IPv4 space is nearly running out. Like many other countries, both Australia and China vastly deploy IPv4, and still lack progress towards IPv6 deployment. If both countries are to be able to fully participate and benefit from the digital economy in the future, they better prepare themselves for the transition and be ready for the IPv6 alternative. Otherwise, they will face a huge challenge in connecting devices across different platforms.
Getting on board with IPv6 will not only prevent countries like Australia and China from being at a competitive disadvantage, but also save operational costs in the near future. As APNIC’s general director Paul Wilson quoted in the iTnews, “Without IPv6, the Australian internet will be less efficient, it will be slower and less reliable, and more expensive — and that would be bad for the country.” Likewise, China is also on the same boat as Australia.
The IPv4 address shortage will become an even bigger issue for both countries as their economies are rapidly moving toward the fourth industrial revolution of which their manufacturing and services are relying heavily on the Internet.
To promote the IPv6 employment and help prepare Australia and China for the transition, APNIC has provided a grant to the School of Engineering and IT at Murdoch University to conduct research on the IPv6 readiness and deployment in all types of industry sectors that use the Internet in both countries. The survey received responses from 198 participants from Australia and 188 participants from China.
The main objective of this research project seeks to gain insights into the motivations of Australian and Chinese organisations for deploying or not deploying IPv6, to identify whether IPv6 deployment is likely to increase in the future, and to determine what are the driving and hindering forces behind the deployment of IPv6 in their economies.
However, most of the data from those studies is a few years old now, and there is small number of participants from Australia and China. While some studies provide a comprehensive analysis on the awareness and urgency of IPv6, none of them provide much insight into the motivation or reason of the companies for deploying or not deploying IPv6 or what the main obstacles are for them to move to IPv6 infrastructure.
In this sense, the focus of the study by Murdoch University is very crucial, particularly for these two countries to plan for the IPv6 transition. The project’s focus is on Australia and China, because both countries currently have low IPv6 deployment, so the study will be very relevant to both countries.
A significant number of islands are too remote to make submarine cable Internet connections economical. We’re looking mostly at the Pacific, but islands like these exist elsewhere, too. There are also remote places on some continents, and of course there are cruise ships, which can’t connect to cables while they’re underway. In such cases, the only way to provision Internet commercially at this point are satellites.
Satellite Internet is expensive and sells by the Mbps per month (megabits per second of capacity per month). So places with few people – up to several ten thousand perhaps – usually find themselves with connections clearly this side of a Gbps (gigabit per second). The time it takes for the bits to make their way to the satellite and back adds latency, and so we have all the ingredients for trouble: a bottleneck prone to congestion and long round-trip times (RTT) which give TCP senders an outdated picture of the actual level of congestion that their packets are going to encounter.
I have discussed these effects at length in two previous articles published at the APNIC blog, so won’t repeat this here. Suffice to say: It’s a problem worth investigating in depth, and we’re doing so with help from grants by ISIF Asia in 2014 and 2016 and Internet NZ. This blog post describes how we’ve built our simulator, which challenges we’ve come up against, and where we’re at some way down our journey.
What are the questions we want to answer?
The list is quite long, but our initial goals are:
We’d like to know under which circumstances (link type – GEO or MEO, bandwidth, load, input queue size) various adverse effects such as TCP queue oscillation or standing queues occur.
We’d like to know what input queue size represents the best compromise in different circumstances.
We’d like to know how much improvement we can expect from devices peripheral to the link, such as performance-enhancing proxies and network coders.
We’d like to know how best to parameterise such devices in the scenarios in which one might deploy them.
We’d like to know how devices with optimised parameters behave when loads and/or flow size distributions change.
That’s just a start, of course, and before we can answer any of these, the biggest question to solve is: How do you actually build, configure, and operate such a simulator?
Why simulate, and why build a hybrid software/hardware simulator?
We get this question a lot. There are things we simply can’t try on real satellite networks without causing serious inconvenience or cost. Some of the solutions we are looking at require significant changes in network topology. Any island out there keen to go without Internet for a few days while we try something out? So we’d better make sure it can be done in a controlled environment first.
Our first idea was to try to simulate things in software. There is a generation of engineers and networking people who have been brought up on the likes of ns-2, ns-3, mininet etc., and they swear by it. We’re part of that generation, but the moment we tried to simulate a satellite link in ns-2 with more than a couple of dozen Mbps and more than a handful of simultaneous users generating a realistic flow size distribution, we knew that we were in trouble. The experiment was meant to simulate just a few minutes’ worth of traffic, for just one link configuration, and we were looking at days of simulation. No way. This wasn’t scalable.
Also, with a software simulator, you rely on software simulating a complex system with timing, concurrent processes, etc., in an entirely sequential way. How do you know that it gets the chaos of congestion right?
So we opted for farming out as much as we could to hardware. Here, we’re dealing with actual network components, real packets, and real network stacks.
There’s been some debate as to whether we shouldn’t be calling the thing an emulator rather than a simulator. Point taken. It’s really a bit of both. We take a leaf here from airline flight simulators, which also leverage a lot of hardware.
The tangible assets
Our island-based clients at present are 84 Raspberry Pis, complemented by 10 Intel NUCs. Three Supermicro servers simulate the satellite link and terminal equipment (such as PEPs or network coding encoders and decoders), and another 14 Supermicros of varying vintage act as the servers of the world that provide the data which the clients on the island want.
The whole thing is tied together by a number of switches, and all servers have external Internet access, so we can remotely access them to control experiments without having to load the actual experimental channel. The image in Figure 1 below shows the topology – the “island” is to the right, the satellite “link” in the middle, and the “world” servers on the right.
Simulating traffic: The need for realistic traffic data
High latency is a core difference between satellite networks and, say, LANs or MANs. As I’ve explained in a previous blog, this divides TCP flows (the packets of a TCP connection going in one direction) into two distinct categories: Flows which are long enough to become subject to TCP congestion control, and those that are so short that their last data packet has left the sender by the time the first ACK for data arrives.
In networks where RTT is no more than a millisecond or two, most flows fall into the former category. In a satellite network, most flows don’t experience congestion control – but contribute very little data. Most of the data on satellite networks lives in flows whose congestion window changes in response to ACKs received.
So we were lucky to have a bit of netflow data courtesy of a cooperating Pacific ISP. From this, we’ve been able to extract a flow size distribution to assist us in traffic generation. To give you a bit of an idea as to how long the tail of the distribution is: We’re looking at a median flow size of under 500 bytes, a mean flow size of around 50 kB, and a maximum flow size of around 1 GB.
A quick reminder for those who don’t like statistics: The median is what you get if you sort all flows by size and take the flow size half-way down the list. The mean is what you get by adding all flow sizes and dividing by the number of flows. A distribution with a long tail has a mean that’s miles from the median. Put simply: Most flows are small but most of the bytes sit in large flows.
Simulating traffic: Supplying a controllable load level
Another assumption we make is this: By and large, consumer Internet users are reasonably predictable creatures, especially if they come as a crowd. As a rule of thumb, if we increase the number of users by a factor of X, then we can reasonably expect that the number of flows of a particular size will also roughly increase by X. So if the flows we sampled were created by, say, 500 users, we can approximate the behaviour of 1000 users simply by creating twice as many flows from the same distribution. This gives us a kind of “load control knob” for our simulator.
But how are we creating the traffic? This is where our own purpose-built software comes in. Because we have only 84 Pis and 10 NUCs, but want to be able to simulate thousands of parallel flows, each physical “island client” has to play the role of a number of real clients. Our client software does this by creating a configurable number of “channels”, say 10 or 30 on each physical client machine.
Each channel creates a client socket, randomly selects one of our “world” servers to connect to, opens a connection and receives a certain number of bytes, which the server determines by random pick from our flow size distribution. The server then disconnects, and the client channel creates a new socket, selects another server, etc. Selecting the number of physical machines and client channels to use thus gives us an incremental way of ramping up load on the “link” while still having realistic conditions.
Simulating traffic: Methodology challenges
There are a couple of tricky spots to navigate, though: Firstly, netflow reports a significant number of flows that consist of only a single packet, with or without payload data. These could be rare ACKs flowing back from a slow connection in the opposite direction, or be SYN packets probing, or…
However, our client channels create a minimum amount traffic per flow through their connection handshake. This amount exceeds the flow size of these tiny flows. So we approximate the existence of these flows by pro-rating them in the distribution, i.e., each client channel connection accounts for several of these small single packet flows.
Secondly, the long tail of the distribution means that as we sample from it, our initial few samples are very likely to have an average size that is closer to the median than to the mean. In order to obtain a comparable mean, we need to run our experiments for long enough so that our large flows have a realistic chance to occur. This is a problem in particular with experiments using low bandwidths, high latencies (GEO sats), and a low number of client channels.
For example, a ten minute experiment simulating a 16 Mbps GEO link with 20 client channels will typically generate a total of only about 14,000 flows. The main reason for this is the time it takes to establish a connection via a GEO RTT of over 500 ms. Our distribution contains well over 100,000 flows, with only a handful of really giant flows. So results at this end are naturally a bit noisy, depending on whether, and which, giant flows in the 100’s of MB get picked by our servers. This forces us to run rather lengthy experiments at this end of the scale.
Simulating the satellite link itself
For our purposes, simulating a satellite link mainly means simulating the bandwidth bottleneck and the latency associated with it. More complex scenarios may include packet losses from noise or fading on the link, or issues related to link layer protocol. We’re dedicating an entire server to the simulation (server K in the centre of the topology diagram), so we have enough computing capacity to handle every case of interest. The rest is software, and here the choice is chiefly between a network simulator (such as, e.g., sns-3) and something relatively simple like the Linux tc utility.
The latter lets us simulate bandwidth constraints, delay, sporadic packet loss and jitter: enough for the moment. That said, it’s a complex beast, which exists in multiple versions and – as we found out – is quite quirky and not overly extensively documented.
Following examples given by various online sources, we configured a tc netem qdisc to represent the delay, which we in turn chained to a token bucket filter. The online sources also suggested quality control: ping across the simulated link to ensure the delay is place, then run iperf in UDP mode to see that the token bucket filter is working correctly. Sure enough, the copy-and-paste example passed these two tests with flying colours. It’s just that we then got rather strange results once we ran TCP across the link. So we decided to ping while we were running iperf. Big surprise: Some of the ping RTTs were in the hundreds of seconds – far longer than any buffer involved could explain. Moreover, no matter which configuration parameter we tweaked, the effect wouldn’t go away. So, a bug it seems. We finally found a workaround involving ingress redirection to an intermediate function block device, which passes all tests and produces sensible results for TCP. Just goes to show how important quality control is!
Simulating world latency
We also use a similar technique to add a variety of fixed ingress and egress delays to the “world” servers. This models the fact that TCP connections in real life don’t end at the off-island sat gate, but at a server that’s potentially a continent or two down the road and therefore another few dozen or even hundreds of milliseconds away.
Link periphery and data collection
We already know that we’ll want to try PEPs, network coders etc., so we have another server each on both the “island” (server L) and the “world” (server J) side of the server (K) that takes care of the “satellite link” itself. Where applicable, these servers host the PEPs and / or network coding encoders / decoders. Otherwise, these servers simply act as routers. In all cases, these two servers also function as our observation points.
At each of the two observation points, we run tcpdump on eth0 to capture all packets entering and leaving the link at either end. These get logged into pcap capture files on L and J.
An alternative to data capture here would be to capture and log on the clients and / or “world” servers. However, capture files are large and we expect lots of them, and the SD cards on the Raspberry Pis really aren’t a suitable storage medium for this sort of thing. Besides that, we’d like to let the Pis and servers get on with the job of generating and sinking traffic rather than writing large log files. Plus, we’d have to orchestrate the retrieval of logs from 108 machines with separate clocks, meaning we’d have trouble detecting effects such as link underutilisation.
So servers L and J are really without a lot of serious competition as observation points. After each experiment, we use tshark to translate the pcap files into text files, which we then copy to our storage server (bottom).
For some experiments, we also use other tools such as iperf (so we can monitor the performance of a well-defined individual download) or ping (to get a handle on RTT and queue sojourn times). We run these between the NUCs and some of the more powerful “world” servers.
A basic experiment sequence
Each experiment basically follows the same sequence, which we execute via our core script:
Configure the “sat link” with the right bandwidth, latency, queue capacity etc.
Configure and start any network coded tunnel or PEP link we wish to user between servers L and J.
Start the tcpdump capture at the island end (server L) of the link
Start the tcpdump capture at the world end (server J) of the link with a little delay. This ensures that we capture every packets heading from the world to the island side
Start the iperf server on one of the NUCs. Note that in iperf, the client sends data to the server rather than downloading it.
Start the world servers.
Ping the special purpose client from the special purpose server. This functions as a kind of “referee’s start whistle” for the experiment as it creates a unique packet record in both tcpdump captures, allowing us to synchronise them later.
Start the island clients as simultaneously as possible.
Start the iperf client.
Start pinging – typically, we ping 10 times per second.
Wait for the core experiment duration to expire. The clients terminate themselves.
Ping the special purpose client from the special purpose server again (“stop whistle”).
Terminate pinging (usually, we ping only for part of the experiment period, though)
Terminate the iperf client.
Terminate the iperf server.
Terminate the world servers.
Convert the pcap files on J and L into text log files with tshark
Retrieve text log files, iperf log and ping log to the storage server.
Start the analysis on the storage server.
Between most steps, there is a wait period to allow the previous step to complete. For a low load 8 Mbps GEO link, the core experiment time needs to be 10 minutes to yield a half-way representative sample from the flow size distribution. The upshot is that the pcap log files are small, so need less time for conversion and transfer to storage. For higher bandwidths and more client channels, we can get away with shorter core experiment durations. However, as they produce larger pcap files, conversion and transfer take longer. Altogether, we budget around 20 minutes for a basic experiment run.
Tying it all together
We now have more than 100 machines in the simulator. Even in our basic experiments sequence, we tend to use most if not all of them. This means we need to be able to issue commands to individual machines or groups of machines in an efficient manner, and we need to be able to script this.
Enter the pssh utility. This useful little program lets our scripts establish a number of SSH connections to multiple machines simultaneously, e.g., to start our servers or clients, or to distribute configuration information. It’s not without its pitfalls though: For one, the present version has a hardwired limit of 32 simultaneous connections that isn’t properly document in the man page. If one requests more than 32 connections, pssh quietly runs the first 32 immediately and then delays the next 32 by 60 seconds, the next 32 by 120 seconds, etc.
We wouldn’t have noticed this hadn’t we added a feature to our analysis script that checks whether all clients and servers involved in the experiment are being seen throughout the whole core experiment period. Originally, we’d intended this feature to pick up machines that had crashed or had failed to start. Instead, it alerted us to the fact that quite a few of our machines were late starters, always by exactly a minute or two.
We now have a script that we pass the number of client channels required. It computes how to distribute the load across the Pi and NUC clients, creates subsets of up to 32 machines to pass to pssh, and invokes the right number of pssh instances with the right client parameters. This lets us start up all client machines within typically less than a second. The whole episode condemned a week’s worth of data to /dev/null, but shows again just how important quality assurance is.
Automating the complex processes is vital, so we keep adding scripts to the simulator as we go to assist us in various aspects of analysis and quality assurance.
Observations – and how we use them
Our basic experiment collects four pieces of information:
A log file with information on the packets that enter the link from the “world” side at J (or the PEP or network encoder as the case may be). This file includes a time stamp for each packet, the source and destination addresses and ports, and the sizes of IP packets, the TCP packets they carry, and the size of the payload they contain, plus sequence and ACK numbers as well as the status of the TCP flags in the packet.
A similar log file with information on the packets that emerge at the other end of the link from L and head to the “island” clients.
An iperf log, showing average data rates achieved for the iperf transfer.
A ping log, showing the sequence numbers and RTT values for the ping packets sent.
The first two files allow us to determine the total number of packets, bytes and TCP payload bytes that arrived at and left the link. This gives us throughput, goodput, and TCP byte loss, as well as a wealth of performance information for the clients and servers. For example, we can compute the number of flows achieved and the average number of parallel flows, or the throughput, goodput for and byte loss for each client.
Figure 2 above shows throughput (blue) and goodput (red) in relation to link capacity, taken at 100 ms intervals. The link capacity is the brown horizontal line – 16 Mbps in this case.
Any bit of blue that doesn’t reach the brown line represents idle link capacity – evidence of an empty queue some time during the 100 ms in question. So you’d think there’s be no problem fitting a little bit of download in, right? Well that’s exactly what we’re doing at the beginning of the experiment, and you can indeed see that there’s quite a bit less spare capacity – but still room for improvement.
Don’t believe me? Well, the iperf log gives us an idea as to how a single long download fares in terms of throughput. Remember that our clients and servers aim at creating a flow mix but don’t aim to complete a standardised long download. So iperf is the more appropriate tool here. In this example, our 40 MB download takes over 120 s with an average rate of 2.6 Mbps. If we run the same experiment with 10 client channels instead of 20, iperf might take only a third of the time (41 s) to complete the download. That is basically the time it takes if the download has the link to itself. So adding the extra 10 client channel load clearly has a significant impact.
At 50 client channels, iperf takes 186 seconds, although this figure can vary considerably depending which other randomly selected flows run in parallel. At 100 client channels, the download sometimes won’t even complete – if it does, it’s usually above the 400 second mark & there’s very little spare capacity left (Figure 3).
You might ask why the iperf download is so visible in Figure 1 compared to the traffic contributed by our hundreds of client channels? The answer lies once again in the extreme nature of our flow size distribution and the fact that at any time, a lot of the channels are in connection establishment mode: The 20 client channel experiment above averages only just under 18 parallel flows, and almost all of the 14,000 flows this experiment generates are less than 40 MB: In fact, 99.989% of the flows in our distribution are shorter than our 40 MB download. As we add more load, the iperf download gets more “competition” and also contributes at a lower goodput rate.
The ping log, finally, gives us a pretty good estimate of queue sojourn time. We know the residual RTT from our configuration but can also measure it by pinging after step 2 in the basic experiment sequence. Any additional RTT during the experiment reflects the extra time that the ICMP ping packets spend being queued behind larger data packets waiting for transmission.
One nice feature here is that our queue at server K practically never fills completely: To do so, the last byte of the last packet to be accepted into the queue would have to occupy the last byte of queue capacity. However, with data packets being around 1500 bytes, the more common scenario is that the queue starts rejecting data packets once it has less than 1500 bytes capacity left. There’s generally still enough capacity for the short ping packets to slip in like a mouse into a crowded bus, though. It’s one of the reasons why standard size pings aren’t a good way of detecting whether your link is suffering from packet loss, but for our purposes – measuring sojourn time – it comes in really handy.
Figure 4 shows the ping RTTs for the first 120 seconds of the 100 client channel experiment above. Notice how the maximum RTT tops out at just below 610 ms? That’s 50 ms above the residual RTT of 560 ms (500 satellite RTT and 60 ms terrestrial), +/-5% terrestrial jitter that we’ve configured here. No surprises here: That’s exactly the time it takes to transmit the 800 kbits of capacity that the queue provides. In other words: The pings at the top of the peaks in the plot hit a queue that was, for the purposes of data transfer, overflowing.
The RTT here manages to hit its minimum quite frequently, and this shows in throughput of just under 14 Mbps, 2 Mbps below link capacity.
Note also that where the queue hits capacity, it generally drains again within a very short time frame. This is queue oscillation. Note also that we ping only once every 100 ms, so we may be missing shorter queue drain or overflow events here because they are too short in duration – and going by the throughput, we know that we have plenty of drain events.
This plot also illustrates one of the drawbacks of having a queue: between e.g., 35 and 65 seconds, there are multiple occasions when the RTT doesn’t return to residual for a couple of seconds. This is called a “standing queue” – the phenomenon commonly associated with buffer bloat. At times, the standing queue doesn’t contribute to actual buffering for a couple of seconds but simply adds 20 ms or so of delay. This is undesirable, not just for real-time traffic using the same queue, but also for TCP trying to get a handle on the RTT. Here, it’s not dramatic, but if we add queue capacity, we can provoke an almost continuous standing queue: the more buffer we provide, the longer it will get under load.
Should we be losing packet loss altogether?
There’s one famous observable that’s often talked about but surprisingly difficult to measure here: packet loss. How come, you may ask, given that we have lists of packets from before and after the satellite link?
Essentially, the problem boils down to the question of what we count as a packet, segment or datagram at different stages of the path.
Here’s the gory detail: The maximum size of a TCP packet can in theory be anything that will fit inside a single IP packet. The size of the IP packet in turn has to fit into the Ethernet (or other physical layer) frame and has to be able to be processed along the path.
In our simulator, and in most real connected networks, we have two incompatible goals: Large frames and packets are desirable because they lower overhead. On the other hand, if noise or interference hits on the satellite link, large frames present a higher risk of data loss: Firstly, at a given bit error rate, large packets are more likely to cop bit errors than small ones. Secondly, we lose more data if we have to discard a large packet after a bit error than if we have to discard a small packet only.
Then again, most of our servers sit on Gbps Ethernet or similar, where the network interfaces have the option of using jumbo frames. The jumbo frame size of up to 9000 bytes represents a compromise deemed reasonable for this medium. However, these may not be ideal for a satellite link. For example, given a bit error probability of 0.0000001, we can expect to lose 7 in 1000 jumbo frames, or 0.7% of our packet data. If we use 1500 byte frames instead, we’ll only lose just over 1 in 1000 frames, or 0.12% of our packet data. Why is that important? Because packet loss really puts the brakes on TCP, and these numbers really make a difference.
The number of bytes that a link may transfer in a single IP packet is generally known as the maximum transmission unit (MTU). There are several ways to deal with diversity in MTUs along the path: Either, we can restrict the size of our TCP segment right from the sender to fit into the smallest MTU along the path, or we can rely on devices along the way to split IP packets with TCP segments into smaller IP packets for us. Modern network interfaces do this on the fly with TCP segmentation offload (TSO) and generic segmentation offload (GSO, see https://sandilands.info/sgordon/segmentation-offloading-with-wireshark-and-ethtool). Finally, the emergency option when an oversize IP datagram hits a link is to fragment the IP datagram.
In practice, TSO and GSO are so widespread that TCP senders on a Gbps network will generally transmit jumbo frames and have devices further down the path worry about it. This leaves us with a choice in principle: Allow jumbo frames across the “satellite link”, or break them up?
Enter the token bucket filter: If we want to use jumbo frames, we need to make the token bucket large enough to accept them. This has an undesirable side effect: Whenever the bucket has had a chance to fill with tokens, any arriving packets that are ready to consume them get forwarded immediately, regardless of rate (which is why you see small amounts of excess throughput in the plots above). So we’d “buy” jumbo frame capability by considerably relaxing instantaneous rate control for smaller data packets. That’s not what we want, so it seems prudent to stick with the “usual” MTUs of around 1500 bytes and accept fragmentation of large packets.
There’s also the issue of tcpdump not necessarily seeing the true number of packets/fragments involved, because it captures before segmentation offload etc. (https://ask.wireshark.org/questions/3949/tcpdump-vs-wireshark-differences-packets-merged).
The gist of it all: The packets we see going into the link aren’t necessarily the packets that we see coming out at the other end. Unfortunately that happens in a frighteningly large number of cases.
In principle, we could check from TCP sequence numbers & IP fragments whether all parts of each packet going in are represented in the packets going out. However, with 94 clients all connecting to 14 servers with up to 40-or-so parallel channels, doing the sequence number accounting is quite a computing-intensive task. But is it worth it? For example, if I count a small data packet with 200 bytes as lost when it doesn’t come out the far end, then what happens when I have a jumbo frame packet with 8000 bytes that gets fragmented into 7 smaller packets and one of these fragments gets dropped? Do I count the latter as one packet loss, or 1/7th of a packet loss, or what?
The good news: For our purposes, packet loss doesn’t actually help explain much unless we take it as an estimate of byte loss. But byte loss is an observable we can compute very easily here: We simply compare the number of observed TCP payload bytes on either side of the link. Any missing byte must clearly have been in a packet that got lost.
There is a saying in my native Germany: “Wer misst, misst Mist”. Roughly translated, it’s a warning that those who measure blindly tend to produce rubbish results. We’ve already seen a couple of examples of how an “out-of-left field” effect caused us problems. I’ll spare you some of the others but will say that there were just a few!
So what are we doing to ensure we’re producing solid data? Essentially, we rely on four pillars:
Configuration verification and testing. This includes checking that link setups have the bandwidth configured, that servers and clients are capable of handling the load, and that all machines are up and running at the beginning of an experiment.
Automated log file analysis. When we compare the log files from either side of the link, we also compute statistics about when each client and server was first and last seen, and how much traffic went to/from the respective machine. Whenever a machine deviates from the average by more than a small tolerance or a machine doesn’t show up at all, we issue a warning.
Human inspection of results: Are the results feasible? E.g., are throughput and goodput within capacity limits? Do observables change in the expected direction when we change parameters such as load or queue capacity? Plots such as those discussed above also assist us in assessing quality. Do they show what we’d expect, or do they show artefacts? This also includes discussion of our results so there are at least four eyes looking at data.
Scripting: Configuring an experiment requires the setting of no less than seven parameters for the link simulation, fourteen different RTT latencies for the servers, and load and timeout configurations for 94 client machines, an iperf download size, plus the orchestrated execution of everything with the right timing – see above. Configuring all of this manually would be a recipe for disaster, so we script as much as we can – this takes care of a lot of typos!
Also, an underperforming satellite link could simply be a matter of bad link configuration rather than a fundamental problem with TCP congestion control. It would be all too easy to take a particular combination of link capacity and queue capacity to demonstrate an effect without asking what influence these parameters have on the effect. This is why we’re performing sweeps – when it comes to comparing the performance of different technologies, we want to ensure that we are putting our best foot forward.
So what’s the best queue capacity for a given link capacity? You may remember the old formula for sizing router queue, RTT * bandwidth. However, there’s also Guido Appenzeller’s PhD thesis from Stanford, in which he recommends to divide this figure by the square root of the number of long-lived parallel flows.
This presents us with a problem: We can have hundreds of parallel flows in the scenarios we’re looking at. However, how many of those will qualify as long-lived depends to no small extent on the queue capacity at the token bucket filter!
For example, take the 16 Mbps link with 20 client channels we’ve already looked at before. At 16 Mbps (=2MBps) and 500 ms RTT, the old formula suggests 1 MB queue capacity. We see fairly consistently 17-18 parallel flows (not necessarily long-lived ones, though) regardless of queue capacity. Assuming extremely naively that all of these flows might qualify as long-lived (well, we know they’re not), Guido’s results suggest dividing the 1MB by about a factor of around 4, which just so happens to be a little larger than the 100kB queue we’ve deployed here. But how do we know whether this is the best queue capacity to choose?
A real Internet satellite link generally doesn’t just see a constant load. So how do we know which queue capacity works best under a range of loads?
The only way to get a proper answer is to try feasible combinations of load levels and queue capacities. Which poses the next question: What exactly do we mean by “works best”?
Looking at the iperf download, increasing the queue size at 20 client channels always improves the download time. This would suggest dumping Guido’s insights in favour of the traditional value. Not so fast: Remember those standing queues in Figure 3? At around 20 ms extra delay, they seemed tolerable. Just going to a 200kB queue bumps these up to 80 ms, though, and they’re a lot more common, too. Anyone keen to annoy VoIP users for the sake of a download that could be three times faster? Maybe, maybe not. We’re clearly getting into compromise territory here, but around 100kB-200kB seems to be in the right ballpark.
So how do we get to zero in on a feasible range? Well, in the case of the 16 Mbps link, we looked at (“sweeped”) eleven potential queue capacities between 30 kB and 800 kB. For each capacity, we swept up to nine load levels between 10 and 600 client channels. That’s many dozens of combinations, each of which takes around 20 minutes to simulate, plus whatever time we then take for subsequent manual inspection. Multiply this with the number of possible link bandwidths of interest in GEO and MEO configuration, plus repeats for experiments with quality control issues, and we’ve got our worked carved out. It’s only then that we can get to coding and PEPs.
A lot. If the question on your mind starts with “Have you thought of…” or “Have you considered…,” the answer is possibly yes. Here are a few challenges ahead:
Network coding (TCP/NC): We’ve already got the encoder and decoder ready, and once the sweeps are complete and we have identified the parameter combinations that represent the best compromises, we’ll collect performance data here. Again, this will probably take a few sweeps of possible generation sizes and overhead settings.
Performance-enhancing proxies (PEP): We’ve identified two “free” PEPs, PEPSal and TCPEP, which we want to use both in comparison and – eventually – in combination with network coding.
UDP and similar protocols without congestion control. In our netflow samples, UDP traffic accounts for around 12% of bytes. How will TCP behave in the presence of UDP in our various scenarios? How do we best simulate UDP traffic given that we know observed throughput, but can only control offered load? In principle, we could model UDP as a bandwidth constraint, but under heavy TCP load, we get to drop UDP packets as well, so it’s a constraint that’s a little flexible, too. What impact does this have on parameters such as queue capacities, generation sizes etc.?
Most real links are asymmetric, i.e., the inbound bandwidth is a lot larger than the outbound bandwidth. So far, we have neglected this as our data suggests that the outbound channels tend to have comparatively generous share of the total bandwidth.
Simulating world latencies. At this point, we’re using a crude set of delays on our 14 “world servers”. We haven’t even added jitter. What if we did? What if we replaced our current crude model of “satgate in Hawaii” with a “satgate in X” model, where the latencies from the satgate in X to the servers would be distributed differently?
Tindak Malaysia is the winner of the ISIF Asia 2016 Technical Innovation Award and the Community Choice Award 2016.
TINDAK MALAYSIA: Towards A Fairer Electoral System –
1 Person, 1 Vote, 1 Value
A democracy is reflected in the sovereignty of the people. They are supposed to have the power to choose their leaders under Free and Fair Elections. Unfortunately, those in power will try to manipulate the electoral system to entrench their grip on power. Attempts to manipulate the system could be…
in tweaking the rules of elections in their favour,
in the control of the mainstream media,
through the pollsters to manipulate public perception,
during the vote count,
by making election campaigns so expensive that only the rich or powerful could afford to run or win.
through boundary delineation either by gerrymandering, or through unequal seat size.
The Nov 2016 US Presidential Election threw up all of the above in sharp contrast. There were two front runners, Donald Trump and Hillary Clinton.
Both candidates were disliked by more than half the electorate,
Both candidates generated such strong aversion that a dominant campaign theme was to vote for the lesser evil. The people were caught in the politics-of-no-choice.
Eventually, the winning candidate won, with slightly less votes (0.3%), than the losing candidate, each winning only 27% of the electorate. Yet the delegates won by the winner was 306 (57%) while the loser got 232 (43%), a huge difference!
The winning candidate won with barely a quarter of the total voting population. 43% of the voters did not vote. In other words, only 27% of the electorate decided on the President.
Consider Malaysia. We are located in South-east Asia. We have a population of 31 million with about 13.5 million registered voters. We practise a First-Past-The-Post System of elections, meaning the winner takes all, just like in the US.
In the 2013 General Elections, the Ruling Party obtained 47.4% of the votes and 60% of the seats. Meanwhile the opposition, with 52% of the votes, won only 40% of the seats – more votes, but much fewer seats.
We had all the problems listed above except that no opinion polls were allowed on polling day. But the most egregious problem of all was boundary delimitation, which is the subject of our project.
In 2013, the Ruling Party with 47.4% of the popular vote, secured 60% of the seats. To hang on to power, they resorted to abuse and to change of the laws to suppress the Opposition and the people. Our concern was that continuing oppression of the people in this manner could lead to violent protests. It was our hope to achieve peaceful change in a democratic manner through the Constitution.
From a Problem Tree Analysis, it was found that the problem was cyclic in nature. The root cause was a Fascist Government maintaining power through Fraudulent Elections. See red box opposite. Problem Tree Analysis
Malapportionment! The seats won by the Ruling Party in the chart below are the blue lines with small number of voters in the rural seats. The red lines with huge numbers are in the urban areas won by the Opposition. It was found that they could have won 50% of the seats with merely 20.22% of the votes. Malapportionment in General Elections – GE213
The above computation was based on popular vote. If based on total voting population, BN needed only 17.4% to secure a simple majority.
What is the solution we propose?
The solution was obvious. Equalize the seats.
But for the past 50 years, no one seemed to object to the unfair maps.
Why? The objectors never managed to submit a substantive objection because:
Biased EC stacked with Ruling Party cronies, who actively worked to prevent any objections being made,
Constitution rules of delimitation drafted to make objections difficult, such that the EC had a lot of leeway to interpret it anyway it wished.
Very high barriers to objection,
Insufficient information offered during a Redelineation exercise. Given the 1-month deadline, it was impossible for an ordinary voter to prepare a proper objection.
We start with a Polling District (PD). The PD is the smallest unit of area in a Constituency. It is defined by a boundary, a name and/ID Code, and includes elector population. Map 1 is an example of PD. To avoid clutter, the elector numbers are carried in separate layer which can be overlaid on top.
Districting is conducted by assembling these PD into Constituencies. In theory, the Constituencies are supposed to have roughly the same number of electors, unless variation is permitted in the Constitution.
This was gazetted by the EC on 15th Sept 2016 for public objections. No Polling Districts are identified. In reality, the EC had all the information in digital format under an Electoral Geographical Information System (EGIS) but they kept it from the public.
An elector faced with such a map, is stuck. He would not know where to begin. Neither did he have the technical knowledge to carry out the redistricting even if he wanted to, all within the time limit of 1 month.
This has been the case for the past 50 years. No one could object effectively.
So we had a situation where electors wanted to object but were unable to do so because of insufficient information and lack of expertise.
Studying the problem, we decided that the solution was to bridge the Digital Divide through Technical Innovation as well as to bring the matter out of the jurisdiction of the EC.
Digitize all the PD in Malaysia, about 8000 of them. This took us 1 year.
Learn how to redistrict using digital systems. We used QGIS, an open source GIS system,
Develop a plug-in to semi-automate and speed up the redistricting process.
Bring in legal expertise. Collaborate with lawyers to bring the matter out of the control of the EC and into the jurisdiction of the courts in order to defend the Constitution.
We started this initiative in July 2011 and by Dec 2015, we had digitised all the PD and redistricted the whole country twice, sharpening our expertise and correcting errors in the process. We got the Bar Council (Lawyers Association) to team up with us to guide the public on how to object when the Redelineation exercise by the EC is launched.
Redelineation, 1st Gazette:
On 15th Sept 2016, the EC published the First Gazette of the Redelineation Proposal. For the State of Selangor with 22 Parliamentary seats, they published one map only – MAP 2. We analysed their proposal and found glaring disparities in the seat sizes with elector population ranging from 39% to 200% of the State Electoral Quota (EQ) – MAP 3
At a more detailed level, it looks like MAP 4 below. We can see the densely populated central belt (brown columns) sticking out in sharp contrast to the under-populated outlying regions around the perimeter – ochre areas). Clearly the EC has not addressed the inequalities in the voting strength among the various regions.
Trial Run: We conducted a trial run on the EC maps for a local council in Selangor – MPSJ. See MAP 4. It was found that we could maintain local ties with 6 State and 2 Parliamentary Constituencies, with the elector population kept within +/-20% of the mean. This was much better than the EC’s range of -60% to +100%.
We have submitted objections for the First Gazette and await the call for a public hearing by the EC. Our lawyers are monitoring the EC to ensure they comply with the Constitution and preparing lawsuits in case they don’t.
While conducting our research on how to object, we uncovered yet another area of abuse. The boundaries of the polling districts and electors within, had been shifted to other constituencies unannounced. This was a surreptitious form of redelineation outside the ambit of the constitution and a gross abuse of authority. As part of our next project, we intend to focus on this, to prevent such gerrymandering.
In conclusion, we feel like we are peeling an onion. As we unfold one layer, a new layer of fraud is exposed. It was a never-ending process. But we are determined to keep on digging until we reach the core and achieve our goal of Free and Fair Elections.
As a woman entrepreneur in technology, I have a unique perspective on running the company. I believe in nurturing and rely on my own organization’s strength on sustaining the business.
Commonly nowadays, as a startup, it is easy to be carried away on the trend where startups rely on investments to create traction or to scale-up and grow. I started my company with my two co-founders from scratch and decided to sustain the company on its own. Since the first time, rather than using investment money to gain traction, we rely on the trait of our product (jBatik Software) and our paying customers to grow our business. We realized that only if our customers happy with our service, will then our company be successful. In other words, our success is integrated with the success of our software users.
jBatik is a pattern generator software that we use to empower the traditional textile business in Indonesia. Our main customers are batik artisans where they use the software to create endless of new batik patterns to increase their productivity and of course, their profit. To date, there are more than 2,000 artisans who have been using our software which we reached out through direct training to the rural areas of the Indonesia, the places where they live. All of them are paying customers, and we are very happy to see that their income has increased 20-25% through the utilization of jBatik Software.
ISIF Asia Award has leveraged our business in term of visibility and credibility. The opportunity to network with the fellow ISIF winners has given me a better perspective and an improved point of view on addressing the pain points and needs of our beneficiaries, which are the traditional artisans. All of these are very important to continue and grow our social business. After winning the award, we have been able to improve our software training, reaching to more organizations to collaborate to acquiring new users within new strategies and we have successfully secured funding from Indonesia government to build new software to serve more traditional artisans.
Our work is far from perfect. With the focus on progress, we believe that collaboration is the key to our innovation. Only by collaborating with each stakeholder, then we can create a breakthrough to solve our problems.
For the past two decades, the rise of ICTs has generated new forms of violence. Such violence happens online or via mobile phones, and women are the first victims. According to UN Women, nearly 75 percent of female Internet users worldwide have been exposed to online threats and harassment.
The Philippines are no exception. Electronic violence against women or eVAW is on the rise, and more and more suffer from it. In Manila, 70% of the complaints about online or mobile abuse come from women.
Definition of eVAW
eVAW refers to any violence against women perpetrated using ICTs. Such violence often causes a lasting mental, emotional or psychological distress.
— Electronic Harassment: This is the most common form of eVAW in the Philippines. Most of the time the harassment comes from a former partner who wants to take revenge. It can also come from strangers willing to exert control over their female victim. They send threats or communications with sexual undertones. Or they publish false accusations through blogs, online forums, or via mobile phones.
— Cyber Stalking: ICTs have made stalking much easier and more prevalent than before. In the Philippines, this is the second most widespread form of eVAW. Tracking someone’s phone has become quite easy (even without their permission). On a smart phone, it requires an installation of a tracking app, which can be done in five minutes. Even if the person owns a regular cell phone, it is still possible to install a tracker. This puts some women in a precarious situation.
— Unauthorized Distribution of Videos and Images: Sex videos and images have been proliferating online. With a smart phone, it is very easy for a man to record intimacy unbeknown to his partner. It is even easier to post these records online to harass, humiliate or bribe a woman. This does not happen to celebrities only.
— Cyber Pornography and Prostitution: The Philippines are sometimes considered as a “cyber sex hub.” About 25 percent of the population lives below the poverty line. It is no surprise that prostitution is flourishing. In 2013, there were about 500,000 prostitutes, mostly women. Some are now forced to engage in cyber sex or pornography in exchange for money. The situation is aggravated by the craze for pornography among Filipinos. The country places 15th in adult website Pornhub’s global traffic on mobile devices. And it ranks 26th when it comes to watching it using a computer.
Founded in 1987, FMA is a well-established Filipino nongovernment organization. Its goal is to empower the Philippines’ civil society through the media. In the 2000’s, it contributed to opening the access to the Internet. In particular, it developed a free email service for NGOs.
In 2009 FMA decided to commit against the rising eVAW issue in the Philippines by becoming involved in the global initiative “Take Back the Tech! To End Violence Against Women.” At the time, there was a pressing need for more adapted laws. The Philippines were already considering violence against women as a crime, but electronic violence was not targeted as such.
Furthermore, more awareness was required. The victims often had no idea how to deal with these offenses. “Laws […] do not always prove to be effective deterrents in the commission of crimes, explained Lisa Garcia from FMA. The anonymity that the Internet provides emboldens malicious citizens to commit damaging acts without fear of discovery in spite of laws. This means more advocacy and education are needed to address issues of violence and rights abuses through technology.”
Taking Action Against eVAW
That is why FAM’s first priority was to raise awareness about eVAW. It targeted the general public by featuring programs on the radio and television. It also reached representatives of public, academic and civil organizations. In total, FMA has trained more than 1,000 people.
In 2013, FAM took its struggle against eVAW one step further. It reinforced its advocacy action by launching the eVAW Mapping Project. This Ushahidi-based tool aims to collect accurate eVAW data. Women report incidents by SMS or emails, and the software aggregates them into a map. FMA then conducts a trend analysis and data visualization. It eventually shares this data with the authorities and policy makers.
Safer Electronic Spaces for Women
Since 2009, FMA has managed to take the struggle against eVAW in the Philippines one step further. Today, eVAW is recognized as a form of cybercrime and more women are aware of their rights and able to report this violence.
In December 2008, the Government of Bangladesh introduced the vision ‘Digital Bangladesh by 2021’ to leverage the Internet to improve the delivery of its services, particularly among the poor people. A quiet revolution in digitizing its health sector is already under way to strengthen the Health Information System (HIS), which enables real-time monitoring of population health.
The Internet could provide solutions to a number of structural problems besetting Bangladesh’s health sector. For example, the use of ICT to provide remote diagnosis, advice, treatment, and health education could address a major part of the health issues of patients in rural clinics, which are typically the most poorly staffed. Online tools and mobile innovations can improve the operational efficiency and productivity of (rural) health system by enabling more effective service delivery.
The use of ICT in education has a similar potential to deliver rapid gains in access to education, teacher training, and learning outcomes. As pointed above, web-based school management systems that can support standardization and monitoring of school performance could enable the government to achieve more with their education budgets and providing millions of students with the foundation for a better future.
All of the available sources show that to make a dramatic shift in these two sectors, incorporation of digitized materials is one of the most important factors of current time. However, there are many questions regarding the impact of the digital age in the socioeconomic conditions of Bangladesh that still remained unsolved.
How do we quantify the impact that the internet and internet driven business have on a country’s GDP?
How does a digitized registration process reduce corruption?
What will the legal system of the country look like when digitization will take place there? Is the internet creating a divergence among the different groups of people?
How much of labor hours are people wasting doing unproductive works on the internet?
For every new digital adoption, someone may be getting a new job, whereas someone, somewhere may be losing his/her job. So what if the rate of losing jobs is far greater than the rate of creating employment opportunities?
The answer to these questions require further research, including reviewing the experiences of other countries where these questions have already been addressed. However, one thing is clear that these topics will dominate the research agendas in the upcoming years and their findings will help Bangladesh to transform into a more balanced, robust and sustainable economic growth.
Voters worldwide seldom interact with their chosen leaders- except around 5-yearly elections. However, the advent of advanced Information and Community Technologies (ICT) might shrink this interval considerably. They may even turn back the clock towards the seminal Athenian model of democratic decision-making: directly by the people rather than their representatives. With some political discretion, today’s online forums can allow for similarly incorporating crowdsourced public opinion into policy design. This could contribute to nationally important initiatives (such as preparing Morocco’s 2011 Draft Constitution or debates on Spain’s Plaza Podemos, Brazil’s E-democracia portal and India’s own mygov.in). Nonetheless, we will concern ourselves with far more universal and local problem-solving at the municipal level.
But just who has access to such platforms? While internet penetration in rural India is rising dramatically, the lion’s share (67%) still resides with urban denizens. Moreover, as highlighted by the Wall Street Journal, India boasted of a quarter of the world’s fastest growing urban zones and 8 qualifying ‘MegaCities’ as per India’s 2011 Census definition. The demands on municipal governments are likely to be considerable, and even more likely to be mediated by internet platforms.
Regardless of this explosion of population and the associated challenges, the structure of municipal bodies has remained unchanged since Lord Ripon’s 1882 Resolution on self-government. Furthermore, as Ramesh Ramnathan of Janaagraha points out, the responsibility for action is de facto scattered across acronyms of acrimonious accusing agencies. For example, Bangalore’s (deep breath advised) BDA, BMRDA, BWWSB, BMTC, KSB, BESCOM together juggle the city’s water, transport, electricity, traffic police and development needs. Many authorities, little authority. Increasingly internet-savvy and increasingly increasing residents. Where can they all turn for help?
Enter DataKind Bangalore Partners.
15-year old Janaagraha has endeavoured to improve the quality of urban life- in terms of infrastructure, services and civic engagement- by coordinating government and citizen-led efforts. Of their various initiatives, the IChangeMyCity portal also earned Discover ISIF Asia’s award under the Rights and People’s Choice categories.
Next up, eGovernments Foundation, brainchild of Nandan Nilekani & Srikanth Nadhamuni (Silicon Valley technologist) has since 2003 sought to transform urban governance across 275 Municipalities with the use of scalable and replicable technology solutions (for Financial Accounting, Property & Professional Taxes, Public Works, etc.) Their Public Grievance and Redressal system for the Municipal Corporation of Chennai- recipient of the 2010 Skoch Award -has fielded over 0.22 million complaints over 6 years.
Though these organizations joined hands with DataKind in two distinct ‘Sprints’, the similarities are remarkable. Both their platforms allow citizens to primarily flag problems (garbage, city lighting, potholes) at the neighbourhood level for resolution by government agencies.
Then again, the differences are noteworthy too. As an advocacy-oriented organization, Janaagraha aimed to understand the factors that led to certain complaints being closed promptly by a third party. eGovernments on the other hand, being within the system, to keep officials and engineers adequately prepared for the business-as-usual and also immediately alert them on anomalies. So both sought predictions around complaints- one on their creation, another on their likelihood of closure.
Clearly, quite a campaign lay ahead. If we forget Ancient Greek democracy and hitch a caravan to China, then Sun Tzu’s wisdom from the Art of War pops in: knowing oneself is the key to victory. Always open to relevant philosophy, the DataKinders looked into their own ranks to assess their strengths. The team assigned for E-Governments coincidentally included Ambassadors (Chapter Leader, Vinod Chandrashekhar) and Data Experts (Samarth Bhargav, Sahil Maheshwari) from the Janaagraha project. The teams were also at different junctures joined by the latter’s Vice President (Manu Srivastava) and two of his interns, plus a multidisciplinary mob of volunteers from backgrounds in business consulting, UX Design, data warehousing, development economics and digital ethnography. Let’s see how they waged war.
Progress to Date
Back in March 2015, IChangeMyCity’s presented a set of 18,533 complaints carrying rich meta-data on Category, Complainant Details, Comments, etc. You’d assume this level of detail opens doors to appetizing analyses. Perhaps. Unfortunately, the information dwelt in a database of 10 different tables. Sahil Maheshwari- then working as a Product Specialist- busied himself with the onerous task of unraveling the relationships between them, drawing up an ER Diagram and ‘flattening’ records into one combined table. The team then accordingly fished out missing or anomalous values.
Conversely, E-Governments users either report their problems online, through SMS, paper forms or by calling into the special ‘1913’ helpline where operators transcribe complainants’ inputs. With digital data being entered through drop-down menus rather than free text (either directly by users or call centre employees), no major missing data was to be found. Except of course, unresolved cases-a mere 8% of the 0.18 million complaints. Some entries, amounting to 0.8% were exactly identical- clearly a technical glitch. Moreover, all data resided in one table. So in November 2015’s DataJam, this structure allowed the team to plunge immediately to exploratory analysis.
Across the 200 wards of Chennai, 93 kinds of complaints (grouped further into 9 categories) could be assigned to departments at either the City or Zone level. Although the numbers initially seemed staggering, Samartha Bhargav ran basic visualizations in the R Programming language. The result? Another instance of Pareto’s rule: 15 of these complaint types were contributing to 82% of grievances. Several DataKind first-timers like Aditya Garg & Venkat Reddy ran similar analyses for the 10 most given-to-grumbling wards, and found trouble emanating from roughly the same top 5 sources. Apparently, malfunctioning street lights blow everyone’s fuse. These common bugbears intriguingly became less bearable (and more numerous) in the second half of the year, while others related to taxes seemed more even across the year.
Even so, how could there be 10 broken lights in an area with only one on record? So had ten people all indicated the same light? Like with data analysis, learning from Chinese classics (literally) involves reading the fine print. Sun Tzu’s actual words: ‘If you know the enemy AND know yourself, you need not fear the result of a hundred battles.’ Clearly, this enemy was a lot more complicated than the decoy flanks that DataKinders had speared. Tzu and George Lucas may well have hung out over green tea.
Attack of the Clones .
In usual data science settings, duplicates are often easy to identify and provide little intrinsic value. However, the game changes in the world of crowdsourced data. Especiallydata highlighting the criticality of an issue. So to achieve victory, the team would have to understand and strike at its core- dynamic social feedback. We could assess its importance at four levels.
The first involves messages from the platform itself to indicate that a complaint has been registered and no further inputs are necessary. In its absence, citizens could well create duplicates by hitting the Submit button either accidentally (not knowing if their complaint was logged) or deliberately (hoping that repeating the complaint may lead to quicker action). This is more of a concern for web platforms rather than call centres. By matching against columns involving email, phone and postal contact details and date, time and type of the complaint, DataKind had already been able to quickly hurl out these obvious clones.
The second level of feedback is where the Force truly awakens- from other citizens. The ability to see that other fellow residents have experienced the same concern may prevent its repetition. But this rests on two assumptions. First, that they can view already posted complaints, as exists with IChangeMyCity. They may rally behind this shared cause by ‘upvoting’- an indicator to authorities of its increased importance.
Even if this feature does not exist- as with eGovernments- then all is not lost. High priority might still be inferred by large absolute numbers of complaints. But these would provide an idea of the severity of the problem across the ward (45 pot holes in Adyar) rather than one specific instance of it (that life-threatening one before the flyover). Secondly, if peeved citizens do not put in the effort of checking the roster of existing complaints- as inevitably occurred even with IChangeMyCity- then the Upvotes option alone cannot guarantee being Clone-free.
The third and most obvious feedback comes from authorities via the digital platform- to indicate closure. This is provided by both partners, with IChangeMyCity also appending contact details of which official has been assigned the task.
The fourth and final level- is where a citizen can verify that a complaint marked as ‘closed’ has truly been resolved. After all, accountability forms part of the foundation of democracy. In this manner, the same poorly tended-to complaint could be reopened, rather than filing another one out. This feature currently exists only with IChangeMyCity, which not only allows municipal authorities to mark a complaint as ‘closed’ (as exists with eGoverments), but also allows users to reopen them if unsatisfied.
IChangeMyCity’s resolution rates lie close to 50%- a figure probably reached after allowing for this reopening scenario. eGovernments on the other hand closed a commendable 97%, with up to 13% shut on the same day to an outlier of 1043 (almost 3 years), with the majority (56%) in under 3 days. Mr Srivastava emphasized that these efficiency statistics had improved dramatically in the last 2 years. But as we just explored, perhaps a confounding factor is that multiple duplicate complaints are being closed by engineers who have identified their Clone nature.
How to Fix It?
Thus, it was the second category- unintended duplication- which bled into the fourth. How could the DataKind team exploit the enemies’ own weakness? They decided to unsheathe their two logical light sabers: text and location. Either one in isolation didn’t necessarily pinpoint a duplicate. But in combination, they could quickly incinerate a Clone’s trooper suit.
Saber A: WHERE was the complaint registered? For IChangeMyCity, one can log in, peer through a map of Bangalore and plant a pin on the spot where you’d like to divert the authority’s attention. Using that pin, analysts can procure exact latitude and longitude coordinates. It’s still entirely possible that different people place the pins some distance apart even when referring to the same issue. But it would seem like a safe bet that two closely located complaints might just be Clones.
EGovernments currently doesn’t use maps, but asks users a fairly detailed, 6-level description of addresses (City, Regions, Zones, Wards, Area, Locality, Street). Such text might help direct an engineer gallivanting outdoors, but not for a computer that speaks code. Attempting to translate the text addresses into associated geocodes, the team split the data into 10 parts and ran Google Maps API with an R Script on each one. Despite their best efforts, accuracy could not be guaranteed. Though eGovernments will soon be introducing such coordinates in future work, geocoding seemed like a closed line of attack.
Saber 2: HOW was the complaint registered. The way people express themselves on a particular local issue may vary, but could feature some words in common. However with E-Governments system, pre-loaded tags from the website were automatically attached to complaints. Result? Nearly 40,000 entries demanding ‘NECESSARY ACTION’ (in capitals, no less) with only minor differences. Others exist, but simply restate the category of complaints. (‘Removal of Garbage’). With so little variability and no hidden clues, this strategy failed too.
However, for IChangeMyCity, citizens are free to fill complaint titles and descriptions as they please. So the DataKind Team broke the text of both the complaint’s title and description into sentences and then into words. Then they ran an unsupervised learning algorithm, which helped generate the Jaccard Index- a measure of how ‘close’ two complaints were in terms of statistical similarity.
But to check this ‘distance’ for N complaints against each other would require N*N operations. Far too long for a dataset of this size. To assist with this more abstract sense of ‘distance’, the team decided to turn to the more intuitive geographical meaning of the term. The clearly listed geocode saber we mentioned above.
The team decided that any two complaints within 250m of each other on a map could be considered as potential duplicates, while the rest could be ignored. Plugging these codes into the MongoDB geospatial index, Samarth ingeniously reduced the computation time for this process from 2 hours to 10 minutes. He also later developed a REST API that could be queried to detect the 10 nearest complaints. Going forward, the team hopes to set a threshold of such ‘similarity’ beyond which a new entry could automatically be flagged as a duplicate, much like answered programming queries on Stack Overflow.
Onward to De-Duplication Success
At first glance, it may seem like the Attack of the Clones had stamped defeat over the eGovernments project, while IChangeMyCity had dodged the bullet. But let’s not jump to conclusions. The importance of this first battle is relative. Since Janaagraha is focused on closure of a single complaint, it makes sense not to muddy waters by repeating the same theory. EGovernments on the other hand is interested in the total number of complaints likely to arise, not the problems. Also, as we’ll soon see in the next installment, the larger numbers of complaints (including duplicates) would prove crucial in helping generate valid forecasts for the Chennai Municipal Corporation.
So at the end of this first DataJam session, what had the team discovered? On a flight that carried along Sun Tzu, 2 mayors, George Lucas and random Athenians in Business Class, we learnt the philosophical complexities of the idea of ‘duplication’, especially in the contexts of crowdsourcing and democratic processes in strained local governments.
The Wireless Innovation Project from the Vodafone Americas Foundation is designed as a competition to promote innovation and increase implementation of wireless related technology for a better world. Total awards up to $600,000 will be available to support projects of exceptional promise that meet our eligibility criteria.
Types of Projects
The Vodafone Wireless Innovation Project seeks to identify and fund the best innovations using wireless related technology to address critical social issues around the world. Project proposals must demonstrate significant advancement in the field of wireless-related technology applied to social benefit use.
The competition is open to projects from universities and nonprofit organizations based in the United States. Although organizations must be based in the United States, projects may operate and help people outside of the United States.
Applicants must demonstrate a multi-disciplinary approach that uses an innovation in wireless-related technology to address a critical global issue in one or more of the following areas:
Social Issue Areas: Access to communication, Education, Economic development, Environment, Health
Technical Issue Areas: Connectivity, Energy, Language or Literacy hurdles, Ease of use
The project must be at a stage of research where an advanced prototype or field/market test can occur during the award period.
The technology should have the potential for replication and large scale impact.
Teams should have a business plan or a basic framework for financial sustainability and rollout.
Winners will be selected for awards of $100,000, $200,000, and $300,000 which will be paid in equal installments over three years.
How to Submit a Proposal
To submit a proposal, Applicants must first successfully complete the Eligibility Questionnaire. Eligible Applicants will then receive the URL for the online application via e-mail and be asked to create a username and password which will enable them to work on their proposal online. The application consists of multiple narrative questions and a project budget spreadsheet that Applicants must complete and submit. All information must be submitted through the on-line application.
Submissions will be accepted from 9:00 a.m. Pacific Time on November 2, 2015 to 11:59 p.m. Pacific Time on February 22, 2016 (the “Entry Period”).
2015 has been a busy year of the ISIF Asia program with awards, grants and capacity building activities been supported around the Asia Pacific region. Here is a summary of what we have done in 2015.
ISIF Asia Awards 2015
The Information Society Innovation Fund (ISIF Asia) Awards seek to acknowledge the important contributions ICT innovators have made to their communities, by addressing social and development challenges using the Internet. The Awards recognize projects that have already been implemented, or are in the final stages of implementation, and have been successful in addressing their communities’ needs.
During 2015, 5 awards of AUD 3000 were given to very interesting projects from India, Indonesia and Pakistan covering very relevant issues were Internet technologies make a difference for community development, such as citizens participation to improve public infrastructure in India; bridging fractal algorithms with traditional batik design in Indonesia; supporting female doctors in Pakistan to access the workforce; mapping diseases in rural areas of Pakistan. 5 award winners were selected out of the 78 nominations received from 12 economies across the region.
This year was particularly interesting to receive an application from China, for the very first time since the inception of the ISIF Asia program. 31 applications were accepted for the selection process and are publicly available for anyone interested to learn more about the ingenuity and practical approaches that originate from our region. 16 applications were selected as finalists for full review. When the final selection of the 4 award winners was completed, the process was opened for the community to cast their vote to select the Community Choice Award winner, selected with 426 valid votes. Besides the cash prize, the award winners were invited to attend the 10th Internet Governance Forum (Joao Pessoa, Brazil, 10-13 November 2015) were the awards ceremony took place. The full video of the awards proceedings is below:
Internet Governance Forum participation
As part of the Seed Alliance support, ISIF Asia led the development of a workshop proposal that was accepted by the MAG for inclusion in the official IGF program. A follow-up of the work conducted during the IGF in Bali and the IGF in Istanbul the workshop No. 219 “Addressing funding challenges for continuous innovation” to understand how funding for Internet innovation operates, how the Internet community respond to those challenges, as well as explore solutions together. In light of the publication of the new Sustainable Development Goals in August 2015, the workshop also explored the link between funding opportunities to achieve Goal #9 “Build resilient infrastructure, promote inclusive and sustainable industrialisation and foster innovation” where 9c set the objective of “Significantly increase access to information and communications technology and strive to provide universal and affordable access to the Internet in least developed countries by 2020”. The workshop speakers were Jens Karberg (Sida), Laurent Elder (IDRC), Paul Wilson (APNIC) and Vint Cerf (Google). You can follow the workshop on the video below:
Capacity Building Fund
During 2015, ISIF recipients benefited from additional support through the Capacity building fund to promote the results of their ISIF supported projects at 9 international events that have raised their profile which open doors to negotiate additional support for their projects through a stronger and wider network of contacts, as follows:
10th Internet Governance Forum. 10-13 November 2015. Joao Pessoa, Brazil
APNIC 40. 3 Sep 2015 to 10 Sep 2015. Jakarta, Indonesia
COHRED Forum 2015. 24 Aug 2015 to 27 Aug 2015. Manila, Philippines
WiSATS 2015. 6 Jul 2015 to 7 Jul 2015. Bradford, United Kingdom
APrIGF 2015. 30 Jun 2015 to 3 Jul 2015. Macau, Macau
RightsCon. 24 May 2015 to 25 May 2015. Manila, Philippines
ICTD 2015. 15 May 2015 to 18 May 2015. Singapore City, Singapore
AVPN 2015 Conference. 20 Apr 2015 to 23 Apr 2015. Singapore City, Singapore
APNIC 39. 24 Feb 2015 to 6 Mar 2015. Fukuoka, Japan
Remove comment option
search on blog should come on top
change search colour to
design by enbake link
Additionally, a “Mentoring workshop on evaluation and research communications” was provided for 2014 grant recipient Operation ASHA for the project “Linking TB with technology” from 23 Mar 2015 to 26 Mar 2015 in Phnom Penh, Cambodia, where Sonal Zaveri and Vira Ramelan provided mentoring to Jacqueline Chen and Vin “Charlie” Samnang for the eDetection app project to provide training on U-FE and ResCom concepts, refine key evaluation questions and draft plans for communications strategy.
On top of these face-to-face opportunities, ISIF provided access to the JFDI.Asia pre-accelerator course from August to November 2015, providing 60 accounts for ISIF recipients to join in teams the course. JFDI was founded in Singapore in 2010 by Hugh Mason and Wong Meng Weng. The community has since helped thousands of people in Asia to engineer innovative businesses around their ideas. They can do this because innovation is evolving from an art into a science, and because we have built a community who share their expertise and experience turning ideas into reality. The 60 teams are all ISIF Asia funding winners seeking to accelerate their learning and thereby scale and grow the impact of their ideas.
The site visits allowed ISIF Asia to gain a deeper understanding of: 1) the context in which the supported organization operates, partnerships with other organizations and relationships with project beneficiaries; 2) the problematic that the project addressed; 3) the solution proposed; 4) the results that the project achieved and 5) the challenges the organizations face for future development. The site visits were documenting using photographs, videos, and blog articles. The visits were not only informative about the challenging contexts that these 2 projects operate but were also inspiring, as what can be achieved when talented and highly committed professionals, put their knowledge and effort to good use, for the benefit of disadvantaged communities. During 2015, ISIF Asia visited:
iSolutions. Aug 2015. Chuuk, Federated States of Micronesia.
Access Health International. Mar 2015. Manila, The Philippines
Operation ASHA. Mar 2015 to Mar 2015. Phnom Penh, Cambodia
2014 grants completed!
During 2015, we have seen the completion of most of the 2014 grant recipients. Projects addressed development problems and demonstrated the transformative role the Internet can have in emerging economies. This summary of 2014 grant recipients and their projects are examples of the kind of partnerships that ISIF encourages and supports.
The Pacific Islands Chapter of the Internet Society – PICISOC in collaboration with University of Auckland (Pacific Islands) worked to improving Internet Connectivity in Pacific Island countries with network coded TCP with deployments in several islands of the Pacific with very positive and encouraging measurements for future development.
The Punjabi University, Patiala (India) completed their project to overcome the barriers that Sindhi Arabic and Devnagri scripts posed for researchers. They have completed the transliteration tables for both scripts and millions of words have being input into the database which is now on their final version.
The Cook Islands Internet Action Group (Cook Islands) has released the Maori Database app, website and social media page that has raised attention from the local media and interest from the local government to preserve the language.
CoralWatch, The University of Queensland (Australia/Indonesia) finalized their mobile app in Bahasa-Indonesia and English to improve citizen science monitoring of coral reefs in Indonesia
The Internet Education and Research Laboratory, Asian Institute of Technology, in collaboration with the Mirror Foundation and the THNIC Foundation (Thailand) deployed Chiang-Rai MeshTV: An Educational Video-on-Demand (E-VoD) System for a Rural Hill-Tribe Village via a Community Wireless Mesh Network (CWMN). The Chiang-Rai community has now a fully operational mesh network that streams educational videos for learning development over a community wireless network, increasing their access to educational content fit for a low literacy context motivating families to support their kids to keep on their learning path.
The Institute of Social Informatics and Technological Innovations – ISITI-CoERI. (Malaysia) continued their research in and develop a game to digitalize and preserve Oro, a secret signage language of the nomadic Penans in the rainforest in Malaysia. Their efforts have allowed to document traditional knowledge from the elders and making it relevant for the younger generations.
iSolutions (Micronesia) deployed the Chuuk State Solar Server Education Hub, through a scale-up grant, following the deployment of the PISCES project support in 2013 to connect schools to the Internet in Chuuk. The solar server education hub connects schools to educational content and share communications capabilities, lowering the cost by rationalizing the use of their limited broadband connections and using solar energy.
Nazdeek, in collaboration with PAHJRA and ICAAD has introduced a different approach on how to improve maternal health in India. They are using SMS technologies liked to online mapping to increase accountability in delivery of maternal health services. Their approach allows Adivasi tea garden workers in Assam to understand their rights and how to claim the benefits they are entitled to.
The ECHO app from eHomemakers in Malaysia received an award in 2012 for their work to support workingwomen in Malaysia to communicate and coordinate better when they work from home. In 2014 they received a scale-up grant replicate their experience in support to Homenet in Indonesia.
The University of Engineering and Technology and Vietnam National University are working on better systems for monitoring and early warning of landslides in Vietnam.
Operation ASHA successes in India, have inspired this scale-up grant to support the deployment of an application to monitor TB in Cambodia and support the work that healthworkers do to contain the spread of the disease and provide adequate follow-up for patients. They developed the eDetection app and improved diagnostics to reduce the spread of TB in Cambodia.
BAPSI has completed training and testing the development of Morse code-based applications to provide deaf-blind people with the opportunity to use mobile phones to better communicate with those around them that do not know sign-language.
2015 supported projects are well under way!
The selection for the 2015 grant recipients was also completed and 4 projects have received support. Their progress reports are starting to flow in, and they will reach completion during the first semester of 2016.
Development of mobile phone based telemedicine system with interfaced diagnostic equipment for essential healthcare in rural areas of Low Resource Countries. Department of Biomedical Physics and Technology. University of Dhaka, Bangladesh.
Deployment of a Community based Hybrid Wireless Network Using TV White Space and Wi-Fi Spectrum in Remote Valleys around Manaslu Himalaya. E-Networking Research and Development. Nepal
Improved Carrier Access in Rural Emergencies (ICARE). Innovadors Lab Pvt Ltd and School of Computer and Information Science, IGNOU. India
A Peering Strategy for the Pacific Islands. Network Startup Resource Center and Telco2. Pacific Islands
The Discovery Asia blog
44 new articles have been published this year, to highlight the talent, skills and commitment that the Asia Pacific region has to offer, continues to raise attention to the vibrant community we serve, their needs and their innovative approaches to solve development problems using the Internet for the benefit of their communities. We encourage you to share your stories. It has been a busy year, and we look forward for new challenges during 2016! We look forward to complete 100 articles soon!
Seed Alliance end of a 3 year cycle
The three years grants from IDRC and Sida that made the collaboration with FIRE and FRIDA programs possible has come to a close during the 10th IGF in Brazil, where the Seed Alliance website was launched. The website provides a comprehensive view of the work that IDRC and Sida’s funds have made possible supporting 116 projects from 57 economies. It has allocated around US$ 2.2 million of funding in Grants and Awards throughout Africa, Asia Pacific, and Latin America, helping to strengthen and promote the Information Society within these regions through 102 opportunities for networking, outreach, evaluation and/or capacity building. The website will be officially launched in February 2016, but we invite you all to explore it!
New funding confirmed for project implementation and more grants in 2016!
After a successful external evaluation process commissioned by IDRC and a new proposal negotiation as part of the Seed Alliance activities, ISIF Asia has received renewed funding commitment from IDRC for 2016 and 2017. A new call for grants will open early in 2016 to which APNIC has renewed its commitment as well. In addition, the Internet Society has decided to increase their funding contribution to ISIF Asia and fund a full grant, more details about this will be shared earlier in 2016.
The selection process for the 2015 round is currently under way, with 60K AUD to contribute towards research in our region. And finally, APNIC has renewed the funding for the Internet Operations Research Grants 2016.
We thank all our partners and sponsors for their renewed support!