Fixing Government Data Duplication at DataKind Bangalore

Data-kind

Voters worldwide seldom interact with their chosen leaders- except around 5-yearly elections. However, the advent of advanced Information and Community Technologies (ICT) might shrink this interval considerably. They may even turn back the clock towards the seminal Athenian model of democratic decision-making: directly by the people rather than their representatives. With some political discretion, today’s online forums can allow for similarly incorporating crowdsourced public opinion into policy design. This could contribute to nationally important initiatives (such as preparing Morocco’s 2011 Draft Constitution or debates on Spain’s Plaza Podemos, Brazil’s E-democracia portal and India’s own mygov.in). Nonetheless, we will concern ourselves with far more universal and local problem-solving at the municipal level.

But just who has access to such platforms? While internet penetration in rural India is rising dramatically, the lion’s share (67%) still resides with urban denizens. Moreover, as highlighted by the Wall Street Journal, India boasted of a quarter of the world’s fastest growing urban zones and 8 qualifying ‘MegaCities’ as per India’s 2011 Census definition. The demands on municipal governments are likely to be considerable, and even more likely to be mediated by internet platforms.

Regardless of this explosion of population and the associated challenges, the structure of municipal bodies has remained unchanged since Lord Ripon’s 1882 Resolution on self-government. Furthermore, as Ramesh Ramnathan of Janaagraha points out, the responsibility for action is de facto scattered across acronyms of acrimonious accusing agencies. For example, Bangalore’s (deep breath advised) BDA, BMRDA, BWWSB, BMTC, KSB, BESCOM together juggle the city’s water, transport, electricity, traffic police and development needs. Many authorities, little authority. Increasingly internet-savvy and increasingly increasing residents. Where can they all turn for help?

Enter DataKind Bangalore Partners.

15-year old Janaagraha has endeavoured to improve the quality of urban life- in terms of infrastructure, services and civic engagement- by coordinating government and citizen-led efforts. Of their various initiatives, the IChangeMyCity portal also earned Discover ISIF Asia’s award under the Rights and People’s Choice categories.

Next up, eGovernments Foundation, brainchild of Nandan Nilekani & Srikanth Nadhamuni (Silicon Valley technologist) has since 2003 sought to transform urban governance across 275 Municipalities with the use of scalable and replicable technology solutions (for Financial Accounting, Property & Professional Taxes, Public Works, etc.) Their Public Grievance and Redressal system for the Municipal Corporation of Chennai- recipient of the 2010 Skoch Award -has fielded over 0.22 million complaints over 6 years.

Though these organizations joined hands with DataKind in two distinct ‘Sprints’, the similarities are remarkable. Both their platforms allow citizens to primarily flag problems (garbage, city lighting, potholes) at the neighbourhood level for resolution by government agencies.

Then again, the differences are noteworthy too. As an advocacy-oriented organization, Janaagraha aimed to understand the factors that led to certain complaints being closed promptly by a third party. eGovernments on the other hand, being within the system, to keep officials and engineers adequately prepared for the business-as-usual and also immediately alert them on anomalies. So both sought predictions around complaints- one on their creation, another on their likelihood of closure.

Clearly, quite a campaign lay ahead. If we forget Ancient Greek democracy and hitch a caravan to China, then Sun Tzu’s wisdom from the Art of War pops in: knowing oneself is the key to victory. Always open to relevant philosophy, the DataKinders looked into their own ranks to assess their strengths. The team assigned for E-Governments coincidentally included Ambassadors (Chapter Leader, Vinod Chandrashekhar) and Data Experts (Samarth Bhargav, Sahil Maheshwari) from the Janaagraha project. The teams were also at different junctures joined by the latter’s Vice President (Manu Srivastava) and two of his interns, plus a multidisciplinary mob of volunteers from backgrounds in business consulting, UX Design, data warehousing, development economics and digital ethnography. Let’s see how they waged war.

india-form
Progress to Date

Back in March 2015, IChangeMyCity’s presented a set of 18,533 complaints carrying rich meta-data on Category, Complainant Details, Comments, etc. You’d assume this level of detail opens doors to appetizing analyses. Perhaps. Unfortunately, the information dwelt in a database of 10 different tables. Sahil Maheshwari- then working as a Product Specialist- busied himself with the onerous task of unraveling the relationships between them, drawing up an ER Diagram and ‘flattening’ records into one combined table. The team then accordingly fished out missing or anomalous values.

Conversely, E-Governments users either report their problems online, through SMS, paper forms or by calling into the special ‘1913’ helpline where operators transcribe complainants’ inputs. With digital data being entered through drop-down menus rather than free text (either directly by users or call centre employees), no major missing data was to be found. Except of course, unresolved cases-a mere 8% of the 0.18 million complaints. Some entries, amounting to 0.8% were exactly identical- clearly a technical glitch. Moreover, all data resided in one table. So in November 2015’s DataJam, this structure allowed the team to plunge immediately to exploratory analysis.

Across the 200 wards of Chennai, 93 kinds of complaints (grouped further into 9 categories) could be assigned to departments at either the City or Zone level. Although the numbers initially seemed staggering, Samartha Bhargav ran basic visualizations in the R Programming language. The result? Another instance of Pareto’s rule: 15 of these complaint types were contributing to 82% of grievances. Several DataKind first-timers like Aditya Garg & Venkat Reddy ran similar analyses for the 10 most given-to-grumbling wards, and found trouble emanating from roughly the same top 5 sources. Apparently, malfunctioning street lights blow everyone’s fuse. These common bugbears intriguingly became less bearable (and more numerous) in the second half of the year, while others related to taxes seemed more even across the year.

Even so, how could there be 10 broken lights in an area with only one on record? So had ten people all indicated the same light? Like with data analysis, learning from Chinese classics (literally) involves reading the fine print. Sun Tzu’s actual words: ‘If you know the enemy AND know yourself, you need not fear the result of a hundred battles.’ Clearly, this enemy was a lot more complicated than the decoy flanks that DataKinders had speared. Tzu and George Lucas may well have hung out over green tea.

Attack of the Clones .

In usual data science settings, duplicates are often easy to identify and provide little intrinsic value. However, the game changes in the world of crowdsourced data. Especiallydata highlighting the criticality of an issue. So to achieve victory, the team would have to understand and strike at its core- dynamic social feedback. We could assess its importance at four levels.

The first involves messages from the platform itself to indicate that a complaint has been registered and no further inputs are necessary. In its absence, citizens could well create duplicates by hitting the Submit button either accidentally (not knowing if their complaint was logged) or deliberately (hoping that repeating the complaint may lead to quicker action). This is more of a concern for web platforms rather than call centres. By matching against columns involving email, phone and postal contact details and date, time and type of the complaint, DataKind had already been able to quickly hurl out these obvious clones.

The second level of feedback is where the Force truly awakens- from other citizens. The ability to see that other fellow residents have experienced the same concern may prevent its repetition. But this rests on two assumptions. First, that they can view already posted complaints, as exists with IChangeMyCity. They may rally behind this shared cause by ‘upvoting’- an indicator to authorities of its increased importance.

Even if this feature does not exist- as with eGovernments- then all is not lost. High priority might still be inferred by large absolute numbers of complaints. But these would provide an idea of the severity of the problem across the ward (45 pot holes in Adyar) rather than one specific instance of it (that life-threatening one before the flyover). Secondly, if peeved citizens do not put in the effort of checking the roster of existing complaints- as inevitably occurred even with IChangeMyCity- then the Upvotes option alone cannot guarantee being Clone-free.

The third and most obvious feedback comes from authorities via the digital platform- to indicate closure. This is provided by both partners, with IChangeMyCity also appending contact details of which official has been assigned the task.

The fourth and final level- is where a citizen can verify that a complaint marked as ‘closed’ has truly been resolved. After all, accountability forms part of the foundation of democracy. In this manner, the same poorly tended-to complaint could be reopened, rather than filing another one out. This feature currently exists only with IChangeMyCity, which not only allows municipal authorities to mark a complaint as ‘closed’ (as exists with eGoverments), but also allows users to reopen them if unsatisfied.

IChangeMyCity’s resolution rates lie close to 50%- a figure probably reached after allowing for this reopening scenario. eGovernments on the other hand closed a commendable 97%, with up to 13% shut on the same day to an outlier of 1043 (almost 3 years), with the majority (56%) in under 3 days. Mr Srivastava emphasized that these efficiency statistics had improved dramatically in the last 2 years. But as we just explored, perhaps a confounding factor is that multiple duplicate complaints are being closed by engineers who have identified their Clone nature.

dogs
How to Fix It?

Thus, it was the second category- unintended duplication- which bled into the fourth. How could the DataKind team exploit the enemies’ own weakness? They decided to unsheathe their two logical light sabers: text and location. Either one in isolation didn’t necessarily pinpoint a duplicate. But in combination, they could quickly incinerate a Clone’s trooper suit.

Saber A: WHERE was the complaint registered? For IChangeMyCity, one can log in, peer through a map of Bangalore and plant a pin on the spot where you’d like to divert the authority’s attention. Using that pin, analysts can procure exact latitude and longitude coordinates. It’s still entirely possible that different people place the pins some distance apart even when referring to the same issue. But it would seem like a safe bet that two closely located complaints might just be Clones.

EGovernments currently doesn’t use maps, but asks users a fairly detailed, 6-level description of addresses (City, Regions, Zones, Wards, Area, Locality, Street). Such text might help direct an engineer gallivanting outdoors, but not for a computer that speaks code. Attempting to translate the text addresses into associated geocodes, the team split the data into 10 parts and ran Google Maps API with an R Script on each one. Despite their best efforts, accuracy could not be guaranteed. Though eGovernments will soon be introducing such coordinates in future work, geocoding seemed like a closed line of attack.

Saber 2: HOW was the complaint registered. The way people express themselves on a particular local issue may vary, but could feature some words in common. However with E-Governments system, pre-loaded tags from the website were automatically attached to complaints. Result? Nearly 40,000 entries demanding ‘NECESSARY ACTION’ (in capitals, no less) with only minor differences. Others exist, but simply restate the category of complaints. (‘Removal of Garbage’). With so little variability and no hidden clues, this strategy failed too.

However, for IChangeMyCity, citizens are free to fill complaint titles and descriptions as they please. So the DataKind Team broke the text of both the complaint’s title and description into sentences and then into words. Then they ran an unsupervised learning algorithm, which helped generate the Jaccard Index- a measure of how ‘close’ two complaints were in terms of statistical similarity.

But to check this ‘distance’ for N complaints against each other would require N*N operations. Far too long for a dataset of this size. To assist with this more abstract sense of ‘distance’, the team decided to turn to the more intuitive geographical meaning of the term. The clearly listed geocode saber we mentioned above.

The team decided that any two complaints within 250m of each other on a map could be considered as potential duplicates, while the rest could be ignored. Plugging these codes into the MongoDB geospatial index, Samarth ingeniously reduced the computation time for this process from 2 hours to 10 minutes. He also later developed a REST API that could be queried to detect the 10 nearest complaints. Going forward, the team hopes to set a threshold of such ‘similarity’ beyond which a new entry could automatically be flagged as a duplicate, much like answered programming queries on Stack Overflow.

 Onward to De-Duplication Success

At first glance, it may seem like the Attack of the Clones had stamped defeat over the eGovernments project, while IChangeMyCity had dodged the bullet. But let’s not jump to conclusions. The importance of this first battle is relative. Since Janaagraha is focused on closure of a single complaint, it makes sense not to muddy waters by repeating the same theory. EGovernments on the other hand is interested in the total number of complaints likely to arise, not the problems. Also, as we’ll soon see in the next installment, the larger numbers of complaints (including duplicates) would prove crucial in helping generate valid forecasts for the Chennai Municipal Corporation.

So at the end of this first DataJam session, what had the team discovered? On a flight that carried along Sun Tzu, 2 mayors, George Lucas and random Athenians in Business Class, we learnt the philosophical complexities of the idea of ‘duplication’, especially in the contexts of crowdsourcing and democratic processes in strained local governments.

Abhishek Pandit is a Strategy Consultant at ChaseFuture

DataKind Bangalore: Using Data to Improve Development

datakind banglaore

On the eve of the birthday of MK Gandhi- India’s founding father- two very different groups of technologists are buckled up onboard flights to the United States. One surrounds a man who has risen from poverty to the position of the country’s Prime Minister. Soon, their plane begins its descent into the sun-drenched hillsides around Silicon Valley. The second comprises a trio of young middle-class professionals who’ve applied for extended leave from their day jobs to visit New York.

Peering out at the towering skyline from their windows, they dwell on the upcoming Second Annual Summit of the movement that they helped launched globally a year ago. Despite these surface differences, the two groups find their eyes glazing over the same dreams: harnessing the power of technology and internet connectivity to build a better, brighter India.

Who are they? The first-as you must have guessed- is the retinue of Narendra Modi, spearheading his ambition for a Digital India. Less obvious, and the subject of this three-part series- is DataKind Bangalore, and their diverse initiatives for the improved governance and accountability.

DataKind Banga-what? Mouthful alert. So let’s review that- one word at a time.

DataKind is a global nonprofit that unites pro-bono data scientists with social sector organizations to address critical humanitarian problems within a project-based framework. Since its launch in New York City in 2011, DataKind’s volunteers have undertaken a range of exciting initiatives– from scraping website data on Indonesian agricultural prices and Mozambique’s microfinance, to exploring poverty levels through satellite imagery of electric lighting in Bangladesh and roof materials in Uganda, to identifying trends in the needs of the distressed by mining their SMS text in India, the US and the UK.

This breadth of impact and depth of expertise has only been possible through a vibrant worldwide community represented at DataKind’s Chapters in Dublin, San Francisco, Singapore, the UK and Washington DC, and of course, Bangalore.

Yes, Bangalore. The city Indians would like to call the Silicon Valley of the East. And what Silicon Valley itself would like to dislike as the Outsourcing Capital of the world. Except now, Bangalore is ‘insourcing’. DataKind’s local Chapter, founded in 2014 has been harnessing the country’s top tech talent to take on its own greatest challenges.

Within just a year of operations, their tally of volunteers hit a staggering 700. So could India’s bemoaned Brain-Drain be quietly rebounding into a Brain Gain? Perhaps part of the pro-bono participants’ passion pertains to how Bangalore’s is the only Chapter situated in a developing country.

dkblrPAN

Of course, all members of DataKind’s international network confront the ‘wicked’ problems that bedevil poverty alleviation. But when you experience this wickedness first-hand, when it’s cackling in your face on a daily basis, you’re far more inclined to land an algorithmic slap on its cheek.

One of the most stinging issues- possibly one that brought Modi into power- was a lack of transparency and accountability and an almost resigned acceptance of corruption and inefficiency.  And as the trio in New York soon realized at DataKind HQ, governance had unintentionally become a Chapter theme of sorts. 4 of all their 6 nonprofit partners thus far had resolved to support public bodies with data-driven decision making, or at least to build societal consensus on the need thereof.

As another interesting insight at the Global Summit, the Bangalore trio noted that even in developed nations like the UK, the supply of well-trained data scientists still fell short of demands from the private and public sectors. What did this portend for India?

In parallel and on the opposite coast, Modi had been pitching to several CEOs to invest in his country’s IT infrastructure. This tied in with the 17-point Digital India vision he announced 3 months ago, which concludes with ‘I dream of an India where every Netizen is an Empowered Citizen’. But as former Microsoft researcher, Kentaro Toyama elaborates in his book ‘The Geek Heresy’– mere provision of internet and mobile technologies, without investing first in human capacity to handle them (and the resulting information deluge) would ring hollow. An empty promise.

Volunteers at DataKind Bangalore have been fortunate to belong to the narrow segment of digital elite equipped with the industry knowledge and cognitive capabilities to leverage these tools. And it turns out that 6 of the 17 points in Modi’s mandate could be linked to issues of Good Governance. So if there is any measure of evaluating just how truly efficacious the ICT4D mandate could prove for India, and particularly in transparency and accountability, DataKind Bangalore and its projects with local NGOs provide an exciting testing ground.

Likewise, this current Chapter theme of Governance will form the focus of this series, though future extensions may explore outstanding DataKind Bangalore projects in other areas such as education, agriculture and microfinance. The remainder of this entry outlines the workings of DataKind’s typical project cycle, and sets the stage for more detailed explorations that will follow in the coming weeks.

Given this backdrop of the non-profit and technologist landscape, how does possessing data lead to any sustainable change? The answer: it doesn’t. Not per se, at least. Then again, DataKind Bangalore isn’t a group of number crunchers alone. Think of it rather, as an innovation and strategy hub. Likewise, its leaders follow a system.

First, a rigorous scoping and outreach process helps determine which organizations hold sufficient management capacity and clearly defined data science problems for the collaboration to prove worthwhile.

Secondly, doors are opened to volunteers from not only the IT industry but a variety of fields including economics, design, journalism, anthropology, and business development. This diversity enriches the ideation process, while also providing many participants with their first on-the-job taste of programming and statistics.

Thirdly, through a defined sequence of community events, the nonprofit’s challenge is hacked and hewed much like Michelangelo sculpting David out of a block of marble. Project Accelerator Nights (evening brainstorming sessions that lead to problem formulation) and DataJams (sessions of data cleaning and exploratory analysis) then culminate in DataDives (weekend hackathons on clearly defined challenges).

dk1

If partners believe that the resulting proposition would boost social impact, a specially selected DataCorps project then fully integrates it into the host organization over a six-month period.

For every David, there’s a Goliath lurking out there somewhere. And the world we inhabit today teems not only with Big, but Giant Data. This isn’t just the statistics computed to furnish in a Non-Profit’s annual reports or the World Bank’s tables, or even decade-end census figures. Neither is it the information gleaned from large-scale randomized controlled trials on policy effectiveness.

Sure, all of that is pretty and polished. But to (grossly) twist a John Legend classic, quantitative analysts today have to learn to love data with it ‘all its curves and edges, all its perfect imperfections’. And this could either pop up mercilessly in real time (through the spread of social media, mobile devices and sensor devices) or turn musty over years in impregnable government PDFs.

So no matter what fancy statistical technique DataKind may have planned, the first step of problem solving remains the same. All available data- whether from partners or scraped off the net- must be tamed and standardized into a format suitable for computers to perform their magic on. Once this foundation is laid, applications of data science to governance could be classified broadly into two use cases. We will explore each with a pair of Datakind Bangalore partners.

The first centres on the executive wing of public administration- specifically interface with citizens at the municipal ward level. Hell hath no fury like a Smartphone owner scorned. Naturally, public officials often feel overwhelmed and understaffed to deal with the volume and variety of their complaints. As a first remedy, duplicates must be cleared, i.e. if many citizens are lodging new entries for the same issue. These must then be allocated to the appropriate authority for resolution.

For example, ISIF 2015 award winner (and coincidentally one of DataKind Bangalore’s inaugural partners), Janaagraha has leveraged its ‘I Change My City’ online platform and mobile app to empower over 2 million Bangalore citizens to lodge over 36,000 complaints on daily hassles such as potholes, garbage left in the open, streetlights, etc (see below).

With some practice on previous years’ data, a computer can soon begin to predict where and when they are likely to emerge, and calculate the probability that they will be resolved. Machine Learning, Mamma Mia! The next entry in this series will explore the mechanics of such an analysis both for the established Janaagraha initiative as well as the newly commenced e-Governments Foundation project in the neighbouring metropolis of Chennai.

The second approach turns to the judiciary and public finances by visualizing data over time or in specific areas. This allows for identifying trends to take action (for public officials themselves) or demand good governance (for citizens and activists). For example, a brief mapping exercise with the Bangalore Police helped them deduce the location of organized gangs (mostly around open public spaces) and then snatch up and enchain some unassuming chain-snatchers. But more importantly, such visualization converts endless and inscrutable reams of data into a clear and visually engaging narrative.  The final installments of this series will compare applications of data cleaning and visualization to two freshly minted DataKind Bangalore partnerships.

First, DAKSH and its Rule of Law project aim to throw light on another category of the overwhelmed government employee- judges at the District, State and National levels. By mapping and quantifying India’s notoriously high case pendency across courts, DAKSH aims to foster informed public debate and develop sustainable solutions.

Second, Centre for Budget and Government Accountability from New Delhi is striving to develop a detailed data Portal on Union and State budgets in India since 2005 and expose any discrepancies between funded allocated and those actually spent. With both partners, DataKind will help discipline and visualize unruly giant data for a simplified user experience that provides not only intelligible insights but impetus for informed action.

So there we have it- common citizens in the world’s largest democracy harnessing internet technologies for improved transparency and accountability.  The world has changed dramatically since back when Gandhi overthrew a colonial regime through the power of a clear national message and transforming the culture of community movements. It remains to be seen whether embedding technology and data-driven decision-making within organizations can help create a similar impact on the dramatically different challenges of the present day.

Two groups who believe in this potential- Prime Minister Modi and DataKind Bangalore- may have now caught the flight back to India to achieve their mission. But now it’s time for you to fasten your seatbelts. Stay tuned as we embark on new adventures with two fascinating methodologies applied to pioneering and passionate partners in the Silicon Valley of the East. No matter how long the seed needs to take root, and whether this experiment fails or succeeds- it’s definitely a journey you don’t want to miss.

Abhishek Pandit is a Strategy Consultant at ChaseFuture