Public sector in the age of AI: How to turn pile of papers into analytical assets?

8 min readFeb 19, 2020

Paperwork and complicated hierarchies of Word documents – we often equate the public sector with the image of inefficiency. However, they can be valuable assets for natural language processing (NLP) – Check out our article on these potentials

Beyond traditional applications, such as chatbots or interactive websites, cutting-edge data science technologies can already deliver relevant and scalable products and services to the public sector: social network analysis for online public opinion, email classification for targeted, responsive governance, facial recognition at border controls, among others.

Nevertheless, we intuitively equate the public sector with paperwork and documents — a profile of inefficiency. Almost everyone complains about the bureaucracies associated with that image. However, the ever-growing field of natural language processing (NLP) may be able to turn this image around by either reducing manual text generation or exploiting a large amount of the text data.

The primary purpose of this article is to show how NLP, especially text analysis, can contribute to the public sector. In particular, this blog post focuses on the information gathering and text-based analytics that elevate evidence-based decision-making to the next level.

Text mining: Beyond copy-paste

Imagine that you are reporting to political leaders on the latest news regarding social welfare reform, a controversy that requires quick mapping of the media landscape, and the right reaction to public opinion. Conventionally, snippets of news are collected by reading “important” newspapers or clicking through “well-known” websites/social media accounts. In essence, it is still a copy-paste procedure based on digital content. However, such manual processes notoriously lack comprehensiveness, and are slow and error prone.

It is not enough to turn a government of paper into a government of jammed Word documents. Text-mining techniques aim to solve the problem of information gathering and storage in a transparent and systematic way. Below, we briefly show some examples and their uses.

Amassing web-based content

Web-scraping or -crawling aided by rule-based NLP techniques is an established solution to the speed and scale up the information gathering in Internet. In essence, we can exploit the structure of the HTML or use the application programming interface (API) provided by the hosts to automatically download the information in the desired format.

I was once involved in a research project[1] for which I was able to automatically scrape parliamentary questions from the German Bundestag’s archives from the 1970s onwards and store the documents in a structured dataset within days. The dataset contained precise information including titles, content, keywords, authors, dates, and endorsing parties, etc. Basic knowledge of regular expressions (RegEx) comes in handy when separating the questioning and answering actors from a single string.

Example documents and scraped results from the document and information system of German Bundestag from own research (http://dipbt.bundestag.de/dip21.web/bt)

For the public sector, this form of information collection provides a more comprehensive (i.e., less biased) overview of the media landscape. Moreover, the automatic pipeline offers consistent, transparent, and scalable insights for further analysis. Also, it is cheaper in terms of reducing the cost associated with manual works.

Converting handwritten documents and speeches

Existing optical character recognition (OCR) techniques can easily detect and transcribe handrwitten text or scanned documents into digitalized characters. Various machine learning and deep learning algorithms achieve high accuracy. These techniques are particularly interesting for government agencies that want to automate the digital storage of letters or handwritten documents, such as tax declarations or work permits.

Speech recognition systems based on neural networks can now quickly facilitate the transcription of, for instance, politicians’ speeches into a readable format. Both business and open-source solutions are well developed.

Note: Example of handwritten numbers that could be successfully transformed into machine-readable digits. Visualization by Joseph Steppan, CC BY-SA 4.0/ LeCun, Cortes and Burges, MNIST Dataset (http://yann.lecun.com/exdb/mnist/)

Deriving sharp insights from unstructured texts: Analytical approaches

As exciting as textual data collection sounds, readers may well ask, “So what?” This is a pressing question that raises concerns about the quality of the insights from the vast amount of (unstructured) data. We present three relevant use cases that current techniques can already solve: automatic classification of citizens’ complaints and requests, analysis of open-ended survey questions, and extraction of persons and entities related to a political issue.

Classification of citizens’ complaints

As the digital infrastructure makes sending emails or online complaints to administrative agencies becomes more convenient than ever, the expectation of a quick and accurate response increases as well. Thus, a rapid classification of the different subjects of emails or complaints is desirable. Traditional methods such as self-labeling or keyword-based categorization are severely inaccurate and do not improve accuracy significantly.

Supervised machine learning of classification offers an answer to that. For instance, we build a training dataset that includes complaint content andtheir labels, e.g., how to categorize them regarding political issues or administrative processes based on human knowledge. Then, we use statistical measures to guide the machine to learn the typical text features of a label. Features can refer to typical words, word combinations, or sometimes even characters.

Also, sentiment analysis can exploit the intrinsic “policy emotions” beneath the content and further distinguish the degrees of negativity of complaints. The combination of these two techniques in the NLP field provides government agencies with a quick and accurate way to guide messages to the right department, reducing the workload for human employees and potentially increasing citizens’ satisfaction. The following example is the product of our team within Capgemini which classifies emails and scores their sentiment for the private sector. A similar logic can apply to the public sector as well.

Note: Email Sentiment Analysis, Asset of Capgemini Invent, AI Garage. Developed by Srithar Jeyaraman, Sanchit Malhotra, Kaustav Chattopadhyay, Arpit Rawal, Dheeraj Tingloo, Ravi Mrityunjay, and Deepak Kumar (2019).

Open-ended questions in survey for measuring policy preferences

Surveys have been widely used in evidence-based governance. In Germany, chancellors have been conducting surveys since the 1950s. However, traditional survey questionnaire design has faced numerous revisionist critiques, and measurement through a multiple-choice format does not always reflect citizen’ political preferences. Open-ended questions in turn provide more abundant information but are perceived as hard to analyze systematically.

Unsupervised learning through topic modeling, a clustering algorithm can mitigate these concerns. In general, the algorithm looks for words that co-occur throughout multiple documents. If, for instance, “parental leave” and “child” were spotted jointly among thousands of answers, the machine would consider that they belong to the same latent cluster. Through iterative training, we can easily summarize what aspects of specific policies are mostly primed and even calculate the proportion of those concerns.

The following example shows one application of topic modeling. Based on a sample of online complaints of Chinese citizens, the machine learns to cluster words in a way that humans can also understand. For instance, among those complaints labeled “medicine,” a majority of texts contain the typical related features.

Note: Clustering of typical words about a topic based on Chinese citizens’ online complaints. Own research. Visualization by Qixuan Yang. https://gitlab.com/qxyang/semi-supervised-tm

Determining relevant political actors and their positions

The last example involves extracting relevant persons and entities around a policy. Imagine that after collecting pertinent policy papers and news about carbon taxation, we want to provide an overview of who’s saying what. Here, named entity recognition (NER), another widely used technique in NLP, can be used for extracting relevant parties of a policy discussion. In this case, NER exploits existing or local dictionaries of names of persons and organizations and predicts them in a rule-based fashion or a statistical procedure.

NER can be further combined with other data science applications, such as network analysis or any of the classification or clustering algorithms described above. Through joint efforts, we are able not only to detect the relevant parties but also their positions and relations with each other.

In a nutshell, NLP makes it possible to analyze a large amount of unstructured text data systematically. The aforementioned techniques are not only interesting in the example scenarios but also generalizable to many other contexts. Classification and clustering can be used to analyze social media posts and discussions. NER, when fused with network analysis and clustering algorithms, can be used to trace the disinformation campaign — how fake news or false information disseminates in terms of persons/accounts/organizations and content.

Enabling text-based insights: Infrastructure matters

As readers may have noticed, all the use cases presented above are established or still developing. Perhaps the more practical question is why these use cases are not widely applied. Besides the path-dependence issue within the public sector and the lack of the awareness of big data before the wave of artificial intelligence, we argue that the nonexistence of modern data collection and analytical infrastructure restricts the full potential of the NLP techniques.

A complete pipeline of NLP to realize its full potential.

For one thing, the OCR and various data mining techniques require sufficient storage and adequate servers to digitize the documents accordingly. Without reliable data storage and preprocessing infrastructure, there is no way to employ cutting-edge analytical algorithms. As for analytics, the public sector needs to use cloud-based solutions or high-performance computing (HPC) to derive text-based insights effectively. Without an infrastructure for big data, there will not be any valuable evidence to derive.

In addition, I would like to mention the role of human intervention with regard to the NLP pipeline. Applying the quantitative method does not mean that qualitative endeavor is obsolete. Instead, for all supervised or rule-based training of the models, human labeling is essential for future success. Also, qualitative knowledge about specific administrative or political issues can inform the NLP solution developers about model-relevant issues so that the pipeline can be tailored to the needs of various tasks. For instance, particular word combinations such as “parental leave” are important features for some political fields. Dividing them into “parental” and “leave” is not informative.

To sum up, NLP techniques provide the public sector with a variety of new opportunities to analyze relevant data in unstructured text form. There are already many established use cases that use cutting-edge algorithms to amass large amounts of data and deliver politically and administratively relevant insights in an automated fashion. Nevertheless, an infrastructural pipeline is essential for the success of the analytic engine, and for valuable qualitative insights and labeling practices.

About the author:

Qixuan Yang, Senior Data Scientist at the AI Garage of Capgemini Invent in Germany, is interested in the fusion between social sciences and data science. In particular, he focuses on text analysis in the realm of natural language processing. With strong background in interdisciplinary research, Qixuan aims to provide neat business solutions backed by statistical learning approaches, as well as evidence-based consulting for better decision-making. You can get in touch here.

References:
[1] CAP, Comparative Agendas Project, https://www.comparativeagendas.net/germany.