Episode II: Increasing operational excellence with analytical data quality management

Capgemini Invent
6 min readMar 18, 2020

--

Traditional data quality management is unfit to meet today’s challenges

Good data quality has become essential across industries as it enables companies to be productive and innovative while meeting privacy standards and regulatory requirements. This is especially true in banking, where fierce competition forces banks to reinvent themselves quickly while keeping up with the continuously tightening regulations such as TRIM, GDPR or BCBS 239. Therefore, establishing a strong data quality management (DQM) has become more important for banks; a fact witnessed by the emergence of dedicated organizational roles such as Chief Data Officers (CDO) or Data Stewards.

Traditional data quality management is set up as a gate between the raw data from different systems and the users. These gates apply a set of semi-automatic rules to control the data quality and flag data points violating these rules. Dedicated data quality employees can then correct these exceptions if necessary and investigate the respective source systems and transformation processes to find the underlying cause of the issue. However, this reactive approach has reached the limit of its capabilities. Data itself has become more complex through new sources and sheer volume. In addition, regulation requires a more granular reporting from banks which otherwise face substantial fines. These developments cause the need for a more proactive data quality management approach, which not only controls the systems’ output but analyzes the relations between the systems, processes and humans.

Analytical data quality management leverages machine learning algorithms to improve data quality and reduce manual effort

The upside of the aforementioned rise in data volume is that it supports the application of machine learning (ML) algorithms for DQM. This new approach, which we call analytical data quality management (ADQM), utilizes the algorithms to address the shortcomings of traditional DQM and reduces the manual effort involved.

We distinguish between two main types of data quality issues which can be detected and solved via ADQM: known data quality incidents and unknown data quality incidents. The former comprises all cases where data quality issues are already flagged by the systems in place. The ML methods then analyze these issues to determine why they arose in the first place. This way, the root causes of the issues can be addressed proactively instead of just correcting the errors reactively after already having caused downstream process errors. The latter case covers data quality issues which have not been detected by the existing traditional systems. This expands the capabilities of the data quality management and reduces the effort involved in identifying these issues manually. The following sections provide a detailed explanation of our methodology and illustrate the business impact on the basis of recent client projects.

With a root cause analysis, DQM changes from being reactive to proactive

A root cause analysis is a statistical model that enables a broader perspective on data quality issues by filtering for significant impact factors. The model finds data quality problems on multidimensional layers. Using interpretable supervised learning methods such as regularized regression or random forests, root cause analysis enables investigating the causes below the symptoms of data quality issues. First, patterns are identified in the symptoms by using the best fitting separators in the data. Next, a partial dependency plot and model coefficients are created to approximate causation. ADQM does not merely look at correlating occurrences but analyzes the impacting factors. In this sense, the deeper underlying problem in the interplay of source systems, data transformation processes and human interaction is investigated. Lastly, the critical impact factors are specified and enable companies to take sustainable actions helping to proactively mitigate data quality issues before they could cause downstream process interruptions or failures.

For an international bank, we used root cause analysis to understand how visible data defects such as duplicates, missing values or inconsistencies were created in distributed database systems. By connecting multiple data sources and deploying statistical models for root cause analysis, we saved more than 50% of the time required in manual processes to eliminate fundamental DQ root causes.

1 Our Analysis framework to go from symptoms to root causes

Anomaly detection models automatically uncover unknown data quality incidents based on historical data patterns

For unknown data quality incidents, anomaly detection models are the tool of choice. The models analyze historical data patterns and thereby detect when a new data entry strongly mismatches the expected record. One well-performing class of methods are the so-called autoencoders. An autoencoder model consists of two, cascaded deep neural networks. The first one takes the high-dimensional historical data as input, encodes and compresses them and thereby forces the network to learn typical data patterns. Then, the second network decodes the data and thereby creates a typical representation of the original data. If this representation deviates strongly from the original data, it suggests that the original data represents an anomaly and a potential data quality issue.

2 Autoencoder technology enables data reconstruction for anomaly detection

A key application opportunity for autoencoders is transaction monitoring in the financial sector. As “Fraud-Detection” and “Anti-money laundering” (AML) regulations have become more rigorous, we developed and implemented autoencoders at a global bank to flag suspicious financial transactions. Here, our algorithms enabled an anomaly detection on large data streams (more than 250,000 daily transactions) without manual intervention or prior training on labeled data.

ADQM provides insights on data quality issues, automates manual processes and thereby creates tangible business value

For industries where data quality is business-essential, analytical data quality management is the logical next step on the journey to operational excellence. While root cause analysis offers a proactive investigation of quality issues, anomaly detection models improve and automate processes bearing significant cost savings and risk reductions. To realize these benefits, Chief Data Officers and Data-Quality Departments should follow a three-step approach:

1. Ensure the company meets the technical and organizational requirements for ADQM

2. Develop prototypes to validate the business case

3. Scale the solutions and capabilities across the company

In the first step, a short status quo analysis is conducted along three dimensions: data quality processes, infrastructure and organizational capabilities. For the data quality dimension, the company identifies processes and data sets which could be analyzed with ADQM methods. Additionally, it needs to be verified that the technical infrastructure is sufficient to handle the expected data volumes and that a team combining analytical capabilities and industry experience is available.

Secondly, the team conducts rapid prototyping with real data and validates the business case for the management. For example, with an A/B test, the accuracy, speed and costs of an anomaly detection model vs. current processes can be evaluated. Finally, once the business value of ADQM is proven, the solutions and developed capabilities should be scaled across the company to leverage the created value.

Due to our hands-on experience with ADQM at clients from banking as well as other industries, Capgemini Invent can support your transition to analytical data quality management from the status quo analysis to scaling the solutions. If you want to discuss the potential of ADQM for your business or are interested in a demonstration of our solution portfolio, feel free to reach out to Dr. Katja Tiefenbacher.

About the author:

Sebastian Dierdorf is a Senior Consultant at Capgemini Invent with a special focus on Data Science and Analytics

Read also: #valuefromdata: Episode I & & #valuefromdata Episode III

Find more information on Capgemini’s service offering here

--

--

Capgemini Invent

Capgemini Invent is the digital innovation, consulting and transformation brand of the Capgemini Group. #designingthenext