Episode III: Data Catalog — Creating transparency on available data

7 min readApr 3, 2020

Scaling AI — The hard truth

Artificial Intelligence (AI) is everywhere — one cannot attend a business conference and not be confronted with at least one keynote that covers AI and its implications. This is not surprising, given the significant competitive advantages that an organization can achieve through the application of AI use cases.

Capgemini Research Institute, State of AI survey, N=993 companies that are implementing AI, June 2017

However, many companies have difficulties in implementing AI use cases throughout their organization. This is caused by a lack of transparency over available data hindering its efficient use or even prohibiting it. The unfortunate truth in many organisations is that they:

1. have a very limited overview about available systems (e.g.: each business department applies its own (silo) system that neither IT nor other departments are aware of, etc.) and do not automatically track system changes,

2. are inconsistent in their information about available data, its meaning and respective source systems as well as information on data access and usage, and

3. do not document data flows between systems, its frequency and the type of link (push vs. pull, etc.).

A premium OEM has identified data as a strategic asset for the organization and a key enabler for the future success of its core business. However, after the first promising AI use cases many of the following use cases underdeliver and take longer than anticipated.

Specifically, necessary roles within an organisation to implement AI use cases are hindered by this situation. The use case owner is the functional source of any AI use case. This role necessitates easy answers to questions such as “What data is available?”, “Who is owning the data?” and “What is the meaning of the data?” to assess the feasibility of use cases.

The data scientist is responsible to extract knowledge and insights from structured and unstructured data [https://www.capgemini.com/de-de/2018/12/data-science-journey-data-management/]. In order to do so this role requires answers to the questions “In detail, what does the data mean?” and “How can I quickly get access to identified data?”

Lastly, the data steward is not associated with any particular AI use case but is rather a vital support function that utilizes an organization’s data governance processes to ensure fitness of data elements [https://www.capgemini.com/de-de/2018/05/operating-models-in-a-digital-world-folge-8-information-data-management/]. This role’s most pressing needs are “How can I actively manage data?” and “How can I improve the data quality?”

Solution — The Data Catalogue

To combat this lack in transparency and enable the implementation of AI use cases a Data Catalogue initiative cumulating in the establishment of a respective tool within the organization should be considered. A Data Catalogue stores all relevant data about available data (i.e. metadata) and gives an overview over available systems, their stored data and existing data flows within an organization’s system landscape.

The investment into a Data Catalogue is already beneficial after a limited number of implemented AI use cases and subsequently can realize huge benefits with increasing numbers of AI use cases — over 6 to 1 return on investment (510%) over three years [Quantifying the Business Value of the Collibra Data Governance and Catalogue Platform, IDC White Paper (2018)], through:

1. an increased rate of realization of AI use cases,

2. the implementation of additional AI use cases, and

3. an increased flexibility towards new or changing business requirements.

Key Features of a Data Catalogue — Consume

The main features of a Data Catalogue support the actors in finding the right data for their work and accessing it. Particularly, the use case owner and data scientist are concerned with this aspect of a Data Catalogue to find answers to their needs.

A key feature of a Data Catalogue is the business glossary. In contrast to any data dictionary which stores a system’s technical metadata it is rather a framework to create, nurture, and promote a common vocabulary for an organization. For data to be meaningful, people across the organization need to share a common understanding of its definition, lineage, and validity.

In addition, the relationships between technical and functional metadata as well as the data lineage from source systems to target systems can be visualized within a Data Catalogue. This enables an easy access to information such as: “Which data comes from which source?”, “Which connections are existent to other data?” or “Was the data manipulated?”.

The ILoveData AG has successfully implemented a Data Catalogue with information about key systems, data assets and data flows. Subsequently, the users reported to have found more data, a lot quicker while easily identifying its respective functional and technical owner.

Key Features of a Data Catalogue — Collaborate

A Data Catalogue also supports collaboration to allow different roles to work together and support the maintenance and consumption of data assets. Such features are relevant for all roles engaged in-depth in the implementation of AI use cases i.e. the data scientist and data steward.

Bundling functionally-connected data into separate entities, notwithstanding their source systems and virtualizing them on a meta-level is a powerful tool within a Data Catalogue. As AI use cases usually demand functionally-grouped data this saves particularly the data scientists’ time as well as possibly gives them access to data they would not otherwise have found.

Furthermore, a Data Catalogue offers an in-tool data access management where requests, approvals and documentation are handled in a compliance-proof manner. Upon the request for any data all affected data stewards are automatically notified and can (dis-)approve and comment within the tool. Also, the notification of system owners with system-specific data requests after the approval are handled within a Data Catalogue.

The ILoveData AG has successfully implemented a Data Catalogue with information about key systems, data assets and data flows. Its data stewards are very positive of the tool as it pools the access management for all data, they are responsible for. Furthermore, its data scientists find relevant data more easily through its functional grouping and access it quicker.

Key Features of a Data Catalogue — Connect & Curate, Crawl

Lastly, a Data Catalogue simplifies the import and maintenance of metadata and supports the relevant roles in cataloguing, enriching and quality-assuring data. The features are especially relevant for the data steward who wants to actively manage available data through a Data Catalogue.

The maintenance of data assets is crucial for businesses to become digital and data-driven and its biggest success factor are enabled data stewards. A Data Catalogue will enable data stewards to proactively manage data rules, reactively monitor the quality of the data and establish organization-specific workflows with other stakeholders that guarantee the maintenance as well as the enhancement of metadata.

The ILoveData AG has successfully implemented a Data Catalogue with information about key systems, data assets and data flows. As a result, also the governance of data has been improved as data stewards are enabled to actively manage their data.

Our Approach — The Capgemini Way

Establishing a Data Catalogue around the three aspects should be done in a use-case driven and agile approach to maximise the value that a Data Catalogue can offer at any time. We recommend an approach in four phases.

Capgemini Approach towards the establishing a Data Catalogue

A short “Discover” phase will scope the Data Catalogue and start the ideation of possible first use cases. The “Devise” phase is dedicated to the selection of a tool vendor and the preparation of use cases. Subsequently, the “Deploy” phase is used to launch the Proof of Concept with a chosen number of tool vendors and to already establish necessary governance around the Data Catalogue. Finally, the Data Catalogue is scaled into production in a continuous process through implementing additional use cases incrementally.

The aforementioned premium OEM finally realized that a failure of enabling analytics resulted in the above-mentioned suboptimal situation. Subsequently, a data governance initiative (including a Data Catalogue) was launched. Today, more AI use cases (especially with functional use case owners) are started and use the full spectrum of existing data, they are executed in a shorter time period and generally, deliver better results.

About the author:

Paul Schmidt is a Management Consultant with a special focus on data governance and data analytics at Capgemini Invent. He supports German and European companies on their way towards carrying out a broader variety of AI use cases successfully. Find Paul Schmidt on LinkedIn.

Would you like to discuss the potential of a Data Catalog for your business or are interested in a demonstration of our solution portfolio? Please, also feel free to reach out to Dr. Katja Tiefenbacher.

Episode III: Data Catalog — Creating transparency on available data

Written by Capgemini Invent