Deep Learning for Contract Analysis and GDPR Compliance

How to measure the compliance of contracts with respect to GDPR using NLP and Deep Learning?

Capgemini Invent
7 min readOct 29, 2019

GDPR increases the regulatory risk due to higher level of sanctions (up to €20 million, or 4% of the worldwide annual revenue of the prior financial year, whichever is higher), on the other hand reviewing thousands of contracts is a tremendous work load for large companies.

Having a rationalized and efficient contract review process is crucial.

Read on to discover how Deep Learning and AI can automatically analyze thousands of contracts in a breeze, tag all sections to identify their GDPR sensibility, thus saving tons of time !

GDPR is not behind you !

More than a year after GDPR came into effect, only 28% of the firms are confident enough to make the claim that they comply with the regulation, and 30% report that they are “close to”.

Leaving 42% of companies, which admits they are still far from closing the GDPR chapter, while 78% of them claimed a year before that they would be ready by now¹.

In June 2019 28% of the firms comply with GDPR while 78% of them thought they would be compliant by this period

Post outline

  • The 4 challenges of a contract review
  • Intelligent Contract Analysis for Regulation (ICARe)
  • Deep Learning for Clause Classification
  • Conclusion

The 4 challenges of a contract review

Here are 4 main challenges faced during a contract review process

  1. Identify the scope of contracts. The first challenge is to determine which contracts are impacted by the regulation.
  2. Prioritization. Which contracts should we look first ? Now we collected the right set of documents, we need to figure out how to rank and prioritize a contract from one another. There are multiples options there, we could use our compliance score of the document, the criticality of the personal data exchanged, the amounts or duration involved in the contract, etc.
  3. Recommending the right modifications. GDPR expertise is scarce. It’s one thing to identify the contracts that need to be updated, it’s another to have already an insights about which modification to make.
  4. Carrying out continuous monitoring of a contract base. New contracts are coming in every day. Following up throughout the document lifecycle from the negociation phase until the end of the relation would allow to detect compliance gap as soon as possible, while keeping track of past modifications and validation status.

ICARe

In Capgemini Invent we have built a solution named Intelligent Contract Analysis for Regulation (ICARe), that specifically addresses all of those four challenges.

Contract review using ICARe detecting relevant content, classifying clauses and suggesting modifications

We are going to take you step by step through the main components of ICARe pipeline described bellow, from preprocessing steps to the Deep Learning classifier of legal clauses.

ICARe pipeline

Document collection:

We gathered thousands of contracts. Among them : (i) internal Capgemini contracts, (ii) GDPR addendum from above contracts and Data Processing Agreements (DPA), (iii) external contracts or templates web-scrapped from internet.

We also collected other legal documents completely unrelated to data privacies, in order to train our model to distinguish GDPR relevant content from other legal topics (cf. Relevance classifier part) in a contract base. As the vocabulary distribution of legal documents can be quite different from other content such as news articles or books, one should pay attention collecting data coming from the same domain, we don’t want our model to overfit to the legal jargon of specific contracts !

Optical Character Recognition (OCR)

The vast majority of contracts are unfortunately not digitally signed yet, so we have to deal with scanned documents, usually saved as PDF files. Scanned documents can get messy, but most of high quality OCR technologies are now doing a great job at extracting text even from not-aligned or skewed pages.

Language detection

We have to consider multi-language document, in big master agreement contracts of hundreds of pages we may find relevant information expressed in annexes from an other language than the one used in the rest of the document. In this particular set up, it’s a nice idea not to assume all our document fits in one language !

The relevance classifier identified 6 pages to be reviewed

Relevance classification or “How de-noising contract ?”

Contracts are noisy ! Most of the text we would find in a contract brings no real information or value to the task at hand. Therefore, we want to filter out the irrelevant content, i.e. sections that are not sensible to GDPR.

There are couple of approaches to do so: while we could experiment unsupervised approaches such as Topic Modeling, we could train a binary text classifier based upon a corpus of both GDPR and non GDPR documents to predict whether a given page or section is relevant for GDPR review.

We can now exactly tell which contracts have sensible contents and need to be reviewed !

But we have not finished yet ! Once we got rid of the noise in our documents, we’re left with the core of our contracts that is actually sensitive to the regulation. This is the set of clauses our legal experts want to have a closer look at.

Clause classification

The model classified each clause towards the main thematics of GDPR

The classifier should predict for each clause the covered topic(s).

Consider this bellow clause:

“Company X is a provider of enterprise cloud computing solutions which processes personal data upon the instruction of the data exporter Company Y in accordance with the terms of the Agreement.”

Here we want our model to predict that this clause is actually talking about defining the roles of each of the parties involved.

We built an ad-hoc deep learning architecture that is able to capture both the semantic of the paragraph and the surrounding context to categorize each clauses to a set of topics. Our model is trained upon more than 4K+ annotated clauses by our GDPR experts both in french and english languages.

Having categorized all the clauses, we are now able to suggest clauses around topics that are not covered by the contract.

Recommandations

We can recommend examples of clauses to the reviewer from unfound topic in the document

This is where we leverage collective intelligence using annotations of legal experts that also served as a training set for our clause classifier. There is now a “clause base” that fulfills both purposes of classifying clauses and recommending missing ones !

If you have no interest in knowing how the clause classifier is working under the hood you can just skip this section, and go directly to the conclusion.

Deep Learning for clause classification

“Document classification” vs “Sequence Labeling”, why defining the right task is vital ?

An exemple of french contract, each row is the paragraph we want to predict the class for

This is actually a sequence labeling tasks, most known NLP applications falling into this category are Part Of Speech Tagging and Named Entity Recognition, which aims at predicting classes of a sequence of token, each token being usually a word (or a short group of word) in a semantically coherent sequence such as phrases. We are essentially solving the same task but at the contract level where each contract is a sequence of paragraphs, each paragraph being a sequence of words (terminated by a newline character).

Paragraph Embeddings with Hierarchical Attention Network (HAN)

Model architecture

To build this hierarchical representation we implemented a Hierarchical Attention Network (Z. Yang et al. 2016[2]), using bidirectional LSTM (BiLSTM) with an attention mechanism over word embeddings. This first neural network outputs a hierarchical latent representation of a paragraph that would upweight important words embedded in the paragraph for the ultimate classification.

Sequence Labeling BiLSTM + Conditional Random Field

The output of the first HAN layer is ultimately feeding a second BiLSTM that would capture the context of the paragraph before the sequence classification that we handle through a Conditional Random Field (Z. Huang et al., 2016[3]) able to model the transition probabilities of classes.

Conclusion

We discussed some of the AI components that you can put together in a contract review solution that should ultimately lead to great operational gains and effort savings to your organization in the road to be (and stay) compliant, while focusing on your core business goals. Such architecture is both scalable and flexible and could be applied to any other specific domain. Get in touch with the author to know more.

About the author

Khémon Beh, is a Senior Data Scientist at Capgemini Invent France. Machine Learning and Deep Learning practitioner with a strong knowledge in financial services and Startup, Khemon strives for identifying and solving complicated business cases by building AI solutions that brings real value to our clients. You can get in touch here.

References

[1] Capgemini Research Institute: Championing Data Protection and Privacy a source of competitive advantage in the digital century

[2] Yang et al., 2016: Hierarchical Attention Networks for Document Classification

[3] Huang et al., 2016: Bidirectional LSTM-CRF Models for Sequence Tagging

--

--

Capgemini Invent
Capgemini Invent

Written by Capgemini Invent

Capgemini Invent is the digital innovation, consulting and transformation brand of the Capgemini Group. #designingthenext

No responses yet