Episode VI: Use Case Management — Are agile principles applicable to data science projects?

Capgemini Invent
11 min readOct 1, 2020

In the first episode of our blog series, we detailed how an AI Service Unit can build the baseline for scaling AI to the everyday life of your employees. The AI Service Unit is a business unit that bundles organizational, processual and technological capabilities to offer the efficient and user-centric development of AI initiatives as an internal service provider to your organization.

In this episode, we will elaborate how to set up processes both within an AI Service Unit or another Analytics-as-a-Service organization that delivers data science projects of any kind. In the first episode of our blog (Episode 1: AI Service Unit — building the baseline for scaling AI: Link), we described use case management in the context of data science covering all activities and decision gates “from use case definition by business units, granting data access and prototyping to data-driven decision-making”. In our projects, we have seen that the data science team often iterates without continuous feedback from business users — in opposition to software development, where agile principles are widely adopted and thus feedback is assured. Software development and data science projects obviously have different characters. How can we — accounting for these differences — utilize an agile way of working also in data science projects to gain productivity?

In this blog post, we will uncover the incompatibilities between agile project management and data science projects and introduce aspects of our agile Use Case Management for data science that embody remedies to these incompatibilities.

Figure 1: The dimensions of the AI Service Unit

Introducing agile principles

Working agile is based in software engineering and a common approach to developing, testing and deploying a software deliverable. Changing to agile ways of working made the development team consider defined principles (Principles behind the Agile Manifesto: Link) such as:

· Working increments of software are delivered early and continuously

· Collaboration of business employees and developers in short time intervals

· Simplicity — the art of maximizing the amount of work not done — is essential

· The value of the working increment is the primary measure of success

The most common framework that is based on agile principles is Scrum. In Scrum, the whole amount of work not done, the product backlog, is split into defined work that can be completed by the team in one delivery interval, a typically 2–4 week sprint. Prior to every sprint, the team conducts a sprint planning to estimate the time necessary for every work item to be completed. After each sprint, the development team has completed a couple of work items and thus increased the value of their working increment, the primary measure of success. In order to maximize the value, the development team and business employees review the working increment in a demo after each sprint and prioritize the backlog for the next sprint.

Figure 2: Development Sprint

Uncovering and resolving incompatibilities with our Use Case Management

In many aspects, data science projects are becoming more and more similiar to software development projects. They also incorporate code development with the aim to release an increment that generates value. The activities of code development in data science projects are aiming at generating insights from data using statistical and mathematical methods — which also includes a research perspective. Research can be defined as “careful consideration of study regarding a particular concern or a problem using scientific methods” (Definition by Microsoft Academic: Link) and by nature adds uncertainty about the resulting increment in every phase of the development. It cannot be predicted upfront which algorithm and features finally construct the best performing model for a given task. In turn, software engineering projects also deal with uncertainty. But here it is rather the uncertainty about the complexity of the tasks to be completed than the resulting increment itself. Thus, a typical rundown of a data science project deviates from a software engineering project — making the first partially incompatible with the agile principles that were developed for the latter.

Figure 3: CRISP-DM Framework (Link)

Hereafter, we will introduce three views on introducing agile principles in data science to uncover and resolve incompatibilities. These three views uncover challenges to overcome when implementing agile ways of working in data science projects: It is necessary to manage the expectations on sprint output, to find a way for efficient sprint planning and to allow roles — all while keeping the research character alive. To represent the phases of a data science project, we will refer to the Cross-Industry Standard Process for Data Mining (CRISP-DM) framework. Like Scrum, it is a representation that is rarely implemented holistically. It is important to point out that working agile does not always mean working along Scrum and data science projects do not have to follow CRISP-DM. However, employing that strictness will highlight where both worlds deviate — and uncover areas where relaxing the rules reveals potential benefits. In response to these needs, we developed our Use Case Management: A framework for managing the development of AI use cases that follows the general steps of CRISP-DM and is partially incorporating agile principles.

1. Value view: Replace tangible output with valuable outcome

In Scrum, the development team is expected to deliver a working increment which generates more value than the previous release after every sprint. Results to these sprints appear tangible for the stakeholders. The agile methodologies prefer quick “one-sprint” solutions over perfect ones — the principle of maximizing the work not done. This principle conflicts with the research character of data science projects. Spending time researching and optimizing a non-perfect machine learning model to increase the accuracy is important to the Data Scientist. They need to have this freedom in their job and easily feel pressured if they haven’t. In opposition, project managers that cannot show a tangible increment to their stakeholders during sprint review might get frustrated. In fact, there might be longer periods of time — sometimes longer than one sprint cycle — where the data science team does not create tangible output. However, it does not mean that we don’t see valuable outcomes from this time — i.e., the reduction of misclassifications or a well-processed data set might have tremendous business value (Agile cycles in data science projects: Link).

As CRISP-DM mirrors, data science projects are a constant loop from data exploration to model evaluation. In total nonetheless, up to 80% of the time is spent with data preparation or feature engineering and not tuning the model itself. The team might not have a tangible result (for business stakeholders) in this period and declare thereafter that the use case is not feasible (Do agile methodologies fit in data science environments: Link).
However, the team still creates value in every sprint. It is inevitable to define what value means to the team and the organization, to communicate it to the business stakeholders and to utilize any created value. If the team spends a lot of time preparing a certain data set, make sure to make it available for other projects (if permitted by data and IT security). If the team learns a lot about a certain data set, it shall capture this data domain knowledge in a data catalogue. Therefore, defining which sprint outcomes are valuable is a countermeasure to the inherent uncertainty of data science projects that might cause frustration among team members and stakeholders.

2. Task view: Shift from how long will to how long can it take

As stated above, data science projects can be perceived a combination of software engineering and research. Data science is a rather young discipline in which a variety of tools and mathematical concepts are used to find solutions that are not as straightforward and standardized as in the more mature discipline of software engineering. While the general process can be properly defined, it is difficult to break down projects into small and well-defined tasks. It is possible to define and complete work items in software engineering mostly independent from and parallel to each other, but the phases in data science projects are often inseparable and more dependent on each other. An example is the relationship between data preparation and modelling: Even though data preparation is preceding the modeling, we might observe during modeling that the provided data needs to be enhanced for more features or transformed differently. This makes it hard to predict the workload for both tasks a priori — which would be required for a proper sprint planning. Experienced Data Scientists might give a good estimation nonetheless, but these are rare.

For our use case management, we developed the practice to estimate how long a task can take, not how long it will take. Workload is considered the time you are willing to allocate to tasks. Let’s employ the example of a modeling task: The responsible Data Scientist will tune a given model until either a — from a business perspective — sufficient level of a given evaluation measure (e.g. accuracy) or the maximum time a task can take is reached. Given a simple example of a project duration of 4 months and 4-week-sprints, the team might decide to allow the modeling to take the duration of the whole 3rd sprint. To not endanger timely delivery of the project, it has to be finished thereafter. That way, the team can allocate their effort appropriately to the overall project timeline and simultaneously commit to a time limit which might be in line with the sprint cycle. It solves the trade-off between the early and continuous delivery of working increments on the one hand and the research ambition on the other hand. If the level for the given evaluation measure is not sufficient once the time limit is reached, the team needs to decide actively to assign more time. A silent continuation is prohibited.

3. Role view: Manage an agile portfolio

Following the Scrum framework means, we do not assign specific roles to the development team members: Accountability belongs to the development team as a whole and — theoretically — any work item in the Sprint backlog can be completed by any member of the development team. This claim is often not lived to the extreme as most of the time. There are still specific tasks that should be completed by the developer possessing a specific knowledge — but that might not mean different roles, but just different levels of seniority. In contrast, you will find diverse roles in data science projects: There are Data Engineers that provide and prepare the necessary data, the mathematically and statistically well-educated Data Scientists working on Machine Learning models and Information Architects that transform the models into operable services for customers or employees. Work items thus will not be picked by the data science development team according to capacity and seniority but rather be assigned to the most suitable subject matter expert. This obviously has implications on our Use Case Management. Assuming that each team member shall follow its specialization and that tasks are inseparable and consequently hardly parallelizable, the team members cannot always pick tasks from the sprint backlog and work on it. They might simply need to wait until another one is finished, which — due to the iterative character — might be uncertain in time.

There are two remedies to this challenge: On the one hand, hybrid profiles counteract this problem: There might be team members that can fill out two or more roles along the data science process. The same team member might e.g. be able to serve as business translator and as an Information Architect. She translates a business challenge into a data problem and down the process defines the right architecture to operationalize the solution. That way, the parallelization problem does not weigh as heavy. But similar to the experienced Data Scientist, these profiles are rare. On the other hand, the projects of the AI Service Unit — or potentially other Analytics-as-a-Service organizations — could manage their use cases as a portfolio. If projects are not seen as silos and team members are not fully assigned to one of these silos, they can input their specialized capabilities in each of the projects without waiting for preceding tasks to be done. Consequently, the use cases might not be seen as independent projects any longer. Instead, they are prioritized and executed as a whole portfolio. The prioritization itself can again be aligned with agile principles: Pick that task from the open ones that generates the most value among the whole portfolio. We have experienced that managing the use cases as a portfolio offers another benefit: selecting the right level of abstraction for sprint backlog boards (built in a software application such as JIRA). The board for a portfolio needs to be as detailed that one can understand the status quo of a use case from the board, but it should not be so detailed that one might get lost in (the creation of) the work items. This is particularly important as the board management might cause serious overhead in data science projects with its iterative character.

Conclusion

Summing up, our Use Case Management enables agile ways of working in data science projects. We propose to change the perspective on value, tasks and roles: It is inevitable to define what a valuable outcome for a sprint is and that this is recognized by the stakeholders — even though they might not be able to fully understand or touch it. The data science team should further assign a maximum time a task can take, instead of precisely saying upfront how long it will take, so it does not appear as a roadblock to the whole initiative. This preserves the freedom the data science team needs, which a rigid implementation of an agile way of working such as Scrum would have taken away. And finally, organizations should abandon to treat projects as silos: The skills in the data science team are often as valuable as rare. The specialists should be enabled to serve multiple projects by doing what they are best at, which can be achieved by managing all initiatives as a portfolio of a larger team, not as single projects. If you are curious what roles you need in your data science team and what the requirements towards a technical architecture for such a team are, stay tuned to our next blogposts, where we will detail the dimensions Organization & Technology.

Thanks to Hongbin Xiang, Gerrit Wiltfang and Katja Tiefenbacher for your review and feedback!

About the author

Niclas Musies is a Management Consultant with a special focus on Data Governance and Data Analytics at Capgemini Invent. He supports German and European companies in creating the environment for successfully and sustainably delivering AI use cases. Find Niclas Musies on LinkedIn.

Read also: #valuefromdata: Episode IV & #valuefromdata Episode V

Would you like to discuss the potential of the Use Case Management for your business or are interested in a demonstration of our solution portfolio? Please, also feel free to reach out to Dr. Katja Tiefenbacher.

--

--

Capgemini Invent

Capgemini Invent is the digital innovation, consulting and transformation brand of the Capgemini Group. #designingthenext