Topics

Software Development Productivity
Large Language Models in Software Engineering
Test Flakiness

Software Development Productivity

Der Begriff “Software Development Productivity” bezieht sich auf die Effizienz und Effektivität, mit der einzelne Softwareentwickler/-innen oder ganze Teams Softwaresysteme erstellen und warten. Die Relevanz dieses Themenbereichs in einem Umfeld von sich schnell wandelnden Anforderungen und Marktbedingungen und einem gleichzeitigen Mangel an Softwareentwickler/-innen ist offensichtlich. Allerdings ist Produktivität in der Softwareentwicklung nur sehr schwer allumfassend quantifizierbar, da das reine “Produzieren” von Quellcode einen relativ kleinen Teil der Softwareentwicklung ausmacht. Daher sind traditionelle Ansätze wie das Messen der Anzahl geschriebener Codezeilen (LoC, Lines of Code) pro Zeiteinheit unzureichend, da sie die tatsächliche Qualität und den Nutzen der erstellten Software nicht angemessen erfassen. Insbesondere im unternehmerischen Umfeld sorgt diese Tatsache für Unmut, ein Umstand den das Beratungsunternehmen McKinsey aktuell für sich auszunutzen versucht. Zwei Antworten auf McKinseys Behauptung man könne Produktivität in der Softwareentwicklung messen, bieten einen guten Einstieg in das Themenfeld:

Gergely Orosz und Kent Beck (Blog Post 1, Blog Post 2).
Dan North (Blog Post)

Weitere Quellen, die einen wissenschaftlichen Einstieg in das Thema ermöglichen, sind das Buch Rethinking Productivity in Software Engineering, ein Interview mit Professor Peggy Storey von der University of Victoria in Kanada zum SPACE Framework (siehe unten), sowie die “Developer Productivity for Humans” Kolumne im Magazin IEEE Software.

Die einzelnen Seminarthemen widmen sich spezifischen Aspekten des Themenbereichs “Software Development Productivity”. Die Aufgabe der Studierenden ist es, anhand der aufgelisteten Referenzen und und weiteren wissenschaftlichen Artikeln den jeweiligen Forschungsstand zu erarbeiten und zusammenzufassen.

Measuring Software Development Productivity: Introduction (Bachelor)

Das oben genannte Buch Rethinking Productivity in Software Engineering bietet einen allgemeinverständlichen Einstieg ins Thema, die einzelnen Artikel verweisen jeweils auf weiterführende Literatur, die bei Bedarf einen tieferen Einblick ermöglicht. Die folgenden Artikel aus den Abschnitten Measuring Productivity: No Silver Bullet und Measuring Productivity in Practice bilden die Grundlage dieses Seminarthemas:

Ein erfolgreicher Seminarvortrag fasst basierend auf diesen Artikeln zusammen, warum Software Development Productivity nur schwer auf eine einzelne Metrik reduziert werden kann und warum manche Forscher/-innen sogar dafür plädieren, Produktivität erst gar nicht (quantitativ) zu erfassen. Der Vortrag sollte ebenso die subjektive Perspektive der Entwickler/-innen einfließen lassen sowie den Zusammenhang der Produktivität Einzelner mit der des gesamten Teams erläutern.

Measuring Software Development Productivity: Frameworks (Bachelor/Master)

Naive productivity metrics such as lines of code (partially) measure the output of a software developer, but not the outcome or impact of that output. Questions such as the (potentially negative) value provided by the code remain unanswered. With the emergence of a DevOps culture that aims at integrating and automating the development and operations of software systems, with the goal of shortening release cycles, the focus shifted towards measuring aspects such as the deployment frequency or the time it takes to promote a change from development to production. The DORA metrics, introduced in 2016, capture such aspects and quickly gained popularity in the software industry. Those metrics, however, focus on a very specific notion of outcome and impact and do not include the perspective of individual developers. Nicole Forsgren, one of the original authors of DORA, moved on to (co-)develop two more holistic productivity frameworks, SPACE and DevEx, which are described in the following research papers:

On the bachelor level, the main goal of this seminar talk is to summarize and compare those two productivity frameworks and briefly outline what distinguishes them from previous frameworks and metrics. On the master level, a deeper embedding in related work is required. The resources listed in the introduction are useful starting points for this. Further, on the master level, a discussion of the results of the following paper in context of those two frameworks is required:

What Predicts Software Developers’ Productivity?

Effects of the COVID-19 Pandemic on Software Development Productivity (Bachelor/Master)

We all vividly remember when a novel corona virus swept the work in early 2020. Sudden lockdowns and strict regulations in many countries forces workers, including software developers, to work from home. While acknowledging the difficult and stressful conditions for many software developers, researchers both in industry and academia saw the opportunity to study the impact of this unique setting on developers’ wellbeing and productivity. Two main studies conducted during that time are:

The former paper reports on a company-internal study at Microsoft, the latter on a broader world-wide survey based on a questionnaire that was translated into 12 languages. On the bachelor level, the main goal of this seminar talk is to summarize and compare the results of the two studies. On the master level, an additional discussion of the methodological differences between the two studies is required, further comparing them to this third study that was published a year later:

Predictors of well-being and productivity among software professionals during the COVID-19 pandemic – a longitudinal study

Large Language Models in Software Engineering

Mit der Veröffentlichung von ChatGPT im November 2022 wurden der Öffentlichkeit schlagartig die Fähigkeiten von modernen Sprachmodellen (Large Language Models, LLMs) bewusst. Der Hype um dieses und andere chatbasierte Tools erfasste auch die Softwareindustrie, und Unternehmen konkurrierten um Ankündigungen hinsichtlich einer Zukunft, in der “Generative AI” ein essenzieller Bestandteil der Arbeitsabläufe ist. Während sich die Berichterstattung dem Peak of Inflated Expectations nährt, ist das Thema Sprachmodelle in der Softwareentwicklung nicht neu. So hat bereits im Jahr 2012 eine Gruppe um Professor Premkumar Devanbu von der UC Davis in den USA mit dem Thema beschäftigt und geschrieben:

“We […] ask whether a) code can be usefully modeled by statistical language models and b) such models can be leveraged to support software engineers.” Quelle

Sie haben insbesondere herausgearbeitet, dass solche Sprachmodelle nicht nur die formale Grammatik einer Programmiersprache erfassen können, sondern auch deren “Naturalness”. Ein einfaches Beispiel ist der Vergleich der Ausdrücke i = i + 1 und i = 1 + i. Beide sind semantisch äquivalent, aber der erste fühlt sich für einen Programmierer deutlich natürlicher an. Genau solche Zusammenhänge können Sprachmodelle erfassen.

Im Hype um ChatGPT ging insbesondere auch unter, dass GitHub Copilot bereits ein Jahr zuvor, im Oktober 2021, veröffentlicht wurde. Natürlich wird auch Copilot (X) von den neuen Generationen der Sprachmodelle (hier: GPT-4) profitieren, aber selbst das “alte” Copilot mit OpenAI Codex war bereits deutlich mehr als nur eine simple “Code Completion”.

Die einzelnen Seminarthemen widmen sich spezifischen Aspekten des Themenbereichs “Large Language Models in Software Engineering”. Die Aufgabe der Studierenden ist es, anhand der aufgelisteten Referenzen und weiteren wissenschaftlichen Artikeln den jeweiligen Forschungsstand - bzw. die historische Perspektive - zu erarbeiten und zusammenzufassen. Eine kuratierte Liste mit Hintergrundinformationen zu LLMs im Allgemeinen ist beispielsweise hier zu finden.

Naturalness of Software (Master)

The above introduction already mentioned the 2012 paper by Hindle et al. on the “Naturalness of Software.” This paper has been cited more than 1200 times and won the prestigious Most Influential Paper award of the International Conference on Software Engineering in 2022, ten years after its publication.

On the Naturalness of Software

The goal of this seminar topic is to carefully read and understand this seminal paper and then select two papers that cite it, many of which are highly cited papers themselves. That way, the student learns how ideas can propagate and kickstart a whole range of new research projects.

GitHub Copilot (Bachelor/Master)

Researchers at GitHub and Microsoft have participated in the journey of GitHub Copilot from a closed preview to a tool that enterprises all around the world want to adopt internally. This seminar topic explores what we can learn from the studies that have been conducted and GitHub and beyond on the impact of using AI coding assistants such as Copilot:

On the bachelor level, the focus is primarily on the two above-mentioned “official” studies. On the master level, it is expected that students include at least one additional technical paper published at a high-quality venue that also targets Copilot or similar tools, in close discussion with the topic advisor. Further, the studies should be critically reflected considering GitHub’s monetary incentive to make the tool look good.

Literature Reviews of Large Language Models in Software Engineering 1 (Bachelor/Master)

The above-mentioned hype around ChatGPT also triggered a huge number of academic papers on using Large Language Models in Software Engineering. The topic of this seminar is to carefully read one of the two literature reviews that have recently been conducted.

Large Language Models for Software Engineering: A Systematic Literature Review

On the bachelor level, the focus is on the (extensive) literature review itself. The goal of the seminar presentation is to outline the various areas researchers imagine large language models being used as well as potential open problems and gaps. On the master level, it is expected that students include at least one additional technical paper published at a high-quality venue that was reviewed as part of the literature review, in close discussion with the topic advisor.

Literature Reviews of Large Language Models in Software Engineering 2 (Bachelor/Master)

Large Language Models for Software Engineering: Survey and Open Problems

Test Flakiness

Regressionstests sind ein essenzieller Bestandteil der Qualitätssicherung in modernen Softwareprojekten. Das Ziel von Regressionstests ist es, möglichst automatisiert Änderungen an einem Softwaresystem dahingehend zu testen, ob sie zuvor bereits vorhandene Funktionalität negativ beeinflussen. Die sogenannte “Test Flakiness” ist hierbei ein bedeutendes Problem, da ein “Flaky Test” sporadisch fehlschlägt, ohne dass sich der zu testende Code geändert hat. Gründe hierfür können sein, dass Testfälle von der Verfügbarkeit von externen Diensten abhängen, der zu testende Code Race Conditions enthält, oder dass die Tests nichtdetermisitisch sind, also beispielsweise zufällig erzeugte Daten verwenden. Die negativen Einflüsse dieses Testverhalten beinhalten einen Vertrauensverlust der Entwickler in die Tests, Verzögerungen durch eine notwendige manuelle Überprüfung von Testläufen, oder mittelfristige Qualitätsprobleme, falls Flaky Tests ohne Ersatz deaktiviert werden.

Die einzelnen Seminarthemen widmen sich spezifischen Aspekten des Themenbereichs “Test Flakiness”. Die Aufgabe der Studierenden ist es, anhand der aufgelisteten Referenzen und weiteren wissenschaftlichen Artikeln den jeweiligen Forschungsstand zu erarbeiten und zusammenzufassen.

The Vocabulary of Flaky Tests (Bachelor/Master)

In 2020, Pinto et al. published a study centered around the idea of treating the source code and comments of tests as natural language, and then determining whether a test is flaky or not merely based on its vocabulary. This study showed promising results and was replicated and extended afterwards, which is very valuable but unfortunately rare in software engineering research. The goal of this seminar topic is to carefully read and understand the original paper, and then compare it to its later replications.

On the bachelor level, the focus is mainly on the two papers listed above. On the master level, a very recent paper is included as well. The paper is not publicly available yet, but will be once the seminar starts:

The Vocabulary of Flaky Tests in the Context of SAP HANA

This paper is particularly interesting because it is not only a rare replication, but also an even rarer replication in industry.

Flaky Tests in Practice (Bachelor/Master)

The goal of this seminar topic is to understand how developers perceive flaky tests and how companies try to tackle the issue of test flakiness, based on the following two papers:

On the bachelor level, the focus is on combining those two perspectives into a coherent talk. On the master level, the temporal dimension is added by also including the following paper:

A large-scale longitudinal study of flaky tests