Chromium Conversations

Description

Software developers and testers have long struggled with how to elicit proactive responses from their coworkers when reviewing code for security vulnerabilities and errors. For a code review to be successful, it must not only identify potential problems but also elicit an active response from the colleague responsible for modifying the code. To understand the factors that contribute to this outcome, we analyze a novel dataset of more than one million code reviews for the Google Chromium project, from which we extract linguistic features of feedback that elicited responsive actions from coworkers. Using a manually-labeled subset of reviewer comments, we trained a highly accurate classifier to identify 'acted-upon' comments (AUC = 0.85). Our results demonstrate the utility of our dataset, the feasibility of using NLP for this new task, and the potential of NLP to improve our understanding of how communications between colleagues can be authored to elicit positive, proactive responses.

Datasets

There are two datasets available for download. In each one, we have (to the best of our knowledge) de-identified all usernames and email addresses of developers involved in the Chromium Project. The datasets described below were exported from the PostgreSQL database used in our research. Everything needed to recollect and recreate our database is available here. A README.md file, which explains the structure of the datasets, is included in the download below.

CONVERSATIONS: This is the full dataset containing over 1.5 million comments posted by developers reviewing proposed code changes. The dataset also includes the values we calculated for all nine linguistic features (described in Section 4 of the paper cited below).

ANNOTATIONS: This dataset is a subset of the CONVERSATIONS dataset. It contains the data used in the classification experiment outlined in Section 5 of the paper cited below (2,994 comments automatically identified as acted-upon and 800 comments manually identified as not (known-to-be) acted-upon).

Download: chromium_conversations.tar.gz | 270MB Compressed; 1GB Raw

Citation

We encourage you to use this dataset in your research. If you do, we ask that you please cite:

A Dataset for Identifying Actionable Feedback in Collaborative Software Development.

B.S. Meyers, N. Munaiah, E. Prud'hommeaux, A. Meneely, C.O. Alm, J. Wolff, and P.K. Murukannaiah.

Proceedings of the 2018 Meeting for the Association for Computational Linguistics (ACL).

License

Rietveld, the system that facilitates code review in Chromium, is licensed under the Apache v2.0 license.

The datasets we are releasing are licensed under the Creative Commons ShareAlike license (CC-BY-SA).