Member-only story
Classifying bugfix commits with ML.NET
Training a commit bugfix / non-bugfix binary classification model with ML.NET
This article is a case study in using ML.NET to train and select an optimal binary classification model for use in classifying code commits as bugfix or non-bugfix related based on their commit metadata.
This is a modified version of my master’s project’s final summary report and so the tone, style, and approach will be slightly different than my other writing. Where possible I will link to other articles of mine detailing concepts in more detail. Nonetheless, I believe you should find it an interesting illustration of why I find ML.NET to be an effective machine learning library and some creative uses of Phi-3 LLMs to help draft a pre-review training dataset.
This content is also available as a YouTube video
Project Objective and Context
Software engineers, data scientists, data engineers, and other technology professionals use source control management systems such as git to track changes to repositories of code over time. These changes are called commits and relate to one or more files and include one or more file and line of code that is modified. Often there will be many additions and deletions across several files. Each of these commits contains information about when it occurred, who performed it, and the message they used to describe the overall set of changes.
To help analyze the history of git-based codebases I’ve been developing a tool called GitStractor. This project can extract data from any git repository regardless of programming languages used and store that data in a set of CSV files. These CSV files can then be ingested by an analytics notebook or data visualization tool to visualize the structure of the project and how it had changed over time.
GitStractor has proved effective in its ability to glean and communicate insight from code, but it was not able to adequately communicate important information on software quality without a way of determining whether a commit was…