Classifying bugfix commits with ML.NET
Training a commit bugfix / non-bugfix binary classification model with ML.NET
This article is a case study in using ML.NET to train and select an optimal binary classification model for use in classifying code commits as bugfix or non-bugfix related based on their commit metadata.
This is a modified version of my master’s project’s final summary report and so the tone, style, and approach will be slightly different than my other writing. Where possible I will link to other articles of mine detailing concepts in more detail. Nonetheless, I believe you should find it an interesting illustration of why I find ML.NET to be an effective machine learning library and some creative uses of Phi-3 LLMs to help draft a pre-review training dataset.
This content is also available as a YouTube video
Project Objective and Context
Software engineers, data scientists, data engineers, and other technology professionals use source control management systems such as git to track changes to repositories of code over time. These changes are called commits and relate to one or more files and include one or more file and line of code that is modified. Often there will be many additions and deletions across several files. Each of…