Software Repository Mining: 3 WHs You Need To Know

In this modern data-driven era, many industries find their data a valuable asset. Therefore they tend to warehouse almost all the data that is generated through their business operations.

Later these warehoused data were used to generate business intelligence. It is interesting to see that lots of business intelligence is generated through the data that are mined from the data warehouses and those findings helped them to take competitive advantages in their business domain.

When it comes to the Software Development Industry, someone can argue that we don't have so-called typical business operations to generate and warehouse data. But if we take a closer look into the SDLC; there are a lot of instances that generate data and store them in different tools or systems that are used throughout the development lifecycle of the software. For example, version control, bug tracking, issue tracking systems, and mailing list archives could be considered great data sources.

Most of the time the data that is getting stored in those locations is not a result of an intentional data warehousing attempt. But the data that are stored are still valuable and can be treated as sources for data mining to discover valuable insights into many aspects of the software development industry.

What is Repository Mining

Similar to the data mining concept we have for large-scale data warehouses the term repository mining implies the process of extraction of useful data from software repositories, bug tracking, issue tracking tools, and mailing list archives which can be transformed into valuable information about many aspects of software projects.

Why Repository Mining

When it comes to large-scale software development projects, Normally it has to deal with frequent contributions and engagements from a high number of stakeholders, year-long road maps, and heavy practice of SDLC processes and tools. This makes software development projects very complex to control and manage.

Although the stakeholders, tools, and processes are in place to streamline the software development project, a human attitude or behavior, inefficient or conflicting processes, and unsuitable or buggy tools can divert the direction and the velocity of the project. Often this introduces a huge risk to such projects in the long run where the damage is irreversible at the point that its identified.

Software repository mining has been identified as a proactive way to mitigate that risk in large-scale software development projects where top-level management of the project can draw very insightful information about the potential deviations from the initial road map at the very beginning of those deviations occur.

This is a useful technique not only for ongoing software projects but also for the projects that are already completed. Where the management can evaluate the efficiency and effectiveness of the stakeholders, tools, and processes that have been utilized in the completed projects.

How Repositories are Mined

There are a couple of different approaches to mining software repositories that are commonly used. But there will be many undocumented ways of mining repositories to extract the data out of the tools and repositories. Here, we're going to discuss two commonly applied mining techniques.

Coupled Change Analysis

During new feature developments and bug fixes, developers tend to change a set of coupled files in the source code. For an experienced developer, the couplings between source code files are already known. But for inexperienced or new developers always have the challenge of identifying the coupled code snippets in between different files for a particular change in the source code.

Analyzing the repository can identify the files that are changed together in most of the previous scenarios. Thus, the result of coupled change analysis can be useful to get hints about incomplete source code changes.

Commit Analysis

This can be identified as the far broader type of change analysis. A commit in a software repository contains much information about the change that is done. The data set extracted through this analysis can be used for many different types of knowledge generation.

For example, by analyzing the commit history we can identify and rank the developers who are most experienced in particular components in each area of a huge software project. The knowledge produced by that can be useful to identify the impact and possible replacements if a developer resigns from a software development project.

Conclusion

Software Repository Mining is an emerging trend in the software development industry. Despite some custom-developed tools for specific mining purposes, more generalized tools and techniques are yet to emerge in this discipline. Thus, there is much room for innovative things and enable stakeholders to reap better benefits of software repository mining.