Tools for the Study of the Usual Data Sources Found in Libre Software Projects

From Maisqual Wiki

Jump to: navigation, search

Tools for the Study of the Usual Data Sources Found in Libre Software Projects

Gregorio Robles, Universidad Rey Juan Carlos, Spain

Jesus M. Gonzales-Barahona, Universidad Rey Juan Carlos, Spain

Daniel Izquierdo-Cortazar, Universidad Rey Juan Carlos, Spain

Israel Herraiz, Universidad Rey Juan Carlos, Spain



  1. Abstract
  2. Introduction
  3. Identification of Data Sources and Retrieval
  4. Source Code
    1. Hierarchical Structure
    2. File Discrimination
    3. Analysis of Source Code Files
    4. Analysis of Other Files
    5. Authorship Analysis
  5. SCM System Meta-Data
    1. Preprocessing: Retrieval and Parsing
    2. Data Treatment and Storage
  6. Mailing Lists Archives and Forums
    1. The RFC 822 Standard Format
  7. Bug-Tracking Systems
    1. Data Description
    2. Data Acquisition and Further Processing
  8. Summary


 author    = 	 {Gregorio Robles and Jesus M. Gonzalez-Barahona and Daniel Izquierdo-Cortazar and Israel Herraiz},
 title     = 	 {Tools for the study of the usual data sources found in libre software projects},
 journal   = 	 {International Journal of Open Source Software and Processes},
 year      = 	 {2009},
 volume    = 	 {1},
 number    = 	 {1},
 pages     = 	 {24--45},
 month     = 	 {Jan-March},


You can download the file from here: File:Tools for study of usual data sources in libre sw.pdf.



Open source software has brought a new era in data-mining by providing a rich set of data. The information exploited is: source code, source code management system, mailing lists, bug tracking.


The availability of many public data sources make development information extraction grow far beyond source code. The ability to get historical facts offers longitudinal analysis as well. But the access, retrieval and extraction is not a simple task.

The data sources treated come as a complement to other data-acquiring means: surveys, interviews, experiments.

Having access to the history of the project means we have the what (activities), the when (points in time), the who (actors involved in the project), and the why (the reasons for decisions, actions, etc.).

Data-mining techniques used are non-intrusive: no access is required to developers, no action is required from the development team.

Identification of Data Souces and Retrieval

There are two main steps before data can be used for analysis: identification and retrieval. These are not easy because:

  • There are many tools for projects with the same purpose (SCM, BTS), and they have different feature and architecture.
  • The conventional usage of some features (e.g.tags, comments, BTS attributes, etc.) may differ from one project to another.

The generic process from identification to analysis is: Identification -> Retrieval -> Extraction -> Storage -> Analysis

Some cases where this process is hard/impossible to apply:

  • No use of a SCM tool at the beginning of the project (thus some history lacks).
  • Identification of some release is not possible (e.g. no SVN tag has been put at a specific release).
  • The project has been migrated from one tool to another (SCM, BTS) and some data have been lost.

There are also other sources of data for some projects, that can be retrieved on a case by case basis: e.g. todo list, maintainer, organisational charts, roadmaps, etc. These can sometimes be retrieved from packaging systems (e.g. deb, rpm) or events (e.g. Debian popularity contest, Debian developer database) or other means (e.g. credit files). These are project-specific and difficult (if not impossible) to generalize.

Source Code

Source code analysis often begins with the concept of release.

Source is the main central point in a project; it plays a major signalling and coordinating role.

"Source Code is transient knowledge: it reflects what has been programmed and developed up to that point, resuming past development and knowledge and pointing to future experiments and future knowledge."

Source code is the main product of the software development process. It also refers to other files that belong to CM: documentation, translations, localization, UI definitions, build procedures, etc.


  1. Get code (either tarball or directory)
  2. Identify hierarchy
  3. Group files in different categories (often by type of files) -> This allows some more specific analysis.
  4. Analyse

Hierarchical Structure

Capiluppi et al. showed that the project hierarchy gives insights on technical architecture and development team organisation.

File Discrimination

This means analysing files on behalf of their contents. This is usually done through heuristics:

1. Rely on extensions: file types that can be considered are: documentation, images, internationalization, localization, user interface, multimedia, code files.

For code files, identify code from building process (makefile,, etc.) and documentation files that are tightly bound to the development building process (README, TODO, INSTALL, etc.).

2. Second step is to check inside the file, 1. if extension is correct and 2. for files with no/unknown extensions.

This allows for more specific heuristics, e.g. 1. look at the first line (for perl, pyton, etc. look at the shebang) or 2. search specific keywords or comment types, or for mark-up languages, tags or specific elements.

3. Other algorithms may come in handy, e.g. the 'file' unix command.

Analysis of source code files

This is one the most known tasks. Basic measures include:

  • Size (SLOC, LOC)
  • Complexity (McCabe, Halstead)
  • Composite metrics such as Maintainability Index (Oman and Hagemeister, 1992).

Tools used for determining such metrics should be tightly integrated in the process as black box for better efficiency/independence. GlueTheos has been built for that purpose.

The process used by GlueTheos is has follows:

Source Code Retrieval -> File Discrimination -> Tool 1 / Tool 2 / Tool 3 -> RDBMS storage -> Analysis of Data.

Having specific tools (Tool 1 / Tool 2 / Tool 3) for specific file types optimizes execution.

Analysis of other files

Capiluppi et al. studied change log files.

Translation files show the amount of work accomplished for the support of the software in a given language.

The amount of documentation gives empirical evaluation of support for the software: the doceval tool has been made for that purpose.

Authorship analysis

CODD is a tool that analyses authorship information in source code, and assigns the lenghts of the files (in bytes) to names.

The process is the following:

File extraction -> File Selection -> Dependency resolution -> Shared Source Files detection -> Owner grep (searches for mail address, copyright information, SCM identifiers).

CODD has some weaknesses:

  • cannot identify when one author appears in different forms (different email addresses, different names).
  • Heuristics are not always able to catch different conventions, which depend on projects.

pyTernity has been developed to overcome this difficulties; it uses and improves some of the heuristics of CODD.

SCM System Meta-Data

SCM systems give historical axes to analyse. It also has some meta-data that give information about interactions among developers and the SCM system.

There are some tools that can be used to analyze SCM logs:

  • CVSAnalY (which supports other SCM tools than CVS, by the way)
  • SoftChange

CVSAnalY analyses the following data: committer name, date, file, revision number, lines added, lines removed, and comment. Comments can be parsed, e.g. to see if the commit is an external contribution or an automated script.

Preprocessing: retrieval and parsing

Process is the following:

  • Download sources.
  • Download files, parse it and store it in a more usable way (xml, sql).
  • Get logs (e.g. svn log), parse them and get the file type.
    • Files that do not match any known extension, that have no extension or that match some predefined names (e.g. README) are classified as unknown. At that stage, only file name extensions are checked, since the CVS log only shows that information.
    • Some commit messages may be discarded or get a special treatment, e.g. silent commits for small changes.
    • External contributions are detected (with words such as "patch(es)", "from", "by", email address, etc.) and marked as such.

=> With commit messages analysis we may detect the nature of change (if it is a adoptive, corrective or perfective change).

Data Treatment and Storage

Once information has been parsed and transformed into a more structured format, data is stored in database. Data can be put in a raw format, or more transformation is needed to make it more efficient: user numerical identifiers, several tables, etc.

Then some statistical information is gathered on committers and modules -> gives details on the interaction by contributor or by module.

CVS do not understand the atomic commit / revision concept. Authors use the sliding window algorithm to retrieve these. This algorithm considers all commits made by a committer in a period of time as part of the same atomic commit. The delay may be constant, or sliding (delay is restarted at every commit catched).

Post processing computes inequality of concentration indexes and generates graphs. Results are published on a web interface.

Mailing Lists Archives and Forums

Usually Libre software have 3 mailing lists: development, support and announces. Mailing lists and forums have the same purpose but show the information in different ways; authors consider only mailing lists, since there are software that can transform contents from on to another (and back).

RFC 822 MBOX Standard Format

MLStats outputs the information extracted from headers, given the url of a mailing list archive.

Bug-Tracking Systems

The most used BTS is BugZilla; but other are also widely used. Additionally some forges have developed their own BT system.

Data Description

The fields (attributes) commonly found in BTS are:

  • Bug ID, a numerical unique identifier.
  • Description, a textual description.
  • Opened, date of opening.
  • Status: one of new, assigned, reopened, needinfo, verified, closed, resolved, unconfirmed.
  • Resolution: one of obsolete, invalid, incomplete, notproject, notabug, wontfix, fixed.
  • Assigned: a name/email address.
  • Priority
  • Severity
  • Reporter, a name/email address.
  • Product
  • Version
  • Component (subsystem)
  • Platform

Data Acquisition and Processing

BugZilla analysis tool (developed by authors) does the following:

  • Retrieve HTML pages, extract data, process missing information, etc.
  • Apply a specific parser (specific to the BTS tool).
  • Store in an independent format.
  • Apply a generic parser.
  • Put results in Database.
  • Apply analysis and statistical algorithms.


Although Libre Software offers public access to development repositories, many hidden problem arise.

SCM logs analysis gives insights on the dynamics of the project.

Source code and SCM meta-data offer a wide range of analysis, still to be researched.

Mailing lists, as the main communication channel,five insights on organisation, team dynamics and technical evolution.

Major problems are:

  • Incomplete data sets (mainly due to migrations).
  • For SCM, changes are sometimes made by other people who have write access to the repository (external contributions).

Since Data Mining has become more popular, some projects have received too many requests on their servers, resulting in denials of service. Hence some tools have been banished, and cannot be used any more on these projects.

FLOSS Metrics and FLOSS Mole projects construct, publish and analyse large scale databases with information and metrics about Libre Software developments.

Personal tools