A Statistical Examination of the Evolution and Properties of Libre Software

From Maisqual Wiki

Jump to: navigation, search

A Statistical Examination of the Evolution and Properties of Libre Software

Israel Herraiz, GSyC/Libresoft, Universidad Rey Juan Carlos, Spain


Paper Summary

  1. Abstract
  2. Introduction
  3. Data Sources
  4. Research Questions
    1. First Question: Validity of the Laws
    2. Second Question: Metrics to Characterise a Software Product
    3. Third Question: Shape of the Statistical Distribution of Size
    4. Fourth Question: Self-Similarity
    5. Fifth Question: Software Evolution Dynamics
    6. Sixth Question: Forecasting Software Evolution
  5. Conclusions and Further Work
  6. Acknowledgements


 author     = 	 {Israel Herraiz},
 title      = 	 {A statistical examination of the evolution and properties of libre software},
 booktitle  =   {Proceedings of the 25th IEEE International Conference on Software Maintenance (ICSM)},
 pages      = 	 {439-442},
 year       = 	 {2009},
 publisher  =   {IEEE Computer Society},

Download File

The file can be downloaded from here: File:Herraiz-icsm09.pdf.



Lehman published some laws to characterise (1985) software evolution. Are these laws still valid nowadays, especially for libre software? Main results are:

  • a small subset of basic size metrics are enough to characterise a software system
  • software systems are self-similar
  • software evolution is a short-range (i.e. short memory) correlated process.


Many basic rules are different between libre software and corporate software:

  • Lehman assumes 2 phases (development and evolution), which is not true for libre software (both live together),
  • Lehman encourages to have a long time between releases, which is opposed to libre beliefs (release early, release often).

Data Sources

This study takes three databases as its data set:

  • FreeBSD ports (softwares for the FreeBSD distribution),
  • SF active projects (more than 3 devs, 1 year active history, use repos),
  • Three big projects: FreeBSD & NetBSD kernels and PostgreSQL.

First Question: Are the laws always valid?

Godfrey and Tu already worked on that subject; they found that Linux growth is super-linear. But Lehman argued he used number of files where Godfrey and Tu used SLOC. Further analysis shows that all size metrics are highly correlated (number of files & SLOC should show the same shapes).

Second Question: What metrics to characterise a software product?

Preceding question showed that all metrics are highly correlated, so why not use the simplest and most used SLOC metric?

Third Question: Shape of the Statistical Distribution of size

Density probability function of software size analysis conducted with many different metrics: SLOC, number of files, projects in domains (i.e. at different granularity levels) show a Double Pareto distribution.

This is found in other areas and dynamic models have already been proposed.

Fourth Question: Self-Similarity

The double Pareto distribution appears regardless of the metrics used, or level of granularity.

Fifth Question: Software Evolution Dynamics

Wu found that libre software evolution is driven by a SOC (Self-Organising Criticality) dynamics. But these processes are long-range correlated. The sample of daily time series of number of changes on the sourceforge projectsshows that 80% of projects are short-range correlated processes, like ARIMA processes.

In practice, recent events have more significance than older events when taking decisions.

Sixth Question: Forecasting software evolution

The team used long history projects, and tried to forecast the last year of evolution. Mean squared relative error was:

  • 7% -> 14% for the regression models,
  • 1.5% -> 4% for the ARIMA models.
Personal tools