Data Mining Treatments

From Maisqual Private Wiki

Jump to: navigation, search

This article describes how we can use data mining techniques to achieve our goals.


[edit] Objectives

We want to:

  • Measure software quality, for both process and products.
  • Provide advice to actors to help them achieve the goals they want for the software project.

We also think that:

  • Process quality is measured through its output, i.e. the shipped product. There exist process-oriented measures, but they do not measure the process quality. They rather give insights on some characteristics of the process, which is helpful as well.
  • One interesting track for measuring the quality could be to use multidimensional analysis, which takes several base measures and compose them to get a quantitative or qualitative measure. Data mining algorithm could help weight the different metrics, depending on the different characteristics of the software [1] and give a more relevant measure of what we want to know.

Considering these points, three questions arise.

[edit] Measure software quality

The example data needed are the following:

We must know the evaluated quality for a project, considering the values of its metrics.

Example: We know that a project has a good reliability, according to the ISO/IEC 9126 standard. Try to correlate different metrics with that good result.

Check the Characteristics of Quality page for what could define a good or bad project.

Evaluating many projects against the ISO/IEC 9126 only would learn the algorithm to recognise the ISO/IEC 9126 only quality characteristics. We have to find different means to evaluate quality and mix them.

We could:

  • Use the ISO/IEC 9126 standard, which is well-known and widely accepted.
  • Use other standards to assess the quality. We have to be careful with the following caveats:
    • The metrics used should not be too close from one standard to another.
    • The semantic of the quality characteristics may lead to bad mappings between standards.
  • Manually evaluate some projects, that are known to be good or bad. I guess that these projects would the more interesting for the learning algorithm.

[edit] Multidimensional Analysis: select the right metrics

An alternate method to assess quality could be to use the multidimensional analysis to dynamically weight different metrics, depending on the very context of the evaluation: size of team, language, domain, etc.

The example data needed are the following:

We must know, for a project's characteristics, what metrics are relevant or not.

Example: On the project XYZ, which is a good project with:

  • 13341 SLOC,
  • is written in Java,
  • has been developed by 10 people,
  • is targeted at the desktop.

The Mc Cabe metric give bad results (and we know for sure that they should be good) and thus is not relevant.

The learning algorithm could be fed with:

  • Projects known
  • Limits described in research articles: many papers have studied the limits of metrics, e.g. Halstead is no more pertinent for volume higher than 100K SLOC.
  • Known rules, retrieved from an expert knowledge[2]. As an example, there are some known cases for which the Mc Cabe measure is not relevant.

For that purpose we have to find as many research articles as possible demonstrating the limits of metrics.

[edit] Giving the right advice

The example data needed are the following:

We must know, for a project's history, what practice have caused what impact.

Example: For a project, the reliability has been greatly improved between milestones v1.34 and v1.35 because of the number of tests written.

Data Mining on Software Repositories could be of a great help for detecting such reasons, e.g. by finding known patterns. See maisqual:Monitoring Software Quality Evolution for Defects.

[edit] See also




[edit] References

  1. E.g. the size of code, number of developers, process used, planned maintenance time, domain (e.g. video player, automatic pilot of a plane, networked system), etc.
  2. Christophe, on va t'aspirer le cerveau ! :)
Personal tools