Software Engineering Metrics: What Do They Measure And How Do We Know

From Maisqual Wiki

Jump to: navigation, search

Software Engineering Metrics: What Do They Measure And How Do We Know


Cem Kaner, Senior Member, IEEE.

Walter P. Bond.


Contents


Contents

  1. Introduction
  2. What are we measuring?
    1. Defining Measurement
    2. Developing a Set of Metrics
    3. Direct Measurement
  3. A Framework for Evaluating Metrics
    1. Defining Measurement
    2. The Evaluating Framework
    3. Applying the Evaluation Framework
  4. A More Qualitative Approach to Qualitative Attributes


Bibtex

@article{ Kaner_2004, 
 title={Software Engineering Metrics : What Do They Measure and How Do We Know ?}, 
 volume={8},  
 journal={Direct},  
 publisher={Citeseer}, 
 author={Kaner, C. and Bond, W.}, 
 year={2004}, 
 pages={1--12}
}


Notes

Abstract

This paper deals with construct validity: how we know that we are measuring the attribute that we thing we are measuring? Direct measures should be validated: few or no software engineering tasks or attributes are so simple that their measure can be direct. Thus:

  • All metrics should be validated.
  • Multidimensional analyses of attributes provide better accuracy.

Introduction

Metrics are not used so often, mainly because:

  1. the field is still immature,
  2. metrics have some cost (4% of the project cost, according to Fenton[1]),
  3. the way metrics are used may cause more harm than good.

As a consequence, using metrics that are inadequately validated, insufficiently understood and not tightly tied to the attributes they are intended to measure induce distortions and dysfunctions.

What are we measuring?

Defining measurement

There are several definitions for measurement. We will retain the definition of Fenton[2]: "Formally, we define measurement as a mapping from the empirical world to the formal, relational world. Consequently, a measure is the number or symbol assigned to an entity by this mapping in order to characterize an attribute."

Other definitions include:

  • Measurement is the assignment of numbers to objects or events according to rule[3]. The rule of of assignment can be any consistent rule. The only rule not allowed would be random assignment, for randomness amounts in effect to a nonrule.
  • Measurement is the process of empirical, objective, assignment of numbers to properties of objects or events of the real world in such a way as to describe them.[4]
  • Measurement is the process by which numbers of symbols are assigned to attributes of entities in the real world in such a way as to characterize them according to clearly defined rules.[2]
  • Measurement is "the act or process of assigning a number or category to an entity to describe an attribute of that entity."[5]
  • Fundamental measurement is a means by which numbers can be assigned according to natural laws to represent the property, and yet which does not presuppose measurement of any other variable.[6]
  • And other as well: [7][2][8][9][10]

Developing a set of metrics

IEEE Standard 1061 lays out a methodology for developing metrics for software quality attributes.

  • attribute: a measurable physical or abstract property of an entity.
  • quality factor is a type of attribute: a management-oriented attribute of software that contributes to its quality.
  • metric is a measurement function
  • software quality metric: a function whose inputs are software data and whose output is a single numerical value that can be interpreted as the degree to which software possesses a given attribute that affects its quality.

For each quality factor, assign one or more direct metrics to represent the quality factor, and assign direct metric values to serve as quantitative requirements for that quality factor.

Standard 1061 proposes the following criteria for metrics validation:

  • Correlation: the metric should be linearly related to the quality factor as measured by the statistical correlation between the metric and the corresponding quality factor.
  • Consistency: let F be the quality factor and Y the output of the metrics function. M:F->Y. M must be a monotonic function. That is if f1 < f2 < f3 then y1 < y2 < y3.
  • Tracking: let be M:F->Y.As F changes from f1 to f2 in real time, M(f) should change promptly from y1 to y2.
  • Predictability: let be M: F->Y. If we know Y at some point in time, we should be able to predict F.
  • Discriminative power: let be M: F->Y. For signification variation of F, Y should have signification variation as well.
  • Reliability: "A metric shall demonstrate the correlation, tracking, consistency, predictability and discriminative power for at least P% of the application of the metric.".

"Direct" measurement

A direct metric is a metric that does not depend upon a measure of any other attribute. In other words a direct metric is a function whose domain is only one variable: M:F->Y. Examples of direct measures: LOC, etc.

An indirect of derived measure relies on more than one measures. In other words a derived measure is a function whose domain is an n-tuple: M: (F,G,H)->Y. Examples of indirect measures: programmer productivity, defect density, etc.

Standard 1061 offers MTTF (Mean Time To Failure) as a direct metric example of reliability. This can be argued, considering different usage profiles, different knowledge of users, different concepts of failures.

Any measure that include human behaviour (as for the MTTF) do rely on many (possibly hard to quantify) parameters, which make them by definition not direct any more.

Other direct metrics (as provided by Fenton & Pfleeger[2]) that could be decomposed and argued,as for the attribute they're supposed to measure, include:

  • Length of source code (measured in Lines of Code): what's a line? What meaning it has, depending on the language, coding and naming conventions? Do all lines have the same impact on the desired attribute measure (size)?
  • Duration of testing process (in elapsed time in hours): what impact do hardware or algorithms have on the time spent testing? What is test? Do all tests are equal?
  • Number of defects discovered during the testing process: what is a defect? Can we compare a missing coma, a bad copyright and a buffer overflow? What about not-so-good specifications? What is the testing process: where does it start, where does it stop?
  • Time a programmer spends on a project: do all programmers have the same efficiency? How do we count: manually run clocks, automatic typing detection, work hours? What is the project and what is not: common modules, meetings?

Rather than defining a metric in terms of what we can perform, authors prefer to think about:

  1. the question we want answered,
  2. the nature of the information (attributes) that could answer the question, and
  3. then define measures that can address those attributes in that context.

Distinguishing between direct and derived measures may not be that valuable. All metrics need validation, even the supposedly direct ones.

A framework for evaluating metrics

Construct validity refers to: how do you know that you are measuring what you think you are measuring?

Construct validity is rarely addressed in software engineering.

Fenton and Melton point another question: How much do we need to know about an attribute before it is really reasonable to consider measuring it?

The representational theory of measurement is laid out in [4][7][8], critiqued in [11]. Applied to the field of software engineering it is presented in detail in [2][9][10] and Fenton and Melton summarize it in [12].

But the theory is deep and complex and not enough understood by practitioners.

Defining measurement

There are two theories in the measurement field: representational and traditional. If measurement is "assigning numbers of symbols [..] according to clearly defined rules", what is the nature of the rules?

For the tradition point of view, the numbers are assigned according to natural laws. That is, the model behind the theory is causal, based on the natural laws.

For the representational theory, a causal model is still under discussion. IEEE 1061 refers to correlation as a means of validating a measure, but this is a weak and risk-prone substitute for a causal model[2].

Thus the definition retained is: Measurement is the empirical, objective assignement of numbers, according to a rule derived from a model or theory, to attributes of objects or events with the intent of describing tem.

The evaluation framework

The questions to be asked to evaluate a proposed metric are:

  1. What is the purpose of this measure? e.g. self-assessment, evaluating project status, informing customers, informing authorities.
  2. What is the scope of this measure? e.g. method by a single developer, project by a workgroup, year's work for a workgroup.
  3. What attribute are we trying to measure?
  4. What is the natural scale of the attribute we are trying to measure? e.g. numbers, symbols, ratio, size.
  5. What is the natural variability of the attribute? e.g. weight of a person changes from day to day (even in the day).
  6. What is the metric (the function that assigns a value to the attribute)? What measuring instrument do we use to perform the measurement? e.g. couting, timing, matching, comparing.
  7. What is the natural scale for this metric? It may not be the same as the attribute's.
  8. What is the natural variability of readings from this instrument? a.k.a measurement error.
  9. What is the relationship of the attribute to the metric value? What model relates the value of the attribute to the value of the metric? If value of attribute increases by 20%, why should we expect measured value to increase and by how much?
  10. What are the natural and foreseeable side effects of using this instrument? It is possible to improve the measured result without improving the attribute[13].

Applying the Evaluation Framework

The preceding framework is applied to a widely used metric: defect count.

"Models have been presented to tie bug curves to project status [...] [14][15] Simmons plots the time of reporting of medium and high severity bugs, fits the curve to a Weibull distribution and estimates its two parameters, the shape parameter and the characteristic life. From characteristic life, he predicts the total duration of the testing phase of the project."

Measuring testers by their bug counts will encourage:

  • superficial (and not deep, hard-to-find) bugs,
  • testers not to help themselves (coaching, writing tools, etc.),
  • spend less time on test documentation, on good reports (time spent on a good report could be spent to find another bug),
  • political problems: make a tester brilliant by making him work on a unstable development (and another bad by assigning him stable, hard-to-setup code), programmers would give less answer, knowing that testers 'have to' fill many bugs.

A more qualitative approach to qualitative attributes

The multidimensional approach intends to break down some softeng tasks or attributes into a collection of related sub-attributes.

As an example, consider a test manager willing to evaluate his staff's performance. They do a variety of tasks, such as bug-hunting, bug reporting, test planning, test tool development. To fully evaluate the work of the tester, you probably want to measure the quality of work on each of these tasks. Dress a list of the factors you want to measure, and mark them.

Authors have used rubrics for that: a rubric is a table, with the dimensions of work (communication, coverage, clarity, reproducibility, etc.) as rows and different appreciations as columns (e.g. bad, acceptable, good). This works well when corporate standards are defined, because it reduces the human factor (don't judge according to personal factors, but rather to defined standards). Furthermore, the goal is not to come in micromanagement, because it would be counter-productive.

Doing it many times, on different bug reports, will give you an idea of how the tester performs: e.g. the tester may be excellent at bug-hunting, but acceptable at test planning. Note also that the measure should accept diversity, because there are many ways to do the task well.

Conclusion

There are too many simplistic metrics that don't capture the essence of whatever they are supposed to measure, and too many uses of these simplistic measures. Authors believe that thinking again metrics may lead to more complex and more qualitative measures, but would also lead to more meaningful and useful data.


References

  1. N.E. Fenton, "Software Metrics: Successes, Failures & New Directions", presented at ASM 99: Applications of Software Measurement, San Jose, CA, 1999, http://www.stickyminds.com/s.asp?F=S2624_ART_2.
  2. 2.0 2.1 2.2 2.3 2.4 2.5 N.E. Fenton and S.L. Pfleeger, "Software Metrics: A Rigorous and Practical Approach", Second 2nd edition revised ed. Boston: PWS Publishing, 1997.
  3. S.S. Stevens, "On the Theory of Scales of Measurement", Science, vol. 103, pp. 677-680, 1946.
  4. 4.0 4.1 L. Finkelstein, "Theory and Philosophy of Measurement", in Theoretical Fundamentals, vol. 1, Handbook of measurement science, P.H. Sydenham, Ed. Chichester: John Wiley & Sons, 1982, pp.1-30.
  5. See Standard IEEE 1061.
  6. W. Torgerson, S., Theory and Methods of Scaling. New York: John Wiley & Sons, 1958.
  7. 7.0 7.1 J. Pfanzagl, "Theroy of Measurement", 2nd edition revised ed. Wurzburg: Physica-Verlag, 1971.
  8. 8.0 8.1 D.H. Krantz, R.D. Luce, P. Suppers, and A. Tversky, Foundations of Measurement, vol. 1. New York: Academic Press, 1971.
  9. 9.0 9.1 H. Zuse, A Framework of Software Measurement. Berlin: Walter de Gruyter, 1998.
  10. 10.0 10.1 S. Morasca and L. Briand, Toward a Theoretical Framework for Measuring Software Attributes, presented at 4th International Software Metrics Symposium (METRICS'97), Albuquerque, NM, 1997.
  11. J. Michell, Measurement in Psychology: A Critical History of a Methodology Concept. Cambridge: Cambridge UNiversity Press, 1999.
  12. N.E. Fenton and A. Melton, "Measurement Theory and Software Measurement", in Software Measurement, A. Melton, Ed. London: International Thomson Computer Press, 1996, pp. 27-38.
  13. A measurement system yields distortion if it creates incentives for the employee to allocate this time so as to make the measurements look better rather than to optimize for achieving the organization's actual goals for this work.
  14. E. Simmons, "When will we be done testing? Software defect arrival modeling using the Weibull distribution" presented at Pacific North-West Software Quality Conference, Portland, OR, 2000. htpp://www.pnsqc.org
  15. E. Simmons, "Defect Arrival Modeling Using the Weibull distribution", presented at International Software Quality Week, San Fransisco, CA, 2002.
Personal tools