The Inductive Software Engineering Manifesto: Principles for Industrial Data Mining

From Maisqual Wiki

Jump to: navigation, search

The Inductive Software Engineering Manifesto: Principles for Industrial Data Mining.

Tim Menzies, Lane Dept. of CS & EE, West Virginia University, Morgantown, USA.

Christian Bird, Thomas Zimmermann, Wolfram Schulte, Microsoft Research, Redmond, USA.

Ekrem Kocaganeli, Lane Dept. of CS & EE, West Virginia University, Morgantown, USA.

This page needs reviewing.


Contents


Abstract

The practices of industrial and academic data mining are very different. These differences have significant implications for (a) how we manage industrial data mining projects; (b) the direction of academic studies in data mining; and (c) training programs for engineers who seek to use data miners in an industrial setting.


Contents

  1. Introduction
  2. User-focused development
  3. The standard KDD cycle
    1. User involvement
    2. Cycle Evolution
    3. Early feedback
  4. Details
    1. Avoiding Bad Learning
    2. Data and Hypothesis Collection
    3. Tool Choice
  5. Implications
    1. Implications for Project Management
    2. Implications for Training
    3. Implications for Academic Research
  6. Conclusion
  7. References


Bibtex

@inproceedings{menzies-malets-2011,
    title = "The Inductive Software Engineering Manifesto: Principles for Industrial Data Mining",
    author = "Tim Menzies and Christian Bird and Thomas Zimmermann and Wolfram Schulte and Ekrem Kocaganeli",
    year = "2011",
    month = "November",
    booktitle = "Proceedings of the International Workshop on Machine Learning Technologies in Software Engineering",
    location = "Lawrence, Kansas, USA",
}


Notes

Introduction

We should document our methods for Data Mining, for newcomers to be more efficient, sooner.

Although individual, case studies exist [1], differences between industrial and academic data mining techniques have not been clearly articulated.

User-focused development

Principle 1 - Users before algorithms
Mining algorithms are only useful in industry if users fund their use in real-world applications.
Definition 1 - Inductive Software Engineering
Understanding user goals to inductively generate the models that most matter to the user.

Fenton [2] and Valerdi & Boehm [3][4][5] show that users need to be involved in the data mining process: get requirements, generate interest, address issues that concern them.

The KDD Standard cycle

KDD: Knowledge Discovery in Databases [6].

Data -> Target Date -> Pre-processed Data -> Transformed Data -> Patterns -> Knowledge

Tip 1 - Most of "Data Mining" is actually "Data Pre-Processing"
Before any learner can execute, much effort must be expended in selecting and accessing the data to process, pre-processing, and transforming it in some leanable form.
Principle 2 - Plan for scale
In any industrial application, the data mining method is repeated multiple times to either answer on extra user questions, make some enhancements and/or bug fixes to the method, or to deploy it to a different set of users.
Tip 2 - Thou shall not click
For serious studies, to ensure repeatability, the entire analysis should be automated using some high-level scripting language.

Authors propose three improvements to the model:

  1. User involvment
  2. Cycle Evolution (cf. CRANE)
  3. Early feedback

Microsoft's CRANE (used to internally check the quality of software on commit) was developed in 3 steps:

  1. Scout phase: gain the interest of users, define user needs.
  2. Study phase: propose models focusing on user goals
  3. Build phase: build product for a wide deployment.

Approximate time: respectively for the three steps: weeks, months, years. Approximate team size: doubles every step (for n scouters, have 2n studiers, and 4n builders).

Principle 3 - Early Feedback
Continuous and early feedback from the users allows needed changes to be made as soon as possible, (e.g. when they find that assumptions don't match the user's perception) and without wasting heavy up-front investments.
Tip 3 - Simplicity first
Prior to conducting very elaborate studies, try applying very simple tools to gain rapid and early feedback.
Principle 4 - Be open-minded
It is unwise to enter into an inductive study with fixed hypotheses or approaches, particularly for data that has not been mined before. Don't resist additional avenues when a particular idea doesn't work out.
Tip 4 - Data like surprises
Initial results often change the goals of a study when (e.g.) business plans are based on issues that are irrelevant to local data.
Principle 5 - Do Smart Learning
Important outcomes are riding on your conclusions. Make sure that you check and validate them.

Rule learners like RIPPER [7] and INDUCT [8] may help.

Tip 5 - Check the variance
Before asserting that result A is better than result B, repeat the analysis for multiple subsets of the data and perform some statistical tests to check that any performance difference are not just statistical noise.
Tip 6 - Check the stability
Given any conclusion, see if it holds if the analysis is repeated using (say) 10*90% random samples of data.
Tip 7 - Check the support
Try to avoid conclusions based on a very small percent of the data.
Tip 8 - If you can, control data collection
The Goal-Question-Metric approach suggests putting specific data collection measures in place early to address a goal.
Tip 9 - Be smart about data collecting and cleaning
Keep in mind that collecting data comes at a cost (for example, related to hardware, runtime, and/or operations) and should not negatively affect the users' system (avoid runtime overhead). It is thus infeasible to collect all possible data. Collect data that has high return on investment, i.e. many insights for relatively little cost.
Principle 6 - Live with the data you have
You go mining with the data you have -- not the data you might want or with to have at a later time.
Tip 10 - Rinse before use
Before learning from a data set, conduct instance or feature selection studies to see what spurious data can be removed. Many tool kits include feature selection tools (e.g. weka). As to instance selection, try clustering the data and reasoning from just a small percentage of the data from each cluster. Data cleaning has a cost too, and it is highly unlikely that you can afford perfect data. Most data mining approaches can handle a limited amount of noise.
Tip 11 - Helmuth won Molkte's rule
Few hypotheses survive first contact with the data:
Principle 7 - Broad skill set, big toolset
Successful inductive engineers routinely try multiple inductive technologies.

Buse et. al defined nine kinds of decision making needs found in a sample of industrial practitionners [9]:

Past Present Future
Exploration (find) trends alerts forecasting
Analysis (explain) summarization overlays goals
Experimentation (what-if) modeling benchmarking simulation
Tip 12 - Big ecology
Use tools supported by a large ecosystem of developers who are constantly building new learners and fixing old ones; e.g; R, weka, Matlab.

Implications

Conclusion

References

  1. A. Tosun, B. Turham, and A. Bener. Practical considerations in deploying ai for defect prediction: a case study within the turkish telecommunication industry. In Proceedings of the 5th International Conference on Predictor Models in Software Engineering, page 11. ACM, 2009.
  2. N. Fenton, M. Neil, W. Marsh, P. Hearty, L. Radlinski, and P. Krause. Project data incorporating qualitative factors for improved software defect prediction. In PROMISE'09, 2007. Available from http://promisedata.org/pdf/mpls2007FentonNeilMarshHeartyRadlinskiKrause.pdf .
  3. B. Boehm, E. Horowitz, R. Madachy, D. Reifer, B. K. Clark, B. Steece, A. W.Brown, S. Chulani, and C. Abts. Software Cost Estimation with Cocomo II. MSR-TR-2011-8, 2011.
  4. R. Valerdi. Convergence of expert opinion via the wideband delphi method: An application in cost estimation models. In Incose International Symposium, Denver, USA, 2011. Available from http://goo.gl/Zo9HT .
  5. R. Valerdi, C. Miller, and G. Thomas. Systems engineering cost estimation by consensus. In 17th International Conference on Systems Engineering, September 2004.
  6. U. Fayyad, G. Piatetsky-Shapiro, and P. Smyth. From data mining to knowledge discovery in databases. AI Magazine, pages 37-54, Fall 1996.
  7. W. Cohen. Fast effective rule induction. In ICML'95, pages 115-123, 1995. Available on-line from http://www.cs.cmu.edu/~wcohen/postscript/ml-95-ripper.ps .
  8. B. Gaines and P. Compton. Induction of ripple down rules. In Proceedings, Australian AI'92, pages 349-154. World Scientific, 1992.
  9. R.P.L.Buse and T. Zimmermann. Information needs for software development analytics. MSR-TR-2011-8, 2011.

Download

The article can be found here: http://thomas-zimmermann.com/publications/files/menzies-malets-2011.pdf

Always prefer the above download if available. If not, we propose this backup:

Personal tools