Data Mining for Software Engineering

From Maisqual Wiki

Jump to: navigation, search

Data Mining for Software Engineering

Tao Xie and Suresh Thummalapenta, North Carolina State University

David Lo, Singapore Management University

Chao Liu, Microsoft Research


Paper Summary

  1. Abstract
  2. Mining methodology
  3. Mining challenges
    1. Requirements unique to SE
    2. Complex data and patterns
    3. Large-scale data
    4. Just-in-time mining
  4. Mining sequences
    1. Iterative pattern matching
    2. Temporal rule matching
    3. Sequence-diagram and FSM mining
    4. Sequence association rule mining
  5. Mining graphs
    1. Discriminative graph mining
    2. Graph classification
    3. Mining text
  6. Conclusion


 author = {Tao Xie and Suresh Thummalapenta and David Lo and Chao Liu.},
 title = {Data Mining for Software Engineering},
 journal = {IEEE Computer},
 year = {2009},
 volume = {42},
 number = {8},
 pages = {35--42},
 month = {August},
 url = {},       


The file can be downloaded from here: File:Data mining for software engineering.pdf.


Mining SE has two main trends that should make it successful:

  • abundance of data,
  • helpfulness in solving real-world problems.

For a project several data can be searched:

  • CVS, SVN, and other modern CM tools have the ability to track and record changes in history.
  • Bugzilla provides complete lifecycle pub management and querying.
  • Rich execution data available with specific tools -- e.g. Dr Watson, MS tools.

SE data has the 3P: People, Process, Product.

Data Mining can help in many day-to-day tasks

  • Provide help when documentation is not enough,
  • Provide changes code location (e.g. modify read if write is modified),
  • Hunt for potential bugs or identify buggy LOCs for known problems,

Mining Methodology

  1. Collect/Inventory SE Data
  2. Determine SE Tasks -- and go back to (1) if needed.
  3. Preprocess data
  4. Adopt/adapt/develop mining algorithms
  5. Post-process/apply mining results

Usually (1) and (2) are mixted. This can be a problem-driven approach ((2)->(1)) or a data-driven approach ((1)->(2)).

Step (3) extracts relevant data from raw data, clean and format for the mining algorithm

From steps 1 and 2, get requirements and produce the algorithm. There 4 of them:

  • frequent pattern mining (find common patterns)
  • pattern matching (find data instances for patterns)
  • clustering (grouping data into clusters)
  • classification (predict label based on already labeled data).

There should be data mining algorithms written especially for software engineering data mining. But people writing algorithms and people knowing the exact requirements/needs rarely work together.

The difficulties are:

  • SE Data and patterns are complex
  • Data would be relevant for large-scale repositories and searches
  • Just-In-Time data are important as well; e.g. at the IDE stage for on-the-fly analyis and feedback.

In modern integrated SE environments, software engineers shall be able to collect and mine SE data on the fly to provide rapid JIT feedback.

Mining sequences

Common types of mining algorithms are:

  • frequent pattern matching
  • sequence matching, clustering and classification

The different algorithms do not provide the same information. New algorithms should include mining:

Iterative pattern matching

Capture repetitive occurrences of the patterns within each trace and across multiple traces[1]. See this example: 1. ABCDEAXBC 2. AGXBC 3. AXBC

Mine the set of frequent closed iterative subsequence patterns, at the minimum support threshold of 4 -> {<ABC>}.

Mined iterative patterns could be used as high level features to classify program behaviours.

Temporal Rule Mining

If a series of events occurs, eventually another series of events will occur. In our example <A> -> <BC>.

Sequence diagrams and FSM mining

Patterns and rules may not catch every information available, e.g.

  • UML sequence diagrams has caller and callee
  • Patterns and rules do not express disjunctions, e.g. loop and branches.

Patterns and rules capture strongly observed properties whereas FSM capture the overall view of events, ignoring support and confidence.

Sequence Association rules

Sequence Association rules[2] are defined as (FC^1_c .. FC^n_c) /\ FC_a => (FC^1_e .. FC^m_e)

Used e.g. for database statement execution. Example: "execute statement, if exception then rollback" is true excepted for read-only statements -> we have to get the exact sequence to get the rule.

Authors have created algorithms that mine frequent sequences and post-process them to get rules. They applied them to 5 real-world applications and got 294 exception-handling rules detecting 160 exception-handling defects in code.

Mining Graphs

Graphs can be extracted from static analysis or dynamic call graphs. Graph mining algorithms can be applied to them: frequent subgraph mining (shows the programming rules), graph matching (shows potential bugs), graph classification and graph clustering.

Discriminative graph mining

top-k most discriminative graphs can serve as signatures where bugs occur => helps software engineers to locate and fix them.

Subgraphs can be designed at (a) method level or (b) basic blocks level. Results are: for (a) 74% precision, 91% recall and (b) 59% precision, 73% recall.

Graph classification

Dynamic call graph between erroneous and correct executions may differ. Divergence time and location gives hints on bug location.

Given a big enough set of correct and bad execution traces, a classifier algorithm can find the location of divergence.

Mining text

Examples of text are: bug reports, e-mails, code comments, documentation.

Common algorithms are:

  • Text clustering: helps for bug reports duplicates findings, assign bug reports automatically.
  • Text classification: helps for assign bugs to developers based on past assignement.
  • text matching: helps detecting dups in bug reports.

If duplicates are detected, assign them to the same developer (authors describe an algorithm that detects duplicates).

Algorithms need Natural Language Processing (NLP) pre-processing before data-mining mechanics: thesaurus, synonyms, dictionaries.


  • remove stop words (on, or, of), NLP processing,
  • transform docs into vectors,
  • calculate similarities between vectors.

In some cases, vector similarities are not enough: in those cases, analysing traces and finding similarity on traces can help (e.g. with sequence matching). Use some heuristics to weight both approaches. Benchmarks on firefox bug repositories can find 67% to 93% of duplicate reports.

Future research

Authors draw the following directions for future research:

  • In the pattern-mining domain, increased scalability of mining algorithms and expressibility of mined results.
  • Classification algorithms need more accuracy.
  • For the pattern-matching domain, more efficient approximate matching algorithms.


  1. D. Lo, S-C Khoo, and C. Liu, "Efficient Mining of Iterative Patterns for Software Specification Discovery", Proc. 13th ACM SIGKDD Int'l Conf. Knowledge Discovery and Data Mining (KDD'07), ACM Press, 2007, pp. 460-469.
  2. S. Thummalapenta and T. Xie, "Mining Exception-Handling Rules as Sequence Association Rules", Proc. 31st Int'l Conf. Software Eng. (ICSE 09), ACM Press, 2009, pp. 496-506.
Personal tools