Improving Software Quality via Code Searching and Mining

From Maisqual Wiki

Jump to: navigation, search

Improving Software Quality via Code Searching and Mining

Madhuri R. Marri, Department of Computer Science, North Carolina State University

Suresh Thummalapenta, Department of Computer Science, North Carolina State University

Tao Xie, Department of Computer Science, North Carolina State University



  1. Abstact
  2. Introduction
  3. Life-Cycle Model
    1. Searching
    2. Mining
  4. Application of the Life-Cycle Model
    1. Tasks Assisted by Life-Cycle Model
    2. Preliminary Evaluation
  5. Conclusion


 author = {Marri, Madhuri R. and Thummalapenta, Suresh and Xie, Tao},
 title = {Improving software quality via code searching and mining},
 booktitle = {Proceedings of the 2009 ICSE Workshop on Search-Driven Development-Users, Infrastructure, Tools and Evaluation},
 series = {SUITE '09},
 year = {2009},
 isbn = {978-1-4244-3740-5},
 pages = {33--36},
 numpages = {4},
 url = {},
 doi = {},
 acmid = {1556992},
 publisher = {IEEE Computer Society},
 address = {Washington, DC, USA},


The file can be downloaded from there: File:Improving sw quality via code searching and mining.pdf.


This paper presents three development tasks that can be improved by exploiting Code Searching Engines (CSEs), and how these tasks can help improving software quality.

Examples of CSEs are: Google Code Search (GCS), Krugle, Koders, Sourcerers, Codase.


CSEs are often limited to simple tasks, e.g. search relevant code examples.

The three development tasks showns in this paper are:

  • Learn about an API usage by discovering its programming rules.
  • Use mined patterns to detect defects in a program
  • Infer a fix for a detected bug.

One common point for the existing approaches for the first point is the small number of data to mine from. Thus the mined patterns show only the most common programming examples, and a relatively small number of real programming rules are inferred. By exploiting the enormous amount of open-source codebase, could be not improve these results?

The problem is the algorithms used are often not scalable. Thus authors propose to use CSEs for a specific pattern and mine only the few results returned.

The lifecycle proposed is self-fed: mined patterns help write new code, which is in turn used as input for mining.

This study shows that:

  • This model addresses issues that cannot be addressed by any CSEs or mining technique individually.
  • Code examples gathered from CSEs require post-processing before applying mining algorithms (e.g. common usage patterns).

Life-Cycle Model

The LC model has 2 steps: Searching and Mining.


Searching has 2 phases: query construction and duplicates elimination.

Example of query: "lang:java org.apache.regexp.RE"

Experience shows that relevance depends on the format of the query. Some CSEs (e.g. GCS) accept regular expressions, and with a good query 1. number of code examples lessens and 2. results are more accurate. Example of precise query: "lang:c file:.c$ [\ s\ *]fopen [\ s]??\(". Relevance of code examples greatly matters for afterwards data mining.

Duplicates have to be identified. They can be meaningful or not: they can represent the same file and thus introduce bias, but it can also mean that the code is widely used and thus even more trustable.


Mining has 3 post-processing phases:

Type resolution
Infer fully qualified name. These can be retrieved from imports, but if using import* it becomes difficult.
Candidates extraction
Get rules from code (e.g. should be preceded by iterator.hasNext()).
Pattern inference
Apply mining techniques such as frequent subsequence for common patterns for API usage.

Application of the Lifecycle Model

Tasks assisted by the lifecycle model
Assist developpers with common usage scenarios (e.g. PARSEWeb)
If mined patterns show that code examples check for a non-null return value, check if the program under analysis have it also.
If a defect is detected in the PUA, propose a fix based on mined patterns, e.g. add a hasNext() check before the next() call.

Preliminary Evaluation

Authors applied the lifecycle model on an problem example: how to get the IViewPart object from the IWorkbenchWindow class?

4 CSEs were used, 2 of them got a method-invokation sequence. PARSEWeb then searched for sequence condidates and proposed a working sequence.

Personal tools