9512.net
甜梦文库
当前位置:首页 >> >>

The Confounding Effect of Class Size on The Validity of Object-Oriented Metrics



National Research Council Canada Institute for Information Technology

Conseil national de recherches Canada Institut de Technologie de l’information

ERB-1062

The Confounding Effect of Class Size on the Validity of Object-oriented Metrics
Khaled El Emam, Saida Benlarbi, and Nishith Goel September 1999

Canada

NRC 43606

National Research Council Canada Institute for Information Technology

Conseil national de recherches Canada Institut de Technologie de l’information

The Confounding Effect of Class Size on the Validity of Object-oriented Metrics

Khaled El Emam, Saida Benlarbi, and Nishith Goel September 1999

Copyright 1999 by National Research Council of Canada Permission is granted to quote short excerpts and to reproduce figures and tables from this report, provided that the source of such material is fully acknowledged.

The Confounding Effect of Class Size on The Validity of Object-Oriented Metrics
Khaled El Emam National Research Council, Canada Institute for Information Technology Building M-50, Montreal Road Ottawa, Ontario Canada K1A OR6 khaled.el-emam@iit.nrc.ca Saida Benlarbi Nishith Goel Cistel Technology 210 Colonnade Road Suite 204 Nepean, Ontario Canada K2E 7L5 {benlarbi, ngoel}@cistel.com

Abstract
Much effort has been devoted to the development and empirical validation of object-oriented metrics. The empirical validations performed thus far would suggest that a core set of validated metrics is close to being identified. However, none of these studies control for the potentially confounding effect of class size. In this paper we demonstrate a strong size confounding effect, and question the results of previous object-oriented metrics validation studies. We first investigated whether there is a confounding effect of class size in validation studies of object-oriented metrics and show that based on previous work there is reason to believe that such an effect exists. We then describe a detailed empirical methodology for identifying those effects. Finally, we perform a study on a large C++ telecommunications framework to examine if size is really a confounder. This study considered the Chidamber and Kemerer metrics, and a subset of the Lorenz and Kidd metrics. The dependent variable was the incidence of a fault attributable to a field failure (fault-proneness of a class). Our findings indicate that before controlling for size, the results are very similar to previous studies: the metrics that are expected to be validated are indeed associated with fault-proneness. After controlling for size none of the metrics we studied were associated with fault-proneness anymore. This demonstrates a strong size confounding effect, and casts doubt on the results of previous object-oriented metrics validation studies. It is recommended that previous validation studies be re-examined to determine whether their conclusions would still hold after controlling for size, and that future validation studies should always control for size.

1 Introduction
The validation of software product metrics has received much research attention by the software engineering community. There are two types of validation that are recognized [48]: internal and external. Internal validation is a theoretical exercise that ensures that the metric is a proper numerical characterization of the property it claims to measure. External validation involves empirically demonstrating that the product metric is associated with some important external metric (such as measures of maintainability or reliability). These are also commonly referred to as theoretical and empirical validation respectively [73], and procedures for achieving both are described in [15]. Our focus 2 in this paper is empirical validation. Product metrics are of little value by themselves unless there is empirical evidence that they are associated with important external attributes [65]. The demonstration of such a relationship can serve two important purposes: early prediction/identification of high risk software components, and the construction of preventative design and programming guidelines.
1

1 Some authors distinguish between the terms ‘metric’ and ‘measure’ [2]. We use the term “metric” here to be consistent with prevailing international standards. Specifically, ISO/IEC 9126:1991 [64] defines a “software quality metric” as a “quantitative scale and method which can be used to determine the value a feature takes for a specific software product”. 2

Theoretical validations of many of the metrics that we consider in this paper can be found in [20][21][30].

V18-10/04/00

1

Early prediction is commonly cast as a binary classification problem. This is achieved through a quality model that classifies components into either a high or low risk category. The definition of a high risk component varies depending on the context of the study. For example, a high risk component is one that contains any faults found during testing [14][75], one that contains any faults found during operation [72], or one that is costly to correct after an error has been found [3][13][1]. The identification of high risk components allows an organization to take mitigating actions, such as focus defect detection activities on high risk components, for example optimally allocating testing resources [56], or redesign components that are likely to cause field failures or be costly to maintain. This is motivated by evidence showing that most faults are found in only a few of a system’s components [86][51][67][91]. A number of organizations have integrated quality models and modeling techniques into their overall quality decision making process. For example, Lyu et al. [81] report on a prototype system to support developers with software quality models, and the EMERALD system is reportedly routinely used for risk assessment at Nortel [62][63]. Ebert and Liedtke describe the application of quality models to control the quality of switching software at Alcatel [46]. The construction of design and programming guidelines can proceed by first showing that there is a relationship between say a coupling metric and maintenance cost. Then proscriptions on the maximum allowable value on that coupling metric are defined in order to avoid costly rework and maintenance in the 4 future. Examples of cases where guidelines were empirically constructed are [1][3]. Guidelines based on anecdotal experience have also been defined [80], and experience-based guidelines are used directly in the context of software product acquisition by Bell Canada [34]. Concordant with the popularity of the object-oriented paradigm, there has been a concerted research effort to develop object oriented product metrics [8][17][30][80][78][27][24][60][106], and to validate them [4][27][17][19][22][78][32][57][89][106][8][25][10]. For example, in [8] the relationship between a set of new polymorphism metrics and fault-proneness is investigated. A study of the relationship between various design and source code measures using a data set from student systems was reported in [4][17][22][18], and a validation study of a large set of object-oriented metrics on an industrial system was described in [19]. Another industrial study is described in [27] where the authors investigate the relationship between object-oriented design metrics and two dependent variables: the number of defects and size in LOC. Li and Henry [78] report an analysis where they related object-oriented design and code metrics to the extent of code change, which they use as a surrogate for maintenance effort. Chidamber et al. [32] describe an exploratory analysis where they investigate the relationship between objectoriented metrics and productivity, rework effort and design effort on three different financial systems respectively. Tang et al. [106] investigate the relationship between a set of object-oriented metrics and faults found in three systems. Nesi and Querci [89] construct regression models to predict class development effort using a set of new metrics. Finally, Harrison et al. [57] propose a new object-oriented coupling metric, and compare its performance with a more established coupling metric. Despite minor inconsistencies in some of the results, a reading of the object-oriented metrics validation literature would sugge