9512.net

甜梦文库

甜梦文库

当前位置：首页 >> >> # The Confounding Effect of Class Size on The Validity of Object-Oriented Metrics

National Research Council Canada Institute for Information Technology

Conseil national de recherches Canada Institut de Technologie de l’information

ERB-1062

The Confounding Effect of Class Size on the Validity of Object-oriented Metrics

Khaled El Emam, Saida Benlarbi, and Nishith Goel September 1999

Canada

NRC 43606

National Research Council Canada Institute for Information Technology

Conseil national de recherches Canada Institut de Technologie de l’information

The Confounding Effect of Class Size on the Validity of Object-oriented Metrics

Khaled El Emam, Saida Benlarbi, and Nishith Goel September 1999

Copyright 1999 by National Research Council of Canada Permission is granted to quote short excerpts and to reproduce figures and tables from this report, provided that the source of such material is fully acknowledged.

The Confounding Effect of Class Size on The Validity of Object-Oriented Metrics

Khaled El Emam National Research Council, Canada Institute for Information Technology Building M-50, Montreal Road Ottawa, Ontario Canada K1A OR6 khaled.el-emam@iit.nrc.ca Saida Benlarbi Nishith Goel Cistel Technology 210 Colonnade Road Suite 204 Nepean, Ontario Canada K2E 7L5 {benlarbi, ngoel}@cistel.com

Abstract

Much effort has been devoted to the development and empirical validation of object-oriented metrics. The empirical validations performed thus far would suggest that a core set of validated metrics is close to being identified. However, none of these studies control for the potentially confounding effect of class size. In this paper we demonstrate a strong size confounding effect, and question the results of previous object-oriented metrics validation studies. We first investigated whether there is a confounding effect of class size in validation studies of object-oriented metrics and show that based on previous work there is reason to believe that such an effect exists. We then describe a detailed empirical methodology for identifying those effects. Finally, we perform a study on a large C++ telecommunications framework to examine if size is really a confounder. This study considered the Chidamber and Kemerer metrics, and a subset of the Lorenz and Kidd metrics. The dependent variable was the incidence of a fault attributable to a field failure (fault-proneness of a class). Our findings indicate that before controlling for size, the results are very similar to previous studies: the metrics that are expected to be validated are indeed associated with fault-proneness. After controlling for size none of the metrics we studied were associated with fault-proneness anymore. This demonstrates a strong size confounding effect, and casts doubt on the results of previous object-oriented metrics validation studies. It is recommended that previous validation studies be re-examined to determine whether their conclusions would still hold after controlling for size, and that future validation studies should always control for size.

1 Introduction

The validation of software product metrics has received much research attention by the software engineering community. There are two types of validation that are recognized [48]: internal and external. Internal validation is a theoretical exercise that ensures that the metric is a proper numerical characterization of the property it claims to measure. External validation involves empirically demonstrating that the product metric is associated with some important external metric (such as measures of maintainability or reliability). These are also commonly referred to as theoretical and empirical validation respectively [73], and procedures for achieving both are described in [15]. Our focus 2 in this paper is empirical validation. Product metrics are of little value by themselves unless there is empirical evidence that they are associated with important external attributes [65]. The demonstration of such a relationship can serve two important purposes: early prediction/identification of high risk software components, and the construction of preventative design and programming guidelines.

1

1 Some authors distinguish between the terms ‘metric’ and ‘measure’ [2]. We use the term “metric” here to be consistent with prevailing international standards. Specifically, ISO/IEC 9126:1991 [64] defines a “software quality metric” as a “quantitative scale and method which can be used to determine the value a feature takes for a specific software product”. 2

Theoretical validations of many of the metrics that we consider in this paper can be found in [20][21][30].

V18-10/04/00

1

Early prediction is commonly cast as a binary classification problem. This is achieved through a quality model that classifies components into either a high or low risk category. The definition of a high risk component varies depending on the context of the study. For example, a high risk component is one that contains any faults found during testing [14][75], one that contains any faults found during operation [72], or one that is costly to correct after an error has been found [3][13][1]. The identification of high risk components allows an organization to take mitigating actions, such as focus defect detection activities on high risk components, for example optimally allocating testing resources [56], or redesign components that are likely to cause field failures or be costly to maintain. This is motivated by evidence showing that most faults are found in only a few of a system’s components [86][51][67][91]. A number of organizations have integrated quality models and modeling techniques into their overall quality decision making process. For example, Lyu et al. [81] report on a prototype system to support developers with software quality models, and the EMERALD system is reportedly routinely used for risk assessment at Nortel [62][63]. Ebert and Liedtke describe the application of quality models to control the quality of switching software at Alcatel [46]. The construction of design and programming guidelines can proceed by first showing that there is a relationship between say a coupling metric and maintenance cost. Then proscriptions on the maximum allowable value on that coupling metric are defined in order to avoid costly rework and maintenance in the 4 future. Examples of cases where guidelines were empirically constructed are [1][3]. Guidelines based on anecdotal experience have also been defined [80], and experience-based guidelines are used directly in the context of software product acquisition by Bell Canada [34]. Concordant with the popularity of the object-oriented paradigm, there has been a concerted research effort to develop object oriented product metrics [8][17][30][80][78][27][24][60][106], and to validate them [4][27][17][19][22][78][32][57][89][106][8][25][10]. For example, in [8] the relationship between a set of new polymorphism metrics and fault-proneness is investigated. A study of the relationship between various design and source code measures using a data set from student systems was reported in [4][17][22][18], and a validation study of a large set of object-oriented metrics on an industrial system was described in [19]. Another industrial study is described in [27] where the authors investigate the relationship between object-oriented design metrics and two dependent variables: the number of defects and size in LOC. Li and Henry [78] report an analysis where they related object-oriented design and code metrics to the extent of code change, which they use as a surrogate for maintenance effort. Chidamber et al. [32] describe an exploratory analysis where they investigate the relationship between objectoriented metrics and productivity, rework effort and design effort on three different financial systems respectively. Tang et al. [106] investigate the relationship between a set of object-oriented metrics and faults found in three systems. Nesi and Querci [89] construct regression models to predict class development effort using a set of new metrics. Finally, Harrison et al. [57] propose a new object-oriented coupling metric, and compare its performance with a more established coupling metric. Despite minor inconsistencies in some of the results, a reading of the object-oriented metrics validation literature would suggest that a number of metrics are indeed ‘validated’ in that they are strongly associated with outcomes of interest (e.g., fault-proneness) and that they can serve as good predictors of high-risk classes. The former is of course a precursor for the latter. For example, it has been stated that some metrics (namely the Chidamber and Kemerer – henceforth CK – metrics of [30]) “have been proven empirically to be useful for the prediction of fault-prone modules” [106]. A recent review of the literature stated that “Existing data suggests that there are important relationships between structural attributes and external quality indicators” [23]. However, almost all of the validation studies that have been performed thus far completely ignore the potential confounding impact of class size. This is the case because the analyses employed are univariate: they only model the relationship between the product metric and the dependent variable of interest. For example, recent studies used the bivariate correlation between object-oriented metrics and

3 It is not, however, always the case that binary classifiers are used. For example, there have been studies that predict the number of faults in individual components (e.g., [69]), and that produce point estimates of maintenance effort (e.g., [78][66]). 4 It should be noted that the construction of guidelines requires the demonstration of a causal relationship rather than a mere association.

3

V18-10/04/00

2

the number of faults to investigate the validity of the metrics [57][10]. Also, univariate logistic regression models are used as the basis for demonstrating the relationship between object-oriented product metrics and fault-proneness in [22][19][106]. The importance of controlling for potential confounders in empirical studies of object-oriented products has been emphasized [23]. However, size, the most obvious potential confounder, has not been controlled in previous validation studies. The objective of this paper is to investigate the confounding effect of class size on the validation of objectoriented product metrics. We first demonstrate based on previous work that there is potentially a size confounding effect in object-oriented metrics validation studies, and present a methodology for empirically testing this. We then perform an empirical study on an object-oriented telecommunications framework 5 written in C++ [102]. The metrics we investigate consist of the CK metrics suite [30], and some of the metrics defined by Lorenz and Kidd [80]. The external metric that we validate against is the occurrence of a fault, which we term the fault-proneness of the class. In our study a fault is detected due to a field failure. Briefly, our results indicate that by using the commonly employed univariate analyses our results are consistent with previous studies. After controlling for the confounding effect of class size, none of the metrics is associated with fault-proneness. This indicates a strong confounding effect of class size on some common object-oriented metrics. The results cast serious doubt that many previous validation studies demonstrate more than that size is associated with fault-proneness. Perhaps the most important practical implication of these results is that design and programming guidelines based on previous validation studies are questioned. Efforts to control cost and quality using object-oriented metrics as early indicators of problems may be achieved just as well using early indicators of size. The implications for research are that data from previous validation studies should be reexamined to gauge the impact of the size confounding effect, and future validation studies should control for size. In Section 2 we provide the rationale behind the confounding effect of class size and present a framework for its empirical investigation. Section 3 presents our research method, and Section 4 includes the results of the study. We conclude the paper in Section 5 with a summary and directions for future work.

2 Background

This section is divided in two parts. First, we present the theoretical and empirical basis of the objectoriented metrics that we attempt to validate. Second, we demonstrate that there is a potentially strong size confounding effect in object-oriented metrics validation studies.

2.1 Theoretical and Empirical Basis of Object-Oriented Metrics

2.1.1 Theoretical Basis and Its Empirical Support The primary reason why there is an interest in the development of product metrics in general is exemplified by the following justification for a product metric validity study “There is a clear intuitive basis for believing that complex programs have more faults in them than simple programs” [87]. However, an intuitive belief does not make a theory. In fact, the lack of a strong theoretical basis driving the development of traditional software product metrics has been criticized in the past [68]. Specifically, Kearney et al. [68] state that “One of the reasons that the development of software complexity measures is so difficult is that programming behaviors are poorly understood. A behavior must be understood before what makes it difficult can be determined. To clearly state what is to be measured, we need a theory of programming that includes models of the program, the programmer, the programming environment, and the programming task.”

5 It has been stated that for historical reasons the CK metrics are the most referenced [23]. Most commercial metrics collection tools available at the time of writing also collect these metrics.

V18-10/04/00

3

Structural Class Properties (e.g., coupling)

affect affect

Cognitive Complexity

indicate

External Attributes (e.g., fault-proneness, maintainability)

Figure 1: Theoretical basis for the development of object-oriented product metrics. In the arena of object-oriented metrics, a slightly more detailed articulation of a theoretical basis for developing quantitative models relating product metrics and external quality metrics has been provided in [19], and is summarized in Figure 1. There, it is hypothesized that the structural properties of a software component (such as its coupling) have an impact on its cognitive complexity. Cognitive complexity is defined as the mental burden of the individuals who have to deal with the component, for example, the developers, testers, inspectors, and maintainers. High cognitive complexity leads to a component 6 exhibiting undesirable external qualities, such as increased fault-proneness and reduced maintainability. Certain structural features of the object-oriented paradigm have been implicated in reducing the understandability of object-oriented programs, hence raising cognitive complexity. We describe these below. 2.1.1.1 Distribution of Functionality

In traditional applications developed using functional decomposition, functionality is localized in specific procedures, the contents of data structures are accessed directly, and data central to an application is often globally accessible [110]. Functional decomposition makes procedural programs easier to understand because it is based on a hierarchy in which a top-level function calls lower level functions to carry out smaller chunks of the overall task [109]. Hence tracing through a program to understand its global functionality is facilitated. In one experimental study with students and professional programmers [11], the authors compared maintenance time for three equivalent programs (implementing three different applications, therefore we have nine programs): one consisted of a straight serial structure (i.e., one main function), a program developed following the principles of functional decomposition, and an object-oriented program (without inheritance). In general, it took the students more time to change the object-oriented programs, and the professionals exhibited the same effect, although not as strongly. Furthermore, both the students and professionals noted that they found that it was most difficult to recognize program units in the objectoriented programs, and the students felt that it was also most difficult to find information in the objectoriented programs. Widenbeck et al. [109] make a distinction between program functionality at the local level and at the global (application) level. At the local level they argue that the object-oriented paradigm’s concept of encapsulation ensures that methods are bundled together with the data that they operate on, making it easier to construct appropriate mental models and specifically to understand a class’ individual functionality. At the global level, functionality is dispersed amongst many interacting classes, making it harder to understand what the program is doing. They support this in an experiment with equivalent small C++ (with no inheritance) and Pascal programs whereby the subjects were better able to answer questions about the functionality of the C++ program. They also performed an experiment with larger programs. Here the subjects with the C++ program (with inheritance) were unable to answer questions about its functionality much better than guessing. While this study was done with novices, it supports the general notions that high cohesion makes object-oriented programs easier to understand, and high coupling makes them more difficult to understand. Wilde et al.’s [110] conclusions based on an interviewbased study of two object-oriented systems at Bellcore implemented in C++ and an investigation of a PC Smalltalk environment, all in different application domains, are concordant with this finding, in that programmers have to understand a method’s context of use by tracing back through the chain of calls that reach it, and tracing the chain of methods it uses. When there are many interactions, this

6

To reflect the likelihood that not only structural properties affect a component’s external qualities, some authors have included additional metrics as predictor variables in their quantitative models, such as reuse [69], the history of corrected faults [70], and the experience of developers [72][71]. However, this does not detract from the importance of the primary relationship between product metrics and a component’s external qualities.

V18-10/04/00

4

exacerbates the understandability problem. An investigation of a C and a C++ system, both developed by the same staff in the same organization, concluded that “The developers found it much harder to trace faults in the OO C++ design than in the conventional C design. Although this may simply be a feature of C++, it appears to be more generally observed in the testing of OO systems, largely due to the distorted and frequently nonlocal relationships between cause and effect: the manifestation of a failure may be a ‘long way away’ from the fault that led to it. […] Overall, each C++ correction took more than twice as long to fix as each C correction.” [59]. 2.1.1.2 Inheritance Complications

As noted in [43], there has been a preoccupation within the community with inheritance, and therefore more studies have investigated that particular feature of the object-oriented paradigm. Inheritance introduces a new level of delocalization, making the understandability even more difficult. It has been noted that “Inheritance gives rise to distributed class descriptions. That is, the complete description for a class C can only be assembled by examining C as well as each of C’s superclasses. Because different classes are described at different places in the source code of a program (often spread across several different files), there is no single place a programmer can turn to get a complete description of a class” [77]. While this argument is stated in terms of source code, it is not difficult to generalize it to design documents. Wilde et al.’s study [110] indicated that to understand the behavior of a method one has to trace inheritance dependencies, which is considerably complicated due to dynamic binding. A similar point was made in [77] about the understandability of programs in languages that support dynamic binding, such as C++. In a set of interviews with 13 experienced users of object-oriented programming, Daly et al. [40] noted that if the inheritance hierarchy is designed properly then the effect of distributing functionality over the inheritance hierarchy would not be detrimental to understanding. However, it has been argued that there exists increasing conceptual inconsistency as one travels down an inheritance hierarchy (i.e., deeper levels in the hierarchy are characterized by inconsistent extensions and/or specializations of superclasses) [45], therefore inheritance hierarchies may not be designed properly in practice. In one study Dvorak [45] found that subjects were more inconsistent in placing classes deeper in the inheritance hierarchy when compared to at higher levels in the hierarchy. An experimental investigation found that making changes to a C++ program with inheritance consumed more effort than a program without inheritance, and the author attributed this to the subjects finding the inheritance program more difficult to understand based on responses to a questionnaire [26]. A contradictory result was found in [41], where the authors conducted a series of classroom experiments comparing the time to perform maintenance tasks on a ‘flat’ C++ program and a program with three levels of inheritance. This was premised on a survey of object-oriented practitioners showing 55% of respondents agreeing that inheritance depth is a factor when attempting to understand object-oriented software [39]. The result was a significant reduction in maintenance effort for the inheritance program. An internal replication by the same authors found the results to be in the same direction, albeit the pvalue was larger. The second experiment in [41] found that C++ programs with 5 levels of inheritance took more time to maintain than those with no inheritance, although the effect was not statistically significant. The authors explain this by observing that searching/tracing through the bigger inheritance hierarchy takes longer. Two experiments that were partial replications of the Daly et al. experiments produced different conclusions [107]. In both experiments the subjects were given three equivalent Java programs to make changes to, and the maintenance time was measured. One of the Java programs was ‘flat’, one had an inheritance depth of 3, and one had an inheritance depth of 5. The results for the first experiment indicate that the programs with inheritance depth of 3 took longer to maintain than the ‘flat’ program, but the program with inheritance depth of 5 took as much time as the ‘flat’ program. The authors attribute this to the fact that the amount of changes required to complete the maintenance task for the deepest inheritance program was smaller. The results for a second task in the first experiment and the results of the second experiment indicate that it took longer to maintain the programs with inheritance. To explain this finding and its difference from the Daly et al. results, the authors showed that the “number of methods relevant for understanding” (which is the number of methods that have to be traced in order to perform the maintenance task) was strongly correlated with the maintenance time, and this value was much larger in their study compared with the Daly et al. programs. The authors conclude that inheritance

V18-10/04/00

5

depth per se is not the factor that affects understandability, but the number of methods that have to be traced. 2.1.1.3 Summary

The current theoretical framework for explaining the effect of the structural properties of object-oriented programs on external program attributes can be justified empirically. To be specific, studies that have been performed indicate that the distribution of functionality across classes in object-oriented systems, and the exacerbation of this through inheritance, potentially makes programs more difficult to understand. This suggests that highly cohesive, sparsely coupled, and low inheritance programs are less likely to contain a fault. Therefore, metrics that measure these three dimensions of an object-oriented program would be expected to be good predictors of fault-proneness or the number of faults. The empirical question is then whether contemporary object-oriented metrics measure the relevant structural properties well enough to substantiate the above theory. Below we review the evidence on this. 2.1.2 Empirical Validation of Object-Oriented Metrics

In this section we review the empirical studies that investigate the relationship between the ten objectoriented metrics that we study and fault-proneness (or number of faults). The product metrics cover the following dimensions: coupling, cohesion, inheritance, and complexity. These dimensions are based on the definition of the metrics, and may not reflect their actual behavior. Coupling metrics characterize the static usage dependencies amongst the classes in an object-oriented system [21]. Cohesion metrics characterize the extent to which the methods and attributes of a class belong together [16]. Inheritance metrics characterize the structure of the inheritance hierarchy. Complexity metrics, as used here, are adaptations of traditional procedural paradigm complexity metrics to the object-oriented paradigm. Current methodological approaches for the validation of object-oriented product metrics are best exemplified by two articles by Briand et al. [19][22]. These are validation studies for an industrial communications system and a set of student systems respectively, where a considerable number of contemporary object-oriented product metrics were studied. We single out these studies because their methodological reporting is detailed and because they reflect what can be considered best methodological practice to date. The basic approach starts with a data set of product metrics and binary fault data for a complete system or multiple systems. The important element of the Briand et al. methodology that is of interest to us here is the univariate analysis that they stipulate should be performed. In fact, the main association between the product metrics and fault-proneness is established on the basis of the univariate analysis. If the relationship is statistically significant (and in the expected direction) than a metric is considered 7 validated. For instance, in [22] the authors state a series of hypotheses relating each metric with faultproneness. They then explain “Univariate logistic regression is performed, for each individual measure (independent variable), against the dependent variable to determine if the measure is statistically related, in the expected direction, to fault-proneness. This analysis is conducted to test the hypotheses..” Subsequently, the results of the univariate analysis are used to evaluate the extent of evidence supporting each of the hypotheses. Reliance on univariate results as the basis for drawing validity conclusions is common practice (e.g., see [4][10][17][18][57][106]). In this review we first present the definition of the metrics as we have operationalized them. The operationalization of some of the metrics is programming language dependent. We then present the magnitude of the coefficients and p values computed in the various studies. Validation coefficients were either the change in odds ratio as a measure of the magnitude of the metric to fault-proneness association from a logistic regression (see the appendix, Section 7) or the Spearman correlation coefficient. Finally, this review focuses only on the fault-proneness or number of faults dependent variable. Other studies that investigated effort, such as [32][89][78], are not covered as effort is not the topic of the current paper.

7

Briand et al. use logistic regression, and consider the statistical significance of the regression parameters.

V18-10/04/00

6

2.1.2.1

WMC

This is the Weighted Methods per Class metric [30], and can be classified as a traditional complexity metric. It is a count of the methods in a class. The developers of this metric leave the weighting scheme as an implementation decision [30]. We weight it using cyclomatic complexity as did [78]. However, other authors did not adopt a weighting scheme [4][106]. Methods from ancestor classes are not counted and neither are “friends” in C++. This is similar to the approach taken in, for example, [4][31]. To be precise, 8 WMC was counted after preprocessing to avoid undercounts due to macros [33]. One study found WMC to be associated with fault-proneness on three different sub-systems written in 9 C++ with p-values 0.054, 0.0219 and 0.0602, and change in odds ratio 1.26, 1.45, and 1.26 [106]. A study that evaluated WMC on a C++ application and a Java application found WMC to have a Spearman correlation of 0.414 and 0.456 with the number of faults due to field failures respectively, and highly significant p-values (<0.0001 and <0.0056) [10]. Another study using student systems found WMC to be 10 associated with fault-proneness with a p-value for the logistic regression coefficient of 0.0607 [4]. 2.1.2.2 DIT

The Depth of Inheritance Tree [30] metric is defined as the length of the longest path from the class to the root in the inheritance hierarchy. It is stated that as one goes further down the class hierarchy the more complex a class becomes, and hence more fault-prone. The DIT metric was empirically evaluated in [19][22]. In [19] the authors found that this metric was related to fault-proneness (p=0.0074) with a change in odds ratio equal to 0.572 when measured on non-library classes. The second study [22] also found it to be associated with fault-proneness (p=0.0001) with a change in odds ratio of 2.311. Another study using student systems found DIT to be associated with faultproneness with a p-value for the logistic regression coefficient <0.0001 [4]. It will be noted that in the first study a negative association was found between DIT and fault-proneness. The authors explain this by stating that in the system studied classes located deeper in the inheritance hierarchy provide only implementations for a few specialized methods, and are therefore less likely to contain faults than classes closer to the root [19]. This was a deliberate strategy to place as much functionality as close as possible to the root of the inheritance tree. Note that for the latter two investigations, the same data set was used, and therefore the slightly different coefficients may have been due to removal of outliers. One study using data from an industrial system found that classes involved in an inheritance structure were more likely to have defects (found during integration testing and within 12 months post-delivery) [27]. Another study did not find DIT to be associated with fault-proneness on three different sub-systems written in C++, where faults were based on three years’ worth of trouble reports [106]. One study that evaluated DIT on a Java application found that it had a Spearman correlation of 0.523 (p<0.0015) with the number of faults due to field failures [10]. 2.1.2.3 NOC

This is the Number of Children inheritance metric [30]. This metric counts the number of classes which inherit from a particular class (i.e., the number of classes in the inheritance tree down from a class). The NOC metric was empirically evaluated in [19][22]. In [19] the authors found that this metric was not related to fault-proneness. Conversely, the second study [22] found it to be associated with faultproneness (p=0.0276) with a change in odds ratio of 0.322. Another study using student systems found

8

Note that macros embodied in #ifdef’s are used to customize the implementation to a particular platform. Therefore, the method is defined at design time but its implementation is conditional on environment variables. Not counting it, as suggested in [31], would undercount methods known at design time.

9 In this study faults were classified as either object-oriented type faults or traditional faults. The values presented here are for all of the faults, although the same metrics were found to be significanct for both all faults and the object-oriented only faults. Furthermore, the change in odds ratio reported is based on a change of one unit of the metric rather than a change in the standard deviation. 10

This study used the same data set as in [22], except that the data was divided into subsets using different criteria. The results presented here are for all of the classes.

V18-10/04/00

7

NOC to be associated with fault-proneness with a p-value for the regression coefficient <0.0001 [4]. Note that for the latter two investigations, the same data set was used, and therefore the slightly different coefficients may have been due to removal of outliers. In both studies NOC had a negative association with fault-proneness and this was interpreted as indicating that greater attention was given to these classes (e.g., through inspections) given that many classes were dependent on them. Another study did not find NOC to be associated with fault-proneness on three different sub-systems written in C++, where faults were based on three years’ worth of trouble reports [106]. NOC was not associated with the number of faults due to field failures in a study of two systems, one implemented in C++ and the other in Java [10]. 2.1.2.4 CBO

This is the Coupling Between Object Classes coupling metric [30]. A class is coupled with another if methods of one class uses methods or attributes of the other, or vice versa. In this definition, uses can mean as a member type, parameter type, method local variable type or cast. CBO is the number of other classes to which a class is coupled. It includes inheritance-based coupling (i.e., coupling between classes related via inheritance). The CBO metric was empirically evaluated in [19][22]. In [19] the authors found that this metric was related to fault-proneness (p<0.0001) with a change in odds ratio equal to 5.493 when measured on nonlibrary classes. The second study [22] also found it to be associated with fault-proneness (p<0.0001) with a change in odds ratio of 2.012 when measured on non-library classes. Another study did not find CBO to be associated with fault-proneness on three different sub-systems written in C++, where faults were based on three years’ worth of trouble reports [106]. This was also the case in a recent empirical analysis on two traffic simulation systems, where no relationship between CBO and the number of known faults was found [57], and a study of a Java application where CBO was not found to be associated with faults due to field failures [10]. Finally, another study using student systems found CBO to be associated with fault-proneness with a p-value for the logistic regression coefficient <0.0001 [4]. 2.1.2.5 RFC

This is the Response for a Class coupling metric [30]. The response set of a class consists of the set M of methods of the class, and the set of methods invoked directly by methods in M (i.e., the set of methods that can potentially be executed in response to a message received by that class). RFC is the number of methods in the response set of the class. The RFC metric was empirically evaluated in [19][22]. In [19] the authors found that this metric was related to fault-proneness (p=0.0019) with a change in odds ratio equal to 1.368 when measured on nonlibrary classes. The second study [22] also found it to be associated with fault-proneness (p<0.0001) with a change in odds ratio of 3.208 when measured on non-library classes. Another study found RFC to be associated with fault-proneness on two different sub-systems written in C++ with p-values 0.0401 and 11 0.0499, and change in odds ratio 1.0562 and 1.0654 [106]. A study that evaluated RFC on a C++ application and a Java application found RFC to have a Spearman correlation of 0.417 and 0.775 with the number of faults due to field failures respectively, and highly significant p-values (both <0.0001) [10]. Another study using student systems found RFC to be associated with fault-proneness with a p-value for the logistic regression coefficient <0.0001 [4].

11 In this study faults were classified as either object-oriented type faults or traditional faults. The values presented here are for all of the faults, although the same metrics were found to be significanct for both all faults and the object-oriented only faults. Furthermore, the change in odds ratio reported is based on a change of one unit of the metric rather than a change in the standard deviation.

V18-10/04/00

8

2.1.2.6

LCOM

This is a cohesion metric that was defined in [30]. This measures the number of pairs of methods in the class using no attributes in common minus the number of pairs of methods that do. If the difference is negative it is set to zero. The LCOM metric was empirically evaluated in [19][22]. In [19] the authors found it to be associated with fault-proneness (p=0.0.249) with a change in odds ratio of 1.613. Conversely, the second study [22] did not find it to be associated with fault-proneness. 2.1.2.7 NMO

This is an inheritance metric that has been defined in [80], and measures the number of inherited methods overriden by a subclass. A large number of overriden methods indicates a design problem [80]. Since a subclass is intended to specialize its parent, it should primarily extend the parent’s services [94]. This should result in unique new method names. Numerous overrides indicate subclassing for the convenience of reusing some code and/or instance variables when the new subclass is not purely a specialization of its parent [80]. The NMO metric was empirically evaluated in [19][22]. In [19] the authors found that this metric was related to fault-proneness (p=0.0082) with a change in odds ratio equal to 1.724. The second study [22] also found it to be associated with fault-proneness (p=0.0243) with a change in odds ratio of 1.948. Lorenz and Kidd [80] caution that in the context of frameworks methods are often defined specifically for reuse or that are meant to be overriden. Therefore, for our study there is already an a priori expectation that this metric may not be a good predictor. 2.1.2.8 NMA

This is an inheritance metric that has been defined in [80], and measures the number of methods added by a subclass (inherited methods are not counted). As this value becomes larger for a class, the functionality of that class becomes increasingly distinct from that of the parent classes. The NMO metric was empirically evaluated in [19][22]. In [19] the authors found that this metric was related to fault-proneness (p=0.0021) with a change in odds ratio equal to 3.925, a rather substantial effect. The second study [22] also found it to be associated with fault-proneness (p=0.0021) with a change in odds ratio of 1.710. 2.1.2.9 SIX

This is an inheritance metric that has been defined in [80], and consists of a combination of inheritance metrics. It is calculated as the product of the number of overriden methods and the class hierarchy nesting level normalized by the total number of methods in the class. The higher value for SIX, the more likely that a particular class does not conform to the abstraction of it's superclasses [94]. The SIX metric was empirically evaluated in [19][22]. In [19] the authors found that this metric was not related to fault-proneness. Conversely, the second study [22] found it to be associated with faultproneness (p=0.0089) with a change in odds ratio of 1.337. 2.1.2.10 NPAVG This can be considered as a coupling metric and has been defined in [80], and measures the average number of parameters per method (not including inherited methods). Methods with a high number of parameters generally require considerable testing (as their input can be highly varied). Also, large numbers of parameters lead to more complex, less maintainable code. 2.1.2.11 Summary The current empirical studies do provide some evidence that object oriented metrics are associated with fault-proneness or the incidence of faults. Though, the evidence is equivocal. For some of the inheritance metrics that were studied (DIT and NOC) some studies found a positive association, some found a negative association, and some found no association. The CBO metric was found to be positively associated with fault-proneness in some studies, and not associated with either the number of faults found or fault-proneness in other studies. The RFC and WMC metrics were consistently found to be

V18-10/04/00

9

associated with fault-proneness. The NMO and NMA metrics were found to be associated with faultproneness, but the evidence for the SIX metric is more equivocal. The LCOM cohesion metric also has equivocal evidence supporting its validity. It should be noted that the differences in the results obtained across studies may be a consequence of the measurement of different dependent variables. For instance, some treat the dependent variable as the (continuous) number of defects found. Other studies use a binary value of incidence of a fault during testing or in the field, or both. It is plausible that the effects of product metrics may be different for each of these. An optimistic observer would conclude that the evidence as to the predictive validity of most of these metrics is good enough to recommend their practical usage.

2.2 The Confounding Effect of Size

In this section we take as a starting point the stance of an optimistic observer and assume that there is sufficient empirical evidence demonstrating the relationship between the object-oriented metrics that we study and fault-proneness. We already showed that previous empirical studies drew their conclusions from univariate analyses. Below we make the argument that univariate analyses ignore the potential confounding effects of class size. We show that if there is indeed a size confounding effect, then previous empirical studies could have harbored a large positive bias. For ease of presentation we take as a running example a coupling metric as the main metric that we are trying to validate. For our purposes, a validation study is designed to determine whether there is an association between coupling and fault-proneness. Furthermore, we assume that this coupling metric is appropriately dichotomized: Low Coupling (LC) and High Coupling (HC). This dichotomization assumption simplifies the presentation, but the conclusions can be directly generalized to a continuous metric. 2.2.1 The Case Control Analogy

An object-oriented metrics validation study can be easily seen as an unmatched case-control study. Case-control studies are frequently used in epidemiology to, for example, study the effect of exposure to 12 carcinogens on the incidence of cancers [95][12] . The reason for using case-control studies as opposed to randomized experiments in certain instances is that it would not be ethically and legally defensible to do otherwise. For example, it would not be possible to have deliberately composed ‘exposed’ and ‘unexposed’ groups in a randomized experiment when the exposure is a suspected carcinogen or toxic substance. Randomized experiments are more appropriately used to evaluate treatments or preventative measures [52]. In applying the conduct of a case-control study to the validation of an object-oriented product metric, one would first proceed by identifying classes that have faults in them (the cases). Then, for the purpose of comparison another group of classes without faults in them are identified (the controls). We determine the proportion of cases that have, say High Coupling and the proportion with Low Coupling. Similarly, we determine the proportion of controls with High Coupling, and the proportion with Low Coupling. If there is an association of coupling with fault-proneness then the prevalence of High Coupling classes would be higher in the cases than in the controls. Effectively then, a case-control study follows a paradigm that proceeds from effect to cause, attempting to find antecedents that lead to faults [99]. In a case-control study, the control group provides an estimate of the frequency of High Coupling that would be expected among the classes that do not have faults in them. In an epidemiological context, it is common to have ‘hospital-based cases’ [52][95]. For example, a subset or all patients that have been admitted to a hospital with a particular disease can be considered as 13 cases. Controls can also be selected from the same hospital or clinic. The selection of controls is not necessarily a simple affair. For example, one can match the cases with controls on some confounding

12 13

Other types of studies that are used are cohort-studies [52], but we will not consider these here.

This raises the issue of generalizability of the results. However, as noted by Breslow and Day [12], generalization from the sample in a case-control study depends on non-statistical arguments. The concern with the design of the study is to maximize internal validity. In general, replication of results establishes generalizability [79].

V18-10/04/00

10

variables, for instance, on age and sex. Matching ensures that the cases and controls are similar on the matching variable and therefore this variable cannot be considered a causal factor in the analysis. Alternatively, one can have an unmatched case-control study and control for confounding effects during the analysis stage. In an unmatched case-control study the determination of an association between the exposure (product metric) and the disease (fault-proneness) proceeds by calculating a measure of association and determining whether it is significant. For example, consider the following contingency table that is obtained from a hypothetical validation study:

Fault Proneness Coupling

HC LC Faulty 91 19 Not Faulty 19 91

Table 1: A contingency table showing the results of a hypothetical validation study. For this particular data set, the odds ratio is 22.9 (see the appendix, Section 7, for a definition of the odds ratio), which is highly significant, indicating a strong positive association between coupling and faultproneness. 2.2.2 The Potential Confounding Effect of Size

One important element that has been ignored in previous validation studies is the potential confounding effect of class size. This is illustrated in Figure 2.

Product Metric

(c)

(a)

Fault-Proneness

(b)

Size

Figure 2: Path diagram illustrating the confounding effect of size. The path diagram in Figure 2 depicts a classic text-book example of confounding in a case-control study 14 The path (a) represents the current causal beliefs about product metrics being an antecedent [99][12].

14 We make the analogy to a case-control study because it provides us with a well tested framework for defining and evaluating confounding effects, as well as for conducting observational studies from which one can make stronger causal claims (if all known confounders are controlled). However, for the sole purposes of this paper, the characteristics of a confounding effect have been described and exemplified in [61] without resort to a case-control analogy.

V18-10/04/00

11

to fault-proneness. The path (b) depicts a positive causal relationship between size and fault-proneness. The path (c) depicts a positive association between product metrics and size. If this path diagram is concordant with reality, then size distorts the relationship between product metrics and fault-proneness. Confounding can result in considerable bias in the estimate of the magnitude of the association. Size is a positive confounder, which means that ignoring size will always result in the association between say coupling and fault-proneness to be more positive than it really is. The potential confounding effect of size can be demonstrated through an example (adapted from [12]). Consider the table in Table 1 that gave an odds ratio of 22.9. As mentioned earlier, this is representative of the current univariate analyses used in the object-oriented product metrics validation literature (which explicitly exclude size as a covariate nor employ a stratification on size). Now, let us say that if we analyze the data seperately for small and large classes, we have the data in 15 Table 2 for the large classes, and the data in Table 3 for the small classes.

Fault Proneness Coupling

HC LC Faulty 90 10 Not Faulty 9 1

Table 2: A contingency table showing the results for only large classes of a hypothetical validation study.

Fault Proneness Coupling

HC LC Faulty 1 9 Not Faulty 10 90

Table 3: A contingency table showing the results for only small classes of a hypothetical validation study. In both of the above tables the odds ratio is one. By stratifying on size (i.e., controlling for the effect of size), the association between coupling and fault-proneness has been reduced dramatically. This is because size was the reason why there was an association between coupling and fault-proneness in the first place. Once the influence of size is removed, the example shows that the impact of the coupling metric disappears. Therefore, an important improvement on the conduct of validation studies of object oriented metrics is to control for the effect of size, otherwise one may be getting the illusion that the product metric is strongly associated with fault-proneness, when in reality the association is much weaker or non-existent. 2.2.3 Evidence of a Confounding Effect

Now we must consider whether the path diagram in Figure 2 can be supported in reality. There is evidence that object-oriented product metrics are associated with size. For example, in [22] the Spearman rho correlation coefficients go as high as 0.43 for associations between some coupling and cohesion metrics with size, and 0.397 for inheritance metrics, and both are statistically significant (at an alpha level of say 0.1). Similar patterns emerge in the study reported in [19], where relatively large correlations are shown. In another study [27] the authors display the correlation matrix showing the Spearman correlation between a set of object-oriented metrics that can be collected from Shlaer-Mellor designs and C++ LOC. The correlations range from 0.563 to 0.968, all statistically significant at an alpha level 0.05. This also indicates very strong correlations with size.

15 Note that in this example the odds ratio of the size to fault-proneness association is 100, and the size to coupling association is 81.3. Therefore, it follows the model in Figure 2.

V18-10/04/00

12

Associations between size and defects have been reported in non-object oriented systems [58]. For object oriented programs, the relationship between size and defects is clearly visible in the study of [27], where the Spearman correlation was found to be 0.759 and statistically significant. Another study of image analysis programs written in C++ found a Spearman correlation of 0.53 between size in LOC and the number of errors found during testing [55], and was statistically significant at an alpha level of 0.05. Briand et al. [22] find statistically significant associations between 6 different size metrics and faultproneness for C++ programs, with a change in odds ratio going as high as 4.952 for one of the size metrics. General indications of a confounding effect are seen in Figure 3, which shows the associations between a set of coupling metrics and fault-proneness, and with size from a recent study [22]. The association between coupling metrics and fault-proneness is given in terms of the change in the odds ratio and the pvalue of the univariate logistic regression parameter. The association with size is in terms of the Spearman correlation. As can be seen in Figure 3, all the metrics that had a significant relationship with fault-proneness in the univariate analysis also had a significant correlation with size. Furthermore, there is a general trend of increasing association between the coupling metric and fault-proneness as its association with size increases.

Relationship with fault-proneness Metric Change in odds Ratio 2.012 2.062 3.208 8.168 5.206 7.170 1.090 9.272 1.395 1.385 1.416 1.206 1.133 0.816 1.575 1.067 4.937 1.214 p-value <0.0001 <0.0001 <0.0001 <0.0001 <0.0001 <0.0001 0.5898 <0.0001 0.0329 0.0389 0.0307 0.3213 0.3384 0.252 0.0922 0.6735 <0.0001 0.2737

Relationship with size rho 0.3217 0.3359 0.3940 0.4310 0.3232 0.3168 -0.124 0.3455 0.1753 0.1958 0.1296 0.0297 0.0493 -0.0855 0.2365 -0.1229 0.2765 -0.0345 p-value <0.0001 <0.0001 <0.0001 <0.0001 <0.0001 <0.0001 0.1082 <0.0001 0.0163 0.0088 0.0785 0.7010 0.4913 0.2528 0.0019 0.1115 0.0001 0.6553

CBO CBO’ RFC1 RFCoo MPC ICP IH-ICP NIH-ICP DAC DAC’ OCAIC FCAEC OCMIC OCMEC IFMMIC AMMIC OMMIC OMMEC

Figure 3: Relationship between coupling metrics and fault-proneness, and between coupling metrics and size from [22]. This covers only coupling to non-library classes. This also excludes the following metrics because no results pertaining to the relationship with fault-proneness were presented: ACAIC, DCAEC, IFCMIC, ACMIC, IFCMEC, and DCMEC. The definition of these metrics is provided in the appendix.

V18-10/04/00

13

This leads us to conclude that, potentially, previous validation studies have overestimated the impact of object oriented metrics on fault-proneness due to the confounding effect of size.

2.3 Summary

In this section the theoretical basis for object-oriented product metrics was presented. This states that cognitive complexity is an intervening variable between the structural properties of classes and faultproneness. Furthermore, the empirical evidence supporting the validity of the object oriented metrics that we study was presented, and this indicates that some of the metrics are strongly associated with faultproneness or the number of faults. We have also demonstrated that there is potentially a strong size confounding effect in empirical studies to date that validate object oriented product metrics. This makes it of paramount importance to determine whether such a strong confounding effect really exists. If a size confounding effect is found, this means that previous validation studies have a positive bias and may have exaggerated the impact of product metrics on fault-proneness. The reason is that studies to date relied exclusively on univariate analysis to test the hypothesis that the product metrics are associated with fault-proneness or the number of faults. The objective of the study below then is to directly test the existence of this confounding effect and its magnitude.

3 Research Method

3.1 Data Source

Our data set comes from a telecommunications framework written in C++ [102]. The framework implements many core design patterns for concurrent communication software. The communication software tasks provided by this framework include event demultiplexing and event handler dispatching, signal handling, service initialization, interprocess communication, shared memory management, message routing, dynamic (re)configuration of distributed services, and concurrent execution and synchronization. The framework has been used in applications such as electronic medical imaging systems, configurable telecommunications systems, high-performance real-time CORBA, and web servers. Examples of its application include in the Motorola Iridium global personal communications system [101] and in network monitoring applications for telecommunications switches at Ericsson [100]. A total of 174 classes from the framework that were being reused in the development of commercial switching software constitute the system that we study. A total of 14 different programmers were involved 16 in the development of this set of classes.

3.2 Measurement

3.2.1 Product Metrics All product metrics are defined on the class, and constitute design metrics, and they have been presented in Section 2.1.2. In our study the size variable was measured as non-comment source LOC for the class. Measurement of product metrics used a commercial metrics collection tool that is currently being used by a number of large telecommunications software development organizations. 3.2.2 Dependent Variable

17

For this product, we obtained data on the faults found in the library from actual field usage. Each fault was due to a unique field failure and represents a defect in the program that caused the failure. Failures were reported by the users of the framework. The developers of the framework documented the reasons for each delta in the version control system, and it was from this that we extracted information on whether a class was faulty.

16 17

This number was obtained from the different login names of the version control system associated with each class.

It has been argued that considering faults causing field failures is a more important question to address than faults found during testing [9]. In fact, it has been argued that it is the ultimate aim of quality modeling to predict post-release fault-proness [50]. In at least one study it was found that pre-release fault-proneness is not a good surrogate measure for post-release fault-proness, the reason posited being that pre-release fault-proness is a function of testing effort [51].

V18-10/04/00

14

A total of 192 faults were detected in the framework at the time of writing. These faults occurred in 70 out of 174 classes. The dichotomous dependent variable that we used in our study was the detection or nondetection of a fault. If one or more faults are detected then the class is considered to be faulty, and if not then it is considered not faulty.

3.3 Data Analysis Methods

3.3.1 Testing for a Confounding Effect It is tempting to use a simple approach to test for a confounding effect of size: examine the association between size and fault-proneness. If this association is not significant at a traditional alpha level, then conclude that size is not different between cases and controls (and hence has no confounding effect), and proceed with a usual univariate analysis. However, it has been noted that this is an incorrect approach [38]. The reason is that traditional significance testing places the burden of proof on rejecting the null hypothesis. This means that one has to prove that the cases and controls do differ in size. In evaluating confounding potential, the burden of proof should be in the opposite direction: before discarding the potential for confounding, the researcher should demonstrate that cases and controls do not differ on size. This means controlling the Type II error rather than the Type I error. Since one usually has no control over the sample size, this means setting the alpha level to 0.25, 0.5, or even larger. A simpler and more parsimonious approach is as follows. For an unmatched case-control study, a measured confounding variable can be controlled through a regression adjustment [12][99]. A regression adjustment entails including the confounder as another independent variable in a regression model. If the regression coefficient of the object-oriented metric changes dramatically (in magnitude and statistical significance) with and without the size variable, then this is a strong indication that there was indeed a confounding effect [61]. This is further elaborated below. 3.3.2 Logistic Regression Model

Binary logistic regression is used to construct models when the dependent variable can only take on two values, as in our case. It is most convenient to use a logistic regression (henceforth LR) model rather than the contingency table analysis used earlier for illustrations since the model does not require dichotomization of our product metrics. The general form of an LR model is:

π =

1+ e

where

1

k ? ? ? ? β 0 + β i xi ? ? ? i =1 ? ?

Eqn. 1

∑

π

is the probability of a class having a fault, and the

18

xi ’s are the independent variables. The β

parameters are estimated through the (unconditional) In a univariate analysis only one being validated:

maximization of a log-likelihood [61].

xi , x1 , is included in the model, and this is the product metric that is

π =

When controlling for size, a second

1+ e

? ( β 0 + β i x1 )

1

Eqn. 2

xi , x 2 , is included that measures size:

1+ e

? ( β 0 + β i x1 + β 2 x2 )

π =

1

Eqn. 3

18 Conditional logistic regression is used when there has been matching in the case-control study and each matched set is treated as a stratum in the analysis [12].

V18-10/04/00

15

In constructing our models, we could follow the previous literature and not consider interaction effects nor consider any transformations (for example, see [4][8][17][18][19][22][106]). To err on the conservative side, however, we did test for interaction effects between the size metric and the product metric for all product metrics evaluated. In none of the cases was a significant interaction effect identified. 19 Furthermore, we performed a logarithmic transformation on our variables and re-evaluated all the 20 models. Our conclusions would not be affected by using the transformed models. Therefore, we only present the detailed results for the untransformed model. The magnitude of an association can be expressed in terms of the change in odds ratio as the x1 variable changes by one standard deviation. This is explained in the appendix (Section 7), and is denoted by ?Ψ . Since we construct two models as shown in Eqn. 2 and Eqn. 3 without and with controlling for size respectively, we will denote the change in odds ratio as ?Ψ x1 and ?Ψx1 + x2 respectively. As suggested in [74], we can evaluate the extent to which the change in odds ratio changes as an indication of the extent of confounding. We operationalize this as follows:

?2ψ =

This gives the percent change in

?ψ x1 ? ?ψ x1 + x2 ?ψ x1 + x2

× 100

Eqn. 4

?Ψx1 + x2 by removing the size confounder. If this value is large then we

can consider that class size does indeed have a confounding effect. The definition of “large” can be problematic, however, as will be seen in the results, the changes are sufficiently big in our study that by any reasonable threshold, there is little doubt. 3.3.3 Diagnostics and Hypothesis Testing

The appendix of this paper presents the details of the model diagnostics that were performed, and the approach to hypothesis testing. Here we summarize these. The diagnostics concerned checking for collinearity and identifying influential observations. We compute the condition number specific to logistic regression, η LR , to determine whether dependencies amongst

the independent variables are affecting the stability of the model (collinearity). The ?β value provides us an indication of which observations are overly influential. For hypothesis testing, we use the likelihood ratio statistic, G, to test the significance of the overall model, the Wald statistic to test for the significance 2 of individual model parameters, and the Hosmer and Lemeshow R value as a measure of goodness of fit. Note that for the univariate model the G statistic and the Wald test are statistically equivalent, but we present them both for completeness. All statistical tests were performed at an alpha level of 0.05.

4 Results

4.1 Descriptive Statistics

Box and whisker plots for all the product metrics that we collected are shown in Figure 4. These indicate th th 21 the median, the 25 and 75 quantiles. Outliers and extreme points are also shown in the figure. As is typical with product metrics their distributions are clearly heavy tailed. Most of the variables are counts, and therefore their minimal value is zero. Variables NOC, NMO, and SIX have less than six observations that are non-zero. Therefore, they were excluded from further analysis. This is the approach followed in [22].

19 20 21

Given that product metrics are counts, an appropriate transformation to stablize the variance would be the logarithm. We wish to thank an anonymous reviewer for making this suggestion.

As will be noted that in some cases the minimal value is zero. For metrics such as CBO, WMC and RFC, this would be because the class was defined in a manner similar to a C struct, with no methods associated with it.

V18-10/04/00

16

The fact that few classes have NOC values greater than zero indicates that most classes in the system are leaf classes. Overall, 76 of the classes had a DIT value greater than zero, indicating that they are subclasses. The remaining 98 classes are at the root of the inheritance hierarchy. The above makes clear that the inheritance hierarchy for this system was “flat”. Variable DIT has a small variation, but this is primarily due to there not being a large amount of deep inheritance in this framework. Shallow inheritance trees, indicating sparse use of inheritance, have been reported in a number of systems thus far [27][30][32].

180 160 140 120 100 80 60 40 20 0 WMC RFC LCOM NMA

10 8 6 4 2 0 DIT NOC CBO NMO SIX NPAVG

Figure 4: Box and whisker plots for all the object-oriented product metrics. Two charts are shown to allow for the fact that two y-axis scales are required due to the different ranges. The LCOM values may seem to be large. However, examination of the results from previous systems indicate that they are not exceptional. For instance, in the C++ systems reported in [22], the maximum LCOM value was 818, the mean 43, and the standard deviation was 106. Similarly, the system reported in [19] had a maximum LCOM value of 4988, a mean of 99.6, and standard deviation of 547.7.

4.2 Correlation with Size

Table 4 shows the correlation of the metrics with size as measured in LOC. As can be seen all of the associations are statistically significant except DIT. But DIT did not have much variation, and therefore a weak association with size is not surprising. All metrics except LCOM and NPAVG have a substantial correlation coefficient, indicating a non-trivial assocation with size.

V18-10/04/00

17

OO Metric WMC DIT CBO RFC LCOM NMA NPAVG

LOC

Rho

0.88 0.098 0.46 0.88 0.24 0.86 0.27

p-value

<0.0001 0.19 <0.0001 <0.0001 0.0011 <0.0001 0.000256

Table 4: Spearman correlation of the object oriented metrics (only the ones that have more than five nonzero values) with size in LOC.

4.3 Validation Results

The results of the univariate analyses and the models controlling for size for each of the remaining metrics are presented in this section. The complete results are presented in Table 5.

V18-10/04/00

18

Without Size Control Metric

WMC DIT CBO RFC LCOM NMA NPAVG

Controlling for Size

G (p-value)

10.1 (0.0015) 0.17 (0.6814) 4.16 (0.0414) 11.48 (0.0007) 3.24 (0.0721) 12.83 (0.0003) 0 (0.975)

H-L R

0.043

2

η LR

2.24 2.166 1.792 2.41 4.011 2.76 1.75

Coeff. (p-value)

0.0280 (0.0022) -0.1224 (0.3409) 0.2804 (0.0266) 0.0408 (0.0013) 0.0074 (0.0391) 0.0690 (0.0007) 0.0053 (0.4875)

?Ψ

1.74 0.938 1.387 1.84 1.33 1.975 1.004

G (p-value)

13.83 (0.001) -13.59 (0.0011) 16.51 (0.0003) 13.91 (0.001) 16.06 (0.0003) --

H-L R

0.059 --

2

η LR

7.63 -3.59 7.843 4.682 7.47 --

Coeff. (p-value)

-0.0113 (0.3073) -0.0209 (0.4495) -0.0207 (0.2520) 0.0026 (0.2830) -0.0025 (0.4783) --

?Ψ

0.80 -1.024 0.73 1.107 0.97 --

Size Coeff. (p-value)

0.0141 (0.0289) -0.0105 (0.0030) 0.0171 (0.0157) 0.0101 (0.0017) 0.0127 (0.0402) --

Size

?Ψ

2.374 -1.9 2.80 1.86 2.15 --

0.0007 0.018 0.049 0.014 0.055 4.18e-6

0.0579 0.071 0.059 0.069 --

Table 5: Overall results of the models without control of size (univariate models), and with control of size. The G value is the likelihood ratio test for the whole model. The “Coeff.” columns give the estimated parameters from the logistic regression model. The “p-value” is the one-sided test of 2 2 the null hypothesis for the coefficient. The R values are based on the definition of R provided by Hosmer and Lemeshow [61]. Hence they are 2 referred to as the H-L R values. For the second half of the table, presenting the results of the model with size control, the coefficient for the size parameter is provided with its change in odds ratio. For the metrics where the model without size control is not significant (DIT and NPAVG) we do not present the model with size control since given the hypothesised confounding effect, the results will not be substantively different from the no size control model.

V18-10/04/00

19

4.3.1

WMC

As can be seen in Table 5, the univariate result for WMC is rather consistent with the previous literature that explicitly evaluated WMC [4][10][106]. The overall goodness of fit is significant, and the regression parameter is significant. Furthermore, the ?Ψ indicates that an increase of one standard deviation in the value of WMC will increase the odds of a fault 1.74 times. The results after the correction for size show that the regression parameter for WMC is no longer statistically significant, but the impact of size is large and statistically significant. The condition number does not indicate that collinearity is a problem. The WMC coefficient is negative, but this is not different from zero after taking sampling variability into account. Table 5 shows that the inclusion of size in the model has a dramatic impact on the validity of WMC, and is confirmed by a we can say that size has a confounding effect on WMC. 4.3.2 DIT

?2ψ value of 117%. Hence

The results for the DIT metric as shown in Table 5 indicate that there is no relationship between DIT and fault-proneness, with the overall G coefficient being non-significant. An analysis similar to that in [27] whereby the classes were dichotomized into those involved in inheritance and those not involved was performed. This subsequent analysis does not change the conclusions that DIT is not associated with fault-proneness in this system. 4.3.3 CBO

The univariate analysis results for CBO indicate that CBO is associated with fault-proneness, with a change in odds ratio of 1.387. This is consistent with previous literature that found CBO to be associated with fault-proneness [22][19]. After controlling for the confounding impact of size, the effect of CBO on fault-proneness diminishes considerably, with a ? ψ value of 35%. Furthermore, CBO is no longer statistically significant. Also note that the condition number does not indicate collinearity problems in this model.

2

4.3.4

RFC

The univariate results for RFC indicate that RFC is associated with fault-proneness, as one would expect given the current literature [19][22][10][4][106]. However, after controlling for size, Table 5 shows that RFC coefficient is no longer statistically significant, the ? ψ value is 152%, and the multivariate model is dominated by size. This indicates that there is a strong size confounding effect. The condition number for this model does not indicate collinearity problems.

2

4.3.5

LCOM

As will be recalled, the evidence for the validity of the LCOM metric from previous studies is equivocal. This is further evident here in Table 5, whereby the G statistic is not significant at an alpha level of 0.05, but it does approach it. The multivariate LCOM model in Table 5 clearly indicates that after controlling for size, there is no evidence that LCOM is associated with fault-proneness. Again, this model does not show signs of collinearity problems. 4.3.6 NMA

The univariate and multivariate results for the NMA metric present the same pattern, whereby at the univariate level this metric is highly significant, but after controlling for size its effect disappears, with a

?2ψ value of 103%.

V18-10/04/00

20

4.3.7

NPAVG

The results for the NPAVG metric in Table 5 indicate that there is no association at all with faultproneness, and in fact, this is the worst metric amongst those studied here.

4.4 Discussion of Results

The results that we obtained are remarkably consistent. They indicate that for some of the product metrics that we studied, the univariate analyses demonstrate that they are indeed associated with faultproneness (namely WMC, CBO, RFC and NMA). After controlling for the size confounder, all of these associations disappear, and the differences in the product metrics’ odds ratios indicates a strong and classic confounding effect. The remaining product metrics were not associated with fault-proneness from a univariate analysis, but this was predictable from previous work. This result has two important implications. First, it casts doubt on the validity claims made in the past about object-oriented metrics. It seems more plausible now that previous studies demonstrated an association with fault-proneness or faults because of the confounding effect of size, which by itself leads to greater fault-proneness or more faults. Second, it suggests caution in drawing validity conclusions from univariate analyses, as is currently a common practice, in validation studies of object-oriented product metrics. To be clear, our results are not the last word on the validity of these metrics. Subsequent work with different data sets may demonstrate that they are valid after controlling for size. Re-examination of previous data sets may indicate that the metrics are still valid after controlling for size. What is clear is that researchers in this area, especially those who have demonstrated valid metrics, are urged to reexamine their data and determine the extent to which their conclusions would be contaminated after controlling for size. Similarly, future validation studies should always control for size. Indeed, a careful examination of some previous studies indicates that there are potentially some objectoriented metrics that have an association with fault-proneness after controlling for size. For instance, Briand et al. construct two multivariate models in two studies [19][22] using a forward selection procedure and allowing a large number of metrics to enter. The purpose was to build prediction models only and not to validate the metrics (recall that validation was demonstrated primarily through univariate results). When size metrics are allowed to enter, the result was a model with size metrics in both cases. Notwithstanding 22 the inherent problems in using a forward selection procedure , one of the models included the NMA metric [22], which had a significant impact on fault-proneness after controlling for size. This indicates that, out of the set of common object-oriented metrics, there is possibly a subset of metrics that are associated with fault-proneness for reasons other than their association with size. More promising results were obtained after we re-analyzed the data of Cartwright and Shepperd [27]. They had collected early design metrics from Shlaer-Mellor models for a real-time telecommunications system with 32 C++ classes and data on the number of faults per class. We analyzed this data using 23 multivariate least-squares models including LOC and each design metric in turn. We found their design metrics to be strongly associated with the number of defects after controlling for size. Specifically, the DELS (count of all delete accesses by a class), ATTRIB (count of attributes per class), READS (count of all read accesses), WRITES (count of all write accesses), EVNTS (count of events per class in the state model), and STATES (count of states per class) all had significant associations with the number of

22 A Monte Carlo simulation of forward selection indicated that in the presence of collinearity amongst the independent variables, the proportion of ‘noise’ variables that are selected can reach as high as 74% [44]. It is clear that in the object-oriented metrics validation studies many of the metrics were correlated [19][22]. Harrell and Lee [53] note that when statistical significance is the sole criterion for including a variable the number of variables selected is a function of the sample size, and therefore tends to be unstable across studies. Furthermore, some general guidelines on the number of variables to consider in an automatic selection procedure given the number of ‘faulty classes’ are provided in [54]. The studies that used automatic selection [19][22] had a much larger number of variables than these guidelines. Therefore, clearly the variables selected through such a procedure should not be construed as the best object-oriented metrics nor even as good predictors of fault-proneness.

We used the condition number of Belsley et al. [5] as a diagnostic for detecting collinearity. We did not find any serious collinearity problems. The biggest condition number was for the STATES variable: 17.18. However, this is still smaller than the recommended threshold of 30.

23

V18-10/04/00

21

defects at a one tailed alpha level of 0.0.5. Of course, metrics such as READ and WRITE, and EVNTS and STATES are strongly correlated, but individually they seem to capture a structural property that has an impact on faults.

24

4.5 Limitations

This study has a number of limitations which should be made clear in the interpretation of our results. These limitations are not unique to our study, but are characteristics of most of the product metrics validation literature. However, it is of value to repeat them here. This study did not account for the severity of faults. A failure that is caused by a fault may lead to a whole network crash or to an inability to interpret an address with specific characters in it. These types of failures are not the same, the former being more serious. Lack of accounting of fault severity was one of the criticisms of the quality modeling literature in [49]. In general, unless the organization has a reliable data collection program in place where severity is assigned, it is difficult to retrospectively obtain this data. Therefore, the prediction models developed here can be used to identify classes that are prone to have faults that cause any type of failure. Even though our overall approach ought to compel caution in future validation studies, we do not claim that none of the studied object oriented metrics are associated with fault-proneness. They were not associated in our study, but they may be in other studies. One thing is clear, however: previous studies that did not account for the confounding effect of size overestimate the impact of the product metrics on fault-proneness. It is also important to note that our conclusions are pertinent only to the fault-proneness dependent variable, albeit this seems to be one of the more popular dependent variables in validation studies. We do not make claims about the validity (or otherwise) of the studied object-oriented metrics when the external attributes of interest are, for example, maintainability (say measured as effort to make a change) or reliability (say measured as mean time between failures). Even though we cast validation studies as case-control studies, we have been cautious in not making claims of causality. There are other potential confounders that would have to be controlled in order to gain any confidence that the product metrics are really causally affecting fault-proneness, for example, 25 the competence of the designers and programmers. Our argument has been to identify each of these variables and incrementally perform better validation studies. Eventually, once a core set of confounding variables has been identified, they ought to be controlled for in every validation study. The case-control framework provides a set of techniques for controlling the confounders in the design of the study through matching, or through various types of statistical adjustments.

5 Conclusions

The objective of the current paper was to examine if there is a confounding effect of class size in the validation of object-oriented metrics. The metrics set that we studied are the CK metrics [30] and a subset of the Lorenz and Kidd metrics [80]. Our focus in this paper was the prediction of class faultproneness for faults found in the field. To attain this objective we first presented evidence from previous work indicating that there is a potential confounding effect of size in the product metrics to fault-proneness relationship. We then described and

24

For this analysis we did a careful evaluation of influential observations using Cook’s distance [35][36]. In general, we found two observations for each metric that can be considered influential. However, inclusion or removal of these observations, while changing the estimated parameters and goodness of fit, does not change our conclusions.

25

We did perform an analysis that accounted for the potential confounding effect of programmer capability. For a subset of the 174 classes we could reliably obtain the name of the primary developer. This subset consisted of 159 classes. To account for the potential confounding effect of individual capability we included 13 dummy variables in each of the multivariate models to capture the effect of the 14 different developers (see [61] for a description of LR models with design variables), in addition to the product metric in question and the size metric. The estimated coefficients for the product metrics were slightly different from the values we present in this paper, but their statistical significance followed exactly the trends shown in our results. Therefore, accounting for the programmer confounder in addition to size would not change our conclusions.

V18-10/04/00

22

justified an empirical methodology for examining whether there is a confounding effect. Finally, we performed a study with a large C++ system to answer the confounding effect question. When we performed our own empirical investigation we controlled for the confounding effect of class size. Without controlling for class size our results are consistent with what one would expect from the current literature, with most of the coefficients being statistically significant. When we controlled for class size all of the effects disappeared, providing compelling evidence for a confounding effect of class size. These results cast serious doubt on the empirical validity of object-oriented metrics since they suggest that previous empirical validity results were strongly positively biased by the confounding effect of size. We urge a re-examination of already validated object-oriented metrics after controlling for size. Our initial attempts at re-analyzing existing data (after controlling for size) indicate that some design metrics are indeed associated with faults after controlling for size. However, this ought to be performed on a broader scale.

6 Acknowledgements

The authors wish to thank Janice Singer, Hakan Erdogmus, Oliver Laitenberger, Morven Gentleman, James Miller, Barbara Kitchenham, and Anatol Kark for their comments on an earlier draft of this paper. We also wish to thank the anonymous reviewers whose suggestions have improved the paper.

7 Appendix A: The Odds Ratio

In this appendix we provide a brief description of the odds ratio, which is a measure of association. We take it that the dependent variable is dichotomous, fault or no fault. First we assume that the independent variable is also dichotomous, say High Coupling and Low Coupling. Second, we relax this and explain the odds ratio in the context of a logistic regression model and also the change in odds ratio. The latter is used extensively in this paper. The odds of an event, such as High Coupling, is the ratio of the number of ways the event can occur to the number of ways the event cannot occur. Therefore, if the probability that a class has High Coupling is denoted by

p , then the odds of a class having high coupling is

p . Consider the following 1? p

contingency table that could have been obtained from a validation study:

Fault Proneness

Faulty

Coupling

HC LC

a c

Not Faulty

b d

Table 6: A contingency table showing the notation for potential results of a validation study. The odds of a faulty class having High Coupling is:

a

a+c = a c c a+c

Eqn. 5

The odds of a not-faulty class having High Coupling is given by:

V18-10/04/00

23

b+d = b d d b+d

b

Eqn. 6

The ratio of the odds that the faulty class has High Coupling to the odds that the not-faulty class has High 26 Coupling is the odds ratio [47]:

ad ψ = c = b bc d

a

Eqn. 7

If coupling is not related to fault-proneness, then the odds ratio is equal to 1. If there is a positive association, then the odds ratio will be greater than 1, and if there is a negative association, then the odds ratio will be negative. Now, assume that we are performing a univariate logistic regression with one coupling metric. Let denote the presence of a fault ( D = 1 ) or absence ( D = 0 ), and let x be our coupling metric. Then:

D

p x = Pr (D = 1 x ) =

is the probability of a fault given the value of of x is:

1 1+ e

? ( β 0 + β1 x )

Eqn. 8

x . The probability of there not being a fault given the value

1 1+ e

? ( β 0 + β1 x )

q x = 1 ? p x = Pr (D = 0 x ) = 1 ?

=

1

e ? ( β 0 + β1 x ) 1 + e ? ( β 0 + β1 x )

Eqn. 9

One can then claim that out of every say 100 classes, odds of a class having a fault given the value of

x is:

1+ e

? ( β 0 + β1 x )

× 100 classes will have a fault. The

px = Ψ ( x ) = e ( β 0 + β1 x ) qx

The odds of a class having a fault if the product metric value

Eqn. 10

x is increased by one standard deviation is:

Eqn. 11

Ψ ( x + σ ) = e ( β 0 + β1 ( x +σ ))

The change in odds by increasing the value of

x by one standard deviation is:

Eqn. 12

?Ψ =

Ψ (x + σ ) = e β1σ Ψ (x )

26

This is the estimate of the population odds ratio.

V18-10/04/00

24

8 Appendix B: Definition of Metrics

The following is a definition of the metrics that appear in Figure 3. Metric Acronym CBO CBO’ RFC1 RFCoo MPC ICP IH-ICP NIH-ICP DAC DAC’ OCAIC FCAEC OCMIC OCMEC IFMMIC AMMIC OMMIC OMMEC Definition This is defined in the main text of the paper. Same as CBO except that inheritance-based coupling is not counted [29]. This is defined in the main text of the paper. Same as RFC1 except that methods indirectly invoked by methods in M are not included in the response set [29]. The number of method invocations in a class [78]. The number of method invocations in a class, weighted by the number of parameters of the invoked methods [76]. The same as ICP, but counts invocations of methods of ancestors of classes (i.e., inheritance-based coupling) only [76]. The same as ICP, but counts invocations to classes not related through inheritance [76]. The number of attributes in a class that have as their type another class [78]. The number of different classes that are used as types of attributes in a class [78]. These coupling metrics are counts of interactions between classes. The metrics distinguish between the classes (friendship, inheritance, none), different types of interactions, and the locus of impact of the interaction [17]. The acronyms for the metrics indicate what types of interactions are counted:

?

The first or first two letters indicate the relationship (A: coupling to ancestor classes; D: Descendents; F: Friend classes; IF: Inverse Friends; and O: other, i.e., none of the above) The next two letters indicate the type of interaction between classes c and d (CA: there is a class-attribute interaction between classes c and d if c has an attribute of type d; CM: there is a class-method interaction between classes c and d if class c has a method with a parameter of type class d; MM: there is a method-method interaction between classes c and d if c invokes a method of d, or if a method of class d is passed as parameter to a method of class c. The last two letters indicate the locus of impact (IC: Import Coupling; and EC: Export Coupling)

?

?

V18-10/04/00

25

9 Appendix C: Data Analysis Details

This appendix presents the details of the data analysis that we performed. Specifically, it covers how we determine the extent of collinearity, identify influential observations, and perform hypothesis tests.

9.1 Diagnosing Collinearity

Since in our study we control for the size confounder through regression adjustment, careful attention should be paid to the detection and mitigation of potential collinearity. Previous studies have shown that outliers can induce collinearities in regression models [83][98]. But also, it is known that collinearities may mask influential observations [5]. This has lead some authors to recommend addressing potential collinearity problems as a first step in the analysis [5], and this is what we do. We first briefly review some common approaches for diagnosing collinearity in the context of ordinary least squares regression (henceforth LS regression for short). This is important because the previously suggested diagnostic for validating object-oriented metrics [22] was initailly defined for the LS case. We then present some further diagnostics for the LR case. One of the requirements for properly interpreting a LS regression model is that no one of the independent variables are perfectly linearly correlated to one or more other independent variables. This situation is referred to as perfect collinearity. When there is perfect collinearity then the regression surface is not even defined. Perfect collinearity is rare, however, and therefore one talks about the degree of collinearity. Recall that the confounder is associated with the product metric by definition, and therefore one should not be surprised if strong collinearity exists. The larger the collinearity, the greater the standard errors of the coefficient estimates. This means that tstatistics for significance testing tend to be small and confidence intervals tend to be wide [28][92]. One implication of this is that the conclusions drawn about the relative impacts of the independent variables based on the regression coefficient estimates from the sample are less stable. In the context of LR regression, Hosmer and Lemeshow [61] have illustrated the detrimental effects of collinearity amongst the independent variables and between the independent variables and the constant term. Specifically, the parameter standard errors tend to be grossly inflated. An empirical investigation with a large data set indicated large errors and variability in maximum-likelihood parameter estimates [97]. Schaefer [98] performed a Monte Carlo simulation showing that for two independent variables (which is our context) the impact of collinearity on the LR maximum likelihood estimates was severe but reduced as sample size increases. In some cases the effect became almost negligible for sufficiently large sample sizes (n=200) [98]. Even though our sample size approaches this number, to err on the conservative side we explicitly diagnose and deal with collinearity problems if they are detected. Hosmer and Lemeshow [61] suggest that diagnostics originally devised for LS regression can also be applied for diagnosing collinearity for LR regression. Mennard [84] proposes using the tolerance statistic to detect collinearity. This is defined as

th

2 1? R2 j , where the R j is the multiple coefficient of determination

after regressing the j independent variable on all of the remaining independent variables [104]. He notes that a tolerance of less than 0.2 is cause for concern, and a tolerance of less than 0.1 indicates a serious collinearity problem. This diagnostic is the reciprocal of the more common Variance Inflation Factor (VIF) [28], where a value greater than 10 is taken as signaling a collinearity problem. A common operationalization of the VIF is as follows. Assume that the linear regression model is given by:

y=

where

)

0

+ ? xZ

Eqn. 13

y is the dependent variable vector, is a vector of ones, ? x are the LS parameter estimates, and Z is a n × k matrix of the xij raw data, with i = 1K n and j = 1K k . If we transform each of the xij

values using the correlation transform as follows [90]:

V18-10/04/00

26

? xij ? x j ? n ?1 ? ? sj 1

where

? ? ? ?

Eqn. 14

s j is the standard deviation of the j th variable, then the correlation matrix is given by Z T Z , and

the VIFs are the diagonals of

(Z Z )

T

?1

.

Belsley [5] notes a number of important disadvantages of using the VIF as a collinearity diagnostic. Namely, the common way of computing the VIF as above does not account for collinearities involving the intercept, it cannot diagnose the number of near dependencies amongst the independent variables, and there are no guidelines for interpreting the VIF values that are not ad-hoc. Belsley et al. [5] propose the condition number as a collinearity diagnostic. reformulate Eqn. 13 as follows:

27

First, it is easier if we

y = ?X

Eqn. 15

Here, there is a column of ones in the n × ( k + 1) X matrix to account for the fact that the intercept is included in the models. The condition number is defined as follows:

η=

where the

λmax λmin

matrix:

Eqn. 16

λ ’s are singular values obtained from the decomposition of the X

X = UDG T

Eqn. 17

where the λ ’s are the non-negative diagonal values of D . Based on a number of experiments, Belsley et al. suggest that a condition number greater than 30 indicates mild to severe collinearity. The condition number can also be obtained from the eigenvalues of the

X T X matrix as follows:

Eqn. 18

η=

where

? max ? min

? max ≥ L ≥ ? min

are the eigenvalues.

Belsley et al. emphasize that in the case where an

intercept is included in the model, the independent variables should not be centered since this can mask the role of the constant in any near dependencies. Furthermore, the X matrix must be column equilibrated, that is, each column should be scaled for equal Euclidean length. Without this, the collinearity diagnostic produces arbitrary results. Column equilibration is achieved by scaling each column in

X , X j , by its norm [7]: X j

Xj .

Briand et al. [22] suggest using the condition number as the basis for diagnosing collinearity in multivariate LR models when validating object-oriented product metrics. They propose using the eigenvalues obtained from a principal components analysis in conjunction with Eqn. 18. However, this 28 approach suffers from the problem of ignoring the intercept and its impact on near dependencies. It is

27 Actually, they propose a number of diagnostic tools. However, for the purposes of our analysis with only two independent variables we will consider the condition number only. 28

Recall that only the independent variables are used in a principal components analysis.

V18-10/04/00

27

known that collinearities can exist between the independent variables and the intercept, jointly and individually [104]. Therefore the Briand et al. approach can underestimate the extent of collinearity in a particular data set. For example, from our data set the condition number computed from the results of a principal components analysis for a model with the variables WMC, RFC, CBO, and LOC is 6.94, whereas computing it directly using eigenvalues of the X X or a singular value decomposition of the X matrix gives a condition number of 10.62. It is therefore prudent to compute the eigenvalues or singular values directly from X after the appropriate scaling. Even though the condition number diagnostic was originally designated for LS regression, it has been applied in some instances for diagnosing collinearity in LR regression [105]. However, this diagnostic has been extended to the case of LR regression [42][108] by capitalizing on the analogy between the independent variable cross-product matrix in LS regression to the information matrix in maximum likelihood estimation, and therefore it would certainly be parsimonious to use the latter. The information matrix in the case of LR regression is [61]:

T

29

? ? = XT V ?X I

where

()

Eqn. 19

? is the diagonal matrix consisting of π ? ii (1 ? π ? ii V

) where π? ii

is the probability estimate from the

?1

? ? . One can LR model for observation i . Note that the variance-covariance matrix of ? is given by I then compute the eigenvalues from the information matrix after column equilibrating and compute the condition number as in Eqn. 18. The general approach for non-LS models is described by Belsley [6]. In this case, the same interpretive guidelines as for the LS condition number are used [108].

We therefore use this as the condition number in diagnosing collinearity in our models that include the size confounder, and will denote it as η LR . In principle, if severe collinearity is detected we would use 30 logistic ridge regression to estimate the model parameters [97][98]. However, since we do not detect severe collinearity in our study, we will not describe the ridge regression procedure here.

()

9.2 Hypothesis Testing

The next task in evaluating the LR model is to determine whether any of the regression parameters are different from zero, i.e., test H 0 : β 1 = β 2 = L = β k = 0 . This can be achieved by using the likelihood ratio G statistic [61]. One first determines the log-likelihood for the model with the constant term only, and denote this

l 0 for the ‘null’ model.31 Then the log-likelihood for the full model with the k parameters

29 Several columns of the data matrix become jointly, but not individually, collinear with the intercept, for example, when a linear combination of several non-constant independent variables is nearly constant. 30 31

Ridge regression produces biased estimates of the parameters, but they are more robust in situations with high collinearity. This is defined as:

l 0 = ∑ {n1 ln(n1 ) + n0 ln (n0 ) ? n ln (n )}

i =1

n

where

n1 = ∑ y i

and

n0 = ∑ (1 ? y i ) .

V18-10/04/00

28

is determined, and denote this with k degrees of freedom.

33

l k .32 The G statistic is given by 2(l k ? l 0 ) which has a χ 2 distribution

If the likelihood ratio test is found to be significant at an

the individual coefficients. This is done using a Wald statistic,

α = 0.05 ? β

then we can proceed to test each of

j

? s.e. β j

( ) , which follows a standard normal

distribution. These tests were performed at a one-tailed alpha level of 0.05. We used one-tailed test since all of our alternative hypotheses are directional: there is a positive association between the metric and fault-proneness. In previous studies another descriptive statistic has been used, namely an

R 2 statistic that is analogous l0 ? l k 2 and to the multiple coefficient of determination in LS regression [22][19]. This is defined as R = l0

may be interpreted as the proportion of uncertainty explained by the model. Ideally, since the analogy to LS regression is the driving force behind such a statistic, it should vary between 1 and 0. However, Hosmer and Lemeshow [61] note that when there are groups in the data the maximum value will not be 1. A group occurs when there are multiple observations that have exactly the same values on the independent variables. In validation studies of product metrics this frequently happens (for example, the data clustering in very few groups with only one independent variable) because the product metrics are counts (i.e., integers). To be specific, in our data set there were 48 groups for the WMC metric (recall that we have 174 observations), 40 groups for the RFC metric, 6 groups for the CBO metric, 39 groups for the LCOM metric, and 31 groups for the NMA metric. Hosmer and Lemeshow [61] suggest an alternative formulation for R that corrects for this, and still allows for the reduction in uncertainty interpretation. It should be recalled that this descriptive statistic will 34 in general have low values compared to what one is accustomed to in a LS regression. In our study we will use the corrected Hosmer and Lemeshow

2

R 2 statistic as an indicator of the quality of the LR model.

9.3 Influence Analysis

Influence analysis is performed to identify influential observations (i.e., ones that have a large influence on the LR regression model). This can be achieved through deletion diagnostics. For a data set with n observations, estimated coefficients are recomputed n times, each time deleting exactly one of the

32

This is defined as:

? ( xi ) ) + (1 ? y i ) ln (1 ? π ? ( xi ) )}. l k = ∑ {yi ln (π

i =1

n

Note that we are not concerned with testing whether the intercept is zero or not since we do not draw substantive conclusions from the intercept in a validation study. If we were, we would use the log-likelihood for the null model which assigns a probability of 0.5 to each response. Hosmer and Lemeshow [61] recommend against using the R statistic stating that “ R is nothing more than an expression of the likelihood ratio test and, as such, is not a measure of goodness-of-fit. This likelihood ratio test compares fitted values under two models rather than comparing observed values to those fitted under one model.” However, Menard [85] notes that the comparison

34

33

2

2

of fitted values under two models is one way to interpret the traditional R coefficient of determination in OLS regression. Therefore, if there is no conceptual objection to its use for OLS regression, there ought to be none for logistic regression. An alternative measure of goodness-of-fit was proposed by Maddala [82]. In an OLS context this can be interpreted in terms of the geometric mean reduction in error per observation produced by the full model as opposed to the model with only the intercept [37]. Nagelkerke [88] notes that this statistic cannot have a value of 1 even when the model fits the data perfectly, and proposed a correction for the maximum value. However, this coefficient measures a criterion that is not optimized in maximum likelihood logistic regression, and therefore the Hosmer and Lemeshow [88] is preferred. Furthermore, Menard [85], based on a conceptual review and an empirical demonstration, recommended the Hosmer and Lemeshow coefficient because it is the most independent of the base rate (in our case, this is the proportion of classes that are faulty), making it a more generally useful measure of goodness-of-fit.

2

V18-10/04/00

29

observations from the model fitting process. This is similar to the approach advocated by Briand et al. [22] for identifying outliers in multivariate LR models used in validating object-oriented metrics. Specifically, they use the Mahalanobis distance from the centroid, removing each observation in turn before computing the centroid. However, this approach has some deficiencies. First, it is important to note that an outlier is not necessarily an influential observation [96]. For example, an observation may be further away from the other observations but may be right on the regression surface. Furthermore, influential observations may be clustered in groups. For example, if there are say two independent variables in our model and there are five out of the n observations that have exactly the same values on these two variables, then these 5 observations are a group. Now, assume that this group is multivariately different from all of the other n ? 5 observations. Which one of these 5 would be considered an outlier ? A leave-one-out approach may not even identify any one of these observations as being different from the rest since in each run the remaining 4 observations will be included in computing the centroid. Such masking effects are demonstrated in [5]. Therefore, a more principled approach to the detection of influential observations is necessary. The first step is to identify the groups of observations with the same values on the independent variables. Similar to Cook’s distance diagnostics [35][36] that are commonly used in LS regression to identify influential observations, Pergibon has defined the ?β diagnostic [93] to identify influential groups in LR regression.

The ?β diagnostic is a standardized distance between the parameter estimates when a group is included and when it is not included in the model. We use the ?β diagnostic in our study to identify influential groups. For groups that are deemed influential we investigate this to determine if we can identify substantive reasons for the differences. We found one observation that consistently had a large ?β value. However, its removal does not change our conclusions.

V18-10/04/00

30

10 References

[1] [2] [3] M. Almeida, H. Lounis, and W. Melo: “An Investigation on the Use of Machine Learned Models for th Estimating Correction Costs”. In Proceedings of the 20 International Conference on Software Engineering, pages 473-476, 1998. A. Baker, J. Bieman, N. Fenton, D. Gustafson, A. Mellon, and R. Whitty: “A Philosophy for Software Measurement”. In Journal of Systems and Software, 12:277-281, 1990. V. Basili, S. Condon, K. El Emam, R. Hendrick, and W. Melo: “Characterizing and Modeling the th Cost of Rework in a Library of Reusable Software Components”. In Proceedings of the 19 International Conference on Software Engineering, 1997. V. Basili, L. Briand and W. Melo: “A Validation of Object-Oriented Design Metrics as Quality Indicators”. In IEEE Transactions on Software Engineering, 22(10):751-761, 1996. D. Belsley, E. Kuh, and R. Welsch: Regression Diagnostics: Identifying Influential Data and Sources of Collinearity. John Wiley and Sons, 1980. D. Belsley: Conditioning Diagnostics: Collinearity and Weak Data in Regression. John Wiley and Sons, 1991. D. Belsley: “A Guide to Using the Collinearity Diagnostics”. In Computer Science in Economics and Management, 4:33-50, 1991. S. Benlarbi and W. Melo: “Polymorphism Measures for Early Risk Prediction”. In Proceedings of the st 21 International Conference on Software Engineering, pages 334-344, 1999. A. Binkley and S. Schach: “Prediction of Run-Time Failures Using Static Product Quality Metrics”. In Software Quality Journal, 7:141-147, 1998. A. Binkley and S. Schach: “Validation of the Coupling Dependency Metric as a Predictor of Runth Time Fauilures and Maintenance Measures”. In Proceedings of the 20 International Conference on Software Engineering, pages 452-455, 1998. D. Boehm-Davis, R. Holt, and A. Schultz: “The Role of Program Structure in Software Maintenance”. In International Journal of Man-Machine Studies, 36:21-63, 1992. N. Breslow and N. Day: Statistical Methods in Cancer Research – Volume 1 – The Analysis of Case Control Studies, IARC, 1980. L. Briand, W. Thomas, and C. Hetmanski: “Modeling and Managing Risk Early in Software Development”. In Proceedings of the International Conference on Software Engineering, pages 5565, 1993. L. Briand, V. Basili, and C. Hetmanski: “Developing Interpretable Models with Optimized Set Reduction for Identifying High-Risk Software Components”. In IEEE Transactions on Software Engineering, 19(11):1028-1044, 1993. L. Briand, K. El Emam, and S. Morasca: “Theoretical and Empirical Validation of Software Product Measures”. International Software Engineering Research Network, Technical Report ISERN-95-03, 1995. L. Briand, J. Daly, and J. Wuest: “A Unified Framework for Cohesion Measurement in ObjectOriented Systems”. In Empirical Software Engineering, 3:65-117, 1998. L. Briand, P. Devanbu, and W. Melo: “An Investigation into Coupling Measures for C++”. In th Proceedings of the 19 International Conference on Software Engineering, 1997. L. Briand, J. Daly, V. Porter, and J. Wuest: “Predicting Fault-Prone Classes with Design Measures in Object Oriented Systems”. In Proceedings of the International Symposium on Software Reliability Engineering, pages 334-343, 1998. L. Briand, J. Wuest, S. Ikonomovski, and H. Lounis: “A Comprehensive Investigation of Quality Factors in Object-Oriented Designs: An Industrial Case Study”. International Software Engineering Research Network technical report ISERN-98-29, 1998. L. Briand, J. Daly, and J. Wuest: “A Unified Framework for Cohesion Measurement in ObjectOriented Systems”. In Empirical Software Engineering, 3:65-117, 1998. L. Briand, J. Daly, and J. Wuest: “A Unified Framework for Coupling Measurement in ObjectOriented Systems”. In IEEE Transactions on Software Engineering, 25(1):91-121, 1999.

[4] [5] [6] [7] [8] [9] [10]

[11] [12] [13]

[14]

[15]

[16] [17] [18]

[19]

[20] [21]

V18-10/04/00

31

[22] L. Briand, J. Wuest, J. Daly, and V. Porter: “Exploring the Relationships Between Design Measures and Software Quality in Object-Oriented Systems”. In Journal of Systems and Software, 51:245273, 2000. [23] L. Briand, E. Arisholm, S. Counsell, F. Houdek, and P. Thevenod-Fosse: “Empirical Studies of Object-Oriented Artifacts, Methods, and Processes: State of the Art and Future Direction”. In Empirical Software Engineering, 4(4):387-404, 1999. [24] F. Brite e Abreu and R. Carapuca: “Object-Oriented Software Engineering: Measuring and th Controlling the Development Process”. In Proceedings of the 4 International Conference on Software Quality, 1994. [25] F. Brito e Abreu and W. Melo: “Evaluating the Impact of Object-Oriented Design on Software rd Quality”. In Proceedings of the 3 International Software Metrics Symposium, pages 90-99, 1996. [26] M. Cartwright: “An Empirical View of Inheritance”. In Information and Software Technology, 40:795799, 1998. [27] M. Cartwright and M. Shepperd: “An Empirical Investigation of an Object-Oriented Software System”. To appear in IEEE Transactions on Software Engineering. [28] S. Chaterjee and B. Price: Regression Analysis by Example. John Wiley and Sons, 1991. [29] S. Chidamber and C. Kemerer: “Towards a Metrics Suite for Object Oriented Design”. In Proceedings of the Conference on Object Oriented Programming Systems, Languages and Applications (OOPSLA’91). Published in SIGPLAN Notices, 26(11):197-211, 1991. [30] S. Chidamber and C. Kemerer: “A Metrics Suite for Object-Oriented Design”. In IEEE Transactions on Software Engineering, 20(6):476-493, 1994. [31] S. Chidamber and C. Kemerer: “Authors’ Reply”. In IEEE Transactions on Software Engineering, 21(3):265, 1995. [32] S. Chidamber, D. Darcy, and C. Kemerer: “Managerial Use of Metrics for Object Oriented Software: An Exploratory Analysis”. In IEEE Transactions on Software Engineering, 24(8):629-639, 1998. [33] N. Churcher and M. Shepperd: “Comments on ‘A Metrics Suite for Object Oriented Design’”. In IEEE Transactions on Software Engineering, 21(3):263-265, 1995. [34] F. Coallier, J. Mayrand, and B. Lague: “Risk Management in Software Product Procurement”. In K. El Emam and N. H. Madhavji (eds.): Elements of Software Process Assessment and Improvement, IEEE CS Press, 1999. [35] R. Cook: “Detection of Influential Observations in Linear Regression”. In Technometrics, 19:15-18, 1977. [36] R. Cook: “Influential Observations in Linear Regression”. In Journal of the American Statistical Association, 74:169-174, 1979. [37] D. Cox and N. Wermuth: “A Comment on the Coefficient of Determination for Binary Responses”. In The American Statistician, 46:1-4, 1992. [38] L. Dales and H. Ury: “An Improper Use of Statistical Significance Testing in Studying Covariables”. In International Journal of Epidemiology, 7(4):373-375, 1978. [39] J. Daly, J. Miller, A. Brooks, M. Roper, and M. Wood: “Issues on the Object-Oriented Paradigm: A questionnaire Survey”. Research Report EFoCS-8-95, Department of Computer Science, University of Strathclyde, 1995. [40] J. Daly, M. Wood, A. Brooks, J. Miller, and M. Roper: “Structured Interviews on the Object-Oriented Paradigm”. Research Report EFoCS-7-95, Department of Computer Science, University of Strathclyde, 1995. [41] J. Daly, A. Brooks, J. Miller, M. Roper, and M. Wood: “Evaluating Inheritance Depth on the Maintainability of Object-Oriented Software”. In Empirical Software Engineering, 1(2):109-132, 1996. [42] C. Davies, J. Hyde, S. Bangdiwala, and J. Nelson: “An Example of Dependencies Among Variables in a Conditional Logistic Regression”. In S. Moolgavkar and R. Prentice (eds.): Modern Statistical Methods in Chronic Disease Edpidemiology. John Wiley and Sons, 1986. [43] I. Deligiannis and M. Shepperd: “A Review of Experimental Investigations into Object-Oriented Technology”. In Proceedings of the Fifth IEEE Workshop on Empirical Studies of Software Maintenance, pages 6-10, 1999. [44] S. Derksen and H. Keselman: “Backward, Forward and Stepwise Automated Subset Selection Algorithms: Frequency of Obtaining Authentic and Noise Variables”. In British Journal of Mathematical and Statistical Psychology, 45:265-282, 1992.

V18-10/04/00

32

[45] J. Dvorak: “Conceptual Entropy and Its Effect on Class Hierarchies”. In IEEE Computer, pages 5963, 1994. th [46] C. Ebert and T. Liedtke: “An Integrated Approach for Criticality Prediction”. In Proceedings of the 6 International Symposium on Software Reliability Engineering, pages 14-23, 1995. [47] B. Everitt: The Analysis of Contingency Tables. Chapman and Hall, 1992. [48] N. Fenton: “Software Metrics: Theory, Tools and Validation”. In Software Engineering Journal, pages 65-78, January 1990. [49] N. Fenton and M. Neil: “A Critique of Software Defect Prediction Models”. In IEEE Transactions on Software Engineering, 25(5):676-689, 1999. [50] N. Fenton and M. Neil: “Software Metrics: Successes, Failures, and New Directions”. In Journal of Systems and Software, 47:149-157, 1999. [51] N. Fenton and N. Ohlsson: “Quantitative Analysis of Faults and Failures in a Complex Software System”. To appear in IEEE Transactions on Software Engineering. [52] L. Gordis: Epidemiology. W. B. Saunders, 1996. [53] F. Harrell and K. Lee: “Regression Modelling Strategies for Improved Prognostic Prediction”. In Statistics in Medicine, 3:143-152, 1984. [54] F. Harrell, K. Lee, and D. Mark: “Multivariate Prognostic Models: Issues in Developing Models, Evaluating Assumptions and Adequacy, and Measuring and Reducing Errors”. In Statistics in Medicine, 15:361-387, 1996. [55] R. Harrison, L. Samaraweera, M. Dobie, and P. Lewis: “An Evaluation of Code Metrics for ObjectOriented Programs”. In Information and Software Technology, 38:443-450, 1996. [56] W. Harrison: “Using Software Metrics to Allocate Testing Resources”. In Journal of Management Information Systems, 4(4):93-105, 1988. [57] R. Harrison, S. Counsell, and R. Nithi: “Coupling Metrics for Object Oriented Design”. In th Proceedings of the 5 International Symposium on Software Metrics, pages 150-157, 1998. [58] L. Hatton: “Is Modularization Always a Good Idea ?”. In Information and Software Technology, 38:719-721, 1996. [59] L. Hatton: “Does OO Sync with How We Think ?”. In IEEE Software, pages 46-54, May/June 1998. [60] B. Henderson-Sellers: Software Metrics. Prentice-Hall, 1996. [61] D. Hosmer and S. Lemeshow: Applied Logistic Regression. John Wiley & Sons, 1989. [62] J. Hudepohl, S. Aud, T. Khoshgoftaar, E. Allen, and J. Mayrand: “EMERALD: Software Metrics and Models on the Desktop”. In IEEE Software, 13(5):56-60, 1996. [63] J. Hudepohl, S. Aud, T. Khoshgoftaar, E. Allen, and J. Mayrand: “Integrating Metrics and Models for th Software Risk Assessment”. In Proceedings of the 7 International Symposium on Software Reliability Engineering, pages 93-98, 1996. [64] ISO/IEC 9126: Information Technology – Software Product Evaluation – Quality Characteristics and Guidelines for their Use. International Organization for Standardization and the International Electrotechnical Commission, 1991. [65] ISO/IEC DIS 14598-1: Information Technology – Software Product Evaluation; Part 1: Overview. International Organization for Standardization and the International Electrotechnical Commission, 1996. [66] M. Jorgensen: “Experience with the Accuracy of Software Maintenance Task Effort Prediction Models”. In IEEE Transactions on Software Engineering, 21(8):674-681, 1995. [67] M. Kaaniche and K. Kanoun: “Reliability of a Commercial Telecommunications System”. In Proceedings of the International Symposium on Software Reliability Engineering, pages 207-212, 1996. [68] J. Kearney, R. Sedlmeyer, W. Thompson, M. Gray, and M. Adler: “Software Complexity Measurement”. In Communications of the ACM, 29(11):1044-1050, 1986. [69] T. Khoshgoftaar, E. Allen, K. Kalaichelvan, and N. Goel: “The Impact of Software Evolution and Reuse on Software Quality”. In Empirical Software Engineering: An International Journal, 1:31-44, 1996. [70] T. Khoshgoftaar, E. Allen, R. Halstead, G. Trio, and R. Flass: “Process Measures for Predicting Software Quality”. In IEEE Computer, 31(4):66-72, 1998. [71] T. Khoshgoftaar, E. Allen, W. Jones, and J. Hudepohl: “Which Software Modules Have Faults that will be Discovered by Customers ?”. In Journal of Software Maintenance: Research and Practice, 11(1):1-18, 1999.

V18-10/04/00

33

[72] T. Khoshgoftaar, E. Allen, W. Jones, and J. Hudepohl: “Classification Tree Models of Software Quality Over Multiple Releases”. In Proceedings of the International Symposium on Software Reliability Engineering, pages 116-125, 1999. [73] B. Kitchenham, S-L Pfleeger, and N. Fenton: “Towards a Framework for Software Measurement Validation”. In IEEE Transactions on Software Engineering, 21(12):929-944, 1995. [74] D. Kleinbaum, L. Kupper, and H. Morgenstern: Epidemiologic Research: Principles and Quantitative Methods. Van Nostrand Reinhold, 1982. [75] F. Lanubile and G. Visaggio: “Evaluating Predictive Quality Models Derived from Software Measures: Lessons Learned”. In Journal of Systems and Software, 38:225-234, 1997. [76] Y. Lee, B. Liang, S. Wu, and F. Wang: “Measuring the Coupling and Cohesion of an ObjectOriented Program Based on Information Flow”. In Proceedings of the International Conference on Software Quality, 1995. [77] M. Leijter, S. Meyers, and S. Reiss: “Support for Maintaining Object-Oriented Programs”. In IEEE Transactions on Software Engineering, 18(12):1045-1052, 1992. [78] W. Li and S. Henry: “Object-Oriented Metrics that Predict Maintainability”. In Journal of Systems and Software, 23:111-122, 1993. [79] R. Lindsay and A. Ehrenberg: “The Design of Replicated Studies”. In The American Statistician, 47(3):217-228, 1993. [80] M. Lorenz and J. Kidd: Object-Oriented Software Metrics. Prentice-Hall, 1994. [81] M. Lyu, J. Yu, E. Keramides, and S. Dalal: “ARMOR: Analyzer for Reducing Module Operational th Risk”. In Proceedings of the 25 International Symposium on Fault-Tolerant Computing, pages 137-142, 1995. [82] G. Maddala: Limited-Dependent and Qualitative Variables in Econometrics. Cambridge University Press, 1983. [83] R. Mason and R. Gunst: “Outlier-Induced Collinearities”. In Technometrics, 27:401-407, 1985. [84] S. Menard: Applied Logistic Regression Analysis. Sage Publications, 1995. [85] S. Menard: “Coefficients of Determination for Multiple Logistic Regression Analysis”. In The American Statistician, 54(1):17-24, 2000. [86] K-H Moller and D. Paulish: “An Empirical Investigation of Software Fault Distribution”. In Proceedings of the First International Software Metrics Symposium, pages 82-90, 1993. [87] J. Munson and T. Khoshgoftaar: “The Detection of Fault-Prone Programs”. In IEEE Transactions on Software Engineering, 18(5):423-433, 1992. [88] N. Nagelkerke: “A Note on a General Definition of the Coefficient of Determination”. In Biometrika, 78(3):691-692, 1991. [89] P. Nesi and T. Querci: “Effort Estimation and Prediction of Object-Oriented Systems”. In Journal of Systems and Software, 42:89-102, 1998. [90] J. Neter, W. Wasserman, and M. Kunter: Applied Linear Statistical Models. Irwin, 1990. [91] N. Ohlsson and H. Alberg: “Predicting Fault-Prone Software Modules in Telephone Switches”. In IEEE Transactions on Software Engineering, 22(12):886-894, 1996. [92] E. Pedhazur: Multiple Regression in Behavioral Research. Harcourt Brace Jovanovich 1982. [93] D. Pergibon: “Logistic Regression Diagnostics”. In The Annals of Statistics, 9(4):705-724, 1981. [94] R. Pressman: Software Engineering: A Practitioner’s Approach. McGraw Hill, 1997. [95] K. Rothman and S. Greenland: Modern Epidemiology. Lippincott-Raven, 1998. [96] P. Rousseeuw and A. Leroy: Robust Regression and Outlier Detection. John Wiley & Sons, 1987. [97] R. Schaefer, L. Roi, and R. Wolfe: “A Ridge Logistic Estimator”. In Communications in Statistics – Theory and Methods, 13(1):99-113, 1984. [98] R. Schaefer: “Alternative Estimators in Logistic Regression when the Data are Collinear”. In The Journal of Statistical Computation and Simulation, 25:75-91, 1986. [99] J. Schlesselman: Case-Control Studies: Design, Conduct, Analysis. Oxford University Press, 1982. [100] D. Schmidt and P. Stephenson: “Experiences Using Design Patterns to Evolve System Software th Across Diverse OS Platforms”. In Proceedings of the 9 European Conference on Object Oriented Programming, 1995. [101] D. Schmidt: “A System of Reusable Design Patterns for Communication Software”. In S. Berzuk (ed.): The Theory and Practice of Object Systems, Wiley, 1995. [102] D. Schmidt: “Using Design Patterns to Develop Reusable Object-Oriented Communication Software”. In Communications of the ACM, 38(10):65-74, 1995.

V18-10/04/00

34

[103] S. Shlaer and S. Mellor: Object-Oriented Systems Analysis: Modelling the World in Data. PrenticeHall, 1988. [104] S. Simon and J. Lesage: “The Impact of Collinearity Involving the Intercept Term on the Numerical Accuracy of Regression”. In Computer Science in Economics and Management, 1:137-152, 1988. [105] K. Smith, M. Slattery, and T. French: “Collinear Nutrients and the Risk of Colon Cancer”. In Journal of Clinical Epidemiology, 44(7):715-723, 1991. [106] M-H. Tang, M-H. Kao, and M-H. Chen: “An Empirical Study on Object Oriented Metrics”. In Proceedings of the Sixth International Software Metrics Symposium, pages 242-249, 1999. [107] B. Unger and L. Prechelt: “The Impact of Inheritance Depth on Maintenance Tasks – Detailed Description and Evaluation of Two Experiment Replications”. Technical Report 19/1998, Fakultat fur Informatik, Universitaet Karlsruhe, 1998. [108] Y. Wax: “Collinearity Diagnosis for Relative Risk Regression Analysis: An Application to Assessment of Diet-Cancer Relationship in Epidemiological Studies”. In Statistics in Medicine, 11:1273-1287, 1992. [109] S. Wiedenbeck, V. Ramalingam, S. Sarasamma, and C. Corritore: “A Comparison of the Comprehension of Object-Oriented and Procedural Programs by Novice Programmers”. In Interacting with Computers, 11(3):255-282, 1999. [110] N. Wilde, P. Matthews, and R. Huitt: “Maintaining Object-Oriented Software”. In IEEE Software, pages 75-80, January 1993.

V18-10/04/00

35

- effects of class size
- The Effects of Class Size on the Long Run Growth in Reading Abilities and Early Adult Outco
- The optimal class size for object-oriented software
- Measuring the Effects of Object-Oriented Frameworks on the Software Process, submitted for
- The Applicability of Proposed Object Oriented Metrics to Developer Feedback in Time to Impa
- The Coverage of the Object-Oriented Framework Application Class-Based Test Cases
- Investigating the effect of dataset size, metrics sets, and feature
- Investigating effect of Design Metrics on Fault Proneness in Object-Oriented Systems
- Optimization of Object-Oriented Programs using Static Class Hierarchy Analysis
- Software Quality Metrics for Object Oriented System Environments, A report of SATC’s resea
- Investigating effect of Design Metrics on Fault Proneness in Object-Oriented Systems
- The optimal class size for object-oriented software
- 系统科学中的控制论
- MFC消息响应顺序
- MFC消息响应顺序

更多相关文章：
更多相关标签：