Validity: The Test Doth Purport Too Much, Methinks
Two pages minimum with two references
In your unit studies, you read about two approaches or models of validity—trinitarian and unitary. In some ways, these two models are competing views of gathering evidence for a test’s validity. In other ways, the two approaches have an overlap of elements.
Save your time - order a paper!
Get your paper written from scratch within the tight deadline. Our service is a reliable solution to all your troubles. Place an order on any task and we will take care of it. You won’t have to worry about the quality and deadlinesOrder Paper Now
For this discussion:
- Compare and contrast these two models in terms of how they conceptualize validity.
- Identify at least one advantage and disadvantage of each model.
- Decide which model appears to be the most valid for determining validity of a test.
- Explain your decision in terms of the implications for decision making about a test’s validity.
Be sure to include citations from Guion’s 1980 article, “On Trinitarian Doctrines of Validity,” and Messick’s 1995 article, “Validity of Psychological Assessment: Validation of Inferences From Persons’ Responses and Performances As Scientific Inquiry Into Score Meaning.”
Validity: The Test Doth Purport Too Much, Methinks
Validity of Psychological Assessment Validation of Inferences From Persons’ Responses and Performances as Scientific Inquiry Into Score Meaning Samuel Messick Educational Testing Service The traditional conception of validity divides it into three separate and substitutable types—namely, content, cri- terion, and construct validities. This view is fragmented and incomplete, especially because it fails to take into account both evidence of the value implications of score meaning as a basis for action and the social consequences of score use. The new unified concept of validity interre- lates these issues as fundamental aspects of a more com- prehensive theory of construct validity that addresses both score meaning and social values in test interpretation and test use. That is, unified validity integrates considerations of content, criteria, and consequences into a construct framework for the empirical testing of rational hypotheses about score meaning and theoretically relevant relation- ships, including those of an applied and a scientific nature. Six distinguishable aspects of construct validity are high- lighted as a means of addressing central issues implicit in the notion of validity as a unified concept. These are content, substantive, structural, generalizability, external, and consequential aspects of construct validity. In effect, these six aspects function as general validity criteria or standards for all educational and psychological measure- ment, including performance assessments, which are dis- cussed in some detail because of their increasing emphasis in educational and employment settings. V alidity is an overall evaluative judgment of the degree to which empirical evidence and theoret- ical rationales support the adequacy and appro- priateness of interpretations and actions on the basis of test scores or other modes of assessment (Messick, 1989b). Validity is not a property of the test or assessment as such, but rather of the meaning of the test scores. These scores are a function not only of the items or stimulus conditions, but also of the persons responding as well as the context of the assessment. In particluar, what needs to be valid is the meaning or interpretation of the score; as well as any implications for action that this meaning entails (Cronbach, 1971). The extent to which score meaning and action implications hold across persons or population groups and across settings or contexts is a persistent and perennial empirical question. This is the main reason that validity is an evolving property and val- idation a continuing process. The Value of Validity The principles of validity apply not just to interpretive and action inferences derived from test scores as ordinarily conceived, but also to inferences based on any means of observing or documenting consistent behaviors or attri- butes. Thus, the term score is used generically in its broadest sense to mean any coding or summarization of observed consistencies or performance regularities on a test, questionnaire, observation procedure, or other as- sessment devices such as work samples, portfolios, and realistic problem simulations. This general usage subsumes qualitative as well as quantitative summaries. It applies, for example, to behavior protocols, to clinical appraisals, to computerized verbal score reports, and to behavioral or performance judgments or ratings. Scores in this sense are not limited to behavioral consistencies and attributes of persons (e.g., persistence and verbal ability). Scores may also refer to functional consis- tencies and attributes of groups, situations or environments, and objects or institutions, as in measures of group solidarity, situational stress, quality of artistic products, and such social indicators as school dropout rate. Hence, the principles of validity apply to all assess- ments, including performance assessments. For example, student portfolios are often the source of inferences—not just about the quality of the included products but also about the knowledge, skills, or other attributes of the stu- dent—and such inferences about quality and constructs need to meet standards of validity. This is important be- cause performance assessments, although long a staple of industrial and military applications, are now touted as purported instruments of standards-based education re- form because they promise positive consequences for teaching and learning. Indeed, it is precisely because of Editor’s note. Samuel M. Turner served as action editor for this article. This article was presented as a keynote address at the Conference on Contemporary Psychological Assessment, Arbetspsykologiska Utveck- lingsinstitutet, June 7-8, 1994, Stockholm, Sweden. Author’s note. Acknowledgments are gratefully extended to Isaac Bejar, Randy Bennett, Drew Gitomer, Ann Jungeblut, and Michael Zieky for their reviews of various versions of the manuscript. Correspondence concerning this article should be addressed to Sa- muel Messick, Educational Testing Service, Princeton, NJ 08541. September 1995 • American Psychologist Copyright 1995 by the American Psychological Association, Inc. 0003-066X/95/S2.00 Vol. 50. No. 9. 741-749 741 Comprehensiveness of Construct Comoro Validity Samuel Messick Photo by William Monachan, Educational Testing Service, Princeton, NJ. such politically salient potential consequences that the validity of performance assessment needs to be system- atically addressed, as do other basic measurement issues such as reliability, comparability, and fairness. The latter reference to fairness broaches a broader set of equity issues in testing that includes fairness of test use, freedom from bias in scoring and interpretation, and the appropriateness of the test-based constructs or rules underlying decision making or resource allocation, that is, distributive justice (Messick, 1989b). These issues are critical for performance assess- ment—as they are for all educational and psychological assessment—because validity, reliability, comparability, and fairness are not just measurement principles, they are social values that have meaning and force outside of measurement whenever evaluative judgments and deci- sions are made. As a salient social value, validity assumes both a scientific and a political role that can by no means be fulfilled by a simple correlation coefficient between test scores and a purported criterion (i.e., classical cri- terion-related validity) or by expert judgments that test content is relevant to the proposed test use (i.e., traditional content validity). Indeed, validity is broadly defined as nothing less than an evaluative summary of both the evidence for and the actual—as well as potential—consequences of score interpretation and use (i.e., construct validity conceived comprehensively). This comprehensive view of validity integrates considerations of content, criteria, and con- sequences into a construct framework for empirically testing rational hypotheses about score meaning and util- ity. Therefore, it is fundamental that score validation is an empirical evaluation of the meaning and consequences of measurement. As such, validation combines scientific inquiry with rational argument to justify (or nullify) score interpretation and use. In principle as well as in practice, construct validity is based on an integration of any evidence that bears on the interpretation or meaning of the test scores—including content- and criterion-related evidence—which are thus subsumed as part of construct validity. In construct val- idation the test score is not equated with the construct it attempts to tap, nor is it considered to define the con- struct, as in strict operationism (Cronbach & Meehl, 1955). Rather, the measure is viewed as just one of an extensible set of indicators of the construct. Convergent empirical relationships reflecting communality among such indicators are taken to imply the operation of the construct to the degree that discriminant evidence dis- counts the intrusion of alternative constructs as plausible rival hypotheses. A fundamental feature of construct validity is con- struct representation, whereby one attempts to identify through cognitive-process analysis or research on person- ality and motivation the theoretical mechanisms under- lying task performance, primarily by decomposing the task into requisite component processes and assembling them into a functional model or process theory (Em- bretson, 1983). Relying heavily on the cognitive psy- chology of information processing, construct represen- tation refers to the relative dependence of task responses on the processes, strategies, and knowledge (including metacognitive or self-knowledge) that are implicated in task performance. Sources of Invalidity There are two major threats to construct validity: In the one known as construct underrepresentation, the assess- ment is too narrow and fails to include important di- mensions or facets of the construct. In the threat to va- lidity known as construct-irrelevant variance, the assess- ment is too broad, containing excess reliable variance associated with other distinct constructs as well as method variance such as response sets or guessing propensities that affects responses in a manner irrelevant to the in- terpreted construct. Both threats are operative in all as- sessments. Hence a primary validation concern is the ex- tent to which the same assessment might underrepresent the focal construct while simultaneously contaminating the scores with construct-irrelevant variance. There are two basic kinds of construct-irrelevant vari- ance. In the language of ability and achievement testing, these might be called construct-irrelevant difficulty and con- struct-irrelevant easiness. In the former, aspects of the task that are extraneous to the focal construct make the task irrelevantly difficult for some individuals or groups. An ex- ample is the intrusion of undue reading comprehension requirements in a test of subject matter knowledge. In gen- eral, construct-irrelevant difficulty leads to construct scores that are invalidly low for those individuals adversely affected (e.g., knowledge scores of poor readers or examinees with limited English proficiency). Of course, if concern is solely 742 September 1995 • American Psychologist with criterion prediction and the criterion performance re- quires reading skill as well as subject matter knowledge, then both sources of variance would be considered criterion- relevant and valid. However, for score interpretations in terms of subject matter knowledge and for any score uses based thereon, undue reading requirements would constitute construct-irrelevant difficulty. Indeed, construct-irrelevant difficulty for individuals and groups is a major source of bias in test scoring and interpretation and of unfairness in test use. Differences in construct-irrelevant difficulty for groups, as distinct from construct-relevant group differences, is the major culprit sought in analyses of differential item functioning (Holland & Wainer, 1993). In contrast, construct-irrelevant easiness occurs when extraneous clues in item or task formats permit some individuals to respond correctly or appropriately in ways irrelevant to the construct being assessed. Another instance occurs when the specific test material, either de- liberately or inadvertently, is highly familiar to some re- spondents, as when the text of a reading comprehension passage is well-known to some readers or the musical score for a sight reading exercise invokes a well-drilled rendition for some performers. Construct-irrelevant easiness leads to scores that are invalidly high for the affected individuals as reflections of the construct under scrutiny. The concept of construct-irrelevant variance is im- portant in all educational and psychological measure- ment, including performance assessments. This is es- pecially true of richly contextualized assessments and so- called “authentic” simulations of real-world tasks. This is the case because “paradoxically, the complexity of con- text is made manageable by contextual clues” (Wiggins, 1993, p. 208). And it matters whether the contextual clues that people respond to are construct-relevant or represent construct-irrelevant difficulty or easiness. However, what constitutes construct-irrelevant vari- ance is a tricky and contentious issue (Messick, 1994). This is especially true of performance assessments, which typically invoke constructs that are higher order and complex in the sense of subsuming or organizing multiple processes. For example, skill in communicating mathe- matical ideas might well be considered irrelevant variance in the assessment of mathematical knowledge (although not necessarily vice versa). But both communication skill and mathematical knowledge are considered relevant parts of the higher-order construct of mathematical power, according to the content standards delineated by the Na- tional Council of Teachers of Mathematics (1989). It all depends on how compelling the evidence and arguments are that the particular source of variance is a relevant part of the focal construct, as opposed to affording a plau- sible rival hypothesis to account for the observed perfor- mance regularities and relationships with other variables. A further complication arises when construct-irrel- evant variance is deliberately capitalized upon to produce desired social consequences, as in score adjustments for minority groups, within-group norming, or sliding band procedures (Cascio, Outtz, Zedeck, & Goldstein, 1991; Hartigan & Wigdor, 1989; Schmidt, 1991). However, rec- ognizing that these adjustments distort the meaning of the construct as originally assessed, psychologists should distinguish such controversial procedures in applied test- ing practice (Gottfredson, 1994; Sackett & Wilk, 1994) from the valid assessment of focal constructs and from any score uses based on that construct meaning. Con- struct-irrelevant variance is always a source of invalidity in the assessment of construct meaning and its action implications. These issues portend the substantive and consequential aspects of construct validity, which are dis- cussed in more detail later. Sources of Evidence in Construct Validity In essence, construct validity comprises the evidence and rationales supporting the trustworthiness of score interpre- tation in terms of explanatory concepts that account for both test performance and score relationships with other variables. In its simplest terms, construct validity is the ev- idential basis for score interpretation. As an integration of evidence for score meaning, it applies to any score inter- pretation—not just those involving so-called “theoretical constructs.” Almost any kind of information about a test can contribute to an understanding of score meaning, but the contribution becomes stronger if the degree of fit of the information with the theoretical rationale underlying score interpretation is explicitly evaluated (Cronbach, 1988; Kane, 1992; Messick, 1989b). Historically, primary emphasis in construct validation has been placed on internal and external test structures—that is, on the appraisal of theoretically ex- pected patterns of relationships among item scores or be- tween test scores and other measures. Probably even more illuminating in regard to score meaning are studies of expected performance differences over time, across groups and settings, and in response to experimental treatments and manipulations. For exam- ple, over time one might demonstrate the increased scores from childhood to young adulthood expected for mea- sures of impulse control. Across groups and settings, one might contrast the solution strategies of novices versus experts for measures of domain problem-solving or, for measures of creativity, contrast the creative productions of individuals in self-determined as opposed to directive work environments. With respect to experimental treat- ments and manipulations, one might seek increased knowledge scores as a function of domain instruction or increased achievement motivation scores as a function of greater benefits and risks. Possibly most illuminating of all, however, are direct probes and modeling of the pro- cesses underlying test responses, which are becoming both more accessible and more powerful with continuing de- velopments in cognitive psychology (Frederiksen, Mislevy, & Bejar, 1993; Snow & Lohman, 1989). At the simplest level, this might involve querying respondents about their solution processes or asking them to think aloud while responding to exercises during field trials. In addition to reliance on these forms of evidence, construct validity, as previously indicated, also subsumes content relevance and representativeness as well as cri- September 1995 • American Psychologist 743 terion-relatedness. This is the case because such infor- mation about the range and limits of content coverage and about specific criterion behaviors predicted by the test scores clearly contributes to score interpretation. In the latter instance, correlations between test scores and criterion measures—viewed within the broader context of other evidence supportive of score meaning—contrib- ute to the joint construct validity of both predictor and criterion. In other words, empirical relationships between predictor scores and criterion measures should make theoretical sense in terms of what the predictor test is interpreted to measure and what the criterion is presumed to embody (Gulliksen, 1950). An important form of validity evidence still re- maining bears on the social consequences of test inter- pretation and use. It is ironic that validity theory has paid so little attention over the years to the consequential basis of test validity, because validation practice has long in- voked such notions as the functional worth of the test- ing—that is, a concern over how well the test does the job for which it is used (Cureton, 1951; Rulon, 1946). And to appraise how well a test does its job, one must inquire whether the potential and actual social conse- quences of test interpretation and use are not only sup- portive of the intended testing purposes, but also at the same time consistent with other social values. With some trepidation due to the difficulties inherent in forecasting, both potential and actual consequences are included in this formulation for two main reasons: First, anticipation of likely outcomes may guide one where to look for side effects and toward what kinds of evidence are needed to monitor consequences; second, such anticipation may alert one to take timely steps to capitalize on positive effects and to ameliorate or forestall negative effects. However, this form of evidence should not be viewed in isolation as a separate type of validity, say, of “conse- quential validity.” Rather, because the values served in the intended and unintended outcomes of test interpre- tation and use both derive from and contribute to the meaning of the test scores, appraisal of the social conse- quences of the testing is also seen to be subsumed as an aspect of construct validity (Messick, 1964, 1975, 1980). In the language of the Cronbach and Meehl (1955) sem- inal manifesto on construct validity, the intended con- sequences of the testing are strands in the construct’s no- mological network representing presumed action impli- cations of score meaning. The central point is that unintended consequences, when they occur, are also strands in the construct’s nomological network that need to be taken into account in construct theory, score inter- pretation, and test use. At issue is evidence for not only negative but also positive consequences of testing, such as the promised benefits of educational performance as- sessment for teaching and learning. A major concern in practice is to distinguish adverse consequences that stem from valid descriptions of indi- vidual and group differences from adverse consequences that derive from sources of test invalidity such as construct underrepresentation and construct-irrelevant variance. The latter adverse consequences of test invalidity present measurement problems that need to be investigated in the validation process, whereas the former consequences of valid assessment represent problems of social policy. But more about this later. Thus, the process of construct validation evolves from these multiple sources of evidence a mosaic of con- vergent and discriminant findings supportive of score meaning. However, in anticipated applied test use, this mosaic of general evidence may or may not include per- tinent specific evidence of (a) the relevance of the test to the particular applied purpose and (b) the utility of the test in the applied setting. Hence, the general construct validity evidence may need to be buttressed in applied instances by specific evidence of relevance and utility. In summary, the construct validity of score interpre- tation comes to undergird all score-based inferences—not just those related to interpretive meaningfulness but also the content- and criterion-related inferences specific to ap- plied decisions and actions based on test scores. From the discussion thus far, it should also be clear that test validity cannot rely on any one of the supplementary forms of ev- idence just discussed. However, neither does validity require any one form, granted that there is defensible convergent and discriminant evidence supporting score meaning. To the extent that some form of evidence cannot be devel- oped—as when criterion-related studies must be forgone because of small sample sizes, unreliable or contaminated criteria, and highly restricted score ranges—heightened em- phasis can be placed on other evidence, especially on the construct validity of the predictor tests and on the relevance of the construct to the criterion domain (Guion, 1976; Mes- sick, 1989b). What is required is a compelling argument that the available evidence justifies the test interpretation and use, even though some pertinent evidence had to be forgone. Hence, validity becomes a unified concept, and the unifying force is the meaningfulness or trustworthy inter- pretability of the test scores and their action implications, namely, construct validity. Aspects of Construct Validity However, to speak of validity as a unified concept does not imply that validity cannot be usefully differentiated into distinct aspects to underscore issues and nuances that might otherwise be downplayed or overlooked, such as the social consequences of performance assessments or the role of score meaning in applied use. The intent of these distinctions is to provide a means of addressing functional aspects of va- lidity that help disentangle some of the complexities inherent in appraising the appropriateness, meaningfulness, and use- fulness of score inferences. In particular, six distinguishable aspects of construct validity are highlighted as a means of addressing central issues implicit in the notion of validity as a unified con- cept. These are content, substantive, structural, general- izability, external, and consequential aspects of construct validity. In effect, these six aspects function as general validity criteria or standards for all educational and psy- chological measurement (Messick, 1989b). Following a 744 September 1995 • American Psychologist capsule description of these six aspects, some of the va- lidity issues and sources of evidence bearing on each are highlighted: • The content aspect of construct validity includes evidence of content relevance, representativeness, and technical quality (Lennon, 1956; Messick, 1989b); • The substantive aspect refers to theoretical ratio- nales for the observed consistencies in test re- sponses, including process models of task perfor- mance (Embretson, 1983), along with empirical evidence that the theoretical processes are actually engaged by respondents in the assessment tasks; • The structural aspect appraises the fidelity of the scoring structure to the structure of the construct domain at issue (Loevinger, 1957; Messick 1989b); • The generalizability aspect examines the extent to which score properties and interpretations gen- eralize to and across population groups, settings, and tasks (Cook & Campbell, 1979; Shulman, 1970), including validity generalization of test cri- terion relationships (Hunter, Schmidt, & Jackson, 1982); • The external aspect includes convergent and dis- criminant evidence from multitrait-multimethod comparisons (Campbell & Fiske, 1959), as well as evidence of criterion relevance and applied utility (Cronbach & Gleser, 1965); • The consequential aspect appraises the value im- plications of score interpretation as a basis for ac- tion as well as the actual and potential conse- quences of test use, especially in regard to sources of invalidity related to issues of bias, fairness, and distributive justice (Messick, 1980, 1989b). Content- Relevance and Representativeness A key issue for the content aspect of construct validity is the specification of the boundaries of the construct do- main to be assessed—that is, determining the knowledge, skills, attitudes, motives, and other attributes to be re- vealed by the assessment tasks. The boundaries and structure of the construct domain can be addressed by means of job analysis, task analysis, curriculum analysis, and especially domain theory, in other words, scientific inquiry into the nature of the domain processes and the ways in which they combine to produce effects or out- comes. A major goal of domain theory is to understand the construct-relevant sources of task difficulty, which then serves as a guide to the rational development and scoring of performance tasks and other assessment for- mats. At whatever stage of its development, then, domain theory is a primary basis for specifying the boundaries and structure of the construct to be assessed. However, it is not sufficient merely to select tasks that are relevant to the construct domain. In addition, the assessment should assemble tasks that are represen- tative of the domain in some sense. The intent is to insure that all important parts of the construct domain are cov- ered, which is usually described as selecting tasks that sample domain processes in terms of their functional im- portance, or what Brunswik (1956) called ecological sampling. Functional importance can be considered in terms of what people actually do in the performance do- main, as in job analyses, but also in terms of what char- acterizes and differentiates expertise in the domain, which would usually emphasize different tasks and processes. Both the content relevance and representativeness of as- sessment tasks are traditionally appraised by expert professional judgment, documentation of which serves to address the content aspect of construct validity. Substantive Theories, Process Models, and Process Engagement The substantive aspect of construct validity emphasizes the role of substantive theories and process modeling in identifying the domain processes to be revealed in as- sessment tasks (Embretson, 1983; Messick, 1989b). Two important points are involved: One is the need for tasks providing appropriate sampling of domain processes in addition to traditional coverage of domain content; the other is the need to move beyond traditional professional judgment of content to accrue empirical evidence that the ostensibly sampled processes are actually engaged by respondents in task performance. Thus, the substantive aspect adds to the content as- pect of construct validity the need for empirical evidence of response consistencies or performance regularities re- flective of domain processes (Loevinger, 1957). Such ev- idence may derive from a variety of sources, for example, from “think aloud” protocols or eye movement records during task performance; from correlation patterns among part scores; from consistencies in response times for task segments; or from mathematical or computer modeling of task processes (Messick, 1989b, pp. 53-55; Snow & Lohman, 1989). In summary, the issue of domain coverage refers not just to the content representativeness of the construct measure but also to the process repre- sentation of the construct and the degree to which these processes are reflected in construct measurement. The core concept bridging the content and substantive aspects of construct validity is representativeness. This be- comes clear once one recognizes that the term representative has two distinct meanings, both of which are applicable to performance assessment. One is in the cognitive psycholo- gist’s sense of representation or modeling (Suppes, Pavel, & Falmagne, 1994); the other is in the Brunswikian sense of ecological sampling (Brunswik, 1956; Snow, 1974). The choice of tasks or contexts in assessment is a representative sampling issue. The comprehensiveness and fidelity of sim- ulating the constructs realistic engagement in performance is a representation issue. Both issues are important in ed- ucational and psychological measurement and especially in performance assessment. Scoring Models As Reflective of Task and Domain Structure According to the structural aspect of construct validity, scoring models should be rationally consistent with what is September 1995 • American Psychologist 745 known about the structural relations inherent in behavioral manifestations of the construct in question (Loevinger, 1957; Peak, 1953). That is, the theory of the construct domain should guide not only the selection or construction of rel- evant assessment tasks but also the rational development of construct-based scoring criteria and rubrics. Ideally, the manner in which behavioral instances are combined to produce a score should rest on knowledge of how the processes underlying those behaviors combine dynamically to produce effects. Thus, the internal struc- ture of the assessment (i.e., interrelations among the scored aspects of task and subtask performance) should be consistent with what is known about the internal structure of the construct domain (Messick, 1989b). This property of construct-based rational scoring models is called structural fidelity (Loevinger, 1957). Generalizability and the Boundaries of Score Meaning The concern that a performance assessment should pro- vide representative coverage of the content and processes of the construct domain is meant to insure that the score interpretation not be limited to the sample of assessed tasks but be broadly generalizable to the construct do- main. Evidence of such generalizability depends on the degree of correlation of the assessed tasks with other tasks representing the construct or aspects of the construct. This issue of generalizability of score inferences across tasks and contexts goes to the very heart of score meaning. Indeed, setting the boundaries of score meaning is pre- cisely what generalizability evidence is meant to address. However, because of the extensive time required for the typical performance task, there is a conflict in per- formance assessment between time-intensive depth of ex- amination and the breadth of domain coverage needed for generalizability of construct interpretation. This con- flict between depth and breadth of coverage is often viewed as entailing a trade-off between validity and reliability (or generalizability). It might better be depicted as a trade- off between the valid description of the specifics of a com- plex task and the power of construct interpretation. In any event, such a conflict signals a design problem that needs to be carefully negotiated in performance assess- ment (Wiggins, 1993). In addition to generalizability across tasks, the limits of score meaning are also affected by the degree of gen- eralizability across time or occasions and across observers or raters of the task performance. Such sources of mea- surement error associated with the sampling of tasks, oc- casions, and scorers underlie traditional reliability con- cerns (Feldt & Brennan, 1989). Convergent and Discriminant Correlations With External Variables The external aspect of construct validity refers to the ex- tent to which the assessment scores’ relationships with other measures and nonassessment behaviors reflect the expected high, low, and interactive relations implicit in the theory of the construct being assessed. Thus, the meaning of the scores is substantiated externally by ap- praising the degree to which empirical relationships with other measures—or the lack thereof—are consistent with that meaning. That is, the constructs represented in the assessment should rationally account for the external pattern of correlations. Both convergent and discriminant correlation patterns are important, the convergent pattern indicating a correspondence between measures of the same construct and the discriminant pattern indicating a distinctness from measures of other constructs (Camp- bell & Fiske, 1959). Discriminant evidence is particularly critical for discounting plausible rival alternatives to the focal construct interpretation. Both convergent and dis- criminant evidence are basic to construct validation. Of special importance among these external rela- tionships are those between the assessment scores and criterion measures pertinent to selection, placement, li- censure, program evaluation, or other accountability purposes in applied settings. Once again, the construct theory points to the relevance of potential relationships between the assessment scores and criterion measures, and empirical evidence of such links attests to the utility of the scores for the applied purpose. Consequences As Validity Evidence The consequential aspect of construct validity includes evidence and rationales for evaluating the intended and unintended consequences of score interpretation and use in both the short- and long-term. Social consequences of testing may be either positive, such as improved educa- tional policies based on international comparisons of stu- dent performance, or negative, especially when associated with bias in scoring and interpretation or with unfairness in test use. For example, because performance assess- ments in education promise potential benefits for teaching and learning, it is important to accrue evidence of such positive consequences as well as evidence that adverse consequences are minimal. The primary measurement concern with respect to adverse consequences is that any negative impact on in- dividuals or groups should not derive from any source of test invalidity, such as construct underrepresentation or construct-irrelevant variance (Messick, 1989b). In other words, low scores should not occur because the assessment is missing something relevant to the focal construct that, if present, would have permitted the affected persons to display their competence. Moreover, low scores should not occur because the measurement contains something irrelevant that interferes with the affected persons’ dem- onstration of competence. Validity as Integrative Summary These six aspects of construct validity apply to all edu- cational and psychological measurement, including per- formance assessments. Taken together, they provide a way of addressing the multiple and interrelated validity ques- tions that need to be answered to justify score interpre- tation and use. In previous writings, I maintained that it is “the relation between the evidence and the inferences 746 September 1995 • American Psychologist drawn that should determine the validation focus” (Mes- sick, 1989b, p. 16). This relation is embodied in theo- retical rationales or persuasive arguments that the ob- tained evidence both supports the preferred inferences and undercuts plausible rival inferences. From this per- spective, as Cronbach (1988) concluded, validation is evaluation argument. That is, as stipulated earlier, vali- dation is empirical evaluation of the meaning and con- sequences of measurement. The term empirical evalua- tion is meant to convey that the validation process is sci- entific as well as rhetorical and requires both evidence and argument. By focusing on the argument or rationale used to support the assumptions and inferences invoked in the score-based interpretations and actions of a particular test use, one can prioritize the forms of validity evidence needed according to the points in the argument requiring justification or support (Kane, 1992; Shepard, 1993). Helpful as this may be, there still remain problems in setting priorities for needed evidence because the argu- ment may be incomplete or off target, not all the as- sumptions may be addressed, and the need to discount alternative arguments evokes multiple priorities. This is one reason that Cronbach (1989) stressed cross-argument criteria for assigning priority to a line of inquiry, such as the degree of prior uncertainty, information yield, cost, and leverage in achieving consensus. Kane (1992) illustrated the argument-based ap- proach by prioritizing the evidence needed to validate a placement test for assigning students to a course in either remedial algebra or calculus. He addressed seven as- sumptions that, from the present perspective, bear on the content, substantive, generalizability, external, and con- sequential aspects of construct validity. Yet the structural aspect is not explicitly addressed. Hence, the compen- satory property of the usual cumulative total score, which permits good performance on some algebra skills to com- pensate for poor performance on others, remains une- valuated in contrast, for example, to scoring models with multiple cut scores or with minimal requirements across the profile of prerequisite skills. The question is whether such profile scoring models might yield not only useful information for diagnosis and remediation but also better student placement. The structural aspect of construct validity also re- ceived little attention in Shepard’s (1993) argument-based analysis of the validity of special education placement decisions. This was despite the fact that the assessment referral system under consideration involved a profile of cognitive, biomedical, behavioral, and academic skills that required some kind of structural model linking test results to placement decisions. However, in her analysis of se- lection uses of the General Aptitude Test Battery (GATB), Shepard (1993) did underscore the structural aspect be- cause the GATB within-group scoring model is both sa- lient and controversial. The six aspects of construct validity afford a means of checking that the theoretical rationale or persuasive argument linking the evidence to the inferences drawn touches the important bases; if the bases are not covered, an argument that such omissions are defensible must be provided. These six aspects are highlighted because most score-based interpretations and action inferences, as well as the elaborated rationales or arguments that attempt to legitimize them (Kane, 1992), either invoke these prop- erties or assume them, explicitly or tacitly. In other words, most score interpretations refer to rel- evant content and operative processes, presumed to be re- flected in scores that concatenate responses in domain-ap- propriate ways and are generalizable across a range of tasks, settings, and occasions. Furthermore, score-based interpre- tations and actions are typically extrapolated beyond the test context on the basis of presumed relationships with nontest behaviors and anticipated outcomes or conse- quences. The challenge in test validation is to link these inferences to convergent evidence supporting them and to discriminant evidence discounting plausible rival inferences. Evidence pertinent to all of these aspects needs to be inte- grated into an overall validity judgment to sustain score inferences and their action implications, or else provide compelling reasons why there is not a link, which is what is meant by validity as a unified concept. Meaning and Values in Test Validation The essence of unified validity is that the appropriateness, meaningfulness, and usefulness of score-based inferences are inseparable and that the integrating power derives from empirically grounded score interpretation. As seen in this article, both meaning and values are integral to the concept of validity, and psychologists need a way of addressing both concerns in validation practice. In particular, what is needed is a way of configuring validity evidence that forestalls undue reliance on selected forms of evidence as opposed to a pat- tern of supplementary evidence, that highlights the impor- tant yet subsidiary role of specific content- and criterion- related evidence in support of construct validity in testing applications. This means should formally bring considera- tion of value implications and social consequences into the validity framework. A unified validity framework meeting these require- ments distinguishes two interconnected facets of validity as a unitary concept (Messick, 1989a, 1989b). One facet is the source of justification of the testing based on ap- praisal of either evidence supportive of score meaning or consequences contributing to score valuation. The other facet is the function or outcome of the testing—either interpretation or applied use. If the facet for justification (i.e., either an evidential basis for meaning implications or a consequential basis for value implications of scores) is crossed with the facet for function or outcome (i.e., either test interpretation or test use), a four-fold classifi- cation is obtained, highlighting both meaning and values in both test interpretation and test use, as represented by the row and column headings of Figure 1. These distinctions may seem fuzzy because they are not only interlinked but overlapping. For example, social consequences of testing are a form of evidence, and other forms of evidence have consequences. Furthermore, to September 1995 • American Psychologist 747 Figure 1 Facets of Validity as a Progressive Matrix Evidential Basis Consequential Basis Test Interpretation Construct Validity (CV) CV + Value Implications (VI) Test Use CV + Relevance/Utility (R/U) CV + R/U + VI + Social Consequences interpret a test is to use it, and all other test uses involve interpretation either explicitly or tacitly. Moreover, utility is both validity evidence and a value consequence. This conceptual messiness derives from cutting through what indeed is a unitary concept to provide a means of dis- cussing its functional aspects. Each of the cells in this four-fold crosscutting of uni- fied validity are briefly considered in turn, beginning with the evidential basis of test interpretation. Because the ev- idence and rationales supporting the trustworthiness of score meaning are what is meant by construct validity, the evidential basis of test interpretation is clearly con- struct validity. The evidential basis of test use is also con- struct validity, but with the important proviso that the general evidence supportive of score meaning either al- ready includes or becomes enhanced by specific evidence for the relevance of the scores to the applied purpose and for the utility of the scores in the applied setting, where utility is broadly conceived to reflect the benefits of testing relative to its costs (Cronbach & Gleser, 1965). The consequential basis of test interpretation is the appraisal of value implications of score meaning, includ- ing the often tacit value implications of the construct label itself, of the broader theory conceptualizing construct properties and relationships that undergirds construct meaning, and of the still broader ideologies that give the- ories their perspective and purpose—for example, ide- ologies about the functions of science or about the nature of the human being as a learner or as an adaptive or fully functioning person. The value implications of score in- terpretation are not only part of score meaning, but a socially relevant part that often triggers score-based ac- tions and serves to link the construct measured to ques- tions of applied practice and social policy. One way to protect against the tyranny of unexposed and unexamined values in score interpretation is to explicitly adopt mul- tiple value perspectives to formulate and empirically ap- praise plausible rival hypotheses (Churchman, 1971; Messick, 1989b). Many constructs such as competence, creativity, in- telligence, or extraversion have manifold and arguable value implications that may or may not be sustainable in terms of properties of their associated measures. A central issue is whether the theoretical or trait implications and the value implications of the test interpretation are commensurate, because value implications are not an- cillary but, rather, integral to score meaning. Therefore, to make clear that score interpretation is needed to ap- praise value implications and vice versa, this cell for the consequential basis of test interpretation needs to com- prehend both the construct validity as well as the value ramifications of score meaning. Finally, the consequential basis of test use is the ap- praisal of both potential and actual social consequences of the applied testing. One approach to appraising po- tential side effects is to pit the benefits and risks of the proposed test use against the pros and cons of alternatives or counterproposals. By taking multiple perspectives on proposed test use, the various (and sometimes conflicting) value commitments of each proposal are often exposed to open examination and debate (Churchman, 1971; Messick, 1989b). Counterproposals to a proposed test use might involve quite different assessment techniques, such as observations or portfolios when educational perfor- mance standards are at issue. Counterproposals might attempt to serve the intended purpose in a different way, such as through training rather than selection when pro- ductivity levels are at issue (granted that testing may also be used to reduce training costs, and that failure in train- ing yields a form of selection). What matters is not only whether the social conse- quences of test interpretation and use are positive or negative, but how the consequences came about and what determined them. In particular, it is not that adverse social consequences of test use render the use invalid but, rather, that adverse social consequences should not be attributable to any source of test invalidity, such as construct underrepresentation or construct-irrelevant variance. And once again, in recognition of the fact that the weighing of social consequences both presumes and contributes to evidence of score meaning, of relevance, of utility, and of values, this cell needs to include construct validity, relevance, and utility, as well as social and value consequences. Some measurement specialists argue that adding value implications and social consequences to the validity framework unduly burdens the concept. However, it is simply not the case that values are being added to validity in this unified view. Rather, values are intrinsic to the meaning and outcomes of the testing and have always been. As opposed to adding values to validity as an ad- junct or supplement, the unified view instead exposes the inherent value aspects of score meaning and outcome to open examination and debate as an integral part of the validation process (Messick, 1989a). This makes explicit what has been latent all along, namely, that validity judg- ments are value judgments. A salient feature of Figure 1 is that construct validity appears in every cell, which is fitting because the construct validity of score meaning is the integrating force that un- ifies validity issues into a unitary concept. At the same time, by distinguishing facets reflecting the justification and function of the testing, it becomes clear that distinct features of construct validity need to be emphasized, in addition to the general mosaic of evidence, as one moves 748 September 1995 • American Psychologist from the focal issue of one cell to that of the others. In particular, the forms of evidence change and compound as one moves from appraisal of evidence for the construct interpretation per se, to appraisal of evidence supportive of a rational basis for test use, to appraisal of the value consequences of score interpretation as a basis for action, and finally, to appraisal of the social consequences—or, more generally, of the functional worth—of test use. As different foci of emphasis are highlighted in ad- dressing the basic construct validity appearing in each cell, this movement makes what at first glance was a sim- ple four-fold classification appear more like a progressive matrix, as portrayed in the cells of Figure 1. From one perspective, each cell represents construct validity, with different features highlighted on the basis of the justifi- cation and function of the testing. From another per- spective, the entire progressive matrix represents construct validity, which is another way of saying that validity is a unified concept. One implication of this progressive-ma- trix formulation is that both meaning and values, as well as both test interpretation and test use, are intertwined in the validation process. Thus, validity and values are one imperative, not two, and test validation implicates both the science and the ethics of assessment, which is why validity has force as a social value. REFERENCES Brunswik, E. (1956). Perception and the representative design of psychological experiments (2nd ed.). Berkeley: University of California Press. Campbell, D. T., & Fiske, D. W. (1959). Convergent and discriminant validation by the multitrait-multimethod matrix. Psychological Bul- letin, 56, 81-105. Cascio, W. E, Outtz, J., Zedeck, S., & Goldstein, I. L. (1991). Statistical implications of six methods of test score use in personnel selection. Human Performance, 4, 233-264. Churchman, C. W. (1971). The design of inquiring systems: Basic con- cepts of systems and organization. New York: Basic Books. Cook, T. D., & Campbell, D. T. (1979). Quasi-experimentation: Design and analysis issues for field settings. Chicago: Rand McNally. Cronbach, L. J. (1971). Test validation. In R. L. Thorndike (Ed.), Ed- ucational measurement (2nd ed., pp. 443-507). Washington, DC: American Council on Education. Cronbach, L. J. (1988). Five perspectives on validation argument. In H. Wainer & H. Braun (Eds.), Test validity (pp. 34-35). Hillsdale, NJ: Erlbaum. Cronbach, L. J. (1989). Construct validation after thirty years. In R. L. Linn (Ed.), Intelligence: Measurement, theory, and public policy (pp. 147-171). Chicago: University of Illinois Press. Cronbach, L. J., & Gleser, G. C. (1965). Psychological tests and personnel decisions (2nd ed.). Urbana: University of Illinois Press. Cronbach, L. J., & Meehl, P. E. (1955). Construct validity in psychological tests. Psychological Bulletin, 52, 281-302. Cureton, E. E. (1951). Validity. In E. F. Lindquist (Ed.), Educational measurement (1st ed., pp. 621-694). Washington, DC: American Council on Education. Embretson, S. (1983). Construct validity: Construct representation versus nomothetic span. Psychological Bulletin, 93, 179-197. Feldt, L. S., & Brennan, R. L. (1989). Reliability. In R. L. Linn (Ed.), Educational measurement (3rd ed., pp. 105-146). New York: Mac- millan. Frederiksen, N., Mislevy, R. J., & Bejar, I. (Eds.). (1993). Test theory for a new generation of tests. Hillsdale, NJ: Erlbaum. Gottfredson, L. S. (1994). The science and politics of race-norming. American Psychologist, 49, 955-963. Guion, R. M. (1976). Recruiting, selection, and job placement. In M. D. Dunnette (Ed.), Handbook of industrial and organizational psychology (pp. 777-828). Chicago: Rand McNally. Gulliksen, H. (1950). Intrinsic validity. American Psychologist, 5, 511-517. Hartigan, J. A., & Wigdor, A. K. (Eds.). (1989). Fairness in employment testing: Validity generalization, minority issues, and the General Ap- titude Test Battery. Washington, DC: National Academy Press. Holland, P. W., & Wainer, H. (Eds.). (1993). Differential item functioning. Hillsdale, NJ: Erlbaum. Hunter, J. E., Schmidt, F. L., & Jackson, C. B. (1982). Advanced meta- analysis: Quantitative methods of cumulating research findings across studies. San Francisco: Sage. Kane, M. T. (1992). An argument-based approach to validity. Psycho- logical Bulletin, 112, 527-535. Lennon, R. T. (1956). Assumptions underlying the use of content validity. Educational and Psychological Measurement, 16, 294-304. Loevinger, J. (1957). Objective tests as instruments of psychological theory [Monograph]. Psychological Reports, 3, 635-694 (Pt. 9). Messick, S. (1964). Personality measurement and college performance. In Proceedings of the 1963 Invitational Conference on Testing Problems (pp. 110-129). Princeton, NJ: Educational Testing Service. Messick, S. (1975). The standard problem: Meaning and values in mea- surement and evaluation. American Psychologist, 30, 955-966. Messick, S. (1980). Test validity and the ethics of assessment. American Psychologist, 35, 1012-1027. Messick, S. (1989a). Meaning and values in test validation: The science and ethics of assessment. Educational Researcher, 18(2), 5-11. Messick, S. (1989b). Validity. In R. L. Linn (Ed.), Educational mea- surement (3rd ed., pp. 13-103). New “York: Macmillan. Messick, S. (1994). The interplay of evidence and consequences in the validation of performance assessments. Educational Researcher, 23(2), 13-23. National Council of Teachers of Mathematics. (1989). Curriculum and evaluation standards for school mathematics. Reston, VA: Author. Peak, H. (1953). Problems of observation. In L. Festinger & D. Katz (Eds.), Research methods in the behavioral sciences (pp. 243-299). Hinsdale, IL: Dryden Press. Rulon, P. J. (1946). On the validity of educational tests. Harvard Ed- ucational Review, 16, 290-296. Sackett, P. R., & Wilk, S. L. (1994). Within-group norming and other forms of score adjustment in preemployment testing. American Psy- chologist, 49, 929-954. Schmidt, F. L. (1991). Why all banding procedures are logically flawed. Human Performance, 4, 265-278. Shepard, L. A. (1993). Evaluating test validity. Review of research in education, 19, 405-450. Shulman, L. S. (1970). Reconstruction of educational research. Review of Educational Research, 40, 371-396. Snow, R. E. (1974). Representative and quasi-representative designs for research on teaching. Review of Educational Research, 44, 265-291. Snow, R. E., & Lohman, D. F. (1989). Implications of cognitive psy- chology for educational measurement. In R. L. Linn (Ed.), Educational measurement (3rd ed., pp. 263-331). New York: Macmillan. Suppes, P., Pavel, M., & Falmagne, J.-C. (1994). Representations and models in psychology. Annual Review of Psychology, 45, 517-544. Wiggins, G. (1993). Assessment: Authenticity, context, and validity. Phi Delta Kappan, 75, 200-214. September 1995 • American Psychologist 749
Validity: The Test Doth Purport Too Much, Methinks
ROBERT M. GUION On Trinitarian Doctrines of Validity The evaluation of how well one measures particular attributes of people or of objects is, at least in part, a question of validity. Some discussions of validity also refer to relationships between measures of quite different attributes, either to aid in the understanding of a construct or to establish a basis for comparison between evaluations of the validity of measurement and evaluations of the validity of a hypothesis. The three conventionally listed aspects of validity—criterion-related, content, and construct—are examined from this dual perspective. The unifying nature of the validity of measurement is found in the degree to which the results of measurement (the numbers or scores) represent magnitudes of the intended attribute. Validity is thus an evaluative judgment based on a variety of considerations, including the structure of the measurement operations, the pattern of correlations with other variables, and the results of confirmatory and disconfirmatory investigations. Validity in this sense is close to the concept of construct validity but perhaps without the theoretical implications of that term; like construct validity, the evaluation cannot be expressed with a single* research result. Evaluations of the validity of hypotheses should also be based on multiple considerations rather than on single coefficients. In some circumstances, conventional methods of validation may be superfluous. People who use tests speak of “validity” in referring to evaluations either of their tests or of their use of tests; the term is ambiguous. Sometimes it refers to evaluations of how well the scores represent the attribute being measured; sometimes it refers to evaluations of how well the scores are related to some quite different attribute. Al- though these are complementary meanings, they are conceptually distinguishable. Measurement that is not called testing also needs to be evaluated. Recognition of other kinds of psychological measurement may help in clarifying the concept of test validity; in return, such clarification may offer guidance for evaluating other kinds of measurement. Questions of evaluation, and certainly references to validity, are regrettably rare in reports of measurement not based on the traditional concept of psychometrics—a concept generally fenced in with the unfortunate phrase “educational and psychological testing.” Examples include measures of the degrees of socialization in children’s play, the information content of sentences, differences between intensities of experimental treatments, recidivism rates of mental hospital patients, the level of aggression in mice, or the degrees of preference for various classes of objects. Even in test validation studies, the list of examples includes criterion measures. In each of these examples, something is quantified or measured, whether well or poorly, in a serious test of a hypothesis in which that “something” is important. Either the concept of validity applies to all of these different kinds of measurement or the limits of its applicability need to be better understood. Each of these examples came from a research report containing no mention of validity or any other evaluation of the ef- Vol. 11, No. 3 June 1980 PROFESSIONAL PSYCHOLOGY 385 Copyright 1980 by the American Psychological Association, Inc 0033-0175/80/1103-0385500.75 1. CONTEMPORARY PERSONNEL PSYCHOLOGY: ON TRINITARIAN DOCTRINES OF VALIDITY fectiveness of measurement. Why do authors apparently ignore the question of va- lidity? One reason might be that the term is simply not in an author’s working vocabulary. It seems likely that many authors were never exposed to a course in tests and mea- surement; many of those who were exposed have successfully warded off the dis- ease. Another reason might be disdain. Some people think mental testers are less scientific than experimenters—perhaps only a cut above astrologers. They are assumed to be holdovers from a static, discredited trait theory that is not adequate for genuinely sci- entific study. The attitude may be that since validity belongs to the mental testers, it can be discarded by everyone else. Perhaps a better reason is confusion over the meaning of the word. There has been an almost mystical, trinitarian concept of validity in mental testing over the last quarter century. Although the trio of terms introduced in the “Technical Recommendations for Psychological Tests and Diagnostic Techniques” (American Psychological Asso- ciation [APA] et al., 1954)—content validity, criterion-related validity, and construct validity—were identified as different “aspects” of validity, many practitioners seem to think of them as three quite different things. Such misinterpretation is “a conceptual compartmentalization of ‘types’ of validity . . . [that] leads to confusion and, in the face of confusion, oversimplification” (Dunnette & Borman, 1979, p. 483). This regrettable confusion and oversimplification reached its zenith with the publication of the “Uniform Guidelines on Employee Selection Procedures” (Equal Employment Opportunity Commission et al., 1978). The Guidelines seem to treat them as something of a holy trinity representing three different roads to psychometric salvation. If you cannot demonstrate one kind of validity, you have two more chances! These three terms may have outlived their usefulness, but at the present time, no other terms serve quite the same purpose of identifying facets of validity. Clarification of the concept of validity and of its applicability beyond testing demands a clarification of the interrelatedness and the essential unity of these three terms. The metaphor of the holy trinity is partially apt. In Christian theology, the Trinity is spoken of as one God manifested in three persons. In psychometric theology, we can speak of one validity, evidenced in three ways. The weaknesses of the metaphor are, first, that the essential unity of validity is much more closely related to the notion of construct validity than to the other two “persons” and, second, that there is reasonable doubt whether the other two consistently serve as evidence of validity. In brief, the argument of this essay suggests: 1. Measurement consists of operations leading to quantitative statements that rep- ROBERT M. GUION is Professor of Industrial/Organizational Psychology at Bowling Green State University, where he was Chairman of the department from 1966 to 1971. He has written a comprehensive book on Personnel Testing; more than 30 articles about assessment, employment testing, personnel assessment, and measurement; and at least 7 articles on validity. He is a fellow in the Amerian Psychological Association, Divisions 5 and 14. COMMENTS ON AN EARLIER DRAFT by Marvin Dunnette, Clifford (Jack) Mynatt, Patricia Smith, and Ross Stagner are gratefully acknowledged; both their comments and the differences in their perspectives were challenging. REQUESTS FOR REPRINTS should be sent to Robert M. Guion, Department of Psychology, Bowling Green State University, Bowling Green, Ohio 43403. 386 PROFESSIONAL PSYCHOLOGY Vol. 7 7, No. 3 June 1980 resent magnitudes of a variable conceptualized by the researcher. 2. The validity of the measurement consists fundamentally of the congruence of the operational and the conceptual definitions of the variable. 3. Evidence of such congruence may be partially based on relationships of the variable, as measured, to other variables. The evaluation of the strength of such re- lationships may be sought, however, even where the quality of the measures is not questioned. In principle, the validity of a hypothesized relationship can be studied independently of studies of the validity of measurement. 4. For some kinds of measurement, the traditional questions of validity and of so- called validation strategy do not arise. For these, different questions may be formed to evaluate the measurement and its use. The Validity of Measurement One does not measure objects or people; one measures attributes of objects or people. Much of everyday physical measurement, such as height, is based on mathematically formal procedures of fundamental measurement (Coombs, Dawes, & Tversky, 1970; Torgerson, 1958). There is general agreement among scientists and others on the definition of the standard unit of measurement. The evaluation of how well one has measured is based on a judgment of the precision of measurement (by which is meant either the fineness of the permissible error or the reliability of repeated measurement) and of the accuracy of measurement (by which is meant a freedom from biasing tendencies to err systematically too high or too low). In these cases, validity (as typically defined) is not an issue. This is not to say that validity is irrelevant to physical measurement. Much of the work done by physicists, astronomers, and others in the “hard” sciences uses mea- surements of one kind of physical phenomenon as a basis for inferences about a different one. For example, Idso, Jackson, and Reginato (1975) described methods of using the ratio of reflected to incoming solar radiation as a basis for inferring soil moisture by remote surveillance from high altitude aircraft or satellites. Validity of such in- ferences is indeed an issue and has been addressed by Idso et al. (1975). It is an issue much like that in the psychological measurement of “softer” attributes, such as aspects of intelligence or personality, where there are no units of measurement with standard definitions accepted by the scientific community. Although we speak of reliability (another ambiguous term), the concept of validity is invoked as the de- finitive basis for evaluating such measurement. In short, whether the attribute being measured is physical or psychological, “hard” or “soft,” the focus of measurement is necessarily on the attributes—the “something” that is measured. The something may be clearly identified or not, well established in the scientific literature or innovative, important or trivial, material or abstract; it can be a vague idea, a moderately well-defined concept, or an established scientific construct. Nothing in the intent to measure it says that the attribute must be clearly defined at the outset. To the contrary, many scientific constructs develop in an iterative pattern of rough definition, preliminary measurement, and refinement of definition. The point is that one needs at least a hazy conceptual definition of the something to be measured. One then develops a set of operations, an operational definition, by which relative Vol. 11, No. 3 June 1980 PROFESSIONAL PSYCHOLOGY 387 1. CONTEMPORARY PERSONNEL PSYCHOLOGY: ON TRINITARIAN DOCTRINES OF VALIDITY magnitudes of the defined something can be assessed. A set of operations must first be evaluated logically in terms of its apparent relevance to the underlying concept. If it is deemed satisfactory in this respect, it can be put to use and, after use, evaluated. Validity in measurement is, first of all, essentially an evaluation of how well one has succeeded in measuring the attribute that was to be measured. The operational definition in measurement ends up with a number—a scale value, a score, or some other expression of quantity. The number ordinarily has no absolute meaning in its own right. It becomes important when it is used to draw inferences about the attribute being measured. Validity, then, refers to an evaluation of the quality of the inferences drawn from these numbers. It follows that the validity of measurement is not a precise evaluation. Rather, it is expressed in broad quantitative categories: high validity, satisfactory validity, or poor or no validity. One might compare validities of inferences based on different operational definitions of an attribute and say that the validity of one is better, or equal to, or worse than the validity of inferences from another. These are ordinal statements; they do not denote precise quantities, and they are not expressible with precise numbers. One should not confuse an evaluative interpretation of validity with an obtained validity coefficient. Validity coefficients may be computed, but the evaluation of validity is based on those coefficients and on other information as well; it is not equated with them. Validity is a property of inferences from scores, not (strictly speaking) of the mea- suring instrument or test itself. Properties of a test should influence evaluations of validity, but so also should other information. If one is evaluating a total approach to measurement, evaluative considerations include structural characteristics of stimulus materials, the degree of standardization, the adequacy of the sample taken in mea- surement, and the like. These things contribute to one’s evaluation of the validity of certain inferences from scores, but they should not be confused with those infer- ences. CRITERION-RELATED VALIDITY The many reasons for conducting studies relating one set of measures to another can be condensed into two categories: (a) to investigate the meaning of scores as measures of a certain attribute and (b) to investigate the scores as concomitants or predictors of other attributes (APA et al, 1974). The first of these fits the old definition of validity as the extent to which a test measures what it “purports” to measure. If one has developed a test “purporting” to measure scholastic aptitude, then the “real” measure of that aptitude is achievement in school (Hull, 1928). Pintner (1931) advocated validation of intelligence tests against such other indicators of intelligence as teachers’ ratings, school achievement, or even scores on other intelligence tests. These were the criteria—the standards—for judging the goodness of the test as a basis for inferences about intelligence; for example, the test is evaluated favorably as an intelligence test if the correlation between test scores and school achievement is high, and it is evaluated less favorably if that correlation is low. The second category is illustrated when one correlates school achievement with scores on an intelligence test that one has already evaluated as providing satisfactorily valid measures of intelligence. The purpose might be a practical interest in predicting 388 PROFESSIONAL PSYCHOLOGY Vol. 11, No. 3 June 1980 achievement; intelligence may be one of several attributes investigated as potential predictors. In this kind of investigation, the so-called criterion is placed less in the role of a standard and more in a role like the dependent variable in an experiment. The analogy is useful because, in such criterion-related validities, the inference from the test score is based on a hypothesis. The hypothesis is that performance on the test is related to performance on the other measure, usually a measure of a totally different attribute and usually one of greater importance to the test user. In these cases, vali- dation is not so much a matter of evaluating the test score for measuring some attribute as an evaluation of a hypothesized relationship of one variable to another. In personnel testing, the hypothesis is that an attribute of job applicants, as measured, can be used to predict future proficiency (or attendance, or whatever) if the applicants are hired. The future proficiency is of greater interest to the organization than the attribute that predicts it; the “validation” research is an attempt to evaluate the hy- pothesis that proficiency is a function of the predictor. The study may, of course, lead to insights about the measurement of the trait. If, over a period of time, several studies are conducted in which the same sort of dependent variable is predicted by scores on the same test, then there is a pretty good basis for nailing down one’s interpretation of the meaning of scores on that test. In a pragmatic world, however, that is often treated as a relatively trivial bit of information. The interpretation of interest to the organization is, quite simply, the value of the test as a basis for predicting future per- formance and, therefore, as a basis for decisions, regardless of what attribute it really measures. The two different purposes of criterion-related validity studies can be distinguished in another way. If one’s purpose is to come to a better understanding of how well a particular attribute is being measured by a certain test, then research should consist of several studies, each using a different criterion thought to reflect that attribute. If one is primarily interested, however, in predicting a specific measure of future behavior, then the advisable strategy is to use many different predictors; for example, measures of proficiency may be hypothesized to be functions of certain applicant attributes, situational variables, and demographic variables used in combination (Guion, 1976). CONSTRUCT VALIDITY The first type of criterion-related validation yields evidence of what has been called construct validity. To evaluate the measurements for inferences of specified attributes, some criterion measures are chosen as independent reflections of the same attributes. Other criterion measures may represent competing interpretations of what a particular test score (or other measure) might mean, that is, different attributes. For example, suppose that one considers scrap rate (proportion of work that is done so poorly it must be discarded—scrapped) an indication of poor work motivation. That is, it is proposed that a worker’s scrap rate can be used as a measure of carelessness; perhaps the idea is that careless people should not be promoted to a higher level of responsibility. If one is seriously interested in evaluating the notion that scrap rate measures carelessness, it would be well to look for correlations with some other indi- cators of carelessness, but it would also be well to check relationships with some measures that indicate clumsiness. Clumsiness is a competing interpretation of scrap rate; the scrap rate cannot be considered a valid measure of carelessness if scrap is Vol. 11, No. 3 June 1980 PROFESSIONAL PSYCHOLOGY 389 1. CONTEMPORARY PERSONNEL PSYCHOLOGY: ON TRINITARIAN DOCTRINES OF VALIDITY largely attributable to lack of coordination. This may still be important if the higher level job is another job requiring high levels of coordination. However, if motivation (in the sense of attentiveness) rather than physical grace is required on the next higher level job, the use of scrap rate would probably be an invalid instrument for selection in part, at least, because it is an invalid measure of the desired attribute. The essential logic of construct validation is disconfirmatory. There should, of course, be positive evidence that a measurement procedure leads to valid inferences about a particular construct, but the issue is more commonly a matter of showing that alternative or competing inferences do not destroy the intended interpretation. Cook and Campbell (1976) described construct validity primarily in terms of freedom from experimental confounding. In the correlational language of psychometric discussions, it is perhaps more familiar to say that a construct has been measured validly if a set of scores is reasonably free from contaminating sources of variance. The aim of re- search in construct validation is to strengthen, if possible, a given interpretation of scores, assuring that alternative interpretations are not very good. Of course, if the alternative interpretation turns out to be a very good one, the originally intended interpretation may have to be modified. The historical introduction of the notion of construct validity was as much concerned with the validation of a theoretical construct as with the validation of its measure (Gronbach & Meehl, 1955). The basic logic and disconfirmatory emphasis of construct validation, however, can be as useful in evaluating attributes and measures of attributes identified vaguely for purely practical purposes as for evaluating constructs and measures of constructs required in the development of a theory. Perhaps, all that is being implied here is a metaphor, an analogy to the original notion of construct validity. If so, the analogy is apt. To say that valid inferences can be drawn about a specified construct by a particular method of measurement is to say very little about the value of that measurement for practical decisions. In personnel selection, the practical value of measurement depends not on how well it measures a specified attribute but on how well it predicts future performance on some other variable. Evidence of that practical value may come from a criterion-related validity coefficient of the second kind described in the previous section. In the long run, better evidence may come from a tightly reasoned hypothesis coupled with strong evidence of the construct validity with which the independent variables of that hypothesis are measured (Guion, 1976). Issues of the validity of measurement arise in basic experimental research as well as in testing. The measurement problem arises whenever a concept is imperfectly or partially operationalized, and it becomes an acute problem whenever an experiment fails to confirm a theoretical proposition. In such cases, the experimenter must ask whether the failure is because the axiomatic relationships posed by the theory are wrong or because the inferences drawn from the measurements are invalid. Stagner (Note 1) has given me an excellent illustration. He and Harlow did the first curare exper- iment and concluded that animals could not learn a striped muscle response when paralyzed. Later studies showed that they had learned but that the learning could be shown only when curare was injected again. The data were not in error, but the inference was. CONTENT VALIDITY Content validity is also a special case of construct validity. The “construct” may be 390 PROFESSIONAL PSYCHOLOGY Vol. 11, No. 3 June 1980 an attribute, like level of knowledge or level of skill, in a particular information or performance domain. It has been customary to speak of content validity when one wishes to use scores on a test to infer probable performance in a larger domain of which the test is but a sample. In personnel testing, the concept of content validity, which was borrowed from ed- ucational measurement, has been very troublesome. In educational measurement, a test could be considered a valid measure of curriculum content insofar as the material covered on an examination matched in general proportions the material to be covered in the general curriculum. In either case, the so-called content validity of the test is an evaluation of how well the tasks or questions it contains match those in a defined content domain. In personnel testing, the definition of a content domain has been a source of very great confusion. Nowhere is that confusion better documented than in the “Standards for Educational and Psychological Tests” (APA et al., 1974). In discussing the applicability of content validity to employment testing, it says that “the performance domain would need definition in terms of the objectives of measurement, restricted perhaps only to critical, most frequent, or prerequisite work behaviors” (p. 29). Two paragraphs further, it says, “An employer cannot justify an employment test on grounds of content validity if he cannot demonstrate that the content universe includes all, or nearly all, important parts of the job.” A Strict Approach to Content Sampling The procrustean task of making content validity fit the problem of personnel testing can be described by a four-step process that would assure a work sample test of un- questionable job relevance: 1. Define a job content universe on the basis of job analysis. This should include all nontrivial tasks, responsibilities, prerequisite knowledge and skill, and organiza- tional relationships that make up the job. This is not what is to be sampled directly. One rarely hires people who are already able to do all of the things that are done on the job. (The second of the two quotations given earlier is herewith declared unac- ceptable.) Training programs exist to teach people how to recognize and carry out job responsibilities. Job applicants may be expected to know already how to do some of the things the job requires. For example, in hiring a secretary, one expects to train the new employee in specific office procedures or the use of unique equipment en- countered in that office, but one does not expect to teach the new secretary how to type. 2. Identify a portion of the job content universe for the purposes of work sample testing; this may be called job content domain. The word domain is being used here to denote a sample—not necessarily a representative sample—of the content implied by the word universe. 3. Define a test content universe as the tasks to be included in testing and the possible methods to standardize and score performance on them. The test content is not merely a sample of job content; it includes things that are not part of the actual job. Performing a job and taking a test are not the same thing, even if the component tasks seem nearly identical. Typing mailable letters from dictation on a real job involves a familiar machine, knowledge of the idiosyncrasies of the person who dictates the letters, tele- phone or other interruptions, and so forth. Typing the same material in a test situation involves the anxiety or motivation created by the testing, standard conditions such that distractions (if any) are built into the exercise equally for all people taking the test, Vol. 11, No. 3 June 1980 PROFESSIONAL PSYCHOLOGY 391 1. CONTEMPORARY PERSONNEL PSYCHOLOGY: ON TRINITARIAN DOCTRINES OF VALIDITY and using material dictated by an unfamiliar voice. Moreover, typing on the job is not formally scored, but the test requires a standard scoring procedure. Therefore, one adds to the job content domain possible methods of standardization and of scoring to form the test content universe. It consists of all the tasks that might be assigned from the job content domain, the various conditions that might be imposed, the various procedures for observing and recording responses, and the possible procedures for scoring them. The test will not include everything, but defining such a universe identifies the options. 4. Define a test content domain, a part (not a representative part) of the test content universe. This defines actual specifications settled on for test construction. A test constructed according to these specifications would certainly be seen as job related. If the measurement operations were not questioned, it might even be said to have a high level of content validity. The foregoing steps define a very tiresome and exhaustive procedure, but they should make clear two points: (a) that what has been talked about as content validity is really a content-oriented approach to test construction (Messick, 1975) and (b) that a truly representative sample of the job does not ordinarily provide measurement of the quality of performance. A test constructed by this procedure will almost certainly result in valid inferences about the ability to do the job, but the evidence of that validity may require something beyond the implications of the term content validity. In the first place, it is highly unlikely that the two domains would precisely overlap; if circles are drawn to represent each, they would overlap, but the degree of content validity (the degree of overlap) would be small if either the job tasks omitted or the measurement procedures added were substantial. In the second place, there is an important conceptual difference between evaluations of the validity of inferences from scores and the evaluations of the quality of sampling tasks. Content validity, by definition, refers to the latter. If the inference to be drawn from a score on a content sample is to be an inference about performance on an actual job, then it is drawn at the end of a series of inferential steps, any one of which can be a serious misstep. The most serious misstep may occur in defining the scoring system. The scoring system of a work sample is as subject to contamination as is the scoring of any other test. The score obtained by an individual may reflect the attribute one wishes to infer—ability to do the designated aspects of the job; but it may also reflect a variety of contaminations such as anxiety, ability to comprehend verbal instructions, or the perceptual skills that enable some people to perceive cues for scoring that are imperceptible to others. All of this has a familiar ring after the discussion of construct validity. It means that disconfirmatory research (that is, construct validation) may be needed to evaluate the validity of scores on many job samples. To repeat: Content validity is a special case of construct validity (Messick, 1975; Tenopyr, 1977). A Unitarian Doctrine of Validity This discussion, of course, has been purely semantic, offered in the conviction that semantic clarification leads to clearer thought. The meaning of validity begins with a concept of an attribute, more or less clearly defined. It has ended with an evaluation of how well such a concept is represented by the numerical result of a set of operations for measuring it. Content validity, insofar as its meaning is restricted to content sam- 392 PROFESSIONAL PSYCHOLOGY Vol. 1T, No. 3 June 1980 pling, may influence one’s evaluation of the validity of inferences from numbers or scores, but it is conceptually distinguishable from this broader concept of validity. Criterion-related validity sometimes gives evidence directly relevant to this repre- sentation question, but sometimes its evidence is more directly relevant to evaluations of the tenability of a hypothesis—again, a conceptually distinguishable idea. Stated differently, both the kinds of evidence known as content validity and as criterion-related validity may contribute to evaluations of how well the operations represent the underlying concept, but they do so only insofar as they are special cases of construct validity. Construct validity seems to provide the unifying theme. I am a little reluctant, however, to assert that validity in general is the same thing as construct validity. Discussions of construct validity have generally been carried on at a level of the philosophy of science, and not all evaluation of measurement needs to be done at this level. At the risk of hedging, therefore, I identify the unifying concept of validity as similar, but not necessarily identical, to what has been meant by construct validity. Validity is therefore defined as the degree to which the result of the measurement process (the numbers) satisfactorily represent the various magnitudes of the intended attribute. This is familiar; it is another statement of the traditional definition of validity as the extent to which a test measures what it purports to measure. It is not, however, restricted to tests; the emphasis is on a more general evaluation of the goodness of measurement. That evaluation can draw from all three aspects of validity. Certainly, what has been called content-oriented test construction contributes to the evaluation of the ad- equacy of measurement. If the results of measurement are to be called valid, structural questions about the measurement operations must be answered satisfactorily. These include, but are not limited to, questions of content. Questions of content apply not only to work samples but to factored aptitude tests, rating scales, personality inventories, and just about any other technique. These are not questions of the representativeness of the content and measurement operations, however; they are questions of importance and relevance. In discussing the validity of criterion measures, Jenkins said, There is always the danger that the investigator may accept some convenient measure . . . only to find ulti- mately that the performance which produces this measure is merely a part, and perhaps an unimportant part, of the total field performance desired by the sponsor. (Jenkins, 1946, pp. 96-97) To generalize: There is always the danger that the measure at hand is so constructed that it cannot faithfully mirror the attribute to be measured. In addition to content, structural considerations include reliability, standardization, language and language level, quality of graphics, and appropriateness of time limits or of standardizing samples, among others. In short, a first approximation to a judgment that a particular set of operations leads to valid inferences about a specified attribute is the judgment that the set of operations has been thoughtfully and skillfully assembled. This may not be a sufficient basis for a judgment that the measures are valid, but it is a necessary one. Some form of empirical evidence is equally necessary. The evidence typically takes the form of a pattern of correlations. The measures being evaluated should logically be correlated with some external variables, but there are others to which they should logically not be related. The judgment of validity depends on the confidence one has Vol. /1, No. 3 June 1980 PROFESSIONAL PSYCHOLOGY 393 1. CONTEMPORARY PERSONNEL PSYCHOLOGY: ON TRINITARIAN DOCTRINES OF VALIDITY that obtained coefficients fit the logically expected pattern. Individually, such cor- relation coefficients are statements of criterion-related validity; collectively, they are bases for judgments of construct validity. It seems clear that the essence of a unified conception of validity is at least meta- phorically the notion of construct validity; in short, the trinitarian doctrine reduces to a Unitarian one so long as the meaning of validity refers only to the evaluation of how well a designated attribute is measured. The Validity of a Hypothesis The discussion of construct and criterion-related validation includes a different kind of evaluative research, the evaluation of hypotheses about relationships of either the- oretical or practical importance. Since such research is often called validation, even if the purpose is not to validate the measurement of an attribute, it is useful to speak of evaluating the validity of a hypothesis. This requires evaluation of the research as well as the result. Under some circumstances, the validity coefficient obtained in the research is inflated. For example, there may be common error variance in both the predictor and the variable to be predicted. Under most circumstances, however, problems in the research lead one to underevaluate the validities of one’s hypoth- eses. Campbell and Stanley (1966) and Cook and Campbell (1976) discussed the validity problems encountered in doing experimental and quasi-experimental research; much of their discussion is also relevant to nonexperimental studies of relationships. In addition to construct validity, they referred to “internal validity” in discussing problems in the conduct of research that undermine permissible confidence in the results. They referred to “statistical conclusion validity” in discussing statistical issues that alter such confidence. They referred to “external validity” in describing problems limiting the generalizability of research findings. In these discussions, Campbell and his associates identified what might be called a third target of evaluation, the validity of the research itself. Such evaluation is surely necessary in establishing the validity of a hypoth- esis. Validation research in personnel testing is usually correlational rather than quasi-experimental. This seems unfortunate. We should have broad programs of personnel selection that can eventually be evaluated for overall effectiveness, without such severe concentration on the single predictor applied to the individual applicant. Use of program evaluation designs could adopt such dependent variables as organi- zational productivity or profitability—variables not predictable when the individual is the unit of analysis. We need not wait, however, for the adoption of program evaluation methods to consider the effects of the internal, external, and statistical conclusion validities on the evaluation of correlational results. These considerations should make clear that the particular validity coefficient one obtains in a predictive study is not a sufficient basis for an evaluation of the validity of the hypothesized relationship (any more than it is sufficient for evaluating the validity of the measurement). Many of the threats to validity described by Campbell and his colleagues conspire in prediction research to lull the researcher into either an unwarranted complacency or an unwarranted pes- simism. If their effect lowers the estimate of the relationship, the researcher avoids using a good predictor. 394 PROFESSIONAL PSYCHOLOGY Vol. 11, No. 3 June 1980 Personnel testers have placed too much reliance on criterion-related validation and on the specific validity coefficients they obtain for evaluating their predictive hypotheses. The simplicity of the validity coefficient makes it very attractive; predictive studies, where they can be done well, are obviously valuable sources of data. However, things are rarely as simple as they seem, and it is time to abandon a simplistic overreliance on a validity coefficient obtained from a single study. There are several reasons. First, research conditions are never exactly repeated. The logic of research on the validity of a predictive hypothesis generally assumes a static set of conditions such that the particular setting in which the study is done this year will be matched in all but trivial respects by the setting in which the results of the study will be used 2 or 3 years hence. The typical design of research and blind use of the results seem to assume that new methods of training, new equipment, new social attitudes, new characteristics of the applicant population, and many other new things will have no effect on the ob- served relationship. Second, the logic assumes one or more variables important enough to predict. Too often validity studies use available criteria without serious evaluation of their impor- tance. Jenkins (1946) called for criteria that were comprehensive measures of the performance of concern to the “sponsor” of the research; he decried the still-prevalent tendency to accept any measure that happened to be lying around. Otis (1971) and Wallace (1965) argued that psychologists should develop behavioral criteria instead of the typical managerial records or ratings. The advice is not often taken. Third, the logic assumes that the measures to be predicted will be psychometrically sound, that is, that they will represent consistent behaviors reliably observed and that they can be measured validly. However, it is very rare that the report of a criterion- related validity study says anything at all about the evaluation of the criterion measures themselves. The use of supervisory ratings is prevalent, and the validities of these ratings are often questionable. Fourth, the logic assumes that the relationship observed will generalize to later samples. If motivation or attitude influences predictor scores, as in personality in- ventories or measures of physical strength, the findings in a concurrent study (using present employees with assurances that poor performance will not haunt them) may not generalize to samples of job applicants. Finally, the logic assumes reliable statistics. Criterion-related studies should be conducted, if at all possible, using reliable measures encompassing a representative range of talent on many more cases than are usually available (Schmidt, Hunter, & Urry, 1976). In short, the evaluation of a research result must always take into account the ade- quacy of the sample (its representativeness and size), procedures and safeguards in research (e.g., avoidance of criterion contamination), the logical and psychometric quality of the measures of the variables, and the rational foundation for the hypothesis (Guion, 1976). Certainly, one should take into account the relevant history of prior research. The latter point suggests a less static approach to prediction research. At a recent convention, Croll and Urry (Note 2) described a Bayesian approach in which each new sample provides new information to be assimilated in the light of prior information about probabilities; an address on the use of Bayesian statistics in industrial psychology was also presented by Novick (Note 3). The Bayesian approach seems an effective way to point out that a single validity coefficient is not as useful for evaluating the Vol. 77, No. 3 June W80 PROFESSIONAL PSYCHOLOGY 395 1. CONTEMPORARY PERSONNEL PSYCHOLOGY: ON TRINITARIAN DOCTRINES OF VALIDITY tenability of the hypothesis of a predictor-criterion relationship as is a series of such coefficients (Schmidt & Hunter, 1977). Must All Tests Be Validated? The heading is a paraphrasing of the question asked by Ebel (1961); his answer was negative. I think he was right. This does not mean that measurements and hypotheses should not be evaluated; it means that there are other methods and standards for evaluation beyond those implied by the conventional trinitarian doctrine of validity. The unifying concept of the validity of measurement has been defined in terms of the congruence of the conceptual and operational definitions ,of an attribute; more precisely, validity is the degree to which the numbers obtained by a measurement procedure represent magnitudes of the attribute to be measured. Fundamentally, like the notion of construct validity, this definition refers to the meaningfulness or interpretability of the scores. By this definition, all measures (including tests) should be valid; it does not follow that all measures should be validated by looking at content sampling, or validity coefficients, or experimental or multivariate studies of conver- gence. Consider, for example, the most fundamental sorts of measurement, such as the measurement of weight using balances or the measurement of linear distance with a yardstick; consider also such mathematically formal measurement as the measurement of information or uncertainty. For these kinds of measurement, there is a formal mathematical model representing the finite set of relationships involving the attribute. Effective formal measurement will provide an isomorphic measurement set, that is, a measurement set with a one-to-one relationship to empirical realities or to the the- oretical model. Such isomorphism may be sufficient evidence of the meaningfulness of the measurement that psychometric concepts of validation are superfluous. (For further discussion of the evaluation of formal measurement without reference to psy- chometric validity, see Coombs, Dawes, & Tversky, 1970; Hooke, 1963; or Suppes & Zinnes, 1963.) It has also been suggested that the validity of a hypothesized relationship between variables be accepted as conceptually distinguishable from the validity of measurement. Here, too, there are circumstances’ in which validation research is not necessary and perhaps not meaningful. For example, a content sample does not need to be forced into the notion of either kind of validity to be considered job relevant. Under certain conditions, operational definitions of an attribute, such as ability to do a job, provide both a necessary and a sufficient evaluation of the obtained scores and of their use in personnel selection without further concern for either kind of validity. Particularly in personnel research, the procrustean concept that everything must somehow be squeezed into a validity framework needs to be questioned. I have heard colleagues seriously propose, for example, that educational, or experience, or even age requirements be defended on the grounds of content validity! The principal concern in personnel testing—as in most fields of applied psychology—is with the validity of a hypothesis of a relationship between a variable used as a predictor and subsequent job performance. If solid research to evaluate the hypothesis can be conducted, com- plete with valid measures of job performance, then the research conventionally called criterion-related validation provides the best evidence of the usefulness of the mea- sure—its job relatedness. 396 PROFESSIONAL PSYCHOLOGY Vol. 11, No. 3 June 1980 It would be an error, however, to assume that job relatedness can be evaluated only in terms of a validity coefficient describing an observed relationship. The validity coefficient itself must be logically evaluated. Beyond that, the solid logic of a well- developed hypothesis, where competent empirical research is unlikely, provides better evidence of the job relatedness of a predictor than does a validity coefficient obtained in a faulty study. Validation is important, but it is not all-important. Sound arguments of the job relevance of well-measured, logically defensible attributes may be sufficient in them- selves. REFERENCE NOTES 1. Stagner, R. Personal communication, February 23, 1979. 2. Croll, P. R., &Urry, V. W. Tailored testing: Maximizing validity and utility for job se- lection. In T. Kehle (Chair), Innovations in personnel selection. Symposium presented at the meeting of the American Psychological Association, Toronto, Canada, August-Sep- tember 1978. 3. Novick, M. R. Implications of Bayesian statistics for industrial/organizational psychology. Invited address presented at the meeting of the American Psychological Association, Toronto, Canada, August-September 1978. REFERENCES American Psychological Association, American Educational Research Association, & National Council of Measurement Used in Education (joint committee). Technical recommendations for psychological tests and diagnostic techniques. Psychological Bulletin, 1954, 51, 201-238. American Psychological Association, American Educational Research Association, & National Council on Measurement in Education. Standards for educational and psychological testa. Washington, D.C. American Psychological Association, 1974. Campbell, D. T., & Stanley, J. C. Experimental and quasi-experimental design for research. Chicago: Rand McNally, 1966. Cook, T. D., & Campbell, D. T. The design and conduct of quasi-experiments and true ex- periments in field settings. In M. D. Dunnette (Ed.), Handbook oj industrial and organi- zational psychology. Chicago: Rand McNally, 1976. Coombs, C. H., Dawes, R. M., & Tversky, A. Mathematical psychology: An elementary introduction. Englewood Cliffs, N. J.: Prentice-Hall, 1970. Cronbach, L. J., & Meehl, P. E. Construct validity in psychological tests. Psychological Bulletin, 1955,52,281-302. Dunnette, M. D., & Borman, W. C. Personnel selection and classification systems. Annual Review of Psychology, 1979, 30, 477-525. Ebel, R. L. Must all tests be valid? American Psychologist, 1961, 16, 640-647. Equal Employment Opportunity Commission, Civil Service Commission, Department of Labor, & Department of Justice. Uniform guidelines on employee selection procedures. Federal Register, August 25, 1978,^(166), 38290-38315. Guion, R. M. Recruiting, selection, and job placement. In M. D. Dunnette (Ed.), Handbook of industrial and organizational psychology. Chicago: Rand McNally, 1976. Hooke, R. Some figures. In R. Fox, M. Garbuny, and R. Hooke (Eds.), The science of science. New York: Walker, 1963. Hull, C. L. Aptitude testing. Yonkers, N.Y.: World Book, 1928. Vol. 11, No. 3 June 1980 PROFESSIONAL PSYCHOLOGY 397 1. CONTEMPORARY PERSONNEL PSYCHOLOGY: ON TRINITARIAN DOCTRINES OF VALIDITY Idso, S. B., Jackson, R. D., & Reginato, R. J. Detection of soil moisture by remote surveillance. American Scientist, 1975, 63, 549-557. Jenkins, J. G. Validity for what? Journal of Consulting Psychology, 1946, 10, 93-98. Messick, S. The standard problem: Meaning and values in measurement and evaluation. American Psychologist, 1975, 30, 955-966. Otis, J. L. Whose criterion? In W. W. Ronan & E. P. Prien (Eds.), Perspectives on the measurement of human performance. New York: Appleton-Century-Crofts, 1971. Pintner, R. Intelligence testing: Methods and results (2nd ed.). New York: Holt, 1931. Schmidt, F. L., & Hunter, J. E. Development of a general solution to the problem of validity generalization. Journal of Applied Psychology, 1977,62, 529-540. Schmidt, F. L., Hunter, J. E., & Urry, V. W. Statistical power in criterion-related validity studies. Journal of Applied Psychology, 1976, 61, 473-485. Suppes, P., & Zinnes, J. L. Basic measurement theory. In R. D. Luce, R. R. Bush, & E. Galanter (Eds.), Handbook of mathematical psychology (Vol. 1). New York: Wiley, 1963. Tenopyr, M. L. Content-construct confusion. Personnel Psychology, 1977, .30, 47-54. Torgerson, W. S. Theory and methods of scaling. New York: Wiley, 1958. Wallace, S. R. Criteria for what? American Psychologist, 1965, 20, 411-417. Received April 16, 1979 398 PROFESSIONAL PSYCHOLOGY Vol. 11, No. 3 June 1980