Methods in Behavioral Research, Ch. 5

When reading Chapter 5, we get the sense that measurement of a test’s reliability is necessary; however, it is not enough to measure validity of the same test. Select your example from research on the topic of interest to you within psychology then follow the instructions below. Note: Ensure your example from research includes one of the types of reliability and one of the types of validity as those listed below.

Test-retest reliability
Internal consistency reliability
Interrater reliability

Face validity
Content validity
Predictive validity
Concurrent validity
Convergent validity
Discriminant validity

Instructions: (a) Please summarize both concepts (reliability and validity) found in our text by Cozby and Bates (2015) or another credible source. (b) Then, summarize an illustration of those concepts found in your example from research related to your topic of interest within psychology. There might be variance in replies among you due to the example from research selected (e.g., experimental vs field research design). Our class discussion will likely relate to learning objectives/competencies 2.4 and 2.5 as dependent on peer selections, research examples, and replies.

The participation rubric is in the Instructor Policy document. Word count range: 1000 excluding citations of your sources.

Methods in Behavioral Research, Ch. 5
Chapter 5 Measuring Concepts LEARNING OBJECTIVES Define reliability of a measure of behavior and describe the difference between test-retest, internal consistency, and interrater reliability. Discuss ways to establish construct validity, including face validity, content validity, predictive validity, concurrent validity, convergent validity, and discriminant validity. Describe the problem of reactivity of a measure of behavior and discuss ways to minimize reactivity. Describe the properties of the four scales of measurement: nominal, ordinal, interval, and ratio. Page 100WE LEARN ABOUT BEHAVIOR THROUGH CAREFUL MEASUREMENT. As we discussed in Chapter 4, behavior can be measured in many ways. The most common measurement strategy is to ask people to tell you about themselves: How many times have you argued with your partner in the past week? How would you rate your overall happiness? How fair was the professor’s grading system? Of course, you can also directly observe behaviors. How many errors did someone make on a task? Will people who you approach in a shopping mall give you change for a dollar? How many times did a person smile during an interview? Physiological and neurological responses can be measured as well. How much did heart rate change while working on the problems? Did muscle tension increase during the interview? There is an endless supply of fascinating behaviors that can be studied. We will describe various methods of measuring variables at several points in subsequent chapters. In this chapter, however, we explore the technical aspects of measurement. We need to consider reliability, validity, and reactivity of measures. We will also consider scales of measurement. RELIABILITY OF MEASURES Reliability refers to the consistency or stability of a measure of behavior. Your everyday definition of reliability is quite close to the scientific definition. For example, you might say that Professor Fuentes is “reliable” because she begins class exactly at 10 a.m. each day; in contrast, Professor Fine might be called “unreliable” because, although she sometimes begins class exactly on the hour, on any given day she may appear anytime between 10 and 10:20 a.m. Similarly, a reliable measure of a psychological variable such as intelligence will yield the same result each time you administer the intelligence test to the same person. The test would be unreliable if it measured the same person as average 1 week, low the next, and bright the next. Put simply, a reliable measure does not fluctuate from one reading to the next. If the measure does fluctuate, there is error in the measurement device. A more formal way of understanding reliability is to use the concepts of true score and measurement error. Any measure that you make can be thought of as comprising two components: (1) a true score, which is the real score on the variable, and (2) measurement error. An unreliable measure of intelligence contains considerable measurement error and so does not provide an accurate indication of an individual’s true intelligence. In contrast, a reliable measure of intelligence—one that contains little measurement error—will yield an identical (or nearly identical) intelligence score each time the same individual is measured. To illustrate the concept of reliability further, imagine that you know someone whose “true” intelligence score is 100. Now suppose that you administer an unreliable intelligence test to this person each week for a year. After the year, you calculate the person’s average score on the test based on the 52 scores you obtained. Now suppose that you test another friend who also has a true intelligence score of 100; however, this time you administer a highly reliable test. Again, you calculate the average score. What might your data look like? Typical data are shown in Figure 5.1. In each case, the average score is 100. However, scores on the unreliable test range from 85 to 115, whereas scores on the reliable test range from 97 to 103. The measurement error in the unreliable test is revealed in the greater variability shown by the person who took the unreliable test. Page 101 FIGURE 5.1 Comparing data of a reliable and unreliable measure When conducting research, you can measure each person only once; you cannot give the measure 50 or 100 times to discover a true score. Thus, it is very important that you use a reliable measure. Your single administration of the measure should closely reflect the person’s true score. The importance of reliability is obvious. An unreliable measure of length would be useless in building a table; an unreliable measure of a variable such as intelligence is equally useless in studying that variable. Researchers cannot use unreliable measures to systematically study variables or the relationships among variables. Trying to study behavior using unreliable measures is a waste of time because the results will be unstable and unable to be replicated. Reliability is most likely to be achieved when researchers use careful measurement procedures. In some research areas, this might involve carefully training observers to record behavior; in other areas, it might mean paying close attention to the way questions are phrased or the way recording electrodes are placed on the body to measure physiological reactions. In many areas, reliability can be increased by making multiple measures. This is most commonly seen when assessing personality traits and cognitive abilities. A personality measure, for example, will typically have 10 or more questions (called items) designed to assess a trait. Responses on the items are then combined for a total score. Reliability is increased when the number of items increases. Page 102How can we assess reliability? We cannot directly observe the true score and error components of an actual score on the measure. However, we can assess the statistical stability of measures using correlation coefficients. Recall from Chapter 4 that a correlation coefficient is a number that tells us how strongly two variables are related to each other. There are several ways of calculating correlation coefficients; the most common correlation coefficient when discussing reliability is the Pearson product-moment correlation coefficient. The Pearson correlation coefficient (symbolized as r) can range from 0.00 to +1.00 and 0.00 to −1.00. A correlation of 0.00 tells us that the two variables are not related at all. The closer a correlation is to 1.00, either +1.00 or −1.00, the stronger is the relationship. The positive and negative signs provide information about the direction of the relationship. When the correlation coefficient is positive (a plus sign), there is a positive linear relationship—high scores on one variable are associated with high scores on the second variable. A negative linear relationship is indicated by a minus sign—high scores on one variable are associated with low scores on the second variable. The Pearson correlation coefficient will be discussed further in Chapter 12. To assess the reliability of a measure, we will need to obtain at least two scores on the measure from many individuals. If the measure is reliable, the two scores should be very similar; a Pearson correlation coefficient that relates the two scores should be a high positive correlation. When you read about reliability, the correlation will usually be called a reliability coefficient. Let’s examine specific methods of assessing reliability. Test-Retest Reliability Test-retest reliability is assessed by measuring the same individuals at two points in time. For example, the reliability of a test of intelligence could be assessed by giving the measure to a group of people on one day and again a week later. We would then have two scores for each person, and a correlation coefficient could be calculated to determine the relationship between the first test score and the retest score. Recall that high reliability is indicated by a high correlation coefficient showing that the two scores are very similar. If many people have very similar scores, we conclude that the measure reflects true scores rather than measurement error. It is difficult to say how high the correlation should be before we accept the measure as reliable, but for most measures the reliability coefficient should probably be at least .80. Given that test-retest reliability requires administering the same test twice, the correlation might be artificially high because the individuals remember how they responded the first time. Alternate forms reliability is sometimes used to avoid this problem; it requires administering two different forms of the same test to the same individuals at two points in time. A drawback to this procedure is that creating a second equivalent measure may require considerable time and effort. Intelligence is a variable that can be expected to stay relatively constant over time; thus, we expect the test-retest reliability for intelligence to be very Page 103high. However, some variables may be expected to change from one test period to the next. For example, a mood scale designed to measure a person’s current mood state is a measure that might easily change from one test period to another, and so test-retest reliability might not be appropriate. On a more practical level, obtaining two measures from the same people at two points in time may sometimes be difficult. To address these issues, researchers have devised methods to assess reliability without two separate assessments. Internal Consistency Reliability It is possible to assess reliability by measuring individuals at only one point in time. We can do this because most psychological measures are made up of a number of different questions, called items. An intelligence test might have 100 items, a measure of extraversion might have 15 items, or a multiple-choice examination in a class might have 50 items. A person’s test score would be based on the total of his or her responses on all items. In the class, an exam consists of a number of questions about the material, and the total score is the number of correct answers. An extraversion measure might ask people to agree or disagree with items such as “I enjoy the stimulation of a lively party.” An individual’s extraversion score is obtained by finding the total number of such items that are endorsed. Recall that reliability increases with increasing numbers of items. Internal consistency reliability is the assessment of reliability using responses at only one point in time. Because all items measure the same variable, they should yield similar or consistent results. One indicator of internal consistency is split-half reliability; this is the correlation of the total score on one half of the test with the total score on the other half. The two halves are created by randomly dividing the items into two parts. The actual calculation of a split-half reliability coefficient is a bit more complicated because the final measure will include items from both halves. Thus, the combined measure will have more items and will be more reliable than either half by itself. This fact must be taken into account when calculating the reliability coefficient; the corrected reliability is termed the Spearman-Brown split-half reliability coefficient. Split-half reliability is relatively straightforward and easy to calculate, even without a computer. One drawback is that it is based on only one of many possible ways of dividing the measure into halves. Another commonly used indicator of reliability based on internal consistency, called Cronbach’s alpha, provides us with the average of all possible split-half reliability coefficients. To actually perform the calculation, scores on each item are correlated with scores on every other item. A large number of correlation coefficients are produced; you would only want to do this with a computer! The value of Cronbach’s alpha is based on the average of all the inter-item correlation coefficients and the number of items in the measure. Again, you should note that more items will be associated with higher reliability. Page 104It is also possible to examine the correlation of each item score with the total score based on all items. Such item-total correlations are very informative because they provide information about each individual item. Items that do not correlate with the total score on the measure are actually measuring a different variable; they can be eliminated to increase internal consistency reliability. This information is also useful when it is necessary to construct a brief version of a measure. Even though reliability increases with longer measures, a shorter version can be more convenient to administer and still retain acceptable reliability. Interrater Reliability In some research, raters observe behaviors and make ratings or judgments. To do this, a rater uses instructions for making judgments about the behaviors—for example, by rating whether a child’s behavior on a playground is aggressive and how aggressive is the behavior. You could have one rater make judgments about aggression, but the single observations of one rater might be unreliable. The solution to this problem is to use at least two raters who observe the same behavior. Interrater reliability is the extent to which raters agree in their observations. Thus, if two raters are judging whether behaviors are aggressive, high interrater reliability is obtained when most of the observations result in the same judgment. A commonly used indicator of interrater reliability is called Cohen’s kappa. The methods of assessing reliability are summarized in Figure 5.2. FIGURE 5.2 Three strategies for assessing reliability Page 105 Reliability and Accuracy of Measures Reliability is clearly important when researchers develop measures of behavior. Reliability is not the only characteristic of a measure or the only thing that researchers worry about. Reliability tells us about measurement error but it does not tell us about whether we have a good measure of the variable of interest. To use a silly example, suppose you want to measure intelligence. The measure you develop looks remarkably like the device that is used to measure shoe size at your local shoe store. You ask your best friend to place one foot in the device, and you use the gauge to measure their intelligence. Numbers on the device provide a scale of intelligence so you can immediately assess a person’s intelligence level. Will these numbers result in a reliable measure of intelligence? The answer is that they will! Consider what a test-retest reliability coefficient would be. If you administer the “foot intelligence scale” on Monday, it will be almost the same the following Monday; the test-retest reliability is high. But is this an accurate measure of intelligence? Obviously, the scores have nothing to do with intelligence; just because the device is labeled an intelligence test does not mean that it is a good measure of intelligence. Let’s consider a less silly example. Suppose your neighborhood gas station pump puts the same amount of gas in your car every time you purchase a gallon (or liter) of fuel; the gas pump gauge is reliable. However, the issue of accuracy is still open. The only way you can know about accuracy of the pump is to compare the gallon (or liter) you receive with some standard measure. In fact, states have inspectors responsible for comparing the amount that the pump says is a gallon with an exact gallon measure. A pump with a gauge that does not deliver what it says must be repaired or replaced. This difference between the reliability and accuracy of measures leads us to a consideration of the validity of measures. CONSTRUCT VALIDITY OF MEASURES If something is valid, it is “true” in the sense that it is supported by available evidence. The amount of gasoline that the gauge indicates should match some standard measure of liquid volume; a measure of a personality characteristic such as shyness should be an accurate indicator of that trait. Recall from Chapter 4 that construct validity concerns whether our methods of studying variables are accurate. That is, it refers to the adequacy of the operational definition of variables. To what extent does the operational definition of a variable actually reflect the true theoretical meaning of the variable? In terms of measurement, construct validity is a question of whether the measure that is employed actually measures the construct it is intended to measure. Applicants for some jobs are required to take a Clerical Ability Test; this measure is supposed to predict an individual’s clerical ability. The validity of such a test is determined by whether it actually does measure this ability. A measure of shyness is an operational definition of the shyness variable; the validity of this measure is determined by whether it does measure this construct. Page 106How do we know that a measure is valid? Ways that we can assess validity are summarized in Table 5.1. Evidence for construct validity takes many forms. TABLE 5.1 Indicators of construct validity of a measure Page 107 Indicators of Construct Validity Face validity The simplest way to argue that a measure is valid is to suggest that the measure appears to accurately assess the intended variable. This is called face validity—the evidence for validity is that the measure appears “on the face of it” to measure what it is supposed to measure. Face validity is not very sophisticated; it involves only a judgment of whether, given the theoretical definition of the variable, the content of the measure appears to actually measure the variable. That is, do the procedures used to measure the variable appear to be an accurate operational definition of the theoretical variable? Thus, a measure of a variable such as shyness will usually appear to measure that variable. A measure of shyness called the Shy Q (Bortnik, Henderson, & Zimbardo, 2002) includes items such as “I often feel insecure in social situations” but does not include an item such as “I learned to ride a bicycle at an early age”—the first item appears to be more closely related to shyness than does the second one. Note that the assessment of validity here is a very subjective, intuitive process. A way to improve the process somewhat is to systematically seek out experts in the field to make the face validity determination. In either case, face validity is not sufficient to conclude that a measure is in fact valid. Appearance is not a very good indicator of accuracy. Some very poor measures may have face validity; for example, most personality measures that appear in popular magazines typically have several questions that look reasonable but often do not tell you anything meaningful. The interpretations of the scores may make fun reading, but there is no empirical evidence to support the conclusions that are drawn in the article. In addition, many good measures of variables do not have obvious face validity. For example, is it obvious that rapid eye movement during sleep is a measure of dreaming? Content validity Content validity is based on comparing the content of the measure with the universe of content that defines the construct. For example, a measure of depression would have content that links to each of the symptoms that define the depression construct. Or consider a measure of “knowledge of psychology” that could be administered to graduating seniors at your college. In this case, the faculty would need to define a universe of content that constitutes this knowledge. The measure would then have to reflect that universe. Thus, if classical conditioning is one of the content areas that defines knowledge of psychology (it is!), questions relating to this topic will be included in the measure. Both face validity and content validity focus on assessing whether the content of a measure reflects the meaning of the construct being measured. Other indicators of validity rely on research that examines how scores on a measure relate to other measures of behavior. In validity research, the behavior is termed a criterion. These validity indicators are predictive validity, concurrent validity, convergent validity, and discriminant validity. Page 108Predictive validity Research that uses a measure to predict some future behavior is using predictive validity. Thus, with predictive validity, the criterion measure is based on future behavior or outcomes. Predictive validity is clearly important when studying measures that are designed to improve our ability to make predictions. A Clerical Ability Test is intended to provide a fast way to predict future performance in a clerical position. Similarly, many college students take the Graduate Record Exam (GRE), which was developed to predict success in graduate programs, or the Law School Admissions Test (LSAT), developed to predict success in law school. The construct validity of such measures is demonstrated when scores on the measure predict the future behaviors. For example, predictive validity of the LSAT is demonstrated when research shows that people who score high on the test do better in law school than people who score low on the test (i.e., there is a positive relationship between the test score and grades in law school). The measure can be used to advise people on whether they are likely to succeed in law school or to select applicants for law school admission. Concurrent validity Concurrent validity is demonstrated by research that examines the relationship between the measure and a criterion behavior at the same time (concurrently). Research using the concurrent validity approach can take many forms. A common method is to study whether two or more groups of people differ on the measure in expected ways. Suppose you have a measure of shyness. Your theory of shyness might lead you to expect that salespeople whose job requires making cold calls to potential customers would score lower on the shyness measure than salespeople in positions in which potential customers must make the effort to contact the company themselves. Another approach to concurrent validity is to study how people who score either low or high on the measure behave in different situations. For example, you could ask people who score high versus low on the shyness scale to describe themselves to a stranger while you measure their level of anxiety. Here you would expect that the people who score high on the shyness scale would exhibit higher amounts of anxiety. Convergent validity Any given measure is a particular operational definition of the variable being measured. Often there will be other operational definitions—other measures—of the same or similar constructs. Convergent validity is the extent to which scores on the measure in question are related to scores on other measures of the same construct or similar constructs. Measures of similar constructs should converge—for example, one measure of shyness should correlate highly with another shyness measure or a measure of a similar construct such as social anxiety. In actual research on a shyness scale, the convergent validity of the Shy Q was demonstrated by showing that Shy Q scores Page 109were highly correlated (.77) with a scale called the Fear of Negative Evaluation (Bortnik et al., 2002). Because the constructs of shyness and fear of negative evaluation have many similarities (such fear is thought to be a component of shyness), the high correlation is expected and increases our confidence in the construct validity of the Shy Q measure. Discriminant validity When the measure is not related to variables with which it should not be related, discriminant validity is demonstrated. The measure should discriminate between the construct being measured and other unrelated constructs. In research on the discriminant validity of their shyness measure, Bortnik et al. (2002) found no relationship between Shy Q scores and several conceptually unrelated interpersonal values such as valuing forcefulness with others. REACTIVITY OF MEASURES A potential problem when measuring behavior is reactivity. A measure is said to be reactive if awareness of being measured changes an individual’s behavior. A reactive measure tells what the person is like when he or she is aware of being observed, but it does not tell how the person would behave under natural circumstances. Simply having various devices such as electrodes and blood pressure cuffs attached to your body may change the physiological responses being recorded. Knowing that a researcher is observing you or recording your behavior on tape might change the way you behave. Measures of behavior vary in terms of their potential reactivity. There are also ways to minimize reactivity, such as allowing time for individuals to become used to the presence of the observer or the recording equipment. A book by Webb, Campbell, Schwartz, Sechrest, and Grove (1981) has drawn attention to a number of measures that are called nonreactive or unobtrusive. Many such measures involve clever ways of indirectly recording a variable. For example, an unobtrusive measure of preferences for paintings in an art museum is the frequency with which tiles around each painting must be replaced—the most popular paintings are the ones with the most tile wear. Levine (1990) studied the pace of life in cities, using indirect measures such as the accuracy of bank clocks and the speed of processing standard requests at post offices to measure pace of life. Some of the measures described by Webb et al. (1981) are simply humorous. For instance, in 1872, Sir Francis Galton studied the efficacy of prayer in producing long life. Galton wondered whether British royalty, who were frequently the recipients of prayers by the populace, lived longer than other people. He checked death records and found that members of royal families actually led shorter lives than other people, such as men of literature and science. The book by Webb and his colleagues is a rich source of such nonreactive measures. More important, it draws attention Page 110to the problem of reactivity and sensitizes researchers to the need to reduce reactivity whenever possible. We will return to this issue at several points in this book. VARIABLES AND MEASUREMENT SCALES Every variable that is studied must be operationally defined. The operational definition is the specific method used to manipulate or measure the variable (see Chapter 4). There must be at least two values or levels of the variable. In Chapter 4, we mentioned that the values may be quantitatively different or they may reflect categorical differences. In actuality, the world is a bit more complex. The levels can be conceptualized as a scale that uses one of four kinds of measurement scales: nominal, ordinal, interval, and ratio (summarized in Table 5.2). Nominal Scales Nominal scales have no numerical or quantitative properties. Instead, categories or groups simply differ from one another (sometimes nominal variables are called “categorical” variables). An obvious example is the variable of gender. A person is classified as either male or female. Being male does not imply a greater amount of “sexness” than being female; the two levels are merely different. This is called a nominal scale because we simply assign names to different categories. Another example is the classification of undergraduates according to major. A psychology major would not be entitled to a higher number than a history major, for instance. Even if you were to assign numbers to the different categories, the numbers would be meaningless, except for identification. TABLE 5.2 Scales of measurement Page 111In an experiment, the independent variable is often a nominal or categorical variable. For example, Hölzel et al. (2011) studied the effect of meditation on brain structures using magnetic resonant imaging (MRI). They found that, after participants underwent an 8-week mindfulness meditation-based stress reduction program, their specific brain areas increased gray-matter density as compared with other participants who did not take part in the program. The independent variable in this case (participating in the program or not) was clearly nominal because the two levels are merely different; participants either did, or did not, participate in the stress reduction program. Ordinal Scales Ordinal scales allow us to rank order the levels of the variable being studied. Instead of having categories that are simply different, as in a nominal scale, the categories can be ordered from first to last. Letter grades are a good example of an ordinal scale. Another example of an ordinal scale is provided by the movie rating system used on a movie review website. Movies on TV are given one, two, three, or four stars, based on these descriptions: Great! New or old, a classic Good! First rate Flawed, but may have some good moments Poor! Desperation time The rating system is not a nominal scale because the number of stars is meaningful in terms of a continuum of quality. However, the stars allow us only to rank order the movies. A four-star movie is better than a three-star movie; a three-star movie is better than a two-star movie; and so on. Although we have this quantitative information about the movies, we cannot say that the difference between a one-star and a two-star movie is always the same or that it is equal to the difference between a two-star and a three-star movie. No particular value is attached to the intervals between the numbers used in the rating scale. Interval and Ratio Scales In an interval scale, the difference between the numbers on the scale is meaningful. Specifically, the intervals between the numbers are equal in size. The Page 112difference between 1 and 2 on the scale, for example, is the same as the difference between 2 and 3. Interval scales generally have five or more quantitative levels. A household thermometer (Fahrenheit or Celsius) measures temperature on an interval scale. The difference in temperature between 40° and 50° is equal to the difference between 70° and 80°. However, there is no absolute zero on the scale that would indicate the absence of temperature. The zero on any interval scale is only an arbitrary reference point. (Note that the zero point on the Celsius scale was chosen to reflect the temperature at which water freezes; this is the same as 32° on the Fahrenheit scale. The zero on both scales is arbitrary, and there are even negative numbers on the scale.) Without an absolute zero point on interval scales, we cannot form ratios of the numbers. That is, we cannot say that one number on the scale represents twice as much (or three times as much, and so forth) temperature as another number. You cannot say, for example, that 60° is twice as warm as 30°. An example of an interval scale in the behavioral sciences might be a personality measure of a trait such as extraversion. If the measurement is an interval scale, we cannot make a statement such as “the person who scored 20 is twice as extraverted as the person who scored 10” because there is no absolute zero point that indicates an absence of the trait being measured. Ratio scales do have an absolute zero point that indicates the absence of the variable being measured. Examples include many physical measures, such as length, weight, or time. With a ratio scale, such statements as “a person who weighs 220 pounds weighs twice as much as a person who weighs 110 pounds” or “participants in the experimental group responded twice as fast as participants in the control group” are possible. Ratio scales are used in the behavioral sciences when variables that involve physical measures are being studied—particularly time measures such as reaction time, rate of responding, and duration of response. However, many variables in the behavioral sciences are less precise and so use nominal, ordinal, or interval scale measures. It should also be noted that the statistical tests for interval and ratio scales are the same. The Importance of the Measurement Scales When you read about the operational definitions of variables, you will recognize the levels of the variable in terms of these types of scales. The conclusions one draws about the meaning of a particular score on a variable depend on which type of scale was used. With interval and ratio scales, you can make quantitative distinctions that allow you to talk about amounts of the variable. With nominal scales, there is no quantitative information. To illustrate, suppose you are studying perceptions of physical attractiveness. In an experiment, you might show participants pictures of people with different characteristics such as their waist-to-hip ratio (waist size divided by hip size); this variable has been studied extensively by Singh and his colleagues (see Singh, Dixson, Page 113Jessop, Morgan, & Dixson, 2010). How should you measure the participants’ physical attractiveness judgments? You could use a nominal scale such as: _____ Not Attractive _____ Attractive These scale values allow participants to state whether they find the person attractive or not, but do not allow you to know about the amount of attractiveness. As an alternative, you could use a scale that asks participants to rate amount of attractiveness: Very Unattractive _____ _____ _____ _____ _____ _____ _____ Very Attractive This rating scale provides you with quantitative information about amount of attractiveness because you can assign numeric values to each of the response options on the scale; in this case, the values would range from 1 to 7. A major finding of Singh’s research is that males rate females with a .70 waist-to-hip ratio as most attractive. Singh interprets this finding in terms of evolutionary theory—this ratio presumably is a signal of reproductive capacity. The scale that is used also determines the types of statistics that are appropriate when the results of a study are analyzed. For now, we do not need to worry about statistical analysis. However, we will return to this point in Chapter 12. We are now ready to consider methods for measuring behavior. A variety of observational methods are described in Chapter 6. We will then focus on questionnaires and interviews in Chapter 7. ILLUSTRATIVE ARTICLE: MEASUREMENT CONCEPTS Every term, millions of students complete course evaluations in an effort to assess the quality and performance of their instructors. This specific measurement instrument can vary from campus to campus, but the overall goal is the same. Course evaluations are used to inform hiring decisions, promotion decisions, and classroom instruction decisions, and they are also used by individual instructors to improve the courses that they teach. Brown (2008) was interested in student perceptions of course evaluations. He collected data from 80 undergraduates enrolled in an undergraduate research methods course and examined their perceptions of student evaluations of teaching, of mid-semester evaluations, and of the effectiveness of completing mid-semester evaluations. He found, among other things, that although participants believed that students are honest in their evaluations and that the evaluations are important in hiring decisions, they were less sure that instructors took the evaluations seriously and also tended to believe that students evaluate courses based on the grade that they get, or to “get back” at instructors. For this exercise, acquire, and read, the following article:Page 114 Brown, M. (2008). Student perceptions of teaching evaluations. Journal of Instructional Psychology, 35(2), 177–181. Retrieved from http://www.freepatentsonline.com/article/Journal-Instructional-Psychology/181365765.html After reading the article, consider the following: 1. Brown (2008) did not report any reliability data for his measures. How would you suggest that he go about assessing the reliability of his measures? 2. In the context of evaluating college teaching, how would you describe the construct validity of course evaluation measures generally (or the specific tool that is used on your campus)? That is, how well do student evaluations truly assess the construct of course quality? Specifically, how would you assess the content, predictive, concurrent, convergent, and discriminant validity of student course evaluation measures? 3. Brown did not report any validity information for his measures of participant perceptions. Assess the face validity of his measures. 4. Do you think that Brown’s measures are reactive? How so? Likewise, do you think that course evaluations are reactive? How so? 5. Describe the level of measurement used in Brown’s study. Generate two alternative strategies for measurement that would occur at different levels. Study Terms Alternate forms reliability (p. 102) Concurrent validity (p. 108) Construct validity (p. 105) Content validity (p. 107) Convergent validity (p. 108) Cronbach’s alpha (p. 103) Discriminant validity (p. 109) Face validity (p. 107) Internal consistency reliability (p. 103) Interrater reliability (p. 104) Interval scale (p. 111) Item-total correlation (p. 104) Measurement error (p. 100) Nominal scale (p. 110) Ordinal scale (p. 111) Pearson product-moment correlation coefficient (p. 102) Predictive validity (p. 108) Ratio scale (p. 112) Reactivity (p. 109) Reliability (p. 100) Split-half reliability (p. 103) Test-retest reliability (p. 102) True score (p. 100) Page 115 Review Questions 1. What is meant by the reliability of a measure? Distinguish between true score and measurement error. 2. Describe the methods of determining the reliability of a measure. 3. Discuss the concept of construct validity. Distinguish among the indicators of construct validity. 4. Why isn’t face validity sufficient to establish the validity of a measure? 5. What is a reactive measure? 6. Distinguish between nominal, ordinal, interval, and ratio scales. Activities 1. Conduct a PsycINFO search to find information on the construct validity of a psychological measure. Specify construct validity as a search term along with terms such as aptitude test, personality test, intelligence test, and so on. You can also specify particular psychological constructs such as depression, self-esteem, or extraversion. Read about a measure that interests you and describe the reliability and validity research reported. 2. Take a personality test on the Internet (you can find such tests using Internet search engines). Based on the information provided, what can you conclude about the test’s reliability, construct validity, and reactivity? 3. For each of the following, identify whether a nominal, ordinal, interval, or ratio scale is being used: a. The temperatures in cities throughout the country that are listed in most newspapers. b. The birth weights of babies who were born at Wilshire General Hospital last week. c. The number of hours you spent studying each day during the past week. d. The amount of the tips left after each meal at a restaurant during a 3-hour period. e. The number of votes received by the Republican and Democratic candidates for Congress in your district in the last election. f. The cell phone listed as third best on an online electronics review website. g. Connecticut’s listing as the number one team in a poll of sportswriters, with Kansas listed number two. h. Your friend’s score on an intelligence test. i. Yellow walls in your office and white walls in your boss’s office.Page 116 j. The type of programming on each radio station in your city (e.g., KPSY plays jazz, KSOC is talk radio). k. Ethnic group categories of people in a neighborhood. 4. Think of an important characteristic that you would look for in a potential romantic partner, such as humorous, intelligent, attractive, hardworking, religious, and so on. How might you measure that characteristic? Describe two methods that you might use to assess construct validity.

Save your time - order a paper!

Get your paper written from scratch within the tight deadline. Our service is a reliable solution to all your troubles. Place an order on any task and we will take care of it. You won’t have to worry about the quality and deadlines

Order Paper Now

Methods in Behavioral Research, Ch. 5

Save your time - order a paper!

GET TO KNOW US

QUICK LINKS

WE ACCEPT

CONTACT INFORMATION

Do you need academic writing help with your homework? Let us write your papers.

Save your time - order a paper!

Our team of vetted writers in every subject is waiting to help you pass that class. With keen editors and a friendly customer support team, we guarantee custom-written, original, high-quality papers. Get top grades.

GET TO KNOW US

QUICK LINKS

WE ACCEPT

CONTACT INFORMATION