Approaches to Handling Common Response Styles and Issues in Educational Surveys
Summary and Keywords
Survey has been a widely used data collection method for a variety of purposes in educational research. Although response styles have the potential to contaminate survey results, educational researchers often do little to control for such negative effects. Under discussion are five common response issues, their impact on survey data, and the methods that may be used to minimize the negative impact of these response issues on survey data. The five response issues in question are acquiescence (including disacquiescence), careless responding, extreme response, social desirability, and item-keying effect. Acquiescence (disacquiescence) refers to a respondent’s general tendency to agree (or disagree) with an item regardless of its content. This response style can distort item and construct correlations, compromising the results of factor analytic and correlational findings. Careless responding refers to a respondent’s tendency to pay insufficient attention to item content before responding, which can also lead to a biased estimation of relationships. Extreme response refers to the tendency of selecting extreme response options (e.g., strongly agree or strongly disagree) over middle options (e.g., neutral). Social desirability refers to a respondent’s tendency to rate him- or herself in an overly positive light. Finally, item-keying effect refers to a respondent’s differential responses to regular-keyed and reverse-keyed items. This effect often creates the illusion that items with opposite keying directions measure distinct constructs even when they may not.
A growing amount of research has been done on how to control for the negative impact of these response styles, although the research may be limited and uneven for different response issues. A variety of approaches and methods exist for handling these response issues in research practice. Different response issues may require considerations at different stages of research. For example, effective handling of acquiescence response may require steps in both survey construction (e.g., including a hidden measure of acquiescence) and survey data analytic treatment (partial correlation technique), while controlling for item-keying effect may require more sophisticated modeling techniques (e.g., multitrait-multimethod confirmatory factor analysis).
Approaches to Handling Common Response Styles and Issues in Educational Surveys
Response styles—respondents’ tendency to answer survey items in a systematic manner—are prevalent in educational research. They have the potential to contaminate the survey data, such as affecting the factor loading pattern of survey items (Rammstedt & Farmer, 2013), the means of comparison groups (Bolt, Lu, & Kim, 2014), and the magnitude of correlations between constructs (Kam & Meyer, 2015). Therefore, the negative impact of response styles cannot be ignored. Although response styles have been known to distort survey results, few educational researchers actually measure and control for them. This is unfortunate, because response styles could lead to incorrect research findings.
One reason for the lack of attention to response styles may be due to insufficient knowledge of them, of their consequences, and of the approaches for dealing with them. Therefore, the purpose of this chapter is to explain the common response styles. We focus on five major response styles: acquiescence, careless responding, extreme response, social desirability, and item-keying effect. For each response style, we explain what it is, how it can bias research results, and how it can be measured and controlled. Next, we summarize the common procedures to deal with the response styles and briefly introduce novel statistical techniques to control for multiple response styles simultaneously. Finally, we discuss unresolved issues in the response style literature.
Acquiescence (disacquiescence) may be defined as respondents’ tendency to agree (disagree) with an item regardless of item content (Bentler, Jackson, & Messick, 1971). Acquiescence has a long history (Lorge, 1937). Jackson and Messick (Jackson & Messick, 1962; Messick & Jackson, 1961) studied acquiescence together with social desirability response style, and found that the two response styles together explain over one half of the variance in the measurement of a clinical scale. These researchers concluded that the effects of the response styles are massive. Response styles such as acquiescence thus need to be taken seriously.
Influence of Acquiescence
Because participants high in acquiescence may agree with both regular-keyed and reverse-keyed items, the negative correlations between items with opposite keying direction are thereby attenuated. As a result, a unidimensional construct may appear bidimensional. For example, previous research shows that job satisfaction and job dissatisfaction are distinct constructs (i.e., bidimensionality; Credé, Chernyshenko, Bagraim, & Sully, 2009), because the correlation between the two factors (satisfaction and dissatisfaction) was far from -1. However, after controlling for acquiescence, Kam and Meyer (2015) find job satisfaction and dissatisfaction items to be perfectly and negatively correlated (i.e., r = −1; they belong to opposite ends of the same construct). Similarly, as shown in some studies, acquiescence can mask the five-factor structure of Big Five personality traits; only after the influence of acquiescence was statistically partialled out of the data was the Big Five structure revealed (Rammstedt & Farmer, 2013; Rammstedt, Goldberg, & Borg, 2010). In general, as shown by Kam and Meyer (2015), acquiescence can bias construct correlations in the positive direction, thus inflating positive correlations between constructs (e.g., between job satisfaction and positive affect) and deflating negative correlations between constructs (e.g., between job satisfaction and negative affect).
The stability of acquiescence has also been investigated. In early research, acquiescence was considered a situational (unstable) phenomenon (Hui & Triandis, 1985; Rorer, 1965). More recently, researchers have consistently found it to be stable over time. Billiet and Davidov (2008) found acquiescence scores to be moderately correlated (r = .59) over a four-year period. Weijters, Geuens, and Schillewaert (2010) found that acquiescence scores loaded on the same latent factor over a one-year period, implying strong stability over time. These results thus challenge the previous assumption that the response style is unstable. Therefore, acquiescence response style influences item response across time—administering survey items for the same participants across two different occasions likely will not eliminate the problem.
Measuring and Minimizing Acquiescence
Researchers use three major methods to measure acquiescence. In the first, item scores across the entire survey are summed. This method is based on the assumption that the survey contains items measuring constructs with heterogeneous content. For example, Schimmack, Oishi, and Diener (2005) summed the scores of two cultural orientations (individualism and collectivism) to derive an acquiescence score; this score was then used to control for the correlation between the cultural orientations and other constructs. Similarly, Billiet and McClendon (2000) summed the scores of political distrust, threat, individualism, and collectivism to derive an index of acquiescence.
The second way to measure acquiescence is to sum scores from pairs of items that are opposite in meaning. For example, Rammstedt and Farmer (2013) summed 16 matched pairs of items with antithetical content (e.g., being talkative and being quiet) to capture acquiescence before conducting exploratory factor analysis on a personality survey. The predicted five-factor structure of the inventory was found, but only after controlling for acquiescence. In addition, congruence in factor loadings across cultures was found only after controlling for acquiescence. Winkler, Kanouse, and Ware (1982) also computed acquiescence by using matched pairs of logically opposite items; they found that inter-item correlations were affected by acquiescence. Winkler et al. thus recommended controlling for acquiescence before conducting factor analysis.
In the third method, researchers sum items that are heterogeneous in content (Baumgartner & Steenkamp, 2001). This method differs from the first in that it does not use all the items in a survey. Based on the definition of acquiescence as endorsement regardless of content, these researchers suggest including only items with diverse content. With this set of items, a researcher can ensure the generalizability of the acquiescence scores across different content areas. De Beuckelaer, Weijters, & Rutten (2010; see also Weijters, Baumgartner, & Schillewaert, 2013) affirmed the utility of this operationalization relative to the method of summing scores across an entire survey. They recommended that researchers should make sure that the inter-item correlations are low so that the items are not capturing any substantive construct. De Beuckelaer et al. (2010) suggested using 15 items to compute a “valid and reliable” response style indicator (p. 766). Recent studies (Kam & Meyer, 2015) followed a similar method to create a measure of acquiescence.
Kam (2016a) endorsed the third method on the grounds that the approach—using items with heterogeneous content—makes it likely that the acquiescence score does not measure constructs with substantive content. However, he noted two cautions when researchers use this method. First, researchers should use items that are balanced with respect to positive and negative valence. Measurement items are seldom neutral in content. Some have favorable meanings (i.e., positive valence, such as “I like my friends”) and others have unfavorable meanings (i.e., negative valence, such as “I seldom donate to charity”). Therefore, if researchers select items randomly, they may end up with a set predominantly positive or negative in valence. In other words, the researcher may measure participants’ sensitivity toward positive valence items or negative valence items rather than acquiescence. Second, Kam (2016a) found that an acquiescence score made up of 15 items had inadequate reliability and validity. Convergent validity was only .40 with 16 items, but increased to.50 with 32 items and to .62 with 64 items. For the score to possess sufficient validity, 15 items are too few.
Although acquiescent response style may be difficult to eliminate, its effects can be partialled out statistically. Researchers often measure acquiescence and then statistically partial out its effect at the construct level when examining the correlation between variables (e.g., De Beuckelaer et al., 2010; Schimmack et al., 2005). Other researchers (Kam & Meyer, 2015) have used structural equation modeling technique to partial out the effect of acquiescence at the item level before examining its effect on construct correlations. Either way, acquiescence response style is relatively easy to control for—compared to some styles examined later in this article.
Careless responding is not so much a response style as it is the result of participants being distracted during survey completion (Barnette, 1999; Huang, Curran, Keeney, Poposki, & DeShon, 2012; Kam & Meyer, 2015; Maniaci & Rogge, 2014; Schmitt & Stults, 1985; Woods, 2006). Participants may not have enough cognitive resources or may be unmotivated when responding to a survey item (Weijters et al., 2013). Therefore, they do not fully process the item content before responding (Meade & Craig, 2012).
Influences of Careless Responding
When participants do not pay sufficient attention to an item, their responses are not likely to be valid. Oppenheimer, Meyvis, and Davidenko (2009) showed that statistical results are compromised when a data sample includes a substantial number of careless respondents. When factor analysis is conducted, inclusion of careless respondents may obscure the true factor structure of the data (Schmitt & Stults, 1985). In a simulation study, Schmitt and Stults (1985) showed how a unidimensional construct may appear bidimensional in a dataset with 10% or more careless respondents.
Although researchers may assume that careless responding attenuates the magnitude of a correlation, there are at least two recent studies showing that careless responding can also inflate construct correlations. Huang, Liu, and Bowling (2015) showed that careless responding and construct means interact to affect the magnitude of a correlation. They argued that the construct means of careless respondents tend to drift to the midpoint of a Likert scale (because they randomly choose a response), whereas the construct means of careful respondents do not cluster at the midpoint. If the scale means of careful respondents are different from the scale means of careless respondents, their simulation results showed that the correlations observed between constructs can be inflated consequently.
Similarly, Kam and Meyer (2015) showed that careless respondents can inflate construct correlations for another reason. Following Meade and Craig (2012) and Maniaci and Rogge (2014), Kam and Meyer (2015) discovered two types of careless respondents: the first type gives random responses to each item; the second type gives identical answers to consecutive items. Analyzing real data, Kam and Meyer (2015) discovered that the latter type can inflate construct correlations, because construct means may become identical given such response patterns.
Measuring and Minimizing Careless Responding
Researchers have proposed a priori and post-hoc methods to detect careless respondents. One a priori method is to include items that have a clear answer (e.g., “I am currently not working on a survey” [correct answer: Disagree or Strongly Disagree] and “I am currently answering survey questions” [correct answer: Agree or Strongly Agree]). Another a priori method is to include synonyms and antonyms (such as agreeing that one is “careful” in one item and “careless” in another). If participants give unlikely answers to items with clear answers, or inconsistent answers to synonyms and antonyms, it implies that they are careless respondents not paying attention to content.
Another method of identifying careless respondents was reported by Oppenheimer et al. (2009). They explicitly instructed participants to select (or not) a certain response in a survey—those who failed to follow the instruction were considered careless. Kam and Meyer (2015) revised their method to develop a four-item scale of careless responding, which had good convergent validity with other indicators (of careless responding). In fact, the four-item scale alone is as effective as a constellation of indicators in identifying careless respondents.
If researchers did not plan on including dedicated careless responding indicators, they may still use post-hoc measures. Meade and Craig (2012) investigated the efficacy of a comprehensive set of post-hoc measures, and found the best indicator to be Mahalanobis distance. Mahalanobis distance—originally designed to be a measure of statistical distance in multivariate space, and often used to detect multivariate outliers—can be used to detect abnormal response patterns. Kam and Meyer (2015) also found that Mahalanobis distance is particularly good at detecting participants who give random responses. To identify respondents giving identical rather than random responses to consecutive survey items, Kam and Meyer (2015) found that long-string (maximum number of identical responses in a survey) and repeated responses (number of times a response is identical to the two previous responses) did a better job than Mahalanobis distance. Interested readers may refer to Meade and Craig (2012) for the efficiency of other post-hoc indicators in identifying careless respondents.
After identifying careless respondents, previous researchers have chosen to statistically control (Huang et al., 2015) or exclude them (Kam & Meyer, 2015; Oppenheimer et al., 2009). If a priori measures of careless respondents are used, the best practice is to exclude them (Kam & Meyer, 2015), because researchers could identify them with high certainty. However, if post-hoc measures are used, it is difficult to determine the cutoff score between careful and careless respondents. Therefore, researchers may be better advised to use statistical control (usually partial correlation; Huang et al., 2015) with the post-hoc measures to recover parameter estimates.
Extreme Response Style
Extreme response style refers to participants’ tendency to select extreme response options (e.g., Strongly Agree or Strongly Disagree), rather than moderate responses (e.g., Agree or Neutral). Whereas acquiescent (and disacquiescent) participants prefer responses in a particular direction (Strongly Agree only), people with extreme response style select responses in both directions—but they prefer extreme answers. The opposite of extreme response style is often considered to be midpoint response style, which refers to respondents’ tendency to select middle responses (Neutral) as answers (Baumgartner & Steenkamp, 2001). Extreme response style has been shown to correlate negatively with midpoint response style (Weijters et al., 2010).
Influence of Extreme Response Style
Extreme and midpoint response styles are seldom investigated. Previous research investigated the extreme response style in a cross-cultural context, in which one culture shows stronger extreme response style than another culture. Countries with high power distance and masculinity have been found to endorse extreme response style (Johnson, Kulesa, Cho, & Shavitt, 2005). Johnson et al. (2005) has also shown that the extreme response style elevates item intercepts. In contrast, the midpoint response style depresses them. When two cultures differ in the extreme response style (or the midpoint response style), it often results in measurement non-invariance between the two cultures. In any event, cross-cultural comparison can be seriously compromised by the response style.
Measuring and Minimizing Extreme Response Style
The most common method to measure extreme responding is to code respondents’ extreme responses (Strongly Agree and Strongly Disagree) as 1 and other response options as 0. Respondents with the highest coded scores are strong in extreme response style. For midpoint response style, participants’ responses were coded as 1 when they chose the Neutral option and 0 otherwise. Some researchers have used partial correlations to control for the effect of extreme response style on construct correlations (Weijers, Schillewaert, & Geuens, 2008), although the efficiency of this method to recover true population parameters is still unknown. Other researchers have used advanced statistical techniques such as latent class factor analysis to control for extreme response style (Morren, Gelissen, & Vermunt, 2012).
Item-keying effect is not a response style, but an issue related to item construction. The item-keying effect often has an operational definition rather than a semantic definition. It is usually assumed that responses to regular-keyed items (measuring the presence of a construct; e.g., “I have high self-esteem”) are strictly negatively correlated (r = −1) with responses to reverse-keyed items (measuring the absence of a construct; e.g., “I have low self-esteem”). Therefore, participants who strongly agree with the former item should logically strongly disagree with the latter. When this happens, regular- and reverse-keyed items will load on a single common factor in factor analysis.
Very often, however, the logical expectation above turned out to be incorrect. Instead, confirmatory factor analysis often identifies—in addition to a common factor for regular- and reverse-keyed items—a factor coming solely from reverse-keyed items. The common factor is often called the “trait” factor; the additional factor is often called a “method” factor. In the language of multitrait-multimethod analysis (at least within the framework of Eid, 2000), the trait factor captures common variance in regular- and reverse-keyed items, whereas the method factor captures unique variance shared only by reverse-keyed items but not by regular-keyed items. (There are multitrait-multimethod frameworks other than Eid’s, but the basic interpretation of the method factor is similar.)
Researchers disagree about the nature of the item-keying effect. Some believe it is simply a methodological artifact (DiStefano & Motl, 2009a, 2009b; Rauch, Schweizer, & Moosbrugger, 2007) represented by constructs such as social desirability response style (or a related characteristic such as self-enhancement). Others argue that it represents traits rather than “method” (Lindwall, Ljung, Hadžibajramović, & Jonsdottir, 2012; Marshall, Wortman, Kusulas, Hervig, & Vickers, 1992). The latter group, therefore, believes that the so-called method factor is actually a misnomer, as it represents something more substantial. According to these researchers, regular- and reverse-keyed items measure two distinct factors, not one factor.
The debate has been ongoing for several decades, with no resolution in sight. Often, one group of researchers finds a method factor to be correlated with a response style (such as social desirability) in one scale (Rauch et al., 2007), but other researchers fail to find the same result in a different scale (DiStefano & Motl, 2009b). A recent study (Weijters et al., 2013) revealed that the item-keying factor is correlated with acquiescence response style, but the amount of variance explained by acquiescence was extremely small—less than 10% of the variance in the method factor. In short, we still do not fully understand the source of the item-keying effect.
Influence of Item-Keying Effect
When the method effect is not properly modeled, the trait factor may measure variances due to both traits and methods (Cole, Martin, & Steiger, 2005). Therefore, its relationships with external variables are likely to be biased (Castro-Schilo, Widaman, & Grimm, 2013; Cole et al., 2005). Castro-Schilo et al. (2013) investigated the consequences of neglecting the modeling of method factor. They found that such neglect can cause noticeable bias in its regression coefficients with external constructs. The percentage of bias (operationalized as the percentage of increase or decrease from the true parameter estimates in the population) reached over 50% in many cases. Therefore, neglecting the structure of the method effect has the potential to severely distort the research conclusion of a study.
Measuring and Minimizing Item-Keying Effect
Researchers have developed multiple methods to model the method factor (Jöreskog, 1971; Kenny, 1976). Early researchers suggested one trait factor that is common to all items and two method factors that are specific to regular- and reverse-keyed items, respectively (Widaman, 1985). Theoretically, the trait factor represents a pure trait effect, because the two method factors capture variance due to the use of regular- and reverse-keyed items. The method factors do not correlate with the trait factor for identification purposes, and the two method factors may or may not be correlated with each other. This model is advocated by many methodologists (e.g., Lance, Noble, & Scullen, 2002) as it is most faithful to the original theorization of the multitrait-multimethod model of Campbell and Fiske (1959).
More recently, however, researchers have pointed out that the model suffers from both identification problems and collapse of the method factor, such as unreasonably low factor loadings on the method factor (Geiser, Bishop, & Lockhart, 2015; Gu, Wen, & Fan, 2015). Researchers have therefore developed models to overcome its problems. One solution—pioneered by Kenny (1976, 1979; Kenny & Berman, 1980) and further developed by Marsh (1989)—is to allow the item residual variances of the same keying direction to covary with each other instead of modeling any method factor. Kenny’s method has the disadvantage of not measuring the method factor, and thus researchers could not use this method to investigate the nature of the method effect. The second solution—due to Eid (2000), as mentioned previously—is to allow only one method factor (usually on reverse-keyed items) rather than two, and thus the method factor captures unique variance not shared by the regular-keyed items. The method factor is constrained to be orthogonal to the trait factor so that trait variance is unrelated to method variance. A common characteristic of all these methods is to assume that the item-keying effect is unwanted variance independent of the trait effect. Another novel line of research conceptualizes the item-keying effect as the latent mean difference between regular- and reverse-keyed items. Interested readers may refer to the sources for more information (Pohl & Steyer, 2010).
Given that we are still unclear about both its nature and its antecedents, it is difficult to minimize the item-keying effect. However, the effect is prevalent across measurement scales that have reverse-keyed items, including personality scales that are theoretically unidimensional (Kam & Meyer, 2015). Kam (2016b, 2017) showed that the nature of the item-keying effect differs across measurement scales, and thus a finding from one scale does not generalize to another. Kam correlated the item-keying method factor (extracted from reverse-keyed items) across a variety of measurement scales, and found that these method factors were not strongly correlated. In addition, some of the method factors correlated well with social desirability response style while others did not, suggesting that these factors are dissimilar in nature. Kam’s results help to explain why the previous findings on the nature of the method effect largely failed to generalize. In addition, Kam’s findings implied that causes of the method effect in one particular scale (e.g., optimism) may not always apply to another scale (e.g., self-esteem). Therefore, he suggested investigating scale-specific nature of the item-keying effect.
Finally, because we still know relatively little about the item-keying effect, we are unable to eliminate its influence on data. The best course of action is to minimize its influence by explicitly modeling the response style using multitrait-multimethod techniques (e.g., Eid, 2000).
Social Desirability Response Style
Social desirability response style refers to respondents’ tendency to present themselves in an overly positive manner (Paulhus, 1991). Another definition states it is a stylistic tendency to answer survey items in a “culturally approved” manner as opposed to “honest self-evaluation” (Wiggins, 1973, p. 36). The latter definition thus considers the role of cultural values in influencing participants’ choice of response.
Influence of Social Desirability Response Style
Social desirability is likely to distort both factor analytic and construct correlations. Items can be correlated more strongly because they are socially desirable. Previous researchers have shown that Big Five personality traits—which are theoretically orthogonal to one another—can all load on one higher-order factor due to social desirability response style (Bäckström, Björklund, & Larsson, 2014). Bäckström, Björklund, and Larsson (2009) showed that the inter-correlations among the five factors are substantially attenuated when items are reworded to be more neutral in meaning, and thus less susceptible to social desirability. Similarly, constructs may be correlated with each other because their measurement items are loaded with social desirability (Paunonen & Lebel, 2012). Therefore, positive correlations may be inflated when both constructs are measured by socially desirable items. Conversely, negative correlations—such as the relationship between job satisfaction and negative affect—may become stronger when participants endorse socially desirable items in the job satisfaction measure and reject socially undesirable items in the negative affect measure.
Measuring and Minimizing Social Desirability Response Style
Social desirability response style is often measured with a dedicated scale that asks participants to report their tendency to engage in socially approved (e.g., admitting one’s mistakes) and disapproved (e.g., littering the street) behaviors. Common measures include the Marlowe-Crowne Social Desirability Scale (Crowne & Marlowe, 1960), the lie scale on Minnesota Multiphasic Personality Inventory (Hathaway & McKinley, 1989), and Paulhus’s Balanced Inventory of Desirability Responding (BIDR; Paulhus, 1984).
The BIDR is popular because it divides social desirability into two components—impression management and self-deception—whereas other measures do not make a similar distinction. Initially, Paulhus conceptualized impression management as respondents’ intentional effort to favorably present themselves, and self-deception as their unintentional effort to do so (1984). More recent research, however, has questioned this notion. Uziel (2010a, 2010b), for example, reframed impression management as interpersonally related self-control. Individuals high in impression management have a higher drive to succeed in public as opposed to private settings. Uziel (2010b) showed empirically that people high in impression management are more creative and have better self-control in social than in private settings.
In an attempt to examine the validity of the two components of social desirability, Kam (2013) had external judges rate the desirability of personality items and then examined if those highly desirable items would correlate strongly with impression management and self-deception. If impression management and self-deception are valid, they should be able to help identifying items that are considered desirable or undesirable by external judges. Kam found very strong match between decisions based on the self-deception scale and raters’ judgment (convergent validity r = .90), and weaker match between decisions based on the impression management scale and raters’ judgment (convergent validity r = .71). This result suggested higher validity in the self-deception scale than the impression management scale. Rather surprisingly, Zettler, Hilbig, Moshagen, and de Vries (2015) found that those high in impression management, relative to those low in the trait, are more likely to be honest in a task in which participants can choose to cheat. Therefore, although traditionally impression management is regarded as a socially desirable trait and individuals with this trait are considered more likely to behave dishonestly, Zettler et al.’s (2015) finding contradicts these assumptions—participants who endorse impression management items (e.g., never taking possessions not belonging to oneself, never saying swear words) may indeed be more honest than other people. Taking all these results together, it appears that self-deception is a better measure of social desirability than is impression management (Kam, 2013).
After measuring social desirability response style, researchers often statistically control for its influence using partial correlation or multiple regression (with social desirability as a covariate). However, two caveats should be mentioned. First, Paunonen and LeBel (2012) caution that the relationship between social desirability and construct scores may not be a simple linear one. Respondents high in social desirability may have a stronger tendency to reject possession of negative traits than to accept ownership of positive traits. For example, high social desirability respondents may more strongly disagree with negative items (e.g., I am worthless) than to agree with positive items (I am a worthwhile individual). Therefore, if a construct is measured by both positive and negative items (such as Rosenberg’s Self-Esteem Measure), the net effect of social desirability on the final score is complex, involving the interaction of the response styles with keying direction of the items.
The second caveat is that the relationship between social desirability and item scores may depend on assessment context. Dunlop, Telford, and Morrison (2012) showed that, although the extreme response (e.g., Strongly Agree in an extraversion item) may be assumed to be most desirable, the penultimate option (e.g., Moderately Agree) can be equally desirable. Furthermore, the socially desirable response also depends on the context. Extraversion, for example, may be regarded as desirable for a sales job but less desirable for a job in nursing. Therefore, the best way to minimize social desirability may be to create neutral items—ones perceived as not particularly desirable or undesirable—rather than measuring social desirability and then controlling for it statistically (Dunlop et al., 2012).
There have indeed been efforts to create scale items neutral in meaning. In a noteworthy study, Bäckström, Björklund, and Larsson (2014) attempted to attenuate item social desirability by rephrasing items so that they were more neutral. Bäckström et al. (2014) called this process “evaluative neutralization.” They claimed that the process is “so simple that even untrained undergraduate students can apply it successfully, when provided with basic instructions” (p. 28). Bäckström et al. neutralized items from a popular personality inventory (International Personality Item Pool; Goldberg et al., 2006) and compared the criterion validity between the original scale and the new, neutralized scale. Two findings stood out. First, there were no substantial differences in scale reliability between original and new inventories. Second, criterion validity for the original and new inventory was comparable on all scales except social desirability: for social desirability, the old scale correlated substantially stronger than the new one. Bäckström et al. concluded that evaluative neutralization can reduce social desirability content without sacrificing validity.
Summary of General Approaches
As is apparent in the preceding discussion, researchers can measure and control for response styles in multiple stages of their research.
Survey Construction Stage
Some researchers advocate the use of dedicated items to measure response style. For example, careless responding can be easily controlled if items are included to measure participants’ attentiveness to content (Kam & Meyer, 2015). Acquiescence can be measured by including a large number of heterogeneous items to measure respondents’ tendency to endorse them (Weijters et al., 2010). Doing so has the advantage that these external items are uncorrelated with the other items, thus making a clear separation between measurement of content and measurement of styles. Researchers, however, do not necessarily favor this approach because including dedicated items undoubtedly lengthens the survey. In addition, this approach is obviously applicable only when researchers plan ahead of time. A common practice is to measure response styles using post-hoc methods (e.g., measuring acquiescence by counting the number of times a respondent strongly agrees with survey items). However, such scores they may correlate with construct scores, making the separation between content and style difficult.
Data Analytic Stage
Some methods control for the impact of response styles after the fact. A researcher may employ partial correlation and regression to control for the effect of a response style on the relationship between two substantive constructs. However, the efficacy of this approach remains to be demonstrated. In an empirical study, controlling for social desirability using partial correlations did not significantly alter the apparent correlation between a predictor and a criterion variable (Ones, Viswesvaran, & Reiss, 1996). The same finding was obtained in a simulation study (Paunonen & Lebel, 2012). Because convenient statistical techniques (partial correlation) may fail to recover parameter estimates of a population, researchers have proposed more advanced techniques to model and control for response styles.
Böckenholt (2012) demonstrated how an item response theory (IRT) model may be used to capture a response style process. This IRT model combined multinomial process tree models (Batchelder, 2010) and diagnostic measurement models (Rupp, Templin, & Henson, 2010) to measure both construct content and response styles. The procedure first requires a researcher to theorize the sequence a respondent uses to decide a response option, and then empirically test for the fit of the hypothesized sequence with real data. Later researchers used Böckenholt’s original approach to assess the dimensionality of a multidimensional construct (Khorramdel & von Davier, 2014; von Davier & Khorramdel, 2013). In general, the advanced statistical procedures introduced by Böckenholt (2012) do not absolutely require a researcher to have dedicated measures to capture each response style in the survey construction stage. Nevertheless, the use of the advanced IRT model requires very large sample size (e.g., more than 1,000 respondents) to be stable.
Finally, Bolt et al. (2014) proposed the use of anchoring vignettes to correct the data for response style. In this procedure, respondents read a number of vignettes. Each vignette is a detailed description and thus all respondents are assumed to perceive it in the same way. For example, a vignette may describe an employee who always plans ahead so that they will complete all their work on time; they think carefully before promising anything. Based on this vignette, the employee should be rated as extremely conscientious, but a respondent with midpoint response style may select Neutral as response. With a large number of vignettes, participant differences in how response options are interpreted will be recorded, and this information can be used to control for multiple response styles simultaneously.
Despite the abundance of techniques to measure and control for response styles, there are still unanswered questions and more research work needs to be done. First, more effort is required to simplify the application of statistical control techniques. Advanced IRT models (Böckenholt, 2012; Bolt et al., 2014) are not used as commonly as they should be because they often require researchers to understand complex statistical models. With the advent of modern statistical software such as the Mplus program, implementation of these techniques is easier, but they still require substantial effort to understand the statistics. Therefore, development of user-friendly tools to implement these statistical procedures will help make their use more widespread.
Second, researchers seem to be more interested in developing new modeling techniques to control for a response style than validating such techniques. This is problematic, as an applied researcher may trust an advanced technique in recovering parameter estimates at the population level even when the technique has questionable validity. Therefore, in addition to developing novel statistical procedures, researchers need to examine the validity of different techniques with both simulated (Maydeu-Olivares & Coffman, 2006) and empirical data (Kam & Zhou, 2015) to find out which works best in which situation. A more sophisticated technique is better than a simpler one (e.g., partial correlations) only when the former yields results with greater validity.
Finally, future research needs to investigate the nature of response styles. There has been an uneven amount of research across different response styles. We know much more about the potential causes of acquiescence and extreme response styles (Johnson et al., 2005) than careless response style (Huang et al., 2015; Kam & Meyer, 2015). For example, we know that acquiescence is related to agreeableness (Couch & Keniston, 1960, 1961) and that extreme response style is related to power distance and masculinity (Johnson et al., 2005). However, issues related to careless responding are still severely under-researched. Knowing who is more likely to exhibit each particular response style will help researchers minimize their negative impact at the data collection stage.
Bäckström, M., Björklund, F., & Larsson, M. R. (2009). Five-factor inventories have a major general factor related to social desirability which can be reduced by framing items neutrally. Journal of Research in Personality, 43, 335–344.Find this resource:
Bäckström, M., Björklund, F., & Larsson, M. R. (2014). Criterion validity is maintained when items are evaluatively neutralized: Evidence from a full-scale Five-Factor Model Inventory. European Journal of Personality, 28, 620–633.Find this resource:
Barnette, J. J. (1999). Nonattending respondent effects on internal consistency of self-administered surveys: A Monte Carlo simulation study. Educational and Psychological Measurement, 59, 38–46.Find this resource:
Batchelder, W. H. (2010). Cognitive psychometrics: Using multinomial processing tree models as measurement tools. In S. E. Embretson (Ed.), Measuring psychological constructs: Advances in model-based approaches (pp. 71–93). Washington, DC: American Psychological Association.Find this resource:
Baumgartner, H., & Steenkamp, J. B. E. (2001). Response styles in marketing research: A cross-national investigation. Journal of Marketing Research, 38, 143–156.Find this resource:
Bentler, P. M., Jackson, D. N., & Messick, S. (1971). Identification of content and style: A two-dimensional interpretation of acquiescence. Psychological Bulletin, 76, 186–204.Find this resource:
Billiet, J. B., & Davidov, E. (2008). Testing the stability of an acquiescence style factor behind two interrelated substantive variables in a panel design. Sociological Methods & Research, 36, 542–562.Find this resource:
Billiet, J. B., & McClendon, M. J. (2000). Modeling acquiescence in measurement models for two balanced sets of items. Structural Equation Modeling, 7, 608–628.Find this resource:
Böckenholt, U. (2012). The cognitive-miser response model: Testing for intuitive and deliberate reasoning. Psychometrika, 77, 388–399.Find this resource:
Bolt, D. M., Lu, Y., & Kim, J. S. (2014). Measurement and control of response styles using anchoring vignettes: A model-based approach. Psychological Methods, 19, 528–541.Find this resource:
Campbell, D. T., & Fiske, D. W. (1959). Convergent and discriminant validation by the multitrait-multimethod matrix. Psychological Bulletin, 56, 81–105.Find this resource:
Castro-Schilo, L., Widaman, K. F., & Grimm, K. J. (2013). Neglect the structure of multitrait-multimethod data at your peril: Implications for associations with external variables. Structural Equation Modeling: A Multidisciplinary Journal, 20, 181–207.Find this resource:
Cole, D. A., Martin, N. C., & Steiger, J. H. (2005). Empirical and conceptual problems with longitudinal trait-state models: Introducing a trait-state-occasion model. Psychological Methods, 10, 3–20.Find this resource:
Couch, A., & Keniston, K. (1960). Yeasayers and naysayers: Agreeing response set as a personality variable. Journal of Abnormal and Social Psychology, 60, 151–174.Find this resource:
Couch, A., & Keniston, K. (1961). Agreeing response set and social desirability. Journal of Abnormal and Social Psychology, 62, 175–179.Find this resource:
Credé, M., Chernyshenko, O. S., Bagraim, J., & Sully, M. (2009). Contextual performance and the job satisfaction–dissatisfaction distinction: Examining artifacts and utility. Human Performance, 22, 246–272.Find this resource:
Crowne, D. P., & Marlowe, D. (1960). A new scale of social desirability independent of psychopathology. Journal of Consulting Psychology, 24, 349–354.Find this resource:
De Beuckelaer, A., Weijters, B., & Rutten, A. (2010). Using ad hoc measures for response styles: A cautionary note. Quality & Quantity, 44, 761–775.Find this resource:
DiStefano, C., & Motl, R. W. (2009a). Personality correlates of method effects due to negatively worded items on the Rosenberg Self-Esteem scale. Personality and Individual Differences, 46, 309–313.Find this resource:
DiStefano, C., & Motl, R. W. (2009b). Self-esteem and method effects associated with negatively worded items: Investigating factorial invariance by sex. Structural Equation Modeling, 16, 134–146.Find this resource:
Dunlop, P. D., Telford, A. D., & Morrison, D. L. (2012). Not too little, but not too much: The perceived desirability of responses to personality items. Journal of Research in Personality, 46, 8–18.Find this resource:
Eid, M. (2000). A multitrait-multimethod model with minimal assumptions. Psychometrika, 65, 241–261.Find this resource:
Geiser, C., Bishop, J., & Lockhart, G. (2015). Collapsing factors in multitrait-multimethod models: Examining consequences of a mismatch between measurement design and model. Frontiers in Psychology, 6, 946–970.Find this resource:
Goldberg, L. R., Johnson, J. A., Eber, H. W., Hogan, R., Ashton, M. C., Cloninger, C. R., et al. (2006). The international personality item pool and the future of public-domain personality measures. Journal of Research in Personality, 40, 84–96.Find this resource:
Gu, H., Wen, Z., & Fan, X. (2015). The impact of wording effect on reliability and validity of the Core Self-Evaluation Scale (CSES): A bi-factor perspective. Personality and Individual Differences, 83, 142–147.Find this resource:
Hathaway, S. R., McKinley, J. C., & MMPI Restandardization Committee. (1989). MMPI-2: Minnesota Multiphasic Personality Inventory-2: Manual for administration and scoring. Minneapolis: University of Minnesota Press.Find this resource:
Huang, G. B., Curran, P. G., Keeney, J., Poposki, E. M., & DeShon, R. P. (2012). Detecting and deterring insufficient effort responding to surveys. Journal of Business and Psychology, 27, 99–114.Find this resource:
Huang, J. L., Liu, M., & Bowling, N. A. (2015). Insufficient effort responding: Examining an insidious confound in survey data. Journal of Applied Psychology, 100, 828–845.Find this resource:
Hui, C. H., & Triandis, H. C. (1985). The instability of response sets. Public Opinion Quarterly, 49, 253–260.Find this resource:
Jackson, D. N., & Messick, S. (1962). Response styles and the assessment of psychopathology. In S. Messick & J. Ross (Eds.), Measurement in personality and cognition (pp. 129–155). New York: John Wiley.Find this resource:
Johnson, T., Kulesa, P., Cho, Y. I., & Shavitt, S. (2005). The relation between culture and response styles evidence from 19 countries. Journal of Cross-Cultural Psychology, 36, 264–277.Find this resource:
Jöreskog, K. G. (1971). Statistical analysis of sets of congeneric tests. Psychometrika, 36, 109–133.Find this resource:
Kenny, D. A. (1976). An empirical application of confirmatory factor analysis to the multitrait–multimethod matrix. Journal of Experimental Social Psychology, 12, 247–252.Find this resource:
Kam, C. (2013). Probing item social desirability by correlating personality items with Balanced Inventory of Desirable Responding (BIDR): A validity examination. Personality and Individual Differences, 54, 513–518.Find this resource:
Kam, C. C. S. (2016a). Further considerations in using items with diverse content to measure acquiescence. Educational and Psychological Measurement, 76, 164–174.Find this resource:
Kam, C. C. S. (2016b). Why do we still have an impoverished understanding of the item wording effect? Sociological Methods and Research.Find this resource:
Kam, C. C. S. (2017). Novel Insights into Item Keying/Valence Effect using Latent Difference (LD)Modeling Analysis. Unpublished manuscript. University of Macau.Find this resource:
Kam, C. C. S., & Meyer, J. P. (2015). How careless responding and acquiescence response bias can influence construct dimensionality: The case of job satisfaction. Organizational Research Methods, 18, 512–541.Find this resource:
Kam, C. C. S., & Zhou, M. (2015). Does acquiescence affect individual items consistently? Educational and Psychological Measurement, 75, 764–784.Find this resource:
Kenny, D. A. (1976). An empirical application of confirmatory factor analysis to the multitrait–multimethod matrix. Journal of Experimental Social Psychology, 65, 507–516.Find this resource:
Kenny, D. A. (1979). Correlation and causation. New York: Wiley.Find this resource:
Kenny, D. A., & Berman, J. S. (1980). Statistical approaches to the correction of correlational bias. Psychological Bulletin, 88, 288–295.Find this resource:
Khorramdel, L., & von Davier, M. (2014). Measuring response styles across the Big Five: A multiscale extension of an approach using multinomial processing trees. Multivariate Behavioral Research, 49, 161–177.Find this resource:
Lance, C. E., Noble, C. L., & Scullen, S. E. (2002). A critique of the correlated trait-correlated method and correlated uniqueness models for multitrait-multimethod data. Psychological Methods, 7, 228–244.Find this resource:
Lindwall, M., Ljung, T., Hadžibajramović, E., & Jonsdottir, I. H. (2012). Self-reported physical activity and aerobic fitness are differently related to mental health. Mental Health and Physical Activity, 5, 28–34.Find this resource:
Lorge, I. (1937). Gen-like: Halo or reality. Psychological Bulletin, 34, 545–546.Find this resource:
Maniaci, M. R., & Rogge, R. D. (2014). Caring about carelessness: Participant inattention and its effects on research. Journal of Research in Personality, 48, 61–83.Find this resource:
Marsh, H. W. (1989). Confirmatory factor analyses of multitrait-multimethod data: Many problems and a few solutions. Applied Psychological Measurement, 13, 335–361.Find this resource:
Marshall, G. N., Wortman, C. B., Kusulas, J. W., Hervig, L. K., & Vickers, R. R., Jr. (1992). Distinguishing optimism from pessimism: Relations to fundamental dimensions of mood and personality. Journal of Personality and Social Psychology, 62, 1067–1074.Find this resource:
Maydeu-Olivares, A., & Coffman, D. L. (2006). Random intercept item factor analysis. Psychological Methods, 11, 344–362.Find this resource:
Meade, A. W., & Craig, S. B. (2012). Identifying careless responses in survey data. Psychological Methods, 17, 437–455.Find this resource:
Messick, S., & Jackson, D. N. (1961). Acquiescence and desirability as response determinants on the MMPI. Educational and Psychological Measurement, 21, 771–790.Find this resource:
Morren, M., Gelissen, J. P., & Vermunt, J. K. (2012). Exploring the response process of culturally differing survey respondents with a response style: A sequential mixed methods study. Field Methods, 25, 162–181.Find this resource:
Ones, D. S., Viswesvaran, C., & Reiss, A. D. (1996). Role of social desirability in personality testing for personnel selection: The red herring. Journal of Applied Psychology, 81, 660–679.Find this resource:
Oppenheimer, D. M., Meyvis, T., & Davidenko, N. (2009). Instructional manipulation checks: Detecting satisficing to increase statistical power. Journal of Experimental Social Psychology, 45, 867–872.Find this resource:
Paulhus, D. L. (1984). Two-component models of socially desirable responding. Journal of Personality and Social Psychology, 46, 598–609.Find this resource:
Paulhus, D. L. (1991). Measurement and control of response bias. In J. P. Robinson, P. Shaver, & L. S. Wrightsman (Eds.), Measures of personality and social psychological attitudes (pp. 17–59). San Diego: Academic Press.Find this resource:
Paunonen, S. V., & LeBel, E. P. (2012). Socially desirable responding and its elusive effects on the validity of personality assessments. Journal of Personality and Social Psychology, 103, 158–175.Find this resource:
Pohl, S., & Steyer, R. (2010). Modeling common traits and method effects in multitrait-multimethod analysis. Multivariate Behavioral Research, 45, 45–72.Find this resource:
Rammstedt, B., & Farmer, R. F. (2013). The impact of acquiescence on the evaluation of personality structure. Psychological Assessment, 25, 1137–1145.Find this resource:
Rammstedt, B., Goldberg, L. R., & Borg, I. (2010). The measurement equivalence of Big-Five factor markers for persons with different levels of education. Journal of Research in Personality, 44, 53–61.Find this resource:
Rauch, W. A., Schweizer, K., & Moosbrugger, H. (2007). Method effects due to social desirability as a parsimonious explanation of the deviation from unidimensionality in LOT-R scores. Personality and Individual Differences, 42, 1597–1607.Find this resource:
Rorer, L. G. (1965). The great response-style myth. Psychological Bulletin, 63, 129–156.Find this resource:
Rupp, A. A., Templin, J., & Henson, R. A. (2010). Diagnostic assessment: Theory, methods, and applications. New York: Guilford.Find this resource:
Schimmack, U., Oishi, S., & Diener, E. (2005). Individualism: A valid and important dimension of cultural differences between nations. Personality and Social Psychology Review, 9, 17–31.Find this resource:
Schmitt, N., & Stults, D. M. (1985). Factors defined by negatively keyed items: The result of careless respondents? Applied Psychological Measurement, 9, 367–373.Find this resource:
Uziel, L. (2010a). Look at me, I’m happy and creative: The effect of impression management on behavior in social presence. Personality and Social Psychology Bulletin, 36, 1591–1602.Find this resource:
Uziel, L. (2010b). Rethinking social desirability scales from impression management to interpersonally oriented self-control. Perspectives on Psychological Science, 5, 243–262.Find this resource:
von Davier, M., & Khorramdel, L. (2013). Differentiating response styles and construct-related responses: A new IRT approach using bifactor and second-order models. In R. E. Millsap (Ed.), New Developments in Quantitative Psychology (pp. 463–487). New York: Springer.Find this resource:
Weijters, B., Baumgartner, H., & Schillewaert, N. (2013). Reversed item bias: An integrative model. Psychological Methods, 18, 320–334.Find this resource:
Weijters, B., Geuens, M., & Schillewaert, N. (2010). The stability of individual response styles. Psychological Methods, 15, 96–110.Find this resource:
Weijers, B., Schillewaert, N., & Geuens, M. (2008). Assessing response styles across modes of data collection. Journal of the Academy of Marketing Science, 36, 409–422.Find this resource:
Widaman, K. F. (1985). Hierarchically nested covariance structure models for multitrait-multimethod data. Applied Psychological Measurement, 9, 1–26.Find this resource:
Wiggins, J. S. (1973). Personality and prediction: Principles of personality assessment. Reading, MA: Addison-Wesley.Find this resource:
Winkler, J. D., Kanouse, D. E., & Ware, J. E. (1982). Controlling for acquiescence response set in scale development. Journal of Applied Psychology, 67, 555–561.Find this resource:
Woods, C. M. (2006). Careless responding to reverse-worded items: Implications for confirmatory factor analysis. Journal of Psychopathology and Behavioral Assessment, 28, 186–191.Find this resource:
Zettler, I., Hilbig, B. E., Moshagen, M., & de Vries, R. E. (2015). Dishonest responding or true virtue? A behavioral test of impression management. Personality and Individual Differences, 81, 107–111.Find this resource: