Experimentally Manipulated
Bias in School Psychologists’ Scoring of WISC-III Protocols.
A presentation to the
Midwest Educational Research Association
Division D: Measurement and
Research Methodology
Chicago, Il
October 26, 2001
Lawrence W. Sherman (shermalw@muohio.edu) and Amy N. Taylor1
Department of Educational
Psychology
Miami University
Oxford, Oh 45056
Available on the web at:
http://www.users.muohio.edu/shermalw/mwera_version5_files/mwera_version5.htm
Abstract.
Experimenter Bias Effects were experimentally manipulated in a sample of
School Psychologists’ (n= 97) scoring of three subscales (Similarities,
Vocabulary, Comprehension) of the WISC-III. First year (n=29), interns (n=42) and experienced (n=26) school psychologists were
randomly assigned to either a bias or control group and requested to score the identical three subscale
protocols. No statistically
significant interactions between experimental groups (biased vs. control) and
level of experience (first-year vs. Interns vs. experienced) were obtained. All
main effects were non-significant. These results were interpreted as an
affirmation of the objectivity of scoring for these relatively subjective
sub-scales, as well as the quality of training of these students, interns and
experienced practitioners.
Intelligence
tests are one integral part of educational planning and placement. The most widely used intelligence test
currently on the market is the Wechsler Intelligence Scale for Children- Third
Edition. Although great
efforts have been made to make this test a standardized and objective measure,
some subtests have been shown to be vulnerable to examiner subjectivity. Earlier research on previous versions
of the WISC have indicated that
several sources of bias can significantly influence an examiner’s scoring
of WISC-III (Sattler, 1992; Massey, 1964; Miller, 1970; Miller & Chansky,
1972; Sattler, Squire, & Andres, 1977; Slate & Chick, 1989; Slate &
Jones, 1990; Slate, 1993; Wheeler, 1987; Kirchner, 1979; Shannon, 985;
O’Reilly, 1989). Inasmuch as
the WISC and it’s subsequent revisions, the WISC-R and WISC-III, is a
test that is commonly used to determine a variety of special education classifications,
it would be important to know that the latest revision of this measurement
device is reliable and free from
bias.
Rosenthal's
(1976; 1994) notion of "experimenter bias" might suggest that an
examiners diagnosis of a client may unintentionally be influenced by bias,
especially in the relatively subjective scoring systems associated with three
specific subtests of the WISC-III.
The present study focused on the effects of an experimentally induced
disability bias, Down Syndrome. A
randomly determined independent variable consisted of a control group not
receiving this bias as contrasted with an experimental group that did receive
the bias. Three levels of
experience (first-year school psychology students, third-year school psychology
interns, and experienced certified school psychologists) were considered as a
moderator variable. Three
dependent measures included the subjects' scoring of the Similarities, Vocabulary,
and Comprehension subtests of the WISC-III. Based on the prior research on
expectancy bias and errs observed in Wechsler scale scoring by school
psychologists and trainees, the present study sought to find if a completed
WISC-III protocol might also be prone to these influences. Thus, we hypothesized that our
biased group would have significantly lower scores than our control group. Within the context of the many
WISC sub-tests (13 in all), the three subtests that we used are often referred
to as the most subjective and vulnerable to external bias. We also hypothesized that the
differences between the biased and control groups would be least for the
experienced school psychologists, followed by the interns and then greatest for
the novice school psychology trainees.
1This study reports findings that were
originally Amy N. Taylor’s Specialist Degree Thesis at Miami University,
Oxford, Ohio, 2001. We are
thankful for Dr. Alex Thomas’ support and assistance in completing this
project. He greatly facilitated
many of the subjects who volunteered to participate in this study.
Participants:
One
adolescent with Down Syndrome was selected with permission from his primary
caregiver to complete a WISC-III protocol. No identifying information on this individual was known to
anyone but the researchers. This
individual’s anonymity was protected throughout the study. Volunteers, including 29 first year
school psychology graduate students, 42 intern school psychology graduate
students, and 26 certified school psychologists practicing in the field were
randomly assigned to either the bias or control conditions. The first year students and the interns
were from various training programs in the state of Ohio including Miami
University, Kent State, and the University of Akron. The certified school psychologists were randomly selected
from the South Western Ohio School Psychology Association database.
Materials:
The WISC-III was given to the adolescent
with Down Syndrome. The answers he
gave to the subtests of Similarities, Vocabulary and Comprehension were
transcribed on to a blank protocol and coded according to what group of
subjects would receive the protocol (e.g. bias vs. control and level of
training). This was done so that
the researcher could identify which group each protocol belonged to without
identifying any subjects, who for the most part remained anonymous. The protocols were coded by changing
the response on one answer on the Similarities subtest slightly
according to which group they belonged to. Half of the transcriptions to each
group included a sheet of paper indicating that the individual who completed
this protocol had Down Syndrome while the other half received no information
with the protocol. These letters told the subjects nothing about the
study’s purpose. They were
only told to score the subtests as best they could without using the
WISC-III manual or any other aid.
The subjects were told not to use the manual to prevent sharing of the
manual by the subjects, which could lead to collaboration on responses, and
thereby tainting the data collected.
Procedure:
One of the researchers first gave the
WISC-III subtests of Similarities, Vocabulary and Comprehension
to the adolescent with Down syndrome.
When the protocol was completed, the researcher then transcribed the
answers to the questions on the Similarities, Vocabulary, and Comprehension
subtests onto six different blank protocols. These six groups were coded according to which group they
belonged, and copies of this transcription were then made. A coding system was developed and
placed on the protocol to indicate level of training of the subjects and
whether or not they received the bias to facilitate a
“double-blind” element of the study. Half of the subjects from each level of training randomly
received the bias and the other half did not. The bias consisted of a small sheet of paper stating that an
individual with Down Syndrome completed the three subtests they received. All subjects also received instructions
with their transcriptions asking them to score the subtests without using any
scoring guide and to not share their answers or protocols with anyone else to
insure confidentiality. They were
also instructed to mail their completed protocols back to the researchers in
the enclosed self-addressed envelopes via the Educational Psychology office at
Miami University. No identifying information was placed on any protocol to
identify the subjects who remained anonymous. The researcher was only aware of whether the subject was a
first year graduate student, an intern or a certified school psychologist and
whether or not they received or did not receive the bias. Thus, the researchers were
“blind” as to who received what experimental condition. The researchers then processed each
subjects’ raw scores for each subtest and used them as dependent measures
in later analyses.
Research
Design:
This
study was a randomized posttest-only control group design. The subjects in the
three different groups (student, intern, certified - the moderating variable)
were randomly assigned to receive the bias or not receive the bias - the
independent variable. The posttest
consisted of determining if there was a significant difference in raw scores
(the dependent variable) among the groups receiving the bias or not receiving
the bias - the independent variable. Further, the researchers sought to determine if these
score differences varied depending on the three experience levels of the group
(the moderating variable).
We hypothesized that the mean scores for the biased groups would be
significantly lower on each of the three subtests than the control groups’
mean scores. Further, the
difference between scores in the biased versus control conditions would be
greatest for the first year students, then the interns and least for the
certified school psychologists.
Table
1. Symbolic
Representation of Research Design.
|
|
Similarities |
Vocabulary |
Comprehension |
|
Group
I: Students |
Rn—X—O1 |
Rn—X—O1 |
|
|
|
Rn—C—O2 |
Rn—C—O2 |
Rn—C—O2 |
|
Group
II: Interns |
Ri—X—O3 |
Ri—X—O3 |
Ri—X—O3 |
|
|
Ri—C—O4 |
Ri—C—O4 |
Ri—C—O4 |
|
Group
III: Certified |
Rc—X—O5 |
Rc—X—O5 |
Rc—X—O5 |
|
|
Rc—C—O6 |
Rc—C—O6 |
Rc—C—O6 |
Rn= Randomly selected first year
graduate students in School Psychology (stu)
Ri= Randomly selected interns (third
year of study) in School Psychology (int)
Re= Randomly selected certified School
Psychologists (cer.)
X= Bias given (treatment)
C= Control (no bias given
O= Scores
Table 2:
Research Hypotheses I and II
|
Hypothesis
I |
Hypothesis
II |
|
O1
< O2 |
[O2-O1]>[O4-O3]>[O6-O5] |
|
O3
< O4 |
|
|
O5
< O6 |
|
O1= Scores for experimental group I
O2=
Scores for control
group I
O3=
Scores for
experimentally group II
O4=
Scores for control
group II
O5= Scores for experimental group III
O6= Scores for control group III
Results.
The experimental
treatment, bias vs. control group, did not significantly interact with the
level of experience on any of the three sub-tests. No significant main effects were obtained on the Similarities
or Vocabulary sub-tests.
The Comprehension sub-test did obtain one marginally significant
main effect for level of experience (F(2, 91) = 3.24, p
< .047), with first year students scoring their protocols significantly
(Scheffe = 3.14) higher than the experienced practitioners who scored their
protocols the lowest, and the third-year school psychology interns contributed
scores in the middle, not significantly different from the first year students
or experienced practitioners .
|
Subscales |
||||||
|
|
Similarities |
Vocabulary |
Comprehension |
|||
Groups
|
Control |
|||||
|
Students (n=29) |
n = 13 |
n=16 |
n=16 |
n=13 |
n=16 |
|
|
Mean |
11.62 |
12.19 |
14.77 |
15.75 |
14.77 |
15.00 |
|
SD |
.92 |
1.90 |
4.64 |
3.53 |
3.39 |
3.92 |
|
Range |
11 |
9 |
10 |
12 |
10 |
10 |
|
Interns
(n=42) |
n =21 |
n =21 |
n =21 |
n =21 |
n =21 |
n =21 |
|
Mean |
12.48 |
16.71 |
16.24 |
14.29 |
13.29 |
|
|
SD |
1.94 |
1.29 |
2.97 |
1.95 |
2.59 |
2.45 |
|
Range |
6 |
6 |
12 |
7 |
9 |
10 |
|
Certified
(n=26) |
n=12 |
n=14 |
n=12 |
n=14 |
n=12 |
n=14 |
|
Mean |
11.75 |
11.57 |
16.17 |
17.14 |
12.00 |
13.57 |
|
SD |
1.52 |
.934 |
2.48 |
1.96 |
2.59 |
3.37 |
|
Range |
4 |
8 |
7 |
10 |
||
|
ANOVA INTERACTION |
.34, ns |
.61, ns |
F(2,91)
= |
1.44,
p<.26 |
||
Table
4: Three 2x3 ANOVAs* of Experimental
Treatment (Bias/Unbiased) by Experience (3) for Three WISC-III Subtests
|
Source |
df |
MS |
F |
P |
|
|
Similarities |
|
|
|
|
|
|
|
Bias/Unbiased |
1 |
.61 |
.21 |
.64 |
|
|
Experience
Level |
2 |
5.27 |
1.84 |
.64 |
|
|
Interaction |
2 |
1.00 |
.34 |
.70 |
|
|
Error |
91 |
2.87 |
|
|
|
Vocabulary |
|
|
|
|
|
|
|
Bias/Unbiased |
1 |
5.63 |
.62 |
.42 |
|
|
Experience
Level |
2 |
17.76 |
1.98 |
.14 |
|
|
Interaction |
2 |
5.43 |
.61 |
.55 |
|
|
Error |
91 |
8.95 |
|
|
|
Comprehension |
|
|
|
|
|
|
|
Bias/Unbiased |
1 |
1.65 |
.17 |
.67 |
|
|
Experience
Level |
2 |
33.96 |
3.66 |
.02 |
|
|
Interaction |
2 |
12.74 |
1.37 |
.26 |
|
|
Error |
91 |
9.28 |
|
|
*Statistics computed using GB-STAT
(Friedman, a1998).
Discussion
No
significant differences in scoring were found for the experimental versus the
control groups on the Similarities, Vocabulary or Comprehension subtests. Bias alone showed no effect on
any of the three subtests. However,
Level of experience did show a marginally significant effect on the
Comprehension subtest. This was
not a hypothesis that was originally to be tested by the researchers, but an
interesting finding nonetheless, even if somewhat marginal in significance.. However, this finding did not take the
direction the researchers would have expected. On the Comprehension subtest, the certified School
Psychologists produced significantly lower means than the first year
students. This finding
demonstrates that level of experience may be related to differential scoring on
this subtest. It does not support
the researchers’ hypothesis that the more experienced an individual is,
the less likely they will be influenced by a bias. This may be attributed to
experienced school psychologists being more stringent in their scoring or to
novice school psychology trainees being too liberal. Technically, if one is to assume that no matter what level
of experience a professional is at, they still will have mastered the scoring
techniques, then experience alone should not have had a significant
effect. This finding could speak
to the test-makers about searching out ways to make the scoring of this subtest
less subjective and therefore examiners more capable of arriving at a uniform
score.
This
study was limited in several ways.
The small number of subjects and the limitation in geographical
representation subsequently makes the obtained results less generalizable. Further research should concentrate on
broadening both the size of the groups as well as the geographic and
demographic diversity of the subjects involved. The final and perhaps most obtrusive limitation of this
study was the overall contrived nature of the research. Having school psychologists and school
psychologists in training score protocols from a child they have never seen,
much less assessed, is very unrealistic.
In the real world of practice, the child would be in front of the
examiner and a much “truer” score would likely be determined. On the other hand, having a child
with Down Syndrome in front of the examiner may even heighten the effect of the
bias given the possible influence of observed physical attributes of the child
being assessed. One can never
really know what the “true” effects would be.
While
this study failed to confirm hypotheses based on Rosenthal's (1994)
experimenter bias effect, the results are interpreted as an affirmation of the
objectivity of scoring for these relatively subjective sub-scales, as well as
the quality of training of these students, interns and experienced
practitioners.
References
Beeghly,
M. & Cicchetti, D. (1990). Children
with Down syndrome: A developmental perspective. Cambridge University Press: Boston, MA.
Braden,
J.P. (1995). Review of the WISC-III. In J.C. Conoley & J. C. Impara (Eds.),
The twelfth mental measurements yearbook.
Lincoln, NE: Buros Institute of Mental Measurements. See Test number 412.
Friedman,
P. (1998). GB-STAT Tutorial. Silver Spring, MD: Dynamic Microsystems, Inc.
Kirchner,
C. M. (1980). Children’s
test behavior and examiner bias on the Wechsler Intelligence Scale for
Children. Dissertation Abstracts International-A, 41/06, p. 2566.
O’Reily,
C. (1989). The confirmation bias in special education eligibility decisions. School
Psychology Review, 18, (n 1), 126-35.
Massey,
J.O. (1964) WISC scoring criteria. Palo Alto,
CA: Consulting Psychologists
Press.
Miller,
C.K., Chansky, N.M. & Gredler, G.R.(1970). Rater agreement on WISC protocols. Psychology in the Schools, 7, 190-193.
Miller,
C.K. & Chansky, N.M. (1972).
Psychologists’ scoring of WISC protocols. Psychology in the Schools, 9, 144-152.
Rosenthal,
R. (1976). Experimenter Effects in Behavioral Research. Irvington Publishers, Inc.: New
York, NY.
Rosenthal,
R. (1994). Interpersonal expectancy effects: A 30-year perspective. Current Directions in Psychological
Science, 3,
176-179.
Sandoval,
J. (1995). Review of the
WISC-III. In J.C. Conoley & J.
C. Impara (Eds.), The twelfth mental measurements yearbook. Lincoln, NE: Buros Institute of Mental
Measurements. See Test number 412.
Sattler,
J.M., Squire, L.S., & Andres, J.R. (1977). Scoring discrepancies between the WISC-R manual and two
scoring guides. Journal of
Clinical Psychology,
33, 1058-1059.
Sattler,
J.M. (1992). Assessment of
Children-Revised and Updated, 3rd ed. Jerome M. Sattler Publisher: San Diego, CA.
Shannon,
Robert Lewis (1985). Impact of special education labels: Implications for
psychological re-evaluations utilizing the WISC-R. Dissertation Abstracts
International-A,
45/08, p. 2488.
Shaw,
S.R., Swerdlik, S.E., & Laurent, J. (1993). [Review of the WISC-III.] In
B.A. Bracken (Ed.), Monograph series advances in psychoeducational
assessment: Wechsler Intelligence Scale for Children-Third edition; Journal of
Psychoeducational Assessment (pp. 151-159).
Brandon, VT: Clinical Psychology Publishing Co., Inc.
Slate,
J.R. (Jan 1993). Evidence that
practitioners err in administering and scoring the WAIS-R. Measurement and Evaluation in Counseling
and Development,
25,156-161.
Slate,
J.R. & Chick, D. (Jan 1989).
WISC-R examiner errors: Cause for concern. Psychology in the Schools, 26, 78-84.
Slate,
J.R. & Jones, C.H. (1990a). Identifying students’ errors in
administering the WAIS-R. Psychology in the Schools, 27, 83-87.
Slate,
J. R. & Jones, C.H. (1990b).
Student error in administering the WISC-R: Identifying problem
areas. Measurement and
Evaluation in Counseling and Development, 23, 137-140.
Taylor,
Amy N. (2001). Experimentally Manipulated Bias in School
Psychologists’ Scoring of WISC-III Protocols , (a Specialist’s Degree
Thesis), Miami University, Oxford, OH.
Wheeler,
P. T. (1987). A study of the effect of a child’s physical attractiveness
upon verbal scoring of the Wechsler Intelligence Scale for Children (Revised)
and upon personality attributions. Dissertation Abstracts International-B, 47/08, p. 3550.