Nenty,
H. J. (1996). Advances in test
validation. In G, A. Badmus, & P. I. Odor (Eds.), Challenges
in managing
educational assessment in Nigeria (pp. 59 - 69). Kaduna,
Nigeria: National
Conference on Educational
Assessment
CHAPTER
8
ADVANCES IN TEST VALIDATION
H.
Johnson Nenty, University of Calabar, Calabar, Nigeria.
|
INTRODUCTION
AND BACKGROUND
Increase in size and
variety of the nature of school population in the face of dwindling resources
available for education, and hence the limiting educational opportunities,
calls for more efficient measurement process for selecting, promotion, and
graduating learners. The demand is for tests that ensure objectivity in the
characterization of learners in terms of their ability and items in terms of
their parameters. This is in view of the important role test scores play in
discriminating among learners and the vital decisions they sustain on the
future of every Nigerian child. For such decisions to be fair and valid, scores
on which they are based must reflect nothing other than the ability, knowledge
or skills which the test was designed to measure. This borders on test
validity.
The various
definitions of validity tend to coverage around the idea that it is the degree
to which scores from a test represent, and hence can be used to infer, that
which the test was designed to measure. It follows that validity refers to the
appropriateness, meaningfulness, and usefulness of any inference made from a
test score (American Psychological Association (APA), American Education
Research Association (AERA), & National council on measurement in Education
(NCME), 1985), or the level of confidence one can have on such inference
(Shimbery, 1990). Hence, test validity is actually the level of confidence with
which an examinee’s test score could be used to infer the ability under
measurement possessed by the examinee.
According to Ebel
(1983) the concept of test validity has been evolving for more than half a
century from Thorndike’s (1918) idea of logical (intrinsic rational) and
experimental (empirical) validity, through the idea of the four Cs: content,
criterion, concurrent, and construct validity (APA, et al, 1966), to the
current emphasis on construct validity (Gardner,1983). To Gardner (1983), one
“would be willing to accept the assertion of several test specialists that all
validity can be subsumed under the term construct validity” (p.13). The differences
in types of validity are seen to depend only upon the kind of inference one
might wish to draw from test scores (APA, et al., 1985). But in many ways,
“construct’ interpretation of test results are likely to make more sense ... than
interpretation based on the content validity model”(Shimberg,1990, p.13), and
more so on other models. While with previous emphasis a test could be deemed
valid to extent that its items are send to represent domain of content or
universe of behavioural indicants (Kerlinger, 1986), with the current emphasis,
in addition to this, a test is valid to the extent that scores from it reflect
or are sustained by the ability under measurement, or to the level that the
ability under measurement could be inferred from such scores.
Even if test
construction procedure is faithfully or logically followed and hence a
scientific sampling of subject matter or behavioural domains is done, the
resulting test scores might not reflect the ability under measurement. This is
because the behaviour of the examinee during the examination and of the
examiner during test administration and test scoring (especially with essay
test) contribute to influence validity of test scores. For example, the current
general tendency to look down on results from public examinations in Nigeria
implies that the ability of graduates from Nigerian secondary and tertiary
institutions could not be legitimately inferred from their test scores. The
public is not saying that the items in the tests they took do not reflect or represent
the subject matter or behavioural domains, but that the ability that sustained
their performance on the test might not be that which the test was actually
designed to measure. Though validity can be built into a test construction,
scores from it might fail to reflect or be sustained by the ability, which the
test was so well designed to measure. In other words, ensuring that a valid
test is developed is necessary but not a sufficient test validation procedure.
A ‘valid’ test can give invalid test scores.
Most definitions of
test rightly imply that a psychological test is an inferential tool. For
example Nenty (1985a) defined a test as a set of tasks. Stimuli, questions, or
statements systematically constructed, selected and administered under controlled
conditions to elicit a sample of human behaviour on which inference about the
testee’s total behaviour can be drawn. According to Gardener (1983) ‘tests are
instruments that are designed to elicit responses and it is the inferences one
can draw from the responses that need to be validated” (p.13). It is not the
test itself. But since valid test scores cannot come from invalid test items,
validating both is necessary and sufficient procedures in test validation.
Hence the degree of confidence one has in making any inference about a
learner’s ability from his test score is a direct function of both the extent
and that to which it is only that ability that sustains responses to these
items.
Learners’ ability or
their level of competence or knowledge is developed through the influence of a
combination of inputs from the curriculum, and the teacher via instruction.
Hence, it is common to talk about curricular and instructional validities of
test score. According to Poggio, Glasnapp, Miller, Tollefson and Burry (1986),
validation investigation are done to examine the representativeness of skill
tested, the fit of test items to tested skills, the educational opportunity to
learn the content skill or knowledge contained in an item, and the content
equivalence of test ...” (p. 20). Each test item is a translation of subject
matter content into tasks, question, or statements which when read elicits the
intended behaviour (as specified based on Bloom’s taxonomy) as may be specified
in the course objective (Nenty, 1985a). For example, a learner can have a high
level of analytic skills (behaviour) in English language (subject matter), but
a low level of the same skill when it comes to mathematics expressions, but not
as good in differentiating (behaviour) mathematics expressions, In other words,
the same type of behaviour could be exhibited across subject matter areas, and
within the same subject matter area, different types of behaviour required to
respond correctly to a test item represents the totality of behaviours intended
by the curriculum, or by instruction. For example, in testing for ability to
perform fundamental mathematics operations with number, set of items that
involve only addiction (behaviour) of two-digit numbers (subject matter) cannot
be said to be valid both content wise or behaviour wise.
To differentiate among
the three types of validity involved here, test scores are content valid to the
extent that they could sustain inference to educational objectives and domain
specification. Also, they are curricular valid to the extent that they could
sustain inference to curriculum materials (for example, contents and specified
objectives in text book) used in school; and instruction valid to the extent
that they could sustain inference to the objectives and content of instruction
actually provided to learners in the classrooms.
SOURCES
OF INVALIDITY
Measurement is a
process of searching for or determining the true value or amount of ability,
knowledge, skill, characteristic or behaviour possessed by an individual,
object or event. In education, it involves the construction, administration,
and scoring of instruments or tests. According to Airasian and Madaus (1983) “a
test is a sample of behaviours from a domain about which a user wishes to make
inference” (p.104). In the light of item response theory (IRT), when a person
walks into a room in which his ability θ is to be measured through testing, he
takes along with him this ability (θ) (Warm, 1978). Theta (the person
parameter, is latent and not observed. A task is relevant if it is such that
would provide a suitable or maximum provocation to the particular ability under
measurement. Each task or test item, also has minimum ability it demands before
it could be overcome, and this it also brings to the testing situation. This
minimum ability-demand is termed beta (β), or item parameter, and is also
latent. During testing both parameters are brought together in a
confrontational mood, in a person-by-item encounter, or better, a theta-by-beta
encounter (Nenty, 1995). The result of such an encounter can be used with a
high level of confidence of infer theta, to the extent that it was only the
interaction of theta alone that brought the observed result.
Let us assume that we
are dealing with mathematics ability (θ). One cannot measure θ without calling
on other abilities like communication skills, for example, reading,
comprehension, and writing skills. These abilities compound or form part of the
expressed mathematics ability. They are necessary by eta (η). Besides the
systematic or predictable influence or η on test scores, there is also the
random influence of measurement error. If we denote this by epsilon (ε), then
the observed score, or the outcome of measurement (X) resulting from the
encounter could be expressed in the form of the form of the general linear
model thus (Schmidt, 1983; Nenty, 1991a; Ackerman, 1992):

While the test score
variance accounted for by
reflects the unreliability of test scores,
that accounted for by
reflects invalidity of the test scores. Eta (η) is intended here as
a conglomeration of extraneous variables whose influence systematically
distorts the result of testing, and hence reduces the level of confidence with
which θ can be inferred from X.


Given the latent or
hypothetical nature of human characteristics, educational measurement is
unavoidably indirect and inferential in nature, and hence inevitably involves
the operation of some extraneous variables. Any test or item characteristic,
persons action or behaviour exhibited before, during or after testing, or an
interaction among these that lead to increase or decrease in test scores
without a corresponding pre-testing increase or decrease in the ability,
knowledge or skill under measurement is extraneous, increases the η influence
and thus reduces the level of confidence with the ability (θ) could be inferred
from such test scores.
Factors that
invalidate test scores are examiner-, examinee-, as well as item-related.
Seemingly item-related sources are basically examiner-related because they
emanate from poor test construction. The most important item-related sources of
invalidity of test scores is the lack of unidimensionality of our test items. In
other words, items are constructed to call on abilities other than the one
under measurement. For example, when responding to an essay item, the examinee
must (a) first of all read and understand the question; (b) think of the
answer; (c) think of how to present it in order to ensure that the examiner
understands exactly what his resent answer is; These include(i) right choice of
words, (ii) good sentence structure; (iii) good grammar and punctuation; (iv)
logicality in, and coherence of presentation; and of course the last but not
least, (v) good or readable handwriting (Nenty, 1985a).
I the case of objective test
item, according to Lazarus and Taylor (1977), an examinee must (a) first read
and understand the item; then (b) think of the answer, but does not have to
write it down, instead he must (c) try to find the answer among the options
given; (d) pick one that he deems most appropriate; (e) keep track of the item
number and the letter representing the option he has chosen in order to be able
to (f) shade in the appropriate space on the answer sheet. In both cases if an
examinee knows the correct answer to an item but lacks any of these extraneous
sills, he is likely to get the item wrong(Nenty,1985a). This will cause the
resulting test score to fail to reflect the actual ability under measurement.
Hence it could not sustain a valid inference of the ability. Two examinees with
the same ability on what is being measured might end up with significantly
different scores because of the influence of extraneous skills on which they
might differ. This is an example of item bias.
According to Nenty,
(1986, 1994) if one or the other significant extraneous factors which influence
performance on a test or an item discriminates systematically among groups of
examinee, then the test or item is said to be biased. Generally, scores from
biased test are invalid. In fact, “item bias is attributable to the degree of
lack of item validity” (Ackerman, 1992, p. 70) words, illusions, examples,
concepts, ideas, etc., used in items might appeal to one group more than others
and hence lead to differences in group performance not because they are
different in the ability under measurement but a result of differential
influence or these extraneous factors. Whenever a test calls on more than one
skill, the likelihood of the score from it being invalid is high. Measuring
with a biased test is like measuring with an elastic ruler that stretches for
one person or group and shrinks for the other.
Another source of
item-related invalidity is proneness to guessing. Multiple-choice item with
heterogeneous options are more prone to guessing than those with homogeneous
options. Guessing is a source of multidimensionality in a test as it introduces
a different and often times significant dimension in test performance .By
guessing answers to test items, the examinee introduces another “ability”
besides that which the items. Were designed to measure. The scores made through
guessing cannot sustain valid inference to the ability under measurement as it
was not ability to guess that was being measured. Similar examinee-related
source of invalidity when responding to affective instrument include faking,
response style and impersonation. Other examinee-related sources of invalidity
are the level of test-wiseness or test-sophistication and other
examination-related behaviour, of the examinees like anxiety, motivation, etc.
Currently, the most
important and disturbing examinee-related source of invalidity in our tests is
examination malpractice, Examination malpractice is a psychometric as well as a
serial problem. As a psychometric problem it infers with any objective attempt
to estimate examinees ability by invalidating test scores .Any form of
examination malpractice inflates scores of those who practice it and renders
such scores useless for any evaluation purposes. Examination malpractice is also
examinee-related source of invalidity because teachers and other examiners, and
even parents are known to aid and abet malpractice during examination. Hence
the behaviour of examiners including teachers, and even parents are sources of
invalidity for our test scores especially in public examinations.
As indicated by Nenty
(1985a) during testing in general and with essay test in particular the
examiner is a part of the measurement instrument. He sets the test, administers
and scores it. His behaviour during each of these processes has significant
influence on the resulting test scores. The first is his ability and
willingness to follow a step-by-step systematic procedure in developing the
test items. This involve a random sampling of test content from a well-defined
domain in order to ensure that the item content provides a scientific
representation of both the subject matter and behavioural contents on which the
measurement is based. If the contents of test items are not adequate
representatives of the domain contents, the basis of inferring the domain
behaviour (ability given a well specified domain content) of the examinees from
their test scores is very weak. According to Schmidt (1983):
The question of what
constitutes a representative sample is important since the degree of
“non-representative” contributes to the discrepancy Ε(yi)-Ζi). Thus the extent
to which items are non-representative contributes to the amount of bias and
correspondingly to the degree of invalidity (pp. 166-167).
[(in this
presentation ηi = Ε(yi)-Ζi)]
Secondly, teachers’
personal characteristics as displayed during test administration or as
perceived by the examinee have been found to influence examinee’s performance
significantly. And lastly, especially in essay test. The score that result from
the scoring process does not depend only on the examinee’s response, but on who
scores the response. For example, in a series of studies by starch and Elliot
(reported in Mouly, 1978, p.75), scores given by 180 different scorers to the
same response ranged from 28 to 92. Which of the 180 scores awarded do we infer
the examinee’s ability from?
The discrepancies
between test content (content actually tested on), on one hand and each of
curriculum content (content as specified on the curriculum), content of
curricular material (textbook contents), and instructional content (content
actually covered in class) are sources of invalidity of test scores, especially
when validation studies are done across classrooms or schools. Differences in curricular
and instructional emphasis and coverage, even given the same curriculum, have
significant influences on test scores across classrooms. Generally, every
factor, be it item-,examiner- or examinee-related, other than θi
that contributes significantly to differences in the size of test scores,
enlarges ηi, and thus increases invalidity of test scores.
METHODS
OF TEST VALIDATION
According to Kerlinger
(1986) ‘the subject of validity is complex, controversial, and peculiarly
important .... Here perhaps more than anywhere else, the nature of reality is
questioned.... It is not possible to study validity ... without sooner or later
inquiring into the nature and meaning of one’s variables” (p. 416). given the
very complex nature of validity, test validation, or determining the level of
confidence with which the variable being measured could be inferred from scores
generated from a test designed to measure it, is not a simple one-shot affair
like, for example, determining the reliability of a test. It is a complex
theoretically- and empirically-involving process. Unlike test reliability, one
single index cannot satisfactorily justify the validity of a test. Validation
is in fact a theory-backed empirical process of searching for the truth about
the theoretical, as well as the empirical nature or meaning of a variable under
measurement. In this case, it is a research study within the measurement are
stated and tested using scientifically generated data. A test validation process
should be able to ascertain the level to which human behaviour given the
totality of a clearly specified universe of skills, knowledge, indicant
(Kerlinger, 1986), etc., that operationally defines the variable under
consideration, and relates predictably to several behaviours that represent
some other variables. The investigative process of gathering, analyzing and
evaluating the necessary data with which hypotheses involving the nature and
meaning of a variable are tested is called validation. There are two aspects to
this (i) an evaluation of the process used or provision made for the
development, administration and scoring of test; and (ii) determining through a
series of empirical hypotheses-testing investigations, the actual level of
confidence with which scores from the test could be used to infer a domain
behaviour.
Over the years methods
of test validation have evolved along with the meaning of validity. According to
APA et al (1974), content validity is determined by a set of operations, and
one evaluates this by the thoroughness and care with which these operations
have been conducted (p.29), current emphasis on subsuming all types of
validities under construct validity does not discard this procedural or
operational investigations because “evidence of construct validity may also be
inferred from procedures followed in developing a test” (APA et al., 1974,
p.30). This procedural investigation involves determining:
I.
The exhaustiveness and clarity with which the elements of the domain or
universe
of behavioural and subject matter contents
have been identified and itemized;
II. The adequacy of sampling to ensure
representativeness of test content of these well-defined
domain or universe of
contents;
III.
The
clarity with which the sampled content is translated into appropriate tasks or
statements that do not demand abilities other than one under measurement, and
which discriminate among examine who differ in this ability;
IV.
The
adequacy with which provision is made for ideal administration and testing
conditions that do not allow for adverse physical and psychological effects on
the examinees, nor encourage favouritism and examination malpractices of any
kind; and
V.
The
adequacy of the provision made to ensure objectivity in scoring responses to
the test item.
The determination of
“the thoroughness and care with which these operations have been conducted” is
done through expert judgement based on the results of the inspection of records
on test development process; content analysis, including taxonomial analysis,
and examination of the provision for objectivity in scoring. These methods of
procedural evaluation are also used to determine curricular validity by
quantifying the overlap between the contents of text books and other curricular
materials and test contents; and instructional validity by quantifying the
congruency between content of instruction actually provided to learners and
test content (Mehrens & Philips, 1987).
Empirically, test
validation is no more seen as a one-shot analysis to determine the correlation
between test scores on the ability or characteristic under measurement and
scores which themselves might not valid, on other concurrent, future, or
criterion behaviours, to ascertain concurrent, predictive and criterion
validity respectively. Current emphasis is on validation research during which
hypotheses on the relationship between test scores on the ability or
characteristic measured using a variety of methods and on several other
theoretically related and unrelated variables are stated and tested. Such
variables might include valid criterion variables. The main question for such
studies is: to what extent do scores on tests designed, using a variety of
measurement methods to measure our ability or characteristic of interest relate
to scores on several other variables which are theoretically related or
unrelated to our ability or characteristic of interest. Also measured using a
variety of methods? Validity can be established by testing hypotheses that
address the following concerns:
i.
How
well the scores from several measures of our variable of interest using a
variety of methods. Relate to each other – convergence analysis (Campbell &
Fiske, 1959; Kerlinger, 1986);
ii.
How
well the scores from several measures of our variable relate to those on
appropriate valid criterion measures;
iii.
How
well the scores from several measures of our variable relate to scores on other
variables which are theoretically related to our variable – convergence
analysis (Campbell & Fiske, 1959; Kerlinger, 1986);
iv.
How
well the scores from several measures of our variable fail to relate to scores
from other variables which are theoretically unrelated to our variable –
discriminability analysis (Campbell & Fiske, 1959; Kerlinger, 1986);
v.
How well scores from it tend to cluster with a
set of other variables not because of similarity in method of measurement but
similarity in underlying trait – convergence analysis (Campbell & Fiske,
1959).
vi.
What
is the loading patterns on the dimensions inferred from factor analyzing the
resulting matrix of interrelationships look like (APA et al., 1985; Kerlinger,
1986)?
vii.
What
simpler factors can explain the variance of our test scores, how much of it can
each explain; Which factor variances can our test scores explain, and how much
can they explain (Kerlinger, 1986)?;
viii.
Are
there significantly differentiating characteristics between those who have high
scores compared to those who have low scores on our variable (APA, et al.,
1985)?
Existing methods of
analyses involved in testing related hypotheses include Campbell and Fiske’s
(1959) Multitrait-multimethod matrix analysis, analysis of variance, regression
and discriminant analysis, factor analysis, path analysis and other related
multivariate analyses. It is through rigorous empirical studies that the
meaning and nature of a variable can be determined. The rationale behind these
analyses is that if scores generated to represent a theoretically defined
variable validly represent it, then scores from different methods of measuring
the variable should converge not because of commonality in method, but because
of commonality in underlying trait; those from measuring variables that are
theoretically related to it should also converge, but scores on variables not
theoretically related to it should be uncorrelated with scores from our
variable. Besides investigations into the wholistic nature, meaning, and
behaviour of the variable as measured, current trend in test validation also of
each of the items that make up the instrument. Such investigations involve
analyses like item bias analysis and item response pattern analysis.
ANALYSIS
OF TEST AND ITEM BIAS
Test validation also
involves determining whether some extraneous factors have significantly
systematic influence on test performance. The most compelling of such factors
is item or test bias. Wiley, Haertel and Harnischfeger (1981), Schmidt (1983),
and Ackerman (1992) have identified invalidity with item or test bias.
Invalidity component in our general linear model (see Formula 1), eta (η) is
associated with item or test bias. During testing, if the influence of any
extraneous factor systematically favours one group of examinee over the other
then the test is said to be biased. Technically, an item is unbiased if for all
examinee of equal ability (i.e. equal total score on a test that contains the
item) the probability of a correct response is the same regardless of each
examinee’s group membership (Scheuneman, 1979). Or, for item response theory (IRT),
an item is unbiased if groups items response function (see Nenty, 1995) are
identical (Lord, 1980). Test bias causes two examinee who have the same ability
(θ) on what is being measured to end up with different observed scores (Χ). In
that case, using observed scores for decision making like in selection,
promotion, or graduation often leads to unfair, and invalid decision. Scores
from any test based on which important decision will be made must be tested for
possible bias. If done at the pilot-testing level, items identified as biased
could be revised or changed. According to Ackerman (1992) ‘it should be
apparent that as much thought must go into the analysis of bias for a test as
went into the original construction of the test’ (p. 90).
There are several
methods of determining item bias. These can be grouped under: (i)
item-parameter-related methods; (ii) chi-square/probability-related methods;
(iii) analysis of variance, regression, and log-linear-related methods; and
(iv) methods based on item response theory. These have been reviewed and
compared (Ironson & Subkoviak, 1979; Nenty, 1979, 1986; Jensen, 1980;
Shepard. Camilli & Acerill, 1981; Marascuilo & Slaughter, 1981); and
chi-square/probability-relayed methods have been found to be generally valid
and mathematically less-demanding methods of detecting item bias. Studies using
these methods with Nigerian samples (Nenty, 1986; Umoinyang, 1991; Abiam, 1996)
have found significant item bias in Nigerian ability and achievement tests at
the secondary and primary school levels.
ANALYSIS
OF ITEM RESPONSE PATTERN (IRP)
Another test
validation analysis of item response pattern (IRP) is based on the assumption
that if all items in an objective test measured one and only one ability, they
are ordered to their estimated difficulties, from the easiest to the most
difficult, then some regularity in each examinee’s pattern of response to these
items would be expected. Similarity, if examinee are arranged in order of their
estimated ability, from the one with the highest to the one with the lowest
score, given the extent to which a test measures one, and only one ability
irregularity would be expected in the pattern of responses across the
examinee(Nenty. 1994) with a perfectly valid test, the ideal response pattern,
given this arrangement of items which should gradually peter-out into a string
of successes (ones) over less demanding items which should gradually peter-out
into a string of failures (zeroes) over more demanding item. Similarly, for an
item, a string of successes by more able examinee should gradually peter-out
into a string of failures by less able examinee (Wright & Stone, 1978;
Nenty, 1987). This would give a perfect Guttman scale. Extent of observed
deviations from these hypothesized pattern signals extent to which responses to
the test items call for abilities other than the one under measurement, and
hence is invalid.
Methods for analyzing
item response patterns have been reviewed and compared (Harnisch & Linn,
1981; Harnisch & Tatsuoka, 1983) and Sato’s Modified Caution Index (C)
(Sato, 1975) has always been recommended because of its validity and
mathematically less-demanding computational procedures. This has been used with
achievement test fata in Nigeria (see Nenty, 1987, 1991b &1992 also for
computational procedures). In studies that amounted to test validation using
item response pattern analysis, Harnisch and Linn (1981), Harnisch (1983),
Miller (1986), Nenty (1991a, 1991b) identified the type of items that
functioned differently within and across individuals, classrooms, schools,
Zones, states, and region. They attributed unusual pattern of responses by
individuals to guessing, carelessness, and cheating, and by classrooms to
differences in content coverage and emphasis, hence evidence of a mismatch
between what is being tested and what was taught for each school.
ITEM
RESPONSE THOERY, TEST DIMENSIONALITY AND VALIDITY
Given the notations
indicated earlier in section 2.0 item response theory (IRT) in its simplest
form has it that the probability of a person i answering item j correctly is a
function of the difference between the person’s θ and the item’s β. For a
dichotomously scored item, this is:

And this holds only
when certain conditions are met. The most important of these is the unidimensionality
assumption. This all-involving assumption has it that all the items in a test
must be constructed, administered and scored to ensure that they all measure
one, and only one, ability, area of knowledge or behaviour (Nenty, 1995). In
other words, test items must be constructed, administered, and scored to ensure
that he resulting scores are sustainable only by the ability, area of
knowledge, or behaviour under measurement. A close consideration of that which
is involved to ensure unidimensionality of a test shows very close similarity
with the demands for maximum test validity. In fact, with the current level of
testing technology, validity can only be meaningfully defined for a
unidimensional test.
According to Nenty
(1994), basic to the achievement of specific objectivity in psychological
measurement with all its accompanying advantage (see Nenty, 1984) is the fact
that a test must be designed, administered, and scored in such a way that one,
and only one ability accounts for an examinee’s score it. While this might not
be serious concern for the classical test theory (CTT), it is basic assumption
for valid operationalization of any of the three currently operational IRT
models. But even for the CTT, if besides the definitions of classical item parameters
are invalid. Item calibration that is independent of the calibrating sample,
and person characterization that is independent of the particular set of items
used are the advantages of specific objectivity in measurement (Nenty, 1985b;
Wright, 1967). When any factor or ability other than that under measurement
contributes to an examinee getting a test item wrong or right, then the test is
multidimensional. Item bias is one of such factors> Analysis of test
dimensionality is a complex, mathematically involving and all-embracing test
validation procedure earlier (see Warm, 1978, Stout, 1987; Akpan, 1996). The
last two analyses presented above have implications for dimensionality test. In
a perfectly unidimensional test, the eta (η) component in our general linear
model (see Formula 1) is zero, and hence the test has maximum validity.
CONCLUSION
AND RECOMMENDATIONS
The level of
confidence with which scores from test items could be used to infer the ability
under measurement depends on both (i) how well the items represent the domain
that defines the ability under measurement, and (ii) the extent to which it is
this ability alone that sustains responses to these items. Highly
representative items with good characteristics cannot give valid results if
responses to them are not sustained by the ability to which inference is
intended, neither could responses sustained by the desired to them are not
sustained by the desired ability but on unrepresentative items with poor
measurement qualities. Both logical and empirical validation processes are
therefore necessary and sufficient for ascertaining test validity. While
logical or intrinsic rational validation procedures try to determine the level
of test validity in terms of item representativeness and item quality, the
empirical procedures try to determine the level of confidence with which a
given type of inference can be drawn from scores on the test.
Given the latent or
hypothetical nature of a psychological variable, the validation of a test
designed to measure it requires a variety of relevant data or information from
many sources theoretically related or unrelated to the variable with which to
determine the quality and functioning of scores from the test. This implies a
rigorous and preferably hypothesis-biased investigation to determine the nature
and meaning of such variable.
One of the most important challenges of
managing educational assessment in Nigeria is the production of valid test
scores and grades through our institutional and public examinations. In order
to generate scores that would sustain valid and fair decision in schools and in
the public, especially in a multi-ethnic society like ours, all persons
involved with the development, administration and scoring of tests for public
and school examinations should be occasionally trained and retrained on current
methods of ensuring and ascertaining validity in testing. The most obnoxious
source of test invalidity in Nigeria, and hence the biggest challenge to the
management of educational assessment in the country is examination malpractice
(Nenty, 1987). To check this escalating menace to the learner, to the school
and to the public, the government should create a ‘centre for the study and
prevention of examination malpractice” with an enforcement arm. Such centre
should instigate, sponsor, and coordinate studies on this malady, ensure strict
enforcement of relevant decrees, as well as work with public examination bodies
to ensure malpractice-free testing in Nigeria.
REFERENCES
Abiam.
P.O. (1996). Analysis of differential item functioning (DIF) of 1992 First
School Leaving Certificate (FSLC) mathematics examination in Cross River State
of Nigeria. Unpublished M.Ed. thesis, University of Calabar.
Ackerman,
T.A. (1992). A didactic explanation of item bias, item impact, and item
validity from
multidimensional
perspective. Journal of Educational Measurement, 29(1), 67-91.
Airasian,
P.W., & Madaus, G.F. (1983). Linking testing and instruction: Policy issues,
Journal of Educational Measurement, 20 (2), 103 – 118
Akpan,
G.S. (1996). Speededness effect in assessing the dimensionality of
agricultural science examination. Unpublished M.Ed. thesis, University of Calabar
American
Psychological Association, American Educational Research Association, &
National Council on Measurement in Education (1966). Standards for
educational & psychological tests. Washington, D.C. American
Psychological Association, Inc.
American
Psychological Association, American Educational Research Association, &
National Council on Measurement in Education (1974). Standards for
educational & psychological tests. Washington, D.C. American
Psychological Association, Inc.
American
Psychological Association, American Educational Research Association, &
National Council on Measurement in Education (1985). Standards for
educational & psychological tests. Washington, D.C. American
Psychological Association, Inc.
Campbell.
D... & Fiske, D. (1959). Convergent and discriminant validation by the
multirait-multimethod matrix. Psychological Bulletin, 54, 81 – 105.
Ebel. R.L.
(1983). The practical validation of test of ability. Educational
Measurement: Issues and Practice, 2(2), 7-10
Gardner.
E.F. (1983). Intrinsic rational validity: Necessary but not sufficient.
Educational Measurement: Issues and Practice, 2(2), 13
Harnisch.
D.L., & Linn, R.L. (1983). Item Response patterns: Applications for
educational practice. Journal of Educational Measurement, 20(2),
1991-206
Harnisch.
D.L., & Linn, R.L. (1981). Analysis of item response patterns: questionable
test data and dissimilar curriculum practices. Journal of Educational
Measurement, 18(3), 133 – 146.
Harnisch.
D.L., & Tatsuoka, K.K. (1983). A comparison of appropriateness indices
based on item response theory. In R.K. Hambleton (Ed.), Application of item
response theory, British Columbia, Educational Research Institute.
Ironson.
G.H., & Subkoviak, M. (1979). A comparison of several methods of assessing
item bias.
Journal
of Educational Measurement,
16, 209 – 225.
Jensen, A.R. (1980). Bias in
mental testing. New York: The Free Press.
Kerlinger, F.N. (1986). Foundations of behavioural research. Fort
Worth: Holt, Rinehart and
Winston, Inc.
Lazarus, Taylor, E.F. (1977, May
1). The debate: Pro and con of testing. New York Times.
Linn. R.L. & Harnisch, D.L.
(1981). Interactions between item content and group membership on
achievement test
items. Journal of Education Measurement, 18(2), 109-118.
Marascuilo, & Slaughter, P.E.
(1981). Statistical procedures for identifying possible sources of item
bias based on X2 statistics.
Statistics. Journal of Educational Measurement, 18(4), 229-248.
Mehrens, W. & Phillips, S.E.
(1987). Sensitivity of item difficulties to curricular validity. Journal of
educational
Measurements,
24(4), 357-370.
Miller, M.D. (1986). Time
allocation and patterns of item response. Journal of educational
Measurement, 23(2), 147-156.
Mouly, G.J. (1978). Educational
research: the arts and science of investigation. Boston: Alyn and
Bacon.
Nenty, H.J. (1979). An empirical assessment of the culture of
fairness of the Cattell Culture Fair
Intelligence Test
using the Rasch Latent Trait measurement model. Unpublished doctoral
dissertation, Kent,
Ohio, USA.
Nenty, H.J (1984, April 27). Objectivity
in psychological measurement: An introduction, Paper
presented at WAEC
monthly seminar, Lagos, Nigeria.
Nenty, H.J (1985a). Fundamentals of educational measurement
and evaluation. Unpublished
manuscript, University
of Calabar.
Nenty, H.J. (1986). Cross-cultural
bias analysis of Cattell Culture – Fair Intelligence Test.
Perspectives in
Psychological Researchers,
9 (1), 1-16.
Nenty, H.J. (1987). Factors that
influence students tendency to cheat in examination. Journal of
Education in
Developing Area, vi & vii, 70-78
Nenty, H.J. (1987, August 28). Item response pattern: Causes of its
variability and its uses in
educational practices. Paper presented at
the WAEC monthly seminar, Lagos, Nigeria.
Nenty, H.J. (1991a, March). Background
characteristics and students pattern of responses in
mathematics
examination.
Paper presented at the 7th annual conference of the National
Association of
Educational Psychologists at Ahmadu Bello University, Zaria.
Nenty, H.J. (1991b, March). Student-problem
(S-P) skill analysis of pupil’s performance in common
entrance mathematics
examination.
Paper presented at the 7th annual conference of the
National Association
of Educational Psychologists at Ahmadu Bello University, Zaria
(submitted to Wajevm,
WAEC for publication),
Nenty, H.J. (1992). Item response
pattern: Causes of its variability and its uses in educational
practices. The West
African Journal of Educational and Vocational Measurement, 7 1-12.
Nenty, H.J. (1994). Response
pattern analysis of cross River State 1986 common entrance
mathematics
examination: An aid to item selection. A paper presented at the
inaugural meeting of National Association for Educational Assessment (NAEA) on
June 16 at ERC, Minna, Niger State (submitted to journal of Educational
Assessment, Minna for publication).
Nenty, H.J. (1995). Introduction
to item response theory. Unpublished paper, University of Calabar
(submitted to the
Global Journal of Pure and Applied science, for publication).
Poggio. J.P.., Glasnapp, D.R.,
Miller, M.D., Tollefson, N., & Burry, J.A. (1986). Strategies for
validating dating
teacher certification test. Education Measurement: Issues and Practice,
5(2),
18-25.
Sato. T. (1979) “The construction
and interpretation of S-P tables”. Tokyo: Meiji Tosho
Scheuneman, J.D. (1979). A method
for assessing bias in test items. Journal of Educational
Measurement, 16, 143-153.
Schmidt, W.H. (1983). Content
bias in achievement test. Journal of Educational Measurement, 20(2),
165-178
Shepard, Camili, & Averill,
(1981). Comparison of procedures for detecting test item bias with both
internal and external
ability criteria. Journal of Education Statistics. 6(4), 317-375
Shimberg, B. (1990). Social considerations
in the validation of licensing and certificate exams.
Educational
Measurement: Issues and practice, 9(4), 11-14.
Stout. W. F. (1987). A
nonparametric approach for assessing latent trait unidimensionality.
Psychometrika, 52. 589-617
Thorndike, E.L. (1918). The
nature, purpose and general methods of educational products. The
measurement of
educational products (17th
Yearbook, Part ІІ). Chicago: National Society of
the Study of
Education.
Umoinyang, I.E.
(1991). Item bias in mathematics achievement test. Unpublished M.Ed.
thesis,
University of Calabar, Calabar.
Warm, T. A. (1978). A primer
of item response theory, Springfield, VA: National Technical
Information Services,
US Department of commerce.
Wiley, D.E., Haertel, E., & Harnischfeger,
A. (1981). Test validity and national educational
assessment: a
conception, a method, and an example. (Report No.17). Chicago, Illinois: North-western
University.
Wright, B. D. (1967). Sample-free
test calibration and person measurement. In Proceeding of the 1967
Invitational
Conference on Testing Problems. Princeton, NJ: Educational Testing Service.
Wright. B. D., & stone, M.H.
(1978). Best design: A handbook for Rasch measurement. Chicago; The
University of Chicago
& Elmhurst College.
No comments:
Post a Comment