PROF J H NENTY: ADVANCES IN TEST VALIDATION

Nenty, H. J. (1996). Advances in test validation. In G, A. Badmus, & P. I. Odor (Eds.), Challenges

in managing educational assessment in Nigeria (pp. 59 - 69). Kaduna, Nigeria: National

Conference on Educational Assessment

CHAPTER 8

ADVANCES IN TEST VALIDATION

H. Johnson Nenty, University of Calabar, Calabar, Nigeria.

INTRODUCTION AND BACKGROUND

Increase in size and variety of the nature of school population in the face of dwindling resources available for education, and hence the limiting educational opportunities, calls for more efficient measurement process for selecting, promotion, and graduating learners. The demand is for tests that ensure objectivity in the characterization of learners in terms of their ability and items in terms of their parameters. This is in view of the important role test scores play in discriminating among learners and the vital decisions they sustain on the future of every Nigerian child. For such decisions to be fair and valid, scores on which they are based must reflect nothing other than the ability, knowledge or skills which the test was designed to measure. This borders on test validity.

The various definitions of validity tend to coverage around the idea that it is the degree to which scores from a test represent, and hence can be used to infer, that which the test was designed to measure. It follows that validity refers to the appropriateness, meaningfulness, and usefulness of any inference made from a test score (American Psychological Association (APA), American Education Research Association (AERA), & National council on measurement in Education (NCME), 1985), or the level of confidence one can have on such inference (Shimbery, 1990). Hence, test validity is actually the level of confidence with which an examinee’s test score could be used to infer the ability under measurement possessed by the examinee.

According to Ebel (1983) the concept of test validity has been evolving for more than half a century from Thorndike’s (1918) idea of logical (intrinsic rational) and experimental (empirical) validity, through the idea of the four Cs: content, criterion, concurrent, and construct validity (APA, et al, 1966), to the current emphasis on construct validity (Gardner,1983). To Gardner (1983), one “would be willing to accept the assertion of several test specialists that all validity can be subsumed under the term construct validity” (p.13). The differences in types of validity are seen to depend only upon the kind of inference one might wish to draw from test scores (APA, et al., 1985). But in many ways, “construct’ interpretation of test results are likely to make more sense ... than interpretation based on the content validity model”(Shimberg,1990, p.13), and more so on other models. While with previous emphasis a test could be deemed valid to extent that its items are send to represent domain of content or universe of behavioural indicants (Kerlinger, 1986), with the current emphasis, in addition to this, a test is valid to the extent that scores from it reflect or are sustained by the ability under measurement, or to the level that the ability under measurement could be inferred from such scores.

Even if test construction procedure is faithfully or logically followed and hence a scientific sampling of subject matter or behavioural domains is done, the resulting test scores might not reflect the ability under measurement. This is because the behaviour of the examinee during the examination and of the examiner during test administration and test scoring (especially with essay test) contribute to influence validity of test scores. For example, the current general tendency to look down on results from public examinations in Nigeria implies that the ability of graduates from Nigerian secondary and tertiary institutions could not be legitimately inferred from their test scores. The public is not saying that the items in the tests they took do not reflect or represent the subject matter or behavioural domains, but that the ability that sustained their performance on the test might not be that which the test was actually designed to measure. Though validity can be built into a test construction, scores from it might fail to reflect or be sustained by the ability, which the test was so well designed to measure. In other words, ensuring that a valid test is developed is necessary but not a sufficient test validation procedure. A ‘valid’ test can give invalid test scores.

Most definitions of test rightly imply that a psychological test is an inferential tool. For example Nenty (1985a) defined a test as a set of tasks. Stimuli, questions, or statements systematically constructed, selected and administered under controlled conditions to elicit a sample of human behaviour on which inference about the testee’s total behaviour can be drawn. According to Gardener (1983) ‘tests are instruments that are designed to elicit responses and it is the inferences one can draw from the responses that need to be validated” (p.13). It is not the test itself. But since valid test scores cannot come from invalid test items, validating both is necessary and sufficient procedures in test validation. Hence the degree of confidence one has in making any inference about a learner’s ability from his test score is a direct function of both the extent and that to which it is only that ability that sustains responses to these items.

Learners’ ability or their level of competence or knowledge is developed through the influence of a combination of inputs from the curriculum, and the teacher via instruction. Hence, it is common to talk about curricular and instructional validities of test score. According to Poggio, Glasnapp, Miller, Tollefson and Burry (1986), validation investigation are done to examine the representativeness of skill tested, the fit of test items to tested skills, the educational opportunity to learn the content skill or knowledge contained in an item, and the content equivalence of test ...” (p. 20). Each test item is a translation of subject matter content into tasks, question, or statements which when read elicits the intended behaviour (as specified based on Bloom’s taxonomy) as may be specified in the course objective (Nenty, 1985a). For example, a learner can have a high level of analytic skills (behaviour) in English language (subject matter), but a low level of the same skill when it comes to mathematics expressions, but not as good in differentiating (behaviour) mathematics expressions, In other words, the same type of behaviour could be exhibited across subject matter areas, and within the same subject matter area, different types of behaviour required to respond correctly to a test item represents the totality of behaviours intended by the curriculum, or by instruction. For example, in testing for ability to perform fundamental mathematics operations with number, set of items that involve only addiction (behaviour) of two-digit numbers (subject matter) cannot be said to be valid both content wise or behaviour wise.

To differentiate among the three types of validity involved here, test scores are content valid to the extent that they could sustain inference to educational objectives and domain specification. Also, they are curricular valid to the extent that they could sustain inference to curriculum materials (for example, contents and specified objectives in text book) used in school; and instruction valid to the extent that they could sustain inference to the objectives and content of instruction actually provided to learners in the classrooms.

SOURCES OF INVALIDITY

Measurement is a process of searching for or determining the true value or amount of ability, knowledge, skill, characteristic or behaviour possessed by an individual, object or event. In education, it involves the construction, administration, and scoring of instruments or tests. According to Airasian and Madaus (1983) “a test is a sample of behaviours from a domain about which a user wishes to make inference” (p.104). In the light of item response theory (IRT), when a person walks into a room in which his ability θ is to be measured through testing, he takes along with him this ability (θ) (Warm, 1978). Theta (the person parameter, is latent and not observed. A task is relevant if it is such that would provide a suitable or maximum provocation to the particular ability under measurement. Each task or test item, also has minimum ability it demands before it could be overcome, and this it also brings to the testing situation. This minimum ability-demand is termed beta (β), or item parameter, and is also latent. During testing both parameters are brought together in a confrontational mood, in a person-by-item encounter, or better, a theta-by-beta encounter (Nenty, 1995). The result of such an encounter can be used with a high level of confidence of infer theta, to the extent that it was only the interaction of theta alone that brought the observed result.

Let us assume that we are dealing with mathematics ability (θ). One cannot measure θ without calling on other abilities like communication skills, for example, reading, comprehension, and writing skills. These abilities compound or form part of the expressed mathematics ability. They are necessary by eta (η). Besides the systematic or predictable influence or η on test scores, there is also the random influence of measurement error. If we denote this by epsilon (ε), then the observed score, or the outcome of measurement (X) resulting from the encounter could be expressed in the form of the form of the general linear model thus (Schmidt, 1983; Nenty, 1991a; Ackerman, 1992):

While the test score variance accounted for by

reflects the unreliability of test scores, that accounted for by

reflects invalidity of the test scores. Eta (η) is intended here as a conglomeration of extraneous variables whose influence systematically distorts the result of testing, and hence reduces the level of confidence with which θ can be inferred from X.

Given the latent or hypothetical nature of human characteristics, educational measurement is unavoidably indirect and inferential in nature, and hence inevitably involves the operation of some extraneous variables. Any test or item characteristic, persons action or behaviour exhibited before, during or after testing, or an interaction among these that lead to increase or decrease in test scores without a corresponding pre-testing increase or decrease in the ability, knowledge or skill under measurement is extraneous, increases the η influence and thus reduces the level of confidence with the ability (θ) could be inferred from such test scores.

Factors that invalidate test scores are examiner-, examinee-, as well as item-related. Seemingly item-related sources are basically examiner-related because they emanate from poor test construction. The most important item-related sources of invalidity of test scores is the lack of unidimensionality of our test items. In other words, items are constructed to call on abilities other than the one under measurement. For example, when responding to an essay item, the examinee must (a) first of all read and understand the question; (b) think of the answer; (c) think of how to present it in order to ensure that the examiner understands exactly what his resent answer is; These include(i) right choice of words, (ii) good sentence structure; (iii) good grammar and punctuation; (iv) logicality in, and coherence of presentation; and of course the last but not least, (v) good or readable handwriting (Nenty, 1985a).

I the case of objective test item, according to Lazarus and Taylor (1977), an examinee must (a) first read and understand the item; then (b) think of the answer, but does not have to write it down, instead he must (c) try to find the answer among the options given; (d) pick one that he deems most appropriate; (e) keep track of the item number and the letter representing the option he has chosen in order to be able to (f) shade in the appropriate space on the answer sheet. In both cases if an examinee knows the correct answer to an item but lacks any of these extraneous sills, he is likely to get the item wrong(Nenty,1985a). This will cause the resulting test score to fail to reflect the actual ability under measurement. Hence it could not sustain a valid inference of the ability. Two examinees with the same ability on what is being measured might end up with significantly different scores because of the influence of extraneous skills on which they might differ. This is an example of item bias.

According to Nenty, (1986, 1994) if one or the other significant extraneous factors which influence performance on a test or an item discriminates systematically among groups of examinee, then the test or item is said to be biased. Generally, scores from biased test are invalid. In fact, “item bias is attributable to the degree of lack of item validity” (Ackerman, 1992, p. 70) words, illusions, examples, concepts, ideas, etc., used in items might appeal to one group more than others and hence lead to differences in group performance not because they are different in the ability under measurement but a result of differential influence or these extraneous factors. Whenever a test calls on more than one skill, the likelihood of the score from it being invalid is high. Measuring with a biased test is like measuring with an elastic ruler that stretches for one person or group and shrinks for the other.

Another source of item-related invalidity is proneness to guessing. Multiple-choice item with heterogeneous options are more prone to guessing than those with homogeneous options. Guessing is a source of multidimensionality in a test as it introduces a different and often times significant dimension in test performance .By guessing answers to test items, the examinee introduces another “ability” besides that which the items. Were designed to measure. The scores made through guessing cannot sustain valid inference to the ability under measurement as it was not ability to guess that was being measured. Similar examinee-related source of invalidity when responding to affective instrument include faking, response style and impersonation. Other examinee-related sources of invalidity are the level of test-wiseness or test-sophistication and other examination-related behaviour, of the examinees like anxiety, motivation, etc.

Currently, the most important and disturbing examinee-related source of invalidity in our tests is examination malpractice, Examination malpractice is a psychometric as well as a serial problem. As a psychometric problem it infers with any objective attempt to estimate examinees ability by invalidating test scores .Any form of examination malpractice inflates scores of those who practice it and renders such scores useless for any evaluation purposes. Examination malpractice is also examinee-related source of invalidity because teachers and other examiners, and even parents are known to aid and abet malpractice during examination. Hence the behaviour of examiners including teachers, and even parents are sources of invalidity for our test scores especially in public examinations.

As indicated by Nenty (1985a) during testing in general and with essay test in particular the examiner is a part of the measurement instrument. He sets the test, administers and scores it. His behaviour during each of these processes has significant influence on the resulting test scores. The first is his ability and willingness to follow a step-by-step systematic procedure in developing the test items. This involve a random sampling of test content from a well-defined domain in order to ensure that the item content provides a scientific representation of both the subject matter and behavioural contents on which the measurement is based. If the contents of test items are not adequate representatives of the domain contents, the basis of inferring the domain behaviour (ability given a well specified domain content) of the examinees from their test scores is very weak. According to Schmidt (1983):

The question of what constitutes a representative sample is important since the degree of “non-representative” contributes to the discrepancy Ε(yi)-Ζi). Thus the extent to which items are non-representative contributes to the amount of bias and correspondingly to the degree of invalidity (pp. 166-167).

[(in this presentation η_i = Ε(y_i)-Ζ_i)]

Secondly, teachers’ personal characteristics as displayed during test administration or as perceived by the examinee have been found to influence examinee’s performance significantly. And lastly, especially in essay test. The score that result from the scoring process does not depend only on the examinee’s response, but on who scores the response. For example, in a series of studies by starch and Elliot (reported in Mouly, 1978, p.75), scores given by 180 different scorers to the same response ranged from 28 to 92. Which of the 180 scores awarded do we infer the examinee’s ability from?

The discrepancies between test content (content actually tested on), on one hand and each of curriculum content (content as specified on the curriculum), content of curricular material (textbook contents), and instructional content (content actually covered in class) are sources of invalidity of test scores, especially when validation studies are done across classrooms or schools. Differences in curricular and instructional emphasis and coverage, even given the same curriculum, have significant influences on test scores across classrooms. Generally, every factor, be it item-,examiner- or examinee-related, other than θ_i that contributes significantly to differences in the size of test scores, enlarges η_i, and thus increases invalidity of test scores.

METHODS OF TEST VALIDATION

According to Kerlinger (1986) ‘the subject of validity is complex, controversial, and peculiarly important .... Here perhaps more than anywhere else, the nature of reality is questioned.... It is not possible to study validity ... without sooner or later inquiring into the nature and meaning of one’s variables” (p. 416). given the very complex nature of validity, test validation, or determining the level of confidence with which the variable being measured could be inferred from scores generated from a test designed to measure it, is not a simple one-shot affair like, for example, determining the reliability of a test. It is a complex theoretically- and empirically-involving process. Unlike test reliability, one single index cannot satisfactorily justify the validity of a test. Validation is in fact a theory-backed empirical process of searching for the truth about the theoretical, as well as the empirical nature or meaning of a variable under measurement. In this case, it is a research study within the measurement are stated and tested using scientifically generated data. A test validation process should be able to ascertain the level to which human behaviour given the totality of a clearly specified universe of skills, knowledge, indicant (Kerlinger, 1986), etc., that operationally defines the variable under consideration, and relates predictably to several behaviours that represent some other variables. The investigative process of gathering, analyzing and evaluating the necessary data with which hypotheses involving the nature and meaning of a variable are tested is called validation. There are two aspects to this (i) an evaluation of the process used or provision made for the development, administration and scoring of test; and (ii) determining through a series of empirical hypotheses-testing investigations, the actual level of confidence with which scores from the test could be used to infer a domain behaviour.

Over the years methods of test validation have evolved along with the meaning of validity. According to APA et al (1974), content validity is determined by a set of operations, and one evaluates this by the thoroughness and care with which these operations have been conducted (p.29), current emphasis on subsuming all types of validities under construct validity does not discard this procedural or operational investigations because “evidence of construct validity may also be inferred from procedures followed in developing a test” (APA et al., 1974, p.30). This procedural investigation involves determining:

I. The exhaustiveness and clarity with which the elements of the domain or universe

of behavioural and subject matter contents have been identified and itemized;

II. The adequacy of sampling to ensure representativeness of test content of these well-defined

domain or universe of contents;

III. The clarity with which the sampled content is translated into appropriate tasks or statements that do not demand abilities other than one under measurement, and which discriminate among examine who differ in this ability;

IV. The adequacy with which provision is made for ideal administration and testing conditions that do not allow for adverse physical and psychological effects on the examinees, nor encourage favouritism and examination malpractices of any kind; and

V. The adequacy of the provision made to ensure objectivity in scoring responses to the test item.

The determination of “the thoroughness and care with which these operations have been conducted” is done through expert judgement based on the results of the inspection of records on test development process; content analysis, including taxonomial analysis, and examination of the provision for objectivity in scoring. These methods of procedural evaluation are also used to determine curricular validity by quantifying the overlap between the contents of text books and other curricular materials and test contents; and instructional validity by quantifying the congruency between content of instruction actually provided to learners and test content (Mehrens & Philips, 1987).

Empirically, test validation is no more seen as a one-shot analysis to determine the correlation between test scores on the ability or characteristic under measurement and scores which themselves might not valid, on other concurrent, future, or criterion behaviours, to ascertain concurrent, predictive and criterion validity respectively. Current emphasis is on validation research during which hypotheses on the relationship between test scores on the ability or characteristic measured using a variety of methods and on several other theoretically related and unrelated variables are stated and tested. Such variables might include valid criterion variables. The main question for such studies is: to what extent do scores on tests designed, using a variety of measurement methods to measure our ability or characteristic of interest relate to scores on several other variables which are theoretically related or unrelated to our ability or characteristic of interest. Also measured using a variety of methods? Validity can be established by testing hypotheses that address the following concerns:

i. How well the scores from several measures of our variable of interest using a variety of methods. Relate to each other – convergence analysis (Campbell & Fiske, 1959; Kerlinger, 1986);

ii. How well the scores from several measures of our variable relate to those on appropriate valid criterion measures;

iii. How well the scores from several measures of our variable relate to scores on other variables which are theoretically related to our variable – convergence analysis (Campbell & Fiske, 1959; Kerlinger, 1986);

iv. How well the scores from several measures of our variable fail to relate to scores from other variables which are theoretically unrelated to our variable – discriminability analysis (Campbell & Fiske, 1959; Kerlinger, 1986);

v. How well scores from it tend to cluster with a set of other variables not because of similarity in method of measurement but similarity in underlying trait – convergence analysis (Campbell & Fiske, 1959).

vi. What is the loading patterns on the dimensions inferred from factor analyzing the resulting matrix of interrelationships look like (APA et al., 1985; Kerlinger, 1986)?

vii. What simpler factors can explain the variance of our test scores, how much of it can each explain; Which factor variances can our test scores explain, and how much can they explain (Kerlinger, 1986)?;

viii. Are there significantly differentiating characteristics between those who have high scores compared to those who have low scores on our variable (APA, et al., 1985)?

Existing methods of analyses involved in testing related hypotheses include Campbell and Fiske’s (1959) Multitrait-multimethod matrix analysis, analysis of variance, regression and discriminant analysis, factor analysis, path analysis and other related multivariate analyses. It is through rigorous empirical studies that the meaning and nature of a variable can be determined. The rationale behind these analyses is that if scores generated to represent a theoretically defined variable validly represent it, then scores from different methods of measuring the variable should converge not because of commonality in method, but because of commonality in underlying trait; those from measuring variables that are theoretically related to it should also converge, but scores on variables not theoretically related to it should be uncorrelated with scores from our variable. Besides investigations into the wholistic nature, meaning, and behaviour of the variable as measured, current trend in test validation also of each of the items that make up the instrument. Such investigations involve analyses like item bias analysis and item response pattern analysis.

ANALYSIS OF TEST AND ITEM BIAS

Test validation also involves determining whether some extraneous factors have significantly systematic influence on test performance. The most compelling of such factors is item or test bias. Wiley, Haertel and Harnischfeger (1981), Schmidt (1983), and Ackerman (1992) have identified invalidity with item or test bias. Invalidity component in our general linear model (see Formula 1), eta (η) is associated with item or test bias. During testing, if the influence of any extraneous factor systematically favours one group of examinee over the other then the test is said to be biased. Technically, an item is unbiased if for all examinee of equal ability (i.e. equal total score on a test that contains the item) the probability of a correct response is the same regardless of each examinee’s group membership (Scheuneman, 1979). Or, for item response theory (IRT), an item is unbiased if groups items response function (see Nenty, 1995) are identical (Lord, 1980). Test bias causes two examinee who have the same ability (θ) on what is being measured to end up with different observed scores (Χ). In that case, using observed scores for decision making like in selection, promotion, or graduation often leads to unfair, and invalid decision. Scores from any test based on which important decision will be made must be tested for possible bias. If done at the pilot-testing level, items identified as biased could be revised or changed. According to Ackerman (1992) ‘it should be apparent that as much thought must go into the analysis of bias for a test as went into the original construction of the test’ (p. 90).

There are several methods of determining item bias. These can be grouped under: (i) item-parameter-related methods; (ii) chi-square/probability-related methods; (iii) analysis of variance, regression, and log-linear-related methods; and (iv) methods based on item response theory. These have been reviewed and compared (Ironson & Subkoviak, 1979; Nenty, 1979, 1986; Jensen, 1980; Shepard. Camilli & Acerill, 1981; Marascuilo & Slaughter, 1981); and chi-square/probability-relayed methods have been found to be generally valid and mathematically less-demanding methods of detecting item bias. Studies using these methods with Nigerian samples (Nenty, 1986; Umoinyang, 1991; Abiam, 1996) have found significant item bias in Nigerian ability and achievement tests at the secondary and primary school levels.

ANALYSIS OF ITEM RESPONSE PATTERN (IRP)

Another test validation analysis of item response pattern (IRP) is based on the assumption that if all items in an objective test measured one and only one ability, they are ordered to their estimated difficulties, from the easiest to the most difficult, then some regularity in each examinee’s pattern of response to these items would be expected. Similarity, if examinee are arranged in order of their estimated ability, from the one with the highest to the one with the lowest score, given the extent to which a test measures one, and only one ability irregularity would be expected in the pattern of responses across the examinee(Nenty. 1994) with a perfectly valid test, the ideal response pattern, given this arrangement of items which should gradually peter-out into a string of successes (ones) over less demanding items which should gradually peter-out into a string of failures (zeroes) over more demanding item. Similarly, for an item, a string of successes by more able examinee should gradually peter-out into a string of failures by less able examinee (Wright & Stone, 1978; Nenty, 1987). This would give a perfect Guttman scale. Extent of observed deviations from these hypothesized pattern signals extent to which responses to the test items call for abilities other than the one under measurement, and hence is invalid.

Methods for analyzing item response patterns have been reviewed and compared (Harnisch & Linn, 1981; Harnisch & Tatsuoka, 1983) and Sato’s Modified Caution Index (C) (Sato, 1975) has always been recommended because of its validity and mathematically less-demanding computational procedures. This has been used with achievement test fata in Nigeria (see Nenty, 1987, 1991b &1992 also for computational procedures). In studies that amounted to test validation using item response pattern analysis, Harnisch and Linn (1981), Harnisch (1983), Miller (1986), Nenty (1991a, 1991b) identified the type of items that functioned differently within and across individuals, classrooms, schools, Zones, states, and region. They attributed unusual pattern of responses by individuals to guessing, carelessness, and cheating, and by classrooms to differences in content coverage and emphasis, hence evidence of a mismatch between what is being tested and what was taught for each school.

ITEM RESPONSE THOERY, TEST DIMENSIONALITY AND VALIDITY

Given the notations indicated earlier in section 2.0 item response theory (IRT) in its simplest form has it that the probability of a person i answering item j correctly is a function of the difference between the person’s θ and the item’s β. For a dichotomously scored item, this is:

And this holds only when certain conditions are met. The most important of these is the unidimensionality assumption. This all-involving assumption has it that all the items in a test must be constructed, administered and scored to ensure that they all measure one, and only one, ability, area of knowledge or behaviour (Nenty, 1995). In other words, test items must be constructed, administered, and scored to ensure that he resulting scores are sustainable only by the ability, area of knowledge, or behaviour under measurement. A close consideration of that which is involved to ensure unidimensionality of a test shows very close similarity with the demands for maximum test validity. In fact, with the current level of testing technology, validity can only be meaningfully defined for a unidimensional test.

According to Nenty (1994), basic to the achievement of specific objectivity in psychological measurement with all its accompanying advantage (see Nenty, 1984) is the fact that a test must be designed, administered, and scored in such a way that one, and only one ability accounts for an examinee’s score it. While this might not be serious concern for the classical test theory (CTT), it is basic assumption for valid operationalization of any of the three currently operational IRT models. But even for the CTT, if besides the definitions of classical item parameters are invalid. Item calibration that is independent of the calibrating sample, and person characterization that is independent of the particular set of items used are the advantages of specific objectivity in measurement (Nenty, 1985b; Wright, 1967). When any factor or ability other than that under measurement contributes to an examinee getting a test item wrong or right, then the test is multidimensional. Item bias is one of such factors> Analysis of test dimensionality is a complex, mathematically involving and all-embracing test validation procedure earlier (see Warm, 1978, Stout, 1987; Akpan, 1996). The last two analyses presented above have implications for dimensionality test. In a perfectly unidimensional test, the eta (η) component in our general linear model (see Formula 1) is zero, and hence the test has maximum validity.

CONCLUSION AND RECOMMENDATIONS

The level of confidence with which scores from test items could be used to infer the ability under measurement depends on both (i) how well the items represent the domain that defines the ability under measurement, and (ii) the extent to which it is this ability alone that sustains responses to these items. Highly representative items with good characteristics cannot give valid results if responses to them are not sustained by the ability to which inference is intended, neither could responses sustained by the desired to them are not sustained by the desired ability but on unrepresentative items with poor measurement qualities. Both logical and empirical validation processes are therefore necessary and sufficient for ascertaining test validity. While logical or intrinsic rational validation procedures try to determine the level of test validity in terms of item representativeness and item quality, the empirical procedures try to determine the level of confidence with which a given type of inference can be drawn from scores on the test.

Given the latent or hypothetical nature of a psychological variable, the validation of a test designed to measure it requires a variety of relevant data or information from many sources theoretically related or unrelated to the variable with which to determine the quality and functioning of scores from the test. This implies a rigorous and preferably hypothesis-biased investigation to determine the nature and meaning of such variable.

One of the most important challenges of managing educational assessment in Nigeria is the production of valid test scores and grades through our institutional and public examinations. In order to generate scores that would sustain valid and fair decision in schools and in the public, especially in a multi-ethnic society like ours, all persons involved with the development, administration and scoring of tests for public and school examinations should be occasionally trained and retrained on current methods of ensuring and ascertaining validity in testing. The most obnoxious source of test invalidity in Nigeria, and hence the biggest challenge to the management of educational assessment in the country is examination malpractice (Nenty, 1987). To check this escalating menace to the learner, to the school and to the public, the government should create a ‘centre for the study and prevention of examination malpractice” with an enforcement arm. Such centre should instigate, sponsor, and coordinate studies on this malady, ensure strict enforcement of relevant decrees, as well as work with public examination bodies to ensure malpractice-free testing in Nigeria.

REFERENCES

Abiam. P.O. (1996). Analysis of differential item functioning (DIF) of 1992 First School Leaving Certificate (FSLC) mathematics examination in Cross River State of Nigeria. Unpublished M.Ed. thesis, University of Calabar.

Ackerman, T.A. (1992). A didactic explanation of item bias, item impact, and item validity from

multidimensional perspective. Journal of Educational Measurement, 29(1), 67-91.

Airasian, P.W., & Madaus, G.F. (1983). Linking testing and instruction: Policy issues, Journal of Educational Measurement, 20 (2), 103 – 118

Akpan, G.S. (1996). Speededness effect in assessing the dimensionality of agricultural science examination. Unpublished M.Ed. thesis, University of Calabar

American Psychological Association, American Educational Research Association, & National Council on Measurement in Education (1966). Standards for educational & psychological tests. Washington, D.C. American Psychological Association, Inc.

American Psychological Association, American Educational Research Association, & National Council on Measurement in Education (1974). Standards for educational & psychological tests. Washington, D.C. American Psychological Association, Inc.

American Psychological Association, American Educational Research Association, & National Council on Measurement in Education (1985). Standards for educational & psychological tests. Washington, D.C. American Psychological Association, Inc.

Campbell. D... & Fiske, D. (1959). Convergent and discriminant validation by the multirait-multimethod matrix. Psychological Bulletin, 54, 81 – 105.

Ebel. R.L. (1983). The practical validation of test of ability. Educational Measurement: Issues and Practice, 2(2), 7-10

Gardner. E.F. (1983). Intrinsic rational validity: Necessary but not sufficient. Educational Measurement: Issues and Practice, 2(2), 13

Harnisch. D.L., & Linn, R.L. (1983). Item Response patterns: Applications for educational practice. Journal of Educational Measurement, 20(2), 1991-206

Harnisch. D.L., & Linn, R.L. (1981). Analysis of item response patterns: questionable test data and dissimilar curriculum practices. Journal of Educational Measurement, 18(3), 133 – 146.

Harnisch. D.L., & Tatsuoka, K.K. (1983). A comparison of appropriateness indices based on item response theory. In R.K. Hambleton (Ed.), Application of item response theory, British Columbia, Educational Research Institute.

Ironson. G.H., & Subkoviak, M. (1979). A comparison of several methods of assessing item bias.

Journal of Educational Measurement, 16, 209 – 225.

Jensen, A.R. (1980). Bias in mental testing. New York: The Free Press.

Kerlinger, F.N. (1986). Foundations of behavioural research. Fort Worth: Holt, Rinehart and

Winston, Inc.

Lazarus, Taylor, E.F. (1977, May 1). The debate: Pro and con of testing. New York Times.

Linn. R.L. & Harnisch, D.L. (1981). Interactions between item content and group membership on

achievement test items. Journal of Education Measurement, 18(2), 109-118.

Marascuilo, & Slaughter, P.E. (1981). Statistical procedures for identifying possible sources of item

bias based on X²statistics. Statistics. Journal of Educational Measurement, 18(4), 229-248.

Mehrens, W. & Phillips, S.E. (1987). Sensitivity of item difficulties to curricular validity. Journal of

educational Measurements, 24(4), 357-370.

Miller, M.D. (1986). Time allocation and patterns of item response. Journal of educational

Measurement, 23(2), 147-156.

Mouly, G.J. (1978). Educational research: the arts and science of investigation. Boston: Alyn and

Bacon.

Nenty, H.J. (1979). An empirical assessment of the culture of fairness of the Cattell Culture Fair

Intelligence Test using the Rasch Latent Trait measurement model. Unpublished doctoral

dissertation, Kent, Ohio, USA.

Nenty, H.J (1984, April 27). Objectivity in psychological measurement: An introduction, Paper

presented at WAEC monthly seminar, Lagos, Nigeria.

Nenty, H.J (1985a). Fundamentals of educational measurement and evaluation. Unpublished

manuscript, University of Calabar.

Nenty, H.J. (1986). Cross-cultural bias analysis of Cattell Culture – Fair Intelligence Test.

Perspectives in Psychological Researchers, 9 (1), 1-16.

Nenty, H.J. (1987). Factors that influence students tendency to cheat in examination. Journal of

Education in Developing Area, vi & vii, 70-78

Nenty, H.J. (1987, August 28). Item response pattern: Causes of its variability and its uses in

educational practices. Paper presented at the WAEC monthly seminar, Lagos, Nigeria.

Nenty, H.J. (1991a, March). Background characteristics and students pattern of responses in

mathematics examination. Paper presented at the 7^th annual conference of the National

Association of Educational Psychologists at Ahmadu Bello University, Zaria.

Nenty, H.J. (1991b, March). Student-problem (S-P) skill analysis of pupil’s performance in common

entrance mathematics examination. Paper presented at the 7^th annual conference of the

National Association of Educational Psychologists at Ahmadu Bello University, Zaria

(submitted to Wajevm, WAEC for publication),

Nenty, H.J. (1992). Item response pattern: Causes of its variability and its uses in educational

practices. The West African Journal of Educational and Vocational Measurement, 7 1-12.

Nenty, H.J. (1994). Response pattern analysis of cross River State 1986 common entrance

mathematics examination: An aid to item selection. A paper presented at the inaugural meeting of National Association for Educational Assessment (NAEA) on June 16 at ERC, Minna, Niger State (submitted to journal of Educational Assessment, Minna for publication).

Nenty, H.J. (1995). Introduction to item response theory. Unpublished paper, University of Calabar

(submitted to the Global Journal of Pure and Applied science, for publication).

Poggio. J.P.., Glasnapp, D.R., Miller, M.D., Tollefson, N., & Burry, J.A. (1986). Strategies for

validating dating teacher certification test. Education Measurement: Issues and Practice, 5(2),

18-25.

Sato. T. (1979) “The construction and interpretation of S-P tables”. Tokyo: Meiji Tosho

Scheuneman, J.D. (1979). A method for assessing bias in test items. Journal of Educational

Measurement, 16, 143-153.

Schmidt, W.H. (1983). Content bias in achievement test. Journal of Educational Measurement, 20(2),

165-178

Shepard, Camili, & Averill, (1981). Comparison of procedures for detecting test item bias with both

internal and external ability criteria. Journal of Education Statistics. 6(4), 317-375

Shimberg, B. (1990). Social considerations in the validation of licensing and certificate exams.

Educational Measurement: Issues and practice, 9(4), 11-14.

Stout. W. F. (1987). A nonparametric approach for assessing latent trait unidimensionality.

Psychometrika, 52. 589-617

Thorndike, E.L. (1918). The nature, purpose and general methods of educational products. The

measurement of educational products (17^th Yearbook, Part ІІ). Chicago: National Society of

the Study of Education.

Umoinyang, I.E. (1991). Item bias in mathematics achievement test. Unpublished M.Ed. thesis,

University of Calabar, Calabar.

Warm, T. A. (1978). A primer of item response theory, Springfield, VA: National Technical

Information Services, US Department of commerce.

Wiley, D.E., Haertel, E., & Harnischfeger, A. (1981). Test validity and national educational

assessment: a conception, a method, and an example. (Report No.17). Chicago, Illinois: North-western University.

Wright, B. D. (1967). Sample-free test calibration and person measurement. In Proceeding of the 1967

Invitational Conference on Testing Problems. Princeton, NJ: Educational Testing Service.

Wright. B. D., & stone, M.H. (1978). Best design: A handbook for Rasch measurement. Chicago; The

University of Chicago & Elmhurst College.

Pages

Thursday, 8 September 2016

ADVANCES IN TEST VALIDATION

No comments:

Post a Comment