Chapter 12

Click here to download zip fileDownload zip file

Test and Item Analysis

Learning Objectives

  • State importance of test and item analysis.
  • Define Facility Value (FV), Discrimination index (DI) and Distractor efficiency.
  • Calculate FV and DI of a given item.
  • Calculate internal consistency of a given test paper.

 

When you go to the market to purchase a commodity. Which shop will you have more on-one that uses stones as weight measures or the one which uses certified weights ? Obviously the latter, because you do not want to get cheated. Something similar is the case with examinations. The questions and tests that we use are like measures which we compare the knowledge possessed by the students. What will happen if this measure is not standardised - a student will get either more or less marks than he actually deserves. This harms the cause the harm of learning in more than one ways - on one hand, we are eroding the faith of the society in the system and on the other, we may be incompetent doctors. One of the ways to overcome this problem is to use standardised tests by undertaking what is called test and item analysis.

Test and item analysis consists of 2 distinct sets of activities viz. analysis of the individual questions and analysis of the test as a whole. This is easier and precise for objective type questions, although with modifications, it can be used for essay type questions also. In the subsequent discussions, we will first learn about item analysis of both objectives and essay type questions and then about
test analysis.

*Item Analysis :

Under this category, we include questions which have only one pre-determined correct answer-in other words, where the student can be marked either right or wrong. Let us introduce to you the technique of item analysis.

The first step in performing item analysis is to mark the papers and then arrange them in rank order, which student scoring highest marks at the top.

go to top

reparing for item-
analysis

The next step is to break this distribution in 2 groups i.e. Higher ability group (HAG) and lower ability group (LAG). If the number of students is upto 50, the groups will include 25 students each but if it is large, say 200, then you should include 30% top and 30% bottom students respectively in the two groups.

Now, for each question, count the number of students ticking option a, b, c or d as the case may be, in each of these two groups. For example, a test was administered to a group of 50 students and divided into HAG and LAG. For question no. 1, the distribution of options could be something like this : 

 a

b* 

d

1.

HAG (25) 

1

20

4

0

LAG (25) 

13

5

1

6

(b is correct answer)      

Once we have this information available about all questions, we proceed further to calculate the indices related to each.

Facility Value (FV) : Simple stated, FV means, number in the group answering a question right. If 60% of the group answers the question correctly, then FV will be 60%. FV can be calculated by the formula :

FV=

HAG + LAG

x 100

N

Coming to the previous example, FV will be :

20  +  5

x 100 = 50%

50

FV

FV is a measure of how easy or how difficult a question is Higher the FV, easier is the question.

Discrimination Index (DI) : This index indicates the ability of a question to discriminate between a higher and a lower ability student. This is calculated by the formula :

DI =

HAG - LAG

No. in each group

Item Analysis for non-MCQs Uses


You would have noticed that while FV is expressed as percentage, DI is indicated as a fraction. The maximum value for for DI is 1.0, which indicates an ideal question with perfect discrimination between HAG and LAG.

adapted.gif (6048 bytes)

Adapted from J.J.Guilbert, 1992

At this, stage we would also like to introduce you to another term, which is called negative discrimination. imply stated it means that more LAG students are answering the question right as compared to HAG students.

Look at the following distribution :

 

a*

b

c

d

Key 'a'

HAG (25)

3

15

1

6

LAG (25)

7

14

1

3

 

DI =

HAG - LAG

No. in each group

3-7

=

-4

=

-0.16

25

25

We shall revert to negative discrimination, when we talk about uses of item analysis.
Distractor efficiency : Do you recall our discussion on distractors in the chapter on Objective type questions? It was very strongly emphasised that distractors should not be 'bogey' and they should attract only lower ability students. Look at our first example. Distractor 'd' is a good distractor because it has not attracted only of the higher ability students and only lower ability student have been attracted towards it. On the other hand, if you look at 'c' you  find that more students in the upper group have been attracted towards it than the lower group. Numerically speaking, any distractor which is not picked by at least 5% of the students is not considered a good distractor.

go to top

For other objective type questions, the options can be arranged as 'a' (correct answer) and 'b' (wrong answer) and by dividing the students. FV and DI can be calculated by using the following formula :

FV=

sum of marks obtained by all students

x100

sum of max. marks obtainable on that question

You may be wondering as to what purpose is being served by undertaking these calculations. Item analysis helps in detecting specific technical flaws in the question and provides information for improvement. It increases the skill of examiners in item writing. It provides information for class discussion of results. It helps students to improve their learning and teachers to know about common misconceptions of the class. Let us elaborate on some of these points.

(a)

A good item is one, which approximately half the class can answer (i.e. FV of 50%). If we select an item which too low an FV, then students tend to answer that item more from guess work than from actual knowledge. Knowledge of FV for a particular question also aids in better design of the question paper. As a general rule, the paper should begin with easy questions and then progress on to difficult ones. Adopting a reverse sequence may demotivate the students right from the beginning.

(b)

For testing the adequacy of class room teaching, calculation of FV is a fairly useful tool. Look at some of the following examples.

Question No.


1

2

3

4


a

b

a

b

a

b

a

b


Students

1.

+

+

-

+

+

-

+

+

2.

+

+

-

+

+

-

+

+

3.

+

+

+

+

-

-

-

-

4.

+

+

-

-

+

-

-

-

5.

+

+

-

+

+

+

+

+


a : Before teaching
+ Right answer

b: After teaching
- Wrong answer

go to top

This indicates that the subject area related to objective of item 1 is well known to the students and does not need too much time. Subject related to item 2 has been well taught and has been understood by most of the students. on the other hand, students were rightly answering item 3 before teaching but after teaching, they have given wrong answers. This indicates that either the question has been properly worded or else the teaching has not been proper.

(c) For tests which are employed for the purpose of selection, we prefer items with a high DI. As already stated, an item can have a maximum DI of 1.0 but this is difficult to attain. For practical purposes, an item with a DI of 0.35 or more is considered good while DI between 0.2 to 0.34 can be considered acceptable. Items with DI less than 0.2 need to be revised.

(d) We had referred to a term called negative discrimination, which indicates that more students in the lower group are answering that item right than students in the higher group. There are two possible reasons for this. The first ambiguous framing of the question, which forces the brighter student to read more into it than what is intended. A wrong answer key can also create havoc with the apparent result.

For example, look at the following question :

The infant mortality rate of India is :

(a) 30
(b) 45
(c) 56
(d) None of the above

This question was given to a group of 20 students and the following distribution was obtained :

a

b

c

d

HAG

1

0

8

1

LAG

4

3

3

0

The DI for the question will be :

8-3

=

5

= 0.5

3

10

Now suppose, by mistake, the key is marked as 'a' in place of 'c'. In that case, the DI will become :

-4

=

-3

=-0.3

10

10

Also, a brilliant student who may have read a very recent reference quoting a figure of say 52, will tick option 'd'. Thus, test and item analysis will give a clue to a wrong key and prevent injustice to many deserving candidates.

go to top

Reliability of the Test :

Do you recall our discussions on reliability ? We had discussed about the various types of reliability. The one we are going to discuss here in detail in the internal consistency of the test. The internal consistency is calculated by dividing the whole test into odd and even numbered items and hence the method is also called  split half method. The following example will illustrate, how we calculate the reliability.

Suppose a test of 10 items was given to a group of students, and we calculated the scores obtained by the whole class.
We are using the following terminology :
X : Scores on odd numbered items
Y : Scores on even numbered items
Arrange the scores as following :


X

Y

XY


60

72

3600

5184

4320

65

73

4225

5329

4745

70

79

4900

6241

5330

85

77

7225

5929

6545

55

74

3025

5476

4070

EX=335

EY=375

EX²=2975

EY²=28159

EXY=25010

(E indicates summation)

Applying the formula

r=

NEXY     -     (EX) (EY)

Ö(NEX)² - (EX)²   Ö(NEY)² - (EY)²

=


5(25010) - (335) (375)

Ö5 (22975) - (335)²  Ö 5 (28159) - (375)²

=


125050-125025

Ö114875 - 112225    Ö140795 - 140625

=


25

Ö2550       Ö170

=

25

=

25

51.47x 13.03

670.7


=


 0.03

This gives the reliability of half test as 0.03, which can be converted to reliability of the full test by using the Spearman formula.

r =

2r

1 + r

=

   2 x 0.03

1 + 0.03

=

0.06

1.03

=

0.05

This indicates that the reliability of the test is poor. One of the prime reasons for getting such a low reliability is the less number of items on the test.

go to top

From standard statistical tables, you can find the figure of reliability which would be statistical acceptable. For a group of 100 students, this value is 0.27. How do you attain this - by increasing the length of the test. Use the following formula :

Split-half method

Building Internal Consistency

( r you want ) x ( 1- r you got )
(r you got ) x (1 - r you want)

0.27 x ( 1 - 0.05 )
0.05 x ( 1 - 0.27 )

0.256
0.036

7.11

New - items

It means that  to have an acceptable reliability, you should have a test 7 times longer or in other words, of 70 items of similar level of FV.

*Standard error of measurement (SEM) : It is a concept related to reliability of test. SEM depends on the number of items in a test and is calculated by the formula.

SEM = .4 Ön

Where n is the number of items in a test. For example, SEM for a test of 20 items would be :

=.4 Ö20

  = 1.78 
20

  = 1.78 

SEM for a test of 100 items will be :

= .4 Ö100
  = 4.0        

Does it sound odd that longer a test, higher is the SEM ? But if you look at it this way that for 20 items, of 1 mark each, SEM represents approximately 9% while for 100 items of 1 mark each, SEM represents only 4%. What do these figures mean - it means that just like standard deviation, 2/3rd of the students would have got 1 SEM marks higher or lower than they actually deserved and 95% of the marks would fall between  + 2 SEM.

From the above, it would have become clear that reliability and SEM vary inversely with the length of the test - longer the rest, more reliable are the results. This is another indirect pointer to the fact that an MCQ of 100 items would have a far better reliability than howsoever carefully framed essay paper of 5 questions.

Most of the foregoing discussion has been centered around objective type questions. It does not, however, mean that reliability of essay questions can't be calculated. There are tests and formulae available for this also but these are generally more difficult and require elaborate statistical treatment. For those of you who are interested in knowing about them a few references have been listed at the end of the book.

We have already emphasised that before we actually use a test, we must have the date related to each item available ; you may be wondering, if you have written a few new items, how will you have these figures. Well, one of the ways to calculate various indices related to these questions is to give them a trial run. Thus, in an actual test situation, the first 20 questions can be new questions. The students answer them, they are marked on them but scores obtained on these 20 questions are not used for computing the results. They are used only for calculating the FV and DI and only when a question has been found to have a satisfactory level of FV and DI, is is used in the actual test situation.

go to top