2015
Menampilkan postingan dari 2015
Chapter 10 : Assessment and Examinations
(i)
A. I saw a big sheep over there.
B. I
saw a big ship over there.
etc.
Recognition of correct grammatical
structure. Written presentation/written response. Group.
Each item below contains a group of sentences. Only one
sentence in each group is correct. In the blank space at the right of each
group of sentences write the letter indicating the correct sentence. (i) A.
What wants that man?
B.
What does want that man?
C.
What does that man want?D. What that man does want?
(ii) A. I
have finished my work, and so did Paul.B. I have finished my work, and so has
Paul.
C. I
have finished my work, and so Paul has.
D. I
have finished my work, and so Paul did.etc.
Production of correct vocabulary. Oral
presentation/ response. Individual.
Examiner asks the question. The candidate must respond with
the correct lexical item. Only the specified item may be accepted as correct.
(i)
Q. What do you call a man who makes bread?
A. A baker,
(ii) Q.
The opposite of concave is…A. Convex, etc.
Clearly discrete item tests of this kind have certain
disadvantages. Testing ability to operate various parts of the system does not
test the interrelated complex that is a system of systems—an important
implication of the underlying theory—and the need for global tests which do
interrelate the various systems apparent. Using discrete item tests is a bit
like testing whether a potential car driver can move the gear lever into the
correct positions, depress the accelerator smoothly, release the clutch gently
and turn the steering wheel to and fro. He may be able to do all of these
correctly and yet not be able to drive the car. It is the skill which combines all
the sub-skills, control of the system which integrates the systems so that the
speaker conveys what he wishes to by the means he wishes to that constitutes
‘knowing a language’ in this sense, just as it constitutes ‘driving a car’.
Attempts were therefore made to devise types of global tests which could be
marked objectively. Two of these appear to have achieved some success, these
are dictation and cloze tests.
Dictation
Dictation was, of course, used as a testing device long
before Lado and the structuralist/behaviourist nexus became influential. Lado
in fact criticised dictation on three grounds, first that since the order of
words was given by the examiner, it did not test the ability to use this very
important grammatical device in English; second, since the words themselves are
given, it can in no sense be thought of as a test of lexis; and third, since
many words and grammatical forms can be identified from the context, it does
not test aural discrimination or perception. On the other hand it has been
argued that dictation involves taking in the stream of noise emitted by the
examiner, perceiving this as meaningful, and then analysing this into words
which must then be written down.
On this view the words are not
given—what are given are strings of noises. These only become words when they
have been processed by the hearer using his knowledge of the language. This
argument that perception of language, whether spoken or written, is
psychologically an active process, not purely passive, is very persuasive. That
dictation requires the co-ordination of the functioning of a substantial number
of different linguistic systems spoken and written, seems very clear so that
its global, active nature ought to be accepted. If this is so then the
candidate doing a dictation might well be said to be actually ‘driving the
car’.
Cloze tests
A cloze test consists of a text from which every nth
word has been deleted. The task is to replace the deleted words. The term
‘cloze’ is derived from Gestalt psychology, and relates to the apparent ability
of individuals to complete a pattern, indeed to perceive this pattern as in
fact complete, once they have grasped the structure of the pattern. Here the
patterns involved are clearly linguistic patterns. A cloze test looks something
like the following:
In the sentences of this test every fifth word has been
left out. Write in the word that fits best. Sometimes only one word will fit as
in ‘A week has seven…’ The only word which will fit in this blank is ‘days’. But sometimes you can choose
between two or more words, as in: ‘We write with a…’ In this blank you can
write ‘pen’ or ‘pencil’ or even ‘typewriter’
or ‘crayon’. Write only one word in
each blank. The length of the blank will not help you to choose a word to put
in it. All the blanks are the same length. The first paragraph has no words
left out. Complete the sentences in the second and following paragraphs by filling in the blanks as shown above.
‘Since man first appeared on earth he has had to solve
certain problems of survival. He has had to find ways of satisfying his hunger,
clothing himself for protection against the cold and providing himself with
shelter. Fruit and leaves from trees were his first food, and his first clothes
were probably made from large leaves and animal skins. Then he began to hunt
wild animals and to trap fish.
In some such way…began to progress and
…his physical problems. But…had other, more spititual…—for happiness, love,
security, …divine protection.’ etc.
Like dictations, cloze tests test the ability to process
strings of aural or visual phenomena in linguistic terms such that their
potential signification is remembered and used to process further strings as
they are perceived. Cloze tests are usually presented through the written
medium and responded to in that medium too, but there seems no reason why oral
cloze should not be possible, and indeed there have been attempts to devise
such tests. (See the University of London Certificate of Proficiency in English
for Foreign Students, Comprehension of Spoken English, 1976.) Cloze tests too
are global in nature demanding perceptive and productive skills and an
integrating knowledge of the various linguistic systems, grammatical and
lexical since some of the words left out will be grammatical and others will be
lexical. There is a good deal of discussion still going on about the
technicalities of constructing cloze tests but useful pragmatic solutions to
many of the problems have been found and it would seem that cloze offers a
potentially very valuable way of measuring language proficiency.
There are, however, two substantial
criticisms to be made of all tests which have a fundamentally structuralist/
behaviourist theoretical base, whether they are discrete item tests like those
of Lado, or global tests like dictation and cloze. The first of these
criticisms is that such tests rarely afford the person being tested any opportunity
to produce language spontaneously. The second is that they are fundamentally
trying to test that knowledge of the language system that underlies any actual
instance of its use— linguistic competence in Chomsky’s terms—they are not
concerned with the ability to operate the system for particular purposes with
particular people in particular situations. In other words they are testing the
basic driving skill, as does the Ministry of Transport driving test, not
whether the driver can actually use the car to get from one place to another
quickly and safely and legally—as the Institute of Advanced Motorists test
does.
Testing communication
If ‘knowing a language’ is seen as the ability to
communicate in particular sorts of situation, then the assessment will be in
terms of setting up simulations of those situations and evaluating how
effective the communication is that takes place. Situations are likely to have
to be specified in terms of the role and status of the participants. The degree
of formality of the interaction, the attitudes and purposes of the
participants, the setting or context and the medium of transmission used—spoken
or written language. The productive-receptive dimension will also enter in
since this is often relevant to the roles of participants. A lecturer does all
the talking, his audience only listens; but a customer in a dress shop is
likely to be involved in extensive two-way exchanges with the sales assistant.
It is of course possible to devise discrete items, objectively scored tests of
communicative ability, but it would seem in general that global, subjectively
marked tests are more likely to make it possible to match the task on which the
assessment is based fairly closely with the actual performance required. The
‘situational composition’ used as a testing device is probably the most
familiar example of this, and has been part of the Cambridge Local Examinations
Syndicate’s paper in English Language for East Africa for many years. The sort
of thing that is used is exemplified by the following:
Write a reply accepting the
following formal invitation:
Mr
and Mrs J.Brown request the pleasure of the company of
Mr
Alfred Andrews at the wedding of their daughter
Sylvia
to
Mr
Alan White
on
Wednesday 6th April 1977 at 2.00 p.m.
in
St Martin’s Church, Puddlepool, Wessex
and afterwards at the Mount Hotel, Puddlebridge, Wessex.
18
The Crescent R.S.V.P.
Puddlepool Wessex.
There are however a great many other possibilities and one
of the most interesting explorations of what these might be is Keith Morrow’s Techniques of Evaluation for a Notional
Syllabus (RSA 1977—mimeo) from which the following examples are taken.
Identification of context of situation.
Oral—tape recorded presentation written response. Group.
Listen carefully. You are about to hear an utterance in
English. It will be repeated twice. After you have heard the utterance answer
the questions below by writing the letter of the correct reply in the
appropriately numbered box on your answer sheet. The utterance will be repeated
twice more after two minutes.
Person: ‘Excuse me, do you know where the nearest post-office
is, please?’
(i)
Where might somebody ask you that question? A. In your
house B. In your office
C.
In the street.
D.
In a restaurant.
(ii)
What is the person asking you about? A. The price of
stamps.
B.
The age of the post-office.
C.
The position of the post-office.
D.
The size of the post-office.etc.
Question (i) here relates to the setting of the utterance
(ii)
to the topic,
(iii) would
relate to its function (iv) to the speaker’s role.
(v) to the degree of formality of the utterance, (vi) to the speaker’s status, and so on,
to cover as many different dimensions of the context of
situation as may be thought appropriate.
Asking questions. Mixed oral/written
presentation/ and response. Individual.
The examiner is provided with a table of information of the
following kind:
KINGS OF ENGLAND
Came
to
|
||||
Name
|
the throne
|
Died
|
Age
|
Reigned
|
William I
|
1066
|
1087
|
60
|
21
|
William II
|
1087
|
1100
|
43
|
13
|
Henry I
|
1100
|
1135
|
67
|
35
|
Stephen
|
1135
|
1154
|
50
|
19
|
Candidates are supplied with an identical table with blanks
in certain spaces. The task is to complete the table by asking the examiner for
specific information. To ensure that the examiner treated each question on its
merits a number of different tables would be needed with different blanks at
different places for different candidates. The candidates would be assessed on
a number of related criteria. First, success. Does the candidate actually
manage to fill in the blanks correctly? Second time. How long does it take the
candidate to assess the situation and perform as required? Third, productive
skill. If he fails to ask any questions, or if his question is unlikely to be
understood by the average native speaker of English: no marks. If the question
is comprehensible but unnatural: 1 mark. If the question is appropriate,
accurate and well expressed:
4 marks. Candidates may be scaled between the extremes by
using as principal criterion how far the candidate’s faults interfere with his
ability to cope with the situation.
Clearly test items of this kind can have an almost
limitless range of variation, what has here been exemplified as oral
presentation could be purely written, information which is here exemplified as
being presented in tabular form could just as well be presented
pictorially—sets of pictures of the ‘Spot the difference’ kind for example, and
it is not unlikely that a good deal of exciting experimentation in this field
will take place in the next few years.
In the last resort most formal
assessment of English as a foreign language nowadays is a combination of
elements from a wide range of all the different kinds of test discussed above,
probably reflecting some kind of consensus view that language does involve
code, system, skill and communication.
Four kinds of assessment
If the question asked above has been ‘What kind of a thing
is it that is being assessed?’ the next question must be ‘What is the purpose
of making that assessment?’
There are at least four different sorts
of purpose that assessment may serve. First, one may wish to assess whether a
particular individual will ever be able to learn any foreign language at all.
An assessment of this kind is an assessment of aptitude. The question being asked is ‘Can he learn this at all?’
Tests designed to measure aptitude must largely be only indirectly specific
language orientated. There appear to be no tests to determine whether a
foreigner has the aptitude to learn English as such. Aptitude test batteries
include items like tests of the ability to break or use codes, to generate or
create messages on the basis of a small set of rules and symbols, tests for
memory of nonsense syllables, tests of additory discrimination and so on. A
standardised test battery The Modern
Language Aptitude Test has been devised by J.B. Carroll and S.M.Sapon. Such
a test looks only forward in time from the point of the test and nothing lies
behind it in terms of English language teaching.
Second, assessment may be made to
determine how much English an individual actually knows with a view to how well
he might be able to function in situations, which may be more or less closely
specified, often quite outside the language learning classroom. The basic
question being asked is ‘Does he know enough English to…?’ ‘…follow a course in
atomic physics?’ ‘…act as waiter in a tourist hotel?’ and so on. Assessment of
this kind is assessment of proficiency.
Tests of proficiency look back over previous language learning, the precise
details of which are probably unknown, with a view to possible success in some
future activity, not necessarily language learning but requiring the effective
use of language. Proficiency tests do, however, sometimes have a direct
language teaching connection. They might, for example, be used to classify or
place individuals in appropriate language classes, or to determine their
readiness for particular levels or kinds of instruction. The question here is a
rather specific one like ‘Does he know enough to fit into the second advanced
level class in this institution?’ Thus selection examinations, and placement
tests are basically proficiency tests. The title of the wellknown Cambridge
Proficiency Examination implies proficiency in English to do something else,
like study in a British institution of further education.
Third, assessment may be made to
determine the extent of student learning, or the extent to which instructional
goals have been attained. In other words the question being asked is ‘Has he
learned what he has been taught?’ Indirectly of course such assessment may help
to evaluate the programme of instruction, to say nothing of the capabilities of
the teacher. If he has learned what he has been taught the teaching may well be
all right; if he hasn’t, the teaching may well have to be looked at carefully
and modified and improved. Assessments of this kind are assessments of achievement. Tests of achievement look
only backwards over a known programme of teaching. Most ordinary class tests,
the quick oral checks of fluency or aural discrimination that are part of
almost every lesson are achievement tests, and so too should be end of term or
end of year examinations.
Lastly, assessment may be undertaken to
determine what errors are occurring, what malfunctioning of the systems there
may be, with a view to future rectification of these. The question being asked
is ‘What has gone wrong that can be put right, and why did it go wrong?’
Assessment of this kind is diagnostic.
Diagnostic tests look back over previous instruction with a view to modifying
future instruction. The details of past instruction may be known or not, so
some kinds of diagnostic test will be like proficiency tests, some will be like
achievement tests in this regard. However, it is important at all times to bear
in mind the basic question which is being asked, and to realise that items
which may be very good tests of actual achievement may be very poor
diagnostically. A diagnostic test ought to reveal an individual’s strengths and
weaknesses and it is therefore likely that it will have to be fairly
comprehensive, and devote special attention to known or predicted areas of
particular difficulty for the learner. Diagnostic tests are most often used
early in a course, when particular difficulties begin to arise and the teacher
wants to pin down just what is going wrong so that he can do something about
it. Such tests are almost always informal and devised for quite specific
situations.
The four terms aptitude, proficiency, achievement, and diagnostic are very frequent in the literature on testing and it is
well to get their meaning clear. It is also worth noting the characteristic
usages which these terms have. A learner may have an aptitude for
English language learning; if he does he may quickly attain sufficient proficiency
in English for him to be able to
study mathematics; this means he has achieved
a satisfactory standard, but a test may diagnose certain faults in
his English or in the teaching he has received.
Test qualities
There remains one other important question to ask about any
assessment of knowledge of the English language—‘Does it work?’ Here again
there may be at least four different ways in which this question may be
interpreted. The first of these is revealed by the question ‘Does it measure
consistently?’ A metre stick measures the same distance each time because it is
rigid and accurately standardised against a given norm. A piece of elastic with
a metre marked on it is very unlikely to measure the same every time. In this
case the metre stick can be said to be a reliable
measure. In the same way reliability in instruments for measuring language
ability is obviously highly desirable, but very difficult to achieve. Among the
reasons for this are the effects of variation in pupil motivation, and of the
range of tasks set in making an assessment. A pupil who is just not interested
in doing a test will be unlikely to score highly on it. Generally speaking the
more instances of pupil language behaviour that can be incorporated into a test
the better. It is for this reason that testing specialists have tended to
prefer discrete item test batteries in which a large number of different
instances of language activity are used, to essay type examinations where the
tasks set are seen as more limited in kind and number. Variations in the
conditions under which tests are taken can also affect reliability—small
variations in timing where precise time limits are required for example, a
stuffy room, the time of day when the test is taken, or other equally
trivialseeming factors may all distort test results. Perhaps most important of
all in its consequences on test results is the reliability of the marker. This
reliability may be high in objectively marked tests—like multiple-choice
tests—but can be low in free response tests—like essays—if a structured
approach or multiple marking are not used. Determining test reliability
requires a certain amount of technical know-how and familiarity with the
statistical techniques which permit the calculation of a reliability
coefficient. Guidance to these will be found in the books referred to for
further reading at the end of this chapter.
The second way in which the question
‘Does it work?’ can be made more precise is by rephrasing it as ‘Does it
distinguish between one pupil and another?’ A metre stick may be a suitable
instrument for measuring the dimensions of an ordinary room, but it would not
be suitable for measuring a motorway or the gap of a spark plug for a car. In
one case the scale of the object to be measured is too great, in the other it
is too small. Not only should the instrument which is used be appropriate to
the thing being measured but the scale on the instrument should be right too. A
micrometer marked only in centimetres would not permit accurate measurement of
watch parts, the scale needs to be fractions of millimetres. Tests which have
the right sort of scale may be said to discriminate
well. Tests which are on the whole too easy or too difficult for the pupils who
do them do not discriminate well, they do not spread the pupils out since
virtually all pupils score high marks or all pupils score low marks. Ideally
the test should give a distribution which comes close to that of the normal
distribution curve.
One needs to be careful in reading the
literature on testing when the term discrimination
index is encountered. This has little to do with discrimination in the
sense discussed above. It refers rather to the product of statistical
procedures which measure the extent to which any single item in a test measures
the same thing as the whole of the test. By calculating a discrimination index
for each item in a test it is possible to select those items which are most
efficient in distinguishing between the top one-third and the bottom one-third
of any group for whom the test as a whole is about right. In other words it will
help to establish the measuring scale within the limits of the instrument
itself and ensure that that is about right, giving a proper distribution of
easy and difficult questions within the test. But a discrimination index has no absolute value; to get the overall
level of difficulty of the test right requires a pragmatic approach with
repeated retrials of the test items, accepting some and rejecting others until
the correct combination has been achieved. Again details of these technical
matters will be found in the books for further reading.
The third way in which the ‘Does it
work?’ question may be more fully specified is by asking ‘Does it measure what
it is supposed to measure?’ A metre stick is very practical for measuring cloth
but it is irrelevant for measuring language ability. ‘What it is supposed to
measure’ in the case of English language tests is presumably ability in English
language, and the only way that the extent to which a test actually does this
can be determined is by comparing the test results with some other outside
measurement, some other way of estimating pupil ability, a way which ought to
be at least as reliable and accurate as the test itself. Where the results of
the outside measure match the results of the test reasonably closely the test
can be said to have empirical validity.
Suitable outside measures are difficult to come by. So far the best criterion
which seems to have been found is a teacher’s rating. An experienced teacher
who knows his class well can rank pupils in order of merit with considerable
reliability and accuracy. Thus tests whose results correlate well with teacher
ratings can be regarded as empirically valid, and the correspondence between
the two measures can be expressed as a coefficient
of validity. Testing specialists like such coefficients to have a value
higher than 0.7—perfect correlation would give a coefficient of 1.0.
It is clear of course that empirical
validity is unlikely to be achieved unless a test is constructed in accordance
with some respectable theory of language. It is also unlikely to be achieved
unless the test adequately samples the knowledge and activities which are
entailed by showing that one knows a language. However, a theoretical base and
adequate sampling do not guarantee empirical validity—to gain that, the test
must be set against some external criterion.
There is one final kind of validity
which is sometimes discussed in the literature on assessment. This is ‘face
validity’. This is a matter of how the test appears to the pupils being tested,
to teachers, administrators and so on. If the form or content of a test appears
foolish or irrelevant or inconsequential, then users of the test will be
suspicious of it; those in authority will be unlikely to adopt it, pupils may
be poorly motivated by it. Thus test makers must ensure that a test not only
tests what it is supposed to test, reliably and accurately but that it looks as
though that is what it does.
A final characteristic of a good
language test is practicability. By
this is meant the extent to which the test is readily usable by teachers with
limited time and resources at their disposal. Such factors as the cost of the
test booklets, the amount of time and manpower needed to prepare, administer,
invigilate, mark and interpret the test, the requirements for special equipment
and so on must all be taken into account. For example a standardised test which
employs re-usable test booklets with separate answer sheets is likely to be
much cheaper to run than one which uses consumable test booklets. Tests which
take relatively little time to work and process are likely to be preferred to
those which take a lot of time, those which can be given to many pupils
simultaneously are usually more practicable than those which require individual
administration. Simple paper and pencil tests may well be preferred to those
which require elaborate audio- or video-recording equipment. Up-to-date tests
whose cultural content is unexceptional are evidently better than those which
are out of date and contain culturally inappropriate or objectionable material,
those with clear instruction manuals are better than those with obscure
manuals, and so on. The test maker needs to bear all such factors in mind, but
he should also bear in mind that the testing of some kinds of activity relevant
to some dimensions of ‘knowing a language’ may require the use of elaborate
equipment or individualised methods and a proper balance must be struck.
In the classroom the teacher finds
himself faced with having to assess the progress of his pupils, to judge their
suitability for one class or another and so on. He must decide out of the whole
complex of considerations which has been outlined above what kind of assessment
he wishes to make, of what aspects of his pupils learning, with what kind of
reliability and what kind of validity. Once those decisions are made he can go
ahead with devising his instrument for making the assessment. For help with
that he will find J.B. Heaton’s Writing
English Language Tests a useful book along with the book by Lado mentioned
earlier, and that by Rebecca Valette listed below.
The last matters to which it would seem
appropriate to give some attention here concern standardised English language
tests, and the public examinations systems.
A number of standardised tests exist.
Among these the Davis test has been widely used. More recently Elizabeth Ingram
has published English Language Battery
but the American tests in this area seem to be more readily available. Among
the best known of these are Robert Lado’s English
Language Test for Foreign Students which developed into the Michigan Test of English Language
Proficiency and the TOEFL, Educational Testing Service, Test of English as a Foreign Language.
Further information and discussion of such tests will be found in The Seventh Mental Measurements Yearbook,
ed. O.Bures.
Public examinations
The public examination system tends to vary from country to
country. One of the tasks which every teacher has when he takes up an
appointment in a new country is to discover just what the requirements of the
public examination system are. He needs to obtain copies of syllabuses, past
papers, regulations, and the reports of the examiners, where these are
published, and to familiarise himself with them. From this he should be able to
discover what real linguistic skills are required of examination candidates and
what kinds of examination techniques they will need to have mastered. It is
then possible to concentrate substantially on teaching the language skills and,
in about the last one-tenth of the course, to teach the necessary techniques
for passing the examination. Most teachers devote far too much time to practice
examinations—pupils often seem to like it, but it is rarely in their best
interests since many good examining techniques do little to foster greater
learning—dictation is a good case in point. For information about the public
examinations most widely taken in Britain, one can do little better than
consult J.McClafferty’s A Guide to
Examinations in English for Foreign Students. In this there are useful
hints on preparing for the examinations, details of the various examinations
offered by the boards and summaries of regulations and entry requirements. It
covers the examinations of the Cambridge Local Examination
Syndicate, the Royal Society of Arts, the London Chamber of
Commerce, and the ARELS Oral Examination, and has a supplementary list of other
examinations in English for foreign students—altogether a very helpful
document. Much of the preliminary investigatory work suggested in the previous
paragraph has been done for the teacher by this book, there remains only the
task of analysing past papers and consulting the annual reports of the
examiners.
There are a number of types of
examination or methods of assessment which have not been discussed at all in
this chapter but which a teacher may come across from time to time. One of
these is assessment by using a structured interview schedule. Here the test
takes the form of an interview and the linguistic tasks demanded of the
candidate are progressively elaborated according to a fixed programme. The
point at which the candidate begins to fail in these tasks gives him a rating
on the schedule. Such examinations are usually entirely oral—though clearly
there is no absolute necessity that they should be so—and the rating is usually
arrived at by subjective judgment against a fairly detailed specification of
performance features, sometimes by a panel of judges. Another type of test is
that involving simultaneous translation—usually reserved for assessing
interpreters—but there are a number of such techniques and it is wise to keep
an open mind towards them for they might well turn out to be useful some day.
The final word is—avoid too much
assessment; resist pressures which might make examinations dominate teaching.
Suggestions for further reading
J.P.B.Allen
and S.Pit Corder, The Edinburgh Course in
Applied Linguistics, Vol. 4, Testing and Experimental Methods, Oxford
University Press, 1977.
A.Davies, Language Testing Symposium: A
Psycholinguistic Approach, Oxford University Press, 1968.
D.P.Harris, Testing English as a Second Language,
New York: McGrawHill, 1969.
J.Oller, Language Tests at School, Longman, 1979.
R.M.Valette,
Modern Language Testing: A Handbook,
2nd edn, New York: Harcourt Brace, Jovanovich, 1977.