Chapter 10

Assessment and Examinations

Basic terms

A great deal of the language teacher’s time and attention is devoted to assessing the progress pupils make or preparing them for public examinations. One of the problems in discussing this area of English language teaching is that the words used to describe these activities are used in a number of different ways. First of all, the term examination usually refers to a formal set-piece kind of assessment. Typically one or more three-hour papers have to be worked. Pupils are isolated from one another and usually have no access to textbooks, notes or dictionaries. An examination of this kind may be set by the teachers or head of department in a school, or by some central examining body like the Ministry of Education in various countries or the Cambridge Local Examinations Syndicate—to mention only the best known of the British examining bodies. This usage of the word examination is fairly consistent in the literature on the subject and presents few difficulties.

The word test is much more complicated. It has at least three quite distinct meanings. One of them refers to a carefully prepared measuring instrument, which has been tried out on a sample of people like those who will be assessed by it, which has been corrected and made as efficient and accurate as possible using the whole panoply of statistical techniques appropriate to educational measurement. The preparation of such tests is time-consuming, expensive and requires expertise in statistical techniques as well as in devising suitable tasks for the linguistic assessment to be based on.

The second meaning of test refers to what is usually a short, quick teacher-devised activity carried out in the classroom, and used by the teacher as the basis of an ongoing assessment. It may be more or less formal, more or less carefully prepared, ranging from a carefully devised multiplechoice test of reading comprehension which has been used several times with pupils at about the same stage and of the same ability, so that it has been possible to revise the test, eliminate poor distractors and build up norms which might almost be accepted as statistically valid, to a quick check of whether pupils have grasped the basic concept behind a new linguistic item, by using a scatter of oral questions round the class. It is because of the wide range of interpretation that is put on this second meaning of test that confusions and controversy often arise. The important question to ask is always ‘What kind of test do you mean?’ and it is for this reason that there is perhaps some advantage in talking about assessment rather than testing.

The third meaning which is sometimes given to test is that of an item within a larger test, part of a test battery, or even sometimes what is often called a question in an examination. Sometimes when one paper in an examination series is devised to be marked objectively it is called a test, and once again it is important to be careful in interpreting just what is meant.

Subjective and objective testing

There is another pair of terms used in connection with assessment—one of them was used in the last sentence— which also need to be clarified. These are the terms subjective and objective. There is often talk of objective tests. It is important to note that these words refer only to the mode by which the test is marked, there is nothing intrinsically objective about any test or test item. The understanding is that objective tests are those which can be marked almost entirely mechanically, by an intelligent automaton or even a machine. The answers are usually recorded non-linguistically, by a tick or a cross in a box, a circle round a number or letter, or the writing of a letter or number. Occasionally an actual word or punctuation mark may be used. Typically such tests take the multiple-choice format or a blank-filling format but no real linguistic judgment is required of the marker. Subjective tests on the other hand can only be marked by human beings with the necessary linguistic knowledge, skill and judgment. Usually the minimum requirement for an answer is a complete sentence, though sometimes single words may be sufficient. It must be recognised, however, that the creation and setting of both kinds is ultimately subjective, since the choice of items, their relative prominence in the test and so on are matters of the knowledge, skill and judgment of the setter. Furthermore, evaluating a piece of language like a free composition is virtually an entirely subjective matter, a question of individual judgment, and quasi-analytic procedures like allocating so many marks for spelling, so many for grammar, so many for ‘expression’ and so on do almost nothing to reduce that fundamental subjectivity. A checklist of points to watch may help to make the marking more consistent but it is well to recognise that the marking is none the less subjective.

It is frequently claimed that the results obtained from objective tests are ‘better’ than those obtained from subjectively marked tests or examinations, and books like the classic The Marking of English Essays by P.Hartog et al. with their frightening picture of the unreliability and inconsistency of marking in public examinations give good grounds for this claim. However, there are two devices which may be used to improve the consistency and reliability of subjective marking. One is to use the Nine Pile Technique and the other is to use multiple marking.

The Nine Pile Technique is based on the assumption that in any population the likelihood is that the distribution of abilities will follow a normal curve, and that subjective judgments are more reliable over scales with few points on them than over scales with a large number of points on them. In other words a five-point scale will give reasonable results, a fifty-point scale will not. Suppose a teacher has ninetynine essays to mark. He will begin by reading these through quickly and sorting them into three piles on the basis of a straight global subjective evaluation: Good, Middling, Poor. In order to get an approximately normal distribution he would expect about seventeen of the ninety-nine to be Good, sixty-five to be Middling, and seventeen to be Poor. Next he takes the Good pile and sorts these on the basis of a second reading into Outstanding, Very Good, and Good piles. In the Outstanding pile he might put only one essay, in the Very Good pile four, and the remaining twelve in the Good pile. Similarly he would sort the Poor pile into Appalling, Very Poor, and Poor with approximately the same numbers. Finally he would sort the Middling pile into three, Middling/ Good, Middling, and Middling/Bad in the proportion of about twenty, twenty-five, and twenty. This sorting gives a ninepoint scale which has been arrived at by a double marking involving an element of overlap. Obviously if the second reading requires a Middling/Bad essay to go into the Poor pile or a Poor essay to go into the Middling Pile such adjustments can easily be made. This technique has been shown to give good consistency as between different markers and the same marker over time.

If this technique is then combined with multiple marking, that is to say getting a second or third marker to re-read the essays and to make adjustments between piles, the results are likely to be even more consistent and reliable. There is a very cogently argued case for multiple marking made out in Multiple Marking of English Compositions by J.Britton et al. Techniques such as these acknowledge the fundamentally subjective nature of the assessments being made, but they exploit the psychological realities of judgementmaking in a controlled way and this is surely sensible and useful. The time required for multiple marking is no greater than that required for using a conventional analytic mark allocation system and there seems little justification for clinging to the well worn and substantially discredited ways.

All of the above is almost by way of being preliminary. When the fundamentals of what assessing progress in learning a foreign language really involves are considered it becomes clearly apparent that it is the underlying theoretical view of what language is and how it works that is most important.

Discrete item tests

If language is seen as a kind of code, a means by which ‘ideas’ may be expressed as easily by one set of symbols as by another, then it is likely that the bilingual dictionary and the grammar will be seen as the code books by means of which the cypher may be broken. Knowing a language will be seen as the ability to operate the code so assessment will be in terms of knowledge of the rules—the grammar—and facility in transferring from one set of symbols to another— translation. It would seem that the great majority of foreign language examinations in Britain today still reflect this as their underlying theory. The typical rubric of an assessment of language seen in this way is ‘Translate the following into English’ or ‘Give the second person plural of the preterite of the following verbs.’

If language is seen as an aggregate of ‘skills’ of various kinds, then assessment is likely to be in terms of a classification of ‘skills’. So there might be tests of the ability to hear, to discriminate between sounds or perceive tone patterns or comprehend intellectually what is spoken; tests of the ability to speak, to produce the noises of the language correctly, to utter accurately, fluently and coherently, tests of the ability to understand the written form of the language, to read quickly, accurately and efficiently, to skim, to look up information; tests of the ability to use the graphic symbol system and its associated conventions, or to generate accurate, fluent and coherent language in the written medium; tests of the ability to interrelate media, to read aloud, to take dictation; and so on. Virtually all theoretical approaches to language take a skills dimension into account and in the examples which occur later in this chapter it will be observed that part of the specification of the type of test being illustrated relates to the skills involved.

If language is seen as a structured system by means of which the members of a speech community interact, transmitting and receiving messages, then assessment will be seen in terms of structure and system, of transmission and reception. Robert Lado’s substantial work Language Testing: The Construction and Use of Foreign Language Tests is full of examples of the kind of test item this view engenders. Since language is seen as a number of systems, there will be items to test knowledge of both the production and reception of the sound segment system, of the stress system, the intonation system, and morphemic system, the grammatical system, the lexical system and so on. The tendency is to give prominence to discrete items of language and relatively little attention to the way language functions globally. There is a tendency, too, for assessments made with this theoretical background to have a behavioural dimension and to be designed to be marked objectively. Some examples of the kind of thing involved follow:

Recognition of sound segments. Oral presentation/ written response. Group.

The examiner will read one of the sentences in each of the following groups of sentences. Write the letter of the sentence you heard in the space provided on the right hand side of the page.

(i) ^{A. I saw a big sheep over there.}

B. I saw a big ship over there.

etc.

Recognition of correct grammatical structure. Written presentation/written response. Group.

Each item below contains a group of sentences. Only one sentence in each group is correct. In the blank space at the right of each group of sentences write the letter indicating the correct sentence. (i) A. What wants that man?

B. What does want that man?

C. What does that man want?D. What that man does want?

(ii) A. I have finished my work, and so did Paul.B. I have finished my work, and so has Paul.

C. I have finished my work, and so Paul has.

D. I have finished my work, and so Paul did.etc.

Production of correct vocabulary. Oral presentation/ response. Individual.

Examiner asks the question. The candidate must respond with the correct lexical item. Only the specified item may be accepted as correct.

(i) Q. What do you call a man who makes bread?

A. A baker,

(ii) Q. The opposite of concave is…A. Convex, etc.

Clearly discrete item tests of this kind have certain disadvantages. Testing ability to operate various parts of the system does not test the interrelated complex that is a system of systems—an important implication of the underlying theory—and the need for global tests which do interrelate the various systems apparent. Using discrete item tests is a bit like testing whether a potential car driver can move the gear lever into the correct positions, depress the accelerator smoothly, release the clutch gently and turn the steering wheel to and fro. He may be able to do all of these correctly and yet not be able to drive the car. It is the skill which combines all the sub-skills, control of the system which integrates the systems so that the speaker conveys what he wishes to by the means he wishes to that constitutes ‘knowing a language’ in this sense, just as it constitutes ‘driving a car’. Attempts were therefore made to devise types of global tests which could be marked objectively. Two of these appear to have achieved some success, these are dictation and cloze tests.

Dictation

Dictation was, of course, used as a testing device long before Lado and the structuralist/behaviourist nexus became influential. Lado in fact criticised dictation on three grounds, first that since the order of words was given by the examiner, it did not test the ability to use this very important grammatical device in English; second, since the words themselves are given, it can in no sense be thought of as a test of lexis; and third, since many words and grammatical forms can be identified from the context, it does not test aural discrimination or perception. On the other hand it has been argued that dictation involves taking in the stream of noise emitted by the examiner, perceiving this as meaningful, and then analysing this into words which must then be written down.

On this view the words are not given—what are given are strings of noises. These only become words when they have been processed by the hearer using his knowledge of the language. This argument that perception of language, whether spoken or written, is psychologically an active process, not purely passive, is very persuasive. That dictation requires the co-ordination of the functioning of a substantial number of different linguistic systems spoken and written, seems very clear so that its global, active nature ought to be accepted. If this is so then the candidate doing a dictation might well be said to be actually ‘driving the car’.

Cloze tests

A cloze test consists of a text from which every n^th word has been deleted. The task is to replace the deleted words. The term ‘cloze’ is derived from Gestalt psychology, and relates to the apparent ability of individuals to complete a pattern, indeed to perceive this pattern as in fact complete, once they have grasped the structure of the pattern. Here the patterns involved are clearly linguistic patterns. A cloze test looks something like the following:

In the sentences of this test every fifth word has been left out. Write in the word that fits best. Sometimes only one word will fit as in ‘A week has seven…’ The only word which will fit in this blank is ‘days’. But sometimes you can choose between two or more words, as in: ‘We write with a…’ In this blank you can write ‘pen’ or ‘pencil’ or even ‘typewriter’ or ‘crayon’. Write only one word in each blank. The length of the blank will not help you to choose a word to put in it. All the blanks are the same length. The first paragraph has no words left out. Complete the sentences in the second and following paragraphs by filling in the blanks as shown above.

‘Since man first appeared on earth he has had to solve certain problems of survival. He has had to find ways of satisfying his hunger, clothing himself for protection against the cold and providing himself with shelter. Fruit and leaves from trees were his first food, and his first clothes were probably made from large leaves and animal skins. Then he began to hunt wild animals and to trap fish.

In some such way…began to progress and …his physical problems. But…had other, more spititual…—for happiness, love, security, …divine protection.’ etc.

Like dictations, cloze tests test the ability to process strings of aural or visual phenomena in linguistic terms such that their potential signification is remembered and used to process further strings as they are perceived. Cloze tests are usually presented through the written medium and responded to in that medium too, but there seems no reason why oral cloze should not be possible, and indeed there have been attempts to devise such tests. (See the University of London Certificate of Proficiency in English for Foreign Students, Comprehension of Spoken English, 1976.) Cloze tests too are global in nature demanding perceptive and productive skills and an integrating knowledge of the various linguistic systems, grammatical and lexical since some of the words left out will be grammatical and others will be lexical. There is a good deal of discussion still going on about the technicalities of constructing cloze tests but useful pragmatic solutions to many of the problems have been found and it would seem that cloze offers a potentially very valuable way of measuring language proficiency.

There are, however, two substantial criticisms to be made of all tests which have a fundamentally structuralist/ behaviourist theoretical base, whether they are discrete item tests like those of Lado, or global tests like dictation and cloze. The first of these criticisms is that such tests rarely afford the person being tested any opportunity to produce language spontaneously. The second is that they are fundamentally trying to test that knowledge of the language system that underlies any actual instance of its use— linguistic competence in Chomsky’s terms—they are not concerned with the ability to operate the system for particular purposes with particular people in particular situations. In other words they are testing the basic driving skill, as does the Ministry of Transport driving test, not whether the driver can actually use the car to get from one place to another quickly and safely and legally—as the Institute of Advanced Motorists test does.

Testing communication

If ‘knowing a language’ is seen as the ability to communicate in particular sorts of situation, then the assessment will be in terms of setting up simulations of those situations and evaluating how effective the communication is that takes place. Situations are likely to have to be specified in terms of the role and status of the participants. The degree of formality of the interaction, the attitudes and purposes of the participants, the setting or context and the medium of transmission used—spoken or written language. The productive-receptive dimension will also enter in since this is often relevant to the roles of participants. A lecturer does all the talking, his audience only listens; but a customer in a dress shop is likely to be involved in extensive two-way exchanges with the sales assistant. It is of course possible to devise discrete items, objectively scored tests of communicative ability, but it would seem in general that global, subjectively marked tests are more likely to make it possible to match the task on which the assessment is based fairly closely with the actual performance required. The ‘situational composition’ used as a testing device is probably the most familiar example of this, and has been part of the Cambridge Local Examinations Syndicate’s paper in English Language for East Africa for many years. The sort of thing that is used is exemplified by the following:

Write a reply accepting the following formal invitation:

Mr and Mrs J.Brown request the pleasure of the company of

Mr Alfred Andrews at the wedding of their daughter

Sylvia to

Mr Alan White

on Wednesday 6th April 1977 at 2.00 p.m.

in St Martin’s Church, Puddlepool, Wessex

and afterwards at the Mount Hotel, Puddlebridge, Wessex.

18 The Crescent R.S.V.P.

Puddlepool Wessex.

There are however a great many other possibilities and one of the most interesting explorations of what these might be is Keith Morrow’s Techniques of Evaluation for a Notional Syllabus (RSA 1977—mimeo) from which the following examples are taken.

Identification of context of situation. Oral—tape recorded presentation written response. Group.

Listen carefully. You are about to hear an utterance in English. It will be repeated twice. After you have heard the utterance answer the questions below by writing the letter of the correct reply in the appropriately numbered box on your answer sheet. The utterance will be repeated twice more after two minutes.

Person: ‘Excuse me, do you know where the nearest post-office is, please?’

(i) Where might somebody ask you that question? A. In your house B. In your office

C. In the street.

D. In a restaurant.

(ii) What is the person asking you about? A. The price of stamps.

B. The age of the post-office.

C. The position of the post-office.

D. The size of the post-office.etc.

Question (i) here relates to the setting of the utterance

(ii) to the topic,

(iii) would relate to its function (iv) to the speaker’s role.

(v) to the degree of formality of the utterance, (vi) to the speaker’s status, and so on,

to cover as many different dimensions of the context of situation as may be thought appropriate.

Asking questions. Mixed oral/written presentation/ and response. Individual.

The examiner is provided with a table of information of the following kind:

KINGS OF ENGLAND

	Came to
Name	the throne	Died	Age	Reigned
William I	1066	1087	60	21
William II	1087	1100	43	13
Henry I	1100	1135	67	35
Stephen	1135	1154	50	19

Candidates are supplied with an identical table with blanks in certain spaces. The task is to complete the table by asking the examiner for specific information. To ensure that the examiner treated each question on its merits a number of different tables would be needed with different blanks at different places for different candidates. The candidates would be assessed on a number of related criteria. First, success. Does the candidate actually manage to fill in the blanks correctly? Second time. How long does it take the candidate to assess the situation and perform as required? Third, productive skill. If he fails to ask any questions, or if his question is unlikely to be understood by the average native speaker of English: no marks. If the question is comprehensible but unnatural: 1 mark. If the question is appropriate, accurate and well expressed:

4 marks. Candidates may be scaled between the extremes by using as principal criterion how far the candidate’s faults interfere with his ability to cope with the situation.

Clearly test items of this kind can have an almost limitless range of variation, what has here been exemplified as oral presentation could be purely written, information which is here exemplified as being presented in tabular form could just as well be presented pictorially—sets of pictures of the ‘Spot the difference’ kind for example, and it is not unlikely that a good deal of exciting experimentation in this field will take place in the next few years.

In the last resort most formal assessment of English as a foreign language nowadays is a combination of elements from a wide range of all the different kinds of test discussed above, probably reflecting some kind of consensus view that language does involve code, system, skill and communication.

Four kinds of assessment

If the question asked above has been ‘What kind of a thing is it that is being assessed?’ the next question must be ‘What is the purpose of making that assessment?’

There are at least four different sorts of purpose that assessment may serve. First, one may wish to assess whether a particular individual will ever be able to learn any foreign language at all. An assessment of this kind is an assessment of aptitude. The question being asked is ‘Can he learn this at all?’ Tests designed to measure aptitude must largely be only indirectly specific language orientated. There appear to be no tests to determine whether a foreigner has the aptitude to learn English as such. Aptitude test batteries include items like tests of the ability to break or use codes, to generate or create messages on the basis of a small set of rules and symbols, tests for memory of nonsense syllables, tests of additory discrimination and so on. A standardised test battery The Modern Language Aptitude Test has been devised by J.B. Carroll and S.M.Sapon. Such a test looks only forward in time from the point of the test and nothing lies behind it in terms of English language teaching.

Second, assessment may be made to determine how much English an individual actually knows with a view to how well he might be able to function in situations, which may be more or less closely specified, often quite outside the language learning classroom. The basic question being asked is ‘Does he know enough English to…?’ ‘…follow a course in atomic physics?’ ‘…act as waiter in a tourist hotel?’ and so on. Assessment of this kind is assessment of proficiency. Tests of proficiency look back over previous language learning, the precise details of which are probably unknown, with a view to possible success in some future activity, not necessarily language learning but requiring the effective use of language. Proficiency tests do, however, sometimes have a direct language teaching connection. They might, for example, be used to classify or place individuals in appropriate language classes, or to determine their readiness for particular levels or kinds of instruction. The question here is a rather specific one like ‘Does he know enough to fit into the second advanced level class in this institution?’ Thus selection examinations, and placement tests are basically proficiency tests. The title of the wellknown Cambridge Proficiency Examination implies proficiency in English to do something else, like study in a British institution of further education.

Third, assessment may be made to determine the extent of student learning, or the extent to which instructional goals have been attained. In other words the question being asked is ‘Has he learned what he has been taught?’ Indirectly of course such assessment may help to evaluate the programme of instruction, to say nothing of the capabilities of the teacher. If he has learned what he has been taught the teaching may well be all right; if he hasn’t, the teaching may well have to be looked at carefully and modified and improved. Assessments of this kind are assessments of achievement. Tests of achievement look only backwards over a known programme of teaching. Most ordinary class tests, the quick oral checks of fluency or aural discrimination that are part of almost every lesson are achievement tests, and so too should be end of term or end of year examinations.

Lastly, assessment may be undertaken to determine what errors are occurring, what malfunctioning of the systems there may be, with a view to future rectification of these. The question being asked is ‘What has gone wrong that can be put right, and why did it go wrong?’ Assessment of this kind is diagnostic. Diagnostic tests look back over previous instruction with a view to modifying future instruction. The details of past instruction may be known or not, so some kinds of diagnostic test will be like proficiency tests, some will be like achievement tests in this regard. However, it is important at all times to bear in mind the basic question which is being asked, and to realise that items which may be very good tests of actual achievement may be very poor diagnostically. A diagnostic test ought to reveal an individual’s strengths and weaknesses and it is therefore likely that it will have to be fairly comprehensive, and devote special attention to known or predicted areas of particular difficulty for the learner. Diagnostic tests are most often used early in a course, when particular difficulties begin to arise and the teacher wants to pin down just what is going wrong so that he can do something about it. Such tests are almost always informal and devised for quite specific situations.

The four terms aptitude, proficiency, achievement, and diagnostic are very frequent in the literature on testing and it is well to get their meaning clear. It is also worth noting the characteristic usages which these terms have. A learner may have an aptitude for English language learning; if he does he may quickly attain sufficient proficiency in English for him to be able to study mathematics; this means he has achieved a satisfactory standard, but a test may diagnose certain faults in his English or in the teaching he has received.

Test qualities

There remains one other important question to ask about any assessment of knowledge of the English language—‘Does it work?’ Here again there may be at least four different ways in which this question may be interpreted. The first of these is revealed by the question ‘Does it measure consistently?’ A metre stick measures the same distance each time because it is rigid and accurately standardised against a given norm. A piece of elastic with a metre marked on it is very unlikely to measure the same every time. In this case the metre stick can be said to be a reliable measure. In the same way reliability in instruments for measuring language ability is obviously highly desirable, but very difficult to achieve. Among the reasons for this are the effects of variation in pupil motivation, and of the range of tasks set in making an assessment. A pupil who is just not interested in doing a test will be unlikely to score highly on it. Generally speaking the more instances of pupil language behaviour that can be incorporated into a test the better. It is for this reason that testing specialists have tended to prefer discrete item test batteries in which a large number of different instances of language activity are used, to essay type examinations where the tasks set are seen as more limited in kind and number. Variations in the conditions under which tests are taken can also affect reliability—small variations in timing where precise time limits are required for example, a stuffy room, the time of day when the test is taken, or other equally trivialseeming factors may all distort test results. Perhaps most important of all in its consequences on test results is the reliability of the marker. This reliability may be high in objectively marked tests—like multiple-choice tests—but can be low in free response tests—like essays—if a structured approach or multiple marking are not used. Determining test reliability requires a certain amount of technical know-how and familiarity with the statistical techniques which permit the calculation of a reliability coefficient. Guidance to these will be found in the books referred to for further reading at the end of this chapter.

The second way in which the question ‘Does it work?’ can be made more precise is by rephrasing it as ‘Does it distinguish between one pupil and another?’ A metre stick may be a suitable instrument for measuring the dimensions of an ordinary room, but it would not be suitable for measuring a motorway or the gap of a spark plug for a car. In one case the scale of the object to be measured is too great, in the other it is too small. Not only should the instrument which is used be appropriate to the thing being measured but the scale on the instrument should be right too. A micrometer marked only in centimetres would not permit accurate measurement of watch parts, the scale needs to be fractions of millimetres. Tests which have the right sort of scale may be said to discriminate well. Tests which are on the whole too easy or too difficult for the pupils who do them do not discriminate well, they do not spread the pupils out since virtually all pupils score high marks or all pupils score low marks. Ideally the test should give a distribution which comes close to that of the normal distribution curve.

One needs to be careful in reading the literature on testing when the term discrimination index is encountered. This has little to do with discrimination in the sense discussed above. It refers rather to the product of statistical procedures which measure the extent to which any single item in a test measures the same thing as the whole of the test. By calculating a discrimination index for each item in a test it is possible to select those items which are most efficient in distinguishing between the top one-third and the bottom one-third of any group for whom the test as a whole is about right. In other words it will help to establish the measuring scale within the limits of the instrument itself and ensure that that is about right, giving a proper distribution of easy and difficult questions within the test. But a discrimination index has no absolute value; to get the overall level of difficulty of the test right requires a pragmatic approach with repeated retrials of the test items, accepting some and rejecting others until the correct combination has been achieved. Again details of these technical matters will be found in the books for further reading.

The third way in which the ‘Does it work?’ question may be more fully specified is by asking ‘Does it measure what it is supposed to measure?’ A metre stick is very practical for measuring cloth but it is irrelevant for measuring language ability. ‘What it is supposed to measure’ in the case of English language tests is presumably ability in English language, and the only way that the extent to which a test actually does this can be determined is by comparing the test results with some other outside measurement, some other way of estimating pupil ability, a way which ought to be at least as reliable and accurate as the test itself. Where the results of the outside measure match the results of the test reasonably closely the test can be said to have empirical validity. Suitable outside measures are difficult to come by. So far the best criterion which seems to have been found is a teacher’s rating. An experienced teacher who knows his class well can rank pupils in order of merit with considerable reliability and accuracy. Thus tests whose results correlate well with teacher ratings can be regarded as empirically valid, and the correspondence between the two measures can be expressed as a coefficient of validity. Testing specialists like such coefficients to have a value higher than 0.7—perfect correlation would give a coefficient of 1.0.

It is clear of course that empirical validity is unlikely to be achieved unless a test is constructed in accordance with some respectable theory of language. It is also unlikely to be achieved unless the test adequately samples the knowledge and activities which are entailed by showing that one knows a language. However, a theoretical base and adequate sampling do not guarantee empirical validity—to gain that, the test must be set against some external criterion.

There is one final kind of validity which is sometimes discussed in the literature on assessment. This is ‘face validity’. This is a matter of how the test appears to the pupils being tested, to teachers, administrators and so on. If the form or content of a test appears foolish or irrelevant or inconsequential, then users of the test will be suspicious of it; those in authority will be unlikely to adopt it, pupils may be poorly motivated by it. Thus test makers must ensure that a test not only tests what it is supposed to test, reliably and accurately but that it looks as though that is what it does.

A final characteristic of a good language test is practicability. By this is meant the extent to which the test is readily usable by teachers with limited time and resources at their disposal. Such factors as the cost of the test booklets, the amount of time and manpower needed to prepare, administer, invigilate, mark and interpret the test, the requirements for special equipment and so on must all be taken into account. For example a standardised test which employs re-usable test booklets with separate answer sheets is likely to be much cheaper to run than one which uses consumable test booklets. Tests which take relatively little time to work and process are likely to be preferred to those which take a lot of time, those which can be given to many pupils simultaneously are usually more practicable than those which require individual administration. Simple paper and pencil tests may well be preferred to those which require elaborate audio- or video-recording equipment. Up-to-date tests whose cultural content is unexceptional are evidently better than those which are out of date and contain culturally inappropriate or objectionable material, those with clear instruction manuals are better than those with obscure manuals, and so on. The test maker needs to bear all such factors in mind, but he should also bear in mind that the testing of some kinds of activity relevant to some dimensions of ‘knowing a language’ may require the use of elaborate equipment or individualised methods and a proper balance must be struck.

In the classroom the teacher finds himself faced with having to assess the progress of his pupils, to judge their suitability for one class or another and so on. He must decide out of the whole complex of considerations which has been outlined above what kind of assessment he wishes to make, of what aspects of his pupils learning, with what kind of reliability and what kind of validity. Once those decisions are made he can go ahead with devising his instrument for making the assessment. For help with that he will find J.B. Heaton’s Writing English Language Tests a useful book along with the book by Lado mentioned earlier, and that by Rebecca Valette listed below.

The last matters to which it would seem appropriate to give some attention here concern standardised English language tests, and the public examinations systems.

A number of standardised tests exist. Among these the Davis test has been widely used. More recently Elizabeth Ingram has published English Language Battery but the American tests in this area seem to be more readily available. Among the best known of these are Robert Lado’s English Language Test for Foreign Students which developed into the Michigan Test of English Language Proficiency and the TOEFL, Educational Testing Service, Test of English as a Foreign Language. Further information and discussion of such tests will be found in The Seventh Mental Measurements Yearbook, ed. O.Bures.

Public examinations

The public examination system tends to vary from country to country. One of the tasks which every teacher has when he takes up an appointment in a new country is to discover just what the requirements of the public examination system are. He needs to obtain copies of syllabuses, past papers, regulations, and the reports of the examiners, where these are published, and to familiarise himself with them. From this he should be able to discover what real linguistic skills are required of examination candidates and what kinds of examination techniques they will need to have mastered. It is then possible to concentrate substantially on teaching the language skills and, in about the last one-tenth of the course, to teach the necessary techniques for passing the examination. Most teachers devote far too much time to practice examinations—pupils often seem to like it, but it is rarely in their best interests since many good examining techniques do little to foster greater learning—dictation is a good case in point. For information about the public examinations most widely taken in Britain, one can do little better than consult J.McClafferty’s A Guide to Examinations in English for Foreign Students. In this there are useful hints on preparing for the examinations, details of the various examinations offered by the boards and summaries of regulations and entry requirements. It covers the examinations of the Cambridge Local Examination

Syndicate, the Royal Society of Arts, the London Chamber of Commerce, and the ARELS Oral Examination, and has a supplementary list of other examinations in English for foreign students—altogether a very helpful document. Much of the preliminary investigatory work suggested in the previous paragraph has been done for the teacher by this book, there remains only the task of analysing past papers and consulting the annual reports of the examiners.

There are a number of types of examination or methods of assessment which have not been discussed at all in this chapter but which a teacher may come across from time to time. One of these is assessment by using a structured interview schedule. Here the test takes the form of an interview and the linguistic tasks demanded of the candidate are progressively elaborated according to a fixed programme. The point at which the candidate begins to fail in these tasks gives him a rating on the schedule. Such examinations are usually entirely oral—though clearly there is no absolute necessity that they should be so—and the rating is usually arrived at by subjective judgment against a fairly detailed specification of performance features, sometimes by a panel of judges. Another type of test is that involving simultaneous translation—usually reserved for assessing interpreters—but there are a number of such techniques and it is wise to keep an open mind towards them for they might well turn out to be useful some day.

The final word is—avoid too much assessment; resist pressures which might make examinations dominate teaching.

Suggestions for further reading

J.P.B.Allen and S.Pit Corder, The Edinburgh Course in Applied Linguistics, Vol. 4, Testing and Experimental Methods, Oxford University Press, 1977.

A.Davies, Language Testing Symposium: A Psycholinguistic Approach, Oxford University Press, 1968.

D.P.Harris, Testing English as a Second Language, New York: McGrawHill, 1969.

J.Oller, Language Tests at School, Longman, 1979.

R.M.Valette, Modern Language Testing: A Handbook, 2nd edn, New York: Harcourt Brace, Jovanovich, 1977.

aminhape

Chapter 10 : Assessment and Examinations