Syzygy: How to write/grade an exam

As a professional student (17+ years) and teacher (ok TA, mostly) (4+ years), I have some experience with the proper creation and grading of an exam. Ignoring my own biases towards open-book all-free-response exams, there are some things that are just plain bad examples of test writing:

Disclaimer: I have attempted to be vague when giving examples as I know some professors are trying to stop proliferation of questions since they reuse exams.

1. "trick" questions

For example, the ever-common riddle: which is heavier (i.e. more mass), a kilogram of feathers, or a kilogram of lead? The correct response is neither, since both "objects" weigh one kilogram. However, it has a propensity to trip up people who do not read and understand the question carefully, since heaviness dne density. So there are people who do understand the difference between mass and density, but trip up because they rush, are under pressure, etc. and there are people who don't understand the difference between mass and density. If you consider both groups the same, I guess the question is ok, but what you really want to test is conceptual knowledge, NOT how nitpicky their reading is. People get enough of the latter on the SAT and other standardized tests.

The real life example comes from a multiple choice question regarding a certain topic. The question asks which of the following is NOT correct, with the following answer choices: a', b, c, and none of the above. In the notes, it is clearly stated that a, b, and c are correct. However, it happens to be the case that a' is a slight variation on a that could be missed by a careless reader. It may be the case that the professor is TRYING to catch the students who do not understand the difference between a and a', but with a multiple choice question, you can't tell how many students who get the question wrong do so for conceptual reasons and how many get it incorrect for other reasons. (Moreover, both a and a' are irrelevant, as the actual concept should be x, which is only vaguely related to a.)

2. "matching" questions

Have you ever had those questions where you have a list of blanks and a corresponding word/answer list? There are a number of problems wrong with this type of question. First, the process of elimination (i.e. pigeonhole principle) can be used to successfully guess at some of the answers. While you may concede this as a bonus to the student, really it is a bonus to students with successful test-taking skills. Moreover, if the number of blanks = the number of possible responses, it is impossible to get exactly one incorrect, unless one is intentionally getting one incorrect. If one does a plausible one-to-one matching, there are at least two incorrect matches or zero matches. The correction for this is either to increase the number of possible answers (which also removes most of the benefits for process of elimination) or to allow multiple answers for each blank. The latter was tried out by my at least two of my science teachers, which has the upside of making sure you understand the concepts exactly, but the downside of MASSIVE penalties as both missing and incorrect matches are marked incorrect.

In a more abstract sense, some teachers add a caveat that extra, incorrect information is ignored, and points are only deducted for missing correct information. This has the extreme downside of promoting massive bullshitting as students try to brute-force as many points as possible. It also leads to a bad habit of massive bullshitting as some students in classes that my friends TA for write extremely long nonsensical responses that must be read and graded.

3. "vague" questions

Don't ask for one thing if you actually mean another. For example, one of the questions on a midterm I took had the following: "Describe one example of x." What the question really meant, judging from the key, was "Give on example of one of the functions of x." These two questions are vastly different. For example, if x were "videogames", one might write a whole essay for the first question describing a videogame, its graphical and audio properties, the gameplay, design, marketing, etc. However, the actual intended question was the second question, in which case the student need only respond: entertainment, stress relief, addiction, occupation (gold farmers in China). Even if you mark both responses as correct, the student who interprets the question "correctly" saves time and energy and the student who interprets the question accurately wastes valuable time and energy on a poorly written question. (note: this question occurred on the same midterm that had "trick" questions, so one was both tested for and penalized for careful reading of the question.)

There are more examples of this type of question on the same exam, such as "corresponds" when the more accurate term is "induced by" or "resulting from". Neither of these corrections are actually that great, the whole question should be rephrased.

There are also minor examples of vague questions that are not that bad. For instance, one of the questions read: "name one strategy to accomplish y". Of course, a number of responses are possible, and it's clear that the question is asking for "name one USEFUL strategy to accomplish y", but a well-written test should seal all the holes, however tiny.

Of course, the flip side of the story is that there is also bad GRADING:

1. "cascade" questions

These types of questions occurred a lot in core physics classes at Caltech. The question goes something like this: part a) says solve for x, part b) says given that value of x, solve for y, part c) says given that value of x and y, what is their product z? If each question is graded independently, then the student who makes a mistake in solving for x, but does the other parts correct loses full credit. This is clearly unacceptable, as the student may understand all the necessary concepts for the later parts of the question, but is hindered by a mistake in the initial section of the problem. The correct grading scheme, which is work-intensive on the part of the grader, is to examine each section to see if the work is correct, even with false initial data. Caltech's way around it was to change part b) so that it read given THIS value of x, solve for y, where the value of x given to the student was not the real answer.

In a more abstract sense, students should be penalized according to the scope of their mistake. If it is a minor mistake, but the student clearly shows an understanding of the rest of the problem, minor points should be taken off.

2. "Better" answers

When testing for conceptual knowledge, there are usually key terms that the grader is looking for. For example, a question expecting a short response might ask "what is a regression line" might expect something along the lines of the "line of best fit", "line that fits the data", when a more accurate answer such as "the line that reduces the error function, which is usually the sum of the squared errors" could be given. Ideally, any of these answers should be marked correct if the expected answer is vague. (If the expected answer is specific, vague answers should receive only partial credit. More detailed answers should receive full credit. Assign bonus credit as you wish) Of course, this could also be interpreted as a poorly written question, as the level of specificity could be unclear.

Sometimes, "better" answers slip through the crack and it becomes the student's job to request retroactive credit. However, the ideal grader is able to use common-sense judgment to either mark a "better" answer as correct or to request advice from the professor when an issue is unlcear.

3. Question Analysis

We use a lot of multiple-choice tests in the psychology department. While I don't agree with using multiple-choice tests so often, they do have the advantage of making statistical analysis of the test easy. For instance, for each question, our grading software gives us the histogram of answers, so we can see which answer choices were distractors. Furthermore, it computes a correlation: testing whether performance on a question is correlated with overall performance on the test. A "bad" question is one where the overall-good students are incorrect and the overall-bad students are correct.

In situations where such detailed analysis is infeasible, the ideal grader should keep a mental histogram of the incorrect responses. Then, it may be possible to see how the question was misinterpreted, as sort of a last check for "vague" questions.

Labels: education

Tuesday, November 6, 2007

How to write/grade an exam

0 Comments:

Previous Posts

things I blog about