Is "base rate neglect" a load of tosh?

Matt Adams/CreativeCommons
Reading a blogpost from Chris Dillow about Kevin Pietersen led me via Wikipedia to a couple of interesting psychological hypotheses by Amos Tversky and Daniel Kahneman about bias, and the way it affects our judgments. First, the "availability heuristic", which does seem to me to have some relevance to criticism of KP's approach to batting (and indeed Chris's whole theory seems to me neatly to explain why Pietersen's test record is so much better than that of Mark Ramprakash and Graeme Hick). Second, it led me to the "representativeness heuristic" or theory of "base rate neglect" as a sort of cognitive bias, and in particular a classic experiment that illustrates it: the "taxicab problem":
A cab was involved in a hit and run accident at night. Two cab companies, the Green and the Blue, operate in the city. 85% of the cabs in the city are Green and 15% are Blue. A witness identified the cab as Blue. The court tested the reliability of the witness under the same circumstances that existed on the night of the accident and concluded that the witness correctly identified each one of the two colors 80% of the time and failed 20% of the time. What is the probability that the cab involved in the accident was Blue rather than Green knowing that this witness identified it as Blue?
I instinctively thought the probability was 80%. Did you? Well, according to Tversky and Kahneman, we're wrong.
Most subjects gave probabilities over 50%, and some gave answers over 80%. The correct answer, found using Bayes' theorem, is lower than these estimates:
There is a 12% chance (15% times 80%) of the witness correctly identifying a blue cab.
There is a 17% chance (85% times 20%) of the witness incorrectly identifying a green cab as blue.
There is therefore a 29% chance (12% plus 17%) the witness will identify the cab as blue.
This results in a 41% chance (12% divided by 29%) that the cab identified as blue is actually blue.
All very scientific-looking. But I doubt it's even rational. There are many criticisms of the work of Tversky and Kahneman, so I'm far from the first. And I claim no expertise at all either in psychology or statistics. But as a lawyer and citizen I am concerned about the reliability of testimony, and I think common sense demolishes the theory - at least as it applies to the reliability of witness evidence. I have no problem with the relevance of base rates in identifying the risk of non-human tests coming up with false positives; and even in the field of human testimony I'm quite willing to admit I'm wrong - if someone can convincingly deal with the argument I set out below.
The flaw seems to me to be the use of Bayes's theorem to arrive at an objective measure of the probability of a certain state of affairs being true, disregarding what any observer might say about it - and then using that probability to evaluate evidence from an observer. I'm not criticising Bayes's theorem itself: this example about schoolkids, skirts and trousers seems to me a very good application of it. I agree: of course the probability of the child observed having been a girl must be 25%.
But the schoolkids example is not the same as the taxicab problem because in the schoolkids example we are not using Bayes to question the correctness of what the witness says she observed - but simply to assess the probability, taking the observation as given, of a further, unobserved fact. Applying the base rate neglect theory, the likelihood of the observer having seen a girl would be more than 25%, because the probability of the observer having seen someone in trousers in the first place would itself be less than 100%, since you'd need to take into account the "base rate" of 20% skirt-wearing.
But it's not necessarily right to take the base rate into account - and I think it may even be dangerous to assume that evidence should be evaluated in that way. I think this may be a misapplication of the Bayesian approach. Let me explain.
In real life, you don't know how reliable or not a witness is in statistical terms. You could do tests I suppose (and the taxicab problem makes it look easy to conduct them) but it'd be nearly impossible to conduct reliable ones. So you'd never actually have a quantifiable witness "reliability ratio" to apply. I suppose, though, you are left with your own assessment of the reliability of the witness both generally, and in relation to the particular sort of evidence she's giving. That's analogous to the 80% reliability ratio in the taxicab problem.
So, let's take another example. A girl in the Bayes school (the one with 40% girl pupils, remember) has her iPod snatched, she says by another girl. We think she's a pretty reliable witness: intelligent, observant and so on, and very determinedly telling us she knows what she saw. How shall we assess her reliability ratio, roughly? Okay, let's say 80%. But we also know that in the school concerned there are only 40% girls. So now, based on that, do we conclude that she's only 72% likely to be right? That's the figure I come up with, applying the Tversky and Kahneman approach. Maybe; that sounds fair enough, you're saying.
But what if this is the sixth form of Wellington, or something, and there are only relatively few girls - say, 15%. Our 80% reliable sixth-former says another female sixth-former snatched her iPod. Do we now reason as follows?
There is a 12% chance (15% times 80%) of the witness correctly identifying a girl (= blue cab). There is a 17% chance (85% times 20%) of the witness incorrectly identifying a boy (= green cab) as a girl. There is therefore a 29% chance (12% plus 17%) the witness will identify the thief as a girl. This results in a 41% chance (12% divided by 29%) that the thief identified as a girl was actually a girl.
No: that would surely be nonsense. It seems to me much more rational, and much less of an error, to apply the 80% reliability ratio we started with, and to leave the "base rate" of girls to boys out of account. And unless we do that, we might well conclude that, contrary to what the witness says she saw, the thief was in fact a boy - and the girl in the dock would be acquitted.
I think the problem with the Tversky/Kahneman approach is that it gives insufficient consideration to the quality of the evidence given. The taxicab problem masks this, by positing a situation where one colour of cab might easily be mistaken for the other. But in the real world the two categories to be told apart might be more easily distinguished - in which case, I think our assessment of the witness's reliability both generally, and especially in terms of making the specific distinction, becomes much, much more important than the background "base rate" at which the two categories occur among the population. They're not of equal weight: so multiplying one rate by the other causes distortion.
I think this illustrates that the very first multiplication is the root of the problem. We're told we must multiply 15% by 80% to arrive at the chance (12%) of the witness correctly identifying a blue cab. But the test already demonstrated that the chance of the identification being correct is actually 80%. To factor in some other ratio is to deny the validity of the test. It is to be irrationally biased against the witness, to the point of discounting statistical evidence of the witness's accuracy.
In other words, the first multiplication introduces a logical flaw because it forces us to evaluate evidence which, if you considered its reliability directly, you may conclude is of very high quality, in terms of a purely probabilistic calculation that leaves that evidence out of account entirely. You end up preferring partly statistical evidence to direct evidence, thereby underestimating the likelihood of a certain state of affairs. You end up believing the thief was probably a boy, even though this runs counter to the clear evidence of the only direct witness. You end up always thinking green taxis are the ones that run pedestrians over, even when the pedestrians always say the taxis were blue. Because it doesn't matter to you what they say.
By the way, the base rate neglect theory also seems to me to suffer from the problem of identifying the pool from which to extract the relevant base rate in the first place. If, say, our sixth-form witness says she was alone with the thief in the girl's loo at the time of the podsnatch, is the base rate of girls to boys now 100 to zero? Or should our judgment of whether the other person in the girl's loo was a girl still be influenced by the base rate of girls to boys in the school? Why not consider the 50-50 base rate in the town as a whole? Or does the base rate of girls' to boys' toilets come into account?
So, if a witness we believe can tell blue taxis from green 80% of the time says the only taxi in the street in question at the time of the accident was definitely blue, why should we think this untrue because of all the green taxis there are somewhere else?
Actually, I think Tversky's and Kahneman's test subjects were probably very sensible, and show why the jury system is good, and too much reliance on expert evidence is bad. The ones who thought the witness more than 80% likely to be right obviously got mixed up. But in real life, since witness reliability cannot be independently tested, it makes sense to discount your subjective judgment about a witness's reliability to some extent to take account of the base rate: that might well lead you to conclude, depending how credible you think her (or how much you trust the test result you're given), that she is somewhere between 50% and 80% likely to be right. Which is what most of them said.
If you remain unconvinced, let me leave you with another thought that may undermine your faith in the seductive maths of the taxicab problem. How did the court know the witness correctly identified the colour of cabs 80% of the time? Doesn't the assessment of the correctness of the witness's answer in the test also depend on someone's identification of each taxi as either green or blue? Doesn't that mean the base rate of colour distribution has to be taken into account in assessing the reliability of the tester, and so, ultimately, of the test? Doesn't it mean that, in deciding whether the hit and run accident was really caused by a blue taxi, you need to factor in the base rate not once but twice? It could all get frightfully complicated.
If you think the simple, decisive point is that someone responsible for the test says they're certain what colour each test taxi was - then I think you agree with me. What matters is not the rate of green to blue taxis in town but simply how reliable you think the tester's evidence is. Or anyone else's evidence.