Tuesday, 12 August 2008

How likely is an athlete to falsely test positive for a banned substance?

Well, one opinion on this matter was given by the statistician Donald Berry, in the prestigious journal Nature. He seems to think that the tests are actually quite likely to generate false positives. For instance, commenting on the case of Floyd Landis (the American found guilty of doping in the 2006 Tour de France), he says:

Landis seemed to have an unusual test result. Because he was among the leaders he provided 8 pairs of urine samples (of the total of approximately 126 sample-pairs in the 2006 Tour de France). So there were 8 opportunities for a true positive — and 8 opportunities for a false positive. If he never doped and assuming a specificity of 95%, the probability of all 8 samples being labelled 'negative' is the eighth power of 0.95, or 0.66. Therefore, Landis's false-positive rate for the race as a whole would be about 34%. Even a very high specificity of 99% would mean a false-positive rate of about 8%.

Well, yes, but here Berry is giving the completely misleading impression that the more tests done the more likely they are to end up screwing the athletes! Quite the opposite, of course. Scientists deliberately take several samples, and test them each several times, to reduce the possibility of error. For instance, Berry doesn't mention things like positive and negative controls, and the variety of confirmation tests employed prior to declaring the athlete a cheat.

More opposition to Berry's rather lopsided (though very interesting) opinion can be found at The Questionable Authority blog.

Update (14/08/08) - Oops! It appears I have misunderstood the testing protocols that are conducted, and thus was wrong in my assessment of Barry's point. To see why, have a look at the comments by myself and 'anonymous'. For what it's worth, I therefore retract the above criticism.


  1. If the basic testing procedures are flawed then Berry is right. The more tests the more chance the athlete will be screwed!

  2. Yes, but it depends what you do with the extra tests.

    Let's use Berry's example of a test that is 99% specific. What that means is that the test will identify 99% of non-cheaters as such. 1% will unfortunately register postively - i.e. 1% of innocents will inaccurately register as dopers.

    As Berry says, if eight separate samples are taken and tested, there is an 8% chance that one of them will be a FALSE positive. That sounds terrible, but do things actually follow Berry's logic?

    It depends on what the protocol is. If just one positive test is required to damn an athelete, then the more samples that are taken, the higher his chances of being erroneously implicated, as Berry notes.

    BUT, what he doesn't admit is that this is NOT the protocol that is followed. Protocols vary, but one obvious thing to do is to make the average reading the thing that counts, or else to take the top (or bottom) 3 readings as definitive. At a stroke, this means that the MORE (not the fewer) tests you do, the more likely it is that the result is true. This is because one dodgy reading can be ignored.

    There are numerous other things that can be done, but in general, increasing the sample size of the data tends to INCREASE the accuracy of the conclusions. In the above hypothetical test, the chance of the average result being a false positive is much, much smaller than the initial 1%.

    Of course, if the tests are terriby flawed to start, then the results can't be trusted, as you say. But this isn't the point I was attacking.

    Also, the rest of Berry's article is definitely commendable and interesting - well worth a read.

  3. Wrong wrong and more wrong.

    Berry is not talking about lab errors. False positives are not simply mistakes. False positives are when a subject produces a sample that is genuinely outside the normal range.

    The testing protocols do reduce the chance of mistakes, but if someone's natural values happen to cross some arbitrary threshold, that sample will test positive over and over again, even if caused by natural variations. That's a false positive.

    Floyd's test was a threshold test, on a measurement where we all have natural variations. The higher the threshold, the less likely you'll cross it naturally. But natural variations will still cross the boundary sometimes. Berry's main point is that WADA has not idea how often that will happen using the threshold they've set.

    I can tell you from my own review of the research that went into the test, that all the research papers
    I reviewed had subjects with samples that crossed the 3 per mil threshold. In one study, about one out of every thirty samples had one pair of metabolites with a difference of 3 per mil or more.

    That's a long way from a presumed good standard of 1 in a 1000. But even if the false positive rate from natural variations was that good, WADA does more than 12,000 tests on cyclists in a year (in 2005). 1 in a 1000 would mean that cycling would average 12 false positives a year for this test.

    All from natural variations, not lab mistakes.


  4. I don't mean to propose an argument from authority, but you seem to be implying that Dr. Berry has overlooked a simple truism or has an uncertain grasp of basic concepts. You may not realize that he is one of the world's preeminent biostatisticians. His predictive models in breast cancer are foundational. I wouldn't take his assertions at face value, of course, but you'll have to bring a better counter than you have here to refute his argument. Much better.

  5. I listened to the Berry interview on SF several times to be sure I heard it correctly. In his interview there it seemed his most serious criticism is that the FP rate of all these tests in unknown, so the results are "non-imformative" This is due to the fact that these procedures are cooked up in a closed system that lacks peer review and published studies.

    The question then is, do you "trust" such a system. There has been plenty of good reason to not trust athletes and that is why they get tested. Do we "trust" the WADA system? Berry/Nature do not, I presume, because it is closed and as scientists they are, by all their training, suspicious of things they can't see/measure. Berry proposed that the labs themselves be tested and if I recall correctly, someone just did that with a bunch of positive EPO samples. The labs that were tested failed quite badly.

    There has been quite a bit of activity by WADA over the last two years and enough of its been exposed in public to ask that question. Would you "trust" this system with YOUR career? Based on what I know today I certainly wouldn't.

  6. Oh dear, I seem to have stirred up a hornet's nest!

    Kylepyro - I agree with everything you say. To emphasise again: Dr. Berry's article is otherwise excellent. He makes a range of points, and I actually agree with almost all of them.

    What I am disputing is the impression that the quoted passage gives off; namely that the more tests done the worse off for the athlete. Of course, as I conceded above, this may be the case given certain terrible protocols. But given sensible ones, this isn't the case. Perhaps I am misreading the passage, though.

    Sandman - I'm certainly not under the impression that I'm a match for Dr Barry in biostatistics!! But I do think my point stands. It's not of course that he's gotten his sums wrong (!), but I do think the quoted passage is misleading, don't you?

    Thomas - You make excellent points; I agree with them. (I don't believe that I equated false positives with lab errors though.) I think your summary of Barry's position is spot on.

    Hmm - apologies if my own post seems a bit "lopsided". Hope I've clarified things here.

  7. Jeremy -- not to pile on, but if I read you correctly I think you may also be mistaken about the protocols ...

    Your comment that averaging across several samples reduces chances of false positives seems correct ... but that, in fact, is NOT the protocol that "damns an athlete". It truly is the ONE positive -- regardless of number of tests-- that counts. One out of eight will do it.

    The system doesn't allow for ignoring "one dodgy reading", because the tests themselves (and the lab practices for that matter) are presumed to NEVER be dodgy.

  8. It's hard to believe you could agree with me and still think that multiple tests don't boost the odds.

    Maybe there's a mistake impression here. The eight tests Berry talks about are not eight different things tested on one sample. The eight different tests are eight different samples taken from Floyd on different days during the tour.

    Each sample taken will find different values, because the values being measured vary day-to-day (and even hour-to-hour). So, there are eight separate opportunities for Floyd's natural variations to exceed the limits.

    It's like flipping a coin eight times - of course the odds of getting heads goes up.


  9. Hi anonymous (? the second),

    Oh dear - you're absolutely correct, as I've just verified. It seems one sample is enough to at least set off the alarm bells and cause confirmatory tests to be done, and thus I see your (and Berry's) point. To that effect I'll add a brief retraction to the original post. Thanks for pointing that out - it really does crystalise why it is, for instance, that Tom and I seem to keep talking past each other.

    On a related note though - and I accept this is a different point - could not multiple samples help increase the accuracy in a different way (whilst simultaneously increasing the risk of false positives in the above manner)? For instance, if an athlete's natural level of a substance 'X' hovers around 1.5 mmol/dl over several tests, and then suddenly jumps to (say) ten times that level, surely that would increase the likelihood that the athlete was doping? (Assuming, of course, that such a high level of the substance is normally only achievable with performance enchancing drugs.)

    The prior tests would constitute some sort of a baseline FOR THAT SPORTSMAN, and the further away from the baseline the one aberrant reading was, the higher the likelihood of it being non-physiological. Naturally, you would still have to do the hard work of excluding non-doping causes for this jump.

    I'd be interested to hear your thoughts.