14 September 2008

DNA barcoding: A glitch in the system?

ResearchBlogging.orgFollowing up on last week's post about uncovering hidden species using DNA diversity (or "DNA barcoding"), an open-access paper in this week's issue of PNAS demonstrates a potentially significant glitch in the system: mitochondrial pseudogenes.

The original DNA barcoding concept is straightforward, if not uncontroversial - use a standard DNA sequence marker to identify ("barcode") species that might be challenging to ID otherwise, or previously not known as separate species. The proposed standard marker is a mitochondrial gene that codes for the protein cytochrome oxidase I (COI), which varies quite a bit between animal species (though it wouldn't work for plants, whose mitochondrial DNA mutates very rarely). The lab where I work has used COI for a lot of studies in yucca moths, though not barcoding per se.

Photo by fabbio.
One potential problem with barcoding is that sequencing any gene in one species using procedures derived from another species is always a bit risky. DNA sequencing relies on primers, short snippets of DNA that bind to a region near the target gene as part of the reaction that makes lots of copies of that gene for analysis (this is called PCR, for polymerase chain reaction). The easiest way to get sequence data for a new species is to try and use primers from a close relative - if there aren't any mutations at the primer site, they should carry over. But mutation happens, and it can definitely happen at primer sites.

Primer site mutations are a minor problem compared to pseudogenes, the focus of the new paper by Song et al. Pseudogenes are a result of gene duplication, a mutation in which an extra copy of a gene is accidentally created during DNA replication. Because it's redundant, the extra copy can absorb mutations that destroy its function without harming individuals who carry it. The duplicate is then "junk DNA," free to accumulate mutations - a pseudogene. (Gene duplication is also one way that new proteins and gene functions can evolve - but that's beyond the scope of the present post.) A primer site mutation just means that primers from one species won't work on another, but a pseudogene might still bind to primers. And then you can get sequence data from the pseudogene instead of the target gene.

DNA barcoding identifies species based on how many mutations have accumulated since they split from a common ancestor; a pseudogene, which mutates faster, can make two samples look further apart then than they are. So barcoding studies that accidentally use pseudogenes may identify two species where only one exists. Song et al. use data on mitochondrial pseudogenes in insects and crustaceans to argue that pseudogenes are both common and unpredictable. They also perform barcoding on grasshoppers and crustaceans using data "contaminated" with pseudogenes and data without - unsurprisingly, pseudogenes inflated the number of species detected by barcoding. Although Song et al. suggest a few ways to reduce the odds of interference from pseudogenes, they conclude that there is no way to completely eliminate this problem.

Last week's paper by Smith and colleagues showed the importance of species identification for conservationists, ecologists, and evolutionary biologists. This new result suggests that DNA barcoding may not be the best way to identify species.


P.D.N. Hebert, A. Cywinska, S.L. Ball, J.R. deWaard (2003). Biological identifications through DNA barcodes Proc. Royal Society B, 270 (1512), 313-21 DOI: 10.1098/rspb.2002.2218

H. Song, J.E. Buhay, M.F. Whiting, K.A. Crandall (2008). Many species in one: DNA barcoding overestimates the number of species when nuclear mitochondrial pseudogenes are coamplified PNAS, 105 (36), 13486-91 DOI: 10.1073/pnas.0803076105


  1. Would you say that for barcoding and deep DNA analysis several genes/gene segments should be targeted? There are a ton of papers out there that are making assertions (maybe all correct) based on single gene/gene segment analysis, I've always been a bit concerned that this may not be a conservative enough approach. I'm interested in your take, since my knowledge of DNA would fill a thimble right now.

  2. Eric - Honestly, the simplest answer is "it depends." A single gene segment can certainly capture the evolutionary divergence between species, and a lot of perfectly acceptable studies are based on just one gene. (I have a paper in review right now that uses only COI and some nearby regions to estimate a phylogeny.)

    The catch is that what we call a "species" is not the same thing as the patterns of divergence revealed by any single gene. Species can sometimes split so rapidly that their genes can't catch up - you can have very similar genes in what you know to be two different species. There's a whole branch of evolutionary modeling called coalescent theory devoted to dealing with situations when the evolutionary histories of individual genes don't reflect the history of the species carrying those genes.

    Speaking as someone with no direct experience in DNA barcoding, I'd say we almost always get better information about evolutionary histories by looking at multiple genes. But often you don't need more data than you can get from a single gene. The issue with barcoding, I think, is that it tries to develop a "universal" procedure, when the analysis really needs to be tailored to every group of organisms you study.

  3. Thanks, that makes sense to me.