Monday, 24 November 2014

Dispelling creationist misconceptions about ENCODE

In 2012, some scientists made hyperbolic claims that the Encyclopedia of DNA Elements Project (ENCODE) had shown that 80% of our genome was functional. Unsurprisingly, special creationists latched onto this now-refuted claim as if it somehow invalidated common descent. It did not. Apart from the fact that those with the ENCODE project did not declare that their research rebutted evolution, special creationists ignored two points:

1. Functional does not mean essential. Actively transposing retrotransposons writing over essential DNA are functional, but are definitely harmful
2. Once again, the evidence from consonant phylogenetic trees and shared genomic 'errors' is independent of any claim about 'functionality'

Unfortunately, almost all special creationists peddling the ENCODE claim have not caught up with the refutation of the hyperbolic '80% is functional' claim so a detailed rebuttal is needed.

ENCODE - The truth is that at least 66% of our DNA is worthless junk

Anyone who appeals to the ENCODE data in an attempt to rebut the evidence for common descent is merely broadcasting their ignorance of the fact that the ENCODE team and results have been heavily criticised by many evolutionary biologists. For those unaware of what ENCODE is, some context will be provided.

In 2001, the human genome was sequenced [1]. Over the past nine years, the Encyclopedia of DNA elements (ENCODE) project has been examining the genome in order to examine what the genome does. Now, the ENCODE project has released several papers announcing the results of its research. One of results of its research is that "more than 80% of the human genome's components have now been assigned at least one biochemical function." [2] How does this square with the fact that much of our genome is made up of non-coding DNA such as retrotransposons (nearly half the DNA), intronic DNA or endogenous retroviral elements?

The key word here is 'functional'. I have to stress that functional does not mean essential or beneficial. Retrotransposons for example have been linked with human disease. [3] We would be arguably better off if those SINEs were silent. Special creation already has to accept that if every nucleotide was created by God, then the creator has deliberately inserted genomic material which causes immense misery in the human race. The intelligent designer becomes a malevolent designer if the logic of the special creationist position is carried through to its inevitable conclusion.

Another point to remember is that being transcribed counts as biological function, irrespective of whether that transcribed section actually does something beneficial for the organism. Without this context, claims that the 80% figure invalidate what we already know about the genome (that most of it is non-coding junk) can be dismissed.

There is of course no substitute for informed commentary (as opposed to special creationist disinformation), which is why the opinions of senior scientists involved in the ENCODE project are worth reading. Ewan Birney, the lead analysis coordinator for ENCODE over the past five years is arguably a man whose opinion would count for something. So what does he say about the 80% figure:
It’s clear that 80% of the genome has a specific biochemical activity – whatever that might be. This question hinges on the word “functional” so let’s try to tackle this first. Like many English language words, “functional” is a very useful but context-dependent word. Does a “functional element” in the genome mean something that changes a biochemical property of the cell (i.e., if the sequence was not here, the biochemistry would be different) or is it something that changes a phenotypically observable trait that affects the whole organism? At their limits (considering all the biochemical activities being a phenotype), these two definitions merge. Having spent a long time thinking about and discussing this, not a single definition of “functional” works for all conversations. We have to be precise about the context. Pragmatically, in ENCODE we define our criteria as “specific biochemical activity” – for example, an assay that identifies a series of bases. This is not the entire genome (so, for example, things like “having a phosphodiester bond” would not qualify). We then subset this into different classes of assay; in decreasing order of coverage these are: RNA, “broad” histone modifications, “narrow” histone modifications, DNaseI hypersensitive sites, Transcription Factor ChIP-seq peaks, DNaseI Footprints, Transcription Factor bound motifs, and finally Exons.(Emphasis mine) [4]
Specific biochemical activity does not mean essential to life. This is the point that Burges completely ignores.This point needs to be hammered home to every special creationist who latches onto the ENCODE paper and claims that 80% of the genome is functional (though one wonders why they are still happy to accept the implication that God created the human genome with 20% junk). Birney continued by commenting on with what definition of 'functional he is happy:
Back to that word “functional”: There is no easy answer to this. In ENCODE we present this hierarchy of assays with cumulative coverage percentages, ending up with 80%. As I’ve pointed out in presentations, you shouldn’t be surprised by the 80% figure. After all, 60% of the genome with the new detailed manually reviewed (GenCode) annotation is either exonic or intronic, and a number of our assays (such as PolyA- RNA, and H3K36me3/H3K79me2) are expected to mark all active transcription. So seeing an additional 20% over this expected 60% is not so surprising. 
However, on the other end of the scale – using very strict, classical definitions of “functional” like bound motifs and DNaseI footprints; places where we are very confident that there is a specific DNA:protein contact, such as a transcription factor binding site to the actual bases – we see a cumulative occupation of 8% of the genome. With the exons (which most people would always classify as “functional” by intuition) that number goes up to 9%. Given what most people thought earlier this decade, that the regulatory elements might account for perhaps a similar amount of bases as exons, this is surprisingly high for many people – certainly it was to me! 
In addition, in this phase of ENCODE we did sample broadly but nowhere near completely in terms of cell types or transcription factors. We estimated how well we have sampled, and our most generous view of our sampling is that we’ve seen around 50% of the elements. There are lots of reasons to think we have sampled less than this (e.g., the inability to sample developmental cell types; classes of transcription factors which we have not seen). A conservative estimate of our expected coverage of exons + specific DNA:protein contacts gives us 18%, easily further justified (given our sampling) to 20% (Emphasis mine)
In other words, once we start changing our definition of 'functional' to one more consistent with what the layperson would take it to be (ie: biologically useful or essential) the 80% figure drops to around 20%. As for why ENCODE emphasised the 80% figure, rather than the 20% one more consistent with that the layperson would perceive 'functional' to mean, Birney states:
Originally I pushed for using an “80% overall” figure and a “20% conservative floor” figure, since the 20% was extrapolated from the sampling. But putting two percentage-based numbers in the same breath/paragraph is asking a lot of your listener/reader – they need to understand why there is such a big difference between the two numbers, and that takes perhaps more explaining than most people have the patience for. We had to decide on a percentage, because that is easier to visualize, and we choose 80% because (a) it is inclusive of all the ENCODE experiments (and we did not want to leave any of the sub-projects out) and (b) 80% best coveys the difference between a genome made mostly of dead wood and one that is alive with activity. (Emphasis mine)
Alive with activity again does not mean essential to life. A retrotransposon that copies and pastes itself indiscriminately in the genome is functional, but when it causes genetic disorders it is clearly not beneficial. Unsurprisingly, special creationists tend to ignore the 45% of the genome that is retrotransposed DNA, essentially parasitic genetic material.

The ENCODE hype has been criticised severely for its misleading 80% figure. Dan Graur  et alhave published a takedown of the extravagant ENCODE claims:
A recent slew of ENCODE Consortium publications, specifically the article signed by all Consortium members, put forward the idea that more than 80% of the human genome is functional. This claim flies in the face of current estimates according to which the fraction of the genome that is evolutionarily conserved through purifying selection is under 10%. Thus, according to the ENCODE Consortium, a biological function can be maintained indefinitely without selection, which implies that at least 80 – 10 = 70% of the genome is perfectly invulnerable to deleterious mutations, either because no mutation can ever occur in these “functional” regions, or because no mutation in these regions can ever be deleterious. This absurd conclusion was reached through various means, chiefly (1) by employing the seldom used “causal role” definition of biological function and then applying it inconsistently to different biochemical properties, (2) by committing a logical fallacy known as “affirming the consequent,” (3) by failing to appreciate the crucial difference between “junk DNA” and “garbage DNA,” (4) by using analytical methods that yield biased errors and inflate estimates of functionality, (5) by favoring statistical sensitivity over specificity, and (6) by emphasizing statistical significance rather than the magnitude of the effect.Here, we detail the many logical and methodological transgressions involved in assigning functionality to almost every nucleotide in the human genome. The ENCODE results were predicted by one of its authors to necessitate the rewriting of textbooks. We agree, many textbooks dealing with marketing, mass-media hype, and public relations may well have to be rewritten. (Emphasis mine) [5]
It is worth noting that when Ewan Birney, the lead scientist for ENCODE was pressed on the claim that 80% of the genome is essential to life, he conceded that this was not true. In a BBC Radio interview, Birney admitted:
Chris Ponting: So I think we can probably agree between us that between 10% and say 20% is vital for life. 
Ewan Birney: I mean, I think we would agree with that. I think, you know, refining that percentage down is quite interesting. I think also the other components that we — biochemical events that we see in the genome, sort of, each one of them are equally likely to be part of that 10% to 20% that we’re looking for. It’s important to realize that it’s not the case that we can spot the 10% to 20% just by looking harder. Each of these different places in the genome that have some biochemical activity associated with it, when there’s some phenotype screen that’s directed there or some evolutionary screen that’s directed to that point, ENCODE can now say “Ah ha! Here is a biochemical thing that this piece of DNA looks like it could be doing”.(Emphasis mine) [6]
Evolutionary biologist TR Gregory who is also an expert in genome size evolution - putting him in a perfect position to provide informed commentary on the subject has taken a considerable interest in the subject. In the comments section of one of Gregory's blog posts discussing the ENCODE hype, respected evolutionary geneticist Joe Felsenstein makes a penetrating comment which cuts to the heart of the hype:
Ewan Birney is trying to give the impression that the problem is that people have misinterpreted him. But he was the one who put forward the 80% figure. It was not added by the popular science press, he wanted it out there and wanted it noticed. And when there was a huge blaze of publicity centered on the (purported) death of junk DNA, publicity that Ryan has done us the great service of listing, I didn’t notice Birney jumping up saying that he had been misinterpreted. 
Large numbers of laypeople and other scientists are now persuaded that there never was any junk DNA. It will probably take 10 years to unpersuade them. We have Birney to thank for this situation. I’m saddened to see him dance around and try to give the impression that someone else came up with the Death of Junk DNA. (Emphasis in the original) [7]
Birney later admitted on bis bog that the 80% figure represented biological activity, which was not the same thing as essential to life.
The problem with 'science by press release', is that in order to gain the attention of your audience, there is a very real temptation to succumb to hyperbole, and when you are dealing with the general public, terms such as 'functional' need to be defined properly, otherwise there is the chance that they will get the wrong idea. Certainly, when most people hear 'functional', they are likely to think that 80% of the genome is essential to life, which is simply false. As project leader Ewan Birney acknowledged later, the 80% figure represents biological activity, which is definitely not the same thing as functional:
Q. Ok, fair enough. But are you most comfortable with the 10% to 20% figure for the hard-core functional bases? Why emphasize the 80% figure in the abstract and press release? 
A. (Sigh.) Indeed. Originally I pushed for using an “80% overall” figure and a “20% conservative floor” figure, since the 20% was extrapolated from the sampling. But putting two percentage-based numbers in the same breath/paragraph is asking a lot of your listener/reader – they need to understand why there is such a big difference between the two numbers, and that takes perhaps more explaining than most people have the patience for. We had to decide on a percentage, because that is easier to visualize, and we choose 80% because (a) it is inclusive of all the ENCODE experiments (and we did not want to leave any of the sub-projects out) and (b) 80% best coveys the difference between a genome made mostly of dead wood and one that is alive with activity. We refer also to “4 million switches”, and that represents the bound motifs and footprints. 
We use the bigger number because it brings home the impact of this work to a much wider audience. But we are in fact using an accurate, well-defined figure when we say that 80% of the genome has specific biological activity. [8]
In other words, between 10-20% of the genome consists of 'hard core functional bases' with the rest simply being biologically active, which is not the same thing as essential to life. Retrotransposable elements that copy and paste themselves randomly into the genome are biologically active, but hardly essential - or beneficial, as evidenced by the genetic diseases connected to retrotransposable DNA. Even if one grants that the entire 80% figure refers to essential genomic material, that still leaves 20% of the genome non-coding, non-functional junk, which is inconsistent with the idea that the genome is the product of an intelligent designer. 

Since then, the much-touted 80% figure is changing. Science journalist Faye Flam contacted John Stamatoyannopoulos, one of the ENCODE researchers to clarify the 80% figure. It turns out that it is more like 40%:
He said he thought the skeptics hadn’t fully understood the papers, and that some of the activity measured in their tests does involve human genes and contributes something to our human physiology. He did admit that the press conference mislead people by claiming that 80% of our genome was essential and useful. He puts that number at 40%. Otherwise he stands by all the ENCODE claims. (Emphasis mine) [9]
So, we can safely bin the "80% of the genome is functional" claim as even researchers from ENCODE are backing away from it.

Max Libbrecht, another ENCODE researcher also commented on the ENCODE debacle, showing that even members of the project realised just how damaging the "80% is functional" hype was:

After I took part in an AMA ("Ask Me Anything") on reddit, there has been some discussion elsewhere (such as by Ryan Gregory and in the comments of Ewan Birney's blog) of what I and the other ENCODE scientists meant. In response, I'd like to echo what many others have said regarding the significance of ENCODE on the fraction of the genome that is "junk" (or nonfunctional, or unimportant to phenotype, or evolutionarily unconserved). 
In its press releases, ENCODE reported finding 80% of the genome with "specific biochemical activity", which turned into (through some combination of poor presentation on the part of ENCODE and poor interpretation on the part of the media) reports that 80% of the genome is functional. This claim is unlikely given what we know about the genome (here is a good explanation of why), so this created some amount of controversy.

I think very few members of ENCODE believe that the consortium proved that 80% of the genome is functional; no one claimed as much on the reddit AMA, and Ewan Birney has made it clear on his blog that he would not make this claim either. In fact, I think importance of ENCODE's results on the question of what fraction of DNA is functional is very small, and that question is much better answered with other analysis, like that of evolutionary conservation. Lacking proof either way from ENCODE, there was some disagreement on the AMA regarding what the most likely true fraction is, but I think this stemmed from disagreements about definitions and willingness to hypothesize about undiscovered function, not misinterpretation of the significance of ENCODE's results. 
I think many members of the consortium (including Ewan Birney) regret the choice of terminology that led to the misinterpretations of the 80% number.Unfortunately, such misinterpretations are always a danger in scientific communication (both among the scientific community and to the public). Whether the consortium could have done a better job explaining the results, and whether we should expect the media to more accurately represent scientific results, is hard to say. 
I think the contribution of ENCODE lies not in determining what DNA is functional but rather in determining what the functional DNA actually does. This was the focus of the integration paper and the companion papers, and I would have preferred for this to be the focus of the media coverage. (Emphasis mine) [10]
In short:
  • The claim that 80% of the genome is essential to life is false. The figure is more like 10-20%
  • The value of ENCODE, to quote one of its researchers is in determining what the functional DNA actually does, rather than how much is functional.
  • The question of functionality does not take away the considerable evidence for common descent. Burges has completely failed to address in any substantive way this evidence, and the ENCODE diversion merely demonstrated his ignorance of the controversy surrounding ENCODE and the acknowledgement that the 80% figure was hype.
How much of the genome is actually essential to life? Not much. Around 45% of the genome is made up of mobile genetic elements - retrotransposons - that copy and paste themselves into the genome randomly, often causing disease in the process. This is very much an unguided, random process. A significant fraction of the human genome owes its origin to ancient retroviral infection. In fact, there is more retroviral genetic material – the evidence of past retroviral infection – in our genome than there is direct protein coding material. Only a a small percentage of the human genome directly codes for protein or has specific regulatory function.

Breaking down the human genome into the various classes of genetic material we find there, the scale of how much parasitic DNA, decayed viral remnants and genetic equivalent of gibberish [11] is astonishing:

    Transposable Elements: 44% junk

        DNA transposons: functional < 0.1%, defective 3%
        Retrotransposons: active < 0.1%, co-opted < 0.1%, junk 41%

    Viruses: 9% junk

        DNA Viruses: active < 0.1%, defective ~1%
        RNA Viruses: active < 0.1%, co-opted < 0.1%, defective 8%

    Pseudogenes: 1.2% junk

        Derived from protein-coding genes: 1.2% junk
        Co-opted pseudogenes: < 0.1% useful, secondarily acquired new function

    Ribosomal RNA genes: 0.19% junk

        Essential: 0.22%
        Junk: 0.19%

    Other RNA encoding genes

        tRNA genes: < 0.1% essential
        known small RNA genes: < 0.1% essential
        putative regulatory genes: ~2% essential

    Protein-encoding genes: 9.6% junk (intron sequences), 1.8% essential transcribed

    Regulatory Sequences: 0.6% essential

    Origins of DNA replication: < 0.1% essential

    Scaffold attachment regions: < 0.1% essential

    Highly repetitive regions: 1% junk, 2% essential

    Intergenic DNA: 26.3% unknown function, most likely junk, 2% essential

    Essential / Functional DNA: 8.7%
    Junk DNA: 65%
    Unknown: 26.3%

Even if most of the intergenic DNA turns out to have a function, nearly 66% of our genome is rubbish consisting of remnants of ancient retroviral infection, damaged genes that can no longer work, mobile genetic elements that copy and insert themselves randomly around the genome irrespective of what benefit or harm that action does, and introns, the non-coding sections of DNA that interrupt genes.


1. International Human Genome Sequencing Consortium Initial sequencing and analysis of the human genome Nature (2001) 409:860-921

2. Skipper M, Dhand R, Campbell P "Presenting ENCODE" Nature (2012) 489:45 doi:10.1038/489045a

3. Prescott L. Deiningera PL and Batzerc MA "Alu Repeats and Human Disease" Molecular Genetics and Metabolism (1999) 67:183-193

4. Birney E "ENCODE: My own thoughts" Ewan's Blog; bioinformatician at large September 4th 2012

6. Gregory TR "BBC Interview with Ewan BirneyGenomicron April 1 2013. 

7. Gregory TR "BBC Interview with Ewan Birney" Genomicron April 1 2013. Comment

8. Birney E "ENCODE: My Own ThoughtsEwan's Blog: Bioinformatician At Large 5 Sep 2012

9. Flam F "Skeptical Takes on Elevation of Junk DNA and Other Claims from ENCODE ProjectTracker: Peer Review Within Science Journalism 12 Sep 2012

10. Libbrecht M "On ENCODE's results regarding junk DNAmlibbrecht Oct 8 2012

11. Moran L “What’s in Your Genome?” Sandwalk May 8th 2011