Sunday, 25 August 2013

Paired-end read confusion - library, fragment or insert size?

Introduction

When you do an Illumina sequencing run, you need to choose between single-end (SE) or paired-end (PE) sequencing. When sequencing, we chop up our DNA into small fragments, and then ligate some adaptors. Then, for SE, we only sequence one end of a DNA fragment. For PE, we sequence both ends of the same fragment:

fragment                  ========================================
fragment + adaptors    ~~~========================================~~~
SE read                   --------->
PE reads                R1--------->                    <---------R2
unknown gap                         ....................

The two reads you get from PE sequencing are referred to as R1 and R2, and they come from the same piece of DNA. Usually the length of the fragment is much longer than the length of R1+R2, so there is a "gap" in between them. Although we don't know the sequence of DNA in between R1 and R2, we have still gained useful information from the knowledge that R1 and R2 are next to each other with a known orientation and distance apart.

Mind the gap

There is a lot of confusion about the gap of unknown bases. You will encounter terms like "insert size", "fragment size", "library size" and variations thereof. The term "insert" comes from a time before NGS existed, when cloning DNA in E.coli vectors was standard business.

PE reads      R1--------->                    <---------R2
fragment     ~~~========================================~~~
insert          ========================================
inner mate                ....................

The main confusion is with "insert size". The name itself suggests it is the unknown gap because it is "inserted" between R1 and R2, but this is misleading. It is more accurate to think of the insert as the piece of DNA inserted between the adaptors which enable amplification and sequencing of that piece of DNA. So the "insert" actually encompasses R1 and R2 as well as the unknown gap between them. The name for the gap itself is better named "inner mate distance" because it is self-descriptive and can vary depending on what read lengths you sequenced a DNA library with.

Overlapping reads

The Illumina MiSeq instrument has added to the confusion recently. Firstly, it can produce PE reads of length 250bp. Secondly, the Nextera preparation method is sensitive and can produce a lot of small fragments, shorter than 500bp. This results in R1 and R2 actually overlapping each other!

fragment          ~~~========================================~~~
insert               ========================================
R1                   ------------------------->                    
R2                                   <-----------------------
overlap                              ::::::::::
stitched SE read     --------------------------------------->

This can actually be a desirable outcome, as you can stitch R1 and R2 together to make a super-long SE read, with extra confidence of the middle bases from consensus of the overlapping sections of R1 and R2.

Adaptor read-through

If the distribution of fragment sizes is too low, or very wide, you can get the situation where not only do the reads overlap, but they are longer than the fragment itself! This causes R1 and R2 to read into the adaptors:

tiny fragment       ~~~~========================~~~~
insert                  ========================
R1                      -------------------------->                    
R2                   <--------------------------
read-through         !!!                        !!!

If your MiSeq is configured properly, it will automatically trim/mask any adaptor sequence. This will be obvious by your FASTQ file containing reads of different length, or by the presence of lots of Ns at the 5' end of your reads. If it is not configured properly, you will get adaptors in your reads, and these will cause all sorts of problems with downstream applications. You should remove these using a read trimming tool.

Conclusion

Paired-end reads are a neat molecular biology trick. Remember that "insert" refers to the DNA fragment between the adaptors, and not the gap between R1 and R2. Instead we refer to that as the "inner mate distance". In some cases, when reads overlap, the inner mate distance can actually be negative. If you are using MiSeq data, you need to be vigilant about checking for adaptor read-through and overlapping reads.

References

50 comments:

  1. Really clear & helpful explanation
    thanks

    ReplyDelete
  2. ho...thank you for this clarification i was always getting lost in the alias terms whenever it came to paired end data!! This has given enough clarity, for proceeding further w/o getting into the word-web!!

    ReplyDelete
  3. Well said and explained. Thanks for clearing it out. I will follow this blog from now on.

    ReplyDelete
  4. Thank you !
    And I wonder what is the kid of size that you excpect to have on average between two mates ?

    ReplyDelete
    Replies
    1. It all depends on the fragment size the person preparing the DNA library was trying to achieve. For Illumina TruSeq libraries (HiSeq) they often aim for about 500bp fragment size. For MiSeq Nextera libraries, it is a bit harder to control, and ideally they aim for 800bp or so, but often it ends up smaller.

      The best library size depends on what you want to do. For 2x100bp, a 500bp library is fine. But for 2x300bp, then the reads would overlap. For some applications (txome, 16s) this is desirable, but for others you'll want the larger size.

      Delete
  5. Really helpful but I have one question.
    I sections after the introduction whenever you refer to fragment you imply
    fragment + adaptors?
    The naming is not consistent there

    ReplyDelete
    Replies
    1. I think it is correct except maybe in the conclusion. Sorry about that. I think the message gets across though. I may fix it in the future. Thanks for the heads up!

      Delete
  6. Hi there! very useful and clear, thank you!

    One dummy question. I heard somebody ask "what library sizes do you have?"... what do they mean with that??

    Thank you!

    ReplyDelete
    Replies
    1. When they say "library size" they usually mean the AVERAGE length of the fragements that were sequenced. For Illumina paired-end, this could be anything from 200bp to 800bp, depending on what chemistry and library preparation method were used.

      Delete
  7. it was very clear and helpful, tnx

    ReplyDelete
  8. Very nice explanation..Thank you so much!!!

    ReplyDelete
  9. very nice blog , i have a question i would like to know how to calculate the insert standard deviation and fragment standard deviation from insert size?

    ReplyDelete
    Replies
    1. You need to align all the paired reads to the assembled genome (or a close reference genome), to get a BAM file. Then, you can use Picard's "CollectInsertSizeMetrics" command to calculate the statistics:
      http://picard.sourceforge.net/command-line-overview.shtml#CollectInsertSizeMetrics

      Delete
  10. Oh thanks so much for the clarity!!!! So helpful!

    ReplyDelete
  11. This was incredibly helpful, thanks a lot!!!

    ReplyDelete
  12. Nice and very well done :) Thanks

    ReplyDelete
  13. Very informative post.
    I just want ot make sure if I understood it correctly, if I have 2x300 bp paired end reads form miseq then the insert size will be 600.

    Thanks!

    ReplyDelete
    Replies
    1. No. The insert size is a function of the fragment size, it is independent of the read length. You need to ask the sequencing centre what size they fragmented the DNA into when they prepared the library. Most likely they wont really know, or they will give you a number that they think it is from some measurement (eg. BioAnalyzer) but this number will be wrong. The only real way to estimate is to align all the reads to a close reference genome and measure the "typical" distance between Read1 and Read2 in every pair (eg. the mode of the histogram).

      Delete
    2. Thank you for your answer. Here I am dealing with a metagenomic data so it will be difficult to align to any particular genome.

      Delete
    3. One question regarding insert size. Picard CollectInsertSizeMetrics function generates an histogram file with the MEAN_INSERT_SIZE. Does it represent the insert size used in assembly programs or the inner mate distance(unsing your definition)?

      Delete
  14. So what do they do about the distance in between Read1 and Read2? How do they just figure out what is between them? The overlapping idea sounds much better. I don't understand how pair end sequencing works?

    ReplyDelete
  15. Dahlia, what it tells you is that two sequences are close together on one piece of DNA. If you don't have an overlap, you don't know exactly how many bases (and which) are between the two reads. But you generally know what the largest fragments in your library might be and as outlined above, for any paired reads that map "as expected" on your reference genome, you can actually calculate your average fragment size.

    If your fragments are more than twice as long as your reads, you will have a gap. The gap increases your chance e.g. of getting repetitive DNA mapped. If you have a larger fragment with both sides partially sequenced, and one side contains something with a million hits in your reference genome, it is simply more likely that the other side of the fragment contains a unique sequence that you can map. Then you know that your repetitive sequence is really close to your unique sequence. With overlapping paired ends in the region, you can easily map it all.

    Or break-points: It makes it also easier to map break-points of genomic rearrangements. If you have 200bp reads, single reads will give you anything from 199bp 5-prime of the break point and 1bp 3-prime of the break point to 100-100 to 1-199. Thats all the information about the break point you will ever get, no matter how many reads you have. If one side of you break point is repetitive, that might not be enough to map the break point. With non-overlapping paired end reads, you increase the read distance (and you know that the two reads belong together), so you increase the likelihood that you can find out there is e.g. parts of chromosome 1 and chromosome 3 fused.....(ok, you noticed, I come from human genetics....)

    It is simply a method for mapping (or "linking") sequences without needing very long reads. A full length read of long fragments would always be best, but that is simply not available and/or more expensive.

    ReplyDelete
  16. I know this isn't rocket science, but it throws me off all the time when sequencing samples. The hardest things make more sense than this (i.e. particle physics), whereas this has so much room for variability. I think I might go back to protein work...

    ReplyDelete
    Replies
    1. Maybe you should go back to particle physics ;-)

      Delete
  17. Hello,

    Thanks for your blog, it’s very helpful!
    I have paired-end reads (Illumina HiSeq) from a metagenome in two Fastq files. These non-overlapping reads have already been quality checked, and I want to concatenate them into one file, so I can submit it for assembling into Megahit.
    However, I’m a bit confused about the right program to use, once I have non-overlapping reads as input (the exact lengths of gaps are not known), and I don´t want to exclude the sequences that have lost their pair during quality check…
    Do you know any program I could use?

    Thanks a lot for your help!

    ReplyDelete
    Replies
    1. Megahit supports paired end reads now. Use the -1 and -2 parameters. Any orphans can use the -r parameter.

      Delete
  18. Hi
    Does 300X2 bp sequenced in MiSeq include adaptors and barcodes as well or just the DNA sequence?
    Thanks

    ReplyDelete
    Replies
    1. MiSeq usually uses the Nextera tagmentation protocol. The reads will have adaptors and indices (barcodes). Usually the barcodes will be removed by the sequencing provider. Sometimes the adaptors are, but often not! If your reads are all the same length (eg, 300 bp) then there are probably still Nextera adaptors in them, and they can often be ~80bp in each read.

      Delete
  19. Hi Dr. Seemann,

    I realize this is an old post, but I hope you can help clarify something that is unclear to me.

    I've been analyzing a processed fastq file from a 100bp PE GBS run (250bp insert) on an Illumina HiSeq. I'm hoping to get access to the raw files soon, but now have only the combined reads. The library was prepared such that the adaptors are not specifically oriented (e.g. there are P5-frag-P7 and P7-frag-P5), which means that R1s and R2s are interchangeable between fragments.

    Preliminary data analysis is showing a lot of reverse complement clustering. To me, this only makes sense if the R2s have been systematically reverse complemented so that they match R1s on inverse fragments.

    My question is: does the Illumina machine or early processing automatically reverse complement the second reads, perhaps based on the assumption that the adaptors have been oriented specifically?

    ReplyDelete
    Replies
    1. This question exceeds my understanding. I would suggest politely asking my former colleagues at the Monash Uni sequence service who know a lot about this: MicromonGenomics@monash.edu

      Delete
  20. Thanks for this really clear and helpful explanation.
    How can I check for adaptor read-through and overlapping reads ?

    ReplyDelete
    Replies
    1. To check for overlapping reads, I just run it through PEAR or FLASH to see what the summary reports is. You can just to the first 10000 reads to check.

      Not sure about adaptor read through however. I trim that first from both ends.

      Delete
  21. It's been said like a hundred times already, but thank you very much!
    If you know the difference between paired-end vs. mate-pair and could clarify it, I would be so thankful!

    ReplyDelete
    Replies
    1. That is a good question.

      Mate-pairs are further apart on the genome, usually 2000 to 40,000 bp apart.

      Paired-end are usually < 1000 bp apart.

      Mate-pair libraries are harder to make, and use molecular tricks to make the big fragments into smaller fragments so you can use regular paired-end sequencing.

      Delete
  22. Thanks a lot!!! It was very helpful !!

    ReplyDelete
  23. Hi, I was wondering if this is only for MiSeq sequencing or you can get libraries like this in other instruments? Thanks

    ReplyDelete
    Replies
    1. Most "paired end" libraries have the same idea. ABI SOLiD had it too. And people made mate pair libraries on Ion Torrent etc.

      Delete
  24. Thank you for this helpful information. I was wondering - Can the term 'insert size' be used when reporting single end reads from Ion Torrent PGM? If yes, what does it mean.
    Thanks

    ReplyDelete
    Replies
    1. It doesn't really make sense with single-end reads. But some older molecular biologists may still use the term, as it applied to cloning sequences in e.coli vectors so they could be sequenced.

      Delete
  25. Hi,
    That a very helpful post.
    I was wondering if you know techniques to analyse reads that don't overlap ?
    Thanks.

    ReplyDelete
    Replies
    1. Reads that do not overlap are normal - you just use the regular tools, which are expecting non-overlapping reads.

      Delete
    2. Thanks for the answer! Do you know many of this tools for reads that do not overlap ? For example I've seen it is possible to deal with it in USEARCH but it is not recommended...

      Delete
  26. Years have passed and this post stills a hit! ;). I was wondering, is it the same for the mate-pair libraries? The concept of the insert size remains as the fragment between the adaptors? which adaptors then? knowing that in MP the structure will be something like this: ~~~====================~~~~~~====================~~~.
    Thanks in advanced.

    ReplyDelete
    Replies
    1. The blog post that keeps on giving.

      MP reads are more complicated now, with the new Nextera Mate Pair kits etc.

      Also, MP reads are not being used as much now, due to the affordability of long reads from Pacbio and Oxford Nanopore, and the new Chromium X10 systems.

      Delete
  27. Thank you, very useful post!! And your comments' answers were very clarifying too!! Thank you a lot again :)

    ReplyDelete