mebioda

Genome HTS in biodiversity research

High-throughput sequencing

What are your experiences with:

Typical workflow

  1. clip any synthetic oligonucleotides (adaptors, primers)
  2. trim low quality bases, filter short reads
  3. de novo or mapping assembly
  4. further annotation
  5. variant calling
  6. consensus sequence computation

Library preparation

Clipping adaptors

Depending on the sequencing platform, insert size, and additional services provided by the sequencing lab, the reads may already be sorted by adaptors (which are then clipped off) or you may have to do this yourself.

Effect of adaptor clipping

Sturm M, C Schroeder & P Bauer, 2016. SeqPurge: highly-sensitive adapter trimming for paired-end NGS data. BMC Bioinformatics. 17: 208. doi:10.1186/s12859-016-1069-7

Clipping primers

Sidenote about amplicon sequencing

Quality assessment and trimming

A convenient tool for initial quality assessment of HTS read data is FastQC, whose results can indicate numerous pathologies:

Additional filtering

Overlap-layout-consensus assembly

What do you know about graph theory? Edges? Vertices? Degrees? Directedness?

An alternative way to traverse the graph

Making the graph amenable to Eulerian traversal

HTS sequence data and k-mers

This re-processing can be achieved naively (there are faster tools than this) in python thusly:

def find_kmers(string, k):
    
      kmers = []
      n = len(string)

      for i in range(0, n-k+1):
           kmers.append(string[i:i+k])

      return kmers

Optimal k-mer size and assembly

Chikhi R & P Medvedev, 2014. Informed and automated k-mer size selection for genome assembly. Bioinformatics 30(1): 31–37 doi:10.1093/bioinformatics/btt310

Scaffolding

Mapping assembly

The Burrows-Wheeler transform

  1. All rotations for a given input string are generated
  2. These are sorted alphabetically
  3. The final column is the transformed string, i.e. BWT(T)

This string has the following properties:

BWT mapping assembly tools

Commonly used BWT mapping tools are:

Genome annotation

Genomes are annotated using multiple lines of evidence, such as:

Variant calling

Variants are called with a variety of methods: