Poster Presentation Australian Society for Microbiology Annual Scientific Meeting 2014

Whole Genome Sequence Typing: seeing the threes through the forest? Only with proper algorithms and quality control routines! (#417)

Hannes Pouseele 1 , Katrien De Bruyne 1 , Koen Janssens 1 , Sander Valcke 1 , Bruno Pot 1
  1. Applied Maths NV, Sint Martens Latem, Belgium

Whole Genome MLST is becoming an attractive alternative for traditional bacterial typing, as sequencing cost is dropping to levels that compete with the accumulated cost of multiple traditional tests (PFGE, MLVA, MLST, Spoligotyping, AB resistance, serotyping, …). However, a reliable wgMLST analysis depends on the availability of an organism-specific, high-quality reference scheme, covering the genetic diversity of the organism concerned, but avoiding the pitfalls of orthologous genes and repetitive sequences.
We developed a method to build, for any micro-organism, a high-quality pan-genome wgMLST scheme. Using well-annotated chromosomal reference sequences, we extract regions of interest and use an MCL-based clustering approach to define consistent loci. Alleles in each locus which are highly differing due to annotation inconsistencies, are polished to ensure maximal consistency.
We then use the two-tier approach implemented in BioNumerics 7.5® to perform both an assembly-based allele detection (providing an overall picture of the alleles) and an exhaustive, assembly-free allele detection. These computationally intensive tasks can run on a calculation engine (cloud or on-premise cluster). Besides the resulting pan-genome reference set, subschemes can easily be extracted, providing functional information (e.g. resistome or virulence risks) or traditional typing information (sequence types, clonal complexes, subtyping lineages,…).
The de novo approach regularly misses loci due to the multiple contigs and has undefined behavior for reconstructing multi-copy-loci, which are therefore not very well detected. As the computationally less intensive assembly-free method is designed to be exhaustive, it picks up all copies of a multi-copy-loci as separate allele calls. Missing loci are due to missing reads only rather than from unpredictable de novo assembly errors. Using published sequences of Campylobacter, we illustrate that current gene-by-gene schemes bear severe risks in terms of overlapping locus definitions, possibly causing noise in sample comparisons, in contrast to the more extensive and consistently defined pan-genome schemes.