inGAP-sv: structural variation detection and visualization

Ji Qi (qij@fudan.edu.cn) and Fangqing Zhao (zhfq@mail.biols.ac.cn)

12/16/2010

 

We developed an integrative next-generation genome analysis pipeline (inGAP), which employed a Bayesian principle to detect single nucleotide polymorphisms (SNPs), small insertion/deletions (indels). inGAP has been applied to a number of genome projects, including bacteria, yeast, plants and mammals. Here we extend this pipeline to identify and visualize large-size structural variations, including insertions, deletions, inversions and translocations.

 

1.    What inGAP-sv can do?

á       Refine short read alignment by re-aligning short reads around a putative SV.

á       Detect large-size structural variations using paired end sequencing reads.

á       Visualize SAM-formatted alignments and SVs.

 

2.    How does inGAP-sv identify SVs?

á       Classify mapped paired-end reads into normal/anomalous mapping types.

á       Detect Ògapped regionsÓ which cannot be covered by normally mapped paired reads.

á       Detect SVs based on various anomalous mapping combinations. Read qualities, mapping qualities, and ratio of paired-end reads relative to average mapping densities will be considered for the calculation of SV quality.

 

3.    How to get started?

á       inGAP-sv requires two files, a FASTA formatted reference sequence and a SAM alignment

á       A PTT formatted annotation file for the reference sequence is optional.

á       A demo application is preloaded in inGAP-sv.

 

4.    WhatÕs the difference between inGAP-sv and other SV tools?

á       Most of the current SV tools can only detect very short indels (e.g. 1-10bp); inGAP-sv and a few others (e.g. breakdancer) work well with large SVs (>100bp)

á       inGAP-sv is a one-stop SV detector. Users can identify, visualize, annotate and manually edit SVs using inGAP-sv.

á       Compared with other command-line based SV tools, visualization of paired reads in inGAP-sv can significantly reduce the false discovery rate.

 

5.    WhatÕs the performance of inGAP-sv?

á       We firstly tested inGAP-sv using simulated data with large SVs (100-1000bp) from the Yoruban genome (NA18507). inGAP-sv could successfully identify 75%-90% of large indels and >85% of inversions with high accuracy rate. Detailed evaluation is in progress.

á       We also applied inGAP-sv to an Arabidopsis thaliana genome re-sequencing project. inGAP-sv have identified 815 insertions and 1000 deletions. We compared these indels to the Monsanto A. thaliana assembly, and found that 78% of the deletions could be covered by Monsanto contigs and 99% of them were correct. 71% of insertions could be covered by Monsanto contigs and 96% of them were correct.

á       inGAP-sv supports parallel computing.

 

6.    How to access inGAP-sv?

á       Users can download the latest version of inGAP from http://sourceforge.net/projects/ingap/. We provide binaries for Windows, Linux, MacOS/X.

á       A quick manual is available at http://schuster-33.bx.psu.edu/shared/manual.pdf .

 

7.    Screenshots

á       Main functions and work flow of inGAP-sv

á       Deletions detected by inGAP-sv

á       Insertions detected by inGAP-sv

á       Inversions detected by inGAP-sv

á       Schematic view of SVs