inGAP-sv: structural variation detection and visualization

Ji Qi (qij@fudan.edu.cn) and Fangqing Zhao (zhfq@mail.biols.ac.cn)

12/16/2010

 

We developed an integrative next-generation genome analysis pipeline (inGAP), which employed a Bayesian principle to detect single nucleotide polymorphisms (SNPs), small insertion/deletions (indels). inGAP has been applied to a number of genome projects, including bacteria, yeast, plants and mammals. Here we extend this pipeline to identify and visualize large-size structural variations, including insertions, deletions, inversions and translocations.

 

1.    What inGAP-sv can do?

       Refine short read alignment by re-aligning short reads around a putative SV.

       Detect large-size structural variations using paired end sequencing reads.

       Visualize SAM-formatted alignments and SVs.

 

2.    How does inGAP-sv identify SVs?

       Classify mapped paired-end reads into normal/anomalous mapping types.

       Detect gapped regions which cannot be covered by normally mapped paired reads.

       Detect SVs based on various anomalous mapping combinations. Read qualities, mapping qualities, and ratio of paired-end reads relative to average mapping densities will be considered for the calculation of SV quality.

 

3.    How to get started?

       inGAP-sv requires two files, a FASTA formatted reference sequence and a SAM alignment

       A PTT formatted annotation file for the reference sequence is optional.

       A demo application is preloaded in inGAP-sv.

 

4.    Whats the difference between inGAP-sv and other SV tools?

       Most of the current SV tools can only detect very short indels (e.g. 1-10bp); inGAP-sv and a few others (e.g. breakdancer) work well with large SVs (>100bp)

       inGAP-sv is a one-stop SV detector. Users can identify, visualize, annotate and manually edit SVs using inGAP-sv.

       Compared with other command-line based SV tools, visualization of paired reads in inGAP-sv can significantly reduce the false discovery rate.

 

5.    Whats the performance of inGAP-sv?

       We firstly tested inGAP-sv using simulated data with large SVs (100-1000bp) from the Yoruban genome (NA18507). inGAP-sv could successfully identify 75%-90% of large indels and >85% of inversions with high accuracy rate. Detailed evaluation is in progress.

       We also applied inGAP-sv to an Arabidopsis thaliana genome re-sequencing project. inGAP-sv have identified 815 insertions and 1000 deletions. We compared these indels to the Monsanto A. thaliana assembly, and found that 78% of the deletions could be covered by Monsanto contigs and 99% of them were correct. 71% of insertions could be covered by Monsanto contigs and 96% of them were correct.

       inGAP-sv supports parallel computing.

 

6.    How to access inGAP-sv?

       Users can download the latest version of inGAP from http://sourceforge.net/projects/ingap/. We provide binaries for Windows, Linux, MacOS/X.

       A quick manual is available at http://schuster-33.bx.psu.edu/shared/manual.pdf .

 

7.    Screenshots

       Main functions and work flow of inGAP-sv

       Deletions detected by inGAP-sv

       Insertions detected by inGAP-sv

       Inversions detected by inGAP-sv

       Schematic view of SVs