sga

Genetics

Software Description

From the SGA GitHub repository:

Overview

SGA implements a set of assembly algorithms based on the FM-index. As the FM-index is a compressed data structure, the algorithms are very memory efficient. An SGA assembly has three distinct phases. The first phase corrects base calling errors in the reads. The second phase assembles contigs from the corrected reads. The third phase uses paired end and/or mate pair data to build scaffolds from the contigs. Example real-data assemblies can be found here.

Error Correction

The first stage of the assembly. An FM-index of the sequence reads is constructed, then base calling errors are identified by finding low-frequency k-mers in the reads. The output from the error corrector is a set of FASTQ files containing the corrected read sequences.

Contig Assembly

An FM-index of the corrected sequence reads is constructed. Duplicate reads, and low-quality reads after correction, are found and discarded with the sga filter subprogram. For large genomes, the sga fm-merge program can be used to merge together reads that can be unambiguously assembled. sga overlap computes the structure of the string graph and contigs are built using sga assemble.

Scaffolding

The scaffolding module of sga begins by re-aligning reads to the contigs built in the previous step. The copy number of each contig, and distances between contigs, are estimated from the resulting BAM files and used as input to sga scaffold. The output of sga scaffold is passed to sga scaffold2fasta which produces a FASTA file of the resulting scaffold sequences.


Info

Module Name

sga

Last Updated On

08/29/2023

Support Level

Secondary Support

Software Access Level

Open Access

Home Page

https://github.com/jts/sga/wiki/SGA-Design

Documentation

Software Description

From the SGA GitHub repository:

Overview

SGA implements a set of assembly algorithms based on the FM-index. As the FM-index is a compressed data structure, the algorithms are very memory efficient. An SGA assembly has three distinct phases. The first phase corrects base calling errors in the reads. The second phase assembles contigs from the corrected reads. The third phase uses paired end and/or mate pair data to build scaffolds from the contigs. Example real-data assemblies can be found here.

Error Correction

The first stage of the assembly. An FM-index of the sequence reads is constructed, then base calling errors are identified by finding low-frequency k-mers in the reads. The output from the error corrector is a set of FASTQ files containing the corrected read sequences.

Contig Assembly

An FM-index of the corrected sequence reads is constructed. Duplicate reads, and low-quality reads after correction, are found and discarded with the sga filter subprogram. For large genomes, the sga fm-merge program can be used to merge together reads that can be unambiguously assembled. sga overlap computes the structure of the string graph and contigs are built using sga assemble.

Scaffolding

The scaffolding module of sga begins by re-aligning reads to the contigs built in the previous step. The copy number of each contig, and distances between contigs, are estimated from the resulting BAM files and used as input to sga scaffold. The output of sga scaffold is passed to sga scaffold2fasta which produces a FASTA file of the resulting scaffold sequences.

General Linux

SGA is available via the modules system

module load sga

The source directory contains examples of real assemblies using SGA. You should read these scripts or (better) download the data for one of the smaller genomes (I recommend the C. elegans data set) and run the example yourself. This will help you get understand the SGA pipeline so you can run the assembler effectively on your own data.

cd $SGA_EXAMPLES

to access the example files.

Agate Modules

Default

0.10.13

Other Modules

0.10.13, 20130314

Mangi Modules

Default

0.10.13

Other Modules

0.10.13, 20130314

Mesabi Modules

Default

0.10.13

Other Modules

0.10.13, 20130314