ABySS is a de novo sequence assembler intended for short paired-end reads and genomes of all sizes.
Please cite our papers.
If you have the Conda package manager (Linux, MacOS) installed, run:
conda install -c bioconda abyss
Or you can install ABySS in a dedicated environment:
conda create -n abyss-env
conda activate abyss-env
conda install -c bioconda abyss
If you have the Homebrew package manager (Linux, MacOS) installed, run:
brew install abyss
Install Windows Subsystem for Linux from which you can run Conda or Homebrew installation.
These can be installed through Conda:
conda install -c bioconda arcs tigmint
Or Homebrew:
brew install brewsci/bio/arcs brewsci/bio/links-scaffolder
Conda:
conda install -c bioconda samtools
conda install -c conda-forge pigz zsh
Homebrew:
brew install pigz samtools zsh
When compiling ABySS from source the following tools are required:
ABySS requires a C++ compiler that supports OpenMP such as GCC.
The following libraries are required:
Conda:
conda install -c conda-forge boost openmpi
conda install -c bioconda google-sparsehash btllib
It is also helpful to install the compilers Conda package that automatically passes the correct compiler flags to use the available Conda packages:
conda install -c conda-forge compilers
Homebrew:
brew install boost open-mpi google-sparsehash
ABySS will receive an error when compiling with Boost 1.51.0 or 1.52.0 since they contain a bug. Later versions of Boost compile without error.
To compile, run the following:
./autogen.sh
mkdir build
cd build
../configure --prefix=/path/to/abyss
make
make install
You may also pass the following flags to configure
script:
--with-boost=PATH
--with-mpi=PATH
--with-sqlite=PATH
--with-sparsehash=PATH
--with-btllib=PATH
Where PATH is the path to the directory containing the corresponding
dependencies. This should only be necessary if configure
doesn’t find the dependencies by default. If you are using Conda, PATH
would be the path to the Conda installation. SQLite and MPI are optional
dependencies.
The above steps install ABySS at the provided path, in this case
/path/to/abyss
. Not specifying --prefix
would
install in /usr/local
, which requires sudo privileges when
running make install
.
ABySS requires a modern compiler such as GCC 6 or greater. If you have multiple versions of GCC installed, you can specify a different compiler:
../configure CC=gcc-10 CXX=g++-10
While OpenMPI is assumed by default you can switch to LAM/MPI or MPICH using:
../configure --enable-lammpi
../configure --enable-mpich
The default maximum k-mer size is 192 and may be decreased to reduce memory usage or increased at compile time. This value must be a multiple of 32 (i.e. 32, 64, 96, 128, etc):
../configure --enable-maxk=160
If you encounter compiler warnings that are not critical, you can allow the compilation to continue:
../configure --disable-werror
To run ABySS, its executables should be found in your
PATH
environment variable. If you installed ABySS in
/opt/abyss
, add /opt/abyss/bin
to your
PATH
:
PATH=/opt/abyss/bin:$PATH
ABySS stores temporary files in TMPDIR
, which is
/tmp
by default on most systems. If your default temporary
disk volume is too small, set TMPDIR
to a larger volume,
such as /var/tmp
or your home directory.
export TMPDIR=/var/tmp
The recommended mode of running ABySS is the Bloom filter mode.
Specifying the Bloom filter memory budget with the B
parameter enables this mode, which can reduce memory consumption by
ten-fold compared to the MPI mode. B
may be specified with
unit suffixes ‘k’ (kilobytes), ‘M’ (megabytes), ‘G’ (gigabytes). If no
units are specified bytes are assumed. Internally, the Bloom filter
assembler allocates the entire memory budget (B * 8/9
) to a
Counting Bloom filter, and an additional (B/9
) memory to
another Bloom filter that is used to track k-mers that have previously
been included in contigs.
A good value for B
depends on a number of factors, but
primarily on the genome being assembled. A general guideline is:
P. glauca (~20Gbp): B=500G
H. sapiens (~3.1Gbp):
B=50G
C. elegans (~101Mbp): B=2G
For other genome sizes, the value for B
can be
interpolated. Note that there is no downside to using larger than
necessary B
value, except for the memory required. To make
sure you have selected a correct B
value, inspect the
standard error log of the assembly process and ensure that the reported
FPR value under Counting Bloom filter stats
is 5% or less.
This requires using verbosity level 1 with v=-v
option.
This mode is legacy and we do not recommend running ABySS with it. To
run ABySS in the MPI mode, you need to specify the np
parameter, which specifies the number of processes to use for the
parallel MPI job. Without any MPI configuration, this will allow you to
use multiple cores on a single machine. To use multiple machines for
assembly, you must create a hostfile
for
mpirun
, which is described in the mpirun
man
page.
Do not run mpirun -np 8 abyss-pe
. To run ABySS
with 8 threads, use abyss-pe np=8
. The
abyss-pe
driver script will start the MPI process, like so:
mpirun -np 8 ABYSS-P
.
The paired-end assembly stage is multithreaded, but must run on a
single machine. The number of threads to use may be specified with the
parameter j
. The default value for j
is the
value of np
.
wget http://www.bcgsc.ca/platform/bioinfo/software/abyss/releases/1.3.4/test-data.tar.gz
tar xzvf test-data.tar.gz
abyss-pe k=25 name=test B=1G \
in='test-data/reads1.fastq test-data/reads2.fastq'
Calculate assembly contiguity statistics:
abyss-fac test-unitigs.fa test-contigs.fa test-scaffolds.fa
To assemble paired reads in two files named reads1.fa
and reads2.fa
into contigs in a file named
ecoli-contigs.fa
, run the command:
abyss-pe name=ecoli k=96 B=2G in='reads1.fa reads2.fa'
The parameter in
specifies the input files to read,
which may be in FASTA, FASTQ, qseq, export, SRA, SAM or BAM format and
compressed with gz, bz2 or xz and may be tarred. The assembled contigs
will be stored in ${name}-contigs.fa
and the scaffolds will
be stored in ${name}-scaffolds.fa
.
A pair of reads must be named with the suffixes /1
and
/2
to identify the first and second read, or the reads may
be named identically. The paired reads may be in separate files or
interleaved in a single file.
Reads without mates should be placed in a file specified by the
parameter se
(single-end). Reads without mates in the
paired-end files will slow down the paired-end assembler considerably
during the abyss-fixmate
stage.
The distribution of fragment sizes of each library is calculated
empirically by aligning paired reads to the contigs produced by the
single-end assembler, and the distribution is stored in a file with the
extension .hist
, such as ecoli-3.hist
. The N50
of the single-end assembly must be well over the fragment-size to obtain
an accurate empirical distribution.
Here’s an example scenario of assembling a data set with two
different fragment libraries and single-end reads. Note that the names
of the libraries (pea
and peb
) are
arbitrary.
pea
has reads in two files,
pea_1.fa
and pea_2.fa
.peb
has reads in two files,
peb_1.fa
and peb_2.fa
.se1.fa
and
se2.fa
.The command line to assemble this example data set is:
abyss-pe k=96 B=2G name=ecoli lib='pea peb' \
pea='pea_1.fa pea_2.fa' peb='peb_1.fa peb_2.fa' \
se='se1.fa se2.fa'
The empirical distribution of fragment sizes will be stored in two
files named pea-3.hist
and peb-3.hist
. These
files may be plotted to check that the empirical distribution agrees
with the expected distribution. The assembled contigs will be stored in
${name}-contigs.fa
and the scaffolds will be stored in
${name}-scaffolds.fa
.
Long-distance mate-pair libraries may be used to scaffold an
assembly. Specify the names of the mate-pair libraries using the
parameter mp
. The scaffolds will be stored in the file
${name}-scaffolds.fa
. Here’s an example of assembling a
data set with two paired-end libraries and two mate-pair libraries. Note
that the names of the libraries (pea
, peb
,
mpa
, mpb
) are arbitrary.
abyss-pe k=96 B=2G name=ecoli lib='pea peb' mp='mpc mpd' \
pea='pea_1.fa pea_2.fa' peb='peb_1.fa peb_2.fa' \
mpc='mpc_1.fa mpc_2.fa' mpd='mpd_1.fa mpd_2.fa'
The mate-pair libraries are used only for scaffolding and do not contribute towards the consensus sequence.
ABySS can scaffold using linked reads from 10x Genomics Chromium. The
barcodes must first be extracted from the read sequences and added to
the BX:Z
tag of the FASTQ header, typically using the
longranger basic
command of Long
Ranger or EMA
preproc. The linked reads are used to correct assembly errors, which
requires that Tigmint.
The linked reads are also used for scaffolding, which requires ARCS. See Dependencies for installation instructions.
ABySS can combine paired-end, mate-pair, and linked-read libraries.
The pe
and lr
libraries will be used to build
the de Bruijn graph. The mp
libraries will be used for
paired-end/mate-pair scaffolding. The lr
libraries will be
used for misassembly correction using Tigmint and scaffolding using
ARCS.
abyss-pe k=96 B=2G name=hsapiens \
pe='pea' pea='lra.fastq.gz' \
mp='mpa' mpa='lra.fastq.gz' \
lr='lra' lra='lra.fastq.gz'
ABySS performs better with a mixture of paired-end, mate-pair, and linked reads, but it is possible to assemble only linked reads using ABySS, though this mode of operation is experimental.
abyss-pe k=96 name=hsapiens lr='lra' lra='lra.fastq.gz'
Long sequences such as RNA-Seq contigs can be used to rescaffold an assembly. Sequences are aligned using BWA-MEM to the assembled scaffolds. Additional scaffolds are then formed between scaffolds that can be linked unambiguously when considering all BWA-MEM alignments.
Similar to scaffolding, the names of the datasets can be specified
with the long
parameter. These scaffolds will be stored in
the file ${name}-long-scaffs.fa
. The following is an
example of an assembly with PET, MPET and an RNA-Seq assembly. Note that
the names of the libraries are arbitrary.
abyss-pe k=96 B=2G name=ecoli lib='pe1 pe2' mp='mp1 mp2' long='longa' \
pe1='pe1_1.fa pe1_2.fa' pe2='pe2_1.fa pe2_2.fa' \
mp1='mp1_1.fa mp1_2.fa' mp2='mp2_1.fa mp2_2.fa' \
longa='longa.fa'
Assemblies may be performed using a paired de Bruijn graph
instead of a standard de Bruijn graph. In paired de Bruijn graph mode,
ABySS uses k-mer pairs in place of k-mers, where each k-mer
pair consists of two equal-size k-mers separated by a fixed distance. A
k-mer pair is functionally similar to a large k-mer spanning the breadth
of the k-mer pair, but uses less memory because the sequence in the gap
is not stored. To assemble using paired de Bruijn graph mode, specify
both individual k-mer size (K
) and k-mer pair span
(k
). For example, to assemble E. coli with a individual
k-mer size of 16 and a k-mer pair span of 96:
abyss-pe name=ecoli K=16 k=96 in='reads1.fa reads2.fa'
In this example, the size of the intervening gap between k-mer pairs
is 64 bp (96 - 2*16). Note that the k
parameter takes on a
new meaning in paired de Bruijn graph mode. k
indicates
kmer pair span in paired de Bruijn graph mode (when K
is
set), whereas k
indicates k-mer size in standard de Bruijn
graph mode (when K
is not set).
Strand-specific RNA-Seq libraries can be assembled such that the
resulting unitigs, contigs and scaffolds are oriented correctly with
respect to the original transcripts that were sequenced. In order to run
ABySS in strand-specific mode, the SS
parameter must be
used as in the following example:
abyss-pe name=SS-RNA B=2G k=96 in='reads1.fa reads2.fa' SS=--SS
The expected orientation for the read sequences with respect to the original RNA is RF. i.e. the first read in a read pair is always in reverse orientation.
It is standard practice when running ABySS to run multiple assemblies
to find the optimal values for the k
and kc
parameters. k
determines the k-mer size in the de Bruijn
Graph, and kc
is the k-mer minimum coverage multiplicity
cutoff, which filters out erroneous k-mers. The range in which
k
should be tested depends on the read size and read
coverage.
A rough indicator is, for 2x150bp reads and 40x coverage, the right
k
value is often around 70 to 90. For 2x250bp reads and 40x
coverage, the right value might be around 110 to 140.
For kc
, 2 is most often a good value, but can go as high
as 4.
The following shell snippet will assemble for k
values 2
and 3, and every eighth value of k
from 50 to 90. In the
end, we calculate the contiguity statistics, as a proxy for identifying
the optimal assembly. Other metrics can be used, as needed.
for kc in 2 3; do
for k in `seq 50 8 90`; do
mkdir k${k}-kc${kc}
abyss-pe -C k${k}-kc${kc} name=ecoli B=2G k=$k kc=$kc in=../reads.fa
done
done
abyss-fac k*/ecoli-scaffolds.fa
The default maximum value for k
is 192. This limit may
be changed at compile time using the --enable-maxk
option
of configure. It may be decreased to 32 to decrease memory usage or
increased to larger values.
ABySS integrates well with cluster job schedulers, such as:
For example, to submit an array of jobs to assemble every eighth
value of k
between 50 and 90 using 64 processes for each
job:
qsub -N ecoli -pe openmpi 64 -t 50-90:8 \
<<<'mkdir k$SGE_TASK_ID && abyss-pe -C k$SGE_TASK_ID in=/data/reads.fa'
ABySS supports the use of DIDA (Distributed Indexing Dispatched
Alignment), an MPI-based framework for computing sequence alignments in
parallel across multiple machines. The DIDA software must be separately
downloaded and installed from
http://www.bcgsc.ca/platform/bioinfo/software/dida. In comparison to the
standard ABySS alignment stages which are constrained to a single
machine, DIDA offers improved performance and the ability to scale to
larger targets. Please see the DIDA section of the abyss-pe man page (in
the doc
subdirectory) for details on usage.
Parameters of the driver script, abyss-pe
a
: maximum number of branches of a bubble
[2
]b
: maximum length of a bubble (bp)
[""
]B
: Bloom filter size (e.g. “100M”)c
: minimum mean k-mer coverage of a unitig
[sqrt(median)
]d
: allowable error of a distance estimate (bp)
[6
]e
: minimum erosion k-mer coverage
[round(sqrt(median))
]E
: minimum erosion k-mer coverage per strand [1 if
sqrt(median) > 2
else 0]G
: genome size, used to calculate NG50H
: number of Bloom filter hash functions
[4
]j
: number of threads [2
]k
: size of k-mer (when K
is not set) or
the span of a k-mer pair (when K
is set)kc
: minimum k-mer count threshold for Bloom filter
assembly [2
]K
: the length of a single k-mer in a k-mer pair
(bp)l
: minimum alignment length of a read (bp)
[40
]m
: minimum overlap of two unitigs (bp) [0
(interpreted as k - 1
) if mp
is provided or if
k<=50
, otherwise 50
]n
: minimum number of pairs required for building
contigs [10
]N
: minimum number of pairs required for building
scaffolds [15-20
]np
: number of MPI processes [1
]p
: minimum sequence identity of a bubble
[0.9
]q
: minimum base quality [3
]s
: minimum unitig size required for building contigs
(bp) [1000
]S
: minimum contig size required for building scaffolds
(bp) [100-5000
]t
: maximum length of blunt contigs to trim
[k
]v
: use v=-v
for verbose logging,
v=-vv
for extra verbosex
: spaced seed (Bloom filter assembly only)lr_s
: minimum contig size required for building
scaffolds with linked reads (bp) [S
]lr_n
: minimum number of barcodes required for building
scaffolds with linked reads [10
]abyss-pe
configuration variables may be set on the
command line or from the environment, for example with
export k=96
. It can happen that abyss-pe
picks
up such variables from your environment that you had not intended, and
that can cause trouble. To troubleshoot that situation, use the
abyss-pe env
command to print the values of all the
abyss-pe
configuration variables:
abyss-pe env [options]
abyss-pe
is a driver script implemented as a Makefile.
Any option of make
may be used with abyss-pe
.
Particularly useful options are:
-C dir
, --directory=dir
Change to the
directory dir
and store the results there.-n
, --dry-run
Print the commands that
would be executed, but do not execute them.abyss-pe
uses the following programs, which must be
found in your PATH
:
ABYSS
: de Bruijn graph assemblerABYSS-P
: parallel (MPI) de Bruijn graph assemblerAdjList
: find overlapping sequencesDistanceEst
: estimate the distance between
sequencesMergeContigs
: merge sequencesMergePaths
: merge overlapping pathsOverlap
: find overlapping sequences using paired-end
readsPathConsensus
: find a consensus sequence of ambiguous
pathsPathOverlap
: find overlapping pathsPopBubbles
: remove bubbles from the sequence overlap
graphSimpleGraph
: find paths through the overlap graphabyss-fac
: calculate assembly contiguity
statisticsabyss-filtergraph
: remove shim contigs from the overlap
graphabyss-fixmate
: fill the paired-end fields of SAM
alignmentsabyss-map
: map reads to a reference sequenceabyss-scaffold
: scaffold contigs using distance
estimatesabyss-todot
: convert graph formats and merge
graphsabyss-rresolver
: resolve repeats using short readsThis flowchart shows the ABySS assembly pipeline and its intermediate files.
ABySS has a built-in support for SQLite database to export log values
into a SQLite file and/or .csv
files at runtime.
Of abyss-pe
: * db
: path to SQLite
repository file [$(name).sqlite
] * species
:
name of species to archive [ ] * strain
: name of strain to
archive [ ] * library
: name of library to archive [ ]
For example, to export data of species ‘Ecoli’, strain ‘O121’ and library ‘pea’ into your SQLite database repository named ‘/abyss/test.sqlite’:
abyss-pe db=/abyss/test.sqlite species=Ecoli strain=O121 library=pea [other options]
Found in your path
:
abyss-db-txt
: create a flat file showing entire
repository at a glanceabyss-db-csv
: create .csv
table(s) from
the repositoryUsage:
abyss-db-txt /your/repository
abyss-db-csv /your/repository program(s)
For example,
abyss-db-txt repo.sqlite
abyss-db-csv repo.sqlite DistanceEst
abyss-db-csv repo.sqlite DistanceEst abyss-scaffold
abyss-db-csv repo.sqlite --all
Shaun D Jackman, Benjamin P Vandervalk, Hamid Mohamadi, Justin Chu, Sarah Yeo, S Austin Hammond, Golnaz Jahesh, Hamza Khan, Lauren Coombe, René L Warren, and Inanc Birol (2017). ABySS 2.0: Resource-efficient assembly of large genomes using a Bloom filter. Genome research, 27(5), 768-777. doi:10.1101/gr.214346.116
Simpson, Jared T., Kim Wong, Shaun D. Jackman, Jacqueline E. Schein, Steven JM Jones, and Inanc Birol (2009). ABySS: a parallel assembler for short read sequence data. Genome research, 19(6), 1117-1123. doi:10.1101/gr.089532.108
Vladimir Nikolić, Amirhossein Afshinfard, Justin Chu, Johnathan Wong, Lauren Coombe, Ka Ming Nip, René L. Warren & Inanç Birol (2022). RResolver: efficient short-read repeat resolution within ABySS. BMC Bioinformatics 23, Article number: 246 (2022). doi:10.1186/s12859-022-04790-z
Robertson, Gordon, Jacqueline Schein, Readman Chiu, Richard Corbett, Matthew Field, Shaun D. Jackman, Karen Mungall, et al (2010). De novo assembly and analysis of RNA-seq data. Nature methods, 7(11), 909-912. doi:10.1038/10.1038/nmeth.1517
Nielsen, Cydney B., Shaun D. Jackman, Inanc Birol, and Steven JM Jones (2009). ABySS-Explorer: visualizing genome sequence assemblies. IEEE Transactions on Visualization and Computer Graphics, 15(6), 881-888. doi:10.1109/TVCG.2009.116
Subscribe to the ABySS mailing list, abyss-users@googlegroups.com.
For questions related to transcriptome assembly, contact the Trans-ABySS mailing list, trans-abyss@googlegroups.com.
Supervised by Dr. Inanc Birol.
Copyright 2016 Canada’s Michael Smith Genome Sciences Centre