Metagenome Assembly Workflow (v1.0.1)

Metagenome assembly workflow dependencies

Workflow Overview

This workflow takes in paired-end Illumina reads in interleaved format and performs error correction, then reformats the interleaved file into two FASTQ files for downstream tasks using bbcms (BBTools). The corrected reads are assembled using metaSPAdes. After assembly, the reads are mapped back to contigs by bbmap (BBTools) for coverage information. The .wdl (Workflow Description Language) file includes five tasks, bbcms, assy, create_agp, read_mapping_pairs, and make_output.

  1. The bbcms task takes in interleaved FASTQ inputs and performs error correction and reformats the interleaved fastq into two output FASTQ files for paired-end reads for the next tasks.
  2. The assy task performs metaSPAdes assembly
  3. Contigs and Scaffolds (output of metaSPAdes) are consumed by the create_agp task to rename the FASTA header and generate an AGP format which describes the assembly
  4. The read_mapping_pairs task maps reads back to the final assembly to generate coverage information.
  5. The final make_output task adds all output files into the specified directory.

Workflow Availability

The workflow from GitHub uses all the listed docker images to run all third-party tools. The workflow is available in GitHub: https://github.com/microbiomedata/metaAssembly; the corresponding Docker images are available in DockerHub: https://hub.docker.com/r/microbiomedata/spades and https://hub.docker.com/r/microbiomedata/bbtools

Requirements for Execution

(recommendations are in bold)

  • WDL-capable Workflow Execution Tool (Cromwell)
  • Container Runtime that can load Docker images (Docker v2.1.0.3 or higher)

Hardware Requirements

  • Memory: >40 GB RAM

The memory requirement depends on the input complexity. Here is a simple estimation equation for the memory required based on kmers in the input file:

predicted_mem = (kmers * 2.962e-08 + 1.630e+01) * 1.1 (in GB)

Note

The kmers variable for the equation above can be obtained using the kmercountmulti.sh script from BBTools.

kmercountmulti.sh -k=31 in=your.read.fq.gz

Workflow Dependencies

Third party software: (This is included in the Docker image.)

Sample dataset(s)

Zymobiomics mock-community DNA control (SRR7877884); this dataset is ~4 GB.

Note

If the input data is paired-end data, it must be in interleaved format. The following command will interleave the files, using the above dataset as an example:

paste <(zcat SRR7877884_1.fastq.gz | paste - - - -) <(zcat SRR7877884_2.fastq.gz | paste - - - -) | tr '\t' '\n' | gzip -c > SRR7877884-int.fastq.gz

For testing purposes and for the following examples, we used a 10% sub-sampling of the above dataset: (SRR7877884-int-0.1.fastq.gz). This dataset is already interleaved.

Input

A JSON file containing the following information:

  1. the path to the input FASTQ file (Illumina paired-end interleaved FASTQ) (recommended the output of the Reads QC workflow.)
  2. the contig prefix for the FASTA header
  3. the output path
  4. memory (optional) ex: “jgi_metaASM.memory”: “105G”
  5. threads (optional) ex: “jgi_metaASM.threads”: “16”

An example input JSON file is shown below:

{
    "jgi_metaASM.input_file":["/path/to/SRR7877884-int-0.1.fastq.gz "],
    "jgi_metaASM.rename_contig_prefix":"projectID",
    "jgi_metaASM.outdir":"/path/to/ SRR7877884-int-0.1_assembly",
    "jgi_metaASM.memory": "105G",
    "jgi_metaASM.threads": "16"
}

Output

The output directory will contain four output sub-directories: bbcms, final_assembly, mapping and spades3. The main output, the assembled contigs, are in final_assembly/assembly.contigs.fasta.

Part of an example output JSON file is shown below:

├── bbcms
│   ├── berkeleylab-jgi-meta-60ade422cd4e
│   ├── counts.metadata.json
│   ├── input.corr.fastq.gz
│   ├── input.corr.left.fastq.gz
│   ├── input.corr.right.fastq.gz
│   ├── readlen.txt
│   └── unique31mer.txt
├── final_assembly
│   ├── assembly.agp
│   ├── assembly_contigs.fasta
│   ├── assembly_scaffolds.fasta
│   └── assembly_scaffolds.legend
├── mapping
│   ├── covstats.txt (mapping_stats.txt)
│   ├── pairedMapped.bam
│   ├── pairedMapped.sam.gz
│   ├── pairedMapped_sorted.bam
│   └── pairedMapped_sorted.bam.bai
└── spades3
        ├── assembly_graph.fastg
        ├── assembly_graph_with_scaffolds.gfa
        ├── contigs.fasta
        ├── contigs.paths
        ├── scaffolds.fasta
        └── scaffolds.paths

The table provides all of the output directories, files, and their descriptions.

Directory File Name Description
bbcms   Error correction result directory
bbcms/berkeleylab-jgi-meta-60ade422cd4e   directory containing checking resource script
bbcms/ counts.metadata.json bbcms commands and summary statistics in JSON format
bbcms/ input.corr.fastq.gz error corrected reads in interleaved format.
bbcms/ input.corr.left.fastq.gz error corrected forward reads
bbcms/ input.corr.right.fastq.gz error corrected reverse reads
bbcms/ rc cromwell script sbumit return code
bbcms/ readlen.txt error corrected reads statistics
bbcms/ resources.log resource checking log
bbcms/ script Task run commands
bbcms/ script.background Bash script to run script.submit
bbcms/ script.submit cromwell submit commands
bbcms/ stderr standard error where task writes error message to
bbcms/ stderr.background standard error where bash script writes error message to
bbcms/ stderr.log standard error from bbcms command
bbcms/ stdout standard output where task writes error message to
bbcms/ stdout.background standard output where bash script writes error message(s)
bbcms/ stdout.log standard output from bbcms command
bbcms/ unique31mer.txt the count of unique kmer, K=31
spades3   metaSPAdes assembly result directory
spades3/K33   directory containing intermediate files from the run with K=33
spades3/K55   directory containing intermediate files from the run with K=55
spades3/K77   directory containing intermediate files from the run with K=77
spades3/K99   directory containing intermediate files from the run with K=99
spades3/K127   directory containing intermediate files from the run with K=127
spades3/misc   directory containing miscellaneous files
spades3/tmp   directory for temp files
spades3/ assembly_graph.fastg metaSPAdes assembly graph in FASTG format
spades3/ assembly_graph_with_scaffolds.gfa metaSPAdes assembly graph and scaffolds paths in GFA 1.0 format
spades3/ before_rr.fasta contigs before repeat resolution
spades3/ contigs.fasta metaSPAdes resulting contigs
spades3/ contigs.paths paths in the assembly graph corresponding to contigs.fasta
spades3/ dataset.info internal configuration file
spades3/ first_pe_contigs.fasta preliminary contigs of iterative kmers assembly
spades3/ input_dataset.yaml internal YAML data set file
spades3/ params.txt information about SPAdes parameters in this run
spades3/ scaffolds.fasta metaSPAdes resulting scaffolds
spades3/ scaffolds.paths paths in the assembly graph corresponding to scaffolds.fasta
spades3/ spades.log metaSPAdes log
final_assembly   create_agp task result directory
final_assembly/berkeleylab-jgi-meta-60ade422cd4e   directory containing checking resource script
final_assembly/ assembly.agp an AGP format file describes the assembly
final_assembly/ assembly_contigs.fna Final assembly contig fasta
final_assembly/ assembly_scaffolds.fna Final assembly scaffolds fasta
final_assembly/ assembly_scaffolds.legend name mapping file from spades node name to new name
final_assembly/ rc cromwell script sbumit return code
final_assembly/ resources.log resource checking log
final_assembly/ script Task run commands
final_assembly/ script.background Bash script to run script.submit
final_assembly/ script.submit cromwell submit commands
final_assembly/ stats.json assembly statistics in json format
final_assembly/ stderr standard error where task writes error message to
final_assembly/ stderr.background standard error where bash script writes error message to
final_assembly/ stdout standard output where task writes error message to
final_assembly/ stdout.background standard output where bash script writes error message to
mapping   maps reads back to the final assembly result directory
mapping/ covstats.txt contigs coverage informaiton
mapping/ mapping_stats.txt contigs coverage informaiton (same as covstats.txt)
mapping/ pairedMapped.bam reads mapping back to the final assembly bam file
mapping/ pairedMapped.sam.gz reads mapping back to the final assembly sam.gz file
mapping/ pairedMapped_sorted.bam reads mapping back to the final assembly sorted bam file
mapping/ pairedMapped_sorted.bam.bai reads mapping back to the final assembly sorted bam index file
mapping/ rc cromwell script sbumit return code
mapping/ resources.log resource checking log
mapping/ script Task run commands
mapping/ script.background Bash script to run script.submit
mapping/ script.submit cromwell submit commands
mapping/ stderr standard error where task writes error message to
mapping/ stderr.background standard error where bash script writes error message to
mapping/ stdout standard output where task writes error message to
mapping/ stdout.background standard output where bash script writes error message to

Version History

  • 1.0.1 (release date 02/16/2021; previous versions: 1.0.0)

Point of contact