Software Tools
FastQC: Tool to evaluate various statistics of reads in a DNA sequencing project (available at the BaseSpace cloud platform or as a standalone application). FastQC allows one to check the number of reads and their quality (see FAQ: What are Quality Scores? and FAQ: How Do We Assess Sequence Quality?). SPAdes: Genome assembler (available at BaseSpace or as a stand-alone tool). The SPAdes pipeline consists of two stages: BayesHammer tool (error correction of reads) and SPAdes tool (iterative genome assembler). Rather than using a single k-mer size for constructing the de Bruijn graph, SPAdes iterates through the specified set of k-mer lengths values. SPAdes takes single (unpaired) reads or paired reads in FASTQ format as an input (see section Formats). Quality Assessment Tool for Genome Assembly (QUAST): Tool for assessing quality of genome assemblies. QUAST works in two modes (depending on whether the reference genome is known or unknown) and computes various metrics. You can either download QUAST or use its online version. Prokka: prokaryote annotation toolbox (available at BaseSpace). Prokka predicts and annotates genes in bacterial genomes using the GenBank format (see section Formats). The Prokka Genome Annotation BaseSpace App can use FASTA assembly files generated by SPAdes. RNAmmer: Tool to predict ribosomal RNA genes (see FAQ: What is 16S ribosomal RNA?) in sequenced genomes or assembled contigs. RNAmmer uses Hidden Markov Models based on alignments from a dataset of known rRNA sequences. ResFinder: Tool for identification of acquired antimicrobial resistance genes in bacterial genomes, i.e., genes that are responsible for making bacteria less treatable antimicrobial medications, such as antibiotics. ResFinder uses BLAST to find genes in a bacterial genome that are similar to genes in the database of known resistance genes. Basic Local Alignment Search Tool (BLAST): A set of programs for aligning query sequences against sequences in a target database. Introduction to BLAST can be found here. The National Center for Biotechnology Information (NCBI) provides an access point for these web-based tools. Mauve: Genome alignment tool that constructs multiple genome alignments in the presence of large-scale evolutionary events such as rearrangements and large insertions/deletions. It can also align annotated genomes and thus enable analysis of the functional consequences of such rearrangements. For example, when Mauve detects a large insertion, it also reports which genes the inserted segment contains. Databases Sequence Read Archive (SRA): a repository of DNA sequencing data that stores reads from various sequencing platforms. Also, when the reference genome is available, some entries in SRA contain the alignment information, i.e., the coordinates of the reads mapped to the reference genome. You can either download data from the SRA database or import data using BaseSpace. RefSeq: A non-redundant, annotated set of sequences, including genomic DNA, transcripts, and proteins. This database provides a reference for genome annotation, gene identification and characterization, mutation and polymorphism analysis, and expression studies. We will use RefSeq to find strains related to pathogenic strain in order to investigate the origin of the pathogenic factors. Steak:https://www.ncbi.nlm.nih.gov/pubmed/28948042 :STEAK: A specific tool for transposable elements and retrovirus detection in high-throughput sequencing data.