Usage
The program includes both a command line and graphical interface. Both will produce the same results.
Command Line
This will run the entire process based on a set of options given at the terminal:
-h, --help show this help message and exit
-i FILE, --input FILE
input folder(s)
-e LABELSEP, --labelsep LABELSEP
symbol to split the sample labels on
-r FILE, --reference FILE
reference genome filename
-S SPECIES, --species SPECIES
set the species reference genome, overrides -r
-g FILE, --genbank_file FILE
annotation file, optional
-t THREADS, --threads THREADS
cpu threads to use
-w, --overwrite overwrite intermediate files
-T, --trim whether to trim fastq files
-U, --unmapped whether to save unmapped reads
-Q QUALITY, --quality QUALITY
right trim quality, default 25
-f FILTERS, --filters FILTERS
variant calling post-filters
-m MASK, --mask MASK mask regions from a bed file
-c, --custom apply custom filters
-p PLATFORM, --platform PLATFORM
sequencing platform, change to ont if using oxford nanopore
-a ALIGNER, --aligner ALIGNER
aligner to use, bwa, subread, bowtie or minimap2
-b, --buildtree whether to build a phylogenetic tree, requires RaXML
-N BOOTSTRAPS, --bootstraps BOOTSTRAPS
number of bootstraps to build tree
-o FILE, --outdir FILE
Results folder
-q, --qc Get version
-d, --dummy Check samples but don't run
-x, --test Test run
-v, --version Get version
Examples:
Call with your own reference fasta file:
snipgenie -r reference.fa -i data_files -o results
Use an in built species genome as reference. This will also supply an annotation file. The current options are Mbovis-AF212297, MTB-H37Rv, MAP-K10, M.smegmatis-MC2155:
snipgenie -S Mbovis-AF212297 -i data_files -o results
Provide more than one folder:
snipgenie -r reference.fa -i data_files1 -i data_files2 -o results
Provide an annotation (genbank format) for consequence calling:
snipgenie -r reference.fa -g reference.gb -i data_files -o results
Add your own filters and provide threads:
snipgenie -r reference.fa -i data_files -t 8 -o results` \
-f 'QUAL>=40 && INFO/DP>=20 && MQ>40'
Aligners
You can use any one of the following aligners: bwa, subread, bowtie or minimap2. These should be present on your system, unless using the Windows version. Note that for oxford nanopore reads you should use minimap2 and specify the platform as ‘ont’.
Mask file
You can selectively mask snp sites such as those contained in transposons or repetitive regions from being included in the output. You need to provide a bed file with the following columns: chromosome name, start and end coordinates of the regions. There is currently a built-in mask file used for M.bovis and of you select this genome as reference using the –species option it will be used automatically. Example:
LT708304.1 105359 106751
LT708304.1 131419 132910
LT708304.1 149570 151187
LT708304.1 306201 307872
Inputs
You can provide single folder with all the files in one place or multiple folders. Folders are searched recursively for inputs with extensions *.f*q.gz. So be careful you don’t have files in the folders you don’t want included. The following file structure will load both sets of files if you provide the parent folder as input. You can also provide multiple separate folders using -i as shown above.
For example if you provide -i data with the following structure:
data/
├── ERR1588781
│ ├── ERR1588781_1.fq.gz
│ └── ERR1588781_2.fq.gz
└── ERR1588785
├── ERR1588785_1.fastq.gz
└── ERR1588785_2.fastq.gz
Filenames are parsed and a sample name is extracted for each pair (if paired end). This is simply done by splitting on the _ symbol. So a file called /path/13-11594_S85_L001-4_R1_001.fastq.gz will be given a sample name 13-11594. As long as the sample names are unique this is ok. If you had a file names like A_2_L001-4_R1_001, A_3_L001-4_R1_001 you should split on ‘-’ instead. You can specify this in the labelsep option. The workflow won’t run unless sample names are unique.
Outputs
These files will be saved to the output folder when the workflow is finished:
raw.bcf - unfiltered output from bcftools mpileup, not overwritten by default
calls.vcf - unfiltered variant calls
filtered.vcf.gz - filtered vcf from all variant calls
snps.vcf.gz - snps only calls, used to make the core alignment
indels.vcf.gz - indels only, made from filtered calls
core.fa - fasta alignment from core snps, can be used to make a phylogeny
core.txt - text table of core snps
csq.tsv - consequence calls (if genbank provided)
csq_indels.tsv - consequence calls for indels
csq.matrix - matrix of consequence calls
snpdist.csv - comma separated distance matrix using snps
samples.csv - summary table of samples
RAxML_bipartitions.variants - ML tree if RAxML was used, optional
tree.newick - tree with SNPs branch lengths, if RAxMl used
Use from Python
You can run a workflow from within Python by importing the snipgenie package and invoking the WorkFlow class. You need to provide the options in a dictionary with the same keywords as the command line. Notice in this example we are loading files from two folders.
from sngenie import app
args = {'threads':8, 'outdir': 'results', 'labelsep':'-',
'input':['/my/folder/',
'/my/other/folder'],
'reference': None, 'overwrite':False}
W = app.WorkFlow(**args)
st = W.setup()
W.run()