PanACoTA.subcommands
package¶
Subpackage containing the main script used to launch each available subcommand.
prepare
module¶
Subcommand to prepare a dataset:
Download all genomes of a given species from refseq
Filter them with L90 and number of contigs thresholds
Remove too close/far genomes using Mash
@author Amandine PERRIN August 2019
- PanACoTA.subcommands.prepare.build_parser(parser)¶
Method to create a parser for command-line options
- Parameters:
- parserargparse.ArgumentParser
parser to configure in order to extract command-line arguments
- PanACoTA.subcommands.prepare.check_args(parser, args)¶
Check that arguments given to parser are as expected.
- Parameters:
- parserargparse.ArgumentParser
The parser used to parse command-line
- argsargparse.Namespace
Parsed arguments
- Returns:
- argparse.Namespace or None
The arguments parsed, updated according to some rules. Exit program with error message if error occurs with arguments given.
- PanACoTA.subcommands.prepare.main(cmd, ncbi_species_name, ncbi_species_taxid, ncbi_taxid, ncbi_strains, levels, ncbi_section, outdir, tmp_dir, threads, norefseq, db_dir, only_mash, info_file, l90, nbcont, cutn, min_dist, max_dist, verbose, quiet)¶
Main method, constructing the draft dataset for the given species
verbosity: - defaut 0 : stdout contains INFO, stderr contains ERROR, .log contains INFO and more, .log.err contains warning and more - 1: same as 0 + WARNING in stderr - 2: same as 1 + DETAILS in stdout + DETAILS in .log.details - >=15: same as 2 + Add DEBUG in stdout + create .log.debug with everything from info to debug
- Parameters:
- cmdstr
command line used to launch this program
- ncbi_species_namestr
name of species to download, as given by NCBI
- ncbi_species_taxidint
species taxid given in NCBI
- ncbi_taxidint
NCBI taxid (sub-species)
- ncbi_strainsstr
specific strains to download
- levels: str
Level of assembly to download. Choice between ‘all’, ‘complete’, ‘chromosome’, ‘scaffold’, ‘contig’. Default is ‘all’
- outdirstr
path to output directory (where created database will be saved).
- tmp_dirstr
Path to directory where tmp files are saved (sequences split at each row of 5 ‘N’)
- threadsint
max number of threads to use
- norefseqbool
True if user does not want to download again the database
- db_dirstr
Name of the folder where already downloaded fasta files are saved.
- only_mashbool
True if user user already has the database and quality of each genome (L90, #contigs etc.)
- info_filestr
File containing information on QC if it was already ran before (columns to_annotate, gsize, nb_conts and L90).
- l90int
Max L90 allowed to keep a genome
- nbcontint
Max number of contigs allowed to keep a genome
- cutnint
cut at each when there are ‘cutn’ N in a row. Don’t cut if equal to 0
- min_distint
lower limit of distance between 2 genomes to keep them
- max_distint
upper limit of distance between 2 genomes to keep them (default is 0.06)
- verboseint
verbosity: - defaut 0 : stdout contains INFO, stderr contains ERROR, .log contains INFO and more,
.log.err contains warning and more
1: same as 0 + WARNING in stderr
2: same as 1 + DETAILS in stdout + DETAILS in .log.details
>=15: same as 2 + Add DEBUG in stdout + create .log.debug with everything from info to debug
- quietbool
True if nothing must be sent to stdout/stderr, False otherwise
- PanACoTA.subcommands.prepare.main_from_parse(arguments)¶
Call main function from the arguments given by parser
- Parameters:
- argumentsargparse.Namespace
result of argparse parsing of all arguments in command line
- PanACoTA.subcommands.prepare.parse(parser, argu)¶
Parse arguments given to parser
- Parameters:
- parserargparse.ArgumentParser
the parser used
- argu[str]
command-line given by user, to parse using parser
- Returns:
- argparse.Namespace
Parsed arguments
annotate
module¶
annotate is a subcommand of PanACoTA
It is a pipeline to do quality control and annotate genomes. Steps are:
optional: find rows of at least ‘n’ N (default n=5), and cut into a new contig at this point
for each genome, calc L90 and number of contigs (after cut at n ‘N’ occurrences if used)
keep only genomes with:
L90 <= x (default x = 100)
#contig <= y (default y = 999)
rename those genomes and their contigs, with strain name increasing with quality (L90 and #contig)
annotate kept genomes with prokka (default) or only prodigal
gembase format
Input:
list_file: list of genome filenames to annotate. This file contains 1 line per genome. It contains the name(s) of the (multi-)fasta file(s) corresponding to the genome (separated by space if several fasta files for the genome). After quality control, selected genomes will be named as following:
<gen-spe>.<date>.<strain>
, with:<gen_spe>
4 alphanumeric characters. Usually it corresponds to the 2 first letters of genus, and 2 first letters of species, like ESCO for Escherichia coli.<date>
date at which the genome was downloaded, formatted as MMYY (M=Month, Y=Year)<strain>
is the strain number of the genome in the species, ordered by quality.
Default values for <gen_spe>
and <date>
are given as input (see after). However, if some
genomes do not have the same date and/or genus/species as the others, you can add
this information for those genomes in the list file. fasta filenames and information are
separated by ::
. <gen_spe>
is given after the ::
, and <date>
is preceded by a
.
. Here is an example:
genome1.fasta
genome2_ch1.fna genome2_pl.fst
genome3.fst genome3_plasmid.fst :: name
genome4.fna genome4.p1.fna genome4.p2.fna :: name.
genome5.fasta :: name.date
genome6.chromo.fst genome6.pl.fst :: .date
species: with 4 alphanumeric characters, used to rename genomes (except those whose species name is specified in the list file)
date: optional. By default, takes the current date. Used to rename genomes (except those whose date is specified in the list file)
dbpath: path to folder containing all multi-fasta sequences of genomes
respath: path to folder where outputs must be saved (folders Genes, Replicons, Proteins, LSTINFO, gff3 and LSTINFO_dataset.lst file)
tmppath optional. Path where tmp files must be saved. Default is respath/tmp_files
annotepath optional. Path where prokka/prodigal output folders for all genomes must be saved. Default is respath/tmp_files
threads: number of threads that can be used (default 1)
Output:
In your given
respath
, you will find 5 folders:LSTINFO (information on each genome, with gene annotations),
Genes (nuc. gene sequences),
Proteins (aa proteins sequences),
Replicons (input sequences but with formatted headers).
gff3 (information on genes as gff3 format)
In your given
tmppath
folder, folders with prokka/prodigal results will be created for each input genome (1 folder per genome, called<genome_name>-[prokka, prodigal]Res
). If errors are generated during prokka/prodigal step, you can look at the log file to see what was wrong (<genome_name>-[prokka, prodigal].log
).In your given
respath
, a file calledannotate-genomes-<list_file>.log
will be generated. You can find there all logs.In your given
respath
, a file calledannotate-genomes-<list_file>.log.err
will be generated, containing information on errors and warnings that occurred: problems during annotation (hence no formatting step ran), and problems during formatting step. If this file is empty, then annotation and formatting steps finished without any problem for all genomes.In your given
respath
, you will find a file calledLSTINFO-<list_file>.lst
with
information on all genomes: gembase_name, original_name, genome_size, L90, nb_contigs
In your given
respath
, you will find a file calleddiscarded-<list_file>.lst
with information on genomes that were discarded (and hence not annotated) because of the L90 and/or nb_contig threshold: original_name, genome_size, L90, nb_contigsIn your given
respath
, you will find 2 png files:QC_L90-<list_file>.png
andQC_nb-contigs-<list_file>.png
, containing the histograms of L90 and nb_contigs values for all genomes, with a vertical red line representing the limit applied here.
Requested:
- in prokka/prodigal results, all genes are called
<whatever>_<number>
-> the number will be kept.
- in prokka/prodigal results, all genes are called
The number of the genes annotated by prokka/prodigal are in increasing order in tbl, faa and ffn files
genome names given to prokka/prodigal should not end with ‘_<number>’. Ideally, they should always have the same format:
<spegenus>.<date>.<strain_number>
but they can have another format, as long as they don’t end by ‘_<number>’, which is the format of a gene name.
@author gem April 2017
- PanACoTA.subcommands.annotate.build_parser(parser)¶
Method to create a parser for command-line options
- Parameters:
- parserargparse.ArgumentParser
The parser to configure
- PanACoTA.subcommands.annotate.check_args(parser, args)¶
Check that arguments given to parser are as expected.
- Parameters:
- parserargparse.ArgumentParser
The parser used to parse command-line
- argsargparse.Namespace
Parsed arguments
- Returns:
- argparse.Namespace or None
The arguments parsed, updated according to some rules. Exit program with error message if error occurs with arguments given.
- PanACoTA.subcommands.annotate.main(cmd, list_file, db_path, res_dir, name, date, l90=100, nbcont=999, cutn=5, threads=1, force=False, qc_only=False, from_info=None, tmp_dir=None, res_annot_dir=None, verbose=0, quiet=False, prodigal_only=False, small=False)¶
Main method, doing all steps:
analyze genomes (nb contigs, L90, rows of N…)
keep only genomes with ‘good’ (according to user thresholds) L90 and nb_contigs
rename genomes with strain number in decreasing quality
annotate genome with prokka or only prodigal
format annotated genomes
If option ‘-Q’: ends at step 2. If option ‘–info <genome_info file name>’ option: starts at step 2
verbosity:
defaut 0 : stdout contains INFO, stderr contains ERROR.
1: stdout contains INFO, stderr contains WARNING and ERROR
2: stdout contains (DEBUG), DETAIL and INFO, stderr contains WARNING and ERROR
>=15: Add DEBUG in stdout
- Parameters:
- cmdstr
command line used to launch this program
- list_filestr
file containing the list of genome files, 1 genome per line, separated by a space if a genome is split in several fasta files. This file can also specify date and/or species information, according to the format described in documentation.
- db_pathstr
Path to the folder containing all the fasta files which will be annotated
- res_dirstr
Path to the folder which will contain result folders and files
- namestr
4 alpha numeric characters, describing the species (for example ESCO). Used by default if no species name is given in list_file line.
- datestr
4 alpha numeric characters, defining the default date, for strains where it is not specified in the list_file
- l90int
Max L90 allowed to keep a genome
- nbcontint
Max number of contigs allowed to keep a genome
- cutnint
cut each time there are at least cutn ‘N’ in a row. Don’t cut if equal to 0
- threadsint
max number of threads to use
- forcebool
If True, overwrite previous results, if False keep what is already calculated
- qc_onlybool
If True, do only quality control, if False, also do annotation
- from_infostr
File containing information on genomes and their quality information (from prepare step)
- tmp_dirstr or None
Path to folder where tmp files must be saved. None to use the default tmp folder
- res_annot_dirstr or None
Path to folder where are the prokka/prodigal result folders for the genomes. None to use the default prokka/prodigal folder
- verboseint
verbosity: default (0): info in stdout, error and more in stderr 1 = add warnings in stderr 2 = like 1 + add DETAIL to stdout (by default only INFO) >15: add debug to stdout
- quietbool
True if nothing must be sent to stdout/stderr, False otherwise
- prodigal_onlybool
True -> run only prodigal. False -> run prokka
- smallbool
True -> use -p meta option with prodigal
- Returns:
- (genomes, kept_genomes, skipped, skipped_format)tuple
with:
genomes: dict with all genomes in list_file: {genome: [gembase_name, path_split_gembase, gsize, nbcont, L90]}
kept_genomes: dict with all genomes kept for annotation (same format as genomes)
skipped: list of genomes skipped because they had a problem in annotation step
skipped_format : list of genomes skipped because they had a problem in format step
- PanACoTA.subcommands.annotate.main_from_parse(arguments)¶
Call main function from the arguments given by parser
- Parameters:
- argumentsargparse.Namespace
result of argparse parsing of all arguments in command line
- PanACoTA.subcommands.annotate.parse(parser, argu)¶
arse arguments given to parser
- Parameters:
- parserargparse.ArgumentParser
the parser used
- argu[str]
command-line given by user, to parse using parser
- Returns:
- argparse.Namespace
Parsed arguments
pangenome
module¶
pangenome is a subcommand of PanACoTA
@author gem May 2017
- PanACoTA.subcommands.pangenome.build_parser(parser)¶
Method to create a parser for command-line options
- Parameters:
- parserargparse.ArgumentParser
parser to configure in order to extract command-line arguments
- PanACoTA.subcommands.pangenome.main(cmd, lstinfo, name, dbpath, min_id, outdir, clust_mode, spe_dir, threads, outfile=None, verbose=0, quiet=False)¶
Main method, doing all steps:
concatenate all protein files
create database as ffindex
cluster all proteins
convert to pangenome file
creating summary and matrix of pangenome
- Parameters:
- lstinfostr
file with name of genomes to consider for pan in the first column, without extension. Other columns are ignored. The first column header must be ‘gembase_name’
- namestr
name given to the dataset. For example, ESCO44 for 44 Escherichia coli genomes.
- dbpathstr
path to the folder containing all protein files (files called as the name of genome given in lstinfo + “.prt”
- min_idfloat
Minimum percentage of identity between 2 proteins to put them in the same family
- outdirstr
path to folder which will contain pangenome results and tmp files
- clust_mode[0, 1, 2]
0 for ‘set cover’, 1 for ‘single-linkage’, 2 for ‘CD-Hit’
- spe_dirstr or None
path to the folder where concatenated bank of proteins must be saved. None to use the same folder as protein files
- threadsint
Max number of threads to use
- outfilestr or None
Name of the pangenome. None to use the default name
- verboseint
verbosity:
defaut 0 : stdout contains INFO, stderr contains ERROR.
1: stdout contains INFO, stderr contains WARNING and ERROR
2: stdout contains (DEBUG), DETAIL and INFO, stderr contains WARNING and ERROR
>=15: Add DEBUG in stdout
- quietbool
True if nothing must be sent to stdout/stderr, False otherwise
- PanACoTA.subcommands.pangenome.main_from_parse(args)¶
Call main function from the arguments given by parser
- Parameters:
- argsargparse.Namespace
result of argparse parsing of all arguments in command line
- PanACoTA.subcommands.pangenome.parse(parser, argu)¶
Parse arguments given to parser
- Parameters:
- parserargparse.ArgumentParser
Parser to use to parse command-line arguments
- argu[str]
command-line given
- Returns:
- argparse.Namespace or None
The arguments parsed, updated according to some rules. Exit program with error message if error occurs with arguments given.
corepers
module¶
corepers is a subcommand of PanACoTA
Generate a core genome (families containing 1 member in all genomes of the dataset) or a persistent genome (families with a given % of genomes having exactly 1 member). You can also allow:
mixed families: exactly 1 member in the given percentage of genomes, but the other genomes can contain 0 or several members
multi families: allow several members in any genome.
@author gem June 2017
- PanACoTA.subcommands.corepers.build_parser(parser)¶
Method to create a parser for command-line options
- Parameters:
- parserargparse.ArgumentParser
parser to configure in order to extract command-line arguments
- PanACoTA.subcommands.corepers.check_args(parser, args)¶
Check that arguments given to parser are as expected.
- Parameters:
- parserargparse.ArgumentParser
The parser used to parse command-line
- argsargparse.Namespace
Parsed arguments
- Returns:
- argparse.Namespace or None
The arguments parsed, updated according to some rules. Exit program with error message if error occurs with arguments given.
- PanACoTA.subcommands.corepers.get_info(tol, multi, mixed, floor)¶
Get a string corresponding to the information that will be given to logger.
- Parameters:
- tolfloat
min % of genomes present in a family to consider it as persistent (between 0 and 1)
- multibool
True if multigenic families are allowed, False otherwise
- mixedbool
True if mixed families are allowed, False otherwise
- floorbool
Require at least floor(nb_genomes*tol) genomes if True, ceil(nb_genomes*tol) if False
- Returns:
- str
Information to give to logger
- PanACoTA.subcommands.corepers.main(cmd, pangenome, tol, multi, mixed, outputdir, lstinfo_file, floor, verbose, quiet)¶
Read pangenome and deduce Persistent genome according to the user criteria
- Parameters:
- pangenomestr
file containing pangenome
- tolfloat
min % of genomes present in a family to consider it as persistent (between 0 and 1)
- multibool
True if multigenic families are allowed, False otherwise
- mixedbool
True if mixed families are allowed, False otherwise
- outputdirstr or None
Specific directory for the generated persistent genome. If not given, pangenome directory is used.
- lstinfo_filestr
list of genomes to include in the core/persistent genome. If not given, include all genomes of pan
- floorbool
Require at least floor(nb_genomes*tol) genomes if True, ceil(nb_genomes*tol) if False
- verboseint
verbosity: - defaut 0 : stdout contains INFO, stderr contains ERROR. - 1: stdout contains INFO, stderr contains WARNING and ERROR - 2: stdout contains (DEBUG), DETAIL and INFO, stderr contains WARNING and ERROR - >=15: Add DEBUG in stdout
- quietbool
True if nothing must be sent to stdout/stderr, False otherwise
- PanACoTA.subcommands.corepers.main_from_parse(args)¶
Call main function from the arguments given by parser
- Parameters:
- argsargparse.Namespace
result of argparse parsing of all arguments in command line
- PanACoTA.subcommands.corepers.parse(parser, argu)¶
Parse arguments given to parser
- Parameters:
- parserargparse.ArgumentParser
the parser used
- argu[str]
command-line given by user, to parse using parser
- Returns:
- argparse.Namespace
Parsed arguments
align
module¶
align is a subcommand of PanACoTA
@author gem June 2017
- PanACoTA.subcommands.align.build_parser(parser)¶
Method to create a parser for command-line options
- Parameters:
- parserargparse.ArgumentParser
parser to configure in order to extract command-line arguments
- PanACoTA.subcommands.align.main(cmd, corepers, list_genomes, dname, dbpath, outdir, prot_ali, threads, force, verbose=0, quiet=False)¶
Align given core genome families
- Parameters:
- corepersstr
File containing persistent genome families
- list_genomesstr
File containing the list of all genomes in the dataset. Only first column is considered.
- dnamestr
Dataset name, used to name output files
- dbpathstr
path to the directory containing ‘Proteins’ and ‘Genes’ folders
- outdirstr
path to the directory where output files must be saved
- prot_alibool
Also give aa alignment of concatenation of persistent proteins
- threadsint
Max number of threads to use
- forcebool
Remove existing output files and rerun everything if True.
- verboseint
verbosity: - defaut 0 : stdout contains INFO, stderr contains ERROR. - 1: stdout contains INFO, stderr contains WARNING and ERROR - 2: stdout contains (DEBUG), DETAIL and INFO, stderr contains WARNING and ERROR - >=15: Add DEBUG in stdout
- quietbool
True if nothing must be sent to stdout/stderr, False otherwise
- PanACoTA.subcommands.align.main_from_parse(args)¶
Call main function from the arguments given by parser
- Parameters:
- argsargparse.Namespace
result of argparse parsing of all arguments in command line
- PanACoTA.subcommands.align.parse(parser, argu)¶
Parse arguments given to parser
- Parameters:
- parserargparse.ArgumentParser
the parser used
- argu[str]
command-line given by user, to parse using parser
- Returns:
- argparse.Namespace
Parsed arguments
tree
module¶
tree is a subcommand of PanACoTA
@author gem June 2017
- PanACoTA.subcommands.tree.build_parser(parser)¶
Method to create a parser for command-line options
- Parameters:
- parserargparse.ArgumentParser
parser to configure in order to extract command-line arguments
- PanACoTA.subcommands.tree.check_args(parser, args)¶
Check that arguments given to parser are as expected.
- Parameters:
- parserargparse.ArgumentParser
The parser used to parse command-line
- argsargparse.Namespace
Parsed arguments
- Returns:
- argparse.Namespace or None
The arguments parsed, updated according to some rules. Exit program with error message if error occurs with arguments given.
- PanACoTA.subcommands.tree.main(cmd, align, outdir, soft, model, threads, boot=False, write_boot=False, write_mat=False, memory=False, fast=False, verbose=0, quiet=False)¶
Inferring a phylogenetic tree from an alignment file, with the given software.
- Parameters:
- cmd: str
command used to launch tree module
- align: str
Path to file containing alignments of persistent families grouped by genome
- outdir: str or None
Path to file which will contain the tree inferred
- soft: str
Soft to use to infer the phylogenetic tree: 1 of quicktree, fasttree or fastme
- model: str or None
DNA substitution model chosen by user, None if quicktree used
- threads: int
Maximum number of threads to use
- boot: int or None
Number of bootstraps to compute. None if no bootstrap asked
- write_boot: bool
True if all bootstrap pseudo-trees must be saved into a file, False otherwise
- write_mat: bool
True if distance matrix must be saved, false otherwise
- memory: str
Maximal RAM usage in GB | MB | % - Only for iqtree
- fast: boolean
use -fast option with IQtree
- verboseint
verbosity: - defaut 0 : stdout contains INFO, stderr contains ERROR. - 1: stdout contains INFO, stderr contains WARNING and ERROR - 2: stdout contains (DEBUG), DETAIL and INFO, stderr contains WARNING and ERROR - >=15: Add DEBUG in stdout
- quiet: bool
True if nothing must be sent to stdout/stderr, False otherwise
- PanACoTA.subcommands.tree.main_from_parse(args)¶
Call main function from the arguments given by parser
- Parameters:
- argsargparse.Namespace
result of argparse parsing of all arguments in command line
- PanACoTA.subcommands.tree.parse(parser, argu)¶
Parse arguments given to parser
- Parameters:
- parserargparse.ArgumentParser
the parser used
- argu[str]
command-line given by user, to parse using parser
- Returns:
- argparse.Namespace
Parsed arguments