`PanACoTA.subcommands` package¶

Subpackage containing the main script used to launch each available subcommand.

`prepare` module¶

Subcommand to prepare a dataset:

Download all genomes of a given species from refseq
Filter them with L90 and number of contigs thresholds
Remove too close/far genomes using Mash

@author Amandine PERRIN August 2019

PanACoTA.subcommands.prepare.build_parser(parser)¶

Method to create a parser for command-line options

Parameters:

parserargparse.ArgumentParser: parser to configure in order to extract command-line arguments

PanACoTA.subcommands.prepare.check_args(parser, args)¶

Check that arguments given to parser are as expected.

Parameters:

parserargparse.ArgumentParser: The parser used to parse command-line
argsargparse.Namespace: Parsed arguments

Returns:

argparse.Namespace or None: The arguments parsed, updated according to some rules. Exit program with error message if error occurs with arguments given.

PanACoTA.subcommands.prepare.main(cmd, ncbi_species_name, ncbi_species_taxid, ncbi_taxid, ncbi_strains, levels, ncbi_section, outdir, tmp_dir, threads, norefseq, db_dir, only_mash, info_file, l90, nbcont, cutn, min_dist, max_dist, verbose, quiet)¶

Main method, constructing the draft dataset for the given species

verbosity: - defaut 0 : stdout contains INFO, stderr contains ERROR, .log contains INFO and more, .log.err contains warning and more - 1: same as 0 + WARNING in stderr - 2: same as 1 + DETAILS in stdout + DETAILS in .log.details - >=15: same as 2 + Add DEBUG in stdout + create .log.debug with everything from info to debug

Parameters:

cmdstr

command line used to launch this program

ncbi_species_namestr

name of species to download, as given by NCBI

ncbi_species_taxidint

species taxid given in NCBI

ncbi_taxidint

NCBI taxid (sub-species)

ncbi_strainsstr

specific strains to download

levels: str

Level of assembly to download. Choice between ‘all’, ‘complete’, ‘chromosome’, ‘scaffold’, ‘contig’. Default is ‘all’

outdirstr

path to output directory (where created database will be saved).

tmp_dirstr

Path to directory where tmp files are saved (sequences split at each row of 5 ‘N’)

threadsint

max number of threads to use

norefseqbool

True if user does not want to download again the database

db_dirstr

Name of the folder where already downloaded fasta files are saved.

only_mashbool

True if user user already has the database and quality of each genome (L90, #contigs etc.)

info_filestr

File containing information on QC if it was already ran before (columns to_annotate, gsize, nb_conts and L90).

l90int

Max L90 allowed to keep a genome

nbcontint

Max number of contigs allowed to keep a genome

cutnint

cut at each when there are ‘cutn’ N in a row. Don’t cut if equal to 0

min_distint

lower limit of distance between 2 genomes to keep them

max_distint

upper limit of distance between 2 genomes to keep them (default is 0.06)

verboseint

verbosity: - defaut 0 : stdout contains INFO, stderr contains ERROR, .log contains INFO and more,

.log.err contains warning and more

1: same as 0 + WARNING in stderr
2: same as 1 + DETAILS in stdout + DETAILS in .log.details
>=15: same as 2 + Add DEBUG in stdout + create .log.debug with everything from info to debug

quietbool

True if nothing must be sent to stdout/stderr, False otherwise

PanACoTA.subcommands.prepare.main_from_parse(arguments)¶

Call main function from the arguments given by parser

Parameters:

argumentsargparse.Namespace: result of argparse parsing of all arguments in command line

PanACoTA.subcommands.prepare.parse(parser, argu)¶

Parse arguments given to parser

Parameters:

parserargparse.ArgumentParser: the parser used
argu[str]: command-line given by user, to parse using parser

Returns:

argparse.Namespace: Parsed arguments

`annotate` module¶

annotate is a subcommand of PanACoTA

It is a pipeline to do quality control and annotate genomes. Steps are:

optional: find rows of at least ‘n’ N (default n=5), and cut into a new contig at this point
for each genome, calc L90 and number of contigs (after cut at n ‘N’ occurrences if used)
keep only genomes with:
- L90 <= x (default x = 100)
- #contig <= y (default y = 999)
rename those genomes and their contigs, with strain name increasing with quality (L90 and #contig)
annotate kept genomes with prokka (default) or only prodigal
gembase format

Input:

list_file: list of genome filenames to annotate. This file contains 1 line per genome. It contains the name(s) of the (multi-)fasta file(s) corresponding to the genome (separated by space if several fasta files for the genome). After quality control, selected genomes will be named as following: <gen-spe>.<date>.<strain>, with:
- <gen_spe> 4 alphanumeric characters. Usually it corresponds to the 2 first letters of genus, and 2 first letters of species, like ESCO for Escherichia coli.
- <date> date at which the genome was downloaded, formatted as MMYY (M=Month, Y=Year)
- <strain> is the strain number of the genome in the species, ordered by quality.

Default values for <gen_spe> and <date> are given as input (see after). However, if some genomes do not have the same date and/or genus/species as the others, you can add this information for those genomes in the list file. fasta filenames and information are separated by ::. <gen_spe> is given after the ::, and <date> is preceded by a .. Here is an example:

genome1.fasta
genome2_ch1.fna genome2_pl.fst
genome3.fst genome3_plasmid.fst :: name
genome4.fna genome4.p1.fna genome4.p2.fna :: name.
genome5.fasta :: name.date
genome6.chromo.fst genome6.pl.fst  :: .date

species: with 4 alphanumeric characters, used to rename genomes (except those whose species name is specified in the list file)
date: optional. By default, takes the current date. Used to rename genomes (except those whose date is specified in the list file)
dbpath: path to folder containing all multi-fasta sequences of genomes
respath: path to folder where outputs must be saved (folders Genes, Replicons, Proteins, LSTINFO, gff3 and LSTINFO_dataset.lst file)
tmppath optional. Path where tmp files must be saved. Default is respath/tmp_files
annotepath optional. Path where prokka/prodigal output folders for all genomes must be saved. Default is respath/tmp_files
threads: number of threads that can be used (default 1)

Output:

In your given respath, you will find 5 folders:
- LSTINFO (information on each genome, with gene annotations),
- Genes (nuc. gene sequences),
- Proteins (aa proteins sequences),
- Replicons (input sequences but with formatted headers).
- gff3 (information on genes as gff3 format)
In your given tmppath folder, folders with prokka/prodigal results will be created for each input genome (1 folder per genome, called <genome_name>-[prokka, prodigal]Res). If errors are generated during prokka/prodigal step, you can look at the log file to see what was wrong (<genome_name>-[prokka, prodigal].log).
In your given respath, a file called annotate-genomes-<list_file>.log will be generated. You can find there all logs.
In your given respath, a file called annotate-genomes-<list_file>.log.err will be generated, containing information on errors and warnings that occurred: problems during annotation (hence no formatting step ran), and problems during formatting step. If this file is empty, then annotation and formatting steps finished without any problem for all genomes.
In your given respath, you will find a file called LSTINFO-<list_file>.lst with

information on all genomes: gembase_name, original_name, genome_size, L90, nb_contigs

In your given respath, you will find a file called discarded-<list_file>.lst with information on genomes that were discarded (and hence not annotated) because of the L90 and/or nb_contig threshold: original_name, genome_size, L90, nb_contigs
In your given respath, you will find 2 png files: QC_L90-<list_file>.png and QC_nb-contigs-<list_file>.png, containing the histograms of L90 and nb_contigs values for all genomes, with a vertical red line representing the limit applied here.

Requested:

in prokka/prodigal results, all genes are called <whatever>_<number>
-> the number will be kept.
The number of the genes annotated by prokka/prodigal are in increasing order in tbl, faa and ffn files
genome names given to prokka/prodigal should not end with ‘_<number>’. Ideally, they should always have the same format: <spegenus>.<date>.<strain_number> but they can have another format, as long as they don’t end by ‘_<number>’, which is the format of a gene name.

@author gem April 2017

PanACoTA.subcommands.annotate.build_parser(parser)¶

Method to create a parser for command-line options

Parameters:

parserargparse.ArgumentParser: The parser to configure

PanACoTA.subcommands.annotate.check_args(parser, args)¶

Check that arguments given to parser are as expected.

Parameters:

parserargparse.ArgumentParser: The parser used to parse command-line
argsargparse.Namespace: Parsed arguments

Returns:

argparse.Namespace or None: The arguments parsed, updated according to some rules. Exit program with error message if error occurs with arguments given.

PanACoTA.subcommands.annotate.main(cmd, list_file, db_path, res_dir, name, date, l90=100, nbcont=999, cutn=5, threads=1, force=False, qc_only=False, from_info=None, tmp_dir=None, res_annot_dir=None, verbose=0, quiet=False, prodigal_only=False, small=False)¶

Main method, doing all steps:

analyze genomes (nb contigs, L90, rows of N…)
keep only genomes with ‘good’ (according to user thresholds) L90 and nb_contigs
rename genomes with strain number in decreasing quality
annotate genome with prokka or only prodigal
format annotated genomes

If option ‘-Q’: ends at step 2. If option ‘–info <genome_info file name>’ option: starts at step 2

verbosity:

defaut 0 : stdout contains INFO, stderr contains ERROR.
1: stdout contains INFO, stderr contains WARNING and ERROR
2: stdout contains (DEBUG), DETAIL and INFO, stderr contains WARNING and ERROR
>=15: Add DEBUG in stdout

Parameters:

cmdstr: command line used to launch this program
list_filestr: file containing the list of genome files, 1 genome per line, separated by a space if a genome is split in several fasta files. This file can also specify date and/or species information, according to the format described in documentation.
db_pathstr: Path to the folder containing all the fasta files which will be annotated
res_dirstr: Path to the folder which will contain result folders and files
namestr: 4 alpha numeric characters, describing the species (for example ESCO). Used by default if no species name is given in list_file line.
datestr: 4 alpha numeric characters, defining the default date, for strains where it is not specified in the list_file
l90int: Max L90 allowed to keep a genome
nbcontint: Max number of contigs allowed to keep a genome
cutnint: cut each time there are at least cutn ‘N’ in a row. Don’t cut if equal to 0
threadsint: max number of threads to use
forcebool: If True, overwrite previous results, if False keep what is already calculated
qc_onlybool: If True, do only quality control, if False, also do annotation
from_infostr: File containing information on genomes and their quality information (from prepare step)
tmp_dirstr or None: Path to folder where tmp files must be saved. None to use the default tmp folder
res_annot_dirstr or None: Path to folder where are the prokka/prodigal result folders for the genomes. None to use the default prokka/prodigal folder
verboseint: verbosity: default (0): info in stdout, error and more in stderr 1 = add warnings in stderr 2 = like 1 + add DETAIL to stdout (by default only INFO) >15: add debug to stdout
quietbool: True if nothing must be sent to stdout/stderr, False otherwise
prodigal_onlybool: True -> run only prodigal. False -> run prokka
smallbool: True -> use -p meta option with prodigal

Returns:

(genomes, kept_genomes, skipped, skipped_format)tuple

with:

genomes: dict with all genomes in list_file: {genome: [gembase_name, path_split_gembase, gsize, nbcont, L90]}
kept_genomes: dict with all genomes kept for annotation (same format as genomes)
skipped: list of genomes skipped because they had a problem in annotation step
skipped_format : list of genomes skipped because they had a problem in format step

PanACoTA.subcommands.annotate.main_from_parse(arguments)¶

Call main function from the arguments given by parser

Parameters:

argumentsargparse.Namespace: result of argparse parsing of all arguments in command line

PanACoTA.subcommands.annotate.parse(parser, argu)¶

arse arguments given to parser

Parameters:

parserargparse.ArgumentParser: the parser used
argu[str]: command-line given by user, to parse using parser

Returns:

argparse.Namespace: Parsed arguments

`pangenome` module¶

pangenome is a subcommand of PanACoTA

@author gem May 2017

PanACoTA.subcommands.pangenome.build_parser(parser)¶

Method to create a parser for command-line options

Parameters:

parserargparse.ArgumentParser: parser to configure in order to extract command-line arguments

PanACoTA.subcommands.pangenome.main(cmd, lstinfo, name, dbpath, min_id, outdir, clust_mode, spe_dir, threads, outfile=None, verbose=0, quiet=False)¶

Main method, doing all steps:

concatenate all protein files
create database as ffindex
cluster all proteins
convert to pangenome file
creating summary and matrix of pangenome

Parameters:

lstinfostr

file with name of genomes to consider for pan in the first column, without extension. Other columns are ignored. The first column header must be ‘gembase_name’

namestr

name given to the dataset. For example, ESCO44 for 44 Escherichia coli genomes.

dbpathstr

path to the folder containing all protein files (files called as the name of genome given in lstinfo + “.prt”

min_idfloat

Minimum percentage of identity between 2 proteins to put them in the same family

outdirstr

path to folder which will contain pangenome results and tmp files

clust_mode[0, 1, 2]

0 for ‘set cover’, 1 for ‘single-linkage’, 2 for ‘CD-Hit’

spe_dirstr or None

path to the folder where concatenated bank of proteins must be saved. None to use the same folder as protein files

threadsint

Max number of threads to use

outfilestr or None

Name of the pangenome. None to use the default name

verboseint

verbosity:

defaut 0 : stdout contains INFO, stderr contains ERROR.
1: stdout contains INFO, stderr contains WARNING and ERROR
2: stdout contains (DEBUG), DETAIL and INFO, stderr contains WARNING and ERROR
>=15: Add DEBUG in stdout

quietbool

True if nothing must be sent to stdout/stderr, False otherwise

PanACoTA.subcommands.pangenome.main_from_parse(args)¶

Call main function from the arguments given by parser

Parameters:

argsargparse.Namespace: result of argparse parsing of all arguments in command line

PanACoTA.subcommands.pangenome.parse(parser, argu)¶

Parse arguments given to parser

Parameters:

parserargparse.ArgumentParser: Parser to use to parse command-line arguments
argu[str]: command-line given

Returns:

argparse.Namespace or None: The arguments parsed, updated according to some rules. Exit program with error message if error occurs with arguments given.

`corepers` module¶

corepers is a subcommand of PanACoTA

Generate a core genome (families containing 1 member in all genomes of the dataset) or a persistent genome (families with a given % of genomes having exactly 1 member). You can also allow:

mixed families: exactly 1 member in the given percentage of genomes, but the other genomes can contain 0 or several members
multi families: allow several members in any genome.

@author gem June 2017

PanACoTA.subcommands.corepers.build_parser(parser)¶

Method to create a parser for command-line options

Parameters:

parserargparse.ArgumentParser: parser to configure in order to extract command-line arguments

PanACoTA.subcommands.corepers.check_args(parser, args)¶

Check that arguments given to parser are as expected.

Parameters:

parserargparse.ArgumentParser: The parser used to parse command-line
argsargparse.Namespace: Parsed arguments

Returns:

argparse.Namespace or None: The arguments parsed, updated according to some rules. Exit program with error message if error occurs with arguments given.

PanACoTA.subcommands.corepers.get_info(tol, multi, mixed, floor)¶

Get a string corresponding to the information that will be given to logger.

Parameters:

tolfloat: min % of genomes present in a family to consider it as persistent (between 0 and 1)
multibool: True if multigenic families are allowed, False otherwise
mixedbool: True if mixed families are allowed, False otherwise
floorbool: Require at least floor(nb_genomes*tol) genomes if True, ceil(nb_genomes*tol) if False

Returns:

str: Information to give to logger

PanACoTA.subcommands.corepers.main(cmd, pangenome, tol, multi, mixed, outputdir, lstinfo_file, floor, verbose, quiet)¶

Read pangenome and deduce Persistent genome according to the user criteria

Parameters:

pangenomestr: file containing pangenome
tolfloat: min % of genomes present in a family to consider it as persistent (between 0 and 1)
multibool: True if multigenic families are allowed, False otherwise
mixedbool: True if mixed families are allowed, False otherwise
outputdirstr or None: Specific directory for the generated persistent genome. If not given, pangenome directory is used.
lstinfo_filestr: list of genomes to include in the core/persistent genome. If not given, include all genomes of pan
floorbool: Require at least floor(nb_genomes*tol) genomes if True, ceil(nb_genomes*tol) if False
verboseint: verbosity: - defaut 0 : stdout contains INFO, stderr contains ERROR. - 1: stdout contains INFO, stderr contains WARNING and ERROR - 2: stdout contains (DEBUG), DETAIL and INFO, stderr contains WARNING and ERROR - >=15: Add DEBUG in stdout
quietbool: True if nothing must be sent to stdout/stderr, False otherwise

PanACoTA.subcommands.corepers.main_from_parse(args)¶

Call main function from the arguments given by parser

Parameters:

argsargparse.Namespace: result of argparse parsing of all arguments in command line

PanACoTA.subcommands.corepers.parse(parser, argu)¶

Parse arguments given to parser

Parameters:

parserargparse.ArgumentParser: the parser used
argu[str]: command-line given by user, to parse using parser

Returns:

argparse.Namespace: Parsed arguments

`align` module¶

align is a subcommand of PanACoTA

@author gem June 2017

PanACoTA.subcommands.align.build_parser(parser)¶

Method to create a parser for command-line options

Parameters:

parserargparse.ArgumentParser: parser to configure in order to extract command-line arguments

PanACoTA.subcommands.align.main(cmd, corepers, list_genomes, dname, dbpath, outdir, prot_ali, threads, force, verbose=0, quiet=False)¶

Align given core genome families

Parameters:

corepersstr: File containing persistent genome families
list_genomesstr: File containing the list of all genomes in the dataset. Only first column is considered.
dnamestr: Dataset name, used to name output files
dbpathstr: path to the directory containing ‘Proteins’ and ‘Genes’ folders
outdirstr: path to the directory where output files must be saved
prot_alibool: Also give aa alignment of concatenation of persistent proteins
threadsint: Max number of threads to use
forcebool: Remove existing output files and rerun everything if True.
verboseint: verbosity: - defaut 0 : stdout contains INFO, stderr contains ERROR. - 1: stdout contains INFO, stderr contains WARNING and ERROR - 2: stdout contains (DEBUG), DETAIL and INFO, stderr contains WARNING and ERROR - >=15: Add DEBUG in stdout
quietbool: True if nothing must be sent to stdout/stderr, False otherwise

PanACoTA.subcommands.align.main_from_parse(args)¶

Call main function from the arguments given by parser

Parameters:

argsargparse.Namespace: result of argparse parsing of all arguments in command line

PanACoTA.subcommands.align.parse(parser, argu)¶

Parse arguments given to parser

Parameters:

parserargparse.ArgumentParser: the parser used
argu[str]: command-line given by user, to parse using parser

Returns:

argparse.Namespace: Parsed arguments

`tree` module¶

tree is a subcommand of PanACoTA

@author gem June 2017

PanACoTA.subcommands.tree.build_parser(parser)¶

Method to create a parser for command-line options

Parameters:

parserargparse.ArgumentParser: parser to configure in order to extract command-line arguments

PanACoTA.subcommands.tree.check_args(parser, args)¶

Check that arguments given to parser are as expected.

Parameters:

parserargparse.ArgumentParser: The parser used to parse command-line
argsargparse.Namespace: Parsed arguments

Returns:

argparse.Namespace or None: The arguments parsed, updated according to some rules. Exit program with error message if error occurs with arguments given.

PanACoTA.subcommands.tree.main(cmd, align, outdir, soft, model, threads, boot=False, write_boot=False, write_mat=False, memory=False, fast=False, verbose=0, quiet=False)¶

Inferring a phylogenetic tree from an alignment file, with the given software.

Parameters:

cmd: str: command used to launch tree module
align: str: Path to file containing alignments of persistent families grouped by genome
outdir: str or None: Path to file which will contain the tree inferred
soft: str: Soft to use to infer the phylogenetic tree: 1 of quicktree, fasttree or fastme
model: str or None: DNA substitution model chosen by user, None if quicktree used
threads: int: Maximum number of threads to use
boot: int or None: Number of bootstraps to compute. None if no bootstrap asked
write_boot: bool: True if all bootstrap pseudo-trees must be saved into a file, False otherwise
write_mat: bool: True if distance matrix must be saved, false otherwise
memory: str: Maximal RAM usage in GB | MB | % - Only for iqtree
fast: boolean: use -fast option with IQtree
verboseint: verbosity: - defaut 0 : stdout contains INFO, stderr contains ERROR. - 1: stdout contains INFO, stderr contains WARNING and ERROR - 2: stdout contains (DEBUG), DETAIL and INFO, stderr contains WARNING and ERROR - >=15: Add DEBUG in stdout
quiet: bool: True if nothing must be sent to stdout/stderr, False otherwise

PanACoTA.subcommands.tree.main_from_parse(args)¶

Call main function from the arguments given by parser

Parameters:

argsargparse.Namespace: result of argparse parsing of all arguments in command line

PanACoTA.subcommands.tree.parse(parser, argu)¶

Parse arguments given to parser

Parameters:

parserargparse.ArgumentParser: the parser used
argu[str]: command-line given by user, to parse using parser

Returns:

argparse.Namespace: Parsed arguments

`all` module¶

all_modules

PanACoTA.subcommands package¶

prepare module¶

annotate module¶

pangenome module¶

corepers module¶

align module¶

tree module¶

all module¶

`PanACoTA.subcommands` package¶

`prepare` module¶

`annotate` module¶

`pangenome` module¶

`corepers` module¶

`align` module¶

`tree` module¶

`all` module¶