PanACoTA.subcommands package

Subpackage containing the main script used to launch each available subcommand.

prepare module

Subcommand to prepare a dataset:

  • Download all genomes of a given species from refseq

  • Filter them with L90 and number of contigs thresholds

  • Remove too close/far genomes using Mash

@author Amandine PERRIN August 2019

PanACoTA.subcommands.prepare.build_parser(parser)

Method to create a parser for command-line options

Parameters:
parserargparse.ArgumentParser

parser to configure in order to extract command-line arguments

PanACoTA.subcommands.prepare.check_args(parser, args)

Check that arguments given to parser are as expected.

Parameters:
parserargparse.ArgumentParser

The parser used to parse command-line

argsargparse.Namespace

Parsed arguments

Returns:
argparse.Namespace or None

The arguments parsed, updated according to some rules. Exit program with error message if error occurs with arguments given.

PanACoTA.subcommands.prepare.main(cmd, ncbi_species_name, ncbi_species_taxid, ncbi_taxid, ncbi_strains, levels, ncbi_section, outdir, tmp_dir, threads, norefseq, db_dir, only_mash, info_file, l90, nbcont, cutn, min_dist, max_dist, verbose, quiet)

Main method, constructing the draft dataset for the given species

verbosity: - defaut 0 : stdout contains INFO, stderr contains ERROR, .log contains INFO and more, .log.err contains warning and more - 1: same as 0 + WARNING in stderr - 2: same as 1 + DETAILS in stdout + DETAILS in .log.details - >=15: same as 2 + Add DEBUG in stdout + create .log.debug with everything from info to debug

Parameters:
cmdstr

command line used to launch this program

ncbi_species_namestr

name of species to download, as given by NCBI

ncbi_species_taxidint

species taxid given in NCBI

ncbi_taxidint

NCBI taxid (sub-species)

ncbi_strainsstr

specific strains to download

levels: str

Level of assembly to download. Choice between ‘all’, ‘complete’, ‘chromosome’, ‘scaffold’, ‘contig’. Default is ‘all’

outdirstr

path to output directory (where created database will be saved).

tmp_dirstr

Path to directory where tmp files are saved (sequences split at each row of 5 ‘N’)

threadsint

max number of threads to use

norefseqbool

True if user does not want to download again the database

db_dirstr

Name of the folder where already downloaded fasta files are saved.

only_mashbool

True if user user already has the database and quality of each genome (L90, #contigs etc.)

info_filestr

File containing information on QC if it was already ran before (columns to_annotate, gsize, nb_conts and L90).

l90int

Max L90 allowed to keep a genome

nbcontint

Max number of contigs allowed to keep a genome

cutnint

cut at each when there are ‘cutn’ N in a row. Don’t cut if equal to 0

min_distint

lower limit of distance between 2 genomes to keep them

max_distint

upper limit of distance between 2 genomes to keep them (default is 0.06)

verboseint

verbosity: - defaut 0 : stdout contains INFO, stderr contains ERROR, .log contains INFO and more,

.log.err contains warning and more

  • 1: same as 0 + WARNING in stderr

  • 2: same as 1 + DETAILS in stdout + DETAILS in .log.details

  • >=15: same as 2 + Add DEBUG in stdout + create .log.debug with everything from info to debug

quietbool

True if nothing must be sent to stdout/stderr, False otherwise

PanACoTA.subcommands.prepare.main_from_parse(arguments)

Call main function from the arguments given by parser

Parameters:
argumentsargparse.Namespace

result of argparse parsing of all arguments in command line

PanACoTA.subcommands.prepare.parse(parser, argu)

Parse arguments given to parser

Parameters:
parserargparse.ArgumentParser

the parser used

argu[str]

command-line given by user, to parse using parser

Returns:
argparse.Namespace

Parsed arguments

annotate module

annotate is a subcommand of PanACoTA

It is a pipeline to do quality control and annotate genomes. Steps are:

  • optional: find rows of at least ‘n’ N (default n=5), and cut into a new contig at this point

  • for each genome, calc L90 and number of contigs (after cut at n ‘N’ occurrences if used)

  • keep only genomes with:

    • L90 <= x (default x = 100)

    • #contig <= y (default y = 999)

  • rename those genomes and their contigs, with strain name increasing with quality (L90 and #contig)

  • annotate kept genomes with prokka (default) or only prodigal

  • gembase format

Input:

  • list_file: list of genome filenames to annotate. This file contains 1 line per genome. It contains the name(s) of the (multi-)fasta file(s) corresponding to the genome (separated by space if several fasta files for the genome). After quality control, selected genomes will be named as following: <gen-spe>.<date>.<strain>, with:

    • <gen_spe> 4 alphanumeric characters. Usually it corresponds to the 2 first letters of genus, and 2 first letters of species, like ESCO for Escherichia coli.

    • <date> date at which the genome was downloaded, formatted as MMYY (M=Month, Y=Year)

    • <strain> is the strain number of the genome in the species, ordered by quality.

Default values for <gen_spe> and <date> are given as input (see after). However, if some genomes do not have the same date and/or genus/species as the others, you can add this information for those genomes in the list file. fasta filenames and information are separated by ::. <gen_spe> is given after the ::, and <date> is preceded by a .. Here is an example:

genome1.fasta
genome2_ch1.fna genome2_pl.fst
genome3.fst genome3_plasmid.fst :: name
genome4.fna genome4.p1.fna genome4.p2.fna :: name.
genome5.fasta :: name.date
genome6.chromo.fst genome6.pl.fst  :: .date
  • species: with 4 alphanumeric characters, used to rename genomes (except those whose species name is specified in the list file)

  • date: optional. By default, takes the current date. Used to rename genomes (except those whose date is specified in the list file)

  • dbpath: path to folder containing all multi-fasta sequences of genomes

  • respath: path to folder where outputs must be saved (folders Genes, Replicons, Proteins, LSTINFO, gff3 and LSTINFO_dataset.lst file)

  • tmppath optional. Path where tmp files must be saved. Default is respath/tmp_files

  • annotepath optional. Path where prokka/prodigal output folders for all genomes must be saved. Default is respath/tmp_files

  • threads: number of threads that can be used (default 1)

Output:

  • In your given respath, you will find 5 folders:

    • LSTINFO (information on each genome, with gene annotations),

    • Genes (nuc. gene sequences),

    • Proteins (aa proteins sequences),

    • Replicons (input sequences but with formatted headers).

    • gff3 (information on genes as gff3 format)

  • In your given tmppath folder, folders with prokka/prodigal results will be created for each input genome (1 folder per genome, called <genome_name>-[prokka, prodigal]Res). If errors are generated during prokka/prodigal step, you can look at the log file to see what was wrong (<genome_name>-[prokka, prodigal].log).

  • In your given respath, a file called annotate-genomes-<list_file>.log will be generated. You can find there all logs.

  • In your given respath, a file called annotate-genomes-<list_file>.log.err will be generated, containing information on errors and warnings that occurred: problems during annotation (hence no formatting step ran), and problems during formatting step. If this file is empty, then annotation and formatting steps finished without any problem for all genomes.

  • In your given respath, you will find a file called LSTINFO-<list_file>.lst with

information on all genomes: gembase_name, original_name, genome_size, L90, nb_contigs

  • In your given respath, you will find a file called discarded-<list_file>.lst with information on genomes that were discarded (and hence not annotated) because of the L90 and/or nb_contig threshold: original_name, genome_size, L90, nb_contigs

  • In your given respath, you will find 2 png files: QC_L90-<list_file>.png and QC_nb-contigs-<list_file>.png, containing the histograms of L90 and nb_contigs values for all genomes, with a vertical red line representing the limit applied here.

Requested:

  • in prokka/prodigal results, all genes are called <whatever>_<number>

    -> the number will be kept.

  • The number of the genes annotated by prokka/prodigal are in increasing order in tbl, faa and ffn files

  • genome names given to prokka/prodigal should not end with ‘_<number>’. Ideally, they should always have the same format: <spegenus>.<date>.<strain_number> but they can have another format, as long as they don’t end by ‘_<number>’, which is the format of a gene name.

@author gem April 2017

PanACoTA.subcommands.annotate.build_parser(parser)

Method to create a parser for command-line options

Parameters:
parserargparse.ArgumentParser

The parser to configure

PanACoTA.subcommands.annotate.check_args(parser, args)

Check that arguments given to parser are as expected.

Parameters:
parserargparse.ArgumentParser

The parser used to parse command-line

argsargparse.Namespace

Parsed arguments

Returns:
argparse.Namespace or None

The arguments parsed, updated according to some rules. Exit program with error message if error occurs with arguments given.

PanACoTA.subcommands.annotate.main(cmd, list_file, db_path, res_dir, name, date, l90=100, nbcont=999, cutn=5, threads=1, force=False, qc_only=False, from_info=None, tmp_dir=None, res_annot_dir=None, verbose=0, quiet=False, prodigal_only=False, small=False)

Main method, doing all steps:

  1. analyze genomes (nb contigs, L90, rows of N…)

  2. keep only genomes with ‘good’ (according to user thresholds) L90 and nb_contigs

  3. rename genomes with strain number in decreasing quality

  4. annotate genome with prokka or only prodigal

  5. format annotated genomes

If option ‘-Q’: ends at step 2. If option ‘–info <genome_info file name>’ option: starts at step 2

verbosity:

  • defaut 0 : stdout contains INFO, stderr contains ERROR.

  • 1: stdout contains INFO, stderr contains WARNING and ERROR

  • 2: stdout contains (DEBUG), DETAIL and INFO, stderr contains WARNING and ERROR

  • >=15: Add DEBUG in stdout

Parameters:
cmdstr

command line used to launch this program

list_filestr

file containing the list of genome files, 1 genome per line, separated by a space if a genome is split in several fasta files. This file can also specify date and/or species information, according to the format described in documentation.

db_pathstr

Path to the folder containing all the fasta files which will be annotated

res_dirstr

Path to the folder which will contain result folders and files

namestr

4 alpha numeric characters, describing the species (for example ESCO). Used by default if no species name is given in list_file line.

datestr

4 alpha numeric characters, defining the default date, for strains where it is not specified in the list_file

l90int

Max L90 allowed to keep a genome

nbcontint

Max number of contigs allowed to keep a genome

cutnint

cut each time there are at least cutn ‘N’ in a row. Don’t cut if equal to 0

threadsint

max number of threads to use

forcebool

If True, overwrite previous results, if False keep what is already calculated

qc_onlybool

If True, do only quality control, if False, also do annotation

from_infostr

File containing information on genomes and their quality information (from prepare step)

tmp_dirstr or None

Path to folder where tmp files must be saved. None to use the default tmp folder

res_annot_dirstr or None

Path to folder where are the prokka/prodigal result folders for the genomes. None to use the default prokka/prodigal folder

verboseint

verbosity: default (0): info in stdout, error and more in stderr 1 = add warnings in stderr 2 = like 1 + add DETAIL to stdout (by default only INFO) >15: add debug to stdout

quietbool

True if nothing must be sent to stdout/stderr, False otherwise

prodigal_onlybool

True -> run only prodigal. False -> run prokka

smallbool

True -> use -p meta option with prodigal

Returns:
(genomes, kept_genomes, skipped, skipped_format)tuple

with:

  • genomes: dict with all genomes in list_file: {genome: [gembase_name, path_split_gembase, gsize, nbcont, L90]}

  • kept_genomes: dict with all genomes kept for annotation (same format as genomes)

  • skipped: list of genomes skipped because they had a problem in annotation step

  • skipped_format : list of genomes skipped because they had a problem in format step

PanACoTA.subcommands.annotate.main_from_parse(arguments)

Call main function from the arguments given by parser

Parameters:
argumentsargparse.Namespace

result of argparse parsing of all arguments in command line

PanACoTA.subcommands.annotate.parse(parser, argu)

arse arguments given to parser

Parameters:
parserargparse.ArgumentParser

the parser used

argu[str]

command-line given by user, to parse using parser

Returns:
argparse.Namespace

Parsed arguments

pangenome module

pangenome is a subcommand of PanACoTA

@author gem May 2017

PanACoTA.subcommands.pangenome.build_parser(parser)

Method to create a parser for command-line options

Parameters:
parserargparse.ArgumentParser

parser to configure in order to extract command-line arguments

PanACoTA.subcommands.pangenome.main(cmd, lstinfo, name, dbpath, min_id, outdir, clust_mode, spe_dir, threads, outfile=None, verbose=0, quiet=False)

Main method, doing all steps:

  • concatenate all protein files

  • create database as ffindex

  • cluster all proteins

  • convert to pangenome file

  • creating summary and matrix of pangenome

Parameters:
lstinfostr

file with name of genomes to consider for pan in the first column, without extension. Other columns are ignored. The first column header must be ‘gembase_name’

namestr

name given to the dataset. For example, ESCO44 for 44 Escherichia coli genomes.

dbpathstr

path to the folder containing all protein files (files called as the name of genome given in lstinfo + “.prt”

min_idfloat

Minimum percentage of identity between 2 proteins to put them in the same family

outdirstr

path to folder which will contain pangenome results and tmp files

clust_mode[0, 1, 2]

0 for ‘set cover’, 1 for ‘single-linkage’, 2 for ‘CD-Hit’

spe_dirstr or None

path to the folder where concatenated bank of proteins must be saved. None to use the same folder as protein files

threadsint

Max number of threads to use

outfilestr or None

Name of the pangenome. None to use the default name

verboseint

verbosity:

  • defaut 0 : stdout contains INFO, stderr contains ERROR.

  • 1: stdout contains INFO, stderr contains WARNING and ERROR

  • 2: stdout contains (DEBUG), DETAIL and INFO, stderr contains WARNING and ERROR

  • >=15: Add DEBUG in stdout

quietbool

True if nothing must be sent to stdout/stderr, False otherwise

PanACoTA.subcommands.pangenome.main_from_parse(args)

Call main function from the arguments given by parser

Parameters:
argsargparse.Namespace

result of argparse parsing of all arguments in command line

PanACoTA.subcommands.pangenome.parse(parser, argu)

Parse arguments given to parser

Parameters:
parserargparse.ArgumentParser

Parser to use to parse command-line arguments

argu[str]

command-line given

Returns:
argparse.Namespace or None

The arguments parsed, updated according to some rules. Exit program with error message if error occurs with arguments given.

corepers module

corepers is a subcommand of PanACoTA

Generate a core genome (families containing 1 member in all genomes of the dataset) or a persistent genome (families with a given % of genomes having exactly 1 member). You can also allow:

  • mixed families: exactly 1 member in the given percentage of genomes, but the other genomes can contain 0 or several members

  • multi families: allow several members in any genome.

@author gem June 2017

PanACoTA.subcommands.corepers.build_parser(parser)

Method to create a parser for command-line options

Parameters:
parserargparse.ArgumentParser

parser to configure in order to extract command-line arguments

PanACoTA.subcommands.corepers.check_args(parser, args)

Check that arguments given to parser are as expected.

Parameters:
parserargparse.ArgumentParser

The parser used to parse command-line

argsargparse.Namespace

Parsed arguments

Returns:
argparse.Namespace or None

The arguments parsed, updated according to some rules. Exit program with error message if error occurs with arguments given.

PanACoTA.subcommands.corepers.get_info(tol, multi, mixed, floor)

Get a string corresponding to the information that will be given to logger.

Parameters:
tolfloat

min % of genomes present in a family to consider it as persistent (between 0 and 1)

multibool

True if multigenic families are allowed, False otherwise

mixedbool

True if mixed families are allowed, False otherwise

floorbool

Require at least floor(nb_genomes*tol) genomes if True, ceil(nb_genomes*tol) if False

Returns:
str

Information to give to logger

PanACoTA.subcommands.corepers.main(cmd, pangenome, tol, multi, mixed, outputdir, lstinfo_file, floor, verbose, quiet)

Read pangenome and deduce Persistent genome according to the user criteria

Parameters:
pangenomestr

file containing pangenome

tolfloat

min % of genomes present in a family to consider it as persistent (between 0 and 1)

multibool

True if multigenic families are allowed, False otherwise

mixedbool

True if mixed families are allowed, False otherwise

outputdirstr or None

Specific directory for the generated persistent genome. If not given, pangenome directory is used.

lstinfo_filestr

list of genomes to include in the core/persistent genome. If not given, include all genomes of pan

floorbool

Require at least floor(nb_genomes*tol) genomes if True, ceil(nb_genomes*tol) if False

verboseint

verbosity: - defaut 0 : stdout contains INFO, stderr contains ERROR. - 1: stdout contains INFO, stderr contains WARNING and ERROR - 2: stdout contains (DEBUG), DETAIL and INFO, stderr contains WARNING and ERROR - >=15: Add DEBUG in stdout

quietbool

True if nothing must be sent to stdout/stderr, False otherwise

PanACoTA.subcommands.corepers.main_from_parse(args)

Call main function from the arguments given by parser

Parameters:
argsargparse.Namespace

result of argparse parsing of all arguments in command line

PanACoTA.subcommands.corepers.parse(parser, argu)

Parse arguments given to parser

Parameters:
parserargparse.ArgumentParser

the parser used

argu[str]

command-line given by user, to parse using parser

Returns:
argparse.Namespace

Parsed arguments

align module

align is a subcommand of PanACoTA

@author gem June 2017

PanACoTA.subcommands.align.build_parser(parser)

Method to create a parser for command-line options

Parameters:
parserargparse.ArgumentParser

parser to configure in order to extract command-line arguments

PanACoTA.subcommands.align.main(cmd, corepers, list_genomes, dname, dbpath, outdir, prot_ali, threads, force, verbose=0, quiet=False)

Align given core genome families

Parameters:
corepersstr

File containing persistent genome families

list_genomesstr

File containing the list of all genomes in the dataset. Only first column is considered.

dnamestr

Dataset name, used to name output files

dbpathstr

path to the directory containing ‘Proteins’ and ‘Genes’ folders

outdirstr

path to the directory where output files must be saved

prot_alibool

Also give aa alignment of concatenation of persistent proteins

threadsint

Max number of threads to use

forcebool

Remove existing output files and rerun everything if True.

verboseint

verbosity: - defaut 0 : stdout contains INFO, stderr contains ERROR. - 1: stdout contains INFO, stderr contains WARNING and ERROR - 2: stdout contains (DEBUG), DETAIL and INFO, stderr contains WARNING and ERROR - >=15: Add DEBUG in stdout

quietbool

True if nothing must be sent to stdout/stderr, False otherwise

PanACoTA.subcommands.align.main_from_parse(args)

Call main function from the arguments given by parser

Parameters:
argsargparse.Namespace

result of argparse parsing of all arguments in command line

PanACoTA.subcommands.align.parse(parser, argu)

Parse arguments given to parser

Parameters:
parserargparse.ArgumentParser

the parser used

argu[str]

command-line given by user, to parse using parser

Returns:
argparse.Namespace

Parsed arguments

tree module

tree is a subcommand of PanACoTA

@author gem June 2017

PanACoTA.subcommands.tree.build_parser(parser)

Method to create a parser for command-line options

Parameters:
parserargparse.ArgumentParser

parser to configure in order to extract command-line arguments

PanACoTA.subcommands.tree.check_args(parser, args)

Check that arguments given to parser are as expected.

Parameters:
parserargparse.ArgumentParser

The parser used to parse command-line

argsargparse.Namespace

Parsed arguments

Returns:
argparse.Namespace or None

The arguments parsed, updated according to some rules. Exit program with error message if error occurs with arguments given.

PanACoTA.subcommands.tree.main(cmd, align, outdir, soft, model, threads, boot=False, write_boot=False, write_mat=False, memory=False, fast=False, verbose=0, quiet=False)

Inferring a phylogenetic tree from an alignment file, with the given software.

Parameters:
cmd: str

command used to launch tree module

align: str

Path to file containing alignments of persistent families grouped by genome

outdir: str or None

Path to file which will contain the tree inferred

soft: str

Soft to use to infer the phylogenetic tree: 1 of quicktree, fasttree or fastme

model: str or None

DNA substitution model chosen by user, None if quicktree used

threads: int

Maximum number of threads to use

boot: int or None

Number of bootstraps to compute. None if no bootstrap asked

write_boot: bool

True if all bootstrap pseudo-trees must be saved into a file, False otherwise

write_mat: bool

True if distance matrix must be saved, false otherwise

memory: str

Maximal RAM usage in GB | MB | % - Only for iqtree

fast: boolean

use -fast option with IQtree

verboseint

verbosity: - defaut 0 : stdout contains INFO, stderr contains ERROR. - 1: stdout contains INFO, stderr contains WARNING and ERROR - 2: stdout contains (DEBUG), DETAIL and INFO, stderr contains WARNING and ERROR - >=15: Add DEBUG in stdout

quiet: bool

True if nothing must be sent to stdout/stderr, False otherwise

PanACoTA.subcommands.tree.main_from_parse(args)

Call main function from the arguments given by parser

Parameters:
argsargparse.Namespace

result of argparse parsing of all arguments in command line

PanACoTA.subcommands.tree.parse(parser, argu)

Parse arguments given to parser

Parameters:
parserargparse.ArgumentParser

the parser used

argu[str]

command-line given by user, to parse using parser

Returns:
argparse.Namespace

Parsed arguments

all module

all_modules