PanACoTA submodules¶

These submodules contain utility functions.

`PanACoTA.utils` submodule¶

Util functions and classes.

@author gem April 2017

class PanACoTA.utils.LessThanFilter(level)¶

Bases: Filter

When using log, when a level is set to a handler, it is a minimum level. All levels higher than it will be printed. If you want to print only until a given level (no levels higher than the specified one), use this class like this: handler.addFilter(LessThanFilter(level))

Methods

filter(rec)

Function to decide if given log has to be logged or not, according to its level

filter(rec)¶

Function to decide if given log has to be logged or not, according to its level

Parameters:

reccurrent record handled by logger

Returns:

bool: True if level of current log is less than the defined limit, False otherwise

class PanACoTA.utils.NoLevelFilter(level)¶

Bases: Filter

When using log, specify a given level that must not be taken into account by the handler. This is used for the stdout handler. We want to print, by default, DEBUG (for development use) and INFO levels, but not DETAILS level (which is between DEBUG and INFO). We want to print DETAIL only if verbose option was set

Methods

filter(rec)

Function to decide if given log has to be logged or not, according to its level

filter(rec)¶

Function to decide if given log has to be logged or not, according to its level

Parameters:

reccurrent record handled by logger

Returns:

bool: True if level of current log is different from forbidden level, False if it is the same

PanACoTA.utils.cat(list_files, output, title=None)¶

Equivalent of ‘cat’ unix command.

Concatenate all files in ‘list_files’ and save result in ‘output’ folder. Concat using shutil.copyfileobj, in order to copy by chunks, to avoid memory problems if files are big.

Parameters:

list_fileslist: list of filenames to concatenate
outputstr: output filename, where all concatenated files will be written
titlestr or None: if you want to show a progressbar while concatenating files, add a title for this progressbar here. If no title, nothing will be shown during concatenation.

PanACoTA.utils.check_format(info)¶

Check that the given information (can be the genomes name or the date) is in the right format: it should have 4 characters, all alphanumeric.

Parameters:

infostr: information to check

Returns:

bool: True if right format, False otherwise

PanACoTA.utils.check_installed(cmd)¶

Check if the command ‘cmd’ is in $PATH and can then be executed

Parameters:

cmdstr: command to run

Returns:

bool: True if installed, False otherwise

PanACoTA.utils.check_out_dirs(resdir)¶

Check that there is no file in:

resdir/LSTINFO
resdir/Genes
resdir/Proteins
resdir/Replicons
resdir/gff3

Parameters:

resdirstr: path to result directory

PanACoTA.utils.count(filein, get='lines')¶

Similar to ‘wc’ unix command.

Count the number of what is given in ‘get’. It can be:

lines (default)
words

Parameters:

fileinstr: path to the file for which we want to count lines or words
get[“lines”, “words”]: either lines to count the number of lines in the file, or words to count the number of words.

Returns:

int: Number of lines or words according to value of ‘get’ parameter.

PanACoTA.utils.detail_lvl()¶

Get the int level corresponding to “DETAIL”

Returns:

int: int corresponding to the level “DETAIL”

PanACoTA.utils.get_genome_contigs_and_rename(gembase_name, gpath, outfile, logger)¶

For the given genome (sequence in gpath), rename all its contigs with the new name: ‘gembase_name’, and save the output sequence in outfile.

For each contig renamed, save its new name as well as its size. This will be used to generate gff files

Parameters:

gembase_namestr: genome name to use (species.date.strain)
gpathstr: path to the genome sequence
outfilestr: path to the new file, containing ‘gpath’ sequence, but with ‘gembase_name’ in headers

Returns:

tuple

Dict of all contigs with their original and new name: (list of str)

{>orig_name: >new_name} - Dict of all contigs with their size: (list of str) {“new_name’: ‘size1”}

PanACoTA.utils.grep(filein, pattern, counts=False)¶

Equivalent of ‘grep’ unix command

By default, returns all the lines containing the given pattern. If counts = True, returns the number of lines containing the pattern.

Parameters:

fileinstr: path to the file in which pattern must be searched
patternstr: pattern to search
countsbool: True if you want to count how many lines have the pattern and return this number, False if you want to return all lines containing the pattern.

Returns:

list or int: list of lines if counts=False; number of lines if counts=True

PanACoTA.utils.init_logger(logfile_base, level, name, log_details=False, verbose=0, quiet=False)¶

Create logger and its handlers, and set them to the given level

level hierarchy: CRITICAL > ERROR > WARNING > INFO > DETAILS > DEBUG

Messages from all levels are written in ‘logfile’.log

Messages for levels less than WARNING (only INFO and DEBUG) written to stdout

Messages for levels equal or higher than WARNING written to stderr

Messages for levels equal or higher than WARNING written in logfile.log.err

Parameters:

logfile_basestr: base of filename to use for logs. Will add ‘.log’, ‘.log.details’ and ‘.log.err’ for the 3 log files created
levelint: minimum level that must be considered.
namestr or None: if we need to name the logger (used for tests)
log_detailsbool: if True, force creation of .log.details file. Otherwise, just create it if needed according to level
verboseint: be more verbose: default (0): info in stdout, error and more in stderr ; info and more in *.log ; warning and more in *.log.err 1 = add warnings in stderr 2 = like 1 + add details to stdout (by default only INFO) + add details to *.log.details >15: add debug to stdout and create *.log.debug with all levels
quietbool: True if nothing must be sent to stdout/stderr, False otherwise

PanACoTA.utils.list_to_str(list, sep='\t')¶

Return a string corresponding to the given list, with all elements separated by a space. Used to write a list into a file. Ex:

[1, 2, "toto"] -> "1 2 toto"

Parameters:

listlist: list of elements that we would like to write
sepstr: Separator to use between the different elements

Returns:

str: the string to write

PanACoTA.utils.load_bin(binfile)¶

Unpickle python objects from the binary file ‘binfile’

Parameters:

binfilestr: path to binary file containing python object

Returns:

Object: The python objects unpickled

PanACoTA.utils.logger_thread(q)¶

Queue listener used in a thread to handle the logs put to a QueueHandler by several processes (multiprocessing.pool.map_async for example)

Parameters:

qmultiprocessing.managers.AutoProxy[Queue]: queue to listen

PanACoTA.utils.plot_distr(values, limit, title, text, logger)¶

Plot histogram of given ‘values’, and add a vertical line corresponding to the chosen ‘limit’ and return the mpl figure

Parameters:

valueslist: list of values
limitint: limit for which a vertical line must be drawn
titlestr: Title to give to plot
textstr: text to write near the vertical line representing the limit
loggerlogging.Logger: logger object to write log information

Returns:

matplotlib.figure.Figure: figure generated

PanACoTA.utils.read_genomes(list_file, name, date, dbpath, tmp_path, logger)¶

Read list of genomes, and return them. If a genome has a name, also return it. Otherwise, return the name given by user.

Check that the given genome file exists in dbpath. Otherwise, put an error message, and ignore this file.

Parameters:

list_filestr: input file containing the list of genomes
namestr: Default species name
datestr: Default date
dbpathstr: path to folder containing original genome files
tmp_pathstr: path to folder which will contain the genome files to use before annotation, if needed to change them from original file (for example, merging several contig files in one file, split at each stretch of 5 ‘N’, etc.).

Returns:

dict: {genome: spegenus.date} spegenus.date = name.date

PanACoTA.utils.read_genomes_info(list_file, name, date=None, logger=None)¶

Read a lstinfo file containing the list of genomes with information (L90, genome size etc.). 1 line per genome, 4 required columns (Others will just be ignored): to_annotate gsize nb_conts L90

Check that the given genome file (to_annotate column) exists.

Parameters:

list_filestr: input file containing information on genomes (to_annotate, size, L90, nb_contigs)
namestr: Default species name
datestr: Default date
loggerlogging.Logger: logger object to write log information

Returns:

dict

genomes = {genome:: [spegenus.date, path_orig_seq, path_to_splitSequence, size, nbcont, l90]

}

PanACoTA.utils.read_info(name_inf, name, date, genomes_inf)¶

From the given information in ‘name_inf’, check if there is a name (and if its format is ok) and if there is a date (and if its format is ok). If no name (resp. no date), return default name (resp. default date).

Parameters:

name_infstr: information on current genome, which could contain a species name and a date
namestr: default species name
datestr: default date
genomes_infstr: current genome filename. Used to complete information when there is a warning (species name or date given not in the right format…)

Returns:

(cur_name, cur_date)tuple

with:

curname: name to use for this genome (can be the default one, or the one read from ‘name_inf’
curdate: date to use for this genome (default or read from ‘name_inf’)

PanACoTA.utils.remove(infile)¶

Remove the given file if it exists

Parameters:

infilestr: path to file to remove

PanACoTA.utils.run_cmd(cmd, error, eof=False, **kwargs)¶

Run the given command line. If the return code is not 0, print error message. if eof (exit on fail) is True, exit program if error code is not 0.

Parameters:

cmdstr: command to run
errorstr: error message to print if error while running command
eofbool: True: exit program if command failed, False: do not exit even if command fails
kwargsObject: Can provide a logger, stdout and/or stderr streams

Returns:

subprocess.Popen: returns object of subprocess call (has attributes returncode, pid, communicate etc.)

PanACoTA.utils.save_bin(objects, fileout)¶

Save python ‘objects’ in a binary file called ‘fileout’

Parameters:

objectsObject: python object to save
fileoutstr: path to binary file where objects must be saved

PanACoTA.utils.sort_genomes_by_name(x)¶

order by:

species

in each species, by strain number

Parameters:

xtuple or str: [genome_orig, [gembase, path, gsize, nbcont, L90]] with gembase = species.date.strain

Returns:

str: variable to take into account for sorting. If format is ESCO.1512.00001 return ESCO and 00001. Otherwise, just return x itself (sort by alphabetical order)

PanACoTA.utils.sort_genomes_byname_l90_nbcont(x)¶

Sort all genomes with the following criteria:

sort by species (x[1][0] is species.date)
for each species, sort by l90
for same l90, sort by nb contigs

Parameters:

x[[]]: [genome_name, [species.date, path, path_to_seq, gsize, nbcont, L90]]

Returns:

tuple: information on species, l90 and nb_contigs

PanACoTA.utils.sort_genomes_l90_nbcont(x)¶

Sort all genomes with the following criteria:

for each strain, sort by l90
for same l90, sort by nb contigs

Parameters:

x[[]]: [genome_name, [species.date, path, gsize, nbcont, L90]]

Returns:

tuple: information on l90 and nb_contigs

PanACoTA.utils.sort_proteins(x)¶

order by:

species
in each species, strain number
in each species and strain number, by protein number

Parameters:

xstr: species.date.strain.contig_protnum

Returns:

str: variable to take into account for sorting. If format is ESCO.1512.00001.i0002_12124, return ESCO, 00001 and 12124. If not, it must be something_00001: return something and 00001.

PanACoTA.utils.thread_progressbar(widgets, stop)¶

Thread running an “inifite” progress bar, while the main thread is working. Once this progressbar has to stop, we send a signal.

Parameters:

widgetslist: list of widgets to put in the progressbar
stopfunction: function returning False when thread can run, True when it has to stop.

PanACoTA.utils.write_genomes_info(genomes, kept_genomes, list_file, res_path, qc=False)¶

Write the list of genomes discarded to a file (qc=False), so that users can keep a trace of them, with their information (nb contigs, L90 etc.)

If qc=True, we stop after QC. -> Write the list of genomes that would be kept for annotation with all their information (L90, size, #contig)

Parameters:

genomesdict

{genome: [gembase_start_name, orig_seq_file, to_annotate_seq_file,: genome_size, nb_contigs, L90]}

kept_genomeslist

list of genomes kept

list_filestr

path to input file containing the list of genomes

res_pathstr

folder where results must be saved

qcbool

True: called only if QC only. Name this file info-genomes-<list_file>.txt to put

information on genomes that would be annotated if not QC only * otherwise (False), called in any case. Name this file discarded-<list_file>.txt and write all discarded genomes, whether sequences kept are next annotated or not => columns: orig_name, to_annotate, gsize, nb_conts, L90

PanACoTA.utils.write_list(list_names, fileout)¶: Write the given list of strings to a file, 1 per line

PanACoTA.utils.write_lstinfo(list_file, genomes, outdir)¶

Write lstinfo file, with following columns: gembase_name, orig_name, to_annotate_name, size, nbcontigs, l90

Parameters:

list_filestr: input file containing the list of genomes
genomesdict: {genome: [gembase_start_name, seq_file, seq_to_annotate, genome_size, nb_contigs, L90]}
outdirstr: folder where results must be saved

PanACoTA.utils.write_warning_skipped(skipped, do_format=False, prodigal_only=False, logfile='')¶

At the end of the script, write a warning to the user with the names of the genomes which had problems with prokka.

Parameters:

skippedlist: list of genomes with problems
do_formatbool: if False, genomes were not skipped because of format step, but before that. if True, they were skipped because of format
prodigal_onlybool: if False: used prokka to annotate if True: used prodigal to annotate

`PanACoTA.utils_pangenome` submodule¶

Functions used to deal with pangenome file

@author gem April 2017

PanACoTA.utils_pangenome.get_fams_info(families, logger)¶

From all families as list of members, get more information:

all strains found, sorted by species name
for each family, sort members by strain

Parameters:

familiesdict: {num: [members]}
loggerlogging.Logger: logger object to write log information

Returns:

(fams_by_strain, sorted_all_strains)tuple

with:

fams_by_strain: {fam_num: {strain: [members], strain: [members]}}
sorted_all_strains: list of all strains found, sorted by species

PanACoTA.utils_pangenome.read_gene(gene, num, fams_by_strain, all_strains)¶

Read information from a given gene name, and save it to appropriate dicts

Parameters:

genestr: gene name (species.date.strain.contig_number
numstr: num of family from which the given gene is
fams_by_straindict: {fam_num: {strain: [members]}}
all_strainsset: set of all strains

PanACoTA.utils_pangenome.read_lstinfo(lstinfo, logger)¶

Read lstinfo file and return list of genomes

Parameters:

lstinfostr: File containing the list of all genomes to include in the pan-genome, 1 genome per line. Here, only the first column will be used.

Returns:

list: list of genomes

PanACoTA.utils_pangenome.read_pan_file(filein, logger)¶

Read PanGenome file in ‘filein’, and put it into Python objects

Parameters:

fileinstr: path to pangenome file
logger

Returns:

(fams_by_strain, families, sort_all_strains)tuple

with:

fams_by_strain: {fam_num: {strain: [members]}}
families: {fam_num: [all members]}
sort_all_strains: list of all genome names, sorted by species name

PanACoTA.utils_pangenome.read_pangenome(pangenome, logger, families=None)¶

Read pangenome information

Read pangenome according to what is available. First, check if python objects are available, then if not, search for the binary file, and if not, read the text file.

Parameters:

pangenomestr: path to pangenome file
familiesdict or None: {num: [members]} if families are given. If not (must read them from binary file if exists or pangenome file otherwise), None.

Returns:

(fams_by_strain, families, all_strains)tuple

with:

fams_by_strain: {fam_num: {strain: [members]}}
families: {fam_num: [all members]}
all_strains: list of all genome names

PanACoTA submodules¶

PanACoTA.utils submodule¶

PanACoTA.utils_pangenome submodule¶

`PanACoTA.utils` submodule¶

`PanACoTA.utils_pangenome` submodule¶