PanACoTA submodules

These submodules contain utility functions.

PanACoTA.utils submodule

Util functions and classes.

@author gem April 2017

class PanACoTA.utils.LessThanFilter(level)

Bases: Filter

When using log, when a level is set to a handler, it is a minimum level. All levels higher than it will be printed. If you want to print only until a given level (no levels higher than the specified one), use this class like this: handler.addFilter(LessThanFilter(level))

Methods

filter(rec)

Function to decide if given log has to be logged or not, according to its level

filter(rec)

Function to decide if given log has to be logged or not, according to its level

Parameters:
reccurrent record handled by logger
Returns:
bool

True if level of current log is less than the defined limit, False otherwise

class PanACoTA.utils.NoLevelFilter(level)

Bases: Filter

When using log, specify a given level that must not be taken into account by the handler. This is used for the stdout handler. We want to print, by default, DEBUG (for development use) and INFO levels, but not DETAILS level (which is between DEBUG and INFO). We want to print DETAIL only if verbose option was set

Methods

filter(rec)

Function to decide if given log has to be logged or not, according to its level

filter(rec)

Function to decide if given log has to be logged or not, according to its level

Parameters:
reccurrent record handled by logger
Returns:
bool

True if level of current log is different from forbidden level, False if it is the same

PanACoTA.utils.cat(list_files, output, title=None)

Equivalent of ‘cat’ unix command.

Concatenate all files in ‘list_files’ and save result in ‘output’ folder. Concat using shutil.copyfileobj, in order to copy by chunks, to avoid memory problems if files are big.

Parameters:
list_fileslist

list of filenames to concatenate

outputstr

output filename, where all concatenated files will be written

titlestr or None

if you want to show a progressbar while concatenating files, add a title for this progressbar here. If no title, nothing will be shown during concatenation.

PanACoTA.utils.check_format(info)

Check that the given information (can be the genomes name or the date) is in the right format: it should have 4 characters, all alphanumeric.

Parameters:
infostr

information to check

Returns:
bool

True if right format, False otherwise

PanACoTA.utils.check_installed(cmd)

Check if the command ‘cmd’ is in $PATH and can then be executed

Parameters:
cmdstr

command to run

Returns:
bool

True if installed, False otherwise

PanACoTA.utils.check_out_dirs(resdir)

Check that there is no file in:

  • resdir/LSTINFO

  • resdir/Genes

  • resdir/Proteins

  • resdir/Replicons

  • resdir/gff3

Parameters:
resdirstr

path to result directory

PanACoTA.utils.count(filein, get='lines')

Similar to ‘wc’ unix command.

Count the number of what is given in ‘get’. It can be:

  • lines (default)

  • words

Parameters:
fileinstr

path to the file for which we want to count lines or words

get[“lines”, “words”]

either lines to count the number of lines in the file, or words to count the number of words.

Returns:
int

Number of lines or words according to value of ‘get’ parameter.

PanACoTA.utils.detail_lvl()

Get the int level corresponding to “DETAIL”

Returns:
int

int corresponding to the level “DETAIL”

PanACoTA.utils.get_genome_contigs_and_rename(gembase_name, gpath, outfile, logger)

For the given genome (sequence in gpath), rename all its contigs with the new name: ‘gembase_name’, and save the output sequence in outfile.

For each contig renamed, save its new name as well as its size. This will be used to generate gff files

Parameters:
gembase_namestr

genome name to use (species.date.strain)

gpathstr

path to the genome sequence

outfilestr

path to the new file, containing ‘gpath’ sequence, but with ‘gembase_name’ in headers

Returns:
tuple
  • Dict of all contigs with their original and new name: (list of str)

{>orig_name: >new_name} - Dict of all contigs with their size: (list of str) {“new_name’: ‘size1”}

PanACoTA.utils.grep(filein, pattern, counts=False)

Equivalent of ‘grep’ unix command

By default, returns all the lines containing the given pattern. If counts = True, returns the number of lines containing the pattern.

Parameters:
fileinstr

path to the file in which pattern must be searched

patternstr

pattern to search

countsbool

True if you want to count how many lines have the pattern and return this number, False if you want to return all lines containing the pattern.

Returns:
list or int

list of lines if counts=False; number of lines if counts=True

PanACoTA.utils.init_logger(logfile_base, level, name, log_details=False, verbose=0, quiet=False)

Create logger and its handlers, and set them to the given level

level hierarchy: CRITICAL > ERROR > WARNING > INFO > DETAILS > DEBUG

Messages from all levels are written in ‘logfile’.log

Messages for levels less than WARNING (only INFO and DEBUG) written to stdout

Messages for levels equal or higher than WARNING written to stderr

Messages for levels equal or higher than WARNING written in logfile.log.err

Parameters:
logfile_basestr

base of filename to use for logs. Will add ‘.log’, ‘.log.details’ and ‘.log.err’ for the 3 log files created

levelint

minimum level that must be considered.

namestr or None

if we need to name the logger (used for tests)

log_detailsbool

if True, force creation of .log.details file. Otherwise, just create it if needed according to level

verboseint

be more verbose: default (0): info in stdout, error and more in stderr ; info and more in *.log ; warning and more in *.log.err 1 = add warnings in stderr 2 = like 1 + add details to stdout (by default only INFO) + add details to *.log.details >15: add debug to stdout and create *.log.debug with all levels

quietbool

True if nothing must be sent to stdout/stderr, False otherwise

PanACoTA.utils.list_to_str(list, sep='\t')

Return a string corresponding to the given list, with all elements separated by a space. Used to write a list into a file. Ex:

[1, 2, "toto"] -> "1 2 toto"
Parameters:
listlist

list of elements that we would like to write

sepstr

Separator to use between the different elements

Returns:
str

the string to write

PanACoTA.utils.load_bin(binfile)

Unpickle python objects from the binary file ‘binfile’

Parameters:
binfilestr

path to binary file containing python object

Returns:
Object

The python objects unpickled

PanACoTA.utils.logger_thread(q)

Queue listener used in a thread to handle the logs put to a QueueHandler by several processes (multiprocessing.pool.map_async for example)

Parameters:
qmultiprocessing.managers.AutoProxy[Queue]

queue to listen

PanACoTA.utils.plot_distr(values, limit, title, text, logger)

Plot histogram of given ‘values’, and add a vertical line corresponding to the chosen ‘limit’ and return the mpl figure

Parameters:
valueslist

list of values

limitint

limit for which a vertical line must be drawn

titlestr

Title to give to plot

textstr

text to write near the vertical line representing the limit

loggerlogging.Logger

logger object to write log information

Returns:
matplotlib.figure.Figure

figure generated

PanACoTA.utils.read_genomes(list_file, name, date, dbpath, tmp_path, logger)

Read list of genomes, and return them. If a genome has a name, also return it. Otherwise, return the name given by user.

Check that the given genome file exists in dbpath. Otherwise, put an error message, and ignore this file.

Parameters:
list_filestr

input file containing the list of genomes

namestr

Default species name

datestr

Default date

dbpathstr

path to folder containing original genome files

tmp_pathstr

path to folder which will contain the genome files to use before annotation, if needed to change them from original file (for example, merging several contig files in one file, split at each stretch of 5 ‘N’, etc.).

Returns:
dict

{genome: spegenus.date} spegenus.date = name.date

PanACoTA.utils.read_genomes_info(list_file, name, date=None, logger=None)

Read a lstinfo file containing the list of genomes with information (L90, genome size etc.). 1 line per genome, 4 required columns (Others will just be ignored): to_annotate gsize nb_conts L90

Check that the given genome file (to_annotate column) exists.

Parameters:
list_filestr

input file containing information on genomes (to_annotate, size, L90, nb_contigs)

namestr

Default species name

datestr

Default date

loggerlogging.Logger

logger object to write log information

Returns:
dict
genomes = {genome:

[spegenus.date, path_orig_seq, path_to_splitSequence, size, nbcont, l90]

}

PanACoTA.utils.read_info(name_inf, name, date, genomes_inf)

From the given information in ‘name_inf’, check if there is a name (and if its format is ok) and if there is a date (and if its format is ok). If no name (resp. no date), return default name (resp. default date).

Parameters:
name_infstr

information on current genome, which could contain a species name and a date

namestr

default species name

datestr

default date

genomes_infstr

current genome filename. Used to complete information when there is a warning (species name or date given not in the right format…)

Returns:
(cur_name, cur_date)tuple

with:

  • curname: name to use for this genome (can be the default one, or the one read from ‘name_inf’

  • curdate: date to use for this genome (default or read from ‘name_inf’)

PanACoTA.utils.remove(infile)

Remove the given file if it exists

Parameters:
infilestr

path to file to remove

PanACoTA.utils.run_cmd(cmd, error, eof=False, **kwargs)

Run the given command line. If the return code is not 0, print error message. if eof (exit on fail) is True, exit program if error code is not 0.

Parameters:
cmdstr

command to run

errorstr

error message to print if error while running command

eofbool

True: exit program if command failed, False: do not exit even if command fails

kwargsObject

Can provide a logger, stdout and/or stderr streams

Returns:
subprocess.Popen

returns object of subprocess call (has attributes returncode, pid, communicate etc.)

PanACoTA.utils.save_bin(objects, fileout)

Save python ‘objects’ in a binary file called ‘fileout’

Parameters:
objectsObject

python object to save

fileoutstr

path to binary file where objects must be saved

PanACoTA.utils.sort_genomes_by_name(x)

order by:

  • species

  • in each species, by strain number

Parameters:
xtuple or str

[genome_orig, [gembase, path, gsize, nbcont, L90]] with gembase = species.date.strain

Returns:
str

variable to take into account for sorting. If format is ESCO.1512.00001 return ESCO and 00001. Otherwise, just return x itself (sort by alphabetical order)

PanACoTA.utils.sort_genomes_byname_l90_nbcont(x)

Sort all genomes with the following criteria:

  • sort by species (x[1][0] is species.date)

  • for each species, sort by l90

  • for same l90, sort by nb contigs

Parameters:
x[[]]

[genome_name, [species.date, path, path_to_seq, gsize, nbcont, L90]]

Returns:
tuple

information on species, l90 and nb_contigs

PanACoTA.utils.sort_genomes_l90_nbcont(x)

Sort all genomes with the following criteria:

  • for each strain, sort by l90

  • for same l90, sort by nb contigs

Parameters:
x[[]]

[genome_name, [species.date, path, gsize, nbcont, L90]]

Returns:
tuple

information on l90 and nb_contigs

PanACoTA.utils.sort_proteins(x)

order by:

  • species

  • in each species, strain number

  • in each species and strain number, by protein number

Parameters:
xstr

species.date.strain.contig_protnum

Returns:
str

variable to take into account for sorting. If format is ESCO.1512.00001.i0002_12124, return ESCO, 00001 and 12124. If not, it must be something_00001: return something and 00001.

PanACoTA.utils.thread_progressbar(widgets, stop)

Thread running an “inifite” progress bar, while the main thread is working. Once this progressbar has to stop, we send a signal.

Parameters:
widgetslist

list of widgets to put in the progressbar

stopfunction

function returning False when thread can run, True when it has to stop.

PanACoTA.utils.write_genomes_info(genomes, kept_genomes, list_file, res_path, qc=False)

Write the list of genomes discarded to a file (qc=False), so that users can keep a trace of them, with their information (nb contigs, L90 etc.)

If qc=True, we stop after QC. -> Write the list of genomes that would be kept for annotation with all their information (L90, size, #contig)

Parameters:
genomesdict
{genome: [gembase_start_name, orig_seq_file, to_annotate_seq_file,

genome_size, nb_contigs, L90]}

kept_genomeslist

list of genomes kept

list_filestr

path to input file containing the list of genomes

res_pathstr

folder where results must be saved

qcbool
  • True: called only if QC only. Name this file info-genomes-<list_file>.txt to put

information on genomes that would be annotated if not QC only * otherwise (False), called in any case. Name this file discarded-<list_file>.txt and write all discarded genomes, whether sequences kept are next annotated or not => columns: orig_name, to_annotate, gsize, nb_conts, L90

PanACoTA.utils.write_list(list_names, fileout)

Write the given list of strings to a file, 1 per line

PanACoTA.utils.write_lstinfo(list_file, genomes, outdir)

Write lstinfo file, with following columns: gembase_name, orig_name, to_annotate_name, size, nbcontigs, l90

Parameters:
list_filestr

input file containing the list of genomes

genomesdict

{genome: [gembase_start_name, seq_file, seq_to_annotate, genome_size, nb_contigs, L90]}

outdirstr

folder where results must be saved

PanACoTA.utils.write_warning_skipped(skipped, do_format=False, prodigal_only=False, logfile='')

At the end of the script, write a warning to the user with the names of the genomes which had problems with prokka.

Parameters:
skippedlist

list of genomes with problems

do_formatbool

if False, genomes were not skipped because of format step, but before that. if True, they were skipped because of format

prodigal_onlybool

if False: used prokka to annotate if True: used prodigal to annotate

PanACoTA.utils_pangenome submodule

Functions used to deal with pangenome file

@author gem April 2017

PanACoTA.utils_pangenome.get_fams_info(families, logger)

From all families as list of members, get more information:

  • all strains found, sorted by species name

  • for each family, sort members by strain

Parameters:
familiesdict

{num: [members]}

loggerlogging.Logger

logger object to write log information

Returns:
(fams_by_strain, sorted_all_strains)tuple

with:

  • fams_by_strain: {fam_num: {strain: [members], strain: [members]}}

  • sorted_all_strains: list of all strains found, sorted by species

PanACoTA.utils_pangenome.read_gene(gene, num, fams_by_strain, all_strains)

Read information from a given gene name, and save it to appropriate dicts

Parameters:
genestr

gene name (species.date.strain.contig_number

numstr

num of family from which the given gene is

fams_by_straindict

{fam_num: {strain: [members]}}

all_strainsset

set of all strains

PanACoTA.utils_pangenome.read_lstinfo(lstinfo, logger)

Read lstinfo file and return list of genomes

Parameters:
lstinfostr

File containing the list of all genomes to include in the pan-genome, 1 genome per line. Here, only the first column will be used.

Returns:
list

list of genomes

PanACoTA.utils_pangenome.read_pan_file(filein, logger)

Read PanGenome file in ‘filein’, and put it into Python objects

Parameters:
fileinstr

path to pangenome file

logger
Returns:
(fams_by_strain, families, sort_all_strains)tuple

with:

  • fams_by_strain: {fam_num: {strain: [members]}}

  • families: {fam_num: [all members]}

  • sort_all_strains: list of all genome names, sorted by species name

PanACoTA.utils_pangenome.read_pangenome(pangenome, logger, families=None)

Read pangenome information

Read pangenome according to what is available. First, check if python objects are available, then if not, search for the binary file, and if not, read the text file.

Parameters:
pangenomestr

path to pangenome file

familiesdict or None

{num: [members]} if families are given. If not (must read them from binary file if exists or pangenome file otherwise), None.

Returns:
(fams_by_strain, families, all_strains)tuple

with:

  • fams_by_strain: {fam_num: {strain: [members]}}

  • families: {fam_num: [all members]}

  • all_strains: list of all genome names