PanACoTA submodules¶
These submodules contain utility functions.
PanACoTA.utils
submodule¶
Util functions and classes.
@author gem April 2017
- class PanACoTA.utils.LessThanFilter(level)¶
Bases:
Filter
When using log, when a level is set to a handler, it is a minimum level. All levels higher than it will be printed. If you want to print only until a given level (no levels higher than the specified one), use this class like this: handler.addFilter(LessThanFilter(level))
Methods
filter
(rec)Function to decide if given log has to be logged or not, according to its level
- filter(rec)¶
Function to decide if given log has to be logged or not, according to its level
- Parameters:
- reccurrent record handled by logger
- Returns:
- bool
True if level of current log is less than the defined limit, False otherwise
- class PanACoTA.utils.NoLevelFilter(level)¶
Bases:
Filter
When using log, specify a given level that must not be taken into account by the handler. This is used for the stdout handler. We want to print, by default, DEBUG (for development use) and INFO levels, but not DETAILS level (which is between DEBUG and INFO). We want to print DETAIL only if verbose option was set
Methods
filter
(rec)Function to decide if given log has to be logged or not, according to its level
- filter(rec)¶
Function to decide if given log has to be logged or not, according to its level
- Parameters:
- reccurrent record handled by logger
- Returns:
- bool
True if level of current log is different from forbidden level, False if it is the same
- PanACoTA.utils.cat(list_files, output, title=None)¶
Equivalent of ‘cat’ unix command.
Concatenate all files in ‘list_files’ and save result in ‘output’ folder. Concat using shutil.copyfileobj, in order to copy by chunks, to avoid memory problems if files are big.
- Parameters:
- list_fileslist
list of filenames to concatenate
- outputstr
output filename, where all concatenated files will be written
- titlestr or None
if you want to show a progressbar while concatenating files, add a title for this progressbar here. If no title, nothing will be shown during concatenation.
- PanACoTA.utils.check_format(info)¶
Check that the given information (can be the genomes name or the date) is in the right format: it should have 4 characters, all alphanumeric.
- Parameters:
- infostr
information to check
- Returns:
- bool
True if right format, False otherwise
- PanACoTA.utils.check_installed(cmd)¶
Check if the command ‘cmd’ is in $PATH and can then be executed
- Parameters:
- cmdstr
command to run
- Returns:
- bool
True if installed, False otherwise
- PanACoTA.utils.check_out_dirs(resdir)¶
Check that there is no file in:
resdir/LSTINFO
resdir/Genes
resdir/Proteins
resdir/Replicons
resdir/gff3
- Parameters:
- resdirstr
path to result directory
- PanACoTA.utils.count(filein, get='lines')¶
Similar to ‘wc’ unix command.
Count the number of what is given in ‘get’. It can be:
lines (default)
words
- Parameters:
- fileinstr
path to the file for which we want to count lines or words
- get[“lines”, “words”]
either lines to count the number of lines in the file, or words to count the number of words.
- Returns:
- int
Number of lines or words according to value of ‘get’ parameter.
- PanACoTA.utils.detail_lvl()¶
Get the int level corresponding to “DETAIL”
- Returns:
- int
int corresponding to the level “DETAIL”
- PanACoTA.utils.get_genome_contigs_and_rename(gembase_name, gpath, outfile, logger)¶
For the given genome (sequence in gpath), rename all its contigs with the new name: ‘gembase_name’, and save the output sequence in outfile.
For each contig renamed, save its new name as well as its size. This will be used to generate gff files
- Parameters:
- gembase_namestr
genome name to use (species.date.strain)
- gpathstr
path to the genome sequence
- outfilestr
path to the new file, containing ‘gpath’ sequence, but with ‘gembase_name’ in headers
- Returns:
- tuple
Dict of all contigs with their original and new name: (list of str)
{>orig_name: >new_name} - Dict of all contigs with their size: (list of str) {“new_name’: ‘size1”}
- PanACoTA.utils.grep(filein, pattern, counts=False)¶
Equivalent of ‘grep’ unix command
By default, returns all the lines containing the given pattern. If counts = True, returns the number of lines containing the pattern.
- Parameters:
- fileinstr
path to the file in which pattern must be searched
- patternstr
pattern to search
- countsbool
True if you want to count how many lines have the pattern and return this number, False if you want to return all lines containing the pattern.
- Returns:
- list or int
list of lines if counts=False; number of lines if counts=True
- PanACoTA.utils.init_logger(logfile_base, level, name, log_details=False, verbose=0, quiet=False)¶
Create logger and its handlers, and set them to the given level
level hierarchy:
CRITICAL > ERROR > WARNING > INFO > DETAILS > DEBUG
Messages from all levels are written in ‘logfile’.log
Messages for levels less than WARNING (only INFO and DEBUG) written to stdout
Messages for levels equal or higher than WARNING written to stderr
Messages for levels equal or higher than WARNING written in logfile.log.err
- Parameters:
- logfile_basestr
base of filename to use for logs. Will add ‘.log’, ‘.log.details’ and ‘.log.err’ for the 3 log files created
- levelint
minimum level that must be considered.
- namestr or None
if we need to name the logger (used for tests)
- log_detailsbool
if True, force creation of .log.details file. Otherwise, just create it if needed according to level
- verboseint
be more verbose: default (0): info in stdout, error and more in stderr ; info and more in *.log ; warning and more in *.log.err 1 = add warnings in stderr 2 = like 1 + add details to stdout (by default only INFO) + add details to *.log.details >15: add debug to stdout and create *.log.debug with all levels
- quietbool
True if nothing must be sent to stdout/stderr, False otherwise
- PanACoTA.utils.list_to_str(list, sep='\t')¶
Return a string corresponding to the given list, with all elements separated by a space. Used to write a list into a file. Ex:
[1, 2, "toto"] -> "1 2 toto"
- Parameters:
- listlist
list of elements that we would like to write
- sepstr
Separator to use between the different elements
- Returns:
- str
the string to write
- PanACoTA.utils.load_bin(binfile)¶
Unpickle python objects from the binary file ‘binfile’
- Parameters:
- binfilestr
path to binary file containing python object
- Returns:
- Object
The python objects unpickled
- PanACoTA.utils.logger_thread(q)¶
Queue listener used in a thread to handle the logs put to a QueueHandler by several processes (multiprocessing.pool.map_async for example)
- Parameters:
- qmultiprocessing.managers.AutoProxy[Queue]
queue to listen
- PanACoTA.utils.plot_distr(values, limit, title, text, logger)¶
Plot histogram of given ‘values’, and add a vertical line corresponding to the chosen ‘limit’ and return the mpl figure
- Parameters:
- valueslist
list of values
- limitint
limit for which a vertical line must be drawn
- titlestr
Title to give to plot
- textstr
text to write near the vertical line representing the limit
- loggerlogging.Logger
logger object to write log information
- Returns:
- matplotlib.figure.Figure
figure generated
- PanACoTA.utils.read_genomes(list_file, name, date, dbpath, tmp_path, logger)¶
Read list of genomes, and return them. If a genome has a name, also return it. Otherwise, return the name given by user.
Check that the given genome file exists in dbpath. Otherwise, put an error message, and ignore this file.
- Parameters:
- list_filestr
input file containing the list of genomes
- namestr
Default species name
- datestr
Default date
- dbpathstr
path to folder containing original genome files
- tmp_pathstr
path to folder which will contain the genome files to use before annotation, if needed to change them from original file (for example, merging several contig files in one file, split at each stretch of 5 ‘N’, etc.).
- Returns:
- dict
{genome: spegenus.date} spegenus.date = name.date
- PanACoTA.utils.read_genomes_info(list_file, name, date=None, logger=None)¶
Read a lstinfo file containing the list of genomes with information (L90, genome size etc.). 1 line per genome, 4 required columns (Others will just be ignored): to_annotate gsize nb_conts L90
Check that the given genome file (to_annotate column) exists.
- Parameters:
- list_filestr
input file containing information on genomes (to_annotate, size, L90, nb_contigs)
- namestr
Default species name
- datestr
Default date
- loggerlogging.Logger
logger object to write log information
- Returns:
- dict
- genomes = {genome:
[spegenus.date, path_orig_seq, path_to_splitSequence, size, nbcont, l90]
}
- PanACoTA.utils.read_info(name_inf, name, date, genomes_inf)¶
From the given information in ‘name_inf’, check if there is a name (and if its format is ok) and if there is a date (and if its format is ok). If no name (resp. no date), return default name (resp. default date).
- Parameters:
- name_infstr
information on current genome, which could contain a species name and a date
- namestr
default species name
- datestr
default date
- genomes_infstr
current genome filename. Used to complete information when there is a warning (species name or date given not in the right format…)
- Returns:
- (cur_name, cur_date)tuple
with:
curname: name to use for this genome (can be the default one, or the one read from ‘name_inf’
curdate: date to use for this genome (default or read from ‘name_inf’)
- PanACoTA.utils.remove(infile)¶
Remove the given file if it exists
- Parameters:
- infilestr
path to file to remove
- PanACoTA.utils.run_cmd(cmd, error, eof=False, **kwargs)¶
Run the given command line. If the return code is not 0, print error message. if eof (exit on fail) is True, exit program if error code is not 0.
- Parameters:
- cmdstr
command to run
- errorstr
error message to print if error while running command
- eofbool
True: exit program if command failed, False: do not exit even if command fails
- kwargsObject
Can provide a logger, stdout and/or stderr streams
- Returns:
- subprocess.Popen
returns object of subprocess call (has attributes returncode, pid, communicate etc.)
- PanACoTA.utils.save_bin(objects, fileout)¶
Save python ‘objects’ in a binary file called ‘fileout’
- Parameters:
- objectsObject
python object to save
- fileoutstr
path to binary file where objects must be saved
- PanACoTA.utils.sort_genomes_by_name(x)¶
order by:
species
in each species, by strain number
- Parameters:
- xtuple or str
[genome_orig, [gembase, path, gsize, nbcont, L90]] with gembase = species.date.strain
- Returns:
- str
variable to take into account for sorting. If format is ESCO.1512.00001 return ESCO and 00001. Otherwise, just return x itself (sort by alphabetical order)
- PanACoTA.utils.sort_genomes_byname_l90_nbcont(x)¶
Sort all genomes with the following criteria:
sort by species (x[1][0] is species.date)
for each species, sort by l90
for same l90, sort by nb contigs
- Parameters:
- x[[]]
[genome_name, [species.date, path, path_to_seq, gsize, nbcont, L90]]
- Returns:
- tuple
information on species, l90 and nb_contigs
- PanACoTA.utils.sort_genomes_l90_nbcont(x)¶
Sort all genomes with the following criteria:
for each strain, sort by l90
for same l90, sort by nb contigs
- Parameters:
- x[[]]
[genome_name, [species.date, path, gsize, nbcont, L90]]
- Returns:
- tuple
information on l90 and nb_contigs
- PanACoTA.utils.sort_proteins(x)¶
order by:
species
in each species, strain number
in each species and strain number, by protein number
- Parameters:
- xstr
species.date.strain.contig_protnum
- Returns:
- str
variable to take into account for sorting. If format is ESCO.1512.00001.i0002_12124, return ESCO, 00001 and 12124. If not, it must be something_00001: return something and 00001.
- PanACoTA.utils.thread_progressbar(widgets, stop)¶
Thread running an “inifite” progress bar, while the main thread is working. Once this progressbar has to stop, we send a signal.
- Parameters:
- widgetslist
list of widgets to put in the progressbar
- stopfunction
function returning False when thread can run, True when it has to stop.
- PanACoTA.utils.write_genomes_info(genomes, kept_genomes, list_file, res_path, qc=False)¶
Write the list of genomes discarded to a file (qc=False), so that users can keep a trace of them, with their information (nb contigs, L90 etc.)
If qc=True, we stop after QC. -> Write the list of genomes that would be kept for annotation with all their information (L90, size, #contig)
- Parameters:
- genomesdict
- {genome: [gembase_start_name, orig_seq_file, to_annotate_seq_file,
genome_size, nb_contigs, L90]}
- kept_genomeslist
list of genomes kept
- list_filestr
path to input file containing the list of genomes
- res_pathstr
folder where results must be saved
- qcbool
True: called only if QC only. Name this file info-genomes-<list_file>.txt to put
information on genomes that would be annotated if not QC only * otherwise (False), called in any case. Name this file discarded-<list_file>.txt and write all discarded genomes, whether sequences kept are next annotated or not => columns: orig_name, to_annotate, gsize, nb_conts, L90
- PanACoTA.utils.write_list(list_names, fileout)¶
Write the given list of strings to a file, 1 per line
- PanACoTA.utils.write_lstinfo(list_file, genomes, outdir)¶
Write lstinfo file, with following columns: gembase_name, orig_name, to_annotate_name, size, nbcontigs, l90
- Parameters:
- list_filestr
input file containing the list of genomes
- genomesdict
{genome: [gembase_start_name, seq_file, seq_to_annotate, genome_size, nb_contigs, L90]}
- outdirstr
folder where results must be saved
- PanACoTA.utils.write_warning_skipped(skipped, do_format=False, prodigal_only=False, logfile='')¶
At the end of the script, write a warning to the user with the names of the genomes which had problems with prokka.
- Parameters:
- skippedlist
list of genomes with problems
- do_formatbool
if False, genomes were not skipped because of format step, but before that. if True, they were skipped because of format
- prodigal_onlybool
if False: used prokka to annotate if True: used prodigal to annotate
PanACoTA.utils_pangenome
submodule¶
Functions used to deal with pangenome file
@author gem April 2017
- PanACoTA.utils_pangenome.get_fams_info(families, logger)¶
From all families as list of members, get more information:
all strains found, sorted by species name
for each family, sort members by strain
- Parameters:
- familiesdict
{num: [members]}
- loggerlogging.Logger
logger object to write log information
- Returns:
- (fams_by_strain, sorted_all_strains)tuple
with:
fams_by_strain: {fam_num: {strain: [members], strain: [members]}}
sorted_all_strains: list of all strains found, sorted by species
- PanACoTA.utils_pangenome.read_gene(gene, num, fams_by_strain, all_strains)¶
Read information from a given gene name, and save it to appropriate dicts
- Parameters:
- genestr
gene name (species.date.strain.contig_number
- numstr
num of family from which the given gene is
- fams_by_straindict
{fam_num: {strain: [members]}}
- all_strainsset
set of all strains
- PanACoTA.utils_pangenome.read_lstinfo(lstinfo, logger)¶
Read lstinfo file and return list of genomes
- Parameters:
- lstinfostr
File containing the list of all genomes to include in the pan-genome, 1 genome per line. Here, only the first column will be used.
- Returns:
- list
list of genomes
- PanACoTA.utils_pangenome.read_pan_file(filein, logger)¶
Read PanGenome file in ‘filein’, and put it into Python objects
- Parameters:
- fileinstr
path to pangenome file
- logger
- Returns:
- (fams_by_strain, families, sort_all_strains)tuple
with:
fams_by_strain: {fam_num: {strain: [members]}}
families: {fam_num: [all members]}
sort_all_strains: list of all genome names, sorted by species name
- PanACoTA.utils_pangenome.read_pangenome(pangenome, logger, families=None)¶
Read pangenome information
Read pangenome according to what is available. First, check if python objects are available, then if not, search for the binary file, and if not, read the text file.
- Parameters:
- pangenomestr
path to pangenome file
- familiesdict or None
{num: [members]} if families are given. If not (must read them from binary file if exists or pangenome file otherwise), None.
- Returns:
- (fams_by_strain, families, all_strains)tuple
with:
fams_by_strain: {fam_num: {strain: [members]}}
families: {fam_num: [all members]}
all_strains: list of all genome names