PanACoTA.align_module
package¶
pan_to_pergenome
submodule¶
From the Persistent Genome file, group all persistent proteins per genome, in order to be able to extract them faster after.
Input:
Persistent genome (1 line per family, 1st column is family number, others are all members)
list of all genomes (1 genome per line, only first column is considered)
output directory
dname: the name of the dataset, used to name output files
Output:
for each genome:
<outdir>/List-<dname>/<dname>-getEntry_gen_<genome>.txt
: list of all genes from the ‘genome’ which correspond to persistent proteins. 2 columns: the first one is the protein name, the second is the filename to which it must be extracted (corresponding to the family in the persistent genome).for each genome:
<outdir>/List-<dname>/<dname>-getEntry_prt_<genome>.txt
: same as previous file but with the list of proteins to extract instead of genes.for each family:
<outdir>/Align-<dname>/<dname>-current.<fam_num>.miss.lst
: list of genomes which do not have members in family ‘fam_num’.
@author GEM November 2016
- PanACoTA.align_module.pan_to_pergenome.get_all_genomes(list_gen)¶
Read the file containing the genomes list and get all genome names
- Parameters:
- list_genstr
File containing the list of all genomes
- Returns:
- list
list of all genome names
- PanACoTA.align_module.pan_to_pergenome.get_per_genome(persgen, list_gen, dname, outdir)¶
From persistent genome and list of all genomes, sort persistent proteins by genome
For each genome, write all persistent proteins to a file, with the family from which they are, in order to extract those proteins after. For each family, also save the names of genomes which do not have any member. This will be used to complete the alignments by stretches of ‘-‘.
- Parameters:
- persgenstr
file containing persistent genome
- list_genstr
file containing the list of all genomes
- dnamestr
name of the dataset
- outdirstr
Directory where files must be saved. Will create 2 subfolders:
Align-<dname>
andList-<dname>
- Returns:
- (all_genomes, aldir, listdir, families)tuple
all_genomes : [] list of all genome names
aldir : str, path to align directory
listdir : str, path to List directory
families : str, list of family numbers
- PanACoTA.align_module.pan_to_pergenome.proteins_per_strain(persgen, all_genomes)¶
From the persistentGenome file, get all persistent proteins, and classify them according to the strain from which they are.
- Parameters:
- persgenstr
File containing persistent genome
- Returns:
- (all_prots, fam_genomes, several)tuple
all_prots: dict, {strain: {member: fam_num}}
fam_genomes: dict, {fam_num: set(genomes having a member in fam)}
several: dict, {fam_num: set(genomes having several members in fam)}
- PanACoTA.align_module.pan_to_pergenome.write_genome_file(listdir, aldir, dname, strain, member, several)¶
For a given genome, write all the proteins and genes to extract to its file. If one of the 2 files (proteins and genes) already exists, overwrite it. If no file exists -> write them. If the 2 files exist -> warning saying that we use already existing files.
- Parameters:
- listdirstr
path to List directory
- aldirstr
path to Align directory
- dnamestr
name of dataset
- strainstr
current genome name
- memberdict
{member: fam_num}
- severaldict
{fam_num: [genomes having several members in family]}
- PanACoTA.align_module.pan_to_pergenome.write_getentry_files(all_prots, several, listdir, aldir, dname, all_genomes)¶
For each species, write all its persistent proteins into a file, with the persistent family in which they are. Those files will be used to extract all proteins.
- Parameters:
- all_protsdict
{strain: {member: fam_num}}
- severaldict
{fam_num: [genomes having several members in family]}
- listdirstr
directory where lists of proteins per genome must be saved
- aldirstr
directory where extracted proteins per family must be saved
- dnamestr
name of dataset
- all_genomeslist
list of all genomes
- PanACoTA.align_module.pan_to_pergenome.write_missing_genomes(fam_genomes, several, all_genomes, aldir, dname)¶
For each family, write the names of all genomes which do not have any member in the family.
- Parameters:
- fam_genomesdict
{fam_num: [genomes having a member in fam]}
- severaldict
{fam_num: [genomes having several members in family]}
- all_genomeslist
list of all genomes
- aldirstr
directory where the lists of missing genomes per family must be saved
- dnamestr
name of dataset
get_seqs
submodule¶
- PanACoTA.align_module.get_seqs.check_existing_extract(all_fams, aldir, dname)¶
For each family, check if its prt and gen extraction file already exist. If both exist, no need to re-extract for those families. If only one or no one exists, put to list to extract.
- Parameters:
- all_famslist
list of all family numbers
- aldirstr
path to directory where extraction files must be saved
- dnamestr
name of the dataset
- Returns:
- []
list of files that must be generated (prt and gen files)
- PanACoTA.align_module.get_seqs.extract_sequences(to_extract, fasf, files_todo=None, outf=None)¶
Extract sequences from an open fasta file ‘fasf’, and a list of sequences to extract
- Parameters:
- to_extractdict or []
{sequence_to_extract: file_to_which_it_will_be_extracted} or list of sequences to extract, all in a same outfile (name must be given in ‘outf’)
- fasf_io.TextIO
open file containing sequences in multi-fasta format
- files_todolist or None
list of files which must be generated (prt and gen files). Others already exist, so ignore them.
- outf_io.TextIO or None
If an outfile is given (not None), and ‘to_extract’ is a dict, only its keys will be considered, and all these sequences will be extracted to ‘outfile’ (if ‘to_extract’ is a list, will extract all sequences of this list). Otherwise, if None, each sequence will be extracted to its corresponding value in ‘to_extract’.
- PanACoTA.align_module.get_seqs.get_all_seqs(all_genomes, dname, dbpath, listdir, aldir, all_fams, quiet)¶
For all genomes, extract its proteins present in a persistent family to the file corresponding to this family.
- Parameters:
- all_genomes[]
list of all genome names
- dnamestr
name of dataset
- dbpathstr
path to folder containing ‘Proteins’ and ‘Genes’ folders
- listdirstr
path to folder containing the lists of proteins/genes to extract
- aldirstr
path to folder where extracted proteins/genes must be saved
- all_fams[]
list of all family numbers
- quietbool
True if nothin must be written to stdout/stderr, False otherwise
- PanACoTA.align_module.get_seqs.get_genome_seqs(fasta, tabfile, files_todo, outfile=None)¶
From a fasta file, extract all sequences given in the tab file. The tab file can contain:
1 sequence name per line -> all sequences will be extracted to the same file
1 sequence + 1 filename per line -> each sequence will be extracted in the given file
If outfile not given, the tab file must contain 2 columns (1 for the sequence name, 1 for its output file). If an outfile is given, only the 1st column of tab file will be considered, and all sequences will be extracted to the given outfile.
- Parameters:
- fastastr
path to fasta file from which sequences must be extracted
- tabfilestr
path to the tab file containing the names of sequences to extract
- files_todolist
list of files which must be generated (prt and gen files). Others already exist, so ignore them.
- outfilestr or None
if None, the tab file must contain 2 columns (1 for the sequence name, 1 for its output file). If an outfile is given (not None), only the 1st column of tab file will be considered, and all sequences will be extracted to the given outfile.
- PanACoTA.align_module.get_seqs.get_names_to_extract(tabf, outfile)¶
From the tab file, get names of sequences to extract.
- Parameters:
- tabf_io.TextIO
open file containing names of sequences to extract
- outfilestr or None
if None, the tab file must contain 2 columns (1 for the sequence name, 1 for its output file). If an outfile is given (not None), only the 1st column of tab file will be considered, and all sequences will be given ‘outfile’ as output file
- Returns:
- dict
{sequence_to_extract: file_to_which_it_will_be_extracted}
alignment
submodule¶
For a given family:
align its proteins using mafft
back translate to nucleotides
add “-” of same size as alignment for genomes not having members in the family
@author: GEM, Institut Pasteur March 2017
- PanACoTA.align_module.alignment.add_missing_genomes(align_file, ali_type, miss_file, num_fam, ngenomes, status1, logger)¶
Once all family proteins are aligned, and back-translated to nucleotides, add missing genomes for the family to the alignment with ‘-‘. (Add missing genomes to both mafft alignment and back-translated alignment)
- Parameters:
- align_filestr
path to file containing alignments (proteins if from mafft output, or nucleic sequences if after backtranslating them)
- ali_typestr
protein or backtranslated
- miss_filestr
path to file containing the list of missing genomes in this family
- num_famint
family number
- ngenomesint
total number of genomes in dataset
- status1bool or str
“OK” if we did not redo the alignments as they already were as expected. In that case, if missing genomes are already present, just add a warning message saying that we used the already existing btr file.
True if we just did the alignments and backtranslate. So no warning message needed.
False if problem with extraction, alignment or backtranslation (will never happen as this function is not called if status1 == False)
- loggerlogging.Logger
the logger, having a queue Handler, to give logs to the main logger
- Returns:
- bool or str
“OK” if btr file was not recreated, and already has the right number of sequences, and all with the same length.
False if problem in btr file alignment, so missing genomes not added
True if alignment + adding missing genomes is ok. Can happen if there is no missing genome for this family (in that case, btr generated already has the right number of sequences), or if we just added the missing genomes.
- PanACoTA.align_module.alignment.align_all_families(prefix, all_fams, ngenomes, dname, quiet, threads)¶
For each family:
align all its proteins with mafft
back-translate to nucleotides
add missing genomes
- Parameters:
- prefixstr
path to
aldir/<name of dataset>
(used to get extraction, alignment and btr files easily)- all_fams[]
list of all family numbers
- ngenomesint
total number of genomes in dataset
- dnamestr
name of dataset (used to name concat and grouped files, as well as tree folder)
- quietbool
True if nothing must be written in stdout/stderr, False otherwise
- threadsint
max number of threads that can be used by mafft
- Returns:
- bool
True if everything went well, False if there was a problem in at least 1 family.
- PanACoTA.align_module.alignment.back_translate(num_fam, mafft_file, gen_file, btr_file, nbfal, logger)¶
Backtranslate protein alignment to nucleotides
- Parameters:
- num_famint
current family number. Used for log messages
- mafft_filestr
path to file containing protein alignments by mafft
- gen_filestr
path to file containing all sequences, not aligned, in nucleotides. It is used to convert the alignment in proteins into a nucleotide alignment
- btr_filestr
path to the file that will contain the nucleotide alignment
- nbfalint
number of sequences aligned for the family by mafft
- loggerlogging.Logger
logger with queueHandler to give logs to main logger
- Returns:
- bool
False if problem (back-translation, different number of families…)
number of sequences in btr file if everything went well
- PanACoTA.align_module.alignment.check_add_missing(btr_file, num_fam, ngenomes, logger, prev)¶
Check back-translated alignment while missing genomes have been added
- Parameters:
- btr_filestr
path to file containing back-translated alignment
- num_famint
current family number
- ngenomesint
total number of genomes in dataset
- loggerlogging.Logger
logger with queueHandler to give logs to main logger
- prevbool
True if we are checking alignments before adding missing genomes (in case it comes from a previous run and is already done, or in case there are no missing genomes for this family, so nothing to do to btr file). In that case, do not write error message if the number of sequences does not correspond to the total number of genomes. False if we just added missing genomes. In that case, nb sequences should be equal to the total number of genomes. If not, write error message.
- Returns:
- bool or int
True if right number of sequences in btr file and all the same length (everything is ok)
False if problem in alignment (sequences do not all have the same size), or same size but not all genomes while not being from previous run (prev=False)
alignment length if sequences aligned all have the same length, but missing genomes are not added yet (so, will have to add lines with this number of ‘-‘)
- PanACoTA.align_module.alignment.check_extractions(num_fam, miss_file, prt_file, gen_file, ngenomes, logger)¶
Check that extractions went well for the given family:
check number of proteins and genes extracted compared to the number of genomes
- Parameters:
- num_famint
current family number
- miss_filestr
path to file containing the list of genomes missing for the current family
- prt_filestr
path to file containing all proteins extracted
- gen_filestr
path to file containing all genes extracted
- ngenomesint
total number of genomes in dataset
- loggerlogging.Logger
logger with queueHandler to give logs to main logger
- Returns:
- bool or int
False if any problem (nbmiss+prt != nbgenomes or nbmiss+gen != nbgenomes). If no problem, returns the number of proteins/genes extracted
- PanACoTA.align_module.alignment.check_lens(aln_file, num_fam, logger)¶
In the given alignment file, check that all sequences have the same length. If there is no problem, it returns the length of alignment, and the number of sequences aligned.
- Parameters:
- aln_filestr
path to the alignment file to check
- num_famint
current family number. Used for log message if problem
- loggerlogging.Logger
logger having a queue Handler to give logs to the main logger in the main process
- Returns:
- bool or tuple
False if there is a problem in the alignment (sequences do not all have the same length). If they all have the same length, returns this length and the number of sequences
- PanACoTA.align_module.alignment.check_nb_seqs(alnfile, nbfal, logger, message='')¶
Check the number of sequences in the given alignment file
- Parameters:
- alnfilestr
path to alignment file
- nbfalint or []
expected number of sequences, or list of expected numbers of sequences
- loggerlogging.Logger
logger with queueHandler to give logs to main logger
- messagestr
message to write when sequence numbers do not match
- Returns:
- bool or int
False if not same number of sequences
nbseqs in align file if found among values in ‘nbfal’
- PanACoTA.align_module.alignment.family_alignment(prt_file, gen_file, miss_file, mafft_file, btr_file, num_fam, ngenomes, logger)¶
From a given family, align all its proteins with mafft, back-translate to nucleotides, and add missing genomes in this family.
- Parameters:
- prt_filestr
path to file containing proteins extracted
- gen_filestr
path to file containing genes extracted
- miss_filestr
path to file containing list of genomes missing
- mafft_filestr
path to file which will contain the protein alignment
- btr_filestr
path to file which will contain the nucleotide alignment back-translated from protein alignment
- num_famint
current family number
- ngenomesint
total number of genomes in dataset
- loggerlogging.Logger
logger with queueHandler to give logs to main logger
- Returns:
- bool or str
False if problem with extractions or with alignment or with backtranslation
‘nb_seqs’ = number of sequences aligned if everything went well (extractions and alignment ok, btr created without problem)
“OK” if extractions and alignments went well, and btr already exists and is ok
- PanACoTA.align_module.alignment.handle_family(args)¶
For the given family:
align its proteins with mafft
back-translate to nucleotides
add missing genomes
- Parameters:
- args()
(prefix, num_fam, ngenomes, q) with:
prefix: path to
aldir/<name of dataset>
num_fam: the current family number
ngenomes: the total number of genomes in dataset
q: a queue, which will be used by logger to put logs while in other process
- Returns:
- bool
“OK” if the files were not re-created, and have the expected format. This is used by
align_all_families
function, to know if something was regenerated, or if everything already existed with the expected format. If something was redone and concat/group files exist, it removes them.False if any problem (extractions, alignment, btr, add missing genomes…)
True if just generated all files, and everything is ok
- PanACoTA.align_module.alignment.handle_family_1thread(args)¶
For the given family:
align its proteins with mafft
back-translate to nucleotides
add missing genomes
- Parameters:
- args()
(prefix, num_fam, ngenomes, q) with:
prefix: path to
aldir/<name of dataset>
num_fam: the current family number
ngenomes: the total number of genomes in dataset
- Returns:
- bool
“OK” if the files were not re-created, and have the expected format. This is used by
align_all_families
function, to know if something was regenerated, or if everything already existed with the expected format. If something was redone and concat/group files exist, it removes them.False if any problem (extractions, alignment, btr, add missing genomes…)
True if just generated all files, and everything is ok
- PanACoTA.align_module.alignment.mafft_align(num_fam, prt_file, mafft_file, nbfprt, logger)¶
Align all proteins of the given family with mafft
- Parameters:
- num_famint
current family number
- prt_filestr
path to file containing all proteins extracted
- mafft_filestr
path to file which will contain proteins alignment
- nbfprtint
number of proteins extracted in prt file
- loggerlogging.Logger
logger with queueHandler to give logs to main logger
- Returns:
- bool
True if no problem (alignment ok, same number of proteins extracted and aligned), False otherwise
post_align
submodule¶
Concatenate all alignment files of all families. Then, group alignments by genome.
@author: GEM, Institut Pasteur March 2017
- PanACoTA.align_module.post_align.concat_alignments(fam_nums, prefix, ali_type, quiet)¶
Concatenate all family alignment files to a unique file
- Parameters:
- fam_nums[]
list of family numbers
- prefixstr
path to
aldir/<name of dataset>-[mafft-align or mafft-prt2nuc]
(used to get extraction, alignment and btr files easily)- ali_typestr
aa or nucl
- quietbool
True if nothing must be sent to sdtout/stderr, False otherwise
- Returns:
- tuple
(output, str) with:
output: path to file containing concatenation of all alignments
str: “OK” if concatenation file already exists, “Done” if just did concatenation
- PanACoTA.align_module.post_align.get_genome(header, all_genomes)¶
Find to which genome belongs ‘header’
- Parameters:
- headerstr
header read in alignment file
- all_genomes[]
list of all genomes
- Returns:
- str or None
name of genome from which the header is None if no genome found
- PanACoTA.align_module.post_align.group_by_genome(args)¶
From the alignment file ‘all_alns’ containing all proteins, group the alignments of proteins by their genome (listed in ‘all_genomes’), and save the result in ‘treedir’
- Parameters:
- argstuple
all_genomes: list of all genomes
all_alns: path to file containing all alignments concatenated
outfile: path to file which will contain alignments grouped by genome
- Returns:
- bool
True if everything went well
False if problem when trying to group by genomes
- PanACoTA.align_module.post_align.launch_group_by_genome(all_genomes, all_alns, status, outfile, dname, type_ali, quiet)¶
Function calling group_by_genome in a pool, while giving information to user (time elapsed)
- Parameters:
- all_genomes[]
list of all genomes in the dataset
- all_alnsstr
path to file containing all alignments concatenated
- statusstr
“OK” if concatenation file already existed before running, “Done” if just did concatenation
- outfilestr
file containing all families align by genome
- dnamestr
name of dataset
- type_alistr
nucleic or protein
- quietbool
True if nothing must be sent to sdtout/stderr, False otherwise
- Returns:
- bool
True if everything went well or was already done
False if error occurred in at least one step
- PanACoTA.align_module.post_align.post_alignment(fam_nums, all_genomes, prefix, outdir, dname, prot_ali, quiet)¶
After the alignment of all proteins by family:
concatenate all alignment files
group the alignment by genome
- Parameters:
- fam_nums[]
list of family numbers
- all_genomes[]
list of all genomes in dataset
- prefixstr
path to
aldir/<name of dataset>
(used to get extraction, alignment and btr files easily)- outdirstr
path to output directory, containing Aldir and Listdir, and that will also contain Treedir
- dnamestr
name of dataset (used to name concat and grouped files, as well as tree folder)
- prot_alibool
true: also give concatenated alignment in aa
- quietbool
True if nothing must be sent to sdtout/stderr, False otherwise
- PanACoTA.align_module.post_align.read_alignments(all_alns, all_genomes)¶
Read alignment file, and assign each sequence to a genome
- Parameters:
- all_alnsstr
path to file containing all alignments concatenated
- all_genomes[]
list of all genomes
- Returns:
- dict or None
{genome_name: [list of sequences for this genome]}
None if problem with a protein for which we don’t find the genome
- PanACoTA.align_module.post_align.write_groups(outfile, sequences)¶
Writing alignments per genome to output file.
- Parameters:
- outfilestr
path to file that will contain alignments grouped by genome
- sequencesdict
{genome_name: [list of sequences (DNA, prot…) for this genome]}