`PanACoTA.align_module` package¶

`pan_to_pergenome` submodule¶

From the Persistent Genome file, group all persistent proteins per genome, in order to be able to extract them faster after.

Input:

Persistent genome (1 line per family, 1st column is family number, others are all members)
list of all genomes (1 genome per line, only first column is considered)
output directory
dname: the name of the dataset, used to name output files

Output:

for each genome: <outdir>/List-<dname>/<dname>-getEntry_gen_<genome>.txt: list of all genes from the ‘genome’ which correspond to persistent proteins. 2 columns: the first one is the protein name, the second is the filename to which it must be extracted (corresponding to the family in the persistent genome).
for each genome: <outdir>/List-<dname>/<dname>-getEntry_prt_<genome>.txt: same as previous file but with the list of proteins to extract instead of genes.
for each family: <outdir>/Align-<dname>/<dname>-current.<fam_num>.miss.lst: list of genomes which do not have members in family ‘fam_num’.

@author GEM November 2016

PanACoTA.align_module.pan_to_pergenome.get_all_genomes(list_gen)¶

Read the file containing the genomes list and get all genome names

Parameters:

list_genstr: File containing the list of all genomes

Returns:

list: list of all genome names

PanACoTA.align_module.pan_to_pergenome.get_per_genome(persgen, list_gen, dname, outdir)¶

From persistent genome and list of all genomes, sort persistent proteins by genome

For each genome, write all persistent proteins to a file, with the family from which they are, in order to extract those proteins after. For each family, also save the names of genomes which do not have any member. This will be used to complete the alignments by stretches of ‘-‘.

Parameters:

persgenstr: file containing persistent genome
list_genstr: file containing the list of all genomes
dnamestr: name of the dataset
outdirstr: Directory where files must be saved. Will create 2 subfolders: Align-<dname> and List-<dname>

Returns:

(all_genomes, aldir, listdir, families)tuple

all_genomes : [] list of all genome names
aldir : str, path to align directory
listdir : str, path to List directory
families : str, list of family numbers

PanACoTA.align_module.pan_to_pergenome.proteins_per_strain(persgen, all_genomes)¶

From the persistentGenome file, get all persistent proteins, and classify them according to the strain from which they are.

Parameters:

persgenstr: File containing persistent genome

Returns:

(all_prots, fam_genomes, several)tuple

all_prots: dict, {strain: {member: fam_num}}
fam_genomes: dict, {fam_num: set(genomes having a member in fam)}
several: dict, {fam_num: set(genomes having several members in fam)}

PanACoTA.align_module.pan_to_pergenome.write_genome_file(listdir, aldir, dname, strain, member, several)¶

For a given genome, write all the proteins and genes to extract to its file. If one of the 2 files (proteins and genes) already exists, overwrite it. If no file exists -> write them. If the 2 files exist -> warning saying that we use already existing files.

Parameters:

listdirstr: path to List directory
aldirstr: path to Align directory
dnamestr: name of dataset
strainstr: current genome name
memberdict: {member: fam_num}
severaldict: {fam_num: [genomes having several members in family]}

PanACoTA.align_module.pan_to_pergenome.write_getentry_files(all_prots, several, listdir, aldir, dname, all_genomes)¶

For each species, write all its persistent proteins into a file, with the persistent family in which they are. Those files will be used to extract all proteins.

Parameters:

all_protsdict: {strain: {member: fam_num}}
severaldict: {fam_num: [genomes having several members in family]}
listdirstr: directory where lists of proteins per genome must be saved
aldirstr: directory where extracted proteins per family must be saved
dnamestr: name of dataset
all_genomeslist: list of all genomes

PanACoTA.align_module.pan_to_pergenome.write_missing_genomes(fam_genomes, several, all_genomes, aldir, dname)¶

For each family, write the names of all genomes which do not have any member in the family.

Parameters:

fam_genomesdict: {fam_num: [genomes having a member in fam]}
severaldict: {fam_num: [genomes having several members in family]}
all_genomeslist: list of all genomes
aldirstr: directory where the lists of missing genomes per family must be saved
dnamestr: name of dataset

`get_seqs` submodule¶

PanACoTA.align_module.get_seqs.check_existing_extract(all_fams, aldir, dname)¶

For each family, check if its prt and gen extraction file already exist. If both exist, no need to re-extract for those families. If only one or no one exists, put to list to extract.

Parameters:

all_famslist: list of all family numbers
aldirstr: path to directory where extraction files must be saved
dnamestr: name of the dataset

Returns:

[]: list of files that must be generated (prt and gen files)

PanACoTA.align_module.get_seqs.extract_sequences(to_extract, fasf, files_todo=None, outf=None)¶

Extract sequences from an open fasta file ‘fasf’, and a list of sequences to extract

Parameters:

to_extractdict or []: {sequence_to_extract: file_to_which_it_will_be_extracted} or list of sequences to extract, all in a same outfile (name must be given in ‘outf’)
fasf_io.TextIO: open file containing sequences in multi-fasta format
files_todolist or None: list of files which must be generated (prt and gen files). Others already exist, so ignore them.
outf_io.TextIO or None: If an outfile is given (not None), and ‘to_extract’ is a dict, only its keys will be considered, and all these sequences will be extracted to ‘outfile’ (if ‘to_extract’ is a list, will extract all sequences of this list). Otherwise, if None, each sequence will be extracted to its corresponding value in ‘to_extract’.

PanACoTA.align_module.get_seqs.get_all_seqs(all_genomes, dname, dbpath, listdir, aldir, all_fams, quiet)¶

For all genomes, extract its proteins present in a persistent family to the file corresponding to this family.

Parameters:

all_genomes[]: list of all genome names
dnamestr: name of dataset
dbpathstr: path to folder containing ‘Proteins’ and ‘Genes’ folders
listdirstr: path to folder containing the lists of proteins/genes to extract
aldirstr: path to folder where extracted proteins/genes must be saved
all_fams[]: list of all family numbers
quietbool: True if nothin must be written to stdout/stderr, False otherwise

PanACoTA.align_module.get_seqs.get_genome_seqs(fasta, tabfile, files_todo, outfile=None)¶

From a fasta file, extract all sequences given in the tab file. The tab file can contain:

1 sequence name per line -> all sequences will be extracted to the same file
1 sequence + 1 filename per line -> each sequence will be extracted in the given file

If outfile not given, the tab file must contain 2 columns (1 for the sequence name, 1 for its output file). If an outfile is given, only the 1st column of tab file will be considered, and all sequences will be extracted to the given outfile.

Parameters:

fastastr: path to fasta file from which sequences must be extracted
tabfilestr: path to the tab file containing the names of sequences to extract
files_todolist: list of files which must be generated (prt and gen files). Others already exist, so ignore them.
outfilestr or None: if None, the tab file must contain 2 columns (1 for the sequence name, 1 for its output file). If an outfile is given (not None), only the 1st column of tab file will be considered, and all sequences will be extracted to the given outfile.

PanACoTA.align_module.get_seqs.get_names_to_extract(tabf, outfile)¶

From the tab file, get names of sequences to extract.

Parameters:

tabf_io.TextIO: open file containing names of sequences to extract
outfilestr or None: if None, the tab file must contain 2 columns (1 for the sequence name, 1 for its output file). If an outfile is given (not None), only the 1st column of tab file will be considered, and all sequences will be given ‘outfile’ as output file

Returns:

dict: {sequence_to_extract: file_to_which_it_will_be_extracted}

`alignment` submodule¶

For a given family:

align its proteins using mafft
back translate to nucleotides
add “-” of same size as alignment for genomes not having members in the family

@author: GEM, Institut Pasteur March 2017

PanACoTA.align_module.alignment.add_missing_genomes(align_file, ali_type, miss_file, num_fam, ngenomes, status1, logger)¶

Once all family proteins are aligned, and back-translated to nucleotides, add missing genomes for the family to the alignment with ‘-‘. (Add missing genomes to both mafft alignment and back-translated alignment)

Parameters:

align_filestr

path to file containing alignments (proteins if from mafft output, or nucleic sequences if after backtranslating them)

ali_typestr

protein or backtranslated

miss_filestr

path to file containing the list of missing genomes in this family

num_famint

family number

ngenomesint

total number of genomes in dataset

status1bool or str

“OK” if we did not redo the alignments as they already were as expected. In that case, if missing genomes are already present, just add a warning message saying that we used the already existing btr file.
True if we just did the alignments and backtranslate. So no warning message needed.
False if problem with extraction, alignment or backtranslation (will never happen as this function is not called if status1 == False)

loggerlogging.Logger

the logger, having a queue Handler, to give logs to the main logger

Returns:

bool or str

“OK” if btr file was not recreated, and already has the right number of sequences, and all with the same length.
False if problem in btr file alignment, so missing genomes not added
True if alignment + adding missing genomes is ok. Can happen if there is no missing genome for this family (in that case, btr generated already has the right number of sequences), or if we just added the missing genomes.

PanACoTA.align_module.alignment.align_all_families(prefix, all_fams, ngenomes, dname, quiet, threads)¶

For each family:

align all its proteins with mafft
back-translate to nucleotides
add missing genomes

Parameters:

prefixstr: path to aldir/<name of dataset> (used to get extraction, alignment and btr files easily)
all_fams[]: list of all family numbers
ngenomesint: total number of genomes in dataset
dnamestr: name of dataset (used to name concat and grouped files, as well as tree folder)
quietbool: True if nothing must be written in stdout/stderr, False otherwise
threadsint: max number of threads that can be used by mafft

Returns:

bool: True if everything went well, False if there was a problem in at least 1 family.

PanACoTA.align_module.alignment.back_translate(num_fam, mafft_file, gen_file, btr_file, nbfal, logger)¶

Backtranslate protein alignment to nucleotides

Parameters:

num_famint: current family number. Used for log messages
mafft_filestr: path to file containing protein alignments by mafft
gen_filestr: path to file containing all sequences, not aligned, in nucleotides. It is used to convert the alignment in proteins into a nucleotide alignment
btr_filestr: path to the file that will contain the nucleotide alignment
nbfalint: number of sequences aligned for the family by mafft
loggerlogging.Logger: logger with queueHandler to give logs to main logger

Returns:

bool

False if problem (back-translation, different number of families…)
number of sequences in btr file if everything went well

PanACoTA.align_module.alignment.check_add_missing(btr_file, num_fam, ngenomes, logger, prev)¶

Check back-translated alignment while missing genomes have been added

Parameters:

btr_filestr: path to file containing back-translated alignment
num_famint: current family number
ngenomesint: total number of genomes in dataset
loggerlogging.Logger: logger with queueHandler to give logs to main logger
prevbool: True if we are checking alignments before adding missing genomes (in case it comes from a previous run and is already done, or in case there are no missing genomes for this family, so nothing to do to btr file). In that case, do not write error message if the number of sequences does not correspond to the total number of genomes. False if we just added missing genomes. In that case, nb sequences should be equal to the total number of genomes. If not, write error message.

Returns:

bool or int

True if right number of sequences in btr file and all the same length (everything is ok)
False if problem in alignment (sequences do not all have the same size), or same size but not all genomes while not being from previous run (prev=False)
alignment length if sequences aligned all have the same length, but missing genomes are not added yet (so, will have to add lines with this number of ‘-‘)

PanACoTA.align_module.alignment.check_extractions(num_fam, miss_file, prt_file, gen_file, ngenomes, logger)¶

Check that extractions went well for the given family:

check number of proteins and genes extracted compared to the number of genomes

Parameters:

num_famint: current family number
miss_filestr: path to file containing the list of genomes missing for the current family
prt_filestr: path to file containing all proteins extracted
gen_filestr: path to file containing all genes extracted
ngenomesint: total number of genomes in dataset
loggerlogging.Logger: logger with queueHandler to give logs to main logger

Returns:

bool or int: False if any problem (nbmiss+prt != nbgenomes or nbmiss+gen != nbgenomes). If no problem, returns the number of proteins/genes extracted

PanACoTA.align_module.alignment.check_lens(aln_file, num_fam, logger)¶

In the given alignment file, check that all sequences have the same length. If there is no problem, it returns the length of alignment, and the number of sequences aligned.

Parameters:

aln_filestr: path to the alignment file to check
num_famint: current family number. Used for log message if problem
loggerlogging.Logger: logger having a queue Handler to give logs to the main logger in the main process

Returns:

bool or tuple: False if there is a problem in the alignment (sequences do not all have the same length). If they all have the same length, returns this length and the number of sequences

PanACoTA.align_module.alignment.check_nb_seqs(alnfile, nbfal, logger, message='')¶

Check the number of sequences in the given alignment file

Parameters:

alnfilestr: path to alignment file
nbfalint or []: expected number of sequences, or list of expected numbers of sequences
loggerlogging.Logger: logger with queueHandler to give logs to main logger
messagestr: message to write when sequence numbers do not match

Returns:

bool or int

False if not same number of sequences
nbseqs in align file if found among values in ‘nbfal’

PanACoTA.align_module.alignment.family_alignment(prt_file, gen_file, miss_file, mafft_file, btr_file, num_fam, ngenomes, logger)¶

From a given family, align all its proteins with mafft, back-translate to nucleotides, and add missing genomes in this family.

Parameters:

prt_filestr: path to file containing proteins extracted
gen_filestr: path to file containing genes extracted
miss_filestr: path to file containing list of genomes missing
mafft_filestr: path to file which will contain the protein alignment
btr_filestr: path to file which will contain the nucleotide alignment back-translated from protein alignment
num_famint: current family number
ngenomesint: total number of genomes in dataset
loggerlogging.Logger: logger with queueHandler to give logs to main logger

Returns:

bool or str

False if problem with extractions or with alignment or with backtranslation
‘nb_seqs’ = number of sequences aligned if everything went well (extractions and alignment ok, btr created without problem)
“OK” if extractions and alignments went well, and btr already exists and is ok

PanACoTA.align_module.alignment.handle_family(args)¶

For the given family:

align its proteins with mafft
back-translate to nucleotides
add missing genomes

Parameters:

args()

(prefix, num_fam, ngenomes, q) with:

prefix: path to aldir/<name of dataset>
num_fam: the current family number
ngenomes: the total number of genomes in dataset
q: a queue, which will be used by logger to put logs while in other process

Returns:

bool

“OK” if the files were not re-created, and have the expected format. This is used by align_all_families function, to know if something was regenerated, or if everything already existed with the expected format. If something was redone and concat/group files exist, it removes them.
False if any problem (extractions, alignment, btr, add missing genomes…)
True if just generated all files, and everything is ok

PanACoTA.align_module.alignment.handle_family_1thread(args)¶

For the given family:

align its proteins with mafft
back-translate to nucleotides
add missing genomes

Parameters:

args()

(prefix, num_fam, ngenomes, q) with:

prefix: path to aldir/<name of dataset>
num_fam: the current family number
ngenomes: the total number of genomes in dataset

Returns:

bool

“OK” if the files were not re-created, and have the expected format. This is used by align_all_families function, to know if something was regenerated, or if everything already existed with the expected format. If something was redone and concat/group files exist, it removes them.
False if any problem (extractions, alignment, btr, add missing genomes…)
True if just generated all files, and everything is ok

PanACoTA.align_module.alignment.mafft_align(num_fam, prt_file, mafft_file, nbfprt, logger)¶

Align all proteins of the given family with mafft

Parameters:

num_famint: current family number
prt_filestr: path to file containing all proteins extracted
mafft_filestr: path to file which will contain proteins alignment
nbfprtint: number of proteins extracted in prt file
loggerlogging.Logger: logger with queueHandler to give logs to main logger

Returns:

bool: True if no problem (alignment ok, same number of proteins extracted and aligned), False otherwise

`post_align` submodule¶

Concatenate all alignment files of all families. Then, group alignments by genome.

@author: GEM, Institut Pasteur March 2017

PanACoTA.align_module.post_align.concat_alignments(fam_nums, prefix, ali_type, quiet)¶

Concatenate all family alignment files to a unique file

Parameters:

fam_nums[]: list of family numbers
prefixstr: path to aldir/<name of dataset>-[mafft-align or mafft-prt2nuc] (used to get extraction, alignment and btr files easily)
ali_typestr: aa or nucl
quietbool: True if nothing must be sent to sdtout/stderr, False otherwise

Returns:

tuple

(output, str) with:

output: path to file containing concatenation of all alignments
str: “OK” if concatenation file already exists, “Done” if just did concatenation

PanACoTA.align_module.post_align.get_genome(header, all_genomes)¶

Find to which genome belongs ‘header’

Parameters:

headerstr: header read in alignment file
all_genomes[]: list of all genomes

Returns:

str or None: name of genome from which the header is None if no genome found

PanACoTA.align_module.post_align.group_by_genome(args)¶

From the alignment file ‘all_alns’ containing all proteins, group the alignments of proteins by their genome (listed in ‘all_genomes’), and save the result in ‘treedir’

Parameters:

argstuple

all_genomes: list of all genomes
all_alns: path to file containing all alignments concatenated
outfile: path to file which will contain alignments grouped by genome

Returns:

bool

True if everything went well
False if problem when trying to group by genomes

PanACoTA.align_module.post_align.launch_group_by_genome(all_genomes, all_alns, status, outfile, dname, type_ali, quiet)¶

Function calling group_by_genome in a pool, while giving information to user (time elapsed)

Parameters:

all_genomes[]: list of all genomes in the dataset
all_alnsstr: path to file containing all alignments concatenated
statusstr: “OK” if concatenation file already existed before running, “Done” if just did concatenation
outfilestr: file containing all families align by genome
dnamestr: name of dataset
type_alistr: nucleic or protein
quietbool: True if nothing must be sent to sdtout/stderr, False otherwise

Returns:

bool

True if everything went well or was already done
False if error occurred in at least one step

PanACoTA.align_module.post_align.post_alignment(fam_nums, all_genomes, prefix, outdir, dname, prot_ali, quiet)¶

After the alignment of all proteins by family:

concatenate all alignment files
group the alignment by genome

Parameters:

fam_nums[]: list of family numbers
all_genomes[]: list of all genomes in dataset
prefixstr: path to aldir/<name of dataset> (used to get extraction, alignment and btr files easily)
outdirstr: path to output directory, containing Aldir and Listdir, and that will also contain Treedir
dnamestr: name of dataset (used to name concat and grouped files, as well as tree folder)
prot_alibool: true: also give concatenated alignment in aa
quietbool: True if nothing must be sent to sdtout/stderr, False otherwise

PanACoTA.align_module.post_align.read_alignments(all_alns, all_genomes)¶

Read alignment file, and assign each sequence to a genome

Parameters:

all_alnsstr: path to file containing all alignments concatenated
all_genomes[]: list of all genomes

Returns:

dict or None

{genome_name: [list of sequences for this genome]}
None if problem with a protein for which we don’t find the genome

PanACoTA.align_module.post_align.write_groups(outfile, sequences)¶

Writing alignments per genome to output file.

Parameters:

outfilestr: path to file that will contain alignments grouped by genome
sequencesdict: {genome_name: [list of sequences (DNA, prot…) for this genome]}

PanACoTA.align_module package¶

pan_to_pergenome submodule¶

get_seqs submodule¶

alignment submodule¶

post_align submodule¶

`PanACoTA.align_module` package¶

`pan_to_pergenome` submodule¶

`get_seqs` submodule¶

`alignment` submodule¶

`post_align` submodule¶