PanACoTA.align_module package

pan_to_pergenome submodule

From the Persistent Genome file, group all persistent proteins per genome, in order to be able to extract them faster after.

Input:

  • Persistent genome (1 line per family, 1st column is family number, others are all members)

  • list of all genomes (1 genome per line, only first column is considered)

  • output directory

  • dname: the name of the dataset, used to name output files

Output:

  • for each genome: <outdir>/List-<dname>/<dname>-getEntry_gen_<genome>.txt: list of all genes from the ‘genome’ which correspond to persistent proteins. 2 columns: the first one is the protein name, the second is the filename to which it must be extracted (corresponding to the family in the persistent genome).

  • for each genome: <outdir>/List-<dname>/<dname>-getEntry_prt_<genome>.txt: same as previous file but with the list of proteins to extract instead of genes.

  • for each family: <outdir>/Align-<dname>/<dname>-current.<fam_num>.miss.lst: list of genomes which do not have members in family ‘fam_num’.

@author GEM November 2016

PanACoTA.align_module.pan_to_pergenome.get_all_genomes(list_gen)

Read the file containing the genomes list and get all genome names

Parameters:
list_genstr

File containing the list of all genomes

Returns:
list

list of all genome names

PanACoTA.align_module.pan_to_pergenome.get_per_genome(persgen, list_gen, dname, outdir)

From persistent genome and list of all genomes, sort persistent proteins by genome

For each genome, write all persistent proteins to a file, with the family from which they are, in order to extract those proteins after. For each family, also save the names of genomes which do not have any member. This will be used to complete the alignments by stretches of ‘-‘.

Parameters:
persgenstr

file containing persistent genome

list_genstr

file containing the list of all genomes

dnamestr

name of the dataset

outdirstr

Directory where files must be saved. Will create 2 subfolders: Align-<dname> and List-<dname>

Returns:
(all_genomes, aldir, listdir, families)tuple
  • all_genomes : [] list of all genome names

  • aldir : str, path to align directory

  • listdir : str, path to List directory

  • families : str, list of family numbers

PanACoTA.align_module.pan_to_pergenome.proteins_per_strain(persgen, all_genomes)

From the persistentGenome file, get all persistent proteins, and classify them according to the strain from which they are.

Parameters:
persgenstr

File containing persistent genome

Returns:
(all_prots, fam_genomes, several)tuple
  • all_prots: dict, {strain: {member: fam_num}}

  • fam_genomes: dict, {fam_num: set(genomes having a member in fam)}

  • several: dict, {fam_num: set(genomes having several members in fam)}

PanACoTA.align_module.pan_to_pergenome.write_genome_file(listdir, aldir, dname, strain, member, several)

For a given genome, write all the proteins and genes to extract to its file. If one of the 2 files (proteins and genes) already exists, overwrite it. If no file exists -> write them. If the 2 files exist -> warning saying that we use already existing files.

Parameters:
listdirstr

path to List directory

aldirstr

path to Align directory

dnamestr

name of dataset

strainstr

current genome name

memberdict

{member: fam_num}

severaldict

{fam_num: [genomes having several members in family]}

PanACoTA.align_module.pan_to_pergenome.write_getentry_files(all_prots, several, listdir, aldir, dname, all_genomes)

For each species, write all its persistent proteins into a file, with the persistent family in which they are. Those files will be used to extract all proteins.

Parameters:
all_protsdict

{strain: {member: fam_num}}

severaldict

{fam_num: [genomes having several members in family]}

listdirstr

directory where lists of proteins per genome must be saved

aldirstr

directory where extracted proteins per family must be saved

dnamestr

name of dataset

all_genomeslist

list of all genomes

PanACoTA.align_module.pan_to_pergenome.write_missing_genomes(fam_genomes, several, all_genomes, aldir, dname)

For each family, write the names of all genomes which do not have any member in the family.

Parameters:
fam_genomesdict

{fam_num: [genomes having a member in fam]}

severaldict

{fam_num: [genomes having several members in family]}

all_genomeslist

list of all genomes

aldirstr

directory where the lists of missing genomes per family must be saved

dnamestr

name of dataset

get_seqs submodule

PanACoTA.align_module.get_seqs.check_existing_extract(all_fams, aldir, dname)

For each family, check if its prt and gen extraction file already exist. If both exist, no need to re-extract for those families. If only one or no one exists, put to list to extract.

Parameters:
all_famslist

list of all family numbers

aldirstr

path to directory where extraction files must be saved

dnamestr

name of the dataset

Returns:
[]

list of files that must be generated (prt and gen files)

PanACoTA.align_module.get_seqs.extract_sequences(to_extract, fasf, files_todo=None, outf=None)

Extract sequences from an open fasta file ‘fasf’, and a list of sequences to extract

Parameters:
to_extractdict or []

{sequence_to_extract: file_to_which_it_will_be_extracted} or list of sequences to extract, all in a same outfile (name must be given in ‘outf’)

fasf_io.TextIO

open file containing sequences in multi-fasta format

files_todolist or None

list of files which must be generated (prt and gen files). Others already exist, so ignore them.

outf_io.TextIO or None

If an outfile is given (not None), and ‘to_extract’ is a dict, only its keys will be considered, and all these sequences will be extracted to ‘outfile’ (if ‘to_extract’ is a list, will extract all sequences of this list). Otherwise, if None, each sequence will be extracted to its corresponding value in ‘to_extract’.

PanACoTA.align_module.get_seqs.get_all_seqs(all_genomes, dname, dbpath, listdir, aldir, all_fams, quiet)

For all genomes, extract its proteins present in a persistent family to the file corresponding to this family.

Parameters:
all_genomes[]

list of all genome names

dnamestr

name of dataset

dbpathstr

path to folder containing ‘Proteins’ and ‘Genes’ folders

listdirstr

path to folder containing the lists of proteins/genes to extract

aldirstr

path to folder where extracted proteins/genes must be saved

all_fams[]

list of all family numbers

quietbool

True if nothin must be written to stdout/stderr, False otherwise

PanACoTA.align_module.get_seqs.get_genome_seqs(fasta, tabfile, files_todo, outfile=None)

From a fasta file, extract all sequences given in the tab file. The tab file can contain:

  • 1 sequence name per line -> all sequences will be extracted to the same file

  • 1 sequence + 1 filename per line -> each sequence will be extracted in the given file

If outfile not given, the tab file must contain 2 columns (1 for the sequence name, 1 for its output file). If an outfile is given, only the 1st column of tab file will be considered, and all sequences will be extracted to the given outfile.

Parameters:
fastastr

path to fasta file from which sequences must be extracted

tabfilestr

path to the tab file containing the names of sequences to extract

files_todolist

list of files which must be generated (prt and gen files). Others already exist, so ignore them.

outfilestr or None

if None, the tab file must contain 2 columns (1 for the sequence name, 1 for its output file). If an outfile is given (not None), only the 1st column of tab file will be considered, and all sequences will be extracted to the given outfile.

PanACoTA.align_module.get_seqs.get_names_to_extract(tabf, outfile)

From the tab file, get names of sequences to extract.

Parameters:
tabf_io.TextIO

open file containing names of sequences to extract

outfilestr or None

if None, the tab file must contain 2 columns (1 for the sequence name, 1 for its output file). If an outfile is given (not None), only the 1st column of tab file will be considered, and all sequences will be given ‘outfile’ as output file

Returns:
dict

{sequence_to_extract: file_to_which_it_will_be_extracted}

alignment submodule

For a given family:

  • align its proteins using mafft

  • back translate to nucleotides

  • add “-” of same size as alignment for genomes not having members in the family

@author: GEM, Institut Pasteur March 2017

PanACoTA.align_module.alignment.add_missing_genomes(align_file, ali_type, miss_file, num_fam, ngenomes, status1, logger)

Once all family proteins are aligned, and back-translated to nucleotides, add missing genomes for the family to the alignment with ‘-‘. (Add missing genomes to both mafft alignment and back-translated alignment)

Parameters:
align_filestr

path to file containing alignments (proteins if from mafft output, or nucleic sequences if after backtranslating them)

ali_typestr

protein or backtranslated

miss_filestr

path to file containing the list of missing genomes in this family

num_famint

family number

ngenomesint

total number of genomes in dataset

status1bool or str
  • “OK” if we did not redo the alignments as they already were as expected. In that case, if missing genomes are already present, just add a warning message saying that we used the already existing btr file.

  • True if we just did the alignments and backtranslate. So no warning message needed.

  • False if problem with extraction, alignment or backtranslation (will never happen as this function is not called if status1 == False)

loggerlogging.Logger

the logger, having a queue Handler, to give logs to the main logger

Returns:
bool or str
  • “OK” if btr file was not recreated, and already has the right number of sequences, and all with the same length.

  • False if problem in btr file alignment, so missing genomes not added

  • True if alignment + adding missing genomes is ok. Can happen if there is no missing genome for this family (in that case, btr generated already has the right number of sequences), or if we just added the missing genomes.

PanACoTA.align_module.alignment.align_all_families(prefix, all_fams, ngenomes, dname, quiet, threads)

For each family:

  • align all its proteins with mafft

  • back-translate to nucleotides

  • add missing genomes

Parameters:
prefixstr

path to aldir/<name of dataset> (used to get extraction, alignment and btr files easily)

all_fams[]

list of all family numbers

ngenomesint

total number of genomes in dataset

dnamestr

name of dataset (used to name concat and grouped files, as well as tree folder)

quietbool

True if nothing must be written in stdout/stderr, False otherwise

threadsint

max number of threads that can be used by mafft

Returns:
bool

True if everything went well, False if there was a problem in at least 1 family.

PanACoTA.align_module.alignment.back_translate(num_fam, mafft_file, gen_file, btr_file, nbfal, logger)

Backtranslate protein alignment to nucleotides

Parameters:
num_famint

current family number. Used for log messages

mafft_filestr

path to file containing protein alignments by mafft

gen_filestr

path to file containing all sequences, not aligned, in nucleotides. It is used to convert the alignment in proteins into a nucleotide alignment

btr_filestr

path to the file that will contain the nucleotide alignment

nbfalint

number of sequences aligned for the family by mafft

loggerlogging.Logger

logger with queueHandler to give logs to main logger

Returns:
bool
  • False if problem (back-translation, different number of families…)

  • number of sequences in btr file if everything went well

PanACoTA.align_module.alignment.check_add_missing(btr_file, num_fam, ngenomes, logger, prev)

Check back-translated alignment while missing genomes have been added

Parameters:
btr_filestr

path to file containing back-translated alignment

num_famint

current family number

ngenomesint

total number of genomes in dataset

loggerlogging.Logger

logger with queueHandler to give logs to main logger

prevbool

True if we are checking alignments before adding missing genomes (in case it comes from a previous run and is already done, or in case there are no missing genomes for this family, so nothing to do to btr file). In that case, do not write error message if the number of sequences does not correspond to the total number of genomes. False if we just added missing genomes. In that case, nb sequences should be equal to the total number of genomes. If not, write error message.

Returns:
bool or int
  • True if right number of sequences in btr file and all the same length (everything is ok)

  • False if problem in alignment (sequences do not all have the same size), or same size but not all genomes while not being from previous run (prev=False)

  • alignment length if sequences aligned all have the same length, but missing genomes are not added yet (so, will have to add lines with this number of ‘-‘)

PanACoTA.align_module.alignment.check_extractions(num_fam, miss_file, prt_file, gen_file, ngenomes, logger)

Check that extractions went well for the given family:

  • check number of proteins and genes extracted compared to the number of genomes

Parameters:
num_famint

current family number

miss_filestr

path to file containing the list of genomes missing for the current family

prt_filestr

path to file containing all proteins extracted

gen_filestr

path to file containing all genes extracted

ngenomesint

total number of genomes in dataset

loggerlogging.Logger

logger with queueHandler to give logs to main logger

Returns:
bool or int

False if any problem (nbmiss+prt != nbgenomes or nbmiss+gen != nbgenomes). If no problem, returns the number of proteins/genes extracted

PanACoTA.align_module.alignment.check_lens(aln_file, num_fam, logger)

In the given alignment file, check that all sequences have the same length. If there is no problem, it returns the length of alignment, and the number of sequences aligned.

Parameters:
aln_filestr

path to the alignment file to check

num_famint

current family number. Used for log message if problem

loggerlogging.Logger

logger having a queue Handler to give logs to the main logger in the main process

Returns:
bool or tuple

False if there is a problem in the alignment (sequences do not all have the same length). If they all have the same length, returns this length and the number of sequences

PanACoTA.align_module.alignment.check_nb_seqs(alnfile, nbfal, logger, message='')

Check the number of sequences in the given alignment file

Parameters:
alnfilestr

path to alignment file

nbfalint or []

expected number of sequences, or list of expected numbers of sequences

loggerlogging.Logger

logger with queueHandler to give logs to main logger

messagestr

message to write when sequence numbers do not match

Returns:
bool or int
  • False if not same number of sequences

  • nbseqs in align file if found among values in ‘nbfal’

PanACoTA.align_module.alignment.family_alignment(prt_file, gen_file, miss_file, mafft_file, btr_file, num_fam, ngenomes, logger)

From a given family, align all its proteins with mafft, back-translate to nucleotides, and add missing genomes in this family.

Parameters:
prt_filestr

path to file containing proteins extracted

gen_filestr

path to file containing genes extracted

miss_filestr

path to file containing list of genomes missing

mafft_filestr

path to file which will contain the protein alignment

btr_filestr

path to file which will contain the nucleotide alignment back-translated from protein alignment

num_famint

current family number

ngenomesint

total number of genomes in dataset

loggerlogging.Logger

logger with queueHandler to give logs to main logger

Returns:
bool or str
  • False if problem with extractions or with alignment or with backtranslation

  • ‘nb_seqs’ = number of sequences aligned if everything went well (extractions and alignment ok, btr created without problem)

  • “OK” if extractions and alignments went well, and btr already exists and is ok

PanACoTA.align_module.alignment.handle_family(args)

For the given family:

  • align its proteins with mafft

  • back-translate to nucleotides

  • add missing genomes

Parameters:
args()

(prefix, num_fam, ngenomes, q) with:

  • prefix: path to aldir/<name of dataset>

  • num_fam: the current family number

  • ngenomes: the total number of genomes in dataset

  • q: a queue, which will be used by logger to put logs while in other process

Returns:
bool
  • “OK” if the files were not re-created, and have the expected format. This is used by align_all_families function, to know if something was regenerated, or if everything already existed with the expected format. If something was redone and concat/group files exist, it removes them.

  • False if any problem (extractions, alignment, btr, add missing genomes…)

  • True if just generated all files, and everything is ok

PanACoTA.align_module.alignment.handle_family_1thread(args)

For the given family:

  • align its proteins with mafft

  • back-translate to nucleotides

  • add missing genomes

Parameters:
args()

(prefix, num_fam, ngenomes, q) with:

  • prefix: path to aldir/<name of dataset>

  • num_fam: the current family number

  • ngenomes: the total number of genomes in dataset

Returns:
bool
  • “OK” if the files were not re-created, and have the expected format. This is used by align_all_families function, to know if something was regenerated, or if everything already existed with the expected format. If something was redone and concat/group files exist, it removes them.

  • False if any problem (extractions, alignment, btr, add missing genomes…)

  • True if just generated all files, and everything is ok

PanACoTA.align_module.alignment.mafft_align(num_fam, prt_file, mafft_file, nbfprt, logger)

Align all proteins of the given family with mafft

Parameters:
num_famint

current family number

prt_filestr

path to file containing all proteins extracted

mafft_filestr

path to file which will contain proteins alignment

nbfprtint

number of proteins extracted in prt file

loggerlogging.Logger

logger with queueHandler to give logs to main logger

Returns:
bool

True if no problem (alignment ok, same number of proteins extracted and aligned), False otherwise

post_align submodule

Concatenate all alignment files of all families. Then, group alignments by genome.

@author: GEM, Institut Pasteur March 2017

PanACoTA.align_module.post_align.concat_alignments(fam_nums, prefix, ali_type, quiet)

Concatenate all family alignment files to a unique file

Parameters:
fam_nums[]

list of family numbers

prefixstr

path to aldir/<name of dataset>-[mafft-align or mafft-prt2nuc] (used to get extraction, alignment and btr files easily)

ali_typestr

aa or nucl

quietbool

True if nothing must be sent to sdtout/stderr, False otherwise

Returns:
tuple

(output, str) with:

  • output: path to file containing concatenation of all alignments

  • str: “OK” if concatenation file already exists, “Done” if just did concatenation

PanACoTA.align_module.post_align.get_genome(header, all_genomes)

Find to which genome belongs ‘header’

Parameters:
headerstr

header read in alignment file

all_genomes[]

list of all genomes

Returns:
str or None

name of genome from which the header is None if no genome found

PanACoTA.align_module.post_align.group_by_genome(args)

From the alignment file ‘all_alns’ containing all proteins, group the alignments of proteins by their genome (listed in ‘all_genomes’), and save the result in ‘treedir’

Parameters:
argstuple
  • all_genomes: list of all genomes

  • all_alns: path to file containing all alignments concatenated

  • outfile: path to file which will contain alignments grouped by genome

Returns:
bool
  • True if everything went well

  • False if problem when trying to group by genomes

PanACoTA.align_module.post_align.launch_group_by_genome(all_genomes, all_alns, status, outfile, dname, type_ali, quiet)

Function calling group_by_genome in a pool, while giving information to user (time elapsed)

Parameters:
all_genomes[]

list of all genomes in the dataset

all_alnsstr

path to file containing all alignments concatenated

statusstr

“OK” if concatenation file already existed before running, “Done” if just did concatenation

outfilestr

file containing all families align by genome

dnamestr

name of dataset

type_alistr

nucleic or protein

quietbool

True if nothing must be sent to sdtout/stderr, False otherwise

Returns:
bool
  • True if everything went well or was already done

  • False if error occurred in at least one step

PanACoTA.align_module.post_align.post_alignment(fam_nums, all_genomes, prefix, outdir, dname, prot_ali, quiet)

After the alignment of all proteins by family:

  • concatenate all alignment files

  • group the alignment by genome

Parameters:
fam_nums[]

list of family numbers

all_genomes[]

list of all genomes in dataset

prefixstr

path to aldir/<name of dataset> (used to get extraction, alignment and btr files easily)

outdirstr

path to output directory, containing Aldir and Listdir, and that will also contain Treedir

dnamestr

name of dataset (used to name concat and grouped files, as well as tree folder)

prot_alibool

true: also give concatenated alignment in aa

quietbool

True if nothing must be sent to sdtout/stderr, False otherwise

PanACoTA.align_module.post_align.read_alignments(all_alns, all_genomes)

Read alignment file, and assign each sequence to a genome

Parameters:
all_alnsstr

path to file containing all alignments concatenated

all_genomes[]

list of all genomes

Returns:
dict or None
  • {genome_name: [list of sequences for this genome]}

  • None if problem with a protein for which we don’t find the genome

PanACoTA.align_module.post_align.write_groups(outfile, sequences)

Writing alignments per genome to output file.

Parameters:
outfilestr

path to file that will contain alignments grouped by genome

sequencesdict

{genome_name: [list of sequences (DNA, prot…) for this genome]}