PanACoTA.pangenome_module
package¶
pangenome module of PanACoTA
mmseqs_functions
submodule¶
Functions to use mmseqs to create a pangenome
@author gem April 2017
- PanACoTA.pangenome_module.mmseqs_functions.clusters_to_file(clust, fileout)¶
Write all clusters to a file
- Parameters:
- clust{first_member: [all members = protein names]}
- fileoutfilename of pangenome where families must be written
- Returns:
- dict
families : {famnum: [members]}
- PanACoTA.pangenome_module.mmseqs_functions.create_mmseqs_db(mmseqdb, prt_path, logmmseq)¶
Create ffindex of protein bank (prt_path) if not already done. If done, just write a message to tell the user that the current existing file will be used.
- Parameters:
- mmseqdbstr
path to base filename for output of mmseqs createdb
- prt_pathstr
path to the file containing all proteins to cluster
- logmmseqstr
path to file where logs must be written
- Returns:
- bool
True if mmseqs db just created, False if already existed
- PanACoTA.pangenome_module.mmseqs_functions.do_mmseqs_db(mmseqdb, prt_path, logmmseq, quiet)¶
Runs create_mmseqs_db with an “infinite progress bar” in the background.
create_mmseqs_db does : Create ffindex of protein bank (prt_path) if not already done. If done, just write a message to tell the user that the current existing file will be used.
- Parameters:
- mmseqdbstr
path to base filename for output of mmseqs createdb
- prt_pathstr
path to the file containing all proteins to cluster
- logmmseqstr
path to file where logs must be written
- quietbool
True if no output in stderr/stdout, False otherwise
- Returns:
- bool
True if mmseqs db just created, False if already existed
- PanACoTA.pangenome_module.mmseqs_functions.do_pangenome(outdir, prt_bank, mmseqdb, mmseqclust, tmpdir, logmmseq, min_id, clust_mode, just_done, threads, panfile, quiet=False)¶
Use mmseqs to cluster proteins
- Parameters:
- outdirstr
directory where output files are saved
- prt_bankstr
name of the file containing all proteins to cluster, without path
- mmseqdbstr
path to base filename of output of mmseqs createdb
- mmseqcluststr
mmseqs clust
- tmp_dirstr
path to tmp directory
- logmmseqstr
path to file for mmseqs logs
- min_idfloat
min percentage of identity to be considered in the same family (between 0 and 1)
- clust_mode[0, 1, 2]
0 for ‘set cover’, 1 for ‘single-linkage’, 2 for ‘CD-Hit’
- just_donestr
True if mmseqs db was just (re)created -> remove mmseqs clust. False if mmseqs db was kept from previous run -> no need to rerun mmseqs clust if already exists
- threadsint
max number of threads to use
- panfilestr
if a pangenome file is specified. Otherwise, default pangenome name will be used
- quietbool
true if nothing must be print on stdout/stderr, false otherwise (show progress bar)
- Returns:
- (families, outfile)tuple
families : {fam_num: [all members]}
outfile : pangenome filename
- PanACoTA.pangenome_module.mmseqs_functions.get_info(threads, min_id, clust_mode)¶
Get string containing all information on future run
- Parameters:
- threadsint
max number of threads to use
- min_idfloat
min percentage of identity to consider 2 proteins in hte same family
- clust_mode[0, 1, 2]
0 for ‘set cover’, 1 for ‘single-linkage’, 2 for ‘CD-Hit’
- Returns:
- str
string containing info on current run, to put in filenames
- PanACoTA.pangenome_module.mmseqs_functions.get_logmmseq(outdir, prt_bank, infoname)¶
Get filename of logfile, given information
- Parameters:
- outdirstr
output directory
- prt_bankstr
name of file (without path) containing all proteins to cluster
- infonamestr
string containing information on the current run
- Returns:
- str
path to mmseq logfile
- PanACoTA.pangenome_module.mmseqs_functions.mmseq_tsv_to_clusters(mmseq)¶
Reads the output of mmseq as a tsv file, and converts it to a python dict
- Parameters:
- mmseqstr
filename of mmseq clustering output in tsv format
- Returns:
- dict
{representative_of_cluster: [list of members]}
- PanACoTA.pangenome_module.mmseqs_functions.mmseqs_to_pangenome(mmseqdb, mmseqclust, logmmseq, outfile)¶
Convert mmseqs clustering to a pangenome file:
convert mmseqs results to tsv file
convert tsv file to pangenome
- Parameters:
- mmseqdbstr
path to base filename of output of mmseqs createdb
- mmseqcluststr
path to base filename of output of mmseqs cluster
- logmmseqstr
path to file where logs must be written
- outfilestr
pangenome filename
- Returns:
- dict
families : {fam_num: [all members]}
- PanACoTA.pangenome_module.mmseqs_functions.mmseqs_tsv_to_pangenome(mmseqclust, logmmseq, outfile)¶
Convert the tsv output file of mmseqs to the pangenome file
- Parameters:
- mmseqcluststr
path to base filename for output of mmseq clustering
- logmmseqstr
path to file where logs must be written
- outfilestr
pangenome filename, or None if default one must be used
- Returns:
- dict
families : {fam_num: [all members]}
- PanACoTA.pangenome_module.mmseqs_functions.run_all_pangenome(min_id, clust_mode, outdir, prt_path, threads, panfile=None, quiet=False)¶
Run all steps to build a pangenome:
create mmseqs database from protein bank
cluster proteins
convert to pangenome
- Parameters:
- min_idfloat
minimum percentage of identity to be in the same family
- clust_mode[0, 1, 2]
0 for ‘set cover’, 1 for ‘single-linkage’, 2 for ‘CD-Hit’
- outdirstr
directory where output cluster file must be saved
- prt_pathstr
path to file containing all proteins to cluster.
- threadsint
number of threads which can be used
- panfilestr or None
name for output pangenome file. Otherwise, will use default name
- quietbool
True if nothing must be written on stdout, False otherwise.
- Returns:
- (families, outfile)tuple
families : {fam_num: [all members]}
outfile : pangenome filename
- PanACoTA.pangenome_module.mmseqs_functions.run_mmseqs_clust(args)¶
Run mmseqs clustering
- Parameters:
- argstuple
(mmseqdb, mmseqclust, tmpdir, logmmseq, min_id, threads, clust_mode), with:
mmseqdb: path to base filename (output created by mmseq db)
mmseqclust: path to base filename for output of mmseq clustering
tmpdir : path to folder which will contain mmseq temporary files
logmmseq : path to file where logs must be written
min_id : min percentage of identity to be considered in the same family
(between 0 and 1)
threads : max number of threads to use
clust_mode : [0, 1, 2], 0 for ‘set cover’, 1 for ‘single-linkage’, 2 for ‘CD-Hit’
post_treatment
submodule¶
Functions to generate the matrix pan_quali, pan_quanti, as well as a summary file for the pangenome.
@author gem April 2017
- PanACoTA.pangenome_module.post_treatment.generate_and_write_outputs(fams_by_strain, families, all_strains, panquali, panquanti, psf)¶
From the python objects of pangenome, generate qualitative and quantitative matrix, as well as summary file.
- Parameters:
- fams_by_straindict
{fam_num: {strain: [members]}}
- familiesdict
{fam_num: [all members]}
- all_strainslist
list of all strains
- pqlf_io.TextIOWrapper
open file where qualitative matrix will be written
- pqtf_io.TextIOWrapper
open file where quantitative matrix will be written
- psf_io.TextIOWrapper
open file where summary will be written
- Returns:
- (qualis, quantis, summaries)tuple
with:
qualis = {fam_num: [0 if no gene for species, 1 if at least 1 gene, for each species in all_strains]}
quantis = {fam_num: [number of genes for each strain in all_strains]}
summaries = {fam_num: [nb_members, sum_quanti, sum_quali, nb_0, nb_mono, nb_multi, sum_0-mono-multi, max_multi]}
- PanACoTA.pangenome_module.post_treatment.open_outputs_to_write(fams_by_strain, families, all_strains, pangenome)¶
Open output files, and call function to generate the matrix and summary file, and write it in those output files
- Parameters:
- fams_by_straindict
{fam_num: {strain: [members]}}
- familiesdict
{fam_num: [all members]}
- all_strainslist
list of all genome names
- pangenomestr
filename containing pangenome. Will be extended for the 3 output files
- Returns:
- (qualis, quantis, summaries)tuple
with:
qualis = {fam_num: [0 if no gene for species, 1 if at least 1 gene, for each species in all_strains]}
quantis = {fam_num: [number of genes for each strain in all_strains]}
summaries = {fam_num: [nb_members, sum_quanti, sum_quali, nb_0, nb_mono, nb_multi, sum_0-mono-multi, max_multi]}
- PanACoTA.pangenome_module.post_treatment.post_treat(families, pangenome)¶
From clusters = {num: [members]}, create:
a pan_quali matrix (lines = families, columns = genomes, 1 if genome present in family, 0 otherwise)
a pan_quanti matrix (lines = families, columns = genomes, number of members from given genome in the given family)
a summary file: lines = families. For each family:
nb_members: total number of members
sum_quanti: should be the same as nb_members!
sum_quali: number of different genomes in family
nb_0: number of missing genomes in family
nb_mono: number of genomes with exactly 1 member
nb_multi: number of genomes with more than 1 member
sum_0-mono-multi: should be equal to the total number of genomes in dataset
max_multi: maximum number of members from 1 genome
- Parameters:
- familiesdict
{num_fam: [list of members]}. Can be None, and then they will be retrieved from the pangenome file
- pangenomestr
file containing pangenome
protein_seq_functions
submodule¶
Functions to build a bank of all proteins to include in the pangenome
@author gem April 2017
- PanACoTA.pangenome_module.protein_seq_functions.build_prt_bank(lstinfo, dbpath, name, spedir, quiet)¶
Build a file containing all proteins of all genomes contained in lstinfo.
- Parameters:
- lstinfostr
1 line per genome, only 1st column considered here, as the genome name without extension
- dbpathstr
Proteins folder, containing all proteins for each genome. Each genome has its own protein file, called <genome_name>.prt.
- namestr
dataset name, used to name the output databank: <outdir>/<name>.All.prt
- spedirstr or None
By default, output file is saved in dbpath directory. If it must be saved somewhere else, it is specified here.
- quietbool
True if nothing must be written in stdout/stderr, False otherwise
- Returns:
- str
name (with path) of the protein databank generated