PanACoTA.pangenome_module package

pangenome module of PanACoTA

mmseqs_functions submodule

Functions to use mmseqs to create a pangenome

@author gem April 2017

PanACoTA.pangenome_module.mmseqs_functions.clusters_to_file(clust, fileout)

Write all clusters to a file

Parameters:
clust{first_member: [all members = protein names]}
fileoutfilename of pangenome where families must be written
Returns:
dict

families : {famnum: [members]}

PanACoTA.pangenome_module.mmseqs_functions.create_mmseqs_db(mmseqdb, prt_path, logmmseq)

Create ffindex of protein bank (prt_path) if not already done. If done, just write a message to tell the user that the current existing file will be used.

Parameters:
mmseqdbstr

path to base filename for output of mmseqs createdb

prt_pathstr

path to the file containing all proteins to cluster

logmmseqstr

path to file where logs must be written

Returns:
bool

True if mmseqs db just created, False if already existed

PanACoTA.pangenome_module.mmseqs_functions.do_mmseqs_db(mmseqdb, prt_path, logmmseq, quiet)

Runs create_mmseqs_db with an “infinite progress bar” in the background.

create_mmseqs_db does : Create ffindex of protein bank (prt_path) if not already done. If done, just write a message to tell the user that the current existing file will be used.

Parameters:
mmseqdbstr

path to base filename for output of mmseqs createdb

prt_pathstr

path to the file containing all proteins to cluster

logmmseqstr

path to file where logs must be written

quietbool

True if no output in stderr/stdout, False otherwise

Returns:
bool

True if mmseqs db just created, False if already existed

PanACoTA.pangenome_module.mmseqs_functions.do_pangenome(outdir, prt_bank, mmseqdb, mmseqclust, tmpdir, logmmseq, min_id, clust_mode, just_done, threads, panfile, quiet=False)

Use mmseqs to cluster proteins

Parameters:
outdirstr

directory where output files are saved

prt_bankstr

name of the file containing all proteins to cluster, without path

mmseqdbstr

path to base filename of output of mmseqs createdb

mmseqcluststr

mmseqs clust

tmp_dirstr

path to tmp directory

logmmseqstr

path to file for mmseqs logs

min_idfloat

min percentage of identity to be considered in the same family (between 0 and 1)

clust_mode[0, 1, 2]

0 for ‘set cover’, 1 for ‘single-linkage’, 2 for ‘CD-Hit’

just_donestr

True if mmseqs db was just (re)created -> remove mmseqs clust. False if mmseqs db was kept from previous run -> no need to rerun mmseqs clust if already exists

threadsint

max number of threads to use

panfilestr

if a pangenome file is specified. Otherwise, default pangenome name will be used

quietbool

true if nothing must be print on stdout/stderr, false otherwise (show progress bar)

Returns:
(families, outfile)tuple
  • families : {fam_num: [all members]}

  • outfile : pangenome filename

PanACoTA.pangenome_module.mmseqs_functions.get_info(threads, min_id, clust_mode)

Get string containing all information on future run

Parameters:
threadsint

max number of threads to use

min_idfloat

min percentage of identity to consider 2 proteins in hte same family

clust_mode[0, 1, 2]

0 for ‘set cover’, 1 for ‘single-linkage’, 2 for ‘CD-Hit’

Returns:
str

string containing info on current run, to put in filenames

PanACoTA.pangenome_module.mmseqs_functions.get_logmmseq(outdir, prt_bank, infoname)

Get filename of logfile, given information

Parameters:
outdirstr

output directory

prt_bankstr

name of file (without path) containing all proteins to cluster

infonamestr

string containing information on the current run

Returns:
str

path to mmseq logfile

PanACoTA.pangenome_module.mmseqs_functions.mmseq_tsv_to_clusters(mmseq)

Reads the output of mmseq as a tsv file, and converts it to a python dict

Parameters:
mmseqstr

filename of mmseq clustering output in tsv format

Returns:
dict

{representative_of_cluster: [list of members]}

PanACoTA.pangenome_module.mmseqs_functions.mmseqs_to_pangenome(mmseqdb, mmseqclust, logmmseq, outfile)

Convert mmseqs clustering to a pangenome file:

  • convert mmseqs results to tsv file

  • convert tsv file to pangenome

Parameters:
mmseqdbstr

path to base filename of output of mmseqs createdb

mmseqcluststr

path to base filename of output of mmseqs cluster

logmmseqstr

path to file where logs must be written

outfilestr

pangenome filename

Returns:
dict
  • families : {fam_num: [all members]}

PanACoTA.pangenome_module.mmseqs_functions.mmseqs_tsv_to_pangenome(mmseqclust, logmmseq, outfile)

Convert the tsv output file of mmseqs to the pangenome file

Parameters:
mmseqcluststr

path to base filename for output of mmseq clustering

logmmseqstr

path to file where logs must be written

outfilestr

pangenome filename, or None if default one must be used

Returns:
dict
  • families : {fam_num: [all members]}

PanACoTA.pangenome_module.mmseqs_functions.run_all_pangenome(min_id, clust_mode, outdir, prt_path, threads, panfile=None, quiet=False)

Run all steps to build a pangenome:

  • create mmseqs database from protein bank

  • cluster proteins

  • convert to pangenome

Parameters:
min_idfloat

minimum percentage of identity to be in the same family

clust_mode[0, 1, 2]

0 for ‘set cover’, 1 for ‘single-linkage’, 2 for ‘CD-Hit’

outdirstr

directory where output cluster file must be saved

prt_pathstr

path to file containing all proteins to cluster.

threadsint

number of threads which can be used

panfilestr or None

name for output pangenome file. Otherwise, will use default name

quietbool

True if nothing must be written on stdout, False otherwise.

Returns:
(families, outfile)tuple
  • families : {fam_num: [all members]}

  • outfile : pangenome filename

PanACoTA.pangenome_module.mmseqs_functions.run_mmseqs_clust(args)

Run mmseqs clustering

Parameters:
argstuple

(mmseqdb, mmseqclust, tmpdir, logmmseq, min_id, threads, clust_mode), with:

  • mmseqdb: path to base filename (output created by mmseq db)

  • mmseqclust: path to base filename for output of mmseq clustering

  • tmpdir : path to folder which will contain mmseq temporary files

  • logmmseq : path to file where logs must be written

  • min_id : min percentage of identity to be considered in the same family

  • (between 0 and 1)

  • threads : max number of threads to use

  • clust_mode : [0, 1, 2], 0 for ‘set cover’, 1 for ‘single-linkage’, 2 for ‘CD-Hit’

post_treatment submodule

Functions to generate the matrix pan_quali, pan_quanti, as well as a summary file for the pangenome.

@author gem April 2017

PanACoTA.pangenome_module.post_treatment.generate_and_write_outputs(fams_by_strain, families, all_strains, panquali, panquanti, psf)

From the python objects of pangenome, generate qualitative and quantitative matrix, as well as summary file.

Parameters:
fams_by_straindict

{fam_num: {strain: [members]}}

familiesdict

{fam_num: [all members]}

all_strainslist

list of all strains

pqlf_io.TextIOWrapper

open file where qualitative matrix will be written

pqtf_io.TextIOWrapper

open file where quantitative matrix will be written

psf_io.TextIOWrapper

open file where summary will be written

Returns:
(qualis, quantis, summaries)tuple

with:

  • qualis = {fam_num: [0 if no gene for species, 1 if at least 1 gene, for each species in all_strains]}

  • quantis = {fam_num: [number of genes for each strain in all_strains]}

  • summaries = {fam_num: [nb_members, sum_quanti, sum_quali, nb_0, nb_mono, nb_multi, sum_0-mono-multi, max_multi]}

PanACoTA.pangenome_module.post_treatment.open_outputs_to_write(fams_by_strain, families, all_strains, pangenome)

Open output files, and call function to generate the matrix and summary file, and write it in those output files

Parameters:
fams_by_straindict

{fam_num: {strain: [members]}}

familiesdict

{fam_num: [all members]}

all_strainslist

list of all genome names

pangenomestr

filename containing pangenome. Will be extended for the 3 output files

Returns:
(qualis, quantis, summaries)tuple

with:

  • qualis = {fam_num: [0 if no gene for species, 1 if at least 1 gene, for each species in all_strains]}

  • quantis = {fam_num: [number of genes for each strain in all_strains]}

  • summaries = {fam_num: [nb_members, sum_quanti, sum_quali, nb_0, nb_mono, nb_multi, sum_0-mono-multi, max_multi]}

PanACoTA.pangenome_module.post_treatment.post_treat(families, pangenome)

From clusters = {num: [members]}, create:

  • a pan_quali matrix (lines = families, columns = genomes, 1 if genome present in family, 0 otherwise)

  • a pan_quanti matrix (lines = families, columns = genomes, number of members from given genome in the given family)

  • a summary file: lines = families. For each family:

    • nb_members: total number of members

    • sum_quanti: should be the same as nb_members!

    • sum_quali: number of different genomes in family

    • nb_0: number of missing genomes in family

    • nb_mono: number of genomes with exactly 1 member

    • nb_multi: number of genomes with more than 1 member

    • sum_0-mono-multi: should be equal to the total number of genomes in dataset

    • max_multi: maximum number of members from 1 genome

Parameters:
familiesdict

{num_fam: [list of members]}. Can be None, and then they will be retrieved from the pangenome file

pangenomestr

file containing pangenome

protein_seq_functions submodule

Functions to build a bank of all proteins to include in the pangenome

@author gem April 2017

PanACoTA.pangenome_module.protein_seq_functions.build_prt_bank(lstinfo, dbpath, name, spedir, quiet)

Build a file containing all proteins of all genomes contained in lstinfo.

Parameters:
lstinfostr

1 line per genome, only 1st column considered here, as the genome name without extension

dbpathstr

Proteins folder, containing all proteins for each genome. Each genome has its own protein file, called <genome_name>.prt.

namestr

dataset name, used to name the output databank: <outdir>/<name>.All.prt

spedirstr or None

By default, output file is saved in dbpath directory. If it must be saved somewhere else, it is specified here.

quietbool

True if nothing must be written in stdout/stderr, False otherwise

Returns:
str

name (with path) of the protein databank generated