`PanACoTA.prepare_module` package¶

prepare module of PanACoTA

`download genomes` submodule¶

Functions helping for downloading refseq genomes of a species, gunzip them, adding complete genomes…

@author gem August 2017

PanACoTA.prepare_module.download_genomes_func.download_from_ncbi(species_linked, section, ncbi_species_name, ncbi_species_taxid, ncbi_taxid, spe_strains, levels, outdir, threads)¶

Download ncbi genomes of given species

Parameters:

species_linkedstr: given NCBI species with ‘_’ instead of spaces, or NCBI taxID if species name not given
sectionstr: genbank or only refseq (default = refseq)
ncbi_species_namestr or None: name of species to download: user given NCBI species. None if no species name given
ncbi_species_taxidint: species taxid given in NCBI (-T option)
ncbi_taxidint: taxid given in NCBI (-t option)
spe_strainsstr: specific strain name, or comma-separated strain names (or name of a file with one strain name per line)
outdirstr: Directory where downloaded sequences must be saved
threadsint: Number f threads to use to download genome sequences

Returns:

str: Output filename of downloaded summary

PanACoTA.prepare_module.download_genomes_func.to_database(outdir, section)¶

Move .fna.gz files to ‘database_init’ folder, and uncompress them.

Parameters:

outdirstr: directory where all results are (for now, refseq/genbank folders, assembly summary and log
sectionstr: refseq (default) or genbank

Returns:

nb_gennumber of genomes downloaded
db_dirdirectory where are all fna files downloaded from refseq/genbank

`filter genomes` submodule¶

Functions helping for doing quality control on genomes in order to eliminate bad quality sequences, and then run Mash loops in order to discard too close genomes.

@author gem August 2019

PanACoTA.prepare_module.filter_genomes.check_quality(species_linked, db_path, tmp_dir, max_l90, max_cont, cutn)¶

Do a quality control of all genomes in db_path

Parameters:

outdirstr: directory for all results (refseq downloads, database init etc)
species_linkedstr: given NCBI species with ‘_’ instead of spaces, or NCBI taxID if species name not given
dbpathstr: directory to ‘Database_init’ containing all .fna files
tmp_dirstr: directory where all tmp files must be saved (files cut at each stretch of ‘x’ N)
max_l90int: max L90 value tolerated to keep a genome
max_contint: Max number of contigs tolerated to keep a genome
cutnint: cut at each stretch of this number of ‘N’. Don’t cut if equal to 0

Returns:

genomes{genome_file: [genome_name, orig_path, path_to_seq_to_annotate, size,: nbcont, l90]}
no need for small name, we won’t annotate genomes. genome_name is the same as filename
but without extension

PanACoTA.prepare_module.filter_genomes.compare_all(out_msh, matrix, npz_matrix, mash_log, threads)¶

Comparing all pairwise genomes that are already been sketched in the given file.

Parameters:

out_mshstr: output of mash
matrixstr: File to put generated matrix of pairwise distances between all genomes
npz_matrixstr: matrix of pairwise distances saved in a binary file
mash_logstr: mash logfile
threads: max number of threads to use

Returns:

return code

PanACoTA.prepare_module.filter_genomes.iterative_mash(sorted_genomes, genomes, outdir, species_linked, min_dist, max_dist, threads, quiet)¶

Run mash all vs all, to get all pairwise distances. Then, take the first genome of the list, and remove those for which the distance to it is not between 1e-4 and 0.06. Restart with the next genome kept in the list, and so on until the last genome.

Parameters:

sorted_genomes: list: list of ‘genome_file’ for all genomes kept (L90 and nbcont ok)
genomesdict: {genome_file: [genome_name, orig_name, path_to_seq_to_annotate, size, nbcont, l90]}
outdirstr: path to directory where all results are saved
species_linkedstr: species name if given, otherwise species taxID
min_distfloat: lower limit of distance between 2 genomes to keep them
max_distfloat: max limit of distance between 2 genomes to keep them
threads: max number of threads to use
quietbool: True if nothing must be sent to stdout/stderr, False otherwise

Returns:

genomes_removeddict: {genome_name: [ref_name, dist]} genome against which ‘genome_name’ is removed, and corresponding distance (justifying removal)

PanACoTA.prepare_module.filter_genomes.mash_step(to_try, corresp, mat_sp, genomes_removed, min_dist, max_dist)¶

Prepare a mash run, with a given genome as reference, and others to compare to.

Parameters:

to_trylist: list of genome_file (keys of ‘genomes’) to compare, ordered by decreasing L90/nbcont
correspdict: {genome_file : num_of_genome in sorted_genomes}
mat_spscipy.sparse.dok.dok_matrix: triangle matrix containing pairwise distance comparisons
genomes_removeddict: {genome_file: [ref_name, dist]} genome against which ‘genome_name’ is removed, and corresponding distance (justifying removal)
min_distfloat: lower limit of distance between 2 genomes to keep them
max_distfloat: max limit of distance between 2 genomes to keep them

Returns:

to_try is updated (reference element and all genomes not compatible with it are removed)
genomes_removed is updated
return code (0 if no problem)

PanACoTA.prepare_module.filter_genomes.read_matrix(genomes, sorted_genomes, matrix)¶

Read the matrix of pairwise distances between all genomes, and save it to a sparse matrix (only upper triangle).

Parameters:

genomesdict: {genome_file: [genome_name, orig_name, path_to_seq_to_annotate, size, nbcont, l90]}
sorted_genomes: list: list of ‘genome_file’ for all genomes kept (L90 and nbcont ok)
matrixstr: File containing the matrix of pairwise distances between all genomes

Returns:

mat_spstr: python dok_matrix object

PanACoTA.prepare_module.filter_genomes.sketch_all(genomes, sorted_genomes, outdir, list_reps, out_msh, mash_log, threads)¶

Sketch all genomes to a combined archive.

Parameters:

genomesdict: {genome_file: [genome_name, orig_name, path_to_seq_to_annotate, size, nbcont, l90]}
sorted_genomes: list: list of ‘genome_file’ for all genomes kept (L90 and nbcont ok), ordered by decreasing quality
outdirstr: path to directory where all results are saved
list_repsstr: file with list of genomes to sketch. File will be emptied if it contain something, and filled with the informations from ‘genomes’.
out_mshstr: output of mash
mash_logstr: mash logfile
threads: max number of threads to use

Returns:

return value (0 if OK, 1 if error)

PanACoTA.prepare_module.filter_genomes.sort_genomes_minhash(genomes, max_l90, max_cont)¶

Sort genomes: - draft genomes, sorted by L90 and then nb_contigs

Parameters:

genomes{genome_file: [genome_name, orig_name, path_to_seq_to_annotate, size,: nbcont, l90]}
max_l90int: max L90 value tolerated to keep a genome
max_contint: Max number of contigs tolerated to keep a genome

Returns:

sorted_genomes: list of ‘genome_file’ for all genomes kept (L90 and nbcont ok),
ordered by decreasing quality

PanACoTA.prepare_module.filter_genomes.write_outputfiles(genomes, sorted_genomes, genomes_removed, outdir, gspecies, min_dist, max_dist)¶

Write the list of genomes kept in a file, 1 genome per line -> will be the input file for annotation and next steps Write discarded genomes to another file, with, for each line: - genome name - problem when compared with which other genome - distance to this other genomes

Parameters:

genomesdict: {genome_file: [genome_name, orig_name, path_to_seq_to_annotate, size, nbcont, l90]}
sorted_genomes: list: list of ‘genome_file’ for all genomes kept (L90 and nbcont ok)
genomes_removeddict: {genome_name: [ref_name, dist]} genome against which ‘genome_name’ is removed, and corresponding distance (justifying removal)
outdirstr: directory where those list files must be created
gspeciesstr: species name if given, otherwise species taxID
min_distfloat: lower limit of distance between 2 genomes to keep them
max_distfloat: upper limit of distance between 2 genomes to keep them

Returns:

return code

PanACoTA.prepare_module package¶

download genomes submodule¶

filter genomes submodule¶

`PanACoTA.prepare_module` package¶

`download genomes` submodule¶

`filter genomes` submodule¶