PanACoTA.prepare_module package

prepare module of PanACoTA

download genomes submodule

Functions helping for downloading refseq genomes of a species, gunzip them, adding complete genomes…

@author gem August 2017

PanACoTA.prepare_module.download_genomes_func.download_from_ncbi(species_linked, section, ncbi_species_name, ncbi_species_taxid, ncbi_taxid, spe_strains, levels, outdir, threads)

Download ncbi genomes of given species

Parameters:
species_linkedstr

given NCBI species with ‘_’ instead of spaces, or NCBI taxID if species name not given

sectionstr

genbank or only refseq (default = refseq)

ncbi_species_namestr or None

name of species to download: user given NCBI species. None if no species name given

ncbi_species_taxidint

species taxid given in NCBI (-T option)

ncbi_taxidint

taxid given in NCBI (-t option)

spe_strainsstr

specific strain name, or comma-separated strain names (or name of a file with one strain name per line)

outdirstr

Directory where downloaded sequences must be saved

threadsint

Number f threads to use to download genome sequences

Returns:
str

Output filename of downloaded summary

PanACoTA.prepare_module.download_genomes_func.to_database(outdir, section)

Move .fna.gz files to ‘database_init’ folder, and uncompress them.

Parameters:
outdirstr

directory where all results are (for now, refseq/genbank folders, assembly summary and log

sectionstr

refseq (default) or genbank

Returns:
nb_gennumber of genomes downloaded
db_dirdirectory where are all fna files downloaded from refseq/genbank

filter genomes submodule

Functions helping for doing quality control on genomes in order to eliminate bad quality sequences, and then run Mash loops in order to discard too close genomes.

@author gem August 2019

PanACoTA.prepare_module.filter_genomes.check_quality(species_linked, db_path, tmp_dir, max_l90, max_cont, cutn)

Do a quality control of all genomes in db_path

Parameters:
outdirstr

directory for all results (refseq downloads, database init etc)

species_linkedstr

given NCBI species with ‘_’ instead of spaces, or NCBI taxID if species name not given

dbpathstr

directory to ‘Database_init’ containing all .fna files

tmp_dirstr

directory where all tmp files must be saved (files cut at each stretch of ‘x’ N)

max_l90int

max L90 value tolerated to keep a genome

max_contint

Max number of contigs tolerated to keep a genome

cutnint

cut at each stretch of this number of ‘N’. Don’t cut if equal to 0

Returns:
genomes{genome_file: [genome_name, orig_path, path_to_seq_to_annotate, size,

nbcont, l90]}

no need for small name, we won’t annotate genomes. genome_name is the same as filename
but without extension
PanACoTA.prepare_module.filter_genomes.compare_all(out_msh, matrix, npz_matrix, mash_log, threads)

Comparing all pairwise genomes that are already been sketched in the given file.

Parameters:
out_mshstr

output of mash

matrixstr

File to put generated matrix of pairwise distances between all genomes

npz_matrixstr

matrix of pairwise distances saved in a binary file

mash_logstr

mash logfile

threads

max number of threads to use

Returns:
return code
PanACoTA.prepare_module.filter_genomes.iterative_mash(sorted_genomes, genomes, outdir, species_linked, min_dist, max_dist, threads, quiet)

Run mash all vs all, to get all pairwise distances. Then, take the first genome of the list, and remove those for which the distance to it is not between 1e-4 and 0.06. Restart with the next genome kept in the list, and so on until the last genome.

Parameters:
sorted_genomes: list

list of ‘genome_file’ for all genomes kept (L90 and nbcont ok)

genomesdict

{genome_file: [genome_name, orig_name, path_to_seq_to_annotate, size, nbcont, l90]}

outdirstr

path to directory where all results are saved

species_linkedstr

species name if given, otherwise species taxID

min_distfloat

lower limit of distance between 2 genomes to keep them

max_distfloat

max limit of distance between 2 genomes to keep them

threads

max number of threads to use

quietbool

True if nothing must be sent to stdout/stderr, False otherwise

Returns:
genomes_removeddict

{genome_name: [ref_name, dist]} genome against which ‘genome_name’ is removed, and corresponding distance (justifying removal)

PanACoTA.prepare_module.filter_genomes.mash_step(to_try, corresp, mat_sp, genomes_removed, min_dist, max_dist)

Prepare a mash run, with a given genome as reference, and others to compare to.

Parameters:
to_trylist

list of genome_file (keys of ‘genomes’) to compare, ordered by decreasing L90/nbcont

correspdict

{genome_file : num_of_genome in sorted_genomes}

mat_spscipy.sparse.dok.dok_matrix

triangle matrix containing pairwise distance comparisons

genomes_removeddict

{genome_file: [ref_name, dist]} genome against which ‘genome_name’ is removed, and corresponding distance (justifying removal)

min_distfloat

lower limit of distance between 2 genomes to keep them

max_distfloat

max limit of distance between 2 genomes to keep them

Returns:
to_try is updated (reference element and all genomes not compatible with it are removed)
genomes_removed is updated
return code (0 if no problem)
PanACoTA.prepare_module.filter_genomes.read_matrix(genomes, sorted_genomes, matrix)

Read the matrix of pairwise distances between all genomes, and save it to a sparse matrix (only upper triangle).

Parameters:
genomesdict

{genome_file: [genome_name, orig_name, path_to_seq_to_annotate, size, nbcont, l90]}

sorted_genomes: list

list of ‘genome_file’ for all genomes kept (L90 and nbcont ok)

matrixstr

File containing the matrix of pairwise distances between all genomes

Returns:
mat_spstr

python dok_matrix object

PanACoTA.prepare_module.filter_genomes.sketch_all(genomes, sorted_genomes, outdir, list_reps, out_msh, mash_log, threads)

Sketch all genomes to a combined archive.

Parameters:
genomesdict

{genome_file: [genome_name, orig_name, path_to_seq_to_annotate, size, nbcont, l90]}

sorted_genomes: list

list of ‘genome_file’ for all genomes kept (L90 and nbcont ok), ordered by decreasing quality

outdirstr

path to directory where all results are saved

list_repsstr

file with list of genomes to sketch. File will be emptied if it contain something, and filled with the informations from ‘genomes’.

out_mshstr

output of mash

mash_logstr

mash logfile

threads

max number of threads to use

Returns:
return value (0 if OK, 1 if error)
PanACoTA.prepare_module.filter_genomes.sort_genomes_minhash(genomes, max_l90, max_cont)

Sort genomes: - draft genomes, sorted by L90 and then nb_contigs

Parameters:
genomes{genome_file: [genome_name, orig_name, path_to_seq_to_annotate, size,

nbcont, l90]}

max_l90int

max L90 value tolerated to keep a genome

max_contint

Max number of contigs tolerated to keep a genome

Returns:
sorted_genomes: list of ‘genome_file’ for all genomes kept (L90 and nbcont ok),
ordered by decreasing quality
PanACoTA.prepare_module.filter_genomes.write_outputfiles(genomes, sorted_genomes, genomes_removed, outdir, gspecies, min_dist, max_dist)

Write the list of genomes kept in a file, 1 genome per line -> will be the input file for annotation and next steps Write discarded genomes to another file, with, for each line: - genome name - problem when compared with which other genome - distance to this other genomes

Parameters:
genomesdict

{genome_file: [genome_name, orig_name, path_to_seq_to_annotate, size, nbcont, l90]}

sorted_genomes: list

list of ‘genome_file’ for all genomes kept (L90 and nbcont ok)

genomes_removeddict

{genome_name: [ref_name, dist]} genome against which ‘genome_name’ is removed, and corresponding distance (justifying removal)

outdirstr

directory where those list files must be created

gspeciesstr

species name if given, otherwise species taxID

min_distfloat

lower limit of distance between 2 genomes to keep them

max_distfloat

upper limit of distance between 2 genomes to keep them

Returns:
return code