PanACoTA.prepare_module
package¶
prepare module of PanACoTA
download genomes
submodule¶
Functions helping for downloading refseq genomes of a species, gunzip them, adding complete genomes…
@author gem August 2017
- PanACoTA.prepare_module.download_genomes_func.download_from_ncbi(species_linked, section, ncbi_species_name, ncbi_species_taxid, ncbi_taxid, spe_strains, levels, outdir, threads)¶
Download ncbi genomes of given species
- Parameters:
- species_linkedstr
given NCBI species with ‘_’ instead of spaces, or NCBI taxID if species name not given
- sectionstr
genbank or only refseq (default = refseq)
- ncbi_species_namestr or None
name of species to download: user given NCBI species. None if no species name given
- ncbi_species_taxidint
species taxid given in NCBI (-T option)
- ncbi_taxidint
taxid given in NCBI (-t option)
- spe_strainsstr
specific strain name, or comma-separated strain names (or name of a file with one strain name per line)
- outdirstr
Directory where downloaded sequences must be saved
- threadsint
Number f threads to use to download genome sequences
- Returns:
- str
Output filename of downloaded summary
- PanACoTA.prepare_module.download_genomes_func.to_database(outdir, section)¶
Move .fna.gz files to ‘database_init’ folder, and uncompress them.
- Parameters:
- outdirstr
directory where all results are (for now, refseq/genbank folders, assembly summary and log
- sectionstr
refseq (default) or genbank
- Returns:
- nb_gennumber of genomes downloaded
- db_dirdirectory where are all fna files downloaded from refseq/genbank
filter genomes
submodule¶
Functions helping for doing quality control on genomes in order to eliminate bad quality sequences, and then run Mash loops in order to discard too close genomes.
@author gem August 2019
- PanACoTA.prepare_module.filter_genomes.check_quality(species_linked, db_path, tmp_dir, max_l90, max_cont, cutn)¶
Do a quality control of all genomes in db_path
- Parameters:
- outdirstr
directory for all results (refseq downloads, database init etc)
- species_linkedstr
given NCBI species with ‘_’ instead of spaces, or NCBI taxID if species name not given
- dbpathstr
directory to ‘Database_init’ containing all .fna files
- tmp_dirstr
directory where all tmp files must be saved (files cut at each stretch of ‘x’ N)
- max_l90int
max L90 value tolerated to keep a genome
- max_contint
Max number of contigs tolerated to keep a genome
- cutnint
cut at each stretch of this number of ‘N’. Don’t cut if equal to 0
- Returns:
- genomes{genome_file: [genome_name, orig_path, path_to_seq_to_annotate, size,
nbcont, l90]}
- no need for small name, we won’t annotate genomes. genome_name is the same as filename
- but without extension
- PanACoTA.prepare_module.filter_genomes.compare_all(out_msh, matrix, npz_matrix, mash_log, threads)¶
Comparing all pairwise genomes that are already been sketched in the given file.
- Parameters:
- out_mshstr
output of mash
- matrixstr
File to put generated matrix of pairwise distances between all genomes
- npz_matrixstr
matrix of pairwise distances saved in a binary file
- mash_logstr
mash logfile
- threads
max number of threads to use
- Returns:
- return code
- PanACoTA.prepare_module.filter_genomes.iterative_mash(sorted_genomes, genomes, outdir, species_linked, min_dist, max_dist, threads, quiet)¶
Run mash all vs all, to get all pairwise distances. Then, take the first genome of the list, and remove those for which the distance to it is not between 1e-4 and 0.06. Restart with the next genome kept in the list, and so on until the last genome.
- Parameters:
- sorted_genomes: list
list of ‘genome_file’ for all genomes kept (L90 and nbcont ok)
- genomesdict
{genome_file: [genome_name, orig_name, path_to_seq_to_annotate, size, nbcont, l90]}
- outdirstr
path to directory where all results are saved
- species_linkedstr
species name if given, otherwise species taxID
- min_distfloat
lower limit of distance between 2 genomes to keep them
- max_distfloat
max limit of distance between 2 genomes to keep them
- threads
max number of threads to use
- quietbool
True if nothing must be sent to stdout/stderr, False otherwise
- Returns:
- genomes_removeddict
{genome_name: [ref_name, dist]} genome against which ‘genome_name’ is removed, and corresponding distance (justifying removal)
- PanACoTA.prepare_module.filter_genomes.mash_step(to_try, corresp, mat_sp, genomes_removed, min_dist, max_dist)¶
Prepare a mash run, with a given genome as reference, and others to compare to.
- Parameters:
- to_trylist
list of genome_file (keys of ‘genomes’) to compare, ordered by decreasing L90/nbcont
- correspdict
{genome_file : num_of_genome in sorted_genomes}
- mat_spscipy.sparse.dok.dok_matrix
triangle matrix containing pairwise distance comparisons
- genomes_removeddict
{genome_file: [ref_name, dist]} genome against which ‘genome_name’ is removed, and corresponding distance (justifying removal)
- min_distfloat
lower limit of distance between 2 genomes to keep them
- max_distfloat
max limit of distance between 2 genomes to keep them
- Returns:
- to_try is updated (reference element and all genomes not compatible with it are removed)
- genomes_removed is updated
- return code (0 if no problem)
- PanACoTA.prepare_module.filter_genomes.read_matrix(genomes, sorted_genomes, matrix)¶
Read the matrix of pairwise distances between all genomes, and save it to a sparse matrix (only upper triangle).
- Parameters:
- genomesdict
{genome_file: [genome_name, orig_name, path_to_seq_to_annotate, size, nbcont, l90]}
- sorted_genomes: list
list of ‘genome_file’ for all genomes kept (L90 and nbcont ok)
- matrixstr
File containing the matrix of pairwise distances between all genomes
- Returns:
- mat_spstr
python dok_matrix object
- PanACoTA.prepare_module.filter_genomes.sketch_all(genomes, sorted_genomes, outdir, list_reps, out_msh, mash_log, threads)¶
Sketch all genomes to a combined archive.
- Parameters:
- genomesdict
{genome_file: [genome_name, orig_name, path_to_seq_to_annotate, size, nbcont, l90]}
- sorted_genomes: list
list of ‘genome_file’ for all genomes kept (L90 and nbcont ok), ordered by decreasing quality
- outdirstr
path to directory where all results are saved
- list_repsstr
file with list of genomes to sketch. File will be emptied if it contain something, and filled with the informations from ‘genomes’.
- out_mshstr
output of mash
- mash_logstr
mash logfile
- threads
max number of threads to use
- Returns:
- return value (0 if OK, 1 if error)
- PanACoTA.prepare_module.filter_genomes.sort_genomes_minhash(genomes, max_l90, max_cont)¶
Sort genomes: - draft genomes, sorted by L90 and then nb_contigs
- Parameters:
- genomes{genome_file: [genome_name, orig_name, path_to_seq_to_annotate, size,
nbcont, l90]}
- max_l90int
max L90 value tolerated to keep a genome
- max_contint
Max number of contigs tolerated to keep a genome
- Returns:
- sorted_genomes: list of ‘genome_file’ for all genomes kept (L90 and nbcont ok),
- ordered by decreasing quality
- PanACoTA.prepare_module.filter_genomes.write_outputfiles(genomes, sorted_genomes, genomes_removed, outdir, gspecies, min_dist, max_dist)¶
Write the list of genomes kept in a file, 1 genome per line -> will be the input file for annotation and next steps Write discarded genomes to another file, with, for each line: - genome name - problem when compared with which other genome - distance to this other genomes
- Parameters:
- genomesdict
{genome_file: [genome_name, orig_name, path_to_seq_to_annotate, size, nbcont, l90]}
- sorted_genomes: list
list of ‘genome_file’ for all genomes kept (L90 and nbcont ok)
- genomes_removeddict
{genome_name: [ref_name, dist]} genome against which ‘genome_name’ is removed, and corresponding distance (justifying removal)
- outdirstr
directory where those list files must be created
- gspeciesstr
species name if given, otherwise species taxID
- min_distfloat
lower limit of distance between 2 genomes to keep them
- max_distfloat
upper limit of distance between 2 genomes to keep them
- Returns:
- return code