PanACoTA.annote_module
package¶
annotate module of PanACoTA
genome_seq_functions
submodule¶
Functions to:
analyse a genome (cut at stretch of N if asked, calc L90, nb cont, size…)
if genome cut by stretch of N, write the new sequence to new file
rename genome contigs and write new sequence to new file
@author gem April 2017
- PanACoTA.annotate_module.genome_seq_functions.analyse_all_genomes(genomes, dbpath, tmp_path, nbn, soft, logger, quiet=False)¶
- Parameters:
- genomesdict
{genome: spegenus.date}
- dbpathstr
path to folder containing genomes
- tmp_pathstr
path to put out files
- nbnint
minimum number of ‘N’ required to cut into a new contig
- softstr
soft used (prokka, prodigal, or None if called by prepare module)
- loggerlogging.Logger
logger object to write log information. Because this function can be called from prepare module, where sub logger name is different
- quietbool
True if nothing must be written to stdout/stderr, False otherwise
- Returns:
- dict
{genome: [spegenus.date, orig_name, path_to_seq_to_annotate, size, nbcont, l90]}
- PanACoTA.annotate_module.genome_seq_functions.analyse_genome(genome, dbpath, tmp_path, cut, pat, genomes, soft, logger)¶
Analyse given genome:
if cut is asked:
cut its contigs at each time that ‘pat’ is seen
save cut genome in new file
calculate genome size, L90, nb contigs and save it into genomes
- Parameters:
- genomestr
given genome to analyse
- dbpathstr
path to the folder containing the given genome sequence
- tmp_pathstr
path to folder where output files must be saved.
- cutbool
True if contigs must be cut, False otherwise
- patstr
pattern on which contigs must be cut. ex: “NNNNN”
- genomesdict
{genome_file: [genome_name]} as input, and will be changed to {genome_file: [genome_name, path, path_annotate, gsize, nbcont, L90]}
- softstr
soft used (prokka, prodigal, or None if called by prepare module)
- Returns:
- bool
True if genome analysis went well, False otherwise Modifies ‘genomes’ for the analysed genome: -> {genome_file: [genome_name, path, path_annotate, gsize, nbcont, L90]}
- PanACoTA.annotate_module.genome_seq_functions.calc_l90(contig_sizes)¶
Calc L90 of a given genome
- Parameters:
- contig_sizesdict
{name: size}
- Returns:
- None or int
if L90 found, returns L90. Otherwise, returns nothing
- PanACoTA.annotate_module.genome_seq_functions.format_contig(cut, pat, cur_seq, cur_contig_name, genome, contig_sizes, gresf, num, logger)¶
Format given contig, and save to output file if needed
if cut: cut it and write each subsequence
write new contig just check that contig names are different
- Parameters:
- cutbool
True if contigs must be cut, False otherwise
- patstr
pattern on which contigs must be cut. ex: “NNNNN”
- cur_seqstr
current sequence (aa)
- cur_contig_namestr
name of current contig
- genomestr
name of current genome
- cont_sizesdict
{contig_name : sequence length}
- gresfio.TextIOWrappe
open file to write new sequence. If we are annotating with prodigal and not cutting, there is no new sequence -> gref is None
- numint
current contig number
- loggerlogging.Logger
logger object to write log information
- Returns:
- bool
True if contig has been written without problem, False if any problem
- PanACoTA.annotate_module.genome_seq_functions.get_output_dir(soft, dbpath, tmp_path, genome, cut, pat)¶
Get output file to put sequence cut and/or sequence with shorter contigs (prokka)
- Parameters:
- softstr
soft used (prokka, prodigal, or None if called by prepare module)
- dbpathstr
path to the folder containing the given genome sequence
- tmp_pathstr
path to folder where output files must be saved.
- genomestr
genome name
- cutbool
True if contigs must be cut, False otherwise
- patstr
pattern on which contigs must be cut. ex: “NNNNN”
- PanACoTA.annotate_module.genome_seq_functions.plot_distributions(genomes, res_path, listfile_base, l90, nbconts)¶
FUNCTION DIRECTLY CALLED FROM MAIN ANNOTATE MODULE (step2) Plot distributions of L90 and nbcontig values.
- Parameters:
- genomesdict
{genome: [name, orig_path, to_annotate_path, size, nbcont, l90]}
- res_pathstr
path to put all output files
- listfile_basestr
name of list file
- l90int
L90 threshold
- nbcontsint
nb contigs threshold
- Returns:
- (l90_vals, nbcont_vals, dist1, dist2)
- - l90_valslist of l90 values for all genomes
- - nbcont_valslist of nbcontigs for all genomes
- - dist1matplotlib figure of distribution of L90 values
- - dist2matplotlib figure of distribution of nb contigs values
- PanACoTA.annotate_module.genome_seq_functions.rename_all_genomes(genomes)¶
FUNCTION DIRECTLY CALLED FROM MAIN ANNOTATE MODULE (step 3) Sort kept genomes by L90 and then nb contigs. For each genome, assign a strain number, and rename all its contigs.
- Parameters:
- genomesdict
{genome: [name, path, path_to_seq, gsize, nbcont, L90]} as input, and will become {genome: [gembase_name, path, path_to_seq, gsize, nbcont, L90]} at the end
- PanACoTA.annotate_module.genome_seq_functions.split_contig(pat, whole_seq, cur_contig_name, contig_sizes, gresf, num)¶
Save the contig read just before into dicts and write it to sequence file. Unique ID of contig must be in the first field of header, before the first space (required by prokka)
- Parameters:
- patstr
pattern to split a contig. None if we do not want to split
- whole_seqstr
sequence of current contig, to save once split according to pat
- cur_contig_namestr
name of current contig to save once split according to pat
- contig_sizesdict
{name: size} save cur_contig once split according to pat
- gresf_io.TextIOWrapper
file open in w mode to save the split sequence
- numint
current contig number.
- Returns:
- int
new contig number, after giving number(s) to the current contig
annotation_functions
submodule¶
Functions to deal with prokka or prodigal only, according to user request
@author gem April 2017
- PanACoTA.annotate_module.annotation_functions.check_prodigal(gpath, name, prodigal_dir, logger)¶
When prodigal result folder already exists, check that the ouput files exist. We cannot check all content, but check that they are present.
- Parameters:
- gpathstr
path to fasta file given as input for prokka
- namestr
genome name in gembase format
- prodigal_dirstr
output directory, where all files are written by prodigal
- loggerlogging.Logger
logger object to get logs
- Returns:
- bool
True if everything went well, False otherwise
- PanACoTA.annotate_module.annotation_functions.check_prokka(outdir, logf, name, gpath, nbcont, logger)¶
Prokka writes everything to stderr, and always returns a non-zero return code. So, we check if it ran well by checking the content of output directory. This function is also used when prokka files already exist (prokka was run previously), to check if everything is ok before going to next step.
- Parameters:
- outdirstr
output directory, where all files are written by prokka
- logfstr
prokka/prodigal logfile, containing stderr of prokka
- namestr
genome name in gembase format
- gpathstr
path to fasta file given as input for prokka
- nbcontint
number of contigs in fasta file given to prokka
- loggerlogging.Logger
logger object to get logs
- Returns:
- bool
True if everything went well, False otherwise
- PanACoTA.annotate_module.annotation_functions.count_headers(seqfile)¶
Count how many sequences there are in the given file
- Parameters:
- seqfilestr
file containing a sequence in multi-fasta format
- Returns:
- int
number of contigs in the given multi-fasta file
- PanACoTA.annotate_module.annotation_functions.count_tbl(tblfile)¶
Count the different features found in the tbl file:
number of contigs
number of proteins (CDS)
number of genes (locus_tag)
number of CRISPR arrays (repeat_region) -> ignore crisprs
- Parameters:
- tblfilestr
tbl file generated by prokka
- Returns:
- (nbcont, nb_cds, nb_gene, nb_crispr)
information on features found in the tbl file.
- PanACoTA.annotate_module.annotation_functions.prodigal_train(gpath, annot_folder)¶
Use prodigal training mode. First, train prodigal on the first genome (‘gpath’), and write it to ‘genome’.trn, file which will be used for the annotation of all next sequence Parameters ———- gpath : str
path to genome to train on
- annot_folderstr
path to folder where the log files and train file will be saved
- Returns:
- str
path and name of train file (will be used to annotate all next genomes) If problem, returns empty string
- PanACoTA.annotate_module.annotation_functions.run_annotation_all(genomes, threads, force, annot_folder, fgn, prodigal_only=False, small=False, quiet=False)¶
For each genome in genomes, run prokka (or only prodigal) to annotate the genome.
- Parameters:
- genomesdict
{genome: [gembase_name, path_to_origfile, path_split_gembase, gsize, nbcont, L90]}
- threadsint
max number of threads that can be used
- forcebool
if False, do not override prokka/prodigal outdir and result dir if they exist. If True, rerun prokka and override existing results, for all genomes.
- annot_folderstr
folder where prokka/prodigal results must be written: for each genome, a directory <genome_name>-prokkaRes or <genome_name>-prodigalRes> will be created in this folder, and all the results of prokka/prodigal for the genome will be written inside
- fgnstr
name (key in genomes dict) of the fist genome, which will be used for prodigal training
- prodigal_onlybool
True if only prodigal must run, False if prokka must run
- smallbool
True -> use -p meta option with prodigal. Do not use training
- quietbool
True if nothing must be written to stderr/stdout, False otherwise
- Returns:
- dict
{genome: boolean} -> with True if prokka/prodigal ran well, False otherwise.
- PanACoTA.annotate_module.annotation_functions.run_prodigal(arguments)¶
Run prodigal for the given genome.
- Parameters:
- argumentstuple
(gpath, prodigal_folder, cores_annot, name, force, nbcont, q) with:
gpath: path and filename of genome to annotate
prodigal_folder: path to folder where all prodigal folders for all genomes are saved
cores_annot: how many cores can use prodigal
name: output name of annotated genome
force: True if force run (override existing files), False otherwise
nbcont: number of contigs in the input genome, to check prodigal results
small: ifcontigs are too small (<20000bp), use -p meta option
q : queue where logs are put
- Returns:
- boolean
True if eveything went well (all needed output files present, corresponding numbers of proteins, genes etc.). False otherwise.
- PanACoTA.annotate_module.annotation_functions.run_prokka(arguments)¶
Run prokka for the given genome.
- Parameters:
- argumentstuple
(gpath, prok_folder, cores_annot, name, force, nbcont, small, q) with:
gpath: path and filename of genome to annotate
prok_folder: path to folder where all prokka folders for all genomes are saved
cores_annot: how many cores can use prokka
name: output name of annotated genome
force: True if force run (override existing files), False otherwise
nbcont: number of contigs in the input genome, to check prokka results
small: used for prodigal, if sequences to annotate are small. Not used here
q : queue where logs are put
- Returns:
- boolean
True if eveything went well (all needed output files present, corresponding numbers of proteins, genes etc.). False otherwise.
format_functions
submodule¶
Functions to convert prokka or prodigal results to gembase format:
Proteins: multifasta with all CDS in aa
Replicons: multifasta of genome
Genes: multifasta of all genes in nuc
gff3: gff files without sequence
- LSTINFO:
if annotated by prokka: information on annotation. Columns are:
“start end strand type locus gene_name | product | EC_number | inference2” with the same types as prokka file, and strain is C (complement) or D (direct). Locus is: <genome_name>.<contig_num><i or b>_<protein_num> - if annotated by prodigal
@author gem May 2019
- PanACoTA.annotate_module.general_format_functions.format_genomes(genomes_ok, res_path, annot_path, prodigal_only, threads=1, quiet=False)¶
For all genomes which were annotated (by prokka or prodigal), reformat them in order to have, in ‘res_path’, the following folders:
LSTINFO: containing a .lst file for each genome, with all genes
Replicons: containing all multifasta files
Genes: containing 1 multi-fasta per genome, with all its genes in nuc
Proteins: containing 1 multi-fasta per genome, with all its proteins in aa
gff: containing all gff files
- Parameters:
- genomes_okdict
genomes to format (annotation was OK) -> {genome: [name, gpath, to_annot, size, nbcont, l90]}
- res_pathstr
path to folder where the 4 directories must be created
- annot_pathstr
path to folder containing “<genome_name>-[prokka, prodigal]Res” where all prokka/prodigal results are saved.
- prodigal_only: True if it was annotated by prodigal, False if annotated by prokka
- threadsint
number of threads to use to while formatting genomes
- quietbool
True if nothing must be sent to stderr/stdout, False otherwise
- Returns:
- skipped_formatlist
list of genomes skipped because they had a problem in format step
- PanACoTA.annotate_module.general_format_functions.handle_genome(args)¶
For a given genome, check if it has been annotated (in results), if annotation (by prokka or prodigal) ran without problems (result = True). In that case, format the genome and get the output to see if everything went ok.
- Parameters:
- argstuple
(genome, name, gpath, prok_path, lst_dir, prot_dir, gene_dir, rep_dir, gff_dir, results, q) with:
genome : original genome name
name : gembase name of the genome
gpath : path to the genome sequence which was given to prokka/prodigal for annotation
annot_path : directory where prokka/prodigal folders are saved
lst_dir : path to ‘LSTINFO’ folder
prot_dir : path to ‘Proteins’ folder
gene_dit : path to ‘Genes’ folder
rep_dir : path to ‘Replicons’ folder
gff_dir : path to ‘gff3’ folder
prodigal_only : True if annotated by prodigal, False if annotated by prokka
q : multiprocessing.managers.AutoProxy[Queue] queue to put logs during subprocess
- Returns:
- (bool, str)
True if genome was annotated as expected, False otherwise
genome name (used to get info from the pool.map_async)
- PanACoTA.annotate_module.general_format_functions.write_gene(gtype, locus_num, gene_name, product, cont_loc, genome, cont_num, ecnum, inf2, db_xref, strand, start, end, lstopenfile)¶
Write given gene to output file
- Parameters:
- gtypestr
type of feature (CDS, tRNA, etc.)
- locus_numstr
number of the locus given by prokka/prodigal
- gene_namestr
gene name found by prokka/prodigal (“NA” if no gene name -> Always the case with Prodigal)
- productstr
found by prokka/Prodigal, “NA” if no product (always the case for prodigal)
- cont_locstr
‘i’ if the gene is inside a contig, ‘b’ if its on the border (first or last gene of the contig)
- genomestr
genome name (spegenus.date.strain_num)
- cont_numint
contig number
- ecnumstr
EC number, found by prokka, or “NA” if no EC number (-> always for prodigal)
- inf2str
more information found by prokka/prodigal, or “NA” if no more information
- db_xrefstr
db_xref given by Prokka (“NA” for prodigal)
- strandstr
C (complement) or D (direct)
- startstr
start of gene in the contig
- endstr
end of gene in the contig
- lstopenfile_io.TextIOWrapper
open file where lstinfo must be written
- Returns:
- str
lstline
- PanACoTA.annotate_module.general_format_functions.write_header(lstline, outfile)¶
write header to output file. Header is generated from the lst line.
- Parameters:
- lstlinestr
current line of lst file
- outfile_io.TextIOWrapper
open file where header must be written
Functions to:
convert prokka tbl file to our tab file
convert prokka ffn and faa headers to our format
Create the database, with the following folders in the given “res_path”:
Proteins: multifasta with all CDS in aa
Replicons: multifasta of genome
Genes: multifasta of all genes in nuc
gff3: gff files without sequence
LSTINFO: information on annotation. Columns are: “start end strand type locus gene_name | product | EC_number | inference2” with the same types as prokka file, and strain is C (complement) or D (direct). Locus is: <genome_name>.<contig_num><i or b>_<protein_num>
@author gem April 2019
- PanACoTA.annotate_module.format_prokka.create_gen(ffnseq, lstfile, genseq)¶
Generate .gen file, from sequences contained in .ffn, but changing the headers using the information in .lst
- Parameters:
- ffnseqstr
.ffn file generated by prokka
- lstfilestr
lstfile converted from prokka tbl file
- genseqstr
output file, to write in Genes directory
- loggerlogging.Logger
logger object to put information
- Returns:
- bool
True if conversion went well, False otherwise
- PanACoTA.annotate_module.format_prokka.create_prt(faaseq, lstfile, prtseq)¶
Generate .prt file, from sequences in .faa, but changing the headers using information in .lst
Note: works if proteins are in increasing order (of number after “_” in their name) in faa and tbl (hence lst) files.
If a header is not in the right format, or a protein exists in prt file but not in lstfile, conversion is stopped, an error message is output, and prt file is removed.
- Parameters:
- faaseqstr
faa file output of prokka
- lstfilestr
lstinfo converted from prokka tab file
- prtseqstr
output file where converted proteins must be saved
- Returns:
- bool
True if conversion went well, False otherwise
- PanACoTA.annotate_module.format_prokka.format_one_genome(gpath, name, prok_path, lst_dir, prot_dir, gene_dir, rep_dir, gff_dir)¶
Format the given genome, and create its corresponding files in the following folders:
Proteins
Genes
Replicons
LSTINFO
gff
- Parameters:
- gpathstr
path to the genome sequence which was given to prokka for annotation
- namestr
gembase name of the genome
- prok_pathstr
directory where prokka folders are saved
- lst_dirsrt
path to LSTINFO folder
- prot_dirstr
path to Proteins folder
- gene_dirstr
path to Genes folder
- rep_dirstr
path to Replicons folder
- gff_dirstr
path to gff3 folder
- Returns:
- bool
True if genome was correctly formatted, False otherwise
- PanACoTA.annotate_module.format_prokka.generate_gff(gpath, prokka_gff_file, res_gff_file, res_lst_file, sizes, contigs)¶
From the lstinfo file and contig names (retrieved from generation of Replicons files), generate a gff file.
Format:
##gff-version 3 ##sequence-region contig1 start end ##sequence-region contig2 start end … seqid(=contig) source type start end score strand phase attributes
All fields tab separated. Empty fields contain ‘.’
For example: ESCO.1017.00200.00001 Prodigal:2.6 CDS start end . + . ID=ESCO.1017.00200.b0001_00001;locus_tag=ESCO.1017.00200.b0001_00001;product=hypothetical protein
genome1_1 Prodigal_v2.6.3 CDS 213 1880 260.0 + 0 ID=1_1;partial=00;start_type=ATG;rbs_motif=None;rbs_spacer=None;gc_cont=0.534;conf=99.99;score=259.99;cscore=268.89;sscore=-8.89;rscore=-8.50;uscore=-4.34;tscore=3.95;
- Parameters:
- gpathstr
path to the genome sequence given to prokka. Only used for error message
- res-gff_filestr
path to the gff file that must be created in result database
- res-lst_filestr
path to the lst file that was created in result database in the previous step
- sizeslist
dict of contig names with their size. {“gembase1”: “size”, “gembase2”:”size2” …]
- contigslist
dict of contig original and gembase names. {“contig1”: “gembase1”…}
- Returns:
- bool
True if conversion worked well, False otherwise
- PanACoTA.annotate_module.format_prokka.tbl2lst(tblfile, lstfile, contigs, genome, gpath)¶
Read prokka tbl file, and convert it to the lst file.
prokka tbl file format:
>Feature contig_name start end type [EC_number text] [gene text] inference ab initio prediction:Prodigal:2.6 [inference text] locus_tag test product text
where type can be CDS, tRNA, rRNA, etc … lines between [] are not always present
lst file format:
start end strand type locus gene_name | product | EC_number | inference 2 | db_xref
with the same types as prokka file, and strain is C (complement) or D (direct) locus is: <genome_name>.<contig_num><i or b>_<protein_num>
- Parameters:
- tblfilestr
name of prokka output tbl file to read
- lstfilestr
name of lst file to generate
- contigsdict
{original_contig_name: gembase_contig_name}
- genomestr
genome name (gembase format)
- gpathstr
path to the genome given to prodigal. Only used for error message
- changed_namebool
True if contig names have been changed (cutn != 0) -> contig names end by ‘_num’, False otherwise.
- Returns:
- bool
True if genome name used in lstfile and prokka tblfile are the same, False otherwise
Functions to convert prodigal result files to gembase format.
Proteins: multifasta with all CDS in aa
Replicons: (multi)fasta of genome sequences
Genes: multifasta of all genes in nuc
gff3: gff files without sequence
LSTINFO: information on annotation. Columns are: “start end strand type locus
gene_name | product | EC_number | inference2” and strain is C (complement) or D (direct). Locus is: <genome_name>.<contig_num><i or b>_<protein_num> For prodigal: “start end strand type locus NA | NA | NA | NA”, as there is no functional annotation.
@author gem July 2019
- PanACoTA.annotate_module.format_prodigal.create_gene_lst(contigs, gen_file, res_gen_file, res_lst_file, gpath, name)¶
Generate .gen file, from sequences contained in .ffn, but changing the headers to match with gembase format. At the same time, generate .lst file, from the information given in prodigal ffn headers
- Parameters:
- contigsdict
{original_contig_name: gembase_contig_name}
- gen_filestr
.ffn file generated by prodigal
- res_gen_filestr
generated .gen file, to write in Genes directory
- res_lst_filestr
generated .lst file to write in LSTINFO directory
- gpathstr
path to the genome given to prodigal. Only used for error message
- namestr
gembase name of the genome to format
- loggerlogging.Logger
logger object to put information
- Returns:
- bool
True if conversion went well, False otherwise
- PanACoTA.annotate_module.format_prodigal.create_gff(gpath, gff_file, res_gff_file, res_lst_file, contigs, sizes)¶
Create .gff3 file.
Format:
##gff-version 3 ##sequence-region contig1 start end ##sequence-region contig2 start end … seqid(=contig) source type start end score strand phase attributes
All fields tab separated. Empty fields contain ‘.’
For example: ESCO.1017.00200.00001 Prodigal:2.6 CDS start end . + . ID=ESCO.1017.00200.b0001_00001;locus_tag=ESCO.1017.00200.b0001_00001;product=hypothetical protein
genome1_1 Prodigal_v2.6.3 CDS 213 1880 260.0 + 0 ID=1_1;partial=00;start_type=ATG;rbs_motif=None;rbs_spacer=None;gc_cont=0.534;conf=99.99;score=259.99;cscore=268.89;sscore=-8.89;rscore=-8.50;uscore=-4.34;tscore=3.95;
- Parameters:
- gpathstr
path to the genome sequence given to prodigal. Only used for the error message
- gff_filestr
path to gff file generated by prodigal
- res-gff_filestr
path to the gff file that must be created in result database
- res-lst_filestr
path to the lst file that was created in result database in the previous step
- contigsdict
dict of contig names with their size. [“original_name”: “gembase_name”]
- sizesdict
dict of contig gembase names with their sizes {“gembase_name”: size}
- Returns:
- bool
True if everything went well, False if any problem
- PanACoTA.annotate_module.format_prodigal.create_prt(prot_file, res_prot_file, res_lst_file)¶
Generate .prt file (gembase formatted gene names), from features contained in .lst file generated just before.
- Parameters:
- prot_filestr
.faa file generated by prodigal
- res_prot_filestr
output file, to write in Proteins directory
- res_lst_filestr
.lst file to get all gene names in gembase format instead of re-generating them
- Returns
- ——-
- bool
True if conversion went well, False otherwise
- PanACoTA.annotate_module.format_prodigal.format_one_genome(gpath, name, prod_path, lst_dir, prot_dir, gene_dir, rep_dir, gff_dir)¶
Format the given genome, and create its corresponding files in the following folders:
Proteins
Genes
Replicons
LSTINFO
gff
- Parameters:
- gpathstr
path to the genome sequence which was given to prodigal for annotation
- namestr
gembase name of the genome
- prod_pathstr
directory where all tmp_files for all sequences are saved (sequences cut at each set of 5N, prodigal results and logs)
- lst_dirstr
path to LSTINFO folder
- prot_dirstr
path to Proteins folder
- gene_dirstr
path to Genes folder
- rep_dirstr
path to Replicons folder
- gff_dirstr
path to gff3 folder
- Returns:
- bool
True if genome was correctly formatted, False otherwise