PanACoTA.annote_module package

annotate module of PanACoTA

genome_seq_functions submodule

Functions to:

  • analyse a genome (cut at stretch of N if asked, calc L90, nb cont, size…)

  • if genome cut by stretch of N, write the new sequence to new file

  • rename genome contigs and write new sequence to new file

@author gem April 2017

PanACoTA.annotate_module.genome_seq_functions.analyse_all_genomes(genomes, dbpath, tmp_path, nbn, soft, logger, quiet=False)
Parameters:
genomesdict

{genome: spegenus.date}

dbpathstr

path to folder containing genomes

tmp_pathstr

path to put out files

nbnint

minimum number of ‘N’ required to cut into a new contig

softstr

soft used (prokka, prodigal, or None if called by prepare module)

loggerlogging.Logger

logger object to write log information. Because this function can be called from prepare module, where sub logger name is different

quietbool

True if nothing must be written to stdout/stderr, False otherwise

Returns:
dict

{genome: [spegenus.date, orig_name, path_to_seq_to_annotate, size, nbcont, l90]}

PanACoTA.annotate_module.genome_seq_functions.analyse_genome(genome, dbpath, tmp_path, cut, pat, genomes, soft, logger)

Analyse given genome:

  • if cut is asked:

    • cut its contigs at each time that ‘pat’ is seen

    • save cut genome in new file

  • calculate genome size, L90, nb contigs and save it into genomes

Parameters:
genomestr

given genome to analyse

dbpathstr

path to the folder containing the given genome sequence

tmp_pathstr

path to folder where output files must be saved.

cutbool

True if contigs must be cut, False otherwise

patstr

pattern on which contigs must be cut. ex: “NNNNN”

genomesdict

{genome_file: [genome_name]} as input, and will be changed to {genome_file: [genome_name, path, path_annotate, gsize, nbcont, L90]}

softstr

soft used (prokka, prodigal, or None if called by prepare module)

Returns:
bool

True if genome analysis went well, False otherwise Modifies ‘genomes’ for the analysed genome: -> {genome_file: [genome_name, path, path_annotate, gsize, nbcont, L90]}

PanACoTA.annotate_module.genome_seq_functions.calc_l90(contig_sizes)

Calc L90 of a given genome

Parameters:
contig_sizesdict

{name: size}

Returns:
None or int

if L90 found, returns L90. Otherwise, returns nothing

PanACoTA.annotate_module.genome_seq_functions.format_contig(cut, pat, cur_seq, cur_contig_name, genome, contig_sizes, gresf, num, logger)

Format given contig, and save to output file if needed

  • if cut: cut it and write each subsequence

  • write new contig just check that contig names are different

Parameters:
cutbool

True if contigs must be cut, False otherwise

patstr

pattern on which contigs must be cut. ex: “NNNNN”

cur_seqstr

current sequence (aa)

cur_contig_namestr

name of current contig

genomestr

name of current genome

cont_sizesdict

{contig_name : sequence length}

gresfio.TextIOWrappe

open file to write new sequence. If we are annotating with prodigal and not cutting, there is no new sequence -> gref is None

numint

current contig number

loggerlogging.Logger

logger object to write log information

Returns:
bool

True if contig has been written without problem, False if any problem

PanACoTA.annotate_module.genome_seq_functions.get_output_dir(soft, dbpath, tmp_path, genome, cut, pat)

Get output file to put sequence cut and/or sequence with shorter contigs (prokka)

Parameters:
softstr

soft used (prokka, prodigal, or None if called by prepare module)

dbpathstr

path to the folder containing the given genome sequence

tmp_pathstr

path to folder where output files must be saved.

genomestr

genome name

cutbool

True if contigs must be cut, False otherwise

patstr

pattern on which contigs must be cut. ex: “NNNNN”

PanACoTA.annotate_module.genome_seq_functions.plot_distributions(genomes, res_path, listfile_base, l90, nbconts)

FUNCTION DIRECTLY CALLED FROM MAIN ANNOTATE MODULE (step2) Plot distributions of L90 and nbcontig values.

Parameters:
genomesdict

{genome: [name, orig_path, to_annotate_path, size, nbcont, l90]}

res_pathstr

path to put all output files

listfile_basestr

name of list file

l90int

L90 threshold

nbcontsint

nb contigs threshold

Returns:
(l90_vals, nbcont_vals, dist1, dist2)
- l90_valslist of l90 values for all genomes
- nbcont_valslist of nbcontigs for all genomes
- dist1matplotlib figure of distribution of L90 values
- dist2matplotlib figure of distribution of nb contigs values
PanACoTA.annotate_module.genome_seq_functions.rename_all_genomes(genomes)

FUNCTION DIRECTLY CALLED FROM MAIN ANNOTATE MODULE (step 3) Sort kept genomes by L90 and then nb contigs. For each genome, assign a strain number, and rename all its contigs.

Parameters:
genomesdict

{genome: [name, path, path_to_seq, gsize, nbcont, L90]} as input, and will become {genome: [gembase_name, path, path_to_seq, gsize, nbcont, L90]} at the end

PanACoTA.annotate_module.genome_seq_functions.split_contig(pat, whole_seq, cur_contig_name, contig_sizes, gresf, num)

Save the contig read just before into dicts and write it to sequence file. Unique ID of contig must be in the first field of header, before the first space (required by prokka)

Parameters:
patstr

pattern to split a contig. None if we do not want to split

whole_seqstr

sequence of current contig, to save once split according to pat

cur_contig_namestr

name of current contig to save once split according to pat

contig_sizesdict

{name: size} save cur_contig once split according to pat

gresf_io.TextIOWrapper

file open in w mode to save the split sequence

numint

current contig number.

Returns:
int

new contig number, after giving number(s) to the current contig

annotation_functions submodule

Functions to deal with prokka or prodigal only, according to user request

@author gem April 2017

PanACoTA.annotate_module.annotation_functions.check_prodigal(gpath, name, prodigal_dir, logger)

When prodigal result folder already exists, check that the ouput files exist. We cannot check all content, but check that they are present.

Parameters:
gpathstr

path to fasta file given as input for prokka

namestr

genome name in gembase format

prodigal_dirstr

output directory, where all files are written by prodigal

loggerlogging.Logger

logger object to get logs

Returns:
bool

True if everything went well, False otherwise

PanACoTA.annotate_module.annotation_functions.check_prokka(outdir, logf, name, gpath, nbcont, logger)

Prokka writes everything to stderr, and always returns a non-zero return code. So, we check if it ran well by checking the content of output directory. This function is also used when prokka files already exist (prokka was run previously), to check if everything is ok before going to next step.

Parameters:
outdirstr

output directory, where all files are written by prokka

logfstr

prokka/prodigal logfile, containing stderr of prokka

namestr

genome name in gembase format

gpathstr

path to fasta file given as input for prokka

nbcontint

number of contigs in fasta file given to prokka

loggerlogging.Logger

logger object to get logs

Returns:
bool

True if everything went well, False otherwise

PanACoTA.annotate_module.annotation_functions.count_headers(seqfile)

Count how many sequences there are in the given file

Parameters:
seqfilestr

file containing a sequence in multi-fasta format

Returns:
int

number of contigs in the given multi-fasta file

PanACoTA.annotate_module.annotation_functions.count_tbl(tblfile)

Count the different features found in the tbl file:

  • number of contigs

  • number of proteins (CDS)

  • number of genes (locus_tag)

  • number of CRISPR arrays (repeat_region) -> ignore crisprs

Parameters:
tblfilestr

tbl file generated by prokka

Returns:
(nbcont, nb_cds, nb_gene, nb_crispr)

information on features found in the tbl file.

PanACoTA.annotate_module.annotation_functions.prodigal_train(gpath, annot_folder)

Use prodigal training mode. First, train prodigal on the first genome (‘gpath’), and write it to ‘genome’.trn, file which will be used for the annotation of all next sequence Parameters ———- gpath : str

path to genome to train on

annot_folderstr

path to folder where the log files and train file will be saved

Returns:
str

path and name of train file (will be used to annotate all next genomes) If problem, returns empty string

PanACoTA.annotate_module.annotation_functions.run_annotation_all(genomes, threads, force, annot_folder, fgn, prodigal_only=False, small=False, quiet=False)

For each genome in genomes, run prokka (or only prodigal) to annotate the genome.

Parameters:
genomesdict

{genome: [gembase_name, path_to_origfile, path_split_gembase, gsize, nbcont, L90]}

threadsint

max number of threads that can be used

forcebool

if False, do not override prokka/prodigal outdir and result dir if they exist. If True, rerun prokka and override existing results, for all genomes.

annot_folderstr

folder where prokka/prodigal results must be written: for each genome, a directory <genome_name>-prokkaRes or <genome_name>-prodigalRes> will be created in this folder, and all the results of prokka/prodigal for the genome will be written inside

fgnstr

name (key in genomes dict) of the fist genome, which will be used for prodigal training

prodigal_onlybool

True if only prodigal must run, False if prokka must run

smallbool

True -> use -p meta option with prodigal. Do not use training

quietbool

True if nothing must be written to stderr/stdout, False otherwise

Returns:
dict

{genome: boolean} -> with True if prokka/prodigal ran well, False otherwise.

PanACoTA.annotate_module.annotation_functions.run_prodigal(arguments)

Run prodigal for the given genome.

Parameters:
argumentstuple

(gpath, prodigal_folder, cores_annot, name, force, nbcont, q) with:

  • gpath: path and filename of genome to annotate

  • prodigal_folder: path to folder where all prodigal folders for all genomes are saved

  • cores_annot: how many cores can use prodigal

  • name: output name of annotated genome

  • force: True if force run (override existing files), False otherwise

  • nbcont: number of contigs in the input genome, to check prodigal results

  • small: ifcontigs are too small (<20000bp), use -p meta option

  • q : queue where logs are put

Returns:
boolean

True if eveything went well (all needed output files present, corresponding numbers of proteins, genes etc.). False otherwise.

PanACoTA.annotate_module.annotation_functions.run_prokka(arguments)

Run prokka for the given genome.

Parameters:
argumentstuple

(gpath, prok_folder, cores_annot, name, force, nbcont, small, q) with:

  • gpath: path and filename of genome to annotate

  • prok_folder: path to folder where all prokka folders for all genomes are saved

  • cores_annot: how many cores can use prokka

  • name: output name of annotated genome

  • force: True if force run (override existing files), False otherwise

  • nbcont: number of contigs in the input genome, to check prokka results

  • small: used for prodigal, if sequences to annotate are small. Not used here

  • q : queue where logs are put

Returns:
boolean

True if eveything went well (all needed output files present, corresponding numbers of proteins, genes etc.). False otherwise.

format_functions submodule

Functions to convert prokka or prodigal results to gembase format:

  • Proteins: multifasta with all CDS in aa

  • Replicons: multifasta of genome

  • Genes: multifasta of all genes in nuc

  • gff3: gff files without sequence

  • LSTINFO:
    • if annotated by prokka: information on annotation. Columns are:

    “start end strand type locus gene_name | product | EC_number | inference2” with the same types as prokka file, and strain is C (complement) or D (direct). Locus is: <genome_name>.<contig_num><i or b>_<protein_num> - if annotated by prodigal

@author gem May 2019

PanACoTA.annotate_module.general_format_functions.format_genomes(genomes_ok, res_path, annot_path, prodigal_only, threads=1, quiet=False)

For all genomes which were annotated (by prokka or prodigal), reformat them in order to have, in ‘res_path’, the following folders:

  • LSTINFO: containing a .lst file for each genome, with all genes

  • Replicons: containing all multifasta files

  • Genes: containing 1 multi-fasta per genome, with all its genes in nuc

  • Proteins: containing 1 multi-fasta per genome, with all its proteins in aa

  • gff: containing all gff files

Parameters:
genomes_okdict

genomes to format (annotation was OK) -> {genome: [name, gpath, to_annot, size, nbcont, l90]}

res_pathstr

path to folder where the 4 directories must be created

annot_pathstr

path to folder containing “<genome_name>-[prokka, prodigal]Res” where all prokka/prodigal results are saved.

prodigal_only: True if it was annotated by prodigal, False if annotated by prokka
threadsint

number of threads to use to while formatting genomes

quietbool

True if nothing must be sent to stderr/stdout, False otherwise

Returns:
skipped_formatlist

list of genomes skipped because they had a problem in format step

PanACoTA.annotate_module.general_format_functions.handle_genome(args)

For a given genome, check if it has been annotated (in results), if annotation (by prokka or prodigal) ran without problems (result = True). In that case, format the genome and get the output to see if everything went ok.

Parameters:
argstuple

(genome, name, gpath, prok_path, lst_dir, prot_dir, gene_dir, rep_dir, gff_dir, results, q) with:

  • genome : original genome name

  • name : gembase name of the genome

  • gpath : path to the genome sequence which was given to prokka/prodigal for annotation

  • annot_path : directory where prokka/prodigal folders are saved

  • lst_dir : path to ‘LSTINFO’ folder

  • prot_dir : path to ‘Proteins’ folder

  • gene_dit : path to ‘Genes’ folder

  • rep_dir : path to ‘Replicons’ folder

  • gff_dir : path to ‘gff3’ folder

  • prodigal_only : True if annotated by prodigal, False if annotated by prokka

  • q : multiprocessing.managers.AutoProxy[Queue] queue to put logs during subprocess

Returns:
(bool, str)
  • True if genome was annotated as expected, False otherwise

  • genome name (used to get info from the pool.map_async)

PanACoTA.annotate_module.general_format_functions.write_gene(gtype, locus_num, gene_name, product, cont_loc, genome, cont_num, ecnum, inf2, db_xref, strand, start, end, lstopenfile)

Write given gene to output file

Parameters:
gtypestr

type of feature (CDS, tRNA, etc.)

locus_numstr

number of the locus given by prokka/prodigal

gene_namestr

gene name found by prokka/prodigal (“NA” if no gene name -> Always the case with Prodigal)

productstr

found by prokka/Prodigal, “NA” if no product (always the case for prodigal)

cont_locstr

‘i’ if the gene is inside a contig, ‘b’ if its on the border (first or last gene of the contig)

genomestr

genome name (spegenus.date.strain_num)

cont_numint

contig number

ecnumstr

EC number, found by prokka, or “NA” if no EC number (-> always for prodigal)

inf2str

more information found by prokka/prodigal, or “NA” if no more information

db_xrefstr

db_xref given by Prokka (“NA” for prodigal)

strandstr

C (complement) or D (direct)

startstr

start of gene in the contig

endstr

end of gene in the contig

lstopenfile_io.TextIOWrapper

open file where lstinfo must be written

Returns:
str

lstline

PanACoTA.annotate_module.general_format_functions.write_header(lstline, outfile)

write header to output file. Header is generated from the lst line.

Parameters:
lstlinestr

current line of lst file

outfile_io.TextIOWrapper

open file where header must be written

Functions to:

  • convert prokka tbl file to our tab file

  • convert prokka ffn and faa headers to our format

  • Create the database, with the following folders in the given “res_path”:

    • Proteins: multifasta with all CDS in aa

    • Replicons: multifasta of genome

    • Genes: multifasta of all genes in nuc

    • gff3: gff files without sequence

    • LSTINFO: information on annotation. Columns are: “start end strand type locus gene_name | product | EC_number | inference2” with the same types as prokka file, and strain is C (complement) or D (direct). Locus is: <genome_name>.<contig_num><i or b>_<protein_num>

@author gem April 2019

PanACoTA.annotate_module.format_prokka.create_gen(ffnseq, lstfile, genseq)

Generate .gen file, from sequences contained in .ffn, but changing the headers using the information in .lst

Parameters:
ffnseqstr

.ffn file generated by prokka

lstfilestr

lstfile converted from prokka tbl file

genseqstr

output file, to write in Genes directory

loggerlogging.Logger

logger object to put information

Returns:
bool

True if conversion went well, False otherwise

PanACoTA.annotate_module.format_prokka.create_prt(faaseq, lstfile, prtseq)

Generate .prt file, from sequences in .faa, but changing the headers using information in .lst

Note: works if proteins are in increasing order (of number after “_” in their name) in faa and tbl (hence lst) files.

If a header is not in the right format, or a protein exists in prt file but not in lstfile, conversion is stopped, an error message is output, and prt file is removed.

Parameters:
faaseqstr

faa file output of prokka

lstfilestr

lstinfo converted from prokka tab file

prtseqstr

output file where converted proteins must be saved

Returns:
bool

True if conversion went well, False otherwise

PanACoTA.annotate_module.format_prokka.format_one_genome(gpath, name, prok_path, lst_dir, prot_dir, gene_dir, rep_dir, gff_dir)

Format the given genome, and create its corresponding files in the following folders:

  • Proteins

  • Genes

  • Replicons

  • LSTINFO

  • gff

Parameters:
gpathstr

path to the genome sequence which was given to prokka for annotation

namestr

gembase name of the genome

prok_pathstr

directory where prokka folders are saved

lst_dirsrt

path to LSTINFO folder

prot_dirstr

path to Proteins folder

gene_dirstr

path to Genes folder

rep_dirstr

path to Replicons folder

gff_dirstr

path to gff3 folder

Returns:
bool

True if genome was correctly formatted, False otherwise

PanACoTA.annotate_module.format_prokka.generate_gff(gpath, prokka_gff_file, res_gff_file, res_lst_file, sizes, contigs)

From the lstinfo file and contig names (retrieved from generation of Replicons files), generate a gff file.

Format:

##gff-version 3 ##sequence-region contig1 start end ##sequence-region contig2 start end … seqid(=contig) source type start end score strand phase attributes

All fields tab separated. Empty fields contain ‘.’

For example: ESCO.1017.00200.00001 Prodigal:2.6 CDS start end . + . ID=ESCO.1017.00200.b0001_00001;locus_tag=ESCO.1017.00200.b0001_00001;product=hypothetical protein

genome1_1 Prodigal_v2.6.3 CDS 213 1880 260.0 + 0 ID=1_1;partial=00;start_type=ATG;rbs_motif=None;rbs_spacer=None;gc_cont=0.534;conf=99.99;score=259.99;cscore=268.89;sscore=-8.89;rscore=-8.50;uscore=-4.34;tscore=3.95;

Parameters:
gpathstr

path to the genome sequence given to prokka. Only used for error message

res-gff_filestr

path to the gff file that must be created in result database

res-lst_filestr

path to the lst file that was created in result database in the previous step

sizeslist

dict of contig names with their size. {“gembase1”: “size”, “gembase2”:”size2” …]

contigslist

dict of contig original and gembase names. {“contig1”: “gembase1”…}

Returns:
bool

True if conversion worked well, False otherwise

PanACoTA.annotate_module.format_prokka.tbl2lst(tblfile, lstfile, contigs, genome, gpath)

Read prokka tbl file, and convert it to the lst file.

  • prokka tbl file format:

    >Feature contig_name
    start end type
            [EC_number text]
            [gene text]
            inference ab initio prediction:Prodigal:2.6
            [inference text]
            locus_tag test
            product text
    

where type can be CDS, tRNA, rRNA, etc … lines between [] are not always present

  • lst file format:

    start end strand type locus gene_name | product | EC_number | inference 2 | db_xref
    

with the same types as prokka file, and strain is C (complement) or D (direct) locus is: <genome_name>.<contig_num><i or b>_<protein_num>

Parameters:
tblfilestr

name of prokka output tbl file to read

lstfilestr

name of lst file to generate

contigsdict

{original_contig_name: gembase_contig_name}

genomestr

genome name (gembase format)

gpathstr

path to the genome given to prodigal. Only used for error message

changed_namebool

True if contig names have been changed (cutn != 0) -> contig names end by ‘_num’, False otherwise.

Returns:
bool

True if genome name used in lstfile and prokka tblfile are the same, False otherwise

Functions to convert prodigal result files to gembase format.

  • Proteins: multifasta with all CDS in aa

  • Replicons: (multi)fasta of genome sequences

  • Genes: multifasta of all genes in nuc

  • gff3: gff files without sequence

  • LSTINFO: information on annotation. Columns are: “start end strand type locus

gene_name | product | EC_number | inference2” and strain is C (complement) or D (direct). Locus is: <genome_name>.<contig_num><i or b>_<protein_num> For prodigal: “start end strand type locus NA | NA | NA | NA”, as there is no functional annotation.

@author gem July 2019

PanACoTA.annotate_module.format_prodigal.create_gene_lst(contigs, gen_file, res_gen_file, res_lst_file, gpath, name)

Generate .gen file, from sequences contained in .ffn, but changing the headers to match with gembase format. At the same time, generate .lst file, from the information given in prodigal ffn headers

Parameters:
contigsdict

{original_contig_name: gembase_contig_name}

gen_filestr

.ffn file generated by prodigal

res_gen_filestr

generated .gen file, to write in Genes directory

res_lst_filestr

generated .lst file to write in LSTINFO directory

gpathstr

path to the genome given to prodigal. Only used for error message

namestr

gembase name of the genome to format

loggerlogging.Logger

logger object to put information

Returns:
bool

True if conversion went well, False otherwise

PanACoTA.annotate_module.format_prodigal.create_gff(gpath, gff_file, res_gff_file, res_lst_file, contigs, sizes)

Create .gff3 file.

Format:

##gff-version 3 ##sequence-region contig1 start end ##sequence-region contig2 start end … seqid(=contig) source type start end score strand phase attributes

All fields tab separated. Empty fields contain ‘.’

For example: ESCO.1017.00200.00001 Prodigal:2.6 CDS start end . + . ID=ESCO.1017.00200.b0001_00001;locus_tag=ESCO.1017.00200.b0001_00001;product=hypothetical protein

genome1_1 Prodigal_v2.6.3 CDS 213 1880 260.0 + 0 ID=1_1;partial=00;start_type=ATG;rbs_motif=None;rbs_spacer=None;gc_cont=0.534;conf=99.99;score=259.99;cscore=268.89;sscore=-8.89;rscore=-8.50;uscore=-4.34;tscore=3.95;

Parameters:
gpathstr

path to the genome sequence given to prodigal. Only used for the error message

gff_filestr

path to gff file generated by prodigal

res-gff_filestr

path to the gff file that must be created in result database

res-lst_filestr

path to the lst file that was created in result database in the previous step

contigsdict

dict of contig names with their size. [“original_name”: “gembase_name”]

sizesdict

dict of contig gembase names with their sizes {“gembase_name”: size}

Returns:
bool

True if everything went well, False if any problem

PanACoTA.annotate_module.format_prodigal.create_prt(prot_file, res_prot_file, res_lst_file)

Generate .prt file (gembase formatted gene names), from features contained in .lst file generated just before.

Parameters:
prot_filestr

.faa file generated by prodigal

res_prot_filestr

output file, to write in Proteins directory

res_lst_filestr

.lst file to get all gene names in gembase format instead of re-generating them

Returns
——-
bool

True if conversion went well, False otherwise

PanACoTA.annotate_module.format_prodigal.format_one_genome(gpath, name, prod_path, lst_dir, prot_dir, gene_dir, rep_dir, gff_dir)

Format the given genome, and create its corresponding files in the following folders:

  • Proteins

  • Genes

  • Replicons

  • LSTINFO

  • gff

Parameters:
gpathstr

path to the genome sequence which was given to prodigal for annotation

namestr

gembase name of the genome

prod_pathstr

directory where all tmp_files for all sequences are saved (sequences cut at each set of 5N, prodigal results and logs)

lst_dirstr

path to LSTINFO folder

prot_dirstr

path to Proteins folder

gene_dirstr

path to Genes folder

rep_dirstr

path to Replicons folder

gff_dirstr

path to gff3 folder

Returns:
bool

True if genome was correctly formatted, False otherwise