`PanACoTA.annote_module` package¶

annotate module of PanACoTA

`genome_seq_functions` submodule¶

Functions to:

analyse a genome (cut at stretch of N if asked, calc L90, nb cont, size…)
if genome cut by stretch of N, write the new sequence to new file
rename genome contigs and write new sequence to new file

@author gem April 2017

PanACoTA.annotate_module.genome_seq_functions.analyse_all_genomes(genomes, dbpath, tmp_path, nbn, soft, logger, quiet=False)¶

Parameters:

genomesdict: {genome: spegenus.date}
dbpathstr: path to folder containing genomes
tmp_pathstr: path to put out files
nbnint: minimum number of ‘N’ required to cut into a new contig
softstr: soft used (prokka, prodigal, or None if called by prepare module)
loggerlogging.Logger: logger object to write log information. Because this function can be called from prepare module, where sub logger name is different
quietbool: True if nothing must be written to stdout/stderr, False otherwise

Returns:

dict: {genome: [spegenus.date, orig_name, path_to_seq_to_annotate, size, nbcont, l90]}

PanACoTA.annotate_module.genome_seq_functions.analyse_genome(genome, dbpath, tmp_path, cut, pat, genomes, soft, logger)¶

Analyse given genome:

if cut is asked:
- cut its contigs at each time that ‘pat’ is seen
- save cut genome in new file
calculate genome size, L90, nb contigs and save it into genomes

Parameters:

genomestr: given genome to analyse
dbpathstr: path to the folder containing the given genome sequence
tmp_pathstr: path to folder where output files must be saved.
cutbool: True if contigs must be cut, False otherwise
patstr: pattern on which contigs must be cut. ex: “NNNNN”
genomesdict: {genome_file: [genome_name]} as input, and will be changed to {genome_file: [genome_name, path, path_annotate, gsize, nbcont, L90]}
softstr: soft used (prokka, prodigal, or None if called by prepare module)

Returns:

bool: True if genome analysis went well, False otherwise Modifies ‘genomes’ for the analysed genome: -> {genome_file: [genome_name, path, path_annotate, gsize, nbcont, L90]}

PanACoTA.annotate_module.genome_seq_functions.calc_l90(contig_sizes)¶

Calc L90 of a given genome

Parameters:

contig_sizesdict: {name: size}

Returns:

None or int: if L90 found, returns L90. Otherwise, returns nothing

PanACoTA.annotate_module.genome_seq_functions.format_contig(cut, pat, cur_seq, cur_contig_name, genome, contig_sizes, gresf, num, logger)¶

Format given contig, and save to output file if needed

if cut: cut it and write each subsequence
write new contig just check that contig names are different

Parameters:

cutbool: True if contigs must be cut, False otherwise
patstr: pattern on which contigs must be cut. ex: “NNNNN”
cur_seqstr: current sequence (aa)
cur_contig_namestr: name of current contig
genomestr: name of current genome
cont_sizesdict: {contig_name : sequence length}
gresfio.TextIOWrappe: open file to write new sequence. If we are annotating with prodigal and not cutting, there is no new sequence -> gref is None
numint: current contig number
loggerlogging.Logger: logger object to write log information

Returns:

bool: True if contig has been written without problem, False if any problem

PanACoTA.annotate_module.genome_seq_functions.get_output_dir(soft, dbpath, tmp_path, genome, cut, pat)¶

Get output file to put sequence cut and/or sequence with shorter contigs (prokka)

Parameters:

softstr: soft used (prokka, prodigal, or None if called by prepare module)
dbpathstr: path to the folder containing the given genome sequence
tmp_pathstr: path to folder where output files must be saved.
genomestr: genome name
cutbool: True if contigs must be cut, False otherwise
patstr: pattern on which contigs must be cut. ex: “NNNNN”

PanACoTA.annotate_module.genome_seq_functions.plot_distributions(genomes, res_path, listfile_base, l90, nbconts)¶

FUNCTION DIRECTLY CALLED FROM MAIN ANNOTATE MODULE (step2) Plot distributions of L90 and nbcontig values.

Parameters:

genomesdict: {genome: [name, orig_path, to_annotate_path, size, nbcont, l90]}
res_pathstr: path to put all output files
listfile_basestr: name of list file
l90int: L90 threshold
nbcontsint: nb contigs threshold

Returns:

(l90_vals, nbcont_vals, dist1, dist2)
- l90_valslist of l90 values for all genomes
- nbcont_valslist of nbcontigs for all genomes
- dist1matplotlib figure of distribution of L90 values
- dist2matplotlib figure of distribution of nb contigs values

PanACoTA.annotate_module.genome_seq_functions.rename_all_genomes(genomes)¶

FUNCTION DIRECTLY CALLED FROM MAIN ANNOTATE MODULE (step 3) Sort kept genomes by L90 and then nb contigs. For each genome, assign a strain number, and rename all its contigs.

Parameters:

genomesdict: {genome: [name, path, path_to_seq, gsize, nbcont, L90]} as input, and will become {genome: [gembase_name, path, path_to_seq, gsize, nbcont, L90]} at the end

PanACoTA.annotate_module.genome_seq_functions.split_contig(pat, whole_seq, cur_contig_name, contig_sizes, gresf, num)¶

Save the contig read just before into dicts and write it to sequence file. Unique ID of contig must be in the first field of header, before the first space (required by prokka)

Parameters:

patstr: pattern to split a contig. None if we do not want to split
whole_seqstr: sequence of current contig, to save once split according to pat
cur_contig_namestr: name of current contig to save once split according to pat
contig_sizesdict: {name: size} save cur_contig once split according to pat
gresf_io.TextIOWrapper: file open in w mode to save the split sequence
numint: current contig number.

Returns:

int: new contig number, after giving number(s) to the current contig

`annotation_functions` submodule¶

Functions to deal with prokka or prodigal only, according to user request

@author gem April 2017

PanACoTA.annotate_module.annotation_functions.check_prodigal(gpath, name, prodigal_dir, logger)¶

When prodigal result folder already exists, check that the ouput files exist. We cannot check all content, but check that they are present.

Parameters:

gpathstr: path to fasta file given as input for prokka
namestr: genome name in gembase format
prodigal_dirstr: output directory, where all files are written by prodigal
loggerlogging.Logger: logger object to get logs

Returns:

bool: True if everything went well, False otherwise

PanACoTA.annotate_module.annotation_functions.check_prokka(outdir, logf, name, gpath, nbcont, logger)¶

Prokka writes everything to stderr, and always returns a non-zero return code. So, we check if it ran well by checking the content of output directory. This function is also used when prokka files already exist (prokka was run previously), to check if everything is ok before going to next step.

Parameters:

outdirstr: output directory, where all files are written by prokka
logfstr: prokka/prodigal logfile, containing stderr of prokka
namestr: genome name in gembase format
gpathstr: path to fasta file given as input for prokka
nbcontint: number of contigs in fasta file given to prokka
loggerlogging.Logger: logger object to get logs

Returns:

bool: True if everything went well, False otherwise

PanACoTA.annotate_module.annotation_functions.count_headers(seqfile)¶

Count how many sequences there are in the given file

Parameters:

seqfilestr: file containing a sequence in multi-fasta format

Returns:

int: number of contigs in the given multi-fasta file

PanACoTA.annotate_module.annotation_functions.count_tbl(tblfile)¶

Count the different features found in the tbl file:

number of contigs
number of proteins (CDS)
number of genes (locus_tag)
number of CRISPR arrays (repeat_region) -> ignore crisprs

Parameters:

tblfilestr: tbl file generated by prokka

Returns:

(nbcont, nb_cds, nb_gene, nb_crispr): information on features found in the tbl file.

PanACoTA.annotate_module.annotation_functions.prodigal_train(gpath, annot_folder)¶

Use prodigal training mode. First, train prodigal on the first genome (‘gpath’), and write it to ‘genome’.trn, file which will be used for the annotation of all next sequence Parameters ———- gpath : str

path to genome to train on

annot_folderstr: path to folder where the log files and train file will be saved

Returns:

str: path and name of train file (will be used to annotate all next genomes) If problem, returns empty string

PanACoTA.annotate_module.annotation_functions.run_annotation_all(genomes, threads, force, annot_folder, fgn, prodigal_only=False, small=False, quiet=False)¶

For each genome in genomes, run prokka (or only prodigal) to annotate the genome.

Parameters:

genomesdict: {genome: [gembase_name, path_to_origfile, path_split_gembase, gsize, nbcont, L90]}
threadsint: max number of threads that can be used
forcebool: if False, do not override prokka/prodigal outdir and result dir if they exist. If True, rerun prokka and override existing results, for all genomes.
annot_folderstr: folder where prokka/prodigal results must be written: for each genome, a directory <genome_name>-prokkaRes or <genome_name>-prodigalRes> will be created in this folder, and all the results of prokka/prodigal for the genome will be written inside
fgnstr: name (key in genomes dict) of the fist genome, which will be used for prodigal training
prodigal_onlybool: True if only prodigal must run, False if prokka must run
smallbool: True -> use -p meta option with prodigal. Do not use training
quietbool: True if nothing must be written to stderr/stdout, False otherwise

Returns:

dict: {genome: boolean} -> with True if prokka/prodigal ran well, False otherwise.

PanACoTA.annotate_module.annotation_functions.run_prodigal(arguments)¶

Run prodigal for the given genome.

Parameters:

argumentstuple

(gpath, prodigal_folder, cores_annot, name, force, nbcont, q) with:

gpath: path and filename of genome to annotate
prodigal_folder: path to folder where all prodigal folders for all genomes are saved
cores_annot: how many cores can use prodigal
name: output name of annotated genome
force: True if force run (override existing files), False otherwise
nbcont: number of contigs in the input genome, to check prodigal results
small: ifcontigs are too small (<20000bp), use -p meta option
q : queue where logs are put

Returns:

boolean: True if eveything went well (all needed output files present, corresponding numbers of proteins, genes etc.). False otherwise.

PanACoTA.annotate_module.annotation_functions.run_prokka(arguments)¶

Run prokka for the given genome.

Parameters:

argumentstuple

(gpath, prok_folder, cores_annot, name, force, nbcont, small, q) with:

gpath: path and filename of genome to annotate
prok_folder: path to folder where all prokka folders for all genomes are saved
cores_annot: how many cores can use prokka
name: output name of annotated genome
force: True if force run (override existing files), False otherwise
nbcont: number of contigs in the input genome, to check prokka results
small: used for prodigal, if sequences to annotate are small. Not used here
q : queue where logs are put

Returns:

boolean: True if eveything went well (all needed output files present, corresponding numbers of proteins, genes etc.). False otherwise.

`format_functions` submodule¶

Functions to convert prokka or prodigal results to gembase format:

Proteins: multifasta with all CDS in aa

Replicons: multifasta of genome

Genes: multifasta of all genes in nuc

gff3: gff files without sequence

LSTINFO:

if annotated by prokka: information on annotation. Columns are:

“start end strand type locus gene_name | product | EC_number | inference2” with the same types as prokka file, and strain is C (complement) or D (direct). Locus is: <genome_name>.<contig_num><i or b>_<protein_num> - if annotated by prodigal

@author gem May 2019

PanACoTA.annotate_module.general_format_functions.format_genomes(genomes_ok, res_path, annot_path, prodigal_only, threads=1, quiet=False)¶

For all genomes which were annotated (by prokka or prodigal), reformat them in order to have, in ‘res_path’, the following folders:

LSTINFO: containing a .lst file for each genome, with all genes
Replicons: containing all multifasta files
Genes: containing 1 multi-fasta per genome, with all its genes in nuc
Proteins: containing 1 multi-fasta per genome, with all its proteins in aa
gff: containing all gff files

Parameters:

genomes_okdict: genomes to format (annotation was OK) -> {genome: [name, gpath, to_annot, size, nbcont, l90]}
res_pathstr: path to folder where the 4 directories must be created
annot_pathstr: path to folder containing “<genome_name>-[prokka, prodigal]Res” where all prokka/prodigal results are saved.
prodigal_only: True if it was annotated by prodigal, False if annotated by prokka
threadsint: number of threads to use to while formatting genomes
quietbool: True if nothing must be sent to stderr/stdout, False otherwise

Returns:

skipped_formatlist: list of genomes skipped because they had a problem in format step

PanACoTA.annotate_module.general_format_functions.handle_genome(args)¶

For a given genome, check if it has been annotated (in results), if annotation (by prokka or prodigal) ran without problems (result = True). In that case, format the genome and get the output to see if everything went ok.

Parameters:

argstuple

(genome, name, gpath, prok_path, lst_dir, prot_dir, gene_dir, rep_dir, gff_dir, results, q) with:

genome : original genome name

name : gembase name of the genome

gpath : path to the genome sequence which was given to prokka/prodigal for annotation

annot_path : directory where prokka/prodigal folders are saved

lst_dir : path to ‘LSTINFO’ folder

prot_dir : path to ‘Proteins’ folder

gene_dit : path to ‘Genes’ folder

rep_dir : path to ‘Replicons’ folder

gff_dir : path to ‘gff3’ folder

prodigal_only : True if annotated by prodigal, False if annotated by prokka

q : multiprocessing.managers.AutoProxy[Queue] queue to put logs during subprocess

Returns:

(bool, str)

True if genome was annotated as expected, False otherwise
genome name (used to get info from the pool.map_async)

PanACoTA.annotate_module.general_format_functions.write_gene(gtype, locus_num, gene_name, product, cont_loc, genome, cont_num, ecnum, inf2, db_xref, strand, start, end, lstopenfile)¶

Write given gene to output file

Parameters:

gtypestr: type of feature (CDS, tRNA, etc.)
locus_numstr: number of the locus given by prokka/prodigal
gene_namestr: gene name found by prokka/prodigal (“NA” if no gene name -> Always the case with Prodigal)
productstr: found by prokka/Prodigal, “NA” if no product (always the case for prodigal)
cont_locstr: ‘i’ if the gene is inside a contig, ‘b’ if its on the border (first or last gene of the contig)
genomestr: genome name (spegenus.date.strain_num)
cont_numint: contig number
ecnumstr: EC number, found by prokka, or “NA” if no EC number (-> always for prodigal)
inf2str: more information found by prokka/prodigal, or “NA” if no more information
db_xrefstr: db_xref given by Prokka (“NA” for prodigal)
strandstr: C (complement) or D (direct)
startstr: start of gene in the contig
endstr: end of gene in the contig
lstopenfile_io.TextIOWrapper: open file where lstinfo must be written

Returns:

str: lstline

PanACoTA.annotate_module.general_format_functions.write_header(lstline, outfile)¶

write header to output file. Header is generated from the lst line.

Parameters:

lstlinestr: current line of lst file
outfile_io.TextIOWrapper: open file where header must be written

Functions to:

convert prokka tbl file to our tab file
convert prokka ffn and faa headers to our format
Create the database, with the following folders in the given “res_path”:
- Proteins: multifasta with all CDS in aa
- Replicons: multifasta of genome
- Genes: multifasta of all genes in nuc
- gff3: gff files without sequence
- LSTINFO: information on annotation. Columns are: “start end strand type locus gene_name | product | EC_number | inference2” with the same types as prokka file, and strain is C (complement) or D (direct). Locus is: <genome_name>.<contig_num><i or b>_<protein_num>

@author gem April 2019

PanACoTA.annotate_module.format_prokka.create_gen(ffnseq, lstfile, genseq)¶

Generate .gen file, from sequences contained in .ffn, but changing the headers using the information in .lst

Parameters:

ffnseqstr: .ffn file generated by prokka
lstfilestr: lstfile converted from prokka tbl file
genseqstr: output file, to write in Genes directory
loggerlogging.Logger: logger object to put information

Returns:

bool: True if conversion went well, False otherwise

PanACoTA.annotate_module.format_prokka.create_prt(faaseq, lstfile, prtseq)¶

Generate .prt file, from sequences in .faa, but changing the headers using information in .lst

Note: works if proteins are in increasing order (of number after “_” in their name) in faa and tbl (hence lst) files.

If a header is not in the right format, or a protein exists in prt file but not in lstfile, conversion is stopped, an error message is output, and prt file is removed.

Parameters:

faaseqstr: faa file output of prokka
lstfilestr: lstinfo converted from prokka tab file
prtseqstr: output file where converted proteins must be saved

Returns:

bool: True if conversion went well, False otherwise

PanACoTA.annotate_module.format_prokka.format_one_genome(gpath, name, prok_path, lst_dir, prot_dir, gene_dir, rep_dir, gff_dir)¶

Format the given genome, and create its corresponding files in the following folders:

Proteins
Genes
Replicons
LSTINFO
gff

Parameters:

gpathstr: path to the genome sequence which was given to prokka for annotation
namestr: gembase name of the genome
prok_pathstr: directory where prokka folders are saved
lst_dirsrt: path to LSTINFO folder
prot_dirstr: path to Proteins folder
gene_dirstr: path to Genes folder
rep_dirstr: path to Replicons folder
gff_dirstr: path to gff3 folder

Returns:

bool: True if genome was correctly formatted, False otherwise

PanACoTA.annotate_module.format_prokka.generate_gff(gpath, prokka_gff_file, res_gff_file, res_lst_file, sizes, contigs)¶

From the lstinfo file and contig names (retrieved from generation of Replicons files), generate a gff file.

Format:

##gff-version 3 ##sequence-region contig1 start end ##sequence-region contig2 start end … seqid(=contig) source type start end score strand phase attributes

All fields tab separated. Empty fields contain ‘.’

For example: ESCO.1017.00200.00001 Prodigal:2.6 CDS start end . + . ID=ESCO.1017.00200.b0001_00001;locus_tag=ESCO.1017.00200.b0001_00001;product=hypothetical protein

genome1_1 Prodigal_v2.6.3 CDS 213 1880 260.0 + 0 ID=1_1;partial=00;start_type=ATG;rbs_motif=None;rbs_spacer=None;gc_cont=0.534;conf=99.99;score=259.99;cscore=268.89;sscore=-8.89;rscore=-8.50;uscore=-4.34;tscore=3.95;

Parameters:

gpathstr: path to the genome sequence given to prokka. Only used for error message
res-gff_filestr: path to the gff file that must be created in result database
res-lst_filestr: path to the lst file that was created in result database in the previous step
sizeslist: dict of contig names with their size. {“gembase1”: “size”, “gembase2”:”size2” …]
contigslist: dict of contig original and gembase names. {“contig1”: “gembase1”…}

Returns:

bool: True if conversion worked well, False otherwise

PanACoTA.annotate_module.format_prokka.tbl2lst(tblfile, lstfile, contigs, genome, gpath)¶

Read prokka tbl file, and convert it to the lst file.

prokka tbl file format:

>Feature contig_name
start end type
        [EC_number text]
        [gene text]
        inference ab initio prediction:Prodigal:2.6
        [inference text]
        locus_tag test
        product text

where type can be CDS, tRNA, rRNA, etc … lines between [] are not always present

lst file format:

start end strand type locus gene_name | product | EC_number | inference 2 | db_xref

with the same types as prokka file, and strain is C (complement) or D (direct) locus is: <genome_name>.<contig_num><i or b>_<protein_num>

Parameters:

tblfilestr: name of prokka output tbl file to read
lstfilestr: name of lst file to generate
contigsdict: {original_contig_name: gembase_contig_name}
genomestr: genome name (gembase format)
gpathstr: path to the genome given to prodigal. Only used for error message
changed_namebool: True if contig names have been changed (cutn != 0) -> contig names end by ‘_num’, False otherwise.

Returns:

bool: True if genome name used in lstfile and prokka tblfile are the same, False otherwise

Functions to convert prodigal result files to gembase format.

Proteins: multifasta with all CDS in aa

Replicons: (multi)fasta of genome sequences

Genes: multifasta of all genes in nuc

gff3: gff files without sequence

LSTINFO: information on annotation. Columns are: “start end strand type locus

gene_name | product | EC_number | inference2” and strain is C (complement) or D (direct). Locus is: <genome_name>.<contig_num><i or b>_<protein_num> For prodigal: “start end strand type locus NA | NA | NA | NA”, as there is no functional annotation.

@author gem July 2019

PanACoTA.annotate_module.format_prodigal.create_gene_lst(contigs, gen_file, res_gen_file, res_lst_file, gpath, name)¶

Generate .gen file, from sequences contained in .ffn, but changing the headers to match with gembase format. At the same time, generate .lst file, from the information given in prodigal ffn headers

Parameters:

contigsdict: {original_contig_name: gembase_contig_name}
gen_filestr: .ffn file generated by prodigal
res_gen_filestr: generated .gen file, to write in Genes directory
res_lst_filestr: generated .lst file to write in LSTINFO directory
gpathstr: path to the genome given to prodigal. Only used for error message
namestr: gembase name of the genome to format
loggerlogging.Logger: logger object to put information

Returns:

bool: True if conversion went well, False otherwise

PanACoTA.annotate_module.format_prodigal.create_gff(gpath, gff_file, res_gff_file, res_lst_file, contigs, sizes)¶

Create .gff3 file.

Format:

##gff-version 3 ##sequence-region contig1 start end ##sequence-region contig2 start end … seqid(=contig) source type start end score strand phase attributes

All fields tab separated. Empty fields contain ‘.’

For example: ESCO.1017.00200.00001 Prodigal:2.6 CDS start end . + . ID=ESCO.1017.00200.b0001_00001;locus_tag=ESCO.1017.00200.b0001_00001;product=hypothetical protein

genome1_1 Prodigal_v2.6.3 CDS 213 1880 260.0 + 0 ID=1_1;partial=00;start_type=ATG;rbs_motif=None;rbs_spacer=None;gc_cont=0.534;conf=99.99;score=259.99;cscore=268.89;sscore=-8.89;rscore=-8.50;uscore=-4.34;tscore=3.95;

Parameters:

gpathstr: path to the genome sequence given to prodigal. Only used for the error message
gff_filestr: path to gff file generated by prodigal
res-gff_filestr: path to the gff file that must be created in result database
res-lst_filestr: path to the lst file that was created in result database in the previous step
contigsdict: dict of contig names with their size. [“original_name”: “gembase_name”]
sizesdict: dict of contig gembase names with their sizes {“gembase_name”: size}

Returns:

bool: True if everything went well, False if any problem

PanACoTA.annotate_module.format_prodigal.create_prt(prot_file, res_prot_file, res_lst_file)¶

Generate .prt file (gembase formatted gene names), from features contained in .lst file generated just before.

Parameters:

prot_filestr: .faa file generated by prodigal
res_prot_filestr: output file, to write in Proteins directory
res_lst_filestr: .lst file to get all gene names in gembase format instead of re-generating them
Returns
——-
bool: True if conversion went well, False otherwise

PanACoTA.annotate_module.format_prodigal.format_one_genome(gpath, name, prod_path, lst_dir, prot_dir, gene_dir, rep_dir, gff_dir)¶

Format the given genome, and create its corresponding files in the following folders:

Proteins
Genes
Replicons
LSTINFO
gff

Parameters:

gpathstr: path to the genome sequence which was given to prodigal for annotation
namestr: gembase name of the genome
prod_pathstr: directory where all tmp_files for all sequences are saved (sequences cut at each set of 5N, prodigal results and logs)
lst_dirstr: path to LSTINFO folder
prot_dirstr: path to Proteins folder
gene_dirstr: path to Genes folder
rep_dirstr: path to Replicons folder
gff_dirstr: path to gff3 folder

Returns:

bool: True if genome was correctly formatted, False otherwise

PanACoTA.annote_module package¶

genome_seq_functions submodule¶

annotation_functions submodule¶

format_functions submodule¶

`PanACoTA.annote_module` package¶

`genome_seq_functions` submodule¶

`annotation_functions` submodule¶

`format_functions` submodule¶