Coverage for PanACoTA/subcommands/annotate.py: 100%
155 statements
« prev ^ index » next coverage.py v7.3.1, created at 2023-09-20 14:37 +0000
« prev ^ index » next coverage.py v7.3.1, created at 2023-09-20 14:37 +0000
1#!/usr/bin/env python3
2# coding: utf-8
4# ###############################################################################
5# This file is part of PanACOTA. #
6# #
7# Authors: Amandine Perrin #
8# Copyright © 2018-2020 Institut Pasteur (Paris). #
9# See the COPYRIGHT file for details. #
10# #
11# PanACOTA is a software providing tools for large scale bacterial comparative #
12# genomics. From a set of complete and/or draft genomes, you can: #
13# - Do a quality control of your strains, to eliminate poor quality #
14# genomes, which would not give any information for the comparative study #
15# - Uniformly annotate all genomes #
16# - Do a Pan-genome #
17# - Do a Core or Persistent genome #
18# - Align all Core/Persistent families #
19# - Infer a phylogenetic tree from the Core/Persistent families #
20# #
21# PanACOTA is free software: you can redistribute it and/or modify it under the #
22# terms of the Affero GNU General Public License as published by the Free #
23# Software Foundation, either version 3 of the License, or (at your option) #
24# any later version. #
25# #
26# PanACOTA is distributed in the hope that it will be useful, but WITHOUT ANY #
27# WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS #
28# FOR A PARTICULAR PURPOSE. See the Affero GNU General Public License #
29# for more details. #
30# #
31# You should have received a copy of the Affero GNU General Public License #
32# along with PanACOTA (COPYING file). #
33# If not, see <https://www.gnu.org/licenses/>. #
34# ###############################################################################
36"""
37annotate is a subcommand of PanACoTA
39It is a pipeline to do quality control and annotate genomes. Steps are:
41- optional: find rows of at least 'n' N (default n=5), and cut into a new contig at this point
42- for each genome, calc L90 and number of contigs (after cut at n 'N' occurrences if used)
43- keep only genomes with:
45 - L90 <= x (default x = 100)
46 - #contig <= y (default y = 999)
47- rename those genomes and their contigs, with strain name increasing with quality (L90 and
48 #contig)
49- annotate kept genomes with prokka (default) or only prodigal
50- gembase format
52Input:
54- **list_file:** list of genome filenames to annotate. This file contains 1 line per genome. It
55 contains the name(s) of the (multi-)fasta file(s) corresponding to the genome (separated
56 by space if several fasta files for the genome). After quality control, selected genomes
57 will be named as following: ``<gen-spe>.<date>.<strain>``, with:
59 * ``<gen_spe>`` 4 alphanumeric characters. Usually it corresponds to the 2 first letters
60 of genus, and 2 first letters of species, like ESCO for Escherichia coli.
61 * ``<date>`` date at which the genome was downloaded, formatted as MMYY (M=Month, Y=Year)
62 * ``<strain>`` is the strain number of the genome in the species, ordered by quality.
64Default values for ``<gen_spe>`` and ``<date>`` are given as input (see after). However, if some
65genomes do not have the same date and/or genus/species as the others, you can add
66this information for those genomes in the list file. fasta filenames and information are
67separated by ``::``. ``<gen_spe>`` is given after the ``::``, and ``<date>`` is preceded by a
68``.``. Here is an example::
70 genome1.fasta
71 genome2_ch1.fna genome2_pl.fst
72 genome3.fst genome3_plasmid.fst :: name
73 genome4.fna genome4.p1.fna genome4.p2.fna :: name.
74 genome5.fasta :: name.date
75 genome6.chromo.fst genome6.pl.fst :: .date
77- **species:** with 4 alphanumeric characters, used to rename genomes (except those whose
78 species name is specified in the list file)
79- **date:** optional. By default, takes the current date. Used to rename genomes (except those
80 whose date is specified in the list file)
81- **dbpath:** path to folder containing all multi-fasta sequences of genomes
82- **respath:** path to folder where outputs must be saved (folders Genes, Replicons, Proteins,
83 LSTINFO, gff3 and LSTINFO_dataset.lst file)
84- **tmppath** optional. Path where tmp files must be saved. Default is respath/tmp_files
85- **annotepath** optional. Path where prokka/prodigal output folders for all genomes must be saved.
86 Default is respath/tmp_files
87- **threads:** number of threads that can be used (default 1)
89Output:
91- In your given ``respath``, you will find 5 folders:
93 * LSTINFO (information on each genome, with gene annotations),
94 * Genes (nuc. gene sequences),
95 * Proteins (aa proteins sequences),
96 * Replicons (input sequences but with formatted headers).
97 * gff3 (information on genes as gff3 format)
99- In your given ``tmppath`` folder, folders with prokka/prodigal results will
100 be created for each input genome (1 folder per genome, called
101 ``<genome_name>-[prokka, prodigal]Res``). If errors are generated during prokka/prodigal
102 step, you can look at the log file to see what was wrong
103 (``<genome_name>-[prokka, prodigal].log``).
104- In your given ``respath``, a file called ``annotate-genomes-<list_file>.log`` will be generated.
105 You can find there all logs.
106- In your given ``respath``, a file called ``annotate-genomes-<list_file>.log.err`` will be
107 generated, containing information on errors and warnings that occurred: problems during
108 annotation (hence no formatting step ran), and problems during formatting step. If this file is
109 empty, then annotation and formatting steps finished without any problem for all genomes.
110- In your given ``respath``, you will find a file called ``LSTINFO-<list_file>.lst`` with
111 information on all genomes: gembase_name, original_name, genome_size, L90, nb_contigs
112- In your given ``respath``, you will find a file called ``discarded-<list_file>.lst`` with
113 information on genomes that were discarded (and hence not annotated) because of the
114 L90 and/or nb_contig threshold: original_name, genome_size, L90, nb_contigs
115- In your given ``respath``, you will find 2 png files: ``QC_L90-<list_file>.png`` and
116 ``QC_nb-contigs-<list_file>.png``, containing the histograms of L90 and nb_contigs values for
117 all genomes, with a vertical red line representing the limit applied here.
119Requested:
121- in prokka/prodigal results, all genes are called ``<whatever>_<number>``
122 -> the number will be kept.
123- The number of the genes annotated by prokka/prodigal are in increasing order
124 in tbl, faa and ffn files
125- genome names given to prokka/prodigal should not end with '_<number>'. Ideally, they should
126 always have the same format: ``<spegenus>.<date>.<strain_number>`` but they can have
127 another format, as long as they don't end by '_<number>', which is the format of a gene name.
129@author gem
130April 2017
131"""
133import os
134import sys
135from termcolor import colored
136import sys
139def main_from_parse(arguments):
140 """
141 Call main function from the arguments given by parser
143 Parameters
144 ----------
145 arguments : argparse.Namespace
146 result of argparse parsing of all arguments in command line
148 """
149 cmd = "PanACoTA " + ' '.join(arguments.argv)
150 main(cmd, arguments.list_file, arguments.db_path, arguments.res_path,
151 arguments.name,
152 arguments.date, arguments.l90, arguments.nbcont, arguments.cutn, arguments.threads,
153 arguments.force, arguments.qc_only, arguments.from_info, arguments.tmpdir,
154 arguments.annotdir, arguments.verbose, arguments.quiet, arguments.prodigal_only,
155 arguments.small)
158def main(cmd, list_file, db_path, res_dir, name, date, l90=100, nbcont=999, cutn=5,
159 threads=1, force=False, qc_only=False, from_info=None, tmp_dir=None, res_annot_dir=None,
160 verbose=0, quiet=False, prodigal_only=False, small=False):
161 """
162 Main method, doing all steps:
164 1. analyze genomes (nb contigs, L90, rows of N...)
165 2. keep only genomes with 'good' (according to user thresholds) L90 and nb_contigs
166 3. rename genomes with strain number in decreasing quality
167 4. annotate genome with prokka or only prodigal
168 5. format annotated genomes
170 If option '-Q': ends at step 2.
171 If option '--info <genome_info file name>' option: starts at step 2
173 verbosity:
175 - defaut 0 : stdout contains INFO, stderr contains ERROR.
176 - 1: stdout contains INFO, stderr contains WARNING and ERROR
177 - 2: stdout contains (DEBUG), DETAIL and INFO, stderr contains WARNING and ERROR
178 - >=15: Add DEBUG in stdout
180 Parameters
181 ----------
182 cmd : str
183 command line used to launch this program
184 list_file : str
185 file containing the list of genome files, 1 genome per line, separated by a
186 space if a genome is split in several fasta files. This file can also
187 specify date and/or species information, according to the format described
188 in documentation.
189 db_path : str
190 Path to the folder containing all the fasta files which will be annotated
191 res_dir : str
192 Path to the folder which will contain result folders and files
193 name : str
194 4 alpha numeric characters, describing the species (for example ESCO). Used by default
195 if no species name is given in list_file line.
196 date : str
197 4 alpha numeric characters, defining the default date, for strains where it is not specified
198 in the list_file
199 l90 : int
200 Max L90 allowed to keep a genome
201 nbcont : int
202 Max number of contigs allowed to keep a genome
203 cutn : int
204 cut each time there are at least cutn 'N' in a row. Don't cut if equal to 0
205 threads : int
206 max number of threads to use
207 force : bool
208 If True, overwrite previous results, if False keep what is already calculated
209 qc_only : bool
210 If True, do only quality control, if False, also do annotation
211 from_info : str
212 File containing information on genomes and their quality information (from prepare step)
213 tmp_dir : str or None
214 Path to folder where tmp files must be saved. None to use the default tmp folder
215 res_annot_dir : str or None
216 Path to folder where are the prokka/prodigal result folders for the genomes. None
217 to use the default prokka/prodigal folder
218 verbose : int
219 verbosity:
220 default (0): info in stdout, error and more in stderr
221 1 = add warnings in stderr
222 2 = like 1 + add DETAIL to stdout (by default only INFO)
223 >15: add debug to stdout
224 quiet : bool
225 True if nothing must be sent to stdout/stderr, False otherwise
226 prodigal_only : bool
227 True -> run only prodigal. False -> run prokka
228 small : bool
229 True -> use -p meta option with prodigal
231 Returns
232 -------
233 (genomes, kept_genomes, skipped, skipped_format) : tuple
234 with:
236 - genomes: dict with all genomes in list_file:
237 {genome: [gembase_name, path_split_gembase, gsize, nbcont, L90]}
238 - kept_genomes: dict with all genomes kept for annotation (same format as genomes)
239 - skipped: list of genomes skipped because they had a problem in annotation step
240 - skipped_format : list of genomes skipped because they had a problem in format step
241 """
242 # import needed packages
243 import shutil
244 import logging
245 from PanACoTA.annotate_module import genome_seq_functions as gfunc
246 from PanACoTA.annotate_module import annotation_functions as pfunc
247 from PanACoTA.annotate_module import general_format_functions as ffunc
248 from PanACoTA import utils
249 from PanACoTA import __version__ as version
250 # Check that needed softs are installed
251 prokka = utils.check_installed("prokka")
252 prodigal = utils.check_installed("prodigal")
253 if prodigal_only:
254 soft = "prodigal"
255 else:
256 soft = "prokka"
258 changed = cutn != 0
259 if not qc_only: # pragma: no cover
260 # If user using prokka: check prokka is installed and in the path
261 if not prodigal_only and not prokka:
262 print("Prokka is not installed. 'PanACoTA annotate' cannot run. Install prokka "
263 "to be able to annotate genomes. If you only need syntactical annotation, "
264 "check that prodigal is installed, and add '--prodigal' option.")
265 sys.exit(1)
266 if prodigal_only and not prodigal:
267 print("Prodigal is not installed. 'PanACoTA annotate' cannot run. Install "
268 "prodigal to be able to annotate genomes. If you also need functional "
269 "annotation, check that prokka is installed, and remove '--prodigal' "
270 "option.")
271 sys.exit(1)
273 # By default, all tmp files (split sequences, renamed sequences, prokka/prodigal results) will
274 # be saved in the given <res_dir>/tmp_files.
275 # Create output (results, tmp...) directories if not already existing
276 if not tmp_dir:
277 tmp_dir = os.path.join(res_dir, "tmp_files")
278 if not res_annot_dir:
279 res_annot_dir = tmp_dir
280 os.makedirs(res_dir, exist_ok=True)
281 os.makedirs(tmp_dir, exist_ok=True)
282 os.makedirs(res_annot_dir, exist_ok=True)
284 # If force was set, remove result folders (Proteins, Replicons, Genes, LSTINFO, gff)
285 if force:
286 shutil.rmtree(os.path.join(res_dir, "LSTINFO"), ignore_errors=True)
287 shutil.rmtree(os.path.join(res_dir, "Proteins"), ignore_errors=True)
288 shutil.rmtree(os.path.join(res_dir, "Genes"), ignore_errors=True)
289 shutil.rmtree(os.path.join(res_dir, "Replicons"), ignore_errors=True)
290 shutil.rmtree(os.path.join(res_dir, "gff3"), ignore_errors=True)
291 # If not --force, check that result folders do not already contain results
292 else:
293 utils.check_out_dirs(res_dir)
295 # get only filename of list_file, without extension
296 if list_file:
297 listfile_base = os.path.basename(os.path.splitext(list_file)[0])
298 else:
299 list_file = from_info
300 listfile_base = os.path.basename(os.path.splitext(list_file)[0])
302 # Initialize logger
303 # set level of logger: level is the minimum level that will be considered.
304 if verbose <= 1:
305 level = logging.INFO
306 # for verbose = 2, ignore only debug
307 if verbose >= 2 and verbose < 15:
308 level = utils.detail_lvl() # int corresponding to detail level
309 # for verbose >= 15, write everything
310 if verbose >= 15:
311 level = logging.DEBUG
312 logfile_base = os.path.join(res_dir, "PanACoTA-annotate_" + listfile_base)
313 logfile_base = utils.init_logger(logfile_base, level, name='annotate', log_details=True,
314 verbose=verbose, quiet=quiet)
315 logger = logging.getLogger('annotate')
316 logger.info(f'PanACoTA version {version}')
317 logger.info("Command used\n \t > " + cmd)
319 # STEP 1. analyze genomes (nb contigs, L90, rows of N...)
320 # If already info on genome ('--info <file>' option), skip this step
321 # If no info on genomes, read them and get needed information
322 if not from_info:
323 # Read genome names.
324 # genomes = {genome: [spegenus.date]}
325 genomes = utils.read_genomes(list_file, name, date, db_path, tmp_dir, logger)
326 if not genomes:
327 logger.error(("We did not find any genome listed in {} in the folder {}. "
328 "Please check your list to give valid genome "
329 "names.").format(list_file, db_path))
330 sys.exit(1)
331 # Get L90, nbcontig, size for all genomes, and cut at row of cutn 'N' if asked
332 # -> genome: [spegenus.date, orig_path, to_annotate_path, size, nbcont, l90]
333 gfunc.analyse_all_genomes(genomes, db_path, tmp_dir, cutn, soft,
334 logger, quiet=quiet)
335 # --info <filename> option given: read information (L90, nb contigs...) from this file.
336 else:
337 # genomes = {genome: [spegenus.date, orig_path, to_annotate_path, size, nbcont, l90]}
338 # orig_path is the path to the original sequence
339 # and to_annotate_path the path to the sequence to annotate (once split etc.)
340 # Here, both are the same, as we take given sequences as is.
341 genomes = utils.read_genomes_info(from_info, name, date, logger)
343 # STEP 2. keep only genomes with 'good' (according to user thresholds) L90 and nb_contigs
344 # genomes = {genome: [spegenus.date, orig_seq, path_to_splitSequence, size, nbcont, l90]}
345 # Plot L90 and nb_contigs distributions
346 gfunc.plot_distributions(genomes, res_dir, listfile_base, l90, nbcont)
347 # Get list of genomes kept (according to L90 and nbcont thresholds)
348 kept_genomes = {genome: info for genome, info in genomes.items()
349 if info[-2] <= nbcont and info[-1] <= l90}
350 # Write discarded genomes to a file -> orig_name, to_annotate, gsize, nb_conts, L90
351 utils.write_genomes_info(genomes, list(kept_genomes.keys()), list_file, res_dir)
353 if not kept_genomes:
354 logger.info("No genome kept for annotation.")
355 return "", 0
356 # Info on folder containing original sequences
357 if not from_info:
358 logger.info(f"-> Original sequences folder ('orig_name' column): {db_path} ")
359 logger.info(f"\t-> If original sequence not found in {db_path}, "
360 f"look for it in {tmp_dir}, as it must be a concatenation of several "
361 "input sequence files.")
362 if cutn == 0:
363 logger.info("-> Sequences used for annotation ('to_annotate' column) are the "
364 "same as the previous ones (original sequences).")
365 else:
366 logger.info(f"-> Folder with sequence files that will be used for annotation "
367 f"('to_annotate' column): {tmp_dir}")
368 # If only QC, stop here.
369 if qc_only:
370 # Write information on genomes that would be annotated with the current
371 # parameters if not QC_only:
372 # orig_name, to_annnote, gsize, nb_conts, L90
373 utils.write_genomes_info(genomes, [], list_file, res_dir, qc=True)
374 logger.info("QC only done.")
375 return "", 0
377 # STEP 3. Rename genomes kept, ordered by decreasing quality
378 first_gname = gfunc.rename_all_genomes(kept_genomes)
379 # kept_genomes = {genome: [gembase_name, path_to_origfile, path_split_gembase,
380 # gsize, nbcont, L90]}
381 # first_gname = name of the first genome
382 # Write lstinfo file (list of genomes kept with info on L90 etc.)
383 outlst = utils.write_lstinfo(list_file, kept_genomes, res_dir)
385 # STEP 4. Annotate all kept genomes
386 results = pfunc.run_annotation_all(kept_genomes, threads, force, res_annot_dir, first_gname,
387 prodigal_only, small=small, quiet=quiet)
388 # Information on genomes to format
389 # results_ok = {genome: [gembase_name, path_to_origfile, path_split_gembase,
390 # gsize, nbcont, L90]}
391 results_ok = {genome:info for genome, info in kept_genomes.items() if results[genome]}
392 # If no genome was ok, no need to format them. Just print that no genome was annotated,
393 # end program.
394 if not results_ok:
395 logger.error("Error: No genome was correctly annotated, no need to format them.")
396 sys.exit(1)
397 # list of genomes skipped because annotation had problems: no format step run
398 skipped = [genome for (genome, ok) in results.items() if not ok]
399 # At least 1 genome was not annotated: write a message to warn on it
400 if skipped:
401 utils.write_warning_skipped(skipped, prodigal_only=prodigal_only,
402 logfile=logfile_base)
404 # STEP 5. Format genomes annotated
405 # Here, we have at least 1 genome annotated (otherwise,
406 # it would already have stopped because results_ok is empty)
407 # Initialize list of genomes skipped because something went wrong while formatting.
408 skipped_format = []
409 # Generate database (folders Proteins, Genes, Replicons, LSTINFO)
410 skipped_format = ffunc.format_genomes(results_ok, res_dir, res_annot_dir,
411 prodigal_only, threads, quiet=quiet)
412 # At least one genome could not be formatted -> warn user
413 if skipped_format:
414 utils.write_warning_skipped(skipped_format, do_format=True, prodigal_only=prodigal_only,
415 logfile = logfile_base)
416 logger.info("Annotation step done.")
417 return outlst, len(kept_genomes) - len(skipped) - len(skipped_format)
420def build_parser(parser):
421 """
422 Method to create a parser for command-line options
424 Parameters
425 ----------
426 parser : argparse.ArgumentParser
427 The parser to configure
429 """
430 from PanACoTA import utils
431 from PanACoTA import utils_argparse
432 import multiprocessing
433 import argparse
435 # Create command-line parser for all options and arguments to give
436 required = parser.add_argument_group('Required arguments')
437 required.add_argument("-d", dest="db_path",
438 help="Path to folder containing all multifasta genome files")
439 required.add_argument("-r", dest="res_path", required=True,
440 help="Path to folder where output annotated genomes must be saved")
441 optional = parser.add_argument_group('Optional arguments')
442 optional.add_argument('-l', dest="list_file",
443 help=("File containing the list of genome filenames to annotate "
444 "(1 genome per line). Each genome must be in multi-fasta format. "
445 "You can specify the species name (4 characters) you want to "
446 "give to some genome, as well as the download (or any other "
447 "reason) date of your choice. Format 'genome_name :: name.date'. "
448 "name and date are optional. See doc for more information on "
449 "this file format. "
450 "If you want to run this module from 'prepare_module' results, "
451 "use '--info' instead."))
452 optional.add_argument("-n", dest="name", type=utils_argparse.gen_name,
453 help=("Choose a name for your annotated genomes. This name should "
454 "contain 4 alphanumeric characters. Generally, they correspond "
455 "to the 2 first letters of genus, and 2 first letters of "
456 "species, e.g. ESCO for Escherichia Coli."))
457 optional.add_argument("-Q", dest="qc_only", action="store_true", default=False,
458 help="Add this option if you want only to do quality control on your "
459 "genomes (cut at 5N if asked, calculate L90 and number of contigs "
460 "and plot their distributions). This allows you to check which "
461 "genomes would be annotated with the given parameters, and to "
462 "modify those parameters if you want, before you launch the "
463 "annotation and formatting steps.")
464 optional.add_argument("--info", dest="from_info",
465 help="If you already ran the 'prepare' data module, or already "
466 "calculated yourself the L90 and number of contigs for each "
467 "genome, you can give this information, to go directly to "
468 "annotation and formatting steps. This file contains at "
469 "least 4 columns, tab separated, with the following headers: "
470 "'to_annotate', 'gsize', 'nb_conts', 'L90'. Any other column "
471 "will be ignored.")
472 optional.add_argument("--prodigal", dest="prodigal_only", action="store_true", default=False,
473 help="Add this option if you only want syntactical annotation, given "
474 "by prodigal, and not functional annotation requiring prokka and "
475 "is slower.")
476 optional.add_argument("--small", dest="small", action="store_true", default=False,
477 help="If you use Prodigal to annotate genomes, if you sequences are "
478 "too small (less than 20000 characters), it cannot annotate them "
479 "with the default options. Add this option to use 'meta' procedure.")
480 optional.add_argument("--l90", dest="l90", type=int, default=100,
481 help="Maximum value of L90 allowed to keep a genome. Default is 100.")
482 optional.add_argument("--nbcont", dest="nbcont", type=utils_argparse.cont_num, default=999,
483 help=("Maximum number of contigs allowed to keep a genome. "
484 "Default is 999."))
485 optional.add_argument("--cutn", dest="cutn", type=utils_argparse.positive_int, default=5,
486 help=("By default, each genome will be cut into new contigs when "
487 "at least 5 'N' in a row are found in its sequence. "
488 "If you don't want to "
489 "cut genomes into new contigs when there are rows of 'N', "
490 "put 0 to this option. If you want to cut from a different number "
491 "of 'N' occurrences, put this value to this option."))
492 optional.add_argument("--date", dest="date", default=utils_argparse.get_date(),
493 type=utils_argparse.date_name,
494 help=("Specify the date (MMYY) to give to your annotated genomes. "
495 "By default, will give today's date. The only requirement on the"
496 " given date is that it is 4 characters long. You can use letters"
497 " if you want. But the common way is to use 4 digits, "
498 "corresponding to MMYY."))
499 optional.add_argument("--tmp", dest="tmpdir",
500 help=("Specify where the temporary files (sequence split by rows "
501 "of 'N', sequence with new contig names etc.) must be saved. "
502 "By default, it will be saved in your "
503 "result_directory/tmp_files."))
504 optional.add_argument("--annot_dir", dest="annotdir",
505 help=("Specify in which directory the prokka/prodigal output files "
506 "(1 folder per genome, called "
507 "<genome_name>-[prokka, Prodigal]Res) must be "
508 "saved. By default, they are saved in the same directory as "
509 "your temporary files (see --tmp option to change it)."))
510 optional.add_argument("-F", "--force", dest="force", action="store_true",
511 help=("Force run: Add this option if you want to (re)run annotation and "
512 "formatting steps for all genomes "
513 "even if their result folder (for annotation step) or files (for "
514 "format step) already exist: override existing results.\n"
515 "Without this option, if there already are results in the given "
516 "result folder, the program stops. If there are no results, but "
517 "prokka/prodigal folder already exists, prokka/prodigal won't run "
518 "again, and the formating step will use the already existing "
519 "folder if correct, or skip the genome if there are problems in "
520 "prokka/prodigal folder."))
521 optional.add_argument("--threads", dest="threads", type=utils_argparse.thread_num, default=1,
522 help="Specify how many threads can be used (default=1)")
523 helper = parser.add_argument_group('Others')
524 helper.add_argument("-v", "--verbose", dest="verbose", action="count", default=0,
525 help="Increase verbosity in stdout/stderr.")
526 helper.add_argument("-q", "--quiet", dest="quiet", action="store_true", default=False,
527 help=("Do not display anything to stdout/stderr. log files will "
528 "still be created."))
529 helper.add_argument("-h", "--help", dest="help", action="help",
530 help="show this help message and exit")
533def parse(parser, argu):
534 """
535 arse arguments given to parser
537 Parameters
538 ----------
539 parser : argparse.ArgumentParser
540 the parser used
541 argu : [str]
542 command-line given by user, to parse using parser
544 Returns
545 -------
546 argparse.Namespace
547 Parsed arguments
548 """
549 import argparse
551 args = parser.parse_args(argu)
552 return check_args(parser, args)
555def check_args(parser, args):
556 """
557 Check that arguments given to parser are as expected.
559 Parameters
560 ----------
561 parser : argparse.ArgumentParser
562 The parser used to parse command-line
563 args : argparse.Namespace
564 Parsed arguments
566 Returns
567 -------
568 argparse.Namespace or None
569 The arguments parsed, updated according to some rules. Exit program
570 with error message if error occurs with arguments given.
572 """
573 # Message if user kept default thresholds for L90 and nbcont. Just to warn him, to be sure
574 # it was on purpose
575 def thresholds_message(l90, nbcont):
576 return (" !! Your genomes will be filtered, and only the ones with 'L90' <= {} "
577 "and 'number of contigs' < {} will be kept. If you want to change those "
578 "thresholds, use '--l90' and '--nbcont' "
579 "options.".format(args.l90, args.nbcont))
581 # Message if user is giving a file with already calculated information
582 def nosplit_message():
583 split = (" !! Your sequences will be used as is by PanACoTA. Be sure you "
584 "already split your sequences at each row "
585 "of X 'N' if needed.\n")
586 trust = ("\t-> PanACoTA will use the values (L90, nbcont) given in your info file. "
587 "It will ignore the genomes for which those values are incorrect. "
588 "It will also ignore genomes with more than 999 contigs.")
589 return split + trust
591 # ERRORS
592 # Cannot be verbose and quiet at the same time
593 if args.verbose > 0 and args.quiet:
594 parser.error("Choose between a verbose output (-v) or a quiet output (-q)."
595 " You cannot have both.")
596 # User wants to run all annotation step: needs a genome dataset name
597 if not args.qc_only and not args.name:
598 parser.error("You must specify your genomes dataset name in 4 characters with "
599 "'-n name' option (type -h for more information). Or, if you do not want "
600 "to annotate and format your genomes but just to run quality control, use "
601 "option '-Q")
602 # If QC only, we do not need name -> name is NONE
603 if args.qc_only and not args.name:
604 args.name = "NONE"
605 # option --small used only with prodigal
606 if not args.prodigal_only and args.small:
607 parser.error("You cannot use --small option with prokka. Either use prodigal, "
608 "or remove this option.")
609 # If user specifies a cutN value (different than default one which is 5), and give
610 # an info file, it is not compatible: info file will use sequences as is, and won't cut them
611 if args.cutn != 5 and args.from_info:
612 parser.error("If you provide a list of genomes with their calculated L90 and number of "
613 "contigs, PanACoTA will use the given sequences as is. It will not cut "
614 "them. So, you cannot use both --cutn and --info.")
616 # Give a lst_file or an info file, not nothing
617 if not args.from_info and not args.list_file:
618 parser.error("You must provide a list of genomes to annotate. Either raw genomes "
619 "(see -l option), or genomes with quality information (see --info option).")
621 # Choose between infofile or LSTINFO
622 if args.from_info and args.list_file:
623 parser.error("Either you want to annotate raw sequences (name of files in '-l infofile') "
624 "which will first go through the QC process, "
625 "OR you already did QC on your sequences and just want to annotate them "
626 "(information on those sequences in '--info LSTINFO-file'). "
627 "Please choose one of these 2 possibilities.")
629 # If no info file nor db_path, ask for 1 of them
630 if not args.db_path and not args.from_info:
631 parser.error("You must provide a path to your database genome sequences (-d <db_path>). "
632 "If you already have a LSTINFO file, it contains this db_path. Use it with "
633 "--info <lstinfo file> option.")
635 # If given LSTINFO, already contains paths to genome to annotate. db_path must not be provided
636 if args.from_info and args.db_path:
637 parser.error("If you run from your LSTINFO file, this one already contains the "
638 "path of genomes to annotate. Remove -d <db_path> option.")
640 # WARNINGS
641 # If user wants to cut genomes, warn him to check that it is on purpose (because default is cut at each 5'N')
642 if args.cutn != 0 and not args.from_info:
643 message = (" !! Your genomes will be split when sequence contains at "
644 "least {}'N' in a row. If you want to change this threshold, "
645 "see --cutn option.").format(args.cutn)
646 print(colored(message, "yellow"))
647 # Warn user about selection of genomes thresholds
648 if args.l90 == 100 or args.nbcont == 999:
649 print(colored(thresholds_message(args.l90, args.nbcont), "yellow"))
650 if args.from_info:
651 print(colored(nosplit_message(), "yellow"))
652 print()
653 return args
655if __name__ == '__main__':
656 import argparse
657 from textwrap import dedent
658 header = '''
659 ___ _____ ___ _____ _____
660 ( _`\ ( _ )( _`\ (_ _)( _ )
661 | |_) ) _ _ ___ | (_) || ( (_) _ | | | (_) |
662 | ,__/'/'_` )/' _ `\| _ || | _ /'_`\ | | | _ |
663 | | ( (_| || ( ) || | | || (_( )( (_) )| | | | | |
664 (_) `\__,_)(_) (_)(_) (_)(____/'`\___/'(_) (_) (_)
667 Large scale comparative genomics tools
669 -------------------------------------------
670 '''
671 my_parser = argparse.ArgumentParser(formatter_class=argparse.RawDescriptionHelpFormatter,
672 description=dedent(header), add_help=False)
673 build_parser(my_parser)
674 OPTIONS = parse(my_parser, sys.argv[1:])
675 main_from_parse(OPTIONS)