Coverage for PanACoTA/subcommands/annotate.py: 100%

1#!/usr/bin/env python3

2# coding: utf-8

4# ###############################################################################

5# This file is part of PanACOTA. #

6# #

7# Authors: Amandine Perrin #

9# See the COPYRIGHT file for details. #

10# #

11# PanACOTA is a software providing tools for large scale bacterial comparative #

12# genomics. From a set of complete and/or draft genomes, you can: #

13# - Do a quality control of your strains, to eliminate poor quality #

14# genomes, which would not give any information for the comparative study #

15# - Uniformly annotate all genomes #

16# - Do a Pan-genome #

17# - Do a Core or Persistent genome #

18# - Align all Core/Persistent families #

19# - Infer a phylogenetic tree from the Core/Persistent families #

20# #

21# PanACOTA is free software: you can redistribute it and/or modify it under the #

22# terms of the Affero GNU General Public License as published by the Free #

23# Software Foundation, either version 3 of the License, or (at your option) #

24# any later version. #

25# #

26# PanACOTA is distributed in the hope that it will be useful, but WITHOUT ANY #

27# WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS #

28# FOR A PARTICULAR PURPOSE. See the Affero GNU General Public License #

29# for more details. #

30# #

31# You should have received a copy of the Affero GNU General Public License #

32# along with PanACOTA (COPYING file). #

33# If not, see <https://www.gnu.org/licenses/>. #

34# ###############################################################################

36"""

37annotate is a subcommand of PanACoTA

39It is a pipeline to do quality control and annotate genomes. Steps are:

41- optional: find rows of at least 'n' N (default n=5), and cut into a new contig at this point

42- for each genome, calc L90 and number of contigs (after cut at n 'N' occurrences if used)

43- keep only genomes with:

45 - L90 <= x (default x = 100)

46 - #contig <= y (default y = 999)

47- rename those genomes and their contigs, with strain name increasing with quality (L90 and

48 #contig)

49- annotate kept genomes with prokka (default) or only prodigal

50- gembase format

52Input:

54- **list_file:** list of genome filenames to annotate. This file contains 1 line per genome. It

55 contains the name(s) of the (multi-)fasta file(s) corresponding to the genome (separated

56 by space if several fasta files for the genome). After quality control, selected genomes

57 will be named as following: ``<gen-spe>.<date>.<strain>``, with:

59 * ``<gen_spe>`` 4 alphanumeric characters. Usually it corresponds to the 2 first letters

60 of genus, and 2 first letters of species, like ESCO for Escherichia coli.

61 * ``<date>`` date at which the genome was downloaded, formatted as MMYY (M=Month, Y=Year)

62 * ``<strain>`` is the strain number of the genome in the species, ordered by quality.

64Default values for ``<gen_spe>`` and ``<date>`` are given as input (see after). However, if some

65genomes do not have the same date and/or genus/species as the others, you can add

66this information for those genomes in the list file. fasta filenames and information are

67separated by ``::``. ``<gen_spe>`` is given after the ``::``, and ``<date>`` is preceded by a

68``.``. Here is an example::

70 genome1.fasta

71 genome2_ch1.fna genome2_pl.fst

72 genome3.fst genome3_plasmid.fst :: name

73 genome4.fna genome4.p1.fna genome4.p2.fna :: name.

74 genome5.fasta :: name.date

75 genome6.chromo.fst genome6.pl.fst :: .date

77- **species:** with 4 alphanumeric characters, used to rename genomes (except those whose

78 species name is specified in the list file)

79- **date:** optional. By default, takes the current date. Used to rename genomes (except those

80 whose date is specified in the list file)

81- **dbpath:** path to folder containing all multi-fasta sequences of genomes

82- **respath:** path to folder where outputs must be saved (folders Genes, Replicons, Proteins,

83 LSTINFO, gff3 and LSTINFO_dataset.lst file)

84- **tmppath** optional. Path where tmp files must be saved. Default is respath/tmp_files

85- **annotepath** optional. Path where prokka/prodigal output folders for all genomes must be saved.

86 Default is respath/tmp_files

87- **threads:** number of threads that can be used (default 1)

89Output:

91- In your given ``respath``, you will find 5 folders:

93 * LSTINFO (information on each genome, with gene annotations),

94 * Genes (nuc. gene sequences),

95 * Proteins (aa proteins sequences),

96 * Replicons (input sequences but with formatted headers).

97 * gff3 (information on genes as gff3 format)

99- In your given ``tmppath`` folder, folders with prokka/prodigal results will

100 be created for each input genome (1 folder per genome, called

101 ``<genome_name>-[prokka, prodigal]Res``). If errors are generated during prokka/prodigal

102 step, you can look at the log file to see what was wrong

103 (``<genome_name>-[prokka, prodigal].log``).

104- In your given ``respath``, a file called ``annotate-genomes-<list_file>.log`` will be generated.

105 You can find there all logs.

106- In your given ``respath``, a file called ``annotate-genomes-<list_file>.log.err`` will be

107 generated, containing information on errors and warnings that occurred: problems during

108 annotation (hence no formatting step ran), and problems during formatting step. If this file is

109 empty, then annotation and formatting steps finished without any problem for all genomes.

110- In your given ``respath``, you will find a file called ``LSTINFO-<list_file>.lst`` with

111 information on all genomes: gembase_name, original_name, genome_size, L90, nb_contigs

112- In your given ``respath``, you will find a file called ``discarded-<list_file>.lst`` with

113 information on genomes that were discarded (and hence not annotated) because of the

114 L90 and/or nb_contig threshold: original_name, genome_size, L90, nb_contigs

115- In your given ``respath``, you will find 2 png files: ``QC_L90-<list_file>.png`` and

116 ``QC_nb-contigs-<list_file>.png``, containing the histograms of L90 and nb_contigs values for

117 all genomes, with a vertical red line representing the limit applied here.

118

119Requested:

120

121- in prokka/prodigal results, all genes are called ``<whatever>_<number>``

122 -> the number will be kept.

123- The number of the genes annotated by prokka/prodigal are in increasing order

124 in tbl, faa and ffn files

125- genome names given to prokka/prodigal should not end with '_<number>'. Ideally, they should

126 always have the same format: ``<spegenus>.<date>.<strain_number>`` but they can have

127 another format, as long as they don't end by '_<number>', which is the format of a gene name.

128

129@author gem

130April 2017

131"""

132

133import os

134import sys

135from termcolor import colored

136import sys

137

138

139def main_from_parse(arguments):

140 """

141 Call main function from the arguments given by parser

142

143 Parameters

144 ----------

145 arguments : argparse.Namespace

146 result of argparse parsing of all arguments in command line

147

148 """

149 cmd = "PanACoTA " + ' '.join(arguments.argv)

150 main(cmd, arguments.list_file, arguments.db_path, arguments.res_path,

151 arguments.name,

152 arguments.date, arguments.l90, arguments.nbcont, arguments.cutn, arguments.threads,

153 arguments.force, arguments.qc_only, arguments.from_info, arguments.tmpdir,

154 arguments.annotdir, arguments.verbose, arguments.quiet, arguments.prodigal_only,

155 arguments.small)

156

157

158def main(cmd, list_file, db_path, res_dir, name, date, l90=100, nbcont=999, cutn=5,

159 threads=1, force=False, qc_only=False, from_info=None, tmp_dir=None, res_annot_dir=None,

160 verbose=0, quiet=False, prodigal_only=False, small=False):

161 """

162 Main method, doing all steps:

163

164 1. analyze genomes (nb contigs, L90, rows of N...)

165 2. keep only genomes with 'good' (according to user thresholds) L90 and nb_contigs

166 3. rename genomes with strain number in decreasing quality

167 4. annotate genome with prokka or only prodigal

168 5. format annotated genomes

169

170 If option '-Q': ends at step 2.

171 If option '--info <genome_info file name>' option: starts at step 2

172

173 verbosity:

174

175 - defaut 0 : stdout contains INFO, stderr contains ERROR.

176 - 1: stdout contains INFO, stderr contains WARNING and ERROR

177 - 2: stdout contains (DEBUG), DETAIL and INFO, stderr contains WARNING and ERROR

178 - >=15: Add DEBUG in stdout

179

180 Parameters

181 ----------

182 cmd : str

183 command line used to launch this program

184 list_file : str

185 file containing the list of genome files, 1 genome per line, separated by a

186 space if a genome is split in several fasta files. This file can also

187 specify date and/or species information, according to the format described

188 in documentation.

189 db_path : str

190 Path to the folder containing all the fasta files which will be annotated

191 res_dir : str

192 Path to the folder which will contain result folders and files

193 name : str

194 4 alpha numeric characters, describing the species (for example ESCO). Used by default

195 if no species name is given in list_file line.

196 date : str

197 4 alpha numeric characters, defining the default date, for strains where it is not specified

198 in the list_file

199 l90 : int

200 Max L90 allowed to keep a genome

201 nbcont : int

202 Max number of contigs allowed to keep a genome

203 cutn : int

204 cut each time there are at least cutn 'N' in a row. Don't cut if equal to 0

205 threads : int

206 max number of threads to use

207 force : bool

208 If True, overwrite previous results, if False keep what is already calculated

209 qc_only : bool

210 If True, do only quality control, if False, also do annotation

211 from_info : str

212 File containing information on genomes and their quality information (from prepare step)

213 tmp_dir : str or None

214 Path to folder where tmp files must be saved. None to use the default tmp folder

215 res_annot_dir : str or None

216 Path to folder where are the prokka/prodigal result folders for the genomes. None

217 to use the default prokka/prodigal folder

218 verbose : int

219 verbosity:

220 default (0): info in stdout, error and more in stderr

221 1 = add warnings in stderr

222 2 = like 1 + add DETAIL to stdout (by default only INFO)

223 >15: add debug to stdout

224 quiet : bool

225 True if nothing must be sent to stdout/stderr, False otherwise

226 prodigal_only : bool

227 True -> run only prodigal. False -> run prokka

228 small : bool

229 True -> use -p meta option with prodigal

230

231 Returns

232 -------

233 (genomes, kept_genomes, skipped, skipped_format) : tuple

234 with:

235

236 - genomes: dict with all genomes in list_file:

237 {genome: [gembase_name, path_split_gembase, gsize, nbcont, L90]}

238 - kept_genomes: dict with all genomes kept for annotation (same format as genomes)

239 - skipped: list of genomes skipped because they had a problem in annotation step

240 - skipped_format : list of genomes skipped because they had a problem in format step

241 """

242 # import needed packages

243 import shutil

244 import logging

245 from PanACoTA.annotate_module import genome_seq_functions as gfunc

246 from PanACoTA.annotate_module import annotation_functions as pfunc

247 from PanACoTA.annotate_module import general_format_functions as ffunc

248 from PanACoTA import utils

249 from PanACoTA import __version__ as version

250 # Check that needed softs are installed

251 prokka = utils.check_installed("prokka")

252 prodigal = utils.check_installed("prodigal")

253 if prodigal_only:

254 soft = "prodigal"

255 else:

256 soft = "prokka"

257

258 changed = cutn != 0

259 if not qc_only: # pragma: no cover

260 # If user using prokka: check prokka is installed and in the path

261 if not prodigal_only and not prokka:

262 print("Prokka is not installed. 'PanACoTA annotate' cannot run. Install prokka "

263 "to be able to annotate genomes. If you only need syntactical annotation, "

264 "check that prodigal is installed, and add '--prodigal' option.")

265 sys.exit(1)

266 if prodigal_only and not prodigal:

267 print("Prodigal is not installed. 'PanACoTA annotate' cannot run. Install "

268 "prodigal to be able to annotate genomes. If you also need functional "

269 "annotation, check that prokka is installed, and remove '--prodigal' "

270 "option.")

271 sys.exit(1)

272

273 # By default, all tmp files (split sequences, renamed sequences, prokka/prodigal results) will

274 # be saved in the given <res_dir>/tmp_files.

275 # Create output (results, tmp...) directories if not already existing

276 if not tmp_dir:

277 tmp_dir = os.path.join(res_dir, "tmp_files")

278 if not res_annot_dir:

279 res_annot_dir = tmp_dir

280 os.makedirs(res_dir, exist_ok=True)

281 os.makedirs(tmp_dir, exist_ok=True)

282 os.makedirs(res_annot_dir, exist_ok=True)

283

284 # If force was set, remove result folders (Proteins, Replicons, Genes, LSTINFO, gff)

285 if force:

286 shutil.rmtree(os.path.join(res_dir, "LSTINFO"), ignore_errors=True)

287 shutil.rmtree(os.path.join(res_dir, "Proteins"), ignore_errors=True)

288 shutil.rmtree(os.path.join(res_dir, "Genes"), ignore_errors=True)

289 shutil.rmtree(os.path.join(res_dir, "Replicons"), ignore_errors=True)

290 shutil.rmtree(os.path.join(res_dir, "gff3"), ignore_errors=True)

291 # If not --force, check that result folders do not already contain results

292 else:

293 utils.check_out_dirs(res_dir)

294

295 # get only filename of list_file, without extension

296 if list_file:

297 listfile_base = os.path.basename(os.path.splitext(list_file)[0])

298 else:

299 list_file = from_info

300 listfile_base = os.path.basename(os.path.splitext(list_file)[0])

301

302 # Initialize logger

303 # set level of logger: level is the minimum level that will be considered.

304 if verbose <= 1:

305 level = logging.INFO

306 # for verbose = 2, ignore only debug

307 if verbose >= 2 and verbose < 15:

308 level = utils.detail_lvl() # int corresponding to detail level

309 # for verbose >= 15, write everything

310 if verbose >= 15:

311 level = logging.DEBUG

312 logfile_base = os.path.join(res_dir, "PanACoTA-annotate_" + listfile_base)

313 logfile_base = utils.init_logger(logfile_base, level, name='annotate', log_details=True,

314 verbose=verbose, quiet=quiet)

315 logger = logging.getLogger('annotate')

316 logger.info(f'PanACoTA version {version}')

317 logger.info("Command used\n \t > " + cmd)

318

319 # STEP 1. analyze genomes (nb contigs, L90, rows of N...)

320 # If already info on genome ('--info <file>' option), skip this step

321 # If no info on genomes, read them and get needed information

322 if not from_info:

323 # Read genome names.

324 # genomes = {genome: [spegenus.date]}

325 genomes = utils.read_genomes(list_file, name, date, db_path, tmp_dir, logger)

326 if not genomes:

327 logger.error(("We did not find any genome listed in {} in the folder {}. "

328 "Please check your list to give valid genome "

329 "names.").format(list_file, db_path))

330 sys.exit(1)

331 # Get L90, nbcontig, size for all genomes, and cut at row of cutn 'N' if asked

332 # -> genome: [spegenus.date, orig_path, to_annotate_path, size, nbcont, l90]

333 gfunc.analyse_all_genomes(genomes, db_path, tmp_dir, cutn, soft,

334 logger, quiet=quiet)

335 # --info <filename> option given: read information (L90, nb contigs...) from this file.

336 else:

337 # genomes = {genome: [spegenus.date, orig_path, to_annotate_path, size, nbcont, l90]}

338 # orig_path is the path to the original sequence

339 # and to_annotate_path the path to the sequence to annotate (once split etc.)

340 # Here, both are the same, as we take given sequences as is.

341 genomes = utils.read_genomes_info(from_info, name, date, logger)

342

343 # STEP 2. keep only genomes with 'good' (according to user thresholds) L90 and nb_contigs

344 # genomes = {genome: [spegenus.date, orig_seq, path_to_splitSequence, size, nbcont, l90]}

345 # Plot L90 and nb_contigs distributions

346 gfunc.plot_distributions(genomes, res_dir, listfile_base, l90, nbcont)

347 # Get list of genomes kept (according to L90 and nbcont thresholds)

348 kept_genomes = {genome: info for genome, info in genomes.items()

349 if info[-2] <= nbcont and info[-1] <= l90}

350 # Write discarded genomes to a file -> orig_name, to_annotate, gsize, nb_conts, L90

351 utils.write_genomes_info(genomes, list(kept_genomes.keys()), list_file, res_dir)

352

353 if not kept_genomes:

354 logger.info("No genome kept for annotation.")

355 return "", 0

356 # Info on folder containing original sequences

357 if not from_info:

358 logger.info(f"-> Original sequences folder ('orig_name' column): {db_path} ")

359 logger.info(f"\t-> If original sequence not found in {db_path}, "

360 f"look for it in {tmp_dir}, as it must be a concatenation of several "

361 "input sequence files.")

362 if cutn == 0:

363 logger.info("-> Sequences used for annotation ('to_annotate' column) are the "

364 "same as the previous ones (original sequences).")

365 else:

366 logger.info(f"-> Folder with sequence files that will be used for annotation "

367 f"('to_annotate' column): {tmp_dir}")

368 # If only QC, stop here.

369 if qc_only:

370 # Write information on genomes that would be annotated with the current

371 # parameters if not QC_only:

372 # orig_name, to_annnote, gsize, nb_conts, L90

373 utils.write_genomes_info(genomes, [], list_file, res_dir, qc=True)

374 logger.info("QC only done.")

375 return "", 0

376

377 # STEP 3. Rename genomes kept, ordered by decreasing quality

378 first_gname = gfunc.rename_all_genomes(kept_genomes)

379 # kept_genomes = {genome: [gembase_name, path_to_origfile, path_split_gembase,

380 # gsize, nbcont, L90]}

381 # first_gname = name of the first genome

382 # Write lstinfo file (list of genomes kept with info on L90 etc.)

383 outlst = utils.write_lstinfo(list_file, kept_genomes, res_dir)

384

385 # STEP 4. Annotate all kept genomes

386 results = pfunc.run_annotation_all(kept_genomes, threads, force, res_annot_dir, first_gname,

387 prodigal_only, small=small, quiet=quiet)

388 # Information on genomes to format

389 # results_ok = {genome: [gembase_name, path_to_origfile, path_split_gembase,

390 # gsize, nbcont, L90]}

391 results_ok = {genome:info for genome, info in kept_genomes.items() if results[genome]}

392 # If no genome was ok, no need to format them. Just print that no genome was annotated,

393 # end program.

394 if not results_ok:

395 logger.error("Error: No genome was correctly annotated, no need to format them.")

396 sys.exit(1)

397 # list of genomes skipped because annotation had problems: no format step run

398 skipped = [genome for (genome, ok) in results.items() if not ok]

399 # At least 1 genome was not annotated: write a message to warn on it

400 if skipped:

401 utils.write_warning_skipped(skipped, prodigal_only=prodigal_only,

402 logfile=logfile_base)

403

404 # STEP 5. Format genomes annotated

405 # Here, we have at least 1 genome annotated (otherwise,

406 # it would already have stopped because results_ok is empty)

407 # Initialize list of genomes skipped because something went wrong while formatting.

408 skipped_format = []

409 # Generate database (folders Proteins, Genes, Replicons, LSTINFO)

410 skipped_format = ffunc.format_genomes(results_ok, res_dir, res_annot_dir,

411 prodigal_only, threads, quiet=quiet)

412 # At least one genome could not be formatted -> warn user

413 if skipped_format:

414 utils.write_warning_skipped(skipped_format, do_format=True, prodigal_only=prodigal_only,

415 logfile = logfile_base)

416 logger.info("Annotation step done.")

417 return outlst, len(kept_genomes) - len(skipped) - len(skipped_format)

418

419

420def build_parser(parser):

421 """

422 Method to create a parser for command-line options

423

424 Parameters

425 ----------

426 parser : argparse.ArgumentParser

427 The parser to configure

428

429 """

430 from PanACoTA import utils

431 from PanACoTA import utils_argparse

432 import multiprocessing

433 import argparse

434

435 # Create command-line parser for all options and arguments to give

436 required = parser.add_argument_group('Required arguments')

437 required.add_argument("-d", dest="db_path",

438 help="Path to folder containing all multifasta genome files")

439 required.add_argument("-r", dest="res_path", required=True,

440 help="Path to folder where output annotated genomes must be saved")

441 optional = parser.add_argument_group('Optional arguments')

442 optional.add_argument('-l', dest="list_file",

443 help=("File containing the list of genome filenames to annotate "

444 "(1 genome per line). Each genome must be in multi-fasta format. "

445 "You can specify the species name (4 characters) you want to "

446 "give to some genome, as well as the download (or any other "

447 "reason) date of your choice. Format 'genome_name :: name.date'. "

448 "name and date are optional. See doc for more information on "

449 "this file format. "

450 "If you want to run this module from 'prepare_module' results, "

451 "use '--info' instead."))

452 optional.add_argument("-n", dest="name", type=utils_argparse.gen_name,

453 help=("Choose a name for your annotated genomes. This name should "

454 "contain 4 alphanumeric characters. Generally, they correspond "

455 "to the 2 first letters of genus, and 2 first letters of "

456 "species, e.g. ESCO for Escherichia Coli."))

457 optional.add_argument("-Q", dest="qc_only", action="store_true", default=False,

458 help="Add this option if you want only to do quality control on your "

459 "genomes (cut at 5N if asked, calculate L90 and number of contigs "

460 "and plot their distributions). This allows you to check which "

461 "genomes would be annotated with the given parameters, and to "

462 "modify those parameters if you want, before you launch the "

463 "annotation and formatting steps.")

464 optional.add_argument("--info", dest="from_info",

465 help="If you already ran the 'prepare' data module, or already "

466 "calculated yourself the L90 and number of contigs for each "

467 "genome, you can give this information, to go directly to "

468 "annotation and formatting steps. This file contains at "

469 "least 4 columns, tab separated, with the following headers: "

470 "'to_annotate', 'gsize', 'nb_conts', 'L90'. Any other column "

471 "will be ignored.")

472 optional.add_argument("--prodigal", dest="prodigal_only", action="store_true", default=False,

473 help="Add this option if you only want syntactical annotation, given "

474 "by prodigal, and not functional annotation requiring prokka and "

475 "is slower.")

476 optional.add_argument("--small", dest="small", action="store_true", default=False,

477 help="If you use Prodigal to annotate genomes, if you sequences are "

478 "too small (less than 20000 characters), it cannot annotate them "

479 "with the default options. Add this option to use 'meta' procedure.")

480 optional.add_argument("--l90", dest="l90", type=int, default=100,

481 help="Maximum value of L90 allowed to keep a genome. Default is 100.")

482 optional.add_argument("--nbcont", dest="nbcont", type=utils_argparse.cont_num, default=999,

483 help=("Maximum number of contigs allowed to keep a genome. "

484 "Default is 999."))

485 optional.add_argument("--cutn", dest="cutn", type=utils_argparse.positive_int, default=5,

486 help=("By default, each genome will be cut into new contigs when "

487 "at least 5 'N' in a row are found in its sequence. "

488 "If you don't want to "

489 "cut genomes into new contigs when there are rows of 'N', "

490 "put 0 to this option. If you want to cut from a different number "

491 "of 'N' occurrences, put this value to this option."))

492 optional.add_argument("--date", dest="date", default=utils_argparse.get_date(),

493 type=utils_argparse.date_name,

494 help=("Specify the date (MMYY) to give to your annotated genomes. "

495 "By default, will give today's date. The only requirement on the"

496 " given date is that it is 4 characters long. You can use letters"

497 " if you want. But the common way is to use 4 digits, "

498 "corresponding to MMYY."))

499 optional.add_argument("--tmp", dest="tmpdir",

500 help=("Specify where the temporary files (sequence split by rows "

501 "of 'N', sequence with new contig names etc.) must be saved. "

502 "By default, it will be saved in your "

503 "result_directory/tmp_files."))

504 optional.add_argument("--annot_dir", dest="annotdir",

505 help=("Specify in which directory the prokka/prodigal output files "

506 "(1 folder per genome, called "

507 "<genome_name>-[prokka, Prodigal]Res) must be "

508 "saved. By default, they are saved in the same directory as "

509 "your temporary files (see --tmp option to change it)."))

510 optional.add_argument("-F", "--force", dest="force", action="store_true",

511 help=("Force run: Add this option if you want to (re)run annotation and "

512 "formatting steps for all genomes "

513 "even if their result folder (for annotation step) or files (for "

514 "format step) already exist: override existing results.\n"

515 "Without this option, if there already are results in the given "

516 "result folder, the program stops. If there are no results, but "

517 "prokka/prodigal folder already exists, prokka/prodigal won't run "

518 "again, and the formating step will use the already existing "

519 "folder if correct, or skip the genome if there are problems in "

520 "prokka/prodigal folder."))

521 optional.add_argument("--threads", dest="threads", type=utils_argparse.thread_num, default=1,

522 help="Specify how many threads can be used (default=1)")

523 helper = parser.add_argument_group('Others')

524 helper.add_argument("-v", "--verbose", dest="verbose", action="count", default=0,

525 help="Increase verbosity in stdout/stderr.")

526 helper.add_argument("-q", "--quiet", dest="quiet", action="store_true", default=False,

527 help=("Do not display anything to stdout/stderr. log files will "

528 "still be created."))

529 helper.add_argument("-h", "--help", dest="help", action="help",

530 help="show this help message and exit")

531

532

533def parse(parser, argu):

534 """

535 arse arguments given to parser

536

537 Parameters

538 ----------

539 parser : argparse.ArgumentParser

540 the parser used

541 argu : [str]

542 command-line given by user, to parse using parser

543

544 Returns

545 -------

546 argparse.Namespace

547 Parsed arguments

548 """

549 import argparse

550

551 args = parser.parse_args(argu)

552 return check_args(parser, args)

553

554

555def check_args(parser, args):

556 """

557 Check that arguments given to parser are as expected.

558

559 Parameters

560 ----------

561 parser : argparse.ArgumentParser

562 The parser used to parse command-line

563 args : argparse.Namespace

564 Parsed arguments

565

566 Returns

567 -------

568 argparse.Namespace or None

569 The arguments parsed, updated according to some rules. Exit program

570 with error message if error occurs with arguments given.

571

572 """

573 # Message if user kept default thresholds for L90 and nbcont. Just to warn him, to be sure

574 # it was on purpose

575 def thresholds_message(l90, nbcont):

576 return (" !! Your genomes will be filtered, and only the ones with 'L90' <= {} "

577 "and 'number of contigs' < {} will be kept. If you want to change those "

578 "thresholds, use '--l90' and '--nbcont' "

579 "options.".format(args.l90, args.nbcont))

580

581 # Message if user is giving a file with already calculated information

582 def nosplit_message():

583 split = (" !! Your sequences will be used as is by PanACoTA. Be sure you "

584 "already split your sequences at each row "

585 "of X 'N' if needed.\n")

586 trust = ("\t-> PanACoTA will use the values (L90, nbcont) given in your info file. "

587 "It will ignore the genomes for which those values are incorrect. "

588 "It will also ignore genomes with more than 999 contigs.")

589 return split + trust

590

591 # ERRORS

592 # Cannot be verbose and quiet at the same time

593 if args.verbose > 0 and args.quiet:

594 parser.error("Choose between a verbose output (-v) or a quiet output (-q)."

595 " You cannot have both.")

596 # User wants to run all annotation step: needs a genome dataset name

597 if not args.qc_only and not args.name:

598 parser.error("You must specify your genomes dataset name in 4 characters with "

599 "'-n name' option (type -h for more information). Or, if you do not want "

600 "to annotate and format your genomes but just to run quality control, use "

601 "option '-Q")

602 # If QC only, we do not need name -> name is NONE

603 if args.qc_only and not args.name:

604 args.name = "NONE"

605 # option --small used only with prodigal

606 if not args.prodigal_only and args.small:

607 parser.error("You cannot use --small option with prokka. Either use prodigal, "

608 "or remove this option.")

609 # If user specifies a cutN value (different than default one which is 5), and give

610 # an info file, it is not compatible: info file will use sequences as is, and won't cut them

611 if args.cutn != 5 and args.from_info:

612 parser.error("If you provide a list of genomes with their calculated L90 and number of "

613 "contigs, PanACoTA will use the given sequences as is. It will not cut "

614 "them. So, you cannot use both --cutn and --info.")

615

616 # Give a lst_file or an info file, not nothing

617 if not args.from_info and not args.list_file:

618 parser.error("You must provide a list of genomes to annotate. Either raw genomes "

619 "(see -l option), or genomes with quality information (see --info option).")

620

621 # Choose between infofile or LSTINFO

622 if args.from_info and args.list_file:

623 parser.error("Either you want to annotate raw sequences (name of files in '-l infofile') "

624 "which will first go through the QC process, "

625 "OR you already did QC on your sequences and just want to annotate them "

626 "(information on those sequences in '--info LSTINFO-file'). "

627 "Please choose one of these 2 possibilities.")

628

629 # If no info file nor db_path, ask for 1 of them

630 if not args.db_path and not args.from_info:

631 parser.error("You must provide a path to your database genome sequences (-d <db_path>). "

632 "If you already have a LSTINFO file, it contains this db_path. Use it with "

633 "--info <lstinfo file> option.")

634

635 # If given LSTINFO, already contains paths to genome to annotate. db_path must not be provided

636 if args.from_info and args.db_path:

637 parser.error("If you run from your LSTINFO file, this one already contains the "

638 "path of genomes to annotate. Remove -d <db_path> option.")

639

640 # WARNINGS

641 # If user wants to cut genomes, warn him to check that it is on purpose (because default is cut at each 5'N')

642 if args.cutn != 0 and not args.from_info:

643 message = (" !! Your genomes will be split when sequence contains at "

644 "least {}'N' in a row. If you want to change this threshold, "

645 "see --cutn option.").format(args.cutn)

646 print(colored(message, "yellow"))

647 # Warn user about selection of genomes thresholds

648 if args.l90 == 100 or args.nbcont == 999:

649 print(colored(thresholds_message(args.l90, args.nbcont), "yellow"))

650 if args.from_info:

651 print(colored(nosplit_message(), "yellow"))

652 print()

653 return args

654

655if __name__ == '__main__':

656 import argparse

657 from textwrap import dedent

658 header = '''

659 ___ _____ ___ _____ _____

660 ( _`\ ( _ )( _`\ (_ _)( _ )

661 | |_) ) _ _ ___ | (_) || ( (_) _ | | | (_) |

662 | ,__/'/'_` )/' _ `\| _ || | _ /'_`\ | | | _ |

663 | | ( (_| || ( ) || | | || (_( )( (_) )| | | | | |

664 (_) `\__,_)(_) (_)(_) (_)(____/'`\___/'(_) (_) (_)

665

666

667 Large scale comparative genomics tools

668

669 -------------------------------------------

670 '''

671 my_parser = argparse.ArgumentParser(formatter_class=argparse.RawDescriptionHelpFormatter,

672 description=dedent(header), add_help=False)

673 build_parser(my_parser)

674 OPTIONS = parse(my_parser, sys.argv[1:])

675 main_from_parse(OPTIONS)