Coverage for PanACoTA/subcommands/annotate.py: 100%

155 statements  

« prev     ^ index     » next       coverage.py v7.3.1, created at 2023-09-20 14:37 +0000

1#!/usr/bin/env python3 

2# coding: utf-8 

3 

4# ############################################################################### 

5# This file is part of PanACOTA. # 

6# # 

7# Authors: Amandine Perrin # 

8# Copyright © 2018-2020 Institut Pasteur (Paris). # 

9# See the COPYRIGHT file for details. # 

10# # 

11# PanACOTA is a software providing tools for large scale bacterial comparative # 

12# genomics. From a set of complete and/or draft genomes, you can: # 

13# - Do a quality control of your strains, to eliminate poor quality # 

14# genomes, which would not give any information for the comparative study # 

15# - Uniformly annotate all genomes # 

16# - Do a Pan-genome # 

17# - Do a Core or Persistent genome # 

18# - Align all Core/Persistent families # 

19# - Infer a phylogenetic tree from the Core/Persistent families # 

20# # 

21# PanACOTA is free software: you can redistribute it and/or modify it under the # 

22# terms of the Affero GNU General Public License as published by the Free # 

23# Software Foundation, either version 3 of the License, or (at your option) # 

24# any later version. # 

25# # 

26# PanACOTA is distributed in the hope that it will be useful, but WITHOUT ANY # 

27# WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS # 

28# FOR A PARTICULAR PURPOSE. See the Affero GNU General Public License # 

29# for more details. # 

30# # 

31# You should have received a copy of the Affero GNU General Public License # 

32# along with PanACOTA (COPYING file). # 

33# If not, see <https://www.gnu.org/licenses/>. # 

34# ############################################################################### 

35 

36""" 

37annotate is a subcommand of PanACoTA 

38 

39It is a pipeline to do quality control and annotate genomes. Steps are: 

40 

41- optional: find rows of at least 'n' N (default n=5), and cut into a new contig at this point 

42- for each genome, calc L90 and number of contigs (after cut at n 'N' occurrences if used) 

43- keep only genomes with: 

44 

45 - L90 <= x (default x = 100) 

46 - #contig <= y (default y = 999) 

47- rename those genomes and their contigs, with strain name increasing with quality (L90 and 

48 #contig) 

49- annotate kept genomes with prokka (default) or only prodigal 

50- gembase format 

51 

52Input: 

53 

54- **list_file:** list of genome filenames to annotate. This file contains 1 line per genome. It 

55 contains the name(s) of the (multi-)fasta file(s) corresponding to the genome (separated 

56 by space if several fasta files for the genome). After quality control, selected genomes 

57 will be named as following: ``<gen-spe>.<date>.<strain>``, with: 

58 

59 * ``<gen_spe>`` 4 alphanumeric characters. Usually it corresponds to the 2 first letters 

60 of genus, and 2 first letters of species, like ESCO for Escherichia coli. 

61 * ``<date>`` date at which the genome was downloaded, formatted as MMYY (M=Month, Y=Year) 

62 * ``<strain>`` is the strain number of the genome in the species, ordered by quality. 

63 

64Default values for ``<gen_spe>`` and ``<date>`` are given as input (see after). However, if some 

65genomes do not have the same date and/or genus/species as the others, you can add 

66this information for those genomes in the list file. fasta filenames and information are 

67separated by ``::``. ``<gen_spe>`` is given after the ``::``, and ``<date>`` is preceded by a 

68``.``. Here is an example:: 

69 

70 genome1.fasta 

71 genome2_ch1.fna genome2_pl.fst 

72 genome3.fst genome3_plasmid.fst :: name 

73 genome4.fna genome4.p1.fna genome4.p2.fna :: name. 

74 genome5.fasta :: name.date 

75 genome6.chromo.fst genome6.pl.fst :: .date 

76 

77- **species:** with 4 alphanumeric characters, used to rename genomes (except those whose 

78 species name is specified in the list file) 

79- **date:** optional. By default, takes the current date. Used to rename genomes (except those 

80 whose date is specified in the list file) 

81- **dbpath:** path to folder containing all multi-fasta sequences of genomes 

82- **respath:** path to folder where outputs must be saved (folders Genes, Replicons, Proteins, 

83 LSTINFO, gff3 and LSTINFO_dataset.lst file) 

84- **tmppath** optional. Path where tmp files must be saved. Default is respath/tmp_files 

85- **annotepath** optional. Path where prokka/prodigal output folders for all genomes must be saved. 

86 Default is respath/tmp_files 

87- **threads:** number of threads that can be used (default 1) 

88 

89Output: 

90 

91- In your given ``respath``, you will find 5 folders: 

92 

93 * LSTINFO (information on each genome, with gene annotations), 

94 * Genes (nuc. gene sequences), 

95 * Proteins (aa proteins sequences), 

96 * Replicons (input sequences but with formatted headers). 

97 * gff3 (information on genes as gff3 format) 

98 

99- In your given ``tmppath`` folder, folders with prokka/prodigal results will 

100 be created for each input genome (1 folder per genome, called 

101 ``<genome_name>-[prokka, prodigal]Res``). If errors are generated during prokka/prodigal 

102 step, you can look at the log file to see what was wrong 

103 (``<genome_name>-[prokka, prodigal].log``). 

104- In your given ``respath``, a file called ``annotate-genomes-<list_file>.log`` will be generated. 

105 You can find there all logs. 

106- In your given ``respath``, a file called ``annotate-genomes-<list_file>.log.err`` will be 

107 generated, containing information on errors and warnings that occurred: problems during 

108 annotation (hence no formatting step ran), and problems during formatting step. If this file is 

109 empty, then annotation and formatting steps finished without any problem for all genomes. 

110- In your given ``respath``, you will find a file called ``LSTINFO-<list_file>.lst`` with 

111 information on all genomes: gembase_name, original_name, genome_size, L90, nb_contigs 

112- In your given ``respath``, you will find a file called ``discarded-<list_file>.lst`` with 

113 information on genomes that were discarded (and hence not annotated) because of the 

114 L90 and/or nb_contig threshold: original_name, genome_size, L90, nb_contigs 

115- In your given ``respath``, you will find 2 png files: ``QC_L90-<list_file>.png`` and 

116 ``QC_nb-contigs-<list_file>.png``, containing the histograms of L90 and nb_contigs values for 

117 all genomes, with a vertical red line representing the limit applied here. 

118 

119Requested: 

120 

121- in prokka/prodigal results, all genes are called ``<whatever>_<number>`` 

122 -> the number will be kept. 

123- The number of the genes annotated by prokka/prodigal are in increasing order 

124 in tbl, faa and ffn files 

125- genome names given to prokka/prodigal should not end with '_<number>'. Ideally, they should 

126 always have the same format: ``<spegenus>.<date>.<strain_number>`` but they can have 

127 another format, as long as they don't end by '_<number>', which is the format of a gene name. 

128 

129@author gem 

130April 2017 

131""" 

132 

133import os 

134import sys 

135from termcolor import colored 

136import sys 

137 

138 

139def main_from_parse(arguments): 

140 """ 

141 Call main function from the arguments given by parser 

142 

143 Parameters 

144 ---------- 

145 arguments : argparse.Namespace 

146 result of argparse parsing of all arguments in command line 

147 

148 """ 

149 cmd = "PanACoTA " + ' '.join(arguments.argv) 

150 main(cmd, arguments.list_file, arguments.db_path, arguments.res_path, 

151 arguments.name, 

152 arguments.date, arguments.l90, arguments.nbcont, arguments.cutn, arguments.threads, 

153 arguments.force, arguments.qc_only, arguments.from_info, arguments.tmpdir, 

154 arguments.annotdir, arguments.verbose, arguments.quiet, arguments.prodigal_only, 

155 arguments.small) 

156 

157 

158def main(cmd, list_file, db_path, res_dir, name, date, l90=100, nbcont=999, cutn=5, 

159 threads=1, force=False, qc_only=False, from_info=None, tmp_dir=None, res_annot_dir=None, 

160 verbose=0, quiet=False, prodigal_only=False, small=False): 

161 """ 

162 Main method, doing all steps: 

163 

164 1. analyze genomes (nb contigs, L90, rows of N...) 

165 2. keep only genomes with 'good' (according to user thresholds) L90 and nb_contigs 

166 3. rename genomes with strain number in decreasing quality 

167 4. annotate genome with prokka or only prodigal 

168 5. format annotated genomes 

169 

170 If option '-Q': ends at step 2. 

171 If option '--info <genome_info file name>' option: starts at step 2 

172 

173 verbosity: 

174 

175 - defaut 0 : stdout contains INFO, stderr contains ERROR. 

176 - 1: stdout contains INFO, stderr contains WARNING and ERROR 

177 - 2: stdout contains (DEBUG), DETAIL and INFO, stderr contains WARNING and ERROR 

178 - >=15: Add DEBUG in stdout 

179 

180 Parameters 

181 ---------- 

182 cmd : str 

183 command line used to launch this program 

184 list_file : str 

185 file containing the list of genome files, 1 genome per line, separated by a 

186 space if a genome is split in several fasta files. This file can also 

187 specify date and/or species information, according to the format described 

188 in documentation. 

189 db_path : str 

190 Path to the folder containing all the fasta files which will be annotated 

191 res_dir : str 

192 Path to the folder which will contain result folders and files 

193 name : str 

194 4 alpha numeric characters, describing the species (for example ESCO). Used by default 

195 if no species name is given in list_file line. 

196 date : str 

197 4 alpha numeric characters, defining the default date, for strains where it is not specified 

198 in the list_file 

199 l90 : int 

200 Max L90 allowed to keep a genome 

201 nbcont : int 

202 Max number of contigs allowed to keep a genome 

203 cutn : int 

204 cut each time there are at least cutn 'N' in a row. Don't cut if equal to 0 

205 threads : int 

206 max number of threads to use 

207 force : bool 

208 If True, overwrite previous results, if False keep what is already calculated 

209 qc_only : bool 

210 If True, do only quality control, if False, also do annotation 

211 from_info : str 

212 File containing information on genomes and their quality information (from prepare step) 

213 tmp_dir : str or None 

214 Path to folder where tmp files must be saved. None to use the default tmp folder 

215 res_annot_dir : str or None 

216 Path to folder where are the prokka/prodigal result folders for the genomes. None 

217 to use the default prokka/prodigal folder 

218 verbose : int 

219 verbosity: 

220 default (0): info in stdout, error and more in stderr 

221 1 = add warnings in stderr 

222 2 = like 1 + add DETAIL to stdout (by default only INFO) 

223 >15: add debug to stdout 

224 quiet : bool 

225 True if nothing must be sent to stdout/stderr, False otherwise 

226 prodigal_only : bool 

227 True -> run only prodigal. False -> run prokka 

228 small : bool 

229 True -> use -p meta option with prodigal 

230 

231 Returns 

232 ------- 

233 (genomes, kept_genomes, skipped, skipped_format) : tuple 

234 with: 

235 

236 - genomes: dict with all genomes in list_file: 

237 {genome: [gembase_name, path_split_gembase, gsize, nbcont, L90]} 

238 - kept_genomes: dict with all genomes kept for annotation (same format as genomes) 

239 - skipped: list of genomes skipped because they had a problem in annotation step 

240 - skipped_format : list of genomes skipped because they had a problem in format step 

241 """ 

242 # import needed packages 

243 import shutil 

244 import logging 

245 from PanACoTA.annotate_module import genome_seq_functions as gfunc 

246 from PanACoTA.annotate_module import annotation_functions as pfunc 

247 from PanACoTA.annotate_module import general_format_functions as ffunc 

248 from PanACoTA import utils 

249 from PanACoTA import __version__ as version 

250 # Check that needed softs are installed 

251 prokka = utils.check_installed("prokka") 

252 prodigal = utils.check_installed("prodigal") 

253 if prodigal_only: 

254 soft = "prodigal" 

255 else: 

256 soft = "prokka" 

257 

258 changed = cutn != 0 

259 if not qc_only: # pragma: no cover 

260 # If user using prokka: check prokka is installed and in the path 

261 if not prodigal_only and not prokka: 

262 print("Prokka is not installed. 'PanACoTA annotate' cannot run. Install prokka " 

263 "to be able to annotate genomes. If you only need syntactical annotation, " 

264 "check that prodigal is installed, and add '--prodigal' option.") 

265 sys.exit(1) 

266 if prodigal_only and not prodigal: 

267 print("Prodigal is not installed. 'PanACoTA annotate' cannot run. Install " 

268 "prodigal to be able to annotate genomes. If you also need functional " 

269 "annotation, check that prokka is installed, and remove '--prodigal' " 

270 "option.") 

271 sys.exit(1) 

272 

273 # By default, all tmp files (split sequences, renamed sequences, prokka/prodigal results) will 

274 # be saved in the given <res_dir>/tmp_files. 

275 # Create output (results, tmp...) directories if not already existing 

276 if not tmp_dir: 

277 tmp_dir = os.path.join(res_dir, "tmp_files") 

278 if not res_annot_dir: 

279 res_annot_dir = tmp_dir 

280 os.makedirs(res_dir, exist_ok=True) 

281 os.makedirs(tmp_dir, exist_ok=True) 

282 os.makedirs(res_annot_dir, exist_ok=True) 

283 

284 # If force was set, remove result folders (Proteins, Replicons, Genes, LSTINFO, gff) 

285 if force: 

286 shutil.rmtree(os.path.join(res_dir, "LSTINFO"), ignore_errors=True) 

287 shutil.rmtree(os.path.join(res_dir, "Proteins"), ignore_errors=True) 

288 shutil.rmtree(os.path.join(res_dir, "Genes"), ignore_errors=True) 

289 shutil.rmtree(os.path.join(res_dir, "Replicons"), ignore_errors=True) 

290 shutil.rmtree(os.path.join(res_dir, "gff3"), ignore_errors=True) 

291 # If not --force, check that result folders do not already contain results 

292 else: 

293 utils.check_out_dirs(res_dir) 

294 

295 # get only filename of list_file, without extension 

296 if list_file: 

297 listfile_base = os.path.basename(os.path.splitext(list_file)[0]) 

298 else: 

299 list_file = from_info 

300 listfile_base = os.path.basename(os.path.splitext(list_file)[0]) 

301 

302 # Initialize logger 

303 # set level of logger: level is the minimum level that will be considered. 

304 if verbose <= 1: 

305 level = logging.INFO 

306 # for verbose = 2, ignore only debug 

307 if verbose >= 2 and verbose < 15: 

308 level = utils.detail_lvl() # int corresponding to detail level 

309 # for verbose >= 15, write everything 

310 if verbose >= 15: 

311 level = logging.DEBUG 

312 logfile_base = os.path.join(res_dir, "PanACoTA-annotate_" + listfile_base) 

313 logfile_base = utils.init_logger(logfile_base, level, name='annotate', log_details=True, 

314 verbose=verbose, quiet=quiet) 

315 logger = logging.getLogger('annotate') 

316 logger.info(f'PanACoTA version {version}') 

317 logger.info("Command used\n \t > " + cmd) 

318 

319 # STEP 1. analyze genomes (nb contigs, L90, rows of N...) 

320 # If already info on genome ('--info <file>' option), skip this step 

321 # If no info on genomes, read them and get needed information 

322 if not from_info: 

323 # Read genome names. 

324 # genomes = {genome: [spegenus.date]} 

325 genomes = utils.read_genomes(list_file, name, date, db_path, tmp_dir, logger) 

326 if not genomes: 

327 logger.error(("We did not find any genome listed in {} in the folder {}. " 

328 "Please check your list to give valid genome " 

329 "names.").format(list_file, db_path)) 

330 sys.exit(1) 

331 # Get L90, nbcontig, size for all genomes, and cut at row of cutn 'N' if asked 

332 # -> genome: [spegenus.date, orig_path, to_annotate_path, size, nbcont, l90] 

333 gfunc.analyse_all_genomes(genomes, db_path, tmp_dir, cutn, soft, 

334 logger, quiet=quiet) 

335 # --info <filename> option given: read information (L90, nb contigs...) from this file. 

336 else: 

337 # genomes = {genome: [spegenus.date, orig_path, to_annotate_path, size, nbcont, l90]} 

338 # orig_path is the path to the original sequence 

339 # and to_annotate_path the path to the sequence to annotate (once split etc.) 

340 # Here, both are the same, as we take given sequences as is. 

341 genomes = utils.read_genomes_info(from_info, name, date, logger) 

342 

343 # STEP 2. keep only genomes with 'good' (according to user thresholds) L90 and nb_contigs 

344 # genomes = {genome: [spegenus.date, orig_seq, path_to_splitSequence, size, nbcont, l90]} 

345 # Plot L90 and nb_contigs distributions 

346 gfunc.plot_distributions(genomes, res_dir, listfile_base, l90, nbcont) 

347 # Get list of genomes kept (according to L90 and nbcont thresholds) 

348 kept_genomes = {genome: info for genome, info in genomes.items() 

349 if info[-2] <= nbcont and info[-1] <= l90} 

350 # Write discarded genomes to a file -> orig_name, to_annotate, gsize, nb_conts, L90 

351 utils.write_genomes_info(genomes, list(kept_genomes.keys()), list_file, res_dir) 

352 

353 if not kept_genomes: 

354 logger.info("No genome kept for annotation.") 

355 return "", 0 

356 # Info on folder containing original sequences 

357 if not from_info: 

358 logger.info(f"-> Original sequences folder ('orig_name' column): {db_path} ") 

359 logger.info(f"\t-> If original sequence not found in {db_path}, " 

360 f"look for it in {tmp_dir}, as it must be a concatenation of several " 

361 "input sequence files.") 

362 if cutn == 0: 

363 logger.info("-> Sequences used for annotation ('to_annotate' column) are the " 

364 "same as the previous ones (original sequences).") 

365 else: 

366 logger.info(f"-> Folder with sequence files that will be used for annotation " 

367 f"('to_annotate' column): {tmp_dir}") 

368 # If only QC, stop here. 

369 if qc_only: 

370 # Write information on genomes that would be annotated with the current 

371 # parameters if not QC_only: 

372 # orig_name, to_annnote, gsize, nb_conts, L90 

373 utils.write_genomes_info(genomes, [], list_file, res_dir, qc=True) 

374 logger.info("QC only done.") 

375 return "", 0 

376 

377 # STEP 3. Rename genomes kept, ordered by decreasing quality 

378 first_gname = gfunc.rename_all_genomes(kept_genomes) 

379 # kept_genomes = {genome: [gembase_name, path_to_origfile, path_split_gembase, 

380 # gsize, nbcont, L90]} 

381 # first_gname = name of the first genome 

382 # Write lstinfo file (list of genomes kept with info on L90 etc.) 

383 outlst = utils.write_lstinfo(list_file, kept_genomes, res_dir) 

384 

385 # STEP 4. Annotate all kept genomes 

386 results = pfunc.run_annotation_all(kept_genomes, threads, force, res_annot_dir, first_gname, 

387 prodigal_only, small=small, quiet=quiet) 

388 # Information on genomes to format 

389 # results_ok = {genome: [gembase_name, path_to_origfile, path_split_gembase, 

390 # gsize, nbcont, L90]} 

391 results_ok = {genome:info for genome, info in kept_genomes.items() if results[genome]} 

392 # If no genome was ok, no need to format them. Just print that no genome was annotated, 

393 # end program. 

394 if not results_ok: 

395 logger.error("Error: No genome was correctly annotated, no need to format them.") 

396 sys.exit(1) 

397 # list of genomes skipped because annotation had problems: no format step run 

398 skipped = [genome for (genome, ok) in results.items() if not ok] 

399 # At least 1 genome was not annotated: write a message to warn on it 

400 if skipped: 

401 utils.write_warning_skipped(skipped, prodigal_only=prodigal_only, 

402 logfile=logfile_base) 

403 

404 # STEP 5. Format genomes annotated 

405 # Here, we have at least 1 genome annotated (otherwise, 

406 # it would already have stopped because results_ok is empty) 

407 # Initialize list of genomes skipped because something went wrong while formatting. 

408 skipped_format = [] 

409 # Generate database (folders Proteins, Genes, Replicons, LSTINFO) 

410 skipped_format = ffunc.format_genomes(results_ok, res_dir, res_annot_dir, 

411 prodigal_only, threads, quiet=quiet) 

412 # At least one genome could not be formatted -> warn user 

413 if skipped_format: 

414 utils.write_warning_skipped(skipped_format, do_format=True, prodigal_only=prodigal_only, 

415 logfile = logfile_base) 

416 logger.info("Annotation step done.") 

417 return outlst, len(kept_genomes) - len(skipped) - len(skipped_format) 

418 

419 

420def build_parser(parser): 

421 """ 

422 Method to create a parser for command-line options 

423 

424 Parameters 

425 ---------- 

426 parser : argparse.ArgumentParser 

427 The parser to configure 

428 

429 """ 

430 from PanACoTA import utils 

431 from PanACoTA import utils_argparse 

432 import multiprocessing 

433 import argparse 

434 

435 # Create command-line parser for all options and arguments to give 

436 required = parser.add_argument_group('Required arguments') 

437 required.add_argument("-d", dest="db_path", 

438 help="Path to folder containing all multifasta genome files") 

439 required.add_argument("-r", dest="res_path", required=True, 

440 help="Path to folder where output annotated genomes must be saved") 

441 optional = parser.add_argument_group('Optional arguments') 

442 optional.add_argument('-l', dest="list_file", 

443 help=("File containing the list of genome filenames to annotate " 

444 "(1 genome per line). Each genome must be in multi-fasta format. " 

445 "You can specify the species name (4 characters) you want to " 

446 "give to some genome, as well as the download (or any other " 

447 "reason) date of your choice. Format 'genome_name :: name.date'. " 

448 "name and date are optional. See doc for more information on " 

449 "this file format. " 

450 "If you want to run this module from 'prepare_module' results, " 

451 "use '--info' instead.")) 

452 optional.add_argument("-n", dest="name", type=utils_argparse.gen_name, 

453 help=("Choose a name for your annotated genomes. This name should " 

454 "contain 4 alphanumeric characters. Generally, they correspond " 

455 "to the 2 first letters of genus, and 2 first letters of " 

456 "species, e.g. ESCO for Escherichia Coli.")) 

457 optional.add_argument("-Q", dest="qc_only", action="store_true", default=False, 

458 help="Add this option if you want only to do quality control on your " 

459 "genomes (cut at 5N if asked, calculate L90 and number of contigs " 

460 "and plot their distributions). This allows you to check which " 

461 "genomes would be annotated with the given parameters, and to " 

462 "modify those parameters if you want, before you launch the " 

463 "annotation and formatting steps.") 

464 optional.add_argument("--info", dest="from_info", 

465 help="If you already ran the 'prepare' data module, or already " 

466 "calculated yourself the L90 and number of contigs for each " 

467 "genome, you can give this information, to go directly to " 

468 "annotation and formatting steps. This file contains at " 

469 "least 4 columns, tab separated, with the following headers: " 

470 "'to_annotate', 'gsize', 'nb_conts', 'L90'. Any other column " 

471 "will be ignored.") 

472 optional.add_argument("--prodigal", dest="prodigal_only", action="store_true", default=False, 

473 help="Add this option if you only want syntactical annotation, given " 

474 "by prodigal, and not functional annotation requiring prokka and " 

475 "is slower.") 

476 optional.add_argument("--small", dest="small", action="store_true", default=False, 

477 help="If you use Prodigal to annotate genomes, if you sequences are " 

478 "too small (less than 20000 characters), it cannot annotate them " 

479 "with the default options. Add this option to use 'meta' procedure.") 

480 optional.add_argument("--l90", dest="l90", type=int, default=100, 

481 help="Maximum value of L90 allowed to keep a genome. Default is 100.") 

482 optional.add_argument("--nbcont", dest="nbcont", type=utils_argparse.cont_num, default=999, 

483 help=("Maximum number of contigs allowed to keep a genome. " 

484 "Default is 999.")) 

485 optional.add_argument("--cutn", dest="cutn", type=utils_argparse.positive_int, default=5, 

486 help=("By default, each genome will be cut into new contigs when " 

487 "at least 5 'N' in a row are found in its sequence. " 

488 "If you don't want to " 

489 "cut genomes into new contigs when there are rows of 'N', " 

490 "put 0 to this option. If you want to cut from a different number " 

491 "of 'N' occurrences, put this value to this option.")) 

492 optional.add_argument("--date", dest="date", default=utils_argparse.get_date(), 

493 type=utils_argparse.date_name, 

494 help=("Specify the date (MMYY) to give to your annotated genomes. " 

495 "By default, will give today's date. The only requirement on the" 

496 " given date is that it is 4 characters long. You can use letters" 

497 " if you want. But the common way is to use 4 digits, " 

498 "corresponding to MMYY.")) 

499 optional.add_argument("--tmp", dest="tmpdir", 

500 help=("Specify where the temporary files (sequence split by rows " 

501 "of 'N', sequence with new contig names etc.) must be saved. " 

502 "By default, it will be saved in your " 

503 "result_directory/tmp_files.")) 

504 optional.add_argument("--annot_dir", dest="annotdir", 

505 help=("Specify in which directory the prokka/prodigal output files " 

506 "(1 folder per genome, called " 

507 "<genome_name>-[prokka, Prodigal]Res) must be " 

508 "saved. By default, they are saved in the same directory as " 

509 "your temporary files (see --tmp option to change it).")) 

510 optional.add_argument("-F", "--force", dest="force", action="store_true", 

511 help=("Force run: Add this option if you want to (re)run annotation and " 

512 "formatting steps for all genomes " 

513 "even if their result folder (for annotation step) or files (for " 

514 "format step) already exist: override existing results.\n" 

515 "Without this option, if there already are results in the given " 

516 "result folder, the program stops. If there are no results, but " 

517 "prokka/prodigal folder already exists, prokka/prodigal won't run " 

518 "again, and the formating step will use the already existing " 

519 "folder if correct, or skip the genome if there are problems in " 

520 "prokka/prodigal folder.")) 

521 optional.add_argument("--threads", dest="threads", type=utils_argparse.thread_num, default=1, 

522 help="Specify how many threads can be used (default=1)") 

523 helper = parser.add_argument_group('Others') 

524 helper.add_argument("-v", "--verbose", dest="verbose", action="count", default=0, 

525 help="Increase verbosity in stdout/stderr.") 

526 helper.add_argument("-q", "--quiet", dest="quiet", action="store_true", default=False, 

527 help=("Do not display anything to stdout/stderr. log files will " 

528 "still be created.")) 

529 helper.add_argument("-h", "--help", dest="help", action="help", 

530 help="show this help message and exit") 

531 

532 

533def parse(parser, argu): 

534 """ 

535 arse arguments given to parser 

536 

537 Parameters 

538 ---------- 

539 parser : argparse.ArgumentParser 

540 the parser used 

541 argu : [str] 

542 command-line given by user, to parse using parser 

543 

544 Returns 

545 ------- 

546 argparse.Namespace 

547 Parsed arguments 

548 """ 

549 import argparse 

550 

551 args = parser.parse_args(argu) 

552 return check_args(parser, args) 

553 

554 

555def check_args(parser, args): 

556 """ 

557 Check that arguments given to parser are as expected. 

558 

559 Parameters 

560 ---------- 

561 parser : argparse.ArgumentParser 

562 The parser used to parse command-line 

563 args : argparse.Namespace 

564 Parsed arguments 

565 

566 Returns 

567 ------- 

568 argparse.Namespace or None 

569 The arguments parsed, updated according to some rules. Exit program 

570 with error message if error occurs with arguments given. 

571 

572 """ 

573 # Message if user kept default thresholds for L90 and nbcont. Just to warn him, to be sure 

574 # it was on purpose 

575 def thresholds_message(l90, nbcont): 

576 return (" !! Your genomes will be filtered, and only the ones with 'L90' <= {} " 

577 "and 'number of contigs' < {} will be kept. If you want to change those " 

578 "thresholds, use '--l90' and '--nbcont' " 

579 "options.".format(args.l90, args.nbcont)) 

580 

581 # Message if user is giving a file with already calculated information 

582 def nosplit_message(): 

583 split = (" !! Your sequences will be used as is by PanACoTA. Be sure you " 

584 "already split your sequences at each row " 

585 "of X 'N' if needed.\n") 

586 trust = ("\t-> PanACoTA will use the values (L90, nbcont) given in your info file. " 

587 "It will ignore the genomes for which those values are incorrect. " 

588 "It will also ignore genomes with more than 999 contigs.") 

589 return split + trust 

590 

591 # ERRORS 

592 # Cannot be verbose and quiet at the same time 

593 if args.verbose > 0 and args.quiet: 

594 parser.error("Choose between a verbose output (-v) or a quiet output (-q)." 

595 " You cannot have both.") 

596 # User wants to run all annotation step: needs a genome dataset name 

597 if not args.qc_only and not args.name: 

598 parser.error("You must specify your genomes dataset name in 4 characters with " 

599 "'-n name' option (type -h for more information). Or, if you do not want " 

600 "to annotate and format your genomes but just to run quality control, use " 

601 "option '-Q") 

602 # If QC only, we do not need name -> name is NONE 

603 if args.qc_only and not args.name: 

604 args.name = "NONE" 

605 # option --small used only with prodigal 

606 if not args.prodigal_only and args.small: 

607 parser.error("You cannot use --small option with prokka. Either use prodigal, " 

608 "or remove this option.") 

609 # If user specifies a cutN value (different than default one which is 5), and give 

610 # an info file, it is not compatible: info file will use sequences as is, and won't cut them 

611 if args.cutn != 5 and args.from_info: 

612 parser.error("If you provide a list of genomes with their calculated L90 and number of " 

613 "contigs, PanACoTA will use the given sequences as is. It will not cut " 

614 "them. So, you cannot use both --cutn and --info.") 

615 

616 # Give a lst_file or an info file, not nothing 

617 if not args.from_info and not args.list_file: 

618 parser.error("You must provide a list of genomes to annotate. Either raw genomes " 

619 "(see -l option), or genomes with quality information (see --info option).") 

620 

621 # Choose between infofile or LSTINFO 

622 if args.from_info and args.list_file: 

623 parser.error("Either you want to annotate raw sequences (name of files in '-l infofile') " 

624 "which will first go through the QC process, " 

625 "OR you already did QC on your sequences and just want to annotate them " 

626 "(information on those sequences in '--info LSTINFO-file'). " 

627 "Please choose one of these 2 possibilities.") 

628 

629 # If no info file nor db_path, ask for 1 of them 

630 if not args.db_path and not args.from_info: 

631 parser.error("You must provide a path to your database genome sequences (-d <db_path>). " 

632 "If you already have a LSTINFO file, it contains this db_path. Use it with " 

633 "--info <lstinfo file> option.") 

634 

635 # If given LSTINFO, already contains paths to genome to annotate. db_path must not be provided 

636 if args.from_info and args.db_path: 

637 parser.error("If you run from your LSTINFO file, this one already contains the " 

638 "path of genomes to annotate. Remove -d <db_path> option.") 

639 

640 # WARNINGS 

641 # If user wants to cut genomes, warn him to check that it is on purpose (because default is cut at each 5'N') 

642 if args.cutn != 0 and not args.from_info: 

643 message = (" !! Your genomes will be split when sequence contains at " 

644 "least {}'N' in a row. If you want to change this threshold, " 

645 "see --cutn option.").format(args.cutn) 

646 print(colored(message, "yellow")) 

647 # Warn user about selection of genomes thresholds 

648 if args.l90 == 100 or args.nbcont == 999: 

649 print(colored(thresholds_message(args.l90, args.nbcont), "yellow")) 

650 if args.from_info: 

651 print(colored(nosplit_message(), "yellow")) 

652 print() 

653 return args 

654 

655if __name__ == '__main__': 

656 import argparse 

657 from textwrap import dedent 

658 header = ''' 

659 ___ _____ ___ _____ _____ 

660 ( _`\ ( _ )( _`\ (_ _)( _ ) 

661 | |_) ) _ _ ___ | (_) || ( (_) _ | | | (_) | 

662 | ,__/'/'_` )/' _ `\| _ || | _ /'_`\ | | | _ | 

663 | | ( (_| || ( ) || | | || (_( )( (_) )| | | | | | 

664 (_) `\__,_)(_) (_)(_) (_)(____/'`\___/'(_) (_) (_) 

665 

666 

667 Large scale comparative genomics tools 

668 

669 ------------------------------------------- 

670 ''' 

671 my_parser = argparse.ArgumentParser(formatter_class=argparse.RawDescriptionHelpFormatter, 

672 description=dedent(header), add_help=False) 

673 build_parser(my_parser) 

674 OPTIONS = parse(my_parser, sys.argv[1:]) 

675 main_from_parse(OPTIONS)