PREQUAL on ACCESS 1.02 PRE-alignment QUALity assessment on XSEDE Simon Whelan Whelan, S., Irisarri, I., and Burki, F. (2018) PREQUAL: detecting non-homologous characters in sets of unaligned homologous sequences. Bioinformatics 34, 3929-3930 10.1093/bioinformatics/bty448 Phylogeny / prequal_xsede invoke_run perl "/expanse/projects/ngbt/opt/expanse/prequal/prequal/prequal" 0 infile Fasta Sequences File perl "infile.txt" 99 infile.txt scheduler_input scheduler.conf perl !defined $more_memory perl "ChargeFactor=1.0\\n" . "nodes=1\\n" . "mem=8G\\n" . "node_exclusive=0\\n" . "threads_per_process=1\\n" scheduler_input2 scheduler.conf perl defined $more_memory && $more_memory < 240 perl "ChargeFactor=1.0\\n" . "nodes=1\\n" . "mem=" . ($more_memory) . "G\\n" . "node_exclusive=0\\n" . "threads_per_process=1\\n" scheduler_input3 scheduler.conf perl defined $more_memory && $more_memory > 240 perl "ChargeFactor=1.0\\n" . "nodes=1\\n" . "mem=243G\\n" . "node_exclusive=1\\n" . "threads_per_process=1\\n" all_outputfiles * runtime 1 scheduler.conf Maximum Hours to Run (click here for help setting this correctly) 0.25 perl "runhours=$value\\n" Maximum Hours to Run must be less than 168 perl $runtime > 168 Maximum Hours to Run must be greater than 0 perl $runtime < 0 Please Enter a Value for Maximum Hours to Run perl !defined $runtime Estimate the maximum time your job will need to run. We recommend testimg initially with a < 0.5hr test run because Jobs set for 0.5 h or less dependably run immediately in the "debug" queue. Once you are sure the configuration is correct, you then increase the time. The reason is that jobs > 0.5 h are submitted to the "normal" queue, where jobs configured for 1 or a few hours times may run sooner than jobs configured for the full 168 hours. more_memory 1 Set a value for more memory 16 32 64 128 243 This job will require 128 cpus as configured, if it runs for the full time, it will consume $runtime x 128 cpu hours. perl defined $more_memory && $more_memory > 128 This job will require 8 cpus as configured, if it runs for the full time, it will consume $runtime x 8 cpu hours. perl defined $more_memory && $more_memory == 16 This job will require 16 cpus as configured, if it runs for the full time, it will consume $runtime x 16 cpu hours. perl defined $more_memory && $more_memory == 32 This job will require 32 cpus as configured, if it runs for the full time, it will consume $runtime x 32 cpu hours. perl defined $more_memory && $more_memory == 64 This job will require 64 cpus as configured, if it runs for the full time, it will consume $runtime x 64 cpu hours. perl defined $more_memory && $more_memory == 128 This job will require 4 cpu as configured, if it runs for the full time, it will consume 4 X $runtime cpu hours. perl !defined $more_memory core_options Core options specify_NoCore 2 Do not define a core region (-nocore) perl ($value) ? "-nocore":"" 0 Prevents the program from defining a core region. specify_HighPosterior 2 Number of high posterior residues at beginning and end of core region (-corerun) perl (defined $value) ? "-corerun $value":"" 3 Defines the number (X) of contiguous residues with high PP that are required to define a core region. Low values of X will make the program more generous at defining the core, whereas high values will make the program more conservative. specify_RemoveAll 2 Remove all residues rather than those outside the core region (-removeall) perl ($value) ? "-removeall":"" 0 Means that all low PPs residues will be removed (not masked with X, but completely removed from the sequence). The authors recommend using this option with caution. Masking, rather than simply removing, low PP residues within the core region facilitates the inference of the original positional homology among sequences. Complete removal can negatively affect multiple sequence alignment specify_NoMoverepeats 2 Do not remove repeated regions of length >20 (-noremoverepeat) perl ($value) ? "-noremoverepeat":"" 0 By default PREQUAL will attempt to remove long identical repeats that can occur due to sequencing or assembly errors. When such a sequence is detected, a warning will be generated. Choose this option to stop this functionality. prob_options Prior and Posterior options specify_PPAlgorithm 2 Specify the algorithm used to calculate posterior probabilities (-pptype) all closest longest closest Please specify the number of closest relatives perl $specify_PPAlgorithm eq "closest" && !defined $specify_Closest Please specify the number of longest sequences perl $specify_PPAlgorithm eq "longest" && !defined $specify_Longest Specifies the algorithm to choose the subset of sequences to use when calculating PPs for each individual sequence. The default option is -pptype closest, which compares each sequence against the 10 closest sequences defined by Kmer distance and also has mild coverage criteria. In some cases, you may wish to raise the default number of closest relatives Y to improve the accuracy of the PPs (e.g. -pptype closest 20). In other cases, -pptype longest might work better, which instead of evolutionary divergence choses the 10 longest sequences (the number of longest sequences may also be changed as above). We recommend to use this option with caution, because the longest sequences are often those containing the most errors. Finally, the option -pptype all will use all sequences. This option might be very slow, especially for data sets consisting of many sequences. specify_Closest 2 Compare each sequence of how many closest sequences? perl $specify_PPAlgorithm eq "closest" 10 Default is 10. specify_Longest 2 Compare each sequence of how many longest sequences? perl $specify_PPAlgorithm eq "longest" 10 Default is 10. specify_PPAlgorithmhidden 2 perl $specify_PPAlgorithm eq "closest" perl "-pptype closest $specify_Closest" specify_PPAlgorithmhidden2 2 perl $specify_PPAlgorithm eq "longest" perl "-pptype longest $specify_Longest" specify_PPAlgorithmhidden3 2 perl $specify_PPAlgorithm eq "all" perl "-pptype all" specify_FilterProp 2 Fraction of the sequences are maintained. (-filterprop) perl (defined $value) ? "-filterprop $value":"" Sorry the fraction of sequences maintained must be between 0 and 1 perl $value > 1 || $value < 0 Instead of filtering sequences by Posterior Probability, the user can choose a proportion of the original data (X%) that is willing to lose in the filtering, and PREQUAL will adjust the filtering threshold accordingly. In practice, PREQUAL will often filter a little higher proportion of data because of the way regions of low confidence are joined together and how N- and C-termini are dealt with. This is the second of two main filtering approaches available in PREQUAL. specify_FilterThresh 2 Postrerior probability threshold (-filterthresh) perl (defined $value) ? "-filterthresh $value":"" Sorry the posterior probability threshold must be between 0 and 1 perl $value > 1 || $value < 0 Specify a PP threshold for filtering. By default, every residue with PP less than0.994 will be filtered. The most stringent threshold would be a threshold of 1.0, which will keep only the residues with absolutely only the highest confidence. The most liberal threshold of zero would not filter any residue. The default threshold of 0.994 was experimentally shown to perform well for a wide range of input data. This is the first of two main filtering approaches available in PREQUAL. specify_FilterJoin 2 Extend filtering over regions of unfiltered sequence less than (-filterjoin) perl (defined $value) ? "-filterjoin $value":"" PREQUAL joins together low PP residues according to X in this threshold (default is 10). If there are fewer than X residues between two residues with low PPs, PREQUAL will filter all residues between them. Low values of X are more generous and will keep more data, whereas high values of X are more stringent and will filter more data. specify_TaxonNames 2 Select a file that contains a list of taxa names that will not be filtered (-nofilterlist) perl (defined $value) ? "-nofilterlist filterlist.txt":"" nofilterlist.txt Specify a file X that contains a list of taxa names that will not be filtered. In X one name per line. specify_WordsSeqs 2 Number of high posterior residues at beginning and end of core region (-corerun) perl (defined $value) ? "-nofilterword nofilterword.txt":"" nofilterword.txt Specify a file X that contains a list of words and sequence names that contain those words will not be filtered. In X one word per line. output_options Output options specify_Outsuffix 2 Specify a suffix for output files (-outsuffix) perl (defined $value) ? "-outsuffix $value":"" filtered Output file will be the original name with X as a suffix [DEFAULT .filtered]. specify_Dosummary 2 Output summary statistics to file (-dosummary) perl ($value) ? "-dosummary":"" 1 Output a broad summary file about your filtering, including information about the proportion of individual sequences removed and the amount of individual sequences defined as core region. This information will be output to a file suffixed .summary specify_DoDetail 2 Output detailed statistics to file (-dodetail) perl ($value) ? "-dodetail":"" 0 Generate a detailed summary file about your filtering. The output displays the PP for each residue and sequence, arranged into four columns: [Residue_number]Amino_acid : Indexing starts at 0 : Posterior_probability : Range (0,1) : Whether_the_residue_is_removed : 0 = FALSE; 1 = TRUE : Whether_the_residue_is_in_the_code : 0 = FALSE; 1 = TRUE specify_NoPPoutput 2 Stop outputting the posterior probability matrix (--noPP) perl ($value) ? "-noPP":"" 0 Do not output the posterior probability file. This will also force PREQUAL to recalculate PPs every run, in contrast to the default behaviour or reading in the corresponding .PP file if available.