PREQUAL on XSEDE1.02PRE-alignment QUALity assessment on XSEDESimon WhelanWhelan, S., Irisarri, I., and Burki, F. (2018) PREQUAL: detecting non-homologous characters in sets of unaligned homologous sequences. Bioinformatics 34, 3929-3930 10.1093/bioinformatics/bty448Phylogeny / prequal_xsedeinvoke_runperl"/expanse/projects/ngbt/opt/expanse/prequal/prequal/prequal"0infileFasta Sequences Fileperl"infile.txt"99infile.txtscheduler_inputscheduler.confperl!defined $more_memoryperl
"ChargeFactor=1.0\\n" .
"nodes=1\\n" .
"mem=8G\\n" .
"node_exclusive=0\\n" .
"threads_per_process=1\\n"
scheduler_input2scheduler.confperldefined $more_memory && $more_memory < 240perl
"ChargeFactor=1.0\\n" .
"nodes=1\\n" .
"mem=" . ($more_memory) . "G\\n" .
"node_exclusive=0\\n" .
"threads_per_process=1\\n"
scheduler_input3scheduler.confperldefined $more_memory && $more_memory > 240perl
"ChargeFactor=1.0\\n" .
"nodes=1\\n" .
"mem=243G\\n" .
"node_exclusive=1\\n" .
"threads_per_process=1\\n"
all_outputfiles*runtime1scheduler.confMaximum Hours to Run (click here for help setting this correctly)0.25perl"runhours=$value\\n"Maximum Hours to Run must be less than 168perl$runtime > 168Maximum Hours to Run must be greater than 0perl$runtime < 0Please Enter a Value for Maximum Hours to Runperl!defined $runtime Estimate the maximum time your job will need to run. We recommend testimg initially with a < 0.5hr test run because Jobs set for 0.5 h or less dependably run immediately in the "debug" queue.
Once you are sure the configuration is correct, you then increase the time. The reason is that jobs > 0.5 h are submitted to the "normal" queue, where jobs configured for 1 or a few hours times may
run sooner than jobs configured for the full 168 hours.
more_memory1Set a value for more memory163264128243This job will require 128 cpus as configured, if it runs for the full time, it will consume $runtime x 128 cpu hours.perldefined $more_memory && $more_memory > 128This job will require 8 cpus as configured, if it runs for the full time, it will consume $runtime x 8 cpu hours.perldefined $more_memory && $more_memory == 16This job will require 16 cpus as configured, if it runs for the full time, it will consume $runtime x 16 cpu hours.perldefined $more_memory && $more_memory == 32This job will require 32 cpus as configured, if it runs for the full time, it will consume $runtime x 32 cpu hours.perldefined $more_memory && $more_memory == 64This job will require 64 cpus as configured, if it runs for the full time, it will consume $runtime x 64 cpu hours.perldefined $more_memory && $more_memory == 128This job will require 4 cpu as configured, if it runs for the full time, it will consume 4 X $runtime cpu hours.perl!defined $more_memorycore_optionsCore optionsspecify_NoCore2Do not define a core region (-nocore)perl($value) ? "-nocore":""0
Prevents the program from defining a core region.
specify_HighPosterior2Number of high posterior residues at beginning and end of core region (-corerun)perl(defined $value) ? "-corerun $value":""3
Defines the number (X) of contiguous residues with high PP that are required to define a core region. Low values of X will make the program more generous at
defining the core, whereas high values will make the program more conservative.
specify_RemoveAll2Remove all residues rather than those outside the core region (-removeall)perl($value) ? "-removeall":""0
Means that all low PPs residues will be removed (not masked with X, but completely removed from the sequence). The authors recommend using this option with caution.
Masking, rather than simply removing, low PP residues within the core region facilitates the inference
of the original positional homology among sequences. Complete removal can negatively affect multiple sequence alignment
specify_NoMoverepeats2Do not remove repeated regions of length >20 (-noremoverepeat)perl($value) ? "-noremoverepeat":""0
By default PREQUAL will attempt to remove long identical repeats that can occur due to sequencing or assembly errors. When such a sequence is detected, a
warning will be generated. Choose this option to stop this functionality.
prob_optionsPrior and Posterior optionsspecify_PPAlgorithm2Specify the algorithm used to calculate posterior probabilities (-pptype)allclosestlongestclosestPlease specify the number of closest relativesperl$specify_PPAlgorithm eq "closest" && !defined $specify_ClosestPlease specify the number of longest sequencesperl$specify_PPAlgorithm eq "longest" && !defined $specify_Longest
Specifies the algorithm to choose the subset of sequences to use when calculating PPs for each individual sequence. The default option is -pptype closest, which
compares each sequence against the 10 closest sequences defined by Kmer distance and also has mild coverage criteria. In some cases, you may wish to
raise the default number of closest relatives Y to improve the accuracy of the PPs (e.g. -pptype closest 20). In other cases, -pptype longest might work better,
which instead of evolutionary divergence choses the 10 longest sequences (the number of longest sequences may also be changed as above). We recommend to
use this option with caution, because the longest sequences are often those containing the most errors. Finally, the option -pptype all will use all sequences.
This option might be very slow, especially for data sets consisting of many sequences.
specify_Closest2Compare each sequence of how many closest sequences?perl$specify_PPAlgorithm eq "closest"10
Default is 10.
specify_Longest2Compare each sequence of how many longest sequences?perl$specify_PPAlgorithm eq "longest"10
Default is 10.
specify_PPAlgorithmhidden2perl$specify_PPAlgorithm eq "closest"perl"-pptype closest $specify_Closest"specify_PPAlgorithmhidden22perl$specify_PPAlgorithm eq "longest"perl"-pptype longest $specify_Longest"specify_PPAlgorithmhidden32perl$specify_PPAlgorithm eq "all"perl"-pptype all"specify_FilterProp2Fraction of the sequences are maintained. (-filterprop)perl(defined $value) ? "-filterprop $value":""Sorry the fraction of sequences maintained must be between 0 and 1perl$value > 1 || $value < 0
Instead of filtering sequences by Posterior Probability, the user can choose a proportion of the original data (X%) that is willing to lose in the filtering, and PREQUAL will
adjust the filtering threshold accordingly. In practice, PREQUAL will often filter a little higher proportion of data because of the way regions of low confidence are
joined together and how N- and C-termini are dealt with. This is the second of two main filtering approaches available in PREQUAL.
specify_FilterThresh2Postrerior probability threshold (-filterthresh)perl(defined $value) ? "-filterthresh $value":""Sorry the posterior probability threshold must be between 0 and 1perl$value > 1 || $value < 0
Specify a PP threshold for filtering. By default, every residue with PP less than0.994 will be filtered. The most stringent threshold would be a threshold of 1.0, which will
keep only the residues with absolutely only the highest confidence. The most liberal threshold of zero would not filter any residue. The default threshold of
0.994 was experimentally shown to perform well for a wide range of input data. This is the first of two main filtering approaches available in PREQUAL.
specify_FilterJoin2Extend filtering over regions of unfiltered sequence less than (-filterjoin)perl(defined $value) ? "-filterjoin $value":""
PREQUAL joins together low PP residues according to X in this threshold (default is 10). If there are fewer than X residues between two residues with low PPs,
PREQUAL will filter all residues between them. Low values of X are more generous and will keep more data, whereas high values of X are more stringent
and will filter more data.
specify_TaxonNames2Select a file that contains a list of taxa names that will not be filtered (-nofilterlist)perl(defined $value) ? "-nofilterlist filterlist.txt":""nofilterlist.txt
Specify a file X that contains a list of taxa names that will not be filtered. In X one name per line.
specify_WordsSeqs2Number of high posterior residues at beginning and end of core region (-corerun)perl(defined $value) ? "-nofilterword nofilterword.txt":""nofilterword.txt
Specify a file X that contains a list of words and sequence names that contain those words will not be filtered. In X one word per line.
output_optionsOutput optionsspecify_Outsuffix2Specify a suffix for output files (-outsuffix)perl(defined $value) ? "-outsuffix $value":""filtered
Output file will be the original name with X as a suffix [DEFAULT .filtered].
specify_Dosummary2Output summary statistics to file (-dosummary)perl($value) ? "-dosummary":""1
Output a broad summary file about your filtering, including information about the proportion of individual sequences removed and the amount of individual
sequences defined as core region. This information will be output to a file suffixed .summary
specify_DoDetail2Output detailed statistics to file (-dodetail)perl($value) ? "-dodetail":""0
Generate a detailed summary file about your filtering. The output displays the PP for each residue and sequence, arranged into four columns:
[Residue_number]Amino_acid : Indexing starts at 0 : Posterior_probability : Range (0,1) : Whether_the_residue_is_removed : 0 = FALSE; 1 = TRUE : Whether_the_residue_is_in_the_code : 0 = FALSE; 1 = TRUE
specify_NoPPoutput2Stop outputting the posterior probability matrix (--noPP)perl($value) ? "-noPP":""0
Do not output the posterior probability file. This will also force PREQUAL to recalculate PPs every run, in contrast to the default behaviour or reading in the corresponding .PP file if available.