ParallelStructure on XSEDE
2.3.4
A program to investigate population structure using multi-locus genotype data
Pritchard J.K., Stephens M., and Donnelly P.
Pritchard J.K., Stephens M., and Donnelly P. (2000)Inference of population structure using multilocus genotype data. Genetics. 2000 Jun;155(2):945-59.
Melissa J. Hubisz, Daniel Falush,Matthew Stephens, and Jonathon K. Pritchard (2009) Inferring weak population structure with the assistance of sample group information
Molecular Ecology Resources (2009) 9, 1322–1332.
Francois Besnier and Kevin A. Glover (2013) ParallelStructure: A R Package to Distribute Parallel Runs of the Population Genetics Program STRUCTURE on Multi-Core Computers PLoS One 8(7): e70651.
Population Genetics
parallelstructure_xsede
pstructure_comet
perl
""
1
number_nodes
2
scheduler.conf
perl
"nodes=1\\n" .
"node_exclusive=0\\n" .
"threads_per_process=12\\n"
infile
Input File (must be in proper Structure format)
data.txt
pdf_results
*.pdf
txt_results
*.txt
txt_results2
*.TXT
results_results_f
results*_f
results_results_q
results*_q
csv_results
*.csv
runtime
1
scheduler.conf
Maximum Hours to Run (up to 168 hours)
0.5
The maximum hours to run must be less than 168
perl
$runtime > 168.0
The maximum hours to run must be greater than 0.05
perl
$runtime < 0.05
perl
"runhours=$value\\n"
The job will run on 12 processors as configured. If it runs for the entire configured time, it will consume 12 X $runtime cpu hours
perl
$runtime > 0
Estimate the maximum time your job will need to run. We recommend testimg initially with a < 0.5hr test run because Jobs set for 0.5 h or less depedendably run immediately in the "debug" queue.
Once you are sure the configuration is correct, you then increase the time. The reason is that jobs > 0.5 h are submitted to the "normal" queue, where jobs configured for 1 or a few hours times may
run sooner than jobs configured for the full 168 hours.
joblist_file
Joblist File
joblist.txt
Please select a joblist file
perl
!defined $joblist_file
data_file_params
Data File Configuration
set_numinds
Number of individuals in the population (NUMINDS)
3
perl
"numinds = $value,"
Please specify the number of individuals in the population
perl
!defined $set_numinds
set_numloci
Number of loci in the dataset (NUMLOCI)
4
perl
"numloci=$value,"
Please specify the number of loci in the population
perl
!defined $set_numloci
set_ploidy
Ploidy of the dataset (PLOIDY)
5
perl
(defined $set_ploidy) ? "ploidy=$value," :""
set_missing
Value given to missing genotype data (MISSING)
6
perl
(defined $set_missing) ? "missing=$value," :""
Must be an integer, and must not appear elsewhere in the data set. Default is -9
set_onerowperind
The data for each individual are arranged in a single row (ONEROWPERIND)
7
perl
( $value) ? "onerowperind=1," :""
ONEROWPERIND (Boolean) The data for each individual are arranged in a single row. E.g.,
for diploid data, this would mean that the two alleles for each locus are in consecutive order in the same row, rather than being arranged in the same column, in two consecutive rows.
use_popinfo
Use prior population information to assign individuals to clusters (USEPOPINFO)
23
perl
( $value) ? "usepopinfo=1," :""
To use population information information, you must indicate that your input file contains an indicator variable which says whether to use popinfo
perl
$use_popinfo && !$set_popflag
set_popdata
Input file contains a user-defined population-of-origin for each individual (POPDATA)
9
perl
( $value) ? "popdata=1," :""
set_popflag
Input file contains an indicator variable which says whether to use popinfo (POPFLAG)
10
perl
$use_popinfo
perl
( $value) ? "popflag=1," :""
set_locdata
Input file contains a user-defined sampling location for each individual (LOCDATA)
11
perl
$use_locprior
perl
( $value) ? "locdata=1," :""
set_phenotype
Input file contains a column of phenotype information (PHENOTYPE)
12
perl
( $value) ? "phenotype=1," :""
set_markernames
The top row of the data file contains a list of L names corresponding to the markers used (MARKERNAMES)
14
perl
( $value) ? "markernames=1,":""
set_recessivealleles
Next row of data file contains a list of L integers indicating which alleles are recessive at each locus (RECESSIVEALLELES)
15
perl
( $value) ? "recessivealleles=1," :""
Setting this to 1 implies that the dominant marker model is in use.
use_linkagemodel
Use the linkage model (LINKAGE)
22
perl
( $value) ? "linkage=1," :""
RLOG10START sets the initial value of recombination rate r per unit distance.
RLOG10MIN and RLOG10MAX set the minimum and maximum allowed values for log10r. RLOG10PROPSD sets the size of the proposed changes to log10r in
each update. The front end makes some guesses about these, but some care on the part of the user in required to be sure that the values are
sensible for the particular application.
set_phased
Indicates that data are in correct phase (PHASED)
17
perl
$use_linkagemodel
perl
( $value) ? "phased=1," :""
When the linkage model is used with polyploids, PHASED=1 is required.
perl
$set_ploidy > 2 && !$set_phased
For use with linkage model. Indicates that data are in correct phase. If (LINKAGE=1, PHASED=0), then PHASEINFO can be used–this is an extra line in the input file that gives phase probabilities. When PHASEINFO =0 each value is set to 0.5,
implying no phase information. When the linkage model is used with polyploids, PHASED=1 is required.
set_phaseinfo
The row(s) of genotype data for each individual are followed by a row of information about haplotype phase (PHASEINFO)
18
perl
$use_linkagemodel && !$set_phased
perl
( $value) ? "phaseinfo=1," :""
The row(s) of genotype data for each individual are followed by a row of information about haplotype phase. This is for use with the linkage model only. See sections 2 and 3.1 for further details.
second_options
Run Configuration Options (file extraparams)
set_noadmix
Assume the model without admixture (NOADMIX)
21
perl
( $value) ? "noadmix=1," :""
Each individual is assumed to be completely from one of the K populations. In the output, instead of printing the average value of Q as in the admixture case, the program prints the posterior probability that
each individual is from each population. 1 = no admixture; 0 = model with admixture.
use_locprior
Use location information to improve the performance on data that are weakly informative about structure (LOCPRIOR)
24
perl
( $value) ? "locprior=1," :""
use_inferalpha
Infer the value of the model parameter alpha from the data (INFERALPHA)
27
perl
!$set_noadmix
perl
( $value) ? "inferalpha=1," :""
Assume the same value of Fk for all populations (analogous to Wright’s traditional FST ). This is not recommended for most
data, because in practice you probably expect different levels of divergence in each population. When K = 2 it may sometimes be difficult to
estimate two values of FST separately (but see Harter et al. (2004)). When you’re trying to estimate K, you should use the same model for all K
(we suggest ONEFST=0).
use_popalphas
Infer a separate α for each population (POPALPHAS)
28
perl
( $value) ? "popalphas=1," :""
Not recommended in most cases but may be useful for situations with asymmetric admixture.
set_alpha
Dirichlet parameter (α) for degree of admixture (ALPHA)
29
perl
$use_inferalpha
perl
($set_alpha) ? "alpha=$value,":""
Dirichlet parameter (α) for degree of admixture (this is the initial value if INFERALPHA==1).
use_freqscorr
Use the F model, in which the allele frequencies are correlated across populations (FREQSCORR)
25
perl
(defined $use_freqscorr) ? "freqscorr=$value," :""
FREQSCORR (double) Use the “F model”, in which the allele frequencies are correlated across populations (Falush et al., 2003a). More
specifically, rather than assuming a prior in which the allele frequencies in each population are independent draws from a uniform Dirichlet
distribution, we start with a distribution which is centered around the mean allele frequencies in the sample. This model is more realistic for
very closely related populations (where we expect the allele frequencies to be similar across populations), and can produce better clustering
(section 3.2). The prior of Fk is set using FPRIORMEAN, and FPRIORSD. There may be a tendency to overestimate K when FREQSCORR is turned on.
set_fpriormean
Set mean FPRIORMEAN for Fk (FPRIORMEAN)
33
perl
($set_alpha) ? "fpriormean=$value,":""
The prior for Fk is taken to be Gamma with mean FPRIORMEAN, and standard deviation FPRIORSD.
Our default settings place a lot of weight on small values of F . We find that this makes the algorithm sensitive to subtle structure, but at
some increased risk of overestimating K (Falush et al., 2003a)
set_fpriorsd
Set std deviation for Fk (FPRIORSD)
33
perl
($set_alpha) ? "fpriorsd=$value,":""
The prior for Fk is taken to be Gamma with mean FPRIORMEAN, and standard deviation FPRIORSD.
Our default settings place a lot of weight on small values of F . We find that this makes the algorithm sensitive to subtle structure, but at
some increased risk of overestimating K (Falush et al., 2003a)
use_onefst
Assume the same value of Fk for all populations (ONEFST)
26
perl
( $value) ? "onefst=1," :""
Assume the same value of Fk for all populations (analogous to Wright’s traditional FST ). This is not recommended for most
data, because in practice you probably expect different levels of divergence in each population. When K = 2 it may sometimes be difficult to
estimate two values of FST separately (but see Harter et al. (2004)). When you’re trying to estimate K, you should use the same model for all K
(we suggest ONEFST=0).
output_options
Output Options
set_printqhat
Print Q estimates to a separate file with suffix q (PRINTQHAT)
72
perl
( $value) ? "printqhat=1," :""
When this is turned on, the point estimate for Q is not only printed into the main results file, but also into a separate file with suffix “q”. This file is required in order to run the companion program STRAT.
set_ancestdist
Collect information about the distribution of Q for each individual (ANCESTDIST)
76
perl
( $value) ? "ancestdist=1," :""
Collect information about the distribution of Q for each individual, as well as just estimating the mean.
When this is turned on, the output file includes the left- and right-hand ends of the probability intervals for each q(i).