Parallel Structure on XSEDE
2.3.4
A program to investigate population structure using multi-locus genotype data
Pritchard J.K., Stephens M., and Donnelly P.
Pritchard J.K., Stephens M., and Donnelly P. (2000)Inference of population structure using multilocus genotype data. Genetics. 2000 Jun;155(2):945-59.
Melissa J. Hubisz, Daniel Falush,Matthew Stephens, and Jonathon K. Pritchard (2009) Inferring weak population structure with the assistance of sample group information
Molecular Ecology Resources (2009) 9, 1322–1332.
Francois Besnier and Kevin A. Glover (2013) ParallelStructure: A R Package to Distribute Parallel Runs of the Population Genetics Program STRUCTURE on Multi-Core Computers PLoS One 8(7): e70651.
Population Genetics
structure_xsede
pstructure_comet
perl
""
1
number_nodes
2
scheduler.conf
perl
"nodes=1\\n" .
"node_exclusive=0\\n" .
"threads_per_process=12\\n"
infile
Input File (must be in proper Structure format)
data.txt
pdf_results
*.pdf
txt_results
*.txt
results_results
results*
csv_results
*.csv
runtime
1
scheduler.conf
Maximum Hours to Run (up to 168 hours)
0.5
The maximum hours to run must be less than 168
perl
$runtime > 168.0
The maximum hours to run must be greater than 0.05
perl
$runtime < 0.05
perl
"runhours=$value\\n"
The job will run on 12 processors as configured. If it runs for the entire configured time, it will consume 12 X $runtime cpu hours
perl
$runtime > 0
Estimate the maximum time your job will need to run. We recommend testimg initially with a < 0.5hr test run because Jobs set for 0.5 h or less depedendably run immediately in the "debug" queue.
Once you are sure the configuration is correct, you then increase the time. The reason is that jobs > 0.5 h are submitted to the "normal" queue, where jobs configured for 1 or a few hours times may
run sooner than jobs configured for the full 168 hours.
joblist_file
Joblist File
joblist.txt
Please select a joblist file
perl
!defined $joblist_file
basic_run_params
Basic Run Parameters
set_maxpops
Number of populations assumed for a particular run of the program,
3
perl
"maxpops= $value,"
Please specify the number of populations
perl
!defined $set_maxpops
set_burnin
Set the burnin value
3
perl
"burnin = $value,"
Please specify the burnin value
perl
!defined $set_burnin
set_numreps
Number of MCMC reps after burnin
3
perl
"numreps = $value,"
Please specify the number of MCMC reps after burnin
perl
!defined $set_numreps
data_file_params
Data File Configuration
set_numinds
Number of individuals in the population
3
perl
"numinds = $value,"
Please specify the number of individuals in the population
perl
!defined $set_numinds
set_numloci
Number of loci in the dataset
4
perl
"numloci=$value,"
Please specify the number of loci in the population
perl
!defined $set_numloci
set_ploidy
Ploidy of the dataset
5
perl
(defined $set_ploidy) ? "ploidy=$value," :""
set_missing
Value given to missing genotype data
6
perl
(defined $set_missing) ? "missing=$value," :""
Must be an integer, and must not appear elsewhere in the data set. Default is -9
set_onerowperind
The data for each individual are arranged in a single row
7
perl
( $value) ? "onerowperind=1," :""
ONEROWPERIND (Boolean) The data for each individual are arranged in a single row. E.g.,
for diploid data, this would mean that the two alleles for each locus are in consecutive order in the same row, rather than being arranged in the same column, in two consecutive rows.
set_labels
Input file contains labels (names) for each individual
8
perl
( $value) ? "labels=1," :""
set_popdata
Input file contains a user-defined population-of-origin for each individual
9
perl
( $value) ? "popdata=1," :""
set_popflag
Input file contains an indicator variable which says whether to use popinfo
10
perl
$use_popinfo
perl
( $value) ? "popflag=1," :""
set_locdata
Input file contains a user-defined sampling location for each individual
11
perl
$use_locprior
perl
( $value) ? "locdata=1," :""
set_phenotype
Input file contains a column of phenotype information
12
perl
( $value) ? "phenotype=1," :""
set_extracols
Number of additional columns of data after the Phenotype before the genotype data start
13
perl
(defined $set_extracols) ? "extracols=$set_extracols," :""
These are ignored by the program
set_markernames
The top row of the data file contains a list of L names corresponding to the markers used.
14
perl
( $value) ? "markernames=1,":""
set_recessivealleles
Next row of data file contains a list of L integers indicating which alleles are recessive at each locus.
15
perl
( $value) ? "recessivealleles=1," :""
Setting this to 1 implies that the dominant marker model is in use.
set_mapdistances
The next row of the data file contains a list of mapdistances between neighboring loci.
16
perl
( $value) ? "mapdistances=1," :""
The next row of the data file (or the first row if MARKER- NAMES==0) contains a list of mapdistances between neighboring loci.
Advanced data file option.
set_phased
Indicates that data are in correct phase.
17
perl
$use_linkagemodel
perl
( $value) ? "phased=1," :""
When the linkage model is used with polyploids, PHASED=1 is required.
perl
$set_ploidy > 2
For use with linkage model. Indicates that data are in correct phase. If (LINKAGE=1, PHASED=0), then PHASEINFO can be used–this is an extra line in the input file that gives phase probabilities. When PHASEINFO =0 each value is set to 0.5,
implying no phase information. When the linkage model is used with polyploids, PHASED=1 is required.
set_phaseinfo
The row(s) of genotype data for each individual are followed by a row of information about haplotype phase.
18
perl
$use_linkagemodel && !$set_phased
perl
( $value) ? "phaseinfo=1," :""
The row(s) of genotype data for each individual are followed by a row of information about haplotype phase. This is for use with the linkage model only. See sections 2 and 3.1 for further details.
set_markovphase
The phase information follows a Markov model.
19
perl
( $value) ? "markovphase=1," :""
See sections 2.2 and 9.6 for details..
set_notambiguous
Provide an integer that indicates genotype data at a marker are unambiguous
20
perl
$set_recessivealleles && $set_ploidy > 2
perl
(defined $value) ? "notambiguous=$value," :""
Structure allows the data to consist of a mixture of loci for which there is, and isn’t genotypic ambiguity.
If some loci are not ambiguous, set the code NOTAMBIGUOUS to an integer that does not match any of the alleles in
the data, and that does not equal MISSING. Then in the recessive alleles line at the top of the input file put the
NOTAMBIGUOUS code for the unambiguous loci. If instead, at a particular locus the alleles are all codominant, but
there is ambiguity about the number of each (eg for microsatellites in a tetraploid) then set the recessive allele
code to MISSING. Finally, if there is a recessive allele, and there is also ambiguity about the number of each allele,
then set the recessive allele code to indicate which allele is recessive. Coding of alleles where there is copy number
ambiguity is analogous to that where there are dominant markers. So for example in a tetraploid where three codominant
loci B, C and D observed, this should be coded as B C D D or equivalently B B C D or any other combination including
each of the three alleles. It should not be coded as B C D (MISSING), as this indicates that the particular individual
is triploid at the locus in question. Nor should it be coded B C D A if there is a recessive allele A at the locus.
For use with polyploids when RECESSIVEALLELES=1.
second_options
Run Configuration Options (file extraparams)
set_noadmix
Assume the model without admixture
21
perl
( $value) ? "noadmix=1," :""
Each individual is assumed to be completely from one of the K populations. In the output, instead of printing the average value of Q as in the admixture case, the program prints the posterior probability that
each individual is from each population. 1 = no admixture; 0 = model with admixture.
use_linkagemodel
Use the linkage model.
22
perl
( $value) ? "linkage=1," :""
RLOG10START sets the initial value of recombination rate r per unit distance.
RLOG10MIN and RLOG10MAX set the minimum and maximum allowed values for log10r. RLOG10PROPSD sets the size of the proposed changes to log10r in
each update. The front end makes some guesses about these, but some care on the part of the user in required to be sure that the values are
sensible for the particular application.
use_popinfo
Use prior population information to assign individuals to clusters.
23
perl
( $value) ? "usepopinfo=1," :""
To use population information information, you must indicate that your input file contains an indicator variable which says whether to use popinfo
perl
$use_popinfo && !$set_popflag
use_locprior
Use location information to improve the performance on data that are weakly informative about structure.
24
perl
( $value) ? "locprior=1," :""
use_inferalpha
Infer the value of the model parameter alpha from the data
27
perl
!$set_noadmix
perl
( $value) ? "inferalpha=1," :""
Assume the same value of Fk for all populations (analogous to Wright’s traditional FST ). This is not recommended for most
data, because in practice you probably expect different levels of divergence in each population. When K = 2 it may sometimes be difficult to
estimate two values of FST separately (but see Harter et al. (2004)). When you’re trying to estimate K, you should use the same model for all K
(we suggest ONEFST=0).
use_popalphas
Infer a separate α for each population
28
perl
( $value) ? "popalphas=1," :""
Not recommended in most cases but may be useful for situations with asymmetric admixture.
set_alpha
Dirichlet parameter (α) for degree of admixture
29
perl
$use_inferalpha
perl
($set_alpha) ? "alpha=$value,":""
Dirichlet parameter (α) for degree of admixture (this is the initial value if INFERALPHA==1).
use_unifprioalpha
Assume a uniform prior for alpha which runs between 0 and ALPHAMAX
35
perl
( $value) ? "unifprioalpha=1," :""
ALPHAMAX (double) Assume a uniform prior for α which runs between 0 and ALPHAMAX. This model seems to work fine; the alternative model
(when UNIFPRIORALPHA=0) is to take α as having a Gamma prior, with mean ALPHAPRI- ORA × ALPHAPRIORB, and variance ALPHAPRIORA × ALPHAPRIORB2
set_lambda
Parameterize the allele frequency prior
32
perl
($set_alpha) ? "lamda=$value,":""
1
The use of lambda with alpha or F does not work out well usually
perl
$set_lambda && $set_alpha
LAMBDA (double) parameterizes the allele frequency prior, and for most data the default value of 1 seems to work pretty well.
If the frequencies at most markers are very skewed towards low/high frequencies, a smaller value of λ may potentially lead to better performance.
It doesn’t seem to work very well to estimate λ at the same time as the other hyperparameters, α and F . Priors.
These values are used to parameterize the assumed probability models. In most cases the default settings should be fairly sensible and you may not
want to worry about these
use_inferlambda
Infer a suitable value for lambda
30
perl
( $value) ? "inferlambda=1," :""
Not recommended for most analyses.
use_popspecificlambda
Infer a separate lambda for each population
31
perl
( $value) ? "popspecificlambda=1," :""
use_freqscorr
Use the F model, in which the allele frequencies are correlated across populations
25
perl
(defined $use_freqscorr) ? "freqscorr=$value," :""
FREQSCORR (double) Use the “F model”, in which the allele frequencies are correlated across populations (Falush et al., 2003a). More
specifically, rather than assuming a prior in which the allele frequencies in each population are independent draws from a uniform Dirichlet
distribution, we start with a distribution which is centered around the mean allele frequencies in the sample. This model is more realistic for
very closely related populations (where we expect the allele frequencies to be similar across populations), and can produce better clustering
(section 3.2). The prior of Fk is set using FPRIORMEAN, and FPRIORSD. There may be a tendency to overestimate K when FREQSCORR is turned on.
set_fpriormean
Set mean FPRIORMEAN for Fk
33
perl
($set_alpha) ? "fpriormean=$value,":""
The prior for Fk is taken to be Gamma with mean FPRIORMEAN, and standard deviation FPRIORSD.
Our default settings place a lot of weight on small values of F . We find that this makes the algorithm sensitive to subtle structure, but at
some increased risk of overestimating K (Falush et al., 2003a)
use_onefst
Assume the same value of Fk for all populations
26
perl
( $value) ? "onefst=1," :""
Assume the same value of Fk for all populations (analogous to Wright’s traditional FST ). This is not recommended for most
data, because in practice you probably expect different levels of divergence in each population. When K = 2 it may sometimes be difficult to
estimate two values of FST separately (but see Harter et al. (2004)). When you’re trying to estimate K, you should use the same model for all K
(we suggest ONEFST=0).
set_fpriorsd
Set standard deviation FPRIORSD for Fk
34
perl
($set_alpha) ? "fpriormean=$value,":""
The prior for Fk is taken to be Gamma with mean FPRIORMEAN, and standard deviation FPRIORSD.
Our default settings place a lot of weight on small values of F . We find that this makes the algorithm sensitive to subtle structure, but at
some increased risk of overestimating K (Falush et al., 2003a)
set_log10rmin
Set min prior for switch rate r
36
perl
($set_log10rmin) ? "log10rmin=$value,":""
LOG10RMIN, LOG10RMAX, LOG10PROPSD, LOG10RSTART (double) When the linkage model is used, the switch rate r is taken to have a uniform prior
on a log scale, between LOG10RMIN and LOG10RMAX. These values need to be set by the user to make sense in terms of the scale of map units being
used. Using prior population information (USEPOPINFO).
set_log10rmax
Set max prior for switch rate r
37
perl
($set_log10rmax) ? "log10rmax=$value,":""
LOG10RMIN, LOG10RMAX, LOG10PROPSD, LOG10RSTART (double) When the linkage model is used, the switch rate r is taken to have a uniform prior
on a log scale, between LOG10RMIN and LOG10RMAX. These values need to be set by the user to make sense in terms of the scale of map units being
used. Using prior population information (USEPOPINFO).
set_log10rstart
Set start value for switch rate r
39
perl
($set_log10rstart) ? "log10rstart=$value,":""
LOG10RMIN, LOG10RMAX, LOG10PROPSD, LOG10RSTART (double) When the linkage model is used, the switch rate r is taken to have a uniform prior
on a log scale, between LOG10RMIN and LOG10RMAX. These values need to be set by the user to make sense in terms of the scale of map units being
used. Using prior population information (USEPOPINFO).
set_log10propsd
Set standard deviation for switch rate r
38
perl
($set_log10propsd) ? "log10propsd=$value,":""
LOG10RMIN, LOG10RMAX, LOG10PROPSD, LOG10RSTART (double) When the linkage model is used, the switch rate r is taken to have a uniform prior
on a log scale, between LOG10RMIN and LOG10RMAX. These values need to be set by the user to make sense in terms of the scale of map units being
used. Using prior population information (USEPOPINFO).
set_gensback
Set value for G
40
perl
$use_popinfo
perl
($set_gensback) ? "gensback=$value,":""
This corresponds to G (Pritchard et al., 2000a). When using prior population information for individuals (USEPOPINFO=1), the
program tests whether each individual has an immigrant ancestor in the last G generations, where G = 0 corresponds to the individual being an
immigrant itself. In order to have decent power, G should be set fairly small (2, say) unless the data are highly informative.
set_migrprior
Set migration prior
41
perl
(defined $set_migrprior) ? "migrprior=$value,":""
Please enter a value that is greater than zero, and less than 1
perl
$set_migrprior > 1 || $set_migrprior < 0
The value you have entered is outside the recommended range
perl
$set_migrprior > 0.1 || $set_migrprior < 0.001
MIGRPRIOR (double) Must be in [0,1]. This is ν in Pritchard et al. (2000a). Sensible values might be in the range 0.001—0.1.
use_pfrompopflagonly
Update the allele frequencies, P , using only a prespecified subset of the individuals
42
perl
$set_popflag
perl
( $value) ? "pfrompopflagonly=1," :""
This option, new with version 2.0, makes it possible to update the allele frequencies, P , using only a prespecified subset of the individuals.
To use this, include a POPFLAG column, and set POPFLAG=1 for individuals who should be used to update P , and POPFLAG=0 for individuals who should
not be used to update P. This can be used both with, or without USEPOPINFO turned on. This option will be useful, for example, if you have a
standard reference set of individuals from known populations, and then you want to estimate the ancestry of some unknown individuals. Using this
option, the q estimate for each unknown individual depends only on the reference set, and not on the other unknown individuals in the sample.
This property is sometimes desirable. LOCPRIOR model for using location information.
use_locispop
Use the PopData Column for Location data
43
perl
$use_locprior
perl
( $value) ? "locispop=1," :""
This option instructs the program to use the PopData column in the input file as location data when the LOCPRIOR model is
turned on. When LOCISPOP=0, the program requires a LocData column to use LOCPRIOR.
set_locpriorinit
Initial value for the LOCPRIOR parameter r
44
perl
(defined $set_locpriorinit) ? "locpriorinit=$value,":""
1
Initial value for the LOCPRIOR parameter r, that parameterizes how informative the populations are (citepHubiszEtAl09).
We found that LOCPRIORINIT=1 helped achieve good convergence.
set_maxlocprior
Maximum value for the LOCPRIOR parameter r
45
perl
(defined $set_maxlocprior) ? "maxlocprior=$value,":""
20
output_options
Output Options
set_printnet
Print the net nucleotide distance between clusters.
70
perl
( $value) ? "printnet=1," :""
The distance between populations A and B, DAB, is calculated. In words, the net nucleotide distance is the average probability that a pair of alleles,
one each from populations A and B are different, less the average within-population heterozygosities. Perhaps more intuitively, this can be thought
of as being the average amount of pairwise difference between alleles from different populations, beyond the amount of variation found within each
population. The distance has the appropriate property that similar populations have distances near 0, and in particular, DAA = 0. Notice that the distance
is symmetric, so that DAB = DBA.This distance is suitable for drawing trees of populations to help visualize the levels of difference among the clusters (Falush et al., 2003b)
set_sitebysite
Print a complete summary of assignment probabilities for every genotype in the data
71
perl
$use_linkagemodel
perl
( $value) ? "sitebysite=1," :""
Print a complete summary of assignment probabilities for every genotype in the data.
This is printed to a separate file with the suffix “ss”. This file can be big!
set_printqhat
Print Q estimates to a separate file with suffix q.
72
perl
( $value) ? "printqhat=1," :""
When this is turned on, the point estimate for Q is not only printed into the main results file, but also into a separate file with suffix “q”. This file is required in order to run the companion program STRAT.
set_intermedsave
Print this many intermediate results
74
perl
(defined $set_intermedsave) ? "intermedsave=$value," :""
If you’re impatient to see preliminary results before the end of the run, you can have results printed
to file at intervals during the MCMC run. A total of INTERMEDSAVE such files are printed, at equal intervals
following the completion of the BURNIN. Turn this off by setting to 0. Names of these files created using
OUTFILE name.
set_echodata
Print a brief summary of the data set to the output file.
75
perl
( $value) ? "echodata=1," :""
Print a brief summary of the data set to the screen and output file. (Prints the beginnings and ends of the top and
bottom lines of the input file to allow the user to check that it has been read correctly.)
set_ancestdist
Collect information about the distribution of Q for each individual
76
perl
( $value) ? "ancestdist=1," :""
Collect information about the distribution of Q for each individual, as well as just estimating the mean.
When this is turned on, the output file includes the left- and right-hand ends of the probability intervals for each q(i).