Parallel Structure on XSEDE 2.3.4 A program to investigate population structure using multi-locus genotype data Pritchard J.K., Stephens M., and Donnelly P. Pritchard J.K., Stephens M., and Donnelly P. (2000)Inference of population structure using multilocus genotype data. Genetics. 2000 Jun;155(2):945-59. Melissa J. Hubisz, Daniel Falush,Matthew Stephens, and Jonathon K. Pritchard (2009) Inferring weak population structure with the assistance of sample group information Molecular Ecology Resources (2009) 9, 1322–1332. Francois Besnier and Kevin A. Glover (2013) ParallelStructure: A R Package to Distribute Parallel Runs of the Population Genetics Program STRUCTURE on Multi-Core Computers PLoS One 8(7): e70651. Population Genetics structure_xsede pstructure_comet perl "" 1 number_nodes 2 scheduler.conf perl "nodes=1\\n" . "node_exclusive=0\\n" . "threads_per_process=12\\n" infile Input File (must be in proper Structure format) data.txt pdf_results *.pdf txt_results *.txt results_results results* csv_results *.csv runtime 1 scheduler.conf Maximum Hours to Run (up to 168 hours) 0.5 The maximum hours to run must be less than 168 perl $runtime > 168.0 The maximum hours to run must be greater than 0.05 perl $runtime < 0.05 perl "runhours=$value\\n" The job will run on 12 processors as configured. If it runs for the entire configured time, it will consume 12 X $runtime cpu hours perl $runtime > 0 Estimate the maximum time your job will need to run. We recommend testimg initially with a < 0.5hr test run because Jobs set for 0.5 h or less depedendably run immediately in the "debug" queue. Once you are sure the configuration is correct, you then increase the time. The reason is that jobs > 0.5 h are submitted to the "normal" queue, where jobs configured for 1 or a few hours times may run sooner than jobs configured for the full 168 hours. joblist_file Joblist File joblist.txt Please select a joblist file perl !defined $joblist_file basic_run_params Basic Run Parameters set_maxpops Number of populations assumed for a particular run of the program, 3 perl "maxpops= $value," Please specify the number of populations perl !defined $set_maxpops set_burnin Set the burnin value 3 perl "burnin = $value," Please specify the burnin value perl !defined $set_burnin set_numreps Number of MCMC reps after burnin 3 perl "numreps = $value," Please specify the number of MCMC reps after burnin perl !defined $set_numreps data_file_params Data File Configuration set_numinds Number of individuals in the population 3 perl "numinds = $value," Please specify the number of individuals in the population perl !defined $set_numinds set_numloci Number of loci in the dataset 4 perl "numloci=$value," Please specify the number of loci in the population perl !defined $set_numloci set_ploidy Ploidy of the dataset 5 perl (defined $set_ploidy) ? "ploidy=$value," :"" set_missing Value given to missing genotype data 6 perl (defined $set_missing) ? "missing=$value," :"" Must be an integer, and must not appear elsewhere in the data set. Default is -9 set_onerowperind The data for each individual are arranged in a single row 7 perl ( $value) ? "onerowperind=1," :"" ONEROWPERIND (Boolean) The data for each individual are arranged in a single row. E.g., for diploid data, this would mean that the two alleles for each locus are in consecutive order in the same row, rather than being arranged in the same column, in two consecutive rows. set_labels Input file contains labels (names) for each individual 8 perl ( $value) ? "labels=1," :"" set_popdata Input file contains a user-defined population-of-origin for each individual 9 perl ( $value) ? "popdata=1," :"" set_popflag Input file contains an indicator variable which says whether to use popinfo 10 perl $use_popinfo perl ( $value) ? "popflag=1," :"" set_locdata Input file contains a user-defined sampling location for each individual 11 perl $use_locprior perl ( $value) ? "locdata=1," :"" set_phenotype Input file contains a column of phenotype information 12 perl ( $value) ? "phenotype=1," :"" set_extracols Number of additional columns of data after the Phenotype before the genotype data start 13 perl (defined $set_extracols) ? "extracols=$set_extracols," :"" These are ignored by the program set_markernames The top row of the data file contains a list of L names corresponding to the markers used. 14 perl ( $value) ? "markernames=1,":"" set_recessivealleles Next row of data file contains a list of L integers indicating which alleles are recessive at each locus. 15 perl ( $value) ? "recessivealleles=1," :"" Setting this to 1 implies that the dominant marker model is in use. set_mapdistances The next row of the data file contains a list of mapdistances between neighboring loci. 16 perl ( $value) ? "mapdistances=1," :"" The next row of the data file (or the first row if MARKER- NAMES==0) contains a list of mapdistances between neighboring loci. Advanced data file option. set_phased Indicates that data are in correct phase. 17 perl $use_linkagemodel perl ( $value) ? "phased=1," :"" When the linkage model is used with polyploids, PHASED=1 is required. perl $set_ploidy > 2 For use with linkage model. Indicates that data are in correct phase. If (LINKAGE=1, PHASED=0), then PHASEINFO can be used–this is an extra line in the input file that gives phase probabilities. When PHASEINFO =0 each value is set to 0.5, implying no phase information. When the linkage model is used with polyploids, PHASED=1 is required. set_phaseinfo The row(s) of genotype data for each individual are followed by a row of information about haplotype phase. 18 perl $use_linkagemodel && !$set_phased perl ( $value) ? "phaseinfo=1," :"" The row(s) of genotype data for each individual are followed by a row of information about haplotype phase. This is for use with the linkage model only. See sections 2 and 3.1 for further details. set_markovphase The phase information follows a Markov model. 19 perl ( $value) ? "markovphase=1," :"" See sections 2.2 and 9.6 for details.. set_notambiguous Provide an integer that indicates genotype data at a marker are unambiguous 20 perl $set_recessivealleles && $set_ploidy > 2 perl (defined $value) ? "notambiguous=$value," :"" Structure allows the data to consist of a mixture of loci for which there is, and isn’t genotypic ambiguity. If some loci are not ambiguous, set the code NOTAMBIGUOUS to an integer that does not match any of the alleles in the data, and that does not equal MISSING. Then in the recessive alleles line at the top of the input file put the NOTAMBIGUOUS code for the unambiguous loci. If instead, at a particular locus the alleles are all codominant, but there is ambiguity about the number of each (eg for microsatellites in a tetraploid) then set the recessive allele code to MISSING. Finally, if there is a recessive allele, and there is also ambiguity about the number of each allele, then set the recessive allele code to indicate which allele is recessive. Coding of alleles where there is copy number ambiguity is analogous to that where there are dominant markers. So for example in a tetraploid where three codominant loci B, C and D observed, this should be coded as B C D D or equivalently B B C D or any other combination including each of the three alleles. It should not be coded as B C D (MISSING), as this indicates that the particular individual is triploid at the locus in question. Nor should it be coded B C D A if there is a recessive allele A at the locus. For use with polyploids when RECESSIVEALLELES=1. second_options Run Configuration Options (file extraparams) set_noadmix Assume the model without admixture 21 perl ( $value) ? "noadmix=1," :"" Each individual is assumed to be completely from one of the K populations. In the output, instead of printing the average value of Q as in the admixture case, the program prints the posterior probability that each individual is from each population. 1 = no admixture; 0 = model with admixture. use_linkagemodel Use the linkage model. 22 perl ( $value) ? "linkage=1," :"" RLOG10START sets the initial value of recombination rate r per unit distance. RLOG10MIN and RLOG10MAX set the minimum and maximum allowed values for log10r. RLOG10PROPSD sets the size of the proposed changes to log10r in each update. The front end makes some guesses about these, but some care on the part of the user in required to be sure that the values are sensible for the particular application. use_popinfo Use prior population information to assign individuals to clusters. 23 perl ( $value) ? "usepopinfo=1," :"" To use population information information, you must indicate that your input file contains an indicator variable which says whether to use popinfo perl $use_popinfo && !$set_popflag use_locprior Use location information to improve the performance on data that are weakly informative about structure. 24 perl ( $value) ? "locprior=1," :"" use_inferalpha Infer the value of the model parameter alpha from the data 27 perl !$set_noadmix perl ( $value) ? "inferalpha=1," :"" Assume the same value of Fk for all populations (analogous to Wright’s traditional FST ). This is not recommended for most data, because in practice you probably expect different levels of divergence in each population. When K = 2 it may sometimes be difficult to estimate two values of FST separately (but see Harter et al. (2004)). When you’re trying to estimate K, you should use the same model for all K (we suggest ONEFST=0). use_popalphas Infer a separate α for each population 28 perl ( $value) ? "popalphas=1," :"" Not recommended in most cases but may be useful for situations with asymmetric admixture. set_alpha Dirichlet parameter (α) for degree of admixture 29 perl $use_inferalpha perl ($set_alpha) ? "alpha=$value,":"" Dirichlet parameter (α) for degree of admixture (this is the initial value if INFERALPHA==1). use_unifprioalpha Assume a uniform prior for alpha which runs between 0 and ALPHAMAX 35 perl ( $value) ? "unifprioalpha=1," :"" ALPHAMAX (double) Assume a uniform prior for α which runs between 0 and ALPHAMAX. This model seems to work fine; the alternative model (when UNIFPRIORALPHA=0) is to take α as having a Gamma prior, with mean ALPHAPRI- ORA × ALPHAPRIORB, and variance ALPHAPRIORA × ALPHAPRIORB2 set_lambda Parameterize the allele frequency prior 32 perl ($set_alpha) ? "lamda=$value,":"" 1 The use of lambda with alpha or F does not work out well usually perl $set_lambda && $set_alpha LAMBDA (double) parameterizes the allele frequency prior, and for most data the default value of 1 seems to work pretty well. If the frequencies at most markers are very skewed towards low/high frequencies, a smaller value of λ may potentially lead to better performance. It doesn’t seem to work very well to estimate λ at the same time as the other hyperparameters, α and F . Priors. These values are used to parameterize the assumed probability models. In most cases the default settings should be fairly sensible and you may not want to worry about these use_inferlambda Infer a suitable value for lambda 30 perl ( $value) ? "inferlambda=1," :"" Not recommended for most analyses. use_popspecificlambda Infer a separate lambda for each population 31 perl ( $value) ? "popspecificlambda=1," :"" use_freqscorr Use the F model, in which the allele frequencies are correlated across populations 25 perl (defined $use_freqscorr) ? "freqscorr=$value," :"" FREQSCORR (double) Use the “F model”, in which the allele frequencies are correlated across populations (Falush et al., 2003a). More specifically, rather than assuming a prior in which the allele frequencies in each population are independent draws from a uniform Dirichlet distribution, we start with a distribution which is centered around the mean allele frequencies in the sample. This model is more realistic for very closely related populations (where we expect the allele frequencies to be similar across populations), and can produce better clustering (section 3.2). The prior of Fk is set using FPRIORMEAN, and FPRIORSD. There may be a tendency to overestimate K when FREQSCORR is turned on. set_fpriormean Set mean FPRIORMEAN for Fk 33 perl ($set_alpha) ? "fpriormean=$value,":"" The prior for Fk is taken to be Gamma with mean FPRIORMEAN, and standard deviation FPRIORSD. Our default settings place a lot of weight on small values of F . We find that this makes the algorithm sensitive to subtle structure, but at some increased risk of overestimating K (Falush et al., 2003a) use_onefst Assume the same value of Fk for all populations 26 perl ( $value) ? "onefst=1," :"" Assume the same value of Fk for all populations (analogous to Wright’s traditional FST ). This is not recommended for most data, because in practice you probably expect different levels of divergence in each population. When K = 2 it may sometimes be difficult to estimate two values of FST separately (but see Harter et al. (2004)). When you’re trying to estimate K, you should use the same model for all K (we suggest ONEFST=0). set_fpriorsd Set standard deviation FPRIORSD for Fk 34 perl ($set_alpha) ? "fpriormean=$value,":"" The prior for Fk is taken to be Gamma with mean FPRIORMEAN, and standard deviation FPRIORSD. Our default settings place a lot of weight on small values of F . We find that this makes the algorithm sensitive to subtle structure, but at some increased risk of overestimating K (Falush et al., 2003a) set_log10rmin Set min prior for switch rate r 36 perl ($set_log10rmin) ? "log10rmin=$value,":"" LOG10RMIN, LOG10RMAX, LOG10PROPSD, LOG10RSTART (double) When the linkage model is used, the switch rate r is taken to have a uniform prior on a log scale, between LOG10RMIN and LOG10RMAX. These values need to be set by the user to make sense in terms of the scale of map units being used. Using prior population information (USEPOPINFO). set_log10rmax Set max prior for switch rate r 37 perl ($set_log10rmax) ? "log10rmax=$value,":"" LOG10RMIN, LOG10RMAX, LOG10PROPSD, LOG10RSTART (double) When the linkage model is used, the switch rate r is taken to have a uniform prior on a log scale, between LOG10RMIN and LOG10RMAX. These values need to be set by the user to make sense in terms of the scale of map units being used. Using prior population information (USEPOPINFO). set_log10rstart Set start value for switch rate r 39 perl ($set_log10rstart) ? "log10rstart=$value,":"" LOG10RMIN, LOG10RMAX, LOG10PROPSD, LOG10RSTART (double) When the linkage model is used, the switch rate r is taken to have a uniform prior on a log scale, between LOG10RMIN and LOG10RMAX. These values need to be set by the user to make sense in terms of the scale of map units being used. Using prior population information (USEPOPINFO). set_log10propsd Set standard deviation for switch rate r 38 perl ($set_log10propsd) ? "log10propsd=$value,":"" LOG10RMIN, LOG10RMAX, LOG10PROPSD, LOG10RSTART (double) When the linkage model is used, the switch rate r is taken to have a uniform prior on a log scale, between LOG10RMIN and LOG10RMAX. These values need to be set by the user to make sense in terms of the scale of map units being used. Using prior population information (USEPOPINFO). set_gensback Set value for G 40 perl $use_popinfo perl ($set_gensback) ? "gensback=$value,":"" This corresponds to G (Pritchard et al., 2000a). When using prior population information for individuals (USEPOPINFO=1), the program tests whether each individual has an immigrant ancestor in the last G generations, where G = 0 corresponds to the individual being an immigrant itself. In order to have decent power, G should be set fairly small (2, say) unless the data are highly informative. set_migrprior Set migration prior 41 perl (defined $set_migrprior) ? "migrprior=$value,":"" Please enter a value that is greater than zero, and less than 1 perl $set_migrprior > 1 || $set_migrprior < 0 The value you have entered is outside the recommended range perl $set_migrprior > 0.1 || $set_migrprior < 0.001 MIGRPRIOR (double) Must be in [0,1]. This is ν in Pritchard et al. (2000a). Sensible values might be in the range 0.001—0.1. use_pfrompopflagonly Update the allele frequencies, P , using only a prespecified subset of the individuals 42 perl $set_popflag perl ( $value) ? "pfrompopflagonly=1," :"" This option, new with version 2.0, makes it possible to update the allele frequencies, P , using only a prespecified subset of the individuals. To use this, include a POPFLAG column, and set POPFLAG=1 for individuals who should be used to update P , and POPFLAG=0 for individuals who should not be used to update P. This can be used both with, or without USEPOPINFO turned on. This option will be useful, for example, if you have a standard reference set of individuals from known populations, and then you want to estimate the ancestry of some unknown individuals. Using this option, the q estimate for each unknown individual depends only on the reference set, and not on the other unknown individuals in the sample. This property is sometimes desirable. LOCPRIOR model for using location information. use_locispop Use the PopData Column for Location data 43 perl $use_locprior perl ( $value) ? "locispop=1," :"" This option instructs the program to use the PopData column in the input file as location data when the LOCPRIOR model is turned on. When LOCISPOP=0, the program requires a LocData column to use LOCPRIOR. set_locpriorinit Initial value for the LOCPRIOR parameter r 44 perl (defined $set_locpriorinit) ? "locpriorinit=$value,":"" 1 Initial value for the LOCPRIOR parameter r, that parameterizes how informative the populations are (citepHubiszEtAl09). We found that LOCPRIORINIT=1 helped achieve good convergence. set_maxlocprior Maximum value for the LOCPRIOR parameter r 45 perl (defined $set_maxlocprior) ? "maxlocprior=$value,":"" 20 output_options Output Options set_printnet Print the net nucleotide distance between clusters. 70 perl ( $value) ? "printnet=1," :"" The distance between populations A and B, DAB, is calculated. In words, the net nucleotide distance is the average probability that a pair of alleles, one each from populations A and B are different, less the average within-population heterozygosities. Perhaps more intuitively, this can be thought of as being the average amount of pairwise difference between alleles from different populations, beyond the amount of variation found within each population. The distance has the appropriate property that similar populations have distances near 0, and in particular, DAA = 0. Notice that the distance is symmetric, so that DAB = DBA.This distance is suitable for drawing trees of populations to help visualize the levels of difference among the clusters (Falush et al., 2003b) set_sitebysite Print a complete summary of assignment probabilities for every genotype in the data 71 perl $use_linkagemodel perl ( $value) ? "sitebysite=1," :"" Print a complete summary of assignment probabilities for every genotype in the data. This is printed to a separate file with the suffix “ss”. This file can be big! set_printqhat Print Q estimates to a separate file with suffix q. 72 perl ( $value) ? "printqhat=1," :"" When this is turned on, the point estimate for Q is not only printed into the main results file, but also into a separate file with suffix “q”. This file is required in order to run the companion program STRAT. set_intermedsave Print this many intermediate results 74 perl (defined $set_intermedsave) ? "intermedsave=$value," :"" If you’re impatient to see preliminary results before the end of the run, you can have results printed to file at intervals during the MCMC run. A total of INTERMEDSAVE such files are printed, at equal intervals following the completion of the BURNIN. Turn this off by setting to 0. Names of these files created using OUTFILE name. set_echodata Print a brief summary of the data set to the output file. 75 perl ( $value) ? "echodata=1," :"" Print a brief summary of the data set to the screen and output file. (Prints the beginnings and ends of the top and bottom lines of the input file to allow the user to check that it has been read correctly.) set_ancestdist Collect information about the distribution of Q for each individual 76 perl ( $value) ? "ancestdist=1," :"" Collect information about the distribution of Q for each individual, as well as just estimating the mean. When this is turned on, the output file includes the left- and right-hand ends of the probability intervals for each q(i).