ParallelStructure on ACCESS 2.3.4 A program to investigate population structure using multi-locus genotype data Pritchard J.K., Stephens M., and Donnelly P. Pritchard J.K., Stephens M., and Donnelly P. (2000)Inference of population structure using multilocus genotype data. Genetics. 2000 Jun;155(2):945-59. Melissa J. Hubisz, Daniel Falush,Matthew Stephens, and Jonathon K. Pritchard (2009) Inferring weak population structure with the assistance of sample group information Molecular Ecology Resources (2009) 9, 1322–1332. Francois Besnier and Kevin A. Glover (2013) ParallelStructure: A R Package to Distribute Parallel Runs of the Population Genetics Program STRUCTURE on Multi-Core Computers PLoS One 8(7): e70651. Population Genetics parallelstructure_xsede pstructure_comet perl "" 1 number_nodes 2 scheduler.conf perl $specify_threads < 32 && !$more_memory perl "nodes=1\\n" . "node_exclusive=0\\n" . "mem=" . (int(($specify_threads)*(248/128))) . "G\\n" . "threads_per_process=$specify_threads\\n" number_nodesb 2 scheduler.conf perl $specify_threads < 32 && $more_memory perl "nodes=1\\n" . "node_exclusive=0\\n" . "mem=" . 2*(int(($specify_threads)*(248/128))) . "G\\n" . "threads_per_process=$specify_threads\\n" number_nodes2 2 scheduler.conf perl $specify_threads >= 32 && !$more_memory perl "nodes=1\\n" . "node_exclusive=0\\n" . "mem=" . (int(32*(248/128))) . "G\\n" . "threads_per_process=32\\n" number_nodes2b 2 scheduler.conf perl $specify_threads >= 32 && $more_memory perl "nodes=1\\n" . "node_exclusive=0\\n" . "mem=" . (int(64*(248/128))) . "G\\n" . "threads_per_process=32\\n" infile Input File (must be in proper Structure format) data.txt pdf_results *.pdf txt_results *.txt txt_results2 *.TXT results_results_f results*_f results_results_q results*_q csv_results *.csv conf_results *.conf runtime 1 scheduler.conf Maximum Hours to Run (up to 168 hours) 0.5 The maximum hours to run must be less than 168 perl $runtime > 168.0 The maximum hours to run must be greater than 0.05 perl $runtime < 0.05 perl "runhours=$value\\n" The job will run on $specify_threads processors as configured. If it runs for the entire configured time, it will consume $specify_threads X $runtime cpu hours perl $specify_threads < 32 && !$more_memory The job will run on 32 processors as configured. If it runs for the entire configured time, it will consume 32 X $runtime cpu hours perl $specify_threads > 31 && !$more_memory The job will run on 2 X $specify_threads processors as configured. If it runs for the entire configured time, it will consume 2 X $specify_threads X $runtime cpu hours perl $specify_threads < 32 && $more_memory The job will run on 64 processors as configured. If it runs for the entire configured time, it will consume 64 X $runtime cpu hours perl $specify_threads > 31 && $more_memory Estimate the maximum time your job will need to run. We recommend testimg initially with a < 0.5hr test run because Jobs set for 0.5 h or less depedendably run immediately in the "debug" queue. Once you are sure the configuration is correct, you then increase the time. The reason is that jobs > 0.5 h are submitted to the "normal" queue, where jobs configured for 1 or a few hours times may run sooner than jobs configured for the full 168 hours. more_memory I need more memory joblist_file Joblist File joblist.txt Please select a joblist file perl !defined $joblist_file specify_threads How many jobs are in your job list file? Please specify the number of threads perl !defined $specify_threads data_file_params Data File Configuration set_numinds Number of individuals in the population (NUMINDS) 3 perl "numinds = $value," Please specify the number of individuals in the population perl !defined $set_numinds set_numloci Number of loci in the dataset (NUMLOCI) 4 perl "numloci=$value," Please specify the number of loci in the population perl !defined $set_numloci set_ploidy Ploidy of the dataset (PLOIDY) 5 perl (defined $set_ploidy) ? "ploidy=$value," :"" set_missing Value given to missing genotype data (MISSING) 6 perl (defined $set_missing) ? "missing=$value," :"" Must be an integer, and must not appear elsewhere in the data set. Default is -9 set_onerowperind The data for each individual are arranged in a single row (ONEROWPERIND) 7 perl ( $value) ? "onerowperind=1," :"" ONEROWPERIND (Boolean) The data for each individual are arranged in a single row. E.g., for diploid data, this would mean that the two alleles for each locus are in consecutive order in the same row, rather than being arranged in the same column, in two consecutive rows. use_popinfo Use prior population information to assign individuals to clusters (USEPOPINFO) 23 perl ( $value) ? "usepopinfo=1," :"" To use population information information, you must indicate that your input file contains an indicator variable which says whether to use popinfo perl $use_popinfo && !$set_popflag set_popdata Input file contains a user-defined population-of-origin for each individual (POPDATA) 9 perl ( $value) ? "popdata=1," :"" 1 ParallelStructure requires a population data column. The values can all be 1 if you don't want to use this parameter perl $set_popdata = 0 set_popflag Input file contains an indicator variable which says whether to use popinfo (POPFLAG) 10 perl $use_popinfo perl ( $value) ? "popflag=1," :"" set_locdata Input file contains a user-defined sampling location for each individual (LOCDATA) 11 perl $use_locprior perl ( $value) ? "locdata=1," :"" set_phenotype Input file contains a column of phenotype information (PHENOTYPE) 12 perl ( $value) ? "phenotype=1," :"" set_markernames The top row of the data file contains a list of L names corresponding to the markers used (MARKERNAMES) 14 perl ( $value) ? "markernames=1,":"" set_recessivealleles Next row of data file contains a list of L integers indicating which alleles are recessive at each locus (RECESSIVEALLELES) 15 perl ( $value) ? "recessivealleles=1," :"" Setting this to 1 implies that the dominant marker model is in use. use_linkagemodel Use the linkage model (LINKAGE) 22 perl ( $value) ? "linkage=1," :"" RLOG10START sets the initial value of recombination rate r per unit distance. RLOG10MIN and RLOG10MAX set the minimum and maximum allowed values for log10r. RLOG10PROPSD sets the size of the proposed changes to log10r in each update. The front end makes some guesses about these, but some care on the part of the user in required to be sure that the values are sensible for the particular application. set_phased Indicates that data are in correct phase (PHASED) 17 perl $use_linkagemodel perl ( $value) ? "phased=1," :"" When the linkage model is used with polyploids, PHASED=1 is required. perl $set_ploidy > 2 && !$set_phased For use with linkage model. Indicates that data are in correct phase. If (LINKAGE=1, PHASED=0), then PHASEINFO can be used–this is an extra line in the input file that gives phase probabilities. When PHASEINFO =0 each value is set to 0.5, implying no phase information. When the linkage model is used with polyploids, PHASED=1 is required. set_phaseinfo The row(s) of genotype data for each individual are followed by a row of information about haplotype phase (PHASEINFO) 18 perl $use_linkagemodel && !$set_phased perl ( $value) ? "phaseinfo=1," :"" The row(s) of genotype data for each individual are followed by a row of information about haplotype phase. This is for use with the linkage model only. See sections 2 and 3.1 for further details. second_options Run Configuration Options (file extraparams) set_noadmix Assume the model without admixture (NOADMIX) 21 perl ( $value) ? "noadmix=1," :"" Each individual is assumed to be completely from one of the K populations. In the output, instead of printing the average value of Q as in the admixture case, the program prints the posterior probability that each individual is from each population. 1 = no admixture; 0 = model with admixture. use_locprior Use location information to improve the performance on data that are weakly informative about structure (LOCPRIOR) 24 perl ( $value) ? "locprior=1," :"" use_inferalpha Infer the value of the model parameter alpha from the data (INFERALPHA) 27 perl !$set_noadmix perl ( $value) ? "inferalpha=1," :"" Assume the same value of Fk for all populations (analogous to Wright’s traditional FST ). This is not recommended for most data, because in practice you probably expect different levels of divergence in each population. When K = 2 it may sometimes be difficult to estimate two values of FST separately (but see Harter et al. (2004)). When you’re trying to estimate K, you should use the same model for all K (we suggest ONEFST=0). use_popalphas Infer a separate α for each population (POPALPHAS) 28 perl ( $value) ? "popalphas=1," :"" Not recommended in most cases but may be useful for situations with asymmetric admixture. set_alpha Dirichlet parameter (α) for degree of admixture (ALPHA) 29 perl $use_inferalpha perl ($set_alpha) ? "alpha=$value,":"" Dirichlet parameter (α) for degree of admixture (this is the initial value if INFERALPHA==1). use_freqscorr Use the F model, in which the allele frequencies are correlated across populations (FREQSCORR) 25 perl (defined $use_freqscorr) ? "freqscorr=$value," :"" FREQSCORR (double) Use the “F model”, in which the allele frequencies are correlated across populations (Falush et al., 2003a). More specifically, rather than assuming a prior in which the allele frequencies in each population are independent draws from a uniform Dirichlet distribution, we start with a distribution which is centered around the mean allele frequencies in the sample. This model is more realistic for very closely related populations (where we expect the allele frequencies to be similar across populations), and can produce better clustering (section 3.2). The prior of Fk is set using FPRIORMEAN, and FPRIORSD. There may be a tendency to overestimate K when FREQSCORR is turned on. set_fpriormean Set mean FPRIORMEAN for Fk (FPRIORMEAN) 33 perl ($set_alpha) ? "fpriormean=$value,":"" The prior for Fk is taken to be Gamma with mean FPRIORMEAN, and standard deviation FPRIORSD. Our default settings place a lot of weight on small values of F . We find that this makes the algorithm sensitive to subtle structure, but at some increased risk of overestimating K (Falush et al., 2003a) set_fpriorsd Set std deviation for Fk (FPRIORSD) 33 perl ($set_alpha) ? "fpriorsd=$value,":"" The prior for Fk is taken to be Gamma with mean FPRIORMEAN, and standard deviation FPRIORSD. Our default settings place a lot of weight on small values of F . We find that this makes the algorithm sensitive to subtle structure, but at some increased risk of overestimating K (Falush et al., 2003a) use_onefst Assume the same value of Fk for all populations (ONEFST) 26 perl ( $value) ? "onefst=1," :"" Assume the same value of Fk for all populations (analogous to Wright’s traditional FST ). This is not recommended for most data, because in practice you probably expect different levels of divergence in each population. When K = 2 it may sometimes be difficult to estimate two values of FST separately (but see Harter et al. (2004)). When you’re trying to estimate K, you should use the same model for all K (we suggest ONEFST=0). output_options Output Options set_sitebysite Print a complete summary of assignment probabilities for every genotype in the data (SITEBYSITE) 71 perl $use_linkagemodel perl ( $value) ? "sitebysite=1," :"" Print a complete summary of assignment probabilities for every genotype in the data. This is printed to a separate file with the suffix “ss”. This file can be big! set_printqhat Print Q estimates to a separate file with suffix q (PRINTQHAT) 72 perl "printqhat=1," When this is turned on, the point estimate for Q is not only printed into the main results file, but also into a separate file with suffix “q”. This file is required in order to run the companion program STRAT. set_ancestdist Collect information about the distribution of Q for each individual (ANCESTDIST) 76 perl ( $value) ? "ancestdist=1," :"" Collect information about the distribution of Q for each individual, as well as just estimating the mean. When this is turned on, the output file includes the left- and right-hand ends of the probability intervals for each q(i).