GATK on XSEDE 3.5 Variant Discovery in High-Throughput Sequencing Data McKenna A, Hanna M, Banks E, Sivachenko A, Cibulskis K, Kernytsky A, Garimella K, Altshuler D, Gabriel S, Daly M, DePristo MA The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data McKenna A, Hanna M, Banks E, Sivachenko A, Cibulskis K, Kernytsky A, Garimella K, Altshuler D, Gabriel S, Daly M, DePristo MA, 2010 Genome Research 20:1297-303 Phylogeny / Alignment gatk_xsede gatk_invoke_command perl 'gatk_xsede.centos7' infile Input reference fasta file ref.fasta gatk_realalignertarget_command perl $gatk_real perl 'java -jar /opt/biotools/GenomeAnalysisTK/3.5/GenomeAnalysisTK.jar -T RealignerTargetCreator -R ref.fasta -I A1.bam.sorted_marked_readgroups.bam -o A1.bam.sorted_marked_readgroups_realign.intervals --filter_mismatching_base_and_quals && ' gatk_indelrealalignertarget_command perl $gatk_indelreal perl 'java -jar /opt/biotools/GenomeAnalysisTK/3.5/GenomeAnalysisTK.jar -I A1.bam.sorted_marked_readgroups.bam -R ref.fasta -T IndelRealigner -targetIntervals A1.bam.sorted_marked_readgroups_realign.intervals -o A1.bam.sorted_marked_readgroups_realign.bam &&' gatk_gvcf_command perl $gatk_gvcf perl 'java -jar /opt/biotools/GenomeAnalysisTK/3.5/GenomeAnalysisTK.jar -T HaplotypeCaller -R ref.fasta -I A1.bam.sorted_marked_readgroups_realign.bam -o A1.bam.sorted_marked_readgroups_realign.g.vcf -stand_call_conf 30.0 -stand_emit_conf 30.0 --variant_index_type LINEAR --variant_index_parameter 128000 --emitRefConfidence GVCF --filter_mismatching_base_and_quals --minPruning 2 --sample_ploidy 2' gatk_scheduler scheduler.conf perl "threads_per_process=12\\n" . "node_exclusive=0\\n" . "nodes=1\\n" 0 Fastaout fasta output output.fasta all_output fasta output * runtime 1 scheduler.conf Maximum Hours to Run (click here for help setting this correctly) perl "runhours=$value\\n" 0.25 Maximum Hours to Run must be less than 168 perl $runtime > 168.0 Maximum Hours to Run must be greater than 0.1 perl $runtime < 0.1 The job will run on 12 processors as configured. If it runs for the entire configured time, it will consume 12 x $runtime cpu hours perl $runtime ne 0 Estimate the maximum time your job will need to run. We recommend testimg initially with a < 0.5hr test run because Jobs set for 0.5 h or less depedendably run immediately in the "debug" queue. Once you are sure the configuration is correct, you then increase the time. The reason is that jobs > 0.5 h are submitted to the "normal" queue, where jobs configured for 1 or a few hours times may run sooner than jobs configured for the full 168 hours. gatk_real Run realignertarget command 1 gatk_indelreal Run indelrealignertarget command 1 gatk_gvcf Run GVCF command 1 gatkref_dict Select the .dict file (for reference file in main input) ref.dict This is the dictionary file for the ref fasta file gatkref_fai Select the .fai file (reference file in main input) ref.fasta.fai This is the fasta index file for the ref fasta file gatk_sortedbam Select the sorted .bam file (reference file in main input) A1.bam.sorted_marked_readgroups.bam This is the sorted marked readgroups bam file gatk_sortedbamindex Select the sorted .bam index file, .bai (reference file in main input) A1.bam.sorted_marked_readgroups.bai This is the sorted marked readgroups bam index file produced by Picard. indelrealign_inf Select the indel realign .intervals file perl $gatk_indelreal && !$gatk_real A1.bam.sorted_marked_readgroups_realign.intervals This is the sorted marked readgroups realign .intervals file vcf_inf Select the indel realign .bam file perl $gatk_gvcf && !$gatk_indelreal && !$gatk_real A1.bam.sorted_marked_readgroups_realign.bam This is the sorted marked readgroups realign file for the third (vcf) step) gatk_sortedbamindex2 Select the sorted .bam index file, .bai (for runs with step 3 only) perl $gatk_gvcf && !$gatk_indelreal && !$gatk_real A1.bam.sorted_marked_readgroups_realign.bai This is the sorted marked readgroups bam index file produced by Picard, or some other tool for step 3 only runs.