GATK on XSEDE
3.5
Variant Discovery in High-Throughput Sequencing Data
McKenna A, Hanna M, Banks E, Sivachenko A, Cibulskis K, Kernytsky A, Garimella K, Altshuler D, Gabriel S, Daly M, DePristo MA
The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data McKenna A, Hanna M, Banks E, Sivachenko A, Cibulskis K, Kernytsky A, Garimella K, Altshuler D, Gabriel S, Daly M, DePristo MA, 2010 Genome Research 20:1297-303
Phylogeny / Alignment
gatk_xsede
gatk_invoke_command
perl
'gatk_xsede.centos7'
infile
Input reference fasta file
ref.fasta
gatk_realalignertarget_command
perl
$gatk_real
perl
'java -jar /opt/biotools/GenomeAnalysisTK/3.5/GenomeAnalysisTK.jar -T RealignerTargetCreator -R ref.fasta -I A1.bam.sorted_marked_readgroups.bam -o A1.bam.sorted_marked_readgroups_realign.intervals --filter_mismatching_base_and_quals && '
gatk_indelrealalignertarget_command
perl
$gatk_indelreal
perl
'java -jar /opt/biotools/GenomeAnalysisTK/3.5/GenomeAnalysisTK.jar -I A1.bam.sorted_marked_readgroups.bam -R ref.fasta -T IndelRealigner -targetIntervals A1.bam.sorted_marked_readgroups_realign.intervals -o A1.bam.sorted_marked_readgroups_realign.bam &&'
gatk_gvcf_command
perl
$gatk_gvcf
perl
'java -jar /opt/biotools/GenomeAnalysisTK/3.5/GenomeAnalysisTK.jar -T HaplotypeCaller -R ref.fasta -I A1.bam.sorted_marked_readgroups_realign.bam -o A1.bam.sorted_marked_readgroups_realign.g.vcf -stand_call_conf 30.0 -stand_emit_conf 30.0 --variant_index_type LINEAR --variant_index_parameter 128000 --emitRefConfidence GVCF --filter_mismatching_base_and_quals --minPruning 2 --sample_ploidy 2'
gatk_scheduler
scheduler.conf
perl
"threads_per_process=12\\n" .
"node_exclusive=0\\n" .
"nodes=1\\n"
0
Fastaout
fasta output
output.fasta
all_output
fasta output
*
runtime
1
scheduler.conf
Maximum Hours to Run (click here for help setting this correctly)
perl
"runhours=$value\\n"
0.25
Maximum Hours to Run must be less than 168
perl
$runtime > 168.0
Maximum Hours to Run must be greater than 0.1
perl
$runtime < 0.1
The job will run on 12 processors as configured. If it runs for the entire configured time, it will consume 12 x $runtime cpu hours
perl
$runtime ne 0
Estimate the maximum time your job will need to run. We recommend testimg initially with a < 0.5hr test run because Jobs set for 0.5 h or less depedendably run immediately in the "debug" queue.
Once you are sure the configuration is correct, you then increase the time. The reason is that jobs > 0.5 h are submitted to the "normal" queue, where jobs configured for 1 or a few hours times may
run sooner than jobs configured for the full 168 hours.
gatk_real
Run realignertarget command
1
gatk_indelreal
Run indelrealignertarget command
1
gatk_gvcf
Run GVCF command
1
gatkref_dict
Select the .dict file (for reference file in main input)
ref.dict
This is the dictionary file for the ref fasta file
gatkref_fai
Select the .fai file (reference file in main input)
ref.fasta.fai
This is the fasta index file for the ref fasta file
gatk_sortedbam
Select the sorted .bam file (reference file in main input)
A1.bam.sorted_marked_readgroups.bam
This is the sorted marked readgroups bam file
gatk_sortedbamindex
Select the sorted .bam index file, .bai (reference file in main input)
A1.bam.sorted_marked_readgroups.bai
This is the sorted marked readgroups bam index file produced by Picard.
indelrealign_inf
Select the indel realign .intervals file
perl
$gatk_indelreal && !$gatk_real
A1.bam.sorted_marked_readgroups_realign.intervals
This is the sorted marked readgroups realign .intervals file
vcf_inf
Select the indel realign .bam file
perl
$gatk_gvcf && !$gatk_indelreal && !$gatk_real
A1.bam.sorted_marked_readgroups_realign.bam
This is the sorted marked readgroups realign file for the third (vcf) step)
gatk_sortedbamindex2
Select the sorted .bam index file, .bai (for runs with step 3 only)
perl
$gatk_gvcf && !$gatk_indelreal && !$gatk_real
A1.bam.sorted_marked_readgroups_realign.bai
This is the sorted marked readgroups bam index file produced by Picard, or some other tool for step 3 only runs.