1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
|
- This readme describes the alignment data available on the ftp site, how it is
- processed and what summary information is available for each alignment.
- A. File format and alignment index
- All alignment data is in the BAM format. This is the binary version of the SAM
- format which is described here: http://samtools.sourceforge.net/. Each BAM file
- has two associated files: a index file which has the same name but ends with
- .bai and a statistics file which has the same name but end with .bas. The format
- of bas file is described further down in in section D.
- The alignments are found under data/XXXXXXXX/alignment and data/XXXXXXX/exome_alignment
- where the XXXXXXXX identifier is the sample name.
- There is an alignment.index and an exome.alignment.index which links the alignment bam
- files together with their matching bai file and bas file and gives md5sums for each.
- The columns are as follows:
- 1. bam file
- 2. MD5
- 3. bai index file
- 4. bai index file md5
- 5. bas statistics file
- 6. bas statistic file md5
- Directory alignment_indices/ contains the most current alignment.index (the one
- with the latest date and identical to the alignment.index one level up), as well
- as alignment.index files of previous BAM releases. The directory also contains
- the following files about BAM statistics:
- yyyymmdd.alignment.index.bas.gz - a collective bas file of all BAM files.
- yyyymmdd_yyyymmdd.alignment_stats.csv - a summary statistics of BAMs in any
- two releases as specified by the dates in the file name; a comparison of
- the two releases is captured in the "diff" values of the file. The file
- contains the following information break down by platform.
- 1. mapped basepairs in Gb
- 2. # of individuals in the release
- 3. # of individuals with > 10 Gb of mapped sequences
- mapped basepairs in Gb is also shown as break down by population at the
- bottom of the file.
- 20091216.alignment_stats.csv - a summary statistics of the very first
- release of main project BAMs.
- The exome alignments also have a HsMetrics files which contain the results from
- the picard tool CalculateHsMetrics
- The bam filenames themselves contain a lot of information, e.g:
- NA12878.mapped.ILLUMINA.bwa.CEU.low_coverage.20121211.bam
- The name can be broken down into 7 pieces:
- 1. Sample name: this matches column 10 in the sequence.index file and
- represents the individual all the sequences belong to.
- 2. Region, this is generally either mapped or unmapped, the mapped files represents
- all the mapping to the reference genome, the unmapped file represents all the unmapped reads.
- You may also see chromN files which represent mappings to just that chromosome.
- 3. Sequencing platform: all our current release should be ILLUMINA but old alignment files
- will have SOLID or LS454.
- 4. Mapping algorithm: For ILLUMINA this is bwa, again older files may have bfast for solid
- or ssaha or smalt for 454.
- 5. Population*: the Sample population 3 letter code, this code is defined
- in README.populations
- 6. Analysis group*: this describes the sequencing strategy used for the
- sequence being aligned, those strategies being 'low coverage',
- 'high coverage', 'exon targeted' and 'exome'.
- 7. Date in the format YYYYMMDD: this should match the date from the
- sequence.index file which was used to produce the alignments. This date
- will also be in the alignment index name. Over time one release of alignments
- will contain multiple index dates. The index itself is named for the most recent
- sequence.index it is based on. The index in the bam file name reflects the last
- sequence index new data was added for that individual
- *Note: For historical reason, in the first release of BAM files, fields 5 and 6
- are replaced with a single column of project name such as SRP000033. One example
- is NA20828.chrom7.ILLUMINA.bwa.SRP000033.20091216.bam
- B. Alignment Process
- All the most recent alignments were produced by Richard Durbin's group at the Sanger
- based on our analysis.sequence.index file which contains all our ILLUMINA sequence data
- which has 70bp or longer reads. Older alignment releases also contain SOLID and 454 sequence
- data and their alignment process is explained in the readme which sits with those
- alignments
- Illumina data was aligned with bwa v0.5.9 in 4 steps:
- 1. Index the reference fasta:
- bwa index -a bwtsw $reference_fasta
- 2. For each fastq file:
- bwa aln -q 15 -f $sai_file $reference_fasta $fastq_file
- 3. Create SAM files using bwa sampe or samse for paired-end or unpaired
- reads respectively. For paired-end reads, the maximum insert size is
- taken to be 3 times the expected insert size.
- bwa sampe -a $max_insert_size -f $sam_file $reference_fasta $sai_files $fastq_files
- bwa samse -f $sam_file $reference_fasta $sai_file $fastq_file
- 4. Create BAM from SAM, sort, fix mate information and add the MD tag:
- samtools view -bSu $sam_file | samtools sort -n -o - samtools_nsort_tmp |
- samtools fixmate /dev/stdin /dev/stdout | samtools sort -o - samtools_csort_tmp |
- samtools fillmd -u - $reference_fasta > $fixed_bam_file
- Bam Improvement
-
- The run-level alignment BAMs are improved in various ways to help increase
- the quality and speed of subsequent SNP calling that may be carried out on
- them. For Illumina BAMs the following improvements were performed:
- 1. Reads undergo local realignment around known Indels using GATK
- IndelRealigner.
- java $jvm_args -jar GenomeAnalysisTK.jar -T RealignerTargetCreator -R $reference_fasta -o $intervals_file -known $known_indels_file(s)
- java $jvm_args -jar GenomeAnalysisTK.jar -T IndelRealigner -R $reference_fasta -I $bam_file -o $realigned_bam_file -targetIntervals $intervals_file -known $known_indels_file(s) -LOD 0.4 -model KNOWNS_ONLY -compress 0 --disable_bam_indexing
- where the known Indels are from the following:
- ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/phase2_mapping_resources/ALL.wgs.indels_mills_devine_hg19_leftAligned_collapsed_double_hit.indels.sites.vcf.gz
- ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/phase2_mapping_resources/ALL.wgs.low_coverage_vqsr.20101123.indels.sites.vcf.gz
- 2. Realigned BAMs have their read qualities recalibrated with GATK
- CountCovariates and TableRecalibration.
- java $jvm_args -jar GenomeAnalysisTK.jar -T CountCovariates -R $reference_fasta -I $realigned_bam_file -recalFile recal_data.csv -knownSites $known_sites_file(s) -l INFO -L '1;2;3;4;5;6;7;8;9;10;11;12;13;14;15;16;17;18;19;20;21;22;X;Y;MT' -cov ReadGroupCovariate -cov QualityScoreCovariate -cov CycleCovariate -cov DinucCovariate
- java $jvm_args -jar GenomeAnalysisTK.jar -T TableRecalibration -R $reference_fasta -recalFile recal_data.csv -I $realigned_bam_file -o $recalibrated_bam_file -l INFO -compress 0 --disable_bam_indexing
- where the known sites for recalibration are from dbSNP135:
- ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/phase2_mapping_resources/ALL.wgs.dbsnp.build135.snps.sites.vcf.gz
- 3. samtools calmd is applied to the recalibrated BAMs, which fixes the NM
- tags and introduces BQ tags which can be used during SNP calling.
- samtools calmd -Erb $recalibrated_bam_file $reference_fasta > $bq_bam_file
- For Exome Solid BAMs, only step 2 was performed.
- For low coverage Solid BAMs, only steps 1 and 2 were performed.
- Release bam file production
- The improved BAMs are merged together to get the release BAM files available
- for download. Release BAM files therefore contain reads from multiple
- readgroups. Production proceeds broadly as follows:
- 1. Run-level BAMs have extraneous tags (OQ, XM, XG, XO) stripped from them,
- to reduce total file size by around 30%.
- 2. Tag-stripped BAMs are merged to the library level with Picard. (merging
- to a given level means to create a single BAM file containing all the
- readgroups that share a given member of that level; in this case it means
- a BAM will be made for each library, each library BAM containing all the
- reads from the run-level BAMs associated with that library).
- java $jvm_args -jar MergeSamFiles.jar INPUT=$tag_stripped_bam_file(s) OUTPUT=$library_level_bam VALIDATION_STRINGENCY=SILENT
- 3. PCR duplicates are marked in library-level BAMs using Picard
- MarkDuplicates.
- java $jvm_args -jar MarkDuplicates.jar INPUT=$library_level_bam OUTPUT=$markdup_bam_file ASSUME_SORTED=TRUE METRICS_FILE=/dev/null VALIDATION_STRINGENCY=SILENT
- 4. Duplicate-marked library-level BAMs are merged to the platform level.
- java $jvm_args -jar MergeSamFiles.jar INPUT=$markdup_bam_file(s) OUTPUT=$platform_level_bam VALIDATION_STRINGENCY=SILENT
- 5. Platform-level BAMs are split into chromosomes (and other, see above)
- BAMs, which are the release files.
- C. QA of the alignment data by DCC
- Platform-level bams for individuals are checked by a QA process by DCC
- before they are released to the ftp site. Here lists the QA measurements and criteria:
-
- 1. Check md5 for each file to make sure the file was not corrupted during transfer.
-
- 2. Check coverage status. For exome runs this is done by runnung calculate_HsMetrics
- (a function in PICARD) to evaluate EXOME sequence coverage. Samples with less than
- 70% 20x coverage in targetted regions are considered coverage too low. The bait file
- used in this calculation is ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/exome_pull_down_targets_phases1_and_2/20120518.consensus.annotation.bed
- For low coverage samples we use the bas files supplied with the bams. These contain
- the number of mapped bases and the number of duplicate bases. We calculate the non
- duplicated mapped coverage. This must be 3x or greater (consdering a genome size of 2.75Gb)
-
- 3. Check for excessive number of short insertions or deletions (<=6bp). If the ratio of
- short insertion counts to short deletion counts is greater than 5, the sample fails.
- The reverse ratio of short deletion counts to short insertion counts is also checked
- the same way. (This uses a custom script, if you would like a copy please email info@1000genomes.org)
-
- 4. Check for sample contamination using VerifyBAMId developed by Hyun Kang in University of Michigan.
- VerifyBamId is a software that verifies whether the reads in particular file match previously
- known genotypes for an individual (or group of individuals), and checks whether the reads are
- contaminated as a mixture of two samples. verifyBamID can detect sample contamination and swaps
- when external genotypes are available. When external genotypes are not available, verifyBamID still
- robustly detects sample swaps. Genotype data for most phase1 and 2 samples typed on OMNI chips and
- genotype data for most phase3 samples typed on Affymetrix chips were used in this QA process. The original OMNI
- genotype vcf files can be found at /nfs/1000g-archive/vol1/ftp/technical/working/20120131_omni_genotypes_and_intensities/Omni25_genotypes_2141_samples.b37.vcf.gz
- and Affy genotype data can be found at /nfs/1000g-archive/vol1/ftp/technical/working/20121128_corriel_p3_sample_genotypes/. We divided the vcf files by sample
- population when ran VerifyBamId.
- The CHIP_MIX and FREE_MIX cutoff ranges from 2-3.5% for different BAM releases.
- http://genome.sph.umich.edu/wiki/VerifyBamID
-
- 5. Check for completeness of data against sequence index:
- a. All reads and runs for an individual should be included in
- corresponding BAM files
- b. All reads and runs from a BAM file should be from the same
- individual
- It is important to note that the 20130502 bam release was unusual as rather than all QC passed bams
- did not make it into the data/XXXXXX/alignment directories
- For the final round of analysis for the project we only used ILLUMINA sequence data which was 70bp or greater.
- The group decided for the final variant calling only samples with both low coverage and exome sequence would
- be considered. This means there were a small number of samples with just exome or just low coverage alignments.
- The sequence data is still included in the sequence.index file and while the alignments are not being used in
- our variant calling they are still available to be downloaded.
- ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/phase3_EX_or_LC_only_alignment/
- These bam files have their own alignment index which you can find in this directory.
- D. bas files
- .bas files are bam statistic files, one line per readgroup, columns separated by
- tabs. The first line is a header that describes each column. The first 6 columns
- provide meta information about each readgroup.
- The remaining columns provide various statistics about the readgroup, calculated
- by going through the release bams. Where data isn't available to calculate the
- result for a column the default value will be 0.
- Each column is described in detail below.
- Column 1 'bam_filename': The DCC bam file name in which the readgroup data can
- be found.
- Column 2 'md5': The md5 checksum of the bam file named in column 1.
- Column 3 'study': The SRA study id this readgroup belongs to.
- Column 4 'sample': The sample (individual) identifier the readgroup came from.
- Column 5 'platform': The sequencing platform (technology) used to sequence the
- readgroup.
- Column 6 'library': The name of the library used for the readgroup.
- Column 7 'readgroup': The readgroup identifier. This is unique per .bas file.
- The remaining columns summarise data for reads with this RG tag in the bam
- file given in column 1.
- Column 8 '#_total_bases': The sum of the length of all reads in this readgroup.
- Column 9 '#_mapped_bases': The sum of the length of all reads in this readgroup
- that did not have flag 4 (== unmapped).
- Column 10 '#_total_reads': The total number of reads in this readgroup.
- Column 11 '#_mapped_reads': The total number of reads in this readgroup that did
- not have flag 4 (== unmapped).
- Column 12 '#_mapped_reads_paired_in_sequencing': As for column 10, but also
- requiring flag 1 (== reads paired in sequecing).
- Column 13 '#_mapped_reads_properly_paired': As for column 10, but also requiring
- flag 2 (== mapped in a proper pair, inferred during alignment).
- Column 14 '%_of_mismatched_bases': Calculated by summing the read lengths of all
- reads in this readgroup that have an NM tag, summing the edit distances
- obtained from the NM tags, and getting the percentage of the latter out of
- the former to 2 decimal places.
- Column 15 'average_quality_of_mapped_bases': The mean of all the base qualities
- of the bases counted for column 8, to 2 decimal places.
- Column 16 'mean_insert_size': The mean of all insert sizes (ISIZE field) greater
- than 0 for properly paired reads (as counted in column 12) and with a
- mapping quality (MAPQ field) greater than 0. Rounded to the nearest whole
- number.
- Column 17 'insert_size_sd': The standard deviation from the mean of insert sizes
- considered for column 15. To 2 decimal places.
- Column 18 'median_insert_size': The median insert size, using the same set of
- insert sizes considered for column 15.
- Column 19 'insert_size_median_absolute_deviation': The median absolute deviation
- of the column 17 data.
- Column 20 '#_duplicate_reads': The number of reads which were marked as
- duplicates
- Column 21' #_duplicate_bases': The number of bases which were narked as
- duplicated
|