Register
Login
Resources
Docs Blog Datasets Glossary Case Studies Tutorials & Webinars
Product
Data Engine LLMs Platform Enterprise
Pricing Explore
Connect to our Discord channel

README.alignment_data

You have to be logged in to leave a comment. Sign In
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
  1. This readme describes the alignment data available on the ftp site, how it is
  2. processed and what summary information is available for each alignment.
  3. A. File format and alignment index
  4. All alignment data is in the BAM format. This is the binary version of the SAM
  5. format which is described here: http://samtools.sourceforge.net/. Each BAM file
  6. has two associated files: a index file which has the same name but ends with
  7. .bai and a statistics file which has the same name but end with .bas. The format
  8. of bas file is described further down in in section D.
  9. The alignments are found under data/XXXXXXXX/alignment and data/XXXXXXX/exome_alignment
  10. where the XXXXXXXX identifier is the sample name.
  11. There is an alignment.index and an exome.alignment.index which links the alignment bam
  12. files together with their matching bai file and bas file and gives md5sums for each.
  13. The columns are as follows:
  14. 1. bam file
  15. 2. MD5
  16. 3. bai index file
  17. 4. bai index file md5
  18. 5. bas statistics file
  19. 6. bas statistic file md5
  20. Directory alignment_indices/ contains the most current alignment.index (the one
  21. with the latest date and identical to the alignment.index one level up), as well
  22. as alignment.index files of previous BAM releases. The directory also contains
  23. the following files about BAM statistics:
  24. yyyymmdd.alignment.index.bas.gz - a collective bas file of all BAM files.
  25. yyyymmdd_yyyymmdd.alignment_stats.csv - a summary statistics of BAMs in any
  26. two releases as specified by the dates in the file name; a comparison of
  27. the two releases is captured in the "diff" values of the file. The file
  28. contains the following information break down by platform.
  29. 1. mapped basepairs in Gb
  30. 2. # of individuals in the release
  31. 3. # of individuals with > 10 Gb of mapped sequences
  32. mapped basepairs in Gb is also shown as break down by population at the
  33. bottom of the file.
  34. 20091216.alignment_stats.csv - a summary statistics of the very first
  35. release of main project BAMs.
  36. The exome alignments also have a HsMetrics files which contain the results from
  37. the picard tool CalculateHsMetrics
  38. The bam filenames themselves contain a lot of information, e.g:
  39. NA12878.mapped.ILLUMINA.bwa.CEU.low_coverage.20121211.bam
  40. The name can be broken down into 7 pieces:
  41. 1. Sample name: this matches column 10 in the sequence.index file and
  42. represents the individual all the sequences belong to.
  43. 2. Region, this is generally either mapped or unmapped, the mapped files represents
  44. all the mapping to the reference genome, the unmapped file represents all the unmapped reads.
  45. You may also see chromN files which represent mappings to just that chromosome.
  46. 3. Sequencing platform: all our current release should be ILLUMINA but old alignment files
  47. will have SOLID or LS454.
  48. 4. Mapping algorithm: For ILLUMINA this is bwa, again older files may have bfast for solid
  49. or ssaha or smalt for 454.
  50. 5. Population*: the Sample population 3 letter code, this code is defined
  51. in README.populations
  52. 6. Analysis group*: this describes the sequencing strategy used for the
  53. sequence being aligned, those strategies being 'low coverage',
  54. 'high coverage', 'exon targeted' and 'exome'.
  55. 7. Date in the format YYYYMMDD: this should match the date from the
  56. sequence.index file which was used to produce the alignments. This date
  57. will also be in the alignment index name. Over time one release of alignments
  58. will contain multiple index dates. The index itself is named for the most recent
  59. sequence.index it is based on. The index in the bam file name reflects the last
  60. sequence index new data was added for that individual
  61. *Note: For historical reason, in the first release of BAM files, fields 5 and 6
  62. are replaced with a single column of project name such as SRP000033. One example
  63. is NA20828.chrom7.ILLUMINA.bwa.SRP000033.20091216.bam
  64. B. Alignment Process
  65. All the most recent alignments were produced by Richard Durbin's group at the Sanger
  66. based on our analysis.sequence.index file which contains all our ILLUMINA sequence data
  67. which has 70bp or longer reads. Older alignment releases also contain SOLID and 454 sequence
  68. data and their alignment process is explained in the readme which sits with those
  69. alignments
  70. Illumina data was aligned with bwa v0.5.9 in 4 steps:
  71. 1. Index the reference fasta:
  72. bwa index -a bwtsw $reference_fasta
  73. 2. For each fastq file:
  74. bwa aln -q 15 -f $sai_file $reference_fasta $fastq_file
  75. 3. Create SAM files using bwa sampe or samse for paired-end or unpaired
  76. reads respectively. For paired-end reads, the maximum insert size is
  77. taken to be 3 times the expected insert size.
  78. bwa sampe -a $max_insert_size -f $sam_file $reference_fasta $sai_files $fastq_files
  79. bwa samse -f $sam_file $reference_fasta $sai_file $fastq_file
  80. 4. Create BAM from SAM, sort, fix mate information and add the MD tag:
  81. samtools view -bSu $sam_file | samtools sort -n -o - samtools_nsort_tmp |
  82. samtools fixmate /dev/stdin /dev/stdout | samtools sort -o - samtools_csort_tmp |
  83. samtools fillmd -u - $reference_fasta > $fixed_bam_file
  84. Bam Improvement
  85. The run-level alignment BAMs are improved in various ways to help increase
  86. the quality and speed of subsequent SNP calling that may be carried out on
  87. them. For Illumina BAMs the following improvements were performed:
  88. 1. Reads undergo local realignment around known Indels using GATK
  89. IndelRealigner.
  90. java $jvm_args -jar GenomeAnalysisTK.jar -T RealignerTargetCreator -R $reference_fasta -o $intervals_file -known $known_indels_file(s)
  91. java $jvm_args -jar GenomeAnalysisTK.jar -T IndelRealigner -R $reference_fasta -I $bam_file -o $realigned_bam_file -targetIntervals $intervals_file -known $known_indels_file(s) -LOD 0.4 -model KNOWNS_ONLY -compress 0 --disable_bam_indexing
  92. where the known Indels are from the following:
  93. ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/phase2_mapping_resources/ALL.wgs.indels_mills_devine_hg19_leftAligned_collapsed_double_hit.indels.sites.vcf.gz
  94. ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/phase2_mapping_resources/ALL.wgs.low_coverage_vqsr.20101123.indels.sites.vcf.gz
  95. 2. Realigned BAMs have their read qualities recalibrated with GATK
  96. CountCovariates and TableRecalibration.
  97. java $jvm_args -jar GenomeAnalysisTK.jar -T CountCovariates -R $reference_fasta -I $realigned_bam_file -recalFile recal_data.csv -knownSites $known_sites_file(s) -l INFO -L '1;2;3;4;5;6;7;8;9;10;11;12;13;14;15;16;17;18;19;20;21;22;X;Y;MT' -cov ReadGroupCovariate -cov QualityScoreCovariate -cov CycleCovariate -cov DinucCovariate
  98. java $jvm_args -jar GenomeAnalysisTK.jar -T TableRecalibration -R $reference_fasta -recalFile recal_data.csv -I $realigned_bam_file -o $recalibrated_bam_file -l INFO -compress 0 --disable_bam_indexing
  99. where the known sites for recalibration are from dbSNP135:
  100. ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/phase2_mapping_resources/ALL.wgs.dbsnp.build135.snps.sites.vcf.gz
  101. 3. samtools calmd is applied to the recalibrated BAMs, which fixes the NM
  102. tags and introduces BQ tags which can be used during SNP calling.
  103. samtools calmd -Erb $recalibrated_bam_file $reference_fasta > $bq_bam_file
  104. For Exome Solid BAMs, only step 2 was performed.
  105. For low coverage Solid BAMs, only steps 1 and 2 were performed.
  106. Release bam file production
  107. The improved BAMs are merged together to get the release BAM files available
  108. for download. Release BAM files therefore contain reads from multiple
  109. readgroups. Production proceeds broadly as follows:
  110. 1. Run-level BAMs have extraneous tags (OQ, XM, XG, XO) stripped from them,
  111. to reduce total file size by around 30%.
  112. 2. Tag-stripped BAMs are merged to the library level with Picard. (merging
  113. to a given level means to create a single BAM file containing all the
  114. readgroups that share a given member of that level; in this case it means
  115. a BAM will be made for each library, each library BAM containing all the
  116. reads from the run-level BAMs associated with that library).
  117. java $jvm_args -jar MergeSamFiles.jar INPUT=$tag_stripped_bam_file(s) OUTPUT=$library_level_bam VALIDATION_STRINGENCY=SILENT
  118. 3. PCR duplicates are marked in library-level BAMs using Picard
  119. MarkDuplicates.
  120. java $jvm_args -jar MarkDuplicates.jar INPUT=$library_level_bam OUTPUT=$markdup_bam_file ASSUME_SORTED=TRUE METRICS_FILE=/dev/null VALIDATION_STRINGENCY=SILENT
  121. 4. Duplicate-marked library-level BAMs are merged to the platform level.
  122. java $jvm_args -jar MergeSamFiles.jar INPUT=$markdup_bam_file(s) OUTPUT=$platform_level_bam VALIDATION_STRINGENCY=SILENT
  123. 5. Platform-level BAMs are split into chromosomes (and other, see above)
  124. BAMs, which are the release files.
  125. C. QA of the alignment data by DCC
  126. Platform-level bams for individuals are checked by a QA process by DCC
  127. before they are released to the ftp site. Here lists the QA measurements and criteria:
  128. 1. Check md5 for each file to make sure the file was not corrupted during transfer.
  129. 2. Check coverage status. For exome runs this is done by runnung calculate_HsMetrics
  130. (a function in PICARD) to evaluate EXOME sequence coverage. Samples with less than
  131. 70% 20x coverage in targetted regions are considered coverage too low. The bait file
  132. used in this calculation is ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/exome_pull_down_targets_phases1_and_2/20120518.consensus.annotation.bed
  133. For low coverage samples we use the bas files supplied with the bams. These contain
  134. the number of mapped bases and the number of duplicate bases. We calculate the non
  135. duplicated mapped coverage. This must be 3x or greater (consdering a genome size of 2.75Gb)
  136. 3. Check for excessive number of short insertions or deletions (<=6bp). If the ratio of
  137. short insertion counts to short deletion counts is greater than 5, the sample fails.
  138. The reverse ratio of short deletion counts to short insertion counts is also checked
  139. the same way. (This uses a custom script, if you would like a copy please email info@1000genomes.org)
  140. 4. Check for sample contamination using VerifyBAMId developed by Hyun Kang in University of Michigan.
  141. VerifyBamId is a software that verifies whether the reads in particular file match previously
  142. known genotypes for an individual (or group of individuals), and checks whether the reads are
  143. contaminated as a mixture of two samples. verifyBamID can detect sample contamination and swaps
  144. when external genotypes are available. When external genotypes are not available, verifyBamID still
  145. robustly detects sample swaps. Genotype data for most phase1 and 2 samples typed on OMNI chips and
  146. genotype data for most phase3 samples typed on Affymetrix chips were used in this QA process. The original OMNI
  147. genotype vcf files can be found at /nfs/1000g-archive/vol1/ftp/technical/working/20120131_omni_genotypes_and_intensities/Omni25_genotypes_2141_samples.b37.vcf.gz
  148. and Affy genotype data can be found at /nfs/1000g-archive/vol1/ftp/technical/working/20121128_corriel_p3_sample_genotypes/. We divided the vcf files by sample
  149. population when ran VerifyBamId.
  150. The CHIP_MIX and FREE_MIX cutoff ranges from 2-3.5% for different BAM releases.
  151. http://genome.sph.umich.edu/wiki/VerifyBamID
  152. 5. Check for completeness of data against sequence index:
  153. a. All reads and runs for an individual should be included in
  154. corresponding BAM files
  155. b. All reads and runs from a BAM file should be from the same
  156. individual
  157. It is important to note that the 20130502 bam release was unusual as rather than all QC passed bams
  158. did not make it into the data/XXXXXX/alignment directories
  159. For the final round of analysis for the project we only used ILLUMINA sequence data which was 70bp or greater.
  160. The group decided for the final variant calling only samples with both low coverage and exome sequence would
  161. be considered. This means there were a small number of samples with just exome or just low coverage alignments.
  162. The sequence data is still included in the sequence.index file and while the alignments are not being used in
  163. our variant calling they are still available to be downloaded.
  164. ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/phase3_EX_or_LC_only_alignment/
  165. These bam files have their own alignment index which you can find in this directory.
  166. D. bas files
  167. .bas files are bam statistic files, one line per readgroup, columns separated by
  168. tabs. The first line is a header that describes each column. The first 6 columns
  169. provide meta information about each readgroup.
  170. The remaining columns provide various statistics about the readgroup, calculated
  171. by going through the release bams. Where data isn't available to calculate the
  172. result for a column the default value will be 0.
  173. Each column is described in detail below.
  174. Column 1 'bam_filename': The DCC bam file name in which the readgroup data can
  175. be found.
  176. Column 2 'md5': The md5 checksum of the bam file named in column 1.
  177. Column 3 'study': The SRA study id this readgroup belongs to.
  178. Column 4 'sample': The sample (individual) identifier the readgroup came from.
  179. Column 5 'platform': The sequencing platform (technology) used to sequence the
  180. readgroup.
  181. Column 6 'library': The name of the library used for the readgroup.
  182. Column 7 'readgroup': The readgroup identifier. This is unique per .bas file.
  183. The remaining columns summarise data for reads with this RG tag in the bam
  184. file given in column 1.
  185. Column 8 '#_total_bases': The sum of the length of all reads in this readgroup.
  186. Column 9 '#_mapped_bases': The sum of the length of all reads in this readgroup
  187. that did not have flag 4 (== unmapped).
  188. Column 10 '#_total_reads': The total number of reads in this readgroup.
  189. Column 11 '#_mapped_reads': The total number of reads in this readgroup that did
  190. not have flag 4 (== unmapped).
  191. Column 12 '#_mapped_reads_paired_in_sequencing': As for column 10, but also
  192. requiring flag 1 (== reads paired in sequecing).
  193. Column 13 '#_mapped_reads_properly_paired': As for column 10, but also requiring
  194. flag 2 (== mapped in a proper pair, inferred during alignment).
  195. Column 14 '%_of_mismatched_bases': Calculated by summing the read lengths of all
  196. reads in this readgroup that have an NM tag, summing the edit distances
  197. obtained from the NM tags, and getting the percentage of the latter out of
  198. the former to 2 decimal places.
  199. Column 15 'average_quality_of_mapped_bases': The mean of all the base qualities
  200. of the bases counted for column 8, to 2 decimal places.
  201. Column 16 'mean_insert_size': The mean of all insert sizes (ISIZE field) greater
  202. than 0 for properly paired reads (as counted in column 12) and with a
  203. mapping quality (MAPQ field) greater than 0. Rounded to the nearest whole
  204. number.
  205. Column 17 'insert_size_sd': The standard deviation from the mean of insert sizes
  206. considered for column 15. To 2 decimal places.
  207. Column 18 'median_insert_size': The median insert size, using the same set of
  208. insert sizes considered for column 15.
  209. Column 19 'insert_size_median_absolute_deviation': The median absolute deviation
  210. of the column 17 data.
  211. Column 20 '#_duplicate_reads': The number of reads which were marked as
  212. duplicates
  213. Column 21' #_duplicate_bases': The number of bases which were narked as
  214. duplicated
Tip!

Press p or to see the previous file or, n or to see the next file

Comments

Loading...