版權說明:本文檔由用戶提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權,請進行舉報或認領
文檔簡介
1、Chapter 2 genome alignment and assembly,Contents,Alignment algorithms for short-reads Background Blast (why cant we use it?) Adapting hashed seed-extend algorithms to work with shorter reads Indel detection Suffix/Prefix Tries Other alignment considerations Typical alignment pipeline Assembly algori
2、thms for short reads Effect of repeats Overlap-Consensus de Bruijn graphs Assembly evaluation metrics Typical assembly pipeline,Alignment of reads to a reference,Why is short read alignment hard?,The shorter a read, the less likely it is to have a unique match to a reference sequence,Why do we gener
3、ate short reads?,Sanger reads lengths 800-2000bp Generally we define short reads as anything below 200bp Illumina (150bp max) SoLID (75bp max) Ion Torrent (200-300bp max currently.) Helicos (50bp) Even with these platforms it is cheaper to produce short reads (e.g. 36bp) rather than 100 or 200bp rea
4、ds Diminishing returns: For some applications 36bp is more than sufficient Resequencing of smaller organisms Bacterial de-novo assembly ChIP-Seq Digital Gene Expression profiling Bacterial RNA-seq,Short read alignment applications,Genotyping: Methylation SNPs Indels,Classify and measure peaks: ChIP-
5、Seq RNA-Seq,Contents,Alignment algorithms for short-reads Background Blast (why cant we use it?) Adapting hashed seed-extend algorithms to work with shorter reads Indel detection Suffix/Prefix Tries Other alignment considerations Typical alignment pipeline Assembly algorithms for short reads Effect
6、of repeats Overlap-Consensus de Bruijn graphs Assembly evaluation metrics Typical assembly pipeline,BLAST Basic Local Alignment Search Tool,Background BLAST,Primarily designed to identify homologous sequences Blast is a hashed seed-extend algorithm Functional conservation Only some parts of a sequen
7、ce are under positive selection From a functional perspective, these are the interesting parts,BLAST - Original version,A C G A A G T A A G G T C C A G T,C C C T T C C T G G A T T G C G A,Example: Seed size = 4, No mismatches in seed The matching word GGTC initiates an alignment Extension to the lef
8、t and right with no gaps until alignment score falls below 50% Output: GTAAGGTCC GTTAGGTCC,BLAST - Original algorithm,Finding seeds significantly increases the speed of BLAST compared to doing a full local alignment over a whole sequence BLAST first finds highly conserved sequences which are then ex
9、tended with a local alignment.,BLAST Speed (or lack of),Typically BLAST will take approximately 0.1 1 second to search 1 sequence against a database Depends on size of database, e-value cutoff and number of hits to report selected 60 million reads equates to 70 CPU days! Even on multi-core systems t
10、his is too long! Especially if you have multiple samples! This is still true of FPGA and SIMD (vectorised) implementations of BLAST,When NOT to use BLAST,A typical situation: you have a piece of DNA sequence and want to extend it or find what gene it belongs to. In other words, you want an exact or
11、near-exact match to a sequence that is part of an assembled genome. Short reads require very fast algorithms for finding near-exact matches in genomic sequences: BLAT Highly recommended: the BLAT paper (Kent WJ (2003) Genome Res 12:656-64) - it is famous for its unorthodox writing style SOAP Bowtie
12、MAQ BWA Shrimp2,Contents,Alignment algorithms for short-reads Background Blast (why cant we use it?) Adapting hashed seed-extend algorthims to work with shorter reads Suffix/Prefix Tries Other alignment considerations Typical alignment pipeline Assembly algorithms for short reads Effect of repeats O
13、verlap-Consensus de Bruijn graphs Assembly evaluation metrics Typical assembly pipeline,Adapting hashed seed-extend algorithms to work with shorter reads,Improve seed matching sensitivity Allow mismatches within seed BLAST Allow mismatches + Adopt spaced-seed approach ELAND, SOAP, MAQ, RMAP, ZOOM Al
14、low mismatches + Spaced-seeds + Multi-seeds SSAHA2, BLAT, ELAND2 Above and/or Improve speed of local alignment for seed extension Single Instruction Multiple Data Shrimp2, CLCBio Reduce search space to region around seed,Hashed seed-extend algorithms,2 step process 1 identify a match to the seed seq
15、uence in the reference Extend match using sensitive (but slow) Smith-Waterman algorithm,Seed-extend algorithm,Reference sequence: .ACTGGGTCATCGTACGATCGATCGATCGATCGATCGGCTAGCTAGCTA.,Short read: GTCATCGTACGATCGATAGATCGATCGATCGGCTA,Note that the short read has 1 difference wrt to reference,Seed-extend
16、algorithm,Reference sequence: .ACTGGGTCATCGTACGATCGATCGATCGATCGATCGGCTAGCTAGCTA.,Short read: GTCATCGTACG ATCGATAGATCG ATCGATCGGCTA,11bp word,11bp word,11bp word,The algorithm will try to match each word to the reference. If there is a match at with any single word it will perform a local alignment t
17、o extend the match,Seed-extend algorithm,Reference sequence: .ACTGGGTCATCGTACGATCGATCGATCGATCGATCGGCTAGCTAGCTA.,Short read: GTCATCGTACG ATCGATAGATCG ATCGATCGGCTA,Here the algorithm is able to match the short read with a word length of 11bp,GTCATCGTACG,ATCGAACGATCGATCGATCGGCTA,Seed-extend algorithm,R
18、eference sequence: .ACTGGGTCATCGTACGATCGATCGATCGATCGATCGGCTAGCTAGCTA.,Short read: GTCATCGTACGATCGATCGATCGATCGATCGGCAA,Note that the short read has 3 differences Possibly sequencing errors, possibly SNPs,Seed-extend algorithm,Reference sequence: .ACTGGGTCATCGTACGATCGATCGATCGATCGATCGGCTAGCTAGCTA.,Shor
19、t read: GTCATCGTACG ATCGATCGATCG ATCGATCGGCAA,Note that the short read has 3 differences,11bp word,11bp word,11bp word,Seed-extend algorithm,Reference sequence: .ACTGGGTCATCGTACGATCGATCGATCGATCGATCGGCTAGCTAGCTA.,No seeds match Therefore the algorithm would find no hits at all!,Short read: GTCATCGTAC
20、G ATCGATCGATCG ATCGATCGGCAA,Adapting hashed seed-extend algorithms to work with shorter reads,Improve seed matching sensitivity Allow mismatches within seed BLAST Allow mismatches + Adopt spaced-seed approach ELAND, SOAP, MAQ, RMAP, ZOOM Allow mismatches + Spaced-seeds + Multi-seeds SSAHA2, BLAT, EL
21、AND2 Above and/or Improve speed of local alignment for seed extension Single Instruction Multiple Data Shrimp2, CLCBio Reduce search space to region around seed,Adapting hashed seed-extend algorithms to work with shorter reads,Improve seed matching sensitivity Allow mismatches within seed BLAST Allo
22、w mismatches + Adopt spaced-seed approach ELAND, SOAP, MAQ, RMAP, ZOOM Allow mismatches + Spaced-seeds + Multi-seeds SSAHA2, BLAT, ELAND2 Above and/or Improve speed of local alignment for seed extension Single Instruction Multiple Data Shrimp2, CLCBio Reduce search space to region around seed,Consec
23、utive seed,CCACTGTCCTCCTACATAGGAACGA,Consecutive seed 9bp with no mismatches:,ACTCCCATCGTCATCGTACTAGGGATCGTAACA,SNP heavy read,Reference sequence,Even allowing for 2 mismatches in the seed - no seeds match. No hits!,Cannot find seed match due to A-C SNP and G-C SNP,TCATCGTAC,TCCTCCTAC,Spaced seeds,T
24、o increase sensitivity we can used spaced-seeds:,11111111111,11001100110011001,Consecutive seed template with length 9bp,Spaced-seed template with weight 9bp,ACTATCATCGTACACAT,TCATCGTAC,ACTATCATCGTACACAT,ACTCTCACCGTACACAT,Reference,Query,Reference,Query,Spaced seeds,CCACTGTAATCGTACATGGGAACGA,Spaced
25、seed with weight 9bp and no mismatches:,ACTCCCATTGTCATCGTACTTGGGATCGTAACA,SNP heavy read,Reference sequence,Can now extend with Smith-Waterman or other local alignment,Despite SNPs seed matched with 0 mismatches,CCATTGTCATCGTACAT,CCXXTGXXATXXTAXXT,Spaced seeds,Ma, B. et al. PatternHunter. Bioinforma
26、tics Vol 18, No 3, 2002,Spaced seeds:,A seed template 111010010100110111 is 55% more sensitive than BLASTs default template 11111111111 for two sequences of 70% similarity Typically seeds of length 30bp and allow up to 2 mismatches in short read datasets,SOAP,SOAP rules,SOAP will allow either a cert
27、ain number of mismatches or one continuous gap for aligning a read onto the reference sequence. The best hit of each read which has minimal number of mismatches or smaller gap will be reported. For multiple equal best hits, the user can instruct the program to report all, or randomly report one, or
28、disregard all of them.,SOAP rules,the program will allow at most two mismatches. occurrence of single nucleotide polymorphism is much higher than that of small insertions or deletions, so ungapped hits have precedence over gapped hits. For gapped alignment only one continuous gap with a size ranging
29、 from 1 to 3 bp is accepted, while no mismatches are permitted in the flanking regions to avoid ambiguous gaps.,SOAP rules,SOAP can iteratively trim several basepairs at the 3-end and redo the alignment, until hits are detected or the remaining sequence is too short for specific alignment.,SOAP (Pai
30、r-end sequencing),Pair-end sequencing means to sequence both ends of a DNA fragment. So the two reads belonging to a pair will always have the settled relative orientation and approximate distance between each other on the genome. A pair will be aligned when two reads are mapped with the right orien
31、tation relationship and proper distance. A certain number of mismatches are allowed in one or both reads of the pair. For gapped alignment, gap is only permitted on one read, and the other end should match exactly.,SOAPoptions,Input query a file, *.fq or *.fa format reference sequences file, *.fa fo
32、rmat output alignment file seed size, default=10. read18,s=8; read22,s=10, read26, s=12 maximum number of mismatches allowed on a read, OffsetQueryAlleleQual: number of mismatches, followed by detailed mutation sites and switch of allele types. Offset is relative to the initial location on reference
33、. OffsetAlleleQual: offset, allele, and quality.,Features,There are limitations to the alignment approach, such as placing reads within repetitive regions in the reference genome or in corresponding regions that may not exist in the reference genome; the latter situation may result from gaps in the
34、reference genome or the presence of structural variants (SVs) in the genome being analysed. mate-pair reads can resolve the correct genome assignment for some repetitive regions as long as one read in the pair is unique to the genome,Index,For the reads For the Reference For both Based on hash table
35、s Based on suffix trees Based on merge sorting,Contents,Alignment algorithms for short-reads Background Blast (why cant we use it?) Adapting hashed seed-extend algorthims to work with shorter reads Suffix/Prefix Tries Other alignment considerations Typical alignment pipeline Assembly algorithms for
36、short reads Effect of repeats Overlap-Consensus de Bruijn graphs Assembly evaluation metrics Typical assembly pipeline,Suffix-Prefix Trie,A family of methods which uses a Trie structure to search a reference sequence Bowtie BWA SOAP version 2 Trie data structure which stores the suffixes (i.e. ends
37、of a sequence) Key advantage over hashed algorithms: Alignment of multiple copies of an identical sequence in the reference only needs to be done once Use of an FM-Index to store Trie can drastically reduce memory requirements (e.g. Human genome can be stored in 2Gb of RAM) Burrows Wheeler Transform
38、 to perform fast lookups,Suffix Trie,Heng Li SOLiD; 454; Sanger PROS: fast CONS: short read algorithm is slow for long reads and reads with high error rate,Prefix trie,X = GOOGOL$ “G” “GO” “GOO” “GOOG” “GOOGO” “GOOGOL”,Burrows-Wheeler Transform (BWT),Algorithm used for data compression Output is eas
39、ier to compress as it groups similar symbols together,Suffix array interval and sequence alignment,Exact and Inexact Matching,Has to account for mismatches or gaps in the reads the BWT index of the reverse reference sequence narrows the search space,W = LOL X = GOOGOL$,Evaluation: Simulated Data,Sim
40、ulated reads from human genome One million pairs of different lengths Mapped to the human genome BWA was found to be more accurate than Bowtie and SOAPv2 Would need to sacrifice mapping quality in order to increase speed,Evaluation: Real Data,12.2 million pairs of 51bp reads from a male genome Mappe
41、d to human genome and a human-chicken hybrid reference Had high speed and accuracy for both,Comparison,Suffix/Prefix Trie Requires 2% divergence - seed-extend based approach E.g. wild mouse strains alignments,Comparison,Precision and recall by amount of variation for 4 datasets, by polymorphism: (nu
42、mber of SNPs, Indel size),David M et al. Bioinformatics 2011;27:1011-1012,Summary of open-source short read alignment programs,Heng Li 19:164654 Rumble SM, Lacroute P, Dalca AV, et al. SHRiMP: accurate mapping of short color-space reads. PLoS Comput Biol 2009;5:e1000386 Use of multiple seeds Especia
43、lly useful for longer reads (36bp) Li R, Li Y, Kristiansen K, et al. SOAP: short oligonucleotide alignment program. Bioinformatics 2008;24:7134 Jiang H, Wong WH. SeqMap: mapping massive amount of oligonucleotides to the genome. Bioinformatics 2008;24: 23956,Paired-end reads are important,Repetitive
44、DNA,Unique DNA,Single read maps to multiple positions,Paired read maps uniquely,Read 1,Read 2,Known Distance,Effect of paired-end alignments,Heng Li 33:e171,Repetitive DNA,Unique DNA,Single read maps to multiple positions,Paired read maps uniquely,Read 1,Read 2,Known Distance,Paired read does not ma
45、p,Repetitive sequence,Repetitive sequence,/research/assembly_primer.shtml,Can try to identify collapsed repeats by increased relative coverage,Repetitive sequence,Main reason for fragmented genome assemblies Additional sequencing depth will not help overcome repeat limited asse
46、mblies Can estimate the number of repetitive regions, based on relative coverage Only longer reads or paired-end/mate-pair reads can overcome this PacBio reads can extend up to 10-20kb but still too early to say how easy it will be to incorporate such reads Large mate pair insert sizes 20kb are poss
47、ible, but library preparation is inefficient (2-3 days of trial and error). Also a significant fraction will be error-prone and/or chimeric,Whiteford N, Haslam N, Weber G, et al. An analysis of the feasibility of short read sequencing. Nucleic Acids Res 2005;33:e171,Assumptions made by de-novo assem
48、blers,Based on Lander-Waterman model Number of times a base is sequenced follows a Poisson distribution Reads are randomly distributed throughout a genome The ability to detect an overlap between two reads is not dependent on the base-composition of the read,Lander, E.S. and Waterman, M.S. (1988). G
49、enomic Mapping by Fingerprinting Random Clones: A Mathematical Analysis. Genomics 2 (3): 231239,L = Read length N = Number of reads G = Genome size P = Probability base is sequenced,Assumptions are not true,Paszkiewicz K , Studholme D J Brief Bioinform 2010;11:457-472, The Author 2010. Published by
50、Oxford University Press. For Permissions, please email: ,NGS de-novo assemblies are draft quality at best,500 contigs covering most of a bacterial genome can be obtained in 1 week from genomic DNA to Genbank submission To get 1 contigs covering all genomic seque
51、nce could take many months Is the extra effort worth it? Short answer: Usually not.,Assembly complexity of prokaryotic genomes using short reads Carl Kingsford , Michael C Schatz and Mihai Pop BMC Bioinformatics 2010, 11:21,Contents,Alignment algorithms for short-reads Background Blast (why cant we
52、use it?) Adapting hashed seed-extend algorithms to work with shorter reads Suffix/Prefix Tries Other alignment considerations Typical alignment pipeline Assembly algorithms for short reads Effect of repeats Overlap-Consensus de Bruijn graphs Assembly evaluation metrics Typical assembly pipeline,Over
53、lap consensus vs. de Bruijn,2 main categories of assembly algorithms Overlap Consensus (OLC) and de Bruijn graph assemblers OLC Primarily used for Sanger and hybrid assemblies Memory constraints prevent its use beyond 1 million reads or so de Bruijn Primarily used for NGS assemblies Still memory hun
54、gry but possible,de Bruijn graph assembly,de Bruijn graph assembly,de Bruijn graph assembly,de Bruijn graph assembly,de Bruijn graph assembly,de Bruijn graph assembly,de Bruijn graph assembly,de Bruijn graph assembly,Diagrams courtesy M. Caccamo, TGAC,Dealing with errors,Thomas Keane and Jan Aerts,
55、Wellcome Trust Sanger,de Bruijn graph assembly error correction,Diagrams courtesy M. Caccamo, TGAC,Errors or rare sequence?,Depends on the type of data: Assumptions are probably true for single haploid genome data Diploid and polyploid expect any branches to have equal coverage Less clear for RNA-seq due to splicing Completely false assumption for metagenomic and metatranscriptomic data!,Short read assemblers,First de Bruijn based assembler was Newbler Adapted to handle main 454
溫馨提示
- 1. 本站所有資源如無特殊說明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請下載最新的WinRAR軟件解壓。
- 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請聯(lián)系上傳者。文件的所有權益歸上傳用戶所有。
- 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁內(nèi)容里面會有圖紙預覽,若沒有圖紙預覽就沒有圖紙。
- 4. 未經(jīng)權益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
- 5. 人人文庫網(wǎng)僅提供信息存儲空間,僅對用戶上傳內(nèi)容的表現(xiàn)方式做保護處理,對用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對任何下載內(nèi)容負責。
- 6. 下載文件中如有侵權或不適當內(nèi)容,請與我們聯(lián)系,我們立即糾正。
- 7. 本站不保證下載資源的準確性、安全性和完整性, 同時也不承擔用戶因使用這些下載資源對自己和他人造成任何形式的傷害或損失。
最新文檔
- 2026年寧波北侖區(qū)戚家山街道編外工作人員招聘1人考試備考題庫及答案解析
- 2026銀川市金鳳區(qū)天匯里幼兒園教育集團招聘7人考試備考題庫及答案解析
- 2026湖南常德市桃源縣公安局警務輔助人員招聘20人筆試模擬試題及答案解析
- 2026福建投資集團第一批集中招聘考試備考試題及答案解析
- 2026年安徽省能源集團有限公司所屬子公司社會招聘考試備考試題及答案解析
- 2026年甘肅省武威市古浪縣黑松驛鎮(zhèn)選聘大學生村文書筆試備考試題及答案解析
- 2026年昭通市鹽津縣公安局警務輔助人員招聘(21人)考試參考題庫及答案解析
- 2026備戰(zhàn)中考【語文考點專練:“非連續(xù)性文本閱讀”專題】精練(含答案)
- 2026浙江紹興市強制醫(yī)療所招聘編外人員2人考試參考題庫及答案解析
- 2026江西省贛勤發(fā)展集團有限公司社會招聘6人考試備考題庫及答案解析
- 《海洋生物學》課程教學大綱
- WST856-2025安全注射標準解讀
- 低壓控制基本知識培訓課件
- 星間激光鏈路構建-洞察及研究
- “十三五”規(guī)劃重點-銻礦石及精銻項目建議書(立項報告)
- 環(huán)衛(wèi)公司內(nèi)部管理制度
- 第3章 同位素示蹤技術課件
- 創(chuàng)傷骨科患者深靜脈血栓形成篩查與治療的專家共識
- x線胸片診斷試題及答案
- GB/T 17554.1-2025卡及身份識別安全設備測試方法第1部分:一般特性
- 電氣試驗室建設規(guī)范
評論
0/150
提交評論