版權(quán)說明:本文檔由用戶提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請進行舉報或認領(lǐng)
文檔簡介
1、Class 26: Computing Genomes, Genomes ComputingDavid Evans/evanscs302: Theory of ComputationUniversity of Virginia Computer ScienceFinal Exam Sneak PreviewHandout available nowHonor policy: you may discuss these problems with others and use any resources you want until the FinalNo notes or other reso
2、urces may be used during the finalIntent is to give you an idea what to expect on the final and a chance to start thinking about some problemsDont attempt to memorize answers: need to understand things since the actual questions may be differentMenuComputing Genomes (PS6, Problem 6)Crash course in b
3、iologyBusy Beaver result! Computing with GenomesConclusionGenome Assembly ProblemIn order to assemble a genome, it is necessary to combine snippets from many reads into a single sequence. The input is a set of n genome snippets, each of which is a string of up to k symbols. The output is the smalles
4、t single string that contains all of the input snippets as substrings.DNASequence of nucleotides: adenine (A), guanine (G), cytosine (C), and thymine (T)Two strands, A must attach to T and G must attach to CAGTCCentral Dogma of BiologyRNA makes copies of DNA segmentsRNA describes sequences of amino
5、acidsChains of amino acids make proteinsProteins make usDNATranscriptionRNATranslationProteinImage from http:/protein/Human Genome3 Billion Base PairsEach nucleotide is 2 bits (4 possibilities)3 B pairs * 1 byte/4 pairs = 750 MBEvery sequence of 3 base pairs one of 20 amino acids (or stop codon)21 p
6、ossible codons, but 43 = 64 possible So, really only 750MB * (21/64) 250 MBMuch of it ( 95%) is may be junk (doesnt encode proteins, but some might be important)1 CD 650 MBHuman Genome RaceUVa CLAS 1970Yale PhDTenured Professor at U. MichiganFrancis Collins(Director of public National Center for Hum
7、an Genome Research)(Picture from UVa Graduation 2001)Craig Venter(President of Celera Genomics)vs.San Mateo CollegeCourt-martialedDenied tenure at SUNY BuffaloReading the GenomeWhitehead Institute, MITGene Reading MachinesOne read: about 700 base pairsButdont know where they are on the chromosomeAGG
8、CATACCAGAATACCCGTGATCCAGAATAAGCActualGenomeACCAGAATACCRead 1TCCAGAATAARead 2TACCCGTGATCCARead 3Genome Assembly ProblemInput: Genome fragments (but without knowing where they are from)Ouput: The full genomeACCAGAATACCRead 1TCCAGAATAARead 2TACCCGTGATCCARead 3Decision ProblemGA = | where each xi is a s
9、tring and there is a string X of length m that includes all of the xi stringsas substrings If we had a decider for GA, can we find the length of the shortest common superstring?Yes. Try all possible m values from 1, 2, , |xi|.Is GA In Class NP?GA = | where each xi is a string and there is a string X
10、 of length m that includes all of the xi stringsas substrings Yes. The string X is a polynomial-verifiable certificate.- Check it includes each substring- Check its length is mIs GA NP-Complete?GA = | where each xi is a string and there is a string X of length m that includes all of the xi stringsas
11、 substrings NP-CompleteA language B is in NP-complete if:2. There is a polynomial-time reduction from every problem A NP to B.1. B NPBNPBNP?To Prove NP-HardnessPick some known NP-Complete problem X.Show that a polynomial-time solver for Y could be used to build a polynomial-time solver for X.This pr
12、oves that there is no polynomial-time solver for Y (unless P = NP).Possible ChoicesHAMPATHABCDE3SAT(a b c) (a b c) SUBSET-SUM3, 5, 12, 13, 17, 30By definition, all must work. Every NP-Complete problem can be reduced to every NP-Complete problem.In practice, some will work much more easily than other
13、s. Try to pick a problem “close” to the target problem.Busy Beaver ChallengeRuixin YangDoubly-Infinite Tape (Regular BB)One-Way Infinite Tape (Challenge)Winning (3, 2) MachineCodeBusyBeaver32.cppBusyBeaver.cppIs GA NP-Complete?GA = | where each xi is a string and there is a string X of length m that
14、 includes all of the xi stringsas substrings Reducing HAMPATH to GATake an arbitrary input to HAMPATH:HAMPATH = G, s, t | G represents a graph, and there is a path between s and t in G that includes every node in G exactly once Construct a corresponding input to GA such that the input is in GA if an
15、d only if the original input is in HAMPATH.So, we need to map the nodes and edges of G into thesubstrings input to GA.Consider Each NodeIncoming Edges(x, a)aOutgoing Edges(a, y)In a Hamiltonian path, for each node (except start and end), exactly one incoming edge, and one outgoing edge must be used.
16、Consider Each NodeIncoming Edges(x, a)aOutgoing Edges(a, y)We need GA substrings to represent each edge, butsuch that only 1 can be actually followed. But, the GA superstring needs to include all substrings.Idea: make it so the untaken edges can loop back!Simple NodesIf there is only one edge (a, b)
17、 out of a given node, thatedge must be used in the path. Add the substring: ababGenerating the SubstringsFor each edge (a, yi) add two substrings:ayiaThis is the “back” edge.yiayi+1 This connects the possible destinations.abcabaacabaccabPossible (length 4) alignments:aba aca bac cabIf there is a Ham
18、iltonian path, one of these must be used.Start and EndACDEThe start node must come first:Add string: A(where is unused elsewhere)Add string: E$(where $ is unused elsewhere)BWhat is m?ACDE A, ABA, BAC, ACA, CAB, BE, ED, DA, CBC, CDC, BCD, DBC, D$ BA ACA CAB ABA BAC CBC BCD CDC DCB BE ED D$m = Start a
19、nd end = 2 + one-edge nodes * 1 + multi-edge nodes * 2 for each edge + 2 for alignHuman Genome3 Billion base pairs600-700 bases per read8X coverage requiredSo, n 37 Million sequence fragmentsCelera used 27.2 Million reads (but could get more than 700 bases per read)How can we solve an NP-Complete pr
20、oblem for such large n?ApproachesHuman Genome Project (Collins)Start by producing a genome map (using biology, chemistry, etc) to have a framework for knowing where the fragments should goCelera Solution (Venter)Approximate: we cant guarantee finding the shortest possible, but we can develop clever
21、algorithms that get close most of the timeResult: DrawPresident Clinton announces Human Genome Sequenceessentially complete (with Venter and Collins), June 26, 2000But, Human Genome Project mostly adopted Venters approach.Genomes ComputingSolving HAMPATH with DNAMake up a two random 4-nucleotide seq
22、uences for each node:A: A1 = ACTT A2 = gcagB:B1 = TCGGB2 = actgC:C1 = GGCTC2 = atgtD:D1 = GATCD2 = tccaIf there is a link between two cities (XY), create a nucleotide sequence: X2Y1 ABgcagTCGGBAactgACTTBased on Fred Hapgoods notes on Adelmans talkEncoding The ProblemEach city nucleotide sequence bin
23、ds with its complement (A T, G C) :A: A1 = ACTT A2 = gcagA: TGAA cgtcB:TCGGactgB: AGCCtgacC:GGCTatgtC = CCGAtacaD:GATCtccaD = CTAGaggtMix up all the link and complement DNA strands they will bind to show a path!Path BindingABCDACTTgcagTCGGactgGATCtccaGGCTatgtATGAAcgtcgcagGGCTABBCCGAtacaatgtTCGG BCCA
24、GCCtgacDCTAGaggtactgGATC CDGetting the SolutionShake up all the DNA to get it to bindExtract DNA strands starting with A and ending with D Can do this with chemical binding on start/end tags: remove all strands that do not start with A, and then remove all strands that do not end with DWeigh remaining strands to find ones with the right weight (7 * 8 nucleotides)
溫馨提示
- 1. 本站所有資源如無特殊說明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請下載最新的WinRAR軟件解壓。
- 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
- 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁內(nèi)容里面會有圖紙預(yù)覽,若沒有圖紙預(yù)覽就沒有圖紙。
- 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
- 5. 人人文庫網(wǎng)僅提供信息存儲空間,僅對用戶上傳內(nèi)容的表現(xiàn)方式做保護處理,對用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對任何下載內(nèi)容負責。
- 6. 下載文件中如有侵權(quán)或不適當內(nèi)容,請與我們聯(lián)系,我們立即糾正。
- 7. 本站不保證下載資源的準確性、安全性和完整性, 同時也不承擔用戶因使用這些下載資源對自己和他人造成任何形式的傷害或損失。
最新文檔
- 2022~2023測繪職業(yè)技能鑒定考試題庫及答案第876期
- 職業(yè)健康科普傳播的媒介選擇策略-1
- 職業(yè)健康監(jiān)護中的標準化文書書寫規(guī)范
- 職業(yè)健康檔案在員工職業(yè)規(guī)劃中的應(yīng)用價值
- 黃岡2025年湖北麻城市城區(qū)學校選調(diào)鄉(xiāng)鎮(zhèn)教師150人筆試歷年參考題庫附帶答案詳解
- 長春2025年吉林長春新區(qū)招聘合同制教師筆試歷年參考題庫附帶答案詳解
- 職業(yè)健康與員工職業(yè)發(fā)展:醫(yī)療績效管理的健康維度
- 蘇州2025年江蘇蘇州太倉市沙溪人民醫(yī)院招聘編外專業(yè)技術(shù)人員6人筆試歷年參考題庫附帶答案詳解
- 益陽2025年湖南沅江市城區(qū)義務(wù)教育學校面向市內(nèi)選調(diào)教師97人筆試歷年參考題庫附帶答案詳解
- 職業(yè)人群職業(yè)倦怠與心理健康干預(yù)
- 成人呼吸支持治療器械相關(guān)壓力性損傷的預(yù)防
- DHA乳狀液制備工藝優(yōu)化及氧化穩(wěn)定性的研究
- 2023年江蘇省五年制專轉(zhuǎn)本英語統(tǒng)考真題(試卷+答案)
- 三星-SHS-P718-指紋鎖使用說明書
- 岳麓書社版高中歷史必修三3.13《挑戰(zhàn)教皇的權(quán)威》課件(共28張PPT)
- 2007年國家公務(wù)員考試《申論》真題及參考答案
- GC/T 1201-2022國家物資儲備通用術(shù)語
- 污水管網(wǎng)監(jiān)理規(guī)劃
- GB/T 6730.65-2009鐵礦石全鐵含量的測定三氯化鈦還原重鉻酸鉀滴定法(常規(guī)方法)
- GB/T 35273-2020信息安全技術(shù)個人信息安全規(guī)范
- 《看圖猜成語》課件
評論
0/150
提交評論