大基因組大數(shù)據(jù)與生物信息學(xué)英文及翻譯_第1頁(yè)
大基因組大數(shù)據(jù)與生物信息學(xué)英文及翻譯_第2頁(yè)
大基因組大數(shù)據(jù)與生物信息學(xué)英文及翻譯_第3頁(yè)
大基因組大數(shù)據(jù)與生物信息學(xué)英文及翻譯_第4頁(yè)
大基因組大數(shù)據(jù)與生物信息學(xué)英文及翻譯_第5頁(yè)
已閱讀5頁(yè),還剩16頁(yè)未讀, 繼續(xù)免費(fèi)閱讀

下載本文檔

版權(quán)說(shuō)明:本文檔由用戶提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請(qǐng)進(jìn)行舉報(bào)或認(rèn)領(lǐng)

文檔簡(jiǎn)介

BigGenomicDatainBioinformaticsCloudAbstractTheachievementofHumanGenomeprojecthasledtotheproliferationofgenomicsequencingdata.Thisalongwiththenextgenerationsequencinghashelpedtoreducethecostofsequencing,whichhasfurtherincreasedthedemandofanalysisofthislargegenomicdata.Thisdatasetanditsprocessinghasaidedmedicalresearches.Thus,werequireexpertisetodealwithbiologicalbigdata.TheconceptofcloudcomputingandbigdatatechnologiessuchastheApacheHadoopproject,areherebyneededtostore,handleandanalysethisdata.Because,thesetechnologiesprovidedistributedandparallelizeddataprocessingandareefficienttoanalyseevenpetabyte(PB)scaledatasets.However,therearesomedemeritstoowhichmayincludeneedoflargertimetotransferdataandlessernetworkbandwidth,majorly.人類(lèi)基因組計(jì)劃的實(shí)現(xiàn)造成基因組測(cè)序數(shù)據(jù)的增殖。這與下一代測(cè)序一起有助于減少測(cè)序的成本,這進(jìn)一步增加了對(duì)這種大基因組數(shù)據(jù)的分析的需求。該數(shù)據(jù)集及其解決有助于醫(yī)學(xué)研究。因此,我們需要專(zhuān)門(mén)知識(shí)來(lái)解決生物大數(shù)據(jù)。因此,需要云計(jì)算和大數(shù)據(jù)技術(shù)(例如ApacheHadoop項(xiàng)目)的概念來(lái)存儲(chǔ),解決和分析這些數(shù)據(jù)。由于,這些技術(shù)提供分布式和并行化的數(shù)據(jù)解決,并且能夠有效地分析甚至PB級(jí)的數(shù)據(jù)集。然而,也有某些缺點(diǎn),可能涉及需要更大的時(shí)間來(lái)傳輸數(shù)據(jù)和更小的網(wǎng)絡(luò)帶寬,重要。IntroductionTheintroductionofnextgenerationsequencinghasgivenunrivalledlevelsofsequencedata.So,themodernbiologyisincurringchallengesinthefieldofdatamanagementandanalysis.Asinglehuman'sDNAcomprisesaround3billionbasepairs(bp)representingapproximately100gigabytes(GB)ofdata.Bioinformaticsisencounteringdifficultyinstorageandanalysisofsuchdata.Moore'sLawinfersthatcomputersdoubleinspeedandhalfinsizeevery18months.Andreportssaythatthebiologicaldatawillaccumulateatevenfasterpace[1].Sequencingahumangenomehasdecreasedincostfrom$1millioninto$1thousandin.WiththisfallingcostofsequencingandafterthecompletionoftheHumanGenomeprojectin,inundateofbiologicalsequencedatawasgenerated.Sequencingandcataloguinggeneticinformationhasincreasedmanyfolds(ascanbeobservedfromtheGenBankdatabaseofNCBI).VariousmedicalresearchinstitutesliketheNationalCancerInstitutearecontinuouslytargetingonsequencingofamilliongenomesfortheunderstandingofbiologicalpathwaysandgenomicvariationstopredictthecauseofthedisease.Given,thewholegenomeofatumourandamatchingnormaltissuesampleconsumes0.1TBofcompresseddata,thenonemilliongenomeswillrequire0.1millionTB,i.e.103PB(petabyte)[2].TheexplosionofBiology'sdata(thescaleofthedataexceedsasinglemachine)hasmadeitmoreexpensivetostore,processandanalysecomparedtoitsgeneration.Thishasstimulatedtheuseofcloudtoavoidlargecapitalinfrastructureandmaintenancecosts.Infact,itneedsdeviationfromthecommonstructureddata(row-columnorganisation)toasemi-structuredorunstructureddata.Andthereisaneedtodevelopapplicationsthatexecuteinparallelondistributeddatasets.Withtheeffectiveuseofbigdatainthehealthcaresector,areductionofaround8%inexpenditureispossible,thatwouldaccountfor$300billionsavingannually.下一代測(cè)序的引入給出了無(wú)與倫比的序列數(shù)據(jù)水平。因此,當(dāng)代生物學(xué)在數(shù)據(jù)管理和分析領(lǐng)域面臨挑戰(zhàn)。單個(gè)人類(lèi)DNA包含約30億個(gè)堿基對(duì)(bp),表達(dá)約100吉字節(jié)(GB)的數(shù)據(jù)。生物信息學(xué)在這種數(shù)據(jù)的存儲(chǔ)和分析中碰到困難。摩爾定律推測(cè),計(jì)算機(jī)速度增加了一倍,每18個(gè)月大小減少二分之一。報(bào)告說(shuō),生物數(shù)據(jù)將以更快的速度積累[1]。人類(lèi)基因組測(cè)序的成本從的100萬(wàn)美元降至的1千美元。隨著測(cè)序成本的下降,在人類(lèi)基因組項(xiàng)目完畢后,產(chǎn)生了生物序列數(shù)據(jù)的沉沒(méi)。測(cè)序和編目遺傳信息已經(jīng)增加了許多倍(如從NCBI的GenBank數(shù)據(jù)庫(kù)能夠觀察到的)。諸如國(guó)家癌癥研究所的多個(gè)醫(yī)學(xué)研究機(jī)構(gòu)正在持續(xù)地將一百萬(wàn)個(gè)基因組的測(cè)序用于理解生物學(xué)途徑和基因組變異以預(yù)測(cè)疾病的因素。假定腫瘤的全基因組和匹配的正常組織樣品消耗0.1TB的壓縮數(shù)據(jù),則一百萬(wàn)基因組將需要10萬(wàn)TB,即103PB(petabyte)[2]。生物學(xué)數(shù)據(jù)的爆炸(數(shù)據(jù)的規(guī)模超出單個(gè)機(jī)器)使得與其一代相比存儲(chǔ),解決和分析更昂貴。這刺激了云的使用,以避免大的資本基礎(chǔ)設(shè)施和維護(hù)成本。事實(shí)上,它需要從公共構(gòu)造化數(shù)據(jù)(行-列組織)偏移到半構(gòu)造化或非構(gòu)造化數(shù)據(jù)。并且需要開(kāi)發(fā)在分布式數(shù)據(jù)集上并行執(zhí)行的應(yīng)用程序。隨著醫(yī)療行業(yè)大數(shù)據(jù)的有效運(yùn)用,支出減少約8%,每年可節(jié)省3000億美元。ReviewCloudcomputingCloudcomputingisdefinedas"apay-per-usemodelforenablingconvenient,on-demandnetworkaccesstoasharedpoolofconfigurablecomputingresources(e.g.,networks,servers,storage,applicationsandservices)thatcanberapidlyprovisionedandreleasedwithminimalmanagementeffortorserviceproviderinteraction"[3].Someofthemajorconceptsinvolvedaregridcomputing,distributedsystems,parallelisedprogrammingandvisualizationtechnology.Asinglephysicalmachinecanhostmultiplevirtualmachinesthroughvirtualisationtechnology.Problemwithgridcomputingwasthateffortwasmajorlyspentonmaintainingtherobustnessandresilienceoftheclusteritself.Bigdatatechnologiesnowhaveidentifiedsolutionstoprocesshugeparalleliseddatasetscosteffectively.Cloudcomputingandbigdatatechnologiesaretwodifferentthings,oneisfacilitatingthecosteffectivestorageandtheotherisaPlatformasaService(PaaS),respectively。Threetypesofcloudsare:publiccloud,PrivatecloudandHybridcloud.Firstonereferstoresourceslikeinfrastructure,applications,platforms,etc.madeavailabletogeneralpublic,accessibleonlythroughInterneton"payasyougo"basis.Secondonereferstovirtualisedcloudinfrastructureowned,housedandmanagedbyasingleorganisation.Thirdonerefertotheconnectionofprivateandpublic,forscalabilityandfaulttoleranceviaVirtualPrivateNetworking(VPN).Afourthmodelisalsoproposed,namelyCommunityCloud.Hereorganisationslikepublicsectororganisations,havingsameinterest,cancontributefinanciallytowardsacloudinfrastructure.云計(jì)算被定義為“用于實(shí)現(xiàn)對(duì)可快速供應(yīng)和釋放的可配備計(jì)算資源(例如,網(wǎng)絡(luò),服務(wù)器,存儲(chǔ),應(yīng)用和服務(wù))的共享池的方便,按需的網(wǎng)絡(luò)訪問(wèn)的按使用付費(fèi)模型與最小的管理努力或服務(wù)提供者交互“[3]。涉及的某些重要概念是網(wǎng)格計(jì)算,分布式系統(tǒng),并行編程和可視化技術(shù)。單個(gè)物理機(jī)器能夠通過(guò)虛擬化技術(shù)托管多個(gè)虛擬機(jī)。網(wǎng)格計(jì)算的問(wèn)題是,努力重要花在維護(hù)集群本身的魯棒性和彈性。大數(shù)據(jù)技術(shù)現(xiàn)在已經(jīng)擬定了以成本有效的方式解決大量并行數(shù)據(jù)集的解決方案。云計(jì)算和大數(shù)據(jù)技術(shù)是兩個(gè)不同的事情,一種是增進(jìn)成本有效的存儲(chǔ),另一種是分別是平臺(tái)即服務(wù)(PaaS)。三種類(lèi)型的云是:公共云,私有云和混合云。第一種是指向普通公眾提供的基礎(chǔ)設(shè)施,應(yīng)用程序,平臺(tái)等資源,只能通過(guò)互聯(lián)網(wǎng)以“按需付費(fèi)”的方式訪問(wèn)。第二個(gè)是指由單個(gè)組織擁有,安置和管理的虛擬化云基礎(chǔ)設(shè)施。第三種是指私有和公共連接,通過(guò)虛擬專(zhuān)用網(wǎng)(VPN)實(shí)現(xiàn)可擴(kuò)展性和容錯(cuò)。還提出了第四個(gè)模型,即社區(qū)云。這里的組織像公共部門(mén)組織,含有相似的愛(ài)好,能夠奉獻(xiàn)財(cái)務(wù)到云基礎(chǔ)設(shè)施。GenomicsthroughbigdatatechnologiesWiththeimplementationofbigdatatechnologiesinstoring,processingandanalysinggenomicsdataofmedicalresearchcanprofoundlyimpactmankind.Timelyprocessingofdata,andsubsequentanalysisarestillachallenge.SolutionscouldbeimplementationofleadingbigdatatechnologieslikeHadoop.TherehavebeenstudiesregardingtheutilisationofApacheHadoopplatforminbioinformaficsprojects[4].隨著大數(shù)據(jù)技術(shù)在存儲(chǔ),解決和分析基因組學(xué)數(shù)據(jù)的實(shí)施,醫(yī)學(xué)研究能夠深刻地影響人類(lèi)。及時(shí)解決數(shù)據(jù),以及隨即的分析仍然是一種挑戰(zhàn)。解決方案可能是實(shí)施領(lǐng)先的大數(shù)據(jù)技術(shù),如Hadoop。已有有關(guān)ApacheHadoop平臺(tái)在生物信息學(xué)項(xiàng)目中的運(yùn)用的研究[4]。BioinformaticstoolsdevelopedMapReduceprojects[5]Crossbowproject[6]B1astReduceproject[7]C1oudBurst[8]CrossBow[9}ClouderaCloudera,beingtheserviceproviderinthebigdataplatformistheleadingApacheHadoopsoftware.Itiscontributing>50%ofitsoutputintoopensource(Apachelicensed))projects,drawingacuttingedgeinthedevelopmentofbigdatatechnologyandtheHadoopframework.Itwasestablishedby谷歌,YahooandFacebookleadingengineersalongwithanOracleexecutive,whowerelaterjoinedbythefounderofApacheHadoopproject.[3]Clouderaisapioneerofbigdataandcloudcomputinginthebiomedicalresearches.Thechiefscienfistandtheco-founderofCloudera,isaimingtodedicate25%oftheirtimetowardstheuseofcomputationalbiologyingenomics[10].Hence,leadingpioneersofbigdataandcomputationalbiologyalongwithleadingmultinationalsarenowcommittingtoaidmedicaldiscoveriesthroughcontributiontowardsanalysisoflargebiologicaldata,fortheunderstanding,diagnosisandtreatmentofdiseases.Infact,thisistheneedofthehour,becausetheannualgrowthforhealthcarecomputingisgoingtobearound20.5%through[11].Cloudera作為大數(shù)據(jù)平臺(tái)中的服務(wù)提供商,是領(lǐng)先的ApacheHadoop軟件。它將其50%的輸出奉獻(xiàn)給開(kāi)源(Apache許可)項(xiàng)目,在大數(shù)據(jù)技術(shù)和Hadoop框架的開(kāi)發(fā)中占據(jù)了前沿。它由谷歌,雅虎和Facebook領(lǐng)先的工程師與Oracle高管共同建立,后來(lái)他們被ApacheHadoop項(xiàng)目的創(chuàng)始人加入。[3]Cloudera是大數(shù)據(jù)和云計(jì)算在生物醫(yī)學(xué)研究領(lǐng)域的先驅(qū)。Cloudera的首席科學(xué)家和聯(lián)合創(chuàng)始人,旨在將25%的時(shí)間用于在基因組學(xué)中使用計(jì)算生物學(xué)[10]。因此,大數(shù)據(jù)和計(jì)算生物學(xué)的領(lǐng)先先驅(qū)與領(lǐng)先的跨國(guó)公司現(xiàn)在承諾通過(guò)對(duì)大型生物數(shù)據(jù)的分析,疾病的理解,診療和治療的奉獻(xiàn),協(xié)助醫(yī)療發(fā)現(xiàn)。事實(shí)上,這是小時(shí)的需要,由于到,醫(yī)療保健計(jì)算的年增加率將達(dá)成20.5%左右[11]。HadoopTwokeymodules:i)MapReduceii)HadoopDistributedFileSystem(HDFS)1.Acomputationalprogramisdividedintomanysmallsub-problems.Distributedonmultiplenodesofthecomputer.2.Adistributedfilesystemforstoringdataonthesenodes.Suchsoftwaresaredesignedforloadbalancingamongdifferentnodesandallowingdistributedprocessingoflargedatasets,enablingfault-tolerantparallelizedanalysis.Bioinformaticscloudinvolveserviceslikedatastorage,acquisition,analysis,etc.asthecloudplatformdelivershostedservicesovertheInternet.Itcouldbecategorizedintofourcategoriesnamely,DataasaService,SoftwareasaService,PlatformasaService,andInfrastructureasaService[12-16}.兩個(gè)核心模塊:i)MapReduceii)Hadoop分布式文獻(xiàn)系統(tǒng)(HDFS)計(jì)算程序被分為許多小的子問(wèn)題。分布在計(jì)算機(jī)的多個(gè)節(jié)點(diǎn)上。2.用于在這些節(jié)點(diǎn)上存儲(chǔ)數(shù)據(jù)的分布式文獻(xiàn)系統(tǒng)。這樣的軟件被設(shè)計(jì)用于在不同節(jié)點(diǎn)之間的負(fù)載平衡,并允許大型數(shù)據(jù)集的分布式解決,使得容錯(cuò)并行化分析成為可能。生物信息云涉及數(shù)據(jù)存儲(chǔ),采集,分析等服務(wù),由于云平臺(tái)通過(guò)Internet提供托管服務(wù)。它能夠分為四類(lèi):數(shù)據(jù)即服務(wù),軟件即服務(wù),平臺(tái)即服務(wù)和基礎(chǔ)設(shè)施即服務(wù)[12-16]。Dataasaservice(DaaS)Bioinformaticscloudsaredependentondatafordownstreamanalyses."Itisreportedthatannualworldwidesequencingcapacityisbeyond13Pbpandonanincreasebyafactoroffiveeveryyear"[17].Duetothisunrevealedexplosionofdata,DataasaService(DaaS)deliveryviaInternethasgainedimportance.Itprovidesdynamicdataaccessondemand,alongwithup-to-datedataaccesstoawiderangeofdevices,connectedovertheWeb.AmazonWebServices(AWS)provideacentralizedcloudofpublicdatasets(e.g.archivesofGenBank,Ensembldatabases,1000Genomes,ModelOrganismEncyclopedia,Unigene,etc.)ofbiology,economics,etc.asservices[18}.生物信息學(xué)云取決于下游分析的數(shù)據(jù)。“據(jù)報(bào)道,全球每年的測(cè)序能力超出13Pbp,每年增加5倍”[17]。由于這種數(shù)據(jù)泄露的爆炸式增加,通過(guò)因特網(wǎng)的數(shù)據(jù)即服務(wù)(DaaS)交付已變得越來(lái)越重要。它可根據(jù)需要提供動(dòng)態(tài)數(shù)據(jù)訪問(wèn),以及通過(guò)Web連接的多個(gè)設(shè)備的最新數(shù)據(jù)訪問(wèn)。

亞馬遜網(wǎng)絡(luò)服務(wù)(AWS)提供生物學(xué),經(jīng)濟(jì)學(xué)等作為服務(wù)的公共數(shù)據(jù)集(例如GenBank,Ensembl數(shù)據(jù)庫(kù),1000基因組,模型生物百科全書(shū),Unigene等的歸檔)的集中云。Softwareasaservice(SaaS)SaaSdeliversalargevarietyofsoftwareservicesonlinefordifferenttypesofdataanalysisfacilitatingremoteaccessofvariousheavybioinformaticssoftwares.Thus,iteliminatestheneedforlocalinstallation,therebyeasingsoftwaremaintenance.Up-to-datecloud-basedservicesforbioinformaticdataanalysishasmadelifeeasyfortheusers.Effortshavebeenmadetodevelopcloud-scaleandcloud-basedsequencemapping[19],multiplesequencealignment[20],expressionanalysis[21],identificationofepistaticinteractionsofSNPs(singlenucleotidepolymorphisms)[22],andNGS(Next-GenerationSequencing).SaaS在線提供多個(gè)各樣的軟件服務(wù),用于不同類(lèi)型的數(shù)據(jù)分析,便于遠(yuǎn)程訪問(wèn)多個(gè)重型生物信息學(xué)軟件。因此,它消除了對(duì)本地安裝的需要,從而簡(jiǎn)化軟件維護(hù)。最新的基于云的生物信息數(shù)據(jù)分析服務(wù)為顧客帶來(lái)了輕松的生活。已經(jīng)開(kāi)發(fā)了云尺度和基于云的序列作圖[19],多重序列比對(duì)[20],體現(xiàn)分析[21],SNPs(單核苷酸多態(tài)性)上位互相作用的鑒定[22]和NGS下一代測(cè)序)。Platformasaservice(PaaS)PaaSallowuserstodevelop,testandusecloudapplicationsinanenvironmentwherecomputerresourcesscaletomatchapplicationdemandautomaticallyanddynamically.Thisscalabilityfactorhelpsindevelopingapplicationsforbiologicaldata.TwoPaaSplatforms:1.Eoulsan,cloud-based-forhigh-throughputsequencinganalyses[23];2.GalaxyCloud,cloud-scale-forlarge-scaledataanalyses[24].PaaS允許顧客在計(jì)算機(jī)資源自動(dòng)和動(dòng)態(tài)地?cái)U(kuò)展以匹配應(yīng)用程序需求的環(huán)境中開(kāi)發(fā),測(cè)試和使用云應(yīng)用程序。這種可擴(kuò)展性因素有助于開(kāi)發(fā)生物數(shù)據(jù)的應(yīng)用程序。兩個(gè)PaaS平臺(tái):1.Eoulsan,基于云的高通量測(cè)序分析[23];2.GalaxyCloud,云規(guī)模-用于大規(guī)模數(shù)據(jù)分析[24]。Infrastructureasaservice(IaaS)IaaSdeliversallkindsofresources(virtualized)includingCPU(hardwares),OS(softwares)etc.summingupafullcomputerinfrastructure,reachingtothefullpotentialofcomputerresourcesviaInternet.Virtualizedresourcescanbeaccessedasapublicutilitybyusersandtherebypayingforthecloudresourcesthattheyutilize.Flexibilityandcustomizationgivefreedomtodifferentuserstoaccessdifferentcloudresources,aspertheirrequirement,thusmeetingthecustomizedneedsofdifferentusers.Examples:1.CloudBioLinuxisavirtualmachinethatispubliclyaccessibleforhigh-performancebioinformaticscomputing[25].2.CloVRisaportablevirtualmachinethatincorporatesseveralpipelinesforautomatedsequenceanalysis[26].IaaS提供多個(gè)資源(虛擬化),涉及CPU(硬件),操作系統(tǒng)(軟件)等等,總計(jì)完整的計(jì)算機(jī)基礎(chǔ)設(shè)施,通過(guò)互聯(lián)網(wǎng)充足發(fā)揮計(jì)算機(jī)資源的潛力。虛擬化資源能夠作為顧客的公用設(shè)施訪問(wèn),從而為他們使用的云資源付費(fèi)。靈活性和定制使得不同顧客能夠根據(jù)自己的需求訪問(wèn)不同的云資源,從而滿足不同顧客的定制需求。例子:1.CloudBioLinux是一種能夠高性能生物信息學(xué)計(jì)算公開(kāi)訪問(wèn)的虛擬機(jī)[25]。2.CloVR是一種便攜式虛擬機(jī),它包含了幾個(gè)用于自動(dòng)序列分析的管道[26]。BioinformaticscloudDatainthecloudInitialmethodofanalysisinvolvedownloadingofdatafromNCBI,Ensembl,etc.andinstallationofsoftwareslocallyonin-housecomputers.Placingdataandloadingsoftwaresincloud,makeawaytodeliverthemasDaaSorSaaS.Bothcanbeseamlesslyintegratedintocloud.thus,storingofbiologicaldataachievestheaimofbigdataanalysiswithinthecloud.Weareusingconventionalbiologicaldatabasesinsteadofcloudbased.But,forlargersequencingprojects,generatingultra-largevolumesofdata,wouldrequirecloudforbigdataanalysisandsharing[27,28].ProjectlikeGenome10K,1001GenomesProject,1KITE,TCGAetc.,aresimilarkindofprojectsrequiringbigdataanalysis,wheresolutionsofcomplexbiologicalqueriesinvolvesutilizationofbigdatatools[29].初始分析辦法涉及從NCBI,Ensembl等下載數(shù)據(jù),并在本地計(jì)算機(jī)上安裝軟件。在云中放置數(shù)據(jù)和加載軟件,使其成為DaaS或SaaS。兩者都能夠無(wú)縫集成到云。因此,生物數(shù)據(jù)的存儲(chǔ)實(shí)現(xiàn)了云內(nèi)大數(shù)據(jù)分析的目的。我們使用傳統(tǒng)的生物數(shù)據(jù)庫(kù)而不是云。但是,對(duì)于更大的測(cè)序項(xiàng)目,生成超大量的數(shù)據(jù),將需要云進(jìn)行大數(shù)據(jù)分析和共享[27,28]。像Genome10K,1001GenomesProject,1KITE,TCGA等項(xiàng)目是類(lèi)似的需要大數(shù)據(jù)分析的項(xiàng)目,其中復(fù)雜生物查詢的解決方案涉及大數(shù)據(jù)工具的運(yùn)用[29]。Transferringbigdatathebottleneckofcloudcomputingisthetransferofdataintocloud.Insteadofphysicallyshipingharddrivestothecloudcenter,apromisingsolutioncouldbetheintegrationofinnovativetransferringtechnologieswithcloudcomputing.Oneiscloud-basedEasyGenomicsforhighspeedgenomicdatatransfer.therewasasuccessfuleventoftransferringgenomicdataacrossPacificOceanatarateofabout10GigabitspersecondwhichprovedtechnologiestobecapableofdealingwithbigdataovertheWeb.Apartfromthis,therearetechnologieslikedatacompressionandPeer-to-Peer(P2P)datadistributiontoaidbigdatatransfer[30].云計(jì)算的瓶頸是將數(shù)據(jù)傳輸?shù)皆浦?。而不是將硬盤(pán)驅(qū)動(dòng)器物理運(yùn)輸?shù)皆浦行?,一種有前途的解決方案可能是將創(chuàng)新的傳輸技術(shù)與云計(jì)算集成。一種是基于云的EasyGenomics,用于高速基因組數(shù)據(jù)傳輸。有一種成功的事件,以大概10吉比特每秒的速度跨太平洋傳輸基因組數(shù)據(jù),這證明了技術(shù)能夠通過(guò)網(wǎng)絡(luò)解決大數(shù)據(jù)。除此之外,尚有諸如數(shù)據(jù)壓縮和對(duì)等(P2P)數(shù)據(jù)分發(fā)等技術(shù)來(lái)協(xié)助大數(shù)據(jù)傳輸[30]。Cloud-basedprogrammingtheanalysistaskisimplementedaspipelinethroughlinkagesbetweentheoutputsoftoolswiththeinputsofothertools,toautomatethesystem.Developmentofcustomizedpipelinesisneededforthelarge-scaleautomatedandconfigurabledataanalysisonacloud-basedenvironment.SimilarprogrammingparadigmisadoptedthroughHadoop,whereasingletaskisdistributedovermultiplenodes.Computationalskillsarerequiredforthedevelopmentofcloud-basedpipelinesinHadoopwithouttherequirementofextensivecoding,ratherthesettingupasystemfordataexchangetopavethewayforprogrammingenvironment[31].分析任務(wù)通過(guò)工具輸出與其它工具的輸入之間的聯(lián)系來(lái)實(shí)現(xiàn)為管道,以使系統(tǒng)自動(dòng)化。需要開(kāi)發(fā)定制管道以在基于云的環(huán)境上進(jìn)行大規(guī)模自動(dòng)化和可配備的數(shù)據(jù)分析。通過(guò)Hadoop采用類(lèi)似的編程范例,其中單個(gè)任務(wù)分布在多個(gè)節(jié)點(diǎn)上。在Hadoop中開(kāi)發(fā)基于云的管線需要計(jì)算技能,而不需要大量編碼,而是建立一種用于數(shù)據(jù)交換的系統(tǒng)為編程環(huán)境鋪平道路[31]。BioinformaticscloudPresently,thebiggestcloudproviderisAmazon,providingcommercialcloudsforbigdataprocessing.谷歌isanotherproviderallowinguserstodevelopwebapplicationsandanalysedata.thereismoretobedonewithcommercialcloudstoprovideampledataandsoftware,alongwithkeepingpaceoftheemergingneedsofresearches,whichrequirecustomizedcloudsforbioinformaticsanalysis.Openaccessandpublicavailabilityofdataandsoftwareareofequalsignificance[32].theavailabilityofthecloudpubliclytothescientificcommunityisessentialwhendataandsoftwaresareincloud[33].Itensuresdataintegration,reproducibleanalyses,maximumscopeforsharing.現(xiàn)在,最大的云提供商是亞馬遜,為大數(shù)據(jù)解決提供商業(yè)云。谷歌是另一種供應(yīng)商,允許顧客開(kāi)發(fā)網(wǎng)絡(luò)應(yīng)用程序和分析數(shù)據(jù)。還需要做更多的工作來(lái)提供充足的數(shù)據(jù)和軟件,以及保持研究的新興需求的步伐,這需要定制云的生物信息學(xué)分析。開(kāi)放獲取和數(shù)據(jù)和軟件的公共可用性同等重要[32]。當(dāng)數(shù)據(jù)和軟件在云中時(shí),云對(duì)科學(xué)界公開(kāi)的可用性是至關(guān)重要的[33]。它確保數(shù)據(jù)集成,可重復(fù)分析,最大范疇的共享。PotentialChallengesGenomicsresearcheswithenormousamountsofdatahasrecognizedthepotentialbenefitsofmovingtothecloud,butatthesametimecloudcomputingraisessomeconcernsaswell.Theoptimizationofthegenomicsanalysisforthecloudhasprovidedefficientandtimelyservices.Forinstance,datacanbeeasilyrunfromsequencingfacilitytoanalysispipelineonthecloud,asitisgenerated.However,thereisneedtobeawareofvariouspotentialchallengesinadoptingcloudcomputingtechnologies.HadoopprogrammingrequiresahighlevelofJavaexpertise;itneedstobesimplifiedtoaSQLlikeinterfacetogenerateparallelizedprograms.Standardisationofreportingandsummarisationofresultsisaproblemwhichisnotmuchaddressed;needistodevelopbetteranalyticsandvisualisationtechnologies.Hadoopwithnofrontendvisualisationisdifficulttoset,useandmaintain;effortsarebeingmadetowardsintroducingdeveloperfriendlymanagementinterfacesinsteadofshell/commandlineinterfaces.Consideringthescaleofthegenomicdatathatneedstobetransmittedoverinternet,ittakesconsiderablylargeamountoftime(mightextendtoweeksattimes).thus,therateoftransferofdataremainsabottleneckofthetechnology[36].Datatenancyisanotherchallenge.Mostlycloudsprovidelessercapabilityondataandserviceinteroperability,makingitdifficultforacustomertomovedataandservicesbacktoanin-houseITenvironmentortomigratefromoneprovidertoanother.Moreover,dataprivacylegislation,legalownershipandresponsibilitypertainingtodatastoredbetweeninternationalzonespointsatanotherchallenge[37].Nevertheless,genomicsandproteomicsresearchprojectsforsureexhibittheapplicationsfornextgenerationcloudbasedcomputationalbiologyanditessentiallyhasthepotentialtorevolutionisethepaceofresearchinlifesciences.含有大量數(shù)據(jù)的基因組學(xué)研究已經(jīng)認(rèn)識(shí)到移動(dòng)到云的潛在好處,但同時(shí)云計(jì)算也引發(fā)了某些關(guān)注。云的基因組分析的優(yōu)化提供了高效和及時(shí)的服務(wù)。例如,數(shù)據(jù)能夠容易地從測(cè)序設(shè)備運(yùn)行到云上的分析流水線,由于它是生成的。然而,需要理解采用云計(jì)算技術(shù)的多個(gè)潛在挑戰(zhàn)。Hadoop編程需要高水平的Java專(zhuān)業(yè)知識(shí);它需要簡(jiǎn)化為類(lèi)似SQL的接口來(lái)生成并行程序。原則化報(bào)告和總結(jié)成果是一種沒(méi)有得到諸多解決的問(wèn)題;需要開(kāi)發(fā)更加好的分析和可視化技術(shù)。Hadoop沒(méi)有前端可視化是很難設(shè)立,使用和維護(hù);正在努力引入開(kāi)發(fā)者和諧的管理接口而不是shell/命令行接口??紤]到需要通過(guò)因特網(wǎng)傳輸?shù)幕蚪M數(shù)據(jù)的規(guī)模,需要相稱大量的時(shí)間(可能延長(zhǎng)到幾個(gè)星期)。因此,數(shù)據(jù)傳輸?shù)乃俾嗜匀皇窃摷夹g(shù)的瓶頸[36]。數(shù)據(jù)租賃是另一種挑戰(zhàn)。大多數(shù)云對(duì)數(shù)據(jù)和服務(wù)互操作性提供較少的能力,使得客戶難以將數(shù)據(jù)和服務(wù)移回到內(nèi)部IT環(huán)境或從一種提供商遷移到另一種。另外,數(shù)據(jù)隱私立法,法律全部權(quán)和與存儲(chǔ)在國(guó)際區(qū)域之間的數(shù)據(jù)有關(guān)的責(zé)任指出了另一種挑戰(zhàn)[37]。然而,基因組學(xué)和蛋白質(zhì)組學(xué)研究項(xiàng)目必定會(huì)展示下一代基于云的計(jì)算生物學(xué)的應(yīng)用,它本質(zhì)上有可能變化生命科學(xué)研究的步伐。SecurityPrivacyandconfidentialityissomethingthatismusttomaintainespeciallywhendealingwithhealthinformation.Cloudcomputingofferstheuseofdataencryption,passwordprotection,securedatatransfer,processes’audits,andtheimplementationofrespectivepoliciesagainstdatabreechesandmalicioususe[34].theinvolvementofanexternalentityfordatastorageandprocessingservicesoffersaddedsecurityconcerns.Loggingaccesstothedata,role-basedaccess,thirdpartycertifications,computernetworksecurity,notificationalarms,changetrackers,cloudusagetermandassociatedservicesaremadetoaddresstheseconcerns[35].隱私和保密是在解決健康信息時(shí)必須保持的。云計(jì)算提供了數(shù)據(jù)加密,密碼保護(hù),安全數(shù)據(jù)傳輸,流程審計(jì)以及針對(duì)數(shù)據(jù)流量和惡意使用實(shí)施對(duì)應(yīng)方略的使用[34]。外部實(shí)體參加數(shù)據(jù)存儲(chǔ)和解決服務(wù)提供了額外的安全問(wèn)題。統(tǒng)計(jì)對(duì)數(shù)據(jù)的訪問(wèn),基于角色的訪問(wèn),第三方認(rèn)證,計(jì)算機(jī)網(wǎng)絡(luò)安全,告知報(bào)警,變化跟蹤,云使用期限和有關(guān)服務(wù),以解決這些問(wèn)題[35]。FutureinmicrobiologyresearchPetabytesofrawinformationcanrevolutionizemicrobiologyresearchifwearesuccessfultofigureouthowtousethisgoldmine.WinstonHidesays“Inthelastfiveyears,morescientificcdatahasbeengeneratedthanintheentirehistoryofmankind”.Todaythedatagenerationislight-yearsfasterthatitwasjustafewyearsagoandthuswecan’timaginetheamountofdigitalinformationavailabletousnow.Liketostudyrespiratorydiseasewerequirecapturinghugequantitiesofdataforairqualityandthenmatchitwithequivalentlylargedatasets,arestudieswhichinvolvebigdata.Weneedtoengagelotsofeyesinthisprocess.如果我們成功地想出如何使用這個(gè)金礦,那么幾百億的原始信息能夠革命微生物研究。溫斯頓·史密斯說(shuō):“在過(guò)去五年里,生成的科學(xué)數(shù)據(jù)比人類(lèi)整個(gè)歷史更多。今天,數(shù)據(jù)生成比僅僅幾年前的光年快,因此我們無(wú)法想象我們現(xiàn)在可用的數(shù)字信息量。像研究呼吸系統(tǒng)疾病同樣,我們需要捕獲大量的空氣質(zhì)量數(shù)據(jù),然后將其與等量的大數(shù)據(jù)集相匹配,是涉及大數(shù)據(jù)的研究。我們需要

溫馨提示

  • 1. 本站所有資源如無(wú)特殊說(shuō)明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請(qǐng)下載最新的WinRAR軟件解壓。
  • 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請(qǐng)聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
  • 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁(yè)內(nèi)容里面會(huì)有圖紙預(yù)覽,若沒(méi)有圖紙預(yù)覽就沒(méi)有圖紙。
  • 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
  • 5. 人人文庫(kù)網(wǎng)僅提供信息存儲(chǔ)空間,僅對(duì)用戶上傳內(nèi)容的表現(xiàn)方式做保護(hù)處理,對(duì)用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對(duì)任何下載內(nèi)容負(fù)責(zé)。
  • 6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容,請(qǐng)與我們聯(lián)系,我們立即糾正。
  • 7. 本站不保證下載資源的準(zhǔn)確性、安全性和完整性, 同時(shí)也不承擔(dān)用戶因使用這些下載資源對(duì)自己和他人造成任何形式的傷害或損失。

最新文檔

評(píng)論

0/150

提交評(píng)論