版權(quán)說明:本文檔由用戶提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請(qǐng)進(jìn)行舉報(bào)或認(rèn)領(lǐng)
文檔簡介
AThoroughparisonofDeltaLake,Icebergand
HudiJunjie
ChenAboutMe?
Software
engineer
at
Tencent
Data
Lake
Team?
Focus
on
big
data
areafor
yearsAgendaIntroduction
to
DeltaLake,
Apache
Icebergand
Apache
HudiKey
FeaturesparisonMaturityConclusion?
Tooling?
Transaction?
Integration?
Performance?
Data
mutation?
StreamingSupport?
Schema
evolutionWhatfeaturesare
expectforthedatalake?Unified
Batch&
StreamingDataMutationScalableMetadataTransaction(ACID)DataLakeData
QualityStoragePluggableIndependenceof
EnginesDelta
LakeDelta
Lakeis
an
open-sourcestoragelayer
that
bringsACIDtransactionsto
Apache
Spark?and
big
data
workloads.ApacheIcebergAntableformatforhugeanalyticdatasetswhichdelivershighquery
performancefortableswithtensofpetabytesofdata,alongwithatomicmits,concurrentwrites,andSQLpatibletableevolution.InteractiveQueriesSpark
Batch&StreamingAI&ReportingStreamingStreamingAnalyticsDFS/Cloud
StorageApache
HudiApacheHudiingests&managesstorageoflargeanalyticaldatasetsoverDFSA
Quickparison2020-05Delta
Lake
(open
source)Apache
IcebergApache
HudiTransaction
(ACID)MVCCYYYYYNYYYTime
travelYYSchema
EvolutionData
MutationStreamingYYY
(update/delete/merge
into)Y
(upsert)Sink
and
source
for
spark
structstreamingSink
and
source(wip)
for
Sparkstruct
streaming,
Flink(wip)DeltaStreamerHiveIncrementalPullerFile
FormatParquetParquet,
ORC,
AVROAPI
available
(Spark
Action)DSv2,
InputFormatParquetpaction/CleanupIntegrationManualManual
and
AutoDSv1,
InputFormatDSv1,
Delta
connectorMultiple
language
supportStorage
AbstractionAPI
dependencyScala/java/pythonYJava/pythonYJava/pythonNSpark-bundledSpark,
presto,
hiveNative/Engine
bundledSpark,
hiveSpark-bundledDeltaStreamerData
ingestionTransactionDelta
Lake?
Model?
TransactionLog(DeltaLog)?
Optimisticconcurrencycontrol?
Checkpoint
changesintoparquet?
AtomicityGuarantee?
HDFSrename?
S3
filewrite?
Azurerenamewithout
overwrite?
TimeTravel?
timestamp?
versionnumberApacheIceberg?
Model?
Snapshot?
OptimisticconcurrencycontrolWR?
AtomicityGuarantee?
HDFSRename?
Hive
metastorelockS1S2S3S4?
TimeTravel?
snapshot
id?
timestampApache
Hudi?
Model?
Timeline?
Optimisticconcurrencycontrol?
AtomicityGuarantee?
HDFSrename?
TimeTravel?
Hoodiemit_timeDataMutationDelta
Lake?
Copy
on
Write
mode?
Step
1:
find
files
to
delete
according
to
filter
expression?
Step
2:
load
files
as
dataframe
and
update
column
values
in
rows?
Step
3:
save
dataframe
tonew
files?
Step
4:
logs
thefiles
to
delete
and
add
into
JSON,mit
to
table?
Tablelevel
APIs?
update,
delete(condition
based)?
merge
into
(upserta
source
into
target
table)Apache
Hudi?
CopyonWritetable?
Step1:
read
out
records
from
parquet?
Step2:
merge
records
accordingto
passingupdate
records?
Step3:
write
merged
recordsto
files?
Step4:mit
to
tablemitActionExecutor?
MergeonReadtable?
Store
delta
records
into
AVRO
format
log
file?
Scheduledpaction?
Indexing?
Mapping
Hudi
recordkey
(in
metadata
column)
to
file
group
and
file
id?
In-memory,
bloom
filter
and
HBase?
TablelevelAPIs?
upsertApacheIceberg?
Copy
on
Write
Mode?
File
leveloverwriteAPIsavailable?
Merge
onRead
mode?
PositionbaseddeletefilesandequalitybaseddeletefilesStreamingSupportDelta
Lake?
Deeply
integratedwithSpark
Structured
Streaming?
Asa
streaming
source?
Streamingcontrol:
maxBytesPerTrigger,
maxFilesPerTrigger?
DoesNOThandlenon-append(ignoreDeletesor
ignoreChanges)?
Asa
streaming
sink?
Appendmode?pletemodeApache
Hudi?
DeltaStreamer?
Exactlyonceingestionof
newevent
fromKafka?
Support
JSON,
AVRO
orcustomrecordtypes?
Managecheckpoints,
rollback&recovery?
Support
forpluggingin
transformations?
IncrementalQueries?
HiveIncrementalPuller?
AsSparkdatasource(beginInstantTime)ApacheIceberg?
Support
spark
struct
streaming?
As
streaming
source
(WIP)?
Ratelimit:
max-files-per-batch?
Offsetrange?
As
streaming
sink?
Appendmode?pletemode?
Supportflink
(WIP)TableSchemaEvolution?
Delta
Lake?
Use
Spark
schema?
Allow
Schema
merge
and
overwrite?
Apache
Hudi?
Use
Spark
schema?
Support
adding
new
fields
in
stream,
column
delete
is
not
allowed.?
Apache
Iceberg?
Independent
ID-based
schema
abstraction?
Full
schema
evolution
and
partition
evolutionMaturityIntegrations?
Delta
Lake?
DSv1?
Delta.io
connector
enable
Apache
Hive,
Presto?
Apache
Iceberg?
DSv2,
InputFormat,
Hive
StorageHandle
(WIP)?
Flinksink(WIP)?
Apache
Hudi?
InputFormat,
DSv1?
DeltaStreamer
for
data
ingestingQueryPerformanceOptimization?
Delta
Lake?
Vectorizationfrom
Spark?
Dataskippingviastatistic
from
Parquet?
Vacuum,
optimize?
ApacheHudi?
Vectorizationfrom
Spark?
Dataskippingviastatistic
from
Parquet?
Autopaction?
ApacheIceberg?
Predicatepushdown?
Nativevectorizedreader(WIP)?
Statistic
from
Icebergmanifest
file?
HiddenpartitioningTooling?
Delta
Lake?
CLI:
VACUUM,
HI
溫馨提示
- 1. 本站所有資源如無特殊說明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請(qǐng)下載最新的WinRAR軟件解壓。
- 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請(qǐng)聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
- 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁內(nèi)容里面會(huì)有圖紙預(yù)覽,若沒有圖紙預(yù)覽就沒有圖紙。
- 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
- 5. 人人文庫網(wǎng)僅提供信息存儲(chǔ)空間,僅對(duì)用戶上傳內(nèi)容的表現(xiàn)方式做保護(hù)處理,對(duì)用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對(duì)任何下載內(nèi)容負(fù)責(zé)。
- 6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容,請(qǐng)與我們聯(lián)系,我們立即糾正。
- 7. 本站不保證下載資源的準(zhǔn)確性、安全性和完整性, 同時(shí)也不承擔(dān)用戶因使用這些下載資源對(duì)自己和他人造成任何形式的傷害或損失。
最新文檔
- 小學(xué)課外閱讀提升計(jì)劃及活動(dòng)實(shí)施方案
- 中班學(xué)期表現(xiàn)評(píng)估及個(gè)案分析報(bào)告模板
- 開封市中學(xué)籃球業(yè)余訓(xùn)練:現(xiàn)狀洞察、問題剖析與發(fā)展路徑探索
- 建造中船舶浮動(dòng)抵押制度的構(gòu)建與實(shí)踐:理論、困境與突破
- 工業(yè)自動(dòng)化設(shè)備安裝調(diào)試方案說明
- 移動(dòng)型鋼材切割設(shè)備設(shè)計(jì)方案
- 知情同意中的自主權(quán)與醫(yī)療 beneficence 平衡
- 睡眠障礙的全程化管理方案
- 真實(shí)世界證據(jù)支持藥物個(gè)體化用藥方案-1
- 真實(shí)世界數(shù)據(jù)溯源的質(zhì)量控制策略
- 工程制圖習(xí)題集答案
- 食品安全管理制度打印版
- 多聯(lián)機(jī)安裝施工方案
- 煤礦副斜井維修安全技術(shù)措施
- 公共視頻監(jiān)控系統(tǒng)運(yùn)營維護(hù)要求
- 四川大學(xué)宣傳介紹PPT
- 小學(xué)數(shù)學(xué)人教版六年級(jí)上冊(cè)全冊(cè)電子教案
- 液氨儲(chǔ)罐區(qū)風(fēng)險(xiǎn)評(píng)估與安全設(shè)計(jì)
- 阿司匹林在一級(jí)預(yù)防中應(yīng)用回顧
- 2023年福??h政務(wù)中心綜合窗口人員招聘筆試模擬試題及答案解析
- GB/T 4103.10-2000鉛及鉛合金化學(xué)分析方法銀量的測定
評(píng)論
0/150
提交評(píng)論