阿里數(shù)據(jù)湖選型_第1頁
阿里數(shù)據(jù)湖選型_第2頁
阿里數(shù)據(jù)湖選型_第3頁
阿里數(shù)據(jù)湖選型_第4頁
阿里數(shù)據(jù)湖選型_第5頁
已閱讀5頁,還剩22頁未讀, 繼續(xù)免費(fèi)閱讀

下載本文檔

版權(quán)說明:本文檔由用戶提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請(qǐng)進(jìn)行舉報(bào)或認(rèn)領(lǐng)

文檔簡介

AThoroughparisonofDeltaLake,Icebergand

HudiJunjie

ChenAboutMe?

Software

engineer

at

Tencent

Data

Lake

Team?

Focus

on

big

data

areafor

yearsAgendaIntroduction

to

DeltaLake,

Apache

Icebergand

Apache

HudiKey

FeaturesparisonMaturityConclusion?

Tooling?

Transaction?

Integration?

Performance?

Data

mutation?

StreamingSupport?

Schema

evolutionWhatfeaturesare

expectforthedatalake?Unified

Batch&

StreamingDataMutationScalableMetadataTransaction(ACID)DataLakeData

QualityStoragePluggableIndependenceof

EnginesDelta

LakeDelta

Lakeis

an

open-sourcestoragelayer

that

bringsACIDtransactionsto

Apache

Spark?and

big

data

workloads.ApacheIcebergAntableformatforhugeanalyticdatasetswhichdelivershighquery

performancefortableswithtensofpetabytesofdata,alongwithatomicmits,concurrentwrites,andSQLpatibletableevolution.InteractiveQueriesSpark

Batch&StreamingAI&ReportingStreamingStreamingAnalyticsDFS/Cloud

StorageApache

HudiApacheHudiingests&managesstorageoflargeanalyticaldatasetsoverDFSA

Quickparison2020-05Delta

Lake

(open

source)Apache

IcebergApache

HudiTransaction

(ACID)MVCCYYYYYNYYYTime

travelYYSchema

EvolutionData

MutationStreamingYYY

(update/delete/merge

into)Y

(upsert)Sink

and

source

for

spark

structstreamingSink

and

source(wip)

for

Sparkstruct

streaming,

Flink(wip)DeltaStreamerHiveIncrementalPullerFile

FormatParquetParquet,

ORC,

AVROAPI

available

(Spark

Action)DSv2,

InputFormatParquetpaction/CleanupIntegrationManualManual

and

AutoDSv1,

InputFormatDSv1,

Delta

connectorMultiple

language

supportStorage

AbstractionAPI

dependencyScala/java/pythonYJava/pythonYJava/pythonNSpark-bundledSpark,

presto,

hiveNative/Engine

bundledSpark,

hiveSpark-bundledDeltaStreamerData

ingestionTransactionDelta

Lake?

Model?

TransactionLog(DeltaLog)?

Optimisticconcurrencycontrol?

Checkpoint

changesintoparquet?

AtomicityGuarantee?

HDFSrename?

S3

filewrite?

Azurerenamewithout

overwrite?

TimeTravel?

timestamp?

versionnumberApacheIceberg?

Model?

Snapshot?

OptimisticconcurrencycontrolWR?

AtomicityGuarantee?

HDFSRename?

Hive

metastorelockS1S2S3S4?

TimeTravel?

snapshot

id?

timestampApache

Hudi?

Model?

Timeline?

Optimisticconcurrencycontrol?

AtomicityGuarantee?

HDFSrename?

TimeTravel?

Hoodiemit_timeDataMutationDelta

Lake?

Copy

on

Write

mode?

Step

1:

find

files

to

delete

according

to

filter

expression?

Step

2:

load

files

as

dataframe

and

update

column

values

in

rows?

Step

3:

save

dataframe

tonew

files?

Step

4:

logs

thefiles

to

delete

and

add

into

JSON,mit

to

table?

Tablelevel

APIs?

update,

delete(condition

based)?

merge

into

(upserta

source

into

target

table)Apache

Hudi?

CopyonWritetable?

Step1:

read

out

records

from

parquet?

Step2:

merge

records

accordingto

passingupdate

records?

Step3:

write

merged

recordsto

files?

Step4:mit

to

tablemitActionExecutor?

MergeonReadtable?

Store

delta

records

into

AVRO

format

log

file?

Scheduledpaction?

Indexing?

Mapping

Hudi

recordkey

(in

metadata

column)

to

file

group

and

file

id?

In-memory,

bloom

filter

and

HBase?

TablelevelAPIs?

upsertApacheIceberg?

Copy

on

Write

Mode?

File

leveloverwriteAPIsavailable?

Merge

onRead

mode?

PositionbaseddeletefilesandequalitybaseddeletefilesStreamingSupportDelta

Lake?

Deeply

integratedwithSpark

Structured

Streaming?

Asa

streaming

source?

Streamingcontrol:

maxBytesPerTrigger,

maxFilesPerTrigger?

DoesNOThandlenon-append(ignoreDeletesor

ignoreChanges)?

Asa

streaming

sink?

Appendmode?pletemodeApache

Hudi?

DeltaStreamer?

Exactlyonceingestionof

newevent

fromKafka?

Support

JSON,

AVRO

orcustomrecordtypes?

Managecheckpoints,

rollback&recovery?

Support

forpluggingin

transformations?

IncrementalQueries?

HiveIncrementalPuller?

AsSparkdatasource(beginInstantTime)ApacheIceberg?

Support

spark

struct

streaming?

As

streaming

source

(WIP)?

Ratelimit:

max-files-per-batch?

Offsetrange?

As

streaming

sink?

Appendmode?pletemode?

Supportflink

(WIP)TableSchemaEvolution?

Delta

Lake?

Use

Spark

schema?

Allow

Schema

merge

and

overwrite?

Apache

Hudi?

Use

Spark

schema?

Support

adding

new

fields

in

stream,

column

delete

is

not

allowed.?

Apache

Iceberg?

Independent

ID-based

schema

abstraction?

Full

schema

evolution

and

partition

evolutionMaturityIntegrations?

Delta

Lake?

DSv1?

Delta.io

connector

enable

Apache

Hive,

Presto?

Apache

Iceberg?

DSv2,

InputFormat,

Hive

StorageHandle

(WIP)?

Flinksink(WIP)?

Apache

Hudi?

InputFormat,

DSv1?

DeltaStreamer

for

data

ingestingQueryPerformanceOptimization?

Delta

Lake?

Vectorizationfrom

Spark?

Dataskippingviastatistic

from

Parquet?

Vacuum,

optimize?

ApacheHudi?

Vectorizationfrom

Spark?

Dataskippingviastatistic

from

Parquet?

Autopaction?

ApacheIceberg?

Predicatepushdown?

Nativevectorizedreader(WIP)?

Statistic

from

Icebergmanifest

file?

HiddenpartitioningTooling?

Delta

Lake?

CLI:

VACUUM,

HI

溫馨提示

  • 1. 本站所有資源如無特殊說明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請(qǐng)下載最新的WinRAR軟件解壓。
  • 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請(qǐng)聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
  • 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁內(nèi)容里面會(huì)有圖紙預(yù)覽,若沒有圖紙預(yù)覽就沒有圖紙。
  • 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
  • 5. 人人文庫網(wǎng)僅提供信息存儲(chǔ)空間,僅對(duì)用戶上傳內(nèi)容的表現(xiàn)方式做保護(hù)處理,對(duì)用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對(duì)任何下載內(nèi)容負(fù)責(zé)。
  • 6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容,請(qǐng)與我們聯(lián)系,我們立即糾正。
  • 7. 本站不保證下載資源的準(zhǔn)確性、安全性和完整性, 同時(shí)也不承擔(dān)用戶因使用這些下載資源對(duì)自己和他人造成任何形式的傷害或損失。

評(píng)論

0/150

提交評(píng)論