版權(quán)說明:本文檔由用戶提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請進行舉報或認領(lǐng)
文檔簡介
Apache
KylinOLAP
on
HadoopApacheKylinOLAPonHadoop1http://kylin.ioAgenda
What’s
Apache
Kylin?Tech
HighlightsPerformanceRoadmapQ&Ahttp://kylin.ioAgendaWhat’sA2Extreme
OLAP
Engine
for
Big
DataKylin
is
an
open
source
Distributed
Analytics
Engine
from
eBay
thatprovides
SQL
interface
and
multi-dimensional
analysis
(OLAP)
onHadoop
supporting
extremely
large
datasetsWhat’s
Kylinkylin
/
?ki??l?n
/
麒麟--n.
(in
Chinese
art)
a
mythical
animal
of
composite
form?
Open
Sourced
on
Oct
1st,
2014?
Be
accepted
as
Apache
Incubator
Project
on
Nov
25th,
2014ExtremeOLAPEngineforBigDa3
Big
Data
Era
More
and
more
data
becoming
available
on
Hadoop
Limitations
in
existing
Business
Intelligence
(BI)
Tools
Limited
support
for
HadoopData
size
growing
exponentiallyHigh
latency
of
interactive
queriesScale-Up
architecture
Challenges
to
adopt
Hadoop
as
interactive
analysis
system
Majority
of
analyst
groups
are
SQL
savvyNo
mature
SQL
interface
on
HadoopOLAP
capability
on
Hadoop
ecosystem
not
ready
yetBigDataEraLimitedsupport45
Why
notBuild
an
engine
from
scratch?5 Whynot5
Extreme
Scale
OLAP
EngineKylin
is
designed
to
query
10+
billions
of
rows
on
Hadoop
ANSI
SQL
Interface
on
HadoopKylin
offers
ANSI
SQL
on
Hadoop
and
supports
most
ANSI
SQL
query
functions
Seamless
Integration
with
BI
ToolsKylin
currently
offers
integration
capability
with
BI
Tools
like
Tableau.
Interactive
Query
CapabilityUsers
can
interact
with
Hive
tables
at
sub-second
latency
MOLAP
CubeDefine
a
data
model
from
Hive
tables
and
pre-build
in
Kylin
Scale
Out
ArchitectureQuery
server
cluster
supports
thousands
concurrent
users
and
provide
high
availabilityFeatures
HighlightsExtremeScaleOLAPEngineKyli6
Compression
and
Encoding
SupportIncremental
Refresh
of
CubesApproximate
Query
Capability
for
distinct
count
(HyperLogLog)Leverage
HBase
Coprocessor
for
query
latencyJob
Management
and
MonitoringEasy
Web
interface
to
manage,
build,
monitor
and
query
cubesSecurity
capability
to
set
ACL
at
Cube/Project
LevelSupport
LDAP
IntegrationFeatures
Highlights…CompressionandEncodingSupp7Cube
DesignerCubeDesigner8Job
ManagementJobManagement9Query
and
VisualizationQueryandVisualization10Tableau
IntegrationTableauIntegration11CaseCubeSizeRawRecordsUserSessionAnalysis26TB28+billionrowsClassifiedTrafficAnalysis21TB20+billionrowsGeoXBehaviorAnalysis560GB1.2+billionrows
eBay
90%
query
<
5
seconds
Baidu
Baidu
Map
internal
analysis
Many
other
Proof
of
Concepts
Bloomberg
Law,
British
GAS,
JD,
Microsoft,
StubHub,
Tableau
…Who
are
using
KylinCaseCubeSizeRawRecordsUserSess12http://kylin.ioAgenda
What’s
Apache
Kylin?Tech
HighlightsPerformanceRoadmapQ&Ahttp://kylin.ioAgendaWhat’sA13OLAPCubeKylin
Architecture
Overview15SQL-Based
Tool
(BI
Tools:
Tableau…)
JDBC/ODBC
SQL
Online
AnalysisData
Flow
Offline
Data
Flow
Clients/Users
interactive
with
Kylin
via
SQL
OLAP
Cube
is
transparent
to
users
Mid
Latency
-
MinutesHadoop
Hive
Star
Schema
DataLow
Latency
-Seconds
Data
Cube
(HBase)
Key
Value
Data3rd
Party
App(Web
App,
Mobile…)
REST
API
SQL
REST
Server
Query
Engine
Routing
MetadataCube
Build
Engine
(MapReduce…)OLAPCubeKylinArchitectureOve14Cube:
…Fact
Table:
…Dimensions:
…Measures:
…Storage(HBase):
…DimDimDimFact
SourceStar
SchemaColumn
FamilyRow
Key
row
A
row
B
row
CColumn
Val
1
Val
2
Val
3
TargetHBase
Storage
MappingCube
MetadataData
Modeling
End
UserCube
ModelerAdminCube:…FactTable:…DimDimDim 15time,
itemtime,
item,
locationtime,
item,
location,
suppliertimeitemlocationsuppliertime,
locationTime,
supplieritem,
locationitem,
supplierlocation,
suppliertime,
item,
suppliertime,
location,
supplieritem,
location,
supplier1-D
cuboids2-D
cuboids3-D
cuboids4-D(base)
cuboid?Base
vs.
aggregate
cells;
ancestor
vs.
descendant
cells;
parent
vs.
child
cells1.2.3.4.5.(9/15,
milk,
Urbana,
Dairy_land)
-
<time,
item,
location,
supplier>(9/15,
milk,
Urbana,
*)
-
<time,
item,
location>(*,
milk,
Urbana,
*)
-
<item,
location>(*,
milk,
Chicago,
*)
-
<item,
location>(*,
milk,
*,
*)
-
<item>??OLAP
Cube
–
Balance
between
Space
and
Time
Cuboid
=
one
combination
of
dimensions
Cube
=
all
combination
of
dimensions
(all
cuboids)
0-D(apex)
cuboidtime,itemtime,item,location16Cube
Build
Job
FlowCubeBuildJobFlow17How
To
Store
Cube?
–
HBase
SchemaHowToStoreCube?–HBaseSch18
Dynamic
data
management
framework.Formerly
known
as
Optiq,
Calcite
is
an
Apache
incubator
project,
used
byApache
Drill
and
Apache
Hive,
among
others.How
to
Query
Cube?Query
Engine
–
CalciteDynamicdatamanagementframe19?????Metadata
SPI
–
Provide
table
schema
from
Kylin
metadataOptimize
Rule
–
Translate
the
logic
operator
into
Kylin
operatorRelational
Operator
–
Find
right
cube
–
Translate
SQL
into
storage
engine
API
call
–
Generate
physical
execute
plan
by
linq4j
java
implementationResult
Enumerator
–
Translate
storage
engine
result
into
java
implementation
result.SQL
Function
–
Add
HyperLogLog
for
distinct
count
–
Implement
date
time
related
functions
(i.e.
Quarter)How
to
Query
Cube?Kylin
Extensions
on
Calcite?MetadataSPIHowtoQueryCube20Query
Engine
–
Kylin
Explain
PlanSELECT
test_cal_dt.week_beg_dt,test_category.category_name,
test_category.lvl2_name,
test_category.lvl3_name,test_kylin_fact.lstg_format_name,
test_sites.site_name,
SUM(test_kylin_fact.price)
AS
GMV,
COUNT(*)
AS
TRANS_CNTFROM
test_kylin_factLEFT
JOIN
test_cal_dt
ON
test_kylin_fact.cal_dt
=
test_cal_dt.cal_dtLEFT
JOIN
test_category
ON
test_kylin_fact.leaf_categ_id
=
test_category.leaf_categ_id
AND
test_kylin_fact.lstg_site_id
=test_category.site_idLEFT
JOIN
test_sites
ON
test_kylin_fact.lstg_site_id
=
test_sites.site_idWHERE
test_kylin_fact.seller_id
=
123456OR
test_kylin_fact.lstg_format_name
=
’New'GROUP
BY
test_cal_dt.week_beg_dt,
test_category.category_name,
test_category.lvl2_name,
test_category.lvl3_name,test_kylin_fact.lstg_format_name,test_sites.site_nameOLAPToEnumerableConverterOLAPProjectRel(WEEK_BEG_DT=[$0],
category_name=[$1],
CATEG_LVL2_NAME=[$2],CATEG_LVL3_NAME=[$3],LSTG_FORMAT_NAME=[$4],
SITE_NAME=[$5],
GMV=[CASE(=($7,
0),
null,
$6)],
TRANS_CNT=[$8])OLAPAggregateRel(group=[{0,
1,
2,
3,
4,
5}],
agg#0=[$SUM0($6)],
agg#1=[COUNT($6)],
TRANS_CNT=[COUNT()])
OLAPProjectRel(WEEK_BEG_DT=[$13],
category_name=[$21],
CATEG_LVL2_NAME=[$15],
CATEG_LVL3_NAME=[$14],LSTG_FORMAT_NAME=[$5],
SITE_NAME=[$23],
PRICE=[$0])
OLAPFilterRel(condition=[OR(=($3,
123456),
=($5,
’New'))])OLAPJoinRel(condition=[=($2,
$25)],
joinType=[left])OLAPJoinRel(condition=[AND(=($6,
$22),
=($2,
$17))],
joinType=[left])OLAPJoinRel(condition=[=($4,$12)],
joinType=[left])OLAPTableScan(table=[[DEFAULT,
TEST_KYLIN_FACT]],
fields=[[0,1,
2,
3,
4,
5,
6,
7,
8,
9,
10,
11]])OLAPTableScan(table=[[DEFAULT,
TEST_CAL_DT]],
fields=[[0,1]])OLAPTableScan(table=[[DEFAULT,
test_category]],
fields=[[0,1,
2,
3,
4,
5,
6,
7,
8]])OLAPTableScan(table=[[DEFAULT,
TEST_SITES]],
fields=[[0,1,
2]])QueryEngine–KylinExplainP21
Plugin-able
storage
engine
Common
iterator
interface
for
storage
engineIsolate
query
engine
from
underline
storage
Translate
cube
query
into
HBase
table
scan
Columns,
Groups
Cuboid
IDFilters
->
Scan
Range
(Row
Key)Aggregations
->
Measure
Columns
(Row
Values)
Scan
HBase
table
and
translate
HBase
result
into
cube
result
HBase
Result
(key
+
value)
->
Cube
Result
(dimensions
+
measures)How
to
Query
Cube?Storage
EnginePlugin-ablestorageengineCo22
Curse
of
dimensionality:
N
dimension
cube
has
2N
cuboid
Full
Cube
vs.
Partial
Cube
Hugh
data
volume
Dictionary
EncodingIncremental
BuildingHow
to
Optimize
Cube?Cube
OptimizationCurseofdimensionality:Ndi23
Full
Cube
Pre-aggregate
all
dimension
combinations“Curse
of
dimensionality”:
N
dimension
cube
has
2N
cuboid.
Partial
Cube
To
avoid
dimension
explosion,
we
divide
the
dimensions
intodifferent
aggregation
groups
2N+M+L
2N
+
2M
+
2L
For
cube
with
30
dimensions,
if
we
divide
these
dimensions
into
3group,
the
cuboid
number
will
reduce
from
1
Billion
to
3
Thousands
230
210
+
210
+
210
Tradeoff
between
online
aggregation
and
offline
pre-aggregationHow
to
Optimize
Cube?Full
Cube
vs.
Partial
CubeFullCubePre-aggregatealld24How
to
Optimize
Cube?Partial
CubeHowtoOptimizeCube?PartialC25
Data
cube
has
lost
of
duplicated
dimension
valuesDictionary
maps
dimension
values
into
IDs
that
will
reduce
the
memory
and
storagefootprint.Dictionary
is
based
on
TrieHow
to
Optimize
Cube?Dictionary
EncodingDatacubehaslostofduplica26How
to
Optimize
Cube?Incremental
BuildHowtoOptimizeCube?Increment27CubeInvertedIndexStorageformatPre-aggregatedcuboidsSharding,columnarstorage,withinvertedindexonrowblocksQuerymethodCuboidscanningMassiveparallelprocessingStrengthPre-aggregatehugehistoricdatatosmallsummariesSwiftresponsetoreal-timedataWeaknessTaketimetobuildSlowatscanninglargedatavolumeStreaming,
ongoing
effort
Cube
is
great,
but…
Sometimes
we
want
to
drill
down
to
row
level
informationCube
takes
time
to
build,
how
about
real-time
analysis?
Streaming
with
inverted
indexCubeInvertedIndexStorageformat28streamingKarfkahourly/dailybatchminutes
batch
Inverted
IndexReal-time
StoreKylin
0.8,
Lambda
Architecture
SQL
Query
Hybrid
Storage
Interface
CubeHistoric
StorestreamingKarfkahourly/dailybat29http://kylin.ioAgenda
What’s
Apache
Kylin?Tech
HighlightsPerformanceRoadmapQ&Ahttp://kylin.ioAgendaWhat’sA30Kylin
vs.
Hive#QueryTypeReturn
DatasetQueryOn
Kylin
(s)QueryOn
Hive
(s)Comments1High
LevelAggregation40.129157.4371,217
times23Analysis
QueryDrill
Down
toDetail22,669325,0291.61512.058109.206113.12368
times9
times4Drill
Down
toDetail524,78022.426383.21278
times5Data
Dump972,00249.054N/A100
50
0200150SQL
#1SQL
#2SQL
#3HiveKylinHighLevelAggregatio
nAnalysis
QueryDrillDownto
DetailLow
LevelAggregatio
nTransactio
n
LevelBased
on
12+B
records
caseKylinvs.Hive#QueryTypeReturn31Performance
--
ConcurrencyLinear
scale
out
with
more
nodesPerformance--ConcurrencyLine32Performance
-
Query
Latency90%
queries
<5sGreen
Line:
90%tile
queriesGray
Line:
95%tile
queriesPerformance-QueryLatency90%33http://kylin.ioAgenda
What’s
Apache
Kylin
溫馨提示
- 1. 本站所有資源如無特殊說明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請下載最新的WinRAR軟件解壓。
- 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
- 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁內(nèi)容里面會有圖紙預(yù)覽,若沒有圖紙預(yù)覽就沒有圖紙。
- 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
- 5. 人人文庫網(wǎng)僅提供信息存儲空間,僅對用戶上傳內(nèi)容的表現(xiàn)方式做保護處理,對用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對任何下載內(nèi)容負責(zé)。
- 6. 下載文件中如有侵權(quán)或不適當內(nèi)容,請與我們聯(lián)系,我們立即糾正。
- 7. 本站不保證下載資源的準確性、安全性和完整性, 同時也不承擔(dān)用戶因使用這些下載資源對自己和他人造成任何形式的傷害或損失。
最新文檔
- 2026年成都農(nóng)業(yè)科技職業(yè)學(xué)院單招綜合素質(zhì)筆試參考題庫含詳細答案解析
- 2026年桂林生命與健康職業(yè)技術(shù)學(xué)院單招綜合素質(zhì)筆試模擬試題含詳細答案解析
- 2026年安慶師范大學(xué)單招職業(yè)技能考試備考題庫含詳細答案解析
- 2026年新疆能源職業(yè)技術(shù)學(xué)院單招綜合素質(zhì)筆試備考題庫含詳細答案解析
- 2026年江西傳媒職業(yè)學(xué)院單招綜合素質(zhì)考試參考題庫含詳細答案解析
- 2026年重慶建筑工程職業(yè)學(xué)院單招綜合素質(zhì)考試參考題庫含詳細答案解析
- 2026年江蘇衛(wèi)生健康職業(yè)學(xué)院單招職業(yè)技能考試參考題庫含詳細答案解析
- 2026年寧德師范學(xué)院單招綜合素質(zhì)筆試備考試題含詳細答案解析
- 2026年湖北工程職業(yè)學(xué)院單招綜合素質(zhì)考試模擬試題含詳細答案解析
- 2026年廣東食品藥品職業(yè)學(xué)院單招職業(yè)技能考試參考題庫含詳細答案解析
- 2025湖南銀行筆試題庫及答案
- 廣東省佛山市順德區(qū)2026屆高一數(shù)學(xué)第一學(xué)期期末檢測模擬試題含解析
- 新河北省安全生產(chǎn)條例培訓(xùn)課件
- 【初高中】【假期學(xué)習(xí)規(guī)劃】主題班會【寒假有為彎道超車】
- 鐵路聲屏障施工方案及安裝注意事項說明
- 2026年及未來5年市場數(shù)據(jù)中國超細銅粉行業(yè)發(fā)展趨勢及投資前景預(yù)測報告
- (新教材)2026年人教版八年級下冊數(shù)學(xué) 21.2.2 平行四邊形的判定 21.2.3 三角形的中位線 課件
- 繼承農(nóng)村房屋協(xié)議書
- 2025-2026學(xué)人教版八年級英語上冊(全冊)教案設(shè)計(附教材目錄)
- 臺球競業(yè)協(xié)議書范本
- 湖南公務(wù)員考試申論試題(行政執(zhí)法卷)1
評論
0/150
提交評論