ApacheKylin在大數(shù)據(jù)系統(tǒng)中應(yīng)用課件_第1頁
ApacheKylin在大數(shù)據(jù)系統(tǒng)中應(yīng)用課件_第2頁
ApacheKylin在大數(shù)據(jù)系統(tǒng)中應(yīng)用課件_第3頁
ApacheKylin在大數(shù)據(jù)系統(tǒng)中應(yīng)用課件_第4頁
ApacheKylin在大數(shù)據(jù)系統(tǒng)中應(yīng)用課件_第5頁
已閱讀5頁,還剩31頁未讀 繼續(xù)免費閱讀

下載本文檔

版權(quán)說明:本文檔由用戶提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請進行舉報或認領(lǐng)

文檔簡介

Apache

KylinOLAP

on

HadoopApacheKylinOLAPonHadoop1http://kylin.ioAgenda

What’s

Apache

Kylin?Tech

HighlightsPerformanceRoadmapQ&Ahttp://kylin.ioAgendaWhat’sA2Extreme

OLAP

Engine

for

Big

DataKylin

is

an

open

source

Distributed

Analytics

Engine

from

eBay

thatprovides

SQL

interface

and

multi-dimensional

analysis

(OLAP)

onHadoop

supporting

extremely

large

datasetsWhat’s

Kylinkylin

/

?ki??l?n

/

麒麟--n.

(in

Chinese

art)

a

mythical

animal

of

composite

form?

Open

Sourced

on

Oct

1st,

2014?

Be

accepted

as

Apache

Incubator

Project

on

Nov

25th,

2014ExtremeOLAPEngineforBigDa3

Big

Data

Era

More

and

more

data

becoming

available

on

Hadoop

Limitations

in

existing

Business

Intelligence

(BI)

Tools

Limited

support

for

HadoopData

size

growing

exponentiallyHigh

latency

of

interactive

queriesScale-Up

architecture

Challenges

to

adopt

Hadoop

as

interactive

analysis

system

Majority

of

analyst

groups

are

SQL

savvyNo

mature

SQL

interface

on

HadoopOLAP

capability

on

Hadoop

ecosystem

not

ready

yetBigDataEraLimitedsupport45

Why

notBuild

an

engine

from

scratch?5 Whynot5

Extreme

Scale

OLAP

EngineKylin

is

designed

to

query

10+

billions

of

rows

on

Hadoop

ANSI

SQL

Interface

on

HadoopKylin

offers

ANSI

SQL

on

Hadoop

and

supports

most

ANSI

SQL

query

functions

Seamless

Integration

with

BI

ToolsKylin

currently

offers

integration

capability

with

BI

Tools

like

Tableau.

Interactive

Query

CapabilityUsers

can

interact

with

Hive

tables

at

sub-second

latency

MOLAP

CubeDefine

a

data

model

from

Hive

tables

and

pre-build

in

Kylin

Scale

Out

ArchitectureQuery

server

cluster

supports

thousands

concurrent

users

and

provide

high

availabilityFeatures

HighlightsExtremeScaleOLAPEngineKyli6

Compression

and

Encoding

SupportIncremental

Refresh

of

CubesApproximate

Query

Capability

for

distinct

count

(HyperLogLog)Leverage

HBase

Coprocessor

for

query

latencyJob

Management

and

MonitoringEasy

Web

interface

to

manage,

build,

monitor

and

query

cubesSecurity

capability

to

set

ACL

at

Cube/Project

LevelSupport

LDAP

IntegrationFeatures

Highlights…CompressionandEncodingSupp7Cube

DesignerCubeDesigner8Job

ManagementJobManagement9Query

and

VisualizationQueryandVisualization10Tableau

IntegrationTableauIntegration11CaseCubeSizeRawRecordsUserSessionAnalysis26TB28+billionrowsClassifiedTrafficAnalysis21TB20+billionrowsGeoXBehaviorAnalysis560GB1.2+billionrows

eBay

90%

query

<

5

seconds

Baidu

Baidu

Map

internal

analysis

Many

other

Proof

of

Concepts

Bloomberg

Law,

British

GAS,

JD,

Microsoft,

StubHub,

Tableau

…Who

are

using

KylinCaseCubeSizeRawRecordsUserSess12http://kylin.ioAgenda

What’s

Apache

Kylin?Tech

HighlightsPerformanceRoadmapQ&Ahttp://kylin.ioAgendaWhat’sA13OLAPCubeKylin

Architecture

Overview15SQL-Based

Tool

(BI

Tools:

Tableau…)

JDBC/ODBC

SQL

Online

AnalysisData

Flow

Offline

Data

Flow

Clients/Users

interactive

with

Kylin

via

SQL

OLAP

Cube

is

transparent

to

users

Mid

Latency

-

MinutesHadoop

Hive

Star

Schema

DataLow

Latency

-Seconds

Data

Cube

(HBase)

Key

Value

Data3rd

Party

App(Web

App,

Mobile…)

REST

API

SQL

REST

Server

Query

Engine

Routing

MetadataCube

Build

Engine

(MapReduce…)OLAPCubeKylinArchitectureOve14Cube:

…Fact

Table:

…Dimensions:

…Measures:

…Storage(HBase):

…DimDimDimFact

SourceStar

SchemaColumn

FamilyRow

Key

row

A

row

B

row

CColumn

Val

1

Val

2

Val

3

TargetHBase

Storage

MappingCube

MetadataData

Modeling

End

UserCube

ModelerAdminCube:…FactTable:…DimDimDim 15time,

itemtime,

item,

locationtime,

item,

location,

suppliertimeitemlocationsuppliertime,

locationTime,

supplieritem,

locationitem,

supplierlocation,

suppliertime,

item,

suppliertime,

location,

supplieritem,

location,

supplier1-D

cuboids2-D

cuboids3-D

cuboids4-D(base)

cuboid?Base

vs.

aggregate

cells;

ancestor

vs.

descendant

cells;

parent

vs.

child

cells1.2.3.4.5.(9/15,

milk,

Urbana,

Dairy_land)

-

<time,

item,

location,

supplier>(9/15,

milk,

Urbana,

*)

-

<time,

item,

location>(*,

milk,

Urbana,

*)

-

<item,

location>(*,

milk,

Chicago,

*)

-

<item,

location>(*,

milk,

*,

*)

-

<item>??OLAP

Cube

Balance

between

Space

and

Time

Cuboid

=

one

combination

of

dimensions

Cube

=

all

combination

of

dimensions

(all

cuboids)

0-D(apex)

cuboidtime,itemtime,item,location16Cube

Build

Job

FlowCubeBuildJobFlow17How

To

Store

Cube?

HBase

SchemaHowToStoreCube?–HBaseSch18

Dynamic

data

management

framework.Formerly

known

as

Optiq,

Calcite

is

an

Apache

incubator

project,

used

byApache

Drill

and

Apache

Hive,

among

others.How

to

Query

Cube?Query

Engine

CalciteDynamicdatamanagementframe19?????Metadata

SPI

Provide

table

schema

from

Kylin

metadataOptimize

Rule

Translate

the

logic

operator

into

Kylin

operatorRelational

Operator

Find

right

cube

Translate

SQL

into

storage

engine

API

call

Generate

physical

execute

plan

by

linq4j

java

implementationResult

Enumerator

Translate

storage

engine

result

into

java

implementation

result.SQL

Function

Add

HyperLogLog

for

distinct

count

Implement

date

time

related

functions

(i.e.

Quarter)How

to

Query

Cube?Kylin

Extensions

on

Calcite?MetadataSPIHowtoQueryCube20Query

Engine

Kylin

Explain

PlanSELECT

test_cal_dt.week_beg_dt,test_category.category_name,

test_category.lvl2_name,

test_category.lvl3_name,test_kylin_fact.lstg_format_name,

test_sites.site_name,

SUM(test_kylin_fact.price)

AS

GMV,

COUNT(*)

AS

TRANS_CNTFROM

test_kylin_factLEFT

JOIN

test_cal_dt

ON

test_kylin_fact.cal_dt

=

test_cal_dt.cal_dtLEFT

JOIN

test_category

ON

test_kylin_fact.leaf_categ_id

=

test_category.leaf_categ_id

AND

test_kylin_fact.lstg_site_id

=test_category.site_idLEFT

JOIN

test_sites

ON

test_kylin_fact.lstg_site_id

=

test_sites.site_idWHERE

test_kylin_fact.seller_id

=

123456OR

test_kylin_fact.lstg_format_name

=

’New'GROUP

BY

test_cal_dt.week_beg_dt,

test_category.category_name,

test_category.lvl2_name,

test_category.lvl3_name,test_kylin_fact.lstg_format_name,test_sites.site_nameOLAPToEnumerableConverterOLAPProjectRel(WEEK_BEG_DT=[$0],

category_name=[$1],

CATEG_LVL2_NAME=[$2],CATEG_LVL3_NAME=[$3],LSTG_FORMAT_NAME=[$4],

SITE_NAME=[$5],

GMV=[CASE(=($7,

0),

null,

$6)],

TRANS_CNT=[$8])OLAPAggregateRel(group=[{0,

1,

2,

3,

4,

5}],

agg#0=[$SUM0($6)],

agg#1=[COUNT($6)],

TRANS_CNT=[COUNT()])

OLAPProjectRel(WEEK_BEG_DT=[$13],

category_name=[$21],

CATEG_LVL2_NAME=[$15],

CATEG_LVL3_NAME=[$14],LSTG_FORMAT_NAME=[$5],

SITE_NAME=[$23],

PRICE=[$0])

OLAPFilterRel(condition=[OR(=($3,

123456),

=($5,

’New'))])OLAPJoinRel(condition=[=($2,

$25)],

joinType=[left])OLAPJoinRel(condition=[AND(=($6,

$22),

=($2,

$17))],

joinType=[left])OLAPJoinRel(condition=[=($4,$12)],

joinType=[left])OLAPTableScan(table=[[DEFAULT,

TEST_KYLIN_FACT]],

fields=[[0,1,

2,

3,

4,

5,

6,

7,

8,

9,

10,

11]])OLAPTableScan(table=[[DEFAULT,

TEST_CAL_DT]],

fields=[[0,1]])OLAPTableScan(table=[[DEFAULT,

test_category]],

fields=[[0,1,

2,

3,

4,

5,

6,

7,

8]])OLAPTableScan(table=[[DEFAULT,

TEST_SITES]],

fields=[[0,1,

2]])QueryEngine–KylinExplainP21

Plugin-able

storage

engine

Common

iterator

interface

for

storage

engineIsolate

query

engine

from

underline

storage

Translate

cube

query

into

HBase

table

scan

Columns,

Groups

Cuboid

IDFilters

->

Scan

Range

(Row

Key)Aggregations

->

Measure

Columns

(Row

Values)

Scan

HBase

table

and

translate

HBase

result

into

cube

result

HBase

Result

(key

+

value)

->

Cube

Result

(dimensions

+

measures)How

to

Query

Cube?Storage

EnginePlugin-ablestorageengineCo22

Curse

of

dimensionality:

N

dimension

cube

has

2N

cuboid

Full

Cube

vs.

Partial

Cube

Hugh

data

volume

Dictionary

EncodingIncremental

BuildingHow

to

Optimize

Cube?Cube

OptimizationCurseofdimensionality:Ndi23

Full

Cube

Pre-aggregate

all

dimension

combinations“Curse

of

dimensionality”:

N

dimension

cube

has

2N

cuboid.

Partial

Cube

To

avoid

dimension

explosion,

we

divide

the

dimensions

intodifferent

aggregation

groups

2N+M+L

2N

+

2M

+

2L

For

cube

with

30

dimensions,

if

we

divide

these

dimensions

into

3group,

the

cuboid

number

will

reduce

from

1

Billion

to

3

Thousands

230

210

+

210

+

210

Tradeoff

between

online

aggregation

and

offline

pre-aggregationHow

to

Optimize

Cube?Full

Cube

vs.

Partial

CubeFullCubePre-aggregatealld24How

to

Optimize

Cube?Partial

CubeHowtoOptimizeCube?PartialC25

Data

cube

has

lost

of

duplicated

dimension

valuesDictionary

maps

dimension

values

into

IDs

that

will

reduce

the

memory

and

storagefootprint.Dictionary

is

based

on

TrieHow

to

Optimize

Cube?Dictionary

EncodingDatacubehaslostofduplica26How

to

Optimize

Cube?Incremental

BuildHowtoOptimizeCube?Increment27CubeInvertedIndexStorageformatPre-aggregatedcuboidsSharding,columnarstorage,withinvertedindexonrowblocksQuerymethodCuboidscanningMassiveparallelprocessingStrengthPre-aggregatehugehistoricdatatosmallsummariesSwiftresponsetoreal-timedataWeaknessTaketimetobuildSlowatscanninglargedatavolumeStreaming,

ongoing

effort

Cube

is

great,

but…

Sometimes

we

want

to

drill

down

to

row

level

informationCube

takes

time

to

build,

how

about

real-time

analysis?

Streaming

with

inverted

indexCubeInvertedIndexStorageformat28streamingKarfkahourly/dailybatchminutes

batch

Inverted

IndexReal-time

StoreKylin

0.8,

Lambda

Architecture

SQL

Query

Hybrid

Storage

Interface

CubeHistoric

StorestreamingKarfkahourly/dailybat29http://kylin.ioAgenda

What’s

Apache

Kylin?Tech

HighlightsPerformanceRoadmapQ&Ahttp://kylin.ioAgendaWhat’sA30Kylin

vs.

Hive#QueryTypeReturn

DatasetQueryOn

Kylin

(s)QueryOn

Hive

(s)Comments1High

LevelAggregation40.129157.4371,217

times23Analysis

QueryDrill

Down

toDetail22,669325,0291.61512.058109.206113.12368

times9

times4Drill

Down

toDetail524,78022.426383.21278

times5Data

Dump972,00249.054N/A100

50

0200150SQL

#1SQL

#2SQL

#3HiveKylinHighLevelAggregatio

nAnalysis

QueryDrillDownto

DetailLow

LevelAggregatio

nTransactio

n

LevelBased

on

12+B

records

caseKylinvs.Hive#QueryTypeReturn31Performance

--

ConcurrencyLinear

scale

out

with

more

nodesPerformance--ConcurrencyLine32Performance

-

Query

Latency90%

queries

<5sGreen

Line:

90%tile

queriesGray

Line:

95%tile

queriesPerformance-QueryLatency90%33http://kylin.ioAgenda

What’s

Apache

Kylin

溫馨提示

  • 1. 本站所有資源如無特殊說明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請下載最新的WinRAR軟件解壓。
  • 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
  • 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁內(nèi)容里面會有圖紙預(yù)覽,若沒有圖紙預(yù)覽就沒有圖紙。
  • 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
  • 5. 人人文庫網(wǎng)僅提供信息存儲空間,僅對用戶上傳內(nèi)容的表現(xiàn)方式做保護處理,對用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對任何下載內(nèi)容負責(zé)。
  • 6. 下載文件中如有侵權(quán)或不適當內(nèi)容,請與我們聯(lián)系,我們立即糾正。
  • 7. 本站不保證下載資源的準確性、安全性和完整性, 同時也不承擔(dān)用戶因使用這些下載資源對自己和他人造成任何形式的傷害或損失。

最新文檔

評論

0/150

提交評論