版權說明:本文檔由用戶提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權,請進行舉報或認領
文檔簡介
1、在大規(guī)模Kubernetes集群上實現(xiàn)SLOMethods to achieve high SLOs on a large scale Kubernetes clusterWhy SLO?LatencySLIAvailabilityQPSCorrectnessSLOPunishmentSLASLI defines an indicator, which can represent user experience.SLO is the object that try to meets all SLIs in a period of time.SLA = SLO + Punishment.SLO
2、(Service-Level Objective). Within service-level agreements (SLAs), SLOs are the objectives that must be achieved for each service activity, function and process to provide the best opportunity for service recipient success. GartnerSLA/SLO/SLIWhat we concern about Large k8s ClusterIs the cluster heal
3、thy1 Are all software components working fine 2 How many failures occurred on the clusterWhat happened about the cluster1 Is there something unexpected happened in the cluster 2 What end users did in the clusterHow to locate failureWhich component is going wrongWhich component that leads delivery of
4、 the pod to failureSLIs on Large k8s ClusterCluster health stateA combination value indicates the risk in the cluster. Currently Healthy, Warning and Fatal are the possible value.Success rateA rate value indicates the rate of success about creating/upgrading pod.Number of Terminating PodA number val
5、ue indicates the count of pods that can not be deleted in a certain period.Centralized Components AvailabilityA ratio value indicates the time in which the cluster is available. It is used to evaluate the master components.Nodes AvailabilityA number value indicates the number of unhealthy node in th
6、e cluster. Pods scheduled to unhealthy nodes may not be delivered in time, success rate would decrease consequently.The success standard and reason classificationThe success standard:PodFeatureTime limitSuccess conditionPodRestartPolicy=Always1min(example value)the status of .Status.Conditions. “typ
7、e=Ready” is True”Job PodRestartPolicy=Never/OnFailure1min.Status.Phase= Running/Succeeded/FailedTerminating Pod1minpod is removed from etcdUnhealthy NodeTaint/Degrade1minNode has taints or is degradedProcessingBase on the failure reasonUnhealth node is healed or removed.Reason classification:SourceF
8、eatureExampleSystemFailure caused by cluster itselfRuntimeError, ImageFailed, Unscheduled, KubeletDelay.End UsersFailure caused by end usersContainerCrashLoopBackOff, FailedPostStartHook, UnhealthyTrace systemIncrease of SLOData CollectAudit logEventThe unhealthy nodeMonitoringIsolationRecoverDegrad
9、eData AnalysisFailures/MachineFailures/ReasonReportLifecycle of PodFailure ReasonTargetKubeletApiserverSchedulerOperatorRuntimeDaemonsetAlertGray ScaleBug FixSuccess RateSLOThe unhealthy nodeCluster HealthyTerminating Pod NumberDaily ReportValidationHousekeepi ngHigh AvailableFast RecoveryDisplay Bo
10、ardAlertAnalysis PlatformWeekly ReportSLO:Indicate the cluster is healthy or thereis something unexpected happened.Trace system:Collect and analyze logs in cluster. Sowe can known what happened about the cluster.Increase of SLO:Get the weakness of the cluster byanalyzing the failure reasons and effe
11、ctive actions should be taken to increase the success rate.The unhealthy node:Detect the unhealthy machines timely and fix problems automatically。The infrastructureLogEventEnd UserStorageAnalysis PlatformTraceReportWeaknessThe trace systemData Collect:Collect Audit log for the whole cluster.Data ana
12、lysis:Analyze failure reason if pod is failed.Reason analysis:Analyze the failure reasons. Try to find somethingabnormal in the cluster.Trace Result:We can get:It is failed to deliver the pod,and the fail reason isFailedMount.Pod LifecycleFailureReasonTrace System:Node Metricsnodemetricskubelet metr
13、icsdaemonset metricsnode loadslo metricscsi metricsdirty dataWith huge amount of metrics data collected, statistical methods can be used to check whether the node is healthy or not.Besides, node delivery capacity can also be evaluated via historical data.With dirty data metrics which consists ofesca
14、ped/zombie/uninterruptible processorphaned containersorphaned pod directories/volumesorphaned cgroupsorphaned net deviceand so on, node recovery system can cleanup those dirty data or alert cluster admins to process dirty data manually.Unhealthy nodeSLOTraceNPDMetricsEvent SourceruntimeErrorCo ntoll
15、erfailedPodContr ollerDetectorStrategyUnhealthy node listFast TaintWeight AdjustRecoveryManualHandlingImproveAuto Human experienceImprove of strategyCollect data from metrics NPD, Trace system and Log.Analyze the problem of the node, such as DiskRO, critical Daemonset is not ready.Processing unhealt
16、hy node: Heal, Degrade or Isolate. With scoring mechanism and historical operation records, unhealthy nodes can be recovered automatically. Otherwise manual intervention is required.Generate daily report to show what happens, and programmers can improve the system with the reportscontinuously.Daily ReportTips on increasing SLOCase 1: Image DownloadImage lazyload technology provides the ability to run a container without downloading image.Case 2: RetryPod should be recreate when the previous pod is failed. The previous node sho
溫馨提示
- 1. 本站所有資源如無特殊說明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請下載最新的WinRAR軟件解壓。
- 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請聯(lián)系上傳者。文件的所有權益歸上傳用戶所有。
- 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁內(nèi)容里面會有圖紙預覽,若沒有圖紙預覽就沒有圖紙。
- 4. 未經(jīng)權益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
- 5. 人人文庫網(wǎng)僅提供信息存儲空間,僅對用戶上傳內(nèi)容的表現(xiàn)方式做保護處理,對用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對任何下載內(nèi)容負責。
- 6. 下載文件中如有侵權或不適當內(nèi)容,請與我們聯(lián)系,我們立即糾正。
- 7. 本站不保證下載資源的準確性、安全性和完整性, 同時也不承擔用戶因使用這些下載資源對自己和他人造成任何形式的傷害或損失。
最新文檔
- 農(nóng)村老人組財務制度
- 投資擔保公司財務制度
- 公司雙休制度
- 養(yǎng)老院老人護理評估制度
- 武警醫(yī)院財務管理制度(3篇)
- 甲醇站施工方案(3篇)
- 漢服時裝活動策劃方案(3篇)
- 瀝青攤施工方案(3篇)
- 教職工績效考核制度
- 罕見遺傳性肝病代謝干預新靶點
- DB11-T 1835-2021 給水排水管道工程施工技術規(guī)程
- 2025職業(yè)健康培訓測試題(+答案)
- 供貨流程管控方案
- 章節(jié)復習:平行四邊形(5個知識點+12大??碱}型)解析版-2024-2025學年八年級數(shù)學下冊(北師大版)
- 中試基地運營管理制度
- 老年病康復訓練治療講課件
- 2024中考會考模擬地理(福建)(含答案或解析)
- CJ/T 164-2014節(jié)水型生活用水器具
- 購銷合同范本(塘渣)8篇
- 貨車充電協(xié)議書范本
- 屋面光伏設計合同協(xié)議
評論
0/150
提交評論