详细教程丨使用Prometheus和Thanos进行高可用K8S监控

本文转自Rancher Labs

介 绍

Prometheus高可用的必要性

在过去的几年里,Kubernetes的采用量增长了数倍。很明显,Kubernetes是容器编排的不二选择。与此同时,Promet# C u 6 L kheus也被认为是监控容器化和非容器化工作负载的绝佳选择。监控是任何基础设施的一个重要关注点,我们应该确保我们的监控设置具有高可用性和高可扩展性,以满足不断增长的基础设施的x 4 6需求,特别是在采用Kubernetes的情况下。

因此,今天我们将部署一个集群化的Prometheus设置,它不仅能c F o够弹性应对节点故障,还能保证合适的数据存档,供以后参考。我们的设置还具有很强的可扩展性,以至于我们可以在同一个监控保护伞下跨越多个Kubernetes集群。

当前方案

大部分的Prometheus部署都是使用持久卷的pod,而Prometheus则是使用联邦机制进行扩展。但是并不是所有的数据都可以使用联邦机制进行聚合,在这K $ Y Z w ;里,当你增^ 2 | 1加额外的服务器时,你往往需要一个机制来管理Promee $ u - etheus配置。

解决方法

Thanos旨在解决上述问题。在Thanos的帮助下K o ^ R & 5 8 F,我们不仅可以对Prometheus的实例进行多重复制,并在它们之间进行数据去重,b [ o | B 6 8 S B还可以将数据归档到GCS或S3等长期存储中。

实施过程

Thanosf U N ^ 架构

详细教程丨使用Prometheus和Thanos进行高可用K8S监控

图片% ` B G i c来源: https://thanos.io/quick-t8 a _ e Rutorial.md/

Thanos由以下组件b T : 3 $ = e I构成:

  • Thanos sidecar:这是运行在Prometheus上的主要组件。它读取和归档对象存储上的数据。此外,它还管理着Prometheus的配置和生命周期。为了区分每个Prometheus实例] t a W C y ) U,sidu * . y D . ` z fecar组件将外部标签注入到Prometheus配置中。该组件能够在 Prometheus 服务器的 PromQL 接口上运行查询。Sidecar组件还能监听Thanos gRPC协议,并在gRPC和REST之间翻译查询。
  • Thanos 存储:该组件在对象storage bucket中的历史数据之上实现了Store API,它主要作为API网关,因此不需要大量的本地磁盘空间。它在启动时加入一个Thanos集群,并公布它可以访问的数据。它在本地磁盘上保存了少量关于所有远程区块的信息,并使其与 bucket 保持同步。通常情况下,在重新启动时可以安全地删除此数据,但会增加启动时间。

  • Tha= 9 Onos查询:查询组件在HTTP: 8 g 7上监听并将查询翻译成Thanos gU 6 l / t } BRPC格式。它从不同的源头汇总查询结果,并能从Sidecar和Store读取数据。在HA设置中,它甚至会对查询结果进行重复数据删除。

    HA组的运行时S ~ & i S T $ w重复数据删除

Prometheus是有状态的,不允许复制其数据库。这意味着通过运行多个Prometheus副本来提高高可用性并不易于使用。简单的负载均衡是行不通的,比如在发生某些崩溃之后,一个副本可能会启动,但是查询这样的副本会导致它在关闭期间出现一个小的缺口(gap)。你有第二个+ f S ?副本可能正在启动,但它可能在另一0 G E ]个时刻(如滚动重启)关闭,因此在这些副本上面的负载均衡将无法正常工作。

  • Thanos Querier则从两个副本中提取数据,并对这些信号进行重复数据删除,从而为c T P @ Q v gQuerier使用者填补了缺口(gap)。

  • Thanos Compact组件将Prometheus 2.0存储引擎的压实程序应用于对象存储中的块数据存储。它通常不是语义上的并发安全,必须针对bucket 进行单例部署。它还负责数据的下采样——40小时后执行5m下采样,10天后执行1h下采样。

  • Thanos Ruler基本上和Prometheus的规则具有相同作用,唯一f [ ? J区别是它可以与Thanos组件进行s - 9 U Q F o 6通信。

    配 置

前期准备

要完全理解这个教程,需要准备以下东西:

  1. 对Kubr 1 4 ( Zernetes和使用kubectl有一定的了解。
  2. 运行中的Kubernetes- F D % 0 Z 1 x集群至少有3个节点(在本demo中,使用GKE集群)
  3. 实现Ingress Contro9 7 s y :ller和Ingress对象(在本demo中使用Nginx Ingress Controller7 A 6)。虽然这不是强制性的,但为了减少创建外部端点的数量,强烈建议使用。
  4. 创建用于ThanoO Z d z b : _s组件访问对象存储的凭证(在本例中为GCS bucket)。
  5. 创建2个GCSm c X bucket,并将其命名为Prometheus-long; t ~ g b-term和thanos-ruler。
  6. 创建一个服务账户,角色为Storage Object Admin。
  7. 下载密钥文件作为json证书,并命名为thanos-gcs-credentials.json。
  8. 使用凭证创建KuberneZ m B wtes sercret

kubectl create secret generic thanos-gcs-credentials --from-file=thanos-gcs-credenI O S M u j $ . Jtials.json

部署各类组件

部署Prometheus服务账户、ClusterrolerClusterrolebinding

apiVersion: v1
kind: Namespace
metadata:
name: monitoY ) y O A { 1ring
-J K K u--
apiVerv ^ 8 M  lsion: v1
kind: SI @ , $ E - 5 T RerviceAccount
metadam Q # n Mta:
name: moJ S I o v |nitoring
namespace: monitoring
--3 | H %-
apiVersion: rbac.authorization.k8s.io/v1beta1
kind: ClusterRole
m` # Betadata:
name: monitoring
namespace: mZ 7 s ] Ronitoring
rules:
- apiGroups: [""]
resources:
- nodes
- nodes/proxy
- services
- endpoin V p ! dts
- pods
verbs: ["get", "list", "watch"]
- apiGroups: [""]
resources:
- configm% [ @ x z o caps
v3 H - a r a } Eerbs: ["get"]
- nonResourceURLs: ["/metrics"]
verbs: ["getZ k ~ = S"]
---
a; ( n L  jpiVersip q G i Qon: rbac.authorization.k8s.io/v1beta1
kind: ClusterRoleBinding
metadata:
name: monitoring
subjects:
- kind: ServiceAccount
name: monitoring
namespace: monitoring
roleRe` h ] ; E xf:
kind: ClusterRole
Name: monitoring
apiGroup: rbac.authoriH U fzation.k8sf * ` T 4 T z r o.io
---

以上manifest创建了Prometheus所需的监控命名空间以及服务账户、clusterrole以及cluster2 D rolebinding

部署Prometheues配置configmap

apiVersion: v1
kin~ h B c [d: ConfigMap
metadata:
name: prometheus-server-conf
labels:
name: prometheus-server-conf
namespace: monitoring
data:
prometheus.yaml.tmpl: |-
global:
scrape_interval: 5s
evaluation_interval: 5s
external_labels:
cluster: prometheus-ha
# Each Prometheus has to have uniT p G + J -  D #que labels.
replicQ M K w Z u { va: $(p n tPOD_NAME)
rule_files:
- /etc/prometheus/rules/*rules.yaml
alerting:
# We want our alerts to be deduplicated
# from different replicas.
alert_relabel_configs:
- regex: replica
action: labeldrop
alertmanager: # Q ~ f p 2s:
- scheme: http
path_prefix: /
static_configs:
- targe; / `  U h 6 O Dts: ['R  i t k  4 xalertmau ! n Fnager:9093']
scrape_configs:
- job_name: kubernetes-nodi 9 s Bes-cadvisor
scrape_interval: 10s
scrape_timeout: 10s
scheme: https
tls_config:
ca_file: /var/run/secretsr = v m % 6 8 E M/kubernetes.io/j U P $ o *serviceaccount/ca.crt
bearer_to, Y j U # f 3ken_file: /var/run/secrets/kubernetes.io/serviceaccoun9 V B !t/token
kubernetes_sd_configs:
- role: node
relabel_configs:
- action: labelma6 ; ( n x p
regex t Z D } t e k vx: __meta_kubernetes_node_label_(.+)
# Only for Kubernetes ^1.7.3.
# See: htH G Stps://github.com/promo t O J 2 ietheus/prometheus/issues/2916
- target_label: _c F y q F_address__
replacement: kubernetes.default.svc:443
- source_labels:{ ^ - m 4 [__ml Y @ keta_kuw i ] z b E lbernetes_node_name]
regex: (.+)
target_label: __metricb % . ] +s_path__
replacementa h f W U C: /api/v1/nodes/${1}/proxy/metrics/cadvisor
metric_relabel_configs:
- action: replace
source_labels:N q { , 8 x A [id]
regex: '^/machine\.sli_ h y ~ L 3 q @ce/m/ O ) U =achine-rkt\\x2d([^\\]+)\\.+/([^/]+)\.service$'
target_labeI ) pl: rkt_containerO z O z q_name
replacement: '$a o q{2}-${1}'
- action: replace
source_labels: [id]
regex: '^/system\.slice/(.+)\.servi* l P L n . H # oce$'
target_label: systemd_service_name
replacement: '${1}'
- jobA W 4 G_name: 'kubernetes-pods% x y # K ^ 7 r'
kubernetes_sd_configs^ f A [ * 6:
- role: pod
relabel_conP x f @ C v W : Ffigs:
- act? ! mion: labelmap
regex: __meta_kubernetes_pod_label_(.+)
- source_labels: [__meta_kubernetes_namespace]
action: replace
target_label: kubernetes_nr p 6 e 4 t Hamespace
- sourceT n 1_labeE - $ z G [ J H qls: [__meta_kubernetes_pod_name]
action: replace
target_label:$ j E kubernetes_po$ Q N , vd_name
- sJ Y 1 F r - Fource_labels: [__meta_kubernetes_pod_annotation_promethe/ v O  X s tus_io_scrapv j 6 j = 4 m Je]
action: keep
regex: true
- source_labels: [__m# # 7 2 u neta_kubernetes_pod_annotation_prometheus_io_scheme]
action: replace
target_label: __scheme__
regex: (https?)
- source_labels: [__meta_kuG ) y ) . t 3bernetes_pod_annotation_prometheus_io_path]
action2 x : replace
target_label: __% R V 5 5 2metrics_path__
regex: (.+)
- sourcH S 0 6 r ( k j he_labels: [__address__, __meta_kubernetes_pod_prometheus_io_port]
action: replace
target_label: __address__
regex: ([^:]+)(?::\d+)?;(\d+)
replacement: $1:$2
- job_name:3 c 7 ~ ) 'kubernetes-apiser: R n F o a L Xvers'
kubernetes_sd_configs:
- rolu j V R = , p K ee: enu M z 8 , ydpoints
sc. / sheme: https
tls_config:
ca_fily ! u 3 / B l E Se: /var/run/secr+ | 7 E }ets/kubernetes.io/serviceaccount/ca.crt
bearer_token_fil^ , I n }e: /var/run/secrets/kubernetes.io/se3 D R C D I ] Frvic` l 7  u )eaccount/{ U  0token
relabel_configs:
- source_labels: [__meta_kubernetes_namespace, _d & 5_meta_kubernetes_service_name, __meta_kubernetes7 F ;  = T $ (_endpoI { W Y w = Oint_port_name]
action: keep
regex: default;kubernetes;https
- job_name: 'kubernetes-$ L f R P [ Q service-eD ) 4 v Mndpoints'
kubernetes_sd_configs:
- role: endpoints
relabel_configs:
- action: labelmap
regex: __meta_kubernetes_service_label_(.+)
- source_labels: [__meta_kubernetes_namespace]
action: replace
targeZ E l Gt_label: kubernetes_! G b F k #namespace
- source_labels: [__meta_kubernetes_service_name]
action: replace
target_label: kube! 8 s T K L K ,rnetes_name
- source_labels: [__meta_kj a Nubera i j 0netes_service_annotation_prometheus_io_scrape]
action: keep
regex: true
- source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scheme]
action: replace
target_label: __schy : ! ~eme__
regex: (https?)
- source_labels: [__meta_kubernete_ f 7 h H @ s_service_annotation_prometheus_io_path]
action: replace
target_label: __meo 8 I w % J  ?trics_path__
regex: (.+)
- source_labels: [__address__, __meta_kubernetesO ) , $ x_service_annotati6 a p Mon_prometheus_io_port]
action: repll ; ; y O [ace
target_labelc | E n _ 5 4: __address__
regex: (.+)(?::.  D % * Y x\d+);(\d+)
replacement: $1:$2

上述Configmap创建了Prometheus配置文件模板。k | a v z 4 H T 1这个配置文件模板, ! . e 3 e + + o将被Thanos sidecar组件读取,它将生成实际的配置文件,而这个配置文件又将被运行在同一个pod中的Prometheus容器所消耗。在配置文件中添加external_labels部分是极其重要的,这样Querier就可以根据这个来重复删除数据。

部署Prometheus Rules configmap

这将创建我们的告警规则,这些规则将被转发到alertmanager,以便发送。

apiVersion: v1
kind: ConfigMap
metadata:
name: prometheus-rules
labels:
name: prometheus-rules
namespace: monitoring
data:
alert-rules.yaml: |-
groups:
- name: Deployment
rules:
- alert: Deployment at 0 Replicas
annotations:
summary: DepC o m  _ z c i %loyment {{$labels.deployment}} in {{$labels.namespace}} is currently having no pods running
expr: |
sum(kube_deployment_status_replicas{pod_template_hash=""}) by (deplp h U { V = x Qoyment,namespace)  < 1
for: 1m
labels:
team: devops
- alert:C 9 7 ~ Z s G O HPA Scaling Limit= k = (ed. $ B 4
annotations:
summary: HPA named {m m $ L { ] D{$labels.hpa}} in {{$labels.namespace}} namespace has reachedp T S t + scaling limited state
expr: |
(sum(kube_hpa_status@ ~ + D x 2_condition{condition=M ! u w y F d ; ;"ScalingLimited",status="true"}) by (hpa,namJ U D + } : 3espace)) == 1
for: 1m
labels:
teamK 1 z R ) u j |: devops
- alert: HPA at MaxCapacity
annotations:= } ( n
summary: HPA named {{$labels.hpa}} in {{$labels.namespace}} namespace is running at Max Capacity
expr: |
((sum(kube_hpa_spec_max_replicas) by (hpa,namespace))x W | ^ ] t 4 , D - (sum(kube_hpa_statuse % L m j_current_replicas) by (hpa,namj D X /espace))) == 0
for: 1m
lab@ ! g . b B _ [ 3els:
team: d4 . f ievops
- name: Pods
rules:
- alert: CoK b x Y / @ntainer resta+ d Y L Yrted
annotatx J 5 R $ rions:
summary: Container named {{$labels.container}- @ U a 7 . B} in {{$labels.pod}} in {{$labels.naG ` n M R # [ S mmespace}} was restarted
expr: |
sum(increase(kube_pod_container_status_restarts_toL 7 Ptal{namespace!="kube-system",pd M ( ; & S j uod_template_hash=""}[1m])) by (^ a { w ^ | # %pod,namespace,container) > 0
for:m J E i - ` { 0m
labels:
team: dev
- aleO b F _ V 0 +rt: High Memory Usage of Conta= x 6 P - V R Jiner
annotations:
summary: Contaic { O & ; { |ner named {{$labels.container}} in {{$labels.pod}} in {{$labeld D us.namespace}} is using more than 75% of Memory Limit
expr: |
((( suW k z 9 ` ` Em(cot j !  lntainev 9 P I B } o c sr_memory_usage_bytes{image!="",container_name!="POD", namespace!="kube-s; V 7 =ystem"}) by (namespace,container_L u # [ M Pname,pod_name)  / sum(contai: { ` _ V +ner_spec_memory_limA } c O 1 C 0 M it_F @ B qbyd e e ( W R Q m htes{image!="",container_name!="POD",namespace!="kube-syR l r lste+ _ M ( ! B t &m"}) by (namespace,container_name,pod_name) ) * 100 ) < +Inf ) > 75
for: 5m
labels:
team: dev
- alert: High CPE p 2 _ x o TU Usage of Container
annotations:
summary: C0 U | d Z iontainerz u [ M S m named {{$labelsv d * p J @ W e ).container}} in {{$labels.pod}} in {{$labels.namespace}} is using more` Z S than 75% of CPU Limit
expr:[ w D [ s N  R |
((sum(irate(container_cpu_usage_seconds_total{image!="",container_name!="POD", namespace!="kube-system"}[30s])) by (namespace,container_name,pod_name) / sum(container_spec_cpu_quota{image!="",container_name!="POD", namespace!="kube-system"} / container_spj C : ec_cpu_period{image!="",container_name!="PODC [ O X n x s b ?", namespace!="kube-system"}) by (namespace,container_name,pod_name) ) * 100)  > 75q K U 7 q $
for: 5m
lab? t d - K N = E EelsB E D K = * ` {:
team: dev
- name: NC 8 I S e :odes
rules:
- ale& 0 | 2 / d V k Krt: High Node Memory Usage
annotations:
summary: Node {{$labels.0 L [ + Bkubernetes_io_hostname}} has more than 80% memory used. Plan Capcity
expr:n 7 f  j A 5 ~ ` |
(sum (container_memory_working_set_bytes{id="/",contU ` 3 / g ~ Q v ainer_name!="POD"})~ v g l k 9 [ by (kubernetes_io_hostname) / sum (machine_memory_bytes{}) by (kubernetes_io_hostname) * 100) > 80
for: 5m
labels:
team: devops
- alert: High Node= H R ? & V CPU Usage
annotations:
summary: Node {{$lb : K * ~abels.kub{ o p f ;ernetes_io_hostname}} has more than 80% allocatable cpu used. Planz / i Capacity.
expr: |
(sum(rate(container_cpu_usage_seconds_tc H T zotal{id="/", container_name!="POD"}[1m])) by (kubernetes_i) A f so_hostname) / sum(machine_cpu_cores) by (kubernetes_io_hostname)  * 100) &gl z @ 0 6 at; 80
for: 5m
lR g m E G q mabels:
team: devops
- alert: High Node Disk Usage
annotations:
summary: Node {{$labels.kubernetes_io_hostname}} has more than 85% disk used. Plan Capacity.
expr: |
(sum(container_fs_usage_bytes{device=~"^/dev/[sv]d[D - 0 H [ 2a-z][1-9]$",id="j v e 2 - , t/",container_nameM t u ! 8 !!="POD"}) by (kubernetes_io_hostname) / sum(container_fs_limit_bytes{container_name!="POD",device=~"^/dev/[sv]d[a-z][1-9]$",id="/"}) by (kubernetes_io_hos0 F ] .tname)) * 100 > 85
for: 5m
labels:
team: devops

部署Prometheus Stateful Set

apiVersion: stL 3 Q y v D $ Iorage.k8s.io/v1beta1
kind: StorageClass
metadata:
name: fast
namespace: m] : s : - $ a  Uonia P .  0 J `toring
provisioner: kubernetd u ! & 4 1 : 8es.io/gce-pd
allowVolumeExpansion: true
---
apiVersion: apps/v1beta1
kind: StatefulSet
metadata:
name: prom1 i R @ v A k Uetheus
namespace: monk 3 + &itoring
spec:
replicas: 3
serviceName: proF 3 W o P N 6metheus-service
template:
metadata:
labels:
app: prometheus
thanos; m  5 2 M _ j-store-api: "true"
spec:
serviceAccountName: monitoring
containers:
- name: prometheusD v z } e M k v
image: prom/prometheus:v2.4.3
args:
- "--config.file=/etc2  l  J   4/prometheus-shared/prometheus.yaml"
- "--storn n w q e { r 2age.tsdb.path=/pj N T , N Brometheus/"
- "--web.enable-lifecycle"
- "--storage.tsdb.no-loI j [ 8ckfile"
- "--storage.tsdb.min-block-dura, # 5 1tion=2h"
- "--storage.tsdb.max-block-duration=2h"
ports:
- name: prometheus
containerPort:; 0 G 9090
volumeMounts:
- name: prometheus-storage
mountPath: /prometheus/
- name: p_ g b j K lrometheus-config-sh. l 4ared
mountPath: /e] f n } ktc/prometheus-shared/
- name: prometheuu w |s-rules
mountPath: /etc/| p 6 Z * 6 prometheus/rules
- name: thanos
image: quay.io/thanos/thad [ X P ] d ) 6 ;nos:v0.8.0
args:
- "sidecar"
- "--log.level=debug"
- "--tsdb.path=/prometheus"
- "-M Y 5 | v x C ~ Q-prometheus.url=htH n u J $ P A N Dtp://127.0.0.1:9090"
- "--objstore.config={type:Q V Y GCSM 7 0 j 7 k f, config: {bucket: prometheus-long-term}}"
- "--reloader.config-file=/etc/promethe_ + 0  tus/prometheus.yaml.tmpl"
- "--reloader.config-envsubst-file=/etcX t ~ c X/prometheus-shared/pe h hrometheus.yaml"
- t { ("--rela 9 p ; : ~ &oader.rule-dir=/etc/prometheus/rules/"
env:
- name: POD_NAME
valueFrom:
f) I ) w ? l -ieldRef:
fieldPath: metadata.nameX p h
- name : GOOGLE_APPLICATIO) i 2 u V G l 1 IN_CREDENTIALS
value: /etc/secret/thanos-gcs-credentials.json
ports:
- name: http-sidecar
containerPort: 10902
- name: grpc
containerPort: 10901
livenessProbe:
httpGet:
porE ` xt: 10902
path: /-/he9 6 f F H 9althy
readinessProbe:
httpGet:
port: 10902
path: /-/ready
volumeMounts:
- name: prometheus-storage
mountPath: /prometheus
- naj N / 2 T r Kme: prometheus-config-shared
mountPath: /etc/prom) ? 3 + R #etheus-shared/
- name: prometheus-config
mountp w & s _ { ! k }Path: /etc/prometheus
- name: prometheus-rules
mountPath: /etc/prometheus/rules
- name: thanos-gcs-credentials
mountPath: /etc/secret
readOnly: false
securityContext:
fsGroup: 2000
runAsNonRoot: true
runAsUser: 1000
volumes:
- name: prometheus-config
configMap:
defaultMode: 420
name: prometheus-server-conf
- name: prometheus-config-shared
emptyDir: {}
- name: prj K F 0 O o i Uometheus-rI 8 g X )ules
configMap:
name: prometheus-rules
- name: ty p $ O Ihanos-gcs-credentials
secret:$ y G n ? / S L ~
secretName: thanos-gcs-credentials
vs a B }olumeClaimTemplates:
- metadata:
name: promet$ + I  aheus-storage
namespacD P C K 5 D , =e: monh r 9 ] z H _ ~ Xitoring
spec:
accessModes: [ "ReadWriteOnce" ]
s: W 2 ] r qtorageClassName: fas- 7 C c ) s : c Dt
resources:
requests:
storage: 20Gi

{ { ? : Q 0 p G于上面提供的manifest,理解以下内容很重要:

  1. Prometheus是作为一个有状态集部署的,有3个副本,每个副本动态地提供自己的持久化卷。
  2. Prometheus配置是@ # l u m & o |由Thah ] A fnos sidecar容器使用我们上面创建的模板文件生成的。
  3. Thanos处理数据压缩,因此我们需要设置--storage.tsdb.min-block-duration=2h和--stora3 v b ; u y age.tsdb.max-block-duration=2h。
  4. Prometheus有状态集被标记为thanos-store-api: true,这样每个pod就会被我们接下来创建的headless service发现。正是这T x = q l q个headless service将被Thanos Querier用来查询所有Prometheus实例的数据。我们还将相同的标签应用于Thanos Store和Thanos Ruler组件,这样它们也会被Queri: p /er发现,并可用于6 $ + # t ] m l查询指标。
  5. GCSu e W b , C = bucket credentials路径是使用GOOGLEK i h_APPLICATION_CREDENTIALS环境变量提供的,配置文件是由我们作为前期准备中创建的secrE b K $ L ) m aet挂载到它上面的。

    部署Prometheu% [ xs服务

    apiVersion: v1
    kind: Servik Q w J 3 z Wce
    metadata:
    name: prometheus-0-service
    annotations:
    prometheus.io/scrape: "true"
    prometheus.io/port: "9090"
    namespace: monitoriP W C ^ 8 t Ong
    labels:
    name: prometheus
    spec:
    selector:
    stat6 G n 0 l yefulset.kubernetes.io/* s . % # Z K ~pod-name: prometheus-0
    ports:
    - name: prometheus
    port: 8080
    targetPort: prometheus
    ---. w ] p + ^ K
    apiVersion: v1
    kind: Service
    metadata:
    name: prometheus-1-service; $ 3 c 7 t = @
    annotations:
    prometheus.io/scrape: "true"
    prometheus.io/port: "9090"
    namespace: monitoring
    labelsF x /:
    name: prometheus
    spec:
    selector:
    state| p ; s G R {fulset.kubernetes.io/pod-name: prometheus-1
    ports:
    - name: prometheus
    port: 8080
    targetPort: prometheus
    ---
    apiVersion: v1
    kind: Service# a j
    metadatY O ( 0 Da:
    name* G [ X N: prometheus-s R 4 . z U2-serv& L ^ _ C (ice
    annotations:
    prometheus.io/scrape: "true"
    prometheus.io/port: "9090"
    namespace: monitoring
    labels:
    name: prometheus
    sp_ s  H A yec:
    selector:
    statefulset.kubern( f O z Vetes.io/pod-name: prometheus-2
    ports:
    - name: prometheus
    port: 8080
    targetPort: prometheQ $ Y l dus
    -o X , T ?--
    #This sK } $ervice creates a srv recorV C F 8 z 7d for querier to find about stoy 6 J t Gre-api's
    apiVersion: v1
    ki` { 2 | Bnd: Service
    me: / ? k  @ | 8tadata:
    name: thanosp E , t-store-gateway
    namespace: monitoring3 W 3 M * !
    spec:
    type: ClusterIP
    clusterIP: None
    ports:
    - name: grpc
    port: 10901
    targetPort: grpc
    selector:
    thanos-store-api: "true"

    除了上述方法外,你还可以点击这篇文章了解如何在Ranchew r 0r上快速部署和配置Prometheus服务。

我们为stateful set中的每个Prometheus pod创建了不同的服务,尽管这并不是必要的。这些服务的创建只是为了调试。上文已经解释了 thanos-store-gateway heaA ^ w i | 3 ldless servia , ; 5 a rce的目的。| T v h b X Q @ ,我们稍后将使用一个 ingress 对象来暴露 Prometheus 服务。

部署PromD O . ] 9 R O ^ 3etheus Querier

api0 W P l h 7 } |Verp R M S | E e ]sion: v1
kind: Namespace
metadata:
name: monitoD R q b L 3ring
---
apiVersion: appsQ D D m ; B/v1
kind: Deployment
metadata:
name:k H ] L thanos-querier
names4 L 9 J L Ppace: monitoring
labels:
app: thanos-querin = D ]  F wer
s^ F ) #pec:
replicas: 1
seleA |  5 8 & M J Wctor:
matchLabels:
app: thanos-querier
template:
metadata:
labels:
app: thanos-queri! } 6 *er
spec:
containers:
- nar f G * 2 3 Fme: thanos
image: quay.io/thanos/thanos:v0.8.0
args:
- query
- --log.level=debug
- --query.replica-label=replica
- x n j --stoP W )re=dn***v+thanos-store-gateway:109B g 101
ports:
- name: httph } w R f b W e [
contE S _ . r = }ainerPort: 10902
- name: grpc
containerPort: 10901
livenessProbe:
httpGet:
port: http
path: /-/healthy
re@ 5 [ RadinessProbe:
httpGet:
port: htth y f K # Y :p
path: /-/ready
---
apiVersion: v1
kind: Service
metadata:
labels:
app: thanos-querier
name: thanos-querier
namespace: monitoring
spec:
ports:
- port: 9090
protocol: TCP
targetPort: http
name: http
selector:
app: thanos-querier

这是Thanos部署的主要内容之一。请注意以下几点:

  1. 容器参数-store=dn***v+thanos-store-gatew? @ C A w p N Hay:10901有助于发现所有应查询的指标数据的组件。
  2. thanosl 8 | k o [ z ]-querier服务提供了一个Web接口来运行PromQL查询。它还可以选择在不同的Prometheu7 i O 5 K B 2s集群中去重复删除数据。
  3. 这是我们提供Grafana作为所有dashboI m V Bard的数据源的终点(end point)。

    部署Thanos存储网关

    apiVersion: v1
    kind: Namespace
    metadata:
    name: monitoring
    ---
    apiVersion: apps/v1beta1
    kind: StatefulSet
    metadata:
    name: thanos-store-gateway
    namespace: monitoring
    labels:J d c . . K N
    app: thanos-store-gateway
    spec:
    replicas: 1
    selector:
    mw d 8 W u $ ratchLabels:
    app:( f ! O r w thanos-store-gateway
    serviceName: thanos-store-gateway
    tempm @ / y V 0 a h Xlate:
    metadata:
    labels:
    app: thanos-store-gateway
    thanos-H M @ cstore-api: "truey * e u A & w F"
    spec:
    cI e E * N kontainers:
    - name: thai P r Znos
    image: quay.io/thanos/thanos:v0.8.0
    args:
    - "store"
    - "--log.level=debug"
    - "--data-dir=/data"
    - "--objstore.config={type: GCS, config: {bucket: prometheus-long-term}}"
    - "--indeJ h [ R E . X ]x-cache-size=500MB"
    - "--chunk-pool-size=9 W l 7 H h ;500MB"
    env:
    - name : GOOGLE_APPLICATION_CREDE/ ( i h p D W { MNTIALS
    value: /etc/secret/thanos-gcs-credentials.json
    ports:
    - name: http
    containeb . A f % trPort: 109g ! % L02
    - name: grpc
    containerPort: 10901
    livenessProbe:
    httpGet:
    port: 10902
    path: /-/healthy
    red u E = - U 9adinessProbe:
    httpGet:
    port: 10902
    path: /-/ready
    volumeMounts:
    - name: thanos-gcO z N @ / Os-credentials
    mountPath6 O @ R Z: /etc/secret
    readOnly: false
    vol* E F r t 2umes:
    - name: thanos-gcs-u j Qcredentials
    secret:
    secretName: thanos-gcs-cro 4 u fedentials
    ---

这将创建存储组件,它将从对象存储中向Querier提供指标。

部署Thanos Ruler

apiVersion: v1
kind: NamespA ~ 9 D bace
metadata:
name: monitoring
---
apiVersion: v1
kind: ConfigMap
metadata:
name: thanos-rul / zer-rules
namespace: monitoring
data:
alert_down_services.rules.yaml: |
groups:
- name: metamonitoring
rules:S ` ~ t 2 #
- alert: PrometheusReplicaDown
annotations:
message: Prometheus9 | D 2 $ t * r # C * keplica in cluster {{$labels.cluster}} has disappeared froj , F im PromethZ C # l [ L O q }eus target dk c K 2 N w $ g aiscovery.
expr: |
sum(up{clue y U E 1 l j aster="prometheus-ha", instance=~".*:9090", jobx ) A="kubernetes-servX i q g b g | sice-endpoints"}) by (job,cluster) < 3
for: 15s
labels:
severity: critical
---
apiVersion: apps/v1beta1
kind: StatefulSet
metadata:
labels:
ab [ c z $ ? p =pp: thanos-ruler
name: thanos-ruler
namespace: monitoring
spec:
replicas: 1
selector:
matchLabels:
app: thanos-ruler
serviceName: thanos-ruler
template:
metadata:
labels:
app: thanos-ruler
thanos-store-api: "true"
spec:
c1 I K |ontainers:
- name4 N | 0 a L D b F: thanos
image: quay.io/thanos/thanos:v0.8M O : { d ` q.0
args:
- rule
- --log.level=debug
- --datp $ h S r H :a-dir=/data
- --eval-interval=15s
- --rule-@ 5 s P M f * ] ;file=/etc/thS | ianos-ruler/*.ruli y Z / ~ O ,es.yaml
- --alertmanagers.urv | p J | ;  =l=http://alertg S P 1 0 v l r Bmanager:9093
- --query=thanos-querier:9090
- "-$ i . Y ? p-objstore.config={* 9 { G v htype: GCS, config: {bucket: thanos-ruler}}"
- --label=ruler_cU 6 *  t slustei Z Q : ] F q Y Mr="prometheus-ha"
- --label=replica="$(POD_NAME)"
env:
- name : GOOGLE_APPLICATION_CREDENTIi K oALS
value: /etc/secret/thanos-gcs-credentials.json
- name: POD_NAME
valueFrom:
fie% c 9 9 G X jldReN l ( E 2f:
fieldPath: metadata.name
ports:
-L z N R Q  name: http
containerPort: 10902
- name: grpc
containerPort: 10901
livene` M r { 3 B 8ssProbe:
httpGet:
port: http
path: /-/healthy
readinessProbe:
httpGet:
port: http
path: /-/ready
volumeMounts:
- mountPath: /etc/thanos-ruler
name: config
- name: thanos-gcs-credentials
mountQ ) . l 2 UPath: /etc/secret
readOnly: false
volumes:
- configMap:
name: thanos-ruler-rules
name: config
- name: thanos-gcs-credentials
secret:
secretName: thanos-gcs-credentials
---
apiVersion: v1
kind: SerX { X ` mvice
metadata:
labels:
app: thanos-ruler
name: thanos-ruler
namesp3 ( Q Z @ Wace: monitoring
spec:Z S 1 +  a
ports:
- port: 9090
protocol: TCP
targetPort: http
name: http
selector:
app: thanos-ruler

现在,如果你在与我们的工作负载相同的命名空间中启动交互式shell,并尝试查看我们的thanos-store-gateway解析到哪些pods,你会看到以下内容:

root@my-shell-95cb5df57-4q6w8:/# nslookup thanos-store-gateway
Server:    10.63.240.10
Address:  10.63.240.10#53
Name:  thanos-store-gateway.monitoring.svc.cluster.local
Addreh f 1ss: 10.60.25U ~ X B , f.2M a J Z ` : Y v &
Name:  thanos-store-gateway.m+ ^ I tonitoring.svc.cluster.local
Addresf = ^ g 3 ns:C Y 2 6 10./ ,  ]60.25.4
Name:  thanos-store-gateway.monitoring) m { S M A . ( {.svc.cluster.lF f X #  5 7 m $ocal
Addre$ y u , c Lss: 10.60.30.2
Name:  thanos-v - ! ( q ) m wstore-gateway.monitoring.svc.cluster.lo+ w k )cal
Address: 10.60.30.8
Name:  thanos-store-gateway.monir F R H  , $toring.svc.cluster.local
Address: 10.60.31.2
roo_ $ &  . x +t@my-shell-95cb5] { 5 i A V L $ Rdf57-4q6w8:/# exit

上面返回的IP对应的是我们j U X + { 1 + 0的Prometheus Pod、thanos-storethanos-ruler。这可以被验证为:

$ kubectl get pods -o wide -l thanos-store-api="tu X prue"
NAME                     READY   STA# A y 3 g @TUS    RESTARTS   AGE    IP           NODE                              NOMINATED NODE   READINESS GATES
prometheC 3 ` N = 4 G -us-0             2/ Q O ] | ; | e2     Running   0          100m   10.60.31.2   gke-demo-1-pool-1-649cbe02-jdnv   <none>           <none>
prometheus-1             2/2     Running   0          14h    10.60.30.2   gke-demo-1-pool-1-7533d618-kxkd   <none>           <none>
prometheus-2             2/2     Running   0          31h    10.60.25.2   gke-demo-1-pool-1-4e9889dd-27gc   <none>           <none>
thanos-ruler-0           1/1     Running   0          100m   10.60.30.8   gke-demo-1-pool-1-7533d618-kxkd   <none>           <none>
t@ $ ] vhanos-store-gate- a wway-0   1/1     Running   0          14h    10.60.25.4   gke-demo-1-pool-1-4e9889dd-27gc   <O 3 j O ( f g ?;none>           <none>

部署Alertmanager

apiVersion: v1
kin M Ynd: NamespacZ t Ne
metadata:
name: monitoringe p N = %
---
kind: ConfigMap
apiVersion: v1
metadata:
name: alertmanager
namespace: monitoring
data:
config.yml: |-
glt x v 8 | d 7 Pobal:
resolve_timeout: 5m
slack_api_url: "<your_s{ & ! N n Ylack_/ e e & 9 Ghook>r q C 0 K"
victorops_api_url: "&l: 2 z % n ot;your_victorops_hook>"
templates:
- '/etc/alertmanager-templates/*.tmpl'
route:
gr} t ^ , d X * Uoup_by: ['alertname', 'cluster', 'service']
group_wait: 10s
ga d 6r; h I M 4 P z k aoup_intK S _erval: 1m
repeat_interval: 5m
req 3 O _ceiver: default
routes:
- match:
team: devops
receiver: devops
c# F } ; k f B T Gontinue: true
- match:
team: devc z 
receiver: dev
coj G 7 $ntinue: true
receivers:
-8 + x o S { w 2 name: 'default'
- name: 'devops f Z ) / 7'
victorops_configs:
- a] & = 0 * Opi_key: '<YOUR_API_KEY>'
routing_key: 'devops'
message_type: 'CRITICAL'
entity_display_name: '{{ .CommonLabels.alertname }}'
state_message: 'Alert: {{ .CommonLabels.alertname }}= 4  C 9 j 8. Summary:{{ .CommoJ y s _ D R r y ~nAnnotati[ V 6 @  * !ons.summary }}. RawData: {{ .C` F E / m D i O +ommonLabels }}'
s{ + X h Olack_configs:
- channel: '#k8-alerts'
send_resolved: true
- name: 'dev'
victorops_config@ k z 4 0 ` 5 Ss:
- api_key: '<YOUR_API_KEY>'
routing_key: 'dev'
message_type: 'CRITICAL'
entity_display_name: '{{Y F + e c z .& W 9 i 6 0CommonLaG H / @ M u W bbels.alertnam| M U ? ^ d 5e }}'
state_message: 'Alert: {{E = a o .CommonLabels.alertname }}. Summary:{{ .Cs M + ^ ; - 5 3ommonAnnotations.summary }}. RawData: {{ .CommonLabels }}'
slack_configs:
- channel: '#k8-alerts'
send_resolved: true
---
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
name: alertmanager
namespace: monitoring
specY ] m r:
replicas: 1
selector:
matchLabels:
app: alertmanager
template:
metadata:
n1 ^ =ame: alertmanager
labels:
app: alertmanager
spec:3 V u r S r d
contaiv _ H U S hners:
- name: alertmanager
image: prom/alertmanager:v0.15.3
argsN S L ,:
- '--config.file=/etc/alertmanager/config.yml'
- '--storage.path=/alertmanager'
ports:
- name: alertmanager
containerPoP e g L G A k % brt: 9093
volume% a v t z  #Mounts:
- name: config-volume
mountPath: /etc/alertmanager
- name: alertmanager
mountP& y s G Y 2 [ ] }ath: /alertmanager
volumes:
- name: config-volt # 9 Z f % a k Mume
configMap:
name: alertmanager
- name: alertmanager
emptyDir: {}
---
af 0 @ 4 @ 5piVersion: v1
kind: Service
metadata:
annotations:
prometheus.io/scrape: 'true'
prometheus.io/path: '/metrics'
labels:
name: al6 y h T Y uertmanager
name: aler% } $ Ttmanager
namespace: monitork I n & x K k q sing
spec:
selector:
app: alertmanager
ports:
- namp ^ ; Ee: alertmanager
protocol: TCP
port: 9093
targetPort: 9093

这将创建我们的Alertj q [ .manager部署,它将根据Prometheus规则生成所有告警。

部署Kubestate指标

apiVersion: v1
kind: Namespace
metadata:
name: monitoring
---
apiVersion: rbac.authorization.k8s.io/v1
# kubernetes versions before 1.8.0 should use rbac.authorizat; Z &ion.k8s.io/v1beta1
kind: ClusterRoleBinding
metadata:
name: kube-state~ z i f % , 0-metrics
roleRef:
apiGroup: rbac.authorizationy   q - 2 R 3 0.k8s.io
kind: ClusterRole
name: kube-state-metrics
subjects:
- kind: ServiceAccount
name: kube-state-metrics
namespace: monitoring
---
apiVersiW _ s n 0 4 Lon: rbac.authorization.u ) }k8s.io/v1
# ku ] Q % u L : t (bernetes versions before 1.8.0 should use rbac.authoV m [ # u H ~rization.k8s.io/v1beta1
kind: ClusterRole
metadata:
name: kube-state-metricsJ _ W
rules:
- apiGroups: [""]
resources:
- configmaps
- secrets
- nodes
- pods
- servi% B f { g |ce@ Q 4 ^ Us
- resourcequotas
- replicationcontrollers
- limi` Z Etranges
- pei K 5 s l ~rsistentvolumecl( T W A 7 . aaims
- persistentvolumes
- nw U v 2 J J % 8amespaces
- endpoints
verbs: ["list", "watchU C F @"]
- apiGroups: ["extensions"]
resources:
- daemonsets
- deployments
- replis ^ , u # J 5 xcasets
verbs: ["list", "watch"]
- apiGroups: ["apps"]
resources:
- statefulsets
verbs: ["list", "watch"]
- apiGroups: ["batch"]
resources:
- cronjobs
- jobs
verbs: ["list", "watch"]
- apiGroups: ["autoscaling"]
r* 6 Q * } V ] ]esources:
- horizontalG A 9 3 , Npodautoscalers
verbs: ["lisF w j S C Jt", "watch"]
---
apiVersion: rbac.authorizationf u /  q ? S 9 [.k8s.io/v1
# kuberner S | 6 -tes versions before 1.8.0 sho ( m t H m *ould use rbac.authori_ x Kzation.k8s.io/v1beta1
kind: RoleBinding
metadata:
nW C * Mame: kube-state-metrics
namespace: monitoring
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: Role
name: kube-state-metrics-resizer
subjects:
- kind: ServiceAccount
namQ  . g |e: kube-sC u v B y } 6 Jtate-metrics
namespace: monitoring^ N M *
---
apiVera W Xsion: rbac.authorization.k8s.io/v1
# kubern. ) l q Z z g } *etes versions bJ & s d Vefore 1.8.0 sh/ z S n A ]ould use rbac.authJ $ p b 2 B u d Korizati- 8 I L % ] pon.k8s] J y A x # Z E ;.io/v1beta1
kind: Rok  i {le
metadata:
namespace: monitoring
name: kube-state-metrics-resizer
rules:
- apiGo p Rroups: [j & ) / 9 h u 7 #""]
resources:
- pods
verbs: ["get"]
- apiGroups: ["extensions"]
resources:
- deI O Vployments
resourceNames: [L O e o"kube-state-metrics"]
verbs: ["get", "update"]
---
apiVersion: v1
kind: ServiceD O /Account
metadata:
name: kube-stateq q . + e-metrics
namespace: monitoring
---
apiVersion: apD o Cps/v1
kind: Deployment
metadata:
name: kube-state-metrics
namespace: monitoB O e sring
spec:
selectos 6 _ kr:
matchLabels:
k8s-app: kube-state-~ i ! = b { Qmetrics
replicas: 1
template:
metadata:
labels:
k8s-app: kube-state-metrics
spec:
serviceAccountName: kube-state-metrics
contaT W $ R 9 v . ? *iners:
- name: kube-state-metrics
image: quay.io/mxinden/kube-state-metrics:v1.4.0-gQ Q !zip.3
ports:
- name: http-metrics
containerPort: 8080
- name: telemetry
containerPort: 8P [ s g )081
readC Z ` k # QinessProbe:
httpGet:
path: /healthz
port: 8080
initialDelaySeconds: 5
timeoutSeconds: 5
- name: add) s - [ D Won-resizer
image: k8s.gcr.io/addon-resizer:1.8.3
resources:
lim6 W 9 ! xits:
cpu: 150m
memory: 50Mi
requests:
cpu:& [ i { c B _ z 150m
memory: 50Mi
env:
- name: MY_POD_NAME
valueFrom:
fieldRef:
fieldPath: metadata.name
- name: MY_POD_NAy 1 _ & z _ i }MESPACE
valm s i H L t e o FueFrom:
fieldRef:
fieldPath: metadata.namespm E 2 ; Sace
command:
- /pod_nanny
- --~ K Y ^ : } } 9container=kube-state-metrics
- --cpu=100m
- --exW # G i {tra-cpu=1m
- --memory=100Mi
- --extra-memory=2Mi
- --threshold=5
- --deployment=kuF / y Y U J sbe-state-metriJ $ g G M M $ fcs
---
apiVersion: v1
kind: Service
metadata:
na L ! ` g M y a ?me: kube-state-metrics
namespace: monitoring
labels:
k8s-app: kube-state-metrics
annotations:
prometheus.io/scrape: 'true'
spe0 9 S P j I H i wc:
ports:
- name: http-metrics
port: 8080
targetPort: http-me; G $ strics
protocol: TCP
- name: telemetg 8 E Iry
port:7  ^ x * 5 : , : 8081
targetPort: telemetry
protocol: TCP
selector:
k8s-r 7 = = U ~ R ? |app: kV } P zube-state-metrics

KubestaY P J X d 1 hte指标部署需要转发一些l ! l k F o U o重要的容器指标,这些指标不是kubeleth 0 4 & } y原生暴露的,因此不能直接提供给Prometheus。

部署Node-Exporter Daemonset

apiVersion: v1
kind: Namespace
meg h 2 V l . Ltadataw x , u:
name: monitoring
---
apiVersion: extensions/v1beta1
kind: Daemn O :onSet
metadata:
name: node-exporter
namespace: monitoring
labels:
name: node-exporter
spec:
template:
metadataR | L D n:
labels:
name: node-exp_ [  v Q r Oorter
any @ 8 0 anotations:
prometheus.io/scrape: x A F"true"
prometheus.io/port: "9100"
spec:
hostPID: true
hostIP( 8 x 2  Y R u =C: true
hostNetwork: true
containers:
- name: node-exporter
image: prom/n) , D ^ d o f $ +ode-expo& i x ~ Krter:v0.16.0
securityContext:
privileged: true
args:
- --path.procfs=/ho* . { M , Z wst/proc
- --path.sysfs=/host/sys
po? r W g X { j Y Nrts:
- containerPort: 9100
protocol: TCP
resources:
limW % x m hits:
cpu: 100m
memory:q R F m n # 100Mi
requests:
cpu: 10m
memory: 100Mi
volumeMounn 9 W v ^ s $ts:
- nam} S - a ye: dev
mountPath: /host/devG Z / 2 t 7
- nac 5 M 7me: proc
mountPath: /hostU + p S T L/proc
- name: sys
mountPath: /host/sys
-9 [ # k 6 s Z name: rootf : P ` Qs
mountPath: /rootfs
volumes:
- name: proc
hostPath:
path: /proc
- name:F x | dev
hostPath:
path: /dev
- name: sys
hostPath:
path: /sys
- name: rootfs
hostPa( H a + F L k 3th:
path: /

Node-Exporter daemonset在每个节点上运行一个node-exi 3 z # } |porter的pod,并暴露出非常重要的R z h节点相关g . Y F指标,这些指标可以被Prometheus实例拉取。

部署Grafana

apiVersion: v1
kind: Namespace
metadata:
name: m4 Q Z ;onitoring
---
apiVersion: storage.k8s.io/v1beta1
kind: StorageCla= g [ 8ss
metadata:
name: fast
namespace: monitoring
provisioner: kubernetes.io/gce-pd
allowVolumeExpansion: true
---
apiVersion: apps/v1beta1
kind: StatefulSet
metadata:
name: grafa| M U p 6 H # lna
namespace: monitoring
spec:
replicr ; g z X c u fas: 1
serviceName: gi % I ( ~ J k $rafana
template* j . E -:
metadata:
labeO U N ? o ~ s sls:
task: monitoring
k8s-app: grafana
spec:
containers:
- name: grafana
image: k8s.gcr.io/heapster-grafana-amd64:v5.0.4
ports:
- containerPort: 3000
protocol: TCP
volumeMounts:
- mountPath: /etc/sq J . o .sl/certs
name: c[ p { 7 W Sa-certificates
readOnly: true
- mountPath: /S X s hvar
name: graf_ / Uana-storaK t 1 ^ % f gge
env:
- name: GF_SERVER_HTTP_PORT
v% O C - ^ Walue: "3000"
# The following env variableV R ( i i m 3 Ys are required to make Grafana accessible via
# the kubernete2 a F M A g Q f `s api-server proxy. On production clusters, wev q U Q ^ 4 recommend
# removing these env variables, set& ( 2 U L yup auth for grafana, and expose the grafana
# service using a LoadBalancer or a public IP.
- name: G) 6 = t ? q j o .F_AUTH_BASIC_ENABLED
value: "false"
- name: GF_AUTH_ANj @ G = 3 N jONYMOUS_ENABLED
value: "true"
- name: GF_AU d ; j p . 6 kTH_ANONYMOUS_ORG_ROLE
value: Admin
- name: GF_SERVER_ROOT_UR[ 6 L 2 ! z S T lL
# If you're only usin1 @ j Qg the API Serve- ! [ P r 1 Fr proxy, set this value insteal g 8 o wd:
# value: /api/v1/namespaces/kube-system/services/monitoring-graf7 d ( 1 9 z (ana/proxy
value: /
volumes:
- nam- 7 J ; /e: ca-certificates
hostPa[ ) jth:S [ J }
path: /etc/ssa K 4l/certs
volumeClaimTemplates:
- metadata:
name: grafana-storage
namespace: monitoring
spec:
accessModes: [ "ReadWriteOnce" ]
storageClassName: fast
resources:
requests:
storage: 5Gi
---
apiVersion: v1
kind: Servicw 3 _e
metadata:
labels:
kub; R / V , / b i Qernetes.io/cluster-servic# ( y l ! 8 )e: 'true'
kubernetes.io/name: grafana
nl - a Name: grafana
namespace: monitoriJ H m P Q f &ng
spec:
ports:
- port: 3000
targetPort: 3000
selector:
k8s-app: grafana

这将创建我们的Grafana部署和服务= } 3 F ;,它将使用我们的Ingres$ # 5 2s对象暴露。为了做到这一点,我们应该添加Thanos-Querier& N ` o `A ) I ( k # W p )为我们Grafana部署的数据源:

  1. 点击添加数据源
  2. 设置Name: DS_PROMETHEUS
  3. 设置Type: Prometheus
  4. 设置URL: http://, J = w * W X 8thanos-queriP O N 1 G -er:9090
  5. 保存并测试。现在你可以构建你的自定义dashY R P m d U T [board或从grafana.net简单导入dashboard。Dashboard #315和#1471都非常适合入门。

    部署Ingress对象

    apiVe] = S ` { S yrsion: extensions/v1beta1
    kind: IW a ? y o } / ^ngress
    metadata:
    name: mon@ 8 M ~ Ditoring-ingress
    namespace: monitoring
    annotations:
    kuber4 d g | jnetes.io/ingress.cla6 1 X 7 h 0 ;ss: "nginxr v V"
    spec:
    rules:
    - host: grafana.<yourdomY b 0 Dain>.com
    http:
    paths( 9 G W . M c:
    - path: /
    backend:
    serviceName:q 4 U r h grafana
    servicePort: 3000
    - host:A 9 U L p prometheus-0.<yourdomain>.com
    http:
    paths:
    - path/ a } H : H d: /
    backend:
    serviceName: prometheus-0-s7 t lervice
    serviceH * } D /P{  ]ort: 8080
    - host: prometheus-1.<yourP ` r , D z wdomain>.com
    http:
    paths:
    - path: /
    backend:
    serviceName: prometheus-1-service
    servicePort: 8080
    - host: prometheus-2.<yourdomain>.com
    http:
    paths:
    - path: /
    backenN j ? E s & [ xd:
    serviceNaW s Z Mme: prometheus-2-service
    servicePort: 8080
    - host: alertmanager.<yourdom] I s Bain&gy Z N L Xt;.com
    http:
    patT K 5hs:
    - path: /
    backend:
    serviceName: alertmanager
    servicePort: 93 5 093
    - host: thanos-querier.<yourdomain>.com
    htt w & : d m O Rp:
    paths:
    - path: /
    backend:
    serviceName: thanos-q1 3 ! 7 e 6 A Querier
    servicePort: 9090
    - host: thanos-rulW P { a l ) 7 *er.m u w l<yourdomain>.com
    http:
    paths:
    - path: /
    backend:
    serviceNay n G Q # 4 z : }m` $ C 8 | Q ze: thanos-ruler
    servicePort: 9090

    这是拼图的最后一块。有助于F P K m [ _将我们的所有服务暴露在Kubernete{ 9 8 bs集群之外,并帮助我们访问它们。确保将<yourdomain>替换为一个你可以访问的域名,并 J + & v且你可以` C + P j r l s将Ingress-ContrH T K L . o oller的服务指向这个域名。

现在你应该可以访问Thanos Querier,网址是:hB | 9 Http://c G ~ F g G ythanos-querier.<your_ 8 g g ? Vdomain>.co` ! U ! c 2 L %m。它如下所示:

详细教程丨使用Prometheus和Thanos进行高可用K8S监控

确保选中重复数据删除(deduplication)。

如果你点击Store,可以看到所有由thanos-store-g` - B _ T & X Bateway服务发现的活动端点。

详细教程丨使用Prometheus和Thanos进行高可用K8S监控

现在你可以在Grafana中添加Thanos Querier作为数据源,并开始创建dashboard。r c j G ( u N |

详细教程丨使用Prometheus和Thanos进行高可用K8S监控

Kubernetes集群监控dashboard

详细教程丨使用Prometheus和Thanos进行高可用K8S监控

Kuy H wbernetes节点监控dashboard

详细教程丨使用Prometheus和Thanos进行高可用K8S监控

总 结

将Thanos与Prometheus集成在一起,无疑提供了横向扩展Prometheus的能u ^ B = 6 O力,而且由于Thanos-Querier能够从其他qQ } j } f n s *uerier实例中提取指标数据,因此实际上你可以跨集群提取指标数据,并在一个单一的仪表板中可视化。

^ M h .们还能够将指标数据归档在对象存储中,为我们的监控系统提供无限的存储空e o i & ] + B h间,同时从对象存储本身提供指标数据。这种设置的主要成本部分可以归结为对象存储(S3或GCS)。如果g % V @ ) y v我们对它们应用适当的保留策略,可以进一步降低成本。

然而,实现这一切需要你进行大量的配置。上面提供的manifest已经在生产环境中进行了测试,你可以大胆进行尝试。