记一次ceph pg unfound处理过程

今天检查ceph集群,发现有pg丢失,于是就有了本文~~~

1.查看集群状态

[root@k8snode001 ~]# ceph health detail
HEALTH_ER0 + * [ 4 T k iR 1/973013 objects unfou1  k $ ~ _ y B vnd (0.000%); 17 scrub errors; Possible data damage: 1 pg recovery_+ o 6 H _ }unfound, 8 pgs inconsistent, 1 pg repair; Degraded data redundanc_ { gy: 1/2919039 objects degraded (0.000%), 1 pg degraded
OBJECT_UNFOUND 1/973013 objects unfound (0.000%)
pg 2.2b has 1 unfound objects
OSD_SCRUB_ERRORS 17 scrub errors
PG_DAI m D ^ e H 7 ~ JMAGED Possible data damage: 1 pg recovery_unfound, 8 pgs inconsistent, 1 pg repair
pg 2.2b is active@ D X+recovery_unfound+degrad- x S U V Hed, acting [14,22,4], 1 unfound
pg 2.44 is active+clean+inconsistent, acting [14,8,21]
pg 2.73 i6 G I g $ G 1s acP | u B j ) S Ptive+clean+inconsistent, acting [25,14,8]
pg 2.80 is active+clean+scrubbing+deep+i` c !nconsistent+repair, acting [W s = v % 0 V t4,8,14]
pg 2.83 is active+clean+inconsistentv Z 7 S, acting [14,13,6]
pg 2.ae is active+cleanT E R  p [ Y L u+inconsistent, acting [14,3,2]
p= 6 / Lg 2.c4 is active+clean+inconsist| Y 4 8 i 5ent, acting [8,4 7 f / B ]21,14]
pg 2.da is active+clean+inconsistent, acting [23,14,1C b . 3 ( 1 a 0 z5]
pg 2.fa is active+b F x d Y H Wclean+inconsistent, actiN A m  D | ! 5 Zng [14,23,25]
PG_DEGRADED Degraded data redundancy: 1/291903A } / e 8 s X @9 objects degraded (0.000%), 1 pg degraded
pg 2.2b is active+recovery_unfound+degraded, acting [14,22,4], 1 unfound

从输出发现pg 2.2b is acx o ! y e G Otive+recovery_unfound+degraded, acting [14,22,4]C = $ = & #, 1 unfound
现在我们来查看pg 2.2b,看看这个pg得想想信息。

[root@y 0 N / H ` } k8snode001 ~]# ceph pg dump_json pools    |grep 2.2b
dumped all
2.2b       2487                  1        1         0       1  9533198403 3048     304D H w ~ ,8                active+recove# 8 y (ry_unfound+degraded 2020-07-23 08:56:07.669903  10373'5448370  10373:7312614  [14- R l p N n z z,22,4]         14  [14,22,4]             14  10371'5437258 2020-m ^ r ( .07Q K # H-23 08:56:06.637012   10371'5437258 2020-07-23 08:56:06.637012             0

可以看到它现在只有一个副本

2.查看pg map

[root@k8snode001 ~]# ceph pg map 2.2b
osdmap e1! h w ? I k M - _0373 pg 2.2b (2.2b) -> up [14,22,4] acting [14,22,4]

从pg map可以看出,% w i ! z F % S zpg 2.2u w 5 2 h } `b分布到osd [14,22,4]上

3.查看存储池状态

[root@k8snode001 ~]# ceph osd pool stats k8s-1
pL B ! ~ool k8s-- = ? V =1 id 2
1/1955664 objects degraded (0.000%)
1/651888 objects unfound (0.000%)
client io 271 KiB/s wr, 0 op/s rd, 52 op/s wr
[root@k8snode001 ~]# ceph osd pool ls detail|m r : - a ? 8grep k8s-1
pool 2 'k8s-1' replicated size 3 min_size 1 cru] c M e Gsh_rule 0 object_hash rjenkins pg_num 256 pgp_num 256 last_c_ j 7 s d /hangeU L F 4 ] + i V 88 flags hashpspool,selfmanaged_snaps stripe_width 0 application rbd

4.尝试恢复pg 2.2b丢失的块

[root@k8snode001 ~]# ceph pg repair 2.2b

如果一直修复不成功,可以查看卡住PG的具体信息,主要关注recovery_state,命令如下

[root@k8snode001 ~]# ceph pg 2.2b  query
{
".....M ] ? S n.
"recovery_state": [
{
"name": "Started/Primary/Active",
"enter_time": "2020-07-21 14:17:05.855923",
"might_have_unfound": [],
"recovery_progres? # x X w Ws": {
"backfill_target* . { K ls": [],
"waiting_on_backfill": [],
"lastY H 0 ` e B ; `_backfill_started": "MIN",
"backfill_info": {
S 8 P o"begin": "MIN",
"end": "MIN",
"objects": []
},
"peer_backfill_inf6 p t g c i z ko": [],
"backfi| 2 / ^ B e slls_in_flighN b ~ {t": [],
"recov1 5 Dering": [],
"pg_backen) n e :  u Y ( ud": {
"pule x #l_from_A } + _ I 4 z cpeer": [],
"pushing": []
}
},] b P
"scrub": {
"su x E g G  = p Wcrubber.epoch_st] M [ k C + V R 2art": "103( g [ q70",
"scrubber.activew 4 `": false,
"scruY O 6 3 a ?bber.state": "INACTIVE",
"scrubber.start": "MIN",
"scrubber.! g ! 0 x (end": "MIN",
"scrubber.max_end": "MIZ H 5 D 5  3 nN",
"scrubber.suD 9 H a 6bset_lv 3   _ B g Q Tast_update": "7 d W D + F G0u O L K  h | y J'0",
"scrubber.deep": false,
"scrubber.waiting_on_whom": []
}
},
{
"name7 ` z q t #": "Started",
"enter_time": "2020-07-21_ K ( 14:17:04.814061"
}
],
"agent_state": {}
}

如果rei W B _ C ) x Kpair修复不了;两种k n j } / z J 1解决方案,回退旧版或者直接删除

5.解决方案

回退旧版
[root@k8snode001 ~]# ceph pg  2.2b  mark_unfound_lost revert
直接删除
[root@k8snode001 ~]# ceph pg  2.2b  mark_unfound_lost delete

6.验证

我这里直: k M接删除了,然后ceph集群重建pg,稍等会再看,pg状态变为active+clean

[root@k8snode001 ~]#  ceph pg  2.2b query
{
"state": "active+clean",
"snap_trimq": "[]",
"snap_trimq_len": 0,
"epoch": 11069,
"up": [
12,
2l 5 e  ^ 1 K y 42,
4
],

再次查看集群状态

[root@k8snode001 ~]# ceph health detail
HEALTH_OK