Re: PG down & incomplete - Olivier Bonvalet

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Olivier Bonvalet <ceph.list-PaEMFeTk6C1QFI55V6+gNQ@public.gmane.org>
To: John Wilkins <john.wilkins-4GqslpFJ+cxBDgjK7y7TUQ@public.gmane.org>
Cc: ceph-devel <ceph-devel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org>,
	ceph-users <ceph-users-Qp0mS5GaXlQ@public.gmane.org>
Subject: Re: PG down & incomplete
Date: Fri, 17 May 2013 10:31:18 +0200	[thread overview]
Message-ID: <1368779478.22569.17.camel@localhost> (raw)
In-Reply-To: <CAM2gkg4znKDOp-D=z459G2MCQcGzkHrLWF_Ox8uGexZNcMUM3Q-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>

Hi,

thanks for your answer. In fact I have several different problems, which
I tried to solve separatly :

1) I loose 2 OSD, and some pools have only 2 replicas. So some data was
lost.
2) One monitor refuse the Cuttlefish upgrade, so I only have 4 of 5
monitors running.
3) I have 4 old inconsistent PG that I can't repair.


So the status :

   health HEALTH_ERR 15 pgs incomplete; 4 pgs inconsistent; 15 pgs stuck
inactive; 15 pgs stuck unclean; 1 near full osd(s); 19 scrub errors;
noout flag(s) set; 1 mons down, quorum 0,1,2,3 a,b,c,e
   monmap e7: 5 mons at
{a=10.0.0.1:6789/0,b=10.0.0.2:6789/0,c=10.0.0.5:6789/0,d=10.0.0.6:6789/0,e=10.0.0.3:6789/0}, election epoch 2584, quorum 0,1,2,3 a,b,c,e
   osdmap e82502: 50 osds: 48 up, 48 in
    pgmap v12807617: 7824 pgs: 7803 active+clean, 1 active+clean
+scrubbing, 15 incomplete, 4 active+clean+inconsistent, 1 active+clean
+scrubbing+deep; 5676 GB data, 18948 GB used, 18315 GB / 37263 GB avail;
137KB/s rd, 1852KB/s wr, 199op/s
   mdsmap e1: 0/0/1 up



The tree :

# id	weight	type name	up/down	reweight
-8	14.26	root SSDroot
-27	8		datacenter SSDrbx2
-26	8			room SSDs25
-25	8				net SSD188-165-12
-24	8					rack SSD25B09
-23	8						host lyll
46	2							osd.46	up	1	
47	2							osd.47	up	1	
48	2							osd.48	up	1	
49	2							osd.49	up	1	
-10	4.26		datacenter SSDrbx3
-12	2			room SSDs43
-13	2				net SSD178-33-122
-16	2					rack SSD43S01
-17	2						host kaino
42	1							osd.42	up	1	
43	1							osd.43	up	1	
-22	2.26			room SSDs45
-21	2.26				net SSD5-135-138
-20	2.26					rack SSD45F01
-19	2.26						host taman
44	1.13							osd.44	up	1	
45	1.13							osd.45	up	1	
-9	2		datacenter SSDrbx4
-11	2			room SSDs52
-14	2				net SSD176-31-226
-15	2					rack SSD52B09
-18	2						host dragan
40	1							osd.40	up	1	
41	1							osd.41	up	1	
-1	33.43	root SASroot
-100	15.9		datacenter SASrbx1
-90	15.9			room SASs15
-72	15.9				net SAS188-165-15
-40	8					rack SAS15B01
-3	8						host brontes
0	1							osd.0	up	1	
1	1							osd.1	up	1	
2	1							osd.2	up	1	
3	1							osd.3	up	1	
4	1							osd.4	up	1	
5	1							osd.5	up	1	
6	1							osd.6	up	1	
7	1							osd.7	up	1	
-41	7.9					rack SAS15B02
-6	7.9						host alim
24	1							osd.24	up	1	
25	1							osd.25	down	0	
26	1							osd.26	up	1	
27	1							osd.27	up	1	
28	1							osd.28	up	1	
29	1							osd.29	up	1	
30	1							osd.30	up	1	
31	0.9							osd.31	up	1	
-101	17.53		datacenter SASrbx2
-91	17.53			room SASs27
-70	1.6				net SAS188-165-13
-44	0					rack SAS27B04
-7	0						host bul
-45	1.6					rack SAS27B06
-4	1.6						host okko
32	0.2							osd.32	up	1	
33	0.2							osd.33	up	1	
34	0.2							osd.34	up	1	
35	0.2							osd.35	up	1	
36	0.2							osd.36	up	1	
37	0.2							osd.37	up	1	
38	0.2							osd.38	up	1	
39	0.2							osd.39	up	1	
-71	15.93				net SAS188-165-14
-42	8					rack SAS27A03
-5	8						host noburo
8	1							osd.8	up	1	
9	1							osd.9	up	1	
18	1							osd.18	up	1	
19	1							osd.19	up	1	
20	1							osd.20	up	1	
21	1							osd.21	up	1	
22	1							osd.22	up	1	
23	1							osd.23	up	1	
-43	7.93					rack SAS27A04
-2	7.93						host keron
10	0.97							osd.10	up	1	
11	1							osd.11	up	1	
12	1							osd.12	up	1	
13	1							osd.13	up	1	
14	0.98							osd.14	up	1	
15	1							osd.15	down	0	
16	0.98							osd.16	up	1	
17	1							osd.17	up	1	


Here I have 2 roots : SSDroot and SASroot. All my OSD/PG problems are on
the SAS branch, and my CRUSH rules use per "net" replication.

The osd.15 have a failling disk since long time, its data was correctly
moved (= OSD was out until the cluster obtain HEALTH_OK).
The osd.25 is a buggy OSD that I can't remove or change : if I balance
it's PG on other OSD, then this others OSD crash. That problem occur
before I loose the osd.19 : OSD was unable to mark that PG as
inconsistent since it was crashing during scrub. For me, all
inconsistencies come from this OSD.
The osd.19 was a failling disk, that I changed.


And the health detail :

HEALTH_ERR 15 pgs incomplete; 4 pgs inconsistent; 15 pgs stuck inactive;
15 pgs stuck unclean; 1 near full osd(s); 19 scrub errors; noout flag(s)
set; 1 mons down, quorum 0,1,2,3 a,b,c,e
pg 4.5c is stuck inactive since forever, current state incomplete, last
acting [19,30]
pg 8.71d is stuck inactive since forever, current state incomplete, last
acting [24,19]
pg 8.3fa is stuck inactive since forever, current state incomplete, last
acting [19,31]
pg 8.3e0 is stuck inactive since forever, current state incomplete, last
acting [31,19]
pg 8.56c is stuck inactive since forever, current state incomplete, last
acting [19,28]
pg 8.19f is stuck inactive since forever, current state incomplete, last
acting [31,19]
pg 8.792 is stuck inactive since forever, current state incomplete, last
acting [19,28]
pg 4.0 is stuck inactive since forever, current state incomplete, last
acting [28,19]
pg 8.78a is stuck inactive since forever, current state incomplete, last
acting [31,19]
pg 8.23e is stuck inactive since forever, current state incomplete, last
acting [32,13]
pg 8.2ff is stuck inactive since forever, current state incomplete, last
acting [6,19]
pg 8.5e2 is stuck inactive since forever, current state incomplete, last
acting [0,19]
pg 8.528 is stuck inactive since forever, current state incomplete, last
acting [31,19]
pg 8.20f is stuck inactive since forever, current state incomplete, last
acting [31,19]
pg 8.372 is stuck inactive since forever, current state incomplete, last
acting [19,24]
pg 4.5c is stuck unclean since forever, current state incomplete, last
acting [19,30]
pg 8.71d is stuck unclean since forever, current state incomplete, last
acting [24,19]
pg 8.3fa is stuck unclean since forever, current state incomplete, last
acting [19,31]
pg 8.3e0 is stuck unclean since forever, current state incomplete, last
acting [31,19]
pg 8.56c is stuck unclean since forever, current state incomplete, last
acting [19,28]
pg 8.19f is stuck unclean since forever, current state incomplete, last
acting [31,19]
pg 8.792 is stuck unclean since forever, current state incomplete, last
acting [19,28]
pg 4.0 is stuck unclean since forever, current state incomplete, last
acting [28,19]
pg 8.78a is stuck unclean since forever, current state incomplete, last
acting [31,19]
pg 8.23e is stuck unclean since forever, current state incomplete, last
acting [32,13]
pg 8.2ff is stuck unclean since forever, current state incomplete, last
acting [6,19]
pg 8.5e2 is stuck unclean since forever, current state incomplete, last
acting [0,19]
pg 8.528 is stuck unclean since forever, current state incomplete, last
acting [31,19]
pg 8.20f is stuck unclean since forever, current state incomplete, last
acting [31,19]
pg 8.372 is stuck unclean since forever, current state incomplete, last
acting [19,24]
pg 8.792 is incomplete, acting [19,28]
pg 8.78a is incomplete, acting [31,19]
pg 8.71d is incomplete, acting [24,19]
pg 8.5e2 is incomplete, acting [0,19]
pg 8.56c is incomplete, acting [19,28]
pg 8.528 is incomplete, acting [31,19]
pg 8.3fa is incomplete, acting [19,31]
pg 8.3e0 is incomplete, acting [31,19]
pg 8.372 is incomplete, acting [19,24]
pg 8.2ff is incomplete, acting [6,19]
pg 8.23e is incomplete, acting [32,13]
pg 8.20f is incomplete, acting [31,19]
pg 8.19f is incomplete, acting [31,19]
pg 3.7c is active+clean+inconsistent, acting [24,13,39]
pg 3.6b is active+clean+inconsistent, acting [28,23,5]
pg 4.5c is incomplete, acting [19,30]
pg 3.d is active+clean+inconsistent, acting [29,4,11]
pg 4.0 is incomplete, acting [28,19]
pg 3.1 is active+clean+inconsistent, acting [28,19,5]
osd.10 is near full at 85%
19 scrub errors
noout flag(s) set
mon.d (rank 4) addr 10.0.0.6:6789/0 is down (out of quorum)


Pools 4 and 8 have only 2 replica, and pool 3 have 3 replica but
inconsistent data.

Thanks in advance.

Le vendredi 17 mai 2013 à 00:14 -0700, John Wilkins a écrit :
> If you can follow the documentation here:
> http://ceph.com/docs/master/rados/operations/monitoring-osd-pg/  and
> http://ceph.com/docs/master/rados/troubleshooting/  to provide some
> additional information, we may be better able to help you.
> 
> For example, "ceph osd tree" would help us understand the status of
> your cluster a bit better.
> 
> On Thu, May 16, 2013 at 10:32 PM, Olivier Bonvalet <ceph.list@daevel.fr> wrote:
> > Le mercredi 15 mai 2013 à 00:15 +0200, Olivier Bonvalet a écrit :
> >> Hi,
> >>
> >> I have some PG in state down and/or incomplete on my cluster, because I
> >> loose 2 OSD and a pool was having only 2 replicas. So of course that
> >> data is lost.
> >>
> >> My problem now is that I can't retreive a "HEALTH_OK" status : if I try
> >> to remove, read or overwrite the corresponding RBD images, near all OSD
> >> hang (well... they don't do anything and requests stay in a growing
> >> queue, until the production will be done).
> >>
> >> So, what can I do to remove that corrupts images ?
> >>
> >> _______________________________________________
> >> ceph-users mailing list
> >> ceph-users@lists.ceph.com
> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >>
> >
> > Up. Nobody can help me on that problem ?
> >
> > Thanks,
> >
> > Olivier
> >
> > _______________________________________________
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
> 
> 
> -- 
> John Wilkins
> Senior Technical Writer
> Intank
> john.wilkins@inktank.com
> (415) 425-9599
> http://inktank.com
> 


_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

next prev parent reply	other threads:[~2013-05-17  8:31 UTC|newest]

Thread overview: 8+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <1368569751.5157.5.camel@localhost>
2013-05-17  5:32 ` PG down & incomplete Olivier Bonvalet
2013-05-17  7:14   ` [ceph-users] " John Wilkins
     [not found]     ` <CAM2gkg4znKDOp-D=z459G2MCQcGzkHrLWF_Ox8uGexZNcMUM3Q-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2013-05-17  8:31       ` Olivier Bonvalet [this message]
2013-05-17 18:27         ` John Wilkins
2013-05-17 18:36           ` John Wilkins
2013-05-17 21:37             ` Olivier Bonvalet
2013-05-19 17:19               ` Olivier Bonvalet
2013-05-17 21:33           ` Olivier Bonvalet

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1368779478.22569.17.camel@localhost \
    --to=ceph.list-paemfetk6c1qfi55v6+gnq@public.gmane.org \
    --cc=ceph-devel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org \
    --cc=ceph-users-Qp0mS5GaXlQ@public.gmane.org \
    --cc=john.wilkins-4GqslpFJ+cxBDgjK7y7TUQ@public.gmane.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.