Re: [ceph-users] PG down & incomplete

CEPH filesystem development
 help / color / mirror / Atom feed

From: Olivier Bonvalet <ceph.list@daevel.fr>
To: John Wilkins <john.wilkins@inktank.com>
Cc: ceph-devel <ceph-devel@vger.kernel.org>,
	ceph-users <ceph-users@ceph.com>
Subject: Re: [ceph-users] PG down & incomplete
Date: Fri, 17 May 2013 23:33:03 +0200	[thread overview]
Message-ID: <1368826383.22569.66.camel@localhost> (raw)
In-Reply-To: <CAM2gkg51tSPiH_kesF=xQi3t6oET+f0D1XgxyE_6UOZei6PO3Q@mail.gmail.com>

Yes, I set the "noout" flag to avoid the auto balancing of the osd.25,
which will crash all OSD of this host (already tried several times).

Le vendredi 17 mai 2013 à 11:27 -0700, John Wilkins a écrit :
> It looks like you have the "noout" flag set:
> 
> "noout flag(s) set; 1 mons down, quorum 0,1,2,3 a,b,c,e
>    monmap e7: 5 mons at
> {a=10.0.0.1:6789/0,b=10.0.0.2:6789/0,c=10.0.0.5:6789/0,d=10.0.0.6:6789/0,e=10.0.0.3:6789/0},
> election epoch 2584, quorum 0,1,2,3 a,b,c,e
>    osdmap e82502: 50 osds: 48 up, 48 in"
> 
> http://ceph.com/docs/master/rados/troubleshooting/troubleshooting-osd/#stopping-w-out-rebalancing
> 
> If you have down OSDs that don't get marked out, that would certainly
> cause problems. Have you tried restarting the failed OSDs?
> 
> What do the logs look like for osd.15 and osd.25?
> 
> On Fri, May 17, 2013 at 1:31 AM, Olivier Bonvalet <ceph.list@daevel.fr> wrote:
> > Hi,
> >
> > thanks for your answer. In fact I have several different problems, which
> > I tried to solve separatly :
> >
> > 1) I loose 2 OSD, and some pools have only 2 replicas. So some data was
> > lost.
> > 2) One monitor refuse the Cuttlefish upgrade, so I only have 4 of 5
> > monitors running.
> > 3) I have 4 old inconsistent PG that I can't repair.
> >
> >
> > So the status :
> >
> >    health HEALTH_ERR 15 pgs incomplete; 4 pgs inconsistent; 15 pgs stuck
> > inactive; 15 pgs stuck unclean; 1 near full osd(s); 19 scrub errors;
> > noout flag(s) set; 1 mons down, quorum 0,1,2,3 a,b,c,e
> >    monmap e7: 5 mons at
> > {a=10.0.0.1:6789/0,b=10.0.0.2:6789/0,c=10.0.0.5:6789/0,d=10.0.0.6:6789/0,e=10.0.0.3:6789/0}, election epoch 2584, quorum 0,1,2,3 a,b,c,e
> >    osdmap e82502: 50 osds: 48 up, 48 in
> >     pgmap v12807617: 7824 pgs: 7803 active+clean, 1 active+clean
> > +scrubbing, 15 incomplete, 4 active+clean+inconsistent, 1 active+clean
> > +scrubbing+deep; 5676 GB data, 18948 GB used, 18315 GB / 37263 GB avail;
> > 137KB/s rd, 1852KB/s wr, 199op/s
> >    mdsmap e1: 0/0/1 up
> >
> >
> >
> > The tree :
> >
> > # id    weight  type name       up/down reweight
> > -8      14.26   root SSDroot
> > -27     8               datacenter SSDrbx2
> > -26     8                       room SSDs25
> > -25     8                               net SSD188-165-12
> > -24     8                                       rack SSD25B09
> > -23     8                                               host lyll
> > 46      2                                                       osd.46  up      1
> > 47      2                                                       osd.47  up      1
> > 48      2                                                       osd.48  up      1
> > 49      2                                                       osd.49  up      1
> > -10     4.26            datacenter SSDrbx3
> > -12     2                       room SSDs43
> > -13     2                               net SSD178-33-122
> > -16     2                                       rack SSD43S01
> > -17     2                                               host kaino
> > 42      1                                                       osd.42  up      1
> > 43      1                                                       osd.43  up      1
> > -22     2.26                    room SSDs45
> > -21     2.26                            net SSD5-135-138
> > -20     2.26                                    rack SSD45F01
> > -19     2.26                                            host taman
> > 44      1.13                                                    osd.44  up      1
> > 45      1.13                                                    osd.45  up      1
> > -9      2               datacenter SSDrbx4
> > -11     2                       room SSDs52
> > -14     2                               net SSD176-31-226
> > -15     2                                       rack SSD52B09
> > -18     2                                               host dragan
> > 40      1                                                       osd.40  up      1
> > 41      1                                                       osd.41  up      1
> > -1      33.43   root SASroot
> > -100    15.9            datacenter SASrbx1
> > -90     15.9                    room SASs15
> > -72     15.9                            net SAS188-165-15
> > -40     8                                       rack SAS15B01
> > -3      8                                               host brontes
> > 0       1                                                       osd.0   up      1
> > 1       1                                                       osd.1   up      1
> > 2       1                                                       osd.2   up      1
> > 3       1                                                       osd.3   up      1
> > 4       1                                                       osd.4   up      1
> > 5       1                                                       osd.5   up      1
> > 6       1                                                       osd.6   up      1
> > 7       1                                                       osd.7   up      1
> > -41     7.9                                     rack SAS15B02
> > -6      7.9                                             host alim
> > 24      1                                                       osd.24  up      1
> > 25      1                                                       osd.25  down    0
> > 26      1                                                       osd.26  up      1
> > 27      1                                                       osd.27  up      1
> > 28      1                                                       osd.28  up      1
> > 29      1                                                       osd.29  up      1
> > 30      1                                                       osd.30  up      1
> > 31      0.9                                                     osd.31  up      1
> > -101    17.53           datacenter SASrbx2
> > -91     17.53                   room SASs27
> > -70     1.6                             net SAS188-165-13
> > -44     0                                       rack SAS27B04
> > -7      0                                               host bul
> > -45     1.6                                     rack SAS27B06
> > -4      1.6                                             host okko
> > 32      0.2                                                     osd.32  up      1
> > 33      0.2                                                     osd.33  up      1
> > 34      0.2                                                     osd.34  up      1
> > 35      0.2                                                     osd.35  up      1
> > 36      0.2                                                     osd.36  up      1
> > 37      0.2                                                     osd.37  up      1
> > 38      0.2                                                     osd.38  up      1
> > 39      0.2                                                     osd.39  up      1
> > -71     15.93                           net SAS188-165-14
> > -42     8                                       rack SAS27A03
> > -5      8                                               host noburo
> > 8       1                                                       osd.8   up      1
> > 9       1                                                       osd.9   up      1
> > 18      1                                                       osd.18  up      1
> > 19      1                                                       osd.19  up      1
> > 20      1                                                       osd.20  up      1
> > 21      1                                                       osd.21  up      1
> > 22      1                                                       osd.22  up      1
> > 23      1                                                       osd.23  up      1
> > -43     7.93                                    rack SAS27A04
> > -2      7.93                                            host keron
> > 10      0.97                                                    osd.10  up      1
> > 11      1                                                       osd.11  up      1
> > 12      1                                                       osd.12  up      1
> > 13      1                                                       osd.13  up      1
> > 14      0.98                                                    osd.14  up      1
> > 15      1                                                       osd.15  down    0
> > 16      0.98                                                    osd.16  up      1
> > 17      1                                                       osd.17  up      1
> >
> >
> > Here I have 2 roots : SSDroot and SASroot. All my OSD/PG problems are on
> > the SAS branch, and my CRUSH rules use per "net" replication.
> >
> > The osd.15 have a failling disk since long time, its data was correctly
> > moved (= OSD was out until the cluster obtain HEALTH_OK).
> > The osd.25 is a buggy OSD that I can't remove or change : if I balance
> > it's PG on other OSD, then this others OSD crash. That problem occur
> > before I loose the osd.19 : OSD was unable to mark that PG as
> > inconsistent since it was crashing during scrub. For me, all
> > inconsistencies come from this OSD.
> > The osd.19 was a failling disk, that I changed.
> >
> >
> > And the health detail :
> >
> > HEALTH_ERR 15 pgs incomplete; 4 pgs inconsistent; 15 pgs stuck inactive;
> > 15 pgs stuck unclean; 1 near full osd(s); 19 scrub errors; noout flag(s)
> > set; 1 mons down, quorum 0,1,2,3 a,b,c,e
> > pg 4.5c is stuck inactive since forever, current state incomplete, last
> > acting [19,30]
> > pg 8.71d is stuck inactive since forever, current state incomplete, last
> > acting [24,19]
> > pg 8.3fa is stuck inactive since forever, current state incomplete, last
> > acting [19,31]
> > pg 8.3e0 is stuck inactive since forever, current state incomplete, last
> > acting [31,19]
> > pg 8.56c is stuck inactive since forever, current state incomplete, last
> > acting [19,28]
> > pg 8.19f is stuck inactive since forever, current state incomplete, last
> > acting [31,19]
> > pg 8.792 is stuck inactive since forever, current state incomplete, last
> > acting [19,28]
> > pg 4.0 is stuck inactive since forever, current state incomplete, last
> > acting [28,19]
> > pg 8.78a is stuck inactive since forever, current state incomplete, last
> > acting [31,19]
> > pg 8.23e is stuck inactive since forever, current state incomplete, last
> > acting [32,13]
> > pg 8.2ff is stuck inactive since forever, current state incomplete, last
> > acting [6,19]
> > pg 8.5e2 is stuck inactive since forever, current state incomplete, last
> > acting [0,19]
> > pg 8.528 is stuck inactive since forever, current state incomplete, last
> > acting [31,19]
> > pg 8.20f is stuck inactive since forever, current state incomplete, last
> > acting [31,19]
> > pg 8.372 is stuck inactive since forever, current state incomplete, last
> > acting [19,24]
> > pg 4.5c is stuck unclean since forever, current state incomplete, last
> > acting [19,30]
> > pg 8.71d is stuck unclean since forever, current state incomplete, last
> > acting [24,19]
> > pg 8.3fa is stuck unclean since forever, current state incomplete, last
> > acting [19,31]
> > pg 8.3e0 is stuck unclean since forever, current state incomplete, last
> > acting [31,19]
> > pg 8.56c is stuck unclean since forever, current state incomplete, last
> > acting [19,28]
> > pg 8.19f is stuck unclean since forever, current state incomplete, last
> > acting [31,19]
> > pg 8.792 is stuck unclean since forever, current state incomplete, last
> > acting [19,28]
> > pg 4.0 is stuck unclean since forever, current state incomplete, last
> > acting [28,19]
> > pg 8.78a is stuck unclean since forever, current state incomplete, last
> > acting [31,19]
> > pg 8.23e is stuck unclean since forever, current state incomplete, last
> > acting [32,13]
> > pg 8.2ff is stuck unclean since forever, current state incomplete, last
> > acting [6,19]
> > pg 8.5e2 is stuck unclean since forever, current state incomplete, last
> > acting [0,19]
> > pg 8.528 is stuck unclean since forever, current state incomplete, last
> > acting [31,19]
> > pg 8.20f is stuck unclean since forever, current state incomplete, last
> > acting [31,19]
> > pg 8.372 is stuck unclean since forever, current state incomplete, last
> > acting [19,24]
> > pg 8.792 is incomplete, acting [19,28]
> > pg 8.78a is incomplete, acting [31,19]
> > pg 8.71d is incomplete, acting [24,19]
> > pg 8.5e2 is incomplete, acting [0,19]
> > pg 8.56c is incomplete, acting [19,28]
> > pg 8.528 is incomplete, acting [31,19]
> > pg 8.3fa is incomplete, acting [19,31]
> > pg 8.3e0 is incomplete, acting [31,19]
> > pg 8.372 is incomplete, acting [19,24]
> > pg 8.2ff is incomplete, acting [6,19]
> > pg 8.23e is incomplete, acting [32,13]
> > pg 8.20f is incomplete, acting [31,19]
> > pg 8.19f is incomplete, acting [31,19]
> > pg 3.7c is active+clean+inconsistent, acting [24,13,39]
> > pg 3.6b is active+clean+inconsistent, acting [28,23,5]
> > pg 4.5c is incomplete, acting [19,30]
> > pg 3.d is active+clean+inconsistent, acting [29,4,11]
> > pg 4.0 is incomplete, acting [28,19]
> > pg 3.1 is active+clean+inconsistent, acting [28,19,5]
> > osd.10 is near full at 85%
> > 19 scrub errors
> > noout flag(s) set
> > mon.d (rank 4) addr 10.0.0.6:6789/0 is down (out of quorum)
> >
> >
> > Pools 4 and 8 have only 2 replica, and pool 3 have 3 replica but
> > inconsistent data.
> >
> > Thanks in advance.
> >
> > Le vendredi 17 mai 2013 à 00:14 -0700, John Wilkins a écrit :
> >> If you can follow the documentation here:
> >> http://ceph.com/docs/master/rados/operations/monitoring-osd-pg/  and
> >> http://ceph.com/docs/master/rados/troubleshooting/  to provide some
> >> additional information, we may be better able to help you.
> >>
> >> For example, "ceph osd tree" would help us understand the status of
> >> your cluster a bit better.
> >>
> >> On Thu, May 16, 2013 at 10:32 PM, Olivier Bonvalet <ceph.list@daevel.fr> wrote:
> >> > Le mercredi 15 mai 2013 à 00:15 +0200, Olivier Bonvalet a écrit :
> >> >> Hi,
> >> >>
> >> >> I have some PG in state down and/or incomplete on my cluster, because I
> >> >> loose 2 OSD and a pool was having only 2 replicas. So of course that
> >> >> data is lost.
> >> >>
> >> >> My problem now is that I can't retreive a "HEALTH_OK" status : if I try
> >> >> to remove, read or overwrite the corresponding RBD images, near all OSD
> >> >> hang (well... they don't do anything and requests stay in a growing
> >> >> queue, until the production will be done).
> >> >>
> >> >> So, what can I do to remove that corrupts images ?
> >> >>
> >> >> _______________________________________________
> >> >> ceph-users mailing list
> >> >> ceph-users@lists.ceph.com
> >> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >> >>
> >> >
> >> > Up. Nobody can help me on that problem ?
> >> >
> >> > Thanks,
> >> >
> >> > Olivier
> >> >
> >> > _______________________________________________
> >> > ceph-users mailing list
> >> > ceph-users@lists.ceph.com
> >> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >>
> >>
> >>
> >> --
> >> John Wilkins
> >> Senior Technical Writer
> >> Intank
> >> john.wilkins@inktank.com
> >> (415) 425-9599
> >> http://inktank.com
> >>
> >
> >
> 
> 
> 
> -- 
> John Wilkins
> Senior Technical Writer
> Intank
> john.wilkins@inktank.com
> (415) 425-9599
> http://inktank.com
> 


--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

     prev parent reply	other threads:[~2013-05-17 21:33 UTC|newest]

Thread overview: 8+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <1368569751.5157.5.camel@localhost>
2013-05-17  5:32 ` PG down & incomplete Olivier Bonvalet
2013-05-17  7:14   ` [ceph-users] " John Wilkins
     [not found]     ` <CAM2gkg4znKDOp-D=z459G2MCQcGzkHrLWF_Ox8uGexZNcMUM3Q-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2013-05-17  8:31       ` Olivier Bonvalet
2013-05-17 18:27         ` [ceph-users] " John Wilkins
2013-05-17 18:36           ` John Wilkins
2013-05-17 21:37             ` Olivier Bonvalet
2013-05-19 17:19               ` Olivier Bonvalet
2013-05-17 21:33           ` Olivier Bonvalet [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1368826383.22569.66.camel@localhost \
    --to=ceph.list@daevel.fr \
    --cc=ceph-devel@vger.kernel.org \
    --cc=ceph-users@ceph.com \
    --cc=john.wilkins@inktank.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox