pg stuck in peering

All of lore.kernel.org
 help / color / mirror / Atom feed

* pg stuck in peering
@ 2013-02-21 10:09 Gandalf Corvotempesta
  2013-02-21 10:26 ` Martin B Nielsen
  2013-02-21 16:46 ` Sage Weil
  0 siblings, 2 replies; 7+ messages in thread
From: Gandalf Corvotempesta @ 2013-02-21 10:09 UTC (permalink / raw)
  To: ceph-devel

Hi
I have some trouble on a test cluster.
Many PGs are stuck in "peering" state from yesterday:

pg 1.2 is peering, acting [3,4]
pg 0.3 is peering, acting [3,4]
pg 2.1 is peering, acting [3,4]
pg 3.0 is peering, acting [3,4]
pg 1.3 is peering, acting [1,4]
pg 0.2 is peering, acting [3,4]
pg 2.0 is peering, acting [3,4]
pg 3.1 is peering, acting [1,4]
pg 4.6 is down+peering, acting [4,0]
pg 5.7 is peering, acting [3,1]
pg 6.4 is down+peering, acting [4,0]
pg 7.5 is peering, acting [3,1]
pg 0.1 is active+degraded, acting [4,0]
pg 2.3 is peering, acting [3,4]
pg 3.2 is peering, acting [3,4]
pg 4.5 is down+peering, acting [2,4]
pg 5.4 is down+peering, acting [2,4]
pg 6.7 is peering, acting [0,4]
pg 7.6 is peering, acting [0,4]
pg 1.1 is peering, acting [3,4]
pg 0.0 is active+degraded, acting [4,0]
pg 2.2 is peering, acting [1,4]
pg 3.3 is peering, acting [2,4]
pg 4.4 is peering, acting [1,4]
pg 5.5 is down+peering, acting [4,0]
pg 6.6 is peering, acting [3,1]
pg 7.7 is down+peering, acting [2,4]

HEALTH_WARN 332 pgs degraded; 316 pgs down; 316 pgs peering; 93 pgs
stale; 316 pgs stuck inactive; 93 pgs stuck stale; 764 pgs stuck
unclean



any advice ? I think that a not healthy cluster prevent me to run RGW
due to some timeout.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: pg stuck in peering
  2013-02-21 10:09 pg stuck in peering Gandalf Corvotempesta
@ 2013-02-21 10:26 ` Martin B Nielsen
  2013-02-21 10:43   ` Gandalf Corvotempesta
  2013-02-21 16:46 ` Sage Weil
  1 sibling, 1 reply; 7+ messages in thread
From: Martin B Nielsen @ 2013-02-21 10:26 UTC (permalink / raw)
  To: Gandalf Corvotempesta; +Cc: ceph-devel

I think this one would better fit the ceph-user maillinglist.

Can you supply more information? You have some pgs down - seems like
you have some osd down; can you list output of ceph osd tree and
verify all your osd are actually running?

/Martin

On Thu, Feb 21, 2013 at 11:09 AM, Gandalf Corvotempesta
<gandalf.corvotempesta@gmail.com> wrote:
> Hi
> I have some trouble on a test cluster.
> Many PGs are stuck in "peering" state from yesterday:
>
> pg 1.2 is peering, acting [3,4]
> pg 0.3 is peering, acting [3,4]
> pg 2.1 is peering, acting [3,4]
> pg 3.0 is peering, acting [3,4]
> pg 1.3 is peering, acting [1,4]
> pg 0.2 is peering, acting [3,4]
> pg 2.0 is peering, acting [3,4]
> pg 3.1 is peering, acting [1,4]
> pg 4.6 is down+peering, acting [4,0]
> pg 5.7 is peering, acting [3,1]
> pg 6.4 is down+peering, acting [4,0]
> pg 7.5 is peering, acting [3,1]
> pg 0.1 is active+degraded, acting [4,0]
> pg 2.3 is peering, acting [3,4]
> pg 3.2 is peering, acting [3,4]
> pg 4.5 is down+peering, acting [2,4]
> pg 5.4 is down+peering, acting [2,4]
> pg 6.7 is peering, acting [0,4]
> pg 7.6 is peering, acting [0,4]
> pg 1.1 is peering, acting [3,4]
> pg 0.0 is active+degraded, acting [4,0]
> pg 2.2 is peering, acting [1,4]
> pg 3.3 is peering, acting [2,4]
> pg 4.4 is peering, acting [1,4]
> pg 5.5 is down+peering, acting [4,0]
> pg 6.6 is peering, acting [3,1]
> pg 7.7 is down+peering, acting [2,4]
>
> HEALTH_WARN 332 pgs degraded; 316 pgs down; 316 pgs peering; 93 pgs
> stale; 316 pgs stuck inactive; 93 pgs stuck stale; 764 pgs stuck
> unclean
>
>
>
> any advice ? I think that a not healthy cluster prevent me to run RGW
> due to some timeout.
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: pg stuck in peering
  2013-02-21 10:26 ` Martin B Nielsen
@ 2013-02-21 10:43   ` Gandalf Corvotempesta
  2013-02-21 11:18     ` Wolfgang Hennerbichler
  0 siblings, 1 reply; 7+ messages in thread
From: Gandalf Corvotempesta @ 2013-02-21 10:43 UTC (permalink / raw)
  To: Martin B Nielsen; +Cc: ceph-devel

2013/2/21 Martin B Nielsen <martin@unity3d.com>:
> Can you supply more information? You have some pgs down - seems like
> you have some osd down; can you list output of ceph osd tree and
> verify all your osd are actually running?

I have one osd down, but I'm unable to start it.
Running "service ceph restart" on the node doesn't restart anything.

Running "service ceph -a restart" from the main node, will restart all
services but that osd is always down and nothing is logged.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: pg stuck in peering
  2013-02-21 10:43   ` Gandalf Corvotempesta
@ 2013-02-21 11:18     ` Wolfgang Hennerbichler
  2013-02-21 11:32       ` Gandalf Corvotempesta
  0 siblings, 1 reply; 7+ messages in thread
From: Wolfgang Hennerbichler @ 2013-02-21 11:18 UTC (permalink / raw)
  To: Gandalf Corvotempesta; +Cc: ceph-devel

On 02/21/2013 11:43 AM, Gandalf Corvotempesta wrote:

> I have one osd down, but I'm unable to start it.
> Running "service ceph restart" on the node doesn't restart anything.
> 
> Running "service ceph -a restart" from the main node, will restart all
> services but that osd is always down and nothing is logged.

try sevice ceph -a restart osd.X (where X is the osd number) - maybe
that helps. and check the kernel-logs, is the osd-directory mounted?

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: pg stuck in peering
  2013-02-21 11:18     ` Wolfgang Hennerbichler
@ 2013-02-21 11:32       ` Gandalf Corvotempesta
  2013-02-21 12:23         ` Wolfgang Hennerbichler
  0 siblings, 1 reply; 7+ messages in thread
From: Gandalf Corvotempesta @ 2013-02-21 11:32 UTC (permalink / raw)
  To: Wolfgang Hennerbichler; +Cc: ceph-devel

2013/2/21 Wolfgang Hennerbichler <wolfgang.hennerbichler@risc-software.at>:
> try sevice ceph -a restart osd.X (where X is the osd number) - maybe
> that helps. and check the kernel-logs, is the osd-directory mounted?

all dirs are mounted.
I have to manually restart many many time to get all osd up. Seems to be random.

If ceph status is not in HELTH_OK, radosgw doesn't start at all. Is this ok?

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: pg stuck in peering
  2013-02-21 11:32       ` Gandalf Corvotempesta
@ 2013-02-21 12:23         ` Wolfgang Hennerbichler
  0 siblings, 0 replies; 7+ messages in thread
From: Wolfgang Hennerbichler @ 2013-02-21 12:23 UTC (permalink / raw)
  To: Gandalf Corvotempesta; +Cc: ceph-devel



On 02/21/2013 12:32 PM, Gandalf Corvotempesta wrote:

> all dirs are mounted.
> I have to manually restart many many time to get all osd up. Seems to be random.

this should not happen. unstable networking connection? have you checked
tail -f /var/log/syslog on your ceph-hosts?

> If ceph status is not in HELTH_OK, radosgw doesn't start at all. Is this ok?

what does ceph osd tree say?
what is the output of ceph health detail?

I don't have experience with radosgw, sorry.


-- 
DI (FH) Wolfgang Hennerbichler
Software Development
Unit Advanced Computing Technologies
RISC Software GmbH
A company of the Johannes Kepler University Linz

IT-Center
Softwarepark 35
4232 Hagenberg
Austria

Phone: +43 7236 3343 245
Fax: +43 7236 3343 250
wolfgang.hennerbichler@risc-software.at
http://www.risc-software.at

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: pg stuck in peering
  2013-02-21 10:09 pg stuck in peering Gandalf Corvotempesta
  2013-02-21 10:26 ` Martin B Nielsen
@ 2013-02-21 16:46 ` Sage Weil
  1 sibling, 0 replies; 7+ messages in thread
From: Sage Weil @ 2013-02-21 16:46 UTC (permalink / raw)
  To: Gandalf Corvotempesta; +Cc: ceph-devel

On Thu, 21 Feb 2013, Gandalf Corvotempesta wrote:
> Hi
> I have some trouble on a test cluster.
> Many PGs are stuck in "peering" state from yesterday:
> 
> pg 1.2 is peering, acting [3,4]
> pg 0.3 is peering, acting [3,4]
> pg 2.1 is peering, acting [3,4]
> pg 3.0 is peering, acting [3,4]
> pg 1.3 is peering, acting [1,4]
> pg 0.2 is peering, acting [3,4]
> pg 2.0 is peering, acting [3,4]
> pg 3.1 is peering, acting [1,4]
> pg 4.6 is down+peering, acting [4,0]
> pg 5.7 is peering, acting [3,1]
> pg 6.4 is down+peering, acting [4,0]
> pg 7.5 is peering, acting [3,1]
> pg 0.1 is active+degraded, acting [4,0]
> pg 2.3 is peering, acting [3,4]
> pg 3.2 is peering, acting [3,4]
> pg 4.5 is down+peering, acting [2,4]
> pg 5.4 is down+peering, acting [2,4]
> pg 6.7 is peering, acting [0,4]
> pg 7.6 is peering, acting [0,4]
> pg 1.1 is peering, acting [3,4]
> pg 0.0 is active+degraded, acting [4,0]
> pg 2.2 is peering, acting [1,4]
> pg 3.3 is peering, acting [2,4]
> pg 4.4 is peering, acting [1,4]
> pg 5.5 is down+peering, acting [4,0]
> pg 6.6 is peering, acting [3,1]
> pg 7.7 is down+peering, acting [2,4]

You can 'ceph pg <pgid> query' on the individual pgid's above to see why 
it is blocked.. look near the end of the json dump.  This kind of 
diagnosis is documented under the section on handling common failures.  It 
should tell you which specific OSD needs to be started (or possibly marked 
'lost').

sage


> 
> HEALTH_WARN 332 pgs degraded; 316 pgs down; 316 pgs peering; 93 pgs
> stale; 316 pgs stuck inactive; 93 pgs stuck stale; 764 pgs stuck
> unclean
> 
> 
> 
> any advice ? I think that a not healthy cluster prevent me to run RGW
> due to some timeout.
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2013-02-21 16:46 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-02-21 10:09 pg stuck in peering Gandalf Corvotempesta
2013-02-21 10:26 ` Martin B Nielsen
2013-02-21 10:43   ` Gandalf Corvotempesta
2013-02-21 11:18     ` Wolfgang Hennerbichler
2013-02-21 11:32       ` Gandalf Corvotempesta
2013-02-21 12:23         ` Wolfgang Hennerbichler
2013-02-21 16:46 ` Sage Weil

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.