HEALTH_ERR when (re)starting ceph-osd's

All of lore.kernel.org
 help / color / mirror / Atom feed

* HEALTH_ERR when (re)starting ceph-osd's
@ 2016-01-28 10:48 Piotr.Dalek
  2016-01-28 12:37 ` Wido den Hollander
  0 siblings, 1 reply; 4+ messages in thread
From: Piotr.Dalek @ 2016-01-28 10:48 UTC (permalink / raw)
  To: ceph-devel@vger.kernel.org

Hello,

I haven't noticed it before, but since merging https://github.com/ceph/ceph/pull/7253 I see that, when restarting daemons on healthy ceph cluster, it goes to HEALTH_ERR state with "$(random_number) pgs are stuck inactive for more than 300 seconds". 
I looked at the commit and it turns out it will be always occurring on restart/boot, as booting pgs are inactive "by default" (since mons never received any sign of life from them) - not because they're actually stuck inactive.
One solution to this would be to mark pg_stat.last_* fields to the point where it were first seen, so they will become stuck (mon_pg_stuck_threshold) seconds after first registering, and not right away.
Another, less invasive one, is to just let user disable this warning.

What do you think?

With best regards / Pozdrawiam
Piotr Dałek

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: HEALTH_ERR when (re)starting ceph-osd's
  2016-01-28 10:48 HEALTH_ERR when (re)starting ceph-osd's Piotr.Dalek
@ 2016-01-28 12:37 ` Wido den Hollander
  2016-01-28 13:05   ` Piotr.Dalek
  2016-01-28 15:25   ` Sage Weil
  0 siblings, 2 replies; 4+ messages in thread
From: Wido den Hollander @ 2016-01-28 12:37 UTC (permalink / raw)
  To: Piotr.Dalek@ts.fujitsu.com, ceph-devel@vger.kernel.org



On 28-01-16 11:48, Piotr.Dalek@ts.fujitsu.com wrote:
> Hello,
> 
> I haven't noticed it before, but since merging https://github.com/ceph/ceph/pull/7253 I see that, when restarting daemons on healthy ceph cluster, it goes to HEALTH_ERR state with "$(random_number) pgs are stuck inactive for more than 300 seconds". 
> I looked at the commit and it turns out it will be always occurring on restart/boot, as booting pgs are inactive "by default" (since mons never received any sign of life from them) - not because they're actually stuck inactive.

Well, in that case, isn't the PR correct? But I see what you mean.

> One solution to this would be to mark pg_stat.last_* fields to the point where it were first seen, so they will become stuck (mon_pg_stuck_threshold) seconds after first registering, and not right away.

That sounds like a good solution, you might want to take a look at:
http://tracker.ceph.com/issues/14028

> Another, less invasive one, is to just let user disable this warning.
> 

As you can see in the discussion on Github, we decided to set
'mon_pg_min_inactive' to 1 by default.

You can disable these warnings by either setting it to zero or to maybe
something like 10.

This is just there that people are informed when multiple PGs are inactive.

Being in WARN state, but still not performing I/O is a bad thing. WARN
should be where you take a look, but aren't worried.

If I/O stops ERR is a good thing to go in to.

Wido

> What do you think?
> 
> With best regards / Pozdrawiam
> Piotr Dałek
> 
> N�����r��y���b�X��ǧv�^�)޺{.n�+���z�]z���{ay�\x1dʇڙ�,j\a��f���h���z�\x1e�w���\f���j:+v���w�j�m����\a����zZ+�����ݢj"��!tml=
> 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 4+ messages in thread

* RE: HEALTH_ERR when (re)starting ceph-osd's
  2016-01-28 12:37 ` Wido den Hollander
@ 2016-01-28 13:05   ` Piotr.Dalek
  2016-01-28 15:25   ` Sage Weil
  1 sibling, 0 replies; 4+ messages in thread
From: Piotr.Dalek @ 2016-01-28 13:05 UTC (permalink / raw)
  To: Wido den Hollander, ceph-devel@vger.kernel.org

> -----Original Message-----
> From: Wido den Hollander [mailto:wido@42on.com]
> Sent: Thursday, January 28, 2016 1:38 PM
> 
> On 28-01-16 11:48, Piotr.Dalek@ts.fujitsu.com wrote:
> > Hello,
> >
> > I haven't noticed it before, but since merging
> https://github.com/ceph/ceph/pull/7253 I see that, when restarting
> daemons on healthy ceph cluster, it goes to HEALTH_ERR state with
> "$(random_number) pgs are stuck inactive for more than 300 seconds".
> > I looked at the commit and it turns out it will be always occurring on
> restart/boot, as booting pgs are inactive "by default" (since mons never
> received any sign of life from them) - not because they're actually stuck
> inactive.
> 
> Well, in that case, isn't the PR correct? But I see what you mean.

Actually, the only thing wrong with this is that it reports PGs as inactive for some prolonged period of time, when it's not true.
 
> > One solution to this would be to mark pg_stat.last_* fields to the point
> where it were first seen, so they will become stuck
> (mon_pg_stuck_threshold) seconds after first registering, and not right
> away.
> 
> That sounds like a good solution, you might want to take a look at:
> http://tracker.ceph.com/issues/14028

I'll take a look. Maybe we could fix two issues with one PR ;-)

With best regards / Pozdrawiam
Piotr Dałek



^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: HEALTH_ERR when (re)starting ceph-osd's
  2016-01-28 12:37 ` Wido den Hollander
  2016-01-28 13:05   ` Piotr.Dalek
@ 2016-01-28 15:25   ` Sage Weil
  1 sibling, 0 replies; 4+ messages in thread
From: Sage Weil @ 2016-01-28 15:25 UTC (permalink / raw)
  To: Wido den Hollander; +Cc: Piotr.Dalek@ts.fujitsu.com, ceph-devel@vger.kernel.org

On Thu, 28 Jan 2016, Wido den Hollander wrote:
> On 28-01-16 11:48, Piotr.Dalek@ts.fujitsu.com wrote:
> > One solution to this would be to mark pg_stat.last_* fields to the 
> > point where it were first seen, so they will become stuck 
> > (mon_pg_stuck_threshold) seconds after first registering, and not 
> > right away.
> 
> That sounds like a good solution, you might want to take a look at:
> http://tracker.ceph.com/issues/14028

I think this makes the most sense.. most of the last_* fields should 
basically be inferred to be the creation stamp if they are empty.

sage


^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2016-01-28 15:25 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2016-01-28 10:48 HEALTH_ERR when (re)starting ceph-osd's Piotr.Dalek
2016-01-28 12:37 ` Wido den Hollander
2016-01-28 13:05   ` Piotr.Dalek
2016-01-28 15:25   ` Sage Weil

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.