From mboxrd@z Thu Jan 1 00:00:00 1970 From: Wido den Hollander Subject: Re: HEALTH_ERR when (re)starting ceph-osd's Date: Thu, 28 Jan 2016 13:37:39 +0100 Message-ID: <56AA0B93.8050105@42on.com> References: <0ec5364ea10c41a687edca2bfe7818e6@R01UKEXCASM115.r01.fujitsu.local> Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: QUOTED-PRINTABLE Return-path: Received: from smtp01.mail.pcextreme.nl ([109.72.87.137]:41988 "EHLO smtp01.mail.pcextreme.nl" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S935526AbcA1Mhf (ORCPT ); Thu, 28 Jan 2016 07:37:35 -0500 In-Reply-To: <0ec5364ea10c41a687edca2bfe7818e6@R01UKEXCASM115.r01.fujitsu.local> Sender: ceph-devel-owner@vger.kernel.org List-ID: To: "Piotr.Dalek@ts.fujitsu.com" , "ceph-devel@vger.kernel.org" On 28-01-16 11:48, Piotr.Dalek@ts.fujitsu.com wrote: > Hello, >=20 > I haven't noticed it before, but since merging https://github.com/cep= h/ceph/pull/7253 I see that, when restarting daemons on healthy ceph cl= uster, it goes to HEALTH_ERR state with "$(random_number) pgs are stuck= inactive for more than 300 seconds".=20 > I looked at the commit and it turns out it will be always occurring o= n restart/boot, as booting pgs are inactive "by default" (since mons ne= ver received any sign of life from them) - not because they're actually= stuck inactive. Well, in that case, isn't the PR correct? But I see what you mean. > One solution to this would be to mark pg_stat.last_* fields to the po= int where it were first seen, so they will become stuck (mon_pg_stuck_t= hreshold) seconds after first registering, and not right away. That sounds like a good solution, you might want to take a look at: http://tracker.ceph.com/issues/14028 > Another, less invasive one, is to just let user disable this warning. >=20 As you can see in the discussion on Github, we decided to set 'mon_pg_min_inactive' to 1 by default. You can disable these warnings by either setting it to zero or to maybe something like 10. This is just there that people are informed when multiple PGs are inact= ive. Being in WARN state, but still not performing I/O is a bad thing. WARN should be where you take a look, but aren't worried. If I/O stops ERR is a good thing to go in to. Wido > What do you think? >=20 > With best regards / Pozdrawiam > Piotr Da=C5=82ek >=20 > N=EF=BF=BD=EF=BF=BD=EF=BF=BD=EF=BF=BD=EF=BF=BDr=EF=BF=BD=EF=BF=BDy=EF= =BF=BD=EF=BF=BD=EF=BF=BDb=EF=BF=BDX=EF=BF=BD=EF=BF=BD=C7=A7v=EF=BF=BD^=EF= =BF=BD)=DE=BA{.n=EF=BF=BD+=EF=BF=BD=EF=BF=BD=EF=BF=BDz=EF=BF=BD]z=EF=BF= =BD=EF=BF=BD=EF=BF=BD{ay=EF=BF=BD=1D=CA=87=DA=99=EF=BF=BD,j=07=EF=BF=BD= =EF=BF=BDf=EF=BF=BD=EF=BF=BD=EF=BF=BDh=EF=BF=BD=EF=BF=BD=EF=BF=BDz=EF=BF= =BD=1E=EF=BF=BDw=EF=BF=BD=EF=BF=BD=EF=BF=BD=0C=EF=BF=BD=EF=BF=BD=EF=BF=BD= j:+v=EF=BF=BD=EF=BF=BD=EF=BF=BDw=EF=BF=BDj=EF=BF=BDm=EF=BF=BD=EF=BF=BD=EF= =BF=BD=EF=BF=BD=07=EF=BF=BD=EF=BF=BD=EF=BF=BD=EF=BF=BDzZ+=EF=BF=BD=EF=BF= =BD=EF=BF=BD=EF=BF=BD=EF=BF=BD=DD=A2j"=EF=BF=BD=EF=BF=BD!tml=3D >=20 -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" i= n the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html