Re: SMART monitoring - Andrey Korolyov

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Andrey Korolyov <andrey@xdel.ru>
To: Justin Erenkrantz <justin@erenkrantz.com>
Cc: Sage Weil <sage@inktank.com>,
	James Harper <james.harper@bendigoit.com.au>,
	"ceph-devel@vger.kernel.org" <ceph-devel@vger.kernel.org>
Subject: Re: SMART monitoring
Date: Fri, 27 Dec 2013 21:09:46 +0400	[thread overview]
Message-ID: <52BDB45A.802@xdel.ru> (raw)
In-Reply-To: <CAOGLoJPoS3WUUJuyRqkiUbT+DoDCochHZ-LThC4Ro8wUCdkD0g@mail.gmail.com>

On 12/27/2013 08:15 PM, Justin Erenkrantz wrote:
> On Thu, Dec 26, 2013 at 9:17 PM, Sage Weil <sage@inktank.com> wrote:
>> I think the question comes down to whether Ceph should take some internal
>> action based on the information, or whether that is better handled by some
>> external monitoring agent.  For example, an external agent might collect
>> SMART info into graphite, and every so often do some predictive analysis
>> and mark out disks that are expected to fail soon.
>>
>> I'd love to see some consensus form around what this should look like...
> 
> My $.02 from the peanut gallery: at a minimum, set the HEALTH_WARN flag if
> there is a SMART failure on a physical drive that contains an OSD.  Yes,
> you could build the monitoring into a separate system, but I think it'd be
> really useful to combine it into the cluster health assessment.  -- justin
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

Hi,

Judging from my personal experience SMART failures can be dangerous if
they are not bad enough to completely tear down an OSD therefore it will
not flap and will not be marked as down in time, but cluster performance
is greatly affected in this case. I don`t think that the SMART
monitoring task is somehow related to Ceph because seperate monitoring
of predictive failure counters can do its job well and in cause of
sudden errors SMART query may not work at all since a lot of bus resets
was made by the system and disk can be inaccessible at all. So I propose
two set of strategies - do a regular scattered background checks and
monitor OSD responsiveness to word around cases with performance
degradation due to read/write errors.

next prev parent reply	other threads:[~2013-12-27 17:11 UTC|newest]

Thread overview: 5+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2013-12-27  0:26 SMART monitoring James Harper
2013-12-27  2:17 ` Sage Weil
2013-12-27 16:15   ` Justin Erenkrantz
2013-12-27 17:09     ` Andrey Korolyov [this message]
2014-05-22  8:59       ` Andrey Korolyov

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=52BDB45A.802@xdel.ru \
    --to=andrey@xdel.ru \
    --cc=ceph-devel@vger.kernel.org \
    --cc=james.harper@bendigoit.com.au \
    --cc=justin@erenkrantz.com \
    --cc=sage@inktank.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.