SMART monitoring

All of lore.kernel.org
 help / color / mirror / Atom feed

* SMART monitoring
@ 2013-12-27  0:26 James Harper
  2013-12-27  2:17 ` Sage Weil
  0 siblings, 1 reply; 5+ messages in thread
From: James Harper @ 2013-12-27  0:26 UTC (permalink / raw)
  To: ceph-devel@vger.kernel.org

What would be the best approach to integrate SMART with ceph, for the predictive failure case?

Assuming you agree with SMART diagnosis of an impending failure, would it be better to automatically start migrating data off the OSD (reduce the weight to 0?), or to just prompt the user to replace the disk (which requires no monitoring on ceph's part)? The former would ensure that redundancy is maintained at all times without any user interaction.

And what about the bad sector case? Assuming you are using something like btrfs with redundant copies of metadata, and assuming that is enough to keep the metadata consistent, what should be done in the case of a small number of fs errors? Can ceph handle getting an i/o error on one of its files inside the osd and just read from the replica, or should the entire osd just be failed and let ceph rebalance the data itself?

Thanks

James

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: SMART monitoring
  2013-12-27  0:26 SMART monitoring James Harper
@ 2013-12-27  2:17 ` Sage Weil
  2013-12-27 16:15   ` Justin Erenkrantz
  0 siblings, 1 reply; 5+ messages in thread
From: Sage Weil @ 2013-12-27  2:17 UTC (permalink / raw)
  To: James Harper; +Cc: ceph-devel@vger.kernel.org

Hi James,

On Fri, 27 Dec 2013, James Harper wrote:
> What would be the best approach to integrate SMART with ceph, for the 
> predictive failure case?

Currently (as you know) we don't do anything with SMART.  It is obviously 
important for the entire system, but I'm unsure whether it should be 
something that ceph-osd is doing as part of the cluster, or whether it is 
better handled by another generic agent that is monitoring the hosts in 
your cluster.  

I think the question comes down to whether Ceph should take some internal 
action based on the information, or whether that is better handled by some 
external monitoring agent.  For example, an external agent might collect 
SMART info into graphite, and every so often do some predictive analysis 
and mark out disks that are expected to fail soon.

I'd love to see some consensus form around what this should look like...

> Assuming you agree with SMART diagnosis of an impending failure, would 
> it be better to automatically start migrating data off the OSD (reduce 
> the weight to 0?), or to just prompt the user to replace the disk (which 
> requires no monitoring on ceph's part)? The former would ensure that 
> redundancy is maintained at all times without any user interaction.

We definitely want to mark the disk 'out' or reweight it to zero so that 
redudancy is never unnecessarily reduced.

> And what about the bad sector case? Assuming you are using something 
> like btrfs with redundant copies of metadata, and assuming that is 
> enough to keep the metadata consistent, what should be done in the case 
> of a small number of fs errors? Can ceph handle getting an i/o error on 
> one of its files inside the osd and just read from the replica, or 
> should the entire osd just be failed and let ceph rebalance the data 
> itself?

If the failure is masked by the fs, Ceph doesn't care.  Currently, if Ceph 
sees any error on write, we 'fail' the entire ceph-osd process.  On read, 
this is configurable (filestore fail eio), but also defaults to true.  
This may seem like overkill, but if we are getting read failures, this is 
a not-completely-horrible signal that the drive may fail more 
spectacularly later, and it avoids having to cope with the complexity of a 
partial failure.  Also note that since we are doing a deep-scrub with some 
regularity (which reads every byte stored and compares across replicas), 
the cluster will automatically fail drives that start issuing latent read 
errors.

sage

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: SMART monitoring
  2013-12-27  2:17 ` Sage Weil
@ 2013-12-27 16:15   ` Justin Erenkrantz
  2013-12-27 17:09     ` Andrey Korolyov
  0 siblings, 1 reply; 5+ messages in thread
From: Justin Erenkrantz @ 2013-12-27 16:15 UTC (permalink / raw)
  To: Sage Weil; +Cc: James Harper, ceph-devel@vger.kernel.org

On Thu, Dec 26, 2013 at 9:17 PM, Sage Weil <sage@inktank.com> wrote:
> I think the question comes down to whether Ceph should take some internal
> action based on the information, or whether that is better handled by some
> external monitoring agent.  For example, an external agent might collect
> SMART info into graphite, and every so often do some predictive analysis
> and mark out disks that are expected to fail soon.
>
> I'd love to see some consensus form around what this should look like...

My $.02 from the peanut gallery: at a minimum, set the HEALTH_WARN flag if
there is a SMART failure on a physical drive that contains an OSD.  Yes,
you could build the monitoring into a separate system, but I think it'd be
really useful to combine it into the cluster health assessment.  -- justin

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: SMART monitoring
  2013-12-27 16:15   ` Justin Erenkrantz
@ 2013-12-27 17:09     ` Andrey Korolyov
  2014-05-22  8:59       ` Andrey Korolyov
  0 siblings, 1 reply; 5+ messages in thread
From: Andrey Korolyov @ 2013-12-27 17:09 UTC (permalink / raw)
  To: Justin Erenkrantz; +Cc: Sage Weil, James Harper, ceph-devel@vger.kernel.org

On 12/27/2013 08:15 PM, Justin Erenkrantz wrote:
> On Thu, Dec 26, 2013 at 9:17 PM, Sage Weil <sage@inktank.com> wrote:
>> I think the question comes down to whether Ceph should take some internal
>> action based on the information, or whether that is better handled by some
>> external monitoring agent.  For example, an external agent might collect
>> SMART info into graphite, and every so often do some predictive analysis
>> and mark out disks that are expected to fail soon.
>>
>> I'd love to see some consensus form around what this should look like...
> 
> My $.02 from the peanut gallery: at a minimum, set the HEALTH_WARN flag if
> there is a SMART failure on a physical drive that contains an OSD.  Yes,
> you could build the monitoring into a separate system, but I think it'd be
> really useful to combine it into the cluster health assessment.  -- justin
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

Hi,

Judging from my personal experience SMART failures can be dangerous if
they are not bad enough to completely tear down an OSD therefore it will
not flap and will not be marked as down in time, but cluster performance
is greatly affected in this case. I don`t think that the SMART
monitoring task is somehow related to Ceph because seperate monitoring
of predictive failure counters can do its job well and in cause of
sudden errors SMART query may not work at all since a lot of bus resets
was made by the system and disk can be inaccessible at all. So I propose
two set of strategies - do a regular scattered background checks and
monitor OSD responsiveness to word around cases with performance
degradation due to read/write errors.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: SMART monitoring
  2013-12-27 17:09     ` Andrey Korolyov
@ 2014-05-22  8:59       ` Andrey Korolyov
  0 siblings, 0 replies; 5+ messages in thread
From: Andrey Korolyov @ 2014-05-22  8:59 UTC (permalink / raw)
  To: Justin Erenkrantz; +Cc: Sage Weil, James Harper, ceph-devel@vger.kernel.org

On Fri, Dec 27, 2013 at 9:09 PM, Andrey Korolyov <andrey@xdel.ru> wrote:
> On 12/27/2013 08:15 PM, Justin Erenkrantz wrote:
>> On Thu, Dec 26, 2013 at 9:17 PM, Sage Weil <sage@inktank.com> wrote:
>>> I think the question comes down to whether Ceph should take some internal
>>> action based on the information, or whether that is better handled by some
>>> external monitoring agent.  For example, an external agent might collect
>>> SMART info into graphite, and every so often do some predictive analysis
>>> and mark out disks that are expected to fail soon.
>>>
>>> I'd love to see some consensus form around what this should look like...
>>
>> My $.02 from the peanut gallery: at a minimum, set the HEALTH_WARN flag if
>> there is a SMART failure on a physical drive that contains an OSD.  Yes,
>> you could build the monitoring into a separate system, but I think it'd be
>> really useful to combine it into the cluster health assessment.  -- justin
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>
> Hi,
>
> Judging from my personal experience SMART failures can be dangerous if
> they are not bad enough to completely tear down an OSD therefore it will
> not flap and will not be marked as down in time, but cluster performance
> is greatly affected in this case. I don`t think that the SMART
> monitoring task is somehow related to Ceph because seperate monitoring
> of predictive failure counters can do its job well and in cause of
> sudden errors SMART query may not work at all since a lot of bus resets
> was made by the system and disk can be inaccessible at all. So I propose
> two set of strategies - do a regular scattered background checks and
> monitor OSD responsiveness to word around cases with performance
> degradation due to read/write errors.

Some necromant job for this thread..

Considering a year-long experience with Hitachi 4T disks, there are a
lot of failures which are cannot be handled by SMART completely -
speed degradation and sudden disk death. Although second case rules
out by itself by kicking out stuck OSD, it is not very easy to check
which disks are about to die without throughout dmesg monitoring for
bus errors and periodical speed calibration. Probably introducing such
thing as idle-priority speed measurement for OSDs without dramatically
increasing overall wearout may be useful enough to implement in couple
with additional OSD perf metric, like seek_time in SMART, though SMART
may return good value for it when performance already slowed down to
crawl, also it`ll handle most things impacting performance which can
be unexposable at all to the host OS - correctable bus errors and so
on. By the way, although 1T Seagates have way higher failure rate,
they always dying with an 'appropriate' set of attributes in SMART,
Hitachi tends to die without warning :) Hope that it`ll be helpful for
someone.

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2014-05-22  9:00 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-12-27  0:26 SMART monitoring James Harper
2013-12-27  2:17 ` Sage Weil
2013-12-27 16:15   ` Justin Erenkrantz
2013-12-27 17:09     ` Andrey Korolyov
2014-05-22  8:59       ` Andrey Korolyov

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.