Bad Hardware / Software Disk Detection for Production Systems

All of lore.kernel.org
 help / color / mirror / Atom feed

* Bad Hardware / Software Disk Detection for Production Systems
@ 2006-06-02 12:21 Bill Rees
  2006-06-02 12:40 ` Mark Nipper
  0 siblings, 1 reply; 4+ messages in thread
From: Bill Rees @ 2006-06-02 12:21 UTC (permalink / raw)
  To: reiserfs-list

[-- Attachment #1: Type: text/plain, Size: 1436 bytes --]

Hi,

This is a general question but I thought I'd post it to the list to see if
anyone has any suggestions.

We have some production systems running SUSE 9.3 and reiserfs that are very
IO intensive using 3ware controllers in JBOD mode for the high throughput.
On occasion we will have a disk problem that bringst the system to a virtual
standstill. Ususally the problems cause an eventual system lockup that can't
even be resolved with a software reboot. One has to hit the reset switch to
get the system back and then take the offending disk offline. Usually there
are tons of errors from the kernel indicating drive problems. 90% of the
time the failure is due to a hardware issue with the drive running out of
spare sectors but sometime it is a filesystem corruption issue. I am looking
for a way to prevent the system lockup.

Is there a way to accomplish this without RAID 5? We have the smartmon utils
installed on all of our systems and most times even setting a drive to be
fsck'ed on reboot does nothing when a system boots. Ideally, I'd just like
to recognize when a disk might be having issues, stop using it, and then
notify someone to manually check into it.

Is there a quick and dirty check to determine if a reiserfs disk is hosed on
boot? Setting a flag in fstab doesn't seem to do the trick. I'd be willing
to write a custom mount script to accomplish this as well.

Any input would be appreciated.

thanks,
Bill Rees

[-- Attachment #2: Type: text/html, Size: 1482 bytes --]

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Bad Hardware / Software Disk Detection for Production Systems
  2006-06-02 12:21 Bad Hardware / Software Disk Detection for Production Systems Bill Rees
@ 2006-06-02 12:40 ` Mark Nipper
  2006-06-02 13:10   ` Bill Rees
  0 siblings, 1 reply; 4+ messages in thread
From: Mark Nipper @ 2006-06-02 12:40 UTC (permalink / raw)
  To: Bill Rees; +Cc: reiserfs-list

On 02 Jun 2006, Bill Rees wrote:
>    This is a general question but I thought I'd post it to the list to see if
>    anyone has any suggestions.

        You cannot expect any sort of data integrity if you are
using JBOD mode on a RAID controller.  The very nature of JBOD
means that any time a single drive fails, you lose every
partition across that entire JBOD volume.

        If your file systems are even coming back after such a
catastrophe, you have been extremely fortunate to date.  I still
cannot imagine that you aren't losing data though since there is
no mirroring going on whatsoever which means blocks are
inherently being lost during such failures.

        If you are just looking for read performance, you should
seriously consider at least a RAID 1 volume.  If you can
sacrifice some overall performance, you will get the best space
optimization by going with a RAID 5.  Anything more exotic than
that, you should start at:
---
http://en.wikipedia.org/wiki/Redundant_array_of_independent_disks

and go with whichever level of RAID makes the most sense for your
application.

        The bottom line is, you are going to lose data in a JBOD
array (unless you also happen to be doing some sort of software
RAID on top of that).  There is really no gain through some sort
of software trick to try to avoid the data loss.  SMART enabled
hard drives may occasionally warn you of an imminent failure, but
then again, they may not.  Drives will just occasionally fail
entirely without any warning, hence the whole purpose of RAID
levels above 0.

-- 
Mark Nipper                                                e-contacts:
832 Tanglewood Drive                                nipsy@bitgnome.net
Bryan, Texas 77802-4013                     http://nipsy.bitgnome.net/
(979)575-3193                      AIM/Yahoo: texasnipsy ICQ: 66971617

-----BEGIN GEEK CODE BLOCK-----
Version: 3.1
GG/IT d- s++:+ a- C++$ UBL++++$ P--->+++ L+++$ !E---
W++(--) N+ o K++ w(---) O++ M V(--) PS+++(+) PE(--)
Y+ PGP t+ 5 X R tv b+++@ DI+(++) D+ G e h r++ y+(**)
------END GEEK CODE BLOCK------

---begin random quote of the moment---
Anyone who is capable of getting themselves made President
should on no account be allowed to do the job.
 -- Douglas Adams, _The Hitchhiker's Guide to the Galaxy_
----end random quote of the moment----

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Bad Hardware / Software Disk Detection for Production Systems
  2006-06-02 12:40 ` Mark Nipper
@ 2006-06-02 13:10   ` Bill Rees
  2006-06-02 16:47     ` Hans Reiser
  0 siblings, 1 reply; 4+ messages in thread
From: Bill Rees @ 2006-06-02 13:10 UTC (permalink / raw)
  To: Mark Nipper; +Cc: reiserfs-list

[-- Attachment #1: Type: text/plain, Size: 3069 bytes --]

Believe it or not, we aren't that concerned with data loss in this
application due to the way the data is stored on the disks.  We are more
concerned with the becoming unresponsive due to problems writing to the
filesystem. I would think that this could still happen in a RAID 5
implementation if the filesystem becomes corrupted. We haven't used RAID 5
in awhile but when we did, file system rebuilds that would take days were
not uncommon.

I am trying to maximize uptime and some data loss is not a problem.  Perhaps
redundant RAID 5 systems would be the only answer to this problem?

On 6/2/06, Mark Nipper <nipsy@bitgnome.net> wrote:
>
> On 02 Jun 2006, Bill Rees wrote:
> >    This is a general question but I thought I'd post it to the list to
> see if
> >    anyone has any suggestions.
>
>         You cannot expect any sort of data integrity if you are
> using JBOD mode on a RAID controller.  The very nature of JBOD
> means that any time a single drive fails, you lose every
> partition across that entire JBOD volume.
>
>         If your file systems are even coming back after such a
> catastrophe, you have been extremely fortunate to date.  I still
> cannot imagine that you aren't losing data though since there is
> no mirroring going on whatsoever which means blocks are
> inherently being lost during such failures.
>
>         If you are just looking for read performance, you should
> seriously consider at least a RAID 1 volume.  If you can
> sacrifice some overall performance, you will get the best space
> optimization by going with a RAID 5.  Anything more exotic than
> that, you should start at:
> ---
> http://en.wikipedia.org/wiki/Redundant_array_of_independent_disks
>
> and go with whichever level of RAID makes the most sense for your
> application.
>
>         The bottom line is, you are going to lose data in a JBOD
> array (unless you also happen to be doing some sort of software
> RAID on top of that).  There is really no gain through some sort
> of software trick to try to avoid the data loss.  SMART enabled
> hard drives may occasionally warn you of an imminent failure, but
> then again, they may not.  Drives will just occasionally fail
> entirely without any warning, hence the whole purpose of RAID
> levels above 0.
>
> --
> Mark Nipper                                                e-contacts:
> 832 Tanglewood Drive                                nipsy@bitgnome.net
> Bryan, Texas 77802-4013                     http://nipsy.bitgnome.net/
> (979)575-3193                      AIM/Yahoo: texasnipsy ICQ: 66971617
>
> -----BEGIN GEEK CODE BLOCK-----
> Version: 3.1
> GG/IT d- s++:+ a- C++$ UBL++++$ P--->+++ L+++$ !E---
> W++(--) N+ o K++ w(---) O++ M V(--) PS+++(+) PE(--)
> Y+ PGP t+ 5 X R tv b+++@ DI+(++) D+ G e h r++ y+(**)
> ------END GEEK CODE BLOCK------
>
> ---begin random quote of the moment---
> Anyone who is capable of getting themselves made President
> should on no account be allowed to do the job.
> -- Douglas Adams, _The Hitchhiker's Guide to the Galaxy_
> ----end random quote of the moment----
>

[-- Attachment #2: Type: text/html, Size: 4480 bytes --]

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Bad Hardware / Software Disk Detection for Production Systems
  2006-06-02 13:10   ` Bill Rees
@ 2006-06-02 16:47     ` Hans Reiser
  0 siblings, 0 replies; 4+ messages in thread
From: Hans Reiser @ 2006-06-02 16:47 UTC (permalink / raw)
  To: Bill Rees; +Cc: Mark Nipper, reiserfs-list

Bill Rees wrote:

> Believe it or not, we aren't that concerned with data loss in this
> application due to the way the data is stored on the disks.  We are
> more concerned with the becoming unresponsive due to problems writing
> to the filesystem. I would think that this could still happen in a
> RAID 5 implementation if the filesystem becomes corrupted. We haven't
> used RAID 5 in awhile but when we did, file system rebuilds that would
> take days were not uncommon.
>
> I am trying to maximize uptime and some data loss is not a problem. 
> Perhaps redundant RAID 5 systems would be the only answer to this problem?

Look into the reiserfs mount options regarding error handling.  If your
budget can justify it, we can write code to do whatever you want in this
area.

Hans

>
> On 6/2/06, * Mark Nipper* <nipsy@bitgnome.net
> <mailto:nipsy@bitgnome.net>> wrote:
>
>     On 02 Jun 2006, Bill Rees wrote:
>     >    This is a general question but I thought I'd post it to the
>     list to see if
>     >    anyone has any suggestions.
>
>             You cannot expect any sort of data integrity if you are
>     using JBOD mode on a RAID controller.  The very nature of JBOD
>     means that any time a single drive fails, you lose every
>     partition across that entire JBOD volume.
>
>             If your file systems are even coming back after such a
>     catastrophe, you have been extremely fortunate to date.  I still
>     cannot imagine that you aren't losing data though since there is
>     no mirroring going on whatsoever which means blocks are
>     inherently being lost during such failures.
>
>             If you are just looking for read performance, you should
>     seriously consider at least a RAID 1 volume.  If you can
>     sacrifice some overall performance, you will get the best space
>     optimization by going with a RAID 5.  Anything more exotic than
>     that, you should start at:
>     ---
>     http://en.wikipedia.org/wiki/Redundant_array_of_independent_disks
>
>     and go with whichever level of RAID makes the most sense for your
>     application.
>
>             The bottom line is, you are going to lose data in a JBOD
>     array (unless you also happen to be doing some sort of software
>     RAID on top of that).  There is really no gain through some sort
>     of software trick to try to avoid the data loss.  SMART enabled
>     hard drives may occasionally warn you of an imminent failure, but
>     then again, they may not.  Drives will just occasionally fail
>     entirely without any warning, hence the whole purpose of RAID
>     levels above 0.
>
>     --
>     Mark Nipper                                                e-contacts:
>     832 Tanglewood
>     Drive                                nipsy@bitgnome.net
>     <mailto:nipsy@bitgnome.net>
>     Bryan, Texas 77802-4013                     http://nipsy.bitgnome.net/
>     (979)575-3193                      AIM/Yahoo: texasnipsy ICQ: 66971617
>
>     -----BEGIN GEEK CODE BLOCK-----
>     Version: 3.1
>     GG/IT d- s++:+ a- C++$ UBL++++$ P--->+++ L+++$ !E---
>     W++(--) N+ o K++ w(---) O++ M V(--) PS+++(+) PE(--)
>     Y+ PGP t+ 5 X R tv b+++@ DI+(++) D+ G e h r++ y+(**)
>     ------END GEEK CODE BLOCK------
>
>     ---begin random quote of the moment---
>     Anyone who is capable of getting themselves made President
>     should on no account be allowed to do the job.
>     -- Douglas Adams, _The Hitchhiker's Guide to the Galaxy_
>     ----end random quote of the moment----
>
>


^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2006-06-02 16:47 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-06-02 12:21 Bad Hardware / Software Disk Detection for Production Systems Bill Rees
2006-06-02 12:40 ` Mark Nipper
2006-06-02 13:10   ` Bill Rees
2006-06-02 16:47     ` Hans Reiser

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.