RAID1 seems not to be able to scrub pending sectors shown by smart

All of lore.kernel.org
 help / color / mirror / Atom feed

* RAID1 seems not to be able to scrub pending sectors shown by smart
@ 2011-12-23 18:39 Philip Hands
  2011-12-23 19:59 ` Roger Heflin
  0 siblings, 1 reply; 11+ messages in thread
From: Philip Hands @ 2011-12-23 18:39 UTC (permalink / raw)
  To: linux-raid

[-- Attachment #1: Type: text/plain, Size: 5838 bytes --]

Hi,

This is a little vague I'm afraid, but I've saved the syslogs, so please
feel free to ask for details if they'd help track down what's happening.

I'm running a relatively busy server (it hosts the VM for
ftp.uk.debian.org among other things) which has 6 disks, four of which
are 2TB Western Digital Caviar Black drives.

Each of the 2TB drives is split into a couple of small partitions at the
front (250MB & 750MB) on which are built 4-way RAID1s containing /boot
and / respectively, with the rest of the drives split into 4 ~500GB
chunks, which are then assembled into 5 3-way RAID1s.

A while ago, one of the drives started showing an increasing number of
pending sectors, over the course of several weeks getting up to 360 or
so.  Meanwhile another of the drives got up to about 90 pending sectors.

I assumed that by forcing a check, it would read the drives, notice that
sectors were unreadable, and write the data back from one of the clean
drives, but having run checks on all drives, the number of pending
sectors went down by about five or so each time (or once about ten) and
then crept up again.

So, I went in to the co-lo to see if there was something like a lose
cable causing the problem, say -- and just before I left I removed the
drive with fewer pending sectors, zeroed the superblocks to ensure that
it really would rewrite things, and then added it back in -- it dropped
the pending sector count from ~90 to 10 quite quickly, at which point
smart started declaring the dive as failed.  I've now replaced that drive.

The replacement drive was fitted a few days ago, and has now synced up.

While it was syncing, the drive with 360-ish pending sectors started
throwing many read errors, but the pending sector count remained
static -- this seems wrong to me.  Surely the md code should notice the
read errors, and decide to rewrite the data from the remaining drive.

While the read errors were happening, the system performance became dire
(with system load going up to about 15, as opposed to the normal 1-3,
and the whole system regularly pausing -- I had previously assumed that
this might be due to busy networks or dropped packets, but when I was
on-site I noticed that when a read error was occurring, that all other
disk activity would halt, as would the responsiveness of the CLI).

So, I failed the 360-pending-sector drive out of the RAID, and all
returned to normal, performance-wise.

Once the RAID synced (the one remaining disk, and the one that was
supplied as a replacement), I added the apparently duff dusk back into
the array, having zeroed its superblock, and made sure that the first
array to rebuild was the one containing at least some of the pending
sectors -- it turns out that that partition contained all of the pending
sectors, as they are now all gone.

None of those sectors has resulted in a reallocated sector, so they were
soft errors it seems -- so what I'm wondering is why none of the checks
or repairs I've run over the preceding weeks managed to put a dent in
the number of pending sectors.

I'll admit the possibility that some cabling or controller issue may have
been causing the duff sectors, as I've now moved it to a different SATA
port, but even so, is seems that it wasn't even trying to rewrite the
data.  It seems more likely that there really is some fault with the
disk (especially since a smart long test has just revealed another
unreadable sector in about the same area of the disk).

Perhaps you can suggest what I should look out for in the logs to
determine if read failures are really rewriting the blocks, or if my
suspicion that it's not happening is true.

Here's a sampling of one day's log which seems to show what I'm on
about:

  http://hands.com/~phil/tmp/sheikh.hands.com-mdadm-syslog-20111205

if for instance, you search for '25314' you'll find loads of this sort
of thing:

Dec  5 17:00:54 sheikh kernel: [1663261.867952] md/raid1:md4: redirecting sector 253145096 to other mirror: sdd4
Dec  5 17:00:54 sheikh kernel: [1663262.017791] md/raid1:md4: redirecting sector 253145104 to other mirror: sdd4
Dec  5 17:00:55 sheikh kernel: [1663262.451139] md/raid1:md4: redirecting sector 253145112 to other mirror: sdd4
Dec  5 17:00:56 sheikh kernel: [1663263.409472] md/raid1:md4: redirecting sector 253145120 to other mirror: sdd4
Dec  5 17:00:56 sheikh kernel: [1663263.734508] md/raid1:md4: redirecting sector 253145128 to other mirror: sdd4
Dec  5 17:00:56 sheikh kernel: [1663263.967813] md/raid1:md4: redirecting sector 253145136 to other mirror: sdd4
Dec  5 17:00:56 sheikh kernel: [1663264.034509] md/raid1:md4: redirecting sector 253145144 to other mirror: sdd4
Dec  5 17:00:56 sheikh kernel: [1663264.209565] md/raid1:md4: redirecting sector 253145152 to other mirror: sdd4
Dec  5 17:00:58 sheikh kernel: [1663265.609860] md/raid1:md4: redirecting sector 253145160 to other mirror: sdd4
Dec  5 17:00:58 sheikh kernel: [1663265.992975] md/raid1:md4: redirecting sector 253145168 to other mirror: sdd4

often preceded by something like:

Dec  5 17:00:41 sheikh kernel: [1663248.685965] md/raid1:md4: read error corrected (8 sectors at 253147088 on sdg4)

but to my eye, there don't seem to be enough of these corrections to go
with the errors, and they didn't get rid of all the pending sectors that
have since been wiped out as described above.

Once the raid that's currently rebuilding has finished (in about an
hour), I'll tell it to do a check to see if that notices/fixes the new
pending block that's turned up.

Cheers, Phil.
-- 
|)|  Philip Hands [+44 (0)20 8530 9560]    http://www.hands.com/
|-|  HANDS.COM Ltd.                    http://www.uk.debian.org/
|(|  10 Onslow Gardens, South Woodford, London  E18 1NE  ENGLAND

[-- Attachment #2: Type: application/pgp-signature, Size: 835 bytes --]

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: RAID1 seems not to be able to scrub pending sectors shown by smart
  2011-12-23 18:39 RAID1 seems not to be able to scrub pending sectors shown by smart Philip Hands
@ 2011-12-23 19:59 ` Roger Heflin
  2011-12-23 21:22   ` Philip Hands
  0 siblings, 1 reply; 11+ messages in thread
From: Roger Heflin @ 2011-12-23 19:59 UTC (permalink / raw)
  To: Philip Hands; +Cc: linux-raid

On Fri, Dec 23, 2011 at 12:39 PM, Philip Hands <phil@hands.com> wrote:
> Hi,
>
> This is a little vague I'm afraid, but I've saved the syslogs, so please
> feel free to ask for details if they'd help track down what's happening.
>
> I'm running a relatively busy server (it hosts the VM for
> ftp.uk.debian.org among other things) which has 6 disks, four of which
> are 2TB Western Digital Caviar Black drives.
>
> Each of the 2TB drives is split into a couple of small partitions at the
> front (250MB & 750MB) on which are built 4-way RAID1s containing /boot
> and / respectively, with the rest of the drives split into 4 ~500GB
> chunks, which are then assembled into 5 3-way RAID1s.
>
> A while ago, one of the drives started showing an increasing number of
> pending sectors, over the course of several weeks getting up to 360 or
> so.  Meanwhile another of the drives got up to about 90 pending sectors.
>
> I assumed that by forcing a check, it would read the drives, notice that
> sectors were unreadable, and write the data back from one of the clean
> drives, but having run checks on all drives, the number of pending
> sectors went down by about five or so each time (or once about ten) and
> then crept up again.
>
> So, I went in to the co-lo to see if there was something like a lose
> cable causing the problem, say -- and just before I left I removed the
> drive with fewer pending sectors, zeroed the superblocks to ensure that
> it really would rewrite things, and then added it back in -- it dropped
> the pending sector count from ~90 to 10 quite quickly, at which point
> smart started declaring the dive as failed.  I've now replaced that drive.
>
> The replacement drive was fitted a few days ago, and has now synced up.
>
> While it was syncing, the drive with 360-ish pending sectors started
> throwing many read errors, but the pending sector count remained
> static -- this seems wrong to me.  Surely the md code should notice the
> read errors, and decide to rewrite the data from the remaining drive.
>
> While the read errors were happening, the system performance became dire
> (with system load going up to about 15, as opposed to the normal 1-3,
> and the whole system regularly pausing -- I had previously assumed that
> this might be due to busy networks or dropped packets, but when I was
> on-site I noticed that when a read error was occurring, that all other
> disk activity would halt, as would the responsiveness of the CLI).
>
> So, I failed the 360-pending-sector drive out of the RAID, and all
> returned to normal, performance-wise.
>
> Once the RAID synced (the one remaining disk, and the one that was
> supplied as a replacement), I added the apparently duff dusk back into
> the array, having zeroed its superblock, and made sure that the first
> array to rebuild was the one containing at least some of the pending
> sectors -- it turns out that that partition contained all of the pending
> sectors, as they are now all gone.
>
> None of those sectors has resulted in a reallocated sector, so they were
> soft errors it seems -- so what I'm wondering is why none of the checks
> or repairs I've run over the preceding weeks managed to put a dent in
> the number of pending sectors.
>
> I'll admit the possibility that some cabling or controller issue may have
> been causing the duff sectors, as I've now moved it to a different SATA
> port, but even so, is seems that it wasn't even trying to rewrite the
> data.  It seems more likely that there really is some fault with the
> disk (especially since a smart long test has just revealed another
> unreadable sector in about the same area of the disk).
>
> Perhaps you can suggest what I should look out for in the logs to
> determine if read failures are really rewriting the blocks, or if my
> suspicion that it's not happening is true.
>
> Here's a sampling of one day's log which seems to show what I'm on
> about:
>
>  http://hands.com/~phil/tmp/sheikh.hands.com-mdadm-syslog-20111205
>
> if for instance, you search for '25314' you'll find loads of this sort
> of thing:
>
> Dec  5 17:00:54 sheikh kernel: [1663261.867952] md/raid1:md4: redirecting sector 253145096 to other mirror: sdd4
> Dec  5 17:00:54 sheikh kernel: [1663262.017791] md/raid1:md4: redirecting sector 253145104 to other mirror: sdd4
> Dec  5 17:00:55 sheikh kernel: [1663262.451139] md/raid1:md4: redirecting sector 253145112 to other mirror: sdd4
> Dec  5 17:00:56 sheikh kernel: [1663263.409472] md/raid1:md4: redirecting sector 253145120 to other mirror: sdd4
> Dec  5 17:00:56 sheikh kernel: [1663263.734508] md/raid1:md4: redirecting sector 253145128 to other mirror: sdd4
> Dec  5 17:00:56 sheikh kernel: [1663263.967813] md/raid1:md4: redirecting sector 253145136 to other mirror: sdd4
> Dec  5 17:00:56 sheikh kernel: [1663264.034509] md/raid1:md4: redirecting sector 253145144 to other mirror: sdd4
> Dec  5 17:00:56 sheikh kernel: [1663264.209565] md/raid1:md4: redirecting sector 253145152 to other mirror: sdd4
> Dec  5 17:00:58 sheikh kernel: [1663265.609860] md/raid1:md4: redirecting sector 253145160 to other mirror: sdd4
> Dec  5 17:00:58 sheikh kernel: [1663265.992975] md/raid1:md4: redirecting sector 253145168 to other mirror: sdd4
>
> often preceded by something like:
>
> Dec  5 17:00:41 sheikh kernel: [1663248.685965] md/raid1:md4: read error corrected (8 sectors at 253147088 on sdg4)
>
> but to my eye, there don't seem to be enough of these corrections to go
> with the errors, and they didn't get rid of all the pending sectors that
> have since been wiped out as described above.
>
> Once the raid that's currently rebuilding has finished (in about an
> hour), I'll tell it to do a check to see if that notices/fixes the new
> pending block that's turned up.
>


No idea if raid1 was rewriting the sectors or not...but I know my
raid6 was and performance was really bad while it was happening so it
probably would not have help you much.  I was typically seeing 30sec
pauses each time md found a set of bad sectors and forced the
rewrite...this went on for several days until finally smart would
offiically fail one of the drives and I replaced it with another one.
 It appears about the same time smart failed the drive that MD also
did (write failed as the drive appears to have ran out of spare sector
to relocate things to).


I had 4 1.5tb seagate drives from 2009 (bought at different times in
2009) and 3 of those 4 started getting lots of bad sector all within a
2 month period and all 3 finally officially failed smart.and when the
sectors (one after another...lucky they failed out aover 2-3 weeks so
I had got the replacements in before I lost data-I was down to no
redundancy for several days in the middle) were failing and being
rewritten the performance was just ugly--so even if raid1 was
rewriting the drives it does not do anything for performance when the
drives are going bad...the only thing that solved my performance was
getting all of the failing devices to finally fail smart so they could
be RMAed and replaced at minimal cost..
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: RAID1 seems not to be able to scrub pending sectors shown by smart
  2011-12-23 19:59 ` Roger Heflin
@ 2011-12-23 21:22   ` Philip Hands
  2011-12-23 22:26     ` Roger Heflin
  0 siblings, 1 reply; 11+ messages in thread
From: Philip Hands @ 2011-12-23 21:22 UTC (permalink / raw)
  To: Roger Heflin; +Cc: linux-raid

[-- Attachment #1: Type: text/plain, Size: 1860 bytes --]

On Fri, 23 Dec 2011 13:59:21 -0600, Roger Heflin <rogerheflin@gmail.com> wrote:
> On Fri, Dec 23, 2011 at 12:39 PM, Philip Hands <phil@hands.com> wrote:
...
> I had 4 1.5tb seagate drives from 2009 (bought at different times in
> 2009) and 3 of those 4 started getting lots of bad sector all within a
> 2 month period and all 3 finally officially failed smart.and when the
> sectors (one after another...lucky they failed out aover 2-3 weeks so
> I had got the replacements in before I lost data-I was down to no
> redundancy for several days in the middle) were failing and being
> rewritten the performance was just ugly--so even if raid1 was
> rewriting the drives it does not do anything for performance when the
> drives are going bad...the only thing that solved my performance was
> getting all of the failing devices to finally fail smart so they could
> be RMAed and replaced at minimal cost..

Well, I suppose that's to some extent the reason I mentioned this.

It seems to me that if a disk is throwing _loads_ of read errors, and
running dreadfully slowly, one could react to that by favouring
different disk(s), and only occasionally throwing a read at the duff
disk, until it either sorts itself out or dies.

My performance went from rubbish to fine simply by removing the
360-pending-sector disk from the RAID.  OK, so if the problem is that
writes are being delayed by the dodgy disk, that's not easy to deal
with, but looking at the logs makes it look like the reads quite often
keep targeting the same disk even when several reads just failed and
got redirected.  This seems suboptimal to me.

Cheers, Phil.
-- 
|)|  Philip Hands [+44 (0)20 8530 9560]    http://www.hands.com/
|-|  HANDS.COM Ltd.                    http://www.uk.debian.org/
|(|  10 Onslow Gardens, South Woodford, London  E18 1NE  ENGLAND

[-- Attachment #2: Type: application/pgp-signature, Size: 835 bytes --]

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: RAID1 seems not to be able to scrub pending sectors shown by smart
  2011-12-23 21:22   ` Philip Hands
@ 2011-12-23 22:26     ` Roger Heflin
  2011-12-24 10:07       ` Philip Hands
  0 siblings, 1 reply; 11+ messages in thread
From: Roger Heflin @ 2011-12-23 22:26 UTC (permalink / raw)
  To: Philip Hands; +Cc: 'LinuxRaid'

On 12/23/2011 03:22 PM, Philip Hands wrote:
> On Fri, 23 Dec 2011 13:59:21 -0600, Roger Heflin<rogerheflin@gmail.com>  wrote:
>> On Fri, Dec 23, 2011 at 12:39 PM, Philip Hands<phil@hands.com>  wrote:
> ...
>> I had 4 1.5tb seagate drives from 2009 (bought at different times in
>> 2009) and 3 of those 4 started getting lots of bad sector all within a
>> 2 month period and all 3 finally officially failed smart.and when the
>> sectors (one after another...lucky they failed out aover 2-3 weeks so
>> I had got the replacements in before I lost data-I was down to no
>> redundancy for several days in the middle) were failing and being
>> rewritten the performance was just ugly--so even if raid1 was
>> rewriting the drives it does not do anything for performance when the
>> drives are going bad...the only thing that solved my performance was
>> getting all of the failing devices to finally fail smart so they could
>> be RMAed and replaced at minimal cost..
>
> Well, I suppose that's to some extent the reason I mentioned this.
>
> It seems to me that if a disk is throwing _loads_ of read errors, and
> running dreadfully slowly, one could react to that by favouring
> different disk(s), and only occasionally throwing a read at the duff
> disk, until it either sorts itself out or dies.
>
> My performance went from rubbish to fine simply by removing the
> 360-pending-sector disk from the RAID.  OK, so if the problem is that
> writes are being delayed by the dodgy disk, that's not easy to deal
> with, but looking at the logs makes it look like the reads quite often
> keep targeting the same disk even when several reads just failed and
> got redirected.  This seems suboptimal to me.
>
> Cheers, Phil.

In mine I am pretty sure the reads being delayed was causing issues.

I wonder if a patch might be possible that allows one to put an array 
into a mode (or go into said mode once a badblock condition has 
happened) that causes it to read from at least 2 possible data sources 
and return whichever gets there first...in the raid1 case it would 
read from another mirror (esp if one of the data sources was known to 
be flakey), in the raid5/6 case it would need to read one of the 
parity disks and calculate the correct data...that would appear to 
help in this sort of situation...in all other situations the extra 
reads would appear to hurt things, but it may produce less performance 
issues when these sorts of things happen).   No idea how bad this 
would be to implement...and it won't help with the case where the 
writes are getting delayed because the reads are having serious issues 
with bad sectors, in this case the reads would continue to go through, 
but eventually I would think that enough writes backed up to cause 
things to stop anyway...

The recent disk quality does appear to have gone downhill...with the 
previous 160-250 gb drives and the later 500gb drives I had not seen 
many issues...but the 1-2TB drives appear to be a mess and certainly 
don't appear to be aging well, nor the the initial quality appear to 
be that good either...

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: RAID1 seems not to be able to scrub pending sectors shown by smart
  2011-12-23 22:26     ` Roger Heflin
@ 2011-12-24 10:07       ` Philip Hands
  2011-12-24 14:27         ` Phil Turmel
  0 siblings, 1 reply; 11+ messages in thread
From: Philip Hands @ 2011-12-24 10:07 UTC (permalink / raw)
  To: Roger Heflin; +Cc: 'LinuxRaid'

[-- Attachment #1: Type: text/plain, Size: 5754 bytes --]

On Fri, 23 Dec 2011 16:26:39 -0600, Roger Heflin <rogerheflin@gmail.com> wrote:
> On 12/23/2011 03:22 PM, Philip Hands wrote:
> > On Fri, 23 Dec 2011 13:59:21 -0600, Roger Heflin<rogerheflin@gmail.com>  wrote:
> >> On Fri, Dec 23, 2011 at 12:39 PM, Philip Hands<phil@hands.com>  wrote:
> > ...
> >> I had 4 1.5tb seagate drives from 2009 (bought at different times in
> >> 2009) and 3 of those 4 started getting lots of bad sector all within a
> >> 2 month period and all 3 finally officially failed smart.and when the
> >> sectors (one after another...lucky they failed out aover 2-3 weeks so
> >> I had got the replacements in before I lost data-I was down to no
> >> redundancy for several days in the middle) were failing and being
> >> rewritten the performance was just ugly--so even if raid1 was
> >> rewriting the drives it does not do anything for performance when the
> >> drives are going bad...the only thing that solved my performance was
> >> getting all of the failing devices to finally fail smart so they could
> >> be RMAed and replaced at minimal cost..
> >
> > Well, I suppose that's to some extent the reason I mentioned this.
> >
> > It seems to me that if a disk is throwing _loads_ of read errors, and
> > running dreadfully slowly, one could react to that by favouring
> > different disk(s), and only occasionally throwing a read at the duff
> > disk, until it either sorts itself out or dies.
> >
> > My performance went from rubbish to fine simply by removing the
> > 360-pending-sector disk from the RAID.  OK, so if the problem is that
> > writes are being delayed by the dodgy disk, that's not easy to deal
> > with, but looking at the logs makes it look like the reads quite often
> > keep targeting the same disk even when several reads just failed and
> > got redirected.  This seems suboptimal to me.
> >
> > Cheers, Phil.
> 
> In mine I am pretty sure the reads being delayed was causing issues.

Last night I started a check of the RAID that contained most of the errors on
that disk, and it's pretty much finished (81%), in which time the Pending
sector count is back up to 53. [Erm, 83% and 54 now -- while writing
this mail]

Clearly it's not a particularly happy drive, so I guess that smart will
eventually diagnose it as faulty, but in the mean time it may be a
useful test case for mdadm.

One of those newly pending sectors was found almost immediately, as I
was able to see from the logs, and while that was being dealt with, it
drove the system load up to about 18, and rendered the system
unresponsive for at least 10 seconds, probably more like 20 or 30 (the
normal load once it had chance to settle down again was about 2, on a 6
core CPU, so it wasn't really that busy).

[84% and 55 pending now -- with the first indication being a spike in
load, followed a minute or two later by mention of the read problems in
the logs, but apparently nothing logged by md, so presumably the read
eventually succeeded]

> I wonder if a patch might be possible that allows one to put an array 
> into a mode (or go into said mode once a badblock condition has 
> happened) that causes it to read from at least 2 possible data sources 
> and return whichever gets there first...

Well, given that something appears to be blocking in a fairly
disastrous way on the read that's not coming back, I was wondering if
there might be some way of having a timeout on those reads that if one
gets no response for long enough (say 10 seconds) reacts by getting the
data from elsewhere, and overwriting the slow sector.

What I find rather interesting is that the sector that I witnessed
failing to read seems to have resulted in the Pending Sector count
increasing without the md code realising that it had a failed sector
that it needed to rewrite, so I'm guessing that the drive spent 30
seconds or so desperately trying to get a read to work, which eventually
happened, thus providing the md code with a successful read, while the
drive knows that that sector is pretty damaged, and marks it as pending.

Just a theory -- feel free to tell me how to test it (while I still have
a reliably broken disk in service).

Given that the disk now has 53 Pending sectors, it would be nice to know
a way of convincing md to rewrite those sectors.  Running checks seems
not to do the trick, because, as said, it will quite often manage to get
the data off the drive, so there's no reason to fix anything, and
meanwhile every time it hits one of these sectors system performance is
severely degraded.

So far, the only ways I've worked out of rewriting the blocks are:

  1) fail the partition out of the RAID, remove it, zero it's superblock
     to prevent a quick re-add, and then add it back in again.

  2) use hdparm --read-sector to find the faulty sector, use dd skip=
     to find the same sector in the partition, find the matching sector
     in one of it's mirror pairs, and then use dd skip=x | dd seek=x to
     overwrite the block (hoping that the system isn't touching that
     sector at the time) --- I'm not very happy with this option.

It would be nice to be able to say:  read block X from that md device,
and write it back to all the devices on which it resides, in a safe
manner.

What would be even better would be a way of saying:  Sector X on Disk Y
is duff, please work out which md device that is part of, and rewrite it
From other sources -- but that's probably asking a bit too much.

Cheers, Phil.
-- 
|)|  Philip Hands [+44 (0)20 8530 9560]    http://www.hands.com/
|-|  HANDS.COM Ltd.                    http://www.uk.debian.org/
|(|  10 Onslow Gardens, South Woodford, London  E18 1NE  ENGLAND

[-- Attachment #2: Type: application/pgp-signature, Size: 835 bytes --]

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: RAID1 seems not to be able to scrub pending sectors shown by smart
  2011-12-24 10:07       ` Philip Hands
@ 2011-12-24 14:27         ` Phil Turmel
  2011-12-24 15:30           ` Philip Hands
  2011-12-24 15:54           ` Roger Heflin
  0 siblings, 2 replies; 11+ messages in thread
From: Phil Turmel @ 2011-12-24 14:27 UTC (permalink / raw)
  To: Philip Hands; +Cc: Roger Heflin, 'LinuxRaid'

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Hi Philip,

On 12/24/2011 05:07 AM, Philip Hands wrote:
[...]
> Last night I started a check of the RAID that contained most of the errors on
> that disk, and it's pretty much finished (81%), in which time the Pending
> sector count is back up to 53. [Erm, 83% and 54 now -- while writing
> this mail]
> 
> Clearly it's not a particularly happy drive, so I guess that smart will
> eventually diagnose it as faulty, but in the mean time it may be a
> useful test case for mdadm.
> 
> One of those newly pending sectors was found almost immediately, as I
> was able to see from the logs, and while that was being dealt with, it
> drove the system load up to about 18, and rendered the system
> unresponsive for at least 10 seconds, probably more like 20 or 30 (the
> normal load once it had chance to settle down again was about 2, on a 6
> core CPU, so it wasn't really that busy).
> 
> [84% and 55 pending now -- with the first indication being a spike in
> load, followed a minute or two later by mention of the read problems in
> the logs, but apparently nothing logged by md, so presumably the read
> eventually succeeded]
> 
>> I wonder if a patch might be possible that allows one to put an array 
>> into a mode (or go into said mode once a badblock condition has 
>> happened) that causes it to read from at least 2 possible data sources 
>> and return whichever gets there first...
> 
> Well, given that something appears to be blocking in a fairly
> disastrous way on the read that's not coming back, I was wondering if
> there might be some way of having a timeout on those reads that if one
> gets no response for long enough (say 10 seconds) reacts by getting the
> data from elsewhere, and overwriting the slow sector.

Have you set up TLER or SCTERC on these drives?  I suspect you haven't, as
these long delays on read errors are typical of default error handling on
consumer drives.

Can you show the complete "smartctl -x" output for this failing drive?

Phil
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.18 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAk714VwACgkQBP+iHzflm3BXmACffzNuNvh98KueHKUL06e9Ultj
ETcAn20P84PxbN3n6K0BlDoNsMpg1+2n
=2gBn
-----END PGP SIGNATURE-----

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: RAID1 seems not to be able to scrub pending sectors shown by smart
  2011-12-24 14:27         ` Phil Turmel
@ 2011-12-24 15:30           ` Philip Hands
  2011-12-25  0:11             ` Phil Turmel
  2011-12-24 15:54           ` Roger Heflin
  1 sibling, 1 reply; 11+ messages in thread
From: Philip Hands @ 2011-12-24 15:30 UTC (permalink / raw)
  To: 'LinuxRaid'

[-- Attachment #1: Type: text/plain, Size: 2170 bytes --]

Hi Phil,

On Sat, 24 Dec 2011 09:27:45 -0500, Phil Turmel <philip@turmel.org> wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
> 
> Hi Philip,
...
> > Well, given that something appears to be blocking in a fairly
> > disastrous way on the read that's not coming back, I was wondering if
> > there might be some way of having a timeout on those reads that if one
> > gets no response for long enough (say 10 seconds) reacts by getting the
> > data from elsewhere, and overwriting the slow sector.
> 
> Have you set up TLER or SCTERC on these drives?

The WD Caviar Black model doesn't appear to support that, judging by the:

  Warning: device does not support SCT Error Recovery Control command

in the smartctl output.  As for TLER, threads like this:

  http://www.eggxpert.com/forums/thread/602903.aspx

suggest that there used to be a DOS utility for doing it, but that WD
have since disabled the ability to set that -- and TBH the chances of me
scheduling down time, and working out how to boot DOS on a system with
no floppy, which is in a co-lo centre where I generally am not, are slim
to say the least -- I'd be more likely to simply replace the disks if
that's the only solution, since I'm not impressed with them so far.

> I suspect you haven't, as these long delays on read errors are typical
> of default error handling on consumer drives.

That's my understanding too.

Unfortunately, this enlightenment only came to me after I'd already
bought the el-cheapo drives, rather than the overly expensive RAID-ready
model.

You may say that I deserve what I'm getting, but I'm rather used to
Linux being able to get the best out of cheap hardware, and was hoping
that this would be another example where that could be made to be the case.

> Can you show the complete "smartctl -x" output for this failing drive?

  http://hands.com/~phil/tmp/sheikh.hands.com-smartctl-e-sde--20111224

Cheers, Phil.
-- 
|)|  Philip Hands [+44 (0)20 8530 9560]    http://www.hands.com/
|-|  HANDS.COM Ltd.                    http://www.uk.debian.org/
|(|  10 Onslow Gardens, South Woodford, London  E18 1NE  ENGLAND

[-- Attachment #2: Type: application/pgp-signature, Size: 835 bytes --]

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: RAID1 seems not to be able to scrub pending sectors shown by smart
  2011-12-24 14:27         ` Phil Turmel
  2011-12-24 15:30           ` Philip Hands
@ 2011-12-24 15:54           ` Roger Heflin
  2011-12-25  0:24             ` Phil Turmel
  1 sibling, 1 reply; 11+ messages in thread
From: Roger Heflin @ 2011-12-24 15:54 UTC (permalink / raw)
  To: Phil Turmel; +Cc: Philip Hands, 'LinuxRaid'

On 12/24/2011 08:27 AM, Phil Turmel wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> Hi Philip,
>
> On 12/24/2011 05:07 AM, Philip Hands wrote:
> [...]
>> Last night I started a check of the RAID that contained most of the errors on
>> that disk, and it's pretty much finished (81%), in which time the Pending
>> sector count is back up to 53. [Erm, 83% and 54 now -- while writing
>> this mail]
>>
>> Clearly it's not a particularly happy drive, so I guess that smart will
>> eventually diagnose it as faulty, but in the mean time it may be a
>> useful test case for mdadm.
>>
>> One of those newly pending sectors was found almost immediately, as I
>> was able to see from the logs, and while that was being dealt with, it
>> drove the system load up to about 18, and rendered the system
>> unresponsive for at least 10 seconds, probably more like 20 or 30 (the
>> normal load once it had chance to settle down again was about 2, on a 6
>> core CPU, so it wasn't really that busy).
>>
>> [84% and 55 pending now -- with the first indication being a spike in
>> load, followed a minute or two later by mention of the read problems in
>> the logs, but apparently nothing logged by md, so presumably the read
>> eventually succeeded]
>>
>>> I wonder if a patch might be possible that allows one to put an array
>>> into a mode (or go into said mode once a badblock condition has
>>> happened) that causes it to read from at least 2 possible data sources
>>> and return whichever gets there first...
>>
>> Well, given that something appears to be blocking in a fairly
>> disastrous way on the read that's not coming back, I was wondering if
>> there might be some way of having a timeout on those reads that if one
>> gets no response for long enough (say 10 seconds) reacts by getting the
>> data from elsewhere, and overwriting the slow sector.
>
> Have you set up TLER or SCTERC on these drives?  I suspect you haven't, as
> these long delays on read errors are typical of default error handling on
> consumer drives.
>
> Can you show the complete "smartctl -x" output for this failing drive?
>
> Phil

On my Seagates I turned down the SCTERC to really low (ie .2 seconds) 
and from what I could see it did not make an obvious difference in the 
length of the time that the system paused, the pauses appeared to stay 
at about 30 seconds...which I guess implies that the actual read 
failed timeout was being hit rather than the disk returning an error 
in a reasonable time...from the log each time it was forcing a 
re-write it appeared to be 8 sections of 8 sector each so 32k of data, 
64 sectors.    I seem to remember there is a way to turn down the disk 
op timeout...but at least on my system turning it down lower would 
mean that the disks might not have enough time to spinup out of a sleep...

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: RAID1 seems not to be able to scrub pending sectors shown by smart
  2011-12-24 15:30           ` Philip Hands
@ 2011-12-25  0:11             ` Phil Turmel
  0 siblings, 0 replies; 11+ messages in thread
From: Phil Turmel @ 2011-12-25  0:11 UTC (permalink / raw)
  To: Philip Hands; +Cc: 'LinuxRaid'

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 12/24/2011 10:30 AM, Philip Hands wrote:
> On Sat, 24 Dec 2011 09:27:45 -0500, Phil Turmel <philip@turmel.org> wrote:
[...]
>> Have you set up TLER or SCTERC on these drives?
> 
> The WD Caviar Black model doesn't appear to support that, judging by the:
> 
>   Warning: device does not support SCT Error Recovery Control command
> 
> in the smartctl output.  As for TLER, threads like this:
> 
>   http://www.eggxpert.com/forums/thread/602903.aspx
> 
> suggest that there used to be a DOS utility for doing it, but that WD
> have since disabled the ability to set that -- and TBH the chances of me
> scheduling down time, and working out how to boot DOS on a system with
> no floppy, which is in a co-lo centre where I generally am not, are slim
> to say the least -- I'd be more likely to simply replace the disks if
> that's the only solution, since I'm not impressed with them so far.

Yup.  You're stuck.  I read about this and deliberately avoided WD drives.

>> I suspect you haven't, as these long delays on read errors are typical
>> of default error handling on consumer drives.
> 
> That's my understanding too.
> 
> Unfortunately, this enlightenment only came to me after I'd already
> bought the el-cheapo drives, rather than the overly expensive RAID-ready
> model.

I got burned this summer myself.  My older 1T Seagate drives support SCTERC.
I bought some 2T Seagate drives in July that don't, and didn't notice right
away that my init script was failing to set those drives.

I've since taken the 2T Seagate drives out of RAID service.  I'm using them
for non-raid non-critical media storage and for my offsite backup rotations.
Without a time limit in the drive, extreme system delays are unavoidable.  MD
does not have timeouts, so a delayed error report on one drive can block the
entire array for the duration.

I've also noticed the scattered complaints across the 'net that the major
manufacturer's are crippling SCTERC to push buyers to the enterprise drives.

> You may say that I deserve what I'm getting, but I'm rather used to
> Linux being able to get the best out of cheap hardware, and was hoping
> that this would be another example where that could be made to be the case.

I've been doing so as well, but the manufacturers are trying to close this
off.  I've identified the Hitachi Deskstar 5K3000 as a consumer-grade drive
that still supports SCTERC, but I'm not sure how long that'll last.  Careful
reading and re-reading of drive specs will be part of my future purchasing
plans.

For your situation, you either need to figure out how to tolerate the long
delays, or swap your drives for models that can report errors quickly.

Phil
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.18 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAk72ahoACgkQBP+iHzflm3CE3gCfVmH2sDMBxeKxajZYyfqFB5j1
n60AnitjWSZZh88GuSc+Fps61lvCHbiI
=+R3p
-----END PGP SIGNATURE-----

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: RAID1 seems not to be able to scrub pending sectors shown by smart
  2011-12-24 15:54           ` Roger Heflin
@ 2011-12-25  0:24             ` Phil Turmel
  2011-12-25 15:07               ` Philip Hands
  0 siblings, 1 reply; 11+ messages in thread
From: Phil Turmel @ 2011-12-25  0:24 UTC (permalink / raw)
  To: Roger Heflin; +Cc: Philip Hands, 'LinuxRaid'

On 12/24/2011 10:54 AM, Roger Heflin wrote:
> On my Seagates I turned down the SCTERC to really low (ie .2 seconds)
> and from what I could see it did not make an obvious difference in
> the length of the time that the system paused, the pauses appeared to
> stay at about 30 seconds...which I guess implies that the actual read
> failed timeout was being hit rather than the disk returning an error
> in a reasonable time...from the log each time it was forcing a
> re-write it appeared to be 8 sections of 8 sector each so 32k of
> data, 64 sectors.    I seem to remember there is a way to turn down
> the disk op timeout...but at least on my system turning it down lower
> would mean that the disks might not have enough time to spinup out of
> a sleep...

On the drives I've checked closely, any SCTERC setting below 6.5 seconds
is discarded and treated as zero (no limit).  Setting timeouts in the
driver stack below the timeout in the drive is counterproductive, as
drives won't abandon the error recovery attempt to reply to the controller's
next command.  So the drive gets kicked out of the array as completely
failed (unresponsive) instead of dealing with the localized read error.

If I recall my Seagate spec right, the 6.5 second timeout wouldn't count
the spin-up time.  I haven't tested that, as my application doesn't sleep.

Phil

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: RAID1 seems not to be able to scrub pending sectors shown by smart
  2011-12-25  0:24             ` Phil Turmel
@ 2011-12-25 15:07               ` Philip Hands
  0 siblings, 0 replies; 11+ messages in thread
From: Philip Hands @ 2011-12-25 15:07 UTC (permalink / raw)
  To: 'LinuxRaid'

[-- Attachment #1: Type: text/plain, Size: 2398 bytes --]

On Sat, 24 Dec 2011 19:24:37 -0500, Phil Turmel <philip@turmel.org> wrote:
> On 12/24/2011 10:54 AM, Roger Heflin wrote:
> > On my Seagates I turned down the SCTERC to really low (ie .2 seconds)
> > and from what I could see it did not make an obvious difference in
> > the length of the time that the system paused, the pauses appeared to
> > stay at about 30 seconds...which I guess implies that the actual read
> > failed timeout was being hit rather than the disk returning an error
> > in a reasonable time...from the log each time it was forcing a
> > re-write it appeared to be 8 sections of 8 sector each so 32k of
> > data, 64 sectors.    I seem to remember there is a way to turn down
> > the disk op timeout...but at least on my system turning it down lower
> > would mean that the disks might not have enough time to spinup out of
> > a sleep...
> 
> On the drives I've checked closely, any SCTERC setting below 6.5 seconds
> is discarded and treated as zero (no limit).  Setting timeouts in the
> driver stack below the timeout in the drive is counterproductive, as
> drives won't abandon the error recovery attempt to reply to the controller's
> next command.  So the drive gets kicked out of the array as completely
> failed (unresponsive) instead of dealing with the localized read
> error.

Well, that's fair enough, but I'm guessing that it would be relatively
cheap to notice the fact that the read took _ages_ to return, and treat
that as a failure of sorts, even if the drive eventually claims success.

Then, at least the sector would be rewritten, which would either solve
the problem by refreshing the data, or provoke the sector to be re-mapped
if the physical sector was really damaged.  That way you'd not be
constantly bumping into the same pending sectors, provoking extended
read attempts, and thus degrading the whole system's performance.

Alternatively, some way of nudging mdadm into rewriting a sector in one
device from wherever it's stored elsewhere in a RAID, could be combined
with something looking for read failures in the logs, without needing to
add any extra checks to the normal operational code.

Cheers, Phil.
-- 
|)|  Philip Hands [+44 (0)20 8530 9560]    http://www.hands.com/
|-|  HANDS.COM Ltd.                    http://www.uk.debian.org/
|(|  10 Onslow Gardens, South Woodford, London  E18 1NE  ENGLAND

[-- Attachment #2: Type: application/pgp-signature, Size: 835 bytes --]

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2011-12-25 15:07 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-12-23 18:39 RAID1 seems not to be able to scrub pending sectors shown by smart Philip Hands
2011-12-23 19:59 ` Roger Heflin
2011-12-23 21:22   ` Philip Hands
2011-12-23 22:26     ` Roger Heflin
2011-12-24 10:07       ` Philip Hands
2011-12-24 14:27         ` Phil Turmel
2011-12-24 15:30           ` Philip Hands
2011-12-25  0:11             ` Phil Turmel
2011-12-24 15:54           ` Roger Heflin
2011-12-25  0:24             ` Phil Turmel
2011-12-25 15:07               ` Philip Hands

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.