Read errors and SMART tests

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Read errors and SMART tests
@ 2008-12-20  1:30 Kevin Shanahan
  2008-12-20  4:13 ` David Lethe
  0 siblings, 1 reply; 7+ messages in thread
From: Kevin Shanahan @ 2008-12-20  1:30 UTC (permalink / raw)
  To: linux-raid

Hi,

Just a quick question about SMART tests :-

I have a Samsung drive returning read errors, e.g.:

Dec 20 08:59:24 hermes kernel: ata4.00: exception Emask 0x0 SAct 0x1 SErr 0x0 action 0x0
Dec 20 08:59:24 hermes kernel: ata4.00: irq_stat 0x40000008
Dec 20 08:59:24 hermes kernel: ata4.00: cmd 60/80:00:3f:0e:50/00:00:24:00:00/40 tag 0 ncq 65536 in
Dec 20 08:59:24 hermes kernel:          res 41/40:00:61:0e:50/00:00:24:00:00/40 Emask 0x409 (media error) <F>
Dec 20 08:59:24 hermes kernel: ata4.00: status: { DRDY ERR }
Dec 20 08:59:24 hermes kernel: ata4.00: error: { UNC }
Dec 20 08:59:24 hermes kernel: ata4.00: configured for UDMA/133
Dec 20 08:59:24 hermes kernel: ata4: EH complete
Dec 20 08:59:24 hermes kernel: sd 3:0:0:0: [sdd] 1953525168 512-byte hardware sectors (1000205 MB)
Dec 20 08:59:24 hermes kernel: sd 3:0:0:0: [sdd] Write Protect is off
Dec 20 08:59:24 hermes kernel: sd 3:0:0:0: [sdd] Mode Sense: 00 3a 00 00
Dec 20 08:59:24 hermes kernel: sd 3:0:0:0: [sdd] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA

So, I ran the short (and long) selftest and it showed read
failures. Then I put in a new drive to replace it and ran the short
selftest again - this one is showing read errors also:

=== START OF READ SMART DATA SECTION ===
SMART Self-test log structure revision number 0
Warning: ATA Specification requires self-test log structure revision number = 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed: read failure       20%      2572         294961
# 2  Short offline       Aborted by host               20%      2572         -

I'm guessing this is just bad luck, i.e. drives from the same bad
batch. Erm, so my question - Am I right in assuming that the SMART
self test is not influenced in any way by bad cables, etc.? If the
drive returns read errors on it's self-test the error is within the
drive itself, right?

Thanks,
Kevin.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* RE: Read errors and SMART tests
  2008-12-20  1:30 Read errors and SMART tests Kevin Shanahan
@ 2008-12-20  4:13 ` David Lethe
  2008-12-20  5:22   ` Kevin Shanahan
  0 siblings, 1 reply; 7+ messages in thread
From: David Lethe @ 2008-12-20  4:13 UTC (permalink / raw)
  To: Kevin Shanahan, linux-raid

> -----Original Message-----
> From: linux-raid-owner@vger.kernel.org [mailto:linux-raid-
> owner@vger.kernel.org] On Behalf Of Kevin Shanahan
> Sent: Friday, December 19, 2008 7:31 PM
> To: linux-raid@vger.kernel.org
> Subject: Read errors and SMART tests
> 
> Hi,
> 
> Just a quick question about SMART tests :-
> 
> I have a Samsung drive returning read errors, e.g.:
> 
> Dec 20 08:59:24 hermes kernel: ata4.00: exception Emask 0x0 SAct 0x1
> SErr 0x0 action 0x0
> Dec 20 08:59:24 hermes kernel: ata4.00: irq_stat 0x40000008
> Dec 20 08:59:24 hermes kernel: ata4.00: cmd
> 60/80:00:3f:0e:50/00:00:24:00:00/40 tag 0 ncq 65536 in
> Dec 20 08:59:24 hermes kernel:          res
> 41/40:00:61:0e:50/00:00:24:00:00/40 Emask 0x409 (media error) <F>
> Dec 20 08:59:24 hermes kernel: ata4.00: status: { DRDY ERR }
> Dec 20 08:59:24 hermes kernel: ata4.00: error: { UNC }
> Dec 20 08:59:24 hermes kernel: ata4.00: configured for UDMA/133
> Dec 20 08:59:24 hermes kernel: ata4: EH complete
> Dec 20 08:59:24 hermes kernel: sd 3:0:0:0: [sdd] 1953525168 512-byte
> hardware sectors (1000205 MB)
> Dec 20 08:59:24 hermes kernel: sd 3:0:0:0: [sdd] Write Protect is off
> Dec 20 08:59:24 hermes kernel: sd 3:0:0:0: [sdd] Mode Sense: 00 3a 00
> 00
> Dec 20 08:59:24 hermes kernel: sd 3:0:0:0: [sdd] Write cache: enabled,
> read cache: enabled, doesn't support DPO or FUA
> 
> So, I ran the short (and long) selftest and it showed read
> failures. Then I put in a new drive to replace it and ran the short
> selftest again - this one is showing read errors also:
> 
> === START OF READ SMART DATA SECTION ===
> SMART Self-test log structure revision number 0
> Warning: ATA Specification requires self-test log structure revision
> number = 1
> Num  Test_Description    Status                  Remaining
> LifeTime(hours)  LBA_of_first_error
> # 1  Short offline       Completed: read failure       20%      2572
> 294961
> # 2  Short offline       Aborted by host               20%      2572
> -
> 
> I'm guessing this is just bad luck, i.e. drives from the same bad
> batch. Erm, so my question - Am I right in assuming that the SMART
> self test is not influenced in any way by bad cables, etc.? If the
> drive returns read errors on it's self-test the error is within the
> drive itself, right?
> 
> Thanks,
> Kevin.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid"
> in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

This shows nothing more than you having a single bad block.  You have a
1TB drive, for crying
out loud, they can't all stay perfect ;)

This is no reason to assume the disk is bad, or that it has anything to
do with cabling.   When you wrote you have 
read "errors" .. does that mean you have dozens, hundreds of individual
unreadable blocks, or 
could you just have just this one bad block.

Why not use dd to do raw reads from /dev/sdd, send output to /dev/null,
and start at the next LBA, if dd
has another read error, it will tell you, then repeat process and go on.
I am assuming you aren't using
any software RAID1/5/6, so just fix the bad blocks by using dd to write
/dev/zero to the bad block(s).

When you write to the block the disk will either map a reserved block to
it, or just correct the ECC w/o remapping. It depends on root cause and
more details that you can't get without running some more sophisticated
software.

David



^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Read errors and SMART tests
  2008-12-20  4:13 ` David Lethe
@ 2008-12-20  5:22   ` Kevin Shanahan
  2008-12-20  6:54     ` David Lethe
  2009-01-14 20:59     ` Bill Davidsen
  0 siblings, 2 replies; 7+ messages in thread
From: Kevin Shanahan @ 2008-12-20  5:22 UTC (permalink / raw)
  To: David Lethe; +Cc: linux-raid

On Fri, Dec 19, 2008 at 10:13:14PM -0600, David Lethe wrote:
> This shows nothing more than you having a single bad block.  You have a
> 1TB drive, for crying out loud, they can't all stay perfect ;)

Heh, true.

> This is no reason to assume the disk is bad, or that it has anything to
> do with cabling.   When you wrote you have 
> read "errors" .. does that mean you have dozens, hundreds of individual
> unreadable blocks, or 
> could you just have just this one bad block.

Sorry, I didn't provide a lot of detail there. The "bad" drive,
/dev/sdd was doing more than just failing the self test:

Dec 20 06:55:20 hermes kernel: ata4.00: exception Emask 0x0 SAct 0x5 SErr 0x0 action 0x0
Dec 20 06:55:20 hermes kernel: ata4.00: irq_stat 0x40000008
Dec 20 06:55:20 hermes kernel: ata4.00: cmd 60/78:10:47:d5:fa/00:00:1e:00:00/40 tag 2 ncq 61440 in
Dec 20 06:55:20 hermes kernel:          res 51/40:00:b9:d5:fa/00:00:1e:00:00/40 Emask 0x409 (media error) <F>
Dec 20 06:55:20 hermes kernel: ata4.00: status: { DRDY ERR }
Dec 20 06:55:20 hermes kernel: ata4.00: error: { UNC }
Dec 20 06:55:20 hermes kernel: ata4.00: configured for UDMA/133
Dec 20 06:55:20 hermes kernel: ata4: EH complete
Dec 20 06:55:20 hermes kernel: sd 3:0:0:0: [sdd] 1953525168 512-byte hardware sectors (1000205 MB)
Dec 20 06:55:20 hermes kernel: sd 3:0:0:0: [sdd] Write Protect is off
Dec 20 06:55:20 hermes kernel: sd 3:0:0:0: [sdd] Mode Sense: 00 3a 00 00
Dec 20 06:55:20 hermes kernel: sd 3:0:0:0: [sdd] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA

(repeats several times)

Dec 20 06:55:30 hermes kernel: raid5:md5: read error corrected (8 sectors at 519755016 on sdd1)
Dec 20 06:55:30 hermes kernel: raid5:md5: read error corrected (8 sectors at 519755024 on sdd1)
Dec 20 06:55:30 hermes kernel: raid5:md5: read error corrected (8 sectors at 519755032 on sdd1)
Dec 20 06:55:30 hermes kernel: raid5:md5: read error corrected (8 sectors at 519755040 on sdd1)
Dec 20 06:55:30 hermes kernel: raid5:md5: read error corrected (8 sectors at 519755048 on sdd1)
Dec 20 06:55:30 hermes kernel: raid5:md5: read error corrected (8 sectors at 519755056 on sdd1)
Dec 20 06:55:30 hermes kernel: raid5:md5: read error corrected (8 sectors at 519755064 on sdd1)
Dec 20 06:55:30 hermes kernel: raid5:md5: read error corrected (8 sectors at 519755072 on sdd1)
Dec 20 06:55:30 hermes kernel: raid5:md5: read error corrected (8 sectors at 519755080 on sdd1)
Dec 20 06:55:30 hermes kernel: raid5:md5: read error corrected (8 sectors at 519755088 on sdd1)

...

Dec 20 07:04:30 hermes kernel: raid5:md5: read error corrected (8 sectors at 613165696 on sdd1)
Dec 20 07:04:30 hermes kernel: raid5:md5: read error corrected (8 sectors at 613165704 on sdd1)
Dec 20 07:04:30 hermes kernel: raid5:md5: read error corrected (8 sectors at 613165712 on sdd1)
Dec 20 07:04:30 hermes kernel: raid5:md5: read error corrected (8 sectors at 613165720 on sdd1)
Dec 20 07:04:30 hermes kernel: raid5:md5: read error corrected (8 sectors at 613165728 on sdd1)
Dec 20 07:04:30 hermes kernel: raid5:md5: read error corrected (8 sectors at 613165736 on sdd1)
Dec 20 07:04:30 hermes kernel: raid5:md5: read error corrected (8 sectors at 613165744 on sdd1)
Dec 20 07:04:30 hermes kernel: raid5:md5: read error corrected (8 sectors at 613165752 on sdd1)
Dec 20 07:04:30 hermes kernel: raid5:md5: read error corrected (8 sectors at 613165760 on sdd1)
Dec 20 07:04:30 hermes kernel: raid5:md5: read error corrected (8 sectors at 613165768 on sdd1)

...

Dec 20 07:04:47 hermes kernel: raid5:md5: read error corrected (8 sectors at 613181440 on sdd1)
Dec 20 07:04:47 hermes kernel: raid5:md5: read error corrected (8 sectors at 613181448 on sdd1)
Dec 20 07:04:47 hermes kernel: raid5:md5: read error corrected (8 sectors at 613181456 on sdd1)
Dec 20 07:04:47 hermes kernel: raid5:md5: read error corrected (8 sectors at 613181464 on sdd1)
Dec 20 07:04:47 hermes kernel: raid5:md5: read error corrected (8 sectors at 613181472 on sdd1)
Dec 20 07:04:47 hermes kernel: raid5:md5: read error corrected (8 sectors at 613181480 on sdd1)
Dec 20 07:04:47 hermes kernel: raid5:md5: read error corrected (8 sectors at 613181488 on sdd1)
Dec 20 07:04:47 hermes kernel: raid5:md5: read error corrected (8 sectors at 613181496 on sdd1)
Dec 20 07:04:47 hermes kernel: raid5:md5: read error corrected (8 sectors at 613181504 on sdd1)
Dec 20 07:04:47 hermes kernel: raid5:md5: read error corrected (8 sectors at 613181512 on sdd1)

...

Dec 20 08:10:09 hermes kernel: raid5:md5: read error corrected (8 sectors at 613552584 on sdd1)
Dec 20 08:10:09 hermes kernel: raid5:md5: read error corrected (8 sectors at 613552592 on sdd1)
Dec 20 08:10:09 hermes kernel: raid5:md5: read error corrected (8 sectors at 613552600 on sdd1)
Dec 20 08:10:09 hermes kernel: raid5:md5: read error corrected (8 sectors at 613552608 on sdd1)
Dec 20 08:10:09 hermes kernel: raid5:md5: read error corrected (8 sectors at 613552616 on sdd1)
Dec 20 08:10:09 hermes kernel: raid5:md5: read error corrected (8 sectors at 613552624 on sdd1)
Dec 20 08:10:09 hermes kernel: raid5:md5: read error corrected (8 sectors at 613552632 on sdd1)

...

Dec 20 08:16:19 hermes kernel: raid5:md5: read error corrected (8 sectors at 613020008 on sdd1)

That's just a sample from today - it's been doing similar things for
several days.  So the drive was hanging in there in the array, thanks
to the error correction, but it was of course impacting performance.

Anyway, when I put the replacement drive in I decided to do a self
test before adding it to the array and I guess I was a bit concerned
that it immediately failed the test. Since it was inserted into the
same slot in the drive cage, same cable, etc. I wondered if those
factors can affect a self test. My assumption was no, but I thought
I'd ask.

Cheers,
Kevin.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* RE: Read errors and SMART tests
  2008-12-20  5:22   ` Kevin Shanahan
@ 2008-12-20  6:54     ` David Lethe
  2008-12-20  9:09       ` Kevin Shanahan
  2009-01-14 20:59     ` Bill Davidsen
  1 sibling, 1 reply; 7+ messages in thread
From: David Lethe @ 2008-12-20  6:54 UTC (permalink / raw)
  To: Kevin Shanahan; +Cc: linux-raid



> -----Original Message-----
> From: Kevin Shanahan [mailto:kmshanah@disenchant.net]
> Sent: Friday, December 19, 2008 11:23 PM
> To: David Lethe
> Cc: linux-raid@vger.kernel.org
> Subject: Re: Read errors and SMART tests
> 
> On Fri, Dec 19, 2008 at 10:13:14PM -0600, David Lethe wrote:
> > This shows nothing more than you having a single bad block.  You
have
> a
> > 1TB drive, for crying out loud, they can't all stay perfect ;)
> 
> Heh, true.
> 
> > This is no reason to assume the disk is bad, or that it has anything
> to
> > do with cabling.   When you wrote you have
> > read "errors" .. does that mean you have dozens, hundreds of
> individual
> > unreadable blocks, or
> > could you just have just this one bad block.
> 
> Sorry, I didn't provide a lot of detail there. The "bad" drive,
> /dev/sdd was doing more than just failing the self test:
> 
> Dec 20 06:55:20 hermes kernel: ata4.00: exception Emask 0x0 SAct 0x5
> SErr 0x0 action 0x0
> Dec 20 06:55:20 hermes kernel: ata4.00: irq_stat 0x40000008
> Dec 20 06:55:20 hermes kernel: ata4.00: cmd
> 60/78:10:47:d5:fa/00:00:1e:00:00/40 tag 2 ncq 61440 in
> Dec 20 06:55:20 hermes kernel:          res
> 51/40:00:b9:d5:fa/00:00:1e:00:00/40 Emask 0x409 (media error) <F>
> Dec 20 06:55:20 hermes kernel: ata4.00: status: { DRDY ERR }
> Dec 20 06:55:20 hermes kernel: ata4.00: error: { UNC }
> Dec 20 06:55:20 hermes kernel: ata4.00: configured for UDMA/133
> Dec 20 06:55:20 hermes kernel: ata4: EH complete
> Dec 20 06:55:20 hermes kernel: sd 3:0:0:0: [sdd] 1953525168 512-byte
> hardware sectors (1000205 MB)
> Dec 20 06:55:20 hermes kernel: sd 3:0:0:0: [sdd] Write Protect is off
> Dec 20 06:55:20 hermes kernel: sd 3:0:0:0: [sdd] Mode Sense: 00 3a 00
> 00
> Dec 20 06:55:20 hermes kernel: sd 3:0:0:0: [sdd] Write cache: enabled,
> read cache: enabled, doesn't support DPO or FUA
> 
> (repeats several times)
> 
> Dec 20 06:55:30 hermes kernel: raid5:md5: read error corrected (8
> sectors at 519755016 on sdd1)
> Dec 20 06:55:30 hermes kernel: raid5:md5: read error corrected (8
> sectors at 519755024 on sdd1)
> Dec 20 06:55:30 hermes kernel: raid5:md5: read error corrected (8
> sectors at 519755032 on sdd1)
> Dec 20 06:55:30 hermes kernel: raid5:md5: read error corrected (8
> sectors at 519755040 on sdd1)
> Dec 20 06:55:30 hermes kernel: raid5:md5: read error corrected (8
> sectors at 519755048 on sdd1)
> Dec 20 06:55:30 hermes kernel: raid5:md5: read error corrected (8
> sectors at 519755056 on sdd1)
> Dec 20 06:55:30 hermes kernel: raid5:md5: read error corrected (8
> sectors at 519755064 on sdd1)
> Dec 20 06:55:30 hermes kernel: raid5:md5: read error corrected (8
> sectors at 519755072 on sdd1)
> Dec 20 06:55:30 hermes kernel: raid5:md5: read error corrected (8
> sectors at 519755080 on sdd1)
> Dec 20 06:55:30 hermes kernel: raid5:md5: read error corrected (8
> sectors at 519755088 on sdd1)
> 
> ...
> 
> Dec 20 07:04:30 hermes kernel: raid5:md5: read error corrected (8
> sectors at 613165696 on sdd1)
> Dec 20 07:04:30 hermes kernel: raid5:md5: read error corrected (8
> sectors at 613165704 on sdd1)
> Dec 20 07:04:30 hermes kernel: raid5:md5: read error corrected (8
> sectors at 613165712 on sdd1)
> Dec 20 07:04:30 hermes kernel: raid5:md5: read error corrected (8
> sectors at 613165720 on sdd1)
> Dec 20 07:04:30 hermes kernel: raid5:md5: read error corrected (8
> sectors at 613165728 on sdd1)
> Dec 20 07:04:30 hermes kernel: raid5:md5: read error corrected (8
> sectors at 613165736 on sdd1)
> Dec 20 07:04:30 hermes kernel: raid5:md5: read error corrected (8
> sectors at 613165744 on sdd1)
> Dec 20 07:04:30 hermes kernel: raid5:md5: read error corrected (8
> sectors at 613165752 on sdd1)
> Dec 20 07:04:30 hermes kernel: raid5:md5: read error corrected (8
> sectors at 613165760 on sdd1)
> Dec 20 07:04:30 hermes kernel: raid5:md5: read error corrected (8
> sectors at 613165768 on sdd1)
> 
> ...
> 
> Dec 20 07:04:47 hermes kernel: raid5:md5: read error corrected (8
> sectors at 613181440 on sdd1)
> Dec 20 07:04:47 hermes kernel: raid5:md5: read error corrected (8
> sectors at 613181448 on sdd1)
> Dec 20 07:04:47 hermes kernel: raid5:md5: read error corrected (8
> sectors at 613181456 on sdd1)
> Dec 20 07:04:47 hermes kernel: raid5:md5: read error corrected (8
> sectors at 613181464 on sdd1)
> Dec 20 07:04:47 hermes kernel: raid5:md5: read error corrected (8
> sectors at 613181472 on sdd1)
> Dec 20 07:04:47 hermes kernel: raid5:md5: read error corrected (8
> sectors at 613181480 on sdd1)
> Dec 20 07:04:47 hermes kernel: raid5:md5: read error corrected (8
> sectors at 613181488 on sdd1)
> Dec 20 07:04:47 hermes kernel: raid5:md5: read error corrected (8
> sectors at 613181496 on sdd1)
> Dec 20 07:04:47 hermes kernel: raid5:md5: read error corrected (8
> sectors at 613181504 on sdd1)
> Dec 20 07:04:47 hermes kernel: raid5:md5: read error corrected (8
> sectors at 613181512 on sdd1)
> 
> ...
> 
> Dec 20 08:10:09 hermes kernel: raid5:md5: read error corrected (8
> sectors at 613552584 on sdd1)
> Dec 20 08:10:09 hermes kernel: raid5:md5: read error corrected (8
> sectors at 613552592 on sdd1)
> Dec 20 08:10:09 hermes kernel: raid5:md5: read error corrected (8
> sectors at 613552600 on sdd1)
> Dec 20 08:10:09 hermes kernel: raid5:md5: read error corrected (8
> sectors at 613552608 on sdd1)
> Dec 20 08:10:09 hermes kernel: raid5:md5: read error corrected (8
> sectors at 613552616 on sdd1)
> Dec 20 08:10:09 hermes kernel: raid5:md5: read error corrected (8
> sectors at 613552624 on sdd1)
> Dec 20 08:10:09 hermes kernel: raid5:md5: read error corrected (8
> sectors at 613552632 on sdd1)
> 
> ...
> 
> Dec 20 08:16:19 hermes kernel: raid5:md5: read error corrected (8
> sectors at 613020008 on sdd1)
> 
> That's just a sample from today - it's been doing similar things for
> several days.  So the drive was hanging in there in the array, thanks
> to the error correction, but it was of course impacting performance.
> 
> Anyway, when I put the replacement drive in I decided to do a self
> test before adding it to the array and I guess I was a bit concerned
> that it immediately failed the test. Since it was inserted into the
> same slot in the drive cage, same cable, etc. I wondered if those
> factors can affect a self test. My assumption was no, but I thought
> I'd ask.
> 
> Cheers,
> Kevin.

This particular test terminates when the FIRST bad block is found.  It
is not
an indication of a drive in stress or immediate replacement.  I don't
have the desire
or time to look up how many reserved blocks that disk has, but I
wouldn't be
surprised if it was well over 10,000.  The count is certainly documented
in
the product manual, but not necessarily the data sheet, and certainly
not on
the outside of the box.  (I'm curious, if you look it up, please post
it).

Time for you to run full consistency check/repairs.  These errors could
be
Result of something relatively benign, like unexpected power loss. 
 


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Read errors and SMART tests
  2008-12-20  6:54     ` David Lethe
@ 2008-12-20  9:09       ` Kevin Shanahan
  2008-12-20 21:46         ` David Greaves
  0 siblings, 1 reply; 7+ messages in thread
From: Kevin Shanahan @ 2008-12-20  9:09 UTC (permalink / raw)
  To: David Lethe; +Cc: linux-raid

On Sat, Dec 20, 2008 at 12:54:24AM -0600, David Lethe wrote:
> This particular test terminates when the FIRST bad block is found.
> It is not an indication of a drive in stress or immediate
> replacement.  I don't have the desire or time to look up how many
> reserved blocks that disk has, but I wouldn't be surprised if it was
> well over 10,000.  The count is certainly documented in the product
> manual, but not necessarily the data sheet, and certainly not on the
> outside of the box.  (I'm curious, if you look it up, please post
> it).

Sorry, I didn't have any luck finding that info.

Data sheet - http://www.samsung.com/global/system/business/hdd/prdmodel/2008/8/19/525716F1_DT_R4.8.pdf
Product manual - http://downloadcenter.samsung.com/content/UM/200704/20070419200104171_3.5_Install_Gudie_Eng_200704.pdf

> Time for you to run full consistency check/repairs.

You mean array consistency? Yeah, I've done that. This drive was
removed, raid superblock zeroed and then re-added to the array on
Thursday morning, so the entire drive had been re-written only
recently.

Dec 18 04:16:04 hermes kernel: md: bind<sdd1>
Dec 18 04:16:08 hermes kernel: RAID5 conf printout:
Dec 18 04:16:08 hermes kernel:  --- rd:10 wd:9
Dec 18 04:16:08 hermes kernel:  disk 0, o:1, dev:sde1
Dec 18 04:16:08 hermes kernel:  disk 1, o:1, dev:sdf1
Dec 18 04:16:08 hermes kernel:  disk 2, o:1, dev:sdg1
Dec 18 04:16:08 hermes kernel:  disk 3, o:1, dev:sdk1
Dec 18 04:16:08 hermes kernel:  disk 4, o:1, dev:sdj1
Dec 18 04:16:08 hermes kernel:  disk 5, o:1, dev:sdi1
Dec 18 04:16:08 hermes kernel:  disk 6, o:1, dev:sdh1
Dec 18 04:16:08 hermes kernel:  disk 7, o:1, dev:sdd1
Dec 18 04:16:08 hermes kernel:  disk 8, o:1, dev:sdc1
Dec 18 04:16:08 hermes kernel:  disk 9, o:1, dev:sdl1
Dec 18 04:16:08 hermes mdadm[1949]: RebuildStarted event detected on md device /dev/md5
Dec 18 04:16:08 hermes kernel: md: recovery of RAID array md5
Dec 18 04:16:08 hermes kernel: md: minimum _guaranteed_  speed: 1000 KB/sec/disk.
Dec 18 04:16:08 hermes kernel: md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for recovery.
Dec 18 04:16:08 hermes kernel: md: using 128k window, over a total of 976759936 blocks.
Dec 18 08:41:08 hermes mdadm[1949]: Rebuild20 event detected on md device /dev/md5
Dec 18 11:46:08 hermes mdadm[1949]: Rebuild40 event detected on md device /dev/md5
Dec 18 14:35:08 hermes mdadm[1949]: Rebuild60 event detected on md device /dev/md5
Dec 18 17:20:08 hermes mdadm[1949]: Rebuild80 event detected on md device /dev/md5
Dec 18 19:58:05 hermes kernel: md: md5: recovery done.
Dec 18 19:58:05 hermes kernel: RAID5 conf printout:
Dec 18 19:58:05 hermes kernel:  --- rd:10 wd:10
Dec 18 19:58:05 hermes kernel:  disk 0, o:1, dev:sde1
Dec 18 19:58:05 hermes kernel:  disk 1, o:1, dev:sdf1
Dec 18 19:58:05 hermes kernel:  disk 2, o:1, dev:sdg1
Dec 18 19:58:05 hermes kernel:  disk 3, o:1, dev:sdk1
Dec 18 19:58:05 hermes kernel:  disk 4, o:1, dev:sdj1
Dec 18 19:58:05 hermes kernel:  disk 5, o:1, dev:sdi1
Dec 18 19:58:05 hermes kernel:  disk 6, o:1, dev:sdh1
Dec 18 19:58:05 hermes kernel:  disk 7, o:1, dev:sdd1
Dec 18 19:58:05 hermes kernel:  disk 8, o:1, dev:sdc1
Dec 18 19:58:05 hermes kernel:  disk 9, o:1, dev:sdl1
Dec 18 19:58:05 hermes mdadm[1949]: RebuildFinished event detected on md device /dev/md5
Dec 18 19:58:05 hermes mdadm[1949]: SpareActive event detected on md device /dev/md5, component device /dev/sdd1

And then, e.g.

Dec 18 22:17:44 hermes kernel: ata4.00: exception Emask 0x0 SAct 0xc3f SErr 0x0 action 0x0
Dec 18 22:17:44 hermes kernel: ata4.00: irq_stat 0x40000008
Dec 18 22:17:44 hermes kernel: ata4.00: cmd 60/58:50:c7:b1:c6/00:00:1e:00:00/40 tag 10 ncq 45056 in
Dec 18 22:17:44 hermes kernel:          res 41/40:00:ca:b1:c6/00:00:1e:00:00/40 Emask 0x409 (media error) <F>
Dec 18 22:17:44 hermes kernel: ata4.00: status: { DRDY ERR }
Dec 18 22:17:44 hermes kernel: ata4.00: error: { UNC }
Dec 18 22:17:44 hermes kernel: ata4.00: configured for UDMA/133
Dec 18 22:17:44 hermes kernel: ata4: EH complete
Dec 18 22:17:44 hermes kernel: sd 3:0:0:0: [sdd] 1953525168 512-byte hardware sectors (1000205 MB)
Dec 18 22:17:44 hermes kernel: sd 3:0:0:0: [sdd] Write Protect is off
Dec 18 22:17:44 hermes kernel: sd 3:0:0:0: [sdd] Mode Sense: 00 3a 00 00
Dec 18 22:17:44 hermes kernel: sd 3:0:0:0: [sdd] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA

There are lots of these.

hermes:~# zgrep UNC /var/log/syslog{.1.gz,.0,} | wc -l
385

Of the remaining drives, SMART attributes for /dev/sd[cghijkl] all show:

  196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always       -       0
  197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       0

/dev/sde shows:

  196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always       -       0
  197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       3

/dev/sdf shows:

  196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always       -       2
  197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       0

Unfortunately the original /dev/sdd isn't currently attached, but I'll
hook that up on Monday and check. I'd expect to see some high numbers
there.

> These errors could be
> Result of something relatively benign, like unexpected power loss.

Sorry, are you saying that about the errors from libata layer or just
the errors from the md layer?

Cheers,
Kevin.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Read errors and SMART tests
  2008-12-20  9:09       ` Kevin Shanahan
@ 2008-12-20 21:46         ` David Greaves
  0 siblings, 0 replies; 7+ messages in thread
From: David Greaves @ 2008-12-20 21:46 UTC (permalink / raw)
  To: Kevin Shanahan; +Cc: David Lethe, linux-raid

Kevin Shanahan wrote:
> Of the remaining drives, SMART attributes for /dev/sd[cghijkl] all show:
> 
>   196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always       -       0
>   197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       0
> 
> /dev/sde shows:
> 
>   196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always       -       0
>   197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       3
> 
> /dev/sdf shows:
> 
>   196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always       -       2
>   197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       0
> 
> Unfortunately the original /dev/sdd isn't currently attached, but I'll
> hook that up on Monday and check. I'd expect to see some high numbers
> there.
> 
>> These errors could be
>> Result of something relatively benign, like unexpected power loss.
> 
> Sorry, are you saying that about the errors from libata layer or just
> the errors from the md layer?

I wouldn't dream of contradicting David and I'm sure you've got nothing to worry
about. What's a few bad blocks between friends anyway :)

I will say that I have had very similar problems.

I used ddrescue to read the area around the block until it read without error,
and then re-wrote it. A subsequent smartctl -tlong /dev/sdX would then show no
errors.

In my experience the bad blocks returned regularly and I became very familiar
indeed with forced rebuilds of arrays, array re-creation and other mdadm
incantations as the errors hit the system.

I will say that I've returned a *lot* of these under RMA (after discussions with
Samsung engineers).

Any drive that returns *fail* for a built-in self-test now gets 1 chance and is
then RMAed.

David

-- 
"Don't worry, you'll be fine; I saw it work in a cartoon once..."

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Read errors and SMART tests
  2008-12-20  5:22   ` Kevin Shanahan
  2008-12-20  6:54     ` David Lethe
@ 2009-01-14 20:59     ` Bill Davidsen
  1 sibling, 0 replies; 7+ messages in thread
From: Bill Davidsen @ 2009-01-14 20:59 UTC (permalink / raw)
  To: Kevin Shanahan; +Cc: David Lethe, linux-raid

Kevin Shanahan wrote:
> On Fri, Dec 19, 2008 at 10:13:14PM -0600, David Lethe wrote:
>   
>> This shows nothing more than you having a single bad block.  You have a
>> 1TB drive, for crying out loud, they can't all stay perfect ;)
>>     
>
> Heh, true.
>
>   
>> This is no reason to assume the disk is bad, or that it has anything to
>> do with cabling.   When you wrote you have 
>> read "errors" .. does that mean you have dozens, hundreds of individual
>> unreadable blocks, or 
>> could you just have just this one bad block.
>>     
>
> Sorry, I didn't provide a lot of detail there. The "bad" drive,
> /dev/sdd was doing more than just failing the self test:
>
> Dec 20 06:55:20 hermes kernel: ata4.00: exception Emask 0x0 SAct 0x5 SErr 0x0 action 0x0
> Dec 20 06:55:20 hermes kernel: ata4.00: irq_stat 0x40000008
> Dec 20 06:55:20 hermes kernel: ata4.00: cmd 60/78:10:47:d5:fa/00:00:1e:00:00/40 tag 2 ncq 61440 in
> Dec 20 06:55:20 hermes kernel:          res 51/40:00:b9:d5:fa/00:00:1e:00:00/40 Emask 0x409 (media error) <F>
> Dec 20 06:55:20 hermes kernel: ata4.00: status: { DRDY ERR }
> Dec 20 06:55:20 hermes kernel: ata4.00: error: { UNC }
> Dec 20 06:55:20 hermes kernel: ata4.00: configured for UDMA/133
> Dec 20 06:55:20 hermes kernel: ata4: EH complete
> Dec 20 06:55:20 hermes kernel: sd 3:0:0:0: [sdd] 1953525168 512-byte hardware sectors (1000205 MB)
> Dec 20 06:55:20 hermes kernel: sd 3:0:0:0: [sdd] Write Protect is off
> Dec 20 06:55:20 hermes kernel: sd 3:0:0:0: [sdd] Mode Sense: 00 3a 00 00
> Dec 20 06:55:20 hermes kernel: sd 3:0:0:0: [sdd] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
>
> (repeats several times)
>
> Dec 20 06:55:30 hermes kernel: raid5:md5: read error corrected (8 sectors at 519755016 on sdd1)
> Dec 20 06:55:30 hermes kernel: raid5:md5: read error corrected (8 sectors at 519755024 on sdd1)
> Dec 20 06:55:30 hermes kernel: raid5:md5: read error corrected (8 sectors at 519755032 on sdd1)
> Dec 20 06:55:30 hermes kernel: raid5:md5: read error corrected (8 sectors at 519755040 on sdd1)
> Dec 20 06:55:30 hermes kernel: raid5:md5: read error corrected (8 sectors at 519755048 on sdd1)
> Dec 20 06:55:30 hermes kernel: raid5:md5: read error corrected (8 sectors at 519755056 on sdd1)
> Dec 20 06:55:30 hermes kernel: raid5:md5: read error corrected (8 sectors at 519755064 on sdd1)
> Dec 20 06:55:30 hermes kernel: raid5:md5: read error corrected (8 sectors at 519755072 on sdd1)
> Dec 20 06:55:30 hermes kernel: raid5:md5: read error corrected (8 sectors at 519755080 on sdd1)
> Dec 20 06:55:30 hermes kernel: raid5:md5: read error corrected (8 sectors at 519755088 on sdd1)
>
> ...
>
> Dec 20 07:04:30 hermes kernel: raid5:md5: read error corrected (8 sectors at 613165696 on sdd1)
> Dec 20 07:04:30 hermes kernel: raid5:md5: read error corrected (8 sectors at 613165704 on sdd1)
> Dec 20 07:04:30 hermes kernel: raid5:md5: read error corrected (8 sectors at 613165712 on sdd1)
> Dec 20 07:04:30 hermes kernel: raid5:md5: read error corrected (8 sectors at 613165720 on sdd1)
> Dec 20 07:04:30 hermes kernel: raid5:md5: read error corrected (8 sectors at 613165728 on sdd1)
> Dec 20 07:04:30 hermes kernel: raid5:md5: read error corrected (8 sectors at 613165736 on sdd1)
> Dec 20 07:04:30 hermes kernel: raid5:md5: read error corrected (8 sectors at 613165744 on sdd1)
> Dec 20 07:04:30 hermes kernel: raid5:md5: read error corrected (8 sectors at 613165752 on sdd1)
> Dec 20 07:04:30 hermes kernel: raid5:md5: read error corrected (8 sectors at 613165760 on sdd1)
> Dec 20 07:04:30 hermes kernel: raid5:md5: read error corrected (8 sectors at 613165768 on sdd1)
>
> ...
>
> Dec 20 07:04:47 hermes kernel: raid5:md5: read error corrected (8 sectors at 613181440 on sdd1)
> Dec 20 07:04:47 hermes kernel: raid5:md5: read error corrected (8 sectors at 613181448 on sdd1)
> Dec 20 07:04:47 hermes kernel: raid5:md5: read error corrected (8 sectors at 613181456 on sdd1)
> Dec 20 07:04:47 hermes kernel: raid5:md5: read error corrected (8 sectors at 613181464 on sdd1)
> Dec 20 07:04:47 hermes kernel: raid5:md5: read error corrected (8 sectors at 613181472 on sdd1)
> Dec 20 07:04:47 hermes kernel: raid5:md5: read error corrected (8 sectors at 613181480 on sdd1)
> Dec 20 07:04:47 hermes kernel: raid5:md5: read error corrected (8 sectors at 613181488 on sdd1)
> Dec 20 07:04:47 hermes kernel: raid5:md5: read error corrected (8 sectors at 613181496 on sdd1)
> Dec 20 07:04:47 hermes kernel: raid5:md5: read error corrected (8 sectors at 613181504 on sdd1)
> Dec 20 07:04:47 hermes kernel: raid5:md5: read error corrected (8 sectors at 613181512 on sdd1)
>
> ...
>
> Dec 20 08:10:09 hermes kernel: raid5:md5: read error corrected (8 sectors at 613552584 on sdd1)
> Dec 20 08:10:09 hermes kernel: raid5:md5: read error corrected (8 sectors at 613552592 on sdd1)
> Dec 20 08:10:09 hermes kernel: raid5:md5: read error corrected (8 sectors at 613552600 on sdd1)
> Dec 20 08:10:09 hermes kernel: raid5:md5: read error corrected (8 sectors at 613552608 on sdd1)
> Dec 20 08:10:09 hermes kernel: raid5:md5: read error corrected (8 sectors at 613552616 on sdd1)
> Dec 20 08:10:09 hermes kernel: raid5:md5: read error corrected (8 sectors at 613552624 on sdd1)
> Dec 20 08:10:09 hermes kernel: raid5:md5: read error corrected (8 sectors at 613552632 on sdd1)
>
> ...
>
> Dec 20 08:16:19 hermes kernel: raid5:md5: read error corrected (8 sectors at 613020008 on sdd1)
>
> That's just a sample from today - it's been doing similar things for
> several days.  So the drive was hanging in there in the array, thanks
> to the error correction, but it was of course impacting performance.
>
> Anyway, when I put the replacement drive in I decided to do a self
> test before adding it to the array and I guess I was a bit concerned
> that it immediately failed the test. Since it was inserted into the
> same slot in the drive cage, same cable, etc. I wondered if those
> factors can affect a self test. My assumption was no, but I thought
> I'd ask.
>   

A bad cable, poor cooling, funky power, any external problem isn't going 
away by replacing the drive. And I don't expect a new drive to have bad 
sectors which haven't been relocated before the drive got to me...

-- 
Bill Davidsen <davidsen@tmr.com>
  "Woe unto the statesman who makes war without a reason that will still
  be valid when the war is over..." Otto von Bismark 



^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2009-01-14 20:59 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-12-20  1:30 Read errors and SMART tests Kevin Shanahan
2008-12-20  4:13 ` David Lethe
2008-12-20  5:22   ` Kevin Shanahan
2008-12-20  6:54     ` David Lethe
2008-12-20  9:09       ` Kevin Shanahan
2008-12-20 21:46         ` David Greaves
2009-01-14 20:59     ` Bill Davidsen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).