* Read errors and SMART tests
@ 2008-12-20 1:30 Kevin Shanahan
2008-12-20 4:13 ` David Lethe
0 siblings, 1 reply; 7+ messages in thread
From: Kevin Shanahan @ 2008-12-20 1:30 UTC (permalink / raw)
To: linux-raid
Hi,
Just a quick question about SMART tests :-
I have a Samsung drive returning read errors, e.g.:
Dec 20 08:59:24 hermes kernel: ata4.00: exception Emask 0x0 SAct 0x1 SErr 0x0 action 0x0
Dec 20 08:59:24 hermes kernel: ata4.00: irq_stat 0x40000008
Dec 20 08:59:24 hermes kernel: ata4.00: cmd 60/80:00:3f:0e:50/00:00:24:00:00/40 tag 0 ncq 65536 in
Dec 20 08:59:24 hermes kernel: res 41/40:00:61:0e:50/00:00:24:00:00/40 Emask 0x409 (media error) <F>
Dec 20 08:59:24 hermes kernel: ata4.00: status: { DRDY ERR }
Dec 20 08:59:24 hermes kernel: ata4.00: error: { UNC }
Dec 20 08:59:24 hermes kernel: ata4.00: configured for UDMA/133
Dec 20 08:59:24 hermes kernel: ata4: EH complete
Dec 20 08:59:24 hermes kernel: sd 3:0:0:0: [sdd] 1953525168 512-byte hardware sectors (1000205 MB)
Dec 20 08:59:24 hermes kernel: sd 3:0:0:0: [sdd] Write Protect is off
Dec 20 08:59:24 hermes kernel: sd 3:0:0:0: [sdd] Mode Sense: 00 3a 00 00
Dec 20 08:59:24 hermes kernel: sd 3:0:0:0: [sdd] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
So, I ran the short (and long) selftest and it showed read
failures. Then I put in a new drive to replace it and ran the short
selftest again - this one is showing read errors also:
=== START OF READ SMART DATA SECTION ===
SMART Self-test log structure revision number 0
Warning: ATA Specification requires self-test log structure revision number = 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Short offline Completed: read failure 20% 2572 294961
# 2 Short offline Aborted by host 20% 2572 -
I'm guessing this is just bad luck, i.e. drives from the same bad
batch. Erm, so my question - Am I right in assuming that the SMART
self test is not influenced in any way by bad cables, etc.? If the
drive returns read errors on it's self-test the error is within the
drive itself, right?
Thanks,
Kevin.
^ permalink raw reply [flat|nested] 7+ messages in thread* RE: Read errors and SMART tests 2008-12-20 1:30 Read errors and SMART tests Kevin Shanahan @ 2008-12-20 4:13 ` David Lethe 2008-12-20 5:22 ` Kevin Shanahan 0 siblings, 1 reply; 7+ messages in thread From: David Lethe @ 2008-12-20 4:13 UTC (permalink / raw) To: Kevin Shanahan, linux-raid > -----Original Message----- > From: linux-raid-owner@vger.kernel.org [mailto:linux-raid- > owner@vger.kernel.org] On Behalf Of Kevin Shanahan > Sent: Friday, December 19, 2008 7:31 PM > To: linux-raid@vger.kernel.org > Subject: Read errors and SMART tests > > Hi, > > Just a quick question about SMART tests :- > > I have a Samsung drive returning read errors, e.g.: > > Dec 20 08:59:24 hermes kernel: ata4.00: exception Emask 0x0 SAct 0x1 > SErr 0x0 action 0x0 > Dec 20 08:59:24 hermes kernel: ata4.00: irq_stat 0x40000008 > Dec 20 08:59:24 hermes kernel: ata4.00: cmd > 60/80:00:3f:0e:50/00:00:24:00:00/40 tag 0 ncq 65536 in > Dec 20 08:59:24 hermes kernel: res > 41/40:00:61:0e:50/00:00:24:00:00/40 Emask 0x409 (media error) <F> > Dec 20 08:59:24 hermes kernel: ata4.00: status: { DRDY ERR } > Dec 20 08:59:24 hermes kernel: ata4.00: error: { UNC } > Dec 20 08:59:24 hermes kernel: ata4.00: configured for UDMA/133 > Dec 20 08:59:24 hermes kernel: ata4: EH complete > Dec 20 08:59:24 hermes kernel: sd 3:0:0:0: [sdd] 1953525168 512-byte > hardware sectors (1000205 MB) > Dec 20 08:59:24 hermes kernel: sd 3:0:0:0: [sdd] Write Protect is off > Dec 20 08:59:24 hermes kernel: sd 3:0:0:0: [sdd] Mode Sense: 00 3a 00 > 00 > Dec 20 08:59:24 hermes kernel: sd 3:0:0:0: [sdd] Write cache: enabled, > read cache: enabled, doesn't support DPO or FUA > > So, I ran the short (and long) selftest and it showed read > failures. Then I put in a new drive to replace it and ran the short > selftest again - this one is showing read errors also: > > === START OF READ SMART DATA SECTION === > SMART Self-test log structure revision number 0 > Warning: ATA Specification requires self-test log structure revision > number = 1 > Num Test_Description Status Remaining > LifeTime(hours) LBA_of_first_error > # 1 Short offline Completed: read failure 20% 2572 > 294961 > # 2 Short offline Aborted by host 20% 2572 > - > > I'm guessing this is just bad luck, i.e. drives from the same bad > batch. Erm, so my question - Am I right in assuming that the SMART > self test is not influenced in any way by bad cables, etc.? If the > drive returns read errors on it's self-test the error is within the > drive itself, right? > > Thanks, > Kevin. > -- > To unsubscribe from this list: send the line "unsubscribe linux-raid" > in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html This shows nothing more than you having a single bad block. You have a 1TB drive, for crying out loud, they can't all stay perfect ;) This is no reason to assume the disk is bad, or that it has anything to do with cabling. When you wrote you have read "errors" .. does that mean you have dozens, hundreds of individual unreadable blocks, or could you just have just this one bad block. Why not use dd to do raw reads from /dev/sdd, send output to /dev/null, and start at the next LBA, if dd has another read error, it will tell you, then repeat process and go on. I am assuming you aren't using any software RAID1/5/6, so just fix the bad blocks by using dd to write /dev/zero to the bad block(s). When you write to the block the disk will either map a reserved block to it, or just correct the ECC w/o remapping. It depends on root cause and more details that you can't get without running some more sophisticated software. David ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Read errors and SMART tests 2008-12-20 4:13 ` David Lethe @ 2008-12-20 5:22 ` Kevin Shanahan 2008-12-20 6:54 ` David Lethe 2009-01-14 20:59 ` Bill Davidsen 0 siblings, 2 replies; 7+ messages in thread From: Kevin Shanahan @ 2008-12-20 5:22 UTC (permalink / raw) To: David Lethe; +Cc: linux-raid On Fri, Dec 19, 2008 at 10:13:14PM -0600, David Lethe wrote: > This shows nothing more than you having a single bad block. You have a > 1TB drive, for crying out loud, they can't all stay perfect ;) Heh, true. > This is no reason to assume the disk is bad, or that it has anything to > do with cabling. When you wrote you have > read "errors" .. does that mean you have dozens, hundreds of individual > unreadable blocks, or > could you just have just this one bad block. Sorry, I didn't provide a lot of detail there. The "bad" drive, /dev/sdd was doing more than just failing the self test: Dec 20 06:55:20 hermes kernel: ata4.00: exception Emask 0x0 SAct 0x5 SErr 0x0 action 0x0 Dec 20 06:55:20 hermes kernel: ata4.00: irq_stat 0x40000008 Dec 20 06:55:20 hermes kernel: ata4.00: cmd 60/78:10:47:d5:fa/00:00:1e:00:00/40 tag 2 ncq 61440 in Dec 20 06:55:20 hermes kernel: res 51/40:00:b9:d5:fa/00:00:1e:00:00/40 Emask 0x409 (media error) <F> Dec 20 06:55:20 hermes kernel: ata4.00: status: { DRDY ERR } Dec 20 06:55:20 hermes kernel: ata4.00: error: { UNC } Dec 20 06:55:20 hermes kernel: ata4.00: configured for UDMA/133 Dec 20 06:55:20 hermes kernel: ata4: EH complete Dec 20 06:55:20 hermes kernel: sd 3:0:0:0: [sdd] 1953525168 512-byte hardware sectors (1000205 MB) Dec 20 06:55:20 hermes kernel: sd 3:0:0:0: [sdd] Write Protect is off Dec 20 06:55:20 hermes kernel: sd 3:0:0:0: [sdd] Mode Sense: 00 3a 00 00 Dec 20 06:55:20 hermes kernel: sd 3:0:0:0: [sdd] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA (repeats several times) Dec 20 06:55:30 hermes kernel: raid5:md5: read error corrected (8 sectors at 519755016 on sdd1) Dec 20 06:55:30 hermes kernel: raid5:md5: read error corrected (8 sectors at 519755024 on sdd1) Dec 20 06:55:30 hermes kernel: raid5:md5: read error corrected (8 sectors at 519755032 on sdd1) Dec 20 06:55:30 hermes kernel: raid5:md5: read error corrected (8 sectors at 519755040 on sdd1) Dec 20 06:55:30 hermes kernel: raid5:md5: read error corrected (8 sectors at 519755048 on sdd1) Dec 20 06:55:30 hermes kernel: raid5:md5: read error corrected (8 sectors at 519755056 on sdd1) Dec 20 06:55:30 hermes kernel: raid5:md5: read error corrected (8 sectors at 519755064 on sdd1) Dec 20 06:55:30 hermes kernel: raid5:md5: read error corrected (8 sectors at 519755072 on sdd1) Dec 20 06:55:30 hermes kernel: raid5:md5: read error corrected (8 sectors at 519755080 on sdd1) Dec 20 06:55:30 hermes kernel: raid5:md5: read error corrected (8 sectors at 519755088 on sdd1) ... Dec 20 07:04:30 hermes kernel: raid5:md5: read error corrected (8 sectors at 613165696 on sdd1) Dec 20 07:04:30 hermes kernel: raid5:md5: read error corrected (8 sectors at 613165704 on sdd1) Dec 20 07:04:30 hermes kernel: raid5:md5: read error corrected (8 sectors at 613165712 on sdd1) Dec 20 07:04:30 hermes kernel: raid5:md5: read error corrected (8 sectors at 613165720 on sdd1) Dec 20 07:04:30 hermes kernel: raid5:md5: read error corrected (8 sectors at 613165728 on sdd1) Dec 20 07:04:30 hermes kernel: raid5:md5: read error corrected (8 sectors at 613165736 on sdd1) Dec 20 07:04:30 hermes kernel: raid5:md5: read error corrected (8 sectors at 613165744 on sdd1) Dec 20 07:04:30 hermes kernel: raid5:md5: read error corrected (8 sectors at 613165752 on sdd1) Dec 20 07:04:30 hermes kernel: raid5:md5: read error corrected (8 sectors at 613165760 on sdd1) Dec 20 07:04:30 hermes kernel: raid5:md5: read error corrected (8 sectors at 613165768 on sdd1) ... Dec 20 07:04:47 hermes kernel: raid5:md5: read error corrected (8 sectors at 613181440 on sdd1) Dec 20 07:04:47 hermes kernel: raid5:md5: read error corrected (8 sectors at 613181448 on sdd1) Dec 20 07:04:47 hermes kernel: raid5:md5: read error corrected (8 sectors at 613181456 on sdd1) Dec 20 07:04:47 hermes kernel: raid5:md5: read error corrected (8 sectors at 613181464 on sdd1) Dec 20 07:04:47 hermes kernel: raid5:md5: read error corrected (8 sectors at 613181472 on sdd1) Dec 20 07:04:47 hermes kernel: raid5:md5: read error corrected (8 sectors at 613181480 on sdd1) Dec 20 07:04:47 hermes kernel: raid5:md5: read error corrected (8 sectors at 613181488 on sdd1) Dec 20 07:04:47 hermes kernel: raid5:md5: read error corrected (8 sectors at 613181496 on sdd1) Dec 20 07:04:47 hermes kernel: raid5:md5: read error corrected (8 sectors at 613181504 on sdd1) Dec 20 07:04:47 hermes kernel: raid5:md5: read error corrected (8 sectors at 613181512 on sdd1) ... Dec 20 08:10:09 hermes kernel: raid5:md5: read error corrected (8 sectors at 613552584 on sdd1) Dec 20 08:10:09 hermes kernel: raid5:md5: read error corrected (8 sectors at 613552592 on sdd1) Dec 20 08:10:09 hermes kernel: raid5:md5: read error corrected (8 sectors at 613552600 on sdd1) Dec 20 08:10:09 hermes kernel: raid5:md5: read error corrected (8 sectors at 613552608 on sdd1) Dec 20 08:10:09 hermes kernel: raid5:md5: read error corrected (8 sectors at 613552616 on sdd1) Dec 20 08:10:09 hermes kernel: raid5:md5: read error corrected (8 sectors at 613552624 on sdd1) Dec 20 08:10:09 hermes kernel: raid5:md5: read error corrected (8 sectors at 613552632 on sdd1) ... Dec 20 08:16:19 hermes kernel: raid5:md5: read error corrected (8 sectors at 613020008 on sdd1) That's just a sample from today - it's been doing similar things for several days. So the drive was hanging in there in the array, thanks to the error correction, but it was of course impacting performance. Anyway, when I put the replacement drive in I decided to do a self test before adding it to the array and I guess I was a bit concerned that it immediately failed the test. Since it was inserted into the same slot in the drive cage, same cable, etc. I wondered if those factors can affect a self test. My assumption was no, but I thought I'd ask. Cheers, Kevin. ^ permalink raw reply [flat|nested] 7+ messages in thread
* RE: Read errors and SMART tests 2008-12-20 5:22 ` Kevin Shanahan @ 2008-12-20 6:54 ` David Lethe 2008-12-20 9:09 ` Kevin Shanahan 2009-01-14 20:59 ` Bill Davidsen 1 sibling, 1 reply; 7+ messages in thread From: David Lethe @ 2008-12-20 6:54 UTC (permalink / raw) To: Kevin Shanahan; +Cc: linux-raid > -----Original Message----- > From: Kevin Shanahan [mailto:kmshanah@disenchant.net] > Sent: Friday, December 19, 2008 11:23 PM > To: David Lethe > Cc: linux-raid@vger.kernel.org > Subject: Re: Read errors and SMART tests > > On Fri, Dec 19, 2008 at 10:13:14PM -0600, David Lethe wrote: > > This shows nothing more than you having a single bad block. You have > a > > 1TB drive, for crying out loud, they can't all stay perfect ;) > > Heh, true. > > > This is no reason to assume the disk is bad, or that it has anything > to > > do with cabling. When you wrote you have > > read "errors" .. does that mean you have dozens, hundreds of > individual > > unreadable blocks, or > > could you just have just this one bad block. > > Sorry, I didn't provide a lot of detail there. The "bad" drive, > /dev/sdd was doing more than just failing the self test: > > Dec 20 06:55:20 hermes kernel: ata4.00: exception Emask 0x0 SAct 0x5 > SErr 0x0 action 0x0 > Dec 20 06:55:20 hermes kernel: ata4.00: irq_stat 0x40000008 > Dec 20 06:55:20 hermes kernel: ata4.00: cmd > 60/78:10:47:d5:fa/00:00:1e:00:00/40 tag 2 ncq 61440 in > Dec 20 06:55:20 hermes kernel: res > 51/40:00:b9:d5:fa/00:00:1e:00:00/40 Emask 0x409 (media error) <F> > Dec 20 06:55:20 hermes kernel: ata4.00: status: { DRDY ERR } > Dec 20 06:55:20 hermes kernel: ata4.00: error: { UNC } > Dec 20 06:55:20 hermes kernel: ata4.00: configured for UDMA/133 > Dec 20 06:55:20 hermes kernel: ata4: EH complete > Dec 20 06:55:20 hermes kernel: sd 3:0:0:0: [sdd] 1953525168 512-byte > hardware sectors (1000205 MB) > Dec 20 06:55:20 hermes kernel: sd 3:0:0:0: [sdd] Write Protect is off > Dec 20 06:55:20 hermes kernel: sd 3:0:0:0: [sdd] Mode Sense: 00 3a 00 > 00 > Dec 20 06:55:20 hermes kernel: sd 3:0:0:0: [sdd] Write cache: enabled, > read cache: enabled, doesn't support DPO or FUA > > (repeats several times) > > Dec 20 06:55:30 hermes kernel: raid5:md5: read error corrected (8 > sectors at 519755016 on sdd1) > Dec 20 06:55:30 hermes kernel: raid5:md5: read error corrected (8 > sectors at 519755024 on sdd1) > Dec 20 06:55:30 hermes kernel: raid5:md5: read error corrected (8 > sectors at 519755032 on sdd1) > Dec 20 06:55:30 hermes kernel: raid5:md5: read error corrected (8 > sectors at 519755040 on sdd1) > Dec 20 06:55:30 hermes kernel: raid5:md5: read error corrected (8 > sectors at 519755048 on sdd1) > Dec 20 06:55:30 hermes kernel: raid5:md5: read error corrected (8 > sectors at 519755056 on sdd1) > Dec 20 06:55:30 hermes kernel: raid5:md5: read error corrected (8 > sectors at 519755064 on sdd1) > Dec 20 06:55:30 hermes kernel: raid5:md5: read error corrected (8 > sectors at 519755072 on sdd1) > Dec 20 06:55:30 hermes kernel: raid5:md5: read error corrected (8 > sectors at 519755080 on sdd1) > Dec 20 06:55:30 hermes kernel: raid5:md5: read error corrected (8 > sectors at 519755088 on sdd1) > > ... > > Dec 20 07:04:30 hermes kernel: raid5:md5: read error corrected (8 > sectors at 613165696 on sdd1) > Dec 20 07:04:30 hermes kernel: raid5:md5: read error corrected (8 > sectors at 613165704 on sdd1) > Dec 20 07:04:30 hermes kernel: raid5:md5: read error corrected (8 > sectors at 613165712 on sdd1) > Dec 20 07:04:30 hermes kernel: raid5:md5: read error corrected (8 > sectors at 613165720 on sdd1) > Dec 20 07:04:30 hermes kernel: raid5:md5: read error corrected (8 > sectors at 613165728 on sdd1) > Dec 20 07:04:30 hermes kernel: raid5:md5: read error corrected (8 > sectors at 613165736 on sdd1) > Dec 20 07:04:30 hermes kernel: raid5:md5: read error corrected (8 > sectors at 613165744 on sdd1) > Dec 20 07:04:30 hermes kernel: raid5:md5: read error corrected (8 > sectors at 613165752 on sdd1) > Dec 20 07:04:30 hermes kernel: raid5:md5: read error corrected (8 > sectors at 613165760 on sdd1) > Dec 20 07:04:30 hermes kernel: raid5:md5: read error corrected (8 > sectors at 613165768 on sdd1) > > ... > > Dec 20 07:04:47 hermes kernel: raid5:md5: read error corrected (8 > sectors at 613181440 on sdd1) > Dec 20 07:04:47 hermes kernel: raid5:md5: read error corrected (8 > sectors at 613181448 on sdd1) > Dec 20 07:04:47 hermes kernel: raid5:md5: read error corrected (8 > sectors at 613181456 on sdd1) > Dec 20 07:04:47 hermes kernel: raid5:md5: read error corrected (8 > sectors at 613181464 on sdd1) > Dec 20 07:04:47 hermes kernel: raid5:md5: read error corrected (8 > sectors at 613181472 on sdd1) > Dec 20 07:04:47 hermes kernel: raid5:md5: read error corrected (8 > sectors at 613181480 on sdd1) > Dec 20 07:04:47 hermes kernel: raid5:md5: read error corrected (8 > sectors at 613181488 on sdd1) > Dec 20 07:04:47 hermes kernel: raid5:md5: read error corrected (8 > sectors at 613181496 on sdd1) > Dec 20 07:04:47 hermes kernel: raid5:md5: read error corrected (8 > sectors at 613181504 on sdd1) > Dec 20 07:04:47 hermes kernel: raid5:md5: read error corrected (8 > sectors at 613181512 on sdd1) > > ... > > Dec 20 08:10:09 hermes kernel: raid5:md5: read error corrected (8 > sectors at 613552584 on sdd1) > Dec 20 08:10:09 hermes kernel: raid5:md5: read error corrected (8 > sectors at 613552592 on sdd1) > Dec 20 08:10:09 hermes kernel: raid5:md5: read error corrected (8 > sectors at 613552600 on sdd1) > Dec 20 08:10:09 hermes kernel: raid5:md5: read error corrected (8 > sectors at 613552608 on sdd1) > Dec 20 08:10:09 hermes kernel: raid5:md5: read error corrected (8 > sectors at 613552616 on sdd1) > Dec 20 08:10:09 hermes kernel: raid5:md5: read error corrected (8 > sectors at 613552624 on sdd1) > Dec 20 08:10:09 hermes kernel: raid5:md5: read error corrected (8 > sectors at 613552632 on sdd1) > > ... > > Dec 20 08:16:19 hermes kernel: raid5:md5: read error corrected (8 > sectors at 613020008 on sdd1) > > That's just a sample from today - it's been doing similar things for > several days. So the drive was hanging in there in the array, thanks > to the error correction, but it was of course impacting performance. > > Anyway, when I put the replacement drive in I decided to do a self > test before adding it to the array and I guess I was a bit concerned > that it immediately failed the test. Since it was inserted into the > same slot in the drive cage, same cable, etc. I wondered if those > factors can affect a self test. My assumption was no, but I thought > I'd ask. > > Cheers, > Kevin. This particular test terminates when the FIRST bad block is found. It is not an indication of a drive in stress or immediate replacement. I don't have the desire or time to look up how many reserved blocks that disk has, but I wouldn't be surprised if it was well over 10,000. The count is certainly documented in the product manual, but not necessarily the data sheet, and certainly not on the outside of the box. (I'm curious, if you look it up, please post it). Time for you to run full consistency check/repairs. These errors could be Result of something relatively benign, like unexpected power loss. ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Read errors and SMART tests 2008-12-20 6:54 ` David Lethe @ 2008-12-20 9:09 ` Kevin Shanahan 2008-12-20 21:46 ` David Greaves 0 siblings, 1 reply; 7+ messages in thread From: Kevin Shanahan @ 2008-12-20 9:09 UTC (permalink / raw) To: David Lethe; +Cc: linux-raid On Sat, Dec 20, 2008 at 12:54:24AM -0600, David Lethe wrote: > This particular test terminates when the FIRST bad block is found. > It is not an indication of a drive in stress or immediate > replacement. I don't have the desire or time to look up how many > reserved blocks that disk has, but I wouldn't be surprised if it was > well over 10,000. The count is certainly documented in the product > manual, but not necessarily the data sheet, and certainly not on the > outside of the box. (I'm curious, if you look it up, please post > it). Sorry, I didn't have any luck finding that info. Data sheet - http://www.samsung.com/global/system/business/hdd/prdmodel/2008/8/19/525716F1_DT_R4.8.pdf Product manual - http://downloadcenter.samsung.com/content/UM/200704/20070419200104171_3.5_Install_Gudie_Eng_200704.pdf > Time for you to run full consistency check/repairs. You mean array consistency? Yeah, I've done that. This drive was removed, raid superblock zeroed and then re-added to the array on Thursday morning, so the entire drive had been re-written only recently. Dec 18 04:16:04 hermes kernel: md: bind<sdd1> Dec 18 04:16:08 hermes kernel: RAID5 conf printout: Dec 18 04:16:08 hermes kernel: --- rd:10 wd:9 Dec 18 04:16:08 hermes kernel: disk 0, o:1, dev:sde1 Dec 18 04:16:08 hermes kernel: disk 1, o:1, dev:sdf1 Dec 18 04:16:08 hermes kernel: disk 2, o:1, dev:sdg1 Dec 18 04:16:08 hermes kernel: disk 3, o:1, dev:sdk1 Dec 18 04:16:08 hermes kernel: disk 4, o:1, dev:sdj1 Dec 18 04:16:08 hermes kernel: disk 5, o:1, dev:sdi1 Dec 18 04:16:08 hermes kernel: disk 6, o:1, dev:sdh1 Dec 18 04:16:08 hermes kernel: disk 7, o:1, dev:sdd1 Dec 18 04:16:08 hermes kernel: disk 8, o:1, dev:sdc1 Dec 18 04:16:08 hermes kernel: disk 9, o:1, dev:sdl1 Dec 18 04:16:08 hermes mdadm[1949]: RebuildStarted event detected on md device /dev/md5 Dec 18 04:16:08 hermes kernel: md: recovery of RAID array md5 Dec 18 04:16:08 hermes kernel: md: minimum _guaranteed_ speed: 1000 KB/sec/disk. Dec 18 04:16:08 hermes kernel: md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for recovery. Dec 18 04:16:08 hermes kernel: md: using 128k window, over a total of 976759936 blocks. Dec 18 08:41:08 hermes mdadm[1949]: Rebuild20 event detected on md device /dev/md5 Dec 18 11:46:08 hermes mdadm[1949]: Rebuild40 event detected on md device /dev/md5 Dec 18 14:35:08 hermes mdadm[1949]: Rebuild60 event detected on md device /dev/md5 Dec 18 17:20:08 hermes mdadm[1949]: Rebuild80 event detected on md device /dev/md5 Dec 18 19:58:05 hermes kernel: md: md5: recovery done. Dec 18 19:58:05 hermes kernel: RAID5 conf printout: Dec 18 19:58:05 hermes kernel: --- rd:10 wd:10 Dec 18 19:58:05 hermes kernel: disk 0, o:1, dev:sde1 Dec 18 19:58:05 hermes kernel: disk 1, o:1, dev:sdf1 Dec 18 19:58:05 hermes kernel: disk 2, o:1, dev:sdg1 Dec 18 19:58:05 hermes kernel: disk 3, o:1, dev:sdk1 Dec 18 19:58:05 hermes kernel: disk 4, o:1, dev:sdj1 Dec 18 19:58:05 hermes kernel: disk 5, o:1, dev:sdi1 Dec 18 19:58:05 hermes kernel: disk 6, o:1, dev:sdh1 Dec 18 19:58:05 hermes kernel: disk 7, o:1, dev:sdd1 Dec 18 19:58:05 hermes kernel: disk 8, o:1, dev:sdc1 Dec 18 19:58:05 hermes kernel: disk 9, o:1, dev:sdl1 Dec 18 19:58:05 hermes mdadm[1949]: RebuildFinished event detected on md device /dev/md5 Dec 18 19:58:05 hermes mdadm[1949]: SpareActive event detected on md device /dev/md5, component device /dev/sdd1 And then, e.g. Dec 18 22:17:44 hermes kernel: ata4.00: exception Emask 0x0 SAct 0xc3f SErr 0x0 action 0x0 Dec 18 22:17:44 hermes kernel: ata4.00: irq_stat 0x40000008 Dec 18 22:17:44 hermes kernel: ata4.00: cmd 60/58:50:c7:b1:c6/00:00:1e:00:00/40 tag 10 ncq 45056 in Dec 18 22:17:44 hermes kernel: res 41/40:00:ca:b1:c6/00:00:1e:00:00/40 Emask 0x409 (media error) <F> Dec 18 22:17:44 hermes kernel: ata4.00: status: { DRDY ERR } Dec 18 22:17:44 hermes kernel: ata4.00: error: { UNC } Dec 18 22:17:44 hermes kernel: ata4.00: configured for UDMA/133 Dec 18 22:17:44 hermes kernel: ata4: EH complete Dec 18 22:17:44 hermes kernel: sd 3:0:0:0: [sdd] 1953525168 512-byte hardware sectors (1000205 MB) Dec 18 22:17:44 hermes kernel: sd 3:0:0:0: [sdd] Write Protect is off Dec 18 22:17:44 hermes kernel: sd 3:0:0:0: [sdd] Mode Sense: 00 3a 00 00 Dec 18 22:17:44 hermes kernel: sd 3:0:0:0: [sdd] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA There are lots of these. hermes:~# zgrep UNC /var/log/syslog{.1.gz,.0,} | wc -l 385 Of the remaining drives, SMART attributes for /dev/sd[cghijkl] all show: 196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always - 0 197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0 /dev/sde shows: 196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always - 0 197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 3 /dev/sdf shows: 196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always - 2 197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0 Unfortunately the original /dev/sdd isn't currently attached, but I'll hook that up on Monday and check. I'd expect to see some high numbers there. > These errors could be > Result of something relatively benign, like unexpected power loss. Sorry, are you saying that about the errors from libata layer or just the errors from the md layer? Cheers, Kevin. ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Read errors and SMART tests 2008-12-20 9:09 ` Kevin Shanahan @ 2008-12-20 21:46 ` David Greaves 0 siblings, 0 replies; 7+ messages in thread From: David Greaves @ 2008-12-20 21:46 UTC (permalink / raw) To: Kevin Shanahan; +Cc: David Lethe, linux-raid Kevin Shanahan wrote: > Of the remaining drives, SMART attributes for /dev/sd[cghijkl] all show: > > 196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always - 0 > 197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0 > > /dev/sde shows: > > 196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always - 0 > 197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 3 > > /dev/sdf shows: > > 196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always - 2 > 197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0 > > Unfortunately the original /dev/sdd isn't currently attached, but I'll > hook that up on Monday and check. I'd expect to see some high numbers > there. > >> These errors could be >> Result of something relatively benign, like unexpected power loss. > > Sorry, are you saying that about the errors from libata layer or just > the errors from the md layer? I wouldn't dream of contradicting David and I'm sure you've got nothing to worry about. What's a few bad blocks between friends anyway :) I will say that I have had very similar problems. I used ddrescue to read the area around the block until it read without error, and then re-wrote it. A subsequent smartctl -tlong /dev/sdX would then show no errors. In my experience the bad blocks returned regularly and I became very familiar indeed with forced rebuilds of arrays, array re-creation and other mdadm incantations as the errors hit the system. I will say that I've returned a *lot* of these under RMA (after discussions with Samsung engineers). Any drive that returns *fail* for a built-in self-test now gets 1 chance and is then RMAed. David -- "Don't worry, you'll be fine; I saw it work in a cartoon once..." ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Read errors and SMART tests 2008-12-20 5:22 ` Kevin Shanahan 2008-12-20 6:54 ` David Lethe @ 2009-01-14 20:59 ` Bill Davidsen 1 sibling, 0 replies; 7+ messages in thread From: Bill Davidsen @ 2009-01-14 20:59 UTC (permalink / raw) To: Kevin Shanahan; +Cc: David Lethe, linux-raid Kevin Shanahan wrote: > On Fri, Dec 19, 2008 at 10:13:14PM -0600, David Lethe wrote: > >> This shows nothing more than you having a single bad block. You have a >> 1TB drive, for crying out loud, they can't all stay perfect ;) >> > > Heh, true. > > >> This is no reason to assume the disk is bad, or that it has anything to >> do with cabling. When you wrote you have >> read "errors" .. does that mean you have dozens, hundreds of individual >> unreadable blocks, or >> could you just have just this one bad block. >> > > Sorry, I didn't provide a lot of detail there. The "bad" drive, > /dev/sdd was doing more than just failing the self test: > > Dec 20 06:55:20 hermes kernel: ata4.00: exception Emask 0x0 SAct 0x5 SErr 0x0 action 0x0 > Dec 20 06:55:20 hermes kernel: ata4.00: irq_stat 0x40000008 > Dec 20 06:55:20 hermes kernel: ata4.00: cmd 60/78:10:47:d5:fa/00:00:1e:00:00/40 tag 2 ncq 61440 in > Dec 20 06:55:20 hermes kernel: res 51/40:00:b9:d5:fa/00:00:1e:00:00/40 Emask 0x409 (media error) <F> > Dec 20 06:55:20 hermes kernel: ata4.00: status: { DRDY ERR } > Dec 20 06:55:20 hermes kernel: ata4.00: error: { UNC } > Dec 20 06:55:20 hermes kernel: ata4.00: configured for UDMA/133 > Dec 20 06:55:20 hermes kernel: ata4: EH complete > Dec 20 06:55:20 hermes kernel: sd 3:0:0:0: [sdd] 1953525168 512-byte hardware sectors (1000205 MB) > Dec 20 06:55:20 hermes kernel: sd 3:0:0:0: [sdd] Write Protect is off > Dec 20 06:55:20 hermes kernel: sd 3:0:0:0: [sdd] Mode Sense: 00 3a 00 00 > Dec 20 06:55:20 hermes kernel: sd 3:0:0:0: [sdd] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA > > (repeats several times) > > Dec 20 06:55:30 hermes kernel: raid5:md5: read error corrected (8 sectors at 519755016 on sdd1) > Dec 20 06:55:30 hermes kernel: raid5:md5: read error corrected (8 sectors at 519755024 on sdd1) > Dec 20 06:55:30 hermes kernel: raid5:md5: read error corrected (8 sectors at 519755032 on sdd1) > Dec 20 06:55:30 hermes kernel: raid5:md5: read error corrected (8 sectors at 519755040 on sdd1) > Dec 20 06:55:30 hermes kernel: raid5:md5: read error corrected (8 sectors at 519755048 on sdd1) > Dec 20 06:55:30 hermes kernel: raid5:md5: read error corrected (8 sectors at 519755056 on sdd1) > Dec 20 06:55:30 hermes kernel: raid5:md5: read error corrected (8 sectors at 519755064 on sdd1) > Dec 20 06:55:30 hermes kernel: raid5:md5: read error corrected (8 sectors at 519755072 on sdd1) > Dec 20 06:55:30 hermes kernel: raid5:md5: read error corrected (8 sectors at 519755080 on sdd1) > Dec 20 06:55:30 hermes kernel: raid5:md5: read error corrected (8 sectors at 519755088 on sdd1) > > ... > > Dec 20 07:04:30 hermes kernel: raid5:md5: read error corrected (8 sectors at 613165696 on sdd1) > Dec 20 07:04:30 hermes kernel: raid5:md5: read error corrected (8 sectors at 613165704 on sdd1) > Dec 20 07:04:30 hermes kernel: raid5:md5: read error corrected (8 sectors at 613165712 on sdd1) > Dec 20 07:04:30 hermes kernel: raid5:md5: read error corrected (8 sectors at 613165720 on sdd1) > Dec 20 07:04:30 hermes kernel: raid5:md5: read error corrected (8 sectors at 613165728 on sdd1) > Dec 20 07:04:30 hermes kernel: raid5:md5: read error corrected (8 sectors at 613165736 on sdd1) > Dec 20 07:04:30 hermes kernel: raid5:md5: read error corrected (8 sectors at 613165744 on sdd1) > Dec 20 07:04:30 hermes kernel: raid5:md5: read error corrected (8 sectors at 613165752 on sdd1) > Dec 20 07:04:30 hermes kernel: raid5:md5: read error corrected (8 sectors at 613165760 on sdd1) > Dec 20 07:04:30 hermes kernel: raid5:md5: read error corrected (8 sectors at 613165768 on sdd1) > > ... > > Dec 20 07:04:47 hermes kernel: raid5:md5: read error corrected (8 sectors at 613181440 on sdd1) > Dec 20 07:04:47 hermes kernel: raid5:md5: read error corrected (8 sectors at 613181448 on sdd1) > Dec 20 07:04:47 hermes kernel: raid5:md5: read error corrected (8 sectors at 613181456 on sdd1) > Dec 20 07:04:47 hermes kernel: raid5:md5: read error corrected (8 sectors at 613181464 on sdd1) > Dec 20 07:04:47 hermes kernel: raid5:md5: read error corrected (8 sectors at 613181472 on sdd1) > Dec 20 07:04:47 hermes kernel: raid5:md5: read error corrected (8 sectors at 613181480 on sdd1) > Dec 20 07:04:47 hermes kernel: raid5:md5: read error corrected (8 sectors at 613181488 on sdd1) > Dec 20 07:04:47 hermes kernel: raid5:md5: read error corrected (8 sectors at 613181496 on sdd1) > Dec 20 07:04:47 hermes kernel: raid5:md5: read error corrected (8 sectors at 613181504 on sdd1) > Dec 20 07:04:47 hermes kernel: raid5:md5: read error corrected (8 sectors at 613181512 on sdd1) > > ... > > Dec 20 08:10:09 hermes kernel: raid5:md5: read error corrected (8 sectors at 613552584 on sdd1) > Dec 20 08:10:09 hermes kernel: raid5:md5: read error corrected (8 sectors at 613552592 on sdd1) > Dec 20 08:10:09 hermes kernel: raid5:md5: read error corrected (8 sectors at 613552600 on sdd1) > Dec 20 08:10:09 hermes kernel: raid5:md5: read error corrected (8 sectors at 613552608 on sdd1) > Dec 20 08:10:09 hermes kernel: raid5:md5: read error corrected (8 sectors at 613552616 on sdd1) > Dec 20 08:10:09 hermes kernel: raid5:md5: read error corrected (8 sectors at 613552624 on sdd1) > Dec 20 08:10:09 hermes kernel: raid5:md5: read error corrected (8 sectors at 613552632 on sdd1) > > ... > > Dec 20 08:16:19 hermes kernel: raid5:md5: read error corrected (8 sectors at 613020008 on sdd1) > > That's just a sample from today - it's been doing similar things for > several days. So the drive was hanging in there in the array, thanks > to the error correction, but it was of course impacting performance. > > Anyway, when I put the replacement drive in I decided to do a self > test before adding it to the array and I guess I was a bit concerned > that it immediately failed the test. Since it was inserted into the > same slot in the drive cage, same cable, etc. I wondered if those > factors can affect a self test. My assumption was no, but I thought > I'd ask. > A bad cable, poor cooling, funky power, any external problem isn't going away by replacing the drive. And I don't expect a new drive to have bad sectors which haven't been relocated before the drive got to me... -- Bill Davidsen <davidsen@tmr.com> "Woe unto the statesman who makes war without a reason that will still be valid when the war is over..." Otto von Bismark ^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2009-01-14 20:59 UTC | newest] Thread overview: 7+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2008-12-20 1:30 Read errors and SMART tests Kevin Shanahan 2008-12-20 4:13 ` David Lethe 2008-12-20 5:22 ` Kevin Shanahan 2008-12-20 6:54 ` David Lethe 2008-12-20 9:09 ` Kevin Shanahan 2008-12-20 21:46 ` David Greaves 2009-01-14 20:59 ` Bill Davidsen
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).