* Re: disk testing
@ 2004-09-17 2:50 harry
2004-09-17 9:18 ` Tim Small
2004-09-17 15:08 ` Sebastien Koechlin
0 siblings, 2 replies; 9+ messages in thread
From: harry @ 2004-09-17 2:50 UTC (permalink / raw)
To: linux-raid
Tim and Neil have suggested (apparently correctly) that the disk had a bad sector and the firmware remapped it when I wrote to it. My question is, how many spare sectors does the typical disk have? More importantly, since the sector has been remapped, recreating the raid5 array worked fine, but is a failure right out of the box normal? I was going to return it but since its working now I'm not sure if I should or not.
Thanks
--- On Tue 09/14, Tim Small < tim@buttersideup.com > wrote:
From: Tim Small [mailto: tim@buttersideup.com]
To: linux-raid@vger.kernel.org
Date: Tue, 14 Sep 2004 10:15:43 +0100
Subject: Re: disk testing
<br>If there is an unreadable sector on the disk, then reading it will fail,<br>but if you write to it, the drive firmware will reallocate the sector,<br>and then allow reading (actually the sector it is reading is now<br>somewhere else on the disk, but the firmware hides this). If the raid5<br>sync was trying to read such a sector, but your other tests have written<br>it, then it will now appear to be fine (and the raid5 should now work).<br><br>If I were you I would use smartmontools to check out the drive (you can<br>then see if it has reallocated any sectors, and read errors should show<br>up in the SMART error log).<br><br>Tim.<br><br><br>p.s. Prelim SMART support for libata:<br><br>http://www.ussg.iu.edu/hypermail/linux/kernel/0408.3/2304.html<br><br><br><br><br>harry wrote:<br><br>>I just bought 3 sata drives and set them up in a raid5 array. About 45% into syncing them, the firs
t disk gets an error and goes offline. I figure I did something wrong, so I retrace my ste
ps and try again, and again, I get an error about 45% of the way through, the first disk errors and goes offline. <br>><br>>So, I think I have a bad disk. But wait! I created a raid 1 array on the remaining two to see if there are any other errors later on those two (there weren't), and I create a normal partition/fs on the failing disk. I begin writing various bitpatterns across the entire disk and reading them back, trying to find the problem. So far, I've done about 5 passes over the entire disk without error! <br>><br>>So, any idea why raid would be getting errors from the disk, but I don't seem to be able to? (or, what I should tell the store I bought it from when I try to get it replaced?)<br>><br>> <br>><br><br>-<br>To unsubscribe from this list: send the line "unsubscribe linux-raid" in<br>the body of a message to majordomo@vger.kernel.org<br>More majordomo info at http://vger
.kernel.org/majordomo-info.html<br>
_______________________________________________
Join Excite! - http://www.excite.com
The most personalized portal on the Web!
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: disk testing
2004-09-17 2:50 disk testing harry
@ 2004-09-17 9:18 ` Tim Small
2004-09-17 15:08 ` Sebastien Koechlin
1 sibling, 0 replies; 9+ messages in thread
From: Tim Small @ 2004-09-17 9:18 UTC (permalink / raw)
To: hfranklin97; +Cc: linux-raid
harry wrote:
>Tim and Neil have suggested (apparently correctly) that the disk had a bad sector and the firmware remapped it when I wrote to it. My question is, how many spare sectors does the typical disk have?
>
Good question. The drive technical documentation (if you can get it)
may tell you. I think I low-level formatted a 70G SCSI drive a few
weeks ago, which had a couple of percent in its default setup (on SCSI,
you can change the spare portion when you low-level format, if you want to).
I think that it's impossible to tell with xATA drives (at least without
vendor-specific tools) as the detail is hidden by the firmware, at a
guess (and it is a complete guess) I would say that it wouldn't be more
than 0.5% of the drive capacity. I think that the low-level formatting
geometry puts a certain percentage of the total raw capacity aside for
spare sectors - a certain number of these are used up for
manufacture-time defects (i.e. unusable sectors due to imperfections in
the platters) when the drive is low-level formatted in the factory, and
the rest of the spare sectors (down to some manufacturer define minimum
below which the drive fails QC) are left for in-service spares. BICBW.
> More importantly, since the sector has been remapped, recreating the raid5 array worked fine, but is a failure right out of the box normal? I was going to return it but since its working now I'm not sure if I should or not.
>
>
Well, that's a difficult choice - here are some things that may help you
to decide:
. Do the SMART read-retry counts etc. seem to be noticeably higher than
the other drives in the array, or are they increasing quicker (or for
"rate" variables, are they lower, or decreasing, as some drives
represent these "1 failure every x operations" style counters)?
. How long does the warranty run for?
. Will the mfr, or your supplier actually take the drive back in its
current condition? - If you run their "factory revalidation test" or
whatever they call it, the drive will probably pass now
. How much is your time to replace it worth vs. the cost of the drive
(or the cost of the drive once its warranty has expired).
If it was me, I'd be inclined to leave it in place, but return it if I
got another failure on a different part of the disk (if an adjacent
sector fails this may be OK), or if the drive looked to be deteriorating
quickly.
If SMART support for libata was complete, I'd be inclined to get smartd
to run an extended self-test on the drive every week. As it is, you may
want to do this manually a couple of times on the drive to see what
difference this makes to the SMART counters (smartctl -t long)..
Another option is to put in a cron job that does "dd if=/dev/sdx
of=/dev/null" once a week for all drives in the array (e.g. every Sunday
night, or some other quiet period for the computer) to give the drives a
similar work out to the SMART long test (albeit with a lot more work for
the CPU, and buses) - this way, you get to check that all sectors are
readable (and the firmware may get the chance to correct failing sectors
before they become unreadable - if the drive firmware support this).
Tim.
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: disk testing
2004-09-17 2:50 disk testing harry
2004-09-17 9:18 ` Tim Small
@ 2004-09-17 15:08 ` Sebastien Koechlin
1 sibling, 0 replies; 9+ messages in thread
From: Sebastien Koechlin @ 2004-09-17 15:08 UTC (permalink / raw)
To: harry; +Cc: linux-raid
On Thu, Sep 16, 2004 at 10:50:05PM -0400, harry wrote:
>
> Tim and Neil have suggested (apparently correctly) that the disk had a bad
> sector and the firmware remapped it when I wrote to it. My question is,
> how many spare sectors does the typical disk have? More importantly, since
> the sector has been remapped, recreating the raid5 array worked fine, but
> is a failure right out of the box normal? I was going to return it but
> since its working now I'm not sure if I should or not.
Can you read SMART Attributes under any OS?
Every disk I use, have a SMART Reallocated_Sector_Ct Attribute:
------
# smartctl -A /dev/hde
smartctl version 5.32 Copyright (C) 2002-4 Bruce Allen
Home page is http://smartmontools.sourceforge.net/
=== START OF READ SMART DATA SECTION ===
SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED RAW_VALUE
1 Raw_Read_Error_Rate 0x000f 065 063 006 Pre-fail Always 147852401
3 Spin_Up_Time 0x0003 100 100 000 Pre-fail Always 0
4 Start_Stop_Count 0x0032 100 100 020 Old_age Always 0
5 Reallocated_Sector_Ct 0x0033 100 100 036 Pre-fail Always 0
(...)
------
You should search for your hard-drive datasheet, but RAW_VALUE is probably a
counter of remapped sectors.
If RAW_VALUE is high, if VALUE is low and going near THRESH, it means you're
going to have troubles with this disk.
You can read http://www.linuxjournal.com/article.php?sid=6983
--
Seb, autocuiseur
^ permalink raw reply [flat|nested] 9+ messages in thread
[parent not found: <20040914095208.E790A3969@xprdmailfe9.nwk.excite.com>]
* Re: disk testing
[not found] <20040914095208.E790A3969@xprdmailfe9.nwk.excite.com>
@ 2004-09-14 12:17 ` Tim Small
0 siblings, 0 replies; 9+ messages in thread
From: Tim Small @ 2004-09-14 12:17 UTC (permalink / raw)
To: hfranklin97, linux-raid
harry wrote:
>Here's an interesting twist: smartctl claims that all three disks don't support smart. However, I think this is because the disks show up as scsi and not ata disks (the controller they're attached to is a promise sata150 tx4). I tried tricking it into looking at the drive as an ata device with the '-d ata' option, but no dice.
>
>(I'm fairly certain that these 3 disks support smart because I have two more attached through a sata controller built into the motherboard, which both show up as hd? drives, and smartctl gives loads of info for them). All of the drives in question (3 new ones, 2 bought about 8 months ago) are Western Digital 2500JD's.
>
>
>
>And finally, forgive my ignorance, but what does libata do/provide? (I'm guessing it would allow smartctl to see disks that show up as scsi on the system as ata disks, but just want to verify). The system involved is running debian sid with a custom 2.4.27 kernel, and I just did an apt-file and apt-cache search for libata and both came up empty.
>
>
>
libata is the new SATA controller support in 2.6.x (also available for
2.4.x). If you are seeing your SATA disks as /dev/sdx then you are
probably using libata. libata lives here:
http://www.kernel.org/pub/linux/kernel/people/jgarzik/libata/
and here:
http://gkernel.bkbits.net/
The necessary ioctls to support SMART are not in libata yet, but this patch:
http://www.ussg.iu.edu/hypermail/linux/kernel/0408.3/2304.html
will allow you to use smartctl (with the "-d ata" argument) - however it
doesn't do the proper locking, so you probably wouldn't want to use it
on active disks whilst a system is in production (and certainly not with
smartd).
Tim.
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: disk testing
@ 2004-09-14 9:54 harry
0 siblings, 0 replies; 9+ messages in thread
From: harry @ 2004-09-14 9:54 UTC (permalink / raw)
To: linux-raid
Here's an interesting twist: smartctl claims that all three disks don't support smart. However, I think this is because the disks show up as scsi and not ata disks (the controller they're attached to is a promise sata150 tx4). I tried tricking it into looking at the drive as an ata device with the '-d ata' option, but no dice.
(I'm fairly certain that these 3 disks support smart because I have two more attached through a sata controller built into the motherboard, which both show up as hd? drives, and smartctl gives loads of info for them). All of the drives in question (3 new ones, 2 bought about 8 months ago) are Western Digital 2500JD's.
And finally, forgive my ignorance, but what does libata do/provide? (I'm guessing it would allow smartctl to see disks that show up as scsi on the system as ata disks, but just want to verify). The system involved is running debian sid with a custom 2.4.27 kernel, and I just did an apt-file and apt-cache search for libata and both came up empty.
Thanks, Harry
_______________________________________________
Join Excite! - http://www.excite.com
The most personalized portal on the Web!
^ permalink raw reply [flat|nested] 9+ messages in thread
* RE: disk testing
@ 2004-09-14 9:04 harry
0 siblings, 0 replies; 9+ messages in thread
From: harry @ 2004-09-14 9:04 UTC (permalink / raw)
To: linux-raid
It occurs to me that I should include the errors that were in dmesg:
raid5: switching cache buffer size, 1024 --> 4096
kjournald starting. Commit interval 5 seconds
EXT3 FS 2.4-0.9.19, 19 August 2002 on md(9,1), internal journal
EXT3-fs: mounted filesystem with ordered data mode.
scsi0: ERROR on channel 0, id 0, lun 0, CDB: Request Sense 00 00 00 40 00
Current sd0b:00: sense key Medium Error
Additional sense indicates Unrecovered read error
I/O error: dev 0b:00, sector 7985264
scsi1: ERROR on channel 0, id 0, lun 0, CDB: Read (10) 00 0b 01 8c 2f 00 00 c8 00
Current sd08:01: sense key Medium Error
Additional sense indicates Unrecovered read error - auto reallocate failed
I/O error: dev 08:01, sector 184650736
raid5: Disk failure on sda1, disabling device. Operation continuing on 2 devices
raid5: parity resync was not fully finished, restarting next time.
md: recovery thread got woken up ...
md: updating md1 RAID superblock on device
md: sdc1 [events: 00000004]<6>(write) sdc1's sb offset: 244195904
md: sdb1 [events: 00000004]<6>(write) sdb1's sb offset: 244195904
md: (skipping faulty sda1 )
md1: no spare disk to reconstruct array! -- continuing in degraded mode
md: recovery thread finished ...
md: md_do_sync() got signal ... exiting
raid5: resync aborted!
Thanks, Harry
--- On Tue 09/14, harry < hfranklin97@excite.com > wrote:
From: harry [mailto: hfranklin97@excite.com]
To: linux-raid@vger.kernel.org
Date: Tue, 14 Sep 2004 04:50:07 -0400 (EDT)
Subject: disk testing
<br>I just bought 3 sata drives and set them up in a raid5 array. About 45% into syncing them, the first disk gets an error and goes offline. I figure I did something wrong, so I retrace my steps and try again, and again, I get an error about 45% of the way through, the first disk errors and goes offline. <br><br>So, I think I have a bad disk. But wait! I created a raid 1 array on the remaining two to see if there are any other errors later on those two (there weren't), and I create a normal partition/fs on the failing disk. I begin writing various bitpatterns across the entire disk and reading them back, trying to find the problem. So far, I've done about 5 passes over the entire disk without error! <br><br>So, any idea why raid would be getting errors from the disk, but I don't seem to be able to? (or, what I should tell the store I bought it from when I try to get it replaced?)<br><br
>Thanks, Harry<br><br>ps, The only thing I can think of is that the first time through I h
ad been using the array (created a partition, started moving files onto it), and the excessive thrashing of the heads caused an intermittent error to show itself, whereas the tests I'm currently running are strictly linear and easy enough on the disk that the problem doesn't appear.
_______________________________________________
Join Excite! - http://www.excite.com
The most personalized portal on the Web!
^ permalink raw reply [flat|nested] 9+ messages in thread
* disk testing
@ 2004-09-14 8:50 harry
2004-09-14 9:06 ` Neil Brown
2004-09-14 9:15 ` Tim Small
0 siblings, 2 replies; 9+ messages in thread
From: harry @ 2004-09-14 8:50 UTC (permalink / raw)
To: linux-raid
I just bought 3 sata drives and set them up in a raid5 array. About 45% into syncing them, the first disk gets an error and goes offline. I figure I did something wrong, so I retrace my steps and try again, and again, I get an error about 45% of the way through, the first disk errors and goes offline.
So, I think I have a bad disk. But wait! I created a raid 1 array on the remaining two to see if there are any other errors later on those two (there weren't), and I create a normal partition/fs on the failing disk. I begin writing various bitpatterns across the entire disk and reading them back, trying to find the problem. So far, I've done about 5 passes over the entire disk without error!
So, any idea why raid would be getting errors from the disk, but I don't seem to be able to? (or, what I should tell the store I bought it from when I try to get it replaced?)
Thanks, Harry
ps, The only thing I can think of is that the first time through I had been using the array (created a partition, started moving files onto it), and the excessive thrashing of the heads caused an intermittent error to show itself, whereas the tests I'm currently running are strictly linear and easy enough on the disk that the problem doesn't appear.
_______________________________________________
Join Excite! - http://www.excite.com
The most personalized portal on the Web!
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: disk testing
2004-09-14 8:50 harry
@ 2004-09-14 9:06 ` Neil Brown
2004-09-14 9:15 ` Tim Small
1 sibling, 0 replies; 9+ messages in thread
From: Neil Brown @ 2004-09-14 9:06 UTC (permalink / raw)
To: hfranklin97; +Cc: linux-raid
On Tuesday September 14, hfranklin97@excite.com wrote:
>
> I just bought 3 sata drives and set them up in a raid5 array. About
> 45% into syncing them, the first disk gets an error and goes
> offline. I figure I did something wrong, so I retrace my steps and
> try again, and again, I get an error about 45% of the way through,
> the first disk errors and goes offline.
>
Presumably a read-error. Did the kernel logs indicate the error type?
> So, I think I have a bad disk. But wait! I created a raid 1 array on
> the remaining two to see if there are any other errors later on
> those two (there weren't), and I create a normal partition/fs on the
> failing disk. I begin writing various bitpatterns across the entire
> disk and reading them back, trying to find the problem. So far, I've
> done about 5 passes over the entire disk without error!
Maybe when you wrote it caused the drive to fix up the bad sector.
Try building the raid5 again?
NeilBrown
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: disk testing
2004-09-14 8:50 harry
2004-09-14 9:06 ` Neil Brown
@ 2004-09-14 9:15 ` Tim Small
1 sibling, 0 replies; 9+ messages in thread
From: Tim Small @ 2004-09-14 9:15 UTC (permalink / raw)
To: linux-raid
If there is an unreadable sector on the disk, then reading it will fail,
but if you write to it, the drive firmware will reallocate the sector,
and then allow reading (actually the sector it is reading is now
somewhere else on the disk, but the firmware hides this). If the raid5
sync was trying to read such a sector, but your other tests have written
it, then it will now appear to be fine (and the raid5 should now work).
If I were you I would use smartmontools to check out the drive (you can
then see if it has reallocated any sectors, and read errors should show
up in the SMART error log).
Tim.
p.s. Prelim SMART support for libata:
http://www.ussg.iu.edu/hypermail/linux/kernel/0408.3/2304.html
harry wrote:
>I just bought 3 sata drives and set them up in a raid5 array. About 45% into syncing them, the first disk gets an error and goes offline. I figure I did something wrong, so I retrace my steps and try again, and again, I get an error about 45% of the way through, the first disk errors and goes offline.
>
>So, I think I have a bad disk. But wait! I created a raid 1 array on the remaining two to see if there are any other errors later on those two (there weren't), and I create a normal partition/fs on the failing disk. I begin writing various bitpatterns across the entire disk and reading them back, trying to find the problem. So far, I've done about 5 passes over the entire disk without error!
>
>So, any idea why raid would be getting errors from the disk, but I don't seem to be able to? (or, what I should tell the store I bought it from when I try to get it replaced?)
>
>
>
^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2004-09-17 15:08 UTC | newest]
Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2004-09-17 2:50 disk testing harry
2004-09-17 9:18 ` Tim Small
2004-09-17 15:08 ` Sebastien Koechlin
[not found] <20040914095208.E790A3969@xprdmailfe9.nwk.excite.com>
2004-09-14 12:17 ` Tim Small
-- strict thread matches above, loose matches on Subject: below --
2004-09-14 9:54 harry
2004-09-14 9:04 harry
2004-09-14 8:50 harry
2004-09-14 9:06 ` Neil Brown
2004-09-14 9:15 ` Tim Small
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).