linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Re: disk testing
@ 2004-09-17  2:50 harry
  2004-09-17  9:18 ` Tim Small
  2004-09-17 15:08 ` Sebastien Koechlin
  0 siblings, 2 replies; 9+ messages in thread
From: harry @ 2004-09-17  2:50 UTC (permalink / raw)
  To: linux-raid


Tim and Neil have suggested (apparently correctly) that the disk had a bad sector and the firmware remapped it when I wrote to it. My question is, how many spare sectors does the typical disk have? More importantly, since the sector has been remapped, recreating the raid5 array worked fine, but is a failure right out of the box normal? I was going to return it but since its working now I'm not sure if I should or not.

Thanks




 --- On Tue 09/14, Tim Small < tim@buttersideup.com > wrote:
From: Tim Small [mailto: tim@buttersideup.com]
To: linux-raid@vger.kernel.org
Date: Tue, 14 Sep 2004 10:15:43 +0100
Subject: Re: disk testing

<br>If there is an unreadable sector on the disk, then reading it will fail,<br>but if you write to it, the drive firmware will reallocate the sector,<br>and then allow reading (actually the sector it is reading is now<br>somewhere else on the disk, but the firmware hides this).  If the raid5<br>sync was trying to read such a sector, but your other tests have written<br>it, then it will now appear to be fine (and the raid5 should now work).<br><br>If I were you I would use smartmontools to check out the drive (you can<br>then see if it has reallocated any sectors, and read errors should show<br>up in the SMART error log).<br><br>Tim.<br><br><br>p.s. Prelim SMART support for libata:<br><br>http://www.ussg.iu.edu/hypermail/linux/kernel/0408.3/2304.html<br><br><br><br><br>harry wrote:<br><br>>I just bought 3 sata drives and set them up in a raid5 array. About 45% into syncing them, the firs
 t disk gets an error and goes offline. I figure I did something wrong, so I retrace my ste
 ps and try again, and again, I get an error about 45% of the way through, the first disk errors and goes offline. <br>><br>>So, I think I have a bad disk. But wait! I created a raid 1 array on the remaining two to see if there are any other errors later on those two (there weren't), and I create a normal partition/fs on the failing disk. I begin writing various bitpatterns across the entire disk and reading them back, trying to find the problem. So far, I've done about 5 passes over the entire disk without error! <br>><br>>So, any idea why raid would be getting errors from the disk, but I don't seem to be able to? (or, what I should tell the store I bought it from when I try to get it replaced?)<br>><br>>  <br>><br><br>-<br>To unsubscribe from this list: send the line "unsubscribe linux-raid" in<br>the body of a message to majordomo@vger.kernel.org<br>More majordomo info at  http://vger
 .kernel.org/majordomo-info.html<br>

_______________________________________________
Join Excite! - http://www.excite.com
The most personalized portal on the Web!

^ permalink raw reply	[flat|nested] 9+ messages in thread
[parent not found: <20040914095208.E790A3969@xprdmailfe9.nwk.excite.com>]
* Re: disk testing
@ 2004-09-14  9:54 harry
  0 siblings, 0 replies; 9+ messages in thread
From: harry @ 2004-09-14  9:54 UTC (permalink / raw)
  To: linux-raid


Here's an interesting twist: smartctl claims that all three disks don't support smart. However, I think this is because the disks show up as scsi and not ata disks (the controller they're attached to is a promise sata150 tx4). I tried tricking it into looking at the drive as an ata device with the '-d ata' option, but no dice. 

(I'm fairly certain that these 3 disks support smart because I have two more attached through a sata controller built into the motherboard, which both show up as hd? drives, and smartctl gives loads of info for them). All of the drives in question (3 new ones, 2 bought about 8 months ago) are Western Digital 2500JD's. 



And finally, forgive my ignorance, but what does libata do/provide? (I'm guessing it would allow smartctl to see disks that show up as scsi on the system as ata disks, but just want to verify). The system involved is running debian sid with a custom 2.4.27 kernel, and I just did an apt-file and apt-cache search for libata and both came up empty.


Thanks, Harry

_______________________________________________
Join Excite! - http://www.excite.com
The most personalized portal on the Web!

^ permalink raw reply	[flat|nested] 9+ messages in thread
* RE: disk testing
@ 2004-09-14  9:04 harry
  0 siblings, 0 replies; 9+ messages in thread
From: harry @ 2004-09-14  9:04 UTC (permalink / raw)
  To: linux-raid


It occurs to me that I should include the errors that were in dmesg:

raid5: switching cache buffer size, 1024 --> 4096
kjournald starting.  Commit interval 5 seconds
EXT3 FS 2.4-0.9.19, 19 August 2002 on md(9,1), internal journal
EXT3-fs: mounted filesystem with ordered data mode.
scsi0: ERROR on channel 0, id 0, lun 0, CDB: Request Sense 00 00 00 40 00 
Current sd0b:00: sense key Medium Error
Additional sense indicates Unrecovered read error
 I/O error: dev 0b:00, sector 7985264
scsi1: ERROR on channel 0, id 0, lun 0, CDB: Read (10) 00 0b 01 8c 2f 00 00 c8 00 
Current sd08:01: sense key Medium Error
Additional sense indicates Unrecovered read error - auto reallocate failed
 I/O error: dev 08:01, sector 184650736
raid5: Disk failure on sda1, disabling device. Operation continuing on 2 devices
raid5: parity resync was not fully finished, restarting next time.
md: recovery thread got woken up ...
md: updating md1 RAID superblock on device
md: sdc1 [events: 00000004]<6>(write) sdc1's sb offset: 244195904
md: sdb1 [events: 00000004]<6>(write) sdb1's sb offset: 244195904
md: (skipping faulty sda1 )
md1: no spare disk to reconstruct array! -- continuing in degraded mode
md: recovery thread finished ...
md: md_do_sync() got signal ... exiting
raid5: resync aborted!

Thanks, Harry

 --- On Tue 09/14, harry < hfranklin97@excite.com > wrote:
From: harry [mailto: hfranklin97@excite.com]
To: linux-raid@vger.kernel.org
Date: Tue, 14 Sep 2004 04:50:07 -0400 (EDT)
Subject: disk testing

<br>I just bought 3 sata drives and set them up in a raid5 array. About 45% into syncing them, the first disk gets an error and goes offline. I figure I did something wrong, so I retrace my steps and try again, and again, I get an error about 45% of the way through, the first disk errors and goes offline. <br><br>So, I think I have a bad disk. But wait! I created a raid 1 array on the remaining two to see if there are any other errors later on those two (there weren't), and I create a normal partition/fs on the failing disk. I begin writing various bitpatterns across the entire disk and reading them back, trying to find the problem. So far, I've done about 5 passes over the entire disk without error! <br><br>So, any idea why raid would be getting errors from the disk, but I don't seem to be able to? (or, what I should tell the store I bought it from when I try to get it replaced?)<br><br
 >Thanks, Harry<br><br>ps, The only thing I can think of is that the first time through I h
 ad been using the array (created a partition, started moving files onto it), and the excessive thrashing of the heads caused an intermittent error to show itself, whereas the tests I'm currently running are strictly linear and easy enough on the disk that the problem doesn't appear. 

_______________________________________________
Join Excite! - http://www.excite.com
The most personalized portal on the Web!

^ permalink raw reply	[flat|nested] 9+ messages in thread
* disk testing
@ 2004-09-14  8:50 harry
  2004-09-14  9:06 ` Neil Brown
  2004-09-14  9:15 ` Tim Small
  0 siblings, 2 replies; 9+ messages in thread
From: harry @ 2004-09-14  8:50 UTC (permalink / raw)
  To: linux-raid


I just bought 3 sata drives and set them up in a raid5 array. About 45% into syncing them, the first disk gets an error and goes offline. I figure I did something wrong, so I retrace my steps and try again, and again, I get an error about 45% of the way through, the first disk errors and goes offline. 

So, I think I have a bad disk. But wait! I created a raid 1 array on the remaining two to see if there are any other errors later on those two (there weren't), and I create a normal partition/fs on the failing disk. I begin writing various bitpatterns across the entire disk and reading them back, trying to find the problem. So far, I've done about 5 passes over the entire disk without error! 

So, any idea why raid would be getting errors from the disk, but I don't seem to be able to? (or, what I should tell the store I bought it from when I try to get it replaced?)

Thanks, Harry

ps, The only thing I can think of is that the first time through I had been using the array (created a partition, started moving files onto it), and the excessive thrashing of the heads caused an intermittent error to show itself, whereas the tests I'm currently running are strictly linear and easy enough on the disk that the problem doesn't appear. 



_______________________________________________
Join Excite! - http://www.excite.com
The most personalized portal on the Web!

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2004-09-17 15:08 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2004-09-17  2:50 disk testing harry
2004-09-17  9:18 ` Tim Small
2004-09-17 15:08 ` Sebastien Koechlin
     [not found] <20040914095208.E790A3969@xprdmailfe9.nwk.excite.com>
2004-09-14 12:17 ` Tim Small
  -- strict thread matches above, loose matches on Subject: below --
2004-09-14  9:54 harry
2004-09-14  9:04 harry
2004-09-14  8:50 harry
2004-09-14  9:06 ` Neil Brown
2004-09-14  9:15 ` Tim Small

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).