* Raid and badblocks
@ 2009-05-22 9:09 Jeremy Sanders
2009-05-22 9:19 ` Jeremy Sanders
0 siblings, 1 reply; 8+ messages in thread
From: Jeremy Sanders @ 2009-05-22 9:09 UTC (permalink / raw)
To: linux-raid
We have a linux RAID 5 md setup with 10 disks controlled using a 3ware
9650se card. Here is /proc/mdstat:
Personalities : [raid6] [raid5] [raid4]
md0 : active raid5 sdb1[0] sdk1[9] sdj1[8] sdi1[7] sdh1[6] sdg1[5] sdf1[4]
sde1[3] sdd1[2] sdc1[1]
8788959360 blocks level 5, 32k chunk, algorithm 2 [10/10] [UUUUUUUUUU]
The built-in 3ware autoverify feature found some bad blocks on sdj. To try
to correct these I ran:
echo repair > /sys/block/md0/md/sync_action
This completed with no errors, but the bad blocks have not gone away.
Smartctl has these entries:
197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always
- 1
198 Offline_Uncorrectable 0x0030 100 100 000 Old_age Offline
- 1
SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours)
LBA_of_first_error
# 1 Extended offline Completed: read failure 90% 961
1953103085
I also tried a check on md0, but it found no errors. If I run a badblocks
read test directly on sdj, it finds bad blocks. A smartctl long test also
finds the problems.
Isn't the check/repair supposed to find bad blocks and report/repair them? I
thought a check would read all the data off all the disks and check for
inconsistencies. How can I get these bad blocks repaired?
The kernel is Fedora 10, 2.6.29.2-52.fc10.x86_64.
Thanks
Jeremy
--
Jeremy Sanders <jss@ast.cam.ac.uk> http://www-xray.ast.cam.ac.uk/~jss/
^ permalink raw reply [flat|nested] 8+ messages in thread* Re: Raid and badblocks 2009-05-22 9:09 Raid and badblocks Jeremy Sanders @ 2009-05-22 9:19 ` Jeremy Sanders 2009-05-22 11:50 ` NeilBrown 2009-05-30 14:21 ` Sujit Karataparambil 0 siblings, 2 replies; 8+ messages in thread From: Jeremy Sanders @ 2009-05-22 9:19 UTC (permalink / raw) To: linux-raid PS, here are some more data from the logs: 3w-9xxx: scsi4: ERROR: (0x03:0x0202): Drive ECC error:port=8. sd 4:0:8:0: [sdj] Unhandled sense code sd 4:0:8:0: [sdj] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE,SUGGEST_OK sd 4:0:8:0: [sdj] Sense Key : Medium Error [current] sd 4:0:8:0: [sdj] Add. Sense: Unrecovered read error end_request: I/O error, dev sdj, sector 1953102848 3w-9xxx: scsi4: ERROR: (0x03:0x0202): Drive ECC error:port=8. sd 4:0:8:0: [sdj] Unhandled sense code sd 4:0:8:0: [sdj] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE,SUGGEST_OK sd 4:0:8:0: [sdj] Sense Key : Medium Error [current] sd 4:0:8:0: [sdj] Add. Sense: Unrecovered read error end_request: I/O error, dev sdj, sector 1953102856 Buffer I/O error on device sdj, logical block 244137857 Buffer I/O error on device sdj, logical block 244137858 Buffer I/O error on device sdj, logical block 244137859 Buffer I/O error on device sdj, logical block 244137860 Buffer I/O error on device sdj, logical block 244137861 Buffer I/O error on device sdj, logical block 244137862 Buffer I/O error on device sdj, logical block 244137863 Buffer I/O error on device sdj, logical block 244137864 Buffer I/O error on device sdj, logical block 244137865 Buffer I/O error on device sdj, logical block 244137866 3w-9xxx: scsi4: ERROR: (0x03:0x0202): Drive ECC error:port=8. sd 4:0:8:0: [sdj] Unhandled sense code sd 4:0:8:0: [sdj] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE,SUGGEST_OK sd 4:0:8:0: [sdj] Sense Key : Medium Error [current] sd 4:0:8:0: [sdj] Add. Sense: Unrecovered read error end_request: I/O error, dev sdj, sector 1953103080 3w-9xxx: scsi4: ERROR: (0x03:0x0202): Drive ECC error:port=8. sd 4:0:8:0: [sdj] Unhandled sense code sd 4:0:8:0: [sdj] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE,SUGGEST_OK sd 4:0:8:0: [sdj] Sense Key : Medium Error [current] sd 4:0:8:0: [sdj] Add. Sense: Unrecovered read error end_request: I/O error, dev sdj, sector 1953103080 3w-9xxx: scsi4: ERROR: (0x03:0x0202): Drive ECC error:port=8. sd 4:0:8:0: [sdj] Unhandled sense code sd 4:0:8:0: [sdj] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE,SUGGEST_OK sd 4:0:8:0: [sdj] Sense Key : Medium Error [current] sd 4:0:8:0: [sdj] Add. Sense: Unrecovered read error end_request: I/O error, dev sdj, sector 1953103080 __ratelimit: 24 callbacks suppressed Buffer I/O error on device sdj, logical block 244137885 3w-9xxx: scsi4: ERROR: (0x03:0x0202): Drive ECC error:port=8. sd 4:0:8:0: [sdj] Unhandled sense code sd 4:0:8:0: [sdj] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE,SUGGEST_OK sd 4:0:8:0: [sdj] Sense Key : Medium Error [current] sd 4:0:8:0: [sdj] Add. Sense: Unrecovered read error end_request: I/O error, dev sdj, sector 1953103080 Buffer I/O error on device sdj, logical block 244137885 md: requested-resync of RAID array md0 md: minimum _guaranteed_ speed: 1000 KB/sec/disk. md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for requested-resync. md: using 128k window, over a total of 976551040 blocks. md: md0: requested-resync done. -- Jeremy Sanders <jss@ast.cam.ac.uk> http://www-xray.ast.cam.ac.uk/~jss/ ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Raid and badblocks 2009-05-22 9:19 ` Jeremy Sanders @ 2009-05-22 11:50 ` NeilBrown 2009-05-26 12:59 ` Jeremy Sanders 2009-05-30 14:21 ` Sujit Karataparambil 1 sibling, 1 reply; 8+ messages in thread From: NeilBrown @ 2009-05-22 11:50 UTC (permalink / raw) To: Jeremy Sanders; +Cc: linux-raid On Fri, May 22, 2009 7:19 pm, Jeremy Sanders wrote: > PS, here are some more data from the logs: > ..... > sd 4:0:8:0: [sdj] Sense Key : Medium Error [current] > sd 4:0:8:0: [sdj] Add. Sense: Unrecovered read error > end_request: I/O error, dev sdj, sector 1953103080 ^^^^^^^^^^ > Buffer I/O error on device sdj, logical block 244137885 ^^^^^^^^^ > md: requested-resync of RAID array md0 > md: minimum _guaranteed_ speed: 1000 KB/sec/disk. > md: using maximum available idle IO bandwidth (but not more than 200000 > KB/sec) for requested-resync. > md: using 128k window, over a total of 976551040 blocks. ^^^^^^^^^ > md: md0: requested-resync done. Consider those three underlined numbers. The first is the (512byte) sector number of the error: 1953103080 The second is the 4K block number. Multiply by 8 to get sector number and we get 1953103080 exactly the same. That is encouraging. The third is the number of 1K blocks that the array covers. So double that to get sectors and the answer is 1953102080 So that erroneous sector is 1000 sectors (500K) beyond the end of the data area of the array. i.e. it is in the unused padding at the end, possibly near where the metadata lives. So md/raid5 will never touch that block. If you want to write to it which might make the drive think it isn't bad any more, you could dd if=/dev/zero of=/dev/sdj seek=1953103080 count=1 But if it were me, I would do dd if=/dev/sdj of=/dev/null skip=1953103080 count=1 and confirm that gives and error, just to double check the numbers. NeilBrown ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Raid and badblocks 2009-05-22 11:50 ` NeilBrown @ 2009-05-26 12:59 ` Jeremy Sanders 2009-05-26 15:27 ` Andrew Burgess 0 siblings, 1 reply; 8+ messages in thread From: Jeremy Sanders @ 2009-05-26 12:59 UTC (permalink / raw) To: linux-raid NeilBrown wrote: > If you want to write to it which might make the drive think it isn't > bad any more, you could > dd if=/dev/zero of=/dev/sdj seek=1953103080 count=1 > > But if it were me, I would do > dd if=/dev/sdj of=/dev/null skip=1953103080 count=1 > and confirm that gives and error, just to double check the numbers. Thanks very much for your help. I managed to get rid of the bad block. However I had to write 4kB at once to the bad region (bs=4096) for the drive to correct the block. Using bs=1 didn't seem able to get rid of the errors. Strangely, the Reallocated_Sector_Ct is still zero on the drive even though Current_Pending_Sector is now zero. This is a Samsung HD103UJ by the way. Jeremy -- Jeremy Sanders <jss@ast.cam.ac.uk> http://www-xray.ast.cam.ac.uk/~jss/ X-Ray Group, Institute of Astronomy, University of Cambridge, UK. Public Key Server PGP Key ID: E1AAE053 ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Raid and badblocks 2009-05-26 12:59 ` Jeremy Sanders @ 2009-05-26 15:27 ` Andrew Burgess 2009-05-30 10:13 ` hank peng 0 siblings, 1 reply; 8+ messages in thread From: Andrew Burgess @ 2009-05-26 15:27 UTC (permalink / raw) To: Jeremy Sanders; +Cc: linux-raid On Tue, 2009-05-26 at 13:59 +0100, Jeremy Sanders wrote: > Strangely, the Reallocated_Sector_Ct is still zero on the drive even though > Current_Pending_Sector is now zero. This is a Samsung HD103UJ by the way. That can be ok. It probably tried writing and then rereading and when that worked it decided the sector didn't really have a 'hard' error (like a physical defect on the platter) and thus didn't need to be reallocated to a spare sector. You could generate an unreadable sector during a write by the power failing or with excessive vibration. That said, the drive firmware could also be broken ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Raid and badblocks 2009-05-26 15:27 ` Andrew Burgess @ 2009-05-30 10:13 ` hank peng 2009-05-30 13:29 ` John Robinson 0 siblings, 1 reply; 8+ messages in thread From: hank peng @ 2009-05-30 10:13 UTC (permalink / raw) To: Andrew Burgess; +Cc: Jeremy Sanders, linux-raid 2009/5/26 Andrew Burgess <aab@cichlid.com>: > On Tue, 2009-05-26 at 13:59 +0100, Jeremy Sanders wrote: > >> Strangely, the Reallocated_Sector_Ct is still zero on the drive even though >> Current_Pending_Sector is now zero. This is a Samsung HD103UJ by the way. > > That can be ok. It probably tried writing and then rereading and when > that worked it decided the sector didn't really have a 'hard' error > (like a physical defect on the platter) and thus didn't need to be > reallocated to a spare sector. You could generate an unreadable sector > during a write by the power failing or with excessive vibration. > I have a question, in this situation, if I do as Jeremy did, write zero to bad block to make drive think it is not bad any more, then what about old data? Isn't it lost? > That said, the drive firmware could also be broken > > -- > To unsubscribe from this list: send the line "unsubscribe linux-raid" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > -- The simplest is not all best but the best is surely the simplest! -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Raid and badblocks 2009-05-30 10:13 ` hank peng @ 2009-05-30 13:29 ` John Robinson 0 siblings, 0 replies; 8+ messages in thread From: John Robinson @ 2009-05-30 13:29 UTC (permalink / raw) To: hank peng; +Cc: Linux RAID On 30/05/2009 11:13, hank peng wrote: > 2009/5/26 Andrew Burgess <aab@cichlid.com>: >> On Tue, 2009-05-26 at 13:59 +0100, Jeremy Sanders wrote: >> >>> Strangely, the Reallocated_Sector_Ct is still zero on the drive even though >>> Current_Pending_Sector is now zero. This is a Samsung HD103UJ by the way. >> That can be ok. It probably tried writing and then rereading and when >> that worked it decided the sector didn't really have a 'hard' error >> (like a physical defect on the platter) and thus didn't need to be >> reallocated to a spare sector. You could generate an unreadable sector >> during a write by the power failing or with excessive vibration. >> > I have a question, in this situation, if I do as Jeremy did, write > zero to bad block to make drive think it is not bad any more, then > what about old data? Isn't it lost? In Jeremy's case, the bad block was past the end of the data in the array, so no, data wasn't lost. If the bad block had been within the array data, the repair operation he ran earlier would have found it, reconstructed the correct data from the other drives, and rewritten it, again avoiding any data loss. Cheers, John. ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Raid and badblocks 2009-05-22 9:19 ` Jeremy Sanders 2009-05-22 11:50 ` NeilBrown @ 2009-05-30 14:21 ` Sujit Karataparambil 1 sibling, 0 replies; 8+ messages in thread From: Sujit Karataparambil @ 2009-05-30 14:21 UTC (permalink / raw) To: Jeremy Sanders; +Cc: linux-raid >>We have a linux RAID 5 md setup with 10 disks controlled using a 3ware >>9650se card. Here is /proc/mdstat: >>Personalities : [raid6] [raid5] [raid4] >>md0 : active raid5 sdb1[0] sdk1[9] sdj1[8] sdi1[7] sdh1[6] sdg1[5] sdf1[4] >>sde1[3] sdd1[2] sdc1[1] >> 8788959360 blocks level 5, 32k chunk, algorithm 2 [10/10] [UUUUUUUUUU] Above the Raid Configuration. (Cut and Paste and other info) > 3w-9xxx: scsi4: ERROR: (0x03:0x0202): Drive ECC error:port=8. > sd 4:0:8:0: [sdj] Unhandled sense code > sd 4:0:8:0: [sdj] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE,SUGGEST_OK > sd 4:0:8:0: [sdj] Sense Key : Medium Error [current] > sd 4:0:8:0: [sdj] Add. Sense: Unrecovered read error 3ware 9650se card/check whether the drive on sdj is plugged in properly. since the error is sd 4:0:8:0: [sdj] Unhandled sense code. Probably an jumper. But why does the error come up as read error. Is the read error given priority/ precedence over write error in the hardware? > end_request: I/O error, dev sdj, sector 1953102848 > 3w-9xxx: scsi4: ERROR: (0x03:0x0202): Drive ECC error:port=8. > sd 4:0:8:0: [sdj] Unhandled sense code > sd 4:0:8:0: [sdj] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE,SUGGEST_OK > sd 4:0:8:0: [sdj] Sense Key : Medium Error [current] > sd 4:0:8:0: [sdj] Add. Sense: Unrecovered read error 3ware 9650se card. Same here. > end_request: I/O error, dev sdj, sector 1953102856 > Buffer I/O error on device sdj, logical block 244137857 > Buffer I/O error on device sdj, logical block 244137858 > Buffer I/O error on device sdj, logical block 244137859 > Buffer I/O error on device sdj, logical block 244137860 > Buffer I/O error on device sdj, logical block 244137861 > Buffer I/O error on device sdj, logical block 244137862 > Buffer I/O error on device sdj, logical block 244137863 > Buffer I/O error on device sdj, logical block 244137864 > Buffer I/O error on device sdj, logical block 244137865 > Buffer I/O error on device sdj, logical block 244137866 This is an read error as the disk is being added to the raid. Here the sector and logical block is being looked for. > 3w-9xxx: scsi4: ERROR: (0x03:0x0202): Drive ECC error:port=8. > sd 4:0:8:0: [sdj] Unhandled sense code > sd 4:0:8:0: [sdj] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE,SUGGEST_OK > sd 4:0:8:0: [sdj] Sense Key : Medium Error [current] > sd 4:0:8:0: [sdj] Add. Sense: Unrecovered read error > end_request: I/O error, dev sdj, sector 1953103080 > 3w-9xxx: scsi4: ERROR: (0x03:0x0202): Drive ECC error:port=8. > sd 4:0:8:0: [sdj] Unhandled sense code > sd 4:0:8:0: [sdj] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE,SUGGEST_OK > sd 4:0:8:0: [sdj] Sense Key : Medium Error [current] > sd 4:0:8:0: [sdj] Add. Sense: Unrecovered read error > end_request: I/O error, dev sdj, sector 1953103080 > 3w-9xxx: scsi4: ERROR: (0x03:0x0202): Drive ECC error:port=8. > sd 4:0:8:0: [sdj] Unhandled sense code > sd 4:0:8:0: [sdj] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE,SUGGEST_OK > sd 4:0:8:0: [sdj] Sense Key : Medium Error [current] > sd 4:0:8:0: [sdj] Add. Sense: Unrecovered read error 3ware 9650se card. Same Here. > end_request: I/O error, dev sdj, sector 1953103080 > __ratelimit: 24 callbacks suppressed > Buffer I/O error on device sdj, logical block 244137885 Sector and block info. > 3w-9xxx: scsi4: ERROR: (0x03:0x0202): Drive ECC error:port=8. > sd 4:0:8:0: [sdj] Unhandled sense code > sd 4:0:8:0: [sdj] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE,SUGGEST_OK > sd 4:0:8:0: [sdj] Sense Key : Medium Error [current] > sd 4:0:8:0: [sdj] Add. Sense: Unrecovered read error Same here. > end_request: I/O error, dev sdj, sector 1953103080 > Buffer I/O error on device sdj, logical block 244137885 > md: requested-resync of RAID array md0 > md: minimum _guaranteed_ speed: 1000 KB/sec/disk. > md: using maximum available idle IO bandwidth (but not more than 200000 > KB/sec) for requested-resync. > md: using 128k window, over a total of 976551040 blocks. > md: md0: requested-resync done. is it printing out information regarding minimum/optimal speed/IO bandwidth due to the an failed/partially failed device. > > > -- > Jeremy Sanders <jss@ast.cam.ac.uk> http://www-xray.ast.cam.ac.uk/~jss/ > > > > -- > To unsubscribe from this list: send the line "unsubscribe linux-raid" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > -- -- Sujit K M ^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2009-05-30 14:21 UTC | newest] Thread overview: 8+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2009-05-22 9:09 Raid and badblocks Jeremy Sanders 2009-05-22 9:19 ` Jeremy Sanders 2009-05-22 11:50 ` NeilBrown 2009-05-26 12:59 ` Jeremy Sanders 2009-05-26 15:27 ` Andrew Burgess 2009-05-30 10:13 ` hank peng 2009-05-30 13:29 ` John Robinson 2009-05-30 14:21 ` Sujit Karataparambil
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).