* Raid and badblocks
@ 2009-05-22 9:09 Jeremy Sanders
2009-05-22 9:19 ` Jeremy Sanders
0 siblings, 1 reply; 8+ messages in thread
From: Jeremy Sanders @ 2009-05-22 9:09 UTC (permalink / raw)
To: linux-raid
We have a linux RAID 5 md setup with 10 disks controlled using a 3ware
9650se card. Here is /proc/mdstat:
Personalities : [raid6] [raid5] [raid4]
md0 : active raid5 sdb1[0] sdk1[9] sdj1[8] sdi1[7] sdh1[6] sdg1[5] sdf1[4]
sde1[3] sdd1[2] sdc1[1]
8788959360 blocks level 5, 32k chunk, algorithm 2 [10/10] [UUUUUUUUUU]
The built-in 3ware autoverify feature found some bad blocks on sdj. To try
to correct these I ran:
echo repair > /sys/block/md0/md/sync_action
This completed with no errors, but the bad blocks have not gone away.
Smartctl has these entries:
197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always
- 1
198 Offline_Uncorrectable 0x0030 100 100 000 Old_age Offline
- 1
SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours)
LBA_of_first_error
# 1 Extended offline Completed: read failure 90% 961
1953103085
I also tried a check on md0, but it found no errors. If I run a badblocks
read test directly on sdj, it finds bad blocks. A smartctl long test also
finds the problems.
Isn't the check/repair supposed to find bad blocks and report/repair them? I
thought a check would read all the data off all the disks and check for
inconsistencies. How can I get these bad blocks repaired?
The kernel is Fedora 10, 2.6.29.2-52.fc10.x86_64.
Thanks
Jeremy
--
Jeremy Sanders <jss@ast.cam.ac.uk> http://www-xray.ast.cam.ac.uk/~jss/
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Raid and badblocks
2009-05-22 9:09 Raid and badblocks Jeremy Sanders
@ 2009-05-22 9:19 ` Jeremy Sanders
2009-05-22 11:50 ` NeilBrown
2009-05-30 14:21 ` Sujit Karataparambil
0 siblings, 2 replies; 8+ messages in thread
From: Jeremy Sanders @ 2009-05-22 9:19 UTC (permalink / raw)
To: linux-raid
PS, here are some more data from the logs:
3w-9xxx: scsi4: ERROR: (0x03:0x0202): Drive ECC error:port=8.
sd 4:0:8:0: [sdj] Unhandled sense code
sd 4:0:8:0: [sdj] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE,SUGGEST_OK
sd 4:0:8:0: [sdj] Sense Key : Medium Error [current]
sd 4:0:8:0: [sdj] Add. Sense: Unrecovered read error
end_request: I/O error, dev sdj, sector 1953102848
3w-9xxx: scsi4: ERROR: (0x03:0x0202): Drive ECC error:port=8.
sd 4:0:8:0: [sdj] Unhandled sense code
sd 4:0:8:0: [sdj] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE,SUGGEST_OK
sd 4:0:8:0: [sdj] Sense Key : Medium Error [current]
sd 4:0:8:0: [sdj] Add. Sense: Unrecovered read error
end_request: I/O error, dev sdj, sector 1953102856
Buffer I/O error on device sdj, logical block 244137857
Buffer I/O error on device sdj, logical block 244137858
Buffer I/O error on device sdj, logical block 244137859
Buffer I/O error on device sdj, logical block 244137860
Buffer I/O error on device sdj, logical block 244137861
Buffer I/O error on device sdj, logical block 244137862
Buffer I/O error on device sdj, logical block 244137863
Buffer I/O error on device sdj, logical block 244137864
Buffer I/O error on device sdj, logical block 244137865
Buffer I/O error on device sdj, logical block 244137866
3w-9xxx: scsi4: ERROR: (0x03:0x0202): Drive ECC error:port=8.
sd 4:0:8:0: [sdj] Unhandled sense code
sd 4:0:8:0: [sdj] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE,SUGGEST_OK
sd 4:0:8:0: [sdj] Sense Key : Medium Error [current]
sd 4:0:8:0: [sdj] Add. Sense: Unrecovered read error
end_request: I/O error, dev sdj, sector 1953103080
3w-9xxx: scsi4: ERROR: (0x03:0x0202): Drive ECC error:port=8.
sd 4:0:8:0: [sdj] Unhandled sense code
sd 4:0:8:0: [sdj] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE,SUGGEST_OK
sd 4:0:8:0: [sdj] Sense Key : Medium Error [current]
sd 4:0:8:0: [sdj] Add. Sense: Unrecovered read error
end_request: I/O error, dev sdj, sector 1953103080
3w-9xxx: scsi4: ERROR: (0x03:0x0202): Drive ECC error:port=8.
sd 4:0:8:0: [sdj] Unhandled sense code
sd 4:0:8:0: [sdj] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE,SUGGEST_OK
sd 4:0:8:0: [sdj] Sense Key : Medium Error [current]
sd 4:0:8:0: [sdj] Add. Sense: Unrecovered read error
end_request: I/O error, dev sdj, sector 1953103080
__ratelimit: 24 callbacks suppressed
Buffer I/O error on device sdj, logical block 244137885
3w-9xxx: scsi4: ERROR: (0x03:0x0202): Drive ECC error:port=8.
sd 4:0:8:0: [sdj] Unhandled sense code
sd 4:0:8:0: [sdj] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE,SUGGEST_OK
sd 4:0:8:0: [sdj] Sense Key : Medium Error [current]
sd 4:0:8:0: [sdj] Add. Sense: Unrecovered read error
end_request: I/O error, dev sdj, sector 1953103080
Buffer I/O error on device sdj, logical block 244137885
md: requested-resync of RAID array md0
md: minimum _guaranteed_ speed: 1000 KB/sec/disk.
md: using maximum available idle IO bandwidth (but not more than 200000
KB/sec) for requested-resync.
md: using 128k window, over a total of 976551040 blocks.
md: md0: requested-resync done.
--
Jeremy Sanders <jss@ast.cam.ac.uk> http://www-xray.ast.cam.ac.uk/~jss/
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Raid and badblocks
2009-05-22 9:19 ` Jeremy Sanders
@ 2009-05-22 11:50 ` NeilBrown
2009-05-26 12:59 ` Jeremy Sanders
2009-05-30 14:21 ` Sujit Karataparambil
1 sibling, 1 reply; 8+ messages in thread
From: NeilBrown @ 2009-05-22 11:50 UTC (permalink / raw)
To: Jeremy Sanders; +Cc: linux-raid
On Fri, May 22, 2009 7:19 pm, Jeremy Sanders wrote:
> PS, here are some more data from the logs:
> .....
> sd 4:0:8:0: [sdj] Sense Key : Medium Error [current]
> sd 4:0:8:0: [sdj] Add. Sense: Unrecovered read error
> end_request: I/O error, dev sdj, sector 1953103080
^^^^^^^^^^
> Buffer I/O error on device sdj, logical block 244137885
^^^^^^^^^
> md: requested-resync of RAID array md0
> md: minimum _guaranteed_ speed: 1000 KB/sec/disk.
> md: using maximum available idle IO bandwidth (but not more than 200000
> KB/sec) for requested-resync.
> md: using 128k window, over a total of 976551040 blocks.
^^^^^^^^^
> md: md0: requested-resync done.
Consider those three underlined numbers.
The first is the (512byte) sector number of the error:
1953103080
The second is the 4K block number. Multiply by 8 to get sector number
and we get
1953103080
exactly the same. That is encouraging.
The third is the number of 1K blocks that the array covers.
So double that to get sectors and the answer is
1953102080
So that erroneous sector is 1000 sectors (500K) beyond the end of
the data area of the array. i.e. it is in the unused padding at the end,
possibly near where the metadata lives.
So md/raid5 will never touch that block.
If you want to write to it which might make the drive think it isn't
bad any more, you could
dd if=/dev/zero of=/dev/sdj seek=1953103080 count=1
But if it were me, I would do
dd if=/dev/sdj of=/dev/null skip=1953103080 count=1
and confirm that gives and error, just to double check the numbers.
NeilBrown
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Raid and badblocks
2009-05-22 11:50 ` NeilBrown
@ 2009-05-26 12:59 ` Jeremy Sanders
2009-05-26 15:27 ` Andrew Burgess
0 siblings, 1 reply; 8+ messages in thread
From: Jeremy Sanders @ 2009-05-26 12:59 UTC (permalink / raw)
To: linux-raid
NeilBrown wrote:
> If you want to write to it which might make the drive think it isn't
> bad any more, you could
> dd if=/dev/zero of=/dev/sdj seek=1953103080 count=1
>
> But if it were me, I would do
> dd if=/dev/sdj of=/dev/null skip=1953103080 count=1
> and confirm that gives and error, just to double check the numbers.
Thanks very much for your help. I managed to get rid of the bad block.
However I had to write 4kB at once to the bad region (bs=4096) for the drive
to correct the block. Using bs=1 didn't seem able to get rid of the errors.
Strangely, the Reallocated_Sector_Ct is still zero on the drive even though
Current_Pending_Sector is now zero. This is a Samsung HD103UJ by the way.
Jeremy
--
Jeremy Sanders <jss@ast.cam.ac.uk> http://www-xray.ast.cam.ac.uk/~jss/
X-Ray Group, Institute of Astronomy, University of Cambridge, UK.
Public Key Server PGP Key ID: E1AAE053
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Raid and badblocks
2009-05-26 12:59 ` Jeremy Sanders
@ 2009-05-26 15:27 ` Andrew Burgess
2009-05-30 10:13 ` hank peng
0 siblings, 1 reply; 8+ messages in thread
From: Andrew Burgess @ 2009-05-26 15:27 UTC (permalink / raw)
To: Jeremy Sanders; +Cc: linux-raid
On Tue, 2009-05-26 at 13:59 +0100, Jeremy Sanders wrote:
> Strangely, the Reallocated_Sector_Ct is still zero on the drive even though
> Current_Pending_Sector is now zero. This is a Samsung HD103UJ by the way.
That can be ok. It probably tried writing and then rereading and when
that worked it decided the sector didn't really have a 'hard' error
(like a physical defect on the platter) and thus didn't need to be
reallocated to a spare sector. You could generate an unreadable sector
during a write by the power failing or with excessive vibration.
That said, the drive firmware could also be broken
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Raid and badblocks
2009-05-26 15:27 ` Andrew Burgess
@ 2009-05-30 10:13 ` hank peng
2009-05-30 13:29 ` John Robinson
0 siblings, 1 reply; 8+ messages in thread
From: hank peng @ 2009-05-30 10:13 UTC (permalink / raw)
To: Andrew Burgess; +Cc: Jeremy Sanders, linux-raid
2009/5/26 Andrew Burgess <aab@cichlid.com>:
> On Tue, 2009-05-26 at 13:59 +0100, Jeremy Sanders wrote:
>
>> Strangely, the Reallocated_Sector_Ct is still zero on the drive even though
>> Current_Pending_Sector is now zero. This is a Samsung HD103UJ by the way.
>
> That can be ok. It probably tried writing and then rereading and when
> that worked it decided the sector didn't really have a 'hard' error
> (like a physical defect on the platter) and thus didn't need to be
> reallocated to a spare sector. You could generate an unreadable sector
> during a write by the power failing or with excessive vibration.
>
I have a question, in this situation, if I do as Jeremy did, write
zero to bad block to make drive think it is not bad any more, then
what about old data? Isn't it lost?
> That said, the drive firmware could also be broken
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
--
The simplest is not all best but the best is surely the simplest!
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Raid and badblocks
2009-05-30 10:13 ` hank peng
@ 2009-05-30 13:29 ` John Robinson
0 siblings, 0 replies; 8+ messages in thread
From: John Robinson @ 2009-05-30 13:29 UTC (permalink / raw)
To: hank peng; +Cc: Linux RAID
On 30/05/2009 11:13, hank peng wrote:
> 2009/5/26 Andrew Burgess <aab@cichlid.com>:
>> On Tue, 2009-05-26 at 13:59 +0100, Jeremy Sanders wrote:
>>
>>> Strangely, the Reallocated_Sector_Ct is still zero on the drive even though
>>> Current_Pending_Sector is now zero. This is a Samsung HD103UJ by the way.
>> That can be ok. It probably tried writing and then rereading and when
>> that worked it decided the sector didn't really have a 'hard' error
>> (like a physical defect on the platter) and thus didn't need to be
>> reallocated to a spare sector. You could generate an unreadable sector
>> during a write by the power failing or with excessive vibration.
>>
> I have a question, in this situation, if I do as Jeremy did, write
> zero to bad block to make drive think it is not bad any more, then
> what about old data? Isn't it lost?
In Jeremy's case, the bad block was past the end of the data in the
array, so no, data wasn't lost. If the bad block had been within the
array data, the repair operation he ran earlier would have found it,
reconstructed the correct data from the other drives, and rewritten it,
again avoiding any data loss.
Cheers,
John.
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Raid and badblocks
2009-05-22 9:19 ` Jeremy Sanders
2009-05-22 11:50 ` NeilBrown
@ 2009-05-30 14:21 ` Sujit Karataparambil
1 sibling, 0 replies; 8+ messages in thread
From: Sujit Karataparambil @ 2009-05-30 14:21 UTC (permalink / raw)
To: Jeremy Sanders; +Cc: linux-raid
>>We have a linux RAID 5 md setup with 10 disks controlled using a 3ware
>>9650se card. Here is /proc/mdstat:
>>Personalities : [raid6] [raid5] [raid4]
>>md0 : active raid5 sdb1[0] sdk1[9] sdj1[8] sdi1[7] sdh1[6] sdg1[5] sdf1[4]
>>sde1[3] sdd1[2] sdc1[1]
>> 8788959360 blocks level 5, 32k chunk, algorithm 2 [10/10] [UUUUUUUUUU]
Above the Raid Configuration. (Cut and Paste and other info)
> 3w-9xxx: scsi4: ERROR: (0x03:0x0202): Drive ECC error:port=8.
> sd 4:0:8:0: [sdj] Unhandled sense code
> sd 4:0:8:0: [sdj] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE,SUGGEST_OK
> sd 4:0:8:0: [sdj] Sense Key : Medium Error [current]
> sd 4:0:8:0: [sdj] Add. Sense: Unrecovered read error
3ware 9650se card/check whether the drive on sdj is plugged in properly.
since the error is sd 4:0:8:0: [sdj] Unhandled sense code. Probably an jumper.
But why does the error come up as read error. Is the read error given priority/
precedence over write error in the hardware?
> end_request: I/O error, dev sdj, sector 1953102848
> 3w-9xxx: scsi4: ERROR: (0x03:0x0202): Drive ECC error:port=8.
> sd 4:0:8:0: [sdj] Unhandled sense code
> sd 4:0:8:0: [sdj] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE,SUGGEST_OK
> sd 4:0:8:0: [sdj] Sense Key : Medium Error [current]
> sd 4:0:8:0: [sdj] Add. Sense: Unrecovered read error
3ware 9650se card. Same here.
> end_request: I/O error, dev sdj, sector 1953102856
> Buffer I/O error on device sdj, logical block 244137857
> Buffer I/O error on device sdj, logical block 244137858
> Buffer I/O error on device sdj, logical block 244137859
> Buffer I/O error on device sdj, logical block 244137860
> Buffer I/O error on device sdj, logical block 244137861
> Buffer I/O error on device sdj, logical block 244137862
> Buffer I/O error on device sdj, logical block 244137863
> Buffer I/O error on device sdj, logical block 244137864
> Buffer I/O error on device sdj, logical block 244137865
> Buffer I/O error on device sdj, logical block 244137866
This is an read error as the disk is being added to the raid.
Here the sector and logical block is being looked for.
> 3w-9xxx: scsi4: ERROR: (0x03:0x0202): Drive ECC error:port=8.
> sd 4:0:8:0: [sdj] Unhandled sense code
> sd 4:0:8:0: [sdj] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE,SUGGEST_OK
> sd 4:0:8:0: [sdj] Sense Key : Medium Error [current]
> sd 4:0:8:0: [sdj] Add. Sense: Unrecovered read error
> end_request: I/O error, dev sdj, sector 1953103080
> 3w-9xxx: scsi4: ERROR: (0x03:0x0202): Drive ECC error:port=8.
> sd 4:0:8:0: [sdj] Unhandled sense code
> sd 4:0:8:0: [sdj] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE,SUGGEST_OK
> sd 4:0:8:0: [sdj] Sense Key : Medium Error [current]
> sd 4:0:8:0: [sdj] Add. Sense: Unrecovered read error
> end_request: I/O error, dev sdj, sector 1953103080
> 3w-9xxx: scsi4: ERROR: (0x03:0x0202): Drive ECC error:port=8.
> sd 4:0:8:0: [sdj] Unhandled sense code
> sd 4:0:8:0: [sdj] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE,SUGGEST_OK
> sd 4:0:8:0: [sdj] Sense Key : Medium Error [current]
> sd 4:0:8:0: [sdj] Add. Sense: Unrecovered read error
3ware 9650se card. Same Here.
> end_request: I/O error, dev sdj, sector 1953103080
> __ratelimit: 24 callbacks suppressed
> Buffer I/O error on device sdj, logical block 244137885
Sector and block info.
> 3w-9xxx: scsi4: ERROR: (0x03:0x0202): Drive ECC error:port=8.
> sd 4:0:8:0: [sdj] Unhandled sense code
> sd 4:0:8:0: [sdj] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE,SUGGEST_OK
> sd 4:0:8:0: [sdj] Sense Key : Medium Error [current]
> sd 4:0:8:0: [sdj] Add. Sense: Unrecovered read error
Same here.
> end_request: I/O error, dev sdj, sector 1953103080
> Buffer I/O error on device sdj, logical block 244137885
> md: requested-resync of RAID array md0
> md: minimum _guaranteed_ speed: 1000 KB/sec/disk.
> md: using maximum available idle IO bandwidth (but not more than 200000
> KB/sec) for requested-resync.
> md: using 128k window, over a total of 976551040 blocks.
> md: md0: requested-resync done.
is it printing out information regarding minimum/optimal speed/IO bandwidth
due to the an failed/partially failed device.
>
>
> --
> Jeremy Sanders <jss@ast.cam.ac.uk> http://www-xray.ast.cam.ac.uk/~jss/
>
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
--
-- Sujit K M
^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2009-05-30 14:21 UTC | newest]
Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-05-22 9:09 Raid and badblocks Jeremy Sanders
2009-05-22 9:19 ` Jeremy Sanders
2009-05-22 11:50 ` NeilBrown
2009-05-26 12:59 ` Jeremy Sanders
2009-05-26 15:27 ` Andrew Burgess
2009-05-30 10:13 ` hank peng
2009-05-30 13:29 ` John Robinson
2009-05-30 14:21 ` Sujit Karataparambil
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).