* Re: Lose two disks during Raid 10 rebuild
[not found] <8E77BA43C8998042B05BB83386E8CF044CE0C8@SFO1EXC-MBXP06.nbttech.com>
@ 2012-08-23 21:07 ` NeilBrown
0 siblings, 0 replies; only message in thread
From: NeilBrown @ 2012-08-23 21:07 UTC (permalink / raw)
To: Steven La; +Cc: linux-raid@vger.kernel.org
[-- Attachment #1: Type: text/plain, Size: 4564 bytes --]
On Thu, 23 Aug 2012 19:28:27 +0000 Steven La <Steven.La@riverbed.com> wrote:
> Hello all,
>
> Got the following messages from syslog during Raid 10 rebuild cycle.
>
> Aug 3 01:48:11 oak-sh283 kernel: sd 0:0:0:0: [sda] Unhandled sense code
> Aug 3 01:48:11 oak-sh283 kernel: sd 0:0:0:0: [sda] Result: hostbyte=invalid
> driverbyte=DRIVER_SENSE
> Aug 3 01:48:11 oak-sh283 kernel: sd 0:0:0:0: [sda] Sense Key : Medium Error
> [current]
"Medium Error" normally means that the recording medium (magnetic regions) is
corrupt in some way and a valid data block cannot be extracted.
> Aug 3 01:48:11 oak-sh283 kernel: Info fld=0x3ae0f43c
> Aug 3 01:48:11 oak-sh283 kernel: sd 0:0:0:0: [sda] Add. Sense: Unrecovered
> read error
> Aug 3 01:48:11 oak-sh283 kernel: sd 0:0:0:0: [sda] CDB: Read(10): 28 00 3a e0
> f3 ab 00 01 00 00
> Aug 3 01:48:11 oak-sh283 kernel: end_request: I/O error, dev sda, sector
> 987821116
> Aug 3 01:48:11 oak-sh283 kernel: md/raid10:md7: Disk failure on sda8,
> disabling device.
> Aug 3 01:48:11 oak-sh283 kernel: md/raid10:md7: Operation continuing on 2
> devices.
> Aug 3 01:48:11 oak-sh283 kernel: md: md7: recovery done.
> Aug 3 01:48:11 oak-sh283 kernel: md/raid10:md7: Disk failure on sdc8,
> disabling device.
Presumably md7 was trying to recover sdc8 from sda8. It got a data error on
sda8, so could not recover sda8 and so marked it as failed.
> Aug 3 01:48:11 oak-sh283 kernel: md/raid10:md7: Operation continuing on 2
> devices.
> Aug 3 01:48:14 oak-sh283 kernel: md: unbind<sdc8>
> Aug 3 01:48:14 oak-sh283 kernel: md: export_rdev(sdc8)
> Aug 3 01:48:14 oak-sh283 kernel: md: unbind<sda8>
> Aug 3 01:48:14 oak-sh283 kernel: md: export_rdev(sda8)
> Aug 3 01:48:16 oak-sh283 raid_rebuild: Sending sighup to hald[22152] for event
> RebuildFinished for /dev/md7
>
>
> [admin@oak-sh283 ~]# cat /proc/mdstat
>
> Personalities : [linear] [raid0] [raid1] [raid10]
>
> md5 : active raid10 sdc9[1] sde9[2] sdg9[3] sda9[0]
>
> 562997760 blocks 64K chunks 2 near-copies [4/4] [UUUU]
>
>
>
> md7 : active raid10 sde8[2] sdg8[3]
>
> 562997760 blocks 64K chunks 2 near-copies [4/2] [__UU]
>
>
>
> md6 : active raid10 sdc7[1] sde7[2] sdg7[3] sda7[0]
>
> 562997760 blocks 64K chunks 2 near-copies [4/4] [UUUU]
>
>
>
> md3 : active raid10 sdc6[1] sde6[2] sdg6[3] sda6[0]
>
> 52435968 blocks 64K chunks 2 near-copies [4/4] [UUUU]
>
>
>
> md0 : active raid10 sdc2[1] sde2[2] sdg2[3] sda2[0]
>
> 10490240 blocks 64K chunks 2 near-copies [4/4] [UUUU]
>
>
>
> md4 : active raid10 sdb3[0] sdh3[3] sdf3[2] sdd3[1]
>
> 19518720 blocks 64K chunks 2 near-copies [4/4] [UUUU]
>
>
>
> md2 : active raid10 sdc3[1] sde3[2] sdg3[3] sda3[0]
>
> 67119360 blocks 64K chunks 2 near-copies [4/4] [UUUU]
>
>
>
> md1 : active raid10 sdc5[1] sde5[2] sdg5[3] sda5[0]
>
> 134222848 blocks 64K chunks 2 near-copies [4/4] [UUUU]
>
>
>
>
>
> >From the error message below (also shown above), the block that cannot be read from sda
>
> has lba=0x3ae0f3ab.
>
>
>
> Aug 3 01:48:11 oak-sh283 kernel: sd 0:0:0:0: [sda] CDB: Read(10): 28 00 3a e0
>
> f3 ab 00 01 00 00
>
>
> [admin@oak-sh283 ~]# fdisk -s /dev/sda
>
> 976762584
This number is in kilobytes. 1 TB.
>
>
>
> The last block on the drive is 0x3a3836d8
This is a sector number. 976762584 sectors or 500102443008 bytes into the
device. About half way.
You can probably correct the bad sector by
dd if=/dev/zero of=/dev/sda seek=976762584 count=1 oflag=direct
I would try to read from the address first yo make sure it is in error:
dd of=/dev/null if=/dev/sda skip=976762584 count=1 oflag=direct
Then read the entire device to ensure there are no other media errors.
Then stop the array and re-assemble with --force.
Then try the recovery again.
NeilBrown
>
>
>
> (gdb) p/x 976762584
>
> $1 = 0x3a3836d8
>
> (gdb) p 0x3ae0f3ab
>
> $2 = 987820971
>
> So, it seems like the lba number used in the Read(10) command has exceeded the last block of the drive.
> Has anyone had this problem before? What else can I look at?
>
> Relevant info are shown below,
>
> [admin@oak-sh283 ~]# mdadm -V
> mdadm - v2.6.4 - 19th October 2007
>
> [admin@oak-sh283 ~]# uname -a
> Linux oak-sh283 2.6.32 #1 SMP Wed Aug 1 01:38:35 PDT 2012 x86_64 x86_64 x86_64 GNU/Linux
>
> Thanks and regards,
> --Steven
>
>
>
>
>
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]
^ permalink raw reply [flat|nested] only message in thread