Failed drive in raid6 while doing data-check

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Failed drive in raid6 while doing data-check
@ 2012-06-03 17:31 Krzysztof Adamski
  2012-06-03 18:32 ` Igor M Podlesny
  2012-06-04  3:56 ` NeilBrown
  0 siblings, 2 replies; 9+ messages in thread
From: Krzysztof Adamski @ 2012-06-03 17:31 UTC (permalink / raw)
  To: linux-raid

The monthly data check found a bad drive in my raid6 array. This is what
started to show up in the log:
Jun  3 12:02:53 rogen kernel: [9908355.355940] sd 2:0:1:0: attempting task abort! scmd(ffff8801547c6a00)
Jun  3 12:02:53 rogen kernel: [9908355.355953] sd 2:0:1:0: [sdb] CDB: Read(10): 28 00 e4 5c ed 38 00 00 08 00
Jun  3 12:02:53 rogen kernel: [9908355.355983] scsi target2:0:1: handle(0x0009), sas_address(0x4433221100000000), phy(0)
Jun  3 12:02:53 rogen kernel: [9908355.355992] scsi target2:0:1: enclosure_logical_id(0x500605b003f7aa10), slot(3)
Jun  3 12:02:56 rogen kernel: [9908359.141194] sd 2:0:1:0: task abort: SUCCESS scmd(ffff8801547c6a00)
Jun  3 12:02:56 rogen kernel: [9908359.141206] sd 2:0:1:0: attempting task abort! scmd(ffff8803aea45400)
Jun  3 12:02:56 rogen kernel: [9908359.141216] sd 2:0:1:0: [sdb] CDB: Read(10): 28 00 e4 5c ed 40 00 00 08 00

But now it has changed to this:
Jun  3 12:04:44 rogen kernel: [9908466.716281] sd 2:0:1:0: [sdb] Unhandled error code
Jun  3 12:04:44 rogen kernel: [9908466.716287] sd 2:0:1:0: [sdb]  Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
Jun  3 12:04:44 rogen kernel: [9908466.716296] sd 2:0:1:0: [sdb] CDB: Read(10): 28 00 e4 5c ee 38 00 00 08 00
Jun  3 12:04:44 rogen kernel: [9908466.716319] end_request: I/O error, dev sdb, sector 3831295544
Jun  3 12:04:44 rogen kernel: [9908466.716616] sd 2:0:1:0: [sdb]  Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
Jun  3 12:04:44 rogen kernel: [9908466.717200] mpt2sas0: removing handle(0x0009), sas_addr(0x4433221100000000)
Jun  3 12:04:44 rogen kernel: [9908466.917090] md/raid:md7: Disk failure on sdb2, disabling device.
Jun  3 12:04:44 rogen kernel: [9908466.917091] md/raid:md7: Operation continuing on 11 devices.
Jun  3 12:07:41 rogen kernel: [9908643.882541] INFO: task md7_resync:28497 blocked for more than 120 seconds.
Jun  3 12:07:41 rogen kernel: [9908643.882552] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jun  3 12:07:41 rogen kernel: [9908643.882556] md7_resync      D ffff8800b508aa20     0 28497      2 0x00000000
Jun  3 12:07:41 rogen kernel: [9908643.882560]  ffff8802ab877b80 0000000000000046 ffff8803ffbfa340 0000000000000046
Jun  3 12:07:41 rogen kernel: [9908643.882564]  ffff8802ab876010 ffff8800b508a6a0 00000000001d29c0 ffff8802ab877fd8
Jun  3 12:07:41 rogen kernel: [9908643.882566]  ffff8802ab877fd8 00000000001d29c0 ffff880070448000 ffff8800b508a6a0
Jun  3 12:07:41 rogen kernel: [9908643.882569] Call Trace:
Jun  3 12:07:41 rogen kernel: [9908643.882577]  [<ffffffff81339704>] schedule+0x55/0x57
Jun  3 12:07:41 rogen kernel: [9908643.882599]  [<ffffffffa01da26b>] bitmap_cond_end_sync+0xbc/0x152 [md_mod]
Jun  3 12:07:41 rogen kernel: [9908643.882602]  [<ffffffff8106190d>] ? wake_up_bit+0x25/0x25
Jun  3 12:07:41 rogen kernel: [9908643.882607]  [<ffffffffa022f7a7>] sync_request+0x22e/0x2ef [raid456]
Jun  3 12:07:41 rogen kernel: [9908643.882613]  [<ffffffffa01d1ebc>] ? is_mddev_idle+0x106/0x118 [md_mod]
Jun  3 12:07:41 rogen kernel: [9908643.882618]  [<ffffffffa01d2689>] md_do_sync+0x7bb/0xbce [md_mod]
Jun  3 12:07:41 rogen kernel: [9908643.882624]  [<ffffffffa01d2cbe>] md_thread+0xff/0x11d [md_mod]
Jun  3 12:07:41 rogen kernel: [9908643.882629]  [<ffffffffa01d2bbf>] ? md_rdev_init+0x8d/0x8d [md_mod]
Jun  3 12:07:41 rogen kernel: [9908643.882631]  [<ffffffff81061499>] kthread+0x9b/0xa3
Jun  3 12:07:41 rogen kernel: [9908643.882634]  [<ffffffff81342ca4>] kernel_thread_helper+0x4/0x10
Jun  3 12:07:41 rogen kernel: [9908643.882637]  [<ffffffff810613fe>] ? __init_kthread_worker+0x56/0x56
Jun  3 12:07:41 rogen kernel: [9908643.882639]  [<ffffffff81342ca0>] ? gs_change+0x13/0x13
Jun  3 12:07:41 rogen kernel: [9908643.882641] INFO: lockdep is turned off.

The cat /proc/mdstat is:
Personalities : [raid1] [raid6] [raid5] [raid4]
md7 : active raid6 sdd2[0] sdab2[11] sdaa2[10] sdz2[9] sdy2[8] sde2[7] sdh2[6] sdf2[5] sdg2[4] sdb2[3](F) sdc2[2] sda2[1]
      29283121600 blocks super 1.2 level 6, 32k chunk, algorithm 2 [12/11] [UUU_UUUUUUUU]
      [=============>.......]  check = 65.3% (1913765076/2928312160) finish=44345.9min speed=381K/sec
      bitmap: 1/22 pages [4KB], 65536KB chunk

I don't really want to wait 30 days for this to finish, what is correct
thing to do before I replace the failed drive?

Thanks in advance,
K


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Failed drive in raid6 while doing data-check
  2012-06-03 17:31 Failed drive in raid6 while doing data-check Krzysztof Adamski
@ 2012-06-03 18:32 ` Igor M Podlesny
  2012-06-03 18:38   ` Krzysztof Adamski
  2012-06-04  3:56 ` NeilBrown
  1 sibling, 1 reply; 9+ messages in thread
From: Igor M Podlesny @ 2012-06-03 18:32 UTC (permalink / raw)
  To: Krzysztof Adamski; +Cc: linux-raid

On 4 June 2012 01:31, Krzysztof Adamski <k@adamski.org> wrote:
[…]
> The cat /proc/mdstat is:
> Personalities : [raid1] [raid6] [raid5] [raid4]
> md7 : active raid6 sdd2[0] sdab2[11] sdaa2[10] sdz2[9] sdy2[8] sde2[7] sdh2[6] sdf2[5] sdg2[4] sdb2[3](F) sdc2[2] sda2[1]
>      29283121600 blocks super 1.2 level 6, 32k chunk, algorithm 2 [12/11] [UUU_UUUUUUUU]
>      [=============>.......]  check = 65.3% (1913765076/2928312160) finish=44345.9min speed=381K/sec
>      bitmap: 1/22 pages [4KB], 65536KB chunk
>
> I don't really want to wait 30 days for this to finish, what is correct
> thing to do before I replace the failed drive?

   Is stripe_cache_size reasonably adjusted?

--
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Failed drive in raid6 while doing data-check
  2012-06-03 18:32 ` Igor M Podlesny
@ 2012-06-03 18:38   ` Krzysztof Adamski
  2012-06-03 20:35     ` Mathias Burén
  0 siblings, 1 reply; 9+ messages in thread
From: Krzysztof Adamski @ 2012-06-03 18:38 UTC (permalink / raw)
  To: Igor M Podlesny; +Cc: linux-raid

On Mon, 2012-06-04 at 02:32 +0800, Igor M Podlesny wrote:
> On 4 June 2012 01:31, Krzysztof Adamski <k@adamski.org> wrote:
> […]
> > The cat /proc/mdstat is:
> > Personalities : [raid1] [raid6] [raid5] [raid4]
> > md7 : active raid6 sdd2[0] sdab2[11] sdaa2[10] sdz2[9] sdy2[8] sde2[7] sdh2[6] sdf2[5] sdg2[4] sdb2[3](F) sdc2[2] sda2[1]
> >      29283121600 blocks super 1.2 level 6, 32k chunk, algorithm 2 [12/11] [UUU_UUUUUUUU]
> >      [=============>.......]  check = 65.3% (1913765076/2928312160) finish=44345.9min speed=381K/sec
> >      bitmap: 1/22 pages [4KB], 65536KB chunk
> >
> > I don't really want to wait 30 days for this to finish, what is correct
> > thing to do before I replace the failed drive?
> 
>    Is stripe_cache_size reasonably adjusted?
> 
I hope so.

# cat /sys/block/md7/md/stripe_cache_size
32768

# mdadm -E /dev/sdd2
/dev/sdd2:
          Magic : a92b4efc
        Version : 1.2
    Feature Map : 0x1
     Array UUID : 3fe8cdcf:d7b1a55c:d8c07daa:c7c3021b
           Name : rogen:7  (local to host rogen)
  Creation Time : Thu Nov  3 13:31:13 2011
     Raid Level : raid6
   Raid Devices : 12

 Avail Dev Size : 5856624640 (2792.66 GiB 2998.59 GB)
     Array Size : 58566243200 (27926.56 GiB 29985.92 GB)
  Used Dev Size : 5856624320 (2792.66 GiB 2998.59 GB)
    Data Offset : 2048 sectors
   Super Offset : 8 sectors
          State : clean
    Device UUID : 26f3ab68:62de305e:0720d287:ad4624a4

Internal Bitmap : 8 sectors from superblock
    Update Time : Sun Jun  3 12:43:57 2012
       Checksum : c6da765f - correct
         Events : 23610

         Layout : left-symmetric
     Chunk Size : 32K

   Device Role : Active device 0
   Array State : AAA.AAAAAAAA ('A' == active, '.' == missing)



--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Failed drive in raid6 while doing data-check
  2012-06-03 18:38   ` Krzysztof Adamski
@ 2012-06-03 20:35     ` Mathias Burén
  0 siblings, 0 replies; 9+ messages in thread
From: Mathias Burén @ 2012-06-03 20:35 UTC (permalink / raw)
  To: Krzysztof Adamski; +Cc: Igor M Podlesny, linux-raid

On 3 June 2012 19:38, Krzysztof Adamski <k@adamski.org> wrote:
> On Mon, 2012-06-04 at 02:32 +0800, Igor M Podlesny wrote:
>> On 4 June 2012 01:31, Krzysztof Adamski <k@adamski.org> wrote:
>> […]
>> > The cat /proc/mdstat is:
>> > Personalities : [raid1] [raid6] [raid5] [raid4]
>> > md7 : active raid6 sdd2[0] sdab2[11] sdaa2[10] sdz2[9] sdy2[8] sde2[7] sdh2[6] sdf2[5] sdg2[4] sdb2[3](F) sdc2[2] sda2[1]
>> >      29283121600 blocks super 1.2 level 6, 32k chunk, algorithm 2 [12/11] [UUU_UUUUUUUU]
>> >      [=============>.......]  check = 65.3% (1913765076/2928312160) finish=44345.9min speed=381K/sec
>> >      bitmap: 1/22 pages [4KB], 65536KB chunk
>> >
>> > I don't really want to wait 30 days for this to finish, what is correct
>> > thing to do before I replace the failed drive?
>>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

Don't quote me on this,

But since you are running a RAID6 with 1 degraded drive  you could
stop the check (echo idle > /sys/block/mdX/md/sync_action , I
believe). Then you could manually fail the broken HDD, remove it, then
physically replace it. However, it looks like MD already did that for
you ([UUU_UUUUUUUU]).

If possible I'd unmount any filesystems on the array first just in case.

Kind regards,
Mathias
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Failed drive in raid6 while doing data-check
  2012-06-03 17:31 Failed drive in raid6 while doing data-check Krzysztof Adamski
  2012-06-03 18:32 ` Igor M Podlesny
@ 2012-06-04  3:56 ` NeilBrown
  2012-06-04 13:19   ` Krzysztof Adamski
  1 sibling, 1 reply; 9+ messages in thread
From: NeilBrown @ 2012-06-04  3:56 UTC (permalink / raw)
  To: Krzysztof Adamski; +Cc: linux-raid

[-- Attachment #1: Type: text/plain, Size: 4741 bytes --]

On Sun, 03 Jun 2012 13:31:14 -0400 Krzysztof Adamski <k@adamski.org> wrote:

> The monthly data check found a bad drive in my raid6 array. This is what
> started to show up in the log:
> Jun  3 12:02:53 rogen kernel: [9908355.355940] sd 2:0:1:0: attempting task abort! scmd(ffff8801547c6a00)
> Jun  3 12:02:53 rogen kernel: [9908355.355953] sd 2:0:1:0: [sdb] CDB: Read(10): 28 00 e4 5c ed 38 00 00 08 00
> Jun  3 12:02:53 rogen kernel: [9908355.355983] scsi target2:0:1: handle(0x0009), sas_address(0x4433221100000000), phy(0)
> Jun  3 12:02:53 rogen kernel: [9908355.355992] scsi target2:0:1: enclosure_logical_id(0x500605b003f7aa10), slot(3)
> Jun  3 12:02:56 rogen kernel: [9908359.141194] sd 2:0:1:0: task abort: SUCCESS scmd(ffff8801547c6a00)
> Jun  3 12:02:56 rogen kernel: [9908359.141206] sd 2:0:1:0: attempting task abort! scmd(ffff8803aea45400)
> Jun  3 12:02:56 rogen kernel: [9908359.141216] sd 2:0:1:0: [sdb] CDB: Read(10): 28 00 e4 5c ed 40 00 00 08 00
> 
> But now it has changed to this:
> Jun  3 12:04:44 rogen kernel: [9908466.716281] sd 2:0:1:0: [sdb] Unhandled error code
> Jun  3 12:04:44 rogen kernel: [9908466.716287] sd 2:0:1:0: [sdb]  Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
> Jun  3 12:04:44 rogen kernel: [9908466.716296] sd 2:0:1:0: [sdb] CDB: Read(10): 28 00 e4 5c ee 38 00 00 08 00
> Jun  3 12:04:44 rogen kernel: [9908466.716319] end_request: I/O error, dev sdb, sector 3831295544
> Jun  3 12:04:44 rogen kernel: [9908466.716616] sd 2:0:1:0: [sdb]  Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
> Jun  3 12:04:44 rogen kernel: [9908466.717200] mpt2sas0: removing handle(0x0009), sas_addr(0x4433221100000000)
> Jun  3 12:04:44 rogen kernel: [9908466.917090] md/raid:md7: Disk failure on sdb2, disabling device.
> Jun  3 12:04:44 rogen kernel: [9908466.917091] md/raid:md7: Operation continuing on 11 devices.
> Jun  3 12:07:41 rogen kernel: [9908643.882541] INFO: task md7_resync:28497 blocked for more than 120 seconds.
> Jun  3 12:07:41 rogen kernel: [9908643.882552] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> Jun  3 12:07:41 rogen kernel: [9908643.882556] md7_resync      D ffff8800b508aa20     0 28497      2 0x00000000
> Jun  3 12:07:41 rogen kernel: [9908643.882560]  ffff8802ab877b80 0000000000000046 ffff8803ffbfa340 0000000000000046
> Jun  3 12:07:41 rogen kernel: [9908643.882564]  ffff8802ab876010 ffff8800b508a6a0 00000000001d29c0 ffff8802ab877fd8
> Jun  3 12:07:41 rogen kernel: [9908643.882566]  ffff8802ab877fd8 00000000001d29c0 ffff880070448000 ffff8800b508a6a0
> Jun  3 12:07:41 rogen kernel: [9908643.882569] Call Trace:
> Jun  3 12:07:41 rogen kernel: [9908643.882577]  [<ffffffff81339704>] schedule+0x55/0x57
> Jun  3 12:07:41 rogen kernel: [9908643.882599]  [<ffffffffa01da26b>] bitmap_cond_end_sync+0xbc/0x152 [md_mod]
> Jun  3 12:07:41 rogen kernel: [9908643.882602]  [<ffffffff8106190d>] ? wake_up_bit+0x25/0x25
> Jun  3 12:07:41 rogen kernel: [9908643.882607]  [<ffffffffa022f7a7>] sync_request+0x22e/0x2ef [raid456]
> Jun  3 12:07:41 rogen kernel: [9908643.882613]  [<ffffffffa01d1ebc>] ? is_mddev_idle+0x106/0x118 [md_mod]
> Jun  3 12:07:41 rogen kernel: [9908643.882618]  [<ffffffffa01d2689>] md_do_sync+0x7bb/0xbce [md_mod]
> Jun  3 12:07:41 rogen kernel: [9908643.882624]  [<ffffffffa01d2cbe>] md_thread+0xff/0x11d [md_mod]
> Jun  3 12:07:41 rogen kernel: [9908643.882629]  [<ffffffffa01d2bbf>] ? md_rdev_init+0x8d/0x8d [md_mod]
> Jun  3 12:07:41 rogen kernel: [9908643.882631]  [<ffffffff81061499>] kthread+0x9b/0xa3
> Jun  3 12:07:41 rogen kernel: [9908643.882634]  [<ffffffff81342ca4>] kernel_thread_helper+0x4/0x10
> Jun  3 12:07:41 rogen kernel: [9908643.882637]  [<ffffffff810613fe>] ? __init_kthread_worker+0x56/0x56
> Jun  3 12:07:41 rogen kernel: [9908643.882639]  [<ffffffff81342ca0>] ? gs_change+0x13/0x13
> Jun  3 12:07:41 rogen kernel: [9908643.882641] INFO: lockdep is turned off.
> 
> The cat /proc/mdstat is:
> Personalities : [raid1] [raid6] [raid5] [raid4]
> md7 : active raid6 sdd2[0] sdab2[11] sdaa2[10] sdz2[9] sdy2[8] sde2[7] sdh2[6] sdf2[5] sdg2[4] sdb2[3](F) sdc2[2] sda2[1]
>       29283121600 blocks super 1.2 level 6, 32k chunk, algorithm 2 [12/11] [UUU_UUUUUUUU]
>       [=============>.......]  check = 65.3% (1913765076/2928312160) finish=44345.9min speed=381K/sec
>       bitmap: 1/22 pages [4KB], 65536KB chunk
> 
> I don't really want to wait 30 days for this to finish, what is correct
> thing to do before I replace the failed drive?
>

If it is still hanging, then I suspect a reboot is your only way forward.
This should not affect the data on the array.

What kernel are you running?  I'll see if I can find the cause.

NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Failed drive in raid6 while doing data-check
  2012-06-04  3:56 ` NeilBrown
@ 2012-06-04 13:19   ` Krzysztof Adamski
  2012-06-05  3:35     ` NeilBrown
  0 siblings, 1 reply; 9+ messages in thread
From: Krzysztof Adamski @ 2012-06-04 13:19 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid

On Mon, 2012-06-04 at 13:56 +1000, NeilBrown wrote:
> On Sun, 03 Jun 2012 13:31:14 -0400 Krzysztof Adamski <k@adamski.org> wrote:
> 
> > The monthly data check found a bad drive in my raid6 array. This is what
> > started to show up in the log:
> > Jun  3 12:02:53 rogen kernel: [9908355.355940] sd 2:0:1:0: attempting task abort! scmd(ffff8801547c6a00)
> > Jun  3 12:02:53 rogen kernel: [9908355.355953] sd 2:0:1:0: [sdb] CDB: Read(10): 28 00 e4 5c ed 38 00 00 08 00
> > Jun  3 12:02:53 rogen kernel: [9908355.355983] scsi target2:0:1: handle(0x0009), sas_address(0x4433221100000000), phy(0)
> > Jun  3 12:02:53 rogen kernel: [9908355.355992] scsi target2:0:1: enclosure_logical_id(0x500605b003f7aa10), slot(3)
> > Jun  3 12:02:56 rogen kernel: [9908359.141194] sd 2:0:1:0: task abort: SUCCESS scmd(ffff8801547c6a00)
> > Jun  3 12:02:56 rogen kernel: [9908359.141206] sd 2:0:1:0: attempting task abort! scmd(ffff8803aea45400)
> > Jun  3 12:02:56 rogen kernel: [9908359.141216] sd 2:0:1:0: [sdb] CDB: Read(10): 28 00 e4 5c ed 40 00 00 08 00
> > 
> > But now it has changed to this:
> > Jun  3 12:04:44 rogen kernel: [9908466.716281] sd 2:0:1:0: [sdb] Unhandled error code
> > Jun  3 12:04:44 rogen kernel: [9908466.716287] sd 2:0:1:0: [sdb]  Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
> > Jun  3 12:04:44 rogen kernel: [9908466.716296] sd 2:0:1:0: [sdb] CDB: Read(10): 28 00 e4 5c ee 38 00 00 08 00
> > Jun  3 12:04:44 rogen kernel: [9908466.716319] end_request: I/O error, dev sdb, sector 3831295544
> > Jun  3 12:04:44 rogen kernel: [9908466.716616] sd 2:0:1:0: [sdb]  Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
> > Jun  3 12:04:44 rogen kernel: [9908466.717200] mpt2sas0: removing handle(0x0009), sas_addr(0x4433221100000000)
> > Jun  3 12:04:44 rogen kernel: [9908466.917090] md/raid:md7: Disk failure on sdb2, disabling device.
> > Jun  3 12:04:44 rogen kernel: [9908466.917091] md/raid:md7: Operation continuing on 11 devices.
> > Jun  3 12:07:41 rogen kernel: [9908643.882541] INFO: task md7_resync:28497 blocked for more than 120 seconds.
> > Jun  3 12:07:41 rogen kernel: [9908643.882552] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> > Jun  3 12:07:41 rogen kernel: [9908643.882556] md7_resync      D ffff8800b508aa20     0 28497      2 0x00000000
> > Jun  3 12:07:41 rogen kernel: [9908643.882560]  ffff8802ab877b80 0000000000000046 ffff8803ffbfa340 0000000000000046
> > Jun  3 12:07:41 rogen kernel: [9908643.882564]  ffff8802ab876010 ffff8800b508a6a0 00000000001d29c0 ffff8802ab877fd8
> > Jun  3 12:07:41 rogen kernel: [9908643.882566]  ffff8802ab877fd8 00000000001d29c0 ffff880070448000 ffff8800b508a6a0
> > Jun  3 12:07:41 rogen kernel: [9908643.882569] Call Trace:
> > Jun  3 12:07:41 rogen kernel: [9908643.882577]  [<ffffffff81339704>] schedule+0x55/0x57
> > Jun  3 12:07:41 rogen kernel: [9908643.882599]  [<ffffffffa01da26b>] bitmap_cond_end_sync+0xbc/0x152 [md_mod]
> > Jun  3 12:07:41 rogen kernel: [9908643.882602]  [<ffffffff8106190d>] ? wake_up_bit+0x25/0x25
> > Jun  3 12:07:41 rogen kernel: [9908643.882607]  [<ffffffffa022f7a7>] sync_request+0x22e/0x2ef [raid456]
> > Jun  3 12:07:41 rogen kernel: [9908643.882613]  [<ffffffffa01d1ebc>] ? is_mddev_idle+0x106/0x118 [md_mod]
> > Jun  3 12:07:41 rogen kernel: [9908643.882618]  [<ffffffffa01d2689>] md_do_sync+0x7bb/0xbce [md_mod]
> > Jun  3 12:07:41 rogen kernel: [9908643.882624]  [<ffffffffa01d2cbe>] md_thread+0xff/0x11d [md_mod]
> > Jun  3 12:07:41 rogen kernel: [9908643.882629]  [<ffffffffa01d2bbf>] ? md_rdev_init+0x8d/0x8d [md_mod]
> > Jun  3 12:07:41 rogen kernel: [9908643.882631]  [<ffffffff81061499>] kthread+0x9b/0xa3
> > Jun  3 12:07:41 rogen kernel: [9908643.882634]  [<ffffffff81342ca4>] kernel_thread_helper+0x4/0x10
> > Jun  3 12:07:41 rogen kernel: [9908643.882637]  [<ffffffff810613fe>] ? __init_kthread_worker+0x56/0x56
> > Jun  3 12:07:41 rogen kernel: [9908643.882639]  [<ffffffff81342ca0>] ? gs_change+0x13/0x13
> > Jun  3 12:07:41 rogen kernel: [9908643.882641] INFO: lockdep is turned off.
> > 
> > The cat /proc/mdstat is:
> > Personalities : [raid1] [raid6] [raid5] [raid4]
> > md7 : active raid6 sdd2[0] sdab2[11] sdaa2[10] sdz2[9] sdy2[8] sde2[7] sdh2[6] sdf2[5] sdg2[4] sdb2[3](F) sdc2[2] sda2[1]
> >       29283121600 blocks super 1.2 level 6, 32k chunk, algorithm 2 [12/11] [UUU_UUUUUUUU]
> >       [=============>.......]  check = 65.3% (1913765076/2928312160) finish=44345.9min speed=381K/sec
> >       bitmap: 1/22 pages [4KB], 65536KB chunk
> > 
> > I don't really want to wait 30 days for this to finish, what is correct
> > thing to do before I replace the failed drive?
> >
> 
> If it is still hanging, then I suspect a reboot is your only way forward.
> This should not affect the data on the array.

Nothing more is being written to the log, but there is no progress:

Personalities : [raid1] [raid6] [raid5] [raid4]
md7 : active raid6 sdd2[0] sdab2[11] sdaa2[10] sdz2[9] sdy2[8] sde2[7] sdh2[6] sdf2[5] sdg2[4] sdb2[3](F) sdc2[2] sda2[1]
      29283121600 blocks super 1.2 level 6, 32k chunk, algorithm 2 [12/11] [UUU_UUUUUUUU]
      [=============>.......]  check = 65.3% (1913765076/2928312160) finish=660982.5min speed=25K/sec
      bitmap: 4/22 pages [16KB], 65536KB chunk


> What kernel are you running?  I'll see if I can find the cause.

Self compiled:
# cat /proc/version 
Linux version 3.0.18-KAA (root@rogen) (gcc version 4.4.5 (Debian 4.4.5-8) ) #1 SMP Fri Jan 27 18:35:11 EST 2012

I can provide any other info that is needed.

K


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Failed drive in raid6 while doing data-check
  2012-06-04 13:19   ` Krzysztof Adamski
@ 2012-06-05  3:35     ` NeilBrown
  2012-06-05 16:48       ` Krzysztof Adamski
  0 siblings, 1 reply; 9+ messages in thread
From: NeilBrown @ 2012-06-05  3:35 UTC (permalink / raw)
  To: Krzysztof Adamski; +Cc: linux-raid

[-- Attachment #1: Type: text/plain, Size: 7317 bytes --]

On Mon, 04 Jun 2012 09:19:06 -0400 Krzysztof Adamski <k@adamski.org> wrote:

> On Mon, 2012-06-04 at 13:56 +1000, NeilBrown wrote:
> > On Sun, 03 Jun 2012 13:31:14 -0400 Krzysztof Adamski <k@adamski.org> wrote:
> > 
> > > The monthly data check found a bad drive in my raid6 array. This is what
> > > started to show up in the log:
> > > Jun  3 12:02:53 rogen kernel: [9908355.355940] sd 2:0:1:0: attempting task abort! scmd(ffff8801547c6a00)
> > > Jun  3 12:02:53 rogen kernel: [9908355.355953] sd 2:0:1:0: [sdb] CDB: Read(10): 28 00 e4 5c ed 38 00 00 08 00
> > > Jun  3 12:02:53 rogen kernel: [9908355.355983] scsi target2:0:1: handle(0x0009), sas_address(0x4433221100000000), phy(0)
> > > Jun  3 12:02:53 rogen kernel: [9908355.355992] scsi target2:0:1: enclosure_logical_id(0x500605b003f7aa10), slot(3)
> > > Jun  3 12:02:56 rogen kernel: [9908359.141194] sd 2:0:1:0: task abort: SUCCESS scmd(ffff8801547c6a00)
> > > Jun  3 12:02:56 rogen kernel: [9908359.141206] sd 2:0:1:0: attempting task abort! scmd(ffff8803aea45400)
> > > Jun  3 12:02:56 rogen kernel: [9908359.141216] sd 2:0:1:0: [sdb] CDB: Read(10): 28 00 e4 5c ed 40 00 00 08 00
> > > 
> > > But now it has changed to this:
> > > Jun  3 12:04:44 rogen kernel: [9908466.716281] sd 2:0:1:0: [sdb] Unhandled error code
> > > Jun  3 12:04:44 rogen kernel: [9908466.716287] sd 2:0:1:0: [sdb]  Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
> > > Jun  3 12:04:44 rogen kernel: [9908466.716296] sd 2:0:1:0: [sdb] CDB: Read(10): 28 00 e4 5c ee 38 00 00 08 00
> > > Jun  3 12:04:44 rogen kernel: [9908466.716319] end_request: I/O error, dev sdb, sector 3831295544
> > > Jun  3 12:04:44 rogen kernel: [9908466.716616] sd 2:0:1:0: [sdb]  Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
> > > Jun  3 12:04:44 rogen kernel: [9908466.717200] mpt2sas0: removing handle(0x0009), sas_addr(0x4433221100000000)
> > > Jun  3 12:04:44 rogen kernel: [9908466.917090] md/raid:md7: Disk failure on sdb2, disabling device.
> > > Jun  3 12:04:44 rogen kernel: [9908466.917091] md/raid:md7: Operation continuing on 11 devices.
> > > Jun  3 12:07:41 rogen kernel: [9908643.882541] INFO: task md7_resync:28497 blocked for more than 120 seconds.
> > > Jun  3 12:07:41 rogen kernel: [9908643.882552] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> > > Jun  3 12:07:41 rogen kernel: [9908643.882556] md7_resync      D ffff8800b508aa20     0 28497      2 0x00000000
> > > Jun  3 12:07:41 rogen kernel: [9908643.882560]  ffff8802ab877b80 0000000000000046 ffff8803ffbfa340 0000000000000046
> > > Jun  3 12:07:41 rogen kernel: [9908643.882564]  ffff8802ab876010 ffff8800b508a6a0 00000000001d29c0 ffff8802ab877fd8
> > > Jun  3 12:07:41 rogen kernel: [9908643.882566]  ffff8802ab877fd8 00000000001d29c0 ffff880070448000 ffff8800b508a6a0
> > > Jun  3 12:07:41 rogen kernel: [9908643.882569] Call Trace:
> > > Jun  3 12:07:41 rogen kernel: [9908643.882577]  [<ffffffff81339704>] schedule+0x55/0x57
> > > Jun  3 12:07:41 rogen kernel: [9908643.882599]  [<ffffffffa01da26b>] bitmap_cond_end_sync+0xbc/0x152 [md_mod]
> > > Jun  3 12:07:41 rogen kernel: [9908643.882602]  [<ffffffff8106190d>] ? wake_up_bit+0x25/0x25
> > > Jun  3 12:07:41 rogen kernel: [9908643.882607]  [<ffffffffa022f7a7>] sync_request+0x22e/0x2ef [raid456]
> > > Jun  3 12:07:41 rogen kernel: [9908643.882613]  [<ffffffffa01d1ebc>] ? is_mddev_idle+0x106/0x118 [md_mod]
> > > Jun  3 12:07:41 rogen kernel: [9908643.882618]  [<ffffffffa01d2689>] md_do_sync+0x7bb/0xbce [md_mod]
> > > Jun  3 12:07:41 rogen kernel: [9908643.882624]  [<ffffffffa01d2cbe>] md_thread+0xff/0x11d [md_mod]
> > > Jun  3 12:07:41 rogen kernel: [9908643.882629]  [<ffffffffa01d2bbf>] ? md_rdev_init+0x8d/0x8d [md_mod]
> > > Jun  3 12:07:41 rogen kernel: [9908643.882631]  [<ffffffff81061499>] kthread+0x9b/0xa3
> > > Jun  3 12:07:41 rogen kernel: [9908643.882634]  [<ffffffff81342ca4>] kernel_thread_helper+0x4/0x10
> > > Jun  3 12:07:41 rogen kernel: [9908643.882637]  [<ffffffff810613fe>] ? __init_kthread_worker+0x56/0x56
> > > Jun  3 12:07:41 rogen kernel: [9908643.882639]  [<ffffffff81342ca0>] ? gs_change+0x13/0x13
> > > Jun  3 12:07:41 rogen kernel: [9908643.882641] INFO: lockdep is turned off.
> > > 
> > > The cat /proc/mdstat is:
> > > Personalities : [raid1] [raid6] [raid5] [raid4]
> > > md7 : active raid6 sdd2[0] sdab2[11] sdaa2[10] sdz2[9] sdy2[8] sde2[7] sdh2[6] sdf2[5] sdg2[4] sdb2[3](F) sdc2[2] sda2[1]
> > >       29283121600 blocks super 1.2 level 6, 32k chunk, algorithm 2 [12/11] [UUU_UUUUUUUU]
> > >       [=============>.......]  check = 65.3% (1913765076/2928312160) finish=44345.9min speed=381K/sec
> > >       bitmap: 1/22 pages [4KB], 65536KB chunk
> > > 
> > > I don't really want to wait 30 days for this to finish, what is correct
> > > thing to do before I replace the failed drive?
> > >
> > 
> > If it is still hanging, then I suspect a reboot is your only way forward.
> > This should not affect the data on the array.
> 
> Nothing more is being written to the log, but there is no progress:
> 
> Personalities : [raid1] [raid6] [raid5] [raid4]
> md7 : active raid6 sdd2[0] sdab2[11] sdaa2[10] sdz2[9] sdy2[8] sde2[7] sdh2[6] sdf2[5] sdg2[4] sdb2[3](F) sdc2[2] sda2[1]
>       29283121600 blocks super 1.2 level 6, 32k chunk, algorithm 2 [12/11] [UUU_UUUUUUUU]
>       [=============>.......]  check = 65.3% (1913765076/2928312160) finish=660982.5min speed=25K/sec
>       bitmap: 4/22 pages [16KB], 65536KB chunk
> 
> 
> > What kernel are you running?  I'll see if I can find the cause.
> 
> Self compiled:
> # cat /proc/version 
> Linux version 3.0.18-KAA (root@rogen) (gcc version 4.4.5 (Debian 4.4.5-8) ) #1 SMP Fri Jan 27 18:35:11 EST 2012
> 
> I can provide any other info that is needed.

Thanks.
I have a theory about what might be causing it, but I cannot duplicated the
problem.  I think you were quite "unlucky" to hit it.

If I'm right you might be able to kick the 'check' back into life by reading
from the area of the array that is currently being synced. i.e. about 
19137650760K into the array.

dd skip=19137650000 bs=1024 count=2000 if=/dev/md7 of=/dev/null

might do it.

The thread that is doing the 'check' is blocked in bitmap_cond_sync_end,
waiting for all pending check requests to complete.  It only does this every
5 seconds or so, so oncs in hundreds of requests.  (It doesn't really need to
for 'check', only for 'sync', but it does anyway).

But for some reason one request isn't completing.  That should mean that
STRIPE_INSYNC is not yet set on a 'struct stripe_head'.

I'm guessing that it has processed by handle_parity_checks6, noticed in the
check_state_check_result  case that s->failed is non-zero, and so set 
sh->check_state to check_state_compute_result, but then nothing caused
the stripe to go around the loop again.  The code there should probably
set STRIPE_HANDLE.. Maybe.

If the dead disk provides data to that stripe, rather than parity, then the
read should set STRIPE_HANDLE and the handling of the stripe can progress.

It might not work, but it is worth a try.  What should happen is that the
check will immediately abort.  If it doesn't you probably need to schedule a
reboot ... if you haven't already.

NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Failed drive in raid6 while doing data-check
  2012-06-05  3:35     ` NeilBrown
@ 2012-06-05 16:48       ` Krzysztof Adamski
  2012-06-06  1:22         ` NeilBrown
  0 siblings, 1 reply; 9+ messages in thread
From: Krzysztof Adamski @ 2012-06-05 16:48 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid

On Tue, 2012-06-05 at 13:35 +1000, NeilBrown wrote:
> On Mon, 04 Jun 2012 09:19:06 -0400 Krzysztof Adamski <k@adamski.org> wrote:
> 
> > On Mon, 2012-06-04 at 13:56 +1000, NeilBrown wrote:
> > > On Sun, 03 Jun 2012 13:31:14 -0400 Krzysztof Adamski <k@adamski.org> wrote:
> > > 
> > > > The monthly data check found a bad drive in my raid6 array. This is what
> > > > started to show up in the log:
> > > > Jun  3 12:02:53 rogen kernel: [9908355.355940] sd 2:0:1:0: attempting task abort! scmd(ffff8801547c6a00)
> > > > Jun  3 12:02:53 rogen kernel: [9908355.355953] sd 2:0:1:0: [sdb] CDB: Read(10): 28 00 e4 5c ed 38 00 00 08 00
> > > > Jun  3 12:02:53 rogen kernel: [9908355.355983] scsi target2:0:1: handle(0x0009), sas_address(0x4433221100000000), phy(0)
> > > > Jun  3 12:02:53 rogen kernel: [9908355.355992] scsi target2:0:1: enclosure_logical_id(0x500605b003f7aa10), slot(3)
> > > > Jun  3 12:02:56 rogen kernel: [9908359.141194] sd 2:0:1:0: task abort: SUCCESS scmd(ffff8801547c6a00)
> > > > Jun  3 12:02:56 rogen kernel: [9908359.141206] sd 2:0:1:0: attempting task abort! scmd(ffff8803aea45400)
> > > > Jun  3 12:02:56 rogen kernel: [9908359.141216] sd 2:0:1:0: [sdb] CDB: Read(10): 28 00 e4 5c ed 40 00 00 08 00
> > > > 
> > > > But now it has changed to this:
> > > > Jun  3 12:04:44 rogen kernel: [9908466.716281] sd 2:0:1:0: [sdb] Unhandled error code
> > > > Jun  3 12:04:44 rogen kernel: [9908466.716287] sd 2:0:1:0: [sdb]  Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
> > > > Jun  3 12:04:44 rogen kernel: [9908466.716296] sd 2:0:1:0: [sdb] CDB: Read(10): 28 00 e4 5c ee 38 00 00 08 00
> > > > Jun  3 12:04:44 rogen kernel: [9908466.716319] end_request: I/O error, dev sdb, sector 3831295544
> > > > Jun  3 12:04:44 rogen kernel: [9908466.716616] sd 2:0:1:0: [sdb]  Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
> > > > Jun  3 12:04:44 rogen kernel: [9908466.717200] mpt2sas0: removing handle(0x0009), sas_addr(0x4433221100000000)
> > > > Jun  3 12:04:44 rogen kernel: [9908466.917090] md/raid:md7: Disk failure on sdb2, disabling device.
> > > > Jun  3 12:04:44 rogen kernel: [9908466.917091] md/raid:md7: Operation continuing on 11 devices.
> > > > Jun  3 12:07:41 rogen kernel: [9908643.882541] INFO: task md7_resync:28497 blocked for more than 120 seconds.
> > > > Jun  3 12:07:41 rogen kernel: [9908643.882552] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> > > > Jun  3 12:07:41 rogen kernel: [9908643.882556] md7_resync      D ffff8800b508aa20     0 28497      2 0x00000000
> > > > Jun  3 12:07:41 rogen kernel: [9908643.882560]  ffff8802ab877b80 0000000000000046 ffff8803ffbfa340 0000000000000046
> > > > Jun  3 12:07:41 rogen kernel: [9908643.882564]  ffff8802ab876010 ffff8800b508a6a0 00000000001d29c0 ffff8802ab877fd8
> > > > Jun  3 12:07:41 rogen kernel: [9908643.882566]  ffff8802ab877fd8 00000000001d29c0 ffff880070448000 ffff8800b508a6a0
> > > > Jun  3 12:07:41 rogen kernel: [9908643.882569] Call Trace:
> > > > Jun  3 12:07:41 rogen kernel: [9908643.882577]  [<ffffffff81339704>] schedule+0x55/0x57
> > > > Jun  3 12:07:41 rogen kernel: [9908643.882599]  [<ffffffffa01da26b>] bitmap_cond_end_sync+0xbc/0x152 [md_mod]
> > > > Jun  3 12:07:41 rogen kernel: [9908643.882602]  [<ffffffff8106190d>] ? wake_up_bit+0x25/0x25
> > > > Jun  3 12:07:41 rogen kernel: [9908643.882607]  [<ffffffffa022f7a7>] sync_request+0x22e/0x2ef [raid456]
> > > > Jun  3 12:07:41 rogen kernel: [9908643.882613]  [<ffffffffa01d1ebc>] ? is_mddev_idle+0x106/0x118 [md_mod]
> > > > Jun  3 12:07:41 rogen kernel: [9908643.882618]  [<ffffffffa01d2689>] md_do_sync+0x7bb/0xbce [md_mod]
> > > > Jun  3 12:07:41 rogen kernel: [9908643.882624]  [<ffffffffa01d2cbe>] md_thread+0xff/0x11d [md_mod]
> > > > Jun  3 12:07:41 rogen kernel: [9908643.882629]  [<ffffffffa01d2bbf>] ? md_rdev_init+0x8d/0x8d [md_mod]
> > > > Jun  3 12:07:41 rogen kernel: [9908643.882631]  [<ffffffff81061499>] kthread+0x9b/0xa3
> > > > Jun  3 12:07:41 rogen kernel: [9908643.882634]  [<ffffffff81342ca4>] kernel_thread_helper+0x4/0x10
> > > > Jun  3 12:07:41 rogen kernel: [9908643.882637]  [<ffffffff810613fe>] ? __init_kthread_worker+0x56/0x56
> > > > Jun  3 12:07:41 rogen kernel: [9908643.882639]  [<ffffffff81342ca0>] ? gs_change+0x13/0x13
> > > > Jun  3 12:07:41 rogen kernel: [9908643.882641] INFO: lockdep is turned off.
> > > > 
> > > > The cat /proc/mdstat is:
> > > > Personalities : [raid1] [raid6] [raid5] [raid4]
> > > > md7 : active raid6 sdd2[0] sdab2[11] sdaa2[10] sdz2[9] sdy2[8] sde2[7] sdh2[6] sdf2[5] sdg2[4] sdb2[3](F) sdc2[2] sda2[1]
> > > >       29283121600 blocks super 1.2 level 6, 32k chunk, algorithm 2 [12/11] [UUU_UUUUUUUU]
> > > >       [=============>.......]  check = 65.3% (1913765076/2928312160) finish=44345.9min speed=381K/sec
> > > >       bitmap: 1/22 pages [4KB], 65536KB chunk
> > > > 
> > > > I don't really want to wait 30 days for this to finish, what is correct
> > > > thing to do before I replace the failed drive?
> > > >
> > > 
> > > If it is still hanging, then I suspect a reboot is your only way forward.
> > > This should not affect the data on the array.
> > 
> > Nothing more is being written to the log, but there is no progress:
> > 
> > Personalities : [raid1] [raid6] [raid5] [raid4]
> > md7 : active raid6 sdd2[0] sdab2[11] sdaa2[10] sdz2[9] sdy2[8] sde2[7] sdh2[6] sdf2[5] sdg2[4] sdb2[3](F) sdc2[2] sda2[1]
> >       29283121600 blocks super 1.2 level 6, 32k chunk, algorithm 2 [12/11] [UUU_UUUUUUUU]
> >       [=============>.......]  check = 65.3% (1913765076/2928312160) finish=660982.5min speed=25K/sec
> >       bitmap: 4/22 pages [16KB], 65536KB chunk
> > 
> > 
> > > What kernel are you running?  I'll see if I can find the cause.
> > 
> > Self compiled:
> > # cat /proc/version 
> > Linux version 3.0.18-KAA (root@rogen) (gcc version 4.4.5 (Debian 4.4.5-8) ) #1 SMP Fri Jan 27 18:35:11 EST 2012
> > 
> > I can provide any other info that is needed.
> 
> Thanks.
> I have a theory about what might be causing it, but I cannot duplicated the
> problem.  I think you were quite "unlucky" to hit it.
> 
> If I'm right you might be able to kick the 'check' back into life by reading
> from the area of the array that is currently being synced. i.e. about 
> 19137650760K into the array.
> 
> dd skip=19137650000 bs=1024 count=2000 if=/dev/md7 of=/dev/null
> 
> might do it.
> 
> The thread that is doing the 'check' is blocked in bitmap_cond_sync_end,
> waiting for all pending check requests to complete.  It only does this every
> 5 seconds or so, so oncs in hundreds of requests.  (It doesn't really need to
> for 'check', only for 'sync', but it does anyway).
> 
> But for some reason one request isn't completing.  That should mean that
> STRIPE_INSYNC is not yet set on a 'struct stripe_head'.
> 
> I'm guessing that it has processed by handle_parity_checks6, noticed in the
> check_state_check_result  case that s->failed is non-zero, and so set 
> sh->check_state to check_state_compute_result, but then nothing caused
> the stripe to go around the loop again.  The code there should probably
> set STRIPE_HANDLE.. Maybe.
> 
> If the dead disk provides data to that stripe, rather than parity, then the
> read should set STRIPE_HANDLE and the handling of the stripe can progress.
> 
> It might not work, but it is worth a try.  What should happen is that the
> check will immediately abort.  If it doesn't you probably need to schedule a
> reboot ... if you haven't already.

I had to do a reboot already.

The drive was wedged for same reason, activity light solid, soft reboot
(shutdown -r now) didn't wake it up, the card bios didn't see it. I
pulled and reinserted it for it to be recognized by the LSI card.

I added it to the array and the rebuild took few minutes (thanks for the
bitmap). I forced a recheck and after 12 hours everything is good.

This was not the first time where I saw this kind of stop in the check
when one drive stops working in this machine, but with different adaptor
card and different drives. I will try what you suggested if it happens
again. 

Now I wonder if I should have just pulled/reinserted the drive to see
what would have happened.

K


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Failed drive in raid6 while doing data-check
  2012-06-05 16:48       ` Krzysztof Adamski
@ 2012-06-06  1:22         ` NeilBrown
  0 siblings, 0 replies; 9+ messages in thread
From: NeilBrown @ 2012-06-06  1:22 UTC (permalink / raw)
  To: Krzysztof Adamski; +Cc: linux-raid

[-- Attachment #1: Type: text/plain, Size: 9049 bytes --]

On Tue, 05 Jun 2012 12:48:02 -0400 Krzysztof Adamski <k@adamski.org> wrote:

> On Tue, 2012-06-05 at 13:35 +1000, NeilBrown wrote:
> > On Mon, 04 Jun 2012 09:19:06 -0400 Krzysztof Adamski <k@adamski.org> wrote:
> > 
> > > On Mon, 2012-06-04 at 13:56 +1000, NeilBrown wrote:
> > > > On Sun, 03 Jun 2012 13:31:14 -0400 Krzysztof Adamski <k@adamski.org> wrote:
> > > > 
> > > > > The monthly data check found a bad drive in my raid6 array. This is what
> > > > > started to show up in the log:
> > > > > Jun  3 12:02:53 rogen kernel: [9908355.355940] sd 2:0:1:0: attempting task abort! scmd(ffff8801547c6a00)
> > > > > Jun  3 12:02:53 rogen kernel: [9908355.355953] sd 2:0:1:0: [sdb] CDB: Read(10): 28 00 e4 5c ed 38 00 00 08 00
> > > > > Jun  3 12:02:53 rogen kernel: [9908355.355983] scsi target2:0:1: handle(0x0009), sas_address(0x4433221100000000), phy(0)
> > > > > Jun  3 12:02:53 rogen kernel: [9908355.355992] scsi target2:0:1: enclosure_logical_id(0x500605b003f7aa10), slot(3)
> > > > > Jun  3 12:02:56 rogen kernel: [9908359.141194] sd 2:0:1:0: task abort: SUCCESS scmd(ffff8801547c6a00)
> > > > > Jun  3 12:02:56 rogen kernel: [9908359.141206] sd 2:0:1:0: attempting task abort! scmd(ffff8803aea45400)
> > > > > Jun  3 12:02:56 rogen kernel: [9908359.141216] sd 2:0:1:0: [sdb] CDB: Read(10): 28 00 e4 5c ed 40 00 00 08 00
> > > > > 
> > > > > But now it has changed to this:
> > > > > Jun  3 12:04:44 rogen kernel: [9908466.716281] sd 2:0:1:0: [sdb] Unhandled error code
> > > > > Jun  3 12:04:44 rogen kernel: [9908466.716287] sd 2:0:1:0: [sdb]  Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
> > > > > Jun  3 12:04:44 rogen kernel: [9908466.716296] sd 2:0:1:0: [sdb] CDB: Read(10): 28 00 e4 5c ee 38 00 00 08 00
> > > > > Jun  3 12:04:44 rogen kernel: [9908466.716319] end_request: I/O error, dev sdb, sector 3831295544
> > > > > Jun  3 12:04:44 rogen kernel: [9908466.716616] sd 2:0:1:0: [sdb]  Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
> > > > > Jun  3 12:04:44 rogen kernel: [9908466.717200] mpt2sas0: removing handle(0x0009), sas_addr(0x4433221100000000)
> > > > > Jun  3 12:04:44 rogen kernel: [9908466.917090] md/raid:md7: Disk failure on sdb2, disabling device.
> > > > > Jun  3 12:04:44 rogen kernel: [9908466.917091] md/raid:md7: Operation continuing on 11 devices.
> > > > > Jun  3 12:07:41 rogen kernel: [9908643.882541] INFO: task md7_resync:28497 blocked for more than 120 seconds.
> > > > > Jun  3 12:07:41 rogen kernel: [9908643.882552] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> > > > > Jun  3 12:07:41 rogen kernel: [9908643.882556] md7_resync      D ffff8800b508aa20     0 28497      2 0x00000000
> > > > > Jun  3 12:07:41 rogen kernel: [9908643.882560]  ffff8802ab877b80 0000000000000046 ffff8803ffbfa340 0000000000000046
> > > > > Jun  3 12:07:41 rogen kernel: [9908643.882564]  ffff8802ab876010 ffff8800b508a6a0 00000000001d29c0 ffff8802ab877fd8
> > > > > Jun  3 12:07:41 rogen kernel: [9908643.882566]  ffff8802ab877fd8 00000000001d29c0 ffff880070448000 ffff8800b508a6a0
> > > > > Jun  3 12:07:41 rogen kernel: [9908643.882569] Call Trace:
> > > > > Jun  3 12:07:41 rogen kernel: [9908643.882577]  [<ffffffff81339704>] schedule+0x55/0x57
> > > > > Jun  3 12:07:41 rogen kernel: [9908643.882599]  [<ffffffffa01da26b>] bitmap_cond_end_sync+0xbc/0x152 [md_mod]
> > > > > Jun  3 12:07:41 rogen kernel: [9908643.882602]  [<ffffffff8106190d>] ? wake_up_bit+0x25/0x25
> > > > > Jun  3 12:07:41 rogen kernel: [9908643.882607]  [<ffffffffa022f7a7>] sync_request+0x22e/0x2ef [raid456]
> > > > > Jun  3 12:07:41 rogen kernel: [9908643.882613]  [<ffffffffa01d1ebc>] ? is_mddev_idle+0x106/0x118 [md_mod]
> > > > > Jun  3 12:07:41 rogen kernel: [9908643.882618]  [<ffffffffa01d2689>] md_do_sync+0x7bb/0xbce [md_mod]
> > > > > Jun  3 12:07:41 rogen kernel: [9908643.882624]  [<ffffffffa01d2cbe>] md_thread+0xff/0x11d [md_mod]
> > > > > Jun  3 12:07:41 rogen kernel: [9908643.882629]  [<ffffffffa01d2bbf>] ? md_rdev_init+0x8d/0x8d [md_mod]
> > > > > Jun  3 12:07:41 rogen kernel: [9908643.882631]  [<ffffffff81061499>] kthread+0x9b/0xa3
> > > > > Jun  3 12:07:41 rogen kernel: [9908643.882634]  [<ffffffff81342ca4>] kernel_thread_helper+0x4/0x10
> > > > > Jun  3 12:07:41 rogen kernel: [9908643.882637]  [<ffffffff810613fe>] ? __init_kthread_worker+0x56/0x56
> > > > > Jun  3 12:07:41 rogen kernel: [9908643.882639]  [<ffffffff81342ca0>] ? gs_change+0x13/0x13
> > > > > Jun  3 12:07:41 rogen kernel: [9908643.882641] INFO: lockdep is turned off.
> > > > > 
> > > > > The cat /proc/mdstat is:
> > > > > Personalities : [raid1] [raid6] [raid5] [raid4]
> > > > > md7 : active raid6 sdd2[0] sdab2[11] sdaa2[10] sdz2[9] sdy2[8] sde2[7] sdh2[6] sdf2[5] sdg2[4] sdb2[3](F) sdc2[2] sda2[1]
> > > > >       29283121600 blocks super 1.2 level 6, 32k chunk, algorithm 2 [12/11] [UUU_UUUUUUUU]
> > > > >       [=============>.......]  check = 65.3% (1913765076/2928312160) finish=44345.9min speed=381K/sec
> > > > >       bitmap: 1/22 pages [4KB], 65536KB chunk
> > > > > 
> > > > > I don't really want to wait 30 days for this to finish, what is correct
> > > > > thing to do before I replace the failed drive?
> > > > >
> > > > 
> > > > If it is still hanging, then I suspect a reboot is your only way forward.
> > > > This should not affect the data on the array.
> > > 
> > > Nothing more is being written to the log, but there is no progress:
> > > 
> > > Personalities : [raid1] [raid6] [raid5] [raid4]
> > > md7 : active raid6 sdd2[0] sdab2[11] sdaa2[10] sdz2[9] sdy2[8] sde2[7] sdh2[6] sdf2[5] sdg2[4] sdb2[3](F) sdc2[2] sda2[1]
> > >       29283121600 blocks super 1.2 level 6, 32k chunk, algorithm 2 [12/11] [UUU_UUUUUUUU]
> > >       [=============>.......]  check = 65.3% (1913765076/2928312160) finish=660982.5min speed=25K/sec
> > >       bitmap: 4/22 pages [16KB], 65536KB chunk
> > > 
> > > 
> > > > What kernel are you running?  I'll see if I can find the cause.
> > > 
> > > Self compiled:
> > > # cat /proc/version 
> > > Linux version 3.0.18-KAA (root@rogen) (gcc version 4.4.5 (Debian 4.4.5-8) ) #1 SMP Fri Jan 27 18:35:11 EST 2012
> > > 
> > > I can provide any other info that is needed.
> > 
> > Thanks.
> > I have a theory about what might be causing it, but I cannot duplicated the
> > problem.  I think you were quite "unlucky" to hit it.
> > 
> > If I'm right you might be able to kick the 'check' back into life by reading
> > from the area of the array that is currently being synced. i.e. about 
> > 19137650760K into the array.
> > 
> > dd skip=19137650000 bs=1024 count=2000 if=/dev/md7 of=/dev/null
> > 
> > might do it.
> > 
> > The thread that is doing the 'check' is blocked in bitmap_cond_sync_end,
> > waiting for all pending check requests to complete.  It only does this every
> > 5 seconds or so, so oncs in hundreds of requests.  (It doesn't really need to
> > for 'check', only for 'sync', but it does anyway).
> > 
> > But for some reason one request isn't completing.  That should mean that
> > STRIPE_INSYNC is not yet set on a 'struct stripe_head'.
> > 
> > I'm guessing that it has processed by handle_parity_checks6, noticed in the
> > check_state_check_result  case that s->failed is non-zero, and so set 
> > sh->check_state to check_state_compute_result, but then nothing caused
> > the stripe to go around the loop again.  The code there should probably
> > set STRIPE_HANDLE.. Maybe.
> > 
> > If the dead disk provides data to that stripe, rather than parity, then the
> > read should set STRIPE_HANDLE and the handling of the stripe can progress.
> > 
> > It might not work, but it is worth a try.  What should happen is that the
> > check will immediately abort.  If it doesn't you probably need to schedule a
> > reboot ... if you haven't already.
> 
> I had to do a reboot already.
> 
> The drive was wedged for same reason, activity light solid, soft reboot
> (shutdown -r now) didn't wake it up, the card bios didn't see it. I
> pulled and reinserted it for it to be recognized by the LSI card.

Maybe that was the problem.  If md had sent a request to the drive and not
received any reply, (success or failure) it would keep waiting until it got
one.

> 
> I added it to the array and the rebuild took few minutes (thanks for the
> bitmap). I forced a recheck and after 12 hours everything is good.

:-)

> 
> This was not the first time where I saw this kind of stop in the check
> when one drive stops working in this machine, but with different adaptor
> card and different drives. I will try what you suggested if it happens
> again. 

I looked at the code again and realised I was missing something and it
couldn't possibly lock up where I thought it might.  So while that experiment
won't hurt, it is very unlikely to help.


> 
> Now I wonder if I should have just pulled/reinserted the drive to see
> what would have happened.

Yes, that would be interesting.

Thanks,
NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2012-06-06  1:22 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-06-03 17:31 Failed drive in raid6 while doing data-check Krzysztof Adamski
2012-06-03 18:32 ` Igor M Podlesny
2012-06-03 18:38   ` Krzysztof Adamski
2012-06-03 20:35     ` Mathias Burén
2012-06-04  3:56 ` NeilBrown
2012-06-04 13:19   ` Krzysztof Adamski
2012-06-05  3:35     ` NeilBrown
2012-06-05 16:48       ` Krzysztof Adamski
2012-06-06  1:22         ` NeilBrown

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).