* Failed drive in raid6 while doing data-check @ 2012-06-03 17:31 Krzysztof Adamski 2012-06-03 18:32 ` Igor M Podlesny 2012-06-04 3:56 ` NeilBrown 0 siblings, 2 replies; 9+ messages in thread From: Krzysztof Adamski @ 2012-06-03 17:31 UTC (permalink / raw) To: linux-raid The monthly data check found a bad drive in my raid6 array. This is what started to show up in the log: Jun 3 12:02:53 rogen kernel: [9908355.355940] sd 2:0:1:0: attempting task abort! scmd(ffff8801547c6a00) Jun 3 12:02:53 rogen kernel: [9908355.355953] sd 2:0:1:0: [sdb] CDB: Read(10): 28 00 e4 5c ed 38 00 00 08 00 Jun 3 12:02:53 rogen kernel: [9908355.355983] scsi target2:0:1: handle(0x0009), sas_address(0x4433221100000000), phy(0) Jun 3 12:02:53 rogen kernel: [9908355.355992] scsi target2:0:1: enclosure_logical_id(0x500605b003f7aa10), slot(3) Jun 3 12:02:56 rogen kernel: [9908359.141194] sd 2:0:1:0: task abort: SUCCESS scmd(ffff8801547c6a00) Jun 3 12:02:56 rogen kernel: [9908359.141206] sd 2:0:1:0: attempting task abort! scmd(ffff8803aea45400) Jun 3 12:02:56 rogen kernel: [9908359.141216] sd 2:0:1:0: [sdb] CDB: Read(10): 28 00 e4 5c ed 40 00 00 08 00 But now it has changed to this: Jun 3 12:04:44 rogen kernel: [9908466.716281] sd 2:0:1:0: [sdb] Unhandled error code Jun 3 12:04:44 rogen kernel: [9908466.716287] sd 2:0:1:0: [sdb] Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK Jun 3 12:04:44 rogen kernel: [9908466.716296] sd 2:0:1:0: [sdb] CDB: Read(10): 28 00 e4 5c ee 38 00 00 08 00 Jun 3 12:04:44 rogen kernel: [9908466.716319] end_request: I/O error, dev sdb, sector 3831295544 Jun 3 12:04:44 rogen kernel: [9908466.716616] sd 2:0:1:0: [sdb] Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK Jun 3 12:04:44 rogen kernel: [9908466.717200] mpt2sas0: removing handle(0x0009), sas_addr(0x4433221100000000) Jun 3 12:04:44 rogen kernel: [9908466.917090] md/raid:md7: Disk failure on sdb2, disabling device. Jun 3 12:04:44 rogen kernel: [9908466.917091] md/raid:md7: Operation continuing on 11 devices. Jun 3 12:07:41 rogen kernel: [9908643.882541] INFO: task md7_resync:28497 blocked for more than 120 seconds. Jun 3 12:07:41 rogen kernel: [9908643.882552] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Jun 3 12:07:41 rogen kernel: [9908643.882556] md7_resync D ffff8800b508aa20 0 28497 2 0x00000000 Jun 3 12:07:41 rogen kernel: [9908643.882560] ffff8802ab877b80 0000000000000046 ffff8803ffbfa340 0000000000000046 Jun 3 12:07:41 rogen kernel: [9908643.882564] ffff8802ab876010 ffff8800b508a6a0 00000000001d29c0 ffff8802ab877fd8 Jun 3 12:07:41 rogen kernel: [9908643.882566] ffff8802ab877fd8 00000000001d29c0 ffff880070448000 ffff8800b508a6a0 Jun 3 12:07:41 rogen kernel: [9908643.882569] Call Trace: Jun 3 12:07:41 rogen kernel: [9908643.882577] [<ffffffff81339704>] schedule+0x55/0x57 Jun 3 12:07:41 rogen kernel: [9908643.882599] [<ffffffffa01da26b>] bitmap_cond_end_sync+0xbc/0x152 [md_mod] Jun 3 12:07:41 rogen kernel: [9908643.882602] [<ffffffff8106190d>] ? wake_up_bit+0x25/0x25 Jun 3 12:07:41 rogen kernel: [9908643.882607] [<ffffffffa022f7a7>] sync_request+0x22e/0x2ef [raid456] Jun 3 12:07:41 rogen kernel: [9908643.882613] [<ffffffffa01d1ebc>] ? is_mddev_idle+0x106/0x118 [md_mod] Jun 3 12:07:41 rogen kernel: [9908643.882618] [<ffffffffa01d2689>] md_do_sync+0x7bb/0xbce [md_mod] Jun 3 12:07:41 rogen kernel: [9908643.882624] [<ffffffffa01d2cbe>] md_thread+0xff/0x11d [md_mod] Jun 3 12:07:41 rogen kernel: [9908643.882629] [<ffffffffa01d2bbf>] ? md_rdev_init+0x8d/0x8d [md_mod] Jun 3 12:07:41 rogen kernel: [9908643.882631] [<ffffffff81061499>] kthread+0x9b/0xa3 Jun 3 12:07:41 rogen kernel: [9908643.882634] [<ffffffff81342ca4>] kernel_thread_helper+0x4/0x10 Jun 3 12:07:41 rogen kernel: [9908643.882637] [<ffffffff810613fe>] ? __init_kthread_worker+0x56/0x56 Jun 3 12:07:41 rogen kernel: [9908643.882639] [<ffffffff81342ca0>] ? gs_change+0x13/0x13 Jun 3 12:07:41 rogen kernel: [9908643.882641] INFO: lockdep is turned off. The cat /proc/mdstat is: Personalities : [raid1] [raid6] [raid5] [raid4] md7 : active raid6 sdd2[0] sdab2[11] sdaa2[10] sdz2[9] sdy2[8] sde2[7] sdh2[6] sdf2[5] sdg2[4] sdb2[3](F) sdc2[2] sda2[1] 29283121600 blocks super 1.2 level 6, 32k chunk, algorithm 2 [12/11] [UUU_UUUUUUUU] [=============>.......] check = 65.3% (1913765076/2928312160) finish=44345.9min speed=381K/sec bitmap: 1/22 pages [4KB], 65536KB chunk I don't really want to wait 30 days for this to finish, what is correct thing to do before I replace the failed drive? Thanks in advance, K ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Failed drive in raid6 while doing data-check 2012-06-03 17:31 Failed drive in raid6 while doing data-check Krzysztof Adamski @ 2012-06-03 18:32 ` Igor M Podlesny 2012-06-03 18:38 ` Krzysztof Adamski 2012-06-04 3:56 ` NeilBrown 1 sibling, 1 reply; 9+ messages in thread From: Igor M Podlesny @ 2012-06-03 18:32 UTC (permalink / raw) To: Krzysztof Adamski; +Cc: linux-raid On 4 June 2012 01:31, Krzysztof Adamski <k@adamski.org> wrote: […] > The cat /proc/mdstat is: > Personalities : [raid1] [raid6] [raid5] [raid4] > md7 : active raid6 sdd2[0] sdab2[11] sdaa2[10] sdz2[9] sdy2[8] sde2[7] sdh2[6] sdf2[5] sdg2[4] sdb2[3](F) sdc2[2] sda2[1] > 29283121600 blocks super 1.2 level 6, 32k chunk, algorithm 2 [12/11] [UUU_UUUUUUUU] > [=============>.......] check = 65.3% (1913765076/2928312160) finish=44345.9min speed=381K/sec > bitmap: 1/22 pages [4KB], 65536KB chunk > > I don't really want to wait 30 days for this to finish, what is correct > thing to do before I replace the failed drive? Is stripe_cache_size reasonably adjusted? -- -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Failed drive in raid6 while doing data-check 2012-06-03 18:32 ` Igor M Podlesny @ 2012-06-03 18:38 ` Krzysztof Adamski 2012-06-03 20:35 ` Mathias Burén 0 siblings, 1 reply; 9+ messages in thread From: Krzysztof Adamski @ 2012-06-03 18:38 UTC (permalink / raw) To: Igor M Podlesny; +Cc: linux-raid On Mon, 2012-06-04 at 02:32 +0800, Igor M Podlesny wrote: > On 4 June 2012 01:31, Krzysztof Adamski <k@adamski.org> wrote: > […] > > The cat /proc/mdstat is: > > Personalities : [raid1] [raid6] [raid5] [raid4] > > md7 : active raid6 sdd2[0] sdab2[11] sdaa2[10] sdz2[9] sdy2[8] sde2[7] sdh2[6] sdf2[5] sdg2[4] sdb2[3](F) sdc2[2] sda2[1] > > 29283121600 blocks super 1.2 level 6, 32k chunk, algorithm 2 [12/11] [UUU_UUUUUUUU] > > [=============>.......] check = 65.3% (1913765076/2928312160) finish=44345.9min speed=381K/sec > > bitmap: 1/22 pages [4KB], 65536KB chunk > > > > I don't really want to wait 30 days for this to finish, what is correct > > thing to do before I replace the failed drive? > > Is stripe_cache_size reasonably adjusted? > I hope so. # cat /sys/block/md7/md/stripe_cache_size 32768 # mdadm -E /dev/sdd2 /dev/sdd2: Magic : a92b4efc Version : 1.2 Feature Map : 0x1 Array UUID : 3fe8cdcf:d7b1a55c:d8c07daa:c7c3021b Name : rogen:7 (local to host rogen) Creation Time : Thu Nov 3 13:31:13 2011 Raid Level : raid6 Raid Devices : 12 Avail Dev Size : 5856624640 (2792.66 GiB 2998.59 GB) Array Size : 58566243200 (27926.56 GiB 29985.92 GB) Used Dev Size : 5856624320 (2792.66 GiB 2998.59 GB) Data Offset : 2048 sectors Super Offset : 8 sectors State : clean Device UUID : 26f3ab68:62de305e:0720d287:ad4624a4 Internal Bitmap : 8 sectors from superblock Update Time : Sun Jun 3 12:43:57 2012 Checksum : c6da765f - correct Events : 23610 Layout : left-symmetric Chunk Size : 32K Device Role : Active device 0 Array State : AAA.AAAAAAAA ('A' == active, '.' == missing) -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Failed drive in raid6 while doing data-check 2012-06-03 18:38 ` Krzysztof Adamski @ 2012-06-03 20:35 ` Mathias Burén 0 siblings, 0 replies; 9+ messages in thread From: Mathias Burén @ 2012-06-03 20:35 UTC (permalink / raw) To: Krzysztof Adamski; +Cc: Igor M Podlesny, linux-raid On 3 June 2012 19:38, Krzysztof Adamski <k@adamski.org> wrote: > On Mon, 2012-06-04 at 02:32 +0800, Igor M Podlesny wrote: >> On 4 June 2012 01:31, Krzysztof Adamski <k@adamski.org> wrote: >> […] >> > The cat /proc/mdstat is: >> > Personalities : [raid1] [raid6] [raid5] [raid4] >> > md7 : active raid6 sdd2[0] sdab2[11] sdaa2[10] sdz2[9] sdy2[8] sde2[7] sdh2[6] sdf2[5] sdg2[4] sdb2[3](F) sdc2[2] sda2[1] >> > 29283121600 blocks super 1.2 level 6, 32k chunk, algorithm 2 [12/11] [UUU_UUUUUUUU] >> > [=============>.......] check = 65.3% (1913765076/2928312160) finish=44345.9min speed=381K/sec >> > bitmap: 1/22 pages [4KB], 65536KB chunk >> > >> > I don't really want to wait 30 days for this to finish, what is correct >> > thing to do before I replace the failed drive? >> > > -- > To unsubscribe from this list: send the line "unsubscribe linux-raid" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html Don't quote me on this, But since you are running a RAID6 with 1 degraded drive you could stop the check (echo idle > /sys/block/mdX/md/sync_action , I believe). Then you could manually fail the broken HDD, remove it, then physically replace it. However, it looks like MD already did that for you ([UUU_UUUUUUUU]). If possible I'd unmount any filesystems on the array first just in case. Kind regards, Mathias -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Failed drive in raid6 while doing data-check 2012-06-03 17:31 Failed drive in raid6 while doing data-check Krzysztof Adamski 2012-06-03 18:32 ` Igor M Podlesny @ 2012-06-04 3:56 ` NeilBrown 2012-06-04 13:19 ` Krzysztof Adamski 1 sibling, 1 reply; 9+ messages in thread From: NeilBrown @ 2012-06-04 3:56 UTC (permalink / raw) To: Krzysztof Adamski; +Cc: linux-raid [-- Attachment #1: Type: text/plain, Size: 4741 bytes --] On Sun, 03 Jun 2012 13:31:14 -0400 Krzysztof Adamski <k@adamski.org> wrote: > The monthly data check found a bad drive in my raid6 array. This is what > started to show up in the log: > Jun 3 12:02:53 rogen kernel: [9908355.355940] sd 2:0:1:0: attempting task abort! scmd(ffff8801547c6a00) > Jun 3 12:02:53 rogen kernel: [9908355.355953] sd 2:0:1:0: [sdb] CDB: Read(10): 28 00 e4 5c ed 38 00 00 08 00 > Jun 3 12:02:53 rogen kernel: [9908355.355983] scsi target2:0:1: handle(0x0009), sas_address(0x4433221100000000), phy(0) > Jun 3 12:02:53 rogen kernel: [9908355.355992] scsi target2:0:1: enclosure_logical_id(0x500605b003f7aa10), slot(3) > Jun 3 12:02:56 rogen kernel: [9908359.141194] sd 2:0:1:0: task abort: SUCCESS scmd(ffff8801547c6a00) > Jun 3 12:02:56 rogen kernel: [9908359.141206] sd 2:0:1:0: attempting task abort! scmd(ffff8803aea45400) > Jun 3 12:02:56 rogen kernel: [9908359.141216] sd 2:0:1:0: [sdb] CDB: Read(10): 28 00 e4 5c ed 40 00 00 08 00 > > But now it has changed to this: > Jun 3 12:04:44 rogen kernel: [9908466.716281] sd 2:0:1:0: [sdb] Unhandled error code > Jun 3 12:04:44 rogen kernel: [9908466.716287] sd 2:0:1:0: [sdb] Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK > Jun 3 12:04:44 rogen kernel: [9908466.716296] sd 2:0:1:0: [sdb] CDB: Read(10): 28 00 e4 5c ee 38 00 00 08 00 > Jun 3 12:04:44 rogen kernel: [9908466.716319] end_request: I/O error, dev sdb, sector 3831295544 > Jun 3 12:04:44 rogen kernel: [9908466.716616] sd 2:0:1:0: [sdb] Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK > Jun 3 12:04:44 rogen kernel: [9908466.717200] mpt2sas0: removing handle(0x0009), sas_addr(0x4433221100000000) > Jun 3 12:04:44 rogen kernel: [9908466.917090] md/raid:md7: Disk failure on sdb2, disabling device. > Jun 3 12:04:44 rogen kernel: [9908466.917091] md/raid:md7: Operation continuing on 11 devices. > Jun 3 12:07:41 rogen kernel: [9908643.882541] INFO: task md7_resync:28497 blocked for more than 120 seconds. > Jun 3 12:07:41 rogen kernel: [9908643.882552] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > Jun 3 12:07:41 rogen kernel: [9908643.882556] md7_resync D ffff8800b508aa20 0 28497 2 0x00000000 > Jun 3 12:07:41 rogen kernel: [9908643.882560] ffff8802ab877b80 0000000000000046 ffff8803ffbfa340 0000000000000046 > Jun 3 12:07:41 rogen kernel: [9908643.882564] ffff8802ab876010 ffff8800b508a6a0 00000000001d29c0 ffff8802ab877fd8 > Jun 3 12:07:41 rogen kernel: [9908643.882566] ffff8802ab877fd8 00000000001d29c0 ffff880070448000 ffff8800b508a6a0 > Jun 3 12:07:41 rogen kernel: [9908643.882569] Call Trace: > Jun 3 12:07:41 rogen kernel: [9908643.882577] [<ffffffff81339704>] schedule+0x55/0x57 > Jun 3 12:07:41 rogen kernel: [9908643.882599] [<ffffffffa01da26b>] bitmap_cond_end_sync+0xbc/0x152 [md_mod] > Jun 3 12:07:41 rogen kernel: [9908643.882602] [<ffffffff8106190d>] ? wake_up_bit+0x25/0x25 > Jun 3 12:07:41 rogen kernel: [9908643.882607] [<ffffffffa022f7a7>] sync_request+0x22e/0x2ef [raid456] > Jun 3 12:07:41 rogen kernel: [9908643.882613] [<ffffffffa01d1ebc>] ? is_mddev_idle+0x106/0x118 [md_mod] > Jun 3 12:07:41 rogen kernel: [9908643.882618] [<ffffffffa01d2689>] md_do_sync+0x7bb/0xbce [md_mod] > Jun 3 12:07:41 rogen kernel: [9908643.882624] [<ffffffffa01d2cbe>] md_thread+0xff/0x11d [md_mod] > Jun 3 12:07:41 rogen kernel: [9908643.882629] [<ffffffffa01d2bbf>] ? md_rdev_init+0x8d/0x8d [md_mod] > Jun 3 12:07:41 rogen kernel: [9908643.882631] [<ffffffff81061499>] kthread+0x9b/0xa3 > Jun 3 12:07:41 rogen kernel: [9908643.882634] [<ffffffff81342ca4>] kernel_thread_helper+0x4/0x10 > Jun 3 12:07:41 rogen kernel: [9908643.882637] [<ffffffff810613fe>] ? __init_kthread_worker+0x56/0x56 > Jun 3 12:07:41 rogen kernel: [9908643.882639] [<ffffffff81342ca0>] ? gs_change+0x13/0x13 > Jun 3 12:07:41 rogen kernel: [9908643.882641] INFO: lockdep is turned off. > > The cat /proc/mdstat is: > Personalities : [raid1] [raid6] [raid5] [raid4] > md7 : active raid6 sdd2[0] sdab2[11] sdaa2[10] sdz2[9] sdy2[8] sde2[7] sdh2[6] sdf2[5] sdg2[4] sdb2[3](F) sdc2[2] sda2[1] > 29283121600 blocks super 1.2 level 6, 32k chunk, algorithm 2 [12/11] [UUU_UUUUUUUU] > [=============>.......] check = 65.3% (1913765076/2928312160) finish=44345.9min speed=381K/sec > bitmap: 1/22 pages [4KB], 65536KB chunk > > I don't really want to wait 30 days for this to finish, what is correct > thing to do before I replace the failed drive? > If it is still hanging, then I suspect a reboot is your only way forward. This should not affect the data on the array. What kernel are you running? I'll see if I can find the cause. NeilBrown [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 828 bytes --] ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Failed drive in raid6 while doing data-check 2012-06-04 3:56 ` NeilBrown @ 2012-06-04 13:19 ` Krzysztof Adamski 2012-06-05 3:35 ` NeilBrown 0 siblings, 1 reply; 9+ messages in thread From: Krzysztof Adamski @ 2012-06-04 13:19 UTC (permalink / raw) To: NeilBrown; +Cc: linux-raid On Mon, 2012-06-04 at 13:56 +1000, NeilBrown wrote: > On Sun, 03 Jun 2012 13:31:14 -0400 Krzysztof Adamski <k@adamski.org> wrote: > > > The monthly data check found a bad drive in my raid6 array. This is what > > started to show up in the log: > > Jun 3 12:02:53 rogen kernel: [9908355.355940] sd 2:0:1:0: attempting task abort! scmd(ffff8801547c6a00) > > Jun 3 12:02:53 rogen kernel: [9908355.355953] sd 2:0:1:0: [sdb] CDB: Read(10): 28 00 e4 5c ed 38 00 00 08 00 > > Jun 3 12:02:53 rogen kernel: [9908355.355983] scsi target2:0:1: handle(0x0009), sas_address(0x4433221100000000), phy(0) > > Jun 3 12:02:53 rogen kernel: [9908355.355992] scsi target2:0:1: enclosure_logical_id(0x500605b003f7aa10), slot(3) > > Jun 3 12:02:56 rogen kernel: [9908359.141194] sd 2:0:1:0: task abort: SUCCESS scmd(ffff8801547c6a00) > > Jun 3 12:02:56 rogen kernel: [9908359.141206] sd 2:0:1:0: attempting task abort! scmd(ffff8803aea45400) > > Jun 3 12:02:56 rogen kernel: [9908359.141216] sd 2:0:1:0: [sdb] CDB: Read(10): 28 00 e4 5c ed 40 00 00 08 00 > > > > But now it has changed to this: > > Jun 3 12:04:44 rogen kernel: [9908466.716281] sd 2:0:1:0: [sdb] Unhandled error code > > Jun 3 12:04:44 rogen kernel: [9908466.716287] sd 2:0:1:0: [sdb] Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK > > Jun 3 12:04:44 rogen kernel: [9908466.716296] sd 2:0:1:0: [sdb] CDB: Read(10): 28 00 e4 5c ee 38 00 00 08 00 > > Jun 3 12:04:44 rogen kernel: [9908466.716319] end_request: I/O error, dev sdb, sector 3831295544 > > Jun 3 12:04:44 rogen kernel: [9908466.716616] sd 2:0:1:0: [sdb] Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK > > Jun 3 12:04:44 rogen kernel: [9908466.717200] mpt2sas0: removing handle(0x0009), sas_addr(0x4433221100000000) > > Jun 3 12:04:44 rogen kernel: [9908466.917090] md/raid:md7: Disk failure on sdb2, disabling device. > > Jun 3 12:04:44 rogen kernel: [9908466.917091] md/raid:md7: Operation continuing on 11 devices. > > Jun 3 12:07:41 rogen kernel: [9908643.882541] INFO: task md7_resync:28497 blocked for more than 120 seconds. > > Jun 3 12:07:41 rogen kernel: [9908643.882552] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > > Jun 3 12:07:41 rogen kernel: [9908643.882556] md7_resync D ffff8800b508aa20 0 28497 2 0x00000000 > > Jun 3 12:07:41 rogen kernel: [9908643.882560] ffff8802ab877b80 0000000000000046 ffff8803ffbfa340 0000000000000046 > > Jun 3 12:07:41 rogen kernel: [9908643.882564] ffff8802ab876010 ffff8800b508a6a0 00000000001d29c0 ffff8802ab877fd8 > > Jun 3 12:07:41 rogen kernel: [9908643.882566] ffff8802ab877fd8 00000000001d29c0 ffff880070448000 ffff8800b508a6a0 > > Jun 3 12:07:41 rogen kernel: [9908643.882569] Call Trace: > > Jun 3 12:07:41 rogen kernel: [9908643.882577] [<ffffffff81339704>] schedule+0x55/0x57 > > Jun 3 12:07:41 rogen kernel: [9908643.882599] [<ffffffffa01da26b>] bitmap_cond_end_sync+0xbc/0x152 [md_mod] > > Jun 3 12:07:41 rogen kernel: [9908643.882602] [<ffffffff8106190d>] ? wake_up_bit+0x25/0x25 > > Jun 3 12:07:41 rogen kernel: [9908643.882607] [<ffffffffa022f7a7>] sync_request+0x22e/0x2ef [raid456] > > Jun 3 12:07:41 rogen kernel: [9908643.882613] [<ffffffffa01d1ebc>] ? is_mddev_idle+0x106/0x118 [md_mod] > > Jun 3 12:07:41 rogen kernel: [9908643.882618] [<ffffffffa01d2689>] md_do_sync+0x7bb/0xbce [md_mod] > > Jun 3 12:07:41 rogen kernel: [9908643.882624] [<ffffffffa01d2cbe>] md_thread+0xff/0x11d [md_mod] > > Jun 3 12:07:41 rogen kernel: [9908643.882629] [<ffffffffa01d2bbf>] ? md_rdev_init+0x8d/0x8d [md_mod] > > Jun 3 12:07:41 rogen kernel: [9908643.882631] [<ffffffff81061499>] kthread+0x9b/0xa3 > > Jun 3 12:07:41 rogen kernel: [9908643.882634] [<ffffffff81342ca4>] kernel_thread_helper+0x4/0x10 > > Jun 3 12:07:41 rogen kernel: [9908643.882637] [<ffffffff810613fe>] ? __init_kthread_worker+0x56/0x56 > > Jun 3 12:07:41 rogen kernel: [9908643.882639] [<ffffffff81342ca0>] ? gs_change+0x13/0x13 > > Jun 3 12:07:41 rogen kernel: [9908643.882641] INFO: lockdep is turned off. > > > > The cat /proc/mdstat is: > > Personalities : [raid1] [raid6] [raid5] [raid4] > > md7 : active raid6 sdd2[0] sdab2[11] sdaa2[10] sdz2[9] sdy2[8] sde2[7] sdh2[6] sdf2[5] sdg2[4] sdb2[3](F) sdc2[2] sda2[1] > > 29283121600 blocks super 1.2 level 6, 32k chunk, algorithm 2 [12/11] [UUU_UUUUUUUU] > > [=============>.......] check = 65.3% (1913765076/2928312160) finish=44345.9min speed=381K/sec > > bitmap: 1/22 pages [4KB], 65536KB chunk > > > > I don't really want to wait 30 days for this to finish, what is correct > > thing to do before I replace the failed drive? > > > > If it is still hanging, then I suspect a reboot is your only way forward. > This should not affect the data on the array. Nothing more is being written to the log, but there is no progress: Personalities : [raid1] [raid6] [raid5] [raid4] md7 : active raid6 sdd2[0] sdab2[11] sdaa2[10] sdz2[9] sdy2[8] sde2[7] sdh2[6] sdf2[5] sdg2[4] sdb2[3](F) sdc2[2] sda2[1] 29283121600 blocks super 1.2 level 6, 32k chunk, algorithm 2 [12/11] [UUU_UUUUUUUU] [=============>.......] check = 65.3% (1913765076/2928312160) finish=660982.5min speed=25K/sec bitmap: 4/22 pages [16KB], 65536KB chunk > What kernel are you running? I'll see if I can find the cause. Self compiled: # cat /proc/version Linux version 3.0.18-KAA (root@rogen) (gcc version 4.4.5 (Debian 4.4.5-8) ) #1 SMP Fri Jan 27 18:35:11 EST 2012 I can provide any other info that is needed. K ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Failed drive in raid6 while doing data-check 2012-06-04 13:19 ` Krzysztof Adamski @ 2012-06-05 3:35 ` NeilBrown 2012-06-05 16:48 ` Krzysztof Adamski 0 siblings, 1 reply; 9+ messages in thread From: NeilBrown @ 2012-06-05 3:35 UTC (permalink / raw) To: Krzysztof Adamski; +Cc: linux-raid [-- Attachment #1: Type: text/plain, Size: 7317 bytes --] On Mon, 04 Jun 2012 09:19:06 -0400 Krzysztof Adamski <k@adamski.org> wrote: > On Mon, 2012-06-04 at 13:56 +1000, NeilBrown wrote: > > On Sun, 03 Jun 2012 13:31:14 -0400 Krzysztof Adamski <k@adamski.org> wrote: > > > > > The monthly data check found a bad drive in my raid6 array. This is what > > > started to show up in the log: > > > Jun 3 12:02:53 rogen kernel: [9908355.355940] sd 2:0:1:0: attempting task abort! scmd(ffff8801547c6a00) > > > Jun 3 12:02:53 rogen kernel: [9908355.355953] sd 2:0:1:0: [sdb] CDB: Read(10): 28 00 e4 5c ed 38 00 00 08 00 > > > Jun 3 12:02:53 rogen kernel: [9908355.355983] scsi target2:0:1: handle(0x0009), sas_address(0x4433221100000000), phy(0) > > > Jun 3 12:02:53 rogen kernel: [9908355.355992] scsi target2:0:1: enclosure_logical_id(0x500605b003f7aa10), slot(3) > > > Jun 3 12:02:56 rogen kernel: [9908359.141194] sd 2:0:1:0: task abort: SUCCESS scmd(ffff8801547c6a00) > > > Jun 3 12:02:56 rogen kernel: [9908359.141206] sd 2:0:1:0: attempting task abort! scmd(ffff8803aea45400) > > > Jun 3 12:02:56 rogen kernel: [9908359.141216] sd 2:0:1:0: [sdb] CDB: Read(10): 28 00 e4 5c ed 40 00 00 08 00 > > > > > > But now it has changed to this: > > > Jun 3 12:04:44 rogen kernel: [9908466.716281] sd 2:0:1:0: [sdb] Unhandled error code > > > Jun 3 12:04:44 rogen kernel: [9908466.716287] sd 2:0:1:0: [sdb] Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK > > > Jun 3 12:04:44 rogen kernel: [9908466.716296] sd 2:0:1:0: [sdb] CDB: Read(10): 28 00 e4 5c ee 38 00 00 08 00 > > > Jun 3 12:04:44 rogen kernel: [9908466.716319] end_request: I/O error, dev sdb, sector 3831295544 > > > Jun 3 12:04:44 rogen kernel: [9908466.716616] sd 2:0:1:0: [sdb] Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK > > > Jun 3 12:04:44 rogen kernel: [9908466.717200] mpt2sas0: removing handle(0x0009), sas_addr(0x4433221100000000) > > > Jun 3 12:04:44 rogen kernel: [9908466.917090] md/raid:md7: Disk failure on sdb2, disabling device. > > > Jun 3 12:04:44 rogen kernel: [9908466.917091] md/raid:md7: Operation continuing on 11 devices. > > > Jun 3 12:07:41 rogen kernel: [9908643.882541] INFO: task md7_resync:28497 blocked for more than 120 seconds. > > > Jun 3 12:07:41 rogen kernel: [9908643.882552] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > > > Jun 3 12:07:41 rogen kernel: [9908643.882556] md7_resync D ffff8800b508aa20 0 28497 2 0x00000000 > > > Jun 3 12:07:41 rogen kernel: [9908643.882560] ffff8802ab877b80 0000000000000046 ffff8803ffbfa340 0000000000000046 > > > Jun 3 12:07:41 rogen kernel: [9908643.882564] ffff8802ab876010 ffff8800b508a6a0 00000000001d29c0 ffff8802ab877fd8 > > > Jun 3 12:07:41 rogen kernel: [9908643.882566] ffff8802ab877fd8 00000000001d29c0 ffff880070448000 ffff8800b508a6a0 > > > Jun 3 12:07:41 rogen kernel: [9908643.882569] Call Trace: > > > Jun 3 12:07:41 rogen kernel: [9908643.882577] [<ffffffff81339704>] schedule+0x55/0x57 > > > Jun 3 12:07:41 rogen kernel: [9908643.882599] [<ffffffffa01da26b>] bitmap_cond_end_sync+0xbc/0x152 [md_mod] > > > Jun 3 12:07:41 rogen kernel: [9908643.882602] [<ffffffff8106190d>] ? wake_up_bit+0x25/0x25 > > > Jun 3 12:07:41 rogen kernel: [9908643.882607] [<ffffffffa022f7a7>] sync_request+0x22e/0x2ef [raid456] > > > Jun 3 12:07:41 rogen kernel: [9908643.882613] [<ffffffffa01d1ebc>] ? is_mddev_idle+0x106/0x118 [md_mod] > > > Jun 3 12:07:41 rogen kernel: [9908643.882618] [<ffffffffa01d2689>] md_do_sync+0x7bb/0xbce [md_mod] > > > Jun 3 12:07:41 rogen kernel: [9908643.882624] [<ffffffffa01d2cbe>] md_thread+0xff/0x11d [md_mod] > > > Jun 3 12:07:41 rogen kernel: [9908643.882629] [<ffffffffa01d2bbf>] ? md_rdev_init+0x8d/0x8d [md_mod] > > > Jun 3 12:07:41 rogen kernel: [9908643.882631] [<ffffffff81061499>] kthread+0x9b/0xa3 > > > Jun 3 12:07:41 rogen kernel: [9908643.882634] [<ffffffff81342ca4>] kernel_thread_helper+0x4/0x10 > > > Jun 3 12:07:41 rogen kernel: [9908643.882637] [<ffffffff810613fe>] ? __init_kthread_worker+0x56/0x56 > > > Jun 3 12:07:41 rogen kernel: [9908643.882639] [<ffffffff81342ca0>] ? gs_change+0x13/0x13 > > > Jun 3 12:07:41 rogen kernel: [9908643.882641] INFO: lockdep is turned off. > > > > > > The cat /proc/mdstat is: > > > Personalities : [raid1] [raid6] [raid5] [raid4] > > > md7 : active raid6 sdd2[0] sdab2[11] sdaa2[10] sdz2[9] sdy2[8] sde2[7] sdh2[6] sdf2[5] sdg2[4] sdb2[3](F) sdc2[2] sda2[1] > > > 29283121600 blocks super 1.2 level 6, 32k chunk, algorithm 2 [12/11] [UUU_UUUUUUUU] > > > [=============>.......] check = 65.3% (1913765076/2928312160) finish=44345.9min speed=381K/sec > > > bitmap: 1/22 pages [4KB], 65536KB chunk > > > > > > I don't really want to wait 30 days for this to finish, what is correct > > > thing to do before I replace the failed drive? > > > > > > > If it is still hanging, then I suspect a reboot is your only way forward. > > This should not affect the data on the array. > > Nothing more is being written to the log, but there is no progress: > > Personalities : [raid1] [raid6] [raid5] [raid4] > md7 : active raid6 sdd2[0] sdab2[11] sdaa2[10] sdz2[9] sdy2[8] sde2[7] sdh2[6] sdf2[5] sdg2[4] sdb2[3](F) sdc2[2] sda2[1] > 29283121600 blocks super 1.2 level 6, 32k chunk, algorithm 2 [12/11] [UUU_UUUUUUUU] > [=============>.......] check = 65.3% (1913765076/2928312160) finish=660982.5min speed=25K/sec > bitmap: 4/22 pages [16KB], 65536KB chunk > > > > What kernel are you running? I'll see if I can find the cause. > > Self compiled: > # cat /proc/version > Linux version 3.0.18-KAA (root@rogen) (gcc version 4.4.5 (Debian 4.4.5-8) ) #1 SMP Fri Jan 27 18:35:11 EST 2012 > > I can provide any other info that is needed. Thanks. I have a theory about what might be causing it, but I cannot duplicated the problem. I think you were quite "unlucky" to hit it. If I'm right you might be able to kick the 'check' back into life by reading from the area of the array that is currently being synced. i.e. about 19137650760K into the array. dd skip=19137650000 bs=1024 count=2000 if=/dev/md7 of=/dev/null might do it. The thread that is doing the 'check' is blocked in bitmap_cond_sync_end, waiting for all pending check requests to complete. It only does this every 5 seconds or so, so oncs in hundreds of requests. (It doesn't really need to for 'check', only for 'sync', but it does anyway). But for some reason one request isn't completing. That should mean that STRIPE_INSYNC is not yet set on a 'struct stripe_head'. I'm guessing that it has processed by handle_parity_checks6, noticed in the check_state_check_result case that s->failed is non-zero, and so set sh->check_state to check_state_compute_result, but then nothing caused the stripe to go around the loop again. The code there should probably set STRIPE_HANDLE.. Maybe. If the dead disk provides data to that stripe, rather than parity, then the read should set STRIPE_HANDLE and the handling of the stripe can progress. It might not work, but it is worth a try. What should happen is that the check will immediately abort. If it doesn't you probably need to schedule a reboot ... if you haven't already. NeilBrown [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 828 bytes --] ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Failed drive in raid6 while doing data-check 2012-06-05 3:35 ` NeilBrown @ 2012-06-05 16:48 ` Krzysztof Adamski 2012-06-06 1:22 ` NeilBrown 0 siblings, 1 reply; 9+ messages in thread From: Krzysztof Adamski @ 2012-06-05 16:48 UTC (permalink / raw) To: NeilBrown; +Cc: linux-raid On Tue, 2012-06-05 at 13:35 +1000, NeilBrown wrote: > On Mon, 04 Jun 2012 09:19:06 -0400 Krzysztof Adamski <k@adamski.org> wrote: > > > On Mon, 2012-06-04 at 13:56 +1000, NeilBrown wrote: > > > On Sun, 03 Jun 2012 13:31:14 -0400 Krzysztof Adamski <k@adamski.org> wrote: > > > > > > > The monthly data check found a bad drive in my raid6 array. This is what > > > > started to show up in the log: > > > > Jun 3 12:02:53 rogen kernel: [9908355.355940] sd 2:0:1:0: attempting task abort! scmd(ffff8801547c6a00) > > > > Jun 3 12:02:53 rogen kernel: [9908355.355953] sd 2:0:1:0: [sdb] CDB: Read(10): 28 00 e4 5c ed 38 00 00 08 00 > > > > Jun 3 12:02:53 rogen kernel: [9908355.355983] scsi target2:0:1: handle(0x0009), sas_address(0x4433221100000000), phy(0) > > > > Jun 3 12:02:53 rogen kernel: [9908355.355992] scsi target2:0:1: enclosure_logical_id(0x500605b003f7aa10), slot(3) > > > > Jun 3 12:02:56 rogen kernel: [9908359.141194] sd 2:0:1:0: task abort: SUCCESS scmd(ffff8801547c6a00) > > > > Jun 3 12:02:56 rogen kernel: [9908359.141206] sd 2:0:1:0: attempting task abort! scmd(ffff8803aea45400) > > > > Jun 3 12:02:56 rogen kernel: [9908359.141216] sd 2:0:1:0: [sdb] CDB: Read(10): 28 00 e4 5c ed 40 00 00 08 00 > > > > > > > > But now it has changed to this: > > > > Jun 3 12:04:44 rogen kernel: [9908466.716281] sd 2:0:1:0: [sdb] Unhandled error code > > > > Jun 3 12:04:44 rogen kernel: [9908466.716287] sd 2:0:1:0: [sdb] Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK > > > > Jun 3 12:04:44 rogen kernel: [9908466.716296] sd 2:0:1:0: [sdb] CDB: Read(10): 28 00 e4 5c ee 38 00 00 08 00 > > > > Jun 3 12:04:44 rogen kernel: [9908466.716319] end_request: I/O error, dev sdb, sector 3831295544 > > > > Jun 3 12:04:44 rogen kernel: [9908466.716616] sd 2:0:1:0: [sdb] Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK > > > > Jun 3 12:04:44 rogen kernel: [9908466.717200] mpt2sas0: removing handle(0x0009), sas_addr(0x4433221100000000) > > > > Jun 3 12:04:44 rogen kernel: [9908466.917090] md/raid:md7: Disk failure on sdb2, disabling device. > > > > Jun 3 12:04:44 rogen kernel: [9908466.917091] md/raid:md7: Operation continuing on 11 devices. > > > > Jun 3 12:07:41 rogen kernel: [9908643.882541] INFO: task md7_resync:28497 blocked for more than 120 seconds. > > > > Jun 3 12:07:41 rogen kernel: [9908643.882552] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > > > > Jun 3 12:07:41 rogen kernel: [9908643.882556] md7_resync D ffff8800b508aa20 0 28497 2 0x00000000 > > > > Jun 3 12:07:41 rogen kernel: [9908643.882560] ffff8802ab877b80 0000000000000046 ffff8803ffbfa340 0000000000000046 > > > > Jun 3 12:07:41 rogen kernel: [9908643.882564] ffff8802ab876010 ffff8800b508a6a0 00000000001d29c0 ffff8802ab877fd8 > > > > Jun 3 12:07:41 rogen kernel: [9908643.882566] ffff8802ab877fd8 00000000001d29c0 ffff880070448000 ffff8800b508a6a0 > > > > Jun 3 12:07:41 rogen kernel: [9908643.882569] Call Trace: > > > > Jun 3 12:07:41 rogen kernel: [9908643.882577] [<ffffffff81339704>] schedule+0x55/0x57 > > > > Jun 3 12:07:41 rogen kernel: [9908643.882599] [<ffffffffa01da26b>] bitmap_cond_end_sync+0xbc/0x152 [md_mod] > > > > Jun 3 12:07:41 rogen kernel: [9908643.882602] [<ffffffff8106190d>] ? wake_up_bit+0x25/0x25 > > > > Jun 3 12:07:41 rogen kernel: [9908643.882607] [<ffffffffa022f7a7>] sync_request+0x22e/0x2ef [raid456] > > > > Jun 3 12:07:41 rogen kernel: [9908643.882613] [<ffffffffa01d1ebc>] ? is_mddev_idle+0x106/0x118 [md_mod] > > > > Jun 3 12:07:41 rogen kernel: [9908643.882618] [<ffffffffa01d2689>] md_do_sync+0x7bb/0xbce [md_mod] > > > > Jun 3 12:07:41 rogen kernel: [9908643.882624] [<ffffffffa01d2cbe>] md_thread+0xff/0x11d [md_mod] > > > > Jun 3 12:07:41 rogen kernel: [9908643.882629] [<ffffffffa01d2bbf>] ? md_rdev_init+0x8d/0x8d [md_mod] > > > > Jun 3 12:07:41 rogen kernel: [9908643.882631] [<ffffffff81061499>] kthread+0x9b/0xa3 > > > > Jun 3 12:07:41 rogen kernel: [9908643.882634] [<ffffffff81342ca4>] kernel_thread_helper+0x4/0x10 > > > > Jun 3 12:07:41 rogen kernel: [9908643.882637] [<ffffffff810613fe>] ? __init_kthread_worker+0x56/0x56 > > > > Jun 3 12:07:41 rogen kernel: [9908643.882639] [<ffffffff81342ca0>] ? gs_change+0x13/0x13 > > > > Jun 3 12:07:41 rogen kernel: [9908643.882641] INFO: lockdep is turned off. > > > > > > > > The cat /proc/mdstat is: > > > > Personalities : [raid1] [raid6] [raid5] [raid4] > > > > md7 : active raid6 sdd2[0] sdab2[11] sdaa2[10] sdz2[9] sdy2[8] sde2[7] sdh2[6] sdf2[5] sdg2[4] sdb2[3](F) sdc2[2] sda2[1] > > > > 29283121600 blocks super 1.2 level 6, 32k chunk, algorithm 2 [12/11] [UUU_UUUUUUUU] > > > > [=============>.......] check = 65.3% (1913765076/2928312160) finish=44345.9min speed=381K/sec > > > > bitmap: 1/22 pages [4KB], 65536KB chunk > > > > > > > > I don't really want to wait 30 days for this to finish, what is correct > > > > thing to do before I replace the failed drive? > > > > > > > > > > If it is still hanging, then I suspect a reboot is your only way forward. > > > This should not affect the data on the array. > > > > Nothing more is being written to the log, but there is no progress: > > > > Personalities : [raid1] [raid6] [raid5] [raid4] > > md7 : active raid6 sdd2[0] sdab2[11] sdaa2[10] sdz2[9] sdy2[8] sde2[7] sdh2[6] sdf2[5] sdg2[4] sdb2[3](F) sdc2[2] sda2[1] > > 29283121600 blocks super 1.2 level 6, 32k chunk, algorithm 2 [12/11] [UUU_UUUUUUUU] > > [=============>.......] check = 65.3% (1913765076/2928312160) finish=660982.5min speed=25K/sec > > bitmap: 4/22 pages [16KB], 65536KB chunk > > > > > > > What kernel are you running? I'll see if I can find the cause. > > > > Self compiled: > > # cat /proc/version > > Linux version 3.0.18-KAA (root@rogen) (gcc version 4.4.5 (Debian 4.4.5-8) ) #1 SMP Fri Jan 27 18:35:11 EST 2012 > > > > I can provide any other info that is needed. > > Thanks. > I have a theory about what might be causing it, but I cannot duplicated the > problem. I think you were quite "unlucky" to hit it. > > If I'm right you might be able to kick the 'check' back into life by reading > from the area of the array that is currently being synced. i.e. about > 19137650760K into the array. > > dd skip=19137650000 bs=1024 count=2000 if=/dev/md7 of=/dev/null > > might do it. > > The thread that is doing the 'check' is blocked in bitmap_cond_sync_end, > waiting for all pending check requests to complete. It only does this every > 5 seconds or so, so oncs in hundreds of requests. (It doesn't really need to > for 'check', only for 'sync', but it does anyway). > > But for some reason one request isn't completing. That should mean that > STRIPE_INSYNC is not yet set on a 'struct stripe_head'. > > I'm guessing that it has processed by handle_parity_checks6, noticed in the > check_state_check_result case that s->failed is non-zero, and so set > sh->check_state to check_state_compute_result, but then nothing caused > the stripe to go around the loop again. The code there should probably > set STRIPE_HANDLE.. Maybe. > > If the dead disk provides data to that stripe, rather than parity, then the > read should set STRIPE_HANDLE and the handling of the stripe can progress. > > It might not work, but it is worth a try. What should happen is that the > check will immediately abort. If it doesn't you probably need to schedule a > reboot ... if you haven't already. I had to do a reboot already. The drive was wedged for same reason, activity light solid, soft reboot (shutdown -r now) didn't wake it up, the card bios didn't see it. I pulled and reinserted it for it to be recognized by the LSI card. I added it to the array and the rebuild took few minutes (thanks for the bitmap). I forced a recheck and after 12 hours everything is good. This was not the first time where I saw this kind of stop in the check when one drive stops working in this machine, but with different adaptor card and different drives. I will try what you suggested if it happens again. Now I wonder if I should have just pulled/reinserted the drive to see what would have happened. K ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Failed drive in raid6 while doing data-check 2012-06-05 16:48 ` Krzysztof Adamski @ 2012-06-06 1:22 ` NeilBrown 0 siblings, 0 replies; 9+ messages in thread From: NeilBrown @ 2012-06-06 1:22 UTC (permalink / raw) To: Krzysztof Adamski; +Cc: linux-raid [-- Attachment #1: Type: text/plain, Size: 9049 bytes --] On Tue, 05 Jun 2012 12:48:02 -0400 Krzysztof Adamski <k@adamski.org> wrote: > On Tue, 2012-06-05 at 13:35 +1000, NeilBrown wrote: > > On Mon, 04 Jun 2012 09:19:06 -0400 Krzysztof Adamski <k@adamski.org> wrote: > > > > > On Mon, 2012-06-04 at 13:56 +1000, NeilBrown wrote: > > > > On Sun, 03 Jun 2012 13:31:14 -0400 Krzysztof Adamski <k@adamski.org> wrote: > > > > > > > > > The monthly data check found a bad drive in my raid6 array. This is what > > > > > started to show up in the log: > > > > > Jun 3 12:02:53 rogen kernel: [9908355.355940] sd 2:0:1:0: attempting task abort! scmd(ffff8801547c6a00) > > > > > Jun 3 12:02:53 rogen kernel: [9908355.355953] sd 2:0:1:0: [sdb] CDB: Read(10): 28 00 e4 5c ed 38 00 00 08 00 > > > > > Jun 3 12:02:53 rogen kernel: [9908355.355983] scsi target2:0:1: handle(0x0009), sas_address(0x4433221100000000), phy(0) > > > > > Jun 3 12:02:53 rogen kernel: [9908355.355992] scsi target2:0:1: enclosure_logical_id(0x500605b003f7aa10), slot(3) > > > > > Jun 3 12:02:56 rogen kernel: [9908359.141194] sd 2:0:1:0: task abort: SUCCESS scmd(ffff8801547c6a00) > > > > > Jun 3 12:02:56 rogen kernel: [9908359.141206] sd 2:0:1:0: attempting task abort! scmd(ffff8803aea45400) > > > > > Jun 3 12:02:56 rogen kernel: [9908359.141216] sd 2:0:1:0: [sdb] CDB: Read(10): 28 00 e4 5c ed 40 00 00 08 00 > > > > > > > > > > But now it has changed to this: > > > > > Jun 3 12:04:44 rogen kernel: [9908466.716281] sd 2:0:1:0: [sdb] Unhandled error code > > > > > Jun 3 12:04:44 rogen kernel: [9908466.716287] sd 2:0:1:0: [sdb] Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK > > > > > Jun 3 12:04:44 rogen kernel: [9908466.716296] sd 2:0:1:0: [sdb] CDB: Read(10): 28 00 e4 5c ee 38 00 00 08 00 > > > > > Jun 3 12:04:44 rogen kernel: [9908466.716319] end_request: I/O error, dev sdb, sector 3831295544 > > > > > Jun 3 12:04:44 rogen kernel: [9908466.716616] sd 2:0:1:0: [sdb] Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK > > > > > Jun 3 12:04:44 rogen kernel: [9908466.717200] mpt2sas0: removing handle(0x0009), sas_addr(0x4433221100000000) > > > > > Jun 3 12:04:44 rogen kernel: [9908466.917090] md/raid:md7: Disk failure on sdb2, disabling device. > > > > > Jun 3 12:04:44 rogen kernel: [9908466.917091] md/raid:md7: Operation continuing on 11 devices. > > > > > Jun 3 12:07:41 rogen kernel: [9908643.882541] INFO: task md7_resync:28497 blocked for more than 120 seconds. > > > > > Jun 3 12:07:41 rogen kernel: [9908643.882552] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > > > > > Jun 3 12:07:41 rogen kernel: [9908643.882556] md7_resync D ffff8800b508aa20 0 28497 2 0x00000000 > > > > > Jun 3 12:07:41 rogen kernel: [9908643.882560] ffff8802ab877b80 0000000000000046 ffff8803ffbfa340 0000000000000046 > > > > > Jun 3 12:07:41 rogen kernel: [9908643.882564] ffff8802ab876010 ffff8800b508a6a0 00000000001d29c0 ffff8802ab877fd8 > > > > > Jun 3 12:07:41 rogen kernel: [9908643.882566] ffff8802ab877fd8 00000000001d29c0 ffff880070448000 ffff8800b508a6a0 > > > > > Jun 3 12:07:41 rogen kernel: [9908643.882569] Call Trace: > > > > > Jun 3 12:07:41 rogen kernel: [9908643.882577] [<ffffffff81339704>] schedule+0x55/0x57 > > > > > Jun 3 12:07:41 rogen kernel: [9908643.882599] [<ffffffffa01da26b>] bitmap_cond_end_sync+0xbc/0x152 [md_mod] > > > > > Jun 3 12:07:41 rogen kernel: [9908643.882602] [<ffffffff8106190d>] ? wake_up_bit+0x25/0x25 > > > > > Jun 3 12:07:41 rogen kernel: [9908643.882607] [<ffffffffa022f7a7>] sync_request+0x22e/0x2ef [raid456] > > > > > Jun 3 12:07:41 rogen kernel: [9908643.882613] [<ffffffffa01d1ebc>] ? is_mddev_idle+0x106/0x118 [md_mod] > > > > > Jun 3 12:07:41 rogen kernel: [9908643.882618] [<ffffffffa01d2689>] md_do_sync+0x7bb/0xbce [md_mod] > > > > > Jun 3 12:07:41 rogen kernel: [9908643.882624] [<ffffffffa01d2cbe>] md_thread+0xff/0x11d [md_mod] > > > > > Jun 3 12:07:41 rogen kernel: [9908643.882629] [<ffffffffa01d2bbf>] ? md_rdev_init+0x8d/0x8d [md_mod] > > > > > Jun 3 12:07:41 rogen kernel: [9908643.882631] [<ffffffff81061499>] kthread+0x9b/0xa3 > > > > > Jun 3 12:07:41 rogen kernel: [9908643.882634] [<ffffffff81342ca4>] kernel_thread_helper+0x4/0x10 > > > > > Jun 3 12:07:41 rogen kernel: [9908643.882637] [<ffffffff810613fe>] ? __init_kthread_worker+0x56/0x56 > > > > > Jun 3 12:07:41 rogen kernel: [9908643.882639] [<ffffffff81342ca0>] ? gs_change+0x13/0x13 > > > > > Jun 3 12:07:41 rogen kernel: [9908643.882641] INFO: lockdep is turned off. > > > > > > > > > > The cat /proc/mdstat is: > > > > > Personalities : [raid1] [raid6] [raid5] [raid4] > > > > > md7 : active raid6 sdd2[0] sdab2[11] sdaa2[10] sdz2[9] sdy2[8] sde2[7] sdh2[6] sdf2[5] sdg2[4] sdb2[3](F) sdc2[2] sda2[1] > > > > > 29283121600 blocks super 1.2 level 6, 32k chunk, algorithm 2 [12/11] [UUU_UUUUUUUU] > > > > > [=============>.......] check = 65.3% (1913765076/2928312160) finish=44345.9min speed=381K/sec > > > > > bitmap: 1/22 pages [4KB], 65536KB chunk > > > > > > > > > > I don't really want to wait 30 days for this to finish, what is correct > > > > > thing to do before I replace the failed drive? > > > > > > > > > > > > > If it is still hanging, then I suspect a reboot is your only way forward. > > > > This should not affect the data on the array. > > > > > > Nothing more is being written to the log, but there is no progress: > > > > > > Personalities : [raid1] [raid6] [raid5] [raid4] > > > md7 : active raid6 sdd2[0] sdab2[11] sdaa2[10] sdz2[9] sdy2[8] sde2[7] sdh2[6] sdf2[5] sdg2[4] sdb2[3](F) sdc2[2] sda2[1] > > > 29283121600 blocks super 1.2 level 6, 32k chunk, algorithm 2 [12/11] [UUU_UUUUUUUU] > > > [=============>.......] check = 65.3% (1913765076/2928312160) finish=660982.5min speed=25K/sec > > > bitmap: 4/22 pages [16KB], 65536KB chunk > > > > > > > > > > What kernel are you running? I'll see if I can find the cause. > > > > > > Self compiled: > > > # cat /proc/version > > > Linux version 3.0.18-KAA (root@rogen) (gcc version 4.4.5 (Debian 4.4.5-8) ) #1 SMP Fri Jan 27 18:35:11 EST 2012 > > > > > > I can provide any other info that is needed. > > > > Thanks. > > I have a theory about what might be causing it, but I cannot duplicated the > > problem. I think you were quite "unlucky" to hit it. > > > > If I'm right you might be able to kick the 'check' back into life by reading > > from the area of the array that is currently being synced. i.e. about > > 19137650760K into the array. > > > > dd skip=19137650000 bs=1024 count=2000 if=/dev/md7 of=/dev/null > > > > might do it. > > > > The thread that is doing the 'check' is blocked in bitmap_cond_sync_end, > > waiting for all pending check requests to complete. It only does this every > > 5 seconds or so, so oncs in hundreds of requests. (It doesn't really need to > > for 'check', only for 'sync', but it does anyway). > > > > But for some reason one request isn't completing. That should mean that > > STRIPE_INSYNC is not yet set on a 'struct stripe_head'. > > > > I'm guessing that it has processed by handle_parity_checks6, noticed in the > > check_state_check_result case that s->failed is non-zero, and so set > > sh->check_state to check_state_compute_result, but then nothing caused > > the stripe to go around the loop again. The code there should probably > > set STRIPE_HANDLE.. Maybe. > > > > If the dead disk provides data to that stripe, rather than parity, then the > > read should set STRIPE_HANDLE and the handling of the stripe can progress. > > > > It might not work, but it is worth a try. What should happen is that the > > check will immediately abort. If it doesn't you probably need to schedule a > > reboot ... if you haven't already. > > I had to do a reboot already. > > The drive was wedged for same reason, activity light solid, soft reboot > (shutdown -r now) didn't wake it up, the card bios didn't see it. I > pulled and reinserted it for it to be recognized by the LSI card. Maybe that was the problem. If md had sent a request to the drive and not received any reply, (success or failure) it would keep waiting until it got one. > > I added it to the array and the rebuild took few minutes (thanks for the > bitmap). I forced a recheck and after 12 hours everything is good. :-) > > This was not the first time where I saw this kind of stop in the check > when one drive stops working in this machine, but with different adaptor > card and different drives. I will try what you suggested if it happens > again. I looked at the code again and realised I was missing something and it couldn't possibly lock up where I thought it might. So while that experiment won't hurt, it is very unlikely to help. > > Now I wonder if I should have just pulled/reinserted the drive to see > what would have happened. Yes, that would be interesting. Thanks, NeilBrown [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 828 bytes --] ^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2012-06-06 1:22 UTC | newest] Thread overview: 9+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2012-06-03 17:31 Failed drive in raid6 while doing data-check Krzysztof Adamski 2012-06-03 18:32 ` Igor M Podlesny 2012-06-03 18:38 ` Krzysztof Adamski 2012-06-03 20:35 ` Mathias Burén 2012-06-04 3:56 ` NeilBrown 2012-06-04 13:19 ` Krzysztof Adamski 2012-06-05 3:35 ` NeilBrown 2012-06-05 16:48 ` Krzysztof Adamski 2012-06-06 1:22 ` NeilBrown
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).