[BUG] MD/RAID1 hung forever on bitmap

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [BUG] MD/RAID1 hung forever on bitmap_startwrite+0x122
@ 2016-11-28  8:45 Jinpu Wang
  2016-11-30  0:08 ` Shaohua Li
  0 siblings, 1 reply; 3+ messages in thread
From: Jinpu Wang @ 2016-11-28  8:45 UTC (permalink / raw)
  To: linux-raid; +Cc: NeilBrown, Shaohua Li

Hi folks,

We hit another hung task on bitmap_startwrite with our test machines, this time
it's hung in bitmap_startwrite.

We build MD/RAID1 over 2 block devices exported via IB,
bitmap=internal. KVM build on top of
RAID1 on compute node, disks are on remote storage node, one storage
node crash/reboot, multiple raid1 on multiple compute node KVM run
into hung task:

[106204.343870] INFO: task kvm:37669 blocked for more than 180 seconds.

[106204.344138]       Tainted: G          IO    4.4.28-1-pserver #1

[106204.344385] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.

[106204.344798] kvm             D ffff882037723710     0 37669      1 0x00000000

[106204.344805]  ffff882037723710 ffff882038f08d00 ffff882029770d00
ffff8820377236d8

[106204.344809]  ffff8820377236d8 ffff882037724000 0000000000308648
0000000000000008

[106204.344813]  ffff880f9bd9e8c0 ffff882037723768 ffff882037723728
ffffffff81811c60

[106204.344818] Call Trace:

[106204.344831]  [<ffffffff81811c60>] schedule+0x30/0x80

[106204.344841]  [<ffffffffa09d31a2>] bitmap_startwrite+0x122/0x190 [md_mod]

[106204.344848]  [<ffffffff813f660b>] ? bio_clone_bioset+0x11b/0x310

[106204.344853]  [<ffffffff810956b0>] ? wait_woken+0x80/0x80

[106204.344859]  [<ffffffffa0cc5127>] 0xffffffffa0cc5127

[106204.344865]  [<ffffffffa09c4863>] md_set_array_sectors+0xac3/0xe20 [md_mod]

[106204.344871]  [<ffffffff813faa94>] ? generic_make_request_checks+0x234/0x4c0

[106204.344875]  [<ffffffff813fdb91>] blk_prologue_bio+0x91/0xc0

[106204.344879]  [<ffffffff813fd54e>] generic_make_request+0xfe/0x1e0

[106204.344883]  [<ffffffff813fd692>] submit_bio+0x62/0x150

[106204.344892]  [<ffffffff811d3257>] do_blockdev_direct_IO+0x2317/0x2ba0

[106204.344897]  [<ffffffff810b9999>] ? __remove_hrtimer+0x89/0xa0

[106204.344903]  [<ffffffff8173c08f>] ? udp_poll+0x1f/0xb0

[106204.344908]  [<ffffffff816b71c7>] ? sock_poll+0x57/0x120

[106204.344913]  [<ffffffff811cdbf0>] ? I_BDEV+0x10/0x10

[106204.344918]  [<ffffffff811d3b1e>] __blockdev_direct_IO+0x3e/0x40

[106204.344922]  [<ffffffff811ce287>] blkdev_direct_IO+0x47/0x50

[106204.344930]  [<ffffffff81132c60>] generic_file_direct_write+0xb0/0x170

[106204.344934]  [<ffffffff81132ded>] __generic_file_write_iter+0xcd/0x1f0

[106204.344943]  [<ffffffff81184ff8>] ? kmem_cache_free+0x78/0x190

[106204.344948]  [<ffffffff811ce4c0>] ? bd_unlink_disk_holder+0xf0/0xf0

[106204.344952]  [<ffffffff811ce547>] blkdev_write_iter+0x87/0x110

[106204.344956]  [<ffffffff811ce4c0>] ? bd_unlink_disk_holder+0xf0/0xf0

[106204.344962]  [<ffffffff811dec56>] aio_run_iocb+0x236/0x2a0

[106204.344966]  [<ffffffff811dd183>] ? eventfd_ctx_read+0x53/0x200

[106204.344973]  [<ffffffff811b3bbf>] ? __fget_light+0x1f/0x60

[106204.344976]  [<ffffffff811b3c0e>] ? __fdget+0xe/0x10

[106204.344980]  [<ffffffff811dfb5a>] do_io_submit+0x23a/0x4d0

[106204.344985]  [<ffffffff811dfdfb>] SyS_io_submit+0xb/0x10

[106204.344989]  [<ffffffff818154d7>] entry_SYSCALL_64_fastpath+0x12/0x6a

[106384.345330] INFO: task kvm:37669 blocked for more than 180 seconds.

[106384.345621]       Tainted: G          IO    4.4.28-1-pserver #1

[106384.345866] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.

[106384.346275] kvm             D ffff882037723710     0 37669      1 0x00000000

[106384.346282]  ffff882037723710 ffff882038f08d00 ffff882029770d00
ffff8820377236d8

[106384.346286]  ffff8820377236d8 ffff882037724000 0000000000308648
0000000000000008

[106384.346290]  ffff880f9bd9e8c0 ffff882037723768 ffff882037723728
ffffffff81811c60

[106384.346294] Call Trace:

[106384.346308]  [<ffffffff81811c60>] schedule+0x30/0x80

[106384.346317]  [<ffffffffa09d31a2>] bitmap_startwrite+0x122/0x190 [md_mod]

[106384.346325]  [<ffffffff813f660b>] ? bio_clone_bioset+0x11b/0x310

[106384.346330]  [<ffffffff810956b0>] ? wait_woken+0x80/0x80

[106384.346336]  [<ffffffffa0cc5127>] 0xffffffffa0cc5127

[106384.346341]  [<ffffffffa09c4863>] md_set_array_sectors+0xac3/0xe20 [md_mod]

[106384.346347]  [<ffffffff813faa94>] ? generic_make_request_checks+0x234/0x4c0

[106384.346352]  [<ffffffff813fdb91>] blk_prologue_bio+0x91/0xc0

[106384.346356]  [<ffffffff813fd54e>] generic_make_request+0xfe/0x1e0

[106384.346360]  [<ffffffff813fd692>] submit_bio+0x62/0x150

[106384.346369]  [<ffffffff811d3257>] do_blockdev_direct_IO+0x2317/0x2ba0


(gdb) l *bitmap_startwrite+0x122

0x121d2 is in bitmap_startwrite (drivers/md/bitmap.c:1396).



1394                 if (unlikely(COUNTER(*bmc) == COUNTER_MAX)) {

1395                         DEFINE_WAIT(__wait);

1396                         /* note that it is safe to do the prepare_to_wait

1397                          * after the test as long as we do it
before dropping

1398                          * the spinlock.

1399                          */

1400                         prepare_to_wait(&bitmap->overflow_wait, &__wait,

1401                                         TASK_UNINTERRUPTIBLE);

1402                         spin_unlock_irq(&bitmap->counts.lock);

1403                         schedule();

1404                         finish_wait(&bitmap->overflow_wait, &__wait);

1405                         continue;

1406                 }

so seem KVM is waiting on overflow_wait queue, but somehow no one wake
him up. During reboot one storage, raid1 has a lot IO errors in that
time, I guess some error handle part is broken.

I haven't have a reproducer, just want to report to community, in case
this is known bug, or anyone has patch already :)

Thanks,

-- 
Jinpu Wang
Linux Kernel Developer

ProfitBricks GmbH
Greifswalder Str. 207
D - 10405 Berlin

Tel:       +49 30 577 008  042
Fax:      +49 30 577 008 299
Email:    jinpu.wang@profitbricks.com
URL:      https://www.profitbricks.de

Sitz der Gesellschaft: Berlin
Registergericht: Amtsgericht Charlottenburg, HRB 125506 B
Geschäftsführer: Achim Weiss

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [BUG] MD/RAID1 hung forever on bitmap_startwrite+0x122
  2016-11-28  8:45 [BUG] MD/RAID1 hung forever on bitmap_startwrite+0x122 Jinpu Wang
@ 2016-11-30  0:08 ` Shaohua Li
  2016-11-30  9:29   ` Jinpu Wang
  0 siblings, 1 reply; 3+ messages in thread
From: Shaohua Li @ 2016-11-30  0:08 UTC (permalink / raw)
  To: Jinpu Wang; +Cc: linux-raid, NeilBrown

On Mon, Nov 28, 2016 at 09:45:07AM +0100, Jinpu Wang wrote:
> Hi folks,
> 
> We hit another hung task on bitmap_startwrite with our test machines, this time
> it's hung in bitmap_startwrite.
> 
> We build MD/RAID1 over 2 block devices exported via IB,
> bitmap=internal. KVM build on top of
> RAID1 on compute node, disks are on remote storage node, one storage
> node crash/reboot, multiple raid1 on multiple compute node KVM run
> into hung task:
> 
> [106204.343870] INFO: task kvm:37669 blocked for more than 180 seconds.
> 
> [106204.344138]       Tainted: G          IO    4.4.28-1-pserver #1
> 
> [106204.344385] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> disables this message.
> 
> [106204.344798] kvm             D ffff882037723710     0 37669      1 0x00000000
> 
> [106204.344805]  ffff882037723710 ffff882038f08d00 ffff882029770d00
> ffff8820377236d8
> 
> [106204.344809]  ffff8820377236d8 ffff882037724000 0000000000308648
> 0000000000000008
> 
> [106204.344813]  ffff880f9bd9e8c0 ffff882037723768 ffff882037723728
> ffffffff81811c60
> 
> [106204.344818] Call Trace:
> 
> [106204.344831]  [<ffffffff81811c60>] schedule+0x30/0x80
> 
> [106204.344841]  [<ffffffffa09d31a2>] bitmap_startwrite+0x122/0x190 [md_mod]
> 
> [106204.344848]  [<ffffffff813f660b>] ? bio_clone_bioset+0x11b/0x310
> 
> [106204.344853]  [<ffffffff810956b0>] ? wait_woken+0x80/0x80
> 
> [106204.344859]  [<ffffffffa0cc5127>] 0xffffffffa0cc5127
> 
> [106204.344865]  [<ffffffffa09c4863>] md_set_array_sectors+0xac3/0xe20 [md_mod]
> 
> [106204.344871]  [<ffffffff813faa94>] ? generic_make_request_checks+0x234/0x4c0
> 
> [106204.344875]  [<ffffffff813fdb91>] blk_prologue_bio+0x91/0xc0
> 
> [106204.344879]  [<ffffffff813fd54e>] generic_make_request+0xfe/0x1e0
> 
> [106204.344883]  [<ffffffff813fd692>] submit_bio+0x62/0x150
> 
> [106204.344892]  [<ffffffff811d3257>] do_blockdev_direct_IO+0x2317/0x2ba0
> 
> [106204.344897]  [<ffffffff810b9999>] ? __remove_hrtimer+0x89/0xa0
> 
> [106204.344903]  [<ffffffff8173c08f>] ? udp_poll+0x1f/0xb0
> 
> [106204.344908]  [<ffffffff816b71c7>] ? sock_poll+0x57/0x120
> 
> [106204.344913]  [<ffffffff811cdbf0>] ? I_BDEV+0x10/0x10
> 
> [106204.344918]  [<ffffffff811d3b1e>] __blockdev_direct_IO+0x3e/0x40
> 
> [106204.344922]  [<ffffffff811ce287>] blkdev_direct_IO+0x47/0x50
> 
> [106204.344930]  [<ffffffff81132c60>] generic_file_direct_write+0xb0/0x170
> 
> [106204.344934]  [<ffffffff81132ded>] __generic_file_write_iter+0xcd/0x1f0
> 
> [106204.344943]  [<ffffffff81184ff8>] ? kmem_cache_free+0x78/0x190
> 
> [106204.344948]  [<ffffffff811ce4c0>] ? bd_unlink_disk_holder+0xf0/0xf0
> 
> [106204.344952]  [<ffffffff811ce547>] blkdev_write_iter+0x87/0x110
> 
> [106204.344956]  [<ffffffff811ce4c0>] ? bd_unlink_disk_holder+0xf0/0xf0
> 
> [106204.344962]  [<ffffffff811dec56>] aio_run_iocb+0x236/0x2a0
> 
> [106204.344966]  [<ffffffff811dd183>] ? eventfd_ctx_read+0x53/0x200
> 
> [106204.344973]  [<ffffffff811b3bbf>] ? __fget_light+0x1f/0x60
> 
> [106204.344976]  [<ffffffff811b3c0e>] ? __fdget+0xe/0x10
> 
> [106204.344980]  [<ffffffff811dfb5a>] do_io_submit+0x23a/0x4d0
> 
> [106204.344985]  [<ffffffff811dfdfb>] SyS_io_submit+0xb/0x10
> 
> [106204.344989]  [<ffffffff818154d7>] entry_SYSCALL_64_fastpath+0x12/0x6a
> 
> [106384.345330] INFO: task kvm:37669 blocked for more than 180 seconds.
> 
> [106384.345621]       Tainted: G          IO    4.4.28-1-pserver #1
> 
> [106384.345866] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> disables this message.
> 
> [106384.346275] kvm             D ffff882037723710     0 37669      1 0x00000000
> 
> [106384.346282]  ffff882037723710 ffff882038f08d00 ffff882029770d00
> ffff8820377236d8
> 
> [106384.346286]  ffff8820377236d8 ffff882037724000 0000000000308648
> 0000000000000008
> 
> [106384.346290]  ffff880f9bd9e8c0 ffff882037723768 ffff882037723728
> ffffffff81811c60
> 
> [106384.346294] Call Trace:
> 
> [106384.346308]  [<ffffffff81811c60>] schedule+0x30/0x80
> 
> [106384.346317]  [<ffffffffa09d31a2>] bitmap_startwrite+0x122/0x190 [md_mod]
> 
> [106384.346325]  [<ffffffff813f660b>] ? bio_clone_bioset+0x11b/0x310
> 
> [106384.346330]  [<ffffffff810956b0>] ? wait_woken+0x80/0x80
> 
> [106384.346336]  [<ffffffffa0cc5127>] 0xffffffffa0cc5127
> 
> [106384.346341]  [<ffffffffa09c4863>] md_set_array_sectors+0xac3/0xe20 [md_mod]
> 
> [106384.346347]  [<ffffffff813faa94>] ? generic_make_request_checks+0x234/0x4c0
> 
> [106384.346352]  [<ffffffff813fdb91>] blk_prologue_bio+0x91/0xc0
> 
> [106384.346356]  [<ffffffff813fd54e>] generic_make_request+0xfe/0x1e0
> 
> [106384.346360]  [<ffffffff813fd692>] submit_bio+0x62/0x150
> 
> [106384.346369]  [<ffffffff811d3257>] do_blockdev_direct_IO+0x2317/0x2ba0
> 
> 
> (gdb) l *bitmap_startwrite+0x122
> 
> 0x121d2 is in bitmap_startwrite (drivers/md/bitmap.c:1396).
> 
> 
> 
> 1394                 if (unlikely(COUNTER(*bmc) == COUNTER_MAX)) {
> 
> 1395                         DEFINE_WAIT(__wait);
> 
> 1396                         /* note that it is safe to do the prepare_to_wait
> 
> 1397                          * after the test as long as we do it
> before dropping
> 
> 1398                          * the spinlock.
> 
> 1399                          */
> 
> 1400                         prepare_to_wait(&bitmap->overflow_wait, &__wait,
> 
> 1401                                         TASK_UNINTERRUPTIBLE);
> 
> 1402                         spin_unlock_irq(&bitmap->counts.lock);
> 
> 1403                         schedule();
> 
> 1404                         finish_wait(&bitmap->overflow_wait, &__wait);
> 
> 1405                         continue;
> 
> 1406                 }
> 
> so seem KVM is waiting on overflow_wait queue, but somehow no one wake
> him up. During reboot one storage, raid1 has a lot IO errors in that
> time, I guess some error handle part is broken.
> 
> I haven't have a reproducer, just want to report to community, in case
> this is known bug, or anyone has patch already :)

Does the kernel report the raid disk is faulty and gets removed? Is this a
real hang, eg maybe we are waitting for IO error reported.

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [BUG] MD/RAID1 hung forever on bitmap_startwrite+0x122
  2016-11-30  0:08 ` Shaohua Li
@ 2016-11-30  9:29   ` Jinpu Wang
  0 siblings, 0 replies; 3+ messages in thread
From: Jinpu Wang @ 2016-11-30  9:29 UTC (permalink / raw)
  To: Shaohua Li; +Cc: linux-raid, NeilBrown

On Wed, Nov 30, 2016 at 1:08 AM, Shaohua Li <shli@kernel.org> wrote:
> On Mon, Nov 28, 2016 at 09:45:07AM +0100, Jinpu Wang wrote:
>> Hi folks,
>>
>> We hit another hung task on bitmap_startwrite with our test machines, this time
>> it's hung in bitmap_startwrite.
>>
>> We build MD/RAID1 over 2 block devices exported via IB,
>> bitmap=internal. KVM build on top of
>> RAID1 on compute node, disks are on remote storage node, one storage
>> node crash/reboot, multiple raid1 on multiple compute node KVM run
>> into hung task:
>>
>> [106204.343870] INFO: task kvm:37669 blocked for more than 180 seconds.
>>
>> [106204.344138]       Tainted: G          IO    4.4.28-1-pserver #1
>>
>> [106204.344385] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
>> disables this message.
>>
>> [106204.344798] kvm             D ffff882037723710     0 37669      1 0x00000000
>>
>> [106204.344805]  ffff882037723710 ffff882038f08d00 ffff882029770d00
>> ffff8820377236d8
>>
>> [106204.344809]  ffff8820377236d8 ffff882037724000 0000000000308648
>> 0000000000000008
>>
>> [106204.344813]  ffff880f9bd9e8c0 ffff882037723768 ffff882037723728
>> ffffffff81811c60
>>
>> [106204.344818] Call Trace:
>>
>> [106204.344831]  [<ffffffff81811c60>] schedule+0x30/0x80
>>
>> [106204.344841]  [<ffffffffa09d31a2>] bitmap_startwrite+0x122/0x190 [md_mod]
>>
>> [106204.344848]  [<ffffffff813f660b>] ? bio_clone_bioset+0x11b/0x310
>>
>> [106204.344853]  [<ffffffff810956b0>] ? wait_woken+0x80/0x80
>>
>> [106204.344859]  [<ffffffffa0cc5127>] 0xffffffffa0cc5127
>>
>> [106204.344865]  [<ffffffffa09c4863>] md_set_array_sectors+0xac3/0xe20 [md_mod]
>>
>> [106204.344871]  [<ffffffff813faa94>] ? generic_make_request_checks+0x234/0x4c0
>>
>> [106204.344875]  [<ffffffff813fdb91>] blk_prologue_bio+0x91/0xc0
>>
>> [106204.344879]  [<ffffffff813fd54e>] generic_make_request+0xfe/0x1e0
>>
>> [106204.344883]  [<ffffffff813fd692>] submit_bio+0x62/0x150
>>
>> [106204.344892]  [<ffffffff811d3257>] do_blockdev_direct_IO+0x2317/0x2ba0
>>
>> [106204.344897]  [<ffffffff810b9999>] ? __remove_hrtimer+0x89/0xa0
>>
>> [106204.344903]  [<ffffffff8173c08f>] ? udp_poll+0x1f/0xb0
>>
>> [106204.344908]  [<ffffffff816b71c7>] ? sock_poll+0x57/0x120
>>
>> [106204.344913]  [<ffffffff811cdbf0>] ? I_BDEV+0x10/0x10
>>
>> [106204.344918]  [<ffffffff811d3b1e>] __blockdev_direct_IO+0x3e/0x40
>>
>> [106204.344922]  [<ffffffff811ce287>] blkdev_direct_IO+0x47/0x50
>>
>> [106204.344930]  [<ffffffff81132c60>] generic_file_direct_write+0xb0/0x170
>>
>> [106204.344934]  [<ffffffff81132ded>] __generic_file_write_iter+0xcd/0x1f0
>>
>> [106204.344943]  [<ffffffff81184ff8>] ? kmem_cache_free+0x78/0x190
>>
>> [106204.344948]  [<ffffffff811ce4c0>] ? bd_unlink_disk_holder+0xf0/0xf0
>>
>> [106204.344952]  [<ffffffff811ce547>] blkdev_write_iter+0x87/0x110
>>
>> [106204.344956]  [<ffffffff811ce4c0>] ? bd_unlink_disk_holder+0xf0/0xf0
>>
>> [106204.344962]  [<ffffffff811dec56>] aio_run_iocb+0x236/0x2a0
>>
>> [106204.344966]  [<ffffffff811dd183>] ? eventfd_ctx_read+0x53/0x200
>>
>> [106204.344973]  [<ffffffff811b3bbf>] ? __fget_light+0x1f/0x60
>>
>> [106204.344976]  [<ffffffff811b3c0e>] ? __fdget+0xe/0x10
>>
>> [106204.344980]  [<ffffffff811dfb5a>] do_io_submit+0x23a/0x4d0
>>
>> [106204.344985]  [<ffffffff811dfdfb>] SyS_io_submit+0xb/0x10
>>
>> [106204.344989]  [<ffffffff818154d7>] entry_SYSCALL_64_fastpath+0x12/0x6a
>>
>> [106384.345330] INFO: task kvm:37669 blocked for more than 180 seconds.
>>
>> [106384.345621]       Tainted: G          IO    4.4.28-1-pserver #1
>>
>> [106384.345866] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
>> disables this message.
>>
>> [106384.346275] kvm             D ffff882037723710     0 37669      1 0x00000000
>>
>> [106384.346282]  ffff882037723710 ffff882038f08d00 ffff882029770d00
>> ffff8820377236d8
>>
>> [106384.346286]  ffff8820377236d8 ffff882037724000 0000000000308648
>> 0000000000000008
>>
>> [106384.346290]  ffff880f9bd9e8c0 ffff882037723768 ffff882037723728
>> ffffffff81811c60
>>
>> [106384.346294] Call Trace:
>>
>> [106384.346308]  [<ffffffff81811c60>] schedule+0x30/0x80
>>
>> [106384.346317]  [<ffffffffa09d31a2>] bitmap_startwrite+0x122/0x190 [md_mod]
>>
>> [106384.346325]  [<ffffffff813f660b>] ? bio_clone_bioset+0x11b/0x310
>>
>> [106384.346330]  [<ffffffff810956b0>] ? wait_woken+0x80/0x80
>>
>> [106384.346336]  [<ffffffffa0cc5127>] 0xffffffffa0cc5127
>>
>> [106384.346341]  [<ffffffffa09c4863>] md_set_array_sectors+0xac3/0xe20 [md_mod]
>>
>> [106384.346347]  [<ffffffff813faa94>] ? generic_make_request_checks+0x234/0x4c0
>>
>> [106384.346352]  [<ffffffff813fdb91>] blk_prologue_bio+0x91/0xc0
>>
>> [106384.346356]  [<ffffffff813fd54e>] generic_make_request+0xfe/0x1e0
>>
>> [106384.346360]  [<ffffffff813fd692>] submit_bio+0x62/0x150
>>
>> [106384.346369]  [<ffffffff811d3257>] do_blockdev_direct_IO+0x2317/0x2ba0
>>
>>
>> (gdb) l *bitmap_startwrite+0x122
>>
>> 0x121d2 is in bitmap_startwrite (drivers/md/bitmap.c:1396).
>>
>>
>>
>> 1394                 if (unlikely(COUNTER(*bmc) == COUNTER_MAX)) {
>>
>> 1395                         DEFINE_WAIT(__wait);
>>
>> 1396                         /* note that it is safe to do the prepare_to_wait
>>
>> 1397                          * after the test as long as we do it
>> before dropping
>>
>> 1398                          * the spinlock.
>>
>> 1399                          */
>>
>> 1400                         prepare_to_wait(&bitmap->overflow_wait, &__wait,
>>
>> 1401                                         TASK_UNINTERRUPTIBLE);
>>
>> 1402                         spin_unlock_irq(&bitmap->counts.lock);
>>
>> 1403                         schedule();
>>
>> 1404                         finish_wait(&bitmap->overflow_wait, &__wait);
>>
>> 1405                         continue;
>>
>> 1406                 }
>>
>> so seem KVM is waiting on overflow_wait queue, but somehow no one wake
>> him up. During reboot one storage, raid1 has a lot IO errors in that
>> time, I guess some error handle part is broken.
>>
>> I haven't have a reproducer, just want to report to community, in case
>> this is known bug, or anyone has patch already :)
>
> Does the kernel report the raid disk is faulty and gets removed? Is this a
> real hang, eg maybe we are waitting for IO error reported.


Thanks Shaohua for reply.

I checked the log again, the hung task was there for 10+ hours.
I found something wrong with the testcase, it create MD/RAID1 on two
drives from 2 remote storage servers, and
reboot one storage before some MDs finished recovery, and then the
other storage also rebooted, this led to both legs
were broken, and somehow led to the hung tasks.

As RAID1 can't handle 2 broken legs at same time, I think we will
change our test case for more practical case.


-- 
Jinpu Wang
Linux Kernel Developer

ProfitBricks GmbH
Greifswalder Str. 207
D - 10405 Berlin

Tel:       +49 30 577 008  042
Fax:      +49 30 577 008 299
Email:    jinpu.wang@profitbricks.com
URL:      https://www.profitbricks.de

Sitz der Gesellschaft: Berlin
Registergericht: Amtsgericht Charlottenburg, HRB 125506 B
Geschäftsführer: Achim Weiss

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2016-11-30  9:29 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2016-11-28  8:45 [BUG] MD/RAID1 hung forever on bitmap_startwrite+0x122 Jinpu Wang
2016-11-30  0:08 ` Shaohua Li
2016-11-30  9:29   ` Jinpu Wang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).