[BUG] MD/RAID1 hung forever on freeze

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [BUG] MD/RAID1 hung forever on freeze_array
@ 2016-11-25 13:30 Jinpu Wang
  2016-11-25 13:59 ` Jinpu Wang
  2016-12-12  0:59 ` NeilBrown
  0 siblings, 2 replies; 31+ messages in thread
From: Jinpu Wang @ 2016-11-25 13:30 UTC (permalink / raw)
  To: linux-raid, Shaohua Li; +Cc: Neil Brown, Nate Dailey

Hi,

I'm hitting hung task in mdx_raid1 when running test, I can reproduce
it easily with my tests below:

I create one md with one local loop device and one remote scsi
exported by SRP. running fio with mix rw on top of md, force_close
session on storage side. mdx_raid1 is wait on free_array in D state,
and a lot of fio also in D state in wait_barrier.

[  335.154711] blk_update_request: I/O error, dev sdb, sector 8
[  335.154855] md: super_written gets error=-5
[  335.154999] md/raid1:md1: Disk failure on sdb, disabling device.
               md/raid1:md1: Operation continuing on 1 devices.
[  335.155258] sd 1:0:0:0: rejecting I/O to offline device
[  335.155402] blk_update_request: I/O error, dev sdb, sector 80
[  335.155547] md: super_written gets error=-5
[  340.158828] scsi host1: ib_srp: reconnect succeeded
[  373.017608] md/raid1:md1: redirecting sector 616617 to other mirror: loop1
[  373.110527] md/raid1:md1: redirecting sector 1320893 to other mirror: loop1
[  373.117230] md/raid1:md1: redirecting sector 1564499 to other mirror: loop1
[  373.127652] md/raid1:md1: redirecting sector 104034 to other mirror: loop1
[  373.135665] md/raid1:md1: redirecting sector 1209765 to other mirror: loop1
[  373.145634] md/raid1:md1: redirecting sector 51200 to other mirror: loop1
[  373.158824] md/raid1:md1: redirecting sector 755750 to other mirror: loop1
[  373.169964] md/raid1:md1: redirecting sector 1681631 to other mirror: loop1
[  373.178619] md/raid1:md1: redirecting sector 1894296 to other mirror: loop1
[  373.186153] md/raid1:md1: redirecting sector 1905016 to other mirror: loop1
[  374.364370] RAID1 conf printout:
[  374.364377]  --- wd:1 rd:2
[  374.364379]  disk 0, wo:1, o:0, dev:sdb
[  374.364381]  disk 1, wo:0, o:1, dev:loop1
[  374.437099] RAID1 conf printout:
[  374.437103]  --- wd:1 rd:2
snip


[  810.266112] sysrq: SysRq : Show Blocked State
[  810.266235]   task                        PC stack   pid father
[  810.266362] md1_raid1       D ffff88022d927c48     0  4022      2 0x00000000
[  810.266487]  ffff88022d927c48 ffff8802351a0000 ffff8800b91bc100
000000008010000e
[  810.266747]  ffff88022d927c30 ffff88022d928000 0000000000000001
ffff880233b49b70
[  810.266975]  ffff880233b49b88 ffff8802325d5a40 ffff88022d927c60
ffffffff81810600
[  810.267203] Call Trace:
[  810.267322]  [<ffffffff81810600>] schedule+0x30/0x80
[  810.267437]  [<ffffffffa01342c1>] freeze_array+0x71/0xc0 [raid1]
[  810.267555]  [<ffffffff81095480>] ? wake_atomic_t_function+0x70/0x70
[  810.267669]  [<ffffffffa013578b>] handle_read_error+0x3b/0x570 [raid1]
[  810.267816]  [<ffffffff81185783>] ? kmem_cache_free+0x183/0x190
[  810.267929]  [<ffffffff81094e36>] ? __wake_up+0x46/0x60
[  810.268045]  [<ffffffffa0136dcd>] raid1d+0x20d/0xfc0 [raid1]
[  810.268159]  [<ffffffff81813043>] ? schedule_timeout+0x1a3/0x230
[  810.268274]  [<ffffffff8180fe77>] ? __schedule+0x2e7/0xa40
[  810.268391]  [<ffffffffa0211839>] md_thread+0x119/0x120 [md_mod]
[  810.268508]  [<ffffffff81095480>] ? wake_atomic_t_function+0x70/0x70
[  810.268624]  [<ffffffffa0211720>] ? find_pers+0x70/0x70 [md_mod]
[  810.268741]  [<ffffffff81075614>] kthread+0xc4/0xe0
[  810.268853]  [<ffffffff81075550>] ? kthread_worker_fn+0x150/0x150
[  810.268970]  [<ffffffff8181415f>] ret_from_fork+0x3f/0x70
[  810.269114]  [<ffffffff81075550>] ? kthread_worker_fn+0x150/0x150
[  810.269227] fio             D ffff8802325137a0     0  4212   4206 0x00000000
[  810.269347]  ffff8802325137a0 ffff88022de3db00 ffff8800ba7bb400
0000000000000000
[  810.269574]  ffff880233b49b00 ffff880232513788 ffff880232514000
ffff880233b49b88
[  810.269801]  ffff880233b49b70 ffff8800ba7bb400 ffff8800b5f5db00
ffff8802325137b8
[  810.270028] Call Trace:
[  810.270138]  [<ffffffff81810600>] schedule+0x30/0x80
[  810.270282]  [<ffffffffa0133727>] wait_barrier+0x117/0x1f0 [raid1]
[  810.270396]  [<ffffffff81095480>] ? wake_atomic_t_function+0x70/0x70
[  810.270513]  [<ffffffffa0135d72>] make_request+0xb2/0xd80 [raid1]
[  810.270628]  [<ffffffffa02123fc>] md_make_request+0xec/0x230 [md_mod]
[  810.270746]  [<ffffffff813f96f9>] ? generic_make_request_checks+0x219/0x500
[  810.270860]  [<ffffffff813fc851>] blk_prologue_bio+0x91/0xc0
[  810.270976]  [<ffffffff813fc230>] generic_make_request+0xe0/0x1b0
[  810.271090]  [<ffffffff813fc362>] submit_bio+0x62/0x140
[  810.271209]  [<ffffffff811d2bbc>] do_blockdev_direct_IO+0x289c/0x33c0
[  810.271323]  [<ffffffff81810600>] ? schedule+0x30/0x80
[  810.271468]  [<ffffffff811cd620>] ? I_BDEV+0x10/0x10
[  810.271580]  [<ffffffff811d371e>] __blockdev_direct_IO+0x3e/0x40
[  810.271696]  [<ffffffff811cdfb7>] blkdev_direct_IO+0x47/0x50
[  810.271828]  [<ffffffff81132cbf>] generic_file_read_iter+0x44f/0x570
[  810.271949]  [<ffffffff811ceaa0>] ? blkdev_write_iter+0x110/0x110
[  810.272062]  [<ffffffff811cead0>] blkdev_read_iter+0x30/0x40
[  810.272179]  [<ffffffff811de5a6>] aio_run_iocb+0x126/0x2b0
[  810.272291]  [<ffffffff8181209d>] ? mutex_lock+0xd/0x30
[  810.272407]  [<ffffffff811ddd04>] ? aio_read_events+0x284/0x370
[  810.272521]  [<ffffffff81183c29>] ? kmem_cache_alloc+0xd9/0x180
[  810.272665]  [<ffffffff811df438>] ? do_io_submit+0x178/0x4a0
[  810.272778]  [<ffffffff811df4ed>] do_io_submit+0x22d/0x4a0
[  810.272895]  [<ffffffff811df76b>] SyS_io_submit+0xb/0x10
[  810.273007]  [<ffffffff81813e17>] entry_SYSCALL_64_fastpath+0x12/0x66
[  810.273130] fio             D ffff88022fa6f730     0  4213   4206 0x00000000
[  810.273247]  ffff88022fa6f730 ffff8800b549a700 ffff8800af703400
0000000002011200
[  810.273475]  ffff880236001700 ffff88022fa6f718 ffff88022fa70000
ffff880233b49b88
[  810.273702]  ffff880233b49b70 ffff8800af703400 ffff88022f843700
ffff88022fa6f748
[  810.273958] Call Trace:
[  810.274070]  [<ffffffff81810600>] schedule+0x30/0x80
[  810.274183]  [<ffffffffa0133727>] wait_barrier+0x117/0x1f0 [raid1]
[  810.274300]  [<ffffffff81095480>] ? wake_atomic_t_function+0x70/0x70
[  810.274413]  [<ffffffffa0135d72>] make_request+0xb2/0xd80 [raid1]
[  810.274537]  [<ffffffff81408f15>] ? __bt_get.isra.7+0xd5/0x1b0
[  810.274650]  [<ffffffff81094feb>] ? finish_wait+0x5b/0x80
[  810.274766]  [<ffffffff8140917f>] ? bt_get+0x18f/0x1b0
[  810.274881]  [<ffffffffa02123fc>] md_make_request+0xec/0x230 [md_mod]
[  810.274998]  [<ffffffff813f96f9>] ? generic_make_request_checks+0x219/0x500
[  810.275144]  [<ffffffff813fc851>] blk_prologue_bio+0x91/0xc0
[  810.275257]  [<ffffffff813fc230>] generic_make_request+0xe0/0x1b0
[  810.275373]  [<ffffffff813fc362>] submit_bio+0x62/0x140
[  810.275486]  [<ffffffff811d2bbc>] do_blockdev_direct_IO+0x289c/0x33c0
[  810.275607]  [<ffffffff811cd620>] ? I_BDEV+0x10/0x10
[  810.275721]  [<ffffffff811d371e>] __blockdev_direct_IO+0x3e/0x40
[  810.275843]  [<ffffffff811cdfb7>] blkdev_direct_IO+0x47/0x50
[  810.275956]  [<ffffffff81132e8c>] generic_file_direct_write+0xac/0x170
[  810.276073]  [<ffffffff8113301d>] __generic_file_write_iter+0xcd/0x1f0
[  810.276187]  [<ffffffff811ce990>] ? blkdev_close+0x30/0x30
[  810.276332]  [<ffffffff811cea17>] blkdev_write_iter+0x87/0x110
[  810.276445]  [<ffffffff811de6d0>] aio_run_iocb+0x250/0x2b0
[  810.276560]  [<ffffffff8181209d>] ? mutex_lock+0xd/0x30
[  810.276673]  [<ffffffff811ddd04>] ? aio_read_events+0x284/0x370
[  810.276786]  [<ffffffff81183c29>] ? kmem_cache_alloc+0xd9/0x180
[  810.276902]  [<ffffffff811df438>] ? do_io_submit+0x178/0x4a0
[  810.277015]  [<ffffffff811df4ed>] do_io_submit+0x22d/0x4a0
[  810.277131]  [<ffffffff811df76b>] SyS_io_submit+0xb/0x10
[  810.277244]  [<ffffffff81813e17>] entry_SYSCALL_64_fastpath+0x12/0x66
I dump r1conf in crash:
struct r1conf {
  mddev = 0xffff88022d761800,
  mirrors = 0xffff88023456a180,
  raid_disks = 2,
  next_resync = 18446744073709527039,
  start_next_window = 18446744073709551615,
  current_window_requests = 0,
  next_window_requests = 0,
  device_lock = {
    {
      rlock = {
        raw_lock = {
          val = {
            counter = 0
          }
        }
      }
    }
  },
  retry_list = {
    next = 0xffff8800b5fe3b40,
    prev = 0xffff8800b50164c0
  },
  bio_end_io_list = {
    next = 0xffff88022fcd45c0,
    prev = 0xffff8800b53d57c0
  },
  pending_bio_list = {
    head = 0x0,
    tail = 0x0
  },
  pending_count = 0,
  wait_barrier = {
    lock = {
      {
        rlock = {
          raw_lock = {
            val = {
              counter = 0
            }
          }
        }
      }
    },
    task_list = {
      next = 0xffff8800b51d37e0,
      prev = 0xffff88022fbbb770
    }
  },
  resync_lock = {
    {
      rlock = {
        raw_lock = {
          val = {
            counter = 0
          }
        }
      }
    }
  },
  nr_pending = 406,
  nr_waiting = 100,
  nr_queued = 404,
  barrier = 0,
  array_frozen = 1,
  fullsync = 0,
  recovery_disabled = 1,
  poolinfo = 0xffff88022d829bb0,
  r1bio_pool = 0xffff88022b4512a0,
  r1buf_pool = 0x0,
  tmppage = 0xffffea0008c97b00,
  thread = 0x0,
  cluster_sync_low = 0,
  cluster_sync_high = 0
}

every time nr_pending is 1 bigger then (nr_queued + 1), so seems we
forgot to increase nr_queued somewhere?

I've noticed (commit ccfc7bf1f09d61)raid1: include bio_end_io_list in
nr_queued to prevent freeze_array hang. Seems it fixed similar bug.

Could you give your suggestion?


-- 
Jinpu Wang
Linux Kernel Developer

ProfitBricks GmbH
Greifswalder Str. 207
D - 10405 Berlin

Tel:       +49 30 577 008  042
Fax:      +49 30 577 008 299
Email:    jinpu.wang@profitbricks.com
URL:      https://www.profitbricks.de

Sitz der Gesellschaft: Berlin
Registergericht: Amtsgericht Charlottenburg, HRB 125506 B
Geschäftsführer: Achim Weiss

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [BUG] MD/RAID1 hung forever on freeze_array
  2016-11-25 13:30 [BUG] MD/RAID1 hung forever on freeze_array Jinpu Wang
@ 2016-11-25 13:59 ` Jinpu Wang
  2016-11-28  4:47   ` Coly Li
  2016-12-12  0:59 ` NeilBrown
  1 sibling, 1 reply; 31+ messages in thread
From: Jinpu Wang @ 2016-11-25 13:59 UTC (permalink / raw)
  To: linux-raid, Shaohua Li; +Cc: Neil Brown, Nate Dailey

On Fri, Nov 25, 2016 at 2:30 PM, Jinpu Wang <jinpu.wang@profitbricks.com> wrote:
> Hi,
>
> I'm hitting hung task in mdx_raid1 when running test, I can reproduce
> it easily with my tests below:
>
> I create one md with one local loop device and one remote scsi
> exported by SRP. running fio with mix rw on top of md, force_close
> session on storage side. mdx_raid1 is wait on free_array in D state,
> and a lot of fio also in D state in wait_barrier.
>
> [  335.154711] blk_update_request: I/O error, dev sdb, sector 8
> [  335.154855] md: super_written gets error=-5
> [  335.154999] md/raid1:md1: Disk failure on sdb, disabling device.
>                md/raid1:md1: Operation continuing on 1 devices.
> [  335.155258] sd 1:0:0:0: rejecting I/O to offline device
> [  335.155402] blk_update_request: I/O error, dev sdb, sector 80
> [  335.155547] md: super_written gets error=-5
> [  340.158828] scsi host1: ib_srp: reconnect succeeded
> [  373.017608] md/raid1:md1: redirecting sector 616617 to other mirror: loop1
> [  373.110527] md/raid1:md1: redirecting sector 1320893 to other mirror: loop1
> [  373.117230] md/raid1:md1: redirecting sector 1564499 to other mirror: loop1
> [  373.127652] md/raid1:md1: redirecting sector 104034 to other mirror: loop1
> [  373.135665] md/raid1:md1: redirecting sector 1209765 to other mirror: loop1
> [  373.145634] md/raid1:md1: redirecting sector 51200 to other mirror: loop1
> [  373.158824] md/raid1:md1: redirecting sector 755750 to other mirror: loop1
> [  373.169964] md/raid1:md1: redirecting sector 1681631 to other mirror: loop1
> [  373.178619] md/raid1:md1: redirecting sector 1894296 to other mirror: loop1
> [  373.186153] md/raid1:md1: redirecting sector 1905016 to other mirror: loop1
> [  374.364370] RAID1 conf printout:
> [  374.364377]  --- wd:1 rd:2
> [  374.364379]  disk 0, wo:1, o:0, dev:sdb
> [  374.364381]  disk 1, wo:0, o:1, dev:loop1
> [  374.437099] RAID1 conf printout:
> [  374.437103]  --- wd:1 rd:2
> snip
>
>
> [  810.266112] sysrq: SysRq : Show Blocked State
> [  810.266235]   task                        PC stack   pid father
> [  810.266362] md1_raid1       D ffff88022d927c48     0  4022      2 0x00000000
> [  810.266487]  ffff88022d927c48 ffff8802351a0000 ffff8800b91bc100
> 000000008010000e
> [  810.266747]  ffff88022d927c30 ffff88022d928000 0000000000000001
> ffff880233b49b70
> [  810.266975]  ffff880233b49b88 ffff8802325d5a40 ffff88022d927c60
> ffffffff81810600
> [  810.267203] Call Trace:
> [  810.267322]  [<ffffffff81810600>] schedule+0x30/0x80
> [  810.267437]  [<ffffffffa01342c1>] freeze_array+0x71/0xc0 [raid1]
> [  810.267555]  [<ffffffff81095480>] ? wake_atomic_t_function+0x70/0x70
> [  810.267669]  [<ffffffffa013578b>] handle_read_error+0x3b/0x570 [raid1]
> [  810.267816]  [<ffffffff81185783>] ? kmem_cache_free+0x183/0x190
> [  810.267929]  [<ffffffff81094e36>] ? __wake_up+0x46/0x60
> [  810.268045]  [<ffffffffa0136dcd>] raid1d+0x20d/0xfc0 [raid1]
> [  810.268159]  [<ffffffff81813043>] ? schedule_timeout+0x1a3/0x230
> [  810.268274]  [<ffffffff8180fe77>] ? __schedule+0x2e7/0xa40
> [  810.268391]  [<ffffffffa0211839>] md_thread+0x119/0x120 [md_mod]
> [  810.268508]  [<ffffffff81095480>] ? wake_atomic_t_function+0x70/0x70
> [  810.268624]  [<ffffffffa0211720>] ? find_pers+0x70/0x70 [md_mod]
> [  810.268741]  [<ffffffff81075614>] kthread+0xc4/0xe0
> [  810.268853]  [<ffffffff81075550>] ? kthread_worker_fn+0x150/0x150
> [  810.268970]  [<ffffffff8181415f>] ret_from_fork+0x3f/0x70
> [  810.269114]  [<ffffffff81075550>] ? kthread_worker_fn+0x150/0x150
> [  810.269227] fio             D ffff8802325137a0     0  4212   4206 0x00000000
> [  810.269347]  ffff8802325137a0 ffff88022de3db00 ffff8800ba7bb400
> 0000000000000000
> [  810.269574]  ffff880233b49b00 ffff880232513788 ffff880232514000
> ffff880233b49b88
> [  810.269801]  ffff880233b49b70 ffff8800ba7bb400 ffff8800b5f5db00
> ffff8802325137b8
> [  810.270028] Call Trace:
> [  810.270138]  [<ffffffff81810600>] schedule+0x30/0x80
> [  810.270282]  [<ffffffffa0133727>] wait_barrier+0x117/0x1f0 [raid1]
> [  810.270396]  [<ffffffff81095480>] ? wake_atomic_t_function+0x70/0x70
> [  810.270513]  [<ffffffffa0135d72>] make_request+0xb2/0xd80 [raid1]
> [  810.270628]  [<ffffffffa02123fc>] md_make_request+0xec/0x230 [md_mod]
> [  810.270746]  [<ffffffff813f96f9>] ? generic_make_request_checks+0x219/0x500
> [  810.270860]  [<ffffffff813fc851>] blk_prologue_bio+0x91/0xc0
> [  810.270976]  [<ffffffff813fc230>] generic_make_request+0xe0/0x1b0
> [  810.271090]  [<ffffffff813fc362>] submit_bio+0x62/0x140
> [  810.271209]  [<ffffffff811d2bbc>] do_blockdev_direct_IO+0x289c/0x33c0
> [  810.271323]  [<ffffffff81810600>] ? schedule+0x30/0x80
> [  810.271468]  [<ffffffff811cd620>] ? I_BDEV+0x10/0x10
> [  810.271580]  [<ffffffff811d371e>] __blockdev_direct_IO+0x3e/0x40
> [  810.271696]  [<ffffffff811cdfb7>] blkdev_direct_IO+0x47/0x50
> [  810.271828]  [<ffffffff81132cbf>] generic_file_read_iter+0x44f/0x570
> [  810.271949]  [<ffffffff811ceaa0>] ? blkdev_write_iter+0x110/0x110
> [  810.272062]  [<ffffffff811cead0>] blkdev_read_iter+0x30/0x40
> [  810.272179]  [<ffffffff811de5a6>] aio_run_iocb+0x126/0x2b0
> [  810.272291]  [<ffffffff8181209d>] ? mutex_lock+0xd/0x30
> [  810.272407]  [<ffffffff811ddd04>] ? aio_read_events+0x284/0x370
> [  810.272521]  [<ffffffff81183c29>] ? kmem_cache_alloc+0xd9/0x180
> [  810.272665]  [<ffffffff811df438>] ? do_io_submit+0x178/0x4a0
> [  810.272778]  [<ffffffff811df4ed>] do_io_submit+0x22d/0x4a0
> [  810.272895]  [<ffffffff811df76b>] SyS_io_submit+0xb/0x10
> [  810.273007]  [<ffffffff81813e17>] entry_SYSCALL_64_fastpath+0x12/0x66
> [  810.273130] fio             D ffff88022fa6f730     0  4213   4206 0x00000000
> [  810.273247]  ffff88022fa6f730 ffff8800b549a700 ffff8800af703400
> 0000000002011200
> [  810.273475]  ffff880236001700 ffff88022fa6f718 ffff88022fa70000
> ffff880233b49b88
> [  810.273702]  ffff880233b49b70 ffff8800af703400 ffff88022f843700
> ffff88022fa6f748
> [  810.273958] Call Trace:
> [  810.274070]  [<ffffffff81810600>] schedule+0x30/0x80
> [  810.274183]  [<ffffffffa0133727>] wait_barrier+0x117/0x1f0 [raid1]
> [  810.274300]  [<ffffffff81095480>] ? wake_atomic_t_function+0x70/0x70
> [  810.274413]  [<ffffffffa0135d72>] make_request+0xb2/0xd80 [raid1]
> [  810.274537]  [<ffffffff81408f15>] ? __bt_get.isra.7+0xd5/0x1b0
> [  810.274650]  [<ffffffff81094feb>] ? finish_wait+0x5b/0x80
> [  810.274766]  [<ffffffff8140917f>] ? bt_get+0x18f/0x1b0
> [  810.274881]  [<ffffffffa02123fc>] md_make_request+0xec/0x230 [md_mod]
> [  810.274998]  [<ffffffff813f96f9>] ? generic_make_request_checks+0x219/0x500
> [  810.275144]  [<ffffffff813fc851>] blk_prologue_bio+0x91/0xc0
> [  810.275257]  [<ffffffff813fc230>] generic_make_request+0xe0/0x1b0
> [  810.275373]  [<ffffffff813fc362>] submit_bio+0x62/0x140
> [  810.275486]  [<ffffffff811d2bbc>] do_blockdev_direct_IO+0x289c/0x33c0
> [  810.275607]  [<ffffffff811cd620>] ? I_BDEV+0x10/0x10
> [  810.275721]  [<ffffffff811d371e>] __blockdev_direct_IO+0x3e/0x40
> [  810.275843]  [<ffffffff811cdfb7>] blkdev_direct_IO+0x47/0x50
> [  810.275956]  [<ffffffff81132e8c>] generic_file_direct_write+0xac/0x170
> [  810.276073]  [<ffffffff8113301d>] __generic_file_write_iter+0xcd/0x1f0
> [  810.276187]  [<ffffffff811ce990>] ? blkdev_close+0x30/0x30
> [  810.276332]  [<ffffffff811cea17>] blkdev_write_iter+0x87/0x110
> [  810.276445]  [<ffffffff811de6d0>] aio_run_iocb+0x250/0x2b0
> [  810.276560]  [<ffffffff8181209d>] ? mutex_lock+0xd/0x30
> [  810.276673]  [<ffffffff811ddd04>] ? aio_read_events+0x284/0x370
> [  810.276786]  [<ffffffff81183c29>] ? kmem_cache_alloc+0xd9/0x180
> [  810.276902]  [<ffffffff811df438>] ? do_io_submit+0x178/0x4a0
> [  810.277015]  [<ffffffff811df4ed>] do_io_submit+0x22d/0x4a0
> [  810.277131]  [<ffffffff811df76b>] SyS_io_submit+0xb/0x10
> [  810.277244]  [<ffffffff81813e17>] entry_SYSCALL_64_fastpath+0x12/0x66
> I dump r1conf in crash:
> struct r1conf {
>   mddev = 0xffff88022d761800,
>   mirrors = 0xffff88023456a180,
>   raid_disks = 2,
>   next_resync = 18446744073709527039,
>   start_next_window = 18446744073709551615,
>   current_window_requests = 0,
>   next_window_requests = 0,
>   device_lock = {
>     {
>       rlock = {
>         raw_lock = {
>           val = {
>             counter = 0
>           }
>         }
>       }
>     }
>   },
>   retry_list = {
>     next = 0xffff8800b5fe3b40,
>     prev = 0xffff8800b50164c0
>   },
>   bio_end_io_list = {
>     next = 0xffff88022fcd45c0,
>     prev = 0xffff8800b53d57c0
>   },
>   pending_bio_list = {
>     head = 0x0,
>     tail = 0x0
>   },
>   pending_count = 0,
>   wait_barrier = {
>     lock = {
>       {
>         rlock = {
>           raw_lock = {
>             val = {
>               counter = 0
>             }
>           }
>         }
>       }
>     },
>     task_list = {
>       next = 0xffff8800b51d37e0,
>       prev = 0xffff88022fbbb770
>     }
>   },
>   resync_lock = {
>     {
>       rlock = {
>         raw_lock = {
>           val = {
>             counter = 0
>           }
>         }
>       }
>     }
>   },
>   nr_pending = 406,
>   nr_waiting = 100,
>   nr_queued = 404,
>   barrier = 0,
>   array_frozen = 1,
>   fullsync = 0,
>   recovery_disabled = 1,
>   poolinfo = 0xffff88022d829bb0,
>   r1bio_pool = 0xffff88022b4512a0,
>   r1buf_pool = 0x0,
>   tmppage = 0xffffea0008c97b00,
>   thread = 0x0,
>   cluster_sync_low = 0,
>   cluster_sync_high = 0
> }
>
> every time nr_pending is 1 bigger then (nr_queued + 1), so seems we
> forgot to increase nr_queued somewhere?
>
> I've noticed (commit ccfc7bf1f09d61)raid1: include bio_end_io_list in
> nr_queued to prevent freeze_array hang. Seems it fixed similar bug.
>
> Could you give your suggestion?
>
Sorry, forgot to mention kernel version is 4.4.28


-- 
Jinpu Wang
Linux Kernel Developer

ProfitBricks GmbH
Greifswalder Str. 207
D - 10405 Berlin

Tel:       +49 30 577 008  042
Fax:      +49 30 577 008 299
Email:    jinpu.wang@profitbricks.com
URL:      https://www.profitbricks.de

Sitz der Gesellschaft: Berlin
Registergericht: Amtsgericht Charlottenburg, HRB 125506 B
Geschäftsführer: Achim Weiss

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [BUG] MD/RAID1 hung forever on freeze_array
  2016-11-25 13:59 ` Jinpu Wang
@ 2016-11-28  4:47   ` Coly Li
  2016-11-28  8:24     ` Jinpu Wang
  0 siblings, 1 reply; 31+ messages in thread
From: Coly Li @ 2016-11-28  4:47 UTC (permalink / raw)
  To: Jinpu Wang; +Cc: linux-raid, Shaohua Li, Neil Brown, Nate Dailey

On 2016/11/25 下午9:59, Jinpu Wang wrote:
> On Fri, Nov 25, 2016 at 2:30 PM, Jinpu Wang <jinpu.wang@profitbricks.com> wrote:
>> Hi,
>>
>> I'm hitting hung task in mdx_raid1 when running test, I can reproduce
>> it easily with my tests below:
>>
>> I create one md with one local loop device and one remote scsi
>> exported by SRP. running fio with mix rw on top of md, force_close
>> session on storage side. mdx_raid1 is wait on free_array in D state,
>> and a lot of fio also in D state in wait_barrier.
>>
>> [  335.154711] blk_update_request: I/O error, dev sdb, sector 8
>> [  335.154855] md: super_written gets error=-5
>> [  335.154999] md/raid1:md1: Disk failure on sdb, disabling device.
>>                md/raid1:md1: Operation continuing on 1 devices.
>> [  335.155258] sd 1:0:0:0: rejecting I/O to offline device
>> [  335.155402] blk_update_request: I/O error, dev sdb, sector 80
>> [  335.155547] md: super_written gets error=-5
>> [  340.158828] scsi host1: ib_srp: reconnect succeeded
>> [  373.017608] md/raid1:md1: redirecting sector 616617 to other mirror: loop1
>> [  373.110527] md/raid1:md1: redirecting sector 1320893 to other mirror: loop1
>> [  373.117230] md/raid1:md1: redirecting sector 1564499 to other mirror: loop1
>> [  373.127652] md/raid1:md1: redirecting sector 104034 to other mirror: loop1
>> [  373.135665] md/raid1:md1: redirecting sector 1209765 to other mirror: loop1
>> [  373.145634] md/raid1:md1: redirecting sector 51200 to other mirror: loop1
>> [  373.158824] md/raid1:md1: redirecting sector 755750 to other mirror: loop1
>> [  373.169964] md/raid1:md1: redirecting sector 1681631 to other mirror: loop1
>> [  373.178619] md/raid1:md1: redirecting sector 1894296 to other mirror: loop1
>> [  373.186153] md/raid1:md1: redirecting sector 1905016 to other mirror: loop1
>> [  374.364370] RAID1 conf printout:
>> [  374.364377]  --- wd:1 rd:2
>> [  374.364379]  disk 0, wo:1, o:0, dev:sdb
>> [  374.364381]  disk 1, wo:0, o:1, dev:loop1
>> [  374.437099] RAID1 conf printout:
>> [  374.437103]  --- wd:1 rd:2
>> snip
>>
>>
>> [  810.266112] sysrq: SysRq : Show Blocked State
>> [  810.266235]   task                        PC stack   pid father
>> [  810.266362] md1_raid1       D ffff88022d927c48     0  4022      2 0x00000000
>> [  810.266487]  ffff88022d927c48 ffff8802351a0000 ffff8800b91bc100
>> 000000008010000e
>> [  810.266747]  ffff88022d927c30 ffff88022d928000 0000000000000001
>> ffff880233b49b70
>> [  810.266975]  ffff880233b49b88 ffff8802325d5a40 ffff88022d927c60
>> ffffffff81810600
>> [  810.267203] Call Trace:
>> [  810.267322]  [<ffffffff81810600>] schedule+0x30/0x80
>> [  810.267437]  [<ffffffffa01342c1>] freeze_array+0x71/0xc0 [raid1]
>> [  810.267555]  [<ffffffff81095480>] ? wake_atomic_t_function+0x70/0x70
>> [  810.267669]  [<ffffffffa013578b>] handle_read_error+0x3b/0x570 [raid1]
>> [  810.267816]  [<ffffffff81185783>] ? kmem_cache_free+0x183/0x190
>> [  810.267929]  [<ffffffff81094e36>] ? __wake_up+0x46/0x60
>> [  810.268045]  [<ffffffffa0136dcd>] raid1d+0x20d/0xfc0 [raid1]
>> [  810.268159]  [<ffffffff81813043>] ? schedule_timeout+0x1a3/0x230
>> [  810.268274]  [<ffffffff8180fe77>] ? __schedule+0x2e7/0xa40
>> [  810.268391]  [<ffffffffa0211839>] md_thread+0x119/0x120 [md_mod]
>> [  810.268508]  [<ffffffff81095480>] ? wake_atomic_t_function+0x70/0x70
>> [  810.268624]  [<ffffffffa0211720>] ? find_pers+0x70/0x70 [md_mod]
>> [  810.268741]  [<ffffffff81075614>] kthread+0xc4/0xe0
>> [  810.268853]  [<ffffffff81075550>] ? kthread_worker_fn+0x150/0x150
>> [  810.268970]  [<ffffffff8181415f>] ret_from_fork+0x3f/0x70
>> [  810.269114]  [<ffffffff81075550>] ? kthread_worker_fn+0x150/0x150
>> [  810.269227] fio             D ffff8802325137a0     0  4212   4206 0x00000000
>> [  810.269347]  ffff8802325137a0 ffff88022de3db00 ffff8800ba7bb400
>> 0000000000000000
>> [  810.269574]  ffff880233b49b00 ffff880232513788 ffff880232514000
>> ffff880233b49b88
>> [  810.269801]  ffff880233b49b70 ffff8800ba7bb400 ffff8800b5f5db00
>> ffff8802325137b8
>> [  810.270028] Call Trace:
>> [  810.270138]  [<ffffffff81810600>] schedule+0x30/0x80
>> [  810.270282]  [<ffffffffa0133727>] wait_barrier+0x117/0x1f0 [raid1]
>> [  810.270396]  [<ffffffff81095480>] ? wake_atomic_t_function+0x70/0x70
>> [  810.270513]  [<ffffffffa0135d72>] make_request+0xb2/0xd80 [raid1]
>> [  810.270628]  [<ffffffffa02123fc>] md_make_request+0xec/0x230 [md_mod]
>> [  810.270746]  [<ffffffff813f96f9>] ? generic_make_request_checks+0x219/0x500
>> [  810.270860]  [<ffffffff813fc851>] blk_prologue_bio+0x91/0xc0
>> [  810.270976]  [<ffffffff813fc230>] generic_make_request+0xe0/0x1b0
>> [  810.271090]  [<ffffffff813fc362>] submit_bio+0x62/0x140
>> [  810.271209]  [<ffffffff811d2bbc>] do_blockdev_direct_IO+0x289c/0x33c0
>> [  810.271323]  [<ffffffff81810600>] ? schedule+0x30/0x80
>> [  810.271468]  [<ffffffff811cd620>] ? I_BDEV+0x10/0x10
>> [  810.271580]  [<ffffffff811d371e>] __blockdev_direct_IO+0x3e/0x40
>> [  810.271696]  [<ffffffff811cdfb7>] blkdev_direct_IO+0x47/0x50
>> [  810.271828]  [<ffffffff81132cbf>] generic_file_read_iter+0x44f/0x570
>> [  810.271949]  [<ffffffff811ceaa0>] ? blkdev_write_iter+0x110/0x110
>> [  810.272062]  [<ffffffff811cead0>] blkdev_read_iter+0x30/0x40
>> [  810.272179]  [<ffffffff811de5a6>] aio_run_iocb+0x126/0x2b0
>> [  810.272291]  [<ffffffff8181209d>] ? mutex_lock+0xd/0x30
>> [  810.272407]  [<ffffffff811ddd04>] ? aio_read_events+0x284/0x370
>> [  810.272521]  [<ffffffff81183c29>] ? kmem_cache_alloc+0xd9/0x180
>> [  810.272665]  [<ffffffff811df438>] ? do_io_submit+0x178/0x4a0
>> [  810.272778]  [<ffffffff811df4ed>] do_io_submit+0x22d/0x4a0
>> [  810.272895]  [<ffffffff811df76b>] SyS_io_submit+0xb/0x10
>> [  810.273007]  [<ffffffff81813e17>] entry_SYSCALL_64_fastpath+0x12/0x66
>> [  810.273130] fio             D ffff88022fa6f730     0  4213   4206 0x00000000
>> [  810.273247]  ffff88022fa6f730 ffff8800b549a700 ffff8800af703400
>> 0000000002011200
>> [  810.273475]  ffff880236001700 ffff88022fa6f718 ffff88022fa70000
>> ffff880233b49b88
>> [  810.273702]  ffff880233b49b70 ffff8800af703400 ffff88022f843700
>> ffff88022fa6f748
>> [  810.273958] Call Trace:
>> [  810.274070]  [<ffffffff81810600>] schedule+0x30/0x80
>> [  810.274183]  [<ffffffffa0133727>] wait_barrier+0x117/0x1f0 [raid1]
>> [  810.274300]  [<ffffffff81095480>] ? wake_atomic_t_function+0x70/0x70
>> [  810.274413]  [<ffffffffa0135d72>] make_request+0xb2/0xd80 [raid1]
>> [  810.274537]  [<ffffffff81408f15>] ? __bt_get.isra.7+0xd5/0x1b0
>> [  810.274650]  [<ffffffff81094feb>] ? finish_wait+0x5b/0x80
>> [  810.274766]  [<ffffffff8140917f>] ? bt_get+0x18f/0x1b0
>> [  810.274881]  [<ffffffffa02123fc>] md_make_request+0xec/0x230 [md_mod]
>> [  810.274998]  [<ffffffff813f96f9>] ? generic_make_request_checks+0x219/0x500
>> [  810.275144]  [<ffffffff813fc851>] blk_prologue_bio+0x91/0xc0
>> [  810.275257]  [<ffffffff813fc230>] generic_make_request+0xe0/0x1b0
>> [  810.275373]  [<ffffffff813fc362>] submit_bio+0x62/0x140
>> [  810.275486]  [<ffffffff811d2bbc>] do_blockdev_direct_IO+0x289c/0x33c0
>> [  810.275607]  [<ffffffff811cd620>] ? I_BDEV+0x10/0x10
>> [  810.275721]  [<ffffffff811d371e>] __blockdev_direct_IO+0x3e/0x40
>> [  810.275843]  [<ffffffff811cdfb7>] blkdev_direct_IO+0x47/0x50
>> [  810.275956]  [<ffffffff81132e8c>] generic_file_direct_write+0xac/0x170
>> [  810.276073]  [<ffffffff8113301d>] __generic_file_write_iter+0xcd/0x1f0
>> [  810.276187]  [<ffffffff811ce990>] ? blkdev_close+0x30/0x30
>> [  810.276332]  [<ffffffff811cea17>] blkdev_write_iter+0x87/0x110
>> [  810.276445]  [<ffffffff811de6d0>] aio_run_iocb+0x250/0x2b0
>> [  810.276560]  [<ffffffff8181209d>] ? mutex_lock+0xd/0x30
>> [  810.276673]  [<ffffffff811ddd04>] ? aio_read_events+0x284/0x370
>> [  810.276786]  [<ffffffff81183c29>] ? kmem_cache_alloc+0xd9/0x180
>> [  810.276902]  [<ffffffff811df438>] ? do_io_submit+0x178/0x4a0
>> [  810.277015]  [<ffffffff811df4ed>] do_io_submit+0x22d/0x4a0
>> [  810.277131]  [<ffffffff811df76b>] SyS_io_submit+0xb/0x10
>> [  810.277244]  [<ffffffff81813e17>] entry_SYSCALL_64_fastpath+0x12/0x66
>> I dump r1conf in crash:
>> struct r1conf {
>>   mddev = 0xffff88022d761800,
>>   mirrors = 0xffff88023456a180,
>>   raid_disks = 2,
>>   next_resync = 18446744073709527039,
>>   start_next_window = 18446744073709551615,
>>   current_window_requests = 0,
>>   next_window_requests = 0,
>>   device_lock = {
>>     {
>>       rlock = {
>>         raw_lock = {
>>           val = {
>>             counter = 0
>>           }
>>         }
>>       }
>>     }
>>   },
>>   retry_list = {
>>     next = 0xffff8800b5fe3b40,
>>     prev = 0xffff8800b50164c0
>>   },
>>   bio_end_io_list = {
>>     next = 0xffff88022fcd45c0,
>>     prev = 0xffff8800b53d57c0
>>   },
>>   pending_bio_list = {
>>     head = 0x0,
>>     tail = 0x0
>>   },
>>   pending_count = 0,
>>   wait_barrier = {
>>     lock = {
>>       {
>>         rlock = {
>>           raw_lock = {
>>             val = {
>>               counter = 0
>>             }
>>           }
>>         }
>>       }
>>     },
>>     task_list = {
>>       next = 0xffff8800b51d37e0,
>>       prev = 0xffff88022fbbb770
>>     }
>>   },
>>   resync_lock = {
>>     {
>>       rlock = {
>>         raw_lock = {
>>           val = {
>>             counter = 0
>>           }
>>         }
>>       }
>>     }
>>   },
>>   nr_pending = 406,
>>   nr_waiting = 100,
>>   nr_queued = 404,
>>   barrier = 0,
>>   array_frozen = 1,
>>   fullsync = 0,
>>   recovery_disabled = 1,
>>   poolinfo = 0xffff88022d829bb0,
>>   r1bio_pool = 0xffff88022b4512a0,
>>   r1buf_pool = 0x0,
>>   tmppage = 0xffffea0008c97b00,
>>   thread = 0x0,
>>   cluster_sync_low = 0,
>>   cluster_sync_high = 0
>> }
>>
>> every time nr_pending is 1 bigger then (nr_queued + 1), so seems we
>> forgot to increase nr_queued somewhere?
>>
>> I've noticed (commit ccfc7bf1f09d61)raid1: include bio_end_io_list in
>> nr_queued to prevent freeze_array hang. Seems it fixed similar bug.
>>
>> Could you give your suggestion?
>>
> Sorry, forgot to mention kernel version is 4.4.28

This commit is Cced to stable@vger.kernel.org for v4.3+, do you use a
stable kernel or a distribution with 4.4.28 kernel ?

Coly



^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [BUG] MD/RAID1 hung forever on freeze_array
  2016-11-28  4:47   ` Coly Li
@ 2016-11-28  8:24     ` Jinpu Wang
  2016-11-28  8:54       ` Coly Li
  0 siblings, 1 reply; 31+ messages in thread
From: Jinpu Wang @ 2016-11-28  8:24 UTC (permalink / raw)
  To: Coly Li; +Cc: linux-raid, Shaohua Li, Neil Brown, Nate Dailey

snip
>>>
>>> every time nr_pending is 1 bigger then (nr_queued + 1), so seems we
>>> forgot to increase nr_queued somewhere?
>>>
>>> I've noticed (commit ccfc7bf1f09d61)raid1: include bio_end_io_list in
>>> nr_queued to prevent freeze_array hang. Seems it fixed similar bug.
>>>
>>> Could you give your suggestion?
>>>
>> Sorry, forgot to mention kernel version is 4.4.28
>
> This commit is Cced to stable@vger.kernel.org for v4.3+, do you use a
> stable kernel or a distribution with 4.4.28 kernel ?
>
> Coly
>
>
Hi Coly,

I'm using Debian8 with 4.4.28 kernel.

Best,
-- 
Jinpu Wang
Linux Kernel Developer

ProfitBricks GmbH
Greifswalder Str. 207
D - 10405 Berlin

Tel:       +49 30 577 008  042
Fax:      +49 30 577 008 299
Email:    jinpu.wang@profitbricks.com
URL:      https://www.profitbricks.de

Sitz der Gesellschaft: Berlin
Registergericht: Amtsgericht Charlottenburg, HRB 125506 B
Geschäftsführer: Achim Weiss

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [BUG] MD/RAID1 hung forever on freeze_array
  2016-11-28  8:24     ` Jinpu Wang
@ 2016-11-28  8:54       ` Coly Li
  2016-11-28  9:02         ` Jinpu Wang
  0 siblings, 1 reply; 31+ messages in thread
From: Coly Li @ 2016-11-28  8:54 UTC (permalink / raw)
  To: Jinpu Wang; +Cc: linux-raid, Shaohua Li, Neil Brown, Nate Dailey

On 2016/11/28 下午4:24, Jinpu Wang wrote:
> snip
>>>>
>>>> every time nr_pending is 1 bigger then (nr_queued + 1), so seems we
>>>> forgot to increase nr_queued somewhere?
>>>>
>>>> I've noticed (commit ccfc7bf1f09d61)raid1: include bio_end_io_list in
>>>> nr_queued to prevent freeze_array hang. Seems it fixed similar bug.
>>>>
>>>> Could you give your suggestion?
>>>>
>>> Sorry, forgot to mention kernel version is 4.4.28
>>
>> This commit is Cced to stable@vger.kernel.org for v4.3+, do you use a
>> stable kernel or a distribution with 4.4.28 kernel ?
>>
>> Coly
>>
>>
> Hi Coly,
> 
> I'm using Debian8 with 4.4.28 kernel.

Hi Jinpu,

Is it possible for your to run a upstream kernel or vanilla kernel to
test whether the issue still can be reproduced ? Then we can know
whether it is an upstream bug or a distro issue.

Thanks.

Coly

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [BUG] MD/RAID1 hung forever on freeze_array
  2016-11-28  8:54       ` Coly Li
@ 2016-11-28  9:02         ` Jinpu Wang
  2016-11-28  9:10           ` Coly Li
  0 siblings, 1 reply; 31+ messages in thread
From: Jinpu Wang @ 2016-11-28  9:02 UTC (permalink / raw)
  To: Coly Li; +Cc: linux-raid, Shaohua Li, Neil Brown, Nate Dailey

On Mon, Nov 28, 2016 at 9:54 AM, Coly Li <colyli@suse.de> wrote:
> On 2016/11/28 下午4:24, Jinpu Wang wrote:
>> snip
>>>>>
>>>>> every time nr_pending is 1 bigger then (nr_queued + 1), so seems we
>>>>> forgot to increase nr_queued somewhere?
>>>>>
>>>>> I've noticed (commit ccfc7bf1f09d61)raid1: include bio_end_io_list in
>>>>> nr_queued to prevent freeze_array hang. Seems it fixed similar bug.
>>>>>
>>>>> Could you give your suggestion?
>>>>>
>>>> Sorry, forgot to mention kernel version is 4.4.28
>>>
>>> This commit is Cced to stable@vger.kernel.org for v4.3+, do you use a
>>> stable kernel or a distribution with 4.4.28 kernel ?
>>>
>>> Coly
>>>
>>>
>> Hi Coly,
>>
>> I'm using Debian8 with 4.4.28 kernel.
>
> Hi Jinpu,
>
> Is it possible for your to run a upstream kernel or vanilla kernel to
> test whether the issue still can be reproduced ? Then we can know
> whether it is an upstream bug or a distro issue.
>
> Thanks.
>
> Coly

Hi Coly,

I did run kernel 4.4.34 (I download from kernel.org), I can reproduce
the same bug.

I can also try latest 4.8 or 4.9 rc kernel, if you think it's necessary?

-- 
Jinpu Wang
Linux Kernel Developer

ProfitBricks GmbH
Greifswalder Str. 207
D - 10405 Berlin

Tel:       +49 30 577 008  042
Fax:      +49 30 577 008 299
Email:    jinpu.wang@profitbricks.com
URL:      https://www.profitbricks.de

Sitz der Gesellschaft: Berlin
Registergericht: Amtsgericht Charlottenburg, HRB 125506 B
Geschäftsführer: Achim Weiss

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [BUG] MD/RAID1 hung forever on freeze_array
  2016-11-28  9:02         ` Jinpu Wang
@ 2016-11-28  9:10           ` Coly Li
  2016-11-29 11:15             ` Jinpu Wang
  0 siblings, 1 reply; 31+ messages in thread
From: Coly Li @ 2016-11-28  9:10 UTC (permalink / raw)
  To: Jinpu Wang; +Cc: linux-raid, Shaohua Li, Neil Brown, Nate Dailey

On 2016/11/28 下午5:02, Jinpu Wang wrote:
> On Mon, Nov 28, 2016 at 9:54 AM, Coly Li <colyli@suse.de> wrote:
>> On 2016/11/28 下午4:24, Jinpu Wang wrote:
>>> snip
>>>>>>
>>>>>> every time nr_pending is 1 bigger then (nr_queued + 1), so seems we
>>>>>> forgot to increase nr_queued somewhere?
>>>>>>
>>>>>> I've noticed (commit ccfc7bf1f09d61)raid1: include bio_end_io_list in
>>>>>> nr_queued to prevent freeze_array hang. Seems it fixed similar bug.
>>>>>>
>>>>>> Could you give your suggestion?
>>>>>>
>>>>> Sorry, forgot to mention kernel version is 4.4.28
>>>>
>>>> This commit is Cced to stable@vger.kernel.org for v4.3+, do you use a
>>>> stable kernel or a distribution with 4.4.28 kernel ?
>>>>
>>>> Coly
>>>>
>>>>
>>> Hi Coly,
>>>
>>> I'm using Debian8 with 4.4.28 kernel.
>>
>> Hi Jinpu,
>>
>> Is it possible for your to run a upstream kernel or vanilla kernel to
>> test whether the issue still can be reproduced ? Then we can know
>> whether it is an upstream bug or a distro issue.
>>
>> Thanks.
>>
>> Coly
> 
> Hi Coly,
> 
> I did run kernel 4.4.34 (I download from kernel.org), I can reproduce
> the same bug.
> 
> I can also try latest 4.8 or 4.9 rc kernel, if you think it's necessary?
> 
Yes, please. If it can be reproduced on upstream kernel by a set of
scripts, it will be very helpful to debug and fix this issue.

Thanks in advance.

Coly

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [BUG] MD/RAID1 hung forever on freeze_array
  2016-11-28  9:10           ` Coly Li
@ 2016-11-29 11:15             ` Jinpu Wang
  2016-12-07 14:17               ` Jinpu Wang
  0 siblings, 1 reply; 31+ messages in thread
From: Jinpu Wang @ 2016-11-29 11:15 UTC (permalink / raw)
  To: Coly Li; +Cc: linux-raid, Shaohua Li, Neil Brown, Nate Dailey

On Mon, Nov 28, 2016 at 10:10 AM, Coly Li <colyli@suse.de> wrote:
> On 2016/11/28 下午5:02, Jinpu Wang wrote:
>> On Mon, Nov 28, 2016 at 9:54 AM, Coly Li <colyli@suse.de> wrote:
>>> On 2016/11/28 下午4:24, Jinpu Wang wrote:
>>>> snip
>>>>>>>
>>>>>>> every time nr_pending is 1 bigger then (nr_queued + 1), so seems we
>>>>>>> forgot to increase nr_queued somewhere?
>>>>>>>
>>>>>>> I've noticed (commit ccfc7bf1f09d61)raid1: include bio_end_io_list in
>>>>>>> nr_queued to prevent freeze_array hang. Seems it fixed similar bug.
>>>>>>>
>>>>>>> Could you give your suggestion?
>>>>>>>
>>>>>> Sorry, forgot to mention kernel version is 4.4.28
>>>>>
>>>>> This commit is Cced to stable@vger.kernel.org for v4.3+, do you use a
>>>>> stable kernel or a distribution with 4.4.28 kernel ?
>>>>>
>>>>> Coly
>>>>>
>>>>>
>>>> Hi Coly,
>>>>
>>>> I'm using Debian8 with 4.4.28 kernel.
>>>
>>> Hi Jinpu,
>>>
>>> Is it possible for your to run a upstream kernel or vanilla kernel to
>>> test whether the issue still can be reproduced ? Then we can know
>>> whether it is an upstream bug or a distro issue.
>>>
>>> Thanks.
>>>
>>> Coly
>>
>> Hi Coly,
>>
>> I did run kernel 4.4.34 (I download from kernel.org), I can reproduce
>> the same bug.
>>
>> I can also try latest 4.8 or 4.9 rc kernel, if you think it's necessary?
>>
> Yes, please. If it can be reproduced on upstream kernel by a set of
> scripts, it will be very helpful to debug and fix this issue.
>
> Thanks in advance.
>
> Coly

Hi Coly,

I tried with kernel 4.9-cr7, I can't reproduce it with my testcase anymore.

It's hard to say the bug is fixed or harder to reproduce because code
changed a lot.

-- 
Jinpu Wang
Linux Kernel Developer

ProfitBricks GmbH
Greifswalder Str. 207
D - 10405 Berlin

Tel:       +49 30 577 008  042
Fax:      +49 30 577 008 299
Email:    jinpu.wang@profitbricks.com
URL:      https://www.profitbricks.de

Sitz der Gesellschaft: Berlin
Registergericht: Amtsgericht Charlottenburg, HRB 125506 B
Geschäftsführer: Achim Weiss

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [BUG] MD/RAID1 hung forever on freeze_array
  2016-11-29 11:15             ` Jinpu Wang
@ 2016-12-07 14:17               ` Jinpu Wang
  2016-12-08  3:17                 ` NeilBrown
  0 siblings, 1 reply; 31+ messages in thread
From: Jinpu Wang @ 2016-12-07 14:17 UTC (permalink / raw)
  To: Coly Li; +Cc: linux-raid, Shaohua Li, Neil Brown, Nate Dailey

On Tue, Nov 29, 2016 at 12:15 PM, Jinpu Wang
<jinpu.wang@profitbricks.com> wrote:
> On Mon, Nov 28, 2016 at 10:10 AM, Coly Li <colyli@suse.de> wrote:
>> On 2016/11/28 下午5:02, Jinpu Wang wrote:
>>> On Mon, Nov 28, 2016 at 9:54 AM, Coly Li <colyli@suse.de> wrote:
>>>> On 2016/11/28 下午4:24, Jinpu Wang wrote:
>>>>> snip
>>>>>>>>
>>>>>>>> every time nr_pending is 1 bigger then (nr_queued + 1), so seems we
>>>>>>>> forgot to increase nr_queued somewhere?
>>>>>>>>
>>>>>>>> I've noticed (commit ccfc7bf1f09d61)raid1: include bio_end_io_list in
>>>>>>>> nr_queued to prevent freeze_array hang. Seems it fixed similar bug.
>>>>>>>>
>>>>>>>> Could you give your suggestion?
>>>>>>>>
>>>>>>> Sorry, forgot to mention kernel version is 4.4.28
>>>>>>
>>>>>> This commit is Cced to stable@vger.kernel.org for v4.3+, do you use a
>>>>>> stable kernel or a distribution with 4.4.28 kernel ?
>>>>>>
>>>>>> Coly
>>>>>>
>>>>>>
>>>>> Hi Coly,
>>>>>
>>>>> I'm using Debian8 with 4.4.28 kernel.
>>>>
>>>> Hi Jinpu,
>>>>
>>>> Is it possible for your to run a upstream kernel or vanilla kernel to
>>>> test whether the issue still can be reproduced ? Then we can know
>>>> whether it is an upstream bug or a distro issue.
>>>>
>>>> Thanks.
>>>>
>>>> Coly
>>>
>>> Hi Coly,
>>>
>>> I did run kernel 4.4.34 (I download from kernel.org), I can reproduce
>>> the same bug.
>>>
>>> I can also try latest 4.8 or 4.9 rc kernel, if you think it's necessary?
>>>
>> Yes, please. If it can be reproduced on upstream kernel by a set of
>> scripts, it will be very helpful to debug and fix this issue.
>>
>> Thanks in advance.
>>
>> Coly
>
> Hi Coly,
>
> I tried with kernel 4.9-cr7, I can't reproduce it with my testcase anymore.
>
> It's hard to say the bug is fixed or harder to reproduce because code
> changed a lot.
>
> --
> Jinpu Wang
> Linux Kernel Developer
>
> ProfitBricks GmbH
> Greifswalder Str. 207
> D - 10405 Berlin
>
> Tel:       +49 30 577 008  042
> Fax:      +49 30 577 008 299
> Email:    jinpu.wang@profitbricks.com
> URL:      https://www.profitbricks.de
>
> Sitz der Gesellschaft: Berlin
> Registergericht: Amtsgericht Charlottenburg, HRB 125506 B
> Geschäftsführer: Achim Weiss
Hi,

I continue debug the bug:

20161207
crash> struct r1conf 0xffff8800b9792000
struct r1conf {
  mddev = 0xffff88022db03800,
  mirrors = 0xffff880227729200,
  raid_disks = 2,
  next_resync = 18446744073709527039,
  start_next_window = 18446744073709551615,
  current_window_requests = 0,
  next_window_requests = 0,
  device_lock = {
    {
      rlock = {
        raw_lock = {
          val = {
            counter = 0
          }
        }
      }
    }
  },
  retry_list = {
    next = 0xffff8800afe690c0,
    prev = 0xffff8800b96acac0
  },
  bio_end_io_list = {
    next = 0xffff8800b96ac2c0,
    prev = 0xffff88003735f140
  },
  pending_bio_list = {
    head = 0x0,
    tail = 0x0
  },
  pending_count = 0,
  wait_barrier = {
    lock = {
      {
        rlock = {
          raw_lock = {
            val = {
              counter = 0
            }
          }
        }
      }
    },
    task_list = {
      next = 0xffff880221bc7770,
      prev = 0xffff8800ad44bc88
    }
  },
  resync_lock = {
    {
      rlock = {
        raw_lock = {
          val = {
            counter = 0
          }
        }
      }
    }
  },
  nr_pending = 948,
  nr_waiting = 9,
  nr_queued = 946, // again we need one more to finished wait_event.
  barrier = 0,
  array_frozen = 1,
  fullsync = 0,
  recovery_disabled = 2,
  poolinfo = 0xffff88022c567580,
  r1bio_pool = 0xffff88022fdccea0,
  r1buf_pool = 0x0,
  tmppage = 0xffffea0002bf1600,
  thread = 0x0,
  cluster_sync_low = 0,
  cluster_sync_high = 0
}
crash> exit

on conf->bio_end_io_list we have 91 entries.
crash> list -H 0xffff88003735f140
ffff8800b9792048
ffff8800b96ac2c0
ffff8800b96ac1c0

snip
ffff88022243dfc0
ffff88022243dc40
crash>

on conf->retry_list we have 855
crash> list -H 0xffff8800b96acac0
ffff8800b9792038
ffff8800afe690c0
snip

list -H 0xffff8800b96acac0 r1bio.retry_list -s r1bio

ffff8800b9791ff8
struct r1bio {
  remaining = {
    counter = 0
  },
  behind_remaining = {
    counter = 0
  },
  sector = 18446612141670676480, // corrupted?
  start_next_window = 18446612141565972992, //ditto
  sectors = 2,
  state = 18446744073709527039, // ditto
  mddev = 0xffffffffffffffff,
  master_bio = 0x0,
  read_disk = 0,
  retry_list = {
    next = 0xffff8800afe690c0,
    prev = 0xffff8800b96acac0
  },
  behind_bvecs = 0xffff8800b96ac2c0,
  behind_page_count = 926282048,
  bios = 0xffff8800b9792058
}
ffff8800afe69080
struct r1bio {
  remaining = {
    counter = 0
  },
  behind_remaining = {
    counter = 0
  },
  sector = 1566,
  start_next_window = 0,
  sectors = 128,
  state = 257,
  mddev = 0xffff88022db03800,
  master_bio = 0xffff8800371c1f00,
  read_disk = 0,
  retry_list = {
    next = 0xffff8800ad41a540,
    prev = 0xffff8800b9792038
  },
  behind_bvecs = 0x0,
  behind_page_count = 0,
  bios = 0xffff8800afe690e0
}

check conf->bio_end_io_list
list -H 0xffff88003735f140 r1bio.retry_list -s r1bio

ffff8800b9792008
struct r1bio {
  remaining = {
    counter = 661819904
  },
  behind_remaining = {
    counter = -30718
  },
  sector = 2,
  start_next_window = 18446744073709527039, // corrupted?
  sectors = -1,                                                 // corrupted?
  state = 0,
  mddev = 0x0,                                               // corrupted?
  master_bio = 0xffff8800afe690c0,
  read_disk = -1184183616,                             // ?
  retry_list = {
    next = 0xffff8800b96ac2c0,
    prev = 0xffff88003735f140
  },
  behind_bvecs = 0x0,
  behind_page_count = 0,
  bios = 0xffff8800b9792068
}
ffff8800b96ac280
struct r1bio {
  remaining = {
    counter = 0
  },
  behind_remaining = {
    counter = 0
  },
  sector = 980009,
  start_next_window = 0,
  sectors = 16,
  state = 257,
  mddev = 0xffff88022db03800,
  master_bio = 0xffff8800370b0600,
  read_disk = 0,
  retry_list = {
    next = 0xffff8800b96ac1c0,
    prev = 0xffff8800b9792048
  },
  behind_bvecs = 0x0,
  behind_page_count = 0,
  bios = 0xffff8800b96ac2e0
}

I still have no clue what it could be, any one has idea?

-- 
Jinpu Wang
Linux Kernel Developer

ProfitBricks GmbH
Greifswalder Str. 207
D - 10405 Berlin

Tel:       +49 30 577 008  042
Fax:      +49 30 577 008 299
Email:    jinpu.wang@profitbricks.com
URL:      https://www.profitbricks.de

Sitz der Gesellschaft: Berlin
Registergericht: Amtsgericht Charlottenburg, HRB 125506 B
Geschäftsführer: Achim Weiss

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [BUG] MD/RAID1 hung forever on freeze_array
  2016-12-07 14:17               ` Jinpu Wang
@ 2016-12-08  3:17                 ` NeilBrown
  2016-12-08  9:50                   ` Jinpu Wang
  0 siblings, 1 reply; 31+ messages in thread
From: NeilBrown @ 2016-12-08  3:17 UTC (permalink / raw)
  To: Jinpu Wang, Coly Li; +Cc: linux-raid, Shaohua Li, Nate Dailey

[-- Attachment #1: Type: text/plain, Size: 3489 bytes --]

On Thu, Dec 08 2016, Jinpu Wang wrote:

> On Tue, Nov 29, 2016 at 12:15 PM, Jinpu Wang
> <jinpu.wang@profitbricks.com> wrote:
>> On Mon, Nov 28, 2016 at 10:10 AM, Coly Li <colyli@suse.de> wrote:
>>> On 2016/11/28 下午5:02, Jinpu Wang wrote:
>>>> On Mon, Nov 28, 2016 at 9:54 AM, Coly Li <colyli@suse.de> wrote:
>>>>> On 2016/11/28 下午4:24, Jinpu Wang wrote:
>>>>>> snip
>>>>>>>>>
>>>>>>>>> every time nr_pending is 1 bigger then (nr_queued + 1), so seems we
>>>>>>>>> forgot to increase nr_queued somewhere?
>>>>>>>>>
>>>>>>>>> I've noticed (commit ccfc7bf1f09d61)raid1: include bio_end_io_list in
>>>>>>>>> nr_queued to prevent freeze_array hang. Seems it fixed similar bug.
>>>>>>>>>
>>>>>>>>> Could you give your suggestion?
>>>>>>>>>
>>>>>>>> Sorry, forgot to mention kernel version is 4.4.28

>
> I continue debug the bug:
>
> 20161207

>   nr_pending = 948,
>   nr_waiting = 9,
>   nr_queued = 946, // again we need one more to finished wait_event.
>   barrier = 0,
>   array_frozen = 1,

> on conf->bio_end_io_list we have 91 entries.

> on conf->retry_list we have 855

This is useful.  It confirms that nr_queued is correct, and that
nr_pending is consistently 1 higher than expected.
This suggests that a request has been counted in nr_pending, but hasn't
yet been submitted, or has been taken off one of the queues but has not
yet been processed.

I notice that in your first email the Blocked tasks listed included
raid1d which is blocked in freeze_array() and a few others in
make_request() blocked on wait_barrier().
In that case nr_waiting was 100, so there should have been 100 threads
blocked in wait_barrier().  Is that correct?  I assume you thought it
was pointless to list them all, which seems reasonable.

I asked because I wonder if there might have been one thread in
make_request() which was blocked on something else.  There are a couple
of places when make_request() will wait after having successfully called
wait_barrier().  If that happened, it would cause exactly the symptoms
you report.  Could you check all blocked threads carefully please?

There are other ways that nr_pending and nr_queued can get out of sync,
though I think they would result in nr_pending being less than
nr_queued, not more.

If the presense of a bad block in the bad block log causes a request to
be split into two r1bios, and if both of those end up on one of the
queues, then they would be added to nr_queued twice, but to nr_pending
only once.  We should fix that.

>
> list -H 0xffff8800b96acac0 r1bio.retry_list -s r1bio
>
> ffff8800b9791ff8
> struct r1bio {
>   remaining = {
>     counter = 0
>   },
>   behind_remaining = {
>     counter = 0
>   },
>   sector = 18446612141670676480, // corrupted?
>   start_next_window = 18446612141565972992, //ditto

I don't think this is corruption.

> crash> struct r1conf 0xffff8800b9792000
> struct r1conf {
  ....
>   retry_list = {
>     next = 0xffff8800afe690c0,
>     prev = 0xffff8800b96acac0
>   },

The pointer you started at was at the end of the list.
So this r1bio structure you are seeing is not an r1bio at all but the
memory out of the middle of the r1conf, being interpreted as an r1bio.
You can confirm this by noticing that retry_list in the r1bio:

>   retry_list = {
>     next = 0xffff8800afe690c0,
>     prev = 0xffff8800b96acac0
>   },

is exactly the same as the retry_list in the r1conf.

NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [BUG] MD/RAID1 hung forever on freeze_array
  2016-12-08  3:17                 ` NeilBrown
@ 2016-12-08  9:50                   ` Jinpu Wang
  2016-12-09  6:01                     ` NeilBrown
  0 siblings, 1 reply; 31+ messages in thread
From: Jinpu Wang @ 2016-12-08  9:50 UTC (permalink / raw)
  To: NeilBrown; +Cc: Coly Li, linux-raid, Shaohua Li, Nate Dailey

Thanks Neil for valuable inputs, please see comments inline.

On Thu, Dec 8, 2016 at 4:17 AM, NeilBrown <neilb@suse.com> wrote:
> On Thu, Dec 08 2016, Jinpu Wang wrote:
>
>> On Tue, Nov 29, 2016 at 12:15 PM, Jinpu Wang
>> <jinpu.wang@profitbricks.com> wrote:
>>> On Mon, Nov 28, 2016 at 10:10 AM, Coly Li <colyli@suse.de> wrote:
>>>> On 2016/11/28 下午5:02, Jinpu Wang wrote:
>>>>> On Mon, Nov 28, 2016 at 9:54 AM, Coly Li <colyli@suse.de> wrote:
>>>>>> On 2016/11/28 下午4:24, Jinpu Wang wrote:
>>>>>>> snip
>>>>>>>>>>
>>>>>>>>>> every time nr_pending is 1 bigger then (nr_queued + 1), so seems we
>>>>>>>>>> forgot to increase nr_queued somewhere?
>>>>>>>>>>
>>>>>>>>>> I've noticed (commit ccfc7bf1f09d61)raid1: include bio_end_io_list in
>>>>>>>>>> nr_queued to prevent freeze_array hang. Seems it fixed similar bug.
>>>>>>>>>>
>>>>>>>>>> Could you give your suggestion?
>>>>>>>>>>
>>>>>>>>> Sorry, forgot to mention kernel version is 4.4.28
>
>>
>> I continue debug the bug:
>>
>> 20161207
>
>>   nr_pending = 948,
>>   nr_waiting = 9,
>>   nr_queued = 946, // again we need one more to finished wait_event.
>>   barrier = 0,
>>   array_frozen = 1,
>
>> on conf->bio_end_io_list we have 91 entries.
>
>> on conf->retry_list we have 855
>
> This is useful.  It confirms that nr_queued is correct, and that
> nr_pending is consistently 1 higher than expected.
> This suggests that a request has been counted in nr_pending, but hasn't
> yet been submitted, or has been taken off one of the queues but has not
> yet been processed.
>
> I notice that in your first email the Blocked tasks listed included
> raid1d which is blocked in freeze_array() and a few others in
> make_request() blocked on wait_barrier().
> In that case nr_waiting was 100, so there should have been 100 threads
> blocked in wait_barrier().  Is that correct?  I assume you thought it
> was pointless to list them all, which seems reasonable.
This is correct. From my test, I initially use numjobs set to 100 in
fio. then later I reduce that to 10,
I can still trigger the bug.


>
> I asked because I wonder if there might have been one thread in
> make_request() which was blocked on something else.  There are a couple
> of places when make_request() will wait after having successfully called
> wait_barrier().  If that happened, it would cause exactly the symptoms
> you report.  Could you check all blocked threads carefully please?

Every time it's the same hung task, raid1d is blocked in freeze_array
PID: 5002   TASK: ffff8800ad430d00  CPU: 0   COMMAND: "md1_raid1"
 #0 [ffff8800ad44bbf8] __schedule at ffffffff81811483
 #1 [ffff8800ad44bc50] schedule at ffffffff81811c60
 #2 [ffff8800ad44bc68] freeze_array at ffffffffa085a1e1 [raid1]
 #3 [ffff8800ad44bcc0] handle_read_error at ffffffffa085bfe9 [raid1]
 #4 [ffff8800ad44bd68] raid1d at ffffffffa085d29b [raid1]
 #5 [ffff8800ad44be60] md_thread at ffffffffa0549040 [md_mod]
and 9 fio threads are blocked in wait_barrier
PID: 5172   TASK: ffff88022bf4a700  CPU: 3   COMMAND: "fio"
 #0 [ffff88022186b6d8] __schedule at ffffffff81811483
 #1 [ffff88022186b730] schedule at ffffffff81811c60
 #2 [ffff88022186b748] wait_barrier at ffffffffa085a0b6 [raid1]
 #3 [ffff88022186b7b0] make_request at ffffffffa085c3cd [raid1]
 #4 [ffff88022186b890] md_make_request at ffffffffa0549903 [md_mod]
The other fio thread is  sleeping in futex wait, I think it's unrelated.
PID: 5171   TASK: ffff88022bf49a00  CPU: 2   COMMAND: "fio"
 #0 [ffff880221f97c08] __schedule at ffffffff81811483
 #1 [ffff880221f97c60] schedule at ffffffff81811c60
 #2 [ffff880221f97c78] futex_wait_queue_me at ffffffff810c8a42
 #3 [ffff880221f97cb0] futex_wait at ffffffff810c9c72



>
>
> There are other ways that nr_pending and nr_queued can get out of sync,
> though I think they would result in nr_pending being less than
> nr_queued, not more.
>
> If the presense of a bad block in the bad block log causes a request to
> be split into two r1bios, and if both of those end up on one of the
> queues, then they would be added to nr_queued twice, but to nr_pending
> only once.  We should fix that.
I checked, there're some bad_blocks md detected, not sure if it helps.
root@ibnbd-clt1:~/jack# cat /sys/block/md1/md/dev-ibnbd0/bad_blocks
180630 512
181142 7
982011 8
1013386 255

I also checked each md_rdev:
crash> struct md_rdev 0xffff880230630000
struct md_rdev {
  same_set = {
    next = 0xffff88022db03818,
    prev = 0xffff880230653e00
  },
  sectors = 2095104,
  mddev = 0xffff88022db03800,
  last_events = 125803676,
  meta_bdev = 0x0,
  bdev = 0xffff880234a56080,
  sb_page = 0xffffea0008d3edc0,
  bb_page = 0xffffea0008b17400,
  sb_loaded = 1,
  sb_events = 72,
  data_offset = 2048,
  new_data_offset = 2048,
  sb_start = 8,
  sb_size = 512,
  preferred_minor = 65535,
  kobj = {
    name = 0xffff88022c567510 "dev-loop1",
    entry = {
      next = 0xffff880230630080,
      prev = 0xffff880230630080
    },
    parent = 0xffff88022db03850,
    kset = 0x0,
    ktype = 0xffffffffa055f200 <rdev_ktype>,
    sd = 0xffff88022c4ac708,
    kref = {
      refcount = {
        counter = 1
      }
    },
    state_initialized = 1,
    state_in_sysfs = 1,
    state_add_uevent_sent = 0,
    state_remove_uevent_sent = 0,
    uevent_suppress = 0
  },
  flags = 2,
  blocked_wait = {
    lock = {
      {
        rlock = {
          raw_lock = {
            val = {
              counter = 0
            }
          }
        }
      }
    },
    task_list = {
      next = 0xffff8802306300c8,
      prev = 0xffff8802306300c8
    }
  },
  desc_nr = 1,
  raid_disk = 1,
  new_raid_disk = 0,
  saved_raid_disk = -1,
  {
    recovery_offset = 18446744073709551615,
    journal_tail = 18446744073709551615
  },
  nr_pending = {
    counter = 1
  },
  read_errors = {
    counter = 0
  },
  last_read_error = {
    tv_sec = 0,
    tv_nsec = 0
  },
  corrected_errors = {
    counter = 0
  },
  del_work = {
    data = {
      counter = 0
    },
    entry = {
      next = 0x0,
      prev = 0x0
    },
    func = 0x0
  },
  sysfs_state = 0xffff88022c4ac780,
  badblocks = {
    count = 0,
    unacked_exist = 0,
    shift = 0,
    page = 0xffff88022740d000,
    changed = 0,
    lock = {
      seqcount = {
        sequence = 60
      },
      lock = {
        {
          rlock = {
            raw_lock = {
              val = {
                counter = 0
              }
            }
          }
        }
      }
    },
    sector = 0,
    size = 0
  }
}
crash>
crash> struct md_rdev 0xffff880230653e00
struct md_rdev {
  same_set = {
    next = 0xffff880230630000,
    prev = 0xffff88022db03818
  },
  sectors = 2095104,
  mddev = 0xffff88022db03800,
  last_events = 9186098,
  meta_bdev = 0x0,
  bdev = 0xffff880234a56700,
  sb_page = 0xffffea0007758f40,
  bb_page = 0xffffea000887b480,
  sb_loaded = 1,
  sb_events = 42,
  data_offset = 2048,
  new_data_offset = 2048,
  sb_start = 8,
  sb_size = 512,
  preferred_minor = 65535,
  kobj = {
    name = 0xffff880233c825b0 "dev-ibnbd0",
    entry = {
      next = 0xffff880230653e80,
      prev = 0xffff880230653e80
    },
    parent = 0xffff88022db03850,
    kset = 0x0,
    ktype = 0xffffffffa055f200 <rdev_ktype>,
    sd = 0xffff8800b9fcce88,
    kref = {
      refcount = {
        counter = 1
      }
    },
    state_initialized = 1,
    state_in_sysfs = 1,
    state_add_uevent_sent = 0,
    state_remove_uevent_sent = 0,
    uevent_suppress = 0
  },
  flags = 581,
  blocked_wait = {
    lock = {
      {
        rlock = {
          raw_lock = {
            val = {
              counter = 0
            }
          }
        }
      }
    },
    task_list = {
      next = 0xffff880230653ec8,
      prev = 0xffff880230653ec8
    }
  },
  desc_nr = 0,
  raid_disk = 0,
  new_raid_disk = 0,
  saved_raid_disk = -1,
  {
    recovery_offset = 18446744073709551615,
    journal_tail = 18446744073709551615
  },
  nr_pending = {
    counter = 856
  },
  read_errors = {
    counter = 0
  },
  last_read_error = {
    tv_sec = 0,
    tv_nsec = 0
  },
  corrected_errors = {
    counter = 0
  },
  del_work = {
    data = {
      counter = 0
    },
    entry = {
      next = 0x0,
      prev = 0x0
    },
    func = 0x0
  },
  sysfs_state = 0xffff8800b9fccf78,
  badblocks = {
    count = 4,
    unacked_exist = 0,
    shift = 0,
    page = 0xffff880227211000,
    changed = 0,
    lock = {
      seqcount = {
        sequence = 1624
      },
      lock = {
        {
          rlock = {
            raw_lock = {
              val = {
                counter = 0
              }
            }
          }
        }
      }
    },
    sector = 80,
    size = 8
  }
}
>
>
>>
>> list -H 0xffff8800b96acac0 r1bio.retry_list -s r1bio
>>
>> ffff8800b9791ff8
>> struct r1bio {
>>   remaining = {
>>     counter = 0
>>   },
>>   behind_remaining = {
>>     counter = 0
>>   },
>>   sector = 18446612141670676480, // corrupted?
>>   start_next_window = 18446612141565972992, //ditto
>
> I don't think this is corruption.
>
>> crash> struct r1conf 0xffff8800b9792000
>> struct r1conf {
>   ....
>>   retry_list = {
>>     next = 0xffff8800afe690c0,
>>     prev = 0xffff8800b96acac0
>>   },
>
> The pointer you started at was at the end of the list.
> So this r1bio structure you are seeing is not an r1bio at all but the
> memory out of the middle of the r1conf, being interpreted as an r1bio.
> You can confirm this by noticing that retry_list in the r1bio:
>
>>   retry_list = {
>>     next = 0xffff8800afe690c0,
>>     prev = 0xffff8800b96acac0
>>   },
>
> is exactly the same as the retry_list in the r1conf.
>
> NeilBrown

Oh, thanks for explanation, I notice this coincidence, was curious why.
I still have my hung task machine alive, I can just what ever you
think necessary.

Thanks again for debuging this with me.

-- 
Jinpu Wang
Linux Kernel Developer

ProfitBricks GmbH
Greifswalder Str. 207
D - 10405 Berlin

Tel:       +49 30 577 008  042
Fax:      +49 30 577 008 299
Email:    jinpu.wang@profitbricks.com
URL:      https://www.profitbricks.de

Sitz der Gesellschaft: Berlin
Registergericht: Amtsgericht Charlottenburg, HRB 125506 B
Geschäftsführer: Achim Weiss

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [BUG] MD/RAID1 hung forever on freeze_array
  2016-12-08  9:50                   ` Jinpu Wang
@ 2016-12-09  6:01                     ` NeilBrown
  2016-12-09 15:28                       ` Jinpu Wang
  2016-12-09 15:36                       ` Jinpu Wang
  0 siblings, 2 replies; 31+ messages in thread
From: NeilBrown @ 2016-12-09  6:01 UTC (permalink / raw)
  To: Jinpu Wang; +Cc: Coly Li, linux-raid, Shaohua Li, Nate Dailey

[-- Attachment #1: Type: text/plain, Size: 884 bytes --]

On Thu, Dec 08 2016, Jinpu Wang wrote:

This number:

>   nr_pending = {
>     counter = 1
>   },

and this number:

>   nr_pending = {
>     counter = 856
>   },

might be interesting.

There are 855 requested on the list.  Add the one that is currently
being retried give 856, which is nr_pending for the device that failed.
But nr_pending on the device that didn't fail is 1.  I would expect
zero.
When a read or write requests succeeds, rdev_dec_pending() is called
immediately so this should quickly go to zero.

It seems as though there must be a request to the loop device that is
stuck somewhere between the atomic_inc(&rdev->nr_pending) (possibly
inside read_balance) and the call to generic_make_request().
I cannot yet see how that would happen.

Can you check if the is a repeatable observation?  Is nr_pending.counter
always '1' on the loop device?

Thanks,
NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [BUG] MD/RAID1 hung forever on freeze_array
  2016-12-09  6:01                     ` NeilBrown
@ 2016-12-09 15:28                       ` Jinpu Wang
  2016-12-09 15:36                       ` Jinpu Wang
  1 sibling, 0 replies; 31+ messages in thread
From: Jinpu Wang @ 2016-12-09 15:28 UTC (permalink / raw)
  To: NeilBrown; +Cc: Coly Li, linux-raid, Shaohua Li, Nate Dailey

On Fri, Dec 9, 2016 at 7:01 AM, NeilBrown <neilb@suse.com> wrote:
> On Thu, Dec 08 2016, Jinpu Wang wrote:
>
>
> This number:
>
>>   nr_pending = {
>>     counter = 1
>>   },
>
>
>
> and this number:
>
>>   nr_pending = {
>>     counter = 856
>>   },
>
> might be interesting.
>
> There are 855 requested on the list.  Add the one that is currently
> being retried give 856, which is nr_pending for the device that failed.
> But nr_pending on the device that didn't fail is 1.  I would expect
> zero.
> When a read or write requests succeeds, rdev_dec_pending() is called
> immediately so this should quickly go to zero.
>
> It seems as though there must be a request to the loop device that is
> stuck somewhere between the atomic_inc(&rdev->nr_pending) (possibly
> inside read_balance) and the call to generic_make_request().
> I cannot yet see how that would happen.
>
> Can you check if the is a repeatable observation?  Is nr_pending.counter
> always '1' on the loop device?
>
> Thanks,
> NeilBrown

Hi Neil,

Yes, it's repreatable observation. I triggered again, this time
  nr_pending = 1203,
  nr_waiting = 8,
  nr_queued = 1201,

conf->retry_list has 1175 entries.
on conf->bio_end_io_list has 26 entries.
Totol is 1201, match nr_queued.

in md_rdev healthy one loop1 has 1 nr_pending.
faulty one ibnbd1 has 1076.

crash> struct md_rdev 0xffff880228880400
struct md_rdev {
  same_set = {
    next = 0xffff88023202a200,
    prev = 0xffff8800b64c6018
  },
  sectors = 2095104,
  mddev = 0xffff8800b64c6000,
  last_events = 17764573,
  meta_bdev = 0x0,
  bdev = 0xffff8800b60ce080,
  sb_page = 0xffffea0002bd3040,
  bb_page = 0xffffea0002dc76c0,
  sb_loaded = 1,
  sb_events = 166,
  data_offset = 2048,
  new_data_offset = 2048,
  sb_start = 8,
  sb_size = 512,
  preferred_minor = 65535,
  kobj = {
    name = 0xffff880037962af0 "dev-ibnbd0",
    entry = {
      next = 0xffff880228880480,
      prev = 0xffff880228880480
    },
    parent = 0xffff8800b64c6050,
    kset = 0x0,
    ktype = 0xffffffffa0501300 <rdev_ktype>,
    sd = 0xffff88022bfc12d0,
    kref = {
      refcount = {
        counter = 1
      }
    },
    state_initialized = 1,
    state_in_sysfs = 1,
    state_add_uevent_sent = 0,
    state_remove_uevent_sent = 0,
    uevent_suppress = 0
  },
  flags = 581,
  blocked_wait = {
    lock = {
      {
        rlock = {
          raw_lock = {
            val = {
              counter = 0
            }
          }
        }
      }
    },
    task_list = {
      next = 0xffff8802288804c8,
      prev = 0xffff8802288804c8
    }
  },
  desc_nr = 0,
  raid_disk = 0,
  new_raid_disk = 0,
  saved_raid_disk = -1,
  {
    recovery_offset = 18446744073709551615,
    journal_tail = 18446744073709551615
  },
  nr_pending = {
    counter = 1176
  },
  read_errors = {
    counter = 0
  },
  last_read_error = {
    tv_sec = 0,
    tv_nsec = 0
  },
  corrected_errors = {
    counter = 0
  },
  del_work = {
    data = {
      counter = 0
    },
    entry = {
      next = 0x0,
      prev = 0x0
    },
    func = 0x0
  },
  sysfs_state = 0xffff88022bfc1348,
  badblocks = {
    count = 0,
    unacked_exist = 0,
    shift = 0,
    page = 0xffff8802289aa000,
    changed = 0,
    lock = {
      seqcount = {
        sequence = 0
      },
      lock = {
        {
          rlock = {
            raw_lock = {
              val = {
                counter = 0
              }
            }
          }
        }
      }
    },
    sector = 0,
    size = 0
  }
}

crash> struct md_rdev 0xffff88023202a200
struct md_rdev {
  same_set = {
    next = 0xffff8800b64c6018,
    prev = 0xffff880228880400
  },
  sectors = 2095104,
  mddev = 0xffff8800b64c6000,
  last_events = 37178561,
  meta_bdev = 0x0,
  bdev = 0xffff8800b60d09c0,
  sb_page = 0xffffea0008af7580,
  bb_page = 0xffffea0002e69380,
  sb_loaded = 1,
  sb_events = 167,
  data_offset = 2048,
  new_data_offset = 2048,
  sb_start = 8,
  sb_size = 512,
  preferred_minor = 65535,
  kobj = {
    name = 0xffff88023521ec30 "dev-loop1",
    entry = {
      next = 0xffff88023202a280,
      prev = 0xffff88023202a280
    },
    parent = 0xffff8800b64c6050,
    kset = 0x0,
    ktype = 0xffffffffa0501300 <rdev_ktype>,
    sd = 0xffff88022bc0a708,
    kref = {
      refcount = {
        counter = 1
      }
    },
    state_initialized = 1,
    state_in_sysfs = 1,
    state_add_uevent_sent = 0,
    state_remove_uevent_sent = 0,
    uevent_suppress = 0
  },
  flags = 2,
  blocked_wait = {
    lock = {
      {
        rlock = {
          raw_lock = {
            val = {
              counter = 0
            }
          }
        }
      }
    },
    task_list = {
      next = 0xffff88023202a2c8,
      prev = 0xffff88023202a2c8
    }
  },
crash> struct md_rdev 0xffff88023202a200
struct md_rdev {
  same_set = {
    next = 0xffff8800b64c6018,
    prev = 0xffff880228880400
  },
  sectors = 2095104,
  mddev = 0xffff8800b64c6000,
  last_events = 37178561,
  meta_bdev = 0x0,
  bdev = 0xffff8800b60d09c0,
  sb_page = 0xffffea0008af7580,
  bb_page = 0xffffea0002e69380,
  sb_loaded = 1,
  sb_events = 167,
  data_offset = 2048,
  new_data_offset = 2048,
  sb_start = 8,
  sb_size = 512,
  preferred_minor = 65535,
  kobj = {
    name = 0xffff88023521ec30 "dev-loop1",
    entry = {
      next = 0xffff88023202a280,
      prev = 0xffff88023202a280
    },
    parent = 0xffff8800b64c6050,
    kset = 0x0,
    ktype = 0xffffffffa0501300 <rdev_ktype>,
    sd = 0xffff88022bc0a708,
    kref = {
      refcount = {
        counter = 1
      }
    },
    state_initialized = 1,
    state_in_sysfs = 1,
    state_add_uevent_sent = 0,
    state_remove_uevent_sent = 0,
    uevent_suppress = 0
  },
  flags = 2,
  blocked_wait = {
    lock = {
      {
        rlock = {
          raw_lock = {
            val = {
              counter = 0
            }
          }
        }
      }
    },
    task_list = {
      next = 0xffff88023202a2c8,
      prev = 0xffff88023202a2c8
    }
  },
  desc_nr = 1,
  raid_disk = 1,
  new_raid_disk = 0,
  saved_raid_disk = -1,
  {
    recovery_offset = 18446744073709551615,
    journal_tail = 18446744073709551615
  },
  nr_pending = {
    counter = 1
  },
  read_errors = {
    counter = 0
  },
  last_read_error = {
    tv_sec = 0,
    tv_nsec = 0
  },
  corrected_errors = {
    counter = 0
  },
  del_work = {
    data = {
      counter = 0
    },
    entry = {
      next = 0x0,
      prev = 0x0
    },
    func = 0x0
  },
  sysfs_state = 0xffff88022bc0a780,
  badblocks = {
    count = 0,
    unacked_exist = 0,
    shift = 0,
    page = 0xffff88022bff0000,
    changed = 0,
    lock = {
      seqcount = {
        sequence = 164
      },
      lock = {
        {
          rlock = {
            raw_lock = {
              val = {
                counter = 0
              }
            }
          }
        }
      }
    },
    sector = 0,
    size = 0
  }
}



Thanks

-- 
Jinpu Wang
Linux Kernel Developer

ProfitBricks GmbH
Greifswalder Str. 207
D - 10405 Berlin

Tel:       +49 30 577 008  042
Fax:      +49 30 577 008 299
Email:    jinpu.wang@profitbricks.com
URL:      https://www.profitbricks.de

Sitz der Gesellschaft: Berlin
Registergericht: Amtsgericht Charlottenburg, HRB 125506 B
Geschäftsführer: Achim Weiss

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [BUG] MD/RAID1 hung forever on freeze_array
  2016-12-09  6:01                     ` NeilBrown
  2016-12-09 15:28                       ` Jinpu Wang
@ 2016-12-09 15:36                       ` Jinpu Wang
  1 sibling, 0 replies; 31+ messages in thread
From: Jinpu Wang @ 2016-12-09 15:36 UTC (permalink / raw)
  To: NeilBrown; +Cc: Coly Li, linux-raid, Shaohua Li, Nate Dailey

On Fri, Dec 9, 2016 at 7:01 AM, NeilBrown <neilb@suse.com> wrote:
> On Thu, Dec 08 2016, Jinpu Wang wrote:
>
>
> This number:
>
>>   nr_pending = {
>>     counter = 1
>>   },
>
>
>
> and this number:
>
>>   nr_pending = {
>>     counter = 856
>>   },
>
> might be interesting.
>
> There are 855 requested on the list.  Add the one that is currently
> being retried give 856, which is nr_pending for the device that failed.
> But nr_pending on the device that didn't fail is 1.  I would expect
> zero.
> When a read or write requests succeeds, rdev_dec_pending() is called
> immediately so this should quickly go to zero.
>
> It seems as though there must be a request to the loop device that is
> stuck somewhere between the atomic_inc(&rdev->nr_pending) (possibly
> inside read_balance) and the call to generic_make_request().
> I cannot yet see how that would happen.
>
> Can you check if the is a repeatable observation?  Is nr_pending.counter
> always '1' on the loop device?
>
> Thanks,
> NeilBrown

Hi Neil,

Yes, it's repreatable observation. I triggered again, this time
  nr_pending = 1203,
  nr_waiting = 8,
  nr_queued = 1201,

conf->retry_list has 1175 entries.
on conf->bio_end_io_list has 26 entries.
Totol is 1201, match nr_queued.

in md_rdev healthy one loop1 has 1 nr_pending.
faulty one ibnbd1 has 1076.

crash> struct md_rdev 0xffff880228880400
struct md_rdev {
  same_set = {
    next = 0xffff88023202a200,
    prev = 0xffff8800b64c6018
  },
  sectors = 2095104,
  mddev = 0xffff8800b64c6000,
  last_events = 17764573,
  meta_bdev = 0x0,
  bdev = 0xffff8800b60ce080,
  sb_page = 0xffffea0002bd3040,
  bb_page = 0xffffea0002dc76c0,
  sb_loaded = 1,
  sb_events = 166,
  data_offset = 2048,
  new_data_offset = 2048,
  sb_start = 8,
  sb_size = 512,
  preferred_minor = 65535,
  kobj = {
    name = 0xffff880037962af0 "dev-ibnbd0",
    entry = {
      next = 0xffff880228880480,
      prev = 0xffff880228880480
    },
    parent = 0xffff8800b64c6050,
    kset = 0x0,
    ktype = 0xffffffffa0501300 <rdev_ktype>,
    sd = 0xffff88022bfc12d0,
    kref = {
      refcount = {
        counter = 1
      }
    },
    state_initialized = 1,
    state_in_sysfs = 1,
    state_add_uevent_sent = 0,
    state_remove_uevent_sent = 0,
    uevent_suppress = 0
  },
  flags = 581,
  blocked_wait = {
    lock = {
      {
        rlock = {
          raw_lock = {
            val = {
              counter = 0
            }
          }
        }
      }
    },
    task_list = {
      next = 0xffff8802288804c8,
      prev = 0xffff8802288804c8
    }
  },
  desc_nr = 0,
  raid_disk = 0,
  new_raid_disk = 0,
  saved_raid_disk = -1,
  {
    recovery_offset = 18446744073709551615,
    journal_tail = 18446744073709551615
  },
  nr_pending = {
    counter = 1176
  },
  read_errors = {
    counter = 0
  },
  last_read_error = {
    tv_sec = 0,
    tv_nsec = 0
  },
  corrected_errors = {
    counter = 0
  },
  del_work = {
    data = {
      counter = 0
    },
    entry = {
      next = 0x0,
      prev = 0x0
    },
    func = 0x0
  },
  sysfs_state = 0xffff88022bfc1348,
  badblocks = {
    count = 0,
    unacked_exist = 0,
    shift = 0,
    page = 0xffff8802289aa000,
    changed = 0,
    lock = {
      seqcount = {
        sequence = 0
      },
      lock = {
        {
          rlock = {
            raw_lock = {
              val = {
                counter = 0
              }
            }
          }
        }
      }
    },
    sector = 0,
    size = 0
  }
}

crash> struct md_rdev 0xffff88023202a200
struct md_rdev {
  same_set = {
    next = 0xffff8800b64c6018,
    prev = 0xffff880228880400
  },
  sectors = 2095104,
  mddev = 0xffff8800b64c6000,
  last_events = 37178561,
  meta_bdev = 0x0,
  bdev = 0xffff8800b60d09c0,
  sb_page = 0xffffea0008af7580,
  bb_page = 0xffffea0002e69380,
  sb_loaded = 1,
  sb_events = 167,
  data_offset = 2048,
  new_data_offset = 2048,
  sb_start = 8,
  sb_size = 512,
  preferred_minor = 65535,
  kobj = {
    name = 0xffff88023521ec30 "dev-loop1",
    entry = {
      next = 0xffff88023202a280,
      prev = 0xffff88023202a280
    },
    parent = 0xffff8800b64c6050,
    kset = 0x0,
    ktype = 0xffffffffa0501300 <rdev_ktype>,
    sd = 0xffff88022bc0a708,
    kref = {
      refcount = {
        counter = 1
      }
    },
    state_initialized = 1,
    state_in_sysfs = 1,
    state_add_uevent_sent = 0,
    state_remove_uevent_sent = 0,
    uevent_suppress = 0
  },
  flags = 2,
  blocked_wait = {
    lock = {
      {
        rlock = {
          raw_lock = {
            val = {
              counter = 0
            }
          }
        }
      }
    },
    task_list = {
      next = 0xffff88023202a2c8,
      prev = 0xffff88023202a2c8
    }
  },
crash> struct md_rdev 0xffff88023202a200
struct md_rdev {
  same_set = {
    next = 0xffff8800b64c6018,
    prev = 0xffff880228880400
  },
  sectors = 2095104,
  mddev = 0xffff8800b64c6000,
  last_events = 37178561,
  meta_bdev = 0x0,
  bdev = 0xffff8800b60d09c0,
  sb_page = 0xffffea0008af7580,
  bb_page = 0xffffea0002e69380,
  sb_loaded = 1,
  sb_events = 167,
  data_offset = 2048,
  new_data_offset = 2048,
  sb_start = 8,
  sb_size = 512,
  preferred_minor = 65535,
  kobj = {
    name = 0xffff88023521ec30 "dev-loop1",
    entry = {
      next = 0xffff88023202a280,
      prev = 0xffff88023202a280
    },
    parent = 0xffff8800b64c6050,
    kset = 0x0,
    ktype = 0xffffffffa0501300 <rdev_ktype>,
    sd = 0xffff88022bc0a708,
    kref = {
      refcount = {
        counter = 1
      }
    },
    state_initialized = 1,
    state_in_sysfs = 1,
    state_add_uevent_sent = 0,
    state_remove_uevent_sent = 0,
    uevent_suppress = 0
  },
  flags = 2,
  blocked_wait = {
    lock = {
      {
        rlock = {
          raw_lock = {
            val = {
              counter = 0
            }
          }
        }
      }
    },
    task_list = {
      next = 0xffff88023202a2c8,
      prev = 0xffff88023202a2c8
    }
  },
  desc_nr = 1,
  raid_disk = 1,
  new_raid_disk = 0,
  saved_raid_disk = -1,
  {
    recovery_offset = 18446744073709551615,
    journal_tail = 18446744073709551615
  },
  nr_pending = {
    counter = 1
  },
  read_errors = {
    counter = 0
  },
  last_read_error = {
    tv_sec = 0,
    tv_nsec = 0
  },
  corrected_errors = {
    counter = 0
  },
  del_work = {
    data = {
      counter = 0
    },
    entry = {
      next = 0x0,
      prev = 0x0
    },
    func = 0x0
  },
  sysfs_state = 0xffff88022bc0a780,
  badblocks = {
    count = 0,
    unacked_exist = 0,
    shift = 0,
    page = 0xffff88022bff0000,
    changed = 0,
    lock = {
      seqcount = {
        sequence = 164
      },
      lock = {
        {
          rlock = {
            raw_lock = {
              val = {
                counter = 0
              }
            }
          }
        }
      }
    },
    sector = 0,
    size = 0
  }
}



Thanks

-- 
Jinpu Wang
Linux Kernel Developer

ProfitBricks GmbH
Greifswalder Str. 207
D - 10405 Berlin

Tel:       +49 30 577 008  042
Fax:      +49 30 577 008 299
Email:    jinpu.wang@profitbricks.com
URL:      https://www.profitbricks.de

Sitz der Gesellschaft: Berlin
Registergericht: Amtsgericht Charlottenburg, HRB 125506 B
Geschäftsführer: Achim Weiss

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [BUG] MD/RAID1 hung forever on freeze_array
  2016-11-25 13:30 [BUG] MD/RAID1 hung forever on freeze_array Jinpu Wang
  2016-11-25 13:59 ` Jinpu Wang
@ 2016-12-12  0:59 ` NeilBrown
  2016-12-12 13:10   ` Jinpu Wang
  1 sibling, 1 reply; 31+ messages in thread
From: NeilBrown @ 2016-12-12  0:59 UTC (permalink / raw)
  To: Jinpu Wang, linux-raid, Shaohua Li; +Cc: Nate Dailey

[-- Attachment #1: Type: text/plain, Size: 184 bytes --]

On Sat, Nov 26 2016, Jinpu Wang wrote:
> [  810.270860]  [<ffffffff813fc851>] blk_prologue_bio+0x91/0xc0

What is this?  I cannot find that function in the upstream kernel.

NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [BUG] MD/RAID1 hung forever on freeze_array
  2016-12-12  0:59 ` NeilBrown
@ 2016-12-12 13:10   ` Jinpu Wang
  2016-12-12 21:53     ` NeilBrown
  0 siblings, 1 reply; 31+ messages in thread
From: Jinpu Wang @ 2016-12-12 13:10 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid, Shaohua Li, Nate Dailey

On Mon, Dec 12, 2016 at 1:59 AM, NeilBrown <neilb@suse.com> wrote:
> On Sat, Nov 26 2016, Jinpu Wang wrote:
>> [  810.270860]  [<ffffffff813fc851>] blk_prologue_bio+0x91/0xc0
>
> What is this?  I cannot find that function in the upstream kernel.
>
> NeilBrown

Hi Neil,

blk_prologue_bio is our internal extension to gather some stats, sorry
not informed before.

It does below, code snip only:
+static void bio_endio_complete_diskstat(struct bio *clone)
+{
+       struct bio *bio = clone->bi_private;
+       struct block_device *bdev = bio->bi_bdev;
+       struct hd_struct *hdp = bdev->bd_part;
+       unsigned long start = clone->bi_start_time;
+       const int rw = !!bio_data_dir(bio);
+       int cpu, err = clone->bi_error;
+
+       bio_put(clone);
+       blk_end_io_acct()
+       bio->bi_error = err;
+       bio_endio(bio);
+}
+
+blk_qc_t blk_prologue_bio(struct request_queue *q, struct bio *bio)
+{
+       struct block_device *bdev = bio->bi_bdev;
+       struct hd_struct *hdp = bdev->bd_part;
+       struct bio *clone;
+
+       if (!hdp->additional_diskstat)
+               return q->custom_make_request_fn(q, bio);
+       clone = bio_clone_fast(bio, GFP_NOWAIT | __GFP_NOWARN, NULL);
+       if (unlikely(!clone))
+               return q->custom_make_request_fn(q, bio);
+       clone->bi_start_time = jiffies;
+       if (hdp->bio_mode_diskstat) {
+               int rw = !!bio_data_dir(bio);
+
+               generic_start_io_acct(rw, bio_sectors(clone), hdp);
+       }
+       clone->bi_end_io = bio_endio_complete_diskstat;
+       clone->bi_private = bio;
+       return q->custom_make_request_fn(q, clone);
+}

ＩＭＨＯ， it seems unrelated, but I will rerun my test without this change.

-- 
Jinpu Wang
Linux Kernel Developer

ProfitBricks GmbH
Greifswalder Str. 207
D - 10405 Berlin

Tel:       +49 30 577 008  042
Fax:      +49 30 577 008 299
Email:    jinpu.wang@profitbricks.com
URL:      https://www.profitbricks.de

Sitz der Gesellschaft: Berlin
Registergericht: Amtsgericht Charlottenburg, HRB 125506 B
Geschäftsführer: Achim Weiss

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [BUG] MD/RAID1 hung forever on freeze_array
  2016-12-12 13:10   ` Jinpu Wang
@ 2016-12-12 21:53     ` NeilBrown
  2016-12-13 15:08       ` Jinpu Wang
  0 siblings, 1 reply; 31+ messages in thread
From: NeilBrown @ 2016-12-12 21:53 UTC (permalink / raw)
  To: Jinpu Wang; +Cc: linux-raid, Shaohua Li, Nate Dailey

[-- Attachment #1: Type: text/plain, Size: 835 bytes --]

On Tue, Dec 13 2016, Jinpu Wang wrote:

> On Mon, Dec 12, 2016 at 1:59 AM, NeilBrown <neilb@suse.com> wrote:
>> On Sat, Nov 26 2016, Jinpu Wang wrote:
>>> [  810.270860]  [<ffffffff813fc851>] blk_prologue_bio+0x91/0xc0
>>
>> What is this?  I cannot find that function in the upstream kernel.
>>
>> NeilBrown
>
> Hi Neil,
>
> blk_prologue_bio is our internal extension to gather some stats, sorry
> not informed before.

Ahhh.

....
> +       return q->custom_make_request_fn(q, clone);

I haven't heard of custom_make_request_fn before either.

> +}
>
> ＩＭＨＯ， it seems unrelated, but I will rerun my test without this change.

Yes, please re-test with an unmodified upstream kernel (and always
report *exactly* what kernel you are running.  I cannot analyse code
that I cannot see).

NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [BUG] MD/RAID1 hung forever on freeze_array
  2016-12-12 21:53     ` NeilBrown
@ 2016-12-13 15:08       ` Jinpu Wang
  2016-12-13 22:18         ` NeilBrown
  0 siblings, 1 reply; 31+ messages in thread
From: Jinpu Wang @ 2016-12-13 15:08 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid, Shaohua Li, Nate Dailey

Hi Neil,

On Mon, Dec 12, 2016 at 10:53 PM, NeilBrown <neilb@suse.com> wrote:
> On Tue, Dec 13 2016, Jinpu Wang wrote:
>
>> On Mon, Dec 12, 2016 at 1:59 AM, NeilBrown <neilb@suse.com> wrote:
>>> On Sat, Nov 26 2016, Jinpu Wang wrote:
>>>> [  810.270860]  [<ffffffff813fc851>] blk_prologue_bio+0x91/0xc0
>>>
>>> What is this?  I cannot find that function in the upstream kernel.
>>>
>>> NeilBrown
>>
>> Hi Neil,
>>
>> blk_prologue_bio is our internal extension to gather some stats, sorry
>> not informed before.
>
> Ahhh.
>
> ....
>> +       return q->custom_make_request_fn(q, clone);
>
> I haven't heard of custom_make_request_fn before either.
>
>> +}
>>
>> ＩＭＨＯ， it seems unrelated, but I will rerun my test without this change.
>
> Yes, please re-test with an unmodified upstream kernel (and always
> report *exactly* what kernel you are running.  I cannot analyse code
> that I cannot see).
>
> NeilBrown

As you suggested, I re-run same test with 4.4.36 with no our own patch on MD.
I can still reproduce the same bug, nr_pending on heathy leg(loop1) is till 1.

4.4.36 kernel:
crash> bt 4069
PID: 4069   TASK: ffff88022b4f8d00  CPU: 3   COMMAND: "md2_raid1"
 #0 [ffff8800b77d3bf8] __schedule at ffffffff81811453
 #1 [ffff8800b77d3c50] schedule at ffffffff81811c30
 #2 [ffff8800b77d3c68] freeze_array at ffffffffa07ee17e [raid1]
 #3 [ffff8800b77d3cc0] handle_read_error at ffffffffa07f093b [raid1]
 #4 [ffff8800b77d3d68] raid1d at ffffffffa07f10a6 [raid1]
 #5 [ffff8800b77d3e60] md_thread at ffffffffa04dee80 [md_mod]
 #6 [ffff8800b77d3ed0] kthread at ffffffff81075fb6
 #7 [ffff8800b77d3f50] ret_from_fork at ffffffff818157df
crash> bt 2558
bt: invalid task or pid value: 2558
crash> bt 4558
PID: 4558   TASK: ffff88022b550d00  CPU: 3   COMMAND: "fio"
 #0 [ffff88022c287710] __schedule at ffffffff81811453
 #1 [ffff88022c287768] schedule at ffffffff81811c30
 #2 [ffff88022c287780] wait_barrier at ffffffffa07ee044 [raid1]
 #3 [ffff88022c2877e8] make_request at ffffffffa07efc65 [raid1]
 #4 [ffff88022c2878d0] md_make_request at ffffffffa04df609 [md_mod]
 #5 [ffff88022c287928] generic_make_request at ffffffff813fd3de
 #6 [ffff88022c287970] submit_bio at ffffffff813fd522
 #7 [ffff88022c2879b8] do_blockdev_direct_IO at ffffffff811d32a7
 #8 [ffff88022c287be8] __blockdev_direct_IO at ffffffff811d3b6e
 #9 [ffff88022c287c10] blkdev_direct_IO at ffffffff811ce2d7
#10 [ffff88022c287c38] generic_file_direct_write at ffffffff81132c90
#11 [ffff88022c287cb0] __generic_file_write_iter at ffffffff81132e1d
#12 [ffff88022c287d08] blkdev_write_iter at ffffffff811ce597
#13 [ffff88022c287d68] aio_run_iocb at ffffffff811deca6
#14 [ffff88022c287e68] do_io_submit at ffffffff811dfbaa
#15 [ffff88022c287f40] sys_io_submit at ffffffff811dfe4b
#16 [ffff88022c287f50] entry_SYSCALL_64_fastpath at ffffffff81815497
    RIP: 00007f63b1362737  RSP: 00007ffff7eb17f8  RFLAGS: 00000206
    RAX: ffffffffffffffda  RBX: 00007f63a142a000  RCX: 00007f63b1362737
    RDX: 0000000001179b58  RSI: 0000000000000001  RDI: 00007f63b1f4a000
    RBP: 0000000000000512   R8: 0000000000000001   R9: 0000000001171fa0
    R10: 00007f639ef84000  R11: 0000000000000206  R12: 0000000000000001
    R13: 0000000000000200  R14: 000000003a2d3000  R15: 0000000000000001
    ORIG_RAX: 00000000000000d1  CS: 0033  SS: 002b

crash> struct r1conf 0xffff880037362100
struct r1conf {
  mddev = 0xffff880037352800,
  mirrors = 0xffff88022c209c00,
  raid_disks = 2,
  next_resync = 18446744073709527039,
  start_next_window = 18446744073709551615,
  current_window_requests = 0,
  next_window_requests = 0,
  device_lock = {
    {
      rlock = {
        raw_lock = {
          val = {
            counter = 0
          }
        }
      }
    }
  },
  retry_list = {
    next = 0xffff8801ce757740,
    prev = 0xffff8801b1b79140
  },
  bio_end_io_list = {
    next = 0xffff8801ce7d9ac0,
    prev = 0xffff88022838f4c0
  },
  pending_bio_list = {
    head = 0x0,
    tail = 0x0
  },
  pending_count = 0,
  wait_barrier = {
    lock = {
      {
        rlock = {
          raw_lock = {
            val = {
              counter = 0
            }
          }
        }
      }
    },
    task_list = {
      next = 0xffff8801f6d87818,
      prev = 0xffff88022c2877a8
    }
  },
  resync_lock = {
    {
      rlock = {
        raw_lock = {
          val = {
            counter = 0
          }
        }
      }
    }
  },
  nr_pending = 2086,
  nr_waiting = 97,
  nr_queued = 2084,
  barrier = 0,
  array_frozen = 1,
  fullsync = 0,
  recovery_disabled = 1,
  poolinfo = 0xffff8802330be390,
  r1bio_pool = 0xffff88022bdf54e0,
  r1buf_pool = 0x0,
  tmppage = 0xffffea0000dcee40,
  thread = 0x0,
  cluster_sync_low = 0,
  cluster_sync_high = 0
}
crash>
crash> struct raid1_info 0xffff88022c209c00
struct raid1_info {
  rdev = 0xffff880231635800,
  head_position = 1318965,
  next_seq_sect = 252597,
  seq_start = 252342
}
crash> struct raid1_info 0xffff88022c209c20
struct raid1_info {
  rdev = 0xffff88023166ce00,
  head_position = 1585216,
  next_seq_sect = 839992,
  seq_start = 839977
}
crash> struct md_rdev 0xffff880231635800
struct md_rdev {
  same_set = {
    next = 0xffff880037352818,
    prev = 0xffff88023166ce00
  },
  sectors = 2095104,
  mddev = 0xffff880037352800,
  last_events = 41325652,
  meta_bdev = 0x0,
  bdev = 0xffff880235c2aa40,
  sb_page = 0xffffea0002dd98c0,
  bb_page = 0xffffea0002e48f80,
  sb_loaded = 1,
  sb_events = 205,
  data_offset = 2048,
  new_data_offset = 2048,
  sb_start = 8,
  sb_size = 512,
  preferred_minor = 65535,
  kobj = {
    name = 0xffff8802341cdef0 "dev-loop1",
    entry = {
      next = 0xffff880231635880,
      prev = 0xffff880231635880
    },
    parent = 0xffff880037352850,
    kset = 0x0,
    ktype = 0xffffffffa04f3020 <rdev_ktype>,
    sd = 0xffff880233e3b8e8,
    kref = {
      refcount = {
        counter = 1
      }
    },
    state_initialized = 1,
    state_in_sysfs = 1,
    state_add_uevent_sent = 0,
    state_remove_uevent_sent = 0,
    uevent_suppress = 0
  },
  flags = 2,
  blocked_wait = {
    lock = {
      {
        rlock = {
          raw_lock = {
            val = {
              counter = 0
            }
          }
        }
      }
    },
    task_list = {
      next = 0xffff8802316358c8,
      prev = 0xffff8802316358c8
    }
  },
desc_nr = 0,
  raid_disk = 0,
  new_raid_disk = 0,
  saved_raid_disk = -1,
  {
    recovery_offset = 0,
    journal_tail = 0
  },
  nr_pending = {
    counter = 1
  },
  read_errors = {
    counter = 0
  },
  last_read_error = {
    tv_sec = 0,
    tv_nsec = 0
  },
  corrected_errors = {
    counter = 0
  },
  del_work = {
    data = {
      counter = 0
    },
    entry = {
      next = 0x0,
      prev = 0x0
    },
    func = 0x0
  },
  sysfs_state = 0xffff880233e3b960,
  badblocks = {
    count = 0,
    unacked_exist = 0,
    shift = 0,
    page = 0xffff88022c0d6000,
    changed = 0,
    lock = {
      seqcount = {
        sequence = 264
      },
      lock = {
        {
          rlock = {
            raw_lock = {
              val = {
                counter = 0
              }
            }
          }
        }
      }
    },
    sector = 0,
    size = 0
  }
}
struct md_rdev {
  same_set = {
    next = 0xffff880231635800,
    prev = 0xffff880037352818
  },
  sectors = 2095104,
  mddev = 0xffff880037352800,
  last_events = 10875407,
  meta_bdev = 0x0,
  bdev = 0xffff880234a86a40,
  sb_page = 0xffffea00089e0ac0,
  bb_page = 0xffffea0007db4980,
  sb_loaded = 1,
  sb_events = 204,
  data_offset = 2048,
  new_data_offset = 2048,
  sb_start = 8,
  sb_size = 512,
  preferred_minor = 65535,
  kobj = {
    name = 0xffff88022c100e30 "dev-ibnbd0",
    entry = {
      next = 0xffff88023166ce80,
      prev = 0xffff88023166ce80
    },
    parent = 0xffff880037352850,
    kset = 0x0,
    ktype = 0xffffffffa04f3020 <rdev_ktype>,
    sd = 0xffff8800b6539e10,
    kref = {
      refcount = {
        counter = 1
      }
    },
    state_initialized = 1,
    state_in_sysfs = 1,
    state_add_uevent_sent = 0,
    state_remove_uevent_sent = 0,
    uevent_suppress = 0
  },
  flags = 581,
  blocked_wait = {
    lock = {
      {
        rlock = {
          raw_lock = {
            val = {
              counter = 0
            }
          }
        }
      }
    },
    task_list = {
      next = 0xffff88023166cec8,
      prev = 0xffff88023166cec8
    }
  },
  desc_nr = 1,
  raid_disk = 1,
  new_raid_disk = 0,
  saved_raid_disk = -1,
  {
    recovery_offset = 18446744073709551615,
    journal_tail = 18446744073709551615
  },
  nr_pending = {
    counter = 2073
  },
  read_errors = {
    counter = 0
  },
  last_read_error = {
    tv_sec = 0,
    tv_nsec = 0
  },
  corrected_errors = {
    counter = 0
  },
  del_work = {
    data = {
      counter = 0
    },
    entry = {
      next = 0x0,
      prev = 0x0
    },
    func = 0x0
  },
  sysfs_state = 0xffff8800b6539e88,
  badblocks = {
    count = 1,
    unacked_exist = 0,
    shift = 0,
    page = 0xffff880099ced000,
    changed = 0,
    lock = {
      seqcount = {
        sequence = 4
      },
      lock = {
        {
          rlock = {
            raw_lock = {
              val = {
                counter = 0
              }
            }
          }
        }
      }
    },
    sector = 80,
    size = 8
  }
}


-- 
Jinpu Wang
Linux Kernel Developer

ProfitBricks GmbH
Greifswalder Str. 207
D - 10405 Berlin

Tel:       +49 30 577 008  042
Fax:      +49 30 577 008 299
Email:    jinpu.wang@profitbricks.com
URL:      https://www.profitbricks.de

Sitz der Gesellschaft: Berlin
Registergericht: Amtsgericht Charlottenburg, HRB 125506 B
Geschäftsführer: Achim Weiss

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [BUG] MD/RAID1 hung forever on freeze_array
  2016-12-13 15:08       ` Jinpu Wang
@ 2016-12-13 22:18         ` NeilBrown
  2016-12-14 10:22           ` Jinpu Wang
  0 siblings, 1 reply; 31+ messages in thread
From: NeilBrown @ 2016-12-13 22:18 UTC (permalink / raw)
  To: Jinpu Wang; +Cc: linux-raid, Shaohua Li, Nate Dailey

[-- Attachment #1: Type: text/plain, Size: 1703 bytes --]

On Wed, Dec 14 2016, Jinpu Wang wrote:

>
> As you suggested, I re-run same test with 4.4.36 with no our own patch on MD.
> I can still reproduce the same bug, nr_pending on heathy leg(loop1) is till 1.
>

Thanks.

I have an hypothesis.

md_make_request() calls blk_queue_split().
If that does split the request it will call generic_make_request()
on the first half. That will call back into md_make_request() and
raid1_make_request() which will submit requests to the underlying
devices.  These will get caught on the bio_list_on_stack queue in
generic_make_request().
This is a queue which is not accounted in nr_queued.

When blk_queue_split() completes, 'bio' will be the second half of the
bio.
This enters raid1_make_request() and by this time the array have been
frozen.
So wait_barrier() has to wait for pending requests to complete, and that
includes the one that it stuck in bio_list_on_stack, which will never
complete now.

To see if this might be happening, please change the

	blk_queue_split(q, &bio, q->bio_split);

call in md_make_request() to

	struct bio *tmp = bio;
	blk_queue_split(q, &bio, q->bio_split);
	WARN_ON_ONCE(bio != tmp);

If that ever triggers, then the above is a real possibility.

Fixing the problem isn't very easy...

You could try:
1/ write a function in raid1.c which calls punt_bios_to_rescuer()
  (which you will need to export from block/bio.c),
  passing mddev->queue->bio_split as the bio_set.

1/ change the wait_event_lock_irq() call in wait_barrier() to
   wait_event_lock_irq_cmd(), and pass the new function as the command.

That way, if wait_barrier() ever blocks, all the requests in
bio_list_on_stack will be handled by a separate thread.

NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [BUG] MD/RAID1 hung forever on freeze_array
  2016-12-13 22:18         ` NeilBrown
@ 2016-12-14 10:22           ` Jinpu Wang
  2016-12-14 12:13             ` Jinpu Wang
  0 siblings, 1 reply; 31+ messages in thread
From: Jinpu Wang @ 2016-12-14 10:22 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid, Shaohua Li, Nate Dailey

Thanks Neil,

On Tue, Dec 13, 2016 at 11:18 PM, NeilBrown <neilb@suse.com> wrote:
> On Wed, Dec 14 2016, Jinpu Wang wrote:
>
>>
>> As you suggested, I re-run same test with 4.4.36 with no our own patch on MD.
>> I can still reproduce the same bug, nr_pending on heathy leg(loop1) is till 1.
>>
>
> Thanks.
>
> I have an hypothesis.
>
> md_make_request() calls blk_queue_split().
> If that does split the request it will call generic_make_request()
> on the first half. That will call back into md_make_request() and
> raid1_make_request() which will submit requests to the underlying
> devices.  These will get caught on the bio_list_on_stack queue in
> generic_make_request().
> This is a queue which is not accounted in nr_queued.
>
> When blk_queue_split() completes, 'bio' will be the second half of the
> bio.
> This enters raid1_make_request() and by this time the array have been
> frozen.
> So wait_barrier() has to wait for pending requests to complete, and that
> includes the one that it stuck in bio_list_on_stack, which will never
> complete now.
>
> To see if this might be happening, please change the
>
>         blk_queue_split(q, &bio, q->bio_split);
>
> call in md_make_request() to
>
>         struct bio *tmp = bio;
>         blk_queue_split(q, &bio, q->bio_split);
>         WARN_ON_ONCE(bio != tmp);
>
> If that ever triggers, then the above is a real possibility.

I triggered the warning as you expected, we can confirm the bug was
caused by your above hypothesis.
[  429.282235] ------------[ cut here ]------------
[  429.282407] WARNING: CPU: 2 PID: 4139 at drivers/md/md.c:262
md_set_array_sectors+0xac0/0xc30 [md_mod]()
[  429.285288] Modules linked in: raid1 ibnbd_client(O)
ibtrs_client(O) dm_service_time dm_multipath rdma_ucm ib_ucm rdma_cm
iw_cm ib_ipoib ib_cm ib_uverbs ib_umad mlx5_ib mlx5_c
ore vxlan ip6_udp_tunnel udp_tunnel mlx4_ib ib_sa ib_mad ib_core
ib_addr ib_netlink mlx4_core mlx_compat loop md_mod kvm_amd
edac_mce_amd kvm edac_core irqbypass acpi_cpufreq tpm
_infineon tpm_tis i2c_piix4 tpm serio_raw evdev k10temp processor
button fam15h_power crct10dif_pclmul crc32_pclmul sg sd_mod ahci
libahci libata scsi_mod crc32c_intel r8169 psmo
use xhci_pci xhci_hcd [last unloaded: mlx_compat]
[  429.288543] CPU: 2 PID: 4139 Comm: fio Tainted: G           O    4.4.36-1-pse
rver #1
[  429.288825] Hardware name: To be filled by O.E.M. To be filled by
O.E.M./M5A97 R2.0, BIOS 2501 04/07/2014
[  429.289113]  0000000000000000 ffff8801f64ff8f0 ffffffff81424486
0000000000000000
[  429.289538]  ffffffffa0561938 ffff8801f64ff928 ffffffff81058a60
ffff8800b8f3e000
[  429.290157]  0000000000000000 ffff8800b51f4100 ffff880234f9a700
ffff880234f9a700
[  429.290594] Call Trace:
[  429.290743]  [<ffffffff81424486>] dump_stack+0x4d/0x67
[  429.290893]  [<ffffffff81058a60>] warn_slowpath_common+0x90/0xd0
[  429.291046]  [<ffffffff81058b55>] warn_slowpath_null+0x15/0x20
[  429.291202]  [<ffffffffa0550740>] md_set_array_sectors+0xac0/0xc30 [md_mod]
[  429.291358]  [<ffffffff813fd3de>] generic_make_request+0xfe/0x1e0
[  429.291540]  [<ffffffff813fd522>] submit_bio+0x62/0x150
[  429.291693]  [<ffffffff813f53d9>] ? bio_set_pages_dirty+0x49/0x60
[  429.291847]  [<ffffffff811d32a7>] do_blockdev_direct_IO+0x2317/0x2ba0
[  429.292011]  [<ffffffffa0834f64>] ?
ib_post_rdma_write_imm+0x24/0x30 [ibtrs_client]
[  429.292271]  [<ffffffff811cdc40>] ? I_BDEV+0x10/0x10
[  429.292417]  [<ffffffff811d3b6e>] __blockdev_direct_IO+0x3e/0x40
[  429.292566]  [<ffffffff811ce2d7>] blkdev_direct_IO+0x47/0x50
[  429.292746]  [<ffffffff81132abf>] generic_file_read_iter+0x45f/0x580
[  429.292894]  [<ffffffff811ce620>] ? blkdev_write_iter+0x110/0x110
[  429.293073]  [<ffffffff811ce652>] blkdev_read_iter+0x32/0x40
[  429.293284]  [<ffffffff811deb86>] aio_run_iocb+0x116/0x2a0
[  429.293492]  [<ffffffff813fed52>] ? blk_flush_plug_list+0xc2/0x200
[  429.293703]  [<ffffffff81183ac6>] ? kmem_cache_alloc+0xb6/0x180
[  429.293901]  [<ffffffff811dfaf4>] ? do_io_submit+0x184/0x4d0
[  429.294047]  [<ffffffff811dfbaa>] do_io_submit+0x23a/0x4d0
[  429.294194]  [<ffffffff811dfe4b>] SyS_io_submit+0xb/0x10
[  429.294375]  [<ffffffff81815497>] entry_SYSCALL_64_fastpath+0x12/0x6a
[  429.294610] ---[ end trace 25d1cece0e01494b ]---

I double checked the nr_pending on heathy leg is still 1 as before.

>
> Fixing the problem isn't very easy...
>
> You could try:
> 1/ write a function in raid1.c which calls punt_bios_to_rescuer()
>   (which you will need to export from block/bio.c),
>   passing mddev->queue->bio_split as the bio_set.
>
> 1/ change the wait_event_lock_irq() call in wait_barrier() to
>    wait_event_lock_irq_cmd(), and pass the new function as the command.
>
> That way, if wait_barrier() ever blocks, all the requests in
> bio_list_on_stack will be handled by a separate thread.
>
> NeilBrown

I will try your sugested way to see if it fix the bug, will report back soon.

-- 
Jinpu Wang
Linux Kernel Developer

ProfitBricks GmbH
Greifswalder Str. 207
D - 10405 Berlin

Tel:       +49 30 577 008  042
Fax:      +49 30 577 008 299
Email:    jinpu.wang@profitbricks.com
URL:      https://www.profitbricks.de

Sitz der Gesellschaft: Berlin
Registergericht: Amtsgericht Charlottenburg, HRB 125506 B
Geschäftsführer: Achim Weiss

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [BUG] MD/RAID1 hung forever on freeze_array
  2016-12-14 10:22           ` Jinpu Wang
@ 2016-12-14 12:13             ` Jinpu Wang
  2016-12-14 14:49               ` Jinpu Wang
  0 siblings, 1 reply; 31+ messages in thread
From: Jinpu Wang @ 2016-12-14 12:13 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid, Shaohua Li, Nate Dailey

[-- Attachment #1: Type: text/plain, Size: 8255 bytes --]

On Wed, Dec 14, 2016 at 11:22 AM, Jinpu Wang
<jinpu.wang@profitbricks.com> wrote:
> Thanks Neil,
>
> On Tue, Dec 13, 2016 at 11:18 PM, NeilBrown <neilb@suse.com> wrote:
>> On Wed, Dec 14 2016, Jinpu Wang wrote:
>>
>>>
>>> As you suggested, I re-run same test with 4.4.36 with no our own patch on MD.
>>> I can still reproduce the same bug, nr_pending on heathy leg(loop1) is till 1.
>>>
>>
>> Thanks.
>>
>> I have an hypothesis.
>>
>> md_make_request() calls blk_queue_split().
>> If that does split the request it will call generic_make_request()
>> on the first half. That will call back into md_make_request() and
>> raid1_make_request() which will submit requests to the underlying
>> devices.  These will get caught on the bio_list_on_stack queue in
>> generic_make_request().
>> This is a queue which is not accounted in nr_queued.
>>
>> When blk_queue_split() completes, 'bio' will be the second half of the
>> bio.
>> This enters raid1_make_request() and by this time the array have been
>> frozen.
>> So wait_barrier() has to wait for pending requests to complete, and that
>> includes the one that it stuck in bio_list_on_stack, which will never
>> complete now.
>>
>> To see if this might be happening, please change the
>>
>>         blk_queue_split(q, &bio, q->bio_split);
>>
>> call in md_make_request() to
>>
>>         struct bio *tmp = bio;
>>         blk_queue_split(q, &bio, q->bio_split);
>>         WARN_ON_ONCE(bio != tmp);
>>
>> If that ever triggers, then the above is a real possibility.
>
> I triggered the warning as you expected, we can confirm the bug was
> caused by your above hypothesis.
> [  429.282235] ------------[ cut here ]------------
> [  429.282407] WARNING: CPU: 2 PID: 4139 at drivers/md/md.c:262
> md_set_array_sectors+0xac0/0xc30 [md_mod]()
> [  429.285288] Modules linked in: raid1 ibnbd_client(O)
> ibtrs_client(O) dm_service_time dm_multipath rdma_ucm ib_ucm rdma_cm
> iw_cm ib_ipoib ib_cm ib_uverbs ib_umad mlx5_ib mlx5_c
> ore vxlan ip6_udp_tunnel udp_tunnel mlx4_ib ib_sa ib_mad ib_core
> ib_addr ib_netlink mlx4_core mlx_compat loop md_mod kvm_amd
> edac_mce_amd kvm edac_core irqbypass acpi_cpufreq tpm
> _infineon tpm_tis i2c_piix4 tpm serio_raw evdev k10temp processor
> button fam15h_power crct10dif_pclmul crc32_pclmul sg sd_mod ahci
> libahci libata scsi_mod crc32c_intel r8169 psmo
> use xhci_pci xhci_hcd [last unloaded: mlx_compat]
> [  429.288543] CPU: 2 PID: 4139 Comm: fio Tainted: G           O    4.4.36-1-pse
> rver #1
> [  429.288825] Hardware name: To be filled by O.E.M. To be filled by
> O.E.M./M5A97 R2.0, BIOS 2501 04/07/2014
> [  429.289113]  0000000000000000 ffff8801f64ff8f0 ffffffff81424486
> 0000000000000000
> [  429.289538]  ffffffffa0561938 ffff8801f64ff928 ffffffff81058a60
> ffff8800b8f3e000
> [  429.290157]  0000000000000000 ffff8800b51f4100 ffff880234f9a700
> ffff880234f9a700
> [  429.290594] Call Trace:
> [  429.290743]  [<ffffffff81424486>] dump_stack+0x4d/0x67
> [  429.290893]  [<ffffffff81058a60>] warn_slowpath_common+0x90/0xd0
> [  429.291046]  [<ffffffff81058b55>] warn_slowpath_null+0x15/0x20
> [  429.291202]  [<ffffffffa0550740>] md_set_array_sectors+0xac0/0xc30 [md_mod]
> [  429.291358]  [<ffffffff813fd3de>] generic_make_request+0xfe/0x1e0
> [  429.291540]  [<ffffffff813fd522>] submit_bio+0x62/0x150
> [  429.291693]  [<ffffffff813f53d9>] ? bio_set_pages_dirty+0x49/0x60
> [  429.291847]  [<ffffffff811d32a7>] do_blockdev_direct_IO+0x2317/0x2ba0
> [  429.292011]  [<ffffffffa0834f64>] ?
> ib_post_rdma_write_imm+0x24/0x30 [ibtrs_client]
> [  429.292271]  [<ffffffff811cdc40>] ? I_BDEV+0x10/0x10
> [  429.292417]  [<ffffffff811d3b6e>] __blockdev_direct_IO+0x3e/0x40
> [  429.292566]  [<ffffffff811ce2d7>] blkdev_direct_IO+0x47/0x50
> [  429.292746]  [<ffffffff81132abf>] generic_file_read_iter+0x45f/0x580
> [  429.292894]  [<ffffffff811ce620>] ? blkdev_write_iter+0x110/0x110
> [  429.293073]  [<ffffffff811ce652>] blkdev_read_iter+0x32/0x40
> [  429.293284]  [<ffffffff811deb86>] aio_run_iocb+0x116/0x2a0
> [  429.293492]  [<ffffffff813fed52>] ? blk_flush_plug_list+0xc2/0x200
> [  429.293703]  [<ffffffff81183ac6>] ? kmem_cache_alloc+0xb6/0x180
> [  429.293901]  [<ffffffff811dfaf4>] ? do_io_submit+0x184/0x4d0
> [  429.294047]  [<ffffffff811dfbaa>] do_io_submit+0x23a/0x4d0
> [  429.294194]  [<ffffffff811dfe4b>] SyS_io_submit+0xb/0x10
> [  429.294375]  [<ffffffff81815497>] entry_SYSCALL_64_fastpath+0x12/0x6a
> [  429.294610] ---[ end trace 25d1cece0e01494b ]---
>
> I double checked the nr_pending on heathy leg is still 1 as before.
>
>>
>> Fixing the problem isn't very easy...
>>
>> You could try:
>> 1/ write a function in raid1.c which calls punt_bios_to_rescuer()
>>   (which you will need to export from block/bio.c),
>>   passing mddev->queue->bio_split as the bio_set.
>>
>> 1/ change the wait_event_lock_irq() call in wait_barrier() to
>>    wait_event_lock_irq_cmd(), and pass the new function as the command.
>>
>> That way, if wait_barrier() ever blocks, all the requests in
>> bio_list_on_stack will be handled by a separate thread.
>>
>> NeilBrown
>
> I will try your sugested way to see if it fix the bug, will report back soon.
>
Hi Neil,

Sorry, bad news, with the 2 patch attached, I can still reproduce the same bug.
nr_pending on healthy leg is still 1, as before.
crash> struct r1conf 0xffff8800b7176100
struct r1conf {
  mddev = 0xffff8800b59b0000,
  mirrors = 0xffff88022bab7900,
  raid_disks = 2,
  next_resync = 18446744073709527039,
  start_next_window = 18446744073709551615,
  current_window_requests = 0,
  next_window_requests = 0,
  device_lock = {
    {
      rlock = {
        raw_lock = {
          val = {
            counter = 0
          }
        }
      }
    }
  },
  retry_list = {
    next = 0xffff880211b2ec40,
    prev = 0xffff88022819ad40
  },
  bio_end_io_list = {
    next = 0xffff880227e9a9c0,
    prev = 0xffff8802119c6140
  },
  pending_bio_list = {
    head = 0x0,
    tail = 0x0
  },
  pending_count = 0,
  wait_barrier = {
    lock = {
      {
        rlock = {
          raw_lock = {
            val = {
              counter = 0
            }
          }
        }
      }
    },
    task_list = {
      next = 0xffff8800adf3b818,
      prev = 0xffff88021180f7a8
    }
  },
  resync_lock = {
    {
      rlock = {
        raw_lock = {
          val = {
            counter = 0
          }
        }
      }
    }
  },
  nr_pending = 1675,
  nr_waiting = 100,
  nr_queued = 1673,
  barrier = 0,
  array_frozen = 1,
  fullsync = 0,
  recovery_disabled = 1,
  poolinfo = 0xffff88022c80f640,
  r1bio_pool = 0xffff88022b8b6a20,
  r1buf_pool = 0x0,
  tmppage = 0xffffea0008a90c80,
  thread = 0x0,
  cluster_sync_low = 0,
  cluster_sync_high = 0
}

 kobj = {
    name = 0xffff88022b7194a0 "dev-loop1",
    entry = {
      next = 0xffff880231495280,
      prev = 0xffff880231495280
    },
    parent = 0xffff8800b59b0050,
    kset = 0x0,
    ktype = 0xffffffffa0564060 <rdev_ktype>,
    sd = 0xffff8800b6510960,
    kref = {
      refcount = {
        counter = 1
      }
    },
    state_initialized = 1,
    state_in_sysfs = 1,
    state_add_uevent_sent = 0,
    state_remove_uevent_sent = 0,
    uevent_suppress = 0
  },
  flags = 2,
  blocked_wait = {
    lock = {
      {
        rlock = {
          raw_lock = {
            val = {
              counter = 0
            }
          }
        }
      }
    },
    task_list = {
      next = 0xffff8802314952c8,
      prev = 0xffff8802314952c8
    }
  },
  desc_nr = 1,
  raid_disk = 1,
  new_raid_disk = 0,
  saved_raid_disk = -1,
  {
    recovery_offset = 0,
    journal_tail = 0
  },
  nr_pending = {
    counter = 1
  },


-- 
Jinpu Wang
Linux Kernel Developer

ProfitBricks GmbH
Greifswalder Str. 207
D - 10405 Berlin

Tel:       +49 30 577 008  042
Fax:      +49 30 577 008 299
Email:    jinpu.wang@profitbricks.com
URL:      https://www.profitbricks.de

Sitz der Gesellschaft: Berlin
Registergericht: Amtsgericht Charlottenburg, HRB 125506 B
Geschäftsführer: Achim Weiss

[-- Attachment #2: 0001-block-export-punt_bios_to_rescuer.patch --]
[-- Type: text/x-patch, Size: 1566 bytes --]

From e7adbbb1a8d542ea68ada5996e0f9ffe87c479b6 Mon Sep 17 00:00:00 2001
From: Jack Wang <jinpu.wang@profitbricks.com>
Date: Wed, 14 Dec 2016 11:26:23 +0100
Subject: [PATCH 1/2] block: export punt_bios_to_rescuer

We need it later in raid1

Suggested-by: Neil Brown <neil.brown@suse.com>
Signed-off-by: Jack Wang <jinpu.wang@profitbricks.com>
---
 block/bio.c         | 3 ++-
 include/linux/bio.h | 1 +
 2 files changed, 3 insertions(+), 1 deletion(-)

diff --git a/block/bio.c b/block/bio.c
index 46e2cc1..f6a250d 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -354,7 +354,7 @@ static void bio_alloc_rescue(struct work_struct *work)
 	}
 }
 
-static void punt_bios_to_rescuer(struct bio_set *bs)
+void punt_bios_to_rescuer(struct bio_set *bs)
 {
 	struct bio_list punt, nopunt;
 	struct bio *bio;
@@ -384,6 +384,7 @@ static void punt_bios_to_rescuer(struct bio_set *bs)
 
 	queue_work(bs->rescue_workqueue, &bs->rescue_work);
 }
+EXPORT_SYMBOL(punt_bios_to_rescuer);
 
 /**
  * bio_alloc_bioset - allocate a bio for I/O
diff --git a/include/linux/bio.h b/include/linux/bio.h
index 42e4e3c..6256ba7 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -479,6 +479,7 @@ extern void bio_advance(struct bio *, unsigned);
 extern void bio_init(struct bio *);
 extern void bio_reset(struct bio *);
 void bio_chain(struct bio *, struct bio *);
+void punt_bios_to_rescuer(struct bio_set *);
 
 extern int bio_add_page(struct bio *, struct page *, unsigned int,unsigned int);
 extern int bio_add_pc_page(struct request_queue *, struct bio *, struct page *,
-- 
2.7.4


[-- Attachment #3: 0002-raid1-fix-deadlock.patch --]
[-- Type: text/x-patch, Size: 1420 bytes --]

From 2ad4cc5e8b5d7ec9db7a6fffaa2fdcd5f20419bf Mon Sep 17 00:00:00 2001
From: Jack Wang <jinpu.wang@profitbricks.com>
Date: Wed, 14 Dec 2016 11:35:52 +0100
Subject: [PATCH 2/2] raid1: fix deadlock

Signed-off-by: Jack Wang <jinpu.wang@profitbricks.com>
---
 drivers/md/raid1.c | 10 ++++++++--
 1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index 478223c..61dafb1 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -190,6 +190,11 @@ static void put_all_bios(struct r1conf *conf, struct r1bio *r1_bio)
 	}
 }
 
+static void raid1_punt_bios_to_rescuer(struct r1conf *conf)
+{
+	punt_bios_to_rescuer(conf->mddev->queue->bio_split);
+}
+
 static void free_r1bio(struct r1bio *r1_bio)
 {
 	struct r1conf *conf = r1_bio->mddev->private;
@@ -871,14 +876,15 @@ static sector_t wait_barrier(struct r1conf *conf, struct bio *bio)
 		 * that queue to allow conf->start_next_window
 		 * to increase.
 		 */
-		wait_event_lock_irq(conf->wait_barrier,
+		wait_event_lock_irq_cmd(conf->wait_barrier,
 				    !conf->array_frozen &&
 				    (!conf->barrier ||
 				     ((conf->start_next_window <
 				       conf->next_resync + RESYNC_SECTORS) &&
 				      current->bio_list &&
 				      !bio_list_empty(current->bio_list))),
-				    conf->resync_lock);
+				    conf->resync_lock,
+				    raid1_punt_bios_to_rescuer(conf));
 		conf->nr_waiting--;
 	}
 
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: [BUG] MD/RAID1 hung forever on freeze_array
  2016-12-14 12:13             ` Jinpu Wang
@ 2016-12-14 14:49               ` Jinpu Wang
  2016-12-15  3:20                 ` NeilBrown
  0 siblings, 1 reply; 31+ messages in thread
From: Jinpu Wang @ 2016-12-14 14:49 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid, Shaohua Li, Nate Dailey

On Wed, Dec 14, 2016 at 1:13 PM, Jinpu Wang <jinpu.wang@profitbricks.com> wrote:
> On Wed, Dec 14, 2016 at 11:22 AM, Jinpu Wang
> <jinpu.wang@profitbricks.com> wrote:
>> Thanks Neil,
>>
>> On Tue, Dec 13, 2016 at 11:18 PM, NeilBrown <neilb@suse.com> wrote:
>>> On Wed, Dec 14 2016, Jinpu Wang wrote:
>>>
>>>>
>>>> As you suggested, I re-run same test with 4.4.36 with no our own patch on MD.
>>>> I can still reproduce the same bug, nr_pending on heathy leg(loop1) is till 1.
>>>>
>>>
>>> Thanks.
>>>
>>> I have an hypothesis.
>>>
>>> md_make_request() calls blk_queue_split().
>>> If that does split the request it will call generic_make_request()
>>> on the first half. That will call back into md_make_request() and
>>> raid1_make_request() which will submit requests to the underlying
>>> devices.  These will get caught on the bio_list_on_stack queue in
>>> generic_make_request().
>>> This is a queue which is not accounted in nr_queued.
>>>
>>> When blk_queue_split() completes, 'bio' will be the second half of the
>>> bio.
>>> This enters raid1_make_request() and by this time the array have been
>>> frozen.
>>> So wait_barrier() has to wait for pending requests to complete, and that
>>> includes the one that it stuck in bio_list_on_stack, which will never
>>> complete now.
>>>
>>> To see if this might be happening, please change the
>>>
>>>         blk_queue_split(q, &bio, q->bio_split);
>>>
>>> call in md_make_request() to
>>>
>>>         struct bio *tmp = bio;
>>>         blk_queue_split(q, &bio, q->bio_split);
>>>         WARN_ON_ONCE(bio != tmp);
>>>
>>> If that ever triggers, then the above is a real possibility.
>>
>> I triggered the warning as you expected, we can confirm the bug was
>> caused by your above hypothesis.
>> [  429.282235] ------------[ cut here ]------------
>> [  429.282407] WARNING: CPU: 2 PID: 4139 at drivers/md/md.c:262
>> md_set_array_sectors+0xac0/0xc30 [md_mod]()
>> [  429.285288] Modules linked in: raid1 ibnbd_client(O)
>> ibtrs_client(O) dm_service_time dm_multipath rdma_ucm ib_ucm rdma_cm
>> iw_cm ib_ipoib ib_cm ib_uverbs ib_umad mlx5_ib mlx5_c
>> ore vxlan ip6_udp_tunnel udp_tunnel mlx4_ib ib_sa ib_mad ib_core
>> ib_addr ib_netlink mlx4_core mlx_compat loop md_mod kvm_amd
>> edac_mce_amd kvm edac_core irqbypass acpi_cpufreq tpm
>> _infineon tpm_tis i2c_piix4 tpm serio_raw evdev k10temp processor
>> button fam15h_power crct10dif_pclmul crc32_pclmul sg sd_mod ahci
>> libahci libata scsi_mod crc32c_intel r8169 psmo
>> use xhci_pci xhci_hcd [last unloaded: mlx_compat]
>> [  429.288543] CPU: 2 PID: 4139 Comm: fio Tainted: G           O    4.4.36-1-pse
>> rver #1
>> [  429.288825] Hardware name: To be filled by O.E.M. To be filled by
>> O.E.M./M5A97 R2.0, BIOS 2501 04/07/2014
>> [  429.289113]  0000000000000000 ffff8801f64ff8f0 ffffffff81424486
>> 0000000000000000
>> [  429.289538]  ffffffffa0561938 ffff8801f64ff928 ffffffff81058a60
>> ffff8800b8f3e000
>> [  429.290157]  0000000000000000 ffff8800b51f4100 ffff880234f9a700
>> ffff880234f9a700
>> [  429.290594] Call Trace:
>> [  429.290743]  [<ffffffff81424486>] dump_stack+0x4d/0x67
>> [  429.290893]  [<ffffffff81058a60>] warn_slowpath_common+0x90/0xd0
>> [  429.291046]  [<ffffffff81058b55>] warn_slowpath_null+0x15/0x20
>> [  429.291202]  [<ffffffffa0550740>] md_set_array_sectors+0xac0/0xc30 [md_mod]
>> [  429.291358]  [<ffffffff813fd3de>] generic_make_request+0xfe/0x1e0
>> [  429.291540]  [<ffffffff813fd522>] submit_bio+0x62/0x150
>> [  429.291693]  [<ffffffff813f53d9>] ? bio_set_pages_dirty+0x49/0x60
>> [  429.291847]  [<ffffffff811d32a7>] do_blockdev_direct_IO+0x2317/0x2ba0
>> [  429.292011]  [<ffffffffa0834f64>] ?
>> ib_post_rdma_write_imm+0x24/0x30 [ibtrs_client]
>> [  429.292271]  [<ffffffff811cdc40>] ? I_BDEV+0x10/0x10
>> [  429.292417]  [<ffffffff811d3b6e>] __blockdev_direct_IO+0x3e/0x40
>> [  429.292566]  [<ffffffff811ce2d7>] blkdev_direct_IO+0x47/0x50
>> [  429.292746]  [<ffffffff81132abf>] generic_file_read_iter+0x45f/0x580
>> [  429.292894]  [<ffffffff811ce620>] ? blkdev_write_iter+0x110/0x110
>> [  429.293073]  [<ffffffff811ce652>] blkdev_read_iter+0x32/0x40
>> [  429.293284]  [<ffffffff811deb86>] aio_run_iocb+0x116/0x2a0
>> [  429.293492]  [<ffffffff813fed52>] ? blk_flush_plug_list+0xc2/0x200
>> [  429.293703]  [<ffffffff81183ac6>] ? kmem_cache_alloc+0xb6/0x180
>> [  429.293901]  [<ffffffff811dfaf4>] ? do_io_submit+0x184/0x4d0
>> [  429.294047]  [<ffffffff811dfbaa>] do_io_submit+0x23a/0x4d0
>> [  429.294194]  [<ffffffff811dfe4b>] SyS_io_submit+0xb/0x10
>> [  429.294375]  [<ffffffff81815497>] entry_SYSCALL_64_fastpath+0x12/0x6a
>> [  429.294610] ---[ end trace 25d1cece0e01494b ]---
>>
>> I double checked the nr_pending on heathy leg is still 1 as before.
>>
>>>
>>> Fixing the problem isn't very easy...
>>>
>>> You could try:
>>> 1/ write a function in raid1.c which calls punt_bios_to_rescuer()
>>>   (which you will need to export from block/bio.c),
>>>   passing mddev->queue->bio_split as the bio_set.
>>>
>>> 1/ change the wait_event_lock_irq() call in wait_barrier() to
>>>    wait_event_lock_irq_cmd(), and pass the new function as the command.
>>>
>>> That way, if wait_barrier() ever blocks, all the requests in
>>> bio_list_on_stack will be handled by a separate thread.
>>>
>>> NeilBrown
>>
>> I will try your sugested way to see if it fix the bug, will report back soon.
>>

Hi Neil,

I found a old mail thread below
http://www.spinics.net/lists/raid/msg52792.html

Likely Alex is trying to fix same bug, right?
in one reply you suggested to modify the call in make_request

@@ -1207,7 +1207,8 @@ read_again:
                                sectors_handled;
                        goto read_again;
                } else
-                       generic_make_request(read_bio);
+                       reschedule_retry(r1_bio);
                return;
        }


I append above change, it looks fix the bug, I've run same tests over
one hour,  no hung task anymore.

Do you think this is right fix? Do we still need the change you
suggested with punt_bios_to_rescuer?

-- 
Jinpu Wang
Linux Kernel Developer

ProfitBricks GmbH
Greifswalder Str. 207
D - 10405 Berlin

Tel:       +49 30 577 008  042
Fax:      +49 30 577 008 299
Email:    jinpu.wang@profitbricks.com
URL:      https://www.profitbricks.de

Sitz der Gesellschaft: Berlin
Registergericht: Amtsgericht Charlottenburg, HRB 125506 B
Geschäftsführer: Achim Weiss

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [BUG] MD/RAID1 hung forever on freeze_array
  2016-12-14 14:49               ` Jinpu Wang
@ 2016-12-15  3:20                 ` NeilBrown
  2016-12-15  9:24                   ` Jinpu Wang
       [not found]                   ` <CAMGffEkufeaDytaHxtLR02iiQifZDhcwkLdzMj3X8_yaitSoFQ@mail.gmail.com>
  0 siblings, 2 replies; 31+ messages in thread
From: NeilBrown @ 2016-12-15  3:20 UTC (permalink / raw)
  To: Jinpu Wang; +Cc: linux-raid, Shaohua Li, Nate Dailey

[-- Attachment #1: Type: text/plain, Size: 2300 bytes --]


> Hi Neil,
>
> I found a old mail thread below
> http://www.spinics.net/lists/raid/msg52792.html
>
> Likely Alex is trying to fix same bug, right?
> in one reply you suggested to modify the call in make_request
>
> @@ -1207,7 +1207,8 @@ read_again:
>                                 sectors_handled;
>                         goto read_again;
>                 } else
> -                       generic_make_request(read_bio);
> +                       reschedule_retry(r1_bio);
>                 return;
>         }
>
>
> I append above change, it looks fix the bug, I've run same tests over
> one hour,  no hung task anymore.
>
> Do you think this is right fix? Do we still need the change you
> suggested with punt_bios_to_rescuer?

I don't really like that fix.  I suspect it would probably hurt
performance.

I'd prefer to fix generic_make_request() to process queued requests in a
more sensible order.
Can you please try the following (with all other patches removed)?
Thanks,
NeilBrown

diff --git a/block/blk-core.c b/block/blk-core.c
index 14d7c0740dc0..3436b6fc3ef8 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -2036,10 +2036,31 @@ blk_qc_t generic_make_request(struct bio *bio)
 		struct request_queue *q = bdev_get_queue(bio->bi_bdev);
 
 		if (likely(blk_queue_enter(q, false) == 0)) {
+			struct bio_list hold;
+			struct bio_list lower, same;
+
+			/* Create a fresh bio_list for all subordinate requests */
+			bio_list_merge(&hold, &bio_list_on_stack);
+			bio_list_init(&bio_list_on_stack);
 			ret = q->make_request_fn(q, bio);
 
 			blk_queue_exit(q);
 
+			/* sort new bios into those for a lower level
+			 * and those for the same level
+			 */
+			bio_list_init(&lower);
+			bio_list_init(&same);
+			while ((bio = bio_list_pop(&bio_list_on_stack)) != NULL)
+				if (q == bdev_get_queue(bio->bi_bdev))
+					bio_list_add(&same, bio);
+				else
+					bio_list_add(&lower, bio);
+			/* now assemble so we handle the lowest level first */
+			bio_list_merge(&bio_list_on_stack, &lower);
+			bio_list_merge(&bio_list_on_stack, &same);
+			bio_list_merge(&bio_list_on_stack, &hold);
+
 			bio = bio_list_pop(current->bio_list);
 		} else {
 			struct bio *bio_next = bio_list_pop(current->bio_list);

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: [BUG] MD/RAID1 hung forever on freeze_array
  2016-12-15  3:20                 ` NeilBrown
@ 2016-12-15  9:24                   ` Jinpu Wang
       [not found]                   ` <CAMGffEkufeaDytaHxtLR02iiQifZDhcwkLdzMj3X8_yaitSoFQ@mail.gmail.com>
  1 sibling, 0 replies; 31+ messages in thread
From: Jinpu Wang @ 2016-12-15  9:24 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid, Shaohua Li, Nate Dailey

On Thu, Dec 15, 2016 at 4:20 AM, NeilBrown <neilb@suse.com> wrote:
>
>> Hi Neil,
>>
>> I found a old mail thread below
>> http://www.spinics.net/lists/raid/msg52792.html
>>
>> Likely Alex is trying to fix same bug, right?
>> in one reply you suggested to modify the call in make_request
>>
>> @@ -1207,7 +1207,8 @@ read_again:
>>                                 sectors_handled;
>>                         goto read_again;
>>                 } else
>> -                       generic_make_request(read_bio);
>> +                       reschedule_retry(r1_bio);
>>                 return;
>>         }
>>
>>
>> I append above change, it looks fix the bug, I've run same tests over
>> one hour,  no hung task anymore.
>>
>> Do you think this is right fix? Do we still need the change you
>> suggested with punt_bios_to_rescuer?
>
> I don't really like that fix.  I suspect it would probably hurt
> performance.
>
> I'd prefer to fix generic_make_request() to process queued requests in a
> more sensible order.
> Can you please try the following (with all other patches removed)?
> Thanks,
> NeilBrown

Thanks Neil, I will try when back to office, I'm sick at home.
>
> diff --git a/block/blk-core.c b/block/blk-core.c
> index 14d7c0740dc0..3436b6fc3ef8 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -2036,10 +2036,31 @@ blk_qc_t generic_make_request(struct bio *bio)
>                 struct request_queue *q = bdev_get_queue(bio->bi_bdev);
>
>                 if (likely(blk_queue_enter(q, false) == 0)) {
> +                       struct bio_list hold;
> +                       struct bio_list lower, same;
> +
> +                       /* Create a fresh bio_list for all subordinate requests */
> +                       bio_list_merge(&hold, &bio_list_on_stack);
> +                       bio_list_init(&bio_list_on_stack);
>                         ret = q->make_request_fn(q, bio);
>
>                         blk_queue_exit(q);
>
> +                       /* sort new bios into those for a lower level
> +                        * and those for the same level
> +                        */
> +                       bio_list_init(&lower);
> +                       bio_list_init(&same);
> +                       while ((bio = bio_list_pop(&bio_list_on_stack)) != NULL)
> +                               if (q == bdev_get_queue(bio->bi_bdev))
> +                                       bio_list_add(&same, bio);
> +                               else
> +                                       bio_list_add(&lower, bio);
> +                       /* now assemble so we handle the lowest level first */
> +                       bio_list_merge(&bio_list_on_stack, &lower);
> +                       bio_list_merge(&bio_list_on_stack, &same);
> +                       bio_list_merge(&bio_list_on_stack, &hold);
> +
>                         bio = bio_list_pop(current->bio_list);
>                 } else {
>                         struct bio *bio_next = bio_list_pop(current->bio_list);



-- 
Jinpu Wang
Linux Kernel Developer

ProfitBricks GmbH
Greifswalder Str. 207
D - 10405 Berlin

Tel:       +49 30 577 008  042
Fax:      +49 30 577 008 299
Email:    jinpu.wang@profitbricks.com
URL:      https://www.profitbricks.de

Sitz der Gesellschaft: Berlin
Registergericht: Amtsgericht Charlottenburg, HRB 125506 B
Geschäftsführer: Achim Weiss

^ permalink raw reply	[flat|nested] 31+ messages in thread

[parent not found: <CAMGffEkufeaDytaHxtLR02iiQifZDhcwkLdzMj3X8_yaitSoFQ@mail.gmail.com>]

* Re: [BUG] MD/RAID1 hung forever on freeze_array
       [not found]                   ` <CAMGffEkufeaDytaHxtLR02iiQifZDhcwkLdzMj3X8_yaitSoFQ@mail.gmail.com>
@ 2016-12-19 14:56                     ` Jinpu Wang
  2016-12-19 22:45                     ` NeilBrown
  1 sibling, 0 replies; 31+ messages in thread
From: Jinpu Wang @ 2016-12-19 14:56 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid, Shaohua Li, Nate Dailey

On Mon, Dec 19, 2016 at 12:16 PM, Jinpu Wang
<jinpu.wang@profitbricks.com> wrote:
> Hi Neil,
>
> After apply the patch below, it paniced during boot in
> generic_make_request-> bio_list_pop.
> Looks related to you do bio_list_init(&bio_list_on_stack); again.
>> diff --git a/block/blk-core.c b/block/blk-core.c
>> index 14d7c0740dc0..3436b6fc3ef8 100644
>> --- a/block/blk-core.c
>> +++ b/block/blk-core.c
>> @@ -2036,10 +2036,31 @@ blk_qc_t generic_make_request(struct bio *bio)
>>                 struct request_queue *q = bdev_get_queue(bio->bi_bdev);
>>
>>                 if (likely(blk_queue_enter(q, false) == 0)) {
>> +                       struct bio_list hold;
>> +                       struct bio_list lower, same;
>> +
>> +                       /* Create a fresh bio_list for all subordinate requests */
>> +                       bio_list_merge(&hold, &bio_list_on_stack);
>> +                       bio_list_init(&bio_list_on_stack); ??? maybe init hold, and then merge bio_list_on_stack?
>>                         ret = q->make_request_fn(q, bio);
>>
>>                         blk_queue_exit(q);
>>
>> +                       /* sort new bios into those for a lower level
>> +                        * and those for the same level
>> +                        */
>> +                       bio_list_init(&lower);
>> +                       bio_list_init(&same);
>> +                       while ((bio = bio_list_pop(&bio_list_on_stack)) != NULL)
>> +                               if (q == bdev_get_queue(bio->bi_bdev))
>> +                                       bio_list_add(&same, bio);
>> +                               else
>> +                                       bio_list_add(&lower, bio);
>> +                       /* now assemble so we handle the lowest level first */
>> +                       bio_list_merge(&bio_list_on_stack, &lower);
>> +                       bio_list_merge(&bio_list_on_stack, &same);
>> +                       bio_list_merge(&bio_list_on_stack, &hold);
>> +
>>                         bio = bio_list_pop(current->bio_list);
>>                 } else {
>>                         struct bio *bio_next = bio_list_pop(current->bio_list);
>
>
Seems we don't need bio_list hold at all, lower and seem merge back to
bio_list_on_stack?
I'm currently testing patch below, so far so good.

diff --git a/block/blk-core.c b/block/blk-core.c
index e8d15d8..9c56cf7 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -2064,10 +2064,24 @@ blk_qc_t generic_make_request(struct bio *bio)
                struct request_queue *q = bdev_get_queue(bio->bi_bdev);

                if (likely(blk_queue_enter(q, __GFP_DIRECT_RECLAIM) == 0)) {
+                       struct bio_list lower, same;

                        ret = q->make_request_fn(q, bio);

                        blk_queue_exit(q);
+                       /* sort new bios into those for a lower level
+                        * and those for the same level
+                        */
+                       bio_list_init(&lower);
+                       bio_list_init(&same);
+                       while ((bio = bio_list_pop(&bio_list_on_stack)) != NULL)
+                               if (q == bdev_get_queue(bio->bi_bdev))
+                                       bio_list_add(&same, bio);
+                               else
+                                       bio_list_add(&lower, bio);
+                       /* now assemble so we handle the lowest level first */
+                       bio_list_merge(&bio_list_on_stack, &lower);
+                       bio_list_merge(&bio_list_on_stack, &same);

                        bio = bio_list_pop(current->bio_list);
                } else {

-- 
Jinpu Wang
Linux Kernel Developer

ProfitBricks GmbH
Greifswalder Str. 207
D - 10405 Berlin

Tel:       +49 30 577 008  042
Fax:      +49 30 577 008 299
Email:    jinpu.wang@profitbricks.com
URL:      https://www.profitbricks.de

Sitz der Gesellschaft: Berlin
Registergericht: Amtsgericht Charlottenburg, HRB 125506 B
Geschäftsführer: Achim Weiss

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: [BUG] MD/RAID1 hung forever on freeze_array
       [not found]                   ` <CAMGffEkufeaDytaHxtLR02iiQifZDhcwkLdzMj3X8_yaitSoFQ@mail.gmail.com>
  2016-12-19 14:56                     ` Jinpu Wang
@ 2016-12-19 22:45                     ` NeilBrown
  2016-12-20 10:34                       ` Jinpu Wang
  1 sibling, 1 reply; 31+ messages in thread
From: NeilBrown @ 2016-12-19 22:45 UTC (permalink / raw)
  To: Jinpu Wang; +Cc: linux-raid, Shaohua Li, Nate Dailey

[-- Attachment #1: Type: text/plain, Size: 1418 bytes --]

On Mon, Dec 19 2016, Jinpu Wang wrote:

> Hi Neil,
>
> After apply the patch below, it paniced during boot in
> generic_make_request-> bio_list_pop.
> Looks related to you do bio_list_init(&bio_list_on_stack); again.
>> diff --git a/block/blk-core.c b/block/blk-core.c
>> index 14d7c0740dc0..3436b6fc3ef8 100644
>> --- a/block/blk-core.c
>> +++ b/block/blk-core.c
>> @@ -2036,10 +2036,31 @@ blk_qc_t generic_make_request(struct bio *bio)
>>                 struct request_queue *q = bdev_get_queue(bio->bi_bdev);
>>
>>                 if (likely(blk_queue_enter(q, false) == 0)) {
>> +                       struct bio_list hold;
>> +                       struct bio_list lower, same;
>> +
>> +                       /* Create a fresh bio_list for all subordinate requests */
>> +                       bio_list_merge(&hold, &bio_list_on_stack);

This is the problem.  'hold' hasn't been initialised.
We could either do:
  bio_list_init(&hold);
  bio_list_merge(&hold, &bio_list_on_stack);
or just
  hold = bio_list_on_stack;


You didn't find 'hold' to be necessary in your testing, but I think that
is more complex arrangements it could make an important difference.

Thanks,
NeilBrown


>> +                       bio_list_init(&bio_list_on_stack); ??? maybe init hold, and then merge bio_list_on_stack?
>>                         ret = q->make_request_fn(q, bio);
>>
>>                         blk_queue_exit(q);

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [BUG] MD/RAID1 hung forever on freeze_array
  2016-12-19 22:45                     ` NeilBrown
@ 2016-12-20 10:34                       ` Jinpu Wang
  2016-12-20 21:23                         ` NeilBrown
  0 siblings, 1 reply; 31+ messages in thread
From: Jinpu Wang @ 2016-12-20 10:34 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid, Shaohua Li, Nate Dailey

Hi Neil,
> This is the problem.  'hold' hasn't been initialised.
> We could either do:
>   bio_list_init(&hold);
>   bio_list_merge(&hold, &bio_list_on_stack);
I tried above variant first and it lead to panic in endio path:

PID: 4004   TASK: ffff8802337f3400  CPU: 1   COMMAND: "fio"

 #0 [ffff88023ec838d0] machine_kexec at ffffffff8104075a

 #1 [ffff88023ec83918] crash_kexec at ffffffff810d54c3

 #2 [ffff88023ec839e0] oops_end at ffffffff81008784

 #3 [ffff88023ec83a08] no_context at ffffffff8104a8f6

 #4 [ffff88023ec83a60] __bad_area_nosemaphore at ffffffff8104abcf

 #5 [ffff88023ec83aa8] bad_area_nosemaphore at ffffffff8104ad3e

 #6 [ffff88023ec83ab8] __do_page_fault at ffffffff8104afd7

 #7 [ffff88023ec83b10] do_page_fault at ffffffff8104b33c

 #8 [ffff88023ec83b20] page_fault at ffffffff818173a2

    [exception RIP: bio_check_pages_dirty+65]

    RIP: ffffffff813f6221  RSP: ffff88023ec83bd8  RFLAGS: 00010212

    RAX: 0000000000000020  RBX: ffff880232d75010  RCX: 0000000000000001

    RDX: ffff880232d74000  RSI: 0000000000000000  RDI: 0000000000000000

    RBP: ffff88023ec83bf8   R8: 0000000000000001   R9: 0000000000000000

    R10: ffffffff81f25ac0  R11: ffff8802348acef0  R12: 0000000000000001

    R13: 0000000000000000  R14: ffff8800b53b7d00  R15: ffff88009704d180

    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018

 #9 [ffff88023ec83c00] dio_bio_complete at ffffffff811d010e

#10 [ffff88023ec83c38] dio_bio_end_aio at ffffffff811d0367

#11 [ffff88023ec83c68] bio_endio at ffffffff813f637a

#12 [ffff88023ec83c80] call_bio_endio at ffffffffa0868220 [raid1]

#13 [ffff88023ec83cc8] raid_end_bio_io at ffffffffa086885b [raid1]

#14 [ffff88023ec83cf8] raid1_end_read_request at ffffffffa086a184 [raid1]

#15 [ffff88023ec83d50] bio_endio at ffffffff813f637a

#16 [ffff88023ec83d68] blk_update_request at ffffffff813fdab6

#17 [ffff88023ec83da8] blk_mq_end_request at ffffffff81406dfe


> or just
>   hold = bio_list_on_stack;
>
>
> You didn't find 'hold' to be necessary in your testing, but I think that
> is more complex arrangements it could make an important difference.

Could you elaborate a bit more, from my understanding, in later code,
we pop all bio from bio_list_on_stack,
add it to either "lower" or "same" bio_list, so merge both will have
the whole list again, right?
>
> Thanks,
> NeilBrown
>
>
>>> +                       bio_list_init(&bio_list_on_stack); ??? maybe init hold, and then merge bio_list_on_stack?
>>>                         ret = q->make_request_fn(q, bio);
>>>
>>>                         blk_queue_exit(q);
Thanks

-- 
Jinpu Wang
Linux Kernel Developer

ProfitBricks GmbH
Greifswalder Str. 207
D - 10405 Berlin

Tel:       +49 30 577 008  042
Fax:      +49 30 577 008 299
Email:    jinpu.wang@profitbricks.com
URL:      https://www.profitbricks.de

Sitz der Gesellschaft: Berlin
Registergericht: Amtsgericht Charlottenburg, HRB 125506 B
Geschäftsführer: Achim Weiss

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [BUG] MD/RAID1 hung forever on freeze_array
  2016-12-20 10:34                       ` Jinpu Wang
@ 2016-12-20 21:23                         ` NeilBrown
  2016-12-21 12:48                           ` Jinpu Wang
  0 siblings, 1 reply; 31+ messages in thread
From: NeilBrown @ 2016-12-20 21:23 UTC (permalink / raw)
  To: Jinpu Wang; +Cc: linux-raid, Shaohua Li, Nate Dailey

[-- Attachment #1: Type: text/plain, Size: 1783 bytes --]

On Tue, Dec 20 2016, Jinpu Wang wrote:

> Hi Neil,
>> This is the problem.  'hold' hasn't been initialised.
>> We could either do:
>>   bio_list_init(&hold);
>>   bio_list_merge(&hold, &bio_list_on_stack);
> I tried above variant first and it lead to panic in endio path:
>
...
>
>     [exception RIP: bio_check_pages_dirty+65]

I can't explain that at all.  Maybe if I saw the complete patch I would
be able to see what went wrong (or maybe there is a separate cause).

>>
>> You didn't find 'hold' to be necessary in your testing, but I think that
>> is more complex arrangements it could make an important difference.
>
> Could you elaborate a bit more, from my understanding, in later code,
> we pop all bio from bio_list_on_stack,
> add it to either "lower" or "same" bio_list, so merge both will have
> the whole list again, right?

If there are several bios on bio_list_on_stack, and if processing the
first one causes several calls to generic_make_request(), then we want
all the bios passed to generic_make_request() to be handled *before* the
remaining bios that were originally on bio_list_on_stack.  Of these new
bios, we want to handle those for a lower level device first, and those
for the same device after that.  Only when all of those have been
completely submitted (together with any subordinate bios submitted to
generic_make_request()) do we move on to the rest of bio_list_on_stack
(which are at the same level as the first bio, or higher in the device
stack).

The usage of 'hold' would become important when you have multiple
levels.  e.g. dm on md on scsi.  If the first bio send to dm is split,
then the first half needs to be processed all the way down to all the
scsi devices, before the second half of the split is handled.

Hope that helps,
NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [BUG] MD/RAID1 hung forever on freeze_array
  2016-12-20 21:23                         ` NeilBrown
@ 2016-12-21 12:48                           ` Jinpu Wang
  2016-12-21 23:51                             ` NeilBrown
  0 siblings, 1 reply; 31+ messages in thread
From: Jinpu Wang @ 2016-12-21 12:48 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid, Shaohua Li, Nate Dailey

[-- Attachment #1: Type: text/plain, Size: 2806 bytes --]

Thanks Neil,

On Tue, Dec 20, 2016 at 10:23 PM, NeilBrown <neilb@suse.com> wrote:
> On Tue, Dec 20 2016, Jinpu Wang wrote:
>
>> Hi Neil,
>>> This is the problem.  'hold' hasn't been initialised.
>>> We could either do:
>>>   bio_list_init(&hold);
>>>   bio_list_merge(&hold, &bio_list_on_stack);
>> I tried above variant first and it lead to panic in endio path:
>>
> ...
>>
>>     [exception RIP: bio_check_pages_dirty+65]
>
>
> I can't explain that at all.  Maybe if I saw the complete patch I would
> be able to see what went wrong (or maybe there is a separate cause).
I think I know the cause now, I did
+bio_list_init(&hold);
bio_list_merge(&hold, &bio_list_on_stack);
- bio_list_init(&bio_list_on_stack)
I didn't read your comment clear enough, just thought we maybe don't
need to reinit &bio_list_on_stack twice.
But your idea is same the copy before we can q->make_request_fn.

>
>>>
>>> You didn't find 'hold' to be necessary in your testing, but I think that
>>> is more complex arrangements it could make an important difference.
>>
>> Could you elaborate a bit more, from my understanding, in later code,
>> we pop all bio from bio_list_on_stack,
>> add it to either "lower" or "same" bio_list, so merge both will have
>> the whole list again, right?
>
> If there are several bios on bio_list_on_stack, and if processing the
> first one causes several calls to generic_make_request(), then we want
> all the bios passed to generic_make_request() to be handled *before* the
> remaining bios that were originally on bio_list_on_stack.  Of these new
> bios, we want to handle those for a lower level device first, and those
> for the same device after that.  Only when all of those have been
> completely submitted (together with any subordinate bios submitted to
> generic_make_request()) do we move on to the rest of bio_list_on_stack
> (which are at the same level as the first bio, or higher in the device
> stack).
>
> The usage of 'hold' would become important when you have multiple
> levels.  e.g. dm on md on scsi.  If the first bio send to dm is split,
> then the first half needs to be processed all the way down to all the
> scsi devices, before the second half of the split is handled.
>
> Hope that helps,
> NeilBrown
>
Thanks, it does help a lot, I attached the patch I'm still testing,
but so far so good.
Could you check if I got it right?

Cheers,
-- 
Jinpu Wang
Linux Kernel Developer

ProfitBricks GmbH
Greifswalder Str. 207
D - 10405 Berlin

Tel:       +49 30 577 008  042
Fax:      +49 30 577 008 299
Email:    jinpu.wang@profitbricks.com
URL:      https://www.profitbricks.de

Sitz der Gesellschaft: Berlin
Registergericht: Amtsgericht Charlottenburg, HRB 125506 B
Geschäftsführer: Achim Weiss

[-- Attachment #2: 0001-block-fix-deadlock-between-freeze_array-and-wait_bar.patch --]
[-- Type: text/x-patch, Size: 2300 bytes --]

From 9500d5e9a7ccd1f4397f927ffdc8d378955d477c Mon Sep 17 00:00:00 2001
From: Jack Wang <jinpu.wang@profitbricks.com>
Date: Wed, 14 Dec 2016 16:55:52 +0100
Subject: [PATCH] block: fix deadlock between freeze_array() and wait_barrier()

When we call wait_barrier, we might have some bios waiting
in current->bio_list, which prevents the array_freeze call to
complete. Those can only be internal READs, which have already
passed the wait_barrier call (thus incrementing nr_pending), but
still were not submitted to the lower level, due to generic_make_request
logic to avoid recursive calls. In such case, we have a deadlock:
- array_frozen is already set to 1, so wait_barrier unconditionally waits, so
- internal READ bios will not be submitted, thus freeze_array will
never completes.

To fix this, modify generic_make_request to always sort bio_list_on_stack
first with lowest level, then higher, until same level.

Suggested-by: Neil Brown <neil.brown.suse.com>
Signed-off-by: Jack Wang <jinpu.wang@profitbricks.com>
---
 block/blk-core.c | 20 ++++++++++++++++++++
 1 file changed, 20 insertions(+)

diff --git a/block/blk-core.c b/block/blk-core.c
index e8d15d8..5a6390a 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -2064,10 +2064,30 @@ blk_qc_t generic_make_request(struct bio *bio)
 		struct request_queue *q = bdev_get_queue(bio->bi_bdev);
 
 		if (likely(blk_queue_enter(q, __GFP_DIRECT_RECLAIM) == 0)) {
+			struct bio_list lower, same, hold;
+
+			/* Create a fresh bio_list for all subordinate requests */
+			bio_list_init(&hold);
+			bio_list_merge(&hold, &bio_list_on_stack);
+			bio_list_init(&bio_list_on_stack);
 
 			ret = q->make_request_fn(q, bio);
 
 			blk_queue_exit(q);
+			/* sort new bios into those for a lower level
+			 * and those for the same level
+			 */
+			bio_list_init(&lower);
+			bio_list_init(&same);
+			while ((bio = bio_list_pop(&bio_list_on_stack)) != NULL)
+				if (q == bdev_get_queue(bio->bi_bdev))
+					bio_list_add(&same, bio);
+				else
+					bio_list_add(&lower, bio);
+			/* now assemble so we handle the lowest level first */
+			bio_list_merge(&bio_list_on_stack, &lower);
+			bio_list_merge(&bio_list_on_stack, &same);
+			bio_list_merge(&bio_list_on_stack, &hold);
 
 			bio = bio_list_pop(current->bio_list);
 		} else {
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: [BUG] MD/RAID1 hung forever on freeze_array
  2016-12-21 12:48                           ` Jinpu Wang
@ 2016-12-21 23:51                             ` NeilBrown
  2016-12-22  8:35                               ` Jinpu Wang
  0 siblings, 1 reply; 31+ messages in thread
From: NeilBrown @ 2016-12-21 23:51 UTC (permalink / raw)
  To: Jinpu Wang; +Cc: linux-raid, Shaohua Li, Nate Dailey

[-- Attachment #1: Type: text/plain, Size: 295 bytes --]

On Wed, Dec 21 2016, Jinpu Wang wrote:

>>
> Thanks, it does help a lot, I attached the patch I'm still testing,
> but so far so good.
> Could you check if I got it right?

Yes, that looks exactly right.
I guess one of us should try to push it upstream... maybe next year :-)

Thanks,
NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [BUG] MD/RAID1 hung forever on freeze_array
  2016-12-21 23:51                             ` NeilBrown
@ 2016-12-22  8:35                               ` Jinpu Wang
  0 siblings, 0 replies; 31+ messages in thread
From: Jinpu Wang @ 2016-12-22  8:35 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid, Shaohua Li, Nate Dailey

On Thu, Dec 22, 2016 at 12:51 AM, NeilBrown <neilb@suse.com> wrote:
> On Wed, Dec 21 2016, Jinpu Wang wrote:
>
>>>
>> Thanks, it does help a lot, I attached the patch I'm still testing,
>> but so far so good.
>> Could you check if I got it right?
>
> Yes, that looks exactly right.
> I guess one of us should try to push it upstream... maybe next year :-)
>
> Thanks,
> NeilBrown

Thanks, Neil. I will try to push to upstream next year!
Happy holidays!

Cheers!
-- 
Jinpu Wang
Linux Kernel Developer

ProfitBricks GmbH
Greifswalder Str. 207
D - 10405 Berlin

Tel:       +49 30 577 008  042
Fax:      +49 30 577 008 299
Email:    jinpu.wang@profitbricks.com
URL:      https://www.profitbricks.de

Sitz der Gesellschaft: Berlin
Registergericht: Amtsgericht Charlottenburg, HRB 125506 B
Geschäftsführer: Achim Weiss

^ permalink raw reply	[flat|nested] 31+ messages in thread

end of thread, other threads:[~2016-12-22  8:35 UTC | newest]

Thread overview: 31+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2016-11-25 13:30 [BUG] MD/RAID1 hung forever on freeze_array Jinpu Wang
2016-11-25 13:59 ` Jinpu Wang
2016-11-28  4:47   ` Coly Li
2016-11-28  8:24     ` Jinpu Wang
2016-11-28  8:54       ` Coly Li
2016-11-28  9:02         ` Jinpu Wang
2016-11-28  9:10           ` Coly Li
2016-11-29 11:15             ` Jinpu Wang
2016-12-07 14:17               ` Jinpu Wang
2016-12-08  3:17                 ` NeilBrown
2016-12-08  9:50                   ` Jinpu Wang
2016-12-09  6:01                     ` NeilBrown
2016-12-09 15:28                       ` Jinpu Wang
2016-12-09 15:36                       ` Jinpu Wang
2016-12-12  0:59 ` NeilBrown
2016-12-12 13:10   ` Jinpu Wang
2016-12-12 21:53     ` NeilBrown
2016-12-13 15:08       ` Jinpu Wang
2016-12-13 22:18         ` NeilBrown
2016-12-14 10:22           ` Jinpu Wang
2016-12-14 12:13             ` Jinpu Wang
2016-12-14 14:49               ` Jinpu Wang
2016-12-15  3:20                 ` NeilBrown
2016-12-15  9:24                   ` Jinpu Wang
     [not found]                   ` <CAMGffEkufeaDytaHxtLR02iiQifZDhcwkLdzMj3X8_yaitSoFQ@mail.gmail.com>
2016-12-19 14:56                     ` Jinpu Wang
2016-12-19 22:45                     ` NeilBrown
2016-12-20 10:34                       ` Jinpu Wang
2016-12-20 21:23                         ` NeilBrown
2016-12-21 12:48                           ` Jinpu Wang
2016-12-21 23:51                             ` NeilBrown
2016-12-22  8:35                               ` Jinpu Wang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).