* [BUG] MD/RAID1 hung forever on freeze_array @ 2016-11-25 13:30 Jinpu Wang 2016-11-25 13:59 ` Jinpu Wang 2016-12-12 0:59 ` NeilBrown 0 siblings, 2 replies; 31+ messages in thread From: Jinpu Wang @ 2016-11-25 13:30 UTC (permalink / raw) To: linux-raid, Shaohua Li; +Cc: Neil Brown, Nate Dailey Hi, I'm hitting hung task in mdx_raid1 when running test, I can reproduce it easily with my tests below: I create one md with one local loop device and one remote scsi exported by SRP. running fio with mix rw on top of md, force_close session on storage side. mdx_raid1 is wait on free_array in D state, and a lot of fio also in D state in wait_barrier. [ 335.154711] blk_update_request: I/O error, dev sdb, sector 8 [ 335.154855] md: super_written gets error=-5 [ 335.154999] md/raid1:md1: Disk failure on sdb, disabling device. md/raid1:md1: Operation continuing on 1 devices. [ 335.155258] sd 1:0:0:0: rejecting I/O to offline device [ 335.155402] blk_update_request: I/O error, dev sdb, sector 80 [ 335.155547] md: super_written gets error=-5 [ 340.158828] scsi host1: ib_srp: reconnect succeeded [ 373.017608] md/raid1:md1: redirecting sector 616617 to other mirror: loop1 [ 373.110527] md/raid1:md1: redirecting sector 1320893 to other mirror: loop1 [ 373.117230] md/raid1:md1: redirecting sector 1564499 to other mirror: loop1 [ 373.127652] md/raid1:md1: redirecting sector 104034 to other mirror: loop1 [ 373.135665] md/raid1:md1: redirecting sector 1209765 to other mirror: loop1 [ 373.145634] md/raid1:md1: redirecting sector 51200 to other mirror: loop1 [ 373.158824] md/raid1:md1: redirecting sector 755750 to other mirror: loop1 [ 373.169964] md/raid1:md1: redirecting sector 1681631 to other mirror: loop1 [ 373.178619] md/raid1:md1: redirecting sector 1894296 to other mirror: loop1 [ 373.186153] md/raid1:md1: redirecting sector 1905016 to other mirror: loop1 [ 374.364370] RAID1 conf printout: [ 374.364377] --- wd:1 rd:2 [ 374.364379] disk 0, wo:1, o:0, dev:sdb [ 374.364381] disk 1, wo:0, o:1, dev:loop1 [ 374.437099] RAID1 conf printout: [ 374.437103] --- wd:1 rd:2 snip [ 810.266112] sysrq: SysRq : Show Blocked State [ 810.266235] task PC stack pid father [ 810.266362] md1_raid1 D ffff88022d927c48 0 4022 2 0x00000000 [ 810.266487] ffff88022d927c48 ffff8802351a0000 ffff8800b91bc100 000000008010000e [ 810.266747] ffff88022d927c30 ffff88022d928000 0000000000000001 ffff880233b49b70 [ 810.266975] ffff880233b49b88 ffff8802325d5a40 ffff88022d927c60 ffffffff81810600 [ 810.267203] Call Trace: [ 810.267322] [<ffffffff81810600>] schedule+0x30/0x80 [ 810.267437] [<ffffffffa01342c1>] freeze_array+0x71/0xc0 [raid1] [ 810.267555] [<ffffffff81095480>] ? wake_atomic_t_function+0x70/0x70 [ 810.267669] [<ffffffffa013578b>] handle_read_error+0x3b/0x570 [raid1] [ 810.267816] [<ffffffff81185783>] ? kmem_cache_free+0x183/0x190 [ 810.267929] [<ffffffff81094e36>] ? __wake_up+0x46/0x60 [ 810.268045] [<ffffffffa0136dcd>] raid1d+0x20d/0xfc0 [raid1] [ 810.268159] [<ffffffff81813043>] ? schedule_timeout+0x1a3/0x230 [ 810.268274] [<ffffffff8180fe77>] ? __schedule+0x2e7/0xa40 [ 810.268391] [<ffffffffa0211839>] md_thread+0x119/0x120 [md_mod] [ 810.268508] [<ffffffff81095480>] ? wake_atomic_t_function+0x70/0x70 [ 810.268624] [<ffffffffa0211720>] ? find_pers+0x70/0x70 [md_mod] [ 810.268741] [<ffffffff81075614>] kthread+0xc4/0xe0 [ 810.268853] [<ffffffff81075550>] ? kthread_worker_fn+0x150/0x150 [ 810.268970] [<ffffffff8181415f>] ret_from_fork+0x3f/0x70 [ 810.269114] [<ffffffff81075550>] ? kthread_worker_fn+0x150/0x150 [ 810.269227] fio D ffff8802325137a0 0 4212 4206 0x00000000 [ 810.269347] ffff8802325137a0 ffff88022de3db00 ffff8800ba7bb400 0000000000000000 [ 810.269574] ffff880233b49b00 ffff880232513788 ffff880232514000 ffff880233b49b88 [ 810.269801] ffff880233b49b70 ffff8800ba7bb400 ffff8800b5f5db00 ffff8802325137b8 [ 810.270028] Call Trace: [ 810.270138] [<ffffffff81810600>] schedule+0x30/0x80 [ 810.270282] [<ffffffffa0133727>] wait_barrier+0x117/0x1f0 [raid1] [ 810.270396] [<ffffffff81095480>] ? wake_atomic_t_function+0x70/0x70 [ 810.270513] [<ffffffffa0135d72>] make_request+0xb2/0xd80 [raid1] [ 810.270628] [<ffffffffa02123fc>] md_make_request+0xec/0x230 [md_mod] [ 810.270746] [<ffffffff813f96f9>] ? generic_make_request_checks+0x219/0x500 [ 810.270860] [<ffffffff813fc851>] blk_prologue_bio+0x91/0xc0 [ 810.270976] [<ffffffff813fc230>] generic_make_request+0xe0/0x1b0 [ 810.271090] [<ffffffff813fc362>] submit_bio+0x62/0x140 [ 810.271209] [<ffffffff811d2bbc>] do_blockdev_direct_IO+0x289c/0x33c0 [ 810.271323] [<ffffffff81810600>] ? schedule+0x30/0x80 [ 810.271468] [<ffffffff811cd620>] ? I_BDEV+0x10/0x10 [ 810.271580] [<ffffffff811d371e>] __blockdev_direct_IO+0x3e/0x40 [ 810.271696] [<ffffffff811cdfb7>] blkdev_direct_IO+0x47/0x50 [ 810.271828] [<ffffffff81132cbf>] generic_file_read_iter+0x44f/0x570 [ 810.271949] [<ffffffff811ceaa0>] ? blkdev_write_iter+0x110/0x110 [ 810.272062] [<ffffffff811cead0>] blkdev_read_iter+0x30/0x40 [ 810.272179] [<ffffffff811de5a6>] aio_run_iocb+0x126/0x2b0 [ 810.272291] [<ffffffff8181209d>] ? mutex_lock+0xd/0x30 [ 810.272407] [<ffffffff811ddd04>] ? aio_read_events+0x284/0x370 [ 810.272521] [<ffffffff81183c29>] ? kmem_cache_alloc+0xd9/0x180 [ 810.272665] [<ffffffff811df438>] ? do_io_submit+0x178/0x4a0 [ 810.272778] [<ffffffff811df4ed>] do_io_submit+0x22d/0x4a0 [ 810.272895] [<ffffffff811df76b>] SyS_io_submit+0xb/0x10 [ 810.273007] [<ffffffff81813e17>] entry_SYSCALL_64_fastpath+0x12/0x66 [ 810.273130] fio D ffff88022fa6f730 0 4213 4206 0x00000000 [ 810.273247] ffff88022fa6f730 ffff8800b549a700 ffff8800af703400 0000000002011200 [ 810.273475] ffff880236001700 ffff88022fa6f718 ffff88022fa70000 ffff880233b49b88 [ 810.273702] ffff880233b49b70 ffff8800af703400 ffff88022f843700 ffff88022fa6f748 [ 810.273958] Call Trace: [ 810.274070] [<ffffffff81810600>] schedule+0x30/0x80 [ 810.274183] [<ffffffffa0133727>] wait_barrier+0x117/0x1f0 [raid1] [ 810.274300] [<ffffffff81095480>] ? wake_atomic_t_function+0x70/0x70 [ 810.274413] [<ffffffffa0135d72>] make_request+0xb2/0xd80 [raid1] [ 810.274537] [<ffffffff81408f15>] ? __bt_get.isra.7+0xd5/0x1b0 [ 810.274650] [<ffffffff81094feb>] ? finish_wait+0x5b/0x80 [ 810.274766] [<ffffffff8140917f>] ? bt_get+0x18f/0x1b0 [ 810.274881] [<ffffffffa02123fc>] md_make_request+0xec/0x230 [md_mod] [ 810.274998] [<ffffffff813f96f9>] ? generic_make_request_checks+0x219/0x500 [ 810.275144] [<ffffffff813fc851>] blk_prologue_bio+0x91/0xc0 [ 810.275257] [<ffffffff813fc230>] generic_make_request+0xe0/0x1b0 [ 810.275373] [<ffffffff813fc362>] submit_bio+0x62/0x140 [ 810.275486] [<ffffffff811d2bbc>] do_blockdev_direct_IO+0x289c/0x33c0 [ 810.275607] [<ffffffff811cd620>] ? I_BDEV+0x10/0x10 [ 810.275721] [<ffffffff811d371e>] __blockdev_direct_IO+0x3e/0x40 [ 810.275843] [<ffffffff811cdfb7>] blkdev_direct_IO+0x47/0x50 [ 810.275956] [<ffffffff81132e8c>] generic_file_direct_write+0xac/0x170 [ 810.276073] [<ffffffff8113301d>] __generic_file_write_iter+0xcd/0x1f0 [ 810.276187] [<ffffffff811ce990>] ? blkdev_close+0x30/0x30 [ 810.276332] [<ffffffff811cea17>] blkdev_write_iter+0x87/0x110 [ 810.276445] [<ffffffff811de6d0>] aio_run_iocb+0x250/0x2b0 [ 810.276560] [<ffffffff8181209d>] ? mutex_lock+0xd/0x30 [ 810.276673] [<ffffffff811ddd04>] ? aio_read_events+0x284/0x370 [ 810.276786] [<ffffffff81183c29>] ? kmem_cache_alloc+0xd9/0x180 [ 810.276902] [<ffffffff811df438>] ? do_io_submit+0x178/0x4a0 [ 810.277015] [<ffffffff811df4ed>] do_io_submit+0x22d/0x4a0 [ 810.277131] [<ffffffff811df76b>] SyS_io_submit+0xb/0x10 [ 810.277244] [<ffffffff81813e17>] entry_SYSCALL_64_fastpath+0x12/0x66 I dump r1conf in crash: struct r1conf { mddev = 0xffff88022d761800, mirrors = 0xffff88023456a180, raid_disks = 2, next_resync = 18446744073709527039, start_next_window = 18446744073709551615, current_window_requests = 0, next_window_requests = 0, device_lock = { { rlock = { raw_lock = { val = { counter = 0 } } } } }, retry_list = { next = 0xffff8800b5fe3b40, prev = 0xffff8800b50164c0 }, bio_end_io_list = { next = 0xffff88022fcd45c0, prev = 0xffff8800b53d57c0 }, pending_bio_list = { head = 0x0, tail = 0x0 }, pending_count = 0, wait_barrier = { lock = { { rlock = { raw_lock = { val = { counter = 0 } } } } }, task_list = { next = 0xffff8800b51d37e0, prev = 0xffff88022fbbb770 } }, resync_lock = { { rlock = { raw_lock = { val = { counter = 0 } } } } }, nr_pending = 406, nr_waiting = 100, nr_queued = 404, barrier = 0, array_frozen = 1, fullsync = 0, recovery_disabled = 1, poolinfo = 0xffff88022d829bb0, r1bio_pool = 0xffff88022b4512a0, r1buf_pool = 0x0, tmppage = 0xffffea0008c97b00, thread = 0x0, cluster_sync_low = 0, cluster_sync_high = 0 } every time nr_pending is 1 bigger then (nr_queued + 1), so seems we forgot to increase nr_queued somewhere? I've noticed (commit ccfc7bf1f09d61)raid1: include bio_end_io_list in nr_queued to prevent freeze_array hang. Seems it fixed similar bug. Could you give your suggestion? -- Jinpu Wang Linux Kernel Developer ProfitBricks GmbH Greifswalder Str. 207 D - 10405 Berlin Tel: +49 30 577 008 042 Fax: +49 30 577 008 299 Email: jinpu.wang@profitbricks.com URL: https://www.profitbricks.de Sitz der Gesellschaft: Berlin Registergericht: Amtsgericht Charlottenburg, HRB 125506 B Geschäftsführer: Achim Weiss ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [BUG] MD/RAID1 hung forever on freeze_array 2016-11-25 13:30 [BUG] MD/RAID1 hung forever on freeze_array Jinpu Wang @ 2016-11-25 13:59 ` Jinpu Wang 2016-11-28 4:47 ` Coly Li 2016-12-12 0:59 ` NeilBrown 1 sibling, 1 reply; 31+ messages in thread From: Jinpu Wang @ 2016-11-25 13:59 UTC (permalink / raw) To: linux-raid, Shaohua Li; +Cc: Neil Brown, Nate Dailey On Fri, Nov 25, 2016 at 2:30 PM, Jinpu Wang <jinpu.wang@profitbricks.com> wrote: > Hi, > > I'm hitting hung task in mdx_raid1 when running test, I can reproduce > it easily with my tests below: > > I create one md with one local loop device and one remote scsi > exported by SRP. running fio with mix rw on top of md, force_close > session on storage side. mdx_raid1 is wait on free_array in D state, > and a lot of fio also in D state in wait_barrier. > > [ 335.154711] blk_update_request: I/O error, dev sdb, sector 8 > [ 335.154855] md: super_written gets error=-5 > [ 335.154999] md/raid1:md1: Disk failure on sdb, disabling device. > md/raid1:md1: Operation continuing on 1 devices. > [ 335.155258] sd 1:0:0:0: rejecting I/O to offline device > [ 335.155402] blk_update_request: I/O error, dev sdb, sector 80 > [ 335.155547] md: super_written gets error=-5 > [ 340.158828] scsi host1: ib_srp: reconnect succeeded > [ 373.017608] md/raid1:md1: redirecting sector 616617 to other mirror: loop1 > [ 373.110527] md/raid1:md1: redirecting sector 1320893 to other mirror: loop1 > [ 373.117230] md/raid1:md1: redirecting sector 1564499 to other mirror: loop1 > [ 373.127652] md/raid1:md1: redirecting sector 104034 to other mirror: loop1 > [ 373.135665] md/raid1:md1: redirecting sector 1209765 to other mirror: loop1 > [ 373.145634] md/raid1:md1: redirecting sector 51200 to other mirror: loop1 > [ 373.158824] md/raid1:md1: redirecting sector 755750 to other mirror: loop1 > [ 373.169964] md/raid1:md1: redirecting sector 1681631 to other mirror: loop1 > [ 373.178619] md/raid1:md1: redirecting sector 1894296 to other mirror: loop1 > [ 373.186153] md/raid1:md1: redirecting sector 1905016 to other mirror: loop1 > [ 374.364370] RAID1 conf printout: > [ 374.364377] --- wd:1 rd:2 > [ 374.364379] disk 0, wo:1, o:0, dev:sdb > [ 374.364381] disk 1, wo:0, o:1, dev:loop1 > [ 374.437099] RAID1 conf printout: > [ 374.437103] --- wd:1 rd:2 > snip > > > [ 810.266112] sysrq: SysRq : Show Blocked State > [ 810.266235] task PC stack pid father > [ 810.266362] md1_raid1 D ffff88022d927c48 0 4022 2 0x00000000 > [ 810.266487] ffff88022d927c48 ffff8802351a0000 ffff8800b91bc100 > 000000008010000e > [ 810.266747] ffff88022d927c30 ffff88022d928000 0000000000000001 > ffff880233b49b70 > [ 810.266975] ffff880233b49b88 ffff8802325d5a40 ffff88022d927c60 > ffffffff81810600 > [ 810.267203] Call Trace: > [ 810.267322] [<ffffffff81810600>] schedule+0x30/0x80 > [ 810.267437] [<ffffffffa01342c1>] freeze_array+0x71/0xc0 [raid1] > [ 810.267555] [<ffffffff81095480>] ? wake_atomic_t_function+0x70/0x70 > [ 810.267669] [<ffffffffa013578b>] handle_read_error+0x3b/0x570 [raid1] > [ 810.267816] [<ffffffff81185783>] ? kmem_cache_free+0x183/0x190 > [ 810.267929] [<ffffffff81094e36>] ? __wake_up+0x46/0x60 > [ 810.268045] [<ffffffffa0136dcd>] raid1d+0x20d/0xfc0 [raid1] > [ 810.268159] [<ffffffff81813043>] ? schedule_timeout+0x1a3/0x230 > [ 810.268274] [<ffffffff8180fe77>] ? __schedule+0x2e7/0xa40 > [ 810.268391] [<ffffffffa0211839>] md_thread+0x119/0x120 [md_mod] > [ 810.268508] [<ffffffff81095480>] ? wake_atomic_t_function+0x70/0x70 > [ 810.268624] [<ffffffffa0211720>] ? find_pers+0x70/0x70 [md_mod] > [ 810.268741] [<ffffffff81075614>] kthread+0xc4/0xe0 > [ 810.268853] [<ffffffff81075550>] ? kthread_worker_fn+0x150/0x150 > [ 810.268970] [<ffffffff8181415f>] ret_from_fork+0x3f/0x70 > [ 810.269114] [<ffffffff81075550>] ? kthread_worker_fn+0x150/0x150 > [ 810.269227] fio D ffff8802325137a0 0 4212 4206 0x00000000 > [ 810.269347] ffff8802325137a0 ffff88022de3db00 ffff8800ba7bb400 > 0000000000000000 > [ 810.269574] ffff880233b49b00 ffff880232513788 ffff880232514000 > ffff880233b49b88 > [ 810.269801] ffff880233b49b70 ffff8800ba7bb400 ffff8800b5f5db00 > ffff8802325137b8 > [ 810.270028] Call Trace: > [ 810.270138] [<ffffffff81810600>] schedule+0x30/0x80 > [ 810.270282] [<ffffffffa0133727>] wait_barrier+0x117/0x1f0 [raid1] > [ 810.270396] [<ffffffff81095480>] ? wake_atomic_t_function+0x70/0x70 > [ 810.270513] [<ffffffffa0135d72>] make_request+0xb2/0xd80 [raid1] > [ 810.270628] [<ffffffffa02123fc>] md_make_request+0xec/0x230 [md_mod] > [ 810.270746] [<ffffffff813f96f9>] ? generic_make_request_checks+0x219/0x500 > [ 810.270860] [<ffffffff813fc851>] blk_prologue_bio+0x91/0xc0 > [ 810.270976] [<ffffffff813fc230>] generic_make_request+0xe0/0x1b0 > [ 810.271090] [<ffffffff813fc362>] submit_bio+0x62/0x140 > [ 810.271209] [<ffffffff811d2bbc>] do_blockdev_direct_IO+0x289c/0x33c0 > [ 810.271323] [<ffffffff81810600>] ? schedule+0x30/0x80 > [ 810.271468] [<ffffffff811cd620>] ? I_BDEV+0x10/0x10 > [ 810.271580] [<ffffffff811d371e>] __blockdev_direct_IO+0x3e/0x40 > [ 810.271696] [<ffffffff811cdfb7>] blkdev_direct_IO+0x47/0x50 > [ 810.271828] [<ffffffff81132cbf>] generic_file_read_iter+0x44f/0x570 > [ 810.271949] [<ffffffff811ceaa0>] ? blkdev_write_iter+0x110/0x110 > [ 810.272062] [<ffffffff811cead0>] blkdev_read_iter+0x30/0x40 > [ 810.272179] [<ffffffff811de5a6>] aio_run_iocb+0x126/0x2b0 > [ 810.272291] [<ffffffff8181209d>] ? mutex_lock+0xd/0x30 > [ 810.272407] [<ffffffff811ddd04>] ? aio_read_events+0x284/0x370 > [ 810.272521] [<ffffffff81183c29>] ? kmem_cache_alloc+0xd9/0x180 > [ 810.272665] [<ffffffff811df438>] ? do_io_submit+0x178/0x4a0 > [ 810.272778] [<ffffffff811df4ed>] do_io_submit+0x22d/0x4a0 > [ 810.272895] [<ffffffff811df76b>] SyS_io_submit+0xb/0x10 > [ 810.273007] [<ffffffff81813e17>] entry_SYSCALL_64_fastpath+0x12/0x66 > [ 810.273130] fio D ffff88022fa6f730 0 4213 4206 0x00000000 > [ 810.273247] ffff88022fa6f730 ffff8800b549a700 ffff8800af703400 > 0000000002011200 > [ 810.273475] ffff880236001700 ffff88022fa6f718 ffff88022fa70000 > ffff880233b49b88 > [ 810.273702] ffff880233b49b70 ffff8800af703400 ffff88022f843700 > ffff88022fa6f748 > [ 810.273958] Call Trace: > [ 810.274070] [<ffffffff81810600>] schedule+0x30/0x80 > [ 810.274183] [<ffffffffa0133727>] wait_barrier+0x117/0x1f0 [raid1] > [ 810.274300] [<ffffffff81095480>] ? wake_atomic_t_function+0x70/0x70 > [ 810.274413] [<ffffffffa0135d72>] make_request+0xb2/0xd80 [raid1] > [ 810.274537] [<ffffffff81408f15>] ? __bt_get.isra.7+0xd5/0x1b0 > [ 810.274650] [<ffffffff81094feb>] ? finish_wait+0x5b/0x80 > [ 810.274766] [<ffffffff8140917f>] ? bt_get+0x18f/0x1b0 > [ 810.274881] [<ffffffffa02123fc>] md_make_request+0xec/0x230 [md_mod] > [ 810.274998] [<ffffffff813f96f9>] ? generic_make_request_checks+0x219/0x500 > [ 810.275144] [<ffffffff813fc851>] blk_prologue_bio+0x91/0xc0 > [ 810.275257] [<ffffffff813fc230>] generic_make_request+0xe0/0x1b0 > [ 810.275373] [<ffffffff813fc362>] submit_bio+0x62/0x140 > [ 810.275486] [<ffffffff811d2bbc>] do_blockdev_direct_IO+0x289c/0x33c0 > [ 810.275607] [<ffffffff811cd620>] ? I_BDEV+0x10/0x10 > [ 810.275721] [<ffffffff811d371e>] __blockdev_direct_IO+0x3e/0x40 > [ 810.275843] [<ffffffff811cdfb7>] blkdev_direct_IO+0x47/0x50 > [ 810.275956] [<ffffffff81132e8c>] generic_file_direct_write+0xac/0x170 > [ 810.276073] [<ffffffff8113301d>] __generic_file_write_iter+0xcd/0x1f0 > [ 810.276187] [<ffffffff811ce990>] ? blkdev_close+0x30/0x30 > [ 810.276332] [<ffffffff811cea17>] blkdev_write_iter+0x87/0x110 > [ 810.276445] [<ffffffff811de6d0>] aio_run_iocb+0x250/0x2b0 > [ 810.276560] [<ffffffff8181209d>] ? mutex_lock+0xd/0x30 > [ 810.276673] [<ffffffff811ddd04>] ? aio_read_events+0x284/0x370 > [ 810.276786] [<ffffffff81183c29>] ? kmem_cache_alloc+0xd9/0x180 > [ 810.276902] [<ffffffff811df438>] ? do_io_submit+0x178/0x4a0 > [ 810.277015] [<ffffffff811df4ed>] do_io_submit+0x22d/0x4a0 > [ 810.277131] [<ffffffff811df76b>] SyS_io_submit+0xb/0x10 > [ 810.277244] [<ffffffff81813e17>] entry_SYSCALL_64_fastpath+0x12/0x66 > I dump r1conf in crash: > struct r1conf { > mddev = 0xffff88022d761800, > mirrors = 0xffff88023456a180, > raid_disks = 2, > next_resync = 18446744073709527039, > start_next_window = 18446744073709551615, > current_window_requests = 0, > next_window_requests = 0, > device_lock = { > { > rlock = { > raw_lock = { > val = { > counter = 0 > } > } > } > } > }, > retry_list = { > next = 0xffff8800b5fe3b40, > prev = 0xffff8800b50164c0 > }, > bio_end_io_list = { > next = 0xffff88022fcd45c0, > prev = 0xffff8800b53d57c0 > }, > pending_bio_list = { > head = 0x0, > tail = 0x0 > }, > pending_count = 0, > wait_barrier = { > lock = { > { > rlock = { > raw_lock = { > val = { > counter = 0 > } > } > } > } > }, > task_list = { > next = 0xffff8800b51d37e0, > prev = 0xffff88022fbbb770 > } > }, > resync_lock = { > { > rlock = { > raw_lock = { > val = { > counter = 0 > } > } > } > } > }, > nr_pending = 406, > nr_waiting = 100, > nr_queued = 404, > barrier = 0, > array_frozen = 1, > fullsync = 0, > recovery_disabled = 1, > poolinfo = 0xffff88022d829bb0, > r1bio_pool = 0xffff88022b4512a0, > r1buf_pool = 0x0, > tmppage = 0xffffea0008c97b00, > thread = 0x0, > cluster_sync_low = 0, > cluster_sync_high = 0 > } > > every time nr_pending is 1 bigger then (nr_queued + 1), so seems we > forgot to increase nr_queued somewhere? > > I've noticed (commit ccfc7bf1f09d61)raid1: include bio_end_io_list in > nr_queued to prevent freeze_array hang. Seems it fixed similar bug. > > Could you give your suggestion? > Sorry, forgot to mention kernel version is 4.4.28 -- Jinpu Wang Linux Kernel Developer ProfitBricks GmbH Greifswalder Str. 207 D - 10405 Berlin Tel: +49 30 577 008 042 Fax: +49 30 577 008 299 Email: jinpu.wang@profitbricks.com URL: https://www.profitbricks.de Sitz der Gesellschaft: Berlin Registergericht: Amtsgericht Charlottenburg, HRB 125506 B Geschäftsführer: Achim Weiss ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [BUG] MD/RAID1 hung forever on freeze_array 2016-11-25 13:59 ` Jinpu Wang @ 2016-11-28 4:47 ` Coly Li 2016-11-28 8:24 ` Jinpu Wang 0 siblings, 1 reply; 31+ messages in thread From: Coly Li @ 2016-11-28 4:47 UTC (permalink / raw) To: Jinpu Wang; +Cc: linux-raid, Shaohua Li, Neil Brown, Nate Dailey On 2016/11/25 下午9:59, Jinpu Wang wrote: > On Fri, Nov 25, 2016 at 2:30 PM, Jinpu Wang <jinpu.wang@profitbricks.com> wrote: >> Hi, >> >> I'm hitting hung task in mdx_raid1 when running test, I can reproduce >> it easily with my tests below: >> >> I create one md with one local loop device and one remote scsi >> exported by SRP. running fio with mix rw on top of md, force_close >> session on storage side. mdx_raid1 is wait on free_array in D state, >> and a lot of fio also in D state in wait_barrier. >> >> [ 335.154711] blk_update_request: I/O error, dev sdb, sector 8 >> [ 335.154855] md: super_written gets error=-5 >> [ 335.154999] md/raid1:md1: Disk failure on sdb, disabling device. >> md/raid1:md1: Operation continuing on 1 devices. >> [ 335.155258] sd 1:0:0:0: rejecting I/O to offline device >> [ 335.155402] blk_update_request: I/O error, dev sdb, sector 80 >> [ 335.155547] md: super_written gets error=-5 >> [ 340.158828] scsi host1: ib_srp: reconnect succeeded >> [ 373.017608] md/raid1:md1: redirecting sector 616617 to other mirror: loop1 >> [ 373.110527] md/raid1:md1: redirecting sector 1320893 to other mirror: loop1 >> [ 373.117230] md/raid1:md1: redirecting sector 1564499 to other mirror: loop1 >> [ 373.127652] md/raid1:md1: redirecting sector 104034 to other mirror: loop1 >> [ 373.135665] md/raid1:md1: redirecting sector 1209765 to other mirror: loop1 >> [ 373.145634] md/raid1:md1: redirecting sector 51200 to other mirror: loop1 >> [ 373.158824] md/raid1:md1: redirecting sector 755750 to other mirror: loop1 >> [ 373.169964] md/raid1:md1: redirecting sector 1681631 to other mirror: loop1 >> [ 373.178619] md/raid1:md1: redirecting sector 1894296 to other mirror: loop1 >> [ 373.186153] md/raid1:md1: redirecting sector 1905016 to other mirror: loop1 >> [ 374.364370] RAID1 conf printout: >> [ 374.364377] --- wd:1 rd:2 >> [ 374.364379] disk 0, wo:1, o:0, dev:sdb >> [ 374.364381] disk 1, wo:0, o:1, dev:loop1 >> [ 374.437099] RAID1 conf printout: >> [ 374.437103] --- wd:1 rd:2 >> snip >> >> >> [ 810.266112] sysrq: SysRq : Show Blocked State >> [ 810.266235] task PC stack pid father >> [ 810.266362] md1_raid1 D ffff88022d927c48 0 4022 2 0x00000000 >> [ 810.266487] ffff88022d927c48 ffff8802351a0000 ffff8800b91bc100 >> 000000008010000e >> [ 810.266747] ffff88022d927c30 ffff88022d928000 0000000000000001 >> ffff880233b49b70 >> [ 810.266975] ffff880233b49b88 ffff8802325d5a40 ffff88022d927c60 >> ffffffff81810600 >> [ 810.267203] Call Trace: >> [ 810.267322] [<ffffffff81810600>] schedule+0x30/0x80 >> [ 810.267437] [<ffffffffa01342c1>] freeze_array+0x71/0xc0 [raid1] >> [ 810.267555] [<ffffffff81095480>] ? wake_atomic_t_function+0x70/0x70 >> [ 810.267669] [<ffffffffa013578b>] handle_read_error+0x3b/0x570 [raid1] >> [ 810.267816] [<ffffffff81185783>] ? kmem_cache_free+0x183/0x190 >> [ 810.267929] [<ffffffff81094e36>] ? __wake_up+0x46/0x60 >> [ 810.268045] [<ffffffffa0136dcd>] raid1d+0x20d/0xfc0 [raid1] >> [ 810.268159] [<ffffffff81813043>] ? schedule_timeout+0x1a3/0x230 >> [ 810.268274] [<ffffffff8180fe77>] ? __schedule+0x2e7/0xa40 >> [ 810.268391] [<ffffffffa0211839>] md_thread+0x119/0x120 [md_mod] >> [ 810.268508] [<ffffffff81095480>] ? wake_atomic_t_function+0x70/0x70 >> [ 810.268624] [<ffffffffa0211720>] ? find_pers+0x70/0x70 [md_mod] >> [ 810.268741] [<ffffffff81075614>] kthread+0xc4/0xe0 >> [ 810.268853] [<ffffffff81075550>] ? kthread_worker_fn+0x150/0x150 >> [ 810.268970] [<ffffffff8181415f>] ret_from_fork+0x3f/0x70 >> [ 810.269114] [<ffffffff81075550>] ? kthread_worker_fn+0x150/0x150 >> [ 810.269227] fio D ffff8802325137a0 0 4212 4206 0x00000000 >> [ 810.269347] ffff8802325137a0 ffff88022de3db00 ffff8800ba7bb400 >> 0000000000000000 >> [ 810.269574] ffff880233b49b00 ffff880232513788 ffff880232514000 >> ffff880233b49b88 >> [ 810.269801] ffff880233b49b70 ffff8800ba7bb400 ffff8800b5f5db00 >> ffff8802325137b8 >> [ 810.270028] Call Trace: >> [ 810.270138] [<ffffffff81810600>] schedule+0x30/0x80 >> [ 810.270282] [<ffffffffa0133727>] wait_barrier+0x117/0x1f0 [raid1] >> [ 810.270396] [<ffffffff81095480>] ? wake_atomic_t_function+0x70/0x70 >> [ 810.270513] [<ffffffffa0135d72>] make_request+0xb2/0xd80 [raid1] >> [ 810.270628] [<ffffffffa02123fc>] md_make_request+0xec/0x230 [md_mod] >> [ 810.270746] [<ffffffff813f96f9>] ? generic_make_request_checks+0x219/0x500 >> [ 810.270860] [<ffffffff813fc851>] blk_prologue_bio+0x91/0xc0 >> [ 810.270976] [<ffffffff813fc230>] generic_make_request+0xe0/0x1b0 >> [ 810.271090] [<ffffffff813fc362>] submit_bio+0x62/0x140 >> [ 810.271209] [<ffffffff811d2bbc>] do_blockdev_direct_IO+0x289c/0x33c0 >> [ 810.271323] [<ffffffff81810600>] ? schedule+0x30/0x80 >> [ 810.271468] [<ffffffff811cd620>] ? I_BDEV+0x10/0x10 >> [ 810.271580] [<ffffffff811d371e>] __blockdev_direct_IO+0x3e/0x40 >> [ 810.271696] [<ffffffff811cdfb7>] blkdev_direct_IO+0x47/0x50 >> [ 810.271828] [<ffffffff81132cbf>] generic_file_read_iter+0x44f/0x570 >> [ 810.271949] [<ffffffff811ceaa0>] ? blkdev_write_iter+0x110/0x110 >> [ 810.272062] [<ffffffff811cead0>] blkdev_read_iter+0x30/0x40 >> [ 810.272179] [<ffffffff811de5a6>] aio_run_iocb+0x126/0x2b0 >> [ 810.272291] [<ffffffff8181209d>] ? mutex_lock+0xd/0x30 >> [ 810.272407] [<ffffffff811ddd04>] ? aio_read_events+0x284/0x370 >> [ 810.272521] [<ffffffff81183c29>] ? kmem_cache_alloc+0xd9/0x180 >> [ 810.272665] [<ffffffff811df438>] ? do_io_submit+0x178/0x4a0 >> [ 810.272778] [<ffffffff811df4ed>] do_io_submit+0x22d/0x4a0 >> [ 810.272895] [<ffffffff811df76b>] SyS_io_submit+0xb/0x10 >> [ 810.273007] [<ffffffff81813e17>] entry_SYSCALL_64_fastpath+0x12/0x66 >> [ 810.273130] fio D ffff88022fa6f730 0 4213 4206 0x00000000 >> [ 810.273247] ffff88022fa6f730 ffff8800b549a700 ffff8800af703400 >> 0000000002011200 >> [ 810.273475] ffff880236001700 ffff88022fa6f718 ffff88022fa70000 >> ffff880233b49b88 >> [ 810.273702] ffff880233b49b70 ffff8800af703400 ffff88022f843700 >> ffff88022fa6f748 >> [ 810.273958] Call Trace: >> [ 810.274070] [<ffffffff81810600>] schedule+0x30/0x80 >> [ 810.274183] [<ffffffffa0133727>] wait_barrier+0x117/0x1f0 [raid1] >> [ 810.274300] [<ffffffff81095480>] ? wake_atomic_t_function+0x70/0x70 >> [ 810.274413] [<ffffffffa0135d72>] make_request+0xb2/0xd80 [raid1] >> [ 810.274537] [<ffffffff81408f15>] ? __bt_get.isra.7+0xd5/0x1b0 >> [ 810.274650] [<ffffffff81094feb>] ? finish_wait+0x5b/0x80 >> [ 810.274766] [<ffffffff8140917f>] ? bt_get+0x18f/0x1b0 >> [ 810.274881] [<ffffffffa02123fc>] md_make_request+0xec/0x230 [md_mod] >> [ 810.274998] [<ffffffff813f96f9>] ? generic_make_request_checks+0x219/0x500 >> [ 810.275144] [<ffffffff813fc851>] blk_prologue_bio+0x91/0xc0 >> [ 810.275257] [<ffffffff813fc230>] generic_make_request+0xe0/0x1b0 >> [ 810.275373] [<ffffffff813fc362>] submit_bio+0x62/0x140 >> [ 810.275486] [<ffffffff811d2bbc>] do_blockdev_direct_IO+0x289c/0x33c0 >> [ 810.275607] [<ffffffff811cd620>] ? I_BDEV+0x10/0x10 >> [ 810.275721] [<ffffffff811d371e>] __blockdev_direct_IO+0x3e/0x40 >> [ 810.275843] [<ffffffff811cdfb7>] blkdev_direct_IO+0x47/0x50 >> [ 810.275956] [<ffffffff81132e8c>] generic_file_direct_write+0xac/0x170 >> [ 810.276073] [<ffffffff8113301d>] __generic_file_write_iter+0xcd/0x1f0 >> [ 810.276187] [<ffffffff811ce990>] ? blkdev_close+0x30/0x30 >> [ 810.276332] [<ffffffff811cea17>] blkdev_write_iter+0x87/0x110 >> [ 810.276445] [<ffffffff811de6d0>] aio_run_iocb+0x250/0x2b0 >> [ 810.276560] [<ffffffff8181209d>] ? mutex_lock+0xd/0x30 >> [ 810.276673] [<ffffffff811ddd04>] ? aio_read_events+0x284/0x370 >> [ 810.276786] [<ffffffff81183c29>] ? kmem_cache_alloc+0xd9/0x180 >> [ 810.276902] [<ffffffff811df438>] ? do_io_submit+0x178/0x4a0 >> [ 810.277015] [<ffffffff811df4ed>] do_io_submit+0x22d/0x4a0 >> [ 810.277131] [<ffffffff811df76b>] SyS_io_submit+0xb/0x10 >> [ 810.277244] [<ffffffff81813e17>] entry_SYSCALL_64_fastpath+0x12/0x66 >> I dump r1conf in crash: >> struct r1conf { >> mddev = 0xffff88022d761800, >> mirrors = 0xffff88023456a180, >> raid_disks = 2, >> next_resync = 18446744073709527039, >> start_next_window = 18446744073709551615, >> current_window_requests = 0, >> next_window_requests = 0, >> device_lock = { >> { >> rlock = { >> raw_lock = { >> val = { >> counter = 0 >> } >> } >> } >> } >> }, >> retry_list = { >> next = 0xffff8800b5fe3b40, >> prev = 0xffff8800b50164c0 >> }, >> bio_end_io_list = { >> next = 0xffff88022fcd45c0, >> prev = 0xffff8800b53d57c0 >> }, >> pending_bio_list = { >> head = 0x0, >> tail = 0x0 >> }, >> pending_count = 0, >> wait_barrier = { >> lock = { >> { >> rlock = { >> raw_lock = { >> val = { >> counter = 0 >> } >> } >> } >> } >> }, >> task_list = { >> next = 0xffff8800b51d37e0, >> prev = 0xffff88022fbbb770 >> } >> }, >> resync_lock = { >> { >> rlock = { >> raw_lock = { >> val = { >> counter = 0 >> } >> } >> } >> } >> }, >> nr_pending = 406, >> nr_waiting = 100, >> nr_queued = 404, >> barrier = 0, >> array_frozen = 1, >> fullsync = 0, >> recovery_disabled = 1, >> poolinfo = 0xffff88022d829bb0, >> r1bio_pool = 0xffff88022b4512a0, >> r1buf_pool = 0x0, >> tmppage = 0xffffea0008c97b00, >> thread = 0x0, >> cluster_sync_low = 0, >> cluster_sync_high = 0 >> } >> >> every time nr_pending is 1 bigger then (nr_queued + 1), so seems we >> forgot to increase nr_queued somewhere? >> >> I've noticed (commit ccfc7bf1f09d61)raid1: include bio_end_io_list in >> nr_queued to prevent freeze_array hang. Seems it fixed similar bug. >> >> Could you give your suggestion? >> > Sorry, forgot to mention kernel version is 4.4.28 This commit is Cced to stable@vger.kernel.org for v4.3+, do you use a stable kernel or a distribution with 4.4.28 kernel ? Coly ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [BUG] MD/RAID1 hung forever on freeze_array 2016-11-28 4:47 ` Coly Li @ 2016-11-28 8:24 ` Jinpu Wang 2016-11-28 8:54 ` Coly Li 0 siblings, 1 reply; 31+ messages in thread From: Jinpu Wang @ 2016-11-28 8:24 UTC (permalink / raw) To: Coly Li; +Cc: linux-raid, Shaohua Li, Neil Brown, Nate Dailey snip >>> >>> every time nr_pending is 1 bigger then (nr_queued + 1), so seems we >>> forgot to increase nr_queued somewhere? >>> >>> I've noticed (commit ccfc7bf1f09d61)raid1: include bio_end_io_list in >>> nr_queued to prevent freeze_array hang. Seems it fixed similar bug. >>> >>> Could you give your suggestion? >>> >> Sorry, forgot to mention kernel version is 4.4.28 > > This commit is Cced to stable@vger.kernel.org for v4.3+, do you use a > stable kernel or a distribution with 4.4.28 kernel ? > > Coly > > Hi Coly, I'm using Debian8 with 4.4.28 kernel. Best, -- Jinpu Wang Linux Kernel Developer ProfitBricks GmbH Greifswalder Str. 207 D - 10405 Berlin Tel: +49 30 577 008 042 Fax: +49 30 577 008 299 Email: jinpu.wang@profitbricks.com URL: https://www.profitbricks.de Sitz der Gesellschaft: Berlin Registergericht: Amtsgericht Charlottenburg, HRB 125506 B Geschäftsführer: Achim Weiss ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [BUG] MD/RAID1 hung forever on freeze_array 2016-11-28 8:24 ` Jinpu Wang @ 2016-11-28 8:54 ` Coly Li 2016-11-28 9:02 ` Jinpu Wang 0 siblings, 1 reply; 31+ messages in thread From: Coly Li @ 2016-11-28 8:54 UTC (permalink / raw) To: Jinpu Wang; +Cc: linux-raid, Shaohua Li, Neil Brown, Nate Dailey On 2016/11/28 下午4:24, Jinpu Wang wrote: > snip >>>> >>>> every time nr_pending is 1 bigger then (nr_queued + 1), so seems we >>>> forgot to increase nr_queued somewhere? >>>> >>>> I've noticed (commit ccfc7bf1f09d61)raid1: include bio_end_io_list in >>>> nr_queued to prevent freeze_array hang. Seems it fixed similar bug. >>>> >>>> Could you give your suggestion? >>>> >>> Sorry, forgot to mention kernel version is 4.4.28 >> >> This commit is Cced to stable@vger.kernel.org for v4.3+, do you use a >> stable kernel or a distribution with 4.4.28 kernel ? >> >> Coly >> >> > Hi Coly, > > I'm using Debian8 with 4.4.28 kernel. Hi Jinpu, Is it possible for your to run a upstream kernel or vanilla kernel to test whether the issue still can be reproduced ? Then we can know whether it is an upstream bug or a distro issue. Thanks. Coly ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [BUG] MD/RAID1 hung forever on freeze_array 2016-11-28 8:54 ` Coly Li @ 2016-11-28 9:02 ` Jinpu Wang 2016-11-28 9:10 ` Coly Li 0 siblings, 1 reply; 31+ messages in thread From: Jinpu Wang @ 2016-11-28 9:02 UTC (permalink / raw) To: Coly Li; +Cc: linux-raid, Shaohua Li, Neil Brown, Nate Dailey On Mon, Nov 28, 2016 at 9:54 AM, Coly Li <colyli@suse.de> wrote: > On 2016/11/28 下午4:24, Jinpu Wang wrote: >> snip >>>>> >>>>> every time nr_pending is 1 bigger then (nr_queued + 1), so seems we >>>>> forgot to increase nr_queued somewhere? >>>>> >>>>> I've noticed (commit ccfc7bf1f09d61)raid1: include bio_end_io_list in >>>>> nr_queued to prevent freeze_array hang. Seems it fixed similar bug. >>>>> >>>>> Could you give your suggestion? >>>>> >>>> Sorry, forgot to mention kernel version is 4.4.28 >>> >>> This commit is Cced to stable@vger.kernel.org for v4.3+, do you use a >>> stable kernel or a distribution with 4.4.28 kernel ? >>> >>> Coly >>> >>> >> Hi Coly, >> >> I'm using Debian8 with 4.4.28 kernel. > > Hi Jinpu, > > Is it possible for your to run a upstream kernel or vanilla kernel to > test whether the issue still can be reproduced ? Then we can know > whether it is an upstream bug or a distro issue. > > Thanks. > > Coly Hi Coly, I did run kernel 4.4.34 (I download from kernel.org), I can reproduce the same bug. I can also try latest 4.8 or 4.9 rc kernel, if you think it's necessary? -- Jinpu Wang Linux Kernel Developer ProfitBricks GmbH Greifswalder Str. 207 D - 10405 Berlin Tel: +49 30 577 008 042 Fax: +49 30 577 008 299 Email: jinpu.wang@profitbricks.com URL: https://www.profitbricks.de Sitz der Gesellschaft: Berlin Registergericht: Amtsgericht Charlottenburg, HRB 125506 B Geschäftsführer: Achim Weiss ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [BUG] MD/RAID1 hung forever on freeze_array 2016-11-28 9:02 ` Jinpu Wang @ 2016-11-28 9:10 ` Coly Li 2016-11-29 11:15 ` Jinpu Wang 0 siblings, 1 reply; 31+ messages in thread From: Coly Li @ 2016-11-28 9:10 UTC (permalink / raw) To: Jinpu Wang; +Cc: linux-raid, Shaohua Li, Neil Brown, Nate Dailey On 2016/11/28 下午5:02, Jinpu Wang wrote: > On Mon, Nov 28, 2016 at 9:54 AM, Coly Li <colyli@suse.de> wrote: >> On 2016/11/28 下午4:24, Jinpu Wang wrote: >>> snip >>>>>> >>>>>> every time nr_pending is 1 bigger then (nr_queued + 1), so seems we >>>>>> forgot to increase nr_queued somewhere? >>>>>> >>>>>> I've noticed (commit ccfc7bf1f09d61)raid1: include bio_end_io_list in >>>>>> nr_queued to prevent freeze_array hang. Seems it fixed similar bug. >>>>>> >>>>>> Could you give your suggestion? >>>>>> >>>>> Sorry, forgot to mention kernel version is 4.4.28 >>>> >>>> This commit is Cced to stable@vger.kernel.org for v4.3+, do you use a >>>> stable kernel or a distribution with 4.4.28 kernel ? >>>> >>>> Coly >>>> >>>> >>> Hi Coly, >>> >>> I'm using Debian8 with 4.4.28 kernel. >> >> Hi Jinpu, >> >> Is it possible for your to run a upstream kernel or vanilla kernel to >> test whether the issue still can be reproduced ? Then we can know >> whether it is an upstream bug or a distro issue. >> >> Thanks. >> >> Coly > > Hi Coly, > > I did run kernel 4.4.34 (I download from kernel.org), I can reproduce > the same bug. > > I can also try latest 4.8 or 4.9 rc kernel, if you think it's necessary? > Yes, please. If it can be reproduced on upstream kernel by a set of scripts, it will be very helpful to debug and fix this issue. Thanks in advance. Coly ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [BUG] MD/RAID1 hung forever on freeze_array 2016-11-28 9:10 ` Coly Li @ 2016-11-29 11:15 ` Jinpu Wang 2016-12-07 14:17 ` Jinpu Wang 0 siblings, 1 reply; 31+ messages in thread From: Jinpu Wang @ 2016-11-29 11:15 UTC (permalink / raw) To: Coly Li; +Cc: linux-raid, Shaohua Li, Neil Brown, Nate Dailey On Mon, Nov 28, 2016 at 10:10 AM, Coly Li <colyli@suse.de> wrote: > On 2016/11/28 下午5:02, Jinpu Wang wrote: >> On Mon, Nov 28, 2016 at 9:54 AM, Coly Li <colyli@suse.de> wrote: >>> On 2016/11/28 下午4:24, Jinpu Wang wrote: >>>> snip >>>>>>> >>>>>>> every time nr_pending is 1 bigger then (nr_queued + 1), so seems we >>>>>>> forgot to increase nr_queued somewhere? >>>>>>> >>>>>>> I've noticed (commit ccfc7bf1f09d61)raid1: include bio_end_io_list in >>>>>>> nr_queued to prevent freeze_array hang. Seems it fixed similar bug. >>>>>>> >>>>>>> Could you give your suggestion? >>>>>>> >>>>>> Sorry, forgot to mention kernel version is 4.4.28 >>>>> >>>>> This commit is Cced to stable@vger.kernel.org for v4.3+, do you use a >>>>> stable kernel or a distribution with 4.4.28 kernel ? >>>>> >>>>> Coly >>>>> >>>>> >>>> Hi Coly, >>>> >>>> I'm using Debian8 with 4.4.28 kernel. >>> >>> Hi Jinpu, >>> >>> Is it possible for your to run a upstream kernel or vanilla kernel to >>> test whether the issue still can be reproduced ? Then we can know >>> whether it is an upstream bug or a distro issue. >>> >>> Thanks. >>> >>> Coly >> >> Hi Coly, >> >> I did run kernel 4.4.34 (I download from kernel.org), I can reproduce >> the same bug. >> >> I can also try latest 4.8 or 4.9 rc kernel, if you think it's necessary? >> > Yes, please. If it can be reproduced on upstream kernel by a set of > scripts, it will be very helpful to debug and fix this issue. > > Thanks in advance. > > Coly Hi Coly, I tried with kernel 4.9-cr7, I can't reproduce it with my testcase anymore. It's hard to say the bug is fixed or harder to reproduce because code changed a lot. -- Jinpu Wang Linux Kernel Developer ProfitBricks GmbH Greifswalder Str. 207 D - 10405 Berlin Tel: +49 30 577 008 042 Fax: +49 30 577 008 299 Email: jinpu.wang@profitbricks.com URL: https://www.profitbricks.de Sitz der Gesellschaft: Berlin Registergericht: Amtsgericht Charlottenburg, HRB 125506 B Geschäftsführer: Achim Weiss ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [BUG] MD/RAID1 hung forever on freeze_array 2016-11-29 11:15 ` Jinpu Wang @ 2016-12-07 14:17 ` Jinpu Wang 2016-12-08 3:17 ` NeilBrown 0 siblings, 1 reply; 31+ messages in thread From: Jinpu Wang @ 2016-12-07 14:17 UTC (permalink / raw) To: Coly Li; +Cc: linux-raid, Shaohua Li, Neil Brown, Nate Dailey On Tue, Nov 29, 2016 at 12:15 PM, Jinpu Wang <jinpu.wang@profitbricks.com> wrote: > On Mon, Nov 28, 2016 at 10:10 AM, Coly Li <colyli@suse.de> wrote: >> On 2016/11/28 下午5:02, Jinpu Wang wrote: >>> On Mon, Nov 28, 2016 at 9:54 AM, Coly Li <colyli@suse.de> wrote: >>>> On 2016/11/28 下午4:24, Jinpu Wang wrote: >>>>> snip >>>>>>>> >>>>>>>> every time nr_pending is 1 bigger then (nr_queued + 1), so seems we >>>>>>>> forgot to increase nr_queued somewhere? >>>>>>>> >>>>>>>> I've noticed (commit ccfc7bf1f09d61)raid1: include bio_end_io_list in >>>>>>>> nr_queued to prevent freeze_array hang. Seems it fixed similar bug. >>>>>>>> >>>>>>>> Could you give your suggestion? >>>>>>>> >>>>>>> Sorry, forgot to mention kernel version is 4.4.28 >>>>>> >>>>>> This commit is Cced to stable@vger.kernel.org for v4.3+, do you use a >>>>>> stable kernel or a distribution with 4.4.28 kernel ? >>>>>> >>>>>> Coly >>>>>> >>>>>> >>>>> Hi Coly, >>>>> >>>>> I'm using Debian8 with 4.4.28 kernel. >>>> >>>> Hi Jinpu, >>>> >>>> Is it possible for your to run a upstream kernel or vanilla kernel to >>>> test whether the issue still can be reproduced ? Then we can know >>>> whether it is an upstream bug or a distro issue. >>>> >>>> Thanks. >>>> >>>> Coly >>> >>> Hi Coly, >>> >>> I did run kernel 4.4.34 (I download from kernel.org), I can reproduce >>> the same bug. >>> >>> I can also try latest 4.8 or 4.9 rc kernel, if you think it's necessary? >>> >> Yes, please. If it can be reproduced on upstream kernel by a set of >> scripts, it will be very helpful to debug and fix this issue. >> >> Thanks in advance. >> >> Coly > > Hi Coly, > > I tried with kernel 4.9-cr7, I can't reproduce it with my testcase anymore. > > It's hard to say the bug is fixed or harder to reproduce because code > changed a lot. > > -- > Jinpu Wang > Linux Kernel Developer > > ProfitBricks GmbH > Greifswalder Str. 207 > D - 10405 Berlin > > Tel: +49 30 577 008 042 > Fax: +49 30 577 008 299 > Email: jinpu.wang@profitbricks.com > URL: https://www.profitbricks.de > > Sitz der Gesellschaft: Berlin > Registergericht: Amtsgericht Charlottenburg, HRB 125506 B > Geschäftsführer: Achim Weiss Hi, I continue debug the bug: 20161207 crash> struct r1conf 0xffff8800b9792000 struct r1conf { mddev = 0xffff88022db03800, mirrors = 0xffff880227729200, raid_disks = 2, next_resync = 18446744073709527039, start_next_window = 18446744073709551615, current_window_requests = 0, next_window_requests = 0, device_lock = { { rlock = { raw_lock = { val = { counter = 0 } } } } }, retry_list = { next = 0xffff8800afe690c0, prev = 0xffff8800b96acac0 }, bio_end_io_list = { next = 0xffff8800b96ac2c0, prev = 0xffff88003735f140 }, pending_bio_list = { head = 0x0, tail = 0x0 }, pending_count = 0, wait_barrier = { lock = { { rlock = { raw_lock = { val = { counter = 0 } } } } }, task_list = { next = 0xffff880221bc7770, prev = 0xffff8800ad44bc88 } }, resync_lock = { { rlock = { raw_lock = { val = { counter = 0 } } } } }, nr_pending = 948, nr_waiting = 9, nr_queued = 946, // again we need one more to finished wait_event. barrier = 0, array_frozen = 1, fullsync = 0, recovery_disabled = 2, poolinfo = 0xffff88022c567580, r1bio_pool = 0xffff88022fdccea0, r1buf_pool = 0x0, tmppage = 0xffffea0002bf1600, thread = 0x0, cluster_sync_low = 0, cluster_sync_high = 0 } crash> exit on conf->bio_end_io_list we have 91 entries. crash> list -H 0xffff88003735f140 ffff8800b9792048 ffff8800b96ac2c0 ffff8800b96ac1c0 snip ffff88022243dfc0 ffff88022243dc40 crash> on conf->retry_list we have 855 crash> list -H 0xffff8800b96acac0 ffff8800b9792038 ffff8800afe690c0 snip list -H 0xffff8800b96acac0 r1bio.retry_list -s r1bio ffff8800b9791ff8 struct r1bio { remaining = { counter = 0 }, behind_remaining = { counter = 0 }, sector = 18446612141670676480, // corrupted? start_next_window = 18446612141565972992, //ditto sectors = 2, state = 18446744073709527039, // ditto mddev = 0xffffffffffffffff, master_bio = 0x0, read_disk = 0, retry_list = { next = 0xffff8800afe690c0, prev = 0xffff8800b96acac0 }, behind_bvecs = 0xffff8800b96ac2c0, behind_page_count = 926282048, bios = 0xffff8800b9792058 } ffff8800afe69080 struct r1bio { remaining = { counter = 0 }, behind_remaining = { counter = 0 }, sector = 1566, start_next_window = 0, sectors = 128, state = 257, mddev = 0xffff88022db03800, master_bio = 0xffff8800371c1f00, read_disk = 0, retry_list = { next = 0xffff8800ad41a540, prev = 0xffff8800b9792038 }, behind_bvecs = 0x0, behind_page_count = 0, bios = 0xffff8800afe690e0 } check conf->bio_end_io_list list -H 0xffff88003735f140 r1bio.retry_list -s r1bio ffff8800b9792008 struct r1bio { remaining = { counter = 661819904 }, behind_remaining = { counter = -30718 }, sector = 2, start_next_window = 18446744073709527039, // corrupted? sectors = -1, // corrupted? state = 0, mddev = 0x0, // corrupted? master_bio = 0xffff8800afe690c0, read_disk = -1184183616, // ? retry_list = { next = 0xffff8800b96ac2c0, prev = 0xffff88003735f140 }, behind_bvecs = 0x0, behind_page_count = 0, bios = 0xffff8800b9792068 } ffff8800b96ac280 struct r1bio { remaining = { counter = 0 }, behind_remaining = { counter = 0 }, sector = 980009, start_next_window = 0, sectors = 16, state = 257, mddev = 0xffff88022db03800, master_bio = 0xffff8800370b0600, read_disk = 0, retry_list = { next = 0xffff8800b96ac1c0, prev = 0xffff8800b9792048 }, behind_bvecs = 0x0, behind_page_count = 0, bios = 0xffff8800b96ac2e0 } I still have no clue what it could be, any one has idea? -- Jinpu Wang Linux Kernel Developer ProfitBricks GmbH Greifswalder Str. 207 D - 10405 Berlin Tel: +49 30 577 008 042 Fax: +49 30 577 008 299 Email: jinpu.wang@profitbricks.com URL: https://www.profitbricks.de Sitz der Gesellschaft: Berlin Registergericht: Amtsgericht Charlottenburg, HRB 125506 B Geschäftsführer: Achim Weiss ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [BUG] MD/RAID1 hung forever on freeze_array 2016-12-07 14:17 ` Jinpu Wang @ 2016-12-08 3:17 ` NeilBrown 2016-12-08 9:50 ` Jinpu Wang 0 siblings, 1 reply; 31+ messages in thread From: NeilBrown @ 2016-12-08 3:17 UTC (permalink / raw) To: Jinpu Wang, Coly Li; +Cc: linux-raid, Shaohua Li, Nate Dailey [-- Attachment #1: Type: text/plain, Size: 3489 bytes --] On Thu, Dec 08 2016, Jinpu Wang wrote: > On Tue, Nov 29, 2016 at 12:15 PM, Jinpu Wang > <jinpu.wang@profitbricks.com> wrote: >> On Mon, Nov 28, 2016 at 10:10 AM, Coly Li <colyli@suse.de> wrote: >>> On 2016/11/28 下午5:02, Jinpu Wang wrote: >>>> On Mon, Nov 28, 2016 at 9:54 AM, Coly Li <colyli@suse.de> wrote: >>>>> On 2016/11/28 下午4:24, Jinpu Wang wrote: >>>>>> snip >>>>>>>>> >>>>>>>>> every time nr_pending is 1 bigger then (nr_queued + 1), so seems we >>>>>>>>> forgot to increase nr_queued somewhere? >>>>>>>>> >>>>>>>>> I've noticed (commit ccfc7bf1f09d61)raid1: include bio_end_io_list in >>>>>>>>> nr_queued to prevent freeze_array hang. Seems it fixed similar bug. >>>>>>>>> >>>>>>>>> Could you give your suggestion? >>>>>>>>> >>>>>>>> Sorry, forgot to mention kernel version is 4.4.28 > > I continue debug the bug: > > 20161207 > nr_pending = 948, > nr_waiting = 9, > nr_queued = 946, // again we need one more to finished wait_event. > barrier = 0, > array_frozen = 1, > on conf->bio_end_io_list we have 91 entries. > on conf->retry_list we have 855 This is useful. It confirms that nr_queued is correct, and that nr_pending is consistently 1 higher than expected. This suggests that a request has been counted in nr_pending, but hasn't yet been submitted, or has been taken off one of the queues but has not yet been processed. I notice that in your first email the Blocked tasks listed included raid1d which is blocked in freeze_array() and a few others in make_request() blocked on wait_barrier(). In that case nr_waiting was 100, so there should have been 100 threads blocked in wait_barrier(). Is that correct? I assume you thought it was pointless to list them all, which seems reasonable. I asked because I wonder if there might have been one thread in make_request() which was blocked on something else. There are a couple of places when make_request() will wait after having successfully called wait_barrier(). If that happened, it would cause exactly the symptoms you report. Could you check all blocked threads carefully please? There are other ways that nr_pending and nr_queued can get out of sync, though I think they would result in nr_pending being less than nr_queued, not more. If the presense of a bad block in the bad block log causes a request to be split into two r1bios, and if both of those end up on one of the queues, then they would be added to nr_queued twice, but to nr_pending only once. We should fix that. > > list -H 0xffff8800b96acac0 r1bio.retry_list -s r1bio > > ffff8800b9791ff8 > struct r1bio { > remaining = { > counter = 0 > }, > behind_remaining = { > counter = 0 > }, > sector = 18446612141670676480, // corrupted? > start_next_window = 18446612141565972992, //ditto I don't think this is corruption. > crash> struct r1conf 0xffff8800b9792000 > struct r1conf { .... > retry_list = { > next = 0xffff8800afe690c0, > prev = 0xffff8800b96acac0 > }, The pointer you started at was at the end of the list. So this r1bio structure you are seeing is not an r1bio at all but the memory out of the middle of the r1conf, being interpreted as an r1bio. You can confirm this by noticing that retry_list in the r1bio: > retry_list = { > next = 0xffff8800afe690c0, > prev = 0xffff8800b96acac0 > }, is exactly the same as the retry_list in the r1conf. NeilBrown [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 832 bytes --] ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [BUG] MD/RAID1 hung forever on freeze_array 2016-12-08 3:17 ` NeilBrown @ 2016-12-08 9:50 ` Jinpu Wang 2016-12-09 6:01 ` NeilBrown 0 siblings, 1 reply; 31+ messages in thread From: Jinpu Wang @ 2016-12-08 9:50 UTC (permalink / raw) To: NeilBrown; +Cc: Coly Li, linux-raid, Shaohua Li, Nate Dailey Thanks Neil for valuable inputs, please see comments inline. On Thu, Dec 8, 2016 at 4:17 AM, NeilBrown <neilb@suse.com> wrote: > On Thu, Dec 08 2016, Jinpu Wang wrote: > >> On Tue, Nov 29, 2016 at 12:15 PM, Jinpu Wang >> <jinpu.wang@profitbricks.com> wrote: >>> On Mon, Nov 28, 2016 at 10:10 AM, Coly Li <colyli@suse.de> wrote: >>>> On 2016/11/28 下午5:02, Jinpu Wang wrote: >>>>> On Mon, Nov 28, 2016 at 9:54 AM, Coly Li <colyli@suse.de> wrote: >>>>>> On 2016/11/28 下午4:24, Jinpu Wang wrote: >>>>>>> snip >>>>>>>>>> >>>>>>>>>> every time nr_pending is 1 bigger then (nr_queued + 1), so seems we >>>>>>>>>> forgot to increase nr_queued somewhere? >>>>>>>>>> >>>>>>>>>> I've noticed (commit ccfc7bf1f09d61)raid1: include bio_end_io_list in >>>>>>>>>> nr_queued to prevent freeze_array hang. Seems it fixed similar bug. >>>>>>>>>> >>>>>>>>>> Could you give your suggestion? >>>>>>>>>> >>>>>>>>> Sorry, forgot to mention kernel version is 4.4.28 > >> >> I continue debug the bug: >> >> 20161207 > >> nr_pending = 948, >> nr_waiting = 9, >> nr_queued = 946, // again we need one more to finished wait_event. >> barrier = 0, >> array_frozen = 1, > >> on conf->bio_end_io_list we have 91 entries. > >> on conf->retry_list we have 855 > > This is useful. It confirms that nr_queued is correct, and that > nr_pending is consistently 1 higher than expected. > This suggests that a request has been counted in nr_pending, but hasn't > yet been submitted, or has been taken off one of the queues but has not > yet been processed. > > I notice that in your first email the Blocked tasks listed included > raid1d which is blocked in freeze_array() and a few others in > make_request() blocked on wait_barrier(). > In that case nr_waiting was 100, so there should have been 100 threads > blocked in wait_barrier(). Is that correct? I assume you thought it > was pointless to list them all, which seems reasonable. This is correct. From my test, I initially use numjobs set to 100 in fio. then later I reduce that to 10, I can still trigger the bug. > > I asked because I wonder if there might have been one thread in > make_request() which was blocked on something else. There are a couple > of places when make_request() will wait after having successfully called > wait_barrier(). If that happened, it would cause exactly the symptoms > you report. Could you check all blocked threads carefully please? Every time it's the same hung task, raid1d is blocked in freeze_array PID: 5002 TASK: ffff8800ad430d00 CPU: 0 COMMAND: "md1_raid1" #0 [ffff8800ad44bbf8] __schedule at ffffffff81811483 #1 [ffff8800ad44bc50] schedule at ffffffff81811c60 #2 [ffff8800ad44bc68] freeze_array at ffffffffa085a1e1 [raid1] #3 [ffff8800ad44bcc0] handle_read_error at ffffffffa085bfe9 [raid1] #4 [ffff8800ad44bd68] raid1d at ffffffffa085d29b [raid1] #5 [ffff8800ad44be60] md_thread at ffffffffa0549040 [md_mod] and 9 fio threads are blocked in wait_barrier PID: 5172 TASK: ffff88022bf4a700 CPU: 3 COMMAND: "fio" #0 [ffff88022186b6d8] __schedule at ffffffff81811483 #1 [ffff88022186b730] schedule at ffffffff81811c60 #2 [ffff88022186b748] wait_barrier at ffffffffa085a0b6 [raid1] #3 [ffff88022186b7b0] make_request at ffffffffa085c3cd [raid1] #4 [ffff88022186b890] md_make_request at ffffffffa0549903 [md_mod] The other fio thread is sleeping in futex wait, I think it's unrelated. PID: 5171 TASK: ffff88022bf49a00 CPU: 2 COMMAND: "fio" #0 [ffff880221f97c08] __schedule at ffffffff81811483 #1 [ffff880221f97c60] schedule at ffffffff81811c60 #2 [ffff880221f97c78] futex_wait_queue_me at ffffffff810c8a42 #3 [ffff880221f97cb0] futex_wait at ffffffff810c9c72 > > > There are other ways that nr_pending and nr_queued can get out of sync, > though I think they would result in nr_pending being less than > nr_queued, not more. > > If the presense of a bad block in the bad block log causes a request to > be split into two r1bios, and if both of those end up on one of the > queues, then they would be added to nr_queued twice, but to nr_pending > only once. We should fix that. I checked, there're some bad_blocks md detected, not sure if it helps. root@ibnbd-clt1:~/jack# cat /sys/block/md1/md/dev-ibnbd0/bad_blocks 180630 512 181142 7 982011 8 1013386 255 I also checked each md_rdev: crash> struct md_rdev 0xffff880230630000 struct md_rdev { same_set = { next = 0xffff88022db03818, prev = 0xffff880230653e00 }, sectors = 2095104, mddev = 0xffff88022db03800, last_events = 125803676, meta_bdev = 0x0, bdev = 0xffff880234a56080, sb_page = 0xffffea0008d3edc0, bb_page = 0xffffea0008b17400, sb_loaded = 1, sb_events = 72, data_offset = 2048, new_data_offset = 2048, sb_start = 8, sb_size = 512, preferred_minor = 65535, kobj = { name = 0xffff88022c567510 "dev-loop1", entry = { next = 0xffff880230630080, prev = 0xffff880230630080 }, parent = 0xffff88022db03850, kset = 0x0, ktype = 0xffffffffa055f200 <rdev_ktype>, sd = 0xffff88022c4ac708, kref = { refcount = { counter = 1 } }, state_initialized = 1, state_in_sysfs = 1, state_add_uevent_sent = 0, state_remove_uevent_sent = 0, uevent_suppress = 0 }, flags = 2, blocked_wait = { lock = { { rlock = { raw_lock = { val = { counter = 0 } } } } }, task_list = { next = 0xffff8802306300c8, prev = 0xffff8802306300c8 } }, desc_nr = 1, raid_disk = 1, new_raid_disk = 0, saved_raid_disk = -1, { recovery_offset = 18446744073709551615, journal_tail = 18446744073709551615 }, nr_pending = { counter = 1 }, read_errors = { counter = 0 }, last_read_error = { tv_sec = 0, tv_nsec = 0 }, corrected_errors = { counter = 0 }, del_work = { data = { counter = 0 }, entry = { next = 0x0, prev = 0x0 }, func = 0x0 }, sysfs_state = 0xffff88022c4ac780, badblocks = { count = 0, unacked_exist = 0, shift = 0, page = 0xffff88022740d000, changed = 0, lock = { seqcount = { sequence = 60 }, lock = { { rlock = { raw_lock = { val = { counter = 0 } } } } } }, sector = 0, size = 0 } } crash> crash> struct md_rdev 0xffff880230653e00 struct md_rdev { same_set = { next = 0xffff880230630000, prev = 0xffff88022db03818 }, sectors = 2095104, mddev = 0xffff88022db03800, last_events = 9186098, meta_bdev = 0x0, bdev = 0xffff880234a56700, sb_page = 0xffffea0007758f40, bb_page = 0xffffea000887b480, sb_loaded = 1, sb_events = 42, data_offset = 2048, new_data_offset = 2048, sb_start = 8, sb_size = 512, preferred_minor = 65535, kobj = { name = 0xffff880233c825b0 "dev-ibnbd0", entry = { next = 0xffff880230653e80, prev = 0xffff880230653e80 }, parent = 0xffff88022db03850, kset = 0x0, ktype = 0xffffffffa055f200 <rdev_ktype>, sd = 0xffff8800b9fcce88, kref = { refcount = { counter = 1 } }, state_initialized = 1, state_in_sysfs = 1, state_add_uevent_sent = 0, state_remove_uevent_sent = 0, uevent_suppress = 0 }, flags = 581, blocked_wait = { lock = { { rlock = { raw_lock = { val = { counter = 0 } } } } }, task_list = { next = 0xffff880230653ec8, prev = 0xffff880230653ec8 } }, desc_nr = 0, raid_disk = 0, new_raid_disk = 0, saved_raid_disk = -1, { recovery_offset = 18446744073709551615, journal_tail = 18446744073709551615 }, nr_pending = { counter = 856 }, read_errors = { counter = 0 }, last_read_error = { tv_sec = 0, tv_nsec = 0 }, corrected_errors = { counter = 0 }, del_work = { data = { counter = 0 }, entry = { next = 0x0, prev = 0x0 }, func = 0x0 }, sysfs_state = 0xffff8800b9fccf78, badblocks = { count = 4, unacked_exist = 0, shift = 0, page = 0xffff880227211000, changed = 0, lock = { seqcount = { sequence = 1624 }, lock = { { rlock = { raw_lock = { val = { counter = 0 } } } } } }, sector = 80, size = 8 } } > > >> >> list -H 0xffff8800b96acac0 r1bio.retry_list -s r1bio >> >> ffff8800b9791ff8 >> struct r1bio { >> remaining = { >> counter = 0 >> }, >> behind_remaining = { >> counter = 0 >> }, >> sector = 18446612141670676480, // corrupted? >> start_next_window = 18446612141565972992, //ditto > > I don't think this is corruption. > >> crash> struct r1conf 0xffff8800b9792000 >> struct r1conf { > .... >> retry_list = { >> next = 0xffff8800afe690c0, >> prev = 0xffff8800b96acac0 >> }, > > The pointer you started at was at the end of the list. > So this r1bio structure you are seeing is not an r1bio at all but the > memory out of the middle of the r1conf, being interpreted as an r1bio. > You can confirm this by noticing that retry_list in the r1bio: > >> retry_list = { >> next = 0xffff8800afe690c0, >> prev = 0xffff8800b96acac0 >> }, > > is exactly the same as the retry_list in the r1conf. > > NeilBrown Oh, thanks for explanation, I notice this coincidence, was curious why. I still have my hung task machine alive, I can just what ever you think necessary. Thanks again for debuging this with me. -- Jinpu Wang Linux Kernel Developer ProfitBricks GmbH Greifswalder Str. 207 D - 10405 Berlin Tel: +49 30 577 008 042 Fax: +49 30 577 008 299 Email: jinpu.wang@profitbricks.com URL: https://www.profitbricks.de Sitz der Gesellschaft: Berlin Registergericht: Amtsgericht Charlottenburg, HRB 125506 B Geschäftsführer: Achim Weiss ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [BUG] MD/RAID1 hung forever on freeze_array 2016-12-08 9:50 ` Jinpu Wang @ 2016-12-09 6:01 ` NeilBrown 2016-12-09 15:28 ` Jinpu Wang 2016-12-09 15:36 ` Jinpu Wang 0 siblings, 2 replies; 31+ messages in thread From: NeilBrown @ 2016-12-09 6:01 UTC (permalink / raw) To: Jinpu Wang; +Cc: Coly Li, linux-raid, Shaohua Li, Nate Dailey [-- Attachment #1: Type: text/plain, Size: 884 bytes --] On Thu, Dec 08 2016, Jinpu Wang wrote: This number: > nr_pending = { > counter = 1 > }, and this number: > nr_pending = { > counter = 856 > }, might be interesting. There are 855 requested on the list. Add the one that is currently being retried give 856, which is nr_pending for the device that failed. But nr_pending on the device that didn't fail is 1. I would expect zero. When a read or write requests succeeds, rdev_dec_pending() is called immediately so this should quickly go to zero. It seems as though there must be a request to the loop device that is stuck somewhere between the atomic_inc(&rdev->nr_pending) (possibly inside read_balance) and the call to generic_make_request(). I cannot yet see how that would happen. Can you check if the is a repeatable observation? Is nr_pending.counter always '1' on the loop device? Thanks, NeilBrown [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 832 bytes --] ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [BUG] MD/RAID1 hung forever on freeze_array 2016-12-09 6:01 ` NeilBrown @ 2016-12-09 15:28 ` Jinpu Wang 2016-12-09 15:36 ` Jinpu Wang 1 sibling, 0 replies; 31+ messages in thread From: Jinpu Wang @ 2016-12-09 15:28 UTC (permalink / raw) To: NeilBrown; +Cc: Coly Li, linux-raid, Shaohua Li, Nate Dailey On Fri, Dec 9, 2016 at 7:01 AM, NeilBrown <neilb@suse.com> wrote: > On Thu, Dec 08 2016, Jinpu Wang wrote: > > > This number: > >> nr_pending = { >> counter = 1 >> }, > > > > and this number: > >> nr_pending = { >> counter = 856 >> }, > > might be interesting. > > There are 855 requested on the list. Add the one that is currently > being retried give 856, which is nr_pending for the device that failed. > But nr_pending on the device that didn't fail is 1. I would expect > zero. > When a read or write requests succeeds, rdev_dec_pending() is called > immediately so this should quickly go to zero. > > It seems as though there must be a request to the loop device that is > stuck somewhere between the atomic_inc(&rdev->nr_pending) (possibly > inside read_balance) and the call to generic_make_request(). > I cannot yet see how that would happen. > > Can you check if the is a repeatable observation? Is nr_pending.counter > always '1' on the loop device? > > Thanks, > NeilBrown Hi Neil, Yes, it's repreatable observation. I triggered again, this time nr_pending = 1203, nr_waiting = 8, nr_queued = 1201, conf->retry_list has 1175 entries. on conf->bio_end_io_list has 26 entries. Totol is 1201, match nr_queued. in md_rdev healthy one loop1 has 1 nr_pending. faulty one ibnbd1 has 1076. crash> struct md_rdev 0xffff880228880400 struct md_rdev { same_set = { next = 0xffff88023202a200, prev = 0xffff8800b64c6018 }, sectors = 2095104, mddev = 0xffff8800b64c6000, last_events = 17764573, meta_bdev = 0x0, bdev = 0xffff8800b60ce080, sb_page = 0xffffea0002bd3040, bb_page = 0xffffea0002dc76c0, sb_loaded = 1, sb_events = 166, data_offset = 2048, new_data_offset = 2048, sb_start = 8, sb_size = 512, preferred_minor = 65535, kobj = { name = 0xffff880037962af0 "dev-ibnbd0", entry = { next = 0xffff880228880480, prev = 0xffff880228880480 }, parent = 0xffff8800b64c6050, kset = 0x0, ktype = 0xffffffffa0501300 <rdev_ktype>, sd = 0xffff88022bfc12d0, kref = { refcount = { counter = 1 } }, state_initialized = 1, state_in_sysfs = 1, state_add_uevent_sent = 0, state_remove_uevent_sent = 0, uevent_suppress = 0 }, flags = 581, blocked_wait = { lock = { { rlock = { raw_lock = { val = { counter = 0 } } } } }, task_list = { next = 0xffff8802288804c8, prev = 0xffff8802288804c8 } }, desc_nr = 0, raid_disk = 0, new_raid_disk = 0, saved_raid_disk = -1, { recovery_offset = 18446744073709551615, journal_tail = 18446744073709551615 }, nr_pending = { counter = 1176 }, read_errors = { counter = 0 }, last_read_error = { tv_sec = 0, tv_nsec = 0 }, corrected_errors = { counter = 0 }, del_work = { data = { counter = 0 }, entry = { next = 0x0, prev = 0x0 }, func = 0x0 }, sysfs_state = 0xffff88022bfc1348, badblocks = { count = 0, unacked_exist = 0, shift = 0, page = 0xffff8802289aa000, changed = 0, lock = { seqcount = { sequence = 0 }, lock = { { rlock = { raw_lock = { val = { counter = 0 } } } } } }, sector = 0, size = 0 } } crash> struct md_rdev 0xffff88023202a200 struct md_rdev { same_set = { next = 0xffff8800b64c6018, prev = 0xffff880228880400 }, sectors = 2095104, mddev = 0xffff8800b64c6000, last_events = 37178561, meta_bdev = 0x0, bdev = 0xffff8800b60d09c0, sb_page = 0xffffea0008af7580, bb_page = 0xffffea0002e69380, sb_loaded = 1, sb_events = 167, data_offset = 2048, new_data_offset = 2048, sb_start = 8, sb_size = 512, preferred_minor = 65535, kobj = { name = 0xffff88023521ec30 "dev-loop1", entry = { next = 0xffff88023202a280, prev = 0xffff88023202a280 }, parent = 0xffff8800b64c6050, kset = 0x0, ktype = 0xffffffffa0501300 <rdev_ktype>, sd = 0xffff88022bc0a708, kref = { refcount = { counter = 1 } }, state_initialized = 1, state_in_sysfs = 1, state_add_uevent_sent = 0, state_remove_uevent_sent = 0, uevent_suppress = 0 }, flags = 2, blocked_wait = { lock = { { rlock = { raw_lock = { val = { counter = 0 } } } } }, task_list = { next = 0xffff88023202a2c8, prev = 0xffff88023202a2c8 } }, crash> struct md_rdev 0xffff88023202a200 struct md_rdev { same_set = { next = 0xffff8800b64c6018, prev = 0xffff880228880400 }, sectors = 2095104, mddev = 0xffff8800b64c6000, last_events = 37178561, meta_bdev = 0x0, bdev = 0xffff8800b60d09c0, sb_page = 0xffffea0008af7580, bb_page = 0xffffea0002e69380, sb_loaded = 1, sb_events = 167, data_offset = 2048, new_data_offset = 2048, sb_start = 8, sb_size = 512, preferred_minor = 65535, kobj = { name = 0xffff88023521ec30 "dev-loop1", entry = { next = 0xffff88023202a280, prev = 0xffff88023202a280 }, parent = 0xffff8800b64c6050, kset = 0x0, ktype = 0xffffffffa0501300 <rdev_ktype>, sd = 0xffff88022bc0a708, kref = { refcount = { counter = 1 } }, state_initialized = 1, state_in_sysfs = 1, state_add_uevent_sent = 0, state_remove_uevent_sent = 0, uevent_suppress = 0 }, flags = 2, blocked_wait = { lock = { { rlock = { raw_lock = { val = { counter = 0 } } } } }, task_list = { next = 0xffff88023202a2c8, prev = 0xffff88023202a2c8 } }, desc_nr = 1, raid_disk = 1, new_raid_disk = 0, saved_raid_disk = -1, { recovery_offset = 18446744073709551615, journal_tail = 18446744073709551615 }, nr_pending = { counter = 1 }, read_errors = { counter = 0 }, last_read_error = { tv_sec = 0, tv_nsec = 0 }, corrected_errors = { counter = 0 }, del_work = { data = { counter = 0 }, entry = { next = 0x0, prev = 0x0 }, func = 0x0 }, sysfs_state = 0xffff88022bc0a780, badblocks = { count = 0, unacked_exist = 0, shift = 0, page = 0xffff88022bff0000, changed = 0, lock = { seqcount = { sequence = 164 }, lock = { { rlock = { raw_lock = { val = { counter = 0 } } } } } }, sector = 0, size = 0 } } Thanks -- Jinpu Wang Linux Kernel Developer ProfitBricks GmbH Greifswalder Str. 207 D - 10405 Berlin Tel: +49 30 577 008 042 Fax: +49 30 577 008 299 Email: jinpu.wang@profitbricks.com URL: https://www.profitbricks.de Sitz der Gesellschaft: Berlin Registergericht: Amtsgericht Charlottenburg, HRB 125506 B Geschäftsführer: Achim Weiss ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [BUG] MD/RAID1 hung forever on freeze_array 2016-12-09 6:01 ` NeilBrown 2016-12-09 15:28 ` Jinpu Wang @ 2016-12-09 15:36 ` Jinpu Wang 1 sibling, 0 replies; 31+ messages in thread From: Jinpu Wang @ 2016-12-09 15:36 UTC (permalink / raw) To: NeilBrown; +Cc: Coly Li, linux-raid, Shaohua Li, Nate Dailey On Fri, Dec 9, 2016 at 7:01 AM, NeilBrown <neilb@suse.com> wrote: > On Thu, Dec 08 2016, Jinpu Wang wrote: > > > This number: > >> nr_pending = { >> counter = 1 >> }, > > > > and this number: > >> nr_pending = { >> counter = 856 >> }, > > might be interesting. > > There are 855 requested on the list. Add the one that is currently > being retried give 856, which is nr_pending for the device that failed. > But nr_pending on the device that didn't fail is 1. I would expect > zero. > When a read or write requests succeeds, rdev_dec_pending() is called > immediately so this should quickly go to zero. > > It seems as though there must be a request to the loop device that is > stuck somewhere between the atomic_inc(&rdev->nr_pending) (possibly > inside read_balance) and the call to generic_make_request(). > I cannot yet see how that would happen. > > Can you check if the is a repeatable observation? Is nr_pending.counter > always '1' on the loop device? > > Thanks, > NeilBrown Hi Neil, Yes, it's repreatable observation. I triggered again, this time nr_pending = 1203, nr_waiting = 8, nr_queued = 1201, conf->retry_list has 1175 entries. on conf->bio_end_io_list has 26 entries. Totol is 1201, match nr_queued. in md_rdev healthy one loop1 has 1 nr_pending. faulty one ibnbd1 has 1076. crash> struct md_rdev 0xffff880228880400 struct md_rdev { same_set = { next = 0xffff88023202a200, prev = 0xffff8800b64c6018 }, sectors = 2095104, mddev = 0xffff8800b64c6000, last_events = 17764573, meta_bdev = 0x0, bdev = 0xffff8800b60ce080, sb_page = 0xffffea0002bd3040, bb_page = 0xffffea0002dc76c0, sb_loaded = 1, sb_events = 166, data_offset = 2048, new_data_offset = 2048, sb_start = 8, sb_size = 512, preferred_minor = 65535, kobj = { name = 0xffff880037962af0 "dev-ibnbd0", entry = { next = 0xffff880228880480, prev = 0xffff880228880480 }, parent = 0xffff8800b64c6050, kset = 0x0, ktype = 0xffffffffa0501300 <rdev_ktype>, sd = 0xffff88022bfc12d0, kref = { refcount = { counter = 1 } }, state_initialized = 1, state_in_sysfs = 1, state_add_uevent_sent = 0, state_remove_uevent_sent = 0, uevent_suppress = 0 }, flags = 581, blocked_wait = { lock = { { rlock = { raw_lock = { val = { counter = 0 } } } } }, task_list = { next = 0xffff8802288804c8, prev = 0xffff8802288804c8 } }, desc_nr = 0, raid_disk = 0, new_raid_disk = 0, saved_raid_disk = -1, { recovery_offset = 18446744073709551615, journal_tail = 18446744073709551615 }, nr_pending = { counter = 1176 }, read_errors = { counter = 0 }, last_read_error = { tv_sec = 0, tv_nsec = 0 }, corrected_errors = { counter = 0 }, del_work = { data = { counter = 0 }, entry = { next = 0x0, prev = 0x0 }, func = 0x0 }, sysfs_state = 0xffff88022bfc1348, badblocks = { count = 0, unacked_exist = 0, shift = 0, page = 0xffff8802289aa000, changed = 0, lock = { seqcount = { sequence = 0 }, lock = { { rlock = { raw_lock = { val = { counter = 0 } } } } } }, sector = 0, size = 0 } } crash> struct md_rdev 0xffff88023202a200 struct md_rdev { same_set = { next = 0xffff8800b64c6018, prev = 0xffff880228880400 }, sectors = 2095104, mddev = 0xffff8800b64c6000, last_events = 37178561, meta_bdev = 0x0, bdev = 0xffff8800b60d09c0, sb_page = 0xffffea0008af7580, bb_page = 0xffffea0002e69380, sb_loaded = 1, sb_events = 167, data_offset = 2048, new_data_offset = 2048, sb_start = 8, sb_size = 512, preferred_minor = 65535, kobj = { name = 0xffff88023521ec30 "dev-loop1", entry = { next = 0xffff88023202a280, prev = 0xffff88023202a280 }, parent = 0xffff8800b64c6050, kset = 0x0, ktype = 0xffffffffa0501300 <rdev_ktype>, sd = 0xffff88022bc0a708, kref = { refcount = { counter = 1 } }, state_initialized = 1, state_in_sysfs = 1, state_add_uevent_sent = 0, state_remove_uevent_sent = 0, uevent_suppress = 0 }, flags = 2, blocked_wait = { lock = { { rlock = { raw_lock = { val = { counter = 0 } } } } }, task_list = { next = 0xffff88023202a2c8, prev = 0xffff88023202a2c8 } }, crash> struct md_rdev 0xffff88023202a200 struct md_rdev { same_set = { next = 0xffff8800b64c6018, prev = 0xffff880228880400 }, sectors = 2095104, mddev = 0xffff8800b64c6000, last_events = 37178561, meta_bdev = 0x0, bdev = 0xffff8800b60d09c0, sb_page = 0xffffea0008af7580, bb_page = 0xffffea0002e69380, sb_loaded = 1, sb_events = 167, data_offset = 2048, new_data_offset = 2048, sb_start = 8, sb_size = 512, preferred_minor = 65535, kobj = { name = 0xffff88023521ec30 "dev-loop1", entry = { next = 0xffff88023202a280, prev = 0xffff88023202a280 }, parent = 0xffff8800b64c6050, kset = 0x0, ktype = 0xffffffffa0501300 <rdev_ktype>, sd = 0xffff88022bc0a708, kref = { refcount = { counter = 1 } }, state_initialized = 1, state_in_sysfs = 1, state_add_uevent_sent = 0, state_remove_uevent_sent = 0, uevent_suppress = 0 }, flags = 2, blocked_wait = { lock = { { rlock = { raw_lock = { val = { counter = 0 } } } } }, task_list = { next = 0xffff88023202a2c8, prev = 0xffff88023202a2c8 } }, desc_nr = 1, raid_disk = 1, new_raid_disk = 0, saved_raid_disk = -1, { recovery_offset = 18446744073709551615, journal_tail = 18446744073709551615 }, nr_pending = { counter = 1 }, read_errors = { counter = 0 }, last_read_error = { tv_sec = 0, tv_nsec = 0 }, corrected_errors = { counter = 0 }, del_work = { data = { counter = 0 }, entry = { next = 0x0, prev = 0x0 }, func = 0x0 }, sysfs_state = 0xffff88022bc0a780, badblocks = { count = 0, unacked_exist = 0, shift = 0, page = 0xffff88022bff0000, changed = 0, lock = { seqcount = { sequence = 164 }, lock = { { rlock = { raw_lock = { val = { counter = 0 } } } } } }, sector = 0, size = 0 } } Thanks -- Jinpu Wang Linux Kernel Developer ProfitBricks GmbH Greifswalder Str. 207 D - 10405 Berlin Tel: +49 30 577 008 042 Fax: +49 30 577 008 299 Email: jinpu.wang@profitbricks.com URL: https://www.profitbricks.de Sitz der Gesellschaft: Berlin Registergericht: Amtsgericht Charlottenburg, HRB 125506 B Geschäftsführer: Achim Weiss ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [BUG] MD/RAID1 hung forever on freeze_array 2016-11-25 13:30 [BUG] MD/RAID1 hung forever on freeze_array Jinpu Wang 2016-11-25 13:59 ` Jinpu Wang @ 2016-12-12 0:59 ` NeilBrown 2016-12-12 13:10 ` Jinpu Wang 1 sibling, 1 reply; 31+ messages in thread From: NeilBrown @ 2016-12-12 0:59 UTC (permalink / raw) To: Jinpu Wang, linux-raid, Shaohua Li; +Cc: Nate Dailey [-- Attachment #1: Type: text/plain, Size: 184 bytes --] On Sat, Nov 26 2016, Jinpu Wang wrote: > [ 810.270860] [<ffffffff813fc851>] blk_prologue_bio+0x91/0xc0 What is this? I cannot find that function in the upstream kernel. NeilBrown [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 832 bytes --] ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [BUG] MD/RAID1 hung forever on freeze_array 2016-12-12 0:59 ` NeilBrown @ 2016-12-12 13:10 ` Jinpu Wang 2016-12-12 21:53 ` NeilBrown 0 siblings, 1 reply; 31+ messages in thread From: Jinpu Wang @ 2016-12-12 13:10 UTC (permalink / raw) To: NeilBrown; +Cc: linux-raid, Shaohua Li, Nate Dailey On Mon, Dec 12, 2016 at 1:59 AM, NeilBrown <neilb@suse.com> wrote: > On Sat, Nov 26 2016, Jinpu Wang wrote: >> [ 810.270860] [<ffffffff813fc851>] blk_prologue_bio+0x91/0xc0 > > What is this? I cannot find that function in the upstream kernel. > > NeilBrown Hi Neil, blk_prologue_bio is our internal extension to gather some stats, sorry not informed before. It does below, code snip only: +static void bio_endio_complete_diskstat(struct bio *clone) +{ + struct bio *bio = clone->bi_private; + struct block_device *bdev = bio->bi_bdev; + struct hd_struct *hdp = bdev->bd_part; + unsigned long start = clone->bi_start_time; + const int rw = !!bio_data_dir(bio); + int cpu, err = clone->bi_error; + + bio_put(clone); + blk_end_io_acct() + bio->bi_error = err; + bio_endio(bio); +} + +blk_qc_t blk_prologue_bio(struct request_queue *q, struct bio *bio) +{ + struct block_device *bdev = bio->bi_bdev; + struct hd_struct *hdp = bdev->bd_part; + struct bio *clone; + + if (!hdp->additional_diskstat) + return q->custom_make_request_fn(q, bio); + clone = bio_clone_fast(bio, GFP_NOWAIT | __GFP_NOWARN, NULL); + if (unlikely(!clone)) + return q->custom_make_request_fn(q, bio); + clone->bi_start_time = jiffies; + if (hdp->bio_mode_diskstat) { + int rw = !!bio_data_dir(bio); + + generic_start_io_acct(rw, bio_sectors(clone), hdp); + } + clone->bi_end_io = bio_endio_complete_diskstat; + clone->bi_private = bio; + return q->custom_make_request_fn(q, clone); +} IMHO, it seems unrelated, but I will rerun my test without this change. -- Jinpu Wang Linux Kernel Developer ProfitBricks GmbH Greifswalder Str. 207 D - 10405 Berlin Tel: +49 30 577 008 042 Fax: +49 30 577 008 299 Email: jinpu.wang@profitbricks.com URL: https://www.profitbricks.de Sitz der Gesellschaft: Berlin Registergericht: Amtsgericht Charlottenburg, HRB 125506 B Geschäftsführer: Achim Weiss ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [BUG] MD/RAID1 hung forever on freeze_array 2016-12-12 13:10 ` Jinpu Wang @ 2016-12-12 21:53 ` NeilBrown 2016-12-13 15:08 ` Jinpu Wang 0 siblings, 1 reply; 31+ messages in thread From: NeilBrown @ 2016-12-12 21:53 UTC (permalink / raw) To: Jinpu Wang; +Cc: linux-raid, Shaohua Li, Nate Dailey [-- Attachment #1: Type: text/plain, Size: 835 bytes --] On Tue, Dec 13 2016, Jinpu Wang wrote: > On Mon, Dec 12, 2016 at 1:59 AM, NeilBrown <neilb@suse.com> wrote: >> On Sat, Nov 26 2016, Jinpu Wang wrote: >>> [ 810.270860] [<ffffffff813fc851>] blk_prologue_bio+0x91/0xc0 >> >> What is this? I cannot find that function in the upstream kernel. >> >> NeilBrown > > Hi Neil, > > blk_prologue_bio is our internal extension to gather some stats, sorry > not informed before. Ahhh. .... > + return q->custom_make_request_fn(q, clone); I haven't heard of custom_make_request_fn before either. > +} > > IMHO, it seems unrelated, but I will rerun my test without this change. Yes, please re-test with an unmodified upstream kernel (and always report *exactly* what kernel you are running. I cannot analyse code that I cannot see). NeilBrown [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 832 bytes --] ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [BUG] MD/RAID1 hung forever on freeze_array 2016-12-12 21:53 ` NeilBrown @ 2016-12-13 15:08 ` Jinpu Wang 2016-12-13 22:18 ` NeilBrown 0 siblings, 1 reply; 31+ messages in thread From: Jinpu Wang @ 2016-12-13 15:08 UTC (permalink / raw) To: NeilBrown; +Cc: linux-raid, Shaohua Li, Nate Dailey Hi Neil, On Mon, Dec 12, 2016 at 10:53 PM, NeilBrown <neilb@suse.com> wrote: > On Tue, Dec 13 2016, Jinpu Wang wrote: > >> On Mon, Dec 12, 2016 at 1:59 AM, NeilBrown <neilb@suse.com> wrote: >>> On Sat, Nov 26 2016, Jinpu Wang wrote: >>>> [ 810.270860] [<ffffffff813fc851>] blk_prologue_bio+0x91/0xc0 >>> >>> What is this? I cannot find that function in the upstream kernel. >>> >>> NeilBrown >> >> Hi Neil, >> >> blk_prologue_bio is our internal extension to gather some stats, sorry >> not informed before. > > Ahhh. > > .... >> + return q->custom_make_request_fn(q, clone); > > I haven't heard of custom_make_request_fn before either. > >> +} >> >> IMHO, it seems unrelated, but I will rerun my test without this change. > > Yes, please re-test with an unmodified upstream kernel (and always > report *exactly* what kernel you are running. I cannot analyse code > that I cannot see). > > NeilBrown As you suggested, I re-run same test with 4.4.36 with no our own patch on MD. I can still reproduce the same bug, nr_pending on heathy leg(loop1) is till 1. 4.4.36 kernel: crash> bt 4069 PID: 4069 TASK: ffff88022b4f8d00 CPU: 3 COMMAND: "md2_raid1" #0 [ffff8800b77d3bf8] __schedule at ffffffff81811453 #1 [ffff8800b77d3c50] schedule at ffffffff81811c30 #2 [ffff8800b77d3c68] freeze_array at ffffffffa07ee17e [raid1] #3 [ffff8800b77d3cc0] handle_read_error at ffffffffa07f093b [raid1] #4 [ffff8800b77d3d68] raid1d at ffffffffa07f10a6 [raid1] #5 [ffff8800b77d3e60] md_thread at ffffffffa04dee80 [md_mod] #6 [ffff8800b77d3ed0] kthread at ffffffff81075fb6 #7 [ffff8800b77d3f50] ret_from_fork at ffffffff818157df crash> bt 2558 bt: invalid task or pid value: 2558 crash> bt 4558 PID: 4558 TASK: ffff88022b550d00 CPU: 3 COMMAND: "fio" #0 [ffff88022c287710] __schedule at ffffffff81811453 #1 [ffff88022c287768] schedule at ffffffff81811c30 #2 [ffff88022c287780] wait_barrier at ffffffffa07ee044 [raid1] #3 [ffff88022c2877e8] make_request at ffffffffa07efc65 [raid1] #4 [ffff88022c2878d0] md_make_request at ffffffffa04df609 [md_mod] #5 [ffff88022c287928] generic_make_request at ffffffff813fd3de #6 [ffff88022c287970] submit_bio at ffffffff813fd522 #7 [ffff88022c2879b8] do_blockdev_direct_IO at ffffffff811d32a7 #8 [ffff88022c287be8] __blockdev_direct_IO at ffffffff811d3b6e #9 [ffff88022c287c10] blkdev_direct_IO at ffffffff811ce2d7 #10 [ffff88022c287c38] generic_file_direct_write at ffffffff81132c90 #11 [ffff88022c287cb0] __generic_file_write_iter at ffffffff81132e1d #12 [ffff88022c287d08] blkdev_write_iter at ffffffff811ce597 #13 [ffff88022c287d68] aio_run_iocb at ffffffff811deca6 #14 [ffff88022c287e68] do_io_submit at ffffffff811dfbaa #15 [ffff88022c287f40] sys_io_submit at ffffffff811dfe4b #16 [ffff88022c287f50] entry_SYSCALL_64_fastpath at ffffffff81815497 RIP: 00007f63b1362737 RSP: 00007ffff7eb17f8 RFLAGS: 00000206 RAX: ffffffffffffffda RBX: 00007f63a142a000 RCX: 00007f63b1362737 RDX: 0000000001179b58 RSI: 0000000000000001 RDI: 00007f63b1f4a000 RBP: 0000000000000512 R8: 0000000000000001 R9: 0000000001171fa0 R10: 00007f639ef84000 R11: 0000000000000206 R12: 0000000000000001 R13: 0000000000000200 R14: 000000003a2d3000 R15: 0000000000000001 ORIG_RAX: 00000000000000d1 CS: 0033 SS: 002b crash> struct r1conf 0xffff880037362100 struct r1conf { mddev = 0xffff880037352800, mirrors = 0xffff88022c209c00, raid_disks = 2, next_resync = 18446744073709527039, start_next_window = 18446744073709551615, current_window_requests = 0, next_window_requests = 0, device_lock = { { rlock = { raw_lock = { val = { counter = 0 } } } } }, retry_list = { next = 0xffff8801ce757740, prev = 0xffff8801b1b79140 }, bio_end_io_list = { next = 0xffff8801ce7d9ac0, prev = 0xffff88022838f4c0 }, pending_bio_list = { head = 0x0, tail = 0x0 }, pending_count = 0, wait_barrier = { lock = { { rlock = { raw_lock = { val = { counter = 0 } } } } }, task_list = { next = 0xffff8801f6d87818, prev = 0xffff88022c2877a8 } }, resync_lock = { { rlock = { raw_lock = { val = { counter = 0 } } } } }, nr_pending = 2086, nr_waiting = 97, nr_queued = 2084, barrier = 0, array_frozen = 1, fullsync = 0, recovery_disabled = 1, poolinfo = 0xffff8802330be390, r1bio_pool = 0xffff88022bdf54e0, r1buf_pool = 0x0, tmppage = 0xffffea0000dcee40, thread = 0x0, cluster_sync_low = 0, cluster_sync_high = 0 } crash> crash> struct raid1_info 0xffff88022c209c00 struct raid1_info { rdev = 0xffff880231635800, head_position = 1318965, next_seq_sect = 252597, seq_start = 252342 } crash> struct raid1_info 0xffff88022c209c20 struct raid1_info { rdev = 0xffff88023166ce00, head_position = 1585216, next_seq_sect = 839992, seq_start = 839977 } crash> struct md_rdev 0xffff880231635800 struct md_rdev { same_set = { next = 0xffff880037352818, prev = 0xffff88023166ce00 }, sectors = 2095104, mddev = 0xffff880037352800, last_events = 41325652, meta_bdev = 0x0, bdev = 0xffff880235c2aa40, sb_page = 0xffffea0002dd98c0, bb_page = 0xffffea0002e48f80, sb_loaded = 1, sb_events = 205, data_offset = 2048, new_data_offset = 2048, sb_start = 8, sb_size = 512, preferred_minor = 65535, kobj = { name = 0xffff8802341cdef0 "dev-loop1", entry = { next = 0xffff880231635880, prev = 0xffff880231635880 }, parent = 0xffff880037352850, kset = 0x0, ktype = 0xffffffffa04f3020 <rdev_ktype>, sd = 0xffff880233e3b8e8, kref = { refcount = { counter = 1 } }, state_initialized = 1, state_in_sysfs = 1, state_add_uevent_sent = 0, state_remove_uevent_sent = 0, uevent_suppress = 0 }, flags = 2, blocked_wait = { lock = { { rlock = { raw_lock = { val = { counter = 0 } } } } }, task_list = { next = 0xffff8802316358c8, prev = 0xffff8802316358c8 } }, desc_nr = 0, raid_disk = 0, new_raid_disk = 0, saved_raid_disk = -1, { recovery_offset = 0, journal_tail = 0 }, nr_pending = { counter = 1 }, read_errors = { counter = 0 }, last_read_error = { tv_sec = 0, tv_nsec = 0 }, corrected_errors = { counter = 0 }, del_work = { data = { counter = 0 }, entry = { next = 0x0, prev = 0x0 }, func = 0x0 }, sysfs_state = 0xffff880233e3b960, badblocks = { count = 0, unacked_exist = 0, shift = 0, page = 0xffff88022c0d6000, changed = 0, lock = { seqcount = { sequence = 264 }, lock = { { rlock = { raw_lock = { val = { counter = 0 } } } } } }, sector = 0, size = 0 } } struct md_rdev { same_set = { next = 0xffff880231635800, prev = 0xffff880037352818 }, sectors = 2095104, mddev = 0xffff880037352800, last_events = 10875407, meta_bdev = 0x0, bdev = 0xffff880234a86a40, sb_page = 0xffffea00089e0ac0, bb_page = 0xffffea0007db4980, sb_loaded = 1, sb_events = 204, data_offset = 2048, new_data_offset = 2048, sb_start = 8, sb_size = 512, preferred_minor = 65535, kobj = { name = 0xffff88022c100e30 "dev-ibnbd0", entry = { next = 0xffff88023166ce80, prev = 0xffff88023166ce80 }, parent = 0xffff880037352850, kset = 0x0, ktype = 0xffffffffa04f3020 <rdev_ktype>, sd = 0xffff8800b6539e10, kref = { refcount = { counter = 1 } }, state_initialized = 1, state_in_sysfs = 1, state_add_uevent_sent = 0, state_remove_uevent_sent = 0, uevent_suppress = 0 }, flags = 581, blocked_wait = { lock = { { rlock = { raw_lock = { val = { counter = 0 } } } } }, task_list = { next = 0xffff88023166cec8, prev = 0xffff88023166cec8 } }, desc_nr = 1, raid_disk = 1, new_raid_disk = 0, saved_raid_disk = -1, { recovery_offset = 18446744073709551615, journal_tail = 18446744073709551615 }, nr_pending = { counter = 2073 }, read_errors = { counter = 0 }, last_read_error = { tv_sec = 0, tv_nsec = 0 }, corrected_errors = { counter = 0 }, del_work = { data = { counter = 0 }, entry = { next = 0x0, prev = 0x0 }, func = 0x0 }, sysfs_state = 0xffff8800b6539e88, badblocks = { count = 1, unacked_exist = 0, shift = 0, page = 0xffff880099ced000, changed = 0, lock = { seqcount = { sequence = 4 }, lock = { { rlock = { raw_lock = { val = { counter = 0 } } } } } }, sector = 80, size = 8 } } -- Jinpu Wang Linux Kernel Developer ProfitBricks GmbH Greifswalder Str. 207 D - 10405 Berlin Tel: +49 30 577 008 042 Fax: +49 30 577 008 299 Email: jinpu.wang@profitbricks.com URL: https://www.profitbricks.de Sitz der Gesellschaft: Berlin Registergericht: Amtsgericht Charlottenburg, HRB 125506 B Geschäftsführer: Achim Weiss ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [BUG] MD/RAID1 hung forever on freeze_array 2016-12-13 15:08 ` Jinpu Wang @ 2016-12-13 22:18 ` NeilBrown 2016-12-14 10:22 ` Jinpu Wang 0 siblings, 1 reply; 31+ messages in thread From: NeilBrown @ 2016-12-13 22:18 UTC (permalink / raw) To: Jinpu Wang; +Cc: linux-raid, Shaohua Li, Nate Dailey [-- Attachment #1: Type: text/plain, Size: 1703 bytes --] On Wed, Dec 14 2016, Jinpu Wang wrote: > > As you suggested, I re-run same test with 4.4.36 with no our own patch on MD. > I can still reproduce the same bug, nr_pending on heathy leg(loop1) is till 1. > Thanks. I have an hypothesis. md_make_request() calls blk_queue_split(). If that does split the request it will call generic_make_request() on the first half. That will call back into md_make_request() and raid1_make_request() which will submit requests to the underlying devices. These will get caught on the bio_list_on_stack queue in generic_make_request(). This is a queue which is not accounted in nr_queued. When blk_queue_split() completes, 'bio' will be the second half of the bio. This enters raid1_make_request() and by this time the array have been frozen. So wait_barrier() has to wait for pending requests to complete, and that includes the one that it stuck in bio_list_on_stack, which will never complete now. To see if this might be happening, please change the blk_queue_split(q, &bio, q->bio_split); call in md_make_request() to struct bio *tmp = bio; blk_queue_split(q, &bio, q->bio_split); WARN_ON_ONCE(bio != tmp); If that ever triggers, then the above is a real possibility. Fixing the problem isn't very easy... You could try: 1/ write a function in raid1.c which calls punt_bios_to_rescuer() (which you will need to export from block/bio.c), passing mddev->queue->bio_split as the bio_set. 1/ change the wait_event_lock_irq() call in wait_barrier() to wait_event_lock_irq_cmd(), and pass the new function as the command. That way, if wait_barrier() ever blocks, all the requests in bio_list_on_stack will be handled by a separate thread. NeilBrown [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 832 bytes --] ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [BUG] MD/RAID1 hung forever on freeze_array 2016-12-13 22:18 ` NeilBrown @ 2016-12-14 10:22 ` Jinpu Wang 2016-12-14 12:13 ` Jinpu Wang 0 siblings, 1 reply; 31+ messages in thread From: Jinpu Wang @ 2016-12-14 10:22 UTC (permalink / raw) To: NeilBrown; +Cc: linux-raid, Shaohua Li, Nate Dailey Thanks Neil, On Tue, Dec 13, 2016 at 11:18 PM, NeilBrown <neilb@suse.com> wrote: > On Wed, Dec 14 2016, Jinpu Wang wrote: > >> >> As you suggested, I re-run same test with 4.4.36 with no our own patch on MD. >> I can still reproduce the same bug, nr_pending on heathy leg(loop1) is till 1. >> > > Thanks. > > I have an hypothesis. > > md_make_request() calls blk_queue_split(). > If that does split the request it will call generic_make_request() > on the first half. That will call back into md_make_request() and > raid1_make_request() which will submit requests to the underlying > devices. These will get caught on the bio_list_on_stack queue in > generic_make_request(). > This is a queue which is not accounted in nr_queued. > > When blk_queue_split() completes, 'bio' will be the second half of the > bio. > This enters raid1_make_request() and by this time the array have been > frozen. > So wait_barrier() has to wait for pending requests to complete, and that > includes the one that it stuck in bio_list_on_stack, which will never > complete now. > > To see if this might be happening, please change the > > blk_queue_split(q, &bio, q->bio_split); > > call in md_make_request() to > > struct bio *tmp = bio; > blk_queue_split(q, &bio, q->bio_split); > WARN_ON_ONCE(bio != tmp); > > If that ever triggers, then the above is a real possibility. I triggered the warning as you expected, we can confirm the bug was caused by your above hypothesis. [ 429.282235] ------------[ cut here ]------------ [ 429.282407] WARNING: CPU: 2 PID: 4139 at drivers/md/md.c:262 md_set_array_sectors+0xac0/0xc30 [md_mod]() [ 429.285288] Modules linked in: raid1 ibnbd_client(O) ibtrs_client(O) dm_service_time dm_multipath rdma_ucm ib_ucm rdma_cm iw_cm ib_ipoib ib_cm ib_uverbs ib_umad mlx5_ib mlx5_c ore vxlan ip6_udp_tunnel udp_tunnel mlx4_ib ib_sa ib_mad ib_core ib_addr ib_netlink mlx4_core mlx_compat loop md_mod kvm_amd edac_mce_amd kvm edac_core irqbypass acpi_cpufreq tpm _infineon tpm_tis i2c_piix4 tpm serio_raw evdev k10temp processor button fam15h_power crct10dif_pclmul crc32_pclmul sg sd_mod ahci libahci libata scsi_mod crc32c_intel r8169 psmo use xhci_pci xhci_hcd [last unloaded: mlx_compat] [ 429.288543] CPU: 2 PID: 4139 Comm: fio Tainted: G O 4.4.36-1-pse rver #1 [ 429.288825] Hardware name: To be filled by O.E.M. To be filled by O.E.M./M5A97 R2.0, BIOS 2501 04/07/2014 [ 429.289113] 0000000000000000 ffff8801f64ff8f0 ffffffff81424486 0000000000000000 [ 429.289538] ffffffffa0561938 ffff8801f64ff928 ffffffff81058a60 ffff8800b8f3e000 [ 429.290157] 0000000000000000 ffff8800b51f4100 ffff880234f9a700 ffff880234f9a700 [ 429.290594] Call Trace: [ 429.290743] [<ffffffff81424486>] dump_stack+0x4d/0x67 [ 429.290893] [<ffffffff81058a60>] warn_slowpath_common+0x90/0xd0 [ 429.291046] [<ffffffff81058b55>] warn_slowpath_null+0x15/0x20 [ 429.291202] [<ffffffffa0550740>] md_set_array_sectors+0xac0/0xc30 [md_mod] [ 429.291358] [<ffffffff813fd3de>] generic_make_request+0xfe/0x1e0 [ 429.291540] [<ffffffff813fd522>] submit_bio+0x62/0x150 [ 429.291693] [<ffffffff813f53d9>] ? bio_set_pages_dirty+0x49/0x60 [ 429.291847] [<ffffffff811d32a7>] do_blockdev_direct_IO+0x2317/0x2ba0 [ 429.292011] [<ffffffffa0834f64>] ? ib_post_rdma_write_imm+0x24/0x30 [ibtrs_client] [ 429.292271] [<ffffffff811cdc40>] ? I_BDEV+0x10/0x10 [ 429.292417] [<ffffffff811d3b6e>] __blockdev_direct_IO+0x3e/0x40 [ 429.292566] [<ffffffff811ce2d7>] blkdev_direct_IO+0x47/0x50 [ 429.292746] [<ffffffff81132abf>] generic_file_read_iter+0x45f/0x580 [ 429.292894] [<ffffffff811ce620>] ? blkdev_write_iter+0x110/0x110 [ 429.293073] [<ffffffff811ce652>] blkdev_read_iter+0x32/0x40 [ 429.293284] [<ffffffff811deb86>] aio_run_iocb+0x116/0x2a0 [ 429.293492] [<ffffffff813fed52>] ? blk_flush_plug_list+0xc2/0x200 [ 429.293703] [<ffffffff81183ac6>] ? kmem_cache_alloc+0xb6/0x180 [ 429.293901] [<ffffffff811dfaf4>] ? do_io_submit+0x184/0x4d0 [ 429.294047] [<ffffffff811dfbaa>] do_io_submit+0x23a/0x4d0 [ 429.294194] [<ffffffff811dfe4b>] SyS_io_submit+0xb/0x10 [ 429.294375] [<ffffffff81815497>] entry_SYSCALL_64_fastpath+0x12/0x6a [ 429.294610] ---[ end trace 25d1cece0e01494b ]--- I double checked the nr_pending on heathy leg is still 1 as before. > > Fixing the problem isn't very easy... > > You could try: > 1/ write a function in raid1.c which calls punt_bios_to_rescuer() > (which you will need to export from block/bio.c), > passing mddev->queue->bio_split as the bio_set. > > 1/ change the wait_event_lock_irq() call in wait_barrier() to > wait_event_lock_irq_cmd(), and pass the new function as the command. > > That way, if wait_barrier() ever blocks, all the requests in > bio_list_on_stack will be handled by a separate thread. > > NeilBrown I will try your sugested way to see if it fix the bug, will report back soon. -- Jinpu Wang Linux Kernel Developer ProfitBricks GmbH Greifswalder Str. 207 D - 10405 Berlin Tel: +49 30 577 008 042 Fax: +49 30 577 008 299 Email: jinpu.wang@profitbricks.com URL: https://www.profitbricks.de Sitz der Gesellschaft: Berlin Registergericht: Amtsgericht Charlottenburg, HRB 125506 B Geschäftsführer: Achim Weiss ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [BUG] MD/RAID1 hung forever on freeze_array 2016-12-14 10:22 ` Jinpu Wang @ 2016-12-14 12:13 ` Jinpu Wang 2016-12-14 14:49 ` Jinpu Wang 0 siblings, 1 reply; 31+ messages in thread From: Jinpu Wang @ 2016-12-14 12:13 UTC (permalink / raw) To: NeilBrown; +Cc: linux-raid, Shaohua Li, Nate Dailey [-- Attachment #1: Type: text/plain, Size: 8255 bytes --] On Wed, Dec 14, 2016 at 11:22 AM, Jinpu Wang <jinpu.wang@profitbricks.com> wrote: > Thanks Neil, > > On Tue, Dec 13, 2016 at 11:18 PM, NeilBrown <neilb@suse.com> wrote: >> On Wed, Dec 14 2016, Jinpu Wang wrote: >> >>> >>> As you suggested, I re-run same test with 4.4.36 with no our own patch on MD. >>> I can still reproduce the same bug, nr_pending on heathy leg(loop1) is till 1. >>> >> >> Thanks. >> >> I have an hypothesis. >> >> md_make_request() calls blk_queue_split(). >> If that does split the request it will call generic_make_request() >> on the first half. That will call back into md_make_request() and >> raid1_make_request() which will submit requests to the underlying >> devices. These will get caught on the bio_list_on_stack queue in >> generic_make_request(). >> This is a queue which is not accounted in nr_queued. >> >> When blk_queue_split() completes, 'bio' will be the second half of the >> bio. >> This enters raid1_make_request() and by this time the array have been >> frozen. >> So wait_barrier() has to wait for pending requests to complete, and that >> includes the one that it stuck in bio_list_on_stack, which will never >> complete now. >> >> To see if this might be happening, please change the >> >> blk_queue_split(q, &bio, q->bio_split); >> >> call in md_make_request() to >> >> struct bio *tmp = bio; >> blk_queue_split(q, &bio, q->bio_split); >> WARN_ON_ONCE(bio != tmp); >> >> If that ever triggers, then the above is a real possibility. > > I triggered the warning as you expected, we can confirm the bug was > caused by your above hypothesis. > [ 429.282235] ------------[ cut here ]------------ > [ 429.282407] WARNING: CPU: 2 PID: 4139 at drivers/md/md.c:262 > md_set_array_sectors+0xac0/0xc30 [md_mod]() > [ 429.285288] Modules linked in: raid1 ibnbd_client(O) > ibtrs_client(O) dm_service_time dm_multipath rdma_ucm ib_ucm rdma_cm > iw_cm ib_ipoib ib_cm ib_uverbs ib_umad mlx5_ib mlx5_c > ore vxlan ip6_udp_tunnel udp_tunnel mlx4_ib ib_sa ib_mad ib_core > ib_addr ib_netlink mlx4_core mlx_compat loop md_mod kvm_amd > edac_mce_amd kvm edac_core irqbypass acpi_cpufreq tpm > _infineon tpm_tis i2c_piix4 tpm serio_raw evdev k10temp processor > button fam15h_power crct10dif_pclmul crc32_pclmul sg sd_mod ahci > libahci libata scsi_mod crc32c_intel r8169 psmo > use xhci_pci xhci_hcd [last unloaded: mlx_compat] > [ 429.288543] CPU: 2 PID: 4139 Comm: fio Tainted: G O 4.4.36-1-pse > rver #1 > [ 429.288825] Hardware name: To be filled by O.E.M. To be filled by > O.E.M./M5A97 R2.0, BIOS 2501 04/07/2014 > [ 429.289113] 0000000000000000 ffff8801f64ff8f0 ffffffff81424486 > 0000000000000000 > [ 429.289538] ffffffffa0561938 ffff8801f64ff928 ffffffff81058a60 > ffff8800b8f3e000 > [ 429.290157] 0000000000000000 ffff8800b51f4100 ffff880234f9a700 > ffff880234f9a700 > [ 429.290594] Call Trace: > [ 429.290743] [<ffffffff81424486>] dump_stack+0x4d/0x67 > [ 429.290893] [<ffffffff81058a60>] warn_slowpath_common+0x90/0xd0 > [ 429.291046] [<ffffffff81058b55>] warn_slowpath_null+0x15/0x20 > [ 429.291202] [<ffffffffa0550740>] md_set_array_sectors+0xac0/0xc30 [md_mod] > [ 429.291358] [<ffffffff813fd3de>] generic_make_request+0xfe/0x1e0 > [ 429.291540] [<ffffffff813fd522>] submit_bio+0x62/0x150 > [ 429.291693] [<ffffffff813f53d9>] ? bio_set_pages_dirty+0x49/0x60 > [ 429.291847] [<ffffffff811d32a7>] do_blockdev_direct_IO+0x2317/0x2ba0 > [ 429.292011] [<ffffffffa0834f64>] ? > ib_post_rdma_write_imm+0x24/0x30 [ibtrs_client] > [ 429.292271] [<ffffffff811cdc40>] ? I_BDEV+0x10/0x10 > [ 429.292417] [<ffffffff811d3b6e>] __blockdev_direct_IO+0x3e/0x40 > [ 429.292566] [<ffffffff811ce2d7>] blkdev_direct_IO+0x47/0x50 > [ 429.292746] [<ffffffff81132abf>] generic_file_read_iter+0x45f/0x580 > [ 429.292894] [<ffffffff811ce620>] ? blkdev_write_iter+0x110/0x110 > [ 429.293073] [<ffffffff811ce652>] blkdev_read_iter+0x32/0x40 > [ 429.293284] [<ffffffff811deb86>] aio_run_iocb+0x116/0x2a0 > [ 429.293492] [<ffffffff813fed52>] ? blk_flush_plug_list+0xc2/0x200 > [ 429.293703] [<ffffffff81183ac6>] ? kmem_cache_alloc+0xb6/0x180 > [ 429.293901] [<ffffffff811dfaf4>] ? do_io_submit+0x184/0x4d0 > [ 429.294047] [<ffffffff811dfbaa>] do_io_submit+0x23a/0x4d0 > [ 429.294194] [<ffffffff811dfe4b>] SyS_io_submit+0xb/0x10 > [ 429.294375] [<ffffffff81815497>] entry_SYSCALL_64_fastpath+0x12/0x6a > [ 429.294610] ---[ end trace 25d1cece0e01494b ]--- > > I double checked the nr_pending on heathy leg is still 1 as before. > >> >> Fixing the problem isn't very easy... >> >> You could try: >> 1/ write a function in raid1.c which calls punt_bios_to_rescuer() >> (which you will need to export from block/bio.c), >> passing mddev->queue->bio_split as the bio_set. >> >> 1/ change the wait_event_lock_irq() call in wait_barrier() to >> wait_event_lock_irq_cmd(), and pass the new function as the command. >> >> That way, if wait_barrier() ever blocks, all the requests in >> bio_list_on_stack will be handled by a separate thread. >> >> NeilBrown > > I will try your sugested way to see if it fix the bug, will report back soon. > Hi Neil, Sorry, bad news, with the 2 patch attached, I can still reproduce the same bug. nr_pending on healthy leg is still 1, as before. crash> struct r1conf 0xffff8800b7176100 struct r1conf { mddev = 0xffff8800b59b0000, mirrors = 0xffff88022bab7900, raid_disks = 2, next_resync = 18446744073709527039, start_next_window = 18446744073709551615, current_window_requests = 0, next_window_requests = 0, device_lock = { { rlock = { raw_lock = { val = { counter = 0 } } } } }, retry_list = { next = 0xffff880211b2ec40, prev = 0xffff88022819ad40 }, bio_end_io_list = { next = 0xffff880227e9a9c0, prev = 0xffff8802119c6140 }, pending_bio_list = { head = 0x0, tail = 0x0 }, pending_count = 0, wait_barrier = { lock = { { rlock = { raw_lock = { val = { counter = 0 } } } } }, task_list = { next = 0xffff8800adf3b818, prev = 0xffff88021180f7a8 } }, resync_lock = { { rlock = { raw_lock = { val = { counter = 0 } } } } }, nr_pending = 1675, nr_waiting = 100, nr_queued = 1673, barrier = 0, array_frozen = 1, fullsync = 0, recovery_disabled = 1, poolinfo = 0xffff88022c80f640, r1bio_pool = 0xffff88022b8b6a20, r1buf_pool = 0x0, tmppage = 0xffffea0008a90c80, thread = 0x0, cluster_sync_low = 0, cluster_sync_high = 0 } kobj = { name = 0xffff88022b7194a0 "dev-loop1", entry = { next = 0xffff880231495280, prev = 0xffff880231495280 }, parent = 0xffff8800b59b0050, kset = 0x0, ktype = 0xffffffffa0564060 <rdev_ktype>, sd = 0xffff8800b6510960, kref = { refcount = { counter = 1 } }, state_initialized = 1, state_in_sysfs = 1, state_add_uevent_sent = 0, state_remove_uevent_sent = 0, uevent_suppress = 0 }, flags = 2, blocked_wait = { lock = { { rlock = { raw_lock = { val = { counter = 0 } } } } }, task_list = { next = 0xffff8802314952c8, prev = 0xffff8802314952c8 } }, desc_nr = 1, raid_disk = 1, new_raid_disk = 0, saved_raid_disk = -1, { recovery_offset = 0, journal_tail = 0 }, nr_pending = { counter = 1 }, -- Jinpu Wang Linux Kernel Developer ProfitBricks GmbH Greifswalder Str. 207 D - 10405 Berlin Tel: +49 30 577 008 042 Fax: +49 30 577 008 299 Email: jinpu.wang@profitbricks.com URL: https://www.profitbricks.de Sitz der Gesellschaft: Berlin Registergericht: Amtsgericht Charlottenburg, HRB 125506 B Geschäftsführer: Achim Weiss [-- Attachment #2: 0001-block-export-punt_bios_to_rescuer.patch --] [-- Type: text/x-patch, Size: 1566 bytes --] From e7adbbb1a8d542ea68ada5996e0f9ffe87c479b6 Mon Sep 17 00:00:00 2001 From: Jack Wang <jinpu.wang@profitbricks.com> Date: Wed, 14 Dec 2016 11:26:23 +0100 Subject: [PATCH 1/2] block: export punt_bios_to_rescuer We need it later in raid1 Suggested-by: Neil Brown <neil.brown@suse.com> Signed-off-by: Jack Wang <jinpu.wang@profitbricks.com> --- block/bio.c | 3 ++- include/linux/bio.h | 1 + 2 files changed, 3 insertions(+), 1 deletion(-) diff --git a/block/bio.c b/block/bio.c index 46e2cc1..f6a250d 100644 --- a/block/bio.c +++ b/block/bio.c @@ -354,7 +354,7 @@ static void bio_alloc_rescue(struct work_struct *work) } } -static void punt_bios_to_rescuer(struct bio_set *bs) +void punt_bios_to_rescuer(struct bio_set *bs) { struct bio_list punt, nopunt; struct bio *bio; @@ -384,6 +384,7 @@ static void punt_bios_to_rescuer(struct bio_set *bs) queue_work(bs->rescue_workqueue, &bs->rescue_work); } +EXPORT_SYMBOL(punt_bios_to_rescuer); /** * bio_alloc_bioset - allocate a bio for I/O diff --git a/include/linux/bio.h b/include/linux/bio.h index 42e4e3c..6256ba7 100644 --- a/include/linux/bio.h +++ b/include/linux/bio.h @@ -479,6 +479,7 @@ extern void bio_advance(struct bio *, unsigned); extern void bio_init(struct bio *); extern void bio_reset(struct bio *); void bio_chain(struct bio *, struct bio *); +void punt_bios_to_rescuer(struct bio_set *); extern int bio_add_page(struct bio *, struct page *, unsigned int,unsigned int); extern int bio_add_pc_page(struct request_queue *, struct bio *, struct page *, -- 2.7.4 [-- Attachment #3: 0002-raid1-fix-deadlock.patch --] [-- Type: text/x-patch, Size: 1420 bytes --] From 2ad4cc5e8b5d7ec9db7a6fffaa2fdcd5f20419bf Mon Sep 17 00:00:00 2001 From: Jack Wang <jinpu.wang@profitbricks.com> Date: Wed, 14 Dec 2016 11:35:52 +0100 Subject: [PATCH 2/2] raid1: fix deadlock Signed-off-by: Jack Wang <jinpu.wang@profitbricks.com> --- drivers/md/raid1.c | 10 ++++++++-- 1 file changed, 8 insertions(+), 2 deletions(-) diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c index 478223c..61dafb1 100644 --- a/drivers/md/raid1.c +++ b/drivers/md/raid1.c @@ -190,6 +190,11 @@ static void put_all_bios(struct r1conf *conf, struct r1bio *r1_bio) } } +static void raid1_punt_bios_to_rescuer(struct r1conf *conf) +{ + punt_bios_to_rescuer(conf->mddev->queue->bio_split); +} + static void free_r1bio(struct r1bio *r1_bio) { struct r1conf *conf = r1_bio->mddev->private; @@ -871,14 +876,15 @@ static sector_t wait_barrier(struct r1conf *conf, struct bio *bio) * that queue to allow conf->start_next_window * to increase. */ - wait_event_lock_irq(conf->wait_barrier, + wait_event_lock_irq_cmd(conf->wait_barrier, !conf->array_frozen && (!conf->barrier || ((conf->start_next_window < conf->next_resync + RESYNC_SECTORS) && current->bio_list && !bio_list_empty(current->bio_list))), - conf->resync_lock); + conf->resync_lock, + raid1_punt_bios_to_rescuer(conf)); conf->nr_waiting--; } -- 2.7.4 ^ permalink raw reply related [flat|nested] 31+ messages in thread
* Re: [BUG] MD/RAID1 hung forever on freeze_array 2016-12-14 12:13 ` Jinpu Wang @ 2016-12-14 14:49 ` Jinpu Wang 2016-12-15 3:20 ` NeilBrown 0 siblings, 1 reply; 31+ messages in thread From: Jinpu Wang @ 2016-12-14 14:49 UTC (permalink / raw) To: NeilBrown; +Cc: linux-raid, Shaohua Li, Nate Dailey On Wed, Dec 14, 2016 at 1:13 PM, Jinpu Wang <jinpu.wang@profitbricks.com> wrote: > On Wed, Dec 14, 2016 at 11:22 AM, Jinpu Wang > <jinpu.wang@profitbricks.com> wrote: >> Thanks Neil, >> >> On Tue, Dec 13, 2016 at 11:18 PM, NeilBrown <neilb@suse.com> wrote: >>> On Wed, Dec 14 2016, Jinpu Wang wrote: >>> >>>> >>>> As you suggested, I re-run same test with 4.4.36 with no our own patch on MD. >>>> I can still reproduce the same bug, nr_pending on heathy leg(loop1) is till 1. >>>> >>> >>> Thanks. >>> >>> I have an hypothesis. >>> >>> md_make_request() calls blk_queue_split(). >>> If that does split the request it will call generic_make_request() >>> on the first half. That will call back into md_make_request() and >>> raid1_make_request() which will submit requests to the underlying >>> devices. These will get caught on the bio_list_on_stack queue in >>> generic_make_request(). >>> This is a queue which is not accounted in nr_queued. >>> >>> When blk_queue_split() completes, 'bio' will be the second half of the >>> bio. >>> This enters raid1_make_request() and by this time the array have been >>> frozen. >>> So wait_barrier() has to wait for pending requests to complete, and that >>> includes the one that it stuck in bio_list_on_stack, which will never >>> complete now. >>> >>> To see if this might be happening, please change the >>> >>> blk_queue_split(q, &bio, q->bio_split); >>> >>> call in md_make_request() to >>> >>> struct bio *tmp = bio; >>> blk_queue_split(q, &bio, q->bio_split); >>> WARN_ON_ONCE(bio != tmp); >>> >>> If that ever triggers, then the above is a real possibility. >> >> I triggered the warning as you expected, we can confirm the bug was >> caused by your above hypothesis. >> [ 429.282235] ------------[ cut here ]------------ >> [ 429.282407] WARNING: CPU: 2 PID: 4139 at drivers/md/md.c:262 >> md_set_array_sectors+0xac0/0xc30 [md_mod]() >> [ 429.285288] Modules linked in: raid1 ibnbd_client(O) >> ibtrs_client(O) dm_service_time dm_multipath rdma_ucm ib_ucm rdma_cm >> iw_cm ib_ipoib ib_cm ib_uverbs ib_umad mlx5_ib mlx5_c >> ore vxlan ip6_udp_tunnel udp_tunnel mlx4_ib ib_sa ib_mad ib_core >> ib_addr ib_netlink mlx4_core mlx_compat loop md_mod kvm_amd >> edac_mce_amd kvm edac_core irqbypass acpi_cpufreq tpm >> _infineon tpm_tis i2c_piix4 tpm serio_raw evdev k10temp processor >> button fam15h_power crct10dif_pclmul crc32_pclmul sg sd_mod ahci >> libahci libata scsi_mod crc32c_intel r8169 psmo >> use xhci_pci xhci_hcd [last unloaded: mlx_compat] >> [ 429.288543] CPU: 2 PID: 4139 Comm: fio Tainted: G O 4.4.36-1-pse >> rver #1 >> [ 429.288825] Hardware name: To be filled by O.E.M. To be filled by >> O.E.M./M5A97 R2.0, BIOS 2501 04/07/2014 >> [ 429.289113] 0000000000000000 ffff8801f64ff8f0 ffffffff81424486 >> 0000000000000000 >> [ 429.289538] ffffffffa0561938 ffff8801f64ff928 ffffffff81058a60 >> ffff8800b8f3e000 >> [ 429.290157] 0000000000000000 ffff8800b51f4100 ffff880234f9a700 >> ffff880234f9a700 >> [ 429.290594] Call Trace: >> [ 429.290743] [<ffffffff81424486>] dump_stack+0x4d/0x67 >> [ 429.290893] [<ffffffff81058a60>] warn_slowpath_common+0x90/0xd0 >> [ 429.291046] [<ffffffff81058b55>] warn_slowpath_null+0x15/0x20 >> [ 429.291202] [<ffffffffa0550740>] md_set_array_sectors+0xac0/0xc30 [md_mod] >> [ 429.291358] [<ffffffff813fd3de>] generic_make_request+0xfe/0x1e0 >> [ 429.291540] [<ffffffff813fd522>] submit_bio+0x62/0x150 >> [ 429.291693] [<ffffffff813f53d9>] ? bio_set_pages_dirty+0x49/0x60 >> [ 429.291847] [<ffffffff811d32a7>] do_blockdev_direct_IO+0x2317/0x2ba0 >> [ 429.292011] [<ffffffffa0834f64>] ? >> ib_post_rdma_write_imm+0x24/0x30 [ibtrs_client] >> [ 429.292271] [<ffffffff811cdc40>] ? I_BDEV+0x10/0x10 >> [ 429.292417] [<ffffffff811d3b6e>] __blockdev_direct_IO+0x3e/0x40 >> [ 429.292566] [<ffffffff811ce2d7>] blkdev_direct_IO+0x47/0x50 >> [ 429.292746] [<ffffffff81132abf>] generic_file_read_iter+0x45f/0x580 >> [ 429.292894] [<ffffffff811ce620>] ? blkdev_write_iter+0x110/0x110 >> [ 429.293073] [<ffffffff811ce652>] blkdev_read_iter+0x32/0x40 >> [ 429.293284] [<ffffffff811deb86>] aio_run_iocb+0x116/0x2a0 >> [ 429.293492] [<ffffffff813fed52>] ? blk_flush_plug_list+0xc2/0x200 >> [ 429.293703] [<ffffffff81183ac6>] ? kmem_cache_alloc+0xb6/0x180 >> [ 429.293901] [<ffffffff811dfaf4>] ? do_io_submit+0x184/0x4d0 >> [ 429.294047] [<ffffffff811dfbaa>] do_io_submit+0x23a/0x4d0 >> [ 429.294194] [<ffffffff811dfe4b>] SyS_io_submit+0xb/0x10 >> [ 429.294375] [<ffffffff81815497>] entry_SYSCALL_64_fastpath+0x12/0x6a >> [ 429.294610] ---[ end trace 25d1cece0e01494b ]--- >> >> I double checked the nr_pending on heathy leg is still 1 as before. >> >>> >>> Fixing the problem isn't very easy... >>> >>> You could try: >>> 1/ write a function in raid1.c which calls punt_bios_to_rescuer() >>> (which you will need to export from block/bio.c), >>> passing mddev->queue->bio_split as the bio_set. >>> >>> 1/ change the wait_event_lock_irq() call in wait_barrier() to >>> wait_event_lock_irq_cmd(), and pass the new function as the command. >>> >>> That way, if wait_barrier() ever blocks, all the requests in >>> bio_list_on_stack will be handled by a separate thread. >>> >>> NeilBrown >> >> I will try your sugested way to see if it fix the bug, will report back soon. >> Hi Neil, I found a old mail thread below http://www.spinics.net/lists/raid/msg52792.html Likely Alex is trying to fix same bug, right? in one reply you suggested to modify the call in make_request @@ -1207,7 +1207,8 @@ read_again: sectors_handled; goto read_again; } else - generic_make_request(read_bio); + reschedule_retry(r1_bio); return; } I append above change, it looks fix the bug, I've run same tests over one hour, no hung task anymore. Do you think this is right fix? Do we still need the change you suggested with punt_bios_to_rescuer? -- Jinpu Wang Linux Kernel Developer ProfitBricks GmbH Greifswalder Str. 207 D - 10405 Berlin Tel: +49 30 577 008 042 Fax: +49 30 577 008 299 Email: jinpu.wang@profitbricks.com URL: https://www.profitbricks.de Sitz der Gesellschaft: Berlin Registergericht: Amtsgericht Charlottenburg, HRB 125506 B Geschäftsführer: Achim Weiss ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [BUG] MD/RAID1 hung forever on freeze_array 2016-12-14 14:49 ` Jinpu Wang @ 2016-12-15 3:20 ` NeilBrown 2016-12-15 9:24 ` Jinpu Wang [not found] ` <CAMGffEkufeaDytaHxtLR02iiQifZDhcwkLdzMj3X8_yaitSoFQ@mail.gmail.com> 0 siblings, 2 replies; 31+ messages in thread From: NeilBrown @ 2016-12-15 3:20 UTC (permalink / raw) To: Jinpu Wang; +Cc: linux-raid, Shaohua Li, Nate Dailey [-- Attachment #1: Type: text/plain, Size: 2300 bytes --] > Hi Neil, > > I found a old mail thread below > http://www.spinics.net/lists/raid/msg52792.html > > Likely Alex is trying to fix same bug, right? > in one reply you suggested to modify the call in make_request > > @@ -1207,7 +1207,8 @@ read_again: > sectors_handled; > goto read_again; > } else > - generic_make_request(read_bio); > + reschedule_retry(r1_bio); > return; > } > > > I append above change, it looks fix the bug, I've run same tests over > one hour, no hung task anymore. > > Do you think this is right fix? Do we still need the change you > suggested with punt_bios_to_rescuer? I don't really like that fix. I suspect it would probably hurt performance. I'd prefer to fix generic_make_request() to process queued requests in a more sensible order. Can you please try the following (with all other patches removed)? Thanks, NeilBrown diff --git a/block/blk-core.c b/block/blk-core.c index 14d7c0740dc0..3436b6fc3ef8 100644 --- a/block/blk-core.c +++ b/block/blk-core.c @@ -2036,10 +2036,31 @@ blk_qc_t generic_make_request(struct bio *bio) struct request_queue *q = bdev_get_queue(bio->bi_bdev); if (likely(blk_queue_enter(q, false) == 0)) { + struct bio_list hold; + struct bio_list lower, same; + + /* Create a fresh bio_list for all subordinate requests */ + bio_list_merge(&hold, &bio_list_on_stack); + bio_list_init(&bio_list_on_stack); ret = q->make_request_fn(q, bio); blk_queue_exit(q); + /* sort new bios into those for a lower level + * and those for the same level + */ + bio_list_init(&lower); + bio_list_init(&same); + while ((bio = bio_list_pop(&bio_list_on_stack)) != NULL) + if (q == bdev_get_queue(bio->bi_bdev)) + bio_list_add(&same, bio); + else + bio_list_add(&lower, bio); + /* now assemble so we handle the lowest level first */ + bio_list_merge(&bio_list_on_stack, &lower); + bio_list_merge(&bio_list_on_stack, &same); + bio_list_merge(&bio_list_on_stack, &hold); + bio = bio_list_pop(current->bio_list); } else { struct bio *bio_next = bio_list_pop(current->bio_list); [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 832 bytes --] ^ permalink raw reply related [flat|nested] 31+ messages in thread
* Re: [BUG] MD/RAID1 hung forever on freeze_array 2016-12-15 3:20 ` NeilBrown @ 2016-12-15 9:24 ` Jinpu Wang [not found] ` <CAMGffEkufeaDytaHxtLR02iiQifZDhcwkLdzMj3X8_yaitSoFQ@mail.gmail.com> 1 sibling, 0 replies; 31+ messages in thread From: Jinpu Wang @ 2016-12-15 9:24 UTC (permalink / raw) To: NeilBrown; +Cc: linux-raid, Shaohua Li, Nate Dailey On Thu, Dec 15, 2016 at 4:20 AM, NeilBrown <neilb@suse.com> wrote: > >> Hi Neil, >> >> I found a old mail thread below >> http://www.spinics.net/lists/raid/msg52792.html >> >> Likely Alex is trying to fix same bug, right? >> in one reply you suggested to modify the call in make_request >> >> @@ -1207,7 +1207,8 @@ read_again: >> sectors_handled; >> goto read_again; >> } else >> - generic_make_request(read_bio); >> + reschedule_retry(r1_bio); >> return; >> } >> >> >> I append above change, it looks fix the bug, I've run same tests over >> one hour, no hung task anymore. >> >> Do you think this is right fix? Do we still need the change you >> suggested with punt_bios_to_rescuer? > > I don't really like that fix. I suspect it would probably hurt > performance. > > I'd prefer to fix generic_make_request() to process queued requests in a > more sensible order. > Can you please try the following (with all other patches removed)? > Thanks, > NeilBrown Thanks Neil, I will try when back to office, I'm sick at home. > > diff --git a/block/blk-core.c b/block/blk-core.c > index 14d7c0740dc0..3436b6fc3ef8 100644 > --- a/block/blk-core.c > +++ b/block/blk-core.c > @@ -2036,10 +2036,31 @@ blk_qc_t generic_make_request(struct bio *bio) > struct request_queue *q = bdev_get_queue(bio->bi_bdev); > > if (likely(blk_queue_enter(q, false) == 0)) { > + struct bio_list hold; > + struct bio_list lower, same; > + > + /* Create a fresh bio_list for all subordinate requests */ > + bio_list_merge(&hold, &bio_list_on_stack); > + bio_list_init(&bio_list_on_stack); > ret = q->make_request_fn(q, bio); > > blk_queue_exit(q); > > + /* sort new bios into those for a lower level > + * and those for the same level > + */ > + bio_list_init(&lower); > + bio_list_init(&same); > + while ((bio = bio_list_pop(&bio_list_on_stack)) != NULL) > + if (q == bdev_get_queue(bio->bi_bdev)) > + bio_list_add(&same, bio); > + else > + bio_list_add(&lower, bio); > + /* now assemble so we handle the lowest level first */ > + bio_list_merge(&bio_list_on_stack, &lower); > + bio_list_merge(&bio_list_on_stack, &same); > + bio_list_merge(&bio_list_on_stack, &hold); > + > bio = bio_list_pop(current->bio_list); > } else { > struct bio *bio_next = bio_list_pop(current->bio_list); -- Jinpu Wang Linux Kernel Developer ProfitBricks GmbH Greifswalder Str. 207 D - 10405 Berlin Tel: +49 30 577 008 042 Fax: +49 30 577 008 299 Email: jinpu.wang@profitbricks.com URL: https://www.profitbricks.de Sitz der Gesellschaft: Berlin Registergericht: Amtsgericht Charlottenburg, HRB 125506 B Geschäftsführer: Achim Weiss ^ permalink raw reply [flat|nested] 31+ messages in thread
[parent not found: <CAMGffEkufeaDytaHxtLR02iiQifZDhcwkLdzMj3X8_yaitSoFQ@mail.gmail.com>]
* Re: [BUG] MD/RAID1 hung forever on freeze_array [not found] ` <CAMGffEkufeaDytaHxtLR02iiQifZDhcwkLdzMj3X8_yaitSoFQ@mail.gmail.com> @ 2016-12-19 14:56 ` Jinpu Wang 2016-12-19 22:45 ` NeilBrown 1 sibling, 0 replies; 31+ messages in thread From: Jinpu Wang @ 2016-12-19 14:56 UTC (permalink / raw) To: NeilBrown; +Cc: linux-raid, Shaohua Li, Nate Dailey On Mon, Dec 19, 2016 at 12:16 PM, Jinpu Wang <jinpu.wang@profitbricks.com> wrote: > Hi Neil, > > After apply the patch below, it paniced during boot in > generic_make_request-> bio_list_pop. > Looks related to you do bio_list_init(&bio_list_on_stack); again. >> diff --git a/block/blk-core.c b/block/blk-core.c >> index 14d7c0740dc0..3436b6fc3ef8 100644 >> --- a/block/blk-core.c >> +++ b/block/blk-core.c >> @@ -2036,10 +2036,31 @@ blk_qc_t generic_make_request(struct bio *bio) >> struct request_queue *q = bdev_get_queue(bio->bi_bdev); >> >> if (likely(blk_queue_enter(q, false) == 0)) { >> + struct bio_list hold; >> + struct bio_list lower, same; >> + >> + /* Create a fresh bio_list for all subordinate requests */ >> + bio_list_merge(&hold, &bio_list_on_stack); >> + bio_list_init(&bio_list_on_stack); ??? maybe init hold, and then merge bio_list_on_stack? >> ret = q->make_request_fn(q, bio); >> >> blk_queue_exit(q); >> >> + /* sort new bios into those for a lower level >> + * and those for the same level >> + */ >> + bio_list_init(&lower); >> + bio_list_init(&same); >> + while ((bio = bio_list_pop(&bio_list_on_stack)) != NULL) >> + if (q == bdev_get_queue(bio->bi_bdev)) >> + bio_list_add(&same, bio); >> + else >> + bio_list_add(&lower, bio); >> + /* now assemble so we handle the lowest level first */ >> + bio_list_merge(&bio_list_on_stack, &lower); >> + bio_list_merge(&bio_list_on_stack, &same); >> + bio_list_merge(&bio_list_on_stack, &hold); >> + >> bio = bio_list_pop(current->bio_list); >> } else { >> struct bio *bio_next = bio_list_pop(current->bio_list); > > Seems we don't need bio_list hold at all, lower and seem merge back to bio_list_on_stack? I'm currently testing patch below, so far so good. diff --git a/block/blk-core.c b/block/blk-core.c index e8d15d8..9c56cf7 100644 --- a/block/blk-core.c +++ b/block/blk-core.c @@ -2064,10 +2064,24 @@ blk_qc_t generic_make_request(struct bio *bio) struct request_queue *q = bdev_get_queue(bio->bi_bdev); if (likely(blk_queue_enter(q, __GFP_DIRECT_RECLAIM) == 0)) { + struct bio_list lower, same; ret = q->make_request_fn(q, bio); blk_queue_exit(q); + /* sort new bios into those for a lower level + * and those for the same level + */ + bio_list_init(&lower); + bio_list_init(&same); + while ((bio = bio_list_pop(&bio_list_on_stack)) != NULL) + if (q == bdev_get_queue(bio->bi_bdev)) + bio_list_add(&same, bio); + else + bio_list_add(&lower, bio); + /* now assemble so we handle the lowest level first */ + bio_list_merge(&bio_list_on_stack, &lower); + bio_list_merge(&bio_list_on_stack, &same); bio = bio_list_pop(current->bio_list); } else { -- Jinpu Wang Linux Kernel Developer ProfitBricks GmbH Greifswalder Str. 207 D - 10405 Berlin Tel: +49 30 577 008 042 Fax: +49 30 577 008 299 Email: jinpu.wang@profitbricks.com URL: https://www.profitbricks.de Sitz der Gesellschaft: Berlin Registergericht: Amtsgericht Charlottenburg, HRB 125506 B Geschäftsführer: Achim Weiss ^ permalink raw reply related [flat|nested] 31+ messages in thread
* Re: [BUG] MD/RAID1 hung forever on freeze_array [not found] ` <CAMGffEkufeaDytaHxtLR02iiQifZDhcwkLdzMj3X8_yaitSoFQ@mail.gmail.com> 2016-12-19 14:56 ` Jinpu Wang @ 2016-12-19 22:45 ` NeilBrown 2016-12-20 10:34 ` Jinpu Wang 1 sibling, 1 reply; 31+ messages in thread From: NeilBrown @ 2016-12-19 22:45 UTC (permalink / raw) To: Jinpu Wang; +Cc: linux-raid, Shaohua Li, Nate Dailey [-- Attachment #1: Type: text/plain, Size: 1418 bytes --] On Mon, Dec 19 2016, Jinpu Wang wrote: > Hi Neil, > > After apply the patch below, it paniced during boot in > generic_make_request-> bio_list_pop. > Looks related to you do bio_list_init(&bio_list_on_stack); again. >> diff --git a/block/blk-core.c b/block/blk-core.c >> index 14d7c0740dc0..3436b6fc3ef8 100644 >> --- a/block/blk-core.c >> +++ b/block/blk-core.c >> @@ -2036,10 +2036,31 @@ blk_qc_t generic_make_request(struct bio *bio) >> struct request_queue *q = bdev_get_queue(bio->bi_bdev); >> >> if (likely(blk_queue_enter(q, false) == 0)) { >> + struct bio_list hold; >> + struct bio_list lower, same; >> + >> + /* Create a fresh bio_list for all subordinate requests */ >> + bio_list_merge(&hold, &bio_list_on_stack); This is the problem. 'hold' hasn't been initialised. We could either do: bio_list_init(&hold); bio_list_merge(&hold, &bio_list_on_stack); or just hold = bio_list_on_stack; You didn't find 'hold' to be necessary in your testing, but I think that is more complex arrangements it could make an important difference. Thanks, NeilBrown >> + bio_list_init(&bio_list_on_stack); ??? maybe init hold, and then merge bio_list_on_stack? >> ret = q->make_request_fn(q, bio); >> >> blk_queue_exit(q); [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 832 bytes --] ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [BUG] MD/RAID1 hung forever on freeze_array 2016-12-19 22:45 ` NeilBrown @ 2016-12-20 10:34 ` Jinpu Wang 2016-12-20 21:23 ` NeilBrown 0 siblings, 1 reply; 31+ messages in thread From: Jinpu Wang @ 2016-12-20 10:34 UTC (permalink / raw) To: NeilBrown; +Cc: linux-raid, Shaohua Li, Nate Dailey Hi Neil, > This is the problem. 'hold' hasn't been initialised. > We could either do: > bio_list_init(&hold); > bio_list_merge(&hold, &bio_list_on_stack); I tried above variant first and it lead to panic in endio path: PID: 4004 TASK: ffff8802337f3400 CPU: 1 COMMAND: "fio" #0 [ffff88023ec838d0] machine_kexec at ffffffff8104075a #1 [ffff88023ec83918] crash_kexec at ffffffff810d54c3 #2 [ffff88023ec839e0] oops_end at ffffffff81008784 #3 [ffff88023ec83a08] no_context at ffffffff8104a8f6 #4 [ffff88023ec83a60] __bad_area_nosemaphore at ffffffff8104abcf #5 [ffff88023ec83aa8] bad_area_nosemaphore at ffffffff8104ad3e #6 [ffff88023ec83ab8] __do_page_fault at ffffffff8104afd7 #7 [ffff88023ec83b10] do_page_fault at ffffffff8104b33c #8 [ffff88023ec83b20] page_fault at ffffffff818173a2 [exception RIP: bio_check_pages_dirty+65] RIP: ffffffff813f6221 RSP: ffff88023ec83bd8 RFLAGS: 00010212 RAX: 0000000000000020 RBX: ffff880232d75010 RCX: 0000000000000001 RDX: ffff880232d74000 RSI: 0000000000000000 RDI: 0000000000000000 RBP: ffff88023ec83bf8 R8: 0000000000000001 R9: 0000000000000000 R10: ffffffff81f25ac0 R11: ffff8802348acef0 R12: 0000000000000001 R13: 0000000000000000 R14: ffff8800b53b7d00 R15: ffff88009704d180 ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 #9 [ffff88023ec83c00] dio_bio_complete at ffffffff811d010e #10 [ffff88023ec83c38] dio_bio_end_aio at ffffffff811d0367 #11 [ffff88023ec83c68] bio_endio at ffffffff813f637a #12 [ffff88023ec83c80] call_bio_endio at ffffffffa0868220 [raid1] #13 [ffff88023ec83cc8] raid_end_bio_io at ffffffffa086885b [raid1] #14 [ffff88023ec83cf8] raid1_end_read_request at ffffffffa086a184 [raid1] #15 [ffff88023ec83d50] bio_endio at ffffffff813f637a #16 [ffff88023ec83d68] blk_update_request at ffffffff813fdab6 #17 [ffff88023ec83da8] blk_mq_end_request at ffffffff81406dfe > or just > hold = bio_list_on_stack; > > > You didn't find 'hold' to be necessary in your testing, but I think that > is more complex arrangements it could make an important difference. Could you elaborate a bit more, from my understanding, in later code, we pop all bio from bio_list_on_stack, add it to either "lower" or "same" bio_list, so merge both will have the whole list again, right? > > Thanks, > NeilBrown > > >>> + bio_list_init(&bio_list_on_stack); ??? maybe init hold, and then merge bio_list_on_stack? >>> ret = q->make_request_fn(q, bio); >>> >>> blk_queue_exit(q); Thanks -- Jinpu Wang Linux Kernel Developer ProfitBricks GmbH Greifswalder Str. 207 D - 10405 Berlin Tel: +49 30 577 008 042 Fax: +49 30 577 008 299 Email: jinpu.wang@profitbricks.com URL: https://www.profitbricks.de Sitz der Gesellschaft: Berlin Registergericht: Amtsgericht Charlottenburg, HRB 125506 B Geschäftsführer: Achim Weiss ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [BUG] MD/RAID1 hung forever on freeze_array 2016-12-20 10:34 ` Jinpu Wang @ 2016-12-20 21:23 ` NeilBrown 2016-12-21 12:48 ` Jinpu Wang 0 siblings, 1 reply; 31+ messages in thread From: NeilBrown @ 2016-12-20 21:23 UTC (permalink / raw) To: Jinpu Wang; +Cc: linux-raid, Shaohua Li, Nate Dailey [-- Attachment #1: Type: text/plain, Size: 1783 bytes --] On Tue, Dec 20 2016, Jinpu Wang wrote: > Hi Neil, >> This is the problem. 'hold' hasn't been initialised. >> We could either do: >> bio_list_init(&hold); >> bio_list_merge(&hold, &bio_list_on_stack); > I tried above variant first and it lead to panic in endio path: > ... > > [exception RIP: bio_check_pages_dirty+65] I can't explain that at all. Maybe if I saw the complete patch I would be able to see what went wrong (or maybe there is a separate cause). >> >> You didn't find 'hold' to be necessary in your testing, but I think that >> is more complex arrangements it could make an important difference. > > Could you elaborate a bit more, from my understanding, in later code, > we pop all bio from bio_list_on_stack, > add it to either "lower" or "same" bio_list, so merge both will have > the whole list again, right? If there are several bios on bio_list_on_stack, and if processing the first one causes several calls to generic_make_request(), then we want all the bios passed to generic_make_request() to be handled *before* the remaining bios that were originally on bio_list_on_stack. Of these new bios, we want to handle those for a lower level device first, and those for the same device after that. Only when all of those have been completely submitted (together with any subordinate bios submitted to generic_make_request()) do we move on to the rest of bio_list_on_stack (which are at the same level as the first bio, or higher in the device stack). The usage of 'hold' would become important when you have multiple levels. e.g. dm on md on scsi. If the first bio send to dm is split, then the first half needs to be processed all the way down to all the scsi devices, before the second half of the split is handled. Hope that helps, NeilBrown [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 832 bytes --] ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [BUG] MD/RAID1 hung forever on freeze_array 2016-12-20 21:23 ` NeilBrown @ 2016-12-21 12:48 ` Jinpu Wang 2016-12-21 23:51 ` NeilBrown 0 siblings, 1 reply; 31+ messages in thread From: Jinpu Wang @ 2016-12-21 12:48 UTC (permalink / raw) To: NeilBrown; +Cc: linux-raid, Shaohua Li, Nate Dailey [-- Attachment #1: Type: text/plain, Size: 2806 bytes --] Thanks Neil, On Tue, Dec 20, 2016 at 10:23 PM, NeilBrown <neilb@suse.com> wrote: > On Tue, Dec 20 2016, Jinpu Wang wrote: > >> Hi Neil, >>> This is the problem. 'hold' hasn't been initialised. >>> We could either do: >>> bio_list_init(&hold); >>> bio_list_merge(&hold, &bio_list_on_stack); >> I tried above variant first and it lead to panic in endio path: >> > ... >> >> [exception RIP: bio_check_pages_dirty+65] > > > I can't explain that at all. Maybe if I saw the complete patch I would > be able to see what went wrong (or maybe there is a separate cause). I think I know the cause now, I did +bio_list_init(&hold); bio_list_merge(&hold, &bio_list_on_stack); - bio_list_init(&bio_list_on_stack) I didn't read your comment clear enough, just thought we maybe don't need to reinit &bio_list_on_stack twice. But your idea is same the copy before we can q->make_request_fn. > >>> >>> You didn't find 'hold' to be necessary in your testing, but I think that >>> is more complex arrangements it could make an important difference. >> >> Could you elaborate a bit more, from my understanding, in later code, >> we pop all bio from bio_list_on_stack, >> add it to either "lower" or "same" bio_list, so merge both will have >> the whole list again, right? > > If there are several bios on bio_list_on_stack, and if processing the > first one causes several calls to generic_make_request(), then we want > all the bios passed to generic_make_request() to be handled *before* the > remaining bios that were originally on bio_list_on_stack. Of these new > bios, we want to handle those for a lower level device first, and those > for the same device after that. Only when all of those have been > completely submitted (together with any subordinate bios submitted to > generic_make_request()) do we move on to the rest of bio_list_on_stack > (which are at the same level as the first bio, or higher in the device > stack). > > The usage of 'hold' would become important when you have multiple > levels. e.g. dm on md on scsi. If the first bio send to dm is split, > then the first half needs to be processed all the way down to all the > scsi devices, before the second half of the split is handled. > > Hope that helps, > NeilBrown > Thanks, it does help a lot, I attached the patch I'm still testing, but so far so good. Could you check if I got it right? Cheers, -- Jinpu Wang Linux Kernel Developer ProfitBricks GmbH Greifswalder Str. 207 D - 10405 Berlin Tel: +49 30 577 008 042 Fax: +49 30 577 008 299 Email: jinpu.wang@profitbricks.com URL: https://www.profitbricks.de Sitz der Gesellschaft: Berlin Registergericht: Amtsgericht Charlottenburg, HRB 125506 B Geschäftsführer: Achim Weiss [-- Attachment #2: 0001-block-fix-deadlock-between-freeze_array-and-wait_bar.patch --] [-- Type: text/x-patch, Size: 2300 bytes --] From 9500d5e9a7ccd1f4397f927ffdc8d378955d477c Mon Sep 17 00:00:00 2001 From: Jack Wang <jinpu.wang@profitbricks.com> Date: Wed, 14 Dec 2016 16:55:52 +0100 Subject: [PATCH] block: fix deadlock between freeze_array() and wait_barrier() When we call wait_barrier, we might have some bios waiting in current->bio_list, which prevents the array_freeze call to complete. Those can only be internal READs, which have already passed the wait_barrier call (thus incrementing nr_pending), but still were not submitted to the lower level, due to generic_make_request logic to avoid recursive calls. In such case, we have a deadlock: - array_frozen is already set to 1, so wait_barrier unconditionally waits, so - internal READ bios will not be submitted, thus freeze_array will never completes. To fix this, modify generic_make_request to always sort bio_list_on_stack first with lowest level, then higher, until same level. Suggested-by: Neil Brown <neil.brown.suse.com> Signed-off-by: Jack Wang <jinpu.wang@profitbricks.com> --- block/blk-core.c | 20 ++++++++++++++++++++ 1 file changed, 20 insertions(+) diff --git a/block/blk-core.c b/block/blk-core.c index e8d15d8..5a6390a 100644 --- a/block/blk-core.c +++ b/block/blk-core.c @@ -2064,10 +2064,30 @@ blk_qc_t generic_make_request(struct bio *bio) struct request_queue *q = bdev_get_queue(bio->bi_bdev); if (likely(blk_queue_enter(q, __GFP_DIRECT_RECLAIM) == 0)) { + struct bio_list lower, same, hold; + + /* Create a fresh bio_list for all subordinate requests */ + bio_list_init(&hold); + bio_list_merge(&hold, &bio_list_on_stack); + bio_list_init(&bio_list_on_stack); ret = q->make_request_fn(q, bio); blk_queue_exit(q); + /* sort new bios into those for a lower level + * and those for the same level + */ + bio_list_init(&lower); + bio_list_init(&same); + while ((bio = bio_list_pop(&bio_list_on_stack)) != NULL) + if (q == bdev_get_queue(bio->bi_bdev)) + bio_list_add(&same, bio); + else + bio_list_add(&lower, bio); + /* now assemble so we handle the lowest level first */ + bio_list_merge(&bio_list_on_stack, &lower); + bio_list_merge(&bio_list_on_stack, &same); + bio_list_merge(&bio_list_on_stack, &hold); bio = bio_list_pop(current->bio_list); } else { -- 2.7.4 ^ permalink raw reply related [flat|nested] 31+ messages in thread
* Re: [BUG] MD/RAID1 hung forever on freeze_array 2016-12-21 12:48 ` Jinpu Wang @ 2016-12-21 23:51 ` NeilBrown 2016-12-22 8:35 ` Jinpu Wang 0 siblings, 1 reply; 31+ messages in thread From: NeilBrown @ 2016-12-21 23:51 UTC (permalink / raw) To: Jinpu Wang; +Cc: linux-raid, Shaohua Li, Nate Dailey [-- Attachment #1: Type: text/plain, Size: 295 bytes --] On Wed, Dec 21 2016, Jinpu Wang wrote: >> > Thanks, it does help a lot, I attached the patch I'm still testing, > but so far so good. > Could you check if I got it right? Yes, that looks exactly right. I guess one of us should try to push it upstream... maybe next year :-) Thanks, NeilBrown [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 832 bytes --] ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [BUG] MD/RAID1 hung forever on freeze_array 2016-12-21 23:51 ` NeilBrown @ 2016-12-22 8:35 ` Jinpu Wang 0 siblings, 0 replies; 31+ messages in thread From: Jinpu Wang @ 2016-12-22 8:35 UTC (permalink / raw) To: NeilBrown; +Cc: linux-raid, Shaohua Li, Nate Dailey On Thu, Dec 22, 2016 at 12:51 AM, NeilBrown <neilb@suse.com> wrote: > On Wed, Dec 21 2016, Jinpu Wang wrote: > >>> >> Thanks, it does help a lot, I attached the patch I'm still testing, >> but so far so good. >> Could you check if I got it right? > > Yes, that looks exactly right. > I guess one of us should try to push it upstream... maybe next year :-) > > Thanks, > NeilBrown Thanks, Neil. I will try to push to upstream next year! Happy holidays! Cheers! -- Jinpu Wang Linux Kernel Developer ProfitBricks GmbH Greifswalder Str. 207 D - 10405 Berlin Tel: +49 30 577 008 042 Fax: +49 30 577 008 299 Email: jinpu.wang@profitbricks.com URL: https://www.profitbricks.de Sitz der Gesellschaft: Berlin Registergericht: Amtsgericht Charlottenburg, HRB 125506 B Geschäftsführer: Achim Weiss ^ permalink raw reply [flat|nested] 31+ messages in thread
end of thread, other threads:[~2016-12-22 8:35 UTC | newest] Thread overview: 31+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2016-11-25 13:30 [BUG] MD/RAID1 hung forever on freeze_array Jinpu Wang 2016-11-25 13:59 ` Jinpu Wang 2016-11-28 4:47 ` Coly Li 2016-11-28 8:24 ` Jinpu Wang 2016-11-28 8:54 ` Coly Li 2016-11-28 9:02 ` Jinpu Wang 2016-11-28 9:10 ` Coly Li 2016-11-29 11:15 ` Jinpu Wang 2016-12-07 14:17 ` Jinpu Wang 2016-12-08 3:17 ` NeilBrown 2016-12-08 9:50 ` Jinpu Wang 2016-12-09 6:01 ` NeilBrown 2016-12-09 15:28 ` Jinpu Wang 2016-12-09 15:36 ` Jinpu Wang 2016-12-12 0:59 ` NeilBrown 2016-12-12 13:10 ` Jinpu Wang 2016-12-12 21:53 ` NeilBrown 2016-12-13 15:08 ` Jinpu Wang 2016-12-13 22:18 ` NeilBrown 2016-12-14 10:22 ` Jinpu Wang 2016-12-14 12:13 ` Jinpu Wang 2016-12-14 14:49 ` Jinpu Wang 2016-12-15 3:20 ` NeilBrown 2016-12-15 9:24 ` Jinpu Wang [not found] ` <CAMGffEkufeaDytaHxtLR02iiQifZDhcwkLdzMj3X8_yaitSoFQ@mail.gmail.com> 2016-12-19 14:56 ` Jinpu Wang 2016-12-19 22:45 ` NeilBrown 2016-12-20 10:34 ` Jinpu Wang 2016-12-20 21:23 ` NeilBrown 2016-12-21 12:48 ` Jinpu Wang 2016-12-21 23:51 ` NeilBrown 2016-12-22 8:35 ` Jinpu Wang
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).