Re: [bug report] block/005 hangs with NVMe device and linux-block/for-next

public inbox for linux-nvme@lists.infradead.org
 help / color / mirror / Atom feed

* Re: [bug report] block/005 hangs with NVMe device and linux-block/for-next
       [not found]         ` <20211102081929.s6eyxawq32phvufr@shindev>
@ 2021-11-02  8:28           ` Damien Le Moal
  0 siblings, 0 replies; 5+ messages in thread
From: Damien Le Moal @ 2021-11-02  8:28 UTC (permalink / raw)
  To: Shinichiro Kawasaki, Chaitanya Kulkarni
  Cc: linux-block@vger.kernel.org, Jens Axboe, Damien Le Moal,
	linux-nvme@lists.infradead.org, Keith Busch, Christoph Hellwig


+linux-nvme (Keith and Christoph)


On 2021/11/02 17:19, Shinichiro Kawasaki wrote:
> On Nov 02, 2021 / 03:07, Chaitanya Kulkarni wrote:
>>
>>> The NVMe device I use is a U.2 NVMe ZNS SSD. It has a zoned name space and
>>> a regular name space, and the hang is observed with both name spaces. I have
>>> not yet tried other NVME devices, so I will try them.
> 
> I tried another NVMe device, WD Black SN750, but it did not recreate the hang.
> 
>>>
>>
>> See if you can produce this with QEMU NVMe emulation (ZNS and NON-ZNS
>> mode), if you can then it will be easier to reproduce for everyone.
> 
> Chaitanya, thank you for the advice. I have managed to reproduce the hang with
> QEMU NVMe emulation. Actually, ZNS mode is not required. I tried some device
> set up configuration with QEMU, and the hang was recreated when a single NVMe
> device has two namespaces. With single namespace in a single NVMe device, the
> hang is not observed.
> 
> So it looks like that the number of namespaces may be related to the cause. The
> WD Black SN750 without hang has single namespace. I reduced the number of
> namespaces of the U.2 NVMe ZNS SSD from 2 to 1, then the hang was not observed.
> 
> FYI, the QEMU command line options that I used was as follows. It prepares
> /dev/nvme0n1 and /dev/nvme0n2, and the block/005 run on /dev/nvme0n1 recreated
> the hang.
> 
> -device nvme,id=nvme0,serial=qemunvme,logical_block_size=4096,physical_block_size=4096 \
> -drive file=(path)/nvme0n1.img,id=nvme0n1,format=raw,if=none \
> -device nvme-ns,drive=nvme0n1,bus=nvme0,nsid=1 \
> -drive file=(path)/nvme0n2.img,id=nvme0n2,format=raw,if=none \
> -device nvme-ns,drive=nvme0n2,bus=nvme0,nsid=2
> 
> Regarding the two image files, I created them beforehand with the command below:
> 
> $ qemu-img create -f raw "${image_file_path}" 1024M
> 
> Hope this helps.
> 



-- 
Damien Le Moal
Western Digital Research


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [bug report] block/005 hangs with NVMe device and linux-block/for-next
       [not found]         ` <20211102082846.m632phnsaqnwtaec@shindev>
@ 2021-11-02  9:02           ` Shinichiro Kawasaki
  2021-11-02 10:48             ` Ming Lei
  0 siblings, 1 reply; 5+ messages in thread
From: Shinichiro Kawasaki @ 2021-11-02  9:02 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, linux-block@vger.kernel.org, Damien Le Moal,
	linux-nvme@lists.infradead.org, Keith Busch, Christoph Hellwig

Let me add linux-nvme, Keith and Christoph to the CC list.

-- 
Best Regards,
Shin'ichiro Kawasaki


On Nov 02, 2021 / 17:28, Shin'ichiro Kawasaki wrote:
> On Nov 02, 2021 / 11:44, Ming Lei wrote:
> > On Tue, Nov 02, 2021 at 02:22:15AM +0000, Shinichiro Kawasaki wrote:
> > > On Nov 01, 2021 / 17:01, Jens Axboe wrote:
> > > > On 11/1/21 6:41 AM, Jens Axboe wrote:
> > > > > On 11/1/21 2:34 AM, Shinichiro Kawasaki wrote:
> > > > >> I tried the latest linux-block/for-next branch tip (git hash b43fadb6631f and
> > > > >> observed a process hang during blktests block/005 run on a NVMe device.
> > > > >> Kernel message reported "INFO: task check:1224 blocked for more than 122
> > > > >> seconds." with call trace [1]. So far, the hang is 100% reproducible with my
> > > > >> system. This hang is not observed with HDDs or null_blk devices.
> > > > >>
> > > > >> I bisected and found the commit 4f5022453acd ("nvme: wire up completion batching
> > > > >> for the IRQ path") triggers the hang. When I revert this commit from the
> > > > >> for-next branch tip, the hang disappears. The block/005 test case does IO
> > > > >> scheduler switch during IO, and the completion path change by the commit looks
> > > > >> affecting the scheduler switch. Comments for solution will be appreciated.
> > > > > 
> > > > > I'll take a look at this.
> > > > 
> > > > I've tried running various things most of the day, and I cannot
> > > > reproduce this issue nor do I see what it could be. Even if requests are
> > > > split between batched completion and one-by-one completion, it works
> > > > just fine for me. No special care needs to be taken for put_many() on
> > > > the queue reference, as the wake_up() happens for the ref going to zero.
> > > > 
> > > > Tell me more about your setup. What does the runtimes of the test look
> > > > like? Do you have all schedulers enabled? What kind of NVMe device is
> > > > this?
> > > 
> > > Thank you for spending your precious time. With the kernel without the hang,
> > > the test case completes around 20 seconds. When the hang happens, the check
> > > script process stops at blk_mq_freeze_queue_wait() at scheduler change, and fio
> > > workload processes stop at __blkdev_direct_IO_simple(). The test case does not
> > > end, so I need to reboot the system for the next trial. While waiting the test
> > > case completion, the kernel repeats the same INFO message every 2 minutes.
> > > 
> > > Regarding the scheduler, I compiled the kernel with mq-deadline and kyber.
> > > 
> > > The NVMe device I use is a U.2 NVMe ZNS SSD. It has a zoned name space and
> > > a regular name space, and the hang is observed with both name spaces. I have
> > > not yet tried other NVME devices, so I will try them.
> > > 
> > > > 
> > > > FWIW, this is upstream now, so testing with Linus -git would be
> > > > preferable.
> > > 
> > > I see. I have switched from linux-block for-next branch to the upstream branch
> > > of Linus. At git hash 879dbe9ffebc, and still the hang is observed.
> > 
> > Can you post the blk-mq debugfs log after the hang is triggered?
> > 
> > (cd /sys/kernel/debug/block/nvme0n1 && find . -type f -exec grep -aH . {} \;)
> 
> Thanks Ming. When I ran the command above, the grep command stopped when it
> opened tag related files in the debugfs tree. That grep command looked hanking
> also. So I used the find command below instead to exclude the tag related files.
> 
> # find . -type f -not -name *tag* -exec grep -aH . {} \;
> 
> Here I share the captured log.
> 
> ./sched/queued:0 8 0
> ./sched/owned_by_driver:0 8 0
> ./sched/async_depth:192
> ./sched/starved:0
> ./sched/batching:1
> ./rqos/wbt/wb_background:4
> ./rqos/wbt/wb_normal:8
> ./rqos/wbt/unknown_cnt:0
> ./rqos/wbt/min_lat_nsec:2000000
> ./rqos/wbt/inflight:0: inflight 0
> ./rqos/wbt/inflight:1: inflight 0
> ./rqos/wbt/inflight:2: inflight 0
> ./rqos/wbt/id:0
> ./rqos/wbt/enabled:1
> ./rqos/wbt/curr_win_nsec:0
> ./hctx7/type:default
> ./hctx7/dispatch_busy:7
> ./hctx7/active:1024
> ./hctx7/run:5
> ./hctx7/ctx_map:00000000: 00
> ./hctx7/dispatch:000000003dfed3fd {.op=READ, .cmd_flags=, .rq_flags=STARTED|ELVPRIV|IO_STAT|22, .state=idle, .tag=-1, .internal_tag=192}
> ./hctx7/dispatch:0000000077876d9e {.op=READ, .cmd_flags=, .rq_flags=STARTED|ELVPRIV|IO_STAT|22, .state=idle, .tag=-1, .internal_tag=193}
> ./hctx7/flags:alloc_policy=FIFO SHOULD_MERGE|TAG_QUEUE_SHARED
> ./hctx7/state:TAG_ACTIVE|SCHED_RESTART
> ./hctx6/type:default
> ./hctx6/dispatch_busy:5
> ./hctx6/active:1025
> ./hctx6/run:4
> ./hctx6/ctx_map:00000000: 00
> ./hctx6/dispatch:00000000c0b8e1c9 {.op=READ, .cmd_flags=, .rq_flags=STARTED|ELVPRIV|IO_STAT|22, .state=idle, .tag=-1, .internal_tag=192}
> ./hctx6/flags:alloc_policy=FIFO SHOULD_MERGE|TAG_QUEUE_SHARED
> ./hctx6/state:TAG_ACTIVE|SCHED_RESTART
> ./hctx5/type:default
> ./hctx5/dispatch_busy:5
> ./hctx5/active:1024
> ./hctx5/run:4
> ./hctx5/ctx_map:00000000: 00
> ./hctx5/dispatch:00000000aaf1e364 {.op=READ, .cmd_flags=, .rq_flags=STARTED|ELVPRIV|IO_STAT|22, .state=idle, .tag=-1, .internal_tag=128}
> ./hctx5/flags:alloc_policy=FIFO SHOULD_MERGE|TAG_QUEUE_SHARED
> ./hctx5/state:TAG_ACTIVE|SCHED_RESTART
> ./hctx4/type:default
> ./hctx4/dispatch_busy:0
> ./hctx4/active:1023
> ./hctx4/run:1
> ./hctx4/ctx_map:00000000: 00
> ./hctx4/flags:alloc_policy=FIFO SHOULD_MERGE|TAG_QUEUE_SHARED
> ./hctx3/type:default
> ./hctx3/dispatch_busy:5
> ./hctx3/active:1024
> ./hctx3/run:4
> ./hctx3/ctx_map:00000000: 00
> ./hctx3/dispatch:000000008b07d5e1 {.op=READ, .cmd_flags=, .rq_flags=STARTED|ELVPRIV|IO_STAT|22, .state=idle, .tag=-1, .internal_tag=128}
> ./hctx3/flags:alloc_policy=FIFO SHOULD_MERGE|TAG_QUEUE_SHARED
> ./hctx3/state:TAG_ACTIVE|SCHED_RESTART
> ./hctx2/type:default
> ./hctx2/dispatch_busy:5
> ./hctx2/active:1024
> ./hctx2/run:4
> ./hctx2/ctx_map:00000000: 00
> ./hctx2/dispatch:00000000c4887013 {.op=READ, .cmd_flags=, .rq_flags=STARTED|ELVPRIV|IO_STAT|22, .state=idle, .tag=-1, .internal_tag=128}
> ./hctx2/flags:alloc_policy=FIFO SHOULD_MERGE|TAG_QUEUE_SHARED
> ./hctx2/state:TAG_ACTIVE|SCHED_RESTART
> ./hctx1/type:default
> ./hctx1/dispatch_busy:6
> ./hctx1/active:1024
> ./hctx1/run:5
> ./hctx1/ctx_map:00000000: 00
> ./hctx1/dispatch:00000000efe38e4e {.op=READ, .cmd_flags=, .rq_flags=STARTED|ELVPRIV|IO_STAT|22, .state=idle, .tag=-1, .internal_tag=0}
> ./hctx1/flags:alloc_policy=FIFO SHOULD_MERGE|TAG_QUEUE_SHARED
> ./hctx1/state:TAG_ACTIVE|SCHED_RESTART
> ./hctx0/type:default
> ./hctx0/dispatch_busy:5
> ./hctx0/active:1024
> ./hctx0/run:4
> ./hctx0/ctx_map:00000000: 00
> ./hctx0/dispatch:0000000015147095 {.op=READ, .cmd_flags=, .rq_flags=STARTED|ELVPRIV|IO_STAT|22, .state=idle, .tag=-1, .internal_tag=192}
> ./hctx0/flags:alloc_policy=FIFO SHOULD_MERGE|TAG_QUEUE_SHARED
> ./hctx0/state:TAG_ACTIVE|SCHED_RESTART
> ./write_hints:hint0: 0
> ./write_hints:hint1: 0
> ./write_hints:hint2: 0
> ./write_hints:hint3: 0
> ./write_hints:hint4: 0
> ./state:SAME_COMP|NONROT|IO_STAT|DISCARD|INIT_DONE|STATS|REGISTERED|PCI_P2PDMA|NOWAIT
> ./pm_only:0
> ./poll_stat:read  (512 Bytes): samples=0
> ./poll_stat:write (512 Bytes): samples=0
> ./poll_stat:read  (1024 Bytes): samples=0
> ./poll_stat:write (1024 Bytes): samples=0
> ./poll_stat:read  (2048 Bytes): samples=0
> ./poll_stat:write (2048 Bytes): samples=0
> ./poll_stat:read  (4096 Bytes): samples=0
> ./poll_stat:write (4096 Bytes): samples=0
> ./poll_stat:read  (8192 Bytes): samples=0
> ./poll_stat:write (8192 Bytes): samples=0
> ./poll_stat:read  (16384 Bytes): samples=0
> ./poll_stat:write (16384 Bytes): samples=0
> ./poll_stat:read  (32768 Bytes): samples=0
> ./poll_stat:write (32768 Bytes): samples=0
> ./poll_stat:read  (65536 Bytes): samples=0
> ./poll_stat:write (65536 Bytes): samples=0
> 
> 
> -- 
> Best Regards,
> Shin'ichiro Kawasaki

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [bug report] block/005 hangs with NVMe device and linux-block/for-next
  2021-11-02  9:02           ` Shinichiro Kawasaki
@ 2021-11-02 10:48             ` Ming Lei
  2021-11-02 11:24               ` Shinichiro Kawasaki
  0 siblings, 1 reply; 5+ messages in thread
From: Ming Lei @ 2021-11-02 10:48 UTC (permalink / raw)
  To: Shinichiro Kawasaki
  Cc: Jens Axboe, linux-block@vger.kernel.org, Damien Le Moal,
	linux-nvme@lists.infradead.org, Keith Busch, Christoph Hellwig,
	ming.lei

On Tue, Nov 02, 2021 at 09:02:47AM +0000, Shinichiro Kawasaki wrote:
> Let me add linux-nvme, Keith and Christoph to the CC list.
> 
> -- 
> Best Regards,
> Shin'ichiro Kawasaki
> 
> 
> On Nov 02, 2021 / 17:28, Shin'ichiro Kawasaki wrote:
> > On Nov 02, 2021 / 11:44, Ming Lei wrote:
> > > On Tue, Nov 02, 2021 at 02:22:15AM +0000, Shinichiro Kawasaki wrote:
> > > > On Nov 01, 2021 / 17:01, Jens Axboe wrote:
> > > > > On 11/1/21 6:41 AM, Jens Axboe wrote:
> > > > > > On 11/1/21 2:34 AM, Shinichiro Kawasaki wrote:
> > > > > >> I tried the latest linux-block/for-next branch tip (git hash b43fadb6631f and
> > > > > >> observed a process hang during blktests block/005 run on a NVMe device.
> > > > > >> Kernel message reported "INFO: task check:1224 blocked for more than 122
> > > > > >> seconds." with call trace [1]. So far, the hang is 100% reproducible with my
> > > > > >> system. This hang is not observed with HDDs or null_blk devices.
> > > > > >>
> > > > > >> I bisected and found the commit 4f5022453acd ("nvme: wire up completion batching
> > > > > >> for the IRQ path") triggers the hang. When I revert this commit from the
> > > > > >> for-next branch tip, the hang disappears. The block/005 test case does IO
> > > > > >> scheduler switch during IO, and the completion path change by the commit looks
> > > > > >> affecting the scheduler switch. Comments for solution will be appreciated.
> > > > > > 
> > > > > > I'll take a look at this.
> > > > > 
> > > > > I've tried running various things most of the day, and I cannot
> > > > > reproduce this issue nor do I see what it could be. Even if requests are
> > > > > split between batched completion and one-by-one completion, it works
> > > > > just fine for me. No special care needs to be taken for put_many() on
> > > > > the queue reference, as the wake_up() happens for the ref going to zero.
> > > > > 
> > > > > Tell me more about your setup. What does the runtimes of the test look
> > > > > like? Do you have all schedulers enabled? What kind of NVMe device is
> > > > > this?
> > > > 
> > > > Thank you for spending your precious time. With the kernel without the hang,
> > > > the test case completes around 20 seconds. When the hang happens, the check
> > > > script process stops at blk_mq_freeze_queue_wait() at scheduler change, and fio
> > > > workload processes stop at __blkdev_direct_IO_simple(). The test case does not
> > > > end, so I need to reboot the system for the next trial. While waiting the test
> > > > case completion, the kernel repeats the same INFO message every 2 minutes.
> > > > 
> > > > Regarding the scheduler, I compiled the kernel with mq-deadline and kyber.
> > > > 
> > > > The NVMe device I use is a U.2 NVMe ZNS SSD. It has a zoned name space and
> > > > a regular name space, and the hang is observed with both name spaces. I have
> > > > not yet tried other NVME devices, so I will try them.
> > > > 
> > > > > 
> > > > > FWIW, this is upstream now, so testing with Linus -git would be
> > > > > preferable.
> > > > 
> > > > I see. I have switched from linux-block for-next branch to the upstream branch
> > > > of Linus. At git hash 879dbe9ffebc, and still the hang is observed.
> > > 
> > > Can you post the blk-mq debugfs log after the hang is triggered?
> > > 
> > > (cd /sys/kernel/debug/block/nvme0n1 && find . -type f -exec grep -aH . {} \;)
> > 
> > Thanks Ming. When I ran the command above, the grep command stopped when it
> > opened tag related files in the debugfs tree. That grep command looked hanking
> > also. So I used the find command below instead to exclude the tag related files.
> > 
> > # find . -type f -not -name *tag* -exec grep -aH . {} \;
> > 
> > Here I share the captured log.
> > 

It is a bit odd since batching completion shouldn't be triggered in case
of io sched, but blk_mq_end_request_batch() does not restart hctx, so
maybe you can try the following patch:


diff --git a/block/blk-mq.c b/block/blk-mq.c
index 07eb1412760b..4c0c9af9235e 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -846,16 +846,20 @@ void blk_mq_end_request_batch(struct io_comp_batch *iob)
 		rq_qos_done(rq->q, rq);
 
 		if (nr_tags == TAG_COMP_BATCH || cur_hctx != rq->mq_hctx) {
-			if (cur_hctx)
+			if (cur_hctx) {
 				blk_mq_flush_tag_batch(cur_hctx, tags, nr_tags);
+				blk_mq_sched_restart(cur_hctx);
+			}
 			nr_tags = 0;
 			cur_hctx = rq->mq_hctx;
 		}
 		tags[nr_tags++] = rq->tag;
 	}
 
-	if (nr_tags)
+	if (nr_tags) {
 		blk_mq_flush_tag_batch(cur_hctx, tags, nr_tags);
+		blk_mq_sched_restart(cur_hctx);
+	}
 }
 EXPORT_SYMBOL_GPL(blk_mq_end_request_batch);
 


-- 
Ming



^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [bug report] block/005 hangs with NVMe device and linux-block/for-next
  2021-11-02 10:48             ` Ming Lei
@ 2021-11-02 11:24               ` Shinichiro Kawasaki
  2021-11-02 12:26                 ` Ming Lei
  0 siblings, 1 reply; 5+ messages in thread
From: Shinichiro Kawasaki @ 2021-11-02 11:24 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, linux-block@vger.kernel.org, Damien Le Moal,
	linux-nvme@lists.infradead.org, Keith Busch, Christoph Hellwig

On Nov 02, 2021 / 18:48, Ming Lei wrote:
> On Tue, Nov 02, 2021 at 09:02:47AM +0000, Shinichiro Kawasaki wrote:
> > Let me add linux-nvme, Keith and Christoph to the CC list.
> > 
> > -- 
> > Best Regards,
> > Shin'ichiro Kawasaki
> > 
> > 
> > On Nov 02, 2021 / 17:28, Shin'ichiro Kawasaki wrote:
> > > On Nov 02, 2021 / 11:44, Ming Lei wrote:
> > > > On Tue, Nov 02, 2021 at 02:22:15AM +0000, Shinichiro Kawasaki wrote:
> > > > > On Nov 01, 2021 / 17:01, Jens Axboe wrote:
> > > > > > On 11/1/21 6:41 AM, Jens Axboe wrote:
> > > > > > > On 11/1/21 2:34 AM, Shinichiro Kawasaki wrote:
> > > > > > >> I tried the latest linux-block/for-next branch tip (git hash b43fadb6631f and
> > > > > > >> observed a process hang during blktests block/005 run on a NVMe device.
> > > > > > >> Kernel message reported "INFO: task check:1224 blocked for more than 122
> > > > > > >> seconds." with call trace [1]. So far, the hang is 100% reproducible with my
> > > > > > >> system. This hang is not observed with HDDs or null_blk devices.
> > > > > > >>
> > > > > > >> I bisected and found the commit 4f5022453acd ("nvme: wire up completion batching
> > > > > > >> for the IRQ path") triggers the hang. When I revert this commit from the
> > > > > > >> for-next branch tip, the hang disappears. The block/005 test case does IO
> > > > > > >> scheduler switch during IO, and the completion path change by the commit looks
> > > > > > >> affecting the scheduler switch. Comments for solution will be appreciated.
> > > > > > > 
> > > > > > > I'll take a look at this.
> > > > > > 
> > > > > > I've tried running various things most of the day, and I cannot
> > > > > > reproduce this issue nor do I see what it could be. Even if requests are
> > > > > > split between batched completion and one-by-one completion, it works
> > > > > > just fine for me. No special care needs to be taken for put_many() on
> > > > > > the queue reference, as the wake_up() happens for the ref going to zero.
> > > > > > 
> > > > > > Tell me more about your setup. What does the runtimes of the test look
> > > > > > like? Do you have all schedulers enabled? What kind of NVMe device is
> > > > > > this?
> > > > > 
> > > > > Thank you for spending your precious time. With the kernel without the hang,
> > > > > the test case completes around 20 seconds. When the hang happens, the check
> > > > > script process stops at blk_mq_freeze_queue_wait() at scheduler change, and fio
> > > > > workload processes stop at __blkdev_direct_IO_simple(). The test case does not
> > > > > end, so I need to reboot the system for the next trial. While waiting the test
> > > > > case completion, the kernel repeats the same INFO message every 2 minutes.
> > > > > 
> > > > > Regarding the scheduler, I compiled the kernel with mq-deadline and kyber.
> > > > > 
> > > > > The NVMe device I use is a U.2 NVMe ZNS SSD. It has a zoned name space and
> > > > > a regular name space, and the hang is observed with both name spaces. I have
> > > > > not yet tried other NVME devices, so I will try them.
> > > > > 
> > > > > > 
> > > > > > FWIW, this is upstream now, so testing with Linus -git would be
> > > > > > preferable.
> > > > > 
> > > > > I see. I have switched from linux-block for-next branch to the upstream branch
> > > > > of Linus. At git hash 879dbe9ffebc, and still the hang is observed.
> > > > 
> > > > Can you post the blk-mq debugfs log after the hang is triggered?
> > > > 
> > > > (cd /sys/kernel/debug/block/nvme0n1 && find . -type f -exec grep -aH . {} \;)
> > > 
> > > Thanks Ming. When I ran the command above, the grep command stopped when it
> > > opened tag related files in the debugfs tree. That grep command looked hanking
> > > also. So I used the find command below instead to exclude the tag related files.
> > > 
> > > # find . -type f -not -name *tag* -exec grep -aH . {} \;
> > > 
> > > Here I share the captured log.
> > > 
> 
> It is a bit odd since batching completion shouldn't be triggered in case
> of io sched, but blk_mq_end_request_batch() does not restart hctx, so
> maybe you can try the following patch:
> 
> 
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 07eb1412760b..4c0c9af9235e 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -846,16 +846,20 @@ void blk_mq_end_request_batch(struct io_comp_batch *iob)
>  		rq_qos_done(rq->q, rq);
>  
>  		if (nr_tags == TAG_COMP_BATCH || cur_hctx != rq->mq_hctx) {
> -			if (cur_hctx)
> +			if (cur_hctx) {
>  				blk_mq_flush_tag_batch(cur_hctx, tags, nr_tags);
> +				blk_mq_sched_restart(cur_hctx);
> +			}
>  			nr_tags = 0;
>  			cur_hctx = rq->mq_hctx;
>  		}
>  		tags[nr_tags++] = rq->tag;
>  	}
>  
> -	if (nr_tags)
> +	if (nr_tags) {
>  		blk_mq_flush_tag_batch(cur_hctx, tags, nr_tags);
> +		blk_mq_sched_restart(cur_hctx);
> +	}
>  }
>  EXPORT_SYMBOL_GPL(blk_mq_end_request_batch);

Ming, thank you, but unfortunately, still I observe the hang with the patch
above.

-- 
Best Regards,
Shin'ichiro Kawasaki

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [bug report] block/005 hangs with NVMe device and linux-block/for-next
  2021-11-02 11:24               ` Shinichiro Kawasaki
@ 2021-11-02 12:26                 ` Ming Lei
  0 siblings, 0 replies; 5+ messages in thread
From: Ming Lei @ 2021-11-02 12:26 UTC (permalink / raw)
  To: Shinichiro Kawasaki
  Cc: Jens Axboe, linux-block@vger.kernel.org, Damien Le Moal,
	linux-nvme@lists.infradead.org, Keith Busch, Christoph Hellwig,
	ming.lei

On Tue, Nov 02, 2021 at 11:24:06AM +0000, Shinichiro Kawasaki wrote:
> On Nov 02, 2021 / 18:48, Ming Lei wrote:
> > On Tue, Nov 02, 2021 at 09:02:47AM +0000, Shinichiro Kawasaki wrote:
> > > Let me add linux-nvme, Keith and Christoph to the CC list.
> > > 
> > > -- 
> > > Best Regards,
> > > Shin'ichiro Kawasaki
> > > 
> > > 
> > > On Nov 02, 2021 / 17:28, Shin'ichiro Kawasaki wrote:
> > > > On Nov 02, 2021 / 11:44, Ming Lei wrote:
> > > > > On Tue, Nov 02, 2021 at 02:22:15AM +0000, Shinichiro Kawasaki wrote:
> > > > > > On Nov 01, 2021 / 17:01, Jens Axboe wrote:
> > > > > > > On 11/1/21 6:41 AM, Jens Axboe wrote:
> > > > > > > > On 11/1/21 2:34 AM, Shinichiro Kawasaki wrote:
> > > > > > > >> I tried the latest linux-block/for-next branch tip (git hash b43fadb6631f and
> > > > > > > >> observed a process hang during blktests block/005 run on a NVMe device.
> > > > > > > >> Kernel message reported "INFO: task check:1224 blocked for more than 122
> > > > > > > >> seconds." with call trace [1]. So far, the hang is 100% reproducible with my
> > > > > > > >> system. This hang is not observed with HDDs or null_blk devices.
> > > > > > > >>
> > > > > > > >> I bisected and found the commit 4f5022453acd ("nvme: wire up completion batching
> > > > > > > >> for the IRQ path") triggers the hang. When I revert this commit from the
> > > > > > > >> for-next branch tip, the hang disappears. The block/005 test case does IO
> > > > > > > >> scheduler switch during IO, and the completion path change by the commit looks
> > > > > > > >> affecting the scheduler switch. Comments for solution will be appreciated.
> > > > > > > > 
> > > > > > > > I'll take a look at this.
> > > > > > > 
> > > > > > > I've tried running various things most of the day, and I cannot
> > > > > > > reproduce this issue nor do I see what it could be. Even if requests are
> > > > > > > split between batched completion and one-by-one completion, it works
> > > > > > > just fine for me. No special care needs to be taken for put_many() on
> > > > > > > the queue reference, as the wake_up() happens for the ref going to zero.
> > > > > > > 
> > > > > > > Tell me more about your setup. What does the runtimes of the test look
> > > > > > > like? Do you have all schedulers enabled? What kind of NVMe device is
> > > > > > > this?
> > > > > > 
> > > > > > Thank you for spending your precious time. With the kernel without the hang,
> > > > > > the test case completes around 20 seconds. When the hang happens, the check
> > > > > > script process stops at blk_mq_freeze_queue_wait() at scheduler change, and fio
> > > > > > workload processes stop at __blkdev_direct_IO_simple(). The test case does not
> > > > > > end, so I need to reboot the system for the next trial. While waiting the test
> > > > > > case completion, the kernel repeats the same INFO message every 2 minutes.
> > > > > > 
> > > > > > Regarding the scheduler, I compiled the kernel with mq-deadline and kyber.
> > > > > > 
> > > > > > The NVMe device I use is a U.2 NVMe ZNS SSD. It has a zoned name space and
> > > > > > a regular name space, and the hang is observed with both name spaces. I have
> > > > > > not yet tried other NVME devices, so I will try them.
> > > > > > 
> > > > > > > 
> > > > > > > FWIW, this is upstream now, so testing with Linus -git would be
> > > > > > > preferable.
> > > > > > 
> > > > > > I see. I have switched from linux-block for-next branch to the upstream branch
> > > > > > of Linus. At git hash 879dbe9ffebc, and still the hang is observed.
> > > > > 
> > > > > Can you post the blk-mq debugfs log after the hang is triggered?
> > > > > 
> > > > > (cd /sys/kernel/debug/block/nvme0n1 && find . -type f -exec grep -aH . {} \;)
> > > > 
> > > > Thanks Ming. When I ran the command above, the grep command stopped when it
> > > > opened tag related files in the debugfs tree. That grep command looked hanking
> > > > also. So I used the find command below instead to exclude the tag related files.
> > > > 
> > > > # find . -type f -not -name *tag* -exec grep -aH . {} \;
> > > > 
> > > > Here I share the captured log.
> > > > 
> > 
> > It is a bit odd since batching completion shouldn't be triggered in case
> > of io sched, but blk_mq_end_request_batch() does not restart hctx, so
> > maybe you can try the following patch:
> > 
> > 
> > diff --git a/block/blk-mq.c b/block/blk-mq.c
> > index 07eb1412760b..4c0c9af9235e 100644
> > --- a/block/blk-mq.c
> > +++ b/block/blk-mq.c
> > @@ -846,16 +846,20 @@ void blk_mq_end_request_batch(struct io_comp_batch *iob)
> >  		rq_qos_done(rq->q, rq);
> >  
> >  		if (nr_tags == TAG_COMP_BATCH || cur_hctx != rq->mq_hctx) {
> > -			if (cur_hctx)
> > +			if (cur_hctx) {
> >  				blk_mq_flush_tag_batch(cur_hctx, tags, nr_tags);
> > +				blk_mq_sched_restart(cur_hctx);
> > +			}
> >  			nr_tags = 0;
> >  			cur_hctx = rq->mq_hctx;
> >  		}
> >  		tags[nr_tags++] = rq->tag;
> >  	}
> >  
> > -	if (nr_tags)
> > +	if (nr_tags) {
> >  		blk_mq_flush_tag_batch(cur_hctx, tags, nr_tags);
> > +		blk_mq_sched_restart(cur_hctx);
> > +	}
> >  }
> >  EXPORT_SYMBOL_GPL(blk_mq_end_request_batch);
> 
> Ming, thank you, but unfortunately, still I observe the hang with the patch
> above.

OK, thanks for your test, care to test the following patch?


diff --git a/block/blk-mq.c b/block/blk-mq.c
index 888c43aff1df..ef767d4e65da 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -837,6 +837,9 @@ void blk_mq_end_request_batch(struct io_comp_batch *iob)
 		if (iob->need_ts)
 			__blk_mq_end_request_acct(rq, now);
 
+		if (rq->rq_flags & RQF_MQ_INFLIGHT)
+			__blk_mq_dec_active_requests(rq->mq_hctx);
+
 		WRITE_ONCE(rq->state, MQ_RQ_IDLE);
 		if (!refcount_dec_and_test(&rq->ref))
 			continue;

Thanks, 
Ming



^ permalink raw reply related	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2021-11-02 12:27 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <20211101083417.fcttizyxpahrcgov@shindev>
     [not found] ` <30d7ccec-c798-3936-67bd-e66ae59c318b@kernel.dk>
     [not found]   ` <f56c7b71-cef4-10be-7804-b171929cfb76@kernel.dk>
     [not found]     ` <20211102022214.7hetxsg4z2yqafyd@shindev>
     [not found]       ` <9e22ece3-d080-945d-5011-b0009b781798@nvidia.com>
     [not found]         ` <20211102081929.s6eyxawq32phvufr@shindev>
2021-11-02  8:28           ` [bug report] block/005 hangs with NVMe device and linux-block/for-next Damien Le Moal
     [not found]       ` <YYC0ESdW1+B/dDTs@T590>
     [not found]         ` <20211102082846.m632phnsaqnwtaec@shindev>
2021-11-02  9:02           ` Shinichiro Kawasaki
2021-11-02 10:48             ` Ming Lei
2021-11-02 11:24               ` Shinichiro Kawasaki
2021-11-02 12:26                 ` Ming Lei

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox