NFS workload leaves nfsd threads in D state

Linux NFS development
 help / color / mirror / Atom feed

* NFS workload leaves nfsd threads in D state
@ 2023-07-08 18:30 Chuck Lever III
  2023-07-09  6:58 ` Linux regression tracking (Thorsten Leemhuis)
  2023-07-10  7:56 ` Christoph Hellwig
  0 siblings, 2 replies; 14+ messages in thread
From: Chuck Lever III @ 2023-07-08 18:30 UTC (permalink / raw)
  To: Jens Axboe, Christoph Hellwig
  Cc: linux-block@vger.kernel.org, Linux NFS Mailing List, Chuck Lever

Hi -

I have a "standard" test of running the git regression suite with
many threads against an NFS mount. I found that with 6.5-rc, the
test stalled and several nfsd threads on the server were stuck
in D state.

I can reproduce this stall 100% with both an xfs and an ext4
export, so I bisected with both, and both bisects landed on the
same commit:

615939a2ae734e3e68c816d6749d1f5f79c62ab7 is the first bad commit
commit 615939a2ae734e3e68c816d6749d1f5f79c62ab7
Author: Christoph Hellwig <hch@lst.de>
Date:   Fri May 19 06:40:48 2023 +0200

    blk-mq: defer to the normal submission path for post-flush requests

    Requests with the FUA bit on hardware without FUA support need a post
    flush before returning to the caller, but they can still be sent using
    the normal I/O path after initializing the flush-related fields and
    end I/O handler.

    Signed-off-by: Christoph Hellwig <hch@lst.de>
    Reviewed-by: Bart Van Assche <bvanassche@acm.org>
    Link: https://lore.kernel.org/r/20230519044050.107790-6-hch@lst.de
    Signed-off-by: Jens Axboe <axboe@kernel.dk>

 block/blk-flush.c | 11 +++++++++++
 1 file changed, 11 insertions(+)

On system 1: the exports are on top of /dev/mapper and reside on
an "INTEL SSDSC2BA400G3" SATA device.

On system 2: the exports are on top of /dev/mapper and reside on
an "INTEL SSDSC2KB240G8" SATA device.

System 1 was where I discovered the stall. System 2 is where I ran
the bisects.

The call stacks vary a little. I've seen stalls in both the WRITE
and SETATTR paths. Here's a sample from system 1:

INFO: task nfsd:1237 blocked for more than 122 seconds.
      Tainted: G        W          6.4.0-08699-g9e268189cb14 #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:nfsd            state:D stack:0     pid:1237  ppid:2      flags:0x00004000
Call Trace:
 <TASK>
 __schedule+0x78f/0x7db
 schedule+0x93/0xc8
 jbd2_log_wait_commit+0xb4/0xf4
 ? __pfx_autoremove_wake_function+0x10/0x10
 jbd2_complete_transaction+0x85/0x97
 ext4_fc_commit+0x118/0x70a
 ? _raw_spin_unlock+0x18/0x2e
 ? __mark_inode_dirty+0x282/0x302
 ext4_write_inode+0x94/0x121
 ext4_nfs_commit_metadata+0x72/0x7d
 commit_inode_metadata+0x1f/0x31 [nfsd]
 commit_metadata+0x26/0x33 [nfsd]
 nfsd_setattr+0x2f2/0x30e [nfsd]
 nfsd_create_setattr+0x4e/0x87 [nfsd]
 nfsd4_open+0x604/0x8fa [nfsd]
 nfsd4_proc_compound+0x4a8/0x5e3 [nfsd]
 ? nfs4svc_decode_compoundargs+0x291/0x2de [nfsd]
 nfsd_dispatch+0xb3/0x164 [nfsd]
 svc_process_common+0x3c7/0x53a [sunrpc]
 ? __pfx_nfsd_dispatch+0x10/0x10 [nfsd]
 svc_process+0xc6/0xe3 [sunrpc]
 nfsd+0xf2/0x18c [nfsd]
 ? __pfx_nfsd+0x10/0x10 [nfsd]
 kthread+0x10d/0x115
 ? __pfx_kthread+0x10/0x10
 ret_from_fork+0x2c/0x50
 </TASK>

--
Chuck Lever



^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: NFS workload leaves nfsd threads in D state
  2023-07-08 18:30 NFS workload leaves nfsd threads in D state Chuck Lever III
@ 2023-07-09  6:58 ` Linux regression tracking (Thorsten Leemhuis)
  2023-07-10  7:56 ` Christoph Hellwig
  1 sibling, 0 replies; 14+ messages in thread
From: Linux regression tracking (Thorsten Leemhuis) @ 2023-07-09  6:58 UTC (permalink / raw)
  To: Chuck Lever III, Jens Axboe, Christoph Hellwig
  Cc: linux-block@vger.kernel.org, Linux NFS Mailing List, Chuck Lever,
	Linux kernel regressions list

[CCing the regression list, as it should be in the loop for regressions:
https://docs.kernel.org/admin-guide/reporting-regressions.html]

[TLDR: I'm adding this report to the list of tracked Linux kernel
regressions; the text you find below is based on a few templates
paragraphs you might have encountered already in similar form.
See link in footer if these mails annoy you.]

On 08.07.23 20:30, Chuck Lever III wrote:
> 
> I have a "standard" test of running the git regression suite with
> many threads against an NFS mount. I found that with 6.5-rc, the
> test stalled and several nfsd threads on the server were stuck
> in D state.
> 
> I can reproduce this stall 100% with both an xfs and an ext4
> export, so I bisected with both, and both bisects landed on the
> same commit:
> 
> 615939a2ae734e3e68c816d6749d1f5f79c62ab7 is the first bad commit
> commit 615939a2ae734e3e68c816d6749d1f5f79c62ab7
> Author: Christoph Hellwig <hch@lst.de>
> Date:   Fri May 19 06:40:48 2023 +0200
> 
>     blk-mq: defer to the normal submission path for post-flush requests
> 
>     Requests with the FUA bit on hardware without FUA support need a post
>     flush before returning to the caller, but they can still be sent using
>     the normal I/O path after initializing the flush-related fields and
>     end I/O handler.
> 
>     Signed-off-by: Christoph Hellwig <hch@lst.de>
>     Reviewed-by: Bart Van Assche <bvanassche@acm.org>
>     Link: https://lore.kernel.org/r/20230519044050.107790-6-hch@lst.de
>     Signed-off-by: Jens Axboe <axboe@kernel.dk>
> 
>  block/blk-flush.c | 11 +++++++++++
>  1 file changed, 11 insertions(+)
> 
> On system 1: the exports are on top of /dev/mapper and reside on
> an "INTEL SSDSC2BA400G3" SATA device.
> 
> On system 2: the exports are on top of /dev/mapper and reside on
> an "INTEL SSDSC2KB240G8" SATA device.
> 
> System 1 was where I discovered the stall. System 2 is where I ran
> the bisects.
> 
> The call stacks vary a little. I've seen stalls in both the WRITE
> and SETATTR paths. Here's a sample from system 1:
> 
> INFO: task nfsd:1237 blocked for more than 122 seconds.
>       Tainted: G        W          6.4.0-08699-g9e268189cb14 #1
> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> task:nfsd            state:D stack:0     pid:1237  ppid:2      flags:0x00004000
> Call Trace:
>  <TASK>
>  __schedule+0x78f/0x7db
>  schedule+0x93/0xc8
>  jbd2_log_wait_commit+0xb4/0xf4
>  ? __pfx_autoremove_wake_function+0x10/0x10
>  jbd2_complete_transaction+0x85/0x97
>  ext4_fc_commit+0x118/0x70a
>  ? _raw_spin_unlock+0x18/0x2e
>  ? __mark_inode_dirty+0x282/0x302
>  ext4_write_inode+0x94/0x121
>  ext4_nfs_commit_metadata+0x72/0x7d
>  commit_inode_metadata+0x1f/0x31 [nfsd]
>  commit_metadata+0x26/0x33 [nfsd]
>  nfsd_setattr+0x2f2/0x30e [nfsd]
>  nfsd_create_setattr+0x4e/0x87 [nfsd]
>  nfsd4_open+0x604/0x8fa [nfsd]
>  nfsd4_proc_compound+0x4a8/0x5e3 [nfsd]
>  ? nfs4svc_decode_compoundargs+0x291/0x2de [nfsd]
>  nfsd_dispatch+0xb3/0x164 [nfsd]
>  svc_process_common+0x3c7/0x53a [sunrpc]
>  ? __pfx_nfsd_dispatch+0x10/0x10 [nfsd]
>  svc_process+0xc6/0xe3 [sunrpc]
>  nfsd+0xf2/0x18c [nfsd]
>  ? __pfx_nfsd+0x10/0x10 [nfsd]
>  kthread+0x10d/0x115
>  ? __pfx_kthread+0x10/0x10
>  ret_from_fork+0x2c/0x50
>  </TASK>

Thanks for the report. To be sure the issue doesn't fall through the
cracks unnoticed, I'm adding it to regzbot, the Linux kernel regression
tracking bot:

#regzbot ^introduced 615939a2ae734e
#regzbot title blk-mq: NFS workload leaves nfsd threads in D state
#regzbot ignore-activity

This isn't a regression? This issue or a fix for it are already
discussed somewhere else? It was fixed already? You want to clarify when
the regression started to happen? Or point out I got the title or
something else totally wrong? Then just reply and tell me -- ideally
while also telling regzbot about it, as explained by the page listed in
the footer of this mail.

Developers: When fixing the issue, remember to add 'Link:' tags pointing
to the report (the parent of this mail). See page linked in footer for
details.

Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
--
Everything you wanna know about Linux kernel regression tracking:
https://linux-regtracking.leemhuis.info/about/#tldr
That page also explains what to do if mails like this annoy you.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: NFS workload leaves nfsd threads in D state
  2023-07-08 18:30 NFS workload leaves nfsd threads in D state Chuck Lever III
  2023-07-09  6:58 ` Linux regression tracking (Thorsten Leemhuis)
@ 2023-07-10  7:56 ` Christoph Hellwig
  2023-07-10 14:06   ` Chuck Lever III
  1 sibling, 1 reply; 14+ messages in thread
From: Christoph Hellwig @ 2023-07-10  7:56 UTC (permalink / raw)
  To: Chuck Lever III
  Cc: Jens Axboe, Christoph Hellwig, linux-block@vger.kernel.org,
	Linux NFS Mailing List, Chuck Lever

On Sat, Jul 08, 2023 at 06:30:26PM +0000, Chuck Lever III wrote:
> Hi -
> 
> I have a "standard" test of running the git regression suite with
> many threads against an NFS mount. I found that with 6.5-rc, the
> test stalled and several nfsd threads on the server were stuck
> in D state.

Can you paste the exact reproducer here?

> I can reproduce this stall 100% with both an xfs and an ext4
> export, so I bisected with both, and both bisects landed on the
> same commit:

> On system 1: the exports are on top of /dev/mapper and reside on
> an "INTEL SSDSC2BA400G3" SATA device.
> 
> On system 2: the exports are on top of /dev/mapper and reside on
> an "INTEL SSDSC2KB240G8" SATA device.
> 
> System 1 was where I discovered the stall. System 2 is where I ran
> the bisects.

Ok. I'd be curious if this reproducers without either device mapper
or on a non-SATA device.  If you have an easy way to run it in a VM
that'd be great.  Otherwise I'll try to recreate it in various
setups if you post the exact reproducer.


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: NFS workload leaves nfsd threads in D state
  2023-07-10  7:56 ` Christoph Hellwig
@ 2023-07-10 14:06   ` Chuck Lever III
  2023-07-10 15:10     ` Chuck Lever III
  0 siblings, 1 reply; 14+ messages in thread
From: Chuck Lever III @ 2023-07-10 14:06 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jens Axboe, linux-block@vger.kernel.org, Linux NFS Mailing List,
	Chuck Lever

> On Jul 10, 2023, at 3:56 AM, Christoph Hellwig <hch@lst.de> wrote:
> 
> On Sat, Jul 08, 2023 at 06:30:26PM +0000, Chuck Lever III wrote:
>> Hi -
>> 
>> I have a "standard" test of running the git regression suite with
>> many threads against an NFS mount. I found that with 6.5-rc, the
>> test stalled and several nfsd threads on the server were stuck
>> in D state.
> 
> Can you paste the exact reproducer here?

It's a twisty little maze of scripts, but it does essentially this:

1. export a test filesystem on system B

2. mount that export on system A via NFS (I think I used NFSv4.1)

3. download the latest git tarball on system A

4. unpack the tarball on the test NFS mount on system A. umount / mount

5. "make -jN all docs" on system A, where N is nprocs. umount / mount

6. "make -jN test" on system A, where N is as in step 5.

(For "make test" to work, the mounted on dir on system A has to be
exactly the same for all steps).

My system A has 12 cores, and B has 4, fwiw. The network fabric
is InfiniBand, but I suspect that won't make much difference.

During step 6, the tests will slow down and then stop cold.
After another two minutes, on system B you'll start to see the
INFO splats about hung processes.

As an interesting side note, I have a btrfs filesystem on that same
mapper group and physical device. I'm not able to reproduce the problem
on that filesystem.

>> I can reproduce this stall 100% with both an xfs and an ext4
>> export, so I bisected with both, and both bisects landed on the
>> same commit:
> 
>> On system 1: the exports are on top of /dev/mapper and reside on
>> an "INTEL SSDSC2BA400G3" SATA device.
>> 
>> On system 2: the exports are on top of /dev/mapper and reside on
>> an "INTEL SSDSC2KB240G8" SATA device.
>> 
>> System 1 was where I discovered the stall. System 2 is where I ran
>> the bisects.
> 
> Ok. I'd be curious if this reproducers without either device mapper
> or on a non-SATA device.  If you have an easy way to run it in a VM
> that'd be great.  Otherwise I'll try to recreate it in various
> setups if you post the exact reproducer.

I have a way to test it on an xfs export backed by a pair of AIC
NVMe devices. Stand by.

--
Chuck Lever

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: NFS workload leaves nfsd threads in D state
  2023-07-10 14:06   ` Chuck Lever III
@ 2023-07-10 15:10     ` Chuck Lever III
  2023-07-10 15:18       ` Christoph Hellwig
  2023-07-10 17:28       ` Christoph Hellwig
  0 siblings, 2 replies; 14+ messages in thread
From: Chuck Lever III @ 2023-07-10 15:10 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jens Axboe, linux-block@vger.kernel.org, Linux NFS Mailing List,
	Chuck Lever


> On Jul 10, 2023, at 10:06 AM, Chuck Lever III <chuck.lever@oracle.com> wrote:
> 
>> On Jul 10, 2023, at 3:56 AM, Christoph Hellwig <hch@lst.de> wrote:
>> 
>> Ok. I'd be curious if this reproducers without either device mapper
>> or on a non-SATA device.  If you have an easy way to run it in a VM
>> that'd be great.  Otherwise I'll try to recreate it in various
>> setups if you post the exact reproducer.
> 
> I have a way to test it on an xfs export backed by a pair of AIC
> NVMe devices. Stand by.

The issue is not reproducible with PCIe NVMe devices.


--
Chuck Lever



^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: NFS workload leaves nfsd threads in D state
  2023-07-10 15:10     ` Chuck Lever III
@ 2023-07-10 15:18       ` Christoph Hellwig
  2023-07-10 17:28       ` Christoph Hellwig
  1 sibling, 0 replies; 14+ messages in thread
From: Christoph Hellwig @ 2023-07-10 15:18 UTC (permalink / raw)
  To: Chuck Lever III
  Cc: Christoph Hellwig, Jens Axboe, linux-block@vger.kernel.org,
	Linux NFS Mailing List, Chuck Lever

On Mon, Jul 10, 2023 at 03:10:41PM +0000, Chuck Lever III wrote:
> > I have a way to test it on an xfs export backed by a pair of AIC
> > NVMe devices. Stand by.
> 
> The issue is not reproducible with PCIe NVMe devices.

Thanks.  ATA is special because it can't queued flush commands unlike
all other problems.  I'll see if I can get an ATA test setup and will
look into it ASAP.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: NFS workload leaves nfsd threads in D state
  2023-07-10 15:10     ` Chuck Lever III
  2023-07-10 15:18       ` Christoph Hellwig
@ 2023-07-10 17:28       ` Christoph Hellwig
  2023-07-10 17:40         ` Chuck Lever III
  1 sibling, 1 reply; 14+ messages in thread
From: Christoph Hellwig @ 2023-07-10 17:28 UTC (permalink / raw)
  To: Chuck Lever III
  Cc: Christoph Hellwig, Jens Axboe, linux-block@vger.kernel.org,
	Linux NFS Mailing List, Chuck Lever

Only found a virtualized AHCI setup so far without much success.  Can
you try this (from Chengming Zhou) in the meantime?

diff --git a/block/blk-flush.c b/block/blk-flush.c
index dba392cf22bec6..5c392a277b9eb2 100644
--- a/block/blk-flush.c
+++ b/block/blk-flush.c
@@ -443,7 +443,7 @@ bool blk_insert_flush(struct request *rq)
 		 * the post flush, and then just pass the command on.
 		 */
 		blk_rq_init_flush(rq);
-		rq->flush.seq |= REQ_FSEQ_POSTFLUSH;
+		rq->flush.seq |= REQ_FSEQ_PREFLUSH;
 		spin_lock_irq(&fq->mq_flush_lock);
 		list_move_tail(&rq->flush.list, &fq->flush_data_in_flight);
 		spin_unlock_irq(&fq->mq_flush_lock);

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: NFS workload leaves nfsd threads in D state
  2023-07-10 17:28       ` Christoph Hellwig
@ 2023-07-10 17:40         ` Chuck Lever III
  2023-07-11 12:01           ` Christoph Hellwig
  0 siblings, 1 reply; 14+ messages in thread
From: Chuck Lever III @ 2023-07-10 17:40 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jens Axboe, linux-block@vger.kernel.org, Linux NFS Mailing List,
	Chuck Lever


> On Jul 10, 2023, at 1:28 PM, Christoph Hellwig <hch@lst.de> wrote:
> 
> Only found a virtualized AHCI setup so far without much success.  Can
> you try this (from Chengming Zhou) in the meantime?
> 
> diff --git a/block/blk-flush.c b/block/blk-flush.c
> index dba392cf22bec6..5c392a277b9eb2 100644
> --- a/block/blk-flush.c
> +++ b/block/blk-flush.c
> @@ -443,7 +443,7 @@ bool blk_insert_flush(struct request *rq)
> * the post flush, and then just pass the command on.
> */
> blk_rq_init_flush(rq);
> - rq->flush.seq |= REQ_FSEQ_POSTFLUSH;
> + rq->flush.seq |= REQ_FSEQ_PREFLUSH;
> spin_lock_irq(&fq->mq_flush_lock);
> list_move_tail(&rq->flush.list, &fq->flush_data_in_flight);
> spin_unlock_irq(&fq->mq_flush_lock);

Thanks for the quick response. No change.

--
Chuck Lever



^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: NFS workload leaves nfsd threads in D state
  2023-07-10 17:40         ` Chuck Lever III
@ 2023-07-11 12:01           ` Christoph Hellwig
  2023-07-12 11:34             ` Chengming Zhou
  0 siblings, 1 reply; 14+ messages in thread
From: Christoph Hellwig @ 2023-07-11 12:01 UTC (permalink / raw)
  To: Chuck Lever III
  Cc: Christoph Hellwig, Jens Axboe, linux-block@vger.kernel.org,
	Linux NFS Mailing List, Chuck Lever

On Mon, Jul 10, 2023 at 05:40:42PM +0000, Chuck Lever III wrote:
> > blk_rq_init_flush(rq);
> > - rq->flush.seq |= REQ_FSEQ_POSTFLUSH;
> > + rq->flush.seq |= REQ_FSEQ_PREFLUSH;
> > spin_lock_irq(&fq->mq_flush_lock);
> > list_move_tail(&rq->flush.list, &fq->flush_data_in_flight);
> > spin_unlock_irq(&fq->mq_flush_lock);
> 
> Thanks for the quick response. No change.

I'm a bit lost and still can't reprodce.  Below is a patch with the
only behavior differences I can find.  It has two "#if 1" blocks,
which I'll need to bisect to to find out which made it work (if any,
but I hope so).

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 5504719b970d59..67364e607f2d1d 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2927,6 +2927,7 @@ void blk_mq_submit_bio(struct bio *bio)
 	struct request_queue *q = bdev_get_queue(bio->bi_bdev);
 	struct blk_plug *plug = blk_mq_plug(bio);
 	const int is_sync = op_is_sync(bio->bi_opf);
+	bool is_flush = op_is_flush(bio->bi_opf);
 	struct blk_mq_hw_ctx *hctx;
 	struct request *rq;
 	unsigned int nr_segs = 1;
@@ -2967,16 +2968,23 @@ void blk_mq_submit_bio(struct bio *bio)
 		return;
 	}
 
-	if (op_is_flush(bio->bi_opf) && blk_insert_flush(rq))
-		return;
-
-	if (plug) {
-		blk_add_rq_to_plug(plug, rq);
-		return;
+#if 1	/* Variant 1, the plug is holding us back */
+	if (op_is_flush(bio->bi_opf)) {
+		if (blk_insert_flush(rq))
+			return;
+	} else {
+		if (plug) {
+			blk_add_rq_to_plug(plug, rq);
+			return;
+		}
 	}
+#endif
 
 	hctx = rq->mq_hctx;
 	if ((rq->rq_flags & RQF_USE_SCHED) ||
+#if 1	/* Variant 2 (unlikely), blk_mq_try_issue_directly causes problems */
+	    is_flush || 
+#endif
 	    (hctx->dispatch_busy && (q->nr_hw_queues == 1 || !is_sync))) {
 		blk_mq_insert_request(rq, 0);
 		blk_mq_run_hw_queue(hctx, true);

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: NFS workload leaves nfsd threads in D state
  2023-07-11 12:01           ` Christoph Hellwig
@ 2023-07-12 11:34             ` Chengming Zhou
  2023-07-12 13:29               ` Chuck Lever III
  0 siblings, 1 reply; 14+ messages in thread
From: Chengming Zhou @ 2023-07-12 11:34 UTC (permalink / raw)
  To: Chuck Lever III, Christoph Hellwig, ross.lagerwall
  Cc: Jens Axboe, linux-block@vger.kernel.org, Linux NFS Mailing List,
	Chuck Lever

On 2023/7/11 20:01, Christoph Hellwig wrote:
> On Mon, Jul 10, 2023 at 05:40:42PM +0000, Chuck Lever III wrote:
>>> blk_rq_init_flush(rq);
>>> - rq->flush.seq |= REQ_FSEQ_POSTFLUSH;
>>> + rq->flush.seq |= REQ_FSEQ_PREFLUSH;
>>> spin_lock_irq(&fq->mq_flush_lock);
>>> list_move_tail(&rq->flush.list, &fq->flush_data_in_flight);
>>> spin_unlock_irq(&fq->mq_flush_lock);
>>
>> Thanks for the quick response. No change.
> 
> I'm a bit lost and still can't reprodce.  Below is a patch with the
> only behavior differences I can find.  It has two "#if 1" blocks,
> which I'll need to bisect to to find out which made it work (if any,
> but I hope so).

Hello,

I tried today to reproduce, but can't unfortunately.

Could you please also try the fix patch [1] from Ross Lagerwall that fixes
IO hung problem of plug recursive flush?

(Since the main difference is that post-flush requests now can go into plug.)

[1] https://lore.kernel.org/all/20230711160434.248868-1-ross.lagerwall@citrix.com/

Thanks!

> 
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 5504719b970d59..67364e607f2d1d 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -2927,6 +2927,7 @@ void blk_mq_submit_bio(struct bio *bio)
>  	struct request_queue *q = bdev_get_queue(bio->bi_bdev);
>  	struct blk_plug *plug = blk_mq_plug(bio);
>  	const int is_sync = op_is_sync(bio->bi_opf);
> +	bool is_flush = op_is_flush(bio->bi_opf);
>  	struct blk_mq_hw_ctx *hctx;
>  	struct request *rq;
>  	unsigned int nr_segs = 1;
> @@ -2967,16 +2968,23 @@ void blk_mq_submit_bio(struct bio *bio)
>  		return;
>  	}
>  
> -	if (op_is_flush(bio->bi_opf) && blk_insert_flush(rq))
> -		return;
> -
> -	if (plug) {
> -		blk_add_rq_to_plug(plug, rq);
> -		return;
> +#if 1	/* Variant 1, the plug is holding us back */
> +	if (op_is_flush(bio->bi_opf)) {
> +		if (blk_insert_flush(rq))
> +			return;
> +	} else {
> +		if (plug) {
> +			blk_add_rq_to_plug(plug, rq);
> +			return;
> +		}
>  	}
> +#endif
>  
>  	hctx = rq->mq_hctx;
>  	if ((rq->rq_flags & RQF_USE_SCHED) ||
> +#if 1	/* Variant 2 (unlikely), blk_mq_try_issue_directly causes problems */
> +	    is_flush || 
> +#endif
>  	    (hctx->dispatch_busy && (q->nr_hw_queues == 1 || !is_sync))) {
>  		blk_mq_insert_request(rq, 0);
>  		blk_mq_run_hw_queue(hctx, true);

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: NFS workload leaves nfsd threads in D state
  2023-07-12 11:34             ` Chengming Zhou
@ 2023-07-12 13:29               ` Chuck Lever III
  2023-07-25  9:57                 ` Linux regression tracking (Thorsten Leemhuis)
  0 siblings, 1 reply; 14+ messages in thread
From: Chuck Lever III @ 2023-07-12 13:29 UTC (permalink / raw)
  To: Chengming Zhou
  Cc: Christoph Hellwig, ross.lagerwall@citrix.com, Jens Axboe,
	linux-block@vger.kernel.org, Linux NFS Mailing List, Chuck Lever



> On Jul 12, 2023, at 7:34 AM, Chengming Zhou <chengming.zhou@linux.dev> wrote:
> 
> On 2023/7/11 20:01, Christoph Hellwig wrote:
>> On Mon, Jul 10, 2023 at 05:40:42PM +0000, Chuck Lever III wrote:
>>>> blk_rq_init_flush(rq);
>>>> - rq->flush.seq |= REQ_FSEQ_POSTFLUSH;
>>>> + rq->flush.seq |= REQ_FSEQ_PREFLUSH;
>>>> spin_lock_irq(&fq->mq_flush_lock);
>>>> list_move_tail(&rq->flush.list, &fq->flush_data_in_flight);
>>>> spin_unlock_irq(&fq->mq_flush_lock);
>>> 
>>> Thanks for the quick response. No change.
>> 
>> I'm a bit lost and still can't reprodce.  Below is a patch with the
>> only behavior differences I can find.  It has two "#if 1" blocks,
>> which I'll need to bisect to to find out which made it work (if any,
>> but I hope so).
> 
> Hello,
> 
> I tried today to reproduce, but can't unfortunately.
> 
> Could you please also try the fix patch [1] from Ross Lagerwall that fixes
> IO hung problem of plug recursive flush?
> 
> (Since the main difference is that post-flush requests now can go into plug.)
> 
> [1] https://lore.kernel.org/all/20230711160434.248868-1-ross.lagerwall@citrix.com/

Thanks for the suggestion. No change, unfortunately.


> Thanks!
> 
>> 
>> diff --git a/block/blk-mq.c b/block/blk-mq.c
>> index 5504719b970d59..67364e607f2d1d 100644
>> --- a/block/blk-mq.c
>> +++ b/block/blk-mq.c
>> @@ -2927,6 +2927,7 @@ void blk_mq_submit_bio(struct bio *bio)
>> struct request_queue *q = bdev_get_queue(bio->bi_bdev);
>> struct blk_plug *plug = blk_mq_plug(bio);
>> const int is_sync = op_is_sync(bio->bi_opf);
>> + bool is_flush = op_is_flush(bio->bi_opf);
>> struct blk_mq_hw_ctx *hctx;
>> struct request *rq;
>> unsigned int nr_segs = 1;
>> @@ -2967,16 +2968,23 @@ void blk_mq_submit_bio(struct bio *bio)
>> return;
>> }
>> 
>> - if (op_is_flush(bio->bi_opf) && blk_insert_flush(rq))
>> - return;
>> -
>> - if (plug) {
>> - blk_add_rq_to_plug(plug, rq);
>> - return;
>> +#if 1 /* Variant 1, the plug is holding us back */
>> + if (op_is_flush(bio->bi_opf)) {
>> + if (blk_insert_flush(rq))
>> + return;
>> + } else {
>> + if (plug) {
>> + blk_add_rq_to_plug(plug, rq);
>> + return;
>> + }
>> }
>> +#endif
>> 
>> hctx = rq->mq_hctx;
>> if ((rq->rq_flags & RQF_USE_SCHED) ||
>> +#if 1 /* Variant 2 (unlikely), blk_mq_try_issue_directly causes problems */
>> +     is_flush || 
>> +#endif
>>     (hctx->dispatch_busy && (q->nr_hw_queues == 1 || !is_sync))) {
>> blk_mq_insert_request(rq, 0);
>> blk_mq_run_hw_queue(hctx, true);


--
Chuck Lever



^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: NFS workload leaves nfsd threads in D state
  2023-07-12 13:29               ` Chuck Lever III
@ 2023-07-25  9:57                 ` Linux regression tracking (Thorsten Leemhuis)
  2023-07-25 13:21                   ` Chuck Lever III
  0 siblings, 1 reply; 14+ messages in thread
From: Linux regression tracking (Thorsten Leemhuis) @ 2023-07-25  9:57 UTC (permalink / raw)
  To: Chuck Lever III, Chengming Zhou
  Cc: Christoph Hellwig, ross.lagerwall@citrix.com, Jens Axboe,
	linux-block@vger.kernel.org, Linux NFS Mailing List, Chuck Lever,
	Linux kernel regressions list

On 12.07.23 15:29, Chuck Lever III wrote:
>> On Jul 12, 2023, at 7:34 AM, Chengming Zhou <chengming.zhou@linux.dev> wrote:
>> On 2023/7/11 20:01, Christoph Hellwig wrote:
>>> On Mon, Jul 10, 2023 at 05:40:42PM +0000, Chuck Lever III wrote:
>>>>> blk_rq_init_flush(rq);
>>>>> - rq->flush.seq |= REQ_FSEQ_POSTFLUSH;
>>>>> + rq->flush.seq |= REQ_FSEQ_PREFLUSH;
>>>>> spin_lock_irq(&fq->mq_flush_lock);
>>>>> list_move_tail(&rq->flush.list, &fq->flush_data_in_flight);
>>>>> spin_unlock_irq(&fq->mq_flush_lock);
>>>>
>>>> Thanks for the quick response. No change.
>>> I'm a bit lost and still can't reprodce.  Below is a patch with the
>>> only behavior differences I can find.  It has two "#if 1" blocks,
>>> which I'll need to bisect to to find out which made it work (if any,
>>> but I hope so).
>>
>> I tried today to reproduce, but can't unfortunately.
>>
>> Could you please also try the fix patch [1] from Ross Lagerwall that fixes
>> IO hung problem of plug recursive flush?
>>
>> (Since the main difference is that post-flush requests now can go into plug.)
>>
>> [1] https://lore.kernel.org/all/20230711160434.248868-1-ross.lagerwall@citrix.com/
> 
> Thanks for the suggestion. No change, unfortunately.

Chuck, what's the status here? This thread looks stalled, that's why I
wonder.

FWIW, I noticed a commit with a Fixes: tag for your culprit in next (see
28b24123747098 ("blk-flush: fix rq->flush.seq for post-flush
requests")). But unless I missed something you are not CCed, so I guess
that's a different issue?

Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
--
Everything you wanna know about Linux kernel regression tracking:
https://linux-regtracking.leemhuis.info/about/#tldr
If I did something stupid, please tell me, as explained on that page.

#regzbot poke

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: NFS workload leaves nfsd threads in D state
  2023-07-25  9:57                 ` Linux regression tracking (Thorsten Leemhuis)
@ 2023-07-25 13:21                   ` Chuck Lever III
  2023-07-25 13:34                     ` Linux regression tracking #update (Thorsten Leemhuis)
  0 siblings, 1 reply; 14+ messages in thread
From: Chuck Lever III @ 2023-07-25 13:21 UTC (permalink / raw)
  To: Linux regressions mailing list
  Cc: Chengming Zhou, Christoph Hellwig, ross.lagerwall@citrix.com,
	Jens Axboe, linux-block@vger.kernel.org, Linux NFS Mailing List,
	Chuck Lever



> On Jul 25, 2023, at 5:57 AM, Linux regression tracking (Thorsten Leemhuis) <regressions@leemhuis.info> wrote:
> 
> On 12.07.23 15:29, Chuck Lever III wrote:
>>> On Jul 12, 2023, at 7:34 AM, Chengming Zhou <chengming.zhou@linux.dev> wrote:
>>> On 2023/7/11 20:01, Christoph Hellwig wrote:
>>>> On Mon, Jul 10, 2023 at 05:40:42PM +0000, Chuck Lever III wrote:
>>>>>> blk_rq_init_flush(rq);
>>>>>> - rq->flush.seq |= REQ_FSEQ_POSTFLUSH;
>>>>>> + rq->flush.seq |= REQ_FSEQ_PREFLUSH;
>>>>>> spin_lock_irq(&fq->mq_flush_lock);
>>>>>> list_move_tail(&rq->flush.list, &fq->flush_data_in_flight);
>>>>>> spin_unlock_irq(&fq->mq_flush_lock);
>>>>> 
>>>>> Thanks for the quick response. No change.
>>>> I'm a bit lost and still can't reprodce.  Below is a patch with the
>>>> only behavior differences I can find.  It has two "#if 1" blocks,
>>>> which I'll need to bisect to to find out which made it work (if any,
>>>> but I hope so).
>>> 
>>> I tried today to reproduce, but can't unfortunately.
>>> 
>>> Could you please also try the fix patch [1] from Ross Lagerwall that fixes
>>> IO hung problem of plug recursive flush?
>>> 
>>> (Since the main difference is that post-flush requests now can go into plug.)
>>> 
>>> [1] https://lore.kernel.org/all/20230711160434.248868-1-ross.lagerwall@citrix.com/
>> 
>> Thanks for the suggestion. No change, unfortunately.
> 
> Chuck, what's the status here? This thread looks stalled, that's why I
> wonder.
> 
> FWIW, I noticed a commit with a Fixes: tag for your culprit in next (see
> 28b24123747098 ("blk-flush: fix rq->flush.seq for post-flush
> requests")). But unless I missed something you are not CCed, so I guess
> that's a different issue?

Hi Thorsten-

This issue was fixed in 6.5-rc2 by commit

9f87fc4d72f5 ("block: queue data commands from the flush state machine at the head")


--
Chuck Lever



^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: NFS workload leaves nfsd threads in D state
  2023-07-25 13:21                   ` Chuck Lever III
@ 2023-07-25 13:34                     ` Linux regression tracking #update (Thorsten Leemhuis)
  0 siblings, 0 replies; 14+ messages in thread
From: Linux regression tracking #update (Thorsten Leemhuis) @ 2023-07-25 13:34 UTC (permalink / raw)
  To: Chuck Lever III, Linux regressions mailing list
  Cc: Chengming Zhou, Christoph Hellwig, ross.lagerwall@citrix.com,
	Jens Axboe, linux-block@vger.kernel.org, Linux NFS Mailing List,
	Chuck Lever

On 25.07.23 15:21, Chuck Lever III wrote:
>> On Jul 25, 2023, at 5:57 AM, Linux regression tracking (Thorsten Leemhuis) <regressions@leemhuis.info> wrote:
>>
>> On 12.07.23 15:29, Chuck Lever III wrote:
>>>> On Jul 12, 2023, at 7:34 AM, Chengming Zhou <chengming.zhou@linux.dev> wrote:
>>>> On 2023/7/11 20:01, Christoph Hellwig wrote:
>>>>> On Mon, Jul 10, 2023 at 05:40:42PM +0000, Chuck Lever III wrote:
>>
>> Chuck, what's the status here? This thread looks stalled, that's why I
>> wonder.
>>
>> FWIW, I noticed a commit with a Fixes: tag for your culprit in next (see
>> 28b24123747098 ("blk-flush: fix rq->flush.seq for post-flush
>> requests")). But unless I missed something you are not CCed, so I guess
>> that's a different issue?
> 
> This issue was fixed in 6.5-rc2 by commit
> 
> 9f87fc4d72f5 ("block: queue data commands from the flush state machine at the head")

Ahh, many thx for the update, it lacked a proper Link:/Closes: tag and
mentioned another culprit, so I had missed that!

#regzbot fix: 9f87fc4d72f5
#regzbot ignore-activity

Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
--
Everything you wanna know about Linux kernel regression tracking:
https://linux-regtracking.leemhuis.info/about/#tldr
That page also explains what to do if mails like this annoy you.

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2023-07-25 13:34 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-07-08 18:30 NFS workload leaves nfsd threads in D state Chuck Lever III
2023-07-09  6:58 ` Linux regression tracking (Thorsten Leemhuis)
2023-07-10  7:56 ` Christoph Hellwig
2023-07-10 14:06   ` Chuck Lever III
2023-07-10 15:10     ` Chuck Lever III
2023-07-10 15:18       ` Christoph Hellwig
2023-07-10 17:28       ` Christoph Hellwig
2023-07-10 17:40         ` Chuck Lever III
2023-07-11 12:01           ` Christoph Hellwig
2023-07-12 11:34             ` Chengming Zhou
2023-07-12 13:29               ` Chuck Lever III
2023-07-25  9:57                 ` Linux regression tracking (Thorsten Leemhuis)
2023-07-25 13:21                   ` Chuck Lever III
2023-07-25 13:34                     ` Linux regression tracking #update (Thorsten Leemhuis)

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox