* [PATCH RFC] block: Revert 615939a2ae73
@ 2023-08-09 13:08 Chuck Lever
2023-08-09 15:51 ` Jens Axboe
0 siblings, 1 reply; 9+ messages in thread
From: Chuck Lever @ 2023-08-09 13:08 UTC (permalink / raw)
To: linux-block; +Cc: Chuck Lever
From: Chuck Lever <chuck.lever@oracle.com>
Even after commit 9f87fc4d72f5 ("block: queue data commands from the
flush state machine at the head"), I'm still seeing hangs on NFS
exports that reside on SATA devices.
Bisected to commit 615939a2ae73 ("blk-mq: defer to the normal
submission path for post-flush requests"). I've tested reverting
that commit on several kernels from 0aa69d53ac7c ("Merge tag
'for-6.5/io_uring-2023-06-23' of git://git.kernel.dk/linux") to
v6.5-rc5, and it seems to address the problem.
I also confirmed that commit 9f87fc4d72f5 ("block: queue data
commands from the flush state machine at the head") is still
necessary.
There is still no root cause.
Fixes: 615939a2ae73 ("blk-mq: defer to the normal submission path for post-flush requests")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
block/blk-flush.c | 11 -----------
1 file changed, 11 deletions(-)
diff --git a/block/blk-flush.c b/block/blk-flush.c
index 8220517c2d67..e4d12f2c344c 100644
--- a/block/blk-flush.c
+++ b/block/blk-flush.c
@@ -437,17 +437,6 @@ bool blk_insert_flush(struct request *rq)
* Queue for normal execution.
*/
return false;
- case REQ_FSEQ_DATA | REQ_FSEQ_POSTFLUSH:
- /*
- * Initialize the flush fields and completion handler to trigger
- * the post flush, and then just pass the command on.
- */
- blk_rq_init_flush(rq);
- rq->flush.seq |= REQ_FSEQ_POSTFLUSH;
- spin_lock_irq(&fq->mq_flush_lock);
- list_move_tail(&rq->flush.list, &fq->flush_data_in_flight);
- spin_unlock_irq(&fq->mq_flush_lock);
- return false;
default:
/*
* Mark the request as part of a flush sequence and submit it
^ permalink raw reply related [flat|nested] 9+ messages in thread* Re: [PATCH RFC] block: Revert 615939a2ae73 2023-08-09 13:08 [PATCH RFC] block: Revert 615939a2ae73 Chuck Lever @ 2023-08-09 15:51 ` Jens Axboe 2023-08-09 16:11 ` Christoph Hellwig 0 siblings, 1 reply; 9+ messages in thread From: Jens Axboe @ 2023-08-09 15:51 UTC (permalink / raw) To: Chuck Lever, linux-block; +Cc: Chuck Lever, Christoph Hellwig On 8/9/23 7:08 AM, Chuck Lever wrote: > From: Chuck Lever <chuck.lever@oracle.com> > > Even after commit 9f87fc4d72f5 ("block: queue data commands from the > flush state machine at the head"), I'm still seeing hangs on NFS > exports that reside on SATA devices. > > Bisected to commit 615939a2ae73 ("blk-mq: defer to the normal > submission path for post-flush requests"). I've tested reverting > that commit on several kernels from 0aa69d53ac7c ("Merge tag > 'for-6.5/io_uring-2023-06-23' of git://git.kernel.dk/linux") to > v6.5-rc5, and it seems to address the problem. > > I also confirmed that commit 9f87fc4d72f5 ("block: queue data > commands from the flush state machine at the head") is still > necessary. Adding the author - Christoph? -- Jens Axboe ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH RFC] block: Revert 615939a2ae73 2023-08-09 15:51 ` Jens Axboe @ 2023-08-09 16:11 ` Christoph Hellwig 2023-08-09 16:26 ` Chuck Lever III 0 siblings, 1 reply; 9+ messages in thread From: Christoph Hellwig @ 2023-08-09 16:11 UTC (permalink / raw) To: Jens Axboe; +Cc: Chuck Lever, linux-block, Chuck Lever, Christoph Hellwig On Wed, Aug 09, 2023 at 09:51:18AM -0600, Jens Axboe wrote: > > Bisected to commit 615939a2ae73 ("blk-mq: defer to the normal > > submission path for post-flush requests"). I've tested reverting > > that commit on several kernels from 0aa69d53ac7c ("Merge tag > > 'for-6.5/io_uring-2023-06-23' of git://git.kernel.dk/linux") to > > v6.5-rc5, and it seems to address the problem. > > > > I also confirmed that commit 9f87fc4d72f5 ("block: queue data > > commands from the flush state machine at the head") is still > > necessary. > > Adding the author - Christoph? Hmm, weird. I though we had all this sorted out a few weeks ago when that 9f87fc4d72f5 fixed it for you? What is different in this round of testing? ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH RFC] block: Revert 615939a2ae73 2023-08-09 16:11 ` Christoph Hellwig @ 2023-08-09 16:26 ` Chuck Lever III 2023-08-09 21:49 ` Christoph Hellwig 0 siblings, 1 reply; 9+ messages in thread From: Chuck Lever III @ 2023-08-09 16:26 UTC (permalink / raw) To: Christoph Hellwig; +Cc: Jens Axboe, Chuck Lever, linux-block@vger.kernel.org > On Aug 9, 2023, at 12:11 PM, Christoph Hellwig <hch@lst.de> wrote: > > On Wed, Aug 09, 2023 at 09:51:18AM -0600, Jens Axboe wrote: >>> Bisected to commit 615939a2ae73 ("blk-mq: defer to the normal >>> submission path for post-flush requests"). I've tested reverting >>> that commit on several kernels from 0aa69d53ac7c ("Merge tag >>> 'for-6.5/io_uring-2023-06-23' of git://git.kernel.dk/linux") to >>> v6.5-rc5, and it seems to address the problem. >>> >>> I also confirmed that commit 9f87fc4d72f5 ("block: queue data >>> commands from the flush state machine at the head") is still >>> necessary. >> >> Adding the author - Christoph? > > Hmm, weird. I though we had all this sorted out a few weeks > ago when that 9f87fc4d72f5 fixed it for you? What is different > in this round of testing? I don't know the answer to that... I suspect that I originally tested the first revert on a kernel that also did not have 615939 applied. At the time we thought only one of those commits was broken. I've bisected this every way I can think of, and both fixes are needed to fully stop the hangs. -- Chuck Lever ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH RFC] block: Revert 615939a2ae73 2023-08-09 16:26 ` Chuck Lever III @ 2023-08-09 21:49 ` Christoph Hellwig 2023-08-10 3:42 ` Chengming Zhou 0 siblings, 1 reply; 9+ messages in thread From: Christoph Hellwig @ 2023-08-09 21:49 UTC (permalink / raw) To: Chuck Lever III Cc: Christoph Hellwig, Jens Axboe, Chuck Lever, linux-block@vger.kernel.org, Chengming Zhou Oh well. I don't feel like we're going to find the root cause given that its late in the merge window and I'm running out of time I have to work due to the annual summer vacation. Unless someone like Chengming who knows the flush code pretty well feels like working with chuck on a few more iterations we'll have to revert it. Which is going to be a very painful merge with Chengming's work in the for-6.6 branch. ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH RFC] block: Revert 615939a2ae73 2023-08-09 21:49 ` Christoph Hellwig @ 2023-08-10 3:42 ` Chengming Zhou 2023-08-10 13:47 ` Chuck Lever III ` (2 more replies) 0 siblings, 3 replies; 9+ messages in thread From: Chengming Zhou @ 2023-08-10 3:42 UTC (permalink / raw) To: Christoph Hellwig, Chuck Lever III Cc: Jens Axboe, Chuck Lever, linux-block@vger.kernel.org On 2023/8/10 05:49, Christoph Hellwig wrote: > Oh well. I don't feel like we're going to find the root cause > given that its late in the merge window and I'm running out of > time I have to work due to the annual summer vacation. Unless > someone like Chengming who knows the flush code pretty well > feels like working with chuck on a few more iterations we'll Hi, No problem, I can work with Chuck to find the root cause if Chuck has time too, since I still can't reproduce it by myself. Maybe we should first find what's the status of the hang request? I can write a Drgn script to find if any request hang in the queue. Christoph, it would be very helpful if you share some thoughts and directions. > have to revert it. Which is going to be a very painful merge > with Chengming's work in the for-6.6 branch. > Maybe we can revert it manually if we still fail since that commit just let postflushes go to the normal I/O path, instead of going to the flush state machine. So I think it should be fine if we just delete that case? Chuck, could you please help to check this change on block/for-next? Thanks! diff --git a/block/blk-flush.c b/block/blk-flush.c index e73dc22d05c1..7ea3c00f40ce 100644 --- a/block/blk-flush.c +++ b/block/blk-flush.c @@ -442,17 +442,6 @@ bool blk_insert_flush(struct request *rq) * Queue for normal execution. */ return false; - case REQ_FSEQ_DATA | REQ_FSEQ_POSTFLUSH: - /* - * Initialize the flush fields and completion handler to trigger - * the post flush, and then just pass the command on. - */ - blk_rq_init_flush(rq); - rq->flush.seq |= REQ_FSEQ_PREFLUSH; - spin_lock_irq(&fq->mq_flush_lock); - fq->flush_data_in_flight++; - spin_unlock_irq(&fq->mq_flush_lock); - return false; default: /* * Mark the request as part of a flush sequence and submit it ^ permalink raw reply related [flat|nested] 9+ messages in thread
* Re: [PATCH RFC] block: Revert 615939a2ae73 2023-08-10 3:42 ` Chengming Zhou @ 2023-08-10 13:47 ` Chuck Lever III 2023-08-10 15:58 ` Christoph Hellwig 2023-08-10 16:02 ` Chuck Lever 2 siblings, 0 replies; 9+ messages in thread From: Chuck Lever III @ 2023-08-10 13:47 UTC (permalink / raw) To: Chengming Zhou Cc: Christoph Hellwig, Jens Axboe, Chuck Lever, linux-block@vger.kernel.org > On Aug 9, 2023, at 11:42 PM, Chengming Zhou <zhouchengming@bytedance.com> wrote: > > On 2023/8/10 05:49, Christoph Hellwig wrote: >> Oh well. I don't feel like we're going to find the root cause >> given that its late in the merge window and I'm running out of >> time I have to work due to the annual summer vacation. Unless >> someone like Chengming who knows the flush code pretty well >> feels like working with chuck on a few more iterations we'll > > Hi, > > No problem, I can work with Chuck to find the root cause if Chuck > has time too, since I still can't reproduce it by myself. Yes, I have some time to help. > Maybe we should first find what's the status of the hang request? > I can write a Drgn script to find if any request hang in the queue. > > Christoph, it would be very helpful if you share some thoughts > and directions. > > >> have to revert it. Which is going to be a very painful merge >> with Chengming's work in the for-6.6 branch. >> > > Maybe we can revert it manually if we still fail since that commit > just let postflushes go to the normal I/O path, instead of going to > the flush state machine. > > So I think it should be fine if we just delete that case? That's basically the fix I have been testing on v6.5-rc5 and earlier. > Chuck, could you please help to check this change on block/for-next? I'll set up a block/for-next kernel and try this out. > Thanks! > > diff --git a/block/blk-flush.c b/block/blk-flush.c > index e73dc22d05c1..7ea3c00f40ce 100644 > --- a/block/blk-flush.c > +++ b/block/blk-flush.c > @@ -442,17 +442,6 @@ bool blk_insert_flush(struct request *rq) > * Queue for normal execution. > */ > return false; > - case REQ_FSEQ_DATA | REQ_FSEQ_POSTFLUSH: > - /* > - * Initialize the flush fields and completion handler to trigger > - * the post flush, and then just pass the command on. > - */ > - blk_rq_init_flush(rq); > - rq->flush.seq |= REQ_FSEQ_PREFLUSH; > - spin_lock_irq(&fq->mq_flush_lock); > - fq->flush_data_in_flight++; > - spin_unlock_irq(&fq->mq_flush_lock); > - return false; > default: > /* > * Mark the request as part of a flush sequence and submit it -- Chuck Lever ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH RFC] block: Revert 615939a2ae73 2023-08-10 3:42 ` Chengming Zhou 2023-08-10 13:47 ` Chuck Lever III @ 2023-08-10 15:58 ` Christoph Hellwig 2023-08-10 16:02 ` Chuck Lever 2 siblings, 0 replies; 9+ messages in thread From: Christoph Hellwig @ 2023-08-10 15:58 UTC (permalink / raw) To: Chengming Zhou Cc: Christoph Hellwig, Chuck Lever III, Jens Axboe, Chuck Lever, linux-block@vger.kernel.org, Damien Le Moal On Thu, Aug 10, 2023 at 11:42:50AM +0800, Chengming Zhou wrote: > Christoph, it would be very helpful if you share some thoughts > and directions. I am not quite sure where the problems could even be. I think the interesting aspects are: - we are not directly processing through the normal submission path instead of adding the command to the flush_data_in_flight list and then going through blk_kick_flush. So a command now gets executed earlier than it did before. - given that it only happens with SATA, it must be in some way related to the fact that ATA flush commands are not queud, that is when a flush is outstanding no other command can't. ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH RFC] block: Revert 615939a2ae73 2023-08-10 3:42 ` Chengming Zhou 2023-08-10 13:47 ` Chuck Lever III 2023-08-10 15:58 ` Christoph Hellwig @ 2023-08-10 16:02 ` Chuck Lever 2 siblings, 0 replies; 9+ messages in thread From: Chuck Lever @ 2023-08-10 16:02 UTC (permalink / raw) To: Chengming Zhou Cc: Christoph Hellwig, Jens Axboe, Chuck Lever, linux-block@vger.kernel.org On Thu, Aug 10, 2023 at 11:42:50AM +0800, Chengming Zhou wrote: > On 2023/8/10 05:49, Christoph Hellwig wrote: > > Oh well. I don't feel like we're going to find the root cause > > given that its late in the merge window and I'm running out of > > time I have to work due to the annual summer vacation. Unless > > someone like Chengming who knows the flush code pretty well > > feels like working with chuck on a few more iterations we'll > > Hi, > > No problem, I can work with Chuck to find the root cause if Chuck > has time too, since I still can't reproduce it by myself. > > Maybe we should first find what's the status of the hang request? > I can write a Drgn script to find if any request hang in the queue. > > Christoph, it would be very helpful if you share some thoughts > and directions. > > > > have to revert it. Which is going to be a very painful merge > > with Chengming's work in the for-6.6 branch. > > > > Maybe we can revert it manually if we still fail since that commit > just let postflushes go to the normal I/O path, instead of going to > the flush state machine. > > So I think it should be fine if we just delete that case? > > Chuck, could you please help to check this change on block/for-next? > > Thanks! > > diff --git a/block/blk-flush.c b/block/blk-flush.c > index e73dc22d05c1..7ea3c00f40ce 100644 > --- a/block/blk-flush.c > +++ b/block/blk-flush.c > @@ -442,17 +442,6 @@ bool blk_insert_flush(struct request *rq) > * Queue for normal execution. > */ > return false; > - case REQ_FSEQ_DATA | REQ_FSEQ_POSTFLUSH: > - /* > - * Initialize the flush fields and completion handler to trigger > - * the post flush, and then just pass the command on. > - */ > - blk_rq_init_flush(rq); > - rq->flush.seq |= REQ_FSEQ_PREFLUSH; > - spin_lock_irq(&fq->mq_flush_lock); > - fq->flush_data_in_flight++; > - spin_unlock_irq(&fq->mq_flush_lock); > - return false; > default: > /* > * Mark the request as part of a flush sequence and submit it linux-block/for-next with the above hunk applied is not able to reproduce the NFSD hangs on SATA devices. -- Chuck Lever ^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2023-08-10 16:03 UTC | newest] Thread overview: 9+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2023-08-09 13:08 [PATCH RFC] block: Revert 615939a2ae73 Chuck Lever 2023-08-09 15:51 ` Jens Axboe 2023-08-09 16:11 ` Christoph Hellwig 2023-08-09 16:26 ` Chuck Lever III 2023-08-09 21:49 ` Christoph Hellwig 2023-08-10 3:42 ` Chengming Zhou 2023-08-10 13:47 ` Chuck Lever III 2023-08-10 15:58 ` Christoph Hellwig 2023-08-10 16:02 ` Chuck Lever
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox