public inbox for linux-block@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH] block: use plug request list tail for one-shot backmerge attempt
@ 2025-06-11 14:53 Jens Axboe
  2025-06-11 16:55 ` Mohamed Abuelfotoh, Hazem
  0 siblings, 1 reply; 17+ messages in thread
From: Jens Axboe @ 2025-06-11 14:53 UTC (permalink / raw)
  To: linux-block@vger.kernel.org; +Cc: Hazem Mohamed Abuelfotoh

Previously, the block layer stored the requests in the plug list in
LIFO order. For this reason, blk_attempt_plug_merge() would check
just the head entry for a back merge attempt, and abort after that
unless requests for multiple queues existed in the plug list. If more
than one request is present in the plug list, this makes the one-shot
back merging less useful than before, as it'll always fail to find a
quick merge candidate.

Use the tail entry for the one-shot merge attempt, which is the last
added request in the list. If that fails, abort immediately unless
there are multiple queues available. If multiple queues are available,
then scan the list. Ideally the latter scan would be a backwards scan
of the list, but as it currently stands, the plug list is singly linked
and hence this isn't easily feasible.

Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/linux-block/20250611121626.7252-1-abuehaze@amazon.com/
Reported-by: Hazem Mohamed Abuelfotoh <abuehaze@amazon.com>
Fixes: e70c301faece ("block: don't reorder requests in blk_add_rq_to_plug")
Signed-off-by: Jens Axboe <axboe@kernel.dk>

---

diff --git a/block/blk-merge.c b/block/blk-merge.c
index 3af1d284add5..70d704615be5 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -998,20 +998,20 @@ bool blk_attempt_plug_merge(struct request_queue *q, struct bio *bio,
 	if (!plug || rq_list_empty(&plug->mq_list))
 		return false;
 
-	rq_list_for_each(&plug->mq_list, rq) {
-		if (rq->q == q) {
-			if (blk_attempt_bio_merge(q, rq, bio, nr_segs, false) ==
-			    BIO_MERGE_OK)
-				return true;
-			break;
-		}
+	rq = plug->mq_list.tail;
+	if (rq->q == q)
+		return blk_attempt_bio_merge(q, rq, bio, nr_segs, false) ==
+			BIO_MERGE_OK;
+	else if (!plug->multiple_queues)
+		return false;
 
-		/*
-		 * Only keep iterating plug list for merges if we have multiple
-		 * queues
-		 */
-		if (!plug->multiple_queues)
-			break;
+	rq_list_for_each(&plug->mq_list, rq) {
+		if (rq->q != q)
+			continue;
+		if (blk_attempt_bio_merge(q, rq, bio, nr_segs, false) ==
+		    BIO_MERGE_OK)
+			return true;
+		break;
 	}
 	return false;
 }

-- 
Jens Axboe


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH] block: use plug request list tail for one-shot backmerge attempt
  2025-06-11 14:53 [PATCH] block: use plug request list tail for one-shot backmerge attempt Jens Axboe
@ 2025-06-11 16:55 ` Mohamed Abuelfotoh, Hazem
  2025-06-11 17:53   ` Jens Axboe
  0 siblings, 1 reply; 17+ messages in thread
From: Mohamed Abuelfotoh, Hazem @ 2025-06-11 16:55 UTC (permalink / raw)
  To: Jens Axboe, linux-block@vger.kernel.org

On 11/06/2025 15:53, Jens Axboe wrote:
> CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe.
> 
> 
> 
> Previously, the block layer stored the requests in the plug list in
> LIFO order. For this reason, blk_attempt_plug_merge() would check
> just the head entry for a back merge attempt, and abort after that
> unless requests for multiple queues existed in the plug list. If more
> than one request is present in the plug list, this makes the one-shot
> back merging less useful than before, as it'll always fail to find a
> quick merge candidate.
> 
> Use the tail entry for the one-shot merge attempt, which is the last
> added request in the list. If that fails, abort immediately unless
> there are multiple queues available. If multiple queues are available,
> then scan the list. Ideally the latter scan would be a backwards scan
> of the list, but as it currently stands, the plug list is singly linked
> and hence this isn't easily feasible.
> 
> Cc: stable@vger.kernel.org
> Link: https://lore.kernel.org/linux-block/20250611121626.7252-1-abuehaze@amazon.com/
> Reported-by: Hazem Mohamed Abuelfotoh <abuehaze@amazon.com>
> Fixes: e70c301faece ("block: don't reorder requests in blk_add_rq_to_plug")
> Signed-off-by: Jens Axboe <axboe@kernel.dk>
> 
> ---
> 
> diff --git a/block/blk-merge.c b/block/blk-merge.c
> index 3af1d284add5..70d704615be5 100644
> --- a/block/blk-merge.c
> +++ b/block/blk-merge.c
> @@ -998,20 +998,20 @@ bool blk_attempt_plug_merge(struct request_queue *q, struct bio *bio,
>          if (!plug || rq_list_empty(&plug->mq_list))
>                  return false;
> 
> -       rq_list_for_each(&plug->mq_list, rq) {
> -               if (rq->q == q) {
> -                       if (blk_attempt_bio_merge(q, rq, bio, nr_segs, false) ==
> -                           BIO_MERGE_OK)
> -                               return true;
> -                       break;
> -               }
> +       rq = plug->mq_list.tail;
> +       if (rq->q == q)
> +               return blk_attempt_bio_merge(q, rq, bio, nr_segs, false) ==
> +                       BIO_MERGE_OK;
> +       else if (!plug->multiple_queues)
> +               return false;
> 
> -               /*
> -                * Only keep iterating plug list for merges if we have multiple
> -                * queues
> -                */
> -               if (!plug->multiple_queues)
> -                       break;
> +       rq_list_for_each(&plug->mq_list, rq) {
> +               if (rq->q != q)
> +                       continue;
> +               if (blk_attempt_bio_merge(q, rq, bio, nr_segs, false) ==
> +                   BIO_MERGE_OK)
> +                       return true;
> +               break;
>          }
>          return false;
>   }
> 
> --
> Jens Axboe
> 

Thanks for posting this fix, I can confirm that this patch mitigated the 
regression reported in [1], my only concern is the impact when we have 
multiple queues available as you explained in the commit message. Given 
that reverting e70c301faece ("block: don't reorder requests in 
blk_add_rq_to_plug") will break zoned storage support and the plug list 
is singly linked, I don't think we have a better solution here.

[1] 
https://lore.kernel.org/lkml/20250611121626.7252-1-abuehaze@amazon.com/T/#u

Hazem

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] block: use plug request list tail for one-shot backmerge attempt
  2025-06-11 16:55 ` Mohamed Abuelfotoh, Hazem
@ 2025-06-11 17:53   ` Jens Axboe
  2025-06-12  5:22     ` Christoph Hellwig
                       ` (2 more replies)
  0 siblings, 3 replies; 17+ messages in thread
From: Jens Axboe @ 2025-06-11 17:53 UTC (permalink / raw)
  To: Mohamed Abuelfotoh, Hazem, linux-block@vger.kernel.org

On 6/11/25 10:55 AM, Mohamed Abuelfotoh, Hazem wrote:
> On 11/06/2025 15:53, Jens Axboe wrote:
>> CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe.
>>
>>
>>
>> Previously, the block layer stored the requests in the plug list in
>> LIFO order. For this reason, blk_attempt_plug_merge() would check
>> just the head entry for a back merge attempt, and abort after that
>> unless requests for multiple queues existed in the plug list. If more
>> than one request is present in the plug list, this makes the one-shot
>> back merging less useful than before, as it'll always fail to find a
>> quick merge candidate.
>>
>> Use the tail entry for the one-shot merge attempt, which is the last
>> added request in the list. If that fails, abort immediately unless
>> there are multiple queues available. If multiple queues are available,
>> then scan the list. Ideally the latter scan would be a backwards scan
>> of the list, but as it currently stands, the plug list is singly linked
>> and hence this isn't easily feasible.
>>
>> Cc: stable@vger.kernel.org
>> Link: https://lore.kernel.org/linux-block/20250611121626.7252-1-abuehaze@amazon.com/
>> Reported-by: Hazem Mohamed Abuelfotoh <abuehaze@amazon.com>
>> Fixes: e70c301faece ("block: don't reorder requests in blk_add_rq_to_plug")
>> Signed-off-by: Jens Axboe <axboe@kernel.dk>
>>
>> ---
>>
>> diff --git a/block/blk-merge.c b/block/blk-merge.c
>> index 3af1d284add5..70d704615be5 100644
>> --- a/block/blk-merge.c
>> +++ b/block/blk-merge.c
>> @@ -998,20 +998,20 @@ bool blk_attempt_plug_merge(struct request_queue *q, struct bio *bio,
>>          if (!plug || rq_list_empty(&plug->mq_list))
>>                  return false;
>>
>> -       rq_list_for_each(&plug->mq_list, rq) {
>> -               if (rq->q == q) {
>> -                       if (blk_attempt_bio_merge(q, rq, bio, nr_segs, false) ==
>> -                           BIO_MERGE_OK)
>> -                               return true;
>> -                       break;
>> -               }
>> +       rq = plug->mq_list.tail;
>> +       if (rq->q == q)
>> +               return blk_attempt_bio_merge(q, rq, bio, nr_segs, false) ==
>> +                       BIO_MERGE_OK;
>> +       else if (!plug->multiple_queues)
>> +               return false;
>>
>> -               /*
>> -                * Only keep iterating plug list for merges if we have multiple
>> -                * queues
>> -                */
>> -               if (!plug->multiple_queues)
>> -                       break;
>> +       rq_list_for_each(&plug->mq_list, rq) {
>> +               if (rq->q != q)
>> +                       continue;
>> +               if (blk_attempt_bio_merge(q, rq, bio, nr_segs, false) ==
>> +                   BIO_MERGE_OK)
>> +                       return true;
>> +               break;
>>          }
>>          return false;
>>   }
>>
>> -- 
>> Jens Axboe
>>
> 
> Thanks for posting this fix, I can confirm that this patch mitigated
> the regression reported in [1], my only concern is the impact when we
> have multiple queues available as you explained in the commit message.
> Given that reverting e70c301faece ("block: don't reorder requests in
> blk_add_rq_to_plug") will break zoned storage support and the plug
> list is singly linked, I don't think we have a better solution here.

Yes we can't revert it, and honestly I would not want to even if that
was an option. If the multi-queue case is particularly important, you
could just do something ala the below - keep scanning until you a merge
_could_ have happened but didn't. Ideally we'd want to iterate the plug
list backwards and then we could keep the same single shot logic, where
you only attempt one request that has a matching queue. And obviously we
could just doubly link the requests, there's space in the request
linkage code to do that. But that'd add overhead in general, I think
it's better to shove a bit of that overhead to the multi-queue case.

I suspect the below would do the trick, however.

diff --git a/block/blk-merge.c b/block/blk-merge.c
index 70d704615be5..4313301f131c 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -1008,6 +1008,8 @@ bool blk_attempt_plug_merge(struct request_queue *q, struct bio *bio,
 	rq_list_for_each(&plug->mq_list, rq) {
 		if (rq->q != q)
 			continue;
+		if (blk_try_merge(rq, bio) == ELEVATOR_NO_MERGE)
+			continue;
 		if (blk_attempt_bio_merge(q, rq, bio, nr_segs, false) ==
 		    BIO_MERGE_OK)
 			return true;

-- 
Jens Axboe

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH] block: use plug request list tail for one-shot backmerge attempt
  2025-06-11 17:53   ` Jens Axboe
@ 2025-06-12  5:22     ` Christoph Hellwig
  2025-06-12  5:23       ` Christoph Hellwig
  2025-06-12 11:49       ` Jens Axboe
  2025-06-12 12:27     ` Mohamed Abuelfotoh, Hazem
  2025-06-24 10:45     ` Mohamed Abuelfotoh, Hazem
  2 siblings, 2 replies; 17+ messages in thread
From: Christoph Hellwig @ 2025-06-12  5:22 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Mohamed Abuelfotoh, Hazem, linux-block@vger.kernel.org

On Wed, Jun 11, 2025 at 11:53:07AM -0600, Jens Axboe wrote:
> Yes we can't revert it, and honestly I would not want to even if that
> was an option. If the multi-queue case is particularly important, you
> could just do something ala the below - keep scanning until you a merge
> _could_ have happened but didn't. Ideally we'd want to iterate the plug
> list backwards and then we could keep the same single shot logic, where
> you only attempt one request that has a matching queue. And obviously we
> could just doubly link the requests, there's space in the request
> linkage code to do that. But that'd add overhead in general, I think
> it's better to shove a bit of that overhead to the multi-queue case.

Maybe byte the bullet and just make the request lists doubly linked?
Unlike the bio memory usage for request should not be quite as
critical.  Right now in my config the las cacheline in struct request
only has a single 8 byte field anyway, so in practive we won't even
bloat it.


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] block: use plug request list tail for one-shot backmerge attempt
  2025-06-12  5:22     ` Christoph Hellwig
@ 2025-06-12  5:23       ` Christoph Hellwig
  2025-06-12 11:49       ` Jens Axboe
  1 sibling, 0 replies; 17+ messages in thread
From: Christoph Hellwig @ 2025-06-12  5:23 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Mohamed Abuelfotoh, Hazem, linux-block@vger.kernel.org

On Wed, Jun 11, 2025 at 10:22:43PM -0700, Christoph Hellwig wrote:
> Maybe byte the bullet and just make the request lists doubly linked?
> Unlike the bio memory usage for request should not be quite as
> critical.  Right now in my config the las cacheline in struct request
> only has a single 8 byte field anyway, so in practive we won't even
> bloat it.

Oh, and we actualy union rq_next with a list_head anyway..


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] block: use plug request list tail for one-shot backmerge attempt
  2025-06-12  5:22     ` Christoph Hellwig
  2025-06-12  5:23       ` Christoph Hellwig
@ 2025-06-12 11:49       ` Jens Axboe
  2025-06-12 11:56         ` Christoph Hellwig
  1 sibling, 1 reply; 17+ messages in thread
From: Jens Axboe @ 2025-06-12 11:49 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Mohamed Abuelfotoh, Hazem, linux-block@vger.kernel.org

On 6/11/25 11:22 PM, Christoph Hellwig wrote:
> On Wed, Jun 11, 2025 at 11:53:07AM -0600, Jens Axboe wrote:
>> Yes we can't revert it, and honestly I would not want to even if that
>> was an option. If the multi-queue case is particularly important, you
>> could just do something ala the below - keep scanning until you a merge
>> _could_ have happened but didn't. Ideally we'd want to iterate the plug
>> list backwards and then we could keep the same single shot logic, where
>> you only attempt one request that has a matching queue. And obviously we
>> could just doubly link the requests, there's space in the request
>> linkage code to do that. But that'd add overhead in general, I think
>> it's better to shove a bit of that overhead to the multi-queue case.
> 
> Maybe byte the bullet and just make the request lists doubly linked?
> Unlike the bio memory usage for request should not be quite as
> critical.  Right now in my config the las cacheline in struct request
> only has a single 8 byte field anyway, so in practive we won't even
> bloat it.

The space isn't a concern, as you found as well. It's the fact that
doubly linked lists suck in terms of needing to touch both prev
and next for removal.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] block: use plug request list tail for one-shot backmerge attempt
  2025-06-12 11:49       ` Jens Axboe
@ 2025-06-12 11:56         ` Christoph Hellwig
  2025-06-12 12:21           ` Jens Axboe
  0 siblings, 1 reply; 17+ messages in thread
From: Christoph Hellwig @ 2025-06-12 11:56 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Christoph Hellwig, Mohamed Abuelfotoh, Hazem,
	linux-block@vger.kernel.org

On Thu, Jun 12, 2025 at 05:49:16AM -0600, Jens Axboe wrote:
> > Maybe byte the bullet and just make the request lists doubly linked?
> > Unlike the bio memory usage for request should not be quite as
> > critical.  Right now in my config the las cacheline in struct request
> > only has a single 8 byte field anyway, so in practive we won't even
> > bloat it.
> 
> The space isn't a concern, as you found as well. It's the fact that
> doubly linked lists suck in terms of needing to touch both prev
> and next for removal.

But is that actually a concern here?  If you look at my patch we can
now use the list_cut helper for the queue_rqs submission sorting,
and for the actual submission we walk each request anyway (and might
get along without removal entirely if we dare to leave the dangling
pointers around).  The multi-queue dispatch could probably use the
cut as well.  For the cached rqs and completion  we could stіck to the
singly linked list as they don't really mix up with the submission
IFF that shows up as an issue there.


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] block: use plug request list tail for one-shot backmerge attempt
  2025-06-12 11:56         ` Christoph Hellwig
@ 2025-06-12 12:21           ` Jens Axboe
  2025-06-12 12:23             ` Christoph Hellwig
  0 siblings, 1 reply; 17+ messages in thread
From: Jens Axboe @ 2025-06-12 12:21 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Mohamed Abuelfotoh, Hazem, linux-block@vger.kernel.org

On 6/12/25 5:56 AM, Christoph Hellwig wrote:
> On Thu, Jun 12, 2025 at 05:49:16AM -0600, Jens Axboe wrote:
>>> Maybe byte the bullet and just make the request lists doubly linked?
>>> Unlike the bio memory usage for request should not be quite as
>>> critical.  Right now in my config the las cacheline in struct request
>>> only has a single 8 byte field anyway, so in practive we won't even
>>> bloat it.
>>
>> The space isn't a concern, as you found as well. It's the fact that
>> doubly linked lists suck in terms of needing to touch both prev
>> and next for removal.
> 
> But is that actually a concern here?  If you look at my patch we can
> now use the list_cut helper for the queue_rqs submission sorting,
> and for the actual submission we walk each request anyway (and might
> get along without removal entirely if we dare to leave the dangling
> pointers around).  The multi-queue dispatch could probably use the
> cut as well.  For the cached rqs and completion  we could st?ck to the
> singly linked list as they don't really mix up with the submission
> IFF that shows up as an issue there.

It's certainly going to make the cached handling more expensive, as the
doubly linked behavior there is just pointless. Generally LIFO behavior
there is preferable. I'd strongly suggest we use the doubly linked side
for dispatch, and retain singly linked for cached + completion. If not
I'm 100% sure we're going to be revisiting this again down the line, and
redo those parts yet again.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] block: use plug request list tail for one-shot backmerge attempt
  2025-06-12 12:21           ` Jens Axboe
@ 2025-06-12 12:23             ` Christoph Hellwig
  2025-06-12 12:28               ` Jens Axboe
  0 siblings, 1 reply; 17+ messages in thread
From: Christoph Hellwig @ 2025-06-12 12:23 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Christoph Hellwig, Mohamed Abuelfotoh, Hazem,
	linux-block@vger.kernel.org

On Thu, Jun 12, 2025 at 06:21:14AM -0600, Jens Axboe wrote:
> It's certainly going to make the cached handling more expensive, as the
> doubly linked behavior there is just pointless. Generally LIFO behavior
> there is preferable. I'd strongly suggest we use the doubly linked side
> for dispatch, and retain singly linked for cached + completion. If not
> I'm 100% sure we're going to be revisiting this again down the line, and
> redo those parts yet again.

Yeah.  For cached requests and completions it might even make sense
to have a simple fixed size array FIFO buffer..


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] block: use plug request list tail for one-shot backmerge attempt
  2025-06-11 17:53   ` Jens Axboe
  2025-06-12  5:22     ` Christoph Hellwig
@ 2025-06-12 12:27     ` Mohamed Abuelfotoh, Hazem
  2025-06-24 10:45     ` Mohamed Abuelfotoh, Hazem
  2 siblings, 0 replies; 17+ messages in thread
From: Mohamed Abuelfotoh, Hazem @ 2025-06-12 12:27 UTC (permalink / raw)
  To: Jens Axboe, linux-block@vger.kernel.org

On 11/06/2025 18:53, Jens Axboe wrote:
> CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe.
> 
> 
> 
> On 6/11/25 10:55 AM, Mohamed Abuelfotoh, Hazem wrote:
>> On 11/06/2025 15:53, Jens Axboe wrote:
>>> CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe.
>>>
>>>
>>>
>>> Previously, the block layer stored the requests in the plug list in
>>> LIFO order. For this reason, blk_attempt_plug_merge() would check
>>> just the head entry for a back merge attempt, and abort after that
>>> unless requests for multiple queues existed in the plug list. If more
>>> than one request is present in the plug list, this makes the one-shot
>>> back merging less useful than before, as it'll always fail to find a
>>> quick merge candidate.
>>>
>>> Use the tail entry for the one-shot merge attempt, which is the last
>>> added request in the list. If that fails, abort immediately unless
>>> there are multiple queues available. If multiple queues are available,
>>> then scan the list. Ideally the latter scan would be a backwards scan
>>> of the list, but as it currently stands, the plug list is singly linked
>>> and hence this isn't easily feasible.
>>>
>>> Cc: stable@vger.kernel.org
>>> Link: https://lore.kernel.org/linux-block/20250611121626.7252-1-abuehaze@amazon.com/
>>> Reported-by: Hazem Mohamed Abuelfotoh <abuehaze@amazon.com>
>>> Fixes: e70c301faece ("block: don't reorder requests in blk_add_rq_to_plug")
>>> Signed-off-by: Jens Axboe <axboe@kernel.dk>
>>>
>>> ---
>>>
>>> diff --git a/block/blk-merge.c b/block/blk-merge.c
>>> index 3af1d284add5..70d704615be5 100644
>>> --- a/block/blk-merge.c
>>> +++ b/block/blk-merge.c
>>> @@ -998,20 +998,20 @@ bool blk_attempt_plug_merge(struct request_queue *q, struct bio *bio,
>>>           if (!plug || rq_list_empty(&plug->mq_list))
>>>                   return false;
>>>
>>> -       rq_list_for_each(&plug->mq_list, rq) {
>>> -               if (rq->q == q) {
>>> -                       if (blk_attempt_bio_merge(q, rq, bio, nr_segs, false) ==
>>> -                           BIO_MERGE_OK)
>>> -                               return true;
>>> -                       break;
>>> -               }
>>> +       rq = plug->mq_list.tail;
>>> +       if (rq->q == q)
>>> +               return blk_attempt_bio_merge(q, rq, bio, nr_segs, false) ==
>>> +                       BIO_MERGE_OK;
>>> +       else if (!plug->multiple_queues)
>>> +               return false;
>>>
>>> -               /*
>>> -                * Only keep iterating plug list for merges if we have multiple
>>> -                * queues
>>> -                */
>>> -               if (!plug->multiple_queues)
>>> -                       break;
>>> +       rq_list_for_each(&plug->mq_list, rq) {
>>> +               if (rq->q != q)
>>> +                       continue;
>>> +               if (blk_attempt_bio_merge(q, rq, bio, nr_segs, false) ==
>>> +                   BIO_MERGE_OK)
>>> +                       return true;
>>> +               break;
>>>           }
>>>           return false;
>>>    }
>>>
>>> --
>>> Jens Axboe
>>>
>>
>> Thanks for posting this fix, I can confirm that this patch mitigated
>> the regression reported in [1], my only concern is the impact when we
>> have multiple queues available as you explained in the commit message.
>> Given that reverting e70c301faece ("block: don't reorder requests in
>> blk_add_rq_to_plug") will break zoned storage support and the plug
>> list is singly linked, I don't think we have a better solution here.
> 
> Yes we can't revert it, and honestly I would not want to even if that
> was an option. If the multi-queue case is particularly important, you
> could just do something ala the below - keep scanning until you a merge
> _could_ have happened but didn't. Ideally we'd want to iterate the plug
> list backwards and then we could keep the same single shot logic, where
> you only attempt one request that has a matching queue. And obviously we
> could just doubly link the requests, there's space in the request
> linkage code to do that. But that'd add overhead in general, I think
> it's better to shove a bit of that overhead to the multi-queue case.

Thanks, I got your point, I will test the impact on raid0 aggregated 
disks and will report the numbers with this proposed patch. During my 
debugging I actually tried to iterate the list from head to tail looking 
for the perfect request to merge and although that improved the bio 
merge success rate, it actually dropped the fio randwrite B.W by about 
66% compared to the baseline performance without reverting e70c301faece 
("block: don't reorder requests in blk_add_rq_to_plug") and that's 
likely because of the overhead of iterating the whole plug list to look 
for the perfect request to merge with.

> 
> I suspect the below would do the trick, however.
> 
> diff --git a/block/blk-merge.c b/block/blk-merge.c
> index 70d704615be5..4313301f131c 100644
> --- a/block/blk-merge.c
> +++ b/block/blk-merge.c
> @@ -1008,6 +1008,8 @@ bool blk_attempt_plug_merge(struct request_queue *q, struct bio *bio,
>          rq_list_for_each(&plug->mq_list, rq) {
>                  if (rq->q != q)
>                          continue;
> +               if (blk_try_merge(rq, bio) == ELEVATOR_NO_MERGE)
> +                       continue;
I think this should be break? it's not clear to me why would you need to 
continue scanning the list if blk_try_merge() returns ELEVATOR_NO_MERGE 
here?
>                  if (blk_attempt_bio_merge(q, rq, bio, nr_segs, false) ==
>                      BIO_MERGE_OK)
>                          return true;
> 
> --
> Jens Axboe


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] block: use plug request list tail for one-shot backmerge attempt
  2025-06-12 12:23             ` Christoph Hellwig
@ 2025-06-12 12:28               ` Jens Axboe
  2025-06-16 13:11                 ` Christoph Hellwig
  0 siblings, 1 reply; 17+ messages in thread
From: Jens Axboe @ 2025-06-12 12:28 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Mohamed Abuelfotoh, Hazem, linux-block@vger.kernel.org

On 6/12/25 6:23 AM, Christoph Hellwig wrote:
> On Thu, Jun 12, 2025 at 06:21:14AM -0600, Jens Axboe wrote:
>> It's certainly going to make the cached handling more expensive, as the
>> doubly linked behavior there is just pointless. Generally LIFO behavior
>> there is preferable. I'd strongly suggest we use the doubly linked side
>> for dispatch, and retain singly linked for cached + completion. If not
>> I'm 100% sure we're going to be revisiting this again down the line, and
>> redo those parts yet again.
> 
> Yeah.  For cached requests and completions it might even make sense
> to have a simple fixed size array FIFO buffer..

I did ponder that in the past too, as that's clearly better.
Experimentally we need ~32 slots in there though, which is 256b of
storage. Pretty sure I have patches laying around somewhere that did
that, but didn't like the plug and batch size growth on the stack. Maybe
overthinking that part...

But ideally we'd have that, and just a plain doubly linked list on the
queue/dispatch side. Which makes the list handling there much easier to
follow, as per your patch.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] block: use plug request list tail for one-shot backmerge attempt
  2025-06-12 12:28               ` Jens Axboe
@ 2025-06-16 13:11                 ` Christoph Hellwig
  2025-06-16 16:01                   ` Caleb Sander Mateos
  2025-06-18  6:04                   ` Hannes Reinecke
  0 siblings, 2 replies; 17+ messages in thread
From: Christoph Hellwig @ 2025-06-16 13:11 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Christoph Hellwig, Mohamed Abuelfotoh, Hazem,
	linux-block@vger.kernel.org

On Thu, Jun 12, 2025 at 06:28:47AM -0600, Jens Axboe wrote:
> But ideally we'd have that, and just a plain doubly linked list on the
> queue/dispatch side. Which makes the list handling there much easier to
> follow, as per your patch.

Quick hack from the weekend.  This also never deletes the requests from
the submission list for the queue_rqs case, so depending on the workload
it should touch either the same amount of less cache lines as the
existing version.  Only very lightly tested, and ublk is broken and
doesn't even compile as it's running out space in the io_uring pdu.
I'll need help from someone who knows ublk for that.

---
From 07e283303c63fcb694e828380a24ad51f225a228 Mon Sep 17 00:00:00 2001
From: Christoph Hellwig <hch@lst.de>
Date: Fri, 13 Jun 2025 09:48:40 +0200
Subject: block: always use a list_head for requests

Use standard double linked lists for the remaining lists of queued up
requests. This removes a lot of hairy list manipulation code and allows
east reverse walking of the lists, which is used in
blk_attempt_plug_merge to improve the merging, and in blk_add_rq_to_plug
to look at the correct request.

XXX: ublk is broken right now, because there is no space in the io_uring
pdu for the list backpointer.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 block/blk-core.c              |  2 +-
 block/blk-merge.c             | 11 +---
 block/blk-mq.c                | 97 ++++++++++++++---------------------
 drivers/block/null_blk/main.c | 16 +++---
 drivers/block/ublk_drv.c      | 43 +++++++---------
 drivers/block/virtio_blk.c    | 31 +++++------
 drivers/nvme/host/pci.c       | 32 ++++++------
 include/linux/blk-mq.h        |  2 +-
 include/linux/blkdev.h        |  2 +-
 9 files changed, 103 insertions(+), 133 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index b862c66018f2..29aad939a1e3 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -1127,7 +1127,7 @@ void blk_start_plug_nr_ios(struct blk_plug *plug, unsigned short nr_ios)
 		return;
 
 	plug->cur_ktime = 0;
-	rq_list_init(&plug->mq_list);
+	INIT_LIST_HEAD(&plug->mq_list);
 	rq_list_init(&plug->cached_rqs);
 	plug->nr_ios = min_t(unsigned short, nr_ios, BLK_MAX_REQUEST_COUNT);
 	plug->rq_count = 0;
diff --git a/block/blk-merge.c b/block/blk-merge.c
index 70d704615be5..223941e9ec08 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -995,17 +995,10 @@ bool blk_attempt_plug_merge(struct request_queue *q, struct bio *bio,
 	struct blk_plug *plug = current->plug;
 	struct request *rq;
 
-	if (!plug || rq_list_empty(&plug->mq_list))
+	if (!plug)
 		return false;
 
-	rq = plug->mq_list.tail;
-	if (rq->q == q)
-		return blk_attempt_bio_merge(q, rq, bio, nr_segs, false) ==
-			BIO_MERGE_OK;
-	else if (!plug->multiple_queues)
-		return false;
-
-	rq_list_for_each(&plug->mq_list, rq) {
+	list_for_each_entry_reverse(rq, &plug->mq_list, queuelist) {
 		if (rq->q != q)
 			continue;
 		if (blk_attempt_bio_merge(q, rq, bio, nr_segs, false) ==
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 4806b867e37d..6d56471d4346 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1378,7 +1378,8 @@ static inline unsigned short blk_plug_max_rq_count(struct blk_plug *plug)
 
 static void blk_add_rq_to_plug(struct blk_plug *plug, struct request *rq)
 {
-	struct request *last = rq_list_peek(&plug->mq_list);
+	struct request *last =
+		list_last_entry(&plug->mq_list, struct request, queuelist);
 
 	if (!plug->rq_count) {
 		trace_block_plug(rq->q);
@@ -1398,7 +1399,7 @@ static void blk_add_rq_to_plug(struct blk_plug *plug, struct request *rq)
 	 */
 	if (!plug->has_elevator && (rq->rq_flags & RQF_SCHED_TAGS))
 		plug->has_elevator = true;
-	rq_list_add_tail(&plug->mq_list, rq);
+	list_add_tail(&rq->queuelist, &plug->mq_list);
 	plug->rq_count++;
 }
 
@@ -2780,15 +2781,15 @@ static blk_status_t blk_mq_request_issue_directly(struct request *rq, bool last)
 	return __blk_mq_issue_directly(hctx, rq, last);
 }
 
-static void blk_mq_issue_direct(struct rq_list *rqs)
+static void blk_mq_issue_direct(struct list_head *rqs)
 {
 	struct blk_mq_hw_ctx *hctx = NULL;
-	struct request *rq;
+	struct request *rq, *n;
 	int queued = 0;
 	blk_status_t ret = BLK_STS_OK;
 
-	while ((rq = rq_list_pop(rqs))) {
-		bool last = rq_list_empty(rqs);
+	list_for_each_entry_safe(rq, n, rqs, queuelist) {
+		list_del_init(&rq->queuelist);
 
 		if (hctx != rq->mq_hctx) {
 			if (hctx) {
@@ -2798,7 +2799,7 @@ static void blk_mq_issue_direct(struct rq_list *rqs)
 			hctx = rq->mq_hctx;
 		}
 
-		ret = blk_mq_request_issue_directly(rq, last);
+		ret = blk_mq_request_issue_directly(rq, list_empty(rqs));
 		switch (ret) {
 		case BLK_STS_OK:
 			queued++;
@@ -2819,45 +2820,18 @@ static void blk_mq_issue_direct(struct rq_list *rqs)
 		blk_mq_commit_rqs(hctx, queued, false);
 }
 
-static void __blk_mq_flush_list(struct request_queue *q, struct rq_list *rqs)
+static void __blk_mq_flush_list(struct request_queue *q, struct list_head *rqs)
 {
 	if (blk_queue_quiesced(q))
 		return;
 	q->mq_ops->queue_rqs(rqs);
 }
 
-static unsigned blk_mq_extract_queue_requests(struct rq_list *rqs,
-					      struct rq_list *queue_rqs)
-{
-	struct request *rq = rq_list_pop(rqs);
-	struct request_queue *this_q = rq->q;
-	struct request **prev = &rqs->head;
-	struct rq_list matched_rqs = {};
-	struct request *last = NULL;
-	unsigned depth = 1;
-
-	rq_list_add_tail(&matched_rqs, rq);
-	while ((rq = *prev)) {
-		if (rq->q == this_q) {
-			/* move rq from rqs to matched_rqs */
-			*prev = rq->rq_next;
-			rq_list_add_tail(&matched_rqs, rq);
-			depth++;
-		} else {
-			/* leave rq in rqs */
-			prev = &rq->rq_next;
-			last = rq;
-		}
-	}
-
-	rqs->tail = last;
-	*queue_rqs = matched_rqs;
-	return depth;
-}
-
-static void blk_mq_dispatch_queue_requests(struct rq_list *rqs, unsigned depth)
+static void blk_mq_dispatch_queue_requests(struct list_head *rqs,
+					   unsigned depth)
 {
-	struct request_queue *q = rq_list_peek(rqs)->q;
+	struct request *rq = list_first_entry(rqs, struct request, queuelist);
+	struct request_queue *q = rq->q;
 
 	trace_block_unplug(q, depth, true);
 
@@ -2869,39 +2843,35 @@ static void blk_mq_dispatch_queue_requests(struct rq_list *rqs, unsigned depth)
 	 */
 	if (q->mq_ops->queue_rqs) {
 		blk_mq_run_dispatch_ops(q, __blk_mq_flush_list(q, rqs));
-		if (rq_list_empty(rqs))
+		if (list_empty(rqs))
 			return;
 	}
 
 	blk_mq_run_dispatch_ops(q, blk_mq_issue_direct(rqs));
 }
 
-static void blk_mq_dispatch_list(struct rq_list *rqs, bool from_sched)
+static void blk_mq_dispatch_list(struct list_head *rqs, bool from_sched)
 {
 	struct blk_mq_hw_ctx *this_hctx = NULL;
 	struct blk_mq_ctx *this_ctx = NULL;
-	struct rq_list requeue_list = {};
+	LIST_HEAD(list);
+	struct request *rq, *n;
 	unsigned int depth = 0;
 	bool is_passthrough = false;
-	LIST_HEAD(list);
-
-	do {
-		struct request *rq = rq_list_pop(rqs);
 
+	list_for_each_entry_safe(rq, n, rqs, queuelist) {
 		if (!this_hctx) {
 			this_hctx = rq->mq_hctx;
 			this_ctx = rq->mq_ctx;
 			is_passthrough = blk_rq_is_passthrough(rq);
 		} else if (this_hctx != rq->mq_hctx || this_ctx != rq->mq_ctx ||
 			   is_passthrough != blk_rq_is_passthrough(rq)) {
-			rq_list_add_tail(&requeue_list, rq);
 			continue;
 		}
-		list_add_tail(&rq->queuelist, &list);
+		list_move_tail(&rq->queuelist, &list);
 		depth++;
-	} while (!rq_list_empty(rqs));
+	}
 
-	*rqs = requeue_list;
 	trace_block_unplug(this_hctx->queue, depth, !from_sched);
 
 	percpu_ref_get(&this_hctx->queue->q_usage_counter);
@@ -2921,17 +2891,27 @@ static void blk_mq_dispatch_list(struct rq_list *rqs, bool from_sched)
 	percpu_ref_put(&this_hctx->queue->q_usage_counter);
 }
 
-static void blk_mq_dispatch_multiple_queue_requests(struct rq_list *rqs)
+static void blk_mq_dispatch_multiple_queue_requests(struct list_head *rqs)
 {
 	do {
-		struct rq_list queue_rqs;
-		unsigned depth;
+		struct request_queue *this_q = NULL;
+		struct request *rq, *n;
+		LIST_HEAD(queue_rqs);
+		unsigned depth = 0;
+
+		list_for_each_entry_safe(rq, n, rqs, queuelist) {
+			if (!this_q)
+				this_q = rq->q;
+			if (this_q == rq->q) {
+				list_move_tail(&rq->queuelist, &queue_rqs);
+				depth++;
+			}
+		}
 
-		depth = blk_mq_extract_queue_requests(rqs, &queue_rqs);
 		blk_mq_dispatch_queue_requests(&queue_rqs, depth);
-		while (!rq_list_empty(&queue_rqs))
+		while (!list_empty(&queue_rqs))
 			blk_mq_dispatch_list(&queue_rqs, false);
-	} while (!rq_list_empty(rqs));
+	} while (!list_empty(rqs));
 }
 
 void blk_mq_flush_plug_list(struct blk_plug *plug, bool from_schedule)
@@ -2955,15 +2935,14 @@ void blk_mq_flush_plug_list(struct blk_plug *plug, bool from_schedule)
 			blk_mq_dispatch_multiple_queue_requests(&plug->mq_list);
 			return;
 		}
-
 		blk_mq_dispatch_queue_requests(&plug->mq_list, depth);
-		if (rq_list_empty(&plug->mq_list))
+		if (list_empty(&plug->mq_list))
 			return;
 	}
 
 	do {
 		blk_mq_dispatch_list(&plug->mq_list, from_schedule);
-	} while (!rq_list_empty(&plug->mq_list));
+	} while (!list_empty(&plug->mq_list));
 }
 
 static void blk_mq_try_issue_list_directly(struct blk_mq_hw_ctx *hctx,
diff --git a/drivers/block/null_blk/main.c b/drivers/block/null_blk/main.c
index aa163ae9b2aa..ce3ac928122f 100644
--- a/drivers/block/null_blk/main.c
+++ b/drivers/block/null_blk/main.c
@@ -1694,22 +1694,22 @@ static blk_status_t null_queue_rq(struct blk_mq_hw_ctx *hctx,
 	return BLK_STS_OK;
 }
 
-static void null_queue_rqs(struct rq_list *rqlist)
+static void null_queue_rqs(struct list_head *rqlist)
 {
-	struct rq_list requeue_list = {};
 	struct blk_mq_queue_data bd = { };
+	LIST_HEAD(requeue_list);
+	struct request *rq, *n;
 	blk_status_t ret;
 
-	do {
-		struct request *rq = rq_list_pop(rqlist);
-
+	list_for_each_entry_safe(rq, n, rqlist, queuelist) {
 		bd.rq = rq;
 		ret = null_queue_rq(rq->mq_hctx, &bd);
 		if (ret != BLK_STS_OK)
-			rq_list_add_tail(&requeue_list, rq);
-	} while (!rq_list_empty(rqlist));
+			list_move_tail(&rq->queuelist, &requeue_list);
+	}
 
-	*rqlist = requeue_list;
+	INIT_LIST_HEAD(rqlist);
+	list_splice(&requeue_list, rqlist);
 }
 
 static void null_init_queue(struct nullb *nullb, struct nullb_queue *nq)
diff --git a/drivers/block/ublk_drv.c b/drivers/block/ublk_drv.c
index c637ea010d34..4d5b88ca7b1b 100644
--- a/drivers/block/ublk_drv.c
+++ b/drivers/block/ublk_drv.c
@@ -101,7 +101,7 @@ struct ublk_uring_cmd_pdu {
 	 */
 	union {
 		struct request *req;
-		struct request *req_list;
+		struct list_head req_list;
 	};
 
 	/*
@@ -1325,24 +1325,18 @@ static void ublk_cmd_list_tw_cb(struct io_uring_cmd *cmd,
 		unsigned int issue_flags)
 {
 	struct ublk_uring_cmd_pdu *pdu = ublk_get_uring_cmd_pdu(cmd);
-	struct request *rq = pdu->req_list;
-	struct request *next;
+	struct request *rq, *n;
 
-	do {
-		next = rq->rq_next;
-		rq->rq_next = NULL;
+	list_for_each_entry_safe(rq, n, &pdu->req_list, queuelist)
 		ublk_dispatch_req(rq->mq_hctx->driver_data, rq, issue_flags);
-		rq = next;
-	} while (rq);
 }
 
-static void ublk_queue_cmd_list(struct ublk_io *io, struct rq_list *l)
+static void ublk_queue_cmd_list(struct ublk_io *io, struct list_head *rqlist)
 {
 	struct io_uring_cmd *cmd = io->cmd;
 	struct ublk_uring_cmd_pdu *pdu = ublk_get_uring_cmd_pdu(cmd);
 
-	pdu->req_list = rq_list_peek(l);
-	rq_list_init(l);
+	list_splice(&pdu->req_list, rqlist);
 	io_uring_cmd_complete_in_task(cmd, ublk_cmd_list_tw_cb);
 }
 
@@ -1416,30 +1410,31 @@ static blk_status_t ublk_queue_rq(struct blk_mq_hw_ctx *hctx,
 	return BLK_STS_OK;
 }
 
-static void ublk_queue_rqs(struct rq_list *rqlist)
+static void ublk_queue_rqs(struct list_head *rqlist)
 {
-	struct rq_list requeue_list = { };
-	struct rq_list submit_list = { };
 	struct ublk_io *io = NULL;
-	struct request *req;
+	struct request *req, *n;
+	LIST_HEAD(requeue_list);
 
-	while ((req = rq_list_pop(rqlist))) {
+	list_for_each_entry_safe(req, n, rqlist, queuelist) {
 		struct ublk_queue *this_q = req->mq_hctx->driver_data;
 		struct ublk_io *this_io = &this_q->ios[req->tag];
 
-		if (io && io->task != this_io->task && !rq_list_empty(&submit_list))
+		if (io && io->task != this_io->task) {
+			LIST_HEAD(submit_list);
+
+			list_cut_before(&submit_list, rqlist, &req->queuelist);
 			ublk_queue_cmd_list(io, &submit_list);
+		}
 		io = this_io;
 
-		if (ublk_prep_req(this_q, req, true) == BLK_STS_OK)
-			rq_list_add_tail(&submit_list, req);
-		else
-			rq_list_add_tail(&requeue_list, req);
+		if (ublk_prep_req(this_q, req, true) != BLK_STS_OK)
+			list_move_tail(&req->queuelist, &requeue_list);
 	}
 
-	if (!rq_list_empty(&submit_list))
-		ublk_queue_cmd_list(io, &submit_list);
-	*rqlist = requeue_list;
+	if (!list_empty(rqlist))
+		ublk_queue_cmd_list(io, rqlist);
+	list_splice(&requeue_list, rqlist);
 }
 
 static int ublk_init_hctx(struct blk_mq_hw_ctx *hctx, void *driver_data,
diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c
index 30bca8cb7106..29f900eada0f 100644
--- a/drivers/block/virtio_blk.c
+++ b/drivers/block/virtio_blk.c
@@ -471,15 +471,14 @@ static bool virtblk_prep_rq_batch(struct request *req)
 }
 
 static void virtblk_add_req_batch(struct virtio_blk_vq *vq,
-		struct rq_list *rqlist)
+		struct list_head *rqlist)
 {
 	struct request *req;
 	unsigned long flags;
 	bool kick;
 
 	spin_lock_irqsave(&vq->lock, flags);
-
-	while ((req = rq_list_pop(rqlist))) {
+	list_for_each_entry(req, rqlist, queuelist) {
 		struct virtblk_req *vbr = blk_mq_rq_to_pdu(req);
 		int err;
 
@@ -498,29 +497,31 @@ static void virtblk_add_req_batch(struct virtio_blk_vq *vq,
 		virtqueue_notify(vq->vq);
 }
 
-static void virtio_queue_rqs(struct rq_list *rqlist)
+static void virtio_queue_rqs(struct list_head *rqlist)
 {
-	struct rq_list submit_list = { };
-	struct rq_list requeue_list = { };
 	struct virtio_blk_vq *vq = NULL;
-	struct request *req;
+	LIST_HEAD(requeue_list);
+	struct request *req, *n;
 
-	while ((req = rq_list_pop(rqlist))) {
+	list_for_each_entry_safe(req, n, rqlist, queuelist) {
 		struct virtio_blk_vq *this_vq = get_virtio_blk_vq(req->mq_hctx);
 
-		if (vq && vq != this_vq)
+		if (vq && vq != this_vq) {
+			LIST_HEAD(submit_list);
+
+			list_cut_before(&submit_list, rqlist, &req->queuelist);
 			virtblk_add_req_batch(vq, &submit_list);
+		}
 		vq = this_vq;
 
-		if (virtblk_prep_rq_batch(req))
-			rq_list_add_tail(&submit_list, req);
-		else
-			rq_list_add_tail(&requeue_list, req);
+		if (!virtblk_prep_rq_batch(req))
+			list_move_tail(&req->queuelist, &requeue_list);
 	}
 
 	if (vq)
-		virtblk_add_req_batch(vq, &submit_list);
-	*rqlist = requeue_list;
+		virtblk_add_req_batch(vq, rqlist);
+	INIT_LIST_HEAD(rqlist);
+	list_splice(&requeue_list, rqlist);
 }
 
 #ifdef CONFIG_BLK_DEV_ZONED
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 8ff12e415cb5..7bcb4b33e154 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -1051,15 +1051,15 @@ static blk_status_t nvme_queue_rq(struct blk_mq_hw_ctx *hctx,
 	return BLK_STS_OK;
 }
 
-static void nvme_submit_cmds(struct nvme_queue *nvmeq, struct rq_list *rqlist)
+static void nvme_submit_cmds(struct nvme_queue *nvmeq, struct list_head *rqlist)
 {
 	struct request *req;
 
-	if (rq_list_empty(rqlist))
+	if (list_empty(rqlist))
 		return;
 
 	spin_lock(&nvmeq->sq_lock);
-	while ((req = rq_list_pop(rqlist))) {
+	list_for_each_entry(req, rqlist, queuelist) {
 		struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
 
 		nvme_sq_copy_cmd(nvmeq, &iod->cmd);
@@ -1082,27 +1082,29 @@ static bool nvme_prep_rq_batch(struct nvme_queue *nvmeq, struct request *req)
 	return nvme_prep_rq(nvmeq->dev, req) == BLK_STS_OK;
 }
 
-static void nvme_queue_rqs(struct rq_list *rqlist)
+static void nvme_queue_rqs(struct list_head *rqlist)
 {
-	struct rq_list submit_list = { };
-	struct rq_list requeue_list = { };
 	struct nvme_queue *nvmeq = NULL;
-	struct request *req;
+	LIST_HEAD(requeue_list);
+	struct request *req, *n;
+
+	list_for_each_entry_safe(req, n, rqlist, queuelist) {
+		if (nvmeq && nvmeq != req->mq_hctx->driver_data) {
+			LIST_HEAD(submit_list);
 
-	while ((req = rq_list_pop(rqlist))) {
-		if (nvmeq && nvmeq != req->mq_hctx->driver_data)
+			list_cut_before(&submit_list, rqlist, &req->queuelist);
 			nvme_submit_cmds(nvmeq, &submit_list);
+		}
 		nvmeq = req->mq_hctx->driver_data;
 
-		if (nvme_prep_rq_batch(nvmeq, req))
-			rq_list_add_tail(&submit_list, req);
-		else
-			rq_list_add_tail(&requeue_list, req);
+		if (!nvme_prep_rq_batch(nvmeq, req))
+			list_move_tail(&req->queuelist, &requeue_list);
 	}
 
 	if (nvmeq)
-		nvme_submit_cmds(nvmeq, &submit_list);
-	*rqlist = requeue_list;
+		nvme_submit_cmds(nvmeq, rqlist);
+	INIT_LIST_HEAD(rqlist);
+	list_splice(&requeue_list, rqlist);
 }
 
 static __always_inline void nvme_unmap_metadata(struct nvme_dev *dev,
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index de8c85a03bb7..76c7ec906481 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -574,7 +574,7 @@ struct blk_mq_ops {
 	 * empty the @rqlist completely, then the rest will be queued
 	 * individually by the block layer upon return.
 	 */
-	void (*queue_rqs)(struct rq_list *rqlist);
+	void (*queue_rqs)(struct list_head *rqlist);
 
 	/**
 	 * @get_budget: Reserve budget before queue request, once .queue_rq is
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 332b56f323d9..1cc87e939b40 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -1083,7 +1083,7 @@ struct rq_list {
  * blk_flush_plug() is called.
  */
 struct blk_plug {
-	struct rq_list mq_list; /* blk-mq requests */
+	struct list_head mq_list; /* blk-mq requests */
 
 	/* if ios_left is > 1, we can batch tag/rq allocations */
 	struct rq_list cached_rqs;
-- 
2.47.2


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH] block: use plug request list tail for one-shot backmerge attempt
  2025-06-16 13:11                 ` Christoph Hellwig
@ 2025-06-16 16:01                   ` Caleb Sander Mateos
  2025-06-17  2:36                     ` Ming Lei
  2025-06-18  6:04                   ` Hannes Reinecke
  1 sibling, 1 reply; 17+ messages in thread
From: Caleb Sander Mateos @ 2025-06-16 16:01 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jens Axboe, Mohamed Abuelfotoh, Hazem,
	linux-block@vger.kernel.org

On Mon, Jun 16, 2025 at 6:11 AM Christoph Hellwig <hch@infradead.org> wrote:
>
> On Thu, Jun 12, 2025 at 06:28:47AM -0600, Jens Axboe wrote:
> > But ideally we'd have that, and just a plain doubly linked list on the
> > queue/dispatch side. Which makes the list handling there much easier to
> > follow, as per your patch.
>
> Quick hack from the weekend.  This also never deletes the requests from
> the submission list for the queue_rqs case, so depending on the workload
> it should touch either the same amount of less cache lines as the
> existing version.  Only very lightly tested, and ublk is broken and
> doesn't even compile as it's running out space in the io_uring pdu.
> I'll need help from someone who knows ublk for that.
>
> ---
> From 07e283303c63fcb694e828380a24ad51f225a228 Mon Sep 17 00:00:00 2001
> From: Christoph Hellwig <hch@lst.de>
> Date: Fri, 13 Jun 2025 09:48:40 +0200
> Subject: block: always use a list_head for requests
>
> Use standard double linked lists for the remaining lists of queued up
> requests. This removes a lot of hairy list manipulation code and allows
> east reverse walking of the lists, which is used in
> blk_attempt_plug_merge to improve the merging, and in blk_add_rq_to_plug
> to look at the correct request.
>
> XXX: ublk is broken right now, because there is no space in the io_uring
> pdu for the list backpointer.
>
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> ---
>  block/blk-core.c              |  2 +-
>  block/blk-merge.c             | 11 +---
>  block/blk-mq.c                | 97 ++++++++++++++---------------------
>  drivers/block/null_blk/main.c | 16 +++---
>  drivers/block/ublk_drv.c      | 43 +++++++---------
>  drivers/block/virtio_blk.c    | 31 +++++------
>  drivers/nvme/host/pci.c       | 32 ++++++------
>  include/linux/blk-mq.h        |  2 +-
>  include/linux/blkdev.h        |  2 +-
>  9 files changed, 103 insertions(+), 133 deletions(-)
>
> diff --git a/block/blk-core.c b/block/blk-core.c
> index b862c66018f2..29aad939a1e3 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -1127,7 +1127,7 @@ void blk_start_plug_nr_ios(struct blk_plug *plug, unsigned short nr_ios)
>                 return;
>
>         plug->cur_ktime = 0;
> -       rq_list_init(&plug->mq_list);
> +       INIT_LIST_HEAD(&plug->mq_list);
>         rq_list_init(&plug->cached_rqs);
>         plug->nr_ios = min_t(unsigned short, nr_ios, BLK_MAX_REQUEST_COUNT);
>         plug->rq_count = 0;
> diff --git a/block/blk-merge.c b/block/blk-merge.c
> index 70d704615be5..223941e9ec08 100644
> --- a/block/blk-merge.c
> +++ b/block/blk-merge.c
> @@ -995,17 +995,10 @@ bool blk_attempt_plug_merge(struct request_queue *q, struct bio *bio,
>         struct blk_plug *plug = current->plug;
>         struct request *rq;
>
> -       if (!plug || rq_list_empty(&plug->mq_list))
> +       if (!plug)
>                 return false;
>
> -       rq = plug->mq_list.tail;
> -       if (rq->q == q)
> -               return blk_attempt_bio_merge(q, rq, bio, nr_segs, false) ==
> -                       BIO_MERGE_OK;
> -       else if (!plug->multiple_queues)
> -               return false;
> -
> -       rq_list_for_each(&plug->mq_list, rq) {
> +       list_for_each_entry_reverse(rq, &plug->mq_list, queuelist) {
>                 if (rq->q != q)
>                         continue;
>                 if (blk_attempt_bio_merge(q, rq, bio, nr_segs, false) ==
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 4806b867e37d..6d56471d4346 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -1378,7 +1378,8 @@ static inline unsigned short blk_plug_max_rq_count(struct blk_plug *plug)
>
>  static void blk_add_rq_to_plug(struct blk_plug *plug, struct request *rq)
>  {
> -       struct request *last = rq_list_peek(&plug->mq_list);
> +       struct request *last =
> +               list_last_entry(&plug->mq_list, struct request, queuelist);
>
>         if (!plug->rq_count) {
>                 trace_block_plug(rq->q);
> @@ -1398,7 +1399,7 @@ static void blk_add_rq_to_plug(struct blk_plug *plug, struct request *rq)
>          */
>         if (!plug->has_elevator && (rq->rq_flags & RQF_SCHED_TAGS))
>                 plug->has_elevator = true;
> -       rq_list_add_tail(&plug->mq_list, rq);
> +       list_add_tail(&rq->queuelist, &plug->mq_list);
>         plug->rq_count++;
>  }
>
> @@ -2780,15 +2781,15 @@ static blk_status_t blk_mq_request_issue_directly(struct request *rq, bool last)
>         return __blk_mq_issue_directly(hctx, rq, last);
>  }
>
> -static void blk_mq_issue_direct(struct rq_list *rqs)
> +static void blk_mq_issue_direct(struct list_head *rqs)
>  {
>         struct blk_mq_hw_ctx *hctx = NULL;
> -       struct request *rq;
> +       struct request *rq, *n;
>         int queued = 0;
>         blk_status_t ret = BLK_STS_OK;
>
> -       while ((rq = rq_list_pop(rqs))) {
> -               bool last = rq_list_empty(rqs);
> +       list_for_each_entry_safe(rq, n, rqs, queuelist) {
> +               list_del_init(&rq->queuelist);
>
>                 if (hctx != rq->mq_hctx) {
>                         if (hctx) {
> @@ -2798,7 +2799,7 @@ static void blk_mq_issue_direct(struct rq_list *rqs)
>                         hctx = rq->mq_hctx;
>                 }
>
> -               ret = blk_mq_request_issue_directly(rq, last);
> +               ret = blk_mq_request_issue_directly(rq, list_empty(rqs));
>                 switch (ret) {
>                 case BLK_STS_OK:
>                         queued++;
> @@ -2819,45 +2820,18 @@ static void blk_mq_issue_direct(struct rq_list *rqs)
>                 blk_mq_commit_rqs(hctx, queued, false);
>  }
>
> -static void __blk_mq_flush_list(struct request_queue *q, struct rq_list *rqs)
> +static void __blk_mq_flush_list(struct request_queue *q, struct list_head *rqs)
>  {
>         if (blk_queue_quiesced(q))
>                 return;
>         q->mq_ops->queue_rqs(rqs);
>  }
>
> -static unsigned blk_mq_extract_queue_requests(struct rq_list *rqs,
> -                                             struct rq_list *queue_rqs)
> -{
> -       struct request *rq = rq_list_pop(rqs);
> -       struct request_queue *this_q = rq->q;
> -       struct request **prev = &rqs->head;
> -       struct rq_list matched_rqs = {};
> -       struct request *last = NULL;
> -       unsigned depth = 1;
> -
> -       rq_list_add_tail(&matched_rqs, rq);
> -       while ((rq = *prev)) {
> -               if (rq->q == this_q) {
> -                       /* move rq from rqs to matched_rqs */
> -                       *prev = rq->rq_next;
> -                       rq_list_add_tail(&matched_rqs, rq);
> -                       depth++;
> -               } else {
> -                       /* leave rq in rqs */
> -                       prev = &rq->rq_next;
> -                       last = rq;
> -               }
> -       }
> -
> -       rqs->tail = last;
> -       *queue_rqs = matched_rqs;
> -       return depth;
> -}
> -
> -static void blk_mq_dispatch_queue_requests(struct rq_list *rqs, unsigned depth)
> +static void blk_mq_dispatch_queue_requests(struct list_head *rqs,
> +                                          unsigned depth)
>  {
> -       struct request_queue *q = rq_list_peek(rqs)->q;
> +       struct request *rq = list_first_entry(rqs, struct request, queuelist);
> +       struct request_queue *q = rq->q;
>
>         trace_block_unplug(q, depth, true);
>
> @@ -2869,39 +2843,35 @@ static void blk_mq_dispatch_queue_requests(struct rq_list *rqs, unsigned depth)
>          */
>         if (q->mq_ops->queue_rqs) {
>                 blk_mq_run_dispatch_ops(q, __blk_mq_flush_list(q, rqs));
> -               if (rq_list_empty(rqs))
> +               if (list_empty(rqs))
>                         return;
>         }
>
>         blk_mq_run_dispatch_ops(q, blk_mq_issue_direct(rqs));
>  }
>
> -static void blk_mq_dispatch_list(struct rq_list *rqs, bool from_sched)
> +static void blk_mq_dispatch_list(struct list_head *rqs, bool from_sched)
>  {
>         struct blk_mq_hw_ctx *this_hctx = NULL;
>         struct blk_mq_ctx *this_ctx = NULL;
> -       struct rq_list requeue_list = {};
> +       LIST_HEAD(list);
> +       struct request *rq, *n;
>         unsigned int depth = 0;
>         bool is_passthrough = false;
> -       LIST_HEAD(list);
> -
> -       do {
> -               struct request *rq = rq_list_pop(rqs);
>
> +       list_for_each_entry_safe(rq, n, rqs, queuelist) {
>                 if (!this_hctx) {
>                         this_hctx = rq->mq_hctx;
>                         this_ctx = rq->mq_ctx;
>                         is_passthrough = blk_rq_is_passthrough(rq);
>                 } else if (this_hctx != rq->mq_hctx || this_ctx != rq->mq_ctx ||
>                            is_passthrough != blk_rq_is_passthrough(rq)) {
> -                       rq_list_add_tail(&requeue_list, rq);
>                         continue;
>                 }
> -               list_add_tail(&rq->queuelist, &list);
> +               list_move_tail(&rq->queuelist, &list);
>                 depth++;
> -       } while (!rq_list_empty(rqs));
> +       }
>
> -       *rqs = requeue_list;
>         trace_block_unplug(this_hctx->queue, depth, !from_sched);
>
>         percpu_ref_get(&this_hctx->queue->q_usage_counter);
> @@ -2921,17 +2891,27 @@ static void blk_mq_dispatch_list(struct rq_list *rqs, bool from_sched)
>         percpu_ref_put(&this_hctx->queue->q_usage_counter);
>  }
>
> -static void blk_mq_dispatch_multiple_queue_requests(struct rq_list *rqs)
> +static void blk_mq_dispatch_multiple_queue_requests(struct list_head *rqs)
>  {
>         do {
> -               struct rq_list queue_rqs;
> -               unsigned depth;
> +               struct request_queue *this_q = NULL;
> +               struct request *rq, *n;
> +               LIST_HEAD(queue_rqs);
> +               unsigned depth = 0;
> +
> +               list_for_each_entry_safe(rq, n, rqs, queuelist) {
> +                       if (!this_q)
> +                               this_q = rq->q;
> +                       if (this_q == rq->q) {
> +                               list_move_tail(&rq->queuelist, &queue_rqs);
> +                               depth++;
> +                       }
> +               }
>
> -               depth = blk_mq_extract_queue_requests(rqs, &queue_rqs);
>                 blk_mq_dispatch_queue_requests(&queue_rqs, depth);
> -               while (!rq_list_empty(&queue_rqs))
> +               while (!list_empty(&queue_rqs))
>                         blk_mq_dispatch_list(&queue_rqs, false);
> -       } while (!rq_list_empty(rqs));
> +       } while (!list_empty(rqs));
>  }
>
>  void blk_mq_flush_plug_list(struct blk_plug *plug, bool from_schedule)
> @@ -2955,15 +2935,14 @@ void blk_mq_flush_plug_list(struct blk_plug *plug, bool from_schedule)
>                         blk_mq_dispatch_multiple_queue_requests(&plug->mq_list);
>                         return;
>                 }
> -
>                 blk_mq_dispatch_queue_requests(&plug->mq_list, depth);
> -               if (rq_list_empty(&plug->mq_list))
> +               if (list_empty(&plug->mq_list))
>                         return;
>         }
>
>         do {
>                 blk_mq_dispatch_list(&plug->mq_list, from_schedule);
> -       } while (!rq_list_empty(&plug->mq_list));
> +       } while (!list_empty(&plug->mq_list));
>  }
>
>  static void blk_mq_try_issue_list_directly(struct blk_mq_hw_ctx *hctx,
> diff --git a/drivers/block/null_blk/main.c b/drivers/block/null_blk/main.c
> index aa163ae9b2aa..ce3ac928122f 100644
> --- a/drivers/block/null_blk/main.c
> +++ b/drivers/block/null_blk/main.c
> @@ -1694,22 +1694,22 @@ static blk_status_t null_queue_rq(struct blk_mq_hw_ctx *hctx,
>         return BLK_STS_OK;
>  }
>
> -static void null_queue_rqs(struct rq_list *rqlist)
> +static void null_queue_rqs(struct list_head *rqlist)
>  {
> -       struct rq_list requeue_list = {};
>         struct blk_mq_queue_data bd = { };
> +       LIST_HEAD(requeue_list);
> +       struct request *rq, *n;
>         blk_status_t ret;
>
> -       do {
> -               struct request *rq = rq_list_pop(rqlist);
> -
> +       list_for_each_entry_safe(rq, n, rqlist, queuelist) {
>                 bd.rq = rq;
>                 ret = null_queue_rq(rq->mq_hctx, &bd);
>                 if (ret != BLK_STS_OK)
> -                       rq_list_add_tail(&requeue_list, rq);
> -       } while (!rq_list_empty(rqlist));
> +                       list_move_tail(&rq->queuelist, &requeue_list);
> +       }
>
> -       *rqlist = requeue_list;
> +       INIT_LIST_HEAD(rqlist);
> +       list_splice(&requeue_list, rqlist);
>  }
>
>  static void null_init_queue(struct nullb *nullb, struct nullb_queue *nq)
> diff --git a/drivers/block/ublk_drv.c b/drivers/block/ublk_drv.c
> index c637ea010d34..4d5b88ca7b1b 100644
> --- a/drivers/block/ublk_drv.c
> +++ b/drivers/block/ublk_drv.c
> @@ -101,7 +101,7 @@ struct ublk_uring_cmd_pdu {
>          */
>         union {
>                 struct request *req;
> -               struct request *req_list;
> +               struct list_head req_list;
>         };
>
>         /*
> @@ -1325,24 +1325,18 @@ static void ublk_cmd_list_tw_cb(struct io_uring_cmd *cmd,
>                 unsigned int issue_flags)
>  {
>         struct ublk_uring_cmd_pdu *pdu = ublk_get_uring_cmd_pdu(cmd);
> -       struct request *rq = pdu->req_list;
> -       struct request *next;
> +       struct request *rq, *n;
>
> -       do {
> -               next = rq->rq_next;
> -               rq->rq_next = NULL;
> +       list_for_each_entry_safe(rq, n, &pdu->req_list, queuelist)
>                 ublk_dispatch_req(rq->mq_hctx->driver_data, rq, issue_flags);
> -               rq = next;
> -       } while (rq);
>  }
>
> -static void ublk_queue_cmd_list(struct ublk_io *io, struct rq_list *l)
> +static void ublk_queue_cmd_list(struct ublk_io *io, struct list_head *rqlist)
>  {
>         struct io_uring_cmd *cmd = io->cmd;
>         struct ublk_uring_cmd_pdu *pdu = ublk_get_uring_cmd_pdu(cmd);
>
> -       pdu->req_list = rq_list_peek(l);
> -       rq_list_init(l);
> +       list_splice(&pdu->req_list, rqlist);
>         io_uring_cmd_complete_in_task(cmd, ublk_cmd_list_tw_cb);

ublk_cmd_list_tw_cb() doesn't need a doubly-linked list. It should be
fine to continue storing just the first struct request * of the list
in struct ublk_uring_cmd_pdu. That would avoid growing
ublk_uring_cmd_pdu past the 32-byte limit.

Best,
Caleb

>  }
>
> @@ -1416,30 +1410,31 @@ static blk_status_t ublk_queue_rq(struct blk_mq_hw_ctx *hctx,
>         return BLK_STS_OK;
>  }
>
> -static void ublk_queue_rqs(struct rq_list *rqlist)
> +static void ublk_queue_rqs(struct list_head *rqlist)
>  {
> -       struct rq_list requeue_list = { };
> -       struct rq_list submit_list = { };
>         struct ublk_io *io = NULL;
> -       struct request *req;
> +       struct request *req, *n;
> +       LIST_HEAD(requeue_list);
>
> -       while ((req = rq_list_pop(rqlist))) {
> +       list_for_each_entry_safe(req, n, rqlist, queuelist) {
>                 struct ublk_queue *this_q = req->mq_hctx->driver_data;
>                 struct ublk_io *this_io = &this_q->ios[req->tag];
>
> -               if (io && io->task != this_io->task && !rq_list_empty(&submit_list))
> +               if (io && io->task != this_io->task) {
> +                       LIST_HEAD(submit_list);
> +
> +                       list_cut_before(&submit_list, rqlist, &req->queuelist);
>                         ublk_queue_cmd_list(io, &submit_list);
> +               }
>                 io = this_io;
>
> -               if (ublk_prep_req(this_q, req, true) == BLK_STS_OK)
> -                       rq_list_add_tail(&submit_list, req);
> -               else
> -                       rq_list_add_tail(&requeue_list, req);
> +               if (ublk_prep_req(this_q, req, true) != BLK_STS_OK)
> +                       list_move_tail(&req->queuelist, &requeue_list);
>         }
>
> -       if (!rq_list_empty(&submit_list))
> -               ublk_queue_cmd_list(io, &submit_list);
> -       *rqlist = requeue_list;
> +       if (!list_empty(rqlist))
> +               ublk_queue_cmd_list(io, rqlist);
> +       list_splice(&requeue_list, rqlist);
>  }
>
>  static int ublk_init_hctx(struct blk_mq_hw_ctx *hctx, void *driver_data,
> diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c
> index 30bca8cb7106..29f900eada0f 100644
> --- a/drivers/block/virtio_blk.c
> +++ b/drivers/block/virtio_blk.c
> @@ -471,15 +471,14 @@ static bool virtblk_prep_rq_batch(struct request *req)
>  }
>
>  static void virtblk_add_req_batch(struct virtio_blk_vq *vq,
> -               struct rq_list *rqlist)
> +               struct list_head *rqlist)
>  {
>         struct request *req;
>         unsigned long flags;
>         bool kick;
>
>         spin_lock_irqsave(&vq->lock, flags);
> -
> -       while ((req = rq_list_pop(rqlist))) {
> +       list_for_each_entry(req, rqlist, queuelist) {
>                 struct virtblk_req *vbr = blk_mq_rq_to_pdu(req);
>                 int err;
>
> @@ -498,29 +497,31 @@ static void virtblk_add_req_batch(struct virtio_blk_vq *vq,
>                 virtqueue_notify(vq->vq);
>  }
>
> -static void virtio_queue_rqs(struct rq_list *rqlist)
> +static void virtio_queue_rqs(struct list_head *rqlist)
>  {
> -       struct rq_list submit_list = { };
> -       struct rq_list requeue_list = { };
>         struct virtio_blk_vq *vq = NULL;
> -       struct request *req;
> +       LIST_HEAD(requeue_list);
> +       struct request *req, *n;
>
> -       while ((req = rq_list_pop(rqlist))) {
> +       list_for_each_entry_safe(req, n, rqlist, queuelist) {
>                 struct virtio_blk_vq *this_vq = get_virtio_blk_vq(req->mq_hctx);
>
> -               if (vq && vq != this_vq)
> +               if (vq && vq != this_vq) {
> +                       LIST_HEAD(submit_list);
> +
> +                       list_cut_before(&submit_list, rqlist, &req->queuelist);
>                         virtblk_add_req_batch(vq, &submit_list);
> +               }
>                 vq = this_vq;
>
> -               if (virtblk_prep_rq_batch(req))
> -                       rq_list_add_tail(&submit_list, req);
> -               else
> -                       rq_list_add_tail(&requeue_list, req);
> +               if (!virtblk_prep_rq_batch(req))
> +                       list_move_tail(&req->queuelist, &requeue_list);
>         }
>
>         if (vq)
> -               virtblk_add_req_batch(vq, &submit_list);
> -       *rqlist = requeue_list;
> +               virtblk_add_req_batch(vq, rqlist);
> +       INIT_LIST_HEAD(rqlist);
> +       list_splice(&requeue_list, rqlist);
>  }
>
>  #ifdef CONFIG_BLK_DEV_ZONED
> diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
> index 8ff12e415cb5..7bcb4b33e154 100644
> --- a/drivers/nvme/host/pci.c
> +++ b/drivers/nvme/host/pci.c
> @@ -1051,15 +1051,15 @@ static blk_status_t nvme_queue_rq(struct blk_mq_hw_ctx *hctx,
>         return BLK_STS_OK;
>  }
>
> -static void nvme_submit_cmds(struct nvme_queue *nvmeq, struct rq_list *rqlist)
> +static void nvme_submit_cmds(struct nvme_queue *nvmeq, struct list_head *rqlist)
>  {
>         struct request *req;
>
> -       if (rq_list_empty(rqlist))
> +       if (list_empty(rqlist))
>                 return;
>
>         spin_lock(&nvmeq->sq_lock);
> -       while ((req = rq_list_pop(rqlist))) {
> +       list_for_each_entry(req, rqlist, queuelist) {
>                 struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
>
>                 nvme_sq_copy_cmd(nvmeq, &iod->cmd);
> @@ -1082,27 +1082,29 @@ static bool nvme_prep_rq_batch(struct nvme_queue *nvmeq, struct request *req)
>         return nvme_prep_rq(nvmeq->dev, req) == BLK_STS_OK;
>  }
>
> -static void nvme_queue_rqs(struct rq_list *rqlist)
> +static void nvme_queue_rqs(struct list_head *rqlist)
>  {
> -       struct rq_list submit_list = { };
> -       struct rq_list requeue_list = { };
>         struct nvme_queue *nvmeq = NULL;
> -       struct request *req;
> +       LIST_HEAD(requeue_list);
> +       struct request *req, *n;
> +
> +       list_for_each_entry_safe(req, n, rqlist, queuelist) {
> +               if (nvmeq && nvmeq != req->mq_hctx->driver_data) {
> +                       LIST_HEAD(submit_list);
>
> -       while ((req = rq_list_pop(rqlist))) {
> -               if (nvmeq && nvmeq != req->mq_hctx->driver_data)
> +                       list_cut_before(&submit_list, rqlist, &req->queuelist);
>                         nvme_submit_cmds(nvmeq, &submit_list);
> +               }
>                 nvmeq = req->mq_hctx->driver_data;
>
> -               if (nvme_prep_rq_batch(nvmeq, req))
> -                       rq_list_add_tail(&submit_list, req);
> -               else
> -                       rq_list_add_tail(&requeue_list, req);
> +               if (!nvme_prep_rq_batch(nvmeq, req))
> +                       list_move_tail(&req->queuelist, &requeue_list);
>         }
>
>         if (nvmeq)
> -               nvme_submit_cmds(nvmeq, &submit_list);
> -       *rqlist = requeue_list;
> +               nvme_submit_cmds(nvmeq, rqlist);
> +       INIT_LIST_HEAD(rqlist);
> +       list_splice(&requeue_list, rqlist);
>  }
>
>  static __always_inline void nvme_unmap_metadata(struct nvme_dev *dev,
> diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
> index de8c85a03bb7..76c7ec906481 100644
> --- a/include/linux/blk-mq.h
> +++ b/include/linux/blk-mq.h
> @@ -574,7 +574,7 @@ struct blk_mq_ops {
>          * empty the @rqlist completely, then the rest will be queued
>          * individually by the block layer upon return.
>          */
> -       void (*queue_rqs)(struct rq_list *rqlist);
> +       void (*queue_rqs)(struct list_head *rqlist);
>
>         /**
>          * @get_budget: Reserve budget before queue request, once .queue_rq is
> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> index 332b56f323d9..1cc87e939b40 100644
> --- a/include/linux/blkdev.h
> +++ b/include/linux/blkdev.h
> @@ -1083,7 +1083,7 @@ struct rq_list {
>   * blk_flush_plug() is called.
>   */
>  struct blk_plug {
> -       struct rq_list mq_list; /* blk-mq requests */
> +       struct list_head mq_list; /* blk-mq requests */
>
>         /* if ios_left is > 1, we can batch tag/rq allocations */
>         struct rq_list cached_rqs;
> --
> 2.47.2
>
>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] block: use plug request list tail for one-shot backmerge attempt
  2025-06-16 16:01                   ` Caleb Sander Mateos
@ 2025-06-17  2:36                     ` Ming Lei
  2025-06-17  4:39                       ` Christoph Hellwig
  0 siblings, 1 reply; 17+ messages in thread
From: Ming Lei @ 2025-06-17  2:36 UTC (permalink / raw)
  To: Caleb Sander Mateos
  Cc: Christoph Hellwig, Jens Axboe, Mohamed Abuelfotoh, Hazem,
	linux-block@vger.kernel.org

On Mon, Jun 16, 2025 at 09:01:54AM -0700, Caleb Sander Mateos wrote:
> On Mon, Jun 16, 2025 at 6:11 AM Christoph Hellwig <hch@infradead.org> wrote:
> >
> > On Thu, Jun 12, 2025 at 06:28:47AM -0600, Jens Axboe wrote:
> > > But ideally we'd have that, and just a plain doubly linked list on the
> > > queue/dispatch side. Which makes the list handling there much easier to
> > > follow, as per your patch.
> >
> > Quick hack from the weekend.  This also never deletes the requests from
> > the submission list for the queue_rqs case, so depending on the workload
> > it should touch either the same amount of less cache lines as the
> > existing version.  Only very lightly tested, and ublk is broken and
> > doesn't even compile as it's running out space in the io_uring pdu.
> > I'll need help from someone who knows ublk for that.
> >
> > ---
> > From 07e283303c63fcb694e828380a24ad51f225a228 Mon Sep 17 00:00:00 2001
> > From: Christoph Hellwig <hch@lst.de>
> > Date: Fri, 13 Jun 2025 09:48:40 +0200
> > Subject: block: always use a list_head for requests
> >
> > Use standard double linked lists for the remaining lists of queued up
> > requests. This removes a lot of hairy list manipulation code and allows
> > east reverse walking of the lists, which is used in
> > blk_attempt_plug_merge to improve the merging, and in blk_add_rq_to_plug
> > to look at the correct request.
> >
> > XXX: ublk is broken right now, because there is no space in the io_uring
> > pdu for the list backpointer.
> >
> > Signed-off-by: Christoph Hellwig <hch@lst.de>
> > ---
> >  block/blk-core.c              |  2 +-
> >  block/blk-merge.c             | 11 +---
> >  block/blk-mq.c                | 97 ++++++++++++++---------------------
> >  drivers/block/null_blk/main.c | 16 +++---
> >  drivers/block/ublk_drv.c      | 43 +++++++---------
> >  drivers/block/virtio_blk.c    | 31 +++++------
> >  drivers/nvme/host/pci.c       | 32 ++++++------
> >  include/linux/blk-mq.h        |  2 +-
> >  include/linux/blkdev.h        |  2 +-
> >  9 files changed, 103 insertions(+), 133 deletions(-)
> >
> > diff --git a/block/blk-core.c b/block/blk-core.c
> > index b862c66018f2..29aad939a1e3 100644
> > --- a/block/blk-core.c
> > +++ b/block/blk-core.c
> > @@ -1127,7 +1127,7 @@ void blk_start_plug_nr_ios(struct blk_plug *plug, unsigned short nr_ios)
> >                 return;
> >
> >         plug->cur_ktime = 0;
> > -       rq_list_init(&plug->mq_list);
> > +       INIT_LIST_HEAD(&plug->mq_list);
> >         rq_list_init(&plug->cached_rqs);
> >         plug->nr_ios = min_t(unsigned short, nr_ios, BLK_MAX_REQUEST_COUNT);
> >         plug->rq_count = 0;
> > diff --git a/block/blk-merge.c b/block/blk-merge.c
> > index 70d704615be5..223941e9ec08 100644
> > --- a/block/blk-merge.c
> > +++ b/block/blk-merge.c
> > @@ -995,17 +995,10 @@ bool blk_attempt_plug_merge(struct request_queue *q, struct bio *bio,
> >         struct blk_plug *plug = current->plug;
> >         struct request *rq;
> >
> > -       if (!plug || rq_list_empty(&plug->mq_list))
> > +       if (!plug)
> >                 return false;
> >
> > -       rq = plug->mq_list.tail;
> > -       if (rq->q == q)
> > -               return blk_attempt_bio_merge(q, rq, bio, nr_segs, false) ==
> > -                       BIO_MERGE_OK;
> > -       else if (!plug->multiple_queues)
> > -               return false;
> > -
> > -       rq_list_for_each(&plug->mq_list, rq) {
> > +       list_for_each_entry_reverse(rq, &plug->mq_list, queuelist) {
> >                 if (rq->q != q)
> >                         continue;
> >                 if (blk_attempt_bio_merge(q, rq, bio, nr_segs, false) ==
> > diff --git a/block/blk-mq.c b/block/blk-mq.c
> > index 4806b867e37d..6d56471d4346 100644
> > --- a/block/blk-mq.c
> > +++ b/block/blk-mq.c
> > @@ -1378,7 +1378,8 @@ static inline unsigned short blk_plug_max_rq_count(struct blk_plug *plug)
> >
> >  static void blk_add_rq_to_plug(struct blk_plug *plug, struct request *rq)
> >  {
> > -       struct request *last = rq_list_peek(&plug->mq_list);
> > +       struct request *last =
> > +               list_last_entry(&plug->mq_list, struct request, queuelist);
> >
> >         if (!plug->rq_count) {
> >                 trace_block_plug(rq->q);
> > @@ -1398,7 +1399,7 @@ static void blk_add_rq_to_plug(struct blk_plug *plug, struct request *rq)
> >          */
> >         if (!plug->has_elevator && (rq->rq_flags & RQF_SCHED_TAGS))
> >                 plug->has_elevator = true;
> > -       rq_list_add_tail(&plug->mq_list, rq);
> > +       list_add_tail(&rq->queuelist, &plug->mq_list);
> >         plug->rq_count++;
> >  }
> >
> > @@ -2780,15 +2781,15 @@ static blk_status_t blk_mq_request_issue_directly(struct request *rq, bool last)
> >         return __blk_mq_issue_directly(hctx, rq, last);
> >  }
> >
> > -static void blk_mq_issue_direct(struct rq_list *rqs)
> > +static void blk_mq_issue_direct(struct list_head *rqs)
> >  {
> >         struct blk_mq_hw_ctx *hctx = NULL;
> > -       struct request *rq;
> > +       struct request *rq, *n;
> >         int queued = 0;
> >         blk_status_t ret = BLK_STS_OK;
> >
> > -       while ((rq = rq_list_pop(rqs))) {
> > -               bool last = rq_list_empty(rqs);
> > +       list_for_each_entry_safe(rq, n, rqs, queuelist) {
> > +               list_del_init(&rq->queuelist);
> >
> >                 if (hctx != rq->mq_hctx) {
> >                         if (hctx) {
> > @@ -2798,7 +2799,7 @@ static void blk_mq_issue_direct(struct rq_list *rqs)
> >                         hctx = rq->mq_hctx;
> >                 }
> >
> > -               ret = blk_mq_request_issue_directly(rq, last);
> > +               ret = blk_mq_request_issue_directly(rq, list_empty(rqs));
> >                 switch (ret) {
> >                 case BLK_STS_OK:
> >                         queued++;
> > @@ -2819,45 +2820,18 @@ static void blk_mq_issue_direct(struct rq_list *rqs)
> >                 blk_mq_commit_rqs(hctx, queued, false);
> >  }
> >
> > -static void __blk_mq_flush_list(struct request_queue *q, struct rq_list *rqs)
> > +static void __blk_mq_flush_list(struct request_queue *q, struct list_head *rqs)
> >  {
> >         if (blk_queue_quiesced(q))
> >                 return;
> >         q->mq_ops->queue_rqs(rqs);
> >  }
> >
> > -static unsigned blk_mq_extract_queue_requests(struct rq_list *rqs,
> > -                                             struct rq_list *queue_rqs)
> > -{
> > -       struct request *rq = rq_list_pop(rqs);
> > -       struct request_queue *this_q = rq->q;
> > -       struct request **prev = &rqs->head;
> > -       struct rq_list matched_rqs = {};
> > -       struct request *last = NULL;
> > -       unsigned depth = 1;
> > -
> > -       rq_list_add_tail(&matched_rqs, rq);
> > -       while ((rq = *prev)) {
> > -               if (rq->q == this_q) {
> > -                       /* move rq from rqs to matched_rqs */
> > -                       *prev = rq->rq_next;
> > -                       rq_list_add_tail(&matched_rqs, rq);
> > -                       depth++;
> > -               } else {
> > -                       /* leave rq in rqs */
> > -                       prev = &rq->rq_next;
> > -                       last = rq;
> > -               }
> > -       }
> > -
> > -       rqs->tail = last;
> > -       *queue_rqs = matched_rqs;
> > -       return depth;
> > -}
> > -
> > -static void blk_mq_dispatch_queue_requests(struct rq_list *rqs, unsigned depth)
> > +static void blk_mq_dispatch_queue_requests(struct list_head *rqs,
> > +                                          unsigned depth)
> >  {
> > -       struct request_queue *q = rq_list_peek(rqs)->q;
> > +       struct request *rq = list_first_entry(rqs, struct request, queuelist);
> > +       struct request_queue *q = rq->q;
> >
> >         trace_block_unplug(q, depth, true);
> >
> > @@ -2869,39 +2843,35 @@ static void blk_mq_dispatch_queue_requests(struct rq_list *rqs, unsigned depth)
> >          */
> >         if (q->mq_ops->queue_rqs) {
> >                 blk_mq_run_dispatch_ops(q, __blk_mq_flush_list(q, rqs));
> > -               if (rq_list_empty(rqs))
> > +               if (list_empty(rqs))
> >                         return;
> >         }
> >
> >         blk_mq_run_dispatch_ops(q, blk_mq_issue_direct(rqs));
> >  }
> >
> > -static void blk_mq_dispatch_list(struct rq_list *rqs, bool from_sched)
> > +static void blk_mq_dispatch_list(struct list_head *rqs, bool from_sched)
> >  {
> >         struct blk_mq_hw_ctx *this_hctx = NULL;
> >         struct blk_mq_ctx *this_ctx = NULL;
> > -       struct rq_list requeue_list = {};
> > +       LIST_HEAD(list);
> > +       struct request *rq, *n;
> >         unsigned int depth = 0;
> >         bool is_passthrough = false;
> > -       LIST_HEAD(list);
> > -
> > -       do {
> > -               struct request *rq = rq_list_pop(rqs);
> >
> > +       list_for_each_entry_safe(rq, n, rqs, queuelist) {
> >                 if (!this_hctx) {
> >                         this_hctx = rq->mq_hctx;
> >                         this_ctx = rq->mq_ctx;
> >                         is_passthrough = blk_rq_is_passthrough(rq);
> >                 } else if (this_hctx != rq->mq_hctx || this_ctx != rq->mq_ctx ||
> >                            is_passthrough != blk_rq_is_passthrough(rq)) {
> > -                       rq_list_add_tail(&requeue_list, rq);
> >                         continue;
> >                 }
> > -               list_add_tail(&rq->queuelist, &list);
> > +               list_move_tail(&rq->queuelist, &list);
> >                 depth++;
> > -       } while (!rq_list_empty(rqs));
> > +       }
> >
> > -       *rqs = requeue_list;
> >         trace_block_unplug(this_hctx->queue, depth, !from_sched);
> >
> >         percpu_ref_get(&this_hctx->queue->q_usage_counter);
> > @@ -2921,17 +2891,27 @@ static void blk_mq_dispatch_list(struct rq_list *rqs, bool from_sched)
> >         percpu_ref_put(&this_hctx->queue->q_usage_counter);
> >  }
> >
> > -static void blk_mq_dispatch_multiple_queue_requests(struct rq_list *rqs)
> > +static void blk_mq_dispatch_multiple_queue_requests(struct list_head *rqs)
> >  {
> >         do {
> > -               struct rq_list queue_rqs;
> > -               unsigned depth;
> > +               struct request_queue *this_q = NULL;
> > +               struct request *rq, *n;
> > +               LIST_HEAD(queue_rqs);
> > +               unsigned depth = 0;
> > +
> > +               list_for_each_entry_safe(rq, n, rqs, queuelist) {
> > +                       if (!this_q)
> > +                               this_q = rq->q;
> > +                       if (this_q == rq->q) {
> > +                               list_move_tail(&rq->queuelist, &queue_rqs);
> > +                               depth++;
> > +                       }
> > +               }
> >
> > -               depth = blk_mq_extract_queue_requests(rqs, &queue_rqs);
> >                 blk_mq_dispatch_queue_requests(&queue_rqs, depth);
> > -               while (!rq_list_empty(&queue_rqs))
> > +               while (!list_empty(&queue_rqs))
> >                         blk_mq_dispatch_list(&queue_rqs, false);
> > -       } while (!rq_list_empty(rqs));
> > +       } while (!list_empty(rqs));
> >  }
> >
> >  void blk_mq_flush_plug_list(struct blk_plug *plug, bool from_schedule)
> > @@ -2955,15 +2935,14 @@ void blk_mq_flush_plug_list(struct blk_plug *plug, bool from_schedule)
> >                         blk_mq_dispatch_multiple_queue_requests(&plug->mq_list);
> >                         return;
> >                 }
> > -
> >                 blk_mq_dispatch_queue_requests(&plug->mq_list, depth);
> > -               if (rq_list_empty(&plug->mq_list))
> > +               if (list_empty(&plug->mq_list))
> >                         return;
> >         }
> >
> >         do {
> >                 blk_mq_dispatch_list(&plug->mq_list, from_schedule);
> > -       } while (!rq_list_empty(&plug->mq_list));
> > +       } while (!list_empty(&plug->mq_list));
> >  }
> >
> >  static void blk_mq_try_issue_list_directly(struct blk_mq_hw_ctx *hctx,
> > diff --git a/drivers/block/null_blk/main.c b/drivers/block/null_blk/main.c
> > index aa163ae9b2aa..ce3ac928122f 100644
> > --- a/drivers/block/null_blk/main.c
> > +++ b/drivers/block/null_blk/main.c
> > @@ -1694,22 +1694,22 @@ static blk_status_t null_queue_rq(struct blk_mq_hw_ctx *hctx,
> >         return BLK_STS_OK;
> >  }
> >
> > -static void null_queue_rqs(struct rq_list *rqlist)
> > +static void null_queue_rqs(struct list_head *rqlist)
> >  {
> > -       struct rq_list requeue_list = {};
> >         struct blk_mq_queue_data bd = { };
> > +       LIST_HEAD(requeue_list);
> > +       struct request *rq, *n;
> >         blk_status_t ret;
> >
> > -       do {
> > -               struct request *rq = rq_list_pop(rqlist);
> > -
> > +       list_for_each_entry_safe(rq, n, rqlist, queuelist) {
> >                 bd.rq = rq;
> >                 ret = null_queue_rq(rq->mq_hctx, &bd);
> >                 if (ret != BLK_STS_OK)
> > -                       rq_list_add_tail(&requeue_list, rq);
> > -       } while (!rq_list_empty(rqlist));
> > +                       list_move_tail(&rq->queuelist, &requeue_list);
> > +       }
> >
> > -       *rqlist = requeue_list;
> > +       INIT_LIST_HEAD(rqlist);
> > +       list_splice(&requeue_list, rqlist);
> >  }
> >
> >  static void null_init_queue(struct nullb *nullb, struct nullb_queue *nq)
> > diff --git a/drivers/block/ublk_drv.c b/drivers/block/ublk_drv.c
> > index c637ea010d34..4d5b88ca7b1b 100644
> > --- a/drivers/block/ublk_drv.c
> > +++ b/drivers/block/ublk_drv.c
> > @@ -101,7 +101,7 @@ struct ublk_uring_cmd_pdu {
> >          */
> >         union {
> >                 struct request *req;
> > -               struct request *req_list;
> > +               struct list_head req_list;
> >         };
> >
> >         /*
> > @@ -1325,24 +1325,18 @@ static void ublk_cmd_list_tw_cb(struct io_uring_cmd *cmd,
> >                 unsigned int issue_flags)
> >  {
> >         struct ublk_uring_cmd_pdu *pdu = ublk_get_uring_cmd_pdu(cmd);
> > -       struct request *rq = pdu->req_list;
> > -       struct request *next;
> > +       struct request *rq, *n;
> >
> > -       do {
> > -               next = rq->rq_next;
> > -               rq->rq_next = NULL;
> > +       list_for_each_entry_safe(rq, n, &pdu->req_list, queuelist)
> >                 ublk_dispatch_req(rq->mq_hctx->driver_data, rq, issue_flags);
> > -               rq = next;
> > -       } while (rq);
> >  }
> >
> > -static void ublk_queue_cmd_list(struct ublk_io *io, struct rq_list *l)
> > +static void ublk_queue_cmd_list(struct ublk_io *io, struct list_head *rqlist)
> >  {
> >         struct io_uring_cmd *cmd = io->cmd;
> >         struct ublk_uring_cmd_pdu *pdu = ublk_get_uring_cmd_pdu(cmd);
> >
> > -       pdu->req_list = rq_list_peek(l);
> > -       rq_list_init(l);
> > +       list_splice(&pdu->req_list, rqlist);
> >         io_uring_cmd_complete_in_task(cmd, ublk_cmd_list_tw_cb);
> 
> ublk_cmd_list_tw_cb() doesn't need a doubly-linked list. It should be
> fine to continue storing just the first struct request * of the list
> in struct ublk_uring_cmd_pdu. That would avoid growing
> ublk_uring_cmd_pdu past the 32-byte limit.

Agree.

ublk needn't re-order, and doubly-linked list is useless here.


Thanks,
Ming


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] block: use plug request list tail for one-shot backmerge attempt
  2025-06-17  2:36                     ` Ming Lei
@ 2025-06-17  4:39                       ` Christoph Hellwig
  0 siblings, 0 replies; 17+ messages in thread
From: Christoph Hellwig @ 2025-06-17  4:39 UTC (permalink / raw)
  To: Ming Lei
  Cc: Caleb Sander Mateos, Christoph Hellwig, Jens Axboe,
	Mohamed Abuelfotoh, Hazem, linux-block@vger.kernel.org

On Tue, Jun 17, 2025 at 10:36:06AM +0800, Ming Lei wrote:

[full quote deleted, please fix your mailer]

> > ublk_cmd_list_tw_cb() doesn't need a doubly-linked list. It should be
> > fine to continue storing just the first struct request * of the list
> > in struct ublk_uring_cmd_pdu. That would avoid growing
> > ublk_uring_cmd_pdu past the 32-byte limit.
> 
> Agree.
> 
> ublk needn't re-order, and doubly-linked list is useless here.

No driver needs to reorder, it just needs to consume the list passed to
it, for which using list_cut_before is really useful.  Just passing it
on would be optimal, but if that can't work the frankenlist proposed by
Caleb might work.


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] block: use plug request list tail for one-shot backmerge attempt
  2025-06-16 13:11                 ` Christoph Hellwig
  2025-06-16 16:01                   ` Caleb Sander Mateos
@ 2025-06-18  6:04                   ` Hannes Reinecke
  1 sibling, 0 replies; 17+ messages in thread
From: Hannes Reinecke @ 2025-06-18  6:04 UTC (permalink / raw)
  To: Christoph Hellwig, Jens Axboe
  Cc: Mohamed Abuelfotoh, Hazem, linux-block@vger.kernel.org

On 6/16/25 15:11, Christoph Hellwig wrote:
> On Thu, Jun 12, 2025 at 06:28:47AM -0600, Jens Axboe wrote:
>> But ideally we'd have that, and just a plain doubly linked list on the
>> queue/dispatch side. Which makes the list handling there much easier to
>> follow, as per your patch.
> 
> Quick hack from the weekend.  This also never deletes the requests from
> the submission list for the queue_rqs case, so depending on the workload
> it should touch either the same amount of less cache lines as the
> existing version.  Only very lightly tested, and ublk is broken and
> doesn't even compile as it's running out space in the io_uring pdu.
> I'll need help from someone who knows ublk for that.
> 
> ---
>  From 07e283303c63fcb694e828380a24ad51f225a228 Mon Sep 17 00:00:00 2001
> From: Christoph Hellwig <hch@lst.de>
> Date: Fri, 13 Jun 2025 09:48:40 +0200
> Subject: block: always use a list_head for requests
> 
> Use standard double linked lists for the remaining lists of queued up
> requests. This removes a lot of hairy list manipulation code and allows
> east reverse walking of the lists, which is used in
> blk_attempt_plug_merge to improve the merging, and in blk_add_rq_to_plug
> to look at the correct request.
> 
> XXX: ublk is broken right now, because there is no space in the io_uring
> pdu for the list backpointer.
> 
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> ---
>   block/blk-core.c              |  2 +-
>   block/blk-merge.c             | 11 +---
>   block/blk-mq.c                | 97 ++++++++++++++---------------------
>   drivers/block/null_blk/main.c | 16 +++---
>   drivers/block/ublk_drv.c      | 43 +++++++---------
>   drivers/block/virtio_blk.c    | 31 +++++------
>   drivers/nvme/host/pci.c       | 32 ++++++------
>   include/linux/blk-mq.h        |  2 +-
>   include/linux/blkdev.h        |  2 +-
>   9 files changed, 103 insertions(+), 133 deletions(-)
> 
> diff --git a/block/blk-core.c b/block/blk-core.c
> index b862c66018f2..29aad939a1e3 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -1127,7 +1127,7 @@ void blk_start_plug_nr_ios(struct blk_plug *plug, unsigned short nr_ios)
>   		return;
>   
>   	plug->cur_ktime = 0;
> -	rq_list_init(&plug->mq_list);
> +	INIT_LIST_HEAD(&plug->mq_list);
>   	rq_list_init(&plug->cached_rqs);
>   	plug->nr_ios = min_t(unsigned short, nr_ios, BLK_MAX_REQUEST_COUNT);
>   	plug->rq_count = 0;
> diff --git a/block/blk-merge.c b/block/blk-merge.c
> index 70d704615be5..223941e9ec08 100644
> --- a/block/blk-merge.c
> +++ b/block/blk-merge.c
> @@ -995,17 +995,10 @@ bool blk_attempt_plug_merge(struct request_queue *q, struct bio *bio,
>   	struct blk_plug *plug = current->plug;
>   	struct request *rq;
>   
> -	if (!plug || rq_list_empty(&plug->mq_list))
> +	if (!plug)
>   		return false;
>   
> -	rq = plug->mq_list.tail;
> -	if (rq->q == q)
> -		return blk_attempt_bio_merge(q, rq, bio, nr_segs, false) ==
> -			BIO_MERGE_OK;
> -	else if (!plug->multiple_queues)
> -		return false;
> -
> -	rq_list_for_each(&plug->mq_list, rq) {
> +	list_for_each_entry_reverse(rq, &plug->mq_list, queuelist) {
>   		if (rq->q != q)
>   			continue;
>   		if (blk_attempt_bio_merge(q, rq, bio, nr_segs, false) ==
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 4806b867e37d..6d56471d4346 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -1378,7 +1378,8 @@ static inline unsigned short blk_plug_max_rq_count(struct blk_plug *plug)
>   
>   static void blk_add_rq_to_plug(struct blk_plug *plug, struct request *rq)
>   {
> -	struct request *last = rq_list_peek(&plug->mq_list);
> +	struct request *last =
> +		list_last_entry(&plug->mq_list, struct request, queuelist);
>   
>   	if (!plug->rq_count) {
>   		trace_block_plug(rq->q);
> @@ -1398,7 +1399,7 @@ static void blk_add_rq_to_plug(struct blk_plug *plug, struct request *rq)
>   	 */
>   	if (!plug->has_elevator && (rq->rq_flags & RQF_SCHED_TAGS))
>   		plug->has_elevator = true;
> -	rq_list_add_tail(&plug->mq_list, rq);
> +	list_add_tail(&rq->queuelist, &plug->mq_list);
>   	plug->rq_count++;
>   }
>   
> @@ -2780,15 +2781,15 @@ static blk_status_t blk_mq_request_issue_directly(struct request *rq, bool last)
>   	return __blk_mq_issue_directly(hctx, rq, last);
>   }
>   
> -static void blk_mq_issue_direct(struct rq_list *rqs)
> +static void blk_mq_issue_direct(struct list_head *rqs)
>   {
>   	struct blk_mq_hw_ctx *hctx = NULL;
> -	struct request *rq;
> +	struct request *rq, *n;
>   	int queued = 0;
>   	blk_status_t ret = BLK_STS_OK;
>   
> -	while ((rq = rq_list_pop(rqs))) {
> -		bool last = rq_list_empty(rqs);
> +	list_for_each_entry_safe(rq, n, rqs, queuelist) {
> +		list_del_init(&rq->queuelist);
>   
>   		if (hctx != rq->mq_hctx) {
>   			if (hctx) {
> @@ -2798,7 +2799,7 @@ static void blk_mq_issue_direct(struct rq_list *rqs)
>   			hctx = rq->mq_hctx;
>   		}
>   
> -		ret = blk_mq_request_issue_directly(rq, last);
> +		ret = blk_mq_request_issue_directly(rq, list_empty(rqs));
>   		switch (ret) {
>   		case BLK_STS_OK:
>   			queued++;
> @@ -2819,45 +2820,18 @@ static void blk_mq_issue_direct(struct rq_list *rqs)
>   		blk_mq_commit_rqs(hctx, queued, false);
>   }
>   
> -static void __blk_mq_flush_list(struct request_queue *q, struct rq_list *rqs)
> +static void __blk_mq_flush_list(struct request_queue *q, struct list_head *rqs)
>   {
>   	if (blk_queue_quiesced(q))
>   		return;
>   	q->mq_ops->queue_rqs(rqs);
>   }
>   
> -static unsigned blk_mq_extract_queue_requests(struct rq_list *rqs,
> -					      struct rq_list *queue_rqs)
> -{
> -	struct request *rq = rq_list_pop(rqs);
> -	struct request_queue *this_q = rq->q;
> -	struct request **prev = &rqs->head;
> -	struct rq_list matched_rqs = {};
> -	struct request *last = NULL;
> -	unsigned depth = 1;
> -
> -	rq_list_add_tail(&matched_rqs, rq);
> -	while ((rq = *prev)) {
> -		if (rq->q == this_q) {
> -			/* move rq from rqs to matched_rqs */
> -			*prev = rq->rq_next;
> -			rq_list_add_tail(&matched_rqs, rq);
> -			depth++;
> -		} else {
> -			/* leave rq in rqs */
> -			prev = &rq->rq_next;
> -			last = rq;
> -		}
> -	}
> -
> -	rqs->tail = last;
> -	*queue_rqs = matched_rqs;
> -	return depth;
> -}
> -
> -static void blk_mq_dispatch_queue_requests(struct rq_list *rqs, unsigned depth)
> +static void blk_mq_dispatch_queue_requests(struct list_head *rqs,
> +					   unsigned depth)
>   {
> -	struct request_queue *q = rq_list_peek(rqs)->q;
> +	struct request *rq = list_first_entry(rqs, struct request, queuelist);
> +	struct request_queue *q = rq->q;
>   
>   	trace_block_unplug(q, depth, true);
>   
> @@ -2869,39 +2843,35 @@ static void blk_mq_dispatch_queue_requests(struct rq_list *rqs, unsigned depth)
>   	 */
>   	if (q->mq_ops->queue_rqs) {
>   		blk_mq_run_dispatch_ops(q, __blk_mq_flush_list(q, rqs));
> -		if (rq_list_empty(rqs))
> +		if (list_empty(rqs))
>   			return;
>   	}
>   
>   	blk_mq_run_dispatch_ops(q, blk_mq_issue_direct(rqs));
>   }
>   
> -static void blk_mq_dispatch_list(struct rq_list *rqs, bool from_sched)
> +static void blk_mq_dispatch_list(struct list_head *rqs, bool from_sched)
>   {
>   	struct blk_mq_hw_ctx *this_hctx = NULL;
>   	struct blk_mq_ctx *this_ctx = NULL;
> -	struct rq_list requeue_list = {};
> +	LIST_HEAD(list);
> +	struct request *rq, *n;
>   	unsigned int depth = 0;
>   	bool is_passthrough = false;
> -	LIST_HEAD(list);
> -
> -	do {
> -		struct request *rq = rq_list_pop(rqs);
>   
> +	list_for_each_entry_safe(rq, n, rqs, queuelist) {
>   		if (!this_hctx) {
>   			this_hctx = rq->mq_hctx;
>   			this_ctx = rq->mq_ctx;
>   			is_passthrough = blk_rq_is_passthrough(rq);
>   		} else if (this_hctx != rq->mq_hctx || this_ctx != rq->mq_ctx ||
>   			   is_passthrough != blk_rq_is_passthrough(rq)) {
> -			rq_list_add_tail(&requeue_list, rq);
>   			continue;
>   		}
> -		list_add_tail(&rq->queuelist, &list);
> +		list_move_tail(&rq->queuelist, &list);
>   		depth++;
> -	} while (!rq_list_empty(rqs));
> +	}
>   
> -	*rqs = requeue_list;
>   	trace_block_unplug(this_hctx->queue, depth, !from_sched);
>   
>   	percpu_ref_get(&this_hctx->queue->q_usage_counter);
> @@ -2921,17 +2891,27 @@ static void blk_mq_dispatch_list(struct rq_list *rqs, bool from_sched)
>   	percpu_ref_put(&this_hctx->queue->q_usage_counter);
>   }
>   
> -static void blk_mq_dispatch_multiple_queue_requests(struct rq_list *rqs)
> +static void blk_mq_dispatch_multiple_queue_requests(struct list_head *rqs)
>   {
>   	do {
> -		struct rq_list queue_rqs;
> -		unsigned depth;
> +		struct request_queue *this_q = NULL;
> +		struct request *rq, *n;
> +		LIST_HEAD(queue_rqs);
> +		unsigned depth = 0;
> +
> +		list_for_each_entry_safe(rq, n, rqs, queuelist) {
> +			if (!this_q)
> +				this_q = rq->q;
> +			if (this_q == rq->q) {
> +				list_move_tail(&rq->queuelist, &queue_rqs);
> +				depth++;
> +			}
> +		}
>   
> -		depth = blk_mq_extract_queue_requests(rqs, &queue_rqs);
>   		blk_mq_dispatch_queue_requests(&queue_rqs, depth);
> -		while (!rq_list_empty(&queue_rqs))
> +		while (!list_empty(&queue_rqs))
>   			blk_mq_dispatch_list(&queue_rqs, false);
> -	} while (!rq_list_empty(rqs));
> +	} while (!list_empty(rqs));
>   }
>   
>   void blk_mq_flush_plug_list(struct blk_plug *plug, bool from_schedule)
> @@ -2955,15 +2935,14 @@ void blk_mq_flush_plug_list(struct blk_plug *plug, bool from_schedule)
>   			blk_mq_dispatch_multiple_queue_requests(&plug->mq_list);
>   			return;
>   		}
> -
>   		blk_mq_dispatch_queue_requests(&plug->mq_list, depth);
> -		if (rq_list_empty(&plug->mq_list))
> +		if (list_empty(&plug->mq_list))
>   			return;
>   	}
>   
>   	do {
>   		blk_mq_dispatch_list(&plug->mq_list, from_schedule);
> -	} while (!rq_list_empty(&plug->mq_list));
> +	} while (!list_empty(&plug->mq_list));
>   }
>   
>   static void blk_mq_try_issue_list_directly(struct blk_mq_hw_ctx *hctx,
> diff --git a/drivers/block/null_blk/main.c b/drivers/block/null_blk/main.c
> index aa163ae9b2aa..ce3ac928122f 100644
> --- a/drivers/block/null_blk/main.c
> +++ b/drivers/block/null_blk/main.c
> @@ -1694,22 +1694,22 @@ static blk_status_t null_queue_rq(struct blk_mq_hw_ctx *hctx,
>   	return BLK_STS_OK;
>   }
>   
> -static void null_queue_rqs(struct rq_list *rqlist)
> +static void null_queue_rqs(struct list_head *rqlist)
>   {
> -	struct rq_list requeue_list = {};
>   	struct blk_mq_queue_data bd = { };
> +	LIST_HEAD(requeue_list);
> +	struct request *rq, *n;
>   	blk_status_t ret;
>   
> -	do {
> -		struct request *rq = rq_list_pop(rqlist);
> -
> +	list_for_each_entry_safe(rq, n, rqlist, queuelist) {
>   		bd.rq = rq;
>   		ret = null_queue_rq(rq->mq_hctx, &bd);
>   		if (ret != BLK_STS_OK)
> -			rq_list_add_tail(&requeue_list, rq);
> -	} while (!rq_list_empty(rqlist));
> +			list_move_tail(&rq->queuelist, &requeue_list);
> +	}
>   
> -	*rqlist = requeue_list;
> +	INIT_LIST_HEAD(rqlist);
> +	list_splice(&requeue_list, rqlist);
>   }
>   
>   static void null_init_queue(struct nullb *nullb, struct nullb_queue *nq)
> diff --git a/drivers/block/ublk_drv.c b/drivers/block/ublk_drv.c
> index c637ea010d34..4d5b88ca7b1b 100644
> --- a/drivers/block/ublk_drv.c
> +++ b/drivers/block/ublk_drv.c
> @@ -101,7 +101,7 @@ struct ublk_uring_cmd_pdu {
>   	 */
>   	union {
>   		struct request *req;
> -		struct request *req_list;
> +		struct list_head req_list;
>   	};
>   
>   	/*
> @@ -1325,24 +1325,18 @@ static void ublk_cmd_list_tw_cb(struct io_uring_cmd *cmd,
>   		unsigned int issue_flags)
>   {
>   	struct ublk_uring_cmd_pdu *pdu = ublk_get_uring_cmd_pdu(cmd);
> -	struct request *rq = pdu->req_list;
> -	struct request *next;
> +	struct request *rq, *n;
>   
> -	do {
> -		next = rq->rq_next;
> -		rq->rq_next = NULL;
> +	list_for_each_entry_safe(rq, n, &pdu->req_list, queuelist)
>   		ublk_dispatch_req(rq->mq_hctx->driver_data, rq, issue_flags);
> -		rq = next;
> -	} while (rq);
>   }
>   
> -static void ublk_queue_cmd_list(struct ublk_io *io, struct rq_list *l)
> +static void ublk_queue_cmd_list(struct ublk_io *io, struct list_head *rqlist)
>   {
>   	struct io_uring_cmd *cmd = io->cmd;
>   	struct ublk_uring_cmd_pdu *pdu = ublk_get_uring_cmd_pdu(cmd);
>   
> -	pdu->req_list = rq_list_peek(l);
> -	rq_list_init(l);
> +	list_splice(&pdu->req_list, rqlist);
>   	io_uring_cmd_complete_in_task(cmd, ublk_cmd_list_tw_cb);
>   }
>   
> @@ -1416,30 +1410,31 @@ static blk_status_t ublk_queue_rq(struct blk_mq_hw_ctx *hctx,
>   	return BLK_STS_OK;
>   }
>   
> -static void ublk_queue_rqs(struct rq_list *rqlist)
> +static void ublk_queue_rqs(struct list_head *rqlist)
>   {
> -	struct rq_list requeue_list = { };
> -	struct rq_list submit_list = { };
>   	struct ublk_io *io = NULL;
> -	struct request *req;
> +	struct request *req, *n;
> +	LIST_HEAD(requeue_list);
>   
> -	while ((req = rq_list_pop(rqlist))) {
> +	list_for_each_entry_safe(req, n, rqlist, queuelist) {
>   		struct ublk_queue *this_q = req->mq_hctx->driver_data;
>   		struct ublk_io *this_io = &this_q->ios[req->tag];
>   
> -		if (io && io->task != this_io->task && !rq_list_empty(&submit_list))
> +		if (io && io->task != this_io->task) {
> +			LIST_HEAD(submit_list);
> +
> +			list_cut_before(&submit_list, rqlist, &req->queuelist);
>   			ublk_queue_cmd_list(io, &submit_list);
> +		}
>   		io = this_io;
>   
> -		if (ublk_prep_req(this_q, req, true) == BLK_STS_OK)
> -			rq_list_add_tail(&submit_list, req);
> -		else
> -			rq_list_add_tail(&requeue_list, req);
> +		if (ublk_prep_req(this_q, req, true) != BLK_STS_OK)
> +			list_move_tail(&req->queuelist, &requeue_list);
>   	}
>   
> -	if (!rq_list_empty(&submit_list))
> -		ublk_queue_cmd_list(io, &submit_list);
> -	*rqlist = requeue_list;
> +	if (!list_empty(rqlist))
> +		ublk_queue_cmd_list(io, rqlist);
> +	list_splice(&requeue_list, rqlist);
>   }
>   
>   static int ublk_init_hctx(struct blk_mq_hw_ctx *hctx, void *driver_data,
> diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c
> index 30bca8cb7106..29f900eada0f 100644
> --- a/drivers/block/virtio_blk.c
> +++ b/drivers/block/virtio_blk.c
> @@ -471,15 +471,14 @@ static bool virtblk_prep_rq_batch(struct request *req)
>   }
>   
>   static void virtblk_add_req_batch(struct virtio_blk_vq *vq,
> -		struct rq_list *rqlist)
> +		struct list_head *rqlist)
>   {
>   	struct request *req;
>   	unsigned long flags;
>   	bool kick;
>   
>   	spin_lock_irqsave(&vq->lock, flags);
> -
> -	while ((req = rq_list_pop(rqlist))) {
> +	list_for_each_entry(req, rqlist, queuelist) {
>   		struct virtblk_req *vbr = blk_mq_rq_to_pdu(req);
>   		int err;
>   

Arguably this changes the outcome; in the original code rq_list_pop()
would remove the bios from the list, with your conversion we're leaving
the list intact (even though the bios have been processed).
Shouldn't we be using 'list_del_init()' (or at least 'list_del()') here
to clear out the list?

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                  Kernel Storage Architect
hare@suse.de                                +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] block: use plug request list tail for one-shot backmerge attempt
  2025-06-11 17:53   ` Jens Axboe
  2025-06-12  5:22     ` Christoph Hellwig
  2025-06-12 12:27     ` Mohamed Abuelfotoh, Hazem
@ 2025-06-24 10:45     ` Mohamed Abuelfotoh, Hazem
  2 siblings, 0 replies; 17+ messages in thread
From: Mohamed Abuelfotoh, Hazem @ 2025-06-24 10:45 UTC (permalink / raw)
  To: Jens Axboe, linux-block@vger.kernel.org

On 11/06/2025 18:53, Jens Axboe wrote:
> Yes we can't revert it, and honestly I would not want to even if that
> was an option. If the multi-queue case is particularly important, you
> could just do something ala the below - keep scanning until you a merge
> _could_ have happened but didn't. Ideally we'd want to iterate the plug
> list backwards and then we could keep the same single shot logic, where
> you only attempt one request that has a matching queue. And obviously we
> could just doubly link the requests, there's space in the request
> linkage code to do that. But that'd add overhead in general, I think
> it's better to shove a bit of that overhead to the multi-queue case.
> 
> I suspect the below would do the trick, however.
> 
> diff --git a/block/blk-merge.c b/block/blk-merge.c
> index 70d704615be5..4313301f131c 100644
> --- a/block/blk-merge.c
> +++ b/block/blk-merge.c
> @@ -1008,6 +1008,8 @@ bool blk_attempt_plug_merge(struct request_queue *q, struct bio *bio,
>          rq_list_for_each(&plug->mq_list, rq) {
>                  if (rq->q != q)
>                          continue;
> +               if (blk_try_merge(rq, bio) == ELEVATOR_NO_MERGE)
> +                       continue;
>                  if (blk_attempt_bio_merge(q, rq, bio, nr_segs, false) ==
>                      BIO_MERGE_OK)
>                          return true;
> 
> --
> Jens Axboe

Sorry for my delayed reply here as I was on business trip for the last 
couple of weeks. I have done some testing on 6 SSDs aggregated as raid0 
to simulate the multi-queue case but I haven't seen measurable impact 
from that change at least on the random write test case. Looks like the 
patch has been queued to 6.15 & 6.12 stable without this change so I 
assume we are dropping it?

Kernel         |  fio (B.W MiB/sec)  | I/O size (iostat)
-------------- +---------------------+--------------------
6.15.2         |   639               |  4KiB
6.15.2+patchv1 |   648               |  4KiB
6.15.2+patchv2 |   665               |  4KiB
--------------+----------------------+--------------------

Hazem

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2025-06-24 10:46 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-06-11 14:53 [PATCH] block: use plug request list tail for one-shot backmerge attempt Jens Axboe
2025-06-11 16:55 ` Mohamed Abuelfotoh, Hazem
2025-06-11 17:53   ` Jens Axboe
2025-06-12  5:22     ` Christoph Hellwig
2025-06-12  5:23       ` Christoph Hellwig
2025-06-12 11:49       ` Jens Axboe
2025-06-12 11:56         ` Christoph Hellwig
2025-06-12 12:21           ` Jens Axboe
2025-06-12 12:23             ` Christoph Hellwig
2025-06-12 12:28               ` Jens Axboe
2025-06-16 13:11                 ` Christoph Hellwig
2025-06-16 16:01                   ` Caleb Sander Mateos
2025-06-17  2:36                     ` Ming Lei
2025-06-17  4:39                       ` Christoph Hellwig
2025-06-18  6:04                   ` Hannes Reinecke
2025-06-12 12:27     ` Mohamed Abuelfotoh, Hazem
2025-06-24 10:45     ` Mohamed Abuelfotoh, Hazem

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox