[PATCH] block: plug attempts to batch allocate tags multiple times

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH] block: plug attempts to batch allocate tags multiple times
       [not found] <CGME20250901082648epcas5p18f81021213f2b8a050efa25f76e0fb54@epcas5p1.samsung.com>
@ 2025-09-01  8:22 ` Xue He
  2025-09-02  8:47   ` Yu Kuai
  0 siblings, 1 reply; 4+ messages in thread
From: Xue He @ 2025-09-01  8:22 UTC (permalink / raw)
  To: axboe; +Cc: linux-block, linux-kernel, hexue

From: hexue <xue01.he@samsung.com>

In the existing plug mechanism, tags are allocated in batches based on
the number of requests. However, testing has shown that the plug only
attempts batch allocation of tags once at the beginning of a batch of
I/O operations. Since the tag_mask does not always have enough available
tags to satisfy the requested number, a full batch allocation is not
guaranteed to succeed each time. The remaining tags are then allocated
individually (occurs frequently), leading to multiple single-tag
allocation overheads.

This patch aims to allow the remaining I/O operations to retry batch
allocation of tags, reducing the overhead caused by multiple
individual tag allocations.

------------------------------------------------------------------------
test result
During testing of the PCIe Gen4 SSD Samsung PM9A3, the perf tool
observed CPU improvements. The CPU usage of the original function
_blk_mq_alloc_requests function was 1.39%, which decreased to 0.82%
after modification.

Additionally, performance variations were observed on different devices.
workload:randread
blocksize:4k
thread:1
------------------------------------------------------------------------
                  PCIe Gen3 SSD   PCIe Gen4 SSD    PCIe Gen5 SSD
native kernel     553k iops       633k iops        793k iops
modified          553k iops       635k iops        801k iops

with Optane SSDs, the performance like
two device one thread
cmd :sudo taskset -c 0 ./t/io_uring -b512 -d128 -c32 -s32 -p1 -F1 -B1
-n1 -r4 /dev/nvme0n1 /dev/nvme1n1

base: 6.4 Million IOPS
patch: 6.49 Million IOPS

two device two thread
cmd: sudo taskset -c 0 ./t/io_uring -b512 -d128 -c32 -s32 -p1 -F1 -B1
-n1 -r4 /dev/nvme0n1 /dev/nvme1n1

base: 7.34 Million IOPS
patch: 7.48 Million IOPS
-------------------------------------------------------------------------

Signed-off-by: hexue <xue01.he@samsung.com>
---
 block/blk-mq.c | 8 +++++---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index b67d6c02eceb..1fb280764b76 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -587,9 +587,9 @@ static struct request *blk_mq_rq_cache_fill(struct request_queue *q,
 	if (blk_queue_enter(q, flags))
 		return NULL;
 
-	plug->nr_ios = 1;
-
 	rq = __blk_mq_alloc_requests(&data);
+	plug->nr_ios = data.nr_tags;
+
 	if (unlikely(!rq))
 		blk_queue_exit(q);
 	return rq;
@@ -3034,11 +3034,13 @@ static struct request *blk_mq_get_new_requests(struct request_queue *q,
 
 	if (plug) {
 		data.nr_tags = plug->nr_ios;
-		plug->nr_ios = 1;
 		data.cached_rqs = &plug->cached_rqs;
 	}
 
 	rq = __blk_mq_alloc_requests(&data);
+	if (plug)
+		plug->nr_ios = data.nr_tags;
+
 	if (unlikely(!rq))
 		rq_qos_cleanup(q, bio);
 	return rq;
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 4+ messages in thread

* Re: [PATCH] block: plug attempts to batch allocate tags multiple times
  2025-09-01  8:22 ` [PATCH] block: plug attempts to batch allocate tags multiple times Xue He
@ 2025-09-02  8:47   ` Yu Kuai
       [not found]     ` <CGME20250903084608epcas5p19a0ad4f0d1bad27889426e525d0c4598@epcas5p1.samsung.com>
  0 siblings, 1 reply; 4+ messages in thread
From: Yu Kuai @ 2025-09-02  8:47 UTC (permalink / raw)
  To: Xue He, axboe; +Cc: linux-block, linux-kernel, yukuai (C)

Hi,

在 2025/09/01 16:22, Xue He 写道:
> From: hexue <xue01.he@samsung.com>
> 
> In the existing plug mechanism, tags are allocated in batches based on
> the number of requests. However, testing has shown that the plug only
> attempts batch allocation of tags once at the beginning of a batch of
> I/O operations. Since the tag_mask does not always have enough available
> tags to satisfy the requested number, a full batch allocation is not
> guaranteed to succeed each time. The remaining tags are then allocated
> individually (occurs frequently), leading to multiple single-tag
> allocation overheads.
> 
> This patch aims to allow the remaining I/O operations to retry batch
> allocation of tags, reducing the overhead caused by multiple
> individual tag allocations.
> 
> ------------------------------------------------------------------------
> test result
> During testing of the PCIe Gen4 SSD Samsung PM9A3, the perf tool
> observed CPU improvements. The CPU usage of the original function
> _blk_mq_alloc_requests function was 1.39%, which decreased to 0.82%
> after modification.
> 
> Additionally, performance variations were observed on different devices.
> workload:randread
> blocksize:4k
> thread:1
> ------------------------------------------------------------------------
>                    PCIe Gen3 SSD   PCIe Gen4 SSD    PCIe Gen5 SSD
> native kernel     553k iops       633k iops        793k iops
> modified          553k iops       635k iops        801k iops
> 
> with Optane SSDs, the performance like
> two device one thread
> cmd :sudo taskset -c 0 ./t/io_uring -b512 -d128 -c32 -s32 -p1 -F1 -B1
> -n1 -r4 /dev/nvme0n1 /dev/nvme1n1
> 

How many hw_queues and how many tags in each hw_queues in your nvme?
I feel it's unlikely that tags can be exhausted, usually cpu will become
bottleneck first.
> base: 6.4 Million IOPS
> patch: 6.49 Million IOPS
> 
> two device two thread
> cmd: sudo taskset -c 0 ./t/io_uring -b512 -d128 -c32 -s32 -p1 -F1 -B1
> -n1 -r4 /dev/nvme0n1 /dev/nvme1n1
> 
> base: 7.34 Million IOPS
> patch: 7.48 Million IOPS
> -------------------------------------------------------------------------
> 
> Signed-off-by: hexue <xue01.he@samsung.com>
> ---
>   block/blk-mq.c | 8 +++++---
>   1 file changed, 5 insertions(+), 3 deletions(-)
> 
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index b67d6c02eceb..1fb280764b76 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -587,9 +587,9 @@ static struct request *blk_mq_rq_cache_fill(struct request_queue *q,
>   	if (blk_queue_enter(q, flags))
>   		return NULL;
>   
> -	plug->nr_ios = 1;
> -
>   	rq = __blk_mq_alloc_requests(&data);
> +	plug->nr_ios = data.nr_tags;
> +
>   	if (unlikely(!rq))
>   		blk_queue_exit(q);
>   	return rq;
> @@ -3034,11 +3034,13 @@ static struct request *blk_mq_get_new_requests(struct request_queue *q,
>   
>   	if (plug) {
>   		data.nr_tags = plug->nr_ios;
> -		plug->nr_ios = 1;
>   		data.cached_rqs = &plug->cached_rqs;
>   	}
>   
>   	rq = __blk_mq_alloc_requests(&data);
> +	if (plug)
> +		plug->nr_ios = data.nr_tags;
> +
>   	if (unlikely(!rq))
>   		rq_qos_cleanup(q, bio);
>   	return rq;
> 

In __blk_mq_alloc_requests(), if __blk_mq_alloc_requests_batch() failed,
data->nr_tags is set to 1, so plug->nr_ios = data.nr_tags will still set
plug->nr_ios to 1 in this case.

What am I missing?

Thanks,
Kuai


^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH] block: plug attempts to batch allocate tags multiple times
       [not found]     ` <CGME20250903084608epcas5p19a0ad4f0d1bad27889426e525d0c4598@epcas5p1.samsung.com>
@ 2025-09-03  8:41       ` Xue He
  2025-09-03  9:35         ` Yu Kuai
  0 siblings, 1 reply; 4+ messages in thread
From: Xue He @ 2025-09-03  8:41 UTC (permalink / raw)
  To: yukuai1, axboe; +Cc: linux-block, linux-kernel, yukuai3

On 2025/09/02 08:47 AM, Yu Kuai wrote:
>On 2025/09/01 16:22, Xue He wrote:
......
>> This patch aims to allow the remaining I/O operations to retry batch
>> allocation of tags, reducing the overhead caused by multiple
>> individual tag allocations.
>> 
>> ------------------------------------------------------------------------
>> test result
>> During testing of the PCIe Gen4 SSD Samsung PM9A3, the perf tool
>> observed CPU improvements. The CPU usage of the original function
>> _blk_mq_alloc_requests function was 1.39%, which decreased to 0.82%
>> after modification.
>> 
>> Additionally, performance variations were observed on different devices.
>> workload:randread
>> blocksize:4k
>> thread:1
>> ------------------------------------------------------------------------
>>                    PCIe Gen3 SSD   PCIe Gen4 SSD    PCIe Gen5 SSD
>> native kernel     553k iops       633k iops        793k iops
>> modified          553k iops       635k iops        801k iops
>> 
>> with Optane SSDs, the performance like
>> two device one thread
>> cmd :sudo taskset -c 0 ./t/io_uring -b512 -d128 -c32 -s32 -p1 -F1 -B1
>> -n1 -r4 /dev/nvme0n1 /dev/nvme1n1
>> 
>
>How many hw_queues and how many tags in each hw_queues in your nvme?
>I feel it's unlikely that tags can be exhausted, usually cpu will become
>bottleneck first.

the information of my nvme like this:
number of CPU: 16
memory: 16G
nvme nvme0: 16/0/16 default/read/poll queue
cat /sys/class/nvme/nvme0/nvme0n1/queue/nr_requests
1023

In more precise terms, I think it is not that the tags are fully exhausted,
but rather that after scanning the bitmap for free bits, the remaining
contiguous bits are nsufficient to meet the requirement (have but not enough).
The specific function involved is __sbitmap_queue_get_batch in lib/sbitmap.c.
                    get_mask = ((1UL << nr_tags) - 1) << nr;
                    if (nr_tags > 1) {
                            printk("before %ld\n", get_mask);
                    }
                    while (!atomic_long_try_cmpxchg(ptr, &val,
                                                      get_mask | val))
                            ;
                    get_mask = (get_mask & ~val) >> nr;

where during the batch acquisition of contiguous free bits, an atomic operation
is performed, resulting in the actual tag_mask obtained differing from the
originally requested one.

Am I missing something?

>> base: 6.4 Million IOPS
>> patch: 6.49 Million IOPS
>> 
>> two device two thread
>> cmd: sudo taskset -c 0 ./t/io_uring -b512 -d128 -c32 -s32 -p1 -F1 -B1
>> -n1 -r4 /dev/nvme0n1 /dev/nvme1n1
>> 
>> base: 7.34 Million IOPS
>> patch: 7.48 Million IOPS
>> -------------------------------------------------------------------------
>> 
>> Signed-off-by: hexue <xue01.he@samsung.com>
>> ---
>>   block/blk-mq.c | 8 +++++---
>>   1 file changed, 5 insertions(+), 3 deletions(-)
>> 
>> diff --git a/block/blk-mq.c b/block/blk-mq.c
>> index b67d6c02eceb..1fb280764b76 100644
>> --- a/block/blk-mq.c
>> +++ b/block/blk-mq.c
>> @@ -587,9 +587,9 @@ static struct request *blk_mq_rq_cache_fill(struct request_queue *q,
>>   	if (blk_queue_enter(q, flags))
>>   		return NULL;
>>   
>> -	plug->nr_ios = 1;
>> -
>>   	rq = __blk_mq_alloc_requests(&data);
>> +	plug->nr_ios = data.nr_tags;
>> +
>>   	if (unlikely(!rq))
>>   		blk_queue_exit(q);
>>   	return rq;
>> @@ -3034,11 +3034,13 @@ static struct request *blk_mq_get_new_requests(struct request_queue *q,
>>   
>>   	if (plug) {
>>   		data.nr_tags = plug->nr_ios;
>> -		plug->nr_ios = 1;
>>   		data.cached_rqs = &plug->cached_rqs;
>>   	}
>>   
>>   	rq = __blk_mq_alloc_requests(&data);
>> +	if (plug)
>> +		plug->nr_ios = data.nr_tags;
>> +
>>   	if (unlikely(!rq))
>>   		rq_qos_cleanup(q, bio);
>>   	return rq;
>> 
>
>In __blk_mq_alloc_requests(), if __blk_mq_alloc_requests_batch() failed,
>data->nr_tags is set to 1, so plug->nr_ios = data.nr_tags will still set
>plug->nr_ios to 1 in this case.
>
>What am I missing?

yes, you are right, if __blk_mq_alloc_requests_batch() failed, it will set
to 1. However, in this case, it did not fail to execute; instead, the
allocated number of tags was insufficient, as only a partial number were
allocated. Therefore, the function is considered successfully executed.

>Thanks,
>Kuai
>

Thanks,
Xue

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH] block: plug attempts to batch allocate tags multiple times
  2025-09-03  8:41       ` Xue He
@ 2025-09-03  9:35         ` Yu Kuai
  0 siblings, 0 replies; 4+ messages in thread
From: Yu Kuai @ 2025-09-03  9:35 UTC (permalink / raw)
  To: Xue He, yukuai1, axboe; +Cc: linux-block, linux-kernel, yukuai (C)

Hi,

在 2025/09/03 16:41, Xue He 写道:
> On 2025/09/02 08:47 AM, Yu Kuai wrote:
>> On 2025/09/01 16:22, Xue He wrote:
> ......
>>> This patch aims to allow the remaining I/O operations to retry batch
>>> allocation of tags, reducing the overhead caused by multiple
>>> individual tag allocations.
>>>
>>> ------------------------------------------------------------------------
>>> test result
>>> During testing of the PCIe Gen4 SSD Samsung PM9A3, the perf tool
>>> observed CPU improvements. The CPU usage of the original function
>>> _blk_mq_alloc_requests function was 1.39%, which decreased to 0.82%
>>> after modification.
>>>
>>> Additionally, performance variations were observed on different devices.
>>> workload:randread
>>> blocksize:4k
>>> thread:1
>>> ------------------------------------------------------------------------
>>>                     PCIe Gen3 SSD   PCIe Gen4 SSD    PCIe Gen5 SSD
>>> native kernel     553k iops       633k iops        793k iops
>>> modified          553k iops       635k iops        801k iops
>>>
>>> with Optane SSDs, the performance like
>>> two device one thread
>>> cmd :sudo taskset -c 0 ./t/io_uring -b512 -d128 -c32 -s32 -p1 -F1 -B1
>>> -n1 -r4 /dev/nvme0n1 /dev/nvme1n1
>>>
>>
>> How many hw_queues and how many tags in each hw_queues in your nvme?
>> I feel it's unlikely that tags can be exhausted, usually cpu will become
>> bottleneck first.
> 
> the information of my nvme like this:
> number of CPU: 16
> memory: 16G
> nvme nvme0: 16/0/16 default/read/poll queue
> cat /sys/class/nvme/nvme0/nvme0n1/queue/nr_requests
> 1023
> 
> In more precise terms, I think it is not that the tags are fully exhausted,
> but rather that after scanning the bitmap for free bits, the remaining
> contiguous bits are nsufficient to meet the requirement (have but not enough).
> The specific function involved is __sbitmap_queue_get_batch in lib/sbitmap.c.
>                      get_mask = ((1UL << nr_tags) - 1) << nr;
>                      if (nr_tags > 1) {
>                              printk("before %ld\n", get_mask);
>                      }
>                      while (!atomic_long_try_cmpxchg(ptr, &val,
>                                                        get_mask | val))
>                              ;
>                      get_mask = (get_mask & ~val) >> nr;
> 
> where during the batch acquisition of contiguous free bits, an atomic operation
> is performed, resulting in the actual tag_mask obtained differing from the
> originally requested one.

Yes, so this function will likely to obtain less tags than nr_tags,the
mask is always start from first zero bit with nr_tags bit, and
sbitmap_deferred_clear() is called uncondionally, it's likely there are
non-zero bits within this range.

Just wonder, do you consider fixing this directly in
__blk_mq_alloc_requests_batch()?

  - call sbitmap_deferred_clear() and retry on allocation failure, so
that the whole word can be used even if previous allocated request are
done, especially for nvme with huge tag depths;
  - retry blk_mq_get_tags() until data->nr_tags is zero;
> 
> Am I missing something?
> 
>>> base: 6.4 Million IOPS
>>> patch: 6.49 Million IOPS
>>>
>>> two device two thread
>>> cmd: sudo taskset -c 0 ./t/io_uring -b512 -d128 -c32 -s32 -p1 -F1 -B1
>>> -n1 -r4 /dev/nvme0n1 /dev/nvme1n1
>>>
>>> base: 7.34 Million IOPS
>>> patch: 7.48 Million IOPS
>>> -------------------------------------------------------------------------
>>>
>>> Signed-off-by: hexue <xue01.he@samsung.com>
>>> ---
>>>    block/blk-mq.c | 8 +++++---
>>>    1 file changed, 5 insertions(+), 3 deletions(-)
>>>
>>> diff --git a/block/blk-mq.c b/block/blk-mq.c
>>> index b67d6c02eceb..1fb280764b76 100644
>>> --- a/block/blk-mq.c
>>> +++ b/block/blk-mq.c
>>> @@ -587,9 +587,9 @@ static struct request *blk_mq_rq_cache_fill(struct request_queue *q,
>>>    	if (blk_queue_enter(q, flags))
>>>    		return NULL;
>>>    
>>> -	plug->nr_ios = 1;
>>> -
>>>    	rq = __blk_mq_alloc_requests(&data);
>>> +	plug->nr_ios = data.nr_tags;
>>> +
>>>    	if (unlikely(!rq))
>>>    		blk_queue_exit(q);
>>>    	return rq;
>>> @@ -3034,11 +3034,13 @@ static struct request *blk_mq_get_new_requests(struct request_queue *q,
>>>    
>>>    	if (plug) {
>>>    		data.nr_tags = plug->nr_ios;
>>> -		plug->nr_ios = 1;
>>>    		data.cached_rqs = &plug->cached_rqs;
>>>    	}
>>>    
>>>    	rq = __blk_mq_alloc_requests(&data);
>>> +	if (plug)
>>> +		plug->nr_ios = data.nr_tags;
>>> +
>>>    	if (unlikely(!rq))
>>>    		rq_qos_cleanup(q, bio);
>>>    	return rq;
>>>
>>
>> In __blk_mq_alloc_requests(), if __blk_mq_alloc_requests_batch() failed,
>> data->nr_tags is set to 1, so plug->nr_ios = data.nr_tags will still set
>> plug->nr_ios to 1 in this case.
>>
>> What am I missing?
> 
> yes, you are right, if __blk_mq_alloc_requests_batch() failed, it will set
> to 1. However, in this case, it did not fail to execute; instead, the
> allocated number of tags was insufficient, as only a partial number were
> allocated. Therefore, the function is considered successfully executed.
> 

Thanks for the explanation, I understand this now.

Thanks,
Kuai

>> Thanks,
>> Kuai
>>
> 
> Thanks,
> Xue
> 
> .
> 


^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2025-09-03  9:35 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <CGME20250901082648epcas5p18f81021213f2b8a050efa25f76e0fb54@epcas5p1.samsung.com>
2025-09-01  8:22 ` [PATCH] block: plug attempts to batch allocate tags multiple times Xue He
2025-09-02  8:47   ` Yu Kuai
     [not found]     ` <CGME20250903084608epcas5p19a0ad4f0d1bad27889426e525d0c4598@epcas5p1.samsung.com>
2025-09-03  8:41       ` Xue He
2025-09-03  9:35         ` Yu Kuai

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).