From: Nilay Shroff <nilay@linux.ibm.com>
To: Sagi Grimberg <sagi@grimberg.me>, Hannes Reinecke <hare@suse.de>,
linux-nvme@lists.infradead.org
Cc: hch@lst.de, kbusch@kernel.org, dwagner@suse.de, axboe@kernel.dk,
kanie@linux.alibaba.com, gjoyce@ibm.com
Subject: Re: [RFC PATCHv5 2/7] nvme-multipath: add support for adaptive I/O policy
Date: Mon, 2 Feb 2026 19:03:49 +0530 [thread overview]
Message-ID: <cf426e56-d14e-413b-a846-5dd437cbf040@linux.ibm.com> (raw)
In-Reply-To: <6c2ed0d7-fd7d-4375-9e77-501a24494531@linux.ibm.com>
On 1/6/26 7:46 PM, Nilay Shroff wrote:
>
>
> On 1/5/26 2:36 AM, Sagi Grimberg wrote:
>>
>>
>> On 04/01/2026 11:07, Nilay Shroff wrote:
>>>
>>> On 12/27/25 3:07 PM, Sagi Grimberg wrote:
>>>>>> Can you please run benchmarks with `blocksize_range`/`bssplit`/`cpuload`/`cpuchunks`/`cpumode` ?
>>>>> Okay, so I ran the benchmark using bssplit, cpuload, and cpumode. Below is the job
>>>>> file I used for the test, followed by the observed throughput result for reference.
>>>>>
>>>>> Job file:
>>>>> =========
>>>>>
>>>>> [global]
>>>>> time_based
>>>>> runtime=120
>>>>> group_reporting=1
>>>>>
>>>>> [cpu]
>>>>> ioengine=cpuio
>>>>> cpuload=85
>>>>> cpumode=qsort
>>>>> numjobs=32
>>>>>
>>>>> [disk]
>>>>> ioengine=io_uring
>>>>> filename=/dev/nvme1n2
>>>>> rw=<randread/randwrite/randrw>
>>>>> bssplit=4k/10:32k/10:64k/10:128k/30:256k/10:512k/30
>>>>> iodepth=32
>>>>> numjobs=32
>>>>> direct=1
>>>>>
>>>>> Throughput:
>>>>> ===========
>>>>>
>>>>> numa round-robin queue-depth adaptive
>>>>> ----------- ----------- ----------- ---------
>>>>> READ: 1120 MiB/s 2241 MiB/s 2233 MiB/s 2215 MiB/s
>>>>> WRITE: 1107 MiB/s 1875 MiB/s 1847 MiB/s 1892 MiB/s
>>>>> RW: R:1001 MiB/s R:1047 MiB/s R:1086 MiB/s R:1112 MiB/s
>>>>> W:999 MiB/s W:1045 MiB/s W:1084 MiB/s W:1111 MiB/s
>>>>>
>>>>> When comparing the results, I did not observe a significant throughput
>>>>> difference between the queue-depth, round-robin, and adaptive policies.
>>>>> With random I/O of mixed sizes, the adaptive policy appears to average
>>>>> out the varying latency values and distribute I/O reasonably evenly
>>>>> across the active paths (assuming symmetric paths).
>>>>>
>>>>> Next I'd implement I/O size buckets and also per-numa node weight and
>>>>> then rerun tests and share the result. Lets see if these changes help
>>>>> further improve the throughput number for adaptive policy. We may then
>>>>> again review the results and discuss further.
>>>>>
>>>>> Thanks,
>>>>> --Nilay
>>>> two comments:
>>>> 1. I'd make reads split slightly biased towards small block sizes, and writes biased towards larger block sizes
>>>> 2. I'd also suggest to measure having weights calculation averaged out on all numa-node cores and then set percpu (such that
>>>> the datapath does not introduce serialization).
>>> Thanks for the suggestions. I ran experiments incorporating both points—
>>> biasing I/O sizes by operation type and comparing per-CPU vs per-NUMA
>>> weight calculation—using the following setup.
>>>
>>> Job file:
>>> =========
>>> [global]
>>> time_based
>>> runtime=120
>>> group_reporting=1
>>>
>>> [cpu]
>>> ioengine=cpuio
>>> cpuload=85
>>> numjobs=32
>>>
>>> [disk]
>>> ioengine=io_uring
>>> filename=/dev/nvme1n1
>>> rw=<randread/randwrite/randrw>
>>> bssplit=<based-on-I/O-pattern-type>[1]
>>> iodepth=32
>>> numjobs=32
>>> direct=1
>>> ==========
>>>
>>> [1] Block-size distributions:
>>> randread => bssplit = 512/30:4k/25:8k/20:16k/15:32k/10
>>> randwrite => bssplit = 4k/10:64k/20:128k/30:256k/40
>>> randrw => bssplit = 512/20:4k/25:32k/25:64k/20:128k/5:256k/5
>>>
>>> Results:
>>> =======
>>>
>>> i) Symmetric paths + system load
>>> (CPU stress using cpuload):
>>>
>>> per-CPU per-CPU-IO-buckets per-NUMA per-NUMA-IO-buckets
>>> (MiB/s) (MiB/s) (MiB/s) (MiB/s)
>>> ------- ------------------- -------- -------------------
>>> READ: 636 621 613 618
>>> WRITE: 1832 1847 1840 1852
>>> RW: R:872 R:869 R:866 R:874
>>> W:872 W:870 W:867 W:876
>>>
>>> ii) Asymmetric paths + system load
>>> (CPU stress using cpuload and iperf3 traffic for inducing network congestion):
>>>
>>> per-CPU per-CPU-IO-buckets per-NUMA per-NUMA-IO-buckets
>>> (MiB/s) (MiB/s) (MiB/s) (MiB/s)
>>> ------- ------------------- -------- -------------------
>>> READ: 553 543 540 533
>>> WRITE: 1705 1670 1710 1655
>>> RW: R:769 R:771 R:784 R:772
>>> W:768 W:767 W:785 W:771
>>>
>>>
>>> Looking at the above results,
>>> - Per-CPU vs per-CPU with I/O buckets:
>>> The per-CPU implementation already averages latency effectively across CPUs.
>>> Introducing per-CPU I/O buckets does not provide a meaningful throughput
>>> improvement and remains largely comparable.
>>>
>>> - Per-CPU vs per-NUMA aggregation:
>>> Calculating or averaging weights at the NUMA level does not significantly
>>> improve throughput over per-CPU weight calculation. Across both symmetric
>>> and asymmetric scenarios, the results remain very close.
>>>
>>> So now based on above results and assessment, unless there are additional
>>> scenarios or metrics of interest, shall we proceed with per-CPU weight
>>> calculation for this new I/O policy?
>>
>> I think it is counter intuitive that bucketing I/O sizes does not present any advantage. Don't you?
>> Maybe the test is not good enough of a representation...
>>
> Hmm you were correct, I also thought the same but I couldn't find
> any test which could prove the advantage using I/O buckets. Then
> today I spend some time thinking about the scenarios which could
> prove the worth using I/O buckets. After some thought I came up
> with following use case.
>
> Size-dependent path behavior:
>
> 1. Example:
> Path A: good for ≤16k, bad for ≥32k
> Path B: good for all
>
> Now running mixed I/O (bssplit => 16k/75:64k/25),
>
> Without buckets:
> Path B looks good; scheduler forwards more I/Os towards path B.
>
> With buckets:
> small I/Os are distributed across path A and B
> large I/Os favor path B
>
> So in theory, throughput shall improve with buckets.
>
> 2. Example:
> Path A: good for ≤16k, bad for ≥32k
> Path B: opposite
>
> Without buckets:
> latency averages cancel out
> scheduler sees “paths are equal”
>
> With buckets:
> small I/O bucket favors A
> large I/O bucket favors B
>
> Again in theory, throughput shall improve with buckets.
>
> So with the above thought, I ran another experiment and results
> are shown below:
>
> Injecting additional delay on one path for larger packets (>=32k)
> and mixing I/Os with bssplit => 16k/75:64k/25. So with this
> test, we have,
> Path A: good for ≤16k, bad for ≥32k
> Path B: good for all
>
> per-CPU per-CPU-IO-buckets per-NUMA per-NUMA-IO-buckets
> (MiB/s) (MiB/s) (MiB/s) (MiB/s)
> ------- ------------------- -------- -------------------
> READ: 550 622 523 615
> WRITE: 726 829 747 834
> RW: R:324 R:381 R: 306 R:375
> W:323 W:381 W: 306 W:374
>
> So yes I/O buckets could be useful for the scenario tested
> above. And regarding per-CPU vs per-NUMA weight calculation
> do you agree per-CPU should be good enough for this policy
> as we saw above per-NUMA doesn't help improve much performance?
>
>
>> Lets also test what happens with multiple clients against the same subsystem.
> Yes this is a good test to run, I will test and post result.
>
Finally, I was able to run tests with two nvmf-tcp hosts connected
to the same nvmf-tcp target. Apologies for the delay — setting up this
topology took some time, partly due to recent non-technical infrastructure
challenges after our lab relocation.
The goal of these tests was to evaluate per-CPU vs per-NUMA weight calculation,
with and without I/O size buckets, under multi-client contention.
I ran tests (randread, randwrite and randrw) with mixed I/O (using bssplit)
and added the CPU stress on hosts using cpuload as I already did for my
earlier tests. Please find below the test result and observation.
Workload characteristics:
=========================
- Workloads tested: randread, randwrite, randrw
- Mixed I/O sizes using bssplit
- CPU stress induced using cpuload
- Both hosts run workloads simultaneously
Job file:
=========
[global]
time_based
runtime=120
group_reporting=1
[cpu]
ioengine=cpuio
cpuload=85
numjobs=32
[disk]
ioengine=io_uring
filename=/dev/nvme1n1
rw=<randread/randwrite/randrw>
bssplit=<based-on-I/O-pattern-type>[1]
iodepth=32
numjobs=32
direct=1
ramp-time=120
[1] Block-size distributions:
randread => bssplit = 512/30:4k/25:8k/20:16k/15:32k/10
randwrite => bssplit = 4k/10:64k/20:128k/30:256k/40
randrw => bssplit = 512/20:4k/25:32k/25:64k/20:128k/5:256k/5
Test topology:
==============
1. Two nvmf-tcp hosts connected to the same nvmf-tcp target
2. Each host connects to target using two symmetric paths
3. System load on each host is induced using cpuload (as shown in jobfile)
4. Both hosts run I/O workloads concurrently
Results:
=======
Host1:
per-CPU per-CPU-IO-buckets per-NUMA per-NUMA-IO-buckets
(MiB/s) (MiB/s) (MiB/s) (MiB/s)
------- ------------------- -------- -------------------
READ: 153 164 166 131
WRITE: 839 837 889 839
RW: R:249 R:255 R:226 R:256
W:247 W:254 W:225 W:253
Host2:
per-CPU per-CPU-IO-buckets per-NUMA per-NUMA-IO-buckets
(MiB/s) (MiB/s) (MiB/s) (MiB/s)
------- ------------------- -------- -------------------
READ: 268 258 279 268
WRITE: 1012 992 880 1017
RW: R:386 R:410 R:401 R:405
W:385 W:409 W:399 W:405
From the above results, I have got the same impression as earlier while I ran the
similar tests between one nvmf-tcp host and target. Looking at the above results,
Per-CPU vs per-CPU with I/O buckets:
- The per-CPU implementation already averages latency effectively across CPUs.
- Introducing per-CPU I/O buckets does not provide a meaningful throughput
improvement in the general case.
- Results remain largely comparable across workloads and hosts.
- However, as shown in earlier experiments with I/O size–dependent path behavior,
I/O buckets can provide measurable benefits in specific scenarios.
Per-CPU vs per-NUMA aggregation:
- Calculating or averaging weights at the NUMA level does not significantly improve
throughput over per-CPU weight calculation.
- This holds true even under multi-host contention.
Based on all the tests conducted so far, including, symmetric and asymmetric paths,
CPU stress, size-dependent path behavior and multi-client access to the same target:
The results suggest that we should move forward with a per-CPU implementation using
I/O buckets. That said, I am open to any further feedback, suggestions, or additional
scenarios that might be worth evaluating.
Thanks,
--Nilay
next prev parent reply other threads:[~2026-02-02 13:34 UTC|newest]
Thread overview: 28+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-11-05 10:33 [RFC PATCHv5 0/7] nvme-multipath: introduce adaptive I/O policy Nilay Shroff
2025-11-05 10:33 ` [RFC PATCHv5 1/7] block: expose blk_stat_{enable,disable}_accounting() to drivers Nilay Shroff
2025-12-12 12:16 ` Sagi Grimberg
2025-11-05 10:33 ` [RFC PATCHv5 2/7] nvme-multipath: add support for adaptive I/O policy Nilay Shroff
2025-12-12 13:04 ` Sagi Grimberg
2025-12-13 7:27 ` Nilay Shroff
2025-12-15 23:36 ` Sagi Grimberg
2025-12-18 11:19 ` Nilay Shroff
2025-12-18 13:46 ` Hannes Reinecke
2025-12-23 14:50 ` Nilay Shroff
2025-12-25 12:45 ` Sagi Grimberg
2025-12-26 18:16 ` Nilay Shroff
2025-12-27 9:33 ` Sagi Grimberg
2025-12-27 9:37 ` Sagi Grimberg
2026-01-04 9:07 ` Nilay Shroff
2026-01-04 21:06 ` Sagi Grimberg
2026-01-06 14:16 ` Nilay Shroff
2026-02-02 13:33 ` Nilay Shroff [this message]
2026-01-07 11:15 ` Hannes Reinecke
2025-12-25 12:28 ` Sagi Grimberg
2025-11-05 10:33 ` [RFC PATCHv5 3/7] nvme: add generic debugfs support Nilay Shroff
2025-11-05 10:33 ` [RFC PATCHv5 4/7] nvme-multipath: add debugfs attribute adaptive_ewma_shift Nilay Shroff
2025-11-05 10:33 ` [RFC PATCHv5 5/7] nvme-multipath: add debugfs attribute adaptive_weight_timeout Nilay Shroff
2025-11-05 10:33 ` [RFC PATCHv5 6/7] nvme-multipath: add debugfs attribute adaptive_stat Nilay Shroff
2025-11-05 10:33 ` [RFC PATCHv5 7/7] nvme-multipath: add documentation for adaptive I/O policy Nilay Shroff
2025-12-09 13:56 ` [RFC PATCHv5 0/7] nvme-multipath: introduce " Nilay Shroff
2025-12-12 12:08 ` Sagi Grimberg
2025-12-13 8:22 ` Nilay Shroff
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=cf426e56-d14e-413b-a846-5dd437cbf040@linux.ibm.com \
--to=nilay@linux.ibm.com \
--cc=axboe@kernel.dk \
--cc=dwagner@suse.de \
--cc=gjoyce@ibm.com \
--cc=hare@suse.de \
--cc=hch@lst.de \
--cc=kanie@linux.alibaba.com \
--cc=kbusch@kernel.org \
--cc=linux-nvme@lists.infradead.org \
--cc=sagi@grimberg.me \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox