public inbox for linux-nvme@lists.infradead.org
 help / color / mirror / Atom feed
From: Nilay Shroff <nilay@linux.ibm.com>
To: Sagi Grimberg <sagi@grimberg.me>, Hannes Reinecke <hare@suse.de>,
	linux-nvme@lists.infradead.org
Cc: hch@lst.de, kbusch@kernel.org, dwagner@suse.de, axboe@kernel.dk,
	kanie@linux.alibaba.com, gjoyce@ibm.com
Subject: Re: [RFC PATCHv5 2/7] nvme-multipath: add support for adaptive I/O policy
Date: Sun, 4 Jan 2026 14:37:48 +0530	[thread overview]
Message-ID: <c8fa5728-5ee1-4f79-b288-71edcefe135d@linux.ibm.com> (raw)
In-Reply-To: <b98e91ef-254c-44bc-b46a-0039c66539e2@grimberg.me>



On 12/27/25 3:07 PM, Sagi Grimberg wrote:
> 
>>> Can you please run benchmarks with `blocksize_range`/`bssplit`/`cpuload`/`cpuchunks`/`cpumode` ?
>> Okay, so I ran the benchmark using bssplit, cpuload, and cpumode. Below is the job
>> file I used for the test, followed by the observed throughput result for reference.
>>
>> Job file:
>> =========
>>
>> [global]
>> time_based
>> runtime=120
>> group_reporting=1
>>
>> [cpu]
>> ioengine=cpuio
>> cpuload=85
>> cpumode=qsort
>> numjobs=32
>>
>> [disk]
>> ioengine=io_uring
>> filename=/dev/nvme1n2
>> rw=<randread/randwrite/randrw>
>> bssplit=4k/10:32k/10:64k/10:128k/30:256k/10:512k/30
>> iodepth=32
>> numjobs=32
>> direct=1
>>
>> Throughput:
>> ===========
>>
>>           numa          round-robin   queue-depth    adaptive
>>           -----------   -----------   -----------    ---------
>> READ:    1120 MiB/s    2241 MiB/s    2233 MiB/s     2215 MiB/s
>> WRITE:   1107 MiB/s    1875 MiB/s    1847 MiB/s     1892 MiB/s
>> RW:      R:1001 MiB/s  R:1047 MiB/s  R:1086 MiB/s   R:1112 MiB/s
>>           W:999  MiB/s  W:1045 MiB/s  W:1084 MiB/s   W:1111 MiB/s
>>
>> When comparing the results, I did not observe a significant throughput
>> difference between the queue-depth, round-robin, and adaptive policies.
>> With random I/O of mixed sizes, the adaptive policy appears to average
>> out the varying latency values and distribute I/O reasonably evenly
>> across the active paths (assuming symmetric paths).
>>
>> Next I'd implement I/O size buckets and also per-numa node weight and
>> then rerun tests and share the result. Lets see if these changes help
>> further improve the throughput number for adaptive policy. We may then
>> again review the results and discuss further.
>>
>> Thanks,
>> --Nilay
> 
> two comments:
> 1. I'd make reads split slightly biased towards small block sizes, and writes biased towards larger block sizes
> 2. I'd also suggest to measure having weights calculation averaged out on all numa-node cores and then set percpu (such that
> the datapath does not introduce serialization).

Thanks for the suggestions. I ran experiments incorporating both points—
biasing I/O sizes by operation type and comparing per-CPU vs per-NUMA
weight calculation—using the following setup.

Job file:
=========
[global]
time_based
runtime=120
group_reporting=1

[cpu]
ioengine=cpuio
cpuload=85
numjobs=32

[disk]
ioengine=io_uring
filename=/dev/nvme1n1
rw=<randread/randwrite/randrw>
bssplit=<based-on-I/O-pattern-type>[1]
iodepth=32
numjobs=32
direct=1
==========

[1] Block-size distributions:
    randread  => bssplit = 512/30:4k/25:8k/20:16k/15:32k/10
    randwrite => bssplit = 4k/10:64k/20:128k/30:256k/40
    randrw    => bssplit = 512/20:4k/25:32k/25:64k/20:128k/5:256k/5

Results:
=======

i) Symmetric paths + system load
   (CPU stress using cpuload):

         per-CPU   per-CPU-IO-buckets       per-NUMA   per-NUMA-IO-buckets
         (MiB/s)         (MiB/s)             (MiB/s)        (MiB/s)     
         -------   -------------------      --------   -------------------
READ:    636          621                   613           618  
WRITE:   1832         1847                  1840          1852
RW:      R:872        R:869                 R:866         R:874   
         W:872        W:870                 W:867         W:876 

ii) Asymmetric paths + system load
   (CPU stress using cpuload and iperf3 traffic for inducing network congestion):

         per-CPU   per-CPU-IO-buckets       per-NUMA   per-NUMA-IO-buckets
         (MiB/s)         (MiB/s)             (MiB/s)        (MiB/s)     
         -------   -------------------      --------   -------------------
READ:    553          543                   540           533  
WRITE:   1705         1670                  1710          1655
RW:      R:769        R:771                 R:784         R:772   
         W:768        W:767                 W:785         W:771 


Looking at the above results,
- Per-CPU vs per-CPU with I/O buckets:
  The per-CPU implementation already averages latency effectively across CPUs.
  Introducing per-CPU I/O buckets does not provide a meaningful throughput
  improvement and remains largely comparable.

- Per-CPU vs per-NUMA aggregation:
  Calculating or averaging weights at the NUMA level does not significantly
  improve throughput over per-CPU weight calculation. Across both symmetric
  and asymmetric scenarios, the results remain very close.

So now based on above results and assessment, unless there are additional
scenarios or metrics of interest, shall we proceed with per-CPU weight 
calculation for this new I/O policy?

Thanks,
--Nilay


  reply	other threads:[~2026-01-04  9:08 UTC|newest]

Thread overview: 28+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-11-05 10:33 [RFC PATCHv5 0/7] nvme-multipath: introduce adaptive I/O policy Nilay Shroff
2025-11-05 10:33 ` [RFC PATCHv5 1/7] block: expose blk_stat_{enable,disable}_accounting() to drivers Nilay Shroff
2025-12-12 12:16   ` Sagi Grimberg
2025-11-05 10:33 ` [RFC PATCHv5 2/7] nvme-multipath: add support for adaptive I/O policy Nilay Shroff
2025-12-12 13:04   ` Sagi Grimberg
2025-12-13  7:27     ` Nilay Shroff
2025-12-15 23:36       ` Sagi Grimberg
2025-12-18 11:19         ` Nilay Shroff
2025-12-18 13:46           ` Hannes Reinecke
2025-12-23 14:50             ` Nilay Shroff
2025-12-25 12:45               ` Sagi Grimberg
2025-12-26 18:16                 ` Nilay Shroff
2025-12-27  9:33                   ` Sagi Grimberg
2025-12-27  9:37                   ` Sagi Grimberg
2026-01-04  9:07                     ` Nilay Shroff [this message]
2026-01-04 21:06                       ` Sagi Grimberg
2026-01-06 14:16                         ` Nilay Shroff
2026-02-02 13:33                           ` Nilay Shroff
2026-01-07 11:15                         ` Hannes Reinecke
2025-12-25 12:28           ` Sagi Grimberg
2025-11-05 10:33 ` [RFC PATCHv5 3/7] nvme: add generic debugfs support Nilay Shroff
2025-11-05 10:33 ` [RFC PATCHv5 4/7] nvme-multipath: add debugfs attribute adaptive_ewma_shift Nilay Shroff
2025-11-05 10:33 ` [RFC PATCHv5 5/7] nvme-multipath: add debugfs attribute adaptive_weight_timeout Nilay Shroff
2025-11-05 10:33 ` [RFC PATCHv5 6/7] nvme-multipath: add debugfs attribute adaptive_stat Nilay Shroff
2025-11-05 10:33 ` [RFC PATCHv5 7/7] nvme-multipath: add documentation for adaptive I/O policy Nilay Shroff
2025-12-09 13:56 ` [RFC PATCHv5 0/7] nvme-multipath: introduce " Nilay Shroff
2025-12-12 12:08 ` Sagi Grimberg
2025-12-13  8:22   ` Nilay Shroff

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=c8fa5728-5ee1-4f79-b288-71edcefe135d@linux.ibm.com \
    --to=nilay@linux.ibm.com \
    --cc=axboe@kernel.dk \
    --cc=dwagner@suse.de \
    --cc=gjoyce@ibm.com \
    --cc=hare@suse.de \
    --cc=hch@lst.de \
    --cc=kanie@linux.alibaba.com \
    --cc=kbusch@kernel.org \
    --cc=linux-nvme@lists.infradead.org \
    --cc=sagi@grimberg.me \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox