From: Nilay Shroff <nilay@linux.ibm.com>
To: linux-nvme@lists.infradead.org
Cc: hare@suse.de, hch@lst.de, kbusch@kernel.org, sagi@grimberg.me,
dwagner@suse.de, axboe@kernel.dk, kanie@linux.alibaba.com,
gjoyce@ibm.com
Subject: [RFC PATCHv5 0/7] nvme-multipath: introduce adaptive I/O policy
Date: Wed, 5 Nov 2025 16:03:19 +0530 [thread overview]
Message-ID: <20251105103347.86059-1-nilay@linux.ibm.com> (raw)
Hi,
This series introduces a new adaptive I/O policy for NVMe native
multipath. Existing policies such as numa, round-robin, and queue-depth
are static and do not adapt to real-time transport performance. The numa
selects the path closest to the NUMA node of the current CPU, optimizing
memory and path locality, but ignores actual path performance. The
round-robin distributes I/O evenly across all paths, providing fairness
but not performance awareness. The queue-depth reacts to instantaneous
queue occupancy, avoiding heavily loaded paths, but does not account for
actual latency, throughput, or link speed.
The new adaptive policy addresses these gaps selecting paths dynamically
based on measured I/O latency for both PCIe and fabrics. Latency is
derived by passively sampling I/O completions. Each path is assigned a
weight proportional to its latency score, and I/Os are then forwarded
accordingly. As condition changes (e.g. latency spikes, bandwidth
differences), path weights are updated, automatically steering traffic
toward better-performing paths.
Early results show reduced tail latency under mixed workloads and
improved throughput by exploiting higher-speed links more effectively.
For example, with NVMf/TCP using two paths (one throttled with ~30 ms
delay), fio results with random read/write/rw workloads (direct I/O)
showed:
numa round-robin queue-depth adaptive
----------- ----------- ----------- ---------
READ: 50.0 MiB/s 105 MiB/s 230 MiB/s 350 MiB/s
WRITE: 65.9 MiB/s 125 MiB/s 385 MiB/s 446 MiB/s
RW: R:30.6 MiB/s R:56.5 MiB/s R:122 MiB/s R:175 MiB/s
W:30.7 MiB/s W:56.5 MiB/s W:122 MiB/s W:175 MiB/s
This pathcset includes totla 6 patches:
[PATCH 1/7] block: expose blk_stat_{enable,disable}_accounting()
- Make blk_stat APIs available to block drivers.
- Needed for per-path latency measurement in adaptive policy.
[PATCH 2/7] nvme-multipath: add adaptive I/O policy
- Implement path scoring based on latency (EWMA).
- Distribute I/O proportionally to per-path weights.
[PATCH 3/7] nvme: add generic debugfs support
- Introduce generic debugfs support for NVMe module
[PATCH 4/7] nvme-multipath: add debugfs attribute adaptive_ewma_shift
- Adds a debugfs attribute to control ewma shift
[PATCH 5/7] nvme-multipath: add debugfs attribute adaptive_weight_timeout
- Adds a debugfs attribute to control path weight calculation timeout
[PATCH 6/7] nvme-multipath: add debugfs attribute adaptive_stat
- Add “adaptive_stat” under per-path and head debugfs directories to
expose adaptive policy state and statistics.
[PATCH 7/7] nvme-multipath: add documentation for adaptive I/O policy
- Includes documentation for adaptive I/O multipath policy.
As ususal, feedback and suggestions are most welcome!
Thanks!
Changes from v4:
- Added patch #7 which includes the documentation for adaptive I/O
policy. (Guixin Liu)
Link to v4: https://lore.kernel.org/all/20251104104533.138481-1-nilay@linux.ibm.com/
Changes from v3:
- Update the adaptive APIs name (which actually enable/disable
adaptive policy) to reflect the actual work it does. Also removed
the misleading use of "current_path" from the adaptive policy code
(Hannes Reinecke)
- Move adaptive_ewma_shift and adaptive_weight_timeout attributes from
sysfs to debugfs (Hannes Reinecke)
Link to v3: https://lore.kernel.org/all/20251027092949.961287-1-nilay@linux.ibm.com/
Changes from v2:
- Addede a new patch to allow user to configure EWMA shift
through sysfs (Hannes Reinecke)
- Added a new patch to allow user to configure path weight
calculation timeout (Hannes Reinecke)
- Distinguish between read/write and other commands (e.g.
admin comamnd) and calculate path weight for other commands
which is separate from read/write weight. (Hannes Reinecke)
- Normalize per-path weight in the range from 0-128 instead
of 0-100 (Hannes Reinecke)
- Restructure and optimize adaptive I/O forwarding code to use
one loop instead of two (Hannes Reinecke)
Link to v2: https://lore.kernel.org/all/20251009100608.1699550-1-nilay@linux.ibm.com/
Changes from v1:
- Ensure that the completion of I/O occurs on the same CPU as the
submitting I/O CPU (Hannes Reinecke)
- Remove adapter link speed from the path weight calculation
(Hannes Reinecke)
- Add adaptive I/O stat under debugfs instead of current sysfs
(Hannes Reinecke)
- Move path weight calculation to a workqueue from IO completion
code path
Link to v1: https://lore.kernel.org/all/20250921111234.863853-1-nilay@linux.ibm.com/
Nilay Shroff (7):
block: expose blk_stat_{enable,disable}_accounting() to drivers
nvme-multipath: add support for adaptive I/O policy
nvme: add generic debugfs support
nvme-multipath: add debugfs attribute adaptive_ewma_shift
nvme-multipath: add debugfs attribute adaptive_weight_timeout
nvme-multipath: add debugfs attribute adaptive_stat
nvme-multipath: add documentation for adaptive I/O policy
Documentation/admin-guide/nvme-multipath.rst | 19 +
block/blk-stat.h | 4 -
drivers/nvme/host/Makefile | 2 +-
drivers/nvme/host/core.c | 22 +-
drivers/nvme/host/debugfs.c | 335 +++++++++++++++
drivers/nvme/host/ioctl.c | 31 +-
drivers/nvme/host/multipath.c | 430 ++++++++++++++++++-
drivers/nvme/host/nvme.h | 86 +++-
drivers/nvme/host/pr.c | 6 +-
drivers/nvme/host/sysfs.c | 2 +-
include/linux/blk-mq.h | 4 +
11 files changed, 913 insertions(+), 28 deletions(-)
create mode 100644 drivers/nvme/host/debugfs.c
--
2.51.0
next reply other threads:[~2025-11-05 10:34 UTC|newest]
Thread overview: 8+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-11-05 10:33 Nilay Shroff [this message]
2025-11-05 10:33 ` [RFC PATCHv5 1/7] block: expose blk_stat_{enable,disable}_accounting() to drivers Nilay Shroff
2025-11-05 10:33 ` [RFC PATCHv5 2/7] nvme-multipath: add support for adaptive I/O policy Nilay Shroff
2025-11-05 10:33 ` [RFC PATCHv5 3/7] nvme: add generic debugfs support Nilay Shroff
2025-11-05 10:33 ` [RFC PATCHv5 4/7] nvme-multipath: add debugfs attribute adaptive_ewma_shift Nilay Shroff
2025-11-05 10:33 ` [RFC PATCHv5 5/7] nvme-multipath: add debugfs attribute adaptive_weight_timeout Nilay Shroff
2025-11-05 10:33 ` [RFC PATCHv5 6/7] nvme-multipath: add debugfs attribute adaptive_stat Nilay Shroff
2025-11-05 10:33 ` [RFC PATCHv5 7/7] nvme-multipath: add documentation for adaptive I/O policy Nilay Shroff
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20251105103347.86059-1-nilay@linux.ibm.com \
--to=nilay@linux.ibm.com \
--cc=axboe@kernel.dk \
--cc=dwagner@suse.de \
--cc=gjoyce@ibm.com \
--cc=hare@suse.de \
--cc=hch@lst.de \
--cc=kanie@linux.alibaba.com \
--cc=kbusch@kernel.org \
--cc=linux-nvme@lists.infradead.org \
--cc=sagi@grimberg.me \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).