linux-nvme.lists.infradead.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH 0/5] nvme-multipath: introduce adaptive I/O policy
@ 2025-09-21 11:12 Nilay Shroff
  2025-09-21 11:12 ` [RFC PATCH 1/5] block: expose blk_stat_{enable,disable}_accounting() to drivers Nilay Shroff
                   ` (4 more replies)
  0 siblings, 5 replies; 16+ messages in thread
From: Nilay Shroff @ 2025-09-21 11:12 UTC (permalink / raw)
  To: linux-nvme; +Cc: kbusch, hch, sagi, axboe, hare, dwagner, gjoyce

Hi,

This series introduces a new adaptive I/O policy for NVMe native
multipath. Existing policies—numa, round-robin, and queue-depth—are
static and do not adapt to real-time transport performance. The numa
selects the path closest to the NUMA node of the current CPU, optimizing
memory and path locality, but ignores actual path performance. The 
round-robin distributes I/O evenly across all paths, providing fairness
but not performance awareness. The queue-depth reacts to instantaneous
queue occupancy, avoiding heavily loaded paths, but does not account for
actual latency, throughput, or link speed.

The new adaptive policy addresses these gaps selecting paths dynamically
based on measured I/O latency and, for fabrics, the negotiated link 
speed. Latency is derived by passively sampling I/O completions. Link
speed is queried from the adapter and factored into path scoring. Each
path is assigned a weight proportional to its score, and I/Os are then
forwarded accordingly. As conditions change (e.g. latency spikes,
bandwidth differences), path weights are updated, automatically 
steering traffic toward better-performing paths.

Early results show reduced tail latency under mixed workloads and
improved throughput by exploiting higher-speed links more effectively.
For example, with NVMf/TCP using two paths (one throttled with ~30 ms
delay), fio results with random read/write/rw workloads (direct I/O)
showed:

        numa         round-robin   queue-depth  adaptive
        -----------  -----------   -----------  ---------
READ:   50.0 MiB/s   105 MiB/s     230 MiB/s    350 MiB/s
WRITE:  65.9 MiB/s   125 MiB/s     385 MiB/s    446 MiB/s
RW:     R:30.6 MiB/s R:56.5 MiB/s  R:122 MiB/s  R:175 MiB/s
        W:30.7 MiB/s W:56.5 MiB/s  W:122 MiB/s  W:175 MiB/s

This pathcset includes totla 5 patches:
[PATCH 1/5] block: expose blk_stat_{enable,disable}_accounting()
  - Make blk_stat APIs available to block drivers.
  - Needed for per-path latency measurement in adaptive policy.

[PATCH 2/5] nvme-multipath: add adaptive I/O policy
  - Implement path scoring based on latency (EWMA).
  - Distribute I/O proportionally to per-path weights.

[PATCH 3/5] nvme-multipath: add sysfs attribute for adaptive policy
  - Introduce "adp_stat" under nvme path block device.
  - Provide observability of latency, weight, and selection stats.

[PATCH 4/5] nvme-tcp: export NIC link speed
  - Retrieve negotiated link speed (Mbps) from the adapter.
  - Expose via sysfs for visibility/debugging.

[PATCH 5/5] nvme-multipath: factor link speed into path scoring
  - Adjust adaptive path weights using link speed as a multiplier.
  - Favor higher bandwidth links while still considering latency.

Currently, link speed reporting is implemented only for TCP NICs.
Support for Fibre Channel adapters will follow in a future patch.

As ususal, feedback and suggestions are most welcome!

Thanks!

Nilay Shroff (5):
  block: expose blk_stat_{enable,disable}_accounting() to drivers
  nvme-multipath: add support for adaptive I/O policy
  nvme-multipath: add sysfs attribute for adaptive I/O policy
  nvmf-tcp: add support for retrieving adapter link speed
  nvme-multipath: factor fabric link speed into path score

 block/blk-stat.h              |   4 -
 drivers/nvme/host/core.c      |  10 +-
 drivers/nvme/host/ioctl.c     |   7 +-
 drivers/nvme/host/multipath.c | 441 +++++++++++++++++++++++++++++++++-
 drivers/nvme/host/nvme.h      |  38 ++-
 drivers/nvme/host/pr.c        |   6 +-
 drivers/nvme/host/sysfs.c     |  12 +-
 drivers/nvme/host/tcp.c       |  66 +++++
 include/linux/blk-mq.h        |   4 +
 9 files changed, 562 insertions(+), 26 deletions(-)

-- 
2.51.0



^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2025-09-23 17:58 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-09-21 11:12 [RFC PATCH 0/5] nvme-multipath: introduce adaptive I/O policy Nilay Shroff
2025-09-21 11:12 ` [RFC PATCH 1/5] block: expose blk_stat_{enable,disable}_accounting() to drivers Nilay Shroff
2025-09-21 11:12 ` [RFC PATCH 2/5] nvme-multipath: add support for adaptive I/O policy Nilay Shroff
2025-09-22  7:30   ` Hannes Reinecke
2025-09-23  3:43     ` Nilay Shroff
2025-09-23  7:03       ` Hannes Reinecke
2025-09-23 10:56         ` Nilay Shroff
2025-09-21 11:12 ` [RFC PATCH 3/5] nvme-multipath: add sysfs attribute " Nilay Shroff
2025-09-22  7:35   ` Hannes Reinecke
2025-09-23  3:53     ` Nilay Shroff
2025-09-21 11:12 ` [RFC PATCH 4/5] nvmf-tcp: add support for retrieving adapter link speed Nilay Shroff
2025-09-22  7:38   ` Hannes Reinecke
2025-09-23  9:33     ` Nilay Shroff
2025-09-23 10:27       ` Hannes Reinecke
2025-09-23 17:58         ` Nilay Shroff
2025-09-21 11:12 ` [RFC PATCH 5/5] nvme-multipath: factor fabric link speed into path score Nilay Shroff

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).