public inbox for linux-nvme@lists.infradead.org
 help / color / mirror / Atom feed
From: Nilay Shroff <nilay@linux.ibm.com>
To: Keith Busch <kbusch@kernel.org>
Cc: hare@suse.de, hch@lst.de, sagi@grimberg.me, dwagner@suse.de,
	axboe@kernel.dk, kanie@linux.alibaba.com, gjoyce@ibm.com,
	"linux-nvme@lists.infradead.org" <linux-nvme@lists.infradead.org>
Subject: Re: [RFC PATCHv5 0/7] nvme-multipath: introduce adaptive I/O policy
Date: Tue, 9 Dec 2025 19:26:40 +0530	[thread overview]
Message-ID: <11893d18-e66a-495c-b318-17dfc7338ec7@linux.ibm.com> (raw)
In-Reply-To: <20251105103347.86059-1-nilay@linux.ibm.com>

Hi Keith,

Just gentle ping on this one...

It has been reviewed and ready for some time now, and I wanted to check if you
had any remaining feedback or concerns, or if you could consider pulling it
into nvme-next.

Link to the latest version for convenience:
https://lore.kernel.org/all/20251105103347.86059-1-nilay@linux.ibm.com/

Please let me know if there's anything further needed on my side.

Thanks,
--Nilay

On 11/5/25 4:03 PM, Nilay Shroff wrote:
> Hi,
> 
> This series introduces a new adaptive I/O policy for NVMe native
> multipath. Existing policies such as numa, round-robin, and queue-depth
> are static and do not adapt to real-time transport performance. The numa
> selects the path closest to the NUMA node of the current CPU, optimizing
> memory and path locality, but ignores actual path performance. The
> round-robin distributes I/O evenly across all paths, providing fairness
> but not performance awareness. The queue-depth reacts to instantaneous
> queue occupancy, avoiding heavily loaded paths, but does not account for
> actual latency, throughput, or link speed.
> 
> The new adaptive policy addresses these gaps selecting paths dynamically
> based on measured I/O latency for both PCIe and fabrics. Latency is
> derived by passively sampling I/O completions. Each path is assigned a
> weight proportional to its latency score, and I/Os are then forwarded
> accordingly. As condition changes (e.g. latency spikes, bandwidth
> differences), path weights are updated, automatically steering traffic
> toward better-performing paths.
> 
> Early results show reduced tail latency under mixed workloads and
> improved throughput by exploiting higher-speed links more effectively.
> For example, with NVMf/TCP using two paths (one throttled with ~30 ms
> delay), fio results with random read/write/rw workloads (direct I/O)
> showed:
> 
>         numa         round-robin   queue-depth  adaptive
>         -----------  -----------   -----------  ---------
> READ:   50.0 MiB/s   105 MiB/s     230 MiB/s    350 MiB/s
> WRITE:  65.9 MiB/s   125 MiB/s     385 MiB/s    446 MiB/s
> RW:     R:30.6 MiB/s R:56.5 MiB/s  R:122 MiB/s  R:175 MiB/s
>         W:30.7 MiB/s W:56.5 MiB/s  W:122 MiB/s  W:175 MiB/s
> 
> This pathcset includes totla 6 patches:
> [PATCH 1/7] block: expose blk_stat_{enable,disable}_accounting()
>   - Make blk_stat APIs available to block drivers.
>   - Needed for per-path latency measurement in adaptive policy.
> 
> [PATCH 2/7] nvme-multipath: add adaptive I/O policy
>   - Implement path scoring based on latency (EWMA).
>   - Distribute I/O proportionally to per-path weights.
> 
> [PATCH 3/7] nvme: add generic debugfs support
>   - Introduce generic debugfs support for NVMe module
> 
> [PATCH 4/7] nvme-multipath: add debugfs attribute adaptive_ewma_shift
>   - Adds a debugfs attribute to control ewma shift
> 
> [PATCH 5/7] nvme-multipath: add debugfs attribute adaptive_weight_timeout
>   - Adds a debugfs attribute to control path weight calculation timeout
> 
> [PATCH 6/7] nvme-multipath: add debugfs attribute adaptive_stat
>   - Add “adaptive_stat” under per-path and head debugfs directories to
>     expose adaptive policy state and statistics.
> 
> [PATCH 7/7] nvme-multipath: add documentation for adaptive I/O policy
>   - Includes documentation for adaptive I/O multipath policy.
> 
> As ususal, feedback and suggestions are most welcome!
> 
> Thanks!
> 
> Changes from v4:
>   - Added patch #7 which includes the documentation for adaptive I/O
>     policy. (Guixin Liu)
> Link to v4: https://lore.kernel.org/all/20251104104533.138481-1-nilay@linux.ibm.com/    
> 
> Changes from v3:
>   - Update the adaptive APIs name (which actually enable/disable
>     adaptive policy) to reflect the actual work it does. Also removed
>     the misleading use of "current_path" from the adaptive policy code
>     (Hannes Reinecke)
>   - Move adaptive_ewma_shift and adaptive_weight_timeout attributes from
>     sysfs to debugfs (Hannes Reinecke)
> Link to v3: https://lore.kernel.org/all/20251027092949.961287-1-nilay@linux.ibm.com/
> 
> Changes from v2:
>   - Addede a new patch to allow user to configure EWMA shift
>     through sysfs (Hannes Reinecke)
>   - Added a new patch to allow user to configure path weight
>     calculation timeout (Hannes Reinecke)
>   - Distinguish between read/write and other commands (e.g.
>     admin comamnd) and calculate path weight for other commands
>     which is separate from read/write weight. (Hannes Reinecke)
>   - Normalize per-path weight in the range from 0-128 instead
>     of 0-100 (Hannes Reinecke)
>   - Restructure and optimize adaptive I/O forwarding code to use
>     one loop instead of two (Hannes Reinecke)
> Link to v2: https://lore.kernel.org/all/20251009100608.1699550-1-nilay@linux.ibm.com/
> 
> Changes from v1:
>   - Ensure that the completion of I/O occurs on the same CPU as the
>     submitting I/O CPU (Hannes Reinecke)
>   - Remove adapter link speed from the path weight calculation
>     (Hannes Reinecke)
>   - Add adaptive I/O stat under debugfs instead of current sysfs
>     (Hannes Reinecke)
>   - Move path weight calculation to a workqueue from IO completion
>     code path
> Link to v1: https://lore.kernel.org/all/20250921111234.863853-1-nilay@linux.ibm.com/
> 
> Nilay Shroff (7):
>   block: expose blk_stat_{enable,disable}_accounting() to drivers
>   nvme-multipath: add support for adaptive I/O policy
>   nvme: add generic debugfs support
>   nvme-multipath: add debugfs attribute adaptive_ewma_shift
>   nvme-multipath: add debugfs attribute adaptive_weight_timeout
>   nvme-multipath: add debugfs attribute adaptive_stat
>   nvme-multipath: add documentation for adaptive I/O policy
> 
>  Documentation/admin-guide/nvme-multipath.rst |  19 +
>  block/blk-stat.h                             |   4 -
>  drivers/nvme/host/Makefile                   |   2 +-
>  drivers/nvme/host/core.c                     |  22 +-
>  drivers/nvme/host/debugfs.c                  | 335 +++++++++++++++
>  drivers/nvme/host/ioctl.c                    |  31 +-
>  drivers/nvme/host/multipath.c                | 430 ++++++++++++++++++-
>  drivers/nvme/host/nvme.h                     |  86 +++-
>  drivers/nvme/host/pr.c                       |   6 +-
>  drivers/nvme/host/sysfs.c                    |   2 +-
>  include/linux/blk-mq.h                       |   4 +
>  11 files changed, 913 insertions(+), 28 deletions(-)
>  create mode 100644 drivers/nvme/host/debugfs.c
> 



  parent reply	other threads:[~2025-12-09 13:58 UTC|newest]

Thread overview: 28+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-11-05 10:33 [RFC PATCHv5 0/7] nvme-multipath: introduce adaptive I/O policy Nilay Shroff
2025-11-05 10:33 ` [RFC PATCHv5 1/7] block: expose blk_stat_{enable,disable}_accounting() to drivers Nilay Shroff
2025-12-12 12:16   ` Sagi Grimberg
2025-11-05 10:33 ` [RFC PATCHv5 2/7] nvme-multipath: add support for adaptive I/O policy Nilay Shroff
2025-12-12 13:04   ` Sagi Grimberg
2025-12-13  7:27     ` Nilay Shroff
2025-12-15 23:36       ` Sagi Grimberg
2025-12-18 11:19         ` Nilay Shroff
2025-12-18 13:46           ` Hannes Reinecke
2025-12-23 14:50             ` Nilay Shroff
2025-12-25 12:45               ` Sagi Grimberg
2025-12-26 18:16                 ` Nilay Shroff
2025-12-27  9:33                   ` Sagi Grimberg
2025-12-27  9:37                   ` Sagi Grimberg
2026-01-04  9:07                     ` Nilay Shroff
2026-01-04 21:06                       ` Sagi Grimberg
2026-01-06 14:16                         ` Nilay Shroff
2026-02-02 13:33                           ` Nilay Shroff
2026-01-07 11:15                         ` Hannes Reinecke
2025-12-25 12:28           ` Sagi Grimberg
2025-11-05 10:33 ` [RFC PATCHv5 3/7] nvme: add generic debugfs support Nilay Shroff
2025-11-05 10:33 ` [RFC PATCHv5 4/7] nvme-multipath: add debugfs attribute adaptive_ewma_shift Nilay Shroff
2025-11-05 10:33 ` [RFC PATCHv5 5/7] nvme-multipath: add debugfs attribute adaptive_weight_timeout Nilay Shroff
2025-11-05 10:33 ` [RFC PATCHv5 6/7] nvme-multipath: add debugfs attribute adaptive_stat Nilay Shroff
2025-11-05 10:33 ` [RFC PATCHv5 7/7] nvme-multipath: add documentation for adaptive I/O policy Nilay Shroff
2025-12-09 13:56 ` Nilay Shroff [this message]
2025-12-12 12:08 ` [RFC PATCHv5 0/7] nvme-multipath: introduce " Sagi Grimberg
2025-12-13  8:22   ` Nilay Shroff

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=11893d18-e66a-495c-b318-17dfc7338ec7@linux.ibm.com \
    --to=nilay@linux.ibm.com \
    --cc=axboe@kernel.dk \
    --cc=dwagner@suse.de \
    --cc=gjoyce@ibm.com \
    --cc=hare@suse.de \
    --cc=hch@lst.de \
    --cc=kanie@linux.alibaba.com \
    --cc=kbusch@kernel.org \
    --cc=linux-nvme@lists.infradead.org \
    --cc=sagi@grimberg.me \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox