From: Guixin Liu <kanie@linux.alibaba.com>
To: Nilay Shroff <nilay@linux.ibm.com>, linux-nvme@lists.infradead.org
Cc: hare@suse.de, hch@lst.de, kbusch@kernel.org, sagi@grimberg.me,
dwagner@suse.de, axboe@kernel.dk, gjoyce@ibm.com
Subject: Re: [RFC PATCHv4 0/6] nvme-multipath: introduce adaptive I/O policy
Date: Wed, 5 Nov 2025 00:57:35 +0800 [thread overview]
Message-ID: <96f9d51d-7cde-404b-a2a4-0c1b97d07be6@linux.alibaba.com> (raw)
In-Reply-To: <20251104104533.138481-1-nilay@linux.ibm.com>
Hi Nilay:
Could you plz update Documentation/admin-guide/nvme-multipath.rst too?
Best Regards,
Guixin Liu
在 2025/11/4 18:45, Nilay Shroff 写道:
> Hi,
>
> This series introduces a new adaptive I/O policy for NVMe native
> multipath. Existing policies such as numa, round-robin, and queue-depth
> are static and do not adapt to real-time transport performance. The numa
> selects the path closest to the NUMA node of the current CPU, optimizing
> memory and path locality, but ignores actual path performance. The
> round-robin distributes I/O evenly across all paths, providing fairness
> but not performance awareness. The queue-depth reacts to instantaneous
> queue occupancy, avoiding heavily loaded paths, but does not account for
> actual latency, throughput, or link speed.
>
> The new adaptive policy addresses these gaps selecting paths dynamically
> based on measured I/O latency for both PCIe and fabrics. Latency is
> derived by passively sampling I/O completions. Each path is assigned a
> weight proportional to its latency score, and I/Os are then forwarded
> accordingly. As condition changes (e.g. latency spikes, bandwidth
> differences), path weights are updated, automatically steering traffic
> toward better-performing paths.
>
> Early results show reduced tail latency under mixed workloads and
> improved throughput by exploiting higher-speed links more effectively.
> For example, with NVMf/TCP using two paths (one throttled with ~30 ms
> delay), fio results with random read/write/rw workloads (direct I/O)
> showed:
>
> numa round-robin queue-depth adaptive
> ----------- ----------- ----------- ---------
> READ: 50.0 MiB/s 105 MiB/s 230 MiB/s 350 MiB/s
> WRITE: 65.9 MiB/s 125 MiB/s 385 MiB/s 446 MiB/s
> RW: R:30.6 MiB/s R:56.5 MiB/s R:122 MiB/s R:175 MiB/s
> W:30.7 MiB/s W:56.5 MiB/s W:122 MiB/s W:175 MiB/s
>
> This pathcset includes totla 6 patches:
> [PATCH 1/6] block: expose blk_stat_{enable,disable}_accounting()
> - Make blk_stat APIs available to block drivers.
> - Needed for per-path latency measurement in adaptive policy.
>
> [PATCH 2/6] nvme-multipath: add adaptive I/O policy
> - Implement path scoring based on latency (EWMA).
> - Distribute I/O proportionally to per-path weights.
>
> [PATCH 3/6] nvme: add generic debugfs support
> - Introduce generic debugfs support for NVMe module
>
> [PATCH 4/6] nvme-multipath: add debugfs attribute adaptive_ewma_shift
> - Adds a debugfs attribute to control ewma shift
>
> [PATCH 5/6] nvme-multipath: add debugfs attribute adaptive_weight_timeout
> - Adds a debugfs attribute to control path weight calculation timeout
>
> [PATCH 6/6] nvme-multipath: add debugfs attribute adaptive_stat
> - Add “adaptive_stat” under per-path and head debugfs directories to
> expose adaptive policy state and statistics.
>
> As ususal, feedback and suggestions are most welcome!
>
> Thanks!
>
> Changes from v3:
> - Update the adaptive APIs name (which actually enable/disable
> adaptive policy) to reflect the actual work it does. Also removed
> the misleading use of "current_path" from the adaptive policy code
> (Hannes Reinecke)
> - Move adaptive_ewma_shift and adaptive_weight_timeout attributes from
> sysfs to debugfs (Hannes Reinecke)
> Link to v3: https://lore.kernel.org/all/20251027092949.961287-1-nilay@linux.ibm.com/
>
> Changes from v2:
> - Addede a new patch to allow user to configure EWMA shift
> through sysfs (Hannes Reinecke)
> - Added a new patch to allow user to configure path weight
> calculation timeout (Hannes Reinecke)
> - Distinguish between read/write and other commands (e.g.
> admin comamnd) and calculate path weight for other commands
> which is separate from read/write weight. (Hannes Reinecke)
> - Normalize per-path weight in the range from 0-128 instead
> of 0-100 (Hannes Reinecke)
> - Restructure and optimize adaptive I/O forwarding code to use
> one loop instead of two (Hannes Reinecke)
> Link to v2: https://lore.kernel.org/all/20251009100608.1699550-1-nilay@linux.ibm.com/
>
> Changes from v1:
> - Ensure that the completion of I/O occurs on the same CPU as the
> submitting I/O CPU (Hannes Reinecke)
> - Remove adapter link speed from the path weight calculation
> (Hannes Reinecke)
> - Add adaptive I/O stat under debugfs instead of current sysfs
> (Hannes Reinecke)
> - Move path weight calculation to a workqueue from IO completion
> code path
> Link to v1: https://lore.kernel.org/all/20250921111234.863853-1-nilay@linux.ibm.com/
>
> Nilay Shroff (6):
> block: expose blk_stat_{enable,disable}_accounting() to drivers
> nvme-multipath: add support for adaptive I/O policy
> nvme: add generic debugfs support
> nvme-multipath: add debugfs attribute adaptive_ewma_shift
> nvme-multipath: add debugfs attribute adaptive_weight_timeout
> nvme-multipath: add debugfs attribute adaptive_stat
>
> block/blk-stat.h | 4 -
> drivers/nvme/host/Makefile | 2 +-
> drivers/nvme/host/core.c | 22 +-
> drivers/nvme/host/debugfs.c | 335 ++++++++++++++++++++++++++
> drivers/nvme/host/ioctl.c | 31 ++-
> drivers/nvme/host/multipath.c | 430 +++++++++++++++++++++++++++++++++-
> drivers/nvme/host/nvme.h | 87 ++++++-
> drivers/nvme/host/pr.c | 6 +-
> drivers/nvme/host/sysfs.c | 2 +-
> include/linux/blk-mq.h | 4 +
> 10 files changed, 895 insertions(+), 28 deletions(-)
> create mode 100644 drivers/nvme/host/debugfs.c
>
next prev parent reply other threads:[~2025-11-04 16:57 UTC|newest]
Thread overview: 12+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-11-04 10:45 [RFC PATCHv4 0/6] nvme-multipath: introduce adaptive I/O policy Nilay Shroff
2025-11-04 10:45 ` [RFC PATCHv4 1/6] block: expose blk_stat_{enable,disable}_accounting() to drivers Nilay Shroff
2025-11-04 10:45 ` [RFC PATCHv4 2/6] nvme-multipath: add support for adaptive I/O policy Nilay Shroff
2025-11-04 14:57 ` Hannes Reinecke
2025-11-04 10:45 ` [RFC PATCHv4 3/6] nvme: add generic debugfs support Nilay Shroff
2025-11-04 10:45 ` [RFC PATCHv4 4/6] nvme-multipath: add debugfs attribute adaptive_ewma_shift Nilay Shroff
2025-11-04 14:58 ` Hannes Reinecke
2025-11-04 10:45 ` [RFC PATCHv4 5/6] nvme-multipath: add debugfs attribute adaptive_weight_timeout Nilay Shroff
2025-11-04 14:58 ` Hannes Reinecke
2025-11-04 10:45 ` [RFC PATCHv4 6/6] nvme-multipath: add debugfs attribute adaptive_stat Nilay Shroff
2025-11-04 16:57 ` Guixin Liu [this message]
2025-11-05 6:57 ` [RFC PATCHv4 0/6] nvme-multipath: introduce adaptive I/O policy Nilay Shroff
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=96f9d51d-7cde-404b-a2a4-0c1b97d07be6@linux.alibaba.com \
--to=kanie@linux.alibaba.com \
--cc=axboe@kernel.dk \
--cc=dwagner@suse.de \
--cc=gjoyce@ibm.com \
--cc=hare@suse.de \
--cc=hch@lst.de \
--cc=kbusch@kernel.org \
--cc=linux-nvme@lists.infradead.org \
--cc=nilay@linux.ibm.com \
--cc=sagi@grimberg.me \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox