* [PATCH 00/13] libmultipath: a generic multipath lib for block drivers
@ 2026-02-25 15:32 John Garry
2026-02-25 15:32 ` [PATCH 01/13] libmultipath: Add initial framework John Garry
` (12 more replies)
0 siblings, 13 replies; 46+ messages in thread
From: John Garry @ 2026-02-25 15:32 UTC (permalink / raw)
To: hch, kbusch, sagi, axboe, martin.petersen, james.bottomley, hare
Cc: jmeneghi, linux-nvme, linux-scsi, michael.christie, snitzer,
bmarzins, dm-devel, linux-block, linux-kernel, John Garry
This series introduces libmultipath. It is essentially a refactoring of
NVME multipath support, so we can have a common library to also support
native SCSI multipath.
Much of the code is taken directly from the NVMe multipath code. However,
NVMe specifics are removed. A template structure is provided so the driver
may provide callbacks for driver specifics, like ANA support for NVMe.
Important new structures introduced include:
- mpath_head and mpath_disk
These contain much of the multipath-specific functionality from
nvme_ns_head. Seperate structures are needed to suit SCSI - that is
because SCSI has concept of a scsi_driver, like scsi_disk. For SCSI,
the mpath_head would be associated with the scsi_device, while
mpath_disk would be associated with scsi_disk.
- mpath_device
This is the per-path structure, and contains the multipath-specific
functionality in nvme_ns
libmultipath provides functionality for path management, path selection,
data path, and failover handling.
Since the NVMe driver has some code in the sysfs and ioctl handling
which iterate all multipath NSes, functions like mpath_call_for_device()
are added to do the same per-path iteration.
John Garry (13):
libmultipath: Add initial framework
libmultipath: Add basic gendisk support
libmultipath: Add path selection support
libmultipath: Add bio handling
libmultipath: Add support for mpath_device management
libmultipath: Add cdev support
libmultipath: Add delayed removal support
libmultipath: Add sysfs helpers
libmultipath: Add PR support
libmultipath: Add mpath_bdev_report_zones()
libmultipath: Add support for block device IOCTL
libmultipath: Add mpath_bdev_getgeo()
libmultipath: Add mpath_bdev_get_unique_id()
include/linux/multipath.h | 205 ++++++
lib/Kconfig | 6 +
lib/Makefile | 2 +
lib/multipath.c | 1261 +++++++++++++++++++++++++++++++++++++
4 files changed, 1474 insertions(+)
create mode 100644 include/linux/multipath.h
create mode 100644 lib/multipath.c
--
2.43.5
^ permalink raw reply [flat|nested] 46+ messages in thread
* [PATCH 01/13] libmultipath: Add initial framework
2026-02-25 15:32 [PATCH 00/13] libmultipath: a generic multipath lib for block drivers John Garry
@ 2026-02-25 15:32 ` John Garry
2026-03-02 12:08 ` Nilay Shroff
2026-02-25 15:32 ` [PATCH 02/13] libmultipath: Add basic gendisk support John Garry
` (11 subsequent siblings)
12 siblings, 1 reply; 46+ messages in thread
From: John Garry @ 2026-02-25 15:32 UTC (permalink / raw)
To: hch, kbusch, sagi, axboe, martin.petersen, james.bottomley, hare
Cc: jmeneghi, linux-nvme, linux-scsi, michael.christie, snitzer,
bmarzins, dm-devel, linux-block, linux-kernel, John Garry
Add initial framework for libmultipath. libmultipath is a library for
multipath-capable block drivers, such as NVMe. The main function is to
support path management, path selection, and failover handling.
Basic support to add and remove the head structure - mpath_head - is
included.
This main purpose of this structure is to manage available paths and path
selection. It is quite similar to the multipath functionality in
nvme_ns_head. However a separate structure will introduced after to manage
the multipath gendisk.
Each path is represented by the mpath_device structure. It should hold a
pointer to the per-path gendisk and also a list element for all siblings
of paths. For NVMe, there would be a mpath_device per nvme_ns.
All the libmultipath code is more or less taken from
drivers/nvme/host/multipath.c, which was originally authored by Christoph
Hellwig <hch@lst.de>.
Signed-off-by: John Garry <john.g.garry@oracle.com>
---
include/linux/multipath.h | 28 +++++++++++++++
lib/Kconfig | 6 ++++
lib/Makefile | 2 ++
lib/multipath.c | 74 +++++++++++++++++++++++++++++++++++++++
4 files changed, 110 insertions(+)
create mode 100644 include/linux/multipath.h
create mode 100644 lib/multipath.c
diff --git a/include/linux/multipath.h b/include/linux/multipath.h
new file mode 100644
index 0000000000000..18cd133b7ca21
--- /dev/null
+++ b/include/linux/multipath.h
@@ -0,0 +1,28 @@
+
+#ifndef _LIBMULTIPATH_H
+#define _LIBMULTIPATH_H
+
+#include <linux/blkdev.h>
+#include <linux/srcu.h>
+
+struct mpath_device {
+ struct list_head siblings;
+ struct gendisk *disk;
+};
+
+struct mpath_head {
+ struct srcu_struct srcu;
+ struct list_head dev_list; /* list of all mpath_devs */
+ struct mutex lock;
+
+ struct kref ref;
+
+ struct mpath_device __rcu *current_path[MAX_NUMNODES];
+ void *drvdata;
+};
+
+int mpath_get_head(struct mpath_head *mpath_head);
+void mpath_put_head(struct mpath_head *mpath_head);
+struct mpath_head *mpath_alloc_head(void);
+
+#endif // _LIBMULTIPATH_H
diff --git a/lib/Kconfig b/lib/Kconfig
index 2923924bea78c..465aed2477d90 100644
--- a/lib/Kconfig
+++ b/lib/Kconfig
@@ -649,3 +649,9 @@ config UNION_FIND
config MIN_HEAP
bool
+
+config LIBMULTIPATH
+ bool "MULTIPATH BLOCK DRIVER LIBRARY"
+ depends on BLOCK
+ help
+ If you say yes here then you get a multipath lib for block drivers
diff --git a/lib/Makefile b/lib/Makefile
index aaf677cf4527e..b81002bc64d2e 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -332,3 +332,5 @@ obj-$(CONFIG_GENERIC_LIB_DEVMEM_IS_ALLOWED) += devmem_is_allowed.o
obj-$(CONFIG_FIRMWARE_TABLE) += fw_table.o
subdir-$(CONFIG_FORTIFY_SOURCE) += test_fortify
+
+obj-$(CONFIG_LIBMULTIPATH) += multipath.o
diff --git a/lib/multipath.c b/lib/multipath.c
new file mode 100644
index 0000000000000..15c495675d729
--- /dev/null
+++ b/lib/multipath.c
@@ -0,0 +1,74 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright (c) 2017-2018 Christoph Hellwig.
+ * Copyright (c) 2026 Oracle and/or its affiliates.
+ */
+#include <linux/module.h>
+#include <linux/multipath.h>
+
+static struct workqueue_struct *mpath_wq;
+
+static void mpath_free_head(struct kref *ref)
+{
+ struct mpath_head *mpath_head =
+ container_of(ref, struct mpath_head, ref);
+
+ cleanup_srcu_struct(&mpath_head->srcu);
+ kfree(mpath_head);
+}
+
+int mpath_get_head(struct mpath_head *mpath_head)
+{
+ if (!kref_get_unless_zero(&mpath_head->ref)) {
+ return -ENXIO;
+ }
+ return 0;
+}
+EXPORT_SYMBOL_GPL(mpath_get_head);
+
+void mpath_put_head(struct mpath_head *mpath_head)
+{
+ kref_put(&mpath_head->ref, mpath_free_head);
+}
+EXPORT_SYMBOL_GPL(mpath_put_head);
+
+struct mpath_head *mpath_alloc_head(void)
+{
+ struct mpath_head *mpath_head;
+ int ret;
+
+ mpath_head = kzalloc(sizeof(*mpath_head), GFP_KERNEL);
+ if (!mpath_head)
+ return ERR_PTR(-ENOMEM);
+ INIT_LIST_HEAD(&mpath_head->dev_list);
+ mutex_init(&mpath_head->lock);
+ kref_init(&mpath_head->ref);
+
+ ret = init_srcu_struct(&mpath_head->srcu);
+ if (ret) {
+ kfree(mpath_head);
+ return ERR_PTR(ret);
+ }
+
+ return mpath_head;
+}
+EXPORT_SYMBOL_GPL(mpath_alloc_head);
+
+static int __init mpath_init(void)
+{
+ mpath_wq = alloc_workqueue("mpath-wq",
+ WQ_UNBOUND | WQ_MEM_RECLAIM | WQ_SYSFS, 0);
+ if (!mpath_wq)
+ return -ENOMEM;
+ return 0;
+}
+
+static void __exit mpath_exit(void)
+{
+ destroy_workqueue(mpath_wq);
+}
+
+module_init(mpath_init);
+module_exit(mpath_exit);
+MODULE_LICENSE("GPL");
+MODULE_DESCRIPTION("libmultipath");
--
2.43.5
^ permalink raw reply related [flat|nested] 46+ messages in thread
* [PATCH 02/13] libmultipath: Add basic gendisk support
2026-02-25 15:32 [PATCH 00/13] libmultipath: a generic multipath lib for block drivers John Garry
2026-02-25 15:32 ` [PATCH 01/13] libmultipath: Add initial framework John Garry
@ 2026-02-25 15:32 ` John Garry
2026-02-26 2:16 ` Benjamin Marzinski
` (2 more replies)
2026-02-25 15:32 ` [PATCH 03/13] libmultipath: Add path selection support John Garry
` (10 subsequent siblings)
12 siblings, 3 replies; 46+ messages in thread
From: John Garry @ 2026-02-25 15:32 UTC (permalink / raw)
To: hch, kbusch, sagi, axboe, martin.petersen, james.bottomley, hare
Cc: jmeneghi, linux-nvme, linux-scsi, michael.christie, snitzer,
bmarzins, dm-devel, linux-block, linux-kernel, John Garry
Add support to allocate and free a multipath gendisk.
NVMe has almost like-for-like equivalents here:
- mpath_alloc_head_disk() -> nvme_mpath_alloc_disk()
- multipath_partition_scan_work() -> nvme_partition_scan_work()
- mpath_remove_disk() -> nvme_remove_head()
- mpath_device_set_live() -> nvme_mpath_set_live()
struct mpath_head_template is introduced as a method for drivers to
provide custom multipath functionality.
Signed-off-by: John Garry <john.g.garry@oracle.com>
---
include/linux/multipath.h | 41 ++++++++++++
lib/multipath.c | 129 ++++++++++++++++++++++++++++++++++++++
2 files changed, 170 insertions(+)
diff --git a/include/linux/multipath.h b/include/linux/multipath.h
index 18cd133b7ca21..be9dd9fb83345 100644
--- a/include/linux/multipath.h
+++ b/include/linux/multipath.h
@@ -5,11 +5,28 @@
#include <linux/blkdev.h>
#include <linux/srcu.h>
+extern const struct block_device_operations mpath_ops;
+
+struct mpath_disk {
+ struct gendisk *disk;
+ struct kref ref;
+ struct work_struct partition_scan_work;
+ struct mutex lock;
+ struct mpath_head *mpath_head;
+ struct device *parent;
+};
+
struct mpath_device {
struct list_head siblings;
struct gendisk *disk;
};
+struct mpath_head_template {
+ const struct attribute_group **device_groups;
+};
+
+#define MPATH_HEAD_DISK_LIVE 0
+
struct mpath_head {
struct srcu_struct srcu;
struct list_head dev_list; /* list of all mpath_devs */
@@ -17,12 +34,36 @@ struct mpath_head {
struct kref ref;
+ unsigned long flags;
struct mpath_device __rcu *current_path[MAX_NUMNODES];
+ const struct mpath_head_template *mpdt;
void *drvdata;
};
+static inline struct mpath_disk *mpath_bd_device_to_disk(struct device *dev)
+{
+ return dev_get_drvdata(dev);
+}
+
+static inline struct mpath_disk *mpath_gendisk_to_disk(struct gendisk *disk)
+{
+ return mpath_bd_device_to_disk(disk_to_dev(disk));
+}
+
int mpath_get_head(struct mpath_head *mpath_head);
void mpath_put_head(struct mpath_head *mpath_head);
struct mpath_head *mpath_alloc_head(void);
+void mpath_put_disk(struct mpath_disk *mpath_disk);
+void mpath_remove_disk(struct mpath_disk *mpath_disk);
+void mpath_unregister_disk(struct mpath_disk *mpath_disk);
+struct mpath_disk *mpath_alloc_head_disk(struct queue_limits *lim,
+ int numa_node);
+void mpath_device_set_live(struct mpath_disk *mpath_disk,
+ struct mpath_device *mpath_device);
+void mpath_unregister_disk(struct mpath_disk *mpath_disk);
+static inline bool is_mpath_head(struct gendisk *disk)
+{
+ return disk->fops == &mpath_ops;
+}
#endif // _LIBMULTIPATH_H
diff --git a/lib/multipath.c b/lib/multipath.c
index 15c495675d729..88efb0ae16acb 100644
--- a/lib/multipath.c
+++ b/lib/multipath.c
@@ -32,6 +32,135 @@ void mpath_put_head(struct mpath_head *mpath_head)
}
EXPORT_SYMBOL_GPL(mpath_put_head);
+static void mpath_free_disk(struct kref *ref)
+{
+ struct mpath_disk *mpath_disk =
+ container_of(ref, struct mpath_disk, ref);
+ struct mpath_head *mpath_head = mpath_disk->mpath_head;
+
+ put_disk(mpath_disk->disk);
+ mpath_put_head(mpath_head);
+ kfree(mpath_disk);
+}
+
+void mpath_put_disk(struct mpath_disk *mpath_disk)
+{
+ kref_put(&mpath_disk->ref, mpath_free_disk);
+}
+EXPORT_SYMBOL_GPL(mpath_put_disk);
+
+static int mpath_get_disk(struct mpath_disk *mpath_disk)
+{
+ if (!kref_get_unless_zero(&mpath_disk->ref)) {
+ return -ENXIO;
+ }
+ return 0;
+}
+
+static int mpath_bdev_open(struct gendisk *disk, blk_mode_t mode)
+{
+ struct mpath_disk *mpath_disk = disk->private_data;
+
+ return mpath_get_disk(mpath_disk);
+}
+
+static void mpath_bdev_release(struct gendisk *disk)
+{
+ struct mpath_disk *mpath_disk = disk->private_data;
+
+ mpath_put_disk(mpath_disk);
+}
+
+const struct block_device_operations mpath_ops = {
+ .owner = THIS_MODULE,
+ .open = mpath_bdev_open,
+ .release = mpath_bdev_release,
+};
+EXPORT_SYMBOL_GPL(mpath_ops);
+
+static void multipath_partition_scan_work(struct work_struct *work)
+{
+ struct mpath_disk *mpath_disk =
+ container_of(work, struct mpath_disk, partition_scan_work);
+
+ if (WARN_ON_ONCE(!test_and_clear_bit(GD_SUPPRESS_PART_SCAN,
+ &mpath_disk->disk->state)))
+ return;
+
+ mutex_lock(&mpath_disk->disk->open_mutex);
+ bdev_disk_changed(mpath_disk->disk, false);
+ mutex_unlock(&mpath_disk->disk->open_mutex);
+}
+
+void mpath_remove_disk(struct mpath_disk *mpath_disk)
+{
+ struct mpath_head *mpath_head = mpath_disk->mpath_head;
+
+ if (test_and_clear_bit(MPATH_HEAD_DISK_LIVE, &mpath_head->flags)) {
+ struct gendisk *disk = mpath_disk->disk;
+
+ del_gendisk(disk);
+ }
+}
+EXPORT_SYMBOL_GPL(mpath_remove_disk);
+
+void mpath_unregister_disk(struct mpath_disk *mpath_disk)
+{
+ mpath_remove_disk(mpath_disk);
+ mpath_put_disk(mpath_disk);
+}
+EXPORT_SYMBOL_GPL(mpath_unregister_disk);
+
+struct mpath_disk *mpath_alloc_head_disk(struct queue_limits *lim, int numa_node)
+{
+ struct mpath_disk *mpath_disk;
+
+ mpath_disk = kzalloc(sizeof(*mpath_disk), GFP_KERNEL);
+ if (!mpath_disk)
+ return NULL;
+
+ INIT_WORK(&mpath_disk->partition_scan_work,
+ multipath_partition_scan_work);
+ mutex_init(&mpath_disk->lock);
+ kref_init(&mpath_disk->ref);
+
+ mpath_disk->disk = blk_alloc_disk(lim, numa_node);
+ if (IS_ERR(mpath_disk->disk)) {
+ kfree(mpath_disk);
+ return NULL;
+ }
+
+ mpath_disk->disk->private_data = mpath_disk;
+ mpath_disk->disk->fops = &mpath_ops;
+
+ set_bit(GD_SUPPRESS_PART_SCAN, &mpath_disk->disk->state);
+
+ return mpath_disk;
+}
+EXPORT_SYMBOL_GPL(mpath_alloc_head_disk);
+
+void mpath_device_set_live(struct mpath_disk *mpath_disk,
+ struct mpath_device *mpath_device)
+{
+ struct mpath_head *mpath_head = mpath_disk->mpath_head;
+ int ret;
+
+ if (!mpath_disk)
+ return;
+
+ if (!test_and_set_bit(MPATH_HEAD_DISK_LIVE, &mpath_head->flags)) {
+ dev_set_drvdata(disk_to_dev(mpath_disk->disk), mpath_disk);
+ ret = device_add_disk(mpath_disk->parent, mpath_disk->disk,
+ mpath_head->mpdt->device_groups);
+ if (ret) {
+ clear_bit(MPATH_HEAD_DISK_LIVE, &mpath_head->flags);
+ return;
+ }
+ queue_work(mpath_wq, &mpath_disk->partition_scan_work);
+ }
+}
+EXPORT_SYMBOL_GPL(mpath_device_set_live);
+
struct mpath_head *mpath_alloc_head(void)
{
struct mpath_head *mpath_head;
--
2.43.5
^ permalink raw reply related [flat|nested] 46+ messages in thread
* [PATCH 03/13] libmultipath: Add path selection support
2026-02-25 15:32 [PATCH 00/13] libmultipath: a generic multipath lib for block drivers John Garry
2026-02-25 15:32 ` [PATCH 01/13] libmultipath: Add initial framework John Garry
2026-02-25 15:32 ` [PATCH 02/13] libmultipath: Add basic gendisk support John Garry
@ 2026-02-25 15:32 ` John Garry
2026-02-26 3:37 ` Benjamin Marzinski
` (2 more replies)
2026-02-25 15:32 ` [PATCH 04/13] libmultipath: Add bio handling John Garry
` (9 subsequent siblings)
12 siblings, 3 replies; 46+ messages in thread
From: John Garry @ 2026-02-25 15:32 UTC (permalink / raw)
To: hch, kbusch, sagi, axboe, martin.petersen, james.bottomley, hare
Cc: jmeneghi, linux-nvme, linux-scsi, michael.christie, snitzer,
bmarzins, dm-devel, linux-block, linux-kernel, John Garry
Add code for path selection.
NVMe ANA is abstracted into enum mpath_access_state. The motivation here is
so that SCSI ALUA can be used. Callbacks .is_disabled, .is_optimized,
.get_access_state are added to get the path access state.
Path selection modes round-robin, NUMA, and queue-depth are added, same
as NVMe supports.
NVMe has almost like-for-like equivalents here:
- __mpath_find_path() -> __nvme_find_path()
- mpath_find_path() -> nvme_find_path()
and similar for all introduced callee functions.
Functions mpath_set_iopolicy() and mpath_get_iopolicy() are added for
setting default iopolicy.
A separate mpath_iopolicy structure is introduced. There is no iopolicy
member included in the mpath_head structure as it may not suit NVMe, where
iopolicy is per-subsystem and not per namespace.
Signed-off-by: John Garry <john.g.garry@oracle.com>
---
include/linux/multipath.h | 36 ++++++
lib/multipath.c | 251 ++++++++++++++++++++++++++++++++++++++
2 files changed, 287 insertions(+)
diff --git a/include/linux/multipath.h b/include/linux/multipath.h
index be9dd9fb83345..c964a1aba9c42 100644
--- a/include/linux/multipath.h
+++ b/include/linux/multipath.h
@@ -7,6 +7,22 @@
extern const struct block_device_operations mpath_ops;
+enum mpath_iopolicy_e {
+ MPATH_IOPOLICY_NUMA,
+ MPATH_IOPOLICY_RR,
+ MPATH_IOPOLICY_QD,
+};
+
+struct mpath_iopolicy {
+ enum mpath_iopolicy_e iopolicy;
+};
+
+enum mpath_access_state {
+ MPATH_STATE_OPTIMIZED,
+ MPATH_STATE_ACTIVE,
+ MPATH_STATE_INVALID = 0xFF
+};
+
struct mpath_disk {
struct gendisk *disk;
struct kref ref;
@@ -18,10 +34,16 @@ struct mpath_disk {
struct mpath_device {
struct list_head siblings;
+ atomic_t nr_active;
struct gendisk *disk;
+ int numa_node;
};
struct mpath_head_template {
+ bool (*is_disabled)(struct mpath_device *);
+ bool (*is_optimized)(struct mpath_device *);
+ enum mpath_access_state (*get_access_state)(struct mpath_device *);
+ enum mpath_iopolicy_e (*get_iopolicy)(struct mpath_head *);
const struct attribute_group **device_groups;
};
@@ -50,6 +72,14 @@ static inline struct mpath_disk *mpath_gendisk_to_disk(struct gendisk *disk)
return mpath_bd_device_to_disk(disk_to_dev(disk));
}
+static inline enum mpath_iopolicy_e mpath_read_iopolicy(
+ struct mpath_iopolicy *mpath_iopolicy)
+{
+ return READ_ONCE(mpath_iopolicy->iopolicy);
+}
+void mpath_synchronize(struct mpath_head *mpath_head);
+int mpath_set_iopolicy(const char *val, int *iopolicy);
+int mpath_get_iopolicy(char *buf, int iopolicy);
int mpath_get_head(struct mpath_head *mpath_head);
void mpath_put_head(struct mpath_head *mpath_head);
struct mpath_head *mpath_alloc_head(void);
@@ -66,4 +96,10 @@ static inline bool is_mpath_head(struct gendisk *disk)
{
return disk->fops == &mpath_ops;
}
+
+static inline bool mpath_qd_iopolicy(struct mpath_iopolicy *mpath_iopolicy)
+{
+ return mpath_read_iopolicy(mpath_iopolicy) == MPATH_IOPOLICY_QD;
+}
+
#endif // _LIBMULTIPATH_H
diff --git a/lib/multipath.c b/lib/multipath.c
index 88efb0ae16acb..65a0d2d2bf524 100644
--- a/lib/multipath.c
+++ b/lib/multipath.c
@@ -6,8 +6,243 @@
#include <linux/module.h>
#include <linux/multipath.h>
+static struct mpath_device *mpath_find_path(struct mpath_head *mpath_head);
+
static struct workqueue_struct *mpath_wq;
+static const char *mpath_iopolicy_names[] = {
+ [MPATH_IOPOLICY_NUMA] = "numa",
+ [MPATH_IOPOLICY_RR] = "round-robin",
+ [MPATH_IOPOLICY_QD] = "queue-depth",
+};
+
+int mpath_set_iopolicy(const char *val, int *iopolicy)
+{
+ if (!val)
+ return -EINVAL;
+ if (!strncmp(val, "numa", 4))
+ *iopolicy = MPATH_IOPOLICY_NUMA;
+ else if (!strncmp(val, "round-robin", 11))
+ *iopolicy = MPATH_IOPOLICY_RR;
+ else if (!strncmp(val, "queue-depth", 11))
+ *iopolicy = MPATH_IOPOLICY_QD;
+ else
+ return -EINVAL;
+
+ return 0;
+}
+EXPORT_SYMBOL_GPL(mpath_set_iopolicy);
+
+int mpath_get_iopolicy(char *buf, int iopolicy)
+{
+ return sprintf(buf, "%s\n", mpath_iopolicy_names[iopolicy]);
+}
+EXPORT_SYMBOL_GPL(mpath_get_iopolicy);
+
+
+void mpath_synchronize(struct mpath_head *mpath_head)
+{
+ synchronize_srcu(&mpath_head->srcu);
+}
+EXPORT_SYMBOL_GPL(mpath_synchronize);
+
+static bool mpath_path_is_disabled(struct mpath_head *mpath_head,
+ struct mpath_device *mpath_device)
+{
+ return mpath_head->mpdt->is_disabled(mpath_device);
+}
+
+static struct mpath_device *__mpath_find_path(struct mpath_head *mpath_head,
+ enum mpath_iopolicy_e iopolicy, int node)
+{
+ int found_distance = INT_MAX, fallback_distance = INT_MAX, distance;
+ struct mpath_device *mpath_dev_found, *mpath_dev_fallback,
+ *mpath_device;
+
+ list_for_each_entry_srcu(mpath_device, &mpath_head->dev_list, siblings,
+ srcu_read_lock_held(&mpath_head->srcu)) {
+ if (mpath_path_is_disabled(mpath_head, mpath_device))
+ continue;
+
+ if (mpath_device->numa_node != NUMA_NO_NODE &&
+ (iopolicy == MPATH_IOPOLICY_NUMA))
+ distance = node_distance(node, mpath_device->numa_node);
+ else
+ distance = LOCAL_DISTANCE;
+
+ switch(mpath_head->mpdt->get_access_state(mpath_device)) {
+ case MPATH_STATE_OPTIMIZED:
+ if (distance < found_distance) {
+ found_distance = distance;
+ mpath_dev_found = mpath_device;
+ }
+ break;
+ case MPATH_STATE_ACTIVE:
+ if (distance < fallback_distance) {
+ fallback_distance = distance;
+ mpath_dev_fallback = mpath_device;
+ }
+ break;
+ default:
+ break;
+ }
+ }
+
+ if (!mpath_dev_found)
+ mpath_dev_found = mpath_dev_fallback;
+
+ if (mpath_dev_found)
+ rcu_assign_pointer(mpath_head->current_path[node],
+ mpath_dev_found);
+
+ return mpath_dev_found;
+}
+
+static struct mpath_device *mpath_next_dev(struct mpath_head *mpath_head,
+ struct mpath_device *mpath_dev)
+{
+ mpath_dev = list_next_or_null_rcu(&mpath_head->dev_list,
+ &mpath_dev->siblings, struct mpath_device,
+ siblings);
+
+ if (mpath_dev)
+ return mpath_dev;
+ return list_first_or_null_rcu(&mpath_head->dev_list,
+ struct mpath_device, siblings);
+}
+
+static struct mpath_device *mpath_round_robin_path(
+ struct mpath_head *mpath_head,
+ enum mpath_iopolicy_e iopolicy)
+{
+ struct mpath_device *mpath_device, *found = NULL;
+ int node = numa_node_id();
+ enum mpath_access_state access_state_old;
+ struct mpath_device *old =
+ srcu_dereference(mpath_head->current_path[node],
+ &mpath_head->srcu);
+
+ if (unlikely(!old))
+ return __mpath_find_path(mpath_head, iopolicy, node);
+
+ if (list_is_singular(&mpath_head->dev_list)) {
+ if (mpath_path_is_disabled(mpath_head, old))
+ return NULL;
+ return old;
+ }
+
+ for (mpath_device = mpath_next_dev(mpath_head, old);
+ mpath_device && mpath_device != old;
+ mpath_device = mpath_next_dev(mpath_head, mpath_device)) {
+ enum mpath_access_state access_state;
+
+ if (mpath_path_is_disabled(mpath_head, mpath_device))
+ continue;
+ access_state = mpath_head->mpdt->get_access_state(mpath_device);
+ if (access_state == MPATH_STATE_OPTIMIZED) {
+ found = mpath_device;
+ goto out;
+ }
+ if (access_state == MPATH_STATE_ACTIVE)
+ found = mpath_device;
+ }
+
+ /*
+ * The loop above skips the current path for round-robin semantics.
+ * Fall back to the current path if either:
+ * - no other optimized path found and current is optimized,
+ * - no other usable path found and current is usable.
+ */
+ access_state_old = mpath_head->mpdt->get_access_state(old);
+ if (!mpath_path_is_disabled(mpath_head, old) &&
+ (access_state_old == MPATH_STATE_OPTIMIZED ||
+ (!found && access_state_old == MPATH_STATE_ACTIVE)))
+ return old;
+
+ if (!found)
+ return NULL;
+out:
+ rcu_assign_pointer(mpath_head->current_path[node], found);
+
+ return found;
+}
+
+static struct mpath_device *mpath_queue_depth_path(struct mpath_head *mpath_head)
+{
+ struct mpath_device *best_opt = NULL, *mpath_device;
+ struct mpath_device *best_nonopt = NULL;
+ unsigned int min_depth_opt = UINT_MAX, min_depth_nonopt = UINT_MAX;
+ unsigned int depth;
+
+ list_for_each_entry_srcu(mpath_device, &mpath_head->dev_list, siblings,
+ srcu_read_lock_held(&mpath_head->srcu)) {
+
+ if (mpath_path_is_disabled(mpath_head, mpath_device))
+ continue;
+
+ depth = atomic_read(&mpath_device->nr_active);
+
+ switch (mpath_head->mpdt->get_access_state(mpath_device)) {
+ case MPATH_STATE_OPTIMIZED:
+ if (depth < min_depth_opt) {
+ min_depth_opt = depth;
+ best_opt = mpath_device;
+ }
+ break;
+ case MPATH_STATE_ACTIVE:
+ if (depth < min_depth_nonopt) {
+ min_depth_nonopt = depth;
+ best_nonopt = mpath_device;
+ }
+ break;
+ default:
+ break;
+ }
+
+ if (min_depth_opt == 0)
+ return best_opt;
+ }
+
+ return best_opt ? best_opt : best_nonopt;
+}
+
+static inline bool mpath_path_is_optimized(struct mpath_head *mpath_head,
+ struct mpath_device *mpath_device)
+{
+ return mpath_head->mpdt->is_optimized(mpath_device);
+}
+
+static struct mpath_device *mpath_numa_path(struct mpath_head *mpath_head,
+ enum mpath_iopolicy_e iopolicy)
+{
+ int node = numa_node_id();
+ struct mpath_device *mpath_device;
+
+ mpath_device = srcu_dereference(mpath_head->current_path[node],
+ &mpath_head->srcu);
+ if (unlikely(!mpath_device))
+ return __mpath_find_path(mpath_head, iopolicy, node);
+ if (unlikely(!mpath_path_is_optimized(mpath_head, mpath_device)))
+ return __mpath_find_path(mpath_head, iopolicy, node);
+ return mpath_device;
+}
+
+__maybe_unused
+static struct mpath_device *mpath_find_path(struct mpath_head *mpath_head)
+{
+ enum mpath_iopolicy_e iopolicy =
+ mpath_head->mpdt->get_iopolicy(mpath_head);
+
+ switch (iopolicy) {
+ case MPATH_IOPOLICY_QD:
+ return mpath_queue_depth_path(mpath_head);
+ case MPATH_IOPOLICY_RR:
+ return mpath_round_robin_path(mpath_head, iopolicy);
+ default:
+ return mpath_numa_path(mpath_head, iopolicy);
+ }
+}
+
static void mpath_free_head(struct kref *ref)
{
struct mpath_head *mpath_head =
@@ -99,6 +334,7 @@ void mpath_remove_disk(struct mpath_disk *mpath_disk)
if (test_and_clear_bit(MPATH_HEAD_DISK_LIVE, &mpath_head->flags)) {
struct gendisk *disk = mpath_disk->disk;
+ mpath_synchronize(mpath_head);
del_gendisk(disk);
}
}
@@ -158,6 +394,21 @@ void mpath_device_set_live(struct mpath_disk *mpath_disk,
}
queue_work(mpath_wq, &mpath_disk->partition_scan_work);
}
+
+ mutex_lock(&mpath_head->lock);
+ if (mpath_path_is_optimized(mpath_head, mpath_device)) {
+ int node, srcu_idx;
+
+ srcu_idx = srcu_read_lock(&mpath_head->srcu);
+ for_each_online_node(node)
+ __mpath_find_path(mpath_head,
+ mpath_head->mpdt->get_iopolicy(mpath_head),
+ node);
+ srcu_read_unlock(&mpath_head->srcu, srcu_idx);
+ }
+ mutex_unlock(&mpath_head->lock);
+
+ mpath_synchronize(mpath_head);
}
EXPORT_SYMBOL_GPL(mpath_device_set_live);
--
2.43.5
^ permalink raw reply related [flat|nested] 46+ messages in thread
* [PATCH 04/13] libmultipath: Add bio handling
2026-02-25 15:32 [PATCH 00/13] libmultipath: a generic multipath lib for block drivers John Garry
` (2 preceding siblings ...)
2026-02-25 15:32 ` [PATCH 03/13] libmultipath: Add path selection support John Garry
@ 2026-02-25 15:32 ` John Garry
2026-03-02 12:39 ` Nilay Shroff
2026-02-25 15:32 ` [PATCH 05/13] libmultipath: Add support for mpath_device management John Garry
` (8 subsequent siblings)
12 siblings, 1 reply; 46+ messages in thread
From: John Garry @ 2026-02-25 15:32 UTC (permalink / raw)
To: hch, kbusch, sagi, axboe, martin.petersen, james.bottomley, hare
Cc: jmeneghi, linux-nvme, linux-scsi, michael.christie, snitzer,
bmarzins, dm-devel, linux-block, linux-kernel, John Garry
Add support to submit a bio per-path. In addition, for failover, add
support to requeue a failed bio.
NVMe has almost like-for-like equivalents here:
- nvme_available_path() -> mpath_available_path()
- nvme_requeue_work() -> mpath_requeue_work()
- nvme_ns_head_submit_bio() -> mpath_bdev_submit_bio()
For failover, a driver may want to re-submit a bio, so add support to
clone a bio prior to submission.
A bio which is submitted to a per-path device has flag REQ_MPATH set,
same as what is done for NVMe with REQ_NVME_MPATH.
Signed-off-by: John Garry <john.g.garry@oracle.com>
---
include/linux/multipath.h | 15 +++++++
lib/multipath.c | 92 ++++++++++++++++++++++++++++++++++++++-
2 files changed, 106 insertions(+), 1 deletion(-)
diff --git a/include/linux/multipath.h b/include/linux/multipath.h
index c964a1aba9c42..d557fb9bab4c9 100644
--- a/include/linux/multipath.h
+++ b/include/linux/multipath.h
@@ -3,6 +3,7 @@
#define _LIBMULTIPATH_H
#include <linux/blkdev.h>
+#include <linux/blk-mq.h>
#include <linux/srcu.h>
extern const struct block_device_operations mpath_ops;
@@ -40,10 +41,12 @@ struct mpath_device {
};
struct mpath_head_template {
+ bool (*available_path)(struct mpath_device *, bool *);
bool (*is_disabled)(struct mpath_device *);
bool (*is_optimized)(struct mpath_device *);
enum mpath_access_state (*get_access_state)(struct mpath_device *);
enum mpath_iopolicy_e (*get_iopolicy)(struct mpath_head *);
+ struct bio *(*clone_bio)(struct bio *);
const struct attribute_group **device_groups;
};
@@ -56,12 +59,23 @@ struct mpath_head {
struct kref ref;
+ struct bio_list requeue_list; /* list for requeing bio */
+ spinlock_t requeue_lock;
+ struct work_struct requeue_work; /* work struct for requeue */
+
unsigned long flags;
struct mpath_device __rcu *current_path[MAX_NUMNODES];
const struct mpath_head_template *mpdt;
void *drvdata;
};
+#define REQ_MPATH REQ_DRV
+
+static inline bool is_mpath_request(struct request *req)
+{
+ return req->cmd_flags & REQ_MPATH;
+}
+
static inline struct mpath_disk *mpath_bd_device_to_disk(struct device *dev)
{
return dev_get_drvdata(dev);
@@ -82,6 +96,7 @@ int mpath_set_iopolicy(const char *val, int *iopolicy);
int mpath_get_iopolicy(char *buf, int iopolicy);
int mpath_get_head(struct mpath_head *mpath_head);
void mpath_put_head(struct mpath_head *mpath_head);
+void mpath_requeue_work(struct work_struct *work);
struct mpath_head *mpath_alloc_head(void);
void mpath_put_disk(struct mpath_disk *mpath_disk);
void mpath_remove_disk(struct mpath_disk *mpath_disk);
diff --git a/lib/multipath.c b/lib/multipath.c
index 65a0d2d2bf524..b494b35e8dccc 100644
--- a/lib/multipath.c
+++ b/lib/multipath.c
@@ -5,6 +5,7 @@
*/
#include <linux/module.h>
#include <linux/multipath.h>
+#include <trace/events/block.h>
static struct mpath_device *mpath_find_path(struct mpath_head *mpath_head);
@@ -227,7 +228,6 @@ static struct mpath_device *mpath_numa_path(struct mpath_head *mpath_head,
return mpath_device;
}
-__maybe_unused
static struct mpath_device *mpath_find_path(struct mpath_head *mpath_head)
{
enum mpath_iopolicy_e iopolicy =
@@ -243,6 +243,66 @@ static struct mpath_device *mpath_find_path(struct mpath_head *mpath_head)
}
}
+static bool mpath_available_path(struct mpath_head *mpath_head)
+{
+ struct mpath_device *mpath_device;
+
+ if (!test_bit(MPATH_HEAD_DISK_LIVE, &mpath_head->flags))
+ return false;
+
+ list_for_each_entry_srcu(mpath_device, &mpath_head->dev_list, siblings,
+ srcu_read_lock_held(&mpath_head->srcu)) {
+ bool available = false;
+
+ if (!mpath_head->mpdt->available_path(mpath_device,
+ &available))
+ continue;
+ if (available)
+ return true;
+ }
+
+ return false;
+}
+
+static void mpath_bdev_submit_bio(struct bio *bio)
+{
+ struct mpath_disk *mpath_disk = bio->bi_bdev->bd_disk->private_data;
+ struct mpath_head *mpath_head = mpath_disk->mpath_head;
+ struct device *dev = mpath_disk->parent;
+ struct mpath_device *mpath_device;
+ int srcu_idx;
+
+ bio = bio_split_to_limits(bio);
+ if (!bio)
+ return;
+
+ srcu_idx = srcu_read_lock(&mpath_head->srcu);
+ mpath_device = mpath_find_path(mpath_head);
+
+ if (likely(mpath_device)) {
+ bio->bi_opf |= REQ_MPATH;
+ if (mpath_head->mpdt->clone_bio)
+ bio = mpath_head->mpdt->clone_bio(bio);
+ trace_block_bio_remap(bio, disk_devt(mpath_device->disk),
+ bio->bi_iter.bi_sector);
+ bio_set_dev(bio, mpath_device->disk->part0);
+
+ submit_bio_noacct(bio);
+ } else if (mpath_available_path(mpath_head)) {
+ dev_warn_ratelimited(dev, "no usable path - requeuing I/O\n");
+
+ spin_lock_irq(&mpath_head->requeue_lock);
+ bio_list_add(&mpath_head->requeue_list, bio);
+ spin_unlock_irq(&mpath_head->requeue_lock);
+ } else {
+ dev_warn_ratelimited(dev, "no available path - failing I/O\n");
+
+ bio_io_error(bio);
+ }
+
+ srcu_read_unlock(&mpath_head->srcu, srcu_idx);
+}
+
static void mpath_free_head(struct kref *ref)
{
struct mpath_head *mpath_head =
@@ -310,6 +370,7 @@ const struct block_device_operations mpath_ops = {
.owner = THIS_MODULE,
.open = mpath_bdev_open,
.release = mpath_bdev_release,
+ .submit_bio = mpath_bdev_submit_bio,
};
EXPORT_SYMBOL_GPL(mpath_ops);
@@ -327,6 +388,24 @@ static void multipath_partition_scan_work(struct work_struct *work)
mutex_unlock(&mpath_disk->disk->open_mutex);
}
+void mpath_requeue_work(struct work_struct *work)
+{
+ struct mpath_head *mpath_head =
+ container_of(work, struct mpath_head, requeue_work);
+ struct bio *bio, *next;
+
+ spin_lock_irq(&mpath_head->requeue_lock);
+ next = bio_list_get(&mpath_head->requeue_list);
+ spin_unlock_irq(&mpath_head->requeue_lock);
+
+ while ((bio = next) != NULL) {
+ next = bio->bi_next;
+ bio->bi_next = NULL;
+ submit_bio_noacct(bio);
+ }
+}
+EXPORT_SYMBOL_GPL(mpath_requeue_work);
+
void mpath_remove_disk(struct mpath_disk *mpath_disk)
{
struct mpath_head *mpath_head = mpath_disk->mpath_head;
@@ -334,6 +413,12 @@ void mpath_remove_disk(struct mpath_disk *mpath_disk)
if (test_and_clear_bit(MPATH_HEAD_DISK_LIVE, &mpath_head->flags)) {
struct gendisk *disk = mpath_disk->disk;
+ /*
+ * requeue I/O after MPATH_HEAD_DISK_LIVE has been cleared
+ * to allow multipath to fail all I/O.
+ */
+ kblockd_schedule_work(&mpath_head->requeue_work);
+
mpath_synchronize(mpath_head);
del_gendisk(disk);
}
@@ -409,6 +494,7 @@ void mpath_device_set_live(struct mpath_disk *mpath_disk,
mutex_unlock(&mpath_head->lock);
mpath_synchronize(mpath_head);
+ kblockd_schedule_work(&mpath_head->requeue_work);
}
EXPORT_SYMBOL_GPL(mpath_device_set_live);
@@ -424,6 +510,10 @@ struct mpath_head *mpath_alloc_head(void)
mutex_init(&mpath_head->lock);
kref_init(&mpath_head->ref);
+ INIT_WORK(&mpath_head->requeue_work, mpath_requeue_work);
+ spin_lock_init(&mpath_head->requeue_lock);
+ bio_list_init(&mpath_head->requeue_list);
+
ret = init_srcu_struct(&mpath_head->srcu);
if (ret) {
kfree(mpath_head);
--
2.43.5
^ permalink raw reply related [flat|nested] 46+ messages in thread
* [PATCH 05/13] libmultipath: Add support for mpath_device management
2026-02-25 15:32 [PATCH 00/13] libmultipath: a generic multipath lib for block drivers John Garry
` (3 preceding siblings ...)
2026-02-25 15:32 ` [PATCH 04/13] libmultipath: Add bio handling John Garry
@ 2026-02-25 15:32 ` John Garry
2026-02-25 15:32 ` [PATCH 06/13] libmultipath: Add cdev support John Garry
` (7 subsequent siblings)
12 siblings, 0 replies; 46+ messages in thread
From: John Garry @ 2026-02-25 15:32 UTC (permalink / raw)
To: hch, kbusch, sagi, axboe, martin.petersen, james.bottomley, hare
Cc: jmeneghi, linux-nvme, linux-scsi, michael.christie, snitzer,
bmarzins, dm-devel, linux-block, linux-kernel, John Garry
Add support to add or remove a mpath_device as a path.
NVMe has almost like-for-like equivalents here:
- nvme_mpath_clear_current_path() -> mpath_clear_current_path()
- nvme_mpath_add_sysfs_link() -> mpath_add_sysfs_link()
- nvme_mpath_remove_sysfs_link() -> mpath_remove_sysfs_link()
- nvme_mpath_revalidate_paths() -> mpath_revalidate_paths()
mpath_revalidate_paths() has a CB arg for NVMe specific handling.
The functionality in mpath_clear_paths() and mpath_synchronize() have the
same pattern which is frequently used in the NVMe code.
Helper mpath_call_for_device() is added to allow a driver run a callback
on any path available. It is intended to be used for occasions when the
NVMe drivers accesses the list of paths outside its multipath code, like
NVMe sysfs.c
Signed-off-by: John Garry <john.g.garry@oracle.com>
---
include/linux/multipath.h | 18 ++++
lib/multipath.c | 169 +++++++++++++++++++++++++++++++++++++-
2 files changed, 186 insertions(+), 1 deletion(-)
diff --git a/include/linux/multipath.h b/include/linux/multipath.h
index d557fb9bab4c9..4255b73de56b2 100644
--- a/include/linux/multipath.h
+++ b/include/linux/multipath.h
@@ -33,10 +33,13 @@ struct mpath_disk {
struct device *parent;
};
+#define MPATH_DEVICE_SYSFS_ATTR_LINK 0
+
struct mpath_device {
struct list_head siblings;
atomic_t nr_active;
struct gendisk *disk;
+ unsigned long flags;
int numa_node;
};
@@ -94,6 +97,21 @@ static inline enum mpath_iopolicy_e mpath_read_iopolicy(
void mpath_synchronize(struct mpath_head *mpath_head);
int mpath_set_iopolicy(const char *val, int *iopolicy);
int mpath_get_iopolicy(char *buf, int iopolicy);
+bool mpath_clear_current_path(struct mpath_head *mpath_head,
+ struct mpath_device *mpath_device);
+void mpath_synchronize(struct mpath_head *mpath_head);
+void mpath_add_device(struct mpath_head *mpath_head,
+ struct mpath_device *mpath_device);
+void mpath_delete_device(struct mpath_head *mpath_head,
+ struct mpath_device *mpath_device);
+int mpath_call_for_device(struct mpath_head *mpath_head,
+ int (*cb)(struct mpath_device *mpath_device));
+void mpath_clear_paths(struct mpath_head *mpath_head);
+void mpath_revalidate_paths(struct mpath_disk *mpath_disk,
+ void (*cb)(struct mpath_device *mpath_device, sector_t capacity));
+void mpath_add_sysfs_link(struct mpath_disk *mpath_disk);
+void mpath_remove_sysfs_link(struct mpath_disk *mpath_disk,
+ struct mpath_device *mpath_device);
int mpath_get_head(struct mpath_head *mpath_head);
void mpath_put_head(struct mpath_head *mpath_head);
void mpath_requeue_work(struct work_struct *work);
diff --git a/lib/multipath.c b/lib/multipath.c
index b494b35e8dccc..7f3b0cccf053b 100644
--- a/lib/multipath.c
+++ b/lib/multipath.c
@@ -40,13 +40,108 @@ int mpath_get_iopolicy(char *buf, int iopolicy)
}
EXPORT_SYMBOL_GPL(mpath_get_iopolicy);
-
void mpath_synchronize(struct mpath_head *mpath_head)
{
synchronize_srcu(&mpath_head->srcu);
}
EXPORT_SYMBOL_GPL(mpath_synchronize);
+void mpath_add_device(struct mpath_head *mpath_head,
+ struct mpath_device *mpath_device)
+{
+ mutex_lock(&mpath_head->lock);
+ list_add_tail_rcu(&mpath_device->siblings, &mpath_head->dev_list);
+ mutex_unlock(&mpath_head->lock);
+}
+EXPORT_SYMBOL_GPL(mpath_add_device);
+
+void mpath_delete_device(struct mpath_head *mpath_head,
+ struct mpath_device *mpath_device)
+{
+ mutex_lock(&mpath_head->lock);
+ list_del_rcu(&mpath_device->siblings);
+ mutex_unlock(&mpath_head->lock);
+}
+EXPORT_SYMBOL_GPL(mpath_delete_device);
+
+int mpath_call_for_device(struct mpath_head *mpath_head,
+ int (*cb)(struct mpath_device *mpath_device))
+{
+ struct mpath_device *mpath_device;
+ int ret = -EWOULDBLOCK, srcu_idx;
+
+ srcu_idx = srcu_read_lock(&mpath_head->srcu);
+ mpath_device = mpath_find_path(mpath_head);
+ if (mpath_device)
+ ret = cb(mpath_device);
+ srcu_read_unlock(&mpath_head->srcu, srcu_idx);
+
+ return ret;
+}
+EXPORT_SYMBOL_GPL(mpath_call_for_device);
+
+bool mpath_clear_current_path(struct mpath_head *mpath_head,
+ struct mpath_device *mpath_device)
+{
+ bool changed = false;
+ int node;
+
+ if (!mpath_head)
+ goto out;
+
+ for_each_node(node) {
+ if (mpath_device ==
+ rcu_access_pointer(mpath_head->current_path[node])) {
+ rcu_assign_pointer(mpath_head->current_path[node],
+ NULL);
+ changed = true;
+ }
+ }
+out:
+ return changed;
+}
+EXPORT_SYMBOL_GPL(mpath_clear_current_path);
+
+static void mpath_revalidate_paths_iter(struct mpath_disk *mpath_disk,
+ void (*cb)(struct mpath_device *mpath_device, sector_t capacity))
+{
+ struct mpath_head *mpath_head = mpath_disk->mpath_head;
+ sector_t capacity = get_capacity(mpath_disk->disk);
+ struct mpath_device *mpath_device;
+ int srcu_idx;
+
+ if (!cb)
+ return;
+
+ srcu_idx = srcu_read_lock(&mpath_head->srcu);
+ list_for_each_entry_srcu(mpath_device, &mpath_head->dev_list, siblings,
+ srcu_read_lock_held(&mpath_head->srcu)) {
+ cb(mpath_device, capacity);
+ }
+ srcu_read_unlock(&mpath_head->srcu, srcu_idx);
+}
+
+void mpath_clear_paths(struct mpath_head *mpath_head)
+{
+ int node;
+
+ for_each_node(node)
+ rcu_assign_pointer(mpath_head->current_path[node], NULL);
+}
+EXPORT_SYMBOL_GPL(mpath_clear_paths);
+
+void mpath_revalidate_paths(struct mpath_disk *mpath_disk,
+ void (*cb)(struct mpath_device *mpath_device, sector_t capacity))
+{
+ struct mpath_head *mpath_head = mpath_disk->mpath_head;
+
+ mpath_revalidate_paths_iter(mpath_disk, cb);
+ mpath_clear_paths(mpath_head);
+
+ kblockd_schedule_work(&mpath_head->requeue_work);
+}
+EXPORT_SYMBOL_GPL(mpath_revalidate_paths);
+
static bool mpath_path_is_disabled(struct mpath_head *mpath_head,
struct mpath_device *mpath_device)
{
@@ -480,6 +575,8 @@ void mpath_device_set_live(struct mpath_disk *mpath_disk,
queue_work(mpath_wq, &mpath_disk->partition_scan_work);
}
+ mpath_add_sysfs_link(mpath_disk);
+
mutex_lock(&mpath_head->lock);
if (mpath_path_is_optimized(mpath_head, mpath_device)) {
int node, srcu_idx;
@@ -498,6 +595,76 @@ void mpath_device_set_live(struct mpath_disk *mpath_disk,
}
EXPORT_SYMBOL_GPL(mpath_device_set_live);
+void mpath_add_sysfs_link(struct mpath_disk *mpath_disk)
+{
+ struct mpath_head *mpath_head = mpath_disk->mpath_head;
+ struct device *target;
+ struct device *source;
+ int rc, srcu_idx;
+ struct kobject *mpath_gd_kobj;
+ struct mpath_device *mpath_device;
+
+ /*
+ * Ensure head disk node is already added otherwise we may get invalid
+ * kobj for head disk node
+ */
+ if (!test_bit(GD_ADDED, &mpath_disk->disk->state))
+ return;
+
+ mpath_gd_kobj = &disk_to_dev(mpath_disk->disk)->kobj;
+ srcu_idx = srcu_read_lock(&mpath_head->srcu);
+
+ list_for_each_entry_srcu(mpath_device, &mpath_head->dev_list, siblings,
+ srcu_read_lock_held(&mpath_head->srcu)) {
+ if (!test_bit(GD_ADDED, &mpath_device->disk->state))
+ continue;
+
+ if (test_and_set_bit(MPATH_DEVICE_SYSFS_ATTR_LINK, &mpath_device->flags))
+ continue;
+
+ target = disk_to_dev(mpath_device->disk);
+ source = disk_to_dev(mpath_disk->disk);
+ /*
+ * Create sysfs link from head gendisk kobject @kobj to the
+ * ns path gendisk kobject @target->kobj.
+ */
+ rc = sysfs_add_link_to_group(mpath_gd_kobj, "multipath",
+ &target->kobj, dev_name(target));
+
+ if (unlikely(rc)) {
+ dev_err(disk_to_dev(mpath_disk->disk),
+ "failed to create link to %s rc=%d\n",
+ dev_name(target), rc);
+ clear_bit(MPATH_DEVICE_SYSFS_ATTR_LINK, &mpath_device->flags);
+ } else {
+ dev_info(source, "Created multipath sysfs link to %s\n",
+ mpath_device->disk->disk_name);
+ }
+ }
+
+ srcu_read_unlock(&mpath_head->srcu, srcu_idx);
+}
+EXPORT_SYMBOL_GPL(mpath_add_sysfs_link);
+
+void mpath_remove_sysfs_link(struct mpath_disk *mpath_disk,
+ struct mpath_device *mpath_device)
+{
+ struct device *target;
+ struct kobject *mpath_gd_kobj;
+
+ if (!test_bit(MPATH_DEVICE_SYSFS_ATTR_LINK, &mpath_device->flags))
+ return;
+
+ target = disk_to_dev(mpath_device->disk);
+ mpath_gd_kobj = &disk_to_dev(mpath_disk->disk)->kobj;
+
+ sysfs_remove_link_from_group(mpath_gd_kobj, "multipath",
+ dev_name(target));
+
+ clear_bit(MPATH_DEVICE_SYSFS_ATTR_LINK, &mpath_device->flags);
+}
+EXPORT_SYMBOL_GPL(mpath_remove_sysfs_link);
+
struct mpath_head *mpath_alloc_head(void)
{
struct mpath_head *mpath_head;
--
2.43.5
^ permalink raw reply related [flat|nested] 46+ messages in thread
* [PATCH 06/13] libmultipath: Add cdev support
2026-02-25 15:32 [PATCH 00/13] libmultipath: a generic multipath lib for block drivers John Garry
` (4 preceding siblings ...)
2026-02-25 15:32 ` [PATCH 05/13] libmultipath: Add support for mpath_device management John Garry
@ 2026-02-25 15:32 ` John Garry
2026-02-25 15:32 ` [PATCH 07/13] libmultipath: Add delayed removal support John Garry
` (6 subsequent siblings)
12 siblings, 0 replies; 46+ messages in thread
From: John Garry @ 2026-02-25 15:32 UTC (permalink / raw)
To: hch, kbusch, sagi, axboe, martin.petersen, james.bottomley, hare
Cc: jmeneghi, linux-nvme, linux-scsi, michael.christie, snitzer,
bmarzins, dm-devel, linux-block, linux-kernel, John Garry
Add support to create a cdev multipath device. The functionality is much
the same as NVMe, where the cdev is created when a mpath device is set
live.
The driver must provide a mpath_head_template.cdev_ioctl callback to
actually handle the ioctl.
Structure mpath_generic_chr_fops would be used for setting the cdev fops in
the mpath_head_template.add_cdev callback.
Signed-off-by: John Garry <john.g.garry@oracle.com>
---
include/linux/multipath.h | 15 +++++
lib/multipath.c | 116 ++++++++++++++++++++++++++++++++++++++
2 files changed, 131 insertions(+)
diff --git a/include/linux/multipath.h b/include/linux/multipath.h
index 4255b73de56b2..0dcfdd205237c 100644
--- a/include/linux/multipath.h
+++ b/include/linux/multipath.h
@@ -4,8 +4,11 @@
#include <linux/blkdev.h>
#include <linux/blk-mq.h>
+#include <linux/cdev.h>
#include <linux/srcu.h>
+#include <linux/io_uring/cmd.h>
+extern const struct file_operations mpath_generic_chr_fops;
extern const struct block_device_operations mpath_ops;
enum mpath_iopolicy_e {
@@ -45,9 +48,18 @@ struct mpath_device {
struct mpath_head_template {
bool (*available_path)(struct mpath_device *, bool *);
+ int (*add_cdev)(struct mpath_head *);
+ void (*del_cdev)(struct mpath_head *);
bool (*is_disabled)(struct mpath_device *);
bool (*is_optimized)(struct mpath_device *);
enum mpath_access_state (*get_access_state)(struct mpath_device *);
+ int (*cdev_ioctl)(struct mpath_head *, struct mpath_device *,
+ blk_mode_t mode, unsigned int cmd, unsigned long arg, int srcu_idx);
+ int (*chr_uring_cmd)(struct mpath_device *, struct io_uring_cmd *ioucmd,
+ unsigned int issue_flags);
+ int (*chr_uring_cmd_iopoll)(struct io_uring_cmd *ioucmd,
+ struct io_comp_batch *iob,
+ unsigned int poll_flags);
enum mpath_iopolicy_e (*get_iopolicy)(struct mpath_head *);
struct bio *(*clone_bio)(struct bio *);
const struct attribute_group **device_groups;
@@ -66,6 +78,9 @@ struct mpath_head {
spinlock_t requeue_lock;
struct work_struct requeue_work; /* work struct for requeue */
+ struct cdev cdev;
+ struct device cdev_device;
+
unsigned long flags;
struct mpath_device __rcu *current_path[MAX_NUMNODES];
const struct mpath_head_template *mpdt;
diff --git a/lib/multipath.c b/lib/multipath.c
index 7f3b0cccf053b..ce12d42918fdd 100644
--- a/lib/multipath.c
+++ b/lib/multipath.c
@@ -469,6 +469,113 @@ const struct block_device_operations mpath_ops = {
};
EXPORT_SYMBOL_GPL(mpath_ops);
+static int mpath_generic_chr_open(struct inode *inode, struct file *file)
+{
+ struct cdev *cdev = file_inode(file)->i_cdev;
+ struct mpath_head *mpath_head =
+ container_of(cdev, struct mpath_head, cdev);
+
+ return mpath_get_head(mpath_head);
+}
+
+static int mpath_generic_chr_release(struct inode *inode, struct file *file)
+{
+ struct cdev *cdev = file_inode(file)->i_cdev;
+ struct mpath_head *mpath_head =
+ container_of(cdev, struct mpath_head, cdev);
+
+ mpath_put_head(mpath_head);
+ return 0;
+}
+
+static long mpath_generic_chr_ioctl(struct file *file, unsigned int cmd,
+ unsigned long arg)
+{
+ struct cdev *cdev = file_inode(file)->i_cdev;
+ struct mpath_head *mpath_head =
+ container_of(cdev, struct mpath_head, cdev);
+ struct mpath_device *mpath_device;
+ fmode_t mode = file->f_mode;
+ int srcu_idx, err = -EWOULDBLOCK;
+
+ srcu_idx = srcu_read_lock(&mpath_head->srcu);
+ mpath_device = mpath_find_path(mpath_head);
+ if (!mpath_device)
+ goto out_unlock;
+
+ /*
+ * If we are in the middle of error recovery, don't let anyone
+ * else try and use this device. Also, if error recovery fails, it
+ * may try and take the device offline, in which case all further
+ * access to the device is prohibited.
+ */
+ err = mpath_head->mpdt->cdev_ioctl(mpath_head, mpath_device,
+ mode, cmd, arg, srcu_idx);
+ lockdep_assert_not_held(&mpath_head->srcu);
+ return err;// ioctl must unlock
+
+out_unlock:
+ srcu_read_unlock(&mpath_head->srcu, srcu_idx);
+ return err;
+}
+
+static int mpath_generic_chr_uring_cmd(struct io_uring_cmd *ioucmd,
+ unsigned int issue_flags)
+{
+ struct cdev *cdev = file_inode(ioucmd->file)->i_cdev;
+ struct mpath_head *mpath_head =
+ container_of(cdev, struct mpath_head, cdev);
+ struct mpath_device *mpath_device;
+ int srcu_idx, ret = -EWOULDBLOCK;
+
+ if (!mpath_head->mpdt->chr_uring_cmd)
+ return -EOPNOTSUPP;
+
+ srcu_idx = srcu_read_lock(&mpath_head->srcu);
+ mpath_device = mpath_find_path(mpath_head);
+
+ if (!mpath_device)
+ goto out_unlock;
+
+ ret = mpath_head->mpdt->chr_uring_cmd(mpath_device, ioucmd,
+ issue_flags);
+out_unlock:
+ srcu_read_unlock(&mpath_head->srcu, srcu_idx);
+ return ret;
+}
+
+static int mpath_generic_chr_uring_cmd_iopoll(struct io_uring_cmd *ioucmd,
+ struct io_comp_batch *iob,
+ unsigned int poll_flags)
+{
+ struct cdev *cdev = file_inode(ioucmd->file)->i_cdev;
+ struct mpath_head *mpath_head =
+ container_of(cdev, struct mpath_head, cdev);
+
+ if (!mpath_head->mpdt->chr_uring_cmd_iopoll)
+ return -EOPNOTSUPP;
+
+ return mpath_head->mpdt->chr_uring_cmd_iopoll(ioucmd, iob, poll_flags);
+}
+
+const struct file_operations mpath_generic_chr_fops = {
+ .owner = THIS_MODULE,
+ .open = mpath_generic_chr_open,
+ .release = mpath_generic_chr_release,
+ .unlocked_ioctl = mpath_generic_chr_ioctl,
+ .compat_ioctl = compat_ptr_ioctl,
+ .uring_cmd = mpath_generic_chr_uring_cmd,
+ .uring_cmd_iopoll = mpath_generic_chr_uring_cmd_iopoll,
+};
+EXPORT_SYMBOL_GPL(mpath_generic_chr_fops);
+
+static int mpath_head_add_cdev(struct mpath_head *mpath_head)
+{
+ if (mpath_head->mpdt->add_cdev)
+ return mpath_head->mpdt->add_cdev(mpath_head);
+ return 0;
+}
+
static void multipath_partition_scan_work(struct work_struct *work)
{
struct mpath_disk *mpath_disk =
@@ -501,6 +608,12 @@ void mpath_requeue_work(struct work_struct *work)
}
EXPORT_SYMBOL_GPL(mpath_requeue_work);
+static void mpath_head_del_cdev(struct mpath_head *mpath_head)
+{
+ if (mpath_head->mpdt->del_cdev)
+ mpath_head->mpdt->del_cdev(mpath_head);
+}
+
void mpath_remove_disk(struct mpath_disk *mpath_disk)
{
struct mpath_head *mpath_head = mpath_disk->mpath_head;
@@ -514,6 +627,7 @@ void mpath_remove_disk(struct mpath_disk *mpath_disk)
*/
kblockd_schedule_work(&mpath_head->requeue_work);
+ mpath_head_del_cdev(mpath_head);
mpath_synchronize(mpath_head);
del_gendisk(disk);
}
@@ -572,6 +686,8 @@ void mpath_device_set_live(struct mpath_disk *mpath_disk,
clear_bit(MPATH_HEAD_DISK_LIVE, &mpath_head->flags);
return;
}
+
+ mpath_head_add_cdev(mpath_head);
queue_work(mpath_wq, &mpath_disk->partition_scan_work);
}
--
2.43.5
^ permalink raw reply related [flat|nested] 46+ messages in thread
* [PATCH 07/13] libmultipath: Add delayed removal support
2026-02-25 15:32 [PATCH 00/13] libmultipath: a generic multipath lib for block drivers John Garry
` (5 preceding siblings ...)
2026-02-25 15:32 ` [PATCH 06/13] libmultipath: Add cdev support John Garry
@ 2026-02-25 15:32 ` John Garry
2026-03-02 12:41 ` Nilay Shroff
2026-02-25 15:32 ` [PATCH 08/13] libmultipath: Add sysfs helpers John Garry
` (5 subsequent siblings)
12 siblings, 1 reply; 46+ messages in thread
From: John Garry @ 2026-02-25 15:32 UTC (permalink / raw)
To: hch, kbusch, sagi, axboe, martin.petersen, james.bottomley, hare
Cc: jmeneghi, linux-nvme, linux-scsi, michael.christie, snitzer,
bmarzins, dm-devel, linux-block, linux-kernel, John Garry
Add support for delayed removal, same as exists for NVMe.
The purpose of this feature is to keep the multipath disk and cdev present
for intermittent periods of no available path.
Helpers mpath_delayed_removal_secs_show() and
mpath_delayed_removal_secs_store() may be used in the driver sysfs code.
The driver is responsible for supplying the removal work callback for
the delayed work.
Signed-off-by: John Garry <john.g.garry@oracle.com>
---
include/linux/multipath.h | 17 +++++++++
lib/multipath.c | 79 ++++++++++++++++++++++++++++++++++++++-
2 files changed, 95 insertions(+), 1 deletion(-)
diff --git a/include/linux/multipath.h b/include/linux/multipath.h
index 0dcfdd205237c..f7998de261899 100644
--- a/include/linux/multipath.h
+++ b/include/linux/multipath.h
@@ -66,6 +66,7 @@ struct mpath_head_template {
};
#define MPATH_HEAD_DISK_LIVE 0
+#define MPATH_HEAD_QUEUE_IF_NO_PATH 1
struct mpath_head {
struct srcu_struct srcu;
@@ -81,6 +82,10 @@ struct mpath_head {
struct cdev cdev;
struct device cdev_device;
+ struct delayed_work remove_work;
+ unsigned int delayed_removal_secs;
+ struct module *drv_module;
+
unsigned long flags;
struct mpath_device __rcu *current_path[MAX_NUMNODES];
const struct mpath_head_template *mpdt;
@@ -132,6 +137,7 @@ void mpath_put_head(struct mpath_head *mpath_head);
void mpath_requeue_work(struct work_struct *work);
struct mpath_head *mpath_alloc_head(void);
void mpath_put_disk(struct mpath_disk *mpath_disk);
+bool mpath_can_remove_head(struct mpath_head *mpath_head);
void mpath_remove_disk(struct mpath_disk *mpath_disk);
void mpath_unregister_disk(struct mpath_disk *mpath_disk);
struct mpath_disk *mpath_alloc_head_disk(struct queue_limits *lim,
@@ -139,6 +145,10 @@ struct mpath_disk *mpath_alloc_head_disk(struct queue_limits *lim,
void mpath_device_set_live(struct mpath_disk *mpath_disk,
struct mpath_device *mpath_device);
void mpath_unregister_disk(struct mpath_disk *mpath_disk);
+ssize_t mpath_delayed_removal_secs_show(struct mpath_head *mpath_head,
+ char *buf);
+ssize_t mpath_delayed_removal_secs_store(struct mpath_head *mpath_head,
+ const char *buf, size_t count);
static inline bool is_mpath_head(struct gendisk *disk)
{
@@ -150,4 +160,11 @@ static inline bool mpath_qd_iopolicy(struct mpath_iopolicy *mpath_iopolicy)
return mpath_read_iopolicy(mpath_iopolicy) == MPATH_IOPOLICY_QD;
}
+static inline bool mpath_head_queue_if_no_path(struct mpath_head *mpath_head)
+{
+ if (test_bit(MPATH_HEAD_QUEUE_IF_NO_PATH, &mpath_head->flags))
+ return true;
+ return false;
+}
+
#endif // _LIBMULTIPATH_H
diff --git a/lib/multipath.c b/lib/multipath.c
index ce12d42918fdd..1ce57b9b14d2e 100644
--- a/lib/multipath.c
+++ b/lib/multipath.c
@@ -52,6 +52,7 @@ void mpath_add_device(struct mpath_head *mpath_head,
mutex_lock(&mpath_head->lock);
list_add_tail_rcu(&mpath_device->siblings, &mpath_head->dev_list);
mutex_unlock(&mpath_head->lock);
+ cancel_delayed_work(&mpath_head->remove_work);
}
EXPORT_SYMBOL_GPL(mpath_add_device);
@@ -356,7 +357,17 @@ static bool mpath_available_path(struct mpath_head *mpath_head)
return true;
}
- return false;
+ /*
+ * If "mpahead->delayed_removal_secs" is configured (i.e., non-zero), do
+ * not immediately fail I/O. Instead, requeue the I/O for the configured
+ * duration, anticipating that if there's a transient link failure then
+ * it may recover within this time window. This parameter is exported to
+ * userspace via sysfs, and its default value is zero. It is internally
+ * mapped to MPATH_HEAD_QUEUE_IF_NO_PATH. When delayed_removal_secs is
+ * non-zero, this flag is set to true. When zero, the flag is cleared.
+ */
+ return mpath_head_queue_if_no_path(mpath_head);
+
}
static void mpath_bdev_submit_bio(struct bio *bio)
@@ -614,6 +625,29 @@ static void mpath_head_del_cdev(struct mpath_head *mpath_head)
mpath_head->mpdt->del_cdev(mpath_head);
}
+bool mpath_can_remove_head(struct mpath_head *mpath_head)
+{
+ bool remove = false;
+
+ mutex_lock(&mpath_head->lock);
+ /*
+ * Ensure that no one could remove this module while the head
+ * remove work is pending.
+ */
+ if (mpath_head_queue_if_no_path(mpath_head) &&
+ try_module_get(mpath_head->drv_module)) {
+
+ mod_delayed_work(mpath_wq, &mpath_head->remove_work,
+ mpath_head->delayed_removal_secs * HZ);
+ } else {
+ remove = true;
+ }
+
+ mutex_unlock(&mpath_head->lock);
+ return remove;
+}
+EXPORT_SYMBOL_GPL(mpath_can_remove_head);
+
void mpath_remove_disk(struct mpath_disk *mpath_disk)
{
struct mpath_head *mpath_head = mpath_disk->mpath_head;
@@ -711,6 +745,47 @@ void mpath_device_set_live(struct mpath_disk *mpath_disk,
}
EXPORT_SYMBOL_GPL(mpath_device_set_live);
+ssize_t mpath_delayed_removal_secs_show(struct mpath_head *mpath_head,
+ char *buf)
+{
+ int ret;
+
+ mutex_lock(&mpath_head->lock);
+ ret = sysfs_emit(buf, "%u\n", mpath_head->delayed_removal_secs);
+ mutex_unlock(&mpath_head->lock);
+
+ return ret;
+}
+EXPORT_SYMBOL_GPL(mpath_delayed_removal_secs_show);
+
+ssize_t mpath_delayed_removal_secs_store(struct mpath_head *mpath_head,
+ const char *buf, size_t count)
+{
+ ssize_t ret;
+ int sec;
+
+ ret = kstrtouint(buf, 0, &sec);
+ if (ret < 0)
+ return ret;
+
+ mutex_lock(&mpath_head->lock);
+ mpath_head->delayed_removal_secs = sec;
+ if (sec)
+ set_bit(MPATH_HEAD_QUEUE_IF_NO_PATH, &mpath_head->flags);
+ else
+ clear_bit(MPATH_HEAD_QUEUE_IF_NO_PATH, &mpath_head->flags);
+ mutex_unlock(&mpath_head->lock);
+
+ /*
+ * Ensure that update to MPATH_HEAD_QUEUE_IF_NO_PATH is seen
+ * by its reader.
+ */
+ mpath_synchronize(mpath_head);
+
+ return count;
+}
+EXPORT_SYMBOL_GPL(mpath_delayed_removal_secs_store);
+
void mpath_add_sysfs_link(struct mpath_disk *mpath_disk)
{
struct mpath_head *mpath_head = mpath_disk->mpath_head;
@@ -793,6 +868,8 @@ struct mpath_head *mpath_alloc_head(void)
mutex_init(&mpath_head->lock);
kref_init(&mpath_head->ref);
+ mpath_head->delayed_removal_secs = 0;
+
INIT_WORK(&mpath_head->requeue_work, mpath_requeue_work);
spin_lock_init(&mpath_head->requeue_lock);
bio_list_init(&mpath_head->requeue_list);
--
2.43.5
^ permalink raw reply related [flat|nested] 46+ messages in thread
* [PATCH 08/13] libmultipath: Add sysfs helpers
2026-02-25 15:32 [PATCH 00/13] libmultipath: a generic multipath lib for block drivers John Garry
` (6 preceding siblings ...)
2026-02-25 15:32 ` [PATCH 07/13] libmultipath: Add delayed removal support John Garry
@ 2026-02-25 15:32 ` John Garry
2026-02-27 19:05 ` Benjamin Marzinski
2026-02-25 15:32 ` [PATCH 09/13] libmultipath: Add PR support John Garry
` (4 subsequent siblings)
12 siblings, 1 reply; 46+ messages in thread
From: John Garry @ 2026-02-25 15:32 UTC (permalink / raw)
To: hch, kbusch, sagi, axboe, martin.petersen, james.bottomley, hare
Cc: jmeneghi, linux-nvme, linux-scsi, michael.christie, snitzer,
bmarzins, dm-devel, linux-block, linux-kernel, John Garry
Add helpers for driver sysfs code for the following functionality:
- get/set iopolicy with mpath_iopolicy_store() and mpath_iopolicy_show()
- show device path per NUMA node
- "multipath" attribute group, equivalent to nvme_ns_mpath_attr_group
- device groups attribute array, similar to nvme_ns_attr_groups but not
containing NVMe members.
Note that mpath_iopolicy_store() has a update callback to allow same
functionality as nvme_subsys_iopolicy_update() be run for clearing paths.
Signed-off-by: John Garry <john.g.garry@oracle.com>
---
include/linux/multipath.h | 9 ++++
lib/multipath.c | 110 ++++++++++++++++++++++++++++++++++++++
2 files changed, 119 insertions(+)
diff --git a/include/linux/multipath.h b/include/linux/multipath.h
index f7998de261899..9122560f71778 100644
--- a/include/linux/multipath.h
+++ b/include/linux/multipath.h
@@ -10,6 +10,8 @@
extern const struct file_operations mpath_generic_chr_fops;
extern const struct block_device_operations mpath_ops;
+extern const struct attribute_group mpath_attr_group;
+extern const struct attribute_group *mpath_device_groups[];
enum mpath_iopolicy_e {
MPATH_IOPOLICY_NUMA,
@@ -145,6 +147,13 @@ struct mpath_disk *mpath_alloc_head_disk(struct queue_limits *lim,
void mpath_device_set_live(struct mpath_disk *mpath_disk,
struct mpath_device *mpath_device);
void mpath_unregister_disk(struct mpath_disk *mpath_disk);
+ssize_t mpath_numa_nodes_show(struct mpath_head *mpath_head,
+ struct mpath_device *mpath_device,
+ struct mpath_iopolicy *iopolicy, char *buf);
+ssize_t mpath_iopolicy_show(struct mpath_iopolicy *mpath_iopolicy, char *buf);
+ssize_t mpath_iopolicy_store(struct mpath_iopolicy *mpath_iopolicy,
+ const char *buf, size_t count,
+ void (*update)(void *data), void *);
ssize_t mpath_delayed_removal_secs_show(struct mpath_head *mpath_head,
char *buf);
ssize_t mpath_delayed_removal_secs_store(struct mpath_head *mpath_head,
diff --git a/lib/multipath.c b/lib/multipath.c
index 1ce57b9b14d2e..c05b4d25ca223 100644
--- a/lib/multipath.c
+++ b/lib/multipath.c
@@ -745,6 +745,116 @@ void mpath_device_set_live(struct mpath_disk *mpath_disk,
}
EXPORT_SYMBOL_GPL(mpath_device_set_live);
+static struct attribute dummy_attr = {
+ .name = "dummy",
+};
+
+static struct attribute *mpath_attrs[] = {
+ &dummy_attr,
+ NULL
+};
+
+static bool multipath_sysfs_group_visible(struct kobject *kobj)
+{
+ struct device *dev = container_of(kobj, struct device, kobj);
+ struct gendisk *disk = dev_to_disk(dev);
+
+ return is_mpath_head(disk);
+}
+
+static bool multipath_sysfs_attr_visible(struct kobject *kobj,
+ struct attribute *attr, int n)
+{
+ return false;
+}
+
+DEFINE_SYSFS_GROUP_VISIBLE(multipath_sysfs)
+
+const struct attribute_group mpath_attr_group = {
+ .name = "multipath",
+ .attrs = mpath_attrs,
+ .is_visible = SYSFS_GROUP_VISIBLE(multipath_sysfs),
+};
+EXPORT_SYMBOL_GPL(mpath_attr_group);
+
+const struct attribute_group *mpath_device_groups[] = {
+ &mpath_attr_group,
+ NULL
+};
+EXPORT_SYMBOL_GPL(mpath_device_groups);
+
+ssize_t mpath_iopolicy_show(struct mpath_iopolicy *mpath_iopolicy, char *buf)
+{
+ return sysfs_emit(buf, "%s\n",
+ mpath_iopolicy_names[mpath_read_iopolicy(mpath_iopolicy)]);
+}
+EXPORT_SYMBOL_GPL(mpath_iopolicy_show);
+
+static void mpath_iopolicy_update(struct mpath_iopolicy *mpath_iopolicy,
+ int iopolicy, void (*update)(void *), void *data)
+{
+ int old_iopolicy = READ_ONCE(mpath_iopolicy->iopolicy);
+
+ if (old_iopolicy == iopolicy)
+ return;
+
+ WRITE_ONCE(mpath_iopolicy->iopolicy, iopolicy);
+
+ /*
+ * iopolicy changes clear the mpath by design, which @update
+ * must do.
+ */
+ update(data);
+
+ pr_err("iopolicy changed from %s to %s\n",
+ mpath_iopolicy_names[old_iopolicy],
+ mpath_iopolicy_names[iopolicy]);
+}
+
+ssize_t mpath_iopolicy_store(struct mpath_iopolicy *mpath_iopolicy,
+ const char *buf, size_t count,
+ void (*update)(void *), void *data)
+{
+ int i;
+
+ for (i = 0; i < ARRAY_SIZE(mpath_iopolicy_names); i++) {
+ if (sysfs_streq(buf, mpath_iopolicy_names[i])) {
+ mpath_iopolicy_update(mpath_iopolicy, i, update, data);
+ return count;
+ }
+ }
+
+ return -EINVAL;
+}
+EXPORT_SYMBOL_GPL(mpath_iopolicy_store);
+
+ssize_t mpath_numa_nodes_show(struct mpath_head *mpath_head,
+ struct mpath_device *mpath_device,
+ struct mpath_iopolicy *mpath_iopolicy, char *buf)
+{
+ int node, srcu_idx;
+ nodemask_t numa_nodes;
+ struct mpath_device *current_mpath_dev;
+
+ if (mpath_read_iopolicy(mpath_iopolicy) != MPATH_IOPOLICY_NUMA)
+ return 0;
+
+ nodes_clear(numa_nodes);
+
+ srcu_idx = srcu_read_lock(&mpath_head->srcu);
+ for_each_node(node) {
+ current_mpath_dev =
+ srcu_dereference(mpath_head->current_path[node],
+ &mpath_head->srcu);
+ if (current_mpath_dev == mpath_device)
+ node_set(node, numa_nodes);
+ }
+ srcu_read_unlock(&mpath_head->srcu, srcu_idx);
+
+ return sysfs_emit(buf, "%*pbl\n", nodemask_pr_args(&numa_nodes));
+}
+EXPORT_SYMBOL_GPL(mpath_numa_nodes_show);
+
ssize_t mpath_delayed_removal_secs_show(struct mpath_head *mpath_head,
char *buf)
{
--
2.43.5
^ permalink raw reply related [flat|nested] 46+ messages in thread
* [PATCH 09/13] libmultipath: Add PR support
2026-02-25 15:32 [PATCH 00/13] libmultipath: a generic multipath lib for block drivers John Garry
` (7 preceding siblings ...)
2026-02-25 15:32 ` [PATCH 08/13] libmultipath: Add sysfs helpers John Garry
@ 2026-02-25 15:32 ` John Garry
2026-02-25 15:49 ` Keith Busch
2026-02-25 15:32 ` [PATCH 10/13] libmultipath: Add mpath_bdev_report_zones() John Garry
` (3 subsequent siblings)
12 siblings, 1 reply; 46+ messages in thread
From: John Garry @ 2026-02-25 15:32 UTC (permalink / raw)
To: hch, kbusch, sagi, axboe, martin.petersen, james.bottomley, hare
Cc: jmeneghi, linux-nvme, linux-scsi, michael.christie, snitzer,
bmarzins, dm-devel, linux-block, linux-kernel, John Garry
Add support for persistent reservations.
Effectively all that is done here is that a multipath version of pr_ops is
created which calls into the driver version of the callbacks for the
mpath_device selected.
Structure mpath_pr_ops is introduced, which must be set by the driver for
PR callbacks.
Signed-off-by: John Garry <john.g.garry@oracle.com>
---
include/linux/multipath.h | 18 +++++
lib/multipath.c | 146 ++++++++++++++++++++++++++++++++++++++
2 files changed, 164 insertions(+)
diff --git a/include/linux/multipath.h b/include/linux/multipath.h
index 9122560f71778..454826c385923 100644
--- a/include/linux/multipath.h
+++ b/include/linux/multipath.h
@@ -5,6 +5,7 @@
#include <linux/blkdev.h>
#include <linux/blk-mq.h>
#include <linux/cdev.h>
+#include <linux/pr.h>
#include <linux/srcu.h>
#include <linux/io_uring/cmd.h>
@@ -48,6 +49,22 @@ struct mpath_device {
int numa_node;
};
+struct mpath_pr_ops {
+ int (*pr_register)(struct mpath_device *mpath_device, u64 old_key,
+ u64 new_key, u32 flags);
+ int (*pr_reserve)(struct mpath_device *mpath_device, u64 key,
+ enum pr_type type, u32 flags);
+ int (*pr_release)(struct mpath_device *mpath_device, u64 key,
+ enum pr_type type);
+ int (*pr_preempt)(struct mpath_device *mpath_device, u64 old_key,
+ u64 new_key, enum pr_type type, bool abort);
+ int (*pr_clear)(struct mpath_device *mpath_device, u64 key);
+ int (*pr_read_keys)(struct mpath_device *mpath_device,
+ struct pr_keys *keys_info);
+ int (*pr_read_reservation)(struct mpath_device *mpath_device,
+ struct pr_held_reservation *rsv);
+};
+
struct mpath_head_template {
bool (*available_path)(struct mpath_device *, bool *);
int (*add_cdev)(struct mpath_head *);
@@ -64,6 +81,7 @@ struct mpath_head_template {
unsigned int poll_flags);
enum mpath_iopolicy_e (*get_iopolicy)(struct mpath_head *);
struct bio *(*clone_bio)(struct bio *);
+ const struct mpath_pr_ops *pr_ops;
const struct attribute_group **device_groups;
};
diff --git a/lib/multipath.c b/lib/multipath.c
index c05b4d25ca223..8ee2d12600035 100644
--- a/lib/multipath.c
+++ b/lib/multipath.c
@@ -472,11 +472,157 @@ static void mpath_bdev_release(struct gendisk *disk)
mpath_put_disk(mpath_disk);
}
+static int mpath_pr_register(struct block_device *bdev, u64 old_key,
+ u64 new_key, unsigned int flags)
+{
+ struct mpath_disk *mpath_disk = dev_get_drvdata(&bdev->bd_device);
+ struct mpath_head *mpath_head = mpath_disk->mpath_head;
+ struct mpath_device *mpath_device;
+ int srcu_idx, ret = -EWOULDBLOCK;
+
+ srcu_idx = srcu_read_lock(&mpath_head->srcu);
+ mpath_device = mpath_find_path(mpath_head);
+ if (mpath_device)
+ ret = mpath_head->mpdt->pr_ops->pr_register(mpath_device,
+ old_key, new_key, flags);
+ srcu_read_unlock(&mpath_head->srcu, srcu_idx);
+
+ return ret;
+}
+
+static int mpath_pr_reserve(struct block_device *bdev, u64 key,
+ enum pr_type type, unsigned flags)
+{
+ struct mpath_disk *mpath_disk = dev_get_drvdata(&bdev->bd_device);
+ struct mpath_head *mpath_head = mpath_disk->mpath_head;
+ struct mpath_device *mpath_device;
+ int srcu_idx, ret = -EWOULDBLOCK;
+
+ srcu_idx = srcu_read_lock(&mpath_head->srcu);
+ mpath_device = mpath_find_path(mpath_head);
+
+ if (mpath_device)
+ ret = mpath_head->mpdt->pr_ops->pr_reserve(mpath_device, key,
+ type, flags);
+
+ srcu_read_unlock(&mpath_head->srcu, srcu_idx);
+
+ return ret;
+}
+
+static int mpath_pr_release(struct block_device *bdev, u64 key, enum pr_type type)
+{
+ struct mpath_disk *mpath_disk = dev_get_drvdata(&bdev->bd_device);
+ struct mpath_head *mpath_head = mpath_disk->mpath_head;
+ struct mpath_device *mpath_device;
+ int srcu_idx, ret = -EWOULDBLOCK;
+
+ srcu_idx = srcu_read_lock(&mpath_head->srcu);
+ mpath_device = mpath_find_path(mpath_head);
+
+ if (mpath_device)
+ ret = mpath_head->mpdt->pr_ops->pr_release(mpath_device, key,
+ type);
+
+ srcu_read_unlock(&mpath_head->srcu, srcu_idx);
+
+ return ret;
+}
+
+static int mpath_pr_preempt(struct block_device *bdev, u64 old, u64 new,
+ enum pr_type type, bool abort)
+{
+ struct mpath_disk *mpath_disk = dev_get_drvdata(&bdev->bd_device);
+ struct mpath_head *mpath_head = mpath_disk->mpath_head;
+ struct mpath_device *mpath_device;
+ int srcu_idx, ret = -EWOULDBLOCK;
+
+ srcu_idx = srcu_read_lock(&mpath_head->srcu);
+ mpath_device = mpath_find_path(mpath_head);
+
+ if (mpath_device)
+ ret = mpath_head->mpdt->pr_ops->pr_preempt(mpath_device, old,
+ new, type, abort);
+
+ srcu_read_unlock(&mpath_head->srcu, srcu_idx);
+
+ return ret;
+}
+
+static int mpath_pr_clear(struct block_device *bdev, u64 key)
+{
+ struct mpath_disk *mpath_disk = dev_get_drvdata(&bdev->bd_device);
+ struct mpath_head *mpath_head = mpath_disk->mpath_head;
+ struct mpath_device *mpath_device;
+ int srcu_idx, ret = -EWOULDBLOCK;
+
+ srcu_idx = srcu_read_lock(&mpath_head->srcu);
+ mpath_device = mpath_find_path(mpath_head);
+
+ if (mpath_device)
+ ret = mpath_head->mpdt->pr_ops->pr_clear(mpath_device, key);
+
+ srcu_read_unlock(&mpath_head->srcu, srcu_idx);
+
+ return ret;
+}
+
+static int mpath_pr_read_keys(struct block_device *bdev,
+ struct pr_keys *keys_info)
+{
+ struct mpath_disk *mpath_disk = dev_get_drvdata(&bdev->bd_device);
+ struct mpath_head *mpath_head = mpath_disk->mpath_head;
+ struct mpath_device *mpath_device;
+ int srcu_idx, ret = -EWOULDBLOCK;
+
+ srcu_idx = srcu_read_lock(&mpath_head->srcu);
+ mpath_device = mpath_find_path(mpath_head);
+
+ if (mpath_device)
+ ret = mpath_head->mpdt->pr_ops->pr_read_keys(mpath_device,
+ keys_info);
+
+ srcu_read_unlock(&mpath_head->srcu, srcu_idx);
+
+ return ret;
+}
+
+static int mpath_pr_read_reservation(struct block_device *bdev,
+ struct pr_held_reservation *resv)
+{
+ struct mpath_disk *mpath_disk = dev_get_drvdata(&bdev->bd_device);
+ struct mpath_head *mpath_head = mpath_disk->mpath_head;
+ struct mpath_device *mpath_device;
+ int srcu_idx, ret = -EWOULDBLOCK;
+
+ srcu_idx = srcu_read_lock(&mpath_head->srcu);
+ mpath_device = mpath_find_path(mpath_head);
+
+ if (mpath_device)
+ ret = mpath_head->mpdt->pr_ops->pr_read_reservation(
+ mpath_device, resv);
+
+ srcu_read_unlock(&mpath_head->srcu, srcu_idx);
+
+ return ret;
+}
+
+static const struct pr_ops mpath_pr_ops = {
+ .pr_register = mpath_pr_register,
+ .pr_reserve = mpath_pr_reserve,
+ .pr_release = mpath_pr_release,
+ .pr_preempt = mpath_pr_preempt,
+ .pr_clear = mpath_pr_clear,
+ .pr_read_keys = mpath_pr_read_keys,
+ .pr_read_reservation = mpath_pr_read_reservation,
+};
+
const struct block_device_operations mpath_ops = {
.owner = THIS_MODULE,
.open = mpath_bdev_open,
.release = mpath_bdev_release,
.submit_bio = mpath_bdev_submit_bio,
+ .pr_ops = &mpath_pr_ops,
};
EXPORT_SYMBOL_GPL(mpath_ops);
--
2.43.5
^ permalink raw reply related [flat|nested] 46+ messages in thread
* [PATCH 10/13] libmultipath: Add mpath_bdev_report_zones()
2026-02-25 15:32 [PATCH 00/13] libmultipath: a generic multipath lib for block drivers John Garry
` (8 preceding siblings ...)
2026-02-25 15:32 ` [PATCH 09/13] libmultipath: Add PR support John Garry
@ 2026-02-25 15:32 ` John Garry
2026-02-25 15:32 ` [PATCH 11/13] libmultipath: Add support for block device IOCTL John Garry
` (2 subsequent siblings)
12 siblings, 0 replies; 46+ messages in thread
From: John Garry @ 2026-02-25 15:32 UTC (permalink / raw)
To: hch, kbusch, sagi, axboe, martin.petersen, james.bottomley, hare
Cc: jmeneghi, linux-nvme, linux-scsi, michael.christie, snitzer,
bmarzins, dm-devel, linux-block, linux-kernel, John Garry
Add a multipath handler for block_device_operations.report_zones
Signed-off-by: John Garry <john.g.garry@oracle.com>
---
include/linux/multipath.h | 2 ++
lib/multipath.c | 25 +++++++++++++++++++++++++
2 files changed, 27 insertions(+)
diff --git a/include/linux/multipath.h b/include/linux/multipath.h
index 454826c385923..3846ea8cfd319 100644
--- a/include/linux/multipath.h
+++ b/include/linux/multipath.h
@@ -74,6 +74,8 @@ struct mpath_head_template {
enum mpath_access_state (*get_access_state)(struct mpath_device *);
int (*cdev_ioctl)(struct mpath_head *, struct mpath_device *,
blk_mode_t mode, unsigned int cmd, unsigned long arg, int srcu_idx);
+ int (*report_zones)(struct mpath_device *, sector_t sector,
+ unsigned int nr_zones, struct blk_report_zones_args *args);
int (*chr_uring_cmd)(struct mpath_device *, struct io_uring_cmd *ioucmd,
unsigned int issue_flags);
int (*chr_uring_cmd_iopoll)(struct io_uring_cmd *ioucmd,
diff --git a/lib/multipath.c b/lib/multipath.c
index 8ee2d12600035..4c57feefff480 100644
--- a/lib/multipath.c
+++ b/lib/multipath.c
@@ -40,6 +40,30 @@ int mpath_get_iopolicy(char *buf, int iopolicy)
}
EXPORT_SYMBOL_GPL(mpath_get_iopolicy);
+#ifdef CONFIG_BLK_DEV_ZONED
+static int mpath_bdev_report_zones(struct gendisk *disk, sector_t sector,
+ unsigned int nr_zones, struct blk_report_zones_args *args)
+{
+ struct mpath_disk *mpath_disk = mpath_gendisk_to_disk(disk);
+ struct mpath_head *mpath_head = mpath_disk->mpath_head;
+ struct mpath_device *mpath_device;
+ int srcu_idx, ret = -EWOULDBLOCK;
+
+ if (!mpath_head->mpdt->report_zones)
+ return -EOPNOTSUPP;
+
+ srcu_idx = srcu_read_lock(&mpath_head->srcu);
+ mpath_device = mpath_find_path(mpath_head);
+ if (mpath_device)
+ ret = mpath_head->mpdt->report_zones(mpath_device, sector,
+ nr_zones, args);
+ srcu_read_unlock(&mpath_head->srcu, srcu_idx);
+ return ret;
+}
+#else
+#define mpath_bdev_report_zones NULL
+#endif /* CONFIG_BLK_DEV_ZONED */
+
void mpath_synchronize(struct mpath_head *mpath_head)
{
synchronize_srcu(&mpath_head->srcu);
@@ -622,6 +646,7 @@ const struct block_device_operations mpath_ops = {
.open = mpath_bdev_open,
.release = mpath_bdev_release,
.submit_bio = mpath_bdev_submit_bio,
+ .report_zones = mpath_bdev_report_zones,
.pr_ops = &mpath_pr_ops,
};
EXPORT_SYMBOL_GPL(mpath_ops);
--
2.43.5
^ permalink raw reply related [flat|nested] 46+ messages in thread
* [PATCH 11/13] libmultipath: Add support for block device IOCTL
2026-02-25 15:32 [PATCH 00/13] libmultipath: a generic multipath lib for block drivers John Garry
` (9 preceding siblings ...)
2026-02-25 15:32 ` [PATCH 10/13] libmultipath: Add mpath_bdev_report_zones() John Garry
@ 2026-02-25 15:32 ` John Garry
2026-02-27 19:52 ` Benjamin Marzinski
2026-02-25 15:32 ` [PATCH 12/13] libmultipath: Add mpath_bdev_getgeo() John Garry
2026-02-25 15:32 ` [PATCH 13/13] libmultipath: Add mpath_bdev_get_unique_id() John Garry
12 siblings, 1 reply; 46+ messages in thread
From: John Garry @ 2026-02-25 15:32 UTC (permalink / raw)
To: hch, kbusch, sagi, axboe, martin.petersen, james.bottomley, hare
Cc: jmeneghi, linux-nvme, linux-scsi, michael.christie, snitzer,
bmarzins, dm-devel, linux-block, linux-kernel, John Garry
Add mpath_bdev_ioctl() as a multipath block device IOCTL handler. This
handler calls into driver mpath_head_template.ioctl handler.
It is expected that the .ioctl handler will unlock the SRCU read lock,
as this is what NVMe requires - see nvme_ns_head_ctrl_ioctl(). As such,
export a handler to unlock, mpath_head_read_unlock().
The .compat_ioctl handler is given the standard handler.
Signed-off-by: John Garry <john.g.garry@oracle.com>
---
include/linux/multipath.h | 4 ++++
lib/multipath.c | 42 +++++++++++++++++++++++++++++++++++++++
2 files changed, 46 insertions(+)
diff --git a/include/linux/multipath.h b/include/linux/multipath.h
index 3846ea8cfd319..40dda6a914c5f 100644
--- a/include/linux/multipath.h
+++ b/include/linux/multipath.h
@@ -72,6 +72,9 @@ struct mpath_head_template {
bool (*is_disabled)(struct mpath_device *);
bool (*is_optimized)(struct mpath_device *);
enum mpath_access_state (*get_access_state)(struct mpath_device *);
+ int (*bdev_ioctl)(struct block_device *bdev, struct mpath_device *,
+ blk_mode_t mode, unsigned int cmd, unsigned long arg,
+ int srcu_idx);
int (*cdev_ioctl)(struct mpath_head *, struct mpath_device *,
blk_mode_t mode, unsigned int cmd, unsigned long arg, int srcu_idx);
int (*report_zones)(struct mpath_device *, sector_t sector,
@@ -154,6 +157,7 @@ void mpath_revalidate_paths(struct mpath_disk *mpath_disk,
void mpath_add_sysfs_link(struct mpath_disk *mpath_disk);
void mpath_remove_sysfs_link(struct mpath_disk *mpath_disk,
struct mpath_device *mpath_device);
+void mpath_head_read_unlock(struct mpath_head *mpath_head, int srcu_idx);
int mpath_get_head(struct mpath_head *mpath_head);
void mpath_put_head(struct mpath_head *mpath_head);
void mpath_requeue_work(struct work_struct *work);
diff --git a/lib/multipath.c b/lib/multipath.c
index 4c57feefff480..537579ad5989e 100644
--- a/lib/multipath.c
+++ b/lib/multipath.c
@@ -496,6 +496,46 @@ static void mpath_bdev_release(struct gendisk *disk)
mpath_put_disk(mpath_disk);
}
+static int mpath_bdev_ioctl(struct block_device *bdev, blk_mode_t mode,
+ unsigned int cmd, unsigned long arg)
+{
+ struct gendisk *disk = bdev->bd_disk;
+ struct mpath_disk *mpath_disk = mpath_gendisk_to_disk(disk);
+ struct mpath_head *mpath_head = mpath_disk->mpath_head;
+ struct mpath_device *mpath_device;
+ int srcu_idx, err;
+
+ srcu_idx = srcu_read_lock(&mpath_head->srcu);
+ mpath_device = mpath_find_path(mpath_head);
+
+ if (!mpath_device) {
+ err = -EWOULDBLOCK;
+ goto out_unlock;
+ }
+
+ if (bdev_is_partition(bdev) && !capable(CAP_SYS_RAWIO)) {
+ err = -ENOIOCTLCMD;
+ goto out_unlock;
+ }
+
+ /* ->ioctl must always unlock */
+ err = mpath_head->mpdt->bdev_ioctl(bdev, mpath_device, mode, cmd,
+ arg, srcu_idx);
+ lockdep_assert_not_held(&mpath_head->srcu);
+ return err;
+
+out_unlock:
+ srcu_read_unlock(&mpath_head->srcu, srcu_idx);
+ return err;
+}
+
+void mpath_head_read_unlock(struct mpath_head *mpath_head, int srcu_idx)
+__releases(&mpath_head->srcu)
+{
+ srcu_read_unlock(&mpath_head->srcu, srcu_idx);
+}
+EXPORT_SYMBOL_GPL(mpath_head_read_unlock);
+
static int mpath_pr_register(struct block_device *bdev, u64 old_key,
u64 new_key, unsigned int flags)
{
@@ -646,6 +686,8 @@ const struct block_device_operations mpath_ops = {
.open = mpath_bdev_open,
.release = mpath_bdev_release,
.submit_bio = mpath_bdev_submit_bio,
+ .ioctl = mpath_bdev_ioctl,
+ .compat_ioctl = blkdev_compat_ptr_ioctl,
.report_zones = mpath_bdev_report_zones,
.pr_ops = &mpath_pr_ops,
};
--
2.43.5
^ permalink raw reply related [flat|nested] 46+ messages in thread
* [PATCH 12/13] libmultipath: Add mpath_bdev_getgeo()
2026-02-25 15:32 [PATCH 00/13] libmultipath: a generic multipath lib for block drivers John Garry
` (10 preceding siblings ...)
2026-02-25 15:32 ` [PATCH 11/13] libmultipath: Add support for block device IOCTL John Garry
@ 2026-02-25 15:32 ` John Garry
2026-02-25 15:32 ` [PATCH 13/13] libmultipath: Add mpath_bdev_get_unique_id() John Garry
12 siblings, 0 replies; 46+ messages in thread
From: John Garry @ 2026-02-25 15:32 UTC (permalink / raw)
To: hch, kbusch, sagi, axboe, martin.petersen, james.bottomley, hare
Cc: jmeneghi, linux-nvme, linux-scsi, michael.christie, snitzer,
bmarzins, dm-devel, linux-block, linux-kernel, John Garry
Add mpath_bdev_getgeo() as a multipath block device .getgeo handler.
Here we just redirect into the selected mpath_device disk fops->getgeo
handler.
Signed-off-by: John Garry <john.g.garry@oracle.com>
---
lib/multipath.c | 17 +++++++++++++++++
1 file changed, 17 insertions(+)
diff --git a/lib/multipath.c b/lib/multipath.c
index 537579ad5989e..192ecd886b958 100644
--- a/lib/multipath.c
+++ b/lib/multipath.c
@@ -536,6 +536,22 @@ __releases(&mpath_head->srcu)
}
EXPORT_SYMBOL_GPL(mpath_head_read_unlock);
+static int mpath_bdev_getgeo(struct gendisk *disk, struct hd_geometry *geo)
+{
+ struct mpath_disk *mpath_disk = mpath_gendisk_to_disk(disk);
+ struct mpath_head *mpath_head = mpath_disk->mpath_head;
+ int srcu_idx, ret = -EWOULDBLOCK;
+ struct mpath_device *mpath_device;
+
+ srcu_idx = srcu_read_lock(&mpath_head->srcu);
+ mpath_device = mpath_find_path(mpath_head);
+ if (mpath_device)
+ ret = mpath_device->disk->fops->getgeo(mpath_device->disk, geo);
+ srcu_read_unlock(&mpath_head->srcu, srcu_idx);
+
+ return ret;
+}
+
static int mpath_pr_register(struct block_device *bdev, u64 old_key,
u64 new_key, unsigned int flags)
{
@@ -689,6 +705,7 @@ const struct block_device_operations mpath_ops = {
.ioctl = mpath_bdev_ioctl,
.compat_ioctl = blkdev_compat_ptr_ioctl,
.report_zones = mpath_bdev_report_zones,
+ .getgeo = mpath_bdev_getgeo,
.pr_ops = &mpath_pr_ops,
};
EXPORT_SYMBOL_GPL(mpath_ops);
--
2.43.5
^ permalink raw reply related [flat|nested] 46+ messages in thread
* [PATCH 13/13] libmultipath: Add mpath_bdev_get_unique_id()
2026-02-25 15:32 [PATCH 00/13] libmultipath: a generic multipath lib for block drivers John Garry
` (11 preceding siblings ...)
2026-02-25 15:32 ` [PATCH 12/13] libmultipath: Add mpath_bdev_getgeo() John Garry
@ 2026-02-25 15:32 ` John Garry
12 siblings, 0 replies; 46+ messages in thread
From: John Garry @ 2026-02-25 15:32 UTC (permalink / raw)
To: hch, kbusch, sagi, axboe, martin.petersen, james.bottomley, hare
Cc: jmeneghi, linux-nvme, linux-scsi, michael.christie, snitzer,
bmarzins, dm-devel, linux-block, linux-kernel, John Garry
Add mpath_bdev_get_unique_id() as a multipath block device .get_unique_id
handler.
Signed-off-by: John Garry <john.g.garry@oracle.com>
---
include/linux/multipath.h | 2 ++
lib/multipath.c | 17 +++++++++++++++++
2 files changed, 19 insertions(+)
diff --git a/include/linux/multipath.h b/include/linux/multipath.h
index 40dda6a914c5f..1aa70ae11a195 100644
--- a/include/linux/multipath.h
+++ b/include/linux/multipath.h
@@ -86,6 +86,8 @@ struct mpath_head_template {
unsigned int poll_flags);
enum mpath_iopolicy_e (*get_iopolicy)(struct mpath_head *);
struct bio *(*clone_bio)(struct bio *);
+ int (*get_unique_id)(struct mpath_device *, u8 id[16],
+ enum blk_unique_id type);
const struct mpath_pr_ops *pr_ops;
const struct attribute_group **device_groups;
};
diff --git a/lib/multipath.c b/lib/multipath.c
index 192ecd886b958..bba13b18215ee 100644
--- a/lib/multipath.c
+++ b/lib/multipath.c
@@ -496,6 +496,22 @@ static void mpath_bdev_release(struct gendisk *disk)
mpath_put_disk(mpath_disk);
}
+static int mpath_bdev_get_unique_id(struct gendisk *disk, u8 id[16],
+ enum blk_unique_id type)
+{
+ struct mpath_disk *mpath_disk = mpath_gendisk_to_disk(disk);
+ struct mpath_head *mpath_head = mpath_disk->mpath_head;
+ int srcu_idx, ret = -EWOULDBLOCK;
+ struct mpath_device *mpath_device;
+
+ srcu_idx = srcu_read_lock(&mpath_head->srcu);
+ mpath_device = mpath_find_path(mpath_head);
+ if (mpath_device)
+ ret = mpath_head->mpdt->get_unique_id(mpath_device, id, type);
+ srcu_read_unlock(&mpath_head->srcu, srcu_idx);
+
+ return ret;
+}
static int mpath_bdev_ioctl(struct block_device *bdev, blk_mode_t mode,
unsigned int cmd, unsigned long arg)
{
@@ -704,6 +720,7 @@ const struct block_device_operations mpath_ops = {
.submit_bio = mpath_bdev_submit_bio,
.ioctl = mpath_bdev_ioctl,
.compat_ioctl = blkdev_compat_ptr_ioctl,
+ .get_unique_id = mpath_bdev_get_unique_id,
.report_zones = mpath_bdev_report_zones,
.getgeo = mpath_bdev_getgeo,
.pr_ops = &mpath_pr_ops,
--
2.43.5
^ permalink raw reply related [flat|nested] 46+ messages in thread
* Re: [PATCH 09/13] libmultipath: Add PR support
2026-02-25 15:32 ` [PATCH 09/13] libmultipath: Add PR support John Garry
@ 2026-02-25 15:49 ` Keith Busch
2026-02-25 16:52 ` John Garry
2026-02-27 18:12 ` Benjamin Marzinski
0 siblings, 2 replies; 46+ messages in thread
From: Keith Busch @ 2026-02-25 15:49 UTC (permalink / raw)
To: John Garry
Cc: hch, sagi, axboe, martin.petersen, james.bottomley, hare,
jmeneghi, linux-nvme, linux-scsi, michael.christie, snitzer,
bmarzins, dm-devel, linux-block, linux-kernel
On Wed, Feb 25, 2026 at 03:32:21PM +0000, John Garry wrote:
> +static int mpath_pr_register(struct block_device *bdev, u64 old_key,
> + u64 new_key, unsigned int flags)
> +{
> + struct mpath_disk *mpath_disk = dev_get_drvdata(&bdev->bd_device);
> + struct mpath_head *mpath_head = mpath_disk->mpath_head;
> + struct mpath_device *mpath_device;
> + int srcu_idx, ret = -EWOULDBLOCK;
> +
> + srcu_idx = srcu_read_lock(&mpath_head->srcu);
> + mpath_device = mpath_find_path(mpath_head);
> + if (mpath_device)
> + ret = mpath_head->mpdt->pr_ops->pr_register(mpath_device,
> + old_key, new_key, flags);
> + srcu_read_unlock(&mpath_head->srcu, srcu_idx);
Instead of having the lower layer define new mp template functions, why
not use the existing pr_ops from mpath_device->disk->fops->pr_ops?
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH 09/13] libmultipath: Add PR support
2026-02-25 15:49 ` Keith Busch
@ 2026-02-25 16:52 ` John Garry
2026-02-27 18:12 ` Benjamin Marzinski
1 sibling, 0 replies; 46+ messages in thread
From: John Garry @ 2026-02-25 16:52 UTC (permalink / raw)
To: Keith Busch
Cc: hch, sagi, axboe, martin.petersen, james.bottomley, hare,
jmeneghi, linux-nvme, linux-scsi, michael.christie, snitzer,
bmarzins, dm-devel, linux-block, linux-kernel
On 25/02/2026 15:49, Keith Busch wrote:
> On Wed, Feb 25, 2026 at 03:32:21PM +0000, John Garry wrote:
>> +static int mpath_pr_register(struct block_device *bdev, u64 old_key,
>> + u64 new_key, unsigned int flags)
>> +{
>> + struct mpath_disk *mpath_disk = dev_get_drvdata(&bdev->bd_device);
>> + struct mpath_head *mpath_head = mpath_disk->mpath_head;
>> + struct mpath_device *mpath_device;
>> + int srcu_idx, ret = -EWOULDBLOCK;
>> +
>> + srcu_idx = srcu_read_lock(&mpath_head->srcu);
>> + mpath_device = mpath_find_path(mpath_head);
>> + if (mpath_device)
>> + ret = mpath_head->mpdt->pr_ops->pr_register(mpath_device,
>> + old_key, new_key, flags);
>> + srcu_read_unlock(&mpath_head->srcu, srcu_idx);
> Instead of having the lower layer define new mp template functions, why
> not use the existing pr_ops from mpath_device->disk->fops->pr_ops?
Yeah, that should be possible and I did use disk->fops elsewhere. We
would just need to find the per-path bdev. I just wasn't sure if that
was a preferred style.
cheers
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH 02/13] libmultipath: Add basic gendisk support
2026-02-25 15:32 ` [PATCH 02/13] libmultipath: Add basic gendisk support John Garry
@ 2026-02-26 2:16 ` Benjamin Marzinski
2026-02-26 9:04 ` John Garry
2026-03-02 12:31 ` Nilay Shroff
2026-03-03 12:13 ` Markus Elfring
2 siblings, 1 reply; 46+ messages in thread
From: Benjamin Marzinski @ 2026-02-26 2:16 UTC (permalink / raw)
To: John Garry
Cc: hch, kbusch, sagi, axboe, martin.petersen, james.bottomley, hare,
jmeneghi, linux-nvme, linux-scsi, michael.christie, snitzer,
dm-devel, linux-block, linux-kernel
On Wed, Feb 25, 2026 at 03:32:14PM +0000, John Garry wrote:
> Add support to allocate and free a multipath gendisk.
>
> NVMe has almost like-for-like equivalents here:
> - mpath_alloc_head_disk() -> nvme_mpath_alloc_disk()
> - multipath_partition_scan_work() -> nvme_partition_scan_work()
> - mpath_remove_disk() -> nvme_remove_head()
> - mpath_device_set_live() -> nvme_mpath_set_live()
>
> struct mpath_head_template is introduced as a method for drivers to
> provide custom multipath functionality.
>
> Signed-off-by: John Garry <john.g.garry@oracle.com>
> ---
> +
> +void mpath_device_set_live(struct mpath_disk *mpath_disk,
> + struct mpath_device *mpath_device)
> +{
> + struct mpath_head *mpath_head = mpath_disk->mpath_head;
You're dereferencing mpath_disk here, before the check if it's NULL.
-Ben
> + int ret;
> +
> + if (!mpath_disk)
> + return;
> +
> + if (!test_and_set_bit(MPATH_HEAD_DISK_LIVE, &mpath_head->flags)) {
> + dev_set_drvdata(disk_to_dev(mpath_disk->disk), mpath_disk);
> + ret = device_add_disk(mpath_disk->parent, mpath_disk->disk,
> + mpath_head->mpdt->device_groups);
> + if (ret) {
> + clear_bit(MPATH_HEAD_DISK_LIVE, &mpath_head->flags);
> + return;
> + }
> + queue_work(mpath_wq, &mpath_disk->partition_scan_work);
> + }
> +}
> +EXPORT_SYMBOL_GPL(mpath_device_set_live);
> +
> struct mpath_head *mpath_alloc_head(void)
> {
> struct mpath_head *mpath_head;
> --
> 2.43.5
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH 03/13] libmultipath: Add path selection support
2026-02-25 15:32 ` [PATCH 03/13] libmultipath: Add path selection support John Garry
@ 2026-02-26 3:37 ` Benjamin Marzinski
2026-02-26 9:26 ` John Garry
2026-03-02 12:36 ` Nilay Shroff
2026-03-04 13:10 ` Nilay Shroff
2 siblings, 1 reply; 46+ messages in thread
From: Benjamin Marzinski @ 2026-02-26 3:37 UTC (permalink / raw)
To: John Garry
Cc: hch, kbusch, sagi, axboe, martin.petersen, james.bottomley, hare,
jmeneghi, linux-nvme, linux-scsi, michael.christie, snitzer,
dm-devel, linux-block, linux-kernel
On Wed, Feb 25, 2026 at 03:32:15PM +0000, John Garry wrote:
> +__maybe_unused
> +static struct mpath_device *mpath_find_path(struct mpath_head *mpath_head)
> +{
> + enum mpath_iopolicy_e iopolicy =
> + mpath_head->mpdt->get_iopolicy(mpath_head);
> +
> + switch (iopolicy) {
> + case MPATH_IOPOLICY_QD:
> + return mpath_queue_depth_path(mpath_head);
> + case MPATH_IOPOLICY_RR:
> + return mpath_round_robin_path(mpath_head, iopolicy);
> + default:
> + return mpath_numa_path(mpath_head, iopolicy);
When we're in mpath_round_robin_path() and mpath_numa_path(), we know the
iopolicy, so we don't really need to pass it in.
-Ben
> + }
> +}
> +
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH 02/13] libmultipath: Add basic gendisk support
2026-02-26 2:16 ` Benjamin Marzinski
@ 2026-02-26 9:04 ` John Garry
0 siblings, 0 replies; 46+ messages in thread
From: John Garry @ 2026-02-26 9:04 UTC (permalink / raw)
To: Benjamin Marzinski
Cc: hch, kbusch, sagi, axboe, martin.petersen, james.bottomley, hare,
jmeneghi, linux-nvme, linux-scsi, michael.christie, snitzer,
dm-devel, linux-block, linux-kernel
On 26/02/2026 02:16, Benjamin Marzinski wrote:
>> struct mpath_head_template is introduced as a method for drivers to
>> provide custom multipath functionality.
>>
>> Signed-off-by: John Garry<john.g.garry@oracle.com>
>> ---
>> +
>> +void mpath_device_set_live(struct mpath_disk *mpath_disk,
>> + struct mpath_device *mpath_device)
>> +{
>> + struct mpath_head *mpath_head = mpath_disk->mpath_head;
> You're dereferencing mpath_disk here, before the check if it's NULL.
>
> -Ben
>
>> + int ret;
>> +
>> + if (!mpath_disk)
Yeah, this NULL check is not needed.
Thanks!
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH 03/13] libmultipath: Add path selection support
2026-02-26 3:37 ` Benjamin Marzinski
@ 2026-02-26 9:26 ` John Garry
0 siblings, 0 replies; 46+ messages in thread
From: John Garry @ 2026-02-26 9:26 UTC (permalink / raw)
To: Benjamin Marzinski
Cc: hch, kbusch, sagi, axboe, martin.petersen, james.bottomley, hare,
jmeneghi, linux-nvme, linux-scsi, michael.christie, snitzer,
dm-devel, linux-block, linux-kernel
On 26/02/2026 03:37, Benjamin Marzinski wrote:
> On Wed, Feb 25, 2026 at 03:32:15PM +0000, John Garry wrote:
>> +__maybe_unused
>> +static struct mpath_device *mpath_find_path(struct mpath_head *mpath_head)
>> +{
>> + enum mpath_iopolicy_e iopolicy =
>> + mpath_head->mpdt->get_iopolicy(mpath_head);
>> +
>> + switch (iopolicy) {
>> + case MPATH_IOPOLICY_QD:
>> + return mpath_queue_depth_path(mpath_head);
>> + case MPATH_IOPOLICY_RR:
>> + return mpath_round_robin_path(mpath_head, iopolicy);
>> + default:
>> + return mpath_numa_path(mpath_head, iopolicy);
> When we're in mpath_round_robin_path() and mpath_numa_path(), we know the
> iopolicy, so we don't really need to pass it in.
Sure, I don't need to pass that around.
Thanks!
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH 09/13] libmultipath: Add PR support
2026-02-25 15:49 ` Keith Busch
2026-02-25 16:52 ` John Garry
@ 2026-02-27 18:12 ` Benjamin Marzinski
2026-03-02 10:45 ` John Garry
1 sibling, 1 reply; 46+ messages in thread
From: Benjamin Marzinski @ 2026-02-27 18:12 UTC (permalink / raw)
To: Keith Busch
Cc: John Garry, hch, sagi, axboe, martin.petersen, james.bottomley,
hare, jmeneghi, linux-nvme, linux-scsi, michael.christie, snitzer,
dm-devel, linux-block, linux-kernel
On Wed, Feb 25, 2026 at 08:49:00AM -0700, Keith Busch wrote:
> On Wed, Feb 25, 2026 at 03:32:21PM +0000, John Garry wrote:
> > +static int mpath_pr_register(struct block_device *bdev, u64 old_key,
> > + u64 new_key, unsigned int flags)
> > +{
> > + struct mpath_disk *mpath_disk = dev_get_drvdata(&bdev->bd_device);
> > + struct mpath_head *mpath_head = mpath_disk->mpath_head;
> > + struct mpath_device *mpath_device;
> > + int srcu_idx, ret = -EWOULDBLOCK;
> > +
> > + srcu_idx = srcu_read_lock(&mpath_head->srcu);
> > + mpath_device = mpath_find_path(mpath_head);
> > + if (mpath_device)
> > + ret = mpath_head->mpdt->pr_ops->pr_register(mpath_device,
> > + old_key, new_key, flags);
> > + srcu_read_unlock(&mpath_head->srcu, srcu_idx);
>
> Instead of having the lower layer define new mp template functions, why
> not use the existing pr_ops from mpath_device->disk->fops->pr_ops?
I don't think that's the right answer. The regular scsi persistent
reservation functions simply won't work on a multipath device. Even just
a simple reservation fails.
For example (with /dev/sda being multipath device 0):
# echo round-robin > /sys/class/scsi_mpath_device/0/iopolicy
# blkpr -c register -k 0x1 /dev/sda
# blkpr -c reserve -k 0x1 -t exclusive-access-reg-only /dev/sda
# dd if=/dev/sda of=/dev/null iflag=direct count=100
dd: error reading '/dev/sda': Invalid exchange
1+0 records in
1+0 records out
512 bytes copied, 0.00871312 s, 58.8 kB/s
Here are the kernel messages:
[ 3494.660401] sd 7:0:1:0: reservation conflict
[ 3494.661802] sd 7:0:1:0: [sda:1] tag#768 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=0s
[ 3494.664848] sd 7:0:1:0: [sda:1] tag#768 CDB: Read(10) 28 00 00 00 00 01 00 00 01 00
[ 3494.667092] reservation conflict error, dev sda:1, sector 1 op 0x0:(READ) flags 0x2800800 phys_seg 1 prio class 2
If you don't have a multipathed scsi device to try this on, you can run:
targetcli <<EOF
/backstores/ramdisk create mptest 1G
/loopback create naa.5001401111111111
/loopback create naa.5001402222222222
/loopback create naa.5001403333333333
/loopback create naa.5001404444444444
/loopback/naa.5001401111111111/luns create /backstores/ramdisk/mptest
/loopback/naa.5001402222222222/luns create /backstores/ramdisk/mptest
/loopback/naa.5001403333333333/luns create /backstores/ramdisk/mptest
/loopback/naa.5001404444444444/luns create /backstores/ramdisk/mptest
EOF
to create one.
Handling scsi Persistent Reservations on a multipath device is painful.
Here is a non-exhaustive list of the problems with trying to make a
multipath device act like a single scsi device for persistent
reservation purposes:
You need to register the key on all the I_T Nexuses. You can't just pick
a single path. Otherwise, when you set up the reservation, you will only
be able to do IO on one of the paths. That's what happened above.
If an path is down when you do the resevation, you might not be able to
register the key on that path. You certainly can't do it directly.
Using the All Target Ports bit (assuming the device supports it) could
let you extend a reservation from one target port to others, assuming
your path isn't down because of connection issue on the host side. But
in general, you have to be able to handle the case where you can't
register (or unregister) a key on your failed paths. If you don't do
that (un)registration when the path comes up, before it can get seleted
for handling IO, you will fail when accessing a path you should be
allowed allowed to access, or succeed in accessing a path that you are
should not be allowed to access.
The same is true when new paths are discovered. You need to register
them.
Except that a preempt can come and remove your registration at any time.
You can't register the new (or newly active) path if the key has been
preempted, and this preemption can happen at any moment, even after you
check if the other paths are still registered. If this isn't handled
correctly, paths can access storage that they should not be allowed to
access.
Changing the reservation type (for instance from
exclusive-access-reg-only to write-exclusive-reg-only) in scsi devices
is done by preempting the existing reservation. This will remove the
registered keys from every path except the one issuing the command. The
key needs to be reregistered on all the other paths again. If any IO
goes to these paths before they are reregistered, it will fail with a
reservation conflict, so IO needs to be suspended during this time.
The path that is holding the reservation might be down. In this case,
you aren't able to release the reservation from that path. The only way
I figured out to handle this in dm-mpath was for the device to preempt
it's own key, to move the reservation to a working path. This causes the
same issues as preempting key to change the reservation type, where you
need to reregister all the paths with IO suspended.
An actual preemption can come in from another machine while you are
doing this. In that case, you must not reregister the paths, and if you
already started, you must unregister them.
I can probably come up with more issues.
I think the best course of action for now is to just fail persistent
reservations as non-supported for scsi devices. IMHO Making them work
correctly (where mulitpath device IO won't fail when it should succeed,
and succeed when it should fail with a reservation conflict) dwarfs the
amount of work necessary to support ALUA.
dm-mpath previously did a pretty good job handling Persistent
Reservations. But recently it became much better, because it become very
clear that pretty good is not good enough for what people what to do
with Persistent Reservations and multipath devices.
-Ben
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH 08/13] libmultipath: Add sysfs helpers
2026-02-25 15:32 ` [PATCH 08/13] libmultipath: Add sysfs helpers John Garry
@ 2026-02-27 19:05 ` Benjamin Marzinski
2026-03-02 11:11 ` John Garry
0 siblings, 1 reply; 46+ messages in thread
From: Benjamin Marzinski @ 2026-02-27 19:05 UTC (permalink / raw)
To: John Garry
Cc: hch, kbusch, sagi, axboe, martin.petersen, james.bottomley, hare,
jmeneghi, linux-nvme, linux-scsi, michael.christie, snitzer,
dm-devel, linux-block, linux-kernel
On Wed, Feb 25, 2026 at 03:32:20PM +0000, John Garry wrote:
> Add helpers for driver sysfs code for the following functionality:
> - get/set iopolicy with mpath_iopolicy_store() and mpath_iopolicy_show()
> - show device path per NUMA node
> - "multipath" attribute group, equivalent to nvme_ns_mpath_attr_group
> - device groups attribute array, similar to nvme_ns_attr_groups but not
> containing NVMe members.
>
> Note that mpath_iopolicy_store() has a update callback to allow same
> functionality as nvme_subsys_iopolicy_update() be run for clearing paths.
>
> Signed-off-by: John Garry <john.g.garry@oracle.com>
>
> diff --git a/lib/multipath.c b/lib/multipath.c
> index 1ce57b9b14d2e..c05b4d25ca223 100644
> --- a/lib/multipath.c
> +++ b/lib/multipath.c
> @@ -745,6 +745,116 @@ void mpath_device_set_live(struct mpath_disk *mpath_disk,
> }
> EXPORT_SYMBOL_GPL(mpath_device_set_live);
>
> +static struct attribute dummy_attr = {
> + .name = "dummy",
> +};
> +
> +static struct attribute *mpath_attrs[] = {
> + &dummy_attr,
> + NULL
> +};
> +
> +static bool multipath_sysfs_group_visible(struct kobject *kobj)
> +{
> + struct device *dev = container_of(kobj, struct device, kobj);
> + struct gendisk *disk = dev_to_disk(dev);
> +
> + return is_mpath_head(disk);
> +}
> +
> +static bool multipath_sysfs_attr_visible(struct kobject *kobj,
> + struct attribute *attr, int n)
> +{
> + return false;
> +}
> +
> +DEFINE_SYSFS_GROUP_VISIBLE(multipath_sysfs)
nitpick: this could use DEFINE_SIMPLE_SYSFS_GROUP_VISIBLE instead.
> +
> +const struct attribute_group mpath_attr_group = {
> + .name = "multipath",
> + .attrs = mpath_attrs,
> + .is_visible = SYSFS_GROUP_VISIBLE(multipath_sysfs),
> +};
> +EXPORT_SYMBOL_GPL(mpath_attr_group);
> +
> +const struct attribute_group *mpath_device_groups[] = {
> + &mpath_attr_group,
> + NULL
> +};
> +EXPORT_SYMBOL_GPL(mpath_device_groups);
> +
> +ssize_t mpath_iopolicy_show(struct mpath_iopolicy *mpath_iopolicy, char *buf)
> +{
> + return sysfs_emit(buf, "%s\n",
> + mpath_iopolicy_names[mpath_read_iopolicy(mpath_iopolicy)]);
> +}
> +EXPORT_SYMBOL_GPL(mpath_iopolicy_show);
> +
> +static void mpath_iopolicy_update(struct mpath_iopolicy *mpath_iopolicy,
> + int iopolicy, void (*update)(void *), void *data)
> +{
> + int old_iopolicy = READ_ONCE(mpath_iopolicy->iopolicy);
> +
> + if (old_iopolicy == iopolicy)
> + return;
> +
> + WRITE_ONCE(mpath_iopolicy->iopolicy, iopolicy);
> +
> + /*
> + * iopolicy changes clear the mpath by design, which @update
> + * must do.
> + */
> + update(data);
> +
> + pr_err("iopolicy changed from %s to %s\n",
> + mpath_iopolicy_names[old_iopolicy],
> + mpath_iopolicy_names[iopolicy]);
I not sure this warrants a pr_err().
-Ben
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH 11/13] libmultipath: Add support for block device IOCTL
2026-02-25 15:32 ` [PATCH 11/13] libmultipath: Add support for block device IOCTL John Garry
@ 2026-02-27 19:52 ` Benjamin Marzinski
2026-03-02 11:19 ` John Garry
0 siblings, 1 reply; 46+ messages in thread
From: Benjamin Marzinski @ 2026-02-27 19:52 UTC (permalink / raw)
To: John Garry
Cc: hch, kbusch, sagi, axboe, martin.petersen, james.bottomley, hare,
jmeneghi, linux-nvme, linux-scsi, michael.christie, snitzer,
dm-devel, linux-block, linux-kernel
On Wed, Feb 25, 2026 at 03:32:23PM +0000, John Garry wrote:
> Add mpath_bdev_ioctl() as a multipath block device IOCTL handler. This
> handler calls into driver mpath_head_template.ioctl handler.
>
> It is expected that the .ioctl handler will unlock the SRCU read lock,
> as this is what NVMe requires - see nvme_ns_head_ctrl_ioctl(). As such,
> export a handler to unlock, mpath_head_read_unlock().
>
> The .compat_ioctl handler is given the standard handler.
>
> Signed-off-by: John Garry <john.g.garry@oracle.com>
> ---
> include/linux/multipath.h | 4 ++++
> lib/multipath.c | 42 +++++++++++++++++++++++++++++++++++++++
> 2 files changed, 46 insertions(+)
>
> diff --git a/include/linux/multipath.h b/include/linux/multipath.h
> index 3846ea8cfd319..40dda6a914c5f 100644
> --- a/include/linux/multipath.h
> +++ b/include/linux/multipath.h
> @@ -72,6 +72,9 @@ struct mpath_head_template {
> bool (*is_disabled)(struct mpath_device *);
> bool (*is_optimized)(struct mpath_device *);
> enum mpath_access_state (*get_access_state)(struct mpath_device *);
> + int (*bdev_ioctl)(struct block_device *bdev, struct mpath_device *,
> + blk_mode_t mode, unsigned int cmd, unsigned long arg,
> + int srcu_idx);
I don't know that this API is going to work out. SCSI persistent
reservations need access to all the mpath_devices, not just one, and
they are commonly handled via SG_IO ioctls. Unless you want to disallow
SCSI persistent reservations via SG_IO, you need to be able to detect
them, and handle them using the persistent reservation code with the
mpath_head.
-Ben
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH 09/13] libmultipath: Add PR support
2026-02-27 18:12 ` Benjamin Marzinski
@ 2026-03-02 10:45 ` John Garry
0 siblings, 0 replies; 46+ messages in thread
From: John Garry @ 2026-03-02 10:45 UTC (permalink / raw)
To: Benjamin Marzinski, Keith Busch
Cc: hch, sagi, axboe, martin.petersen, james.bottomley, hare,
jmeneghi, linux-nvme, linux-scsi, michael.christie, snitzer,
dm-devel, linux-block, linux-kernel
On 27/02/2026 18:12, Benjamin Marzinski wrote:
>> Instead of having the lower layer define new mp template functions, why
>> not use the existing pr_ops from mpath_device->disk->fops->pr_ops?
> I don't think that's the right answer. The regular scsi persistent
> reservation functions simply won't work on a multipath device. Even just
> a simple reservation fails.
>
> For example (with /dev/sda being multipath device 0):
> # echo round-robin > /sys/class/scsi_mpath_device/0/iopolicy
> # blkpr -c register -k 0x1 /dev/sda
> # blkpr -c reserve -k 0x1 -t exclusive-access-reg-only /dev/sda
> # dd if=/dev/sda of=/dev/null iflag=direct count=100
> dd: error reading '/dev/sda': Invalid exchange
> 1+0 records in
> 1+0 records out
> 512 bytes copied, 0.00871312 s, 58.8 kB/s
>
> Here are the kernel messages:
> [ 3494.660401] sd 7:0:1:0: reservation conflict
> [ 3494.661802] sd 7:0:1:0: [sda:1] tag#768 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=0s
> [ 3494.664848] sd 7:0:1:0: [sda:1] tag#768 CDB: Read(10) 28 00 00 00 00 01 00 00 01 00
> [ 3494.667092] reservation conflict error, dev sda:1, sector 1 op 0x0:(READ) flags 0x2800800 phys_seg 1 prio class 2
>
> If you don't have a multipathed scsi device to try this on, you can run:
>
> targetcli <<EOF
> /backstores/ramdisk create mptest 1G
> /loopback create naa.5001401111111111
> /loopback create naa.5001402222222222
> /loopback create naa.5001403333333333
> /loopback create naa.5001404444444444
> /loopback/naa.5001401111111111/luns create /backstores/ramdisk/mptest
> /loopback/naa.5001402222222222/luns create /backstores/ramdisk/mptest
> /loopback/naa.5001403333333333/luns create /backstores/ramdisk/mptest
> /loopback/naa.5001404444444444/luns create /backstores/ramdisk/mptest
> EOF
>
> to create one.
>
> Handling scsi Persistent Reservations on a multipath device is painful.
> Here is a non-exhaustive list of the problems with trying to make a
> multipath device act like a single scsi device for persistent
> reservation purposes:
>
> You need to register the key on all the I_T Nexuses. You can't just pick
> a single path. Otherwise, when you set up the reservation, you will only
> be able to do IO on one of the paths. That's what happened above.
ok, thanks for the pointer. This does not sound too difficult to
implement, but obviously it will require special handling (vs NVMe)
>
> If an path is down when you do the resevation, you might not be able to
> register the key on that path. You certainly can't do it directly.
> Using the All Target Ports bit (assuming the device supports it) could
> let you extend a reservation from one target port to others, assuming
> your path isn't down because of connection issue on the host side. But
> in general, you have to be able to handle the case where you can't
> register (or unregister) a key on your failed paths. If you don't do
> that (un)registration when the path comes up, before it can get seleted
> for handling IO, you will fail when accessing a path you should be
> allowed allowed to access, or succeed in accessing a path that you are
> should not be allowed to access.
Understood
>
> The same is true when new paths are discovered. You need to register
> them.
>
> Except that a preempt can come and remove your registration at any time.
> You can't register the new (or newly active) path if the key has been
> preempted, and this preemption can happen at any moment, even after you
> check if the other paths are still registered. If this isn't handled
> correctly, paths can access storage that they should not be allowed to
> access.
right
>
> Changing the reservation type (for instance from
> exclusive-access-reg-only to write-exclusive-reg-only) in scsi devices
> is done by preempting the existing reservation. This will remove the
> registered keys from every path except the one issuing the command. The
> key needs to be reregistered on all the other paths again. If any IO
> goes to these paths before they are reregistered, it will fail with a
> reservation conflict, so IO needs to be suspended during this time.
>
> The path that is holding the reservation might be down. In this case,
> you aren't able to release the reservation from that path. The only way
> I figured out to handle this in dm-mpath was for the device to preempt
> it's own key, to move the reservation to a working path. This causes the
> same issues as preempting key to change the reservation type, where you
> need to reregister all the paths with IO suspended.
>
> An actual preemption can come in from another machine while you are
> doing this. In that case, you must not reregister the paths, and if you
> already started, you must unregister them.
>
> I can probably come up with more issues.
This all is becoming complicated... :)
>
> I think the best course of action for now is to just fail persistent
> reservations as non-supported for scsi devices. IMHO Making them work
> correctly (where mulitpath device IO won't fail when it should succeed,
> and succeed when it should fail with a reservation conflict) dwarfs the
> amount of work necessary to support ALUA.
Yeah, that sounds reasonable, but I want to ensure libmultipath API does
not later change here such that it disrupts NVMe support (if indeed NVMe
goes on to use libmultipath).
>
> dm-mpath previously did a pretty good job handling Persistent
> Reservations. But recently it became much better, because it become very
> clear that pretty good is not good enough for what people what to do
> with Persistent Reservations and multipath devices.
Thanks for the feedback. I'll check these details further now.
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH 08/13] libmultipath: Add sysfs helpers
2026-02-27 19:05 ` Benjamin Marzinski
@ 2026-03-02 11:11 ` John Garry
0 siblings, 0 replies; 46+ messages in thread
From: John Garry @ 2026-03-02 11:11 UTC (permalink / raw)
To: Benjamin Marzinski
Cc: hch, kbusch, sagi, axboe, martin.petersen, james.bottomley, hare,
jmeneghi, linux-nvme, linux-scsi, michael.christie, snitzer,
dm-devel, linux-block, linux-kernel
On 27/02/2026 19:05, Benjamin Marzinski wrote:
> On Wed, Feb 25, 2026 at 03:32:20PM +0000, John Garry wrote:
>> Add helpers for driver sysfs code for the following functionality:
>> - get/set iopolicy with mpath_iopolicy_store() and mpath_iopolicy_show()
>> - show device path per NUMA node
>> - "multipath" attribute group, equivalent to nvme_ns_mpath_attr_group
>> - device groups attribute array, similar to nvme_ns_attr_groups but not
>> containing NVMe members.
>>
>> Note that mpath_iopolicy_store() has a update callback to allow same
>> functionality as nvme_subsys_iopolicy_update() be run for clearing paths.
>>
>> Signed-off-by: John Garry <john.g.garry@oracle.com>
>>
>> diff --git a/lib/multipath.c b/lib/multipath.c
>> index 1ce57b9b14d2e..c05b4d25ca223 100644
>> --- a/lib/multipath.c
>> +++ b/lib/multipath.c
>> @@ -745,6 +745,116 @@ void mpath_device_set_live(struct mpath_disk *mpath_disk,
>> }
>> EXPORT_SYMBOL_GPL(mpath_device_set_live);
>>
>> +static struct attribute dummy_attr = {
>> + .name = "dummy",
>> +};
>> +
>> +static struct attribute *mpath_attrs[] = {
>> + &dummy_attr,
>> + NULL
>> +};
>> +
>> +static bool multipath_sysfs_group_visible(struct kobject *kobj)
>> +{
>> + struct device *dev = container_of(kobj, struct device, kobj);
>> + struct gendisk *disk = dev_to_disk(dev);
>> +
>> + return is_mpath_head(disk);
>> +}
>> +
>> +static bool multipath_sysfs_attr_visible(struct kobject *kobj,
>> + struct attribute *attr, int n)
>> +{
>> + return false;
>> +}
>> +
>> +DEFINE_SYSFS_GROUP_VISIBLE(multipath_sysfs)
>
> nitpick: this could use DEFINE_SIMPLE_SYSFS_GROUP_VISIBLE instead.
>
Yes, that seems reasonable. And, FWIW, I think that
multipath_sysfs_attr_visible() should return umode_t.
BTW, this is same as mainline NVMe code, so that could be updated first.
>> +
>> +const struct attribute_group mpath_attr_group = {
>> + .name = "multipath",
>> + .attrs = mpath_attrs,
>> + .is_visible = SYSFS_GROUP_VISIBLE(multipath_sysfs),
>> +};
>> +EXPORT_SYMBOL_GPL(mpath_attr_group);
>> +
>> +const struct attribute_group *mpath_device_groups[] = {
>> + &mpath_attr_group,
>> + NULL
>> +};
>> +EXPORT_SYMBOL_GPL(mpath_device_groups);
>> +
>> +ssize_t mpath_iopolicy_show(struct mpath_iopolicy *mpath_iopolicy, char *buf)
>> +{
>> + return sysfs_emit(buf, "%s\n",
>> + mpath_iopolicy_names[mpath_read_iopolicy(mpath_iopolicy)]);
>> +}
>> +EXPORT_SYMBOL_GPL(mpath_iopolicy_show);
>> +
>> +static void mpath_iopolicy_update(struct mpath_iopolicy *mpath_iopolicy,
>> + int iopolicy, void (*update)(void *), void *data)
>> +{
>> + int old_iopolicy = READ_ONCE(mpath_iopolicy->iopolicy);
>> +
>> + if (old_iopolicy == iopolicy)
>> + return;
>> +
>> + WRITE_ONCE(mpath_iopolicy->iopolicy, iopolicy);
>> +
>> + /*
>> + * iopolicy changes clear the mpath by design, which @update
>> + * must do.
>> + */
>> + update(data);
>> +
>> + pr_err("iopolicy changed from %s to %s\n",
>> + mpath_iopolicy_names[old_iopolicy],
>> + mpath_iopolicy_names[iopolicy]);
>
> I not sure this warrants a pr_err().
>
Agreed, I can downgrade this to something like pr_info or notice.
Thanks
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH 11/13] libmultipath: Add support for block device IOCTL
2026-02-27 19:52 ` Benjamin Marzinski
@ 2026-03-02 11:19 ` John Garry
0 siblings, 0 replies; 46+ messages in thread
From: John Garry @ 2026-03-02 11:19 UTC (permalink / raw)
To: Benjamin Marzinski
Cc: hch, kbusch, sagi, axboe, martin.petersen, james.bottomley, hare,
jmeneghi, linux-nvme, linux-scsi, michael.christie, snitzer,
dm-devel, linux-block, linux-kernel
On 27/02/2026 19:52, Benjamin Marzinski wrote:
>> diff --git a/include/linux/multipath.h b/include/linux/multipath.h
>> index 3846ea8cfd319..40dda6a914c5f 100644
>> --- a/include/linux/multipath.h
>> +++ b/include/linux/multipath.h
>> @@ -72,6 +72,9 @@ struct mpath_head_template {
>> bool (*is_disabled)(struct mpath_device *);
>> bool (*is_optimized)(struct mpath_device *);
>> enum mpath_access_state (*get_access_state)(struct mpath_device *);
>> + int (*bdev_ioctl)(struct block_device *bdev, struct mpath_device *,
>> + blk_mode_t mode, unsigned int cmd, unsigned long arg,
>> + int srcu_idx);
> I don't know that this API is going to work out. SCSI persistent
> reservations need access to all the mpath_devices, not just one, and
> they are commonly handled via SG_IO ioctls. Unless you want to disallow
> SCSI persistent reservations via SG_IO,
I want to make the multipathed /dev/sdX behave same as non-multipathed,
so should then support it.
> you need to be able to detect
> them, and handle them using the persistent reservation code with the
> mpath_head.
Understood, I am going to check this PR handling further.
thanks!
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH 01/13] libmultipath: Add initial framework
2026-02-25 15:32 ` [PATCH 01/13] libmultipath: Add initial framework John Garry
@ 2026-03-02 12:08 ` Nilay Shroff
2026-03-02 12:21 ` John Garry
0 siblings, 1 reply; 46+ messages in thread
From: Nilay Shroff @ 2026-03-02 12:08 UTC (permalink / raw)
To: John Garry, hch, kbusch, sagi, axboe, martin.petersen,
james.bottomley, hare
Cc: jmeneghi, linux-nvme, linux-scsi, michael.christie, snitzer,
bmarzins, dm-devel, linux-block, linux-kernel
Hi John,
On 2/25/26 9:02 PM, John Garry wrote:
> Add initial framework for libmultipath. libmultipath is a library for
> multipath-capable block drivers, such as NVMe. The main function is to
> support path management, path selection, and failover handling.
>
> Basic support to add and remove the head structure - mpath_head - is
> included.
>
> This main purpose of this structure is to manage available paths and path
> selection. It is quite similar to the multipath functionality in
> nvme_ns_head. However a separate structure will introduced after to manage
> the multipath gendisk.
>
> Each path is represented by the mpath_device structure. It should hold a
> pointer to the per-path gendisk and also a list element for all siblings
> of paths. For NVMe, there would be a mpath_device per nvme_ns.
>
> All the libmultipath code is more or less taken from
> drivers/nvme/host/multipath.c, which was originally authored by Christoph
> Hellwig <hch@lst.de>.
>
> Signed-off-by: John Garry <john.g.garry@oracle.com>
> ---
> include/linux/multipath.h | 28 +++++++++++++++
> lib/Kconfig | 6 ++++
> lib/Makefile | 2 ++
> lib/multipath.c | 74 +++++++++++++++++++++++++++++++++++++++
> 4 files changed, 110 insertions(+)
> create mode 100644 include/linux/multipath.h
> create mode 100644 lib/multipath.c
>
> diff --git a/include/linux/multipath.h b/include/linux/multipath.h
> new file mode 100644
> index 0000000000000..18cd133b7ca21
> --- /dev/null
> +++ b/include/linux/multipath.h
> @@ -0,0 +1,28 @@
> +
> +#ifndef _LIBMULTIPATH_H
> +#define _LIBMULTIPATH_H
> +
> +#include <linux/blkdev.h>
> +#include <linux/srcu.h>
> +
> +struct mpath_device {
> + struct list_head siblings;
> + struct gendisk *disk;
> +};
> +
> +struct mpath_head {
> + struct srcu_struct srcu;
> + struct list_head dev_list; /* list of all mpath_devs */
> + struct mutex lock;
> +
> + struct kref ref;
> +
> + struct mpath_device __rcu *current_path[MAX_NUMNODES];
> + void *drvdata;
> +};
Can we use current_path[] as last element and flex array (same as what
we have today under struct nvme_ns_head) so that we don't need to
allocate array as big as MAX_NUMANODES? With flex array we can use
num_possible_nodes() which may be much smaller than MAX_NUMANODES.
Thanks,
--Nilay
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH 01/13] libmultipath: Add initial framework
2026-03-02 12:08 ` Nilay Shroff
@ 2026-03-02 12:21 ` John Garry
0 siblings, 0 replies; 46+ messages in thread
From: John Garry @ 2026-03-02 12:21 UTC (permalink / raw)
To: Nilay Shroff, hch, kbusch, sagi, axboe, martin.petersen,
james.bottomley, hare
Cc: jmeneghi, linux-nvme, linux-scsi, michael.christie, snitzer,
bmarzins, dm-devel, linux-block, linux-kernel
On 02/03/2026 12:08, Nilay Shroff wrote:
>> +struct mpath_head {
>> + struct srcu_struct srcu;
>> + struct list_head dev_list; /* list of all mpath_devs */
>> + struct mutex lock;
>> +
>> + struct kref ref;
>> +
>> + struct mpath_device __rcu *current_path[MAX_NUMNODES];
>> + void *drvdata;
>> +};
>
> Can we use current_path[] as last element and flex array (same as what
> we have today under struct nvme_ns_head) so that we don't need to
> allocate array as big as MAX_NUMANODES? With flex array we can use
> num_possible_nodes() which may be much smaller than MAX_NUMANODES.
Sure, I don't see a problem with that. I think that using MAX_NUMNODES
was a leftover from an earlier dev approach.
Thanks!
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH 02/13] libmultipath: Add basic gendisk support
2026-02-25 15:32 ` [PATCH 02/13] libmultipath: Add basic gendisk support John Garry
2026-02-26 2:16 ` Benjamin Marzinski
@ 2026-03-02 12:31 ` Nilay Shroff
2026-03-02 15:39 ` John Garry
2026-03-03 12:13 ` Markus Elfring
2 siblings, 1 reply; 46+ messages in thread
From: Nilay Shroff @ 2026-03-02 12:31 UTC (permalink / raw)
To: John Garry, hch, kbusch, sagi, axboe, martin.petersen,
james.bottomley, hare
Cc: jmeneghi, linux-nvme, linux-scsi, michael.christie, snitzer,
bmarzins, dm-devel, linux-block, linux-kernel
On 2/25/26 9:02 PM, John Garry wrote:
> Add support to allocate and free a multipath gendisk.
>
> NVMe has almost like-for-like equivalents here:
> - mpath_alloc_head_disk() -> nvme_mpath_alloc_disk()
> - multipath_partition_scan_work() -> nvme_partition_scan_work()
> - mpath_remove_disk() -> nvme_remove_head()
> - mpath_device_set_live() -> nvme_mpath_set_live()
>
> struct mpath_head_template is introduced as a method for drivers to
> provide custom multipath functionality.
>
> Signed-off-by: John Garry <john.g.garry@oracle.com>
> ---
> include/linux/multipath.h | 41 ++++++++++++
> lib/multipath.c | 129 ++++++++++++++++++++++++++++++++++++++
> 2 files changed, 170 insertions(+)
>
> diff --git a/include/linux/multipath.h b/include/linux/multipath.h
> index 18cd133b7ca21..be9dd9fb83345 100644
> --- a/include/linux/multipath.h
> +++ b/include/linux/multipath.h
> @@ -5,11 +5,28 @@
> #include <linux/blkdev.h>
> #include <linux/srcu.h>
>
> +extern const struct block_device_operations mpath_ops;
> +
> +struct mpath_disk {
> + struct gendisk *disk;
> + struct kref ref;
> + struct work_struct partition_scan_work;
> + struct mutex lock;
> + struct mpath_head *mpath_head;
> + struct device *parent;
> +};
> +
> struct mpath_device {
> struct list_head siblings;
> struct gendisk *disk;
> };
>
> +struct mpath_head_template {
> + const struct attribute_group **device_groups;
> +};
> +
> +#define MPATH_HEAD_DISK_LIVE 0
> +
> struct mpath_head {
> struct srcu_struct srcu;
> struct list_head dev_list; /* list of all mpath_devs */
> @@ -17,12 +34,36 @@ struct mpath_head {
>
> struct kref ref;
>
> + unsigned long flags;
> struct mpath_device __rcu *current_path[MAX_NUMNODES];
> + const struct mpath_head_template *mpdt;
> void *drvdata;
> };
>
Not sure why we don't have back reference to struct mpath_disk
from struct mpath_head here. Does it make sense to have this?
> +static inline struct mpath_disk *mpath_bd_device_to_disk(struct device *dev)
> +{
> + return dev_get_drvdata(dev);
> +}
> +
> +static inline struct mpath_disk *mpath_gendisk_to_disk(struct gendisk *disk)
> +{
> + return mpath_bd_device_to_disk(disk_to_dev(disk));
> +}
> +
> int mpath_get_head(struct mpath_head *mpath_head);
> void mpath_put_head(struct mpath_head *mpath_head);
> struct mpath_head *mpath_alloc_head(void);
> +void mpath_put_disk(struct mpath_disk *mpath_disk);
> +void mpath_remove_disk(struct mpath_disk *mpath_disk);
> +void mpath_unregister_disk(struct mpath_disk *mpath_disk);
> +struct mpath_disk *mpath_alloc_head_disk(struct queue_limits *lim,
> + int numa_node);
> +void mpath_device_set_live(struct mpath_disk *mpath_disk,
> + struct mpath_device *mpath_device);
> +void mpath_unregister_disk(struct mpath_disk *mpath_disk);
>
> +static inline bool is_mpath_head(struct gendisk *disk)
> +{
> + return disk->fops == &mpath_ops;
> +}
> #endif // _LIBMULTIPATH_H
> diff --git a/lib/multipath.c b/lib/multipath.c
> index 15c495675d729..88efb0ae16acb 100644
> --- a/lib/multipath.c
> +++ b/lib/multipath.c
> @@ -32,6 +32,135 @@ void mpath_put_head(struct mpath_head *mpath_head)
> }
> EXPORT_SYMBOL_GPL(mpath_put_head);
>
> +static void mpath_free_disk(struct kref *ref)
> +{
> + struct mpath_disk *mpath_disk =
> + container_of(ref, struct mpath_disk, ref);
> + struct mpath_head *mpath_head = mpath_disk->mpath_head;
> +
> + put_disk(mpath_disk->disk);
> + mpath_put_head(mpath_head);
> + kfree(mpath_disk);
> +}
> +
The mpath_alloc_head_disk() doesn't get a reference to the
mpath_head object but here while freeing mpath_disk we put
the reference to mpath_head. Would that create a reference
imbalance? Yes we got a reference to mpath_head while
allocating it but then these are two (alloc mpath_disk and
alloc mpath_head) disjoint operations. In that case, can't
we have both mpath_disk and mpath_head allocated under one
libmultipath API?
Thanks,
--Nilay
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH 03/13] libmultipath: Add path selection support
2026-02-25 15:32 ` [PATCH 03/13] libmultipath: Add path selection support John Garry
2026-02-26 3:37 ` Benjamin Marzinski
@ 2026-03-02 12:36 ` Nilay Shroff
2026-03-02 15:11 ` John Garry
2026-03-04 13:10 ` Nilay Shroff
2 siblings, 1 reply; 46+ messages in thread
From: Nilay Shroff @ 2026-03-02 12:36 UTC (permalink / raw)
To: John Garry, hch, kbusch, sagi, axboe, martin.petersen,
james.bottomley, hare
Cc: jmeneghi, linux-nvme, linux-scsi, michael.christie, snitzer,
bmarzins, dm-devel, linux-block, linux-kernel
On 2/25/26 9:02 PM, John Garry wrote:
> Add code for path selection.
>
> NVMe ANA is abstracted into enum mpath_access_state. The motivation here is
> so that SCSI ALUA can be used. Callbacks .is_disabled, .is_optimized,
> .get_access_state are added to get the path access state.
>
> Path selection modes round-robin, NUMA, and queue-depth are added, same
> as NVMe supports.
>
> NVMe has almost like-for-like equivalents here:
> - __mpath_find_path() -> __nvme_find_path()
> - mpath_find_path() -> nvme_find_path()
>
> and similar for all introduced callee functions.
>
> Functions mpath_set_iopolicy() and mpath_get_iopolicy() are added for
> setting default iopolicy.
>
> A separate mpath_iopolicy structure is introduced. There is no iopolicy
> member included in the mpath_head structure as it may not suit NVMe, where
> iopolicy is per-subsystem and not per namespace.
>
> Signed-off-by: John Garry <john.g.garry@oracle.com>
> ---
> include/linux/multipath.h | 36 ++++++
> lib/multipath.c | 251 ++++++++++++++++++++++++++++++++++++++
> 2 files changed, 287 insertions(+)
>
> diff --git a/include/linux/multipath.h b/include/linux/multipath.h
> index be9dd9fb83345..c964a1aba9c42 100644
> --- a/include/linux/multipath.h
> +++ b/include/linux/multipath.h
> @@ -7,6 +7,22 @@
>
> extern const struct block_device_operations mpath_ops;
>
> +enum mpath_iopolicy_e {
> + MPATH_IOPOLICY_NUMA,
> + MPATH_IOPOLICY_RR,
> + MPATH_IOPOLICY_QD,
> +};
> +
> +struct mpath_iopolicy {
> + enum mpath_iopolicy_e iopolicy;
> +};
> +
> +enum mpath_access_state {
> + MPATH_STATE_OPTIMIZED,
> + MPATH_STATE_ACTIVE,
> + MPATH_STATE_INVALID = 0xFF
> +};
Hmm so here we don't have MPATH_STATE_NONOPTIMIZED.
We are morphing NVME_ANA_NONOPTIMIZED as MPATH_STATE_ACTIVE.
Is it because SCSI doesn't have (NONOPTIMIZED) state?
> +
> struct mpath_disk {
> struct gendisk *disk;
> struct kref ref;
> @@ -18,10 +34,16 @@ struct mpath_disk {
>
> struct mpath_device {
> struct list_head siblings;
> + atomic_t nr_active;
> struct gendisk *disk;
> + int numa_node;
> };
>
I haven't seen any API which help set nr_active or numa_node.
Do we need to have those under struct mpath_head_template ?
Thanks,
--Nilay
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH 04/13] libmultipath: Add bio handling
2026-02-25 15:32 ` [PATCH 04/13] libmultipath: Add bio handling John Garry
@ 2026-03-02 12:39 ` Nilay Shroff
2026-03-02 15:52 ` John Garry
0 siblings, 1 reply; 46+ messages in thread
From: Nilay Shroff @ 2026-03-02 12:39 UTC (permalink / raw)
To: John Garry, hch, kbusch, sagi, axboe, martin.petersen,
james.bottomley, hare
Cc: jmeneghi, linux-nvme, linux-scsi, michael.christie, snitzer,
bmarzins, dm-devel, linux-block, linux-kernel
On 2/25/26 9:02 PM, John Garry wrote:
> Add support to submit a bio per-path. In addition, for failover, add
> support to requeue a failed bio.
>
> NVMe has almost like-for-like equivalents here:
> - nvme_available_path() -> mpath_available_path()
> - nvme_requeue_work() -> mpath_requeue_work()
> - nvme_ns_head_submit_bio() -> mpath_bdev_submit_bio()
>
> For failover, a driver may want to re-submit a bio, so add support to
> clone a bio prior to submission.
>
> A bio which is submitted to a per-path device has flag REQ_MPATH set,
> same as what is done for NVMe with REQ_NVME_MPATH.
>
> Signed-off-by: John Garry<john.g.garry@oracle.com>
> ---
> include/linux/multipath.h | 15 +++++++
> lib/multipath.c | 92 ++++++++++++++++++++++++++++++++++++++-
> 2 files changed, 106 insertions(+), 1 deletion(-)
>
> diff --git a/include/linux/multipath.h b/include/linux/multipath.h
> index c964a1aba9c42..d557fb9bab4c9 100644
> --- a/include/linux/multipath.h
> +++ b/include/linux/multipath.h
> @@ -3,6 +3,7 @@
> #define _LIBMULTIPATH_H
>
> #include <linux/blkdev.h>
> +#include <linux/blk-mq.h>
> #include <linux/srcu.h>
>
> extern const struct block_device_operations mpath_ops;
> @@ -40,10 +41,12 @@ struct mpath_device {
> };
>
> struct mpath_head_template {
> + bool (*available_path)(struct mpath_device *, bool *);
> bool (*is_disabled)(struct mpath_device *);
> bool (*is_optimized)(struct mpath_device *);
> enum mpath_access_state (*get_access_state)(struct mpath_device *);
> enum mpath_iopolicy_e (*get_iopolicy)(struct mpath_head *);
> + struct bio *(*clone_bio)(struct bio *);
> const struct attribute_group **device_groups;
> };
>
> @@ -56,12 +59,23 @@ struct mpath_head {
>
> struct kref ref;
>
> + struct bio_list requeue_list; /* list for requeing bio */
> + spinlock_t requeue_lock;
> + struct work_struct requeue_work; /* work struct for requeue */
> +
> unsigned long flags;
> struct mpath_device __rcu *current_path[MAX_NUMNODES];
> const struct mpath_head_template *mpdt;
> void *drvdata;
> };
>
> +#define REQ_MPATH REQ_DRV
> +
> +static inline bool is_mpath_request(struct request *req)
> +{
> + return req->cmd_flags & REQ_MPATH;
> +}
> +
> static inline struct mpath_disk *mpath_bd_device_to_disk(struct device *dev)
> {
> return dev_get_drvdata(dev);
> @@ -82,6 +96,7 @@ int mpath_set_iopolicy(const char *val, int *iopolicy);
> int mpath_get_iopolicy(char *buf, int iopolicy);
> int mpath_get_head(struct mpath_head *mpath_head);
> void mpath_put_head(struct mpath_head *mpath_head);
> +void mpath_requeue_work(struct work_struct *work);
> struct mpath_head *mpath_alloc_head(void);
> void mpath_put_disk(struct mpath_disk *mpath_disk);
> void mpath_remove_disk(struct mpath_disk *mpath_disk);
> diff --git a/lib/multipath.c b/lib/multipath.c
> index 65a0d2d2bf524..b494b35e8dccc 100644
> --- a/lib/multipath.c
> +++ b/lib/multipath.c
> @@ -5,6 +5,7 @@
> */
> #include <linux/module.h>
> #include <linux/multipath.h>
> +#include <trace/events/block.h>
>
> static struct mpath_device *mpath_find_path(struct mpath_head *mpath_head);
>
> @@ -227,7 +228,6 @@ static struct mpath_device *mpath_numa_path(struct mpath_head *mpath_head,
> return mpath_device;
> }
>
> -__maybe_unused
> static struct mpath_device *mpath_find_path(struct mpath_head *mpath_head)
> {
> enum mpath_iopolicy_e iopolicy =
> @@ -243,6 +243,66 @@ static struct mpath_device *mpath_find_path(struct mpath_head *mpath_head)
> }
> }
>
> +static bool mpath_available_path(struct mpath_head *mpath_head)
> +{
> + struct mpath_device *mpath_device;
> +
> + if (!test_bit(MPATH_HEAD_DISK_LIVE, &mpath_head->flags))
> + return false;
> +
> + list_for_each_entry_srcu(mpath_device, &mpath_head->dev_list, siblings,
> + srcu_read_lock_held(&mpath_head->srcu)) {
> + bool available = false;
> +
> + if (!mpath_head->mpdt->available_path(mpath_device,
> + &available))
> + continue;
> + if (available)
> + return true;
> + }
> +
> + return false;
> +}
IMO, we may further simplify the callback ->available_path() to return
true or false instead of passing the result in a separate @available
argument.
Thanks,
--Nilay
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH 07/13] libmultipath: Add delayed removal support
2026-02-25 15:32 ` [PATCH 07/13] libmultipath: Add delayed removal support John Garry
@ 2026-03-02 12:41 ` Nilay Shroff
2026-03-02 15:54 ` John Garry
0 siblings, 1 reply; 46+ messages in thread
From: Nilay Shroff @ 2026-03-02 12:41 UTC (permalink / raw)
To: John Garry, hch, kbusch, sagi, axboe, martin.petersen,
james.bottomley, hare
Cc: jmeneghi, linux-nvme, linux-scsi, michael.christie, snitzer,
bmarzins, dm-devel, linux-block, linux-kernel
On 2/25/26 9:02 PM, John Garry wrote:
> Add support for delayed removal, same as exists for NVMe.
>
> The purpose of this feature is to keep the multipath disk and cdev present
> for intermittent periods of no available path.
>
> Helpers mpath_delayed_removal_secs_show() and
> mpath_delayed_removal_secs_store() may be used in the driver sysfs code.
>
> The driver is responsible for supplying the removal work callback for
> the delayed work.
>
> Signed-off-by: John Garry <john.g.garry@oracle.com>
> ---
> include/linux/multipath.h | 17 +++++++++
> lib/multipath.c | 79 ++++++++++++++++++++++++++++++++++++++-
> 2 files changed, 95 insertions(+), 1 deletion(-)
>
> diff --git a/include/linux/multipath.h b/include/linux/multipath.h
> index 0dcfdd205237c..f7998de261899 100644
> --- a/include/linux/multipath.h
> +++ b/include/linux/multipath.h
> @@ -66,6 +66,7 @@ struct mpath_head_template {
> };
>
> #define MPATH_HEAD_DISK_LIVE 0
> +#define MPATH_HEAD_QUEUE_IF_NO_PATH 1
>
> struct mpath_head {
> struct srcu_struct srcu;
> @@ -81,6 +82,10 @@ struct mpath_head {
> struct cdev cdev;
> struct device cdev_device;
>
> + struct delayed_work remove_work;
> + unsigned int delayed_removal_secs;
> + struct module *drv_module;
> +
> unsigned long flags;
> struct mpath_device __rcu *current_path[MAX_NUMNODES];
> const struct mpath_head_template *mpdt;
> @@ -132,6 +137,7 @@ void mpath_put_head(struct mpath_head *mpath_head);
> void mpath_requeue_work(struct work_struct *work);
> struct mpath_head *mpath_alloc_head(void);
> void mpath_put_disk(struct mpath_disk *mpath_disk);
> +bool mpath_can_remove_head(struct mpath_head *mpath_head);
> void mpath_remove_disk(struct mpath_disk *mpath_disk);
> void mpath_unregister_disk(struct mpath_disk *mpath_disk);
> struct mpath_disk *mpath_alloc_head_disk(struct queue_limits *lim,
> @@ -139,6 +145,10 @@ struct mpath_disk *mpath_alloc_head_disk(struct queue_limits *lim,
> void mpath_device_set_live(struct mpath_disk *mpath_disk,
> struct mpath_device *mpath_device);
> void mpath_unregister_disk(struct mpath_disk *mpath_disk);
> +ssize_t mpath_delayed_removal_secs_show(struct mpath_head *mpath_head,
> + char *buf);
> +ssize_t mpath_delayed_removal_secs_store(struct mpath_head *mpath_head,
> + const char *buf, size_t count);
>
> static inline bool is_mpath_head(struct gendisk *disk)
> {
> @@ -150,4 +160,11 @@ static inline bool mpath_qd_iopolicy(struct mpath_iopolicy *mpath_iopolicy)
> return mpath_read_iopolicy(mpath_iopolicy) == MPATH_IOPOLICY_QD;
> }
>
> +static inline bool mpath_head_queue_if_no_path(struct mpath_head *mpath_head)
> +{
> + if (test_bit(MPATH_HEAD_QUEUE_IF_NO_PATH, &mpath_head->flags))
> + return true;
> + return false;
> +}
> +
> #endif // _LIBMULTIPATH_H
> diff --git a/lib/multipath.c b/lib/multipath.c
> index ce12d42918fdd..1ce57b9b14d2e 100644
> --- a/lib/multipath.c
> +++ b/lib/multipath.c
> @@ -52,6 +52,7 @@ void mpath_add_device(struct mpath_head *mpath_head,
> mutex_lock(&mpath_head->lock);
> list_add_tail_rcu(&mpath_device->siblings, &mpath_head->dev_list);
> mutex_unlock(&mpath_head->lock);
> + cancel_delayed_work(&mpath_head->remove_work);
> }
> EXPORT_SYMBOL_GPL(mpath_add_device);
>
> @@ -356,7 +357,17 @@ static bool mpath_available_path(struct mpath_head *mpath_head)
> return true;
> }
>
> - return false;
> + /*
> + * If "mpahead->delayed_removal_secs" is configured (i.e., non-zero), do
> + * not immediately fail I/O. Instead, requeue the I/O for the configured
> + * duration, anticipating that if there's a transient link failure then
> + * it may recover within this time window. This parameter is exported to
> + * userspace via sysfs, and its default value is zero. It is internally
> + * mapped to MPATH_HEAD_QUEUE_IF_NO_PATH. When delayed_removal_secs is
> + * non-zero, this flag is set to true. When zero, the flag is cleared.
> + */
> + return mpath_head_queue_if_no_path(mpath_head);
> +
> }
>
> static void mpath_bdev_submit_bio(struct bio *bio)
> @@ -614,6 +625,29 @@ static void mpath_head_del_cdev(struct mpath_head *mpath_head)
> mpath_head->mpdt->del_cdev(mpath_head);
> }
>
> +bool mpath_can_remove_head(struct mpath_head *mpath_head)
> +{
> + bool remove = false;
> +
> + mutex_lock(&mpath_head->lock);
> + /*
> + * Ensure that no one could remove this module while the head
> + * remove work is pending.
> + */
> + if (mpath_head_queue_if_no_path(mpath_head) &&
> + try_module_get(mpath_head->drv_module)) {
> +
> + mod_delayed_work(mpath_wq, &mpath_head->remove_work,
> + mpath_head->delayed_removal_secs * HZ);
> + } else {
> + remove = true;
> + }
> +
> + mutex_unlock(&mpath_head->lock);
> + return remove;
> +}
> +EXPORT_SYMBOL_GPL(mpath_can_remove_head);
> +
> void mpath_remove_disk(struct mpath_disk *mpath_disk)
> {
> struct mpath_head *mpath_head = mpath_disk->mpath_head;
> @@ -711,6 +745,47 @@ void mpath_device_set_live(struct mpath_disk *mpath_disk,
> }
> EXPORT_SYMBOL_GPL(mpath_device_set_live);
>
> +ssize_t mpath_delayed_removal_secs_show(struct mpath_head *mpath_head,
> + char *buf)
> +{
> + int ret;
> +
> + mutex_lock(&mpath_head->lock);
> + ret = sysfs_emit(buf, "%u\n", mpath_head->delayed_removal_secs);
> + mutex_unlock(&mpath_head->lock);
> +
> + return ret;
> +}
> +EXPORT_SYMBOL_GPL(mpath_delayed_removal_secs_show);
> +
> +ssize_t mpath_delayed_removal_secs_store(struct mpath_head *mpath_head,
> + const char *buf, size_t count)
> +{
> + ssize_t ret;
> + int sec;
> +
> + ret = kstrtouint(buf, 0, &sec);
> + if (ret < 0)
> + return ret;
> +
> + mutex_lock(&mpath_head->lock);
> + mpath_head->delayed_removal_secs = sec;
> + if (sec)
> + set_bit(MPATH_HEAD_QUEUE_IF_NO_PATH, &mpath_head->flags);
> + else
> + clear_bit(MPATH_HEAD_QUEUE_IF_NO_PATH, &mpath_head->flags);
> + mutex_unlock(&mpath_head->lock);
> +
> + /*
> + * Ensure that update to MPATH_HEAD_QUEUE_IF_NO_PATH is seen
> + * by its reader.
> + */
> + mpath_synchronize(mpath_head);
> +
> + return count;
> +}
> +EXPORT_SYMBOL_GPL(mpath_delayed_removal_secs_store);
> +
> void mpath_add_sysfs_link(struct mpath_disk *mpath_disk)
> {
> struct mpath_head *mpath_head = mpath_disk->mpath_head;
> @@ -793,6 +868,8 @@ struct mpath_head *mpath_alloc_head(void)
> mutex_init(&mpath_head->lock);
> kref_init(&mpath_head->ref);
>
> + mpath_head->delayed_removal_secs = 0;
> +
> INIT_WORK(&mpath_head->requeue_work, mpath_requeue_work);
> spin_lock_init(&mpath_head->requeue_lock);
> bio_list_init(&mpath_head->requeue_list);
I think we also need to initialize ->drv_module here.
Thanks,
--Nilay
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH 03/13] libmultipath: Add path selection support
2026-03-02 12:36 ` Nilay Shroff
@ 2026-03-02 15:11 ` John Garry
2026-03-03 11:01 ` Nilay Shroff
0 siblings, 1 reply; 46+ messages in thread
From: John Garry @ 2026-03-02 15:11 UTC (permalink / raw)
To: Nilay Shroff, hch, kbusch, sagi, axboe, martin.petersen,
james.bottomley, hare
Cc: jmeneghi, linux-nvme, linux-scsi, michael.christie, snitzer,
bmarzins, dm-devel, linux-block, linux-kernel
On 02/03/2026 12:36, Nilay Shroff wrote:
> On 2/25/26 9:02 PM, John Garry wrote:
>> Add code for path selection.
>>
>> NVMe ANA is abstracted into enum mpath_access_state. The motivation
>> here is
>> so that SCSI ALUA can be used. Callbacks .is_disabled, .is_optimized,
>> .get_access_state are added to get the path access state.
>>
>> Path selection modes round-robin, NUMA, and queue-depth are added, same
>> as NVMe supports.
>>
>> NVMe has almost like-for-like equivalents here:
>> - __mpath_find_path() -> __nvme_find_path()
>> - mpath_find_path() -> nvme_find_path()
>>
>> and similar for all introduced callee functions.
>>
>> Functions mpath_set_iopolicy() and mpath_get_iopolicy() are added for
>> setting default iopolicy.
>>
>> A separate mpath_iopolicy structure is introduced. There is no iopolicy
>> member included in the mpath_head structure as it may not suit NVMe,
>> where
>> iopolicy is per-subsystem and not per namespace.
>>
>> Signed-off-by: John Garry <john.g.garry@oracle.com>
>> ---
>> include/linux/multipath.h | 36 ++++++
>> lib/multipath.c | 251 ++++++++++++++++++++++++++++++++++++++
>> 2 files changed, 287 insertions(+)
>>
>> diff --git a/include/linux/multipath.h b/include/linux/multipath.h
>> index be9dd9fb83345..c964a1aba9c42 100644
>> --- a/include/linux/multipath.h
>> +++ b/include/linux/multipath.h
>> @@ -7,6 +7,22 @@
>> extern const struct block_device_operations mpath_ops;
>> +enum mpath_iopolicy_e {
>> + MPATH_IOPOLICY_NUMA,
>> + MPATH_IOPOLICY_RR,
>> + MPATH_IOPOLICY_QD,
>> +};
>> +
>> +struct mpath_iopolicy {
>> + enum mpath_iopolicy_e iopolicy;
>> +};
>> +
>> +enum mpath_access_state {
>> + MPATH_STATE_OPTIMIZED,
>> + MPATH_STATE_ACTIVE,
>> + MPATH_STATE_INVALID = 0xFF
>> +};
> Hmm so here we don't have MPATH_STATE_NONOPTIMIZED.
> We are morphing NVME_ANA_NONOPTIMIZED as MPATH_STATE_ACTIVE.
Yes, well it is treated the same (as NVME_ANA_NONOPTIMIZED) for path
selection.
> Is it because SCSI doesn't have (NONOPTIMIZED) state?
It does have an active (and optimal) state, but I think that keeping
NVMe terminology may be better for now.
>
>> +
>> struct mpath_disk {
>> struct gendisk *disk;
>> struct kref ref;
>> @@ -18,10 +34,16 @@ struct mpath_disk {
>> struct mpath_device {
>> struct list_head siblings;
>> + atomic_t nr_active;
>> struct gendisk *disk;
>> + int numa_node;
>> };
> I haven't seen any API which help set nr_active or numa_node.
I missed setting numa_node for NVMe. About nr_active, that is set/read
by the NVMe code, like nvme_mpath_start_request(). I did try to abstract
that function into a common helper, but it just becomes a mess.
> Do we need to have those under struct mpath_head_template ?
I think that the drivers can handle these directly.
Thanks
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH 02/13] libmultipath: Add basic gendisk support
2026-03-02 12:31 ` Nilay Shroff
@ 2026-03-02 15:39 ` John Garry
2026-03-03 12:39 ` Nilay Shroff
0 siblings, 1 reply; 46+ messages in thread
From: John Garry @ 2026-03-02 15:39 UTC (permalink / raw)
To: Nilay Shroff, hch, kbusch, sagi, axboe, martin.petersen,
james.bottomley, hare
Cc: jmeneghi, linux-nvme, linux-scsi, michael.christie, snitzer,
bmarzins, dm-devel, linux-block, linux-kernel
On 02/03/2026 12:31, Nilay Shroff wrote:
>>
>> +#define MPATH_HEAD_DISK_LIVE 0
>> +
>> struct mpath_head {
>> struct srcu_struct srcu;
>> struct list_head dev_list; /* list of all mpath_devs */
>> @@ -17,12 +34,36 @@ struct mpath_head {
>> struct kref ref;
>> + unsigned long flags;
>> struct mpath_device __rcu *current_path[MAX_NUMNODES];
>> + const struct mpath_head_template *mpdt;
>> void *drvdata;
>> };
> Not sure why we don't have back reference to struct mpath_disk
> from struct mpath_head here. Does it make sense to have this?
We can get away without it.
Some more background info .. so the concept of separate mpath_head and
mpath_disk is driven by SCSI, which has scsi_device and scsi_disk
classes. The scsi_disk driver (sd.c) controls the per-path gendisk and
the mpath_disk, and these internals are hidden from the scsi_core (which
controls the scsi_device). SCSI having this layered approach makes
things more complicated. This is unlike NVMe, where the core driver
controls the NS gendisk also.
>
>
>> +static inline struct mpath_disk *mpath_bd_device_to_disk(struct
>> device *dev)
>> +{
>> + return dev_get_drvdata(dev);
>> +}
>> +
>> +static inline struct mpath_disk *mpath_gendisk_to_disk(struct gendisk
>> *disk)
>> +{
>> + return mpath_bd_device_to_disk(disk_to_dev(disk));
>> +}
>> +
>> int mpath_get_head(struct mpath_head *mpath_head);
>> void mpath_put_head(struct mpath_head *mpath_head);
>> struct mpath_head *mpath_alloc_head(void);
>> +void mpath_put_disk(struct mpath_disk *mpath_disk);
>> +void mpath_remove_disk(struct mpath_disk *mpath_disk);
>> +void mpath_unregister_disk(struct mpath_disk *mpath_disk);
>> +struct mpath_disk *mpath_alloc_head_disk(struct queue_limits *lim,
>> + int numa_node);
>> +void mpath_device_set_live(struct mpath_disk *mpath_disk,
>> + struct mpath_device *mpath_device);
>> +void mpath_unregister_disk(struct mpath_disk *mpath_disk);
>> +static inline bool is_mpath_head(struct gendisk *disk)
>> +{
>> + return disk->fops == &mpath_ops;
>> +}
>> #endif // _LIBMULTIPATH_H
>> diff --git a/lib/multipath.c b/lib/multipath.c
>> index 15c495675d729..88efb0ae16acb 100644
>> --- a/lib/multipath.c
>> +++ b/lib/multipath.c
>> @@ -32,6 +32,135 @@ void mpath_put_head(struct mpath_head *mpath_head)
>> }
>> EXPORT_SYMBOL_GPL(mpath_put_head);
>> +static void mpath_free_disk(struct kref *ref)
>> +{
>> + struct mpath_disk *mpath_disk =
>> + container_of(ref, struct mpath_disk, ref);
>> + struct mpath_head *mpath_head = mpath_disk->mpath_head;
>> +
>> + put_disk(mpath_disk->disk);
>> + mpath_put_head(mpath_head);
>> + kfree(mpath_disk);
>> +}
>> +
>
> The mpath_alloc_head_disk() doesn't get a reference to the
> mpath_head object but here while freeing mpath_disk we put
> the reference to mpath_head. Would that create a reference
> imbalance?
I think that what I done can be improved. If you check
nvme_mpath_alloc_disk(), when we alloc the head the ref is 1, and then
we rely on the disk release to release that head reference.
> Yes we got a reference to mpath_head while
> allocating it but then these are two (alloc mpath_disk and
> alloc mpath_head) disjoint operations. In that case, can't
> we have both mpath_disk and mpath_head allocated under one
> libmultipath API?
I would like to have something simpler (like mainline NVMe code), but I
have it this way because of SCSI, as above.
Thanks
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH 04/13] libmultipath: Add bio handling
2026-03-02 12:39 ` Nilay Shroff
@ 2026-03-02 15:52 ` John Garry
2026-03-03 14:00 ` Nilay Shroff
0 siblings, 1 reply; 46+ messages in thread
From: John Garry @ 2026-03-02 15:52 UTC (permalink / raw)
To: Nilay Shroff, hch, kbusch, sagi, axboe, martin.petersen,
james.bottomley, hare
Cc: jmeneghi, linux-nvme, linux-scsi, michael.christie, snitzer,
bmarzins, dm-devel, linux-block, linux-kernel
On 02/03/2026 12:39, Nilay Shroff wrote:
>> static struct mpath_device *mpath_find_path(struct mpath_head
>> *mpath_head)
>> {
>> enum mpath_iopolicy_e iopolicy =
>> @@ -243,6 +243,66 @@ static struct mpath_device
>> *mpath_find_path(struct mpath_head *mpath_head)
>> }
>> }
>> +static bool mpath_available_path(struct mpath_head *mpath_head)
>> +{
>> + struct mpath_device *mpath_device;
>> +
>> + if (!test_bit(MPATH_HEAD_DISK_LIVE, &mpath_head->flags))
>> + return false;
>> +
>> + list_for_each_entry_srcu(mpath_device, &mpath_head->dev_list,
>> siblings,
>> + srcu_read_lock_held(&mpath_head->srcu)) {
>> + bool available = false;
>> +
>> + if (!mpath_head->mpdt->available_path(mpath_device,
>> + &available))
>> + continue;
>> + if (available)
>> + return true;
>> + }
>> +
>> + return false;
>> +}
>
> IMO, we may further simplify the callback ->available_path() to return
> true or false instead of passing the result in a separate @available
> argument.
I have to admit that I am not keen on this abstraction at all, as it is
purely generated to fit the current code.
Anyway, from checking mainline nvme_available_path(), we skip checking
the ctrl state if the ctrl failfast flag is set (which means
mpath_head->mpdt->available_path returns false). But I suppose the
callback could check both the ctrl flags and state (and just return a
single boolean), like:
if (failfast flag set)
return false;
if (ctrl live, resetting, connecting)
return true;
return false;
Thanks,
John
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH 07/13] libmultipath: Add delayed removal support
2026-03-02 12:41 ` Nilay Shroff
@ 2026-03-02 15:54 ` John Garry
0 siblings, 0 replies; 46+ messages in thread
From: John Garry @ 2026-03-02 15:54 UTC (permalink / raw)
To: Nilay Shroff, hch, kbusch, sagi, axboe, martin.petersen,
james.bottomley, hare
Cc: jmeneghi, linux-nvme, linux-scsi, michael.christie, snitzer,
bmarzins, dm-devel, linux-block, linux-kernel
On 02/03/2026 12:41, Nilay Shroff wrote:
>> +
>> void mpath_add_sysfs_link(struct mpath_disk *mpath_disk)
>> {
>> struct mpath_head *mpath_head = mpath_disk->mpath_head;
>> @@ -793,6 +868,8 @@ struct mpath_head *mpath_alloc_head(void)
>> mutex_init(&mpath_head->lock);
>> kref_init(&mpath_head->ref);
>> + mpath_head->delayed_removal_secs = 0;
>> +
>> INIT_WORK(&mpath_head->requeue_work, mpath_requeue_work);
>> spin_lock_init(&mpath_head->requeue_lock);
>> bio_list_init(&mpath_head->requeue_list);
>
> I think we also need to initialize ->drv_module here.
Yes, thanks!
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH 03/13] libmultipath: Add path selection support
2026-03-02 15:11 ` John Garry
@ 2026-03-03 11:01 ` Nilay Shroff
2026-03-03 12:41 ` John Garry
0 siblings, 1 reply; 46+ messages in thread
From: Nilay Shroff @ 2026-03-03 11:01 UTC (permalink / raw)
To: John Garry, hch, kbusch, sagi, axboe, martin.petersen,
james.bottomley, hare
Cc: jmeneghi, linux-nvme, linux-scsi, michael.christie, snitzer,
bmarzins, dm-devel, linux-block, linux-kernel
On 3/2/26 8:41 PM, John Garry wrote:
> On 02/03/2026 12:36, Nilay Shroff wrote:
>> On 2/25/26 9:02 PM, John Garry wrote:
>>> Add code for path selection.
>>>
>>> NVMe ANA is abstracted into enum mpath_access_state. The motivation
>>> here is
>>> so that SCSI ALUA can be used. Callbacks .is_disabled, .is_optimized,
>>> .get_access_state are added to get the path access state.
>>>
>>> Path selection modes round-robin, NUMA, and queue-depth are added, same
>>> as NVMe supports.
>>>
>>> NVMe has almost like-for-like equivalents here:
>>> - __mpath_find_path() -> __nvme_find_path()
>>> - mpath_find_path() -> nvme_find_path()
>>>
>>> and similar for all introduced callee functions.
>>>
>>> Functions mpath_set_iopolicy() and mpath_get_iopolicy() are added for
>>> setting default iopolicy.
>>>
>>> A separate mpath_iopolicy structure is introduced. There is no iopolicy
>>> member included in the mpath_head structure as it may not suit NVMe,
>>> where
>>> iopolicy is per-subsystem and not per namespace.
>>>
>>> Signed-off-by: John Garry <john.g.garry@oracle.com>
>>> ---
>>> include/linux/multipath.h | 36 ++++++
>>> lib/multipath.c | 251 ++++++++++++++++++++++++++++++++++++++
>>> 2 files changed, 287 insertions(+)
>>>
>>> diff --git a/include/linux/multipath.h b/include/linux/multipath.h
>>> index be9dd9fb83345..c964a1aba9c42 100644
>>> --- a/include/linux/multipath.h
>>> +++ b/include/linux/multipath.h
>>> @@ -7,6 +7,22 @@
>>> extern const struct block_device_operations mpath_ops;
>>> +enum mpath_iopolicy_e {
>>> + MPATH_IOPOLICY_NUMA,
>>> + MPATH_IOPOLICY_RR,
>>> + MPATH_IOPOLICY_QD,
>>> +};
>>> +
>>> +struct mpath_iopolicy {
>>> + enum mpath_iopolicy_e iopolicy;
>>> +};
>>> +
>>> +enum mpath_access_state {
>>> + MPATH_STATE_OPTIMIZED,
>>> + MPATH_STATE_ACTIVE,
>>> + MPATH_STATE_INVALID = 0xFF
>>> +};
>> Hmm so here we don't have MPATH_STATE_NONOPTIMIZED.
>> We are morphing NVME_ANA_NONOPTIMIZED as MPATH_STATE_ACTIVE.
>
> Yes, well it is treated the same (as NVME_ANA_NONOPTIMIZED) for path
> selection.
>
>> Is it because SCSI doesn't have (NONOPTIMIZED) state?
>
> It does have an active (and optimal) state, but I think that keeping
> NVMe terminology may be better for now.
>
>>
>>> +
>>> struct mpath_disk {
>>> struct gendisk *disk;
>>> struct kref ref;
>>> @@ -18,10 +34,16 @@ struct mpath_disk {
>>> struct mpath_device {
>>> struct list_head siblings;
>>> + atomic_t nr_active;
>>> struct gendisk *disk;
>>> + int numa_node;
>>> };
>> I haven't seen any API which help set nr_active or numa_node.
>
> I missed setting numa_node for NVMe. About nr_active, that is set/read
> by the NVMe code, like nvme_mpath_start_request(). I did try to abstract
> that function into a common helper, but it just becomes a mess.
>
The nvme_mpath_start_request() increments ns->ctrl->nr_active, and
nvme_mpath_end_request() decrements it. This means that nr_active is
maintained per controller. If multiple NVMe namespaces are created and
attached to the same controller, their I/O activity is accumulated in
the single ctrl->nr_active counter.
In contrast, libmultipath defines nr_active in struct mpath_device,
which is referenced from struct nvme_ns. Even if we add code to update
mpath_device->nr_active, that accounting would effectively be per
namespace, not per controller.
The nr_active value is used by the queue-depth policy. Currently,
mpath_queue_depth_path() accesses mpath_device->nr_active to make
forwarding decisions. However, if mpath_device->nr_active is maintained
per namespace, it does not correctly reflect controller-wide load when
multiple namespaces share the same controller.
Therefore, instead of maintaining a separate nr_active in struct
mpath_device, it may be more appropriate for mpath_queue_depth_path() to
reference ns->ctrl->nr_active directly. In that case, nr_active could be
removed from struct mpath_device entirely.
Thanks,
--Nilay
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH 02/13] libmultipath: Add basic gendisk support
2026-02-25 15:32 ` [PATCH 02/13] libmultipath: Add basic gendisk support John Garry
2026-02-26 2:16 ` Benjamin Marzinski
2026-03-02 12:31 ` Nilay Shroff
@ 2026-03-03 12:13 ` Markus Elfring
2 siblings, 0 replies; 46+ messages in thread
From: Markus Elfring @ 2026-03-03 12:13 UTC (permalink / raw)
To: John Garry, linux-block, linux-scsi, dm-devel, linux-nvme,
Christoph Hellwig, Hannes Reinecke, James Bottomley, Jens Axboe,
Keith Busch, Martin K. Petersen, Sagi Grimberg
Cc: LKML, Benjamin Marzinski, John Meneghini, Mike Christie,
Mike Snitzer
…
> +++ b/lib/multipath.c
…
> +static void multipath_partition_scan_work(struct work_struct *work)
> +{
…
> + mutex_lock(&mpath_disk->disk->open_mutex);
> + bdev_disk_changed(mpath_disk->disk, false);
> + mutex_unlock(&mpath_disk->disk->open_mutex);
> +}
…
Under which circumstances would you become interested to apply a statement
like “guard(mutex)(&mpath_disk->disk->open_mutex);”?
https://elixir.bootlin.com/linux/v6.19.3/source/include/linux/mutex.h#L253
Regards,
Markus
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH 02/13] libmultipath: Add basic gendisk support
2026-03-02 15:39 ` John Garry
@ 2026-03-03 12:39 ` Nilay Shroff
2026-03-03 12:59 ` John Garry
0 siblings, 1 reply; 46+ messages in thread
From: Nilay Shroff @ 2026-03-03 12:39 UTC (permalink / raw)
To: John Garry, hch, kbusch, sagi, axboe, martin.petersen,
james.bottomley, hare
Cc: jmeneghi, linux-nvme, linux-scsi, michael.christie, snitzer,
bmarzins, dm-devel, linux-block, linux-kernel
On 3/2/26 9:09 PM, John Garry wrote:
> On 02/03/2026 12:31, Nilay Shroff wrote:
>>>
>>> +#define MPATH_HEAD_DISK_LIVE 0
>>> +
>>> struct mpath_head {
>>> struct srcu_struct srcu;
>>> struct list_head dev_list; /* list of all mpath_devs */
>>> @@ -17,12 +34,36 @@ struct mpath_head {
>>> struct kref ref;
>>> + unsigned long flags;
>>> struct mpath_device __rcu *current_path[MAX_NUMNODES];
>>> + const struct mpath_head_template *mpdt;
>>> void *drvdata;
>>> };
>> Not sure why we don't have back reference to struct mpath_disk
>> from struct mpath_head here. Does it make sense to have this?
>
> We can get away without it.
>
> Some more background info .. so the concept of separate mpath_head and
> mpath_disk is driven by SCSI, which has scsi_device and scsi_disk
> classes. The scsi_disk driver (sd.c) controls the per-path gendisk and
> the mpath_disk, and these internals are hidden from the scsi_core (which
> controls the scsi_device). SCSI having this layered approach makes
> things more complicated. This is unlike NVMe, where the core driver
> controls the NS gendisk also.
>
>>
>>
>>> +static inline struct mpath_disk *mpath_bd_device_to_disk(struct
>>> device *dev)
>>> +{
>>> + return dev_get_drvdata(dev);
>>> +}
>>> +
>>> +static inline struct mpath_disk *mpath_gendisk_to_disk(struct
>>> gendisk *disk)
>>> +{
>>> + return mpath_bd_device_to_disk(disk_to_dev(disk));
>>> +}
>>> +
>>> int mpath_get_head(struct mpath_head *mpath_head);
>>> void mpath_put_head(struct mpath_head *mpath_head);
>>> struct mpath_head *mpath_alloc_head(void);
>>> +void mpath_put_disk(struct mpath_disk *mpath_disk);
>>> +void mpath_remove_disk(struct mpath_disk *mpath_disk);
>>> +void mpath_unregister_disk(struct mpath_disk *mpath_disk);
>>> +struct mpath_disk *mpath_alloc_head_disk(struct queue_limits *lim,
>>> + int numa_node);
>>> +void mpath_device_set_live(struct mpath_disk *mpath_disk,
>>> + struct mpath_device *mpath_device);
>>> +void mpath_unregister_disk(struct mpath_disk *mpath_disk);
>>> +static inline bool is_mpath_head(struct gendisk *disk)
>>> +{
>>> + return disk->fops == &mpath_ops;
>>> +}
>>> #endif // _LIBMULTIPATH_H
>>> diff --git a/lib/multipath.c b/lib/multipath.c
>>> index 15c495675d729..88efb0ae16acb 100644
>>> --- a/lib/multipath.c
>>> +++ b/lib/multipath.c
>>> @@ -32,6 +32,135 @@ void mpath_put_head(struct mpath_head *mpath_head)
>>> }
>>> EXPORT_SYMBOL_GPL(mpath_put_head);
>>> +static void mpath_free_disk(struct kref *ref)
>>> +{
>>> + struct mpath_disk *mpath_disk =
>>> + container_of(ref, struct mpath_disk, ref);
>>> + struct mpath_head *mpath_head = mpath_disk->mpath_head;
>>> +
>>> + put_disk(mpath_disk->disk);
>>> + mpath_put_head(mpath_head);
>>> + kfree(mpath_disk);
>>> +}
>>> +
>>
>> The mpath_alloc_head_disk() doesn't get a reference to the
>> mpath_head object but here while freeing mpath_disk we put
>> the reference to mpath_head. Would that create a reference
>> imbalance?
>
> I think that what I done can be improved. If you check
> nvme_mpath_alloc_disk(), when we alloc the head the ref is 1, and then
> we rely on the disk release to release that head reference.
>
>> Yes we got a reference to mpath_head while
>> allocating it but then these are two (alloc mpath_disk and
>> alloc mpath_head) disjoint operations. In that case, can't
>> we have both mpath_disk and mpath_head allocated under one
>> libmultipath API?
>
> I would like to have something simpler (like mainline NVMe code), but I
> have it this way because of SCSI, as above.
>
I understand the intended lifetime model due to SCSI, but the current
flow is somewhat confusing.
In nvme_mpath_alloc_disk(), mpath_disk and mpath_head are allocated
separately. However, during teardown, both objects are ultimately
released through mpath_free_disk(), which drops the reference to
mpath_head via mpath_put_head().
Since the allocation of mpath_disk and mpath_head happens independently,
it is not immediately obvious why their lifetime is tied together and
why they are not freed independently when the NVMe head node is removed.
This coupling makes the ownership and reference flow harder to reason
about.
Additionally, I noticed that nvme_remove_head() has been removed in the
NVMe code that integrates with libmultipath. IMO, It might be clearer to
retain this function and make the teardown sequence explicit (after
removing mpath_put_head() from mpath_free_disk()).
For example:
nvme_remove_head():
mpath_unregister_disk(); /* removes mpath_disk and drops its ref */
mpath_put_head(); /* drops mpath_head reference */
nvme_put_ns_head(); /* drops NVMe namespace head reference */
Does the above example makes sense?
Thanks,
--Nilay
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH 03/13] libmultipath: Add path selection support
2026-03-03 11:01 ` Nilay Shroff
@ 2026-03-03 12:41 ` John Garry
2026-03-04 10:26 ` Nilay Shroff
0 siblings, 1 reply; 46+ messages in thread
From: John Garry @ 2026-03-03 12:41 UTC (permalink / raw)
To: Nilay Shroff, hch, kbusch, sagi, axboe, martin.petersen,
james.bottomley, hare
Cc: jmeneghi, linux-nvme, linux-scsi, michael.christie, snitzer,
bmarzins, dm-devel, linux-block, linux-kernel
>>
> The nvme_mpath_start_request() increments ns->ctrl->nr_active, and
> nvme_mpath_end_request() decrements it. This means that nr_active is
> maintained per controller. If multiple NVMe namespaces are created and
> attached to the same controller, their I/O activity is accumulated in
> the single ctrl->nr_active counter.
>
> In contrast, libmultipath defines nr_active in struct mpath_device,
> which is referenced from struct nvme_ns. Even if we add code to update
> mpath_device->nr_active, that accounting would effectively be per
> namespace, not per controller.
Right, I need to change that back to per-controller.
>
> The nr_active value is used by the queue-depth policy. Currently,
> mpath_queue_depth_path() accesses mpath_device->nr_active to make
> forwarding decisions. However, if mpath_device->nr_active is maintained
> per namespace, it does not correctly reflect controller-wide load when
> multiple namespaces share the same controller.
Yes
>
> Therefore, instead of maintaining a separate nr_active in struct
> mpath_device, it may be more appropriate for mpath_queue_depth_path() to
> reference ns->ctrl->nr_active directly. In that case, nr_active could be
> removed from struct mpath_device entirely.
>
I think so, but we will need scsi to maintain such a count internally to
support this policy. And for NVMe we will need some abstraction to
lookup the per-controller QD for a mpath_device.
Thanks for checking!
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH 02/13] libmultipath: Add basic gendisk support
2026-03-03 12:39 ` Nilay Shroff
@ 2026-03-03 12:59 ` John Garry
0 siblings, 0 replies; 46+ messages in thread
From: John Garry @ 2026-03-03 12:59 UTC (permalink / raw)
To: Nilay Shroff, hch, kbusch, sagi, axboe, martin.petersen,
james.bottomley, hare
Cc: jmeneghi, linux-nvme, linux-scsi, michael.christie, snitzer,
bmarzins, dm-devel, linux-block, linux-kernel
On 03/03/2026 12:39, Nilay Shroff wrote:
>>
>>> Yes we got a reference to mpath_head while
>>> allocating it but then these are two (alloc mpath_disk and
>>> alloc mpath_head) disjoint operations. In that case, can't
>>> we have both mpath_disk and mpath_head allocated under one
>>> libmultipath API?
>>
>> I would like to have something simpler (like mainline NVMe code), but
>> I have it this way because of SCSI, as above.
>>
> I understand the intended lifetime model due to SCSI, but the current
> flow is somewhat confusing.
>
> In nvme_mpath_alloc_disk(), mpath_disk and mpath_head are allocated
> separately. However, during teardown, both objects are ultimately
> released through mpath_free_disk(), which drops the reference to
> mpath_head via mpath_put_head().
>
> Since the allocation of mpath_disk and mpath_head happens independently,
> it is not immediately obvious why their lifetime is tied together and
> why they are not freed independently when the NVMe head node is removed.
> This coupling makes the ownership and reference flow harder to reason
> about.
Yes, and also having 2x separate structures bloats the code, as we are
continuously looking up one from another and so on. Only having a
mpath_head for a nvme_ns_head would be nice - I'll look at this approach
(again).
>
> Additionally, I noticed that nvme_remove_head() has been removed in the
> NVMe code that integrates with libmultipath. IMO, It might be clearer to
> retain this function and make the teardown sequence explicit (after
> removing mpath_put_head() from mpath_free_disk()).
> For example:
>
> nvme_remove_head():
> mpath_unregister_disk(); /* removes mpath_disk and drops its ref */
> mpath_put_head(); /* drops mpath_head reference */
> nvme_put_ns_head(); /* drops NVMe namespace head reference */
>
> Does the above example makes sense?
Yeah, something like that would be simpler. I just need to make it work
for scsi :)
My current implementation has it that the scsi_device manages the
mpath_head and the scsi_disk manages the mpath_disk.
Supporting a single structure for scsi is more complicated, as the
lifetime of the scsi_disk is not tied to that of the scsi_device, i.e.
we can do something like this:
echo "8:0:0:0" > /sys/bus/scsi/drivers/sd/unbind
Which removes the scsi_disk, but the scsi_device remains. So, for
multipath support, doing this would mean the that mpath_disk would be
removed (if the last path), but not the mpath_head.
Furthermore, the scsi_disk info is private to sd.c, so this would mean
that the mpath_disk could/should also be private. My point is that scsi
makes things more complicated.
Thanks for checking!
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH 04/13] libmultipath: Add bio handling
2026-03-02 15:52 ` John Garry
@ 2026-03-03 14:00 ` Nilay Shroff
0 siblings, 0 replies; 46+ messages in thread
From: Nilay Shroff @ 2026-03-03 14:00 UTC (permalink / raw)
To: John Garry, hch, kbusch, sagi, axboe, martin.petersen,
james.bottomley, hare
Cc: jmeneghi, linux-nvme, linux-scsi, michael.christie, snitzer,
bmarzins, dm-devel, linux-block, linux-kernel
On 3/2/26 9:22 PM, John Garry wrote:
> On 02/03/2026 12:39, Nilay Shroff wrote:
>>> static struct mpath_device *mpath_find_path(struct mpath_head
>>> *mpath_head)
>>> {
>>> enum mpath_iopolicy_e iopolicy =
>>> @@ -243,6 +243,66 @@ static struct mpath_device
>>> *mpath_find_path(struct mpath_head *mpath_head)
>>> }
>>> }
>>> +static bool mpath_available_path(struct mpath_head *mpath_head)
>>> +{
>>> + struct mpath_device *mpath_device;
>>> +
>>> + if (!test_bit(MPATH_HEAD_DISK_LIVE, &mpath_head->flags))
>>> + return false;
>>> +
>>> + list_for_each_entry_srcu(mpath_device, &mpath_head->dev_list,
>>> siblings,
>>> + srcu_read_lock_held(&mpath_head->srcu)) {
>>> + bool available = false;
>>> +
>>> + if (!mpath_head->mpdt->available_path(mpath_device,
>>> + &available))
>>> + continue;
>>> + if (available)
>>> + return true;
>>> + }
>>> +
>>> + return false;
>>> +}
>>
>> IMO, we may further simplify the callback ->available_path() to return
>> true or false instead of passing the result in a separate @available
>> argument.
>
> I have to admit that I am not keen on this abstraction at all, as it is
> purely generated to fit the current code.
>
> Anyway, from checking mainline nvme_available_path(), we skip checking
> the ctrl state if the ctrl failfast flag is set (which means mpath_head-
> >mpdt->available_path returns false). But I suppose the callback could
> check both the ctrl flags and state (and just return a single boolean),
> like:
>
> if (failfast flag set)
> return false;
> if (ctrl live, resetting, connecting)
> return true;
> return false;
>
Yes I think, as now the ->dev_list (or ns sibling) iterator is handled
within libmultipath code, the above logic makes sense. We should plan to
simplify nvme_available_path() as per the above pseudo code.
Thanks,
--Nilay
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH 03/13] libmultipath: Add path selection support
2026-03-03 12:41 ` John Garry
@ 2026-03-04 10:26 ` Nilay Shroff
2026-03-04 11:09 ` John Garry
0 siblings, 1 reply; 46+ messages in thread
From: Nilay Shroff @ 2026-03-04 10:26 UTC (permalink / raw)
To: John Garry, hch, kbusch, sagi, axboe, martin.petersen,
james.bottomley, hare
Cc: jmeneghi, linux-nvme, linux-scsi, michael.christie, snitzer,
bmarzins, dm-devel, linux-block, linux-kernel
On 3/3/26 6:11 PM, John Garry wrote:
>>>
>> The nvme_mpath_start_request() increments ns->ctrl->nr_active, and
>> nvme_mpath_end_request() decrements it. This means that nr_active is
>> maintained per controller. If multiple NVMe namespaces are created and
>> attached to the same controller, their I/O activity is accumulated in
>> the single ctrl->nr_active counter.
>>
>> In contrast, libmultipath defines nr_active in struct mpath_device,
>> which is referenced from struct nvme_ns. Even if we add code to update
>> mpath_device->nr_active, that accounting would effectively be per
>> namespace, not per controller.
>
> Right, I need to change that back to per-controller.
>
>>
>> The nr_active value is used by the queue-depth policy. Currently,
>> mpath_queue_depth_path() accesses mpath_device->nr_active to make
>> forwarding decisions. However, if mpath_device->nr_active is
>> maintained per namespace, it does not correctly reflect controller-
>> wide load when multiple namespaces share the same controller.
>
> Yes
>
>>
>> Therefore, instead of maintaining a separate nr_active in struct
>> mpath_device, it may be more appropriate for mpath_queue_depth_path()
>> to reference ns->ctrl->nr_active directly. In that case, nr_active
>> could be removed from struct mpath_device entirely.
>>
>
> I think so, but we will need scsi to maintain such a count internally to
> support this policy. And for NVMe we will need some abstraction to
> lookup the per-controller QD for a mpath_device.
>
This raises another question regarding the current framework. From what
I can see, all NVMe multipath I/O policies are currently supported for
SCSI as well. Going forward, if we introduce a new I/O policy for NVMe
that does not make sense for SCSI, how can we ensure that the new policy
is supported only for NVMe and not for SCSI? Conversely, we may also
want to introduce a policy that is relevant only for SCSI but not for NVMe.
With the current framework, it seems difficult to restrict a policy to a
specific transport. It appears that all policies are implicitly shared
between NVMe and SCSI.
Would it make sense to introduce some abstraction for I/O policies in
the framework so that a given policy can be implemented and exposed only
for the relevant transport (e.g., NVMe-only or SCSI-only), rather than
requiring it to be supported by both?
Thanks,
--Nilay
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH 03/13] libmultipath: Add path selection support
2026-03-04 10:26 ` Nilay Shroff
@ 2026-03-04 11:09 ` John Garry
0 siblings, 0 replies; 46+ messages in thread
From: John Garry @ 2026-03-04 11:09 UTC (permalink / raw)
To: Nilay Shroff, hch, kbusch, sagi, axboe, martin.petersen,
james.bottomley, hare
Cc: jmeneghi, linux-nvme, linux-scsi, michael.christie, snitzer,
bmarzins, dm-devel, linux-block, linux-kernel
On 04/03/2026 10:26, Nilay Shroff wrote:
>>
>> I think so, but we will need scsi to maintain such a count internally
>> to support this policy. And for NVMe we will need some abstraction to
>> lookup the per-controller QD for a mpath_device.
>>
> This raises another question regarding the current framework. From what
> I can see, all NVMe multipath I/O policies are currently supported for
> SCSI as well. Going forward, if we introduce a new I/O policy for NVMe
> that does not make sense for SCSI, how can we ensure that the new policy
> is supported only for NVMe and not for SCSI? Conversely, we may also
> want to introduce a policy that is relevant only for SCSI but not for NVMe.
>
> With the current framework, it seems difficult to restrict a policy to a
> specific transport. It appears that all policies are implicitly shared
> between NVMe and SCSI.
>
> Would it make sense to introduce some abstraction for I/O policies in
> the framework so that a given policy can be implemented and exposed only
> for the relevant transport (e.g., NVMe-only or SCSI-only), rather than
> requiring it to be supported by both?
I think that we can cross that bridge if it ever happens. It should not
be too difficult to allow a driver to specify which policies are
supported/unsupported and the lib can take care of management of that.
Thanks,
John
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH 03/13] libmultipath: Add path selection support
2026-02-25 15:32 ` [PATCH 03/13] libmultipath: Add path selection support John Garry
2026-02-26 3:37 ` Benjamin Marzinski
2026-03-02 12:36 ` Nilay Shroff
@ 2026-03-04 13:10 ` Nilay Shroff
2026-03-04 14:38 ` John Garry
2 siblings, 1 reply; 46+ messages in thread
From: Nilay Shroff @ 2026-03-04 13:10 UTC (permalink / raw)
To: John Garry, hch, kbusch, sagi, axboe, martin.petersen,
james.bottomley, hare
Cc: jmeneghi, linux-nvme, linux-scsi, michael.christie, snitzer,
bmarzins, dm-devel, linux-block, linux-kernel
Hi John,
On 2/25/26 9:02 PM, John Garry wrote:
> +static struct mpath_device *__mpath_find_path(struct mpath_head *mpath_head,
> + enum mpath_iopolicy_e iopolicy, int node)
> +{
> + int found_distance = INT_MAX, fallback_distance = INT_MAX, distance;
> + struct mpath_device *mpath_dev_found, *mpath_dev_fallback,
> + *mpath_device;
> +
I think we should initialize mpath_dev_found and mpath_dev_fallback to
NULL. Otherwise this may lead upto adding a junk mpath_device pointer
in ->current_path[node] when mpath_head->dev_list is empty. This may
particularly manifests when a controller is being shutdown and
concurrently I/O is forwarded to the same controller.
Thanks,
--Nilay
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH 03/13] libmultipath: Add path selection support
2026-03-04 13:10 ` Nilay Shroff
@ 2026-03-04 14:38 ` John Garry
0 siblings, 0 replies; 46+ messages in thread
From: John Garry @ 2026-03-04 14:38 UTC (permalink / raw)
To: Nilay Shroff, hch, kbusch, sagi, axboe, martin.petersen,
james.bottomley, hare
Cc: jmeneghi, linux-nvme, linux-scsi, michael.christie, snitzer,
bmarzins, dm-devel, linux-block, linux-kernel
On 04/03/2026 13:10, Nilay Shroff wrote:
> On 2/25/26 9:02 PM, John Garry wrote:
>> +static struct mpath_device *__mpath_find_path(struct mpath_head
>> *mpath_head,
>> + enum mpath_iopolicy_e iopolicy, int node)
>> +{
>> + int found_distance = INT_MAX, fallback_distance = INT_MAX, distance;
>> + struct mpath_device *mpath_dev_found, *mpath_dev_fallback,
>> + *mpath_device;
>> +
>
> I think we should initialize mpath_dev_found and mpath_dev_fallback to
> NULL. Otherwise this may lead upto adding a junk mpath_device pointer
> in ->current_path[node] when mpath_head->dev_list is empty. This may
> particularly manifests when a controller is being shutdown and
> concurrently I/O is forwarded to the same controller.
Right, I see that we were doing the equivalent in __nvme_find_path().
Will fix.
Thanks for the notice.
^ permalink raw reply [flat|nested] 46+ messages in thread
end of thread, other threads:[~2026-03-04 14:39 UTC | newest]
Thread overview: 46+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-02-25 15:32 [PATCH 00/13] libmultipath: a generic multipath lib for block drivers John Garry
2026-02-25 15:32 ` [PATCH 01/13] libmultipath: Add initial framework John Garry
2026-03-02 12:08 ` Nilay Shroff
2026-03-02 12:21 ` John Garry
2026-02-25 15:32 ` [PATCH 02/13] libmultipath: Add basic gendisk support John Garry
2026-02-26 2:16 ` Benjamin Marzinski
2026-02-26 9:04 ` John Garry
2026-03-02 12:31 ` Nilay Shroff
2026-03-02 15:39 ` John Garry
2026-03-03 12:39 ` Nilay Shroff
2026-03-03 12:59 ` John Garry
2026-03-03 12:13 ` Markus Elfring
2026-02-25 15:32 ` [PATCH 03/13] libmultipath: Add path selection support John Garry
2026-02-26 3:37 ` Benjamin Marzinski
2026-02-26 9:26 ` John Garry
2026-03-02 12:36 ` Nilay Shroff
2026-03-02 15:11 ` John Garry
2026-03-03 11:01 ` Nilay Shroff
2026-03-03 12:41 ` John Garry
2026-03-04 10:26 ` Nilay Shroff
2026-03-04 11:09 ` John Garry
2026-03-04 13:10 ` Nilay Shroff
2026-03-04 14:38 ` John Garry
2026-02-25 15:32 ` [PATCH 04/13] libmultipath: Add bio handling John Garry
2026-03-02 12:39 ` Nilay Shroff
2026-03-02 15:52 ` John Garry
2026-03-03 14:00 ` Nilay Shroff
2026-02-25 15:32 ` [PATCH 05/13] libmultipath: Add support for mpath_device management John Garry
2026-02-25 15:32 ` [PATCH 06/13] libmultipath: Add cdev support John Garry
2026-02-25 15:32 ` [PATCH 07/13] libmultipath: Add delayed removal support John Garry
2026-03-02 12:41 ` Nilay Shroff
2026-03-02 15:54 ` John Garry
2026-02-25 15:32 ` [PATCH 08/13] libmultipath: Add sysfs helpers John Garry
2026-02-27 19:05 ` Benjamin Marzinski
2026-03-02 11:11 ` John Garry
2026-02-25 15:32 ` [PATCH 09/13] libmultipath: Add PR support John Garry
2026-02-25 15:49 ` Keith Busch
2026-02-25 16:52 ` John Garry
2026-02-27 18:12 ` Benjamin Marzinski
2026-03-02 10:45 ` John Garry
2026-02-25 15:32 ` [PATCH 10/13] libmultipath: Add mpath_bdev_report_zones() John Garry
2026-02-25 15:32 ` [PATCH 11/13] libmultipath: Add support for block device IOCTL John Garry
2026-02-27 19:52 ` Benjamin Marzinski
2026-03-02 11:19 ` John Garry
2026-02-25 15:32 ` [PATCH 12/13] libmultipath: Add mpath_bdev_getgeo() John Garry
2026-02-25 15:32 ` [PATCH 13/13] libmultipath: Add mpath_bdev_get_unique_id() John Garry
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox