Linux SCSI subsystem development
 help / color / mirror / Atom feed
* [RFC v1 0/8] scsi: Multipath support for scsi disk devices.
@ 2024-11-09  4:45 himanshu.madhani
  2024-11-09  4:45 ` [RFC v1 1/8] scsi: Add multipath device support himanshu.madhani
                   ` (9 more replies)
  0 siblings, 10 replies; 15+ messages in thread
From: himanshu.madhani @ 2024-11-09  4:45 UTC (permalink / raw)
  To: martin.petersen, linux-scsi

From: Himanshu Madhani <himanshu.madhani@oracle.com>

Hello Folks,

Here is a very early RFC for multipath support in the scsi layer. This patch series
implements native multipath support for scsi disks devices.

In this series, I am providing conceptual changes which still needs work. However,
I wanted to get this RFC out to get community feedback on the direction of changes.

This RFC follows NVMe multipath implementation closely for SCSI multipath. Currently,
SCSI multipath only supports disk devices which advertises ALUA (Asymmetric Logical
Unit Access) capability in the Inquiry response data.

Patches are split as following

Patch 1: Add new SCSI multipath files and makefile changes for enabling multipath support.
Patch 2: Adds changes to scsi_host structure for multipath support
Patch 3: Adds error handling capability to the multipath changes.
Patch 4: Wires up commpletion path for the request
Patch 5: Adds sysfs hooks for displaying iopolicy and state.
Patch 6: Adds changes to use ALUA handler for multipath
Patch 7: Adds changes in sd driver for multipath.
Patch 8: Adds changes to scsi_debug driver for ALUA testing.

Here's list of TO-DO that will be addressed in next RFC version

1. Cleanup sysfs directory structure and only show first multipath device.
2. Test failover scenario with multiple disks and injecting errors with IO.
3. Test updating iopolicy while running IO and make sure path failover happens.
4. cleanup ALUA code to integrate more closely with new multipath code.
5. Performance numbers for the multipath disks.
6. PR ops are not yet handled by this series and will be added in next RFC.

Thanks,
Himanshu

Himanshu Madhani (8):
  scsi: Add multipath device support
  scsi: create multipath capable scsi host
  scsi: Add error handling capability for multipath
  scsi: Complete multipath request
  scsi: Add scsi multipath sysfs hooks
  scsi: Add multipath suppport for device handler
  scsi: Add multipath disk init code for sd driver
  scsi_debug: Add module parameter for ALUA multipath

 drivers/scsi/Kconfig                       |  12 +
 drivers/scsi/Makefile                      |   2 +
 drivers/scsi/device_handler/scsi_dh_alua.c |  15 +
 drivers/scsi/hosts.c                       |  12 +
 drivers/scsi/scsi_debug.c                  |  16 +-
 drivers/scsi/scsi_dh.c                     |   3 +
 drivers/scsi/scsi_error.c                  |   8 +
 drivers/scsi/scsi_lib.c                    |  25 +
 drivers/scsi/scsi_multipath.c              | 896 +++++++++++++++++++++
 drivers/scsi/scsi_sysfs.c                  | 104 +++
 drivers/scsi/sd.c                          |  83 ++
 include/scsi/scsi.h                        |   1 +
 include/scsi/scsi_device.h                 |  64 ++
 include/scsi/scsi_host.h                   |   7 +
 include/scsi/scsi_multipath.h              |  86 ++
 15 files changed, 1332 insertions(+), 2 deletions(-)
 create mode 100644 drivers/scsi/scsi_multipath.c
 create mode 100644 include/scsi/scsi_multipath.h


base-commit: 128faa1845a2d5b0178b986f3bd18fb38cc08cc2
-- 
2.41.0.rc2


^ permalink raw reply	[flat|nested] 15+ messages in thread

* [RFC v1 1/8] scsi: Add multipath device support
  2024-11-09  4:45 [RFC v1 0/8] scsi: Multipath support for scsi disk devices himanshu.madhani
@ 2024-11-09  4:45 ` himanshu.madhani
  2024-11-12 21:09   ` Bart Van Assche
  2024-11-09  4:45 ` [RFC v1 2/8] scsi: create multipath capable scsi host himanshu.madhani
                   ` (8 subsequent siblings)
  9 siblings, 1 reply; 15+ messages in thread
From: himanshu.madhani @ 2024-11-09  4:45 UTC (permalink / raw)
  To: martin.petersen, linux-scsi

From: Himanshu Madhani <himanshu.madhani@oracle.com>

- Add multipath device support to scsi_device
- Add multipath support to scsi_host
- Add Kconfig and Makefile
- Create new scsi_multipath.[ch] files

Signed-off-by: Himanshu Madhani <himanshu.madhani@oracle.com>
---
 drivers/scsi/Kconfig          |  12 +
 drivers/scsi/Makefile         |   2 +
 drivers/scsi/scsi_multipath.c | 896 ++++++++++++++++++++++++++++++++++
 include/scsi/scsi_device.h    |  64 +++
 include/scsi/scsi_host.h      |   7 +
 include/scsi/scsi_multipath.h |  86 ++++
 6 files changed, 1067 insertions(+)
 create mode 100644 drivers/scsi/scsi_multipath.c
 create mode 100644 include/scsi/scsi_multipath.h

diff --git a/drivers/scsi/Kconfig b/drivers/scsi/Kconfig
index 37c24ffea65c..d1298fac774c 100644
--- a/drivers/scsi/Kconfig
+++ b/drivers/scsi/Kconfig
@@ -76,6 +76,18 @@ config SCSI_LIB_KUNIT_TEST
 
 	  If unsure say N.
 
+config SCSI_MULTIPATH
+	bool "SCSI multipath support"
+	depends on SCSI
+	depends on SCSI_DH  && SCSI_DH_ALUA
+	help
+	  This option enables support for native SCSI multipath support for
+	  SCSI host. This option depends on Asymmetric Logical Unit Access
+	  support to be enabled on the device. If this option is enabled a
+	  single /dev/mpathXsdY device will show up for each SCSI host.
+
+	  If unsure say N.
+
 comment "SCSI support type (disk, tape, CD-ROM)"
 	depends on SCSI
 
diff --git a/drivers/scsi/Makefile b/drivers/scsi/Makefile
index 1313ddf2fd1a..017795bc224d 100644
--- a/drivers/scsi/Makefile
+++ b/drivers/scsi/Makefile
@@ -154,6 +154,8 @@ obj-$(CONFIG_SCSI_ENCLOSURE)	+= ses.o
 
 obj-$(CONFIG_SCSI_HISI_SAS) += hisi_sas/
 
+obj-$(CONFIG_SCSI_MULTIPATH) += scsi_multipath.o
+
 # This goes last, so that "real" scsi devices probe earlier
 obj-$(CONFIG_SCSI_DEBUG)	+= scsi_debug.o
 scsi_mod-y			+= scsi.o hosts.o scsi_ioctl.o \
diff --git a/drivers/scsi/scsi_multipath.c b/drivers/scsi/scsi_multipath.c
new file mode 100644
index 000000000000..45684704b9e2
--- /dev/null
+++ b/drivers/scsi/scsi_multipath.c
@@ -0,0 +1,896 @@
+// SPDX-License-Indentifier: GPL-2.0
+/*
+ * Copyright (c) 2024 Himanshu Madhani
+ *
+ * SCSI Multipath support using ALUA (Asymmetric Logical Unit Access)
+ * capable devices.
+ */
+
+#include <linux/bio.h>
+#include <linux/moduleparam.h>
+#include <linux/topology.h>
+#include <scsi/scsi_cmnd.h>
+#include <scsi/scsi_dh.h>
+#include <scsi/scsi_proto.h>
+#include <scsi/scsi_host.h>
+#include <scsi/scsi_multipath.h>
+
+bool scsi_multipath = true;
+module_param(scsi_multipath, bool, 0444);
+MODULE_PARM_DESC(scsi_multipath,
+    "turn on native support for multiple scsi devices \n"
+    "set this value to false to disable multipath, \n");
+
+static const char *scsi_iopolicy_names[] = {
+	[SCSI_MPATH_IOPOLICY_NUMA]	= "numa",
+	[SCSI_MPATH_IOPOLICY_RR]	= "round-robin",
+};
+
+static int iopolicy = SCSI_MPATH_IOPOLICY_NUMA;
+
+/*
+ * SCSI multipath will only allow 'NUMA' or 'round-robin' policy for IO.
+ * In Future, if more apropriate IO-policy is introduced will be added
+ * based on community feedback.
+ */
+static int scsi_set_iopolicy(const char *val, const struct kernel_param *kp)
+{
+	if (!val)
+		return -EINVAL;
+	if (!strncmp(val, "numa", 4))
+		iopolicy = SCSI_MPATH_IOPOLICY_NUMA;
+	else if (!strncmp(val, "round-robin", 11))
+		iopolicy = SCSI_MPATH_IOPOLICY_RR;
+	else
+		return -EINVAL;
+
+	return 0;
+}
+
+static int scsi_get_iopolicy(char *buf, const struct kernel_param *kp)
+{
+	return sprintf(buf, "%s\n", scsi_iopolicy_names[iopolicy]);
+}
+
+module_param_call(iopolicy, scsi_set_iopolicy, scsi_get_iopolicy,
+    &iopolicy, 0644);
+MODULE_PARM_DESC(iopolicy,
+    "Default multipath I/O policy; 'numa' (default) or 'round-robin'");
+
+void scsi_mpath_default_iopolicy(struct scsi_device *sdev)
+{
+	sdev->mpath_iopolicy = iopolicy;
+}
+
+void scsi_multipath_iopolicy_update(struct scsi_device *sdev, int iopolicy)
+{
+	struct Scsi_Host *shost =  sdev->host;
+	struct scsi_mpath *mpath_dev = shost->mpath_dev;
+	int old_iopolicy = READ_ONCE(sdev->mpath_iopolicy);
+
+	if (old_iopolicy == iopolicy)
+		return;
+
+	WRITE_ONCE(sdev->mpath_iopolicy, iopolicy);
+
+	/* iopoliocy changes clear the multipath */
+	mutex_lock(&mpath_dev->mpath_lock);
+	list_for_each_entry_rcu(sdev, &shost->mpath_sdev, mpath_entry)
+		scsi_mpath_clear_paths(shost);
+	mutex_unlock(&mpath_dev->mpath_lock);
+
+	sdev_printk(KERN_NOTICE, sdev, "Multipath iopolocy changed from %s to %s\n",
+	    scsi_iopolicy_names[old_iopolicy], scsi_iopolicy_names[iopolicy]);
+}
+
+bool scsi_mpath_clear_current_path(struct scsi_device *sdev)
+{
+	struct Scsi_Host *shost = sdev->host;
+	struct scsi_mpath *mpath_dev = shost->mpath_dev;
+	bool changed = false;
+	int node;
+
+	if (!sdev)
+		return changed;
+
+	for_each_node(node) {
+		if (sdev == rcu_access_pointer(mpath_dev->current_path[node])) {
+			rcu_assign_pointer(mpath_dev->current_path[node], NULL);
+			changed = true;
+		}
+	}
+
+	return changed;
+}
+EXPORT_SYMBOL_GPL(scsi_mpath_clear_current_path);
+
+void scsi_mpath_clear_paths(struct Scsi_Host *shost)
+{
+	struct scsi_device *sdev;
+	int srcu_idx;
+
+	srcu_idx = srcu_read_lock(&shost->mpath_dev->srcu);
+	list_for_each_entry_rcu(sdev, &shost->mpath_sdev, mpath_entry) {
+		scsi_mpath_clear_current_path(sdev);
+		kblockd_schedule_work(&shost->mpath_dev->mpath_requeue_work);
+	}
+	srcu_read_unlock(&shost->mpath_dev->srcu, srcu_idx);
+
+}
+
+static inline bool scsi_mpath_state_is_live(enum scsi_mpath_access_state state)
+{
+	if (state == SCSI_MPATH_OPTIMAL ||
+	    state == SCSI_MPATH_ACTIVE)
+		return true;
+
+	return false;
+}
+
+/* Check for path error */
+static inline bool scsi_is_mpath_error(struct scsi_cmnd *scmd)
+{
+	struct request *req = scsi_cmd_to_rq(scmd);
+	struct scsi_device *sdev = req->q->queuedata;
+
+	if (sdev->handler && sdev->handler->prep_fn) {
+		blk_status_t ret = sdev->handler->prep_fn(sdev, req);
+
+		if (ret != BLK_STS_OK)
+			return true;
+	}
+
+	return false;
+}
+
+static bool scsi_mpath_is_disabled(struct scsi_device *sdev)
+{
+	enum scsi_device_state sdev_state = sdev->sdev_state;
+
+	/*
+	 * if device multipath state is not set to LIVE
+	 * then return true
+	 */
+	if (!scsi_mpath_state_is_live(sdev->mpath_state))
+		return true;
+
+	/*
+	 * Do not treat DELETING as a disabled path as I/O should
+	 * still be able to complete assuming that scsi_device is
+	 * within timeout limit.
+	 * Otherwise I/O will fail immeadiately and return to
+	 * requeue list
+	 */
+	if (sdev_state != SDEV_RUNNING && sdev_state != SDEV_CANCEL)
+		return true;
+
+	return false;
+}
+
+/* handle failover request for path */
+void scsi_mpath_failover_req(struct request *req)
+{
+	struct scsi_cmnd *scmd = blk_mq_rq_to_pdu(req);
+	struct scsi_device *sdev = scmd->device;
+	struct Scsi_Host *shost = scmd->device->host;
+	struct scsi_mpath *mpath_dev = shost->mpath_dev;
+	unsigned long flags;
+	struct bio *bio;
+
+	if (!scsi_device_online(sdev) || sdev->was_reset || sdev->locked)
+		return;
+
+	scsi_mpath_clear_current_path(sdev);
+
+	/*
+	 * if we got device handler error, we know that device is alive but not
+	 * ready to process command. kick off a requeue of scsi command and try
+	 * other available path
+	 */
+	if (scsi_is_mpath_error(scmd)) {
+		/*
+		 * Set flag as pending and requeue bio for retry on
+		 * another path
+		 */
+		set_bit(SCSI_MPATH_DISK_IO_PENDING, &sdev->mpath_flags);
+		queue_work(shost->work_q, &mpath_dev->mpath_requeue_work);
+	}
+
+	/*
+	 * following logic tries to steal bio, check if the bio has polled
+	 * operation, if yes, then clear polled reqeust and reqeue bio
+	 */
+	spin_lock_irqsave(&mpath_dev->mpath_requeue_lock, flags);
+	for (bio = req->bio; bio; bio = bio->bi_next) {
+		bio_set_dev(bio, req->q->disk->part0);
+		if (bio->bi_opf & REQ_POLLED) {
+			bio->bi_opf &= ~REQ_POLLED;
+			bio->bi_cookie = BLK_QC_T_NONE;
+		}
+	}
+	blk_steal_bios(&mpath_dev->mpath_requeue_list, req);
+	spin_unlock_irqrestore(&mpath_dev->mpath_requeue_lock, flags);
+
+	scmd->result = 0;
+
+	blk_mq_end_request(req, 0);
+
+	kblockd_schedule_work(&mpath_dev->mpath_requeue_work);
+}
+EXPORT_SYMBOL_GPL(scsi_mpath_failover_req);
+
+static inline bool scsi_mpath_is_optimized(struct scsi_device *sdev)
+{
+	return (!scsi_device_online(sdev) &&
+	    ((sdev->mpath_state == SCSI_MPATH_OPTIMAL) ||
+	     (sdev->mpath_state == SCSI_MPATH_ACTIVE)));
+}
+
+static struct scsi_device *scsi_next_mpath_sdev(struct Scsi_Host *shost,
+			struct scsi_device *sdev)
+{
+	sdev = list_next_or_null_rcu(&shost->mpath_sdev, &sdev->siblings,
+	    struct scsi_device, siblings);
+
+	if (sdev)
+		return sdev;
+
+	return list_first_or_null_rcu(&shost->mpath_sdev, struct scsi_device,
+	    siblings);
+}
+
+static struct scsi_device *scsi_mpath_round_robin_path(struct Scsi_Host *shost,
+	int node, struct scsi_device *old_sdev)
+{
+	struct scsi_device *sdev, *found = NULL;
+	struct scsi_mpath *mpath_dev = shost->mpath_dev;
+
+	if (list_is_singular(&shost->mpath_sdev)) {
+		if(scsi_mpath_is_disabled(old_sdev))
+			return NULL;
+		return old_sdev;
+	}
+
+	for (sdev = scsi_next_mpath_sdev(shost, old_sdev);
+	    sdev && sdev != old_sdev;
+	    sdev = scsi_next_mpath_sdev(shost, sdev)) {
+		if (scsi_mpath_is_disabled(sdev))
+			continue;
+		if (sdev->mpath_state == SCSI_MPATH_OPTIMAL) {
+			found = sdev;
+			goto out;
+		}
+		if (sdev->mpath_state == SCSI_MPATH_ACTIVE)
+			found = sdev;
+	}
+
+	if (!scsi_mpath_is_disabled(old_sdev) &&
+	    (old_sdev->mpath_state == SCSI_MPATH_OPTIMAL ||
+	    (!found && old_sdev->mpath_state == SCSI_MPATH_ACTIVE)))
+		return old_sdev;
+
+	if (!found)
+		return NULL;
+out:
+	rcu_assign_pointer(mpath_dev->current_path[node], found);
+
+	return found;
+}
+
+/*
+ * Search path based on iopolicy and numa node affinity
+ * and return the scsi_device for that path
+ */
+inline struct scsi_device *__scsi_find_path(struct Scsi_Host *shost, int node)
+{
+	struct scsi_mpath *mpath_dev = shost->mpath_dev;
+	int found_distance = INT_MAX, fallback_distance = INT_MAX, distance;
+	struct scsi_device *sdev_found = NULL, *sdev_fallback = NULL, *sdev;
+
+	list_for_each_entry_rcu(sdev, &shost->mpath_sdev, mpath_entry) {
+		if (scsi_mpath_is_disabled(sdev))
+			continue;
+
+		if (sdev->mpath_numa_node != NUMA_NO_NODE &&
+		    (READ_ONCE(sdev->mpath_iopolicy) == SCSI_MPATH_IOPOLICY_NUMA))
+			distance = node_distance(node, sdev->mpath_numa_node);
+		else
+			distance = LOCAL_DISTANCE;
+
+		switch(sdev->mpath_state) {
+		case SCSI_MPATH_OPTIMAL:
+		    if (distance < found_distance) {
+			    found_distance = distance;
+			    sdev_found = sdev;
+		    }
+		    break;
+		case SCSI_MPATH_ACTIVE:
+		    if (distance < fallback_distance) {
+			    fallback_distance = distance;
+			    sdev_fallback = sdev;
+		    }
+		    break;
+		default:
+		    break;
+		}
+	}
+
+	if (!sdev_found)
+		sdev_found = sdev_fallback;
+
+	if (sdev_found)
+		rcu_assign_pointer(mpath_dev->current_path[node], sdev_found);
+
+	return sdev_found;
+}
+
+inline struct scsi_device *scsi_find_path(struct Scsi_Host *shost)
+{
+	int node = numa_node_id();
+	struct scsi_device *sdev;
+
+	sdev = srcu_dereference(shost->mpath_dev->current_path[node],
+	    &shost->mpath_dev->srcu);
+
+	if (unlikely(!sdev))
+		sdev = __scsi_find_path(shost, node);
+
+	if (READ_ONCE(sdev->mpath_iopolicy) == SCSI_MPATH_IOPOLICY_RR)
+		return scsi_mpath_round_robin_path(shost, node, sdev);
+
+	if (unlikely(!scsi_mpath_is_optimized(sdev)))
+		return __scsi_find_path(shost, node);
+
+	return sdev;
+}
+
+void scsi_mpath_requeue_work(struct work_struct *work)
+{
+	struct scsi_mpath *mpath_dev =
+	    container_of(work, struct scsi_mpath, mpath_requeue_work);
+	struct bio *bio, *next;
+
+	spin_lock_irq(&mpath_dev->mpath_requeue_lock);
+	next = bio_list_get(&mpath_dev->mpath_requeue_list);
+	spin_unlock(&mpath_dev->mpath_requeue_lock);
+
+	while ((bio = next) != NULL) {
+		next = bio->bi_next;
+		bio->bi_next = NULL;
+		submit_bio_noacct(bio);
+	}
+}
+
+void scsi_mpath_set_live(struct scsi_device *sdev)
+{
+	struct Scsi_Host *shost = sdev->host;
+	struct scsi_mpath *mpath_dev = shost->mpath_dev;
+	int ret;
+
+	if (!sdev->mpath_disk)
+		return;
+
+	if (!test_and_set_bit(SCSI_MPATH_DISK_LIVE, &sdev->mpath_flags)) {
+		ret = device_add_disk(&sdev->sdev_dev, sdev->mpath_disk, NULL);
+		if (ret) {
+			clear_bit(SCSI_MPATH_DISK_LIVE, &sdev->mpath_flags);
+			return;
+		}
+	}
+
+	pr_info("Attached SCSI %s disk\n", sdev->mpath_disk->disk_name);
+
+	mutex_lock(&mpath_dev->mpath_lock);
+	if (scsi_mpath_is_optimized(sdev)) {
+		int node, srcu_idx;
+
+		srcu_idx = srcu_read_lock(&mpath_dev->srcu);
+		for_each_online_node(node)
+			__scsi_find_path(shost, node);
+		srcu_read_unlock(&mpath_dev->srcu, srcu_idx);
+	}
+	mutex_unlock(&mpath_dev->mpath_lock);
+
+	synchronize_srcu(&mpath_dev->srcu);
+	kblockd_schedule_work(&mpath_dev->mpath_requeue_work);
+}
+
+/**
+ * Callback function for activating multipath devices
+ */
+static void activate_mpath(void *data, int err)
+{
+	struct scsi_device *sdev = data;
+	struct scsi_mpath_dh_data *mpath_h = sdev->mpath_pg_data;
+	bool retry = false;
+
+	if (!mpath_h)
+		return;
+
+	switch (err) {
+	case SCSI_DH_OK:
+		break;
+	case SCSI_DH_NOSYS:
+		sdev_printk(KERN_ERR, sdev,
+			"Could not failover the device scsi_dh_%s, Error %d\n",
+			sdev->handler->name, err);
+		scsi_mpath_clear_current_path(sdev);
+		break;
+	case SCSI_DH_DEV_TEMP_BUSY:
+		sdev_printk(KERN_ERR, sdev,
+			"Device Handler Path Busy\n");
+		break;
+	case SCSI_DH_RETRY:
+		sdev_printk(KERN_ERR, sdev,
+			"Device Handler Path Retry \n");
+		retry = true;
+		fallthrough;
+	case SCSI_DH_IMM_RETRY:
+	case SCSI_DH_RES_TEMP_UNAVAIL:
+		sdev_printk(KERN_ERR, sdev,
+			"Device Handler Path Unavailable, Clear current path \n");
+		if ((mpath_h->state == SCSI_ACCESS_STATE_OFFLINE) ||
+		    (mpath_h->state == SCSI_ACCESS_STATE_UNAVAILABLE))
+			scsi_mpath_clear_current_path(sdev);
+		err = 0;
+		break;
+	case SCSI_DH_DEV_OFFLINED:
+	default:
+		sdev_printk(KERN_ERR, sdev, "Device Handler Path offlined \n");
+		scsi_mpath_clear_current_path(sdev);
+		break;
+	}
+
+	if (retry)
+		set_bit(SCSI_MPATH_DISK_IO_PENDING, &sdev->mpath_flags);
+
+        if (scsi_mpath_state_is_live(sdev->mpath_state))
+		scsi_mpath_set_live(sdev);
+}
+
+void scsi_activate_path(struct scsi_device *sdev)
+{
+	struct request_queue *q = sdev->mpath_disk->queue;
+	struct scsi_mpath_dh_data *mpath_dh = sdev->mpath_pg_data;
+
+	if (!mpath_dh)
+		return;
+
+        if (!(scsi_mpath_state_is_live(sdev->mpath_state))) {
+		sdev_printk(KERN_INFO, sdev, "Path state is not live \n");
+                return;
+	}
+
+	if (!blk_queue_dying(q))
+		scsi_dh_activate(q, activate_mpath, sdev);
+	else
+		activate_mpath(sdev, SCSI_DH_OK);
+}
+
+static void scsi_activate_mpath_work(struct work_struct *work)
+{
+        struct scsi_device *sdev = container_of(work,
+            struct scsi_device, activate_mpath);
+
+	if (!sdev)
+		return;
+
+	scsi_activate_path(sdev);
+}
+
+int scsi_mpath_add_disk(struct scsi_device *sdev)
+{
+	if (!sdev->mpath_pg_data) {
+		/* Re initialize ALUA */
+		sdev->handler->rescan(sdev);
+	} else {
+		sdev->mpath_state = SCSI_MPATH_OPTIMAL;
+		scsi_mpath_set_live(sdev);
+	}
+
+	return (test_bit(SCSI_MPATH_DISK_LIVE, &sdev->mpath_flags));
+}
+EXPORT_SYMBOL_GPL(scsi_mpath_add_disk);
+
+int scsi_multipath_init(struct scsi_device *sdev)
+{
+	struct Scsi_Host *shost = sdev->host;
+	struct scsi_mpath_dh_data *h;
+	struct scsi_mpath *mpath_dev;
+	int ret = -ENOMEM;
+
+	mpath_dev = kzalloc(sizeof(struct scsi_mpath), GFP_KERNEL);
+	if (!mpath_dev)
+		return ret;
+
+	h = kzalloc(sizeof(struct scsi_mpath_dh_data), GFP_KERNEL);
+	if (!h)
+		goto out_mpath_dev;
+
+	sdev->mpath_pg_data = h;
+
+	ret = init_srcu_struct(&mpath_dev->srcu);
+	if (ret) {
+		cleanup_srcu_struct(&mpath_dev->srcu);
+		goto out_handler;
+	}
+
+	shost->mpath_dev = mpath_dev;
+
+	mutex_init(&mpath_dev->mpath_lock);
+	bio_list_init(&mpath_dev->mpath_requeue_list);
+	spin_lock_init(&mpath_dev->mpath_requeue_lock);
+	INIT_WORK(&mpath_dev->mpath_requeue_work, scsi_mpath_requeue_work);
+	INIT_LIST_HEAD(&mpath_dev->mpath_list);
+	INIT_WORK(&sdev->activate_mpath, scsi_activate_mpath_work);
+	INIT_LIST_HEAD(&sdev->mpath_entry);
+	sdev->mpath_numa_node = NUMA_NO_NODE;
+	sdev->is_shared = 1;
+
+	return 0;
+
+out_handler:
+	kfree(h);
+out_mpath_dev:
+	if (mpath_dev)
+		kfree(mpath_dev);
+
+	return ret;
+}
+EXPORT_SYMBOL_GPL(scsi_multipath_init);
+
+static bool scsi_available_mpath(struct Scsi_Host *shost)
+{
+	struct scsi_device *sdev;
+
+	list_for_each_entry_rcu(sdev, &shost->mpath_sdev, mpath_entry) {
+		if (scsi_device_online(sdev))
+			return true;
+	}
+	return false;
+}
+
+/*  called when shost is being freed */
+void scsi_mpath_dev_release(struct scsi_device *sdev)
+{
+	struct Scsi_Host *shost = sdev->host;
+	struct scsi_mpath *mpath_dev;
+
+	if (!shost->mpath_dev)
+		return;
+
+	mpath_dev = shost->mpath_dev;
+	cancel_work_sync(&mpath_dev->mpath_requeue_work);
+	cleanup_srcu_struct(&mpath_dev->srcu);
+
+	if (sdev->mpath_pg_data)
+                kfree(sdev->mpath_pg_data);
+}
+EXPORT_SYMBOL_GPL(scsi_mpath_dev_release);
+
+void scsi_put_mpath_sdev(struct scsi_device *sdev)
+{
+	scsi_device_put(sdev);
+}
+
+void scsi_mpath_revalidate_path(struct gendisk *mpath_disk, sector_t capacity)
+{
+	struct Scsi_Host *shost = mpath_disk->private_data;
+	struct scsi_mpath *mpath_dev = shost->mpath_dev;
+	struct scsi_device *sdev;
+	int srcu_idx;
+	int node;
+
+	if (!shost->mpath_dev)
+		return;
+
+	srcu_idx = srcu_read_lock(&mpath_dev->srcu);
+	list_for_each_entry_rcu(sdev, &shost->mpath_sdev, mpath_entry) {
+		if (capacity != get_capacity(sdev->mpath_disk))
+			clear_bit(SCSI_MPATH_DISK_LIVE, &sdev->mpath_flags);
+	}
+	srcu_read_unlock(&mpath_dev->srcu, srcu_idx);
+
+	for_each_node(node)
+		rcu_assign_pointer(mpath_dev->current_path[node], NULL);
+	kblockd_schedule_work(&mpath_dev->mpath_requeue_work);
+}
+EXPORT_SYMBOL_GPL(scsi_mpath_revalidate_path);
+
+static int scsi_mpath_open(struct gendisk *disk, blk_mode_t mode)
+{
+	if (!scsi_get_device(disk->private_data))
+		return -ENXIO;
+
+	return 0;
+}
+
+static void scsi_mpath_release(struct gendisk *disk)
+{
+	struct Scsi_Host *shost = disk->private_data;
+	struct scsi_device *sdev;
+	int srcu_idx;
+
+	srcu_idx = srcu_read_lock(&shost->mpath_dev->srcu);
+	sdev = scsi_find_path(shost);
+	srcu_read_unlock(&shost->mpath_dev->srcu, srcu_idx);
+}
+
+int scsi_mpath_failover_disposition(struct scsi_cmnd *scmd)
+{
+	struct request *req = scsi_cmd_to_rq(scmd);
+
+	if (req->cmd_flags & REQ_SCSI_MPATH) {
+		if (scsi_is_mpath_error(scmd) ||
+		    blk_queue_dying(req->q)) {
+			return NEEDS_RETRY;
+		}
+	} else {
+		if (blk_queue_dying(req->q))
+			return SUCCESS;
+	}
+
+	return SUCCESS;
+}
+EXPORT_SYMBOL_GPL(scsi_mpath_failover_disposition);
+
+static void scsi_multipath_submit_bio(struct bio *bio)
+{
+	struct Scsi_Host *shost = bio->bi_bdev->bd_disk->private_data;
+	struct scsi_mpath *mpath_dev = shost->mpath_dev;
+	struct scsi_device *sdev;
+	int srcu_idx;
+
+	/*
+	 * The scsi device might be going away and the bio might be
+	 * moved to a difference queue via blk_steal_bios(), so we
+	 * need to use bio_split pool from the original queue to
+	 * allocate the bvecs from.
+	 */
+	bio = bio_split_to_limits(bio);
+	if (!bio)
+		return;
+
+	srcu_idx = srcu_read_lock(&mpath_dev->srcu);
+	sdev = scsi_find_path(shost);
+	if (likely(sdev)) {
+		bio_set_dev(bio, bio->bi_bdev->bd_disk->part0);
+		bio->bi_opf |= REQ_SCSI_MPATH;
+		submit_bio_noacct(bio);
+	} else if (scsi_available_mpath(shost)) {
+		sdev_printk(KERN_NOTICE, NULL,
+		    "No Usable Path - Requeing I/O \n");
+
+		spin_lock_irq(&mpath_dev->mpath_requeue_lock);
+		bio_list_add(&mpath_dev->mpath_requeue_list, bio);
+		spin_unlock_irq(&mpath_dev->mpath_requeue_lock);
+	} else {
+		sdev_printk(KERN_NOTICE, NULL,
+		    "No available path = Failing I/O \n");
+
+		bio_io_error(bio);
+	}
+	srcu_read_unlock(&mpath_dev->srcu, srcu_idx);
+}
+
+static int scsi_mpath_get_unique_id(struct gendisk *disk, u8 id[16],
+    enum blk_unique_id type)
+{
+	struct Scsi_Host *shost = disk->private_data;
+	struct scsi_device *sdev;
+	int srcu_idx, ret = -EWOULDBLOCK;
+
+	srcu_idx = srcu_read_lock(&shost->mpath_dev->srcu);
+	sdev = scsi_find_path(shost);
+	if (sdev)
+		ret = scsi_mpath_unique_id(sdev, id, type);
+	srcu_read_unlock(&shost->mpath_dev->srcu, srcu_idx);
+
+	return ret;
+}
+
+const struct block_device_operations scsi_mpath_ops = {
+	.owner          = THIS_MODULE,
+	.submit_bio	= scsi_multipath_submit_bio,
+	.open		= scsi_mpath_open,
+	.release	= scsi_mpath_release,
+	.get_unique_id	= scsi_mpath_get_unique_id,
+};
+
+int scsi_mpath_unique_id(struct scsi_device *sdev, u8 id[16],
+		enum blk_unique_id type)
+{
+	struct scsi_mpath_dh_data *dh_data = sdev->mpath_pg_data;
+
+	if (type != BLK_UID_NAA)
+		return -EINVAL;
+
+	if (strncmp(dh_data->device_id_str, id, 16) == 0)
+		return dh_data->device_id_len;
+
+	return -EINVAL;
+}
+EXPORT_SYMBOL_GPL(scsi_mpath_unique_id);
+
+int scsi_mpath_unique_lun_id(struct scsi_device *sdev)
+{
+	struct scsi_mpath_dh_data *dh_data = sdev->mpath_pg_data;
+	char device_id_str[20];
+	int ret = -EINVAL;
+
+	ret = scsi_vpd_lun_id(sdev, device_id_str, dh_data->device_id_len);
+	if (ret < 0)
+		return ret;
+
+	if (strncmp(dh_data->device_id_str, device_id_str,
+	    dh_data->device_id_len) == 0)
+		return -EINVAL;
+
+	return 0;
+}
+
+/*
+ * Allocate Disk for Multipath Device
+ */
+int scsi_mpath_alloc_disk(struct scsi_device *sdev)
+{
+	struct Scsi_Host *shost = sdev->host;
+	struct queue_limits lim;
+
+	/*
+	 * Don't allocate mpath disk if ALUA handler is not attached
+	 */
+	if (!sdev->handler || strncmp(sdev->handler->name, "alua", 4) != 0) {
+		sdev_printk(KERN_NOTICE, sdev,
+		    "No Handler or correct handler attached for multipath \n");
+		return 0;
+	}
+
+	/*
+	 * Add multipath disk only if scsi host supports multipath modparam
+	 */
+	if (!scsi_multipath) {
+		sdev_printk(KERN_NOTICE, sdev,
+		    "%s Handler attached but modparam scsi_multipath is set to false \n",
+		    sdev->handler->name);
+		return 0;
+	}
+
+	if (scsi_mpath_unique_lun_id(sdev) == 0) {
+		sdev_printk(KERN_NOTICE, sdev,
+		    "existing sdev with path, return\n");
+		return 0;
+	}
+
+	blk_set_stacking_limits(&lim);
+
+	lim.features |= BLK_FEAT_IO_STAT | BLK_FEAT_NOWAIT | BLK_FEAT_POLL;
+	lim.max_zone_append_sectors = 0;
+	lim.dma_alignment = 3;
+
+	sdev->mpath_disk = blk_alloc_disk(&lim, sdev->mpath_numa_node);
+	if (IS_ERR(sdev->mpath_disk))
+		return PTR_ERR(sdev->mpath_disk);
+
+	sdev->mpath_disk->private_data = shost;
+	sdev->mpath_disk->fops = &scsi_mpath_ops;
+
+	list_add_tail(&shost->mpath_sdev, &sdev->mpath_entry);
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(scsi_mpath_alloc_disk);
+
+void scsi_mpath_start_request(struct request *req)
+{
+	struct scsi_cmnd *cmd = blk_mq_rq_to_pdu(req);
+	struct scsi_device *sdev = cmd->device;
+	struct Scsi_Host *shost = sdev->host;
+	struct scsi_mpath *mpath_dev = shost->mpath_dev;
+
+	if (!blk_queue_io_stat(sdev->mpath_disk->queue) ||
+	    blk_rq_is_passthrough(req))
+		return;
+
+	req->rq_flags |= SCSI_MPATH_IO_STATS;
+	mpath_dev->mpath_start_time = bdev_start_io_acct(sdev->mpath_disk->part0,
+	    req_op(req), jiffies);
+}
+
+void scsi_mpath_end_request(struct request *req)
+{
+	struct scsi_cmnd *cmd = blk_mq_rq_to_pdu(req);
+	struct scsi_device *sdev = cmd->device;
+	struct Scsi_Host *shost = sdev->host;
+	struct scsi_mpath *mpath_dev = shost->mpath_dev;
+
+	if (!(req->rq_flags & SCSI_MPATH_IO_STATS))
+		return;
+
+	bdev_end_io_acct(sdev->mpath_disk->part0, req_op(req),
+	    blk_rq_bytes(req) >> SECTOR_SHIFT,
+	    mpath_dev->mpath_start_time);
+}
+
+void scsi_mpath_kick_requeue_lists(struct Scsi_Host *shost)
+{
+	struct scsi_mpath *mpath_dev = shost->mpath_dev;
+	struct scsi_device *sdev;
+	int srcu_idx;
+
+	srcu_idx = srcu_read_lock(&mpath_dev->srcu);
+	list_for_each_entry_rcu(sdev, &shost->mpath_sdev, mpath_entry) {
+		if (sdev->is_shared)
+			continue;
+
+		kblockd_schedule_work(&mpath_dev->mpath_requeue_work);
+		if (sdev->sdev_state == SDEV_RUNNING)
+			disk_uevent(sdev->mpath_disk, KOBJ_CHANGE);
+	}
+	srcu_read_unlock(&mpath_dev->srcu, srcu_idx);
+}
+
+void scsi_mpath_shutdown_disk(struct scsi_device *sdev)
+{
+	struct Scsi_Host *shost = sdev->host;
+
+	if (!sdev->mpath_disk)
+		return;
+
+	if (test_and_clear_bit(SCSI_MPATH_DISK_LIVE, &sdev->mpath_flags)) {
+		synchronize_srcu(&shost->mpath_dev->srcu);
+		kblockd_schedule_work(&shost->mpath_dev->mpath_requeue_work);
+		del_gendisk(sdev->mpath_disk);
+	}
+}
+EXPORT_SYMBOL_GPL(scsi_mpath_shutdown_disk);
+
+void scsi_mpath_remove_disk(struct scsi_device *sdev)
+{
+	struct Scsi_Host *shost = sdev->host;
+
+	if (!sdev->mpath_disk)
+		return;
+
+	if (!sdev->is_shared)
+		return;
+
+	/* Make sure All pending bio's are cleaned up */
+	kblockd_schedule_work(&shost->mpath_dev->mpath_requeue_work);
+	flush_work(&shost->mpath_dev->mpath_requeue_work);
+	put_disk(sdev->mpath_disk);
+}
+EXPORT_SYMBOL_GPL(scsi_mpath_remove_disk);
+
+int scsi_mpath_update_state(struct scsi_device *sdev)
+{
+        struct scsi_mpath_dh_data *mpath_h;
+
+        mpath_h = sdev->mpath_pg_data;
+        if (!mpath_h)
+		return SCSI_MPATH_UNAVAILABLE;
+
+	switch(mpath_h->state) {
+		case SCSI_ACCESS_STATE_OPTIMAL:
+			sdev->mpath_state = SCSI_MPATH_OPTIMAL;
+			break;
+		case SCSI_ACCESS_STATE_ACTIVE:
+			sdev->mpath_state = SCSI_MPATH_ACTIVE;
+			break;
+		case SCSI_ACCESS_STATE_STANDBY:
+			sdev->mpath_state = SCSI_MPATH_STANDBY;
+			break;
+		case SCSI_ACCESS_STATE_UNAVAILABLE:
+			sdev->mpath_state = SCSI_MPATH_UNAVAILABLE;
+			break;
+		case SCSI_ACCESS_STATE_TRANSITIONING:
+			sdev->mpath_state = SCSI_MPATH_TRANSITIONING;
+			break;
+		case SCSI_ACCESS_STATE_OFFLINE:
+		default:
+                    sdev->mpath_state = SCSI_MPATH_OFFLINE;
+		    break;
+	}
+
+	return sdev->mpath_state;
+}
diff --git a/include/scsi/scsi_device.h b/include/scsi/scsi_device.h
index 9c540f5468eb..b46e06a01179 100644
--- a/include/scsi/scsi_device.h
+++ b/include/scsi/scsi_device.h
@@ -9,6 +9,8 @@
 #include <scsi/scsi.h>
 #include <linux/atomic.h>
 #include <linux/sbitmap.h>
+#include <scsi/scsi_multipath.h>
+#include <scsi/scsi_host.h>
 
 struct bsg_device;
 struct device;
@@ -100,6 +102,11 @@ struct scsi_vpd {
 	unsigned char	data[];
 };
 
+/*
+ * Mark bio as coming from scsi multipath node
+ */
+#define REQ_SCSI_MPATH		REQ_DRV
+
 struct scsi_device {
 	struct Scsi_Host *host;
 	struct request_queue *request_queue;
@@ -120,6 +127,7 @@ struct scsi_device {
 	unsigned short last_queue_full_count; /* scsi_track_queue_full() */
 	unsigned long last_queue_full_time;	/* last queue full time */
 	unsigned long queue_ramp_up_period;	/* ramp up period in jiffies */
+
 #define SCSI_DEFAULT_RAMP_UP_PERIOD	(120 * HZ)
 
 	unsigned long last_queue_ramp_up;	/* last queue ramp up time */
@@ -265,6 +273,25 @@ struct scsi_device {
 	struct device		sdev_gendev,
 				sdev_dev;
 
+#ifdef	CONFIG_SCSI_MULTIPATH
+	int				is_shared; 	/* Set Multipath flag  */
+	int				mpath_first_path; /* Indicate if this was first path */
+	struct gendisk          	*mpath_disk;	/* Multipath disk */
+	int				mpath_numa_node; /* NUMA node for Path  */
+	enum scsi_mpath_access_state	mpath_state;	/* Multipath State */
+	enum scsi_mpath_iopolicy	mpath_iopolicy;	/* IO Policy */
+	struct list_head		mpath_entry;	/* list of all mpath_sdevs */
+	struct scsi_mpath_dh_data	*mpath_pg_data; /* Place holder for Port group data */
+	struct work_struct		activate_mpath; /* Activate path work */
+	atomic_t			nr_mpath;	/* Number of Active mpath */
+
+#define SCSI_MPATH_DISK_LIVE            0
+#define SCSI_MPATH_DISK_IO_PENDING      1
+#define SCSI_MPATH_IO_STATS             2
+
+	unsigned long           mpath_flags;		/* flag for multipath devices*/
+#endif
+
 	struct work_struct	requeue_work;
 
 	struct scsi_device_handler *handler;
@@ -294,6 +321,43 @@ struct scsi_device {
 #define sdev_dbg(sdev, fmt, a...) \
 	dev_dbg(&(sdev)->sdev_gendev, fmt, ##a)
 
+#ifdef CONFIG_SCSI_MULTIPATH
+extern bool scsi_multipath;
+extern const struct block_device_operations scsi_mpath_ops;
+
+static inline bool scsi_sdev_use_alua(struct scsi_device *sdev)
+{
+	return sdev->handler_data != NULL;
+}
+
+static inline bool scsi_disk_is_multipath(struct gendisk *disk)
+{
+	return disk->fops == &scsi_mpath_ops;
+}
+
+static inline bool scsi_mpath_enabled(struct scsi_device *sdev)
+{
+	return IS_ENABLED(CONFIG_SCSI_MULTIPATH);
+}
+static inline bool scsi_is_sdev_multipath(struct scsi_device *sdev)
+{
+	return IS_ENABLED(CONFIG_SCSI_MULTIPATH) && sdev->mpath_disk;
+}
+#else
+#define scsi_multipath	false;
+static inline bool scsi_disk_is_multipath(struct gendisk *disk)
+{
+	return false;
+}
+static inline bool scsi_mpath_enabled(struct scsi_device *sdev)
+{
+	return false;
+}
+static inline bool scsi_is_sdev_multipath(struct scsi_device *sdev)
+{
+	return false;
+}
+#endif
 /*
  * like scmd_printk, but the device name is passed in
  * as a string pointer
diff --git a/include/scsi/scsi_host.h b/include/scsi/scsi_host.h
index 2b4ab0369ffb..d20def053254 100644
--- a/include/scsi/scsi_host.h
+++ b/include/scsi/scsi_host.h
@@ -571,6 +571,12 @@ struct Scsi_Host {
 	/* Area to keep a shared tag map */
 	struct blk_mq_tag_set	tag_set;
 
+#ifdef	CONFIG_SCSI_MULTIPATH
+	struct scsi_mpath	*mpath_dev;
+	struct list_head	mpath_sdev;
+	int			mpath_alua_grpid; /* Grounp ID for ALUA devices */
+#endif
+
 	atomic_t host_blocked;
 
 	unsigned int host_failed;	   /* commands that failed.
@@ -761,6 +767,7 @@ static inline int scsi_host_in_recovery(struct Scsi_Host *shost)
 		shost->tmf_in_progress;
 }
 
+
 extern int scsi_queue_work(struct Scsi_Host *, struct work_struct *);
 extern void scsi_flush_work(struct Scsi_Host *);
 
diff --git a/include/scsi/scsi_multipath.h b/include/scsi/scsi_multipath.h
new file mode 100644
index 000000000000..b441241c8316
--- /dev/null
+++ b/include/scsi/scsi_multipath.h
@@ -0,0 +1,86 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _SCSI_SCSI_MULTIPATH_H
+#define _SCSI_SCSI_MULTIPATH_H
+
+#include <linux/list.h>
+#include <linux/types.h>
+#include <linux/rcupdate.h>
+#include <linux/workqueue.h>
+#include <linux/mutex.h>
+#include <linux/blk-mq.h>
+#include <scsi/scsi.h>
+#include <scsi/scsi_device.h>
+#include <scsi/scsi_host.h>
+
+struct scsi_device;
+
+enum scsi_mpath_iopolicy {
+	SCSI_MPATH_IOPOLICY_NUMA,
+	SCSI_MPATH_IOPOLICY_RR,
+};
+
+enum scsi_mpath_access_state {
+	SCSI_MPATH_OPTIMAL	= SCSI_ACCESS_STATE_OPTIMAL,
+	SCSI_MPATH_ACTIVE	= SCSI_ACCESS_STATE_ACTIVE,
+	SCSI_MPATH_STANDBY	= SCSI_ACCESS_STATE_STANDBY,
+	SCSI_MPATH_UNAVAILABLE	= SCSI_ACCESS_STATE_UNAVAILABLE,
+	SCSI_MPATH_LBA		= SCSI_ACCESS_STATE_LBA,
+	SCSI_MPATH_OFFLINE	= SCSI_ACCESS_STATE_OFFLINE,
+	SCSI_MPATH_TRANSITIONING = SCSI_ACCESS_STATE_TRANSITIONING,
+	SCSI_MPATH_INVALID	= 0xFF
+};
+
+struct scsi_mpath_dh_data {
+	const char	*hndlr_name; /* device Handler name */
+	int	group_id;		/* Group ID reported from RTPG cmd */
+	int	tpgs;			/* Target Port Groups reported from RTPG cmd */
+	int	state;			/* Target Port Group State */
+	char	*device_id_str;		/* Multipath Device String */
+	int	device_id_len;		/* Device ID Length */
+	int	valid_states;		/* states from RTPG cmd */
+	int	prefrence;		/* Path prefrence for Port Group from RTPG cmd */
+	int	is_active;		/* Current Sdev is active */
+};
+
+struct scsi_mpath {
+	struct srcu_struct 	srcu;
+	struct Scsi_Host	*shost;	/*Scsi_Host where this mpath belong */
+	struct list_head        mpath_list;  /* list of multipath scsi_device   */
+	struct	bio_list	mpath_requeue_list; /* list for requeing bio */
+	spinlock_t		mpath_requeue_lock;
+	struct work_struct	mpath_requeue_work; /* work struct for requeue */
+	struct mutex            mpath_lock;
+	unsigned long		mpath_start_time;
+	struct delayed_work	activate_mpath; /* Path Activation work */
+	struct scsi_device __rcu *current_path[]; /* scsi_device of current path */
+};
+
+extern void scsi_mpath_default_iopolicy(struct scsi_device *);
+extern void scsi_mpath_unfreeze(struct Scsi_Host *);
+extern void scsi_mpath_wait_freeze(struct Scsi_Host *);
+extern void scsi_mpath_start_freeze(struct Scsi_Host *);
+extern void scsi_mpath_failover_req(struct request *);
+extern void scsi_mpath_start_request(struct request *);
+extern void scsi_mpath_end_request(struct request *);
+extern void scsi_kick_requeue_lists(struct Scsi_Host *);
+extern bool scsi_mpath_clear_current_path(struct scsi_device *);
+int scsi_multipath_init(struct scsi_device *);
+extern int scsi_mpath_failover_disposition(struct scsi_cmnd *);
+int scsi_mpath_alloc_disk(struct scsi_device *);
+extern void scsi_mpath_remove_disk(struct scsi_device *);
+extern void scsi_mpath_shutdown_disk(struct scsi_device *);
+void scsi_put_mpath_sdev(struct scsi_device *);
+void scsi_mpath_requeue_work(struct work_struct *);
+extern void scsi_mpath_dev_release(struct scsi_device *);
+void scsi_mpath_kick_requeue_lists(struct Scsi_Host *);
+int scsi_mpath_update_state(struct scsi_device *);
+extern int scsi_mpath_add_disk(struct scsi_device *);
+void scsi_mpath_set_live(struct scsi_device *);
+void scsi_activate_path(struct scsi_device *);
+void scsi_multipath_iopolicy_update(struct scsi_device *, int);
+void scsi_mpath_clear_paths(struct Scsi_Host *);
+int scsi_mpath_unique_lun_id(struct scsi_device *);
+
+extern void scsi_mpath_revalidate_path(struct gendisk *, sector_t);
+extern int scsi_mpath_unique_id(struct scsi_device *sdev, u8 id[16], enum blk_unique_id type);
+#endif /* _SCSI_SCSI_MULTIPATH_H */
-- 
2.41.0.rc2


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [RFC v1 2/8] scsi: create multipath capable scsi host
  2024-11-09  4:45 [RFC v1 0/8] scsi: Multipath support for scsi disk devices himanshu.madhani
  2024-11-09  4:45 ` [RFC v1 1/8] scsi: Add multipath device support himanshu.madhani
@ 2024-11-09  4:45 ` himanshu.madhani
  2024-11-10 21:11   ` Bart Van Assche
  2024-11-09  4:45 ` [RFC v1 3/8] scsi: Add error handling capability for multipath himanshu.madhani
                   ` (7 subsequent siblings)
  9 siblings, 1 reply; 15+ messages in thread
From: himanshu.madhani @ 2024-11-09  4:45 UTC (permalink / raw)
  To: martin.petersen, linux-scsi

From: Himanshu Madhani <himanshu.madhani@oracle.com>

- Create multipath capable scsi host

Signed-off-by: Himanshu Madhani <himanshu.madhani@oracle.com>
---
 drivers/scsi/hosts.c | 12 ++++++++++++
 1 file changed, 12 insertions(+)

diff --git a/drivers/scsi/hosts.c b/drivers/scsi/hosts.c
index e021f1106bea..3cedb2a9af7b 100644
--- a/drivers/scsi/hosts.c
+++ b/drivers/scsi/hosts.c
@@ -39,6 +39,7 @@
 #include <scsi/scsi_host.h>
 #include <scsi/scsi_transport.h>
 #include <scsi/scsi_cmnd.h>
+#include <scsi/scsi_multipath.h>
 
 #include "scsi_priv.h"
 #include "scsi_logging.h"
@@ -394,6 +395,14 @@ struct Scsi_Host *scsi_host_alloc(const struct scsi_host_template *sht, int priv
 	struct Scsi_Host *shost;
 	int index;
 
+#ifdef CONFIG_SCSI_MULTIPATH
+	struct scsi_mpath *mpath_dev;
+	size_t	size = sizeof(*mpath_dev);
+
+	size += num_possible_nodes() * sizeof(struct mpath_dev *);
+	privsize = privsize + size;
+#endif
+
 	shost = kzalloc(sizeof(struct Scsi_Host) + privsize, GFP_KERNEL);
 	if (!shost)
 		return NULL;
@@ -409,6 +418,9 @@ struct Scsi_Host *scsi_host_alloc(const struct scsi_host_template *sht, int priv
 	init_waitqueue_head(&shost->host_wait);
 	mutex_init(&shost->scan_mutex);
 
+#ifdef CONFIG_SCSI_MULTIPATH
+	INIT_LIST_HEAD(&shost->mpath_sdev);
+#endif
 	index = ida_alloc(&host_index_ida, GFP_KERNEL);
 	if (index < 0) {
 		kfree(shost);
-- 
2.41.0.rc2


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [RFC v1 3/8] scsi: Add error handling capability for multipath
  2024-11-09  4:45 [RFC v1 0/8] scsi: Multipath support for scsi disk devices himanshu.madhani
  2024-11-09  4:45 ` [RFC v1 1/8] scsi: Add multipath device support himanshu.madhani
  2024-11-09  4:45 ` [RFC v1 2/8] scsi: create multipath capable scsi host himanshu.madhani
@ 2024-11-09  4:45 ` himanshu.madhani
  2024-11-09  4:45 ` [RFC v1 4/8] scsi: Complete multipath request himanshu.madhani
                   ` (6 subsequent siblings)
  9 siblings, 0 replies; 15+ messages in thread
From: himanshu.madhani @ 2024-11-09  4:45 UTC (permalink / raw)
  To: martin.petersen, linux-scsi

From: Himanshu Madhani <himanshu.madhani@oracle.com>

For multipath capable devices call scsi_mpath_failover_disposition() to
kick off failover to another path. This will call path selector
algorithm to pick active path for the failover.

Signed-off-by: Himanshu Madhani <himanshu.madhani@oracle.com>
---
 drivers/scsi/scsi_error.c | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/drivers/scsi/scsi_error.c b/drivers/scsi/scsi_error.c
index 612489afe8d2..d5d1b20928a6 100644
--- a/drivers/scsi/scsi_error.c
+++ b/drivers/scsi/scsi_error.c
@@ -40,6 +40,7 @@
 #include <scsi/scsi_ioctl.h>
 #include <scsi/scsi_dh.h>
 #include <scsi/scsi_devinfo.h>
+#include <scsi/scsi_multipath.h>
 #include <scsi/sg.h>
 
 #include "scsi_priv.h"
@@ -2047,6 +2048,13 @@ enum scsi_disposition scsi_decide_disposition(struct scsi_cmnd *scmd)
 
 maybe_retry:
 
+	/*
+	 * For SCSI Multipath check if there are path errors to
+	 * trigger failover to available path
+	 */
+	if (scsi_mpath_enabled(scmd->device))
+		return scsi_mpath_failover_disposition(scmd);
+
 	/* we requeue for retry because the error was retryable, and
 	 * the request was not marked fast fail.  Note that above,
 	 * even if the request is marked fast fail, we still requeue
-- 
2.41.0.rc2


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [RFC v1 4/8] scsi: Complete multipath request
  2024-11-09  4:45 [RFC v1 0/8] scsi: Multipath support for scsi disk devices himanshu.madhani
                   ` (2 preceding siblings ...)
  2024-11-09  4:45 ` [RFC v1 3/8] scsi: Add error handling capability for multipath himanshu.madhani
@ 2024-11-09  4:45 ` himanshu.madhani
  2024-11-09  4:45 ` [RFC v1 5/8] scsi: Add scsi multipath sysfs hooks himanshu.madhani
                   ` (5 subsequent siblings)
  9 siblings, 0 replies; 15+ messages in thread
From: himanshu.madhani @ 2024-11-09  4:45 UTC (permalink / raw)
  To: martin.petersen, linux-scsi

From: Himanshu Madhani <himanshu.madhani@oracle.com>

Add check for multipath reqeust when scsi_complete is called.
For error handling case, call scsi_mpath_failover_req() to
complete the multipath IO.

Signed-off-by: Himanshu Madhani <himanshu.madhani@oracle.com>
---
 drivers/scsi/scsi_lib.c | 25 +++++++++++++++++++++++++
 include/scsi/scsi.h     |  1 +
 2 files changed, 26 insertions(+)

diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index 0561b318dade..1c8113abc154 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -33,6 +33,7 @@
 #include <scsi/scsi_eh.h>
 #include <scsi/scsi_host.h>
 #include <scsi/scsi_transport.h> /* scsi_init_limits() */
+#include <scsi/scsi_multipath.h>
 #include <scsi/scsi_dh.h>
 
 #include <trace/events/scsi.h>
@@ -620,6 +621,14 @@ static void scsi_run_queue_async(struct scsi_device *sdev)
 	}
 }
 
+static inline void __scsi_mpath_end_request(struct request *req,
+    blk_status_t status)
+{
+	if (req->cmd_flags & REQ_SCSI_MPATH)
+		scsi_mpath_end_request(req);
+	blk_mq_end_request(req, status);
+}
+
 /* Returns false when no more bytes to process, true if there are more */
 static bool scsi_end_request(struct request *req, blk_status_t error,
 		unsigned int bytes)
@@ -661,6 +670,9 @@ static bool scsi_end_request(struct request *req, blk_status_t error,
 	 */
 	percpu_ref_get(&q->q_usage_counter);
 
+	if (req->cmd_flags & REQ_SCSI_MPATH)
+		scsi_mpath_end_request(req);
+
 	__blk_mq_end_request(req, error);
 
 	scsi_run_queue_async(sdev);
@@ -1528,6 +1540,9 @@ static void scsi_complete(struct request *rq)
 	case ADD_TO_MLQUEUE:
 		scsi_queue_insert(cmd, SCSI_MLQUEUE_DEVICE_BUSY);
 		break;
+	case FAILOVER:
+		scsi_mpath_failover_req(rq);
+		break;
 	default:
 		scsi_eh_scmd_add(cmd);
 		break;
@@ -1840,6 +1855,9 @@ static blk_status_t scsi_queue_rq(struct blk_mq_hw_ctx *hctx,
 	memset(cmd->sense_buffer, 0, SCSI_SENSE_BUFFERSIZE);
 	cmd->submitter = SUBMITTED_BY_BLOCK_LAYER;
 
+	if (req->cmd_flags & REQ_SCSI_MPATH)
+		scsi_mpath_start_request(req);
+
 	blk_mq_start_request(req);
 	reason = scsi_dispatch_cmd(cmd);
 	if (reason) {
@@ -2811,6 +2829,9 @@ EXPORT_SYMBOL(scsi_target_resume);
 
 static int __scsi_internal_device_block_nowait(struct scsi_device *sdev)
 {
+	if (scsi_mpath_enabled(sdev))
+		scsi_mpath_clear_current_path(sdev);
+
 	if (scsi_device_set_state(sdev, SDEV_BLOCK))
 		return scsi_device_set_state(sdev, SDEV_CREATED_BLOCK);
 
@@ -2927,6 +2948,10 @@ int scsi_internal_device_unblock_nowait(struct scsi_device *sdev,
 		return -EINVAL;
 	}
 
+	/* For multipath device set the path live */
+	if (scsi_mpath_enabled(sdev))
+		scsi_mpath_set_live(sdev);
+
 	/*
 	 * Try to transition the scsi device to SDEV_RUNNING or one of the
 	 * offlined states and goose the device queue if successful.
diff --git a/include/scsi/scsi.h b/include/scsi/scsi.h
index 96b350366670..544153a01b3f 100644
--- a/include/scsi/scsi.h
+++ b/include/scsi/scsi.h
@@ -103,6 +103,7 @@ enum scsi_disposition {
 	TIMEOUT_ERROR		= 0x2007,
 	SCSI_RETURN_NOT_HANDLED	= 0x2008,
 	FAST_IO_FAIL		= 0x2009,
+	FAILOVER		= 0x2010,
 };
 
 /*
-- 
2.41.0.rc2


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [RFC v1 5/8] scsi: Add scsi multipath sysfs hooks
  2024-11-09  4:45 [RFC v1 0/8] scsi: Multipath support for scsi disk devices himanshu.madhani
                   ` (3 preceding siblings ...)
  2024-11-09  4:45 ` [RFC v1 4/8] scsi: Complete multipath request himanshu.madhani
@ 2024-11-09  4:45 ` himanshu.madhani
  2024-11-09  4:45 ` [RFC v1 6/8] scsi: Add multipath suppport for device handler himanshu.madhani
                   ` (4 subsequent siblings)
  9 siblings, 0 replies; 15+ messages in thread
From: himanshu.madhani @ 2024-11-09  4:45 UTC (permalink / raw)
  To: martin.petersen, linux-scsi

From: Himanshu Madhani <himanshu.madhani@oracle.com>

Add Sysfs hook to
- Show current multipath state
- Show and update multipath iopolicy

Signed-off-by: Himanshu Madhani <himanshu.madhani@oracle.com>
---
 drivers/scsi/scsi_sysfs.c | 104 ++++++++++++++++++++++++++++++++++++++
 1 file changed, 104 insertions(+)

diff --git a/drivers/scsi/scsi_sysfs.c b/drivers/scsi/scsi_sysfs.c
index 32f94db6d6bf..cc7dc5c30d2c 100644
--- a/drivers/scsi/scsi_sysfs.c
+++ b/drivers/scsi/scsi_sysfs.c
@@ -1198,6 +1198,103 @@ sdev_show_preferred_path(struct device *dev,
 static DEVICE_ATTR(preferred_path, S_IRUGO, sdev_show_preferred_path, NULL);
 #endif
 
+#ifdef CONFIG_SCSI_MULTIPATH
+static const struct {
+	unsigned char	value;
+	char		*name;
+} scsi_multipath_iopolicy[] = {
+	{ SCSI_MPATH_IOPOLICY_NUMA, "NUMA" },
+	{ SCSI_MPATH_IOPOLICY_RR, "Round-Robin" },
+};
+static const char *scsi_mpath_policy_name(unsigned char policy)
+{
+	int i;
+	char *name = NULL;
+
+	for (i = 0; i < ARRAY_SIZE(scsi_multipath_iopolicy); i++) {
+		if (scsi_multipath_iopolicy[i].value == policy) {
+			name = scsi_multipath_iopolicy[i].name;
+			break;
+		}
+	}
+	return name;
+}
+
+static ssize_t
+sdev_show_multipath_iopolicy(struct device *dev,
+			     struct device_attribute *attr,
+			     char *buf)
+{
+	struct scsi_device *sdev = to_scsi_device(dev);
+	const char *name = scsi_mpath_policy_name(sdev->mpath_iopolicy);
+
+	if (!sdev->mpath_disk)
+		return -EINVAL;
+
+	return sysfs_emit(buf, "%s\n", name);
+}
+
+static ssize_t sdev_store_multipath_iopolicy(struct device *dev,
+    struct device_attribute *attr, const char *buf, size_t count)
+{
+	struct scsi_device *sdev = to_scsi_device(dev);
+	int i;
+
+	for (i = 0; i < ARRAY_SIZE(scsi_multipath_iopolicy); i++) {
+		if (sysfs_streq(buf, scsi_mpath_policy_name(i))) {
+			scsi_multipath_iopolicy_update(sdev, i);
+			return count;
+		}
+	}
+
+	return -EINVAL;
+}
+static DEVICE_ATTR(multipath_iopolicy, S_IRUGO, sdev_show_multipath_iopolicy,
+    sdev_store_multipath_iopolicy);
+
+static const struct {
+	unsigned char	value;
+	char		*name;
+} scsi_mpath_states[] = {
+	{ SCSI_MPATH_OPTIMAL,	"active/optimized" },
+	{ SCSI_MPATH_ACTIVE,	"active/non-optimized" },
+	{ SCSI_MPATH_STANDBY,	"standby" },
+	{ SCSI_MPATH_UNAVAILABLE,"unavailable" },
+	{ SCSI_MPATH_LBA,	"lba-dependent" },
+	{ SCSI_MPATH_OFFLINE,	"offline" },
+	{ SCSI_MPATH_TRANSITIONING,"transitioning" },
+};
+
+static const char *scsi_mpath_state_names(unsigned char state)
+{
+	int i;
+	char *name = NULL;
+
+	for (i = 0; i < ARRAY_SIZE(scsi_mpath_states); i++) {
+		if (scsi_mpath_states[i].value == state) {
+		    name = scsi_mpath_states[i].name;
+		    break;
+		}
+	}
+	return name;
+}
+
+static ssize_t
+sdev_show_multipath_state(struct device *dev,
+			  struct device_attribute *attr,
+			  char *buf)
+{
+	struct scsi_device *sdev = to_scsi_device(dev);
+	const char *name = scsi_mpath_state_names(sdev->mpath_state);
+
+	if (!sdev->mpath_disk)
+		return -EINVAL;
+
+	return sysfs_emit(buf, "%s\n", name);
+}
+static DEVICE_ATTR(multipath_state, S_IRUGO, sdev_show_multipath_state, NULL);
+#endif
+
 static ssize_t
 sdev_show_queue_ramp_up_period(struct device *dev,
 			       struct device_attribute *attr,
@@ -1335,6 +1432,10 @@ static struct attribute *scsi_sdev_attrs[] = {
 	&dev_attr_dh_state.attr,
 	&dev_attr_access_state.attr,
 	&dev_attr_preferred_path.attr,
+#endif
+#ifdef CONFIG_SCSI_MULTIPATH
+	&dev_attr_multipath_iopolicy.attr,
+	&dev_attr_multipath_state.attr,
 #endif
 	&dev_attr_queue_ramp_up_period.attr,
 	&dev_attr_cdl_supported.attr,
@@ -1500,6 +1601,9 @@ void __scsi_remove_device(struct scsi_device *sdev)
 	} else
 		put_device(&sdev->sdev_dev);
 
+	if (scsi_is_sdev_multipath(sdev))
+		scsi_mpath_dev_release(sdev);
+
 	/*
 	 * Stop accepting new requests and wait until all queuecommand() and
 	 * scsi_run_queue() invocations have finished before tearing down the
-- 
2.41.0.rc2


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [RFC v1 6/8] scsi: Add multipath suppport for device handler
  2024-11-09  4:45 [RFC v1 0/8] scsi: Multipath support for scsi disk devices himanshu.madhani
                   ` (4 preceding siblings ...)
  2024-11-09  4:45 ` [RFC v1 5/8] scsi: Add scsi multipath sysfs hooks himanshu.madhani
@ 2024-11-09  4:45 ` himanshu.madhani
  2024-11-09  4:45 ` [RFC v1 7/8] scsi: Add multipath disk init code for sd driver himanshu.madhani
                   ` (3 subsequent siblings)
  9 siblings, 0 replies; 15+ messages in thread
From: himanshu.madhani @ 2024-11-09  4:45 UTC (permalink / raw)
  To: martin.petersen, linux-scsi

From: Himanshu Madhani <himanshu.madhani@oracle.com>

Add multipath initialization during handler attachemnet for DH.
Also initialize multipath port group data for scsi_device.

Signed-off-by: Himanshu Madhani <himanshu.madhani@oracle.com>
---
 drivers/scsi/device_handler/scsi_dh_alua.c | 15 +++++++++++++++
 drivers/scsi/scsi_dh.c                     |  3 +++
 2 files changed, 18 insertions(+)

diff --git a/drivers/scsi/device_handler/scsi_dh_alua.c b/drivers/scsi/device_handler/scsi_dh_alua.c
index 4eb0837298d4..29bd6517a2e3 100644
--- a/drivers/scsi/device_handler/scsi_dh_alua.c
+++ b/drivers/scsi/device_handler/scsi_dh_alua.c
@@ -258,6 +258,21 @@ static struct alua_port_group *alua_alloc_pg(struct scsi_device *sdev,
 		return tmp_pg;
 	}
 
+	if (scsi_mpath_enabled(sdev)) {
+		struct scsi_mpath_dh_data *dh_data = sdev->mpath_pg_data;
+
+		dh_data->group_id = pg->group_id;
+		dh_data->tpgs = pg->tpgs;
+		dh_data->state = pg->state;
+		dh_data->valid_states = pg->valid_states;
+		dh_data->prefrence = pg->pref;
+		dh_data->is_active = 1;
+		dh_data->device_id_str = kstrdup(pg->device_id_str, GFP_KERNEL);
+		dh_data->device_id_len = pg->device_id_len;
+
+		sdev->host->mpath_alua_grpid = pg->group_id;
+	}
+
 	list_add(&pg->node, &port_group_list);
 	spin_unlock(&port_group_lock);
 
diff --git a/drivers/scsi/scsi_dh.c b/drivers/scsi/scsi_dh.c
index 7b56e00c7df6..d61eddc3c1f8 100644
--- a/drivers/scsi/scsi_dh.c
+++ b/drivers/scsi/scsi_dh.c
@@ -129,6 +129,9 @@ static int scsi_dh_handler_attach(struct scsi_device *sdev,
 	if (!try_module_get(scsi_dh->module))
 		return -EINVAL;
 
+	if (scsi_mpath_enabled(sdev))
+		scsi_multipath_init(sdev);
+
 	error = scsi_dh->attach(sdev);
 	if (error != SCSI_DH_OK) {
 		switch (error) {
-- 
2.41.0.rc2


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [RFC v1 7/8] scsi: Add multipath disk init code for sd driver
  2024-11-09  4:45 [RFC v1 0/8] scsi: Multipath support for scsi disk devices himanshu.madhani
                   ` (5 preceding siblings ...)
  2024-11-09  4:45 ` [RFC v1 6/8] scsi: Add multipath suppport for device handler himanshu.madhani
@ 2024-11-09  4:45 ` himanshu.madhani
  2024-11-09  4:45 ` [RFC v1 8/8] scsi_debug: Add module parameter for ALUA multipath himanshu.madhani
                   ` (2 subsequent siblings)
  9 siblings, 0 replies; 15+ messages in thread
From: himanshu.madhani @ 2024-11-09  4:45 UTC (permalink / raw)
  To: martin.petersen, linux-scsi

From: Himanshu Madhani <himanshu.madhani@oracle.com>

This patch adds allocation and initialization code to
scsi disk driver.

Signed-off-by: Himanshu Madhani <himanshu.madhani@oracle.com>
---
 drivers/scsi/sd.c | 83 +++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 83 insertions(+)

diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
index 41e2dfa2d67d..b4727b599794 100644
--- a/drivers/scsi/sd.c
+++ b/drivers/scsi/sd.c
@@ -1483,6 +1483,9 @@ static void sd_uninit_command(struct scsi_cmnd *SCpnt)
 
 static bool sd_need_revalidate(struct gendisk *disk, struct scsi_disk *sdkp)
 {
+	if (scsi_is_sdev_multipath(sdkp->device))
+		return true;
+
 	if (sdkp->device->removable || sdkp->write_prot) {
 		if (disk_check_media_change(disk))
 			return true;
@@ -1892,6 +1895,10 @@ static int sd_get_unique_id(struct gendisk *disk, u8 id[16],
 		if (len == 16)
 			break;
 	}
+
+	if (scsi_mpath_enabled(sdev))
+		ret = scsi_mpath_unique_id(sdev, id, type);
+
 out_unlock:
 	rcu_read_unlock();
 	return ret;
@@ -3817,6 +3824,33 @@ static int sd_revalidate_disk(struct gendisk *disk)
 	if (sdkp->media_present && scsi_device_supports_vpd(sdp))
 		sd_read_cpr(sdkp);
 
+	/* for multipath device, Adjust queue limits for MPATH disk */
+	if (scsi_is_sdev_multipath(sdp)) {
+		struct queue_limits *mpath_lim = &sdp->mpath_disk->queue->limits;
+
+		blk_mq_freeze_queue(sdp->mpath_disk->queue);
+		lim = queue_limits_start_update(sdp->mpath_disk->queue);
+		lim.logical_block_size = mpath_lim->logical_block_size;
+		lim.physical_block_size = mpath_lim->physical_block_size;
+		lim.io_min = mpath_lim->io_min;
+		lim.io_opt = mpath_lim->io_opt;
+		queue_limits_stack_bdev(&lim, sdp->mpath_disk->part0, 0,
+		    sdp->mpath_disk->disk_name);
+
+		sdp->mpath_disk->flags |= GENHD_FL_HIDDEN;
+
+		set_capacity_and_notify(sdp->mpath_disk,
+		    logical_to_sectors(sdp, sdkp->capacity));
+
+		err = queue_limits_commit_update(sdp->mpath_disk->queue, &lim);
+
+		scsi_mpath_revalidate_path(sdp->mpath_disk,
+		    logical_to_sectors(sdp, sdkp->capacity));
+
+		blk_mq_unfreeze_queue(sdp->mpath_disk->queue);
+		if (err)
+			return err;
+	}
 	/*
 	 * For a zoned drive, revalidating the zones can be done only once
 	 * the gendisk capacity is set. So if this fails, set back the gendisk
@@ -3943,6 +3977,9 @@ static int sd_probe(struct device *dev)
 	if (!sdkp)
 		goto out;
 
+	if (scsi_mpath_enabled(sdp) && sdp->is_shared)
+		scsi_mpath_alloc_disk(sdp);
+
 	gd = blk_mq_alloc_disk_for_queue(sdp->request_queue,
 					 &sd_bio_compl_lkclass);
 	if (!gd)
@@ -3960,6 +3997,10 @@ static int sd_probe(struct device *dev)
 		goto out_free_index;
 	}
 
+	if (scsi_is_sdev_multipath(sdp))
+		snprintf(sdp->mpath_disk->disk_name, DISK_NAME_LEN, "mpath%dsd%d",
+		    sdp->host->host_no, index);
+
 	sdkp->device = sdp;
 	sdkp->disk = gd;
 	sdkp->index = index;
@@ -4021,6 +4062,21 @@ static int sd_probe(struct device *dev)
 			sdp->host->rpm_autosuspend_delay);
 	}
 
+	if (scsi_is_sdev_multipath(sdp)) {
+		sdp->mpath_disk->major = sd_major((index & 0xf0) >> 4);
+		sdp->mpath_disk->first_minor = ((index & 0xf) << 4) | (index & 0xfff00);
+		sdp->mpath_disk->minors = SD_MINORS;
+
+		scsi_mpath_add_disk(sdp);
+
+		if (!test_bit(SCSI_MPATH_DISK_LIVE, &sdp->mpath_flags)) {
+			device_unregister(&sdkp->disk_dev);
+			clear_bit(SCSI_MPATH_DISK_LIVE, &sdp->mpath_flags);
+			put_disk(sdp->mpath_disk);
+			goto out;
+		}
+	}
+
 	error = device_add_disk(dev, gd, NULL);
 	if (error) {
 		device_unregister(&sdkp->disk_dev);
@@ -4074,12 +4130,20 @@ static int sd_remove(struct device *dev)
 		sd_shutdown(dev);
 
 	put_disk(sdkp->disk);
+
+	if (scsi_is_sdev_multipath(sdkp->device))
+		scsi_mpath_remove_disk(sdkp->device);
+
 	return 0;
 }
 
 static void scsi_disk_release(struct device *dev)
 {
 	struct scsi_disk *sdkp = to_scsi_disk(dev);
+	struct scsi_device *sdp = to_scsi_device(dev);
+
+	if (scsi_is_sdev_multipath(sdp))
+		scsi_mpath_dev_release(sdp);
 
 	ida_free(&sd_index_ida, sdkp->index);
 	put_device(&sdkp->device->sdev_gendev);
@@ -4171,6 +4235,25 @@ static void sd_shutdown(struct device *dev)
 	if (pm_runtime_suspended(dev))
 		return;
 
+	if (scsi_is_sdev_multipath(sdkp->device)) {
+		struct scsi_device *sdp = sdkp->device;
+		bool last_path = false;
+
+		if (scsi_mpath_clear_current_path(sdp))
+			synchronize_srcu(&sdp->host->mpath_dev->srcu);
+
+		mutex_lock(&sdp->host->mpath_dev->mpath_lock);
+		list_del_rcu(&sdp->siblings);
+		if (list_empty(&sdp->host->mpath_sdev)) {
+			list_del_init(&sdp->mpath_entry);
+			last_path = true;
+		}
+		mutex_unlock(&sdp->host->mpath_dev->mpath_lock);
+
+		if (last_path)
+			scsi_mpath_shutdown_disk(sdp);
+	}
+
 	if (sdkp->WCE && sdkp->media_present) {
 		sd_printk(KERN_NOTICE, sdkp, "Synchronizing SCSI cache\n");
 		sd_sync_cache(sdkp);
-- 
2.41.0.rc2


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [RFC v1 8/8] scsi_debug: Add module parameter for ALUA multipath
  2024-11-09  4:45 [RFC v1 0/8] scsi: Multipath support for scsi disk devices himanshu.madhani
                   ` (6 preceding siblings ...)
  2024-11-09  4:45 ` [RFC v1 7/8] scsi: Add multipath disk init code for sd driver himanshu.madhani
@ 2024-11-09  4:45 ` himanshu.madhani
  2024-11-10 21:15 ` [RFC v1 0/8] scsi: Multipath support for scsi disk devices Bart Van Assche
  2024-11-22 14:27 ` Hannes Reinecke
  9 siblings, 0 replies; 15+ messages in thread
From: himanshu.madhani @ 2024-11-09  4:45 UTC (permalink / raw)
  To: martin.petersen, linux-scsi

From: Himanshu Madhani <himanshu.madhani@oracle.com>

Signed-off-by: Himanshu Madhani <himanshu.madhani@oracle.com>
---
 drivers/scsi/scsi_debug.c | 16 ++++++++++++++--
 1 file changed, 14 insertions(+), 2 deletions(-)

diff --git a/drivers/scsi/scsi_debug.c b/drivers/scsi/scsi_debug.c
index 9be2a6a00530..811d3005c0a5 100644
--- a/drivers/scsi/scsi_debug.c
+++ b/drivers/scsi/scsi_debug.c
@@ -167,6 +167,7 @@ static const char *sdebug_version_date = "20210520";
 #define DEF_TUR_MS_TO_READY 0
 #define DEF_UUID_CTL 0
 #define JDELAY_OVERRIDDEN -9999
+#define DEF_ALUA_MPATH	0
 
 /* Default parameters for ZBC drives */
 #define DEF_ZBC_ZONE_SIZE_MB	128
@@ -884,6 +885,8 @@ static bool write_since_sync;
 static bool sdebug_statistics = DEF_STATISTICS;
 static bool sdebug_wp;
 static bool sdebug_allow_restart;
+static unsigned int sdebug_alua_mpath = DEF_ALUA_MPATH;
+
 static enum {
 	BLK_ZONED_NONE	= 0,
 	BLK_ZONED_HA	= 1,
@@ -2070,8 +2073,14 @@ static int resp_inquiry(struct scsi_cmnd *scp, struct sdebug_dev_info *devip)
 	arr[3] = 2;    /* response_data_format==2 */
 	arr[4] = SDEBUG_LONG_INQ_SZ - 5;
 	arr[5] = (int)have_dif_prot;	/* PROTECT bit */
-	if (sdebug_vpd_use_hostno == 0)
-		arr[5] |= 0x10; /* claim: implicit TPGS */
+	if (sdebug_vpd_use_hostno == 0) {
+		 arr[5] |= 0x10;
+	} else {
+		if (sdebug_alua_mpath == 1)
+			arr[5] |= 0x11;
+		else
+			arr[5] |= 0x10;
+	}
 	arr[6] = 0x10; /* claim: MultiP */
 	/* arr[6] |= 0x40; ... claim: EncServ (enclosure services) */
 	arr[7] = 0xa; /* claim: LINKED + CMDQUE */
@@ -6643,6 +6652,7 @@ module_param_named(zone_max_open, sdeb_zbc_max_open, int, S_IRUGO);
 module_param_named(zone_nr_conv, sdeb_zbc_nr_conv, int, S_IRUGO);
 module_param_named(zone_size_mb, sdeb_zbc_zone_size_mb, int, S_IRUGO);
 module_param_named(allow_restart, sdebug_allow_restart, bool, S_IRUGO | S_IWUSR);
+module_param_named(alua_mpath, sdebug_alua_mpath, int, S_IRUGO | S_IWUSR);
 
 MODULE_AUTHOR("Eric Youngdale + Douglas Gilbert");
 MODULE_DESCRIPTION("SCSI debug adapter driver");
@@ -6722,6 +6732,8 @@ MODULE_PARM_DESC(zone_max_open, "Maximum number of open zones; [0] for no limit
 MODULE_PARM_DESC(zone_nr_conv, "Number of conventional zones (def=1)");
 MODULE_PARM_DESC(zone_size_mb, "Zone size in MiB (def=auto)");
 MODULE_PARM_DESC(allow_restart, "Set scsi_device's allow_restart flag(def=0)");
+MODULE_PARM_DESC(alua_mpath,
+	"\t 1 = implicit alua \n \t 2 = Explicit & Implicit ALUA, \n \t 0 = No ALUA Support (Default) \n");
 
 #define SDEBUG_INFO_LEN 256
 static char sdebug_info[SDEBUG_INFO_LEN];
-- 
2.41.0.rc2


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [RFC v1 2/8] scsi: create multipath capable scsi host
  2024-11-09  4:45 ` [RFC v1 2/8] scsi: create multipath capable scsi host himanshu.madhani
@ 2024-11-10 21:11   ` Bart Van Assche
  0 siblings, 0 replies; 15+ messages in thread
From: Bart Van Assche @ 2024-11-10 21:11 UTC (permalink / raw)
  To: himanshu.madhani, martin.petersen, linux-scsi

On 11/8/24 8:45 PM, himanshu.madhani@oracle.com wrote:
>   #include "scsi_priv.h"
>   #include "scsi_logging.h"
> @@ -394,6 +395,14 @@ struct Scsi_Host *scsi_host_alloc(const struct scsi_host_template *sht, int priv
>   	struct Scsi_Host *shost;
>   	int index;
>   
> +#ifdef CONFIG_SCSI_MULTIPATH
> +	struct scsi_mpath *mpath_dev;
> +	size_t	size = sizeof(*mpath_dev);
> +
> +	size += num_possible_nodes() * sizeof(struct mpath_dev *);
> +	privsize = privsize + size;
> +#endif
> +
>   	shost = kzalloc(sizeof(struct Scsi_Host) + privsize, GFP_KERNEL);
>   	if (!shost)
>   		return NULL;
> @@ -409,6 +418,9 @@ struct Scsi_Host *scsi_host_alloc(const struct scsi_host_template *sht, int priv
>   	init_waitqueue_head(&shost->host_wait);
>   	mutex_init(&shost->scan_mutex);
>   
> +#ifdef CONFIG_SCSI_MULTIPATH
> +	INIT_LIST_HEAD(&shost->mpath_sdev);
> +#endif
>   	index = ida_alloc(&host_index_ida, GFP_KERNEL);
>   	if (index < 0) {
>   		kfree(shost);

 From Documentation/process/4.Coding.rst: "As a general rule, #ifdef use
should be confined to header files whenever possible." Please follow
this advice.

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC v1 0/8] scsi: Multipath support for scsi disk devices.
  2024-11-09  4:45 [RFC v1 0/8] scsi: Multipath support for scsi disk devices himanshu.madhani
                   ` (7 preceding siblings ...)
  2024-11-09  4:45 ` [RFC v1 8/8] scsi_debug: Add module parameter for ALUA multipath himanshu.madhani
@ 2024-11-10 21:15 ` Bart Van Assche
  2024-11-12 20:46   ` Himanshu Madhani
  2024-11-22 14:27 ` Hannes Reinecke
  9 siblings, 1 reply; 15+ messages in thread
From: Bart Van Assche @ 2024-11-10 21:15 UTC (permalink / raw)
  To: himanshu.madhani, martin.petersen, linux-scsi


On 11/8/24 8:45 PM, himanshu.madhani@oracle.com wrote:
> Here is a very early RFC for multipath support in the scsi layer. This patch series
> implements native multipath support for scsi disks devices.
> 
> In this series, I am providing conceptual changes which still needs work. However,
> I wanted to get this RFC out to get community feedback on the direction of changes.
> 
> This RFC follows NVMe multipath implementation closely for SCSI multipath. Currently,
> SCSI multipath only supports disk devices which advertises ALUA (Asymmetric Logical
> Unit Access) capability in the Inquiry response data.

Something very important is missing from the cover letter, namely a
motivation of why this initiative has been started. Why to add native
multipath support to the SCSI core instead of using dm-multipath? Isn't
one of the goals of the Linux kernel not to duplicate functionality that
already exists? How does the new infrastructure compare with
dm-multipath from the point of view of performance and functionality?

Thanks,

Bart.


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC v1 0/8] scsi: Multipath support for scsi disk devices.
  2024-11-10 21:15 ` [RFC v1 0/8] scsi: Multipath support for scsi disk devices Bart Van Assche
@ 2024-11-12 20:46   ` Himanshu Madhani
  0 siblings, 0 replies; 15+ messages in thread
From: Himanshu Madhani @ 2024-11-12 20:46 UTC (permalink / raw)
  To: Bart Van Assche, martin.petersen, linux-scsi

Hi Bart,

On 11/10/24 13:15, Bart Van Assche wrote:
> 
> On 11/8/24 8:45 PM, himanshu.madhani@oracle.com wrote:
>> Here is a very early RFC for multipath support in the scsi layer. This 
>> patch series
>> implements native multipath support for scsi disks devices.
>>
>> In this series, I am providing conceptual changes which still needs 
>> work. However,
>> I wanted to get this RFC out to get community feedback on the 
>> direction of changes.
>>
>> This RFC follows NVMe multipath implementation closely for SCSI 
>> multipath. Currently,
>> SCSI multipath only supports disk devices which advertises ALUA 
>> (Asymmetric Logical
>> Unit Access) capability in the Inquiry response data.
> 
> Something very important is missing from the cover letter, namely a
> motivation of why this initiative has been started. Why to add native
> multipath support to the SCSI core instead of using dm-multipath? Isn't
> one of the goals of the Linux kernel not to duplicate functionality that
> already exists? How does the new infrastructure compare with
> dm-multipath from the point of view of performance and functionality?
> 

Sorry about missing motivation section in the cover letter. I'll add 
that in v2 when I am ready to send an updated version of this RFC.

Here's motivation

1. Having native multipath provides a seamless configuration and setting 
of multipath with SCSI, which does not involve any other dependencies. 
Especially discovery and assembly of raid array. My motivation with 
native SCSI multipath is to avoid having any 3rd party daemon to do the 
discovery and assembly of multipath devices, which can sometimes create 
issues if devices are not discovered properly. The implementation of 
native multipath will avoid all that additional steps and by virtue will 
provide plug-n-play capability for SCSI multipath configurations. Also, 
having native support will help modernize SCSI code with respect to 
multipath support and provide tighter integration for SCSI stack.


2. On the performance point of view, I believe that switching to RCU 
based path selection logic will provide faster path fail-over and will 
improve overall IO latency. In this RFC, I have not spent time on 
performance collection. I am hoping to provide more comprehensive data 
with the next RFC update.

I do not believe this is duplication of functionality since I am 
providing in-kernel multipath option which will provide users a choice 
of using native v/s out of kernel multipath implementation based on 
their needs.


> Thanks,
> 
> Bart.
> 

-- 
Himanshu Madhani                                Oracle Linux Engineering


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC v1 1/8] scsi: Add multipath device support
  2024-11-09  4:45 ` [RFC v1 1/8] scsi: Add multipath device support himanshu.madhani
@ 2024-11-12 21:09   ` Bart Van Assche
  2024-11-13  0:20     ` Himanshu Madhani
  0 siblings, 1 reply; 15+ messages in thread
From: Bart Van Assche @ 2024-11-12 21:09 UTC (permalink / raw)
  To: himanshu.madhani, martin.petersen, linux-scsi

On 11/8/24 8:45 PM, himanshu.madhani@oracle.com wrote:
> +		switch(sdev->mpath_state) {
> +		case SCSI_MPATH_OPTIMAL:
> +		    if (distance < found_distance) {
> +			    found_distance = distance;
> +			    sdev_found = sdev;
> +		    }

Please follow the Linux kernel coding style. As an example, I have never
seen anyone else indenting statements under a case label in the kernel
with four spaces. You may want to run the entire patch series through
clang-format.

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC v1 1/8] scsi: Add multipath device support
  2024-11-12 21:09   ` Bart Van Assche
@ 2024-11-13  0:20     ` Himanshu Madhani
  0 siblings, 0 replies; 15+ messages in thread
From: Himanshu Madhani @ 2024-11-13  0:20 UTC (permalink / raw)
  To: Bart Van Assche; +Cc: Martin Petersen, linux-scsi@vger.kernel.org



> On Nov 12, 2024, at 13:09, Bart Van Assche <bvanassche@acm.org> wrote:
> 
> On 11/8/24 8:45 PM, himanshu.madhani@oracle.com wrote:
>> + switch(sdev->mpath_state) {
>> + case SCSI_MPATH_OPTIMAL:
>> +    if (distance < found_distance) {
>> +    found_distance = distance;
>> +    sdev_found = sdev;
>> +    }
> 
> Please follow the Linux kernel coding style. As an example, I have never
> seen anyone else indenting statements under a case label in the kernel
> with four spaces. You may want to run the entire patch series through
> clang-format.
> 
Sure will do that. 
> Thanks,
> 
> Bart.


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC v1 0/8] scsi: Multipath support for scsi disk devices.
  2024-11-09  4:45 [RFC v1 0/8] scsi: Multipath support for scsi disk devices himanshu.madhani
                   ` (8 preceding siblings ...)
  2024-11-10 21:15 ` [RFC v1 0/8] scsi: Multipath support for scsi disk devices Bart Van Assche
@ 2024-11-22 14:27 ` Hannes Reinecke
  9 siblings, 0 replies; 15+ messages in thread
From: Hannes Reinecke @ 2024-11-22 14:27 UTC (permalink / raw)
  To: himanshu.madhani, martin.petersen, linux-scsi

On 11/9/24 05:45, himanshu.madhani@oracle.com wrote:
> From: Himanshu Madhani <himanshu.madhani@oracle.com>
> 
> Hello Folks,
> 
> Here is a very early RFC for multipath support in the scsi layer. This patch series
> implements native multipath support for scsi disks devices.
> 
> In this series, I am providing conceptual changes which still needs work. However,
> I wanted to get this RFC out to get community feedback on the direction of changes.
> 
> This RFC follows NVMe multipath implementation closely for SCSI multipath. Currently,
> SCSI multipath only supports disk devices which advertises ALUA (Asymmetric Logical
> Unit Access) capability in the Inquiry response data.
> 
First of all, thank you for doing this.
Had been on my to-do list for a long time.

However, the one crucial thing why I kept pushing it back is:

Residuals.

NVMe native multipathing works because NVMe is a 'all-or-nothing' 
protocol, ie either the entire I/O had been completed, or nothing has 
happened.
Which means for any failure we can safely retry the entire I/O on a 
different path (that's the 'steal_bio' thingie), knowing that it's safe
to do so.

For SCSI, however, this is not the case; it's perfectly valid for a 
target to do a partial completion, and ask the initiator to retry the
remainders. And this partial completion might be at any position within
the bvec, requiring us to resend the bio from a random starting position.
Meaning we cannot do a blind 'steal_bio' thing.

So: have you evaluated you series wrt to residuals?
Have you _measured_ if residuals are happening?
Have you considered your patchset how residuals could be
treated?
(It _might_ be possible to resend the entire I/O over to another path,
even if the command had been partially completed. That's perfectly safe
for reads, but for writes you have to be extremely careful to not cause
a data corruption. We had some fun discussions here over at the NVMe 
side ...)

And: please drop the device handler thingie for this, and concentrate
on ALUA. No point in carrying legacy stuff around.
_AND_ you have to evaluate the ALUA settings anyway to get a decent
path selection.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                  Kernel Storage Architect
hare@suse.de                                +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2024-11-22 14:27 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-11-09  4:45 [RFC v1 0/8] scsi: Multipath support for scsi disk devices himanshu.madhani
2024-11-09  4:45 ` [RFC v1 1/8] scsi: Add multipath device support himanshu.madhani
2024-11-12 21:09   ` Bart Van Assche
2024-11-13  0:20     ` Himanshu Madhani
2024-11-09  4:45 ` [RFC v1 2/8] scsi: create multipath capable scsi host himanshu.madhani
2024-11-10 21:11   ` Bart Van Assche
2024-11-09  4:45 ` [RFC v1 3/8] scsi: Add error handling capability for multipath himanshu.madhani
2024-11-09  4:45 ` [RFC v1 4/8] scsi: Complete multipath request himanshu.madhani
2024-11-09  4:45 ` [RFC v1 5/8] scsi: Add scsi multipath sysfs hooks himanshu.madhani
2024-11-09  4:45 ` [RFC v1 6/8] scsi: Add multipath suppport for device handler himanshu.madhani
2024-11-09  4:45 ` [RFC v1 7/8] scsi: Add multipath disk init code for sd driver himanshu.madhani
2024-11-09  4:45 ` [RFC v1 8/8] scsi_debug: Add module parameter for ALUA multipath himanshu.madhani
2024-11-10 21:15 ` [RFC v1 0/8] scsi: Multipath support for scsi disk devices Bart Van Assche
2024-11-12 20:46   ` Himanshu Madhani
2024-11-22 14:27 ` Hannes Reinecke

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox