[RFC PATCH v3 00/19] scsi: scsi_error: Introduce new error handle mechanism

All of lore.kernel.org
 help / color / mirror / Atom feed

* [RFC PATCH v3 00/19] scsi: scsi_error: Introduce new error handle mechanism
@ 2025-03-14  1:29 JiangJianJun
  2025-03-14  1:29 ` [RFC PATCH v3 01/19] scsi: scsi_error: Define framework for LUN/target based error handle JiangJianJun
                   ` (19 more replies)
  0 siblings, 20 replies; 30+ messages in thread
From: JiangJianJun @ 2025-03-14  1:29 UTC (permalink / raw)
  To: jejb, martin.petersen, linux-scsi
  Cc: hare, linux-kernel, lixiaokeng, jiangjianjun3, hewenliang4,
	yangkunlin7

It's unbearable for systems with large scale scsi devices share HBAs to
block all devices' IOs when handle error commands, we need a new error
handle mechanism to address this issue.

I consulted about this issue a year ago, the discuss link can be found in
refenence. Hannes replied about why we have to block the SCSI host
then perform error recovery kindly. I think it's unnecessary to block
SCSI host for all drivers and can try a small level recovery(LUN based for
example) first to avoid block the SCSI host.

The new error handle mechanism introduced in this patchset has been
developed and tested with out self developed hardware since one year
ago, now we want this mechanism can be used by more drivers.

Drivers can decide if using the new error handle mechanism and how to
handle error commands when scsi_device are scanned,the new mechanism
makes SCSI error handle more flexible.

SCSI error recovery strategy after blocking host's IO is mainly
following steps:

- LUN reset
- Target reset
- Bus reset
- Host reset

Some drivers did not implement callbacks for host reset, it's unnecessary
to block host's IO for these drivers. For example, smartpqi only registered
device reset, if device reset failed, it's meaningless to fallback to target
reset, bus reset or host reset any more, because these steps would also
failed.

Here are some drivers we concerned:(there are too many kinds of drivers
to figure out, so here I just list some drivers I am familiar with)

+-------------+--------------+--------------+-----------+------------+
|  drivers    | device_reset | target_reset | bus_reset | host_reset |
+-------------+--------------+--------------+-----------+------------+
| mpt3sas     |     Y        |     Y        |    N      |    Y       |
+-------------+--------------+--------------+-----------+------------+
| smartpqi    |     Y        |     N        |    N      |    N       |
+-------------+--------------+--------------+-----------+------------+
| megaraidsas |     N        |     Y        |    N      |    Y       |
+-------------+--------------+--------------+-----------+------------+
| virtioscsi  |     Y        |     N        |    N      |    N       |
+-------------+--------------+--------------+-----------+------------+
| iscsi_tcp   |     Y        |     Y        |    N      |    N       |
+-------------+--------------+--------------+-----------+------------+
| hisisas     |     Y        |     Y        |    N      |    N       |
+-------------+--------------+--------------+-----------+------------+

For LUN based error handle, when scsi command is classified as error,
we would block the scsi device's IO and try to recover this scsi
device, if still can not recover all error commands, it might
fallback to target or host level recovery.

It's same for target based error handle, but target based error handle
would block the scsi target's IO then try to recover the error commands
of this target.

The first patch defines basic framework to support LUN/target based error
handle mechanism, three key operations are abstracted which are:
 - add error command
 - wake up error handle
 - block IOs when error command is added and recoverying.

Drivers can implement these three function callbacks and setup to SCSI
middle level; I also add a general LUN/target based error handle strategy
which can be called directly from drivers to implement LUN/tartget based
error handle.

The changes of SCSI middle level's error handle are tested with scsi_debug
which support single LUN error injection, the scsi_debug patches can be
found in reference, following scenarios are tested.

Scenario1: LUN based error handle is enabled:
+-----------+---------+-------------------------------------------------------+
| lun reset | TUR     | Desired result                                        |
+ --------- + ------- + ------------------------------------------------------+
| success   | success | retry or finish with  EIO(may offline disk)           |
+ --------- + ------- + ------------------------------------------------------+
| success   | fail    | fallback to host  recovery, retry or finish with      |
|           |         | EIO(may offline disk)                                 |
+ --------- + ------- + ------------------------------------------------------+
| fail      | NA      | fallback to host  recovery, retry or finish with      |
|           |         | EIO(may offline disk)                                 |
+ --------- + ------- + ------------------------------------------------------+

Scenario2: target based error handle is enabled:
+-----------+---------+--------------+---------+------------------------------+
| lun reset | TUR     | target reset | TUR     | Desired result               |
+-----------+---------+--------------+---------+------------------------------+
| success   | success | NA           | NA      | retry or finish with         |
|           |         |              |         | EIO(may offline disk)        |
+-----------+---------+--------------+---------+------------------------------+
| success   | fail    | success      | success | retry or finish with         |
|           |         |              |         | EIO(may offline disk)        |
+-----------+---------+--------------+---------+------------------------------+
| fail      | NA      | success      | success | retry or finish with         |
|           |         |              |         | EIO(may offline disk)        |
+-----------+---------+--------------+---------+------------------------------+
| fail      | NA      | success      | fail    | fallback to host recovery,   |
|           |         |              |         | retry or finish with EIO(may |
|           |         |              |         | offline disk)                |
+-----------+---------+--------------+---------+------------------------------+
| fail      | NA      | fail         | NA      | fallback to host  recovery,  |
|           |         |              |         | retry or finish with EIO(may |
|           |         |              |         | offline disk)                |
+-----------+---------+--------------+---------+------------------------------+

Scenario3: both LUN and target based error handle are enabled:
+-----------+---------+--------------+---------+------------------------------+
| lun reset | TUR     | target reset | TUR     | Desired result               |
+-----------+---------+--------------+---------+------------------------------+
| success   | success | NA           | NA      | retry or finish with         |
|           |         |              |         | EIO(may offline disk)        |
+-----------+---------+--------------+---------+------------------------------+
| success   | fail    | success      | success | lun recovery fallback to     |
|           |         |              |         | target recovery, retry or    |
|           |         |              |         | finish with EIO(may offline  |
|           |         |              |         | disk                         |
+-----------+---------+--------------+---------+------------------------------+
| fail      | NA      | success      | success | lun recovery fallback to     |
|           |         |              |         | target recovery, retry or    |
|           |         |              |         | finish with EIO(may offline  |
|           |         |              |         | disk                         |
+-----------+---------+--------------+---------+------------------------------+
| fail      | NA      | success      | fail    | lun recovery fallback to     |
|           |         |              |         | target recovery, then fall   |
|           |         |              |         | back to host recovery, retry |
|           |         |              |         | or fhinsi with EIO(may       |
|           |         |              |         | offline disk)                |
+-----------+---------+--------------+---------+------------------------------+
| fail      | NA      | fail         | NA      | lun recovery fallback to     |
|           |         |              |         | target recovery, then fall   |
|           |         |              |         | back to host recovery, retry |
|           |         |              |         | or fhinsi with EIO(may       |
|           |         |              |         | offline disk)                |
+-----------+---------+--------------+---------+------------------------------+

References: https://lore.kernel.org/linux-scsi/20230815122316.4129333-1-haowenchao2@huawei.com/
References: https://lore.kernel.org/linux-scsi/71e09bb4-ff0a-23fe-38b4-fe6425670efa@huawei.com/

Wenchao Hao (19):
  scsi: scsi_error: Define framework for LUN/target based error handle
  scsi: scsi_error: Move complete variable eh_action from shost to
    sdevice
  scsi: scsi_error: Check if to do reset in scsi_try_xxx_reset
  scsi: scsi_error: Add helper scsi_eh_sdev_stu to do START_UNIT
  scsi: scsi_error: Add helper scsi_eh_sdev_reset to do lun reset
  scsi: scsi_error: Add flags to mark error handle steps has done
  scsi: scsi_error: Add helper to handle scsi device's error command
    list
  scsi: scsi_error: Add a general LUN based error handler
  scsi: core: increase/decrease target_busy without check can_queue
  scsi: scsi_error: Add helper to handle scsi target's error command
    list
  scsi: scsi_error: Add a general target based error handler
  scsi: scsi_debug: Add param to control LUN bassed error handler
  scsi: scsi_debug: Add param to control target based error handle
  scsi: mpt3sas: Add param to control LUN based error handle
  scsi: mpt3sas: Add param to control target based error handle
  scsi: smartpqi: Add param to control LUN based error handle
  scsi: megaraid_sas: Add param to control target based error handle
  scsi: virtio_scsi: Add param to control LUN based error handle
  scsi: iscsi_tcp: Add param to control LUN based error handle

 drivers/scsi/iscsi_tcp.c                  |  20 +
 drivers/scsi/megaraid/megaraid_sas_base.c |  20 +
 drivers/scsi/mpt3sas/mpt3sas_scsih.c      |  28 +
 drivers/scsi/scsi_debug.c                 |  24 +
 drivers/scsi/scsi_error.c                 | 756 ++++++++++++++++++++--
 drivers/scsi/scsi_lib.c                   |  23 +-
 drivers/scsi/scsi_priv.h                  |  18 +
 drivers/scsi/smartpqi/smartpqi_init.c     |  14 +
 drivers/scsi/virtio_scsi.c                |  16 +-
 include/scsi/scsi_device.h                |  97 +++
 include/scsi/scsi_eh.h                    |   8 +
 include/scsi/scsi_host.h                  |   2 -
 12 files changed, 963 insertions(+), 63 deletions(-)

-- 
2.33.0


^ permalink raw reply	[flat|nested] 30+ messages in thread

* [RFC PATCH v3 01/19] scsi: scsi_error: Define framework for LUN/target based error handle
  2025-03-14  1:29 [RFC PATCH v3 00/19] scsi: scsi_error: Introduce new error handle mechanism JiangJianJun
@ 2025-03-14  1:29 ` JiangJianJun
  2025-03-14  1:49   ` Bart Van Assche
  2025-03-14  1:29 ` [RFC PATCH v3 02/19] scsi: scsi_error: Move complete variable eh_action from shost to sdevice JiangJianJun
                   ` (18 subsequent siblings)
  19 siblings, 1 reply; 30+ messages in thread
From: JiangJianJun @ 2025-03-14  1:29 UTC (permalink / raw)
  To: jejb, martin.petersen, linux-scsi
  Cc: hare, linux-kernel, lixiaokeng, jiangjianjun3, hewenliang4,
	yangkunlin7

From: Wenchao Hao <haowenchao2@huawei.com>

The old scsi error handle logic is based on host, once a scsi command
in one LUN of this host is classfied as failed, SCSI mid-level would
set the whole host to recovery state, and no IO can be submitted to
all LUNs of this host any more before recovery finished, while the
recovery process might take a long time to finish.
It's unreasonable when there are a lot of LUNs in one host.

This change introduce a way for driver to implement its own
error handle logic which can be based on scsi LUN or scsi target
as minimum unit.

scsi_device_eh is defined for error handle based on scsi LUN, and
pointer struct scsi_device_eh "eh" is added in scsi_device, which
is NULL by default.
LLDs can initialize the sdev->eh in hostt->slave_alloc to implement an
scsi LUN based error handle. If this member is not NULL, SCSI mid-level
would branch to drivers' error handler rather than the old one which
block whole host's IO.

scsi_target_eh is defined for error handle based on scsi target, and
pointer struct scsi_target_eh "eh" is added in scsi_target, which is NULL
by default.
LLDs can initialize the starget->eh in hostt->target_alloc to implement
an scsi target based error handle. If this member is not NULL, SCSI
mid-level would branch to drivers' error handler rather than the
old one which block whole host's IO.

Signed-off-by: Wenchao Hao <haowenchao2@huawei.com>
---
 drivers/scsi/scsi_error.c  | 57 +++++++++++++++++++++++++++++++-
 drivers/scsi/scsi_lib.c    | 12 +++++++
 drivers/scsi/scsi_priv.h   | 18 ++++++++++
 include/scsi/scsi_device.h | 67 ++++++++++++++++++++++++++++++++++++++
 4 files changed, 153 insertions(+), 1 deletion(-)

diff --git a/drivers/scsi/scsi_error.c b/drivers/scsi/scsi_error.c
index 815e7d63f3e2..f89de23a6807 100644
--- a/drivers/scsi/scsi_error.c
+++ b/drivers/scsi/scsi_error.c
@@ -291,11 +291,48 @@ static void scsi_eh_inc_host_failed(struct rcu_head *head)
 	spin_unlock_irqrestore(shost->host_lock, flags);
 }
 
+#define SCSI_EH_NO_HANDLER 1
+
+static int __scsi_eh_scmd_add_sdev(struct scsi_cmnd *scmd)
+{
+	struct scsi_device *sdev = scmd->device;
+	struct scsi_device_eh *eh = sdev->eh;
+
+	if (!eh || !eh->add_cmnd)
+		return SCSI_EH_NO_HANDLER;
+
+	scsi_eh_reset(scmd);
+	eh->add_cmnd(scmd);
+
+	if (eh->wakeup)
+		eh->wakeup(sdev);
+
+	return 0;
+}
+
+static int __scsi_eh_scmd_add_starget(struct scsi_cmnd *scmd)
+{
+	struct scsi_device *sdev = scmd->device;
+	struct scsi_target *starget = scsi_target(sdev);
+	struct scsi_target_eh *eh = starget->eh;
+
+	if (!eh || !eh->add_cmnd)
+		return SCSI_EH_NO_HANDLER;
+
+	scsi_eh_reset(scmd);
+	eh->add_cmnd(scmd);
+
+	if (eh->wakeup)
+		eh->wakeup(starget);
+
+	return 0;
+}
+
 /**
  * scsi_eh_scmd_add - add scsi cmd to error handling.
  * @scmd:	scmd to run eh on.
  */
-void scsi_eh_scmd_add(struct scsi_cmnd *scmd)
+static void __scsi_eh_scmd_add(struct scsi_cmnd *scmd)
 {
 	struct Scsi_Host *shost = scmd->device->host;
 	unsigned long flags;
@@ -322,6 +359,24 @@ void scsi_eh_scmd_add(struct scsi_cmnd *scmd)
 	call_rcu_hurry(&scmd->rcu, scsi_eh_inc_host_failed);
 }
 
+void scsi_eh_scmd_add(struct scsi_cmnd *scmd)
+{
+	struct scsi_device *sdev = scmd->device;
+	struct scsi_target *starget = scsi_target(sdev);
+	struct Scsi_Host *shost = sdev->host;
+
+	if (unlikely(scsi_host_in_recovery(shost)))
+		__scsi_eh_scmd_add(scmd);
+
+	if (unlikely(scsi_target_in_recovery(starget)))
+		if (__scsi_eh_scmd_add_starget(scmd))
+			__scsi_eh_scmd_add(scmd);
+
+	if (__scsi_eh_scmd_add_sdev(scmd))
+		if (__scsi_eh_scmd_add_starget(scmd))
+			__scsi_eh_scmd_add(scmd);
+}
+
 /**
  * scsi_timeout - Timeout function for normal scsi commands.
  * @req:	request that is timing out.
diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index f1cfe0bb89b2..8cdab7ac5980 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -399,6 +399,12 @@ void scsi_device_unbusy(struct scsi_device *sdev, struct scsi_cmnd *cmd)
 
 	sbitmap_put(&sdev->budget_map, cmd->budget_token);
 	cmd->budget_token = -1;
+
+	if (sdev->eh && sdev->eh->wakeup)
+		sdev->eh->wakeup(sdev);
+
+	if (starget->eh && starget->eh->wakeup)
+		starget->eh->wakeup(starget);
 }
 
 /*
@@ -1357,6 +1363,9 @@ static inline int scsi_dev_queue_ready(struct request_queue *q,
 {
 	int token;
 
+	if (scsi_device_in_recovery(sdev))
+		return -1;
+
 	token = sbitmap_get(&sdev->budget_map);
 	if (token < 0)
 		return -1;
@@ -1390,6 +1399,9 @@ static inline int scsi_target_queue_ready(struct Scsi_Host *shost,
 	struct scsi_target *starget = scsi_target(sdev);
 	unsigned int busy;
 
+	if (scsi_target_in_recovery(starget))
+		return 0;
+
 	if (starget->single_lun) {
 		spin_lock_irq(shost->host_lock);
 		if (starget->starget_sdev_user &&
diff --git a/drivers/scsi/scsi_priv.h b/drivers/scsi/scsi_priv.h
index 9fc397a9ce7a..90e1e6d80265 100644
--- a/drivers/scsi/scsi_priv.h
+++ b/drivers/scsi/scsi_priv.h
@@ -193,6 +193,24 @@ static inline void scsi_dh_add_device(struct scsi_device *sdev) { }
 static inline void scsi_dh_release_device(struct scsi_device *sdev) { }
 #endif
 
+static inline int scsi_device_in_recovery(struct scsi_device *sdev)
+{
+	struct scsi_device_eh *eh = sdev->eh;
+
+	if (eh && eh->is_busy)
+		return eh->is_busy(sdev);
+	return 0;
+}
+
+static inline int scsi_target_in_recovery(struct scsi_target *starget)
+{
+	struct scsi_target_eh *eh = starget->eh;
+
+	if (eh && eh->is_busy)
+		return eh->is_busy(starget);
+	return 0;
+}
+
 struct bsg_device *scsi_bsg_register_queue(struct scsi_device *sdev);
 
 extern int scsi_device_max_queue_depth(struct scsi_device *sdev);
diff --git a/include/scsi/scsi_device.h b/include/scsi/scsi_device.h
index 7acd0ec82bb0..5911c82ca435 100644
--- a/include/scsi/scsi_device.h
+++ b/include/scsi/scsi_device.h
@@ -100,6 +100,71 @@ struct scsi_vpd {
 	unsigned char	data[];
 };
 
+struct scsi_device;
+struct scsi_target;
+
+struct scsi_device_eh {
+	/*
+	 * add scsi command to error handler so it would be handuled by
+	 * driver's error handle strategy
+	 */
+	void (*add_cmnd)(struct scsi_cmnd *scmd);
+
+	/*
+	 * to judge if the device is busy handling errors, called before
+	 * dispatch scsi cmnd
+	 *
+	 * return 0 if it's ready to accepy scsi cmnd
+	 * return 0 if it's in error handle, command's would not be dispatched
+	 */
+	int (*is_busy)(struct scsi_device *sdev);
+
+	/*
+	 * wakeup device's error handle
+	 *
+	 * usually the error handler strategy would not run at once when
+	 * error command is added. This function would be called when any
+	 * scsi cmnd is finished or when scsi cmnd is added.
+	 */
+	int (*wakeup)(struct scsi_device *sdev);
+
+	/*
+	 * data entity for device specific error handler
+	 */
+	unsigned long driver_data[];
+};
+
+struct scsi_target_eh {
+	/*
+	 * add scsi command to error handler so it would be handuled by
+	 * driver's error handle strategy
+	 */
+	void (*add_cmnd)(struct scsi_cmnd *scmd);
+
+	/*
+	 * to judge if the device is busy handling errors, called before
+	 * dispatch scsi cmnd
+	 *
+	 * return 0 if it's ready to accepy scsi cmnd
+	 * return 0 if it's in error handle, command's would not be dispatched
+	 */
+	int (*is_busy)(struct scsi_target *starget);
+
+	/*
+	 * wakeup device's error handle
+	 *
+	 * usually the error handler strategy would not run at once when
+	 * error command is added. This function would be called when any
+	 * scsi cmnd is finished or when scsi cmnd is added.
+	 */
+	int (*wakeup)(struct scsi_target *starget);
+
+	/*
+	 * data entity for device specific error handler
+	 */
+	unsigned long driver_data[];
+};
+
 struct scsi_device {
 	struct Scsi_Host *host;
 	struct request_queue *request_queue;
@@ -281,6 +346,7 @@ struct scsi_device {
 	struct mutex		state_mutex;
 	enum scsi_device_state sdev_state;
 	struct task_struct	*quiesced_by;
+	struct scsi_device_eh	*eh;
 	unsigned long		sdev_data[];
 } __attribute__((aligned(sizeof(unsigned long))));
 
@@ -367,6 +433,7 @@ struct scsi_target {
 	char			scsi_level;
 	enum scsi_target_state	state;
 	void 			*hostdata; /* available to low-level driver */
+	struct scsi_target_eh	*eh;
 	unsigned long		starget_data[]; /* for the transport */
 	/* starget_data must be the last element!!!! */
 } __attribute__((aligned(sizeof(unsigned long))));
-- 
2.33.0


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [RFC PATCH v3 02/19] scsi: scsi_error: Move complete variable eh_action from shost to sdevice
  2025-03-14  1:29 [RFC PATCH v3 00/19] scsi: scsi_error: Introduce new error handle mechanism JiangJianJun
  2025-03-14  1:29 ` [RFC PATCH v3 01/19] scsi: scsi_error: Define framework for LUN/target based error handle JiangJianJun
@ 2025-03-14  1:29 ` JiangJianJun
  2025-03-14  1:29 ` [RFC PATCH v3 03/19] scsi: scsi_error: Check if to do reset in scsi_try_xxx_reset JiangJianJun
                   ` (17 subsequent siblings)
  19 siblings, 0 replies; 30+ messages in thread
From: JiangJianJun @ 2025-03-14  1:29 UTC (permalink / raw)
  To: jejb, martin.petersen, linux-scsi
  Cc: hare, linux-kernel, lixiaokeng, jiangjianjun3, hewenliang4,
	yangkunlin7

From: Wenchao Hao <haowenchao2@huawei.com>

eh_action is used to wait for error handle command's completion if scsi
command is send in error handle. Now the error handler might based on
scsi_device, so move it to scsi_device.

This is preparation for a genernal LUN/target based error handle
strategy.

Signed-off-by: Wenchao Hao <haowenchao2@huawei.com>
---
 drivers/scsi/scsi_error.c  | 6 +++---
 include/scsi/scsi_device.h | 2 ++
 include/scsi/scsi_host.h   | 2 --
 3 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/drivers/scsi/scsi_error.c b/drivers/scsi/scsi_error.c
index f89de23a6807..4f37af9e20b6 100644
--- a/drivers/scsi/scsi_error.c
+++ b/drivers/scsi/scsi_error.c
@@ -916,7 +916,7 @@ void scsi_eh_done(struct scsi_cmnd *scmd)
 	SCSI_LOG_ERROR_RECOVERY(3, scmd_printk(KERN_INFO, scmd,
 			"%s result: %x\n", __func__, scmd->result));
 
-	eh_action = scmd->device->host->eh_action;
+	eh_action = scmd->device->eh_action;
 	if (eh_action)
 		complete(eh_action);
 }
@@ -1205,7 +1205,7 @@ static enum scsi_disposition scsi_send_eh_cmnd(struct scsi_cmnd *scmd,
 
 retry:
 	scsi_eh_prep_cmnd(scmd, &ses, cmnd, cmnd_size, sense_bytes);
-	shost->eh_action = &done;
+	sdev->eh_action = &done;
 
 	scsi_log_send(scmd);
 	scmd->submitter = SUBMITTED_BY_SCSI_ERROR_HANDLER;
@@ -1249,7 +1249,7 @@ static enum scsi_disposition scsi_send_eh_cmnd(struct scsi_cmnd *scmd,
 		rtn = SUCCESS;
 	}
 
-	shost->eh_action = NULL;
+	sdev->eh_action = NULL;
 
 	scsi_log_completion(scmd, rtn);
 
diff --git a/include/scsi/scsi_device.h b/include/scsi/scsi_device.h
index 5911c82ca435..6388abb6c0d4 100644
--- a/include/scsi/scsi_device.h
+++ b/include/scsi/scsi_device.h
@@ -347,6 +347,8 @@ struct scsi_device {
 	enum scsi_device_state sdev_state;
 	struct task_struct	*quiesced_by;
 	struct scsi_device_eh	*eh;
+	struct completion	*eh_action;	/* Wait for specific actions */
+						/* on the device. */
 	unsigned long		sdev_data[];
 } __attribute__((aligned(sizeof(unsigned long))));
 
diff --git a/include/scsi/scsi_host.h b/include/scsi/scsi_host.h
index 26bc23419cfd..de014c60746c 100644
--- a/include/scsi/scsi_host.h
+++ b/include/scsi/scsi_host.h
@@ -558,8 +558,6 @@ struct Scsi_Host {
 	struct list_head	eh_abort_list;
 	struct list_head	eh_cmd_q;
 	struct task_struct    * ehandler;  /* Error recovery thread. */
-	struct completion     * eh_action; /* Wait for specific actions on the
-					      host. */
 	wait_queue_head_t       host_wait;
 	const struct scsi_host_template *hostt;
 	struct scsi_transport_template *transportt;
-- 
2.33.0


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [RFC PATCH v3 03/19] scsi: scsi_error: Check if to do reset in scsi_try_xxx_reset
  2025-03-14  1:29 [RFC PATCH v3 00/19] scsi: scsi_error: Introduce new error handle mechanism JiangJianJun
  2025-03-14  1:29 ` [RFC PATCH v3 01/19] scsi: scsi_error: Define framework for LUN/target based error handle JiangJianJun
  2025-03-14  1:29 ` [RFC PATCH v3 02/19] scsi: scsi_error: Move complete variable eh_action from shost to sdevice JiangJianJun
@ 2025-03-14  1:29 ` JiangJianJun
  2025-03-14  1:29 ` [RFC PATCH v3 04/19] scsi: scsi_error: Add helper scsi_eh_sdev_stu to do START_UNIT JiangJianJun
                   ` (16 subsequent siblings)
  19 siblings, 0 replies; 30+ messages in thread
From: JiangJianJun @ 2025-03-14  1:29 UTC (permalink / raw)
  To: jejb, martin.petersen, linux-scsi
  Cc: hare, linux-kernel, lixiaokeng, jiangjianjun3, hewenliang4,
	yangkunlin7

From: Wenchao Hao <haowenchao2@huawei.com>

This is preparation for a genernal LUN/target based error handle
strategy, the strategy would reuse some error handler APIs,
but some steps of these function should not be performed. For
example, we should not perform target reset if we just stop IOs
on one single LUN.

This change add checks in scsi_try_xxx_reset to make sure
the reset operations would not be performed only if the condition
is not satisfied.

Signed-off-by: Wenchao Hao <haowenchao2@huawei.com>
---
 drivers/scsi/scsi_error.c | 37 +++++++++++++++++++++++++++++++------
 1 file changed, 31 insertions(+), 6 deletions(-)

diff --git a/drivers/scsi/scsi_error.c b/drivers/scsi/scsi_error.c
index 4f37af9e20b6..cc3a5adb9daa 100644
--- a/drivers/scsi/scsi_error.c
+++ b/drivers/scsi/scsi_error.c
@@ -925,7 +925,7 @@ void scsi_eh_done(struct scsi_cmnd *scmd)
  * scsi_try_host_reset - ask host adapter to reset itself
  * @scmd:	SCSI cmd to send host reset.
  */
-static enum scsi_disposition scsi_try_host_reset(struct scsi_cmnd *scmd)
+static enum scsi_disposition __scsi_try_host_reset(struct scsi_cmnd *scmd)
 {
 	unsigned long flags;
 	enum scsi_disposition rtn;
@@ -951,11 +951,19 @@ static enum scsi_disposition scsi_try_host_reset(struct scsi_cmnd *scmd)
 	return rtn;
 }
 
+static enum scsi_disposition scsi_try_host_reset(struct scsi_cmnd *scmd)
+{
+	if (!scsi_host_in_recovery(scmd->device->host))
+		return FAILED;
+
+	return __scsi_try_host_reset(scmd);
+}
+
 /**
  * scsi_try_bus_reset - ask host to perform a bus reset
  * @scmd:	SCSI cmd to send bus reset.
  */
-static enum scsi_disposition scsi_try_bus_reset(struct scsi_cmnd *scmd)
+static enum scsi_disposition __scsi_try_bus_reset(struct scsi_cmnd *scmd)
 {
 	unsigned long flags;
 	enum scsi_disposition rtn;
@@ -981,6 +989,14 @@ static enum scsi_disposition scsi_try_bus_reset(struct scsi_cmnd *scmd)
 	return rtn;
 }
 
+static enum scsi_disposition scsi_try_bus_reset(struct scsi_cmnd *scmd)
+{
+	if (!scsi_host_in_recovery(scmd->device->host))
+		return FAILED;
+
+	return __scsi_try_bus_reset(scmd);
+}
+
 static void __scsi_report_device_reset(struct scsi_device *sdev, void *data)
 {
 	sdev->was_reset = 1;
@@ -997,7 +1013,7 @@ static void __scsi_report_device_reset(struct scsi_device *sdev, void *data)
  *    timer on it, and set the host back to a consistent state prior to
  *    returning.
  */
-static enum scsi_disposition scsi_try_target_reset(struct scsi_cmnd *scmd)
+static enum scsi_disposition __scsi_try_target_reset(struct scsi_cmnd *scmd)
 {
 	unsigned long flags;
 	enum scsi_disposition rtn;
@@ -1018,6 +1034,15 @@ static enum scsi_disposition scsi_try_target_reset(struct scsi_cmnd *scmd)
 	return rtn;
 }
 
+static enum scsi_disposition scsi_try_target_reset(struct scsi_cmnd *scmd)
+{
+	if (!(scsi_target_in_recovery(scsi_target(scmd->device)) ||
+	      scsi_host_in_recovery(scmd->device->host)))
+		return FAILED;
+
+	return __scsi_try_target_reset(scmd);
+}
+
 /**
  * scsi_try_bus_device_reset - Ask host to perform a BDR on a dev
  * @scmd:	SCSI cmd used to send BDR
@@ -2541,17 +2566,17 @@ scsi_ioctl_reset(struct scsi_device *dev, int __user *arg)
 			break;
 		fallthrough;
 	case SG_SCSI_RESET_TARGET:
-		rtn = scsi_try_target_reset(scmd);
+		rtn = __scsi_try_target_reset(scmd);
 		if (rtn == SUCCESS || (val & SG_SCSI_RESET_NO_ESCALATE))
 			break;
 		fallthrough;
 	case SG_SCSI_RESET_BUS:
-		rtn = scsi_try_bus_reset(scmd);
+		rtn = __scsi_try_bus_reset(scmd);
 		if (rtn == SUCCESS || (val & SG_SCSI_RESET_NO_ESCALATE))
 			break;
 		fallthrough;
 	case SG_SCSI_RESET_HOST:
-		rtn = scsi_try_host_reset(scmd);
+		rtn = __scsi_try_host_reset(scmd);
 		if (rtn == SUCCESS)
 			break;
 		fallthrough;
-- 
2.33.0


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [RFC PATCH v3 04/19] scsi: scsi_error: Add helper scsi_eh_sdev_stu to do START_UNIT
  2025-03-14  1:29 [RFC PATCH v3 00/19] scsi: scsi_error: Introduce new error handle mechanism JiangJianJun
                   ` (2 preceding siblings ...)
  2025-03-14  1:29 ` [RFC PATCH v3 03/19] scsi: scsi_error: Check if to do reset in scsi_try_xxx_reset JiangJianJun
@ 2025-03-14  1:29 ` JiangJianJun
  2025-04-24  9:27   ` Diangang Li
  2025-03-14  1:29 ` [RFC PATCH v3 05/19] scsi: scsi_error: Add helper scsi_eh_sdev_reset to do lun reset JiangJianJun
                   ` (15 subsequent siblings)
  19 siblings, 1 reply; 30+ messages in thread
From: JiangJianJun @ 2025-03-14  1:29 UTC (permalink / raw)
  To: jejb, martin.petersen, linux-scsi
  Cc: hare, linux-kernel, lixiaokeng, jiangjianjun3, hewenliang4,
	yangkunlin7

From: Wenchao Hao <haowenchao2@huawei.com>

Add helper function scsi_eh_sdev_stu() to perform START_UNIT and check
if to finish some error commands.

This is preparation for a genernal LUN/target based error handle
strategy and did not change original logic.

Signed-off-by: Wenchao Hao <haowenchao2@huawei.com>
---
 drivers/scsi/scsi_error.c | 50 +++++++++++++++++++++++----------------
 1 file changed, 29 insertions(+), 21 deletions(-)

diff --git a/drivers/scsi/scsi_error.c b/drivers/scsi/scsi_error.c
index cc3a5adb9daa..3b55642fb585 100644
--- a/drivers/scsi/scsi_error.c
+++ b/drivers/scsi/scsi_error.c
@@ -1567,6 +1567,31 @@ static int scsi_eh_try_stu(struct scsi_cmnd *scmd)
 	return 1;
 }
 
+static int scsi_eh_sdev_stu(struct scsi_cmnd *scmd,
+			      struct list_head *work_q,
+			      struct list_head *done_q)
+{
+	struct scsi_device *sdev = scmd->device;
+	struct scsi_cmnd *next;
+
+	SCSI_LOG_ERROR_RECOVERY(3, sdev_printk(KERN_INFO, sdev,
+				"%s: Sending START_UNIT\n", current->comm));
+
+	if (scsi_eh_try_stu(scmd)) {
+		SCSI_LOG_ERROR_RECOVERY(3, sdev_printk(KERN_INFO, sdev,
+				    "%s: START_UNIT failed\n", current->comm));
+		return 0;
+	}
+
+	if (!scsi_device_online(sdev) || !scsi_eh_tur(scmd))
+		list_for_each_entry_safe(scmd, next, work_q, eh_entry)
+			if (scmd->device == sdev &&
+			    scsi_eh_action(scmd, SUCCESS) == SUCCESS)
+				scsi_eh_finish_cmd(scmd, done_q);
+
+	return list_empty(work_q);
+}
+
  /**
  * scsi_eh_stu - send START_UNIT if needed
  * @shost:	&scsi host being recovered.
@@ -1581,7 +1606,7 @@ static int scsi_eh_stu(struct Scsi_Host *shost,
 			      struct list_head *work_q,
 			      struct list_head *done_q)
 {
-	struct scsi_cmnd *scmd, *stu_scmd, *next;
+	struct scsi_cmnd *scmd, *stu_scmd;
 	struct scsi_device *sdev;
 
 	shost_for_each_device(sdev, shost) {
@@ -1604,26 +1629,9 @@ static int scsi_eh_stu(struct Scsi_Host *shost,
 		if (!stu_scmd)
 			continue;
 
-		SCSI_LOG_ERROR_RECOVERY(3,
-			sdev_printk(KERN_INFO, sdev,
-				     "%s: Sending START_UNIT\n",
-				    current->comm));
-
-		if (!scsi_eh_try_stu(stu_scmd)) {
-			if (!scsi_device_online(sdev) ||
-			    !scsi_eh_tur(stu_scmd)) {
-				list_for_each_entry_safe(scmd, next,
-							  work_q, eh_entry) {
-					if (scmd->device == sdev &&
-					    scsi_eh_action(scmd, SUCCESS) == SUCCESS)
-						scsi_eh_finish_cmd(scmd, done_q);
-				}
-			}
-		} else {
-			SCSI_LOG_ERROR_RECOVERY(3,
-				sdev_printk(KERN_INFO, sdev,
-					    "%s: START_UNIT failed\n",
-					    current->comm));
+		if (scsi_eh_sdev_stu(stu_scmd, work_q, done_q)) {
+			scsi_device_put(sdev);
+			break;
 		}
 	}
 
-- 
2.33.0


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [RFC PATCH v3 05/19] scsi: scsi_error: Add helper scsi_eh_sdev_reset to do lun reset
  2025-03-14  1:29 [RFC PATCH v3 00/19] scsi: scsi_error: Introduce new error handle mechanism JiangJianJun
                   ` (3 preceding siblings ...)
  2025-03-14  1:29 ` [RFC PATCH v3 04/19] scsi: scsi_error: Add helper scsi_eh_sdev_stu to do START_UNIT JiangJianJun
@ 2025-03-14  1:29 ` JiangJianJun
  2025-03-14  1:29 ` [RFC PATCH v3 06/19] scsi: scsi_error: Add flags to mark error handle steps has done JiangJianJun
                   ` (14 subsequent siblings)
  19 siblings, 0 replies; 30+ messages in thread
From: JiangJianJun @ 2025-03-14  1:29 UTC (permalink / raw)
  To: jejb, martin.petersen, linux-scsi
  Cc: hare, linux-kernel, lixiaokeng, jiangjianjun3, hewenliang4,
	yangkunlin7

From: Wenchao Hao <haowenchao2@huawei.com>

Add helper function scsi_eh_sdev_reset() to perform lun reset and check
if to finish some error commands.

This is preparation for a genernal LUN/target based error handle
strategy and did not change original logic.

Signed-off-by: Wenchao Hao <haowenchao2@huawei.com>
---
 drivers/scsi/scsi_error.c | 54 +++++++++++++++++++++++----------------
 1 file changed, 32 insertions(+), 22 deletions(-)

diff --git a/drivers/scsi/scsi_error.c b/drivers/scsi/scsi_error.c
index 3b55642fb585..80e5787055d3 100644
--- a/drivers/scsi/scsi_error.c
+++ b/drivers/scsi/scsi_error.c
@@ -1638,6 +1638,34 @@ static int scsi_eh_stu(struct Scsi_Host *shost,
 	return list_empty(work_q);
 }
 
+static int scsi_eh_sdev_reset(struct scsi_cmnd *scmd,
+			      struct list_head *work_q,
+			      struct list_head *done_q)
+{
+	struct scsi_cmnd *next;
+	struct scsi_device *sdev = scmd->device;
+	enum scsi_disposition rtn;
+
+	SCSI_LOG_ERROR_RECOVERY(3, sdev_printk(KERN_INFO, sdev,
+			     "%s: Sending BDR\n", current->comm));
+
+	rtn = scsi_try_bus_device_reset(scmd);
+	if (rtn != SUCCESS && rtn != FAST_IO_FAIL) {
+		SCSI_LOG_ERROR_RECOVERY(3,
+			sdev_printk(KERN_INFO, sdev,
+				    "%s: BDR failed\n", current->comm));
+		return 0;
+	}
+
+	if (!scsi_device_online(sdev) || rtn == FAST_IO_FAIL ||
+	    !scsi_eh_tur(scmd))
+		list_for_each_entry_safe(scmd, next, work_q, eh_entry)
+			if (scmd->device == sdev &&
+			    scsi_eh_action(scmd, rtn) != FAILED)
+				scsi_eh_finish_cmd(scmd, done_q);
+
+	return list_empty(work_q);
+}
 
 /**
  * scsi_eh_bus_device_reset - send bdr if needed
@@ -1655,9 +1683,8 @@ static int scsi_eh_bus_device_reset(struct Scsi_Host *shost,
 				    struct list_head *work_q,
 				    struct list_head *done_q)
 {
-	struct scsi_cmnd *scmd, *bdr_scmd, *next;
+	struct scsi_cmnd *scmd, *bdr_scmd;
 	struct scsi_device *sdev;
-	enum scsi_disposition rtn;
 
 	shost_for_each_device(sdev, shost) {
 		if (scsi_host_eh_past_deadline(shost)) {
@@ -1678,26 +1705,9 @@ static int scsi_eh_bus_device_reset(struct Scsi_Host *shost,
 		if (!bdr_scmd)
 			continue;
 
-		SCSI_LOG_ERROR_RECOVERY(3,
-			sdev_printk(KERN_INFO, sdev,
-				     "%s: Sending BDR\n", current->comm));
-		rtn = scsi_try_bus_device_reset(bdr_scmd);
-		if (rtn == SUCCESS || rtn == FAST_IO_FAIL) {
-			if (!scsi_device_online(sdev) ||
-			    rtn == FAST_IO_FAIL ||
-			    !scsi_eh_tur(bdr_scmd)) {
-				list_for_each_entry_safe(scmd, next,
-							 work_q, eh_entry) {
-					if (scmd->device == sdev &&
-					    scsi_eh_action(scmd, rtn) != FAILED)
-						scsi_eh_finish_cmd(scmd,
-								   done_q);
-				}
-			}
-		} else {
-			SCSI_LOG_ERROR_RECOVERY(3,
-				sdev_printk(KERN_INFO, sdev,
-					    "%s: BDR failed\n", current->comm));
+		if (scsi_eh_sdev_reset(bdr_scmd, work_q, done_q)) {
+			scsi_device_put(sdev);
+			break;
 		}
 	}
 
-- 
2.33.0


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [RFC PATCH v3 06/19] scsi: scsi_error: Add flags to mark error handle steps has done
  2025-03-14  1:29 [RFC PATCH v3 00/19] scsi: scsi_error: Introduce new error handle mechanism JiangJianJun
                   ` (4 preceding siblings ...)
  2025-03-14  1:29 ` [RFC PATCH v3 05/19] scsi: scsi_error: Add helper scsi_eh_sdev_reset to do lun reset JiangJianJun
@ 2025-03-14  1:29 ` JiangJianJun
  2025-03-14  1:29 ` [RFC PATCH v3 07/19] scsi: scsi_error: Add helper to handle scsi device's error command list JiangJianJun
                   ` (13 subsequent siblings)
  19 siblings, 0 replies; 30+ messages in thread
From: JiangJianJun @ 2025-03-14  1:29 UTC (permalink / raw)
  To: jejb, martin.petersen, linux-scsi
  Cc: hare, linux-kernel, lixiaokeng, jiangjianjun3, hewenliang4,
	yangkunlin7

From: Wenchao Hao <haowenchao2@huawei.com>

LUN based error handle would mainly do three steps to recover
commands which are check sense, start unit, and reset lun. It might
fallback to target/host based error handle which would do these steps
too.

Target based error handle would reset target, it would also fallback
to host based error handle.

Add some flags to mark these steps are done to avoid repeating
these steps.

The flags should be cleared when LUN/target based error handler is
waked up or when target/host based error handle finished, and set
when fallback to target/host based error handle.

scsi_eh_get_sense, scsi_eh_stu, scsi_eh_bus_device_reset and
scsi_eh_target_reset would check these flags before actually action.

Signed-off-by: Wenchao Hao <haowenchao2@huawei.com>
---
 drivers/scsi/scsi_error.c  | 55 ++++++++++++++++++++++++++++++++++++++
 include/scsi/scsi_device.h | 28 +++++++++++++++++++
 2 files changed, 83 insertions(+)

diff --git a/drivers/scsi/scsi_error.c b/drivers/scsi/scsi_error.c
index 80e5787055d3..628ecbfcfff2 100644
--- a/drivers/scsi/scsi_error.c
+++ b/drivers/scsi/scsi_error.c
@@ -57,10 +57,50 @@
 #define BUS_RESET_SETTLE_TIME   (10)
 #define HOST_RESET_SETTLE_TIME  (10)
 
+#define sdev_flags_done(flag)					\
+static inline int sdev_##flag(struct scsi_device *sdev)		\
+{								\
+	struct scsi_device_eh *eh = sdev->eh;			\
+	if (!eh)						\
+		return 0;					\
+	return eh->flag;					\
+}
+
 static int scsi_eh_try_stu(struct scsi_cmnd *scmd);
 static enum scsi_disposition scsi_try_to_abort_cmd(const struct scsi_host_template *,
 						   struct scsi_cmnd *);
 
+sdev_flags_done(get_sense_done);
+sdev_flags_done(stu_done);
+sdev_flags_done(reset_done);
+
+static inline int starget_reset_done(struct scsi_target *starget)
+{
+	struct scsi_target_eh *eh = starget->eh;
+
+	if (!eh)
+		return 0;
+	return eh->reset_done;
+}
+
+static inline void shost_clear_eh_done(struct Scsi_Host *shost)
+{
+	struct scsi_device *sdev;
+	struct scsi_target *starget;
+
+	list_for_each_entry(starget, &shost->__targets, siblings)
+		if (starget->eh)
+			starget->eh->reset_done = 0;
+
+	shost_for_each_device(sdev, shost) {
+		if (!sdev->eh)
+			continue;
+		sdev->eh->get_sense_done = 0;
+		sdev->eh->stu_done	 = 0;
+		sdev->eh->reset_done	 = 0;
+	}
+}
+
 void scsi_eh_wakeup(struct Scsi_Host *shost, unsigned int busy)
 {
 	lockdep_assert_held(shost->host_lock);
@@ -1405,6 +1445,9 @@ int scsi_eh_get_sense(struct list_head *work_q,
 					     current->comm));
 			break;
 		}
+		if (sdev_get_sense_done(scmd->device) ||
+		    starget_reset_done(scsi_target(scmd->device)))
+			continue;
 		if (!scsi_status_is_check_condition(scmd->result))
 			/*
 			 * don't request sense if there's no check condition
@@ -1618,6 +1661,9 @@ static int scsi_eh_stu(struct Scsi_Host *shost,
 			scsi_device_put(sdev);
 			break;
 		}
+		if (sdev_stu_done(sdev) ||
+		    starget_reset_done(scsi_target(sdev)))
+			continue;
 		stu_scmd = NULL;
 		list_for_each_entry(scmd, work_q, eh_entry)
 			if (scmd->device == sdev && SCSI_SENSE_VALID(scmd) &&
@@ -1701,6 +1747,9 @@ static int scsi_eh_bus_device_reset(struct Scsi_Host *shost,
 				bdr_scmd = scmd;
 				break;
 			}
+		if (sdev_reset_done(sdev) ||
+		    starget_reset_done(scsi_target(sdev)))
+			continue;
 
 		if (!bdr_scmd)
 			continue;
@@ -1749,6 +1798,11 @@ static int scsi_eh_target_reset(struct Scsi_Host *shost,
 		}
 
 		scmd = list_entry(tmp_list.next, struct scsi_cmnd, eh_entry);
+		if (starget_reset_done(scsi_target(scmd->device))) {
+			/* push back on work queue for further processing */
+			list_move(&scmd->eh_entry, work_q);
+			continue;
+		}
 		id = scmd_id(scmd);
 
 		SCSI_LOG_ERROR_RECOVERY(3,
@@ -2365,6 +2419,7 @@ static void scsi_unjam_host(struct Scsi_Host *shost)
 	if (!scsi_eh_get_sense(&eh_work_q, &eh_done_q))
 		scsi_eh_ready_devs(shost, &eh_work_q, &eh_done_q);
 
+	shost_clear_eh_done(shost);
 	spin_lock_irqsave(shost->host_lock, flags);
 	if (shost->eh_deadline != -1)
 		shost->last_reset = 0;
diff --git a/include/scsi/scsi_device.h b/include/scsi/scsi_device.h
index 6388abb6c0d4..5cc18ca69fc7 100644
--- a/include/scsi/scsi_device.h
+++ b/include/scsi/scsi_device.h
@@ -104,6 +104,24 @@ struct scsi_device;
 struct scsi_target;
 
 struct scsi_device_eh {
+	/*
+	 * LUN rebased error handle would mainly do three
+	 * steps to recovery commands which are
+	 *   check sense
+	 *   start unit
+	 *   reset lun
+	 * While we would fallback to target or host based error handle
+	 * which would do these steps too. Add flags to mark thes steps
+	 * are done to avoid repeating these steps.
+	 *
+	 * The flags should be cleared when LUN based error handler is
+	 * wakedup or when target/host based error handle finished,
+	 * set when fallback to target or host based error handle.
+	 */
+	unsigned get_sense_done:1;
+	unsigned stu_done:1;
+	unsigned reset_done:1;
+
 	/*
 	 * add scsi command to error handler so it would be handuled by
 	 * driver's error handle strategy
@@ -135,6 +153,16 @@ struct scsi_device_eh {
 };
 
 struct scsi_target_eh {
+	/*
+	 * flag to mark target reset is done to avoid repeating
+	 * these steps when fallback to host based error handle
+	 *
+	 * The flag should be cleared when target based error handler
+	 * is * wakedup or when host based error handle finished,
+	 * set when fallback to host based error handle.
+	 */
+	unsigned reset_done:1;
+
 	/*
 	 * add scsi command to error handler so it would be handuled by
 	 * driver's error handle strategy
-- 
2.33.0


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [RFC PATCH v3 07/19] scsi: scsi_error: Add helper to handle scsi device's error command list
  2025-03-14  1:29 [RFC PATCH v3 00/19] scsi: scsi_error: Introduce new error handle mechanism JiangJianJun
                   ` (5 preceding siblings ...)
  2025-03-14  1:29 ` [RFC PATCH v3 06/19] scsi: scsi_error: Add flags to mark error handle steps has done JiangJianJun
@ 2025-03-14  1:29 ` JiangJianJun
  2025-03-14  1:29 ` [RFC PATCH v3 08/19] scsi: scsi_error: Add a general LUN based error handler JiangJianJun
                   ` (12 subsequent siblings)
  19 siblings, 0 replies; 30+ messages in thread
From: JiangJianJun @ 2025-03-14  1:29 UTC (permalink / raw)
  To: jejb, martin.petersen, linux-scsi
  Cc: hare, linux-kernel, lixiaokeng, jiangjianjun3, hewenliang4,
	yangkunlin7

From: Wenchao Hao <haowenchao2@huawei.com>

Add helper scsi_sdev_eh() to handle scsi device's error command list,
it would perform some steps which can be done with LUN's IO blocked,
including check sense, start unit and reset lun.

Signed-off-by: Wenchao Hao <haowenchao2@huawei.com>
---
 drivers/scsi/scsi_error.c | 37 +++++++++++++++++++++++++++++++++++++
 include/scsi/scsi_eh.h    |  2 ++
 2 files changed, 39 insertions(+)

diff --git a/drivers/scsi/scsi_error.c b/drivers/scsi/scsi_error.c
index 628ecbfcfff2..21f72c075531 100644
--- a/drivers/scsi/scsi_error.c
+++ b/drivers/scsi/scsi_error.c
@@ -2516,6 +2516,43 @@ int scsi_error_handler(void *data)
 	return 0;
 }
 
+/*
+ * Single LUN error handle
+ *
+ * @work_q: list of scsi commands need to recovery
+ * @done_q: list of scsi commands handled
+ *
+ * return: return 1 if all commands in work_q is recoveryed, else 0 is returned
+ */
+int scsi_sdev_eh(struct scsi_device *sdev,
+		 struct list_head *work_q,
+		 struct list_head *done_q)
+{
+	int ret = 0;
+	struct scsi_cmnd *scmd;
+
+	SCSI_LOG_ERROR_RECOVERY(2, sdev_printk(KERN_INFO, sdev,
+		"%s:luneh: checking sense\n", current->comm));
+	ret = scsi_eh_get_sense(work_q, done_q);
+	if (ret)
+		return ret;
+
+	SCSI_LOG_ERROR_RECOVERY(2, sdev_printk(KERN_INFO, sdev,
+		"%s:luneh: start unit\n", current->comm));
+	scmd = list_first_entry(work_q, struct scsi_cmnd, eh_entry);
+	ret = scsi_eh_sdev_stu(scmd, work_q, done_q);
+	if (ret)
+		return ret;
+
+	SCSI_LOG_ERROR_RECOVERY(2, sdev_printk(KERN_INFO, sdev,
+		"%s:luneh reset LUN\n", current->comm));
+	scmd = list_first_entry(work_q, struct scsi_cmnd, eh_entry);
+	ret = scsi_eh_sdev_reset(scmd, work_q, done_q);
+
+	return ret;
+}
+EXPORT_SYMBOL_GPL(scsi_sdev_eh);
+
 /**
  * scsi_report_bus_reset() - report bus reset observed
  *
diff --git a/include/scsi/scsi_eh.h b/include/scsi/scsi_eh.h
index 1ae08e81339f..5ce791063baf 100644
--- a/include/scsi/scsi_eh.h
+++ b/include/scsi/scsi_eh.h
@@ -18,6 +18,8 @@ extern int scsi_block_when_processing_errors(struct scsi_device *);
 extern bool scsi_command_normalize_sense(const struct scsi_cmnd *cmd,
 					 struct scsi_sense_hdr *sshdr);
 extern enum scsi_disposition scsi_check_sense(struct scsi_cmnd *);
+extern int scsi_sdev_eh(struct scsi_device *sdev, struct list_head *workq,
+			struct list_head *doneq);
 
 static inline bool scsi_sense_is_deferred(const struct scsi_sense_hdr *sshdr)
 {
-- 
2.33.0


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [RFC PATCH v3 08/19] scsi: scsi_error: Add a general LUN based error handler
  2025-03-14  1:29 [RFC PATCH v3 00/19] scsi: scsi_error: Introduce new error handle mechanism JiangJianJun
                   ` (6 preceding siblings ...)
  2025-03-14  1:29 ` [RFC PATCH v3 07/19] scsi: scsi_error: Add helper to handle scsi device's error command list JiangJianJun
@ 2025-03-14  1:29 ` JiangJianJun
  2025-03-14  1:29 ` [RFC PATCH v3 09/19] scsi: core: increase/decrease target_busy without check can_queue JiangJianJun
                   ` (11 subsequent siblings)
  19 siblings, 0 replies; 30+ messages in thread
From: JiangJianJun @ 2025-03-14  1:29 UTC (permalink / raw)
  To: jejb, martin.petersen, linux-scsi
  Cc: hare, linux-kernel, lixiaokeng, jiangjianjun3, hewenliang4,
	yangkunlin7

From: Wenchao Hao <haowenchao2@huawei.com>

Add a general LUN based error handler which can be used by drivers
directly. This error handler implements an scsi_device_eh, when handling
error commands, it would call helper function scsi_sdev_eh() added before
to try recover error commands.

The behavior if scsi_sdev_eh() can not recover all error commands
depends on fallback flag, which is initialized when scsi_device is
allocated. If fallback is set, it would fallback to further error
recover strategy like old host based error handle; else it would
mark this scsi device offline and flush all error commands.

To using this error handler, drivers should call scsi_device_setup_eh()
in its slave_alloc() to setup it's LUN based error handler;
call scsi_device_clear_eh() in its slave_destroy() to clear LUN based
error handler.

Signed-off-by: Wenchao Hao <haowenchao2@huawei.com>
---
 drivers/scsi/scsi_error.c | 170 ++++++++++++++++++++++++++++++++++++++
 include/scsi/scsi_eh.h    |   2 +
 2 files changed, 172 insertions(+)

diff --git a/drivers/scsi/scsi_error.c b/drivers/scsi/scsi_error.c
index 21f72c075531..7302536ba62a 100644
--- a/drivers/scsi/scsi_error.c
+++ b/drivers/scsi/scsi_error.c
@@ -2766,3 +2766,173 @@ bool scsi_get_sense_info_fld(const u8 *sense_buffer, int sb_len,
 	}
 }
 EXPORT_SYMBOL(scsi_get_sense_info_fld);
+
+struct scsi_lun_eh {
+	spinlock_t		eh_lock;
+	unsigned int		eh_num;
+	struct list_head	eh_cmd_q;
+	struct scsi_device	*sdev;
+	struct work_struct	eh_handle_work;
+	unsigned int		fallback:1;	/* If fallback to further */
+						/* recovery on failure  */
+};
+
+/*
+ * error handle strategy based on LUN, following steps
+ * is applied to recovery error commands in list:
+ *    check sense data
+ *    send start unit
+ *    reset lun
+ * if there are still error commands, it would fallback to
+ * target based or host based error handle for further recovery.
+ */
+static void sdev_eh_work(struct work_struct *work)
+{
+	unsigned long flags;
+	struct scsi_lun_eh *luneh =
+			container_of(work, struct scsi_lun_eh, eh_handle_work);
+	struct scsi_device *sdev = luneh->sdev;
+	struct scsi_device_eh *eh = sdev->eh;
+	struct Scsi_Host *shost = sdev->host;
+	struct scsi_cmnd *scmd, *next;
+	LIST_HEAD(eh_work_q);
+	LIST_HEAD(eh_done_q);
+
+	spin_lock_irqsave(&luneh->eh_lock, flags);
+	list_splice_init(&luneh->eh_cmd_q, &eh_work_q);
+	spin_unlock_irqrestore(&luneh->eh_lock, flags);
+
+	if (scsi_sdev_eh(sdev, &eh_work_q, &eh_done_q))
+		goto out_flush_done;
+
+	if (!luneh->fallback) {
+		list_for_each_entry_safe(scmd, next, &eh_work_q, eh_entry)
+			scsi_eh_finish_cmd(scmd, &eh_done_q);
+
+		sdev_printk(KERN_INFO, sdev, "%s:luneh: Device offlined - "
+			    "not ready after error recovery\n", current->comm);
+
+		mutex_lock(&sdev->state_mutex);
+		scsi_device_set_state(sdev, SDEV_OFFLINE);
+		mutex_unlock(&sdev->state_mutex);
+
+		goto out_flush_done;
+	}
+
+	/*
+	 * fallback to target or host based error handle
+	 */
+	SCSI_LOG_ERROR_RECOVERY(2, sdev_printk(KERN_INFO, sdev,
+		"%s:luneh fallback to further recovery\n", current->comm));
+	list_for_each_entry_safe(scmd, next, &eh_work_q, eh_entry) {
+		list_del_init(&scmd->eh_entry);
+
+		if (scsi_host_in_recovery(shost) ||
+		    __scsi_eh_scmd_add_starget(scmd))
+			__scsi_eh_scmd_add(scmd);
+	}
+
+	eh->get_sense_done = 1;
+	eh->stu_done = 1;
+	eh->reset_done = 1;
+
+out_flush_done:
+	scsi_eh_flush_done_q(&eh_done_q);
+	spin_lock_irqsave(&luneh->eh_lock, flags);
+	luneh->eh_num = 0;
+	spin_unlock_irqrestore(&luneh->eh_lock, flags);
+}
+static void sdev_eh_add_cmnd(struct scsi_cmnd *scmd)
+{
+	unsigned long flags;
+	struct scsi_lun_eh *luneh;
+	struct scsi_device *sdev = scmd->device;
+
+	luneh = (struct scsi_lun_eh *)sdev->eh->driver_data;
+
+	spin_lock_irqsave(&luneh->eh_lock, flags);
+	list_add_tail(&scmd->eh_entry, &luneh->eh_cmd_q);
+	luneh->eh_num++;
+	spin_unlock_irqrestore(&luneh->eh_lock, flags);
+}
+static int sdev_eh_is_busy(struct scsi_device *sdev)
+{
+	int ret = 0;
+	unsigned long flags;
+	struct scsi_lun_eh *luneh;
+
+	if (!sdev->eh)
+		return 0;
+
+	luneh = (struct scsi_lun_eh *)sdev->eh->driver_data;
+
+	spin_lock_irqsave(&luneh->eh_lock, flags);
+	ret = luneh->eh_num;
+	spin_unlock_irqrestore(&luneh->eh_lock, flags);
+
+	return ret;
+}
+static int sdev_eh_wakeup(struct scsi_device *sdev)
+{
+	unsigned long flags;
+	unsigned int nr_error;
+	unsigned int nr_busy;
+	struct scsi_lun_eh *luneh;
+
+	luneh = (struct scsi_lun_eh *)sdev->eh->driver_data;
+
+	spin_lock_irqsave(&luneh->eh_lock, flags);
+	nr_error = luneh->eh_num;
+	spin_unlock_irqrestore(&luneh->eh_lock, flags);
+
+	nr_busy = scsi_device_busy(sdev);
+
+	if (!nr_error || nr_busy != nr_error) {
+		SCSI_LOG_ERROR_RECOVERY(5, sdev_printk(KERN_INFO, sdev,
+			"%s:luneh: do not wake up, busy/error: %d/%d\n",
+			current->comm, nr_busy, nr_error));
+		return 0;
+	}
+
+	SCSI_LOG_ERROR_RECOVERY(2, sdev_printk(KERN_INFO, sdev,
+		"%s:luneh: waking up, busy/error: %d/%d\n",
+		current->comm, nr_busy, nr_error));
+
+	return schedule_work(&luneh->eh_handle_work);
+}
+
+int scsi_device_setup_eh(struct scsi_device *sdev, int fallback)
+{
+	struct scsi_device_eh *eh;
+	struct scsi_lun_eh *luneh;
+
+	eh = kzalloc(sizeof(struct scsi_device_eh) + sizeof(struct scsi_lun_eh),
+		GFP_KERNEL);
+	if (!eh) {
+		sdev_printk(KERN_ERR, sdev, "failed to setup error handle\n");
+		return -ENOMEM;
+	}
+	luneh = (struct scsi_lun_eh *)eh->driver_data;
+
+	eh->add_cmnd = sdev_eh_add_cmnd;
+	eh->is_busy  = sdev_eh_is_busy;
+	eh->wakeup   = sdev_eh_wakeup;
+
+	luneh->fallback = fallback;
+	luneh->sdev  = sdev;
+	spin_lock_init(&luneh->eh_lock);
+	INIT_LIST_HEAD(&luneh->eh_cmd_q);
+	INIT_WORK(&luneh->eh_handle_work, sdev_eh_work);
+
+	sdev->eh = eh;
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(scsi_device_setup_eh);
+
+void scsi_device_clear_eh(struct scsi_device *sdev)
+{
+	kfree(sdev->eh);
+	sdev->eh = NULL;
+}
+EXPORT_SYMBOL_GPL(scsi_device_clear_eh);
diff --git a/include/scsi/scsi_eh.h b/include/scsi/scsi_eh.h
index 5ce791063baf..89b471aa484f 100644
--- a/include/scsi/scsi_eh.h
+++ b/include/scsi/scsi_eh.h
@@ -20,6 +20,8 @@ extern bool scsi_command_normalize_sense(const struct scsi_cmnd *cmd,
 extern enum scsi_disposition scsi_check_sense(struct scsi_cmnd *);
 extern int scsi_sdev_eh(struct scsi_device *sdev, struct list_head *workq,
 			struct list_head *doneq);
+extern int scsi_device_setup_eh(struct scsi_device *sdev, int fallback);
+extern void scsi_device_clear_eh(struct scsi_device *sdev);
 
 static inline bool scsi_sense_is_deferred(const struct scsi_sense_hdr *sshdr)
 {
-- 
2.33.0


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [RFC PATCH v3 09/19] scsi: core: increase/decrease target_busy without check can_queue
  2025-03-14  1:29 [RFC PATCH v3 00/19] scsi: scsi_error: Introduce new error handle mechanism JiangJianJun
                   ` (7 preceding siblings ...)
  2025-03-14  1:29 ` [RFC PATCH v3 08/19] scsi: scsi_error: Add a general LUN based error handler JiangJianJun
@ 2025-03-14  1:29 ` JiangJianJun
  2025-03-14  1:35   ` Bart Van Assche
  2025-03-14  1:29 ` [RFC PATCH v3 10/19] scsi: scsi_error: Add helper to handle scsi target's error command list JiangJianJun
                   ` (10 subsequent siblings)
  19 siblings, 1 reply; 30+ messages in thread
From: JiangJianJun @ 2025-03-14  1:29 UTC (permalink / raw)
  To: jejb, martin.petersen, linux-scsi
  Cc: hare, linux-kernel, lixiaokeng, jiangjianjun3, hewenliang4,
	yangkunlin7

From: Wenchao Hao <haowenchao2@huawei.com>

This is preparation for a genernal target based error handle strategy
to check if to wake up actual error handler.

Signed-off-by: Wenchao Hao <haowenchao2@huawei.com>
---
 drivers/scsi/scsi_lib.c | 11 ++++-------
 1 file changed, 4 insertions(+), 7 deletions(-)

diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index 8cdab7ac5980..e477c5d3c393 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -394,8 +394,7 @@ void scsi_device_unbusy(struct scsi_device *sdev, struct scsi_cmnd *cmd)
 
 	scsi_dec_host_busy(shost, cmd);
 
-	if (starget->can_queue > 0)
-		atomic_dec(&starget->target_busy);
+	atomic_dec(&starget->target_busy);
 
 	sbitmap_put(&sdev->budget_map, cmd->budget_token);
 	cmd->budget_token = -1;
@@ -1413,10 +1412,10 @@ static inline int scsi_target_queue_ready(struct Scsi_Host *shost,
 		spin_unlock_irq(shost->host_lock);
 	}
 
+	busy = atomic_inc_return(&starget->target_busy) - 1;
 	if (starget->can_queue <= 0)
 		return 1;
 
-	busy = atomic_inc_return(&starget->target_busy) - 1;
 	if (atomic_read(&starget->target_blocked) > 0) {
 		if (busy)
 			goto starved;
@@ -1441,8 +1440,7 @@ static inline int scsi_target_queue_ready(struct Scsi_Host *shost,
 	list_move_tail(&sdev->starved_entry, &shost->starved_list);
 	spin_unlock_irq(shost->host_lock);
 out_dec:
-	if (starget->can_queue > 0)
-		atomic_dec(&starget->target_busy);
+	atomic_dec(&starget->target_busy);
 	return 0;
 }
 
@@ -1886,8 +1884,7 @@ static blk_status_t scsi_queue_rq(struct blk_mq_hw_ctx *hctx,
 out_dec_host_busy:
 	scsi_dec_host_busy(shost, cmd);
 out_dec_target_busy:
-	if (scsi_target(sdev)->can_queue > 0)
-		atomic_dec(&scsi_target(sdev)->target_busy);
+	atomic_dec(&scsi_target(sdev)->target_busy);
 out_put_budget:
 	scsi_mq_put_budget(q, cmd->budget_token);
 	cmd->budget_token = -1;
-- 
2.33.0


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [RFC PATCH v3 10/19] scsi: scsi_error: Add helper to handle scsi target's error command list
  2025-03-14  1:29 [RFC PATCH v3 00/19] scsi: scsi_error: Introduce new error handle mechanism JiangJianJun
                   ` (8 preceding siblings ...)
  2025-03-14  1:29 ` [RFC PATCH v3 09/19] scsi: core: increase/decrease target_busy without check can_queue JiangJianJun
@ 2025-03-14  1:29 ` JiangJianJun
  2025-03-14  1:29 ` [RFC PATCH v3 11/19] scsi: scsi_error: Add a general target based error handler JiangJianJun
                   ` (9 subsequent siblings)
  19 siblings, 0 replies; 30+ messages in thread
From: JiangJianJun @ 2025-03-14  1:29 UTC (permalink / raw)
  To: jejb, martin.petersen, linux-scsi
  Cc: hare, linux-kernel, lixiaokeng, jiangjianjun3, hewenliang4,
	yangkunlin7

From: Wenchao Hao <haowenchao2@huawei.com>

Add helper scsi_starget_eh() to handle scsi target's error command list,
it would perform some steps which can be done with target's IO blocked,
including check sense, start unit, reset lun and reset target.

Signed-off-by: Wenchao Hao <haowenchao2@huawei.com>
---
 drivers/scsi/scsi_error.c | 129 ++++++++++++++++++++++++++++++++++++++
 include/scsi/scsi_eh.h    |   2 +
 2 files changed, 131 insertions(+)

diff --git a/drivers/scsi/scsi_error.c b/drivers/scsi/scsi_error.c
index 7302536ba62a..46415db951ed 100644
--- a/drivers/scsi/scsi_error.c
+++ b/drivers/scsi/scsi_error.c
@@ -2553,6 +2553,135 @@ int scsi_sdev_eh(struct scsi_device *sdev,
 }
 EXPORT_SYMBOL_GPL(scsi_sdev_eh);
 
+static int starget_eh_stu(struct scsi_target *starget,
+			  struct list_head *work_q,
+			  struct list_head *done_q)
+{
+	struct scsi_device *sdev;
+	struct scsi_cmnd *scmd, *stu_scmd;
+
+	list_for_each_entry(sdev, &starget->devices, same_target_siblings) {
+		if (sdev_stu_done(sdev))
+			continue;
+
+		stu_scmd = NULL;
+		list_for_each_entry(scmd, work_q, eh_entry)
+			if (scmd->device == sdev && SCSI_SENSE_VALID(scmd) &&
+			    scsi_check_sense(scmd) == FAILED) {
+				stu_scmd = scmd;
+				break;
+			}
+		if (!stu_scmd)
+			continue;
+
+		if (scsi_eh_sdev_stu(stu_scmd, work_q, done_q))
+			return 1;
+	}
+
+	return 0;
+}
+
+static int starget_eh_reset_lun(struct scsi_target *starget,
+				struct list_head *work_q,
+				struct list_head *done_q)
+{
+	struct scsi_device *sdev;
+	struct scsi_cmnd *scmd, *bdr_scmd;
+
+	list_for_each_entry(sdev, &starget->devices, same_target_siblings) {
+		if (sdev_reset_done(sdev))
+			continue;
+
+		bdr_scmd = NULL;
+		list_for_each_entry(scmd, work_q, eh_entry)
+			if (scmd->device) {
+				bdr_scmd = scmd;
+				break;
+			}
+		if (!bdr_scmd)
+			continue;
+
+		if (scsi_eh_sdev_reset(bdr_scmd, work_q, done_q))
+			return 1;
+	}
+
+	return 0;
+}
+
+static int starget_eh_reset_target(struct scsi_target *starget,
+				    struct list_head *work_q,
+				    struct list_head *done_q)
+{
+	enum scsi_disposition rtn;
+	struct scsi_cmnd *scmd, *next;
+	LIST_HEAD(check_list);
+
+	scmd = list_first_entry(work_q, struct scsi_cmnd, eh_entry);
+
+	SCSI_LOG_ERROR_RECOVERY(3, starget_printk(KERN_INFO, starget,
+			     "%s: Sending target reset\n", current->comm));
+
+	rtn = scsi_try_target_reset(scmd);
+	if (rtn != SUCCESS && rtn != FAST_IO_FAIL) {
+		SCSI_LOG_ERROR_RECOVERY(3, starget_printk(KERN_INFO, starget,
+				     "%s: Target reset failed\n",
+				     current->comm));
+		return 0;
+	}
+
+	SCSI_LOG_ERROR_RECOVERY(3, starget_printk(KERN_INFO, starget,
+			     "%s: Target reset success\n", current->comm));
+
+	list_for_each_entry_safe(scmd, next, work_q, eh_entry) {
+		if (rtn == SUCCESS)
+			list_move_tail(&scmd->eh_entry, &check_list);
+		else if (rtn == FAST_IO_FAIL)
+			scsi_eh_finish_cmd(scmd, done_q);
+	}
+
+	return scsi_eh_test_devices(&check_list, work_q, done_q, 0);
+}
+
+/*
+ * Target based error handle
+ *
+ * @work_q: list of scsi commands need to recovery
+ * @done_q: list of scsi commands handled
+ *
+ * return: return 1 if all commands in work_q is recoveryed, else 0 is returned
+ */
+int scsi_starget_eh(struct scsi_target *starget,
+		    struct list_head *work_q,
+		    struct list_head *done_q)
+{
+	int ret = 0;
+
+	SCSI_LOG_ERROR_RECOVERY(2, starget_printk(KERN_INFO, starget,
+		"%s:targeteh: checking sense\n", current->comm));
+	ret = scsi_eh_get_sense(work_q, done_q);
+	if (ret)
+		return ret;
+
+	SCSI_LOG_ERROR_RECOVERY(2, starget_printk(KERN_INFO, starget,
+		"%s:targeteh: start unit\n", current->comm));
+	ret = starget_eh_stu(starget, work_q, done_q);
+	if (ret)
+		return ret;
+
+	SCSI_LOG_ERROR_RECOVERY(2, starget_printk(KERN_INFO, starget,
+		"%s:targeteh reset LUN\n", current->comm));
+	ret = starget_eh_reset_lun(starget, work_q, done_q);
+	if (ret)
+		return ret;
+
+	SCSI_LOG_ERROR_RECOVERY(2, starget_printk(KERN_INFO, starget,
+		"%s:targeteh reset target\n", current->comm));
+	ret = starget_eh_reset_target(starget, work_q, done_q);
+
+	return ret;
+}
+EXPORT_SYMBOL_GPL(scsi_starget_eh);
+
 /**
  * scsi_report_bus_reset() - report bus reset observed
  *
diff --git a/include/scsi/scsi_eh.h b/include/scsi/scsi_eh.h
index 89b471aa484f..80e2f130e884 100644
--- a/include/scsi/scsi_eh.h
+++ b/include/scsi/scsi_eh.h
@@ -20,6 +20,8 @@ extern bool scsi_command_normalize_sense(const struct scsi_cmnd *cmd,
 extern enum scsi_disposition scsi_check_sense(struct scsi_cmnd *);
 extern int scsi_sdev_eh(struct scsi_device *sdev, struct list_head *workq,
 			struct list_head *doneq);
+extern int scsi_starget_eh(struct scsi_target *starget,
+			struct list_head *workq, struct list_head *doneq);
 extern int scsi_device_setup_eh(struct scsi_device *sdev, int fallback);
 extern void scsi_device_clear_eh(struct scsi_device *sdev);
 
-- 
2.33.0


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [RFC PATCH v3 11/19] scsi: scsi_error: Add a general target based error handler
  2025-03-14  1:29 [RFC PATCH v3 00/19] scsi: scsi_error: Introduce new error handle mechanism JiangJianJun
                   ` (9 preceding siblings ...)
  2025-03-14  1:29 ` [RFC PATCH v3 10/19] scsi: scsi_error: Add helper to handle scsi target's error command list JiangJianJun
@ 2025-03-14  1:29 ` JiangJianJun
  2025-03-14  1:29 ` [RFC PATCH v3 12/19] scsi: scsi_debug: Add param to control LUN bassed " JiangJianJun
                   ` (8 subsequent siblings)
  19 siblings, 0 replies; 30+ messages in thread
From: JiangJianJun @ 2025-03-14  1:29 UTC (permalink / raw)
  To: jejb, martin.petersen, linux-scsi
  Cc: hare, linux-kernel, lixiaokeng, jiangjianjun3, hewenliang4,
	yangkunlin7

From: Wenchao Hao <haowenchao2@huawei.com>

Add a general target based error handler which can be used by drivers
directly. This error handler implements an scsi_target_eh, when handling
error commands, it would call helper function scsi_starget_eh() added
before to try recover error commands.

The behavior if scsi_starget_eh() can not recover all error commands
depends on fallback flag, which is initialized when scsi_target is
allocated. If fallback is set, it would fallback to further error
recover strategy like old host based error handle; else it would
mark this scsi devices of this target  offline and flush all error
commands.

To using this error handler, drivers should call scsi_target_setup_eh()
in its target_alloc() to setup it's target based error handler;
call scsi_device_clear_eh() in its target_destroy() to clear this
target based error handler.

Signed-off-by: Wenchao Hao <haowenchao2@huawei.com>
---
 drivers/scsi/scsi_error.c | 161 ++++++++++++++++++++++++++++++++++++++
 include/scsi/scsi_eh.h    |   2 +
 2 files changed, 163 insertions(+)

diff --git a/drivers/scsi/scsi_error.c b/drivers/scsi/scsi_error.c
index 46415db951ed..d9415d3d32c9 100644
--- a/drivers/scsi/scsi_error.c
+++ b/drivers/scsi/scsi_error.c
@@ -101,6 +101,19 @@ static inline void shost_clear_eh_done(struct Scsi_Host *shost)
 	}
 }
 
+static inline void starget_clear_eh_done(struct scsi_target *starget)
+{
+	struct scsi_device *sdev;
+
+	list_for_each_entry(sdev, &starget->devices, same_target_siblings) {
+		if (!sdev->eh)
+			continue;
+		sdev->eh->get_sense_done = 0;
+		sdev->eh->stu_done   = 0;
+		sdev->eh->reset_done = 0;
+	}
+}
+
 void scsi_eh_wakeup(struct Scsi_Host *shost, unsigned int busy)
 {
 	lockdep_assert_held(shost->host_lock);
@@ -3065,3 +3078,151 @@ void scsi_device_clear_eh(struct scsi_device *sdev)
 	sdev->eh = NULL;
 }
 EXPORT_SYMBOL_GPL(scsi_device_clear_eh);
+
+struct starget_eh {
+	spinlock_t		eh_lock;
+	unsigned int		eh_num;
+	struct list_head	eh_cmd_q;
+	struct scsi_target	*starget;
+	struct work_struct	eh_handle_work;
+	unsigned int		fallback:1;
+};
+
+static void starget_eh_work(struct work_struct *work)
+{
+	struct scsi_cmnd *scmd, *next;
+	unsigned long flags;
+	LIST_HEAD(eh_work_q);
+	LIST_HEAD(eh_done_q);
+	struct starget_eh *stargeteh =
+			container_of(work, struct starget_eh, eh_handle_work);
+	struct scsi_target *starget = stargeteh->starget;
+	struct scsi_target_eh *eh = starget->eh;
+
+	spin_lock_irqsave(&stargeteh->eh_lock, flags);
+	list_splice_init(&stargeteh->eh_cmd_q, &eh_work_q);
+	spin_unlock_irqrestore(&stargeteh->eh_lock, flags);
+
+	if (scsi_starget_eh(starget, &eh_work_q, &eh_done_q))
+		goto out_clear_flag;
+
+	if (!stargeteh->fallback) {
+		scsi_eh_offline_sdevs(&eh_work_q, &eh_done_q);
+		goto out_clear_flag;
+	}
+
+	/*
+	 * fallback to host based error handle
+	 */
+	SCSI_LOG_ERROR_RECOVERY(2, starget_printk(KERN_INFO, starget,
+		"%s:targeteh fallback to further recovery\n", current->comm));
+	eh->reset_done = 1;
+	list_for_each_entry_safe(scmd, next, &eh_work_q, eh_entry) {
+		list_del_init(&scmd->eh_entry);
+		__scsi_eh_scmd_add(scmd);
+	}
+	goto out_flush_done;
+
+out_clear_flag:
+	starget_clear_eh_done(starget);
+
+out_flush_done:
+	scsi_eh_flush_done_q(&eh_done_q);
+	spin_lock_irqsave(&stargeteh->eh_lock, flags);
+	stargeteh->eh_num = 0;
+	spin_unlock_irqrestore(&stargeteh->eh_lock, flags);
+}
+
+static void starget_eh_add_cmnd(struct scsi_cmnd *scmd)
+{
+	unsigned long flags;
+	struct scsi_target *starget = scmd->device->sdev_target;
+	struct starget_eh *eh;
+
+	eh = (struct starget_eh *)starget->eh->driver_data;
+
+	spin_lock_irqsave(&eh->eh_lock, flags);
+	list_add_tail(&scmd->eh_entry, &eh->eh_cmd_q);
+	eh->eh_num++;
+	spin_unlock_irqrestore(&eh->eh_lock, flags);
+}
+
+static int starget_eh_is_busy(struct scsi_target *starget)
+{
+	int ret = 0;
+	unsigned long flags;
+	struct starget_eh *eh;
+
+	eh = (struct starget_eh *)starget->eh->driver_data;
+
+	spin_lock_irqsave(&eh->eh_lock, flags);
+	ret = eh->eh_num;
+	spin_unlock_irqrestore(&eh->eh_lock, flags);
+
+	return ret;
+}
+
+static int starget_eh_wakeup(struct scsi_target *starget)
+{
+	unsigned long flags;
+	unsigned int nr_error;
+	unsigned int nr_busy;
+	struct starget_eh *eh;
+
+	eh = (struct starget_eh *)starget->eh->driver_data;
+
+	spin_lock_irqsave(&eh->eh_lock, flags);
+	nr_error = eh->eh_num;
+	spin_unlock_irqrestore(&eh->eh_lock, flags);
+
+	nr_busy = atomic_read(&starget->target_busy);
+
+	if (!nr_error || nr_busy != nr_error) {
+		SCSI_LOG_ERROR_RECOVERY(5, starget_printk(KERN_INFO, starget,
+			"%s:targeteh: do not wake up, busy/error is %d/%d\n",
+			current->comm, nr_busy, nr_error));
+		return 0;
+	}
+
+	SCSI_LOG_ERROR_RECOVERY(2, starget_printk(KERN_INFO, starget,
+		"%s:targeteh: waking up, busy/error is %d/%d\n",
+		current->comm, nr_busy, nr_error));
+
+	return schedule_work(&eh->eh_handle_work);
+}
+
+int scsi_target_setup_eh(struct scsi_target *starget, int fallback)
+{
+	struct scsi_target_eh *eh;
+	struct starget_eh *stargeteh;
+
+	eh = kzalloc(sizeof(struct scsi_device_eh) + sizeof(struct starget_eh),
+		GFP_KERNEL);
+	if (!eh) {
+		starget_printk(KERN_ERR, starget, "failed to setup eh\n");
+		return -ENOMEM;
+	}
+	stargeteh = (struct starget_eh *)eh->driver_data;
+
+	eh->add_cmnd = starget_eh_add_cmnd;
+	eh->is_busy  = starget_eh_is_busy;
+	eh->wakeup   = starget_eh_wakeup;
+	stargeteh->starget = starget;
+	stargeteh->fallback = fallback;
+
+	spin_lock_init(&stargeteh->eh_lock);
+	INIT_LIST_HEAD(&stargeteh->eh_cmd_q);
+	INIT_WORK(&stargeteh->eh_handle_work, starget_eh_work);
+
+	starget->eh = eh;
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(scsi_target_setup_eh);
+
+void scsi_target_clear_eh(struct scsi_target *starget)
+{
+	kfree(starget->eh);
+	starget->eh = NULL;
+}
+EXPORT_SYMBOL_GPL(scsi_target_clear_eh);
diff --git a/include/scsi/scsi_eh.h b/include/scsi/scsi_eh.h
index 80e2f130e884..011f63030589 100644
--- a/include/scsi/scsi_eh.h
+++ b/include/scsi/scsi_eh.h
@@ -24,6 +24,8 @@ extern int scsi_starget_eh(struct scsi_target *starget,
 			struct list_head *workq, struct list_head *doneq);
 extern int scsi_device_setup_eh(struct scsi_device *sdev, int fallback);
 extern void scsi_device_clear_eh(struct scsi_device *sdev);
+extern int scsi_target_setup_eh(struct scsi_target *starget, int fallback);
+extern void scsi_target_clear_eh(struct scsi_target *starget);
 
 static inline bool scsi_sense_is_deferred(const struct scsi_sense_hdr *sshdr)
 {
-- 
2.33.0


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [RFC PATCH v3 12/19] scsi: scsi_debug: Add param to control LUN bassed error handler
  2025-03-14  1:29 [RFC PATCH v3 00/19] scsi: scsi_error: Introduce new error handle mechanism JiangJianJun
                   ` (10 preceding siblings ...)
  2025-03-14  1:29 ` [RFC PATCH v3 11/19] scsi: scsi_error: Add a general target based error handler JiangJianJun
@ 2025-03-14  1:29 ` JiangJianJun
  2025-03-14  1:29 ` [RFC PATCH v3 13/19] scsi: scsi_debug: Add param to control target based error handle JiangJianJun
                   ` (7 subsequent siblings)
  19 siblings, 0 replies; 30+ messages in thread
From: JiangJianJun @ 2025-03-14  1:29 UTC (permalink / raw)
  To: jejb, martin.petersen, linux-scsi
  Cc: hare, linux-kernel, lixiaokeng, jiangjianjun3, hewenliang4,
	yangkunlin7

From: Wenchao Hao <haowenchao2@huawei.com>

Add new module param lun_eh to control if enable LUN based
error handle, and param lun_eh_fallback to control if fallback
to further recover when LUN recovery can not recover all
error commands. This is used to test the LUN based error handle.

Signed-off-by: Wenchao Hao <haowenchao2@huawei.com>
---
 drivers/scsi/scsi_debug.c | 12 ++++++++++++
 1 file changed, 12 insertions(+)

diff --git a/drivers/scsi/scsi_debug.c b/drivers/scsi/scsi_debug.c
index 5ceaa4665e5d..162728cca99d 100644
--- a/drivers/scsi/scsi_debug.c
+++ b/drivers/scsi/scsi_debug.c
@@ -884,6 +884,8 @@ static bool write_since_sync;
 static bool sdebug_statistics = DEF_STATISTICS;
 static bool sdebug_wp;
 static bool sdebug_allow_restart;
+static bool sdebug_lun_eh;
+static bool sdebug_lun_eh_fallback;
 static enum {
 	BLK_ZONED_NONE	= 0,
 	BLK_ZONED_HA	= 1,
@@ -5885,6 +5887,9 @@ static int scsi_debug_sdev_init(struct scsi_device *sdp)
 		pr_info("sdev_init <%u %u %u %llu>\n",
 		       sdp->host->host_no, sdp->channel, sdp->id, sdp->lun);
 
+	if (sdebug_lun_eh)
+		return scsi_device_setup_eh(sdp, sdebug_lun_eh_fallback);
+
 	return 0;
 }
 
@@ -5953,6 +5958,9 @@ static void scsi_debug_sdev_destroy(struct scsi_device *sdp)
 	/* make this slot available for re-use */
 	devip->used = false;
 	sdp->hostdata = NULL;
+
+	if (sdebug_lun_eh)
+		scsi_device_clear_eh(sdp);
 }
 
 /* Returns true if we require the queued memory to be freed by the caller. */
@@ -6644,6 +6652,8 @@ module_param_named(zone_max_open, sdeb_zbc_max_open, int, S_IRUGO);
 module_param_named(zone_nr_conv, sdeb_zbc_nr_conv, int, S_IRUGO);
 module_param_named(zone_size_mb, sdeb_zbc_zone_size_mb, int, S_IRUGO);
 module_param_named(allow_restart, sdebug_allow_restart, bool, S_IRUGO | S_IWUSR);
+module_param_named(lun_eh, sdebug_lun_eh, bool, S_IRUGO);
+module_param_named(lun_eh_fallback, sdebug_lun_eh_fallback, bool, S_IRUGO);
 
 MODULE_AUTHOR("Eric Youngdale + Douglas Gilbert");
 MODULE_DESCRIPTION("SCSI debug adapter driver");
@@ -6723,6 +6733,8 @@ MODULE_PARM_DESC(zone_max_open, "Maximum number of open zones; [0] for no limit
 MODULE_PARM_DESC(zone_nr_conv, "Number of conventional zones (def=1)");
 MODULE_PARM_DESC(zone_size_mb, "Zone size in MiB (def=auto)");
 MODULE_PARM_DESC(allow_restart, "Set scsi_device's allow_restart flag(def=0)");
+MODULE_PARM_DESC(lun_eh, "LUN based error handle (def=0)");
+MODULE_PARM_DESC(lun_eh_fallback, "Fallback to further recovery if LUN recovery failed (def=0)");
 
 #define SDEBUG_INFO_LEN 256
 static char sdebug_info[SDEBUG_INFO_LEN];
-- 
2.33.0


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [RFC PATCH v3 13/19] scsi: scsi_debug: Add param to control target based error handle
  2025-03-14  1:29 [RFC PATCH v3 00/19] scsi: scsi_error: Introduce new error handle mechanism JiangJianJun
                   ` (11 preceding siblings ...)
  2025-03-14  1:29 ` [RFC PATCH v3 12/19] scsi: scsi_debug: Add param to control LUN bassed " JiangJianJun
@ 2025-03-14  1:29 ` JiangJianJun
  2025-03-14  1:29 ` [RFC PATCH v3 14/19] scsi: mpt3sas: Add param to control LUN " JiangJianJun
                   ` (6 subsequent siblings)
  19 siblings, 0 replies; 30+ messages in thread
From: JiangJianJun @ 2025-03-14  1:29 UTC (permalink / raw)
  To: jejb, martin.petersen, linux-scsi
  Cc: hare, linux-kernel, lixiaokeng, jiangjianjun3, hewenliang4,
	yangkunlin7

From: Wenchao Hao <haowenchao2@huawei.com>

Add new module param target_eh to control if enable target based error
handler, and param target_eh_fallback to control if fallback to further
recover when target recovery can not recover all error commands.
This is used to test the target based error handle.

Signed-off-by: Wenchao Hao <haowenchao2@huawei.com>
---
 drivers/scsi/scsi_debug.c | 12 ++++++++++++
 1 file changed, 12 insertions(+)

diff --git a/drivers/scsi/scsi_debug.c b/drivers/scsi/scsi_debug.c
index 162728cca99d..5c0c2529b865 100644
--- a/drivers/scsi/scsi_debug.c
+++ b/drivers/scsi/scsi_debug.c
@@ -886,6 +886,8 @@ static bool sdebug_wp;
 static bool sdebug_allow_restart;
 static bool sdebug_lun_eh;
 static bool sdebug_lun_eh_fallback;
+static bool sdebug_target_eh;
+static bool sdebug_target_eh_fallback;
 static enum {
 	BLK_ZONED_NONE	= 0,
 	BLK_ZONED_HA	= 1,
@@ -1195,6 +1197,9 @@ static int sdebug_target_alloc(struct scsi_target *starget)
 
 	starget->hostdata = targetip;
 
+	if (sdebug_target_eh)
+		return scsi_target_setup_eh(starget, sdebug_target_eh_fallback);
+
 	return 0;
 }
 
@@ -1210,6 +1215,9 @@ static void sdebug_target_destroy(struct scsi_target *starget)
 {
 	struct sdebug_target_info *targetip;
 
+	if (sdebug_target_eh)
+		scsi_target_clear_eh(starget);
+
 	targetip = (struct sdebug_target_info *)starget->hostdata;
 	if (targetip) {
 		starget->hostdata = NULL;
@@ -6654,6 +6662,8 @@ module_param_named(zone_size_mb, sdeb_zbc_zone_size_mb, int, S_IRUGO);
 module_param_named(allow_restart, sdebug_allow_restart, bool, S_IRUGO | S_IWUSR);
 module_param_named(lun_eh, sdebug_lun_eh, bool, S_IRUGO);
 module_param_named(lun_eh_fallback, sdebug_lun_eh_fallback, bool, S_IRUGO);
+module_param_named(target_eh, sdebug_target_eh, bool, S_IRUGO);
+module_param_named(target_eh_fallback, sdebug_target_eh_fallback, bool, S_IRUGO);
 
 MODULE_AUTHOR("Eric Youngdale + Douglas Gilbert");
 MODULE_DESCRIPTION("SCSI debug adapter driver");
@@ -6735,6 +6745,8 @@ MODULE_PARM_DESC(zone_size_mb, "Zone size in MiB (def=auto)");
 MODULE_PARM_DESC(allow_restart, "Set scsi_device's allow_restart flag(def=0)");
 MODULE_PARM_DESC(lun_eh, "LUN based error handle (def=0)");
 MODULE_PARM_DESC(lun_eh_fallback, "Fallback to further recovery if LUN recovery failed (def=0)");
+MODULE_PARM_DESC(target_eh, "target based error handle (def=0)");
+MODULE_PARM_DESC(target_eh_fallback, "Fallback to further recovery if target recovery failed (def=0)");
 
 #define SDEBUG_INFO_LEN 256
 static char sdebug_info[SDEBUG_INFO_LEN];
-- 
2.33.0


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [RFC PATCH v3 14/19] scsi: mpt3sas: Add param to control LUN based error handle
  2025-03-14  1:29 [RFC PATCH v3 00/19] scsi: scsi_error: Introduce new error handle mechanism JiangJianJun
                   ` (12 preceding siblings ...)
  2025-03-14  1:29 ` [RFC PATCH v3 13/19] scsi: scsi_debug: Add param to control target based error handle JiangJianJun
@ 2025-03-14  1:29 ` JiangJianJun
  2025-03-14  1:29 ` [RFC PATCH v3 15/19] scsi: mpt3sas: Add param to control target " JiangJianJun
                   ` (5 subsequent siblings)
  19 siblings, 0 replies; 30+ messages in thread
From: JiangJianJun @ 2025-03-14  1:29 UTC (permalink / raw)
  To: jejb, martin.petersen, linux-scsi
  Cc: hare, linux-kernel, lixiaokeng, jiangjianjun3, hewenliang4,
	yangkunlin7

From: Wenchao Hao <haowenchao2@huawei.com>

Add new module param lun_eh to control if enable LUN based
error handler, since mpt3sas defined callback eh_host_reset
and eh_target_reset, so make it fallback to further recover
when LUN based recovery can not recover all error commands.

Signed-off-by: Wenchao Hao <haowenchao2@huawei.com>
---
 drivers/scsi/mpt3sas/mpt3sas_scsih.c | 14 ++++++++++++++
 1 file changed, 14 insertions(+)

diff --git a/drivers/scsi/mpt3sas/mpt3sas_scsih.c b/drivers/scsi/mpt3sas/mpt3sas_scsih.c
index a456e5ec74d8..b3ceba3c1ea8 100644
--- a/drivers/scsi/mpt3sas/mpt3sas_scsih.c
+++ b/drivers/scsi/mpt3sas/mpt3sas_scsih.c
@@ -173,6 +173,10 @@ module_param(host_tagset_enable, int, 0444);
 MODULE_PARM_DESC(host_tagset_enable,
 	"Shared host tagset enable/disable Default: enable(1)");
 
+static bool lun_eh;
+module_param(lun_eh, bool, 0444);
+MODULE_PARM_DESC(lun_eh, "LUN based error handle (def=0)");
+
 /* raid transport support */
 static struct raid_template *mpt3sas_raid_template;
 static struct raid_template *mpt2sas_raid_template;
@@ -2043,6 +2047,13 @@ scsih_sdev_init(struct scsi_device *sdev)
 	struct _sas_device *sas_device;
 	struct _pcie_device *pcie_device;
 	unsigned long flags;
+	int ret = 0;
+
+	if (lun_eh) {
+		ret = scsi_device_setup_eh(sdev, 1);
+		if (ret)
+			return ret;
+	}
 
 	sas_device_priv_data = kzalloc(sizeof(*sas_device_priv_data),
 				       GFP_KERNEL);
@@ -2121,6 +2132,9 @@ scsih_sdev_destroy(struct scsi_device *sdev)
 	struct _pcie_device *pcie_device;
 	unsigned long flags;
 
+	if (lun_eh)
+		scsi_device_clear_eh(sdev);
+
 	if (!sdev->hostdata)
 		return;
 
-- 
2.33.0


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [RFC PATCH v3 15/19] scsi: mpt3sas: Add param to control target based error handle
  2025-03-14  1:29 [RFC PATCH v3 00/19] scsi: scsi_error: Introduce new error handle mechanism JiangJianJun
                   ` (13 preceding siblings ...)
  2025-03-14  1:29 ` [RFC PATCH v3 14/19] scsi: mpt3sas: Add param to control LUN " JiangJianJun
@ 2025-03-14  1:29 ` JiangJianJun
  2025-03-14  1:29 ` [RFC PATCH v3 16/19] scsi: smartpqi: Add param to control LUN " JiangJianJun
                   ` (4 subsequent siblings)
  19 siblings, 0 replies; 30+ messages in thread
From: JiangJianJun @ 2025-03-14  1:29 UTC (permalink / raw)
  To: jejb, martin.petersen, linux-scsi
  Cc: hare, linux-kernel, lixiaokeng, jiangjianjun3, hewenliang4,
	yangkunlin7

From: Wenchao Hao <haowenchao2@huawei.com>

Add new module param target_eh to control if enable target based
error handle, since mpt3sas defined callback eh_host_reset, so make
it fallback to further recover when target based recovery can not
recover all error commands.

Signed-off-by: Wenchao Hao <haowenchao2@huawei.com>
---
 drivers/scsi/mpt3sas/mpt3sas_scsih.c | 14 ++++++++++++++
 1 file changed, 14 insertions(+)

diff --git a/drivers/scsi/mpt3sas/mpt3sas_scsih.c b/drivers/scsi/mpt3sas/mpt3sas_scsih.c
index b3ceba3c1ea8..6f3eced511bf 100644
--- a/drivers/scsi/mpt3sas/mpt3sas_scsih.c
+++ b/drivers/scsi/mpt3sas/mpt3sas_scsih.c
@@ -177,6 +177,10 @@ static bool lun_eh;
 module_param(lun_eh, bool, 0444);
 MODULE_PARM_DESC(lun_eh, "LUN based error handle (def=0)");
 
+static bool target_eh;
+module_param(target_eh, bool, 0444);
+MODULE_PARM_DESC(target_eh, "target based error handle (def=0)");
+
 /* raid transport support */
 static struct raid_template *mpt3sas_raid_template;
 static struct raid_template *mpt2sas_raid_template;
@@ -1878,6 +1882,13 @@ scsih_target_alloc(struct scsi_target *starget)
 	struct _pcie_device *pcie_device;
 	unsigned long flags;
 	struct sas_rphy *rphy;
+	int ret = 0;
+
+	if (target_eh) {
+		ret = scsi_target_setup_eh(starget, 1);
+		if (ret)
+			return ret;
+	}
 
 	sas_target_priv_data = kzalloc(sizeof(*sas_target_priv_data),
 				       GFP_KERNEL);
@@ -1968,6 +1979,9 @@ scsih_target_destroy(struct scsi_target *starget)
 	struct _pcie_device *pcie_device;
 	unsigned long flags;
 
+	if (target_eh)
+		scsi_target_clear_eh(starget);
+
 	sas_target_priv_data = starget->hostdata;
 	if (!sas_target_priv_data)
 		return;
-- 
2.33.0


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [RFC PATCH v3 16/19] scsi: smartpqi: Add param to control LUN based error handle
  2025-03-14  1:29 [RFC PATCH v3 00/19] scsi: scsi_error: Introduce new error handle mechanism JiangJianJun
                   ` (14 preceding siblings ...)
  2025-03-14  1:29 ` [RFC PATCH v3 15/19] scsi: mpt3sas: Add param to control target " JiangJianJun
@ 2025-03-14  1:29 ` JiangJianJun
  2025-03-14  1:29 ` [RFC PATCH v3 17/19] scsi: megaraid_sas: Add param to control target " JiangJianJun
                   ` (3 subsequent siblings)
  19 siblings, 0 replies; 30+ messages in thread
From: JiangJianJun @ 2025-03-14  1:29 UTC (permalink / raw)
  To: jejb, martin.petersen, linux-scsi
  Cc: hare, linux-kernel, lixiaokeng, jiangjianjun3, hewenliang4,
	yangkunlin7

From: Wenchao Hao <haowenchao2@huawei.com>

Add new param lun_eh to control if enable LUN based error handler, since
smartpqi did not define other further reset callbacks, it is not
necessary to fallback to further recover any more, so set the LUN
error handler with fallback set to 0.

Signed-off-by: Wenchao Hao <haowenchao2@huawei.com>
---
 drivers/scsi/smartpqi/smartpqi_init.c | 14 ++++++++++++++
 1 file changed, 14 insertions(+)

diff --git a/drivers/scsi/smartpqi/smartpqi_init.c b/drivers/scsi/smartpqi/smartpqi_init.c
index 0da7be40c925..eb6e9be2efc2 100644
--- a/drivers/scsi/smartpqi/smartpqi_init.c
+++ b/drivers/scsi/smartpqi/smartpqi_init.c
@@ -189,6 +189,10 @@ module_param_named(ctrl_ready_timeout,
 MODULE_PARM_DESC(ctrl_ready_timeout,
 	"Timeout in seconds for driver to wait for controller ready.");
 
+static bool pqi_lun_eh;
+module_param_named(lun_eh, pqi_lun_eh, bool, 0444);
+MODULE_PARM_DESC(lun_eh, "LUN based error handle (def=0)");
+
 static char *raid_levels[] = {
 	"RAID-0",
 	"RAID-4",
@@ -6496,6 +6500,13 @@ static int pqi_sdev_init(struct scsi_device *sdev)
 	struct pqi_ctrl_info *ctrl_info;
 	struct scsi_target *starget;
 	struct sas_rphy *rphy;
+	int ret = 0;
+
+	if (pqi_lun_eh) {
+		ret = scsi_device_setup_eh(sdev, 0);
+		if (ret)
+			return ret;
+	}
 
 	ctrl_info = shost_to_hba(sdev->host);
 
@@ -6583,6 +6594,9 @@ static void pqi_sdev_destroy(struct scsi_device *sdev)
 
 	ctrl_info = shost_to_hba(sdev->host);
 
+	if (pqi_lun_eh)
+		scsi_device_clear_eh(sdev);
+
 	mutex_acquired = mutex_trylock(&ctrl_info->scan_mutex);
 	if (!mutex_acquired)
 		return;
-- 
2.33.0


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [RFC PATCH v3 17/19] scsi: megaraid_sas: Add param to control target based error handle
  2025-03-14  1:29 [RFC PATCH v3 00/19] scsi: scsi_error: Introduce new error handle mechanism JiangJianJun
                   ` (15 preceding siblings ...)
  2025-03-14  1:29 ` [RFC PATCH v3 16/19] scsi: smartpqi: Add param to control LUN " JiangJianJun
@ 2025-03-14  1:29 ` JiangJianJun
  2025-03-14  1:29 ` [RFC PATCH v3 18/19] scsi: virtio_scsi: Add param to control LUN " JiangJianJun
                   ` (2 subsequent siblings)
  19 siblings, 0 replies; 30+ messages in thread
From: JiangJianJun @ 2025-03-14  1:29 UTC (permalink / raw)
  To: jejb, martin.petersen, linux-scsi
  Cc: hare, linux-kernel, lixiaokeng, jiangjianjun3, hewenliang4,
	yangkunlin7

From: Wenchao Hao <haowenchao2@huawei.com>

Add new param target_eh to control if enable target based error
handler, since megaraid_sas did not define callback eh_device_reset,
so only target based error handler is enabled; and megaraid_sas
defined eh_host_reset, so make it fallback to further recover
when target based recovery can not recover all error commands.

Signed-off-by: Wenchao Hao <haowenchao2@huawei.com>
---
 drivers/scsi/megaraid/megaraid_sas_base.c | 20 ++++++++++++++++++++
 1 file changed, 20 insertions(+)

diff --git a/drivers/scsi/megaraid/megaraid_sas_base.c b/drivers/scsi/megaraid/megaraid_sas_base.c
index d85f990aec88..4ea4b754b090 100644
--- a/drivers/scsi/megaraid/megaraid_sas_base.c
+++ b/drivers/scsi/megaraid/megaraid_sas_base.c
@@ -44,6 +44,7 @@
 #include <scsi/scsi_host.h>
 #include <scsi/scsi_tcq.h>
 #include <scsi/scsi_dbg.h>
+#include <scsi/scsi_eh.h>
 #include "megaraid_sas_fusion.h"
 #include "megaraid_sas.h"
 
@@ -126,6 +127,10 @@ int host_tagset_enable = 1;
 module_param(host_tagset_enable, int, 0444);
 MODULE_PARM_DESC(host_tagset_enable, "Shared host tagset enable/disable Default: enable(1)");
 
+static bool target_eh;
+module_param(target_eh, bool, 0444);
+MODULE_PARM_DESC(target_eh, "target based error handle (def=0)");
+
 MODULE_LICENSE("GPL");
 MODULE_VERSION(MEGASAS_VERSION);
 MODULE_AUTHOR("megaraidlinux.pdl@broadcom.com");
@@ -2176,6 +2181,19 @@ static void megasas_sdev_destroy(struct scsi_device *sdev)
 	sdev->hostdata = NULL;
 }
 
+static int megasas_target_alloc(struct scsi_target *starget)
+{
+	if (target_eh)
+		return scsi_target_setup_eh(starget, 1);
+	return 0;
+}
+
+static void megasas_target_destroy(struct scsi_target *starget)
+{
+	if (target_eh)
+		scsi_target_clear_eh(starget);
+}
+
 /*
 * megasas_complete_outstanding_ioctls - Complete outstanding ioctls after a
 *                                       kill adapter
@@ -3524,6 +3542,8 @@ static const struct scsi_host_template megasas_template = {
 	.change_queue_depth = scsi_change_queue_depth,
 	.max_segment_size = 0xffffffff,
 	.cmd_size = sizeof(struct megasas_cmd_priv),
+	.target_alloc = megasas_target_alloc,
+	.target_destroy = megasas_target_destroy,
 };
 
 /**
-- 
2.33.0


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [RFC PATCH v3 18/19] scsi: virtio_scsi: Add param to control LUN based error handle
  2025-03-14  1:29 [RFC PATCH v3 00/19] scsi: scsi_error: Introduce new error handle mechanism JiangJianJun
                   ` (16 preceding siblings ...)
  2025-03-14  1:29 ` [RFC PATCH v3 17/19] scsi: megaraid_sas: Add param to control target " JiangJianJun
@ 2025-03-14  1:29 ` JiangJianJun
  2025-03-14  1:29 ` [RFC PATCH v3 19/19] scsi: iscsi_tcp: " JiangJianJun
  2025-03-14  9:01 ` [RFC PATCH v3 00/19] scsi: scsi_error: Introduce new error handle mechanism Hannes Reinecke
  19 siblings, 0 replies; 30+ messages in thread
From: JiangJianJun @ 2025-03-14  1:29 UTC (permalink / raw)
  To: jejb, martin.petersen, linux-scsi
  Cc: hare, linux-kernel, lixiaokeng, jiangjianjun3, hewenliang4,
	yangkunlin7

From: Wenchao Hao <haowenchao2@huawei.com>

Add new param lun_eh to control if enable LUN based error handler, since
virtio_scsi did not define other further reset callbacks, it is not
necessary to fallback to further recover any more, so set the LUN
error handler with fallback set to 0.

Signed-off-by: Wenchao Hao <haowenchao2@huawei.com>
---
 drivers/scsi/virtio_scsi.c | 16 +++++++++++++++-
 1 file changed, 15 insertions(+), 1 deletion(-)

diff --git a/drivers/scsi/virtio_scsi.c b/drivers/scsi/virtio_scsi.c
index 21ce3e940192..99276ad0e441 100644
--- a/drivers/scsi/virtio_scsi.c
+++ b/drivers/scsi/virtio_scsi.c
@@ -28,6 +28,7 @@
 #include <scsi/scsi_cmnd.h>
 #include <scsi/scsi_tcq.h>
 #include <scsi/scsi_devinfo.h>
+#include <scsi/scsi_eh.h>
 #include <linux/seqlock.h>
 
 #include "sd.h"
@@ -41,6 +42,10 @@ module_param(virtscsi_poll_queues, uint, 0644);
 MODULE_PARM_DESC(virtscsi_poll_queues,
 		 "The number of dedicated virtqueues for polling I/O");
 
+static bool lun_eh;
+module_param(lun_eh, bool, 0444);
+MODULE_PARM_DESC(lun_eh, "LUN based error handle (def=0)");
+
 /* Command queue element */
 struct virtio_scsi_cmd {
 	struct scsi_cmnd *sc;
@@ -682,9 +687,18 @@ static int virtscsi_device_alloc(struct scsi_device *sdevice)
 	 */
 	sdevice->sdev_bflags = BLIST_TRY_VPD_PAGES;
 
+	if (lun_eh)
+		return scsi_device_setup_eh(sdevice, 0);
+
 	return 0;
 }
 
+static void virtscsi_device_destroy(struct scsi_device *sdevice)
+{
+	if (lun_eh)
+		return scsi_device_clear_eh(sdevice);
+}
+
 
 /**
  * virtscsi_change_queue_depth() - Change a virtscsi target's queue depth
@@ -801,7 +815,7 @@ static const struct scsi_host_template virtscsi_host_template = {
 	.eh_device_reset_handler = virtscsi_device_reset,
 	.eh_timed_out = virtscsi_eh_timed_out,
 	.sdev_init = virtscsi_device_alloc,
-
+	.sdev_destroy = virtscsi_device_destroy,
 	.dma_boundary = UINT_MAX,
 	.map_queues = virtscsi_map_queues,
 	.track_queue_depth = 1,
-- 
2.33.0


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [RFC PATCH v3 19/19] scsi: iscsi_tcp: Add param to control LUN based error handle
  2025-03-14  1:29 [RFC PATCH v3 00/19] scsi: scsi_error: Introduce new error handle mechanism JiangJianJun
                   ` (17 preceding siblings ...)
  2025-03-14  1:29 ` [RFC PATCH v3 18/19] scsi: virtio_scsi: Add param to control LUN " JiangJianJun
@ 2025-03-14  1:29 ` JiangJianJun
  2025-03-14  9:01 ` [RFC PATCH v3 00/19] scsi: scsi_error: Introduce new error handle mechanism Hannes Reinecke
  19 siblings, 0 replies; 30+ messages in thread
From: JiangJianJun @ 2025-03-14  1:29 UTC (permalink / raw)
  To: jejb, martin.petersen, linux-scsi
  Cc: hare, linux-kernel, lixiaokeng, jiangjianjun3, hewenliang4,
	yangkunlin7

From: Wenchao Hao <haowenchao2@huawei.com>

Add new param lun_eh to control if enable LUN based error handler,
since iscsi_tcp defined callback eh_target_reset, so make it
fallback to further recover when LUN based recovery can not recover
all error commands.

Signed-off-by: Wenchao Hao <haowenchao2@huawei.com>
---
 drivers/scsi/iscsi_tcp.c | 20 ++++++++++++++++++++
 1 file changed, 20 insertions(+)

diff --git a/drivers/scsi/iscsi_tcp.c b/drivers/scsi/iscsi_tcp.c
index e81f60985193..6212d88a4259 100644
--- a/drivers/scsi/iscsi_tcp.c
+++ b/drivers/scsi/iscsi_tcp.c
@@ -35,6 +35,7 @@
 #include <scsi/scsi_host.h>
 #include <scsi/scsi.h>
 #include <scsi/scsi_transport_iscsi.h>
+#include <scsi/scsi_eh.h>
 #include <trace/events/iscsi.h>
 #include <trace/events/sock.h>
 
@@ -63,6 +64,10 @@ module_param_named(debug_iscsi_tcp, iscsi_sw_tcp_dbg, int,
 MODULE_PARM_DESC(debug_iscsi_tcp, "Turn on debugging for iscsi_tcp module "
 		 "Set to 1 to turn on, and zero to turn off. Default is off.");
 
+static bool iscsi_sw_tcp_lun_eh;
+module_param_named(lun_eh, iscsi_sw_tcp_lun_eh, bool, 0444);
+MODULE_PARM_DESC(lun_eh, "LUN based error handle (def=0)");
+
 #define ISCSI_SW_TCP_DBG(_conn, dbg_fmt, arg...)		\
 	do {							\
 		if (iscsi_sw_tcp_dbg)				\
@@ -1069,6 +1074,19 @@ static int iscsi_sw_tcp_sdev_configure(struct scsi_device *sdev,
 	return 0;
 }
 
+static int iscsi_sw_tcp_sdev_alloc(struct scsi_device *sdev)
+{
+	if (iscsi_sw_tcp_lun_eh)
+		return scsi_device_setup_eh(sdev, 1);
+	return 0;
+}
+
+static void iscsi_sw_tcp_sdev_destroy(struct scsi_device *sdev)
+{
+	if (iscsi_sw_tcp_lun_eh)
+		return scsi_device_clear_eh(sdev);
+}
+
 static const struct scsi_host_template iscsi_sw_tcp_sht = {
 	.module			= THIS_MODULE,
 	.name			= "iSCSI Initiator over TCP/IP",
@@ -1084,6 +1102,8 @@ static const struct scsi_host_template iscsi_sw_tcp_sht = {
 	.eh_target_reset_handler = iscsi_eh_recover_target,
 	.dma_boundary		= PAGE_SIZE - 1,
 	.sdev_configure		= iscsi_sw_tcp_sdev_configure,
+	.sdev_init			= iscsi_sw_tcp_sdev_alloc,
+	.sdev_destroy		= iscsi_sw_tcp_sdev_destroy,
 	.proc_name		= "iscsi_tcp",
 	.this_id		= -1,
 	.track_queue_depth	= 1,
-- 
2.33.0


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [RFC PATCH v3 09/19] scsi: core: increase/decrease target_busy without check can_queue
  2025-03-14  1:29 ` [RFC PATCH v3 09/19] scsi: core: increase/decrease target_busy without check can_queue JiangJianJun
@ 2025-03-14  1:35   ` Bart Van Assche
  0 siblings, 0 replies; 30+ messages in thread
From: Bart Van Assche @ 2025-03-14  1:35 UTC (permalink / raw)
  To: JiangJianJun, jejb, martin.petersen, linux-scsi
  Cc: hare, linux-kernel, lixiaokeng, hewenliang4, yangkunlin7

On 3/13/25 6:29 PM, JiangJianJun wrote:
> This is preparation for a genernal target based error handle strategy
> to check if to wake up actual error handler.

I don't like this change because it slows down the hot path for LLD
drivers that do not set starget->can_queue. Why is this change
necessary? What are the alternatives?

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC PATCH v3 01/19] scsi: scsi_error: Define framework for LUN/target based error handle
  2025-03-14  1:29 ` [RFC PATCH v3 01/19] scsi: scsi_error: Define framework for LUN/target based error handle JiangJianJun
@ 2025-03-14  1:49   ` Bart Van Assche
  0 siblings, 0 replies; 30+ messages in thread
From: Bart Van Assche @ 2025-03-14  1:49 UTC (permalink / raw)
  To: JiangJianJun, jejb, martin.petersen, linux-scsi
  Cc: hare, linux-kernel, lixiaokeng, hewenliang4, yangkunlin7

On 3/13/25 6:29 PM, JiangJianJun wrote:
> diff --git a/drivers/scsi/scsi_error.c b/drivers/scsi/scsi_error.c
> index 815e7d63f3e2..f89de23a6807 100644
> --- a/drivers/scsi/scsi_error.c
> +++ b/drivers/scsi/scsi_error.c
> @@ -291,11 +291,48 @@ static void scsi_eh_inc_host_failed(struct rcu_head *head)
>   	spin_unlock_irqrestore(shost->host_lock, flags);
>   }
>   

Please move the new-style error handling functions into a new file. 
scsi_error.c is already way too big. Mixing host-based and device-based
error handling in a single file is confusing. Please don't do this.

> +#define SCSI_EH_NO_HANDLER 1

This should be a new enumeration type instead of a define.

>   /**
>    * scsi_eh_scmd_add - add scsi cmd to error handling.
>    * @scmd:	scmd to run eh on.
>    */
> -void scsi_eh_scmd_add(struct scsi_cmnd *scmd)
> +static void __scsi_eh_scmd_add(struct scsi_cmnd *scmd)

Please choose a better name for this function than __scsi_eh_scmd_add().

> +void scsi_eh_scmd_add(struct scsi_cmnd *scmd)
> +{
> +	struct scsi_device *sdev = scmd->device;
> +	struct scsi_target *starget = scsi_target(sdev);
> +	struct Scsi_Host *shost = sdev->host;
> +
> +	if (unlikely(scsi_host_in_recovery(shost)))
> +		__scsi_eh_scmd_add(scmd);
> +
> +	if (unlikely(scsi_target_in_recovery(starget)))
> +		if (__scsi_eh_scmd_add_starget(scmd))
> +			__scsi_eh_scmd_add(scmd);
> +
> +	if (__scsi_eh_scmd_add_sdev(scmd))
> +		if (__scsi_eh_scmd_add_starget(scmd))
> +			__scsi_eh_scmd_add(scmd);
> +}

Please rename the function __scsi_eh_scmd_add() to make it clear that it 
is used for the host-based error handling strategey.

> +static inline int scsi_device_in_recovery(struct scsi_device *sdev)
> +{
> +	struct scsi_device_eh *eh = sdev->eh;
> +
> +	if (eh && eh->is_busy)
> +		return eh->is_busy(sdev);
> +	return 0;
> +}

Can the return type of this function be changed into 'bool'? Can the
three statements in the function body be combined into the following
statement?

	return eh && eh->is_busy && eh->is_busy(sdev);

> +static inline int scsi_target_in_recovery(struct scsi_target *starget)
> +{
> +	struct scsi_target_eh *eh = starget->eh;
> +
> +	if (eh && eh->is_busy)
> +		return eh->is_busy(starget);
> +	return 0;
> +}

Same questions here.

> +struct scsi_device_eh {
> +	/*
> +	 * add scsi command to error handler so it would be handuled by
> +	 * driver's error handle strategy
> +	 */
> +	void (*add_cmnd)(struct scsi_cmnd *scmd);
> +
> +	/*
> +	 * to judge if the device is busy handling errors, called before
> +	 * dispatch scsi cmnd
> +	 *
> +	 * return 0 if it's ready to accepy scsi cmnd
> +	 * return 0 if it's in error handle, command's would not be dispatched
> +	 */
> +	int (*is_busy)(struct scsi_device *sdev);
> +
> +	/*
> +	 * wakeup device's error handle
> +	 *
> +	 * usually the error handler strategy would not run at once when
> +	 * error command is added. This function would be called when any
> +	 * scsi cmnd is finished or when scsi cmnd is added.
> +	 */
> +	int (*wakeup)(struct scsi_device *sdev);
> +
> +	/*
> +	 * data entity for device specific error handler
> +	 */
> +	unsigned long driver_data[];
> +};

Adding unsigned long driver_data[] at the end of a structure is the old
way for extending a structure. Please don't do this. Instead, leave out
the driver_data[] array, embed the scsi_device_eh in a larger data
structure where necessary and use container_of() to convert
scsi_device_eh pointers into a pointer to the containing structure.

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC PATCH v3 00/19] scsi: scsi_error: Introduce new error handle mechanism
  2025-03-14  1:29 [RFC PATCH v3 00/19] scsi: scsi_error: Introduce new error handle mechanism JiangJianJun
                   ` (18 preceding siblings ...)
  2025-03-14  1:29 ` [RFC PATCH v3 19/19] scsi: iscsi_tcp: " JiangJianJun
@ 2025-03-14  9:01 ` Hannes Reinecke
  2025-03-14 15:55   ` Bart Van Assche
  2025-03-20  6:05   ` Christoph Hellwig
  19 siblings, 2 replies; 30+ messages in thread
From: Hannes Reinecke @ 2025-03-14  9:01 UTC (permalink / raw)
  To: JiangJianJun, jejb, martin.petersen, linux-scsi
  Cc: linux-kernel, lixiaokeng, hewenliang4, yangkunlin7

On 3/14/25 02:29, JiangJianJun wrote:
> It's unbearable for systems with large scale scsi devices share HBAs to
> block all devices' IOs when handle error commands, we need a new error
> handle mechanism to address this issue.
> 
> I consulted about this issue a year ago, the discuss link can be found in
> refenence. Hannes replied about why we have to block the SCSI host
> then perform error recovery kindly. I think it's unnecessary to block
> SCSI host for all drivers and can try a small level recovery(LUN based for
> example) first to avoid block the SCSI host.
> 
Technically, yes.
There are, however, some issues which would need to be addressed if 
someone would design a new error handler.

1. The 'LUN Reset' TMF (as it's currently being used) is badly scoped; 
it will reset the LUN itself, affecting all ports to that LUN.
So in a multipathed/multiported environment all initiators will be 
affected, even if they haven't experienced an error.
Is that what we want?
Shouldn't we rather use the 'Reset IT Nexus' TMF here?
And, of course, the 'Target Reset' TMF has been dropped from SAM,
so I really don't see the point in spending time here ...

2. Irrespective of the EH granularity, any error handing requires
that all activity on the level has to be stopped. If you need to
issue a LUN reset, you need to stop I/O for that LUN.

3. The current EH framework is designed around 'struct scsi_cmnd'.
Which means that the command _initiating_ the error handling can
only be returned once the _entire_ error handling (with all
escalations) is finished. And more often than not, the application
is waiting on that command to be completed before the next I/O
is sent. And that really limits the effectiveness of any improved
error handler; the application ultimatively has to wait for a
host reset before it can contine.

But anyway.
We already have a mechanism for asynchronous command aborts;
have you checked if you can adapt if for LUN reset, too?
That would be the easiest solution, I guess ...

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                  Kernel Storage Architect
hare@suse.de                                +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC PATCH v3 00/19] scsi: scsi_error: Introduce new error handle mechanism
  2025-03-14  9:01 ` [RFC PATCH v3 00/19] scsi: scsi_error: Introduce new error handle mechanism Hannes Reinecke
@ 2025-03-14 15:55   ` Bart Van Assche
  2025-04-24  9:44     ` Diangang Li
  2025-03-20  6:05   ` Christoph Hellwig
  1 sibling, 1 reply; 30+ messages in thread
From: Bart Van Assche @ 2025-03-14 15:55 UTC (permalink / raw)
  To: Hannes Reinecke, JiangJianJun, jejb, martin.petersen, linux-scsi
  Cc: linux-kernel, lixiaokeng, hewenliang4, yangkunlin7

On 3/14/25 2:01 AM, Hannes Reinecke wrote:
> 3. The current EH framework is designed around 'struct scsi_cmnd'.
> Which means that the command _initiating_ the error handling can
> only be returned once the _entire_ error handling (with all
> escalations) is finished. And more often than not, the application
> is waiting on that command to be completed before the next I/O
> is sent. And that really limits the effectiveness of any improved
> error handler; the application ultimatively has to wait for a
> host reset before it can contine.
> 
> But anyway.
> We already have a mechanism for asynchronous command aborts;
> have you checked if you can adapt if for LUN reset, too?
> That would be the easiest solution, I guess ...

Hmm ... does this mean submitting a LUN reset while concurrently new
SCSI commands can be submitted from another thread? I don't think that's
safe.

Additionally, how could a LUN reset help if a SCSI abort doesn't help?
If a SCSI abort doesn't help, it probably means that the host controller
locked up, e.g. due to a firmware bug. How to recover from this without
resetting the host controller?

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC PATCH v3 00/19] scsi: scsi_error: Introduce new error handle mechanism
  2025-03-14  9:01 ` [RFC PATCH v3 00/19] scsi: scsi_error: Introduce new error handle mechanism Hannes Reinecke
  2025-03-14 15:55   ` Bart Van Assche
@ 2025-03-20  6:05   ` Christoph Hellwig
  2025-03-31  3:10     ` 答复: " Jiangjianjun
  1 sibling, 1 reply; 30+ messages in thread
From: Christoph Hellwig @ 2025-03-20  6:05 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: JiangJianJun, jejb, martin.petersen, linux-scsi, linux-kernel,
	lixiaokeng, hewenliang4, yangkunlin7

On Fri, Mar 14, 2025 at 10:01:40AM +0100, Hannes Reinecke wrote:
> 3. The current EH framework is designed around 'struct scsi_cmnd'.
> Which means that the command _initiating_ the error handling can
> only be returned once the _entire_ error handling (with all
> escalations) is finished. And more often than not, the application
> is waiting on that command to be completed before the next I/O
> is sent. And that really limits the effectiveness of any improved
> error handler; the application ultimatively has to wait for a
> host reset before it can contine.

And someone needs to get your old series to fix that merged before
we even start talking about any major EH change.


^ permalink raw reply	[flat|nested] 30+ messages in thread

* 答复: [RFC PATCH v3 00/19] scsi: scsi_error: Introduce new error handle mechanism
  2025-03-20  6:05   ` Christoph Hellwig
@ 2025-03-31  3:10     ` Jiangjianjun
  2025-03-31  7:50       ` John Garry
  0 siblings, 1 reply; 30+ messages in thread
From: Jiangjianjun @ 2025-03-31  3:10 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jiangjianjun, jejb@linux.ibm.com, martin.petersen@oracle.com,
	linux-scsi@vger.kernel.org, linux-kernel@vger.kernel.org,
	lixiaokeng, hewenliang (C), Yangkunlin(Poincare)

Sorry for late message! I'm working on fixing and testing these issues before re-emailing.

-----邮件原件-----
发件人: Christoph Hellwig <hch@infradead.org> 
发送时间: 2025年3月20日 14:06
收件人: Hannes Reinecke <hare@suse.de>
抄送: Jiangjianjun <jiangjianjun3@huawei.com>; jejb@linux.ibm.com; martin.petersen@oracle.com; linux-scsi@vger.kernel.org; linux-kernel@vger.kernel.org; lixiaokeng <lixiaokeng@huawei.com>; hewenliang (C) <hewenliang4@huawei.com>; Yangkunlin(Poincare) <yangkunlin7@huawei.com>
主题: Re: [RFC PATCH v3 00/19] scsi: scsi_error: Introduce new error handle mechanism

On Fri, Mar 14, 2025 at 10:01:40AM +0100, Hannes Reinecke wrote:
> 3. The current EH framework is designed around 'struct scsi_cmnd'.
> Which means that the command _initiating_ the error handling can only 
> be returned once the _entire_ error handling (with all
> escalations) is finished. And more often than not, the application is 
> waiting on that command to be completed before the next I/O is sent. 
> And that really limits the effectiveness of any improved error 
> handler; the application ultimatively has to wait for a host reset 
> before it can contine.

And someone needs to get your old series to fix that merged before we even start talking about any major EH change.


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: 答复: [RFC PATCH v3 00/19] scsi: scsi_error: Introduce new error handle mechanism
  2025-03-31  3:10     ` 答复: " Jiangjianjun
@ 2025-03-31  7:50       ` John Garry
  0 siblings, 0 replies; 30+ messages in thread
From: John Garry @ 2025-03-31  7:50 UTC (permalink / raw)
  To: Jiangjianjun, Christoph Hellwig
  Cc: jejb@linux.ibm.com, martin.petersen@oracle.com,
	linux-scsi@vger.kernel.org, linux-kernel@vger.kernel.org,
	lixiaokeng, hewenliang (C), Yangkunlin(Poincare), yangxingui,
	liyihang9

On 31/03/2025 04:10, Jiangjianjun wrote:
> Sorry for late message! I'm working on fixing and testing these issues before re-emailing.

What are you actually working on?

It seems that Hannes' "scsi: EH rework, main part" series and maybe this 
one can help resolve this following issue:

https://lore.kernel.org/linux-block/eef1e927-c9b2-c61d-7f48-92e65d8b0418@huawei.com/

with fix attempted in:

https://lore.kernel.org/linux-ide/20241031140731.224589-4-cassel@kernel.org/

so that we don't see "fixes" like:
https://lore.kernel.org/linux-scsi/20250329073236.2300582-1-liyihang9@huawei.com/T/#m80bcb3f57fd176b7ce41b1f26e8560de6ad52c9d

> 
> -----邮件原件-----
> 发件人: Christoph Hellwig <hch@infradead.org>
> 发送时间: 2025年3月20日 14:06
> 收件人: Hannes Reinecke <hare@suse.de>
> 抄送: Jiangjianjun <jiangjianjun3@huawei.com>; jejb@linux.ibm.com; martin.petersen@oracle.com; linux-scsi@vger.kernel.org; linux-kernel@vger.kernel.org; lixiaokeng <lixiaokeng@huawei.com>; hewenliang (C) <hewenliang4@huawei.com>; Yangkunlin(Poincare) <yangkunlin7@huawei.com>
> 主题: Re: [RFC PATCH v3 00/19] scsi: scsi_error: Introduce new error handle mechanism
> 
> On Fri, Mar 14, 2025 at 10:01:40AM +0100, Hannes Reinecke wrote:
>> 3. The current EH framework is designed around 'struct scsi_cmnd'.
>> Which means that the command _initiating_ the error handling can only
>> be returned once the _entire_ error handling (with all
>> escalations) is finished. And more often than not, the application is
>> waiting on that command to be completed before the next I/O is sent.
>> And that really limits the effectiveness of any improved error
>> handler; the application ultimatively has to wait for a host reset
>> before it can contine.
> 
> And someone needs to get your old series to fix that merged before we even start talking about any major EH change.
> 


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC PATCH v3 04/19] scsi: scsi_error: Add helper scsi_eh_sdev_stu to do START_UNIT
  2025-03-14  1:29 ` [RFC PATCH v3 04/19] scsi: scsi_error: Add helper scsi_eh_sdev_stu to do START_UNIT JiangJianJun
@ 2025-04-24  9:27   ` Diangang Li
  0 siblings, 0 replies; 30+ messages in thread
From: Diangang Li @ 2025-04-24  9:27 UTC (permalink / raw)
  To: JiangJianJun
  Cc: jejb, martin.petersen, linux-scsi, hare, linux-kernel, lixiaokeng,
	hewenliang4, yangkunlin7, changfengnan

On Fri, Mar 14, 2025 at 09:29:12AM +0800, JiangJianJun wrote:
> From: Wenchao Hao <haowenchao2@huawei.com>
> 
> Add helper function scsi_eh_sdev_stu() to perform START_UNIT and check
> if to finish some error commands.
> 
> This is preparation for a genernal LUN/target based error handle
> strategy and did not change original logic.
> 
> Signed-off-by: Wenchao Hao <haowenchao2@huawei.com>
> ---
>  drivers/scsi/scsi_error.c | 50 +++++++++++++++++++++++----------------
>  1 file changed, 29 insertions(+), 21 deletions(-)
> 
> diff --git a/drivers/scsi/scsi_error.c b/drivers/scsi/scsi_error.c
> index cc3a5adb9daa..3b55642fb585 100644
> --- a/drivers/scsi/scsi_error.c
> +++ b/drivers/scsi/scsi_error.c
> @@ -1567,6 +1567,31 @@ static int scsi_eh_try_stu(struct scsi_cmnd *scmd)
>  	return 1;
>  }
>  
> +static int scsi_eh_sdev_stu(struct scsi_cmnd *scmd,
> +			      struct list_head *work_q,
> +			      struct list_head *done_q)
> +{
> +	struct scsi_device *sdev = scmd->device;
> +	struct scsi_cmnd *next;
> +
> +	SCSI_LOG_ERROR_RECOVERY(3, sdev_printk(KERN_INFO, sdev,
> +				"%s: Sending START_UNIT\n", current->comm));
> +

As in the scsi_eh_stu, SCSI_SENSE_VALID and scsi_check_sense is required
before calling scsi_eh_try_stu.

> +	if (scsi_eh_try_stu(scmd)) {
> +		SCSI_LOG_ERROR_RECOVERY(3, sdev_printk(KERN_INFO, sdev,
> +				    "%s: START_UNIT failed\n", current->comm));
> +		return 0;
> +	}
> +
> +	if (!scsi_device_online(sdev) || !scsi_eh_tur(scmd))
> +		list_for_each_entry_safe(scmd, next, work_q, eh_entry)
> +			if (scmd->device == sdev &&
> +			    scsi_eh_action(scmd, SUCCESS) == SUCCESS)
> +				scsi_eh_finish_cmd(scmd, done_q);
> +
> +	return list_empty(work_q);
> +}
> +
>   /**
>   * scsi_eh_stu - send START_UNIT if needed
>   * @shost:	&scsi host being recovered.
> @@ -1581,7 +1606,7 @@ static int scsi_eh_stu(struct Scsi_Host *shost,
>  			      struct list_head *work_q,
>  			      struct list_head *done_q)
>  {
> -	struct scsi_cmnd *scmd, *stu_scmd, *next;
> +	struct scsi_cmnd *scmd, *stu_scmd;
>  	struct scsi_device *sdev;
>  
>  	shost_for_each_device(sdev, shost) {
> @@ -1604,26 +1629,9 @@ static int scsi_eh_stu(struct Scsi_Host *shost,
>  		if (!stu_scmd)
>  			continue;
>  
> -		SCSI_LOG_ERROR_RECOVERY(3,
> -			sdev_printk(KERN_INFO, sdev,
> -				     "%s: Sending START_UNIT\n",
> -				    current->comm));
> -
> -		if (!scsi_eh_try_stu(stu_scmd)) {
> -			if (!scsi_device_online(sdev) ||
> -			    !scsi_eh_tur(stu_scmd)) {
> -				list_for_each_entry_safe(scmd, next,
> -							  work_q, eh_entry) {
> -					if (scmd->device == sdev &&
> -					    scsi_eh_action(scmd, SUCCESS) == SUCCESS)
> -						scsi_eh_finish_cmd(scmd, done_q);
> -				}
> -			}
> -		} else {
> -			SCSI_LOG_ERROR_RECOVERY(3,
> -				sdev_printk(KERN_INFO, sdev,
> -					    "%s: START_UNIT failed\n",
> -					    current->comm));
> +		if (scsi_eh_sdev_stu(stu_scmd, work_q, done_q)) {
> +			scsi_device_put(sdev);
> +			break;

Maybe this shouldn't break early. If one scsi_device fails to try STU,
the next one might succeed.

>  		}
>  	}
>  
> -- 
> 2.33.0
>

Thanks,

Diangang Li

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC PATCH v3 00/19] scsi: scsi_error: Introduce new error handle mechanism
  2025-03-14 15:55   ` Bart Van Assche
@ 2025-04-24  9:44     ` Diangang Li
  0 siblings, 0 replies; 30+ messages in thread
From: Diangang Li @ 2025-04-24  9:44 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Hannes Reinecke, JiangJianJun, jejb, martin.petersen, linux-scsi,
	linux-kernel, lixiaokeng, hewenliang4, yangkunlin7, changfengnan

On Fri, Mar 14, 2025 at 08:55:25AM -0700, Bart Van Assche wrote:
> On 3/14/25 2:01 AM, Hannes Reinecke wrote:
> > 3. The current EH framework is designed around 'struct scsi_cmnd'.
> > Which means that the command _initiating_ the error handling can
> > only be returned once the _entire_ error handling (with all
> > escalations) is finished. And more often than not, the application
> > is waiting on that command to be completed before the next I/O
> > is sent. And that really limits the effectiveness of any improved
> > error handler; the application ultimatively has to wait for a
> > host reset before it can contine.
> > 
> > But anyway.
> > We already have a mechanism for asynchronous command aborts;
> > have you checked if you can adapt if for LUN reset, too?
> > That would be the easiest solution, I guess ...
> 
> Hmm ... does this mean submitting a LUN reset while concurrently new
> SCSI commands can be submitted from another thread? I don't think that's
> safe.
> 
> Additionally, how could a LUN reset help if a SCSI abort doesn't help?
> If a SCSI abort doesn't help, it probably means that the host controller
> locked up, e.g. due to a firmware bug. How to recover from this without
> resetting the host controller?
>

Hi Bart,

Based on our statistic data, nearly 80% of scsi_error_handler cases recover
successfully after scsi_eh_target_reset. The current solution effectively
prevents all targets under the host from being blocked, which is particularly
beneficial for servers with large numbers of HDD data disks.
 
> Thanks,
> 
> Bart.

Thanks,

Diangang Li

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC PATCH v3 04/19] scsi: scsi_error: Add helper scsi_eh_sdev_stu to do START_UNIT
@ 2025-06-25  3:37 JiangJianJun
  0 siblings, 0 replies; 30+ messages in thread
From: JiangJianJun @ 2025-06-25  3:37 UTC (permalink / raw)
  To: lidiangang
  Cc: jejb, martin.petersen, linux-scsi, hare, linux-kernel, lixiaokeng,
	hewenliang4, yangkunlin7, changfengnan

 > From: Wenchao Hao <haowenchao2@huawei.com>
 >
 > Add helper function scsi_eh_sdev_stu() to perform START_UNIT and check
 > if to finish some error commands.
 >
 > > This is preparation for a genernal LUN/target based error handle
 > > strategy and did not change original logic.
 > >
 > > Signed-off-by: Wenchao Hao <haowenchao2@huawei.com>
 > > ---
 > >  drivers/scsi/scsi_error.c | 50 +++++++++++++++++++++++----------------
 > >  1 file changed, 29 insertions(+), 21 deletions(-)
 > >
 > > diff --git a/drivers/scsi/scsi_error.c b/drivers/scsi/scsi_error.c
 > > index cc3a5adb9daa..3b55642fb585 100644
 > > --- a/drivers/scsi/scsi_error.c
 > > +++ b/drivers/scsi/scsi_error.c
 > > @@ -1567,6 +1567,31 @@ static int scsi_eh_try_stu(struct scsi_cmnd 
*scmd)
 > >  	return 1;
 > >  }
 > >
 > > +static int scsi_eh_sdev_stu(struct scsi_cmnd *scmd,
 > > +			      struct list_head *work_q,
 > > +			      struct list_head *done_q)
 > > +{
 > > +	struct scsi_device *sdev = scmd->device;
 > > +	struct scsi_cmnd *next;
 > > +
 > > +	SCSI_LOG_ERROR_RECOVERY(3, sdev_printk(KERN_INFO, sdev,
 > > +				"%s: Sending START_UNIT\n", current->comm));
 > > +
 >
 > As in the scsi_eh_stu, SCSI_SENSE_VALID and scsi_check_sense is required
 > before calling scsi_eh_try_stu.

But the SCSI_SENSE_VALID and scsi_check_sense has been called before 
calling scsi_eh_try_stu, see in loop devices in scsi_eh_stu, do you 
means re-call at here?


^ permalink raw reply	[flat|nested] 30+ messages in thread

end of thread, other threads:[~2025-06-25  3:37 UTC | newest]

Thread overview: 30+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-03-14  1:29 [RFC PATCH v3 00/19] scsi: scsi_error: Introduce new error handle mechanism JiangJianJun
2025-03-14  1:29 ` [RFC PATCH v3 01/19] scsi: scsi_error: Define framework for LUN/target based error handle JiangJianJun
2025-03-14  1:49   ` Bart Van Assche
2025-03-14  1:29 ` [RFC PATCH v3 02/19] scsi: scsi_error: Move complete variable eh_action from shost to sdevice JiangJianJun
2025-03-14  1:29 ` [RFC PATCH v3 03/19] scsi: scsi_error: Check if to do reset in scsi_try_xxx_reset JiangJianJun
2025-03-14  1:29 ` [RFC PATCH v3 04/19] scsi: scsi_error: Add helper scsi_eh_sdev_stu to do START_UNIT JiangJianJun
2025-04-24  9:27   ` Diangang Li
2025-03-14  1:29 ` [RFC PATCH v3 05/19] scsi: scsi_error: Add helper scsi_eh_sdev_reset to do lun reset JiangJianJun
2025-03-14  1:29 ` [RFC PATCH v3 06/19] scsi: scsi_error: Add flags to mark error handle steps has done JiangJianJun
2025-03-14  1:29 ` [RFC PATCH v3 07/19] scsi: scsi_error: Add helper to handle scsi device's error command list JiangJianJun
2025-03-14  1:29 ` [RFC PATCH v3 08/19] scsi: scsi_error: Add a general LUN based error handler JiangJianJun
2025-03-14  1:29 ` [RFC PATCH v3 09/19] scsi: core: increase/decrease target_busy without check can_queue JiangJianJun
2025-03-14  1:35   ` Bart Van Assche
2025-03-14  1:29 ` [RFC PATCH v3 10/19] scsi: scsi_error: Add helper to handle scsi target's error command list JiangJianJun
2025-03-14  1:29 ` [RFC PATCH v3 11/19] scsi: scsi_error: Add a general target based error handler JiangJianJun
2025-03-14  1:29 ` [RFC PATCH v3 12/19] scsi: scsi_debug: Add param to control LUN bassed " JiangJianJun
2025-03-14  1:29 ` [RFC PATCH v3 13/19] scsi: scsi_debug: Add param to control target based error handle JiangJianJun
2025-03-14  1:29 ` [RFC PATCH v3 14/19] scsi: mpt3sas: Add param to control LUN " JiangJianJun
2025-03-14  1:29 ` [RFC PATCH v3 15/19] scsi: mpt3sas: Add param to control target " JiangJianJun
2025-03-14  1:29 ` [RFC PATCH v3 16/19] scsi: smartpqi: Add param to control LUN " JiangJianJun
2025-03-14  1:29 ` [RFC PATCH v3 17/19] scsi: megaraid_sas: Add param to control target " JiangJianJun
2025-03-14  1:29 ` [RFC PATCH v3 18/19] scsi: virtio_scsi: Add param to control LUN " JiangJianJun
2025-03-14  1:29 ` [RFC PATCH v3 19/19] scsi: iscsi_tcp: " JiangJianJun
2025-03-14  9:01 ` [RFC PATCH v3 00/19] scsi: scsi_error: Introduce new error handle mechanism Hannes Reinecke
2025-03-14 15:55   ` Bart Van Assche
2025-04-24  9:44     ` Diangang Li
2025-03-20  6:05   ` Christoph Hellwig
2025-03-31  3:10     ` 答复: " Jiangjianjun
2025-03-31  7:50       ` John Garry
  -- strict thread matches above, loose matches on Subject: below --
2025-06-25  3:37 [RFC PATCH v3 04/19] scsi: scsi_error: Add helper scsi_eh_sdev_stu to do START_UNIT JiangJianJun

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.