All of lore.kernel.org
 help / color / mirror / Atom feed
From: "haowenchao (C)" <haowenchao2@huawei.com>
To: "James E . J . Bottomley" <jejb@linux.ibm.com>,
	"Martin K . Petersen" <martin.petersen@oracle.com>,
	Hannes Reinecke <hare@suse.de>, <linux-scsi@vger.kernel.org>,
	<linux-kernel@vger.kernel.org>
Cc: Dan Carpenter <error27@gmail.com>, <louhongxiang@huawei.com>
Subject: Re: [PATCH 00/13] scsi: Support LUN/target based error handle
Date: Tue, 15 Aug 2023 22:08:31 +0800	[thread overview]
Message-ID: <2fa67edb-7cf2-e6bb-a2ab-425911226fbb@huawei.com> (raw)
In-Reply-To: <20230723234422.1629194-1-haowenchao2@huawei.com>

[-- Attachment #1: Type: text/plain, Size: 7741 bytes --]

On 2023/7/24 7:44, Wenchao Hao wrote:
> The origin error handle would set host to recovery state and perform
> error recovery operations, and makes all LUNs which share a same host
> can not handle IOs. This phenomenon is unbearable for systems which
> deploy many LUNs in one HBA.
> 
> This patchset introduce support for LUN/target based error handle,
> drivers can chose if to implement it. They can implement LUN, target or
> both of LUN and target based error handle by their own error handle
> strategy. The first patch defined this framework, it abstract three
> key operations which are: add error command, wake up error handle, block
> ios when error command is added and recoverying. Drivers should
> implement these three function callbacks and setup to SCSI middle level.
> 
> Besides the basic framework, this patchset also add a basic LUN/target
> based error handle strategy.
> 
> For LUN based eh, it would try check sense, start unit and reset LUN,
> if all above steps can not recovery all error commands, fallback to
> further recovery like tartget based (if implemented) or host based error
> handle.
> 
> It's same for tartget based eh, it would try check sense, start unit,
> reset LUN and reset target. If all above steps can not recovery all error
> commands, fallback to further recovery which is host based error handle.
> 
> This patchset is tested by scsi_debug which support single LUN error
> injection, the scsi_debug patches is here:
> 
> https://lore.kernel.org/linux-scsi/20230723234105.1628982-1-haowenchao2@huawei.com/T/#t
> 

I tested this patch set with scsi_debug with following scenarios, check
attachments to get my test script and result logs.

+-----------+---------+-------------------------------------------------------+
| lun reset | TUR     | Desired result                                        |
+ --------- + ------- + ------------------------------------------------------+
| success   | success | retry or finish with  EIO(may offline disk)           |
+ --------- + ------- + ------------------------------------------------------+
| success   | fail    | fallback to host  recovery, retry or finish with      |
|           |         | EIO(may offline disk)                                 |
+ --------- + ------- + ------------------------------------------------------+
| fail      | NA      | fallback to host  recovery, retry or finish with      |
|           |         | EIO(may offline disk)                                 |
+ --------- + ------- + ------------------------------------------------------+

+-----------+---------+--------------+---------+------------------------------+
| lun reset | TUR     | target reset | TUR     | Desired result               |
+-----------+---------+--------------+---------+------------------------------+
| success   | success | NA           | NA      | retry or finish with         |
|           |         |              |         | EIO(may offline disk)        |
+-----------+---------+--------------+---------+------------------------------+
| success   | fail    | success      | success | retry or finish with         |
|           |         |              |         | EIO(may offline disk)        |
+-----------+---------+--------------+---------+------------------------------+
| fail      | NA      | success      | success | retry or finish with         |
|           |         |              |         | EIO(may offline disk)        |
+-----------+---------+--------------+---------+------------------------------+
| fail      | NA      | success      | fail    | fallback to host recovery,   |
|           |         |              |         | retry or finish with EIO(may |
|           |         |              |         | offline disk)                |
+-----------+---------+--------------+---------+------------------------------+
| fail      | NA      | fail         | NA      | fallback to host  recovery,  |
|           |         |              |         | retry or finish with EIO(may |
|           |         |              |         | offline disk)                |
+-----------+---------+--------------+---------+------------------------------+

+-----------+---------+--------------+---------+------------------------------+
| lun reset | TUR     | target reset | TUR     | Desired result               |
+-----------+---------+--------------+---------+------------------------------+
| success   | success | NA           | NA      | retry or finish with         |
|           |         |              |         | EIO(may offline disk)        |
+-----------+---------+--------------+---------+------------------------------+
| success   | fail    | success      | success | lun recovery fallback to     |
|           |         |              |         | target recovery, retry or    |
|           |         |              |         | finish with EIO(may offline  |
|           |         |              |         | disk                         |
+-----------+---------+--------------+---------+------------------------------+
| fail      | NA      | success      | success | lun recovery fallback to     |
|           |         |              |         | target recovery, retry or    |
|           |         |              |         | finish with EIO(may offline  |
|           |         |              |         | disk                         |
+-----------+---------+--------------+---------+------------------------------+
| fail      | NA      | success      | fail    | lun recovery fallback to     |
|           |         |              |         | target recovery, then fall   |
|           |         |              |         | back to host recovery, retry |
|           |         |              |         | or fhinsi with EIO(may       |
|           |         |              |         | offline disk)                |
+-----------+---------+--------------+---------+------------------------------+
| fail      | NA      | fail         | NA      | lun recovery fallback to     |
|           |         |              |         | target recovery, then fall   |
|           |         |              |         | back to host recovery, retry |
|           |         |              |         | or fhinsi with EIO(may       |
|           |         |              |         | offline disk)                |
+-----------+---------+--------------+---------+------------------------------+


> Wenchao Hao (13):
>    scsi: Define basic framework for driver LUN/target based error handle
>    scsi:scsi_error: Move complete variable eh_action from shost to sdevice
>    scsi:scsi_error: Check if to do reset in scsi_try_xxx_reset
>    scsi:scsi_error: Add helper scsi_eh_sdev_stu to do START_UNIT
>    scsi:scsi_error: Add helper scsi_eh_sdev_reset to do lun reset
>    scsi:scsi_error: Add flags to mark error handle steps has done
>    scsi:scsi_error: Define helper to perform LUN based error handle
>    scsi:scsi_error: Add LUN based error handler based previous helper
>    scsi:core: increase/decrease target_busy without check can_queue
>    scsi:scsi_error: Define helper to perform target based error handle
>    scsi:scsi_error: Add target based error handler based previous helper
>    scsi:scsi_debug: Add param to control if setup LUN based error handle
>    scsi:scsi_debug: Add param to control if setup target based error handle
> 
>   drivers/scsi/scsi_debug.c  |  19 +
>   drivers/scsi/scsi_error.c  | 705 ++++++++++++++++++++++++++++++++++---
>   drivers/scsi/scsi_lib.c    |  23 +-
>   drivers/scsi/scsi_priv.h   |  20 ++
>   include/scsi/scsi_device.h |  97 +++++
>   include/scsi/scsi_eh.h     |   4 +
>   include/scsi/scsi_host.h   |   2 -
>   7 files changed, 813 insertions(+), 57 deletions(-)
> 

[-- Attachment #2: logs.tar.gz --]
[-- Type: application/x-gzip, Size: 7681 bytes --]

[-- Attachment #3: test.sh --]
[-- Type: text/plain, Size: 6362 bytes --]

#!/bin/sh

scsi_debug=/mnt/mainline/drivers/scsi/scsi_debug.ko 

function clear_error()
{
	error=$1
	tmpfile=$$_clear
	cat $error | grep -v Type | awk '{print $1,$3}' > $tmpfile 
	while read -r line; do echo "- $line" > $error; done < $tmpfile
	rm -rf $tmpfile

	echo 0 > /sys/kernel/debug/scsi_debug/target$target_id/fail_reset
}

function lun_test_sense1()
{
	echo "LUN reset success, TUR success"

	# inject timeout command for write command
	echo "0 -10 0x2a " > ${error}
	# inject abort command for write command
	echo "3 -1 0x2a " > ${error}

	dd if=/dev/zero of=/dev/$disk bs=1K count=10 oflag=direct
	echo $(cat /sys/block/$disk/device/state)

	clear_error $error
	echo running > /sys/block/$disk/device/state
}

function lun_test_sense2()
{
	echo "LUN reset success, TUR failed"

	# inject timeout command for write command
	echo "0 -10 0x2a " > ${error}
	# inject abort command for write command
	echo "3 -1 0x2a " > ${error}
	# inject timeout command for TUR command
	echo "0 -1 0x0 " > ${error}

	dd if=/dev/zero of=/dev/$disk bs=1K count=10 oflag=direct
	echo $(cat /sys/block/$disk/device/state)

	clear_error $error
	echo running > /sys/block/$disk/device/state
}

function lun_test_sense3()
{
	echo "LUN reset failed, fallback to target reset success"

	# inject timeout command for write command
	echo "0 -10 0x2a " > ${error}
	# inject abort command for write command
	echo "3 -1 0x2a " > ${error}
	# inject lunreset failed 
	echo "4 -1 0xff" > ${error}

	dd if=/dev/zero of=/dev/$disk bs=1K count=10 oflag=direct
	echo $(cat /sys/block/$disk/device/state)

	clear_error $error
	echo running > /sys/block/$disk/device/state
}

function target_test_sense1()
{
	echo "LUN reset success, TUR success"

	# inject timeout command for write command
	echo "0 -10 0x2a " > ${error}
	# inject abort command for write command
	echo "3 -1 0x2a " > ${error}

	dd if=/dev/zero of=/dev/$disk bs=1K count=10 oflag=direct
	echo $(cat /sys/block/$disk/device/state)

	clear_error $error
	echo running > /sys/block/$disk/device/state
}

function target_test_sense2()
{
	echo "LUN reset success, TUR failed, target reset success, TUR success"

	# inject timeout command for write command
	echo "0 -10 0x2a " > ${error}
	# inject abort command for write command
	echo "3 -1 0x2a " > ${error}
	# inject timeout command for TUR command
	echo "0 -1 0x0 " > ${error}

	dd if=/dev/zero of=/dev/$disk bs=1K count=10 oflag=direct
	echo $(cat /sys/block/$disk/device/state)

	clear_error $error
	echo running > /sys/block/$disk/device/state
}

function target_test_sense3()
{
	echo "LUN reset failed, target reset success, TUR success"

	# inject timeout command for write command
	echo "0 -10 0x2a " > ${error}
	# inject abort command for write command
	echo "3 -1 0x2a " > ${error}
	# inject lunreset failed 
	echo "4 -1 0xff" > ${error}

	dd if=/dev/zero of=/dev/$disk bs=1K count=10 oflag=direct
	echo $(cat /sys/block/$disk/device/state)

	clear_error $error
	echo running > /sys/block/$disk/device/state
}

function target_test_sense4()
{
	echo "LUN reset failed, target reset success TUR failed"

	# inject timeout command for write command
	echo "0 -10 0x2a " > ${error}
	# inject abort command for write command
	echo "3 -1 0x2a " > ${error}
	# inject lunreset failed 
	echo "4 -1 0xff" > ${error}
	# inject timeout command for TUR command
	echo "0 -1 0x0 " > ${error}

	dd if=/dev/zero of=/dev/$disk bs=1K count=10 oflag=direct
	echo $(cat /sys/block/$disk/device/state)
	clear_error $error
	echo running > /sys/block/$disk/device/state
}

function target_test_sense5()
{
	echo "LUN reset failed, target reset failed, fallback to host recovery"

	# inject timeout command for write command
	echo "0 -10 0x2a " > ${error}
	# inject abort command for write command
	echo "3 -1 0x2a " > ${error}
	# inject lunreset failed 
	echo "4 -1 0xff" > ${error}
	# inject target reset failed
	echo 1 > /sys/kernel/debug/scsi_debug/target$target_id/fail_reset

	dd if=/dev/zero of=/dev/$disk bs=1K count=10 oflag=direct
	echo $(cat /sys/block/$disk/device/state)

	clear_error $error
	echo running > /sys/block/$disk/device/state
}

scsi_logging_level -s --error 4 > /dev/null 2>&1

insmod $scsi_debug lun_eh=Y target_eh=N
str=$(lsscsi | grep scsi_debug | head -n 1 | awk '{print $1}')
scsi_id=${str#*\[}
scsi_id=${scsi_id%\]*}
error=/sys/kernel/debug/scsi_debug/$scsi_id/error 
str=$(lsscsi | grep scsi_debug | head -n 1 | awk '{print $6}')
disk=$(basename $str)
target_id=${scsi_id%\:*}
echo none > /sys/block/$disk/queue/scheduler
echo 1    > /sys/block/$disk/device/timeout
echo 1    > /sys/block/$disk/device/eh_timeout

for((loop=1;loop<=3;loop++))
do
	time=$(date "+%Y-%m-%d-%H-%M-%S")
	since=$(date "+%Y-%m-%d %H:%M:%S")
	lun_test_sense$loop
	sleep 3
	until=$(date "+%Y-%m-%d %H:%M:%S")
	mkdir logs/lun_sense$loop
	journalctl --since="$since" --until="$until" > logs/lun_sense$loop/$time.log
done
rmmod scsi_debug

insmod $scsi_debug lun_eh=N target_eh=Y
str=$(lsscsi | grep scsi_debug | head -n 1 | awk '{print $1}')
scsi_id=${str#*\[}
scsi_id=${scsi_id%\]*}
error=/sys/kernel/debug/scsi_debug/$scsi_id/error 
str=$(lsscsi | grep scsi_debug | head -n 1 | awk '{print $6}')
disk=$(basename $str)
echo none > /sys/block/$disk/queue/scheduler
echo 1    > /sys/block/$disk/device/timeout
echo 1    > /sys/block/$disk/device/eh_timeout
for((loop=1;loop<=5;loop++))
do
	time=$(date "+%Y-%m-%d-%H-%M-%S")
	since=$(date "+%Y-%m-%d %H:%M:%S")
	target_test_sense$loop
	sleep 3
	until=$(date "+%Y-%m-%d %H:%M:%S")
	mkdir logs/target_sense$loop
	journalctl --since="$since" --until="$until" > logs/target_sense$loop/$time.log
done
rmmod scsi_debug

insmod $scsi_debug lun_eh=Y target_eh=Y
str=$(lsscsi | grep scsi_debug | head -n 1 | awk '{print $1}')
scsi_id=${str#*\[}
scsi_id=${scsi_id%\]*}
error=/sys/kernel/debug/scsi_debug/$scsi_id/error 
str=$(lsscsi | grep scsi_debug | head -n 1 | awk '{print $6}')
disk=$(basename $str)
echo none > /sys/block/$disk/queue/scheduler
echo 1    > /sys/block/$disk/device/timeout
echo 1    > /sys/block/$disk/device/eh_timeout
for((loop=1;loop<=5;loop++))
do
	time=$(date "+%Y-%m-%d-%H-%M-%S")
	since=$(date "+%Y-%m-%d %H:%M:%S")
	target_test_sense$loop
	sleep 3
	until=$(date "+%Y-%m-%d %H:%M:%S")
	mkdir logs/lun_target_sense$loop
	journalctl --since="$since" --until="$until" > logs/lun_target_sense$loop/$time.log
done
rmmod scsi_debug

  parent reply	other threads:[~2023-08-15 14:09 UTC|newest]

Thread overview: 20+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-07-23 23:44 [PATCH 00/13] scsi: Support LUN/target based error handle Wenchao Hao
2023-07-23 23:44 ` [PATCH 01/13] scsi: Define basic framework for driver " Wenchao Hao
2023-07-23 23:44 ` [PATCH 02/13] scsi:scsi_error: Move complete variable eh_action from shost to sdevice Wenchao Hao
2023-07-23 23:44 ` [PATCH 03/13] scsi:scsi_error: Check if to do reset in scsi_try_xxx_reset Wenchao Hao
2023-07-23 23:44 ` [PATCH 04/13] scsi:scsi_error: Add helper scsi_eh_sdev_stu to do START_UNIT Wenchao Hao
2023-07-23 23:44 ` [PATCH 05/13] scsi:scsi_error: Add helper scsi_eh_sdev_reset to do lun reset Wenchao Hao
2023-07-23 23:44 ` [PATCH 06/13] scsi:scsi_error: Add flags to mark error handle steps has done Wenchao Hao
2023-07-23 23:44 ` [PATCH 07/13] scsi:scsi_error: Define helper to perform LUN based error handle Wenchao Hao
2023-07-23 23:44 ` [PATCH 08/13] scsi:scsi_error: Add LUN based error handler based previous helper Wenchao Hao
2023-07-23 23:44 ` [PATCH 09/13] scsi:core: increase/decrease target_busy without check can_queue Wenchao Hao
2023-07-23 23:44 ` [PATCH 10/13] scsi:scsi_error: Define helper to perform target based error handle Wenchao Hao
2023-07-23 23:44 ` [PATCH 11/13] scsi:scsi_error: Add target based error handler based previous helper Wenchao Hao
2023-07-23 23:44 ` [PATCH 12/13] scsi:scsi_debug: Add param to control if setup LUN based error handle Wenchao Hao
2023-07-23 23:44 ` [PATCH 13/13] scsi:scsi_debug: Add param to control if setup target " Wenchao Hao
2023-08-15 14:08 ` haowenchao (C) [this message]
2023-08-15 14:17 ` [PATCH 00/13] scsi: Support LUN/target " haowenchao (C)
2023-08-15 15:48   ` Bart Van Assche
2023-08-16  2:14     ` haowenchao (C)
2023-08-21 13:31 ` haowenchao (C)
2023-08-30  9:45 ` haowenchao (C)

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=2fa67edb-7cf2-e6bb-a2ab-425911226fbb@huawei.com \
    --to=haowenchao2@huawei.com \
    --cc=error27@gmail.com \
    --cc=hare@suse.de \
    --cc=jejb@linux.ibm.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-scsi@vger.kernel.org \
    --cc=louhongxiang@huawei.com \
    --cc=martin.petersen@oracle.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.