Linux SCSI subsystem development
 help / color / mirror / Atom feed
From: "haowenchao (C)" <haowenchao2@huawei.com>
To: "James E . J . Bottomley" <jejb@linux.ibm.com>,
	"Martin K . Petersen" <martin.petersen@oracle.com>,
	Hannes Reinecke <hare@suse.de>, <linux-scsi@vger.kernel.org>,
	<linux-kernel@vger.kernel.org>
Cc: Dan Carpenter <error27@gmail.com>, <louhongxiang@huawei.com>
Subject: Re: [PATCH 00/13] scsi: Support LUN/target based error handle
Date: Tue, 15 Aug 2023 22:08:31 +0800	[thread overview]
Message-ID: <2fa67edb-7cf2-e6bb-a2ab-425911226fbb@huawei.com> (raw)
In-Reply-To: <20230723234422.1629194-1-haowenchao2@huawei.com>

[-- Attachment #1: Type: text/plain, Size: 7741 bytes --]

On 2023/7/24 7:44, Wenchao Hao wrote:
> The origin error handle would set host to recovery state and perform
> error recovery operations, and makes all LUNs which share a same host
> can not handle IOs. This phenomenon is unbearable for systems which
> deploy many LUNs in one HBA.
> 
> This patchset introduce support for LUN/target based error handle,
> drivers can chose if to implement it. They can implement LUN, target or
> both of LUN and target based error handle by their own error handle
> strategy. The first patch defined this framework, it abstract three
> key operations which are: add error command, wake up error handle, block
> ios when error command is added and recoverying. Drivers should
> implement these three function callbacks and setup to SCSI middle level.
> 
> Besides the basic framework, this patchset also add a basic LUN/target
> based error handle strategy.
> 
> For LUN based eh, it would try check sense, start unit and reset LUN,
> if all above steps can not recovery all error commands, fallback to
> further recovery like tartget based (if implemented) or host based error
> handle.
> 
> It's same for tartget based eh, it would try check sense, start unit,
> reset LUN and reset target. If all above steps can not recovery all error
> commands, fallback to further recovery which is host based error handle.
> 
> This patchset is tested by scsi_debug which support single LUN error
> injection, the scsi_debug patches is here:
> 
> https://lore.kernel.org/linux-scsi/20230723234105.1628982-1-haowenchao2@huawei.com/T/#t
> 

I tested this patch set with scsi_debug with following scenarios, check
attachments to get my test script and result logs.

+-----------+---------+-------------------------------------------------------+
| lun reset | TUR     | Desired result                                        |
+ --------- + ------- + ------------------------------------------------------+
| success   | success | retry or finish with  EIO(may offline disk)           |
+ --------- + ------- + ------------------------------------------------------+
| success   | fail    | fallback to host  recovery, retry or finish with      |
|           |         | EIO(may offline disk)                                 |
+ --------- + ------- + ------------------------------------------------------+
| fail      | NA      | fallback to host  recovery, retry or finish with      |
|           |         | EIO(may offline disk)                                 |
+ --------- + ------- + ------------------------------------------------------+

+-----------+---------+--------------+---------+------------------------------+
| lun reset | TUR     | target reset | TUR     | Desired result               |
+-----------+---------+--------------+---------+------------------------------+
| success   | success | NA           | NA      | retry or finish with         |
|           |         |              |         | EIO(may offline disk)        |
+-----------+---------+--------------+---------+------------------------------+
| success   | fail    | success      | success | retry or finish with         |
|           |         |              |         | EIO(may offline disk)        |
+-----------+---------+--------------+---------+------------------------------+
| fail      | NA      | success      | success | retry or finish with         |
|           |         |              |         | EIO(may offline disk)        |
+-----------+---------+--------------+---------+------------------------------+
| fail      | NA      | success      | fail    | fallback to host recovery,   |
|           |         |              |         | retry or finish with EIO(may |
|           |         |              |         | offline disk)                |
+-----------+---------+--------------+---------+------------------------------+
| fail      | NA      | fail         | NA      | fallback to host  recovery,  |
|           |         |              |         | retry or finish with EIO(may |
|           |         |              |         | offline disk)                |
+-----------+---------+--------------+---------+------------------------------+

+-----------+---------+--------------+---------+------------------------------+
| lun reset | TUR     | target reset | TUR     | Desired result               |
+-----------+---------+--------------+---------+------------------------------+
| success   | success | NA           | NA      | retry or finish with         |
|           |         |              |         | EIO(may offline disk)        |
+-----------+---------+--------------+---------+------------------------------+
| success   | fail    | success      | success | lun recovery fallback to     |
|           |         |              |         | target recovery, retry or    |
|           |         |              |         | finish with EIO(may offline  |
|           |         |              |         | disk                         |
+-----------+---------+--------------+---------+------------------------------+
| fail      | NA      | success      | success | lun recovery fallback to     |
|           |         |              |         | target recovery, retry or    |
|           |         |              |         | finish with EIO(may offline  |
|           |         |              |         | disk                         |
+-----------+---------+--------------+---------+------------------------------+
| fail      | NA      | success      | fail    | lun recovery fallback to     |
|           |         |              |         | target recovery, then fall   |
|           |         |              |         | back to host recovery, retry |
|           |         |              |         | or fhinsi with EIO(may       |
|           |         |              |         | offline disk)                |
+-----------+---------+--------------+---------+------------------------------+
| fail      | NA      | fail         | NA      | lun recovery fallback to     |
|           |         |              |         | target recovery, then fall   |
|           |         |              |         | back to host recovery, retry |
|           |         |              |         | or fhinsi with EIO(may       |
|           |         |              |         | offline disk)                |
+-----------+---------+--------------+---------+------------------------------+


> Wenchao Hao (13):
>    scsi: Define basic framework for driver LUN/target based error handle
>    scsi:scsi_error: Move complete variable eh_action from shost to sdevice
>    scsi:scsi_error: Check if to do reset in scsi_try_xxx_reset
>    scsi:scsi_error: Add helper scsi_eh_sdev_stu to do START_UNIT
>    scsi:scsi_error: Add helper scsi_eh_sdev_reset to do lun reset
>    scsi:scsi_error: Add flags to mark error handle steps has done
>    scsi:scsi_error: Define helper to perform LUN based error handle
>    scsi:scsi_error: Add LUN based error handler based previous helper
>    scsi:core: increase/decrease target_busy without check can_queue
>    scsi:scsi_error: Define helper to perform target based error handle
>    scsi:scsi_error: Add target based error handler based previous helper
>    scsi:scsi_debug: Add param to control if setup LUN based error handle
>    scsi:scsi_debug: Add param to control if setup target based error handle
> 
>   drivers/scsi/scsi_debug.c  |  19 +
>   drivers/scsi/scsi_error.c  | 705 ++++++++++++++++++++++++++++++++++---
>   drivers/scsi/scsi_lib.c    |  23 +-
>   drivers/scsi/scsi_priv.h   |  20 ++
>   include/scsi/scsi_device.h |  97 +++++
>   include/scsi/scsi_eh.h     |   4 +
>   include/scsi/scsi_host.h   |   2 -
>   7 files changed, 813 insertions(+), 57 deletions(-)
> 

[-- Attachment #2: logs.tar.gz --]
[-- Type: application/x-gzip, Size: 7681 bytes --]

[-- Attachment #3: test.sh --]
[-- Type: text/plain, Size: 6362 bytes --]

#!/bin/sh

scsi_debug=/mnt/mainline/drivers/scsi/scsi_debug.ko 

function clear_error()
{
	error=$1
	tmpfile=$$_clear
	cat $error | grep -v Type | awk '{print $1,$3}' > $tmpfile 
	while read -r line; do echo "- $line" > $error; done < $tmpfile
	rm -rf $tmpfile

	echo 0 > /sys/kernel/debug/scsi_debug/target$target_id/fail_reset
}

function lun_test_sense1()
{
	echo "LUN reset success, TUR success"

	# inject timeout command for write command
	echo "0 -10 0x2a " > ${error}
	# inject abort command for write command
	echo "3 -1 0x2a " > ${error}

	dd if=/dev/zero of=/dev/$disk bs=1K count=10 oflag=direct
	echo $(cat /sys/block/$disk/device/state)

	clear_error $error
	echo running > /sys/block/$disk/device/state
}

function lun_test_sense2()
{
	echo "LUN reset success, TUR failed"

	# inject timeout command for write command
	echo "0 -10 0x2a " > ${error}
	# inject abort command for write command
	echo "3 -1 0x2a " > ${error}
	# inject timeout command for TUR command
	echo "0 -1 0x0 " > ${error}

	dd if=/dev/zero of=/dev/$disk bs=1K count=10 oflag=direct
	echo $(cat /sys/block/$disk/device/state)

	clear_error $error
	echo running > /sys/block/$disk/device/state
}

function lun_test_sense3()
{
	echo "LUN reset failed, fallback to target reset success"

	# inject timeout command for write command
	echo "0 -10 0x2a " > ${error}
	# inject abort command for write command
	echo "3 -1 0x2a " > ${error}
	# inject lunreset failed 
	echo "4 -1 0xff" > ${error}

	dd if=/dev/zero of=/dev/$disk bs=1K count=10 oflag=direct
	echo $(cat /sys/block/$disk/device/state)

	clear_error $error
	echo running > /sys/block/$disk/device/state
}

function target_test_sense1()
{
	echo "LUN reset success, TUR success"

	# inject timeout command for write command
	echo "0 -10 0x2a " > ${error}
	# inject abort command for write command
	echo "3 -1 0x2a " > ${error}

	dd if=/dev/zero of=/dev/$disk bs=1K count=10 oflag=direct
	echo $(cat /sys/block/$disk/device/state)

	clear_error $error
	echo running > /sys/block/$disk/device/state
}

function target_test_sense2()
{
	echo "LUN reset success, TUR failed, target reset success, TUR success"

	# inject timeout command for write command
	echo "0 -10 0x2a " > ${error}
	# inject abort command for write command
	echo "3 -1 0x2a " > ${error}
	# inject timeout command for TUR command
	echo "0 -1 0x0 " > ${error}

	dd if=/dev/zero of=/dev/$disk bs=1K count=10 oflag=direct
	echo $(cat /sys/block/$disk/device/state)

	clear_error $error
	echo running > /sys/block/$disk/device/state
}

function target_test_sense3()
{
	echo "LUN reset failed, target reset success, TUR success"

	# inject timeout command for write command
	echo "0 -10 0x2a " > ${error}
	# inject abort command for write command
	echo "3 -1 0x2a " > ${error}
	# inject lunreset failed 
	echo "4 -1 0xff" > ${error}

	dd if=/dev/zero of=/dev/$disk bs=1K count=10 oflag=direct
	echo $(cat /sys/block/$disk/device/state)

	clear_error $error
	echo running > /sys/block/$disk/device/state
}

function target_test_sense4()
{
	echo "LUN reset failed, target reset success TUR failed"

	# inject timeout command for write command
	echo "0 -10 0x2a " > ${error}
	# inject abort command for write command
	echo "3 -1 0x2a " > ${error}
	# inject lunreset failed 
	echo "4 -1 0xff" > ${error}
	# inject timeout command for TUR command
	echo "0 -1 0x0 " > ${error}

	dd if=/dev/zero of=/dev/$disk bs=1K count=10 oflag=direct
	echo $(cat /sys/block/$disk/device/state)
	clear_error $error
	echo running > /sys/block/$disk/device/state
}

function target_test_sense5()
{
	echo "LUN reset failed, target reset failed, fallback to host recovery"

	# inject timeout command for write command
	echo "0 -10 0x2a " > ${error}
	# inject abort command for write command
	echo "3 -1 0x2a " > ${error}
	# inject lunreset failed 
	echo "4 -1 0xff" > ${error}
	# inject target reset failed
	echo 1 > /sys/kernel/debug/scsi_debug/target$target_id/fail_reset

	dd if=/dev/zero of=/dev/$disk bs=1K count=10 oflag=direct
	echo $(cat /sys/block/$disk/device/state)

	clear_error $error
	echo running > /sys/block/$disk/device/state
}

scsi_logging_level -s --error 4 > /dev/null 2>&1

insmod $scsi_debug lun_eh=Y target_eh=N
str=$(lsscsi | grep scsi_debug | head -n 1 | awk '{print $1}')
scsi_id=${str#*\[}
scsi_id=${scsi_id%\]*}
error=/sys/kernel/debug/scsi_debug/$scsi_id/error 
str=$(lsscsi | grep scsi_debug | head -n 1 | awk '{print $6}')
disk=$(basename $str)
target_id=${scsi_id%\:*}
echo none > /sys/block/$disk/queue/scheduler
echo 1    > /sys/block/$disk/device/timeout
echo 1    > /sys/block/$disk/device/eh_timeout

for((loop=1;loop<=3;loop++))
do
	time=$(date "+%Y-%m-%d-%H-%M-%S")
	since=$(date "+%Y-%m-%d %H:%M:%S")
	lun_test_sense$loop
	sleep 3
	until=$(date "+%Y-%m-%d %H:%M:%S")
	mkdir logs/lun_sense$loop
	journalctl --since="$since" --until="$until" > logs/lun_sense$loop/$time.log
done
rmmod scsi_debug

insmod $scsi_debug lun_eh=N target_eh=Y
str=$(lsscsi | grep scsi_debug | head -n 1 | awk '{print $1}')
scsi_id=${str#*\[}
scsi_id=${scsi_id%\]*}
error=/sys/kernel/debug/scsi_debug/$scsi_id/error 
str=$(lsscsi | grep scsi_debug | head -n 1 | awk '{print $6}')
disk=$(basename $str)
echo none > /sys/block/$disk/queue/scheduler
echo 1    > /sys/block/$disk/device/timeout
echo 1    > /sys/block/$disk/device/eh_timeout
for((loop=1;loop<=5;loop++))
do
	time=$(date "+%Y-%m-%d-%H-%M-%S")
	since=$(date "+%Y-%m-%d %H:%M:%S")
	target_test_sense$loop
	sleep 3
	until=$(date "+%Y-%m-%d %H:%M:%S")
	mkdir logs/target_sense$loop
	journalctl --since="$since" --until="$until" > logs/target_sense$loop/$time.log
done
rmmod scsi_debug

insmod $scsi_debug lun_eh=Y target_eh=Y
str=$(lsscsi | grep scsi_debug | head -n 1 | awk '{print $1}')
scsi_id=${str#*\[}
scsi_id=${scsi_id%\]*}
error=/sys/kernel/debug/scsi_debug/$scsi_id/error 
str=$(lsscsi | grep scsi_debug | head -n 1 | awk '{print $6}')
disk=$(basename $str)
echo none > /sys/block/$disk/queue/scheduler
echo 1    > /sys/block/$disk/device/timeout
echo 1    > /sys/block/$disk/device/eh_timeout
for((loop=1;loop<=5;loop++))
do
	time=$(date "+%Y-%m-%d-%H-%M-%S")
	since=$(date "+%Y-%m-%d %H:%M:%S")
	target_test_sense$loop
	sleep 3
	until=$(date "+%Y-%m-%d %H:%M:%S")
	mkdir logs/lun_target_sense$loop
	journalctl --since="$since" --until="$until" > logs/lun_target_sense$loop/$time.log
done
rmmod scsi_debug

  parent reply	other threads:[~2023-08-15 14:09 UTC|newest]

Thread overview: 20+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-07-23 23:44 [PATCH 00/13] scsi: Support LUN/target based error handle Wenchao Hao
2023-07-23 23:44 ` [PATCH 01/13] scsi: Define basic framework for driver " Wenchao Hao
2023-07-23 23:44 ` [PATCH 02/13] scsi:scsi_error: Move complete variable eh_action from shost to sdevice Wenchao Hao
2023-07-23 23:44 ` [PATCH 03/13] scsi:scsi_error: Check if to do reset in scsi_try_xxx_reset Wenchao Hao
2023-07-23 23:44 ` [PATCH 04/13] scsi:scsi_error: Add helper scsi_eh_sdev_stu to do START_UNIT Wenchao Hao
2023-07-23 23:44 ` [PATCH 05/13] scsi:scsi_error: Add helper scsi_eh_sdev_reset to do lun reset Wenchao Hao
2023-07-23 23:44 ` [PATCH 06/13] scsi:scsi_error: Add flags to mark error handle steps has done Wenchao Hao
2023-07-23 23:44 ` [PATCH 07/13] scsi:scsi_error: Define helper to perform LUN based error handle Wenchao Hao
2023-07-23 23:44 ` [PATCH 08/13] scsi:scsi_error: Add LUN based error handler based previous helper Wenchao Hao
2023-07-23 23:44 ` [PATCH 09/13] scsi:core: increase/decrease target_busy without check can_queue Wenchao Hao
2023-07-23 23:44 ` [PATCH 10/13] scsi:scsi_error: Define helper to perform target based error handle Wenchao Hao
2023-07-23 23:44 ` [PATCH 11/13] scsi:scsi_error: Add target based error handler based previous helper Wenchao Hao
2023-07-23 23:44 ` [PATCH 12/13] scsi:scsi_debug: Add param to control if setup LUN based error handle Wenchao Hao
2023-07-23 23:44 ` [PATCH 13/13] scsi:scsi_debug: Add param to control if setup target " Wenchao Hao
2023-08-15 14:08 ` haowenchao (C) [this message]
2023-08-15 14:17 ` [PATCH 00/13] scsi: Support LUN/target " haowenchao (C)
2023-08-15 15:48   ` Bart Van Assche
2023-08-16  2:14     ` haowenchao (C)
2023-08-21 13:31 ` haowenchao (C)
2023-08-30  9:45 ` haowenchao (C)

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=2fa67edb-7cf2-e6bb-a2ab-425911226fbb@huawei.com \
    --to=haowenchao2@huawei.com \
    --cc=error27@gmail.com \
    --cc=hare@suse.de \
    --cc=jejb@linux.ibm.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-scsi@vger.kernel.org \
    --cc=louhongxiang@huawei.com \
    --cc=martin.petersen@oracle.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox