From: "haowenchao (C)" <haowenchao2@huawei.com>
To: "James E . J . Bottomley" <jejb@linux.ibm.com>,
"Martin K . Petersen" <martin.petersen@oracle.com>,
Hannes Reinecke <hare@suse.de>, <linux-scsi@vger.kernel.org>,
<linux-kernel@vger.kernel.org>
Cc: Dan Carpenter <error27@gmail.com>, <louhongxiang@huawei.com>
Subject: Re: [PATCH 00/13] scsi: Support LUN/target based error handle
Date: Tue, 15 Aug 2023 22:08:31 +0800 [thread overview]
Message-ID: <2fa67edb-7cf2-e6bb-a2ab-425911226fbb@huawei.com> (raw)
In-Reply-To: <20230723234422.1629194-1-haowenchao2@huawei.com>
[-- Attachment #1: Type: text/plain, Size: 7741 bytes --]
On 2023/7/24 7:44, Wenchao Hao wrote:
> The origin error handle would set host to recovery state and perform
> error recovery operations, and makes all LUNs which share a same host
> can not handle IOs. This phenomenon is unbearable for systems which
> deploy many LUNs in one HBA.
>
> This patchset introduce support for LUN/target based error handle,
> drivers can chose if to implement it. They can implement LUN, target or
> both of LUN and target based error handle by their own error handle
> strategy. The first patch defined this framework, it abstract three
> key operations which are: add error command, wake up error handle, block
> ios when error command is added and recoverying. Drivers should
> implement these three function callbacks and setup to SCSI middle level.
>
> Besides the basic framework, this patchset also add a basic LUN/target
> based error handle strategy.
>
> For LUN based eh, it would try check sense, start unit and reset LUN,
> if all above steps can not recovery all error commands, fallback to
> further recovery like tartget based (if implemented) or host based error
> handle.
>
> It's same for tartget based eh, it would try check sense, start unit,
> reset LUN and reset target. If all above steps can not recovery all error
> commands, fallback to further recovery which is host based error handle.
>
> This patchset is tested by scsi_debug which support single LUN error
> injection, the scsi_debug patches is here:
>
> https://lore.kernel.org/linux-scsi/20230723234105.1628982-1-haowenchao2@huawei.com/T/#t
>
I tested this patch set with scsi_debug with following scenarios, check
attachments to get my test script and result logs.
+-----------+---------+-------------------------------------------------------+
| lun reset | TUR | Desired result |
+ --------- + ------- + ------------------------------------------------------+
| success | success | retry or finish with EIO(may offline disk) |
+ --------- + ------- + ------------------------------------------------------+
| success | fail | fallback to host recovery, retry or finish with |
| | | EIO(may offline disk) |
+ --------- + ------- + ------------------------------------------------------+
| fail | NA | fallback to host recovery, retry or finish with |
| | | EIO(may offline disk) |
+ --------- + ------- + ------------------------------------------------------+
+-----------+---------+--------------+---------+------------------------------+
| lun reset | TUR | target reset | TUR | Desired result |
+-----------+---------+--------------+---------+------------------------------+
| success | success | NA | NA | retry or finish with |
| | | | | EIO(may offline disk) |
+-----------+---------+--------------+---------+------------------------------+
| success | fail | success | success | retry or finish with |
| | | | | EIO(may offline disk) |
+-----------+---------+--------------+---------+------------------------------+
| fail | NA | success | success | retry or finish with |
| | | | | EIO(may offline disk) |
+-----------+---------+--------------+---------+------------------------------+
| fail | NA | success | fail | fallback to host recovery, |
| | | | | retry or finish with EIO(may |
| | | | | offline disk) |
+-----------+---------+--------------+---------+------------------------------+
| fail | NA | fail | NA | fallback to host recovery, |
| | | | | retry or finish with EIO(may |
| | | | | offline disk) |
+-----------+---------+--------------+---------+------------------------------+
+-----------+---------+--------------+---------+------------------------------+
| lun reset | TUR | target reset | TUR | Desired result |
+-----------+---------+--------------+---------+------------------------------+
| success | success | NA | NA | retry or finish with |
| | | | | EIO(may offline disk) |
+-----------+---------+--------------+---------+------------------------------+
| success | fail | success | success | lun recovery fallback to |
| | | | | target recovery, retry or |
| | | | | finish with EIO(may offline |
| | | | | disk |
+-----------+---------+--------------+---------+------------------------------+
| fail | NA | success | success | lun recovery fallback to |
| | | | | target recovery, retry or |
| | | | | finish with EIO(may offline |
| | | | | disk |
+-----------+---------+--------------+---------+------------------------------+
| fail | NA | success | fail | lun recovery fallback to |
| | | | | target recovery, then fall |
| | | | | back to host recovery, retry |
| | | | | or fhinsi with EIO(may |
| | | | | offline disk) |
+-----------+---------+--------------+---------+------------------------------+
| fail | NA | fail | NA | lun recovery fallback to |
| | | | | target recovery, then fall |
| | | | | back to host recovery, retry |
| | | | | or fhinsi with EIO(may |
| | | | | offline disk) |
+-----------+---------+--------------+---------+------------------------------+
> Wenchao Hao (13):
> scsi: Define basic framework for driver LUN/target based error handle
> scsi:scsi_error: Move complete variable eh_action from shost to sdevice
> scsi:scsi_error: Check if to do reset in scsi_try_xxx_reset
> scsi:scsi_error: Add helper scsi_eh_sdev_stu to do START_UNIT
> scsi:scsi_error: Add helper scsi_eh_sdev_reset to do lun reset
> scsi:scsi_error: Add flags to mark error handle steps has done
> scsi:scsi_error: Define helper to perform LUN based error handle
> scsi:scsi_error: Add LUN based error handler based previous helper
> scsi:core: increase/decrease target_busy without check can_queue
> scsi:scsi_error: Define helper to perform target based error handle
> scsi:scsi_error: Add target based error handler based previous helper
> scsi:scsi_debug: Add param to control if setup LUN based error handle
> scsi:scsi_debug: Add param to control if setup target based error handle
>
> drivers/scsi/scsi_debug.c | 19 +
> drivers/scsi/scsi_error.c | 705 ++++++++++++++++++++++++++++++++++---
> drivers/scsi/scsi_lib.c | 23 +-
> drivers/scsi/scsi_priv.h | 20 ++
> include/scsi/scsi_device.h | 97 +++++
> include/scsi/scsi_eh.h | 4 +
> include/scsi/scsi_host.h | 2 -
> 7 files changed, 813 insertions(+), 57 deletions(-)
>
[-- Attachment #2: logs.tar.gz --]
[-- Type: application/x-gzip, Size: 7681 bytes --]
[-- Attachment #3: test.sh --]
[-- Type: text/plain, Size: 6362 bytes --]
#!/bin/sh
scsi_debug=/mnt/mainline/drivers/scsi/scsi_debug.ko
function clear_error()
{
error=$1
tmpfile=$$_clear
cat $error | grep -v Type | awk '{print $1,$3}' > $tmpfile
while read -r line; do echo "- $line" > $error; done < $tmpfile
rm -rf $tmpfile
echo 0 > /sys/kernel/debug/scsi_debug/target$target_id/fail_reset
}
function lun_test_sense1()
{
echo "LUN reset success, TUR success"
# inject timeout command for write command
echo "0 -10 0x2a " > ${error}
# inject abort command for write command
echo "3 -1 0x2a " > ${error}
dd if=/dev/zero of=/dev/$disk bs=1K count=10 oflag=direct
echo $(cat /sys/block/$disk/device/state)
clear_error $error
echo running > /sys/block/$disk/device/state
}
function lun_test_sense2()
{
echo "LUN reset success, TUR failed"
# inject timeout command for write command
echo "0 -10 0x2a " > ${error}
# inject abort command for write command
echo "3 -1 0x2a " > ${error}
# inject timeout command for TUR command
echo "0 -1 0x0 " > ${error}
dd if=/dev/zero of=/dev/$disk bs=1K count=10 oflag=direct
echo $(cat /sys/block/$disk/device/state)
clear_error $error
echo running > /sys/block/$disk/device/state
}
function lun_test_sense3()
{
echo "LUN reset failed, fallback to target reset success"
# inject timeout command for write command
echo "0 -10 0x2a " > ${error}
# inject abort command for write command
echo "3 -1 0x2a " > ${error}
# inject lunreset failed
echo "4 -1 0xff" > ${error}
dd if=/dev/zero of=/dev/$disk bs=1K count=10 oflag=direct
echo $(cat /sys/block/$disk/device/state)
clear_error $error
echo running > /sys/block/$disk/device/state
}
function target_test_sense1()
{
echo "LUN reset success, TUR success"
# inject timeout command for write command
echo "0 -10 0x2a " > ${error}
# inject abort command for write command
echo "3 -1 0x2a " > ${error}
dd if=/dev/zero of=/dev/$disk bs=1K count=10 oflag=direct
echo $(cat /sys/block/$disk/device/state)
clear_error $error
echo running > /sys/block/$disk/device/state
}
function target_test_sense2()
{
echo "LUN reset success, TUR failed, target reset success, TUR success"
# inject timeout command for write command
echo "0 -10 0x2a " > ${error}
# inject abort command for write command
echo "3 -1 0x2a " > ${error}
# inject timeout command for TUR command
echo "0 -1 0x0 " > ${error}
dd if=/dev/zero of=/dev/$disk bs=1K count=10 oflag=direct
echo $(cat /sys/block/$disk/device/state)
clear_error $error
echo running > /sys/block/$disk/device/state
}
function target_test_sense3()
{
echo "LUN reset failed, target reset success, TUR success"
# inject timeout command for write command
echo "0 -10 0x2a " > ${error}
# inject abort command for write command
echo "3 -1 0x2a " > ${error}
# inject lunreset failed
echo "4 -1 0xff" > ${error}
dd if=/dev/zero of=/dev/$disk bs=1K count=10 oflag=direct
echo $(cat /sys/block/$disk/device/state)
clear_error $error
echo running > /sys/block/$disk/device/state
}
function target_test_sense4()
{
echo "LUN reset failed, target reset success TUR failed"
# inject timeout command for write command
echo "0 -10 0x2a " > ${error}
# inject abort command for write command
echo "3 -1 0x2a " > ${error}
# inject lunreset failed
echo "4 -1 0xff" > ${error}
# inject timeout command for TUR command
echo "0 -1 0x0 " > ${error}
dd if=/dev/zero of=/dev/$disk bs=1K count=10 oflag=direct
echo $(cat /sys/block/$disk/device/state)
clear_error $error
echo running > /sys/block/$disk/device/state
}
function target_test_sense5()
{
echo "LUN reset failed, target reset failed, fallback to host recovery"
# inject timeout command for write command
echo "0 -10 0x2a " > ${error}
# inject abort command for write command
echo "3 -1 0x2a " > ${error}
# inject lunreset failed
echo "4 -1 0xff" > ${error}
# inject target reset failed
echo 1 > /sys/kernel/debug/scsi_debug/target$target_id/fail_reset
dd if=/dev/zero of=/dev/$disk bs=1K count=10 oflag=direct
echo $(cat /sys/block/$disk/device/state)
clear_error $error
echo running > /sys/block/$disk/device/state
}
scsi_logging_level -s --error 4 > /dev/null 2>&1
insmod $scsi_debug lun_eh=Y target_eh=N
str=$(lsscsi | grep scsi_debug | head -n 1 | awk '{print $1}')
scsi_id=${str#*\[}
scsi_id=${scsi_id%\]*}
error=/sys/kernel/debug/scsi_debug/$scsi_id/error
str=$(lsscsi | grep scsi_debug | head -n 1 | awk '{print $6}')
disk=$(basename $str)
target_id=${scsi_id%\:*}
echo none > /sys/block/$disk/queue/scheduler
echo 1 > /sys/block/$disk/device/timeout
echo 1 > /sys/block/$disk/device/eh_timeout
for((loop=1;loop<=3;loop++))
do
time=$(date "+%Y-%m-%d-%H-%M-%S")
since=$(date "+%Y-%m-%d %H:%M:%S")
lun_test_sense$loop
sleep 3
until=$(date "+%Y-%m-%d %H:%M:%S")
mkdir logs/lun_sense$loop
journalctl --since="$since" --until="$until" > logs/lun_sense$loop/$time.log
done
rmmod scsi_debug
insmod $scsi_debug lun_eh=N target_eh=Y
str=$(lsscsi | grep scsi_debug | head -n 1 | awk '{print $1}')
scsi_id=${str#*\[}
scsi_id=${scsi_id%\]*}
error=/sys/kernel/debug/scsi_debug/$scsi_id/error
str=$(lsscsi | grep scsi_debug | head -n 1 | awk '{print $6}')
disk=$(basename $str)
echo none > /sys/block/$disk/queue/scheduler
echo 1 > /sys/block/$disk/device/timeout
echo 1 > /sys/block/$disk/device/eh_timeout
for((loop=1;loop<=5;loop++))
do
time=$(date "+%Y-%m-%d-%H-%M-%S")
since=$(date "+%Y-%m-%d %H:%M:%S")
target_test_sense$loop
sleep 3
until=$(date "+%Y-%m-%d %H:%M:%S")
mkdir logs/target_sense$loop
journalctl --since="$since" --until="$until" > logs/target_sense$loop/$time.log
done
rmmod scsi_debug
insmod $scsi_debug lun_eh=Y target_eh=Y
str=$(lsscsi | grep scsi_debug | head -n 1 | awk '{print $1}')
scsi_id=${str#*\[}
scsi_id=${scsi_id%\]*}
error=/sys/kernel/debug/scsi_debug/$scsi_id/error
str=$(lsscsi | grep scsi_debug | head -n 1 | awk '{print $6}')
disk=$(basename $str)
echo none > /sys/block/$disk/queue/scheduler
echo 1 > /sys/block/$disk/device/timeout
echo 1 > /sys/block/$disk/device/eh_timeout
for((loop=1;loop<=5;loop++))
do
time=$(date "+%Y-%m-%d-%H-%M-%S")
since=$(date "+%Y-%m-%d %H:%M:%S")
target_test_sense$loop
sleep 3
until=$(date "+%Y-%m-%d %H:%M:%S")
mkdir logs/lun_target_sense$loop
journalctl --since="$since" --until="$until" > logs/lun_target_sense$loop/$time.log
done
rmmod scsi_debug
next prev parent reply other threads:[~2023-08-15 14:09 UTC|newest]
Thread overview: 20+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-07-23 23:44 [PATCH 00/13] scsi: Support LUN/target based error handle Wenchao Hao
2023-07-23 23:44 ` [PATCH 01/13] scsi: Define basic framework for driver " Wenchao Hao
2023-07-23 23:44 ` [PATCH 02/13] scsi:scsi_error: Move complete variable eh_action from shost to sdevice Wenchao Hao
2023-07-23 23:44 ` [PATCH 03/13] scsi:scsi_error: Check if to do reset in scsi_try_xxx_reset Wenchao Hao
2023-07-23 23:44 ` [PATCH 04/13] scsi:scsi_error: Add helper scsi_eh_sdev_stu to do START_UNIT Wenchao Hao
2023-07-23 23:44 ` [PATCH 05/13] scsi:scsi_error: Add helper scsi_eh_sdev_reset to do lun reset Wenchao Hao
2023-07-23 23:44 ` [PATCH 06/13] scsi:scsi_error: Add flags to mark error handle steps has done Wenchao Hao
2023-07-23 23:44 ` [PATCH 07/13] scsi:scsi_error: Define helper to perform LUN based error handle Wenchao Hao
2023-07-23 23:44 ` [PATCH 08/13] scsi:scsi_error: Add LUN based error handler based previous helper Wenchao Hao
2023-07-23 23:44 ` [PATCH 09/13] scsi:core: increase/decrease target_busy without check can_queue Wenchao Hao
2023-07-23 23:44 ` [PATCH 10/13] scsi:scsi_error: Define helper to perform target based error handle Wenchao Hao
2023-07-23 23:44 ` [PATCH 11/13] scsi:scsi_error: Add target based error handler based previous helper Wenchao Hao
2023-07-23 23:44 ` [PATCH 12/13] scsi:scsi_debug: Add param to control if setup LUN based error handle Wenchao Hao
2023-07-23 23:44 ` [PATCH 13/13] scsi:scsi_debug: Add param to control if setup target " Wenchao Hao
2023-08-15 14:08 ` haowenchao (C) [this message]
2023-08-15 14:17 ` [PATCH 00/13] scsi: Support LUN/target " haowenchao (C)
2023-08-15 15:48 ` Bart Van Assche
2023-08-16 2:14 ` haowenchao (C)
2023-08-21 13:31 ` haowenchao (C)
2023-08-30 9:45 ` haowenchao (C)
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=2fa67edb-7cf2-e6bb-a2ab-425911226fbb@huawei.com \
--to=haowenchao2@huawei.com \
--cc=error27@gmail.com \
--cc=hare@suse.de \
--cc=jejb@linux.ibm.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-scsi@vger.kernel.org \
--cc=louhongxiang@huawei.com \
--cc=martin.petersen@oracle.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox