* [PATCH blktests V2 0/1] nvme: add nvme pci timeout testcase
@ 2024-01-17 4:22 Chaitanya Kulkarni
2024-01-17 4:22 ` [PATCH blktests V2 1/1] " Chaitanya Kulkarni
0 siblings, 1 reply; 4+ messages in thread
From: Chaitanya Kulkarni @ 2024-01-17 4:22 UTC (permalink / raw)
To: linux-nvme, linux-block; +Cc: shinichiro.kawasaki, Chaitanya Kulkarni
Hi,
Add a test to trigger nvme pci timeout handler function
Below is the test log.
-ck
Change log :-
1. Use test_device() and TEST_DEV.
2. Make tesctcase pci only just like in nvme/032.
3. Remove . from copyright and test description.
4. Add _have_kernel_option FAIL_IO_TIMEOUT.
5. Save and restore fault injection settings.
6. Use "$(nproc)" to get rid of shellcheck annotation.
7. Don't remove the device keep it as it is, add module remove testcase later.
Chaitanya Kulkarni (1):
nvme: add nvme pci timeout testcase
tests/nvme/050 | 74 ++++++++++++++++++++++++++++++++++++++++++++++
tests/nvme/050.out | 2 ++
2 files changed, 76 insertions(+)
create mode 100755 tests/nvme/050
create mode 100644 tests/nvme/050.out
blktests (master) # modprobe -r nvme
blktests (master) # modprobe nvme
blktests (master) # TEST_DEVS=/dev/nvme0n1 nvme_trtype=pci ./check nvme/050
nvme/050 => nvme0n1 (test nvme-pci timeout with fio jobs) [passed]
runtime 90.854s ... 91.087s
blktests (master) #
--
2.40.0
^ permalink raw reply [flat|nested] 4+ messages in thread
* [PATCH blktests V2 1/1] nvme: add nvme pci timeout testcase
2024-01-17 4:22 [PATCH blktests V2 0/1] nvme: add nvme pci timeout testcase Chaitanya Kulkarni
@ 2024-01-17 4:22 ` Chaitanya Kulkarni
2024-01-19 11:11 ` Shinichiro Kawasaki
0 siblings, 1 reply; 4+ messages in thread
From: Chaitanya Kulkarni @ 2024-01-17 4:22 UTC (permalink / raw)
To: linux-nvme, linux-block; +Cc: shinichiro.kawasaki, Chaitanya Kulkarni
Trigger and test nvme-pci timeout with concurrent fio jobs.
Signed-off-by: Chaitanya Kulkarni <kch@nvidia.com>
---
tests/nvme/050 | 74 ++++++++++++++++++++++++++++++++++++++++++++++
tests/nvme/050.out | 2 ++
2 files changed, 76 insertions(+)
create mode 100755 tests/nvme/050
create mode 100644 tests/nvme/050.out
diff --git a/tests/nvme/050 b/tests/nvme/050
new file mode 100755
index 0000000..6c44b43
--- /dev/null
+++ b/tests/nvme/050
@@ -0,0 +1,74 @@
+#!/bin/bash
+# SPDX-License-Identifier: GPL-3.0+
+# Copyright (C) 2024 Chaitanya Kulkarni
+#
+# Test NVMe-PCI timeout with FIO jobs by triggering the nvme_timeout function.
+#
+
+. tests/nvme/rc
+
+DESCRIPTION="test nvme-pci timeout with fio jobs"
+
+sysfs_path="/sys/kernel/debug/fail_io_timeout/"
+#restrict test to nvme-pci only
+nvme_trtype=pci
+
+# fault injection config array
+declare -A fi_array
+
+requires() {
+ _require_nvme_trtype pci
+ _have_fio
+ _nvme_requires
+ _have_kernel_option FAIL_IO_TIMEOUT
+}
+
+device_requires() {
+ _require_test_dev_is_nvme
+}
+
+save_fi_settings() {
+ for fi_attr in probability interval times space verbose
+ do
+ fi_array["${fi_attr}"]=$(cat "${sysfs_path}/${fi_attr}")
+ done
+}
+
+restore_fi_settings() {
+ for fi_attr in probability interval times space verbose
+ do
+ echo "${fi_array["${fi_attr}"]}" > "${sysfs_path}/${fi_attr}"
+ done
+}
+
+test_device() {
+ local nvme_ns
+ local io_fimeout_fail
+
+ echo "Running ${TEST_NAME}"
+
+ nvme_ns="$(basename "${TEST_DEV}")"
+ io_fimeout_fail="$(cat /sys/block/"${nvme_ns}"/io-timeout-fail)"
+ save_fi_settings
+ echo 1 > /sys/block/"${nvme_ns}"/io-timeout-fail
+
+ echo 99 > /sys/kernel/debug/fail_io_timeout/probability
+ echo 10 > /sys/kernel/debug/fail_io_timeout/interval
+ echo -1 > /sys/kernel/debug/fail_io_timeout/times
+ echo 0 > /sys/kernel/debug/fail_io_timeout/space
+ echo 1 > /sys/kernel/debug/fail_io_timeout/verbose
+
+ fio --bs=4k --rw=randread --norandommap --numjobs="$(nproc)" \
+ --name=reads --direct=1 --filename="${TEST_DEV}" --group_reporting \
+ --time_based --runtime=1m 2>&1 | grep -q "Input/output error"
+
+ # shellcheck disable=SC2181
+ if [ $? -eq 0 ]; then
+ echo "Test complete"
+ else
+ echo "Test failed"
+ fi
+
+ restore_fi_settings
+ echo "${io_fimeout_fail}" > /sys/block/"${nvme_ns}"/io-timeout-fail
+}
diff --git a/tests/nvme/050.out b/tests/nvme/050.out
new file mode 100644
index 0000000..b78b05f
--- /dev/null
+++ b/tests/nvme/050.out
@@ -0,0 +1,2 @@
+Running nvme/050
+Test complete
--
2.40.0
^ permalink raw reply related [flat|nested] 4+ messages in thread
* Re: [PATCH blktests V2 1/1] nvme: add nvme pci timeout testcase
2024-01-17 4:22 ` [PATCH blktests V2 1/1] " Chaitanya Kulkarni
@ 2024-01-19 11:11 ` Shinichiro Kawasaki
2024-01-22 4:06 ` Chaitanya Kulkarni
0 siblings, 1 reply; 4+ messages in thread
From: Shinichiro Kawasaki @ 2024-01-19 11:11 UTC (permalink / raw)
To: Chaitanya Kulkarni
Cc: linux-nvme@lists.infradead.org, linux-block@vger.kernel.org
Hi Chaitanya, thanks for this v2 patch.
I ran this new test case with several devices and observed these two:
1) The test case fails with my QEMU NMVE devices. It looks all of the inject
failures get recovered by error handler, then no I/O error is reported to the
fio process. I modified two fail_io_timeout parameters as follows to inject
more failures, and saw the test case fails with the device. I suggest these
parameters.
@@ -49,8 +50,8 @@ test_device() {
save_fi_settings
echo 1 > /sys/block/"${nvme_ns}"/io-timeout-fail
- echo 99 > /sys/kernel/debug/fail_io_timeout/probability
- echo 10 > /sys/kernel/debug/fail_io_timeout/interval
+ echo 100 > /sys/kernel/debug/fail_io_timeout/probability
+ echo 1 > /sys/kernel/debug/fail_io_timeout/interval
echo -1 > /sys/kernel/debug/fail_io_timeout/times
echo 0 > /sys/kernel/debug/fail_io_timeout/space
echo 1 > /sys/kernel/debug/fail_io_timeout/verbose
2) After I ran the test case, I see the test target device left with zero
size. I think this is consistent as your report [*]. I still think device
remove and rescan at the test case end is required to avoid impacts on
following test cases. It will depend on the discussion for your report,
though.
[*] https://lore.kernel.org/linux-nvme/287b9ed9-6eb3-46d0-a6f0-a9d283245b90@nvidia.com/
Please find my some more comments in line:
On Jan 16, 2024 / 20:22, Chaitanya Kulkarni wrote:
> Trigger and test nvme-pci timeout with concurrent fio jobs.
>
> Signed-off-by: Chaitanya Kulkarni <kch@nvidia.com>
> ---
> tests/nvme/050 | 74 ++++++++++++++++++++++++++++++++++++++++++++++
> tests/nvme/050.out | 2 ++
> 2 files changed, 76 insertions(+)
> create mode 100755 tests/nvme/050
> create mode 100644 tests/nvme/050.out
>
> diff --git a/tests/nvme/050 b/tests/nvme/050
> new file mode 100755
> index 0000000..6c44b43
> --- /dev/null
> +++ b/tests/nvme/050
> @@ -0,0 +1,74 @@
> +#!/bin/bash
> +# SPDX-License-Identifier: GPL-3.0+
> +# Copyright (C) 2024 Chaitanya Kulkarni
> +#
> +# Test NVMe-PCI timeout with FIO jobs by triggering the nvme_timeout function.
> +#
> +
> +. tests/nvme/rc
> +
> +DESCRIPTION="test nvme-pci timeout with fio jobs"
I ran this test case a ZNS drive today, and it looks working. I suggest to add:
CAN_BE_ZONED=1
> +
> +sysfs_path="/sys/kernel/debug/fail_io_timeout/"
> +#restrict test to nvme-pci only
> +nvme_trtype=pci
> +
> +# fault injection config array
> +declare -A fi_array
> +
> +requires() {
> + _require_nvme_trtype pci
We can remove this line, since we added "nvme_trtype=pci" above.
> + _have_fio
> + _nvme_requires
> + _have_kernel_option FAIL_IO_TIMEOUT
Today, I found that this test case depends on another kernel option. Let's add,
_have_kernel_option FAULT_INJECTION_DEBUG_FS
> +}
> +
> +device_requires() {
> + _require_test_dev_is_nvme
> +}
Actually, this check is not required here, since it is already done in
group_device_requires() in tests/nvme/rc. It is a small left work to remove
the same device_requires() in nvme/032.
> +
> +save_fi_settings() {
> + for fi_attr in probability interval times space verbose
> + do
> + fi_array["${fi_attr}"]=$(cat "${sysfs_path}/${fi_attr}")
> + done
> +}
> +
> +restore_fi_settings() {
> + for fi_attr in probability interval times space verbose
> + do
> + echo "${fi_array["${fi_attr}"]}" > "${sysfs_path}/${fi_attr}"
> + done
> +}
> +
> +test_device() {
> + local nvme_ns
> + local io_fimeout_fail
> +
> + echo "Running ${TEST_NAME}"
> +
> + nvme_ns="$(basename "${TEST_DEV}")"
> + io_fimeout_fail="$(cat /sys/block/"${nvme_ns}"/io-timeout-fail)"
> + save_fi_settings
> + echo 1 > /sys/block/"${nvme_ns}"/io-timeout-fail
> +
> + echo 99 > /sys/kernel/debug/fail_io_timeout/probability
> + echo 10 > /sys/kernel/debug/fail_io_timeout/interval
> + echo -1 > /sys/kernel/debug/fail_io_timeout/times
> + echo 0 > /sys/kernel/debug/fail_io_timeout/space
> + echo 1 > /sys/kernel/debug/fail_io_timeout/verbose
> +
> + fio --bs=4k --rw=randread --norandommap --numjobs="$(nproc)" \
> + --name=reads --direct=1 --filename="${TEST_DEV}" --group_reporting \
> + --time_based --runtime=1m 2>&1 | grep -q "Input/output error"
When I investigated the failure cause of this test case, I found fio output
is useful. So, I suggest to keep the output in $FULL.
--time_based --runtime=1m >&"$FULL"
> +
> + # shellcheck disable=SC2181
> + if [ $? -eq 0 ]; then
With that $FULL file, the if statement can be as follows. And we don't need to
disable SC2181 either.
if grep -q "Input/output error" "$FULL"; then
> + echo "Test complete"
> + else
> + echo "Test failed"
> + fi
> +
> + restore_fi_settings
> + echo "${io_fimeout_fail}" > /sys/block/"${nvme_ns}"/io-timeout-fail
> +}
> diff --git a/tests/nvme/050.out b/tests/nvme/050.out
> new file mode 100644
> index 0000000..b78b05f
> --- /dev/null
> +++ b/tests/nvme/050.out
> @@ -0,0 +1,2 @@
> +Running nvme/050
> +Test complete
> --
> 2.40.0
>
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: [PATCH blktests V2 1/1] nvme: add nvme pci timeout testcase
2024-01-19 11:11 ` Shinichiro Kawasaki
@ 2024-01-22 4:06 ` Chaitanya Kulkarni
0 siblings, 0 replies; 4+ messages in thread
From: Chaitanya Kulkarni @ 2024-01-22 4:06 UTC (permalink / raw)
To: Shinichiro Kawasaki
Cc: linux-nvme@lists.infradead.org, Chaitanya Kulkarni,
linux-block@vger.kernel.org
On 1/19/2024 3:11 AM, Shinichiro Kawasaki wrote:
> Hi Chaitanya, thanks for this v2 patch.
>
> I ran this new test case with several devices and observed these two:
>
> 1) The test case fails with my QEMU NMVE devices. It looks all of the inject
> failures get recovered by error handler, then no I/O error is reported to the
> fio process. I modified two fail_io_timeout parameters as follows to inject
> more failures, and saw the test case fails with the device. I suggest these
> parameters.
>
> @@ -49,8 +50,8 @@ test_device() {
> save_fi_settings
> echo 1 > /sys/block/"${nvme_ns}"/io-timeout-fail
>
> - echo 99 > /sys/kernel/debug/fail_io_timeout/probability
> - echo 10 > /sys/kernel/debug/fail_io_timeout/interval
> + echo 100 > /sys/kernel/debug/fail_io_timeout/probability
> + echo 1 > /sys/kernel/debug/fail_io_timeout/interval
> echo -1 > /sys/kernel/debug/fail_io_timeout/times
> echo 0 > /sys/kernel/debug/fail_io_timeout/space
> echo 1 > /sys/kernel/debug/fail_io_timeout/verbose
>
Let me test these and add to the V3. But it is really strange, current
code pass testcase 100% of the time without above changes ...
> 2) After I ran the test case, I see the test target device left with zero
> size. I think this is consistent as your report [*]. I still think device
> remove and rescan at the test case end is required to avoid impacts on
> following test cases. It will depend on the discussion for your report,
> though.
>
Yes, still waiting on that, hopefully we will get some information to
move this forward. The reason I didn't add remove/rescan deviec coz it
helped me debug the state of the device post timeout handler which is
turned out to be not consistent for sure ..
> [*] https://lore.kernel.org/linux-nvme/287b9ed9-6eb3-46d0-a6f0-a9d283245b90@nvidia.com/
>
> Please find my some more comments in line:
>
> On Jan 16, 2024 / 20:22, Chaitanya Kulkarni wrote:
>> Trigger and test nvme-pci timeout with concurrent fio jobs.
>>
>> Signed-off-by: Chaitanya Kulkarni <kch@nvidia.com>
>> ---
>> tests/nvme/050 | 74 ++++++++++++++++++++++++++++++++++++++++++++++
>> tests/nvme/050.out | 2 ++
>> 2 files changed, 76 insertions(+)
>> create mode 100755 tests/nvme/050
>> create mode 100644 tests/nvme/050.out
>>
>> diff --git a/tests/nvme/050 b/tests/nvme/050
>> new file mode 100755
>> index 0000000..6c44b43
>> --- /dev/null
>> +++ b/tests/nvme/050
>> @@ -0,0 +1,74 @@
>> +#!/bin/bash
>> +# SPDX-License-Identifier: GPL-3.0+
>> +# Copyright (C) 2024 Chaitanya Kulkarni
>> +#
>> +# Test NVMe-PCI timeout with FIO jobs by triggering the nvme_timeout function.
>> +#
>> +
>> +. tests/nvme/rc
>> +
>> +DESCRIPTION="test nvme-pci timeout with fio jobs"
>
> I ran this test case a ZNS drive today, and it looks working. I suggest to add:
>
> CAN_BE_ZONED=1
>
okay ..
>> +
>> +sysfs_path="/sys/kernel/debug/fail_io_timeout/"
>> +#restrict test to nvme-pci only
>> +nvme_trtype=pci
>> +
>> +# fault injection config array
>> +declare -A fi_array
>> +
>> +requires() {
>> + _require_nvme_trtype pci
>
> We can remove this line, since we added "nvme_trtype=pci" above.
yeah ..
>
>> + _have_fio
>> + _nvme_requires
>> + _have_kernel_option FAIL_IO_TIMEOUT
>
> Today, I found that this test case depends on another kernel option. Let's add,
>
> _have_kernel_option FAULT_INJECTION_DEBUG_FS
indeed good catch I'll add this ..
>
>> +}
>> +
>> +device_requires() {
>> + _require_test_dev_is_nvme
>> +}
>
> Actually, this check is not required here, since it is already done in
> group_device_requires() in tests/nvme/rc. It is a small left work to remove
> the same device_requires() in nvme/032.
>
okay will remove that ..
>> +
>> +save_fi_settings() {
>> + for fi_attr in probability interval times space verbose
>> + do
>> + fi_array["${fi_attr}"]=$(cat "${sysfs_path}/${fi_attr}")
>> + done
>> +}
>> +
>> +restore_fi_settings() {
>> + for fi_attr in probability interval times space verbose
>> + do
>> + echo "${fi_array["${fi_attr}"]}" > "${sysfs_path}/${fi_attr}"
>> + done
>> +}
>> +
>> +test_device() {
>> + local nvme_ns
>> + local io_fimeout_fail
>> +
>> + echo "Running ${TEST_NAME}"
>> +
>> + nvme_ns="$(basename "${TEST_DEV}")"
>> + io_fimeout_fail="$(cat /sys/block/"${nvme_ns}"/io-timeout-fail)"
>> + save_fi_settings
>> + echo 1 > /sys/block/"${nvme_ns}"/io-timeout-fail
>> +
>> + echo 99 > /sys/kernel/debug/fail_io_timeout/probability
>> + echo 10 > /sys/kernel/debug/fail_io_timeout/interval
>> + echo -1 > /sys/kernel/debug/fail_io_timeout/times
>> + echo 0 > /sys/kernel/debug/fail_io_timeout/space
>> + echo 1 > /sys/kernel/debug/fail_io_timeout/verbose
>> +
>> + fio --bs=4k --rw=randread --norandommap --numjobs="$(nproc)" \
>> + --name=reads --direct=1 --filename="${TEST_DEV}" --group_reporting \
>> + --time_based --runtime=1m 2>&1 | grep -q "Input/output error"
>
> When I investigated the failure cause of this test case, I found fio output
> is useful. So, I suggest to keep the output in $FULL.
>
> --time_based --runtime=1m >&"$FULL"
>
okay and will modify the testcase for that ..
>> +
>> + # shellcheck disable=SC2181
>> + if [ $? -eq 0 ]; then
>
> With that $FULL file, the if statement can be as follows. And we don't need to
> disable SC2181 either.
>
> if grep -q "Input/output error" "$FULL"; then
agree ..
>
>> + echo "Test complete"
>> + else
>> + echo "Test failed"
>> + fi
>> +
>> + restore_fi_settings
>> + echo "${io_fimeout_fail}" > /sys/block/"${nvme_ns}"/io-timeout-fail
>> +}
>> diff --git a/tests/nvme/050.out b/tests/nvme/050.out
>> new file mode 100644
>> index 0000000..b78b05f
>> --- /dev/null
>> +++ b/tests/nvme/050.out
>> @@ -0,0 +1,2 @@
>> +Running nvme/050
>> +Test complete
>> --
>> 2.40.0
Thanks for the review comments, I'll send out V3 with your
suggestions ..
-ck
^ permalink raw reply [flat|nested] 4+ messages in thread
end of thread, other threads:[~2024-01-22 4:06 UTC | newest]
Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-01-17 4:22 [PATCH blktests V2 0/1] nvme: add nvme pci timeout testcase Chaitanya Kulkarni
2024-01-17 4:22 ` [PATCH blktests V2 1/1] " Chaitanya Kulkarni
2024-01-19 11:11 ` Shinichiro Kawasaki
2024-01-22 4:06 ` Chaitanya Kulkarni
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox