* [PATCH] tests/nvme: Add admin-passthru+reset race test
@ 2022-11-14 20:34 Jonathan Derrick
2022-11-15 0:33 ` kernel test robot
` (2 more replies)
0 siblings, 3 replies; 4+ messages in thread
From: Jonathan Derrick @ 2022-11-14 20:34 UTC (permalink / raw)
To: linux-nvme, linux-block, Shin\'ichiro Kawasaki
Cc: Keith Busch, Omar Sandoval, Jonathan Derrick
Adds a test which runs many formats and reset_controllers in parallel.
The intent is to expose timing holes in the controller state machine
which will lead to hung task timing and the controller becoming
unavailable.
Reported by https://bugzilla.kernel.org/show_bug.cgi?id=216354
Signed-off-by: Jonathan Derrick <jonathan.derrick@linux.dev>
---
tests/nvme/046 | 85 ++++++++++++++++++++++++++++++++++++++++++++++
tests/nvme/046.out | 2 ++
2 files changed, 87 insertions(+)
create mode 100755 tests/nvme/046
create mode 100644 tests/nvme/046.out
diff --git a/tests/nvme/046 b/tests/nvme/046
new file mode 100755
index 0000000..4b47783
--- /dev/null
+++ b/tests/nvme/046
@@ -0,0 +1,85 @@
+#!/bin/bash
+# SPDX-License-Identifier: GPL-3.0+
+# Copyright (C) 2022 Jonathan Derrick <jonathan.derrick@linux.dev>
+#
+# Test nvme reset controller during admin passthru
+#
+# Regression for issue reported by
+# https://bugzilla.kernel.org/show_bug.cgi?id=216354
+
+. tests/nvme/rc
+
+#restrict test to nvme-pci only
+nvme_trtype=pci
+
+DESCRIPTION="test nvme reset controller during admin passthru"
+QUICK=1
+CAN_BE_ZONED=1
+
+requires() {
+ _nvme_requires
+}
+
+device_requires() {
+ _require_test_dev_is_nvme
+}
+
+test_device() {
+ echo "Running ${TEST_NAME}"
+
+ local sysfs
+ local attr
+ local m
+
+ sysfs="$TEST_DEV_SYSFS/device"
+ timeout=$(($(cat /proc/sys/kernel/hung_task_timeout_secs) / 2))
+
+ sleep 5
+
+ if [[ ! -d "$sysfs" ]]; then
+ echo "$sysfs doesn't exist"
+ fi
+
+ # do reset controller/format loops
+ # don't check status now because a timing race is desired
+ i=0
+ start=0
+ timing_out=false
+ while [[ $i -le 1000 ]]; do
+ start=$SECONDS
+ if [[ -f "$sysfs/reset_controller" ]]; then
+ echo 1 > "$sysfs/reset_controller" 2>/dev/null &
+ i=$((i+1))
+ fi
+ nvme format -l 0 -f $TEST_DEV 2>/dev/null &
+
+ #Assume the controller is hung and unrecoverable
+ if [[ $(($SECONDS - $start)) -gt $timeout ]]; then
+ echo "nvme controller timing out"
+ timing_out=true
+ break
+ fi
+ done
+
+ { kill $!; wait; } &> /dev/null
+
+ # at this point it may have waited hung_task_timeout / 2 already, so
+ # only wait 25% longer for a total of about 75% of allowed timeout
+ m=0
+ while [[ $m -le $((timeout / 2)) ]]; do
+ if [[ $timing_out == true ]]; then
+ break
+ fi
+ if grep -q live "$sysfs/state"; then
+ break
+ fi
+ sleep 1
+ m=$((m+1))
+ done
+ if ! grep -q live "$sysfs/state"; then
+ echo "nvme still not live after $(($SECONDS - $start)) seconds!"
+ fi
+ udevadm settle
+
+ echo "Test complete"
+}
diff --git a/tests/nvme/046.out b/tests/nvme/046.out
new file mode 100644
index 0000000..2b5fa6a
--- /dev/null
+++ b/tests/nvme/046.out
@@ -0,0 +1,2 @@
+Running nvme/046
+Test complete
--
2.31.1
^ permalink raw reply related [flat|nested] 4+ messages in thread
* Re: [PATCH] tests/nvme: Add admin-passthru+reset race test
2022-11-14 20:34 [PATCH] tests/nvme: Add admin-passthru+reset race test Jonathan Derrick
@ 2022-11-15 0:33 ` kernel test robot
2022-11-15 23:07 ` Keith Busch
2022-11-16 7:43 ` Christoph Hellwig
2 siblings, 0 replies; 4+ messages in thread
From: kernel test robot @ 2022-11-15 0:33 UTC (permalink / raw)
To: Jonathan Derrick, linux-nvme, linux-block,
Shin'ichiro Kawasaki
Cc: oe-kbuild-all, Keith Busch, Omar Sandoval, Jonathan Derrick
Hi Jonathan,
I love your patch! Perhaps something to improve:
[auto build test WARNING on linus/master]
[also build test WARNING on v6.1-rc5 next-20221114]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]
url: https://github.com/intel-lab-lkp/linux/commits/Jonathan-Derrick/tests-nvme-Add-admin-passthru-reset-race-test/20221115-043752
patch link: https://lore.kernel.org/r/20221114203412.383-1-jonathan.derrick%40linux.dev
patch subject: [PATCH] tests/nvme: Add admin-passthru+reset race test
reproduce:
scripts/spdxcheck.py
If you fix the issue, kindly add following tag where applicable
| Reported-by: kernel test robot <lkp@intel.com>
spdxcheck warnings: (new ones prefixed by >>)
drivers/cpufreq/amd-pstate-ut.c: 1:28 Invalid License ID: GPL-1.0-or-later
>> tests/nvme/046: 2:27 Invalid License ID: GPL-3.0+
--
0-DAY CI Kernel Test Service
https://01.org/lkp
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: [PATCH] tests/nvme: Add admin-passthru+reset race test
2022-11-14 20:34 [PATCH] tests/nvme: Add admin-passthru+reset race test Jonathan Derrick
2022-11-15 0:33 ` kernel test robot
@ 2022-11-15 23:07 ` Keith Busch
2022-11-16 7:43 ` Christoph Hellwig
2 siblings, 0 replies; 4+ messages in thread
From: Keith Busch @ 2022-11-15 23:07 UTC (permalink / raw)
To: Jonathan Derrick
Cc: linux-nvme, linux-block, Shin\'ichiro Kawasaki, Omar Sandoval
On Mon, Nov 14, 2022 at 01:34:12PM -0700, Jonathan Derrick wrote:
> + echo "Running ${TEST_NAME}"
> +
> + local sysfs
> + local attr
> + local m
> +
> + sysfs="$TEST_DEV_SYSFS/device"
That's not the correct directory when the device is using native
nvme-multipath.
> + timeout=$(($(cat /proc/sys/kernel/hung_task_timeout_secs) / 2))
> +
> + sleep 5
> +
> + if [[ ! -d "$sysfs" ]]; then
> + echo "$sysfs doesn't exist"
> + fi
> +
> + # do reset controller/format loops
> + # don't check status now because a timing race is desired
> + i=0
> + start=0
> + timing_out=false
> + while [[ $i -le 1000 ]]; do
> + start=$SECONDS
> + if [[ -f "$sysfs/reset_controller" ]]; then
> + echo 1 > "$sysfs/reset_controller" 2>/dev/null &
> + i=$((i+1))
> + fi
> + nvme format -l 0 -f $TEST_DEV 2>/dev/null &
> +
> + #Assume the controller is hung and unrecoverable
> + if [[ $(($SECONDS - $start)) -gt $timeout ]]; then
> + echo "nvme controller timing out"
> + timing_out=true
> + break
> + fi
> + done
If the controller is already undergoing a reset, then writing to the
reset_controller file becomes a no-op. Unless your "reset_controller"
completes near instantaneously, I find that this loop tears through 1000
iterations, forks 1000 formats, and only 1 reset_controller actually
gets through.
If I remove the upper limit, then I can also see the stalled task, but
it is only temporary and gets itself out of it after the admin timeout
(1 minute). Is that also your observation, or is it stuck forever?
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: [PATCH] tests/nvme: Add admin-passthru+reset race test
2022-11-14 20:34 [PATCH] tests/nvme: Add admin-passthru+reset race test Jonathan Derrick
2022-11-15 0:33 ` kernel test robot
2022-11-15 23:07 ` Keith Busch
@ 2022-11-16 7:43 ` Christoph Hellwig
2 siblings, 0 replies; 4+ messages in thread
From: Christoph Hellwig @ 2022-11-16 7:43 UTC (permalink / raw)
To: Jonathan Derrick
Cc: linux-nvme, linux-block, Shin\'ichiro Kawasaki, Keith Busch,
Omar Sandoval
On Mon, Nov 14, 2022 at 01:34:12PM -0700, Jonathan Derrick wrote:
> Adds a test which runs many formats and reset_controllers in parallel.
> The intent is to expose timing holes in the controller state machine
> which will lead to hung task timing and the controller becoming
> unavailable.
This passes just fine for me both on qemu and a thunderbolt attached
m.2 SSD. So it seems to be hardware / timing dependent.
^ permalink raw reply [flat|nested] 4+ messages in thread
end of thread, other threads:[~2022-11-16 7:43 UTC | newest]
Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2022-11-14 20:34 [PATCH] tests/nvme: Add admin-passthru+reset race test Jonathan Derrick
2022-11-15 0:33 ` kernel test robot
2022-11-15 23:07 ` Keith Busch
2022-11-16 7:43 ` Christoph Hellwig
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).