[ndctl PATCH] test: fail on unexpected kernel error & warning, not just "Call Trace"

Linux CXL
 help / color / mirror / Atom feed

* [ndctl PATCH] test: fail on unexpected kernel error & warning, not just "Call Trace"
@ 2025-05-10  1:20 marc.herbert
  2025-05-12 18:35 ` Alison Schofield
  0 siblings, 1 reply; 7+ messages in thread
From: marc.herbert @ 2025-05-10  1:20 UTC (permalink / raw)
  To: linux-cxl, nvdimm; +Cc: Marc Herbert

From: Marc Herbert <marc.herbert@linux.intel.com>

While a "Call Trace" is usually a bad omen, the show_trace_log_lvl()
function supports any log level. So a "Call Trace" is not a reliable
indication of a failure. More importantly: any other WARNING or ERROR
during a test should make a test FAIL. Before this commit, it does not.

So, leverage log levels for the PASS/FAIL decision.  This catches all
issues and not just the ones printing Call Traces.

Add a simple way to exclude expected warnings and errors, either on a
per-test basis or globally.

Add a way for negative tests to fail when if some expected errors are
missing.

Add COOLDOWN_MS to address the inaccuracy of the magic $SECONDS
variable.

As a good example (which initiated this), the test feedback when hitting
bug https://github.com/pmem/ndctl/issues/278, where the cxl_test module
errors at load, is completely changed by this. Instead of only half the
tests failing with a fairly cryptic and late "Numerical result out of
range" error from user space, now all tests are failing early and
consistently, all displaying the same, earlier and more relevant error.

This simple log-level based approach has been successfully used for
years in the CI of https://github.com/thesofproject and caught
countless firmware and kernel bugs.

Note: the popular message "possible circular locking ..." recently fixed
by revert v6.15-rc1-4-gdc1771f71854 is at the WARNING level, including
its Call Trace.

Signed-off-by: Marc Herbert <marc.herbert@linux.intel.com>
---
 test/common            | 74 +++++++++++++++++++++++++++++++++++++++---
 test/cxl-events.sh     |  2 ++
 test/cxl-poison.sh     |  5 +++
 test/cxl-xor-region.sh |  2 ++
 test/dax.sh            |  6 ++++
 5 files changed, 84 insertions(+), 5 deletions(-)

diff --git a/test/common b/test/common
index 75ff1a6e12be..2a95437186e7 100644
--- a/test/common
+++ b/test/common
@@ -3,6 +3,15 @@
 
 # Global variables
 
+# Small gap in journalctl to avoid cross-test pollution.  Unfortunately,
+# this needs be at least 1 second because we don't know how bash rounds
+# up or down its magic $SECONDS variable that we use below.
+COOLDOWN_MS=1200
+sleep "${COOLDOWN_MS}E-3"
+
+# Log anchor, especially useful when running tests back to back
+printf "<5>%s: sourcing test/common\n" "$0" > /dev/kmsg
+
 # NDCTL
 if [ -z $NDCTL ]; then
 	if [ -f "../ndctl/ndctl" ] && [ -x "../ndctl/ndctl" ]; then
@@ -140,15 +149,70 @@ json2var()
 	sed -e "s/[{}\",]//g; s/\[//g; s/\]//g; s/:/=/g"
 }
 
-# check_dmesg
+# - "declare -a" gives the main script the freedom to source this file
+#   before OR after adding some excludes.
+declare -a kmsg_no_fail_on
+# kmsg_no_fail_on+=('this array must never be empty to keep the code simple')
+
+kmsg_no_fail_on+=('cxl_core: loading out-of-tree module taints kernel')
+kmsg_no_fail_on+=('cxl_mock_mem.*: CXL MCE unsupported')
+kmsg_no_fail_on+=('cxl_mock_mem cxl_mem.*: Extended linear cache calculation failed rc:-2')
+
+# 'modprobe nfit_test' prints these every time it's not already loaded
+kmsg_no_fail_on+=(
+    'nd_pmem namespace.*: unable to guarantee persistence of writes'
+    'nfit_test nfit_test.*: failed to evaluate _FIT'
+    'nfit_test nfit_test.*: Error found in NVDIMM nmem. flags: save_fail restore_fail flush_fail not_armed'
+    'nfit_test nfit_test.1: Error found in NVDIMM nmem. flags: map_fail'
+)
+
+declare -a kmsg_fail_if_missing
+
+print_all_warnings()
+{
+	( set +x;
+	  printf '%s\n' '------------ ALL warnings and errors -----------')
+	journalctl -p warning -b --since "-$((SECONDS*1000)) ms"
+}
+
 # $1: line number where this is called
 check_dmesg()
 {
-	# validate no WARN or lockdep report during the run
+	local _e_kmsg_no_fail_on=()
+	for re in "${kmsg_no_fail_on[@]}" "${kmsg_fail_if_missing[@]}"; do
+		_e_kmsg_no_fail_on+=('-e' "$re")
+	done
+
+	# Give some time for a complete kmsg->journalctl flush + any delayed test effect.
 	sleep 1
-	log=$(journalctl -r -k --since "-$((SECONDS+1))s")
-	grep -q "Call Trace" <<< $log && err $1
-	true
+
+	# Optional code to manually verify the SECONDS / COOLDOWN_MS logic.
+	# journalctl -q -b -o short-precise --since "-$((SECONDS*1000 - COOLDOWN_MS/2)) ms" > journal-"$(basename "$0")".log
+	# After enabling, check the timings in:
+	#    head -n 7 $(ls -1t build/journal-*.log | tac)
+	#    journalctl --since='- 5 min' -o short-precise -g 'test/common'
+
+	{ # Redirect to stderr so this is all at the _bottom_ in the log file
+	# Fail on kernel WARNING or ERROR. $SECONDS is bash magic.
+	if journalctl -q -p warning -k --since "-$((SECONDS*1000 - COOLDOWN_MS/2)) ms" |
+		grep -E -v "${_e_kmsg_no_fail_on[@]}"; then
+			print_all_warnings
+			err "$1"
+	fi
+
+	local expected_re
+	for expected_re in "${kmsg_fail_if_missing[@]}"; do
+		journalctl -q -p warning -k --since "-$((SECONDS*1000 - COOLDOWN_MS/2)) ms" |
+			grep -q "${expected_re}" || {
+				printf 'FAIL: expected error not found: %s\n' "$expected_re"
+				print_all_warnings
+				err "$1"
+		}
+	done
+	} >&2
+
+	# Log anchor, especially useful when running tests back to back
+	printf "<5>%s: test/common check_dmesg() OK\n" "$0" > /dev/kmsg
 }
 
 # CXL COMMON
diff --git a/test/cxl-events.sh b/test/cxl-events.sh
index c216d6aa9148..1461b487e208 100644
--- a/test/cxl-events.sh
+++ b/test/cxl-events.sh
@@ -25,6 +25,8 @@ rc=1
 dev_path="/sys/bus/platform/devices"
 trace_path="/sys/kernel/tracing"
 
+kmsg_no_fail_on+=('cxl_mock_mem cxl_mem.* no CXL window for range')
+
 test_region_info()
 {
 	# Trigger a memdev in the cxl_test autodiscovered region
diff --git a/test/cxl-poison.sh b/test/cxl-poison.sh
index 2caf092db460..4df7d7ffbe8a 100644
--- a/test/cxl-poison.sh
+++ b/test/cxl-poison.sh
@@ -8,6 +8,11 @@ rc=77
 
 set -ex
 
+kmsg_no_fail_on+=(
+    'cxl_mock_mem cxl_mem.*: poison inject dpa:0x'
+    'cxl_mock_mem cxl_mem.*: poison clear dpa:0x'
+)
+
 trap 'err $LINENO' ERR
 
 check_prereq "jq"
diff --git a/test/cxl-xor-region.sh b/test/cxl-xor-region.sh
index b9e1d79212d3..f5e0db98b67f 100644
--- a/test/cxl-xor-region.sh
+++ b/test/cxl-xor-region.sh
@@ -17,6 +17,8 @@ modprobe cxl_test interleave_arithmetic=1
 udevadm settle
 rc=1
 
+kmsg_fail_if_missing+=('cxl_mock_mem cxl_mem.* no CXL window for range')
+
 # THEORY OF OPERATION: Create x1,2,3,4 regions to exercise the XOR math
 # option of the CXL driver. As with other cxl_test tests, changes to the
 # CXL topology in tools/testing/cxl/test/cxl.c may require an update here.
diff --git a/test/dax.sh b/test/dax.sh
index 3ffbc8079eba..c325e144753d 100755
--- a/test/dax.sh
+++ b/test/dax.sh
@@ -118,6 +118,12 @@ else
 	run_xfs
 fi
 
+kmsg_fail_if_missing=(
+    'nd_pmem pfn.*: unable to guarantee persistence of writes'
+    'Memory failure: .*: Sending SIGBUS to dax-pmd:.* due to hardware memory corruption'
+    'Memory failure: .*: recovery action for dax page: Recovered'
+)
+
 check_dmesg "$LINENO"
 
 exit 0
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [ndctl PATCH] test: fail on unexpected kernel error & warning, not just "Call Trace"
  2025-05-10  1:20 [ndctl PATCH] test: fail on unexpected kernel error & warning, not just "Call Trace" marc.herbert
@ 2025-05-12 18:35 ` Alison Schofield
  2025-05-12 23:12   ` Marc Herbert
  0 siblings, 1 reply; 7+ messages in thread
From: Alison Schofield @ 2025-05-12 18:35 UTC (permalink / raw)
  To: marc.herbert; +Cc: linux-cxl, nvdimm

On Sat, May 10, 2025 at 01:20:46AM +0000, marc.herbert@linux.intel.com wrote:
> From: Marc Herbert <marc.herbert@linux.intel.com>
> 
> While a "Call Trace" is usually a bad omen, the show_trace_log_lvl()
> function supports any log level. So a "Call Trace" is not a reliable
> indication of a failure. More importantly: any other WARNING or ERROR
> during a test should make a test FAIL. Before this commit, it does not.
> 
> So, leverage log levels for the PASS/FAIL decision.  This catches all
> issues and not just the ones printing Call Traces.
> 
> Add a simple way to exclude expected warnings and errors, either on a
> per-test basis or globally.
> 
> Add a way for negative tests to fail when if some expected errors are
> missing.
> 
> Add COOLDOWN_MS to address the inaccuracy of the magic $SECONDS
> variable.

Thanks Marc, this is good stuff.
Since this patch is doing 2 things, the the journalctl timing, and
the parse of additional messages, I would typically ask for 2 patches,
but - I want to do even more. I want to revive an old, unmerged set
tackling similar work and get it all tidy'd up at once.

https://lore.kernel.org/all/cover.1701143039.git.alison.schofield@intel.com/
  cxl/test: add and use cxl_common_[start|stop] helpers
  cxl/test: add a cxl_ derivative of check_dmesg()
  cxl/test: use an explicit --since time in journalctl

Please take a look at how the prev patch did journalctl start time.
I believe the kmesg_fail... can be used to catch any of the failed sorts
that the old series wanted to do.

Maybe add a brief write up of how to use the kmesg choices per
test and in the common code.

Is the new kmesg approach going to fail on any ERROR or WARNING that
we don't kmesg_no_fail_on ?

And then can we simply add dev_dbg() messages to fail if missing.

I'll take a further look for example at the poison test. We want
it to warn that the poison is in a region. That is a good and
expected warning.  However, if that warn is missing, then the test
is broken! It might not 'FAIL', but it's no longer doing what we
want.

So, let's work on a rev 2 that does all the things of both our
patches. I'm happy to work it with you, or not.

Thanks,
Alison


> 
> As a good example (which initiated this), the test feedback when hitting
> bug https://github.com/pmem/ndctl/issues/278, where the cxl_test module
> errors at load, is completely changed by this. Instead of only half the
> tests failing with a fairly cryptic and late "Numerical result out of
> range" error from user space, now all tests are failing early and
> consistently, all displaying the same, earlier and more relevant error.
> 
> This simple log-level based approach has been successfully used for
> years in the CI of https://github.com/thesofproject and caught
> countless firmware and kernel bugs.
> 
> Note: the popular message "possible circular locking ..." recently fixed
> by revert v6.15-rc1-4-gdc1771f71854 is at the WARNING level, including
> its Call Trace.
> 
> Signed-off-by: Marc Herbert <marc.herbert@linux.intel.com>
> ---
>  test/common            | 74 +++++++++++++++++++++++++++++++++++++++---
>  test/cxl-events.sh     |  2 ++
>  test/cxl-poison.sh     |  5 +++
>  test/cxl-xor-region.sh |  2 ++
>  test/dax.sh            |  6 ++++
>  5 files changed, 84 insertions(+), 5 deletions(-)
> 
> diff --git a/test/common b/test/common
> index 75ff1a6e12be..2a95437186e7 100644
> --- a/test/common
> +++ b/test/common
> @@ -3,6 +3,15 @@
>  
>  # Global variables
>  
> +# Small gap in journalctl to avoid cross-test pollution.  Unfortunately,
> +# this needs be at least 1 second because we don't know how bash rounds
> +# up or down its magic $SECONDS variable that we use below.
> +COOLDOWN_MS=1200
> +sleep "${COOLDOWN_MS}E-3"
> +
> +# Log anchor, especially useful when running tests back to back
> +printf "<5>%s: sourcing test/common\n" "$0" > /dev/kmsg
> +
>  # NDCTL
>  if [ -z $NDCTL ]; then
>  	if [ -f "../ndctl/ndctl" ] && [ -x "../ndctl/ndctl" ]; then
> @@ -140,15 +149,70 @@ json2var()
>  	sed -e "s/[{}\",]//g; s/\[//g; s/\]//g; s/:/=/g"
>  }
>  
> -# check_dmesg
> +# - "declare -a" gives the main script the freedom to source this file
> +#   before OR after adding some excludes.
> +declare -a kmsg_no_fail_on
> +# kmsg_no_fail_on+=('this array must never be empty to keep the code simple')
> +
> +kmsg_no_fail_on+=('cxl_core: loading out-of-tree module taints kernel')
> +kmsg_no_fail_on+=('cxl_mock_mem.*: CXL MCE unsupported')
> +kmsg_no_fail_on+=('cxl_mock_mem cxl_mem.*: Extended linear cache calculation failed rc:-2')
> +
> +# 'modprobe nfit_test' prints these every time it's not already loaded
> +kmsg_no_fail_on+=(
> +    'nd_pmem namespace.*: unable to guarantee persistence of writes'
> +    'nfit_test nfit_test.*: failed to evaluate _FIT'
> +    'nfit_test nfit_test.*: Error found in NVDIMM nmem. flags: save_fail restore_fail flush_fail not_armed'
> +    'nfit_test nfit_test.1: Error found in NVDIMM nmem. flags: map_fail'
> +)
> +
> +declare -a kmsg_fail_if_missing
> +
> +print_all_warnings()
> +{
> +	( set +x;
> +	  printf '%s\n' '------------ ALL warnings and errors -----------')
> +	journalctl -p warning -b --since "-$((SECONDS*1000)) ms"
> +}
> +
>  # $1: line number where this is called
>  check_dmesg()
>  {
> -	# validate no WARN or lockdep report during the run
> +	local _e_kmsg_no_fail_on=()
> +	for re in "${kmsg_no_fail_on[@]}" "${kmsg_fail_if_missing[@]}"; do
> +		_e_kmsg_no_fail_on+=('-e' "$re")
> +	done
> +
> +	# Give some time for a complete kmsg->journalctl flush + any delayed test effect.
>  	sleep 1
> -	log=$(journalctl -r -k --since "-$((SECONDS+1))s")
> -	grep -q "Call Trace" <<< $log && err $1
> -	true
> +
> +	# Optional code to manually verify the SECONDS / COOLDOWN_MS logic.
> +	# journalctl -q -b -o short-precise --since "-$((SECONDS*1000 - COOLDOWN_MS/2)) ms" > journal-"$(basename "$0")".log
> +	# After enabling, check the timings in:
> +	#    head -n 7 $(ls -1t build/journal-*.log | tac)
> +	#    journalctl --since='- 5 min' -o short-precise -g 'test/common'
> +
> +	{ # Redirect to stderr so this is all at the _bottom_ in the log file
> +	# Fail on kernel WARNING or ERROR. $SECONDS is bash magic.
> +	if journalctl -q -p warning -k --since "-$((SECONDS*1000 - COOLDOWN_MS/2)) ms" |
> +		grep -E -v "${_e_kmsg_no_fail_on[@]}"; then
> +			print_all_warnings
> +			err "$1"
> +	fi
> +
> +	local expected_re
> +	for expected_re in "${kmsg_fail_if_missing[@]}"; do
> +		journalctl -q -p warning -k --since "-$((SECONDS*1000 - COOLDOWN_MS/2)) ms" |
> +			grep -q "${expected_re}" || {
> +				printf 'FAIL: expected error not found: %s\n' "$expected_re"
> +				print_all_warnings
> +				err "$1"
> +		}
> +	done
> +	} >&2
> +
> +	# Log anchor, especially useful when running tests back to back
> +	printf "<5>%s: test/common check_dmesg() OK\n" "$0" > /dev/kmsg
>  }
>  
>  # CXL COMMON
> diff --git a/test/cxl-events.sh b/test/cxl-events.sh
> index c216d6aa9148..1461b487e208 100644
> --- a/test/cxl-events.sh
> +++ b/test/cxl-events.sh
> @@ -25,6 +25,8 @@ rc=1
>  dev_path="/sys/bus/platform/devices"
>  trace_path="/sys/kernel/tracing"
>  
> +kmsg_no_fail_on+=('cxl_mock_mem cxl_mem.* no CXL window for range')
> +
>  test_region_info()
>  {
>  	# Trigger a memdev in the cxl_test autodiscovered region
> diff --git a/test/cxl-poison.sh b/test/cxl-poison.sh
> index 2caf092db460..4df7d7ffbe8a 100644
> --- a/test/cxl-poison.sh
> +++ b/test/cxl-poison.sh
> @@ -8,6 +8,11 @@ rc=77
>  
>  set -ex
>  
> +kmsg_no_fail_on+=(
> +    'cxl_mock_mem cxl_mem.*: poison inject dpa:0x'
> +    'cxl_mock_mem cxl_mem.*: poison clear dpa:0x'
> +)
> +
>  trap 'err $LINENO' ERR
>  
>  check_prereq "jq"
> diff --git a/test/cxl-xor-region.sh b/test/cxl-xor-region.sh
> index b9e1d79212d3..f5e0db98b67f 100644
> --- a/test/cxl-xor-region.sh
> +++ b/test/cxl-xor-region.sh
> @@ -17,6 +17,8 @@ modprobe cxl_test interleave_arithmetic=1
>  udevadm settle
>  rc=1
>  
> +kmsg_fail_if_missing+=('cxl_mock_mem cxl_mem.* no CXL window for range')
> +
>  # THEORY OF OPERATION: Create x1,2,3,4 regions to exercise the XOR math
>  # option of the CXL driver. As with other cxl_test tests, changes to the
>  # CXL topology in tools/testing/cxl/test/cxl.c may require an update here.
> diff --git a/test/dax.sh b/test/dax.sh
> index 3ffbc8079eba..c325e144753d 100755
> --- a/test/dax.sh
> +++ b/test/dax.sh
> @@ -118,6 +118,12 @@ else
>  	run_xfs
>  fi
>  
> +kmsg_fail_if_missing=(
> +    'nd_pmem pfn.*: unable to guarantee persistence of writes'
> +    'Memory failure: .*: Sending SIGBUS to dax-pmd:.* due to hardware memory corruption'
> +    'Memory failure: .*: recovery action for dax page: Recovered'
> +)
> +
>  check_dmesg "$LINENO"
>  
>  exit 0
> -- 
> 2.49.0
> 
> 

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [ndctl PATCH] test: fail on unexpected kernel error & warning, not just "Call Trace"
  2025-05-12 18:35 ` Alison Schofield
@ 2025-05-12 23:12   ` Marc Herbert
  2025-05-13  0:45     ` Alison Schofield
  0 siblings, 1 reply; 7+ messages in thread
From: Marc Herbert @ 2025-05-12 23:12 UTC (permalink / raw)
  To: Alison Schofield; +Cc: linux-cxl, nvdimm

Thanks for the prompt feedback!

On 2025-05-12 11:35, Alison Schofield wrote:

> Since this patch is doing 2 things, the the journalctl timing, and
> the parse of additional messages, I would typically ask for 2 patches,
> but - I want to do even more. I want to revive an old, unmerged set
> tackling similar work and get it all tidy'd up at once.
> 
> https://lore.kernel.org/all/cover.1701143039.git.alison.schofield@intel.com/
>   cxl/test: add and use cxl_common_[start|stop] helpers
>   cxl/test: add a cxl_ derivative of check_dmesg()
>   cxl/test: use an explicit --since time in journalctl
> 
> Please take a look at how the prev patch did journalctl start time.

We've been using a "start time" in
https://github.com/thesofproject/sof-test for many years and it's been
only "OK", not great. I did not know about the $SECONDS magic variable
at the time, otherwise I would have tried it in sof-test! The main
advantage of $SECONDS: there is nothing to do, meaning there is no
"cxl_common_start()" to forget or do wrong. Speaking of which: I tested
this patch on the _entire_ ndctl/test, not just with --suite=cxl whereas
https://lore.kernel.org/all/d76c005105b7612dc47ccd19e102d462c0f4fc1b.1701143039.git.alison.schofield@intel.com/
seems to have a CXL-specific "cxl_common_start()" only?

Also, in my experience some sort of short COOLDOWN is always necessary
anyway for various reasons:
- Some tests can sometimes have "after shocks" and a cooldown helps
  with most of these.
- A short gap in the logs really help with their _readability_.
- Clocks can shift, especially inside QEMU (I naively tried to increase
  the number of cores in run_qemu.sh but had to give up due so "clock skew")
- Others I probably forgot.

On my system, the average, per-test duration is about 10s and I find that
10% is an acceptable price to pay for the peace of mind. But a starttime
should hopefully work too, at least for the majority of the time.

> I believe the kmesg_fail... can be used to catch any of the failed
> sorts that the old series wanted to do.

Yes it does, I tried to explain that but afraid my English wasn't good
enough?

> Maybe add a brief write up of how to use the kmesg choices per
> test and in the common code.

Q.E.D ;-)

> Is the new kmesg approach going to fail on any ERROR or WARNING that
> we don't kmesg_no_fail_on ?

Yes, this is the main purpose. The other feature is failing when
any of the _expected_ ERROR or WARNING is not found.

> And then can we simply add dev_dbg() messages to fail if missing.

I'm afraid you just lost me at this point... my patch already does that
without any dev_dbg()...?

> I'll take a further look for example at the poison test. We want
> it to warn that the poison is in a region. That is a good and
> expected warning.  However, if that warn is missing, then the test
> is broken! It might not 'FAIL', but it's no longer doing what we
> want.

I agree: the expected "poison inject" and "poison clear" messages should
be in the kmsg_fail_if_missing array[], not in the kmsg_no_fail_on[]
array. BUT in my experience this makes cxl-poison.sh fail when run
multiple times.  So yes: there seems to be a problem with this test.  (I
should probably file a bug somewhere?) So I put them in
kmsg_fail_if_missing[] for now because I don't have time to look into it
now and I don't think a problem in a single test should hold back the
improvement for the entire suite that exposes it. Even with just
kmsg_no_fail_on[], this test is still better than now.

BTW this is a typical game of whack-a-mole every time you try to tighten
a test screw. In SOF it took 4-5 years to finally catch all firmware
errors: https://github.com/thesofproject/sof-test/issues/297

> So, let's work on a rev 2 that does all the things of both our
> patches. I'm happy to work it with you, or not.

I agree the COOLDOWN / starttime is a separate feature. But... I needed it
for the tests to pass! I find it important to keep the tests all passing
in every commit for bisectability etc., hope you agree. Also, really hard
to submit anything that does not pass the tests :-)

As of now, the tests tolerate cross-test pollution. Being more
demanding when inspecting the logs obviously makes them fail, at least
sometimes. I agree the "timing" solution should go first, so here's
a suggested plan:

1. a) Either I resubmit my COOLDOWN alone,
   b) or you generalize your cxl_common_start()/starttime to non-CXL tests.

No check_dmesg() change yet. "cxl_check_dmesg()" is abandoned forever.

Then:

2. I rebase and resubmit my kmsg_no_fail_on=...

This will give more time for people to try and report any issue in the
timing fix 1. - whichever is it.

In the 1.a) case, I think your [cxl_]common_start() de-duplication is
99% independent and can be submitted at any point.

Thoughts?

PS: keep in mind I may be pulled in other priorities at any time :-(

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [ndctl PATCH] test: fail on unexpected kernel error & warning, not just "Call Trace"
  2025-05-12 23:12   ` Marc Herbert
@ 2025-05-13  0:45     ` Alison Schofield
  2025-05-15  2:14       ` marc.herbert
  0 siblings, 1 reply; 7+ messages in thread
From: Alison Schofield @ 2025-05-13  0:45 UTC (permalink / raw)
  To: Marc Herbert; +Cc: linux-cxl, nvdimm

On Mon, May 12, 2025 at 04:12:35PM -0700, Marc Herbert wrote:
> Thanks for the prompt feedback!
> 
> On 2025-05-12 11:35, Alison Schofield wrote:
> 
> > Since this patch is doing 2 things, the the journalctl timing, and
> > the parse of additional messages, I would typically ask for 2 patches,
> > but - I want to do even more. I want to revive an old, unmerged set
> > tackling similar work and get it all tidy'd up at once.
> > 
> > https://lore.kernel.org/all/cover.1701143039.git.alison.schofield@intel.com/
> >   cxl/test: add and use cxl_common_[start|stop] helpers
> >   cxl/test: add a cxl_ derivative of check_dmesg()
> >   cxl/test: use an explicit --since time in journalctl
> > 
> > Please take a look at how the prev patch did journalctl start time.
> 
> We've been using a "start time" in
> https://github.com/thesofproject/sof-test for many years and it's been
> only "OK", not great. I did not know about the $SECONDS magic variable
> at the time, otherwise I would have tried it in sof-test! The main
> advantage of $SECONDS: there is nothing to do, meaning there is no
> "cxl_common_start()" to forget or do wrong. Speaking of which: I tested
> this patch on the _entire_ ndctl/test, not just with --suite=cxl whereas
> https://lore.kernel.org/all/d76c005105b7612dc47ccd19e102d462c0f4fc1b.1701143039.git.alison.schofield@intel.com/
> seems to have a CXL-specific "cxl_common_start()" only?
> 
> Also, in my experience some sort of short COOLDOWN is always necessary
> anyway for various reasons:
> - Some tests can sometimes have "after shocks" and a cooldown helps
>   with most of these.
> - A short gap in the logs really help with their _readability_.
> - Clocks can shift, especially inside QEMU (I naively tried to increase
>   the number of cores in run_qemu.sh but had to give up due so "clock skew")
> - Others I probably forgot.
> 
> On my system, the average, per-test duration is about 10s and I find that
> 10% is an acceptable price to pay for the peace of mind. But a starttime
> should hopefully work too, at least for the majority of the time.
>
> 
> > I believe the kmesg_fail... can be used to catch any of the failed
> > sorts that the old series wanted to do.
> 
> Yes it does, I tried to explain that but afraid my English wasn't good
> enough?
> 
> > Maybe add a brief write up of how to use the kmesg choices per
> > test and in the common code.
> 
> Q.E.D ;-)
> 
> > Is the new kmesg approach going to fail on any ERROR or WARNING that
> > we don't kmesg_no_fail_on ?
> 
> Yes, this is the main purpose. The other feature is failing when
> any of the _expected_ ERROR or WARNING is not found.
> 
> > And then can we simply add dev_dbg() messages to fail if missing.
> 
> I'm afraid you just lost me at this point... my patch already does that
> without any dev_dbg()...?

Let me rephrase that - can we simply add dev_dbg() messages to the
'kmesg_' fail scheme, like in my check_dmesg() patch.

> 
> > I'll take a further look for example at the poison test. We want
> > it to warn that the poison is in a region. That is a good and
> > expected warning.  However, if that warn is missing, then the test
> > is broken! It might not 'FAIL', but it's no longer doing what we
> > want.
> 
> I agree: the expected "poison inject" and "poison clear" messages should
> be in the kmsg_fail_if_missing array[], not in the kmsg_no_fail_on[]
> array. BUT in my experience this makes cxl-poison.sh fail when run
> multiple times.  So yes: there seems to be a problem with this test.  (I
> should probably file a bug somewhere?) So I put them in
> kmsg_fail_if_missing[] for now because I don't have time to look into it
> now and I don't think a problem in a single test should hold back the
> improvement for the entire suite that exposes it. Even with just
> kmsg_no_fail_on[], this test is still better than now.
> 
> BTW this is a typical game of whack-a-mole every time you try to tighten
> a test screw. In SOF it took 4-5 years to finally catch all firmware
> errors: https://github.com/thesofproject/sof-test/issues/297
> 
> 
> 
> > So, let's work on a rev 2 that does all the things of both our
> > patches. I'm happy to work it with you, or not.
> 
> I agree the COOLDOWN / starttime is a separate feature. But... I needed it
> for the tests to pass! I find it important to keep the tests all passing
> in every commit for bisectability etc., hope you agree. Also, really hard
> to submit anything that does not pass the tests :-)
> 

How are the tests failing without the COOLDOWN now?

> As of now, the tests tolerate cross-test pollution. Being more
> demanding when inspecting the logs obviously makes them fail, at least
> sometimes. I agree the "timing" solution should go first, so here's
> a suggested plan:
> 
> 1. a) Either I resubmit my COOLDOWN alone,
>    b) or you generalize your cxl_common_start()/starttime to non-CXL tests.
> 
> No check_dmesg() change yet. "cxl_check_dmesg()" is abandoned forever.
> 
> Then:
> 
> 2. I rebase and resubmit my kmsg_no_fail_on=...
> 
> This will give more time for people to try and report any issue in the
> timing fix 1. - whichever is it.
> 
> In the 1.a) case, I think your [cxl_]common_start() de-duplication is
> 99% independent and can be submitted at any point.
> 
> 
> Thoughts?

Split them into a patchset for easier review and then I'll take
a look. Thanks!

> 
> PS: keep in mind I may be pulled in other priorities at any time :-(




^ permalink raw reply	[flat|nested] 7+ messages in thread

* (no subject)
  2025-05-13  0:45     ` Alison Schofield
@ 2025-05-15  2:14       ` marc.herbert
  2025-05-15  2:14         ` [ndctl PATCH v2 1/2] test: move err() function at the top marc.herbert
  2025-05-15  2:14         ` [ndctl PATCH v2 2/2] test: fail on unexpected kernel error & warning, not just "Call Trace" marc.herbert
  0 siblings, 2 replies; 7+ messages in thread
From: marc.herbert @ 2025-05-15  2:14 UTC (permalink / raw)
  To: linux-cxl, nvdimm, alison.schofield

Major changes in v2:
- The old $SECONDS variable is dropped from journalctl. Which allows:
- ... dropping the very short-lived COOLDOWN proposed in version 1.
- A new, optional NDTEST_LOG_DBG code which allows "stress testing"
the approach and proving that it is safe.

I tested and compared for many hours $SECONDS versus the NDTEST_START
approach that Alison submitted a few months ago and the conclusion is
very clear:
- $SECONDS does require a ~1.2 cool down between every test. As it was
  done in v1.
- NDTEST_START requires zero cool down.

So that is why I dropped $SECONDS and the cool down.


> Split them into a patchset for easier review and then I'll take a
> look. Thanks!

There are 3 logical changes in the main commit:

A1) Dropping $SECONDS, replaced with NDTEST_START

A2) The new NDTEST_LOG_DBG which was/is critical for:
   - proving that $SECONDS required a "cool down" (with version 1)
   - proving NDTEST_START does _not_ require any cool down, safe
     even without any.

B) The new, _harden_ journalctl check in check_dmesg() and its
   kmsg_fail_if_missing and kmsg_no_fail_on. The main feature!


- B) requires A1) because $SECONDS is too imprecise. With B) only, the
  tests fail.

- The A2) test code achieves nothing without B), it cannot prove
  anything without B).

- A1) and A2) are logically independent but their code are fairly
  intertwined and very painful to separate. Plus, B) would have to sit
  in the middle: A1->B->A2


Long story short:

- while they could be logically separate, these changes are tightly
  coupled with each other.

- breaking down that (relatively small) commit is theoretically
possible but would be very labor intensive. I know because I just went
through a similar "git action"  to compare $SECONDS versus
NDTEST_START for COOLDOWN reasons and it was not fun at all.


^ permalink raw reply	[flat|nested] 7+ messages in thread

* [ndctl PATCH v2 1/2] test: move err() function at the top
  2025-05-15  2:14       ` marc.herbert
@ 2025-05-15  2:14         ` marc.herbert
  2025-05-15  2:14         ` [ndctl PATCH v2 2/2] test: fail on unexpected kernel error & warning, not just "Call Trace" marc.herbert
  1 sibling, 0 replies; 7+ messages in thread
From: marc.herbert @ 2025-05-15  2:14 UTC (permalink / raw)
  To: linux-cxl, nvdimm, alison.schofield; +Cc: Marc Herbert

From: Marc Herbert <marc.herbert@linux.intel.com>

move err() function at the top so we can fail early. err() does not have
any dependency so it can be first.

Signed-off-by: Marc Herbert <marc.herbert@linux.intel.com>
---
 test/common | 25 ++++++++++++++-----------
 1 file changed, 14 insertions(+), 11 deletions(-)

diff --git a/test/common b/test/common
index 75ff1a6e12be..74e74dd4fff9 100644
--- a/test/common
+++ b/test/common
@@ -1,6 +1,20 @@
 # SPDX-License-Identifier: GPL-2.0
 # Copyright (C) 2018, FUJITSU LIMITED. All rights reserved.
 
+# err
+# $1: line number which error detected
+# $2: cleanup function (optional)
+#
+
+test_basename=$(basename "$0")
+
+err()
+{
+	echo test/"$test_basename": failed at line "$1"
+	[ -n "$2" ] && "$2"
+	exit "$rc"
+}
+
 # Global variables
 
 # NDCTL
@@ -53,17 +67,6 @@ E820_BUS="e820"
 
 # Functions
 
-# err
-# $1: line number which error detected
-# $2: cleanup function (optional)
-#
-err()
-{
-	echo test/$(basename $0): failed at line $1
-	[ -n "$2" ] && "$2"
-	exit $rc
-}
-
 reset()
 {
 	$NDCTL disable-region -b $NFIT_TEST_BUS0 all
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* [ndctl PATCH v2 2/2] test: fail on unexpected kernel error & warning, not just "Call Trace"
  2025-05-15  2:14       ` marc.herbert
  2025-05-15  2:14         ` [ndctl PATCH v2 1/2] test: move err() function at the top marc.herbert
@ 2025-05-15  2:14         ` marc.herbert
  1 sibling, 0 replies; 7+ messages in thread
From: marc.herbert @ 2025-05-15  2:14 UTC (permalink / raw)
  To: linux-cxl, nvdimm, alison.schofield; +Cc: Marc Herbert

From: Marc Herbert <marc.herbert@linux.intel.com>

While a "Call Trace" is usually a bad omen, the show_trace_log_lvl()
function supports any log level. So a "Call Trace" is not a reliable
indication of a failure. More importantly: any other WARNING or ERROR
during a test should make a test FAIL. Before this commit, it does not.

So, leverage log levels for the PASS/FAIL decision.  This catches all
issues and not just the ones printing Call Traces.

Add a simple way to exclude expected warnings and errors, either on a
per-test basis or globally.

Add a way for negative tests to fail when if some expected errors are
missing.

Stop relying on the magic and convenient but inaccurate $SECONDS bash
variable because its precision is (surprise!) about 1 second. In the
first version of this patch, $SECONDS was kept and working but it
required a 1++ second long "cooldown" between tests to isolate their
logs from each other. After dropping $SECONDS from journalctl, no
cooldown delay is required.

As a good example (which initiated this), the test feedback when hitting
bug https://github.com/pmem/ndctl/issues/278, where the cxl_test module
errors at load, is completely changed by this. Instead of only half the
tests failing with a fairly cryptic and late "Numerical result out of
range" error from user space, now all tests are failing early and more
consistently; displaying the same, earlier and more relevant error.

This simple log-level based approach has been successfully used for
years in the CI of https://github.com/thesofproject and caught
countless firmware and kernel bugs.

Note: the popular message "possible circular locking ..." recently fixed
by revert v6.15-rc1-4-gdc1771f71854 is at the WARNING level, including
its Call Trace.

Signed-off-by: Marc Herbert <marc.herbert@linux.intel.com>
---
 test/common            | 137 +++++++++++++++++++++++++++++++++++++++--
 test/cxl-events.sh     |   2 +
 test/cxl-poison.sh     |   7 +++
 test/cxl-xor-region.sh |   2 +
 test/dax.sh            |   6 ++
 5 files changed, 149 insertions(+), 5 deletions(-)

diff --git a/test/common b/test/common
index 74e74dd4fff9..67dfd03f9693 100644
--- a/test/common
+++ b/test/common
@@ -15,6 +15,28 @@ err()
 	exit "$rc"
 }
 
+time_init()
+{
+	test "$SECONDS" -le 1 || err 'test/common must be included first!'
+	# ... otherwise NDTEST_START is inaccurate
+
+	NDTEST_START=$(LC_TIME=C date '+%F %T.%3N')
+
+	# Log anchor, especially useful when running tests back to back
+	printf "<5>%s@%ds: sourcing test/common: NDTEST_START=%s\n" \
+		"$test_basename" "$SECONDS" "$NDTEST_START" > /dev/kmsg
+
+	# Default value, can be overriden by the environment
+	: "${NDTEST_LOG_DBG:=false}"
+
+	if "$NDTEST_LOG_DBG"; then
+		local _cd_dbg_msg="NDTEST_LOG_DBG: $test_basename early msg must be found by check_dmesg()"
+		printf '<3>%s: %s\n' "$test_basename" "$_cd_dbg_msg" > /dev/kmsg
+		kmsg_fail_if_missing+=("$_cd_dbg_msg")
+	fi
+}
+time_init
+
 # Global variables
 
 # NDCTL
@@ -143,15 +165,120 @@ json2var()
 	sed -e "s/[{}\",]//g; s/\[//g; s/\]//g; s/:/=/g"
 }
 
-# check_dmesg
+# check_dmesg() performs two actions controlled by two bash arrays:
+# "kmsg_no_fail_on" and "kmsg_fail_if_missing". These list of extended
+# regular expressions (grep '-E') have default values here that can
+# be customized by each test.
+#
+# 1. check_dmesg() first checks the output of `journalctl -k -p warning`
+# and makes the invoking test FAIL if any unexpected kernel error or
+# warning occurred during the test. Messages in either the
+# "kmsg_no_fail_on" or the "kmsg_fail_if_missing" arrays are expected
+# and do NOT cause a test failure. All other errors and warnings cause a
+# test failure.
+#
+# 2.1 Then, check_dmesg() makes sure at least one line in the logs
+# matches each regular expression in the "kmsg_fail_if_missing" array. If
+# any of these expected messages was never issued during the test, then
+# the test fails. This is especially useful for "negative" tests
+# triggering expected errors; but not just. Unlike 1., all log levels
+# are searched. Avoid relying on "optional" messages (e.g.: dyndbg) in
+# "kmsg_fail_if_missing".
+#
+# 2.2 to make sure "something" happened during the test, check_dmesg()
+# provides a default, non-empty kmsg_fail_if_missing value that searches
+# for either "nfit_test" or pmem" or "cxl_". These are not searched if
+# the test already provides some value(s) in "kmsg_fail_if_missing".
+# While not recommended, a test could use check_dmesg() and opt out of
+# "kmsg_fail_if_missing" with a pointless regular expression like '.'
+
+# Always append with '+=' to give any test the freedom to source this
+# file before or after adding exclusions.
+# kmsg_no_fail_on+=('this array cannot be empty otherwise grep -v fails')
+
+kmsg_no_fail_on+=('cxl_core: loading out-of-tree module taints kernel')
+kmsg_no_fail_on+=('cxl_mock_mem.*: CXL MCE unsupported')
+kmsg_no_fail_on+=('cxl_mock_mem cxl_mem.*: Extended linear cache calculation failed rc:-2')
+
+# 'modprobe nfit_test' prints these every time it's not already loaded
+kmsg_no_fail_on+=(
+    'nd_pmem namespace.*: unable to guarantee persistence of writes'
+    'nfit_test nfit_test.*: failed to evaluate _FIT'
+    'nfit_test nfit_test.*: Error found in NVDIMM nmem.* flags: save_fail restore_fail flush_fail not_armed'
+    'nfit_test nfit_test.1: Error found in NVDIMM nmem.* flags: map_fail'
+)
+
+# notice level to give some information without flooding the (single!)
+# testlog.txt file
+journalctl_notice()
+{
+	( set +x;
+	  printf ' ------------ More verbose logs at t=%ds ----------\n' "$SECONDS" )
+	journalctl -b --no-pager -o short-precise -p notice --since "-$((SECONDS*1000 + 1000)) ms"
+}
+
 # $1: line number where this is called
 check_dmesg()
 {
-	# validate no WARN or lockdep report during the run
+	local _e_kmsg_no_fail_on=()
+	for re in "${kmsg_no_fail_on[@]}" "${kmsg_fail_if_missing[@]}"; do
+		_e_kmsg_no_fail_on+=('-e' "$re")
+	done
+
+	# Give some time for a complete kmsg->journalctl flush + any delayed test effect.
 	sleep 1
-	log=$(journalctl -r -k --since "-$((SECONDS+1))s")
-	grep -q "Call Trace" <<< $log && err $1
-	true
+
+	if "$NDTEST_LOG_DBG"; then
+		journalctl -q -b --since "$NDTEST_START" \
+			-o short-precise > journal-"$(basename "$0")".log
+	fi
+	# After enabling, check the timings in:
+	#    head -n 7 $(ls -1t build/journal-*.log | tac)
+	#    journalctl --since='- 5 min' -o short-precise -g 'test/common'
+
+	{ # Redirect to stderr so this is all at the _bottom_ in the log file
+
+	# Fail on kernel WARNING or ERROR.
+	if journalctl -q -o short-precise -p warning -k --since "$NDTEST_START" |
+		grep -E -v "${_e_kmsg_no_fail_on[@]}"; then
+			journalctl_notice
+			err "$1"
+	fi
+
+	# Sanity check: make sure "something" has run
+	if [ "${#kmsg_fail_if_missing[@]}" = 0 ]; then
+		kmsg_fail_if_missing+=( '(nfit_test)|(pmem)|(cxl_)' )
+	fi
+
+	local expected_re
+	for expected_re in "${kmsg_fail_if_missing[@]}"; do
+		journalctl -q -k -p 7 --since "$NDTEST_START" |
+			grep -q -E -e "${expected_re}" || {
+				printf 'FAIL: expected error not found: %s\n' "$expected_re"
+				journalctl_notice
+				err "$1"
+		}
+	done
+	} >&2
+
+	# Log anchor, especially useful when running tests back to back
+	printf "<5>%s@%ds: test/common: check_dmesg() OK\n" "$test_basename" "$SECONDS" > /dev/kmsg
+
+	if "$NDTEST_LOG_DBG"; then
+	    log_stress from_check_dmesg
+	fi
+}
+# Many tests don't use check_dmesg() (yet?) so double down here. Also, this
+# runs later which is better. But before using this make sure there is
+# still no test defining its own EXIT trap.
+if "$NDTEST_LOG_DBG"; then
+    trap 'log_stress from_trap' EXIT
+fi
+
+log_stress()
+{
+	printf '<3>%s@%ds: NDTEST_LOG_DBG; trying to break the next check_dmesg() %s\n' \
+		"$test_basename" "$SECONDS" "$1" > /dev/kmsg
 }
 
 # CXL COMMON
diff --git a/test/cxl-events.sh b/test/cxl-events.sh
index c216d6aa9148..1461b487e208 100644
--- a/test/cxl-events.sh
+++ b/test/cxl-events.sh
@@ -25,6 +25,8 @@ rc=1
 dev_path="/sys/bus/platform/devices"
 trace_path="/sys/kernel/tracing"
 
+kmsg_no_fail_on+=('cxl_mock_mem cxl_mem.* no CXL window for range')
+
 test_region_info()
 {
 	# Trigger a memdev in the cxl_test autodiscovered region
diff --git a/test/cxl-poison.sh b/test/cxl-poison.sh
index 2caf092db460..3985c817914b 100644
--- a/test/cxl-poison.sh
+++ b/test/cxl-poison.sh
@@ -8,6 +8,13 @@ rc=77
 
 set -ex
 
+# FIXME: this should be in "kmsg_fail_if_missing" but this test seems to
+# work only once. Cleanup/reset issue?
+kmsg_no_fail_on+=(
+    'cxl_mock_mem cxl_mem.*: poison inject dpa:0x'
+    'cxl_mock_mem cxl_mem.*: poison clear dpa:0x'
+)
+
 trap 'err $LINENO' ERR
 
 check_prereq "jq"
diff --git a/test/cxl-xor-region.sh b/test/cxl-xor-region.sh
index b9e1d79212d3..f5e0db98b67f 100644
--- a/test/cxl-xor-region.sh
+++ b/test/cxl-xor-region.sh
@@ -17,6 +17,8 @@ modprobe cxl_test interleave_arithmetic=1
 udevadm settle
 rc=1
 
+kmsg_fail_if_missing+=('cxl_mock_mem cxl_mem.* no CXL window for range')
+
 # THEORY OF OPERATION: Create x1,2,3,4 regions to exercise the XOR math
 # option of the CXL driver. As with other cxl_test tests, changes to the
 # CXL topology in tools/testing/cxl/test/cxl.c may require an update here.
diff --git a/test/dax.sh b/test/dax.sh
index 3ffbc8079eba..0589f0d053ec 100755
--- a/test/dax.sh
+++ b/test/dax.sh
@@ -118,6 +118,12 @@ else
 	run_xfs
 fi
 
+kmsg_fail_if_missing+=(
+    'nd_pmem pfn.*: unable to guarantee persistence of writes'
+    'Memory failure: .*: Sending SIGBUS to dax-pmd:.* due to hardware memory corruption'
+    'Memory failure: .*: recovery action for dax page: Recovered'
+)
+
 check_dmesg "$LINENO"
 
 exit 0
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2025-05-15  2:18 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-05-10  1:20 [ndctl PATCH] test: fail on unexpected kernel error & warning, not just "Call Trace" marc.herbert
2025-05-12 18:35 ` Alison Schofield
2025-05-12 23:12   ` Marc Herbert
2025-05-13  0:45     ` Alison Schofield
2025-05-15  2:14       ` marc.herbert
2025-05-15  2:14         ` [ndctl PATCH v2 1/2] test: move err() function at the top marc.herbert
2025-05-15  2:14         ` [ndctl PATCH v2 2/2] test: fail on unexpected kernel error & warning, not just "Call Trace" marc.herbert

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox