[ndctl PATCH v3] test: fail on unexpected kernel error & warning, not just "Call Trace"

linux-cxl.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [ndctl PATCH v3] test: fail on unexpected kernel error & warning, not just "Call Trace"
@ 2025-06-11 23:44 marc.herbert
  2025-06-11 23:44 ` [ndctl PATCH v3 1/2] test: move err() function at the top marc.herbert
                   ` (2 more replies)
  0 siblings, 3 replies; 7+ messages in thread
From: marc.herbert @ 2025-06-11 23:44 UTC (permalink / raw)
  To: linux-cxl, nvdimm, alison.schofield, dan.j.williams

v3 changes:

- One-line fix of kmsg_no_fail_on exclusion of the warning "Extended
  linear cache calculation failed". Fixes test failures since kernel
  commit de7fb30a5870 ("Add extended linear cache support for CXL").

v2 major changes:

- The old $SECONDS variable is dropped from journalctl. Which allows:
- ... dropping the very short-lived COOLDOWN proposed in version 1.
- A new, optional NDTEST_LOG_DBG code which allows "stress testing"
  the approach and proving that it is safe.

I tested and compared for many hours $SECONDS versus the NDTEST_START
approach that Alison submitted a few months ago and the conclusion is
very clear:
- $SECONDS does require a ~1.2 cool down between every test. As it was
  done in v1.
- NDTEST_START requires zero cool down.

So that is why I dropped $SECONDS and the cool down.


> Split them into a patchset for easier review and then I'll take a
> look. Thanks!

There are 3 logical changes in the main commit:

A1) Dropping $SECONDS, replaced with NDTEST_START

A2) The new NDTEST_LOG_DBG which was/is critical for:
   - proving that $SECONDS required a "cool down" (with version 1)
   - proving NDTEST_START does _not_ require any cool down, safe
     even without any.

B) The new, _harden_ journalctl check in check_dmesg() and its
   kmsg_fail_if_missing and kmsg_no_fail_on. The main feature!


- B) requires A1) because $SECONDS is too imprecise. With B) only, the
  tests fail.

- The A2) test code achieves nothing without B), it cannot prove
  anything without B).

- A1) and A2) are logically independent but their code are fairly
  intertwined and very painful to separate. Plus, B) would have to sit
  in the middle: A1->B->A2

Long story short:

- while they could be logically separate, these changes are tightly
coupled with each other.

- breaking down that (relatively small) commit is theoretically
possible but would be very labor intensive. I know because I just went
through a similar "git action"  to compare $SECONDS versus
NDTEST_START for COOLDOWN reasons and it was not fun at all.


^ permalink raw reply	[flat|nested] 7+ messages in thread

* [ndctl PATCH v3 1/2] test: move err() function at the top
  2025-06-11 23:44 [ndctl PATCH v3] test: fail on unexpected kernel error & warning, not just "Call Trace" marc.herbert
@ 2025-06-11 23:44 ` marc.herbert
  2025-07-11 23:56   ` dan.j.williams
  2025-06-11 23:44 ` [ndctl PATCH v3 2/2] test: fail on unexpected kernel error & warning, not just "Call Trace" marc.herbert
  2025-06-12  0:02 ` [ndctl PATCH v3] " Dave Jiang
  2 siblings, 1 reply; 7+ messages in thread
From: marc.herbert @ 2025-06-11 23:44 UTC (permalink / raw)
  To: linux-cxl, nvdimm, alison.schofield, dan.j.williams; +Cc: Marc Herbert

From: Marc Herbert <marc.herbert@linux.intel.com>

move err() function at the top so we can fail early. err() does not have
any dependency so it can be first.

Signed-off-by: Marc Herbert <marc.herbert@linux.intel.com>
---
 test/common | 25 ++++++++++++++-----------
 1 file changed, 14 insertions(+), 11 deletions(-)

diff --git a/test/common b/test/common
index 75ff1a6e12be..74e74dd4fff9 100644
--- a/test/common
+++ b/test/common
@@ -1,6 +1,20 @@
 # SPDX-License-Identifier: GPL-2.0
 # Copyright (C) 2018, FUJITSU LIMITED. All rights reserved.
 
+# err
+# $1: line number which error detected
+# $2: cleanup function (optional)
+#
+
+test_basename=$(basename "$0")
+
+err()
+{
+	echo test/"$test_basename": failed at line "$1"
+	[ -n "$2" ] && "$2"
+	exit "$rc"
+}
+
 # Global variables
 
 # NDCTL
@@ -53,17 +67,6 @@ E820_BUS="e820"
 
 # Functions
 
-# err
-# $1: line number which error detected
-# $2: cleanup function (optional)
-#
-err()
-{
-	echo test/$(basename $0): failed at line $1
-	[ -n "$2" ] && "$2"
-	exit $rc
-}
-
 reset()
 {
 	$NDCTL disable-region -b $NFIT_TEST_BUS0 all
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* [ndctl PATCH v3 2/2] test: fail on unexpected kernel error & warning, not just "Call Trace"
  2025-06-11 23:44 [ndctl PATCH v3] test: fail on unexpected kernel error & warning, not just "Call Trace" marc.herbert
  2025-06-11 23:44 ` [ndctl PATCH v3 1/2] test: move err() function at the top marc.herbert
@ 2025-06-11 23:44 ` marc.herbert
  2025-07-12  0:32   ` dan.j.williams
  2025-06-12  0:02 ` [ndctl PATCH v3] " Dave Jiang
  2 siblings, 1 reply; 7+ messages in thread
From: marc.herbert @ 2025-06-11 23:44 UTC (permalink / raw)
  To: linux-cxl, nvdimm, alison.schofield, dan.j.williams; +Cc: Marc Herbert

From: Marc Herbert <marc.herbert@linux.intel.com>

While a "Call Trace" is usually a bad omen, the show_trace_log_lvl()
function supports any log level. So a "Call Trace" is not a reliable
indication of a failure. More importantly: any other WARNING or ERROR
during a test should make a test FAIL. Before this commit, it does not.

So, leverage log levels for the PASS/FAIL decision.  This catches all
issues and not just the ones printing Call Traces.

Add a simple way to exclude expected warnings and errors, either on a
per-test basis or globally.

Add a way for negative tests to fail when if some expected errors are
missing.

Stop relying on the magic and convenient but inaccurate $SECONDS bash
variable because its precision is (surprise!) about 1 second. In the
first version of this patch, $SECONDS was kept and working but it
required a 1++ second long "cooldown" between tests to isolate their
logs from each other. After dropping $SECONDS from journalctl, no
cooldown delay is required.

As a good example (which initiated this), the test feedback when hitting
bug https://github.com/pmem/ndctl/issues/278, where the cxl_test module
errors at load, is completely changed by this. Instead of only half the
tests failing with a fairly cryptic and late "Numerical result out of
range" error from user space, now all tests are failing early and more
consistently; displaying the same, earlier and more relevant error.

This simple log-level based approach has been successfully used for
years in the CI of https://github.com/thesofproject and caught
countless firmware and kernel bugs.

Note: the popular message "possible circular locking ..." recently fixed
by revert v6.15-rc1-4-gdc1771f71854 is at the WARNING level, including
its Call Trace.

Signed-off-by: Marc Herbert <marc.herbert@linux.intel.com>
---
 test/common            | 137 +++++++++++++++++++++++++++++++++++++++--
 test/cxl-events.sh     |   2 +
 test/cxl-poison.sh     |   7 +++
 test/cxl-xor-region.sh |   2 +
 test/dax.sh            |   6 ++
 5 files changed, 149 insertions(+), 5 deletions(-)

diff --git a/test/common b/test/common
index 74e74dd4fff9..8709932ffbbd 100644
--- a/test/common
+++ b/test/common
@@ -15,6 +15,28 @@ err()
 	exit "$rc"
 }
 
+time_init()
+{
+	test "$SECONDS" -le 1 || err 'test/common must be included first!'
+	# ... otherwise NDTEST_START is inaccurate
+
+	NDTEST_START=$(LC_TIME=C date '+%F %T.%3N')
+
+	# Log anchor, especially useful when running tests back to back
+	printf "<5>%s@%ds: sourcing test/common: NDTEST_START=%s\n" \
+		"$test_basename" "$SECONDS" "$NDTEST_START" > /dev/kmsg
+
+	# Default value, can be overriden by the environment
+	: "${NDTEST_LOG_DBG:=false}"
+
+	if "$NDTEST_LOG_DBG"; then
+		local _cd_dbg_msg="NDTEST_LOG_DBG: $test_basename early msg must be found by check_dmesg()"
+		printf '<3>%s: %s\n' "$test_basename" "$_cd_dbg_msg" > /dev/kmsg
+		kmsg_fail_if_missing+=("$_cd_dbg_msg")
+	fi
+}
+time_init
+
 # Global variables
 
 # NDCTL
@@ -143,15 +165,120 @@ json2var()
 	sed -e "s/[{}\",]//g; s/\[//g; s/\]//g; s/:/=/g"
 }
 
-# check_dmesg
+# check_dmesg() performs two actions controlled by two bash arrays:
+# "kmsg_no_fail_on" and "kmsg_fail_if_missing". These list of extended
+# regular expressions (grep '-E') have default values here that can
+# be customized by each test.
+#
+# 1. check_dmesg() first checks the output of `journalctl -k -p warning`
+# and makes the invoking test FAIL if any unexpected kernel error or
+# warning occurred during the test. Messages in either the
+# "kmsg_no_fail_on" or the "kmsg_fail_if_missing" arrays are expected
+# and do NOT cause a test failure. All other errors and warnings cause a
+# test failure.
+#
+# 2.1 Then, check_dmesg() makes sure at least one line in the logs
+# matches each regular expression in the "kmsg_fail_if_missing" array. If
+# any of these expected messages was never issued during the test, then
+# the test fails. This is especially useful for "negative" tests
+# triggering expected errors; but not just. Unlike 1., all log levels
+# are searched. Avoid relying on "optional" messages (e.g.: dyndbg) in
+# "kmsg_fail_if_missing".
+#
+# 2.2 to make sure "something" happened during the test, check_dmesg()
+# provides a default, non-empty kmsg_fail_if_missing value that searches
+# for either "nfit_test" or pmem" or "cxl_". These are not searched if
+# the test already provides some value(s) in "kmsg_fail_if_missing".
+# While not recommended, a test could use check_dmesg() and opt out of
+# "kmsg_fail_if_missing" with a pointless regular expression like '.'
+
+# Always append with '+=' to give any test the freedom to source this
+# file before or after adding exclusions.
+# kmsg_no_fail_on+=('this array cannot be empty otherwise grep -v fails')
+
+kmsg_no_fail_on+=('cxl_core: loading out-of-tree module taints kernel')
+kmsg_no_fail_on+=('cxl_mock_mem.*: CXL MCE unsupported')
+kmsg_no_fail_on+=('cxl_mock_mem cxl_mem.*: Extended linear cache calculation failed')
+
+# 'modprobe nfit_test' prints these every time it's not already loaded
+kmsg_no_fail_on+=(
+    'nd_pmem namespace.*: unable to guarantee persistence of writes'
+    'nfit_test nfit_test.*: failed to evaluate _FIT'
+    'nfit_test nfit_test.*: Error found in NVDIMM nmem.* flags: save_fail restore_fail flush_fail not_armed'
+    'nfit_test nfit_test.1: Error found in NVDIMM nmem.* flags: map_fail'
+)
+
+# notice level to give some information without flooding the (single!)
+# testlog.txt file
+journalctl_notice()
+{
+	( set +x;
+	  printf ' ------------ More verbose logs at t=%ds ----------\n' "$SECONDS" )
+	journalctl -b --no-pager -o short-precise -p notice --since "-$((SECONDS*1000 + 1000)) ms"
+}
+
 # $1: line number where this is called
 check_dmesg()
 {
-	# validate no WARN or lockdep report during the run
+	local _e_kmsg_no_fail_on=()
+	for re in "${kmsg_no_fail_on[@]}" "${kmsg_fail_if_missing[@]}"; do
+		_e_kmsg_no_fail_on+=('-e' "$re")
+	done
+
+	# Give some time for a complete kmsg->journalctl flush + any delayed test effect.
 	sleep 1
-	log=$(journalctl -r -k --since "-$((SECONDS+1))s")
-	grep -q "Call Trace" <<< $log && err $1
-	true
+
+	if "$NDTEST_LOG_DBG"; then
+		journalctl -q -b --since "$NDTEST_START" \
+			-o short-precise > journal-"$(basename "$0")".log
+	fi
+	# After enabling, check the timings in:
+	#    head -n 7 $(ls -1t build/journal-*.log | tac)
+	#    journalctl --since='- 5 min' -o short-precise -g 'test/common'
+
+	{ # Redirect to stderr so this is all at the _bottom_ in the log file
+
+	# Fail on kernel WARNING or ERROR.
+	if journalctl -q -o short-precise -p warning -k --since "$NDTEST_START" |
+		grep -E -v "${_e_kmsg_no_fail_on[@]}"; then
+			journalctl_notice
+			err "$1"
+	fi
+
+	# Sanity check: make sure "something" has run
+	if [ "${#kmsg_fail_if_missing[@]}" = 0 ]; then
+		kmsg_fail_if_missing+=( '(nfit_test)|(pmem)|(cxl_)' )
+	fi
+
+	local expected_re
+	for expected_re in "${kmsg_fail_if_missing[@]}"; do
+		journalctl -q -k -p 7 --since "$NDTEST_START" |
+			grep -q -E -e "${expected_re}" || {
+				printf 'FAIL: expected error not found: %s\n' "$expected_re"
+				journalctl_notice
+				err "$1"
+		}
+	done
+	} >&2
+
+	# Log anchor, especially useful when running tests back to back
+	printf "<5>%s@%ds: test/common: check_dmesg() OK\n" "$test_basename" "$SECONDS" > /dev/kmsg
+
+	if "$NDTEST_LOG_DBG"; then
+	    log_stress from_check_dmesg
+	fi
+}
+# Many tests don't use check_dmesg() (yet?) so double down here. Also, this
+# runs later which is better. But before using this make sure there is
+# still no test defining its own EXIT trap.
+if "$NDTEST_LOG_DBG"; then
+    trap 'log_stress from_trap' EXIT
+fi
+
+log_stress()
+{
+	printf '<3>%s@%ds: NDTEST_LOG_DBG; trying to break the next check_dmesg() %s\n' \
+		"$test_basename" "$SECONDS" "$1" > /dev/kmsg
 }
 
 # CXL COMMON
diff --git a/test/cxl-events.sh b/test/cxl-events.sh
index 7326eb7447ee..29afd86b8bf8 100644
--- a/test/cxl-events.sh
+++ b/test/cxl-events.sh
@@ -26,6 +26,8 @@ rc=1
 dev_path="/sys/bus/platform/devices"
 trace_path="/sys/kernel/tracing"
 
+kmsg_no_fail_on+=('cxl_mock_mem cxl_mem.* no CXL window for range')
+
 test_region_info()
 {
 	# Trigger a memdev in the cxl_test autodiscovered region
diff --git a/test/cxl-poison.sh b/test/cxl-poison.sh
index 6ed890bc666c..8b38cb7960b0 100644
--- a/test/cxl-poison.sh
+++ b/test/cxl-poison.sh
@@ -9,6 +9,13 @@ rc=77
 set -ex
 [ -d "/sys/kernel/tracing" ] || do_skip "test requires CONFIG_TRACING"
 
+# FIXME: this should be in "kmsg_fail_if_missing" but this test seems to
+# work only once. Cleanup/reset issue?
+kmsg_no_fail_on+=(
+    'cxl_mock_mem cxl_mem.*: poison inject dpa:0x'
+    'cxl_mock_mem cxl_mem.*: poison clear dpa:0x'
+)
+
 trap 'err $LINENO' ERR
 
 check_prereq "jq"
diff --git a/test/cxl-xor-region.sh b/test/cxl-xor-region.sh
index fb4f9a0a1515..d1a58023a61e 100644
--- a/test/cxl-xor-region.sh
+++ b/test/cxl-xor-region.sh
@@ -16,6 +16,8 @@ modprobe -r cxl_test
 modprobe cxl_test interleave_arithmetic=1
 rc=1
 
+kmsg_fail_if_missing+=('cxl_mock_mem cxl_mem.* no CXL window for range')
+
 # THEORY OF OPERATION: Create x1,2,3,4 regions to exercise the XOR math
 # option of the CXL driver. As with other cxl_test tests, changes to the
 # CXL topology in tools/testing/cxl/test/cxl.c may require an update here.
diff --git a/test/dax.sh b/test/dax.sh
index 3ffbc8079eba..0589f0d053ec 100755
--- a/test/dax.sh
+++ b/test/dax.sh
@@ -118,6 +118,12 @@ else
 	run_xfs
 fi
 
+kmsg_fail_if_missing+=(
+    'nd_pmem pfn.*: unable to guarantee persistence of writes'
+    'Memory failure: .*: Sending SIGBUS to dax-pmd:.* due to hardware memory corruption'
+    'Memory failure: .*: recovery action for dax page: Recovered'
+)
+
 check_dmesg "$LINENO"
 
 exit 0
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [ndctl PATCH v3] test: fail on unexpected kernel error & warning, not just "Call Trace"
  2025-06-11 23:44 [ndctl PATCH v3] test: fail on unexpected kernel error & warning, not just "Call Trace" marc.herbert
  2025-06-11 23:44 ` [ndctl PATCH v3 1/2] test: move err() function at the top marc.herbert
  2025-06-11 23:44 ` [ndctl PATCH v3 2/2] test: fail on unexpected kernel error & warning, not just "Call Trace" marc.herbert
@ 2025-06-12  0:02 ` Dave Jiang
  2 siblings, 0 replies; 7+ messages in thread
From: Dave Jiang @ 2025-06-12  0:02 UTC (permalink / raw)
  To: marc.herbert, linux-cxl, nvdimm, alison.schofield, dan.j.williams



On 6/11/25 4:44 PM, marc.herbert@linux.intel.com wrote:
> v3 changes:
> 
> - One-line fix of kmsg_no_fail_on exclusion of the warning "Extended
>   linear cache calculation failed". Fixes test failures since kernel
>   commit de7fb30a5870 ("Add extended linear cache support for CXL").
> 
> v2 major changes:
> 
> - The old $SECONDS variable is dropped from journalctl. Which allows:
> - ... dropping the very short-lived COOLDOWN proposed in version 1.
> - A new, optional NDTEST_LOG_DBG code which allows "stress testing"
>   the approach and proving that it is safe.
> 
> I tested and compared for many hours $SECONDS versus the NDTEST_START
> approach that Alison submitted a few months ago and the conclusion is
> very clear:
> - $SECONDS does require a ~1.2 cool down between every test. As it was
>   done in v1.
> - NDTEST_START requires zero cool down.
> 
> So that is why I dropped $SECONDS and the cool down.
> 
> 
>> Split them into a patchset for easier review and then I'll take a
>> look. Thanks!
> 
> There are 3 logical changes in the main commit:
> 
> A1) Dropping $SECONDS, replaced with NDTEST_START
> 
> A2) The new NDTEST_LOG_DBG which was/is critical for:
>    - proving that $SECONDS required a "cool down" (with version 1)
>    - proving NDTEST_START does _not_ require any cool down, safe
>      even without any.
> 
> B) The new, _harden_ journalctl check in check_dmesg() and its
>    kmsg_fail_if_missing and kmsg_no_fail_on. The main feature!
> 
> 
> - B) requires A1) because $SECONDS is too imprecise. With B) only, the
>   tests fail.
> 
> - The A2) test code achieves nothing without B), it cannot prove
>   anything without B).
> 
> - A1) and A2) are logically independent but their code are fairly
>   intertwined and very painful to separate. Plus, B) would have to sit
>   in the middle: A1->B->A2
> 
> Long story short:
> 
> - while they could be logically separate, these changes are tightly
> coupled with each other.
> 
> - breaking down that (relatively small) commit is theoretically
> possible but would be very labor intensive. I know because I just went
> through a similar "git action"  to compare $SECONDS versus
> NDTEST_START for COOLDOWN reasons and it was not fun at all.
> 
> 


Not bash expert, but LGTM for the series
Reviewed-by: Dave Jiang <dave.jiang@intel.com>

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [ndctl PATCH v3 1/2] test: move err() function at the top
  2025-06-11 23:44 ` [ndctl PATCH v3 1/2] test: move err() function at the top marc.herbert
@ 2025-07-11 23:56   ` dan.j.williams
  0 siblings, 0 replies; 7+ messages in thread
From: dan.j.williams @ 2025-07-11 23:56 UTC (permalink / raw)
  To: marc.herbert, linux-cxl, nvdimm, alison.schofield, dan.j.williams
  Cc: Marc Herbert

marc.herbert@ wrote:
> From: Marc Herbert <marc.herbert@linux.intel.com>
> 
> move err() function at the top so we can fail early. err() does not have
> any dependency so it can be first.
> 
> Signed-off-by: Marc Herbert <marc.herbert@linux.intel.com>
> ---
>  test/common | 25 ++++++++++++++-----------
>  1 file changed, 14 insertions(+), 11 deletions(-)

Looks ok to me

Reviewed-by: Dan Williams <dan.j.williams@intel.com>

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [ndctl PATCH v3 2/2] test: fail on unexpected kernel error & warning, not just "Call Trace"
  2025-06-11 23:44 ` [ndctl PATCH v3 2/2] test: fail on unexpected kernel error & warning, not just "Call Trace" marc.herbert
@ 2025-07-12  0:32   ` dan.j.williams
  2025-07-19  0:22     ` Marc Herbert
  0 siblings, 1 reply; 7+ messages in thread
From: dan.j.williams @ 2025-07-12  0:32 UTC (permalink / raw)
  To: marc.herbert, linux-cxl, nvdimm, alison.schofield, dan.j.williams
  Cc: Marc Herbert

marc.herbert@ wrote:
> From: Marc Herbert <marc.herbert@linux.intel.com>
> 
> While a "Call Trace" is usually a bad omen, the show_trace_log_lvl()
> function supports any log level. So a "Call Trace" is not a reliable
> indication of a failure. More importantly: any other WARNING or ERROR
> during a test should make a test FAIL. Before this commit, it does not.
> 
> So, leverage log levels for the PASS/FAIL decision.  This catches all
> issues and not just the ones printing Call Traces.
> 
> Add a simple way to exclude expected warnings and errors, either on a
> per-test basis or globally.
> 
> Add a way for negative tests to fail when if some expected errors are
> missing.

Each subsequent "Add" in a changelog is usually an opportunity to split
the patch into smaller digestable pieces.

> 
> Stop relying on the magic and convenient but inaccurate $SECONDS bash
> variable because its precision is (surprise!) about 1 second. In the
> first version of this patch, $SECONDS was kept and working but it
> required a 1++ second long "cooldown" between tests to isolate their
> logs from each other. After dropping $SECONDS from journalctl, no
> cooldown delay is required.
> 
> As a good example (which initiated this), the test feedback when hitting
> bug https://github.com/pmem/ndctl/issues/278, where the cxl_test module
> errors at load, is completely changed by this. Instead of only half the
> tests failing with a fairly cryptic and late "Numerical result out of
> range" error from user space, now all tests are failing early and more
> consistently; displaying the same, earlier and more relevant error.
> 
> This simple log-level based approach has been successfully used for
> years in the CI of https://github.com/thesofproject and caught
> countless firmware and kernel bugs.
> 
> Note: the popular message "possible circular locking ..." recently fixed
> by revert v6.15-rc1-4-gdc1771f71854 is at the WARNING level, including
> its Call Trace.

This looks promising to me, some comments below.

> 
> Signed-off-by: Marc Herbert <marc.herbert@linux.intel.com>
> ---
>  test/common            | 137 +++++++++++++++++++++++++++++++++++++++--
>  test/cxl-events.sh     |   2 +
>  test/cxl-poison.sh     |   7 +++
>  test/cxl-xor-region.sh |   2 +
>  test/dax.sh            |   6 ++
>  5 files changed, 149 insertions(+), 5 deletions(-)
> 
> diff --git a/test/common b/test/common
> index 74e74dd4fff9..8709932ffbbd 100644
> --- a/test/common
> +++ b/test/common
> @@ -15,6 +15,28 @@ err()
>  	exit "$rc"
>  }
>  
> +time_init()
> +{
> +	test "$SECONDS" -le 1 || err 'test/common must be included first!'
> +	# ... otherwise NDTEST_START is inaccurate

What is this protecting against... that the test makes sure that
NDTEST_START happens before any error might have been produced?

I think the proposed anchors make this easily debuggable, there are no
tests that include test/common late, as evidenced by no fixups for this
in this patch.

> +	NDTEST_START=$(LC_TIME=C date '+%F %T.%3N')
> +
> +	# Log anchor, especially useful when running tests back to back
> +	printf "<5>%s@%ds: sourcing test/common: NDTEST_START=%s\n" \
> +		"$test_basename" "$SECONDS" "$NDTEST_START" > /dev/kmsg

Why is SECONDS here?

Note that there are some non-shell tests in the suite, not for CXL, but
might want to make this consistent by following on with wrapping those
tests in a script.

> +
> +	# Default value, can be overriden by the environment
> +	: "${NDTEST_LOG_DBG:=false}"
> +
> +	if "$NDTEST_LOG_DBG"; then
> +		local _cd_dbg_msg="NDTEST_LOG_DBG: $test_basename early msg must be found by check_dmesg()"
> +		printf '<3>%s: %s\n' "$test_basename" "$_cd_dbg_msg" > /dev/kmsg
> +		kmsg_fail_if_missing+=("$_cd_dbg_msg")
> +	fi
> +}
> +time_init
> +
>  # Global variables
>  
>  # NDCTL
> @@ -143,15 +165,120 @@ json2var()
>  	sed -e "s/[{}\",]//g; s/\[//g; s/\]//g; s/:/=/g"
>  }
>  
> -# check_dmesg
> +# check_dmesg() performs two actions controlled by two bash arrays:
> +# "kmsg_no_fail_on" and "kmsg_fail_if_missing". These list of extended
> +# regular expressions (grep '-E') have default values here that can
> +# be customized by each test.
> +#
> +# 1. check_dmesg() first checks the output of `journalctl -k -p warning`
> +# and makes the invoking test FAIL if any unexpected kernel error or
> +# warning occurred during the test. Messages in either the
> +# "kmsg_no_fail_on" or the "kmsg_fail_if_missing" arrays are expected
> +# and do NOT cause a test failure. All other errors and warnings cause a
> +# test failure.

Is there a mechanism to opt-out of errors and warnings. Sometimes
upstream gets overzealous with chatty dmesg and it would be nice to
quickly check if tests otherwise pass with ignoring messages. Then go
spend the time to track down new messages tripping up the test.

Otherwise I do like that this causes people to be careful in the
messages they add as new warnings and errors are things that distros
debug for customers on a consistent basis.

> +#
> +# 2.1 Then, check_dmesg() makes sure at least one line in the logs
> +# matches each regular expression in the "kmsg_fail_if_missing" array. If
> +# any of these expected messages was never issued during the test, then
> +# the test fails. This is especially useful for "negative" tests
> +# triggering expected errors; but not just. Unlike 1., all log levels
> +# are searched. Avoid relying on "optional" messages (e.g.: dyndbg) in
> +# "kmsg_fail_if_missing".

This feature feels pedantic, not grokking the immediate value. Perhaps
stage this at the end of the series.

> +#
> +# 2.2 to make sure "something" happened during the test, check_dmesg()
> +# provides a default, non-empty kmsg_fail_if_missing value that searches
> +# for either "nfit_test" or pmem" or "cxl_". These are not searched if
> +# the test already provides some value(s) in "kmsg_fail_if_missing".
> +# While not recommended, a test could use check_dmesg() and opt out of
> +# "kmsg_fail_if_missing" with a pointless regular expression like '.'

Is this not captured by modprobe failures?

> +
> +# Always append with '+=' to give any test the freedom to source this
> +# file before or after adding exclusions.
> +# kmsg_no_fail_on+=('this array cannot be empty otherwise grep -v fails')
> +
> +kmsg_no_fail_on+=('cxl_core: loading out-of-tree module taints kernel')
> +kmsg_no_fail_on+=('cxl_mock_mem.*: CXL MCE unsupported')
> +kmsg_no_fail_on+=('cxl_mock_mem cxl_mem.*: Extended linear cache calculation failed')
> +
> +# 'modprobe nfit_test' prints these every time it's not already loaded
> +kmsg_no_fail_on+=(
> +    'nd_pmem namespace.*: unable to guarantee persistence of writes'
> +    'nfit_test nfit_test.*: failed to evaluate _FIT'
> +    'nfit_test nfit_test.*: Error found in NVDIMM nmem.* flags: save_fail restore_fail flush_fail not_armed'
> +    'nfit_test nfit_test.1: Error found in NVDIMM nmem.* flags: map_fail'
> +)
> +
> +# notice level to give some information without flooding the (single!)
> +# testlog.txt file
> +journalctl_notice()
> +{
> +	( set +x;
> +	  printf ' ------------ More verbose logs at t=%ds ----------\n' "$SECONDS" )
> +	journalctl -b --no-pager -o short-precise -p notice --since "-$((SECONDS*1000 + 1000)) ms"

Why is SECONDS here and not NDTEST_START?

> +}
> +
>  # $1: line number where this is called
>  check_dmesg()
>  {
> -	# validate no WARN or lockdep report during the run
> +	local _e_kmsg_no_fail_on=()
> +	for re in "${kmsg_no_fail_on[@]}" "${kmsg_fail_if_missing[@]}"; do
> +		_e_kmsg_no_fail_on+=('-e' "$re")
> +	done
> +
> +	# Give some time for a complete kmsg->journalctl flush + any delayed test effect.
>  	sleep 1

Feels magical. The sleep was only there to make sure that SECONDS rolls
over. If a test has delayed effects there is no hard guarantee that 1
second is sufficient.

> -	log=$(journalctl -r -k --since "-$((SECONDS+1))s")
> -	grep -q "Call Trace" <<< $log && err $1
> -	true
> +
> +	if "$NDTEST_LOG_DBG"; then
> +		journalctl -q -b --since "$NDTEST_START" \
> +			-o short-precise > journal-"$(basename "$0")".log
> +	fi
> +	# After enabling, check the timings in:
> +	#    head -n 7 $(ls -1t build/journal-*.log | tac)
> +	#    journalctl --since='- 5 min' -o short-precise -g 'test/common'
> +
> +	{ # Redirect to stderr so this is all at the _bottom_ in the log file
> +
> +	# Fail on kernel WARNING or ERROR.
> +	if journalctl -q -o short-precise -p warning -k --since "$NDTEST_START" |
> +		grep -E -v "${_e_kmsg_no_fail_on[@]}"; then
> +			journalctl_notice
> +			err "$1"
> +	fi

Per above, add an opt-out for this.

> +
> +	# Sanity check: make sure "something" has run
> +	if [ "${#kmsg_fail_if_missing[@]}" = 0 ]; then
> +		kmsg_fail_if_missing+=( '(nfit_test)|(pmem)|(cxl_)' )
> +	fi
> +
> +	local expected_re
> +	for expected_re in "${kmsg_fail_if_missing[@]}"; do
> +		journalctl -q -k -p 7 --since "$NDTEST_START" |
> +			grep -q -E -e "${expected_re}" || {
> +				printf 'FAIL: expected error not found: %s\n' "$expected_re"
> +				journalctl_notice
> +				err "$1"
> +		}
> +	done
> +	} >&2
> +
> +	# Log anchor, especially useful when running tests back to back

This comment is going to go stale, I do not think it helps with
understanding the implementation.

> +	printf "<5>%s@%ds: test/common: check_dmesg() OK\n" "$test_basename" "$SECONDS" > /dev/kmsg

I like this, I have long wanted anchors in the log for coordinating
messages.

> +
> +	if "$NDTEST_LOG_DBG"; then
> +	    log_stress from_check_dmesg
> +	fi
> +}
> +# Many tests don't use check_dmesg() (yet?) so double down here. Also, this

A comment like this belongs in the changelog, not the code. I don't want
review later patches fixing up the "(yet?)", a comment should help
understand the present, not comment on some future state.

> +# runs later which is better. But before using this make sure there is
> +# still no test defining its own EXIT trap.

> +if "$NDTEST_LOG_DBG"; then
> +    trap 'log_stress from_trap' EXIT
> +fi
> +
> +log_stress()
> +{
> +	printf '<3>%s@%ds: NDTEST_LOG_DBG; trying to break the next check_dmesg() %s\n' \
> +		"$test_basename" "$SECONDS" "$1" > /dev/kmsg
>  }
>  
>  # CXL COMMON
> diff --git a/test/cxl-events.sh b/test/cxl-events.sh
> index 7326eb7447ee..29afd86b8bf8 100644
> --- a/test/cxl-events.sh
> +++ b/test/cxl-events.sh
> @@ -26,6 +26,8 @@ rc=1
>  dev_path="/sys/bus/platform/devices"
>  trace_path="/sys/kernel/tracing"
>  
> +kmsg_no_fail_on+=('cxl_mock_mem cxl_mem.* no CXL window for range')

I want to see the other benefits from this patch before debugging
whether this exclusion needs a fixup somewhere else. So lets break out
the kmsg_no_fail_on mechanism separately from the base dmesg time,
anchors, and err / warn infrastructure.

> +
>  test_region_info()
>  {
>  	# Trigger a memdev in the cxl_test autodiscovered region
> diff --git a/test/cxl-poison.sh b/test/cxl-poison.sh
> index 6ed890bc666c..8b38cb7960b0 100644
> --- a/test/cxl-poison.sh
> +++ b/test/cxl-poison.sh
> @@ -9,6 +9,13 @@ rc=77
>  set -ex
>  [ -d "/sys/kernel/tracing" ] || do_skip "test requires CONFIG_TRACING"
>  
> +# FIXME: this should be in "kmsg_fail_if_missing" but this test seems to
> +# work only once. Cleanup/reset issue?
> +kmsg_no_fail_on+=(
> +    'cxl_mock_mem cxl_mem.*: poison inject dpa:0x'
> +    'cxl_mock_mem cxl_mem.*: poison clear dpa:0x'
> +)

Yeah another fixup that should be done independently from the base test
infra.

> +
>  trap 'err $LINENO' ERR
>  
>  check_prereq "jq"
> diff --git a/test/cxl-xor-region.sh b/test/cxl-xor-region.sh
> index fb4f9a0a1515..d1a58023a61e 100644
> --- a/test/cxl-xor-region.sh
> +++ b/test/cxl-xor-region.sh
> @@ -16,6 +16,8 @@ modprobe -r cxl_test
>  modprobe cxl_test interleave_arithmetic=1
>  rc=1
>  
> +kmsg_fail_if_missing+=('cxl_mock_mem cxl_mem.* no CXL window for range')

Not sure I want to fail this test if we fixup this message.

> +
>  # THEORY OF OPERATION: Create x1,2,3,4 regions to exercise the XOR math
>  # option of the CXL driver. As with other cxl_test tests, changes to the
>  # CXL topology in tools/testing/cxl/test/cxl.c may require an update here.
> diff --git a/test/dax.sh b/test/dax.sh
> index 3ffbc8079eba..0589f0d053ec 100755
> --- a/test/dax.sh
> +++ b/test/dax.sh
> @@ -118,6 +118,12 @@ else
>  	run_xfs
>  fi
>  
> +kmsg_fail_if_missing+=(
> +    'nd_pmem pfn.*: unable to guarantee persistence of writes'

This one is not valuable for this test.

> +    'Memory failure: .*: Sending SIGBUS to dax-pmd:.* due to hardware memory corruption'
> +    'Memory failure: .*: recovery action for dax page: Recovered'

These are useful.

> +)
> +
>  check_dmesg "$LINENO"
>  
>  exit 0
> -- 
> 2.49.0
> 



^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [ndctl PATCH v3 2/2] test: fail on unexpected kernel error & warning, not just "Call Trace"
  2025-07-12  0:32   ` dan.j.williams
@ 2025-07-19  0:22     ` Marc Herbert
  0 siblings, 0 replies; 7+ messages in thread
From: Marc Herbert @ 2025-07-19  0:22 UTC (permalink / raw)
  To: dan.j.williams, linux-cxl, nvdimm, alison.schofield

Thanks Dan for the thorough review, appreciated and...  unfortunately
rare, especially for shell scripts!

On 2025-07-11 17:32, dan.j.williams@intel.com wrote:
> marc.herbert@ wrote:
>> So, leverage log levels for the PASS/FAIL decision.  This catches all
>> issues and not just the ones printing Call Traces.
>>
>> Add a simple way to exclude expected warnings and errors, either on a
>> per-test basis or globally.
>>
>> Add a way for negative tests to fail when if some expected errors are
>> missing.
>
> Each subsequent "Add" in a changelog is usually an opportunity to split
> the patch into smaller digestable pieces.

As I explained in the cover letter, the problem is: testing. While
logically independent, it's in practice impossible to thoroughly test
these changes independently of each other. That's OK; I'll keep
performing thorough testing in my workspace and I will also make sure no
single, smaller commit introduces any regression in this particular
check_dmesg() area. The current bar is pretty low in this specific
check_dmsg() area so this shouldn't be a problem.

So, in v4 I will submit only the switch from SECONDS to NDTEST_START.
It's 1. logically coming first 2. low risk, low stakes and low
contention. 3. almost half the code: it will be nice to get all that out
of the way.

For that reason, I will for now answer only your comments below that are
related to this first NDTEST_START commit and (preciously) keep your
other comments for a later reply.

>> Stop relying on the magic and convenient but inaccurate $SECONDS bash
>> variable because its precision is (surprise!) about 1 second. In the
>> first version of this patch, $SECONDS was kept and working but it
>> required a 1++ second long "cooldown" between tests to isolate their
>> logs from each other. After dropping $SECONDS from journalctl, no
>> cooldown delay is required.

>>  
>> +time_init()
>> +{
>> +	test "$SECONDS" -le 1 || err 'test/common must be included first!'
>> +	# ... otherwise NDTEST_START is inaccurate
> 
> What is this protecting against... that the test makes sure that
> NDTEST_START happens before any error might have been produced?

Yes - and more generally speaking that no test log will be missing, and
also making sure the test anchors are accurate.

> I think the proposed anchors make this easily debuggable, there are no
> tests that include test/common late, as evidenced by no fixups for this
> in this patch.

There is none now but it's really not that hard to imagine someone
wanting to do random stuff before sourcing test/common. This was not an
issue with $SECONDS and without log anchors; it is one now.

Also, this is just one very simple line; costs practically nothing.

>> +	NDTEST_START=$(LC_TIME=C date '+%F %T.%3N')
>> +
>> +	# Log anchor, especially useful when running tests back to back
>> +	printf "<5>%s@%ds: sourcing test/common: NDTEST_START=%s\n" \
>> +		"$test_basename" "$SECONDS" "$NDTEST_START" > /dev/kmsg
> 
> Why is SECONDS here?

Because test duration is more user-friendly that manually substracting
timestamps; all sorts of people read logs when tests fail and they may
not automatically know that "sourcing common" implies SECONDS is 0 or 1;
it is consistent with other log statements using SECONDS too; it takes
almost no space in the logs; in case the check above gets dropped, this
can become even more useful.

Generally speaking, there is rarely ever "too much" information in
failure logs, only too much "volume" - and SECONDS takes practically
zero space.

> Note that there are some non-shell tests in the suite, not for CXL, but
> might want to make this consistent by following on with wrapping those
> tests in a script.

Good idea!

> Is there a mechanism to opt-out of errors and warnings. Sometimes
> upstream gets overzealous with chatty dmesg and it would be nice to
> quickly check if tests otherwise pass with ignoring messages. Then go
> spend the time to track down new messages tripping up the test.

Good idea (after NDTEST_START).

>> +# notice level to give some information without flooding the (single!)
>> +# testlog.txt file
>> +journalctl_notice()
>> +{
>> +	( set +x;
>> +	  printf ' ------------ More verbose logs at t=%ds ----------\n' "$SECONDS" )
>> +	journalctl -b --no-pager -o short-precise -p notice --since "-$((SECONDS*1000 + 1000)) ms"
> 
> Why is SECONDS here and not NDTEST_START?

Because this is approximative (+1000) to give a bit more background and
it's easier to do math on SECONDS in shell scripts than on timestamps.

>>  # $1: line number where this is called
>>  check_dmesg()
>>  {
>> -	# validate no WARN or lockdep report during the run
>> +	local _e_kmsg_no_fail_on=()
>> +	for re in "${kmsg_no_fail_on[@]}" "${kmsg_fail_if_missing[@]}"; do
>> +		_e_kmsg_no_fail_on+=('-e' "$re")
>> +	done
>> +
>> +	# Give some time for a complete kmsg->journalctl flush + any delayed test effect.
>>  	sleep 1
> 
> Feels magical.

It's a "sleep" so it is magical. That does not automatically make it
bad.

> The sleep was only there to make sure that SECONDS rolls over.

The purpose is a bit different now, funny that git diff matched it.

> If a test has delayed effects there is no hard guarantee that 1
> second is sufficient.

There is no such hard guarantee but testing can only prove the existence
of bugs, not their absence. It's not an exact science. This sleep 1 does
not help with effects delayed more than 1s, but it does help with all
delayed effects shorter 1s and that is useful. So, why 1s specifically?
Because 1s is a short time for humans and for most tests in this suite
while being a very long time for computers - plenty enough time for many
timers to expire and for journalctl to get all kernel messages. I can
make it 0.5s, that should work too and save time when running the whole
suite.

>> +	# Log anchor, especially useful when running tests back to back
> 
> This comment is going to go stale, I do not think it helps with
> understanding the implementation.

Sorry, I don't see what specific part could go stale in this
one... Could you elaborate?

>> +	printf "<5>%s@%ds: test/common: check_dmesg() OK\n" "$test_basename" "$SECONDS" > /dev/kmsg
> 
> I like this, I have long wanted anchors in the log for coordinating
> messages.
> 
>> +
>> +	if "$NDTEST_LOG_DBG"; then
>> +	    log_stress from_check_dmesg
>> +	fi
>> +}
>> +# Many tests don't use check_dmesg() (yet?) so double down here. Also, this
> 
> A comment like this belongs in the changelog, not the code. I don't want
> review later patches fixing up the "(yet?)", a comment should help
> understand the present, not comment on some future state.

I will rephrase.

>> +# runs later which is better. But before using this make sure there is
>> +# still no test defining its own EXIT trap.
> 
>> +if "$NDTEST_LOG_DBG"; then
>> +    trap 'log_stress from_trap' EXIT
>> +fi
>> +
>> +log_stress()
>> +{
>> +	printf '<3>%s@%ds: NDTEST_LOG_DBG; trying to break the next check_dmesg() %s\n' \
>> +		"$test_basename" "$SECONDS" "$1" > /dev/kmsg
>>  }

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2025-07-19  0:22 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-06-11 23:44 [ndctl PATCH v3] test: fail on unexpected kernel error & warning, not just "Call Trace" marc.herbert
2025-06-11 23:44 ` [ndctl PATCH v3 1/2] test: move err() function at the top marc.herbert
2025-07-11 23:56   ` dan.j.williams
2025-06-11 23:44 ` [ndctl PATCH v3 2/2] test: fail on unexpected kernel error & warning, not just "Call Trace" marc.herbert
2025-07-12  0:32   ` dan.j.williams
2025-07-19  0:22     ` Marc Herbert
2025-06-12  0:02 ` [ndctl PATCH v3] " Dave Jiang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).