All of lore.kernel.org
 help / color / mirror / Atom feed
From: Nicolin Chen <nicolinc@nvidia.com>
To: "Thomas Weißschuh" <thomas.weissschuh@linutronix.de>
Cc: "Jason Gunthorpe" <jgg@nvidia.com>,
	"Shuah Khan" <shuah@kernel.org>,
	"Shuah Khan" <skhan@linuxfoundation.org>,
	"Willy Tarreau" <w@1wt.eu>,
	"Thomas Weißschuh" <linux@weissschuh.net>,
	"Kees Cook" <kees@kernel.org>,
	"Andy Lutomirski" <luto@amacapital.net>,
	"Will Drewry" <wad@chromium.org>,
	"Mark Brown" <broonie@kernel.org>,
	"Muhammad Usama Anjum" <usama.anjum@collabora.com>,
	linux-kernel@vger.kernel.org, linux-kselftest@vger.kernel.org
Subject: Re: [PATCH v4 09/14] selftests: harness: Move teardown conditional into test metadata
Date: Wed, 11 Jun 2025 10:19:50 -0700	[thread overview]
Message-ID: <aEm6tuzy7WK12sMh@nvidia.com> (raw)
In-Reply-To: <20250611093942-f6c65a06-c72a-4451-aa1e-8cb8de0d69cb@linutronix.de>

On Wed, Jun 11, 2025 at 10:04:35AM +0200, Thomas Weißschuh wrote:
> On Wed, Jun 11, 2025 at 12:05:25AM -0700, Nicolin Chen wrote:
> > On Tue, Jun 10, 2025 at 08:46:57PM -0300, Jason Gunthorpe wrote:
> > > On Tue, Jun 10, 2025 at 11:48:44AM -0700, Nicolin Chen wrote:
> > > > On Tue, Jun 10, 2025 at 09:09:02AM -0300, Jason Gunthorpe wrote:
> > > > > On Tue, Jun 10, 2025 at 01:38:22PM +0200, Thomas Weißschuh wrote:
> > > > > > > ------------------------------------------------------------------
> > > > > > > #  RUN           iommufd_dirty_tracking.domain_dirty128M_huge.enforce_dirty ...
> > > > > > > # enforce_dirty: Test terminated unexpectedly by signal 11
> > > > > 
> > > > > Sig 11 is weird..
> > > > 
> > > > > > On another note, the selftest should use the kselftest_harness' ASSERT_*()
> > > > > > macros instead of plain assert().
> > > > > 
> > > > > IIRC the kselftest stuff explodes if you try to use it's assert
> > > > > functions within a fixture setup/teardown context.
> > > > > 
> > > > > I also wasn't able to reproduce this (x86 ubuntu 24 LTS OS) Maybe
> > > > > it is ARM specific, I think Nicolin is running on ARM..
> > > > 
> > > > Yes. And I was running with 64KB page size. I just quickly retried
> > > > with 4KB page size (matching x86), and all failed tests pass now.
> > > 
> > > That's a weird thing to be sensitive too. Can you get a backtrace from
> > > the crash, what function/line is crashing?
> > 
> > I think I am getting what's going on. Here the harness code has a
> > parent process and a child process:
> > 
> > --------------------------------------------------------------
> > 428-            /* _metadata and potentially self are shared with all forks. */ \
> > 429:            child = fork(); \
> > 430:            if (child == 0) { \
> > 431-                    fixture_name##_setup(_metadata, self, variant->data); \
> > 432-                    /* Let setup failure terminate early. */ \
> > 433-                    if (_metadata->exit_code) \
> > 434-                            _exit(0); \
> > 435-                    *_metadata->no_teardown = false; \
> > 436-                    fixture_name##_##test_name(_metadata, self, variant->data); \
> > 437-                    _metadata->teardown_fn(false, _metadata, self, variant->data); \
> > 438-                    _exit(0); \
> > 439:            } else if (child < 0 || child != waitpid(child, &status, 0)) { \
> > 440-                    ksft_print_msg("ERROR SPAWNING TEST GRANDCHILD\n"); \
> > 441-                    _metadata->exit_code = KSFT_FAIL; \
> > 442-            } \
> > 443-            _metadata->teardown_fn(true, _metadata, self, variant->data); \
> > 444-            munmap(_metadata->no_teardown, sizeof(*_metadata->no_teardown)); \
> > 445-            _metadata->no_teardown = NULL; \
> > 446-            if (self && fixture_name##_teardown_parent) \
> > 447-                    munmap(self, sizeof(*self)); \
> > 448-            if (WIFEXITED(status)) { \
> > 449-                    if (WEXITSTATUS(status)) \
> > 450-                            _metadata->exit_code = WEXITSTATUS(status); \
> > 451-            } else if (WIFSIGNALED(status)) { \
> > 452-                    /* Forward signal to __wait_for_test(). */ \
> > 453-                    kill(getpid(), WTERMSIG(status)); \
> > 454-            } \
> > ....
> > 456-    static void wrapper_##fixture_name##_##test_name##_teardown( \
> > 457-            bool in_parent, struct __test_metadata *_metadata, \
> > 458-            void *self, const void *variant) \
> > 459-    { \
> > 460-            if (fixture_name##_teardown_parent == in_parent && \
> > 461-                            !__atomic_test_and_set(_metadata->no_teardown, __ATOMIC_RELAXED)) \
> > 462-                    fixture_name##_teardown(_metadata, self, variant); \
> > 463-    } \
> > --------------------------------------------------------------
> > 
> > I found there is a race between those two processes, resulting in
> > the teardown() not getting invoked: I added some ksft_print_msg()
> > calls in-between the lines to debug, those tests can pass mostly,
> > as teardown() got invoked.
> > 
> > I think the reason why those huge page cases fail is just because 
> > the huge version of setup() takes longer time..
> 
> Can you try to recreate this issue with changes to
> tools/testing/selftests/kselftest_harness/harness-selftest.c ?

Hmm, I assume all 9 cases should pass? Mine only passes 4 on rc1,
without any change (perhaps we should start from here?):

TAP version 13
1..9
# Starting 9 tests from 4 test cases.
#  RUN           global.standalone_pass ...
# harness-selftest.c:19:standalone_pass:before
# harness-selftest.c:23:standalone_pass:after
#            OK  global.standalone_pass
ok 1 global.standalone_pass
#  RUN           global.standalone_fail ...
# harness-selftest.c:27:standalone_fail:before
# harness-selftest.c:29:standalone_fail:Expected 0 (0) == 1 (1)
# harness-selftest.c:30:standalone_fail:Expected 0 (0) == 1 (1)
# standalone_fail: Test terminated by assertion
#          FAIL  global.standalone_fail
not ok 2 global.standalone_fail
#  RUN           global.signal_pass ...
# harness-selftest.c:35:signal_pass:before
# harness-selftest.c:37:signal_pass:after
#            OK  global.signal_pass
ok 3 global.signal_pass
#  RUN           global.signal_fail ...
# harness-selftest.c:42:signal_fail:before
# harness-selftest.c:43:signal_fail:Expected 0 (0) == 1 (1)
# signal_fail: Test terminated by assertion
#          FAIL  global.signal_fail
not ok 4 global.signal_fail
#  RUN           fixture.pass ...
# harness-selftest.c:53:pass:setup
# harness-selftest.c:62:pass:before
# harness-selftest.c:19:pass:before
# harness-selftest.c:23:pass:after
# harness-selftest.c:66:pass:after
# harness-selftest.c:58:pass:teardown same-process=1
#            OK  fixture.pass
ok 5 fixture.pass
#  RUN           fixture.fail ...
# harness-selftest.c:53:fail:setup
# harness-selftest.c:70:fail:before
# harness-selftest.c:71:fail:Expected 0 (0) == 1 (1)
# harness-selftest.c:58:fail:teardown same-process=1
# fail: Test terminated by assertion
#          FAIL  fixture.fail
not ok 6 fixture.fail
#  RUN           fixture.timeout ...
# harness-selftest.c:53:timeout:setup
# harness-selftest.c:77:timeout:before
# timeout: Test terminated by timeout
#          FAIL  fixture.timeout
not ok 7 fixture.timeout
#  RUN           fixture_parent.pass ...
# harness-selftest.c:87:pass:setup
# harness-selftest.c:96:pass:before
# harness-selftest.c:98:pass:after
# harness-selftest.c:92:pass:teardown same-process=0
#            OK  fixture_parent.pass
ok 8 fixture_parent.pass
#  RUN           fixture_setup_failure.pass ...
# harness-selftest.c:106:pass:setup
# harness-selftest.c:108:pass:Expected 0 (0) == 1 (1)
# pass: Test terminated by assertion
#          FAIL  fixture_setup_failure.pass
not ok 9 fixture_setup_failure.pass
# FAILED: 4 / 9 tests passed.
# Totals: pass:4 fail:5 xfail:0 xpass:0 skip:0 error:0

> > I haven't figured out a proper fix yet, but something smells bad:
> > 1) *no_teardown is set non-atomically, while both processes calls
> >    __atomic_test_and_set()
> 
> Does this make a difference?
> 
> diff --git a/tools/testing/selftests/kselftest_harness.h b/tools/testing/selftests/kselftest_harness.h
> index 2925e47db995..89fb37a21d9d 100644
> --- a/tools/testing/selftests/kselftest_harness.h
> +++ b/tools/testing/selftests/kselftest_harness.h
> @@ -410,7 +410,7 @@
>                 /* Makes sure there is only one teardown, even when child forks again. */ \
>                 _metadata->no_teardown = mmap(NULL, sizeof(*_metadata->no_teardown), \
>                         PROT_READ | PROT_WRITE, MAP_SHARED | MAP_ANONYMOUS, -1, 0); \
> -               *_metadata->no_teardown = true; \
> +               __atomic_store_n(_metadata->no_teardown, true, __ATOMIC_SEQ_CST); \
>                 if (sizeof(*self) > 0) { \
>                         if (fixture_name##_teardown_parent) { \
>                                 self = mmap(NULL, sizeof(*self), PROT_READ | PROT_WRITE, \
> @@ -429,7 +429,7 @@
>                         /* Let setup failure terminate early. */ \
>                         if (_metadata->exit_code) \
>                                 _exit(0); \
> -                       *_metadata->no_teardown = false; \
> +                       __atomic_store_n(_metadata->no_teardown, false, __ATOMIC_SEQ_CST); \
>                         fixture_name##_##test_name(_metadata, self, variant->data); \
>                         _metadata->teardown_fn(false, _metadata, self, variant->data); \
>                         _exit(0); \
> @@ -455,7 +455,7 @@
>                 void *self, const void *variant) \
>         { \
>                 if (fixture_name##_teardown_parent == in_parent && \
> -                               !__atomic_test_and_set(_metadata->no_teardown, __ATOMIC_RELAXED)) \
> +                               !__atomic_test_and_set(_metadata->no_teardown, __ATOMIC_SEQ_CST)) \
>                         fixture_name##_teardown(_metadata, self, variant); \
>         } \
>         static struct __test_metadata *_##fixture_name##_##test_name##_object; \

Unfortunately, no :(

> 
> > 2) parent doesn't seem to wait for the setup() to complete..
> 
> setup() is called in the child (L431) right before the testcase itself is
> called (L436). The parent waits for the child to exit (L439) before unmapping.
> 
> > 3) when parent runs faster than the child that is still running
> >    setup(), the parent unmaps the no_teardown and set it to NULL,
> >    then UAF in the child, i.e. signal 11?
> 
> That should never happen as the waitpid() will block until the child running
> setup() and the testcase itself have exited.

Ah, maybe I was wrong about these narratives. But the results show
that iommufd_dirty_tracking_teardown() was not called in the failed
cases:

// I added a huge.c file to run only 4 cases on one variant
// And I added two fprintf to its FIXTURE_SETUP/TEARDOWN().

TAP version 13
1..4
# Starting 4 tests from 1 test cases.
#  RUN           iommufd_dirty_tracking.domain_dirty64M_huge.set_dirty_tracking ...
---------iommufd_dirty_tracking_setup
---------iommufd_dirty_tracking_teardown
#            OK  iommufd_dirty_tracking.domain_dirty64M_huge.set_dirty_tracking
ok 1 iommufd_dirty_tracking.domain_dirty64M_huge.set_dirty_tracking
#  RUN           iommufd_dirty_tracking.domain_dirty64M_huge.device_dirty_capability ...
---------iommufd_dirty_tracking_setup
# device_dirty_capability: Test terminated unexpectedly by signal 11
#          FAIL  iommufd_dirty_tracking.domain_dirty64M_huge.device_dirty_capability
not ok 2 iommufd_dirty_tracking.domain_dirty64M_huge.device_dirty_capability
#  RUN           iommufd_dirty_tracking.domain_dirty64M_huge.get_dirty_bitmap ...
---------iommufd_dirty_tracking_setup
# get_dirty_bitmap: Test terminated unexpectedly by signal 11
#          FAIL  iommufd_dirty_tracking.domain_dirty64M_huge.get_dirty_bitmap
not ok 3 iommufd_dirty_tracking.domain_dirty64M_huge.get_dirty_bitmap
#  RUN           iommufd_dirty_tracking.domain_dirty64M_huge.get_dirty_bitmap_no_clear ...
---------iommufd_dirty_tracking_setup
# get_dirty_bitmap_no_clear: Test terminated unexpectedly by signal 11
#          FAIL  iommufd_dirty_tracking.domain_dirty64M_huge.get_dirty_bitmap_no_clear
not ok 4 iommufd_dirty_tracking.domain_dirty64M_huge.get_dirty_bitmap_no_clear
# FAILED: 1 / 4 tests passed.

> 
> Does the issue also happen when you only execute a single testcase?
> $ ./iommufd -r iommufd_dirty_tracking.domain_dirty128M_huge.enforce_dirty

$ sudo tools/testing/selftests/iommu/iommufd -r iommufd_dirty_tracking.domain_dirty128M.enforce_dirty
TAP version 13
1..1
# Starting 1 tests from 1 test cases.
#  RUN           iommufd_dirty_tracking.domain_dirty128M.enforce_dirty ...
#            OK  iommufd_dirty_tracking.domain_dirty128M.enforce_dirty
ok 1 iommufd_dirty_tracking.domain_dirty128M.enforce_dirty
# PASSED: 1 / 1 tests passed.
# Totals: pass:1 fail:0 xfail:0 xpass:0 skip:0 error:0

This one passes. Looks like the first hugepage case would pass but
the following ones would fail if running them sequentially..

Thanks
Nicolin

  reply	other threads:[~2025-06-11 17:20 UTC|newest]

Thread overview: 46+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-05-05 15:15 [PATCH v4 00/14] kselftest harness and nolibc compatibility Thomas Weißschuh
2025-05-05 15:15 ` [PATCH v4 01/14] selftests: harness: Add kselftest harness selftest Thomas Weißschuh
2025-05-05 15:15 ` [PATCH v4 02/14] selftests: harness: Use C89 comment style Thomas Weißschuh
2025-05-05 15:15 ` [PATCH v4 03/14] selftests: harness: Ignore unused variant argument warning Thomas Weißschuh
2025-05-05 15:15 ` [PATCH v4 04/14] selftests: harness: Mark functions without prototypes static Thomas Weißschuh
2025-05-05 15:15 ` [PATCH v4 05/14] selftests: harness: Remove inline qualifier for wrappers Thomas Weißschuh
2025-05-05 15:15 ` [PATCH v4 06/14] selftests: harness: Remove dependency on libatomic Thomas Weißschuh
2025-05-05 15:15 ` [PATCH v4 07/14] selftests: harness: Implement test timeouts through pidfd Thomas Weißschuh
2025-05-05 15:15 ` [PATCH v4 08/14] selftests: harness: Don't set setup_completed for fixtureless tests Thomas Weißschuh
2025-05-05 15:15 ` [PATCH v4 09/14] selftests: harness: Move teardown conditional into test metadata Thomas Weißschuh
2025-06-10  6:49   ` Nicolin Chen
2025-06-10 11:38     ` Thomas Weißschuh
2025-06-10 12:09       ` Jason Gunthorpe
2025-06-10 18:48         ` Nicolin Chen
2025-06-10 23:46           ` Jason Gunthorpe
2025-06-11  7:05             ` Nicolin Chen
2025-06-11  8:04               ` Thomas Weißschuh
2025-06-11 17:19                 ` Nicolin Chen [this message]
2025-06-11 21:47                   ` Nicolin Chen
2025-06-11 23:43                     ` Nicolin Chen
2025-06-11 23:51                       ` Jason Gunthorpe
2025-06-12  6:59                         ` Nicolin Chen
2025-06-12 13:58                           ` Jason Gunthorpe
2025-06-12 14:27                             ` Thomas Weißschuh
2025-06-12 14:58                               ` Jason Gunthorpe
2025-06-12 15:23                                 ` Thomas Weißschuh
2025-06-12 15:42                                   ` Jason Gunthorpe
2025-06-12 17:53                                     ` Nicolin Chen
2025-06-12 18:53                                       ` Nicolin Chen
2025-06-12 18:56                                         ` Jason Gunthorpe
2025-06-12 19:03                                           ` Nicolin Chen
2025-06-12 23:31                                             ` Jason Gunthorpe
2025-06-13  5:55                                               ` Nicolin Chen
2025-06-11  6:53           ` Thomas Weißschuh
2025-05-05 15:15 ` [PATCH v4 10/14] selftests: harness: Add teardown callback to " Thomas Weißschuh
2025-05-05 15:15 ` [PATCH v4 11/14] selftests: harness: Add "variant" and "self" " Thomas Weißschuh
2025-05-05 15:15 ` [PATCH v4 12/14] selftests: harness: Stop using setjmp()/longjmp() Thomas Weißschuh
2025-06-10  6:40   ` Nicolin Chen
2025-06-10 12:21     ` Thomas Weißschuh
2025-06-10 23:13       ` Nicolin Chen
2025-05-05 15:15 ` [PATCH v4 13/14] selftests: harness: Guard includes on nolibc Thomas Weißschuh
2025-05-05 15:15 ` [PATCH v4 14/14] HACK: selftests/nolibc: demonstrate usage of the kselftest harness Thomas Weißschuh
2025-05-06  5:59   ` Mark Brown
2025-05-10  6:54 ` [PATCH v4 00/14] kselftest harness and nolibc compatibility Thomas Weißschuh
2025-05-12 21:32   ` Shuah Khan
2025-05-13  5:33     ` Thomas Weißschuh

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=aEm6tuzy7WK12sMh@nvidia.com \
    --to=nicolinc@nvidia.com \
    --cc=broonie@kernel.org \
    --cc=jgg@nvidia.com \
    --cc=kees@kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-kselftest@vger.kernel.org \
    --cc=linux@weissschuh.net \
    --cc=luto@amacapital.net \
    --cc=shuah@kernel.org \
    --cc=skhan@linuxfoundation.org \
    --cc=thomas.weissschuh@linutronix.de \
    --cc=usama.anjum@collabora.com \
    --cc=w@1wt.eu \
    --cc=wad@chromium.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.