Linux-mm Archive on lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2] selftests/damon: add regression test for damos_walk() vs kdamond exit race
@ 2026-05-24 10:02 Sailesh Nandanavanam
  2026-05-24 18:37 ` SeongJae Park
  0 siblings, 1 reply; 5+ messages in thread
From: Sailesh Nandanavanam @ 2026-05-24 10:02 UTC (permalink / raw)
  To: sj
  Cc: shuah, damon, linux-mm, linux-kselftest, linux-kernel,
	Sailesh Nandanavanam

Add a regression test that verifies damos_walk() does not deadlock
when racing with kdamond_fn() exit.

When kdamond_fn() finishes its main loop, it cancels remaining
damos_walk() requests and unsets damon_ctx->kdamond. Without the fix
in commit 33c3f6c2b48c, damos_walk() could be called right after
cancellation but before kdamond pointer unset, causing it to wait
forever for handling that never comes.

The test starts kdamond monitoring a short-lived process, waits for
the process to exit naturally triggering kdamond termination, then
rapidly calls update_schemes_tried_regions in a separate thread to
hit the race window. Using a thread with join timeout ensures the
test can detect kernel-level deadlocks where the system call blocks
in uninterruptible state.

The sysfs state path is resolved dynamically via the kdamonds object
instead of being hardcoded, and exceptions are handled specifically
as OSError rather than using a bare except block.

Fixes: 33c3f6c2b48c ("mm/damon/core: fix damos_walk() vs kdamond_fn() exit race")
Signed-off-by: Sailesh Nandanavanam <saileshnandanavanam@gmail.com>
---
 tools/testing/selftests/damon/Makefile        |  1 +
 .../sysfs_damos_walk_kdamond_exit_race.py     | 82 +++++++++++++++++++
 2 files changed, 83 insertions(+)
 create mode 100755 tools/testing/selftests/damon/sysfs_damos_walk_kdamond_exit_race.py

diff --git a/tools/testing/selftests/damon/Makefile b/tools/testing/selftests/damon/Makefile
index 2180c328a825..60c83d6c318e 100644
--- a/tools/testing/selftests/damon/Makefile
+++ b/tools/testing/selftests/damon/Makefile
@@ -20,6 +20,7 @@ TEST_PROGS += sysfs_update_removed_scheme_dir.sh
 TEST_PROGS += sysfs_update_schemes_tried_regions_hang.py
 TEST_PROGS += sysfs_memcg_path_leak.sh
 TEST_PROGS += sysfs_no_op_commit_break.py
+TEST_PROGS += sysfs_damos_walk_kdamond_exit_race.py
 
 EXTRA_CLEAN = __pycache__
 
diff --git a/tools/testing/selftests/damon/sysfs_damos_walk_kdamond_exit_race.py b/tools/testing/selftests/damon/sysfs_damos_walk_kdamond_exit_race.py
new file mode 100755
index 000000000000..8e8006d63926
--- /dev/null
+++ b/tools/testing/selftests/damon/sysfs_damos_walk_kdamond_exit_race.py
@@ -0,0 +1,82 @@
+#!/usr/bin/env python3
+# SPDX-License-Identifier: GPL-2.0
+#
+# Regression test for damos_walk() vs kdamond_fn() exit race.
+#
+# When kdamond_fn() finishes its main loop, it cancels remaining damos_walk()
+# requests and unsets damon_ctx->kdamond. If damos_walk() is called right
+# after cancellation but before kdamond pointer unset, it could wait forever
+# for handling that never comes, causing a deadlock.
+#
+# This test verifies the fix by rapidly calling update_schemes_tried_regions
+# while kdamond is naturally terminating (monitored process exits).
+# Without the fix (commit 33c3f6c2b48c), this would hang indefinitely.
+
+import os
+import subprocess
+import threading
+import time
+import _damon_sysfs
+
+def call_update(kdamond, result):
+    err = kdamond.update_schemes_tried_regions()
+    result['err'] = err
+    result['done'] = True
+
+def main():
+    proc = subprocess.Popen(['sleep', '0.3'])
+
+    kdamonds = _damon_sysfs.Kdamonds([_damon_sysfs.Kdamond(
+            contexts=[_damon_sysfs.DamonCtx(
+                ops='vaddr',
+                targets=[_damon_sysfs.DamonTarget(pid=proc.pid)],
+                schemes=[_damon_sysfs.Damos(
+                    action='stat',
+                    access_pattern=_damon_sysfs.DamosAccessPattern(
+                        nr_accesses=[0, 200]))]
+                )]
+            )])
+
+    err = kdamonds.start()
+    if err is not None:
+        print('kdamond start failed: %s' % err)
+        exit(1)
+
+    # Wait for monitored process to die naturally
+    proc.wait()
+
+    # Rapidly call damos_walk() while kdamond is exiting
+    # Use a thread with real timeout to detect kernel-level deadlock
+    deadline = time.time() + 5
+    while time.time() < deadline:
+        result = {'done': False, 'err': None}
+        t = threading.Thread(target=call_update,
+                             args=(kdamonds.kdamonds[0], result))
+        t.daemon = True
+        t.start()
+        t.join(timeout=5)
+
+        if not result['done']:
+            print('FAIL: update_schemes_tried_regions hung - '
+                  'possible damos_walk/kdamond exit race deadlock')
+            exit(1)
+
+        if result['err'] is not None:
+            # kdamond stopped cleanly - expected
+            break
+
+        # Check kdamond state via sysfs using dynamic path
+        state_path = os.path.join(
+                kdamonds.kdamonds[0].sysfs_dir(), 'state')
+        try:
+            with open(state_path) as f:
+                if f.read().strip() == 'off':
+                    break
+        except OSError as e:
+            print('failed to read kdamond state: %s' % e)
+            exit(1)
+
+    print('PASS: damos_walk() vs kdamond exit race not triggered')
+
+if __name__ == '__main__':
+    main()
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [PATCH v2] selftests/damon: add regression test for damos_walk() vs kdamond exit race
  2026-05-24 10:02 [PATCH v2] selftests/damon: add regression test for damos_walk() vs kdamond exit race Sailesh Nandanavanam
@ 2026-05-24 18:37 ` SeongJae Park
  2026-06-05  7:14   ` Sailesh Nandanavanam
  0 siblings, 1 reply; 5+ messages in thread
From: SeongJae Park @ 2026-05-24 18:37 UTC (permalink / raw)
  To: Sailesh Nandanavanam
  Cc: SeongJae Park, shuah, damon, linux-mm, linux-kselftest,
	linux-kernel

Hello Sailesh,

On Sun, 24 May 2026 15:32:58 +0530 Sailesh Nandanavanam <saileshnandanavanam@gmail.com> wrote:

> Add a regression test that verifies damos_walk() does not deadlock
> when racing with kdamond_fn() exit.
> 
> When kdamond_fn() finishes its main loop, it cancels remaining
> damos_walk() requests and unsets damon_ctx->kdamond. Without the fix
> in commit 33c3f6c2b48c, damos_walk() could be called right after

Please use more common commit description format that has the commit subject.
E.g., commit 33c3f6c2b48c ("mm/damon/core: fix damos_walk() vs kdamond_fn()
exit race").

> cancellation but before kdamond pointer unset, causing it to wait
> forever for handling that never comes.
> 
> The test starts kdamond monitoring a short-lived process, waits for
> the process to exit naturally triggering kdamond termination, then
> rapidly calls update_schemes_tried_regions in a separate thread to
> hit the race window. Using a thread with join timeout ensures the
> test can detect kernel-level deadlocks where the system call blocks
> in uninterruptible state.
> 
> The sysfs state path is resolved dynamically via the kdamonds object
> instead of being hardcoded, and exceptions are handled specifically
> as OSError rather than using a bare except block.

Thank you for this patch.  But, is this test reliable?  I ran the test more
than 100 times on my system running a kernel that has commit 33c3f6c2b48c is
reverted.  But test always passed.

> 
> Fixes: 33c3f6c2b48c ("mm/damon/core: fix damos_walk() vs kdamond_fn() exit race")

This is not fixing a bug in the commit, so the above 'Fixes:' tag is
inappropriate and may only confuse people.

> Signed-off-by: Sailesh Nandanavanam <saileshnandanavanam@gmail.com>
> ---

From next time, please add changelog on the commentary area [1], with links to
the previous revisions.

I was able to find the previous version on the mailing list [2], so putting the
link here for others.

Also, before sending a new version, please share your revisioning plan and give
time (about a daay) for others to comment about.

>  tools/testing/selftests/damon/Makefile        |  1 +
>  .../sysfs_damos_walk_kdamond_exit_race.py     | 82 +++++++++++++++++++
>  2 files changed, 83 insertions(+)
>  create mode 100755 tools/testing/selftests/damon/sysfs_damos_walk_kdamond_exit_race.py
> 
> diff --git a/tools/testing/selftests/damon/Makefile b/tools/testing/selftests/damon/Makefile
> index 2180c328a825..60c83d6c318e 100644
> --- a/tools/testing/selftests/damon/Makefile
> +++ b/tools/testing/selftests/damon/Makefile
> @@ -20,6 +20,7 @@ TEST_PROGS += sysfs_update_removed_scheme_dir.sh
>  TEST_PROGS += sysfs_update_schemes_tried_regions_hang.py
>  TEST_PROGS += sysfs_memcg_path_leak.sh
>  TEST_PROGS += sysfs_no_op_commit_break.py
> +TEST_PROGS += sysfs_damos_walk_kdamond_exit_race.py
>  
>  EXTRA_CLEAN = __pycache__
>  
> diff --git a/tools/testing/selftests/damon/sysfs_damos_walk_kdamond_exit_race.py b/tools/testing/selftests/damon/sysfs_damos_walk_kdamond_exit_race.py
> new file mode 100755
> index 000000000000..8e8006d63926
> --- /dev/null
> +++ b/tools/testing/selftests/damon/sysfs_damos_walk_kdamond_exit_race.py
> @@ -0,0 +1,82 @@
> +#!/usr/bin/env python3
> +# SPDX-License-Identifier: GPL-2.0
> +#
> +# Regression test for damos_walk() vs kdamond_fn() exit race.
> +#
> +# When kdamond_fn() finishes its main loop, it cancels remaining damos_walk()
> +# requests and unsets damon_ctx->kdamond. If damos_walk() is called right
> +# after cancellation but before kdamond pointer unset, it could wait forever
> +# for handling that never comes, causing a deadlock.
> +#
> +# This test verifies the fix by rapidly calling update_schemes_tried_regions
> +# while kdamond is naturally terminating (monitored process exits).
> +# Without the fix (commit 33c3f6c2b48c), this would hang indefinitely.
> +
> +import os
> +import subprocess
> +import threading
> +import time
> +import _damon_sysfs

Let's add a blank line before _damon_sysfs import, to be consistent with other
test files, like sysfs_update_schemes_tried_regions_hang.py.

> +
> +def call_update(kdamond, result):
> +    err = kdamond.update_schemes_tried_regions()
> +    result['err'] = err
> +    result['done'] = True
> +
> +def main():
> +    proc = subprocess.Popen(['sleep', '0.3'])
> +
> +    kdamonds = _damon_sysfs.Kdamonds([_damon_sysfs.Kdamond(
> +            contexts=[_damon_sysfs.DamonCtx(
> +                ops='vaddr',
> +                targets=[_damon_sysfs.DamonTarget(pid=proc.pid)],
> +                schemes=[_damon_sysfs.Damos(
> +                    action='stat',
> +                    access_pattern=_damon_sysfs.DamosAccessPattern(
> +                        nr_accesses=[0, 200]))]
> +                )]
> +            )])
> +
> +    err = kdamonds.start()
> +    if err is not None:
> +        print('kdamond start failed: %s' % err)
> +        exit(1)
> +
> +    # Wait for monitored process to die naturally
> +    proc.wait()
> +
> +    # Rapidly call damos_walk() while kdamond is exiting
> +    # Use a thread with real timeout to detect kernel-level deadlock
> +    deadline = time.time() + 5
> +    while time.time() < deadline:
> +        result = {'done': False, 'err': None}
> +        t = threading.Thread(target=call_update,
> +                             args=(kdamonds.kdamonds[0], result))
> +        t.daemon = True
> +        t.start()
> +        t.join(timeout=5)

I'm not sure if this is reliable to trigger the exact race.  As I mentioned
abovely, I tried this test more than 100 times on a kernel that having the fix
reverted, but I was unable to make the test fail.  If it is that unreliable,
I'm not very sure if having this test is beneficial or just make people
confused.

If the test has no false positive, maybe having this make sense to
opportunistically finding the bug.  But I think the 5 seconds timeout is still
not very reliable on some case, and therefore it seems false positive test
failure is available.  If that is correct, I think having this test might only
confuse people.

I think having damos_walk() kunit test for its functionalities including the
walk_control_obsolete might make more sense.

> +
> +        if not result['done']:
> +            print('FAIL: update_schemes_tried_regions hung - '
> +                  'possible damos_walk/kdamond exit race deadlock')
> +            exit(1)
> +
> +        if result['err'] is not None:
> +            # kdamond stopped cleanly - expected
> +            break

Is the above if condition correct?  Could you please explain why having an
error here is expected?

> +
> +        # Check kdamond state via sysfs using dynamic path
> +        state_path = os.path.join(
> +                kdamonds.kdamonds[0].sysfs_dir(), 'state')
> +        try:
> +            with open(state_path) as f:
> +                if f.read().strip() == 'off':
> +                    break
> +        except OSError as e:
> +            print('failed to read kdamond state: %s' % e)
> +            exit(1)
> +
> +    print('PASS: damos_walk() vs kdamond exit race not triggered')
> +
> +if __name__ == '__main__':
> +    main()
> -- 
> 2.34.1

[1] https://docs.kernel.org/process/submitting-patches.html#commentary
[2] https://lore.kernel.org/20260524091812.35283-1-saileshnandanavanam@gmail.com


Thanks,
SJ


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH v2] selftests/damon: add regression test for damos_walk() vs kdamond exit race
  2026-05-24 18:37 ` SeongJae Park
@ 2026-06-05  7:14   ` Sailesh Nandanavanam
  2026-06-05  9:55     ` Sailesh Nandanavanam
  2026-06-06  0:36     ` SeongJae Park
  0 siblings, 2 replies; 5+ messages in thread
From: Sailesh Nandanavanam @ 2026-06-05  7:14 UTC (permalink / raw)
  To: SeongJae Park; +Cc: shuah, damon, linux-mm, linux-kselftest, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 9852 bytes --]

Hi SeongJae,

Apologies for the delayed response.

Thank you for the detailed review and for actually testing the patch on a
kernel with the fix reverted.

You are right that the userspace selftest cannot reliably trigger this
race. The race window between kdamond cancelling requests and unsetting the
pointer is microseconds wide. Since userspace calls have system call
overhead, they cannot hit that window reliably — which is why the test
always passed even on a broken kernel.

Regarding the error condition question:

When kdamond stops naturally, any subsequent write to the state file (which
update_schemes_tried_regions() does internally) fails because kdamond is no
longer running. So receiving an error indicates kdamond exited
without deadlock — the expected success condition. I should have explained
this more clearly in the comment.

I agree that a KUnit test is the right approach since it runs inside the
kernel and can directly set walk_control_obsolete to simulate the race
condition reliably without timing dependencies.

I will drop the userspace selftest approach and write a KUnit test for
damos_walk() functionality including walk_control_obsolete. I will verify
it fails on a kernel with commit 33c3f6c2b48c ("mm/damon/core: fix
damos_walk() vs kdamond_fn() exit race") reverted and passes with the fix
present before sending v3.

Regarding other feedback:
- Will use correct commit description format with subject in brackets
- Will remove the Fixes: tag
- Will add blank line before _damon_sysfs import
- Will add changelog and link previous versions
- Will share revision plan and wait at least one day
  before sending new version

Please give me some time to study mm/damon/tests/core-kunit.h and implement
this properly.

Thanks,
Sailesh Nandanavanam

On Mon, May 25, 2026 at 12:08 AM SeongJae Park <sj@kernel.org> wrote:

> Hello Sailesh,
>
> On Sun, 24 May 2026 15:32:58 +0530 Sailesh Nandanavanam <
> saileshnandanavanam@gmail.com> wrote:
>
> > Add a regression test that verifies damos_walk() does not deadlock
> > when racing with kdamond_fn() exit.
> >
> > When kdamond_fn() finishes its main loop, it cancels remaining
> > damos_walk() requests and unsets damon_ctx->kdamond. Without the fix
> > in commit 33c3f6c2b48c, damos_walk() could be called right after
>
> Please use more common commit description format that has the commit
> subject.
> E.g., commit 33c3f6c2b48c ("mm/damon/core: fix damos_walk() vs kdamond_fn()
> exit race").
>
> > cancellation but before kdamond pointer unset, causing it to wait
> > forever for handling that never comes.
> >
> > The test starts kdamond monitoring a short-lived process, waits for
> > the process to exit naturally triggering kdamond termination, then
> > rapidly calls update_schemes_tried_regions in a separate thread to
> > hit the race window. Using a thread with join timeout ensures the
> > test can detect kernel-level deadlocks where the system call blocks
> > in uninterruptible state.
> >
> > The sysfs state path is resolved dynamically via the kdamonds object
> > instead of being hardcoded, and exceptions are handled specifically
> > as OSError rather than using a bare except block.
>
> Thank you for this patch.  But, is this test reliable?  I ran the test more
> than 100 times on my system running a kernel that has commit 33c3f6c2b48c
> is
> reverted.  But test always passed.
>
> >
> > Fixes: 33c3f6c2b48c ("mm/damon/core: fix damos_walk() vs kdamond_fn()
> exit race")
>
> This is not fixing a bug in the commit, so the above 'Fixes:' tag is
> inappropriate and may only confuse people.
>
> > Signed-off-by: Sailesh Nandanavanam <saileshnandanavanam@gmail.com>
> > ---
>
> From next time, please add changelog on the commentary area [1], with
> links to
> the previous revisions.
>
> I was able to find the previous version on the mailing list [2], so
> putting the
> link here for others.
>
> Also, before sending a new version, please share your revisioning plan and
> give
> time (about a daay) for others to comment about.
>
> >  tools/testing/selftests/damon/Makefile        |  1 +
> >  .../sysfs_damos_walk_kdamond_exit_race.py     | 82 +++++++++++++++++++
> >  2 files changed, 83 insertions(+)
> >  create mode 100755
> tools/testing/selftests/damon/sysfs_damos_walk_kdamond_exit_race.py
> >
> > diff --git a/tools/testing/selftests/damon/Makefile
> b/tools/testing/selftests/damon/Makefile
> > index 2180c328a825..60c83d6c318e 100644
> > --- a/tools/testing/selftests/damon/Makefile
> > +++ b/tools/testing/selftests/damon/Makefile
> > @@ -20,6 +20,7 @@ TEST_PROGS += sysfs_update_removed_scheme_dir.sh
> >  TEST_PROGS += sysfs_update_schemes_tried_regions_hang.py
> >  TEST_PROGS += sysfs_memcg_path_leak.sh
> >  TEST_PROGS += sysfs_no_op_commit_break.py
> > +TEST_PROGS += sysfs_damos_walk_kdamond_exit_race.py
> >
> >  EXTRA_CLEAN = __pycache__
> >
> > diff --git
> a/tools/testing/selftests/damon/sysfs_damos_walk_kdamond_exit_race.py
> b/tools/testing/selftests/damon/sysfs_damos_walk_kdamond_exit_race.py
> > new file mode 100755
> > index 000000000000..8e8006d63926
> > --- /dev/null
> > +++ b/tools/testing/selftests/damon/sysfs_damos_walk_kdamond_exit_race.py
> > @@ -0,0 +1,82 @@
> > +#!/usr/bin/env python3
> > +# SPDX-License-Identifier: GPL-2.0
> > +#
> > +# Regression test for damos_walk() vs kdamond_fn() exit race.
> > +#
> > +# When kdamond_fn() finishes its main loop, it cancels remaining
> damos_walk()
> > +# requests and unsets damon_ctx->kdamond. If damos_walk() is called
> right
> > +# after cancellation but before kdamond pointer unset, it could wait
> forever
> > +# for handling that never comes, causing a deadlock.
> > +#
> > +# This test verifies the fix by rapidly calling
> update_schemes_tried_regions
> > +# while kdamond is naturally terminating (monitored process exits).
> > +# Without the fix (commit 33c3f6c2b48c), this would hang indefinitely.
> > +
> > +import os
> > +import subprocess
> > +import threading
> > +import time
> > +import _damon_sysfs
>
> Let's add a blank line before _damon_sysfs import, to be consistent with
> other
> test files, like sysfs_update_schemes_tried_regions_hang.py.
>
> > +
> > +def call_update(kdamond, result):
> > +    err = kdamond.update_schemes_tried_regions()
> > +    result['err'] = err
> > +    result['done'] = True
> > +
> > +def main():
> > +    proc = subprocess.Popen(['sleep', '0.3'])
> > +
> > +    kdamonds = _damon_sysfs.Kdamonds([_damon_sysfs.Kdamond(
> > +            contexts=[_damon_sysfs.DamonCtx(
> > +                ops='vaddr',
> > +                targets=[_damon_sysfs.DamonTarget(pid=proc.pid)],
> > +                schemes=[_damon_sysfs.Damos(
> > +                    action='stat',
> > +                    access_pattern=_damon_sysfs.DamosAccessPattern(
> > +                        nr_accesses=[0, 200]))]
> > +                )]
> > +            )])
> > +
> > +    err = kdamonds.start()
> > +    if err is not None:
> > +        print('kdamond start failed: %s' % err)
> > +        exit(1)
> > +
> > +    # Wait for monitored process to die naturally
> > +    proc.wait()
> > +
> > +    # Rapidly call damos_walk() while kdamond is exiting
> > +    # Use a thread with real timeout to detect kernel-level deadlock
> > +    deadline = time.time() + 5
> > +    while time.time() < deadline:
> > +        result = {'done': False, 'err': None}
> > +        t = threading.Thread(target=call_update,
> > +                             args=(kdamonds.kdamonds[0], result))
> > +        t.daemon = True
> > +        t.start()
> > +        t.join(timeout=5)
>
> I'm not sure if this is reliable to trigger the exact race.  As I mentioned
> abovely, I tried this test more than 100 times on a kernel that having the
> fix
> reverted, but I was unable to make the test fail.  If it is that
> unreliable,
> I'm not very sure if having this test is beneficial or just make people
> confused.
>
> If the test has no false positive, maybe having this make sense to
> opportunistically finding the bug.  But I think the 5 seconds timeout is
> still
> not very reliable on some case, and therefore it seems false positive test
> failure is available.  If that is correct, I think having this test might
> only
> confuse people.
>
> I think having damos_walk() kunit test for its functionalities including
> the
> walk_control_obsolete might make more sense.
>
> > +
> > +        if not result['done']:
> > +            print('FAIL: update_schemes_tried_regions hung - '
> > +                  'possible damos_walk/kdamond exit race deadlock')
> > +            exit(1)
> > +
> > +        if result['err'] is not None:
> > +            # kdamond stopped cleanly - expected
> > +            break
>
> Is the above if condition correct?  Could you please explain why having an
> error here is expected?
>
> > +
> > +        # Check kdamond state via sysfs using dynamic path
> > +        state_path = os.path.join(
> > +                kdamonds.kdamonds[0].sysfs_dir(), 'state')
> > +        try:
> > +            with open(state_path) as f:
> > +                if f.read().strip() == 'off':
> > +                    break
> > +        except OSError as e:
> > +            print('failed to read kdamond state: %s' % e)
> > +            exit(1)
> > +
> > +    print('PASS: damos_walk() vs kdamond exit race not triggered')
> > +
> > +if __name__ == '__main__':
> > +    main()
> > --
> > 2.34.1
>
> [1] https://docs.kernel.org/process/submitting-patches.html#commentary
> [2]
> https://lore.kernel.org/20260524091812.35283-1-saileshnandanavanam@gmail.com
>
>
> Thanks,
> SJ
>

[-- Attachment #2: Type: text/html, Size: 11785 bytes --]

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH v2] selftests/damon: add regression test for damos_walk() vs kdamond exit race
  2026-06-05  7:14   ` Sailesh Nandanavanam
@ 2026-06-05  9:55     ` Sailesh Nandanavanam
  2026-06-06  0:36     ` SeongJae Park
  1 sibling, 0 replies; 5+ messages in thread
From: Sailesh Nandanavanam @ 2026-06-05  9:55 UTC (permalink / raw)
  To: SeongJae Park; +Cc: shuah, damon, linux-mm, linux-kselftest, linux-kernel

Hi SeongJae,

Apologies for the delayed response.

Thank you for the detailed review and for actually testing the patch
on a kernel with the fix reverted.

You are right that the userspace selftest cannot reliably trigger this
race. The race window between kdamond cancelling requests and
unsetting the pointer is microseconds wide. Since userspace calls have
system call overhead, they cannot hit that window reliably — which is
why the test always passed even on a broken kernel.

Regarding the error condition question:

When kdamond stops naturally, any subsequent write to the state file
(which update_schemes_tried_regions() does internally) fails because
kdamond is no longer running. So receiving an error indicates kdamond
exited without deadlock — the expected success condition. I should
have explained this more clearly in the comment.

I agree that a KUnit test is the right approach since it runs inside
the kernel and can directly set walk_control_obsolete to simulate the
race condition reliably without timing dependencies.

I will drop the userspace selftest approach and write a KUnit test for
damos_walk() functionality including walk_control_obsolete. I will
verify it fails on a kernel with commit 33c3f6c2b48c ("mm/damon/core:
fix damos_walk() vs kdamond_fn() exit race") reverted and passes with
the fix present before sending v3.

Regarding other feedback:
- Will use correct commit description format with subject in brackets
- Will remove the Fixes: tag
- Will add blank line before _damon_sysfs import
- Will add changelog and link previous versions
- Will share revision plan and wait at least one day
  before sending new version

Please give me some time to study mm/damon/tests/core-kunit.h and
implement this properly.

Thanks,
Sailesh Nandanavanam


On Fri, Jun 5, 2026 at 12:44 PM Sailesh Nandanavanam
<saileshnandanavanam@gmail.com> wrote:
>
> Hi SeongJae,
>
> Apologies for the delayed response.
>
> Thank you for the detailed review and for actually testing the patch on a kernel with the fix reverted.
>
> You are right that the userspace selftest cannot reliably trigger this race. The race window between kdamond cancelling requests and unsetting the pointer is microseconds wide. Since userspace calls have system call overhead, they cannot hit that window reliably — which is why the test always passed even on a broken kernel.
>
> Regarding the error condition question:
>
> When kdamond stops naturally, any subsequent write to the state file (which update_schemes_tried_regions() does internally) fails because kdamond is no longer running. So receiving an error indicates kdamond exited without deadlock — the expected success condition. I should have explained this more clearly in the comment.
>
> I agree that a KUnit test is the right approach since it runs inside the kernel and can directly set walk_control_obsolete to simulate the race condition reliably without timing dependencies.
>
> I will drop the userspace selftest approach and write a KUnit test for damos_walk() functionality including walk_control_obsolete. I will verify it fails on a kernel with commit 33c3f6c2b48c ("mm/damon/core: fix damos_walk() vs kdamond_fn() exit race") reverted and passes with the fix present before sending v3.
>
> Regarding other feedback:
> - Will use correct commit description format with subject in brackets
> - Will remove the Fixes: tag
> - Will add blank line before _damon_sysfs import
> - Will add changelog and link previous versions
> - Will share revision plan and wait at least one day
>   before sending new version
>
> Please give me some time to study mm/damon/tests/core-kunit.h and implement this properly.
>
> Thanks,
> Sailesh Nandanavanam
>
> On Mon, May 25, 2026 at 12:08 AM SeongJae Park <sj@kernel.org> wrote:
>>
>> Hello Sailesh,
>>
>> On Sun, 24 May 2026 15:32:58 +0530 Sailesh Nandanavanam <saileshnandanavanam@gmail.com> wrote:
>>
>> > Add a regression test that verifies damos_walk() does not deadlock
>> > when racing with kdamond_fn() exit.
>> >
>> > When kdamond_fn() finishes its main loop, it cancels remaining
>> > damos_walk() requests and unsets damon_ctx->kdamond. Without the fix
>> > in commit 33c3f6c2b48c, damos_walk() could be called right after
>>
>> Please use more common commit description format that has the commit subject.
>> E.g., commit 33c3f6c2b48c ("mm/damon/core: fix damos_walk() vs kdamond_fn()
>> exit race").
>>
>> > cancellation but before kdamond pointer unset, causing it to wait
>> > forever for handling that never comes.
>> >
>> > The test starts kdamond monitoring a short-lived process, waits for
>> > the process to exit naturally triggering kdamond termination, then
>> > rapidly calls update_schemes_tried_regions in a separate thread to
>> > hit the race window. Using a thread with join timeout ensures the
>> > test can detect kernel-level deadlocks where the system call blocks
>> > in uninterruptible state.
>> >
>> > The sysfs state path is resolved dynamically via the kdamonds object
>> > instead of being hardcoded, and exceptions are handled specifically
>> > as OSError rather than using a bare except block.
>>
>> Thank you for this patch.  But, is this test reliable?  I ran the test more
>> than 100 times on my system running a kernel that has commit 33c3f6c2b48c is
>> reverted.  But test always passed.
>>
>> >
>> > Fixes: 33c3f6c2b48c ("mm/damon/core: fix damos_walk() vs kdamond_fn() exit race")
>>
>> This is not fixing a bug in the commit, so the above 'Fixes:' tag is
>> inappropriate and may only confuse people.
>>
>> > Signed-off-by: Sailesh Nandanavanam <saileshnandanavanam@gmail.com>
>> > ---
>>
>> From next time, please add changelog on the commentary area [1], with links to
>> the previous revisions.
>>
>> I was able to find the previous version on the mailing list [2], so putting the
>> link here for others.
>>
>> Also, before sending a new version, please share your revisioning plan and give
>> time (about a daay) for others to comment about.
>>
>> >  tools/testing/selftests/damon/Makefile        |  1 +
>> >  .../sysfs_damos_walk_kdamond_exit_race.py     | 82 +++++++++++++++++++
>> >  2 files changed, 83 insertions(+)
>> >  create mode 100755 tools/testing/selftests/damon/sysfs_damos_walk_kdamond_exit_race.py
>> >
>> > diff --git a/tools/testing/selftests/damon/Makefile b/tools/testing/selftests/damon/Makefile
>> > index 2180c328a825..60c83d6c318e 100644
>> > --- a/tools/testing/selftests/damon/Makefile
>> > +++ b/tools/testing/selftests/damon/Makefile
>> > @@ -20,6 +20,7 @@ TEST_PROGS += sysfs_update_removed_scheme_dir.sh
>> >  TEST_PROGS += sysfs_update_schemes_tried_regions_hang.py
>> >  TEST_PROGS += sysfs_memcg_path_leak.sh
>> >  TEST_PROGS += sysfs_no_op_commit_break.py
>> > +TEST_PROGS += sysfs_damos_walk_kdamond_exit_race.py
>> >
>> >  EXTRA_CLEAN = __pycache__
>> >
>> > diff --git a/tools/testing/selftests/damon/sysfs_damos_walk_kdamond_exit_race.py b/tools/testing/selftests/damon/sysfs_damos_walk_kdamond_exit_race.py
>> > new file mode 100755
>> > index 000000000000..8e8006d63926
>> > --- /dev/null
>> > +++ b/tools/testing/selftests/damon/sysfs_damos_walk_kdamond_exit_race.py
>> > @@ -0,0 +1,82 @@
>> > +#!/usr/bin/env python3
>> > +# SPDX-License-Identifier: GPL-2.0
>> > +#
>> > +# Regression test for damos_walk() vs kdamond_fn() exit race.
>> > +#
>> > +# When kdamond_fn() finishes its main loop, it cancels remaining damos_walk()
>> > +# requests and unsets damon_ctx->kdamond. If damos_walk() is called right
>> > +# after cancellation but before kdamond pointer unset, it could wait forever
>> > +# for handling that never comes, causing a deadlock.
>> > +#
>> > +# This test verifies the fix by rapidly calling update_schemes_tried_regions
>> > +# while kdamond is naturally terminating (monitored process exits).
>> > +# Without the fix (commit 33c3f6c2b48c), this would hang indefinitely.
>> > +
>> > +import os
>> > +import subprocess
>> > +import threading
>> > +import time
>> > +import _damon_sysfs
>>
>> Let's add a blank line before _damon_sysfs import, to be consistent with other
>> test files, like sysfs_update_schemes_tried_regions_hang.py.
>>
>> > +
>> > +def call_update(kdamond, result):
>> > +    err = kdamond.update_schemes_tried_regions()
>> > +    result['err'] = err
>> > +    result['done'] = True
>> > +
>> > +def main():
>> > +    proc = subprocess.Popen(['sleep', '0.3'])
>> > +
>> > +    kdamonds = _damon_sysfs.Kdamonds([_damon_sysfs.Kdamond(
>> > +            contexts=[_damon_sysfs.DamonCtx(
>> > +                ops='vaddr',
>> > +                targets=[_damon_sysfs.DamonTarget(pid=proc.pid)],
>> > +                schemes=[_damon_sysfs.Damos(
>> > +                    action='stat',
>> > +                    access_pattern=_damon_sysfs.DamosAccessPattern(
>> > +                        nr_accesses=[0, 200]))]
>> > +                )]
>> > +            )])
>> > +
>> > +    err = kdamonds.start()
>> > +    if err is not None:
>> > +        print('kdamond start failed: %s' % err)
>> > +        exit(1)
>> > +
>> > +    # Wait for monitored process to die naturally
>> > +    proc.wait()
>> > +
>> > +    # Rapidly call damos_walk() while kdamond is exiting
>> > +    # Use a thread with real timeout to detect kernel-level deadlock
>> > +    deadline = time.time() + 5
>> > +    while time.time() < deadline:
>> > +        result = {'done': False, 'err': None}
>> > +        t = threading.Thread(target=call_update,
>> > +                             args=(kdamonds.kdamonds[0], result))
>> > +        t.daemon = True
>> > +        t.start()
>> > +        t.join(timeout=5)
>>
>> I'm not sure if this is reliable to trigger the exact race.  As I mentioned
>> abovely, I tried this test more than 100 times on a kernel that having the fix
>> reverted, but I was unable to make the test fail.  If it is that unreliable,
>> I'm not very sure if having this test is beneficial or just make people
>> confused.
>>
>> If the test has no false positive, maybe having this make sense to
>> opportunistically finding the bug.  But I think the 5 seconds timeout is still
>> not very reliable on some case, and therefore it seems false positive test
>> failure is available.  If that is correct, I think having this test might only
>> confuse people.
>>
>> I think having damos_walk() kunit test for its functionalities including the
>> walk_control_obsolete might make more sense.
>>
>> > +
>> > +        if not result['done']:
>> > +            print('FAIL: update_schemes_tried_regions hung - '
>> > +                  'possible damos_walk/kdamond exit race deadlock')
>> > +            exit(1)
>> > +
>> > +        if result['err'] is not None:
>> > +            # kdamond stopped cleanly - expected
>> > +            break
>>
>> Is the above if condition correct?  Could you please explain why having an
>> error here is expected?
>>
>> > +
>> > +        # Check kdamond state via sysfs using dynamic path
>> > +        state_path = os.path.join(
>> > +                kdamonds.kdamonds[0].sysfs_dir(), 'state')
>> > +        try:
>> > +            with open(state_path) as f:
>> > +                if f.read().strip() == 'off':
>> > +                    break
>> > +        except OSError as e:
>> > +            print('failed to read kdamond state: %s' % e)
>> > +            exit(1)
>> > +
>> > +    print('PASS: damos_walk() vs kdamond exit race not triggered')
>> > +
>> > +if __name__ == '__main__':
>> > +    main()
>> > --
>> > 2.34.1
>>
>> [1] https://docs.kernel.org/process/submitting-patches.html#commentary
>> [2] https://lore.kernel.org/20260524091812.35283-1-saileshnandanavanam@gmail.com
>>
>>
>> Thanks,
>> SJ


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH v2] selftests/damon: add regression test for damos_walk() vs kdamond exit race
  2026-06-05  7:14   ` Sailesh Nandanavanam
  2026-06-05  9:55     ` Sailesh Nandanavanam
@ 2026-06-06  0:36     ` SeongJae Park
  1 sibling, 0 replies; 5+ messages in thread
From: SeongJae Park @ 2026-06-06  0:36 UTC (permalink / raw)
  To: Sailesh Nandanavanam
  Cc: SeongJae Park, shuah, damon, linux-mm, linux-kselftest,
	linux-kernel

On Fri, 5 Jun 2026 12:44:50 +0530 Sailesh Nandanavanam <saileshnandanavanam@gmail.com> wrote:

> Hi SeongJae,
> 
> Apologies for the delayed response.

No worry!

[...]
> I agree that a KUnit test is the right approach since it runs inside the
> kernel and can directly set walk_control_obsolete to simulate the race
> condition reliably without timing dependencies.
> 
> I will drop the userspace selftest approach and write a KUnit test for
> damos_walk() functionality including walk_control_obsolete. I will verify
> it fails on a kernel with commit 33c3f6c2b48c ("mm/damon/core: fix
> damos_walk() vs kdamond_fn() exit race") reverted and passes with the fix
> present before sending v3.
> 
> Regarding other feedback:
> - Will use correct commit description format with subject in brackets
> - Will remove the Fixes: tag
> - Will add blank line before _damon_sysfs import
> - Will add changelog and link previous versions
> - Will share revision plan and wait at least one day
>   before sending new version

Sounds good.  Please consider using in-line reply [1] from the next time,
though!

> 
> Please give me some time to study mm/damon/tests/core-kunit.h and implement
> this properly.

No rush, take your time.

[1] https://subspace.kernel.org/etiquette.html#do-not-top-post-when-replying


Thanks,
SJ

[...]


^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2026-06-06  0:36 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-05-24 10:02 [PATCH v2] selftests/damon: add regression test for damos_walk() vs kdamond exit race Sailesh Nandanavanam
2026-05-24 18:37 ` SeongJae Park
2026-06-05  7:14   ` Sailesh Nandanavanam
2026-06-05  9:55     ` Sailesh Nandanavanam
2026-06-06  0:36     ` SeongJae Park

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox