From: SeongJae Park <sj@kernel.org>
To: SeongJae Park <sj@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>,
damon@lists.linux.dev, linux-kernel@vger.kernel.org,
linux-mm@kvack.org
Subject: Re: (sashiko review) [RFC PATCH v4 01/10] mm/damon/core: introduce damon_ctx->paused
Date: Sun, 22 Mar 2026 13:51:24 -0700 [thread overview]
Message-ID: <20260322205125.88701-1-sj@kernel.org> (raw)
In-Reply-To: <20260322174018.83729-1-sj@kernel.org>
On Sun, 22 Mar 2026 10:40:16 -0700 SeongJae Park <sj@kernel.org> wrote:
> On Sun, 22 Mar 2026 10:06:59 -0700 SeongJae Park <sj@kernel.org> wrote:
>
> > Forwarding Sashiko review for doing discussions via mails.
> >
> > # review url: https://sashiko.dev/#/patchset/20260322155728.81434-2-sj@kernel.org
> > # start of sashiko.dev inline review
> > commit 2a3be2c8b6c327ef72163cd54e6121ef72b9c932
> > Author: SeongJae Park <sj@kernel.org>
> >
> > mm/damon/core: introduce damon_ctx->paused
> >
> > DAMON supports only start and stop of the execution. When it is
> > stopped, its internal data that it self-trained goes away. It will be
> > useful if the execution can be paused and resumed with the previous
> > self-trained data.
> >
> > > diff --git a/mm/damon/core.c b/mm/damon/core.c
> > > index db6c67e52d2b..0ab2cfa848e6 100644
> > > --- a/mm/damon/core.c
> > > +++ b/mm/damon/core.c
> >
> > [ ... ]
> >
> > > @@ -3003,6 +3004,14 @@ static int kdamond_fn(void *data)
> > > kdamond_call(ctx, false);
> > > if (ctx->maybe_corrupted)
> > > break;
> > > + while (ctx->pause) {
> > > + damos_walk_cancel(ctx);
> >
> > Is it safe to repeatedly call damos_walk_cancel() here?
> >
> > In damos_walk(), the walk_control_lock is dropped after setting
> > ctx->walk_control to check damon_is_running(), which opens a race window.
> >
> > Concurrently, kdamond_fn() calls damos_walk_cancel(), which reads
> > ctx->walk_control, drops the lock, and completes the old request.
> >
> > If a new caller invokes damos_walk() and successfully registers a new
> > walk_control, could damos_walk_cancel() resume, reacquire the lock,
> > and unconditionally set ctx->walk_control = NULL, dropping the new
> > request?
> >
> > This might leave the new caller permanently deadlocked in
> > wait_for_completion().
>
> Nice catch. Orthogonal to this patch, though.
>
> The deadlock could happen in below sequence.
>
> CPU0 │CPU1
> ──────────────────────────────┼────────────────────────
> damos_walk() │
> │register request │
> │wait completion │damos_walk_cancel()
> │ │ │complete the request
> ▼wakeup,return │ │
> damos_walk() │ │
> │register new request │ │
> │ │ │remove the new request
> │wait completion │ ▼return
> ▼ nobody completes it. │
>
> Nonetheless, kdamond_fn() is calling damos_walk() already in several places
> including this loop. This issue hence exists regardless of this patch. I will
> work on fixing this as a separate hotfix. Below fix may work.
TL; DR: there is no deadlock in existing code. I will work on more clean code
or documentation, though.
The scenario that I illustrated above cannot happen, because the second
damos_walk() cannot register its new request before the old request is unset.
The request is unset in three places. damos_walk_complete(),
damos_walk_cancel(), and damos_walk(). damos_walk_complete() and
damos_walk_cancel() are called from same kdamond thread, so no race between
them exists.
damos_walk() unsets the request, only if !damon_is_running(). damos_walk()
seeing !damon_is_running() means the kdamond is stopped. It again means there
can be no concurrent damos_walk_cancel() or damos_walk_complete() that works
for same context and started before the damon_is_running() call.
Unless the same context is restarted, hence, there is no chance to race. Only
DAMON_SYSFS calls damos_walk() and it doesn't restart same context.
DAMON_RECLAIM and DAMON_LRU_SORT do restart same context, but they don't use
damos_walk(). So, there is no deadlock in the existing code (or, no such
deadlock is found so far).
Let's assume there could be damos_walk() call with parallel restart of a DAMON
context, though. In the case, below deadlock is available. Seems this is what
Sashiko was trying to say.
0. A DAMON context is stopped.
1-1. CPU0: calls damos_walk() for the stopped context.
1-2. CPU0: damos_walk(): register a new damos_walk() request to the stopped
context.
1-3. CPU0: damos_walk(): shows !damon_is_running().
2. CPU1: Re-start the DAMON context.
3-1. CPU2: Execute kdamond_fn() -> damos_walk_cancel()
3-2. CPU2: damos_walk_cancel(): complete the walk request that registered on
step 1-2.
4-1. CPU0: damos_walk(): unset the request.
4-2: CPU0: calls damos_walk() again.
4-3: CPU0: damos_walk() 2: register a new damos_walk() request.
4-4: CPU0: damos_walk() 2: wait for the completion.
5-1. CPU2: damos_walk_cancel(): unset the walk request that registered on step
4-3.
Nobody can complete the request that registered on step 4-3. CPU0 infinitely
wait.
In more graphiscal way, this can be illustrated as below:
CPU0 │CPU1 │CPU2
───────────────────────────────┼─────────────────┼────────────────────────────────────────
damos_walk() │ │
│register reqeust │ │
│show !damon_is_running(ctx)│ │
│ │ │
│ │damon_start(ctx) │
│ │ │damos_walk_cancel()
│ │ │ complete first damos_walk() request
│ │ │
│unset request │ │
▼return │ │
│ │
damos_walk() │ │
│register request │ │
│wait completion │ │ unset second request
▼ │ │
As I mentioned abovely, this cannot happen on existing code, since there is no
code that restarts a terminated DAMON context, and calls damos_walk(). In the
future, there might be such use cases or mistakenly made call sequence, though.
I will work on improving this. But, as I mentioned before, it is not a blocker
for this patch.
Thanks,
SJ
[...]
next prev parent reply other threads:[~2026-03-22 20:51 UTC|newest]
Thread overview: 18+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-03-22 15:57 [RFC PATCH v4 00/10] mm/damon: let DAMON be paused and resumed SeongJae Park
2026-03-22 15:57 ` [RFC PATCH v4 01/10] mm/damon/core: introduce damon_ctx->paused SeongJae Park
2026-03-22 17:06 ` (sashiko review) " SeongJae Park
2026-03-22 17:40 ` SeongJae Park
2026-03-22 20:51 ` SeongJae Park [this message]
2026-03-22 15:57 ` [RFC PATCH v4 02/10] mm/damon/sysfs: add pause file under context dir SeongJae Park
2026-03-22 15:57 ` [RFC PATCH v4 03/10] Docs/mm/damon/design: update for context pause/resume feature SeongJae Park
2026-03-22 15:57 ` [RFC PATCH v4 04/10] Docs/admin-guide/mm/damon/usage: update for pause file SeongJae Park
2026-03-22 15:57 ` [RFC PATCH v4 05/10] Docs/ABI/damon: update for pause sysfs file SeongJae Park
2026-03-22 15:57 ` [RFC PATCH v4 06/10] mm/damon/tests/core-kunit: test pause commitment SeongJae Park
2026-03-22 15:57 ` [RFC PATCH v4 07/10] selftests/damon/_damon_sysfs: support pause file staging SeongJae Park
2026-03-22 15:57 ` [RFC PATCH v4 08/10] selftests/damon/drgn_dump_damon_status: dump pause SeongJae Park
2026-03-22 15:57 ` [RFC PATCH v4 09/10] selftests/damon/sysfs.py: check pause on assert_ctx_committed() SeongJae Park
2026-03-22 15:57 ` [RFC PATCH v4 10/10] selftets/damon/sysfs.py: pause DAMON before dumping status SeongJae Park
2026-03-22 17:15 ` (sashiko review) " SeongJae Park
2026-03-22 17:47 ` SeongJae Park
2026-03-22 17:05 ` (sashiko status) [RFC PATCH v4 00/10] mm/damon: let DAMON be paused and resumed SeongJae Park
2026-03-22 17:11 ` SeongJae Park
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20260322205125.88701-1-sj@kernel.org \
--to=sj@kernel.org \
--cc=akpm@linux-foundation.org \
--cc=damon@lists.linux.dev \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.