From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 2CD5430E0F2; Sun, 22 Mar 2026 20:51:39 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1774212700; cv=none; b=AEJAbgEtb4ponIOHbi368QQqGTMld5ZqlKmUNl9D073uzXKZIK/bk8xjHZ6CtnOKXcaSA5BC58cr0fKs90Fiki0sSdt2auAo8B4X7sy33OZL4ij4a9u9ZqVv3Mit82QGY48qc1Ev5YbLqd3v+Spx1mLtfNwkkZk2jLJvHXcaVrQ= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1774212700; c=relaxed/simple; bh=v+JZu3qXNVA/S1bG8WB9r8Oz6NwO0tcGbFSffeOQTzY=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version:Content-Type; b=ooej7vW5NRJQmYghxVBQUmiOjEmMIe5hPmJRCmC5NLbaxs5qXdwyfYtnFVtNhwuvP/LUA7CSvR2Km9vUzNiqJGhLHrcWCQZgwU1oil0H+bWOT8pQ/n+KH9DTrPGn0J5QX5i31m1BCa2xgb5fbFFFu/uVIEDNbkceKtvECZoKvhk= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=bw63tztB; arc=none smtp.client-ip=10.30.226.201 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="bw63tztB" Received: by smtp.kernel.org (Postfix) with ESMTPSA id 93881C19424; Sun, 22 Mar 2026 20:51:39 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1774212699; bh=v+JZu3qXNVA/S1bG8WB9r8Oz6NwO0tcGbFSffeOQTzY=; h=From:To:Cc:Subject:Date:In-Reply-To:References:From; b=bw63tztB5mqokQeZ2TQ6TDgQ4i0zunDVLTtd5npDG9dOE6JkAm8NmvH5a3DbIAQtj SzA5CdTcLOxvx+90tqeYn6Kfhak7EOQB7EUPapY2FUo3mtHsnWMsOQrg9xp1Rln9Xf s4NI7tvE8ho66tmEeWrR9Y9/jq2gWDAxyhp+bVH9mROWrdOmUqTRWfZeeM+/yXCN8f ic55MSQlH6BTWwh+UsyYxREouYUYT+6oZEAxURvIWoRxBDZID8R3sv4ezMpAoe4A6i CF/Rp3FLokntfAAFAAr5BHFPN5BCJwOFdBKigTQ6vyb72QNjgu6jq4gEQHRZbnUjbQ 3qfs6ZkevBUrQ== From: SeongJae Park To: SeongJae Park Cc: Andrew Morton , damon@lists.linux.dev, linux-kernel@vger.kernel.org, linux-mm@kvack.org Subject: Re: (sashiko review) [RFC PATCH v4 01/10] mm/damon/core: introduce damon_ctx->paused Date: Sun, 22 Mar 2026 13:51:24 -0700 Message-ID: <20260322205125.88701-1-sj@kernel.org> X-Mailer: git-send-email 2.47.3 In-Reply-To: <20260322174018.83729-1-sj@kernel.org> References: Precedence: bulk X-Mailing-List: damon@lists.linux.dev List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit On Sun, 22 Mar 2026 10:40:16 -0700 SeongJae Park wrote: > On Sun, 22 Mar 2026 10:06:59 -0700 SeongJae Park wrote: > > > Forwarding Sashiko review for doing discussions via mails. > > > > # review url: https://sashiko.dev/#/patchset/20260322155728.81434-2-sj@kernel.org > > # start of sashiko.dev inline review > > commit 2a3be2c8b6c327ef72163cd54e6121ef72b9c932 > > Author: SeongJae Park > > > > mm/damon/core: introduce damon_ctx->paused > > > > DAMON supports only start and stop of the execution. When it is > > stopped, its internal data that it self-trained goes away. It will be > > useful if the execution can be paused and resumed with the previous > > self-trained data. > > > > > diff --git a/mm/damon/core.c b/mm/damon/core.c > > > index db6c67e52d2b..0ab2cfa848e6 100644 > > > --- a/mm/damon/core.c > > > +++ b/mm/damon/core.c > > > > [ ... ] > > > > > @@ -3003,6 +3004,14 @@ static int kdamond_fn(void *data) > > > kdamond_call(ctx, false); > > > if (ctx->maybe_corrupted) > > > break; > > > + while (ctx->pause) { > > > + damos_walk_cancel(ctx); > > > > Is it safe to repeatedly call damos_walk_cancel() here? > > > > In damos_walk(), the walk_control_lock is dropped after setting > > ctx->walk_control to check damon_is_running(), which opens a race window. > > > > Concurrently, kdamond_fn() calls damos_walk_cancel(), which reads > > ctx->walk_control, drops the lock, and completes the old request. > > > > If a new caller invokes damos_walk() and successfully registers a new > > walk_control, could damos_walk_cancel() resume, reacquire the lock, > > and unconditionally set ctx->walk_control = NULL, dropping the new > > request? > > > > This might leave the new caller permanently deadlocked in > > wait_for_completion(). > > Nice catch. Orthogonal to this patch, though. > > The deadlock could happen in below sequence. > > CPU0 │CPU1 > ──────────────────────────────┼──────────────────────── > damos_walk() │ > │register request │ > │wait completion │damos_walk_cancel() > │ │ │complete the request > ▼wakeup,return │ │ > damos_walk() │ │ > │register new request │ │ > │ │ │remove the new request > │wait completion │ ▼return > ▼ nobody completes it. │ > > Nonetheless, kdamond_fn() is calling damos_walk() already in several places > including this loop. This issue hence exists regardless of this patch. I will > work on fixing this as a separate hotfix. Below fix may work. TL; DR: there is no deadlock in existing code. I will work on more clean code or documentation, though. The scenario that I illustrated above cannot happen, because the second damos_walk() cannot register its new request before the old request is unset. The request is unset in three places. damos_walk_complete(), damos_walk_cancel(), and damos_walk(). damos_walk_complete() and damos_walk_cancel() are called from same kdamond thread, so no race between them exists. damos_walk() unsets the request, only if !damon_is_running(). damos_walk() seeing !damon_is_running() means the kdamond is stopped. It again means there can be no concurrent damos_walk_cancel() or damos_walk_complete() that works for same context and started before the damon_is_running() call. Unless the same context is restarted, hence, there is no chance to race. Only DAMON_SYSFS calls damos_walk() and it doesn't restart same context. DAMON_RECLAIM and DAMON_LRU_SORT do restart same context, but they don't use damos_walk(). So, there is no deadlock in the existing code (or, no such deadlock is found so far). Let's assume there could be damos_walk() call with parallel restart of a DAMON context, though. In the case, below deadlock is available. Seems this is what Sashiko was trying to say. 0. A DAMON context is stopped. 1-1. CPU0: calls damos_walk() for the stopped context. 1-2. CPU0: damos_walk(): register a new damos_walk() request to the stopped context. 1-3. CPU0: damos_walk(): shows !damon_is_running(). 2. CPU1: Re-start the DAMON context. 3-1. CPU2: Execute kdamond_fn() -> damos_walk_cancel() 3-2. CPU2: damos_walk_cancel(): complete the walk request that registered on step 1-2. 4-1. CPU0: damos_walk(): unset the request. 4-2: CPU0: calls damos_walk() again. 4-3: CPU0: damos_walk() 2: register a new damos_walk() request. 4-4: CPU0: damos_walk() 2: wait for the completion. 5-1. CPU2: damos_walk_cancel(): unset the walk request that registered on step 4-3. Nobody can complete the request that registered on step 4-3. CPU0 infinitely wait. In more graphiscal way, this can be illustrated as below: CPU0 │CPU1 │CPU2 ───────────────────────────────┼─────────────────┼──────────────────────────────────────── damos_walk() │ │ │register reqeust │ │ │show !damon_is_running(ctx)│ │ │ │ │ │ │damon_start(ctx) │ │ │ │damos_walk_cancel() │ │ │ complete first damos_walk() request │ │ │ │unset request │ │ ▼return │ │ │ │ damos_walk() │ │ │register request │ │ │wait completion │ │ unset second request ▼ │ │ As I mentioned abovely, this cannot happen on existing code, since there is no code that restarts a terminated DAMON context, and calls damos_walk(). In the future, there might be such use cases or mistakenly made call sequence, though. I will work on improving this. But, as I mentioned before, it is not a blocker for this patch. Thanks, SJ [...]