[BUG] 'damo stop' causes kernel crash in v6.17-rc3

All of lore.kernel.org
 help / color / mirror / Atom feed

* [BUG] 'damo stop' causes kernel crash in v6.17-rc3
@ 2025-09-04  1:17 Yunjeong Mun
  2025-09-04  4:02 ` SeongJae Park
  0 siblings, 1 reply; 8+ messages in thread
From: Yunjeong Mun @ 2025-09-04  1:17 UTC (permalink / raw)
  To: damon; +Cc: sj, honggyu.kim, kernel_team

Hi!

I encountered a kernel crash when running 'damo stop' in kernel v6.17-rc3, 
I tested and confirmed that this issue also occurs in v6.17-rc1.

'damo' version that I tested is v2.9.3 and v2.4.7.

The crash happens when DAMON is configured to used both 'migrate_hot' 
and migrate_cold' actions. I tested that if DAMON is started with only 
one of the two actions, it works fine. Below is the command I used:

```shell
$ ./damo start \
 --ops paddr --numa_node 0 --monitoring_intervals 100ms 2s 20s --damos_action migrate_cold 1 \
 --ops paddr --numa_node 1 --monitoring_intervals 100ms 2s 20s --damos_action migrate_hot 0 \
 --nr_targets 1 1 --nr_schemes 1 1 --nr_ctxs 1 1

$ ps aux | grep kdamond
root      1193 98.2  0.0      0     0 ?        R    07:58   0:18 [kdamond.0]
root      1194 11.2  0.0      0     0 ?        R    07:58   0:02 [kdamond.1]

# Error occurs
$ ./damo stop 
```

This issue also occurs when starting DAMON using yaml configuration file 
that includes both the 'migrate_hot' and 'migrate_cold' actions.
Below is the dmesg log at the time of the issue:

```
[157729.130361] Call Trace:                                                                                                                                                                                                                                                                                                                                                       [19/1810]
[157729.130540]  <TASK>
[157729.130718]  kthread_stop+0x158/0x190
[157729.130904]  kthread_stop_put+0x18/0x80
[157729.131084]  damon_stop+0x4c/0xd0
[157729.131264]  state_store+0x190/0x380
[157729.131445]  ? __x64_sys_ioctl+0x7e/0xf0
[157729.131625]  kobj_attr_store+0x13/0x30
[157729.131805]  sysfs_kf_write+0x73/0x90
[157729.131986]  kernfs_fop_write_iter+0x13a/0x1c0
[157729.132167]  vfs_write+0x304/0x420
[157729.132349]  ksys_write+0x6d/0xe0
[157729.132525]  __x64_sys_write+0x1d/0x30
[157729.132696]  x64_sys_call+0x16ec/0x2180
[157729.132865]  do_syscall_64+0x74/0x1d0
[157729.133030]  ? __x64_sys_ioctl+0x7e/0xf0
[157729.133195]  ? x64_sys_call+0x1268/0x2180
[157729.133364]  ? do_syscall_64+0xa3/0x1d0
[157729.133534]  ? do_syscall_64+0xa3/0x1d0
[157729.133698]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
[157729.133861] RIP: 0033:0x7fafb5d14887
[157729.134023] Code: 10 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 48 89 54 24 18 48 89 74 24
[157729.134363] RSP: 002b:00007ffe44347cc8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
[157729.134535] RAX: ffffffffffffffda RBX: 00005612b1e91b20 RCX: 00007fafb5d14887
[157729.134705] RDX: 0000000000000003 RSI: 00005612b9a3ae70 RDI: 0000000000000003
[157729.134870] RBP: 00005612b98998e0 R08: 0000000000000000 R09: 0000000000000000
[157729.135031] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000003
[157729.135189] R13: 00007fafb5f9bf80 R14: 0000000000000003 R15: 00005612b9a3ae70
[157729.135344]  </TASK>
[157729.135491] ---[ end trace 0000000000000000 ]---
[157729.135639] BUG: kernel NULL pointer dereference, address: 0000000000000000
[157729.135783] #PF: supervisor write access in kernel mode
[157729.135923] #PF: error_code(0x0002) - not-present page
[157729.136059] PGD 0 P4D 0
[157729.136192] Oops: Oops: 0002 [#32] SMP NOPTI
[157729.136323] CPU: 65 UID: 0 PID: 655661 Comm: python3 Tainted: G S    D W           6.17.0-rc3 #23 PREEMPT(voluntary)
[157729.136457] Tainted: [S]=CPU_OUT_OF_SPEC, [D]=DIE, [W]=WARN                                                                                                                                                                                                                                                                                                                            
[157729.136587] Hardware name: Supermicro SSG-222B-NE3X24R/X14DBHM, BIOS 1.0a 12/15/2024
[157729.136718] RIP: 0010:kthread_stop+0x4c/0x190
[157729.136849] Code: 00 f0 0f c1 43 28 85 c0 0f 84 4b 01 00 00 8d 50 01 09 c2 0f 88 10 01 00 00 f6 43 2e 20 0f 84 1d 01 00 00 4c 8b a3 a8 0a 00 00 <f0> 41 80 0c 24 02 48 89 df e8 16 f4 ff ff f0 80 4b 02 02 48 89 df
[157729.137125] RSP: 0018:ffffbd40c65dfc08 EFLAGS: 00010202
[157729.137265] RAX: 0000000000000000 RBX: ffff963a641699c0 RCX: 0000000000000027
[157729.137404] RDX: ffff96b4bdc5cec8 RSI: 0000000000000001 RDI: ffff96b4bdc5cec0
[157729.137543] RBP: ffffbd40c65dfc20 R08: 0000000000000003 R09: 0000000000000000
[157729.137682] R10: 0000000000000002 R11: 0000000000000003 R12: 0000000000000000
[157729.137818] R13: ffff963a641699e8 R14: ffff9637a42574d0 R15: ffff963a7f8910f8
[157729.137956] FS:  00007fafb5f9c000(0000) GS:ffff96b5018fd000(0000) knlGS:0000000000000000
[157729.138098] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[157729.138240] CR2: 0000000000000000 CR3: 00000003a550c004 CR4: 0000000000f72ef0
[157729.138387] PKRU: 55555554
[157729.138531] Call Trace:
[157729.138675]  <TASK>
[157729.138818]  kthread_stop_put+0x18/0x80
[157729.138964]  damon_stop+0x4c/0xd0
[157729.139109]  state_store+0x190/0x380
[157729.139254]  ? __x64_sys_ioctl+0x7e/0xf0
[157729.139398]  kobj_attr_store+0x13/0x30
[157729.139543]  sysfs_kf_write+0x73/0x90
[157729.139688]  kernfs_fop_write_iter+0x13a/0x1c0
[157729.139833]  vfs_write+0x304/0x420
[157729.139980]  ksys_write+0x6d/0xe0
[157729.140125]  __x64_sys_write+0x1d/0x30
[157729.140271]  x64_sys_call+0x16ec/0x2180
[157729.140416]  do_syscall_64+0x74/0x1d0
[157729.140562]  ? __x64_sys_ioctl+0x7e/0xf0
[157729.140707]  ? x64_sys_call+0x1268/0x2180
[157729.140853]  ? do_syscall_64+0xa3/0x1d0
[157729.140993]  ? do_syscall_64+0xa3/0x1d0
[157729.141128]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
[157729.141262] RIP: 0033:0x7fafb5d14887
```

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [BUG] 'damo stop' causes kernel crash in v6.17-rc3
  2025-09-04  1:17 [BUG] 'damo stop' causes kernel crash in v6.17-rc3 Yunjeong Mun
@ 2025-09-04  4:02 ` SeongJae Park
  2025-09-04  8:29   ` Yunjeong Mun
  0 siblings, 1 reply; 8+ messages in thread
From: SeongJae Park @ 2025-09-04  4:02 UTC (permalink / raw)
  To: Yunjeong Mun; +Cc: SeongJae Park, damon, honggyu.kim, kernel_team

Hi Yunjeong,

On Thu,  4 Sep 2025 10:17:38 +0900 Yunjeong Mun <yunjeong.mun@sk.com> wrote:

> Hi!
> 
> I encountered a kernel crash when running 'damo stop' in kernel v6.17-rc3, 
> I tested and confirmed that this issue also occurs in v6.17-rc1.

Thank you for finding and sharing this issue!

> 
> 'damo' version that I tested is v2.9.3 and v2.4.7.
> 
> The crash happens when DAMON is configured to used both 'migrate_hot' 
> and migrate_cold' actions. I tested that if DAMON is started with only 
> one of the two actions, it works fine.

I understand you mean the problem is reproducible when you use two kdamond
threads, and you confirmed it doesn't happen when you use single kdamond
thread.  Please let me know if I'm misunderstanding.

> Below is the command I used:
> 
> ```shell
> $ ./damo start \
>  --ops paddr --numa_node 0 --monitoring_intervals 100ms 2s 20s --damos_action migrate_cold 1 \
>  --ops paddr --numa_node 1 --monitoring_intervals 100ms 2s 20s --damos_action migrate_hot 0 \
>  --nr_targets 1 1 --nr_schemes 1 1 --nr_ctxs 1 1
> 
> $ ps aux | grep kdamond
> root      1193 98.2  0.0      0     0 ?        R    07:58   0:18 [kdamond.0]
> root      1194 11.2  0.0      0     0 ?        R    07:58   0:02 [kdamond.1]
> 
> # Error occurs
> $ ./damo stop 
> ```

Thank you for sharingthis detailed steps.  On my setup, this doesn't cause
crash but make 'damo' hang.

I found commit d809a7c64ba8 ("mm/damon/sysfs: implement refresh_ms file
internal work") is the first bad commit, according to 'git bisect'.  And
actually the code is broken for multiple kdamonds case, since it is sharing one
damon_call_control object for multiple kdamonds while overwriting the data
field to later-called one.

I haven't yet deep dive into by what code path the issue happens, but sharing
this first, since I have to go out soon.  I'll further take a look later.
Meanwhile, could you please also confirm if it is the first bad commit for your
issue, too?

> 
> This issue also occurs when starting DAMON using yaml configuration file 
> that includes both the 'migrate_hot' and 'migrate_cold' actions.
> Below is the dmesg log at the time of the issue:
> 
> ```
> [157729.130361] Call Trace:                                                                                                                                                                                                                                                                                                                                                       [19/1810]
> [157729.130540]  <TASK>
> [157729.130718]  kthread_stop+0x158/0x190
> [157729.130904]  kthread_stop_put+0x18/0x80
> [157729.131084]  damon_stop+0x4c/0xd0
> [157729.131264]  state_store+0x190/0x380
> [157729.131445]  ? __x64_sys_ioctl+0x7e/0xf0
> [157729.131625]  kobj_attr_store+0x13/0x30
> [157729.131805]  sysfs_kf_write+0x73/0x90
> [157729.131986]  kernfs_fop_write_iter+0x13a/0x1c0
> [157729.132167]  vfs_write+0x304/0x420
> [157729.132349]  ksys_write+0x6d/0xe0
> [157729.132525]  __x64_sys_write+0x1d/0x30
> [157729.132696]  x64_sys_call+0x16ec/0x2180
> [157729.132865]  do_syscall_64+0x74/0x1d0
> [157729.133030]  ? __x64_sys_ioctl+0x7e/0xf0
> [157729.133195]  ? x64_sys_call+0x1268/0x2180
> [157729.133364]  ? do_syscall_64+0xa3/0x1d0
> [157729.133534]  ? do_syscall_64+0xa3/0x1d0
> [157729.133698]  entry_SYSCALL_64_after_hwframe+0x76/0x7e

'scripts/decode_stacktrace.sh' can show which line of what source file each of
the above line points.  So if you could share the output of the script from
your next bug reports, it would be pretty helpful.

So, I'll take further look, but please let me know if the first broken commit I
found is also the first broken commit for your issue.

Thanks,
SJ

[...]

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [BUG] 'damo stop' causes kernel crash in v6.17-rc3
  2025-09-04  4:02 ` SeongJae Park
@ 2025-09-04  8:29   ` Yunjeong Mun
  2025-09-05  3:54     ` SeongJae Park
  0 siblings, 1 reply; 8+ messages in thread
From: Yunjeong Mun @ 2025-09-04  8:29 UTC (permalink / raw)
  To: sj; +Cc: damon, honggyu.kim, kernel_team

On Wed,  3 Sep 2025 21:02:03 -0700 SeongJae Park <sj@kernel.org> wrote:
> Hi Yunjeong,
> 
> On Thu,  4 Sep 2025 10:17:38 +0900 Yunjeong Mun <yunjeong.mun@sk.com> wrote:
> 
> > Hi!
> > 
> > I encountered a kernel crash when running 'damo stop' in kernel v6.17-rc3, 
> > I tested and confirmed that this issue also occurs in v6.17-rc1.
> 
> Thank you for finding and sharing this issue!
> 
> > 
> > 'damo' version that I tested is v2.9.3 and v2.4.7.
> > 
> > The crash happens when DAMON is configured to used both 'migrate_hot' 
> > and migrate_cold' actions. I tested that if DAMON is started with only 
> > one of the two actions, it works fine.
> 
> I understand you mean the problem is reproducible when you use two kdamond
> threads, and you confirmed it doesn't happen when you use single kdamond
> thread.  Please let me know if I'm misunderstanding.
>

Your understanding is correct!

> > Below is the command I used:
> > 
> > ```shell
> > $ ./damo start \
> >  --ops paddr --numa_node 0 --monitoring_intervals 100ms 2s 20s --damos_action migrate_cold 1 \
> >  --ops paddr --numa_node 1 --monitoring_intervals 100ms 2s 20s --damos_action migrate_hot 0 \
> >  --nr_targets 1 1 --nr_schemes 1 1 --nr_ctxs 1 1
> > 
> > $ ps aux | grep kdamond
> > root      1193 98.2  0.0      0     0 ?        R    07:58   0:18 [kdamond.0]
> > root      1194 11.2  0.0      0     0 ?        R    07:58   0:02 [kdamond.1]
> > 
> > # Error occurs
> > $ ./damo stop 
> > ```
> 
> Thank you for sharingthis detailed steps.  On my setup, this doesn't cause
> crash but make 'damo' hang.

I have tested this issue on both a physical server and a QEMU.
On the physical server, both a kernel crash and a 'damo' hang occur.
In the QEMU environment, only the 'damo' hang occurs, without a kernel crash.

> 
> I found commit d809a7c64ba8 ("mm/damon/sysfs: implement refresh_ms file
> internal work") is the first bad commit, according to 'git bisect'.  And
> actually the code is broken for multiple kdamonds case, since it is sharing one
> damon_call_control object for multiple kdamonds while overwriting the data
> field to later-called one.
>
> I haven't yet deep dive into by what code path the issue happens, but sharing
> this first, since I have to go out soon.  I'll further take a look later.
> Meanwhile, could you please also confirm if it is the first bad commit for your
> issue, too?

Thanks for sharing your analysis!
I also confirmed that first bad commit is d809a7c64ba8.
The previous commit b907494768e5 doesn't cause the above issue.

> 
> > 
> > This issue also occurs when starting DAMON using yaml configuration file 
> > that includes both the 'migrate_hot' and 'migrate_cold' actions.
> > Below is the dmesg log at the time of the issue:
> > 
> > ```
> > [157729.130361] Call Trace:                                                                                                                                                                                                                                                                                                                                                       [19/1810]
> > [157729.130540]  <TASK>
> > [157729.130718]  kthread_stop+0x158/0x190
> > [157729.130904]  kthread_stop_put+0x18/0x80
> > [157729.131084]  damon_stop+0x4c/0xd0
> > [157729.131264]  state_store+0x190/0x380
> > [157729.131445]  ? __x64_sys_ioctl+0x7e/0xf0
> > [157729.131625]  kobj_attr_store+0x13/0x30
> > [157729.131805]  sysfs_kf_write+0x73/0x90
> > [157729.131986]  kernfs_fop_write_iter+0x13a/0x1c0
> > [157729.132167]  vfs_write+0x304/0x420
> > [157729.132349]  ksys_write+0x6d/0xe0
> > [157729.132525]  __x64_sys_write+0x1d/0x30
> > [157729.132696]  x64_sys_call+0x16ec/0x2180
> > [157729.132865]  do_syscall_64+0x74/0x1d0
> > [157729.133030]  ? __x64_sys_ioctl+0x7e/0xf0
> > [157729.133195]  ? x64_sys_call+0x1268/0x2180
> > [157729.133364]  ? do_syscall_64+0xa3/0x1d0
> > [157729.133534]  ? do_syscall_64+0xa3/0x1d0
> > [157729.133698]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
> 
> 'scripts/decode_stacktrace.sh' can show which line of what source file each of
> the above line points.  So if you could share the output of the script from
> your next bug reports, it would be pretty helpful.

I'm not aware of this tool, thank you for letting me know!
I'll trying using it next time.

Thanks,
Yunjeong Mun

> 
> So, I'll take further look, but please let me know if the first broken commit I
> found is also the first broken commit for your issue.
> 
> 
> Thanks,
> SJ
> 
> [...]

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [BUG] 'damo stop' causes kernel crash in v6.17-rc3
  2025-09-04  8:29   ` Yunjeong Mun
@ 2025-09-05  3:54     ` SeongJae Park
  2025-09-05  9:08       ` Yunjeong Mun
  0 siblings, 1 reply; 8+ messages in thread
From: SeongJae Park @ 2025-09-05  3:54 UTC (permalink / raw)
  To: Yunjeong Mun; +Cc: SeongJae Park, damon, honggyu.kim, kernel_team

On Thu,  4 Sep 2025 17:29:45 +0900 Yunjeong Mun <yunjeong.mun@sk.com> wrote:

> On Wed,  3 Sep 2025 21:02:03 -0700 SeongJae Park <sj@kernel.org> wrote:
> > Hi Yunjeong,
> > 
> > On Thu,  4 Sep 2025 10:17:38 +0900 Yunjeong Mun <yunjeong.mun@sk.com> wrote:
> > 
> > > Hi!
> > > 
> > > I encountered a kernel crash when running 'damo stop' in kernel v6.17-rc3, 
> > > I tested and confirmed that this issue also occurs in v6.17-rc1.
[...]
> > I found commit d809a7c64ba8 ("mm/damon/sysfs: implement refresh_ms file
> > internal work") is the first bad commit, according to 'git bisect'.  And
> > actually the code is broken for multiple kdamonds case, since it is sharing one
> > damon_call_control object for multiple kdamonds while overwriting the data
> > field to later-called one.

More problematically, the damon_call_control->list is continuously overwritten
by the multiple kdamond threads.  This corrupts the damon_call_control lists of
the contexts, and as a result kdamond_call() infinitely loops.  Hence
kdamond_fn() cannot catch the termination request and the hang happens.

> >
> > I haven't yet deep dive into by what code path the issue happens, but sharing
> > this first, since I have to go out soon.  I'll further take a look later.
> > Meanwhile, could you please also confirm if it is the first bad commit for your
> > issue, too?
> 
> Thanks for sharing your analysis!
> I also confirmed that first bad commit is d809a7c64ba8.
> The previous commit b907494768e5 doesn't cause the above issue.

Thank you for confirming.  I confirmed attaching patch fixes the problem with
your repro on my setup.  Could you please also test that on your machines and
confirm if it fixes the issues on your setups, too?  If you confirm, I will
post it soon.

[...]
> > 'scripts/decode_stacktrace.sh' can show which line of what source file each of
> > the above line points.  So if you could share the output of the script from
> > your next bug reports, it would be pretty helpful.
> 
> I'm not aware of this tool, thank you for letting me know!
> I'll trying using it next time.

No worry, and let me know if you need any help for that.

And please don't delay or hesitate reporting new issues in future for learning
of a tool.  I'd prefer getting early and incomplete issue reports much more
than late and complete reports. :)


Thanks,
SJ

[...]

=== >8 ===
From 6754cdb95c03313fd7d9f104b9dbe851ecef237e Mon Sep 17 00:00:00 2001
From: SeongJae Park <sj@kernel.org>
Date: Thu, 4 Sep 2025 20:18:46 -0700
Subject: [PATCH] mm/damon/sysfs: use dynamically allocated repeat mode
 damon_call_control

For testing.

Fixes: d809a7c64ba8 ("mm/damon/sysfs: implement refresh_ms file internal work") # v6.17.x
Signed-off-by: SeongJae Park <sj@kernel.org>
---
 include/linux/damon.h |  2 ++
 mm/damon/core.c       |  8 ++++++--
 mm/damon/sysfs.c      | 23 +++++++++++++++--------
 3 files changed, 23 insertions(+), 10 deletions(-)

diff --git a/include/linux/damon.h b/include/linux/damon.h
index ec8716292c09..aa7381be388c 100644
--- a/include/linux/damon.h
+++ b/include/linux/damon.h
@@ -636,6 +636,7 @@ struct damon_operations {
  * @data:		Data that will be passed to @fn.
  * @repeat:		Repeat invocations.
  * @return_code:	Return code from @fn invocation.
+ * @dealloc_on_cancel:	De-allocate when canceled.
  *
  * Control damon_call(), which requests specific kdamond to invoke a given
  * function.  Refer to damon_call() for more details.
@@ -645,6 +646,7 @@ struct damon_call_control {
 	void *data;
 	bool repeat;
 	int return_code;
+	bool dealloc_on_cancel;
 /* private: internal use only */
 	/* informs if the kdamond finished handling of the request */
 	struct completion completion;
diff --git a/mm/damon/core.c b/mm/damon/core.c
index 7aeb3f24aae8..be5942435d78 100644
--- a/mm/damon/core.c
+++ b/mm/damon/core.c
@@ -2510,10 +2510,14 @@ static void kdamond_call(struct damon_ctx *ctx, bool cancel)
 		mutex_lock(&ctx->call_controls_lock);
 		list_del(&control->list);
 		mutex_unlock(&ctx->call_controls_lock);
-		if (!control->repeat)
+		if (!control->repeat) {
 			complete(&control->completion);
-		else
+		} else if (control->canceled && control->dealloc_on_cancel) {
+			kfree(control);
+			continue;
+		} else {
 			list_add(&control->list, &repeat_controls);
+		}
 	}
 	control = list_first_entry_or_null(&repeat_controls,
 			struct damon_call_control, list);
diff --git a/mm/damon/sysfs.c b/mm/damon/sysfs.c
index 0ed404c89f80..a182670493bb 100644
--- a/mm/damon/sysfs.c
+++ b/mm/damon/sysfs.c
@@ -1565,14 +1565,10 @@ static int damon_sysfs_repeat_call_fn(void *data)
 	return 0;
 }
 
-static struct damon_call_control damon_sysfs_repeat_call_control = {
-	.fn = damon_sysfs_repeat_call_fn,
-	.repeat = true,
-};
-
 static int damon_sysfs_turn_damon_on(struct damon_sysfs_kdamond *kdamond)
 {
 	struct damon_ctx *ctx;
+	struct damon_call_control *repeat_call_control;
 	int err;
 
 	if (damon_sysfs_kdamond_running(kdamond))
@@ -1585,18 +1581,29 @@ static int damon_sysfs_turn_damon_on(struct damon_sysfs_kdamond *kdamond)
 		damon_destroy_ctx(kdamond->damon_ctx);
 	kdamond->damon_ctx = NULL;
 
+	repeat_call_control = kmalloc(sizeof(*repeat_call_control),
+			GFP_KERNEL);
+	if (!repeat_call_control)
+		return -ENOMEM;
+
 	ctx = damon_sysfs_build_ctx(kdamond->contexts->contexts_arr[0]);
-	if (IS_ERR(ctx))
+	if (IS_ERR(ctx)) {
+		kfree(repeat_call_control);
 		return PTR_ERR(ctx);
+	}
 	err = damon_start(&ctx, 1, false);
 	if (err) {
+		kfree(repeat_call_control);
 		damon_destroy_ctx(ctx);
 		return err;
 	}
 	kdamond->damon_ctx = ctx;
 
-	damon_sysfs_repeat_call_control.data = kdamond;
-	damon_call(ctx, &damon_sysfs_repeat_call_control);
+	repeat_call_control->fn = damon_sysfs_repeat_call_fn;
+	repeat_call_control->data = kdamond;
+	repeat_call_control->repeat = true;
+	repeat_call_control->dealloc_on_cancel = true;
+	damon_call(ctx, repeat_call_control);
 	return err;
 }
 
-- 
2.39.5


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [BUG] 'damo stop' causes kernel crash in v6.17-rc3
  2025-09-05  3:54     ` SeongJae Park
@ 2025-09-05  9:08       ` Yunjeong Mun
  2025-09-05 20:07         ` SeongJae Park
  0 siblings, 1 reply; 8+ messages in thread
From: Yunjeong Mun @ 2025-09-05  9:08 UTC (permalink / raw)
  To: sj; +Cc: damon, honggyu.kim, kernel_team

On Thu,  4 Sep 2025 20:54:11 -0700 SeongJae Park <sj@kernel.org> wrote:
> On Thu,  4 Sep 2025 17:29:45 +0900 Yunjeong Mun <yunjeong.mun@sk.com> wrote:
> 
> > On Wed,  3 Sep 2025 21:02:03 -0700 SeongJae Park <sj@kernel.org> wrote:
> > > Hi Yunjeong,
> > > 
> > > On Thu,  4 Sep 2025 10:17:38 +0900 Yunjeong Mun <yunjeong.mun@sk.com> wrote:
> > > 
> > > > Hi!
> > > > 
> > > > I encountered a kernel crash when running 'damo stop' in kernel v6.17-rc3, 
> > > > I tested and confirmed that this issue also occurs in v6.17-rc1.
> [...]
> > > I found commit d809a7c64ba8 ("mm/damon/sysfs: implement refresh_ms file
> > > internal work") is the first bad commit, according to 'git bisect'.  And
> > > actually the code is broken for multiple kdamonds case, since it is sharing one
> > > damon_call_control object for multiple kdamonds while overwriting the data
> > > field to later-called one.
> 
> More problematically, the damon_call_control->list is continuously overwritten
> by the multiple kdamond threads.  This corrupts the damon_call_control lists of
> the contexts, and as a result kdamond_call() infinitely loops.  Hence
> kdamond_fn() cannot catch the termination request and the hang happens.
> 
> > >
> > > I haven't yet deep dive into by what code path the issue happens, but sharing
> > > this first, since I have to go out soon.  I'll further take a look later.
> > > Meanwhile, could you please also confirm if it is the first bad commit for your
> > > issue, too?
> > 
> > Thanks for sharing your analysis!
> > I also confirmed that first bad commit is d809a7c64ba8.
> > The previous commit b907494768e5 doesn't cause the above issue.
> 
> Thank you for confirming.  I confirmed attaching patch fixes the problem with
> your repro on my setup.  Could you please also test that on your machines and
> confirm if it fixes the issues on your setups, too?  If you confirm, I will
> post it soon.

Thank you for the patch. I have confirmed that it fixes the problem on QEMU.
I will test it on my physical machine early next week and share the result
with you.

> 
> [...]
> > > 'scripts/decode_stacktrace.sh' can show which line of what source file each of
> > > the above line points.  So if you could share the output of the script from
> > > your next bug reports, it would be pretty helpful.
> > 
> > I'm not aware of this tool, thank you for letting me know!
> > I'll trying using it next time.
> 
> No worry, and let me know if you need any help for that.
> 
> And please don't delay or hesitate reporting new issues in future for learning
> of a tool.  I'd prefer getting early and incomplete issue reports much more
> than late and complete reports. :)
> 

Thank you for the advice. I'll report any issues right away:)

> 
> Thanks,
> SJ
> 
> [...]
> 
> === >8 ===
> >From 6754cdb95c03313fd7d9f104b9dbe851ecef237e Mon Sep 17 00:00:00 2001
> From: SeongJae Park <sj@kernel.org>
> Date: Thu, 4 Sep 2025 20:18:46 -0700
> Subject: [PATCH] mm/damon/sysfs: use dynamically allocated repeat mode
>  damon_call_control
> 
> For testing.
> 
> Fixes: d809a7c64ba8 ("mm/damon/sysfs: implement refresh_ms file internal work") # v6.17.x
> Signed-off-by: SeongJae Park <sj@kernel.org>
> ---
>  include/linux/damon.h |  2 ++
>  mm/damon/core.c       |  8 ++++++--
>  mm/damon/sysfs.c      | 23 +++++++++++++++--------
>  3 files changed, 23 insertions(+), 10 deletions(-)
> 
> diff --git a/include/linux/damon.h b/include/linux/damon.h
> index ec8716292c09..aa7381be388c 100644
> --- a/include/linux/damon.h
> +++ b/include/linux/damon.h
> @@ -636,6 +636,7 @@ struct damon_operations {
>   * @data:		Data that will be passed to @fn.
>   * @repeat:		Repeat invocations.
>   * @return_code:	Return code from @fn invocation.
> + * @dealloc_on_cancel:	De-allocate when canceled.
>   *
>   * Control damon_call(), which requests specific kdamond to invoke a given
>   * function.  Refer to damon_call() for more details.
> @@ -645,6 +646,7 @@ struct damon_call_control {
>  	void *data;
>  	bool repeat;
>  	int return_code;
> +	bool dealloc_on_cancel;
>  /* private: internal use only */
>  	/* informs if the kdamond finished handling of the request */
>  	struct completion completion;
> diff --git a/mm/damon/core.c b/mm/damon/core.c
> index 7aeb3f24aae8..be5942435d78 100644
> --- a/mm/damon/core.c
> +++ b/mm/damon/core.c
> @@ -2510,10 +2510,14 @@ static void kdamond_call(struct damon_ctx *ctx, bool cancel)
>  		mutex_lock(&ctx->call_controls_lock);
>  		list_del(&control->list);
>  		mutex_unlock(&ctx->call_controls_lock);
> -		if (!control->repeat)
> +		if (!control->repeat) {
>  			complete(&control->completion);
> -		else
> +		} else if (control->canceled && control->dealloc_on_cancel) {
> +			kfree(control);
> +			continue;
> +		} else {
>  			list_add(&control->list, &repeat_controls);
> +		}
>  	}
>  	control = list_first_entry_or_null(&repeat_controls,
>  			struct damon_call_control, list);
> diff --git a/mm/damon/sysfs.c b/mm/damon/sysfs.c
> index 0ed404c89f80..a182670493bb 100644
> --- a/mm/damon/sysfs.c
> +++ b/mm/damon/sysfs.c
> @@ -1565,14 +1565,10 @@ static int damon_sysfs_repeat_call_fn(void *data)
>  	return 0;
>  }
>  
> -static struct damon_call_control damon_sysfs_repeat_call_control = {
> -	.fn = damon_sysfs_repeat_call_fn,
> -	.repeat = true,
> -};
> -
>  static int damon_sysfs_turn_damon_on(struct damon_sysfs_kdamond *kdamond)
>  {
>  	struct damon_ctx *ctx;
> +	struct damon_call_control *repeat_call_control;
>  	int err;
>  
>  	if (damon_sysfs_kdamond_running(kdamond))
> @@ -1585,18 +1581,29 @@ static int damon_sysfs_turn_damon_on(struct damon_sysfs_kdamond *kdamond)
>  		damon_destroy_ctx(kdamond->damon_ctx);
>  	kdamond->damon_ctx = NULL;
>  
> +	repeat_call_control = kmalloc(sizeof(*repeat_call_control),
> +			GFP_KERNEL);
> +	if (!repeat_call_control)
> +		return -ENOMEM;
> +
>  	ctx = damon_sysfs_build_ctx(kdamond->contexts->contexts_arr[0]);
> -	if (IS_ERR(ctx))
> +	if (IS_ERR(ctx)) {
> +		kfree(repeat_call_control);
>  		return PTR_ERR(ctx);
> +	}
>  	err = damon_start(&ctx, 1, false);
>  	if (err) {
> +		kfree(repeat_call_control);
>  		damon_destroy_ctx(ctx);
>  		return err;
>  	}
>  	kdamond->damon_ctx = ctx;
>  
> -	damon_sysfs_repeat_call_control.data = kdamond;
> -	damon_call(ctx, &damon_sysfs_repeat_call_control);
> +	repeat_call_control->fn = damon_sysfs_repeat_call_fn;
> +	repeat_call_control->data = kdamond;
> +	repeat_call_control->repeat = true;
> +	repeat_call_control->dealloc_on_cancel = true;
> +	damon_call(ctx, repeat_call_control);
>  	return err;
>  }
>  
> -- 
> 2.39.5
> 

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [BUG] 'damo stop' causes kernel crash in v6.17-rc3
  2025-09-05  9:08       ` Yunjeong Mun
@ 2025-09-05 20:07         ` SeongJae Park
  2025-09-08  4:36           ` Yunjeong Mun
  0 siblings, 1 reply; 8+ messages in thread
From: SeongJae Park @ 2025-09-05 20:07 UTC (permalink / raw)
  To: Yunjeong Mun; +Cc: SeongJae Park, damon, honggyu.kim, kernel_team

On Fri,  5 Sep 2025 18:08:26 +0900 Yunjeong Mun <yunjeong.mun@sk.com> wrote:

> On Thu,  4 Sep 2025 20:54:11 -0700 SeongJae Park <sj@kernel.org> wrote:
> > On Thu,  4 Sep 2025 17:29:45 +0900 Yunjeong Mun <yunjeong.mun@sk.com> wrote:
> > 
> > > On Wed,  3 Sep 2025 21:02:03 -0700 SeongJae Park <sj@kernel.org> wrote:
> > > > Hi Yunjeong,
> > > > 
> > > > On Thu,  4 Sep 2025 10:17:38 +0900 Yunjeong Mun <yunjeong.mun@sk.com> wrote:
> > > > 
> > > > > Hi!
> > > > > 
> > > > > I encountered a kernel crash when running 'damo stop' in kernel v6.17-rc3, 
> > > > > I tested and confirmed that this issue also occurs in v6.17-rc1.
[...]
> > Thank you for confirming.  I confirmed attaching patch fixes the problem with
> > your repro on my setup.  Could you please also test that on your machines and
> > confirm if it fixes the issues on your setups, too?  If you confirm, I will
> > post it soon.
> 
> Thank you for the patch. I have confirmed that it fixes the problem on QEMU.

Thank you for confirming.

> I will test it on my physical machine early next week and share the result
> with you.

Looking forward to!


Thanks,
SJ

[...]

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [BUG] 'damo stop' causes kernel crash in v6.17-rc3
  2025-09-05 20:07         ` SeongJae Park
@ 2025-09-08  4:36           ` Yunjeong Mun
  2025-09-08 20:18             ` SeongJae Park
  0 siblings, 1 reply; 8+ messages in thread
From: Yunjeong Mun @ 2025-09-08  4:36 UTC (permalink / raw)
  To: sj; +Cc: damon, honggyu.kim, kernel_team

Hi SeongJae,

On Fri,  5 Sep 2025 13:07:41 -0700 SeongJae Park <sj@kernel.org> wrote:
> On Fri,  5 Sep 2025 18:08:26 +0900 Yunjeong Mun <yunjeong.mun@sk.com> wrote:
> 
> > On Thu,  4 Sep 2025 20:54:11 -0700 SeongJae Park <sj@kernel.org> wrote:
> > > On Thu,  4 Sep 2025 17:29:45 +0900 Yunjeong Mun <yunjeong.mun@sk.com> wrote:
> > > 
> > > > On Wed,  3 Sep 2025 21:02:03 -0700 SeongJae Park <sj@kernel.org> wrote:
> > > > > Hi Yunjeong,
> > > > > 
> > > > > On Thu,  4 Sep 2025 10:17:38 +0900 Yunjeong Mun <yunjeong.mun@sk.com> wrote:
> > > > > 
> > > > > > Hi!
> > > > > > 
> > > > > > I encountered a kernel crash when running 'damo stop' in kernel v6.17-rc3, 
> > > > > > I tested and confirmed that this issue also occurs in v6.17-rc1.
> [...]
> > > Thank you for confirming.  I confirmed attaching patch fixes the problem with
> > > your repro on my setup.  Could you please also test that on your machines and
> > > confirm if it fixes the issues on your setups, too?  If you confirm, I will
> > > post it soon.
> > 
> > Thank you for the patch. I have confirmed that it fixes the problem on QEMU.
> 
> Thank you for confirming.
> 
> > I will test it on my physical machine early next week and share the result
> > with you.
> 
> Looking forward to!

I've confirmed that this patch also fixes the issue on my physical machine. :)
Thank you very much.

Yunjeong Mun

> 
> 
> Thanks,
> SJ
> 
> [...]

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [BUG] 'damo stop' causes kernel crash in v6.17-rc3
  2025-09-08  4:36           ` Yunjeong Mun
@ 2025-09-08 20:18             ` SeongJae Park
  0 siblings, 0 replies; 8+ messages in thread
From: SeongJae Park @ 2025-09-08 20:18 UTC (permalink / raw)
  To: Yunjeong Mun; +Cc: SeongJae Park, damon, honggyu.kim, kernel_team

On Mon,  8 Sep 2025 13:36:18 +0900 Yunjeong Mun <yunjeong.mun@sk.com> wrote:

> Hi SeongJae,
> 
> On Fri,  5 Sep 2025 13:07:41 -0700 SeongJae Park <sj@kernel.org> wrote:
> > On Fri,  5 Sep 2025 18:08:26 +0900 Yunjeong Mun <yunjeong.mun@sk.com> wrote:
> > 
> > > On Thu,  4 Sep 2025 20:54:11 -0700 SeongJae Park <sj@kernel.org> wrote:
> > > > On Thu,  4 Sep 2025 17:29:45 +0900 Yunjeong Mun <yunjeong.mun@sk.com> wrote:
> > > > 
> > > > > On Wed,  3 Sep 2025 21:02:03 -0700 SeongJae Park <sj@kernel.org> wrote:
> > > > > > Hi Yunjeong,
> > > > > > 
> > > > > > On Thu,  4 Sep 2025 10:17:38 +0900 Yunjeong Mun <yunjeong.mun@sk.com> wrote:
> > > > > > 
> > > > > > > Hi!
> > > > > > > 
> > > > > > > I encountered a kernel crash when running 'damo stop' in kernel v6.17-rc3, 
> > > > > > > I tested and confirmed that this issue also occurs in v6.17-rc1.
> > [...]
> > > > Thank you for confirming.  I confirmed attaching patch fixes the problem with
> > > > your repro on my setup.  Could you please also test that on your machines and
> > > > confirm if it fixes the issues on your setups, too?  If you confirm, I will
> > > > post it soon.
[...]
> I've confirmed that this patch also fixes the issue on my physical machine. :)
> Thank you very much.

Thank you for confirming!  I just posted the fix:
https://lore.kernel.org/20250908201513.60802-1-sj@kernel.org


Thanks,
SJ

[...]

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2025-09-08 20:18 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-09-04  1:17 [BUG] 'damo stop' causes kernel crash in v6.17-rc3 Yunjeong Mun
2025-09-04  4:02 ` SeongJae Park
2025-09-04  8:29   ` Yunjeong Mun
2025-09-05  3:54     ` SeongJae Park
2025-09-05  9:08       ` Yunjeong Mun
2025-09-05 20:07         ` SeongJae Park
2025-09-08  4:36           ` Yunjeong Mun
2025-09-08 20:18             ` SeongJae Park

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.