* Problems running DAMON kunit tests with spinlock debugging enabled
@ 2024-09-03 19:23 Guenter Roeck
2024-09-04 1:00 ` SeongJae Park
0 siblings, 1 reply; 12+ messages in thread
From: Guenter Roeck @ 2024-09-03 19:23 UTC (permalink / raw)
To: SeongJae Park; +Cc: damon
Hi,
when trying to run DAMON kunit tests with spinlock debugging enabled,
I get the following error messages.
BUG: spinlock bad magic on CPU#0, kunit_try_catch/209
lock: mm.13+0x40/0x840, .magic: 00000000, .owner: <none>/-1, .owner_cpu: 0
CPU: 0 UID: 0 PID: 209 Comm: kunit_try_catch Tainted: G N 6.11.0-rc6-00148-g178297ec52d6 #1
Tainted: [N]=TEST
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014
Call Trace:
<TASK>
dump_stack_lvl+0x9e/0xe0
do_raw_spin_lock+0x63/0xb0
damon_test_three_regions_in_vmas+0x121/0x450S
...
This happens because damon_test_three_regions_in_vmas() calls
mt_init_flags() with MM_MT_FLAGS, which sets MT_FLAGS_LOCK_EXTERN.
Because of this, the spinlock is not initialized, which then results
in the error message when mas_lock() is called during the test.
Unfortunately, replacing MM_MT_FLAGS with MT_FLAGS_ALLOC_RANGE |
MT_FLAGS_USE_RCU or just with MT_FLAGS_ALLOC_RANGE doesn't help;
it results in various "suspicious RCU usage" messages.
At this point I gave up trying to fix the problem.
The problem should be easy to reproduce by enabling CONFIG_DEBUG_SPINLOCK
and CONFIG_LOCKDEP.
Please let me know if there is anything I can do to help fix the problem.
Thanks,
Guenter
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Problems running DAMON kunit tests with spinlock debugging enabled
2024-09-03 19:23 Problems running DAMON kunit tests with spinlock debugging enabled Guenter Roeck
@ 2024-09-04 1:00 ` SeongJae Park
2024-09-04 2:24 ` Guenter Roeck
0 siblings, 1 reply; 12+ messages in thread
From: SeongJae Park @ 2024-09-04 1:00 UTC (permalink / raw)
To: Guenter Roeck; +Cc: SeongJae Park, damon
Hi Guenter,
On Tue, 3 Sep 2024 12:23:50 -0700 Guenter Roeck <linux@roeck-us.net> wrote:
> Hi,
>
> when trying to run DAMON kunit tests with spinlock debugging enabled,
> I get the following error messages.
>
> BUG: spinlock bad magic on CPU#0, kunit_try_catch/209
> lock: mm.13+0x40/0x840, .magic: 00000000, .owner: <none>/-1, .owner_cpu: 0
> CPU: 0 UID: 0 PID: 209 Comm: kunit_try_catch Tainted: G N 6.11.0-rc6-00148-g178297ec52d6 #1
> Tainted: [N]=TEST
> Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014
> Call Trace:
> <TASK>
> dump_stack_lvl+0x9e/0xe0
> do_raw_spin_lock+0x63/0xb0
> damon_test_three_regions_in_vmas+0x121/0x450S
> ...
>
> This happens because damon_test_three_regions_in_vmas() calls
> mt_init_flags() with MM_MT_FLAGS, which sets MT_FLAGS_LOCK_EXTERN.
> Because of this, the spinlock is not initialized, which then results
> in the error message when mas_lock() is called during the test.
>
> Unfortunately, replacing MM_MT_FLAGS with MT_FLAGS_ALLOC_RANGE |
> MT_FLAGS_USE_RCU or just with MT_FLAGS_ALLOC_RANGE doesn't help;
> it results in various "suspicious RCU usage" messages.
Thank you very much for this report with the grateful detailed investigation!
I was able to make and send a fix[1], thanks to the investigation. Seems it
may need some more revisions[2], though. I'll continue discussion on the
thread.
[1] https://lore.kernel.org/20240904004534.1189-1-sj@kernel.org
[2] https://lore.kernel.org/jy6263g6em4jsdhp6tknmh2cljpuvq652kvcet4ko3z2xt7pym@ltc5h5twsszu/
Thanks,
SJ
>
> Thanks,
> Guenter
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Problems running DAMON kunit tests with spinlock debugging enabled
2024-09-04 1:00 ` SeongJae Park
@ 2024-09-04 2:24 ` Guenter Roeck
2024-09-04 2:46 ` Guenter Roeck
0 siblings, 1 reply; 12+ messages in thread
From: Guenter Roeck @ 2024-09-04 2:24 UTC (permalink / raw)
To: SeongJae Park; +Cc: damon
On 9/3/24 18:00, SeongJae Park wrote:
> Hi Guenter,
>
> On Tue, 3 Sep 2024 12:23:50 -0700 Guenter Roeck <linux@roeck-us.net> wrote:
>
>> Hi,
>>
>> when trying to run DAMON kunit tests with spinlock debugging enabled,
>> I get the following error messages.
>>
>> BUG: spinlock bad magic on CPU#0, kunit_try_catch/209
>> lock: mm.13+0x40/0x840, .magic: 00000000, .owner: <none>/-1, .owner_cpu: 0
>> CPU: 0 UID: 0 PID: 209 Comm: kunit_try_catch Tainted: G N 6.11.0-rc6-00148-g178297ec52d6 #1
>> Tainted: [N]=TEST
>> Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014
>> Call Trace:
>> <TASK>
>> dump_stack_lvl+0x9e/0xe0
>> do_raw_spin_lock+0x63/0xb0
>> damon_test_three_regions_in_vmas+0x121/0x450S
>> ...
>>
>> This happens because damon_test_three_regions_in_vmas() calls
>> mt_init_flags() with MM_MT_FLAGS, which sets MT_FLAGS_LOCK_EXTERN.
>> Because of this, the spinlock is not initialized, which then results
>> in the error message when mas_lock() is called during the test.
>>
>> Unfortunately, replacing MM_MT_FLAGS with MT_FLAGS_ALLOC_RANGE |
>> MT_FLAGS_USE_RCU or just with MT_FLAGS_ALLOC_RANGE doesn't help;
>> it results in various "suspicious RCU usage" messages.
>
> Thank you very much for this report with the grateful detailed investigation!
> I was able to make and send a fix[1], thanks to the investigation. Seems it
> may need some more revisions[2], though. I'll continue discussion on the
> thread.
>
> [1] https://lore.kernel.org/20240904004534.1189-1-sj@kernel.org
> [2] https://lore.kernel.org/jy6263g6em4jsdhp6tknmh2cljpuvq652kvcet4ko3z2xt7pym@ltc5h5twsszu/
>
>
Some more problems. I tried your patch on various architectures.
Unfortunately, that didn't turn out well. See below for some of the failures.
[ I am not sure I understand the code in damon_test_nr_accesses_to_accesses_bp(),
especially
.aggr_interval = ((unsigned long)UINT_MAX + 1) * 10
What happens if sizeof(int) == sizeof(long) ?
Unless I am really missing something, .aggr_interval will be
0 in that case.
]
Thanks,
Guenter
---
arm:
[ 6.065379] ok 9 damon_test_set_regions
[ 6.067694] Internal error: Oops - undefined instruction: 0 [#1] ARM
[ 6.068458] CPU: 0 UID: 0 PID: 86 Comm: kunit_try_catch Tainted: G N 6.11.0-rc6-00157-gfd747295fe53 #1
[ 6.068794] Tainted: [N]=TEST
[ 6.068890] Hardware name: Generic DT based system
[ 6.069068] PC is at damon_test_nr_accesses_to_accesses_bp+0x8/0xc
[ 6.069718] LR is at kunit_try_run_case+0x8c/0x1f0
[ 6.069847] pc : [<80323990>] lr : [<8074da70>] psr: 60000113
[ 6.069976] sp : 88bedf18 ip : 00000000 fp : 00000000
[ 6.070096] r10: 00000000 r9 : 88015c0c r8 : 88015d18
[ 6.070218] r7 : 8075035c r6 : 838b94fc r5 : 88015d0c r4 : 88015d18
[ 6.070365] r3 : 80323988 r2 : 028c45c5 r1 : 00000000 r0 : 88015d0c
[ 6.070560] Flags: nZCv IRQs on FIQs on Mode SVC_32 ISA ARM Segment none
[ 6.070747] Control: 00093177 Table: 40004000 DAC: 00000053
[ 6.070911] Register r0 information: 2-page vmalloc region starting at 0x88014000 allocated at kernel_clone+0xb0/0x394
[ 6.071384] Register r1 information: NULL pointer
[ 6.071552] Register r2 information: non-paged memory
[ 6.071695] Register r3 information: non-slab/vmalloc memory
[ 6.071904] Register r4 information: 2-page vmalloc region starting at 0x88014000 allocated at kernel_clone+0xb0/0x394
[ 6.072139] Register r5 information: 2-page vmalloc region starting at 0x88014000 allocated at kernel_clone+0xb0/0x394
[ 6.072387] Register r6 information: slab kmalloc-512 start 838b9200 data offset 512 pointer offset 252 size 512 allocated at kunit_filter_attr_tests+0x84/0x238
[ 6.073263] __kmalloc_noprof+0x320/0x350
[ 6.073383] kunit_filter_attr_tests+0x84/0x238
[ 6.073499] kunit_filter_suites+0x42c/0x5f4
[ 6.073613] kunit_run_all_tests+0x84/0x2cc
[ 6.073723] kernel_init_freeable+0x1cc/0x238
[ 6.073840] kernel_init+0x18/0x138
[ 6.073942] ret_from_fork+0x14/0x38
[ 6.074055] Free path:
[ 6.074172] alg_test_comp+0x22c/0x940
[ 6.074294] alg_test+0x138/0x5f4
[ 6.074390] cryptomgr_test+0x20/0x3c
[ 6.074549] kthread+0xf0/0x114
[ 6.074665] ret_from_fork+0x14/0x38
[ 6.074796] Register r7 information: non-slab/vmalloc memory
[ 6.074960] Register r8 information: 2-page vmalloc region starting at 0x88014000 allocated at kernel_clone+0xb0/0x394
[ 6.075226] Register r9 information: 2-page vmalloc region starting at 0x88014000 allocated at kernel_clone+0xb0/0x394
[ 6.075491] Register r10 information: NULL pointer
[ 6.075631] Register r11 information: NULL pointer
[ 6.075769] Register r12 information: NULL pointer
[ 6.075927] Process kunit_try_catch (pid: 86, stack limit = 0x88bec000)
[ 6.076152] Stack: (0x88bedf18 to 0x88bee000)
[ 6.076379] df00: 83899404 8013c844
[ 6.076613] df20: 83899488 00000001 00000006 00000000 028c45c5 00000000 00000000 00000000
[ 6.076828] df40: 00000000 00000000 40000113 80d7da88 8135e824 801f72d4 40000113 838993f4
[ 6.077036] df60: 83d8b800 80d7da88 83898e60 566860c4 88015d18 83898e60 83d8b800 8075035c
[ 6.077248] df80: 88015d18 80750380 83d7a620 8013e5f8 83d7a620 8013e508 00000000 00000000
[ 6.077458] dfa0: 00000000 00000000 00000000 801000fc 00000000 00000000 00000000 00000000
[ 6.077669] dfc0: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
[ 6.077875] dfe0: 00000000 00000000 00000000 00000000 00000013 00000000 00000000 00000000
[ 6.078136] Call trace:
[ 6.078238] damon_test_nr_accesses_to_accesses_bp from kunit_try_run_case+0x8c/0x1f0
[ 6.078613] kunit_try_run_case from kunit_generic_run_threadfn_adapter+0x24/0x3c
[ 6.078835] kunit_generic_run_threadfn_adapter from kthread+0xf0/0x114
[ 6.079022] kthread from ret_from_fork+0x14/0x38
[ 6.079194] Exception stack(0x88bedfb0 to 0x88bedff8)
[ 6.079333] dfa0: 00000000 00000000 00000000 00000000
[ 6.079537] dfc0: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
[ 6.079736] dfe0: 00000000 00000000 00000000 00000000 00000013 00000000
[ 6.080043] Code: 8074ff5c 80e1542c e52de004 e28dd004 (e7f000f0)
[ 6.080577] ---[ end trace 0000000000000000 ]---
[ 6.080902] Kernel panic - not syncing: Fatal exception
m68k:
ok 9 damon_test_set_regions
*** TRAP #7 *** FORMAT=0
Current process id is 86
BAD KERNEL TRAP: 00000000
Modules linked in:
PC: [<000eaf02>] damos_commit_filter_arg+0x0/0x3c
SR: 2000 SP: 01573f08 a2: 01578660
d0: 00000000 d1: 05bc73b4 d2: 00000000 d3: 01573f7c
d4: 014695a0 d5: 0002cd6c a0: 000eaf00 a1: 05bc73b4
Process kunit_try_catch (pid: 86, task=01578660)
Frame format=0
Stack from 01573f3c:
0025b072 00c63df0 01573f70 00c63dfc 00c63c74 014695a0 0002cd6c 00000000
014695a0 00c63dfc 014bd860 0002cafa 0025c77c 00000000 00000004 05bc73b4
00000000 00000000 00000000 01578a22 00c63dfc 00c63c74 0025c792 00c63cf4
014695a0 0002ce26 00c63dfc 014bd860 00000000 00c515ae 0002cd96 00000000
00000000 00000000 0000252c 014695a0 00000000 00000000 00000000 00000000
00000000 00000000 00000000 00000000 00000000 00000000 00000000 20000000
Call Trace: [<0025b072>] kunit_try_run_case+0xa0/0x176
[<0002cd6c>] list_del_init+0x0/0x2a
[<0002cafa>] kthread_exit+0x0/0x14
[<0025c77c>] kunit_generic_run_threadfn_adapter+0x0/0x2a
[<0025c792>] kunit_generic_run_threadfn_adapter+0x16/0x2a
[<0002ce26>] kthread+0x90/0x9a
[<0002cd96>] kthread+0x0/0x9a
[<0000252c>] ret_from_kernel_thread+0xc/0x14
Code: b082 66aa 4cdf 78fc 4fef 0064 4e75 4e47 <206f> 0004 226f 0008 2010 7203 b280 6712 7204 b280 671e 5380 6606 3169 0006 0006
Disabling lock debugging due to kernel taint
# damon_test_nr_accesses_to_accesses_bp: try faulted
# damon_test_nr_accesses_to_accesses_bp: internal error occurred preventing test case from running: -4
# damon_test_nr_accesses_to_accesses_bp: pass:0 fail:1 skip:0 total:1
not ok 10 damon_test_nr_accesses_to_accesses_bp
riscv32:
[ 4.398451] ok 9 damon_test_set_regions
[ 4.400790] Kernel BUG [#1]
[ 4.400952] Modules linked in:
[ 4.401208] CPU: 0 UID: 0 PID: 70 Comm: kunit_try_catch Tainted: G N 6.11.0-rc6-00157-gfd747295fe53 #1
[ 4.401305] Tainted: [N]=TEST
[ 4.401318] Hardware name: riscv-virtio,qemu (DT)
[ 4.401389] epc : damon_test_nr_accesses_to_accesses_bp+0x6/0x8
[ 4.401740] ra : kunit_try_run_case+0xb4/0x1b4
[ 4.401764] epc : c01ceaaa ra : c0395000 sp : c3ae1eb0
[ 4.401779] gp : c2157848 tp : c4a78040 t0 : c2d8baec
[ 4.401793] t1 : 00000001 t2 : 00006e3d s0 : c3ae1ec0
[ 4.401805] s1 : c2d8bc7c a0 : c2d8bc7c a1 : 3b9ac9ff
[ 4.401819] a2 : 00000000 a3 : 00000003 a4 : 00000001
[ 4.401831] a5 : c01ceaa4 a6 : 16b9ab70 a7 : ffffffff
[ 4.401845] s2 : fffffffc s3 : c2d8bb4c s4 : c4ab38fc
[ 4.401858] s5 : c2d8bc88 s6 : c4a78040 s7 : 00000000
[ 4.401871] s8 : 00000000 s9 : 00000000 s10: 00000000
[ 4.401886] s11: 00000000 t3 : 16b9ab70 t4 : 00000000
[ 4.401898] t5 : 4e6a1710 t6 : c20e3300
[ 4.401910] status: 00000120 badaddr: 00000000 cause: 00000003
[ 4.401948] [<c01ceaaa>] damon_test_nr_accesses_to_accesses_bp+0x6/0x8
[ 4.402017] [<c0395000>] kunit_try_run_case+0xb4/0x1b4
[ 4.402031] [<c039686a>] kunit_generic_run_threadfn_adapter+0x1a/0x32
[ 4.402048] [<c002fc9c>] kthread+0xd2/0xd4
[ 4.402062] [<c08b795a>] ret_from_fork+0xe/0x1c
[ 4.402164] Code: 4b0a 5bf6 5c66 5cd6 5d46 610d 8082 1141 c622 0800 (9002) 1141
[ 4.402354] ---[ end trace 0000000000000000 ]---
[ 4.402493] Kernel panic - not syncing: Fatal exception in interrupt
sheb:
ok 9 damon_test_set_regions
------------[ cut here ]------------
kernel BUG at kernel/exit.c:1912!
Kernel BUG: 003e [#1]
Modules linked in:
CPU: 0 UID: 0 PID: 97 Comm: kunit_try_catch Tainted: G N 6.11.0-rc6-00157-gfd747295fe53 #1
Tainted: [N]=TEST
PC is at abort+0x0/0x4
PR is at damon_test_nr_accesses_to_accesses_bp+0x8/0xc
PC : 8c021338 SP : 8cb41f20 SR : 40008000 TEA : 295f8f58
R0 : 00000000 R1 : 8c021338 R2 : 00000000 R3 : 00000001
R4 : 8c827d28 R5 : 00000000 R6 : 02fc63da R7 : 00000000
R8 : 8c827d28 R9 : 8c71112c R10 : 8caff4fc R11 : 8c064104
R12 : 8c827c2c R13 : 8c1ec660 R14 : 8c827d34
MACH: 0000ba01 MACL: b554de4a GBR : 00000000 PR : 8c0e9424
Call trace:
[<8c1eab14>] kunit_try_run_case+0x58/0x174
[<8c011704>] arch_local_irq_restore+0x0/0x24
[<8c5609ba>] schedule+0x1a/0xf0
[<8c1ec660>] kunit_generic_run_threadfn_adapter+0x0/0x24
[<8c1ec670>] kunit_generic_run_threadfn_adapter+0x10/0x24
[<8c1ec660>] kunit_generic_run_threadfn_adapter+0x0/0x24
[<8c011704>] arch_local_irq_restore+0x0/0x24
[<8c03a69e>] kthread+0x96/0xc4
[<8c0151c4>] ret_from_kernel_thread+0xc/0x14
[<8c011704>] arch_local_irq_restore+0x0/0x24
[<8c041960>] schedule_tail+0x0/0x58
[<8c03a608>] kthread+0x0/0xc4
Process: kunit_try_catch (pid: 97, stack limit = (ptrval))
Stack: (0x8cb41f20 to 0x8cb42000)
1f20: 8c1eab14 8c011704 8c71b51c 00000000 00000001 02fc63da 00000000 00000000
1f40: 00000000 00000000 bec1813c 8c5609ba 8cb41f6c 8c1ec660 8c827c2c bec1813c
1f60: 8c1ec670 8c1ec660 8c827c2c 8c011704 00000000 8cafccc0 8c827d34 8c03a69e
1f80: 8cb3aa80 8c0151c4 8c833f30 8c8285fc 00000000 8c8901e0 8c011704 8c71b51c
1fa0: 8c041960 00000000 00000000 00000000 00000000 8cb3aa80 8c03a608 00000000
1fc0: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
1fe0: 00000000 00000000 00000000 40008000 00000000 00000000 00000000 00000000
---[ end trace 0000000000000000 ]---
# damon_test_nr_accesses_to_accesses_bp: try faulted
# damon_test_nr_accesses_to_accesses_bp: internal error occurred preventing test case from running: -4
# damon_test_nr_accesses_to_accesses_bp: pass:0 fail:1 skip:0 total:1
not ok 10 damon_test_nr_accesses_to_accesses_bp
# damon_test_update_monitoring_result: pass:1 fail:0 skip:0 total:1
ok 11 damon_test_update_monitoring_result
# damon_test_set_attrs: pass:1 fail:0 skip:0 total:1
ok 12 damon_test_set_attrs
# damon_test_moving_sum: pass:1 fail:0 skip:0 total:1
ok 13 damon_test_moving_sum
# damos_test_new_filter: pass:1 fail:0 skip:0 total:1
ok 14 damos_test_new_filter
# damos_test_filter_out: pass:1 fail:0 skip:0 total:1
ok 15 damos_test_filter_out
# damon_test_feed_loop_next_input: EXPECTATION FAILED at mm/damon/core-test.h:485
Expected damon_feed_loop_next_input(last_input, 200) > damon_feed_loop_next_input(last_input, 2000), but
damon_feed_loop_next_input(last_input, 200) == 923006 (0xe157e)
damon_feed_loop_next_input(last_input, 2000) == 1190503 (0x122a67)
# damon_test_feed_loop_next_input: pass:0 fail:1 skip:0 total:1
not ok 16 damon_test_feed_loop_next_input
# damon: pass:14 fail:2 skip:0 total:16
# Totals: pass:14 fail:2 skip:0 total:16
not ok 4 damon
xtensa:
[ 7.469744] ok 9 damon_test_set_regions
[ 7.471548] Unhandled division by 0 in kernel: sig: 9 [#1] PREEMPT
[ 7.472769] CPU: 0 UID: 0 PID: 119 Comm: kunit_try_catch Tainted: G N 6.11.0-rc6-00157-gfd747295fe53 #1
[ 7.473399] Tainted: [N]=TEST
[ 7.473681] a00: 903d2bfd d2503e50 d1081c10 00000000 1a1fcdc0 00000000 00000007 00000007
[ 7.474153] a08: 0012c4b0 00000000 00000138 00000000 00000000 000005d0 00000000 d1ef6060
[ 7.475251] pc: d0102555, ps: 00060710, depc: 00000006, excvaddr: c0ac652c
[ 7.475578] lbeg: d06b3664, lend: d06b366f lcount: ffffffff, sar: 00000019
[ 7.475941] Stack:
[ 7.476404] > 903d4352 d2503e90 d117b8fc d1081c10
[ 7.476750] d03d2bf6 00000000 00000000 00000000
[ 7.477068] 00000000 00000001 ffffffff 00060300
[ 7.477336] 90024f32 d2503ef0 d1081c1c fffffffc
[ 7.477603] 00000000 00000001 d1ef65a4 00060300
[ 7.477864] 00000000 d1ef6560 00000001 00000000
[ 7.478126] 90024f2d d2503ef0 d1edb380 d2502000
[ 7.478386] 00000007 00000000 1a1fcdc0 d2502000
[ 7.478647] d0102544 d1ef6060 00000001 d2503ec0
[ 7.478907] 500054c8 d2503f10 d1ef6360 00000000
[ 7.479166] d0024b34 d0024ae6 00000040 d2503ef0
[ 7.479426] 00000000 d2503f30 d0024e64 d2508e20
[ 7.479686] ffffffdf d03d4344 d1081c1c d1081af0
[ 7.479945] 00000000 d2503f30 d0024e64 d2508e20
[ 7.480204] 00000000 00000000 00000000 00000000
[ 7.480462] 00000000 00000000 00000000 00000000
[ 7.480737] Call Trace:
[ 7.480967] Disabling lock debugging due to kernel taint
[ 7.483121] # damon_test_nr_accesses_to_accesses_bp: try faulted: last line seen /opt/buildbot/slave/qemu-xtensa/build/mm/damon/core-test.h:0
[ 7.484664] # damon_test_nr_accesses_to_accesses_bp: internal error occurred preventing test case from running: -4
[ 7.486237] # damon_test_nr_accesses_to_accesses_bp: pass:0 fail:1 skip:0 total:1
[ 7.486385] not ok 10 damon_test_nr_accesses_to_accesses_bp
[ 7.488894] # damon_test_update_monitoring_result: pass:1 fail:0 skip:0 total:1
[ 7.489238] ok 11 damon_test_update_monitoring_result
[ 7.491703] # damon_test_set_attrs: pass:1 fail:0 skip:0 total:1
[ 7.492033] ok 12 damon_test_set_attrs
[ 7.495038] # damon_test_moving_sum: pass:1 fail:0 skip:0 total:1
[ 7.495468] ok 13 damon_test_moving_sum
[ 7.499501] # damos_test_new_filter: pass:1 fail:0 skip:0 total:1
[ 7.499970] ok 14 damos_test_new_filter
[ 7.504989] # damos_test_filter_out: pass:1 fail:0 skip:0 total:1
[ 7.505292] ok 15 damos_test_filter_out
[ 7.507201] # damon_test_feed_loop_next_input: EXPECTATION FAILED at /opt/buildbot/slave/qemu-xtensa/build/mm/damon/core-test.h:487
[ 7.507201] Expected damon_feed_loop_next_input(last_input, 200) > damon_feed_loop_next_input(last_input, 2000), but
[ 7.507201] damon_feed_loop_next_input(last_input, 200) == 923006 (0xe157e)
[ 7.507201] damon_feed_loop_next_input(last_input, 2000) == 1190503 (0x122a67)
[ 7.509319] # damon_test_feed_loop_next_input: pass:0 fail:1 skip:0 total:1
[ 7.510666] not ok 16 damon_test_feed_loop_next_input
[ 7.511060] # damon: pass:14 fail:2 skip:0 total:16
[ 7.511378] # Totals: pass:14 fail:2 skip:0 total:16
[ 7.511694] not ok 4 damon
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Problems running DAMON kunit tests with spinlock debugging enabled
2024-09-04 2:24 ` Guenter Roeck
@ 2024-09-04 2:46 ` Guenter Roeck
2024-09-04 17:53 ` SeongJae Park
0 siblings, 1 reply; 12+ messages in thread
From: Guenter Roeck @ 2024-09-04 2:46 UTC (permalink / raw)
To: SeongJae Park; +Cc: damon
On 9/3/24 19:24, Guenter Roeck wrote:
> On 9/3/24 18:00, SeongJae Park wrote:
>> Hi Guenter,
>>
>> On Tue, 3 Sep 2024 12:23:50 -0700 Guenter Roeck <linux@roeck-us.net> wrote:
>>
>>> Hi,
>>>
>>> when trying to run DAMON kunit tests with spinlock debugging enabled,
>>> I get the following error messages.
>>>
>>> BUG: spinlock bad magic on CPU#0, kunit_try_catch/209
>>> lock: mm.13+0x40/0x840, .magic: 00000000, .owner: <none>/-1, .owner_cpu: 0
>>> CPU: 0 UID: 0 PID: 209 Comm: kunit_try_catch Tainted: G N 6.11.0-rc6-00148-g178297ec52d6 #1
>>> Tainted: [N]=TEST
>>> Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014
>>> Call Trace:
>>> <TASK>
>>> dump_stack_lvl+0x9e/0xe0
>>> do_raw_spin_lock+0x63/0xb0
>>> damon_test_three_regions_in_vmas+0x121/0x450S
>>> ...
>>>
>>> This happens because damon_test_three_regions_in_vmas() calls
>>> mt_init_flags() with MM_MT_FLAGS, which sets MT_FLAGS_LOCK_EXTERN.
>>> Because of this, the spinlock is not initialized, which then results
>>> in the error message when mas_lock() is called during the test.
>>>
>>> Unfortunately, replacing MM_MT_FLAGS with MT_FLAGS_ALLOC_RANGE |
>>> MT_FLAGS_USE_RCU or just with MT_FLAGS_ALLOC_RANGE doesn't help;
>>> it results in various "suspicious RCU usage" messages.
>>
>> Thank you very much for this report with the grateful detailed investigation!
>> I was able to make and send a fix[1], thanks to the investigation. Seems it
>> may need some more revisions[2], though. I'll continue discussion on the
>> thread.
>>
>> [1] https://lore.kernel.org/20240904004534.1189-1-sj@kernel.org
>> [2] https://lore.kernel.org/jy6263g6em4jsdhp6tknmh2cljpuvq652kvcet4ko3z2xt7pym@ltc5h5twsszu/
>>
>>
>
> Some more problems. I tried your patch on various architectures.
> Unfortunately, that didn't turn out well. See below for some of the failures.
>
> [ I am not sure I understand the code in damon_test_nr_accesses_to_accesses_bp(),
> especially
> .aggr_interval = ((unsigned long)UINT_MAX + 1) * 10
> What happens if sizeof(int) == sizeof(long) ?
> Unless I am really missing something, .aggr_interval will be
> 0 in that case.
> ]
>
The diff below fixes the problem in damon_test_nr_accesses_to_accesses_bp().
Obviously I don't know if this is the correct fix.
Guenter
---
diff --git a/mm/damon/core-test.h b/mm/damon/core-test.h
index 0cee634f3544..8072f83b2bba 100644
--- a/mm/damon/core-test.h
+++ b/mm/damon/core-test.h
@@ -309,6 +309,9 @@ static void damon_test_nr_accesses_to_accesses_bp(struct kunit *test)
.aggr_interval = ((unsigned long)UINT_MAX + 1) * 10
};
+ if (sizeof(int) == sizeof(long))
+ kunit_skip(test, "Test requires sizeof(int) != sizeof(long)");
+
KUNIT_EXPECT_EQ(test, damon_nr_accesses_to_accesses_bp(123, &attrs), 0);
}
^ permalink raw reply related [flat|nested] 12+ messages in thread
* Re: Problems running DAMON kunit tests with spinlock debugging enabled
2024-09-04 2:46 ` Guenter Roeck
@ 2024-09-04 17:53 ` SeongJae Park
2024-09-04 19:06 ` Guenter Roeck
` (2 more replies)
0 siblings, 3 replies; 12+ messages in thread
From: SeongJae Park @ 2024-09-04 17:53 UTC (permalink / raw)
To: Guenter Roeck; +Cc: SeongJae Park, damon
Hi Guenter,
On Tue, 3 Sep 2024 19:46:44 -0700 Guenter Roeck <linux@roeck-us.net> wrote:
> On 9/3/24 19:24, Guenter Roeck wrote:
> > On 9/3/24 18:00, SeongJae Park wrote:
> >> Hi Guenter,
> >>
> >> On Tue, 3 Sep 2024 12:23:50 -0700 Guenter Roeck <linux@roeck-us.net> wrote:
> >>
> >>> Hi,
> >>>
> >>> when trying to run DAMON kunit tests with spinlock debugging enabled,
> >>> I get the following error messages.
> >>>
> >>> BUG: spinlock bad magic on CPU#0, kunit_try_catch/209
> >>> lock: mm.13+0x40/0x840, .magic: 00000000, .owner: <none>/-1, .owner_cpu: 0
> >>> CPU: 0 UID: 0 PID: 209 Comm: kunit_try_catch Tainted: G N 6.11.0-rc6-00148-g178297ec52d6 #1
> >>> Tainted: [N]=TEST
> >>> Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014
> >>> Call Trace:
> >>> <TASK>
> >>> dump_stack_lvl+0x9e/0xe0
> >>> do_raw_spin_lock+0x63/0xb0
> >>> damon_test_three_regions_in_vmas+0x121/0x450S
> >>> ...
> >>>
> >>> This happens because damon_test_three_regions_in_vmas() calls
> >>> mt_init_flags() with MM_MT_FLAGS, which sets MT_FLAGS_LOCK_EXTERN.
> >>> Because of this, the spinlock is not initialized, which then results
> >>> in the error message when mas_lock() is called during the test.
> >>>
> >>> Unfortunately, replacing MM_MT_FLAGS with MT_FLAGS_ALLOC_RANGE |
> >>> MT_FLAGS_USE_RCU or just with MT_FLAGS_ALLOC_RANGE doesn't help;
> >>> it results in various "suspicious RCU usage" messages.
> >>
> >> Thank you very much for this report with the grateful detailed investigation!
> >> I was able to make and send a fix[1], thanks to the investigation. Seems it
> >> may need some more revisions[2], though. I'll continue discussion on the
> >> thread.
> >>
> >> [1] https://lore.kernel.org/20240904004534.1189-1-sj@kernel.org
> >> [2] https://lore.kernel.org/jy6263g6em4jsdhp6tknmh2cljpuvq652kvcet4ko3z2xt7pym@ltc5h5twsszu/
> >>
> >>
> >
> > Some more problems. I tried your patch on various architectures.
> > Unfortunately, that didn't turn out well. See below for some of the failures.
Nice finding, thank you for reporting this! Just to be clear, seems this is
independent from the originally reported issue or the fix for it. Please let
me know if not.
> >
> > [ I am not sure I understand the code in damon_test_nr_accesses_to_accesses_bp(),
> > especially
> > .aggr_interval = ((unsigned long)UINT_MAX + 1) * 10
> > What happens if sizeof(int) == sizeof(long) ?
> > Unless I am really missing something, .aggr_interval will be
> > 0 in that case.
I agree. aggr_interval would be zero in the case. And in this case,
damon_max_nr_accesses(attrs) in damon_nr_accesses_to_accesses_bp() returns
zero, so damon_nr_accesses_to_accesses_bp() will get a divide by zero problem.
That is, the problem happens when damon_nr_accesses_to_accesses_bp() is called
with zero aggr_interval, regardless of the architecture.
> > ]
> >
> The diff below fixes the problem in damon_test_nr_accesses_to_accesses_bp().
> Obviously I don't know if this is the correct fix.
>
> Guenter
>
> ---
> diff --git a/mm/damon/core-test.h b/mm/damon/core-test.h
> index 0cee634f3544..8072f83b2bba 100644
> --- a/mm/damon/core-test.h
> +++ b/mm/damon/core-test.h
> @@ -309,6 +309,9 @@ static void damon_test_nr_accesses_to_accesses_bp(struct kunit *test)
> .aggr_interval = ((unsigned long)UINT_MAX + 1) * 10
> };
>
> + if (sizeof(int) == sizeof(long))
> + kunit_skip(test, "Test requires sizeof(int) != sizeof(long)");
> +
> KUNIT_EXPECT_EQ(test, damon_nr_accesses_to_accesses_bp(123, &attrs), 0);
> }
So, above fix would work, but I think doing the skip if the resulting
aggr_interval is zero would make more sense. Having more comments would also
be nice.
Nonetheless, if the issue can be triggered outside of the test, we may need to
think a better fundamental fix. And seems it is the case.
Sharing current incomplete findings here first.
First of all, users can set the zero aggr_interval. And the problematic
function (damon_nr_accesses_to_accesses_bp()) is called when user updates the
intervals while DAMON is running, with the old intervals attributes. Hence,
users could trigger the issue by making DAMON to run with zero aggregation
interval, and then updating the intervals to any values.
Due to another complicatd logic behind the online intervals updates, trigerring
the issue seems not that simple. In short, seems it can be triggered only by
using DAMOS watermarks together. This is something bettr to revisit later.
So, I haven't confirmed if this is exploitable or not due to the lack of time
and other things to do, at the moment. I will take a look further and share
more findings soon, hopefully within a couple of days.
Thanks,
SJ
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Problems running DAMON kunit tests with spinlock debugging enabled
2024-09-04 17:53 ` SeongJae Park
@ 2024-09-04 19:06 ` Guenter Roeck
2024-09-05 16:30 ` SeongJae Park
2024-09-04 21:04 ` Guenter Roeck
2024-09-05 6:14 ` SeongJae Park
2 siblings, 1 reply; 12+ messages in thread
From: Guenter Roeck @ 2024-09-04 19:06 UTC (permalink / raw)
To: SeongJae Park; +Cc: damon
On 9/4/24 10:53, SeongJae Park wrote:
> Hi Guenter,
>
> On Tue, 3 Sep 2024 19:46:44 -0700 Guenter Roeck <linux@roeck-us.net> wrote:
>
>> On 9/3/24 19:24, Guenter Roeck wrote:
>>> On 9/3/24 18:00, SeongJae Park wrote:
>>>> Hi Guenter,
>>>>
>>>> On Tue, 3 Sep 2024 12:23:50 -0700 Guenter Roeck <linux@roeck-us.net> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> when trying to run DAMON kunit tests with spinlock debugging enabled,
>>>>> I get the following error messages.
>>>>>
>>>>> BUG: spinlock bad magic on CPU#0, kunit_try_catch/209
>>>>> lock: mm.13+0x40/0x840, .magic: 00000000, .owner: <none>/-1, .owner_cpu: 0
>>>>> CPU: 0 UID: 0 PID: 209 Comm: kunit_try_catch Tainted: G N 6.11.0-rc6-00148-g178297ec52d6 #1
>>>>> Tainted: [N]=TEST
>>>>> Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014
>>>>> Call Trace:
>>>>> <TASK>
>>>>> dump_stack_lvl+0x9e/0xe0
>>>>> do_raw_spin_lock+0x63/0xb0
>>>>> damon_test_three_regions_in_vmas+0x121/0x450S
>>>>> ...
>>>>>
>>>>> This happens because damon_test_three_regions_in_vmas() calls
>>>>> mt_init_flags() with MM_MT_FLAGS, which sets MT_FLAGS_LOCK_EXTERN.
>>>>> Because of this, the spinlock is not initialized, which then results
>>>>> in the error message when mas_lock() is called during the test.
>>>>>
>>>>> Unfortunately, replacing MM_MT_FLAGS with MT_FLAGS_ALLOC_RANGE |
>>>>> MT_FLAGS_USE_RCU or just with MT_FLAGS_ALLOC_RANGE doesn't help;
>>>>> it results in various "suspicious RCU usage" messages.
>>>>
>>>> Thank you very much for this report with the grateful detailed investigation!
>>>> I was able to make and send a fix[1], thanks to the investigation. Seems it
>>>> may need some more revisions[2], though. I'll continue discussion on the
>>>> thread.
>>>>
>>>> [1] https://lore.kernel.org/20240904004534.1189-1-sj@kernel.org
>>>> [2] https://lore.kernel.org/jy6263g6em4jsdhp6tknmh2cljpuvq652kvcet4ko3z2xt7pym@ltc5h5twsszu/
>>>>
>>>>
>>>
>>> Some more problems. I tried your patch on various architectures.
>>> Unfortunately, that didn't turn out well. See below for some of the failures.
>
> Nice finding, thank you for reporting this! Just to be clear, seems this is
> independent from the originally reported issue or the fix for it. Please let
> me know if not.
>
It is a different problem.
>>>
>>> [ I am not sure I understand the code in damon_test_nr_accesses_to_accesses_bp(),
>>> especially
>>> .aggr_interval = ((unsigned long)UINT_MAX + 1) * 10
>>> What happens if sizeof(int) == sizeof(long) ?
>>> Unless I am really missing something, .aggr_interval will be
>>> 0 in that case.
>
> I agree. aggr_interval would be zero in the case. And in this case,
> damon_max_nr_accesses(attrs) in damon_nr_accesses_to_accesses_bp() returns
> zero, so damon_nr_accesses_to_accesses_bp() will get a divide by zero problem.
> That is, the problem happens when damon_nr_accesses_to_accesses_bp() is called
> with zero aggr_interval, regardless of the architecture.
>
>>> ]
>>>
>> The diff below fixes the problem in damon_test_nr_accesses_to_accesses_bp().
>> Obviously I don't know if this is the correct fix.
>>
>> Guenter
>>
>> ---
>> diff --git a/mm/damon/core-test.h b/mm/damon/core-test.h
>> index 0cee634f3544..8072f83b2bba 100644
>> --- a/mm/damon/core-test.h
>> +++ b/mm/damon/core-test.h
>> @@ -309,6 +309,9 @@ static void damon_test_nr_accesses_to_accesses_bp(struct kunit *test)
>> .aggr_interval = ((unsigned long)UINT_MAX + 1) * 10
>> };
>>
>> + if (sizeof(int) == sizeof(long))
>> + kunit_skip(test, "Test requires sizeof(int) != sizeof(long)");
>> +
>> KUNIT_EXPECT_EQ(test, damon_nr_accesses_to_accesses_bp(123, &attrs), 0);
>> }
>
> So, above fix would work, but I think doing the skip if the resulting
> aggr_interval is zero would make more sense. Having more comments would also
> be nice.
>
Sure; the above was not intended to be a final (or even acceptable) fix,
just a workaround to let me run tests with damon kunit tests enabled
on 32-bit systems.
Thanks,
Guenter
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Problems running DAMON kunit tests with spinlock debugging enabled
2024-09-04 17:53 ` SeongJae Park
2024-09-04 19:06 ` Guenter Roeck
@ 2024-09-04 21:04 ` Guenter Roeck
2024-09-05 17:35 ` SeongJae Park
2024-09-05 6:14 ` SeongJae Park
2 siblings, 1 reply; 12+ messages in thread
From: Guenter Roeck @ 2024-09-04 21:04 UTC (permalink / raw)
To: SeongJae Park; +Cc: damon
On 9/4/24 10:53, SeongJae Park wrote:
[ ... ]
> Nonetheless, if the issue can be triggered outside of the test, we may need to
> think a better fundamental fix. And seems it is the case.
>
> Sharing current incomplete findings here first.
>
> First of all, users can set the zero aggr_interval. And the problematic
> function (damon_nr_accesses_to_accesses_bp()) is called when user updates the
> intervals while DAMON is running, with the old intervals attributes. Hence,
> users could trigger the issue by making DAMON to run with zero aggregation
> interval, and then updating the intervals to any values.
>
> Due to another complicatd logic behind the online intervals updates, trigerring
> the issue seems not that simple. In short, seems it can be triggered only by
> using DAMOS watermarks together. This is something bettr to revisit later.
>
> So, I haven't confirmed if this is exploitable or not due to the lack of time
> and other things to do, at the moment. I will take a look further and share
> more findings soon, hopefully within a couple of days.
>
Here is another problem. With all other problems fixed, when running on a 32-bit system:
# damon_test_feed_loop_next_input: EXPECTATION FAILED at mm/damon/core-test.h:488
Expected damon_feed_loop_next_input(last_input, 200) > damon_feed_loop_next_input(last_input, 2000), but
damon_feed_loop_next_input(last_input, 200) == 923006 (0xe157e)
damon_feed_loop_next_input(last_input, 2000) == 1190503 (0x122a67)
# damon_test_feed_loop_next_input: pass:0 fail:1 skip:0 total:1
not ok 16 damon_test_feed_loop_next_input
# damon: pass:14 fail:1 skip:1 total:16
Problem is that both damon_feed_loop_next_input(900000, 200) and
damon_feed_loop_next_input(900000, 2000) overflow on 32-bit systems.
Also,
static unsigned long damon_feed_loop_next_input(unsigned long last_input,
unsigned long score)
{
const unsigned long goal = 10000;
unsigned long score_goal_diff = max(goal, score) - min(goal, score);
unsigned long score_goal_diff_bp = score_goal_diff * 10000 / goal;
^^^^^^^^^^^^^^^
Since goal == 10000, the last statement has no effect.
The diff below should address the problem for the most part, though the calculations
may still overflow for large parameter values.
Guenter
---
diff --git a/mm/damon/core.c b/mm/damon/core.c
index 7a87628b76ab..ab9c625ee2e0 100644
--- a/mm/damon/core.c
+++ b/mm/damon/core.c
@@ -10,6 +10,7 @@
#include <linux/damon.h>
#include <linux/delay.h>
#include <linux/kthread.h>
+#include <linux/math64.h>
#include <linux/mm.h>
#include <linux/psi.h>
#include <linux/slab.h>
@@ -1451,8 +1452,7 @@ static unsigned long damon_feed_loop_next_input(unsigned long last_input,
{
const unsigned long goal = 10000;
unsigned long score_goal_diff = max(goal, score) - min(goal, score);
- unsigned long score_goal_diff_bp = score_goal_diff * 10000 / goal;
- unsigned long compensation = last_input * score_goal_diff_bp / 10000;
+ unsigned long compensation = div_u64((unsigned long long)last_input * score_goal_diff, 10000);
/* Set minimum input as 10000 to avoid compensation be zero */
const unsigned long min_input = 10000;
^ permalink raw reply related [flat|nested] 12+ messages in thread
* Re: Problems running DAMON kunit tests with spinlock debugging enabled
2024-09-04 17:53 ` SeongJae Park
2024-09-04 19:06 ` Guenter Roeck
2024-09-04 21:04 ` Guenter Roeck
@ 2024-09-05 6:14 ` SeongJae Park
2 siblings, 0 replies; 12+ messages in thread
From: SeongJae Park @ 2024-09-05 6:14 UTC (permalink / raw)
To: SeongJae Park; +Cc: Guenter Roeck, damon
On Wed, 4 Sep 2024 10:53:06 -0700 SeongJae Park <sj@kernel.org> wrote:
> Hi Guenter,
>
> On Tue, 3 Sep 2024 19:46:44 -0700 Guenter Roeck <linux@roeck-us.net> wrote:
>
> > On 9/3/24 19:24, Guenter Roeck wrote:
> > > On 9/3/24 18:00, SeongJae Park wrote:
> > >> Hi Guenter,
> > >>
> > >> On Tue, 3 Sep 2024 12:23:50 -0700 Guenter Roeck <linux@roeck-us.net> wrote:
> > >>
> > >>> Hi,
> > >>>
> > >>> when trying to run DAMON kunit tests with spinlock debugging enabled,
> > >>> I get the following error messages.
> > >>>
> > >>> BUG: spinlock bad magic on CPU#0, kunit_try_catch/209
> > >>> lock: mm.13+0x40/0x840, .magic: 00000000, .owner: <none>/-1, .owner_cpu: 0
> > >>> CPU: 0 UID: 0 PID: 209 Comm: kunit_try_catch Tainted: G N 6.11.0-rc6-00148-g178297ec52d6 #1
> > >>> Tainted: [N]=TEST
> > >>> Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014
> > >>> Call Trace:
> > >>> <TASK>
> > >>> dump_stack_lvl+0x9e/0xe0
> > >>> do_raw_spin_lock+0x63/0xb0
> > >>> damon_test_three_regions_in_vmas+0x121/0x450S
> > >>> ...
> > >>>
> > >>> This happens because damon_test_three_regions_in_vmas() calls
> > >>> mt_init_flags() with MM_MT_FLAGS, which sets MT_FLAGS_LOCK_EXTERN.
> > >>> Because of this, the spinlock is not initialized, which then results
> > >>> in the error message when mas_lock() is called during the test.
> > >>>
> > >>> Unfortunately, replacing MM_MT_FLAGS with MT_FLAGS_ALLOC_RANGE |
> > >>> MT_FLAGS_USE_RCU or just with MT_FLAGS_ALLOC_RANGE doesn't help;
> > >>> it results in various "suspicious RCU usage" messages.
> > >>
> > >> Thank you very much for this report with the grateful detailed investigation!
> > >> I was able to make and send a fix[1], thanks to the investigation. Seems it
> > >> may need some more revisions[2], though. I'll continue discussion on the
> > >> thread.
> > >>
> > >> [1] https://lore.kernel.org/20240904004534.1189-1-sj@kernel.org
> > >> [2] https://lore.kernel.org/jy6263g6em4jsdhp6tknmh2cljpuvq652kvcet4ko3z2xt7pym@ltc5h5twsszu/
> > >>
> > >>
> > >
> > > Some more problems. I tried your patch on various architectures.
> > > Unfortunately, that didn't turn out well. See below for some of the failures.
>
> Nice finding, thank you for reporting this! Just to be clear, seems this is
> independent from the originally reported issue or the fix for it. Please let
> me know if not.
>
> > >
> > > [ I am not sure I understand the code in damon_test_nr_accesses_to_accesses_bp(),
> > > especially
> > > .aggr_interval = ((unsigned long)UINT_MAX + 1) * 10
> > > What happens if sizeof(int) == sizeof(long) ?
> > > Unless I am really missing something, .aggr_interval will be
> > > 0 in that case.
>
> I agree. aggr_interval would be zero in the case. And in this case,
> damon_max_nr_accesses(attrs) in damon_nr_accesses_to_accesses_bp() returns
> zero, so damon_nr_accesses_to_accesses_bp() will get a divide by zero problem.
> That is, the problem happens when damon_nr_accesses_to_accesses_bp() is called
> with zero aggr_interval, regardless of the architecture.
>
> > > ]
> > >
> > The diff below fixes the problem in damon_test_nr_accesses_to_accesses_bp().
> > Obviously I don't know if this is the correct fix.
> >
> > Guenter
> >
> > ---
> > diff --git a/mm/damon/core-test.h b/mm/damon/core-test.h
> > index 0cee634f3544..8072f83b2bba 100644
> > --- a/mm/damon/core-test.h
> > +++ b/mm/damon/core-test.h
> > @@ -309,6 +309,9 @@ static void damon_test_nr_accesses_to_accesses_bp(struct kunit *test)
> > .aggr_interval = ((unsigned long)UINT_MAX + 1) * 10
> > };
> >
> > + if (sizeof(int) == sizeof(long))
> > + kunit_skip(test, "Test requires sizeof(int) != sizeof(long)");
> > +
> > KUNIT_EXPECT_EQ(test, damon_nr_accesses_to_accesses_bp(123, &attrs), 0);
> > }
>
> So, above fix would work, but I think doing the skip if the resulting
> aggr_interval is zero would make more sense. Having more comments would also
> be nice.
Now I think this is enough fix for this issue, since this test code is only
place that can trigger the issue. More detail below.
>
> Nonetheless, if the issue can be triggered outside of the test, we may need to
> think a better fundamental fix. And seems it is the case.
>
> Sharing current incomplete findings here first.
>
> First of all, users can set the zero aggr_interval. And the problematic
> function (damon_nr_accesses_to_accesses_bp()) is called when user updates the
> intervals while DAMON is running, with the old intervals attributes. Hence,
> users could trigger the issue by making DAMON to run with zero aggregation
> interval, and then updating the intervals to any values.
It turned out this is not the case. Only the test code can trigger this issue.
The only possible callstack for the problematic function is as below.
damon_nr_accesses_to_accesses_bp()
damon_nr_accesses_for_new_attrs()
damon_update_monitoring_result()
damon_update_monitoring_results()
damon_set_attrs()
And damon_update_monitoring_results() does not call
damon_update_monitoring_result() if the aggregation interval of the old context
is zero, as below.
static void damon_update_monitoring_results(struct damon_ctx *ctx,
struct damon_attrs *new_attrs)
{
struct damon_attrs *old_attrs = &ctx->attrs;
[...]
/* if any interval is zero, simply forgive conversion */
if (!old_attrs->sample_interval || !old_attrs->aggr_interval ||
!new_attrs->sample_interval ||
!new_attrs->aggr_interval)
return;
damon_for_each_target(t, ctx)
damon_for_each_region(r, t)
damon_update_monitoring_result(
r, old_attrs, new_attrs);
}
Hence, again, only the test code can invoke damon_nr_accesses_to_accesses_bp()
with zero aggregation interval. Hence fixing the test code and adding comments
would be sufficient for now.
Thanks,
SJ
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Problems running DAMON kunit tests with spinlock debugging enabled
2024-09-04 19:06 ` Guenter Roeck
@ 2024-09-05 16:30 ` SeongJae Park
0 siblings, 0 replies; 12+ messages in thread
From: SeongJae Park @ 2024-09-05 16:30 UTC (permalink / raw)
To: Guenter Roeck; +Cc: SeongJae Park, damon
On Wed, 4 Sep 2024 12:06:29 -0700 Guenter Roeck <linux@roeck-us.net> wrote:
> On 9/4/24 10:53, SeongJae Park wrote:
> > On Tue, 3 Sep 2024 19:46:44 -0700 Guenter Roeck <linux@roeck-us.net> wrote:
> >> On 9/3/24 19:24, Guenter Roeck wrote:
> >>> On 9/3/24 18:00, SeongJae Park wrote:
> >>>> Hi Guenter,
> >>>>
> >>>> On Tue, 3 Sep 2024 12:23:50 -0700 Guenter Roeck <linux@roeck-us.net> wrote:
[...]
> >>> Some more problems. I tried your patch on various architectures.
> >>> Unfortunately, that didn't turn out well. See below for some of the failures.
> >
> > Nice finding, thank you for reporting this! Just to be clear, seems this is
> > independent from the originally reported issue or the fix for it. Please let
> > me know if not.
> >
>
> It is a different problem.
Thank you for confirming :)
[...]
> > I think doing the skip if the resulting
> > aggr_interval is zero would make more sense. Having more comments would also
> > be nice.
> >
>
> Sure; the above was not intended to be a final (or even acceptable) fix,
> just a workaround to let me run tests with damon kunit tests enabled
> on 32-bit systems.
Thank you for kindly agreeing and clarifying. I just posted the fix:
https://lore.kernel.org/20240905162423.74053-1-sj@kernel.org
Hopefully the patch explains more details and my opinions.
Thanks,
SJ
>
> Thanks,
> Guenter
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Problems running DAMON kunit tests with spinlock debugging enabled
2024-09-04 21:04 ` Guenter Roeck
@ 2024-09-05 17:35 ` SeongJae Park
2024-09-05 18:04 ` Guenter Roeck
0 siblings, 1 reply; 12+ messages in thread
From: SeongJae Park @ 2024-09-05 17:35 UTC (permalink / raw)
To: Guenter Roeck; +Cc: SeongJae Park, damon
Hi Guenter,
On Wed, 4 Sep 2024 14:04:03 -0700 Guenter Roeck <linux@roeck-us.net> wrote:
[...]
> Here is another problem. With all other problems fixed, when running on a 32-bit system:
>
> # damon_test_feed_loop_next_input: EXPECTATION FAILED at mm/damon/core-test.h:488
> Expected damon_feed_loop_next_input(last_input, 200) > damon_feed_loop_next_input(last_input, 2000), but
> damon_feed_loop_next_input(last_input, 200) == 923006 (0xe157e)
> damon_feed_loop_next_input(last_input, 2000) == 1190503 (0x122a67)
> # damon_test_feed_loop_next_input: pass:0 fail:1 skip:0 total:1
> not ok 16 damon_test_feed_loop_next_input
> # damon: pass:14 fail:1 skip:1 total:16
>
> Problem is that both damon_feed_loop_next_input(900000, 200) and
> damon_feed_loop_next_input(900000, 2000) overflow on 32-bit systems.
Thank you for reporting this yet another important issue!
>
> Also,
>
> static unsigned long damon_feed_loop_next_input(unsigned long last_input,
> unsigned long score)
> {
> const unsigned long goal = 10000;
> unsigned long score_goal_diff = max(goal, score) - min(goal, score);
> unsigned long score_goal_diff_bp = score_goal_diff * 10000 / goal;
> ^^^^^^^^^^^^^^^
>
> Since goal == 10000, the last statement has no effect.
You're right, thank you.
>
> The diff below should address the problem for the most part, though the calculations
> may still overflow for large parameter values.
>
> Guenter
>
> ---
> diff --git a/mm/damon/core.c b/mm/damon/core.c
> index 7a87628b76ab..ab9c625ee2e0 100644
> --- a/mm/damon/core.c
> +++ b/mm/damon/core.c
> @@ -10,6 +10,7 @@
> #include <linux/damon.h>
> #include <linux/delay.h>
> #include <linux/kthread.h>
> +#include <linux/math64.h>
> #include <linux/mm.h>
> #include <linux/psi.h>
> #include <linux/slab.h>
> @@ -1451,8 +1452,7 @@ static unsigned long damon_feed_loop_next_input(unsigned long last_input,
> {
> const unsigned long goal = 10000;
> unsigned long score_goal_diff = max(goal, score) - min(goal, score);
> - unsigned long score_goal_diff_bp = score_goal_diff * 10000 / goal;
> - unsigned long compensation = last_input * score_goal_diff_bp / 10000;
> + unsigned long compensation = div_u64((unsigned long long)last_input * score_goal_diff, 10000);
> /* Set minimum input as 10000 to avoid compensation be zero */
> const unsigned long min_input = 10000;
Thank you for kindly sharing this, too. Your detailed reports of symptoms,
root cause, and possible fix help a lot.
As you also mentioned, this function may better to be significantly re-written
with overflows in mind. However, I have no enough time and setup to
sufficiently test the fix for now, due to some personal reasons. Also, the
issue is important, but not time-critical for now, in my humble opinion.
Please let me know if someone has different opinions.
So, I made one possible fix off the top of my head, and posted as an RFC patch:
https://lore.kernel.org/20240905172405.46995-1-sj@kernel.org
I only tested that against the kunit test on my 64bit machine. After I get
some time and setup (maybe 1-2 weeks later), I will take more time to review
the code, do more tests, and post RFC-dropped version.
Thanks,
SJ
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Problems running DAMON kunit tests with spinlock debugging enabled
2024-09-05 17:35 ` SeongJae Park
@ 2024-09-05 18:04 ` Guenter Roeck
2024-09-05 18:08 ` SeongJae Park
0 siblings, 1 reply; 12+ messages in thread
From: Guenter Roeck @ 2024-09-05 18:04 UTC (permalink / raw)
To: SeongJae Park; +Cc: damon
On 9/5/24 10:35, SeongJae Park wrote:
> Hi Guenter,
>
> On Wed, 4 Sep 2024 14:04:03 -0700 Guenter Roeck <linux@roeck-us.net> wrote:
>
> [...]
>> Here is another problem. With all other problems fixed, when running on a 32-bit system:
>>
>> # damon_test_feed_loop_next_input: EXPECTATION FAILED at mm/damon/core-test.h:488
>> Expected damon_feed_loop_next_input(last_input, 200) > damon_feed_loop_next_input(last_input, 2000), but
>> damon_feed_loop_next_input(last_input, 200) == 923006 (0xe157e)
>> damon_feed_loop_next_input(last_input, 2000) == 1190503 (0x122a67)
>> # damon_test_feed_loop_next_input: pass:0 fail:1 skip:0 total:1
>> not ok 16 damon_test_feed_loop_next_input
>> # damon: pass:14 fail:1 skip:1 total:16
>>
>> Problem is that both damon_feed_loop_next_input(900000, 200) and
>> damon_feed_loop_next_input(900000, 2000) overflow on 32-bit systems.
>
> Thank you for reporting this yet another important issue!
>
>>
>> Also,
>>
>> static unsigned long damon_feed_loop_next_input(unsigned long last_input,
>> unsigned long score)
>> {
>> const unsigned long goal = 10000;
>> unsigned long score_goal_diff = max(goal, score) - min(goal, score);
>> unsigned long score_goal_diff_bp = score_goal_diff * 10000 / goal;
>> ^^^^^^^^^^^^^^^
>>
>> Since goal == 10000, the last statement has no effect.
>
> You're right, thank you.
>
>>
>> The diff below should address the problem for the most part, though the calculations
>> may still overflow for large parameter values.
>>
>> Guenter
>>
>> ---
>> diff --git a/mm/damon/core.c b/mm/damon/core.c
>> index 7a87628b76ab..ab9c625ee2e0 100644
>> --- a/mm/damon/core.c
>> +++ b/mm/damon/core.c
>> @@ -10,6 +10,7 @@
>> #include <linux/damon.h>
>> #include <linux/delay.h>
>> #include <linux/kthread.h>
>> +#include <linux/math64.h>
>> #include <linux/mm.h>
>> #include <linux/psi.h>
>> #include <linux/slab.h>
>> @@ -1451,8 +1452,7 @@ static unsigned long damon_feed_loop_next_input(unsigned long last_input,
>> {
>> const unsigned long goal = 10000;
>> unsigned long score_goal_diff = max(goal, score) - min(goal, score);
>> - unsigned long score_goal_diff_bp = score_goal_diff * 10000 / goal;
>> - unsigned long compensation = last_input * score_goal_diff_bp / 10000;
>> + unsigned long compensation = div_u64((unsigned long long)last_input * score_goal_diff, 10000);
>> /* Set minimum input as 10000 to avoid compensation be zero */
>> const unsigned long min_input = 10000;
>
> Thank you for kindly sharing this, too. Your detailed reports of symptoms,
> root cause, and possible fix help a lot.
>
> As you also mentioned, this function may better to be significantly re-written
> with overflows in mind. However, I have no enough time and setup to
> sufficiently test the fix for now, due to some personal reasons. Also, the
> issue is important, but not time-critical for now, in my humble opinion.
> Please let me know if someone has different opinions.
>
> So, I made one possible fix off the top of my head, and posted as an RFC patch:
> https://lore.kernel.org/20240905172405.46995-1-sj@kernel.org
>
> I only tested that against the kunit test on my 64bit machine. After I get
> some time and setup (maybe 1-2 weeks later), I will take more time to review
> the code, do more tests, and post RFC-dropped version.
>
No worries. I queued the RFC patch up in my testbed, so we should get results
for all targets later today (or maybe tomorrow, the testbed is busy with stable
release tests).
Guenter
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Problems running DAMON kunit tests with spinlock debugging enabled
2024-09-05 18:04 ` Guenter Roeck
@ 2024-09-05 18:08 ` SeongJae Park
0 siblings, 0 replies; 12+ messages in thread
From: SeongJae Park @ 2024-09-05 18:08 UTC (permalink / raw)
To: Guenter Roeck; +Cc: SeongJae Park, damon
On Thu, 5 Sep 2024 11:04:51 -0700 Guenter Roeck <linux@roeck-us.net> wrote:
> On 9/5/24 10:35, SeongJae Park wrote:
> > Hi Guenter,
> >
> > On Wed, 4 Sep 2024 14:04:03 -0700 Guenter Roeck <linux@roeck-us.net> wrote:
> >
> > [...]
> >> Here is another problem. With all other problems fixed, when running on a 32-bit system:
[...]
> > As you also mentioned, this function may better to be significantly re-written
> > with overflows in mind. However, I have no enough time and setup to
> > sufficiently test the fix for now, due to some personal reasons. Also, the
> > issue is important, but not time-critical for now, in my humble opinion.
> > Please let me know if someone has different opinions.
> >
> > So, I made one possible fix off the top of my head, and posted as an RFC patch:
> > https://lore.kernel.org/20240905172405.46995-1-sj@kernel.org
> >
> > I only tested that against the kunit test on my 64bit machine. After I get
> > some time and setup (maybe 1-2 weeks later), I will take more time to review
> > the code, do more tests, and post RFC-dropped version.
> >
>
> No worries. I queued the RFC patch up in my testbed, so we should get results
> for all targets later today (or maybe tomorrow, the testbed is busy with stable
> release tests).
No problem, please take your time :)
Thanks,
SJ
>
> Guenter
^ permalink raw reply [flat|nested] 12+ messages in thread
end of thread, other threads:[~2024-09-05 18:08 UTC | newest]
Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-09-03 19:23 Problems running DAMON kunit tests with spinlock debugging enabled Guenter Roeck
2024-09-04 1:00 ` SeongJae Park
2024-09-04 2:24 ` Guenter Roeck
2024-09-04 2:46 ` Guenter Roeck
2024-09-04 17:53 ` SeongJae Park
2024-09-04 19:06 ` Guenter Roeck
2024-09-05 16:30 ` SeongJae Park
2024-09-04 21:04 ` Guenter Roeck
2024-09-05 17:35 ` SeongJae Park
2024-09-05 18:04 ` Guenter Roeck
2024-09-05 18:08 ` SeongJae Park
2024-09-05 6:14 ` SeongJae Park
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.