* [BUG] -next lockdep invalid wait context
@ 2024-10-30 21:05 Paul E. McKenney
2024-10-30 21:48 ` Vlastimil Babka
0 siblings, 1 reply; 38+ messages in thread
From: Paul E. McKenney @ 2024-10-30 21:05 UTC (permalink / raw)
To: linux-next, linux-kernel, kasan-dev, linux-mm
Cc: sfr, bigeasy, longman, boqun.feng, elver, cl, penberg, rientjes,
iamjoonsoo.kim, akpm, vbabka
Hello!
The next-20241030 release gets the splat shown below when running
scftorture in a preemptible kernel. This bisects to this commit:
560af5dc839e ("lockdep: Enable PROVE_RAW_LOCK_NESTING with PROVE_LOCKING")
Except that all this is doing is enabling lockdep to find the problem.
The obvious way to fix this is to make the kmem_cache structure's
cpu_slab field's ->lock be a raw spinlock, but this might not be what
we want for real-time response.
This can be reproduced deterministically as follows:
tools/testing/selftests/rcutorture/bin/kvm.sh --torture scf --allcpus --duration 2 --configs PREEMPT --kconfig CONFIG_NR_CPUS=64 --memory 7G --trust-make --kasan --bootargs "scftorture.nthreads=64 torture.disable_onoff_at_boot csdlock_debug=1"
I doubt that the number of CPUs or amount of memory makes any difference,
but that is what I used.
Thoughts?
Thanx, Paul
------------------------------------------------------------------------
[ 35.659746] =============================
[ 35.659746] [ BUG: Invalid wait context ]
[ 35.659746] 6.12.0-rc5-next-20241029 #57233 Not tainted
[ 35.659746] -----------------------------
[ 35.659746] swapper/37/0 is trying to lock:
[ 35.659746] ffff8881ff4bf2f0 (&c->lock){....}-{3:3}, at: put_cpu_partial+0x49/0x1b0
[ 35.659746] other info that might help us debug this:
[ 35.659746] context-{2:2}
[ 35.659746] no locks held by swapper/37/0.
[ 35.659746] stack backtrace:
[ 35.659746] CPU: 37 UID: 0 PID: 0 Comm: swapper/37 Not tainted 6.12.0-rc5-next-20241029 #57233
[ 35.659746] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
[ 35.659746] Call Trace:
[ 35.659746] <IRQ>
[ 35.659746] dump_stack_lvl+0x68/0xa0
[ 35.659746] __lock_acquire+0x8fd/0x3b90
[ 35.659746] ? start_secondary+0x113/0x210
[ 35.659746] ? __pfx___lock_acquire+0x10/0x10
[ 35.659746] ? __pfx___lock_acquire+0x10/0x10
[ 35.659746] ? __pfx___lock_acquire+0x10/0x10
[ 35.659746] ? __pfx___lock_acquire+0x10/0x10
[ 35.659746] lock_acquire+0x19b/0x520
[ 35.659746] ? put_cpu_partial+0x49/0x1b0
[ 35.659746] ? __pfx_lock_acquire+0x10/0x10
[ 35.659746] ? __pfx_lock_release+0x10/0x10
[ 35.659746] ? lock_release+0x20f/0x6f0
[ 35.659746] ? __pfx_lock_release+0x10/0x10
[ 35.659746] ? lock_release+0x20f/0x6f0
[ 35.659746] ? kasan_save_track+0x14/0x30
[ 35.659746] put_cpu_partial+0x52/0x1b0
[ 35.659746] ? put_cpu_partial+0x49/0x1b0
[ 35.659746] ? __pfx_scf_handler_1+0x10/0x10
[ 35.659746] __flush_smp_call_function_queue+0x2d2/0x600
[ 35.659746] __sysvec_call_function_single+0x50/0x280
[ 35.659746] sysvec_call_function_single+0x6b/0x80
[ 35.659746] </IRQ>
[ 35.659746] <TASK>
[ 35.659746] asm_sysvec_call_function_single+0x1a/0x20
[ 35.659746] RIP: 0010:default_idle+0xf/0x20
[ 35.659746] Code: 4c 01 c7 4c 29 c2 e9 72 ff ff ff 90 90 90 90 90 90 90 90 90
90 90 90 90 90 90 90 f3 0f 1e fa eb 07 0f 00 2d 33 80 3e 00 fb f4 <fa> c3 cc cc cc cc 66 66 2e 0f 1f 84 00 00 00 00 00 90 90 90 90 90
[ 35.659746] RSP: 0018:ffff888100a9fe68 EFLAGS: 00000202
[ 35.659746] RAX: 0000000000040d75 RBX: 0000000000000025 RCX: ffffffffab83df45
[ 35.659746] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffffffa8a5f7ba
[ 35.659746] RBP: dffffc0000000000 R08: 0000000000000001 R09: ffffed103fe96c3c
[ 35.659746] R10: ffff8881ff4b61e3 R11: 0000000000000000 R12: ffffffffad13f1d0
[ 35.659746] R13: 1ffff11020153fd2 R14: 0000000000000000 R15: 0000000000000000
[ 35.659746] ? ct_kernel_exit.constprop.0+0xc5/0xf0
[ 35.659746] ? do_idle+0x2fa/0x3b0
[ 35.659746] default_idle_call+0x6d/0xb0
[ 35.659746] do_idle+0x2fa/0x3b0
[ 35.659746] ? __pfx_do_idle+0x10/0x10
[ 35.659746] cpu_startup_entry+0x4f/0x60
[ 35.659746] start_secondary+0x1bc/0x210
[ 35.659746] common_startup_64+0x12c/0x138
[ 35.659746] </TASK>
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [BUG] -next lockdep invalid wait context
2024-10-30 21:05 [BUG] -next lockdep invalid wait context Paul E. McKenney
@ 2024-10-30 21:48 ` Vlastimil Babka
2024-10-30 22:34 ` Marco Elver
0 siblings, 1 reply; 38+ messages in thread
From: Vlastimil Babka @ 2024-10-30 21:48 UTC (permalink / raw)
To: paulmck, linux-next, linux-kernel, kasan-dev, linux-mm
Cc: sfr, bigeasy, longman, boqun.feng, elver, cl, penberg, rientjes,
iamjoonsoo.kim, akpm
On 10/30/24 22:05, Paul E. McKenney wrote:
> Hello!
Hi!
> The next-20241030 release gets the splat shown below when running
> scftorture in a preemptible kernel. This bisects to this commit:
>
> 560af5dc839e ("lockdep: Enable PROVE_RAW_LOCK_NESTING with PROVE_LOCKING")
>
> Except that all this is doing is enabling lockdep to find the problem.
>
> The obvious way to fix this is to make the kmem_cache structure's
> cpu_slab field's ->lock be a raw spinlock, but this might not be what
> we want for real-time response.
But it's a local_lock, not spinlock and it's doing local_lock_irqsave(). I'm
confused what's happening here, the code has been like this for years now.
> This can be reproduced deterministically as follows:
>
> tools/testing/selftests/rcutorture/bin/kvm.sh --torture scf --allcpus --duration 2 --configs PREEMPT --kconfig CONFIG_NR_CPUS=64 --memory 7G --trust-make --kasan --bootargs "scftorture.nthreads=64 torture.disable_onoff_at_boot csdlock_debug=1"
>
> I doubt that the number of CPUs or amount of memory makes any difference,
> but that is what I used.
>
> Thoughts?
>
> Thanx, Paul
>
> ------------------------------------------------------------------------
>
> [ 35.659746] =============================
> [ 35.659746] [ BUG: Invalid wait context ]
> [ 35.659746] 6.12.0-rc5-next-20241029 #57233 Not tainted
> [ 35.659746] -----------------------------
> [ 35.659746] swapper/37/0 is trying to lock:
> [ 35.659746] ffff8881ff4bf2f0 (&c->lock){....}-{3:3}, at: put_cpu_partial+0x49/0x1b0
> [ 35.659746] other info that might help us debug this:
> [ 35.659746] context-{2:2}
> [ 35.659746] no locks held by swapper/37/0.
> [ 35.659746] stack backtrace:
> [ 35.659746] CPU: 37 UID: 0 PID: 0 Comm: swapper/37 Not tainted 6.12.0-rc5-next-20241029 #57233
> [ 35.659746] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
> [ 35.659746] Call Trace:
> [ 35.659746] <IRQ>
> [ 35.659746] dump_stack_lvl+0x68/0xa0
> [ 35.659746] __lock_acquire+0x8fd/0x3b90
> [ 35.659746] ? start_secondary+0x113/0x210
> [ 35.659746] ? __pfx___lock_acquire+0x10/0x10
> [ 35.659746] ? __pfx___lock_acquire+0x10/0x10
> [ 35.659746] ? __pfx___lock_acquire+0x10/0x10
> [ 35.659746] ? __pfx___lock_acquire+0x10/0x10
> [ 35.659746] lock_acquire+0x19b/0x520
> [ 35.659746] ? put_cpu_partial+0x49/0x1b0
> [ 35.659746] ? __pfx_lock_acquire+0x10/0x10
> [ 35.659746] ? __pfx_lock_release+0x10/0x10
> [ 35.659746] ? lock_release+0x20f/0x6f0
> [ 35.659746] ? __pfx_lock_release+0x10/0x10
> [ 35.659746] ? lock_release+0x20f/0x6f0
> [ 35.659746] ? kasan_save_track+0x14/0x30
> [ 35.659746] put_cpu_partial+0x52/0x1b0
> [ 35.659746] ? put_cpu_partial+0x49/0x1b0
> [ 35.659746] ? __pfx_scf_handler_1+0x10/0x10
> [ 35.659746] __flush_smp_call_function_queue+0x2d2/0x600
How did we even get to put_cpu_partial directly from flushing smp calls?
SLUB doesn't use them, it uses queue_work_on)_ for flushing and that
flushing doesn't involve put_cpu_partial() AFAIK.
I think only slab allocation or free can lead to put_cpu_partial() that
would mean the backtrace is missing something. And that somebody does a slab
alloc/free from a smp callback, which I'd then assume isn't allowed?
> [ 35.659746] __sysvec_call_function_single+0x50/0x280
> [ 35.659746] sysvec_call_function_single+0x6b/0x80
> [ 35.659746] </IRQ>
> [ 35.659746] <TASK>
> [ 35.659746] asm_sysvec_call_function_single+0x1a/0x20
> [ 35.659746] RIP: 0010:default_idle+0xf/0x20
> [ 35.659746] Code: 4c 01 c7 4c 29 c2 e9 72 ff ff ff 90 90 90 90 90 90 90 90 90
> 90 90 90 90 90 90 90 f3 0f 1e fa eb 07 0f 00 2d 33 80 3e 00 fb f4 <fa> c3 cc cc cc cc 66 66 2e 0f 1f 84 00 00 00 00 00 90 90 90 90 90
> [ 35.659746] RSP: 0018:ffff888100a9fe68 EFLAGS: 00000202
> [ 35.659746] RAX: 0000000000040d75 RBX: 0000000000000025 RCX: ffffffffab83df45
> [ 35.659746] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffffffa8a5f7ba
> [ 35.659746] RBP: dffffc0000000000 R08: 0000000000000001 R09: ffffed103fe96c3c
> [ 35.659746] R10: ffff8881ff4b61e3 R11: 0000000000000000 R12: ffffffffad13f1d0
> [ 35.659746] R13: 1ffff11020153fd2 R14: 0000000000000000 R15: 0000000000000000
> [ 35.659746] ? ct_kernel_exit.constprop.0+0xc5/0xf0
> [ 35.659746] ? do_idle+0x2fa/0x3b0
> [ 35.659746] default_idle_call+0x6d/0xb0
> [ 35.659746] do_idle+0x2fa/0x3b0
> [ 35.659746] ? __pfx_do_idle+0x10/0x10
> [ 35.659746] cpu_startup_entry+0x4f/0x60
> [ 35.659746] start_secondary+0x1bc/0x210
> [ 35.659746] common_startup_64+0x12c/0x138
> [ 35.659746] </TASK>
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [BUG] -next lockdep invalid wait context
2024-10-30 21:48 ` Vlastimil Babka
@ 2024-10-30 22:34 ` Marco Elver
2024-10-30 23:04 ` Boqun Feng
2024-10-30 23:10 ` Paul E. McKenney
0 siblings, 2 replies; 38+ messages in thread
From: Marco Elver @ 2024-10-30 22:34 UTC (permalink / raw)
To: Vlastimil Babka
Cc: paulmck, linux-next, linux-kernel, kasan-dev, linux-mm, sfr,
bigeasy, longman, boqun.feng, cl, penberg, rientjes,
iamjoonsoo.kim, akpm
On Wed, Oct 30, 2024 at 10:48PM +0100, Vlastimil Babka wrote:
> On 10/30/24 22:05, Paul E. McKenney wrote:
> > Hello!
>
> Hi!
>
> > The next-20241030 release gets the splat shown below when running
> > scftorture in a preemptible kernel. This bisects to this commit:
> >
> > 560af5dc839e ("lockdep: Enable PROVE_RAW_LOCK_NESTING with PROVE_LOCKING")
> >
> > Except that all this is doing is enabling lockdep to find the problem.
> >
> > The obvious way to fix this is to make the kmem_cache structure's
> > cpu_slab field's ->lock be a raw spinlock, but this might not be what
> > we want for real-time response.
>
> But it's a local_lock, not spinlock and it's doing local_lock_irqsave(). I'm
> confused what's happening here, the code has been like this for years now.
>
> > This can be reproduced deterministically as follows:
> >
> > tools/testing/selftests/rcutorture/bin/kvm.sh --torture scf --allcpus --duration 2 --configs PREEMPT --kconfig CONFIG_NR_CPUS=64 --memory 7G --trust-make --kasan --bootargs "scftorture.nthreads=64 torture.disable_onoff_at_boot csdlock_debug=1"
> >
> > I doubt that the number of CPUs or amount of memory makes any difference,
> > but that is what I used.
> >
> > Thoughts?
> >
> > Thanx, Paul
> >
> > ------------------------------------------------------------------------
> >
> > [ 35.659746] =============================
> > [ 35.659746] [ BUG: Invalid wait context ]
> > [ 35.659746] 6.12.0-rc5-next-20241029 #57233 Not tainted
> > [ 35.659746] -----------------------------
> > [ 35.659746] swapper/37/0 is trying to lock:
> > [ 35.659746] ffff8881ff4bf2f0 (&c->lock){....}-{3:3}, at: put_cpu_partial+0x49/0x1b0
> > [ 35.659746] other info that might help us debug this:
> > [ 35.659746] context-{2:2}
> > [ 35.659746] no locks held by swapper/37/0.
> > [ 35.659746] stack backtrace:
> > [ 35.659746] CPU: 37 UID: 0 PID: 0 Comm: swapper/37 Not tainted 6.12.0-rc5-next-20241029 #57233
> > [ 35.659746] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
> > [ 35.659746] Call Trace:
> > [ 35.659746] <IRQ>
> > [ 35.659746] dump_stack_lvl+0x68/0xa0
> > [ 35.659746] __lock_acquire+0x8fd/0x3b90
> > [ 35.659746] ? start_secondary+0x113/0x210
> > [ 35.659746] ? __pfx___lock_acquire+0x10/0x10
> > [ 35.659746] ? __pfx___lock_acquire+0x10/0x10
> > [ 35.659746] ? __pfx___lock_acquire+0x10/0x10
> > [ 35.659746] ? __pfx___lock_acquire+0x10/0x10
> > [ 35.659746] lock_acquire+0x19b/0x520
> > [ 35.659746] ? put_cpu_partial+0x49/0x1b0
> > [ 35.659746] ? __pfx_lock_acquire+0x10/0x10
> > [ 35.659746] ? __pfx_lock_release+0x10/0x10
> > [ 35.659746] ? lock_release+0x20f/0x6f0
> > [ 35.659746] ? __pfx_lock_release+0x10/0x10
> > [ 35.659746] ? lock_release+0x20f/0x6f0
> > [ 35.659746] ? kasan_save_track+0x14/0x30
> > [ 35.659746] put_cpu_partial+0x52/0x1b0
> > [ 35.659746] ? put_cpu_partial+0x49/0x1b0
> > [ 35.659746] ? __pfx_scf_handler_1+0x10/0x10
> > [ 35.659746] __flush_smp_call_function_queue+0x2d2/0x600
>
> How did we even get to put_cpu_partial directly from flushing smp calls?
> SLUB doesn't use them, it uses queue_work_on)_ for flushing and that
> flushing doesn't involve put_cpu_partial() AFAIK.
>
> I think only slab allocation or free can lead to put_cpu_partial() that
> would mean the backtrace is missing something. And that somebody does a slab
> alloc/free from a smp callback, which I'd then assume isn't allowed?
Tail-call optimization is hiding the caller. Compiling with
-fno-optimize-sibling-calls exposes the caller. This gives the full
picture:
[ 40.321505] =============================
[ 40.322711] [ BUG: Invalid wait context ]
[ 40.323927] 6.12.0-rc5-next-20241030-dirty #4 Not tainted
[ 40.325502] -----------------------------
[ 40.326653] cpuhp/47/253 is trying to lock:
[ 40.327869] ffff8881ff9bf2f0 (&c->lock){....}-{3:3}, at: put_cpu_partial+0x48/0x1a0
[ 40.330081] other info that might help us debug this:
[ 40.331540] context-{2:2}
[ 40.332305] 3 locks held by cpuhp/47/253:
[ 40.333468] #0: ffffffffae6e6910 (cpu_hotplug_lock){++++}-{0:0}, at: cpuhp_thread_fun+0xe0/0x590
[ 40.336048] #1: ffffffffae6e9060 (cpuhp_state-down){+.+.}-{0:0}, at: cpuhp_thread_fun+0xe0/0x590
[ 40.338607] #2: ffff8881002a6948 (&root->kernfs_rwsem){++++}-{4:4}, at: kernfs_remove_by_name_ns+0x78/0x100
[ 40.341454] stack backtrace:
[ 40.342291] CPU: 47 UID: 0 PID: 253 Comm: cpuhp/47 Not tainted 6.12.0-rc5-next-20241030-dirty #4
[ 40.344807] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
[ 40.347482] Call Trace:
[ 40.348199] <IRQ>
[ 40.348827] dump_stack_lvl+0x6b/0xa0
[ 40.349899] dump_stack+0x10/0x20
[ 40.350850] __lock_acquire+0x900/0x4010
[ 40.360290] lock_acquire+0x191/0x4f0
[ 40.364850] put_cpu_partial+0x51/0x1a0
[ 40.368341] scf_handler+0x1bd/0x290
[ 40.370590] scf_handler_1+0x4e/0xb0
[ 40.371630] __flush_smp_call_function_queue+0x2dd/0x600
[ 40.373142] generic_smp_call_function_single_interrupt+0xe/0x20
[ 40.374801] __sysvec_call_function_single+0x50/0x280
[ 40.376214] sysvec_call_function_single+0x6c/0x80
[ 40.377543] </IRQ>
[ 40.378142] <TASK>
And scf_handler does indeed tail-call kfree:
static void scf_handler(void *scfc_in)
{
[...]
} else {
kfree(scfcp);
}
}
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [BUG] -next lockdep invalid wait context
2024-10-30 22:34 ` Marco Elver
@ 2024-10-30 23:04 ` Boqun Feng
2024-10-30 23:10 ` Paul E. McKenney
1 sibling, 0 replies; 38+ messages in thread
From: Boqun Feng @ 2024-10-30 23:04 UTC (permalink / raw)
To: Marco Elver
Cc: Vlastimil Babka, paulmck, linux-next, linux-kernel, kasan-dev,
linux-mm, sfr, bigeasy, longman, cl, penberg, rientjes,
iamjoonsoo.kim, akpm
On Wed, Oct 30, 2024 at 11:34:08PM +0100, Marco Elver wrote:
> On Wed, Oct 30, 2024 at 10:48PM +0100, Vlastimil Babka wrote:
> > On 10/30/24 22:05, Paul E. McKenney wrote:
> > > Hello!
> >
> > Hi!
> >
> > > The next-20241030 release gets the splat shown below when running
> > > scftorture in a preemptible kernel. This bisects to this commit:
> > >
> > > 560af5dc839e ("lockdep: Enable PROVE_RAW_LOCK_NESTING with PROVE_LOCKING")
> > >
> > > Except that all this is doing is enabling lockdep to find the problem.
> > >
> > > The obvious way to fix this is to make the kmem_cache structure's
> > > cpu_slab field's ->lock be a raw spinlock, but this might not be what
> > > we want for real-time response.
> >
> > But it's a local_lock, not spinlock and it's doing local_lock_irqsave(). I'm
> > confused what's happening here, the code has been like this for years now.
> >
> > > This can be reproduced deterministically as follows:
> > >
> > > tools/testing/selftests/rcutorture/bin/kvm.sh --torture scf --allcpus --duration 2 --configs PREEMPT --kconfig CONFIG_NR_CPUS=64 --memory 7G --trust-make --kasan --bootargs "scftorture.nthreads=64 torture.disable_onoff_at_boot csdlock_debug=1"
> > >
> > > I doubt that the number of CPUs or amount of memory makes any difference,
> > > but that is what I used.
> > >
> > > Thoughts?
> > >
> > > Thanx, Paul
> > >
> > > ------------------------------------------------------------------------
> > >
> > > [ 35.659746] =============================
> > > [ 35.659746] [ BUG: Invalid wait context ]
> > > [ 35.659746] 6.12.0-rc5-next-20241029 #57233 Not tainted
> > > [ 35.659746] -----------------------------
> > > [ 35.659746] swapper/37/0 is trying to lock:
> > > [ 35.659746] ffff8881ff4bf2f0 (&c->lock){....}-{3:3}, at: put_cpu_partial+0x49/0x1b0
> > > [ 35.659746] other info that might help us debug this:
> > > [ 35.659746] context-{2:2}
> > > [ 35.659746] no locks held by swapper/37/0.
> > > [ 35.659746] stack backtrace:
> > > [ 35.659746] CPU: 37 UID: 0 PID: 0 Comm: swapper/37 Not tainted 6.12.0-rc5-next-20241029 #57233
> > > [ 35.659746] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
> > > [ 35.659746] Call Trace:
> > > [ 35.659746] <IRQ>
> > > [ 35.659746] dump_stack_lvl+0x68/0xa0
> > > [ 35.659746] __lock_acquire+0x8fd/0x3b90
> > > [ 35.659746] ? start_secondary+0x113/0x210
> > > [ 35.659746] ? __pfx___lock_acquire+0x10/0x10
> > > [ 35.659746] ? __pfx___lock_acquire+0x10/0x10
> > > [ 35.659746] ? __pfx___lock_acquire+0x10/0x10
> > > [ 35.659746] ? __pfx___lock_acquire+0x10/0x10
> > > [ 35.659746] lock_acquire+0x19b/0x520
> > > [ 35.659746] ? put_cpu_partial+0x49/0x1b0
> > > [ 35.659746] ? __pfx_lock_acquire+0x10/0x10
> > > [ 35.659746] ? __pfx_lock_release+0x10/0x10
> > > [ 35.659746] ? lock_release+0x20f/0x6f0
> > > [ 35.659746] ? __pfx_lock_release+0x10/0x10
> > > [ 35.659746] ? lock_release+0x20f/0x6f0
> > > [ 35.659746] ? kasan_save_track+0x14/0x30
> > > [ 35.659746] put_cpu_partial+0x52/0x1b0
> > > [ 35.659746] ? put_cpu_partial+0x49/0x1b0
> > > [ 35.659746] ? __pfx_scf_handler_1+0x10/0x10
> > > [ 35.659746] __flush_smp_call_function_queue+0x2d2/0x600
> >
> > How did we even get to put_cpu_partial directly from flushing smp calls?
> > SLUB doesn't use them, it uses queue_work_on)_ for flushing and that
> > flushing doesn't involve put_cpu_partial() AFAIK.
> >
> > I think only slab allocation or free can lead to put_cpu_partial() that
> > would mean the backtrace is missing something. And that somebody does a slab
> > alloc/free from a smp callback, which I'd then assume isn't allowed?
>
I think in this particular case, it is queuing a callback for
smp_call_function_single() which doesn't have an interrupt handle
thread AKAICT, that means the callback will be executed in non-threaded
hardirq context, and that makes locks must be taken with real interrupt
disabled.
Using irq_work might be fine, because it has a handler thread (but the
torture is for s(mp) c(all) f(unction), so replacing with irq_work is
not really fixing it ;-)).
Regards,
Boqun
> Tail-call optimization is hiding the caller. Compiling with
> -fno-optimize-sibling-calls exposes the caller. This gives the full
> picture:
>
> [ 40.321505] =============================
> [ 40.322711] [ BUG: Invalid wait context ]
> [ 40.323927] 6.12.0-rc5-next-20241030-dirty #4 Not tainted
> [ 40.325502] -----------------------------
> [ 40.326653] cpuhp/47/253 is trying to lock:
> [ 40.327869] ffff8881ff9bf2f0 (&c->lock){....}-{3:3}, at: put_cpu_partial+0x48/0x1a0
> [ 40.330081] other info that might help us debug this:
> [ 40.331540] context-{2:2}
> [ 40.332305] 3 locks held by cpuhp/47/253:
> [ 40.333468] #0: ffffffffae6e6910 (cpu_hotplug_lock){++++}-{0:0}, at: cpuhp_thread_fun+0xe0/0x590
> [ 40.336048] #1: ffffffffae6e9060 (cpuhp_state-down){+.+.}-{0:0}, at: cpuhp_thread_fun+0xe0/0x590
> [ 40.338607] #2: ffff8881002a6948 (&root->kernfs_rwsem){++++}-{4:4}, at: kernfs_remove_by_name_ns+0x78/0x100
> [ 40.341454] stack backtrace:
> [ 40.342291] CPU: 47 UID: 0 PID: 253 Comm: cpuhp/47 Not tainted 6.12.0-rc5-next-20241030-dirty #4
> [ 40.344807] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
> [ 40.347482] Call Trace:
> [ 40.348199] <IRQ>
> [ 40.348827] dump_stack_lvl+0x6b/0xa0
> [ 40.349899] dump_stack+0x10/0x20
> [ 40.350850] __lock_acquire+0x900/0x4010
> [ 40.360290] lock_acquire+0x191/0x4f0
> [ 40.364850] put_cpu_partial+0x51/0x1a0
> [ 40.368341] scf_handler+0x1bd/0x290
> [ 40.370590] scf_handler_1+0x4e/0xb0
> [ 40.371630] __flush_smp_call_function_queue+0x2dd/0x600
> [ 40.373142] generic_smp_call_function_single_interrupt+0xe/0x20
> [ 40.374801] __sysvec_call_function_single+0x50/0x280
> [ 40.376214] sysvec_call_function_single+0x6c/0x80
> [ 40.377543] </IRQ>
> [ 40.378142] <TASK>
>
> And scf_handler does indeed tail-call kfree:
>
> static void scf_handler(void *scfc_in)
> {
> [...]
> } else {
> kfree(scfcp);
> }
> }
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [BUG] -next lockdep invalid wait context
2024-10-30 22:34 ` Marco Elver
2024-10-30 23:04 ` Boqun Feng
@ 2024-10-30 23:10 ` Paul E. McKenney
2024-10-31 7:21 ` Sebastian Andrzej Siewior
1 sibling, 1 reply; 38+ messages in thread
From: Paul E. McKenney @ 2024-10-30 23:10 UTC (permalink / raw)
To: Marco Elver
Cc: Vlastimil Babka, linux-next, linux-kernel, kasan-dev, linux-mm,
sfr, bigeasy, longman, boqun.feng, cl, penberg, rientjes,
iamjoonsoo.kim, akpm
On Wed, Oct 30, 2024 at 11:34:08PM +0100, Marco Elver wrote:
> On Wed, Oct 30, 2024 at 10:48PM +0100, Vlastimil Babka wrote:
> > On 10/30/24 22:05, Paul E. McKenney wrote:
> > > Hello!
> >
> > Hi!
> >
> > > The next-20241030 release gets the splat shown below when running
> > > scftorture in a preemptible kernel. This bisects to this commit:
> > >
> > > 560af5dc839e ("lockdep: Enable PROVE_RAW_LOCK_NESTING with PROVE_LOCKING")
> > >
> > > Except that all this is doing is enabling lockdep to find the problem.
> > >
> > > The obvious way to fix this is to make the kmem_cache structure's
> > > cpu_slab field's ->lock be a raw spinlock, but this might not be what
> > > we want for real-time response.
> >
> > But it's a local_lock, not spinlock and it's doing local_lock_irqsave(). I'm
> > confused what's happening here, the code has been like this for years now.
> >
> > > This can be reproduced deterministically as follows:
> > >
> > > tools/testing/selftests/rcutorture/bin/kvm.sh --torture scf --allcpus --duration 2 --configs PREEMPT --kconfig CONFIG_NR_CPUS=64 --memory 7G --trust-make --kasan --bootargs "scftorture.nthreads=64 torture.disable_onoff_at_boot csdlock_debug=1"
> > >
> > > I doubt that the number of CPUs or amount of memory makes any difference,
> > > but that is what I used.
> > >
> > > Thoughts?
> > >
> > > Thanx, Paul
> > >
> > > ------------------------------------------------------------------------
> > >
> > > [ 35.659746] =============================
> > > [ 35.659746] [ BUG: Invalid wait context ]
> > > [ 35.659746] 6.12.0-rc5-next-20241029 #57233 Not tainted
> > > [ 35.659746] -----------------------------
> > > [ 35.659746] swapper/37/0 is trying to lock:
> > > [ 35.659746] ffff8881ff4bf2f0 (&c->lock){....}-{3:3}, at: put_cpu_partial+0x49/0x1b0
> > > [ 35.659746] other info that might help us debug this:
> > > [ 35.659746] context-{2:2}
> > > [ 35.659746] no locks held by swapper/37/0.
> > > [ 35.659746] stack backtrace:
> > > [ 35.659746] CPU: 37 UID: 0 PID: 0 Comm: swapper/37 Not tainted 6.12.0-rc5-next-20241029 #57233
> > > [ 35.659746] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
> > > [ 35.659746] Call Trace:
> > > [ 35.659746] <IRQ>
> > > [ 35.659746] dump_stack_lvl+0x68/0xa0
> > > [ 35.659746] __lock_acquire+0x8fd/0x3b90
> > > [ 35.659746] ? start_secondary+0x113/0x210
> > > [ 35.659746] ? __pfx___lock_acquire+0x10/0x10
> > > [ 35.659746] ? __pfx___lock_acquire+0x10/0x10
> > > [ 35.659746] ? __pfx___lock_acquire+0x10/0x10
> > > [ 35.659746] ? __pfx___lock_acquire+0x10/0x10
> > > [ 35.659746] lock_acquire+0x19b/0x520
> > > [ 35.659746] ? put_cpu_partial+0x49/0x1b0
> > > [ 35.659746] ? __pfx_lock_acquire+0x10/0x10
> > > [ 35.659746] ? __pfx_lock_release+0x10/0x10
> > > [ 35.659746] ? lock_release+0x20f/0x6f0
> > > [ 35.659746] ? __pfx_lock_release+0x10/0x10
> > > [ 35.659746] ? lock_release+0x20f/0x6f0
> > > [ 35.659746] ? kasan_save_track+0x14/0x30
> > > [ 35.659746] put_cpu_partial+0x52/0x1b0
> > > [ 35.659746] ? put_cpu_partial+0x49/0x1b0
> > > [ 35.659746] ? __pfx_scf_handler_1+0x10/0x10
> > > [ 35.659746] __flush_smp_call_function_queue+0x2d2/0x600
> >
> > How did we even get to put_cpu_partial directly from flushing smp calls?
> > SLUB doesn't use them, it uses queue_work_on)_ for flushing and that
> > flushing doesn't involve put_cpu_partial() AFAIK.
> >
> > I think only slab allocation or free can lead to put_cpu_partial() that
> > would mean the backtrace is missing something. And that somebody does a slab
> > alloc/free from a smp callback, which I'd then assume isn't allowed?
>
> Tail-call optimization is hiding the caller. Compiling with
> -fno-optimize-sibling-calls exposes the caller. This gives the full
> picture:
>
> [ 40.321505] =============================
> [ 40.322711] [ BUG: Invalid wait context ]
> [ 40.323927] 6.12.0-rc5-next-20241030-dirty #4 Not tainted
> [ 40.325502] -----------------------------
> [ 40.326653] cpuhp/47/253 is trying to lock:
> [ 40.327869] ffff8881ff9bf2f0 (&c->lock){....}-{3:3}, at: put_cpu_partial+0x48/0x1a0
> [ 40.330081] other info that might help us debug this:
> [ 40.331540] context-{2:2}
> [ 40.332305] 3 locks held by cpuhp/47/253:
> [ 40.333468] #0: ffffffffae6e6910 (cpu_hotplug_lock){++++}-{0:0}, at: cpuhp_thread_fun+0xe0/0x590
> [ 40.336048] #1: ffffffffae6e9060 (cpuhp_state-down){+.+.}-{0:0}, at: cpuhp_thread_fun+0xe0/0x590
> [ 40.338607] #2: ffff8881002a6948 (&root->kernfs_rwsem){++++}-{4:4}, at: kernfs_remove_by_name_ns+0x78/0x100
> [ 40.341454] stack backtrace:
> [ 40.342291] CPU: 47 UID: 0 PID: 253 Comm: cpuhp/47 Not tainted 6.12.0-rc5-next-20241030-dirty #4
> [ 40.344807] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
> [ 40.347482] Call Trace:
> [ 40.348199] <IRQ>
> [ 40.348827] dump_stack_lvl+0x6b/0xa0
> [ 40.349899] dump_stack+0x10/0x20
> [ 40.350850] __lock_acquire+0x900/0x4010
> [ 40.360290] lock_acquire+0x191/0x4f0
> [ 40.364850] put_cpu_partial+0x51/0x1a0
> [ 40.368341] scf_handler+0x1bd/0x290
> [ 40.370590] scf_handler_1+0x4e/0xb0
> [ 40.371630] __flush_smp_call_function_queue+0x2dd/0x600
> [ 40.373142] generic_smp_call_function_single_interrupt+0xe/0x20
> [ 40.374801] __sysvec_call_function_single+0x50/0x280
> [ 40.376214] sysvec_call_function_single+0x6c/0x80
> [ 40.377543] </IRQ>
> [ 40.378142] <TASK>
>
> And scf_handler does indeed tail-call kfree:
>
> static void scf_handler(void *scfc_in)
> {
> [...]
> } else {
> kfree(scfcp);
> }
> }
So I need to avoid calling kfree() within an smp_call_function() handler?
Thanx, Paul
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [BUG] -next lockdep invalid wait context
2024-10-30 23:10 ` Paul E. McKenney
@ 2024-10-31 7:21 ` Sebastian Andrzej Siewior
2024-10-31 7:35 ` Vlastimil Babka
2024-11-02 0:12 ` [BUG] -next lockdep invalid wait context Hillf Danton
0 siblings, 2 replies; 38+ messages in thread
From: Sebastian Andrzej Siewior @ 2024-10-31 7:21 UTC (permalink / raw)
To: Paul E. McKenney
Cc: Marco Elver, Vlastimil Babka, linux-next, linux-kernel, kasan-dev,
linux-mm, sfr, longman, boqun.feng, cl, penberg, rientjes,
iamjoonsoo.kim, akpm
On 2024-10-30 16:10:58 [-0700], Paul E. McKenney wrote:
>
> So I need to avoid calling kfree() within an smp_call_function() handler?
Yes. No kmalloc()/ kfree() in IRQ context.
> Thanx, Paul
Sebastian
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [BUG] -next lockdep invalid wait context
2024-10-31 7:21 ` Sebastian Andrzej Siewior
@ 2024-10-31 7:35 ` Vlastimil Babka
2024-10-31 7:55 ` Sebastian Andrzej Siewior
2024-11-02 0:12 ` [BUG] -next lockdep invalid wait context Hillf Danton
1 sibling, 1 reply; 38+ messages in thread
From: Vlastimil Babka @ 2024-10-31 7:35 UTC (permalink / raw)
To: Sebastian Andrzej Siewior, Paul E. McKenney
Cc: Marco Elver, linux-next, linux-kernel, kasan-dev, linux-mm, sfr,
longman, boqun.feng, cl, penberg, rientjes, iamjoonsoo.kim, akpm
On 10/31/24 08:21, Sebastian Andrzej Siewior wrote:
> On 2024-10-30 16:10:58 [-0700], Paul E. McKenney wrote:
>>
>> So I need to avoid calling kfree() within an smp_call_function() handler?
>
> Yes. No kmalloc()/ kfree() in IRQ context.
However, isn't this the case that the rule is actually about hardirq context
on RT, and most of these operations that are in IRQ context on !RT become
the threaded interrupt context on RT, so they are actually fine? Or is smp
call callback a hardirq context on RT and thus it really can't do those
operations?
Vlastimil
>> Thanx, Paul
>
> Sebastian
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [BUG] -next lockdep invalid wait context
2024-10-31 7:35 ` Vlastimil Babka
@ 2024-10-31 7:55 ` Sebastian Andrzej Siewior
2024-10-31 8:18 ` Vlastimil Babka
2024-10-31 17:50 ` Paul E. McKenney
0 siblings, 2 replies; 38+ messages in thread
From: Sebastian Andrzej Siewior @ 2024-10-31 7:55 UTC (permalink / raw)
To: Vlastimil Babka
Cc: Paul E. McKenney, Marco Elver, linux-next, linux-kernel,
kasan-dev, linux-mm, sfr, longman, boqun.feng, cl, penberg,
rientjes, iamjoonsoo.kim, akpm
On 2024-10-31 08:35:45 [+0100], Vlastimil Babka wrote:
> On 10/31/24 08:21, Sebastian Andrzej Siewior wrote:
> > On 2024-10-30 16:10:58 [-0700], Paul E. McKenney wrote:
> >>
> >> So I need to avoid calling kfree() within an smp_call_function() handler?
> >
> > Yes. No kmalloc()/ kfree() in IRQ context.
>
> However, isn't this the case that the rule is actually about hardirq context
> on RT, and most of these operations that are in IRQ context on !RT become
> the threaded interrupt context on RT, so they are actually fine? Or is smp
> call callback a hardirq context on RT and thus it really can't do those
> operations?
interrupt handlers as of request_irq() are forced-threaded on RT so you
can do kmalloc()/ kfree() there. smp_call_function.*() on the other hand
are not threaded and invoked directly within the IRQ context.
> Vlastimil
>
Sebastian
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [BUG] -next lockdep invalid wait context
2024-10-31 7:55 ` Sebastian Andrzej Siewior
@ 2024-10-31 8:18 ` Vlastimil Babka
2024-11-01 17:14 ` Paul E. McKenney
2024-10-31 17:50 ` Paul E. McKenney
1 sibling, 1 reply; 38+ messages in thread
From: Vlastimil Babka @ 2024-10-31 8:18 UTC (permalink / raw)
To: Sebastian Andrzej Siewior, Paul E. McKenney
Cc: Marco Elver, linux-next, linux-kernel, kasan-dev, linux-mm, sfr,
longman, boqun.feng, cl, penberg, rientjes, iamjoonsoo.kim, akpm
On 10/31/24 08:55, Sebastian Andrzej Siewior wrote:
> On 2024-10-31 08:35:45 [+0100], Vlastimil Babka wrote:
>> On 10/31/24 08:21, Sebastian Andrzej Siewior wrote:
>> > On 2024-10-30 16:10:58 [-0700], Paul E. McKenney wrote:
>> >>
>> >> So I need to avoid calling kfree() within an smp_call_function() handler?
>> >
>> > Yes. No kmalloc()/ kfree() in IRQ context.
>>
>> However, isn't this the case that the rule is actually about hardirq context
>> on RT, and most of these operations that are in IRQ context on !RT become
>> the threaded interrupt context on RT, so they are actually fine? Or is smp
>> call callback a hardirq context on RT and thus it really can't do those
>> operations?
>
> interrupt handlers as of request_irq() are forced-threaded on RT so you
> can do kmalloc()/ kfree() there. smp_call_function.*() on the other hand
> are not threaded and invoked directly within the IRQ context.
Makes sense, thanks.
So how comes rcutorture wasn't deadlocking on RT already, is it (or RCU
itself) doing anything differently there that avoids the kfree() from
smp_call_function() handler?
>> Vlastimil
>>
> Sebastian
>
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [BUG] -next lockdep invalid wait context
2024-10-31 7:55 ` Sebastian Andrzej Siewior
2024-10-31 8:18 ` Vlastimil Babka
@ 2024-10-31 17:50 ` Paul E. McKenney
2024-11-01 19:50 ` Boqun Feng
1 sibling, 1 reply; 38+ messages in thread
From: Paul E. McKenney @ 2024-10-31 17:50 UTC (permalink / raw)
To: Sebastian Andrzej Siewior
Cc: Vlastimil Babka, Marco Elver, linux-next, linux-kernel, kasan-dev,
linux-mm, sfr, longman, boqun.feng, cl, penberg, rientjes,
iamjoonsoo.kim, akpm
On Thu, Oct 31, 2024 at 08:55:09AM +0100, Sebastian Andrzej Siewior wrote:
> On 2024-10-31 08:35:45 [+0100], Vlastimil Babka wrote:
> > On 10/31/24 08:21, Sebastian Andrzej Siewior wrote:
> > > On 2024-10-30 16:10:58 [-0700], Paul E. McKenney wrote:
> > >>
> > >> So I need to avoid calling kfree() within an smp_call_function() handler?
> > >
> > > Yes. No kmalloc()/ kfree() in IRQ context.
> >
> > However, isn't this the case that the rule is actually about hardirq context
> > on RT, and most of these operations that are in IRQ context on !RT become
> > the threaded interrupt context on RT, so they are actually fine? Or is smp
> > call callback a hardirq context on RT and thus it really can't do those
> > operations?
>
> interrupt handlers as of request_irq() are forced-threaded on RT so you
> can do kmalloc()/ kfree() there. smp_call_function.*() on the other hand
> are not threaded and invoked directly within the IRQ context.
OK, thank you all for the explanation! I will fix using Boqun's
suggestion of irq work, but avoiding the issue Boqun raises by invoking
the irq-work handler from the smp_call_function() handler.
It will be a few days before I get to this, so if there is a better way,
please do not keep it a secret!
Thanx, Paul
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [BUG] -next lockdep invalid wait context
2024-10-31 8:18 ` Vlastimil Babka
@ 2024-11-01 17:14 ` Paul E. McKenney
0 siblings, 0 replies; 38+ messages in thread
From: Paul E. McKenney @ 2024-11-01 17:14 UTC (permalink / raw)
To: Vlastimil Babka
Cc: Sebastian Andrzej Siewior, Marco Elver, linux-next, linux-kernel,
kasan-dev, linux-mm, sfr, longman, boqun.feng, cl, penberg,
rientjes, iamjoonsoo.kim, akpm
On Thu, Oct 31, 2024 at 09:18:52AM +0100, Vlastimil Babka wrote:
> On 10/31/24 08:55, Sebastian Andrzej Siewior wrote:
> > On 2024-10-31 08:35:45 [+0100], Vlastimil Babka wrote:
> >> On 10/31/24 08:21, Sebastian Andrzej Siewior wrote:
> >> > On 2024-10-30 16:10:58 [-0700], Paul E. McKenney wrote:
> >> >>
> >> >> So I need to avoid calling kfree() within an smp_call_function() handler?
> >> >
> >> > Yes. No kmalloc()/ kfree() in IRQ context.
> >>
> >> However, isn't this the case that the rule is actually about hardirq context
> >> on RT, and most of these operations that are in IRQ context on !RT become
> >> the threaded interrupt context on RT, so they are actually fine? Or is smp
> >> call callback a hardirq context on RT and thus it really can't do those
> >> operations?
> >
> > interrupt handlers as of request_irq() are forced-threaded on RT so you
> > can do kmalloc()/ kfree() there. smp_call_function.*() on the other hand
> > are not threaded and invoked directly within the IRQ context.
>
> Makes sense, thanks.
>
> So how comes rcutorture wasn't deadlocking on RT already, is it (or RCU
> itself) doing anything differently there that avoids the kfree() from
> smp_call_function() handler?
This was scftorture rather than rcutorture. While I know of others who
regularly run rcutorture, to the best of my knowledge I am the only one
who ever runs scftorture, which tests smp_call_function(), its friends,
and its diagnostics.
Thanx, Paul
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [BUG] -next lockdep invalid wait context
2024-10-31 17:50 ` Paul E. McKenney
@ 2024-11-01 19:50 ` Boqun Feng
2024-11-01 19:54 ` [PATCH] scftorture: Use workqueue to free scf_check Boqun Feng
0 siblings, 1 reply; 38+ messages in thread
From: Boqun Feng @ 2024-11-01 19:50 UTC (permalink / raw)
To: Paul E. McKenney
Cc: Sebastian Andrzej Siewior, Vlastimil Babka, Marco Elver,
linux-next, linux-kernel, kasan-dev, linux-mm, sfr, longman, cl,
penberg, rientjes, iamjoonsoo.kim, akpm
On Thu, Oct 31, 2024 at 10:50:29AM -0700, Paul E. McKenney wrote:
> On Thu, Oct 31, 2024 at 08:55:09AM +0100, Sebastian Andrzej Siewior wrote:
> > On 2024-10-31 08:35:45 [+0100], Vlastimil Babka wrote:
> > > On 10/31/24 08:21, Sebastian Andrzej Siewior wrote:
> > > > On 2024-10-30 16:10:58 [-0700], Paul E. McKenney wrote:
> > > >>
> > > >> So I need to avoid calling kfree() within an smp_call_function() handler?
> > > >
> > > > Yes. No kmalloc()/ kfree() in IRQ context.
> > >
> > > However, isn't this the case that the rule is actually about hardirq context
> > > on RT, and most of these operations that are in IRQ context on !RT become
> > > the threaded interrupt context on RT, so they are actually fine? Or is smp
> > > call callback a hardirq context on RT and thus it really can't do those
> > > operations?
> >
> > interrupt handlers as of request_irq() are forced-threaded on RT so you
> > can do kmalloc()/ kfree() there. smp_call_function.*() on the other hand
> > are not threaded and invoked directly within the IRQ context.
>
> OK, thank you all for the explanation! I will fix using Boqun's
> suggestion of irq work, but avoiding the issue Boqun raises by invoking
I've tried fixing this with irq work, however, unlike normal
work_struct, irq_work will still touch the work item header after the
work function is executed (see irq_work_single()). So it needs more work
to build an "one-off free" functionality on it.
I think we can just use normal workqueue, because queue_work() uses
local_irq_save() + raw_spin_lock(), so it's irq-safe even for
non-threaded interrupts.
Sending a patch soon.
Regards,
Boqun
> the irq-work handler from the smp_call_function() handler.
>
> It will be a few days before I get to this, so if there is a better way,
> please do not keep it a secret!
>
> Thanx, Paul
^ permalink raw reply [flat|nested] 38+ messages in thread
* [PATCH] scftorture: Use workqueue to free scf_check
2024-11-01 19:50 ` Boqun Feng
@ 2024-11-01 19:54 ` Boqun Feng
2024-11-01 23:35 ` Paul E. McKenney
0 siblings, 1 reply; 38+ messages in thread
From: Boqun Feng @ 2024-11-01 19:54 UTC (permalink / raw)
To: paulmck
Cc: Sebastian Andrzej Siewior, Vlastimil Babka, Marco Elver,
linux-next, linux-kernel, kasan-dev, linux-mm, sfr, longman, cl,
penberg, rientjes, iamjoonsoo.kim, akpm, Thomas Gleixner,
Peter Zijlstra, Boqun Feng
Paul reported an invalid wait context issue in scftorture catched by
lockdep, and the cause of the issue is because scf_handler() may call
kfree() to free the struct scf_check:
static void scf_handler(void *scfc_in)
{
[...]
} else {
kfree(scfcp);
}
}
(call chain anlysis from Marco Elver)
This is problematic because smp_call_function() uses non-threaded
interrupt and kfree() may acquire a local_lock which is a sleepable lock
on RT.
The general rule is: do not alloc or free memory in non-threaded
interrupt conntexts.
A quick fix is to use workqueue to defer the kfree(). However, this is
OK only because scftorture is test code. In general the users of
interrupts should avoid giving interrupt handlers the ownership of
objects, that is, users should handle the lifetime of objects outside
and interrupt handlers should only hold references to objects.
Reported-by: "Paul E. McKenney" <paulmck@kernel.org>
Link: https://lore.kernel.org/lkml/41619255-cdc2-4573-a360-7794fc3614f7@paulmck-laptop/
Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
---
kernel/scftorture.c | 14 +++++++++++++-
1 file changed, 13 insertions(+), 1 deletion(-)
diff --git a/kernel/scftorture.c b/kernel/scftorture.c
index 44e83a646264..ab6dcc7c0116 100644
--- a/kernel/scftorture.c
+++ b/kernel/scftorture.c
@@ -127,6 +127,7 @@ static unsigned long scf_sel_totweight;
// Communicate between caller and handler.
struct scf_check {
+ struct work_struct work;
bool scfc_in;
bool scfc_out;
int scfc_cpu; // -1 for not _single().
@@ -252,6 +253,13 @@ static struct scf_selector *scf_sel_rand(struct torture_random_state *trsp)
return &scf_sel_array[0];
}
+static void kfree_scf_check_work(struct work_struct *w)
+{
+ struct scf_check *scfcp = container_of(w, struct scf_check, work);
+
+ kfree(scfcp);
+}
+
// Update statistics and occasionally burn up mass quantities of CPU time,
// if told to do so via scftorture.longwait. Otherwise, occasionally burn
// a little bit.
@@ -296,7 +304,10 @@ static void scf_handler(void *scfc_in)
if (scfcp->scfc_rpc)
complete(&scfcp->scfc_completion);
} else {
- kfree(scfcp);
+ // Cannot call kfree() directly, pass it to workqueue. It's OK
+ // only because this is test code, avoid this in real world
+ // usage.
+ queue_work(system_wq, &scfcp->work);
}
}
@@ -335,6 +346,7 @@ static void scftorture_invoke_one(struct scf_statistics *scfp, struct torture_ra
scfcp->scfc_wait = scfsp->scfs_wait;
scfcp->scfc_out = false;
scfcp->scfc_rpc = false;
+ INIT_WORK(&scfcp->work, kfree_scf_check_work);
}
}
switch (scfsp->scfs_prim) {
--
2.45.2
^ permalink raw reply related [flat|nested] 38+ messages in thread
* Re: [PATCH] scftorture: Use workqueue to free scf_check
2024-11-01 19:54 ` [PATCH] scftorture: Use workqueue to free scf_check Boqun Feng
@ 2024-11-01 23:35 ` Paul E. McKenney
2024-11-03 3:35 ` Boqun Feng
0 siblings, 1 reply; 38+ messages in thread
From: Paul E. McKenney @ 2024-11-01 23:35 UTC (permalink / raw)
To: Boqun Feng
Cc: Sebastian Andrzej Siewior, Vlastimil Babka, Marco Elver,
linux-next, linux-kernel, kasan-dev, linux-mm, sfr, longman, cl,
penberg, rientjes, iamjoonsoo.kim, akpm, Thomas Gleixner,
Peter Zijlstra
On Fri, Nov 01, 2024 at 12:54:38PM -0700, Boqun Feng wrote:
> Paul reported an invalid wait context issue in scftorture catched by
> lockdep, and the cause of the issue is because scf_handler() may call
> kfree() to free the struct scf_check:
>
> static void scf_handler(void *scfc_in)
> {
> [...]
> } else {
> kfree(scfcp);
> }
> }
>
> (call chain anlysis from Marco Elver)
>
> This is problematic because smp_call_function() uses non-threaded
> interrupt and kfree() may acquire a local_lock which is a sleepable lock
> on RT.
>
> The general rule is: do not alloc or free memory in non-threaded
> interrupt conntexts.
>
> A quick fix is to use workqueue to defer the kfree(). However, this is
> OK only because scftorture is test code. In general the users of
> interrupts should avoid giving interrupt handlers the ownership of
> objects, that is, users should handle the lifetime of objects outside
> and interrupt handlers should only hold references to objects.
>
> Reported-by: "Paul E. McKenney" <paulmck@kernel.org>
> Link: https://lore.kernel.org/lkml/41619255-cdc2-4573-a360-7794fc3614f7@paulmck-laptop/
> Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
Thank you!
I was worried that putting each kfree() into a separate workqueue handler
would result in freeing not keeping up with allocation for asynchronous
testing (for example, scftorture.weight_single=1), but it seems to be
doing fine in early testing.
So I have queued this in my -rcu tree for review and further testing.
Thanx, Paul
> ---
> kernel/scftorture.c | 14 +++++++++++++-
> 1 file changed, 13 insertions(+), 1 deletion(-)
>
> diff --git a/kernel/scftorture.c b/kernel/scftorture.c
> index 44e83a646264..ab6dcc7c0116 100644
> --- a/kernel/scftorture.c
> +++ b/kernel/scftorture.c
> @@ -127,6 +127,7 @@ static unsigned long scf_sel_totweight;
>
> // Communicate between caller and handler.
> struct scf_check {
> + struct work_struct work;
> bool scfc_in;
> bool scfc_out;
> int scfc_cpu; // -1 for not _single().
> @@ -252,6 +253,13 @@ static struct scf_selector *scf_sel_rand(struct torture_random_state *trsp)
> return &scf_sel_array[0];
> }
>
> +static void kfree_scf_check_work(struct work_struct *w)
> +{
> + struct scf_check *scfcp = container_of(w, struct scf_check, work);
> +
> + kfree(scfcp);
> +}
> +
> // Update statistics and occasionally burn up mass quantities of CPU time,
> // if told to do so via scftorture.longwait. Otherwise, occasionally burn
> // a little bit.
> @@ -296,7 +304,10 @@ static void scf_handler(void *scfc_in)
> if (scfcp->scfc_rpc)
> complete(&scfcp->scfc_completion);
> } else {
> - kfree(scfcp);
> + // Cannot call kfree() directly, pass it to workqueue. It's OK
> + // only because this is test code, avoid this in real world
> + // usage.
> + queue_work(system_wq, &scfcp->work);
> }
> }
>
> @@ -335,6 +346,7 @@ static void scftorture_invoke_one(struct scf_statistics *scfp, struct torture_ra
> scfcp->scfc_wait = scfsp->scfs_wait;
> scfcp->scfc_out = false;
> scfcp->scfc_rpc = false;
> + INIT_WORK(&scfcp->work, kfree_scf_check_work);
> }
> }
> switch (scfsp->scfs_prim) {
> --
> 2.45.2
>
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [BUG] -next lockdep invalid wait context
2024-10-31 7:21 ` Sebastian Andrzej Siewior
2024-10-31 7:35 ` Vlastimil Babka
@ 2024-11-02 0:12 ` Hillf Danton
2024-11-02 0:45 ` Boqun Feng
1 sibling, 1 reply; 38+ messages in thread
From: Hillf Danton @ 2024-11-02 0:12 UTC (permalink / raw)
To: Sebastian Andrzej Siewior, Paul E. McKenney
Cc: Marco Elver, linux-kernel, boqun.feng
On Thu, 31 Oct 2024 08:21:36 +0100 Sebastian Andrzej Siewior <bigeasy@linutronix.de>
> On 2024-10-30 16:10:58 [-0700], Paul E. McKenney wrote:
> >
> > So I need to avoid calling kfree() within an smp_call_function() handler?
>
> Yes. No kmalloc()/ kfree() in IRQ context.
>
Another splat [1] with no link to kmalloc()/ kfree().
[ BUG: Invalid wait context ]
6.12.0-rc5-next-20241031-syzkaller #0 Not tainted
-----------------------------
strace-static-x/5846 is trying to lock:
ffffffff8eac8698 (kernfs_rename_lock){....}-{3:3}, at: kernfs_path_from_node+0x92/0xb00 fs/kernfs/dir.c:229
other info that might help us debug this:
context-{5:5}
3 locks held by strace-static-x/5846:
#0: ffff8880b873e598 (&rq->__lock){-.-.}-{2:2}, at: raw_spin_rq_lock_nested+0x2a/0x140 kernel/sched/core.c:598
#1: ffffffff8e939f20 (rcu_read_lock){....}-{1:3}, at: rcu_lock_acquire include/linux/rcupdate.h:337 [inline]
#1: ffffffff8e939f20 (rcu_read_lock){....}-{1:3}, at: rcu_read_lock include/linux/rcupdate.h:849 [inline]
#1: ffffffff8e939f20 (rcu_read_lock){....}-{1:3}, at: __bpf_trace_run kernel/trace/bpf_trace.c:2339 [inline]
#1: ffffffff8e939f20 (rcu_read_lock){....}-{1:3}, at: bpf_trace_run2+0x1fc/0x540 kernel/trace/bpf_trace.c:2381
#2: ffff88802f7129e0 (&mm->mmap_lock){++++}-{4:4}, at: mmap_read_trylock include/linux/mmap_lock.h:208 [inline]
#2: ffff88802f7129e0 (&mm->mmap_lock){++++}-{4:4}, at: stack_map_get_build_id_offset+0x431/0x870 kernel/bpf/stackmap.c:157
[1] Subject: [syzbot] [kernfs?] WARNING: locking bug in kernfs_path_from_node
https://lore.kernel.org/lkml/67251dc6.050a0220.529b6.015e.GAE@google.com/
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [BUG] -next lockdep invalid wait context
2024-11-02 0:12 ` [BUG] -next lockdep invalid wait context Hillf Danton
@ 2024-11-02 0:45 ` Boqun Feng
2024-11-04 18:08 ` Tejun Heo
0 siblings, 1 reply; 38+ messages in thread
From: Boqun Feng @ 2024-11-02 0:45 UTC (permalink / raw)
To: Hillf Danton
Cc: Sebastian Andrzej Siewior, Paul E. McKenney, Marco Elver,
linux-kernel, tj, gregkh
On Sat, Nov 02, 2024 at 08:12:24AM +0800, Hillf Danton wrote:
> On Thu, 31 Oct 2024 08:21:36 +0100 Sebastian Andrzej Siewior <bigeasy@linutronix.de>
> > On 2024-10-30 16:10:58 [-0700], Paul E. McKenney wrote:
> > >
> > > So I need to avoid calling kfree() within an smp_call_function() handler?
> >
> > Yes. No kmalloc()/ kfree() in IRQ context.
> >
> Another splat [1] with no link to kmalloc()/ kfree().
>
Because it's a different bug.
schedule():
__schedule():
rq_lock(); // <- rq lock is a raw spin lock, so no sleep inside
switch_mm_irqs_off():
trace_tlb_flush():
bpf_trace_run2():
stack_map_get_build_id_offset():
mmap_read_trylock(): // this is actually ok, because
// trylock() on rwsem won't
// sleep.
__mmap_lock_trace_acquire_returned():
get_mm_memcg_path():
cgroup_path():
kernfs_path_from_node():
read_lock_irqsave(); // on RT, read_lock()
// is a rwsem IIUC,
// so might sleep.
// hence the issue.
Maybe kernfs can use RCU instead of read_lock here? I don't know...
Regards,
Boqun
> [ BUG: Invalid wait context ]
> 6.12.0-rc5-next-20241031-syzkaller #0 Not tainted
> -----------------------------
> strace-static-x/5846 is trying to lock:
> ffffffff8eac8698 (kernfs_rename_lock){....}-{3:3}, at: kernfs_path_from_node+0x92/0xb00 fs/kernfs/dir.c:229
> other info that might help us debug this:
> context-{5:5}
> 3 locks held by strace-static-x/5846:
> #0: ffff8880b873e598 (&rq->__lock){-.-.}-{2:2}, at: raw_spin_rq_lock_nested+0x2a/0x140 kernel/sched/core.c:598
> #1: ffffffff8e939f20 (rcu_read_lock){....}-{1:3}, at: rcu_lock_acquire include/linux/rcupdate.h:337 [inline]
> #1: ffffffff8e939f20 (rcu_read_lock){....}-{1:3}, at: rcu_read_lock include/linux/rcupdate.h:849 [inline]
> #1: ffffffff8e939f20 (rcu_read_lock){....}-{1:3}, at: __bpf_trace_run kernel/trace/bpf_trace.c:2339 [inline]
> #1: ffffffff8e939f20 (rcu_read_lock){....}-{1:3}, at: bpf_trace_run2+0x1fc/0x540 kernel/trace/bpf_trace.c:2381
> #2: ffff88802f7129e0 (&mm->mmap_lock){++++}-{4:4}, at: mmap_read_trylock include/linux/mmap_lock.h:208 [inline]
> #2: ffff88802f7129e0 (&mm->mmap_lock){++++}-{4:4}, at: stack_map_get_build_id_offset+0x431/0x870 kernel/bpf/stackmap.c:157
>
> [1] Subject: [syzbot] [kernfs?] WARNING: locking bug in kernfs_path_from_node
> https://lore.kernel.org/lkml/67251dc6.050a0220.529b6.015e.GAE@google.com/
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH] scftorture: Use workqueue to free scf_check
2024-11-01 23:35 ` Paul E. McKenney
@ 2024-11-03 3:35 ` Boqun Feng
2024-11-03 15:03 ` Paul E. McKenney
0 siblings, 1 reply; 38+ messages in thread
From: Boqun Feng @ 2024-11-03 3:35 UTC (permalink / raw)
To: Paul E. McKenney
Cc: Sebastian Andrzej Siewior, Vlastimil Babka, Marco Elver,
linux-next, linux-kernel, kasan-dev, linux-mm, sfr, longman, cl,
penberg, rientjes, iamjoonsoo.kim, akpm, Thomas Gleixner,
Peter Zijlstra
On Fri, Nov 01, 2024 at 04:35:28PM -0700, Paul E. McKenney wrote:
> On Fri, Nov 01, 2024 at 12:54:38PM -0700, Boqun Feng wrote:
> > Paul reported an invalid wait context issue in scftorture catched by
> > lockdep, and the cause of the issue is because scf_handler() may call
> > kfree() to free the struct scf_check:
> >
> > static void scf_handler(void *scfc_in)
> > {
> > [...]
> > } else {
> > kfree(scfcp);
> > }
> > }
> >
> > (call chain anlysis from Marco Elver)
> >
> > This is problematic because smp_call_function() uses non-threaded
> > interrupt and kfree() may acquire a local_lock which is a sleepable lock
> > on RT.
> >
> > The general rule is: do not alloc or free memory in non-threaded
> > interrupt conntexts.
> >
> > A quick fix is to use workqueue to defer the kfree(). However, this is
> > OK only because scftorture is test code. In general the users of
> > interrupts should avoid giving interrupt handlers the ownership of
> > objects, that is, users should handle the lifetime of objects outside
> > and interrupt handlers should only hold references to objects.
> >
> > Reported-by: "Paul E. McKenney" <paulmck@kernel.org>
> > Link: https://lore.kernel.org/lkml/41619255-cdc2-4573-a360-7794fc3614f7@paulmck-laptop/
> > Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
>
> Thank you!
>
> I was worried that putting each kfree() into a separate workqueue handler
> would result in freeing not keeping up with allocation for asynchronous
> testing (for example, scftorture.weight_single=1), but it seems to be
> doing fine in early testing.
>
I shared the same worry, so it's why I added the comments before
queue_work() saying it's only OK because it's test code, it's certainly
not something recommended for general use.
But glad it turns out OK so far for scftorture ;-)
Regards,
Boqun
> So I have queued this in my -rcu tree for review and further testing.
>
> Thanx, Paul
>
> > ---
> > kernel/scftorture.c | 14 +++++++++++++-
> > 1 file changed, 13 insertions(+), 1 deletion(-)
> >
> > diff --git a/kernel/scftorture.c b/kernel/scftorture.c
> > index 44e83a646264..ab6dcc7c0116 100644
> > --- a/kernel/scftorture.c
> > +++ b/kernel/scftorture.c
> > @@ -127,6 +127,7 @@ static unsigned long scf_sel_totweight;
> >
> > // Communicate between caller and handler.
> > struct scf_check {
> > + struct work_struct work;
> > bool scfc_in;
> > bool scfc_out;
> > int scfc_cpu; // -1 for not _single().
> > @@ -252,6 +253,13 @@ static struct scf_selector *scf_sel_rand(struct torture_random_state *trsp)
> > return &scf_sel_array[0];
> > }
> >
> > +static void kfree_scf_check_work(struct work_struct *w)
> > +{
> > + struct scf_check *scfcp = container_of(w, struct scf_check, work);
> > +
> > + kfree(scfcp);
> > +}
> > +
> > // Update statistics and occasionally burn up mass quantities of CPU time,
> > // if told to do so via scftorture.longwait. Otherwise, occasionally burn
> > // a little bit.
> > @@ -296,7 +304,10 @@ static void scf_handler(void *scfc_in)
> > if (scfcp->scfc_rpc)
> > complete(&scfcp->scfc_completion);
> > } else {
> > - kfree(scfcp);
> > + // Cannot call kfree() directly, pass it to workqueue. It's OK
> > + // only because this is test code, avoid this in real world
> > + // usage.
> > + queue_work(system_wq, &scfcp->work);
> > }
> > }
> >
> > @@ -335,6 +346,7 @@ static void scftorture_invoke_one(struct scf_statistics *scfp, struct torture_ra
> > scfcp->scfc_wait = scfsp->scfs_wait;
> > scfcp->scfc_out = false;
> > scfcp->scfc_rpc = false;
> > + INIT_WORK(&scfcp->work, kfree_scf_check_work);
> > }
> > }
> > switch (scfsp->scfs_prim) {
> > --
> > 2.45.2
> >
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH] scftorture: Use workqueue to free scf_check
2024-11-03 3:35 ` Boqun Feng
@ 2024-11-03 15:03 ` Paul E. McKenney
2024-11-04 10:50 ` [PATCH 1/2] scftorture: Move memory allocation outside of preempt_disable region Sebastian Andrzej Siewior
0 siblings, 1 reply; 38+ messages in thread
From: Paul E. McKenney @ 2024-11-03 15:03 UTC (permalink / raw)
To: Boqun Feng
Cc: Sebastian Andrzej Siewior, Vlastimil Babka, Marco Elver,
linux-next, linux-kernel, kasan-dev, linux-mm, sfr, longman, cl,
penberg, rientjes, iamjoonsoo.kim, akpm, Thomas Gleixner,
Peter Zijlstra
On Sat, Nov 02, 2024 at 08:35:36PM -0700, Boqun Feng wrote:
> On Fri, Nov 01, 2024 at 04:35:28PM -0700, Paul E. McKenney wrote:
> > On Fri, Nov 01, 2024 at 12:54:38PM -0700, Boqun Feng wrote:
> > > Paul reported an invalid wait context issue in scftorture catched by
> > > lockdep, and the cause of the issue is because scf_handler() may call
> > > kfree() to free the struct scf_check:
> > >
> > > static void scf_handler(void *scfc_in)
> > > {
> > > [...]
> > > } else {
> > > kfree(scfcp);
> > > }
> > > }
> > >
> > > (call chain anlysis from Marco Elver)
> > >
> > > This is problematic because smp_call_function() uses non-threaded
> > > interrupt and kfree() may acquire a local_lock which is a sleepable lock
> > > on RT.
> > >
> > > The general rule is: do not alloc or free memory in non-threaded
> > > interrupt conntexts.
> > >
> > > A quick fix is to use workqueue to defer the kfree(). However, this is
> > > OK only because scftorture is test code. In general the users of
> > > interrupts should avoid giving interrupt handlers the ownership of
> > > objects, that is, users should handle the lifetime of objects outside
> > > and interrupt handlers should only hold references to objects.
> > >
> > > Reported-by: "Paul E. McKenney" <paulmck@kernel.org>
> > > Link: https://lore.kernel.org/lkml/41619255-cdc2-4573-a360-7794fc3614f7@paulmck-laptop/
> > > Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
> >
> > Thank you!
> >
> > I was worried that putting each kfree() into a separate workqueue handler
> > would result in freeing not keeping up with allocation for asynchronous
> > testing (for example, scftorture.weight_single=1), but it seems to be
> > doing fine in early testing.
>
> I shared the same worry, so it's why I added the comments before
> queue_work() saying it's only OK because it's test code, it's certainly
> not something recommended for general use.
>
> But glad it turns out OK so far for scftorture ;-)
That said, I have only tried a couple of memory sizes at 64 CPUs, the
default (512M), which OOMs both with and without this fix and 7G, which
is selected by torture.sh, which avoids OOMing either way. It would be
interesting to vary the memory provided between those limits and see if
there is any difference in behavior.
It avoids OOM at the default 512M at 16 CPUs.
Ah, and I did not check throughput, which might have changed. A quick
test on my laptop says that it dropped by almost a factor of two,
from not quite 1M invocations/s to a bit more than 500K invocations/s.
So something more efficient does seem in order. ;-)
tools/testing/selftests/rcutorture/bin/kvm.sh --torture scf --allcpus --configs PREEMPT --duration 30 --bootargs "scftorture.weight_single=1" --trust-make
Thanx, Paul
> Regards,
> Boqun
>
> > So I have queued this in my -rcu tree for review and further testing.
> >
> > Thanx, Paul
> >
> > > ---
> > > kernel/scftorture.c | 14 +++++++++++++-
> > > 1 file changed, 13 insertions(+), 1 deletion(-)
> > >
> > > diff --git a/kernel/scftorture.c b/kernel/scftorture.c
> > > index 44e83a646264..ab6dcc7c0116 100644
> > > --- a/kernel/scftorture.c
> > > +++ b/kernel/scftorture.c
> > > @@ -127,6 +127,7 @@ static unsigned long scf_sel_totweight;
> > >
> > > // Communicate between caller and handler.
> > > struct scf_check {
> > > + struct work_struct work;
> > > bool scfc_in;
> > > bool scfc_out;
> > > int scfc_cpu; // -1 for not _single().
> > > @@ -252,6 +253,13 @@ static struct scf_selector *scf_sel_rand(struct torture_random_state *trsp)
> > > return &scf_sel_array[0];
> > > }
> > >
> > > +static void kfree_scf_check_work(struct work_struct *w)
> > > +{
> > > + struct scf_check *scfcp = container_of(w, struct scf_check, work);
> > > +
> > > + kfree(scfcp);
> > > +}
> > > +
> > > // Update statistics and occasionally burn up mass quantities of CPU time,
> > > // if told to do so via scftorture.longwait. Otherwise, occasionally burn
> > > // a little bit.
> > > @@ -296,7 +304,10 @@ static void scf_handler(void *scfc_in)
> > > if (scfcp->scfc_rpc)
> > > complete(&scfcp->scfc_completion);
> > > } else {
> > > - kfree(scfcp);
> > > + // Cannot call kfree() directly, pass it to workqueue. It's OK
> > > + // only because this is test code, avoid this in real world
> > > + // usage.
> > > + queue_work(system_wq, &scfcp->work);
> > > }
> > > }
> > >
> > > @@ -335,6 +346,7 @@ static void scftorture_invoke_one(struct scf_statistics *scfp, struct torture_ra
> > > scfcp->scfc_wait = scfsp->scfs_wait;
> > > scfcp->scfc_out = false;
> > > scfcp->scfc_rpc = false;
> > > + INIT_WORK(&scfcp->work, kfree_scf_check_work);
> > > }
> > > }
> > > switch (scfsp->scfs_prim) {
> > > --
> > > 2.45.2
> > >
^ permalink raw reply [flat|nested] 38+ messages in thread
* [PATCH 1/2] scftorture: Move memory allocation outside of preempt_disable region.
2024-11-03 15:03 ` Paul E. McKenney
@ 2024-11-04 10:50 ` Sebastian Andrzej Siewior
2024-11-04 10:50 ` [PATCH 2/2] scftorture: Use a lock-less list to free memory Sebastian Andrzej Siewior
0 siblings, 1 reply; 38+ messages in thread
From: Sebastian Andrzej Siewior @ 2024-11-04 10:50 UTC (permalink / raw)
To: Paul E. McKenney
Cc: Boqun Feng, Vlastimil Babka, Marco Elver, linux-kernel, kasan-dev,
linux-mm, sfr, longman, cl, penberg, rientjes, iamjoonsoo.kim,
akpm, Tomas Gleixner, Peter Zijlstra, Sebastian Andrzej Siewior
Memory allocations can not happen within regions with explicit disabled
preemption PREEMPT_RT. The problem is that the locking structures
underneath are sleeping locks.
Move the memory allocation outside of the preempt-disabled section. Keep
the GFP_ATOMIC for the allocation to behave like a "ememergncy
allocation".
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
kernel/scftorture.c | 8 ++++----
1 file changed, 4 insertions(+), 4 deletions(-)
diff --git a/kernel/scftorture.c b/kernel/scftorture.c
index 44e83a6462647..e5546fe256329 100644
--- a/kernel/scftorture.c
+++ b/kernel/scftorture.c
@@ -320,10 +320,6 @@ static void scftorture_invoke_one(struct scf_statistics *scfp, struct torture_ra
struct scf_check *scfcp = NULL;
struct scf_selector *scfsp = scf_sel_rand(trsp);
- if (use_cpus_read_lock)
- cpus_read_lock();
- else
- preempt_disable();
if (scfsp->scfs_prim == SCF_PRIM_SINGLE || scfsp->scfs_wait) {
scfcp = kmalloc(sizeof(*scfcp), GFP_ATOMIC);
if (!scfcp) {
@@ -337,6 +333,10 @@ static void scftorture_invoke_one(struct scf_statistics *scfp, struct torture_ra
scfcp->scfc_rpc = false;
}
}
+ if (use_cpus_read_lock)
+ cpus_read_lock();
+ else
+ preempt_disable();
switch (scfsp->scfs_prim) {
case SCF_PRIM_RESCHED:
if (IS_BUILTIN(CONFIG_SCF_TORTURE_TEST)) {
--
2.45.2
^ permalink raw reply related [flat|nested] 38+ messages in thread
* [PATCH 2/2] scftorture: Use a lock-less list to free memory.
2024-11-04 10:50 ` [PATCH 1/2] scftorture: Move memory allocation outside of preempt_disable region Sebastian Andrzej Siewior
@ 2024-11-04 10:50 ` Sebastian Andrzej Siewior
2024-11-05 1:00 ` Boqun Feng
0 siblings, 1 reply; 38+ messages in thread
From: Sebastian Andrzej Siewior @ 2024-11-04 10:50 UTC (permalink / raw)
To: Paul E. McKenney
Cc: Boqun Feng, Vlastimil Babka, Marco Elver, linux-kernel, kasan-dev,
linux-mm, sfr, longman, cl, penberg, rientjes, iamjoonsoo.kim,
akpm, Tomas Gleixner, Peter Zijlstra, Sebastian Andrzej Siewior
scf_handler() is used as a SMP function call. This function is always
invoked in IRQ-context even with forced-threading enabled. This function
frees memory which not allowed on PREEMPT_RT because the locking
underneath is using sleeping locks.
Add a per-CPU scf_free_pool where each SMP functions adds its memory to
be freed. This memory is then freed by scftorture_invoker() on each
iteration. On the majority of invocations the number of items is less
than five. If the thread sleeps/ gets delayed the number exceed 350 but
did not reach 400 in testing. These were the spikes during testing.
The bulk free of 64 pointers at once should improve the give-back if the
list grows. The list size is ~1.3 items per invocations.
Having one global scf_free_pool with one cleaning thread let the list
grow to over 10.000 items with 32 CPUs (again, spikes not the average)
especially if the CPU went to sleep. The per-CPU part looks like a good
compromise.
Reported-by: "Paul E. McKenney" <paulmck@kernel.org>
Closes: https://lore.kernel.org/lkml/41619255-cdc2-4573-a360-7794fc3614f7@paulmck-laptop/
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
kernel/scftorture.c | 47 +++++++++++++++++++++++++++++++++++++++++----
1 file changed, 43 insertions(+), 4 deletions(-)
diff --git a/kernel/scftorture.c b/kernel/scftorture.c
index e5546fe256329..ba9f1125821b8 100644
--- a/kernel/scftorture.c
+++ b/kernel/scftorture.c
@@ -97,6 +97,7 @@ struct scf_statistics {
static struct scf_statistics *scf_stats_p;
static struct task_struct *scf_torture_stats_task;
static DEFINE_PER_CPU(long long, scf_invoked_count);
+static DEFINE_PER_CPU(struct llist_head, scf_free_pool);
// Data for random primitive selection
#define SCF_PRIM_RESCHED 0
@@ -133,6 +134,7 @@ struct scf_check {
bool scfc_wait;
bool scfc_rpc;
struct completion scfc_completion;
+ struct llist_node scf_node;
};
// Use to wait for all threads to start.
@@ -148,6 +150,40 @@ static DEFINE_TORTURE_RANDOM_PERCPU(scf_torture_rand);
extern void resched_cpu(int cpu); // An alternative IPI vector.
+static void scf_add_to_free_list(struct scf_check *scfcp)
+{
+ struct llist_head *pool;
+ unsigned int cpu;
+
+ cpu = raw_smp_processor_id() % nthreads;
+ pool = &per_cpu(scf_free_pool, cpu);
+ llist_add(&scfcp->scf_node, pool);
+}
+
+static void scf_cleanup_free_list(unsigned int cpu)
+{
+ struct llist_head *pool;
+ struct llist_node *node;
+ struct scf_check *scfcp;
+ unsigned int slot = 0;
+ void *free_pool[64];
+
+ pool = &per_cpu(scf_free_pool, cpu);
+ node = llist_del_all(pool);
+ while (node) {
+ scfcp = llist_entry(node, struct scf_check, scf_node);
+ node = node->next;
+ free_pool[slot] = scfcp;
+ slot++;
+ if (slot == ARRAY_SIZE(free_pool)) {
+ kfree_bulk(slot, free_pool);
+ slot = 0;
+ }
+ }
+ if (slot)
+ kfree_bulk(slot, free_pool);
+}
+
// Print torture statistics. Caller must ensure serialization.
static void scf_torture_stats_print(void)
{
@@ -296,7 +332,7 @@ static void scf_handler(void *scfc_in)
if (scfcp->scfc_rpc)
complete(&scfcp->scfc_completion);
} else {
- kfree(scfcp);
+ scf_add_to_free_list(scfcp);
}
}
@@ -363,7 +399,7 @@ static void scftorture_invoke_one(struct scf_statistics *scfp, struct torture_ra
scfp->n_single_wait_ofl++;
else
scfp->n_single_ofl++;
- kfree(scfcp);
+ scf_add_to_free_list(scfcp);
scfcp = NULL;
}
break;
@@ -391,7 +427,7 @@ static void scftorture_invoke_one(struct scf_statistics *scfp, struct torture_ra
preempt_disable();
} else {
scfp->n_single_rpc_ofl++;
- kfree(scfcp);
+ scf_add_to_free_list(scfcp);
scfcp = NULL;
}
break;
@@ -428,7 +464,7 @@ static void scftorture_invoke_one(struct scf_statistics *scfp, struct torture_ra
pr_warn("%s: Memory-ordering failure, scfs_prim: %d.\n", __func__, scfsp->scfs_prim);
atomic_inc(&n_mb_out_errs); // Leak rather than trash!
} else {
- kfree(scfcp);
+ scf_add_to_free_list(scfcp);
}
barrier(); // Prevent race-reduction compiler optimizations.
}
@@ -442,6 +478,7 @@ static void scftorture_invoke_one(struct scf_statistics *scfp, struct torture_ra
schedule_timeout_uninterruptible(1);
}
+
// SCF test kthread. Repeatedly does calls to members of the
// smp_call_function() family of functions.
static int scftorture_invoker(void *arg)
@@ -479,6 +516,8 @@ static int scftorture_invoker(void *arg)
VERBOSE_SCFTORTOUT("scftorture_invoker %d started", scfp->cpu);
do {
+ scf_cleanup_free_list(scfp->cpu);
+
scftorture_invoke_one(scfp, &rand);
while (cpu_is_offline(cpu) && !torture_must_stop()) {
schedule_timeout_interruptible(HZ / 5);
--
2.45.2
^ permalink raw reply related [flat|nested] 38+ messages in thread
* Re: [BUG] -next lockdep invalid wait context
2024-11-02 0:45 ` Boqun Feng
@ 2024-11-04 18:08 ` Tejun Heo
2024-11-05 9:37 ` Vlastimil Babka
2024-11-08 10:05 ` Sebastian Andrzej Siewior
0 siblings, 2 replies; 38+ messages in thread
From: Tejun Heo @ 2024-11-04 18:08 UTC (permalink / raw)
To: Boqun Feng
Cc: Hillf Danton, Sebastian Andrzej Siewior, Paul E. McKenney,
Marco Elver, linux-kernel, gregkh
Hello,
On Fri, Nov 01, 2024 at 05:45:01PM -0700, Boqun Feng wrote:
...
> Because it's a different bug.
>
> schedule():
> __schedule():
> rq_lock(); // <- rq lock is a raw spin lock, so no sleep inside
> switch_mm_irqs_off():
> trace_tlb_flush():
> bpf_trace_run2():
> stack_map_get_build_id_offset():
> mmap_read_trylock(): // this is actually ok, because
> // trylock() on rwsem won't
> // sleep.
> __mmap_lock_trace_acquire_returned():
> get_mm_memcg_path():
> cgroup_path():
> kernfs_path_from_node():
> read_lock_irqsave(); // on RT, read_lock()
> // is a rwsem IIUC,
> // so might sleep.
> // hence the issue.
>
> Maybe kernfs can use RCU instead of read_lock here? I don't know...
Yeah, we should be able to make kn->name RCU protected and drop the usage of
the rename lock in read paths.
Thanks.
--
tejun
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH 2/2] scftorture: Use a lock-less list to free memory.
2024-11-04 10:50 ` [PATCH 2/2] scftorture: Use a lock-less list to free memory Sebastian Andrzej Siewior
@ 2024-11-05 1:00 ` Boqun Feng
2024-11-07 11:21 ` Sebastian Andrzej Siewior
0 siblings, 1 reply; 38+ messages in thread
From: Boqun Feng @ 2024-11-05 1:00 UTC (permalink / raw)
To: Sebastian Andrzej Siewior
Cc: Paul E. McKenney, Vlastimil Babka, Marco Elver, linux-kernel,
kasan-dev, linux-mm, sfr, longman, cl, penberg, rientjes,
iamjoonsoo.kim, akpm, Tomas Gleixner, Peter Zijlstra
Hi Sebastian,
I love this approach, I think it's better than my workqueue work around,
a few comments below:
On Mon, Nov 04, 2024 at 11:50:53AM +0100, Sebastian Andrzej Siewior wrote:
> scf_handler() is used as a SMP function call. This function is always
> invoked in IRQ-context even with forced-threading enabled. This function
> frees memory which not allowed on PREEMPT_RT because the locking
> underneath is using sleeping locks.
>
> Add a per-CPU scf_free_pool where each SMP functions adds its memory to
> be freed. This memory is then freed by scftorture_invoker() on each
> iteration. On the majority of invocations the number of items is less
> than five. If the thread sleeps/ gets delayed the number exceed 350 but
> did not reach 400 in testing. These were the spikes during testing.
> The bulk free of 64 pointers at once should improve the give-back if the
> list grows. The list size is ~1.3 items per invocations.
>
> Having one global scf_free_pool with one cleaning thread let the list
> grow to over 10.000 items with 32 CPUs (again, spikes not the average)
> especially if the CPU went to sleep. The per-CPU part looks like a good
> compromise.
>
> Reported-by: "Paul E. McKenney" <paulmck@kernel.org>
> Closes: https://lore.kernel.org/lkml/41619255-cdc2-4573-a360-7794fc3614f7@paulmck-laptop/
> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
> ---
> kernel/scftorture.c | 47 +++++++++++++++++++++++++++++++++++++++++----
> 1 file changed, 43 insertions(+), 4 deletions(-)
>
> diff --git a/kernel/scftorture.c b/kernel/scftorture.c
> index e5546fe256329..ba9f1125821b8 100644
> --- a/kernel/scftorture.c
> +++ b/kernel/scftorture.c
> @@ -97,6 +97,7 @@ struct scf_statistics {
> static struct scf_statistics *scf_stats_p;
> static struct task_struct *scf_torture_stats_task;
> static DEFINE_PER_CPU(long long, scf_invoked_count);
> +static DEFINE_PER_CPU(struct llist_head, scf_free_pool);
>
> // Data for random primitive selection
> #define SCF_PRIM_RESCHED 0
> @@ -133,6 +134,7 @@ struct scf_check {
> bool scfc_wait;
> bool scfc_rpc;
> struct completion scfc_completion;
> + struct llist_node scf_node;
> };
>
> // Use to wait for all threads to start.
> @@ -148,6 +150,40 @@ static DEFINE_TORTURE_RANDOM_PERCPU(scf_torture_rand);
>
> extern void resched_cpu(int cpu); // An alternative IPI vector.
>
> +static void scf_add_to_free_list(struct scf_check *scfcp)
> +{
> + struct llist_head *pool;
> + unsigned int cpu;
> +
> + cpu = raw_smp_processor_id() % nthreads;
> + pool = &per_cpu(scf_free_pool, cpu);
> + llist_add(&scfcp->scf_node, pool);
> +}
> +
> +static void scf_cleanup_free_list(unsigned int cpu)
> +{
> + struct llist_head *pool;
> + struct llist_node *node;
> + struct scf_check *scfcp;
> + unsigned int slot = 0;
> + void *free_pool[64];
> +
> + pool = &per_cpu(scf_free_pool, cpu);
> + node = llist_del_all(pool);
> + while (node) {
> + scfcp = llist_entry(node, struct scf_check, scf_node);
> + node = node->next;
> + free_pool[slot] = scfcp;
> + slot++;
> + if (slot == ARRAY_SIZE(free_pool)) {
> + kfree_bulk(slot, free_pool);
> + slot = 0;
> + }
> + }
> + if (slot)
> + kfree_bulk(slot, free_pool);
> +}
> +
> // Print torture statistics. Caller must ensure serialization.
> static void scf_torture_stats_print(void)
> {
> @@ -296,7 +332,7 @@ static void scf_handler(void *scfc_in)
> if (scfcp->scfc_rpc)
> complete(&scfcp->scfc_completion);
> } else {
> - kfree(scfcp);
> + scf_add_to_free_list(scfcp);
> }
> }
>
> @@ -363,7 +399,7 @@ static void scftorture_invoke_one(struct scf_statistics *scfp, struct torture_ra
> scfp->n_single_wait_ofl++;
> else
> scfp->n_single_ofl++;
> - kfree(scfcp);
> + scf_add_to_free_list(scfcp);
> scfcp = NULL;
> }
> break;
> @@ -391,7 +427,7 @@ static void scftorture_invoke_one(struct scf_statistics *scfp, struct torture_ra
> preempt_disable();
> } else {
> scfp->n_single_rpc_ofl++;
> - kfree(scfcp);
> + scf_add_to_free_list(scfcp);
> scfcp = NULL;
> }
> break;
> @@ -428,7 +464,7 @@ static void scftorture_invoke_one(struct scf_statistics *scfp, struct torture_ra
> pr_warn("%s: Memory-ordering failure, scfs_prim: %d.\n", __func__, scfsp->scfs_prim);
> atomic_inc(&n_mb_out_errs); // Leak rather than trash!
> } else {
> - kfree(scfcp);
> + scf_add_to_free_list(scfcp);
> }
> barrier(); // Prevent race-reduction compiler optimizations.
> }
> @@ -442,6 +478,7 @@ static void scftorture_invoke_one(struct scf_statistics *scfp, struct torture_ra
> schedule_timeout_uninterruptible(1);
> }
>
> +
> // SCF test kthread. Repeatedly does calls to members of the
> // smp_call_function() family of functions.
> static int scftorture_invoker(void *arg)
> @@ -479,6 +516,8 @@ static int scftorture_invoker(void *arg)
> VERBOSE_SCFTORTOUT("scftorture_invoker %d started", scfp->cpu);
>
> do {
> + scf_cleanup_free_list(scfp->cpu);
> +
I think this needs to be:
scf_cleanup_free_list(cpu);
or
scf_cleanup_free_list(curcpu);
because scfp->cpu is actually the thread number, and I got a NULL
dereference:
[ 14.219225] BUG: unable to handle page fault for address: ffffffffb2ff7210
while running Paul's reproduce command:
tools/testing/selftests/rcutorture/bin/kvm.sh --torture scf --allcpus --duration 2 --configs PREEMPT --kconfig CONFIG_NR_CPUS=64 --memory 7G --trust-make --kasan --bootargs "scftorture.nthreads=64 torture.disable_onoff_at_boot csdlock_debug=1"
on my 48 cores VM (I think 48 core may be key to reproduce the NULL
dereference).
Another thing is, how do we guarantee that we don't exit the loop
eariler (i.e. while there are still callbacks on the list)? After the
following scftorture_invoke_one(), there could an IPI pending somewhere,
and we may exit this loop if torture_must_stop() is true. And that IPI
might add its scf_check to the list but no scf_cleanup_free_list() is
going to handle that, right?
Regards,
Boqun
> scftorture_invoke_one(scfp, &rand);
> while (cpu_is_offline(cpu) && !torture_must_stop()) {
> schedule_timeout_interruptible(HZ / 5);
> --
> 2.45.2
>
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [BUG] -next lockdep invalid wait context
2024-11-04 18:08 ` Tejun Heo
@ 2024-11-05 9:37 ` Vlastimil Babka
2024-11-08 10:05 ` Sebastian Andrzej Siewior
1 sibling, 0 replies; 38+ messages in thread
From: Vlastimil Babka @ 2024-11-05 9:37 UTC (permalink / raw)
To: Tejun Heo, Boqun Feng
Cc: Hillf Danton, Sebastian Andrzej Siewior, Paul E. McKenney,
Marco Elver, linux-kernel, gregkh
On 11/4/24 19:08, Tejun Heo wrote:
> Hello,
>
> On Fri, Nov 01, 2024 at 05:45:01PM -0700, Boqun Feng wrote:
> ...
>> Because it's a different bug.
>>
>> schedule():
>> __schedule():
>> rq_lock(); // <- rq lock is a raw spin lock, so no sleep inside
>> switch_mm_irqs_off():
>> trace_tlb_flush():
>> bpf_trace_run2():
>> stack_map_get_build_id_offset():
>> mmap_read_trylock(): // this is actually ok, because
>> // trylock() on rwsem won't
>> // sleep.
>> __mmap_lock_trace_acquire_returned():
>> get_mm_memcg_path():
>> cgroup_path():
>> kernfs_path_from_node():
>> read_lock_irqsave(); // on RT, read_lock()
>> // is a rwsem IIUC,
>> // so might sleep.
>> // hence the issue.
>>
>> Maybe kernfs can use RCU instead of read_lock here? I don't know...
>
> Yeah, we should be able to make kn->name RCU protected and drop the usage of
> the rename lock in read paths.
For the purposes of the get_mm_memcg_path() usage for tracepoints, it would
be also sufficient to have a trylock variant. If the lock cannot be
acquired, the trace event will omit the memcg detail, which is documented as
a possibility already.
> Thanks.
>
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH 2/2] scftorture: Use a lock-less list to free memory.
2024-11-05 1:00 ` Boqun Feng
@ 2024-11-07 11:21 ` Sebastian Andrzej Siewior
2024-11-07 14:08 ` Paul E. McKenney
0 siblings, 1 reply; 38+ messages in thread
From: Sebastian Andrzej Siewior @ 2024-11-07 11:21 UTC (permalink / raw)
To: Boqun Feng
Cc: Paul E. McKenney, Vlastimil Babka, Marco Elver, linux-kernel,
kasan-dev, linux-mm, sfr, longman, cl, penberg, rientjes,
iamjoonsoo.kim, akpm, Tomas Gleixner, Peter Zijlstra
On 2024-11-04 17:00:19 [-0800], Boqun Feng wrote:
> Hi Sebastian,
Hi Boqun,
…
> I think this needs to be:
>
> scf_cleanup_free_list(cpu);
>
> or
>
> scf_cleanup_free_list(curcpu);
>
> because scfp->cpu is actually the thread number, and I got a NULL
> dereference:
>
> [ 14.219225] BUG: unable to handle page fault for address: ffffffffb2ff7210
Right. Replaced with cpu.
…
>
> Another thing is, how do we guarantee that we don't exit the loop
> eariler (i.e. while there are still callbacks on the list)? After the
> following scftorture_invoke_one(), there could an IPI pending somewhere,
> and we may exit this loop if torture_must_stop() is true. And that IPI
> might add its scf_check to the list but no scf_cleanup_free_list() is
> going to handle that, right?
Okay. Assuming that IPIs are done by the time scf_torture_cleanup is
invoked, I added scf_cleanup_free_list() for all CPUs there.
Reposted at
https://lore.kernel.org/20241107111821.3417762-1-bigeasy@linutronix.de
> Regards,
> Boqun
Sebastian
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH 2/2] scftorture: Use a lock-less list to free memory.
2024-11-07 11:21 ` Sebastian Andrzej Siewior
@ 2024-11-07 14:08 ` Paul E. McKenney
2024-11-07 14:43 ` Sebastian Andrzej Siewior
0 siblings, 1 reply; 38+ messages in thread
From: Paul E. McKenney @ 2024-11-07 14:08 UTC (permalink / raw)
To: Sebastian Andrzej Siewior
Cc: Boqun Feng, Vlastimil Babka, Marco Elver, linux-kernel, kasan-dev,
linux-mm, sfr, longman, cl, penberg, rientjes, iamjoonsoo.kim,
akpm, Tomas Gleixner, Peter Zijlstra
On Thu, Nov 07, 2024 at 12:21:07PM +0100, Sebastian Andrzej Siewior wrote:
> On 2024-11-04 17:00:19 [-0800], Boqun Feng wrote:
> > Hi Sebastian,
> Hi Boqun,
>
> …
> > I think this needs to be:
> >
> > scf_cleanup_free_list(cpu);
> >
> > or
> >
> > scf_cleanup_free_list(curcpu);
> >
> > because scfp->cpu is actually the thread number, and I got a NULL
> > dereference:
> >
> > [ 14.219225] BUG: unable to handle page fault for address: ffffffffb2ff7210
>
> Right. Replaced with cpu.
> …
> >
> > Another thing is, how do we guarantee that we don't exit the loop
> > eariler (i.e. while there are still callbacks on the list)? After the
> > following scftorture_invoke_one(), there could an IPI pending somewhere,
> > and we may exit this loop if torture_must_stop() is true. And that IPI
> > might add its scf_check to the list but no scf_cleanup_free_list() is
> > going to handle that, right?
>
> Okay. Assuming that IPIs are done by the time scf_torture_cleanup is
> invoked, I added scf_cleanup_free_list() for all CPUs there.
This statement in scf_torture_cleanup() is supposed to wait for all
outstanding IPIs:
smp_call_function(scf_cleanup_handler, NULL, 0);
And the scf_cleanup_handler() function is as follows:
static void scf_cleanup_handler(void *unused)
{
}
Does that work, or am I yet again being overly naive?
> Reposted at
> https://lore.kernel.org/20241107111821.3417762-1-bigeasy@linutronix.de
Thank you!
I will do some testing on this later today.
Thanx, Paul
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH 2/2] scftorture: Use a lock-less list to free memory.
2024-11-07 14:08 ` Paul E. McKenney
@ 2024-11-07 14:43 ` Sebastian Andrzej Siewior
2024-11-07 14:59 ` Paul E. McKenney
0 siblings, 1 reply; 38+ messages in thread
From: Sebastian Andrzej Siewior @ 2024-11-07 14:43 UTC (permalink / raw)
To: Paul E. McKenney
Cc: Boqun Feng, Vlastimil Babka, Marco Elver, linux-kernel, kasan-dev,
linux-mm, sfr, longman, cl, penberg, rientjes, iamjoonsoo.kim,
akpm, Tomas Gleixner, Peter Zijlstra
On 2024-11-07 06:08:35 [-0800], Paul E. McKenney wrote:
…
> This statement in scf_torture_cleanup() is supposed to wait for all
> outstanding IPIs:
>
> smp_call_function(scf_cleanup_handler, NULL, 0);
This should be
smp_call_function(scf_cleanup_handler, NULL, 1);
so it queues the function call and waits for its completion. Otherwise
it is queued and might be invoked _later_.
> And the scf_cleanup_handler() function is as follows:
>
> static void scf_cleanup_handler(void *unused)
> {
> }
>
> Does that work, or am I yet again being overly naive?
See above. I can send a patch later on if you have no other complains ;)
> Thanx, Paul
Sebastian
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH 2/2] scftorture: Use a lock-less list to free memory.
2024-11-07 14:43 ` Sebastian Andrzej Siewior
@ 2024-11-07 14:59 ` Paul E. McKenney
0 siblings, 0 replies; 38+ messages in thread
From: Paul E. McKenney @ 2024-11-07 14:59 UTC (permalink / raw)
To: Sebastian Andrzej Siewior
Cc: Boqun Feng, Vlastimil Babka, Marco Elver, linux-kernel, kasan-dev,
linux-mm, sfr, longman, cl, penberg, rientjes, iamjoonsoo.kim,
akpm, Tomas Gleixner, Peter Zijlstra
On Thu, Nov 07, 2024 at 03:43:00PM +0100, Sebastian Andrzej Siewior wrote:
> On 2024-11-07 06:08:35 [-0800], Paul E. McKenney wrote:
> …
> > This statement in scf_torture_cleanup() is supposed to wait for all
> > outstanding IPIs:
> >
> > smp_call_function(scf_cleanup_handler, NULL, 0);
>
> This should be
> smp_call_function(scf_cleanup_handler, NULL, 1);
>
> so it queues the function call and waits for its completion. Otherwise
> it is queued and might be invoked _later_.
>
> > And the scf_cleanup_handler() function is as follows:
> >
> > static void scf_cleanup_handler(void *unused)
> > {
> > }
> >
> > Does that work, or am I yet again being overly naive?
>
> See above. I can send a patch later on if you have no other complains ;)
You got me on that one! Thank you, and please do feel free to send
a patch.
Interestingly enough, this has never failed. Perhaps because the usual
scftorture run injects enough idle time that all the IPIs have time
to finish. ;-)
Thanx, Paul
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [BUG] -next lockdep invalid wait context
2024-11-04 18:08 ` Tejun Heo
2024-11-05 9:37 ` Vlastimil Babka
@ 2024-11-08 10:05 ` Sebastian Andrzej Siewior
2024-11-08 17:02 ` Tejun Heo
1 sibling, 1 reply; 38+ messages in thread
From: Sebastian Andrzej Siewior @ 2024-11-08 10:05 UTC (permalink / raw)
To: Tejun Heo
Cc: Boqun Feng, Hillf Danton, Paul E. McKenney, Marco Elver,
linux-kernel, gregkh, tglx, syzbot
On 2024-11-04 08:08:05 [-1000], Tejun Heo wrote:
> Yeah, we should be able to make kn->name RCU protected and drop the usage of
> the rename lock in read paths.
Something like this maybe? There are a few rcu_dereference_check(, true)
in there because there seems not be any locking so it might not be
needed but I don't know why.
I added "down_read(&root->kernfs_rwsem)" to kernfs_notify_workfn() for
the name lookup, I don't think kernfs_supers_rwsem implies this.
I dropped up_read(&root->kernfs_rwsem) from kernfs_fop_readdir() during
dir_emit(). I *think* that a rename could happen while the lock is
dropped and I can't have RCU here because of the page fault in
dir_emit().
I didn't see anything complains with this so far. I had just a few
renames during boot for the network interfaces, so it is not excessive
testing.
---
fs/kernfs/dir.c | 141 +++++++++++++++++++-----------------
fs/kernfs/file.c | 6 +-
fs/kernfs/kernfs-internal.h | 23 +++++-
fs/kernfs/mount.c | 17 +++--
fs/kernfs/symlink.c | 27 +++----
include/linux/kernfs.h | 4 +-
6 files changed, 128 insertions(+), 90 deletions(-)
diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c
index 458519e416fe7..a4ea5114cba2c 100644
--- a/fs/kernfs/dir.c
+++ b/fs/kernfs/dir.c
@@ -17,7 +17,6 @@
#include "kernfs-internal.h"
-static DEFINE_RWLOCK(kernfs_rename_lock); /* kn->parent and ->name */
/*
* Don't use rename_lock to piggy back on pr_cont_buf. We don't want to
* call pr_cont() while holding rename_lock. Because sometimes pr_cont()
@@ -53,10 +52,15 @@ static bool kernfs_lockdep(struct kernfs_node *kn)
static int kernfs_name_locked(struct kernfs_node *kn, char *buf, size_t buflen)
{
+ struct kernfs_node *kn_parent;
+ const char *kn_name;
+
if (!kn)
return strscpy(buf, "(null)", buflen);
- return strscpy(buf, kn->parent ? kn->name : "/", buflen);
+ kn_parent = rcu_dereference(kn->parent);
+ kn_name = rcu_dereference(kn->name);
+ return strscpy(buf, kn_parent ? kn_name : "/", buflen);
}
/* kernfs_node_depth - compute depth from @from to @to */
@@ -66,7 +70,7 @@ static size_t kernfs_depth(struct kernfs_node *from, struct kernfs_node *to)
while (to->parent && to != from) {
depth++;
- to = to->parent;
+ to = rcu_dereference(to->parent);
}
return depth;
}
@@ -84,18 +88,18 @@ static struct kernfs_node *kernfs_common_ancestor(struct kernfs_node *a,
db = kernfs_depth(rb->kn, b);
while (da > db) {
- a = a->parent;
+ a = rcu_dereference(a->parent);
da--;
}
while (db > da) {
- b = b->parent;
+ b = rcu_dereference(b->parent);
db--;
}
/* worst case b and a will be the same at root */
while (b != a) {
- b = b->parent;
- a = a->parent;
+ b = rcu_dereference(b->parent);
+ a = rcu_dereference(a->parent);
}
return a;
@@ -169,9 +173,9 @@ static int kernfs_path_from_node_locked(struct kernfs_node *kn_to,
/* Calculate how many bytes we need for the rest */
for (i = depth_to - 1; i >= 0; i--) {
for (kn = kn_to, j = 0; j < i; j++)
- kn = kn->parent;
+ kn = rcu_dereference(kn->parent);
- len += scnprintf(buf + len, buflen - len, "/%s", kn->name);
+ len += scnprintf(buf + len, buflen - len, "/%s", rcu_dereference(kn->name));
}
return len;
@@ -195,12 +199,11 @@ static int kernfs_path_from_node_locked(struct kernfs_node *kn_to,
*/
int kernfs_name(struct kernfs_node *kn, char *buf, size_t buflen)
{
- unsigned long flags;
int ret;
- read_lock_irqsave(&kernfs_rename_lock, flags);
+ rcu_read_lock();
ret = kernfs_name_locked(kn, buf, buflen);
- read_unlock_irqrestore(&kernfs_rename_lock, flags);
+ rcu_read_unlock();
return ret;
}
@@ -223,12 +226,11 @@ int kernfs_name(struct kernfs_node *kn, char *buf, size_t buflen)
int kernfs_path_from_node(struct kernfs_node *to, struct kernfs_node *from,
char *buf, size_t buflen)
{
- unsigned long flags;
int ret;
- read_lock_irqsave(&kernfs_rename_lock, flags);
+ rcu_read_lock();
ret = kernfs_path_from_node_locked(to, from, buf, buflen);
- read_unlock_irqrestore(&kernfs_rename_lock, flags);
+ rcu_read_unlock();
return ret;
}
EXPORT_SYMBOL_GPL(kernfs_path_from_node);
@@ -292,12 +294,11 @@ void pr_cont_kernfs_path(struct kernfs_node *kn)
struct kernfs_node *kernfs_get_parent(struct kernfs_node *kn)
{
struct kernfs_node *parent;
- unsigned long flags;
- read_lock_irqsave(&kernfs_rename_lock, flags);
- parent = kn->parent;
+ rcu_read_lock();
+ parent = rcu_dereference(kn->parent);
kernfs_get(parent);
- read_unlock_irqrestore(&kernfs_rename_lock, flags);
+ rcu_read_unlock();
return parent;
}
@@ -336,13 +337,13 @@ static int kernfs_name_compare(unsigned int hash, const char *name,
return -1;
if (ns > kn->ns)
return 1;
- return strcmp(name, kn->name);
+ return strcmp(name, kernfs_rcu_get_name(kn));
}
static int kernfs_sd_compare(const struct kernfs_node *left,
const struct kernfs_node *right)
{
- return kernfs_name_compare(left->hash, left->name, left->ns, right);
+ return kernfs_name_compare(left->hash, kernfs_rcu_get_name(left), left->ns, right);
}
/**
@@ -360,9 +361,12 @@ static int kernfs_sd_compare(const struct kernfs_node *left,
*/
static int kernfs_link_sibling(struct kernfs_node *kn)
{
- struct rb_node **node = &kn->parent->dir.children.rb_node;
+ struct kernfs_node *node_parent;
struct rb_node *parent = NULL;
+ struct rb_node **node;
+ node_parent = kernfs_rcu_get_parent(kn);
+ node = &node_parent->dir.children.rb_node;
while (*node) {
struct kernfs_node *pos;
int result;
@@ -380,13 +384,13 @@ static int kernfs_link_sibling(struct kernfs_node *kn)
/* add new node and rebalance the tree */
rb_link_node(&kn->rb, parent, node);
- rb_insert_color(&kn->rb, &kn->parent->dir.children);
+ rb_insert_color(&kn->rb, &node_parent->dir.children);
/* successfully added, account subdir number */
down_write(&kernfs_root(kn)->kernfs_iattr_rwsem);
if (kernfs_type(kn) == KERNFS_DIR)
- kn->parent->dir.subdirs++;
- kernfs_inc_rev(kn->parent);
+ node_parent->dir.subdirs++;
+ kernfs_inc_rev(node_parent);
up_write(&kernfs_root(kn)->kernfs_iattr_rwsem);
return 0;
@@ -407,16 +411,19 @@ static int kernfs_link_sibling(struct kernfs_node *kn)
*/
static bool kernfs_unlink_sibling(struct kernfs_node *kn)
{
+ struct kernfs_node *node_parent;
+
if (RB_EMPTY_NODE(&kn->rb))
return false;
down_write(&kernfs_root(kn)->kernfs_iattr_rwsem);
+ node_parent = kernfs_rcu_get_parent(kn);
if (kernfs_type(kn) == KERNFS_DIR)
- kn->parent->dir.subdirs--;
- kernfs_inc_rev(kn->parent);
+ node_parent->dir.subdirs--;
+ kernfs_inc_rev(node_parent);
up_write(&kernfs_root(kn)->kernfs_iattr_rwsem);
- rb_erase(&kn->rb, &kn->parent->dir.children);
+ rb_erase(&kn->rb, &node_parent->dir.children);
RB_CLEAR_NODE(&kn->rb);
return true;
}
@@ -533,7 +540,8 @@ static void kernfs_free_rcu(struct rcu_head *rcu)
{
struct kernfs_node *kn = container_of(rcu, struct kernfs_node, rcu);
- kfree_const(kn->name);
+ /* If the whole node goes away, the name can't be used outside */
+ kfree_const(rcu_dereference_check(kn->name, true));
if (kn->iattr) {
simple_xattrs_free(&kn->iattr->xattrs, NULL);
@@ -556,17 +564,19 @@ void kernfs_put(struct kernfs_node *kn)
if (!kn || !atomic_dec_and_test(&kn->count))
return;
+
+ rcu_read_lock();
root = kernfs_root(kn);
repeat:
/*
* Moving/renaming is always done while holding reference.
* kn->parent won't change beneath us.
*/
- parent = kn->parent;
+ parent = rcu_dereference(kn->parent);
WARN_ONCE(atomic_read(&kn->active) != KN_DEACTIVATED_BIAS,
"kernfs_put: %s/%s: released with incorrect active_ref %d\n",
- parent ? parent->name : "", kn->name, atomic_read(&kn->active));
+ parent ? rcu_dereference(parent->name) : "", kn->name, atomic_read(&kn->active));
if (kernfs_type(kn) == KERNFS_LINK)
kernfs_put(kn->symlink.target_kn);
@@ -586,6 +596,7 @@ void kernfs_put(struct kernfs_node *kn)
idr_destroy(&root->ino_idr);
kfree_rcu(root, rcu);
}
+ rcu_read_unlock();
}
EXPORT_SYMBOL_GPL(kernfs_put);
@@ -643,7 +654,7 @@ static struct kernfs_node *__kernfs_new_node(struct kernfs_root *root,
atomic_set(&kn->active, KN_DEACTIVATED_BIAS);
RB_CLEAR_NODE(&kn->rb);
- kn->name = name;
+ rcu_assign_pointer(kn->name, name);
kn->mode = mode;
kn->flags = flags;
@@ -701,7 +712,7 @@ struct kernfs_node *kernfs_new_node(struct kernfs_node *parent,
name, mode, uid, gid, flags);
if (kn) {
kernfs_get(parent);
- kn->parent = parent;
+ rcu_assign_pointer(kn->parent, parent);
}
return kn;
}
@@ -769,12 +780,14 @@ struct kernfs_node *kernfs_find_and_get_node_by_id(struct kernfs_root *root,
*/
int kernfs_add_one(struct kernfs_node *kn)
{
- struct kernfs_node *parent = kn->parent;
- struct kernfs_root *root = kernfs_root(parent);
+ struct kernfs_node *parent;
+ struct kernfs_root *root;
struct kernfs_iattrs *ps_iattr;
bool has_ns;
int ret;
+ parent = rcu_dereference_check(kn->parent, true);
+ root = kernfs_root(parent);
down_write(&root->kernfs_rwsem);
ret = -EINVAL;
@@ -790,7 +803,7 @@ int kernfs_add_one(struct kernfs_node *kn)
if (parent->flags & (KERNFS_REMOVING | KERNFS_EMPTY_DIR))
goto out_unlock;
- kn->hash = kernfs_name_hash(kn->name, kn->ns);
+ kn->hash = kernfs_name_hash(kernfs_rcu_get_name(kn), kn->ns);
ret = kernfs_link_sibling(kn);
if (ret)
@@ -1111,6 +1124,7 @@ struct kernfs_node *kernfs_create_empty_dir(struct kernfs_node *parent,
static int kernfs_dop_revalidate(struct dentry *dentry, unsigned int flags)
{
+ struct kernfs_node *parent;
struct kernfs_node *kn;
struct kernfs_root *root;
@@ -1119,8 +1133,6 @@ static int kernfs_dop_revalidate(struct dentry *dentry, unsigned int flags)
/* Negative hashed dentry? */
if (d_really_is_negative(dentry)) {
- struct kernfs_node *parent;
-
/* If the kernfs parent node has changed discard and
* proceed to ->lookup.
*
@@ -1162,16 +1174,17 @@ static int kernfs_dop_revalidate(struct dentry *dentry, unsigned int flags)
if (!kernfs_active(kn))
goto out_bad;
+ parent = kernfs_rcu_get_parent(kn);
/* The kernfs node has been moved? */
- if (kernfs_dentry_node(dentry->d_parent) != kn->parent)
+ if (kernfs_dentry_node(dentry->d_parent) != parent)
goto out_bad;
/* The kernfs node has been renamed */
- if (strcmp(dentry->d_name.name, kn->name) != 0)
+ if (strcmp(dentry->d_name.name, kernfs_rcu_get_name(kn)) != 0)
goto out_bad;
/* The kernfs node has been moved to a different namespace */
- if (kn->parent && kernfs_ns_enabled(kn->parent) &&
+ if (parent && kernfs_ns_enabled(parent) &&
kernfs_info(dentry->d_sb)->ns != kn->ns)
goto out_bad;
@@ -1364,7 +1377,7 @@ static struct kernfs_node *kernfs_next_descendant_post(struct kernfs_node *pos,
return kernfs_leftmost_descendant(rb_to_kn(rbn));
/* no sibling left, visit parent */
- return pos->parent;
+ return kernfs_rcu_get_parent(pos);
}
static void kernfs_activate_one(struct kernfs_node *kn)
@@ -1490,8 +1503,11 @@ static void __kernfs_remove(struct kernfs_node *kn)
* to decide who's responsible for cleanups.
*/
if (!pos->parent || kernfs_unlink_sibling(pos)) {
- struct kernfs_iattrs *ps_iattr =
- pos->parent ? pos->parent->iattr : NULL;
+ struct kernfs_iattrs *ps_iattr;
+ struct kernfs_node *parent;
+
+ parent = kernfs_rcu_get_parent(pos);
+ ps_iattr = parent ? parent->iattr : NULL;
/* update timestamps on the parent */
down_write(&kernfs_root(kn)->kernfs_iattr_rwsem);
@@ -1733,8 +1749,10 @@ int kernfs_rename_ns(struct kernfs_node *kn, struct kernfs_node *new_parent,
goto out;
error = 0;
- if ((kn->parent == new_parent) && (kn->ns == new_ns) &&
- (strcmp(kn->name, new_name) == 0))
+ old_parent = kernfs_rcu_get_parent(kn);
+ old_name = kernfs_rcu_get_name(kn);
+ if ((old_parent == new_parent) && (kn->ns == new_ns) &&
+ (strcmp(old_name, new_name) == 0))
goto out; /* nothing to rename */
error = -EEXIST;
@@ -1742,7 +1760,7 @@ int kernfs_rename_ns(struct kernfs_node *kn, struct kernfs_node *new_parent,
goto out;
/* rename kernfs_node */
- if (strcmp(kn->name, new_name) != 0) {
+ if (strcmp(old_name, new_name) != 0) {
error = -ENOMEM;
new_name = kstrdup_const(new_name, GFP_KERNEL);
if (!new_name)
@@ -1757,25 +1775,18 @@ int kernfs_rename_ns(struct kernfs_node *kn, struct kernfs_node *new_parent,
kernfs_unlink_sibling(kn);
kernfs_get(new_parent);
- /* rename_lock protects ->parent and ->name accessors */
- write_lock_irq(&kernfs_rename_lock);
-
- old_parent = kn->parent;
- kn->parent = new_parent;
+ rcu_assign_pointer(kn->parent, new_parent);
kn->ns = new_ns;
- if (new_name) {
- old_name = kn->name;
- kn->name = new_name;
- }
+ if (new_name)
+ rcu_assign_pointer(kn->name, new_name);
- write_unlock_irq(&kernfs_rename_lock);
-
- kn->hash = kernfs_name_hash(kn->name, kn->ns);
+ kn->hash = kernfs_name_hash(kernfs_rcu_get_name(kn), kn->ns);
kernfs_link_sibling(kn);
kernfs_put(old_parent);
- kfree_const(old_name);
+ if (new_name && !is_kernel_rodata((unsigned long)old_name))
+ kfree_rcu_mightsleep(old_name);
error = 0;
out:
@@ -1794,7 +1805,7 @@ static struct kernfs_node *kernfs_dir_pos(const void *ns,
{
if (pos) {
int valid = kernfs_active(pos) &&
- pos->parent == parent && hash == pos->hash;
+ kernfs_rcu_get_parent(pos) == parent && hash == pos->hash;
kernfs_put(pos);
if (!valid)
pos = NULL;
@@ -1859,7 +1870,7 @@ static int kernfs_fop_readdir(struct file *file, struct dir_context *ctx)
for (pos = kernfs_dir_pos(ns, parent, ctx->pos, pos);
pos;
pos = kernfs_dir_next_pos(ns, parent, ctx->pos, pos)) {
- const char *name = pos->name;
+ const char *name = kernfs_rcu_get_name(pos);
unsigned int type = fs_umode_to_dtype(pos->mode);
int len = strlen(name);
ino_t ino = kernfs_ino(pos);
@@ -1868,10 +1879,10 @@ static int kernfs_fop_readdir(struct file *file, struct dir_context *ctx)
file->private_data = pos;
kernfs_get(pos);
- up_read(&root->kernfs_rwsem);
- if (!dir_emit(ctx, name, len, ino, type))
+ if (!dir_emit(ctx, name, len, ino, type)) {
+ up_read(&root->kernfs_rwsem);
return 0;
- down_read(&root->kernfs_rwsem);
+ }
}
up_read(&root->kernfs_rwsem);
file->private_data = NULL;
diff --git a/fs/kernfs/file.c b/fs/kernfs/file.c
index 8502ef68459b9..8672264c166ca 100644
--- a/fs/kernfs/file.c
+++ b/fs/kernfs/file.c
@@ -911,9 +911,11 @@ static void kernfs_notify_workfn(struct work_struct *work)
/* kick fsnotify */
down_read(&root->kernfs_supers_rwsem);
+ down_read(&root->kernfs_rwsem);
list_for_each_entry(info, &kernfs_root(kn)->supers, node) {
struct kernfs_node *parent;
struct inode *p_inode = NULL;
+ const char *kn_name;
struct inode *inode;
struct qstr name;
@@ -927,7 +929,8 @@ static void kernfs_notify_workfn(struct work_struct *work)
if (!inode)
continue;
- name = (struct qstr)QSTR_INIT(kn->name, strlen(kn->name));
+ kn_name = kernfs_rcu_get_name(kn);
+ name = (struct qstr)QSTR_INIT(kn_name, strlen(kn_name));
parent = kernfs_get_parent(kn);
if (parent) {
p_inode = ilookup(info->sb, kernfs_ino(parent));
@@ -947,6 +950,7 @@ static void kernfs_notify_workfn(struct work_struct *work)
iput(inode);
}
+ up_read(&root->kernfs_rwsem);
up_read(&root->kernfs_supers_rwsem);
kernfs_put(kn);
goto repeat;
diff --git a/fs/kernfs/kernfs-internal.h b/fs/kernfs/kernfs-internal.h
index b42ee6547cdc1..847e3cc7c903a 100644
--- a/fs/kernfs/kernfs-internal.h
+++ b/fs/kernfs/kernfs-internal.h
@@ -64,11 +64,13 @@ struct kernfs_root {
*
* Return: the kernfs_root @kn belongs to.
*/
-static inline struct kernfs_root *kernfs_root(struct kernfs_node *kn)
+static inline struct kernfs_root *kernfs_root(const struct kernfs_node *kn)
{
+ struct kernfs_node *parent;
/* if parent exists, it's always a dir; otherwise, @sd is a dir */
- if (kn->parent)
- kn = kn->parent;
+ parent = rcu_dereference_check(kn->parent, true);
+ if (parent)
+ kn = parent;
return kn->dir.root;
}
@@ -97,6 +99,21 @@ struct kernfs_super_info {
};
#define kernfs_info(SB) ((struct kernfs_super_info *)(SB->s_fs_info))
+static inline bool kernfs_root_is_locked(const struct kernfs_node *kn)
+{
+ return lockdep_is_held(&kernfs_root(kn)->kernfs_rwsem);
+}
+
+static inline struct kernfs_node *kernfs_rcu_get_parent(struct kernfs_node *kn)
+{
+ return rcu_dereference_check(kn->parent, kernfs_root_is_locked(kn));
+}
+
+static inline const char *kernfs_rcu_get_name(const struct kernfs_node *kn)
+{
+ return rcu_dereference_check(kn->name, kernfs_root_is_locked(kn));
+}
+
static inline struct kernfs_node *kernfs_dentry_node(struct dentry *dentry)
{
if (d_really_is_negative(dentry))
diff --git a/fs/kernfs/mount.c b/fs/kernfs/mount.c
index 1358c21837f1a..de12915d20d68 100644
--- a/fs/kernfs/mount.c
+++ b/fs/kernfs/mount.c
@@ -146,7 +146,8 @@ static struct dentry *kernfs_get_parent_dentry(struct dentry *child)
{
struct kernfs_node *kn = kernfs_dentry_node(child);
- return d_obtain_alias(kernfs_get_inode(child->d_sb, kn->parent));
+ return d_obtain_alias(kernfs_get_inode(child->d_sb,
+ rcu_dereference_check(kn->parent, true)));
}
static const struct export_operations kernfs_export_ops = {
@@ -181,15 +182,18 @@ struct kernfs_root *kernfs_root_from_sb(struct super_block *sb)
static struct kernfs_node *find_next_ancestor(struct kernfs_node *child,
struct kernfs_node *parent)
{
+ struct kernfs_node *childs_parent;
+
if (child == parent) {
pr_crit_once("BUG in find_next_ancestor: called with parent == child");
return NULL;
}
- while (child->parent != parent) {
- if (!child->parent)
+ childs_parent = rcu_dereference_check(child->parent, true);
+ while (childs_parent != parent) {
+ if (!childs_parent)
return NULL;
- child = child->parent;
+ childs_parent = rcu_dereference_check(child->parent, true);
}
return child;
@@ -225,6 +229,7 @@ struct dentry *kernfs_node_dentry(struct kernfs_node *kn,
do {
struct dentry *dtmp;
struct kernfs_node *kntmp;
+ const char *name;
if (kn == knparent)
return dentry;
@@ -233,8 +238,8 @@ struct dentry *kernfs_node_dentry(struct kernfs_node *kn,
dput(dentry);
return ERR_PTR(-EINVAL);
}
- dtmp = lookup_positive_unlocked(kntmp->name, dentry,
- strlen(kntmp->name));
+ name = rcu_dereference_check(kntmp->name, true);
+ dtmp = lookup_positive_unlocked(name, dentry, strlen(name));
dput(dentry);
if (IS_ERR(dtmp))
return dtmp;
diff --git a/fs/kernfs/symlink.c b/fs/kernfs/symlink.c
index 45371a70caa71..f98725853c4eb 100644
--- a/fs/kernfs/symlink.c
+++ b/fs/kernfs/symlink.c
@@ -62,10 +62,10 @@ static int kernfs_get_target_path(struct kernfs_node *parent,
/* go up to the root, stop at the base */
base = parent;
- while (base->parent) {
- kn = target->parent;
- while (kn->parent && base != kn)
- kn = kn->parent;
+ while (kernfs_rcu_get_parent(base)) {
+ kn = kernfs_rcu_get_parent(target);
+ while (kernfs_rcu_get_parent(kn) && base != kn)
+ kn = kernfs_rcu_get_parent(kn);
if (base == kn)
break;
@@ -75,14 +75,14 @@ static int kernfs_get_target_path(struct kernfs_node *parent,
strcpy(s, "../");
s += 3;
- base = base->parent;
+ base = kernfs_rcu_get_parent(base);
}
/* determine end of target string for reverse fillup */
kn = target;
- while (kn->parent && kn != base) {
- len += strlen(kn->name) + 1;
- kn = kn->parent;
+ while (kernfs_rcu_get_parent(kn) && kn != base) {
+ len += strlen(kernfs_rcu_get_name(kn)) + 1;
+ kn = kernfs_rcu_get_parent(kn);
}
/* check limits */
@@ -94,15 +94,16 @@ static int kernfs_get_target_path(struct kernfs_node *parent,
/* reverse fillup of target string from target to base */
kn = target;
- while (kn->parent && kn != base) {
- int slen = strlen(kn->name);
+ while (kernfs_rcu_get_parent(kn) && kn != base) {
+ const char *name = kernfs_rcu_get_name(kn);
+ int slen = strlen(name);
len -= slen;
- memcpy(s + len, kn->name, slen);
+ memcpy(s + len, name, slen);
if (len)
s[--len] = '/';
- kn = kn->parent;
+ kn = kernfs_rcu_get_parent(kn);
}
return 0;
@@ -111,7 +112,7 @@ static int kernfs_get_target_path(struct kernfs_node *parent,
static int kernfs_getlink(struct inode *inode, char *path)
{
struct kernfs_node *kn = inode->i_private;
- struct kernfs_node *parent = kn->parent;
+ struct kernfs_node *parent = rcu_dereference_check(kn->parent, true);
struct kernfs_node *target = kn->symlink.target_kn;
struct kernfs_root *root = kernfs_root(parent);
int error;
diff --git a/include/linux/kernfs.h b/include/linux/kernfs.h
index 87c79d076d6d7..733d89de40542 100644
--- a/include/linux/kernfs.h
+++ b/include/linux/kernfs.h
@@ -199,8 +199,8 @@ struct kernfs_node {
* never moved to a different parent, it is safe to access the
* parent directly.
*/
- struct kernfs_node *parent;
- const char *name;
+ struct kernfs_node __rcu *parent;
+ const char __rcu *name;
struct rb_node rb;
--
2.45.2
Sebastian
^ permalink raw reply related [flat|nested] 38+ messages in thread
* Re: [BUG] -next lockdep invalid wait context
2024-11-08 10:05 ` Sebastian Andrzej Siewior
@ 2024-11-08 17:02 ` Tejun Heo
2024-11-08 17:12 ` Sebastian Andrzej Siewior
2024-11-08 22:24 ` [PATCH] kernfs: Use RCU for kernfs_node::name lookup Sebastian Andrzej Siewior
0 siblings, 2 replies; 38+ messages in thread
From: Tejun Heo @ 2024-11-08 17:02 UTC (permalink / raw)
To: Sebastian Andrzej Siewior
Cc: Boqun Feng, Hillf Danton, Paul E. McKenney, Marco Elver,
linux-kernel, gregkh, tglx, syzbot
Hello,
On Fri, Nov 08, 2024 at 11:05:03AM +0100, Sebastian Andrzej Siewior wrote:
...
> static int kernfs_name_locked(struct kernfs_node *kn, char *buf, size_t buflen)
> {
> + struct kernfs_node *kn_parent;
> + const char *kn_name;
> +
> if (!kn)
> return strscpy(buf, "(null)", buflen);
>
> - return strscpy(buf, kn->parent ? kn->name : "/", buflen);
> + kn_parent = rcu_dereference(kn->parent);
kn->parent can never change. This can just be a regular deref.
> + kn_name = rcu_dereference(kn->name);
> + return strscpy(buf, kn_parent ? kn_name : "/", buflen);
> }
>
> /* kernfs_node_depth - compute depth from @from to @to */
> @@ -66,7 +70,7 @@ static size_t kernfs_depth(struct kernfs_node *from, struct kernfs_node *to)
>
> while (to->parent && to != from) {
> depth++;
> - to = to->parent;
> + to = rcu_dereference(to->parent);
Ditto here and other places.
> int kernfs_name(struct kernfs_node *kn, char *buf, size_t buflen)
> {
> - unsigned long flags;
> int ret;
>
> - read_lock_irqsave(&kernfs_rename_lock, flags);
> + rcu_read_lock();
> ret = kernfs_name_locked(kn, buf, buflen);
> - read_unlock_irqrestore(&kernfs_rename_lock, flags);
> + rcu_read_unlock();
> return ret;
> }
>
> @@ -223,12 +226,11 @@ int kernfs_name(struct kernfs_node *kn, char *buf, size_t buflen)
> int kernfs_path_from_node(struct kernfs_node *to, struct kernfs_node *from,
> char *buf, size_t buflen)
> {
> - unsigned long flags;
> int ret;
>
> - read_lock_irqsave(&kernfs_rename_lock, flags);
> + rcu_read_lock();
> ret = kernfs_path_from_node_locked(to, from, buf, buflen);
> - read_unlock_irqrestore(&kernfs_rename_lock, flags);
> + rcu_read_unlock();
> return ret;
> }
The _locked suffix looks awkward afterwards. Given that there's no
restriction on how the function is called anymore, maybe we can just
collapse _locked bodies into the callers?
...
> diff --git a/include/linux/kernfs.h b/include/linux/kernfs.h
> index 87c79d076d6d7..733d89de40542 100644
> --- a/include/linux/kernfs.h
> +++ b/include/linux/kernfs.h
> @@ -199,8 +199,8 @@ struct kernfs_node {
> * never moved to a different parent, it is safe to access the
> * parent directly.
> */
> - struct kernfs_node *parent;
> - const char *name;
> + struct kernfs_node __rcu *parent;
> + const char __rcu *name;
->parent doesn't have to be converted to RCU. As long as the node is
accessible, the parent is guaranteed to be pinned and unchanged. We only
need to RCUfy ->name.
Thanks.
--
tejun
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [BUG] -next lockdep invalid wait context
2024-11-08 17:02 ` Tejun Heo
@ 2024-11-08 17:12 ` Sebastian Andrzej Siewior
2024-11-08 22:24 ` [PATCH] kernfs: Use RCU for kernfs_node::name lookup Sebastian Andrzej Siewior
1 sibling, 0 replies; 38+ messages in thread
From: Sebastian Andrzej Siewior @ 2024-11-08 17:12 UTC (permalink / raw)
To: Tejun Heo
Cc: Boqun Feng, Hillf Danton, Paul E. McKenney, Marco Elver,
linux-kernel, gregkh, tglx, syzbot
On 2024-11-08 07:02:24 [-1000], Tejun Heo wrote:
> Hello,
Hi,
> > @@ -223,12 +226,11 @@ int kernfs_name(struct kernfs_node *kn, char *buf, size_t buflen)
> > int kernfs_path_from_node(struct kernfs_node *to, struct kernfs_node *from,
> > char *buf, size_t buflen)
> > {
> > - unsigned long flags;
> > int ret;
> >
> > - read_lock_irqsave(&kernfs_rename_lock, flags);
> > + rcu_read_lock();
> > ret = kernfs_path_from_node_locked(to, from, buf, buflen);
> > - read_unlock_irqrestore(&kernfs_rename_lock, flags);
> > + rcu_read_unlock();
> > return ret;
> > }
>
> The _locked suffix looks awkward afterwards. Given that there's no
> restriction on how the function is called anymore, maybe we can just
> collapse _locked bodies into the callers?
Sure.
> ...
> > diff --git a/include/linux/kernfs.h b/include/linux/kernfs.h
> > index 87c79d076d6d7..733d89de40542 100644
> > --- a/include/linux/kernfs.h
> > +++ b/include/linux/kernfs.h
> > @@ -199,8 +199,8 @@ struct kernfs_node {
> > * never moved to a different parent, it is safe to access the
> > * parent directly.
> > */
> > - struct kernfs_node *parent;
> > - const char *name;
> > + struct kernfs_node __rcu *parent;
> > + const char __rcu *name;
>
> ->parent doesn't have to be converted to RCU. As long as the node is
> accessible, the parent is guaranteed to be pinned and unchanged. We only
> need to RCUfy ->name.
I was uncertain about parent. But in that case, let me keep parent as-is
and just do name.
Thanks.
> Thanks.
>
Sebastian
^ permalink raw reply [flat|nested] 38+ messages in thread
* [PATCH] kernfs: Use RCU for kernfs_node::name lookup.
2024-11-08 17:02 ` Tejun Heo
2024-11-08 17:12 ` Sebastian Andrzej Siewior
@ 2024-11-08 22:24 ` Sebastian Andrzej Siewior
2024-11-08 22:31 ` Tejun Heo
` (2 more replies)
1 sibling, 3 replies; 38+ messages in thread
From: Sebastian Andrzej Siewior @ 2024-11-08 22:24 UTC (permalink / raw)
To: Tejun Heo
Cc: Boqun Feng, Hillf Danton, Paul E. McKenney, Marco Elver,
linux-kernel, gregkh, tglx, syzbot
Instead of using kernfs_rename_lock for lookups of ::name use RCU for
lookup. Rely on kn's kernfs_root::kernfs_rwsem for update
synchronisation.
The .*_locked() have been moved into the callers.
The lock in kernfs_get_parent() has been dropped, the parent node should
node vanish underneath us. The RCU read-lock and atomic_inc_not_zero()
is a safety net in case it does.
kernfs_fop_readdir() no longer drops kernfs_root::kernfs_rwsem to ensure
the name pointer does not vanish while the page fault is handled.
kernfs_notify_workfn() gained the lock for the same reason.
Fixes: 2b5067a8143e3 ("mm: mmap_lock: add tracepoints around lock acquisition")
Reported-by: syzbot+6ea37e2e6ffccf41a7e6@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/lkml/67251dc6.050a0220.529b6.015e.GAE@google.com/
Reported-by: Hillf Danton <hdanton@sina.com>
Closes: https://lore.kernel.org/20241102001224.2789-1-hdanton@sina.com
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
fs/kernfs/dir.c | 164 ++++++++++++++----------------------
fs/kernfs/file.c | 6 +-
fs/kernfs/kernfs-internal.h | 12 ++-
fs/kernfs/mount.c | 6 +-
fs/kernfs/symlink.c | 10 ++-
include/linux/kernfs.h | 4 +-
6 files changed, 93 insertions(+), 109 deletions(-)
diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c
index 458519e416fe7..9f4f2fc48b0a1 100644
--- a/fs/kernfs/dir.c
+++ b/fs/kernfs/dir.c
@@ -17,7 +17,6 @@
#include "kernfs-internal.h"
-static DEFINE_RWLOCK(kernfs_rename_lock); /* kn->parent and ->name */
/*
* Don't use rename_lock to piggy back on pr_cont_buf. We don't want to
* call pr_cont() while holding rename_lock. Because sometimes pr_cont()
@@ -51,14 +50,6 @@ static bool kernfs_lockdep(struct kernfs_node *kn)
#endif
}
-static int kernfs_name_locked(struct kernfs_node *kn, char *buf, size_t buflen)
-{
- if (!kn)
- return strscpy(buf, "(null)", buflen);
-
- return strscpy(buf, kn->parent ? kn->name : "/", buflen);
-}
-
/* kernfs_node_depth - compute depth from @from to @to */
static size_t kernfs_depth(struct kernfs_node *from, struct kernfs_node *to)
{
@@ -102,7 +93,35 @@ static struct kernfs_node *kernfs_common_ancestor(struct kernfs_node *a,
}
/**
- * kernfs_path_from_node_locked - find a pseudo-absolute path to @kn_to,
+ * kernfs_name - obtain the name of a given node
+ * @kn: kernfs_node of interest
+ * @buf: buffer to copy @kn's name into
+ * @buflen: size of @buf
+ *
+ * Copies the name of @kn into @buf of @buflen bytes. The behavior is
+ * similar to strscpy().
+ *
+ * Fills buffer with "(null)" if @kn is %NULL.
+ *
+ * Return: the resulting length of @buf. If @buf isn't long enough,
+ * it's filled up to @buflen-1 and nul terminated, and returns -E2BIG.
+ *
+ * This function can be called from any context.
+ */
+int kernfs_name(struct kernfs_node *kn, char *buf, size_t buflen)
+{
+ const char *kn_name;
+
+ if (!kn)
+ return strscpy(buf, "(null)", buflen);
+
+ guard(rcu)();
+ kn_name = rcu_dereference(kn->name);
+ return strscpy(buf, kn->parent ? kn_name : "/", buflen);
+}
+
+/**
+ * kernfs_path_from_node - find a pseudo-absolute path to @kn_to,
* where kn_from is treated as root of the path.
* @kn_from: kernfs node which should be treated as root for the path
* @kn_to: kernfs node to which path is needed
@@ -131,9 +150,8 @@ static struct kernfs_node *kernfs_common_ancestor(struct kernfs_node *a,
* greater than @buflen, @buf contains the truncated path with the trailing
* '\0'. On error, -errno is returned.
*/
-static int kernfs_path_from_node_locked(struct kernfs_node *kn_to,
- struct kernfs_node *kn_from,
- char *buf, size_t buflen)
+int kernfs_path_from_node(struct kernfs_node *kn_to, struct kernfs_node *kn_from,
+ char *buf, size_t buflen)
{
struct kernfs_node *kn, *common;
const char parent_str[] = "/..";
@@ -166,71 +184,16 @@ static int kernfs_path_from_node_locked(struct kernfs_node *kn_to,
len += copied;
}
+ guard(rcu)();
/* Calculate how many bytes we need for the rest */
for (i = depth_to - 1; i >= 0; i--) {
for (kn = kn_to, j = 0; j < i; j++)
kn = kn->parent;
- len += scnprintf(buf + len, buflen - len, "/%s", kn->name);
+ len += scnprintf(buf + len, buflen - len, "/%s", rcu_dereference(kn->name));
}
-
return len;
}
-
-/**
- * kernfs_name - obtain the name of a given node
- * @kn: kernfs_node of interest
- * @buf: buffer to copy @kn's name into
- * @buflen: size of @buf
- *
- * Copies the name of @kn into @buf of @buflen bytes. The behavior is
- * similar to strscpy().
- *
- * Fills buffer with "(null)" if @kn is %NULL.
- *
- * Return: the resulting length of @buf. If @buf isn't long enough,
- * it's filled up to @buflen-1 and nul terminated, and returns -E2BIG.
- *
- * This function can be called from any context.
- */
-int kernfs_name(struct kernfs_node *kn, char *buf, size_t buflen)
-{
- unsigned long flags;
- int ret;
-
- read_lock_irqsave(&kernfs_rename_lock, flags);
- ret = kernfs_name_locked(kn, buf, buflen);
- read_unlock_irqrestore(&kernfs_rename_lock, flags);
- return ret;
-}
-
-/**
- * kernfs_path_from_node - build path of node @to relative to @from.
- * @from: parent kernfs_node relative to which we need to build the path
- * @to: kernfs_node of interest
- * @buf: buffer to copy @to's path into
- * @buflen: size of @buf
- *
- * Builds @to's path relative to @from in @buf. @from and @to must
- * be on the same kernfs-root. If @from is not parent of @to, then a relative
- * path (which includes '..'s) as needed to reach from @from to @to is
- * returned.
- *
- * Return: the length of the constructed path. If the path would have been
- * greater than @buflen, @buf contains the truncated path with the trailing
- * '\0'. On error, -errno is returned.
- */
-int kernfs_path_from_node(struct kernfs_node *to, struct kernfs_node *from,
- char *buf, size_t buflen)
-{
- unsigned long flags;
- int ret;
-
- read_lock_irqsave(&kernfs_rename_lock, flags);
- ret = kernfs_path_from_node_locked(to, from, buf, buflen);
- read_unlock_irqrestore(&kernfs_rename_lock, flags);
- return ret;
-}
EXPORT_SYMBOL_GPL(kernfs_path_from_node);
/**
@@ -292,13 +255,15 @@ void pr_cont_kernfs_path(struct kernfs_node *kn)
struct kernfs_node *kernfs_get_parent(struct kernfs_node *kn)
{
struct kernfs_node *parent;
- unsigned long flags;
- read_lock_irqsave(&kernfs_rename_lock, flags);
+ guard(rcu)();
parent = kn->parent;
- kernfs_get(parent);
- read_unlock_irqrestore(&kernfs_rename_lock, flags);
+ if (!parent)
+ return parent;
+ if (WARN_ON_ONCE(!__kernfs_active(parent) ||
+ !atomic_inc_not_zero(&parent->count)))
+ return NULL;
return parent;
}
@@ -336,13 +301,13 @@ static int kernfs_name_compare(unsigned int hash, const char *name,
return -1;
if (ns > kn->ns)
return 1;
- return strcmp(name, kn->name);
+ return strcmp(name, kernfs_rcu_get_name(kn));
}
static int kernfs_sd_compare(const struct kernfs_node *left,
const struct kernfs_node *right)
{
- return kernfs_name_compare(left->hash, left->name, left->ns, right);
+ return kernfs_name_compare(left->hash, kernfs_rcu_get_name(left), left->ns, right);
}
/**
@@ -533,7 +498,8 @@ static void kernfs_free_rcu(struct rcu_head *rcu)
{
struct kernfs_node *kn = container_of(rcu, struct kernfs_node, rcu);
- kfree_const(kn->name);
+ /* If the whole node goes away, the name can't be used outside */
+ kfree_const(rcu_dereference_check(kn->name, true));
if (kn->iattr) {
simple_xattrs_free(&kn->iattr->xattrs, NULL);
@@ -556,7 +522,9 @@ void kernfs_put(struct kernfs_node *kn)
if (!kn || !atomic_dec_and_test(&kn->count))
return;
+
root = kernfs_root(kn);
+ guard(rcu)();
repeat:
/*
* Moving/renaming is always done while holding reference.
@@ -566,7 +534,8 @@ void kernfs_put(struct kernfs_node *kn)
WARN_ONCE(atomic_read(&kn->active) != KN_DEACTIVATED_BIAS,
"kernfs_put: %s/%s: released with incorrect active_ref %d\n",
- parent ? parent->name : "", kn->name, atomic_read(&kn->active));
+ parent ? rcu_dereference(parent->name) : "",
+ rcu_dereference(kn->name), atomic_read(&kn->active));
if (kernfs_type(kn) == KERNFS_LINK)
kernfs_put(kn->symlink.target_kn);
@@ -643,7 +612,7 @@ static struct kernfs_node *__kernfs_new_node(struct kernfs_root *root,
atomic_set(&kn->active, KN_DEACTIVATED_BIAS);
RB_CLEAR_NODE(&kn->rb);
- kn->name = name;
+ rcu_assign_pointer(kn->name, name);
kn->mode = mode;
kn->flags = flags;
@@ -790,7 +759,7 @@ int kernfs_add_one(struct kernfs_node *kn)
if (parent->flags & (KERNFS_REMOVING | KERNFS_EMPTY_DIR))
goto out_unlock;
- kn->hash = kernfs_name_hash(kn->name, kn->ns);
+ kn->hash = kernfs_name_hash(kernfs_rcu_get_name(kn), kn->ns);
ret = kernfs_link_sibling(kn);
if (ret)
@@ -1167,7 +1136,7 @@ static int kernfs_dop_revalidate(struct dentry *dentry, unsigned int flags)
goto out_bad;
/* The kernfs node has been renamed */
- if (strcmp(dentry->d_name.name, kn->name) != 0)
+ if (strcmp(dentry->d_name.name, kernfs_rcu_get_name(kn)) != 0)
goto out_bad;
/* The kernfs node has been moved to a different namespace */
@@ -1733,8 +1702,10 @@ int kernfs_rename_ns(struct kernfs_node *kn, struct kernfs_node *new_parent,
goto out;
error = 0;
- if ((kn->parent == new_parent) && (kn->ns == new_ns) &&
- (strcmp(kn->name, new_name) == 0))
+ old_parent = kn->parent;
+ old_name = kernfs_rcu_get_name(kn);
+ if ((old_parent == new_parent) && (kn->ns == new_ns) &&
+ (strcmp(old_name, new_name) == 0))
goto out; /* nothing to rename */
error = -EEXIST;
@@ -1742,7 +1713,7 @@ int kernfs_rename_ns(struct kernfs_node *kn, struct kernfs_node *new_parent,
goto out;
/* rename kernfs_node */
- if (strcmp(kn->name, new_name) != 0) {
+ if (strcmp(old_name, new_name) != 0) {
error = -ENOMEM;
new_name = kstrdup_const(new_name, GFP_KERNEL);
if (!new_name)
@@ -1757,25 +1728,18 @@ int kernfs_rename_ns(struct kernfs_node *kn, struct kernfs_node *new_parent,
kernfs_unlink_sibling(kn);
kernfs_get(new_parent);
- /* rename_lock protects ->parent and ->name accessors */
- write_lock_irq(&kernfs_rename_lock);
-
- old_parent = kn->parent;
kn->parent = new_parent;
kn->ns = new_ns;
- if (new_name) {
- old_name = kn->name;
- kn->name = new_name;
- }
+ if (new_name)
+ rcu_assign_pointer(kn->name, new_name);
- write_unlock_irq(&kernfs_rename_lock);
-
- kn->hash = kernfs_name_hash(kn->name, kn->ns);
+ kn->hash = kernfs_name_hash(kernfs_rcu_get_name(kn), kn->ns);
kernfs_link_sibling(kn);
kernfs_put(old_parent);
- kfree_const(old_name);
+ if (new_name && !is_kernel_rodata((unsigned long)old_name))
+ kfree_rcu_mightsleep(old_name);
error = 0;
out:
@@ -1859,7 +1823,7 @@ static int kernfs_fop_readdir(struct file *file, struct dir_context *ctx)
for (pos = kernfs_dir_pos(ns, parent, ctx->pos, pos);
pos;
pos = kernfs_dir_next_pos(ns, parent, ctx->pos, pos)) {
- const char *name = pos->name;
+ const char *name = kernfs_rcu_get_name(pos);
unsigned int type = fs_umode_to_dtype(pos->mode);
int len = strlen(name);
ino_t ino = kernfs_ino(pos);
@@ -1868,10 +1832,10 @@ static int kernfs_fop_readdir(struct file *file, struct dir_context *ctx)
file->private_data = pos;
kernfs_get(pos);
- up_read(&root->kernfs_rwsem);
- if (!dir_emit(ctx, name, len, ino, type))
+ if (!dir_emit(ctx, name, len, ino, type)) {
+ up_read(&root->kernfs_rwsem);
return 0;
- down_read(&root->kernfs_rwsem);
+ }
}
up_read(&root->kernfs_rwsem);
file->private_data = NULL;
diff --git a/fs/kernfs/file.c b/fs/kernfs/file.c
index 8502ef68459b9..8672264c166ca 100644
--- a/fs/kernfs/file.c
+++ b/fs/kernfs/file.c
@@ -911,9 +911,11 @@ static void kernfs_notify_workfn(struct work_struct *work)
/* kick fsnotify */
down_read(&root->kernfs_supers_rwsem);
+ down_read(&root->kernfs_rwsem);
list_for_each_entry(info, &kernfs_root(kn)->supers, node) {
struct kernfs_node *parent;
struct inode *p_inode = NULL;
+ const char *kn_name;
struct inode *inode;
struct qstr name;
@@ -927,7 +929,8 @@ static void kernfs_notify_workfn(struct work_struct *work)
if (!inode)
continue;
- name = (struct qstr)QSTR_INIT(kn->name, strlen(kn->name));
+ kn_name = kernfs_rcu_get_name(kn);
+ name = (struct qstr)QSTR_INIT(kn_name, strlen(kn_name));
parent = kernfs_get_parent(kn);
if (parent) {
p_inode = ilookup(info->sb, kernfs_ino(parent));
@@ -947,6 +950,7 @@ static void kernfs_notify_workfn(struct work_struct *work)
iput(inode);
}
+ up_read(&root->kernfs_rwsem);
up_read(&root->kernfs_supers_rwsem);
kernfs_put(kn);
goto repeat;
diff --git a/fs/kernfs/kernfs-internal.h b/fs/kernfs/kernfs-internal.h
index b42ee6547cdc1..335826a1f42d5 100644
--- a/fs/kernfs/kernfs-internal.h
+++ b/fs/kernfs/kernfs-internal.h
@@ -64,7 +64,7 @@ struct kernfs_root {
*
* Return: the kernfs_root @kn belongs to.
*/
-static inline struct kernfs_root *kernfs_root(struct kernfs_node *kn)
+static inline struct kernfs_root *kernfs_root(const struct kernfs_node *kn)
{
/* if parent exists, it's always a dir; otherwise, @sd is a dir */
if (kn->parent)
@@ -97,6 +97,16 @@ struct kernfs_super_info {
};
#define kernfs_info(SB) ((struct kernfs_super_info *)(SB->s_fs_info))
+static inline bool kernfs_root_is_locked(const struct kernfs_node *kn)
+{
+ return lockdep_is_held(&kernfs_root(kn)->kernfs_rwsem);
+}
+
+static inline const char *kernfs_rcu_get_name(const struct kernfs_node *kn)
+{
+ return rcu_dereference_check(kn->name, kernfs_root_is_locked(kn));
+}
+
static inline struct kernfs_node *kernfs_dentry_node(struct dentry *dentry)
{
if (d_really_is_negative(dentry))
diff --git a/fs/kernfs/mount.c b/fs/kernfs/mount.c
index 1358c21837f1a..7dedcfae022c4 100644
--- a/fs/kernfs/mount.c
+++ b/fs/kernfs/mount.c
@@ -222,9 +222,11 @@ struct dentry *kernfs_node_dentry(struct kernfs_node *kn,
return ERR_PTR(-EINVAL);
}
+ guard(rcu)();
do {
struct dentry *dtmp;
struct kernfs_node *kntmp;
+ const char *name;
if (kn == knparent)
return dentry;
@@ -233,8 +235,8 @@ struct dentry *kernfs_node_dentry(struct kernfs_node *kn,
dput(dentry);
return ERR_PTR(-EINVAL);
}
- dtmp = lookup_positive_unlocked(kntmp->name, dentry,
- strlen(kntmp->name));
+ name = rcu_dereference(kntmp->name);
+ dtmp = lookup_positive_unlocked(name, dentry, strlen(name));
dput(dentry);
if (IS_ERR(dtmp))
return dtmp;
diff --git a/fs/kernfs/symlink.c b/fs/kernfs/symlink.c
index 45371a70caa71..bdb4f45254b91 100644
--- a/fs/kernfs/symlink.c
+++ b/fs/kernfs/symlink.c
@@ -57,8 +57,10 @@ static int kernfs_get_target_path(struct kernfs_node *parent,
struct kernfs_node *target, char *path)
{
struct kernfs_node *base, *kn;
+ const char *name;
char *s = path;
int len = 0;
+ int slen;
/* go up to the root, stop at the base */
base = parent;
@@ -81,7 +83,8 @@ static int kernfs_get_target_path(struct kernfs_node *parent,
/* determine end of target string for reverse fillup */
kn = target;
while (kn->parent && kn != base) {
- len += strlen(kn->name) + 1;
+ name = kernfs_rcu_get_name(kn);
+ len += strlen(name) + 1;
kn = kn->parent;
}
@@ -95,10 +98,11 @@ static int kernfs_get_target_path(struct kernfs_node *parent,
/* reverse fillup of target string from target to base */
kn = target;
while (kn->parent && kn != base) {
- int slen = strlen(kn->name);
+ name = kernfs_rcu_get_name(kn);
+ slen = strlen(name);
len -= slen;
- memcpy(s + len, kn->name, slen);
+ memcpy(s + len, name, slen);
if (len)
s[--len] = '/';
diff --git a/include/linux/kernfs.h b/include/linux/kernfs.h
index 87c79d076d6d7..bbeeee3ae7dbc 100644
--- a/include/linux/kernfs.h
+++ b/include/linux/kernfs.h
@@ -200,7 +200,7 @@ struct kernfs_node {
* parent directly.
*/
struct kernfs_node *parent;
- const char *name;
+ const char __rcu *name;
struct rb_node rb;
@@ -395,7 +395,7 @@ static inline bool kernfs_ns_enabled(struct kernfs_node *kn)
}
int kernfs_name(struct kernfs_node *kn, char *buf, size_t buflen);
-int kernfs_path_from_node(struct kernfs_node *root_kn, struct kernfs_node *kn,
+int kernfs_path_from_node(struct kernfs_node *kn_to, struct kernfs_node *kn_from,
char *buf, size_t buflen);
void pr_cont_kernfs_name(struct kernfs_node *kn);
void pr_cont_kernfs_path(struct kernfs_node *kn);
--
2.45.2
^ permalink raw reply related [flat|nested] 38+ messages in thread
* Re: [PATCH] kernfs: Use RCU for kernfs_node::name lookup.
2024-11-08 22:24 ` [PATCH] kernfs: Use RCU for kernfs_node::name lookup Sebastian Andrzej Siewior
@ 2024-11-08 22:31 ` Tejun Heo
2024-11-11 17:04 ` Sebastian Andrzej Siewior
2024-11-08 23:16 ` Hillf Danton
2024-11-11 4:49 ` [PATCH] kernfs: Use RCU for kernfs_node::name lookup kernel test robot
2 siblings, 1 reply; 38+ messages in thread
From: Tejun Heo @ 2024-11-08 22:31 UTC (permalink / raw)
To: Sebastian Andrzej Siewior
Cc: Boqun Feng, Hillf Danton, Paul E. McKenney, Marco Elver,
linux-kernel, gregkh, tglx, syzbot
Hello, Sebastian.
On Fri, Nov 08, 2024 at 11:24:06PM +0100, Sebastian Andrzej Siewior wrote:
> Instead of using kernfs_rename_lock for lookups of ::name use RCU for
> lookup. Rely on kn's kernfs_root::kernfs_rwsem for update
> synchronisation.
>
> The .*_locked() have been moved into the callers.
> The lock in kernfs_get_parent() has been dropped, the parent node should
> node vanish underneath us. The RCU read-lock and atomic_inc_not_zero()
> is a safety net in case it does.
> kernfs_fop_readdir() no longer drops kernfs_root::kernfs_rwsem to ensure
> the name pointer does not vanish while the page fault is handled.
> kernfs_notify_workfn() gained the lock for the same reason.
I owe an apology. I was just thinking about cgroups. Sysfs, I think, does
allow moving node a different parent, which IIRC is used by netdevs. How
about something like this:
- Add a KERNFS_ROOT flag indicating that parent-changing moves aren't
allowed.
- Update kernfs_rename_ns() to trigger warning and fail if the above flag is
set and new_parent is different from the old one.
- Create a separate interface which uses RCU instead of rename lock for name
/ path lookups. The RCU ones should trigger warning if used when the above
KERNFS_ROOT flag is not set.
Thanks.
--
tejun
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH] kernfs: Use RCU for kernfs_node::name lookup.
2024-11-08 22:24 ` [PATCH] kernfs: Use RCU for kernfs_node::name lookup Sebastian Andrzej Siewior
2024-11-08 22:31 ` Tejun Heo
@ 2024-11-08 23:16 ` Hillf Danton
2024-11-08 23:48 ` [syzbot] [kernfs?] WARNING: locking bug in kernfs_path_from_node syzbot
2024-11-11 4:49 ` [PATCH] kernfs: Use RCU for kernfs_node::name lookup kernel test robot
2 siblings, 1 reply; 38+ messages in thread
From: Hillf Danton @ 2024-11-08 23:16 UTC (permalink / raw)
To: syzbot
Cc: Sebastian Andrzej Siewior, Tejun Heo, Boqun Feng,
Paul E. McKenney, Marco Elver, linux-kernel, gregkh,
syzkaller-bugs, Hillf Danton
On Fri, 8 Nov 2024 23:24:06 +0100 Sebastian Andrzej Siewior <bigeasy@linutronix.de>
> Instead of using kernfs_rename_lock for lookups of ::name use RCU for
> lookup. Rely on kn's kernfs_root::kernfs_rwsem for update
> synchronisation.
>
> The .*_locked() have been moved into the callers.
> The lock in kernfs_get_parent() has been dropped, the parent node should
> node vanish underneath us. The RCU read-lock and atomic_inc_not_zero()
> is a safety net in case it does.
> kernfs_fop_readdir() no longer drops kernfs_root::kernfs_rwsem to ensure
> the name pointer does not vanish while the page fault is handled.
> kernfs_notify_workfn() gained the lock for the same reason.
#syz test https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git f9f24ca362a4
Fixes: 2b5067a8143e3 ("mm: mmap_lock: add tracepoints around lock acquisition")
Reported-by: syzbot+6ea37e2e6ffccf41a7e6@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/lkml/67251dc6.050a0220.529b6.015e.GAE@google.com/
Reported-by: Hillf Danton <hdanton@sina.com>
Closes: https://lore.kernel.org/20241102001224.2789-1-hdanton@sina.com
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
fs/kernfs/dir.c | 164 ++++++++++++++----------------------
fs/kernfs/file.c | 6 +-
fs/kernfs/kernfs-internal.h | 12 ++-
fs/kernfs/mount.c | 6 +-
fs/kernfs/symlink.c | 10 ++-
include/linux/kernfs.h | 4 +-
6 files changed, 93 insertions(+), 109 deletions(-)
diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c
index 458519e416fe7..9f4f2fc48b0a1 100644
--- a/fs/kernfs/dir.c
+++ b/fs/kernfs/dir.c
@@ -17,7 +17,6 @@
#include "kernfs-internal.h"
-static DEFINE_RWLOCK(kernfs_rename_lock); /* kn->parent and ->name */
/*
* Don't use rename_lock to piggy back on pr_cont_buf. We don't want to
* call pr_cont() while holding rename_lock. Because sometimes pr_cont()
@@ -51,14 +50,6 @@ static bool kernfs_lockdep(struct kernfs_node *kn)
#endif
}
-static int kernfs_name_locked(struct kernfs_node *kn, char *buf, size_t buflen)
-{
- if (!kn)
- return strscpy(buf, "(null)", buflen);
-
- return strscpy(buf, kn->parent ? kn->name : "/", buflen);
-}
-
/* kernfs_node_depth - compute depth from @from to @to */
static size_t kernfs_depth(struct kernfs_node *from, struct kernfs_node *to)
{
@@ -102,7 +93,35 @@ static struct kernfs_node *kernfs_common_ancestor(struct kernfs_node *a,
}
/**
- * kernfs_path_from_node_locked - find a pseudo-absolute path to @kn_to,
+ * kernfs_name - obtain the name of a given node
+ * @kn: kernfs_node of interest
+ * @buf: buffer to copy @kn's name into
+ * @buflen: size of @buf
+ *
+ * Copies the name of @kn into @buf of @buflen bytes. The behavior is
+ * similar to strscpy().
+ *
+ * Fills buffer with "(null)" if @kn is %NULL.
+ *
+ * Return: the resulting length of @buf. If @buf isn't long enough,
+ * it's filled up to @buflen-1 and nul terminated, and returns -E2BIG.
+ *
+ * This function can be called from any context.
+ */
+int kernfs_name(struct kernfs_node *kn, char *buf, size_t buflen)
+{
+ const char *kn_name;
+
+ if (!kn)
+ return strscpy(buf, "(null)", buflen);
+
+ guard(rcu)();
+ kn_name = rcu_dereference(kn->name);
+ return strscpy(buf, kn->parent ? kn_name : "/", buflen);
+}
+
+/**
+ * kernfs_path_from_node - find a pseudo-absolute path to @kn_to,
* where kn_from is treated as root of the path.
* @kn_from: kernfs node which should be treated as root for the path
* @kn_to: kernfs node to which path is needed
@@ -131,9 +150,8 @@ static struct kernfs_node *kernfs_common_ancestor(struct kernfs_node *a,
* greater than @buflen, @buf contains the truncated path with the trailing
* '\0'. On error, -errno is returned.
*/
-static int kernfs_path_from_node_locked(struct kernfs_node *kn_to,
- struct kernfs_node *kn_from,
- char *buf, size_t buflen)
+int kernfs_path_from_node(struct kernfs_node *kn_to, struct kernfs_node *kn_from,
+ char *buf, size_t buflen)
{
struct kernfs_node *kn, *common;
const char parent_str[] = "/..";
@@ -166,71 +184,16 @@ static int kernfs_path_from_node_locked(struct kernfs_node *kn_to,
len += copied;
}
+ guard(rcu)();
/* Calculate how many bytes we need for the rest */
for (i = depth_to - 1; i >= 0; i--) {
for (kn = kn_to, j = 0; j < i; j++)
kn = kn->parent;
- len += scnprintf(buf + len, buflen - len, "/%s", kn->name);
+ len += scnprintf(buf + len, buflen - len, "/%s", rcu_dereference(kn->name));
}
-
return len;
}
-
-/**
- * kernfs_name - obtain the name of a given node
- * @kn: kernfs_node of interest
- * @buf: buffer to copy @kn's name into
- * @buflen: size of @buf
- *
- * Copies the name of @kn into @buf of @buflen bytes. The behavior is
- * similar to strscpy().
- *
- * Fills buffer with "(null)" if @kn is %NULL.
- *
- * Return: the resulting length of @buf. If @buf isn't long enough,
- * it's filled up to @buflen-1 and nul terminated, and returns -E2BIG.
- *
- * This function can be called from any context.
- */
-int kernfs_name(struct kernfs_node *kn, char *buf, size_t buflen)
-{
- unsigned long flags;
- int ret;
-
- read_lock_irqsave(&kernfs_rename_lock, flags);
- ret = kernfs_name_locked(kn, buf, buflen);
- read_unlock_irqrestore(&kernfs_rename_lock, flags);
- return ret;
-}
-
-/**
- * kernfs_path_from_node - build path of node @to relative to @from.
- * @from: parent kernfs_node relative to which we need to build the path
- * @to: kernfs_node of interest
- * @buf: buffer to copy @to's path into
- * @buflen: size of @buf
- *
- * Builds @to's path relative to @from in @buf. @from and @to must
- * be on the same kernfs-root. If @from is not parent of @to, then a relative
- * path (which includes '..'s) as needed to reach from @from to @to is
- * returned.
- *
- * Return: the length of the constructed path. If the path would have been
- * greater than @buflen, @buf contains the truncated path with the trailing
- * '\0'. On error, -errno is returned.
- */
-int kernfs_path_from_node(struct kernfs_node *to, struct kernfs_node *from,
- char *buf, size_t buflen)
-{
- unsigned long flags;
- int ret;
-
- read_lock_irqsave(&kernfs_rename_lock, flags);
- ret = kernfs_path_from_node_locked(to, from, buf, buflen);
- read_unlock_irqrestore(&kernfs_rename_lock, flags);
- return ret;
-}
EXPORT_SYMBOL_GPL(kernfs_path_from_node);
/**
@@ -292,13 +255,15 @@ void pr_cont_kernfs_path(struct kernfs_node *kn)
struct kernfs_node *kernfs_get_parent(struct kernfs_node *kn)
{
struct kernfs_node *parent;
- unsigned long flags;
- read_lock_irqsave(&kernfs_rename_lock, flags);
+ guard(rcu)();
parent = kn->parent;
- kernfs_get(parent);
- read_unlock_irqrestore(&kernfs_rename_lock, flags);
+ if (!parent)
+ return parent;
+ if (WARN_ON_ONCE(!__kernfs_active(parent) ||
+ !atomic_inc_not_zero(&parent->count)))
+ return NULL;
return parent;
}
@@ -336,13 +301,13 @@ static int kernfs_name_compare(unsigned int hash, const char *name,
return -1;
if (ns > kn->ns)
return 1;
- return strcmp(name, kn->name);
+ return strcmp(name, kernfs_rcu_get_name(kn));
}
static int kernfs_sd_compare(const struct kernfs_node *left,
const struct kernfs_node *right)
{
- return kernfs_name_compare(left->hash, left->name, left->ns, right);
+ return kernfs_name_compare(left->hash, kernfs_rcu_get_name(left), left->ns, right);
}
/**
@@ -533,7 +498,8 @@ static void kernfs_free_rcu(struct rcu_head *rcu)
{
struct kernfs_node *kn = container_of(rcu, struct kernfs_node, rcu);
- kfree_const(kn->name);
+ /* If the whole node goes away, the name can't be used outside */
+ kfree_const(rcu_dereference_check(kn->name, true));
if (kn->iattr) {
simple_xattrs_free(&kn->iattr->xattrs, NULL);
@@ -556,7 +522,9 @@ void kernfs_put(struct kernfs_node *kn)
if (!kn || !atomic_dec_and_test(&kn->count))
return;
+
root = kernfs_root(kn);
+ guard(rcu)();
repeat:
/*
* Moving/renaming is always done while holding reference.
@@ -566,7 +534,8 @@ void kernfs_put(struct kernfs_node *kn)
WARN_ONCE(atomic_read(&kn->active) != KN_DEACTIVATED_BIAS,
"kernfs_put: %s/%s: released with incorrect active_ref %d\n",
- parent ? parent->name : "", kn->name, atomic_read(&kn->active));
+ parent ? rcu_dereference(parent->name) : "",
+ rcu_dereference(kn->name), atomic_read(&kn->active));
if (kernfs_type(kn) == KERNFS_LINK)
kernfs_put(kn->symlink.target_kn);
@@ -643,7 +612,7 @@ static struct kernfs_node *__kernfs_new_node(struct kernfs_root *root,
atomic_set(&kn->active, KN_DEACTIVATED_BIAS);
RB_CLEAR_NODE(&kn->rb);
- kn->name = name;
+ rcu_assign_pointer(kn->name, name);
kn->mode = mode;
kn->flags = flags;
@@ -790,7 +759,7 @@ int kernfs_add_one(struct kernfs_node *kn)
if (parent->flags & (KERNFS_REMOVING | KERNFS_EMPTY_DIR))
goto out_unlock;
- kn->hash = kernfs_name_hash(kn->name, kn->ns);
+ kn->hash = kernfs_name_hash(kernfs_rcu_get_name(kn), kn->ns);
ret = kernfs_link_sibling(kn);
if (ret)
@@ -1167,7 +1136,7 @@ static int kernfs_dop_revalidate(struct dentry *dentry, unsigned int flags)
goto out_bad;
/* The kernfs node has been renamed */
- if (strcmp(dentry->d_name.name, kn->name) != 0)
+ if (strcmp(dentry->d_name.name, kernfs_rcu_get_name(kn)) != 0)
goto out_bad;
/* The kernfs node has been moved to a different namespace */
@@ -1733,8 +1702,10 @@ int kernfs_rename_ns(struct kernfs_node *kn, struct kernfs_node *new_parent,
goto out;
error = 0;
- if ((kn->parent == new_parent) && (kn->ns == new_ns) &&
- (strcmp(kn->name, new_name) == 0))
+ old_parent = kn->parent;
+ old_name = kernfs_rcu_get_name(kn);
+ if ((old_parent == new_parent) && (kn->ns == new_ns) &&
+ (strcmp(old_name, new_name) == 0))
goto out; /* nothing to rename */
error = -EEXIST;
@@ -1742,7 +1713,7 @@ int kernfs_rename_ns(struct kernfs_node *kn, struct kernfs_node *new_parent,
goto out;
/* rename kernfs_node */
- if (strcmp(kn->name, new_name) != 0) {
+ if (strcmp(old_name, new_name) != 0) {
error = -ENOMEM;
new_name = kstrdup_const(new_name, GFP_KERNEL);
if (!new_name)
@@ -1757,25 +1728,18 @@ int kernfs_rename_ns(struct kernfs_node *kn, struct kernfs_node *new_parent,
kernfs_unlink_sibling(kn);
kernfs_get(new_parent);
- /* rename_lock protects ->parent and ->name accessors */
- write_lock_irq(&kernfs_rename_lock);
-
- old_parent = kn->parent;
kn->parent = new_parent;
kn->ns = new_ns;
- if (new_name) {
- old_name = kn->name;
- kn->name = new_name;
- }
+ if (new_name)
+ rcu_assign_pointer(kn->name, new_name);
- write_unlock_irq(&kernfs_rename_lock);
-
- kn->hash = kernfs_name_hash(kn->name, kn->ns);
+ kn->hash = kernfs_name_hash(kernfs_rcu_get_name(kn), kn->ns);
kernfs_link_sibling(kn);
kernfs_put(old_parent);
- kfree_const(old_name);
+ if (new_name && !is_kernel_rodata((unsigned long)old_name))
+ kfree_rcu_mightsleep(old_name);
error = 0;
out:
@@ -1859,7 +1823,7 @@ static int kernfs_fop_readdir(struct file *file, struct dir_context *ctx)
for (pos = kernfs_dir_pos(ns, parent, ctx->pos, pos);
pos;
pos = kernfs_dir_next_pos(ns, parent, ctx->pos, pos)) {
- const char *name = pos->name;
+ const char *name = kernfs_rcu_get_name(pos);
unsigned int type = fs_umode_to_dtype(pos->mode);
int len = strlen(name);
ino_t ino = kernfs_ino(pos);
@@ -1868,10 +1832,10 @@ static int kernfs_fop_readdir(struct file *file, struct dir_context *ctx)
file->private_data = pos;
kernfs_get(pos);
- up_read(&root->kernfs_rwsem);
- if (!dir_emit(ctx, name, len, ino, type))
+ if (!dir_emit(ctx, name, len, ino, type)) {
+ up_read(&root->kernfs_rwsem);
return 0;
- down_read(&root->kernfs_rwsem);
+ }
}
up_read(&root->kernfs_rwsem);
file->private_data = NULL;
diff --git a/fs/kernfs/file.c b/fs/kernfs/file.c
index 8502ef68459b9..8672264c166ca 100644
--- a/fs/kernfs/file.c
+++ b/fs/kernfs/file.c
@@ -911,9 +911,11 @@ static void kernfs_notify_workfn(struct work_struct *work)
/* kick fsnotify */
down_read(&root->kernfs_supers_rwsem);
+ down_read(&root->kernfs_rwsem);
list_for_each_entry(info, &kernfs_root(kn)->supers, node) {
struct kernfs_node *parent;
struct inode *p_inode = NULL;
+ const char *kn_name;
struct inode *inode;
struct qstr name;
@@ -927,7 +929,8 @@ static void kernfs_notify_workfn(struct work_struct *work)
if (!inode)
continue;
- name = (struct qstr)QSTR_INIT(kn->name, strlen(kn->name));
+ kn_name = kernfs_rcu_get_name(kn);
+ name = (struct qstr)QSTR_INIT(kn_name, strlen(kn_name));
parent = kernfs_get_parent(kn);
if (parent) {
p_inode = ilookup(info->sb, kernfs_ino(parent));
@@ -947,6 +950,7 @@ static void kernfs_notify_workfn(struct work_struct *work)
iput(inode);
}
+ up_read(&root->kernfs_rwsem);
up_read(&root->kernfs_supers_rwsem);
kernfs_put(kn);
goto repeat;
diff --git a/fs/kernfs/kernfs-internal.h b/fs/kernfs/kernfs-internal.h
index b42ee6547cdc1..335826a1f42d5 100644
--- a/fs/kernfs/kernfs-internal.h
+++ b/fs/kernfs/kernfs-internal.h
@@ -64,7 +64,7 @@ struct kernfs_root {
*
* Return: the kernfs_root @kn belongs to.
*/
-static inline struct kernfs_root *kernfs_root(struct kernfs_node *kn)
+static inline struct kernfs_root *kernfs_root(const struct kernfs_node *kn)
{
/* if parent exists, it's always a dir; otherwise, @sd is a dir */
if (kn->parent)
@@ -97,6 +97,16 @@ struct kernfs_super_info {
};
#define kernfs_info(SB) ((struct kernfs_super_info *)(SB->s_fs_info))
+static inline bool kernfs_root_is_locked(const struct kernfs_node *kn)
+{
+ return lockdep_is_held(&kernfs_root(kn)->kernfs_rwsem);
+}
+
+static inline const char *kernfs_rcu_get_name(const struct kernfs_node *kn)
+{
+ return rcu_dereference_check(kn->name, kernfs_root_is_locked(kn));
+}
+
static inline struct kernfs_node *kernfs_dentry_node(struct dentry *dentry)
{
if (d_really_is_negative(dentry))
diff --git a/fs/kernfs/mount.c b/fs/kernfs/mount.c
index 1358c21837f1a..7dedcfae022c4 100644
--- a/fs/kernfs/mount.c
+++ b/fs/kernfs/mount.c
@@ -222,9 +222,11 @@ struct dentry *kernfs_node_dentry(struct kernfs_node *kn,
return ERR_PTR(-EINVAL);
}
+ guard(rcu)();
do {
struct dentry *dtmp;
struct kernfs_node *kntmp;
+ const char *name;
if (kn == knparent)
return dentry;
@@ -233,8 +235,8 @@ struct dentry *kernfs_node_dentry(struct kernfs_node *kn,
dput(dentry);
return ERR_PTR(-EINVAL);
}
- dtmp = lookup_positive_unlocked(kntmp->name, dentry,
- strlen(kntmp->name));
+ name = rcu_dereference(kntmp->name);
+ dtmp = lookup_positive_unlocked(name, dentry, strlen(name));
dput(dentry);
if (IS_ERR(dtmp))
return dtmp;
diff --git a/fs/kernfs/symlink.c b/fs/kernfs/symlink.c
index 45371a70caa71..bdb4f45254b91 100644
--- a/fs/kernfs/symlink.c
+++ b/fs/kernfs/symlink.c
@@ -57,8 +57,10 @@ static int kernfs_get_target_path(struct kernfs_node *parent,
struct kernfs_node *target, char *path)
{
struct kernfs_node *base, *kn;
+ const char *name;
char *s = path;
int len = 0;
+ int slen;
/* go up to the root, stop at the base */
base = parent;
@@ -81,7 +83,8 @@ static int kernfs_get_target_path(struct kernfs_node *parent,
/* determine end of target string for reverse fillup */
kn = target;
while (kn->parent && kn != base) {
- len += strlen(kn->name) + 1;
+ name = kernfs_rcu_get_name(kn);
+ len += strlen(name) + 1;
kn = kn->parent;
}
@@ -95,10 +98,11 @@ static int kernfs_get_target_path(struct kernfs_node *parent,
/* reverse fillup of target string from target to base */
kn = target;
while (kn->parent && kn != base) {
- int slen = strlen(kn->name);
+ name = kernfs_rcu_get_name(kn);
+ slen = strlen(name);
len -= slen;
- memcpy(s + len, kn->name, slen);
+ memcpy(s + len, name, slen);
if (len)
s[--len] = '/';
diff --git a/include/linux/kernfs.h b/include/linux/kernfs.h
index 87c79d076d6d7..bbeeee3ae7dbc 100644
--- a/include/linux/kernfs.h
+++ b/include/linux/kernfs.h
@@ -200,7 +200,7 @@ struct kernfs_node {
* parent directly.
*/
struct kernfs_node *parent;
- const char *name;
+ const char __rcu *name;
struct rb_node rb;
@@ -395,7 +395,7 @@ static inline bool kernfs_ns_enabled(struct kernfs_node *kn)
}
int kernfs_name(struct kernfs_node *kn, char *buf, size_t buflen);
-int kernfs_path_from_node(struct kernfs_node *root_kn, struct kernfs_node *kn,
+int kernfs_path_from_node(struct kernfs_node *kn_to, struct kernfs_node *kn_from,
char *buf, size_t buflen);
void pr_cont_kernfs_name(struct kernfs_node *kn);
void pr_cont_kernfs_path(struct kernfs_node *kn);
--
2.45.2
^ permalink raw reply related [flat|nested] 38+ messages in thread
* Re: [syzbot] [kernfs?] WARNING: locking bug in kernfs_path_from_node
2024-11-08 23:16 ` Hillf Danton
@ 2024-11-08 23:48 ` syzbot
0 siblings, 0 replies; 38+ messages in thread
From: syzbot @ 2024-11-08 23:48 UTC (permalink / raw)
To: bigeasy, boqun.feng, elver, gregkh, hdanton, linux-kernel,
paulmck, syzkaller-bugs, tj
Hello,
syzbot has tested the proposed patch and the reproducer did not trigger any issue:
Reported-by: syzbot+6ea37e2e6ffccf41a7e6@syzkaller.appspotmail.com
Tested-by: syzbot+6ea37e2e6ffccf41a7e6@syzkaller.appspotmail.com
Tested on:
commit: f9f24ca3 Add linux-next specific files for 20241031
git tree: https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git
console output: https://syzkaller.appspot.com/x/log.txt?x=125e435f980000
kernel config: https://syzkaller.appspot.com/x/.config?x=328572ed4d152be9
dashboard link: https://syzkaller.appspot.com/bug?extid=6ea37e2e6ffccf41a7e6
compiler: Debian clang version 15.0.6, GNU ld (GNU Binutils for Debian) 2.40
patch: https://syzkaller.appspot.com/x/patch.diff?x=1062bd87980000
Note: testing is done by a robot and is best-effort only.
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH] kernfs: Use RCU for kernfs_node::name lookup.
2024-11-08 22:24 ` [PATCH] kernfs: Use RCU for kernfs_node::name lookup Sebastian Andrzej Siewior
2024-11-08 22:31 ` Tejun Heo
2024-11-08 23:16 ` Hillf Danton
@ 2024-11-11 4:49 ` kernel test robot
2 siblings, 0 replies; 38+ messages in thread
From: kernel test robot @ 2024-11-11 4:49 UTC (permalink / raw)
To: Sebastian Andrzej Siewior, Tejun Heo
Cc: oe-kbuild-all, Boqun Feng, Hillf Danton, Paul E. McKenney,
Marco Elver, linux-kernel, gregkh, tglx, syzbot
Hi Sebastian,
kernel test robot noticed the following build warnings:
[auto build test WARNING on driver-core/driver-core-testing]
[also build test WARNING on driver-core/driver-core-next driver-core/driver-core-linus linus/master v6.12-rc7 next-20241108]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]
url: https://github.com/intel-lab-lkp/linux/commits/Sebastian-Andrzej-Siewior/kernfs-Use-RCU-for-kernfs_node-name-lookup/20241109-062450
base: driver-core/driver-core-testing
patch link: https://lore.kernel.org/r/20241108222406.n5azgO98%40linutronix.de
patch subject: [PATCH] kernfs: Use RCU for kernfs_node::name lookup.
config: x86_64-randconfig-122-20241111 (https://download.01.org/0day-ci/archive/20241111/202411111204.JebDWFvY-lkp@intel.com/config)
compiler: gcc-12 (Debian 12.2.0-14) 12.2.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20241111/202411111204.JebDWFvY-lkp@intel.com/reproduce)
If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202411111204.JebDWFvY-lkp@intel.com/
sparse warnings: (new ones prefixed by >>)
>> fs/sysfs/dir.c:126:51: sparse: sparse: incorrect type in argument 3 (different address spaces) @@ expected char const *new_name @@ got char const [noderef] __rcu *name @@
fs/sysfs/dir.c:126:51: sparse: expected char const *new_name
fs/sysfs/dir.c:126:51: sparse: got char const [noderef] __rcu *name
--
>> arch/x86/kernel/cpu/resctrl/pseudo_lock.c:1363:65: sparse: sparse: incorrect type in argument 1 (different address spaces) @@ expected char const *name @@ got char const [noderef] __rcu *name @@
arch/x86/kernel/cpu/resctrl/pseudo_lock.c:1363:65: sparse: expected char const *name
arch/x86/kernel/cpu/resctrl/pseudo_lock.c:1363:65: sparse: got char const [noderef] __rcu *name
arch/x86/kernel/cpu/resctrl/pseudo_lock.c: note: in included file (through include/linux/smp.h, include/linux/lockdep.h, include/linux/spinlock.h, ...):
include/linux/list.h:83:21: sparse: sparse: self-comparison always evaluates to true
--
>> arch/x86/kernel/cpu/resctrl/rdtgroup.c:3654:27: sparse: sparse: incorrect type in argument 1 (different address spaces) @@ expected char const * @@ got char const [noderef] __rcu *name @@
arch/x86/kernel/cpu/resctrl/rdtgroup.c:3654:27: sparse: expected char const *
arch/x86/kernel/cpu/resctrl/rdtgroup.c:3654:27: sparse: got char const [noderef] __rcu *name
>> arch/x86/kernel/cpu/resctrl/rdtgroup.c:3793:45: sparse: sparse: incorrect type in argument 2 (different address spaces) @@ expected char const *name @@ got char const [noderef] __rcu *name @@
arch/x86/kernel/cpu/resctrl/rdtgroup.c:3793:45: sparse: expected char const *name
arch/x86/kernel/cpu/resctrl/rdtgroup.c:3793:45: sparse: got char const [noderef] __rcu *name
arch/x86/kernel/cpu/resctrl/rdtgroup.c:3879:42: sparse: sparse: incorrect type in argument 2 (different address spaces) @@ expected char const *name @@ got char const [noderef] __rcu *name @@
arch/x86/kernel/cpu/resctrl/rdtgroup.c:3879:42: sparse: expected char const *name
arch/x86/kernel/cpu/resctrl/rdtgroup.c:3879:42: sparse: got char const [noderef] __rcu *name
arch/x86/kernel/cpu/resctrl/rdtgroup.c: note: in included file (through include/linux/smp.h, include/linux/lockdep.h, include/linux/spinlock.h, ...):
include/linux/list.h:83:21: sparse: sparse: self-comparison always evaluates to true
include/linux/list.h:83:21: sparse: sparse: self-comparison always evaluates to true
include/linux/list.h:83:21: sparse: sparse: self-comparison always evaluates to true
include/linux/list.h:83:21: sparse: sparse: self-comparison always evaluates to true
include/linux/list.h:83:21: sparse: sparse: self-comparison always evaluates to true
vim +126 fs/sysfs/dir.c
ca1bab38195d66 Eric W. Biederman 2009-11-20 116
e34ff4906199d2 Tejun Heo 2013-09-11 117 int sysfs_move_dir_ns(struct kobject *kobj, struct kobject *new_parent_kobj,
e34ff4906199d2 Tejun Heo 2013-09-11 118 const void *new_ns)
8a82472f86bf69 Cornelia Huck 2006-11-20 119 {
324a56e16e44ba Tejun Heo 2013-12-11 120 struct kernfs_node *kn = kobj->sd;
324a56e16e44ba Tejun Heo 2013-12-11 121 struct kernfs_node *new_parent;
8a82472f86bf69 Cornelia Huck 2006-11-20 122
324a56e16e44ba Tejun Heo 2013-12-11 123 new_parent = new_parent_kobj && new_parent_kobj->sd ?
324a56e16e44ba Tejun Heo 2013-12-11 124 new_parent_kobj->sd : sysfs_root_kn;
51225039f3cf9d Tejun Heo 2007-06-14 125
adc5e8b58f4886 Tejun Heo 2013-12-11 @126 return kernfs_rename_ns(kn, new_parent, kn->name, new_ns);
8a82472f86bf69 Cornelia Huck 2006-11-20 127 }
87d2846fcf8811 Eric W. Biederman 2015-05-13 128
--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH] kernfs: Use RCU for kernfs_node::name lookup.
2024-11-08 22:31 ` Tejun Heo
@ 2024-11-11 17:04 ` Sebastian Andrzej Siewior
2024-11-12 19:02 ` Tejun Heo
0 siblings, 1 reply; 38+ messages in thread
From: Sebastian Andrzej Siewior @ 2024-11-11 17:04 UTC (permalink / raw)
To: Tejun Heo
Cc: Boqun Feng, Hillf Danton, Paul E. McKenney, Marco Elver,
linux-kernel, gregkh, tglx, syzbot
On 2024-11-08 12:31:22 [-1000], Tejun Heo wrote:
> Hello, Sebastian.
Hi Tejun,
> On Fri, Nov 08, 2024 at 11:24:06PM +0100, Sebastian Andrzej Siewior wrote:
> > Instead of using kernfs_rename_lock for lookups of ::name use RCU for
> > lookup. Rely on kn's kernfs_root::kernfs_rwsem for update
> > synchronisation.
> >
> > The .*_locked() have been moved into the callers.
> > The lock in kernfs_get_parent() has been dropped, the parent node should
> > node vanish underneath us. The RCU read-lock and atomic_inc_not_zero()
> > is a safety net in case it does.
> > kernfs_fop_readdir() no longer drops kernfs_root::kernfs_rwsem to ensure
> > the name pointer does not vanish while the page fault is handled.
> > kernfs_notify_workfn() gained the lock for the same reason.
>
> I owe an apology. I was just thinking about cgroups. Sysfs, I think, does
no worries.
> allow moving node a different parent, which IIRC is used by netdevs. How
> about something like this:
>
> - Add a KERNFS_ROOT flag indicating that parent-changing moves aren't
> allowed.
>
> - Update kernfs_rename_ns() to trigger warning and fail if the above flag is
> set and new_parent is different from the old one.
>
> - Create a separate interface which uses RCU instead of rename lock for name
> / path lookups. The RCU ones should trigger warning if used when the above
> KERNFS_ROOT flag is not set.
Let me check if I understood. We have three users of kernfs:
- cgroup
cgroup1_rename(): parent check (would get the new KERNFS_ROOT flag)
- resctrl
rdtgroup_rename(): seems to allow any "mon group" directory
different parent possible.
- sysfs
sysfs_move_dir_ns(): reame to a different parent
That new RCU interface would be used by cgroup only and sysfs/ resctrl
would remain using the "old" code?
If so, would you prefer
|struct kernfs_node {
| …
| union {
| const char name;
| const char __rcu *name_rcu;
| }
to avoid patching resctrl + sysfs for for the rcu_derference name
lookup?
> Thanks.
Sebastian
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH] kernfs: Use RCU for kernfs_node::name lookup.
2024-11-11 17:04 ` Sebastian Andrzej Siewior
@ 2024-11-12 19:02 ` Tejun Heo
2024-11-13 7:58 ` Sebastian Andrzej Siewior
0 siblings, 1 reply; 38+ messages in thread
From: Tejun Heo @ 2024-11-12 19:02 UTC (permalink / raw)
To: Sebastian Andrzej Siewior
Cc: Boqun Feng, Hillf Danton, Paul E. McKenney, Marco Elver,
linux-kernel, gregkh, tglx, syzbot
Hello, Sebastian.
Sorry about late reply. Have been sick for a bit.
On Mon, Nov 11, 2024 at 06:04:50PM +0100, Sebastian Andrzej Siewior wrote:
...
> Let me check if I understood. We have three users of kernfs:
>
> - cgroup
> cgroup1_rename(): parent check (would get the new KERNFS_ROOT flag)
>
> - resctrl
> rdtgroup_rename(): seems to allow any "mon group" directory
> different parent possible.
>
> - sysfs
> sysfs_move_dir_ns(): reame to a different parent
I'm not sure about resctrl but here we just need to add that flag to cgroup,
right?
> That new RCU interface would be used by cgroup only and sysfs/ resctrl
> would remain using the "old" code?
> If so, would you prefer
> |struct kernfs_node {
> | …
> | union {
> | const char name;
> | const char __rcu *name_rcu;
> | }
>
> to avoid patching resctrl + sysfs for for the rcu_derference name
> lookup?
As replied on the patches, it probably is cleaner to always use __rcu and
apply the additional locking on the reader side if renaming across different
parents is allowed.
Thanks.
--
tejun
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH] kernfs: Use RCU for kernfs_node::name lookup.
2024-11-12 19:02 ` Tejun Heo
@ 2024-11-13 7:58 ` Sebastian Andrzej Siewior
0 siblings, 0 replies; 38+ messages in thread
From: Sebastian Andrzej Siewior @ 2024-11-13 7:58 UTC (permalink / raw)
To: Tejun Heo
Cc: Boqun Feng, Hillf Danton, Paul E. McKenney, Marco Elver,
linux-kernel, gregkh, tglx, syzbot
On 2024-11-12 09:02:53 [-1000], Tejun Heo wrote:
> Hello, Sebastian.
Hi Tejun,
> Sorry about late reply. Have been sick for a bit.
No worries, hopefully you are getting better.
> On Mon, Nov 11, 2024 at 06:04:50PM +0100, Sebastian Andrzej Siewior wrote:
> ...
> > Let me check if I understood. We have three users of kernfs:
> >
> > - cgroup
> > cgroup1_rename(): parent check (would get the new KERNFS_ROOT flag)
> >
> > - resctrl
> > rdtgroup_rename(): seems to allow any "mon group" directory
> > different parent possible.
> >
> > - sysfs
> > sysfs_move_dir_ns(): reame to a different parent
>
> I'm not sure about resctrl but here we just need to add that flag to cgroup,
> right?
Correct. I was just puzzling your answer together ;)
> > That new RCU interface would be used by cgroup only and sysfs/ resctrl
> > would remain using the "old" code?
> > If so, would you prefer
> > |struct kernfs_node {
> > | …
> > | union {
> > | const char name;
> > | const char __rcu *name_rcu;
> > | }
> >
> > to avoid patching resctrl + sysfs for for the rcu_derference name
> > lookup?
>
> As replied on the patches, it probably is cleaner to always use __rcu and
> apply the additional locking on the reader side if renaming across different
> parents is allowed.
Yes, on it.
> Thanks.
Sebastian
^ permalink raw reply [flat|nested] 38+ messages in thread
end of thread, other threads:[~2024-11-13 7:58 UTC | newest]
Thread overview: 38+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-10-30 21:05 [BUG] -next lockdep invalid wait context Paul E. McKenney
2024-10-30 21:48 ` Vlastimil Babka
2024-10-30 22:34 ` Marco Elver
2024-10-30 23:04 ` Boqun Feng
2024-10-30 23:10 ` Paul E. McKenney
2024-10-31 7:21 ` Sebastian Andrzej Siewior
2024-10-31 7:35 ` Vlastimil Babka
2024-10-31 7:55 ` Sebastian Andrzej Siewior
2024-10-31 8:18 ` Vlastimil Babka
2024-11-01 17:14 ` Paul E. McKenney
2024-10-31 17:50 ` Paul E. McKenney
2024-11-01 19:50 ` Boqun Feng
2024-11-01 19:54 ` [PATCH] scftorture: Use workqueue to free scf_check Boqun Feng
2024-11-01 23:35 ` Paul E. McKenney
2024-11-03 3:35 ` Boqun Feng
2024-11-03 15:03 ` Paul E. McKenney
2024-11-04 10:50 ` [PATCH 1/2] scftorture: Move memory allocation outside of preempt_disable region Sebastian Andrzej Siewior
2024-11-04 10:50 ` [PATCH 2/2] scftorture: Use a lock-less list to free memory Sebastian Andrzej Siewior
2024-11-05 1:00 ` Boqun Feng
2024-11-07 11:21 ` Sebastian Andrzej Siewior
2024-11-07 14:08 ` Paul E. McKenney
2024-11-07 14:43 ` Sebastian Andrzej Siewior
2024-11-07 14:59 ` Paul E. McKenney
2024-11-02 0:12 ` [BUG] -next lockdep invalid wait context Hillf Danton
2024-11-02 0:45 ` Boqun Feng
2024-11-04 18:08 ` Tejun Heo
2024-11-05 9:37 ` Vlastimil Babka
2024-11-08 10:05 ` Sebastian Andrzej Siewior
2024-11-08 17:02 ` Tejun Heo
2024-11-08 17:12 ` Sebastian Andrzej Siewior
2024-11-08 22:24 ` [PATCH] kernfs: Use RCU for kernfs_node::name lookup Sebastian Andrzej Siewior
2024-11-08 22:31 ` Tejun Heo
2024-11-11 17:04 ` Sebastian Andrzej Siewior
2024-11-12 19:02 ` Tejun Heo
2024-11-13 7:58 ` Sebastian Andrzej Siewior
2024-11-08 23:16 ` Hillf Danton
2024-11-08 23:48 ` [syzbot] [kernfs?] WARNING: locking bug in kernfs_path_from_node syzbot
2024-11-11 4:49 ` [PATCH] kernfs: Use RCU for kernfs_node::name lookup kernel test robot
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox