* bpf_ringbuf_reserve deadlock on rt kernels
@ 2024-06-10 15:17 Dmitry Dolgov
2024-06-12 14:32 ` Sebastian Andrzej Siewior
0 siblings, 1 reply; 4+ messages in thread
From: Dmitry Dolgov @ 2024-06-10 15:17 UTC (permalink / raw)
To: bpf, linux-rt-users; +Cc: ast, daniel, andrii
Hi,
we're facing an interesting issue with a BPF program that writes into a
bpf_ringbuf from different CPUs on an RT kernel. Here is my attempt to
reproduce on QEMU:
======================================================
WARNING: possible circular locking dependency detected
6.9.0-rt5-g66834e17536e #3 Not tainted
------------------------------------------------------
swapper/4/0 is trying to acquire lock:
ffffc90006b4d118 (&lock->wait_lock){....}-{2:2}, at: rt_spin_lock+0x6d/0x100
but task is already holding lock:
ffffc90006b4d158 (&rb->spinlock){....}-{2:2}, at: __bpf_ringbuf_reserve+0x5a/0xf0
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #3 (&rb->spinlock){....}-{2:2}:
lock_acquire+0xc5/0x300
rt_spin_lock+0x2a/0x100
__bpf_ringbuf_reserve+0x5a/0xf0
bpf_prog_abf021cf8a50b730_sched_switch+0x281/0x70d
bpf_trace_run4+0xae/0x1e0
__schedule+0x42c/0xca0
preempt_schedule_notrace+0x37/0x60
preempt_schedule_notrace_thunk+0x1a/0x30
rcu_is_watching+0x32/0x40
__flush_work+0x30b/0x480
n_tty_poll+0x131/0x1d0
tty_poll+0x54/0x90
do_select+0x490/0x9b0
core_sys_select+0x238/0x620
kern_select+0x101/0x190
__x64_sys_select+0x21/0x30
do_syscall_64+0xbc/0x1d0
entry_SYSCALL_64_after_hwframe+0x77/0x7f
-> #2 (&rq->__lock){-...}-{2:2}:
lock_acquire+0xc5/0x300
_raw_spin_lock_nested+0x2e/0x40
raw_spin_rq_lock_nested+0x15/0x30
task_fork_fair+0x3e/0xb0
sched_cgroup_fork+0xe9/0x110
copy_process+0x1b76/0x2fd0
kernel_clone+0xab/0x3e0
user_mode_thread+0x5f/0x90
rest_init+0x1e/0x160
start_kernel+0x61d/0x620
x86_64_start_reservations+0x24/0x30
x86_64_start_kernel+0x8c/0x90
common_startup_64+0x13e/0x148
-> #1 (&p->pi_lock){-...}-{2:2}:
lock_acquire+0xc5/0x300
_raw_spin_lock+0x30/0x40
rtlock_slowlock_locked+0x130/0x1c70
rt_spin_lock+0x78/0x100
prepare_to_wait_event+0x1a/0x140
wake_up_and_wait_for_irq_thread_ready+0xc3/0xe0
__setup_irq+0x374/0x660
request_threaded_irq+0xe5/0x180
acpi_os_install_interrupt_handler+0xb7/0xe0
acpi_ev_install_xrupt_handlers+0x22/0x90
acpi_init+0x8f/0x4d0
do_one_initcall+0x73/0x2d0
kernel_init_freeable+0x24a/0x290
kernel_init+0x1a/0x130
ret_from_fork+0x31/0x50
ret_from_fork_asm+0x1a/0x30
-> #0 (&lock->wait_lock){....}-{2:2}:
check_prev_add+0xeb/0xd80
__lock_acquire+0x113e/0x15b0
lock_acquire+0xc5/0x300
_raw_spin_lock_irqsave+0x3c/0x60
rt_spin_lock+0x6d/0x100
__bpf_ringbuf_reserve+0x5a/0xf0
bpf_prog_abf021cf8a50b730_sched_switch+0x281/0x70d
bpf_trace_run4+0xae/0x1e0
__schedule+0x42c/0xca0
schedule_idle+0x20/0x40
cpu_startup_entry+0x29/0x30
start_secondary+0xfa/0x100
common_startup_64+0x13e/0x148
other info that might help us debug this:
Chain exists of:
&lock->wait_lock --> &rq->__lock --> &rb->spinlock
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock(&rb->spinlock);
lock(&rq->__lock);
lock(&rb->spinlock);
lock(&lock->wait_lock);
*** DEADLOCK ***
3 locks held by swapper/4/0:
#0: ffff88813bd32558 (&rq->__lock){-...}-{2:2}, at: __schedule+0xc4/0xca0
#1: ffffffff83590540 (rcu_read_lock){....}-{1:2}, at: bpf_trace_run4+0x6c/0x1e0
#2: ffffc90006b4d158 (&rb->spinlock){....}-{2:2}, at: __bpf_ringbuf_reserve+0x5a/0xf0
stack backtrace:
CPU: 4 PID: 0 Comm: swapper/4 Not tainted 6.9.0-rt5-g66834e17536e #3
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.3-2.fc40 04/01/2014
Call Trace:
<TASK>
dump_stack_lvl+0x6f/0xb0
print_circular_bug.cold+0x178/0x1be
check_noncircular+0x14e/0x170
check_prev_add+0xeb/0xd80
__lock_acquire+0x113e/0x15b0
lock_acquire+0xc5/0x300
? rt_spin_lock+0x6d/0x100
_raw_spin_lock_irqsave+0x3c/0x60
? rt_spin_lock+0x6d/0x100
rt_spin_lock+0x6d/0x100
__bpf_ringbuf_reserve+0x5a/0xf0
bpf_prog_abf021cf8a50b730_sched_switch+0x281/0x70d
bpf_trace_run4+0xae/0x1e0
__schedule+0x42c/0xca0
schedule_idle+0x20/0x40
cpu_startup_entry+0x29/0x30
start_secondary+0xfa/0x100
common_startup_64+0x13e/0x148
</TASK>
CPU: 1 PID: 160 Comm: screen Not tainted 6.9.0-rt5-g66834e17536e #3
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.3-2.fc40 04/01/2014
Call Trace:
<TASK>
dump_stack_lvl+0x6f/0xb0
__might_resched.cold+0xcc/0xdf
rt_spin_lock+0x4c/0x100
? __bpf_ringbuf_reserve+0x5a/0xf0
__bpf_ringbuf_reserve+0x5a/0xf0
bpf_prog_abf021cf8a50b730_sched_switch+0x281/0x70d
bpf_trace_run4+0xae/0x1e0
__schedule+0x42c/0xca0
preempt_schedule_notrace+0x37/0x60
preempt_schedule_notrace_thunk+0x1a/0x30
? __flush_work+0x84/0x480
rcu_is_watching+0x32/0x40
__flush_work+0x30b/0x480
n_tty_poll+0x131/0x1d0
tty_poll+0x54/0x90
do_select+0x490/0x9b0
? __bfs+0x136/0x230
? do_select+0x26d/0x9b0
? __pfx_pollwake+0x10/0x10
? __pfx_pollwake+0x10/0x10
? core_sys_select+0x238/0x620
core_sys_select+0x238/0x620
kern_select+0x101/0x190
__x64_sys_select+0x21/0x30
do_syscall_64+0xbc/0x1d0
entry_SYSCALL_64_after_hwframe+0x77/0x7f
The BPF program in question is attached to sched_switch. The issue seems
to be similar to a couple of syzkaller reports [1], [2], although the
latter one is about nested progs, which seems to be not the case here.
Talking about nested progs, applying a similar approach as in [3]
reworked for bpf_ringbuf, elliminates the issue.
Do I miss anything, is it a known issue? Any ideas how to address that?
[1]: https://lore.kernel.org/all/0000000000000656bf061a429057@google.com/
[2]: https://lore.kernel.org/lkml/0000000000004aa700061379547e@google.com/
[3]: https://lore.kernel.org/bpf/20240514124052.1240266-2-sidchintamaneni@gmail.com/
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: bpf_ringbuf_reserve deadlock on rt kernels
2024-06-10 15:17 bpf_ringbuf_reserve deadlock on rt kernels Dmitry Dolgov
@ 2024-06-12 14:32 ` Sebastian Andrzej Siewior
2024-06-13 10:23 ` Dmitry Dolgov
0 siblings, 1 reply; 4+ messages in thread
From: Sebastian Andrzej Siewior @ 2024-06-12 14:32 UTC (permalink / raw)
To: Dmitry Dolgov; +Cc: bpf, linux-rt-users, ast, daniel, andrii
On 2024-06-10 17:17:35 [+0200], Dmitry Dolgov wrote:
> Hi,
Hi,
…
> The BPF program in question is attached to sched_switch. The issue seems
> to be similar to a couple of syzkaller reports [1], [2], although the
> latter one is about nested progs, which seems to be not the case here.
> Talking about nested progs, applying a similar approach as in [3]
> reworked for bpf_ringbuf, elliminates the issue.
>
> Do I miss anything, is it a known issue? Any ideas how to address that?
I haven't attached bpf program to trace-events so this new to me. But if
you BPF attach programs to trace-events then there might be more things
that can go wrong…
Let me add this to the bpf-list-to-look-at.
Do you get more splats with CONFIG_DEBUG_ATOMIC_SLEEP=y?
Sebastian
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: bpf_ringbuf_reserve deadlock on rt kernels
2024-06-12 14:32 ` Sebastian Andrzej Siewior
@ 2024-06-13 10:23 ` Dmitry Dolgov
2024-06-13 10:40 ` Sebastian Andrzej Siewior
0 siblings, 1 reply; 4+ messages in thread
From: Dmitry Dolgov @ 2024-06-13 10:23 UTC (permalink / raw)
To: Sebastian Andrzej Siewior; +Cc: bpf, linux-rt-users, ast, daniel, andrii
> On Wed, Jun 12, 2024 at 04:32:23PM GMT, Sebastian Andrzej Siewior wrote:
>
> > The BPF program in question is attached to sched_switch. The issue seems
> > to be similar to a couple of syzkaller reports [1], [2], although the
> > latter one is about nested progs, which seems to be not the case here.
> > Talking about nested progs, applying a similar approach as in [3]
> > reworked for bpf_ringbuf, elliminates the issue.
> >
> > Do I miss anything, is it a known issue? Any ideas how to address that?
>
> I haven't attached bpf program to trace-events so this new to me. But if
> you BPF attach programs to trace-events then there might be more things
> that can go wrong…
Things related to RT kernels, or something else?
> Let me add this to the bpf-list-to-look-at.
> Do you get more splats with CONFIG_DEBUG_ATOMIC_SLEEP=y?
Thanks. Adding CONFIG_DEBUG_ATOMIC_SLEEP gives me this:
BUG: sleeping function called from invalid context at kernel/locking/spinlock_rt.c:48
in_atomic(): 1, irqs_disabled(): 1, non_block: 0, pid: 154, name: script
preempt_count: 3, expected: 0
RCU nest depth: 1, expected: 1
4 locks held by script/154:
#0: ffff8881049798a0 (&tty->ldisc_sem){++++}-{0:0}, at: tty_ldisc_ref_wait+0x28/0x60
#1: ffff88813bdb2558 (&rq->__lock){-...}-{2:2}, at: __schedule+0xc4/0xca0
#2: ffffffff83590540 (rcu_read_lock){....}-{1:2}, at: bpf_trace_run4+0x6c/0x1e0
#3: ffffc90007b61158 (&rb->spinlock){....}-{2:2}, at: __bpf_ringbuf_reserve+0x5a/0xf0
irq event stamp: 129370
hardirqs last enabled at (129369): [<ffffffff82216818>] _raw_spin_unlock_irq+0x28/0x50
hardirqs last disabled at (129370): [<ffffffff822084a9>] __schedule+0x5d9/0xca0
softirqs last enabled at (0): [<ffffffff81110ecb>] copy_process+0xc3b/0x2fd0
softirqs last disabled at (0): [<0000000000000000>] 0x0
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: bpf_ringbuf_reserve deadlock on rt kernels
2024-06-13 10:23 ` Dmitry Dolgov
@ 2024-06-13 10:40 ` Sebastian Andrzej Siewior
0 siblings, 0 replies; 4+ messages in thread
From: Sebastian Andrzej Siewior @ 2024-06-13 10:40 UTC (permalink / raw)
To: Dmitry Dolgov; +Cc: bpf, linux-rt-users, ast, daniel, andrii
On 2024-06-13 12:23:46 [+0200], Dmitry Dolgov wrote:
> > On Wed, Jun 12, 2024 at 04:32:23PM GMT, Sebastian Andrzej Siewior wrote:
> >
> > > The BPF program in question is attached to sched_switch. The issue seems
> > > to be similar to a couple of syzkaller reports [1], [2], although the
> > > latter one is about nested progs, which seems to be not the case here.
> > > Talking about nested progs, applying a similar approach as in [3]
> > > reworked for bpf_ringbuf, elliminates the issue.
> > >
> > > Do I miss anything, is it a known issue? Any ideas how to address that?
> >
> > I haven't attached bpf program to trace-events so this new to me. But if
> > you BPF attach programs to trace-events then there might be more things
> > that can go wrong…
>
> Things related to RT kernels, or something else?
Related to RT kernel. The trace-event is invoked with disabled
preemption. This means locking is limit to raw_spinlock_t and no memory
allocation are allowed. Otherwise the splat below will appear.
Sebastian
^ permalink raw reply [flat|nested] 4+ messages in thread
end of thread, other threads:[~2024-06-13 10:41 UTC | newest]
Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-06-10 15:17 bpf_ringbuf_reserve deadlock on rt kernels Dmitry Dolgov
2024-06-12 14:32 ` Sebastian Andrzej Siewior
2024-06-13 10:23 ` Dmitry Dolgov
2024-06-13 10:40 ` Sebastian Andrzej Siewior
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox