bpf_ringbuf_reserve deadlock on rt kernels

public inbox for linux-rt-users@vger.kernel.org
 help / color / mirror / Atom feed

* bpf_ringbuf_reserve deadlock on rt kernels
@ 2024-06-10 15:17 Dmitry Dolgov
  2024-06-12 14:32 ` Sebastian Andrzej Siewior
  0 siblings, 1 reply; 4+ messages in thread
From: Dmitry Dolgov @ 2024-06-10 15:17 UTC (permalink / raw)
  To: bpf, linux-rt-users; +Cc: ast, daniel, andrii

Hi,

we're facing an interesting issue with a BPF program that writes into a
bpf_ringbuf from different CPUs on an RT kernel. Here is my attempt to
reproduce on QEMU:

    ======================================================
    WARNING: possible circular locking dependency detected
    6.9.0-rt5-g66834e17536e #3 Not tainted
    ------------------------------------------------------
    swapper/4/0 is trying to acquire lock:
    ffffc90006b4d118 (&lock->wait_lock){....}-{2:2}, at: rt_spin_lock+0x6d/0x100

    but task is already holding lock:
    ffffc90006b4d158 (&rb->spinlock){....}-{2:2}, at: __bpf_ringbuf_reserve+0x5a/0xf0

    which lock already depends on the new lock.


    the existing dependency chain (in reverse order) is:

    -> #3 (&rb->spinlock){....}-{2:2}:
           lock_acquire+0xc5/0x300
           rt_spin_lock+0x2a/0x100
           __bpf_ringbuf_reserve+0x5a/0xf0
           bpf_prog_abf021cf8a50b730_sched_switch+0x281/0x70d
           bpf_trace_run4+0xae/0x1e0
           __schedule+0x42c/0xca0
           preempt_schedule_notrace+0x37/0x60
           preempt_schedule_notrace_thunk+0x1a/0x30
           rcu_is_watching+0x32/0x40
           __flush_work+0x30b/0x480
           n_tty_poll+0x131/0x1d0
           tty_poll+0x54/0x90
           do_select+0x490/0x9b0
           core_sys_select+0x238/0x620
           kern_select+0x101/0x190
           __x64_sys_select+0x21/0x30
           do_syscall_64+0xbc/0x1d0
           entry_SYSCALL_64_after_hwframe+0x77/0x7f

    -> #2 (&rq->__lock){-...}-{2:2}:
           lock_acquire+0xc5/0x300
           _raw_spin_lock_nested+0x2e/0x40
           raw_spin_rq_lock_nested+0x15/0x30
           task_fork_fair+0x3e/0xb0
           sched_cgroup_fork+0xe9/0x110
           copy_process+0x1b76/0x2fd0
           kernel_clone+0xab/0x3e0
           user_mode_thread+0x5f/0x90
           rest_init+0x1e/0x160
           start_kernel+0x61d/0x620
           x86_64_start_reservations+0x24/0x30
           x86_64_start_kernel+0x8c/0x90
           common_startup_64+0x13e/0x148

    -> #1 (&p->pi_lock){-...}-{2:2}:
           lock_acquire+0xc5/0x300
           _raw_spin_lock+0x30/0x40
           rtlock_slowlock_locked+0x130/0x1c70
           rt_spin_lock+0x78/0x100
           prepare_to_wait_event+0x1a/0x140
           wake_up_and_wait_for_irq_thread_ready+0xc3/0xe0
           __setup_irq+0x374/0x660
           request_threaded_irq+0xe5/0x180
           acpi_os_install_interrupt_handler+0xb7/0xe0
           acpi_ev_install_xrupt_handlers+0x22/0x90
           acpi_init+0x8f/0x4d0
           do_one_initcall+0x73/0x2d0
           kernel_init_freeable+0x24a/0x290
           kernel_init+0x1a/0x130
           ret_from_fork+0x31/0x50
           ret_from_fork_asm+0x1a/0x30

    -> #0 (&lock->wait_lock){....}-{2:2}:
           check_prev_add+0xeb/0xd80
           __lock_acquire+0x113e/0x15b0
           lock_acquire+0xc5/0x300
           _raw_spin_lock_irqsave+0x3c/0x60
           rt_spin_lock+0x6d/0x100
           __bpf_ringbuf_reserve+0x5a/0xf0
           bpf_prog_abf021cf8a50b730_sched_switch+0x281/0x70d
           bpf_trace_run4+0xae/0x1e0
           __schedule+0x42c/0xca0
           schedule_idle+0x20/0x40
           cpu_startup_entry+0x29/0x30
           start_secondary+0xfa/0x100
           common_startup_64+0x13e/0x148

    other info that might help us debug this:

    Chain exists of:
      &lock->wait_lock --> &rq->__lock --> &rb->spinlock

     Possible unsafe locking scenario:

           CPU0                    CPU1
           ----                    ----
      lock(&rb->spinlock);
                                   lock(&rq->__lock);
                                   lock(&rb->spinlock);
      lock(&lock->wait_lock);

     *** DEADLOCK ***

    3 locks held by swapper/4/0:
     #0: ffff88813bd32558 (&rq->__lock){-...}-{2:2}, at: __schedule+0xc4/0xca0
     #1: ffffffff83590540 (rcu_read_lock){....}-{1:2}, at: bpf_trace_run4+0x6c/0x1e0
     #2: ffffc90006b4d158 (&rb->spinlock){....}-{2:2}, at: __bpf_ringbuf_reserve+0x5a/0xf0

    stack backtrace:
    CPU: 4 PID: 0 Comm: swapper/4 Not tainted 6.9.0-rt5-g66834e17536e #3
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.3-2.fc40 04/01/2014
    Call Trace:
     <TASK>
     dump_stack_lvl+0x6f/0xb0
     print_circular_bug.cold+0x178/0x1be
     check_noncircular+0x14e/0x170
     check_prev_add+0xeb/0xd80
     __lock_acquire+0x113e/0x15b0
     lock_acquire+0xc5/0x300
     ? rt_spin_lock+0x6d/0x100
     _raw_spin_lock_irqsave+0x3c/0x60
     ? rt_spin_lock+0x6d/0x100
     rt_spin_lock+0x6d/0x100
     __bpf_ringbuf_reserve+0x5a/0xf0
     bpf_prog_abf021cf8a50b730_sched_switch+0x281/0x70d
     bpf_trace_run4+0xae/0x1e0
     __schedule+0x42c/0xca0
     schedule_idle+0x20/0x40
     cpu_startup_entry+0x29/0x30
     start_secondary+0xfa/0x100
     common_startup_64+0x13e/0x148
     </TASK>
    CPU: 1 PID: 160 Comm: screen Not tainted 6.9.0-rt5-g66834e17536e #3
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.3-2.fc40 04/01/2014
    Call Trace:
     <TASK>
     dump_stack_lvl+0x6f/0xb0
     __might_resched.cold+0xcc/0xdf
     rt_spin_lock+0x4c/0x100
     ? __bpf_ringbuf_reserve+0x5a/0xf0
     __bpf_ringbuf_reserve+0x5a/0xf0
     bpf_prog_abf021cf8a50b730_sched_switch+0x281/0x70d
     bpf_trace_run4+0xae/0x1e0
     __schedule+0x42c/0xca0
     preempt_schedule_notrace+0x37/0x60
     preempt_schedule_notrace_thunk+0x1a/0x30
     ? __flush_work+0x84/0x480
     rcu_is_watching+0x32/0x40
     __flush_work+0x30b/0x480
     n_tty_poll+0x131/0x1d0
     tty_poll+0x54/0x90
     do_select+0x490/0x9b0
     ? __bfs+0x136/0x230
     ? do_select+0x26d/0x9b0
     ? __pfx_pollwake+0x10/0x10
     ? __pfx_pollwake+0x10/0x10
     ? core_sys_select+0x238/0x620
     core_sys_select+0x238/0x620
     kern_select+0x101/0x190
     __x64_sys_select+0x21/0x30
     do_syscall_64+0xbc/0x1d0
     entry_SYSCALL_64_after_hwframe+0x77/0x7f

The BPF program in question is attached to sched_switch. The issue seems
to be similar to a couple of syzkaller reports [1], [2], although the
latter one is about nested progs, which seems to be not the case here.
Talking about nested progs, applying a similar approach as in [3]
reworked for bpf_ringbuf, elliminates the issue.

Do I miss anything, is it a known issue? Any ideas how to address that?

[1]: https://lore.kernel.org/all/0000000000000656bf061a429057@google.com/
[2]: https://lore.kernel.org/lkml/0000000000004aa700061379547e@google.com/
[3]: https://lore.kernel.org/bpf/20240514124052.1240266-2-sidchintamaneni@gmail.com/

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: bpf_ringbuf_reserve deadlock on rt kernels
  2024-06-10 15:17 bpf_ringbuf_reserve deadlock on rt kernels Dmitry Dolgov
@ 2024-06-12 14:32 ` Sebastian Andrzej Siewior
  2024-06-13 10:23   ` Dmitry Dolgov
  0 siblings, 1 reply; 4+ messages in thread
From: Sebastian Andrzej Siewior @ 2024-06-12 14:32 UTC (permalink / raw)
  To: Dmitry Dolgov; +Cc: bpf, linux-rt-users, ast, daniel, andrii

On 2024-06-10 17:17:35 [+0200], Dmitry Dolgov wrote:
> Hi,
Hi,

…
> The BPF program in question is attached to sched_switch. The issue seems
> to be similar to a couple of syzkaller reports [1], [2], although the
> latter one is about nested progs, which seems to be not the case here.
> Talking about nested progs, applying a similar approach as in [3]
> reworked for bpf_ringbuf, elliminates the issue.
> 
> Do I miss anything, is it a known issue? Any ideas how to address that?

I haven't attached bpf program to trace-events so this new to me. But if
you BPF attach programs to trace-events then there might be more things
that can go wrong…
Let me add this to the bpf-list-to-look-at.
Do you get more splats with CONFIG_DEBUG_ATOMIC_SLEEP=y?

Sebastian

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: bpf_ringbuf_reserve deadlock on rt kernels
  2024-06-12 14:32 ` Sebastian Andrzej Siewior
@ 2024-06-13 10:23   ` Dmitry Dolgov
  2024-06-13 10:40     ` Sebastian Andrzej Siewior
  0 siblings, 1 reply; 4+ messages in thread
From: Dmitry Dolgov @ 2024-06-13 10:23 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior; +Cc: bpf, linux-rt-users, ast, daniel, andrii

> On Wed, Jun 12, 2024 at 04:32:23PM GMT, Sebastian Andrzej Siewior wrote:
>
> > The BPF program in question is attached to sched_switch. The issue seems
> > to be similar to a couple of syzkaller reports [1], [2], although the
> > latter one is about nested progs, which seems to be not the case here.
> > Talking about nested progs, applying a similar approach as in [3]
> > reworked for bpf_ringbuf, elliminates the issue.
> >
> > Do I miss anything, is it a known issue? Any ideas how to address that?
>
> I haven't attached bpf program to trace-events so this new to me. But if
> you BPF attach programs to trace-events then there might be more things
> that can go wrong…

Things related to RT kernels, or something else?

> Let me add this to the bpf-list-to-look-at.
> Do you get more splats with CONFIG_DEBUG_ATOMIC_SLEEP=y?

Thanks. Adding CONFIG_DEBUG_ATOMIC_SLEEP gives me this:

    BUG: sleeping function called from invalid context at kernel/locking/spinlock_rt.c:48
    in_atomic(): 1, irqs_disabled(): 1, non_block: 0, pid: 154, name: script
    preempt_count: 3, expected: 0
    RCU nest depth: 1, expected: 1
    4 locks held by script/154:
     #0: ffff8881049798a0 (&tty->ldisc_sem){++++}-{0:0}, at: tty_ldisc_ref_wait+0x28/0x60
     #1: ffff88813bdb2558 (&rq->__lock){-...}-{2:2}, at: __schedule+0xc4/0xca0
     #2: ffffffff83590540 (rcu_read_lock){....}-{1:2}, at: bpf_trace_run4+0x6c/0x1e0
     #3: ffffc90007b61158 (&rb->spinlock){....}-{2:2}, at: __bpf_ringbuf_reserve+0x5a/0xf0
    irq event stamp: 129370
    hardirqs last  enabled at (129369): [<ffffffff82216818>] _raw_spin_unlock_irq+0x28/0x50
    hardirqs last disabled at (129370): [<ffffffff822084a9>] __schedule+0x5d9/0xca0
    softirqs last  enabled at (0): [<ffffffff81110ecb>] copy_process+0xc3b/0x2fd0
    softirqs last disabled at (0): [<0000000000000000>] 0x0

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: bpf_ringbuf_reserve deadlock on rt kernels
  2024-06-13 10:23   ` Dmitry Dolgov
@ 2024-06-13 10:40     ` Sebastian Andrzej Siewior
  0 siblings, 0 replies; 4+ messages in thread
From: Sebastian Andrzej Siewior @ 2024-06-13 10:40 UTC (permalink / raw)
  To: Dmitry Dolgov; +Cc: bpf, linux-rt-users, ast, daniel, andrii

On 2024-06-13 12:23:46 [+0200], Dmitry Dolgov wrote:
> > On Wed, Jun 12, 2024 at 04:32:23PM GMT, Sebastian Andrzej Siewior wrote:
> >
> > > The BPF program in question is attached to sched_switch. The issue seems
> > > to be similar to a couple of syzkaller reports [1], [2], although the
> > > latter one is about nested progs, which seems to be not the case here.
> > > Talking about nested progs, applying a similar approach as in [3]
> > > reworked for bpf_ringbuf, elliminates the issue.
> > >
> > > Do I miss anything, is it a known issue? Any ideas how to address that?
> >
> > I haven't attached bpf program to trace-events so this new to me. But if
> > you BPF attach programs to trace-events then there might be more things
> > that can go wrong…
> 
> Things related to RT kernels, or something else?

Related to RT kernel. The trace-event is invoked with disabled
preemption. This means locking is limit to raw_spinlock_t and no memory
allocation are allowed. Otherwise the splat below will appear.

Sebastian

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2024-06-13 10:41 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-06-10 15:17 bpf_ringbuf_reserve deadlock on rt kernels Dmitry Dolgov
2024-06-12 14:32 ` Sebastian Andrzej Siewior
2024-06-13 10:23   ` Dmitry Dolgov
2024-06-13 10:40     ` Sebastian Andrzej Siewior

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox