bpf.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Kumar Kartikeya Dwivedi <memxor@gmail.com>
To: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Cc: Raj Sahu <rjsu26@gmail.com>,
	Siddharth Chintamaneni <sidchintamaneni@gmail.com>,
	 bpf <bpf@vger.kernel.org>, Alexei Starovoitov <ast@kernel.org>,
	 Daniel Borkmann <daniel@iogearbox.net>,
	Andrii Nakryiko <andrii@kernel.org>,
	 Martin KaFai Lau <martin.lau@linux.dev>,
	Eduard <eddyz87@gmail.com>, Song Liu <song@kernel.org>,
	 Yonghong Song <yonghong.song@linux.dev>,
	John Fastabend <john.fastabend@gmail.com>,
	 KP Singh <kpsingh@kernel.org>,
	Stanislav Fomichev <sdf@fomichev.me>, Hao Luo <haoluo@google.com>,
	 Jiri Olsa <jolsa@kernel.org>, Dan Williams <djwillia@vt.edu>,
	miloc@vt.edu, ericts@vt.edu,  rahult@vt.edu, doniaghazy@vt.edu,
	quanzhif@vt.edu,  Jinghao Jia <jinghao7@illinois.edu>,
	egor@vt.edu,  Sai Roop Somaraju <sairoop10@gmail.com>
Subject: Re: [RFC bpf-next v2 3/4] bpf: Runtime part of fast-path termination approach
Date: Mon, 7 Jul 2025 21:16:21 +0200	[thread overview]
Message-ID: <CAP01T74i8a4daQ1Cca5Eysy==hTKox-ovpc1Y==64M1LacATEQ@mail.gmail.com> (raw)
In-Reply-To: <CAADnVQ+EiaoWUVcN9=Nm=RWJ6XE=Kcm8Q2FYQqWGJ_NsCtyJ=A@mail.gmail.com>

On Mon, 7 Jul 2025 at 19:41, Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Fri, Jul 4, 2025 at 12:11 PM Kumar Kartikeya Dwivedi
> <memxor@gmail.com> wrote:
> >
> > On Fri, 4 Jul 2025 at 19:29, Raj Sahu <rjsu26@gmail.com> wrote:
> > >
> > > > > Introduces watchdog based runtime mechanism to terminate
> > > > > a BPF program. When a BPF program is interrupted by
> > > > > an watchdog, its registers are are passed onto the bpf_die.
> > > > >
> > > > > Inside bpf_die we perform the text_poke and stack walk
> > > > > to stub helpers/kfunc replace bpf_loop helper if called
> > > > > inside bpf program.
> > > > >
> > > > > Current implementation doesn't handle the termination of
> > > > > tailcall programs.
> > > > >
> > > > > There is a known issue by calling text_poke inside interrupt
> > > > > context - https://elixir.bootlin.com/linux/v6.15.1/source/kernel/smp.c#L815.
> > > >
> > > > I don't have a good idea so far, maybe by deferring work to wq context?
> > > > Each CPU would need its own context and schedule work there.
> > > > The problem is that it may not be invoked immediately.
> > > We will give it a try using wq. We were a bit hesitant in pursuing wq
> > > earlier because to modify the return address on the stack we would
> > > want to interrupt the running BPF program and access its stack since
> > > that's a key part of the design.
> > >
> > > Will need some suggestions here on how to achieve that.
> >
> > Yeah, this is not trivial, now that I think more about it.
> > So keep the stack state untouched so you could synchronize with the
> > callback (spin until it signals us that it's done touching the stack).
> > I guess we can do it from another CPU, not too bad.
> >
> > There's another problem though, wq execution not happening instantly
> > in time is not a big deal, but it getting interrupted by yet another
> > program that stalls can set up a cascading chain that leads to lock up
> > of the machine.
> > So let's say we have a program that stalls in NMI/IRQ. It might happen
> > that all CPUs that can service the wq enter this stall. The kthread is
> > ready to run the wq callback (or in the middle of it) but it may be
> > indefinitely interrupted.
> > It seems like this is a more fundamental problem with the non-cloning
> > approach. We can prevent program execution on the CPU where the wq
> > callback will be run, but we can also have a case where all CPUs lock
> > up simultaneously.
>
> If we have such bugs that prog in NMI can stall CPU indefinitely
> they need to be fixed independently of fast-execute.
> timed may_goto, tailcalls or whatever may need to have different
> limits when it detects that the prog is running in NMI or with hard irqs
> disabled.

I think a lot of programs end up running with hard IRQs disabled. Most
scheduler programs (which would use all these runtime checked
facilities) can fall into this category.
I don't think we can come up with appropriate limits without proper
WCET analysis.

> Fast-execute doesn't have to be a universal kill-bpf-prog
> mechanism that can work in any context. I think fast-execute
> is for progs that deadlocked in res_spin_lock, faulted arena,
> or were slow for wrong reasons, but not fatal for the kernel reasons.
> imo we can rely on schedule_work() and bpf_arch_text_poke() from there.
> The alternative of clone of all progs and memory waste for a rare case
> is not appealing. Unless we can detect "dangerous" progs and
> clone with fast execute only for them, so that the majority of bpf progs
> stay as single copy.

Right, I sympathize with the memory overhead argument. But I think
just ignoring NMI programs as broken is not sufficient.
You can have some tracing program that gets stuck while tracing parts
of the kernel such that it prevents wq from making progress in
patching it out.
I think we will discover more edge cases. All this effort to then have
something working incompletely is sort of a bummer.

Most of the syzbot reports triggering deadlocks were also not "real"
use cases, but we still decided to close the gap with rqspinlock.
When leaving the door open, it's hard to anticipate how the failure
mode in case fast-execute is not triggered will be.
I think let's try to see how bad cloning can be, if done
conditionally, before giving up completely on it.

I think most programs won't use these facilities that end up extending
the total runtime beyond acceptable bounds.
Most legacy programs are not using cond_break, rqspinlock, or arenas.
Most network function programs and tracing programs probably don't use
these facilities as well.
This can certainly change in the future, but I think unless pressing
use cases come up people would stick to how things are.
Schedulers are a different category and they will definitely make use
of all this.
So only triggering cloning when these are used sounds better to me,
even if not ideal.

We can have some simple WCET heuristic that assigns fixed weights to
certain instructions/helpers, and compute an approximate cost which
when breached triggers cloning of the prog.
We can obviously disable cloning when we see things like bpf_for(i, 0, 1000).
We can restrict it to specific subprogs where we notice specific
patterns requiring it, so we may be able to avoid the entire prog.
But for simplicity we can also just trigger cloning when one of
rqspinlock/cond_break/arenas/iterators/bpf_loop are used.

The other option of course is to do run time checks of
prog->aux->terminate bit and start failing things that way.
Most of the time, the branch will be untaken. It can be set to 1 when
a prog detects a fatal condition or from a watchdog (later).
cond_break, rqspinlock, iterators, bpf_loop can all be adjusted to check it.

When I was playing with all this, this is basically what I did
(amortized or not) and I think the overhead was negligible/in range of
noise.
If a prog is hot, the line with terminate bit is almost always in
cache, and it's not a load dependency for other accesses.
If it's not hot, the cost of every other state access dominates anyway.

  reply	other threads:[~2025-07-07 19:16 UTC|newest]

Thread overview: 19+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-06-14  6:40 [RFC bpf-next v2 0/4] bpf: Fast-Path approach for BPF program Siddharth Chintamaneni
2025-06-14  6:40 ` [RFC bpf-next v2 1/4] bpf: Introduce new structs and struct fields Siddharth Chintamaneni
2025-06-14  6:40 ` [RFC bpf-next v2 2/4] bpf: Generating a stubbed version of BPF program for termination Siddharth Chintamaneni
2025-06-30 12:47   ` Kumar Kartikeya Dwivedi
2025-07-04 17:32     ` Raj Sahu
2025-07-04 18:37       ` Kumar Kartikeya Dwivedi
2025-06-14  6:40 ` [RFC bpf-next v2 3/4] bpf: Runtime part of fast-path termination approach Siddharth Chintamaneni
2025-06-30 12:15   ` Kumar Kartikeya Dwivedi
2025-07-04 17:29     ` Raj Sahu
2025-07-04 19:10       ` Kumar Kartikeya Dwivedi
2025-07-07 17:40         ` Alexei Starovoitov
2025-07-07 19:16           ` Kumar Kartikeya Dwivedi [this message]
2025-07-07 20:09             ` Alexei Starovoitov
2025-07-07 22:08             ` Zvi Effron
2025-07-07 22:39               ` Kumar Kartikeya Dwivedi
2025-07-08  7:07           ` Raj Sahu
2025-07-10  0:54             ` Kumar Kartikeya Dwivedi
2025-06-14  6:40 ` [RFC bpf-next v2 4/4] selftests/bpf: Adds selftests to check termination of long running nested bpf loops Siddharth Chintamaneni
2025-06-30 13:03 ` [RFC bpf-next v2 0/4] bpf: Fast-Path approach for BPF program Kumar Kartikeya Dwivedi

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CAP01T74i8a4daQ1Cca5Eysy==hTKox-ovpc1Y==64M1LacATEQ@mail.gmail.com' \
    --to=memxor@gmail.com \
    --cc=alexei.starovoitov@gmail.com \
    --cc=andrii@kernel.org \
    --cc=ast@kernel.org \
    --cc=bpf@vger.kernel.org \
    --cc=daniel@iogearbox.net \
    --cc=djwillia@vt.edu \
    --cc=doniaghazy@vt.edu \
    --cc=eddyz87@gmail.com \
    --cc=egor@vt.edu \
    --cc=ericts@vt.edu \
    --cc=haoluo@google.com \
    --cc=jinghao7@illinois.edu \
    --cc=john.fastabend@gmail.com \
    --cc=jolsa@kernel.org \
    --cc=kpsingh@kernel.org \
    --cc=martin.lau@linux.dev \
    --cc=miloc@vt.edu \
    --cc=quanzhif@vt.edu \
    --cc=rahult@vt.edu \
    --cc=rjsu26@gmail.com \
    --cc=sairoop10@gmail.com \
    --cc=sdf@fomichev.me \
    --cc=sidchintamaneni@gmail.com \
    --cc=song@kernel.org \
    --cc=yonghong.song@linux.dev \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).