From: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
To: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Alexei Starovoitov <alexei.starovoitov@gmail.com>,
Joel Fernandes <joelaf@google.com>,
Daniel Colascione <dancol@google.com>,
Alexei Starovoitov <ast@fb.com>,
linux-kernel <linux-kernel@vger.kernel.org>,
Tim Murray <timmurray@google.com>,
Daniel Borkmann <daniel@iogearbox.net>,
netdev <netdev@vger.kernel.org>, fengc <fengc@google.com>
Subject: Re: [RFC] Add BPF_SYNCHRONIZE bpf(2) command
Date: Mon, 9 Jul 2018 15:19:03 -0700 [thread overview]
Message-ID: <20180709221903.GK3593@linux.vnet.ibm.com> (raw)
In-Reply-To: <566257859.2699.1531172134285.JavaMail.zimbra@efficios.com>
On Mon, Jul 09, 2018 at 05:35:34PM -0400, Mathieu Desnoyers wrote:
>
>
> ----- On Jul 9, 2018, at 5:09 PM, Alexei Starovoitov alexei.starovoitov@gmail.com wrote:
>
> > On Sun, Jul 08, 2018 at 04:54:38PM -0400, Mathieu Desnoyers wrote:
> >> ----- On Jul 7, 2018, at 4:33 PM, Joel Fernandes joelaf@google.com wrote:
> >>
> >> > On Fri, Jul 06, 2018 at 07:54:28PM -0700, Alexei Starovoitov wrote:
> >> >> On Fri, Jul 06, 2018 at 06:56:16PM -0700, Daniel Colascione wrote:
> >> >> > BPF_SYNCHRONIZE waits for any BPF programs active at the time of
> >> >> > BPF_SYNCHRONIZE to complete, allowing userspace to ensure atomicity of
> >> >> > RCU data structure operations with respect to active programs. For
> >> >> > example, userspace can update a map->map entry to point to a new map,
> >> >> > use BPF_SYNCHRONIZE to wait for any BPF programs using the old map to
> >> >> > complete, and then drain the old map without fear that BPF programs
> >> >> > may still be updating it.
> >> >> >
> >> >> > Signed-off-by: Daniel Colascione <dancol@google.com>
> >> >> > ---
> >> >> > include/uapi/linux/bpf.h | 1 +
> >> >> > kernel/bpf/syscall.c | 14 ++++++++++++++
> >> >> > 2 files changed, 15 insertions(+)
> >> >> >
> >> >> > diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> >> >> > index b7db3261c62d..4365c50e8055 100644
> >> >> > --- a/include/uapi/linux/bpf.h
> >> >> > +++ b/include/uapi/linux/bpf.h
> >> >> > @@ -98,6 +98,7 @@ enum bpf_cmd {
> >> >> > BPF_BTF_LOAD,
> >> >> > BPF_BTF_GET_FD_BY_ID,
> >> >> > BPF_TASK_FD_QUERY,
> >> >> > + BPF_SYNCHRONIZE,
> >> >> > };
> >> >> >
> >> >> > enum bpf_map_type {
> >> >> > diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> >> >> > index d10ecd78105f..60ec7811846e 100644
> >> >> > --- a/kernel/bpf/syscall.c
> >> >> > +++ b/kernel/bpf/syscall.c
> >> >> > @@ -2272,6 +2272,20 @@ SYSCALL_DEFINE3(bpf, int, cmd, union bpf_attr __user *,
> >> >> > uattr, unsigned int, siz
> >> >> > if (sysctl_unprivileged_bpf_disabled && !capable(CAP_SYS_ADMIN))
> >> >> > return -EPERM;
> >> >> >
> >> >> > + if (cmd == BPF_SYNCHRONIZE) {
> >> >> > + if (uattr != NULL || size != 0)
> >> >> > + return -EINVAL;
> >> >> > + err = security_bpf(cmd, NULL, 0);
> >> >> > + if (err < 0)
> >> >> > + return err;
> >> >> > + /* BPF programs are run with preempt disabled, so
> >> >> > + * synchronize_sched is sufficient even with
> >> >> > + * RCU_PREEMPT.
> >> >> > + */
> >> >> > + synchronize_sched();
> >> >> > + return 0;
> >> >>
> >> >> I don't think it's necessary. sys_membarrier() can do this already
> >> >> and some folks use it exactly for this use case.
> >> >
> >> > Alexei, the use of sys_membarrier for this purpose seems kind of weird to me
> >> > though. No where does the manpage say membarrier should be implemented this
> >> > way so what happens if the implementation changes?
> >> >
> >> > Further, membarrier manpage says that a memory barrier should be matched with
> >> > a matching barrier. In this use case there is no matching barrier, so it
> >> > makes it weirder.
> >> >
> >> > Lastly, sys_membarrier seems will not work on nohz-full systems, so its a bit
> >> > fragile to depend on it for this?
> >> >
> >> > case MEMBARRIER_CMD_GLOBAL:
> >> > /* MEMBARRIER_CMD_GLOBAL is not compatible with nohz_full. */
> >> > if (tick_nohz_full_enabled())
> >> > return -EINVAL;
> >> > if (num_online_cpus() > 1)
> >> > synchronize_sched();
> >> > return 0;
> >> >
> >> >
> >> > Adding Mathieu as well who I believe is author/maintainer of membarrier.
> >>
> >> See commit 907565337
> >> "Fix: Disable sys_membarrier when nohz_full is enabled"
> >>
> >> "Userspace applications should be allowed to expect the membarrier system
> >> call with MEMBARRIER_CMD_SHARED command to issue memory barriers on
> >> nohz_full CPUs, but synchronize_sched() does not take those into
> >> account."
> >>
> >> So AFAIU you'd want to re-use membarrier to issue synchronize_sched, and you
> >> only care about kernel preempt off critical sections.
> >>
> >> Clearly bpf code does not run in user-space, so it would "work".
> >>
> >> But the guarantees provided by membarrier are not to synchronize against
> >> preempt off per se. It's just that the current implementation happens to
> >> do that. The point of membarrier is to turn user-space memory barriers
> >> into compiler barriers.
> >>
> >> If what you need is to wait for a RCU grace period for whatever RCU flavor
> >> ebpf is using, I would against using membarrier for this. I would rather
> >> recommend adding a dedicated BPF_SYNCHRONIZE so you won't leak
> >> implementation details to user-space, *and* you can eventually change you
> >> RCU implementation for e.g. SRCU in the future if needed.
> >
> > The point about future changes to underlying bpf mechanisms is valid.
> > There is work already on the way to reduce the scope of preempt_off+rcu_lock
> > that currently lasts the whole prog. We will have new prog types that won't
> > have such wrappers and will do rcu_lock/unlock and preempt on/off only
> > when necessary.
> > So something like BPF_SYNCHRONIZE will break soon, since the kernel cannot have
> > guarantees on when programs finish. Calling this command BPF_SYNCHRONIZE_PROG
> > also won't make sense for the same reason.
> > What we can do it instead is to define synchronization barrier for
> > programs accessing maps. May be call it something like:
> > BPF_SYNC_MAP_ACCESS ?
> > uapi/bpf.h would need to have extensive comment what this barrier is doing.
> > Implementation should probably call synchronize_rcu() and not play games
> > with synchronize_sched(), since that's going too much into implementation.
> > Also should such sys_bpf command be root only?
> > I'm not sure whether dos attack can be made by spamming synchronize_rcu()
> > and synchronize_sched() for that matter.
>
> Adding Paul E. McKenney in CC. He may want to share his thoughts on the matter.
Let's see...
Spamming synchronize_rcu() and synchronize_sched() should be a non-event,
at least aside from the CPUs doing the spamming. The reason for this
is that a given task can only fire off a single synchronize_sched or
synchronize_rcu() per few milliseconds, so you need a -lot- of tasks
to have much effect, at which point the sheer number of tasks is much
more a problem than the large number of outstanding synchronize_rcu()
or synchronize_sched() invocations.
I very strongly agree that usermode should have a operation that
synchronizes with whatever eBPF uses, rather than something that forces
a specific type of RCU grace period.
Finally, in a few releases, synchronize_sched() will be retiring in favor
of synchronize_rcu(), which will wait on preemption-disabled regions of
code in addition to waiting on RCU read-side critical sections. Not a
big deal, as I expect to enlist Coccinelle's aid in this.
Did I manage to hit all the high points?
Thanx, Paul
next prev parent reply other threads:[~2018-07-09 22:16 UTC|newest]
Thread overview: 31+ messages / expand[flat|nested] mbox.gz Atom feed top
2018-07-07 1:56 [RFC] Add BPF_SYNCHRONIZE bpf(2) command Daniel Colascione
2018-07-07 2:54 ` Alexei Starovoitov
2018-07-07 3:22 ` Daniel Colascione
2018-07-07 20:33 ` Joel Fernandes
2018-07-08 20:54 ` Mathieu Desnoyers
2018-07-09 21:09 ` Alexei Starovoitov
2018-07-09 21:35 ` Mathieu Desnoyers
2018-07-09 22:19 ` Paul E. McKenney [this message]
2018-07-09 22:19 ` Alexei Starovoitov
2018-07-09 22:48 ` Paul E. McKenney
2018-07-09 21:36 ` Daniel Colascione
2018-07-09 22:10 ` Alexei Starovoitov
2018-07-09 22:21 ` Daniel Colascione
2018-07-09 22:34 ` Alexei Starovoitov
2018-07-10 5:25 ` Chenbo Feng
2018-07-10 23:52 ` Alexei Starovoitov
2018-07-11 2:46 ` Lorenzo Colitti
2018-07-11 3:40 ` Alexei Starovoitov
2018-07-14 18:18 ` Joel Fernandes
2018-07-16 15:29 ` Daniel Colascione
2018-07-16 20:23 ` Joel Fernandes
2018-07-26 16:51 ` [PATCH v2] Add BPF_SYNCHRONIZE_MAPS " Daniel Colascione
2018-07-27 19:17 ` [RFC] Add BPF_SYNCHRONIZE " Daniel Colascione
2018-07-10 5:13 ` Joel Fernandes
2018-07-10 16:42 ` Paul E. McKenney
2018-07-10 16:57 ` Joel Fernandes
2018-07-10 17:12 ` Paul E. McKenney
2018-07-10 17:29 ` Joel Fernandes
2018-07-10 17:42 ` Paul E. McKenney
-- strict thread matches above, loose matches on Subject: below --
2018-07-29 15:57 Alexei Starovoitov
2018-07-30 22:26 ` Joel Fernandes
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20180709221903.GK3593@linux.vnet.ibm.com \
--to=paulmck@linux.vnet.ibm.com \
--cc=alexei.starovoitov@gmail.com \
--cc=ast@fb.com \
--cc=dancol@google.com \
--cc=daniel@iogearbox.net \
--cc=fengc@google.com \
--cc=joelaf@google.com \
--cc=linux-kernel@vger.kernel.org \
--cc=mathieu.desnoyers@efficios.com \
--cc=netdev@vger.kernel.org \
--cc=timmurray@google.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.