linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Daniel Borkmann <daniel@iogearbox.net>
To: Hannes Frederic Sowa <hannes@stressinduktion.org>, davem@davemloft.net
Cc: ast@plumgrid.com, viro@ZenIV.linux.org.uk, ebiederm@xmission.com,
	tgraf@suug.ch, netdev@vger.kernel.org,
	linux-kernel@vger.kernel.org, Alexei Starovoitov <ast@kernel.org>
Subject: Re: [PATCH net-next 3/4] bpf: add support for persistent maps/progs
Date: Fri, 16 Oct 2015 15:36:18 +0200	[thread overview]
Message-ID: <5620FD52.2060103@iogearbox.net> (raw)
In-Reply-To: <1444991103.2861759.411876897.42C807BD@webmail.messagingengine.com>

On 10/16/2015 12:25 PM, Hannes Frederic Sowa wrote:
> On Fri, Oct 16, 2015, at 03:09, Daniel Borkmann wrote:
>> This eventually leads us to this patch, which implements a minimal
>> eBPF file system. The idea is a bit similar, but to the point that
>> these inodes reside at one or multiple mount points. A directory
>> hierarchy can be tailored to a specific application use-case from the
>> various subsystem users and maps/progs pinned inside it. Two new eBPF
>> commands (BPF_PIN_FD, BPF_NEW_FD) have been added to the syscall in
>> order to create one or multiple special inodes from an existing file
>> descriptor that points to a map/program (we call it eBPF fd pinning),
>> or to create a new file descriptor from an existing special inode.
>> BPF_PIN_FD requires CAP_SYS_ADMIN capabilities, whereas BPF_NEW_FD
>> can also be done unpriviledged when having appropriate permissions
>> to the path.
>
> In my opinion this is very un-unixiy, I have to say at least.
>
> Namespaces at some point dealt with the same problem, they nowadays use
> bind mounts of /proc/$$/ns/* to some place in the file hierarchy to keep
> the namespace alive. This at least allows someone to build up its own
> hierarchy with normal unix tools and not hidden inside a C-program. For
> filedescriptors we already have /proc/$$/fd/* but it seems that doesn't
> work out of the box nowadays.

Yes, that doesn't work out of the box, but I also don't know how usable
that would really be. The idea is roughly rather similar to the paths
passed to bind(2)/connect(2) on Unix domain sockets, as mentioned. You
have a map/prog resource that you stick to a special inode so that you
can retrieve it at a later point in time from the same or different
processes through a new fd pointing to the resource from user side, so
that the bpf(2) syscall can be performed upon it.

With Unix tools, you could still create/remove a hierarchy or unlink
those that have maps/progs. You are correct that tools that don't
implement bpf(2) currently cannot access the content behind it, since
bpf(2) manages access to the data itself. I did like the 2nd idea though,
mentioned in the commit log, but don't know how flexible we are in
terms of adding S_IFBPF to the UAPI.

> I don't know in terms of how many objects bpf should be able to handle
> and if such a bind-mount based solution would work, I guess not.
>
> In my opinion I still favor a user space approach. Subsystems which use
> ebpf in a way that no user space program needs to be running to control
> them would need to export the fds by itself. E.g. something like
> sysfs/kobject for tc? The hierarchy would then be in control of the
> subsystem which could also create a proper naming hierarchy or maybe
> even use an already given one. Do most other eBPF users really need to
> persist file descriptors somewhere without user space control and pick
> them up later?

I was thinking about a strict predefined hierarchy dictated by the kernel
as well, but was then considering a more flexible approach that could be
tailored freely to various use cases. A predefined hierarchy would most
likely need to be resolved per subsystem and it's not really easy to map
this properly. F.e. if the kernel would try to provide unique ids (as
opposed to have a name or annotation member through the syscall), it
could end up being quite cryptic. If we let the users choose names, I'm
not sure if a single hierarchy level would be enough. Then, additionally
you have facilities like tail calls that eBPF programs could do.

In such cases, one could even craft relationships where a (strict auto
generated) tree representation would not be sufficient (f.e. recirculation
up to a certain depth). The tail called programs could be changed
atomically during runtime, etc. The other issue related to a per subsystem
representation is that bpf(2) is the central management interface for
creating/accessing maps/progs, and each subsystem then has its own little
interface to "install" them internally (f.e. via netlink, setsockopt(2),
etc). That means, with tail calls, only the 'root' programs are installed
there and further transactions would be needed in order to make individual
subsystems aware, so they could potentially generate some hierarchy; don't
know, it seems rather complex.

Thanks,
Daniel

  reply	other threads:[~2015-10-16 13:36 UTC|newest]

Thread overview: 56+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-10-16  1:09 [PATCH net-next 0/4] BPF updates Daniel Borkmann
2015-10-16  1:09 ` [PATCH net-next 1/4] bpf: abstract anon_inode_getfd invocations Daniel Borkmann
2015-10-16  1:09 ` [PATCH net-next 2/4] bpf: align and clean bpf_{map,prog}_get helpers Daniel Borkmann
2015-10-16  1:09 ` [PATCH net-next 3/4] bpf: add support for persistent maps/progs Daniel Borkmann
2015-10-16 10:25   ` Hannes Frederic Sowa
2015-10-16 13:36     ` Daniel Borkmann [this message]
2015-10-16 16:36       ` Hannes Frederic Sowa
2015-10-16 17:27         ` Daniel Borkmann
2015-10-16 17:37           ` Alexei Starovoitov
2015-10-16 16:18     ` Alexei Starovoitov
2015-10-16 16:43       ` Hannes Frederic Sowa
2015-10-16 17:32         ` Alexei Starovoitov
2015-10-16 17:37           ` Thomas Graf
2015-10-16 17:21   ` Hannes Frederic Sowa
2015-10-16 17:42     ` Alexei Starovoitov
2015-10-16 17:56       ` Daniel Borkmann
2015-10-16 18:41         ` Eric W. Biederman
2015-10-16 19:27           ` Alexei Starovoitov
2015-10-16 19:53             ` Eric W. Biederman
2015-10-16 20:56               ` Alexei Starovoitov
2015-10-16 23:44                 ` Eric W. Biederman
2015-10-17  2:43                   ` Alexei Starovoitov
2015-10-17 12:28                     ` Daniel Borkmann
2015-10-18  2:20                       ` Alexei Starovoitov
2015-10-18 15:03                         ` Daniel Borkmann
2015-10-18 16:49                           ` Daniel Borkmann
2015-10-18 20:59                             ` Alexei Starovoitov
2015-10-19  7:36                               ` Hannes Frederic Sowa
2015-10-19  9:51                                 ` Daniel Borkmann
2015-10-19 14:23                                   ` Daniel Borkmann
2015-10-19 16:22                                     ` Alexei Starovoitov
2015-10-19 17:37                                       ` Daniel Borkmann
2015-10-19 18:15                                         ` Alexei Starovoitov
2015-10-19 18:46                                           ` Hannes Frederic Sowa
2015-10-19 19:34                                             ` Alexei Starovoitov
2015-10-19 20:03                                               ` Hannes Frederic Sowa
2015-10-19 20:48                                                 ` Alexei Starovoitov
2015-10-19 22:17                                                   ` Daniel Borkmann
2015-10-20  0:30                                                     ` Alexei Starovoitov
2015-10-20  8:46                                                       ` Daniel Borkmann
2015-10-20 17:53                                                         ` Alexei Starovoitov
2015-10-20 18:56                                                           ` Eric W. Biederman
2015-10-21 15:17                                                             ` Daniel Borkmann
2015-10-21 18:34                                                               ` Thomas Graf
2015-10-21 22:44                                                                 ` Alexei Starovoitov
2015-10-22 13:22                                                                   ` Daniel Borkmann
2015-10-22 19:35                                                               ` Eric W. Biederman
2015-10-23 13:47                                                                 ` Daniel Borkmann
2015-10-20  9:43                                                       ` Hannes Frederic Sowa
2015-10-19 23:02                                                   ` Hannes Frederic Sowa
2015-10-20  1:09                                                     ` Alexei Starovoitov
2015-10-20 10:07                                                       ` Hannes Frederic Sowa
2015-10-20 18:44                                                         ` Alexei Starovoitov
2015-10-16 19:54             ` Daniel Borkmann
2015-10-16  1:09 ` [PATCH net-next 4/4] bpf: add sample usages " Daniel Borkmann
2015-10-19  2:53 ` [PATCH net-next 0/4] BPF updates David Miller

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=5620FD52.2060103@iogearbox.net \
    --to=daniel@iogearbox.net \
    --cc=ast@kernel.org \
    --cc=ast@plumgrid.com \
    --cc=davem@davemloft.net \
    --cc=ebiederm@xmission.com \
    --cc=hannes@stressinduktion.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=netdev@vger.kernel.org \
    --cc=tgraf@suug.ch \
    --cc=viro@ZenIV.linux.org.uk \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).