From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-kernel-owner@vger.kernel.org>
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
	id S1752666AbbJQM3O (ORCPT <rfc822;w@1wt.eu>);
	Sat, 17 Oct 2015 08:29:14 -0400
Received: from www62.your-server.de ([213.133.104.62]:39860 "EHLO
	www62.your-server.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
	with ESMTP id S1751365AbbJQM3N (ORCPT
	<rfc822;linux-kernel@vger.kernel.org>);
	Sat, 17 Oct 2015 08:29:13 -0400
Message-ID: <56223F00.5030203@iogearbox.net>
Date: Sat, 17 Oct 2015 14:28:48 +0200
From: Daniel Borkmann <daniel@iogearbox.net>
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Thunderbird/31.4.0
MIME-Version: 1.0
To: Alexei Starovoitov <ast@plumgrid.com>,
        "Eric W. Biederman" <ebiederm@xmission.com>
CC: Hannes Frederic Sowa <hannes@stressinduktion.org>, davem@davemloft.net,
        viro@ZenIV.linux.org.uk, tgraf@suug.ch, netdev@vger.kernel.org,
        linux-kernel@vger.kernel.org, Alexei Starovoitov <ast@kernel.org>
Subject: Re: [PATCH net-next 3/4] bpf: add support for persistent maps/progs
References: <cover.1444956943.git.daniel@iogearbox.net> <ab1fceb2d68876d89bb2ebb3d2b45486d3cf2388.1444956943.git.daniel@iogearbox.net> <1445016105.1251655.412231129.6574D430@webmail.messagingengine.com> <5621371C.2000507@plumgrid.com> <56213A61.40509@iogearbox.net> <87d1welkp8.fsf@x220.int.ebiederm.org> <56214FAC.5060704@plumgrid.com> <87y4f2io9l.fsf@x220.int.ebiederm.org> <5621649A.80403@plumgrid.com> <87mvviidku.fsf@x220.int.ebiederm.org> <5621B5BC.8020204@plumgrid.com>
In-Reply-To: <5621B5BC.8020204@plumgrid.com>
Content-Type: text/plain; charset=windows-1252; format=flowed
Content-Transfer-Encoding: 7bit
X-Authenticated-Sender: daniel@iogearbox.net
Sender: linux-kernel-owner@vger.kernel.org
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org

On 10/17/2015 04:43 AM, Alexei Starovoitov wrote:
> On 10/16/15 4:44 PM, Eric W. Biederman wrote:
>> Alexei Starovoitov <ast@plumgrid.com> writes:
>>
>>> We can argue about api for 2nd, whether it's mount with fd=1234 string
>>> or else, but for the first mount style doesn't make sense.
>>
>> Why does mount not make sense?  It is exactly what you are looking for
>> so why does it not make sense?
>
> hmm, how do you get a new fd back after mounting it?
> Note, open cannot be overloaded, so we end up with BPF_NEW_FD anyway,
> but now it's more convoluted and empty mounts are everywhere.

That would be my understanding as well, but as Alexei already said,
these are two different issues, it would be step 2 (let me get back
to that further below). But in any case, I don't really like dumping
key/value somewhere as files. You have binary blobs as both, and
lets say your application has a lookup-key (for whatever reason) of
several cachelines it all ends up getting rather messy than making
it really useful for non-bpf(2) aware cmdline tools to deal with.

Anyway, another idea I've been brainstorming with Hannes today a
bit is about the following:

We register two major numbers, one for eBPF maps (X), one for eBPF
progs (Y). A user can either via cmdline call something like ...
mknod /dev/bpf/maps/map_pkts c X Z to create a special character
device, or alternatively out of an application through mknod(2)
syscall (f.e. tc when setting up maps/progs internally from the obj
file for a classifer).

Then, we still have 2 eBPF commands for bpf(2) syscall to add, say
(for example) BPF_BIND_DEV and BPF_FETCH_DEV. The application that
created a map (or prog) already has the map fd and after mknod(2) it
can open(2) the special file to get the special file fd. Then it can
call something like bpf(BPF_BIND_DEV, &attr, sizeof(attr))) where
attr looks like:

   union bpf_attr attr = {
     .bpf_fd    = bpf_fd,
     .dev_fd    = dev_fd,
   };

The bpf(2) syscall can check whether dev_fd belongs to an eBPF special
file and it can then copy over file->private_data from the bpf_fd
to the dev_fd's underlying file, where the private_data, as we know,
from the bpf_fd already points to a proper bpf_map/bpf_prog structure.
The map/prog would then get ref'ed and lives onwards in the char device's
lifetime. No special hashtable, gc, etc needed. The char device has fops
that we can define by ourself, and unlinking would drop the ref from
its private_data.

Now to the other part: BPF_FETCH_DEV would work similar. The application
opens the device, and fills bpf_attr as follows again:

   union bpf_attr attr = {
     .bpf_fd    = 0,
     .dev_fd    = dev_fd,
   };

This would allow us to look up the map/prog from the dev_fd's file->
private_data, and installs a new fd via bpf_{map,prog}_new_fd() that
is returned from bpf(2) for bpf-related access. The remaining fops
from the char device could still be reserved for possibilities like
debugging in future.

Now in future (2nd step), could either be to use Eric's idea and then do
something like mount -t bpffs ... -o /dev/bpf/maps/map_pkts to dump
attributes or other properties to some location for inspection from such
a special file, or we could use kobjects for that attached to the device
if the fops from the cdev should not be sufficient.

So closing the loop to the special files where there were concerns:

This won't forbid to have a future shell-style access possibility, and
it would also not end up as a nightmare on what you mentioned with the
S_ISSOCK-like bit in the other email.

The pinning mechanism would not require an extra file system to be mounted
somewhere, and yet the user can define himself an arbitrary hierarchy
where he puts the special files as this facility already exists. An
approach like this looks overall cleaner to me, and most likely be
realizable in fewer lines of code as well.

Thoughts?

Cheers,
Daniel