From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752666AbbJQM3O (ORCPT ); Sat, 17 Oct 2015 08:29:14 -0400 Received: from www62.your-server.de ([213.133.104.62]:39860 "EHLO www62.your-server.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751365AbbJQM3N (ORCPT ); Sat, 17 Oct 2015 08:29:13 -0400 Message-ID: <56223F00.5030203@iogearbox.net> Date: Sat, 17 Oct 2015 14:28:48 +0200 From: Daniel Borkmann User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Thunderbird/31.4.0 MIME-Version: 1.0 To: Alexei Starovoitov , "Eric W. Biederman" CC: Hannes Frederic Sowa , davem@davemloft.net, viro@ZenIV.linux.org.uk, tgraf@suug.ch, netdev@vger.kernel.org, linux-kernel@vger.kernel.org, Alexei Starovoitov Subject: Re: [PATCH net-next 3/4] bpf: add support for persistent maps/progs References: <1445016105.1251655.412231129.6574D430@webmail.messagingengine.com> <5621371C.2000507@plumgrid.com> <56213A61.40509@iogearbox.net> <87d1welkp8.fsf@x220.int.ebiederm.org> <56214FAC.5060704@plumgrid.com> <87y4f2io9l.fsf@x220.int.ebiederm.org> <5621649A.80403@plumgrid.com> <87mvviidku.fsf@x220.int.ebiederm.org> <5621B5BC.8020204@plumgrid.com> In-Reply-To: <5621B5BC.8020204@plumgrid.com> Content-Type: text/plain; charset=windows-1252; format=flowed Content-Transfer-Encoding: 7bit X-Authenticated-Sender: daniel@iogearbox.net Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 10/17/2015 04:43 AM, Alexei Starovoitov wrote: > On 10/16/15 4:44 PM, Eric W. Biederman wrote: >> Alexei Starovoitov writes: >> >>> We can argue about api for 2nd, whether it's mount with fd=1234 string >>> or else, but for the first mount style doesn't make sense. >> >> Why does mount not make sense? It is exactly what you are looking for >> so why does it not make sense? > > hmm, how do you get a new fd back after mounting it? > Note, open cannot be overloaded, so we end up with BPF_NEW_FD anyway, > but now it's more convoluted and empty mounts are everywhere. That would be my understanding as well, but as Alexei already said, these are two different issues, it would be step 2 (let me get back to that further below). But in any case, I don't really like dumping key/value somewhere as files. You have binary blobs as both, and lets say your application has a lookup-key (for whatever reason) of several cachelines it all ends up getting rather messy than making it really useful for non-bpf(2) aware cmdline tools to deal with. Anyway, another idea I've been brainstorming with Hannes today a bit is about the following: We register two major numbers, one for eBPF maps (X), one for eBPF progs (Y). A user can either via cmdline call something like ... mknod /dev/bpf/maps/map_pkts c X Z to create a special character device, or alternatively out of an application through mknod(2) syscall (f.e. tc when setting up maps/progs internally from the obj file for a classifer). Then, we still have 2 eBPF commands for bpf(2) syscall to add, say (for example) BPF_BIND_DEV and BPF_FETCH_DEV. The application that created a map (or prog) already has the map fd and after mknod(2) it can open(2) the special file to get the special file fd. Then it can call something like bpf(BPF_BIND_DEV, &attr, sizeof(attr))) where attr looks like: union bpf_attr attr = { .bpf_fd = bpf_fd, .dev_fd = dev_fd, }; The bpf(2) syscall can check whether dev_fd belongs to an eBPF special file and it can then copy over file->private_data from the bpf_fd to the dev_fd's underlying file, where the private_data, as we know, from the bpf_fd already points to a proper bpf_map/bpf_prog structure. The map/prog would then get ref'ed and lives onwards in the char device's lifetime. No special hashtable, gc, etc needed. The char device has fops that we can define by ourself, and unlinking would drop the ref from its private_data. Now to the other part: BPF_FETCH_DEV would work similar. The application opens the device, and fills bpf_attr as follows again: union bpf_attr attr = { .bpf_fd = 0, .dev_fd = dev_fd, }; This would allow us to look up the map/prog from the dev_fd's file-> private_data, and installs a new fd via bpf_{map,prog}_new_fd() that is returned from bpf(2) for bpf-related access. The remaining fops from the char device could still be reserved for possibilities like debugging in future. Now in future (2nd step), could either be to use Eric's idea and then do something like mount -t bpffs ... -o /dev/bpf/maps/map_pkts to dump attributes or other properties to some location for inspection from such a special file, or we could use kobjects for that attached to the device if the fops from the cdev should not be sufficient. So closing the loop to the special files where there were concerns: This won't forbid to have a future shell-style access possibility, and it would also not end up as a nightmare on what you mentioned with the S_ISSOCK-like bit in the other email. The pinning mechanism would not require an extra file system to be mounted somewhere, and yet the user can define himself an arbitrary hierarchy where he puts the special files as this facility already exists. An approach like this looks overall cleaner to me, and most likely be realizable in fewer lines of code as well. Thoughts? Cheers, Daniel