From: ebiederm@xmission.com (Eric W. Biederman)
To: Andy Lutomirski <luto@amacapital.net>
Cc: Alexei Starovoitov <ast@fb.com>,
Linus Torvalds <torvalds@linux-foundation.org>,
"David S . Miller" <davem@davemloft.net>,
Daniel Borkmann <daniel@iogearbox.net>,
David Ahern <dsa@cumulusnetworks.com>, Tejun Heo <tj@kernel.org>,
Thomas Graf <tgraf@suug.ch>,
Network Development <netdev@vger.kernel.org>
Subject: Re: [PATCH net] bpf: expose netns inode to bpf programs
Date: Sat, 04 Feb 2017 10:06:41 +1300 [thread overview]
Message-ID: <8737fvt25a.fsf@xmission.com> (raw)
In-Reply-To: <CALCETrWDxVHc4_hpjLQRj-XbvGX0+tbGZ8o388-xn6rNcQTySQ@mail.gmail.com> (Andy Lutomirski's message of "Fri, 3 Feb 2017 13:00:47 -0800")
Andy Lutomirski <luto@amacapital.net> writes:
> On Thu, Feb 2, 2017 at 8:33 PM, Eric W. Biederman <ebiederm@xmission.com> wrote:
>> Alexei Starovoitov <ast@fb.com> writes:
>>
>>> On 1/26/17 11:07 AM, Andy Lutomirski wrote:
>>>> On Thu, Jan 26, 2017 at 10:32 AM, Alexei Starovoitov <ast@fb.com> wrote:
>>>>> On 1/26/17 10:12 AM, Andy Lutomirski wrote:
>>>>>>
>>>>>> On Thu, Jan 26, 2017 at 9:46 AM, Alexei Starovoitov <ast@fb.com> wrote:
>>>>>>>
>>>>>>> On 1/26/17 8:37 AM, Andy Lutomirski wrote:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Think of bpf programs as safe kernel modules. They don't have
>>>>>>>>> confined boundaries and program authors, if not careful, can shoot
>>>>>>>>> themselves in the foot. We're not trying to prevent that because
>>>>>>>>> it's impossible to check that the program is sane. Just like
>>>>>>>>> it's impossible to check that kernel module is sane.
>>>>>>>>> But in case of bpf we check that bpf program is _safe_ from the kernel
>>>>>>>>> point of view. If it's doing some garbage, it's program's business.
>>>>>>>>> Does it make more sense now?
>>>>>>>>>
>>>>>>>>
>>>>>>>> With all due respect, I think this is not an acceptable way to think
>>>>>>>> about BPF at all. If you think of BPF this way, I think there needs
>>>>>>>> to be a real discussion at KS or similar as to whether this is okay.
>>>>>>>> The reason is simple: the kernel promises a stable ABI to userspace
>>>>>>>> but not to kernel modules. By thinking of BPF as more like a module,
>>>>>>>> you're taking a big shortcut that will either result in ABI breakage
>>>>>>>> down the road or in committing to a problematic stable ABI.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> you misunderstood the analogy.
>>>>>>> bpf abi is certainly stable. that's why we were careful of not
>>>>>>> exposing anything to it that is not already stable.
>>>>>>>
>>>>>>
>>>>>> In that case I don't understand what you're trying to say. Eric
>>>>>> thinks your patch exposes a bad interface. A bad interface for
>>>>>> userspace is a very different thing from a bad interface available to
>>>>>> kernel modules. Are you saying that BPF is kernel-module-like in that
>>>>>> the ABI exposed to BPF programs doesn't need to meet the same quality
>>>>>> standards as userspace ABIs?
>>>>>
>>>>>
>>>>> of course not.
>>>>> ns.inum is already exposed to user space as a value.
>>>>> This patch exposes it to bpf program in a convenient and stable way,
>>>>
>>>> Here's what I'm imaging Eric is thinking:
>>>>
>>>> ns.inum is currently exposed to userspace via procfs. In principle,
>>>> the value could be local to a namespace, though, which would enable
>>>> CRIU to be able to preserve namespace inode numbers across a
>>>> checkpoint+restore operation. If this happened, the contained and
>>>> restored procfs would see a different inode number than the outermost
>>>> procfs.
>>>
>>> sure. there are many different ways for the program to see inode
>>> that either was already reused or disappeared.
>>> What I'm saying that it is expected. We cannot prevent that from
>>> bpf side. Just like ifindex value read by the program can be bogus
>>> as in the example I just provided.
>>
>> The point is that we can make the inode number stable across migration
>> and the user space API for namespaces has been designed with that
>> possibility in mind.
>
> How does it help if BPF starts exposing both inode number and device
> number?
Adding the device number comparison helps in that it is explicit what is
being compared against. That gives me at least a bit of a namespace
for the namespaces, and a program from a sufficiently wrong context will
have it's comparisons fail rather than having a match.
I think the operation that is exported in the BPF should be a full
comparison operation of device and inode number so that it could be
optimized/compiled to something else depending upon the context.
AKA the compilation of the bpf program would have the opportunity to
remove the namespace dependency and make the program work in a global
context. So we don't have to carry namespace information around at run
time.
> ISTM any ability to migrate namespaces and to migrate eBPF programs
> that know about namespaces needs to have the eBPF program firmly
> rooted in some namespace (or perhaps cgroup in this case) so that it
> can see a namespaced view of the world. For this to work, presumably
> we need to make sure that eBPF programs that are installed by programs
> that are in a container don't see traffic that isn't in that
> container. This is part of why I think that we should consider
> preventing programs that aren't in the root namespace (perhaps *all*
> the root namespaces) from installing bpf+cgroup programs in the first
> place until there's a clearer understanding of how this all fits
> together.
Andy I agree. At least to the point those programs are
reading attributes that are in a namespace. Something that should be
straight forward to verify in the bpf checker when installing the
program.
Eric
next prev parent reply other threads:[~2017-02-03 21:11 UTC|newest]
Thread overview: 35+ messages / expand[flat|nested] mbox.gz Atom feed top
2017-01-26 3:27 [PATCH net] bpf: expose netns inode to bpf programs Alexei Starovoitov
2017-01-26 5:46 ` Eric W. Biederman
2017-01-26 6:00 ` Ying Xue
2017-01-26 6:23 ` Alexei Starovoitov
2017-01-26 16:37 ` Andy Lutomirski
2017-01-26 17:46 ` Alexei Starovoitov
2017-01-26 18:12 ` Andy Lutomirski
2017-01-26 18:32 ` Alexei Starovoitov
2017-01-26 19:07 ` Andy Lutomirski
2017-01-26 19:25 ` Alexei Starovoitov
2017-02-03 4:33 ` Eric W. Biederman
2017-02-03 6:05 ` Alexei Starovoitov
2017-02-03 10:30 ` Eric W. Biederman
2017-02-03 21:00 ` Andy Lutomirski
2017-02-03 21:06 ` Eric W. Biederman [this message]
2017-02-03 23:08 ` Alexei Starovoitov
2017-02-04 17:07 ` Andy Lutomirski
2017-02-05 3:10 ` Alexei Starovoitov
2017-02-05 3:27 ` Andy Lutomirski
2017-02-05 3:48 ` Alexei Starovoitov
2017-02-05 3:54 ` Andy Lutomirski
2017-02-05 4:37 ` Alexei Starovoitov
2017-02-05 5:05 ` Andy Lutomirski
2017-02-07 1:43 ` Alexei Starovoitov
2017-01-31 18:02 ` David Miller
2017-01-31 22:11 ` David Ahern
2017-02-03 21:56 ` Daniel Borkmann
2017-02-03 23:06 ` Alexei Starovoitov
2017-02-03 23:42 ` Daniel Borkmann
2017-02-04 1:25 ` Alexei Starovoitov
2017-02-04 17:08 ` Andy Lutomirski
2017-02-05 3:18 ` Alexei Starovoitov
2017-02-05 3:22 ` Andy Lutomirski
2017-02-05 3:35 ` Alexei Starovoitov
2017-02-05 3:49 ` Andy Lutomirski
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=8737fvt25a.fsf@xmission.com \
--to=ebiederm@xmission.com \
--cc=ast@fb.com \
--cc=daniel@iogearbox.net \
--cc=davem@davemloft.net \
--cc=dsa@cumulusnetworks.com \
--cc=luto@amacapital.net \
--cc=netdev@vger.kernel.org \
--cc=tgraf@suug.ch \
--cc=tj@kernel.org \
--cc=torvalds@linux-foundation.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).