From mboxrd@z Thu Jan 1 00:00:00 1970 From: ebiederm@xmission.com (Eric W. Biederman) Subject: Re: [PATCH net] bpf: expose netns inode to bpf programs Date: Sat, 04 Feb 2017 10:06:41 +1300 Message-ID: <8737fvt25a.fsf@xmission.com> References: <1485401274-2836524-1-git-send-email-ast@fb.com> <87efzq8jbi.fsf@xmission.com> <588995DE.9040707@fb.com> <588A35F8.6050909@fb.com> <588A40D7.9070603@fb.com> <588A4D3C.9070601@fb.com> <87r33fevva.fsf@xmission.com> Mime-Version: 1.0 Content-Type: text/plain Cc: Alexei Starovoitov , Linus Torvalds , "David S . Miller" , Daniel Borkmann , David Ahern , Tejun Heo , Thomas Graf , Network Development To: Andy Lutomirski Return-path: Received: from out01.mta.xmission.com ([166.70.13.231]:56348 "EHLO out01.mta.xmission.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752478AbdBCVLP (ORCPT ); Fri, 3 Feb 2017 16:11:15 -0500 In-Reply-To: (Andy Lutomirski's message of "Fri, 3 Feb 2017 13:00:47 -0800") Sender: netdev-owner@vger.kernel.org List-ID: Andy Lutomirski writes: > On Thu, Feb 2, 2017 at 8:33 PM, Eric W. Biederman wrote: >> Alexei Starovoitov writes: >> >>> On 1/26/17 11:07 AM, Andy Lutomirski wrote: >>>> On Thu, Jan 26, 2017 at 10:32 AM, Alexei Starovoitov wrote: >>>>> On 1/26/17 10:12 AM, Andy Lutomirski wrote: >>>>>> >>>>>> On Thu, Jan 26, 2017 at 9:46 AM, Alexei Starovoitov wrote: >>>>>>> >>>>>>> On 1/26/17 8:37 AM, Andy Lutomirski wrote: >>>>>>>>> >>>>>>>>> >>>>>>>>> Think of bpf programs as safe kernel modules. They don't have >>>>>>>>> confined boundaries and program authors, if not careful, can shoot >>>>>>>>> themselves in the foot. We're not trying to prevent that because >>>>>>>>> it's impossible to check that the program is sane. Just like >>>>>>>>> it's impossible to check that kernel module is sane. >>>>>>>>> But in case of bpf we check that bpf program is _safe_ from the kernel >>>>>>>>> point of view. If it's doing some garbage, it's program's business. >>>>>>>>> Does it make more sense now? >>>>>>>>> >>>>>>>> >>>>>>>> With all due respect, I think this is not an acceptable way to think >>>>>>>> about BPF at all. If you think of BPF this way, I think there needs >>>>>>>> to be a real discussion at KS or similar as to whether this is okay. >>>>>>>> The reason is simple: the kernel promises a stable ABI to userspace >>>>>>>> but not to kernel modules. By thinking of BPF as more like a module, >>>>>>>> you're taking a big shortcut that will either result in ABI breakage >>>>>>>> down the road or in committing to a problematic stable ABI. >>>>>>> >>>>>>> >>>>>>> >>>>>>> you misunderstood the analogy. >>>>>>> bpf abi is certainly stable. that's why we were careful of not >>>>>>> exposing anything to it that is not already stable. >>>>>>> >>>>>> >>>>>> In that case I don't understand what you're trying to say. Eric >>>>>> thinks your patch exposes a bad interface. A bad interface for >>>>>> userspace is a very different thing from a bad interface available to >>>>>> kernel modules. Are you saying that BPF is kernel-module-like in that >>>>>> the ABI exposed to BPF programs doesn't need to meet the same quality >>>>>> standards as userspace ABIs? >>>>> >>>>> >>>>> of course not. >>>>> ns.inum is already exposed to user space as a value. >>>>> This patch exposes it to bpf program in a convenient and stable way, >>>> >>>> Here's what I'm imaging Eric is thinking: >>>> >>>> ns.inum is currently exposed to userspace via procfs. In principle, >>>> the value could be local to a namespace, though, which would enable >>>> CRIU to be able to preserve namespace inode numbers across a >>>> checkpoint+restore operation. If this happened, the contained and >>>> restored procfs would see a different inode number than the outermost >>>> procfs. >>> >>> sure. there are many different ways for the program to see inode >>> that either was already reused or disappeared. >>> What I'm saying that it is expected. We cannot prevent that from >>> bpf side. Just like ifindex value read by the program can be bogus >>> as in the example I just provided. >> >> The point is that we can make the inode number stable across migration >> and the user space API for namespaces has been designed with that >> possibility in mind. > > How does it help if BPF starts exposing both inode number and device > number? Adding the device number comparison helps in that it is explicit what is being compared against. That gives me at least a bit of a namespace for the namespaces, and a program from a sufficiently wrong context will have it's comparisons fail rather than having a match. I think the operation that is exported in the BPF should be a full comparison operation of device and inode number so that it could be optimized/compiled to something else depending upon the context. AKA the compilation of the bpf program would have the opportunity to remove the namespace dependency and make the program work in a global context. So we don't have to carry namespace information around at run time. > ISTM any ability to migrate namespaces and to migrate eBPF programs > that know about namespaces needs to have the eBPF program firmly > rooted in some namespace (or perhaps cgroup in this case) so that it > can see a namespaced view of the world. For this to work, presumably > we need to make sure that eBPF programs that are installed by programs > that are in a container don't see traffic that isn't in that > container. This is part of why I think that we should consider > preventing programs that aren't in the root namespace (perhaps *all* > the root namespaces) from installing bpf+cgroup programs in the first > place until there's a clearer understanding of how this all fits > together. Andy I agree. At least to the point those programs are reading attributes that are in a namespace. Something that should be straight forward to verify in the bpf checker when installing the program. Eric