* Re: [PATCH net-next 1/2] bpf: allow extended BPF programs access skb fields
From: Alexei Starovoitov @ 2015-03-13 17:09 UTC (permalink / raw)
To: Daniel Borkmann, David S. Miller
Cc: Thomas Graf, linux-api, netdev, linux-kernel
In-Reply-To: <550313B6.2080701@iogearbox.net>
On 3/13/15 9:43 AM, Daniel Borkmann wrote:
> On 03/13/2015 05:22 PM, Alexei Starovoitov wrote:
>> On 3/13/15 2:57 AM, Daniel Borkmann wrote:
>>> On 03/13/2015 03:21 AM, Alexei Starovoitov wrote:
>>>> introduce user accessible mirror of in-kernel 'struct sk_buff':
>>>
>>> For each member, I'd also add BUILD_BUG_ON()s similarly as we have in
>>> convert_bpf_extensions(). That way, people won't forget to adjust the
>>> code.
>>
>> I thought about it, but didn't add it, since we already have them
>> in the same file several lines above this spot.
>> I think one build error per .c file should be enough to attract
>> attention.
>> Though I'll add a comment to convert_bpf_extensions() that build_bug_on
>> errors should be addressed in two places.
>
> My concern would be that in case the code gets changed, this spot
> could easily be overlooked by someone possibly unfamiliar with the
> code, since no build error triggers there.
>
> So I guess it wouldn't hurt or cost us much to also adding the
> BUILD_BUG_ON()s there instead of a comment.
I think it's overkill, but fine, it's minor. Will add another set
of build_bug_ons and see how it looks.
^ permalink raw reply
* Re: [PATCH v4 3/9] selftests: Introduce minimal shared logic for running tests
From: Shuah Khan @ 2015-03-13 17:19 UTC (permalink / raw)
To: Michael Ellerman
Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA,
davej-rdkfGonbjUTCLXcRTR1eJlpr/1R2p/CL, mmarek-AlSwsSmVLrQ,
linux-api-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <1426046765-19289-3-git-send-email-mpe-Gsx/Oe8HsFggBc27wqDAHg@public.gmane.org>
On 03/10/2015 10:05 PM, Michael Ellerman wrote:
> This adds a Make include file which most selftests can then include to
> get the run_tests logic.
>
> On its own this has the advantage of some reduction in repetition, and
> also means the pass/fail message is defined in fewer places.
>
> However the key advantage is it will allow us to implement install very
> simply in a subsequent patch.
>
> The default implementation just executes each program in $(TEST_PROGS).
>
> We use a variable to hold the default implementation of $(RUN_TESTS)
> because that gives us a clean way to override it if necessary, ie. using
> override. The mount, memory-hotplug and mqueue tests use that to provide
> a different implementation.
>
> Tests are not run via /bin/bash, so if they are scripts they must be
> executable, we add a+x to several.
>
> Signed-off-by: Michael Ellerman <mpe-Gsx/Oe8HsFggBc27wqDAHg@public.gmane.org>
This patch will be applied to next and queued for 4.1.
-- Shuah
--
Shuah Khan
Sr. Linux Kernel Developer
Open Source Innovation Group
Samsung Research America (Silicon Valley)
shuahkh-JPH+aEBZ4P+UEJcrhfAQsw@public.gmane.org | (970) 217-8978
^ permalink raw reply
* Re: [PATCH v4 4/9] selftests: Add install target
From: Shuah Khan @ 2015-03-13 17:20 UTC (permalink / raw)
To: Michael Ellerman
Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA,
davej-rdkfGonbjUTCLXcRTR1eJlpr/1R2p/CL, mmarek-AlSwsSmVLrQ,
linux-api-u79uwXL29TY76Z2rM5mHXA, Shuah Khan
In-Reply-To: <1426046765-19289-4-git-send-email-mpe-Gsx/Oe8HsFggBc27wqDAHg@public.gmane.org>
On 03/10/2015 10:06 PM, Michael Ellerman wrote:
> This adds make install support to selftests. The basic usage is:
>
> $ cd tools/testing/selftests
> $ make install
>
> That installs into tools/testing/selftests/install, which can then be
> copied where ever necessary.
>
> The install destination is also configurable using eg:
>
> $ INSTALL_PATH=/mnt/selftests make install
>
> The implementation uses two targets in the child makefiles. The first
> "install" is expected to install all files into $(INSTALL_PATH).
>
> The second, "emit_tests", is expected to emit the test instructions (ie.
> bash script) on stdout. Separating this from install means the child
> makefiles need no knowledge of the location of the test script.
>
> Signed-off-by: Michael Ellerman <mpe-Gsx/Oe8HsFggBc27wqDAHg@public.gmane.org>
Thanks for doing the work. This patch will be applied to next and will
be queued for 4.1
thanks,
-- Shuah
--
Shuah Khan
Sr. Linux Kernel Developer
Open Source Innovation Group
Samsung Research America (Silicon Valley)
shuahkh-JPH+aEBZ4P+UEJcrhfAQsw@public.gmane.org | (970) 217-8978
^ permalink raw reply
* [GIT PULL] kselftest fixes for 4.0-rc4
From: Shuah Khan @ 2015-03-13 17:39 UTC (permalink / raw)
To: Linus Torvalds
Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
open list:KERNEL SELFTEST F...
Hi Linus,
Please pull the kselftest fix for 4.0-rc4.
thanks,
-- Shuah
The following changes since commit 9eccca0843205f87c00404b663188b88eb248051:
Linux 4.0-rc3 (2015-03-08 16:09:09 -0700)
are available in the git repository at:
git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest
tags/linux-kselftest-4.0-rc4
for you to fetch changes up to 9a0b57451ae8142c74d65bddb6d7765818babbed:
selftests/exec: Check if the syscall exists and bail if not
(2015-03-11 10:15:19 -0600)
----------------------------------------------------------------
Kselftest fixes for Linux 4.0-rc4
selftests/exec: Check if the syscall exists and bail if not
----------------------------------------------------------------
Michael Ellerman (1):
selftests/exec: Check if the syscall exists and bail if not
tools/testing/selftests/exec/execveat.c | 10 +++++++++-
1 file changed, 9 insertions(+), 1 deletion(-)
--
Shuah Khan
Sr. Linux Kernel Developer
Open Source Innovation Group
Samsung Research America (Silicon Valley)
shuahkh-JPH+aEBZ4P+UEJcrhfAQsw@public.gmane.org | (970) 217-8978
^ permalink raw reply
* Re: [RFC] capabilities: Ambient capabilities
From: Andy Lutomirski @ 2015-03-13 17:57 UTC (permalink / raw)
To: Andrew G. Morgan
Cc: Jarkko Sakkinen, Ted Ts'o, Andrew Lutomirski, Andrew Morton,
Michael Kerrisk, Mimi Zohar, Linux API, Austin S Hemmelgarn,
linux-security-module, Aaron Jones, Christoph Lameter, LKML,
Serge Hallyn, Markku Savela, Kees Cook, Jonathan Corbet
In-Reply-To: <CALQRfL6HUGMfPtsqLcSed8G14Yqg27R8tpDnMZOKtvQE+rqX0A@mail.gmail.com>
On Mar 13, 2015 6:24 AM, "Andrew G. Morgan" <morgan@kernel.org> wrote:
>
> > It's to preserve the invariant that pA is always a subset of pI.
>
> But since a user can always raise a bit in pI if it is present in pP,
> what does this invariant add to your model other than inconvenience?
The useful part is that dropping a bit from pI also drops it from pA.
To keep the model consistent, I also require that you add the bit to
pI before adding it to pA.
>
> >> I'm also unclear how you can turn off this new 'feature' for a process
> >> tree? As it is, the code creates an exploit path for a capable (pP !=
> >> 0) program with an exploitable flaw to create a privilege escalation
> >> for an arbitrary child program.
> >
> > Huh? If you exploit the parent, you already win. Yes, if a kiddie
> > injects shellcode that does system("/bin/bash") into some pP != 0
> > program, they don't actually elevate their privileges. On the other
> > hand, by the time an attacker injected shellcode for:
> >
> > prctl(PR_CAP_AMBIENT, PR_CAP_AMBIENT_RAISE, CAP_SYS_ADMIN);
> > system("/bin/bash");
>
> Let's call the above two lines [a] and [b]. With this patch, you are
> encouraging folk to write programs that contain a line like [a]
> already. So, yes, I am saying that you are creating an exploitable
> path in these programs that says if someone can inject
> system("/bin/bash") into these programs they can get a new (because of
> this patch) privilege escalation.
>
> In the prevailing model, this kind of privilege escalation (resulting
> from naive inheritance) is designed out. I recognize that you want to
> add it back in, but I am concerned that you are not allowing for the
> possibility that some folk might still want to still be able to
> prevent it.
If you have a program that deliberately uses PR_CAP_AMBIENT, then
setting such a securebit will break the program, so it still doesn't
buy you anything.
>
> > into a target, they can already do whatever they want.
> >
> >> While I understand that everyone
> >> 'knows what they are doing' in implementing this change, I'm convinced
> >> that folk that are up to no good also do... Why not provide a lockable
> >> secure bit to selectively disable this support?
> >
> > Show me a legitimate use case and I'll gladly implement a secure bit.
>
> Thanks. I was kind of hoping that you would add a lockable secure bit
> that defaults this support to off, but can be used to turn it on with
> or without a lock. That way, you can avoid disturbing the legacy
> defaults (no surprises).
I think this thing needs to default on to be useful.
>
> > In the mean time, I don't even believe that there's a legitimate use
> > for any of the other secure bits (except keepcaps, and I don't know
> > why that's a securebit in the first place).
>
> Those bits currently make it possible to run a subsystem with no
> [set]uid-0 support in its process tree.
Not usefully, because even with all the securebits set to their
non-legacy modes, caps don't inherit, so it doesn't work. I've tried.
>
> > In the mean time, see CVE-2014-3215 for an example of why securebits
> > are probably more trouble than they're worth.
>
> I think it is safe to say that naive privilege inheritance has a fair
> track record of being exploited orders of magnitude more frequently
> than this. After all, these are the reasons LD_PRELOAD and shell
> script setuid bits are suppressed.
I don't know what you mean here by naive privilege inheritance. The
examples you're taking about aren't inheritance at all; they're
exploring privilege *grants* during execve. My patch deliberately
leaves grants like that alone.
--Andy
>
> Cheers
>
> Andrew
^ permalink raw reply
* Re: [RFC] capabilities: Ambient capabilities
From: Andy Lutomirski @ 2015-03-13 17:58 UTC (permalink / raw)
To: Christoph Lameter
Cc: Andrew G. Morgan, Andrew Lutomirski, Kees Cook, Serge Hallyn,
Jonathan Corbet, Aaron Jones, Ted Ts'o, linux-security-module,
LKML, Linux API, Andrew Morton, Mimi Zohar, Austin S Hemmelgarn,
Markku Savela, Jarkko Sakkinen, Michael Kerrisk
In-Reply-To: <alpine.DEB.2.11.1503131049150.13899-gkYfJU5Cukgdnm+yROfE0A@public.gmane.org>
On Fri, Mar 13, 2015 at 9:06 AM, Christoph Lameter <cl-vYTEC60ixJUAvxtiuMwx3w@public.gmane.org> wrote:
> On Fri, 13 Mar 2015, Andrew G. Morgan wrote:
>
>> > prctl(PR_CAP_AMBIENT, PR_CAP_AMBIENT_RAISE, CAP_SYS_ADMIN);
>> > system("/bin/bash");
>>
>> Let's call the above two lines [a] and [b]. With this patch, you are
>> encouraging folk to write programs that contain a line like [a]
>> already. So, yes, I am saying that you are creating an exploitable
>> path in these programs that says if someone can inject
>> system("/bin/bash") into these programs they can get a new (because of
>> this patch) privilege escalation.
>
> Well this is what one naively expects capabilities to give you. An ability
> to avoid full superuser binaries by segmenting off capabilities. Often
> there is really no other choice. If you do not provide this mode then
> the system will be even less secure since people run stuff as root.
>
> This looks to many like the design of capabilties is inherent flawed since
> it does not give you what you need. You experience a go around that leads
> nowhere and just wastes your time.
>
>> In the prevailing model, this kind of privilege escalation (resulting
>> from naive inheritance) is designed out. I recognize that you want to
>> add it back in, but I am concerned that you are not allowing for the
>> possibility that some folk might still want to still be able to
>> prevent it.
>
> The functionalty here depends on CAP_SETPCAP. That was intended as some
> point to be off by default? You can have distros/kernels with that being
> off.
Not in my version. I don't want to further encourage people to hand
out CAP_SETPCAP.
--Andy
--
Andy Lutomirski
AMA Capital Management, LLC
^ permalink raw reply
* Re: [PATCH v4 2/9] kbuild: Don't pass LDFLAGS to selftest Makefile
From: Shuah Khan @ 2015-03-13 18:06 UTC (permalink / raw)
To: Michael Ellerman
Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA,
davej-rdkfGonbjUTCLXcRTR1eJlpr/1R2p/CL, mmarek-AlSwsSmVLrQ,
linux-api-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <1426046765-19289-2-git-send-email-mpe-Gsx/Oe8HsFggBc27wqDAHg@public.gmane.org>
On 03/10/2015 10:05 PM, Michael Ellerman wrote:
> The makefile in arch/x86/Makefile.um sets LDFLAGS and exports it, which
> is then propagated to the selftest Makefiles and leads to build errors
> there. The build errors occur because we are passing LDFLAGS to CC, but
> the option set in Makefile.um (-m elf_x86_64) is not understood by CC.
> We could fix that by using -Wl, but that might break the UM build if
> it's actually passing that option to LD directly.
>
> We don't actually want the LDFLAGS from kbuild in the selftest Makefile,
> so the simplest fix seems to be to clear LDFLAGS before invoking the
> selftest Makefile.
>
> Signed-off-by: Michael Ellerman <mpe-Gsx/Oe8HsFggBc27wqDAHg@public.gmane.org>
> ---
> Makefile | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/Makefile b/Makefile
> index 0836e9d628f0..5cef1d4c2ea0 100644
> --- a/Makefile
> +++ b/Makefile
> @@ -1085,7 +1085,7 @@ headers_check: headers_install
>
> PHONY += kselftest
> kselftest:
> - $(Q)$(MAKE) -C tools/testing/selftests MAKEFLAGS="$(filter-out rR,$(MAKEFLAGS))" run_tests
> + $(Q)$(MAKE) LDFLAGS= -C tools/testing/selftests MAKEFLAGS="$(filter-out rR,$(MAKEFLAGS))" run_tests
>
> # ---------------------------------------------------------------------------
> # Modules
>
Thanks for finding the problem and fix. I want to handle this in
the selftests/Makefile level instead. I have a patch ready to send
out for review.
thanks,
-- Shuah
--
Shuah Khan
Sr. Linux Kernel Developer
Open Source Innovation Group
Samsung Research America (Silicon Valley)
shuahkh-JPH+aEBZ4P+UEJcrhfAQsw@public.gmane.org | (970) 217-8978
^ permalink raw reply
* Re: [RFC] capabilities: Ambient capabilities
From: Christoph Lameter @ 2015-03-13 18:52 UTC (permalink / raw)
To: Andy Lutomirski
Cc: Andrew G. Morgan, Andrew Lutomirski, Kees Cook, Serge Hallyn,
Jonathan Corbet, Aaron Jones, Ted Ts'o, linux-security-module,
LKML, Linux API, Andrew Morton, Mimi Zohar, Austin S Hemmelgarn,
Markku Savela, Jarkko Sakkinen, Michael Kerrisk
In-Reply-To: <CALCETrXcUfbqfm7av9crrxQ5nCBYpdJO8gRo4ZhRA97g27B2iw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
On Fri, 13 Mar 2015, Andy Lutomirski wrote:
> > The functionalty here depends on CAP_SETPCAP. That was intended as some
> > point to be off by default? You can have distros/kernels with that being
> > off.
>
> Not in my version. I don't want to further encourage people to hand
> out CAP_SETPCAP.
Owww... But if you leave it in then maybe Andrew will be satisfied with
this approach?
^ permalink raw reply
* Re: [RFC] capabilities: Ambient capabilities
From: Kees Cook @ 2015-03-13 18:52 UTC (permalink / raw)
To: Andy Lutomirski
Cc: Andrew G. Morgan, Jarkko Sakkinen, Ted Ts'o,
Andrew Lutomirski, Andrew Morton, Michael Kerrisk, Mimi Zohar,
Linux API, Austin S Hemmelgarn, linux-security-module,
Aaron Jones, Christoph Lameter, LKML, Serge Hallyn, Markku Savela,
Jonathan Corbet
In-Reply-To: <CALCETrVz39pJmMAEFodBNA8cOZjd0xkH95EzMazY0MWJDtpYgg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
On Fri, Mar 13, 2015 at 10:57 AM, Andy Lutomirski <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org> wrote:
> On Mar 13, 2015 6:24 AM, "Andrew G. Morgan" <morgan-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> wrote:
>> I think it is safe to say that naive privilege inheritance has a fair
>> track record of being exploited orders of magnitude more frequently
>> than this. After all, these are the reasons LD_PRELOAD and shell
>> script setuid bits are suppressed.
>
> I don't know what you mean here by naive privilege inheritance. The
> examples you're taking about aren't inheritance at all; they're
> exploring privilege *grants* during execve. My patch deliberately
> leaves grants like that alone.
Just to clarify (at least for myself), I think the issue here is the
perception of privileges leaking into newly execed processes that may
have flaws that allow arbitrary execution control.
Case A: daemon is running (with whatever set of privileges), has a
flaw and an attacker now has those same privileges.
Case B: running as root, runs some child (which stays root) with a
flaw and an attacker now has root privileges.
Case C: running with pE=CAP_NET_ADMIN, runs some child (which loses
the cap) with a flaw and an attacker now has only user privileges.
Case D: running with pE=pI=CAP_NET_ADMIN, runs some child that has
fI=CAP_NET_ADMIN, but other children don't and lose the cap as in case
C. The privileged attack surface is reduced to only the child with fI
set.
Case E: running with CAP_NET_ADMIN in pA, runs some child (which gets
pE=pA=CAP_NET_ADMIN), and has an attack surface larger (all children)
than Case D, but with a scope smaller (only CAP_NET_ADMIN) than Case
B.
We don't need to talk about Case A, since we're not crossing an exec
boundary and the attacker already has execution control. We all lose.
Case B is the poster-child for "don't run daemons as root", since
privileges aren't dropped, and we're basically back to Case A again.
Case C is the classic "just give the daemon what caps it needs" case,
and if the daemon itself isn't flawed (Case A), then the privileges
don't leak out to any children (usually helpers of some kind) since
the cap doesn't cross the exec boundary. This lack of leaking is
really frustrating for daemons that need to give caps to helpers (e.g.
containers).
Case D is the standard solution to the frustrations in Case C, but
requires filesystem capabilities (and/or SELinux) to pass
capabilities, and comes with various limitations as described in
Andy's first email.
Case E would be possible with Andy's patch. It lacks the limitations
and frustrations of C and D, but opens us up to a limited version of
the risks in Case B. So, I don't think it's as "bad" as Case B since a
process must opt into it (by setting pA), and it passes only the
limited set of capabilities.
I think this boils down to accepting the elevated risk while
recognizing that it may be less than Case B but greater than Case C.
All this said, almost half of the capabilities, if passed to flawed
children with attacker controlled execution, can be elevated to full
root privileges pretty easily[1], so I think any documentation around
this feature should include some pretty dire warnings about using
this.
-Kees
[1] https://forums.grsecurity.net/viewtopic.php?f=7&t=2522
--
Kees Cook
Chrome OS Security
^ permalink raw reply
* [PATCH v2 net-next 0/2] bpf: allow eBPF access skb fields
From: Alexei Starovoitov @ 2015-03-13 18:57 UTC (permalink / raw)
To: David S. Miller
Cc: Daniel Borkmann, Thomas Graf, linux-api-u79uwXL29TY76Z2rM5mHXA,
netdev-u79uwXL29TY76Z2rM5mHXA,
linux-kernel-u79uwXL29TY76Z2rM5mHXA
Hi All,
V1->V2:
- refactored field access converter into common helper convert_skb_access()
used in both classic and extended BPF
- added missing build_bug_on for field 'len'
- added comment to uapi/linux/bpf.h as suggested by Daniel
- dropped exposing 'ifindex' field for now
classic BPF has a way to access skb fields, whereas extended BPF didn't.
This patch introduces this ability.
Classic BPF can access fields via negative SKF_AD_OFF offset.
Positive bpf_ld_abs N is treated as load from packet, whereas
bpf_ld_abs -0x1000 + N is treated as skb fields access.
Many offsets were hard coded over years: SKF_AD_PROTOCOL, SKF_AD_PKTTYPE, etc.
The problem with this approach was that for every new field classic bpf
assembler had to be tweaked.
I've considered doing the same for extended, but for every new field LLVM
compiler would have to be modifed. Since it would need to add a new intrinsic.
It could be done with single intrinsic and magic offset or use of inline
assembler, but neither are clean from compiler backend point of view, since
they look like calls but shouldn't scratch caller-saved registers.
Another approach was to introduce a new helper functions like bpf_get_pkt_type()
for every field that we want to access, but that is equally ugly for kernel
and slow, since helpers are calls and they are slower then just loads.
In theory helper calls can be 'inlined' inside kernel into direct loads, but
since they were calls for user space, compiler would have to spill registers
around such calls anyway. Teaching compiler to treat such helpers differently
is even uglier.
They were few other ideas considered. At the end the best seems to be to
introduce a user accessible mirror of in-kernel sk_buff structure:
struct __sk_buff {
__u32 len;
__u32 pkt_type;
__u32 mark;
__u32 queue_mapping;
};
bpf programs will do:
int bpf_prog1(struct __sk_buff *skb)
{
__u32 var = skb->pkt_type;
which will be compiled to bpf assembler as:
dst_reg = *(u32 *)(src_reg + 4) // 4 == offsetof(struct __sk_buff, pkt_type)
bpf verifier will check validity of access and will convert it to:
dst_reg = *(u8 *)(src_reg + offsetof(struct sk_buff, __pkt_type_offset))
dst_reg &= 7
since 'pkt_type' is a bitfield.
No new instructions added. LLVM doesn't need to be modified.
JITs don't change and verifier already knows when it accesses 'ctx' pointer.
The only thing needed was to convert user visible offset within __sk_buff
to kernel internal offset within sk_buff.
For 'len' and other fields conversion is trivial.
Converting 'pkt_type' takes 2 or 3 instructions depending on endianness.
More fields can be exposed by adding to the end of the 'struct __sk_buff'.
Like vlan_tci and others can be added later.
When pkt_type field is moved around, goes into different structure, removed or
its size changes, the function convert_skb_access() would need to updated and
it will cover both classic and extended.
Patch 2 updates examples to demonstrates how fields are accessed and
adds new tests for verifier, since it needs to detect a corner case when
attacker is using single bpf instruction in two branches with different
register types.
The 4 fields of __sk_buff are already exposed to user space via classic bpf and
I believe they're useful in extended as well.
Alexei Starovoitov (2):
bpf: allow extended BPF programs access skb fields
samples: bpf: add skb->field examples and tests
include/linux/bpf.h | 5 +-
include/uapi/linux/bpf.h | 10 +++
kernel/bpf/syscall.c | 2 +-
kernel/bpf/verifier.c | 152 ++++++++++++++++++++++++++++++++++++++-----
net/core/filter.c | 100 +++++++++++++++++++++++-----
samples/bpf/sockex1_kern.c | 8 ++-
samples/bpf/sockex1_user.c | 2 +-
samples/bpf/sockex2_kern.c | 26 +++++---
samples/bpf/sockex2_user.c | 11 +++-
samples/bpf/test_verifier.c | 70 ++++++++++++++++++++
10 files changed, 335 insertions(+), 51 deletions(-)
--
1.7.9.5
^ permalink raw reply
* [PATCH v2 net-next 1/2] bpf: allow extended BPF programs access skb fields
From: Alexei Starovoitov @ 2015-03-13 18:57 UTC (permalink / raw)
To: David S. Miller
Cc: Daniel Borkmann, Thomas Graf, linux-api, netdev, linux-kernel
In-Reply-To: <1426273064-4837-1-git-send-email-ast@plumgrid.com>
introduce user accessible mirror of in-kernel 'struct sk_buff':
struct __sk_buff {
__u32 len;
__u32 pkt_type;
__u32 mark;
__u32 queue_mapping;
};
bpf programs can do:
int bpf_prog(struct __sk_buff *skb)
{
__u32 var = skb->pkt_type;
which will be compiled to bpf assembler as:
dst_reg = *(u32 *)(src_reg + 4) // 4 == offsetof(struct __sk_buff, pkt_type)
bpf verifier will check validity of access and will convert it to:
dst_reg = *(u8 *)(src_reg + offsetof(struct sk_buff, __pkt_type_offset))
dst_reg &= 7
since skb->pkt_type is a bitfield.
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
---
include/linux/bpf.h | 5 +-
include/uapi/linux/bpf.h | 10 +++
kernel/bpf/syscall.c | 2 +-
kernel/bpf/verifier.c | 152 +++++++++++++++++++++++++++++++++++++++++-----
net/core/filter.c | 100 ++++++++++++++++++++++++------
5 files changed, 234 insertions(+), 35 deletions(-)
diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 80f2e0fc3d02..2c17ebdfb5ae 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -103,6 +103,9 @@ struct bpf_verifier_ops {
* with 'type' (read or write) is allowed
*/
bool (*is_valid_access)(int off, int size, enum bpf_access_type type);
+
+ u32 (*convert_ctx_access)(int dst_reg, int src_reg, int ctx_off,
+ struct bpf_insn *insn);
};
struct bpf_prog_type_list {
@@ -133,7 +136,7 @@ struct bpf_map *bpf_map_get(struct fd f);
void bpf_map_put(struct bpf_map *map);
/* verify correctness of eBPF program */
-int bpf_check(struct bpf_prog *fp, union bpf_attr *attr);
+int bpf_check(struct bpf_prog **fp, union bpf_attr *attr);
#else
static inline void bpf_register_prog_type(struct bpf_prog_type_list *tl)
{
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 3fa1af8a58d7..936e9257da95 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -168,4 +168,14 @@ enum bpf_func_id {
__BPF_FUNC_MAX_ID,
};
+/* user accessible mirror of in-kernel sk_buff.
+ * new fields can only be added to the end of this structure
+ */
+struct __sk_buff {
+ __u32 len;
+ __u32 pkt_type;
+ __u32 mark;
+ __u32 queue_mapping;
+};
+
#endif /* _UAPI__LINUX_BPF_H__ */
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 669719ccc9ee..ea75c654af1b 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -519,7 +519,7 @@ static int bpf_prog_load(union bpf_attr *attr)
goto free_prog;
/* run eBPF verifier */
- err = bpf_check(prog, attr);
+ err = bpf_check(&prog, attr);
if (err < 0)
goto free_used_maps;
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index e6b522496250..c22ebd36fa4b 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -1620,11 +1620,10 @@ static int do_check(struct verifier_env *env)
return err;
} else if (class == BPF_LDX) {
- if (BPF_MODE(insn->code) != BPF_MEM ||
- insn->imm != 0) {
- verbose("BPF_LDX uses reserved fields\n");
- return -EINVAL;
- }
+ enum bpf_reg_type src_reg_type;
+
+ /* check for reserved fields is already done */
+
/* check src operand */
err = check_reg_arg(regs, insn->src_reg, SRC_OP);
if (err)
@@ -1643,6 +1642,29 @@ static int do_check(struct verifier_env *env)
if (err)
return err;
+ src_reg_type = regs[insn->src_reg].type;
+
+ if (insn->imm == 0 && BPF_SIZE(insn->code) == BPF_W) {
+ /* saw a valid insn
+ * dst_reg = *(u32 *)(src_reg + off)
+ * use reserved 'imm' field to mark this insn
+ */
+ insn->imm = src_reg_type;
+
+ } else if (src_reg_type != insn->imm &&
+ (src_reg_type == PTR_TO_CTX ||
+ insn->imm == PTR_TO_CTX)) {
+ /* ABuser program is trying to use the same insn
+ * dst_reg = *(u32*) (src_reg + off)
+ * with different pointer types:
+ * src_reg == ctx in one branch and
+ * src_reg == stack|map in some other branch.
+ * Reject it.
+ */
+ verbose("same insn cannot be used with different pointers\n");
+ return -EINVAL;
+ }
+
} else if (class == BPF_STX) {
if (BPF_MODE(insn->code) == BPF_XADD) {
err = check_xadd(env, insn);
@@ -1790,6 +1812,13 @@ static int replace_map_fd_with_map_ptr(struct verifier_env *env)
int i, j;
for (i = 0; i < insn_cnt; i++, insn++) {
+ if (BPF_CLASS(insn->code) == BPF_LDX &&
+ (BPF_MODE(insn->code) != BPF_MEM ||
+ insn->imm != 0)) {
+ verbose("BPF_LDX uses reserved fields\n");
+ return -EINVAL;
+ }
+
if (insn[0].code == (BPF_LD | BPF_IMM | BPF_DW)) {
struct bpf_map *map;
struct fd f;
@@ -1881,6 +1910,92 @@ static void convert_pseudo_ld_imm64(struct verifier_env *env)
insn->src_reg = 0;
}
+static void adjust_branches(struct bpf_prog *prog, int pos, int delta)
+{
+ struct bpf_insn *insn = prog->insnsi;
+ int insn_cnt = prog->len;
+ int i;
+
+ for (i = 0; i < insn_cnt; i++, insn++) {
+ if (BPF_CLASS(insn->code) != BPF_JMP ||
+ BPF_OP(insn->code) == BPF_CALL ||
+ BPF_OP(insn->code) == BPF_EXIT)
+ continue;
+
+ /* adjust offset of jmps if necessary */
+ if (i < pos && i + insn->off + 1 > pos)
+ insn->off += delta;
+ else if (i > pos && i + insn->off + 1 < pos)
+ insn->off -= delta;
+ }
+}
+
+/* convert load instructions that access fields of 'struct __sk_buff'
+ * into sequence of instructions that access fields of 'struct sk_buff'
+ */
+static int convert_ctx_accesses(struct verifier_env *env)
+{
+ struct bpf_insn *insn = env->prog->insnsi;
+ int insn_cnt = env->prog->len;
+ struct bpf_insn insn_buf[16];
+ struct bpf_prog *new_prog;
+ u32 cnt;
+ int i;
+
+ if (!env->prog->aux->ops->convert_ctx_access)
+ return 0;
+
+ for (i = 0; i < insn_cnt; i++, insn++) {
+ if (insn->code != (BPF_LDX | BPF_MEM | BPF_W))
+ continue;
+
+ if (insn->imm != PTR_TO_CTX) {
+ /* clear internal mark */
+ insn->imm = 0;
+ continue;
+ }
+
+ cnt = env->prog->aux->ops->
+ convert_ctx_access(insn->dst_reg, insn->src_reg,
+ insn->off, insn_buf);
+ if (cnt == 0 || cnt >= ARRAY_SIZE(insn_buf)) {
+ verbose("bpf verifier is misconfigured\n");
+ return -EINVAL;
+ }
+
+ if (cnt == 1) {
+ memcpy(insn, insn_buf, sizeof(*insn));
+ continue;
+ }
+
+ /* several new insns need to be inserted. Make room for them */
+ insn_cnt += cnt - 1;
+ new_prog = bpf_prog_realloc(env->prog,
+ bpf_prog_size(insn_cnt),
+ GFP_USER);
+ if (!new_prog)
+ return -ENOMEM;
+
+ new_prog->len = insn_cnt;
+
+ memmove(new_prog->insnsi + i + cnt, new_prog->insns + i + 1,
+ sizeof(*insn) * (insn_cnt - i - cnt));
+
+ /* copy substitute insns in place of load instruction */
+ memcpy(new_prog->insnsi + i, insn_buf, sizeof(*insn) * cnt);
+
+ /* adjust branches in the whole program */
+ adjust_branches(new_prog, i, cnt - 1);
+
+ /* keep walking new program and skip insns we just inserted */
+ env->prog = new_prog;
+ insn = new_prog->insnsi + i + cnt - 1;
+ i += cnt - 1;
+ }
+
+ return 0;
+}
+
static void free_states(struct verifier_env *env)
{
struct verifier_state_list *sl, *sln;
@@ -1903,13 +2018,13 @@ static void free_states(struct verifier_env *env)
kfree(env->explored_states);
}
-int bpf_check(struct bpf_prog *prog, union bpf_attr *attr)
+int bpf_check(struct bpf_prog **prog, union bpf_attr *attr)
{
char __user *log_ubuf = NULL;
struct verifier_env *env;
int ret = -EINVAL;
- if (prog->len <= 0 || prog->len > BPF_MAXINSNS)
+ if ((*prog)->len <= 0 || (*prog)->len > BPF_MAXINSNS)
return -E2BIG;
/* 'struct verifier_env' can be global, but since it's not small,
@@ -1919,7 +2034,7 @@ int bpf_check(struct bpf_prog *prog, union bpf_attr *attr)
if (!env)
return -ENOMEM;
- env->prog = prog;
+ env->prog = *prog;
/* grab the mutex to protect few globals used by verifier */
mutex_lock(&bpf_verifier_lock);
@@ -1951,7 +2066,7 @@ int bpf_check(struct bpf_prog *prog, union bpf_attr *attr)
if (ret < 0)
goto skip_full_check;
- env->explored_states = kcalloc(prog->len,
+ env->explored_states = kcalloc(env->prog->len,
sizeof(struct verifier_state_list *),
GFP_USER);
ret = -ENOMEM;
@@ -1968,6 +2083,10 @@ skip_full_check:
while (pop_stack(env, NULL) >= 0);
free_states(env);
+ if (ret == 0)
+ /* program is valid, convert *(u32*)(ctx + off) accesses */
+ ret = convert_ctx_accesses(env);
+
if (log_level && log_len >= log_size - 1) {
BUG_ON(log_len >= log_size);
/* verifier log exceeded user supplied buffer */
@@ -1983,18 +2102,18 @@ skip_full_check:
if (ret == 0 && env->used_map_cnt) {
/* if program passed verifier, update used_maps in bpf_prog_info */
- prog->aux->used_maps = kmalloc_array(env->used_map_cnt,
- sizeof(env->used_maps[0]),
- GFP_KERNEL);
+ env->prog->aux->used_maps = kmalloc_array(env->used_map_cnt,
+ sizeof(env->used_maps[0]),
+ GFP_KERNEL);
- if (!prog->aux->used_maps) {
+ if (!env->prog->aux->used_maps) {
ret = -ENOMEM;
goto free_log_buf;
}
- memcpy(prog->aux->used_maps, env->used_maps,
+ memcpy(env->prog->aux->used_maps, env->used_maps,
sizeof(env->used_maps[0]) * env->used_map_cnt);
- prog->aux->used_map_cnt = env->used_map_cnt;
+ env->prog->aux->used_map_cnt = env->used_map_cnt;
/* program is valid. Convert pseudo bpf_ld_imm64 into generic
* bpf_ld_imm64 instructions
@@ -2006,11 +2125,12 @@ free_log_buf:
if (log_level)
vfree(log_buf);
free_env:
- if (!prog->aux->used_maps)
+ if (!env->prog->aux->used_maps)
/* if we didn't copy map pointers into bpf_prog_info, release
* them now. Otherwise free_bpf_prog_info() will release them.
*/
release_maps(env);
+ *prog = env->prog;
kfree(env);
mutex_unlock(&bpf_verifier_lock);
return ret;
diff --git a/net/core/filter.c b/net/core/filter.c
index 7a4eb7030dba..23b7c087cece 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -150,10 +150,43 @@ static u64 __get_random_u32(u64 ctx, u64 a, u64 x, u64 r4, u64 r5)
return prandom_u32();
}
+static u32 convert_skb_access(int skb_field, int dst_reg, int src_reg,
+ struct bpf_insn *insn_buf)
+{
+ struct bpf_insn *insn = insn_buf;
+
+ switch (skb_field) {
+ case SKF_AD_MARK:
+ BUILD_BUG_ON(FIELD_SIZEOF(struct sk_buff, mark) != 4);
+
+ *insn++ = BPF_LDX_MEM(BPF_W, dst_reg, src_reg,
+ offsetof(struct sk_buff, mark));
+ break;
+
+ case SKF_AD_PKTTYPE:
+ *insn++ = BPF_LDX_MEM(BPF_B, dst_reg, src_reg, PKT_TYPE_OFFSET());
+ *insn++ = BPF_ALU32_IMM(BPF_AND, dst_reg, PKT_TYPE_MAX);
+#ifdef __BIG_ENDIAN_BITFIELD
+ *insn++ = BPF_ALU32_IMM(BPF_RSH, dst_reg, 5);
+#endif
+ break;
+
+ case SKF_AD_QUEUE:
+ BUILD_BUG_ON(FIELD_SIZEOF(struct sk_buff, queue_mapping) != 2);
+
+ *insn++ = BPF_LDX_MEM(BPF_H, dst_reg, src_reg,
+ offsetof(struct sk_buff, queue_mapping));
+ break;
+ }
+
+ return insn - insn_buf;
+}
+
static bool convert_bpf_extensions(struct sock_filter *fp,
struct bpf_insn **insnp)
{
struct bpf_insn *insn = *insnp;
+ u32 cnt;
switch (fp->k) {
case SKF_AD_OFF + SKF_AD_PROTOCOL:
@@ -167,13 +200,8 @@ static bool convert_bpf_extensions(struct sock_filter *fp,
break;
case SKF_AD_OFF + SKF_AD_PKTTYPE:
- *insn++ = BPF_LDX_MEM(BPF_B, BPF_REG_A, BPF_REG_CTX,
- PKT_TYPE_OFFSET());
- *insn = BPF_ALU32_IMM(BPF_AND, BPF_REG_A, PKT_TYPE_MAX);
-#ifdef __BIG_ENDIAN_BITFIELD
- insn++;
- *insn = BPF_ALU32_IMM(BPF_RSH, BPF_REG_A, 5);
-#endif
+ cnt = convert_skb_access(SKF_AD_PKTTYPE, BPF_REG_A, BPF_REG_CTX, insn);
+ insn += cnt - 1;
break;
case SKF_AD_OFF + SKF_AD_IFINDEX:
@@ -197,10 +225,8 @@ static bool convert_bpf_extensions(struct sock_filter *fp,
break;
case SKF_AD_OFF + SKF_AD_MARK:
- BUILD_BUG_ON(FIELD_SIZEOF(struct sk_buff, mark) != 4);
-
- *insn = BPF_LDX_MEM(BPF_W, BPF_REG_A, BPF_REG_CTX,
- offsetof(struct sk_buff, mark));
+ cnt = convert_skb_access(SKF_AD_MARK, BPF_REG_A, BPF_REG_CTX, insn);
+ insn += cnt - 1;
break;
case SKF_AD_OFF + SKF_AD_RXHASH:
@@ -211,10 +237,8 @@ static bool convert_bpf_extensions(struct sock_filter *fp,
break;
case SKF_AD_OFF + SKF_AD_QUEUE:
- BUILD_BUG_ON(FIELD_SIZEOF(struct sk_buff, queue_mapping) != 2);
-
- *insn = BPF_LDX_MEM(BPF_H, BPF_REG_A, BPF_REG_CTX,
- offsetof(struct sk_buff, queue_mapping));
+ cnt = convert_skb_access(SKF_AD_QUEUE, BPF_REG_A, BPF_REG_CTX, insn);
+ insn += cnt - 1;
break;
case SKF_AD_OFF + SKF_AD_VLAN_TAG:
@@ -1147,13 +1171,55 @@ sk_filter_func_proto(enum bpf_func_id func_id)
static bool sk_filter_is_valid_access(int off, int size,
enum bpf_access_type type)
{
- /* skb fields cannot be accessed yet */
- return false;
+ /* only read is allowed */
+ if (type != BPF_READ)
+ return false;
+
+ /* check bounds */
+ if (off < 0 || off >= sizeof(struct __sk_buff))
+ return false;
+
+ /* disallow misaligned access */
+ if (off % size != 0)
+ return false;
+
+ /* all __sk_buff fields are __u32 */
+ if (size != 4)
+ return false;
+
+ return true;
+}
+
+static u32 sk_filter_convert_ctx_access(int dst_reg, int src_reg, int ctx_off,
+ struct bpf_insn *insn_buf)
+{
+ struct bpf_insn *insn = insn_buf;
+
+ switch (ctx_off) {
+ case offsetof(struct __sk_buff, len):
+ BUILD_BUG_ON(FIELD_SIZEOF(struct sk_buff, len) != 4);
+
+ *insn++ = BPF_LDX_MEM(BPF_W, dst_reg, src_reg,
+ offsetof(struct sk_buff, len));
+ break;
+
+ case offsetof(struct __sk_buff, mark):
+ return convert_skb_access(SKF_AD_MARK, dst_reg, src_reg, insn);
+
+ case offsetof(struct __sk_buff, pkt_type):
+ return convert_skb_access(SKF_AD_PKTTYPE, dst_reg, src_reg, insn);
+
+ case offsetof(struct __sk_buff, queue_mapping):
+ return convert_skb_access(SKF_AD_QUEUE, dst_reg, src_reg, insn);
+ }
+
+ return insn - insn_buf;
}
static const struct bpf_verifier_ops sk_filter_ops = {
.get_func_proto = sk_filter_func_proto,
.is_valid_access = sk_filter_is_valid_access,
+ .convert_ctx_access = sk_filter_convert_ctx_access,
};
static struct bpf_prog_type_list sk_filter_type __read_mostly = {
--
1.7.9.5
^ permalink raw reply related
* [PATCH v2 net-next 2/2] samples: bpf: add skb->field examples and tests
From: Alexei Starovoitov @ 2015-03-13 18:57 UTC (permalink / raw)
To: David S. Miller
Cc: Daniel Borkmann, Thomas Graf, linux-api, netdev, linux-kernel
In-Reply-To: <1426273064-4837-1-git-send-email-ast@plumgrid.com>
- modify sockex1 example to count number of bytes in outgoing packets
- modify sockex2 example to count number of bytes and packets per flow
- add 4 stress tests that exercise 'skb->field' code path of verifier
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
---
samples/bpf/sockex1_kern.c | 8 +++--
samples/bpf/sockex1_user.c | 2 +-
samples/bpf/sockex2_kern.c | 26 +++++++++-------
samples/bpf/sockex2_user.c | 11 +++++--
samples/bpf/test_verifier.c | 70 +++++++++++++++++++++++++++++++++++++++++++
5 files changed, 101 insertions(+), 16 deletions(-)
diff --git a/samples/bpf/sockex1_kern.c b/samples/bpf/sockex1_kern.c
index 066892662915..ed18e9a4909c 100644
--- a/samples/bpf/sockex1_kern.c
+++ b/samples/bpf/sockex1_kern.c
@@ -1,5 +1,6 @@
#include <uapi/linux/bpf.h>
#include <uapi/linux/if_ether.h>
+#include <uapi/linux/if_packet.h>
#include <uapi/linux/ip.h>
#include "bpf_helpers.h"
@@ -11,14 +12,17 @@ struct bpf_map_def SEC("maps") my_map = {
};
SEC("socket1")
-int bpf_prog1(struct sk_buff *skb)
+int bpf_prog1(struct __sk_buff *skb)
{
int index = load_byte(skb, ETH_HLEN + offsetof(struct iphdr, protocol));
long *value;
+ if (skb->pkt_type != PACKET_OUTGOING)
+ return 0;
+
value = bpf_map_lookup_elem(&my_map, &index);
if (value)
- __sync_fetch_and_add(value, 1);
+ __sync_fetch_and_add(value, skb->len);
return 0;
}
diff --git a/samples/bpf/sockex1_user.c b/samples/bpf/sockex1_user.c
index 34a443ff3831..678ce4693551 100644
--- a/samples/bpf/sockex1_user.c
+++ b/samples/bpf/sockex1_user.c
@@ -40,7 +40,7 @@ int main(int ac, char **argv)
key = IPPROTO_ICMP;
assert(bpf_lookup_elem(map_fd[0], &key, &icmp_cnt) == 0);
- printf("TCP %lld UDP %lld ICMP %lld packets\n",
+ printf("TCP %lld UDP %lld ICMP %lld bytes\n",
tcp_cnt, udp_cnt, icmp_cnt);
sleep(1);
}
diff --git a/samples/bpf/sockex2_kern.c b/samples/bpf/sockex2_kern.c
index 6f0135f0f217..ba0e177ff561 100644
--- a/samples/bpf/sockex2_kern.c
+++ b/samples/bpf/sockex2_kern.c
@@ -42,13 +42,13 @@ static inline int proto_ports_offset(__u64 proto)
}
}
-static inline int ip_is_fragment(struct sk_buff *ctx, __u64 nhoff)
+static inline int ip_is_fragment(struct __sk_buff *ctx, __u64 nhoff)
{
return load_half(ctx, nhoff + offsetof(struct iphdr, frag_off))
& (IP_MF | IP_OFFSET);
}
-static inline __u32 ipv6_addr_hash(struct sk_buff *ctx, __u64 off)
+static inline __u32 ipv6_addr_hash(struct __sk_buff *ctx, __u64 off)
{
__u64 w0 = load_word(ctx, off);
__u64 w1 = load_word(ctx, off + 4);
@@ -58,7 +58,7 @@ static inline __u32 ipv6_addr_hash(struct sk_buff *ctx, __u64 off)
return (__u32)(w0 ^ w1 ^ w2 ^ w3);
}
-static inline __u64 parse_ip(struct sk_buff *skb, __u64 nhoff, __u64 *ip_proto,
+static inline __u64 parse_ip(struct __sk_buff *skb, __u64 nhoff, __u64 *ip_proto,
struct flow_keys *flow)
{
__u64 verlen;
@@ -82,7 +82,7 @@ static inline __u64 parse_ip(struct sk_buff *skb, __u64 nhoff, __u64 *ip_proto,
return nhoff;
}
-static inline __u64 parse_ipv6(struct sk_buff *skb, __u64 nhoff, __u64 *ip_proto,
+static inline __u64 parse_ipv6(struct __sk_buff *skb, __u64 nhoff, __u64 *ip_proto,
struct flow_keys *flow)
{
*ip_proto = load_byte(skb,
@@ -96,7 +96,7 @@ static inline __u64 parse_ipv6(struct sk_buff *skb, __u64 nhoff, __u64 *ip_proto
return nhoff;
}
-static inline bool flow_dissector(struct sk_buff *skb, struct flow_keys *flow)
+static inline bool flow_dissector(struct __sk_buff *skb, struct flow_keys *flow)
{
__u64 nhoff = ETH_HLEN;
__u64 ip_proto;
@@ -183,18 +183,23 @@ static inline bool flow_dissector(struct sk_buff *skb, struct flow_keys *flow)
return true;
}
+struct pair {
+ long packets;
+ long bytes;
+};
+
struct bpf_map_def SEC("maps") hash_map = {
.type = BPF_MAP_TYPE_HASH,
.key_size = sizeof(__be32),
- .value_size = sizeof(long),
+ .value_size = sizeof(struct pair),
.max_entries = 1024,
};
SEC("socket2")
-int bpf_prog2(struct sk_buff *skb)
+int bpf_prog2(struct __sk_buff *skb)
{
struct flow_keys flow;
- long *value;
+ struct pair *value;
u32 key;
if (!flow_dissector(skb, &flow))
@@ -203,9 +208,10 @@ int bpf_prog2(struct sk_buff *skb)
key = flow.dst;
value = bpf_map_lookup_elem(&hash_map, &key);
if (value) {
- __sync_fetch_and_add(value, 1);
+ __sync_fetch_and_add(&value->packets, 1);
+ __sync_fetch_and_add(&value->bytes, skb->len);
} else {
- long val = 1;
+ struct pair val = {1, skb->len};
bpf_map_update_elem(&hash_map, &key, &val, BPF_ANY);
}
diff --git a/samples/bpf/sockex2_user.c b/samples/bpf/sockex2_user.c
index d2d5f5a790d3..29a276d766fc 100644
--- a/samples/bpf/sockex2_user.c
+++ b/samples/bpf/sockex2_user.c
@@ -6,6 +6,11 @@
#include <unistd.h>
#include <arpa/inet.h>
+struct pair {
+ __u64 packets;
+ __u64 bytes;
+};
+
int main(int ac, char **argv)
{
char filename[256];
@@ -29,13 +34,13 @@ int main(int ac, char **argv)
for (i = 0; i < 5; i++) {
int key = 0, next_key;
- long long value;
+ struct pair value;
while (bpf_get_next_key(map_fd[0], &key, &next_key) == 0) {
bpf_lookup_elem(map_fd[0], &next_key, &value);
- printf("ip %s count %lld\n",
+ printf("ip %s bytes %lld packets %lld\n",
inet_ntoa((struct in_addr){htonl(next_key)}),
- value);
+ value.bytes, value.packets);
key = next_key;
}
sleep(1);
diff --git a/samples/bpf/test_verifier.c b/samples/bpf/test_verifier.c
index 7b56b59fad8e..df6dbb6576f6 100644
--- a/samples/bpf/test_verifier.c
+++ b/samples/bpf/test_verifier.c
@@ -14,6 +14,7 @@
#include <linux/unistd.h>
#include <string.h>
#include <linux/filter.h>
+#include <stddef.h>
#include "libbpf.h"
#define MAX_INSNS 512
@@ -642,6 +643,75 @@ static struct bpf_test tests[] = {
},
.result = ACCEPT,
},
+ {
+ "access skb fields ok",
+ .insns = {
+ BPF_LDX_MEM(BPF_W, BPF_REG_0, BPF_REG_1,
+ offsetof(struct __sk_buff, len)),
+ BPF_JMP_IMM(BPF_JGE, BPF_REG_0, 0, 1),
+ BPF_LDX_MEM(BPF_W, BPF_REG_0, BPF_REG_1,
+ offsetof(struct __sk_buff, mark)),
+ BPF_JMP_IMM(BPF_JGE, BPF_REG_0, 0, 1),
+ BPF_LDX_MEM(BPF_W, BPF_REG_0, BPF_REG_1,
+ offsetof(struct __sk_buff, pkt_type)),
+ BPF_JMP_IMM(BPF_JGE, BPF_REG_0, 0, 1),
+ BPF_LDX_MEM(BPF_W, BPF_REG_0, BPF_REG_1,
+ offsetof(struct __sk_buff, queue_mapping)),
+ BPF_JMP_IMM(BPF_JGE, BPF_REG_0, 0, 0),
+ BPF_EXIT_INSN(),
+ },
+ .result = ACCEPT,
+ },
+ {
+ "access skb fields bad1",
+ .insns = {
+ BPF_LDX_MEM(BPF_W, BPF_REG_0, BPF_REG_1, -4),
+ BPF_EXIT_INSN(),
+ },
+ .errstr = "invalid bpf_context access",
+ .result = REJECT,
+ },
+ {
+ "access skb fields bad2",
+ .insns = {
+ BPF_JMP_IMM(BPF_JGE, BPF_REG_1, 0, 9),
+ BPF_ST_MEM(BPF_DW, BPF_REG_10, -8, 0),
+ BPF_MOV64_REG(BPF_REG_2, BPF_REG_10),
+ BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8),
+ BPF_LD_MAP_FD(BPF_REG_1, 0),
+ BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_map_lookup_elem),
+ BPF_JMP_IMM(BPF_JNE, BPF_REG_0, 0, 1),
+ BPF_EXIT_INSN(),
+ BPF_MOV64_REG(BPF_REG_1, BPF_REG_0),
+ BPF_LDX_MEM(BPF_W, BPF_REG_0, BPF_REG_1,
+ offsetof(struct __sk_buff, pkt_type)),
+ BPF_EXIT_INSN(),
+ },
+ .fixup = {4},
+ .errstr = "different pointers",
+ .result = REJECT,
+ },
+ {
+ "access skb fields bad3",
+ .insns = {
+ BPF_JMP_IMM(BPF_JGE, BPF_REG_1, 0, 2),
+ BPF_LDX_MEM(BPF_W, BPF_REG_0, BPF_REG_1,
+ offsetof(struct __sk_buff, pkt_type)),
+ BPF_EXIT_INSN(),
+ BPF_ST_MEM(BPF_DW, BPF_REG_10, -8, 0),
+ BPF_MOV64_REG(BPF_REG_2, BPF_REG_10),
+ BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8),
+ BPF_LD_MAP_FD(BPF_REG_1, 0),
+ BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_map_lookup_elem),
+ BPF_JMP_IMM(BPF_JNE, BPF_REG_0, 0, 1),
+ BPF_EXIT_INSN(),
+ BPF_MOV64_REG(BPF_REG_1, BPF_REG_0),
+ BPF_JMP_IMM(BPF_JA, 0, 0, -12),
+ },
+ .fixup = {6},
+ .errstr = "different pointers",
+ .result = REJECT,
+ },
};
static int probe_filter_length(struct bpf_insn *fp)
--
1.7.9.5
^ permalink raw reply related
* Re: [RFC] capabilities: Ambient capabilities
From: Andy Lutomirski @ 2015-03-13 19:03 UTC (permalink / raw)
To: Kees Cook
Cc: Andrew G. Morgan, Jarkko Sakkinen, Ted Ts'o,
Andrew Lutomirski, Andrew Morton, Michael Kerrisk, Mimi Zohar,
Linux API, Austin S Hemmelgarn, linux-security-module,
Aaron Jones, Christoph Lameter, LKML, Serge Hallyn, Markku Savela,
Jonathan Corbet
In-Reply-To: <CAGXu5jLiLsXN2-RAbPJh7ZursuTroi5DAs=VTz2kyAZ2urNvgA@mail.gmail.com>
On Fri, Mar 13, 2015 at 11:52 AM, Kees Cook <keescook@chromium.org> wrote:
>
> All this said, almost half of the capabilities, if passed to flawed
> children with attacker controlled execution, can be elevated to full
> root privileges pretty easily[1], so I think any documentation around
> this feature should include some pretty dire warnings about using
> this.
That's a good point. I'll make sure to document that.
It's worth noting that, for many applications, that list is
overstated. For example, many of the suggested privilege escalations
don't work if you're in a sufficiently restrictive mount namespace.
For my own use, I plan on adding only CAP_NET_BIND_SERVICE and
CAP_NET_RAW to pA, and I'll be layering seccomp on top to the extent
possible.
--Andy
>
> -Kees
>
> [1] https://forums.grsecurity.net/viewtopic.php?f=7&t=2522
>
> --
> Kees Cook
> Chrome OS Security
--
Andy Lutomirski
AMA Capital Management, LLC
^ permalink raw reply
* Re: [PATCH 0/6] CLONE_FD: Task exit notification via file descriptor
From: Josh Triplett @ 2015-03-13 19:42 UTC (permalink / raw)
To: David Drysdale
Cc: Al Viro, Andrew Morton, Andy Lutomirski, Ingo Molnar, Kees Cook,
Oleg Nesterov, Paul E. McKenney, H. Peter Anvin, Rik van Riel,
Thomas Gleixner, Thiago Macieira, Michael Kerrisk,
linux-kernel@vger.kernel.org, Linux API, linux-fsdevel, X86 ML
In-Reply-To: <CAHse=S9hnGm=Z8FR4f+z_TFwnrjBLkcV4wVwjDYgKuOSemdVrA@mail.gmail.com>
On Fri, Mar 13, 2015 at 04:05:29PM +0000, David Drysdale wrote:
> On Fri, Mar 13, 2015 at 1:40 AM, Josh Triplett <josh@joshtriplett.org> wrote:
> > This patch series introduces a new clone flag, CLONE_FD, which lets the caller
> > handle child process exit notification via a file descriptor rather than
> > SIGCHLD. CLONE_FD makes it possible for libraries to safely launch and manage
> > child processes on behalf of their caller, *without* taking over process-wide
> > SIGCHLD handling (either via signal handler or signalfd).
>
> Hi Josh,
>
> From the overall description (i.e. I haven't looked at the code yet)
> this looks very interesting. However, it seems to cover a lot of the
> same ground as the process descriptor feature that was added to FreeBSD
> in 9.x/10.x:
> https://www.freebsd.org/cgi/man.cgi?query=pdfork&sektion=2
Interesting.
> I think it would ideally be nice for a userspace library developer to be
> able to do subprocess management (without SIGCHLD) in a similar way
> across both platforms, without lots of complicated autoconf shenanigans.
>
> So could we look at the overlap and seeing if we can come up with
> something that covers your requirements and also allows for something
> that looks like FreeBSD's process descriptors?
Agreed; however, I think it's reasonable to provide appropriate Linux
system calls, and then let glibc or libbsd or similar provide the
BSD-compatible calls on top of those. I don't think the kernel
interface needs to exactly match FreeBSD's, as long as it's a superset
of the functionality.
For example, pdfork can just call clone4 with CLONE_FD and return the
resulting file descriptor.
In my further comments below, I'll suggest ways that the FreeBSD library
calls could be implemented on top of Linux system calls.
> (I've actually got some rough patches to add process descriptor
> functionality on Linux, so I can look at how the two approaches compare
> and contrast.)
>
> > Note that signalfd for SIGCHLD does not suffice here, because that still
> > receives notification for all child processes, and interferes with process-wide
> > signal handling.
> >
> > The CLONE_FD file descriptor uniquely identifies a process on the system in a
> > race-free way, by holding a reference to the task_struct. In the future, we
> > may introduce APIs that support using process file descriptors instead of PIDs.
>
> FreeBSD has pdkill(2) and (theoretically) pdwait4(2) along these lines.
> I suspect we need either need pdkill(2) or a way to retrieve a PID from
> a process file descriptor, so that there's a way to send signals to the
> child.
The original caller of clone4 with CLONE_FD can pass CLONE_PARENT_SETTID
to get the PID.
In the future, I plan to add an fd-based equivalent of
rt_{,tg}sigqueueinfo (likely a single syscall with a flag to determine
whether to kill a process or thread) which is a superset of pdkill.
pdkill could then call that and just not pass the extra info.
A fair bit of pdwait4 could be implemented on top of read(), other than
the full rusage information (see below), and the ability to wait for
STOP/CONT (which the CLONE_FD file descriptor could support if desired,
but it'd have to be set via a flag at clone time).
I think it's a feature to use read() rather than an additional magic
system call.
> > Introducing CLONE_FD required two additional bits of yak shaving: Since clone
> > has no more usable flags (with the three currently unused flags unusable
> > because old kernels ignore them without EINVAL), also introduce a new clone4
> > system call with more flag bits and an extensible argument structure. And
> > since the magic pt_regs-based syscall argument processing for clone's tls
> > argument would otherwise prevent introducing a sane clone4 system call, fix
> > that too.
> >
> > I tested the CLONE_SETTLS changes with a thread-local storage test program (two
> > threads independently reading and writing a __thread variable), on both 32-bit
> > and 64-bit, and I observed no issues there.
>
> Worth preserving in tools/testing/selftests/ ?
Not really; it's just the following trivial program, which was faster to
write than to attempt to find somewhere:
#include <pthread.h>
#include <stdio.h>
__thread unsigned x = 0;
void *thread_func(void *unused)
{
unsigned *tx = &x;
for (; *tx < 10; (*tx)++)
printf("child: tx=%p *tx=%u\n", tx, *tx);
return NULL;
}
int main(void)
{
unsigned *tx = &x;
pthread_t thread;
pthread_create(&thread, NULL, thread_func, NULL);
for (; *tx < 10; (*tx)++)
printf("main: tx=%p *tx=%u\n", tx, *tx);
pthread_join(thread, NULL);
return 0;
}
(I didn't bother with error handling, because I ran it under strace.)
> > I tested clone4 and the new CLONE_FD call with several additional test
> > programs, launching either a process or thread (in the former case using
> > syscall(), in the latter case by calling clone4 via assembly and returning to
> > C), sleeping in parent and child to test the case of either exiting first, and
> > then printing the received clone4_info structure. Thiago also tested clone4
> > with CLONE_FD with a modified version of libqt's process handling, which
> > includes a test suite.
> >
> > I've also included the manpages patch at the end of this series. (Note that
> > the manpage documents the behavior of the future glibc wrapper as well as the
> > raw syscall.) Here's a formatted plain-text version of the manpage for
> > reference:
>
> FYI, I've added some comparisons with the FreeBSD equivalents below.
Thanks!
> > CLONE4(2) Linux Programmer's Manual CLONE4(2)
> >
> >
> >
> > NAME
> > clone4 - create a child process
> >
> > SYNOPSIS
> > /* Prototype for the glibc wrapper function */
> >
> > #define _GNU_SOURCE
> > #include <sched.h>
> >
> > int clone4(uint64_t flags,
> > size_t args_size,
> > struct clone4_args *args,
> > int (*fn)(void *), void *arg);
> >
> > /* Prototype for the raw system call */
> >
> > int clone4(unsigned flags_high, unsigned flags_low,
> > unsigned long args_size,
> > struct clone4_args *args);
> >
> > struct clone4_args {
> > pid_t *ptid;
> > pid_t *ctid;
> > unsigned long stack_start;
> > unsigned long stack_size;
> > unsigned long tls;
> > };
> >
> >
> > DESCRIPTION
> > clone4() creates a new process, similar to clone(2) and fork(2).
> > clone4() supports additional flags that clone(2) does not, and accepts
> > arguments via an extensible structure.
> >
> > args points to a clone4_args structure, and args_size must contain the
> > size of that structure, as understood by the caller. If the caller
> > passes a shorter structure than the kernel expects, the remaining
> > fields will default to 0. If the caller passes a larger structure than
> > the kernel expects (such as one from a newer kernel), clone4() will
> > return EINVAL. The clone4_args structure may gain additional fields at
> > the end in the future, and callers must only pass a size that encom‐
> > passes the number of fields they understand. If the caller passes 0
> > for args_size, args is ignored and may be NULL.
> >
> > In the clone4_args structure, ptid, ctid, stack_start, stack_size, and
> > tls have the same semantics as they do with clone(2) and clone2(2).
> >
> > In the glibc wrapper, fn and arg have the same semantics as they do
> > with clone(2). As with clone(2), the underlying system call works more
> > like fork(2), returning 0 in the child process; the glibc wrapper sim‐
> > plifies thread execution by calling fn(arg) and exiting the child when
> > that function exits.
> >
> > The 64-bit flags argument (split into the 32-bit flags_high and
> > flags_low arguments in the kernel interface) accepts all the same flags
> > as clone(2), with the exception of the obsolete CLONE_PID,
> > CLONE_DETACHED, and CLONE_STOPPED. In addition, flags accepts the fol‐
> > lowing flags:
> >
> >
> > CLONE_FD
> > Instead of returning a process ID, clone4() with the CLONE_FD
> > flag returns a file descriptor associated with the new process.
> > When the new process exits, the kernel will not send a signal to
> > the parent process, and will not keep the new process around as
> > a "zombie" process until a call to waitpid(2) or similar.
> > Instead, the file descriptor will become available for reading,
> > and the new process will be immediately reaped.
>
> Just to confirm: presumably a waitpid(-1,...) call that's already in
> progress won't return when one of these child processes exits?
I agree, I don't think it should. Because otherwise you'd also assume
you can waitpid() on the PID itself, and that'd be a race condition
since the process autoreaps.
> > Unlike using signalfd(2) for the SIGCHLD signal, the file
> > descriptor returned by clone4() with the CLONE_FD flag works
> > even with SIGCHLD unblocked in one or more threads of the parent
> > process, and allows the process to have different handlers for
> > different child processes, such as those created by a library,
> > without introducing race conditions around process-wide signal
> > handling.
> >
> > clone4() will never return a file descriptor in the range 0-2 to
> > the caller, to avoid ambiguity with the return of 0 in the child
> > process. Only the calling process will have the new file
> > descriptor open; the child process will not.
>
> FreeBSD's pdfork(2) returns a PID but also takes an int *fdp argument to
> return the file descriptor separately, which avoids the need for special
> case processing for low FD values (and means that POSIX's "lowest file
> descriptor not currently open" behaviour can be preserved if desired).
That'd be easy to implement if desired, by adding an outbound pointer to
clone4_args.
The (very mild) reason I'd dropped the PID: with CLONE_FD and future
syscalls that use the fd as an identifier, PIDs can hopefully become
mostly unnecessary. However, I'm not that attached to changing the
return value; it'd be trivial to switch to an outbound parameter
instead, and then drop the "not 0-2".
> > Since the kernel does not send a termination signal when a child
> > process created with CLONE_FD exits, the low byte of flags does
> > not contain a signal number. Instead, the low byte of flags can
> > contain the following additional flags for use with CLONE_FD:
> >
> >
> > CLONEFD_CLOEXEC
> > Set the O_CLOEXEC flag on the new open file descriptor.
> > See the description of the O_CLOEXEC flag in open(2) for
> > reasons why this may be useful.
> >
> >
> > CLONEFD_NONBLOCK
> > Set the O_NONBLOCK flag on the new open file descriptor.
> > Using this flag saves extra calls to fcntl(2) to achieve
> > the same result.
> >
> >
> > clone4() with the CLONE_FD flag returns a file descriptor that
> > supports the following operations:
> >
> > read(2) (and similar)
> > When the new process exits, reading from the file
> > descriptor produces a single clonefd_info structure:
> >
> > struct clonefd_info {
> > uint32_t code; /* Signal code */
> > uint32_t status; /* Exit status or signal */
> > uint64_t utime; /* User CPU time */
> > uint64_t stime; /* System CPU time */
> > };
>
> Presumably there is no way to get full rusage information for the exited
> process?
I focused on the information available via SIGCHLD. Even utime and
stime are unnecessary for the primary use case of CLONE_FD, but I
included them because SIGCHLD does. I'd like to avoid sending the much
larger rusage over the file descriptor when the caller may not care.
However, given that the task_struct sticks around as long as the
CLONE_FD file descriptor does, if that information is normally still
available from a dead-but-not-waited-on process, it should be trivial to
add an operation that takes the file descriptor and returns the full
rusage, if someone needs that. I think that can be done as part of a
later patch series adding other operations for use with the file
descriptor, though.
> [FreeBSD theoretically has pdwait4(2) to do wait4-like operations on a
> process descriptor, including rusage retrieval. However, I don't think
> they actually implemented it:
> http://fxr.watson.org/fxr/source/kern/syscalls.master#L928]
That's a pretty good argument that we don't need to either, at least not
yet.
> > If the new process has not yet exited, read(2) either
> > blocks until it does, or fails with the error EAGAIN if
> > the file descriptor has been made nonblocking.
> >
> > Future kernels may extend clonefd_info by appending addi‐
> > tional fields to the end. Callers should read as many
> > bytes as they understand; unread data will be discarded,
> > and subsequent reads after the first will return 0 to
> > indicate end-of-file. Callers requesting more bytes than
> > the kernel provides (such as callers expecting a newer
> > clonefd_info structure) will receive a shorter structure
> > from older kernels.
>
> FreeBSD also implements fstat(2) for its process descriptors, although
> only a few of the fields get filled in.
I looked at what they provide, and that seems like more of a novelty
than something particularly useful (since most of the stat fields aren't
meaningful), but if that's useful for compatibility then adding it seems
fine.
> > poll(2), select(2), epoll(7) (and similar)
> > The file descriptor is readable (the select(2) readfds
> > argument; the poll(2) POLLIN flag) if the new process has
> > exited.
>
> FreeBSD uses POLLHUP here.
That makes sense given that they provide the information via a separate
call rather than read. Since the CLONE_FD file descriptor uses read, it
needs to provide POLLIN, but I have no objection to using *both* POLLIN
and POLLHUP if that'd be at all useful.
> > close(2)
> > When the file descriptor is no longer required it should
> > be closed. If no process has a file descriptor open for
> > the new process, no process will receive any notification
> > when the new process exits. The new process will still
> > be immediately reaped.
>
> FreeBSD has two different behaviours for close(2), depending on a flag
> value (PD_DAEMON). With the flag set it's roughly like this, but
> without PD_DAEMON a close(2) operation on the (last open) file
> descriptor terminates the child process.
>
> This can be quite useful, particularly for the use case where some
> userspace library has an FD-controlled subprocess -- if the application
> using the library terminates, the process descriptor is closed and so
> the subprocess is automatically terminated.
That's an interesting idea. I don't think it makes sense for that to be
the default behavior, but if someone wanted to add an additional flag
to implement that behavior, that seems fine. A FreeBSD-compatible
pdfork could then use that flag when not passed PD_DAEMON and not use it
when passed PD_DAEMON.
How does it kill the process when the last open descriptor closes?
SIGKILL? SIGTERM? The former seems unfriendly (preventing graceful
termination), and the latter blockable. There's a reason init systems
send TERM, then wait, then KILL.
- Josh Triplett
^ permalink raw reply
* Re: [RFC] capabilities: Ambient capabilities
From: Kees Cook @ 2015-03-13 19:54 UTC (permalink / raw)
To: Andy Lutomirski
Cc: Andrew G. Morgan, Jarkko Sakkinen, Ted Ts'o,
Andrew Lutomirski, Andrew Morton, Michael Kerrisk, Mimi Zohar,
Linux API, Austin S Hemmelgarn, linux-security-module,
Aaron Jones, Christoph Lameter, LKML, Serge Hallyn, Markku Savela,
Jonathan Corbet
In-Reply-To: <CALCETrV7vDJVhM_AgtGn8ENStcyxZBwCw3zhSn-whArT_XPg8A-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
On Fri, Mar 13, 2015 at 12:03 PM, Andy Lutomirski <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org> wrote:
> On Fri, Mar 13, 2015 at 11:52 AM, Kees Cook <keescook-F7+t8E8rja9g9hUCZPvPmw@public.gmane.org> wrote:
>>
>> All this said, almost half of the capabilities, if passed to flawed
>> children with attacker controlled execution, can be elevated to full
>> root privileges pretty easily[1], so I think any documentation around
>> this feature should include some pretty dire warnings about using
>> this.
>
> That's a good point. I'll make sure to document that.
>
> It's worth noting that, for many applications, that list is
> overstated. For example, many of the suggested privilege escalations
> don't work if you're in a sufficiently restrictive mount namespace.
>
> For my own use, I plan on adding only CAP_NET_BIND_SERVICE and
> CAP_NET_RAW to pA, and I'll be layering seccomp on top to the extent
> possible.
Right, keeping software authors aware of the fact that their efforts
for attack surface reducing may need additional confinement beyond
just the capability reduction.
-Kees
>
> --Andy
>
>>
>> -Kees
>>
>> [1] https://forums.grsecurity.net/viewtopic.php?f=7&t=2522
>>
>> --
>> Kees Cook
>> Chrome OS Security
>
>
>
> --
> Andy Lutomirski
> AMA Capital Management, LLC
--
Kees Cook
Chrome OS Security
^ permalink raw reply
* Re: [PATCH 6/6] clone4: Introduce new CLONE_FD flag to get task exit notification via fd
From: josh @ 2015-03-13 19:57 UTC (permalink / raw)
To: Oleg Nesterov
Cc: Al Viro, Andrew Morton, Andy Lutomirski, Ingo Molnar, Kees Cook,
Paul E. McKenney, H. Peter Anvin, Rik van Riel, Thomas Gleixner,
Thiago Macieira, Michael Kerrisk, linux-kernel, linux-api,
linux-fsdevel, x86
In-Reply-To: <20150313162113.GA25966@redhat.com>
On Fri, Mar 13, 2015 at 05:21:13PM +0100, Oleg Nesterov wrote:
> Josh,
>
> I'll certainly try to read this series, but not before next week.
Thanks for looking at it.
> but a couple of nits right now.
>
> On 03/12, Josh Triplett wrote:
> >
> > When passed CLONE_FD, clone4 will return a file descriptor rather than a
> > PID. When the child process exits, it gets automatically reaped,
>
> And even I have no idea what you are actually doing, this doesn't look
> right, see below.
>
> > +static unsigned int clonefd_poll(struct file *file, poll_table *wait)
> > +{
> > + struct task_struct *p = file->private_data;
> > + poll_wait(file, &p->clonefd_wqh, wait);
> > + return p->exit_state == EXIT_DEAD ? (POLLIN | POLLRDNORM) : 0;
> > +}
> > +
> > +static ssize_t clonefd_read(struct file *file, char __user *buf, size_t count, loff_t *ppos)
> > +{
> > + struct task_struct *p = file->private_data;
> > + int ret = 0;
> > +
> > + /* EOF after first read */
> > + if (*ppos)
> > + return 0;
> > +
> > + if (file->f_flags & O_NONBLOCK)
> > + ret = -EAGAIN;
> > + else
> > + ret = wait_event_interruptible(p->clonefd_wqh, p->exit_state == EXIT_DEAD);
> > +
> > + if (p->exit_state == EXIT_DEAD) {
>
> Again, I simply do not know what this code does at all. But I bet the usage
> of EXIT_DEAD is wrong ;)
>
> OK, OK, I can be wrong. But I simply do not see what protects this task_struct
> if it is EXIT_DEAD (in fact even if it is EXIT_ZOMBIE).
If by "what protects" you mean "what keeps it alive", the file
descriptor holds a reference to the task_struct by calling
get_task_struct when created and put_task_struct when released.
This wait_event_interruptible pairs with the wake_up_all called from
clonefd_do_notify, which exit_notify calls *after* setting the task to
TASK_DEAD.
Apart from that, what about what the code is doing isn't clear?
> > @@ -598,7 +600,9 @@ static void exit_notify(struct task_struct *tsk, int group_dead)
> > if (group_dead)
> > kill_orphaned_pgrp(tsk->group_leader, NULL);
> >
> > - if (unlikely(tsk->ptrace)) {
> > + if (tsk->autoreap) {
> > + autoreap = true;
>
> Debuggers won't be happy. A ptraced task should not autoreap itself.
A process launching a new process with CLONE_FD is explicitly requesting
that the process be automatically reaped without any other process
having to wait on it. The task needs to not become a zombie, because
otherwise, it'll show up in waitpid(-1, ...) calls in the parent
process, which would break the ability to use this to completely
encapsulate process management within a library and not interfere with
the parent's process handling via SIGCHLD and wait{pid,3,4}.
- Josh Triplett
^ permalink raw reply
* Re: [PATCH v4 3/9] selftests: Introduce minimal shared logic for running tests
From: Shuah Khan @ 2015-03-13 20:28 UTC (permalink / raw)
To: Michael Ellerman
Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA,
davej-rdkfGonbjUTCLXcRTR1eJlpr/1R2p/CL, mmarek-AlSwsSmVLrQ,
linux-api-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <55031C14.1030203-JPH+aEBZ4P+UEJcrhfAQsw@public.gmane.org>
On 03/13/2015 11:19 AM, Shuah Khan wrote:
> On 03/10/2015 10:05 PM, Michael Ellerman wrote:
>> This adds a Make include file which most selftests can then include to
>> get the run_tests logic.
>>
>> On its own this has the advantage of some reduction in repetition, and
>> also means the pass/fail message is defined in fewer places.
>>
>> However the key advantage is it will allow us to implement install very
>> simply in a subsequent patch.
>>
>> The default implementation just executes each program in $(TEST_PROGS).
>>
>> We use a variable to hold the default implementation of $(RUN_TESTS)
>> because that gives us a clean way to override it if necessary, ie. using
>> override. The mount, memory-hotplug and mqueue tests use that to provide
>> a different implementation.
>>
>> Tests are not run via /bin/bash, so if they are scripts they must be
>> executable, we add a+x to several.
>>
>> Signed-off-by: Michael Ellerman <mpe-Gsx/Oe8HsFggBc27wqDAHg@public.gmane.org>
>
> This patch will be applied to next and queued for 4.1.
>
This patch is now in linux-kselftest next. Could you please review
to make sure, it looks right. I had to drop the shared logic from
timers Makefile because, it changed considerably with the additional
tests and it wasn't easy to resolve the conflict and keep both changes.
So at the moment, timers doesn't use the shared logic.
If lib.mk could provide a way to run additional programs that require
arguments in addition to RUN_TESTS. In the case of timers, there is one
test that requires arguments. In some cases, e.g: memory hotplug,
override works well since it is just one executable. In this case,
there is a mix.
Something that can be addressed in a separate patch. For now, I made
the decision to apply with shared logic patch minus the changes to
use lib.mk
thanks,
-- Shuah
--
Shuah Khan
Sr. Linux Kernel Developer
Open Source Innovation Group
Samsung Research America (Silicon Valley)
shuahkh-JPH+aEBZ4P+UEJcrhfAQsw@public.gmane.org | (970) 217-8978
^ permalink raw reply
* Re: [PATCH 0/6] CLONE_FD: Task exit notification via file descriptor
From: Thiago Macieira @ 2015-03-13 21:16 UTC (permalink / raw)
To: Josh Triplett
Cc: David Drysdale, Al Viro, Andrew Morton, Andy Lutomirski,
Ingo Molnar, Kees Cook, Oleg Nesterov, Paul E. McKenney,
H. Peter Anvin, Rik van Riel, Thomas Gleixner, Michael Kerrisk,
linux-kernel@vger.kernel.org, Linux API, linux-fsdevel, X86 ML
In-Reply-To: <20150313194252.GA10317@cloud>
On Friday 13 March 2015 12:42:52 Josh Triplett wrote:
> > Hi Josh,
> >
> > From the overall description (i.e. I haven't looked at the code yet)
> > this looks very interesting. However, it seems to cover a lot of the
> > same ground as the process descriptor feature that was added to FreeBSD
> >
> > in 9.x/10.x:
> > https://www.freebsd.org/cgi/man.cgi?query=pdfork&sektion=2
>
> Interesting.
Hi Josh, David
I wasn't aware of the FreeBSD implementation of pdfork(). It is actually
exactly what I need in userspace. The only difference between pdfork() and and
my proposed forkfd() is where the PID and where the file descriptor are
returned (meaning, which is optional and which isn't).
Josh and I opted to return the file descriptor in the regular return value in
forkfd and in clone4 because getting the file descriptor the whole objective of
using the forkfd or clone4-with-CLONE_FD in the first place: the file descriptor
is not optional, but the PID is.
> Agreed; however, I think it's reasonable to provide appropriate Linux
> system calls, and then let glibc or libbsd or similar provide the
> BSD-compatible calls on top of those. I don't think the kernel
> interface needs to exactly match FreeBSD's, as long as it's a superset
> of the functionality.
>
> For example, pdfork can just call clone4 with CLONE_FD and return the
> resulting file descriptor.
Agreed, we should recommend libc implement pdfork(), pdkill() and pdwait4().
I'm not too attached to the forkfd() interface, but I find it slightly superior
for the reasons above.
If we want the PD_DAEMON flag, it will have to translate to a clone flag, like
CLONEFD_DAEMON or inverted like CLONEFD_KILL_ON_CLOSE.
> In the future, I plan to add an fd-based equivalent of
> rt_{,tg}sigqueueinfo (likely a single syscall with a flag to determine
> whether to kill a process or thread) which is a superset of pdkill.
> pdkill could then call that and just not pass the extra info.
>
> A fair bit of pdwait4 could be implemented on top of read(), other than
> the full rusage information (see below), and the ability to wait for
> STOP/CONT (which the CLONE_FD file descriptor could support if desired,
> but it'd have to be set via a flag at clone time).
>
> I think it's a feature to use read() rather than an additional magic
> system call.
Indeed, even if the libc provides a wrapper for you, like glibc does for
eventfd (eventfd_read, eventfd_write).
Josh and I didn't want to submit "killfd" (or pdkill in the FreeBSD name) in
the initial patch set, but it was part of the plans.
> > > clone4() will never return a file descriptor in the range
> > > 0-2 to
> > > the caller, to avoid ambiguity with the return of 0 in the
> > > child
> > > process. Only the calling process will have the new
> > > file
> > > descriptor open; the child process will not.
> >
> > FreeBSD's pdfork(2) returns a PID but also takes an int *fdp argument to
> > return the file descriptor separately, which avoids the need for special
> > case processing for low FD values (and means that POSIX's "lowest file
> > descriptor not currently open" behaviour can be preserved if desired).
>
> That'd be easy to implement if desired, by adding an outbound pointer to
> clone4_args.
>
> The (very mild) reason I'd dropped the PID: with CLONE_FD and future
> syscalls that use the fd as an identifier, PIDs can hopefully become
> mostly unnecessary. However, I'm not that attached to changing the
> return value; it'd be trivial to switch to an outbound parameter
> instead, and then drop the "not 0-2".
See above for more motivation on making the PID optional.
As for the file descriptor range, if we need to be able to return 0, we can
implement a magic constant to mean the child process, like the userspace
forkfd() does (FFD_CHILD_PROCESS). We'd probably choose the value -4096 on
Linux, since that is neither a valid file descriptor nor a valid errno value.
> > [FreeBSD theoretically has pdwait4(2) to do wait4-like operations on a
> > process descriptor, including rusage retrieval. However, I don't think
> >
> > they actually implemented it:
> > http://fxr.watson.org/fxr/source/kern/syscalls.master#L928]
>
> That's a pretty good argument that we don't need to either, at least not
> yet.
pdwait4() can be implemented on top of read(), with the WNOHANG flag being just
toggling the O_NONBLOCK bit. The problem is with the rest of the flags. We
could implement it via more ioctls to be done prior to read() if we don't want
to add a syscall...
Another alternative is to add a P_PD flag that can be passed as the first
argument to waitid(), making the second argument a file descriptor instead of a
PID or pgrp.
> > FreeBSD also implements fstat(2) for its process descriptors, although
> > only a few of the fields get filled in.
>
> I looked at what they provide, and that seems like more of a novelty
> than something particularly useful (since most of the stat fields aren't
> meaningful), but if that's useful for compatibility then adding it seems
> fine.
I don't think we need to do anything: anon_inode will do it for us.
If I stat an eventfd:
stat("/proc/107751/fd/4", {st_dev=makedev(0, 9), st_ino=3943,
st_mode=0600, st_nlink=1, st_uid=0, st_gid=0, st_blksize=4096, st_blocks=0,
st_size=0, st_atime=2015/03/07-16:40:28, st_mtime=2015/03/12-16:12:00,
st_ctime=2015/03/12-16:12:00}) = 0
And just out of curiosity, in the following order: epoll, signalfd, timerfd
and inotify:
stat("/proc/1462/fd/4", {st_dev=makedev(0, 9), st_ino=3943, st_mode=0600,
st_nlink=1, st_uid=0, st_gid=0, st_blksize=4096, st_blocks=0, st_size=0,
st_atime=2015/03/07-16:40:28, st_mtime=2015/03/12-16:12:00,
st_ctime=2015/03/12-16:12:00}) = 0
stat("/proc/1462/fd/5", {st_dev=makedev(0, 9), st_ino=3943, st_mode=0600,
st_nlink=1, st_uid=0, st_gid=0, st_blksize=4096, st_blocks=0, st_size=0,
st_atime=2015/03/07-16:40:28, st_mtime=2015/03/12-16:12:00,
st_ctime=2015/03/12-16:12:00}) = 0
stat("/proc/1462/fd/7", {st_dev=makedev(0, 9), st_ino=3943, st_mode=0600,
st_nlink=1, st_uid=0, st_gid=0, st_blksize=4096, st_blocks=0, st_size=0,
st_atime=2015/03/07-16:40:28, st_mtime=2015/03/12-16:12:00,
st_ctime=2015/03/12-16:12:00}) = 0
stat("/proc/1462/fd/8", {st_dev=makedev(0, 9), st_ino=3943, st_mode=0600,
st_nlink=1, st_uid=0, st_gid=0, st_blksize=4096, st_blocks=0, st_size=0,
st_atime=2015/03/07-16:40:28, st_mtime=2015/03/12-16:12:00,
st_ctime=2015/03/12-16:12:00}) = 0
(that process is systemd --user)
> > > poll(2), select(2), epoll(7) (and similar)
> > >
> > > The file descriptor is readable (the select(2)
> > > readfds
> > > argument; the poll(2) POLLIN flag) if the new
> > > process has
> > > exited.
> >
> > FreeBSD uses POLLHUP here.
>
> That makes sense given that they provide the information via a separate
> call rather than read. Since the CLONE_FD file descriptor uses read, it
> needs to provide POLLIN, but I have no objection to using *both* POLLIN
> and POLLHUP if that'd be at all useful.
I think we should provide both, since we're notifying that there are things to
be read and that the file descriptor has closed.
> > FreeBSD has two different behaviours for close(2), depending on a flag
> > value (PD_DAEMON). With the flag set it's roughly like this, but
> > without PD_DAEMON a close(2) operation on the (last open) file
> > descriptor terminates the child process.
> >
> > This can be quite useful, particularly for the use case where some
> > userspace library has an FD-controlled subprocess -- if the application
> > using the library terminates, the process descriptor is closed and so
> > the subprocess is automatically terminated.
>
> That's an interesting idea. I don't think it makes sense for that to be
> the default behavior, but if someone wanted to add an additional flag
> to implement that behavior, that seems fine. A FreeBSD-compatible
> pdfork could then use that flag when not passed PD_DAEMON and not use it
> when passed PD_DAEMON.
>
> How does it kill the process when the last open descriptor closes?
> SIGKILL? SIGTERM? The former seems unfriendly (preventing graceful
> termination), and the latter blockable. There's a reason init systems
> send TERM, then wait, then KILL.
I was wondering if it shouldn't be a SIGHUP, since we're talking about a file
descriptor closing. We could make it configurable too, but I'd rather not use
the current CSIGNAL field -- better move to the arguments structure, just in
case someone is passing SIGCHLD there, they should get EINVAL instead of
silently sending SIGCHLD to the child process to ask it to terminate.
--
Thiago Macieira - thiago.macieira (AT) intel.com
Software Architect - Intel Open Source Technology Center
^ permalink raw reply
* Re: [PATCH v4 4/9] selftests: Add install target
From: Shuah Khan @ 2015-03-13 21:32 UTC (permalink / raw)
To: Michael Ellerman
Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA,
davej-rdkfGonbjUTCLXcRTR1eJlpr/1R2p/CL, mmarek-AlSwsSmVLrQ,
linux-api-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <55031C78.4040609-JPH+aEBZ4P+UEJcrhfAQsw@public.gmane.org>
On 03/13/2015 11:20 AM, Shuah Khan wrote:
> On 03/10/2015 10:06 PM, Michael Ellerman wrote:
>> This adds make install support to selftests. The basic usage is:
>>
>> $ cd tools/testing/selftests
>> $ make install
>>
>> That installs into tools/testing/selftests/install, which can then be
>> copied where ever necessary.
>>
>> The install destination is also configurable using eg:
>>
>> $ INSTALL_PATH=/mnt/selftests make install
>>
>> The implementation uses two targets in the child makefiles. The first
>> "install" is expected to install all files into $(INSTALL_PATH).
>>
>> The second, "emit_tests", is expected to emit the test instructions (ie.
>> bash script) on stdout. Separating this from install means the child
>> makefiles need no knowledge of the location of the test script.
>>
>> Signed-off-by: Michael Ellerman <mpe-Gsx/Oe8HsFggBc27wqDAHg@public.gmane.org>
>
> Thanks for doing the work. This patch will be applied to next and will
> be queued for 4.1
>
ok. linux-kselftest next has both shared logic and install patches
now. As I mentioned in my reply to shared logic patch, I had to drop
the changes to timers Makefile to add shared logic, hence timers progs
don't get installed at the moment.
Could you please review to make sure, it looks right.
As I said in the other email, if lib.mk could provide a way to run
additional programs that require arguments in addition to RUN_TESTS.
In the case of timers, there is one test that requires arguments. In
some cases, e.g: memory hotplug, override works well since it is just
one executable. In this case, there is a mix of executables that don't
need any special handling and some that need it.
You are welcome to add shared logic and install to timers if you would
like.
thanks,
-- Shuah
--
Shuah Khan
Sr. Linux Kernel Developer
Open Source Innovation Group
Samsung Research America (Silicon Valley)
shuahkh-JPH+aEBZ4P+UEJcrhfAQsw@public.gmane.org | (970) 217-8978
^ permalink raw reply
* Re: [PATCH 0/6] CLONE_FD: Task exit notification via file descriptor
From: Andy Lutomirski @ 2015-03-13 21:33 UTC (permalink / raw)
To: Josh Triplett
Cc: David Drysdale, Al Viro, Andrew Morton, Ingo Molnar, Kees Cook,
Oleg Nesterov, Paul E. McKenney, H. Peter Anvin, Rik van Riel,
Thomas Gleixner, Thiago Macieira, Michael Kerrisk,
linux-kernel@vger.kernel.org, Linux API, Linux FS Devel, X86 ML
In-Reply-To: <20150313194252.GA10317@cloud>
On Fri, Mar 13, 2015 at 12:42 PM, Josh Triplett <josh@joshtriplett.org> wrote:
> On Fri, Mar 13, 2015 at 04:05:29PM +0000, David Drysdale wrote:
>> On Fri, Mar 13, 2015 at 1:40 AM, Josh Triplett <josh@joshtriplett.org> wrote:
>> > This patch series introduces a new clone flag, CLONE_FD, which lets the caller
>> > handle child process exit notification via a file descriptor rather than
>> > SIGCHLD. CLONE_FD makes it possible for libraries to safely launch and manage
>> > child processes on behalf of their caller, *without* taking over process-wide
>> > SIGCHLD handling (either via signal handler or signalfd).
>>
>> Hi Josh,
>>
>> From the overall description (i.e. I haven't looked at the code yet)
>> this looks very interesting. However, it seems to cover a lot of the
>> same ground as the process descriptor feature that was added to FreeBSD
>> in 9.x/10.x:
>> https://www.freebsd.org/cgi/man.cgi?query=pdfork&sektion=2
>
> Interesting.
>
>> I think it would ideally be nice for a userspace library developer to be
>> able to do subprocess management (without SIGCHLD) in a similar way
>> across both platforms, without lots of complicated autoconf shenanigans.
>>
>> So could we look at the overlap and seeing if we can come up with
>> something that covers your requirements and also allows for something
>> that looks like FreeBSD's process descriptors?
>
> Agreed; however, I think it's reasonable to provide appropriate Linux
> system calls, and then let glibc or libbsd or similar provide the
> BSD-compatible calls on top of those. I don't think the kernel
> interface needs to exactly match FreeBSD's, as long as it's a superset
> of the functionality.
We need to be careful with things like read(2), though. It's hard to
write a glibc function that makes read(2) do something other than what
the kernel thinks. Similarly, poll(2) is defined by the kernel. It
would be really nice to be consistent here.
--Andy
^ permalink raw reply
* Re: [PATCH 6/6] clone4: Introduce new CLONE_FD flag to get task exit notification via fd
From: Andy Lutomirski @ 2015-03-13 21:34 UTC (permalink / raw)
To: Josh Triplett
Cc: Oleg Nesterov, Al Viro, Andrew Morton, Ingo Molnar, Kees Cook,
Paul E. McKenney, H. Peter Anvin, Rik van Riel, Thomas Gleixner,
Thiago Macieira, Michael Kerrisk,
linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Linux API,
Linux FS Devel, X86 ML
In-Reply-To: <20150313195707.GA10487@cloud>
On Fri, Mar 13, 2015 at 12:57 PM, <josh-iaAMLnmF4UmaiuxdJuQwMA@public.gmane.org> wrote:
> On Fri, Mar 13, 2015 at 05:21:13PM +0100, Oleg Nesterov wrote:
>> Josh,
>>
>> I'll certainly try to read this series, but not before next week.
>
> Thanks for looking at it.
>
>> but a couple of nits right now.
>>
>> On 03/12, Josh Triplett wrote:
>> >
>> > When passed CLONE_FD, clone4 will return a file descriptor rather than a
>> > PID. When the child process exits, it gets automatically reaped,
>>
>> And even I have no idea what you are actually doing, this doesn't look
>> right, see below.
>>
>> > +static unsigned int clonefd_poll(struct file *file, poll_table *wait)
>> > +{
>> > + struct task_struct *p = file->private_data;
>> > + poll_wait(file, &p->clonefd_wqh, wait);
>> > + return p->exit_state == EXIT_DEAD ? (POLLIN | POLLRDNORM) : 0;
>> > +}
>> > +
>> > +static ssize_t clonefd_read(struct file *file, char __user *buf, size_t count, loff_t *ppos)
>> > +{
>> > + struct task_struct *p = file->private_data;
>> > + int ret = 0;
>> > +
>> > + /* EOF after first read */
>> > + if (*ppos)
>> > + return 0;
>> > +
>> > + if (file->f_flags & O_NONBLOCK)
>> > + ret = -EAGAIN;
>> > + else
>> > + ret = wait_event_interruptible(p->clonefd_wqh, p->exit_state == EXIT_DEAD);
>> > +
>> > + if (p->exit_state == EXIT_DEAD) {
>>
>> Again, I simply do not know what this code does at all. But I bet the usage
>> of EXIT_DEAD is wrong ;)
>>
>> OK, OK, I can be wrong. But I simply do not see what protects this task_struct
>> if it is EXIT_DEAD (in fact even if it is EXIT_ZOMBIE).
>
> If by "what protects" you mean "what keeps it alive", the file
> descriptor holds a reference to the task_struct by calling
> get_task_struct when created and put_task_struct when released.
>
> This wait_event_interruptible pairs with the wake_up_all called from
> clonefd_do_notify, which exit_notify calls *after* setting the task to
> TASK_DEAD.
>
> Apart from that, what about what the code is doing isn't clear?
>
>> > @@ -598,7 +600,9 @@ static void exit_notify(struct task_struct *tsk, int group_dead)
>> > if (group_dead)
>> > kill_orphaned_pgrp(tsk->group_leader, NULL);
>> >
>> > - if (unlikely(tsk->ptrace)) {
>> > + if (tsk->autoreap) {
>> > + autoreap = true;
>>
>> Debuggers won't be happy. A ptraced task should not autoreap itself.
>
> A process launching a new process with CLONE_FD is explicitly requesting
> that the process be automatically reaped without any other process
> having to wait on it. The task needs to not become a zombie, because
> otherwise, it'll show up in waitpid(-1, ...) calls in the parent
> process, which would break the ability to use this to completely
> encapsulate process management within a library and not interfere with
> the parent's process handling via SIGCHLD and wait{pid,3,4}.
Wouldn't the correct behavior be to keep it alive as a zombie but
*not* show it in waitpid, etc?
--Andy
^ permalink raw reply
* Re: [PATCH 0/6] CLONE_FD: Task exit notification via file descriptor
From: josh @ 2015-03-13 21:44 UTC (permalink / raw)
To: Thiago Macieira
Cc: David Drysdale, Al Viro, Andrew Morton, Andy Lutomirski,
Ingo Molnar, Kees Cook, Oleg Nesterov, Paul E. McKenney,
H. Peter Anvin, Rik van Riel, Thomas Gleixner, Michael Kerrisk,
linux-kernel@vger.kernel.org, Linux API, linux-fsdevel, X86 ML
In-Reply-To: <8029771.gA9WzoLePv@tjmaciei-mobl4>
On Fri, Mar 13, 2015 at 02:16:07PM -0700, Thiago Macieira wrote:
> On Friday 13 March 2015 12:42:52 Josh Triplett wrote:
> > > Hi Josh,
> > >
> > > From the overall description (i.e. I haven't looked at the code yet)
> > > this looks very interesting. However, it seems to cover a lot of the
> > > same ground as the process descriptor feature that was added to FreeBSD
> > >
> > > in 9.x/10.x:
> > > https://www.freebsd.org/cgi/man.cgi?query=pdfork&sektion=2
> >
> > Interesting.
>
> I wasn't aware of the FreeBSD implementation of pdfork(). It is actually
> exactly what I need in userspace.
Right; libqt should be able to use pdfork on FreeBSD and CLONE_FD on
Linux.
> The only difference between pdfork() and and
> my proposed forkfd() is where the PID and where the file descriptor are
> returned (meaning, which is optional and which isn't).
>
> Josh and I opted to return the file descriptor in the regular return value in
> forkfd and in clone4 because getting the file descriptor the whole objective of
> using the forkfd or clone4-with-CLONE_FD in the first place: the file descriptor
> is not optional, but the PID is.
And as long as you can get the fd, where it's returned really doesn't
matter.
> > Agreed; however, I think it's reasonable to provide appropriate Linux
> > system calls, and then let glibc or libbsd or similar provide the
> > BSD-compatible calls on top of those. I don't think the kernel
> > interface needs to exactly match FreeBSD's, as long as it's a superset
> > of the functionality.
> >
> > For example, pdfork can just call clone4 with CLONE_FD and return the
> > resulting file descriptor.
>
> Agreed, we should recommend libc implement pdfork(), pdkill() and pdwait4().
>
> I'm not too attached to the forkfd() interface, but I find it slightly superior
> for the reasons above.
Agreed.
> If we want the PD_DAEMON flag, it will have to translate to a clone flag, like
> CLONEFD_DAEMON or inverted like CLONEFD_KILL_ON_CLOSE.
I think the inverted version makes more sense, so that the default
behavior just changes exit notification without adding the kill-on-close
behavior. And that kill-on-close behavior can come in a later patch. :)
> > In the future, I plan to add an fd-based equivalent of
> > rt_{,tg}sigqueueinfo (likely a single syscall with a flag to determine
> > whether to kill a process or thread) which is a superset of pdkill.
> > pdkill could then call that and just not pass the extra info.
> >
> > A fair bit of pdwait4 could be implemented on top of read(), other than
> > the full rusage information (see below), and the ability to wait for
> > STOP/CONT (which the CLONE_FD file descriptor could support if desired,
> > but it'd have to be set via a flag at clone time).
> >
> > I think it's a feature to use read() rather than an additional magic
> > system call.
>
> Indeed, even if the libc provides a wrapper for you, like glibc does for
> eventfd (eventfd_read, eventfd_write).
>
> Josh and I didn't want to submit "killfd" (or pdkill in the FreeBSD name) in
> the initial patch set, but it was part of the plans.
>
> > > > clone4() will never return a file descriptor in the range
> > > > 0-2 to
> > > > the caller, to avoid ambiguity with the return of 0 in the
> > > > child
> > > > process. Only the calling process will have the new
> > > > file
> > > > descriptor open; the child process will not.
> > >
> > > FreeBSD's pdfork(2) returns a PID but also takes an int *fdp argument to
> > > return the file descriptor separately, which avoids the need for special
> > > case processing for low FD values (and means that POSIX's "lowest file
> > > descriptor not currently open" behaviour can be preserved if desired).
> >
> > That'd be easy to implement if desired, by adding an outbound pointer to
> > clone4_args.
> >
> > The (very mild) reason I'd dropped the PID: with CLONE_FD and future
> > syscalls that use the fd as an identifier, PIDs can hopefully become
> > mostly unnecessary. However, I'm not that attached to changing the
> > return value; it'd be trivial to switch to an outbound parameter
> > instead, and then drop the "not 0-2".
>
> See above for more motivation on making the PID optional.
>
> As for the file descriptor range, if we need to be able to return 0, we can
> implement a magic constant to mean the child process, like the userspace
> forkfd() does (FFD_CHILD_PROCESS). We'd probably choose the value -4096 on
> Linux, since that is neither a valid file descriptor nor a valid errno value.
I don't think that logic is worth implementing, though, since it would
require changing all the architecture-specific copy_thread
implementations. If we really want to go this path, we should just
return the fd via an out parameter in the clone4_args structure.
> > > [FreeBSD theoretically has pdwait4(2) to do wait4-like operations on a
> > > process descriptor, including rusage retrieval. However, I don't think
> > >
> > > they actually implemented it:
> > > http://fxr.watson.org/fxr/source/kern/syscalls.master#L928]
> >
> > That's a pretty good argument that we don't need to either, at least not
> > yet.
>
> pdwait4() can be implemented on top of read(), with the WNOHANG flag being just
> toggling the O_NONBLOCK bit. The problem is with the rest of the flags. We
> could implement it via more ioctls to be done prior to read() if we don't want
> to add a syscall...
>
> Another alternative is to add a P_PD flag that can be passed as the first
> argument to waitid(), making the second argument a file descriptor instead of a
> PID or pgrp.
Or a flag that can be added to the options argument of wait4 to indicate
that the first argument is really a file descriptor.
> > > FreeBSD also implements fstat(2) for its process descriptors, although
> > > only a few of the fields get filled in.
> >
> > I looked at what they provide, and that seems like more of a novelty
> > than something particularly useful (since most of the stat fields aren't
> > meaningful), but if that's useful for compatibility then adding it seems
> > fine.
>
> I don't think we need to do anything: anon_inode will do it for us.
>
> If I stat an eventfd:
>
> stat("/proc/107751/fd/4", {st_dev=makedev(0, 9), st_ino=3943,
> st_mode=0600, st_nlink=1, st_uid=0, st_gid=0, st_blksize=4096, st_blocks=0,
> st_size=0, st_atime=2015/03/07-16:40:28, st_mtime=2015/03/12-16:12:00,
> st_ctime=2015/03/12-16:12:00}) = 0
>
> And just out of curiosity, in the following order: epoll, signalfd, timerfd
> and inotify:
>
> stat("/proc/1462/fd/4", {st_dev=makedev(0, 9), st_ino=3943, st_mode=0600,
> st_nlink=1, st_uid=0, st_gid=0, st_blksize=4096, st_blocks=0, st_size=0,
> st_atime=2015/03/07-16:40:28, st_mtime=2015/03/12-16:12:00,
> st_ctime=2015/03/12-16:12:00}) = 0
> stat("/proc/1462/fd/5", {st_dev=makedev(0, 9), st_ino=3943, st_mode=0600,
> st_nlink=1, st_uid=0, st_gid=0, st_blksize=4096, st_blocks=0, st_size=0,
> st_atime=2015/03/07-16:40:28, st_mtime=2015/03/12-16:12:00,
> st_ctime=2015/03/12-16:12:00}) = 0
> stat("/proc/1462/fd/7", {st_dev=makedev(0, 9), st_ino=3943, st_mode=0600,
> st_nlink=1, st_uid=0, st_gid=0, st_blksize=4096, st_blocks=0, st_size=0,
> st_atime=2015/03/07-16:40:28, st_mtime=2015/03/12-16:12:00,
> st_ctime=2015/03/12-16:12:00}) = 0
> stat("/proc/1462/fd/8", {st_dev=makedev(0, 9), st_ino=3943, st_mode=0600,
> st_nlink=1, st_uid=0, st_gid=0, st_blksize=4096, st_blocks=0, st_size=0,
> st_atime=2015/03/07-16:40:28, st_mtime=2015/03/12-16:12:00,
> st_ctime=2015/03/12-16:12:00}) = 0
>
> (that process is systemd --user)
Interesting. What does stat on a CLONE_FD file descriptor return?
> > > > poll(2), select(2), epoll(7) (and similar)
> > > >
> > > > The file descriptor is readable (the select(2)
> > > > readfds
> > > > argument; the poll(2) POLLIN flag) if the new
> > > > process has
> > > > exited.
> > >
> > > FreeBSD uses POLLHUP here.
> >
> > That makes sense given that they provide the information via a separate
> > call rather than read. Since the CLONE_FD file descriptor uses read, it
> > needs to provide POLLIN, but I have no objection to using *both* POLLIN
> > and POLLHUP if that'd be at all useful.
>
> I think we should provide both, since we're notifying that there are things to
> be read and that the file descriptor has closed.
"closed" in the "other end of the not-quite-a-pipe" sense, sure. I'll
add that in a v2.
> > > FreeBSD has two different behaviours for close(2), depending on a flag
> > > value (PD_DAEMON). With the flag set it's roughly like this, but
> > > without PD_DAEMON a close(2) operation on the (last open) file
> > > descriptor terminates the child process.
> > >
> > > This can be quite useful, particularly for the use case where some
> > > userspace library has an FD-controlled subprocess -- if the application
> > > using the library terminates, the process descriptor is closed and so
> > > the subprocess is automatically terminated.
> >
> > That's an interesting idea. I don't think it makes sense for that to be
> > the default behavior, but if someone wanted to add an additional flag
> > to implement that behavior, that seems fine. A FreeBSD-compatible
> > pdfork could then use that flag when not passed PD_DAEMON and not use it
> > when passed PD_DAEMON.
> >
> > How does it kill the process when the last open descriptor closes?
> > SIGKILL? SIGTERM? The former seems unfriendly (preventing graceful
> > termination), and the latter blockable. There's a reason init systems
> > send TERM, then wait, then KILL.
>
> I was wondering if it shouldn't be a SIGHUP, since we're talking about a file
> descriptor closing. We could make it configurable too, but I'd rather not use
> the current CSIGNAL field -- better move to the arguments structure, just in
> case someone is passing SIGCHLD there, they should get EINVAL instead of
> silently sending SIGCHLD to the child process to ask it to terminate.
That sounds like several good reasons right there to defer "kill on
close" to a future patch, the author of which should research how
FreeBSD implements this.
- Josh Triplett
^ permalink raw reply
* Re: [PATCH 0/6] CLONE_FD: Task exit notification via file descriptor
From: josh-iaAMLnmF4UmaiuxdJuQwMA @ 2015-03-13 21:45 UTC (permalink / raw)
To: Andy Lutomirski
Cc: David Drysdale, Al Viro, Andrew Morton, Ingo Molnar, Kees Cook,
Oleg Nesterov, Paul E. McKenney, H. Peter Anvin, Rik van Riel,
Thomas Gleixner, Thiago Macieira, Michael Kerrisk,
linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Linux API,
Linux FS Devel, X86 ML
In-Reply-To: <CALCETrXH+Ui1XqVZjFB=8vDpJLCYeVf+XNUPGWNfwDsNKi_nKg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
On Fri, Mar 13, 2015 at 02:33:44PM -0700, Andy Lutomirski wrote:
> On Fri, Mar 13, 2015 at 12:42 PM, Josh Triplett <josh-iaAMLnmF4UmaiuxdJuQwMA@public.gmane.org> wrote:
> > On Fri, Mar 13, 2015 at 04:05:29PM +0000, David Drysdale wrote:
> >> On Fri, Mar 13, 2015 at 1:40 AM, Josh Triplett <josh-iaAMLnmF4UmaiuxdJuQwMA@public.gmane.org> wrote:
> >> > This patch series introduces a new clone flag, CLONE_FD, which lets the caller
> >> > handle child process exit notification via a file descriptor rather than
> >> > SIGCHLD. CLONE_FD makes it possible for libraries to safely launch and manage
> >> > child processes on behalf of their caller, *without* taking over process-wide
> >> > SIGCHLD handling (either via signal handler or signalfd).
> >>
> >> Hi Josh,
> >>
> >> From the overall description (i.e. I haven't looked at the code yet)
> >> this looks very interesting. However, it seems to cover a lot of the
> >> same ground as the process descriptor feature that was added to FreeBSD
> >> in 9.x/10.x:
> >> https://www.freebsd.org/cgi/man.cgi?query=pdfork&sektion=2
> >
> > Interesting.
> >
> >> I think it would ideally be nice for a userspace library developer to be
> >> able to do subprocess management (without SIGCHLD) in a similar way
> >> across both platforms, without lots of complicated autoconf shenanigans.
> >>
> >> So could we look at the overlap and seeing if we can come up with
> >> something that covers your requirements and also allows for something
> >> that looks like FreeBSD's process descriptors?
> >
> > Agreed; however, I think it's reasonable to provide appropriate Linux
> > system calls, and then let glibc or libbsd or similar provide the
> > BSD-compatible calls on top of those. I don't think the kernel
> > interface needs to exactly match FreeBSD's, as long as it's a superset
> > of the functionality.
>
> We need to be careful with things like read(2), though. It's hard to
> write a glibc function that makes read(2) do something other than what
> the kernel thinks. Similarly, poll(2) is defined by the kernel. It
> would be really nice to be consistent here.
It doesn't sound like FreeBSD implements read(2) on the pdfork file
descriptor at all. If it does, yes, we're not going to be able to be
compatible with that.
- Josh Triplett
^ permalink raw reply
* Re: [PATCH 0/6] CLONE_FD: Task exit notification via file descriptor
From: Andy Lutomirski @ 2015-03-13 21:51 UTC (permalink / raw)
To: Josh Triplett
Cc: David Drysdale, Al Viro, Andrew Morton, Ingo Molnar, Kees Cook,
Oleg Nesterov, Paul E. McKenney, H. Peter Anvin, Rik van Riel,
Thomas Gleixner, Thiago Macieira, Michael Kerrisk,
linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Linux API,
Linux FS Devel, X86 ML
In-Reply-To: <20150313214556.GB10954@cloud>
On Fri, Mar 13, 2015 at 2:45 PM, <josh-iaAMLnmF4UmaiuxdJuQwMA@public.gmane.org> wrote:
> On Fri, Mar 13, 2015 at 02:33:44PM -0700, Andy Lutomirski wrote:
>> On Fri, Mar 13, 2015 at 12:42 PM, Josh Triplett <josh-iaAMLnmF4UmaiuxdJuQwMA@public.gmane.org> wrote:
>> > On Fri, Mar 13, 2015 at 04:05:29PM +0000, David Drysdale wrote:
>> >> On Fri, Mar 13, 2015 at 1:40 AM, Josh Triplett <josh-iaAMLnmF4UmaiuxdJuQwMA@public.gmane.org> wrote:
>> >> > This patch series introduces a new clone flag, CLONE_FD, which lets the caller
>> >> > handle child process exit notification via a file descriptor rather than
>> >> > SIGCHLD. CLONE_FD makes it possible for libraries to safely launch and manage
>> >> > child processes on behalf of their caller, *without* taking over process-wide
>> >> > SIGCHLD handling (either via signal handler or signalfd).
>> >>
>> >> Hi Josh,
>> >>
>> >> From the overall description (i.e. I haven't looked at the code yet)
>> >> this looks very interesting. However, it seems to cover a lot of the
>> >> same ground as the process descriptor feature that was added to FreeBSD
>> >> in 9.x/10.x:
>> >> https://www.freebsd.org/cgi/man.cgi?query=pdfork&sektion=2
>> >
>> > Interesting.
>> >
>> >> I think it would ideally be nice for a userspace library developer to be
>> >> able to do subprocess management (without SIGCHLD) in a similar way
>> >> across both platforms, without lots of complicated autoconf shenanigans.
>> >>
>> >> So could we look at the overlap and seeing if we can come up with
>> >> something that covers your requirements and also allows for something
>> >> that looks like FreeBSD's process descriptors?
>> >
>> > Agreed; however, I think it's reasonable to provide appropriate Linux
>> > system calls, and then let glibc or libbsd or similar provide the
>> > BSD-compatible calls on top of those. I don't think the kernel
>> > interface needs to exactly match FreeBSD's, as long as it's a superset
>> > of the functionality.
>>
>> We need to be careful with things like read(2), though. It's hard to
>> write a glibc function that makes read(2) do something other than what
>> the kernel thinks. Similarly, poll(2) is defined by the kernel. It
>> would be really nice to be consistent here.
>
> It doesn't sound like FreeBSD implements read(2) on the pdfork file
> descriptor at all. If it does, yes, we're not going to be able to be
> compatible with that.
There's an argument that using read(2) for stuff like this is a bad
idea. If anyone tried to do this in C++ (or any other OO language):
class GenericInterface
{
public:
virtual void DoAction(const char *value, size_t len) = 0;
};
class Process : public GenericInterface
{
public:
virtual void DoAction(const char *value, size_t len) = 0;
};
void Kill(Process *p)
{
p->DoAction("kill", 4);
};
They'd be re-educated very quickly. This is like duck typing, but
taken to a whole new level: *everything* is a duck, and ducks have a
grand total of three operations.
On the other hand, this seems to be UNIX tradition. It's not as if
using echo on pidfds is going to be a common idiom, though.
In any event, we should find out what FreeBSD does in response to
read(2) on the fd.
--Andy
^ permalink raw reply
* Re: [PATCH 2/6] x86: Opt into HAVE_COPY_THREAD_TLS, for both 32-bit and 64-bit
From: Andy Lutomirski @ 2015-03-13 22:01 UTC (permalink / raw)
To: Josh Triplett
Cc: Al Viro, Andrew Morton, Ingo Molnar, Kees Cook, Oleg Nesterov,
Paul E. McKenney, H. Peter Anvin, Rik van Riel, Thomas Gleixner,
Thiago Macieira, Michael Kerrisk,
linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Linux API,
Linux FS Devel, X86 ML
In-Reply-To: <cf79b9f0c40314e6bfda7c634e378015bd7ba037.1426180120.git.josh-iaAMLnmF4UmaiuxdJuQwMA@public.gmane.org>
On Thu, Mar 12, 2015 at 6:40 PM, Josh Triplett <josh-iaAMLnmF4UmaiuxdJuQwMA@public.gmane.org> wrote:
> For 32-bit userspace on a 64-bit kernel, this requires modifying
> stub32_clone to actually swap the appropriate arguments to match
> CONFIG_CLONE_BACKWARDS, rather than just leaving the C argument for tls
> broken.
>
> Signed-off-by: Josh Triplett <josh-iaAMLnmF4UmaiuxdJuQwMA@public.gmane.org>
> Signed-off-by: Thiago Macieira <thiago.macieira-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
> ---
> arch/x86/Kconfig | 1 +
> arch/x86/ia32/ia32entry.S | 2 +-
> arch/x86/kernel/process_32.c | 6 +++---
> arch/x86/kernel/process_64.c | 8 ++++----
> 4 files changed, 9 insertions(+), 8 deletions(-)
>
> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index b7d31ca..4960b0d 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -124,6 +124,7 @@ config X86
> select MODULES_USE_ELF_REL if X86_32
> select MODULES_USE_ELF_RELA if X86_64
> select CLONE_BACKWARDS if X86_32
> + select HAVE_COPY_THREAD_TLS
> select ARCH_USE_BUILTIN_BSWAP
> select ARCH_USE_QUEUE_RWLOCK
> select OLD_SIGSUSPEND3 if X86_32 || IA32_EMULATION
> diff --git a/arch/x86/ia32/ia32entry.S b/arch/x86/ia32/ia32entry.S
> index 156ebca..0286735 100644
> --- a/arch/x86/ia32/ia32entry.S
> +++ b/arch/x86/ia32/ia32entry.S
> @@ -487,7 +487,7 @@ GLOBAL(\label)
> ALIGN
> GLOBAL(stub32_clone)
> leaq sys_clone(%rip),%rax
> - mov %r8, %rcx
> + xchg %r8, %rcx
> jmp ia32_ptregs_common
Do I understand correct that whatever function this is a stub for just
takes its arguments in the wrong order? If so, can we just fix it
instead of using xchg here?
In general, I much prefer C code to new asm where it makes sense to
make this tradeoff.
Other than that, this is a huge improvement. You'll have minor
conflicts against -tip, though.
--Andy
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox