* Re: RFC: on adding new CLONE_* flags [WAS Re: [PATCH 0/4] clone: add CLONE_PIDFD]
From: Christian Brauner @ 2019-04-29 20:52 UTC (permalink / raw)
To: Florian Weimer
Cc: Jann Horn, Kevin Easton, Andy Lutomirski, Aleksa Sarai,
Enrico Weigelt, metux IT consult, Linus Torvalds, Al Viro,
David Howells, Linux API, LKML, Serge E. Hallyn, Arnd Bergmann,
Eric W. Biederman, Kees Cook, Thomas Gleixner, Michael Kerrisk,
Andrew Morton, Oleg Nesterov, Joel Fernandes,
Daniel Colascione <danco>
In-Reply-To: <87v9ywbkp8.fsf@oldenburg2.str.redhat.com>
On Mon, Apr 29, 2019 at 10:50 PM Florian Weimer <fweimer@redhat.com> wrote:
>
> * Jann Horn:
>
> >> int clone_temporary(int (*fn)(void *arg), void *arg, pid_t *child_pid,
> >> <clone flags and arguments, maybe in a struct>)
> >>
> >> and then you'd use it like this to fork off a child process:
> >>
> >> int spawn_shell_subprocess_(void *arg) {
> >> char *cmdline = arg;
> >> execl("/bin/sh", "sh", "-c", cmdline);
> >> return -1;
> >> }
> >> pid_t spawn_shell_subprocess(char *cmdline) {
> >> pid_t child_pid;
> >> int res = clone_temporary(spawn_shell_subprocess_, cmdline,
> >> &child_pid, [...]);
> >> if (res == 0) return child_pid;
> >> return res;
> >> }
> >>
> >> clone_temporary() could be implemented roughly as follows by the libc
> >> (or other userspace code):
> >>
> >> sigset_t sigset, sigset_old;
> >> sigfillset(&sigset);
> >> sigprocmask(SIG_SETMASK, &sigset, &sigset_old);
> >> int child_pid;
> >> int result = 0;
> >> /* starting here, use inline assembly to ensure that no stack
> >> allocations occur */
> >> long child = syscall(__NR_clone,
> >> CLONE_VM|CLONE_CHILD_SETTID|CLONE_CHILD_CLEARTID|SIGCHLD, $RSP -
> >> ABI_STACK_REDZONE_SIZE, NULL, &child_pid, 0);
> >> if (child == -1) { result = -1; goto reset_sigmask; }
> >> if (child == 0) {
> >> result = fn(arg);
> >> syscall(__NR_exit, 0);
> >> }
> >> futex(&child_pid, FUTEX_WAIT, child, NULL);
> >> /* end of no-stack-allocations zone */
> >> reset_sigmask:
> >> sigprocmask(SIG_SETMASK, &sigset_old, NULL);
> >> return result;
> >
> > ... I guess that already has a name, and it's called vfork(). (Well,
> > except that the Linux vfork() isn't a real vfork().)
> >
> > So I guess my question is: Why not vfork()?
>
> Mainly because some users want access to the clone flags, and that's not
> possible with the current userspace wrappers. The stack setup for the
> undocumented clone wrapper is also cumbersome, and the ia64 pecularity
> annoying.
>
> For the stack sharing, the callback-based interface looks like the
> absolutely right thing to do to me. It enforces the notion that you can
> safely return on the child path from a function calling vfork.
>
> > And if vfork() alone isn't flexible enough, alternatively: How about
> > an API that forks a new child in the same address space, and then
> > allows the parent to invoke arbitrary syscalls in the context of the
> > child?
>
> As long it's not an eBPF script …
You shouldn't even joke about this (I'm serious.).
I'm very certain there are people who'd think this is a good idea.
>
> > You could also build that in userspace if you wanted, I think - just
> > let the child run an assembly loop that reads registers from a unix
> > seqpacket socket, invokes the syscall instruction, and writes the
> > value of the result register back into the seqpacket socket. As long
> > as you use CLONE_VM, you don't have to worry about moving the pointer
> > targets of syscalls. The user-visible API could look like this:
>
> People already use a variant of this, execve'ing twice. See
> jspawnhelper.
>
> Thanks,
> Florian
^ permalink raw reply
* Re: RFC: on adding new CLONE_* flags [WAS Re: [PATCH 0/4] clone: add CLONE_PIDFD]
From: Linus Torvalds @ 2019-04-29 21:31 UTC (permalink / raw)
To: Florian Weimer
Cc: Jann Horn, Kevin Easton, Andy Lutomirski, Christian Brauner,
Aleksa Sarai, Enrico Weigelt, metux IT consult, Al Viro,
David Howells, Linux API, LKML, Serge E. Hallyn, Arnd Bergmann,
Eric W. Biederman, Kees Cook, Thomas Gleixner, Michael Kerrisk,
Andrew Morton, Oleg Nesterov, Joel Fernandes, Daniel Colascione
In-Reply-To: <87zho8bl8x.fsf@oldenburg2.str.redhat.com>
On Mon, Apr 29, 2019 at 1:38 PM Florian Weimer <fweimer@redhat.com> wrote:
>
> In Linux-as-the-ABI (as opposed to Linux-as-the-implementation), vfork
> is sometimes implemented as fork, so applications cannot rely on the
> vfork behavior regarding the stopped parent and the shared address
> space.
What broken library does that?
Sure, we didn't have a proper vfork() long long long ago. But that
predates both git and BK, so it's some time in the 90's. We've had a
proper vfork() *forever*.
> In fact, it would be nice to have a flag we can check in the posix_spawn
> implementation, so that we can support vfork-as-fork without any run
> time cost to native Linux.
No. Just make a bug-report to whatever broken library you use. What's
the point of having a library that can't even get vfork() right? Why
would you want to have a flag to say "vfork is broken"?
Linus
^ permalink raw reply
* Re: [PATCH V2] mm: Allow userland to request that the kernel clear memory on release
From: Michal Hocko @ 2019-04-29 21:44 UTC (permalink / raw)
To: Matthew Garrett; +Cc: linux-mm, Linux Kernel Mailing List, Linux API
In-Reply-To: <CACdnJutweLKsir_r9EgP9g=Eih-hbhq20N8zHzKawR8=awnENw@mail.gmail.com>
On Fri 26-04-19 11:08:44, Matthew Garrett wrote:
> On Thu, Apr 25, 2019 at 10:25 PM Michal Hocko <mhocko@kernel.org> wrote:
> >
> > On Thu 25-04-19 13:39:01, Matthew Garrett wrote:
> > > Yes, given MADV_DONTDUMP doesn't imply mlock I thought it'd be more
> > > consistent to keep those independent.
> >
> > Do we want to fail madvise call on VMAs that are not mlocked then? What
> > if the munlock happens later after the madvise is called?
>
> I'm not sure if it's strictly necessary. We already have various
> combinations of features that only make sense when used together and
> which can be undermined by later actions. I can see the appeal of
> designing this in a way that makes it harder to misuse, but is that
> worth additional implementation complexity?
If the complexity is not worth the usual usecases then this should be
really documented and noted that without an mlock you are not getting
the full semantic and you can leave memory behind on the swap partition.
I cannot judge how much that matter but it certainly looks half feature
to me but if nobody is going to use the madvise without mlock then it
looks certainly much easier to implement.
--
Michal Hocko
SUSE Labs
^ permalink raw reply
* Re: [RFC PATCH for 5.2 10/10] rseq/selftests: mips: use break instruction for RSEQ_SIG
From: Paul Burton @ 2019-04-29 22:31 UTC (permalink / raw)
To: Mathieu Desnoyers
Cc: Peter Zijlstra, Paul E . McKenney, Boqun Feng, linux-kernel,
linux-api, Thomas Gleixner, Andy Lutomirski, Dave Watson,
Paul Turner, Andrew Morton, Russell King, Ingo Molnar,
H. Peter Anvin, Andi Kleen, Chris Lameter, Ben Maurer, rostedt,
Josh Triplett, Linus Torvalds, Catalin Marinas, Will
In-Reply-To: <1183307732.352.1556202092390.JavaMail.zimbra@efficios.com>
Hi Mathieu,
On Thu, Apr 25, 2019 at 10:21:32AM -0400, Mathieu Desnoyers wrote:
> I've tried to figure out if we could find a way to have RSEQ_SIG left undefined
> if it's not on the plain mips environment, but could not find anything that
> would be #defined on plain mips, but #undefined on both micromips and nanomips.
>
> What I'd like to do is e.g.:
>
> #if defined(__nanomips__)
> # ifdef __MIPSEL__
> # define RSEQ_SIG 0x03500010
> # else
> # define RSEQ_SIG 0x00100350
> # endif
> #elif defined(__mips_micromips)
> # ifdef __MIPSEL__
> # define RSEQ_SIG 0xd4070000
> # else
> # define RSEQ_SIG 0x0000d407
> # endif
> #elif defined(__mips__)
> # define RSEQ_SIG 0x0350000d
> #else
> /* Leave RSEQ_SIG as is. */
> #endif
>
> The idea here is to not allow code targeting future MIPS ISA to compile
> with the wrong signature.
>
> The delta between compiling without/with -mmicromips on a gcc-8 compiler
> is only:
>
> > #define __mips_micromips 1
>
> Some interesting delta when compiling for plain little-endian mips with
> gcc-8 compared to the nanomips compiler is:
>
> < #define __mips__ 1
> < #define _mips 1
> < #define MIPSEL 1
>
> > #define __nanomips__ 1
>
> < #define __mips_isa_rev 2
> > #define __mips_isa_rev 6
>
> So let's say we have a picomips introduced in the future, can we rely
> on it not defining __mips__ like the nanomips compiler does ? If so,
> my "#elif defined(__mips__)" approach would indeed leave RSEQ_SIG undefined
> as expected.
>
> Thoughts ?
That seems like a reasonable approach to me. I don't think it'll be
guaranteed, but it'll give the best odds of the behavior you want.
If I recall correctly the reason for not defining __mips__ in the
nanoMIPS compiler was to force people to audit MIPS-specific code given
the scale of the changes in nanoMIPS - there are some incompatibilities
at the assembly level but more than that the ABI changes in multiple
ways from register assignment & calling convention to kernel-user struct
layouts & other things. If we were to build existing MIPS-specific code
as-is then some of this could lead to brokenness that the tools wouldn't
have a good way to detect & reject automatically, so making people audit
the code & add in the __nanomips__ check is a sort of safety measure.
So the likelihood of your code above picking up on any future ISA
changes will probably depend upon how incompatible they are, which seems
pretty sensible.
Thanks,
Paul
^ permalink raw reply
* Re: [PATCH V32 01/27] Add the ability to lock down access to the running kernel image
From: Matthew Garrett @ 2019-04-29 22:56 UTC (permalink / raw)
To: James Morris
Cc: LSM List, Linux Kernel Mailing List, David Howells, Linux API,
Andy Lutomirski
In-Reply-To: <20190404003249.14356-2-matthewgarrett@google.com>
Hi James,
What's the best way forward with this? I'm still not entirely clear on
how it can be implemented purely as an LSM, but if you have ideas on
what sort of implementation you'd prefer I'm happy to work on that.
^ permalink raw reply
* Re: RFC: on adding new CLONE_* flags [WAS Re: [PATCH 0/4] clone: add CLONE_PIDFD]
From: Jann Horn @ 2019-04-30 0:38 UTC (permalink / raw)
To: Linus Torvalds
Cc: Kevin Easton, Andy Lutomirski, Christian Brauner, Aleksa Sarai,
Enrico Weigelt, metux IT consult, Al Viro, David Howells,
Linux API, LKML, Serge E. Hallyn, Arnd Bergmann,
Eric W. Biederman, Kees Cook, Thomas Gleixner, Michael Kerrisk,
Andrew Morton, Oleg Nesterov, Joel Fernandes, Daniel Colascione
In-Reply-To: <CAHk-=wi_N81mKYFz33ycoWiL7_tGbZBMJOsAs16inYzSza+OEw@mail.gmail.com>
On Mon, Apr 29, 2019 at 4:21 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> On Mon, Apr 29, 2019 at 12:55 PM Jann Horn <jannh@google.com> wrote:
> >
> > ... I guess that already has a name, and it's called vfork(). (Well,
> > except that the Linux vfork() isn't a real vfork().)
>
> What?
>
> Linux vfork() is very much a real vfork(). What do you mean?
... uuuh, whoops. Turns out I don't know what I'm talking about.
Nevermind. For some reason I thought vfork() was just
CLONE_VFORK|SIGCHLD, but now I see I got that completely wrong.
^ permalink raw reply
* Re: RFC: on adding new CLONE_* flags [WAS Re: [PATCH 0/4] clone: add CLONE_PIDFD]
From: Linus Torvalds @ 2019-04-30 2:16 UTC (permalink / raw)
To: Jann Horn
Cc: Kevin Easton, Andy Lutomirski, Christian Brauner, Aleksa Sarai,
Enrico Weigelt, metux IT consult, Al Viro, David Howells,
Linux API, LKML, Serge E. Hallyn, Arnd Bergmann,
Eric W. Biederman, Kees Cook, Thomas Gleixner, Michael Kerrisk,
Andrew Morton, Oleg Nesterov, Joel Fernandes, Daniel Colascione
In-Reply-To: <CAG48ez1CV54c1xZ9s26ym=9avkihiNi=ppW-CWA1-qrCpYdc1A@mail.gmail.com>
On Mon, Apr 29, 2019 at 5:39 PM Jann Horn <jannh@google.com> wrote:
>
> ... uuuh, whoops. Turns out I don't know what I'm talking about.
Well, apparently there's some odd libc issue accoprding to Florian, so
there *might* be something to it.
> Nevermind. For some reason I thought vfork() was just
> CLONE_VFORK|SIGCHLD, but now I see I got that completely wrong.
Well, inside the kernel, that's actually *very* close to what vfork() is:
SYSCALL_DEFINE0(vfork)
{
return _do_fork(CLONE_VFORK | CLONE_VM | SIGCHLD, 0,
0, NULL, NULL, 0);
}
but that's just an internal implementation detail. It's a real vfork()
and should act as the traditional BSD "share everything" without any
address space copying. The CLONE_VFORK flag is what does the "wait for
child to exit or execve" magic.
Note that vfork() is "exciting" for the compiler in much the same way
"setjmp/longjmp()" is, because of the shared stack use in the child
and the parent. It is *very* easy to get this wrong and cause massive
and subtle memory corruption issues because the parent returns to
something that has been messed up by the child.
That may be why some libc might end up just using "fork()", because it
ends up avoiding bugs in user space.
(In fact, if I recall correctly, the _reason_ we have an explicit
'vfork()' entry point rather than using clone() with magic parameters
was that the lack of arguments meant that you didn't have to
save/restore any registers in user space, which made the whole stack
issue simpler. But it's been two decades, so my memory is bitrotting).
Also, particularly if you have a big address space, vfork()+execve()
can be quite a bit faster than fork()+execve(). Linux fork() is pretty
efficient, but if you have gigabytes of VM space to copy, it's going
to take time even if you do it fairly well.
Linus
^ permalink raw reply
* Re: [PATCH V32 01/27] Add the ability to lock down access to the running kernel image
From: Andrew Donnellan @ 2019-04-30 5:15 UTC (permalink / raw)
To: Daniel Axtens, Matthew Garrett
Cc: James Morris, LSM List, Linux Kernel Mailing List, David Howells,
Linux API, Andy Lutomirski, linuxppc-dev, Michael Ellerman, cmr
In-Reply-To: <87tvehxvh0.fsf@dja-thinkpad.axtens.net>
On 29/4/19 2:54 pm, Daniel Axtens wrote:
> Hi,
>
>>>> I'm thinking about whether we should lock down the powerpc xmon debug
>>>> monitor - intuitively, I think the answer is yes if for no other reason
>>>> than Least Astonishment, when lockdown is enabled you probably don't
>>>> expect xmon to keep letting you access kernel memory.
>>>
>>> The original patchset contained a sysrq hotkey to allow physically
>>> present users to disable lockdown, so I'm not super concerned about
>>> this case - I could definitely be convinced otherwise, though.
>
> So Mimi contacted me offlist and very helpfully provided me with a much
> better and less confused justification for disabling xmon in lockdown:
>
> On x86, physical presence (== console access) is a trigger to
> disable/enable lockdown mode.
>
> In lockdown mode, you're not supposed to be able to modify memory. xmon
> allows you to modify memory, and therefore shouldn't be allowed in
> lockdown.
>
> So, if you can disable lockdown on the console that's probably OK, but
> it should be specifically disabling lockdown, not randomly editing
> memory with xmon.
That makes sense.
--
Andrew Donnellan OzLabs, ADL Canberra
andrew.donnellan@au1.ibm.com IBM Australia Limited
^ permalink raw reply
* Re: RFC: on adding new CLONE_* flags [WAS Re: [PATCH 0/4] clone: add CLONE_PIDFD]
From: Florian Weimer @ 2019-04-30 7:01 UTC (permalink / raw)
To: Linus Torvalds
Cc: Jann Horn, Kevin Easton, Andy Lutomirski, Christian Brauner,
Aleksa Sarai, Enrico Weigelt, metux IT consult, Al Viro,
David Howells, Linux API, LKML, Serge E. Hallyn, Arnd Bergmann,
Eric W. Biederman, Kees Cook, Thomas Gleixner, Michael Kerrisk,
Andrew Morton, Oleg Nesterov, Joel Fernandes, Daniel Colascione
In-Reply-To: <CAHk-=wiPv4QJBC0qX8xxnT5P2C7S5uDG0HKdvdSpcoXaHG91tQ@mail.gmail.com>
* Linus Torvalds:
> On Mon, Apr 29, 2019 at 1:38 PM Florian Weimer <fweimer@redhat.com> wrote:
>>
>> In Linux-as-the-ABI (as opposed to Linux-as-the-implementation), vfork
>> is sometimes implemented as fork, so applications cannot rely on the
>> vfork behavior regarding the stopped parent and the shared address
>> space.
>
> What broken library does that?
>
> Sure, we didn't have a proper vfork() long long long ago. But that
> predates both git and BK, so it's some time in the 90's. We've had a
> proper vfork() *forever*.
It's not so much about libraries, it's alternative implementations of
the system call interface: valgrind, qemu-user and WSL. For valgrind
and qemu-user, it's about cloning the internal data structures, so that
the subprocess does not clobber what's in the parent process (which may
have multiple threads and may not be fully blocked due to vfork). For
WSL, who knows what's going on there.
>> In fact, it would be nice to have a flag we can check in the posix_spawn
>> implementation, so that we can support vfork-as-fork without any run
>> time cost to native Linux.
>
> No. Just make a bug-report to whatever broken library you use. What's
> the point of having a library that can't even get vfork() right? Why
> would you want to have a flag to say "vfork is broken"?
It's apparently quite difficult to fix valgrind and qemu-user. WSL is
apparently not given the resources it needs, yet a surprising number of
people see it as a viable replacement and report what are essentially
vfork-related bugs.
If I had the flag, I could at least fix posix_spawn in glibc to consult
it before assuming that vfork shares address space. (The sharing allows
straightforward reporting of the vfork error code, without resorting to
pipes or creating a MAP_SHARED mapping.) For obvious reasons, I do not
want to apply the workaround unconditionally.
Thanks,
Florian
^ permalink raw reply
* Re: RFC: on adding new CLONE_* flags [WAS Re: [PATCH 0/4] clone: add CLONE_PIDFD]
From: Florian Weimer @ 2019-04-30 8:21 UTC (permalink / raw)
To: Linus Torvalds
Cc: Jann Horn, Kevin Easton, Andy Lutomirski, Christian Brauner,
Aleksa Sarai, Enrico Weigelt, metux IT consult, Al Viro,
David Howells, Linux API, LKML, Serge E. Hallyn, Arnd Bergmann,
Eric W. Biederman, Kees Cook, Thomas Gleixner, Michael Kerrisk,
Andrew Morton, Oleg Nesterov, Joel Fernandes, Daniel Colascione
In-Reply-To: <CAHk-=wg73au-kvOwWpPDY+rXrz8O5gwrcPiw1FZx-Qr2PqpRFg@mail.gmail.com>
* Linus Torvalds:
> Note that vfork() is "exciting" for the compiler in much the same way
> "setjmp/longjmp()" is, because of the shared stack use in the child
> and the parent. It is *very* easy to get this wrong and cause massive
> and subtle memory corruption issues because the parent returns to
> something that has been messed up by the child.
Just using a wrapper around vfork is enough for that, if the return
address is saved on the stack. It's surprising hard to write a test
case for that, but the corruption is definitely there.
> (In fact, if I recall correctly, the _reason_ we have an explicit
> 'vfork()' entry point rather than using clone() with magic parameters
> was that the lack of arguments meant that you didn't have to
> save/restore any registers in user space, which made the whole stack
> issue simpler. But it's been two decades, so my memory is bitrotting).
That's an interesting point. Using a callback-style interface avoids
that because you never need to restore the registers in the new
subprocess. It's still appropriate to use an assembler implementation,
I think, because it will be more obviously correct.
> Also, particularly if you have a big address space, vfork()+execve()
> can be quite a bit faster than fork()+execve(). Linux fork() is pretty
> efficient, but if you have gigabytes of VM space to copy, it's going
> to take time even if you do it fairly well.
vfork is also more benign from a memory accounting perspective. In some
environments, it's not possible to call fork from a large process
because the accounting assumes (conservatively) that the new process
will dirty a lot of its private memory.
Thanks,
Florian
^ permalink raw reply
* Re: [PATCH v5 1/6] arm64: HWCAP: add support for AT_HWCAP2
From: Andrew Murray @ 2019-04-30 11:09 UTC (permalink / raw)
To: Dave Martin
Cc: Will Deacon, Catalin Marinas, Szabolcs Nagy, linux-arm-kernel,
Mark Rutland, Phil Blundell, libc-alpha, linux-api,
Suzuki K Poulose
In-Reply-To: <20190416163041.GU3567@e103592.cambridge.arm.com>
On Tue, Apr 16, 2019 at 05:30:41PM +0100, Dave Martin wrote:
> On Tue, Apr 16, 2019 at 02:51:57PM +0100, Will Deacon wrote:
> > On Tue, Apr 09, 2019 at 10:52:40AM +0100, Andrew Murray wrote:
> > > As we will exhaust the first 32 bits of AT_HWCAP let's start
> > > exposing AT_HWCAP2 to userspace to give us up to 64 caps.
> > >
> > > Whilst it's possible to use the remaining 32 bits of AT_HWCAP, we
> > > prefer to expand into AT_HWCAP2 in order to provide a consistent
> > > view to userspace between ILP32 and LP64. However internal to the
> > > kernel we prefer to continue to use the full space of elf_hwcap.
> > >
> > > To reduce complexity and allow for future expansion, we now
> > > represent hwcaps in the kernel as ordinals and use a
> > > KERNEL_HWCAP_ prefix. This allows us to support automatic feature
> > > based module loading for all our hwcaps.
> > >
> > > We introduce cpu_set_feature to set hwcaps which complements the
> > > existing cpu_have_feature helper. These helpers allow us to clean
> > > up existing direct uses of elf_hwcap and reduce any future effort
> > > required to move beyond 64 caps.
> > >
> > > For convenience we also introduce cpu_{have,set}_named_feature which
> > > makes use of the cpu_feature macro to allow providing a hwcap name
> > > without a {KERNEL_}HWCAP_ prefix.
> > >
> > > Signed-off-by: Andrew Murray <andrew.murray@arm.com>
> > > ---
> > > Documentation/arm64/elf_hwcaps.txt | 14 +++--
> > > arch/arm64/crypto/aes-ce-ccm-glue.c | 2 +-
> > > arch/arm64/crypto/aes-neonbs-glue.c | 2 +-
> > > arch/arm64/crypto/chacha-neon-glue.c | 2 +-
> > > arch/arm64/crypto/crct10dif-ce-glue.c | 4 +-
> > > arch/arm64/crypto/ghash-ce-glue.c | 8 +--
> > > arch/arm64/crypto/nhpoly1305-neon-glue.c | 2 +-
> > > arch/arm64/crypto/sha256-glue.c | 4 +-
> > > arch/arm64/include/asm/cpufeature.h | 22 ++++----
> > > arch/arm64/include/asm/hwcap.h | 52 ++++++++++++++++++-
> > > arch/arm64/include/uapi/asm/hwcap.h | 2 +-
> > > arch/arm64/kernel/cpufeature.c | 66 ++++++++++++------------
> > > arch/arm64/kernel/cpuinfo.c | 2 +-
> > > arch/arm64/kernel/fpsimd.c | 4 +-
> > > drivers/clocksource/arm_arch_timer.c | 8 +++
> > > 15 files changed, 131 insertions(+), 63 deletions(-)
> > >
> > > diff --git a/Documentation/arm64/elf_hwcaps.txt b/Documentation/arm64/elf_hwcaps.txt
> > > index 13d6691b37be..c04f8e87bab8 100644
> > > --- a/Documentation/arm64/elf_hwcaps.txt
> > > +++ b/Documentation/arm64/elf_hwcaps.txt
> > > @@ -13,9 +13,9 @@ architected discovery mechanism available to userspace code at EL0. The
> > > kernel exposes the presence of these features to userspace through a set
> > > of flags called hwcaps, exposed in the auxilliary vector.
> > >
> > > -Userspace software can test for features by acquiring the AT_HWCAP entry
> > > -of the auxilliary vector, and testing whether the relevant flags are
> > > -set, e.g.
> > > +Userspace software can test for features by acquiring the AT_HWCAP or
> > > +AT_HWCAP2 entry of the auxiliary vector, and testing whether the relevant
> > > +flags are set, e.g.
> > >
> > > bool floating_point_is_present(void)
> > > {
> > > @@ -194,3 +194,11 @@ HWCAP_PACG
> > > Functionality implied by ID_AA64ISAR1_EL1.GPA == 0b0001 or
> > > ID_AA64ISAR1_EL1.GPI == 0b0001, as described by
> > > Documentation/arm64/pointer-authentication.txt.
> > > +
> > > +
> > > +4. Unused AT_HWCAP bits
> > > +-----------------------
> > > +
> > > +Each AT_HWCAP and AT_HWCAP2 entry provides for up to 32 hwcaps contained
> > > +in bits [31:0]. For interoperation with userspace we guarantee that bits
> > > +62 and 63 of AT_HWCAP will always be returned as 0.
> >
> > I'm a little nervous about the first sentence here, since it could be
> > taken to mean that we will never allocate 61:32. Mind if I drop it?
>
> Ack: I don't think we want to say explicitly that we will never use
> those bits, apart from AT_HWCAP[63:62] for which there are specific
> reasons.
No problem with me. Thanks for making this change (and const_ilog2).
>
> (For now of course, we won't use them.)
>
> > > diff --git a/drivers/clocksource/arm_arch_timer.c b/drivers/clocksource/arm_arch_timer.c
> > > index aa4ec53281ce..6cc8aff83805 100644
> > > --- a/drivers/clocksource/arm_arch_timer.c
> > > +++ b/drivers/clocksource/arm_arch_timer.c
> > > @@ -833,7 +833,11 @@ static void arch_timer_evtstrm_enable(int divider)
> > > cntkctl |= (divider << ARCH_TIMER_EVT_TRIGGER_SHIFT)
> > > | ARCH_TIMER_VIRT_EVT_EN;
> > > arch_timer_set_cntkctl(cntkctl);
> > > +#ifdef CONFIG_ARM64
> > > + cpu_set_named_feature(EVTSTRM);
> > > +#else
> > > elf_hwcap |= HWCAP_EVTSTRM;
> > > +#endif
> > > #ifdef CONFIG_COMPAT
> > > compat_elf_hwcap |= COMPAT_HWCAP_EVTSTRM;
> > > #endif
> > > @@ -1055,7 +1059,11 @@ static int arch_timer_cpu_pm_notify(struct notifier_block *self,
> > > } else if (action == CPU_PM_ENTER_FAILED || action == CPU_PM_EXIT) {
> > > arch_timer_set_cntkctl(__this_cpu_read(saved_cntkctl));
> > >
> > > +#ifdef CONFIG_ARM64
> > > + if (cpu_have_named_feature(EVTSTRM))
> > > +#else
> > > if (elf_hwcap & HWCAP_EVTSTRM)
> > > +#endif
> >
> > I think this is an indication that the abstraction isn't quite right and
> > should probably be done in an arch-helped via asm/arch_timer.h. However,
> > that can be done as a separate patch later on.
>
> It probably does make sense to add an arch-specific helper for that.
>
> Given that we don't want to encourage this kind of poking about in
> elf_hwcap. It might make sense to have a single-purpose helper just for
> checking this flag.
I'll propose something in a separate patch.
Thanks,
Andrew Murray
>
> Cheers
> ---Dave
^ permalink raw reply
* Re: [PATCH v1 1/2] Add polling support to pidfd
From: Oleg Nesterov @ 2019-04-30 11:53 UTC (permalink / raw)
To: Joel Fernandes
Cc: Christian Brauner, linux-kernel, luto, rostedt, dancol, sspatil,
jannh, surenb, timmurray, Jonathan Kowalski, torvalds,
kernel-team, Andrew Morton, Arnd Bergmann, Eric W. Biederman,
Greg Kroah-Hartman, Ingo Molnar, Jann Horn, linux-kselftest,
Michal Hocko, Peter Zijlstra (Intel), Serge Hallyn, Shuah Khan,
Stephen Rothwell, Thomas Gleixner <tgl>
In-Reply-To: <20190429163259.GA201155@google.com>
On 04/29, Joel Fernandes wrote:
>
> On Mon, Apr 29, 2019 at 04:20:30PM +0200, Oleg Nesterov wrote:
> > On 04/29, Joel Fernandes wrote:
> > >
> > > However, in your code above, it is avoided because we get:
> > >
> > > Task A (poller) Task B (exiting task being polled)
> > > ------------ ----------------
> > > poll() called
> > > add_wait_queue()
> > > exit_state is set to non-zero
> > > read exit_state
> > > remove_wait_queue()
> > > wake_up_all()
> >
> > just to clarify... No, sys_poll() path doesn't do remove_wait_queue() until
> > it returns to user mode, and that is why we can't race with set-exit_code +
> > wake_up().
>
> I didn't follow what you mean, the removal from the waitqueue happens in
> free_poll_entry() called from poll_freewait() which happens from
> do_sys_poll() which is before the syscall returns to user mode. Could you
> explain more?
Hmm. I do not really understand the question... Sure, do_sys_poll() does
poll_freewait() before sysret or even before return from syscall, but why
does this matter? This is the exit path, it frees the memory, does fput(),
etc, f_op->poll() won't be call after that.
> > pidfd_poll() can race with the exiting task, miss exit_code != 0, and return
> > zero. However, do_poll() won't block after that and pidfd_poll() will be called
> > again.
>
> Here also I didn't follow what you mean. If exit_code is read as 0 in
> pidfd_poll(), then in do_poll() the count will be 0 and it will block in
> poll_schedule_timeout(). Right?
No. Please note the pwq->triggered check and please read __pollwake().
But if you want to understand this you can forget about poll/select. It is
a bit complicated, in particular because it has to do set_current_state()
right before schedule() and thus it plays games with pwq->triggered. But in
essence this doesn't differ too much from the plain wait_event-like code
(although you can also look at wait_woken/woken_wake_function).
If remove_wait_queue() could happem before wake_up_all() (like in your pseudo-
code above), then pidfd_poll() or any other ->poll() method could miss _both_
the condition and wakeup. But sys_poll() doesn't do this, so it is fine to miss
the condition and rely on wake_up_all() which ensures we won't block and the
next iteration must see condition == T.
Oleg.
^ permalink raw reply
* Re: [PATCH v1 1/2] Add polling support to pidfd
From: Oleg Nesterov @ 2019-04-30 12:07 UTC (permalink / raw)
To: Joel Fernandes
Cc: Christian Brauner, linux-kernel, luto, rostedt, dancol, sspatil,
jannh, surenb, timmurray, Jonathan Kowalski, torvalds,
kernel-team, Andrew Morton, Arnd Bergmann, Eric W. Biederman,
Greg Kroah-Hartman, Ingo Molnar, Jann Horn, linux-kselftest,
Michal Hocko, Peter Zijlstra (Intel), Serge Hallyn, Shuah Khan,
Stephen Rothwell, Thomas Gleixner <tgl>
In-Reply-To: <20190430115332.GB23020@redhat.com>
On 04/30, Oleg Nesterov wrote:
>
> > > pidfd_poll() can race with the exiting task, miss exit_code != 0, and return
> > > zero. However, do_poll() won't block after that and pidfd_poll() will be called
> > > again.
> >
> > Here also I didn't follow what you mean. If exit_code is read as 0 in
> > pidfd_poll(), then in do_poll() the count will be 0 and it will block in
> > poll_schedule_timeout(). Right?
>
> No. Please note the pwq->triggered check and please read __pollwake().
>
> But if you want to understand this you can forget about poll/select. It is
> a bit complicated, in particular because it has to do set_current_state()
> right before schedule() and thus it plays games with pwq->triggered. But in
> essence this doesn't differ too much from the plain wait_event-like code
> (although you can also look at wait_woken/woken_wake_function).
>
> If remove_wait_queue() could happem before wake_up_all() (like in your pseudo-
> code above), then pidfd_poll() or any other ->poll() method could miss _both_
> the condition and wakeup. But sys_poll() doesn't do this, so it is fine to miss
> the condition and rely on wake_up_all() which ensures we won't block and the
> next iteration must see condition == T.
Oh, just in case... If it is not clear, of course I am talking about the case
when wake_up_call() was already called when we check the condition. Otherwise
everything is simple.
Oleg.
^ permalink raw reply
* Re: RFC: on adding new CLONE_* flags [WAS Re: [PATCH 0/4] clone: add CLONE_PIDFD]
From: Oleg Nesterov @ 2019-04-30 12:39 UTC (permalink / raw)
To: Linus Torvalds
Cc: Jann Horn, Kevin Easton, Andy Lutomirski, Christian Brauner,
Aleksa Sarai, Enrico Weigelt, metux IT consult, Al Viro,
David Howells, Linux API, LKML, Serge E. Hallyn, Arnd Bergmann,
Eric W. Biederman, Kees Cook, Thomas Gleixner, Michael Kerrisk,
Andrew Morton, Joel Fernandes, Daniel Colascione
In-Reply-To: <CAHk-=wi_N81mKYFz33ycoWiL7_tGbZBMJOsAs16inYzSza+OEw@mail.gmail.com>
On 04/29, Linus Torvalds wrote:
>
> Linux vfork() is very much a real vfork(). What do you mean?
Yes, but I am wondering if man vfork should clarify what "child terminates"
actually means. I mean, the child can do clone(CLONE_THREAD) + sys_exit(),
this will wake the parent thread up before the child process exits or execs.
I see nothing wrong, but I was always curious whether it was designed this
way on purpose or not.
Oleg.
^ permalink raw reply
* Re: [PATCH v1 1/2] Add polling support to pidfd
From: Joel Fernandes @ 2019-04-30 15:49 UTC (permalink / raw)
To: Oleg Nesterov
Cc: Christian Brauner, linux-kernel, luto, rostedt, dancol, sspatil,
jannh, surenb, timmurray, Jonathan Kowalski, torvalds,
kernel-team, Andrew Morton, Arnd Bergmann, Eric W. Biederman,
Greg Kroah-Hartman, Ingo Molnar, Jann Horn, linux-kselftest,
Michal Hocko, Peter Zijlstra (Intel), Serge Hallyn, Shuah Khan,
Stephen Rothwell, Thomas Gleixner <tgl>
In-Reply-To: <20190430115332.GB23020@redhat.com>
On Tue, Apr 30, 2019 at 01:53:33PM +0200, Oleg Nesterov wrote:
> On 04/29, Joel Fernandes wrote:
> >
> > On Mon, Apr 29, 2019 at 04:20:30PM +0200, Oleg Nesterov wrote:
> > > On 04/29, Joel Fernandes wrote:
> > > >
> > > > However, in your code above, it is avoided because we get:
> > > >
> > > > Task A (poller) Task B (exiting task being polled)
> > > > ------------ ----------------
> > > > poll() called
> > > > add_wait_queue()
> > > > exit_state is set to non-zero
> > > > read exit_state
> > > > remove_wait_queue()
> > > > wake_up_all()
> > >
> > > just to clarify... No, sys_poll() path doesn't do remove_wait_queue() until
> > > it returns to user mode, and that is why we can't race with set-exit_code +
> > > wake_up().
> >
> > I didn't follow what you mean, the removal from the waitqueue happens in
> > free_poll_entry() called from poll_freewait() which happens from
> > do_sys_poll() which is before the syscall returns to user mode. Could you
> > explain more?
>
> Hmm. I do not really understand the question... Sure, do_sys_poll() does
> poll_freewait() before sysret or even before return from syscall, but why
> does this matter? This is the exit path, it frees the memory, does fput(),
> etc, f_op->poll() won't be call after that.
Ok, we are on the same page on this.
> > > pidfd_poll() can race with the exiting task, miss exit_code != 0, and return
> > > zero. However, do_poll() won't block after that and pidfd_poll() will be called
> > > again.
> >
> > Here also I didn't follow what you mean. If exit_code is read as 0 in
> > pidfd_poll(), then in do_poll() the count will be 0 and it will block in
> > poll_schedule_timeout(). Right?
>
> No. Please note the pwq->triggered check and please read __pollwake().
>
> But if you want to understand this you can forget about poll/select. It is
> a bit complicated, in particular because it has to do set_current_state()
> right before schedule() and thus it plays games with pwq->triggered. But in
> essence this doesn't differ too much from the plain wait_event-like code
> (although you can also look at wait_woken/woken_wake_function).
>
> If remove_wait_queue() could happem before wake_up_all() (like in your pseudo-
> code above), then pidfd_poll() or any other ->poll() method could miss _both_
> the condition and wakeup. But sys_poll() doesn't do this, so it is fine to miss
> the condition and rely on wake_up_all() which ensures we won't block and the
> next iteration must see condition == T.
Agreed. In my pseudo-code above, I meant removal from waitqueue only once we
are not going to be blocking in poll and returning to userspace. I may have
messed the sequence of events, but my point was to show the race I had in
mind (missing a wake up due to adding to the waitqueue too late inside
pidfd_poll()). Anyway, I will repost with your suggested change and send it
soon. Thanks for the discussions.
thanks,
- Joel
^ permalink raw reply
* Re: RFC: on adding new CLONE_* flags [WAS Re: [PATCH 0/4] clone: add CLONE_PIDFD]
From: Linus Torvalds @ 2019-04-30 16:19 UTC (permalink / raw)
To: Florian Weimer
Cc: Jann Horn, Kevin Easton, Andy Lutomirski, Christian Brauner,
Aleksa Sarai, Enrico Weigelt, metux IT consult, Al Viro,
David Howells, Linux API, LKML, Serge E. Hallyn, Arnd Bergmann,
Eric W. Biederman, Kees Cook, Thomas Gleixner, Michael Kerrisk,
Andrew Morton, Oleg Nesterov, Joel Fernandes, Daniel Colascione
In-Reply-To: <87r29jaoov.fsf@oldenburg2.str.redhat.com>
On Tue, Apr 30, 2019 at 1:21 AM Florian Weimer <fweimer@redhat.com> wrote:
>
> > (In fact, if I recall correctly, the _reason_ we have an explicit
> > 'vfork()' entry point rather than using clone() with magic parameters
> > was that the lack of arguments meant that you didn't have to
> > save/restore any registers in user space, which made the whole stack
> > issue simpler. But it's been two decades, so my memory is bitrotting).
>
> That's an interesting point. Using a callback-style interface avoids
> that because you never need to restore the registers in the new
> subprocess. It's still appropriate to use an assembler implementation,
> I think, because it will be more obviously correct.
I agree that a callback interface would have been a whole lot more
obvious and less prone to subtle problems.
But if you want vfork() because the programs you want to build use it,
that's the interface you need..
Of course, if you *don't* need the exact vfork() semantics, clone
itself actually very much supports a callback model with s separate
stack. You can basically do this:
- allocate new stack for the child
- in trivial asm wrapper, do:
- push the callback address on the child stack
- clone(CLONE_VFORK|CLONE_VM|CLONE_SIGCHLD, chld_stack, NULL, NULL,NULL)
- "ret"
- free new stack
where the "ret" in the child will just go to the callback, while the
parent (eventually) just returns from the trivial wrapper and frees
the new stack (which by definition is no longer used, since the child
has exited or execve'd.
So you can most definitely create a "vfork_with_child_callback()" with
clone, and it would arguably be a much superior interface to vfork()
anyway (maybe you'd like to pass in some arguments to the callback too
- add more stack setup for the child as needed), but it wouldn't be
the right solution for programs that just want to use the standard BSD
vfork() model.
> vfork is also more benign from a memory accounting perspective. In some
> environments, it's not possible to call fork from a large process
> because the accounting assumes (conservatively) that the new process
> will dirty a lot of its private memory.
Indeed.
Linus
^ permalink raw reply
* Re: RFC: on adding new CLONE_* flags [WAS Re: [PATCH 0/4] clone: add CLONE_PIDFD]
From: Linus Torvalds @ 2019-04-30 16:24 UTC (permalink / raw)
To: Oleg Nesterov
Cc: Jann Horn, Kevin Easton, Andy Lutomirski, Christian Brauner,
Aleksa Sarai, Enrico Weigelt, metux IT consult, Al Viro,
David Howells, Linux API, LKML, Serge E. Hallyn, Arnd Bergmann,
Eric W. Biederman, Kees Cook, Thomas Gleixner, Michael Kerrisk,
Andrew Morton, Joel Fernandes, Daniel Colascione
In-Reply-To: <20190430123901.GD23020@redhat.com>
On Tue, Apr 30, 2019 at 5:39 AM Oleg Nesterov <oleg@redhat.com> wrote:
>
> Yes, but I am wondering if man vfork should clarify what "child terminates"
> actually means. I mean, the child can do clone(CLONE_THREAD) + sys_exit(),
> this will wake the parent thread up before the child process exits or execs.
That falls solidly into the "give people rope" category.
If the vfork() child wants to mess with the parent, it has many easier
ways to do it than create more threads.
As mentioned, the real problem with vfork() tends to be that the child
unintentionally messes with the parent because it just gets the stack
sharing wrong. No need to add intention there.
> I see nothing wrong, but I was always curious whether it was designed this
> way on purpose or not.
Oh, it's definitely on purpose. Trying to do some nested usage count
would be horrendously complex, and even a trivial "don't allow any
other clone() calls if we already have a vfork completion pending" is
just unnecessary logic.
Because at least in *theory*, there's actually nothing horribly wrong
with allowing a thread to be created during the vfork(). I don't see
the _point_, but it's not conceptually something that couldn't work
(you'd need a separate thread stack etc, but that's normal clone()).
So no, there's no safety or bogus "you can't do that". If you want to
play games after vfork(), go wild.
Linus
^ permalink raw reply
* Re: RFC: on adding new CLONE_* flags [WAS Re: [PATCH 0/4] clone: add CLONE_PIDFD]
From: Linus Torvalds @ 2019-04-30 16:26 UTC (permalink / raw)
To: Florian Weimer
Cc: Jann Horn, Kevin Easton, Andy Lutomirski, Christian Brauner,
Aleksa Sarai, Enrico Weigelt, metux IT consult, Al Viro,
David Howells, Linux API, LKML, Serge E. Hallyn, Arnd Bergmann,
Eric W. Biederman, Kees Cook, Thomas Gleixner, Michael Kerrisk,
Andrew Morton, Oleg Nesterov, Joel Fernandes, Daniel Colascione
In-Reply-To: <CAHk-=wiM8VQ_Ny6Y=fTqE9Aq1LuDdU5bKfnXPyYXU1bi7GtU4w@mail.gmail.com>
On Tue, Apr 30, 2019 at 9:19 AM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> Of course, if you *don't* need the exact vfork() semantics, clone
> itself actually very much supports a callback model with s separate
> stack. You can basically do this:
>
> - allocate new stack for the child
> - in trivial asm wrapper, do:
> - push the callback address on the child stack
> - clone(CLONE_VFORK|CLONE_VM|CLONE_SIGCHLD, chld_stack, NULL, NULL,NULL)
> - "ret"
> - free new stack
>
> where the "ret" in the child will just go to the callback, while the
> parent (eventually) just returns from the trivial wrapper and frees
> the new stack (which by definition is no longer used, since the child
> has exited or execve'd.
In fact, Florian, maybe this is the solution to your "I want to use
vfork for posix_spawn(), but I don't know if I can trust it" problem.
Just use clone() directly. On WSL it will presumably just fail, and
you can then fall back on doing the slow stupid
fork+pipes-to-communicate.
On valgrind, I don't know what will happen. Maybe it will just do an
unchecked posix_spawn() because valgrind doesn't catch it?
Linus
^ permalink raw reply
* Re: RFC: on adding new CLONE_* flags [WAS Re: [PATCH 0/4] clone: add CLONE_PIDFD]
From: Florian Weimer @ 2019-04-30 17:07 UTC (permalink / raw)
To: Linus Torvalds
Cc: Jann Horn, Kevin Easton, Andy Lutomirski, Christian Brauner,
Aleksa Sarai, Enrico Weigelt, metux IT consult, Al Viro,
David Howells, Linux API, LKML, Serge E. Hallyn, Arnd Bergmann,
Eric W. Biederman, Kees Cook, Thomas Gleixner, Michael Kerrisk,
Andrew Morton, Oleg Nesterov, Joel Fernandes, Daniel Colascione
In-Reply-To: <CAHk-=wgxqoH=Cztd7W67pYJyZwPPNtY5P7K66u2QTDEpEYHVAw@mail.gmail.com>
* Linus Torvalds:
> On Tue, Apr 30, 2019 at 9:19 AM Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
>>
>> Of course, if you *don't* need the exact vfork() semantics, clone
>> itself actually very much supports a callback model with s separate
>> stack. You can basically do this:
>>
>> - allocate new stack for the child
>> - in trivial asm wrapper, do:
>> - push the callback address on the child stack
>> - clone(CLONE_VFORK|CLONE_VM|CLONE_SIGCHLD, chld_stack, NULL, NULL,NULL)
>> - "ret"
>> - free new stack
>>
>> where the "ret" in the child will just go to the callback, while the
>> parent (eventually) just returns from the trivial wrapper and frees
>> the new stack (which by definition is no longer used, since the child
>> has exited or execve'd.
>
> In fact, Florian, maybe this is the solution to your "I want to use
> vfork for posix_spawn(), but I don't know if I can trust it" problem.
>
> Just use clone() directly. On WSL it will presumably just fail, and
> you can then fall back on doing the slow stupid
> fork+pipes-to-communicate.
We already use clone. I don't know why. We should add a comment that
provides the reason.
> On valgrind, I don't know what will happen. Maybe it will just do an
> unchecked posix_spawn() because valgrind doesn't catch it?
I think what happens with these emulators that they use fork (no shared
address space) but suspend the parent thread. clone with CLONE_VFORK
definitely does not fail. That mostly works, except that you do not get
back the error code from the execve. Instead, the process is considered
launched, and the caller collects the exit status from the _exit after
the failed execve.
> Of course, if you *don't* need the exact vfork() semantics, clone
> itself actually very much supports a callback model with s separate
> stack. You can basically do this:
>
> - allocate new stack for the child
> - in trivial asm wrapper, do:
> - push the callback address on the child stack
> - clone(CLONE_VFORK|CLONE_VM|CLONE_SIGCHLD, chld_stack, NULL, NULL,NULL)
> - "ret"
> - free new stack
>
> where the "ret" in the child will just go to the callback, while the
> parent (eventually) just returns from the trivial wrapper and frees
> the new stack (which by definition is no longer used, since the child
> has exited or execve'd.
>
> So you can most definitely create a "vfork_with_child_callback()" with
> clone, and it would arguably be a much superior interface to vfork()
> anyway (maybe you'd like to pass in some arguments to the callback too
> - add more stack setup for the child as needed), but it wouldn't be
> the right solution for programs that just want to use the standard BSD
> vfork() model.
As far as we understand the situation, we believe that we absolutely
must block all signals for both the parent thread and the new
subprocess. Signals can be unblocked in the subprocess, but only after
setting their handlers to SIG_DFL or SIG_IGN. (Parent signal handlers
cannot run in the subprocess because application-supplied signal
handlers generally do not expect to run with a corrupt TCB—or a
different PID.)
At that point, I wonder if we can just skip the stack setup for the
CLONE_VFORK case and reuse the existing stack.
Thanks,
Florian
^ permalink raw reply
* Re: [PATCH V32 22/27] bpf: Restrict bpf when kernel lockdown is in confidentiality mode
From: Jann Horn @ 2019-04-30 19:19 UTC (permalink / raw)
To: Matthew Garrett, bpf
Cc: James Morris, linux-security-module, kernel list, David Howells,
Linux API, Andy Lutomirski, Alexei Starovoitov, Matthew Garrett,
Network Development, Chun-Yi Lee, Daniel Borkmann
In-Reply-To: <20190404003249.14356-23-matthewgarrett@google.com>
+bpf list
On Wed, Apr 3, 2019 at 8:34 PM Matthew Garrett
<matthewgarrett@google.com> wrote:
> There are some bpf functions can be used to read kernel memory:
> bpf_probe_read, bpf_probe_write_user and bpf_trace_printk. These allow
> private keys in kernel memory (e.g. the hibernation image signing key) to
> be read by an eBPF program and kernel memory to be altered without
> restriction. Disable them if the kernel has been locked down in
> confidentiality mode.
>
> Suggested-by: Alexei Starovoitov <alexei.starovoitov@gmail.com>
> Signed-off-by: David Howells <dhowells@redhat.com>
> Signed-off-by: Matthew Garrett <mjg59@google.com>
> cc: netdev@vger.kernel.org
> cc: Chun-Yi Lee <jlee@suse.com>
> cc: Alexei Starovoitov <alexei.starovoitov@gmail.com>
> Cc: Daniel Borkmann <daniel@iogearbox.net>
> ---
> kernel/trace/bpf_trace.c | 11 +++++++++++
> 1 file changed, 11 insertions(+)
>
> diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
> index 8b068adb9da1..9e8eda605b5e 100644
> --- a/kernel/trace/bpf_trace.c
> +++ b/kernel/trace/bpf_trace.c
> @@ -137,6 +137,9 @@ BPF_CALL_3(bpf_probe_read, void *, dst, u32, size, const void *, unsafe_ptr)
> {
> int ret;
>
> + if (kernel_is_locked_down("BPF", LOCKDOWN_CONFIDENTIALITY))
> + return -EINVAL;
> +
> ret = probe_kernel_read(dst, unsafe_ptr, size);
> if (unlikely(ret < 0))
> memset(dst, 0, size);
This looks wrong. bpf_probe_read_proto is declared with an
ARG_PTR_TO_UNINIT_MEM argument, so if you don't do a "memset(dst, 0,
size);" like in the probe_kernel_read() error path, the BPF program
can read uninitialized memory.
^ permalink raw reply
* [PATCH] binfmt_elf: Extract .note.gnu.property from an ELF file
From: Yu-cheng Yu @ 2019-05-01 21:12 UTC (permalink / raw)
To: x86, H. Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
Andy Lutomirski, Balbir Singh, Cyrill Gorcunov, Dave Hansen,
Eugene Syromiatnikov, Florian Weimer, H.J. Lu, Jann Horn,
Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
Oleg Nesterov, Pa
Cc: Yu-cheng Yu
An ELF file's .note.gnu.property indicates features the executable file
can support. For example, the property GNU_PROPERTY_X86_FEATURE_1_AND
indicates the file supports GNU_PROPERTY_X86_FEATURE_1_IBT and/or
GNU_PROPERTY_X86_FEATURE_1_SHSTK.
This patch was part of the Control-flow Enforcement series; the original
patch is here: https://lkml.org/lkml/2018/11/20/205. Dave Martin responded
that ARM recently introduced new features to NT_GNU_PROPERTY_TYPE_0 with
properties closely modelled on GNU_PROPERTY_X86_FEATURE_1_AND, and it is
logical to split out the generic part. Here it is.
With this patch, if an arch needs to setup features from ELF properties,
it needs CONFIG_ARCH_USE_GNU_PROPERTY to be set, and a specific
arch_setup_property().
For example, for X86_64:
int arch_setup_property(void *ehdr, void *phdr, struct file *f, bool inter)
{
int r;
uint32_t property;
r = get_gnu_property(ehdr, phdr, f, GNU_PROPERTY_X86_FEATURE_1_AND,
&property);
...
}
Signed-off-by: H.J. Lu <hjl.tools@gmail.com>
Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
---
fs/Kconfig.binfmt | 4 +
fs/Makefile | 1 +
fs/binfmt_elf.c | 13 ++
fs/gnu_property.c | 363 +++++++++++++++++++++++++++++++++++++++
include/linux/elf.h | 12 ++
include/uapi/linux/elf.h | 8 +
6 files changed, 401 insertions(+)
create mode 100644 fs/gnu_property.c
diff --git a/fs/Kconfig.binfmt b/fs/Kconfig.binfmt
index b795f8da81f3..175a1f58e785 100644
--- a/fs/Kconfig.binfmt
+++ b/fs/Kconfig.binfmt
@@ -35,6 +35,10 @@ config COMPAT_BINFMT_ELF
config ARCH_BINFMT_ELF_STATE
bool
+config ARCH_USE_GNU_PROPERTY
+ bool
+ depends on 64BIT
+
config BINFMT_ELF_FDPIC
bool "Kernel support for FDPIC ELF binaries"
default y if !BINFMT_ELF
diff --git a/fs/Makefile b/fs/Makefile
index 427fec226fae..8a35abbebf8b 100644
--- a/fs/Makefile
+++ b/fs/Makefile
@@ -44,6 +44,7 @@ obj-$(CONFIG_BINFMT_ELF) += binfmt_elf.o
obj-$(CONFIG_COMPAT_BINFMT_ELF) += compat_binfmt_elf.o
obj-$(CONFIG_BINFMT_ELF_FDPIC) += binfmt_elf_fdpic.o
obj-$(CONFIG_BINFMT_FLAT) += binfmt_flat.o
+obj-$(CONFIG_ARCH_USE_GNU_PROPERTY) += gnu_property.o
obj-$(CONFIG_FS_MBCACHE) += mbcache.o
obj-$(CONFIG_FS_POSIX_ACL) += posix_acl.o
diff --git a/fs/binfmt_elf.c b/fs/binfmt_elf.c
index 7d09d125f148..40aa4a4fd64d 100644
--- a/fs/binfmt_elf.c
+++ b/fs/binfmt_elf.c
@@ -1076,6 +1076,19 @@ static int load_elf_binary(struct linux_binprm *bprm)
goto out_free_dentry;
}
+ if (interpreter) {
+ retval = arch_setup_property(&loc->interp_elf_ex,
+ interp_elf_phdata,
+ interpreter, true);
+ } else {
+ retval = arch_setup_property(&loc->elf_ex,
+ elf_phdata,
+ bprm->file, false);
+ }
+
+ if (retval < 0)
+ goto out_free_dentry;
+
if (elf_interpreter) {
unsigned long interp_map_addr = 0;
diff --git a/fs/gnu_property.c b/fs/gnu_property.c
new file mode 100644
index 000000000000..656ea3951840
--- /dev/null
+++ b/fs/gnu_property.c
@@ -0,0 +1,363 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * Extract an ELF file's .note.gnu.property.
+ *
+ * The path from the ELF header to the note section is the following:
+ * elfhdr->elf_phdr->elf_note->property[].
+ */
+
+#include <uapi/linux/elf-em.h>
+#include <linux/processor.h>
+#include <linux/binfmts.h>
+#include <linux/elf.h>
+#include <linux/slab.h>
+#include <linux/fs.h>
+#include <linux/uaccess.h>
+#include <linux/string.h>
+#include <linux/compat.h>
+
+/*
+ * The .note.gnu.property layout:
+ *
+ * struct elf_note {
+ * u32 n_namesz; --> sizeof(n_name[]); always (4)
+ * u32 n_ndescsz;--> sizeof(property[])
+ * u32 n_type; --> always NT_GNU_PROPERTY_TYPE_0
+ * };
+ * char n_name[4]; --> always 'GNU\0'
+ *
+ * struct {
+ * struct gnu_property {
+ * u32 pr_type;
+ * u32 pr_datasz;
+ * };
+ * u8 pr_data[pr_datasz];
+ * }[];
+ */
+
+#define BUF_SIZE (PAGE_SIZE / 4)
+
+struct gnu_property {
+ u32 pr_type;
+ u32 pr_datasz;
+};
+
+typedef bool (test_item_fn)(void *buf, u32 *arg, u32 type);
+typedef void *(next_item_fn)(void *buf, u32 *arg, u32 type);
+
+static inline bool test_note_type(void *buf, u32 *align, u32 note_type)
+{
+ struct elf_note *n = buf;
+
+ return ((n->n_type == note_type) && (n->n_namesz == 4) &&
+ (memcmp(n + 1, "GNU", 4) == 0));
+}
+
+static inline void *next_note(void *buf, u32 *align, u32 note_type)
+{
+ struct elf_note *n = buf;
+ u64 size;
+
+ if (check_add_overflow((u64)sizeof(*n), (u64)n->n_namesz, &size))
+ return NULL;
+
+ size = round_up(size, *align);
+
+ if (check_add_overflow(size, (u64)n->n_descsz, &size))
+ return NULL;
+
+ size = round_up(size, *align);
+
+ if (buf + size < buf)
+ return NULL;
+ else
+ return (buf + size);
+}
+
+static inline bool test_property(void *buf, u32 *max_type, u32 pr_type)
+{
+ struct gnu_property *pr = buf;
+
+ /*
+ * Property types must be in ascending order.
+ * Keep track of the max when testing each.
+ */
+ if (pr->pr_type > *max_type)
+ *max_type = pr->pr_type;
+
+ return (pr->pr_type == pr_type);
+}
+
+static inline void *next_property(void *buf, u32 *max_type, u32 pr_type)
+{
+ struct gnu_property *pr = buf;
+
+ if ((buf + sizeof(*pr) + pr->pr_datasz < buf) ||
+ (pr->pr_type > pr_type) ||
+ (pr->pr_type > *max_type))
+ return NULL;
+ else
+ return (buf + sizeof(*pr) + pr->pr_datasz);
+}
+
+/*
+ * Scan 'buf' for a pattern; return true if found.
+ * *pos is the distance from the beginning of buf to where
+ * the searched item or the next item is located.
+ */
+static int scan(u8 *buf, u32 buf_size, int item_size, test_item_fn test_item,
+ next_item_fn next_item, u32 *arg, u32 type, u32 *pos)
+{
+ int found = 0;
+ u8 *p, *max;
+
+ max = buf + buf_size;
+ if (max < buf)
+ return 0;
+
+ p = buf;
+
+ while ((p + item_size < max) && (p + item_size > buf)) {
+ if (test_item(p, arg, type)) {
+ found = 1;
+ break;
+ }
+
+ p = next_item(p, arg, type);
+ }
+
+ *pos = (p + item_size <= buf) ? 0 : (u32)(p - buf);
+ return found;
+}
+
+/*
+ * Search an NT_GNU_PROPERTY_TYPE_0 for the property of 'pr_type'.
+ */
+static int find_property(struct file *file, unsigned long desc_size,
+ loff_t file_offset, u8 *buf,
+ u32 pr_type, u32 *property)
+{
+ u32 buf_pos;
+ unsigned long read_size;
+ unsigned long done;
+ int found = 0;
+ int ret = 0;
+ u32 last_pr = 0;
+
+ *property = 0;
+ buf_pos = 0;
+
+ for (done = 0; done < desc_size; done += buf_pos) {
+ read_size = desc_size - done;
+ if (read_size > BUF_SIZE)
+ read_size = BUF_SIZE;
+
+ ret = kernel_read(file, buf, read_size, &file_offset);
+
+ if (ret != read_size)
+ return (ret < 0) ? ret : -EIO;
+
+ ret = 0;
+ found = scan(buf, read_size, sizeof(struct gnu_property),
+ test_property, next_property,
+ &last_pr, pr_type, &buf_pos);
+
+ if ((!buf_pos) || found)
+ break;
+
+ file_offset += buf_pos - read_size;
+ }
+
+ if (found) {
+ struct gnu_property *pr =
+ (struct gnu_property *)(buf + buf_pos);
+
+ if (pr->pr_datasz == 4) {
+ u32 *max = (u32 *)(buf + read_size);
+ u32 *data = (u32 *)((u8 *)pr + sizeof(*pr));
+
+ if (data + 1 <= max) {
+ *property = *data;
+ } else {
+ file_offset += buf_pos - read_size;
+ file_offset += sizeof(*pr);
+ ret = kernel_read(file, property, 4,
+ &file_offset);
+ }
+ }
+ }
+
+ return ret;
+}
+
+/*
+ * Search a PT_NOTE segment for NT_GNU_PROPERTY_TYPE_0.
+ */
+static int find_note_type_0(struct file *file, loff_t file_offset,
+ unsigned long note_size, u32 align,
+ u32 pr_type, u32 *property)
+{
+ u8 *buf;
+ u32 buf_pos;
+ unsigned long read_size;
+ unsigned long done;
+ int found = 0;
+ int ret = 0;
+
+ buf = kmalloc(BUF_SIZE, GFP_KERNEL);
+ if (!buf)
+ return -ENOMEM;
+
+ *property = 0;
+ buf_pos = 0;
+
+ for (done = 0; done < note_size; done += buf_pos) {
+ read_size = note_size - done;
+ if (read_size > BUF_SIZE)
+ read_size = BUF_SIZE;
+
+ ret = kernel_read(file, buf, read_size, &file_offset);
+
+ if (ret != read_size) {
+ ret = (ret < 0) ? ret : -EIO;
+ kfree(buf);
+ return ret;
+ }
+
+ /*
+ * item_size = sizeof(struct elf_note) + elf_note.n_namesz.
+ * n_namesz is 4 for the note type we look for.
+ */
+ ret = scan(buf, read_size, sizeof(struct elf_note) + 4,
+ test_note_type, next_note,
+ &align, NT_GNU_PROPERTY_TYPE_0, &buf_pos);
+
+ file_offset += buf_pos - read_size;
+
+ if (ret && !found) {
+ struct elf_note *n =
+ (struct elf_note *)(buf + buf_pos);
+ u64 start = round_up(sizeof(*n) + n->n_namesz, align);
+ u64 total = 0;
+
+ if (check_add_overflow(start, (u64)n->n_descsz, &total)) {
+ ret = -EINVAL;
+ break;
+ }
+ total = round_up(total, align);
+
+ ret = find_property(file, n->n_descsz,
+ file_offset + start,
+ buf, pr_type, property);
+ found++;
+ file_offset += total;
+ buf_pos += total;
+ } else if (!buf_pos || ret) {
+ ret = 0;
+ *property = 0;
+ break;
+ }
+ }
+
+ kfree(buf);
+ return ret;
+}
+
+/*
+ * Look at an ELF file's PT_NOTE segments, then NT_GNU_PROPERTY_TYPE_0, then
+ * the property of pr_type.
+ *
+ * Input:
+ * file: the file to search;
+ * phdr: the file's elf header;
+ * phnum: number of entries in phdr;
+ * pr_type: the property type.
+ *
+ * Output:
+ * The property found.
+ *
+ * Return:
+ * Zero or error.
+ */
+static int scan_segments_64(struct file *file, struct elf64_phdr *phdr,
+ int phnum, u32 pr_type, u32 *property)
+{
+ int i;
+ int err = 0;
+
+ for (i = 0; i < phnum; i++, phdr++) {
+ if ((phdr->p_type != PT_NOTE) || (phdr->p_align != 8))
+ continue;
+
+ /*
+ * Search the PT_NOTE segment for NT_GNU_PROPERTY_TYPE_0.
+ */
+ err = find_note_type_0(file, phdr->p_offset, phdr->p_filesz,
+ phdr->p_align, pr_type, property);
+ if (err)
+ return err;
+ }
+
+ return 0;
+}
+
+#ifdef CONFIG_COMPAT
+static int scan_segments_32(struct file *file, struct elf32_phdr *phdr,
+ int phnum, u32 pr_type, u32 *property)
+{
+ int i;
+ int err = 0;
+
+ for (i = 0; i < phnum; i++, phdr++) {
+ if ((phdr->p_type != PT_NOTE) || (phdr->p_align != 4))
+ continue;
+
+ /*
+ * Search the PT_NOTE segment for NT_GNU_PROPERTY_TYPE_0.
+ */
+ err = find_note_type_0(file, phdr->p_offset, phdr->p_filesz,
+ phdr->p_align, pr_type, property);
+ if (err)
+ return err;
+ }
+
+ return 0;
+}
+#endif
+
+int get_gnu_property(void *ehdr_p, void *phdr_p, struct file *f,
+ u32 pr_type, u32 *property)
+{
+ struct elf64_hdr *ehdr64 = ehdr_p;
+ int err = 0;
+
+ *property = 0;
+
+ if (ehdr64->e_ident[EI_CLASS] == ELFCLASS64) {
+ struct elf64_phdr *phdr64 = phdr_p;
+
+ err = scan_segments_64(f, phdr64, ehdr64->e_phnum,
+ pr_type, property);
+ if (err < 0)
+ goto out;
+ } else {
+#ifdef CONFIG_COMPAT
+ struct elf32_hdr *ehdr32 = ehdr_p;
+
+ if (ehdr32->e_ident[EI_CLASS] == ELFCLASS32) {
+ struct elf32_phdr *phdr32 = phdr_p;
+
+ err = scan_segments_32(f, phdr32, ehdr32->e_phnum,
+ pr_type, property);
+ if (err < 0)
+ goto out;
+ }
+#else
+ WARN_ONCE(1, "Exec of 32-bit app, but CONFIG_COMPAT is not enabled.\n");
+ return -ENOTSUPP;
+#endif
+ }
+
+out:
+ return err;
+}
diff --git a/include/linux/elf.h b/include/linux/elf.h
index e3649b3e970e..c15febebe7f2 100644
--- a/include/linux/elf.h
+++ b/include/linux/elf.h
@@ -56,4 +56,16 @@ static inline int elf_coredump_extra_notes_write(struct coredump_params *cprm) {
extern int elf_coredump_extra_notes_size(void);
extern int elf_coredump_extra_notes_write(struct coredump_params *cprm);
#endif
+
+#ifdef CONFIG_ARCH_USE_GNU_PROPERTY
+extern int arch_setup_property(void *ehdr, void *phdr, struct file *f,
+ bool interp);
+extern int get_gnu_property(void *ehdr_p, void *phdr_p, struct file *f,
+ u32 pr_type, u32 *feature);
+#else
+static inline int arch_setup_property(void *ehdr, void *phdr, struct file *f,
+ bool interp) { return 0; }
+static inline int get_gnu_property(void *ehdr_p, void *phdr_p, struct file *f,
+ u32 pr_type, u32 *feature) { return 0; }
+#endif
#endif /* _LINUX_ELF_H */
diff --git a/include/uapi/linux/elf.h b/include/uapi/linux/elf.h
index 34c02e4290fe..7b7603a44cbc 100644
--- a/include/uapi/linux/elf.h
+++ b/include/uapi/linux/elf.h
@@ -372,6 +372,7 @@ typedef struct elf64_shdr {
#define NT_PRFPREG 2
#define NT_PRPSINFO 3
#define NT_TASKSTRUCT 4
+#define NT_GNU_PROPERTY_TYPE_0 5
#define NT_AUXV 6
/*
* Note to userspace developers: size of NT_SIGINFO note may increase
@@ -443,4 +444,11 @@ typedef struct elf64_note {
Elf64_Word n_type; /* Content type */
} Elf64_Nhdr;
+/* .note.gnu.property types */
+#define GNU_PROPERTY_X86_FEATURE_1_AND (0xc0000002)
+
+/* Bits of GNU_PROPERTY_X86_FEATURE_1_AND */
+#define GNU_PROPERTY_X86_FEATURE_1_IBT (0x00000001)
+#define GNU_PROPERTY_X86_FEATURE_1_SHSTK (0x00000002)
+
#endif /* _UAPI_LINUX_ELF_H */
--
2.17.1
^ permalink raw reply related
* Re: [PATCH] binfmt_elf: Extract .note.gnu.property from an ELF file
From: Matthew Wilcox @ 2019-05-01 21:37 UTC (permalink / raw)
To: Yu-cheng Yu
Cc: x86, H. Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
Andy Lutomirski, Balbir Singh, Cyrill Gorcunov, Dave Hansen,
Eugene Syromiatnikov, Florian Weimer, H.J. Lu, Jann Horn,
Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
Oleg Nesterov, Pa
In-Reply-To: <20190501211217.5039-1-yu-cheng.yu@intel.com>
On Wed, May 01, 2019 at 02:12:17PM -0700, Yu-cheng Yu wrote:
> +++ b/fs/Kconfig.binfmt
> @@ -35,6 +35,10 @@ config COMPAT_BINFMT_ELF
> config ARCH_BINFMT_ELF_STATE
> bool
>
> +config ARCH_USE_GNU_PROPERTY
> + bool
> + depends on 64BIT
I don't think this is right. I think you should get rid of the depends line
and instead select the symbol from each of argh64 and x86 Kconfig files.
^ permalink raw reply
* Re: [PATCH] binfmt_elf: Extract .note.gnu.property from an ELF file
From: Yu-cheng Yu @ 2019-05-01 21:54 UTC (permalink / raw)
To: Matthew Wilcox
Cc: x86, H. Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
Andy Lutomirski, Balbir Singh, Cyrill Gorcunov, Dave Hansen,
Eugene Syromiatnikov, Florian Weimer, H.J. Lu, Jann Horn,
Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
Oleg Nesterov, Pa
In-Reply-To: <20190501213709.GD28500@bombadil.infradead.org>
On Wed, 2019-05-01 at 14:37 -0700, Matthew Wilcox wrote:
> On Wed, May 01, 2019 at 02:12:17PM -0700, Yu-cheng Yu wrote:
> > +++ b/fs/Kconfig.binfmt
> > @@ -35,6 +35,10 @@ config COMPAT_BINFMT_ELF
> > config ARCH_BINFMT_ELF_STATE
> > bool
> >
> > +config ARCH_USE_GNU_PROPERTY
> > + bool
> > + depends on 64BIT
>
> I don't think this is right. I think you should get rid of the depends line
> and instead select the symbol from each of argh64 and x86 Kconfig files.
That makes sense. Thanks!
Yu-cheng
^ permalink raw reply
* Re: [PATCHv3 19/27] timens/fs/proc: Introduce /proc/pid/timens_offsets
From: Andrei Vagin @ 2019-05-02 6:08 UTC (permalink / raw)
To: Jann Horn
Cc: Dmitry Safonov, kernel list, Adrian Reber, Andrei Vagin,
Andy Lutomirski, Arnd Bergmann, Christian Brauner,
Cyrill Gorcunov, Dmitry Safonov, Eric W. Biederman,
H. Peter Anvin, Ingo Molnar, Jeff Dike, Oleg Nesterov,
Pavel Emelyanov, Shuah Khan, Thomas Gleixner, Vincenzo Frascino,
containers, criu, Linux API
In-Reply-To: <CAG48ez1ws9qXyNHfScY1RoajG=pquFe4y9QpM1c_xtPSeC2SNA@mail.gmail.com>
Hi Jann,
Thank you for the review. Here are a few comments inline.
On Thu, Apr 25, 2019 at 08:16:41PM +0200, Jann Horn wrote:
> On Thu, Apr 25, 2019 at 6:15 PM Dmitry Safonov <dima@arista.com> wrote:
> > API to set time namespace offsets for children processes, i.e.:
> > echo "clockid off_ses off_nsec" > /proc/self/timens_offsets
> [...]
> > diff --git a/fs/proc/base.c b/fs/proc/base.c
> > index 6a803a0b75df..76d58e9b5178 100644
> > --- a/fs/proc/base.c
> > +++ b/fs/proc/base.c
> [...]
> > @@ -1521,6 +1522,103 @@ static const struct file_operations proc_pid_sched_autogroup_operations = {
> >
> > #endif /* CONFIG_SCHED_AUTOGROUP */
> >
> > +#ifdef CONFIG_TIME_NS
> > +static int timens_offsets_show(struct seq_file *m, void *v)
> > +{
> > + struct inode *inode = m->private;
> > + struct task_struct *p;
> > +
> > + p = get_proc_task(inode);
>
> (FYI, this could also be "p = get_proc_task(file_inode(m->file));".
> But this works, too.)
>
> > + if (!p)
> > + return -ESRCH;
> > + proc_timens_show_offsets(p, m);
> > +
> > + put_task_struct(p);
> > +
> > + return 0;
> > +}
> > +
> > +static ssize_t
> > +timens_offsets_write(struct file *file, const char __user *buf,
> > + size_t count, loff_t *ppos)
> > +{
> > + struct inode *inode = file_inode(file);
> > + struct proc_timens_offset offsets[2];
> > + char *kbuf = NULL, *pos, *next_line;
> > + struct task_struct *p;
> > + int ret, noffsets;
> > +
> > + /* Only allow < page size writes at the beginning of the file */
> > + if ((*ppos != 0) || (count >= PAGE_SIZE))
> > + return -EINVAL;
> > +
> > + /* Slurp in the user data */
> > + kbuf = memdup_user_nul(buf, count);
> > + if (IS_ERR(kbuf))
> > + return PTR_ERR(kbuf);
> > +
> > + /* Parse the user data */
> > + ret = -EINVAL;
> > + noffsets = 0;
> > + pos = kbuf;
> > + for (; pos; pos = next_line) {
> > + struct proc_timens_offset *off = &offsets[noffsets];
> > + int err;
> > +
> > + /* Find the end of line and ensure I don't look past it */
> > + next_line = strchr(pos, '\n');
> > + if (next_line) {
> > + *next_line = '\0';
> > + next_line++;
> > + if (*next_line == '\0')
> > + next_line = NULL;
> > + }
> > +
> > + err = sscanf(pos, "%u %lld %lu", &off->clockid,
> > + &off->val.tv_sec, &off->val.tv_nsec);
> > + if (err != 3 || off->val.tv_nsec >= NSEC_PER_SEC)
> > + goto out;
> > + if (noffsets++ == ARRAY_SIZE(offsets))
> > + break;
>
> This is equivalent to:
>
> if (noffsets == ARRAY_SIZE(offsets)) {
noffsets++;
> break;
}
> noffsets++;
>
> So we can reach the start of the loop with
> noffsets==ARRAY_SIZE(offsets), right? Which means that an
> out-of-bounds write can happen?
good catch. I will add a test for this case.
>
> I think that for code like this, it makes sense to write the increment
> and the check out separately so that it's easier to spot problems;
> e.g. like this:
>
> noffsets++;
> if (noffsets == ARRAY_SIZE(offsets))
> break;
will rewrite this way. Thanks!
>
> > + }
> > +
> > + ret = -ESRCH;
> > + p = get_proc_task(inode);
> > + if (!p)
> > + goto out;
> > + ret = proc_timens_set_offset(p, offsets, noffsets);
> > + put_task_struct(p);
> > + if (ret)
> > + goto out;
> > +
> > + ret = count;
> > +out:
> > + kfree(kbuf);
> > + return ret;
> > +}
> > +
> > +static int timens_offsets_open(struct inode *inode, struct file *filp)
> > +{
> > + int ret;
> > +
> > + ret = single_open(filp, timens_offsets_show, NULL);
> > + if (!ret) {
> > + struct seq_file *m = filp->private_data;
> > +
> > + m->private = inode;
> > + }
> > + return ret;
> > +}
>
> Why did you write it like this? Wouldn't the following be equivalent?
Probably I looked at sched_autogroup_open
>
> static int timens_offsets_open(struct inode *inode, struct file *file)
> {
> return single_open(file, timens_offsets_show, inode);
> }
>
> (But also, you can reach the inode of a seq_file as file_inode(m->file).)
>
> [...]
> > diff --git a/kernel/time_namespace.c b/kernel/time_namespace.c
> > index e806accc4eaf..9ad4b63c4ed2 100644
> > --- a/kernel/time_namespace.c
> > +++ b/kernel/time_namespace.c
> [...]
> > +
> > +int proc_timens_set_offset(struct task_struct *p,
> > + struct proc_timens_offset *offsets, int noffsets)
> > +{
> > + struct ns_common *ns;
> > + struct time_namespace *time_ns;
> > + struct timens_offsets *ns_offsets;
> > + struct timespec64 *offset;
> > + struct timespec64 tp;
> > + int i, err;
> > +
> > + ns = timens_for_children_get(p);
> > + if (!ns)
> > + return -ESRCH;
> > + time_ns = to_time_ns(ns);
> > +
> > + if (!time_ns->offsets || time_ns->initialized ||
> > + !ns_capable(time_ns->user_ns, CAP_SYS_TIME)) {
>
> Capability checks in VFS read/write handlers are bad. Please pass
> through the file pointer to this function and replace the call with
> "file_ns_capable(file, time_ns->user_ns, CAP_SYS_TIME)".
Will fix. Thanks!
>
> > + put_time_ns(time_ns);
> > + return -EPERM;
> > + }
> [...]
> > +}
^ permalink raw reply
* Re: [PATCH] binfmt_elf: Extract .note.gnu.property from an ELF file
From: Dave Martin @ 2019-05-02 11:10 UTC (permalink / raw)
To: Yu-cheng Yu
Cc: x86, H. Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
Andy Lutomirski, Balbir Singh, Cyrill Gorcunov, Dave Hansen,
Eugene Syromiatnikov, Florian Weimer, H.J. Lu, Jann Horn,
Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
Oleg Nesterov, Pa
In-Reply-To: <20190501211217.5039-1-yu-cheng.yu@intel.com>
On Wed, May 01, 2019 at 02:12:17PM -0700, Yu-cheng Yu wrote:
> An ELF file's .note.gnu.property indicates features the executable file
> can support. For example, the property GNU_PROPERTY_X86_FEATURE_1_AND
> indicates the file supports GNU_PROPERTY_X86_FEATURE_1_IBT and/or
> GNU_PROPERTY_X86_FEATURE_1_SHSTK.
>
> This patch was part of the Control-flow Enforcement series; the original
> patch is here: https://lkml.org/lkml/2018/11/20/205. Dave Martin responded
> that ARM recently introduced new features to NT_GNU_PROPERTY_TYPE_0 with
> properties closely modelled on GNU_PROPERTY_X86_FEATURE_1_AND, and it is
> logical to split out the generic part. Here it is.
>
> With this patch, if an arch needs to setup features from ELF properties,
> it needs CONFIG_ARCH_USE_GNU_PROPERTY to be set, and a specific
> arch_setup_property().
>
> For example, for X86_64:
>
> int arch_setup_property(void *ehdr, void *phdr, struct file *f, bool inter)
> {
> int r;
> uint32_t property;
>
> r = get_gnu_property(ehdr, phdr, f, GNU_PROPERTY_X86_FEATURE_1_AND,
> &property);
> ...
> }
Thanks, this is timely for me. I should be able to build the needed
arm64 support pretty quickly around this now.
[Cc'ing libc-alpha for the elf.h question -- see (2)]
A couple of questions before I look in more detail:
1) Can we rely on PT_GNU_PROPERTY being present in the phdrs to describe
the NT_GNU_PROPERTY_TYPE_0 note? If so, we can avoid trying to parse
irrelevant PT_NOTE segments.
2) Are there standard types for things like the program property header?
If not, can we add something in elf.h? We should try to coordinate with
libc on that. Something like
typedef __u32 Elf_Word;
typedef struct {
Elf_Word pr_type;
Elf_Word pr_datasz;
} Elf_Gnu_Prophdr;
(i.e., just the header part from [1], with a more specific name -- which
I just made up).
Given the fragmented nature and draft status of the specs -- and
differing opiniions about the sizes and alignments of certain things --
it could be useful to have this explicitly in the kernel. Some
documentation as to _precisely_ what we accept may also be a good idea.
3) It looks like we have to go and re-parse all the notes for every
property requested by the arch code.
For now there is only one property requested anyway, so this is probably
not too bad. But could we flip things around so that we have some
CONFIG_ARCH_WANTS_ELF_GNU_PROPERTY (say), and have the ELF core code
call into the arch backend for each property found?
The arch could provide some hook
int arch_elf_has_gnu_property(const Elf_Gnu_Prophdr *prop,
const void *data);
to consume the properties as they are found.
This would effectively replace the arch_setup_property() hook you
currently have.
Cheers
---Dave
[1] https://github.com/hjl-tools/linux-abi/wiki/Linux-Extensions-to-gABI
>
> Signed-off-by: H.J. Lu <hjl.tools@gmail.com>
> Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
> ---
> fs/Kconfig.binfmt | 4 +
> fs/Makefile | 1 +
> fs/binfmt_elf.c | 13 ++
> fs/gnu_property.c | 363 +++++++++++++++++++++++++++++++++++++++
> include/linux/elf.h | 12 ++
> include/uapi/linux/elf.h | 8 +
> 6 files changed, 401 insertions(+)
> create mode 100644 fs/gnu_property.c
>
> diff --git a/fs/Kconfig.binfmt b/fs/Kconfig.binfmt
> index b795f8da81f3..175a1f58e785 100644
> --- a/fs/Kconfig.binfmt
> +++ b/fs/Kconfig.binfmt
> @@ -35,6 +35,10 @@ config COMPAT_BINFMT_ELF
> config ARCH_BINFMT_ELF_STATE
> bool
>
> +config ARCH_USE_GNU_PROPERTY
> + bool
> + depends on 64BIT
> +
> config BINFMT_ELF_FDPIC
> bool "Kernel support for FDPIC ELF binaries"
> default y if !BINFMT_ELF
> diff --git a/fs/Makefile b/fs/Makefile
> index 427fec226fae..8a35abbebf8b 100644
> --- a/fs/Makefile
> +++ b/fs/Makefile
> @@ -44,6 +44,7 @@ obj-$(CONFIG_BINFMT_ELF) += binfmt_elf.o
> obj-$(CONFIG_COMPAT_BINFMT_ELF) += compat_binfmt_elf.o
> obj-$(CONFIG_BINFMT_ELF_FDPIC) += binfmt_elf_fdpic.o
> obj-$(CONFIG_BINFMT_FLAT) += binfmt_flat.o
> +obj-$(CONFIG_ARCH_USE_GNU_PROPERTY) += gnu_property.o
>
> obj-$(CONFIG_FS_MBCACHE) += mbcache.o
> obj-$(CONFIG_FS_POSIX_ACL) += posix_acl.o
> diff --git a/fs/binfmt_elf.c b/fs/binfmt_elf.c
> index 7d09d125f148..40aa4a4fd64d 100644
> --- a/fs/binfmt_elf.c
> +++ b/fs/binfmt_elf.c
> @@ -1076,6 +1076,19 @@ static int load_elf_binary(struct linux_binprm *bprm)
> goto out_free_dentry;
> }
>
> + if (interpreter) {
> + retval = arch_setup_property(&loc->interp_elf_ex,
> + interp_elf_phdata,
> + interpreter, true);
> + } else {
> + retval = arch_setup_property(&loc->elf_ex,
> + elf_phdata,
> + bprm->file, false);
> + }
> +
> + if (retval < 0)
> + goto out_free_dentry;
> +
> if (elf_interpreter) {
> unsigned long interp_map_addr = 0;
>
> diff --git a/fs/gnu_property.c b/fs/gnu_property.c
> new file mode 100644
> index 000000000000..656ea3951840
> --- /dev/null
> +++ b/fs/gnu_property.c
> @@ -0,0 +1,363 @@
> +/* SPDX-License-Identifier: GPL-2.0-only */
> +/*
> + * Extract an ELF file's .note.gnu.property.
> + *
> + * The path from the ELF header to the note section is the following:
> + * elfhdr->elf_phdr->elf_note->property[].
> + */
> +
> +#include <uapi/linux/elf-em.h>
> +#include <linux/processor.h>
> +#include <linux/binfmts.h>
> +#include <linux/elf.h>
> +#include <linux/slab.h>
> +#include <linux/fs.h>
> +#include <linux/uaccess.h>
> +#include <linux/string.h>
> +#include <linux/compat.h>
> +
> +/*
> + * The .note.gnu.property layout:
> + *
> + * struct elf_note {
> + * u32 n_namesz; --> sizeof(n_name[]); always (4)
> + * u32 n_ndescsz;--> sizeof(property[])
> + * u32 n_type; --> always NT_GNU_PROPERTY_TYPE_0
> + * };
> + * char n_name[4]; --> always 'GNU\0'
> + *
> + * struct {
> + * struct gnu_property {
> + * u32 pr_type;
> + * u32 pr_datasz;
> + * };
> + * u8 pr_data[pr_datasz];
> + * }[];
> + */
> +
> +#define BUF_SIZE (PAGE_SIZE / 4)
> +
> +struct gnu_property {
> + u32 pr_type;
> + u32 pr_datasz;
> +};
> +
> +typedef bool (test_item_fn)(void *buf, u32 *arg, u32 type);
> +typedef void *(next_item_fn)(void *buf, u32 *arg, u32 type);
> +
> +static inline bool test_note_type(void *buf, u32 *align, u32 note_type)
> +{
> + struct elf_note *n = buf;
> +
> + return ((n->n_type == note_type) && (n->n_namesz == 4) &&
> + (memcmp(n + 1, "GNU", 4) == 0));
> +}
> +
> +static inline void *next_note(void *buf, u32 *align, u32 note_type)
> +{
> + struct elf_note *n = buf;
> + u64 size;
> +
> + if (check_add_overflow((u64)sizeof(*n), (u64)n->n_namesz, &size))
> + return NULL;
> +
> + size = round_up(size, *align);
> +
> + if (check_add_overflow(size, (u64)n->n_descsz, &size))
> + return NULL;
> +
> + size = round_up(size, *align);
> +
> + if (buf + size < buf)
> + return NULL;
> + else
> + return (buf + size);
> +}
> +
> +static inline bool test_property(void *buf, u32 *max_type, u32 pr_type)
> +{
> + struct gnu_property *pr = buf;
> +
> + /*
> + * Property types must be in ascending order.
> + * Keep track of the max when testing each.
> + */
> + if (pr->pr_type > *max_type)
> + *max_type = pr->pr_type;
> +
> + return (pr->pr_type == pr_type);
> +}
> +
> +static inline void *next_property(void *buf, u32 *max_type, u32 pr_type)
> +{
> + struct gnu_property *pr = buf;
> +
> + if ((buf + sizeof(*pr) + pr->pr_datasz < buf) ||
> + (pr->pr_type > pr_type) ||
> + (pr->pr_type > *max_type))
> + return NULL;
> + else
> + return (buf + sizeof(*pr) + pr->pr_datasz);
> +}
> +
> +/*
> + * Scan 'buf' for a pattern; return true if found.
> + * *pos is the distance from the beginning of buf to where
> + * the searched item or the next item is located.
> + */
> +static int scan(u8 *buf, u32 buf_size, int item_size, test_item_fn test_item,
> + next_item_fn next_item, u32 *arg, u32 type, u32 *pos)
> +{
> + int found = 0;
> + u8 *p, *max;
> +
> + max = buf + buf_size;
> + if (max < buf)
> + return 0;
> +
> + p = buf;
> +
> + while ((p + item_size < max) && (p + item_size > buf)) {
> + if (test_item(p, arg, type)) {
> + found = 1;
> + break;
> + }
> +
> + p = next_item(p, arg, type);
> + }
> +
> + *pos = (p + item_size <= buf) ? 0 : (u32)(p - buf);
> + return found;
> +}
> +
> +/*
> + * Search an NT_GNU_PROPERTY_TYPE_0 for the property of 'pr_type'.
> + */
> +static int find_property(struct file *file, unsigned long desc_size,
> + loff_t file_offset, u8 *buf,
> + u32 pr_type, u32 *property)
> +{
> + u32 buf_pos;
> + unsigned long read_size;
> + unsigned long done;
> + int found = 0;
> + int ret = 0;
> + u32 last_pr = 0;
> +
> + *property = 0;
> + buf_pos = 0;
> +
> + for (done = 0; done < desc_size; done += buf_pos) {
> + read_size = desc_size - done;
> + if (read_size > BUF_SIZE)
> + read_size = BUF_SIZE;
> +
> + ret = kernel_read(file, buf, read_size, &file_offset);
> +
> + if (ret != read_size)
> + return (ret < 0) ? ret : -EIO;
> +
> + ret = 0;
> + found = scan(buf, read_size, sizeof(struct gnu_property),
> + test_property, next_property,
> + &last_pr, pr_type, &buf_pos);
> +
> + if ((!buf_pos) || found)
> + break;
> +
> + file_offset += buf_pos - read_size;
> + }
> +
> + if (found) {
> + struct gnu_property *pr =
> + (struct gnu_property *)(buf + buf_pos);
> +
> + if (pr->pr_datasz == 4) {
> + u32 *max = (u32 *)(buf + read_size);
> + u32 *data = (u32 *)((u8 *)pr + sizeof(*pr));
> +
> + if (data + 1 <= max) {
> + *property = *data;
> + } else {
> + file_offset += buf_pos - read_size;
> + file_offset += sizeof(*pr);
> + ret = kernel_read(file, property, 4,
> + &file_offset);
> + }
> + }
> + }
> +
> + return ret;
> +}
> +
> +/*
> + * Search a PT_NOTE segment for NT_GNU_PROPERTY_TYPE_0.
> + */
> +static int find_note_type_0(struct file *file, loff_t file_offset,
> + unsigned long note_size, u32 align,
> + u32 pr_type, u32 *property)
> +{
> + u8 *buf;
> + u32 buf_pos;
> + unsigned long read_size;
> + unsigned long done;
> + int found = 0;
> + int ret = 0;
> +
> + buf = kmalloc(BUF_SIZE, GFP_KERNEL);
> + if (!buf)
> + return -ENOMEM;
> +
> + *property = 0;
> + buf_pos = 0;
> +
> + for (done = 0; done < note_size; done += buf_pos) {
> + read_size = note_size - done;
> + if (read_size > BUF_SIZE)
> + read_size = BUF_SIZE;
> +
> + ret = kernel_read(file, buf, read_size, &file_offset);
> +
> + if (ret != read_size) {
> + ret = (ret < 0) ? ret : -EIO;
> + kfree(buf);
> + return ret;
> + }
> +
> + /*
> + * item_size = sizeof(struct elf_note) + elf_note.n_namesz.
> + * n_namesz is 4 for the note type we look for.
> + */
> + ret = scan(buf, read_size, sizeof(struct elf_note) + 4,
> + test_note_type, next_note,
> + &align, NT_GNU_PROPERTY_TYPE_0, &buf_pos);
> +
> + file_offset += buf_pos - read_size;
> +
> + if (ret && !found) {
> + struct elf_note *n =
> + (struct elf_note *)(buf + buf_pos);
> + u64 start = round_up(sizeof(*n) + n->n_namesz, align);
> + u64 total = 0;
> +
> + if (check_add_overflow(start, (u64)n->n_descsz, &total)) {
> + ret = -EINVAL;
> + break;
> + }
> + total = round_up(total, align);
> +
> + ret = find_property(file, n->n_descsz,
> + file_offset + start,
> + buf, pr_type, property);
> + found++;
> + file_offset += total;
> + buf_pos += total;
> + } else if (!buf_pos || ret) {
> + ret = 0;
> + *property = 0;
> + break;
> + }
> + }
> +
> + kfree(buf);
> + return ret;
> +}
> +
> +/*
> + * Look at an ELF file's PT_NOTE segments, then NT_GNU_PROPERTY_TYPE_0, then
> + * the property of pr_type.
> + *
> + * Input:
> + * file: the file to search;
> + * phdr: the file's elf header;
> + * phnum: number of entries in phdr;
> + * pr_type: the property type.
> + *
> + * Output:
> + * The property found.
> + *
> + * Return:
> + * Zero or error.
> + */
> +static int scan_segments_64(struct file *file, struct elf64_phdr *phdr,
> + int phnum, u32 pr_type, u32 *property)
> +{
> + int i;
> + int err = 0;
> +
> + for (i = 0; i < phnum; i++, phdr++) {
> + if ((phdr->p_type != PT_NOTE) || (phdr->p_align != 8))
> + continue;
> +
> + /*
> + * Search the PT_NOTE segment for NT_GNU_PROPERTY_TYPE_0.
> + */
> + err = find_note_type_0(file, phdr->p_offset, phdr->p_filesz,
> + phdr->p_align, pr_type, property);
> + if (err)
> + return err;
> + }
> +
> + return 0;
> +}
> +
> +#ifdef CONFIG_COMPAT
> +static int scan_segments_32(struct file *file, struct elf32_phdr *phdr,
> + int phnum, u32 pr_type, u32 *property)
> +{
> + int i;
> + int err = 0;
> +
> + for (i = 0; i < phnum; i++, phdr++) {
> + if ((phdr->p_type != PT_NOTE) || (phdr->p_align != 4))
> + continue;
> +
> + /*
> + * Search the PT_NOTE segment for NT_GNU_PROPERTY_TYPE_0.
> + */
> + err = find_note_type_0(file, phdr->p_offset, phdr->p_filesz,
> + phdr->p_align, pr_type, property);
> + if (err)
> + return err;
> + }
> +
> + return 0;
> +}
> +#endif
> +
> +int get_gnu_property(void *ehdr_p, void *phdr_p, struct file *f,
> + u32 pr_type, u32 *property)
> +{
> + struct elf64_hdr *ehdr64 = ehdr_p;
> + int err = 0;
> +
> + *property = 0;
> +
> + if (ehdr64->e_ident[EI_CLASS] == ELFCLASS64) {
> + struct elf64_phdr *phdr64 = phdr_p;
> +
> + err = scan_segments_64(f, phdr64, ehdr64->e_phnum,
> + pr_type, property);
> + if (err < 0)
> + goto out;
> + } else {
> +#ifdef CONFIG_COMPAT
> + struct elf32_hdr *ehdr32 = ehdr_p;
> +
> + if (ehdr32->e_ident[EI_CLASS] == ELFCLASS32) {
> + struct elf32_phdr *phdr32 = phdr_p;
> +
> + err = scan_segments_32(f, phdr32, ehdr32->e_phnum,
> + pr_type, property);
> + if (err < 0)
> + goto out;
> + }
> +#else
> + WARN_ONCE(1, "Exec of 32-bit app, but CONFIG_COMPAT is not enabled.\n");
> + return -ENOTSUPP;
> +#endif
> + }
> +
> +out:
> + return err;
> +}
> diff --git a/include/linux/elf.h b/include/linux/elf.h
> index e3649b3e970e..c15febebe7f2 100644
> --- a/include/linux/elf.h
> +++ b/include/linux/elf.h
> @@ -56,4 +56,16 @@ static inline int elf_coredump_extra_notes_write(struct coredump_params *cprm) {
> extern int elf_coredump_extra_notes_size(void);
> extern int elf_coredump_extra_notes_write(struct coredump_params *cprm);
> #endif
> +
> +#ifdef CONFIG_ARCH_USE_GNU_PROPERTY
> +extern int arch_setup_property(void *ehdr, void *phdr, struct file *f,
> + bool interp);
> +extern int get_gnu_property(void *ehdr_p, void *phdr_p, struct file *f,
> + u32 pr_type, u32 *feature);
> +#else
> +static inline int arch_setup_property(void *ehdr, void *phdr, struct file *f,
> + bool interp) { return 0; }
> +static inline int get_gnu_property(void *ehdr_p, void *phdr_p, struct file *f,
> + u32 pr_type, u32 *feature) { return 0; }
> +#endif
> #endif /* _LINUX_ELF_H */
> diff --git a/include/uapi/linux/elf.h b/include/uapi/linux/elf.h
> index 34c02e4290fe..7b7603a44cbc 100644
> --- a/include/uapi/linux/elf.h
> +++ b/include/uapi/linux/elf.h
> @@ -372,6 +372,7 @@ typedef struct elf64_shdr {
> #define NT_PRFPREG 2
> #define NT_PRPSINFO 3
> #define NT_TASKSTRUCT 4
> +#define NT_GNU_PROPERTY_TYPE_0 5
> #define NT_AUXV 6
> /*
> * Note to userspace developers: size of NT_SIGINFO note may increase
> @@ -443,4 +444,11 @@ typedef struct elf64_note {
> Elf64_Word n_type; /* Content type */
> } Elf64_Nhdr;
>
> +/* .note.gnu.property types */
> +#define GNU_PROPERTY_X86_FEATURE_1_AND (0xc0000002)
> +
> +/* Bits of GNU_PROPERTY_X86_FEATURE_1_AND */
> +#define GNU_PROPERTY_X86_FEATURE_1_IBT (0x00000001)
> +#define GNU_PROPERTY_X86_FEATURE_1_SHSTK (0x00000002)
> +
> #endif /* _UAPI_LINUX_ELF_H */
> --
> 2.17.1
>
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox