Linux userland API discussions
 help / color / mirror / Atom feed
* Re: [musl] Re: [PATCH] binfmt_elf: Update READ_IMPLIES_EXEC logic for modern CPUs
From: Rich Felker @ 2019-04-23 19:25 UTC (permalink / raw)
  To: Jann Horn
  Cc: Kees Cook, Andrew Morton, Hector Marco-Gisbert, Jason Gunthorpe,
	kernel list, the arch/x86 maintainers, Thomas Gleixner,
	Andy Lutomirski, Kernel Hardening, Mark Rutland, linux-arm-kernel,
	musl, Linux API
In-Reply-To: <CAG48ez0CeTEGDuwr+qAGBwsqm+Drj0dkFfr6_UDc+g-xM4BpiA@mail.gmail.com>

On Tue, Apr 23, 2019 at 09:02:03PM +0200, Jann Horn wrote:
> +linux-api, +musl
> 
> On Tue, Apr 23, 2019 at 8:12 PM Kees Cook <keescook@chromium.org> wrote:
> > The READ_IMPLIES_EXEC work-around was designed for old CPUs lacking NX
> > (to have the visible permission flags on memory regions reflect reality:
> > they are all executable), and for old toolchains that lacked the ELF
> > PT_GNU_STACK marking (under the assumption than toolchains that couldn't
> > even specify memory protection flags may have it wrong for all memory
> > regions).
> >
> > This logic is sensible, but was implemented in a way that equated having
> > a PT_GNU_STACK marked executable as being as "broken" as lacking the
> > PT_GNU_STACK marking entirely. This is not a reasonable assumption
> > for CPUs that have had NX support from the start (or very close to
> > the start). This confusion has led to situations where modern 64-bit
> > programs with explicitly marked executable stack are forced into the
> > READ_IMPLIES_EXEC state when no such thing is needed. (And leads to
> > unexpected failures when mmap()ing regions of device driver memory that
> > wish to disallow VM_EXEC[1].)
> >
> > To fix this, elf_read_implies_exec() is adjusted on arm64 (where NX has
> > always existed and all toolchains include PT_GNU_STACK), and x86 is
> > adjusted to handle this combination of possible outcomes:
> >
> >               CPU: | lacks NX  | has NX, ia32     | has NX, x86_64   |
> >  ELF:              |           |                  |                  |
> >  ------------------------------|------------------|------------------|
> >  missing GNU_STACK | needs RIE | needs RIE        | no RIE           |
> >  GNU_STACK == RWX  | needs RIE | no RIE: stack X  | no RIE: stack X  |
> >  GNU_STACK == RW   | needs RIE | no RIE: stack NX | no RIE: stack NX |
> >
> > This has the effect of making binfmt_elf's EXSTACK_DEFAULT actually take
> > on the correct architecture default of being non-executable on arm64 and
> > x86_64, and being executable on ia32.
> 
> It's probably worth going a bit more into detail in this description
> on how libraries typically allocate thread stacks.
> 
> It looks like glibc will be fine; before commit 54ee14b3882
> (https://sourceware.org/git/?p=glibc.git;a=blobdiff;f=nptl/allocatestack.c;h=dc501650b8629eda4502f2016016f09106cfb526;hp=6ada1fe1381de104153c0627e27f09fe5ad02caa;hb=54ee14b3882;hpb=16a76cd23ce9d3924fa192395e730423e3dc8b36),
> thread stacks were always RWX, and since then, from what I can tell,
> thread stacks were executable depending on the executable's ELF
> headers (as parsed by glibc).
> 
> But e.g. musl's __pthread_create() seems to hardcode
> PROT_READ|PROT_WRITE, which I think would mean that if someone built a
> multithreaded program with nested functions and linked with musl, that
> program would stop working? Or maybe I'm just reading the code wrong.

musl intentionally/explicitly does not support executable stacks. If
you happen to get one from the kernel for the main thread, things
might happen to work, but it's unintended and unsupported
functionality.

Rich

^ permalink raw reply

* Re: [PATCH] binfmt_elf: Update READ_IMPLIES_EXEC logic for modern CPUs
From: Kees Cook @ 2019-04-23 19:25 UTC (permalink / raw)
  To: Jann Horn
  Cc: Andrew Morton, Hector Marco-Gisbert, Jason Gunthorpe, kernel list,
	the arch/x86 maintainers, Thomas Gleixner, Andy Lutomirski,
	Kernel Hardening, Mark Rutland, linux-arm-kernel, musl, Linux API
In-Reply-To: <CAG48ez0CeTEGDuwr+qAGBwsqm+Drj0dkFfr6_UDc+g-xM4BpiA@mail.gmail.com>

On Tue, Apr 23, 2019 at 12:02 PM Jann Horn <jannh@google.com> wrote:
> It's probably worth going a bit more into detail in this description
> on how libraries typically allocate thread stacks.
>
> It looks like glibc will be fine; before commit 54ee14b3882
> (https://sourceware.org/git/?p=glibc.git;a=blobdiff;f=nptl/allocatestack.c;h=dc501650b8629eda4502f2016016f09106cfb526;hp=6ada1fe1381de104153c0627e27f09fe5ad02caa;hb=54ee14b3882;hpb=16a76cd23ce9d3924fa192395e730423e3dc8b36),
> thread stacks were always RWX, and since then, from what I can tell,
> thread stacks were executable depending on the executable's ELF
> headers (as parsed by glibc).

2003, which seems safely (?) in the past. :)

> But e.g. musl's __pthread_create() seems to hardcode
> PROT_READ|PROT_WRITE, which I think would mean that if someone built a
> multithreaded program with nested functions and linked with musl, that
> program would stop working? Or maybe I'm just reading the code wrong.

Rephrasing for myself: this could break multithread binaries linked
with musl and marked with PT_GNU_STACK to RWE since musl doesn't check
ELF headers to determine stack executable-ness when allocating stack
space in __pthread_create().

> Then again, I'm not sure whether anyone actually uses nested functions...

It is blissfully rare, but it seems common (?) for Fortran binaries.
Are there multithreaded fortran binaries linked with musl that will
break because of this? I guess it's possible. If that happens, we can
adjust the logic with notes of an actual case. :)

-- 
Kees Cook

^ permalink raw reply

* Re: [PATCH] binfmt_elf: Update READ_IMPLIES_EXEC logic for modern CPUs
From: Jann Horn @ 2019-04-23 19:02 UTC (permalink / raw)
  To: Kees Cook
  Cc: Andrew Morton, Hector Marco-Gisbert, Jason Gunthorpe, kernel list,
	the arch/x86 maintainers, Thomas Gleixner, Andy Lutomirski,
	Kernel Hardening, Mark Rutland, linux-arm-kernel, musl, Linux API
In-Reply-To: <20190423181210.GA2443@beast>

+linux-api, +musl

On Tue, Apr 23, 2019 at 8:12 PM Kees Cook <keescook@chromium.org> wrote:
> The READ_IMPLIES_EXEC work-around was designed for old CPUs lacking NX
> (to have the visible permission flags on memory regions reflect reality:
> they are all executable), and for old toolchains that lacked the ELF
> PT_GNU_STACK marking (under the assumption than toolchains that couldn't
> even specify memory protection flags may have it wrong for all memory
> regions).
>
> This logic is sensible, but was implemented in a way that equated having
> a PT_GNU_STACK marked executable as being as "broken" as lacking the
> PT_GNU_STACK marking entirely. This is not a reasonable assumption
> for CPUs that have had NX support from the start (or very close to
> the start). This confusion has led to situations where modern 64-bit
> programs with explicitly marked executable stack are forced into the
> READ_IMPLIES_EXEC state when no such thing is needed. (And leads to
> unexpected failures when mmap()ing regions of device driver memory that
> wish to disallow VM_EXEC[1].)
>
> To fix this, elf_read_implies_exec() is adjusted on arm64 (where NX has
> always existed and all toolchains include PT_GNU_STACK), and x86 is
> adjusted to handle this combination of possible outcomes:
>
>               CPU: | lacks NX  | has NX, ia32     | has NX, x86_64   |
>  ELF:              |           |                  |                  |
>  ------------------------------|------------------|------------------|
>  missing GNU_STACK | needs RIE | needs RIE        | no RIE           |
>  GNU_STACK == RWX  | needs RIE | no RIE: stack X  | no RIE: stack X  |
>  GNU_STACK == RW   | needs RIE | no RIE: stack NX | no RIE: stack NX |
>
> This has the effect of making binfmt_elf's EXSTACK_DEFAULT actually take
> on the correct architecture default of being non-executable on arm64 and
> x86_64, and being executable on ia32.

It's probably worth going a bit more into detail in this description
on how libraries typically allocate thread stacks.

It looks like glibc will be fine; before commit 54ee14b3882
(https://sourceware.org/git/?p=glibc.git;a=blobdiff;f=nptl/allocatestack.c;h=dc501650b8629eda4502f2016016f09106cfb526;hp=6ada1fe1381de104153c0627e27f09fe5ad02caa;hb=54ee14b3882;hpb=16a76cd23ce9d3924fa192395e730423e3dc8b36),
thread stacks were always RWX, and since then, from what I can tell,
thread stacks were executable depending on the executable's ELF
headers (as parsed by glibc).

But e.g. musl's __pthread_create() seems to hardcode
PROT_READ|PROT_WRITE, which I think would mean that if someone built a
multithreaded program with nested functions and linked with musl, that
program would stop working? Or maybe I'm just reading the code wrong.

Then again, I'm not sure whether anyone actually uses nested functions...

^ permalink raw reply

* Re: WARNING in percpu_ref_kill_and_confirm
From: Dmitry Vyukov @ 2019-04-23 14:41 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Jens Axboe, syzbot, Arnd Bergmann, Borislav Petkov,
	Darrick J. Wong, Greg Kroah-Hartman, Peter Anvin, Linux API,
	linux-arch, linux-block, linux-fsdevel, Linux List Kernel Mailing,
	Andrew Lutomirski, Mathieu Desnoyers, Ingo Molnar,
	Michael Ellerman, syzkaller-bugs, Thomas Gleixner, Al
In-Reply-To: <CAHk-=wh=kKEb6qbuhnSWh5S55SbQBOE33991ULSFUgVrEnzivA@mail.gmail.com>

On Mon, Apr 22, 2019 at 7:48 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> On Mon, Apr 22, 2019 at 9:38 AM Jens Axboe <axboe@kernel.dk> wrote:
> >
> > With the mutex change in, I can trigger it in a second or so. Just ran
> > the reproducer with that change reverted, and I'm not seeing any badness.
> > So I do wonder if the bisect results are accurate?
>
> Looking at the syzbot report, it's syzbot being confused.
>
> The actual WARNING in percpu_ref_kill_and_confirm() only happens with
> recent kernels.
>
> But then syzbot mixes it up with a completely different bug:
>
>    crash: BUG: MAX_STACK_TRACE_ENTRIES too low!
>    BUG: MAX_STACK_TRACE_ENTRIES too low!
>
> and for some reason decides that *that* bug is the same thing entirely.
>
> So yeah, I think the simple percpu_ref_is_dying() check is sufficient,
> and that the syzbot bisection is completely bogus.

Using crashed/not-crashed predicate gives better results overall. More
than half kernel bugs have different manifestations due to different
reasons. And even if we can say for sure that we see a different bug,
we still don't know if the original bug is also there or not. See the
following threads for details:
https://groups.google.com/d/msg/syzkaller-bugs/nFeC8-UG1gg/y6gUEsvAAgAJ
https://groups.google.com/d/msg/syzkaller/sR8aAXaWEF4/tTWYRgvmAwAJ

Unrelated crashes is the most common cause of incorrect bisection
results (66%). To enable better bisection we would need to integrate
some meaningful precommit testing into kernel development process
(would be tremendously useful for other reasons too). E.g. this "BUG:
MAX_STACK_TRACE_ENTRIES too low!" is this:
https://syzkaller.appspot.com/bug?id=dbd70f0407487a061d2d46fdc6bccc94b95ce3c0
and the reproducer is simply opening /dev/infiniband/rdma_cm or
/dev/vhci or something equally simple with LOCKDEP enabled. None of
this was done in a testing environment for several weeks. And then it
took another month to propagate the fix through all distributed kernel
trees. For all that time simple programs crash and bisection can't be
done and we are spending time here...

^ permalink raw reply

* Re: [PATCH 1/5] glibc: Perform rseq(2) registration at C startup and thread creation (v8)
From: Mathieu Desnoyers @ 2019-04-23 12:36 UTC (permalink / raw)
  To: Ramana Radhakrishnan
  Cc: Szabolcs Nagy, nd, Joseph Myers, Will Deacon, carlos,
	Florian Weimer, libc-alpha, Thomas Gleixner, Ben Maurer,
	Peter Zijlstra, Paul E. McKenney, Boqun Feng, Dave Watson,
	Paul Turner, Rich Felker, linux-kernel, linux-api
In-Reply-To: <CAJA7tRZDKHxLDzuggSmNngfxSTD31EX3mv6YQpcnNXrOtF=9qQ@mail.gmail.com>

----- On Apr 23, 2019, at 7:59 AM, Ramana Radhakrishnan ramana.gcc@googlemail.com wrote:

> On Tue, Apr 23, 2019 at 12:16 PM Szabolcs Nagy <Szabolcs.Nagy@arm.com> wrote:
>>
>> On 18/04/2019 19:17, Mathieu Desnoyers wrote:
>> > ----- On Apr 18, 2019, at 1:37 PM, Szabolcs Nagy Szabolcs.Nagy@arm.com wrote:
>> >> you have to add a documentation comment somewhere
>> >> explaining if RSEQ_SIG is the value that's passed to
>> >> the kernel and then aarch64 asm code has to use
>> >>
>> >> .inst endianfixup(RSEQ_SIG) // or
>> >> .word RSEQ_SIG
>> >
>> > Using ".word" won't allow objdump to show the instruction it
>> > maps to. It will consider it as data. So .inst is preferred here.
>>
>> is there some specific reason you prefer .inst?
> 
> I believe the reasoning here is that in the disassembly you want to
> see an instruction pattern for an architecture rather than a magic bit
> pattern that appears to be an "inline" literal pool entry.  I would
> support the .inst variant so that the disassembler shows the
> instruction for what it is when debugging. If control reaches the
> marker instruction, something's gone wrong and thus from a user
> friendliness perspective I would prefer to see an instruction that
> clearly indicates that it's undefined .inst directive so that someone
> disassembling this in an assembly view in GDB sees the right thing
> (TM) instead of having to reach for the manual and disassembling this
> by hand.

That's my line of thinking exactly. I might add that having data in a
literal pool within the instruction stream might be unexpected in some
compilation environments, e.g. when compiling with -mno-text-section-literals .

So even though the signature may likely end up being placed in a literal
pool, it's preferable to ensure it is a valid uncommon trap instruction.

Thanks,

Mathieu

> 
>>
>> disassembling a canary value as data (that is
>> never executed, but loaded and compared by the
>> kernel as data) sounds more semantically correct
>> to me than showing it as an instruction.
>>
> 
> 
> 
> Ramana
> 
> 
>> i guess having it as an instruction can avoid
>> issues if some tools dislike .word in .text,
> > but otherwise .word seems better.

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply

* Re: [PATCH 1/5] glibc: Perform rseq(2) registration at C startup and thread creation (v8)
From: Ramana Radhakrishnan @ 2019-04-23 11:59 UTC (permalink / raw)
  To: Szabolcs Nagy
  Cc: Mathieu Desnoyers, nd, Joseph Myers, Will Deacon, carlos,
	Florian Weimer, libc-alpha, Thomas Gleixner, Ben Maurer,
	Peter Zijlstra, Paul E. McKenney, Boqun Feng, Dave Watson,
	Paul Turner, Rich Felker, linux-kernel, linux-api
In-Reply-To: <a1bd2595-b7ae-b06b-9c71-802901ed8587@arm.com>

On Tue, Apr 23, 2019 at 12:16 PM Szabolcs Nagy <Szabolcs.Nagy@arm.com> wrote:
>
> On 18/04/2019 19:17, Mathieu Desnoyers wrote:
> > ----- On Apr 18, 2019, at 1:37 PM, Szabolcs Nagy Szabolcs.Nagy@arm.com wrote:
> >> you have to add a documentation comment somewhere
> >> explaining if RSEQ_SIG is the value that's passed to
> >> the kernel and then aarch64 asm code has to use
> >>
> >> .inst endianfixup(RSEQ_SIG) // or
> >> .word RSEQ_SIG
> >
> > Using ".word" won't allow objdump to show the instruction it
> > maps to. It will consider it as data. So .inst is preferred here.
>
> is there some specific reason you prefer .inst?

I believe the reasoning here is that in the disassembly you want to
see an instruction pattern for an architecture rather than a magic bit
pattern that appears to be an "inline" literal pool entry.  I would
support the .inst variant so that the disassembler shows the
instruction for what it is when debugging. If control reaches the
marker instruction, something's gone wrong and thus from a user
friendliness perspective I would prefer to see an instruction that
clearly indicates that it's undefined .inst directive so that someone
disassembling this in an assembly view in GDB sees the right thing
(TM) instead of having to reach for the manual and disassembling this
by hand.

>
> disassembling a canary value as data (that is
> never executed, but loaded and compared by the
> kernel as data) sounds more semantically correct
> to me than showing it as an instruction.
>



Ramana


> i guess having it as an instruction can avoid
> issues if some tools dislike .word in .text,
> but otherwise .word seems better.

^ permalink raw reply

* Re: [PATCH 1/5] glibc: Perform rseq(2) registration at C startup and thread creation (v9)
From: Szabolcs Nagy @ 2019-04-23 11:23 UTC (permalink / raw)
  To: Mathieu Desnoyers, Carlos O'Donell
  Cc: nd, Florian Weimer, Joseph Myers, libc-alpha@sourceware.org,
	Thomas Gleixner, Ben Maurer, Peter Zijlstra, Paul E. McKenney,
	Boqun Feng, Will Deacon, Dave Watson, Paul Turner, Rich Felker,
	linux-kernel@vger.kernel.org, linux-api@vger.kernel.org
In-Reply-To: <20190422175623.6134-2-mathieu.desnoyers@efficios.com>

On 22/04/2019 18:56, Mathieu Desnoyers wrote:
> diff --git a/sysdeps/unix/sysv/linux/aarch64/bits/rseq.h b/sysdeps/unix/sysv/linux/aarch64/bits/rseq.h
> new file mode 100644
> index 0000000000..e538668612
> --- /dev/null
> +++ b/sysdeps/unix/sysv/linux/aarch64/bits/rseq.h
> @@ -0,0 +1,44 @@
> +/* Restartable Sequences Linux aarch64 architecture header.
> +
> +   Copyright (C) 2019 Free Software Foundation, Inc.
> +
> +   The GNU C Library is free software; you can redistribute it and/or
> +   modify it under the terms of the GNU Lesser General Public
> +   License as published by the Free Software Foundation; either
> +   version 2.1 of the License, or (at your option) any later version.
> +
> +   The GNU C Library is distributed in the hope that it will be useful,
> +   but WITHOUT ANY WARRANTY; without even the implied warranty of
> +   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> +   Lesser General Public License for more details.
> +
> +   You should have received a copy of the GNU Lesser General Public
> +   License along with the GNU C Library; if not, see
> +   <http://www.gnu.org/licenses/>.  */
> +
> +#ifndef _SYS_RSEQ_H
> +# error "Never use <bits/rseq.h> directly; include <sys/rseq.h> instead."
> +#endif
> +
> +/* RSEQ_SIG is a signature required before each abort handler code.
> +
> +   It is a 32-bit value that maps to actual architecture code compiled
> +   into applications and libraries. It needs to be defined for each
> +   architecture. When choosing this value, it needs to be taken into
> +   account that generating invalid instructions may have ill effects on
> +   tools like objdump, and may also have impact on the CPU speculative
> +   execution efficiency in some cases.
> +
> +   aarch64 -mbig-endian generates mixed endianness code vs data:
> +   little-endian code and big-endian data. Ensure the RSEQ_SIG signature
> +   matches code endianness.  */
> +
> +#define RSEQ_SIG_CODE	0xd428bc00	/* BRK #0x45E0.  */
> +
> +#ifdef __AARCH64EB__
> +#define RSEQ_SIG_DATA	0x00bc28d4	/* BRK #0x45E0.  */
> +#else
> +#define RSEQ_SIG_DATA	RSEQ_SIG_CODE
> +#endif
> +
> +#define RSEQ_SIG	RSEQ_SIG_DATA
> diff --git a/sysdeps/unix/sysv/linux/aarch64/libc.abilist b/sysdeps/unix/sysv/linux/aarch64/libc.abilist
> index 9c330f325e..331f39e41a 100644
> --- a/sysdeps/unix/sysv/linux/aarch64/libc.abilist
> +++ b/sysdeps/unix/sysv/linux/aarch64/libc.abilist
> @@ -2141,3 +2141,5 @@ GLIBC_2.28 thrd_yield F
>  GLIBC_2.29 getcpu F
>  GLIBC_2.29 posix_spawn_file_actions_addchdir_np F
>  GLIBC_2.29 posix_spawn_file_actions_addfchdir_np F
> +GLIBC_2.30 __rseq_abi T 0x20
> +GLIBC_2.30 __rseq_handled D 0x4

the aarch64 changes are ok.

^ permalink raw reply

* Re: [PATCH 1/5] glibc: Perform rseq(2) registration at C startup and thread creation (v8)
From: Szabolcs Nagy @ 2019-04-23 11:16 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: nd, Joseph Myers, Will Deacon, carlos, Florian Weimer, libc-alpha,
	Thomas Gleixner, Ben Maurer, Peter Zijlstra, Paul E. McKenney,
	Boqun Feng, Dave Watson, Paul Turner, Rich Felker, linux-kernel,
	linux-api
In-Reply-To: <1531569198.1389.1555611422188.JavaMail.zimbra@efficios.com>

On 18/04/2019 19:17, Mathieu Desnoyers wrote:
> ----- On Apr 18, 2019, at 1:37 PM, Szabolcs Nagy Szabolcs.Nagy@arm.com wrote:
>> you have to add a documentation comment somewhere
>> explaining if RSEQ_SIG is the value that's passed to
>> the kernel and then aarch64 asm code has to use
>>
>> .inst endianfixup(RSEQ_SIG) // or
>> .word RSEQ_SIG
> 
> Using ".word" won't allow objdump to show the instruction it
> maps to. It will consider it as data. So .inst is preferred here.

is there some specific reason you prefer .inst?

disassembling a canary value as data (that is
never executed, but loaded and compared by the
kernel as data) sounds more semantically correct
to me than showing it as an instruction.

i guess having it as an instruction can avoid
issues if some tools dislike .word in .text,
but otherwise .word seems better.

^ permalink raw reply

* Re: [PATCH ghak90 V6 00/10] audit: implement container identifier
From: Neil Horman @ 2019-04-23 10:28 UTC (permalink / raw)
  To: Paul Moore
  Cc: Richard Guy Briggs, containers, linux-api,
	Linux-Audit Mailing List, linux-fsdevel, LKML, netdev,
	netfilter-devel, sgrubb, omosnace, dhowells, simo, Eric Paris,
	Serge Hallyn, ebiederm
In-Reply-To: <CAHC9VhQYPF2ma_W+hySbQtfTztf=K1LTFnxnyVK0y9VYxj-K=w@mail.gmail.com>

On Mon, Apr 22, 2019 at 09:49:05AM -0400, Paul Moore wrote:
> On Mon, Apr 22, 2019 at 7:38 AM Neil Horman <nhorman@tuxdriver.com> wrote:
> > On Mon, Apr 08, 2019 at 11:39:07PM -0400, Richard Guy Briggs wrote:
> > > Implement kernel audit container identifier.
> >
> > I'm sorry, I've lost track of this, where have we landed on it? Are we good for
> > inclusion?
> 
> I haven't finished going through this latest revision, but unless
> Richard made any significant changes outside of the feedback from the
> v5 patchset I'm guessing we are "close".
> 
> Based on discussions Richard and I had some time ago, I have always
> envisioned the plan as being get the kernel patchset, tests, docs
> ready (which Richard has been doing) and then run the actual
> implemented API by the userland container folks, e.g. cri-o/lxc/etc.,
> to make sure the actual implementation is sane from their perspective.
> They've already seen the design, so I'm not expecting any real
> surprises here, but sometimes opinions change when they have actual
> code in front of them to play with and review.
> 
> Beyond that, while the cri-o/lxc/etc. folks are looking it over,
> whatever additional testing we can do would be a big win.  I'm
> thinking I'll pull it into a separate branch in the audit tree
> (audit/working-container ?) and include that in my secnext kernels
> that I build/test on a regular basis; this is also a handy way to keep
> it based against the current audit/next branch.  If any changes are
> needed Richard can either chose to base those changes on audit/next or
> the separate audit container ID branch; that's up to him.  I've done
> this with other big changes in other trees, e.g. SELinux, and it has
> worked well to get some extra testing in and keep the patchset "merge
> ready" while others outside the subsystem look things over.
> 

That all sounds good, thank you Paul.  I knew you and Richard were working on
it, but I somehow managed to loose track of exactly where we left this.  

Much Appreciated
Neil

> -- 
> paul moore
> www.paul-moore.com
> 

^ permalink raw reply

* Re: [PATCH v3 1/4] Make anon_inodes unconditional
From: Enrico Weigelt, metux IT consult @ 2019-04-23  7:13 UTC (permalink / raw)
  To: Christian Brauner, torvalds, viro, jannh, dhowells, oleg,
	linux-api, linux-kernel
  Cc: serge, luto, arnd, ebiederm, keescook, tglx, mtk.manpages, akpm,
	cyphar, joel, dancol
In-Reply-To: <20190419120904.27502-2-christian@brauner.io>

On 19.04.19 14:09, Christian Brauner wrote:

Hi,

> From: David Howells <dhowells@redhat.com>
> 
> Make the anon_inodes facility unconditional so that it can be used by core
> VFS code and pidfd code.

I'm not convinced this is necessary. Instead the new features should
select ANON_INODES when enabled. And yes: these features should be
configurable.

Why ? Because embedded folks need to trim down the kernel and drop
everything that's not absolutely needed.


--mtx

-- 
Enrico Weigelt, metux IT consult
Free software and Linux embedded engineering
info@metux.net -- +49-151-27565287

^ permalink raw reply

* [PATCH 2/5] glibc: sched_getcpu(): use rseq cpu_id TLS on Linux (v3)
From: Mathieu Desnoyers @ 2019-04-22 17:56 UTC (permalink / raw)
  To: Carlos O'Donell
  Cc: Florian Weimer, Joseph Myers, Szabolcs Nagy, libc-alpha,
	Mathieu Desnoyers, Thomas Gleixner, Ben Maurer, Peter Zijlstra,
	Paul E. McKenney, Boqun Feng, Will Deacon, Dave Watson,
	Paul Turner, linux-kernel, linux-api
In-Reply-To: <20190422175623.6134-1-mathieu.desnoyers@efficios.com>

When available, use the cpu_id field from __rseq_abi on Linux to
implement sched_getcpu(). Fall-back on the vgetcpu vDSO if unavailable.

Benchmarks:

x86-64: Intel E5-2630 v3@2.40GHz, 16-core, hyperthreading

glibc sched_getcpu():                     13.7 ns (baseline)
glibc sched_getcpu() using rseq:           2.5 ns (speedup:  5.5x)
inline load cpuid from __rseq_abi TLS:     0.8 ns (speedup: 17.1x)

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
CC: Carlos O'Donell <carlos@redhat.com>
CC: Florian Weimer <fweimer@redhat.com>
CC: Joseph Myers <joseph@codesourcery.com>
CC: Szabolcs Nagy <szabolcs.nagy@arm.com>
CC: Thomas Gleixner <tglx@linutronix.de>
CC: Ben Maurer <bmaurer@fb.com>
CC: Peter Zijlstra <peterz@infradead.org>
CC: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
CC: Boqun Feng <boqun.feng@gmail.com>
CC: Will Deacon <will.deacon@arm.com>
CC: Dave Watson <davejwatson@fb.com>
CC: Paul Turner <pjt@google.com>
CC: libc-alpha@sourceware.org
CC: linux-kernel@vger.kernel.org
CC: linux-api@vger.kernel.org
---
Changes since v1:
- rseq is only used if both __NR_rseq and RSEQ_SIG are defined.

Changes since v2:
- remove duplicated __rseq_abi extern declaration.
---
 sysdeps/unix/sysv/linux/sched_getcpu.c | 24 ++++++++++++++++++++++--
 1 file changed, 22 insertions(+), 2 deletions(-)

diff --git a/sysdeps/unix/sysv/linux/sched_getcpu.c b/sysdeps/unix/sysv/linux/sched_getcpu.c
index fb0d317f83..167976fab7 100644
--- a/sysdeps/unix/sysv/linux/sched_getcpu.c
+++ b/sysdeps/unix/sysv/linux/sched_getcpu.c
@@ -24,8 +24,8 @@
 #endif
 #include <sysdep-vdso.h>
 
-int
-sched_getcpu (void)
+static int
+vsyscall_sched_getcpu (void)
 {
 #ifdef __NR_getcpu
   unsigned int cpu;
@@ -37,3 +37,23 @@ sched_getcpu (void)
   return -1;
 #endif
 }
+
+#ifdef __NR_rseq
+#include <sys/rseq.h>
+#endif
+
+#if defined __NR_rseq && defined RSEQ_SIG
+int
+sched_getcpu (void)
+{
+  int cpu_id = __rseq_abi.cpu_id;
+
+  return cpu_id >= 0 ? cpu_id : vsyscall_sched_getcpu ();
+}
+#else
+int
+sched_getcpu (void)
+{
+  return vsyscall_sched_getcpu ();
+}
+#endif
-- 
2.17.1

^ permalink raw reply related

* [PATCH 1/5] glibc: Perform rseq(2) registration at C startup and thread creation (v9)
From: Mathieu Desnoyers @ 2019-04-22 17:56 UTC (permalink / raw)
  To: Carlos O'Donell
  Cc: Florian Weimer, Joseph Myers, Szabolcs Nagy, libc-alpha,
	Mathieu Desnoyers, Thomas Gleixner, Ben Maurer, Peter Zijlstra,
	Paul E. McKenney, Boqun Feng, Will Deacon, Dave Watson,
	Paul Turner, Rich Felker, linux-kernel, linux-api
In-Reply-To: <20190422175623.6134-1-mathieu.desnoyers@efficios.com>

Register rseq(2) TLS for each thread (including main), and unregister
for each thread (excluding main). "rseq" stands for Restartable
Sequences.

See the rseq(2) man page proposed here:
  https://lkml.org/lkml/2018/9/19/647

This patch is based on glibc-2.29. The rseq(2) system call was merged
into Linux 4.18.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
CC: Carlos O'Donell <carlos@redhat.com>
CC: Florian Weimer <fweimer@redhat.com>
CC: Joseph Myers <joseph@codesourcery.com>
CC: Szabolcs Nagy <szabolcs.nagy@arm.com>
CC: Thomas Gleixner <tglx@linutronix.de>
CC: Ben Maurer <bmaurer@fb.com>
CC: Peter Zijlstra <peterz@infradead.org>
CC: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
CC: Boqun Feng <boqun.feng@gmail.com>
CC: Will Deacon <will.deacon@arm.com>
CC: Dave Watson <davejwatson@fb.com>
CC: Paul Turner <pjt@google.com>
CC: Rich Felker <dalias@libc.org>
CC: libc-alpha@sourceware.org
CC: linux-kernel@vger.kernel.org
CC: linux-api@vger.kernel.org
---
Changes since v1:
- Move __rseq_refcount to an extra field at the end of __rseq_abi to
  eliminate one symbol.

  All libraries/programs which try to register rseq (glibc,
  early-adopter applications, early-adopter libraries) should use the
  rseq refcount. It becomes part of the ABI within a user-space
  process, but it's not part of the ABI shared with the kernel per se.

- Restructure how this code is organized so glibc keeps building on
  non-Linux targets.

- Use non-weak symbol for __rseq_abi.

- Move rseq registration/unregistration implementation into its own
  nptl/rseq.c compile unit.

- Move __rseq_abi symbol under GLIBC_2.29.

Changes since v2:
- Move __rseq_refcount to its own symbol, which is less ugly than
  trying to play tricks with the rseq uapi.
- Move __rseq_abi from nptl to csu (C start up), so it can be used
  across glibc, including memory allocator and sched_getcpu(). The
  __rseq_refcount symbol is kept in nptl, because there is no reason
  to use it elsewhere in glibc.

Changes since v3:
- Set __rseq_refcount TLS to 1 on register/set to 0 on unregister
  because glibc is the first/last user.
- Unconditionally register/unregister rseq at thread start/exit, because
  glibc is the first/last user.
- Add missing abilist items.
- Rebase on glibc master commit a502c5294.
- Add NEWS entry.

Changes since v4:
- Do not use "weak" symbols for __rseq_abi and __rseq_refcount. Based on
  "System V Application Binary Interface", weak only affects the link
  editor, not the dynamic linker.
- Install a new sys/rseq.h system header on Linux, which contains the
  RSEQ_SIG definition, __rseq_abi declaration and __rseq_refcount
  declaration. Move those definition/declarations from rseq-internal.h
  to the installed sys/rseq.h header.
- Considering that rseq is only available on Linux, move csu/rseq.c to
  sysdeps/unix/sysv/linux/rseq-sym.c.
- Move __rseq_refcount from nptl/rseq.c to
  sysdeps/unix/sysv/linux/rseq-sym.c, so it is only defined on Linux.
- Move both ABI definitions for __rseq_abi and __rseq_refcount to
  sysdeps/unix/sysv/linux/Versions, so they only appear on Linux.
- Document __rseq_abi and __rseq_refcount volatile.
- Document the RSEQ_SIG signature define.
- Move registration functions from rseq.c to rseq-internal.h static
  inline functions. Introduce empty stubs in misc/rseq-internal.h,
  which can be overridden by architecture code in
  sysdeps/unix/sysv/linux/rseq-internal.h.
- Rename __rseq_register_current_thread and __rseq_unregister_current_thread
  to rseq_register_current_thread and rseq_unregister_current_thread,
  now that those are only visible as internal static inline functions.
- Invoke rseq_register_current_thread() from libc-start.c LIBC_START_MAIN
  rather than nptl init, so applications not linked against
  libpthread.so have rseq registered for their main() thread. Note that
  it is invoked separately for SHARED and !SHARED builds.

Changes since v5:
- Replace __rseq_refcount by __rseq_lib_abi, which contains two
  uint32_t: register_state and refcount. The "register_state" field
  allows inhibiting rseq registration from signal handlers nested on top
  of glibc registration and occuring after rseq unregistration by glibc.
- Introduce enum rseq_register_state, which contains the states allowed
  for the struct rseq_lib_abi register_state field.

Changes since v6:
- Introduce bits/rseq.h to define RSEQ_SIG for each architecture.
  The generic bits/rseq.h does not define RSEQ_SIG, meaning that each
  architecture implementing rseq needs to implement bits/rseq.h.
- Rename enum item RSEQ_REGISTER_NESTED to RSEQ_REGISTER_ONGOING.
- Port to glibc-2.29.

Changes since v7:
- Remove __rseq_lib_abi symbol, including refcount and register_state
  fields.
- Remove reference counting and nested signals handling from
  registration/unregistration functions.
- Introduce new __rseq_handled exported symbol, which is set to 1
  by glibc on C startup when it handles restartable sequences.
  This allows glibc to coexist with early adopter libraries and
  applications wishing to register restartable sequences when it
  is not handled by glibc.
- Introduce rseq_init (), which sets __rseq_handled to 1 from
  C startup.
- Update NEWS entry.
- Update comments at the beginning of new files.
- Registration depends on both __NR_rseq and RSEQ_SIG.
- Remove ARM, powerpc, MIPS RSEQ_SIG until we agree with maintainers
  on the signature choice.
- Update x86, s390 RSEQ_SIG based on discussion with arch maintainers.
- Remove rseq-internal.h from headers list of misc/Makefile, so it
  it not installed by make install.

Changes since v8:
- Introduce RSEQ_SIG_CODE and RSEQ_SIG_DATA on aarch64 to handle
  compiling with -mbig-endian.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
CC: Carlos O'Donell <carlos@redhat.com>
CC: Florian Weimer <fweimer@redhat.com>
CC: Joseph Myers <joseph@codesourcery.com>
CC: Szabolcs Nagy <szabolcs.nagy@arm.com>
CC: Thomas Gleixner <tglx@linutronix.de>
CC: Ben Maurer <bmaurer@fb.com>
CC: Peter Zijlstra <peterz@infradead.org>
CC: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
CC: Boqun Feng <boqun.feng@gmail.com>
CC: Will Deacon <will.deacon@arm.com>
CC: Dave Watson <davejwatson@fb.com>
CC: Paul Turner <pjt@google.com>
CC: Rich Felker <dalias@libc.org>
CC: libc-alpha@sourceware.org
CC: linux-kernel@vger.kernel.org
CC: linux-api@vger.kernel.org
---
 NEWS                                          | 15 ++++
 csu/libc-start.c                              | 14 ++-
 misc/rseq-internal.h                          | 39 ++++++++
 nptl/pthread_create.c                         |  9 ++
 sysdeps/unix/sysv/linux/Makefile              |  4 +-
 sysdeps/unix/sysv/linux/Versions              |  4 +
 sysdeps/unix/sysv/linux/aarch64/bits/rseq.h   | 44 +++++++++
 sysdeps/unix/sysv/linux/aarch64/libc.abilist  |  2 +
 sysdeps/unix/sysv/linux/alpha/libc.abilist    |  2 +
 sysdeps/unix/sysv/linux/arm/libc.abilist      |  2 +
 sysdeps/unix/sysv/linux/bits/rseq.h           | 30 +++++++
 sysdeps/unix/sysv/linux/csky/libc.abilist     |  2 +
 sysdeps/unix/sysv/linux/hppa/libc.abilist     |  2 +
 sysdeps/unix/sysv/linux/i386/libc.abilist     |  2 +
 sysdeps/unix/sysv/linux/ia64/libc.abilist     |  2 +
 .../sysv/linux/m68k/coldfire/libc.abilist     |  2 +
 .../unix/sysv/linux/m68k/m680x0/libc.abilist  |  2 +
 .../unix/sysv/linux/microblaze/libc.abilist   |  2 +
 .../sysv/linux/mips/mips32/fpu/libc.abilist   |  2 +
 .../sysv/linux/mips/mips32/nofpu/libc.abilist |  2 +
 .../sysv/linux/mips/mips64/n32/libc.abilist   |  2 +
 .../sysv/linux/mips/mips64/n64/libc.abilist   |  2 +
 sysdeps/unix/sysv/linux/nios2/libc.abilist    |  2 +
 .../linux/powerpc/powerpc32/fpu/libc.abilist  |  2 +
 .../powerpc/powerpc32/nofpu/libc.abilist      |  2 +
 .../linux/powerpc/powerpc64/be/libc.abilist   |  2 +
 .../linux/powerpc/powerpc64/le/libc.abilist   |  2 +
 .../unix/sysv/linux/riscv/rv64/libc.abilist   |  2 +
 sysdeps/unix/sysv/linux/rseq-internal.h       | 89 +++++++++++++++++++
 sysdeps/unix/sysv/linux/rseq-sym.c            | 64 +++++++++++++
 sysdeps/unix/sysv/linux/s390/bits/rseq.h      | 31 +++++++
 .../unix/sysv/linux/s390/s390-32/libc.abilist |  2 +
 .../unix/sysv/linux/s390/s390-64/libc.abilist |  2 +
 sysdeps/unix/sysv/linux/sh/libc.abilist       |  2 +
 .../sysv/linux/sparc/sparc32/libc.abilist     |  2 +
 .../sysv/linux/sparc/sparc64/libc.abilist     |  2 +
 sysdeps/unix/sysv/linux/sys/rseq.h            | 51 +++++++++++
 sysdeps/unix/sysv/linux/x86/bits/rseq.h       | 31 +++++++
 .../unix/sysv/linux/x86_64/64/libc.abilist    |  2 +
 .../unix/sysv/linux/x86_64/x32/libc.abilist   |  2 +
 40 files changed, 474 insertions(+), 5 deletions(-)
 create mode 100644 misc/rseq-internal.h
 create mode 100644 sysdeps/unix/sysv/linux/aarch64/bits/rseq.h
 create mode 100644 sysdeps/unix/sysv/linux/bits/rseq.h
 create mode 100644 sysdeps/unix/sysv/linux/rseq-internal.h
 create mode 100644 sysdeps/unix/sysv/linux/rseq-sym.c
 create mode 100644 sysdeps/unix/sysv/linux/s390/bits/rseq.h
 create mode 100644 sysdeps/unix/sysv/linux/sys/rseq.h
 create mode 100644 sysdeps/unix/sysv/linux/x86/bits/rseq.h

diff --git a/NEWS b/NEWS
index 912a9bdc0f..7276a09b08 100644
--- a/NEWS
+++ b/NEWS
@@ -5,6 +5,21 @@ See the end for copying conditions.
 Please send GNU C library bug reports via <https://sourceware.org/bugzilla/>
 using `glibc' in the "product" field.
 \f
+Version 2.30
+
+Major new features:
+
+* Support for automatically registering threads with the Linux rseq(2)
+  system call has been added.  This system call is implemented starting
+  from Linux 4.18.  The Restartable Sequences ABI accelerates user-space
+  operations on per-cpu data.  It allows user-space to perform updates
+  on per-cpu data without requiring heavy-weight atomic operations.
+  Automatically registering threads allows all libraries, including libc,
+  to make immediate use of the rseq(2) support by using the documented ABI.
+  See 'man 2 rseq' for the details of the ABI shared between libc and the
+  kernel.
+
+\f
 Version 2.29
 
 Major new features:
diff --git a/csu/libc-start.c b/csu/libc-start.c
index 5d9c3675fa..e101196b0d 100644
--- a/csu/libc-start.c
+++ b/csu/libc-start.c
@@ -22,6 +22,7 @@
 #include <ldsodefs.h>
 #include <exit-thread.h>
 #include <libc-internal.h>
+#include <rseq-internal.h>
 
 #include <elf/dl-tunables.h>
 
@@ -140,7 +141,12 @@ LIBC_START_MAIN (int (*main) (int, char **, char ** MAIN_AUXVEC_DECL),
 
   __libc_multiple_libcs = &_dl_starting_up && !_dl_starting_up;
 
-#ifndef SHARED
+  rseq_init ();
+
+#ifdef SHARED
+  /* Register rseq ABI to the kernel. */
+  (void) rseq_register_current_thread ();
+#else
   _dl_relocate_static_pie ();
 
   char **ev = &argv[argc + 1];
@@ -218,6 +224,9 @@ LIBC_START_MAIN (int (*main) (int, char **, char ** MAIN_AUXVEC_DECL),
     }
 # endif
 
+  /* Register rseq ABI to the kernel. */
+  (void) rseq_register_current_thread ();
+
   /* Initialize libpthread if linked in.  */
   if (__pthread_initialize_minimal != NULL)
     __pthread_initialize_minimal ();
@@ -230,8 +239,7 @@ LIBC_START_MAIN (int (*main) (int, char **, char ** MAIN_AUXVEC_DECL),
 # else
   __pointer_chk_guard_local = pointer_chk_guard;
 # endif
-
-#endif /* !SHARED  */
+#endif
 
   /* Register the destructor of the dynamic linker if there is any.  */
   if (__glibc_likely (rtld_fini != NULL))
diff --git a/misc/rseq-internal.h b/misc/rseq-internal.h
new file mode 100644
index 0000000000..b6159319c8
--- /dev/null
+++ b/misc/rseq-internal.h
@@ -0,0 +1,39 @@
+/* Restartable Sequences internal API. Stub version.
+
+   Copyright (C) 2019 Free Software Foundation, Inc.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library; if not, see
+   <http://www.gnu.org/licenses/>.  */
+
+#ifndef RSEQ_INTERNAL_H
+#define RSEQ_INTERNAL_H
+
+static inline int
+rseq_register_current_thread (void)
+{
+  return -1;
+}
+
+static inline int
+rseq_unregister_current_thread (void)
+{
+  return -1;
+}
+
+static inline int
+rseq_init (void)
+{
+}
+
+#endif /* rseq-internal.h */
diff --git a/nptl/pthread_create.c b/nptl/pthread_create.c
index 2bd2b10727..90b3419390 100644
--- a/nptl/pthread_create.c
+++ b/nptl/pthread_create.c
@@ -33,6 +33,7 @@
 #include <default-sched.h>
 #include <futex-internal.h>
 #include <tls-setup.h>
+#include <rseq-internal.h>
 #include "libioP.h"
 
 #include <shlib-compat.h>
@@ -378,6 +379,7 @@ __free_tcb (struct pthread *pd)
 START_THREAD_DEFN
 {
   struct pthread *pd = START_THREAD_SELF;
+  bool has_rseq = false;
 
 #if HP_TIMING_AVAIL
   /* Remember the time when the thread was started.  */
@@ -396,6 +398,9 @@ START_THREAD_DEFN
   if (__glibc_unlikely (atomic_exchange_acq (&pd->setxid_futex, 0) == -2))
     futex_wake (&pd->setxid_futex, 1, FUTEX_PRIVATE);
 
+  /* Register rseq TLS to the kernel. */
+  has_rseq = !rseq_register_current_thread ();
+
 #ifdef __NR_set_robust_list
 # ifndef __ASSUME_SET_ROBUST_LIST
   if (__set_robust_list_avail >= 0)
@@ -573,6 +578,10 @@ START_THREAD_DEFN
     }
 #endif
 
+  /* Unregister rseq TLS from kernel. */
+  if (has_rseq && rseq_unregister_current_thread ())
+    abort();
+
   advise_stack_range (pd->stackblock, pd->stackblock_size, (uintptr_t) pd,
 		      pd->guardsize);
 
diff --git a/sysdeps/unix/sysv/linux/Makefile b/sysdeps/unix/sysv/linux/Makefile
index 5f8c2c7c7d..5b541469ec 100644
--- a/sysdeps/unix/sysv/linux/Makefile
+++ b/sysdeps/unix/sysv/linux/Makefile
@@ -1,5 +1,5 @@
 ifeq ($(subdir),csu)
-sysdep_routines += errno-loc
+sysdep_routines += errno-loc rseq-sym
 endif
 
 ifeq ($(subdir),assert)
@@ -48,7 +48,7 @@ sysdep_headers += sys/mount.h sys/acct.h sys/sysctl.h \
 		  bits/termios-c_iflag.h bits/termios-c_oflag.h \
 		  bits/termios-baud.h bits/termios-c_cflag.h \
 		  bits/termios-c_lflag.h bits/termios-tcflow.h \
-		  bits/termios-misc.h
+		  bits/termios-misc.h sys/rseq.h bits/rseq.h
 
 tests += tst-clone tst-clone2 tst-clone3 tst-fanotify tst-personality \
 	 tst-quota tst-sync_file_range tst-sysconf-iov_max tst-ttyname \
diff --git a/sysdeps/unix/sysv/linux/Versions b/sysdeps/unix/sysv/linux/Versions
index f1e12d9c69..bee3d727e5 100644
--- a/sysdeps/unix/sysv/linux/Versions
+++ b/sysdeps/unix/sysv/linux/Versions
@@ -174,6 +174,10 @@ libc {
   GLIBC_2.29 {
     getcpu;
   }
+  GLIBC_2.30 {
+    __rseq_abi;
+    __rseq_handled;
+  }
   GLIBC_PRIVATE {
     # functions used in other libraries
     __syscall_rt_sigqueueinfo;
diff --git a/sysdeps/unix/sysv/linux/aarch64/bits/rseq.h b/sysdeps/unix/sysv/linux/aarch64/bits/rseq.h
new file mode 100644
index 0000000000..e538668612
--- /dev/null
+++ b/sysdeps/unix/sysv/linux/aarch64/bits/rseq.h
@@ -0,0 +1,44 @@
+/* Restartable Sequences Linux aarch64 architecture header.
+
+   Copyright (C) 2019 Free Software Foundation, Inc.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library; if not, see
+   <http://www.gnu.org/licenses/>.  */
+
+#ifndef _SYS_RSEQ_H
+# error "Never use <bits/rseq.h> directly; include <sys/rseq.h> instead."
+#endif
+
+/* RSEQ_SIG is a signature required before each abort handler code.
+
+   It is a 32-bit value that maps to actual architecture code compiled
+   into applications and libraries. It needs to be defined for each
+   architecture. When choosing this value, it needs to be taken into
+   account that generating invalid instructions may have ill effects on
+   tools like objdump, and may also have impact on the CPU speculative
+   execution efficiency in some cases.
+
+   aarch64 -mbig-endian generates mixed endianness code vs data:
+   little-endian code and big-endian data. Ensure the RSEQ_SIG signature
+   matches code endianness.  */
+
+#define RSEQ_SIG_CODE	0xd428bc00	/* BRK #0x45E0.  */
+
+#ifdef __AARCH64EB__
+#define RSEQ_SIG_DATA	0x00bc28d4	/* BRK #0x45E0.  */
+#else
+#define RSEQ_SIG_DATA	RSEQ_SIG_CODE
+#endif
+
+#define RSEQ_SIG	RSEQ_SIG_DATA
diff --git a/sysdeps/unix/sysv/linux/aarch64/libc.abilist b/sysdeps/unix/sysv/linux/aarch64/libc.abilist
index 9c330f325e..331f39e41a 100644
--- a/sysdeps/unix/sysv/linux/aarch64/libc.abilist
+++ b/sysdeps/unix/sysv/linux/aarch64/libc.abilist
@@ -2141,3 +2141,5 @@ GLIBC_2.28 thrd_yield F
 GLIBC_2.29 getcpu F
 GLIBC_2.29 posix_spawn_file_actions_addchdir_np F
 GLIBC_2.29 posix_spawn_file_actions_addfchdir_np F
+GLIBC_2.30 __rseq_abi T 0x20
+GLIBC_2.30 __rseq_handled D 0x4
diff --git a/sysdeps/unix/sysv/linux/alpha/libc.abilist b/sysdeps/unix/sysv/linux/alpha/libc.abilist
index f630fa4c6f..05dfdd3393 100644
--- a/sysdeps/unix/sysv/linux/alpha/libc.abilist
+++ b/sysdeps/unix/sysv/linux/alpha/libc.abilist
@@ -2204,6 +2204,8 @@ GLIBC_2.3.4 setipv4sourcefilter F
 GLIBC_2.3.4 setsourcefilter F
 GLIBC_2.3.4 xdr_quad_t F
 GLIBC_2.3.4 xdr_u_quad_t F
+GLIBC_2.30 __rseq_abi T 0x20
+GLIBC_2.30 __rseq_handled D 0x4
 GLIBC_2.4 _IO_fprintf F
 GLIBC_2.4 _IO_printf F
 GLIBC_2.4 _IO_sprintf F
diff --git a/sysdeps/unix/sysv/linux/arm/libc.abilist b/sysdeps/unix/sysv/linux/arm/libc.abilist
index b96f45590f..24e9b89a50 100644
--- a/sysdeps/unix/sysv/linux/arm/libc.abilist
+++ b/sysdeps/unix/sysv/linux/arm/libc.abilist
@@ -126,6 +126,8 @@ GLIBC_2.28 thrd_yield F
 GLIBC_2.29 getcpu F
 GLIBC_2.29 posix_spawn_file_actions_addchdir_np F
 GLIBC_2.29 posix_spawn_file_actions_addfchdir_np F
+GLIBC_2.30 __rseq_abi T 0x20
+GLIBC_2.30 __rseq_handled D 0x4
 GLIBC_2.4 _Exit F
 GLIBC_2.4 _IO_2_1_stderr_ D 0xa0
 GLIBC_2.4 _IO_2_1_stdin_ D 0xa0
diff --git a/sysdeps/unix/sysv/linux/bits/rseq.h b/sysdeps/unix/sysv/linux/bits/rseq.h
new file mode 100644
index 0000000000..2f3e4c0e21
--- /dev/null
+++ b/sysdeps/unix/sysv/linux/bits/rseq.h
@@ -0,0 +1,30 @@
+/* Restartable Sequences architecture header. Stub version.
+
+   Copyright (C) 2019 Free Software Foundation, Inc.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library; if not, see
+   <http://www.gnu.org/licenses/>.  */
+
+#ifndef _SYS_RSEQ_H
+# error "Never use <bits/rseq.h> directly; include <sys/rseq.h> instead."
+#endif
+
+/* RSEQ_SIG is a signature required before each abort handler code.
+
+   It is a 32-bit value that maps to actual architecture code compiled
+   into applications and libraries. It needs to be defined for each
+   architecture. When choosing this value, it needs to be taken into
+   account that generating invalid instructions may have ill effects on
+   tools like objdump, and may also have impact on the CPU speculative
+   execution efficiency in some cases.  */
diff --git a/sysdeps/unix/sysv/linux/csky/libc.abilist b/sysdeps/unix/sysv/linux/csky/libc.abilist
index 019044c3cd..e2b0538088 100644
--- a/sysdeps/unix/sysv/linux/csky/libc.abilist
+++ b/sysdeps/unix/sysv/linux/csky/libc.abilist
@@ -2085,3 +2085,5 @@ GLIBC_2.29 xdrstdio_create F
 GLIBC_2.29 xencrypt F
 GLIBC_2.29 xprt_register F
 GLIBC_2.29 xprt_unregister F
+GLIBC_2.30 __rseq_abi T 0x20
+GLIBC_2.30 __rseq_handled D 0x4
diff --git a/sysdeps/unix/sysv/linux/hppa/libc.abilist b/sysdeps/unix/sysv/linux/hppa/libc.abilist
index 088a8ee369..263a91b97e 100644
--- a/sysdeps/unix/sysv/linux/hppa/libc.abilist
+++ b/sysdeps/unix/sysv/linux/hppa/libc.abilist
@@ -2037,6 +2037,8 @@ GLIBC_2.3.4 setipv4sourcefilter F
 GLIBC_2.3.4 setsourcefilter F
 GLIBC_2.3.4 xdr_quad_t F
 GLIBC_2.3.4 xdr_u_quad_t F
+GLIBC_2.30 __rseq_abi T 0x20
+GLIBC_2.30 __rseq_handled D 0x4
 GLIBC_2.4 __confstr_chk F
 GLIBC_2.4 __fgets_chk F
 GLIBC_2.4 __fgets_unlocked_chk F
diff --git a/sysdeps/unix/sysv/linux/i386/libc.abilist b/sysdeps/unix/sysv/linux/i386/libc.abilist
index f7ff2c57b9..18ce09d48a 100644
--- a/sysdeps/unix/sysv/linux/i386/libc.abilist
+++ b/sysdeps/unix/sysv/linux/i386/libc.abilist
@@ -2203,6 +2203,8 @@ GLIBC_2.3.4 setsourcefilter F
 GLIBC_2.3.4 vm86 F
 GLIBC_2.3.4 xdr_quad_t F
 GLIBC_2.3.4 xdr_u_quad_t F
+GLIBC_2.30 __rseq_abi T 0x20
+GLIBC_2.30 __rseq_handled D 0x4
 GLIBC_2.4 __confstr_chk F
 GLIBC_2.4 __fgets_chk F
 GLIBC_2.4 __fgets_unlocked_chk F
diff --git a/sysdeps/unix/sysv/linux/ia64/libc.abilist b/sysdeps/unix/sysv/linux/ia64/libc.abilist
index becd8b1033..b61e2ee010 100644
--- a/sysdeps/unix/sysv/linux/ia64/libc.abilist
+++ b/sysdeps/unix/sysv/linux/ia64/libc.abilist
@@ -2069,6 +2069,8 @@ GLIBC_2.3.4 setipv4sourcefilter F
 GLIBC_2.3.4 setsourcefilter F
 GLIBC_2.3.4 xdr_quad_t F
 GLIBC_2.3.4 xdr_u_quad_t F
+GLIBC_2.30 __rseq_abi T 0x20
+GLIBC_2.30 __rseq_handled D 0x4
 GLIBC_2.4 __confstr_chk F
 GLIBC_2.4 __fgets_chk F
 GLIBC_2.4 __fgets_unlocked_chk F
diff --git a/sysdeps/unix/sysv/linux/m68k/coldfire/libc.abilist b/sysdeps/unix/sysv/linux/m68k/coldfire/libc.abilist
index 74e42a5209..e55792bb22 100644
--- a/sysdeps/unix/sysv/linux/m68k/coldfire/libc.abilist
+++ b/sysdeps/unix/sysv/linux/m68k/coldfire/libc.abilist
@@ -127,6 +127,8 @@ GLIBC_2.28 thrd_yield F
 GLIBC_2.29 getcpu F
 GLIBC_2.29 posix_spawn_file_actions_addchdir_np F
 GLIBC_2.29 posix_spawn_file_actions_addfchdir_np F
+GLIBC_2.30 __rseq_abi T 0x20
+GLIBC_2.30 __rseq_handled D 0x4
 GLIBC_2.4 _Exit F
 GLIBC_2.4 _IO_2_1_stderr_ D 0x98
 GLIBC_2.4 _IO_2_1_stdin_ D 0x98
diff --git a/sysdeps/unix/sysv/linux/m68k/m680x0/libc.abilist b/sysdeps/unix/sysv/linux/m68k/m680x0/libc.abilist
index 4af5a74e8a..9845499048 100644
--- a/sysdeps/unix/sysv/linux/m68k/m680x0/libc.abilist
+++ b/sysdeps/unix/sysv/linux/m68k/m680x0/libc.abilist
@@ -2146,6 +2146,8 @@ GLIBC_2.3.4 setipv4sourcefilter F
 GLIBC_2.3.4 setsourcefilter F
 GLIBC_2.3.4 xdr_quad_t F
 GLIBC_2.3.4 xdr_u_quad_t F
+GLIBC_2.30 __rseq_abi T 0x20
+GLIBC_2.30 __rseq_handled D 0x4
 GLIBC_2.4 __confstr_chk F
 GLIBC_2.4 __fgets_chk F
 GLIBC_2.4 __fgets_unlocked_chk F
diff --git a/sysdeps/unix/sysv/linux/microblaze/libc.abilist b/sysdeps/unix/sysv/linux/microblaze/libc.abilist
index ccef673fd2..1aba8cb86c 100644
--- a/sysdeps/unix/sysv/linux/microblaze/libc.abilist
+++ b/sysdeps/unix/sysv/linux/microblaze/libc.abilist
@@ -2133,3 +2133,5 @@ GLIBC_2.28 thrd_yield F
 GLIBC_2.29 getcpu F
 GLIBC_2.29 posix_spawn_file_actions_addchdir_np F
 GLIBC_2.29 posix_spawn_file_actions_addfchdir_np F
+GLIBC_2.30 __rseq_abi T 0x20
+GLIBC_2.30 __rseq_handled D 0x4
diff --git a/sysdeps/unix/sysv/linux/mips/mips32/fpu/libc.abilist b/sysdeps/unix/sysv/linux/mips/mips32/fpu/libc.abilist
index 1054bb599e..df54e2adab 100644
--- a/sysdeps/unix/sysv/linux/mips/mips32/fpu/libc.abilist
+++ b/sysdeps/unix/sysv/linux/mips/mips32/fpu/libc.abilist
@@ -2120,6 +2120,8 @@ GLIBC_2.3.4 setipv4sourcefilter F
 GLIBC_2.3.4 setsourcefilter F
 GLIBC_2.3.4 xdr_quad_t F
 GLIBC_2.3.4 xdr_u_quad_t F
+GLIBC_2.30 __rseq_abi T 0x20
+GLIBC_2.30 __rseq_handled D 0x4
 GLIBC_2.4 __confstr_chk F
 GLIBC_2.4 __fgets_chk F
 GLIBC_2.4 __fgets_unlocked_chk F
diff --git a/sysdeps/unix/sysv/linux/mips/mips32/nofpu/libc.abilist b/sysdeps/unix/sysv/linux/mips/mips32/nofpu/libc.abilist
index 4f5b5ffebf..ce95ae7e86 100644
--- a/sysdeps/unix/sysv/linux/mips/mips32/nofpu/libc.abilist
+++ b/sysdeps/unix/sysv/linux/mips/mips32/nofpu/libc.abilist
@@ -2118,6 +2118,8 @@ GLIBC_2.3.4 setipv4sourcefilter F
 GLIBC_2.3.4 setsourcefilter F
 GLIBC_2.3.4 xdr_quad_t F
 GLIBC_2.3.4 xdr_u_quad_t F
+GLIBC_2.30 __rseq_abi T 0x20
+GLIBC_2.30 __rseq_handled D 0x4
 GLIBC_2.4 __confstr_chk F
 GLIBC_2.4 __fgets_chk F
 GLIBC_2.4 __fgets_unlocked_chk F
diff --git a/sysdeps/unix/sysv/linux/mips/mips64/n32/libc.abilist b/sysdeps/unix/sysv/linux/mips/mips64/n32/libc.abilist
index 943aee58d4..c9fb5d2096 100644
--- a/sysdeps/unix/sysv/linux/mips/mips64/n32/libc.abilist
+++ b/sysdeps/unix/sysv/linux/mips/mips64/n32/libc.abilist
@@ -2126,6 +2126,8 @@ GLIBC_2.3.4 setipv4sourcefilter F
 GLIBC_2.3.4 setsourcefilter F
 GLIBC_2.3.4 xdr_quad_t F
 GLIBC_2.3.4 xdr_u_quad_t F
+GLIBC_2.30 __rseq_abi T 0x20
+GLIBC_2.30 __rseq_handled D 0x4
 GLIBC_2.4 __confstr_chk F
 GLIBC_2.4 __fgets_chk F
 GLIBC_2.4 __fgets_unlocked_chk F
diff --git a/sysdeps/unix/sysv/linux/mips/mips64/n64/libc.abilist b/sysdeps/unix/sysv/linux/mips/mips64/n64/libc.abilist
index 17a5d17ef9..6335df9acf 100644
--- a/sysdeps/unix/sysv/linux/mips/mips64/n64/libc.abilist
+++ b/sysdeps/unix/sysv/linux/mips/mips64/n64/libc.abilist
@@ -2120,6 +2120,8 @@ GLIBC_2.3.4 setipv4sourcefilter F
 GLIBC_2.3.4 setsourcefilter F
 GLIBC_2.3.4 xdr_quad_t F
 GLIBC_2.3.4 xdr_u_quad_t F
+GLIBC_2.30 __rseq_abi T 0x20
+GLIBC_2.30 __rseq_handled D 0x4
 GLIBC_2.4 __confstr_chk F
 GLIBC_2.4 __fgets_chk F
 GLIBC_2.4 __fgets_unlocked_chk F
diff --git a/sysdeps/unix/sysv/linux/nios2/libc.abilist b/sysdeps/unix/sysv/linux/nios2/libc.abilist
index 4d62a540fd..5465b96768 100644
--- a/sysdeps/unix/sysv/linux/nios2/libc.abilist
+++ b/sysdeps/unix/sysv/linux/nios2/libc.abilist
@@ -2174,3 +2174,5 @@ GLIBC_2.28 thrd_yield F
 GLIBC_2.29 getcpu F
 GLIBC_2.29 posix_spawn_file_actions_addchdir_np F
 GLIBC_2.29 posix_spawn_file_actions_addfchdir_np F
+GLIBC_2.30 __rseq_abi T 0x20
+GLIBC_2.30 __rseq_handled D 0x4
diff --git a/sysdeps/unix/sysv/linux/powerpc/powerpc32/fpu/libc.abilist b/sysdeps/unix/sysv/linux/powerpc/powerpc32/fpu/libc.abilist
index ecc2d6fa13..eb3808dbd4 100644
--- a/sysdeps/unix/sysv/linux/powerpc/powerpc32/fpu/libc.abilist
+++ b/sysdeps/unix/sysv/linux/powerpc/powerpc32/fpu/libc.abilist
@@ -2164,6 +2164,8 @@ GLIBC_2.3.4 siglongjmp F
 GLIBC_2.3.4 swapcontext F
 GLIBC_2.3.4 xdr_quad_t F
 GLIBC_2.3.4 xdr_u_quad_t F
+GLIBC_2.30 __rseq_abi T 0x20
+GLIBC_2.30 __rseq_handled D 0x4
 GLIBC_2.4 _IO_fprintf F
 GLIBC_2.4 _IO_printf F
 GLIBC_2.4 _IO_sprintf F
diff --git a/sysdeps/unix/sysv/linux/powerpc/powerpc32/nofpu/libc.abilist b/sysdeps/unix/sysv/linux/powerpc/powerpc32/nofpu/libc.abilist
index f5830f9c33..6a49a7b718 100644
--- a/sysdeps/unix/sysv/linux/powerpc/powerpc32/nofpu/libc.abilist
+++ b/sysdeps/unix/sysv/linux/powerpc/powerpc32/nofpu/libc.abilist
@@ -2197,6 +2197,8 @@ GLIBC_2.3.4 siglongjmp F
 GLIBC_2.3.4 swapcontext F
 GLIBC_2.3.4 xdr_quad_t F
 GLIBC_2.3.4 xdr_u_quad_t F
+GLIBC_2.30 __rseq_abi T 0x20
+GLIBC_2.30 __rseq_handled D 0x4
 GLIBC_2.4 _IO_fprintf F
 GLIBC_2.4 _IO_printf F
 GLIBC_2.4 _IO_sprintf F
diff --git a/sysdeps/unix/sysv/linux/powerpc/powerpc64/be/libc.abilist b/sysdeps/unix/sysv/linux/powerpc/powerpc64/be/libc.abilist
index 633d8f4792..83177dc75f 100644
--- a/sysdeps/unix/sysv/linux/powerpc/powerpc64/be/libc.abilist
+++ b/sysdeps/unix/sysv/linux/powerpc/powerpc64/be/libc.abilist
@@ -2027,6 +2027,8 @@ GLIBC_2.3.4 siglongjmp F
 GLIBC_2.3.4 swapcontext F
 GLIBC_2.3.4 xdr_quad_t F
 GLIBC_2.3.4 xdr_u_quad_t F
+GLIBC_2.30 __rseq_abi T 0x20
+GLIBC_2.30 __rseq_handled D 0x4
 GLIBC_2.4 _IO_fprintf F
 GLIBC_2.4 _IO_printf F
 GLIBC_2.4 _IO_sprintf F
diff --git a/sysdeps/unix/sysv/linux/powerpc/powerpc64/le/libc.abilist b/sysdeps/unix/sysv/linux/powerpc/powerpc64/le/libc.abilist
index 2c712636ef..e714de994c 100644
--- a/sysdeps/unix/sysv/linux/powerpc/powerpc64/le/libc.abilist
+++ b/sysdeps/unix/sysv/linux/powerpc/powerpc64/le/libc.abilist
@@ -2231,3 +2231,5 @@ GLIBC_2.28 thrd_yield F
 GLIBC_2.29 getcpu F
 GLIBC_2.29 posix_spawn_file_actions_addchdir_np F
 GLIBC_2.29 posix_spawn_file_actions_addfchdir_np F
+GLIBC_2.30 __rseq_abi T 0x20
+GLIBC_2.30 __rseq_handled D 0x4
diff --git a/sysdeps/unix/sysv/linux/riscv/rv64/libc.abilist b/sysdeps/unix/sysv/linux/riscv/rv64/libc.abilist
index 195bc8b2cf..d190623993 100644
--- a/sysdeps/unix/sysv/linux/riscv/rv64/libc.abilist
+++ b/sysdeps/unix/sysv/linux/riscv/rv64/libc.abilist
@@ -2103,3 +2103,5 @@ GLIBC_2.28 thrd_yield F
 GLIBC_2.29 getcpu F
 GLIBC_2.29 posix_spawn_file_actions_addchdir_np F
 GLIBC_2.29 posix_spawn_file_actions_addfchdir_np F
+GLIBC_2.30 __rseq_abi T 0x20
+GLIBC_2.30 __rseq_handled D 0x4
diff --git a/sysdeps/unix/sysv/linux/rseq-internal.h b/sysdeps/unix/sysv/linux/rseq-internal.h
new file mode 100644
index 0000000000..a27324ac28
--- /dev/null
+++ b/sysdeps/unix/sysv/linux/rseq-internal.h
@@ -0,0 +1,89 @@
+/* Restartable Sequences internal API. Linux implementation.
+
+   Copyright (C) 2019 Free Software Foundation, Inc.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library; if not, see
+   <http://www.gnu.org/licenses/>.  */
+
+#ifndef RSEQ_INTERNAL_H
+#define RSEQ_INTERNAL_H
+
+#include <sysdep.h>
+#include <errno.h>
+
+#ifdef __NR_rseq
+#include <sys/rseq.h>
+#endif
+
+#if defined __NR_rseq && defined RSEQ_SIG
+
+static inline int
+rseq_register_current_thread (void)
+{
+  int rc, ret = 0;
+  INTERNAL_SYSCALL_DECL (err);
+
+  if (__rseq_abi.cpu_id == RSEQ_CPU_ID_REGISTRATION_FAILED)
+    return -1;
+  rc = INTERNAL_SYSCALL_CALL (rseq, err, &__rseq_abi, sizeof (struct rseq),
+                              0, RSEQ_SIG);
+  if (!rc)
+    goto end;
+  if (INTERNAL_SYSCALL_ERRNO (rc, err) != EBUSY)
+    __rseq_abi.cpu_id = RSEQ_CPU_ID_REGISTRATION_FAILED;
+  ret = -1;
+end:
+  return ret;
+}
+
+static inline int
+rseq_unregister_current_thread (void)
+{
+  int rc, ret = 0;
+  INTERNAL_SYSCALL_DECL (err);
+
+  rc = INTERNAL_SYSCALL_CALL (rseq, err, &__rseq_abi, sizeof (struct rseq),
+                              RSEQ_FLAG_UNREGISTER, RSEQ_SIG);
+  if (!rc)
+    goto end;
+  ret = -1;
+end:
+  return ret;
+}
+
+static inline void
+rseq_init (void)
+{
+  __rseq_handled = 1;
+}
+#else
+static inline int
+rseq_register_current_thread (void)
+{
+  return -1;
+}
+
+static inline int
+rseq_unregister_current_thread (void)
+{
+  return -1;
+}
+
+static inline void
+rseq_init (void)
+{
+}
+#endif
+
+#endif /* rseq-internal.h */
diff --git a/sysdeps/unix/sysv/linux/rseq-sym.c b/sysdeps/unix/sysv/linux/rseq-sym.c
new file mode 100644
index 0000000000..65403807c8
--- /dev/null
+++ b/sysdeps/unix/sysv/linux/rseq-sym.c
@@ -0,0 +1,64 @@
+/* Restartable Sequences exported symbols. Linux Implementation.
+
+   Copyright (C) 2019 Free Software Foundation, Inc.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library; if not, see
+   <http://www.gnu.org/licenses/>.  */
+
+#include <sys/syscall.h>
+#include <stdint.h>
+
+#ifdef __NR_rseq
+#include <sys/rseq.h>
+#else
+
+enum rseq_cpu_id_state {
+  RSEQ_CPU_ID_UNINITIALIZED = -1,
+  RSEQ_CPU_ID_REGISTRATION_FAILED = -2,
+};
+
+/* linux/rseq.h defines struct rseq as aligned on 32 bytes. The kernel ABI
+   size is 20 bytes.  */
+struct rseq {
+  uint32_t cpu_id_start;
+  uint32_t cpu_id;
+  uint64_t rseq_cs;
+  uint32_t flags;
+} __attribute__ ((aligned(4 * sizeof(uint64_t))));
+
+#endif
+
+/* volatile because fields can be read/updated by the kernel.  */
+__thread volatile struct rseq __rseq_abi = {
+  .cpu_id = RSEQ_CPU_ID_UNINITIALIZED,
+};
+
+/* Advertise Restartable Sequences registration ownership across
+   application and shared libraries.
+
+   Libraries and applications must check whether this variable is zero or
+   non-zero if they wish to perform rseq registration on their own. If it
+   is zero, it means restartable sequence registration is not handled, and
+   the library or application is free to perform rseq registration. In
+   that case, the library or application is taking ownership of rseq
+   registration, and may set __rseq_handled to 1. It may then set it back
+   to 0 after it completes unregistering rseq.
+
+   If __rseq_handled is found to be non-zero, it means that another
+   library (or the application) is currently handling rseq registration.
+
+   Typical use of __rseq_handled is within library constructors and
+   destructors, or at program startup.  */
+
+int __rseq_handled;
diff --git a/sysdeps/unix/sysv/linux/s390/bits/rseq.h b/sysdeps/unix/sysv/linux/s390/bits/rseq.h
new file mode 100644
index 0000000000..7eba4042ea
--- /dev/null
+++ b/sysdeps/unix/sysv/linux/s390/bits/rseq.h
@@ -0,0 +1,31 @@
+/* Restartable Sequences Linux s390 architecture header.
+
+   Copyright (C) 2019 Free Software Foundation, Inc.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library; if not, see
+   <http://www.gnu.org/licenses/>.  */
+
+#ifndef _SYS_RSEQ_H
+# error "Never use <bits/rseq.h> directly; include <sys/rseq.h> instead."
+#endif
+
+/* RSEQ_SIG is a signature required before each abort handler code.
+
+   RSEQ_SIG uses the trap4 instruction. As Linux does not make use of the
+   access-register mode nor the linkage stack this instruction will always
+   cause a special-operation exception (the trap-enabled bit in the DUCT
+   is and will stay 0). The instruction pattern is
+	b2 ff 0f ff	trap4   4095(%r0)  */
+
+#define RSEQ_SIG	0xB2FF0FFF
diff --git a/sysdeps/unix/sysv/linux/s390/s390-32/libc.abilist b/sysdeps/unix/sysv/linux/s390/s390-32/libc.abilist
index 334def033c..dacae17ec4 100644
--- a/sysdeps/unix/sysv/linux/s390/s390-32/libc.abilist
+++ b/sysdeps/unix/sysv/linux/s390/s390-32/libc.abilist
@@ -2159,6 +2159,8 @@ GLIBC_2.3.4 setipv4sourcefilter F
 GLIBC_2.3.4 setsourcefilter F
 GLIBC_2.3.4 xdr_quad_t F
 GLIBC_2.3.4 xdr_u_quad_t F
+GLIBC_2.30 __rseq_abi T 0x20
+GLIBC_2.30 __rseq_handled D 0x4
 GLIBC_2.4 _IO_fprintf F
 GLIBC_2.4 _IO_printf F
 GLIBC_2.4 _IO_sprintf F
diff --git a/sysdeps/unix/sysv/linux/s390/s390-64/libc.abilist b/sysdeps/unix/sysv/linux/s390/s390-64/libc.abilist
index 536f4c4ced..c277b3bd90 100644
--- a/sysdeps/unix/sysv/linux/s390/s390-64/libc.abilist
+++ b/sysdeps/unix/sysv/linux/s390/s390-64/libc.abilist
@@ -2063,6 +2063,8 @@ GLIBC_2.3.4 setipv4sourcefilter F
 GLIBC_2.3.4 setsourcefilter F
 GLIBC_2.3.4 xdr_quad_t F
 GLIBC_2.3.4 xdr_u_quad_t F
+GLIBC_2.30 __rseq_abi T 0x20
+GLIBC_2.30 __rseq_handled D 0x4
 GLIBC_2.4 _IO_fprintf F
 GLIBC_2.4 _IO_printf F
 GLIBC_2.4 _IO_sprintf F
diff --git a/sysdeps/unix/sysv/linux/sh/libc.abilist b/sysdeps/unix/sysv/linux/sh/libc.abilist
index 30ae3b6ebb..5f70e5c53b 100644
--- a/sysdeps/unix/sysv/linux/sh/libc.abilist
+++ b/sysdeps/unix/sysv/linux/sh/libc.abilist
@@ -2041,6 +2041,8 @@ GLIBC_2.3.4 setipv4sourcefilter F
 GLIBC_2.3.4 setsourcefilter F
 GLIBC_2.3.4 xdr_quad_t F
 GLIBC_2.3.4 xdr_u_quad_t F
+GLIBC_2.30 __rseq_abi T 0x20
+GLIBC_2.30 __rseq_handled D 0x4
 GLIBC_2.4 __confstr_chk F
 GLIBC_2.4 __fgets_chk F
 GLIBC_2.4 __fgets_unlocked_chk F
diff --git a/sysdeps/unix/sysv/linux/sparc/sparc32/libc.abilist b/sysdeps/unix/sysv/linux/sparc/sparc32/libc.abilist
index 68b107d080..537da009d3 100644
--- a/sysdeps/unix/sysv/linux/sparc/sparc32/libc.abilist
+++ b/sysdeps/unix/sysv/linux/sparc/sparc32/libc.abilist
@@ -2153,6 +2153,8 @@ GLIBC_2.3.4 setipv4sourcefilter F
 GLIBC_2.3.4 setsourcefilter F
 GLIBC_2.3.4 xdr_quad_t F
 GLIBC_2.3.4 xdr_u_quad_t F
+GLIBC_2.30 __rseq_abi T 0x20
+GLIBC_2.30 __rseq_handled D 0x4
 GLIBC_2.4 _IO_fprintf F
 GLIBC_2.4 _IO_printf F
 GLIBC_2.4 _IO_sprintf F
diff --git a/sysdeps/unix/sysv/linux/sparc/sparc64/libc.abilist b/sysdeps/unix/sysv/linux/sparc/sparc64/libc.abilist
index e5b6a4da50..1fee8e34fc 100644
--- a/sysdeps/unix/sysv/linux/sparc/sparc64/libc.abilist
+++ b/sysdeps/unix/sysv/linux/sparc/sparc64/libc.abilist
@@ -2092,6 +2092,8 @@ GLIBC_2.3.4 setipv4sourcefilter F
 GLIBC_2.3.4 setsourcefilter F
 GLIBC_2.3.4 xdr_quad_t F
 GLIBC_2.3.4 xdr_u_quad_t F
+GLIBC_2.30 __rseq_abi T 0x20
+GLIBC_2.30 __rseq_handled D 0x4
 GLIBC_2.4 __confstr_chk F
 GLIBC_2.4 __fgets_chk F
 GLIBC_2.4 __fgets_unlocked_chk F
diff --git a/sysdeps/unix/sysv/linux/sys/rseq.h b/sysdeps/unix/sysv/linux/sys/rseq.h
new file mode 100644
index 0000000000..c48a4bf8ff
--- /dev/null
+++ b/sysdeps/unix/sysv/linux/sys/rseq.h
@@ -0,0 +1,51 @@
+/* Restartable Sequences exported symbols. Linux header.
+
+   Copyright (C) 2019 Free Software Foundation, Inc.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library; if not, see
+   <http://www.gnu.org/licenses/>.  */
+
+#ifndef _SYS_RSEQ_H
+#define _SYS_RSEQ_H	1
+
+/* We use the structures declarations from the kernel headers.  */
+#include <linux/rseq.h>
+/* Architecture-specific rseq signature.  */
+#include <bits/rseq.h>
+#include <stdint.h>
+
+/* volatile because fields can be read/updated by the kernel.  */
+extern __thread volatile struct rseq __rseq_abi
+__attribute__ ((tls_model ("initial-exec")));
+
+/* Advertise Restartable Sequences registration ownership across
+   application and shared libraries.
+
+   Libraries and applications must check whether this variable is zero or
+   non-zero if they wish to perform rseq registration on their own. If it
+   is zero, it means restartable sequence registration is not handled, and
+   the library or application is free to perform rseq registration. In
+   that case, the library or application is taking ownership of rseq
+   registration, and may set __rseq_handled to 1. It may then set it back
+   to 0 after it completes unregistering rseq.
+
+   If __rseq_handled is found to be non-zero, it means that another
+   library (or the application) is currently handling rseq registration.
+
+   Typical use of __rseq_handled is within library constructors and
+   destructors, or at program startup.  */
+
+extern int __rseq_handled;
+
+#endif /* sys/rseq.h */
diff --git a/sysdeps/unix/sysv/linux/x86/bits/rseq.h b/sysdeps/unix/sysv/linux/x86/bits/rseq.h
new file mode 100644
index 0000000000..8064dda509
--- /dev/null
+++ b/sysdeps/unix/sysv/linux/x86/bits/rseq.h
@@ -0,0 +1,31 @@
+/* Restartable Sequences Linux x86 architecture header.
+
+   Copyright (C) 2019 Free Software Foundation, Inc.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library; if not, see
+   <http://www.gnu.org/licenses/>.  */
+
+#ifndef _SYS_RSEQ_H
+# error "Never use <bits/rseq.h> directly; include <sys/rseq.h> instead."
+#endif
+
+/* RSEQ_SIG is a signature required before each abort handler code.
+
+   RSEQ_SIG is used with the following reserved undefined instructions, which
+   trap in user-space:
+
+   x86-32:    0f b9 3d 53 30 05 53      ud1    0x53053053,%edi
+   x86-64:    0f b9 3d 53 30 05 53      ud1    0x53053053(%rip),%edi  */
+
+#define RSEQ_SIG	0x53053053
diff --git a/sysdeps/unix/sysv/linux/x86_64/64/libc.abilist b/sysdeps/unix/sysv/linux/x86_64/64/libc.abilist
index 86dfb0c94d..a834f65383 100644
--- a/sysdeps/unix/sysv/linux/x86_64/64/libc.abilist
+++ b/sysdeps/unix/sysv/linux/x86_64/64/libc.abilist
@@ -2050,6 +2050,8 @@ GLIBC_2.3.4 setipv4sourcefilter F
 GLIBC_2.3.4 setsourcefilter F
 GLIBC_2.3.4 xdr_quad_t F
 GLIBC_2.3.4 xdr_u_quad_t F
+GLIBC_2.30 __rseq_abi T 0x20
+GLIBC_2.30 __rseq_handled D 0x4
 GLIBC_2.4 __confstr_chk F
 GLIBC_2.4 __fgets_chk F
 GLIBC_2.4 __fgets_unlocked_chk F
diff --git a/sysdeps/unix/sysv/linux/x86_64/x32/libc.abilist b/sysdeps/unix/sysv/linux/x86_64/x32/libc.abilist
index dd688263aa..fb8417bde7 100644
--- a/sysdeps/unix/sysv/linux/x86_64/x32/libc.abilist
+++ b/sysdeps/unix/sysv/linux/x86_64/x32/libc.abilist
@@ -2149,3 +2149,5 @@ GLIBC_2.28 thrd_yield F
 GLIBC_2.29 getcpu F
 GLIBC_2.29 posix_spawn_file_actions_addchdir_np F
 GLIBC_2.29 posix_spawn_file_actions_addfchdir_np F
+GLIBC_2.30 __rseq_abi T 0x20
+GLIBC_2.30 __rseq_handled D 0x4
-- 
2.17.1

^ permalink raw reply related

* Re: WARNING in percpu_ref_kill_and_confirm
From: Jens Axboe @ 2019-04-22 16:50 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: syzbot, Arnd Bergmann, Borislav Petkov, Darrick J. Wong,
	Greg Kroah-Hartman, Peter Anvin, Linux API, linux-arch,
	linux-block, linux-fsdevel, Linux List Kernel Mailing,
	Andrew Lutomirski, Mathieu Desnoyers, Ingo Molnar,
	Michael Ellerman, syzkaller-bugs, Thomas Gleixner, Al Viro
In-Reply-To: <CAHk-=wh=kKEb6qbuhnSWh5S55SbQBOE33991ULSFUgVrEnzivA@mail.gmail.com>

On 4/22/19 10:48 AM, Linus Torvalds wrote:
> On Mon, Apr 22, 2019 at 9:38 AM Jens Axboe <axboe@kernel.dk> wrote:
>>
>> With the mutex change in, I can trigger it in a second or so. Just ran
>> the reproducer with that change reverted, and I'm not seeing any badness.
>> So I do wonder if the bisect results are accurate?
> 
> Looking at the syzbot report, it's syzbot being confused.
> 
> The actual WARNING in percpu_ref_kill_and_confirm() only happens with
> recent kernels.
> 
> But then syzbot mixes it up with a completely different bug:
> 
>    crash: BUG: MAX_STACK_TRACE_ENTRIES too low!
>    BUG: MAX_STACK_TRACE_ENTRIES too low!
> 
> and for some reason decides that *that* bug is the same thing entirely.
> 
> So yeah, I think the simple percpu_ref_is_dying() check is sufficient,
> and that the syzbot bisection is completely bogus.

Ah good, that makes me feel better. I'll queue the fix up, thanks.

-- 
Jens Axboe

^ permalink raw reply

* Re: WARNING in percpu_ref_kill_and_confirm
From: Linus Torvalds @ 2019-04-22 16:48 UTC (permalink / raw)
  To: Jens Axboe
  Cc: syzbot, Arnd Bergmann, Borislav Petkov, Darrick J. Wong,
	Greg Kroah-Hartman, Peter Anvin, Linux API, linux-arch,
	linux-block, linux-fsdevel, Linux List Kernel Mailing,
	Andrew Lutomirski, Mathieu Desnoyers, Ingo Molnar,
	Michael Ellerman, syzkaller-bugs, Thomas Gleixner, Al Viro
In-Reply-To: <cfa6be6d-4125-6d1e-8993-e9ec9dbaa7bb@kernel.dk>

On Mon, Apr 22, 2019 at 9:38 AM Jens Axboe <axboe@kernel.dk> wrote:
>
> With the mutex change in, I can trigger it in a second or so. Just ran
> the reproducer with that change reverted, and I'm not seeing any badness.
> So I do wonder if the bisect results are accurate?

Looking at the syzbot report, it's syzbot being confused.

The actual WARNING in percpu_ref_kill_and_confirm() only happens with
recent kernels.

But then syzbot mixes it up with a completely different bug:

   crash: BUG: MAX_STACK_TRACE_ENTRIES too low!
   BUG: MAX_STACK_TRACE_ENTRIES too low!

and for some reason decides that *that* bug is the same thing entirely.

So yeah, I think the simple percpu_ref_is_dying() check is sufficient,
and that the syzbot bisection is completely bogus.

                Linus

^ permalink raw reply

* Re: WARNING in percpu_ref_kill_and_confirm
From: Jens Axboe @ 2019-04-22 16:38 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: syzbot, Arnd Bergmann, Borislav Petkov, Darrick J. Wong,
	Greg Kroah-Hartman, Peter Anvin, Linux API, linux-arch,
	linux-block, linux-fsdevel, Linux List Kernel Mailing,
	Andrew Lutomirski, Mathieu Desnoyers, Ingo Molnar,
	Michael Ellerman, syzkaller-bugs, Thomas Gleixner, Al Viro
In-Reply-To: <53a17444-9539-5810-82a0-ceeefa742508@kernel.dk>

On 4/22/19 10:32 AM, Jens Axboe wrote:
> On 4/22/19 10:27 AM, Linus Torvalds wrote:
>> [ Crossed emails ]
>>
>> On Mon, Apr 22, 2019 at 9:23 AM Jens Axboe <axboe@kernel.dk> wrote:
>>>
>>> I think the below should fix this. Very early versions of io_uring didn't
>>> have this issue, since we did the percpu ref tryget for io_uring_register().
>>
>> Ok, so I like your patch better than mine, but note how syzbot
>> bisected this to the initial merge of the io_uring code.
> 
> Yes, I did think about that too...
> 
>> I agree that code shouldn't have had this particular issue, but it
>> looks like it does.
>>
>> Is there some way to race with io_ring_ctx_wait_and_kill(), which
>> _also_ does that ref_kill() thing? I'm not seeing how that could
>> happen, but maybe if the file ref counts get screwed up you have
>> ->release() called early..
> 
> I just tried on the current code and it triggers easily, but that's
> with that mutex patch in there. I agree it should not trigger before
> that, unless something is wonky. I'll try and play around with it a bit
> and see what is going on (or if I can trigger it at all with the mutex
> change reverted).

With the mutex change in, I can trigger it in a second or so. Just ran
the reproducer with that change reverted, and I'm not seeing any badness.
So I do wonder if the bisect results are accurate?

I think the dying check should cover it, and then marked with fixing
that mutex commit.

-- 
Jens Axboe

^ permalink raw reply

* Re: WARNING in percpu_ref_kill_and_confirm
From: Jens Axboe @ 2019-04-22 16:32 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: syzbot, Arnd Bergmann, Borislav Petkov, Darrick J. Wong,
	Greg Kroah-Hartman, Peter Anvin, Linux API, linux-arch,
	linux-block, linux-fsdevel, Linux List Kernel Mailing,
	Andrew Lutomirski, Mathieu Desnoyers, Ingo Molnar,
	Michael Ellerman, syzkaller-bugs, Thomas Gleixner, Al Viro
In-Reply-To: <CAHk-=wgG=RqK8T5kWiZfH7da6rBuqp3kAesP=ETgceb6RzvZng@mail.gmail.com>

On 4/22/19 10:27 AM, Linus Torvalds wrote:
> [ Crossed emails ]
> 
> On Mon, Apr 22, 2019 at 9:23 AM Jens Axboe <axboe@kernel.dk> wrote:
>>
>> I think the below should fix this. Very early versions of io_uring didn't
>> have this issue, since we did the percpu ref tryget for io_uring_register().
> 
> Ok, so I like your patch better than mine, but note how syzbot
> bisected this to the initial merge of the io_uring code.

Yes, I did think about that too...

> I agree that code shouldn't have had this particular issue, but it
> looks like it does.
> 
> Is there some way to race with io_ring_ctx_wait_and_kill(), which
> _also_ does that ref_kill() thing? I'm not seeing how that could
> happen, but maybe if the file ref counts get screwed up you have
> ->release() called early..

I just tried on the current code and it triggers easily, but that's
with that mutex patch in there. I agree it should not trigger before
that, unless something is wonky. I'll try and play around with it a bit
and see what is going on (or if I can trigger it at all with the mutex
change reverted).

-- 
Jens Axboe

^ permalink raw reply

* Re: WARNING in percpu_ref_kill_and_confirm
From: Jens Axboe @ 2019-04-22 16:28 UTC (permalink / raw)
  To: Linus Torvalds, syzbot
  Cc: Arnd Bergmann, Borislav Petkov, Darrick J. Wong,
	Greg Kroah-Hartman, Peter Anvin, Linux API, linux-arch,
	linux-block, linux-fsdevel, Linux List Kernel Mailing,
	Andrew Lutomirski, Mathieu Desnoyers, Ingo Molnar,
	Michael Ellerman, syzkaller-bugs, Thomas Gleixner, Al Viro,
	the arch/x86 maintainers
In-Reply-To: <CAHk-=wgC5mNgPN5excXyTNtFpOyF4+9jk3tLiC+s-VezgGbVTA@mail.gmail.com>

On 4/22/19 10:23 AM, Linus Torvalds wrote:
> On Mon, Apr 22, 2019 at 9:06 AM syzbot
> <syzbot+10d25e23199614b7721f@syzkaller.appspotmail.com> wrote:
>>
>>
>> The bug was bisected to:
>>
>> commit 38e7571c07be01f9f19b355a9306a4e3d5cb0f5b
>> Author: Linus Torvalds <torvalds@linux-foundation.org>
>> Date:   Fri Mar 8 22:48:40 2019 +0000
>>
>>      Merge tag 'io_uring-2019-03-06' of git://git.kernel.dk/linux-block
>>
>> percpu_ref_kill_and_confirm called more than once on io_ring_ctx_ref_free!
> 
> So I don't see how that happens in the original code (because
> __io_uring_register() is called with the uring_lock held), but let's
> see.
> 
> HOWEVER.
> 
> I do see how it happens now as of the latest kernel as of commit
> b19062a56726 ("io_uring: fix possible deadlock between
> io_uring_{enter,register}") where the code explicitly drops the mutex
> in order to wait for other uring users to finish.
> 
> So Jens, I think that commit was buggy. I suspect that
> io_uring_register() should perhaps do something like
> 
> --- a/fs/io_uring.c
> +++ b/fs/io_uring.c
> @@ -2934,7 +2934,10 @@ static int __io_uring_register(struct
> io_ring_ctx *ctx, unsigned opcode,
>  {
>         int ret;
> 
> +       if (!percpu_ref_tryget(&ctx->refs))
> +               return -EBUSY;
>         percpu_ref_kill(&ctx->refs);
> +       percpu_ref_put(&ctx->refs);
> 
>         /*
>          * Drop uring mutex before waiting for references to exit. If another
> 
> to guarantee that it's the *only* case of io_uring_register() doing that kill.
> 
> Hmm?

Just sent out something as well. I think we can get by with just
checking if it's dying, or we can go the route of what you did which is
actually very similar to what the earlier versions did. Both versions
should fix the issue.

I'll test just to be totally sure.

-- 
Jens Axboe

^ permalink raw reply

* Re: WARNING in percpu_ref_kill_and_confirm
From: Linus Torvalds @ 2019-04-22 16:27 UTC (permalink / raw)
  To: Jens Axboe
  Cc: syzbot, Arnd Bergmann, Borislav Petkov, Darrick J. Wong,
	Greg Kroah-Hartman, Peter Anvin, Linux API, linux-arch,
	linux-block, linux-fsdevel, Linux List Kernel Mailing,
	Andrew Lutomirski, Mathieu Desnoyers, Ingo Molnar,
	Michael Ellerman, syzkaller-bugs, Thomas Gleixner, Al Viro
In-Reply-To: <ad884b18-0e03-d6be-02b0-4e5273f068ce@kernel.dk>

[ Crossed emails ]

On Mon, Apr 22, 2019 at 9:23 AM Jens Axboe <axboe@kernel.dk> wrote:
>
> I think the below should fix this. Very early versions of io_uring didn't
> have this issue, since we did the percpu ref tryget for io_uring_register().

Ok, so I like your patch better than mine, but note how syzbot
bisected this to the initial merge of the io_uring code.

I agree that code shouldn't have had this particular issue, but it
looks like it does.

Is there some way to race with io_ring_ctx_wait_and_kill(), which
_also_ does that ref_kill() thing? I'm not seeing how that could
happen, but maybe if the file ref counts get screwed up you have
->release() called early..

                  Linus

^ permalink raw reply

* Re: WARNING in percpu_ref_kill_and_confirm
From: Linus Torvalds @ 2019-04-22 16:23 UTC (permalink / raw)
  To: syzbot
  Cc: Arnd Bergmann, Jens Axboe, Borislav Petkov, Darrick J. Wong,
	Greg Kroah-Hartman, Peter Anvin, Linux API, linux-arch,
	linux-block, linux-fsdevel, Linux List Kernel Mailing,
	Andrew Lutomirski, Mathieu Desnoyers, Ingo Molnar,
	Michael Ellerman, syzkaller-bugs, Thomas Gleixner, Al Viro,
	the arch/x86 maintainers
In-Reply-To: <00000000000043fe9c058720a5d3@google.com>

On Mon, Apr 22, 2019 at 9:06 AM syzbot
<syzbot+10d25e23199614b7721f@syzkaller.appspotmail.com> wrote:
>
>
> The bug was bisected to:
>
> commit 38e7571c07be01f9f19b355a9306a4e3d5cb0f5b
> Author: Linus Torvalds <torvalds@linux-foundation.org>
> Date:   Fri Mar 8 22:48:40 2019 +0000
>
>      Merge tag 'io_uring-2019-03-06' of git://git.kernel.dk/linux-block
>
> percpu_ref_kill_and_confirm called more than once on io_ring_ctx_ref_free!

So I don't see how that happens in the original code (because
__io_uring_register() is called with the uring_lock held), but let's
see.

HOWEVER.

I do see how it happens now as of the latest kernel as of commit
b19062a56726 ("io_uring: fix possible deadlock between
io_uring_{enter,register}") where the code explicitly drops the mutex
in order to wait for other uring users to finish.

So Jens, I think that commit was buggy. I suspect that
io_uring_register() should perhaps do something like

--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -2934,7 +2934,10 @@ static int __io_uring_register(struct
io_ring_ctx *ctx, unsigned opcode,
 {
        int ret;

+       if (!percpu_ref_tryget(&ctx->refs))
+               return -EBUSY;
        percpu_ref_kill(&ctx->refs);
+       percpu_ref_put(&ctx->refs);

        /*
         * Drop uring mutex before waiting for references to exit. If another

to guarantee that it's the *only* case of io_uring_register() doing that kill.

Hmm?

                Linus

^ permalink raw reply

* Re: WARNING in percpu_ref_kill_and_confirm
From: Jens Axboe @ 2019-04-22 16:23 UTC (permalink / raw)
  To: syzbot, arnd, bp, darrick.wong, gregkh, hpa, linux-api,
	linux-arch, linux-block, linux-fsdevel, linux-kernel, luto,
	mathieu.desnoyers, mingo, mpe, syzkaller-bugs, tglx, torvalds,
	viro, x86
In-Reply-To: <00000000000043fe9c058720a5d3@google.com>

On 4/22/19 10:06 AM, syzbot wrote:
> Hello,
> 
> syzbot found the following crash on:
> 
> HEAD commit:    9e5de623 Merge tag 'nfs-for-5.1-5' of git://git.linux-nfs...
> git tree:       upstream
> console output: https://syzkaller.appspot.com/x/log.txt?x=15624257200000
> kernel config:  https://syzkaller.appspot.com/x/.config?x=856fc6d0fbbeede9
> dashboard link: https://syzkaller.appspot.com/bug?extid=10d25e23199614b7721f
> compiler:       gcc (GCC) 9.0.0 20181231 (experimental)
> userspace arch: i386
> syz repro:      https://syzkaller.appspot.com/x/repro.syz?x=17ff39f3200000
> C reproducer:   https://syzkaller.appspot.com/x/repro.c?x=15758647200000
> 
> The bug was bisected to:
> 
> commit 38e7571c07be01f9f19b355a9306a4e3d5cb0f5b
> Author: Linus Torvalds <torvalds@linux-foundation.org>
> Date:   Fri Mar 8 22:48:40 2019 +0000
> 
>      Merge tag 'io_uring-2019-03-06' of git://git.kernel.dk/linux-block
> 
> bisection log:  https://syzkaller.appspot.com/x/bisect.txt?x=1736bc57200000
> final crash:    https://syzkaller.appspot.com/x/report.txt?x=14b6bc57200000
> console output: https://syzkaller.appspot.com/x/log.txt?x=10b6bc57200000
> 
> IMPORTANT: if you fix the bug, please add the following tag to the commit:
> Reported-by: syzbot+10d25e23199614b7721f@syzkaller.appspotmail.com
> Fixes: 38e7571c07be ("Merge tag 'io_uring-2019-03-06' of  
> git://git.kernel.dk/linux-block")
> 
> ------------[ cut here ]------------
> percpu_ref_kill_and_confirm called more than once on io_ring_ctx_ref_free!
> WARNING: CPU: 1 PID: 7815 at lib/percpu-refcount.c:335  
> percpu_ref_kill_and_confirm+0x341/0x3b0 lib/percpu-refcount.c:335
> Kernel panic - not syncing: panic_on_warn set ...
> CPU: 1 PID: 7815 Comm: syz-executor269 Not tainted 5.1.0-rc5+ #77
> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS  
> Google 01/01/2011
> Call Trace:
>   __dump_stack lib/dump_stack.c:77 [inline]
>   dump_stack+0x172/0x1f0 lib/dump_stack.c:113
>   panic+0x2cb/0x65c kernel/panic.c:214
>   __warn.cold+0x20/0x45 kernel/panic.c:571
>   report_bug+0x263/0x2b0 lib/bug.c:186
>   fixup_bug arch/x86/kernel/traps.c:179 [inline]
>   fixup_bug arch/x86/kernel/traps.c:174 [inline]
>   do_error_trap+0x11b/0x200 arch/x86/kernel/traps.c:272
>   do_invalid_op+0x37/0x50 arch/x86/kernel/traps.c:291
>   invalid_op+0x14/0x20 arch/x86/entry/entry_64.S:973
> RIP: 0010:percpu_ref_kill_and_confirm+0x341/0x3b0 lib/percpu-refcount.c:335
> Code: 42 e0 2a 06 01 48 89 fa 48 c1 ea 03 80 3c 02 00 75 76 49 8b 54 24 10  
> 48 c7 c6 a0 71 a1 87 48 c7 c7 40 71 a1 87 e8 ad 92 13 fe <0f> 0b 48 b8 00  
> 00 00 00 00 fc ff df 4c 89 ea 48 c1 ea 03 80 3c 02
> RSP: 0018:ffff8880a96cfce0 EFLAGS: 00010086
> RAX: 0000000000000000 RBX: 0000607f5142e35b RCX: 0000000000000000
> RDX: 0000000000000000 RSI: ffffffff815afcf6 RDI: ffffed10152d9f8e
> RBP: ffff8880a96cfd10 R08: ffff8880a85c40c0 R09: fffffbfff1133639
> R10: fffffbfff1133638 R11: ffffffff8899b1c3 R12: ffff88809ee571c0
> R13: ffff88809ee571c8 R14: 0000000000000286 R15: 0000000000000000
>   percpu_ref_kill include/linux/percpu-refcount.h:128 [inline]
>   __io_uring_register+0xa7/0x1fe0 fs/io_uring.c:2937
>   __do_sys_io_uring_register fs/io_uring.c:2998 [inline]
>   __se_sys_io_uring_register fs/io_uring.c:2980 [inline]
>   __ia32_sys_io_uring_register+0x193/0x1f0 fs/io_uring.c:2980
>   do_syscall_32_irqs_on arch/x86/entry/common.c:326 [inline]
>   do_fast_syscall_32+0x281/0xc98 arch/x86/entry/common.c:397
>   entry_SYSENTER_compat+0x70/0x7f arch/x86/entry/entry_64_compat.S:139
> RIP: 0023:0xf7f16869
> Code: 85 d2 74 02 89 0a 5b 5d c3 8b 04 24 c3 8b 14 24 c3 8b 3c 24 c3 90 90  
> 90 90 90 90 90 90 90 90 90 90 51 52 55 89 e5 0f 34 cd 80 <5d> 5a 59 c3 90  
> 90 90 90 eb 0d 90 90 90 90 90 90 90 90 90 90 90 90
> RSP: 002b:00000000f7ef11ec EFLAGS: 00000296 ORIG_RAX: 00000000000001ab
> RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 0000000000000000
> RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
> RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
> R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
> R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
> Kernel Offset: disabled
> Rebooting in 86400 seconds..

I think the below should fix this. Very early versions of io_uring didn't
have this issue, since we did the percpu ref tryget for io_uring_register().
But I think we'll be just fine just checking if the ref is already dying
once inside the mutex. If it is, it's either going away, or someone else
is already doing io_uring_register() on it.


diff --git a/fs/io_uring.c b/fs/io_uring.c
index f65f85d89217..a2f39faed6a7 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -2934,6 +2934,14 @@ static int __io_uring_register(struct io_ring_ctx *ctx, unsigned opcode,
 {
 	int ret;
 
+	/*
+	 * We're inside the ring mutex, if the ref is already dying, then
+	 * someone else killed the ctx or is already going through
+	 * io_uring_register().
+	 */
+	if (percpu_ref_is_dying(&ctx->refs))
+		return -ENXIO;
+
 	percpu_ref_kill(&ctx->refs);
 
 	/*

-- 
Jens Axboe

^ permalink raw reply related

* WARNING in percpu_ref_kill_and_confirm
From: syzbot @ 2019-04-22 16:06 UTC (permalink / raw)
  To: arnd, axboe, bp, darrick.wong, gregkh, hpa, linux-api, linux-arch,
	linux-block, linux-fsdevel, linux-kernel, luto, mathieu.desnoyers,
	mingo, mpe, syzkaller-bugs, tglx, torvalds, viro, x86

Hello,

syzbot found the following crash on:

HEAD commit:    9e5de623 Merge tag 'nfs-for-5.1-5' of git://git.linux-nfs...
git tree:       upstream
console output: https://syzkaller.appspot.com/x/log.txt?x=15624257200000
kernel config:  https://syzkaller.appspot.com/x/.config?x=856fc6d0fbbeede9
dashboard link: https://syzkaller.appspot.com/bug?extid=10d25e23199614b7721f
compiler:       gcc (GCC) 9.0.0 20181231 (experimental)
userspace arch: i386
syz repro:      https://syzkaller.appspot.com/x/repro.syz?x=17ff39f3200000
C reproducer:   https://syzkaller.appspot.com/x/repro.c?x=15758647200000

The bug was bisected to:

commit 38e7571c07be01f9f19b355a9306a4e3d5cb0f5b
Author: Linus Torvalds <torvalds@linux-foundation.org>
Date:   Fri Mar 8 22:48:40 2019 +0000

     Merge tag 'io_uring-2019-03-06' of git://git.kernel.dk/linux-block

bisection log:  https://syzkaller.appspot.com/x/bisect.txt?x=1736bc57200000
final crash:    https://syzkaller.appspot.com/x/report.txt?x=14b6bc57200000
console output: https://syzkaller.appspot.com/x/log.txt?x=10b6bc57200000

IMPORTANT: if you fix the bug, please add the following tag to the commit:
Reported-by: syzbot+10d25e23199614b7721f@syzkaller.appspotmail.com
Fixes: 38e7571c07be ("Merge tag 'io_uring-2019-03-06' of  
git://git.kernel.dk/linux-block")

------------[ cut here ]------------
percpu_ref_kill_and_confirm called more than once on io_ring_ctx_ref_free!
WARNING: CPU: 1 PID: 7815 at lib/percpu-refcount.c:335  
percpu_ref_kill_and_confirm+0x341/0x3b0 lib/percpu-refcount.c:335
Kernel panic - not syncing: panic_on_warn set ...
CPU: 1 PID: 7815 Comm: syz-executor269 Not tainted 5.1.0-rc5+ #77
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS  
Google 01/01/2011
Call Trace:
  __dump_stack lib/dump_stack.c:77 [inline]
  dump_stack+0x172/0x1f0 lib/dump_stack.c:113
  panic+0x2cb/0x65c kernel/panic.c:214
  __warn.cold+0x20/0x45 kernel/panic.c:571
  report_bug+0x263/0x2b0 lib/bug.c:186
  fixup_bug arch/x86/kernel/traps.c:179 [inline]
  fixup_bug arch/x86/kernel/traps.c:174 [inline]
  do_error_trap+0x11b/0x200 arch/x86/kernel/traps.c:272
  do_invalid_op+0x37/0x50 arch/x86/kernel/traps.c:291
  invalid_op+0x14/0x20 arch/x86/entry/entry_64.S:973
RIP: 0010:percpu_ref_kill_and_confirm+0x341/0x3b0 lib/percpu-refcount.c:335
Code: 42 e0 2a 06 01 48 89 fa 48 c1 ea 03 80 3c 02 00 75 76 49 8b 54 24 10  
48 c7 c6 a0 71 a1 87 48 c7 c7 40 71 a1 87 e8 ad 92 13 fe <0f> 0b 48 b8 00  
00 00 00 00 fc ff df 4c 89 ea 48 c1 ea 03 80 3c 02
RSP: 0018:ffff8880a96cfce0 EFLAGS: 00010086
RAX: 0000000000000000 RBX: 0000607f5142e35b RCX: 0000000000000000
RDX: 0000000000000000 RSI: ffffffff815afcf6 RDI: ffffed10152d9f8e
RBP: ffff8880a96cfd10 R08: ffff8880a85c40c0 R09: fffffbfff1133639
R10: fffffbfff1133638 R11: ffffffff8899b1c3 R12: ffff88809ee571c0
R13: ffff88809ee571c8 R14: 0000000000000286 R15: 0000000000000000
  percpu_ref_kill include/linux/percpu-refcount.h:128 [inline]
  __io_uring_register+0xa7/0x1fe0 fs/io_uring.c:2937
  __do_sys_io_uring_register fs/io_uring.c:2998 [inline]
  __se_sys_io_uring_register fs/io_uring.c:2980 [inline]
  __ia32_sys_io_uring_register+0x193/0x1f0 fs/io_uring.c:2980
  do_syscall_32_irqs_on arch/x86/entry/common.c:326 [inline]
  do_fast_syscall_32+0x281/0xc98 arch/x86/entry/common.c:397
  entry_SYSENTER_compat+0x70/0x7f arch/x86/entry/entry_64_compat.S:139
RIP: 0023:0xf7f16869
Code: 85 d2 74 02 89 0a 5b 5d c3 8b 04 24 c3 8b 14 24 c3 8b 3c 24 c3 90 90  
90 90 90 90 90 90 90 90 90 90 51 52 55 89 e5 0f 34 cd 80 <5d> 5a 59 c3 90  
90 90 90 eb 0d 90 90 90 90 90 90 90 90 90 90 90 90
RSP: 002b:00000000f7ef11ec EFLAGS: 00000296 ORIG_RAX: 00000000000001ab
RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 0000000000000000
RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
Kernel Offset: disabled
Rebooting in 86400 seconds..


---
This bug is generated by a bot. It may contain errors.
See https://goo.gl/tpsmEJ for more information about syzbot.
syzbot engineers can be reached at syzkaller@googlegroups.com.

syzbot will keep track of this bug report. See:
https://goo.gl/tpsmEJ#status for how to communicate with syzbot.
For information about bisection process see: https://goo.gl/tpsmEJ#bisection
syzbot can test patches for this bug, for details see:
https://goo.gl/tpsmEJ#testing-patches

^ permalink raw reply

* Re: [PATCH ghak90 V6 00/10] audit: implement container identifier
From: Paul Moore @ 2019-04-22 13:49 UTC (permalink / raw)
  To: Neil Horman
  Cc: Richard Guy Briggs, containers, linux-api,
	Linux-Audit Mailing List, linux-fsdevel, LKML, netdev,
	netfilter-devel, sgrubb, omosnace, dhowells, simo, Eric Paris,
	Serge Hallyn, ebiederm
In-Reply-To: <20190422113810.GA27747@hmswarspite.think-freely.org>

On Mon, Apr 22, 2019 at 7:38 AM Neil Horman <nhorman@tuxdriver.com> wrote:
> On Mon, Apr 08, 2019 at 11:39:07PM -0400, Richard Guy Briggs wrote:
> > Implement kernel audit container identifier.
>
> I'm sorry, I've lost track of this, where have we landed on it? Are we good for
> inclusion?

I haven't finished going through this latest revision, but unless
Richard made any significant changes outside of the feedback from the
v5 patchset I'm guessing we are "close".

Based on discussions Richard and I had some time ago, I have always
envisioned the plan as being get the kernel patchset, tests, docs
ready (which Richard has been doing) and then run the actual
implemented API by the userland container folks, e.g. cri-o/lxc/etc.,
to make sure the actual implementation is sane from their perspective.
They've already seen the design, so I'm not expecting any real
surprises here, but sometimes opinions change when they have actual
code in front of them to play with and review.

Beyond that, while the cri-o/lxc/etc. folks are looking it over,
whatever additional testing we can do would be a big win.  I'm
thinking I'll pull it into a separate branch in the audit tree
(audit/working-container ?) and include that in my secnext kernels
that I build/test on a regular basis; this is also a handy way to keep
it based against the current audit/next branch.  If any changes are
needed Richard can either chose to base those changes on audit/next or
the separate audit container ID branch; that's up to him.  I've done
this with other big changes in other trees, e.g. SELinux, and it has
worked well to get some extra testing in and keep the patchset "merge
ready" while others outside the subsystem look things over.

-- 
paul moore
www.paul-moore.com

^ permalink raw reply

* Re: [PATCH ghak90 V6 00/10] audit: implement container identifier
From: Neil Horman @ 2019-04-22 11:38 UTC (permalink / raw)
  To: Richard Guy Briggs
  Cc: containers, linux-api, Linux-Audit Mailing List, linux-fsdevel,
	LKML, netdev, netfilter-devel, Paul Moore, sgrubb, omosnace,
	dhowells, simo, eparis, serge, ebiederm
In-Reply-To: <cover.1554732921.git.rgb@redhat.com>

On Mon, Apr 08, 2019 at 11:39:07PM -0400, Richard Guy Briggs wrote:
> Implement kernel audit container identifier.
> 
> This patchset is a fifth based on the proposal document (V3)
> posted:
> 	https://www.redhat.com/archives/linux-audit/2018-January/msg00014.html
> 
> The first patch was the last patch from ghak81 that was absorbed into
> this patchset since its primary justification is the rest of this
> patchset.
> 
> The second patch implements the proc fs write to set the audit container
> identifier of a process, emitting an AUDIT_CONTAINER_OP record to
> announce the registration of that audit container identifier on that
> process.  This patch requires userspace support for record acceptance
> and proper type display.
> 
> The third implements reading the audit container identifier from the
> proc filesystem for debugging.  This patch wasn't planned for upstream
> inclusion but is starting to become more likely.
> 
> The fourth implements the auxiliary record AUDIT_CONTAINER_ID if an audit
> container identifier is associated with an event.  This patch requires
> userspace support for proper type display.
> 
> The 5th adds audit daemon signalling provenance through audit_sig_info2.
> 
> The 6th creates a local audit context to be able to bind a standalone
> record with a locally created auxiliary record.
> 
> The 7th patch adds audit container identifier records to the user
> standalone records.
> 
> The 8th adds audit container identifier filtering to the exit,
> exclude and user lists.  This patch adds the AUDIT_CONTID field and
> requires auditctl userspace support for the --contid option.
> 
> The 9th adds network namespace audit container identifier labelling
> based on member tasks' audit container identifier labels.
> 
> The 10th adds audit container identifier support to standalone netfilter
> records that don't have a task context and lists each container to which
> that net namespace belongs.
> 
> Example: Set an audit container identifier of 123456 to the "sleep" task:
> 
>   sleep 2&
>   child=$!
>   echo 123456 > /proc/$child/audit_containerid; echo $?
>   ausearch -ts recent -m container_op
>   echo child:$child contid:$( cat /proc/$child/audit_containerid)
> 
> This should produce a record such as:
> 
>   type=CONTAINER_OP msg=audit(2018-06-06 12:39:29.636:26949) : op=set opid=2209 contid=123456 old-contid=18446744073709551615 pid=628 auid=root uid=root tty=ttyS0 ses=1 subj=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023 comm=bash exe=/usr/bin/bash res=yes
> 
> 
> Example: Set a filter on an audit container identifier 123459 on /tmp/tmpcontainerid:
> 
>   contid=123459
>   key=tmpcontainerid
>   auditctl -a exit,always -F dir=/tmp -F perm=wa -F contid=$contid -F key=$key
>   perl -e "sleep 1; open(my \$tmpfile, '>', \"/tmp/$key\"); close(\$tmpfile);" &
>   child=$!
>   echo $contid > /proc/$child/audit_containerid
>   sleep 2
>   ausearch -i -ts recent -k $key
>   auditctl -d exit,always -F dir=/tmp -F perm=wa -F contid=$contid -F key=$key
>   rm -f /tmp/$key
> 
> This should produce an event such as:
> 
>   type=CONTAINER_ID msg=audit(2018-06-06 12:46:31.707:26953) : contid=123459
>   type=PROCTITLE msg=audit(2018-06-06 12:46:31.707:26953) : proctitle=perl -e sleep 1; open(my $tmpfile, '>', "/tmp/tmpcontainerid"); close($tmpfile);
>   type=PATH msg=audit(2018-06-06 12:46:31.707:26953) : item=1 name=/tmp/tmpcontainerid inode=25656 dev=00:26 mode=file,644 ouid=root ogid=root rdev=00:00 obj=unconfined_u:object_r:user_tmp_t:s0 nametype=CREATE cap_fp=none cap_fi=none cap_fe=0 cap_fver=0
>   type=PATH msg=audit(2018-06-06 12:46:31.707:26953) : item=0 name=/tmp/ inode=8985 dev=00:26 mode=dir,sticky,777 ouid=root ogid=root rdev=00:00 obj=system_u:object_r:tmp_t:s0 nametype=PARENT cap_fp=none cap_fi=none cap_fe=0 cap_fver=0
>   type=CWD msg=audit(2018-06-06 12:46:31.707:26953) : cwd=/root
>   type=SYSCALL msg=audit(2018-06-06 12:46:31.707:26953) : arch=x86_64 syscall=openat success=yes exit=3 a0=0xffffffffffffff9c a1=0x5621f2b81900 a2=O_WRONLY|O_CREAT|O_TRUNC a3=0x1b6 items=2 ppid=628 pid=2232 auid=root uid=root gid=root euid=root suid=root fsuid=root egid=root sgid=root fsgid=root tty=ttyS0 ses=1 comm=perl exe=/usr/bin/perl subj=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023 key=tmpcontainerid
> 
> Example: Test multiple containers on one netns:
> 
>   sleep 5 &
>   child1=$!
>   containerid1=123451
>   echo $containerid1 > /proc/$child1/audit_containerid
>   sleep 5 &
>   child2=$!
>   containerid2=123452
>   echo $containerid2 > /proc/$child2/audit_containerid
>   iptables -I INPUT -i lo -p icmp --icmp-type echo-request -j AUDIT --type accept
>   iptables -I INPUT  -t mangle -i lo -p icmp --icmp-type echo-request -j MARK --set-mark 0x12345555
>   sleep 1;
>   bash -c "ping -q -c 1 127.0.0.1 >/dev/null 2>&1"
>   sleep 1;
>   ausearch -i -m NETFILTER_PKT -ts boot|grep mark=0x12345555
>   ausearch -i -m NETFILTER_PKT -ts boot|grep contid=|grep $containerid1|grep $containerid2
> 
> This should produce an event such as:
> 
>   type=NETFILTER_PKT msg=audit(03/15/2019 14:16:13.369:244) : mark=0x12345555 saddr=127.0.0.1 daddr=127.0.0.1 proto=icmp
>   type=CONTAINER_ID msg=audit(03/15/2019 14:16:13.369:244) : contid=123452,123451
> 
> 
> Includes the last patch of https://github.com/linux-audit/audit-kernel/issues/81
> Please see the github audit kernel issue for the main feature:
>   https://github.com/linux-audit/audit-kernel/issues/90
> and the kernel filter code:
>   https://github.com/linux-audit/audit-kernel/issues/91
> and the network support:
>   https://github.com/linux-audit/audit-kernel/issues/92
> Please see the github audit userspace issue for supporting record types:
>   https://github.com/linux-audit/audit-userspace/issues/51
> and filter code:
>   https://github.com/linux-audit/audit-userspace/issues/40
> Please see the github audit testsuiite issue for the test case:
>   https://github.com/linux-audit/audit-testsuite/issues/64
> Please see the github audit wiki for the feature overview:
>   https://github.com/linux-audit/audit-kernel/wiki/RFE-Audit-Container-ID
> 
> 
> Changelog:
> 
> v6
> - change TMPBUFLEN from 11 to 21 to cover the decimal value of contid
>   u64 (nhorman)
> - fix bug overwriting ctx in struct audit_sig_info, move cid above
>   ctx[0] (nhorman)
> - fix bug skipping remaining fields and not advancing bufp when copying
>   out contid in audit_krule_to_data (omosnacec)
> - add acks, tidy commit descriptions, other formatting fixes (checkpatch
>   wrong on audit_log_lost)
> - cast ull for u64 prints
> - target_cid tracking was moved from the ptrace/signal patch to
>   container_op
> - target ptrace and signal records were moved from the ptrace/signal
>   patch to container_id
> - auditd signaller tracking was moved to a new AUDIT_SIGNAL_INFO2
>   request and record
> - ditch unnecessary list_empty() checks
> - check for null net and aunet in audit_netns_contid_add()
> - swap CONTAINER_OP contid/old-contid order to ease parsing
> 
> v5
> - address loginuid and sessionid syscall scope in ghak104
> - address audit_context in CONFIG_AUDIT vs CONFIG_AUDITSYSCALL in ghak105
> - remove tty patch, addressed in ghak106
> - rebase on audit/next v5.0-rc1
>   w/ghak59/ghak104/ghak103/ghak100/ghak107/ghak105/ghak106/ghak105sup
> - update CONTAINER_ID to CONTAINER_OP in patch description
> - move audit_context in audit_task_info to CONFIG_AUDITSYSCALL
> - move audit_alloc() and audit_free() out of CONFIG_AUDITSYSCALL and into
>   CONFIG_AUDIT and create audit_{alloc,free}_syscall
> - use plain kmem_cache_alloc() rather than kmem_cache_zalloc() in audit_alloc()
> - fix audit_get_contid() declaration type error
> - move audit_set_contid() from auditsc.c to audit.c
> - audit_log_contid() returns void
> - audit_log_contid() handed contid rather than tsk
> - switch from AUDIT_CONTAINER to AUDIT_CONTAINER_ID for aux record
> - move audit_log_contid(tsk/contid) & audit_contid_set(tsk)/audit_contid_valid(contid)
> - switch from tsk to current
> - audit_alloc_local() calls audit_log_lost() on failure to allocate a context
> - add AUDIT_USER* non-syscall contid record
> - cosmetic cleanup double parens, goto out on err
> - ditch audit_get_ns_contid_list_lock(), fix aunet lock race
> - switch from all-cpu read spinlock to rcu, keep spinlock for write
> - update audit_alloc_local() to use ktime_get_coarse_real_ts64()
> - add nft_log support
> - add call from do_exit() in audit_free() to remove contid from netns
> - relegate AUDIT_CONTAINER ref= field (was op=) to debug patch
> 
> v4
> - preface set with ghak81:"collect audit task parameters"
> - add shallyn and sgrubb acks
> - rename feature bitmap macro
> - rename cid_valid() to audit_contid_valid()
> - rename AUDIT_CONTAINER_ID to AUDIT_CONTAINER_OP
> - delete audit_get_contid_list() from headers
> - move work into inner if, delete "found"
> - change netns contid list function names
> - move exports for audit_log_contid audit_alloc_local audit_free_context to non-syscall patch
> - list contids CSV
> - pass in gfp flags to audit_alloc_local() (fix audit_alloc_context callers)
> - use "local" in lieu of abusing in_syscall for auditsc_get_stamp()
> - read_lock(&tasklist_lock) around children and thread check
> - task_lock(tsk) should be taken before first check of tsk->audit
> - add spin lock to contid list in aunet
> - restrict /proc read to CAP_AUDIT_CONTROL
> - remove set again prohibition and inherited flag
> - delete contidion spelling fix from patchset, send to netdev/linux-wireless
> 
> v3
> - switched from containerid in task_struct to audit_task_info (depends on ghak81)
> - drop INVALID_CID in favour of only AUDIT_CID_UNSET
> - check for !audit_task_info, throw -ENOPROTOOPT on set
> - changed -EPERM to -EEXIST for parent check
> - return AUDIT_CID_UNSET if !audit_enabled
> - squash child/thread check patch into AUDIT_CONTAINER_ID patch
> - changed -EPERM to -EBUSY for child check
> - separate child and thread checks, use -EALREADY for latter
> - move addition of op= from ptrace/signal patch to AUDIT_CONTAINER patch
> - fix && to || bashism in ptrace/signal patch
> - uninline and export function for audit_free_context()
> - drop CONFIG_CHANGE, FEATURE_CHANGE, ANOM_ABEND, ANOM_SECCOMP patches
> - move audit_enabled check (xt_AUDIT)
> - switched from containerid list in struct net to net_generic's struct audit_net
> - move containerid list iteration into audit (xt_AUDIT)
> - create function to move namespace switch into audit
> - switched /proc/PID/ entry from containerid to audit_containerid
> - call kzalloc with GFP_ATOMIC on in_atomic() in audit_alloc_context()
> - call kzalloc with GFP_ATOMIC on in_atomic() in audit_log_container_info()
> - use xt_net(par) instead of sock_net(skb->sk) to get net
> - switched record and field names: initial CONTAINER_ID, aux CONTAINER, field CONTID
> - allow to set own contid
> - open code audit_set_containerid
> - add contid inherited flag
> - ccontainerid and pcontainerid eliminated due to inherited flag
> - change name of container list funcitons
> - rename containerid to contid
> - convert initial container record to syscall aux
> - fix spelling mistake of contidion in net/rfkill/core.c to avoid contid name collision
> 
> v2
> - add check for children and threads
> - add network namespace container identifier list
> - add NETFILTER_PKT audit container identifier logging
> - patch description and documentation clean-up and example
> - reap unused ppid
> 
> Richard Guy Briggs (10):
>   audit: collect audit task parameters
>   audit: add container id
>   audit: read container ID of a process
>   audit: log container info of syscalls
>   audit: add contid support for signalling the audit daemon
>   audit: add support for non-syscall auxiliary records
>   audit: add containerid support for user records
>   audit: add containerid filtering
>   audit: add support for containerid to network namespaces
>   audit: NETFILTER_PKT: record each container ID associated with a netNS
> 
>  fs/proc/base.c              |  57 +++++++-
>  include/linux/audit.h       | 113 +++++++++++++--
>  include/linux/sched.h       |   7 +-
>  include/uapi/linux/audit.h  |   9 +-
>  init/init_task.c            |   3 +-
>  init/main.c                 |   2 +
>  kernel/audit.c              | 325 ++++++++++++++++++++++++++++++++++++++++++--
>  kernel/audit.h              |   9 ++
>  kernel/auditfilter.c        |  47 +++++++
>  kernel/auditsc.c            |  90 ++++++++----
>  kernel/fork.c               |   1 -
>  kernel/nsproxy.c            |   4 +
>  net/netfilter/nft_log.c     |  11 +-
>  net/netfilter/xt_AUDIT.c    |  11 +-
>  security/selinux/nlmsgtab.c |   1 +
>  15 files changed, 627 insertions(+), 63 deletions(-)
> 
> -- 
> 1.8.3.1
> 
> 
I'm sorry, I've lost track of this, where have we landed on it? Are we good for
inclusion?
Neil

^ permalink raw reply

* [PATCH v17 3/3] Documentation/filesystems/proc.txt: add arch_status file
From: Aubrey Li @ 2019-04-21 18:35 UTC (permalink / raw)
  To: tglx, mingo, peterz, hpa
  Cc: ak, tim.c.chen, dave.hansen, arjan, adobriyan, akpm, aubrey.li,
	linux-api, linux-kernel, Aubrey Li, Andy Lutomirski
In-Reply-To: <20190421183529.9141-1-aubrey.li@linux.intel.com>

Added /proc/<pid>/arch_status file, and added AVX512_elapsed_ms in
/proc/<pid>/arch_status. Report it in Documentation/filesystems/proc.txt

Signed-off-by: Aubrey Li <aubrey.li@linux.intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Linux API <linux-api@vger.kernel.org>
---
 Documentation/filesystems/proc.txt | 37 ++++++++++++++++++++++++++++++
 1 file changed, 37 insertions(+)

diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt
index 66cad5c86171..cf5114a8fb13 100644
--- a/Documentation/filesystems/proc.txt
+++ b/Documentation/filesystems/proc.txt
@@ -45,6 +45,7 @@ Table of Contents
   3.9   /proc/<pid>/map_files - Information about memory mapped files
   3.10  /proc/<pid>/timerslack_ns - Task timerslack value
   3.11	/proc/<pid>/patch_state - Livepatch patch operation state
+  3.12	/proc/<pid>/arch_status - Task architecture specific information
 
   4	Configuring procfs
   4.1	Mount options
@@ -1948,6 +1949,42 @@ patched.  If the patch is being enabled, then the task has already been
 patched.  If the patch is being disabled, then the task hasn't been
 unpatched yet.
 
+3.12 /proc/<pid>/arch_status - task architecture specific status
+-------------------------------------------------------------------
+When CONFIG_PROC_PID_ARCH_STATUS is enabled, this file displays the
+architecture specific status of the task.
+
+Example
+-------
+ $ cat /proc/6753/arch_status
+ AVX512_elapsed_ms:      8
+
+Description
+-----------
+
+ AVX512_elapsed_ms:
+ ------------------
+  If AVX512 is supported on the machine, this entry shows the milliseconds
+  elapsed since the last time AVX512 usage was recorded. The recording
+  happens on a best effort basis when a task is scheduled out. This means
+  that the value depends on two factors:
+
+    1) The time which the task spent on the CPU without being scheduled
+       out. With CPU isolation and a single runnable task this can take
+       several seconds.
+
+    2) The time since the task was scheduled out last. Depending on the
+       reason for being scheduled out (time slice exhausted, syscall ...)
+       this can be arbitrary long time.
+
+  As a consequence the value cannot be considered precise and authoritative
+  information. The application which uses this information has to be aware
+  of the overall scenario on the system in order to determine whether a
+  task is a real AVX512 user or not.
+
+  A special value of '-1' indicates that no AVX512 usage was recorded, thus
+  the task is unlikely an AVX512 user, but depends on the workload and the
+  scheduling scenario, it also could be a false negative mentioned above.
 
 ------------------------------------------------------------------------------
 Configuring procfs
-- 
2.17.1

^ permalink raw reply related

* [PATCH v17 2/3] /proc/pid/arch_status: Add AVX-512 usage elapsed time
From: Aubrey Li @ 2019-04-21 18:35 UTC (permalink / raw)
  To: tglx, mingo, peterz, hpa
  Cc: ak, tim.c.chen, dave.hansen, arjan, adobriyan, akpm, aubrey.li,
	linux-api, linux-kernel, Aubrey Li, Andy Lutomirski
In-Reply-To: <20190421183529.9141-1-aubrey.li@linux.intel.com>

AVX-512 components use could cause core turbo frequency drop. So
it's useful to expose AVX-512 usage elapsed time as a heuristic hint
for the user space job scheduler to cluster the AVX-512 using tasks
together.

Tensorflow example:
$ while [ 1 ]; do cat /proc/tid/arch_status | grep AVX512; sleep 1; done
AVX512_elapsed_ms:      4
AVX512_elapsed_ms:      8
AVX512_elapsed_ms:      4

This means that 4 milliseconds have elapsed since the AVX512 usage
of tensorflow task was detected when the task was scheduled out.

Or:
$ cat /proc/tid/arch_status | grep AVX512
AVX512_elapsed_ms:      -1

The number '-1' indicates that no AVX512 usage recorded before
thus the task unlikely has frequency drop issue.

User space tools may want to further check by:

$ perf stat --pid <pid> -e core_power.lvl2_turbo_license -- sleep 1

 Performance counter stats for process id '3558':

     3,251,565,961      core_power.lvl2_turbo_license

       1.004031387 seconds time elapsed

Non-zero counter value confirms that the task causes frequency drop.

Signed-off-by: Aubrey Li <aubrey.li@linux.intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Linux API <linux-api@vger.kernel.org>
---
 arch/x86/include/asm/processor.h |  6 +++++
 arch/x86/kernel/fpu/xstate.c     | 43 ++++++++++++++++++++++++++++++++
 2 files changed, 49 insertions(+)

diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
index 2bb3a648fc12..0728848473a2 100644
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -991,4 +991,10 @@ enum l1tf_mitigations {
 
 extern enum l1tf_mitigations l1tf_mitigation;
 
+#ifdef CONFIG_PROC_PID_ARCH_STATUS
+/* Add support for task architecture specific output in /proc/pid/arch_status */
+void task_arch_status(struct seq_file *m, struct task_struct *task);
+#define task_arch_status task_arch_status
+#endif /* CONFIG_PROC_PID_ARCH_STATUS */
+
 #endif /* _ASM_X86_PROCESSOR_H */
diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
index d7432c2b1051..a0dda11ab72e 100644
--- a/arch/x86/kernel/fpu/xstate.c
+++ b/arch/x86/kernel/fpu/xstate.c
@@ -7,6 +7,7 @@
 #include <linux/cpu.h>
 #include <linux/mman.h>
 #include <linux/pkeys.h>
+#include <linux/seq_file.h>
 
 #include <asm/fpu/api.h>
 #include <asm/fpu/internal.h>
@@ -1243,3 +1244,45 @@ int copy_user_to_xstate(struct xregs_state *xsave, const void __user *ubuf)
 
 	return 0;
 }
+
+#ifdef CONFIG_PROC_PID_ARCH_STATUS
+/*
+ * Report the amount of time elapsed in millisecond since last AVX512
+ * use in the task.
+ */
+static void avx512_status(struct seq_file *m, struct task_struct *task)
+{
+	unsigned long timestamp = READ_ONCE(task->thread.fpu.avx512_timestamp);
+	long delta;
+
+	if (!timestamp) {
+		/*
+		 * Report -1 if no AVX512 usage
+		 */
+		delta = -1;
+	} else {
+		delta = (long)(jiffies - timestamp);
+		/*
+		 * Cap to LONG_MAX if time difference > LONG_MAX
+		 */
+		if (delta < 0)
+			delta = LONG_MAX;
+		delta = jiffies_to_msecs(delta);
+	}
+
+	seq_put_decimal_ll(m, "AVX512_elapsed_ms:\t", delta);
+	seq_putc(m, '\n');
+}
+
+/*
+ * Report architecture specific information
+ */
+void task_arch_status(struct seq_file *m, struct task_struct *task)
+{
+	/*
+	 * Report AVX512 state if the processor and build option supported.
+	 */
+	if (cpu_feature_enabled(X86_FEATURE_AVX512F))
+		avx512_status(m, task);
+}
+#endif /* CONFIG_PROC_PID_ARCH_STATUS */
-- 
2.17.1

^ permalink raw reply related


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox