Linux userland API discussions

Linux userland API discussions
 help / color / mirror / Atom feed

* Re: [PATCH v2 2/5] clone: add CLONE_PIDFD
From: Oleg Nesterov @ 2019-04-18 14:10 UTC (permalink / raw)
  To: Christian Brauner
  Cc: torvalds, viro, jannh, dhowells, linux-api, linux-kernel, serge,
	luto, arnd, ebiederm, keescook, tglx, mtk.manpages, akpm, cyphar,
	joel, dancol
In-Reply-To: <20190418132822.untjt7erfvbbiz7a@brauner.io>

On 04/18, Christian Brauner wrote:
>
> On Thu, Apr 18, 2019 at 03:12:07PM +0200, Oleg Nesterov wrote:
> > Should we allow CLONE_THREAD | CLONE_PIDFD ?
>
> I think so, yes. I have thought about this.

OK, I won't insist. But let me explain why did I ask.

> Yes, due to CLONE_FILES |
> CLONE_VM you'd necessarily hand the pidfd to the child but threads are
> no security boundary in the first place.

No, no, I am not not worried about security. CLONE_PARENT | CLONE_PIDFD
looks more problematic to me, but I see nothing dangerous security-wise..

I agree that CLONE_THREAD | CLONE_PIDFD may be usefule, but I am not sure
we should allow this from the very begining, until we have a "real" use-case.

IIUC, we are going to make it pollable soon. OK, but proc_tgid_base_poll()
(which should be turned into pidfd_poll) simply can't work if pid_task() is
not a group leader. poll(pidfd) will hang forever if pidfd was created by
CLONE_THREAD | CLONE_PIDFD.

Sure, we can (should?) improve pidfd_poll() but this will need more nasty
changes in the core kernel code. Do we really need/want this? Right now it
is not clear to me. Instead, we can simply disallow CLONE_THREAD|CLONE_PIDFD
until we decide that yes, we want to poll sub-threads.

But again, I am fine with CLONE_THREAD | CLONE_PIDFD.

Oleg.

^ permalink raw reply

* Re: [PATCH v2 2/5] clone: add CLONE_PIDFD
From: Christian Brauner @ 2019-04-18 14:15 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: torvalds, viro, jannh, dhowells, linux-api, linux-kernel, serge,
	luto, arnd, ebiederm, keescook, tglx, mtk.manpages, akpm, cyphar,
	joel, dancol
In-Reply-To: <20190418141019.GD13701@redhat.com>

On April 18, 2019 4:10:20 PM GMT+02:00, Oleg Nesterov <oleg@redhat.com> wrote:
>On 04/18, Christian Brauner wrote:
>>
>> On Thu, Apr 18, 2019 at 03:12:07PM +0200, Oleg Nesterov wrote:
>> > Should we allow CLONE_THREAD | CLONE_PIDFD ?
>>
>> I think so, yes. I have thought about this.
>
>OK, I won't insist. But let me explain why did I ask.
>
>> Yes, due to CLONE_FILES |
>> CLONE_VM you'd necessarily hand the pidfd to the child but threads
>are
>> no security boundary in the first place.
>
>No, no, I am not not worried about security. CLONE_PARENT | CLONE_PIDFD
>looks more problematic to me, but I see nothing dangerous
>security-wise..
>
>I agree that CLONE_THREAD | CLONE_PIDFD may be usefule, but I am not
>sure
>we should allow this from the very begining, until we have a "real"
>use-case.
>
>IIUC, we are going to make it pollable soon. OK, but
>proc_tgid_base_poll()
>(which should be turned into pidfd_poll) simply can't work if
>pid_task() is
>not a group leader. poll(pidfd) will hang forever if pidfd was created
>by
>CLONE_THREAD | CLONE_PIDFD.
>
>Sure, we can (should?) improve pidfd_poll() but this will need more
>nasty
>changes in the core kernel code. Do we really need/want this? Right now
>it
>is not clear to me. Instead, we can simply disallow
>CLONE_THREAD|CLONE_PIDFD
>until we decide that yes, we want to poll sub-threads.

If you think that makes the polling work simpler for Joel then for sure.
And yes, I have argued for "disable until someone needs it" often before so I can't really argue the other way around here. :)
I'll send an updated version soon.

Christian

>
>But again, I am fine with CLONE_THREAD | CLONE_PIDFD.
>
>Oleg.

^ permalink raw reply

* Re: [PATCH v2 4/5] signal: support CLONE_PIDFD with pidfd_send_signal
From: Oleg Nesterov @ 2019-04-18 14:26 UTC (permalink / raw)
  To: Christian Brauner
  Cc: torvalds, viro, jannh, dhowells, linux-api, linux-kernel, serge,
	luto, arnd, ebiederm, keescook, tglx, mtk.manpages, akpm, cyphar,
	joel, dancol
In-Reply-To: <20190418101841.4476-5-christian@brauner.io>

On 04/18, Christian Brauner wrote:
>
> +static struct pid *pidfd_to_pid(const struct file *file)
> +{
> +	if (file->f_op == &pidfd_fops)
> +		return file->private_data;
> +
> +	return tgid_pidfd_to_pid(file);
> +}

the patch looks obviously fine to me, but I have an absolutely off-topic
question... why tgid_pidfd_to_pid() has to check d_is_dir() ?

Oleg.

^ permalink raw reply

* Re: [PATCH v2] moduleparam: Save information about built-in modules in separate file
From: Lucas De Marchi @ 2019-04-18 14:36 UTC (permalink / raw)
  To: Jessica Yu
  Cc: Masahiro Yamada, Alexey Gladkov, Linux Kernel Mailing List,
	Linux Kbuild mailing list, linux-api, linux-modules,
	Kirill A . Shutemov, Gleb Fotengauer-Malinovskiy, Dmitry V. Levin,
	Michal Marek, Dmitry Torokhov, Rusty Russell
In-Reply-To: <20190418135238.GA5626@linux-8ccs>

On Thu, Apr 18, 2019 at 6:52 AM Jessica Yu <jeyu@kernel.org> wrote:
>
> +++ Masahiro Yamada [18/04/19 20:10 +0900]:
> >On Sat, Apr 6, 2019 at 9:15 PM Alexey Gladkov <gladkov.alexey@gmail.com> wrote:
> >>
> >> Problem:
> >>
> >> When a kernel module is compiled as a separate module, some important
> >> information about the kernel module is available via .modinfo section of
> >> the module.  In contrast, when the kernel module is compiled into the
> >> kernel, that information is not available.
> >>
> >> Information about built-in modules is necessary in the following cases:
> >>
> >> 1. When it is necessary to find out what additional parameters can be
> >> passed to the kernel at boot time.
> >>
> >> 2. When you need to know which module names and their aliases are in
> >> the kernel. This is very useful for creating an initrd image.
> >>
> >> Proposal:
> >>
> >> The proposed patch does not remove .modinfo section with module
> >> information from the vmlinux at the build time and saves it into a
> >> separate file after kernel linking. So, the kernel does not increase in
> >> size and no additional information remains in it. Information is stored
> >> in the same format as in the separate modules (null-terminated string
> >> array). Because the .modinfo section is already exported with a separate
> >> modules, we are not creating a new API.
> >>
> >> It can be easily read in the userspace:
> >>
> >> $ tr '\0' '\n' < kernel.builtin
> >> ext4.softdep=pre: crc32c
> >> ext4.license=GPL
> >> ext4.description=Fourth Extended Filesystem
> >> ext4.author=Remy Card, Stephen Tweedie, Andrew Morton, Andreas Dilger, Theodore Ts'o and others
> >> ext4.alias=fs-ext4
> >> ext4.alias=ext3
> >> ext4.alias=fs-ext3
> >> ext4.alias=ext2
> >> ext4.alias=fs-ext2
> >> md_mod.alias=block-major-9-*
> >> md_mod.alias=md
> >> md_mod.description=MD RAID framework
> >> md_mod.license=GPL
> >> md_mod.parmtype=create_on_open:bool
> >> md_mod.parmtype=start_dirty_degraded:int
> >> ...
> >>
> >> v2:
> >>  * Extract modinfo from vmlinux.o as suggested by Masahiro Yamada;
> >>  * Rename output file to kernel.builtin;
> >
> >Sorry, I do not get why you renamed
> >"kernel.builtin.modinfo" to "kernel.builtin".
> >
> >If you drop "modinfo", we do not understand
> >what kind information is contained in it.
> >
> >I think "kernel" and "builtin" have
> >a quite similar meaning here.
> >
> >How about "builtin.modinfo" for example?
> >
> >
> >It is shorter, and it is clear enough
> >that it contains module_info.
>
> I agree that the name kernel.builtin is unclear in what kind of
> information it contains. Apologies for not having clarified this in
> the previous review.
>
> Since kbuild already produces "modules.order" and "modules.builtin"
> files, why not just name it "modules.builtin.modinfo" to keep the

I agree with modules.builtin.modinfo

> names consistent with what is already there? It clearly conveys that
> this file contains modinfo for builtin modules.
>
> I'll leave it up to Lucas to decide if the file format is OK for kmod.
> With the modinfo dump, kmod could then decide what to do with the
> information, append to modules.alias{,.bin}, etc.

I think it's ok with the current format. It's simple enough. I would
delay merging it until
the userspace counterpart is implemented though. So to make sure we
have everything we need.
I can work on that in ~1 or 2 weeks.

thanks
Lucas De Marchi

>
> Also, I think this file needs to be documented in Documentation/kbuild/kbuild.txt.
>
> Thanks,
>
> Jessica



-- 
Lucas De Marchi

^ permalink raw reply

* Re: [PATCH] Linux: Define struct termios2 in <termios.h> under _GNU_SOURCE [BZ #10339]
From: hpa @ 2019-04-18 14:46 UTC (permalink / raw)
  To: Adhemerval Zanella, Florian Weimer; +Cc: libc-alpha, linux-api, linuxppc-dev
In-Reply-To: <4ae23f8e-c5e5-2d11-8508-82e5dd4c1e2b@linaro.org>

On April 18, 2019 4:09:07 AM PDT, Adhemerval Zanella <adhemerval.zanella@linaro.org> wrote:
>
>
>On 17/04/2019 19:04, H. Peter Anvin wrote:
>> On 4/15/19 10:22 AM, Adhemerval Zanella wrote:
>>>>
>>>> New interfaces are only necessary for the handful of architectures
>that don't have the speed fields *and* to space to put them in. 
>>>
>>> Based on your WIP, it seems that both sparc and mips could be
>adapted.
>>> Do we still have glibc supported architecture that would require
>compat
>>> symbols?
>>>
>>>>
>>>> Using symbol versioning doesn't really help much since the real
>problem is that struct termios can be passed around in userspace, and
>the interfaces between user space libraries don't have any versioning.
>However, my POC code deals with that too by only seeing BOTHER when
>necessary, so if the structure is extended garbage in the extra fields
>will be ignored unless new baud rates are in use.
>>>
>>> Yeah, we discussed this earlier and if recall correctly it was not
>settled
>>> that all architectures would allow the use to extra space for the
>new
>>> fields. It seems the case, which makes the adaptation for termios2
>even easier.
>>>
>>> The question I have for kernel side is whether termios2 is fully
>compatible
>>> with termios, meaning that if there is conner cases we need to
>handle in
>>> userland.
>>>
>> 
>> It depends on what you mean with "fully compatible."
>> 
>> Functionality-wise, the termios2 interfaces are a strict superset.
>There
>> is not, however, any guarantee that struct kernel_termios2 *contains*
>a
>> contiguous binary equivalent to struct kernel_termios (in fact, on
>most
>> architectures, it doesn't.)
>
>I mean that we can fully implement termios1 using termios2 by adjusting
>the termios struct in syscall wrappers.  If it is a superset I think it
>is fine.
>
>> 
>>>>
>>>> My POC code deals with Alpha as well by falling back to the old
>interfaces if necessary and possible, otherwise return error.
>>>>
>>>> Exporting termios2 to user space feels a bit odd at this stage as
>it would only be usable as a fallback on old glibc. Call it
>kernel_termios2 at least. ioctls using struct termios *must* be changed
>to kernel_termios anyway!
>>>>
>>>
>>> I still prefer to avoid export it to userland and make it usable
>through
>>> default termios, as your wip does.  My understanding is new
>interfaces 
>>> should be semantic equal to current one with the only deviation that
>>> non-standard baudrates will handled as its values.  The only issue I
>
>>> can foresee is if POSIX starts to export new bauds value.
>>>
>> 
>> ... which will be easy to handle: just define a Bxxxx constant with
>the
>> value equal to the baud rate.
>> 
>> 	-hhpa
>
>Right.

termio, termios1 and termios2 are kernel ioctl interfaces ... there are no wrappers; it is an ioctl.

The glibc termios is different from all of these, and already is a wrapper around the kernel-provided ioctls.
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.

^ permalink raw reply

* Re: [PATCH 1/5] glibc: Perform rseq(2) registration at C startup and thread creation (v8)
From: Joseph Myers @ 2019-04-18 14:48 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Will Deacon, carlos, Florian Weimer, Szabolcs Nagy, libc-alpha,
	Thomas Gleixner, Ben Maurer, Peter Zijlstra, Paul E. McKenney,
	Boqun Feng, Dave Watson, Paul Turner, Rich Felker, linux-kernel,
	linux-api
In-Reply-To: <1066731871.915.1555593471194.JavaMail.zimbra@efficios.com>

On Thu, 18 Apr 2019, Mathieu Desnoyers wrote:

> The approach above should work for arm32 be8 vs be32 linker weirdness.
> 
> For aarch64, I think we can simply do:
> 
> /*
>  * aarch64 -mbig-endian generates mixed endianness code vs data:
>  * little-endian code and big-endian data. Ensure the RSEQ_SIG signature
>  * matches code endianness.
>  */
> #define RSEQ_SIG_CODE   0xd428bc00      /* BRK #0x45E0.  */
> 
> #ifdef __ARM_BIG_ENDIAN
> #define RSEQ_SIG_DATA   0x00bc28d4      /* BRK #0x45E0.  */
> #else
> #define RSEQ_SIG_DATA   RSEQ_SIG_CODE
> #endif
> 
> #define RSEQ_SIG        RSEQ_SIG_DATA
> 
> Feedback is most welcome,

You'll also need __ASSEMBLER__ conditionals in the installed sys/rseq.h 
header so that it only defines constants and doesn't include any C 
declarations in that case, if RSEQ_SIG_CODE is meant to be usable in .S 
files rather than just inline asm in C files.

-- 
Joseph S. Myers
joseph@codesourcery.com

^ permalink raw reply

* Re: [PATCH v2 4/5] signal: support CLONE_PIDFD with pidfd_send_signal
From: Christian Brauner @ 2019-04-18 15:01 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: torvalds, viro, jannh, dhowells, linux-api, linux-kernel, serge,
	luto, arnd, ebiederm, keescook, tglx, mtk.manpages, akpm, cyphar,
	joel, dancol
In-Reply-To: <20190418142559.GA30069@redhat.com>

On April 18, 2019 4:26:00 PM GMT+02:00, Oleg Nesterov <oleg@redhat.com> wrote:
>On 04/18, Christian Brauner wrote:
>>
>> +static struct pid *pidfd_to_pid(const struct file *file)
>> +{
>> +	if (file->f_op == &pidfd_fops)
>> +		return file->private_data;
>> +
>> +	return tgid_pidfd_to_pid(file);
>> +}
>
>the patch looks obviously fine to me, but I have an absolutely
>off-topic
>question... why tgid_pidfd_to_pid() has to check d_is_dir() ?

It doesn't have too; pure paranoia.

Christian

^ permalink raw reply

* Re: [PATCH v2] moduleparam: Save information about built-in modules in separate file
From: Masahiro Yamada @ 2019-04-18 15:26 UTC (permalink / raw)
  To: Jessica Yu
  Cc: Alexey Gladkov, Linux Kernel Mailing List,
	Linux Kbuild mailing list, linux-api, linux-modules,
	Kirill A . Shutemov, Gleb Fotengauer-Malinovskiy, Dmitry V. Levin,
	Michal Marek, Dmitry Torokhov, Rusty Russell, Lucas De Marchi
In-Reply-To: <20190418135238.GA5626@linux-8ccs>

On Thu, Apr 18, 2019 at 10:52 PM Jessica Yu <jeyu@kernel.org> wrote:
>
> +++ Masahiro Yamada [18/04/19 20:10 +0900]:
> >On Sat, Apr 6, 2019 at 9:15 PM Alexey Gladkov <gladkov.alexey@gmail.com> wrote:
> >>
> >> Problem:
> >>
> >> When a kernel module is compiled as a separate module, some important
> >> information about the kernel module is available via .modinfo section of
> >> the module.  In contrast, when the kernel module is compiled into the
> >> kernel, that information is not available.
> >>
> >> Information about built-in modules is necessary in the following cases:
> >>
> >> 1. When it is necessary to find out what additional parameters can be
> >> passed to the kernel at boot time.
> >>
> >> 2. When you need to know which module names and their aliases are in
> >> the kernel. This is very useful for creating an initrd image.
> >>
> >> Proposal:
> >>
> >> The proposed patch does not remove .modinfo section with module
> >> information from the vmlinux at the build time and saves it into a
> >> separate file after kernel linking. So, the kernel does not increase in
> >> size and no additional information remains in it. Information is stored
> >> in the same format as in the separate modules (null-terminated string
> >> array). Because the .modinfo section is already exported with a separate
> >> modules, we are not creating a new API.
> >>
> >> It can be easily read in the userspace:
> >>
> >> $ tr '\0' '\n' < kernel.builtin
> >> ext4.softdep=pre: crc32c
> >> ext4.license=GPL
> >> ext4.description=Fourth Extended Filesystem
> >> ext4.author=Remy Card, Stephen Tweedie, Andrew Morton, Andreas Dilger, Theodore Ts'o and others
> >> ext4.alias=fs-ext4
> >> ext4.alias=ext3
> >> ext4.alias=fs-ext3
> >> ext4.alias=ext2
> >> ext4.alias=fs-ext2
> >> md_mod.alias=block-major-9-*
> >> md_mod.alias=md
> >> md_mod.description=MD RAID framework
> >> md_mod.license=GPL
> >> md_mod.parmtype=create_on_open:bool
> >> md_mod.parmtype=start_dirty_degraded:int
> >> ...
> >>
> >> v2:
> >>  * Extract modinfo from vmlinux.o as suggested by Masahiro Yamada;
> >>  * Rename output file to kernel.builtin;
> >
> >Sorry, I do not get why you renamed
> >"kernel.builtin.modinfo" to "kernel.builtin".
> >
> >If you drop "modinfo", we do not understand
> >what kind information is contained in it.
> >
> >I think "kernel" and "builtin" have
> >a quite similar meaning here.
> >
> >How about "builtin.modinfo" for example?
> >
> >
> >It is shorter, and it is clear enough
> >that it contains module_info.
>
> I agree that the name kernel.builtin is unclear in what kind of
> information it contains. Apologies for not having clarified this in
> the previous review.
>
> Since kbuild already produces "modules.order" and "modules.builtin"
> files, why not just name it "modules.builtin.modinfo" to keep the
> names consistent with what is already there?


Is it consistent?

If we had "modules.order" and "modules.builtin.order" there,
I would agree with "modules.builtin.modinfo",
and also "modules.alias" vs "modules.builtin.alias".


We already have "modules.builtin", and probably impossible
to rename it, so we cannot keep consistency in any way.


"modules.builtin" is a weird name since
it actually contains "order", but its extension
does not express what kind of information is in it.
Hence, I doubt "modules.builtin" is a good precedent.

IMHO, "modules" and "builtin" are opposite
to each other. "modules.builtin" sounds iffy to me.





> It clearly conveys that
> this file contains modinfo for builtin modules.
>
> I'll leave it up to Lucas to decide if the file format is OK for kmod.
> With the modinfo dump, kmod could then decide what to do with the
> information, append to modules.alias{,.bin}, etc.
>
> Also, I think this file needs to be documented in Documentation/kbuild/kbuild.txt.
>
> Thanks,
>
> Jessica
--
Best Regards
Masahiro Yamada

^ permalink raw reply

* Re: [PATCH 1/4] glibc: Perform rseq(2) registration at C startup and thread creation (v7)
From: Mathieu Desnoyers @ 2019-04-18 15:31 UTC (permalink / raw)
  To: Alan Modra
  Cc: Michael Ellerman, Carlos O'Donell,
	Tulio Magno Quites Machado Filho, Florian Weimer,
	Michael Meissner, Peter Bergner, Paul Burton, Will Deacon,
	Boqun Feng, heiko carstens, gor, schwidefsky,
	Russell King, ARM Linux, Benjamin Herrenschmidt, Paul Mackerras,
	carlos, Joseph Myers, Szabolcs Nagy, libc-alpha, Thomas
In-Reply-To: <20190409092948.GA14424@bubble.grove.modra.org>

----- On Apr 9, 2019, at 5:29 AM, Alan Modra amodra@gmail.com wrote:

> On Tue, Apr 09, 2019 at 02:23:53PM +1000, Michael Ellerman wrote:
>> I'd much rather we use a trap with a specific immediate value. Otherwise
>> someone's going to waste time one day puzzling over why userspace is
>> doing mtmsr.
> 
> It's data.  We have other data in executable sections.  Anyone who
> wonders about odd disassembly just hasn't realized they are
> disassembling data.
> 
>> It would also complicate things if we ever wanted to emulate mtmsr.
> 
> No, because it won't be executed.  If I understand correctly, the only
> reason to choose an illegal, trap or privileged insn is to halt
> execution earlier rather than later when a program goes off in the
> weeds.
> 
>> If we want something that is a trap rather than a nop then use 0x0fe50553.
>> 
>> That's "compare the value in r5 with 0x553 and then trap unconditionally".
>> 
>> It shows up in objdump as:
>> 
>>     10000000:	53 05 e5 0f 	twui    r5,1363
>> 
>> 
>> The immediate can be anything, I chose that value to mimic the x86 value
>> Mathieu mentioned.
>> 
>> There's no reason that instruction would ever be generated because the
>> immediate value serves no purpose. So it satisfies the "very unlikely
>> to appear" criteria AFAICS.
> 
> Yes, looks fine to me, except that in VLE mode (do we care?)
> ".long 0x0fe50553" disassembles as
>   0:	0f e5       	se_cmphl r5,r30
>   2:	05 53       	se_mullw r3,r5
> No illegal/trap/privileged insn there.
> 
> ".long 0x0fe5000b" might be better to cover VLE.

Can you share with us the objdump output of ".long 0x0fe5000b" in
VLE mode ? VLE mode support does not appear to be available in typical
toolchains. Also, is VLE mode only for powerpc 32 be, or also for
powerpc 64 be/le ?

Thanks,

Mathieu



-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply

* Re: [PATCH 1/5] glibc: Perform rseq(2) registration at C startup and thread creation (v8)
From: Szabolcs Nagy @ 2019-04-18 15:33 UTC (permalink / raw)
  To: Mathieu Desnoyers, Joseph Myers, Will Deacon
  Cc: nd, carlos, Florian Weimer, libc-alpha, Thomas Gleixner,
	Ben Maurer, Peter Zijlstra, Paul E. McKenney, Boqun Feng,
	Dave Watson, Paul Turner, Rich Felker, linux-kernel, linux-api
In-Reply-To: <1066731871.915.1555593471194.JavaMail.zimbra@efficios.com>

On 18/04/2019 14:17, Mathieu Desnoyers wrote:
> ----- On Apr 17, 2019, at 3:56 PM, Mathieu Desnoyers mathieu.desnoyers@efficios.com wrote:
>> ----- On Apr 17, 2019, at 12:17 PM, Joseph Myers joseph@codesourcery.com wrote:
>>> On Wed, 17 Apr 2019, Mathieu Desnoyers wrote:
>>>
>>>>> +/* RSEQ_SIG is a signature required before each abort handler code.
>>>>> +
>>>>> +   It is a 32-bit value that maps to actual architecture code compiled
>>>>> +   into applications and libraries. It needs to be defined for each
>>>>> +   architecture. When choosing this value, it needs to be taken into
>>>>> +   account that generating invalid instructions may have ill effects on
>>>>> +   tools like objdump, and may also have impact on the CPU speculative
>>>>> +   execution efficiency in some cases.  */
>>>>> +
>>>>> +#define RSEQ_SIG 0xd428bc00	/* BRK #0x45E0.  */
>>>>
>>>> After further investigation, we should probably do the following
>>>> to handle compiling with -mbig-endian on aarch64, which generates
>>>> binaries with mixed code vs data endianness (little endian code,
>>>> big endian data):
>>>
>>> First, the comment on RSEQ_SIG should specify whether it is to be
>>> interpreted in the code or the data endianness.
>>
>> Right. The signature passed as argument to the rseq registration
>> system call needs to be in data endianness (currently exposed kernel
>> ABI).
>>
>> Ideally for userspace, we want to define a signature in code endianness
>> that happens to nicely match specific code patterns.
...
> For aarch64, I think we can simply do:
> 
> /*
>  * aarch64 -mbig-endian generates mixed endianness code vs data:
>  * little-endian code and big-endian data. Ensure the RSEQ_SIG signature
>  * matches code endianness.
>  */
> #define RSEQ_SIG_CODE   0xd428bc00      /* BRK #0x45E0.  */
> 
> #ifdef __ARM_BIG_ENDIAN
> #define RSEQ_SIG_DATA   0x00bc28d4      /* BRK #0x45E0.  */
> #else
> #define RSEQ_SIG_DATA   RSEQ_SIG_CODE
> #endif
> 
> #define RSEQ_SIG        RSEQ_SIG_DATA
> 
> Feedback is most welcome,

so the RSEQ_SIG value is supposed to be used with .word
in asm instead of .inst?

i don't think we use __ARM_* in public headers currently,
but hopefully aarch64_be compilers implement it.

otherwise this looks ok to me.

(i think a rare palindrome instruction would work too, e.g.
0a5f5f0a 	and	w10, w24, wzr, lsr #23 // shifted 0
2a5f5f2a 	orr	w10, w25, wzr, lsr #23
eb9f9feb 	negs	x11, xzr, asr #39
c83f3fc8 	stxp	wzr, x8, x15, [x30]  // store to LR ignoring success
d9ffffd9 	stz2g	x25, [x30, #-16]!    // v8.5 tag+zero 2 granules around LR
etc. it does not need to be a guaranteed trap)

^ permalink raw reply

* Re: [PATCH 2/5] glibc: sched_getcpu(): use rseq cpu_id TLS on Linux (v2)
From: Szabolcs Nagy @ 2019-04-18 15:33 UTC (permalink / raw)
  To: Mathieu Desnoyers, Carlos O'Donell
  Cc: nd, Florian Weimer, Joseph Myers, libc-alpha@sourceware.org,
	Thomas Gleixner, Ben Maurer, Peter Zijlstra, Paul E. McKenney,
	Boqun Feng, Will Deacon, Dave Watson, Paul Turner,
	linux-kernel@vger.kernel.org, linux-api@vger.kernel.org
In-Reply-To: <20190416173216.9028-3-mathieu.desnoyers@efficios.com>

On 16/04/2019 18:32, Mathieu Desnoyers wrote:
> --- a/sysdeps/unix/sysv/linux/sched_getcpu.c
> +++ b/sysdeps/unix/sysv/linux/sched_getcpu.c
> @@ -37,3 +37,26 @@ sched_getcpu (void)
>    return -1;
>  #endif
>  }
> +
> +#ifdef __NR_rseq
> +#include <sys/rseq.h>
> +#endif
> +
> +#if defined __NR_rseq && defined RSEQ_SIG
> +extern __attribute__ ((tls_model ("initial-exec")))
> +__thread volatile struct rseq __rseq_abi;

i'd expect sys/rseq.h to provide this declaration.

> +
> +int
> +sched_getcpu (void)
> +{
> +  int cpu_id = __rseq_abi.cpu_id;
> +
> +  return cpu_id >= 0 ? cpu_id : vsyscall_sched_getcpu ();
> +}
> +#else
> +int
> +sched_getcpu (void)
> +{
> +  return vsyscall_sched_getcpu ();
> +}
> +#endif
> -- 2.17.1
> 


^ permalink raw reply

* Re: [PATCH v2] moduleparam: Save information about built-in modules in separate file
From: Jessica Yu @ 2019-04-18 15:36 UTC (permalink / raw)
  To: Masahiro Yamada
  Cc: Alexey Gladkov, Linux Kernel Mailing List,
	Linux Kbuild mailing list, linux-api, linux-modules,
	Kirill A . Shutemov, Gleb Fotengauer-Malinovskiy, Dmitry V. Levin,
	Michal Marek, Dmitry Torokhov, Rusty Russell, Lucas De Marchi
In-Reply-To: <CAK7LNARKHXLgniczKXY01JhxhBpL2p-X_hWg70AcsrXRaLyDSg@mail.gmail.com>

+++ Masahiro Yamada [19/04/19 00:26 +0900]:
>On Thu, Apr 18, 2019 at 10:52 PM Jessica Yu <jeyu@kernel.org> wrote:
>>
>> +++ Masahiro Yamada [18/04/19 20:10 +0900]:
>> >On Sat, Apr 6, 2019 at 9:15 PM Alexey Gladkov <gladkov.alexey@gmail.com> wrote:
>> >>
>> >> Problem:
>> >>
>> >> When a kernel module is compiled as a separate module, some important
>> >> information about the kernel module is available via .modinfo section of
>> >> the module.  In contrast, when the kernel module is compiled into the
>> >> kernel, that information is not available.
>> >>
>> >> Information about built-in modules is necessary in the following cases:
>> >>
>> >> 1. When it is necessary to find out what additional parameters can be
>> >> passed to the kernel at boot time.
>> >>
>> >> 2. When you need to know which module names and their aliases are in
>> >> the kernel. This is very useful for creating an initrd image.
>> >>
>> >> Proposal:
>> >>
>> >> The proposed patch does not remove .modinfo section with module
>> >> information from the vmlinux at the build time and saves it into a
>> >> separate file after kernel linking. So, the kernel does not increase in
>> >> size and no additional information remains in it. Information is stored
>> >> in the same format as in the separate modules (null-terminated string
>> >> array). Because the .modinfo section is already exported with a separate
>> >> modules, we are not creating a new API.
>> >>
>> >> It can be easily read in the userspace:
>> >>
>> >> $ tr '\0' '\n' < kernel.builtin
>> >> ext4.softdep=pre: crc32c
>> >> ext4.license=GPL
>> >> ext4.description=Fourth Extended Filesystem
>> >> ext4.author=Remy Card, Stephen Tweedie, Andrew Morton, Andreas Dilger, Theodore Ts'o and others
>> >> ext4.alias=fs-ext4
>> >> ext4.alias=ext3
>> >> ext4.alias=fs-ext3
>> >> ext4.alias=ext2
>> >> ext4.alias=fs-ext2
>> >> md_mod.alias=block-major-9-*
>> >> md_mod.alias=md
>> >> md_mod.description=MD RAID framework
>> >> md_mod.license=GPL
>> >> md_mod.parmtype=create_on_open:bool
>> >> md_mod.parmtype=start_dirty_degraded:int
>> >> ...
>> >>
>> >> v2:
>> >>  * Extract modinfo from vmlinux.o as suggested by Masahiro Yamada;
>> >>  * Rename output file to kernel.builtin;
>> >
>> >Sorry, I do not get why you renamed
>> >"kernel.builtin.modinfo" to "kernel.builtin".
>> >
>> >If you drop "modinfo", we do not understand
>> >what kind information is contained in it.
>> >
>> >I think "kernel" and "builtin" have
>> >a quite similar meaning here.
>> >
>> >How about "builtin.modinfo" for example?
>> >
>> >
>> >It is shorter, and it is clear enough
>> >that it contains module_info.
>>
>> I agree that the name kernel.builtin is unclear in what kind of
>> information it contains. Apologies for not having clarified this in
>> the previous review.
>>
>> Since kbuild already produces "modules.order" and "modules.builtin"
>> files, why not just name it "modules.builtin.modinfo" to keep the
>> names consistent with what is already there?
>
>
>Is it consistent?
>
>If we had "modules.order" and "modules.builtin.order" there,
>I would agree with "modules.builtin.modinfo",
>and also "modules.alias" vs "modules.builtin.alias".
>
>
>We already have "modules.builtin", and probably impossible
>to rename it, so we cannot keep consistency in any way.
>
>
>"modules.builtin" is a weird name since
>it actually contains "order", but its extension
>does not express what kind of information is in it.
>Hence, I doubt "modules.builtin" is a good precedent.
>
>IMHO, "modules" and "builtin" are opposite
>to each other. "modules.builtin" sounds iffy to me.

I've always interpreted "modules.builtin" to mean "this is a list of
modules that have been built-in into the kernel", no? So I thought the
name made sense. But you are the maintainer, so I do not have a strong
opinion on this either way :-)

Thanks,

Jessica

^ permalink raw reply

* Re: [PATCH 1/5] glibc: Perform rseq(2) registration at C startup and thread creation (v8)
From: Mathieu Desnoyers @ 2019-04-18 15:37 UTC (permalink / raw)
  To: Joseph Myers
  Cc: Will Deacon, carlos, Florian Weimer, Szabolcs Nagy, libc-alpha,
	Thomas Gleixner, Ben Maurer, Peter Zijlstra, Paul E. McKenney,
	Boqun Feng, Dave Watson, Paul Turner, Rich Felker, linux-kernel,
	linux-api
In-Reply-To: <alpine.DEB.2.21.1904181445550.5356@digraph.polyomino.org.uk>

----- On Apr 18, 2019, at 10:48 AM, Joseph Myers joseph@codesourcery.com wrote:

> On Thu, 18 Apr 2019, Mathieu Desnoyers wrote:
> 
>> The approach above should work for arm32 be8 vs be32 linker weirdness.
>> 
>> For aarch64, I think we can simply do:
>> 
>> /*
>>  * aarch64 -mbig-endian generates mixed endianness code vs data:
>>  * little-endian code and big-endian data. Ensure the RSEQ_SIG signature
>>  * matches code endianness.
>>  */
>> #define RSEQ_SIG_CODE   0xd428bc00      /* BRK #0x45E0.  */
>> 
>> #ifdef __ARM_BIG_ENDIAN
>> #define RSEQ_SIG_DATA   0x00bc28d4      /* BRK #0x45E0.  */
>> #else
>> #define RSEQ_SIG_DATA   RSEQ_SIG_CODE
>> #endif
>> 
>> #define RSEQ_SIG        RSEQ_SIG_DATA
>> 
>> Feedback is most welcome,
> 
> You'll also need __ASSEMBLER__ conditionals in the installed sys/rseq.h
> header so that it only defines constants and doesn't include any C
> declarations in that case, if RSEQ_SIG_CODE is meant to be usable in .S
> files rather than just inline asm in C files.

Good point!

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply

* Re: [PATCH v2 3/5] signal: use fdget() since we don't allow O_PATH
From: Linus Torvalds @ 2019-04-18 15:37 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Al Viro, Jann Horn, David Howells, Oleg Nesterov, Linux API,
	Linux List Kernel Mailing, Serge E. Hallyn, Andrew Lutomirski,
	Arnd Bergmann, Eric W. Biederman, Kees Cook, Thomas Gleixner,
	Michael Kerrisk-manpages, Andrew Morton, Aleksa Sarai,
	Joel Fernandes, Daniel Colascione, Jann Horn
In-Reply-To: <20190418101841.4476-4-christian@brauner.io>

On Thu, Apr 18, 2019 at 3:19 AM Christian Brauner <christian@brauner.io> wrote:
>
>     It's just semantically correct to use fdget()
> and return an error right from there instead of taking a reference and
> returning an error later.

I've applied this one directly, because it ends up affecting the
existing code, and I'd rather just have the initial release of
pidfd_send_signal() right.

The other patches I consider to be "future release" material.

                    Linus

^ permalink raw reply

* Re: [PATCH 1/5] glibc: Perform rseq(2) registration at C startup and thread creation (v8)
From: Mathieu Desnoyers @ 2019-04-18 15:41 UTC (permalink / raw)
  To: Szabolcs Nagy
  Cc: Joseph Myers, Will Deacon, nd, carlos, Florian Weimer, libc-alpha,
	Thomas Gleixner, Ben Maurer, Peter Zijlstra, Paul E. McKenney,
	Boqun Feng, Dave Watson, Paul Turner, Rich Felker, linux-kernel,
	linux-api
In-Reply-To: <6cbfea7b-9d83-74a5-9cd2-af56a5d68818@arm.com>

----- On Apr 18, 2019, at 11:33 AM, Szabolcs Nagy Szabolcs.Nagy@arm.com wrote:

> On 18/04/2019 14:17, Mathieu Desnoyers wrote:
>> ----- On Apr 17, 2019, at 3:56 PM, Mathieu Desnoyers
>> mathieu.desnoyers@efficios.com wrote:
>>> ----- On Apr 17, 2019, at 12:17 PM, Joseph Myers joseph@codesourcery.com wrote:
>>>> On Wed, 17 Apr 2019, Mathieu Desnoyers wrote:
>>>>
>>>>>> +/* RSEQ_SIG is a signature required before each abort handler code.
>>>>>> +
>>>>>> +   It is a 32-bit value that maps to actual architecture code compiled
>>>>>> +   into applications and libraries. It needs to be defined for each
>>>>>> +   architecture. When choosing this value, it needs to be taken into
>>>>>> +   account that generating invalid instructions may have ill effects on
>>>>>> +   tools like objdump, and may also have impact on the CPU speculative
>>>>>> +   execution efficiency in some cases.  */
>>>>>> +
>>>>>> +#define RSEQ_SIG 0xd428bc00	/* BRK #0x45E0.  */
>>>>>
>>>>> After further investigation, we should probably do the following
>>>>> to handle compiling with -mbig-endian on aarch64, which generates
>>>>> binaries with mixed code vs data endianness (little endian code,
>>>>> big endian data):
>>>>
>>>> First, the comment on RSEQ_SIG should specify whether it is to be
>>>> interpreted in the code or the data endianness.
>>>
>>> Right. The signature passed as argument to the rseq registration
>>> system call needs to be in data endianness (currently exposed kernel
>>> ABI).
>>>
>>> Ideally for userspace, we want to define a signature in code endianness
>>> that happens to nicely match specific code patterns.
> ...
>> For aarch64, I think we can simply do:
>> 
>> /*
>>  * aarch64 -mbig-endian generates mixed endianness code vs data:
>>  * little-endian code and big-endian data. Ensure the RSEQ_SIG signature
>>  * matches code endianness.
>>  */
>> #define RSEQ_SIG_CODE   0xd428bc00      /* BRK #0x45E0.  */
>> 
>> #ifdef __ARM_BIG_ENDIAN
>> #define RSEQ_SIG_DATA   0x00bc28d4      /* BRK #0x45E0.  */
>> #else
>> #define RSEQ_SIG_DATA   RSEQ_SIG_CODE
>> #endif
>> 
>> #define RSEQ_SIG        RSEQ_SIG_DATA
>> 
>> Feedback is most welcome,
> 
> so the RSEQ_SIG value is supposed to be used with .word
> in asm instead of .inst?

We want a .inst so it translates into a valid trap instruction.
It's better to trap in case program execution reaches this
by mistake (makes debugging easier).

> 
> i don't think we use __ARM_* in public headers currently,
> but hopefully aarch64_be compilers implement it.

Can I use #if __BYTE_ORDER__ == __ORDER_BIG_ENDIAN__  then ?

> 
> otherwise this looks ok to me.
> 
> (i think a rare palindrome instruction would work too, e.g.
> 0a5f5f0a 	and	w10, w24, wzr, lsr #23 // shifted 0
> 2a5f5f2a 	orr	w10, w25, wzr, lsr #23
> eb9f9feb 	negs	x11, xzr, asr #39
> c83f3fc8 	stxp	wzr, x8, x15, [x30]  // store to LR ignoring success
> d9ffffd9 	stz2g	x25, [x30, #-16]!    // v8.5 tag+zero 2 granules around LR
> etc. it does not need to be a guaranteed trap)

Unfortunately it's not a trap :/

Thanks,

Mathieu


-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply

* Re: [PATCH 2/5] glibc: sched_getcpu(): use rseq cpu_id TLS on Linux (v2)
From: Mathieu Desnoyers @ 2019-04-18 15:45 UTC (permalink / raw)
  To: Szabolcs Nagy
  Cc: carlos, nd, Florian Weimer, Joseph Myers, libc-alpha,
	Thomas Gleixner, Ben Maurer, Peter Zijlstra, Paul E. McKenney,
	Boqun Feng, Will Deacon, Dave Watson, Paul Turner, linux-kernel,
	linux-api
In-Reply-To: <c9c53d4b-ed9b-f2d9-60b0-e5e805de0324@arm.com>

----- On Apr 18, 2019, at 11:33 AM, Szabolcs Nagy Szabolcs.Nagy@arm.com wrote:

> On 16/04/2019 18:32, Mathieu Desnoyers wrote:
>> --- a/sysdeps/unix/sysv/linux/sched_getcpu.c
>> +++ b/sysdeps/unix/sysv/linux/sched_getcpu.c
>> @@ -37,3 +37,26 @@ sched_getcpu (void)
>>    return -1;
>>  #endif
>>  }
>> +
>> +#ifdef __NR_rseq
>> +#include <sys/rseq.h>
>> +#endif
>> +
>> +#if defined __NR_rseq && defined RSEQ_SIG
>> +extern __attribute__ ((tls_model ("initial-exec")))
>> +__thread volatile struct rseq __rseq_abi;
> 
> i'd expect sys/rseq.h to provide this declaration.

And it actually does! Will remove this duplicate.

Thanks,

Mathieu

> 
>> +
>> +int
>> +sched_getcpu (void)
>> +{
>> +  int cpu_id = __rseq_abi.cpu_id;
>> +
>> +  return cpu_id >= 0 ? cpu_id : vsyscall_sched_getcpu ();
>> +}
>> +#else
>> +int
>> +sched_getcpu (void)
>> +{
>> +  return vsyscall_sched_getcpu ();
>> +}
>> +#endif
>> -- 2.17.1

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply

* Re: RFC: on adding new CLONE_* flags [WAS Re: [PATCH 0/4] clone: add CLONE_PIDFD]
From: Enrico Weigelt, metux IT consult @ 2019-04-18 15:46 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Andy Lutomirski, Aleksa Sarai, Linus Torvalds, Al Viro, Jann Horn,
	David Howells, Linux API, LKML, Serge E. Hallyn, Arnd Bergmann,
	Eric W. Biederman, Kees Cook, Thomas Gleixner, Michael Kerrisk,
	Andrew Morton, Oleg Nesterov, Joel Fernandes, Daniel Colascione
In-Reply-To: <20190417125422.bhyrmzom4kwuenwm@brauner.io>

On 17.04.19 14:54, Christian Brauner wrote:

>> Ah, that is a cool thing !>> I suppose that also works across namespaces ?> > Yes, it should. If
you hand off the pidfd to another pidns (e.g. via SCM> credentials) for
example.
I thought about things like sending the pidfd via unix socket.
It would be really cool if the receiving process could then control
the referred process (eg. send signals), even if it's in a different
pidns.

>> What other things can be done via pidfd ?
> 
> Very basic things right now and until CLONE_PIDFD is accepted (possibly
> for 5.2) we won't enable any more features.
> I'm not a fan of wild speculations and grand schemes that then don't
> come to fruition. :) Right now it's about focussing on somewhat cleanly
> covering the basics. Coming to a consensus there was hard enough. We
> have no intention in making this more complex right now as it needs to
> be.

IMHO, it would be good if it would support all operations that now can
be done via numerical PID, and w/ the permissions of the process who
created that pidfd. Certainly, that would also need some lockdown
mechanism, so the creating process can define what the holder of the
fd can actually do.

--mtx

-- 
Enrico Weigelt, metux IT consult
Free software and Linux embedded engineering
info@metux.net -- +49-151-27565287

^ permalink raw reply

* Re: [PATCH v2 3/5] signal: use fdget() since we don't allow O_PATH
From: Christian Brauner @ 2019-04-18 15:53 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Al Viro, Jann Horn, David Howells, Oleg Nesterov, Linux API,
	Linux List Kernel Mailing, Serge E. Hallyn, Andrew Lutomirski,
	Arnd Bergmann, Eric W. Biederman, Kees Cook, Thomas Gleixner,
	Michael Kerrisk-manpages, Andrew Morton, Aleksa Sarai,
	Joel Fernandes, Daniel Colascione, Jann Horn
In-Reply-To: <CAHk-=whB0A_WsBpr3MFujWKi7WW9XLxUpq=z_uYTv+J7uE0wRw@mail.gmail.com>

On Thu, Apr 18, 2019 at 08:37:28AM -0700, Linus Torvalds wrote:
> On Thu, Apr 18, 2019 at 3:19 AM Christian Brauner <christian@brauner.io> wrote:
> >
> >     It's just semantically correct to use fdget()
> > and return an error right from there instead of taking a reference and
> > returning an error later.
> 
> I've applied this one directly, because it ends up affecting the
> existing code, and I'd rather just have the initial release of
> pidfd_send_signal() right.

Perfect! Thank you!

> 
> The other patches I consider to be "future release" material.

Yes, indeed. Jann and I are targeting the 5.2 merge window if that's ok.

Christian

^ permalink raw reply

* Re: [PATCH 1/5] glibc: Perform rseq(2) registration at C startup and thread creation (v8)
From: Szabolcs Nagy @ 2019-04-18 16:07 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: nd, Joseph Myers, Will Deacon, carlos, Florian Weimer, libc-alpha,
	Thomas Gleixner, Ben Maurer, Peter Zijlstra, Paul E. McKenney,
	Boqun Feng, Dave Watson, Paul Turner, Rich Felker, linux-kernel,
	linux-api
In-Reply-To: <1055153722.1072.1555602067220.JavaMail.zimbra@efficios.com>

On 18/04/2019 16:41, Mathieu Desnoyers wrote:
> ----- On Apr 18, 2019, at 11:33 AM, Szabolcs Nagy Szabolcs.Nagy@arm.com wrote:
> 
>> On 18/04/2019 14:17, Mathieu Desnoyers wrote:
>>> ----- On Apr 17, 2019, at 3:56 PM, Mathieu Desnoyers
>>> mathieu.desnoyers@efficios.com wrote:
>>>> ----- On Apr 17, 2019, at 12:17 PM, Joseph Myers joseph@codesourcery.com wrote:
>>>>> On Wed, 17 Apr 2019, Mathieu Desnoyers wrote:
>>>>>
>>>>>>> +/* RSEQ_SIG is a signature required before each abort handler code.
>>>>>>> +
>>>>>>> +   It is a 32-bit value that maps to actual architecture code compiled
>>>>>>> +   into applications and libraries. It needs to be defined for each
>>>>>>> +   architecture. When choosing this value, it needs to be taken into
>>>>>>> +   account that generating invalid instructions may have ill effects on
>>>>>>> +   tools like objdump, and may also have impact on the CPU speculative
>>>>>>> +   execution efficiency in some cases.  */
>>>>>>> +
>>>>>>> +#define RSEQ_SIG 0xd428bc00	/* BRK #0x45E0.  */
>>>>>>
>>>>>> After further investigation, we should probably do the following
>>>>>> to handle compiling with -mbig-endian on aarch64, which generates
>>>>>> binaries with mixed code vs data endianness (little endian code,
>>>>>> big endian data):
>>>>>
>>>>> First, the comment on RSEQ_SIG should specify whether it is to be
>>>>> interpreted in the code or the data endianness.
>>>>
>>>> Right. The signature passed as argument to the rseq registration
>>>> system call needs to be in data endianness (currently exposed kernel
>>>> ABI).
>>>>
>>>> Ideally for userspace, we want to define a signature in code endianness
>>>> that happens to nicely match specific code patterns.
>> ...
>>> For aarch64, I think we can simply do:
>>>
>>> /*
>>>  * aarch64 -mbig-endian generates mixed endianness code vs data:
>>>  * little-endian code and big-endian data. Ensure the RSEQ_SIG signature
>>>  * matches code endianness.
>>>  */
>>> #define RSEQ_SIG_CODE   0xd428bc00      /* BRK #0x45E0.  */
>>>
>>> #ifdef __ARM_BIG_ENDIAN
>>> #define RSEQ_SIG_DATA   0x00bc28d4      /* BRK #0x45E0.  */
>>> #else
>>> #define RSEQ_SIG_DATA   RSEQ_SIG_CODE
>>> #endif
>>>
>>> #define RSEQ_SIG        RSEQ_SIG_DATA
>>>
>>> Feedback is most welcome,
>>
>> so the RSEQ_SIG value is supposed to be used with .word
>> in asm instead of .inst?
> 
> We want a .inst so it translates into a valid trap instruction.
> It's better to trap in case program execution reaches this
> by mistake (makes debugging easier).

that does not make sense to me.

".inst" is an asm directive that requires a the value to
be the same on BE and LE (normal insn encoding).

".word" is an asm directive that requires the value to
use swapped encoding on BE (if it's used in the instruction
stream it will create a data mapping symbol and disasm to
.word value instead of the instruction mnemonics).

so which one is it?

>> i don't think we use __ARM_* in public headers currently,
>> but hopefully aarch64_be compilers implement it.
> 
> Can I use #if __BYTE_ORDER__ == __ORDER_BIG_ENDIAN__  then ?

hm, i'd either use #ifdef __AARCH64EB__ (since we already use it)
or the portable #include endian.h and __BYTE_ORDER == __BIG_ENDIAN

^ permalink raw reply

* Re: [PATCH 1/5] glibc: Perform rseq(2) registration at C startup and thread creation (v8)
From: Mathieu Desnoyers @ 2019-04-18 17:10 UTC (permalink / raw)
  To: Szabolcs Nagy
  Cc: nd, Joseph Myers, Will Deacon, carlos, Florian Weimer, libc-alpha,
	Thomas Gleixner, Ben Maurer, Peter Zijlstra, Paul E. McKenney,
	Boqun Feng, Dave Watson, Paul Turner, Rich Felker, linux-kernel,
	linux-api
In-Reply-To: <79996d13-2ba2-ed7d-b202-e7d38f1fd870@arm.com>


----- On Apr 18, 2019, at 12:07 PM, Szabolcs Nagy Szabolcs.Nagy@arm.com wrote:

> On 18/04/2019 16:41, Mathieu Desnoyers wrote:
>> ----- On Apr 18, 2019, at 11:33 AM, Szabolcs Nagy Szabolcs.Nagy@arm.com wrote:
>> 
>>> On 18/04/2019 14:17, Mathieu Desnoyers wrote:
>>>> ----- On Apr 17, 2019, at 3:56 PM, Mathieu Desnoyers
>>>> mathieu.desnoyers@efficios.com wrote:
>>>>> ----- On Apr 17, 2019, at 12:17 PM, Joseph Myers joseph@codesourcery.com wrote:
>>>>>> On Wed, 17 Apr 2019, Mathieu Desnoyers wrote:
>>>>>>
>>>>>>>> +/* RSEQ_SIG is a signature required before each abort handler code.
>>>>>>>> +
>>>>>>>> +   It is a 32-bit value that maps to actual architecture code compiled
>>>>>>>> +   into applications and libraries. It needs to be defined for each
>>>>>>>> +   architecture. When choosing this value, it needs to be taken into
>>>>>>>> +   account that generating invalid instructions may have ill effects on
>>>>>>>> +   tools like objdump, and may also have impact on the CPU speculative
>>>>>>>> +   execution efficiency in some cases.  */
>>>>>>>> +
>>>>>>>> +#define RSEQ_SIG 0xd428bc00	/* BRK #0x45E0.  */
>>>>>>>
>>>>>>> After further investigation, we should probably do the following
>>>>>>> to handle compiling with -mbig-endian on aarch64, which generates
>>>>>>> binaries with mixed code vs data endianness (little endian code,
>>>>>>> big endian data):
>>>>>>
>>>>>> First, the comment on RSEQ_SIG should specify whether it is to be
>>>>>> interpreted in the code or the data endianness.
>>>>>
>>>>> Right. The signature passed as argument to the rseq registration
>>>>> system call needs to be in data endianness (currently exposed kernel
>>>>> ABI).
>>>>>
>>>>> Ideally for userspace, we want to define a signature in code endianness
>>>>> that happens to nicely match specific code patterns.
>>> ...
>>>> For aarch64, I think we can simply do:
>>>>
>>>> /*
>>>>  * aarch64 -mbig-endian generates mixed endianness code vs data:
>>>>  * little-endian code and big-endian data. Ensure the RSEQ_SIG signature
>>>>  * matches code endianness.
>>>>  */
>>>> #define RSEQ_SIG_CODE   0xd428bc00      /* BRK #0x45E0.  */
>>>>
>>>> #ifdef __ARM_BIG_ENDIAN
>>>> #define RSEQ_SIG_DATA   0x00bc28d4      /* BRK #0x45E0.  */
>>>> #else
>>>> #define RSEQ_SIG_DATA   RSEQ_SIG_CODE
>>>> #endif
>>>>
>>>> #define RSEQ_SIG        RSEQ_SIG_DATA
>>>>
>>>> Feedback is most welcome,
>>>
>>> so the RSEQ_SIG value is supposed to be used with .word
>>> in asm instead of .inst?
>> 
>> We want a .inst so it translates into a valid trap instruction.
>> It's better to trap in case program execution reaches this
>> by mistake (makes debugging easier).
> 
> that does not make sense to me.
> 
> ".inst" is an asm directive that requires a the value to
> be the same on BE and LE (normal insn encoding).
> 
> ".word" is an asm directive that requires the value to
> use swapped encoding on BE (if it's used in the instruction
> stream it will create a data mapping symbol and disasm to
> .word value instead of the instruction mnemonics).
> 
> so which one is it?

We declare the signature with ".inst" in assembler.

However, we also need to pass that 32-bit signature value as
argument to the rseq system call when registering rseq.

The signature comparison is performed by the kernel before
moving the instruction pointer to the abort handler. It compares
the signature received as parameter by sys_rseq (data) to the
4-byte signature preceding the abort IP.

On aarch64 big endian, AFAIU the signature in the code is in
little endian, and the signature value passed as argument to
the rseq system call is in big endian. One way to handle this
is to swap the byte order of the signature "data" representation
passed as argument to sys_rseq.

> 
>>> i don't think we use __ARM_* in public headers currently,
>>> but hopefully aarch64_be compilers implement it.
>> 
>> Can I use #if __BYTE_ORDER__ == __ORDER_BIG_ENDIAN__  then ?
> 
> hm, i'd either use #ifdef __AARCH64EB__ (since we already use it)
> or the portable #include endian.h and __BYTE_ORDER == __BIG_ENDIAN

I'll use #ifdef __AARCH64EB__ given this header is specific to aarch64.

Thanks,

Mathieu


-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply

* Re: [PATCH 1/5] glibc: Perform rseq(2) registration at C startup and thread creation (v8)
From: Szabolcs Nagy @ 2019-04-18 17:37 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: nd, Joseph Myers, Will Deacon, carlos, Florian Weimer, libc-alpha,
	Thomas Gleixner, Ben Maurer, Peter Zijlstra, Paul E. McKenney,
	Boqun Feng, Dave Watson, Paul Turner, Rich Felker, linux-kernel,
	linux-api
In-Reply-To: <604915684.1299.1555607449814.JavaMail.zimbra@efficios.com>

On 18/04/2019 18:10, Mathieu Desnoyers wrote:
> 
> ----- On Apr 18, 2019, at 12:07 PM, Szabolcs Nagy Szabolcs.Nagy@arm.com wrote:
> 
>> On 18/04/2019 16:41, Mathieu Desnoyers wrote:
>>> ----- On Apr 18, 2019, at 11:33 AM, Szabolcs Nagy Szabolcs.Nagy@arm.com wrote:
>>>
>>>> On 18/04/2019 14:17, Mathieu Desnoyers wrote:
>>>>> ----- On Apr 17, 2019, at 3:56 PM, Mathieu Desnoyers
>>>>> mathieu.desnoyers@efficios.com wrote:
>>>>>> ----- On Apr 17, 2019, at 12:17 PM, Joseph Myers joseph@codesourcery.com wrote:
>>>>>>> On Wed, 17 Apr 2019, Mathieu Desnoyers wrote:
>>>>>>>
>>>>>>>>> +/* RSEQ_SIG is a signature required before each abort handler code.
>>>>>>>>> +
>>>>>>>>> +   It is a 32-bit value that maps to actual architecture code compiled
>>>>>>>>> +   into applications and libraries. It needs to be defined for each
>>>>>>>>> +   architecture. When choosing this value, it needs to be taken into
>>>>>>>>> +   account that generating invalid instructions may have ill effects on
>>>>>>>>> +   tools like objdump, and may also have impact on the CPU speculative
>>>>>>>>> +   execution efficiency in some cases.  */
>>>>>>>>> +
>>>>>>>>> +#define RSEQ_SIG 0xd428bc00	/* BRK #0x45E0.  */
>>>>>>>>
>>>>>>>> After further investigation, we should probably do the following
>>>>>>>> to handle compiling with -mbig-endian on aarch64, which generates
>>>>>>>> binaries with mixed code vs data endianness (little endian code,
>>>>>>>> big endian data):
>>>>>>>
>>>>>>> First, the comment on RSEQ_SIG should specify whether it is to be
>>>>>>> interpreted in the code or the data endianness.
>>>>>>
>>>>>> Right. The signature passed as argument to the rseq registration
>>>>>> system call needs to be in data endianness (currently exposed kernel
>>>>>> ABI).
>>>>>>
>>>>>> Ideally for userspace, we want to define a signature in code endianness
>>>>>> that happens to nicely match specific code patterns.
>>>> ...
>>>>> For aarch64, I think we can simply do:
>>>>>
>>>>> /*
>>>>>  * aarch64 -mbig-endian generates mixed endianness code vs data:
>>>>>  * little-endian code and big-endian data. Ensure the RSEQ_SIG signature
>>>>>  * matches code endianness.
>>>>>  */
>>>>> #define RSEQ_SIG_CODE   0xd428bc00      /* BRK #0x45E0.  */
>>>>>
>>>>> #ifdef __ARM_BIG_ENDIAN
>>>>> #define RSEQ_SIG_DATA   0x00bc28d4      /* BRK #0x45E0.  */
>>>>> #else
>>>>> #define RSEQ_SIG_DATA   RSEQ_SIG_CODE
>>>>> #endif
>>>>>
>>>>> #define RSEQ_SIG        RSEQ_SIG_DATA
>>>>>
>>>>> Feedback is most welcome,
>>>>
>>>> so the RSEQ_SIG value is supposed to be used with .word
>>>> in asm instead of .inst?
>>>
>>> We want a .inst so it translates into a valid trap instruction.
>>> It's better to trap in case program execution reaches this
>>> by mistake (makes debugging easier).
>>
>> that does not make sense to me.
>>
>> ".inst" is an asm directive that requires a the value to
>> be the same on BE and LE (normal insn encoding).
>>
>> ".word" is an asm directive that requires the value to
>> use swapped encoding on BE (if it's used in the instruction
>> stream it will create a data mapping symbol and disasm to
>> .word value instead of the instruction mnemonics).
>>
>> so which one is it?
> 
> We declare the signature with ".inst" in assembler.
> 
> However, we also need to pass that 32-bit signature value as
> argument to the rseq system call when registering rseq.
> 
> The signature comparison is performed by the kernel before
> moving the instruction pointer to the abort handler. It compares
> the signature received as parameter by sys_rseq (data) to the
> 4-byte signature preceding the abort IP.
> 
> On aarch64 big endian, AFAIU the signature in the code is in
> little endian, and the signature value passed as argument to
> the rseq system call is in big endian. One way to handle this
> is to swap the byte order of the signature "data" representation
> passed as argument to sys_rseq.

you have to add a documentation comment somewhere
explaining if RSEQ_SIG is the value that's passed to
the kernel and then aarch64 asm code has to use

 .inst endianfixup(RSEQ_SIG) // or
 .word RSEQ_SIG

or if RSEQ_SIG is used as

 .inst RSEQ_SIG

in aarch64 asm and then endianfixup(RSEQ_SIG) should
be passed to the syscall.

either way it can be a brk 0x45e0 on both LE and BE,
but in the latter case you have to document this in
arch independent way, since the syscall api must be
portable (i assume "RSEQ_SIG" is part of the api).

^ permalink raw reply

* Re: [PATCH 1/5] glibc: Perform rseq(2) registration at C startup and thread creation (v8)
From: Mathieu Desnoyers @ 2019-04-18 18:17 UTC (permalink / raw)
  To: Szabolcs Nagy
  Cc: nd, Joseph Myers, Will Deacon, carlos, Florian Weimer, libc-alpha,
	Thomas Gleixner, Ben Maurer, Peter Zijlstra, Paul E. McKenney,
	Boqun Feng, Dave Watson, Paul Turner, Rich Felker, linux-kernel,
	linux-api
In-Reply-To: <846db7ef-f75e-53b8-3c4c-461ec730a17e@arm.com>

----- On Apr 18, 2019, at 1:37 PM, Szabolcs Nagy Szabolcs.Nagy@arm.com wrote:

> On 18/04/2019 18:10, Mathieu Desnoyers wrote:
>> 
>> ----- On Apr 18, 2019, at 12:07 PM, Szabolcs Nagy Szabolcs.Nagy@arm.com wrote:
>> 
>>> On 18/04/2019 16:41, Mathieu Desnoyers wrote:
>>>> ----- On Apr 18, 2019, at 11:33 AM, Szabolcs Nagy Szabolcs.Nagy@arm.com wrote:
>>>>
>>>>> On 18/04/2019 14:17, Mathieu Desnoyers wrote:
>>>>>> ----- On Apr 17, 2019, at 3:56 PM, Mathieu Desnoyers
>>>>>> mathieu.desnoyers@efficios.com wrote:
>>>>>>> ----- On Apr 17, 2019, at 12:17 PM, Joseph Myers joseph@codesourcery.com wrote:
>>>>>>>> On Wed, 17 Apr 2019, Mathieu Desnoyers wrote:
>>>>>>>>
>>>>>>>>>> +/* RSEQ_SIG is a signature required before each abort handler code.
>>>>>>>>>> +
>>>>>>>>>> +   It is a 32-bit value that maps to actual architecture code compiled
>>>>>>>>>> +   into applications and libraries. It needs to be defined for each
>>>>>>>>>> +   architecture. When choosing this value, it needs to be taken into
>>>>>>>>>> +   account that generating invalid instructions may have ill effects on
>>>>>>>>>> +   tools like objdump, and may also have impact on the CPU speculative
>>>>>>>>>> +   execution efficiency in some cases.  */
>>>>>>>>>> +
>>>>>>>>>> +#define RSEQ_SIG 0xd428bc00	/* BRK #0x45E0.  */
>>>>>>>>>
>>>>>>>>> After further investigation, we should probably do the following
>>>>>>>>> to handle compiling with -mbig-endian on aarch64, which generates
>>>>>>>>> binaries with mixed code vs data endianness (little endian code,
>>>>>>>>> big endian data):
>>>>>>>>
>>>>>>>> First, the comment on RSEQ_SIG should specify whether it is to be
>>>>>>>> interpreted in the code or the data endianness.
>>>>>>>
>>>>>>> Right. The signature passed as argument to the rseq registration
>>>>>>> system call needs to be in data endianness (currently exposed kernel
>>>>>>> ABI).
>>>>>>>
>>>>>>> Ideally for userspace, we want to define a signature in code endianness
>>>>>>> that happens to nicely match specific code patterns.
>>>>> ...
>>>>>> For aarch64, I think we can simply do:
>>>>>>
>>>>>> /*
>>>>>>  * aarch64 -mbig-endian generates mixed endianness code vs data:
>>>>>>  * little-endian code and big-endian data. Ensure the RSEQ_SIG signature
>>>>>>  * matches code endianness.
>>>>>>  */
>>>>>> #define RSEQ_SIG_CODE   0xd428bc00      /* BRK #0x45E0.  */
>>>>>>
>>>>>> #ifdef __ARM_BIG_ENDIAN
>>>>>> #define RSEQ_SIG_DATA   0x00bc28d4      /* BRK #0x45E0.  */
>>>>>> #else
>>>>>> #define RSEQ_SIG_DATA   RSEQ_SIG_CODE
>>>>>> #endif
>>>>>>
>>>>>> #define RSEQ_SIG        RSEQ_SIG_DATA
>>>>>>
>>>>>> Feedback is most welcome,
>>>>>
>>>>> so the RSEQ_SIG value is supposed to be used with .word
>>>>> in asm instead of .inst?
>>>>
>>>> We want a .inst so it translates into a valid trap instruction.
>>>> It's better to trap in case program execution reaches this
>>>> by mistake (makes debugging easier).
>>>
>>> that does not make sense to me.
>>>
>>> ".inst" is an asm directive that requires a the value to
>>> be the same on BE and LE (normal insn encoding).
>>>
>>> ".word" is an asm directive that requires the value to
>>> use swapped encoding on BE (if it's used in the instruction
>>> stream it will create a data mapping symbol and disasm to
>>> .word value instead of the instruction mnemonics).
>>>
>>> so which one is it?
>> 
>> We declare the signature with ".inst" in assembler.
>> 
>> However, we also need to pass that 32-bit signature value as
>> argument to the rseq system call when registering rseq.
>> 
>> The signature comparison is performed by the kernel before
>> moving the instruction pointer to the abort handler. It compares
>> the signature received as parameter by sys_rseq (data) to the
>> 4-byte signature preceding the abort IP.
>> 
>> On aarch64 big endian, AFAIU the signature in the code is in
>> little endian, and the signature value passed as argument to
>> the rseq system call is in big endian. One way to handle this
>> is to swap the byte order of the signature "data" representation
>> passed as argument to sys_rseq.
> 
> you have to add a documentation comment somewhere
> explaining if RSEQ_SIG is the value that's passed to
> the kernel and then aarch64 asm code has to use
> 
> .inst endianfixup(RSEQ_SIG) // or
> .word RSEQ_SIG

Using ".word" won't allow objdump to show the instruction it
maps to. It will consider it as data. So .inst is preferred here.

> 
> or if RSEQ_SIG is used as
> 
> .inst RSEQ_SIG
> 
> in aarch64 asm and then endianfixup(RSEQ_SIG) should
> be passed to the syscall.

At this stage, we control the meaning of the definitions we
publicly expose. They are part of glibc headers, not part of the
kernel uapi.

On architectures where data and code endianness match, RSEQ_SIG
can be used both as argument to sys_rseq and as value for
.inst in assembler.

On architectures where data and code endianness differ, I am
tempted to declare them separately:

* RSEQ_SIG_CODE: for use with .inst in assembly,
* RSEQ_SIG_DATA (mapping to RSEQ_SIG): to pass as parameter to sys_rseq.

So those specific architectures would use "RSEQ_SIG_CODE" with
.inst in assembly, and we can still pass the RSEQ_SIG as parameter
to sys_rseq in generic rseq registration code.

> either way it can be a brk 0x45e0 on both LE and BE,
> but in the latter case you have to document this in
> arch independent way, since the syscall api must be
> portable (i assume "RSEQ_SIG" is part of the api).

The RSEQ_SIG is defined by glibc bits/rseq.h which is included from
sys/rseq.h. It's therefore not part of the Linux kernel uapi. So
we can define whatever we need to at this point, but we won't be
able to change it after it has been exposed for a given
architecture.

All the kernel ABI expects is a data-endian value of the signature
it needs to compare to when it loads the 4 bytes prior to the abort
ip.

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply

* Re: [PATCH 1/4] glibc: Perform rseq(2) registration at C startup and thread creation (v7)
From: Mathieu Desnoyers @ 2019-04-18 18:58 UTC (permalink / raw)
  To: Paul Burton
  Cc: Carlos O'Donell, Will Deacon, Boqun Feng, heiko carstens, gor,
	schwidefsky, Russell King, ARM Linux, Benjamin Herrenschmidt,
	Paul Mackerras, Michael Ellerman, carlos, Florian Weimer,
	Joseph Myers, Szabolcs Nagy, libc-alpha, Thomas Gleixner,
	Ben Maurer, Peter Zijlstra, Paul E. McKenney, Dave Watson
In-Reply-To: <20190404214151.6ogrm34dok52az4h@pburton-laptop>

----- On Apr 4, 2019, at 5:41 PM, Paul Burton paul.burton@mips.com wrote:
[...]

>> 2a. A uncommon TRAP hopefully with some immediate data encoded (maybe uncommon)
> 
> Our break instruction has a 19b immediate in nanoMIPS (20b for microMIPS
> & classic MIPS) so that could be something like:
> 
>  break 0x7273       # ASCII 'rs'
> 

Hi Paul,

I like this uncommon break instruction as signature choice.

However, if I try to compile assembler with a break 0x7273 instruction
with mips64 and mips32 toolchains (gcc version 8.2.0 (Ubuntu 8.2.0-1ubuntu2~18.04))
I get:

/tmp/ccVh9F7T.s: Assembler messages:
/tmp/ccVh9F7T.s:24: Error: operand 1 out of range `break 0x7273'

It works up to the value 0x3FF, which seems to use the top 10
code bits:

   a:	03ff 0007 	break	0x3ff

Would a "break 0x350" be a good choice as well ? 

Any idea why 0x7273 is not accepted by my assembler ?

I also tried crafting the assembler with values between 0x3FF and 0x7273
in the 20 code bits. It seems fine from an objdump perspective:

".long 0x03FFFC7\n\t"

generates:

  10:	003f ffc7 	break	0x3f,0x3ff

What I don't understand is why the instruction generated by my
toolchain ends with the last 6 bits "000111", whereas the mips32
instruction set specifies break as ending with "001101" [1].
What am I missing ?

Also, the nanomips break code [2] has a completely different
instruction layout. Should we use a different signature when
compiling for nanomips ? What #ifdef should we use ? Do I
need a special toolchain to generate nanomips binaries ?

Thanks,

Mathieu

[1] http://hades.mech.northwestern.edu/images/1/16/MIPS32_Architecture_Volume_II-A_Instruction_Set.pdf
[2] https://s3-eu-west-1.amazonaws.com/downloads-mips/I7200/I7200+product+launch/MIPS_nanomips32_ISA_TRM_01_01_MD01247.pdf

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply

* Re: [PATCH V32 01/27] Add the ability to lock down access to the running kernel image
From: Matthew Garrett @ 2019-04-18 19:35 UTC (permalink / raw)
  To: Andrew Donnellan
  Cc: James Morris, LSM List, Linux Kernel Mailing List, David Howells,
	Linux API, Andy Lutomirski, linuxppc-dev, Michael Ellerman,
	Daniel Axtens, cmr
In-Reply-To: <059c523e-926c-24ee-0935-198031712145@au1.ibm.com>

On Tue, Apr 16, 2019 at 1:40 AM Andrew Donnellan
<andrew.donnellan@au1.ibm.com> wrote:
> I'm thinking about whether we should lock down the powerpc xmon debug
> monitor - intuitively, I think the answer is yes if for no other reason
> than Least Astonishment, when lockdown is enabled you probably don't
> expect xmon to keep letting you access kernel memory.

The original patchset contained a sysrq hotkey to allow physically
present users to disable lockdown, so I'm not super concerned about
this case - I could definitely be convinced otherwise, though.

^ permalink raw reply

* [PATCH v5 00/16] fscrypt: key management improvements
From: Eric Biggers @ 2019-04-18 23:29 UTC (permalink / raw)
  To: linux-fscrypt
  Cc: linux-ext4, linux-api, linux-f2fs-devel, keyrings, linux-mtd,
	linux-crypto, linux-fsdevel, Satya Tangirala, Paul Crowley

Hello,

This patchset makes major improvements to how keys are added, removed,
and derived in fscrypt, aka ext4/f2fs/ubifs encryption.  It does this by
adding new ioctls that add and remove encryption keys directly to/from
the filesystem, and by adding a new encryption policy version ("v2")
where the user-provided keys are only used as input to HKDF-SHA512 and
are identified by their cryptographic hash.

All new APIs and all cryptosystem changes are documented in
Documentation/filesystems/fscrypt.rst.  Userspace can use the new key
management ioctls with existing encrypted directories, but migrating to
v2 encryption policies is needed for the full benefits.

These changes solve four interrelated problems:

(1) Providing fscrypt keys via process-subscribed keyrings is abusing
    encryption as an OS-level access control mechanism, causing many
    bugs where processes don't get access to the keys they need -- e.g.,
    when a 'sudo' command or a system service needs to access encrypted
    files.  It's also inconsistent with the filesystem/VFS "view" of
    encrypted files which is global, so sometimes things randomly happen
    to work anyway due to caching.  Regardless, currently almost all
    fscrypt users actually do need global keys, so they're having to use
    workarounds that heavily abuse the session or user keyrings, e.g.
    Android and Chromium OS both use a systemwide "session keyring" and
    the 'fscrypt' tool links all user keyrings into root's user keyring.

(2) Currently there's no way to securely and efficiently remove a
    fscrypt key such that not only is the original key wiped, but also
    all files and directories protected by that key are "locked" and
    their per-file keys wiped.  Many users want this and are using
    'echo 2 > /proc/sys/vm/drop_caches' as a workaround, but this is
    root-only, and also is overkill so can be a performance disaster.

(3) The key derivation function (KDF) that fscrypt uses to derive
    per-file keys is nonstandard, inflexible, and has some weaknesses
    such as being reversible and not evenly distributing the entropy
    from the user-provided keys.

(4) fscrypt doesn't check that the correct key was supplied.  This can
    be a security vulnerability, since it allows malicious local users
    to associate the wrong key with files to which they have read-only
    access, causing other users' processes to read/write the wrong data.

Ultimately, the solutions to these problems all tie into each other.  By
adding a filesystem-level encryption keyring with ioctls to add/remove
keys to/from it, the keys are made usable filesystem-wide (solves
problem #1).  It also becomes easy to track the inodes that were
"unlocked" with each key, so they can be evicted when the key is removed
(solves problem #2).  Moreover, the filesystem-level keyring is a
natural place to store an HMAC transform keyed by each key, thus making
it easy and efficient to switch the KDF to HKDF (solves problem #3).

Finally, to check that the correct key was supplied, I use HKDF to
derive a cryptographically secure key_identifier for each key (solves
problem #4).  This in combination with key quotas and other careful
precautions also makes it safe to allow non-root users to add and remove
keys to/from the filesystem-level keyring.  Thus, all problems are
solved without having to restrict the fscrypt API to root only.

The patchset is organized as follows:

- Patches 1-8 add new ioctls FS_IOC_ADD_ENCRYPTION_KEY,
  FS_IOC_REMOVE_ENCRYPTION_KEY, and FS_IOC_GET_ENCRYPTION_KEY_STATUS.
  Adding a key logically "unlocks" all files on the filesystem that are
  protected by that key; removing a key "locks" them again.

- Patches 9-12 add support for v2 encryption policies.

- Patches 13-15 wire up the new ioctls to ext4, f2fs, and ubifs.

- Patch 16 updates the fscrypt documentation for all the changes.

Changes v4 => v5:
    - Simplify shrink_dcache_inode(), as suggested by Al Viro.
      Also move it into fs/crypto/.
    - Fix a build error on some architectures by calling
      copy_from_user() rather than get_user() with a __u64 pointer.

Changes v3 => v4:
    - Introduce fscrypt_sb_free() to avoid an extra #ifdef.
    - Fix UBIFS's ->drop_inode().
    - Add 'version' to union fscrypt_policy and union fscrypt_context.

Changes v2 => v3:
    - Use ->drop_inode() to trigger the inode eviction during/after
      FS_IOC_REMOVE_ENCRYPTION_KEY, as suggested by Dave Chinner.
    - A few small cleanups.

v1 of this patchset was sent in October 2017 with title "fscrypt:
filesystem-level keyring and v2 policy support".  This revived version
follows the same basic design but incorporates numerous improvements,
such as splitting keyinfo.c into multiple files for much better
understandability, and introducing "per-mode" encryption keys to
implement the semantics of the DIRECT_KEY encryption policy flag.

This patchset applies to the fscrypt tree with the additional patch
"[PATCH v2 2/2] fscrypt: cache decrypted symlink target in ->i_link"
(https://patchwork.kernel.org/patch/10894743/) applied.  You can also
get this patchset from git at:

	Repository:   https://git.kernel.org/pub/scm/linux/kernel/git/ebiggers/linux.git
	Branch:       fscrypt-key-mgmt-improvements-v5

I've started writing xfstests for the new APIs.  So far they cover basic
functionality.  They can be found at:

	Repository:   https://git.kernel.org/pub/scm/linux/kernel/git/ebiggers/xfstests-dev.git
	Branch:       fscrypt-key-mgmt-improvements

The xfstests depend on new xfs_io commands which can be found at:

	Repository:   https://git.kernel.org/pub/scm/linux/kernel/git/ebiggers/xfsprogs-dev.git
	Branch:       fscrypt-key-mgmt-improvements

I've also made proof-of-concept changes to the 'fscrypt' userspace
program (https://github.com/google/fscrypt) to make it support v2
encryption policies.  You can find these changes in git at:

	Repository:   https://github.com/ebiggers/fscrypt.git
	Branch:       fscrypt-key-mgmt-improvements

To make the 'fscrypt' userspace program experimentally use v2 encryption
policies on new encrypted directories, add the following to
/etc/fscrypt.conf within the "options" section:

	"policy_version": "2"

It's also planned for Android and Chromium OS to switch to the new
ioctls and eventually to v2 encryption policies.  Work-in-progress,
proof-of-concept changes by Satya Tangirala for AOSP can be found at
https://android-review.googlesource.com/q/topic:fscrypt-key-mgmt-improvements

Eric Biggers (16):
  fs, fscrypt: move uapi definitions to new header <linux/fscrypt.h>
  fscrypt: use FSCRYPT_ prefix for uapi constants
  fscrypt: use FSCRYPT_* definitions, not FS_*
  fscrypt: add ->ci_inode to fscrypt_info
  fscrypt: refactor v1 policy key setup into keysetup_legacy.c
  fscrypt: add FS_IOC_ADD_ENCRYPTION_KEY ioctl
  fscrypt: add FS_IOC_REMOVE_ENCRYPTION_KEY ioctl
  fscrypt: add FS_IOC_GET_ENCRYPTION_KEY_STATUS ioctl
  fscrypt: add an HKDF-SHA512 implementation
  fscrypt: v2 encryption policy support
  fscrypt: allow unprivileged users to add/remove keys for v2 policies
  fscrypt: require that key be added when setting a v2 encryption policy
  ext4: wire up new fscrypt ioctls
  f2fs: wire up new fscrypt ioctls
  ubifs: wire up new fscrypt ioctls
  fscrypt: document the new ioctls and policy version

 Documentation/filesystems/fscrypt.rst | 670 ++++++++++++++----
 MAINTAINERS                           |   1 +
 fs/crypto/Kconfig                     |   2 +
 fs/crypto/Makefile                    |  10 +-
 fs/crypto/crypto.c                    |  14 +-
 fs/crypto/fname.c                     |   5 +-
 fs/crypto/fscrypt_private.h           | 366 ++++++++--
 fs/crypto/hkdf.c                      | 188 ++++++
 fs/crypto/keyinfo.c                   | 613 -----------------
 fs/crypto/keyring.c                   | 931 ++++++++++++++++++++++++++
 fs/crypto/keysetup.c                  | 561 ++++++++++++++++
 fs/crypto/keysetup_legacy.c           | 340 ++++++++++
 fs/crypto/policy.c                    | 392 ++++++++---
 fs/ext4/ioctl.c                       |  24 +
 fs/ext4/super.c                       |   3 +
 fs/f2fs/file.c                        |  46 ++
 fs/f2fs/super.c                       |   2 +
 fs/super.c                            |   2 +
 fs/ubifs/ioctl.c                      |  16 +
 fs/ubifs/super.c                      |  11 +
 include/linux/fs.h                    |   1 +
 include/linux/fscrypt.h               |  49 +-
 include/uapi/linux/fs.h               |  54 +-
 include/uapi/linux/fscrypt.h          | 163 +++++
 24 files changed, 3521 insertions(+), 943 deletions(-)
 create mode 100644 fs/crypto/hkdf.c
 delete mode 100644 fs/crypto/keyinfo.c
 create mode 100644 fs/crypto/keyring.c
 create mode 100644 fs/crypto/keysetup.c
 create mode 100644 fs/crypto/keysetup_legacy.c
 create mode 100644 include/uapi/linux/fscrypt.h

-- 
2.21.0.392.gf8f6787159e-goog

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox