Linux userland API discussions
 help / color / mirror / Atom feed
* Re: [PATCH] io_uring: use __kernel_timespec in timeout ABI
From: Arnd Bergmann @ 2019-10-01 15:57 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Hannes Reinecke, y2038 Mailman List, Linux API, Jackie Liu,
	linux-kernel@vger.kernel.org, linux-block, Stefan Bühler,
	Alexander Viro, Linux FS-devel Mailing List, Andrew Morton,
	Hristo Venev
In-Reply-To: <ca0a5bbe-c20e-d5be-110e-942c604ad2d7@kernel.dk>

On Tue, Oct 1, 2019 at 5:52 PM Jens Axboe <axboe@kernel.dk> wrote:
> On 10/1/19 9:49 AM, Arnd Bergmann wrote:
> > On Tue, Oct 1, 2019 at 5:38 PM Jens Axboe <axboe@kernel.dk> wrote:

> > What's wrong with using __kernel_timespec? Just the name?
> > I suppose liburing could add a macro to give it a different name
> > for its users.
>
> Just that it seems I need to make it available through liburing on
> systems that don't have it yet. Not a big deal, though.

Ah, right. I t would not cover the case of building against kernel
headers earlier than linux-5.1 but running on a 5.4+ kernel.

I assumed that that you would require new kernel headers anyway,
but if you have a copy of the io_uring header, that is not necessary.

> One thing that struck me about this approach - we then lose the ability to
> differentiate between "don't want a timed timeout" with ts == NULL, vs
> tv_sec and tv_nsec both being 0.

You could always define a special constant such as
'#define IO_URING_TIMEOUT_NEVER -1ull' if you want to
support for 'never wait if it's not already done' and 'wait indefinitely'.

> I think I'll stuck with that you had and just use __kernel_timespec in
> liburing.

Ok.

       Arnd
_______________________________________________
Y2038 mailing list
Y2038@lists.linaro.org
https://lists.linaro.org/mailman/listinfo/y2038

^ permalink raw reply

* Re: [PATCH v4 0/4] lib: introduce copy_struct_from_user() helper
From: Christian Brauner @ 2019-10-01 16:01 UTC (permalink / raw)
  To: Aleksa Sarai, Kees Cook
  Cc: Ingo Molnar, Peter Zijlstra, Alexander Shishkin, Jiri Olsa,
	Namhyung Kim, Rasmus Villemoes, Al Viro, Linus Torvalds,
	libc-alpha, linux-api, linux-kernel
In-Reply-To: <20191001011055.19283-1-cyphar@cyphar.com>

On Tue, Oct 01, 2019 at 11:10:51AM +1000, Aleksa Sarai wrote:
> Patch changelog:
>  v4:
>   * __always_inline copy_struct_from_user(). [Kees Cook]
>   * Rework test_user_copy.ko changes. [Kees Cook]
>  v3: <https://lore.kernel.org/lkml/20190930182810.6090-1-cyphar@cyphar.com/>
>      <https://lore.kernel.org/lkml/20190930191526.19544-1-asarai@suse.de/>
>  v2: <https://lore.kernel.org/lkml/20190925230332.18690-1-cyphar@cyphar.com/>
>  v1: <https://lore.kernel.org/lkml/20190925165915.8135-1-cyphar@cyphar.com/>
> 
> This series was split off from the openat2(2) syscall discussion[1].
> However, the copy_struct_to_user() helper has been dropped, because
> after some discussion it appears that there is no really obvious
> semantics for how copy_struct_to_user() should work on mixed-vintages
> (for instance, whether [2] is the correct semantics for all syscalls).
> 
> A common pattern for syscall extensions is increasing the size of a
> struct passed from userspace, such that the zero-value of the new fields
> result in the old kernel behaviour (allowing for a mix of userspace and
> kernel vintages to operate on one another in most cases).
> 
> Previously there was no common lib/ function that implemented
> the necessary extension-checking semantics (and different syscalls
> implemented them slightly differently or incompletely[3]). This series
> implements the helper and ports several syscalls to use it.
> 
> Some in-kernel selftests are included in this patch. More complete
> self-tests for copy_struct_from_user() are included in the openat2()
> patchset.
> 
> [1]: https://lore.kernel.org/lkml/20190904201933.10736-1-cyphar@cyphar.com/
> 
> [2]: commit 1251201c0d34 ("sched/core: Fix uclamp ABI bug, clean up and
>      robustify sched_read_attr() ABI logic and code")
> 
> [3]: For instance {sched_setattr,perf_event_open,clone3}(2) all do do
>      similar checks to copy_struct_from_user() while rt_sigprocmask(2)
>      always rejects differently-sized struct arguments.
> 
> Aleksa Sarai (4):
>   lib: introduce copy_struct_from_user() helper
>   clone3: switch to copy_struct_from_user()
>   sched_setattr: switch to copy_struct_from_user()
>   perf_event_open: switch to copy_struct_from_user()
> 
>  include/linux/bitops.h     |   7 ++
>  include/linux/uaccess.h    |  70 +++++++++++++++++++
>  include/uapi/linux/sched.h |   2 +
>  kernel/events/core.c       |  47 +++----------
>  kernel/fork.c              |  34 ++--------
>  kernel/sched/core.c        |  43 ++----------
>  lib/strnlen_user.c         |   8 +--
>  lib/test_user_copy.c       | 136 +++++++++++++++++++++++++++++++++++--
>  lib/usercopy.c             |  55 +++++++++++++++
>  9 files changed, 288 insertions(+), 114 deletions(-)

I've picked this up now and it's sitting in
https://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux.git/log/?h=copy_struct_from_user

It should show up in linux-next tomorrow. I will let this sit there for
a few days but overall this seems good to have in rc2.
If someone objects and prefers to take it through their tree I can drop
it.

Christian

^ permalink raw reply

* Re: [PATCH] io_uring: use __kernel_timespec in timeout ABI
From: Jens Axboe @ 2019-10-01 16:02 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: y2038 Mailman List, Linux API, Alexander Viro, Stefan Bühler,
	Hannes Reinecke, Jackie Liu, Andrew Morton, Hristo Venev,
	linux-block, Linux FS-devel Mailing List,
	linux-kernel@vger.kernel.org
In-Reply-To: <CAK8P3a19TDk0uo1eu4CcaKHvQCPUJGMjBV8Txtpgvg1ifAgW_A@mail.gmail.com>

On 10/1/19 9:57 AM, Arnd Bergmann wrote:
> On Tue, Oct 1, 2019 at 5:52 PM Jens Axboe <axboe@kernel.dk> wrote:
>> On 10/1/19 9:49 AM, Arnd Bergmann wrote:
>>> On Tue, Oct 1, 2019 at 5:38 PM Jens Axboe <axboe@kernel.dk> wrote:
> 
>>> What's wrong with using __kernel_timespec? Just the name?
>>> I suppose liburing could add a macro to give it a different name
>>> for its users.
>>
>> Just that it seems I need to make it available through liburing on
>> systems that don't have it yet. Not a big deal, though.
> 
> Ah, right. I t would not cover the case of building against kernel
> headers earlier than linux-5.1 but running on a 5.4+ kernel.
> 
> I assumed that that you would require new kernel headers anyway,
> but if you have a copy of the io_uring header, that is not necessary.

Since I rely mostly on folks using liburing, we include the header as
well. So I'm just going to use __kernel_timespec in liburing, and have
a check to define it if we don't have it.

>> One thing that struck me about this approach - we then lose the ability to
>> differentiate between "don't want a timed timeout" with ts == NULL, vs
>> tv_sec and tv_nsec both being 0.
> 
> You could always define a special constant such as
> '#define IO_URING_TIMEOUT_NEVER -1ull' if you want to
> support for 'never wait if it's not already done' and 'wait indefinitely'.

That thought did occur to me, but that seems pretty ugly... The ts == NULL
vs ts != NULL and timeout set is a more well understood pattern.

-- 
Jens Axboe

^ permalink raw reply

* Re: [PATCH] io_uring: use __kernel_timespec in timeout ABI
From: Florian Weimer @ 2019-10-01 16:07 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: Jens Axboe, y2038 Mailman List, Linux API, Alexander Viro,
	Stefan Bühler, Hannes Reinecke, Jackie Liu, Andrew Morton,
	Hristo Venev, linux-block, Linux FS-devel Mailing List,
	linux-kernel@vger.kernel.org
In-Reply-To: <CAK8P3a3AAFXNmpQwuirzM+jgEQGj9tMC_5oaSs4CfiEVGmTkZg@mail.gmail.com>

* Arnd Bergmann:

> On Tue, Oct 1, 2019 at 5:38 PM Jens Axboe <axboe@kernel.dk> wrote:
>>
>> On 10/1/19 8:09 AM, Jens Axboe wrote:
>> > On 9/30/19 2:20 PM, Arnd Bergmann wrote:
>> >> All system calls use struct __kernel_timespec instead of the old struct
>> >> timespec, but this one was just added with the old-style ABI. Change it
>> >> now to enforce the use of __kernel_timespec, avoiding ABI confusion and
>> >> the need for compat handlers on 32-bit architectures.
>> >>
>> >> Any user space caller will have to use __kernel_timespec now, but this
>> >> is unambiguous and works for any C library regardless of the time_t
>> >> definition. A nicer way to specify the timeout would have been a less
>> >> ambiguous 64-bit nanosecond value, but I suppose it's too late now to
>> >> change that as this would impact both 32-bit and 64-bit users.
>> >
>> > Thanks for catching that, Arnd. Applied.
>>
>> On second thought - since there appears to be no good 64-bit timespec
>> available to userspace, the alternative here is including on in liburing.
>
> What's wrong with using __kernel_timespec? Just the name?
> I suppose liburing could add a macro to give it a different name
> for its users.

Yes, mostly the name.

__ names are reserved for the C/C++ implementation (which does not
include the kernel).  __kernel_timespec looks like an internal kernel
type to the uninitiated, not a UAPI type.

Once we have struct timespec64 in userspace, you also end up with
copying stuff around or introducing aliasing violations.

I'm not saying those concerns are valid, but you asked what's wrong with
it. 8-)

Thanks,
Florian

^ permalink raw reply

* Re: [PATCH v4 1/4] lib: introduce copy_struct_from_user() helper
From: Kees Cook @ 2019-10-01 16:28 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Aleksa Sarai, Ingo Molnar, Peter Zijlstra, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Rasmus Villemoes, Al Viro,
	Linus Torvalds, libc-alpha, linux-api, linux-kernel
In-Reply-To: <20191001023126.qhzeiwmtoo4agy7t@wittgenstein>

On Tue, Oct 01, 2019 at 04:31:27AM +0200, Christian Brauner wrote:
> On Mon, Sep 30, 2019 at 06:58:39PM -0700, Kees Cook wrote:
> > On Tue, Oct 01, 2019 at 11:10:52AM +1000, Aleksa Sarai wrote:
> > > +static __always_inline
> > > +int copy_struct_from_user(void *dst, size_t ksize,
> > > +			  const void __user *src, size_t usize)
> > 
> > And of course I forgot to realize both this and check_zeroed_user()
> > should also have the __must_check attribute. Sorry for forgetting that
> > earlier!
> 
> Just said to Aleksa that I'll just fix this up when I apply so he
> doesn't have to resend. You ok with this, Kees?

Yup; that's totally fine. Thanks!

-- 
Kees Cook

^ permalink raw reply

* Re: [PATCH] io_uring: use __kernel_timespec in timeout ABI
From: Jens Axboe @ 2019-10-01 18:08 UTC (permalink / raw)
  To: Florian Weimer, Arnd Bergmann
  Cc: y2038 Mailman List, Linux API, Alexander Viro, Stefan Bühler,
	Hannes Reinecke, Jackie Liu, Andrew Morton, Hristo Venev,
	linux-block, Linux FS-devel Mailing List,
	linux-kernel@vger.kernel.org
In-Reply-To: <874l0stpog.fsf@oldenburg2.str.redhat.com>

On 10/1/19 10:07 AM, Florian Weimer wrote:
> * Arnd Bergmann:
> 
>> On Tue, Oct 1, 2019 at 5:38 PM Jens Axboe <axboe@kernel.dk> wrote:
>>>
>>> On 10/1/19 8:09 AM, Jens Axboe wrote:
>>>> On 9/30/19 2:20 PM, Arnd Bergmann wrote:
>>>>> All system calls use struct __kernel_timespec instead of the old struct
>>>>> timespec, but this one was just added with the old-style ABI. Change it
>>>>> now to enforce the use of __kernel_timespec, avoiding ABI confusion and
>>>>> the need for compat handlers on 32-bit architectures.
>>>>>
>>>>> Any user space caller will have to use __kernel_timespec now, but this
>>>>> is unambiguous and works for any C library regardless of the time_t
>>>>> definition. A nicer way to specify the timeout would have been a less
>>>>> ambiguous 64-bit nanosecond value, but I suppose it's too late now to
>>>>> change that as this would impact both 32-bit and 64-bit users.
>>>>
>>>> Thanks for catching that, Arnd. Applied.
>>>
>>> On second thought - since there appears to be no good 64-bit timespec
>>> available to userspace, the alternative here is including on in liburing.
>>
>> What's wrong with using __kernel_timespec? Just the name?
>> I suppose liburing could add a macro to give it a different name
>> for its users.
> 
> Yes, mostly the name.
> 
> __ names are reserved for the C/C++ implementation (which does not
> include the kernel).  __kernel_timespec looks like an internal kernel
> type to the uninitiated, not a UAPI type.
> 
> Once we have struct timespec64 in userspace, you also end up with
> copying stuff around or introducing aliasing violations.
> 
> I'm not saying those concerns are valid, but you asked what's wrong with
> it. 8-)

FWIW, I do agree, __kernel_timespec sounds like an internal type, not
something apps should be using. timespec64 works a lot better for that.
Oh well.

-- 
Jens Axboe

^ permalink raw reply

* Re: [RFC][PATCH] sysctl: Remove the sysctl system call
From: Kees Cook @ 2019-10-01 18:46 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: linux-kernel, linux-arch, linux-api, Linus Torvalds, Andi Kleen,
	Andi Kleen, Cong Wang, Luis Chamberlain, Alex Smith, Anders Berg,
	Apelete Seketeli, Arnd Bergmann, Chee Nouk Phoon, Chris Zankel,
	Christian Ruppert, Greg Ungerer, Harvey Hunt, Helge Deller,
	Hongliang Tao, Hua Yan, Huacai Chen, John Crispin <blo>
In-Reply-To: <8736gcjosv.fsf@x220.int.ebiederm.org>

On Tue, Oct 01, 2019 at 01:36:32PM -0500, Eric W. Biederman wrote:
> 
> This system call has been deprecated almost since it was introduced, and
> in a survey of the linux distributions I can no longer find any of them
> that enable CONFIG_SYSCTL_SYSCALL.  The only indication that I can find
> that anyone might care is that a few of the defconfigs in the kernel
> enable CONFIG_SYSCTL_SYSCALL.  However this appears in only 31 of 414
> defconfigs in the kernel, so I suspect this symbols presence is simply
> because it is harmless to include rather than because it is necessary.
> 
> As there appear to be no users of the sysctl system call, remove the
> code.  As this removes one of the few uses of the internal kernel mount
> of proc I hope this allows for even more simplifications of the proc
> filesystem.

I'm for it. :) I tripped over this being deprecated over a decade ago. :P

I think you can actually take this further and remove (or at least
empty) the uapi/linux/sysctl.h file too.

-- 
Kees Cook

^ permalink raw reply

* Re: [PATCH bpf-next] bpf, capabilities: introduce CAP_BPF
From: Steven Rostedt @ 2019-10-01 22:10 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Kees Cook, Andy Lutomirski, Andy Lutomirski, Alexei Starovoitov,
	LSM List, James Morris, Jann Horn, Peter Zijlstra,
	Masami Hiramatsu, David S. Miller, Daniel Borkmann,
	Network Development, bpf, kernel-team, Linux API
In-Reply-To: <20191001012226.vwpe56won5r7gbrz@ast-mbp.dhcp.thefacebook.com>

On Mon, 30 Sep 2019 18:22:28 -0700
Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:

> tracefs is a file system, so clearly file based acls are much better fit
> for all tracefs operations.
> But that is not the case for ftrace overall.
> bpf_trace_printk() calls trace_printk() that dumps into trace pipe.
> Technically it's ftrace operation, but it cannot be controlled by tracefs
> and by file permissions. That's the motivation to guard bpf_trace_printk()
> usage from bpf program with CAP_TRACING.

BTW, I'd rather have bpf use an event that records a string than using
trace printk itself.

Perhaps something like "bpf_print" event? That could be defined like:

TRACE_EVENT(bpf_print,
	TP_PROTO(const char *msg),
	TP_ARGS(msg),
	TP_STRUCT__entry(
		__string(msg, msg)
	),
	TP_fast_assign(
		__assign_str(msg, msg)
	),
	TP_printk("msg=%s", __get_str(msg))
);

And then you can just format the string from the bpf_trace_printk()
into msg, and then have:

	trace_bpf_print(msg);

The user could then just enable the trace event from the file system. I
could also work on making instances work like /tmp does (with the
sticky bit) in creation. That way people with write access to the
instances directory, can make their own buffers that they can use (and
others can't access).


> 
> Both 'trace' and 'trace_pipe' have quirky side effects.
> Like opening 'trace' file will make all parallel trace_printk() to be ignored.
> While reading 'trace_pipe' file will clear it.
> The point that traditional 'read' and 'write' ACLs don't map as-is
> to tracefs, so I would be careful categorizing things into
> confidentiality vs integrity only based on access type.

What exactly is the bpf_trace_printk() used for? I may have other ideas
that can help.

-- Steve

^ permalink raw reply

* Re: [PATCH bpf-next] bpf, capabilities: introduce CAP_BPF
From: Alexei Starovoitov @ 2019-10-01 22:18 UTC (permalink / raw)
  To: Steven Rostedt, Alexei Starovoitov
  Cc: Kees Cook, Andy Lutomirski, Andy Lutomirski, Alexei Starovoitov,
	LSM List, James Morris, Jann Horn, Peter Zijlstra,
	Masami Hiramatsu, David S. Miller, Daniel Borkmann,
	Network Development, bpf, Kernel Team, Linux API
In-Reply-To: <20191001181052.43c9fabb@gandalf.local.home>

On 10/1/19 3:10 PM, Steven Rostedt wrote:
> On Mon, 30 Sep 2019 18:22:28 -0700
> Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:
> 
>> tracefs is a file system, so clearly file based acls are much better fit
>> for all tracefs operations.
>> But that is not the case for ftrace overall.
>> bpf_trace_printk() calls trace_printk() that dumps into trace pipe.
>> Technically it's ftrace operation, but it cannot be controlled by tracefs
>> and by file permissions. That's the motivation to guard bpf_trace_printk()
>> usage from bpf program with CAP_TRACING.
> 
> BTW, I'd rather have bpf use an event that records a string than using
> trace printk itself.
> 
> Perhaps something like "bpf_print" event? That could be defined like:
> 
> TRACE_EVENT(bpf_print,
> 	TP_PROTO(const char *msg),
> 	TP_ARGS(msg),
> 	TP_STRUCT__entry(
> 		__string(msg, msg)
> 	),
> 	TP_fast_assign(
> 		__assign_str(msg, msg)
> 	),
> 	TP_printk("msg=%s", __get_str(msg))
> );
> 
> And then you can just format the string from the bpf_trace_printk()
> into msg, and then have:
> 
> 	trace_bpf_print(msg);

It's an interesting idea, but I don't think it can work.
Please see bpf_trace_printk implementation in kernel/trace/bpf_trace.c
It's a lot more than string printing.

> The user could then just enable the trace event from the file system. I
> could also work on making instances work like /tmp does (with the
> sticky bit) in creation. That way people with write access to the
> instances directory, can make their own buffers that they can use (and
> others can't access).

We tried instances in bcc in the past and eventually removed all the 
support. The overhead of instances is too high to be usable.

> 
> 
>>
>> Both 'trace' and 'trace_pipe' have quirky side effects.
>> Like opening 'trace' file will make all parallel trace_printk() to be ignored.
>> While reading 'trace_pipe' file will clear it.
>> The point that traditional 'read' and 'write' ACLs don't map as-is
>> to tracefs, so I would be careful categorizing things into
>> confidentiality vs integrity only based on access type.
> 
> What exactly is the bpf_trace_printk() used for? I may have other ideas
> that can help.

It's debugging of bpf programs. Same is what printk() is used for
by kernel developers.


^ permalink raw reply

* Re: [PATCH bpf-next] bpf, capabilities: introduce CAP_BPF
From: Steven Rostedt @ 2019-10-01 22:47 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Alexei Starovoitov, Kees Cook, Andy Lutomirski, Andy Lutomirski,
	Alexei Starovoitov, LSM List, James Morris, Jann Horn,
	Peter Zijlstra, Masami Hiramatsu, David S. Miller,
	Daniel Borkmann, Network Development, bpf, Kernel Team, Linux API
In-Reply-To: <6e8b910c-a739-857d-4867-395bd369bc6a@fb.com>

On Tue, 1 Oct 2019 22:18:18 +0000
Alexei Starovoitov <ast@fb.com> wrote:

> > And then you can just format the string from the bpf_trace_printk()
> > into msg, and then have:
> > 
> > 	trace_bpf_print(msg);  
> 
> It's an interesting idea, but I don't think it can work.
> Please see bpf_trace_printk implementation in kernel/trace/bpf_trace.c
> It's a lot more than string printing.

Well, trace_printk() is just string printing. I was thinking that the
bpf_trace_printk() could just use a vsnprintf() into a temporary buffer
(like trace_printk() does), and then call the trace event to write it
out.

> 
> > The user could then just enable the trace event from the file system. I
> > could also work on making instances work like /tmp does (with the
> > sticky bit) in creation. That way people with write access to the
> > instances directory, can make their own buffers that they can use (and
> > others can't access).  
> 
> We tried instances in bcc in the past and eventually removed all the 
> support. The overhead of instances is too high to be usable.

What overhead? An ftrace instance should not have any more overhead than
the root one does (it's the same code). Or are you talking about memory
overhead?

> 
> > 
> >   
> >>
> >> Both 'trace' and 'trace_pipe' have quirky side effects.
> >> Like opening 'trace' file will make all parallel trace_printk() to be ignored.
> >> While reading 'trace_pipe' file will clear it.
> >> The point that traditional 'read' and 'write' ACLs don't map as-is
> >> to tracefs, so I would be careful categorizing things into
> >> confidentiality vs integrity only based on access type.  
> > 
> > What exactly is the bpf_trace_printk() used for? I may have other ideas
> > that can help.  
> 
> It's debugging of bpf programs. Same is what printk() is used for
> by kernel developers.
> 

How is it extracted? Just read from the trace or trace_pipe file?

-- Steve

^ permalink raw reply

* Re: [RFC][PATCH] sysctl: Remove the sysctl system call
From: Eric W. Biederman @ 2019-10-01 22:53 UTC (permalink / raw)
  To: Kees Cook
  Cc: linux-kernel, linux-arch, linux-api, Linus Torvalds, Andi Kleen,
	Andi Kleen, Cong Wang, Luis Chamberlain, Apelete Seketeli,
	Arnd Bergmann, Chee Nouk Phoon, Chris Zankel, Christian Ruppert,
	Greg Ungerer, Helge Deller, Hongliang Tao, Huacai Chen,
	Jonas Jensen, Josh Boyer, Jun Nie, Lars-Peter Clausen
In-Reply-To: <201910011140.EA0181F13@keescook>

Kees Cook <keescook@chromium.org> writes:

> On Tue, Oct 01, 2019 at 01:36:32PM -0500, Eric W. Biederman wrote:
>> 
>> This system call has been deprecated almost since it was introduced, and
>> in a survey of the linux distributions I can no longer find any of them
>> that enable CONFIG_SYSCTL_SYSCALL.  The only indication that I can find
>> that anyone might care is that a few of the defconfigs in the kernel
>> enable CONFIG_SYSCTL_SYSCALL.  However this appears in only 31 of 414
>> defconfigs in the kernel, so I suspect this symbols presence is simply
>> because it is harmless to include rather than because it is necessary.
>> 
>> As there appear to be no users of the sysctl system call, remove the
>> code.  As this removes one of the few uses of the internal kernel mount
>> of proc I hope this allows for even more simplifications of the proc
>> filesystem.
>
> I'm for it. :) I tripped over this being deprecated over a decade ago. :P
>
> I think you can actually take this further and remove (or at least
> empty) the uapi/linux/sysctl.h file too.

I copied everyone who had put this into a defconfig and I will wait a
little more to see if anyone screams.  I think it is a safe guess that
several of the affected configurations are dead (or at least
unmaintained) as I received 17 bounces when copying everyone.

I would make it a followup that removes uapi/linux/sysctl.h.  I don't
see anything in it that isn't about the sysctl system call.  I will keep
it a separate patch as I can imagine something silly that needs the
header file to compile.  A separate patch would make a revert easier
if we find something like that.

Eric

^ permalink raw reply

* Re: [RFC][PATCH] sysctl: Remove the sysctl system call
From: Jann Horn @ 2019-10-02  0:40 UTC (permalink / raw)
  To: Eric W. Biederman, Kostya Serebryany
  Cc: Kees Cook, kernel list, linux-arch, Linux API, Linus Torvalds,
	Andi Kleen, Andi Kleen, Cong Wang, Luis Chamberlain,
	Apelete Seketeli, Arnd Bergmann, Chee Nouk Phoon, Chris Zankel,
	Christian Ruppert, Greg Ungerer, Helge Deller, Hongliang Tao,
	Huacai Chen, Jonas Jensen, Josh Boyer, Jun
In-Reply-To: <87imp8hyc8.fsf@x220.int.ebiederm.org>

+Kostya (code owner for LLVM sanitizer_common) as FYI

On Wed, Oct 2, 2019 at 12:54 AM Eric W. Biederman <ebiederm@xmission.com> wrote:
> Kees Cook <keescook@chromium.org> writes:
> > On Tue, Oct 01, 2019 at 01:36:32PM -0500, Eric W. Biederman wrote:
[...]
> > I think you can actually take this further and remove (or at least
> > empty) the uapi/linux/sysctl.h file too.
[...]
> I would make it a followup that removes uapi/linux/sysctl.h.  I don't
> see anything in it that isn't about the sysctl system call.  I will keep
> it a separate patch as I can imagine something silly that needs the
> header file to compile.  A separate patch would make a revert easier
> if we find something like that.

Unfortunately, I think that header (or at least parts of it) has to
stay around for now:

Looking through the search results for linux/sysctl.h (ignoring
glibc's sys/sysctl.h, which pulls in linux/sysctl.h, because almost
all of those hits are conditional includes for BSD systems) on
codesearch.debian.net, I noticed that e.g. the ASAN code that GCC and
LLVM use pulls in linux/sysctl.h and uses things from it:

https://github.com/llvm-mirror/compiler-rt/blob/124fd5d9aff57cf47bf077df81ad939b289acc6e/lib/sanitizer_common/sanitizer_platform_limits_posix.cpp#L1063

And indeed, after replacing /usr/include/linux/sysctl.h with an empty
file, a build of LLVM's runtime library component (compiler-rt) (git
HEAD version) falls over with error spew about __sysctl_args:

====================
$ cmake -G Ninja -DCMAKE_BUILD_TYPE=Release
-DLLVM_ENABLE_PROJECTS='clang;compiler-rt' -DCMAKE_C_COMPILER=clang-7
-DCMAKE_CXX_COMPILER=clang++-7 -DLLVM_TARGETS_TO_BUILD="X86"
-DLLVM_USE_LINKER=lld-7 -DBUILD_SHARED_LIBS=Off ../llvm
[...]
$ ninja -j64
FAILED: projects/compiler-rt/lib/sanitizer_common/CMakeFiles/RTSanitizerCommonNoTermination.i386.dir/sanitizer_platform_limits_posix.cpp.o
[...]
[...]/compiler-rt/lib/sanitizer_common/sanitizer_platform_limits_posix.cpp:1063:17:
error: use of undeclared identifier '__sysctl_args'
CHECK_TYPE_SIZE(__sysctl_args);
                ^
[...]/compiler-rt/lib/sanitizer_common/sanitizer_platform_limits_posix.cpp:1064:23:
error: use of undeclared identifier '__sysctl_args'
CHECK_SIZE_AND_OFFSET(__sysctl_args, name);
                      ^
[...]/compiler-rt/lib/sanitizer_common/sanitizer_platform_limits_posix.cpp:1064:1:
error: expected expression
CHECK_SIZE_AND_OFFSET(__sysctl_args, name);
^
[...]/compiler-rt/lib/sanitizer_common/sanitizer_platform_limits_posix.h:1438:34:
note: expanded from macro 'CHECK_SIZE_AND_OFFSET'
                 sizeof(((CLASS *)NULL)->MEMBER));                \
                                 ^
[...]/compiler-rt/lib/sanitizer_common/sanitizer_platform_limits_posix.cpp:1064:23:
error: unknown type name '__sysctl_args'
CHECK_SIZE_AND_OFFSET(__sysctl_args, name);
                      ^
====================

^ permalink raw reply

* Re: [RFC][PATCH] sysctl: Remove the sysctl system call
From: Arnd Bergmann @ 2019-10-02  7:31 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Kees Cook, linux-kernel@vger.kernel.org, linux-arch, Linux API,
	Linus Torvalds, Andi Kleen, Andi Kleen, Cong Wang,
	Luis Chamberlain, Apelete Seketeli, Chee Nouk Phoon, Chris Zankel,
	Christian Ruppert, Greg Ungerer, Helge Deller, Hongliang Tao,
	Huacai Chen, Jonas Jensen, Josh Boyer, Jun Nie
In-Reply-To: <87imp8hyc8.fsf@x220.int.ebiederm.org>

On Wed, Oct 2, 2019 at 12:54 AM Eric W. Biederman <ebiederm@xmission.com> wrote:
> Kees Cook <keescook@chromium.org> writes:
> > On Tue, Oct 01, 2019 at 01:36:32PM -0500, Eric W. Biederman wrote:
> >
> > I think you can actually take this further and remove (or at least
> > empty) the uapi/linux/sysctl.h file too.
>
> I copied everyone who had put this into a defconfig and I will wait a
> little more to see if anyone screams.  I think it is a safe guess that
> several of the affected configurations are dead (or at least
> unmaintained) as I received 17 bounces when copying everyone.

Looking at the arm defconfigs:

> arch/arm/configs/axm55xx_defconfig:CONFIG_SYSCTL_SYSCALL=y

No notable work on this platform since it got sold to Intel in 2014.
I think they still use it but not with mainline kernels that lack support
for most drivers and the later chips.

> arch/arm/configs/keystone_defconfig:CONFIG_SYSCTL_SYSCALL=y

Not that old either, but this hardware is mostly obsoleted by newer variants
that we support with the arm64 defconfig.

> arch/arm/configs/lpc32xx_defconfig:CONFIG_SYSCTL_SYSCALL=y
> arch/arm/configs/moxart_defconfig:CONFIG_SYSCTL_SYSCALL=y

Ancient hardware, but still in active use. These tend to have very little
RAM, but they both enable CONFIG_PROC_FS.

> arch/arm/configs/qcom_defconfig:CONFIG_SYSCTL_SYSCALL=y
> arch/arm/configs/zx_defconfig:CONFIG_SYSCTL_SYSCALL=y

These are for older Qualcomm and LG chips that tend to be used
with Android rather than the defconfig here. Maybe double-check
if the official android-common tree enables SYSCTL_SYSCALL.

      Arnd

^ permalink raw reply

* Re: [RFC][PATCH] sysctl: Remove the sysctl system call
From: Eric W. Biederman @ 2019-10-02 14:41 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: Kees Cook, linux-kernel@vger.kernel.org, linux-arch, Linux API,
	Linus Torvalds, Andi Kleen, Andi Kleen, Cong Wang,
	Luis Chamberlain, Apelete Seketeli, Chee Nouk Phoon, Chris Zankel,
	Christian Ruppert, Greg Ungerer, Helge Deller, Hongliang Tao,
	Huacai Chen, Jonas Jensen, Josh Boyer, Jun Nie
In-Reply-To: <CAK8P3a1zLATC7rzYxSpAK-z=NJ1rw7-3ZgHqCOJUUf6b9HwK1A@mail.gmail.com>

Arnd Bergmann <arnd@arndb.de> writes:

> On Wed, Oct 2, 2019 at 12:54 AM Eric W. Biederman <ebiederm@xmission.com> wrote:
>> Kees Cook <keescook@chromium.org> writes:
>> > On Tue, Oct 01, 2019 at 01:36:32PM -0500, Eric W. Biederman wrote:
>> >
>> > I think you can actually take this further and remove (or at least
>> > empty) the uapi/linux/sysctl.h file too.
>>
>> I copied everyone who had put this into a defconfig and I will wait a
>> little more to see if anyone screams.  I think it is a safe guess that
>> several of the affected configurations are dead (or at least
>> unmaintained) as I received 17 bounces when copying everyone.
>
> Looking at the arm defconfigs:
>
>> arch/arm/configs/axm55xx_defconfig:CONFIG_SYSCTL_SYSCALL=y
>
> No notable work on this platform since it got sold to Intel in 2014.
> I think they still use it but not with mainline kernels that lack support
> for most drivers and the later chips.
>
>> arch/arm/configs/keystone_defconfig:CONFIG_SYSCTL_SYSCALL=y
>
> Not that old either, but this hardware is mostly obsoleted by newer variants
> that we support with the arm64 defconfig.
>
>> arch/arm/configs/lpc32xx_defconfig:CONFIG_SYSCTL_SYSCALL=y
>> arch/arm/configs/moxart_defconfig:CONFIG_SYSCTL_SYSCALL=y
>
> Ancient hardware, but still in active use. These tend to have very little
> RAM, but they both enable CONFIG_PROC_FS.
>
>> arch/arm/configs/qcom_defconfig:CONFIG_SYSCTL_SYSCALL=y
>> arch/arm/configs/zx_defconfig:CONFIG_SYSCTL_SYSCALL=y
>
> These are for older Qualcomm and LG chips that tend to be used
> with Android rather than the defconfig here. Maybe double-check
> if the official android-common tree enables SYSCTL_SYSCALL.

I just looked quickly at:
https://android.googlesource.com/kernel/configs/

I don't see the string SYSCTL mentioned anywhere.  Much less
SYSCTL_SYSCALL.

Eric

^ permalink raw reply

* Re: [RFC][PATCH] sysctl: Remove the sysctl system call
From: Eric W. Biederman @ 2019-10-02 14:49 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: Kees Cook, linux-kernel@vger.kernel.org, linux-arch, Linux API,
	Linus Torvalds, Andi Kleen, Andi Kleen, Cong Wang,
	Luis Chamberlain, Apelete Seketeli, Chee Nouk Phoon, Chris Zankel,
	Christian Ruppert, Greg Ungerer, Helge Deller, Hongliang Tao,
	Huacai Chen, Jonas Jensen, Josh Boyer, Jun Nie
In-Reply-To: <CAK8P3a1zLATC7rzYxSpAK-z=NJ1rw7-3ZgHqCOJUUf6b9HwK1A@mail.gmail.com>

Arnd Bergmann <arnd@arndb.de> writes:

> On Wed, Oct 2, 2019 at 12:54 AM Eric W. Biederman <ebiederm@xmission.com> wrote:

>> arch/arm/configs/lpc32xx_defconfig:CONFIG_SYSCTL_SYSCALL=y
>> arch/arm/configs/moxart_defconfig:CONFIG_SYSCTL_SYSCALL=y
>
> Ancient hardware, but still in active use. These tend to have very little
> RAM, but they both enable CONFIG_PROC_FS.

You actually have to enable CONFIG_PROC_FS to enable
CONFIG_SYSCTL_SYSCALL at this point.  CONFIG_SYSCTL_SYSCALL is just an
emulation of the old interface built on top of /proc.

Thank you for the feedback.

Eric

^ permalink raw reply

* Re: [PATCH bpf-next] bpf, capabilities: introduce CAP_BPF
From: Alexei Starovoitov @ 2019-10-02 17:18 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Alexei Starovoitov, Kees Cook, Andy Lutomirski, Andy Lutomirski,
	Alexei Starovoitov, LSM List, James Morris, Jann Horn,
	Peter Zijlstra, Masami Hiramatsu, David S. Miller,
	Daniel Borkmann, Network Development, bpf, Kernel Team, Linux API
In-Reply-To: <20191001184731.0ec98c7a@gandalf.local.home>

On 10/1/19 3:47 PM, Steven Rostedt wrote:
> On Tue, 1 Oct 2019 22:18:18 +0000
> Alexei Starovoitov <ast@fb.com> wrote:
> 
>>> And then you can just format the string from the bpf_trace_printk()
>>> into msg, and then have:
>>>
>>> 	trace_bpf_print(msg);
>>
>> It's an interesting idea, but I don't think it can work.
>> Please see bpf_trace_printk implementation in kernel/trace/bpf_trace.c
>> It's a lot more than string printing.
> 
> Well, trace_printk() is just string printing. I was thinking that the
> bpf_trace_printk() could just use a vsnprintf() into a temporary buffer
> (like trace_printk() does), and then call the trace event to write it
> out.

are you proposing to replicate get_trace_buf() functionality
into bpf_trace_printk?
So print into temp string buffer is done twice?
I'm not excited about such hack.
And what's the goal? so that trace_bpf_print(string_msg);
can go through _run-time_ check whether that particular trace event
was allowed in tracefs ?
That's not how file system acls are typically designed.
The permission check is at open(). Not at write().
If I understood you correctly you're proposing to check permissions
at bpf program run-time which is no good.

bpf_trace_printk() already has one small buffer for
probe_kernel_read-ing an unknown string to pass into %s.
That's not ftrace. That's core tracing. That aspect is covered by 
CAP_TRACING as well.


>>
>>> The user could then just enable the trace event from the file system. I
>>> could also work on making instances work like /tmp does (with the
>>> sticky bit) in creation. That way people with write access to the
>>> instances directory, can make their own buffers that they can use (and
>>> others can't access).
>>
>> We tried instances in bcc in the past and eventually removed all the
>> support. The overhead of instances is too high to be usable.
> 
> What overhead? An ftrace instance should not have any more overhead than
> the root one does (it's the same code). Or are you talking about memory
> overhead?

Yes. Memory overhead. Human users doing cat/echo into tracefs won't be
creating many instances, so that's the only practical usage of them.

> 
>>
>>>
>>>    
>>>>
>>>> Both 'trace' and 'trace_pipe' have quirky side effects.
>>>> Like opening 'trace' file will make all parallel trace_printk() to be ignored.
>>>> While reading 'trace_pipe' file will clear it.
>>>> The point that traditional 'read' and 'write' ACLs don't map as-is
>>>> to tracefs, so I would be careful categorizing things into
>>>> confidentiality vs integrity only based on access type.
>>>
>>> What exactly is the bpf_trace_printk() used for? I may have other ideas
>>> that can help.
>>
>> It's debugging of bpf programs. Same is what printk() is used for
>> by kernel developers.
>>
> 
> How is it extracted? Just read from the trace or trace_pipe file?

yep. Just like kernel devs look at dmesg when they sprinkle printk.
btw, if you can fix 'trace' file issue that stops all trace_printk
while 'trace' file is open that would be great.
Some users have been bitten by this behavior. We even documented it.

^ permalink raw reply

* Re: [RFC][PATCH] sysctl: Remove the sysctl system call
From: Helge Deller @ 2019-10-02 18:52 UTC (permalink / raw)
  To: Eric W. Biederman, Kees Cook
  Cc: linux-kernel, linux-arch, linux-api, Linus Torvalds, Andi Kleen,
	Andi Kleen, Cong Wang, Luis Chamberlain, Apelete Seketeli,
	Arnd Bergmann, Chee Nouk Phoon, Chris Zankel, Christian Ruppert,
	Greg Ungerer, Hongliang Tao, Huacai Chen, Jonas Jensen,
	Josh Boyer, Jun Nie, Lars-Peter Clausen, Ley Foon Tan
In-Reply-To: <87imp8hyc8.fsf@x220.int.ebiederm.org>

On 02.10.19 00:53, Eric W. Biederman wrote:
> Kees Cook <keescook@chromium.org> writes:
>>> As there appear to be no users of the sysctl system call, remove the
>>> code....>...
> I copied everyone who had put this into a defconfig and I will wait a
> little more to see if anyone screams.  I think it is a safe guess that
> several of the affected configurations are dead (or at least
> unmaintained) as I received 17 bounces when copying everyone.

  arch/parisc/configs/c8000_defconfig         |    1 -
  arch/parisc/configs/generic-32bit_defconfig |    1 -

I'm not aware that we require the sysctl syscall somewhere on parisc,
so I think it's safe to remove the code.

Helge

^ permalink raw reply

* Re: [PATCH bpf-next] bpf, capabilities: introduce CAP_BPF
From: Steven Rostedt @ 2019-10-02 23:00 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Alexei Starovoitov, Kees Cook, Andy Lutomirski, Andy Lutomirski,
	Alexei Starovoitov, LSM List, James Morris, Jann Horn,
	Peter Zijlstra, Masami Hiramatsu, David S. Miller,
	Daniel Borkmann, Network Development, bpf, Kernel Team, Linux API
In-Reply-To: <a98725c6-a7db-1d9f-7033-5ecd96438c8d@fb.com>

On Wed, 2 Oct 2019 17:18:21 +0000
Alexei Starovoitov <ast@fb.com> wrote:

> >> It's an interesting idea, but I don't think it can work.
> >> Please see bpf_trace_printk implementation in kernel/trace/bpf_trace.c
> >> It's a lot more than string printing.  
> > 
> > Well, trace_printk() is just string printing. I was thinking that the
> > bpf_trace_printk() could just use a vsnprintf() into a temporary buffer
> > (like trace_printk() does), and then call the trace event to write it
> > out.  
> 
> are you proposing to replicate get_trace_buf() functionality
> into bpf_trace_printk?

No, do you need bpf_trace_printk() to run in all contexts?
trace_printk() does the get_trace_buf() dance so that it can be called
without locks and from any context including NMIs.

> So print into temp string buffer is done twice?
> I'm not excited about such hack.
> And what's the goal? so that trace_bpf_print(string_msg);
> can go through _run-time_ check whether that particular trace event
> was allowed in tracefs ?

No, just to use a standard event instead of hacking into
trace_printk().

> That's not how file system acls are typically designed.
> The permission check is at open(). Not at write().
> If I understood you correctly you're proposing to check permissions
> at bpf program run-time which is no good.
> 
> bpf_trace_printk() already has one small buffer for
> probe_kernel_read-ing an unknown string to pass into %s.
> That's not ftrace. That's core tracing. That aspect is covered by 
> CAP_TRACING as well.

Then use that buffer.

> 
> 
> >>  
> >>> The user could then just enable the trace event from the file system. I
> >>> could also work on making instances work like /tmp does (with the
> >>> sticky bit) in creation. That way people with write access to the
> >>> instances directory, can make their own buffers that they can use (and
> >>> others can't access).  
> >>
> >> We tried instances in bcc in the past and eventually removed all the
> >> support. The overhead of instances is too high to be usable.  
> > 
> > What overhead? An ftrace instance should not have any more overhead than
> > the root one does (it's the same code). Or are you talking about memory
> > overhead?  
> 
> Yes. Memory overhead. Human users doing cat/echo into tracefs won't be
> creating many instances, so that's the only practical usage of them.

If it's a real event, it can go into any of the ftrace buffers (top
level or instance), but it gives you the choice.

> 
> >   
> >>  
> >>>
> >>>      
> >>>>
> >>>> Both 'trace' and 'trace_pipe' have quirky side effects.
> >>>> Like opening 'trace' file will make all parallel trace_printk() to be ignored.
> >>>> While reading 'trace_pipe' file will clear it.
> >>>> The point that traditional 'read' and 'write' ACLs don't map as-is
> >>>> to tracefs, so I would be careful categorizing things into
> >>>> confidentiality vs integrity only based on access type.  
> >>>
> >>> What exactly is the bpf_trace_printk() used for? I may have other ideas
> >>> that can help.  
> >>
> >> It's debugging of bpf programs. Same is what printk() is used for
> >> by kernel developers.
> >>  
> > 
> > How is it extracted? Just read from the trace or trace_pipe file?  
> 
> yep. Just like kernel devs look at dmesg when they sprinkle printk.
> btw, if you can fix 'trace' file issue that stops all trace_printk
> while 'trace' file is open that would be great.
> Some users have been bitten by this behavior. We even documented it.

The behavior is documented as well in the ftrace documentation. That's
why we suggest the trace_pipe redirected into a file so that you don't
lose data (unless the writer goes too fast). If you prefer a producer
consumer where you lose newer events (like perf does), you can turn off
overwrite mode, and it will drop events when the buffer is full (see
options/overwrite).

-- Steve

^ permalink raw reply

* Re: [PATCH bpf-next] bpf, capabilities: introduce CAP_BPF
From: Masami Hiramatsu @ 2019-10-03  6:12 UTC (permalink / raw)
  To: Kees Cook
  Cc: Steven Rostedt, Alexei Starovoitov, Andy Lutomirski,
	Andy Lutomirski, Alexei Starovoitov, LSM List, James Morris,
	Jann Horn, Peter Zijlstra, Masami Hiramatsu, David S. Miller,
	Daniel Borkmann, Network Development, bpf, kernel-team, Linux API
In-Reply-To: <201909301129.5A1129C@keescook>

On Mon, 30 Sep 2019 11:31:29 -0700
Kees Cook <keescook@chromium.org> wrote:

> On Sat, Sep 28, 2019 at 07:37:27PM -0400, Steven Rostedt wrote:
> > On Wed, 28 Aug 2019 21:07:24 -0700
> > Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:
> > > > 
> > > > This won’t make me much more comfortable, since CAP_BPF lets it do an ever-growing set of nasty things. I’d much rather one or both of two things happen:
> > > > 
> > > > 1. Give it CAP_TRACING only. It can leak my data, but it’s rather hard for it to crash my laptop, lose data, or cause other shenanigans.
> > > > 
> > > > 2. Improve it a bit do all the privileged ops are wrapped by capset().
> > > > 
> > > > Does this make sense?  I’m a security person on occasion. I find
> > > > vulnerabilities and exploit them deliberately and I break things by
> > > > accident on a regular basis. In my considered opinion, CAP_TRACING
> > > > alone, even extended to cover part of BPF as I’ve described, is
> > > > decently safe. Getting root with just CAP_TRACING will be decently
> > > > challenging, especially if I don’t get to read things like sshd’s
> > > > memory, and improvements to mitigate even that could be added.  I
> > > > am quite confident that attacks starting with CAP_TRACING will have
> > > > clear audit signatures if auditing is on.  I am also confident that
> > > > CAP_BPF *will* allow DoS and likely privilege escalation, and this
> > > > will only get more likely as BPF gets more widely used. And, if
> > > > BPF-based auditing ever becomes a thing, writing to the audit
> > > > daemon’s maps will be a great way to cover one’s tracks.  
> > > 
> > > CAP_TRACING, as I'm proposing it, will allow full tracefs access.
> > > I think Steven and Massami prefer that as well.
> > > That includes kprobe with probe_kernel_read.
> > > That also means mini-DoS by installing kprobes everywhere or running
> > > too much ftrace.
> > 
> > I was talking with Kees at Plumbers about this, and we were talking
> > about just using simple file permissions. I started playing with some
> > patches to allow the tracefs be visible but by default it would only be
> > visible by root.
> > 
> >  rwx------
> > 
> > Then a start up script (or perhaps mount options) could change the
> > group owner, and change this to:
> > 
> >  rwxrwx---
> > 
> > Where anyone in the group assigned (say "tracing") gets full access to
> > the file system.

Does it for "all" files under tracefs?

> > 
> > The more I was playing with this, the less I see the need for
> > CAP_TRACING for ftrace and reading the format files.
> 
> Nice! Thanks for playing with this. I like it because it gives us a way
> to push policy into userspace (group membership, etc), and provides a
> clean way (hopefully) do separate "read" (kernel memory confidentiality)
> from "write" (kernel memory integrity), which wouldn't have been possible
> with a single new CAP_...

 From the confidentiality point of view, if tracefs exposes traced data,
it might include in-kernel pointer and symbols, but the user still can't
see /proc/kallsyms. This means we still have several different confidentiality
for each interface.

Anyway, adding a tracefs mount option for allowing a user group to access
event format data will be a good idea. But even though, I  think we still
need the CAP_TRACING for allowing control of intrusive tracing, like kprobes
and bpf etc. (Or, do we keep those for CAP_SYS_ADMIN??)

BTW, should we request CAP_SYS_PTRACE for ftrace uprobe interface too?
It might break any user-space program (including init) if user puts a
probe on a wrong address (e.g. non instruction boundary on x86).

Thank you,

-- 
Masami Hiramatsu <mhiramat@kernel.org>

^ permalink raw reply

* Re: [RFC][PATCH] sysctl: Remove the sysctl system call
From: Florian Weimer @ 2019-10-03  6:56 UTC (permalink / raw)
  To: Kees Cook
  Cc: Eric W. Biederman, linux-kernel, linux-arch, linux-api, Jann Horn,
	Arnd Bergmann, Helge Deller
In-Reply-To: <201910011140.EA0181F13@keescook>

Is anyone else getting a very incomplete set of messages in this
thread?

These changes likely matter to glibc, and I've yet to see the actual
patch.  Would someone please forward it to me?

The original message didn't make it into the lore.kernel.org archives
(the cross-post to linux-kernel should have taken care of that).

^ permalink raw reply

* [PATCH RFC 0/3] document openat2(2) patch series
From: Aleksa Sarai @ 2019-10-03 14:55 UTC (permalink / raw)
  To: Al Viro, Michael Kerrisk
  Cc: Aleksa Sarai, Christian Brauner, Aleksa Sarai, linux-man,
	linux-api, linux-kernel

This is a first draft of the man-page changes for the openat2(2) patch
series I'm working on[1]. It includes information about the magic-link
changes as well as the primary new features (O_EMPTYPATH and openat2).

Let me know what you think. I might go into too much detail about how
extension of openat2(2) will work -- let me know if that section should
be dropped (while it is useful for userspace to understand, it isn't
really that necessary to explain exactly what the semantics are -- it
will usually just transparently work).

[1]: https://lore.kernel.org/lkml/20190930183316.10190-1-cyphar@cyphar.com/

Aleksa Sarai (3):
  symlink.7: document magic-links more completely
  open.2: add O_EMPTYPATH documentation
  openat2.2: document new openat2(2) syscall

 man2/open.2            |  47 ++++-
 man2/openat2.2         | 381 +++++++++++++++++++++++++++++++++++++++++
 man7/path_resolution.7 |  89 ++++++++--
 man7/symlink.7         |  39 ++++-
 4 files changed, 528 insertions(+), 28 deletions(-)
 create mode 100644 man2/openat2.2

-- 
2.23.0

^ permalink raw reply

* [PATCH RFC 1/3] symlink.7: document magic-links more completely
From: Aleksa Sarai @ 2019-10-03 14:55 UTC (permalink / raw)
  To: Al Viro, Michael Kerrisk
  Cc: Aleksa Sarai, Christian Brauner, Aleksa Sarai, linux-man,
	linux-api, linux-kernel
In-Reply-To: <20191003145542.17490-1-cyphar@cyphar.com>

Traditionally, magic-links have not been a well-understood topic in
Linux. Given the new changes in their semantics (related to the link
mode of trailing magic-links), it seems like a good opportunity to shine
more light on magic-links and their semantics.

Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
---
 man7/path_resolution.7 | 15 +++++++++++++++
 man7/symlink.7         | 39 ++++++++++++++++++++++++++++++---------
 2 files changed, 45 insertions(+), 9 deletions(-)

diff --git a/man7/path_resolution.7 b/man7/path_resolution.7
index 07664ed8faec..46f25ec4cdfa 100644
--- a/man7/path_resolution.7
+++ b/man7/path_resolution.7
@@ -136,6 +136,21 @@ we are just creating it.
 The details on the treatment
 of the final entry are described in the manual pages of the specific
 system calls.
+.PP
+Since Linux 5.FOO, if the final entry is a "magic-link" (see
+.BR symlink (7)),
+and the user is attempting to
+.BR open (2)
+it, then there is an additional permission-related restriction applied to the
+operation: the requested access mode must not exceed the "link mode" of the
+magic-link (unlike ordinary symlinks, magic-links have their own file mode.)
+For example, if
+.I /proc/[pid]/fd/[num]
+has a link mode of
+.BR 0500 ,
+unprivileged users are not permitted to
+.BR open ()
+the magic-link for writing.
 .SS . and ..
 By convention, every directory has the entries "." and "..",
 which refer to the directory itself and to its parent directory,
diff --git a/man7/symlink.7 b/man7/symlink.7
index 9f5bddd5dc21..33f0ec703acd 100644
--- a/man7/symlink.7
+++ b/man7/symlink.7
@@ -84,6 +84,25 @@ as they are implemented on Linux and other systems,
 are outlined here.
 It is important that site-local applications also conform to these rules,
 so that the user interface can be as consistent as possible.
+.SS Magic-links
+There is a special class of symlink-like objects known as "magic-links" which
+can be found in certain pseudo-filesystems such as
+.BR proc (5)
+(examples include
+.IR /proc/[pid]/exe " and " /proc/[pid]/fd/* .)
+Unlike normal symlinks, magic-links are not resolved through
+pathname-expansion, but instead act as direct references to the kernel's own
+representation of a file handle. As such, these magic-links allow users to
+access files which cannot be referenced with normal paths (such as unlinked
+files still referenced by a running program.)
+.PP
+Because they can bypass ordinary
+.BR mount_namespaces (7)-based
+restrictions, magic-links have been used as attack vectors in various exploits.
+As such (since Linux 5.FOO), there are additional restrictions placed on the
+re-opening of magic-links (see
+.BR path_resolution (7)
+for more details.)
 .SS Symbolic link ownership, permissions, and timestamps
 The owner and group of an existing symbolic link can be changed
 using
@@ -99,16 +118,18 @@ of a symbolic link can be changed using
 or
 .BR lutimes (3).
 .PP
-On Linux, the permissions of a symbolic link are not used
-in any operations; the permissions are always
-0777 (read, write, and execute for all user categories),
 .\" Linux does not currently implement an lchmod(2).
-and can't be changed.
-(Note that there are some "magic" symbolic links in the
-.I /proc
-directory tree\(emfor example, the
-.IR /proc/[pid]/fd/*
-files\(emthat have different permissions.)
+On Linux, the permissions of an ordinary symbolic link are not used in any
+operations; the permissions are always 0777 (read, write, and execute for all
+user categories), and can't be changed.
+.PP
+However, magic-links do not follow this rule. They can have a non-0777 mode,
+which is used for permission checks when the final
+component of an
+.BR open (2)'s
+path is a magic-link (see
+.BR path_resolution (7).)
+
 .\"
 .\" The
 .\" 4.4BSD
-- 
2.23.0

^ permalink raw reply related

* [PATCH RFC 2/3] open.2: add O_EMPTYPATH documentation
From: Aleksa Sarai @ 2019-10-03 14:55 UTC (permalink / raw)
  To: Al Viro, Michael Kerrisk
  Cc: Aleksa Sarai, Christian Brauner, Aleksa Sarai, linux-man,
	linux-api, linux-kernel
In-Reply-To: <20191003145542.17490-1-cyphar@cyphar.com>

Some of the wording around empty paths in path_resolution(7) also needed
to be reworked since it's now legal (if you pass O_EMPTYPATH).

Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
---
 man2/open.2            | 42 +++++++++++++++++++++++++++++++++++++++++-
 man7/path_resolution.7 | 17 ++++++++++++++++-
 2 files changed, 57 insertions(+), 2 deletions(-)

diff --git a/man2/open.2 b/man2/open.2
index b0f485b41589..7217fe056e5e 100644
--- a/man2/open.2
+++ b/man2/open.2
@@ -48,7 +48,7 @@
 .\" FIXME . Apr 08: The next POSIX revision has O_EXEC, O_SEARCH, and
 .\" O_TTYINIT.  Eventually these may need to be documented.  --mtk
 .\"
-.TH OPEN 2 2018-04-30 "Linux" "Linux Programmer's Manual"
+.TH OPEN 2 2019-10-03 "Linux" "Linux Programmer's Manual"
 .SH NAME
 open, openat, creat \- open and possibly create a file
 .SH SYNOPSIS
@@ -421,6 +421,21 @@ was followed by a call to
 .BR fdatasync (2)).
 .IR "See NOTES below" .
 .TP
+.BR O_EMPTYPATH " (since Linux 5.FOO)"
+If \fIpathname\fP is an empty string, re-open the the file descriptor given as
+the \fIdirfd\fP argument to
+.BR openat (2).
+This can be used with both ordinary (file and directory) and \fBO_PATH\fP file
+descriptors, but cannot be used with
+.BR AT_FDCWD
+(or as an argument to plain
+.BR open (2).) When re-opening an \fBO_PATH\fP file descriptor, the same "link
+mode" restrictions apply as with re-opening through
+.BR proc (5)
+(see
+.BR path_resolution "(7) and " symlink (7)
+for more details.)
+.TP
 .B O_EXCL
 Ensure that this call creates the file:
 if this flag is specified in conjunction with
@@ -668,6 +683,13 @@ with
 (or via procfs using
 .BR AT_SYMLINK_FOLLOW )
 even if the file is not a directory.
+You can even "re-open" (or upgrade) an
+.BR O_PATH
+file descriptor by using
+.BR O_EMPTYPATH
+(see the section for
+.BR O_EMPTYPATH
+for more details.)
 .IP *
 Passing the file descriptor to another process via a UNIX domain socket
 (see
@@ -958,6 +980,15 @@ is not allowed.
 (See also
 .BR path_resolution (7).)
 .TP
+.B EBADF
+.I pathname
+was an empty string (and
+.B O_EMPTYPATH
+was passed) with
+.BR open (2)
+(instead of
+.BR openat (2).)
+.TP
 .B EDQUOT
 Where
 .B O_CREAT
@@ -1203,6 +1234,15 @@ The following additional errors can occur for
 .I dirfd
 is not a valid file descriptor.
 .TP
+.B EBADF
+.I pathname
+was an empty string (and
+.B O_EMPTYPATH
+was passed), but the provided
+.I dirfd
+was an invalid file descriptor (or was
+.BR AT_FDCWD .)
+.TP
 .B ENOTDIR
 .I pathname
 is a relative pathname and
diff --git a/man7/path_resolution.7 b/man7/path_resolution.7
index 46f25ec4cdfa..85dd354e9a93 100644
--- a/man7/path_resolution.7
+++ b/man7/path_resolution.7
@@ -22,7 +22,7 @@
 .\" the source, must acknowledge the copyright and authors of this work.
 .\" %%%LICENSE_END
 .\"
-.TH PATH_RESOLUTION 7 2017-11-26 "Linux" "Linux Programmer's Manual"
+.TH PATH_RESOLUTION 7 2019-10-03 "Linux" "Linux Programmer's Manual"
 .SH NAME
 path_resolution \- how a pathname is resolved to a file
 .SH DESCRIPTION
@@ -198,6 +198,21 @@ successfully.
 Linux returns
 .B ENOENT
 in this case.
+.PP
+As of Linux 5.FOO, an empty path argument can be used to indicate the "re-open"
+an existing file descriptor if
+.B O_EMPTYPATH
+is passed as a flag argument to
+.BR openat (2),
+with the
+.I dfd
+argument indicating which file descriptor to "re-open". This is approximately
+equivalent to opening
+.I /proc/self/fd/$fd
+where
+.I $fd
+is the open file descriptor to be "re-opened".
+
 .SS Permissions
 The permission bits of a file consist of three groups of three bits; see
 .BR chmod (1)
-- 
2.23.0

^ permalink raw reply related

* [PATCH RFC 3/3] openat2.2: document new openat2(2) syscall
From: Aleksa Sarai @ 2019-10-03 14:55 UTC (permalink / raw)
  To: Al Viro, Michael Kerrisk
  Cc: Aleksa Sarai, Christian Brauner, Aleksa Sarai, linux-man,
	linux-api, linux-kernel
In-Reply-To: <20191003145542.17490-1-cyphar@cyphar.com>

Rather than trying to merge the new syscall documentation into open.2
(which would probably result in the man-page being incomprehensible),
instead the new syscall gets its own dedicated page with links between
open(2) and openat2(2) to avoid duplicating information such as the list
of O_* flags or common errors.

Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
---
 man2/open.2            |   5 +
 man2/openat2.2         | 381 +++++++++++++++++++++++++++++++++++++++++
 man7/path_resolution.7 |  57 ++++--
 3 files changed, 426 insertions(+), 17 deletions(-)
 create mode 100644 man2/openat2.2

diff --git a/man2/open.2 b/man2/open.2
index 7217fe056e5e..a0b43394bbee 100644
--- a/man2/open.2
+++ b/man2/open.2
@@ -65,6 +65,10 @@ open, openat, creat \- open and possibly create a file
 .BI "int openat(int " dirfd ", const char *" pathname ", int " flags );
 .BI "int openat(int " dirfd ", const char *" pathname ", int " flags \
 ", mode_t " mode );
+.PP
+/* Docuented separately, in \fBopenat2\fP(2). */
+.BI "int openat2(int " dirfd ", const char *" pathname ", \
+const struct open_how *" how ", size_t " size ");
 .fi
 .PP
 .in -4n
@@ -1808,6 +1812,7 @@ will create a regular file (i.e.,
 .B O_DIRECTORY
 is ignored).
 .SH SEE ALSO
+.BR openat2 (2),
 .BR chmod (2),
 .BR chown (2),
 .BR close (2),
diff --git a/man2/openat2.2 b/man2/openat2.2
new file mode 100644
index 000000000000..c43c76046243
--- /dev/null
+++ b/man2/openat2.2
@@ -0,0 +1,381 @@
+.\" Copyright (C) 2019 Aleksa Sarai <cyphar@cyphar.com>
+.\"
+.\" %%%LICENSE_START(VERBATIM)
+.\" Permission is granted to make and distribute verbatim copies of this
+.\" manual provided the copyright notice and this permission notice are
+.\" preserved on all copies.
+.\"
+.\" Permission is granted to copy and distribute modified versions of this
+.\" manual under the conditions for verbatim copying, provided that the
+.\" entire resulting derived work is distributed under the terms of a
+.\" permission notice identical to this one.
+.\"
+.\" Since the Linux kernel and libraries are constantly changing, this
+.\" manual page may be incorrect or out-of-date.  The author(s) assume no
+.\" responsibility for errors or omissions, or for damages resulting from
+.\" the use of the information contained herein.  The author(s) may not
+.\" have taken the same level of care in the production of this manual,
+.\" which is licensed free of charge, as they might when working
+.\" professionally.
+.\"
+.\" Formatted or processed versions of this manual, if unaccompanied by
+.\" the source, must acknowledge the copyright and authors of this work.
+.\" %%%LICENSE_END
+.TH OPENAT2 2 2019-10-03 "Linux" "Linux Programmer's Manual"
+.SH NAME
+openat2 \- open and possibly create a file (extended)
+.SH SYNOPSIS
+.nf
+.B #include <sys/types.h>
+.B #include <sys/stat.h>
+.B #include <fcntl.h>
+.PP
+.BI "int openat2(int " dirfd ", const char *" pathname ", \
+const struct open_how *" how ", size_t " size ");
+.fi
+.PP
+.IR Note :
+There is no glibc wrapper for this system call; see NOTES.
+.SH DESCRIPTION
+The
+.BR openat2 ()
+system call is an extension of
+.BR openat (2)
+and provides a superset of its functionality. Rather than taking a single
+.I flag
+argument, an extensible structure (\fIhow\fP) is passed instead to allow for
+seamless future extensions.
+.PP
+.I size
+must be set to
+.IR "sizeof(struct open_how)" ,
+to facilitate future extensions (see the "Extensibility" section of the
+\fBNOTES\fP for more detail on how extensions are handled.)
+
+.SS The open_how structure
+The following structure indicates how
+.I pathname
+should be opened, and acts as a superset of the
+.IR flag " and " mode
+arguments to
+.BR openat (2).
+.PP
+.in +4n
+.EX
+struct open_how {
+    uint32_t flags;              /* open(2)-style O_* flags. */
+    union {
+        uint16_t mode;           /* File mode bits for new file creation. */
+        uint16_t upgrade_mask;   /* Restrict how O_PATHs may be re-opened. */
+    };
+    uint32_t resolve;            /* RESOLVE_* path-resolution flags. */
+};
+.EE
+.in
+.PP
+Any future extensions to
+.BR openat2 ()
+will be implemented as new fields appended to the above structure, with the
+zero value of the new fields acting as though the extension were not present.
+.PP
+The meaning of each field is as follows:
+.RS
+
+.I flags
+.RS
+The file creation and status flags to use for this operation. All of the
+.B O_*
+flags defined for
+.BR openat (2)
+are valid
+.BR openat2 ()
+flag values.
+.RE
+
+.I upgrade_mask
+.RS
+Restrict with which
+.I access modes
+the returned
+.B O_PATH
+descriptor may be re-opened (either through
+.B O_EMPTYPATH
+or
+.IR /proc/self/fd/ .)
+This field may only be set to a non-zero value if
+.I flags
+contains
+.BR O_PATH .
+By default, an
+.B O_PATH
+file descriptor of an ordinary file may be re-opened with with any access mode (but an
+.B O_PATH
+file descriptor of a magic-link may only be re-opened with access modes that
+the original magic-link possessed). The full list of
+.I upgrade_mask
+flags is given below.
+.TP
+.B UPGRADE_NOREAD
+Do not permit the
+.B O_PATH
+file descriptor to be re-opened for reading (i.e.
+.BR O_RDONLY " or " O_RDWR .)
+.TP
+.B UPGRADE_NOWRITE
+Do not permit the
+.B O_PATH
+file descriptor to be re-opened for writing (i.e.
+.BR O_WRONLY ", " O_RDWR ", or " O_APPEND .)
+.RE
+
+.I resolve
+.RS
+Change how the components of
+.I pathname
+will be resolved (see
+.BR path_resolution (7)
+for background information.) The primary use-case for these flags is to allow
+trusted programs to restrict how un-trusted paths (or paths inside un-trusted
+directories) are resolved. The full list of
+.I resolve
+flags is given below.
+.TP
+.B RESOLVE_NO_XDEV
+Disallow all mount-point crossings during path resolution (including
+all bind-mounts).
+
+Users of this flag are encouraged to make its use configurable (unless it is
+used for a specific security purpose), as bind-mounts are very widely used by
+end-users and thus enabling this flag globally may result in spurious errors on
+some systems.
+.TP
+.B RESOLVE_NO_SYMLINKS
+Disallow all symlink resolution during path resolution. If the trailing
+component is a symlink, and
+.I flags
+contains both
+.BR O_PATH " and " O_NOFOLLOW ","
+then an
+.B O_PATH
+file descriptor referencing the symlink will be returned. This option implies
+.BR RESOLVE_NO_MAGICLINKS .
+
+Users of this flag are encouraged to make its use configurable (unless it is
+used for a specific security purpose), as symlinks are very widely used by
+end-users and thus enabling this flag globally may result in spurious errors on
+some systems.
+.TP
+.B RESOLVE_NO_MAGICLINKS
+Disallow all magic-link resolution during path resolution. If the trailing
+component is a magic-link, and
+.I flags
+contains both
+.BR O_PATH " and " O_NOFOLLOW ","
+then an
+.B O_PATH
+file descriptor referencing the magic-link will be returned.
+
+Magic-links are symlink-like objects that are most notably found in
+.BR proc (5)
+(examples include
+.IR /proc/[pid]/exe " and " /proc/[pid]/fd/* .)
+Due to the potential danger of unknowingly opening these magic-links, it may be
+preferable for users to disable their resolution entirely (see
+.BR symlink (7)
+for more details.)
+.TP
+.B RESOLVE_BENEATH
+Do not permit the path resolution to succeed if any component of the resolution
+is not a descendant of the directory indicated by
+.IR dirfd .
+This results in absolute symlinks (and absolute values of
+.IR pathname )
+to be rejected. Magic-link resolution is also not permitted.
+
+.TP
+.B RESOLVE_IN_ROOT
+Temporarily treat
+.I dirfd
+as the root of the filesystem (as though the user called
+.BR chroot (2)
+with
+.IR dirfd
+as the argument.) Absolute symlinks and ".." path components will be scoped to
+.IR dirfd . Magic-link resolution is also not permitted.
+
+However, unlike
+.BR chroot (2)
+(which changes the filesystem root persistently for an entire thread-group),
+.B RESOLVE_IN_ROOT
+allows a program to efficiently restrict path resolution for only certain
+operations. It also has several hardening features (such as not permitting
+magic-link resolution) which
+.BR chroot (2)
+does not.
+.RE
+
+.RE
+
+.PP
+Unlike
+.BR openat (2),
+any unknown flags set in fields of
+.I how
+will result in an error, rather than being ignored. In addition, an error will
+be returned if the value of the
+.IR mode " and " upgrade_mask
+union is non-zero unless:
+.RS
+.IP * 3
+.I flags
+indicates that a new file will be created (it contains
+.BR O_CREAT " or " O_TMPFILE ),
+in which case
+.I mode
+may be any valid file mode.
+.IP *
+.I flags
+contains
+.BR O_PATH ,
+in which case
+.I upgrade_mask
+must only contain valid
+.B UPGRADE_*
+flags.
+.RE
+
+.SH RETURN VALUE
+On success, a new file descriptor is returned. On error, -1 is returned, and
+.I errno
+is set appropriately.
+
+.SH ERRORS
+The set of errors returned by
+.BR openat2 ()
+includes all of the errors returned by
+.BR openat (2),
+as well as the following additional errors:
+.TP
+.B EINVAL
+An unknown flag or invalid value was specified in
+.IR how .
+.TP
+.B EINVAL
+.I size
+was smaller than any known version of
+.IR "struct open_how" .
+.TP
+.B E2BIG
+An extension was specified in
+.IR how ,
+which the current kernel does not support (see the "Extensibility" section of
+the \fBNOTES\fP for more detail on how extensions are handled.)
+.TP
+.B EAGAIN
+.I resolve
+contains either
+.BR RESOLVE_IN_ROOT " or " RESOLVE_BENEATH ,
+and the kernel could not ensure that a ".." component didn't escape (due to a
+race condition or potential attack). Callers may choose to retry the
+.BR openat2 ()
+call.
+.TP
+.B EXDEV
+.I resolve
+contains either
+.BR RESOLVE_IN_ROOT " or " RESOLVE_BENEATH ,
+and a path component attempted to escape the root of the resolution.
+
+.TP
+.B EXDEV
+.I resolve
+contains
+.BR RESOLVE_NO_XDEV ,
+and a path component attempted to cross a mount-point.
+
+.TP
+.B ELOOP
+.I resolve
+contains
+.BR RESOLVE_NO_SYMLINKS ,
+and one of the path components was a symlink.
+.TP
+.B ELOOP
+.I resolve
+contains
+.BR RESOLVE_NO_MAGICLINKS ,
+and one of the path components was a magic-link.
+
+.SH VERSIONS
+.BR openat2 ()
+was added to Linux in kernel 5.FOO.
+
+.SH CONFORMING TO
+This system call is Linux-specific.
+
+The semantics of
+.B RESOLVE_BENEATH
+were modelled after FreeBSD's
+.BR O_BENEATH .
+
+.SH NOTES
+Glibc does not provide a wrapper for this system call; call it using
+.BR syscall (2).
+
+.SS Extensibility
+In order to allow for
+.I struct open_how
+to be extended in future kernel revisions,
+.BR openat2 ()
+requires userspace to specify what sized
+.I struct open_how
+structure they are passing. By providing this information, it is possible for
+.BR openat2 ()
+to provide both forwards- and backwards-compatibility \(em with
+.I size
+acting as an implicit version number (because new extension fields will always
+be appended, the size will always increase.) This extensibility design is very
+similar to other system calls such as
+.BR perf_setattr "(2), " perf_event_open "(2), and " clone (3).
+
+If we let
+.I usize
+be the size of the structure according to userspace and
+.I ksize
+be the size of the structure which the kernel supports, then there are only
+three cases to consider:
+
+.RS
+.IP * 3
+If
+.IR ksize " equals " usize ,
+then there is no version mismatch and
+.I how
+can be used verbatim.
+.IP *
+If
+.IR ksize " is larger than " usize ,
+then there are some extensions the kernel supports which the userspace program
+is unaware of. Because all extensions must have their zero values be a no-op,
+the kernel treats all of the extension fields not set by userspace to have zero
+values. This provides backwards-compatibility.
+.IP *
+If
+.IR ksize " is smaller than " usize ,
+then there are some extensions which the userspace program is aware of but the
+kernel does not support. Because all extensions must have their zero values be
+a no-op, the kernel can safely ignore the unsupported extension fields if they
+are all-zero. If any unsupported extension fields are non-zero, then an error
+is returned. This provides forwards-compatibility.
+.RE
+
+Therefore, most userspace programs will not need to have any special handling
+of extensions. However, if a userspace program wishes to determine what
+extensions the running kernel supports, they may conduct a binary search on
+.IR size
+(to find the largest value which doesn't produce an error.)
+
+.SH SEE ALSO
+.BR openat (2),
+.BR path_resolution (7),
+.BR symlink (7)
diff --git a/man7/path_resolution.7 b/man7/path_resolution.7
index 85dd354e9a93..3da3e5b614c8 100644
--- a/man7/path_resolution.7
+++ b/man7/path_resolution.7
@@ -29,17 +29,17 @@ path_resolution \- how a pathname is resolved to a file
 Some UNIX/Linux system calls have as parameter one or more filenames.
 A filename (or pathname) is resolved as follows.
 .SS Step 1: start of the resolution process
-If the pathname starts with the \(aq/\(aq character,
-the starting lookup directory
-is the root directory of the calling process.
-(A process inherits its
-root directory from its parent.
-Usually this will be the root directory
-of the file hierarchy.
-A process may get a different root directory
-by use of the
+If the pathname starts with the \(aq/\(aq character, the starting lookup
+directory is the root directory of the calling process. (A process inherits its
+root directory from its parent. Usually this will be the root directory of the
+file hierarchy. A process may get a different root directory by use of the
 .BR chroot (2)
-system call.
+system call, or may temporarily use a different root directory by using
+.BR openat2 (2)
+with the
+.B RESOLVE_IN_ROOT
+flag set.
+.PP
 A process may get an entirely private mount namespace in case
 it\(emor one of its ancestors\(emwas started by an invocation of the
 .BR clone (2)
@@ -48,16 +48,24 @@ system call that had the
 flag set.)
 This handles the \(aq/\(aq part of the pathname.
 .PP
-If the pathname does not start with the \(aq/\(aq character, the
-starting lookup directory of the resolution process is the current working
-directory of the process.
-(This is also inherited from the parent.
-It can be changed by use of the
+If the pathname does not start with the \(aq/\(aq character, the starting
+lookup directory of the resolution process is the current working directory of
+the process \(em or in the case of
+.BR openat (2)-style
+syscalls, the
+.I dfd
+argument (or the current working directory if
+.B AT_FDCWD
+is passed as the
+.I dfd
+argumnet). The current working directory is inherited from the parent, and can
+be changed by use of the
 .BR chdir (2)
-system call.)
+syscall.
 .PP
 Pathnames starting with a \(aq/\(aq character are called absolute pathnames.
 Pathnames not starting with a \(aq/\(aq are called relative pathnames.
+
 .SS Step 2: walk along the path
 Set the current lookup directory to the starting lookup directory.
 Now, for each nonfinal component of the pathname, where a component
@@ -124,6 +132,13 @@ the kernel's pathname-resolution code
 was reworked to eliminate the use of recursion,
 so that the only limit that remains is the maximum of 40
 resolutions for the entire pathname.
+.PP
+The resolution of syscalls during this stage can be blocked by using
+.BR openat2 (2),
+with the
+.B RESOLVE_NO_SYMLINKS
+flag set.
+
 .SS Step 3: find the final entry
 The lookup of the final component of the pathname goes just like
 that of all other components, as described in the previous step,
@@ -160,7 +175,8 @@ The path resolution process will assume that these entries have
 their conventional meanings, regardless of whether they are
 actually present in the physical filesystem.
 .PP
-One cannot walk down past the root: "/.." is the same as "/".
+One cannot walk up past the root: "/.." is the same as "/".
+
 .SS Mount points
 After a "mount dev path" command, the pathname "path" refers to
 the root of the filesystem hierarchy on the device "dev", and no
@@ -169,6 +185,13 @@ longer to whatever it referred to earlier.
 One can walk out of a mounted filesystem: "path/.." refers to
 the parent directory of "path",
 outside of the filesystem hierarchy on "dev".
+.PP
+Mount-point crossings can be blocked by using
+.BR openat2 (2),
+with the
+.B RESOLVE_NO_XDEV
+flag set (though note that this also restricts bind-mount crossings).
+
 .SS Trailing slashes
 If a pathname ends in a \(aq/\(aq, that forces resolution of the preceding
 component as in Step 2: it has to exist and resolve to a directory.
-- 
2.23.0

^ permalink raw reply related

* [PATCH RFC 3/3] openat2.2: document new syscall
From: Aleksa Sarai @ 2019-10-03 14:55 UTC (permalink / raw)
  To: Al Viro, Michael Kerrisk
  Cc: Aleksa Sarai, Christian Brauner, Aleksa Sarai, linux-man,
	linux-api, linux-kernel
In-Reply-To: <20191003145542.17490-1-cyphar@cyphar.com>

Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
---
 man2/open.2            |   5 +
 man2/openat2.2         | 381 +++++++++++++++++++++++++++++++++++++++++
 man7/path_resolution.7 |  57 ++++--
 3 files changed, 426 insertions(+), 17 deletions(-)
 create mode 100644 man2/openat2.2

diff --git a/man2/open.2 b/man2/open.2
index 7217fe056e5e..a0b43394bbee 100644
--- a/man2/open.2
+++ b/man2/open.2
@@ -65,6 +65,10 @@ open, openat, creat \- open and possibly create a file
 .BI "int openat(int " dirfd ", const char *" pathname ", int " flags );
 .BI "int openat(int " dirfd ", const char *" pathname ", int " flags \
 ", mode_t " mode );
+.PP
+/* Docuented separately, in \fBopenat2\fP(2). */
+.BI "int openat2(int " dirfd ", const char *" pathname ", \
+const struct open_how *" how ", size_t " size ");
 .fi
 .PP
 .in -4n
@@ -1808,6 +1812,7 @@ will create a regular file (i.e.,
 .B O_DIRECTORY
 is ignored).
 .SH SEE ALSO
+.BR openat2 (2),
 .BR chmod (2),
 .BR chown (2),
 .BR close (2),
diff --git a/man2/openat2.2 b/man2/openat2.2
new file mode 100644
index 000000000000..c43c76046243
--- /dev/null
+++ b/man2/openat2.2
@@ -0,0 +1,381 @@
+.\" Copyright (C) 2019 Aleksa Sarai <cyphar@cyphar.com>
+.\"
+.\" %%%LICENSE_START(VERBATIM)
+.\" Permission is granted to make and distribute verbatim copies of this
+.\" manual provided the copyright notice and this permission notice are
+.\" preserved on all copies.
+.\"
+.\" Permission is granted to copy and distribute modified versions of this
+.\" manual under the conditions for verbatim copying, provided that the
+.\" entire resulting derived work is distributed under the terms of a
+.\" permission notice identical to this one.
+.\"
+.\" Since the Linux kernel and libraries are constantly changing, this
+.\" manual page may be incorrect or out-of-date.  The author(s) assume no
+.\" responsibility for errors or omissions, or for damages resulting from
+.\" the use of the information contained herein.  The author(s) may not
+.\" have taken the same level of care in the production of this manual,
+.\" which is licensed free of charge, as they might when working
+.\" professionally.
+.\"
+.\" Formatted or processed versions of this manual, if unaccompanied by
+.\" the source, must acknowledge the copyright and authors of this work.
+.\" %%%LICENSE_END
+.TH OPENAT2 2 2019-10-03 "Linux" "Linux Programmer's Manual"
+.SH NAME
+openat2 \- open and possibly create a file (extended)
+.SH SYNOPSIS
+.nf
+.B #include <sys/types.h>
+.B #include <sys/stat.h>
+.B #include <fcntl.h>
+.PP
+.BI "int openat2(int " dirfd ", const char *" pathname ", \
+const struct open_how *" how ", size_t " size ");
+.fi
+.PP
+.IR Note :
+There is no glibc wrapper for this system call; see NOTES.
+.SH DESCRIPTION
+The
+.BR openat2 ()
+system call is an extension of
+.BR openat (2)
+and provides a superset of its functionality. Rather than taking a single
+.I flag
+argument, an extensible structure (\fIhow\fP) is passed instead to allow for
+seamless future extensions.
+.PP
+.I size
+must be set to
+.IR "sizeof(struct open_how)" ,
+to facilitate future extensions (see the "Extensibility" section of the
+\fBNOTES\fP for more detail on how extensions are handled.)
+
+.SS The open_how structure
+The following structure indicates how
+.I pathname
+should be opened, and acts as a superset of the
+.IR flag " and " mode
+arguments to
+.BR openat (2).
+.PP
+.in +4n
+.EX
+struct open_how {
+    uint32_t flags;              /* open(2)-style O_* flags. */
+    union {
+        uint16_t mode;           /* File mode bits for new file creation. */
+        uint16_t upgrade_mask;   /* Restrict how O_PATHs may be re-opened. */
+    };
+    uint32_t resolve;            /* RESOLVE_* path-resolution flags. */
+};
+.EE
+.in
+.PP
+Any future extensions to
+.BR openat2 ()
+will be implemented as new fields appended to the above structure, with the
+zero value of the new fields acting as though the extension were not present.
+.PP
+The meaning of each field is as follows:
+.RS
+
+.I flags
+.RS
+The file creation and status flags to use for this operation. All of the
+.B O_*
+flags defined for
+.BR openat (2)
+are valid
+.BR openat2 ()
+flag values.
+.RE
+
+.I upgrade_mask
+.RS
+Restrict with which
+.I access modes
+the returned
+.B O_PATH
+descriptor may be re-opened (either through
+.B O_EMPTYPATH
+or
+.IR /proc/self/fd/ .)
+This field may only be set to a non-zero value if
+.I flags
+contains
+.BR O_PATH .
+By default, an
+.B O_PATH
+file descriptor of an ordinary file may be re-opened with with any access mode (but an
+.B O_PATH
+file descriptor of a magic-link may only be re-opened with access modes that
+the original magic-link possessed). The full list of
+.I upgrade_mask
+flags is given below.
+.TP
+.B UPGRADE_NOREAD
+Do not permit the
+.B O_PATH
+file descriptor to be re-opened for reading (i.e.
+.BR O_RDONLY " or " O_RDWR .)
+.TP
+.B UPGRADE_NOWRITE
+Do not permit the
+.B O_PATH
+file descriptor to be re-opened for writing (i.e.
+.BR O_WRONLY ", " O_RDWR ", or " O_APPEND .)
+.RE
+
+.I resolve
+.RS
+Change how the components of
+.I pathname
+will be resolved (see
+.BR path_resolution (7)
+for background information.) The primary use-case for these flags is to allow
+trusted programs to restrict how un-trusted paths (or paths inside un-trusted
+directories) are resolved. The full list of
+.I resolve
+flags is given below.
+.TP
+.B RESOLVE_NO_XDEV
+Disallow all mount-point crossings during path resolution (including
+all bind-mounts).
+
+Users of this flag are encouraged to make its use configurable (unless it is
+used for a specific security purpose), as bind-mounts are very widely used by
+end-users and thus enabling this flag globally may result in spurious errors on
+some systems.
+.TP
+.B RESOLVE_NO_SYMLINKS
+Disallow all symlink resolution during path resolution. If the trailing
+component is a symlink, and
+.I flags
+contains both
+.BR O_PATH " and " O_NOFOLLOW ","
+then an
+.B O_PATH
+file descriptor referencing the symlink will be returned. This option implies
+.BR RESOLVE_NO_MAGICLINKS .
+
+Users of this flag are encouraged to make its use configurable (unless it is
+used for a specific security purpose), as symlinks are very widely used by
+end-users and thus enabling this flag globally may result in spurious errors on
+some systems.
+.TP
+.B RESOLVE_NO_MAGICLINKS
+Disallow all magic-link resolution during path resolution. If the trailing
+component is a magic-link, and
+.I flags
+contains both
+.BR O_PATH " and " O_NOFOLLOW ","
+then an
+.B O_PATH
+file descriptor referencing the magic-link will be returned.
+
+Magic-links are symlink-like objects that are most notably found in
+.BR proc (5)
+(examples include
+.IR /proc/[pid]/exe " and " /proc/[pid]/fd/* .)
+Due to the potential danger of unknowingly opening these magic-links, it may be
+preferable for users to disable their resolution entirely (see
+.BR symlink (7)
+for more details.)
+.TP
+.B RESOLVE_BENEATH
+Do not permit the path resolution to succeed if any component of the resolution
+is not a descendant of the directory indicated by
+.IR dirfd .
+This results in absolute symlinks (and absolute values of
+.IR pathname )
+to be rejected. Magic-link resolution is also not permitted.
+
+.TP
+.B RESOLVE_IN_ROOT
+Temporarily treat
+.I dirfd
+as the root of the filesystem (as though the user called
+.BR chroot (2)
+with
+.IR dirfd
+as the argument.) Absolute symlinks and ".." path components will be scoped to
+.IR dirfd . Magic-link resolution is also not permitted.
+
+However, unlike
+.BR chroot (2)
+(which changes the filesystem root persistently for an entire thread-group),
+.B RESOLVE_IN_ROOT
+allows a program to efficiently restrict path resolution for only certain
+operations. It also has several hardening features (such as not permitting
+magic-link resolution) which
+.BR chroot (2)
+does not.
+.RE
+
+.RE
+
+.PP
+Unlike
+.BR openat (2),
+any unknown flags set in fields of
+.I how
+will result in an error, rather than being ignored. In addition, an error will
+be returned if the value of the
+.IR mode " and " upgrade_mask
+union is non-zero unless:
+.RS
+.IP * 3
+.I flags
+indicates that a new file will be created (it contains
+.BR O_CREAT " or " O_TMPFILE ),
+in which case
+.I mode
+may be any valid file mode.
+.IP *
+.I flags
+contains
+.BR O_PATH ,
+in which case
+.I upgrade_mask
+must only contain valid
+.B UPGRADE_*
+flags.
+.RE
+
+.SH RETURN VALUE
+On success, a new file descriptor is returned. On error, -1 is returned, and
+.I errno
+is set appropriately.
+
+.SH ERRORS
+The set of errors returned by
+.BR openat2 ()
+includes all of the errors returned by
+.BR openat (2),
+as well as the following additional errors:
+.TP
+.B EINVAL
+An unknown flag or invalid value was specified in
+.IR how .
+.TP
+.B EINVAL
+.I size
+was smaller than any known version of
+.IR "struct open_how" .
+.TP
+.B E2BIG
+An extension was specified in
+.IR how ,
+which the current kernel does not support (see the "Extensibility" section of
+the \fBNOTES\fP for more detail on how extensions are handled.)
+.TP
+.B EAGAIN
+.I resolve
+contains either
+.BR RESOLVE_IN_ROOT " or " RESOLVE_BENEATH ,
+and the kernel could not ensure that a ".." component didn't escape (due to a
+race condition or potential attack). Callers may choose to retry the
+.BR openat2 ()
+call.
+.TP
+.B EXDEV
+.I resolve
+contains either
+.BR RESOLVE_IN_ROOT " or " RESOLVE_BENEATH ,
+and a path component attempted to escape the root of the resolution.
+
+.TP
+.B EXDEV
+.I resolve
+contains
+.BR RESOLVE_NO_XDEV ,
+and a path component attempted to cross a mount-point.
+
+.TP
+.B ELOOP
+.I resolve
+contains
+.BR RESOLVE_NO_SYMLINKS ,
+and one of the path components was a symlink.
+.TP
+.B ELOOP
+.I resolve
+contains
+.BR RESOLVE_NO_MAGICLINKS ,
+and one of the path components was a magic-link.
+
+.SH VERSIONS
+.BR openat2 ()
+was added to Linux in kernel 5.FOO.
+
+.SH CONFORMING TO
+This system call is Linux-specific.
+
+The semantics of
+.B RESOLVE_BENEATH
+were modelled after FreeBSD's
+.BR O_BENEATH .
+
+.SH NOTES
+Glibc does not provide a wrapper for this system call; call it using
+.BR syscall (2).
+
+.SS Extensibility
+In order to allow for
+.I struct open_how
+to be extended in future kernel revisions,
+.BR openat2 ()
+requires userspace to specify what sized
+.I struct open_how
+structure they are passing. By providing this information, it is possible for
+.BR openat2 ()
+to provide both forwards- and backwards-compatibility \(em with
+.I size
+acting as an implicit version number (because new extension fields will always
+be appended, the size will always increase.) This extensibility design is very
+similar to other system calls such as
+.BR perf_setattr "(2), " perf_event_open "(2), and " clone (3).
+
+If we let
+.I usize
+be the size of the structure according to userspace and
+.I ksize
+be the size of the structure which the kernel supports, then there are only
+three cases to consider:
+
+.RS
+.IP * 3
+If
+.IR ksize " equals " usize ,
+then there is no version mismatch and
+.I how
+can be used verbatim.
+.IP *
+If
+.IR ksize " is larger than " usize ,
+then there are some extensions the kernel supports which the userspace program
+is unaware of. Because all extensions must have their zero values be a no-op,
+the kernel treats all of the extension fields not set by userspace to have zero
+values. This provides backwards-compatibility.
+.IP *
+If
+.IR ksize " is smaller than " usize ,
+then there are some extensions which the userspace program is aware of but the
+kernel does not support. Because all extensions must have their zero values be
+a no-op, the kernel can safely ignore the unsupported extension fields if they
+are all-zero. If any unsupported extension fields are non-zero, then an error
+is returned. This provides forwards-compatibility.
+.RE
+
+Therefore, most userspace programs will not need to have any special handling
+of extensions. However, if a userspace program wishes to determine what
+extensions the running kernel supports, they may conduct a binary search on
+.IR size
+(to find the largest value which doesn't produce an error.)
+
+.SH SEE ALSO
+.BR openat (2),
+.BR path_resolution (7),
+.BR symlink (7)
diff --git a/man7/path_resolution.7 b/man7/path_resolution.7
index 85dd354e9a93..3da3e5b614c8 100644
--- a/man7/path_resolution.7
+++ b/man7/path_resolution.7
@@ -29,17 +29,17 @@ path_resolution \- how a pathname is resolved to a file
 Some UNIX/Linux system calls have as parameter one or more filenames.
 A filename (or pathname) is resolved as follows.
 .SS Step 1: start of the resolution process
-If the pathname starts with the \(aq/\(aq character,
-the starting lookup directory
-is the root directory of the calling process.
-(A process inherits its
-root directory from its parent.
-Usually this will be the root directory
-of the file hierarchy.
-A process may get a different root directory
-by use of the
+If the pathname starts with the \(aq/\(aq character, the starting lookup
+directory is the root directory of the calling process. (A process inherits its
+root directory from its parent. Usually this will be the root directory of the
+file hierarchy. A process may get a different root directory by use of the
 .BR chroot (2)
-system call.
+system call, or may temporarily use a different root directory by using
+.BR openat2 (2)
+with the
+.B RESOLVE_IN_ROOT
+flag set.
+.PP
 A process may get an entirely private mount namespace in case
 it\(emor one of its ancestors\(emwas started by an invocation of the
 .BR clone (2)
@@ -48,16 +48,24 @@ system call that had the
 flag set.)
 This handles the \(aq/\(aq part of the pathname.
 .PP
-If the pathname does not start with the \(aq/\(aq character, the
-starting lookup directory of the resolution process is the current working
-directory of the process.
-(This is also inherited from the parent.
-It can be changed by use of the
+If the pathname does not start with the \(aq/\(aq character, the starting
+lookup directory of the resolution process is the current working directory of
+the process \(em or in the case of
+.BR openat (2)-style
+syscalls, the
+.I dfd
+argument (or the current working directory if
+.B AT_FDCWD
+is passed as the
+.I dfd
+argumnet). The current working directory is inherited from the parent, and can
+be changed by use of the
 .BR chdir (2)
-system call.)
+syscall.
 .PP
 Pathnames starting with a \(aq/\(aq character are called absolute pathnames.
 Pathnames not starting with a \(aq/\(aq are called relative pathnames.
+
 .SS Step 2: walk along the path
 Set the current lookup directory to the starting lookup directory.
 Now, for each nonfinal component of the pathname, where a component
@@ -124,6 +132,13 @@ the kernel's pathname-resolution code
 was reworked to eliminate the use of recursion,
 so that the only limit that remains is the maximum of 40
 resolutions for the entire pathname.
+.PP
+The resolution of syscalls during this stage can be blocked by using
+.BR openat2 (2),
+with the
+.B RESOLVE_NO_SYMLINKS
+flag set.
+
 .SS Step 3: find the final entry
 The lookup of the final component of the pathname goes just like
 that of all other components, as described in the previous step,
@@ -160,7 +175,8 @@ The path resolution process will assume that these entries have
 their conventional meanings, regardless of whether they are
 actually present in the physical filesystem.
 .PP
-One cannot walk down past the root: "/.." is the same as "/".
+One cannot walk up past the root: "/.." is the same as "/".
+
 .SS Mount points
 After a "mount dev path" command, the pathname "path" refers to
 the root of the filesystem hierarchy on the device "dev", and no
@@ -169,6 +185,13 @@ longer to whatever it referred to earlier.
 One can walk out of a mounted filesystem: "path/.." refers to
 the parent directory of "path",
 outside of the filesystem hierarchy on "dev".
+.PP
+Mount-point crossings can be blocked by using
+.BR openat2 (2),
+with the
+.B RESOLVE_NO_XDEV
+flag set (though note that this also restricts bind-mount crossings).
+
 .SS Trailing slashes
 If a pathname ends in a \(aq/\(aq, that forces resolution of the preceding
 component as in Step 2: it has to exist and resolve to a directory.
-- 
2.23.0

^ permalink raw reply related


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox