Linux Security Modules development

Linux Security Modules development
 help / color / mirror / Atom feed

* Re: Why add the general notification queue and its sources
From: David Howells @ 2019-09-05 21:32 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: dhowells, Ray Strode, Greg Kroah-Hartman, Steven Whitehouse,
	Nicolas Dichtel, raven, keyrings, linux-usb, linux-block,
	Christian Brauner, LSM List, linux-fsdevel, Linux API,
	Linux List Kernel Mailing, Al Viro, Ray, Debarshi, Robbie Harwood
In-Reply-To: <CAHk-=wjcsxQ8QB_v=cwBQw4pkJg7pp-bBsdWyPivFO_OeF-y+g@mail.gmail.com>

Linus Torvalds <torvalds@linux-foundation.org> wrote:

> Also, what is the security model here? Open a special character
> device, and you get access to random notifications from random
> sources?
>
> That makes no sense. Do they have the same security permissions?

Sigh.  It doesn't work like that.  I tried to describe this in the manpages I
referred to in the cover note.  Obviously I didn't do a good enough job.  Let
me try and explain the general workings and the security model here.

 (1) /dev/watch_queue just implements locked-in-memory buffers.  It gets you
     no events by simply opening it.

     Each time you open it you get your own private buffer.  Buffers are not
     shares.  Creation of buffers is limited by ENFILE, EMFILE and
     RLIMIT_MEMLOCK.

 (2) A buffer is implemented as a pollable ring buffer, with the head pointer
     belonging to the kernel and the tail pointer belonging to userspace.
     Userspace mmaps the buffer.

     The kernel *only ever* reads the head and tail pointer from a buffer; it
     never reads anything else.

     When it wants to post a message to a buffer, the kernel reads the
     pointers and then does one of three things:

	(a) If the pointers were incoherent it drops the message.

	(b) If the buffer was full the kernel writes a flag to indicate this
	    and drops the message.

	(c) Otherwise, the kernel writes a message and maybe padding at the
     	    place(s) it expects and writes the head pointer.  If userspace was
     	    busy trashing the place, that should not cause a problem for the
     	    kernel.

     The buffer pointers are expected to run to the end and wrap naturally;
     they're only masked off at the point of actually accessing the buffer.

 (3) You connect event sources to your buffer, e.g.:

	fd = open("/dev/watch_queue", ...);
	keyctl_watch_key(KEY_SPEC_SESSION_KEYRING, fd, ...);

     or:

	watch_mount(AT_FDCWD, "/net", 0, fd, ...);

     Security is checked at the point of connection to make sure you have
     permission to access that source.  You have to have View permission on a
     key/keyring for key events, for example, and you have to have execute
     permission on a directory for mount events.

     The LSM gets a look-in too: Smack checks you have read permission on a
     key for example.

 (4) You can connect multiple sources of different types to your buffer and a
     source can be connected to multiple buffers at a time.

 (5) Security is checked when an event is delivered to make sure the triggerer
     of the event has permission to give you that event.  Smack requires that
     the triggerer has write permission on the opener of the buffer for
     example.

 (6) poll() signals POLLIN|POLLRDNORM if there is stuff in the buffer and
     POLLERR if the pointers are incoherent.

David

^ permalink raw reply

* Re: Why add the general notification queue and its sources
From: Linus Torvalds @ 2019-09-05 22:08 UTC (permalink / raw)
  To: David Howells
  Cc: Ray Strode, Greg Kroah-Hartman, Steven Whitehouse,
	Nicolas Dichtel, raven, keyrings, linux-usb, linux-block,
	Christian Brauner, LSM List, linux-fsdevel, Linux API,
	Linux List Kernel Mailing, Al Viro, Ray, Debarshi, Robbie Harwood
In-Reply-To: <5396.1567719164@warthog.procyon.org.uk>

On Thu, Sep 5, 2019 at 2:32 PM David Howells <dhowells@redhat.com> wrote:
>
>  (1) /dev/watch_queue just implements locked-in-memory buffers.  It gets you
>      no events by simply opening it.

Cool. In-memory buffers.

But I know - we *have* one of those. There's already a system call for
it, and has been forever. One that we then extended to allow people to
change the buffer size, and do a lot of other things with.

It's called "pipe()". And you can give the writing side to other user
space processes too, in case you are running an older kernel that
didn't have some "event pipe support". It comes with resource
management, because people already use those things.

If you want to make a message protocol on top of it, it has cool
atomicity guarantees for any message size less than PIPE_BUF, but to
make things simple, maybe just say "fixed record sizes of 64 bytes" or
something like that for events.

Then you can use them from things like perl scripts, not just magical
C programs.

Why do we need a new kind of super-fancy high-speed thing for event reporting?

If you have *so* many events that pipe handling is a performance
issue, you have something seriously wrong going on.

So no. I'm not interested in a magical memory-mapped pipe that is
actually more limited than the real thing.

                  Linus

^ permalink raw reply

* Re: Why add the general notification queue and its sources
From: David Howells @ 2019-09-05 23:18 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: dhowells, Ray Strode, Greg Kroah-Hartman, Steven Whitehouse,
	Nicolas Dichtel, raven, keyrings, linux-usb, linux-block,
	Christian Brauner, LSM List, linux-fsdevel, Linux API,
	Linux List Kernel Mailing, Al Viro, Ray, Debarshi, Robbie Harwood
In-Reply-To: <CAHk-=wgbCXea1a9OTWgMMvcsCGGiNiPp+ty-edZrBWn63NCYdw@mail.gmail.com>

Linus Torvalds <torvalds@linux-foundation.org> wrote:

> But I know - we *have* one of those. There's already a system call for
> it, and has been forever. One that we then extended to allow people to
> change the buffer size, and do a lot of other things with.
> 
> It's called "pipe()". And you can give the writing side to other user
> space processes too, in case you are running an older kernel that
> didn't have some "event pipe support". It comes with resource
> management, because people already use those things.

Can you write into a pipe from softirq context and/or with spinlocks held
and/or with the RCU read lock held?  That is a requirement.  Another is that
messages get inserted whole or not at all (or if they are truncated, the size
field gets updated).

Since one end would certainly be attached to an fd, it looks on the face of it
that writing into the pipe would require taking pipe->mutex.

David

^ permalink raw reply

* Re: Why add the general notification queue and its sources
From: Linus Torvalds @ 2019-09-06  0:07 UTC (permalink / raw)
  To: David Howells
  Cc: Ray Strode, Greg Kroah-Hartman, Steven Whitehouse,
	Nicolas Dichtel, raven, keyrings, linux-usb, linux-block,
	Christian Brauner, LSM List, linux-fsdevel, Linux API,
	Linux List Kernel Mailing, Al Viro, Ray, Debarshi, Robbie Harwood
In-Reply-To: <14883.1567725508@warthog.procyon.org.uk>

On Thu, Sep 5, 2019 at 4:18 PM David Howells <dhowells@redhat.com> wrote:
>
> Can you write into a pipe from softirq context and/or with spinlocks held
> and/or with the RCU read lock held?  That is a requirement.  Another is that
> messages get inserted whole or not at all (or if they are truncated, the size
> field gets updated).

Right now we use a mutex for the buffer locking, so no, pipe buffers
are not irq-safe or atomic. That's due to the whole "we may block on
data from user space" when doing a write.

HOWEVER.

Pipes actually have buffers on two different levels: there's the
actual data buffers themselves (each described by a "struct
pipe_buffer"), and there's the circular queue of them (the
"pipe->buf[]" array, with pipe->curbuf/nrbufs) that points to
individual data buffers.

And we could easily separate out that data buffer management. Right
now it's not really all that separated: people just do things like

        int newbuf = (pipe->curbuf + bufs) & (pipe->buffers-1);
        struct pipe_buffer *buf = pipe->bufs + newbuf;
...
        pipe->nrbufs++;

to add a buffer into that circular array of buffers, but _that_ part
could be made separate.  It's just all protected by the pipe mutex
right now, so it has never been an issue.

And yes, atomicity of writes has actually been an integral part of
pipes since forever. It's actually the only unambiguous atomicity that
POSIX guarantees. It only holds for writes to pipes() of less than
PIPE_BUF blocks, but that's 4096 on Linux.

> Since one end would certainly be attached to an fd, it looks on the face of it
> that writing into the pipe would require taking pipe->mutex.

That's how the normal synchronization is done, yes. And changing that
in general would be pretty painful. For example, two concurrent
user-space writers might take page faults and just generally be
painful, and the pipe locking needs to serialize that.

So the mutex couldn't go away from pipes in general - it would remain
for read/write/splice mutual exclusion (and it's not just the data it
protects, it's the reader/writer logic for EPIPE etc).

But the low-level pipe->bufs[] handling is another issue entirely.
Even when a user space writer copies things from user space, it does
so into a pre-allocated buffer that is then attached to the list of
buffers somewhat separately (there's a magical special case where you
can re-use a buffer that is marked as "I can be reused" and append
into an already allocated buffer).

And adding new buffers *could* be done with it's own separate locking.
If you have a blocking writer (ie a user space data source), that
would still take the pipe mutex, and it would delay the user space
readers (because the readers also need the mutex), but it should not
be all that hard to just make the whole "curbuf/nrbufs" handling use
its own locking (maybe even some lockless atomics and cmpxchg).

So a kernel writer could "insert" a "struct pipe_buffer" atomically,
and wake up the reader atomically. No need for the other complexity
that is protected by the mutex.

The buggest problem is perhaps that the number of pipe buffers per
pipe is fairly limited by default. PIPE_DEF_BUFFERS is 16, and if we'd
insert using the ->bufs[] array, that would be the limit of "number of
messages". But each message could be any size (we've historically
limited pipe buffers to one page each, but that limit isn't all that
hard. You could put more data in there).

The number of pipe buffers _is_ dynamic, so the above PIPE_DEF_BUFFERS
isn't a hard limit, but it would be the default.

Would it be entirely trivial to do all the above? No. But it's
*literally* just finding the places that work with pipe->curbuf/nrbufs
and making them use atomic updates. You'd find all the places by just
renaming them (and making them atomic or whatever) and the compiler
will tell you "this area needs fixing".

We've actually used pipes for messages before: autofs uses a magic
packetized pipe buffer thing. It didn't need any extra atomicity,
though, so it stil all worked with the regular pipe->mutex thing.

And there is a big advantage from using pipes. They really would work
with almost anything. You could even mix-and-match "data generated by
kernel" and "data done by 'write()' or 'splice()' by a user process".

NOTE! I'm not at all saying that pipes are perfect. You'll find people
who swear by sockets instead. They have their own advantages (and
disadvantages). Most people who do packet-based stuff tend to prefer
sockets, because those have standard packet-based models (Linux pipes
have that packet mode too, but it's certainly not standard, and I'm
not even sure we ever exposed it to user space - it could be that it's
only used by the autofs daemon).

I have a soft spot for pipes, just because I think they are simpler
than sockets. But that soft spot might be misplaced.

                   Linus

^ permalink raw reply

* Re: Why add the general notification queue and its sources
From: David Howells @ 2019-09-06 10:09 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: dhowells, Ray Strode, Greg Kroah-Hartman, Steven Whitehouse,
	Nicolas Dichtel, raven, keyrings, linux-usb, linux-block,
	Christian Brauner, LSM List, linux-fsdevel, Linux API,
	Linux List Kernel Mailing, Al Viro, Ray, Debarshi, Robbie Harwood
In-Reply-To: <CAHk-=wjt2Eb+yEDOcQwCa0SrZ4cWu967OtQG8Vz21c=n5ZP1Nw@mail.gmail.com>

Linus Torvalds <torvalds@linux-foundation.org> wrote:

> But it's *literally* just finding the places that work with
> pipe->curbuf/nrbufs and making them use atomic updates.

No.  It really isn't.  That's two variables that describe the occupied section
of the buffer.  Unless you have something like a 68020 with CAS2, or put them
next to each other so you can use CMPXCHG8, you can't do that.

They need converting to head/tail pointers first.

> They really would work with almost anything. You could even mix-and-match
> "data generated by kernel" and "data done by 'write()' or 'splice()' by a
> user process".

Imagine that userspace writes a large message and takes the mutex.  At the
same time something in softirq context decides *it* wants to write a message -
it can't take the mutex and it can't wait, so the userspace write would have
to cause the kernel message to be dropped.

What I would have to do is make a write to a notification pipe go through
post_notification() and limit the size to the maximum for a single message.

Much easier to simply suppress writes and splices on pipes that have been set
up to be notification queues - at least for now.

David

^ permalink raw reply

* Re: [PATCH] tpm_tis_core: Set TPM_CHIP_FLAG_IRQ before probing for interrupts
From: Jan Lübbe @ 2019-09-06 12:37 UTC (permalink / raw)
  To: Stefan Berger, jarkko.sakkinen
  Cc: linux-security-module, linux-kernel, linux-integrity,
	Stefan Berger, linux-stable
In-Reply-To: <20190830000906.2369009-1-stefanb@linux.vnet.ibm.com>

On Thu, 2019-08-29 at 20:09 -0400, Stefan Berger wrote:
> From: Stefan Berger <stefanb@linux.ibm.com>
> 
> The tpm_tis_core has to set the TPM_CHIP_FLAG_IRQ before probing for
> interrupts since there is no other place in the code that would set
> it.

Thanks for this patch! I tested it to fix a
  [   13.198129] tpm tpm0: [Firmware Bug]: TPM interrupt not working, polling instead
we've been seeing, but received this scheduling error:
  [   13.241831] tpm_tis_spi spi0.0: 2.0 TPM (device-id 0x1B, rev-id 16)
  [   13.249450] BUG: scheduling while atomic: swapper/0/0/0x00010002
  [   13.255537] Modules linked in:
  [   13.258669] Preemption disabled at:
  [   13.258686] [<c07d10ec>] schedule_preempt_disabled+0x1c/0x20
  [   13.268050] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.1.21-20190905-1-development #1
  [   13.276106] Hardware name: Freescale i.MX6 Quad/DualLite (Device Tree)
  [   13.282788] [<c010fe00>] (unwind_backtrace) from [<c010cc04>] (show_stack+0x10/0x14)
  [   13.290621] [<c010cc04>] (show_stack) from [<c07b7bc8>] (dump_stack+0x78/0x8c)
  [   13.297994] [<c07b7bc8>] (dump_stack) from [<c0147f9c>] (__schedule_bug+0x88/0xd8)
  [   13.305711] [<c0147f9c>] (__schedule_bug) from [<c07d0948>] (__schedule+0x5b0/0x750)
  [   13.313535] [<c07d0948>] (__schedule) from [<c07d0b30>] (schedule+0x48/0xa0)
  [   13.320731] [<c07d0b30>] (schedule) from [<c07d452c>] (schedule_timeout+0x170/0x2a8)
  [   13.328622] [<c07d452c>] (schedule_timeout) from [<c07d1818>] (wait_for_common+0xa4/0x160)
  [   13.336974] [<c07d1818>] (wait_for_common) from [<c05126ec>] (spi_imx_transfer+0xc4/0x5c4)
  [   13.345385] [<c05126ec>] (spi_imx_transfer) from [<c0511644>] (spi_bitbang_transfer_one+0x50/0xa0)
  [   13.354496] [<c0511644>] (spi_bitbang_transfer_one) from [<c050a4a4>] (spi_transfer_one_message+0x18c/0x3cc)
  [   13.364474] [<c050a4a4>] (spi_transfer_one_message) from [<c050aa34>] (__spi_pump_messages+0x350/0x474)
  [   13.374012] [<c050aa34>] (__spi_pump_messages) from [<c050acfc>] (__spi_sync+0x198/0x1a0)
  [   13.382275] [<c050acfc>] (__spi_sync) from [<c04972d4>] (tpm_tis_spi_transfer+0x124/0x300)
  [   13.390683] [<c04972d4>] (tpm_tis_spi_transfer) from [<c04974e0>] (tpm_tis_spi_read_bytes+0x14/0x1c)
  [   13.399957] [<c04974e0>] (tpm_tis_spi_read_bytes) from [<c0497088>] (tpm_tis_spi_read32+0x30/0x58)
  [   13.409057] [<c0497088>] (tpm_tis_spi_read32) from [<c0496614>] (tis_int_handler+0x40/0x13c)
  [   13.417586] [<c0496614>] (tis_int_handler) from [<c01708e4>] (__handle_irq_event_percpu+0x50/0x11c)
  [   13.426783] [<c01708e4>] (__handle_irq_event_percpu) from [<c01709dc>] (handle_irq_event_percpu+0x2c/0x7c)
  [   13.436582] [<c01709dc>] (handle_irq_event_percpu) from [<c0170a64>] (handle_irq_event+0x38/0x5c)
  [   13.445603] [<c0170a64>] (handle_irq_event) from [<c0174b34>] (handle_level_irq+0xcc/0x170)
  [   13.454098] [<c0174b34>] (handle_level_irq) from [<c016fac8>] (generic_handle_irq+0x24/0x34)
  [   13.462625] [<c016fac8>] (generic_handle_irq) from [<c0453ce0>] (mxc_gpio_irq_handler+0x48/0x164)
  [   13.471643] [<c0453ce0>] (mxc_gpio_irq_handler) from [<c0453e5c>] (mx3_gpio_irq_handler+0x60/0xac)
  [   13.480748] [<c0453e5c>] (mx3_gpio_irq_handler) from [<c016fac8>] (generic_handle_irq+0x24/0x34) 
  [   13.489678] [<c016fac8>] (generic_handle_irq) from [<c01700a8>] (__handle_domain_irq+0x7c/0xec)  
  [   13.498459] [<c01700a8>] (__handle_domain_irq) from [<c0443744>] (gic_handle_irq+0x4c/0x90)
  [   13.506956] [<c0443744>] (gic_handle_irq) from [<c0101a8c>] (__irq_svc+0x6c/0xa8)
  [   13.514577] Exception stack(0xc0b01f30 to 0xc0b01f78)
  [   13.519708] 1f20:                                     00000000 00009e08 dfebb360 c0118e20
  [   13.528029] 1f40: ffffe000 c0b09cf0 c0b09d30 00000001 c0b3ff37 c08d0e78 00000001 c0a38a38
  [   13.536346] 1f60: 00000000 c0b01f80 c010988c c0109890 60010013 ffffffff
  [   13.543047] [<c0101a8c>] (__irq_svc) from [<c0109890>] (arch_cpu_idle+0x38/0x3c)
  [   13.550590] [<c0109890>] (arch_cpu_idle) from [<c014f9bc>] (do_idle+0xe0/0x150)
  [   13.558046] [<c014f9bc>] (do_idle) from [<c014fcf4>] (cpu_startup_entry+0x18/0x1c)
  [   13.565699] [<c014fcf4>] (cpu_startup_entry) from [<c0a00dd0>] (start_kernel+0x450/0x484)
  [   13.574017] [<c0a00dd0>] (start_kernel) from [<00000000>] (  (null))
  [   13.580513] bad: scheduling from the idle thread!

This is due to the SPI accesses performed by tis_int_handler (which
will sleep). Switching to devm_request_threaded_irq fixes this and
leads to a successful IRQ probe.

But: It seems that the IRQ is not acked correctly, as the interrupt
line stays low. I suspect this is because the tpm_chip_stop from
http://git.infradead.org/users/jjs/linux-tpmdd.git/commitdiff/9b558deab2c5d7dc23d5f7a4064892ede482ad32
happens before the threaded handler runs. I'm currently unable to
verify that though, as my build machine's disk just died. :/

Regards,
Jan

> Cc: linux-stable@vger.kernel.org
> Fixes: 570a36097f30 ("tpm: drop 'irq' from struct
> tpm_vendor_specific")
> Signed-off-by: Stefan Berger <stefanb@linux.ibm.com>
> ---
>  drivers/char/tpm/tpm_tis_core.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/drivers/char/tpm/tpm_tis_core.c
> b/drivers/char/tpm/tpm_tis_core.c
> index ffa9048d8f6c..270f43acbb77 100644
> --- a/drivers/char/tpm/tpm_tis_core.c
> +++ b/drivers/char/tpm/tpm_tis_core.c
> @@ -981,6 +981,7 @@ int tpm_tis_core_init(struct device *dev, struct
> tpm_tis_data *priv, int irq,
>  		}
>  
>  		tpm_chip_start(chip);
> +		chip->flags |= TPM_CHIP_FLAG_IRQ;
>  		if (irq) {
>  			tpm_tis_probe_irq_single(chip, intmask,
> IRQF_SHARED,
>  						 irq);
-- 
Pengutronix e.K.                           |                             |
Industrial Linux Solutions                 | http://www.pengutronix.de/  |
Peiner Str. 6-8, 31137 Hildesheim, Germany | Phone: +49-5121-206917-0    |
Amtsgericht Hildesheim, HRA 2686           | Fax:   +49-5121-206917-5555 |


^ permalink raw reply

* Re: [PATCH v8 00/28] LSM: Module stacking for AppArmor
From: Stephen Smalley @ 2019-09-06 13:46 UTC (permalink / raw)
  To: Casey Schaufler, jmorris, linux-security-module, selinux
  Cc: casey.schaufler, keescook, john.johansen, penguin-kernel, paul
In-Reply-To: <13e1badc-12d2-8ae1-2b98-d9a6a80a615d@schaufler-ca.com>

On 9/4/19 3:13 PM, Casey Schaufler wrote:
> On 8/29/2019 4:29 PM, Casey Schaufler wrote:
>> This patchset provides the changes required for
>> the AppArmor security module to stack safely with any other.
> 
> What more needs doing before we can land this airship?
> The only outstanding issue I know about is the uses of
> current_security() in SELinux, and Stephen has a patch
> for that. I could see rebasing it, as security-next is
> a little bit behind the times, but that's just merge work.
> We could bikeshed the "compound" context format if we
> have to, and argue about what capabilities are required
> for /proc/self/attr/display if we must.
> 
> If there's objection to the number of patches (28 > 15)
> there are reasonable ways to squash them. If there's objection
> to the amount of change there are rational stages.
> 
> I'm heading off on holiday, and will be limiting my
> screen time for the next couple weeks. Think about it.

My $0.02, roughly in order from what I think will be the largest 
obstacles to merging to the smallest:

1) Changes outside of the security subsystem:
   1a) Userspace API changes e.g. /proc/self/attr/context, 
/proc/self/attr/display, and SO_PEERCONTEXT, likely needs vetting by 
linux-api and the latter by netdev, and audit record changes, needs 
vetting by linux-audit.
   1b) lsmblob changes to core kernel data structures and code, likely 
needs vetting by relevant subsystem maintainers e.g. netdev for 
unix_skb_parms and scm_cookie changes, linux-audit for audit changes, ...
    Whether or not these changes are deemed acceptable by their 
respective maintainers is key to whether this patch set is viable 
irrespective of what any LSM or security module maintainer may think.

2) Controlling setting of /proc/pid/attr/display:
    CAP_MAC_ADMIN isn't an appropriate check for this operation for 
SELinux. The security modules need to be able to provide their own 
access control logic for setting display just as with the other 
attributes, which they could do if you just called the hooks prior to 
setting the display.  At present, in SELinux, CAP_MAC_ADMIN is used only 
to control what processes can get or set file security labels that are 
unknown to the currently loaded policy, as a means of supporting package 
managers that might set down file security labels before loading new 
policy modules or OS image generators that might be generating a 
filesystem image with a policy that differs from the build host OS 
policy. The set of processes that need that ability are quite different 
from the set of processes that will need to set the display, and we do 
not want to conflate them in policy.  I think you are worried that if 
given the opportunity, the security module authors will just block 
setting the display to any value other than their own or setting it at 
all.  How about we just agree to not do that, and if you see a patch 
doing that you can NAK it?

3) Code review and testing by at least the major LSM maintainers:
   At least an Acked-by and preferably Reviewed-by and Tested-by lines 
from the major security module maintainers, not only for changes that 
directly touch the code of their module but all of the patches in the 
series.  Because any of these changes could break or undermine the 
security of any of the security modules.
   Testing also needs to cover new functionality that wasn't previously 
possible, e.g. the new userspace APIs, actual stacking of AA with other 
modules, etc.  So merely running existing testsuites isn't going to suffice.

> 
> Thank you.
> 
>>
>> v8: Incorporate feedback from v7
>>      - Minor clean-up in display value management
>>      - refactor "compound" context creation to use a common
>>        append_ctx() function.
>>
>> v7: Incorporate feedback from v6
>>      - Make setting the display a privileged operation. The
>>        availability of compound contexts reduces the need for
>>        setting the display.
>>
>> v6: Incorporate feedback from v5
>>      - Add subj_<lsm>= and obj_<lsm>= fields to audit records
>>      - Add /proc/.../attr/context to get the full context in
>>        lsmname\0value\0... format as suggested by Simon McVittie
>>      - Add SO_PEERCONTEXT for getsockopt() to get the full context
>>        in the same format, also suggested by Simon McVittie.
>>      - Add /sys/kernel/security/lsm_display_default to provide
>>        the display default value.
>>
>> v5: Incorporate feedback from v4
>>      - Initialize the lsmcontext in security_secid_to_secctx()
>>      - Clear the lsmcontext in all security_release_secctx() cases
>>      - Don't use the "display" on strictly internal context
>>        interfaces.
>>      - The SELinux binder hooks check for cases where the context
>>        "display" isn't compatible with SELinux.
>>
>> v4: Incorporate feedback from v3
>>      - Mark new lsm_<blob>_alloc functions static
>>      - Replace the lsm and slot fields of the security_hook_list
>>        with a pointer to a LSM allocated lsm_id structure. The
>>        LSM identifies if it needs a slot explicitly. Use the
>>        lsm_id rather than make security_add_hooks return the
>>        slot value.
>>      - Validate slot values used in security.c
>>      - Reworked the "display" process attribute handling so that
>>        it works right and doesn't use goofy list processing.
>>      - fix display value check in dentry_init_security
>>      - Replace audit_log of secids with '?' instead of deleting
>>        the audit log
>>
>> v3: Incorporate feedback from v2
>>      - Make lsmblob parameter and variable names more
>>        meaningful, changing "le" and "l" to "blob".
>>      - Improve consistency of constant naming.
>>      - Do more sanity checking during LSM initialization.
>>      - Be a bit clearer about what is temporary scaffolding.
>>      - Rather than clutter security_getpeersec_dgram with
>>        otherwise unnecessary checks remove the apparmor
>>        stub, which does nothing useful.
>>
>> Patches 0001-0003 complete the process of moving management
>> of security blobs that might be shared from the individual
>> modules to the infrastructure.
>>
>> Patches 0004-0014 replace system use of a "secid" with
>> a structure "lsmblob" containing information from the
>> security modules to be held and reused later. At this
>> point lsmblob contains an array of u32 secids, one "slot"
>> for each of the security modules compiled into the
>> kernel that used secids. A "slot" is allocated when
>> a security module requests one.
>> The infrastructure is changed to use the slot number
>> to pass the correct secid to or from the security module
>> hooks.
>>
>> It is important that the lsmblob be a fixed size entity
>> that does not have to be allocated. Several of the places
>> where it is used would have performance and/or locking
>> issues with dynamic allocation.
>>
>> Patch 0015 provides a mechanism for a process to
>> identify which security module's hooks should be used
>> when displaying or converting a security context string.
>> A new interface /proc/.../attr/display contains the name
>> of the security module to show. Reading from this file
>> will present the name of the module, while writing to
>> it will set the value. Only names of active security
>> modules are accepted. Internally, the name is translated
>> to the appropriate "slot" number for the module which
>> is then stored in the task security blob. Setting the
>> display requires CAP_MAC_ADMIN.
>>
>> Patch 0016 Starts the process of changing how a security
>> context is represented. Since it is possible for a
>> security context to have been generated by more than one
>> security module it is now necessary to note which module
>> created a security context so that the correct "release"
>> hook can be called. There are several places where the
>> module that created a security context cannot be inferred.
>>
>> This is achieved by introducing a "lsmcontext" structure
>> which contains the context string, its length and the
>> "slot" number of the security module that created it.
>> The security_release_secctx() interface is changed,
>> replacing the (string,len) pointer pair with a lsmcontext
>> pointer.
>>
>> Patches 0017-0019 convert the security interfaces from
>> (string,len) pointer pairs to a lsmcontext pointer.
>> The slot number identifying the creating module is
>> added by the infrastructure. Where the security context
>> is stored for extended periods the data type is changed.
>>
>> The Netlabel code is converted to save lsmblob structures
>> instead of secids in Patches 0020-21.
>>
>> Patch 0022 adds checks to the SELinux binder hooks
>> which verify that if either "display" used is SELinux
>> both are.
>>
>> Patches 0023-24 add addition data to the audit records
>> to identify the LSM specific data for all active
>> modules.
>>
>> Patches 0025-0027 add new interfaces for getting the
>> compound security contexts.
>>
>> Finally, with all interference on the AppArmor hooks
>> removed, Patch 0028 removes the exclusive bit from
>> AppArmor. An unnecessary stub hook was also removed.
>>
>> The Ubuntu project is using an earlier version of
>> this patchset in their distribution to enable stacking
>> for containers.
>>
>> Performance measurements to date have the change
>> within the "noise". The sockperf and dbench results
>> are on the order of 0.2% to 0.8% difference, with
>> better performance being as common as worse. The
>> benchmarks were run with AppArmor and Smack on Ubuntu.
>>
>> https://github.com/cschaufler/lsm-stacking.git#stack-5.2-v8-apparmor
>>
>> Signed-off-by: Casey Schaufler <casey@schaufler-ca.com>
>> ---
>>   arch/alpha/include/uapi/asm/socket.h    |   1 +
>>   arch/mips/include/uapi/asm/socket.h     |   1 +
>>   arch/parisc/include/uapi/asm/socket.h   |   1 +
>>   arch/sparc/include/uapi/asm/socket.h    |   1 +
>>   drivers/android/binder.c                |  24 +-
>>   fs/kernfs/dir.c                         |   5 +-
>>   fs/kernfs/inode.c                       |  35 +-
>>   fs/kernfs/kernfs-internal.h             |   3 +-
>>   fs/nfs/nfs4proc.c                       |  22 +-
>>   fs/nfsd/nfs4xdr.c                       |  20 +-
>>   fs/proc/base.c                          |   2 +
>>   include/linux/audit.h                   |   1 +
>>   include/linux/cred.h                    |   3 +-
>>   include/linux/lsm_hooks.h               |  39 +-
>>   include/linux/security.h                | 184 ++++++++--
>>   include/net/af_unix.h                   |   2 +-
>>   include/net/netlabel.h                  |   8 +-
>>   include/net/scm.h                       |  15 +-
>>   include/uapi/asm-generic/socket.h       |   1 +
>>   kernel/audit.c                          |  70 +++-
>>   kernel/audit.h                          |   9 +-
>>   kernel/audit_fsnotify.c                 |   1 +
>>   kernel/auditfilter.c                    |  10 +-
>>   kernel/auditsc.c                        | 129 ++++---
>>   kernel/cred.c                           |  12 +-
>>   net/core/sock.c                         |   7 +-
>>   net/ipv4/cipso_ipv4.c                   |   6 +-
>>   net/ipv4/ip_sockglue.c                  |  12 +-
>>   net/netfilter/nf_conntrack_netlink.c    |  20 +-
>>   net/netfilter/nf_conntrack_standalone.c |  11 +-
>>   net/netfilter/nfnetlink_queue.c         |  26 +-
>>   net/netfilter/nft_meta.c                |  13 +-
>>   net/netfilter/xt_SECMARK.c              |   5 +-
>>   net/netlabel/netlabel_kapi.c            |   6 +-
>>   net/netlabel/netlabel_unlabeled.c       |  98 ++---
>>   net/netlabel/netlabel_unlabeled.h       |   2 +-
>>   net/netlabel/netlabel_user.c            |  13 +-
>>   net/netlabel/netlabel_user.h            |   6 +-
>>   net/unix/af_unix.c                      |   6 +-
>>   net/xfrm/xfrm_policy.c                  |   2 +
>>   net/xfrm/xfrm_state.c                   |   2 +
>>   security/apparmor/include/net.h         |   6 +-
>>   security/apparmor/lsm.c                 |  85 ++---
>>   security/commoncap.c                    |   7 +-
>>   security/inode.c                        |  22 +-
>>   security/integrity/ima/ima.h            |  14 +-
>>   security/integrity/ima/ima_api.c        |  10 +-
>>   security/integrity/ima/ima_appraise.c   |   6 +-
>>   security/integrity/ima/ima_main.c       |  36 +-
>>   security/integrity/ima/ima_policy.c     |  19 +-
>>   security/integrity/integrity_audit.c    |   1 +
>>   security/loadpin/loadpin.c              |   8 +-
>>   security/safesetid/lsm.c                |   8 +-
>>   security/security.c                     | 632 +++++++++++++++++++++++++++++---
>>   security/selinux/hooks.c                | 213 +++++------
>>   security/selinux/include/objsec.h       |  18 +
>>   security/selinux/include/security.h     |   1 +
>>   security/selinux/netlabel.c             |  25 +-
>>   security/selinux/ss/services.c          |   7 +-
>>   security/smack/smack.h                  |  19 +
>>   security/smack/smack_lsm.c              | 185 +++++-----
>>   security/smack/smack_netfilter.c        |   8 +-
>>   security/smack/smackfs.c                |  10 +-
>>   security/tomoyo/tomoyo.c                |   8 +-
>>   security/yama/yama_lsm.c                |   7 +-
>>   65 files changed, 1503 insertions(+), 686 deletions(-)
>>
> 


^ permalink raw reply

* [PATCH v2 3/5] fs: Enable to enforce noexec mounts or file exec through O_MAYEXEC
From: Mickaël Salaün @ 2019-09-06 15:24 UTC (permalink / raw)
  To: linux-kernel
  Cc: Mickaël Salaün, Aleksa Sarai, Alexei Starovoitov,
	Al Viro, Andy Lutomirski, Christian Heimes, Daniel Borkmann,
	Eric Chiang, Florian Weimer, James Morris, Jan Kara, Jann Horn,
	Jonathan Corbet, Kees Cook, Matthew Garrett, Matthew Wilcox,
	Michael Kerrisk, Mickaël Salaün, Mimi Zohar,
	Philippe Trébuchet, Scott Shell, Sean Christopherson,
	Shuah Khan, Song Liu, Steve Dower, Steve Grubb, Thibaut Sautereau,
	Vincent Strubel, Yves-Alexis Perez, kernel-hardening, linux-api,
	linux-security-module, linux-fsdevel
In-Reply-To: <20190906152455.22757-1-mic@digikod.net>

Enable to either propagate the mount options from the underlying VFS
mount to prevent execution, or to propagate the file execute permission.
This may allow a script interpreter to check execution permissions
before reading commands from a file.

The main goal is to be able to protect the kernel by restricting
arbitrary syscalls that an attacker could perform with a crafted binary
or certain script languages.  It also improves multilevel isolation
by reducing the ability of an attacker to use side channels with
specific code.  These restrictions can natively be enforced for ELF
binaries (with the noexec mount option) but require this kernel
extension to properly handle scripts (e.g., Python, Perl).

Add a new sysctl fs.open_mayexec_enforce to control this behavior.  A
following patch adds documentation.

Changes since v1:
* move code from Yama to the FS subsystem (suggested by Kees Cook)
* make omayexec_inode_permission() static (suggested by Jann Horn)
* use mode 0600 for the sysctl
* only match regular files (not directories nor other types), which
  follows the same semantic as commit 73601ea5b7b1 ("fs/open.c: allow
  opening only regular files during execve()")

Signed-off-by: Mickaël Salaün <mic@digikod.net>
Reviewed-by: Philippe Trébuchet <philippe.trebuchet@ssi.gouv.fr>
Reviewed-by: Thibaut Sautereau <thibaut.sautereau@ssi.gouv.fr>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Kees Cook <keescook@chromium.org>
Cc: Mickaël Salaün <mickael.salaun@ssi.gouv.fr>
---
 fs/namei.c         | 68 ++++++++++++++++++++++++++++++++++++++++++++++
 include/linux/fs.h |  3 ++
 kernel/sysctl.c    |  7 +++++
 3 files changed, 78 insertions(+)

diff --git a/fs/namei.c b/fs/namei.c
index 0a6b9483d0cb..abd29a76ecef 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -39,6 +39,7 @@
 #include <linux/bitops.h>
 #include <linux/init_task.h>
 #include <linux/uaccess.h>
+#include <linux/sysctl.h>
 
 #include "internal.h"
 #include "mount.h"
@@ -411,6 +412,34 @@ static int sb_permission(struct super_block *sb, struct inode *inode, int mask)
 	return 0;
 }
 
+#define OMAYEXEC_ENFORCE_NONE	0
+#define OMAYEXEC_ENFORCE_MOUNT	(1 << 0)
+#define OMAYEXEC_ENFORCE_FILE	(1 << 1)
+#define _OMAYEXEC_LAST		OMAYEXEC_ENFORCE_FILE
+#define _OMAYEXEC_MASK		((_OMAYEXEC_LAST << 1) - 1)
+
+/**
+ * omayexec_inode_permission - check O_MAYEXEC before accessing an inode
+ * @inode: inode structure to check
+ * @mask: permission mask
+ *
+ * Return 0 if access is permitted, -EACCES otherwise.
+ */
+static int omayexec_inode_permission(struct inode *inode, int mask)
+{
+	if (!(mask & MAY_OPENEXEC))
+		return 0;
+
+	if ((sysctl_omayexec_enforce & OMAYEXEC_ENFORCE_MOUNT) &&
+			!(mask & MAY_EXECMOUNT))
+		return -EACCES;
+
+	if (sysctl_omayexec_enforce & OMAYEXEC_ENFORCE_FILE)
+		return generic_permission(inode, MAY_EXEC);
+
+	return 0;
+}
+
 /**
  * inode_permission - Check for access rights to a given inode
  * @inode: Inode to check permission on
@@ -454,10 +483,48 @@ int inode_permission(struct inode *inode, int mask)
 	if (retval)
 		return retval;
 
+	retval = omayexec_inode_permission(inode, mask);
+	if (retval)
+		return retval;
+
 	return security_inode_permission(inode, mask);
 }
 EXPORT_SYMBOL(inode_permission);
 
+/*
+ * Handle open_mayexec_enforce sysctl
+ */
+#ifdef CONFIG_SYSCTL
+int proc_omayexec(struct ctl_table *table, int write, void __user *buffer,
+		size_t *lenp, loff_t *ppos)
+{
+	int error;
+
+	if (write) {
+		struct ctl_table table_copy;
+		int tmp_mayexec_enforce;
+
+		if (!capable(CAP_MAC_ADMIN))
+			return -EPERM;
+		tmp_mayexec_enforce = *((int *)table->data);
+		table_copy = *table;
+		/* do not erase sysctl_omayexec_enforce */
+		table_copy.data = &tmp_mayexec_enforce;
+		error = proc_dointvec(&table_copy, write, buffer, lenp, ppos);
+		if (error)
+			return error;
+		if ((tmp_mayexec_enforce | _OMAYEXEC_MASK) != _OMAYEXEC_MASK)
+			return -EINVAL;
+		*((int *)table->data) = tmp_mayexec_enforce;
+	} else {
+		error = proc_dointvec(table, write, buffer, lenp, ppos);
+		if (error)
+			return error;
+	}
+	return 0;
+}
+#endif
+
 /**
  * path_get - get a reference to a path
  * @path: path to get the reference to
@@ -887,6 +954,7 @@ int sysctl_protected_symlinks __read_mostly = 0;
 int sysctl_protected_hardlinks __read_mostly = 0;
 int sysctl_protected_fifos __read_mostly;
 int sysctl_protected_regular __read_mostly;
+int sysctl_omayexec_enforce __read_mostly = OMAYEXEC_ENFORCE_NONE;
 
 /**
  * may_follow_link - Check symlink following for unsafe situations
diff --git a/include/linux/fs.h b/include/linux/fs.h
index e57609dac8dd..735f5950cfed 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -81,6 +81,7 @@ extern int sysctl_protected_symlinks;
 extern int sysctl_protected_hardlinks;
 extern int sysctl_protected_fifos;
 extern int sysctl_protected_regular;
+extern int sysctl_omayexec_enforce;
 
 typedef __kernel_rwf_t rwf_t;
 
@@ -3452,6 +3453,8 @@ int proc_nr_dentry(struct ctl_table *table, int write,
 		  void __user *buffer, size_t *lenp, loff_t *ppos);
 int proc_nr_inodes(struct ctl_table *table, int write,
 		   void __user *buffer, size_t *lenp, loff_t *ppos);
+int proc_omayexec(struct ctl_table *table, int write, void __user *buffer,
+		size_t *lenp, loff_t *ppos);
 int __init get_filesystem_list(char *buf);
 
 #define __FMODE_EXEC		((__force int) FMODE_EXEC)
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 078950d9605b..eaaeb229a828 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -1911,6 +1911,13 @@ static struct ctl_table fs_table[] = {
 		.extra1		= SYSCTL_ZERO,
 		.extra2		= &two,
 	},
+	{
+		.procname       = "open_mayexec_enforce",
+		.data           = &sysctl_omayexec_enforce,
+		.maxlen         = sizeof(int),
+		.mode           = 0600,
+		.proc_handler   = proc_omayexec,
+	},
 #if defined(CONFIG_BINFMT_MISC) || defined(CONFIG_BINFMT_MISC_MODULE)
 	{
 		.procname	= "binfmt_misc",
-- 
2.23.0


^ permalink raw reply related

* [PATCH v2 0/5] Add support for O_MAYEXEC
From: Mickaël Salaün @ 2019-09-06 15:24 UTC (permalink / raw)
  To: linux-kernel
  Cc: Mickaël Salaün, Aleksa Sarai, Alexei Starovoitov,
	Al Viro, Andy Lutomirski, Christian Heimes, Daniel Borkmann,
	Eric Chiang, Florian Weimer, James Morris, Jan Kara, Jann Horn,
	Jonathan Corbet, Kees Cook, Matthew Garrett, Matthew Wilcox,
	Michael Kerrisk, Mickaël Salaün, Mimi Zohar,
	Philippe Trébuchet, Scott Shell, Sean Christopherson,
	Shuah Khan, Song Liu, Steve Dower, Steve Grubb, Thibaut Sautereau,
	Vincent Strubel, Yves-Alexis Perez, kernel-hardening, linux-api,
	linux-security-module, linux-fsdevel

Hi,

The goal of this patch series is to control script interpretation.  A
new O_MAYEXEC flag used by sys_open() is added to enable userspace
script interpreter to delegate to the kernel (and thus the system
security policy) the permission to interpret/execute scripts or other
files containing what can be seen as commands.

This second series mainly differ from the previous one [1] by moving the
basic security policy from Yama to the filesystem subsystem.  This
policy can be enforced by the system administrator through a sysctl
configuration consistent with the mount points.

Furthermore, the security policy can also be delegated to an LSM, either
a MAC system or an integrity system.  For instance, the new kernel
MAY_OPENEXEC flag closes a major IMA measurement/appraisal interpreter
integrity gap by bringing the ability to check the use of scripts [2].
Other uses are expected, such as for openat2(2) [3], SGX integration
[4], and bpffs [5].

Userspace need to adapt to take advantage of this new feature.  For
example, the PEP 578 [6] (Runtime Audit Hooks) enables Python 3.8 to be
extended with policy enforcement points related to code interpretation,
which can be used to align with the PowerShell audit features.
Additional Python security improvements (e.g. a limited interpreter
withou -c, stdin piping of code) are on their way.

The initial idea come from CLIP OS and the original implementation has
been used for more than 10 years:
https://github.com/clipos-archive/clipos4_doc

An introduction to O_MAYEXEC was given at the Linux Security Summit
Europe 2018 - Linux Kernel Security Contributions by ANSSI:
https://www.youtube.com/watch?v=chNjCRtPKQY&t=17m15s
The "write xor execute" principle was explained at Kernel Recipes 2018 -
CLIP OS: a defense-in-depth OS:
https://www.youtube.com/watch?v=PjRE0uBtkHU&t=11m14s

This patch series can be applied on top of v5.3-rc7.  This can be tested
with CONFIG_SYSCTL.  I would really appreciate constructive comments on
this patch series.

# Changes since v1

* move code from Yama to the FS subsystem
* set __FMODE_EXEC when using O_MAYEXEC to make this information
  available through the new fanotify/FAN_OPEN_EXEC event
* only match regular files (not directories nor other types), which
  follows the same semantic as commit 73601ea5b7b1 ("fs/open.c: allow
  opening only regular files during execve()")
* improve tests

[1] https://lore.kernel.org/lkml/20181212081712.32347-1-mic@digikod.net/
[2] https://lore.kernel.org/lkml/1544647356.4028.105.camel@linux.ibm.com/
[3] https://lore.kernel.org/lkml/20190904201933.10736-6-cyphar@cyphar.com/
[4] https://lore.kernel.org/lkml/CALCETrVovr8XNZSroey7pHF46O=kj_c5D9K8h=z2T_cNrpvMig@mail.gmail.com/
[5] https://lore.kernel.org/lkml/CALCETrVeZ0eufFXwfhtaG_j+AdvbzEWE0M3wjXMWVEO7pj+xkw@mail.gmail.com/
[6] https://www.python.org/dev/peps/pep-0578/

Regards,

Mickaël Salaün (5):
  fs: Add support for an O_MAYEXEC flag on sys_open()
  fs: Add a MAY_EXECMOUNT flag to infer the noexec mount propertie
  fs: Enable to enforce noexec mounts or file exec through O_MAYEXEC
  selftest/exec: Add tests for O_MAYEXEC enforcing
  doc: Add documentation for the fs.open_mayexec_enforce sysctl

 Documentation/admin-guide/sysctl/fs.rst     |  43 +++
 fs/fcntl.c                                  |   2 +-
 fs/namei.c                                  |  70 +++++
 fs/open.c                                   |   6 +
 include/linux/fcntl.h                       |   2 +-
 include/linux/fs.h                          |   7 +
 include/uapi/asm-generic/fcntl.h            |   3 +
 kernel/sysctl.c                             |   7 +
 tools/testing/selftests/exec/.gitignore     |   1 +
 tools/testing/selftests/exec/Makefile       |   4 +-
 tools/testing/selftests/exec/omayexec.c     | 317 ++++++++++++++++++++
 tools/testing/selftests/kselftest_harness.h |   3 +
 12 files changed, 462 insertions(+), 3 deletions(-)
 create mode 100644 tools/testing/selftests/exec/omayexec.c

-- 
2.23.0

^ permalink raw reply

* [PATCH v2 2/5] fs: Add a MAY_EXECMOUNT flag to infer the noexec mount propertie
From: Mickaël Salaün @ 2019-09-06 15:24 UTC (permalink / raw)
  To: linux-kernel
  Cc: Mickaël Salaün, Aleksa Sarai, Alexei Starovoitov,
	Al Viro, Andy Lutomirski, Christian Heimes, Daniel Borkmann,
	Eric Chiang, Florian Weimer, James Morris, Jan Kara, Jann Horn,
	Jonathan Corbet, Kees Cook, Matthew Garrett, Matthew Wilcox,
	Michael Kerrisk, Mickaël Salaün, Mimi Zohar,
	Philippe Trébuchet, Scott Shell, Sean Christopherson,
	Shuah Khan, Song Liu, Steve Dower, Steve Grubb, Thibaut Sautereau,
	Vincent Strubel, Yves-Alexis Perez, kernel-hardening, linux-api,
	linux-security-module, linux-fsdevel
In-Reply-To: <20190906152455.22757-1-mic@digikod.net>

An LSM doesn't get path information related to an access request to open
an inode.  This new (internal) MAY_EXECMOUNT flag enables an LSM to
check if the underlying mount point of an inode is marked as executable.
This is useful to implement a security policy taking advantage of the
noexec mount option.

This flag is set according to path_noexec(), which checks if a mount
point is mounted with MNT_NOEXEC or if the underlying superblock is
SB_I_NOEXEC.

Signed-off-by: Mickaël Salaün <mic@digikod.net>
Reviewed-by: Philippe Trébuchet <philippe.trebuchet@ssi.gouv.fr>
Reviewed-by: Thibaut Sautereau <thibaut.sautereau@ssi.gouv.fr>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Kees Cook <keescook@chromium.org>
Cc: Mickaël Salaün <mickael.salaun@ssi.gouv.fr>
---
 fs/namei.c         | 2 ++
 include/linux/fs.h | 2 ++
 2 files changed, 4 insertions(+)

diff --git a/fs/namei.c b/fs/namei.c
index 209c51a5226c..0a6b9483d0cb 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -2968,6 +2968,8 @@ static int may_open(const struct path *path, int acc_mode, int flag)
 		break;
 	}
 
+	/* Pass the mount point executability. */
+	acc_mode |= path_noexec(path) ? 0 : MAY_EXECMOUNT;
 	error = inode_permission(inode, MAY_OPEN | acc_mode);
 	if (error)
 		return error;
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 848f5711bdf0..e57609dac8dd 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -101,6 +101,8 @@ typedef int (dio_iodone_t)(struct kiocb *iocb, loff_t offset,
 #define MAY_NOT_BLOCK		0x00000080
 /* the inode is opened with O_MAYEXEC */
 #define MAY_OPENEXEC		0x00000100
+/* the mount point is marked as executable */
+#define MAY_EXECMOUNT		0x00000200
 
 /*
  * flags in file.f_mode.  Note that FMODE_READ and FMODE_WRITE must correspond
-- 
2.23.0


^ permalink raw reply related

* [PATCH v2 4/5] selftest/exec: Add tests for O_MAYEXEC enforcing
From: Mickaël Salaün @ 2019-09-06 15:24 UTC (permalink / raw)
  To: linux-kernel
  Cc: Mickaël Salaün, Aleksa Sarai, Alexei Starovoitov,
	Al Viro, Andy Lutomirski, Christian Heimes, Daniel Borkmann,
	Eric Chiang, Florian Weimer, James Morris, Jan Kara, Jann Horn,
	Jonathan Corbet, Kees Cook, Matthew Garrett, Matthew Wilcox,
	Michael Kerrisk, Mickaël Salaün, Mimi Zohar,
	Philippe Trébuchet, Scott Shell, Sean Christopherson,
	Shuah Khan, Song Liu, Steve Dower, Steve Grubb, Thibaut Sautereau,
	Vincent Strubel, Yves-Alexis Perez, kernel-hardening, linux-api,
	linux-security-module, linux-fsdevel
In-Reply-To: <20190906152455.22757-1-mic@digikod.net>

Test propagation of noexec mount points or file executability through
files open with or without O_MAYEXEC.

Changes since v1:
* move tests from yama to exec
* fix _GNU_SOURCE in kselftest_harness.h
* add a new test sysctl_access_write to check if CAP_MAC_ADMIN is taken
  into account
* test directory execution which is always forbidden since commit
  73601ea5b7b1 ("fs/open.c: allow opening only regular files during
  execve()"), and also check that even the root user can not bypass file
  execution checks
* make sure delete_workspace() always as enough right to succeed
* cosmetic cleanup

Signed-off-by: Mickaël Salaün <mic@digikod.net>
Cc: Kees Cook <keescook@chromium.org>
Cc: Mickaël Salaün <mickael.salaun@ssi.gouv.fr>
Cc: Shuah Khan <shuah@kernel.org>
---
 tools/testing/selftests/exec/.gitignore     |   1 +
 tools/testing/selftests/exec/Makefile       |   4 +-
 tools/testing/selftests/exec/omayexec.c     | 317 ++++++++++++++++++++
 tools/testing/selftests/kselftest_harness.h |   3 +
 4 files changed, 324 insertions(+), 1 deletion(-)
 create mode 100644 tools/testing/selftests/exec/omayexec.c

diff --git a/tools/testing/selftests/exec/.gitignore b/tools/testing/selftests/exec/.gitignore
index b02279da6fa1..78487c987c07 100644
--- a/tools/testing/selftests/exec/.gitignore
+++ b/tools/testing/selftests/exec/.gitignore
@@ -1,3 +1,4 @@
+/omayexec
 subdir*
 script*
 execveat
diff --git a/tools/testing/selftests/exec/Makefile b/tools/testing/selftests/exec/Makefile
index 33339e31e365..a62b9ca306e7 100644
--- a/tools/testing/selftests/exec/Makefile
+++ b/tools/testing/selftests/exec/Makefile
@@ -3,7 +3,7 @@ CFLAGS = -Wall
 CFLAGS += -Wno-nonnull
 CFLAGS += -D_GNU_SOURCE
 
-TEST_GEN_PROGS := execveat
+TEST_GEN_PROGS := execveat omayexec
 TEST_GEN_FILES := execveat.symlink execveat.denatured script subdir
 # Makefile is a run-time dependency, since it's accessed by the execveat test
 TEST_FILES := Makefile
@@ -26,3 +26,5 @@ $(OUTPUT)/execveat.denatured: $(OUTPUT)/execveat
 	cp $< $@
 	chmod -x $@
 
+$(OUTPUT)/omayexec: omayexec.c ../kselftest_harness.h
+	$(CC) $(CFLAGS) -Wl,-no-as-needed $(LDFLAGS) -lcap $< -o $@
diff --git a/tools/testing/selftests/exec/omayexec.c b/tools/testing/selftests/exec/omayexec.c
new file mode 100644
index 000000000000..e4307b5a5417
--- /dev/null
+++ b/tools/testing/selftests/exec/omayexec.c
@@ -0,0 +1,317 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * O_MAYEXEC tests
+ *
+ * Copyright © 2018-2019 ANSSI
+ *
+ * Author: Mickaël Salaün <mic@digikod.net>
+ */
+
+#include <errno.h>
+#include <fcntl.h> /* O_CLOEXEC */
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h> /* strlen */
+#include <sys/capability.h>
+#include <sys/mount.h>
+#include <sys/stat.h> /* mkdir */
+#include <unistd.h> /* unlink, rmdir */
+
+#include "../kselftest_harness.h"
+
+#ifndef O_MAYEXEC
+#define O_MAYEXEC      040000000
+#endif
+
+#define SYSCTL_MAYEXEC	"/proc/sys/fs/open_mayexec_enforce"
+
+#define BIN_DIR		"./test-mount"
+#define BIN_PATH	BIN_DIR "/file"
+#define DIR_PATH	BIN_DIR "/directory"
+
+#define ALLOWED		1
+#define DENIED		0
+
+static void ignore_dac(struct __test_metadata *_metadata, int override)
+{
+	cap_t caps;
+	const cap_value_t cap_val[2] = {
+		CAP_DAC_OVERRIDE,
+		CAP_DAC_READ_SEARCH,
+	};
+
+	caps = cap_get_proc();
+	ASSERT_TRUE(!!caps);
+	ASSERT_FALSE(cap_set_flag(caps, CAP_EFFECTIVE, 2, cap_val,
+				override ? CAP_SET : CAP_CLEAR));
+	ASSERT_FALSE(cap_set_proc(caps));
+	EXPECT_FALSE(cap_free(caps));
+}
+
+static void ignore_mac(struct __test_metadata *_metadata, int override)
+{
+	cap_t caps;
+	const cap_value_t cap_val[1] = {
+		CAP_MAC_ADMIN,
+	};
+
+	caps = cap_get_proc();
+	ASSERT_TRUE(!!caps);
+	ASSERT_FALSE(cap_set_flag(caps, CAP_EFFECTIVE, 1, cap_val,
+				override ? CAP_SET : CAP_CLEAR));
+	ASSERT_FALSE(cap_set_proc(caps));
+	EXPECT_FALSE(cap_free(caps));
+}
+
+static void test_omx(struct __test_metadata *_metadata,
+		const char *const path, const int exec_allowed)
+{
+	int fd;
+
+	/* without O_MAYEXEC */
+	fd = open(path, O_RDONLY | O_CLOEXEC);
+	ASSERT_NE(-1, fd);
+	EXPECT_FALSE(close(fd));
+
+	/* with O_MAYEXEC */
+	fd = open(path, O_RDONLY | O_CLOEXEC | O_MAYEXEC);
+	if (exec_allowed) {
+		/* open should succeed */
+		ASSERT_NE(-1, fd);
+		EXPECT_FALSE(close(fd));
+	} else {
+		/* open should return EACCES */
+		ASSERT_EQ(-1, fd);
+		ASSERT_EQ(EACCES, errno);
+	}
+}
+
+static void test_omx_dir_file(struct __test_metadata *_metadata,
+		const char *const dir_path, const char *const file_path,
+		const int exec_allowed)
+{
+	/*
+	 * directory execution is always denied since commit 73601ea5b7b1
+	 * ("fs/open.c: allow opening only regular files during execve()")
+	 */
+	test_omx(_metadata, dir_path, DENIED);
+	test_omx(_metadata, file_path, exec_allowed);
+}
+
+static void test_dir_file(struct __test_metadata *_metadata,
+		const char *const dir_path, const char *const file_path,
+		const int exec_allowed)
+{
+	/* test as root */
+	ignore_dac(_metadata, 1);
+	test_omx_dir_file(_metadata, dir_path, file_path, exec_allowed);
+
+	/* test without bypass */
+	ignore_dac(_metadata, 0);
+	test_omx_dir_file(_metadata, dir_path, file_path, exec_allowed);
+}
+
+static void sysctl_write(struct __test_metadata *_metadata,
+		const char *path, const char *value)
+{
+	int fd;
+	size_t len_value;
+	ssize_t len_wrote;
+
+	fd = open(path, O_WRONLY | O_CLOEXEC);
+	ASSERT_NE(-1, fd);
+	len_value = strlen(value);
+	len_wrote = write(fd, value, len_value);
+	ASSERT_EQ(len_wrote, len_value);
+	EXPECT_FALSE(close(fd));
+}
+
+static void create_workspace(struct __test_metadata *_metadata,
+		int mount_exec, int file_exec)
+{
+	int fd;
+
+	/*
+	 * Cleanup previous workspace if any error previously happened (don't
+	 * check errors).
+	 */
+	umount(BIN_DIR);
+	rmdir(BIN_DIR);
+
+	/* create a clean mount point */
+	ASSERT_FALSE(mkdir(BIN_DIR, 00700));
+	ASSERT_FALSE(mount("test", BIN_DIR, "tmpfs",
+				MS_MGC_VAL | (mount_exec ? 0 : MS_NOEXEC),
+				"mode=0700,size=4k"));
+
+	/* create a test file */
+	fd = open(BIN_PATH, O_CREAT | O_RDONLY | O_CLOEXEC,
+			file_exec ? 00500 : 00400);
+	ASSERT_NE(-1, fd);
+	EXPECT_NE(-1, close(fd));
+
+	/* create a test directory */
+	ASSERT_FALSE(mkdir(DIR_PATH, file_exec ? 00500 : 00400));
+}
+
+static void delete_workspace(struct __test_metadata *_metadata)
+{
+	ignore_mac(_metadata, 1);
+	sysctl_write(_metadata, SYSCTL_MAYEXEC, "0");
+
+	/* no need to unlink BIN_PATH nor DIR_PATH */
+	ASSERT_FALSE(umount(BIN_DIR));
+	ASSERT_FALSE(rmdir(BIN_DIR));
+}
+
+FIXTURE_DATA(mount_exec_file_exec) { };
+
+FIXTURE_SETUP(mount_exec_file_exec)
+{
+	create_workspace(_metadata, 1, 1);
+}
+
+FIXTURE_TEARDOWN(mount_exec_file_exec)
+{
+	delete_workspace(_metadata);
+}
+
+TEST_F(mount_exec_file_exec, mount)
+{
+	/* enforce mount exec check */
+	sysctl_write(_metadata, SYSCTL_MAYEXEC, "1");
+	test_dir_file(_metadata, DIR_PATH, BIN_PATH, ALLOWED);
+}
+
+TEST_F(mount_exec_file_exec, file)
+{
+	/* enforce file exec check */
+	sysctl_write(_metadata, SYSCTL_MAYEXEC, "2");
+	test_dir_file(_metadata, DIR_PATH, BIN_PATH, ALLOWED);
+}
+
+TEST_F(mount_exec_file_exec, mount_file)
+{
+	/* enforce mount and file exec check */
+	sysctl_write(_metadata, SYSCTL_MAYEXEC, "3");
+	test_dir_file(_metadata, DIR_PATH, BIN_PATH, ALLOWED);
+}
+
+FIXTURE_DATA(mount_exec_file_noexec) { };
+
+FIXTURE_SETUP(mount_exec_file_noexec)
+{
+	create_workspace(_metadata, 1, 0);
+}
+
+FIXTURE_TEARDOWN(mount_exec_file_noexec)
+{
+	delete_workspace(_metadata);
+}
+
+TEST_F(mount_exec_file_noexec, mount)
+{
+	/* enforce mount exec check */
+	sysctl_write(_metadata, SYSCTL_MAYEXEC, "1");
+	test_dir_file(_metadata, DIR_PATH, BIN_PATH, ALLOWED);
+}
+
+TEST_F(mount_exec_file_noexec, file)
+{
+	/* enforce file exec check */
+	sysctl_write(_metadata, SYSCTL_MAYEXEC, "2");
+	test_dir_file(_metadata, DIR_PATH, BIN_PATH, DENIED);
+}
+
+TEST_F(mount_exec_file_noexec, mount_file)
+{
+	/* enforce mount and file exec check */
+	sysctl_write(_metadata, SYSCTL_MAYEXEC, "3");
+	test_dir_file(_metadata, DIR_PATH, BIN_PATH, DENIED);
+}
+
+FIXTURE_DATA(mount_noexec_file_exec) { };
+
+FIXTURE_SETUP(mount_noexec_file_exec)
+{
+	create_workspace(_metadata, 0, 1);
+}
+
+FIXTURE_TEARDOWN(mount_noexec_file_exec)
+{
+	delete_workspace(_metadata);
+}
+
+TEST_F(mount_noexec_file_exec, mount)
+{
+	/* enforce mount exec check */
+	sysctl_write(_metadata, SYSCTL_MAYEXEC, "1");
+	test_dir_file(_metadata, DIR_PATH, BIN_PATH, DENIED);
+}
+
+TEST_F(mount_noexec_file_exec, file)
+{
+	/* enforce file exec check */
+	sysctl_write(_metadata, SYSCTL_MAYEXEC, "2");
+	test_dir_file(_metadata, DIR_PATH, BIN_PATH, ALLOWED);
+}
+
+TEST_F(mount_noexec_file_exec, mount_file)
+{
+	/* enforce mount and file exec check */
+	sysctl_write(_metadata, SYSCTL_MAYEXEC, "3");
+	test_dir_file(_metadata, DIR_PATH, BIN_PATH, DENIED);
+}
+
+FIXTURE_DATA(mount_noexec_file_noexec) { };
+
+FIXTURE_SETUP(mount_noexec_file_noexec)
+{
+	create_workspace(_metadata, 0, 0);
+}
+
+FIXTURE_TEARDOWN(mount_noexec_file_noexec)
+{
+	delete_workspace(_metadata);
+}
+
+TEST_F(mount_noexec_file_noexec, mount)
+{
+	/* enforce mount exec check */
+	sysctl_write(_metadata, SYSCTL_MAYEXEC, "1");
+	test_dir_file(_metadata, DIR_PATH, BIN_PATH, DENIED);
+}
+
+TEST_F(mount_noexec_file_noexec, file)
+{
+	/* enforce file exec check */
+	sysctl_write(_metadata, SYSCTL_MAYEXEC, "2");
+	test_dir_file(_metadata, DIR_PATH, BIN_PATH, DENIED);
+}
+
+TEST_F(mount_noexec_file_noexec, mount_file)
+{
+	/* enforce mount and file exec check */
+	sysctl_write(_metadata, SYSCTL_MAYEXEC, "3");
+	test_dir_file(_metadata, DIR_PATH, BIN_PATH, DENIED);
+}
+
+TEST(sysctl_access_write)
+{
+	int fd;
+	ssize_t len_wrote;
+
+	ignore_mac(_metadata, 1);
+	sysctl_write(_metadata, SYSCTL_MAYEXEC, "0");
+
+	ignore_mac(_metadata, 0);
+	fd = open(SYSCTL_MAYEXEC, O_WRONLY | O_CLOEXEC);
+	ASSERT_NE(-1, fd);
+	len_wrote = write(fd, "0", 1);
+	ASSERT_EQ(len_wrote, -1);
+	EXPECT_FALSE(close(fd));
+
+	ignore_mac(_metadata, 1);
+}
+
+TEST_HARNESS_MAIN
diff --git a/tools/testing/selftests/kselftest_harness.h b/tools/testing/selftests/kselftest_harness.h
index 5336b26506ab..6ae816fa2f62 100644
--- a/tools/testing/selftests/kselftest_harness.h
+++ b/tools/testing/selftests/kselftest_harness.h
@@ -50,7 +50,10 @@
 #ifndef __KSELFTEST_HARNESS_H
 #define __KSELFTEST_HARNESS_H
 
+#ifndef _GNU_SOURCE
 #define _GNU_SOURCE
+#endif
+
 #include <asm/types.h>
 #include <errno.h>
 #include <stdbool.h>
-- 
2.23.0


^ permalink raw reply related

* [PATCH v2 1/5] fs: Add support for an O_MAYEXEC flag on sys_open()
From: Mickaël Salaün @ 2019-09-06 15:24 UTC (permalink / raw)
  To: linux-kernel
  Cc: Mickaël Salaün, Aleksa Sarai, Alexei Starovoitov,
	Al Viro, Andy Lutomirski, Christian Heimes, Daniel Borkmann,
	Eric Chiang, Florian Weimer, James Morris, Jan Kara, Jann Horn,
	Jonathan Corbet, Kees Cook, Matthew Garrett, Matthew Wilcox,
	Michael Kerrisk, Mickaël Salaün, Mimi Zohar,
	Philippe Trébuchet, Scott Shell, Sean Christopherson,
	Shuah Khan, Song Liu, Steve Dower, Steve Grubb, Thibaut Sautereau,
	Vincent Strubel, Yves-Alexis Perez, kernel-hardening, linux-api,
	linux-security-module, linux-fsdevel
In-Reply-To: <20190906152455.22757-1-mic@digikod.net>

When the O_MAYEXEC flag is passed, sys_open() may be subject to
additional restrictions depending on a security policy implemented by an
LSM through the inode_permission hook.

The underlying idea is to be able to restrict scripts interpretation
according to a policy defined by the system administrator.  For this to
be possible, script interpreters must use the O_MAYEXEC flag
appropriately.  To be fully effective, these interpreters also need to
handle the other ways to execute code (for which the kernel can't help):
command line parameters (e.g., option -e for Perl), module loading
(e.g., option -m for Python), stdin, file sourcing, environment
variables, configuration files...  According to the threat model, it may
be acceptable to allow some script interpreters (e.g. Bash) to interpret
commands from stdin, may it be a TTY or a pipe, because it may not be
enough to (directly) perform syscalls.

A simple security policy implementation is available in a following
patch for Yama.

This is an updated subset of the patch initially written by Vincent
Strubel for CLIP OS:
https://github.com/clipos-archive/src_platform_clip-patches/blob/f5cb330d6b684752e403b4e41b39f7004d88e561/1901_open_mayexec.patch
This patch has been used for more than 10 years with customized script
interpreters.  Some examples can be found here:
https://github.com/clipos-archive/clipos4_portage-overlay/search?q=O_MAYEXEC

Changes since v1:
* set __FMODE_EXEC when using O_MAYEXEC to make this information
  available through the new fanotify/FAN_OPEN_EXEC event (suggested by
  Jan Kara and Matthew Bobrowski)

Signed-off-by: Mickaël Salaün <mic@digikod.net>
Signed-off-by: Thibaut Sautereau <thibaut.sautereau@ssi.gouv.fr>
Signed-off-by: Vincent Strubel <vincent.strubel@ssi.gouv.fr>
Reviewed-by: Philippe Trébuchet <philippe.trebuchet@ssi.gouv.fr>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Kees Cook <keescook@chromium.org>
Cc: Mickaël Salaün <mickael.salaun@ssi.gouv.fr>
---
 fs/fcntl.c                       | 2 +-
 fs/open.c                        | 6 ++++++
 include/linux/fcntl.h            | 2 +-
 include/linux/fs.h               | 2 ++
 include/uapi/asm-generic/fcntl.h | 3 +++
 5 files changed, 13 insertions(+), 2 deletions(-)

diff --git a/fs/fcntl.c b/fs/fcntl.c
index 3d40771e8e7c..4cf05a2fd162 100644
--- a/fs/fcntl.c
+++ b/fs/fcntl.c
@@ -1031,7 +1031,7 @@ static int __init fcntl_init(void)
 	 * Exceptions: O_NONBLOCK is a two bit define on parisc; O_NDELAY
 	 * is defined as O_NONBLOCK on some platforms and not on others.
 	 */
-	BUILD_BUG_ON(21 - 1 /* for O_RDONLY being 0 */ !=
+	BUILD_BUG_ON(22 - 1 /* for O_RDONLY being 0 */ !=
 		HWEIGHT32(
 			(VALID_OPEN_FLAGS & ~(O_NONBLOCK | O_NDELAY)) |
 			__FMODE_EXEC | __FMODE_NONOTIFY));
diff --git a/fs/open.c b/fs/open.c
index a59abe3c669a..1b9b6fedf7cd 100644
--- a/fs/open.c
+++ b/fs/open.c
@@ -989,6 +989,12 @@ static inline int build_open_flags(int flags, umode_t mode, struct open_flags *o
 		acc_mode = 0;
 	}
 
+	/* Check execution permissions on open. */
+	if (flags & O_MAYEXEC) {
+		acc_mode |= MAY_OPENEXEC;
+		flags |= __FMODE_EXEC;
+	}
+
 	op->open_flag = flags;
 
 	/* O_TRUNC implies we need access checks for write permissions */
diff --git a/include/linux/fcntl.h b/include/linux/fcntl.h
index d019df946cb2..af88fb6c8313 100644
--- a/include/linux/fcntl.h
+++ b/include/linux/fcntl.h
@@ -9,7 +9,7 @@
 	(O_RDONLY | O_WRONLY | O_RDWR | O_CREAT | O_EXCL | O_NOCTTY | O_TRUNC | \
 	 O_APPEND | O_NDELAY | O_NONBLOCK | O_NDELAY | __O_SYNC | O_DSYNC | \
 	 FASYNC	| O_DIRECT | O_LARGEFILE | O_DIRECTORY | O_NOFOLLOW | \
-	 O_NOATIME | O_CLOEXEC | O_PATH | __O_TMPFILE)
+	 O_NOATIME | O_CLOEXEC | O_PATH | __O_TMPFILE | O_MAYEXEC)
 
 #ifndef force_o_largefile
 #define force_o_largefile() (!IS_ENABLED(CONFIG_ARCH_32BIT_OFF_T))
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 997a530ff4e9..848f5711bdf0 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -99,6 +99,8 @@ typedef int (dio_iodone_t)(struct kiocb *iocb, loff_t offset,
 #define MAY_CHDIR		0x00000040
 /* called from RCU mode, don't block */
 #define MAY_NOT_BLOCK		0x00000080
+/* the inode is opened with O_MAYEXEC */
+#define MAY_OPENEXEC		0x00000100
 
 /*
  * flags in file.f_mode.  Note that FMODE_READ and FMODE_WRITE must correspond
diff --git a/include/uapi/asm-generic/fcntl.h b/include/uapi/asm-generic/fcntl.h
index 9dc0bf0c5a6e..cbb9425d6e7c 100644
--- a/include/uapi/asm-generic/fcntl.h
+++ b/include/uapi/asm-generic/fcntl.h
@@ -97,6 +97,9 @@
 #define O_NDELAY	O_NONBLOCK
 #endif
 
+/* command execution from file is intended, check exec permissions */
+#define O_MAYEXEC	040000000
+
 #define F_DUPFD		0	/* dup */
 #define F_GETFD		1	/* get close_on_exec */
 #define F_SETFD		2	/* set/clear close_on_exec */
-- 
2.23.0


^ permalink raw reply related

* [PATCH v2 5/5] doc: Add documentation for the fs.open_mayexec_enforce sysctl
From: Mickaël Salaün @ 2019-09-06 15:24 UTC (permalink / raw)
  To: linux-kernel
  Cc: Mickaël Salaün, Aleksa Sarai, Alexei Starovoitov,
	Al Viro, Andy Lutomirski, Christian Heimes, Daniel Borkmann,
	Eric Chiang, Florian Weimer, James Morris, Jan Kara, Jann Horn,
	Jonathan Corbet, Kees Cook, Matthew Garrett, Matthew Wilcox,
	Michael Kerrisk, Mickaël Salaün, Mimi Zohar,
	Philippe Trébuchet, Scott Shell, Sean Christopherson,
	Shuah Khan, Song Liu, Steve Dower, Steve Grubb, Thibaut Sautereau,
	Vincent Strubel, Yves-Alexis Perez, kernel-hardening, linux-api,
	linux-security-module, linux-fsdevel
In-Reply-To: <20190906152455.22757-1-mic@digikod.net>

Changes since v1:
* move from LSM/Yama to sysctl/fs

Signed-off-by: Mickaël Salaün <mic@digikod.net>
Reviewed-by: Philippe Trébuchet <philippe.trebuchet@ssi.gouv.fr>
Reviewed-by: Thibaut Sautereau <thibaut.sautereau@ssi.gouv.fr>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Kees Cook <keescook@chromium.org>
Cc: Mickaël Salaün <mickael.salaun@ssi.gouv.fr>
---
 Documentation/admin-guide/sysctl/fs.rst | 43 +++++++++++++++++++++++++
 1 file changed, 43 insertions(+)

diff --git a/Documentation/admin-guide/sysctl/fs.rst b/Documentation/admin-guide/sysctl/fs.rst
index 2a45119e3331..f2f5bbe428d6 100644
--- a/Documentation/admin-guide/sysctl/fs.rst
+++ b/Documentation/admin-guide/sysctl/fs.rst
@@ -37,6 +37,7 @@ Currently, these files are in /proc/sys/fs:
 - inode-nr
 - inode-state
 - nr_open
+- open_mayexec_enforce
 - overflowuid
 - overflowgid
 - pipe-user-pages-hard
@@ -165,6 +166,48 @@ system needs to prune the inode list instead of allocating
 more.
 
 
+open_mayexec_enforce
+--------------------
+
+The ``O_MAYEXEC`` flag can be passed to :manpage:`open(2)` to only open regular
+files that are expected to be executable.  If the file is not identified as
+executable, then the syscall returns -EACCES.  This may allow a script
+interpreter to check executable permission before reading commands from a file.
+One interesting use case is to enforce a "write xor execute" policy through
+interpreters.
+
+Thanks to this flag, it is possible to enforce the ``noexec`` mount option
+(i.e.  the underlying mount point of the file is mounted with MNT_NOEXEC or its
+underlying superblock is SB_I_NOEXEC) not only on ELF binaries but also on
+scripts.  This may be possible thanks to script interpreters using the
+``O_MAYEXEC`` flag.  The executable permission is then checked before reading
+commands from a file, and thus can enforce the ``noexec`` at the interpreter
+level by propagating this security policy to the scripts.  To be fully
+effective, these interpreters also need to handle the other ways to execute
+code (for which the kernel can't help): command line parameters (e.g., option
+``-e`` for Perl), module loading (e.g., option ``-m`` for Python), stdin, file
+sourcing, environment variables, configuration files...  According to the
+threat model, it may be acceptable to allow some script interpreters (e.g.
+Bash) to interpret commands from stdin, may it be a TTY or a pipe, because it
+may not be enough to (directly) perform syscalls.
+
+There is two complementary security policies: enforce the ``noexec`` mount
+option, or enforce executable file permission.  These policies are handled by
+the ``fs.open_mayexec_enforce`` sysctl (writable only with ``CAP_MAC_ADMIN``)
+as a bitmask:
+
+1 - mount restriction:
+    check that the mount options for the underlying VFS mount do not prevent
+    execution.
+
+2 - file permission restriction:
+    check that the to-be-opened file is marked as executable for the current
+    process (e.g., POSIX permissions).
+
+Code samples can be found in tools/testing/selftests/exec/omayexec.c and
+https://github.com/clipos-archive/clipos4_portage-overlay/search?q=O_MAYEXEC .
+
+
 overflowgid & overflowuid
 -------------------------
 
-- 
2.23.0


^ permalink raw reply related

* Re: Why add the general notification queue and its sources
From: Linus Torvalds @ 2019-09-06 15:35 UTC (permalink / raw)
  To: David Howells
  Cc: Ray Strode, Greg Kroah-Hartman, Steven Whitehouse,
	Nicolas Dichtel, raven, keyrings, linux-usb, linux-block,
	Christian Brauner, LSM List, linux-fsdevel, Linux API,
	Linux List Kernel Mailing, Al Viro, Ray, Debarshi, Robbie Harwood
In-Reply-To: <27732.1567764557@warthog.procyon.org.uk>

On Fri, Sep 6, 2019 at 3:09 AM David Howells <dhowells@redhat.com> wrote:
>
> Linus Torvalds <torvalds@linux-foundation.org> wrote:
>
> > But it's *literally* just finding the places that work with
> > pipe->curbuf/nrbufs and making them use atomic updates.
>
> No.  It really isn't.  That's two variables that describe the occupied section
> of the buffer.  Unless you have something like a 68020 with CAS2, or put them
> next to each other so you can use CMPXCHG8, you can't do that.
>
> They need converting to head/tail pointers first.

You misunderstand - because I phrased it badly. I meant "atomic" in
the traditional kernel sense, as in "usable in not thread context" (eg
GFP_ATOMIC etc).

I'd start out just using a spinlock.

I do agree that we could try to be fancy and do it entirely locklessly
too, and I mentioned that in another part:

 "[..] it should not
  be all that hard to just make the whole "curbuf/nrbufs" handling use
  its own locking (maybe even some lockless atomics and cmpxchg)"

but I also very much agree that it's much more complex.

The main complexity of a lockless thing is actually almost certainly
not in curbuf/nrbufs, because those could easily be packed as two
16-bit values in a 32-bit entity and then regular cmpxchg works fine.

No, the complexity in the lockless model is that then you have to be
very careful with the "buf[]" array update too.  Maybe that's trivial
(just make sure that they are NULL when not used), but it just looks
less than wonderfully easy.

So a lockless update I'm sure is _doable_ with some cleverness, but is
probably not really worth it.

That's particularly true since we already *have* a spinlock that we
would take anyway: the we could strive to use the waitqueue spinlock
in pipe->wait, and not even really add any new locking. That would
require a bit of cleverness too and re-ordering things more, but we do
that in other places (eg completions, but the fs_pin code does it too,
and a few other cases.

Look for "wake_up_locked()" and friends, which is a sure-fire sign
that somebody is playing games and taking the wait-queue lock manually
for their own nefarious reasons.

> > They really would work with almost anything. You could even mix-and-match
> > "data generated by kernel" and "data done by 'write()' or 'splice()' by a
> > user process".
>
> Imagine that userspace writes a large message and takes the mutex.  At the
> same time something in softirq context decides *it* wants to write a message -
> it can't take the mutex and it can't wait, so the userspace write would have
> to cause the kernel message to be dropped.

No. You're missing the point entirely.

The mutex is entirely immaterial for the "insert a message". It is
only used for user-space synchronization. The "add message to the pipe
buffers" would only do the low-level buffer updates (whether using a
new spinlock, re-using the pipe waitqueue lock, or entirely
locklessly, ends up being then just an implementation detail).

Note that user-space writes are defined to be atomic, but they are (a)
not ordered and (b) only atomic up to a single buffer entry (which is
that PIPE_BUF limit). So you can always put in a new buffer entry at
any time.

Obviously if a user space write just fills up the whole queue (or
_other_ messages fill up the whole queue) you'd have to drop the
notification. But that's always true. That's true even in your thing.
The only difference is that we _allow_ other user spaces to write to
the notification queue too.

But if you don't want to allow that, then don't give out the write
side of the pipe to any untrusted user space.

But in *general*, allowing user space to write to the pipe is a great
feature: it means that your notification source *can* be a user space
daemon that you gave the write side of the pipe to (possibly using fd
passing, possibly by just forking your own user-space child or cloning
a thread).

So for example, from a consumer standpoint, you can start off doing
these things in user space with a helper thread that feeds the pipe
(for example, polling /proc/mounts every second), and then when you've
prototyped it and are happy with it, you can add the system call (or
ioctl or whatever) to make the kernel generate the messages so that
you don't have to poll.

But now, once you have the kernel patch, you already have a proven
user, and you can show numbers ("My user-space thing works, but it
uses up 0.1% CPU time and has that nasty up-to-one-second latency
because of polling"). Ta-daa!

End result: it's backwards compatible, it's prototypable, and it's
fairly easily extensible. Want to add a new source of events? Just
pass the pipe to any random piece of code you want. It needs kernel
support only when you've proven the concept _and_ you can show that
"yeah, this user space polling model is a real performance or
complexity problem" or whatever.

This is why I like pipes. You can use them today. They are simple, and
extensible, and you don't need to come up with a new subsystem and
some untested ad-hoc thing that nobody has actually used.

And they work automatically with all the existing infrastructure. They
work with whatever perl or shell scripts, they work with poll/select
loops, they work with user-space sources of events, they are just very
flexible.

                     Linus

^ permalink raw reply

* Re: Why add the general notification queue and its sources
From: Linus Torvalds @ 2019-09-06 15:53 UTC (permalink / raw)
  To: David Howells
  Cc: Ray Strode, Greg Kroah-Hartman, Steven Whitehouse,
	Nicolas Dichtel, raven, keyrings, linux-usb, linux-block,
	Christian Brauner, LSM List, linux-fsdevel, Linux API,
	Linux List Kernel Mailing, Al Viro, Ray, Debarshi, Robbie Harwood
In-Reply-To: <CAHk-=wiR1fpahgKuxSOQY6OfgjWD+MKz8UF6qUQ6V_y2TC_V6w@mail.gmail.com>

On Fri, Sep 6, 2019 at 8:35 AM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> This is why I like pipes. You can use them today. They are simple, and
> extensible, and you don't need to come up with a new subsystem and
> some untested ad-hoc thing that nobody has actually used.

The only _real_ complexity is to make sure that events are reliably parseable.

That's where you really want to use the Linux-only "packet pipe"
thing, becasue otherwise you have to have size markers or other things
to delineate events. But if you do that, then it really becomes
trivial.

And I checked, we made it available to user space, even if the
original reason for that code was kernel-only autofs use: you just
need to make the pipe be O_DIRECT.

This overly stupid program shows off the feature:

        #define _GNU_SOURCE
        #include <fcntl.h>
        #include <unistd.h>

        int main(int argc, char **argv)
        {
                int fd[2];
                char buf[10];

                pipe2(fd, O_DIRECT | O_NONBLOCK);
                write(fd[1], "hello", 5);
                write(fd[1], "hi", 2);
                read(fd[0], buf, sizeof(buf));
                read(fd[0], buf, sizeof(buf));
                return 0;
        }

and it you strace it (because I was too lazy to add error handling or
printing of results), you'll see

    write(4, "hello", 5)                    = 5
    write(4, "hi", 2)                       = 2
    read(3, "hello", 10)                    = 5
    read(3, "hi", 10)                       = 2

note how you got packets of data on the reader side, instead of
getting the traditional "just buffer it as a stream".

So now you can even have multiple readers of the same event pipe, and
packetization is obvious and trivial. Of course, I'm not sure why
you'd want to have multiple readers, and you'd lose _ordering_, but if
all events are independent, this _might_ be a useful thing in a
threaded environment. Maybe.

(Side note: a zero-sized write will not cause a zero-sized packet. It
will just be dropped).

               Linus

^ permalink raw reply

* Re: [PATCH v2 1/5] fs: Add support for an O_MAYEXEC flag on sys_open()
From: Florian Weimer @ 2019-09-06 15:56 UTC (permalink / raw)
  To: Mickaël Salaün
  Cc: linux-kernel, Aleksa Sarai, Alexei Starovoitov, Al Viro,
	Andy Lutomirski, Christian Heimes, Daniel Borkmann, Eric Chiang,
	James Morris, Jan Kara, Jann Horn, Jonathan Corbet, Kees Cook,
	Matthew Garrett, Matthew Wilcox, Michael Kerrisk,
	Mickaël Salaün, Mimi Zohar, Philippe Trébuchet,
	Scott Shell, Sean Christopherson, Shuah Khan, Song Liu,
	Steve Dower, Steve Grubb, Thibaut Sautereau, Vincent Strubel,
	Yves-Alexis Perez, kernel-hardening, linux-api,
	linux-security-module, linux-fsdevel
In-Reply-To: <20190906152455.22757-2-mic@digikod.net>

Let's assume I want to add support for this to the glibc dynamic loader,
while still being able to run on older kernels.

Is it safe to try the open call first, with O_MAYEXEC, and if that fails
with EINVAL, try again without O_MAYEXEC?

Or do I risk disabling this security feature if I do that?

Do we need a different way for recognizing kernel support.  (Note that
we cannot probe paths in /proc for various reasons.)

Thanks,
Florian

^ permalink raw reply

* Re: Why add the general notification queue and its sources
From: Steven Whitehouse @ 2019-09-06 16:12 UTC (permalink / raw)
  To: Linus Torvalds, David Howells
  Cc: Ray Strode, Greg Kroah-Hartman, Nicolas Dichtel, raven, keyrings,
	linux-usb, linux-block, Christian Brauner, LSM List,
	linux-fsdevel, Linux API, Linux List Kernel Mailing, Al Viro,
	Ray, Debarshi, Robbie Harwood
In-Reply-To: <CAHk-=wioHmz69394xKRqFkhK8si86P_704KgcwjKxawLAYAiug@mail.gmail.com>

Hi,

On 06/09/2019 16:53, Linus Torvalds wrote:
> On Fri, Sep 6, 2019 at 8:35 AM Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
>> This is why I like pipes. You can use them today. They are simple, and
>> extensible, and you don't need to come up with a new subsystem and
>> some untested ad-hoc thing that nobody has actually used.
> The only _real_ complexity is to make sure that events are reliably parseable.
>
> That's where you really want to use the Linux-only "packet pipe"
> thing, becasue otherwise you have to have size markers or other things
> to delineate events. But if you do that, then it really becomes
> trivial.
>
> And I checked, we made it available to user space, even if the
> original reason for that code was kernel-only autofs use: you just
> need to make the pipe be O_DIRECT.
>
> This overly stupid program shows off the feature:
>
>          #define _GNU_SOURCE
>          #include <fcntl.h>
>          #include <unistd.h>
>
>          int main(int argc, char **argv)
>          {
>                  int fd[2];
>                  char buf[10];
>
>                  pipe2(fd, O_DIRECT | O_NONBLOCK);
>                  write(fd[1], "hello", 5);
>                  write(fd[1], "hi", 2);
>                  read(fd[0], buf, sizeof(buf));
>                  read(fd[0], buf, sizeof(buf));
>                  return 0;
>          }
>
> and it you strace it (because I was too lazy to add error handling or
> printing of results), you'll see
>
>      write(4, "hello", 5)                    = 5
>      write(4, "hi", 2)                       = 2
>      read(3, "hello", 10)                    = 5
>      read(3, "hi", 10)                       = 2
>
> note how you got packets of data on the reader side, instead of
> getting the traditional "just buffer it as a stream".
>
> So now you can even have multiple readers of the same event pipe, and
> packetization is obvious and trivial. Of course, I'm not sure why
> you'd want to have multiple readers, and you'd lose _ordering_, but if
> all events are independent, this _might_ be a useful thing in a
> threaded environment. Maybe.
>
> (Side note: a zero-sized write will not cause a zero-sized packet. It
> will just be dropped).
>
>                 Linus

The events are generally not independent - we would need ordering either 
implicit in the protocol or explicit in the messages. We also need to 
know in case messages are dropped too - doesn't need to be anything 
fancy, just some idea that since we last did a read, there are messages 
that got lost, most likely due to buffer overrun.

That is why the initial idea was to use netlink, since it solves a lot 
of those issues. The downside was that the indirect nature of the 
netlink sockets resulted in making it tricky to know the namespace of 
the process to which the message was to be delivered (and hence whether 
it should be delivered at all),

Steve.


^ permalink raw reply

* Re: [PATCH v2 1/5] fs: Add support for an O_MAYEXEC flag on sys_open()
From: Mickaël Salaün @ 2019-09-06 16:06 UTC (permalink / raw)
  To: Florian Weimer, Mickaël Salaün
  Cc: linux-kernel, Aleksa Sarai, Alexei Starovoitov, Al Viro,
	Andy Lutomirski, Christian Heimes, Daniel Borkmann, Eric Chiang,
	James Morris, Jan Kara, Jann Horn, Jonathan Corbet, Kees Cook,
	Matthew Garrett, Matthew Wilcox, Michael Kerrisk, Mimi Zohar,
	Philippe Trébuchet, Scott Shell, Sean Christopherson,
	Shuah Khan, Song Liu, Steve Dower, Steve Grubb, Thibaut Sautereau,
	Vincent Strubel, Yves-Alexis Perez, kernel-hardening, linux-api,
	linux-security-module, linux-fsdevel
In-Reply-To: <87ef0te7v3.fsf@oldenburg2.str.redhat.com>

On 06/09/2019 17:56, Florian Weimer wrote:
> Let's assume I want to add support for this to the glibc dynamic loader,
> while still being able to run on older kernels.
>
> Is it safe to try the open call first, with O_MAYEXEC, and if that fails
> with EINVAL, try again without O_MAYEXEC?

The kernel ignore unknown open(2) flags, so yes, it is safe even for
older kernel to use O_MAYEXEC.

>
> Or do I risk disabling this security feature if I do that?

It is only a security feature if the kernel support it, otherwise it is
a no-op.

>
> Do we need a different way for recognizing kernel support.  (Note that
> we cannot probe paths in /proc for various reasons.)

There is no need to probe for kernel support.

>
> Thanks,
> Florian
>

--
Mickaël Salaün

Les données à caractère personnel recueillies et traitées dans le cadre de cet échange, le sont à seule fin d’exécution d’une relation professionnelle et s’opèrent dans cette seule finalité et pour la durée nécessaire à cette relation. Si vous souhaitez faire usage de vos droits de consultation, de rectification et de suppression de vos données, veuillez contacter contact.rgpd@sgdsn.gouv.fr. Si vous avez reçu ce message par erreur, nous vous remercions d’en informer l’expéditeur et de détruire le message. The personal data collected and processed during this exchange aims solely at completing a business relationship and is limited to the necessary duration of that relationship. If you wish to use your rights of consultation, rectification and deletion of your data, please contact: contact.rgpd@sgdsn.gouv.fr. If you have received this message in error, we thank you for informing the sender and destroying the message.

^ permalink raw reply

* Re: [PATCH v2 1/5] fs: Add support for an O_MAYEXEC flag on sys_open()
From: Jeff Layton @ 2019-09-06 16:48 UTC (permalink / raw)
  To: Mickaël Salaün, Florian Weimer,
	Mickaël Salaün
  Cc: linux-kernel, Aleksa Sarai, Alexei Starovoitov, Al Viro,
	Andy Lutomirski, Christian Heimes, Daniel Borkmann, Eric Chiang,
	James Morris, Jan Kara, Jann Horn, Jonathan Corbet, Kees Cook,
	Matthew Garrett, Matthew Wilcox, Michael Kerrisk, Mimi Zohar,
	Philippe Trébuchet, Scott Shell, Sean Christopherson,
	Shuah Khan, Song Liu, Steve Dower, Steve Grubb, Thibaut Sautereau,
	Vincent Strubel, Yves-Alexis Perez, kernel-hardening, linux-api,
	linux-security-module, linux-fsdevel
In-Reply-To: <75442f3b-a3d8-12db-579a-2c5983426b4d@ssi.gouv.fr>

On Fri, 2019-09-06 at 18:06 +0200, Mickaël Salaün wrote:
> On 06/09/2019 17:56, Florian Weimer wrote:
> > Let's assume I want to add support for this to the glibc dynamic loader,
> > while still being able to run on older kernels.
> > 
> > Is it safe to try the open call first, with O_MAYEXEC, and if that fails
> > with EINVAL, try again without O_MAYEXEC?
> 
> The kernel ignore unknown open(2) flags, so yes, it is safe even for
> older kernel to use O_MAYEXEC.
> 

Well...maybe. What about existing programs that are sending down bogus
open flags? Once you turn this on, they may break...or provide a way to
circumvent the protections this gives.

Maybe this should be a new flag that is only usable in the new openat2()
syscall that's still under discussion? That syscall will enforce that
all flags are recognized. You presumably wouldn't need the sysctl if you
went that route too.

Anyone that wants to use this will have to recompile anyway. If the
kernel doesn't support openat2 or if the flag is rejected then you know
that you have no O_MAYEXEC support and can decide what to do.

> > Or do I risk disabling this security feature if I do that?
> 
> It is only a security feature if the kernel support it, otherwise it is
> a no-op.
> 

With a security feature, I think we really want userland to aware of
whether it works.

> > Do we need a different way for recognizing kernel support.  (Note that
> > we cannot probe paths in /proc for various reasons.)
> 
> There is no need to probe for kernel support.
> 
> > Thanks,
> > Florian
> > 
> 
> --
> Mickaël Salaün
> 
> Les données à caractère personnel recueillies et traitées dans le cadre de cet échange, le sont à seule fin d’exécution d’une relation professionnelle et s’opèrent dans cette seule finalité et pour la durée nécessaire à cette relation. Si vous souhaitez faire usage de vos droits de consultation, de rectification et de suppression de vos données, veuillez contacter contact.rgpd@sgdsn.gouv.fr. Si vous avez reçu ce message par erreur, nous vous remercions d’en informer l’expéditeur et de détruire le message. The personal data collected and processed during this exchange aims solely at completing a business relationship and is limited to the necessary duration of that relationship. If you wish to use your rights of consultation, rectification and deletion of your data, please contact: contact.rgpd@sgdsn.gouv.fr. If you have received this message in error, we thank you for informing the sender and destroying the message.

-- 
Jeff Layton <jlayton@kernel.org>


^ permalink raw reply

* Re: [PATCH v2 1/5] fs: Add support for an O_MAYEXEC flag on sys_open()
From: Aleksa Sarai @ 2019-09-06 17:07 UTC (permalink / raw)
  To: Mickaël Salaün
  Cc: Florian Weimer, Mickaël Salaün, linux-kernel,
	Alexei Starovoitov, Al Viro, Andy Lutomirski, Christian Heimes,
	Daniel Borkmann, Eric Chiang, James Morris, Jan Kara, Jann Horn,
	Jonathan Corbet, Kees Cook, Matthew Garrett, Matthew Wilcox,
	Michael Kerrisk, Mimi Zohar, Philippe Trébuchet, Scott Shell,
	Sean Christopherson, Shuah Khan, Song Liu, Steve Dower,
	Steve Grubb, Thibaut Sautereau, Vincent Strubel,
	Yves-Alexis Perez, kernel-hardening, linux-api,
	linux-security-module, linux-fsdevel
In-Reply-To: <75442f3b-a3d8-12db-579a-2c5983426b4d@ssi.gouv.fr>

[-- Attachment #1: Type: text/plain, Size: 1064 bytes --]

On 2019-09-06, Mickaël Salaün <mickael.salaun@ssi.gouv.fr> wrote:
> 
> On 06/09/2019 17:56, Florian Weimer wrote:
> > Let's assume I want to add support for this to the glibc dynamic loader,
> > while still being able to run on older kernels.
> >
> > Is it safe to try the open call first, with O_MAYEXEC, and if that fails
> > with EINVAL, try again without O_MAYEXEC?
> 
> The kernel ignore unknown open(2) flags, so yes, it is safe even for
> older kernel to use O_MAYEXEC.

Depends on your definition of "safe" -- a security feature that you will
silently not enable on older kernels doesn't sound super safe to me.
Unfortunately this is a limitation of open(2) that we cannot change --
which is why the openat2(2) proposal I've been posting gives -EINVAL for
unknown O_* flags.

There is a way to probe for support (though unpleasant), by creating a
test O_MAYEXEC fd and then checking if the flag is present in
/proc/self/fdinfo/$n.

-- 
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
<https://www.cyphar.com/>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply

* Re: [PATCH v2 1/5] fs: Add support for an O_MAYEXEC flag on sys_open()
From: Aleksa Sarai @ 2019-09-06 17:13 UTC (permalink / raw)
  To: Jeff Layton
  Cc: Mickaël Salaün, Florian Weimer,
	Mickaël Salaün, linux-kernel, Alexei Starovoitov,
	Al Viro, Andy Lutomirski, Christian Heimes, Daniel Borkmann,
	Eric Chiang, James Morris, Jan Kara, Jann Horn, Jonathan Corbet,
	Kees Cook, Matthew Garrett, Matthew Wilcox, Michael Kerrisk,
	Mimi Zohar, Philippe Trébuchet, Scott Shell,
	Sean Christopherson, Shuah Khan, Song Liu, Steve Dower,
	Steve Grubb, Thibaut Sautereau, Vincent Strubel,
	Yves-Alexis Perez, kernel-hardening, linux-api,
	linux-security-module, linux-fsdevel
In-Reply-To: <f53ec45fd253e96d1c8d0ea6f9cca7f68afa51e3.camel@kernel.org>

[-- Attachment #1: Type: text/plain, Size: 1776 bytes --]

On 2019-09-06, Jeff Layton <jlayton@kernel.org> wrote:
> On Fri, 2019-09-06 at 18:06 +0200, Mickaël Salaün wrote:
> > On 06/09/2019 17:56, Florian Weimer wrote:
> > > Let's assume I want to add support for this to the glibc dynamic loader,
> > > while still being able to run on older kernels.
> > > 
> > > Is it safe to try the open call first, with O_MAYEXEC, and if that fails
> > > with EINVAL, try again without O_MAYEXEC?
> > 
> > The kernel ignore unknown open(2) flags, so yes, it is safe even for
> > older kernel to use O_MAYEXEC.
> > 
> 
> Well...maybe. What about existing programs that are sending down bogus
> open flags? Once you turn this on, they may break...or provide a way to
> circumvent the protections this gives.

It should be noted that this has been a valid concern for every new O_*
flag introduced (and yet we still introduced new flags, despite the
concern) -- though to be fair, O_TMPFILE actually does have a
work-around with the O_DIRECTORY mask setup.

The openat2() set adds O_EMPTYPATH -- though in fairness it's also
backwards compatible because empty path strings have always given ENOENT
(or EINVAL?) while O_EMPTYPATH is a no-op non-empty strings.

> Maybe this should be a new flag that is only usable in the new openat2()
> syscall that's still under discussion? That syscall will enforce that
> all flags are recognized. You presumably wouldn't need the sysctl if you
> went that route too.

I'm also interested in whether we could add an UPGRADE_NOEXEC flag to
how->upgrade_mask for the openat2(2) patchset (I reserved a flag bit for
it, since I'd heard about this work through the grape-vine).

-- 
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
<https://www.cyphar.com/>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply

* Re: Why add the general notification queue and its sources
From: Andy Lutomirski @ 2019-09-06 17:14 UTC (permalink / raw)
  To: Steven Whitehouse
  Cc: Linus Torvalds, David Howells, Ray Strode, Greg Kroah-Hartman,
	Nicolas Dichtel, raven, keyrings, linux-usb, linux-block,
	Christian Brauner, LSM List, linux-fsdevel, Linux API,
	Linux List Kernel Mailing, Al Viro, Ray, Debarshi, Robbie Harwood
In-Reply-To: <8e60555e-9247-e03f-e8b4-1d31f70f1221@redhat.com>



> On Sep 6, 2019, at 9:12 AM, Steven Whitehouse <swhiteho@redhat.com> wrote:
> 
> Hi,
> 
>> On 06/09/2019 16:53, Linus Torvalds wrote:
>> On Fri, Sep 6, 2019 at 8:35 AM Linus Torvalds
>> <torvalds@linux-foundation.org> wrote:
>>> This is why I like pipes. You can use them today. They are simple, and
>>> extensible, and you don't need to come up with a new subsystem and
>>> some untested ad-hoc thing that nobody has actually used.
>> The only _real_ complexity is to make sure that events are reliably parseable.
>> 
>> That's where you really want to use the Linux-only "packet pipe"
>> thing, becasue otherwise you have to have size markers or other things
>> to delineate events. But if you do that, then it really becomes
>> trivial.
>> 
>> And I checked, we made it available to user space, even if the
>> original reason for that code was kernel-only autofs use: you just
>> need to make the pipe be O_DIRECT.
>> 
>> This overly stupid program shows off the feature:
>> 
>>         #define _GNU_SOURCE
>>         #include <fcntl.h>
>>         #include <unistd.h>
>> 
>>         int main(int argc, char **argv)
>>         {
>>                 int fd[2];
>>                 char buf[10];
>> 
>>                 pipe2(fd, O_DIRECT | O_NONBLOCK);
>>                 write(fd[1], "hello", 5);
>>                 write(fd[1], "hi", 2);
>>                 read(fd[0], buf, sizeof(buf));
>>                 read(fd[0], buf, sizeof(buf));
>>                 return 0;
>>         }
>> 
>> and it you strace it (because I was too lazy to add error handling or
>> printing of results), you'll see
>> 
>>     write(4, "hello", 5)                    = 5
>>     write(4, "hi", 2)                       = 2
>>     read(3, "hello", 10)                    = 5
>>     read(3, "hi", 10)                       = 2
>> 
>> note how you got packets of data on the reader side, instead of
>> getting the traditional "just buffer it as a stream".
>> 
>> So now you can even have multiple readers of the same event pipe, and
>> packetization is obvious and trivial. Of course, I'm not sure why
>> you'd want to have multiple readers, and you'd lose _ordering_, but if
>> all events are independent, this _might_ be a useful thing in a
>> threaded environment. Maybe.
>> 
>> (Side note: a zero-sized write will not cause a zero-sized packet. It
>> will just be dropped).
>> 
>>                Linus
> 
> The events are generally not independent - we would need ordering either implicit in the protocol or explicit in the messages. We also need to know in case messages are dropped too - doesn't need to be anything fancy, just some idea that since we last did a read, there are messages that got lost, most likely due to buffer overrun.

This could be a bit fancier: if the pipe recorded the bitwise or of the first few bytes of dropped message, then the messages could set a bit in the header indicating the type, and readers could then learn which *types* of messages were dropped.

Or they could just use multiple pipes.

If this whole mechanism catches on, I wonder if implementing recvmmsg() on pipes would be worthwhile.

^ permalink raw reply

* Re: [PATCH v2 1/5] fs: Add support for an O_MAYEXEC flag on sys_open()
From: Mickaël Salaün @ 2019-09-06 17:14 UTC (permalink / raw)
  To: Jeff Layton, Florian Weimer, Mickaël Salaün
  Cc: linux-kernel, Aleksa Sarai, Alexei Starovoitov, Al Viro,
	Andy Lutomirski, Christian Heimes, Daniel Borkmann, Eric Chiang,
	James Morris, Jan Kara, Jann Horn, Jonathan Corbet, Kees Cook,
	Matthew Garrett, Matthew Wilcox, Michael Kerrisk, Mimi Zohar,
	Philippe Trébuchet, Scott Shell, Sean Christopherson,
	Shuah Khan, Song Liu, Steve Dower, Steve Grubb, Thibaut Sautereau,
	Vincent Strubel, Yves-Alexis Perez, kernel-hardening, linux-api,
	linux-security-module, linux-fsdevel
In-Reply-To: <f53ec45fd253e96d1c8d0ea6f9cca7f68afa51e3.camel@kernel.org>


On 06/09/2019 18:48, Jeff Layton wrote:
> On Fri, 2019-09-06 at 18:06 +0200, Mickaël Salaün wrote:
>> On 06/09/2019 17:56, Florian Weimer wrote:
>>> Let's assume I want to add support for this to the glibc dynamic loader,
>>> while still being able to run on older kernels.
>>>
>>> Is it safe to try the open call first, with O_MAYEXEC, and if that fails
>>> with EINVAL, try again without O_MAYEXEC?
>>
>> The kernel ignore unknown open(2) flags, so yes, it is safe even for
>> older kernel to use O_MAYEXEC.
>>
>
> Well...maybe. What about existing programs that are sending down bogus
> open flags? Once you turn this on, they may break...or provide a way to
> circumvent the protections this gives.

Well, I don't think we should nor could care about bogus programs that
do not conform to the Linux ABI.

>
> Maybe this should be a new flag that is only usable in the new openat2()
> syscall that's still under discussion? That syscall will enforce that
> all flags are recognized. You presumably wouldn't need the sysctl if you
> went that route too.

Here is a thread about a new syscall:
https://lore.kernel.org/lkml/1544699060.6703.11.camel@linux.ibm.com/

I don't think it fit well with auditing nor integrity. Moreover using
the current open(2) behavior of ignoring unknown flags fit well with the
usage of O_MAYEXEC (because it is only a hint to the kernel about the
use of the *opened* file).

>
> Anyone that wants to use this will have to recompile anyway. If the
> kernel doesn't support openat2 or if the flag is rejected then you know
> that you have no O_MAYEXEC support and can decide what to do.

If we want to enforce a security policy, we need to either be the system
administrator or the distro developer. If a distro ship interpreters
using this flag, we don't need to recompile anything, but we need to be
able to control the enforcement according to the mount point
configuration (or an advanced MAC, or an IMA config). I don't see why an
userspace process should check if this flag is supported or not, it
should simply use it, and the sysadmin will enable an enforcement if it
makes sense for the whole system.

>
>>> Or do I risk disabling this security feature if I do that?
>>
>> It is only a security feature if the kernel support it, otherwise it is
>> a no-op.
>>
>
> With a security feature, I think we really want userland to aware of
> whether it works.

If userland would like to enforce something, it can already do it
without any kernel modification. The goal of the O_MAYEXEC flag is to
enable the kernel, hence sysadmins or system designers, to enforce a
global security policy that makes sense.

>
>>> Do we need a different way for recognizing kernel support.  (Note that
>>> we cannot probe paths in /proc for various reasons.)
>>
>> There is no need to probe for kernel support.
>>
>>> Thanks,
>>> Florian
>>>
Les données à caractère personnel recueillies et traitées dans le cadre de cet échange, le sont à seule fin d’exécution d’une relation professionnelle et s’opèrent dans cette seule finalité et pour la durée nécessaire à cette relation. Si vous souhaitez faire usage de vos droits de consultation, de rectification et de suppression de vos données, veuillez contacter contact.rgpd@sgdsn.gouv.fr. Si vous avez reçu ce message par erreur, nous vous remercions d’en informer l’expéditeur et de détruire le message. The personal data collected and processed during this exchange aims solely at completing a business relationship and is limited to the necessary duration of that relationship. If you wish to use your rights of consultation, rectification and deletion of your data, please contact: contact.rgpd@sgdsn.gouv.fr. If you have received this message in error, we thank you for informing the sender and destroying the message.

^ permalink raw reply

* Re: Why add the general notification queue and its sources
From: Linus Torvalds @ 2019-09-06 17:14 UTC (permalink / raw)
  To: Steven Whitehouse
  Cc: David Howells, Ray Strode, Greg Kroah-Hartman, Nicolas Dichtel,
	raven, keyrings, linux-usb, linux-block, Christian Brauner,
	LSM List, linux-fsdevel, Linux API, Linux List Kernel Mailing,
	Al Viro, Ray, Debarshi, Robbie Harwood
In-Reply-To: <CAHk-=wg6=qhw0-F=2_8y=VdT+fj8k7G1+t2XNSkRYimXhampVg@mail.gmail.com>

On Fri, Sep 6, 2019 at 10:07 AM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> Hmm. Maybe somebody can come up with a good legacy signaling solution
> (and "just use another pipe for error notification and OOB data" for
> the first one may _work_, but that sounds pretty hacky and just not
> very convenient).

... actually, maybe the trivial solution for at least some prototyping
cases is to make any user mode writers never drop messages. Don't use
a non-blocking fd for the write direction.

That's obviously *not* acceptable for a kernel writer, and it's not
acceptable for an actual system daemon writer (that you could block by
just not reading the notifications), but it's certainly acceptable for
the "let's prototype having kernel support for /proc/mounts
notifications using a local thread that just polls for it every few
seconds".

So at least for _some_ prototypes you can probably just ignore the
overflow issue. It won't get you full test coverage, but it will get
you a working legacy solution and a "look, if we have kernel support
for this, we can do better".

                 Linus

^ permalink raw reply

* Re: Why add the general notification queue and its sources
From: Linus Torvalds @ 2019-09-06 17:07 UTC (permalink / raw)
  To: Steven Whitehouse
  Cc: David Howells, Ray Strode, Greg Kroah-Hartman, Nicolas Dichtel,
	raven, keyrings, linux-usb, linux-block, Christian Brauner,
	LSM List, linux-fsdevel, Linux API, Linux List Kernel Mailing,
	Al Viro, Ray, Debarshi, Robbie Harwood
In-Reply-To: <8e60555e-9247-e03f-e8b4-1d31f70f1221@redhat.com>

On Fri, Sep 6, 2019 at 9:12 AM Steven Whitehouse <swhiteho@redhat.com> wrote:
>
> The events are generally not independent - we would need ordering either
> implicit in the protocol or explicit in the messages.

Note that pipes certainly would never re-order messages. It's just
that _if_ you have two independent and concurrent readers of the same
pipe, they could read one message each, and you couldn't tell which
was first in user space.

Of course, I would suggest that anything that actually has
non-independent messages should always use a sequence number or
something like that in the message anyway. But then it would have to
be on a protocol level.

And it's not clear that all notifications need it. If it's just a
random "things changed" notification, mnaybe that doesn't need a
sequence number or anything else. So being on a protocol/data stream
level might be the right thing regardless.

Another possibility is to just say "don't do that then". If you want
multiple concurrent readers, open multiple pipes for them and use
separate events, and be happy in the knowledge that you don't have any
complicated cases.

>  We also need to
> know in case messages are dropped too - doesn't need to be anything
> fancy, just some idea that since we last did a read, there are messages
> that got lost, most likely due to buffer overrun.

Pipes don't have that, but another flag certainly wouldn't be _hard_ to add.

But one problem (and this is fundamental) is that while O_DIRECT works
today (and works with kernels going back years), any new features like
overflow notification would obviously not work with legacy kernels.

On the user write side, with an O_NONBLOCK pipe, you currently just
get an -EAGAIN, so you _see_ the drop happening. But (again) there's
no sticky flag for it anywhere else, and there's no clean automatic
way for the reader to see "ok, the writer overflowed".

That's not a problem for any future extensions - the feature sounds
like a new flag and a couple of lines to do it - but it's a problem
for the whole "prototype in user space using existing pipe support"
that I personally find so nice, and which I think is such a good way
to prove the user space _need_ for anything like this.

But if people are ok with the pipe model in theory, _that_ kind of
small and directed feature I have absolutely no problem with adding.
It's just whole new untested character mode drivers with odd semantics
that I find troublesome.

Hmm. Maybe somebody can come up with a good legacy signaling solution
(and "just use another pipe for error notification and OOB data" for
the first one may _work_, but that sounds pretty hacky and just not
very convenient).

               Linus

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox