Linux userland API discussions

Linux userland API discussions
 help / color / mirror / Atom feed

* Re: [PATCH v2 1/5] fs: Add support for an O_MAYEXEC flag on sys_open()
From: Aleksa Sarai @ 2019-09-06 17:07 UTC (permalink / raw)
  To: Mickaël Salaün
  Cc: Florian Weimer, Mickaël Salaün, linux-kernel,
	Alexei Starovoitov, Al Viro, Andy Lutomirski, Christian Heimes,
	Daniel Borkmann, Eric Chiang, James Morris, Jan Kara, Jann Horn,
	Jonathan Corbet, Kees Cook, Matthew Garrett, Matthew Wilcox,
	Michael Kerrisk, Mimi Zohar, Philippe Trébuchet, Scott Shell
In-Reply-To: <75442f3b-a3d8-12db-579a-2c5983426b4d@ssi.gouv.fr>

[-- Attachment #1: Type: text/plain, Size: 1064 bytes --]

On 2019-09-06, Mickaël Salaün <mickael.salaun@ssi.gouv.fr> wrote:
> 
> On 06/09/2019 17:56, Florian Weimer wrote:
> > Let's assume I want to add support for this to the glibc dynamic loader,
> > while still being able to run on older kernels.
> >
> > Is it safe to try the open call first, with O_MAYEXEC, and if that fails
> > with EINVAL, try again without O_MAYEXEC?
> 
> The kernel ignore unknown open(2) flags, so yes, it is safe even for
> older kernel to use O_MAYEXEC.

Depends on your definition of "safe" -- a security feature that you will
silently not enable on older kernels doesn't sound super safe to me.
Unfortunately this is a limitation of open(2) that we cannot change --
which is why the openat2(2) proposal I've been posting gives -EINVAL for
unknown O_* flags.

There is a way to probe for support (though unpleasant), by creating a
test O_MAYEXEC fd and then checking if the flag is present in
/proc/self/fdinfo/$n.

-- 
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
<https://www.cyphar.com/>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply

* Re: [PATCH v2 1/5] fs: Add support for an O_MAYEXEC flag on sys_open()
From: Jeff Layton @ 2019-09-06 16:48 UTC (permalink / raw)
  To: Mickaël Salaün, Florian Weimer,
	Mickaël Salaün
  Cc: linux-kernel, Aleksa Sarai, Alexei Starovoitov, Al Viro,
	Andy Lutomirski, Christian Heimes, Daniel Borkmann, Eric Chiang,
	James Morris, Jan Kara, Jann Horn, Jonathan Corbet, Kees Cook,
	Matthew Garrett, Matthew Wilcox, Michael Kerrisk, Mimi Zohar,
	Philippe Trébuchet, Scott Shell, Sean Christopherson
In-Reply-To: <75442f3b-a3d8-12db-579a-2c5983426b4d@ssi.gouv.fr>

On Fri, 2019-09-06 at 18:06 +0200, Mickaël Salaün wrote:
> On 06/09/2019 17:56, Florian Weimer wrote:
> > Let's assume I want to add support for this to the glibc dynamic loader,
> > while still being able to run on older kernels.
> > 
> > Is it safe to try the open call first, with O_MAYEXEC, and if that fails
> > with EINVAL, try again without O_MAYEXEC?
> 
> The kernel ignore unknown open(2) flags, so yes, it is safe even for
> older kernel to use O_MAYEXEC.
> 

Well...maybe. What about existing programs that are sending down bogus
open flags? Once you turn this on, they may break...or provide a way to
circumvent the protections this gives.

Maybe this should be a new flag that is only usable in the new openat2()
syscall that's still under discussion? That syscall will enforce that
all flags are recognized. You presumably wouldn't need the sysctl if you
went that route too.

Anyone that wants to use this will have to recompile anyway. If the
kernel doesn't support openat2 or if the flag is rejected then you know
that you have no O_MAYEXEC support and can decide what to do.

> > Or do I risk disabling this security feature if I do that?
> 
> It is only a security feature if the kernel support it, otherwise it is
> a no-op.
> 

With a security feature, I think we really want userland to aware of
whether it works.

> > Do we need a different way for recognizing kernel support.  (Note that
> > we cannot probe paths in /proc for various reasons.)
> 
> There is no need to probe for kernel support.
> 
> > Thanks,
> > Florian
> > 
> 
> --
> Mickaël Salaün
> 
> Les données à caractère personnel recueillies et traitées dans le cadre de cet échange, le sont à seule fin d’exécution d’une relation professionnelle et s’opèrent dans cette seule finalité et pour la durée nécessaire à cette relation. Si vous souhaitez faire usage de vos droits de consultation, de rectification et de suppression de vos données, veuillez contacter contact.rgpd@sgdsn.gouv.fr. Si vous avez reçu ce message par erreur, nous vous remercions d’en informer l’expéditeur et de détruire le message. The personal data collected and processed during this exchange aims solely at completing a business relationship and is limited to the necessary duration of that relationship. If you wish to use your rights of consultation, rectification and deletion of your da
 ta, please contact: contact.rgpd@sgdsn.gouv.fr. If you have received this message in error, we thank you for informing the sender and destroying the message.

-- 
Jeff Layton <jlayton@kernel.org>

^ permalink raw reply

* Re: [PATCH v3 bpf-next 2/3] bpf: implement CAP_BPF
From: kbuild test robot @ 2019-09-06 16:20 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: kbuild-all, davem, daniel, peterz, luto, netdev, bpf, kernel-team,
	linux-api
In-Reply-To: <20190904184335.360074-2-ast@kernel.org>

[-- Attachment #1: Type: text/plain, Size: 2159 bytes --]

Hi Alexei,

I love your patch! Perhaps something to improve:

[auto build test WARNING on bpf-next/master]

url:    https://github.com/0day-ci/linux/commits/Alexei-Starovoitov/capability-introduce-CAP_BPF-and-CAP_TRACING/20190906-215814
base:   https://kernel.googlesource.com/pub/scm/linux/kernel/git/bpf/bpf-next.git master
config: x86_64-allmodconfig (attached as .config)
compiler: gcc-7 (Debian 7.4.0-11) 7.4.0
reproduce:
        # save the attached .config to linux build tree
        make ARCH=x86_64 

If you fix the issue, kindly add following tag
Reported-by: kbuild test robot <lkp@intel.com>

All warnings (new ones prefixed by >>):

   kernel//bpf/syscall.c: In function 'bpf_prog_test_run':
>> kernel//bpf/syscall.c:2087:6: warning: the address of 'capable_bpf_net_admin' will always evaluate as 'true' [-Waddress]
     if (!capable_bpf_net_admin)
         ^

vim +2087 kernel//bpf/syscall.c

  2080	
  2081	static int bpf_prog_test_run(const union bpf_attr *attr,
  2082				     union bpf_attr __user *uattr)
  2083	{
  2084		struct bpf_prog *prog;
  2085		int ret = -ENOTSUPP;
  2086	
> 2087		if (!capable_bpf_net_admin)
  2088			/* test_run callback is available for networking progs only.
  2089			 * Add capable_bpf_tracing() above when tracing progs become runable.
  2090			 */
  2091			return -EPERM;
  2092		if (CHECK_ATTR(BPF_PROG_TEST_RUN))
  2093			return -EINVAL;
  2094	
  2095		if ((attr->test.ctx_size_in && !attr->test.ctx_in) ||
  2096		    (!attr->test.ctx_size_in && attr->test.ctx_in))
  2097			return -EINVAL;
  2098	
  2099		if ((attr->test.ctx_size_out && !attr->test.ctx_out) ||
  2100		    (!attr->test.ctx_size_out && attr->test.ctx_out))
  2101			return -EINVAL;
  2102	
  2103		prog = bpf_prog_get(attr->test.prog_fd);
  2104		if (IS_ERR(prog))
  2105			return PTR_ERR(prog);
  2106	
  2107		if (prog->aux->ops->test_run)
  2108			ret = prog->aux->ops->test_run(prog, attr, uattr);
  2109	
  2110		bpf_prog_put(prog);
  2111		return ret;
  2112	}
  2113	

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 70222 bytes --]

^ permalink raw reply

* Re: Why add the general notification queue and its sources
From: Steven Whitehouse @ 2019-09-06 16:12 UTC (permalink / raw)
  To: Linus Torvalds, David Howells
  Cc: Ray Strode, Greg Kroah-Hartman, Nicolas Dichtel, raven, keyrings,
	linux-usb, linux-block, Christian Brauner, LSM List,
	linux-fsdevel, Linux API, Linux List Kernel Mailing, Al Viro,
	Ray, Debarshi, Robbie Harwood
In-Reply-To: <CAHk-=wioHmz69394xKRqFkhK8si86P_704KgcwjKxawLAYAiug@mail.gmail.com>

Hi,

On 06/09/2019 16:53, Linus Torvalds wrote:
> On Fri, Sep 6, 2019 at 8:35 AM Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
>> This is why I like pipes. You can use them today. They are simple, and
>> extensible, and you don't need to come up with a new subsystem and
>> some untested ad-hoc thing that nobody has actually used.
> The only _real_ complexity is to make sure that events are reliably parseable.
>
> That's where you really want to use the Linux-only "packet pipe"
> thing, becasue otherwise you have to have size markers or other things
> to delineate events. But if you do that, then it really becomes
> trivial.
>
> And I checked, we made it available to user space, even if the
> original reason for that code was kernel-only autofs use: you just
> need to make the pipe be O_DIRECT.
>
> This overly stupid program shows off the feature:
>
>          #define _GNU_SOURCE
>          #include <fcntl.h>
>          #include <unistd.h>
>
>          int main(int argc, char **argv)
>          {
>                  int fd[2];
>                  char buf[10];
>
>                  pipe2(fd, O_DIRECT | O_NONBLOCK);
>                  write(fd[1], "hello", 5);
>                  write(fd[1], "hi", 2);
>                  read(fd[0], buf, sizeof(buf));
>                  read(fd[0], buf, sizeof(buf));
>                  return 0;
>          }
>
> and it you strace it (because I was too lazy to add error handling or
> printing of results), you'll see
>
>      write(4, "hello", 5)                    = 5
>      write(4, "hi", 2)                       = 2
>      read(3, "hello", 10)                    = 5
>      read(3, "hi", 10)                       = 2
>
> note how you got packets of data on the reader side, instead of
> getting the traditional "just buffer it as a stream".
>
> So now you can even have multiple readers of the same event pipe, and
> packetization is obvious and trivial. Of course, I'm not sure why
> you'd want to have multiple readers, and you'd lose _ordering_, but if
> all events are independent, this _might_ be a useful thing in a
> threaded environment. Maybe.
>
> (Side note: a zero-sized write will not cause a zero-sized packet. It
> will just be dropped).
>
>                 Linus

The events are generally not independent - we would need ordering either 
implicit in the protocol or explicit in the messages. We also need to 
know in case messages are dropped too - doesn't need to be anything 
fancy, just some idea that since we last did a read, there are messages 
that got lost, most likely due to buffer overrun.

That is why the initial idea was to use netlink, since it solves a lot 
of those issues. The downside was that the indirect nature of the 
netlink sockets resulted in making it tricky to know the namespace of 
the process to which the message was to be delivered (and hence whether 
it should be delivered at all),

Steve.

^ permalink raw reply

* Re: [PATCH v2 1/5] fs: Add support for an O_MAYEXEC flag on sys_open()
From: Mickaël Salaün @ 2019-09-06 16:06 UTC (permalink / raw)
  To: Florian Weimer, Mickaël Salaün
  Cc: linux-kernel, Aleksa Sarai, Alexei Starovoitov, Al Viro,
	Andy Lutomirski, Christian Heimes, Daniel Borkmann, Eric Chiang,
	James Morris, Jan Kara, Jann Horn, Jonathan Corbet, Kees Cook,
	Matthew Garrett, Matthew Wilcox, Michael Kerrisk, Mimi Zohar,
	Philippe Trébuchet, Scott Shell, Sean Christopherson,
	Shuah Khan, Song Liu, Steve Dower
In-Reply-To: <87ef0te7v3.fsf@oldenburg2.str.redhat.com>

On 06/09/2019 17:56, Florian Weimer wrote:
> Let's assume I want to add support for this to the glibc dynamic loader,
> while still being able to run on older kernels.
>
> Is it safe to try the open call first, with O_MAYEXEC, and if that fails
> with EINVAL, try again without O_MAYEXEC?

The kernel ignore unknown open(2) flags, so yes, it is safe even for
older kernel to use O_MAYEXEC.

>
> Or do I risk disabling this security feature if I do that?

It is only a security feature if the kernel support it, otherwise it is
a no-op.

>
> Do we need a different way for recognizing kernel support.  (Note that
> we cannot probe paths in /proc for various reasons.)

There is no need to probe for kernel support.

>
> Thanks,
> Florian
>

--
Mickaël Salaün

Les données à caractère personnel recueillies et traitées dans le cadre de cet échange, le sont à seule fin d’exécution d’une relation professionnelle et s’opèrent dans cette seule finalité et pour la durée nécessaire à cette relation. Si vous souhaitez faire usage de vos droits de consultation, de rectification et de suppression de vos données, veuillez contacter contact.rgpd@sgdsn.gouv.fr. Si vous avez reçu ce message par erreur, nous vous remercions d’en informer l’expéditeur et de détruire le message. The personal data collected and processed during this exchange aims solely at completing a business relationship and is limited to the necessary duration of that relationship. If you wish to use your rights of consultation, rectification and deletion of your data, please contact: contact.rgpd@sgdsn.gouv.fr. If you have received this message in error, we thank you for informing the sender and destroying the message.

^ permalink raw reply

* Re: [PATCH v2 1/5] fs: Add support for an O_MAYEXEC flag on sys_open()
From: Florian Weimer @ 2019-09-06 15:56 UTC (permalink / raw)
  To: Mickaël Salaün
  Cc: linux-kernel, Aleksa Sarai, Alexei Starovoitov, Al Viro,
	Andy Lutomirski, Christian Heimes, Daniel Borkmann, Eric Chiang,
	James Morris, Jan Kara, Jann Horn, Jonathan Corbet, Kees Cook,
	Matthew Garrett, Matthew Wilcox, Michael Kerrisk,
	Mickaël Salaün, Mimi Zohar, Philippe Trébuchet,
	Scott Shell
In-Reply-To: <20190906152455.22757-2-mic@digikod.net>

Let's assume I want to add support for this to the glibc dynamic loader,
while still being able to run on older kernels.

Is it safe to try the open call first, with O_MAYEXEC, and if that fails
with EINVAL, try again without O_MAYEXEC?

Or do I risk disabling this security feature if I do that?

Do we need a different way for recognizing kernel support.  (Note that
we cannot probe paths in /proc for various reasons.)

Thanks,
Florian

^ permalink raw reply

* Re: Why add the general notification queue and its sources
From: Linus Torvalds @ 2019-09-06 15:53 UTC (permalink / raw)
  To: David Howells
  Cc: Ray Strode, Greg Kroah-Hartman, Steven Whitehouse,
	Nicolas Dichtel, raven, keyrings, linux-usb, linux-block,
	Christian Brauner, LSM List, linux-fsdevel, Linux API,
	Linux List Kernel Mailing, Al Viro, Ray, Debarshi, Robbie Harwood
In-Reply-To: <CAHk-=wiR1fpahgKuxSOQY6OfgjWD+MKz8UF6qUQ6V_y2TC_V6w@mail.gmail.com>

On Fri, Sep 6, 2019 at 8:35 AM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> This is why I like pipes. You can use them today. They are simple, and
> extensible, and you don't need to come up with a new subsystem and
> some untested ad-hoc thing that nobody has actually used.

The only _real_ complexity is to make sure that events are reliably parseable.

That's where you really want to use the Linux-only "packet pipe"
thing, becasue otherwise you have to have size markers or other things
to delineate events. But if you do that, then it really becomes
trivial.

And I checked, we made it available to user space, even if the
original reason for that code was kernel-only autofs use: you just
need to make the pipe be O_DIRECT.

This overly stupid program shows off the feature:

        #define _GNU_SOURCE
        #include <fcntl.h>
        #include <unistd.h>

        int main(int argc, char **argv)
        {
                int fd[2];
                char buf[10];

                pipe2(fd, O_DIRECT | O_NONBLOCK);
                write(fd[1], "hello", 5);
                write(fd[1], "hi", 2);
                read(fd[0], buf, sizeof(buf));
                read(fd[0], buf, sizeof(buf));
                return 0;
        }

and it you strace it (because I was too lazy to add error handling or
printing of results), you'll see

    write(4, "hello", 5)                    = 5
    write(4, "hi", 2)                       = 2
    read(3, "hello", 10)                    = 5
    read(3, "hi", 10)                       = 2

note how you got packets of data on the reader side, instead of
getting the traditional "just buffer it as a stream".

So now you can even have multiple readers of the same event pipe, and
packetization is obvious and trivial. Of course, I'm not sure why
you'd want to have multiple readers, and you'd lose _ordering_, but if
all events are independent, this _might_ be a useful thing in a
threaded environment. Maybe.

(Side note: a zero-sized write will not cause a zero-sized packet. It
will just be dropped).

               Linus

^ permalink raw reply

* Re: Why add the general notification queue and its sources
From: Linus Torvalds @ 2019-09-06 15:35 UTC (permalink / raw)
  To: David Howells
  Cc: Ray Strode, Greg Kroah-Hartman, Steven Whitehouse,
	Nicolas Dichtel, raven, keyrings, linux-usb, linux-block,
	Christian Brauner, LSM List, linux-fsdevel, Linux API,
	Linux List Kernel Mailing, Al Viro, Ray, Debarshi, Robbie Harwood
In-Reply-To: <27732.1567764557@warthog.procyon.org.uk>

On Fri, Sep 6, 2019 at 3:09 AM David Howells <dhowells@redhat.com> wrote:
>
> Linus Torvalds <torvalds@linux-foundation.org> wrote:
>
> > But it's *literally* just finding the places that work with
> > pipe->curbuf/nrbufs and making them use atomic updates.
>
> No.  It really isn't.  That's two variables that describe the occupied section
> of the buffer.  Unless you have something like a 68020 with CAS2, or put them
> next to each other so you can use CMPXCHG8, you can't do that.
>
> They need converting to head/tail pointers first.

You misunderstand - because I phrased it badly. I meant "atomic" in
the traditional kernel sense, as in "usable in not thread context" (eg
GFP_ATOMIC etc).

I'd start out just using a spinlock.

I do agree that we could try to be fancy and do it entirely locklessly
too, and I mentioned that in another part:

 "[..] it should not
  be all that hard to just make the whole "curbuf/nrbufs" handling use
  its own locking (maybe even some lockless atomics and cmpxchg)"

but I also very much agree that it's much more complex.

The main complexity of a lockless thing is actually almost certainly
not in curbuf/nrbufs, because those could easily be packed as two
16-bit values in a 32-bit entity and then regular cmpxchg works fine.

No, the complexity in the lockless model is that then you have to be
very careful with the "buf[]" array update too.  Maybe that's trivial
(just make sure that they are NULL when not used), but it just looks
less than wonderfully easy.

So a lockless update I'm sure is _doable_ with some cleverness, but is
probably not really worth it.

That's particularly true since we already *have* a spinlock that we
would take anyway: the we could strive to use the waitqueue spinlock
in pipe->wait, and not even really add any new locking. That would
require a bit of cleverness too and re-ordering things more, but we do
that in other places (eg completions, but the fs_pin code does it too,
and a few other cases.

Look for "wake_up_locked()" and friends, which is a sure-fire sign
that somebody is playing games and taking the wait-queue lock manually
for their own nefarious reasons.

> > They really would work with almost anything. You could even mix-and-match
> > "data generated by kernel" and "data done by 'write()' or 'splice()' by a
> > user process".
>
> Imagine that userspace writes a large message and takes the mutex.  At the
> same time something in softirq context decides *it* wants to write a message -
> it can't take the mutex and it can't wait, so the userspace write would have
> to cause the kernel message to be dropped.

No. You're missing the point entirely.

The mutex is entirely immaterial for the "insert a message". It is
only used for user-space synchronization. The "add message to the pipe
buffers" would only do the low-level buffer updates (whether using a
new spinlock, re-using the pipe waitqueue lock, or entirely
locklessly, ends up being then just an implementation detail).

Note that user-space writes are defined to be atomic, but they are (a)
not ordered and (b) only atomic up to a single buffer entry (which is
that PIPE_BUF limit). So you can always put in a new buffer entry at
any time.

Obviously if a user space write just fills up the whole queue (or
_other_ messages fill up the whole queue) you'd have to drop the
notification. But that's always true. That's true even in your thing.
The only difference is that we _allow_ other user spaces to write to
the notification queue too.

But if you don't want to allow that, then don't give out the write
side of the pipe to any untrusted user space.

But in *general*, allowing user space to write to the pipe is a great
feature: it means that your notification source *can* be a user space
daemon that you gave the write side of the pipe to (possibly using fd
passing, possibly by just forking your own user-space child or cloning
a thread).

So for example, from a consumer standpoint, you can start off doing
these things in user space with a helper thread that feeds the pipe
(for example, polling /proc/mounts every second), and then when you've
prototyped it and are happy with it, you can add the system call (or
ioctl or whatever) to make the kernel generate the messages so that
you don't have to poll.

But now, once you have the kernel patch, you already have a proven
user, and you can show numbers ("My user-space thing works, but it
uses up 0.1% CPU time and has that nasty up-to-one-second latency
because of polling"). Ta-daa!

End result: it's backwards compatible, it's prototypable, and it's
fairly easily extensible. Want to add a new source of events? Just
pass the pipe to any random piece of code you want. It needs kernel
support only when you've proven the concept _and_ you can show that
"yeah, this user space polling model is a real performance or
complexity problem" or whatever.

This is why I like pipes. You can use them today. They are simple, and
extensible, and you don't need to come up with a new subsystem and
some untested ad-hoc thing that nobody has actually used.

And they work automatically with all the existing infrastructure. They
work with whatever perl or shell scripts, they work with poll/select
loops, they work with user-space sources of events, they are just very
flexible.

                     Linus

^ permalink raw reply

* [PATCH v2 5/5] doc: Add documentation for the fs.open_mayexec_enforce sysctl
From: Mickaël Salaün @ 2019-09-06 15:24 UTC (permalink / raw)
  To: linux-kernel
  Cc: Mickaël Salaün, Aleksa Sarai, Alexei Starovoitov,
	Al Viro, Andy Lutomirski, Christian Heimes, Daniel Borkmann,
	Eric Chiang, Florian Weimer, James Morris, Jan Kara, Jann Horn,
	Jonathan Corbet, Kees Cook, Matthew Garrett, Matthew Wilcox,
	Michael Kerrisk, Mickaël Salaün, Mimi Zohar
In-Reply-To: <20190906152455.22757-1-mic@digikod.net>

Changes since v1:
* move from LSM/Yama to sysctl/fs

Signed-off-by: Mickaël Salaün <mic@digikod.net>
Reviewed-by: Philippe Trébuchet <philippe.trebuchet@ssi.gouv.fr>
Reviewed-by: Thibaut Sautereau <thibaut.sautereau@ssi.gouv.fr>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Kees Cook <keescook@chromium.org>
Cc: Mickaël Salaün <mickael.salaun@ssi.gouv.fr>
---
 Documentation/admin-guide/sysctl/fs.rst | 43 +++++++++++++++++++++++++
 1 file changed, 43 insertions(+)

diff --git a/Documentation/admin-guide/sysctl/fs.rst b/Documentation/admin-guide/sysctl/fs.rst
index 2a45119e3331..f2f5bbe428d6 100644
--- a/Documentation/admin-guide/sysctl/fs.rst
+++ b/Documentation/admin-guide/sysctl/fs.rst
@@ -37,6 +37,7 @@ Currently, these files are in /proc/sys/fs:
 - inode-nr
 - inode-state
 - nr_open
+- open_mayexec_enforce
 - overflowuid
 - overflowgid
 - pipe-user-pages-hard
@@ -165,6 +166,48 @@ system needs to prune the inode list instead of allocating
 more.
 
 
+open_mayexec_enforce
+--------------------
+
+The ``O_MAYEXEC`` flag can be passed to :manpage:`open(2)` to only open regular
+files that are expected to be executable.  If the file is not identified as
+executable, then the syscall returns -EACCES.  This may allow a script
+interpreter to check executable permission before reading commands from a file.
+One interesting use case is to enforce a "write xor execute" policy through
+interpreters.
+
+Thanks to this flag, it is possible to enforce the ``noexec`` mount option
+(i.e.  the underlying mount point of the file is mounted with MNT_NOEXEC or its
+underlying superblock is SB_I_NOEXEC) not only on ELF binaries but also on
+scripts.  This may be possible thanks to script interpreters using the
+``O_MAYEXEC`` flag.  The executable permission is then checked before reading
+commands from a file, and thus can enforce the ``noexec`` at the interpreter
+level by propagating this security policy to the scripts.  To be fully
+effective, these interpreters also need to handle the other ways to execute
+code (for which the kernel can't help): command line parameters (e.g., option
+``-e`` for Perl), module loading (e.g., option ``-m`` for Python), stdin, file
+sourcing, environment variables, configuration files...  According to the
+threat model, it may be acceptable to allow some script interpreters (e.g.
+Bash) to interpret commands from stdin, may it be a TTY or a pipe, because it
+may not be enough to (directly) perform syscalls.
+
+There is two complementary security policies: enforce the ``noexec`` mount
+option, or enforce executable file permission.  These policies are handled by
+the ``fs.open_mayexec_enforce`` sysctl (writable only with ``CAP_MAC_ADMIN``)
+as a bitmask:
+
+1 - mount restriction:
+    check that the mount options for the underlying VFS mount do not prevent
+    execution.
+
+2 - file permission restriction:
+    check that the to-be-opened file is marked as executable for the current
+    process (e.g., POSIX permissions).
+
+Code samples can be found in tools/testing/selftests/exec/omayexec.c and
+https://github.com/clipos-archive/clipos4_portage-overlay/search?q=O_MAYEXEC .
+
+
 overflowgid & overflowuid
 -------------------------
 
-- 
2.23.0

^ permalink raw reply related

* [PATCH v2 4/5] selftest/exec: Add tests for O_MAYEXEC enforcing
From: Mickaël Salaün @ 2019-09-06 15:24 UTC (permalink / raw)
  To: linux-kernel
  Cc: Mickaël Salaün, Aleksa Sarai, Alexei Starovoitov,
	Al Viro, Andy Lutomirski, Christian Heimes, Daniel Borkmann,
	Eric Chiang, Florian Weimer, James Morris, Jan Kara, Jann Horn,
	Jonathan Corbet, Kees Cook, Matthew Garrett, Matthew Wilcox,
	Michael Kerrisk, Mickaël Salaün, Mimi Zohar
In-Reply-To: <20190906152455.22757-1-mic@digikod.net>

Test propagation of noexec mount points or file executability through
files open with or without O_MAYEXEC.

Changes since v1:
* move tests from yama to exec
* fix _GNU_SOURCE in kselftest_harness.h
* add a new test sysctl_access_write to check if CAP_MAC_ADMIN is taken
  into account
* test directory execution which is always forbidden since commit
  73601ea5b7b1 ("fs/open.c: allow opening only regular files during
  execve()"), and also check that even the root user can not bypass file
  execution checks
* make sure delete_workspace() always as enough right to succeed
* cosmetic cleanup

Signed-off-by: Mickaël Salaün <mic@digikod.net>
Cc: Kees Cook <keescook@chromium.org>
Cc: Mickaël Salaün <mickael.salaun@ssi.gouv.fr>
Cc: Shuah Khan <shuah@kernel.org>
---
 tools/testing/selftests/exec/.gitignore     |   1 +
 tools/testing/selftests/exec/Makefile       |   4 +-
 tools/testing/selftests/exec/omayexec.c     | 317 ++++++++++++++++++++
 tools/testing/selftests/kselftest_harness.h |   3 +
 4 files changed, 324 insertions(+), 1 deletion(-)
 create mode 100644 tools/testing/selftests/exec/omayexec.c

diff --git a/tools/testing/selftests/exec/.gitignore b/tools/testing/selftests/exec/.gitignore
index b02279da6fa1..78487c987c07 100644
--- a/tools/testing/selftests/exec/.gitignore
+++ b/tools/testing/selftests/exec/.gitignore
@@ -1,3 +1,4 @@
+/omayexec
 subdir*
 script*
 execveat
diff --git a/tools/testing/selftests/exec/Makefile b/tools/testing/selftests/exec/Makefile
index 33339e31e365..a62b9ca306e7 100644
--- a/tools/testing/selftests/exec/Makefile
+++ b/tools/testing/selftests/exec/Makefile
@@ -3,7 +3,7 @@ CFLAGS = -Wall
 CFLAGS += -Wno-nonnull
 CFLAGS += -D_GNU_SOURCE
 
-TEST_GEN_PROGS := execveat
+TEST_GEN_PROGS := execveat omayexec
 TEST_GEN_FILES := execveat.symlink execveat.denatured script subdir
 # Makefile is a run-time dependency, since it's accessed by the execveat test
 TEST_FILES := Makefile
@@ -26,3 +26,5 @@ $(OUTPUT)/execveat.denatured: $(OUTPUT)/execveat
 	cp $< $@
 	chmod -x $@
 
+$(OUTPUT)/omayexec: omayexec.c ../kselftest_harness.h
+	$(CC) $(CFLAGS) -Wl,-no-as-needed $(LDFLAGS) -lcap $< -o $@
diff --git a/tools/testing/selftests/exec/omayexec.c b/tools/testing/selftests/exec/omayexec.c
new file mode 100644
index 000000000000..e4307b5a5417
--- /dev/null
+++ b/tools/testing/selftests/exec/omayexec.c
@@ -0,0 +1,317 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * O_MAYEXEC tests
+ *
+ * Copyright © 2018-2019 ANSSI
+ *
+ * Author: Mickaël Salaün <mic@digikod.net>
+ */
+
+#include <errno.h>
+#include <fcntl.h> /* O_CLOEXEC */
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h> /* strlen */
+#include <sys/capability.h>
+#include <sys/mount.h>
+#include <sys/stat.h> /* mkdir */
+#include <unistd.h> /* unlink, rmdir */
+
+#include "../kselftest_harness.h"
+
+#ifndef O_MAYEXEC
+#define O_MAYEXEC      040000000
+#endif
+
+#define SYSCTL_MAYEXEC	"/proc/sys/fs/open_mayexec_enforce"
+
+#define BIN_DIR		"./test-mount"
+#define BIN_PATH	BIN_DIR "/file"
+#define DIR_PATH	BIN_DIR "/directory"
+
+#define ALLOWED		1
+#define DENIED		0
+
+static void ignore_dac(struct __test_metadata *_metadata, int override)
+{
+	cap_t caps;
+	const cap_value_t cap_val[2] = {
+		CAP_DAC_OVERRIDE,
+		CAP_DAC_READ_SEARCH,
+	};
+
+	caps = cap_get_proc();
+	ASSERT_TRUE(!!caps);
+	ASSERT_FALSE(cap_set_flag(caps, CAP_EFFECTIVE, 2, cap_val,
+				override ? CAP_SET : CAP_CLEAR));
+	ASSERT_FALSE(cap_set_proc(caps));
+	EXPECT_FALSE(cap_free(caps));
+}
+
+static void ignore_mac(struct __test_metadata *_metadata, int override)
+{
+	cap_t caps;
+	const cap_value_t cap_val[1] = {
+		CAP_MAC_ADMIN,
+	};
+
+	caps = cap_get_proc();
+	ASSERT_TRUE(!!caps);
+	ASSERT_FALSE(cap_set_flag(caps, CAP_EFFECTIVE, 1, cap_val,
+				override ? CAP_SET : CAP_CLEAR));
+	ASSERT_FALSE(cap_set_proc(caps));
+	EXPECT_FALSE(cap_free(caps));
+}
+
+static void test_omx(struct __test_metadata *_metadata,
+		const char *const path, const int exec_allowed)
+{
+	int fd;
+
+	/* without O_MAYEXEC */
+	fd = open(path, O_RDONLY | O_CLOEXEC);
+	ASSERT_NE(-1, fd);
+	EXPECT_FALSE(close(fd));
+
+	/* with O_MAYEXEC */
+	fd = open(path, O_RDONLY | O_CLOEXEC | O_MAYEXEC);
+	if (exec_allowed) {
+		/* open should succeed */
+		ASSERT_NE(-1, fd);
+		EXPECT_FALSE(close(fd));
+	} else {
+		/* open should return EACCES */
+		ASSERT_EQ(-1, fd);
+		ASSERT_EQ(EACCES, errno);
+	}
+}
+
+static void test_omx_dir_file(struct __test_metadata *_metadata,
+		const char *const dir_path, const char *const file_path,
+		const int exec_allowed)
+{
+	/*
+	 * directory execution is always denied since commit 73601ea5b7b1
+	 * ("fs/open.c: allow opening only regular files during execve()")
+	 */
+	test_omx(_metadata, dir_path, DENIED);
+	test_omx(_metadata, file_path, exec_allowed);
+}
+
+static void test_dir_file(struct __test_metadata *_metadata,
+		const char *const dir_path, const char *const file_path,
+		const int exec_allowed)
+{
+	/* test as root */
+	ignore_dac(_metadata, 1);
+	test_omx_dir_file(_metadata, dir_path, file_path, exec_allowed);
+
+	/* test without bypass */
+	ignore_dac(_metadata, 0);
+	test_omx_dir_file(_metadata, dir_path, file_path, exec_allowed);
+}
+
+static void sysctl_write(struct __test_metadata *_metadata,
+		const char *path, const char *value)
+{
+	int fd;
+	size_t len_value;
+	ssize_t len_wrote;
+
+	fd = open(path, O_WRONLY | O_CLOEXEC);
+	ASSERT_NE(-1, fd);
+	len_value = strlen(value);
+	len_wrote = write(fd, value, len_value);
+	ASSERT_EQ(len_wrote, len_value);
+	EXPECT_FALSE(close(fd));
+}
+
+static void create_workspace(struct __test_metadata *_metadata,
+		int mount_exec, int file_exec)
+{
+	int fd;
+
+	/*
+	 * Cleanup previous workspace if any error previously happened (don't
+	 * check errors).
+	 */
+	umount(BIN_DIR);
+	rmdir(BIN_DIR);
+
+	/* create a clean mount point */
+	ASSERT_FALSE(mkdir(BIN_DIR, 00700));
+	ASSERT_FALSE(mount("test", BIN_DIR, "tmpfs",
+				MS_MGC_VAL | (mount_exec ? 0 : MS_NOEXEC),
+				"mode=0700,size=4k"));
+
+	/* create a test file */
+	fd = open(BIN_PATH, O_CREAT | O_RDONLY | O_CLOEXEC,
+			file_exec ? 00500 : 00400);
+	ASSERT_NE(-1, fd);
+	EXPECT_NE(-1, close(fd));
+
+	/* create a test directory */
+	ASSERT_FALSE(mkdir(DIR_PATH, file_exec ? 00500 : 00400));
+}
+
+static void delete_workspace(struct __test_metadata *_metadata)
+{
+	ignore_mac(_metadata, 1);
+	sysctl_write(_metadata, SYSCTL_MAYEXEC, "0");
+
+	/* no need to unlink BIN_PATH nor DIR_PATH */
+	ASSERT_FALSE(umount(BIN_DIR));
+	ASSERT_FALSE(rmdir(BIN_DIR));
+}
+
+FIXTURE_DATA(mount_exec_file_exec) { };
+
+FIXTURE_SETUP(mount_exec_file_exec)
+{
+	create_workspace(_metadata, 1, 1);
+}
+
+FIXTURE_TEARDOWN(mount_exec_file_exec)
+{
+	delete_workspace(_metadata);
+}
+
+TEST_F(mount_exec_file_exec, mount)
+{
+	/* enforce mount exec check */
+	sysctl_write(_metadata, SYSCTL_MAYEXEC, "1");
+	test_dir_file(_metadata, DIR_PATH, BIN_PATH, ALLOWED);
+}
+
+TEST_F(mount_exec_file_exec, file)
+{
+	/* enforce file exec check */
+	sysctl_write(_metadata, SYSCTL_MAYEXEC, "2");
+	test_dir_file(_metadata, DIR_PATH, BIN_PATH, ALLOWED);
+}
+
+TEST_F(mount_exec_file_exec, mount_file)
+{
+	/* enforce mount and file exec check */
+	sysctl_write(_metadata, SYSCTL_MAYEXEC, "3");
+	test_dir_file(_metadata, DIR_PATH, BIN_PATH, ALLOWED);
+}
+
+FIXTURE_DATA(mount_exec_file_noexec) { };
+
+FIXTURE_SETUP(mount_exec_file_noexec)
+{
+	create_workspace(_metadata, 1, 0);
+}
+
+FIXTURE_TEARDOWN(mount_exec_file_noexec)
+{
+	delete_workspace(_metadata);
+}
+
+TEST_F(mount_exec_file_noexec, mount)
+{
+	/* enforce mount exec check */
+	sysctl_write(_metadata, SYSCTL_MAYEXEC, "1");
+	test_dir_file(_metadata, DIR_PATH, BIN_PATH, ALLOWED);
+}
+
+TEST_F(mount_exec_file_noexec, file)
+{
+	/* enforce file exec check */
+	sysctl_write(_metadata, SYSCTL_MAYEXEC, "2");
+	test_dir_file(_metadata, DIR_PATH, BIN_PATH, DENIED);
+}
+
+TEST_F(mount_exec_file_noexec, mount_file)
+{
+	/* enforce mount and file exec check */
+	sysctl_write(_metadata, SYSCTL_MAYEXEC, "3");
+	test_dir_file(_metadata, DIR_PATH, BIN_PATH, DENIED);
+}
+
+FIXTURE_DATA(mount_noexec_file_exec) { };
+
+FIXTURE_SETUP(mount_noexec_file_exec)
+{
+	create_workspace(_metadata, 0, 1);
+}
+
+FIXTURE_TEARDOWN(mount_noexec_file_exec)
+{
+	delete_workspace(_metadata);
+}
+
+TEST_F(mount_noexec_file_exec, mount)
+{
+	/* enforce mount exec check */
+	sysctl_write(_metadata, SYSCTL_MAYEXEC, "1");
+	test_dir_file(_metadata, DIR_PATH, BIN_PATH, DENIED);
+}
+
+TEST_F(mount_noexec_file_exec, file)
+{
+	/* enforce file exec check */
+	sysctl_write(_metadata, SYSCTL_MAYEXEC, "2");
+	test_dir_file(_metadata, DIR_PATH, BIN_PATH, ALLOWED);
+}
+
+TEST_F(mount_noexec_file_exec, mount_file)
+{
+	/* enforce mount and file exec check */
+	sysctl_write(_metadata, SYSCTL_MAYEXEC, "3");
+	test_dir_file(_metadata, DIR_PATH, BIN_PATH, DENIED);
+}
+
+FIXTURE_DATA(mount_noexec_file_noexec) { };
+
+FIXTURE_SETUP(mount_noexec_file_noexec)
+{
+	create_workspace(_metadata, 0, 0);
+}
+
+FIXTURE_TEARDOWN(mount_noexec_file_noexec)
+{
+	delete_workspace(_metadata);
+}
+
+TEST_F(mount_noexec_file_noexec, mount)
+{
+	/* enforce mount exec check */
+	sysctl_write(_metadata, SYSCTL_MAYEXEC, "1");
+	test_dir_file(_metadata, DIR_PATH, BIN_PATH, DENIED);
+}
+
+TEST_F(mount_noexec_file_noexec, file)
+{
+	/* enforce file exec check */
+	sysctl_write(_metadata, SYSCTL_MAYEXEC, "2");
+	test_dir_file(_metadata, DIR_PATH, BIN_PATH, DENIED);
+}
+
+TEST_F(mount_noexec_file_noexec, mount_file)
+{
+	/* enforce mount and file exec check */
+	sysctl_write(_metadata, SYSCTL_MAYEXEC, "3");
+	test_dir_file(_metadata, DIR_PATH, BIN_PATH, DENIED);
+}
+
+TEST(sysctl_access_write)
+{
+	int fd;
+	ssize_t len_wrote;
+
+	ignore_mac(_metadata, 1);
+	sysctl_write(_metadata, SYSCTL_MAYEXEC, "0");
+
+	ignore_mac(_metadata, 0);
+	fd = open(SYSCTL_MAYEXEC, O_WRONLY | O_CLOEXEC);
+	ASSERT_NE(-1, fd);
+	len_wrote = write(fd, "0", 1);
+	ASSERT_EQ(len_wrote, -1);
+	EXPECT_FALSE(close(fd));
+
+	ignore_mac(_metadata, 1);
+}
+
+TEST_HARNESS_MAIN
diff --git a/tools/testing/selftests/kselftest_harness.h b/tools/testing/selftests/kselftest_harness.h
index 5336b26506ab..6ae816fa2f62 100644
--- a/tools/testing/selftests/kselftest_harness.h
+++ b/tools/testing/selftests/kselftest_harness.h
@@ -50,7 +50,10 @@
 #ifndef __KSELFTEST_HARNESS_H
 #define __KSELFTEST_HARNESS_H
 
+#ifndef _GNU_SOURCE
 #define _GNU_SOURCE
+#endif
+
 #include <asm/types.h>
 #include <errno.h>
 #include <stdbool.h>
-- 
2.23.0

^ permalink raw reply related

* [PATCH v2 3/5] fs: Enable to enforce noexec mounts or file exec through O_MAYEXEC
From: Mickaël Salaün @ 2019-09-06 15:24 UTC (permalink / raw)
  To: linux-kernel
  Cc: Mickaël Salaün, Aleksa Sarai, Alexei Starovoitov,
	Al Viro, Andy Lutomirski, Christian Heimes, Daniel Borkmann,
	Eric Chiang, Florian Weimer, James Morris, Jan Kara, Jann Horn,
	Jonathan Corbet, Kees Cook, Matthew Garrett, Matthew Wilcox,
	Michael Kerrisk, Mickaël Salaün, Mimi Zohar
In-Reply-To: <20190906152455.22757-1-mic@digikod.net>

Enable to either propagate the mount options from the underlying VFS
mount to prevent execution, or to propagate the file execute permission.
This may allow a script interpreter to check execution permissions
before reading commands from a file.

The main goal is to be able to protect the kernel by restricting
arbitrary syscalls that an attacker could perform with a crafted binary
or certain script languages.  It also improves multilevel isolation
by reducing the ability of an attacker to use side channels with
specific code.  These restrictions can natively be enforced for ELF
binaries (with the noexec mount option) but require this kernel
extension to properly handle scripts (e.g., Python, Perl).

Add a new sysctl fs.open_mayexec_enforce to control this behavior.  A
following patch adds documentation.

Changes since v1:
* move code from Yama to the FS subsystem (suggested by Kees Cook)
* make omayexec_inode_permission() static (suggested by Jann Horn)
* use mode 0600 for the sysctl
* only match regular files (not directories nor other types), which
  follows the same semantic as commit 73601ea5b7b1 ("fs/open.c: allow
  opening only regular files during execve()")

Signed-off-by: Mickaël Salaün <mic@digikod.net>
Reviewed-by: Philippe Trébuchet <philippe.trebuchet@ssi.gouv.fr>
Reviewed-by: Thibaut Sautereau <thibaut.sautereau@ssi.gouv.fr>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Kees Cook <keescook@chromium.org>
Cc: Mickaël Salaün <mickael.salaun@ssi.gouv.fr>
---
 fs/namei.c         | 68 ++++++++++++++++++++++++++++++++++++++++++++++
 include/linux/fs.h |  3 ++
 kernel/sysctl.c    |  7 +++++
 3 files changed, 78 insertions(+)

diff --git a/fs/namei.c b/fs/namei.c
index 0a6b9483d0cb..abd29a76ecef 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -39,6 +39,7 @@
 #include <linux/bitops.h>
 #include <linux/init_task.h>
 #include <linux/uaccess.h>
+#include <linux/sysctl.h>
 
 #include "internal.h"
 #include "mount.h"
@@ -411,6 +412,34 @@ static int sb_permission(struct super_block *sb, struct inode *inode, int mask)
 	return 0;
 }
 
+#define OMAYEXEC_ENFORCE_NONE	0
+#define OMAYEXEC_ENFORCE_MOUNT	(1 << 0)
+#define OMAYEXEC_ENFORCE_FILE	(1 << 1)
+#define _OMAYEXEC_LAST		OMAYEXEC_ENFORCE_FILE
+#define _OMAYEXEC_MASK		((_OMAYEXEC_LAST << 1) - 1)
+
+/**
+ * omayexec_inode_permission - check O_MAYEXEC before accessing an inode
+ * @inode: inode structure to check
+ * @mask: permission mask
+ *
+ * Return 0 if access is permitted, -EACCES otherwise.
+ */
+static int omayexec_inode_permission(struct inode *inode, int mask)
+{
+	if (!(mask & MAY_OPENEXEC))
+		return 0;
+
+	if ((sysctl_omayexec_enforce & OMAYEXEC_ENFORCE_MOUNT) &&
+			!(mask & MAY_EXECMOUNT))
+		return -EACCES;
+
+	if (sysctl_omayexec_enforce & OMAYEXEC_ENFORCE_FILE)
+		return generic_permission(inode, MAY_EXEC);
+
+	return 0;
+}
+
 /**
  * inode_permission - Check for access rights to a given inode
  * @inode: Inode to check permission on
@@ -454,10 +483,48 @@ int inode_permission(struct inode *inode, int mask)
 	if (retval)
 		return retval;
 
+	retval = omayexec_inode_permission(inode, mask);
+	if (retval)
+		return retval;
+
 	return security_inode_permission(inode, mask);
 }
 EXPORT_SYMBOL(inode_permission);
 
+/*
+ * Handle open_mayexec_enforce sysctl
+ */
+#ifdef CONFIG_SYSCTL
+int proc_omayexec(struct ctl_table *table, int write, void __user *buffer,
+		size_t *lenp, loff_t *ppos)
+{
+	int error;
+
+	if (write) {
+		struct ctl_table table_copy;
+		int tmp_mayexec_enforce;
+
+		if (!capable(CAP_MAC_ADMIN))
+			return -EPERM;
+		tmp_mayexec_enforce = *((int *)table->data);
+		table_copy = *table;
+		/* do not erase sysctl_omayexec_enforce */
+		table_copy.data = &tmp_mayexec_enforce;
+		error = proc_dointvec(&table_copy, write, buffer, lenp, ppos);
+		if (error)
+			return error;
+		if ((tmp_mayexec_enforce | _OMAYEXEC_MASK) != _OMAYEXEC_MASK)
+			return -EINVAL;
+		*((int *)table->data) = tmp_mayexec_enforce;
+	} else {
+		error = proc_dointvec(table, write, buffer, lenp, ppos);
+		if (error)
+			return error;
+	}
+	return 0;
+}
+#endif
+
 /**
  * path_get - get a reference to a path
  * @path: path to get the reference to
@@ -887,6 +954,7 @@ int sysctl_protected_symlinks __read_mostly = 0;
 int sysctl_protected_hardlinks __read_mostly = 0;
 int sysctl_protected_fifos __read_mostly;
 int sysctl_protected_regular __read_mostly;
+int sysctl_omayexec_enforce __read_mostly = OMAYEXEC_ENFORCE_NONE;
 
 /**
  * may_follow_link - Check symlink following for unsafe situations
diff --git a/include/linux/fs.h b/include/linux/fs.h
index e57609dac8dd..735f5950cfed 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -81,6 +81,7 @@ extern int sysctl_protected_symlinks;
 extern int sysctl_protected_hardlinks;
 extern int sysctl_protected_fifos;
 extern int sysctl_protected_regular;
+extern int sysctl_omayexec_enforce;
 
 typedef __kernel_rwf_t rwf_t;
 
@@ -3452,6 +3453,8 @@ int proc_nr_dentry(struct ctl_table *table, int write,
 		  void __user *buffer, size_t *lenp, loff_t *ppos);
 int proc_nr_inodes(struct ctl_table *table, int write,
 		   void __user *buffer, size_t *lenp, loff_t *ppos);
+int proc_omayexec(struct ctl_table *table, int write, void __user *buffer,
+		size_t *lenp, loff_t *ppos);
 int __init get_filesystem_list(char *buf);
 
 #define __FMODE_EXEC		((__force int) FMODE_EXEC)
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 078950d9605b..eaaeb229a828 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -1911,6 +1911,13 @@ static struct ctl_table fs_table[] = {
 		.extra1		= SYSCTL_ZERO,
 		.extra2		= &two,
 	},
+	{
+		.procname       = "open_mayexec_enforce",
+		.data           = &sysctl_omayexec_enforce,
+		.maxlen         = sizeof(int),
+		.mode           = 0600,
+		.proc_handler   = proc_omayexec,
+	},
 #if defined(CONFIG_BINFMT_MISC) || defined(CONFIG_BINFMT_MISC_MODULE)
 	{
 		.procname	= "binfmt_misc",
-- 
2.23.0

^ permalink raw reply related

* [PATCH v2 2/5] fs: Add a MAY_EXECMOUNT flag to infer the noexec mount propertie
From: Mickaël Salaün @ 2019-09-06 15:24 UTC (permalink / raw)
  To: linux-kernel
  Cc: Mickaël Salaün, Aleksa Sarai, Alexei Starovoitov,
	Al Viro, Andy Lutomirski, Christian Heimes, Daniel Borkmann,
	Eric Chiang, Florian Weimer, James Morris, Jan Kara, Jann Horn,
	Jonathan Corbet, Kees Cook, Matthew Garrett, Matthew Wilcox,
	Michael Kerrisk, Mickaël Salaün, Mimi Zohar
In-Reply-To: <20190906152455.22757-1-mic@digikod.net>

An LSM doesn't get path information related to an access request to open
an inode.  This new (internal) MAY_EXECMOUNT flag enables an LSM to
check if the underlying mount point of an inode is marked as executable.
This is useful to implement a security policy taking advantage of the
noexec mount option.

This flag is set according to path_noexec(), which checks if a mount
point is mounted with MNT_NOEXEC or if the underlying superblock is
SB_I_NOEXEC.

Signed-off-by: Mickaël Salaün <mic@digikod.net>
Reviewed-by: Philippe Trébuchet <philippe.trebuchet@ssi.gouv.fr>
Reviewed-by: Thibaut Sautereau <thibaut.sautereau@ssi.gouv.fr>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Kees Cook <keescook@chromium.org>
Cc: Mickaël Salaün <mickael.salaun@ssi.gouv.fr>
---
 fs/namei.c         | 2 ++
 include/linux/fs.h | 2 ++
 2 files changed, 4 insertions(+)

diff --git a/fs/namei.c b/fs/namei.c
index 209c51a5226c..0a6b9483d0cb 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -2968,6 +2968,8 @@ static int may_open(const struct path *path, int acc_mode, int flag)
 		break;
 	}
 
+	/* Pass the mount point executability. */
+	acc_mode |= path_noexec(path) ? 0 : MAY_EXECMOUNT;
 	error = inode_permission(inode, MAY_OPEN | acc_mode);
 	if (error)
 		return error;
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 848f5711bdf0..e57609dac8dd 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -101,6 +101,8 @@ typedef int (dio_iodone_t)(struct kiocb *iocb, loff_t offset,
 #define MAY_NOT_BLOCK		0x00000080
 /* the inode is opened with O_MAYEXEC */
 #define MAY_OPENEXEC		0x00000100
+/* the mount point is marked as executable */
+#define MAY_EXECMOUNT		0x00000200
 
 /*
  * flags in file.f_mode.  Note that FMODE_READ and FMODE_WRITE must correspond
-- 
2.23.0

^ permalink raw reply related

* [PATCH v2 1/5] fs: Add support for an O_MAYEXEC flag on sys_open()
From: Mickaël Salaün @ 2019-09-06 15:24 UTC (permalink / raw)
  To: linux-kernel
  Cc: Mickaël Salaün, Aleksa Sarai, Alexei Starovoitov,
	Al Viro, Andy Lutomirski, Christian Heimes, Daniel Borkmann,
	Eric Chiang, Florian Weimer, James Morris, Jan Kara, Jann Horn,
	Jonathan Corbet, Kees Cook, Matthew Garrett, Matthew Wilcox,
	Michael Kerrisk, Mickaël Salaün, Mimi Zohar
In-Reply-To: <20190906152455.22757-1-mic@digikod.net>

When the O_MAYEXEC flag is passed, sys_open() may be subject to
additional restrictions depending on a security policy implemented by an
LSM through the inode_permission hook.

The underlying idea is to be able to restrict scripts interpretation
according to a policy defined by the system administrator.  For this to
be possible, script interpreters must use the O_MAYEXEC flag
appropriately.  To be fully effective, these interpreters also need to
handle the other ways to execute code (for which the kernel can't help):
command line parameters (e.g., option -e for Perl), module loading
(e.g., option -m for Python), stdin, file sourcing, environment
variables, configuration files...  According to the threat model, it may
be acceptable to allow some script interpreters (e.g. Bash) to interpret
commands from stdin, may it be a TTY or a pipe, because it may not be
enough to (directly) perform syscalls.

A simple security policy implementation is available in a following
patch for Yama.

This is an updated subset of the patch initially written by Vincent
Strubel for CLIP OS:
https://github.com/clipos-archive/src_platform_clip-patches/blob/f5cb330d6b684752e403b4e41b39f7004d88e561/1901_open_mayexec.patch
This patch has been used for more than 10 years with customized script
interpreters.  Some examples can be found here:
https://github.com/clipos-archive/clipos4_portage-overlay/search?q=O_MAYEXEC

Changes since v1:
* set __FMODE_EXEC when using O_MAYEXEC to make this information
  available through the new fanotify/FAN_OPEN_EXEC event (suggested by
  Jan Kara and Matthew Bobrowski)

Signed-off-by: Mickaël Salaün <mic@digikod.net>
Signed-off-by: Thibaut Sautereau <thibaut.sautereau@ssi.gouv.fr>
Signed-off-by: Vincent Strubel <vincent.strubel@ssi.gouv.fr>
Reviewed-by: Philippe Trébuchet <philippe.trebuchet@ssi.gouv.fr>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Kees Cook <keescook@chromium.org>
Cc: Mickaël Salaün <mickael.salaun@ssi.gouv.fr>
---
 fs/fcntl.c                       | 2 +-
 fs/open.c                        | 6 ++++++
 include/linux/fcntl.h            | 2 +-
 include/linux/fs.h               | 2 ++
 include/uapi/asm-generic/fcntl.h | 3 +++
 5 files changed, 13 insertions(+), 2 deletions(-)

diff --git a/fs/fcntl.c b/fs/fcntl.c
index 3d40771e8e7c..4cf05a2fd162 100644
--- a/fs/fcntl.c
+++ b/fs/fcntl.c
@@ -1031,7 +1031,7 @@ static int __init fcntl_init(void)
 	 * Exceptions: O_NONBLOCK is a two bit define on parisc; O_NDELAY
 	 * is defined as O_NONBLOCK on some platforms and not on others.
 	 */
-	BUILD_BUG_ON(21 - 1 /* for O_RDONLY being 0 */ !=
+	BUILD_BUG_ON(22 - 1 /* for O_RDONLY being 0 */ !=
 		HWEIGHT32(
 			(VALID_OPEN_FLAGS & ~(O_NONBLOCK | O_NDELAY)) |
 			__FMODE_EXEC | __FMODE_NONOTIFY));
diff --git a/fs/open.c b/fs/open.c
index a59abe3c669a..1b9b6fedf7cd 100644
--- a/fs/open.c
+++ b/fs/open.c
@@ -989,6 +989,12 @@ static inline int build_open_flags(int flags, umode_t mode, struct open_flags *o
 		acc_mode = 0;
 	}
 
+	/* Check execution permissions on open. */
+	if (flags & O_MAYEXEC) {
+		acc_mode |= MAY_OPENEXEC;
+		flags |= __FMODE_EXEC;
+	}
+
 	op->open_flag = flags;
 
 	/* O_TRUNC implies we need access checks for write permissions */
diff --git a/include/linux/fcntl.h b/include/linux/fcntl.h
index d019df946cb2..af88fb6c8313 100644
--- a/include/linux/fcntl.h
+++ b/include/linux/fcntl.h
@@ -9,7 +9,7 @@
 	(O_RDONLY | O_WRONLY | O_RDWR | O_CREAT | O_EXCL | O_NOCTTY | O_TRUNC | \
 	 O_APPEND | O_NDELAY | O_NONBLOCK | O_NDELAY | __O_SYNC | O_DSYNC | \
 	 FASYNC	| O_DIRECT | O_LARGEFILE | O_DIRECTORY | O_NOFOLLOW | \
-	 O_NOATIME | O_CLOEXEC | O_PATH | __O_TMPFILE)
+	 O_NOATIME | O_CLOEXEC | O_PATH | __O_TMPFILE | O_MAYEXEC)
 
 #ifndef force_o_largefile
 #define force_o_largefile() (!IS_ENABLED(CONFIG_ARCH_32BIT_OFF_T))
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 997a530ff4e9..848f5711bdf0 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -99,6 +99,8 @@ typedef int (dio_iodone_t)(struct kiocb *iocb, loff_t offset,
 #define MAY_CHDIR		0x00000040
 /* called from RCU mode, don't block */
 #define MAY_NOT_BLOCK		0x00000080
+/* the inode is opened with O_MAYEXEC */
+#define MAY_OPENEXEC		0x00000100
 
 /*
  * flags in file.f_mode.  Note that FMODE_READ and FMODE_WRITE must correspond
diff --git a/include/uapi/asm-generic/fcntl.h b/include/uapi/asm-generic/fcntl.h
index 9dc0bf0c5a6e..cbb9425d6e7c 100644
--- a/include/uapi/asm-generic/fcntl.h
+++ b/include/uapi/asm-generic/fcntl.h
@@ -97,6 +97,9 @@
 #define O_NDELAY	O_NONBLOCK
 #endif
 
+/* command execution from file is intended, check exec permissions */
+#define O_MAYEXEC	040000000
+
 #define F_DUPFD		0	/* dup */
 #define F_GETFD		1	/* get close_on_exec */
 #define F_SETFD		2	/* set/clear close_on_exec */
-- 
2.23.0

^ permalink raw reply related

* [PATCH v2 0/5] Add support for O_MAYEXEC
From: Mickaël Salaün @ 2019-09-06 15:24 UTC (permalink / raw)
  To: linux-kernel
  Cc: Mickaël Salaün, Aleksa Sarai, Alexei Starovoitov,
	Al Viro, Andy Lutomirski, Christian Heimes, Daniel Borkmann,
	Eric Chiang, Florian Weimer, James Morris, Jan Kara, Jann Horn,
	Jonathan Corbet, Kees Cook, Matthew Garrett, Matthew Wilcox,
	Michael Kerrisk, Mickaël Salaün, Mimi Zohar

Hi,

The goal of this patch series is to control script interpretation.  A
new O_MAYEXEC flag used by sys_open() is added to enable userspace
script interpreter to delegate to the kernel (and thus the system
security policy) the permission to interpret/execute scripts or other
files containing what can be seen as commands.

This second series mainly differ from the previous one [1] by moving the
basic security policy from Yama to the filesystem subsystem.  This
policy can be enforced by the system administrator through a sysctl
configuration consistent with the mount points.

Furthermore, the security policy can also be delegated to an LSM, either
a MAC system or an integrity system.  For instance, the new kernel
MAY_OPENEXEC flag closes a major IMA measurement/appraisal interpreter
integrity gap by bringing the ability to check the use of scripts [2].
Other uses are expected, such as for openat2(2) [3], SGX integration
[4], and bpffs [5].

Userspace need to adapt to take advantage of this new feature.  For
example, the PEP 578 [6] (Runtime Audit Hooks) enables Python 3.8 to be
extended with policy enforcement points related to code interpretation,
which can be used to align with the PowerShell audit features.
Additional Python security improvements (e.g. a limited interpreter
withou -c, stdin piping of code) are on their way.

The initial idea come from CLIP OS and the original implementation has
been used for more than 10 years:
https://github.com/clipos-archive/clipos4_doc

An introduction to O_MAYEXEC was given at the Linux Security Summit
Europe 2018 - Linux Kernel Security Contributions by ANSSI:
https://www.youtube.com/watch?v=chNjCRtPKQY&t=17m15s
The "write xor execute" principle was explained at Kernel Recipes 2018 -
CLIP OS: a defense-in-depth OS:
https://www.youtube.com/watch?v=PjRE0uBtkHU&t=11m14s

This patch series can be applied on top of v5.3-rc7.  This can be tested
with CONFIG_SYSCTL.  I would really appreciate constructive comments on
this patch series.

# Changes since v1

* move code from Yama to the FS subsystem
* set __FMODE_EXEC when using O_MAYEXEC to make this information
  available through the new fanotify/FAN_OPEN_EXEC event
* only match regular files (not directories nor other types), which
  follows the same semantic as commit 73601ea5b7b1 ("fs/open.c: allow
  opening only regular files during execve()")
* improve tests

[1] https://lore.kernel.org/lkml/20181212081712.32347-1-mic@digikod.net/
[2] https://lore.kernel.org/lkml/1544647356.4028.105.camel@linux.ibm.com/
[3] https://lore.kernel.org/lkml/20190904201933.10736-6-cyphar@cyphar.com/
[4] https://lore.kernel.org/lkml/CALCETrVovr8XNZSroey7pHF46O=kj_c5D9K8h=z2T_cNrpvMig@mail.gmail.com/
[5] https://lore.kernel.org/lkml/CALCETrVeZ0eufFXwfhtaG_j+AdvbzEWE0M3wjXMWVEO7pj+xkw@mail.gmail.com/
[6] https://www.python.org/dev/peps/pep-0578/

Regards,

Mickaël Salaün (5):
  fs: Add support for an O_MAYEXEC flag on sys_open()
  fs: Add a MAY_EXECMOUNT flag to infer the noexec mount propertie
  fs: Enable to enforce noexec mounts or file exec through O_MAYEXEC
  selftest/exec: Add tests for O_MAYEXEC enforcing
  doc: Add documentation for the fs.open_mayexec_enforce sysctl

 Documentation/admin-guide/sysctl/fs.rst     |  43 +++
 fs/fcntl.c                                  |   2 +-
 fs/namei.c                                  |  70 +++++
 fs/open.c                                   |   6 +
 include/linux/fcntl.h                       |   2 +-
 include/linux/fs.h                          |   7 +
 include/uapi/asm-generic/fcntl.h            |   3 +
 kernel/sysctl.c                             |   7 +
 tools/testing/selftests/exec/.gitignore     |   1 +
 tools/testing/selftests/exec/Makefile       |   4 +-
 tools/testing/selftests/exec/omayexec.c     | 317 ++++++++++++++++++++
 tools/testing/selftests/kselftest_harness.h |   3 +
 12 files changed, 462 insertions(+), 3 deletions(-)
 create mode 100644 tools/testing/selftests/exec/omayexec.c

-- 
2.23.0

^ permalink raw reply

* Re: Why add the general notification queue and its sources
From: David Howells @ 2019-09-06 10:09 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: dhowells, Ray Strode, Greg Kroah-Hartman, Steven Whitehouse,
	Nicolas Dichtel, raven, keyrings, linux-usb, linux-block,
	Christian Brauner, LSM List, linux-fsdevel, Linux API,
	Linux List Kernel Mailing, Al Viro, Ray, Debarshi, Robbie Harwood
In-Reply-To: <CAHk-=wjt2Eb+yEDOcQwCa0SrZ4cWu967OtQG8Vz21c=n5ZP1Nw@mail.gmail.com>

Linus Torvalds <torvalds@linux-foundation.org> wrote:

> But it's *literally* just finding the places that work with
> pipe->curbuf/nrbufs and making them use atomic updates.

No.  It really isn't.  That's two variables that describe the occupied section
of the buffer.  Unless you have something like a 68020 with CAS2, or put them
next to each other so you can use CMPXCHG8, you can't do that.

They need converting to head/tail pointers first.

> They really would work with almost anything. You could even mix-and-match
> "data generated by kernel" and "data done by 'write()' or 'splice()' by a
> user process".

Imagine that userspace writes a large message and takes the mutex.  At the
same time something in softirq context decides *it* wants to write a message -
it can't take the mutex and it can't wait, so the userspace write would have
to cause the kernel message to be dropped.

What I would have to do is make a write to a notification pipe go through
post_notification() and limit the size to the maximum for a single message.

Much easier to simply suppress writes and splices on pipes that have been set
up to be notification queues - at least for now.

David

^ permalink raw reply

* Re: [PATCH 01/13] vfs: verify param type in vfs_parse_sb_flag()
From: Miklos Szeredi @ 2019-09-06  7:28 UTC (permalink / raw)
  To: David Howells
  Cc: Al Viro, Miklos Szeredi, Ian Kent, Linux API, linux-fsdevel,
	linux-kernel
In-Reply-To: <11485.1562257188@warthog.procyon.org.uk>

On Thu, Jul 4, 2019 at 6:20 PM David Howells <dhowells@redhat.com> wrote:
>
> Miklos Szeredi <miklos@szeredi.hu> wrote:
>
> > Ping?  Have you had a chance of looking at this series?
>
> Yeah, through due to time pressure, I haven't managed to do much with it.
>
> I don't agree with all your changes, and also I'd like them to wait till after
> the branch of mount API filesystem conversions that I've given to Al has had a
> chance to hopefully go in in this merge window, along with whatever changes Al
> has made to it.

Ping?

Thanks,
Miklos

^ permalink raw reply

* Re: [PATCH v12 01/12] lib: introduce copy_struct_{to,from}_user helpers
From: Christian Brauner @ 2019-09-06  7:00 UTC (permalink / raw)
  To: Aleksa Sarai
  Cc: Al Viro, Jeff Layton, J. Bruce Fields, Arnd Bergmann,
	David Howells, Shuah Khan, Shuah Khan, Ingo Molnar,
	Peter Zijlstra, Christian Brauner, Rasmus Villemoes,
	Eric Biederman, Andy Lutomirski, Andrew Morton,
	Alexei Starovoitov, Kees Cook, Jann Horn, Tycho Andersen,
	David Drysdale, Chanho Min, Oleg
In-Reply-To: <20190905195618.pwzgvuzadkfpznfz@yavin.dot.cyphar.com>

On Fri, Sep 06, 2019 at 05:56:18AM +1000, Aleksa Sarai wrote:
> On 2019-09-05, Al Viro <viro@zeniv.linux.org.uk> wrote:
> > On Thu, Sep 05, 2019 at 08:23:03PM +0200, Christian Brauner wrote:
> > 
> > > Because every caller of that function right now has that limit set
> > > anyway iirc. So we can either remove it from here and place it back for
> > > the individual callers or leave it in the helper.
> > > Also, I'm really asking, why not? Is it unreasonable to have an upper
> > > bound on the size (for a long time probably) or are you disagreeing with
> > > PAGE_SIZE being used? PAGE_SIZE limit is currently used by sched, perf,
> > > bpf, and clone3 and in a few other places.
> > 
> > For a primitive that can be safely used with any size (OK, any within
> > the usual 2Gb limit)?  Why push the random policy into the place where
> > it doesn't belong?
> > 
> > Seriously, what's the point?  If they want to have a large chunk of
> > userland memory zeroed or checked for non-zeroes - why would that
> > be a problem?
> 
> Thinking about it some more, there isn't really any r/w amplification --
> so there isn't much to gain by passing giant structs. Though, if we are
> going to permit 2GB buffers, isn't that also an argument to use
> memchr_inv()? :P

I think we should just do a really dumb, easy to understand minimal
thing for now. It could even just be what every caller is doing right
now anyway with the get_user() loop.
The only way to settle memchr_inv() vs all the other clever ways
suggested here is to benchmark it and see if it matters *for the current
users* of this helper. If it does, great we can do it. If it doesn't why
bother having that argument right now?
Once we somehow end up in a possible world where we apparently have
decided it's a great idea to copy 2GB argument structures for a syscall
into or from the kernel we can start optimizing the hell out of this.
Before that and especially with current callers I honestly doubt it
matters whether we use memchr_inv() or while() {get_user()} loops.

Christian

^ permalink raw reply

* Re: [PATCH v12 01/12] lib: introduce copy_struct_{to,from}_user helpers
From: Al Viro @ 2019-09-06  0:14 UTC (permalink / raw)
  To: Aleksa Sarai
  Cc: Jeff Layton, J. Bruce Fields, Arnd Bergmann, David Howells,
	Shuah Khan, Shuah Khan, Ingo Molnar, Peter Zijlstra,
	Christian Brauner, Rasmus Villemoes, Eric Biederman,
	Andy Lutomirski, Andrew Morton, Alexei Starovoitov, Kees Cook,
	Jann Horn, Tycho Andersen, David Drysdale, Chanho Min,
	Oleg Nesterov, Alexander
In-Reply-To: <20190905234944.GT1131@ZenIV.linux.org.uk>

On Fri, Sep 06, 2019 at 12:49:44AM +0100, Al Viro wrote:
> On Fri, Sep 06, 2019 at 09:00:03AM +1000, Aleksa Sarai wrote:
> > > > +			return -EFAULT;
> > > > +	}
> > > > +	/* Copy the interoperable parts of the struct. */
> > > > +	if (__copy_to_user(dst, src, size))
> > > > +		return -EFAULT;
> > > 
> > > Why not simply clear_user() and copy_to_user()?
> > 
> > I'm not sure I understand what you mean -- are you asking why we need to
> > do memchr_inv(src + size, 0, rest) earlier?
> 
> I'm asking why bother with __ and separate access_ok().
> 
> > > 	if ((unsigned long)addr & 1) {
> > > 		u8 v;
> > > 		if (get_user(v, (__u8 __user *)addr))
> > > 			return -EFAULT;
> > > 		if (v)
> > > 			return -E2BIG;
> > > 		addr++;
> > > 	}
> > > 	if ((unsigned long)addr & 2) {
> > > 		u16 v;
> > > 		if (get_user(v, (__u16 __user *)addr))
> > > 			return -EFAULT;
> > > 		if (v)
> > > 			return -E2BIG;
> > > 		addr +=2;
> > > 	}
> > > 	if ((unsigned long)addr & 4) {
> > > 		u32 v;
> > > 		if (get_user(v, (__u32 __user *)addr))
> > > 			return -EFAULT;
> > > 		if (v)
> > > 			return -E2BIG;
> > > 	}
> > > 	<read the rest like you currently do>
> 
> Actually, this is a dumb way to do it - page size on anything
> is going to be a multiple of 8, so you could just as well
> read 8 bytes from an address aligned down.  Then mask the
> bytes you don't want to check out and see if there's anything
> left.
> 
> You can have readability boundaries inside a page - it's either
> the entire page (let alone a single word) being readable, or
> it's EFAULT for all parts.
> 
> > > would be saner, and things like x86 could trivially add an
> > > asm variant - it's not hard.  Incidentally, memchr_inv() is
> > > an overkill in this case...
> > 
> > Why is memchr_inv() overkill?
> 
> Look at its implementation; you only care if there are
> non-zeroes, you don't give a damn where in the buffer
> the first one would be.  All you need is the same logics
> as in "from userland" case
> 	if (!count)
> 		return true;
> 	offset = (unsigned long)from & 7
> 	p = (u64 *)(from - offset);
> 	v = *p++;
> 	if (offset) {	// unaligned
> 		count += offset;
> 		v &= ~aligned_byte_mask(offset); // see strnlen_user.c
> 	}
> 	while (count > 8) {
> 		if (v)
> 			return false;
> 		v = *p++;
> 		count -= 8;
> 	}
> 	if (count != 8)
> 		v &= aligned_byte_mask(count);
> 	return v == 0;
> 
> All there is to it...

... and __user case would be pretty much this with
	if (user_access_begin(from, count)) {
		....
		user_access_end();
	}
wrapped around the damn thing - again, see strnlen_user.c, with
	unsafe_get_user(v, p++, efault);
instead of those
	v = *p++;

Calling conventions might need some thinking - it might be
	* all read, all zeroes
	* non-zero found
	* read failed
so we probably want to map the "all zeroes" case to 0,
"read failed" to -EFAULT and "non-zero found" to something
else.  Might be positive, might be some other -E.... - not
sure if E2BIG (or EFBIG) makes much sense here.  Need to
look at the users...

^ permalink raw reply

* Re: [PATCH v12 01/12] lib: introduce copy_struct_{to,from}_user helpers
From: Aleksa Sarai @ 2019-09-06  0:09 UTC (permalink / raw)
  To: Al Viro
  Cc: Jeff Layton, J. Bruce Fields, Arnd Bergmann, David Howells,
	Shuah Khan, Shuah Khan, Ingo Molnar, Peter Zijlstra,
	Christian Brauner, Rasmus Villemoes, Eric Biederman,
	Andy Lutomirski, Andrew Morton, Alexei Starovoitov, Kees Cook,
	Jann Horn, Tycho Andersen, David Drysdale, Chanho Min,
	Oleg Nesterov, Alexander
In-Reply-To: <20190905234944.GT1131@ZenIV.linux.org.uk>

[-- Attachment #1: Type: text/plain, Size: 2932 bytes --]

On 2019-09-06, Al Viro <viro@zeniv.linux.org.uk> wrote:
> On Fri, Sep 06, 2019 at 09:00:03AM +1000, Aleksa Sarai wrote:
> > > > +			return -EFAULT;
> > > > +	}
> > > > +	/* Copy the interoperable parts of the struct. */
> > > > +	if (__copy_to_user(dst, src, size))
> > > > +		return -EFAULT;
> > > 
> > > Why not simply clear_user() and copy_to_user()?
> > 
> > I'm not sure I understand what you mean -- are you asking why we need to
> > do memchr_inv(src + size, 0, rest) earlier?
> 
> I'm asking why bother with __ and separate access_ok().

Ah right, it was a dumb "optimisation" (since we need to do access_ok()
anyway since we should early -EFAULT in that case). I've dropped the __
usages in my working copy.

> > > 	if ((unsigned long)addr & 1) {
> > > 		u8 v;
> > > 		if (get_user(v, (__u8 __user *)addr))
> > > 			return -EFAULT;
> > > 		if (v)
> > > 			return -E2BIG;
> > > 		addr++;
> > > 	}
> > > 	if ((unsigned long)addr & 2) {
> > > 		u16 v;
> > > 		if (get_user(v, (__u16 __user *)addr))
> > > 			return -EFAULT;
> > > 		if (v)
> > > 			return -E2BIG;
> > > 		addr +=2;
> > > 	}
> > > 	if ((unsigned long)addr & 4) {
> > > 		u32 v;
> > > 		if (get_user(v, (__u32 __user *)addr))
> > > 			return -EFAULT;
> > > 		if (v)
> > > 			return -E2BIG;
> > > 	}
> > > 	<read the rest like you currently do>
> 
> Actually, this is a dumb way to do it - page size on anything
> is going to be a multiple of 8, so you could just as well
> read 8 bytes from an address aligned down.  Then mask the
> bytes you don't want to check out and see if there's anything
> left.
> 
> You can have readability boundaries inside a page - it's either
> the entire page (let alone a single word) being readable, or
> it's EFAULT for all parts.
> 
> > > would be saner, and things like x86 could trivially add an
> > > asm variant - it's not hard.  Incidentally, memchr_inv() is
> > > an overkill in this case...
> > 
> > Why is memchr_inv() overkill?
> 
> Look at its implementation; you only care if there are
> non-zeroes, you don't give a damn where in the buffer
> the first one would be.  All you need is the same logics
> as in "from userland" case
> 	if (!count)
> 		return true;
> 	offset = (unsigned long)from & 7
> 	p = (u64 *)(from - offset);
> 	v = *p++;
> 	if (offset) {	// unaligned
> 		count += offset;
> 		v &= ~aligned_byte_mask(offset); // see strnlen_user.c
> 	}
> 	while (count > 8) {
> 		if (v)
> 			return false;
> 		v = *p++;
> 		count -= 8;
> 	}
> 	if (count != 8)
> 		v &= aligned_byte_mask(count);
> 	return v == 0;
> 
> All there is to it...

Alright, will do (for some reason I hadn't made the connection that
memchr_inv() is doing effectively the same word-by-word comparison but
also detecting where the first byte is).

-- 
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
<https://www.cyphar.com/>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply

* Re: Why add the general notification queue and its sources
From: Linus Torvalds @ 2019-09-06  0:07 UTC (permalink / raw)
  To: David Howells
  Cc: Ray Strode, Greg Kroah-Hartman, Steven Whitehouse,
	Nicolas Dichtel, raven, keyrings, linux-usb, linux-block,
	Christian Brauner, LSM List, linux-fsdevel, Linux API,
	Linux List Kernel Mailing, Al Viro, Ray, Debarshi, Robbie Harwood
In-Reply-To: <14883.1567725508@warthog.procyon.org.uk>

On Thu, Sep 5, 2019 at 4:18 PM David Howells <dhowells@redhat.com> wrote:
>
> Can you write into a pipe from softirq context and/or with spinlocks held
> and/or with the RCU read lock held?  That is a requirement.  Another is that
> messages get inserted whole or not at all (or if they are truncated, the size
> field gets updated).

Right now we use a mutex for the buffer locking, so no, pipe buffers
are not irq-safe or atomic. That's due to the whole "we may block on
data from user space" when doing a write.

HOWEVER.

Pipes actually have buffers on two different levels: there's the
actual data buffers themselves (each described by a "struct
pipe_buffer"), and there's the circular queue of them (the
"pipe->buf[]" array, with pipe->curbuf/nrbufs) that points to
individual data buffers.

And we could easily separate out that data buffer management. Right
now it's not really all that separated: people just do things like

        int newbuf = (pipe->curbuf + bufs) & (pipe->buffers-1);
        struct pipe_buffer *buf = pipe->bufs + newbuf;
...
        pipe->nrbufs++;

to add a buffer into that circular array of buffers, but _that_ part
could be made separate.  It's just all protected by the pipe mutex
right now, so it has never been an issue.

And yes, atomicity of writes has actually been an integral part of
pipes since forever. It's actually the only unambiguous atomicity that
POSIX guarantees. It only holds for writes to pipes() of less than
PIPE_BUF blocks, but that's 4096 on Linux.

> Since one end would certainly be attached to an fd, it looks on the face of it
> that writing into the pipe would require taking pipe->mutex.

That's how the normal synchronization is done, yes. And changing that
in general would be pretty painful. For example, two concurrent
user-space writers might take page faults and just generally be
painful, and the pipe locking needs to serialize that.

So the mutex couldn't go away from pipes in general - it would remain
for read/write/splice mutual exclusion (and it's not just the data it
protects, it's the reader/writer logic for EPIPE etc).

But the low-level pipe->bufs[] handling is another issue entirely.
Even when a user space writer copies things from user space, it does
so into a pre-allocated buffer that is then attached to the list of
buffers somewhat separately (there's a magical special case where you
can re-use a buffer that is marked as "I can be reused" and append
into an already allocated buffer).

And adding new buffers *could* be done with it's own separate locking.
If you have a blocking writer (ie a user space data source), that
would still take the pipe mutex, and it would delay the user space
readers (because the readers also need the mutex), but it should not
be all that hard to just make the whole "curbuf/nrbufs" handling use
its own locking (maybe even some lockless atomics and cmpxchg).

So a kernel writer could "insert" a "struct pipe_buffer" atomically,
and wake up the reader atomically. No need for the other complexity
that is protected by the mutex.

The buggest problem is perhaps that the number of pipe buffers per
pipe is fairly limited by default. PIPE_DEF_BUFFERS is 16, and if we'd
insert using the ->bufs[] array, that would be the limit of "number of
messages". But each message could be any size (we've historically
limited pipe buffers to one page each, but that limit isn't all that
hard. You could put more data in there).

The number of pipe buffers _is_ dynamic, so the above PIPE_DEF_BUFFERS
isn't a hard limit, but it would be the default.

Would it be entirely trivial to do all the above? No. But it's
*literally* just finding the places that work with pipe->curbuf/nrbufs
and making them use atomic updates. You'd find all the places by just
renaming them (and making them atomic or whatever) and the compiler
will tell you "this area needs fixing".

We've actually used pipes for messages before: autofs uses a magic
packetized pipe buffer thing. It didn't need any extra atomicity,
though, so it stil all worked with the regular pipe->mutex thing.

And there is a big advantage from using pipes. They really would work
with almost anything. You could even mix-and-match "data generated by
kernel" and "data done by 'write()' or 'splice()' by a user process".

NOTE! I'm not at all saying that pipes are perfect. You'll find people
who swear by sockets instead. They have their own advantages (and
disadvantages). Most people who do packet-based stuff tend to prefer
sockets, because those have standard packet-based models (Linux pipes
have that packet mode too, but it's certainly not standard, and I'm
not even sure we ever exposed it to user space - it could be that it's
only used by the autofs daemon).

I have a soft spot for pipes, just because I think they are simpler
than sockets. But that soft spot might be misplaced.

                   Linus

^ permalink raw reply

* Re: [PATCH v12 01/12] lib: introduce copy_struct_{to,from}_user helpers
From: Al Viro @ 2019-09-05 23:49 UTC (permalink / raw)
  To: Aleksa Sarai
  Cc: Jeff Layton, J. Bruce Fields, Arnd Bergmann, David Howells,
	Shuah Khan, Shuah Khan, Ingo Molnar, Peter Zijlstra,
	Christian Brauner, Rasmus Villemoes, Eric Biederman,
	Andy Lutomirski, Andrew Morton, Alexei Starovoitov, Kees Cook,
	Jann Horn, Tycho Andersen, David Drysdale, Chanho Min,
	Oleg Nesterov, Alexander
In-Reply-To: <20190905230003.bek7vqdvruzi4ybx@yavin.dot.cyphar.com>

On Fri, Sep 06, 2019 at 09:00:03AM +1000, Aleksa Sarai wrote:
> > > +			return -EFAULT;
> > > +	}
> > > +	/* Copy the interoperable parts of the struct. */
> > > +	if (__copy_to_user(dst, src, size))
> > > +		return -EFAULT;
> > 
> > Why not simply clear_user() and copy_to_user()?
> 
> I'm not sure I understand what you mean -- are you asking why we need to
> do memchr_inv(src + size, 0, rest) earlier?

I'm asking why bother with __ and separate access_ok().

> > 	if ((unsigned long)addr & 1) {
> > 		u8 v;
> > 		if (get_user(v, (__u8 __user *)addr))
> > 			return -EFAULT;
> > 		if (v)
> > 			return -E2BIG;
> > 		addr++;
> > 	}
> > 	if ((unsigned long)addr & 2) {
> > 		u16 v;
> > 		if (get_user(v, (__u16 __user *)addr))
> > 			return -EFAULT;
> > 		if (v)
> > 			return -E2BIG;
> > 		addr +=2;
> > 	}
> > 	if ((unsigned long)addr & 4) {
> > 		u32 v;
> > 		if (get_user(v, (__u32 __user *)addr))
> > 			return -EFAULT;
> > 		if (v)
> > 			return -E2BIG;
> > 	}
> > 	<read the rest like you currently do>

Actually, this is a dumb way to do it - page size on anything
is going to be a multiple of 8, so you could just as well
read 8 bytes from an address aligned down.  Then mask the
bytes you don't want to check out and see if there's anything
left.

You can have readability boundaries inside a page - it's either
the entire page (let alone a single word) being readable, or
it's EFAULT for all parts.

> > would be saner, and things like x86 could trivially add an
> > asm variant - it's not hard.  Incidentally, memchr_inv() is
> > an overkill in this case...
> 
> Why is memchr_inv() overkill?

Look at its implementation; you only care if there are
non-zeroes, you don't give a damn where in the buffer
the first one would be.  All you need is the same logics
as in "from userland" case
	if (!count)
		return true;
	offset = (unsigned long)from & 7
	p = (u64 *)(from - offset);
	v = *p++;
	if (offset) {	// unaligned
		count += offset;
		v &= ~aligned_byte_mask(offset); // see strnlen_user.c
	}
	while (count > 8) {
		if (v)
			return false;
		v = *p++;
		count -= 8;
	}
	if (count != 8)
		v &= aligned_byte_mask(count);
	return v == 0;

All there is to it...

^ permalink raw reply

* Re: Why add the general notification queue and its sources
From: David Howells @ 2019-09-05 23:18 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: dhowells, Ray Strode, Greg Kroah-Hartman, Steven Whitehouse,
	Nicolas Dichtel, raven, keyrings, linux-usb, linux-block,
	Christian Brauner, LSM List, linux-fsdevel, Linux API,
	Linux List Kernel Mailing, Al Viro, Ray, Debarshi, Robbie Harwood
In-Reply-To: <CAHk-=wgbCXea1a9OTWgMMvcsCGGiNiPp+ty-edZrBWn63NCYdw@mail.gmail.com>

Linus Torvalds <torvalds@linux-foundation.org> wrote:

> But I know - we *have* one of those. There's already a system call for
> it, and has been forever. One that we then extended to allow people to
> change the buffer size, and do a lot of other things with.
> 
> It's called "pipe()". And you can give the writing side to other user
> space processes too, in case you are running an older kernel that
> didn't have some "event pipe support". It comes with resource
> management, because people already use those things.

Can you write into a pipe from softirq context and/or with spinlocks held
and/or with the RCU read lock held?  That is a requirement.  Another is that
messages get inserted whole or not at all (or if they are truncated, the size
field gets updated).

Since one end would certainly be attached to an fd, it looks on the face of it
that writing into the pipe would require taking pipe->mutex.

David

^ permalink raw reply

* Re: [PATCH v12 01/12] lib: introduce copy_struct_{to,from}_user helpers
From: Aleksa Sarai @ 2019-09-05 23:00 UTC (permalink / raw)
  To: Al Viro
  Cc: Jeff Layton, J. Bruce Fields, Arnd Bergmann, David Howells,
	Shuah Khan, Shuah Khan, Ingo Molnar, Peter Zijlstra,
	Christian Brauner, Rasmus Villemoes, Eric Biederman,
	Andy Lutomirski, Andrew Morton, Alexei Starovoitov, Kees Cook,
	Jann Horn, Tycho Andersen, David Drysdale, Chanho Min,
	Oleg Nesterov, Alexander
In-Reply-To: <20190905180750.GQ1131@ZenIV.linux.org.uk>

[-- Attachment #1: Type: text/plain, Size: 3988 bytes --]

On 2019-09-05, Al Viro <viro@zeniv.linux.org.uk> wrote:
> On Thu, Sep 05, 2019 at 06:19:22AM +1000, Aleksa Sarai wrote:
> > +/*
> > + * "memset(p, 0, size)" but for user space buffers. Caller must have already
> > + * checked access_ok(p, size).
> > + */
> > +static int __memzero_user(void __user *p, size_t s)
> > +{
> > +	const char zeros[BUFFER_SIZE] = {};
> > +	while (s > 0) {
> > +		size_t n = min(s, sizeof(zeros));
> > +
> > +		if (__copy_to_user(p, zeros, n))
> > +			return -EFAULT;
> > +
> > +		p += n;
> > +		s -= n;
> > +	}
> > +	return 0;
> > +}
> 
> That's called clear_user().

Already switched, I didn't know about clear_user() -- I assumed it
would've been called bzero_user() or memzero_user() and didn't find it
when looking.

> > +int copy_struct_to_user(void __user *dst, size_t usize,
> > +			const void *src, size_t ksize)
> > +{
> > +	size_t size = min(ksize, usize);
> > +	size_t rest = abs(ksize - usize);
> > +
> > +	if (unlikely(usize > PAGE_SIZE))
> > +		return -EFAULT;
> 
> Why?
> 
> > +	} else if (usize > ksize) {
> > +		if (__memzero_user(dst + size, rest))
> > +			return -EFAULT;
> > +	}
> > +	/* Copy the interoperable parts of the struct. */
> > +	if (__copy_to_user(dst, src, size))
> > +		return -EFAULT;
> 
> Why not simply clear_user() and copy_to_user()?

I'm not sure I understand what you mean -- are you asking why we need to
do memchr_inv(src + size, 0, rest) earlier?

> 
> > +int copy_struct_from_user(void *dst, size_t ksize,
> > +			  const void __user *src, size_t usize)
> > +{
> > +	size_t size = min(ksize, usize);
> > +	size_t rest = abs(ksize - usize);
> 
> Cute, but... you would be just as well without that 'rest' thing.

I would argue it's harder to mess up using "rest" compared to getting
"ksize - usize" and "usize - ksize" mixed up (and it's a bit more
readable).

> > +
> > +	if (unlikely(usize > PAGE_SIZE))
> > +		return -EFAULT;
> 
> Again, why?

As discussed in a sister thread, I will leave this in the callers
(though I would argue callers should always do some kind of sanity check
like this).

> 
> > +	if (unlikely(!access_ok(src, usize)))
> > +		return -EFAULT;
> 
> Why not simply copy_from_user() here?
> 
> > +	/* Deal with trailing bytes. */
> > +	if (usize < ksize)
> > +		memset(dst + size, 0, rest);
> > +	else if (usize > ksize) {
> > +		const void __user *addr = src + size;
> > +		char buffer[BUFFER_SIZE] = {};
> > +
> > +		while (rest > 0) {
> > +			size_t bufsize = min(rest, sizeof(buffer));
> > +
> > +			if (__copy_from_user(buffer, addr, bufsize))
> > +				return -EFAULT;
> > +			if (memchr_inv(buffer, 0, bufsize))
> > +				return -E2BIG;
> 
> Frankly, that looks like a candidate for is_all_zeroes_user().
> With the loop like above serving as a dumb default.  And on
> badly alighed address it _will_ be dumb.  Probably too much
> so - something like
> 	if ((unsigned long)addr & 1) {
> 		u8 v;
> 		if (get_user(v, (__u8 __user *)addr))
> 			return -EFAULT;
> 		if (v)
> 			return -E2BIG;
> 		addr++;
> 	}
> 	if ((unsigned long)addr & 2) {
> 		u16 v;
> 		if (get_user(v, (__u16 __user *)addr))
> 			return -EFAULT;
> 		if (v)
> 			return -E2BIG;
> 		addr +=2;
> 	}
> 	if ((unsigned long)addr & 4) {
> 		u32 v;
> 		if (get_user(v, (__u32 __user *)addr))
> 			return -EFAULT;
> 		if (v)
> 			return -E2BIG;
> 	}
> 	<read the rest like you currently do>
> would be saner, and things like x86 could trivially add an
> asm variant - it's not hard.  Incidentally, memchr_inv() is
> an overkill in this case...

Why is memchr_inv() overkill?

But yes, breaking this out to an asm-generic is_all_zeroes_user()
wouldn't hurt -- and I'll put a cleaned-up version of the alignment
handling there too. Should I drop it in asm-generic/uaccess.h, or
somewhere else?

-- 
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
<https://www.cyphar.com/>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply

* Re: [PATCH v12 01/12] lib: introduce copy_struct_{to,from}_user helpers
From: Al Viro @ 2019-09-05 22:31 UTC (permalink / raw)
  To: Aleksa Sarai
  Cc: Christian Brauner, Jeff Layton, J. Bruce Fields, Arnd Bergmann,
	David Howells, Shuah Khan, Shuah Khan, Ingo Molnar,
	Peter Zijlstra, Christian Brauner, Rasmus Villemoes,
	Eric Biederman, Andy Lutomirski, Andrew Morton,
	Alexei Starovoitov, Kees Cook, Jann Horn, Tycho Andersen,
	David Drysdale, Chanho Min
In-Reply-To: <20190905195618.pwzgvuzadkfpznfz@yavin.dot.cyphar.com>

On Fri, Sep 06, 2019 at 05:56:18AM +1000, Aleksa Sarai wrote:
> On 2019-09-05, Al Viro <viro@zeniv.linux.org.uk> wrote:
> > On Thu, Sep 05, 2019 at 08:23:03PM +0200, Christian Brauner wrote:
> > 
> > > Because every caller of that function right now has that limit set
> > > anyway iirc. So we can either remove it from here and place it back for
> > > the individual callers or leave it in the helper.
> > > Also, I'm really asking, why not? Is it unreasonable to have an upper
> > > bound on the size (for a long time probably) or are you disagreeing with
> > > PAGE_SIZE being used? PAGE_SIZE limit is currently used by sched, perf,
> > > bpf, and clone3 and in a few other places.
> > 
> > For a primitive that can be safely used with any size (OK, any within
> > the usual 2Gb limit)?  Why push the random policy into the place where
> > it doesn't belong?
> > 
> > Seriously, what's the point?  If they want to have a large chunk of
> > userland memory zeroed or checked for non-zeroes - why would that
> > be a problem?
> 
> Thinking about it some more, there isn't really any r/w amplification --
> so there isn't much to gain by passing giant structs. Though, if we are
> going to permit 2GB buffers, isn't that also an argument to use
> memchr_inv()? :P

I'm not sure I understand the last bit.  If you look at what copy_from_user()
does on misaligned source/destination, especially on architectures that
really, really do not like unaligned access...

Case in point: alpha (and it's not unusual in that respect).  What it boils
down to is
	copy bytes until the destination is aligned
	if source and destination are both aligned
		copy word by word
	else
		read word by word, storing the mix of two adjacent words
	copy the rest byte by byte

The unpleasant case (to and from having different remainders modulo 8) is
basically

	if (count >= 8) {
		u64 *aligned = (u64 *)(from & ~7);
		u64 *dest = (u64 *)to;
		int bitshift = (from & 7) * 8;
		u64 prev, next;

		prev = aligned[0];
		do {   
			next = aligned[1];
			prev <<= bitshift;
			prev |= next >> (64 - bitshift);
			*dest++ = prev;
			aligned++;  
			prev = next;
			from += 8;
			to += 8;
			count -= 8;
		} while (count >= 8);
	}

Now, mix that with "... and do memchr_inv() on the copy to find if we'd
copied any non-zeroes, nevermind where" and it starts looking really
ridiculous.

We should just read the fscking source, aligned down to word boundary
and check each word being read.  The first and the last ones - masked.
All there is to it.  On almost all architectures that'll work well
enough; s390 might want something more elaborate (there even word-by-word
copies are costly, but I'd suggest talking to them for details).

Something like bool all_zeroes_user(const void __user *p, size_t count)
would probably be a sane API...

^ permalink raw reply

* Re: Why add the general notification queue and its sources
From: Linus Torvalds @ 2019-09-05 22:08 UTC (permalink / raw)
  To: David Howells
  Cc: Ray Strode, Greg Kroah-Hartman, Steven Whitehouse,
	Nicolas Dichtel, raven, keyrings, linux-usb, linux-block,
	Christian Brauner, LSM List, linux-fsdevel, Linux API,
	Linux List Kernel Mailing, Al Viro, Ray, Debarshi, Robbie Harwood
In-Reply-To: <5396.1567719164@warthog.procyon.org.uk>

On Thu, Sep 5, 2019 at 2:32 PM David Howells <dhowells@redhat.com> wrote:
>
>  (1) /dev/watch_queue just implements locked-in-memory buffers.  It gets you
>      no events by simply opening it.

Cool. In-memory buffers.

But I know - we *have* one of those. There's already a system call for
it, and has been forever. One that we then extended to allow people to
change the buffer size, and do a lot of other things with.

It's called "pipe()". And you can give the writing side to other user
space processes too, in case you are running an older kernel that
didn't have some "event pipe support". It comes with resource
management, because people already use those things.

If you want to make a message protocol on top of it, it has cool
atomicity guarantees for any message size less than PIPE_BUF, but to
make things simple, maybe just say "fixed record sizes of 64 bytes" or
something like that for events.

Then you can use them from things like perl scripts, not just magical
C programs.

Why do we need a new kind of super-fancy high-speed thing for event reporting?

If you have *so* many events that pipe handling is a performance
issue, you have something seriously wrong going on.

So no. I'm not interested in a magical memory-mapped pipe that is
actually more limited than the real thing.

                  Linus

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox