Linux userland API discussions

Linux userland API discussions
 help / color / mirror / Atom feed

* Re: [PATCH 2/2] groups: Allow unprivileged processes to use setgroups to drop groups
From: Josh Triplett @ 2014-11-16 19:09 UTC (permalink / raw)
  To: Theodore Ts'o, Eric W. Biederman, Andy Lutomirski,
	Andrew Morton, Kees Cook, Michael Kerrisk-manpages, Linux API,
	linux-man, linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
In-Reply-To: <20141116133230.GA32030-AKGzg7BKzIDYtjvyW6yDsg@public.gmane.org>

On Sun, Nov 16, 2014 at 08:32:30AM -0500, Theodore Ts'o wrote:
> On Sat, Nov 15, 2014 at 09:08:07PM -0600, Eric W. Biederman wrote:
> > That may be a bug with the user namespace permission check.  Perhaps we
> > shouldn't allow dropping groups that aren't mapped in the user
> > namespace.
> 
> I'm not saying that we can't change the behavior of whether or not a
> user can drop a group permission.  I'm just saying that we need to do
> so consciously.

Agreed.

> The setgroups()/getgroups() ABI isn't part of
> POSIX/SuSv3 so we wouldn't be breaking POSIX compatibility, for those
> people who care about that.

POSIX.1-2001 actually specifies getgroups, but not setgroups.  In any
case, yes, POSIX doesn't say anything about this behavior.

> The bigger deal is that it's very different from how BSD 4.x has
> handled things, which means there is two decades of history that we're
> looking at here.  And there are times when taking away permissions in
> an expected fashion can cause security problems.  (As a silly example;
> some architect at Digital wrote a spec that said that setuid must
> return EINVAL for values greater than 32k --- back in the days when
> uid's were a signed short.  The junior programmer who implemented this
> for Ultrix made the check for 32,000 decimal.  Guess what happened
> when /bin/login got a failure with setuid when it wasn't expecting one
> --- since root could never get an error with that system call, right?

Ignored it and kept going, starting the user's shell as root?

I'd guess that a similar story motivated the note in the Linux manpages
for setuid, setresuid, and similar, saying "Note: there are cases where
setuid() can fail even when the caller is UID 0; it is a grave security
error to omit checking for a failure return from setuid().".

(Also, these days, glibc marks setuid and similar with the
warn_unused_result attribute.)

> And MIT Project Athena started ran out of lower numbered uid's and
> froshlings started getting assigned uid's > 32,000....)
> 
> In this particular case, the change is probably a little less likely
> to cause serious problems, although the fact that sudo does allow
> negative group assignments is an example of another potential
> breakage.
> 
> OTOH, I'm aware of how this could cause major problems to the concept
> of allowing an untrusted user to set up their own containers to
> constrain what program with a possibly untrusted provinance might be
> able to do.  I can see times when I might want to run in a container
> where the user didn't have access to groups that I have access to by
> default --- including groups such as disk, sudo, lpadmin, etc.
> 
> If we do want to make such a change, my suggestion is to keep things
> *very* simple.  Let it be a boot-time option whether or not users are
> allowed to drop group permissions, and let it affect all possible ways
> that users can drop groups.  And we can create a shell script that
> will search for the obvious ways that a user could get screwed by
> enabling this, which we can encourage distributions to package up for
> their end users.  And then we document the heck out of the fact that
> this option exists, and when/if we want to make it the default, so
> it's perfectly clear and transparent to all what is happening.

An option sounds sensible to me.  I think a sysctl makes more sense,
though.  I'll add one in v4.

What did you have in mind about the shell script? Something like:
grep -r !% /etc/sudoers /etc/sudoers.d
?

- Josh Triplett

^ permalink raw reply

* Re: [PATCH 2/2] groups: Allow unprivileged processes to use setgroups to drop groups
From: Josh Triplett @ 2014-11-16 19:12 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Theodore Ts'o, Eric W. Biederman, Andrew Morton, Kees Cook,
	Michael Kerrisk-manpages, Linux API, linux-man,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
In-Reply-To: <CALCETrUPsH_So2Mgk38Fe_pjp5Y+cgjzCUe7fzFcnsFzivHeNA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>

On Sun, Nov 16, 2014 at 07:42:30AM -0800, Andy Lutomirski wrote:
> On Sun, Nov 16, 2014 at 5:32 AM, Theodore Ts'o <tytso-3s7WtUTddSA@public.gmane.org> wrote:
> > On Sat, Nov 15, 2014 at 09:08:07PM -0600, Eric W. Biederman wrote:
> >>
> >> That may be a bug with the user namespace permission check.  Perhaps we
> >> shouldn't allow dropping groups that aren't mapped in the user
> >> namespace.
> >
> > I'm not saying that we can't change the behavior of whether or not a
> > user can drop a group permission.  I'm just saying that we need to do
> > so consciously.  The setgroups()/getgroups() ABI isn't part of
> > POSIX/SuSv3 so we wouldn't be breaking POSIX compatibility, for those
> > people who care about that.
> 
> It may make sense to reach out to some place like oss-security.
> 
> FWIW, I think we should ask, at the same time, about:
> 
>  - Dropping supplementary groups.
>  - Switching gid/egid/sgid to a supplementary group.
>  - Denying ptrace of a process with supplementary groups that the
> tracer doesn't have.

I wonder how crazy it would be to just require either CAP_SYS_PTRACE or
cred1 == cred2 (as in, you have *exactly* the same credentials as the
target)?

> Also, I much prefer a sysctl to a boot option.  Boot options are nasty
> to configure in many distributions.

Agreed.

- Josh Triplett

^ permalink raw reply

* Re: [PATCHv8 3/4] sparc: Hook up execveat system call.
From: David Miller @ 2014-11-16 19:23 UTC (permalink / raw)
  To: drysdale-hpIqsD4AKlfQT0dZR+AlfA
  Cc: ebiederm-aS9lmoZGLiVWk0Htik3J/w, luto-kltTT9wpgjJwATOyAt5JVQ,
	viro-RmSDqhL/yNMiFSDQTTA3OLVCufUGDwFn,
	meredydd-zPN50pYk8eUaUu29zAJCuw,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	oleg-H+wXaHxf7aLQT0dZR+AlfA, mtk.manpages-Re5JQEeQqe8AvxtiuMwx3w,
	tglx-hfZtesqFncYOwBW4kG4KsQ, mingo-H+wXaHxf7aLQT0dZR+AlfA,
	hpa-YMNOUZJC4hwAvxtiuMwx3w, keescook-F7+t8E8rja9g9hUCZPvPmw,
	arnd-r2nGTMty4D4, dalias-/miJ2pyFWUyWIDz0JBNUog,
	hch-wEGCiKHe2LqWVfeAwA7xHQ, x86-DgEjT+Ai2ygdnm+yROfE0A,
	linux-arch-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA,
	sparclinux-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <1415982183-20525-4-git-send-email-drysdale-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>

From: David Drysdale <drysdale-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
Date: Fri, 14 Nov 2014 16:23:02 +0000

> Signed-off-by: David Drysdale <drysdale-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>

Acked-by: David S. Miller <davem-fT/PcQaiUtIeIZ0/mPfg9Q@public.gmane.org>

^ permalink raw reply

* Re: [RFC] Possible new execveat(2) Linux syscall
From: Rich Felker @ 2014-11-16 19:52 UTC (permalink / raw)
  To: David Drysdale
  Cc: libc-alpha, Andrew Morton, Christoph Hellwig, Linux API,
	Andy Lutomirski, musl
In-Reply-To: <CAHse=S8ccC2No5EYS0Pex=Ng3oXjfDB9woOBmMY_k+EgxtODZA@mail.gmail.com>

On Fri, Nov 14, 2014 at 02:54:19PM +0000, David Drysdale wrote:
> Hi,
> 
> Over at the LKML[1] we've been discussing a possible new syscall, execveat(2),
> and it would be good to hear a glibc perspective about it (and whether there
> are any interface changes that would make it easier to use from userspace).
> 
> The syscall prototype is:
>   int execveat(int fd, const char *pathname,
>                       char *const argv[],  char *const envp[],
>                       int flags); /* AT_EMPTY_PATH, AT_SYMLINK_NOFOLLOW */
> and it works similarly to execve(2) except:
>  - the executable to run is identified by the combination of fd+pathname, like
>    other *at(2) syscalls
>  - there's an extra flags field to control behaviour.
> (I've attached a text version of the suggested man page below)
> 
> One particular benefit of this is that it allows an fexecve(3) implementation
> that doesn't rely on /proc being accessible, which is useful for sandboxed
> applications.  (However, that does only work for non-interpreted programs:
> the name passed to a script interpreter is of the form "/dev/fd/<fd>/<path>"
> or "/dev/fd/<fd>", so the executed interpreter will normally still need /proc
> access to load the script file).
> 
> How does this sound from a glibc perspective?

I've been following the discussions so far and everything looks mostly
okay. There are still issues to be resolved with the different
semantics between Linux O_PATH and what POSIX requires for O_EXEC (and
O_SEARCH) but as long as the intent is that, once O_EXEC is defined to
save the permissions at the time of open and cause them to be used in
place of the current file permissions at the time of execveat

One major issue however is FD_CLOEXEC with scripts. Last I checked,
this didn't work because the file is already closed by the time the
interpreted runs. The intended usage of fexecve is almost certainly to
call it with the file descriptor set close-on-exec; otherwise, there
would be no clean way to close it, since the program being executed
doesn't know that it's being executed via fexecve. So this is a
serious problem that needs to be solved if it hasn't already. I have
some ideas I could offer, but I'm not an expert on the kernel side
things so I'm not sure they'd be correct.

Rich

> Thanks,
> David
> 
> [1] https://lkml.org/lkml/2014/11/7/512, with earlier discussions at
> https://lkml.org/lkml/2014/11/6/469, https://lkml.org/lkml/2014/10/22/275
> and https://lkml.org/lkml/2014/10/17/428
> 
> ----
> 
> EXECVEAT(2)              Linux Programmer's Manual             EXECVEAT(2)
> 
> NAME
>        execveat - execute program relative to a directory file descriptor
> 
> SYNOPSIS
>        #include <unistd.h>
> 
>        int execveat(int fd, const char *pathname,
>                     char *const argv[],  char *const envp[],
>                     int flags);
> 
> DESCRIPTION
>        The  execveat()  system call executes the program pointed to by the
>        combination of fd and pathname.  The execveat() system  call  oper‐
>        ates  in  exactly the same way as execve(2), except for the differ‐
>        ences described in this manual page.
> 
>        If the pathname given in pathname is relative, then  it  is  inter‐
>        preted relative to the directory referred to by the file descriptor
>        fd (rather than relative to the current working  directory  of  the
>        calling process, as is done by execve(2) for a relative pathname).
> 
>        If  pathname is relative and fd is the special value AT_FDCWD, then
>        pathname is interpreted relative to the current  working  directory
>        of the calling process (like execve(2)).
> 
>        If pathname is absolute, then fd is ignored.
> 
>        If pathname is an empty string and the AT_EMPTY_PATH flag is speci‐
>        fied, then the file descriptor fd specifies the  file  to  be  exe‐
>        cuted.
> 
>        flags can either be 0, or include the following flags:
> 
>        AT_EMPTY_PATH
>               If pathname is an empty string, operate on the file referred
>               to by fd (which may have been  obtained  using  the  open(2)
>               O_PATH flag).
> 
>        AT_SYMLINK_NOFOLLOW
>               If  the  file  identified by fd and a non-NULL pathname is a
>               symbolic link, then the call fails with the error EINVAL.
> 
> RETURN VALUE
>        On success, execveat() does not return. On error  -1  is  returned,
>        and errno is set appropriately.
> 
> ERRORS
>        The  same  errors  that  occur  for  execve(2)  can  also occur for
>        execveat().   The  following  additional  errors  can   occur   for
>        execveat():
> 
>        EBADF  fd is not a valid file descriptor.
> 
>        ENOENT The  program  identified by fd and pathname requires the use
>               of an interpreter program (such as a  script  starting  with
>               "#!")  but  the  file  descriptor  fd  was  opened  with the
>               O_CLOEXEC flag and so the program file  is  inaccessible  to
>               the launched interpreter.
> 
>        EINVAL Invalid flag specified in flags.
> 
>        ENOTDIR
>               pathname  is  relative and fd is a file descriptor referring
>               to a file other than a directory.
> 
> VERSIONS
>        execveat() was added to Linux in kernel 3.???.
> 
> NOTES
>        In addition to the reasons explained in openat(2),  the  execveat()
>        system call is also needed to allow fexecve(3) to be implemented on
>        systems that do not have the /proc filesystem mounted.
> 
> SEE ALSO
>        execve(2), fexecve(3)
> 
> Linux                           2014-04-02                     EXECVEAT(2)

^ permalink raw reply

* Re: [RFC] Possible new execveat(2) Linux syscall
From: Andy Lutomirski @ 2014-11-16 21:20 UTC (permalink / raw)
  To: Rich Felker
  Cc: libc-alpha, musl-ZwoEplunGu1jrUoiu81ncdBPR1lH4CV8, Andrew Morton,
	David Drysdale, Linux API, Christoph Hellwig
In-Reply-To: <20141116195246.GX22465-C3MtFaGISjmo6RMmaWD+6Sb1p8zYI1N1@public.gmane.org>

On Nov 16, 2014 11:53 AM, "Rich Felker" <dalias-/miJ2pyFWUyWIDz0JBNUog@public.gmane.org> wrote:
>
> On Fri, Nov 14, 2014 at 02:54:19PM +0000, David Drysdale wrote:
> > Hi,
> >
> > Over at the LKML[1] we've been discussing a possible new syscall, execveat(2),
> > and it would be good to hear a glibc perspective about it (and whether there
> > are any interface changes that would make it easier to use from userspace).
> >
> > The syscall prototype is:
> >   int execveat(int fd, const char *pathname,
> >                       char *const argv[],  char *const envp[],
> >                       int flags); /* AT_EMPTY_PATH, AT_SYMLINK_NOFOLLOW */
> > and it works similarly to execve(2) except:
> >  - the executable to run is identified by the combination of fd+pathname, like
> >    other *at(2) syscalls
> >  - there's an extra flags field to control behaviour.
> > (I've attached a text version of the suggested man page below)
> >
> > One particular benefit of this is that it allows an fexecve(3) implementation
> > that doesn't rely on /proc being accessible, which is useful for sandboxed
> > applications.  (However, that does only work for non-interpreted programs:
> > the name passed to a script interpreter is of the form "/dev/fd/<fd>/<path>"
> > or "/dev/fd/<fd>", so the executed interpreter will normally still need /proc
> > access to load the script file).
> >
> > How does this sound from a glibc perspective?
>
> I've been following the discussions so far and everything looks mostly
> okay. There are still issues to be resolved with the different
> semantics between Linux O_PATH and what POSIX requires for O_EXEC (and
> O_SEARCH) but as long as the intent is that, once O_EXEC is defined to
> save the permissions at the time of open and cause them to be used in
> place of the current file permissions at the time of execveat

Is something missing here?

FWIW, I don't understand O_PATH or O_EXEC very well, so from my POV,
help would be appreciated.

>
> One major issue however is FD_CLOEXEC with scripts. Last I checked,
> this didn't work because the file is already closed by the time the
> interpreted runs. The intended usage of fexecve is almost certainly to
> call it with the file descriptor set close-on-exec; otherwise, there
> would be no clean way to close it, since the program being executed
> doesn't know that it's being executed via fexecve. So this is a
> serious problem that needs to be solved if it hasn't already. I have
> some ideas I could offer, but I'm not an expert on the kernel side
> things so I'm not sure they'd be correct.

Bring on the ideas.

FWIW, I've often thought that interpreter binaries should mark
themselves as such to enable better interactions with the kernel.

--Andy

>
> Rich
>
> > Thanks,
> > David
> >
> > [1] https://lkml.org/lkml/2014/11/7/512, with earlier discussions at
> > https://lkml.org/lkml/2014/11/6/469, https://lkml.org/lkml/2014/10/22/275
> > and https://lkml.org/lkml/2014/10/17/428
> >
> > ----
> >
> > EXECVEAT(2)              Linux Programmer's Manual             EXECVEAT(2)
> >
> > NAME
> >        execveat - execute program relative to a directory file descriptor
> >
> > SYNOPSIS
> >        #include <unistd.h>
> >
> >        int execveat(int fd, const char *pathname,
> >                     char *const argv[],  char *const envp[],
> >                     int flags);
> >
> > DESCRIPTION
> >        The  execveat()  system call executes the program pointed to by the
> >        combination of fd and pathname.  The execveat() system  call  oper‐
> >        ates  in  exactly the same way as execve(2), except for the differ‐
> >        ences described in this manual page.
> >
> >        If the pathname given in pathname is relative, then  it  is  inter‐
> >        preted relative to the directory referred to by the file descriptor
> >        fd (rather than relative to the current working  directory  of  the
> >        calling process, as is done by execve(2) for a relative pathname).
> >
> >        If  pathname is relative and fd is the special value AT_FDCWD, then
> >        pathname is interpreted relative to the current  working  directory
> >        of the calling process (like execve(2)).
> >
> >        If pathname is absolute, then fd is ignored.
> >
> >        If pathname is an empty string and the AT_EMPTY_PATH flag is speci‐
> >        fied, then the file descriptor fd specifies the  file  to  be  exe‐
> >        cuted.
> >
> >        flags can either be 0, or include the following flags:
> >
> >        AT_EMPTY_PATH
> >               If pathname is an empty string, operate on the file referred
> >               to by fd (which may have been  obtained  using  the  open(2)
> >               O_PATH flag).
> >
> >        AT_SYMLINK_NOFOLLOW
> >               If  the  file  identified by fd and a non-NULL pathname is a
> >               symbolic link, then the call fails with the error EINVAL.
> >
> > RETURN VALUE
> >        On success, execveat() does not return. On error  -1  is  returned,
> >        and errno is set appropriately.
> >
> > ERRORS
> >        The  same  errors  that  occur  for  execve(2)  can  also occur for
> >        execveat().   The  following  additional  errors  can   occur   for
> >        execveat():
> >
> >        EBADF  fd is not a valid file descriptor.
> >
> >        ENOENT The  program  identified by fd and pathname requires the use
> >               of an interpreter program (such as a  script  starting  with
> >               "#!")  but  the  file  descriptor  fd  was  opened  with the
> >               O_CLOEXEC flag and so the program file  is  inaccessible  to
> >               the launched interpreter.
> >
> >        EINVAL Invalid flag specified in flags.
> >
> >        ENOTDIR
> >               pathname  is  relative and fd is a file descriptor referring
> >               to a file other than a directory.
> >
> > VERSIONS
> >        execveat() was added to Linux in kernel 3.???.
> >
> > NOTES
> >        In addition to the reasons explained in openat(2),  the  execveat()
> >        system call is also needed to allow fexecve(3) to be implemented on
> >        systems that do not have the /proc filesystem mounted.
> >
> > SEE ALSO
> >        execve(2), fexecve(3)
> >
> > Linux                           2014-04-02                     EXECVEAT(2)

^ permalink raw reply

* Re: [PATCH v2 net-next 6/7] bpf: allow eBPF programs to use maps
From: Alexei Starovoitov @ 2014-11-16 21:24 UTC (permalink / raw)
  To: David Miller
  Cc: Ingo Molnar, Andy Lutomirski, Daniel Borkmann,
	Hannes Frederic Sowa, Eric Dumazet, Linux API,
	Network Development, LKML
In-Reply-To: <20141116.140422.570375628237589645.davem-fT/PcQaiUtIeIZ0/mPfg9Q@public.gmane.org>

On Sun, Nov 16, 2014 at 11:04 AM, David Miller <davem-fT/PcQaiUtIeIZ0/mPfg9Q@public.gmane.org> wrote:
> From: Alexei Starovoitov <ast-uqk4Ao+rVK5Wk0Htik3J/w@public.gmane.org>
> Date: Thu, 13 Nov 2014 17:36:49 -0800
>
>> +static u64 bpf_map_lookup_elem(u64 r1, u64 r2, u64 r3, u64 r4, u64 r5)
>> +{
>> +     /* verifier checked that R1 contains a valid pointer to bpf_map
>> +      * and R2 points to a program stack and map->key_size bytes were
>> +      * initialized
>> +      */
>> +     struct bpf_map *map = (struct bpf_map *) (unsigned long) r1;
>> +     void *key = (void *) (unsigned long) r2;
>> +     void *value;
>> +
>> +     WARN_ON_ONCE(!rcu_read_lock_held());
>> +
>> +     value = map->ops->map_lookup_elem(map, key);
>> +
>> +     /* lookup() returns either pointer to element value or NULL
>> +      * which is the meaning of PTR_TO_MAP_VALUE_OR_NULL type
>> +      */
>> +     return (unsigned long) value;
>> +}
>
> You should translate this into a true boolean '1' or '0' value so that
> kernel pointers don't propagate to the user or his eBPF programs.

that won't work. eBPF programs have to see all sorts of kernel
pointers. In this case it's a pointer to map element value
or NULL. There are pointers to stack, pointers to map root,
pointers to context, etc. Programs can read pointers from
other data structures. And in the case of tracing they can
pretty much access any kernel memory in read only way.
Just like 'perf probe' filters.
The requirement that _unprivileged_ programs should
not be able to pass all these pointers back to user is
well understood and was discussed in detail several
month back. It's verifier that will prevent leaking of
kernel addresses. Today, the whole thing is for root
only. When the infra is ready for non-root I will add
a pass to verifier, that will kick in only for unprivileged
programs. Verifier already tracks all pointers and
can prevent passing them to user. In this case
verifier knows that register R0 after a call to
bpf_map_lookup_elem() is
"either pointer to element value or NULL",
so it will prevent storing it into any memory or
doing arithmetic on it, so that user space cannot
see the pointer, whereas eBPF program can use
it to access map element value.

^ permalink raw reply

* Re: [PATCH v2 net-next 6/7] bpf: allow eBPF programs to use maps
From: David Miller @ 2014-11-16 21:34 UTC (permalink / raw)
  To: ast-uqk4Ao+rVK5Wk0Htik3J/w
  Cc: mingo-DgEjT+Ai2ygdnm+yROfE0A, luto-kltTT9wpgjJwATOyAt5JVQ,
	dborkman-H+wXaHxf7aLQT0dZR+AlfA,
	hannes-tFNcAqjVMyqKXQKiL6tip0B+6BGkLq7r,
	edumazet-hpIqsD4AKlfQT0dZR+AlfA, linux-api-u79uwXL29TY76Z2rM5mHXA,
	netdev-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <CAMEtUuwrST6wGnBU6UU2NYEubskHYf1XZmZQpkgM+cUc8YD9OA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>

From: Alexei Starovoitov <ast-uqk4Ao+rVK5Wk0Htik3J/w@public.gmane.org>
Date: Sun, 16 Nov 2014 13:24:53 -0800

> The requirement that _unprivileged_ programs should
> not be able to pass all these pointers back to user is
> well understood and was discussed in detail several
> month back. It's verifier that will prevent leaking of
> kernel addresses.

Ok, fair enough.

^ permalink raw reply

* Re: [RFC] Possible new execveat(2) Linux syscall
From: Rich Felker @ 2014-11-16 22:08 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: libc-alpha, musl, Andrew Morton, David Drysdale, Linux API,
	Christoph Hellwig
In-Reply-To: <CALCETrWWUyizL8HxZKaYE+xuV5eGi8mQcequT9HPvvac=X-dLg@mail.gmail.com>

On Sun, Nov 16, 2014 at 01:20:39PM -0800, Andy Lutomirski wrote:
> On Nov 16, 2014 11:53 AM, "Rich Felker" <dalias@aerifal.cx> wrote:
> >
> > On Fri, Nov 14, 2014 at 02:54:19PM +0000, David Drysdale wrote:
> > > Hi,
> > >
> > > Over at the LKML[1] we've been discussing a possible new syscall, execveat(2),
> > > and it would be good to hear a glibc perspective about it (and whether there
> > > are any interface changes that would make it easier to use from userspace).
> > >
> > > The syscall prototype is:
> > >   int execveat(int fd, const char *pathname,
> > >                       char *const argv[],  char *const envp[],
> > >                       int flags); /* AT_EMPTY_PATH, AT_SYMLINK_NOFOLLOW */
> > > and it works similarly to execve(2) except:
> > >  - the executable to run is identified by the combination of fd+pathname, like
> > >    other *at(2) syscalls
> > >  - there's an extra flags field to control behaviour.
> > > (I've attached a text version of the suggested man page below)
> > >
> > > One particular benefit of this is that it allows an fexecve(3) implementation
> > > that doesn't rely on /proc being accessible, which is useful for sandboxed
> > > applications.  (However, that does only work for non-interpreted programs:
> > > the name passed to a script interpreter is of the form "/dev/fd/<fd>/<path>"
> > > or "/dev/fd/<fd>", so the executed interpreter will normally still need /proc
> > > access to load the script file).
> > >
> > > How does this sound from a glibc perspective?
> >
> > I've been following the discussions so far and everything looks mostly
> > okay. There are still issues to be resolved with the different
> > semantics between Linux O_PATH and what POSIX requires for O_EXEC (and
> > O_SEARCH) but as long as the intent is that, once O_EXEC is defined to
> > save the permissions at the time of open and cause them to be used in
> > place of the current file permissions at the time of execveat
> 
> Is something missing here?
> 
> FWIW, I don't understand O_PATH or O_EXEC very well, so from my POV,
> help would be appreciated.

Yes. POSIX requires that permission checks for execution (fexecve with
O_EXEC file descriptors) and directory-search (*at functions with
O_SEARCH file descriptors) succeed if the open operation succeeded --
the permissions check is required to take place at open time rather
than at exec/search time. There's a separate discussion about how to
make this work on the kernel side.

> > One major issue however is FD_CLOEXEC with scripts. Last I checked,
> > this didn't work because the file is already closed by the time the
> > interpreted runs. The intended usage of fexecve is almost certainly to
> > call it with the file descriptor set close-on-exec; otherwise, there
> > would be no clean way to close it, since the program being executed
> > doesn't know that it's being executed via fexecve. So this is a
> > serious problem that needs to be solved if it hasn't already. I have
> > some ideas I could offer, but I'm not an expert on the kernel side
> > things so I'm not sure they'd be correct.
> 
> Bring on the ideas.

My thought is that when the kernel opens the binary and sees that it's
a script that needs an interpreter, the kernel should not pass
/proc/self/fd/%d to the interpreter, but instead should pass the name
of a new magic symlink in /proc/self that's connected to the inode for
the script to be executed but that ceases to exist as soon as it's
opened. In theory this could also be used for suid scripts to make
them secure.

> FWIW, I've often thought that interpreter binaries should mark
> themselves as such to enable better interactions with the kernel.

That's hard since users expect to be able to use arbitrary
interpreters (and sometimes even pass through multiple ones, e.g.
#!/usr/bin/env perl).

Rich

^ permalink raw reply

* Re: [RFC] Possible new execveat(2) Linux syscall
From: Andy Lutomirski @ 2014-11-16 22:34 UTC (permalink / raw)
  To: Rich Felker
  Cc: libc-alpha, musl-ZwoEplunGu1jrUoiu81ncdBPR1lH4CV8, Andrew Morton,
	David Drysdale, Linux API, Christoph Hellwig
In-Reply-To: <20141116220859.GY22465-C3MtFaGISjmo6RMmaWD+6Sb1p8zYI1N1@public.gmane.org>

On Sun, Nov 16, 2014 at 2:08 PM, Rich Felker <dalias-/miJ2pyFWUyWIDz0JBNUog@public.gmane.org> wrote:
> On Sun, Nov 16, 2014 at 01:20:39PM -0800, Andy Lutomirski wrote:
>> On Nov 16, 2014 11:53 AM, "Rich Felker" <dalias-/miJ2pyFWUyWIDz0JBNUog@public.gmane.org> wrote:
>> >
>> > On Fri, Nov 14, 2014 at 02:54:19PM +0000, David Drysdale wrote:
>> > > Hi,
>> > >
>> > > Over at the LKML[1] we've been discussing a possible new syscall, execveat(2),
>> > > and it would be good to hear a glibc perspective about it (and whether there
>> > > are any interface changes that would make it easier to use from userspace).
>> > >
>> > > The syscall prototype is:
>> > >   int execveat(int fd, const char *pathname,
>> > >                       char *const argv[],  char *const envp[],
>> > >                       int flags); /* AT_EMPTY_PATH, AT_SYMLINK_NOFOLLOW */
>> > > and it works similarly to execve(2) except:
>> > >  - the executable to run is identified by the combination of fd+pathname, like
>> > >    other *at(2) syscalls
>> > >  - there's an extra flags field to control behaviour.
>> > > (I've attached a text version of the suggested man page below)
>> > >
>> > > One particular benefit of this is that it allows an fexecve(3) implementation
>> > > that doesn't rely on /proc being accessible, which is useful for sandboxed
>> > > applications.  (However, that does only work for non-interpreted programs:
>> > > the name passed to a script interpreter is of the form "/dev/fd/<fd>/<path>"
>> > > or "/dev/fd/<fd>", so the executed interpreter will normally still need /proc
>> > > access to load the script file).
>> > >
>> > > How does this sound from a glibc perspective?
>> >
>> > I've been following the discussions so far and everything looks mostly
>> > okay. There are still issues to be resolved with the different
>> > semantics between Linux O_PATH and what POSIX requires for O_EXEC (and
>> > O_SEARCH) but as long as the intent is that, once O_EXEC is defined to
>> > save the permissions at the time of open and cause them to be used in
>> > place of the current file permissions at the time of execveat
>>
>> Is something missing here?
>>
>> FWIW, I don't understand O_PATH or O_EXEC very well, so from my POV,
>> help would be appreciated.
>
> Yes. POSIX requires that permission checks for execution (fexecve with
> O_EXEC file descriptors) and directory-search (*at functions with
> O_SEARCH file descriptors) succeed if the open operation succeeded --
> the permissions check is required to take place at open time rather
> than at exec/search time. There's a separate discussion about how to
> make this work on the kernel side.

It may be worth making this work as part of adding execveat to the
kernel.  Does the kernel even have O_EXEC right now?

>
>> > One major issue however is FD_CLOEXEC with scripts. Last I checked,
>> > this didn't work because the file is already closed by the time the
>> > interpreted runs. The intended usage of fexecve is almost certainly to
>> > call it with the file descriptor set close-on-exec; otherwise, there
>> > would be no clean way to close it, since the program being executed
>> > doesn't know that it's being executed via fexecve. So this is a
>> > serious problem that needs to be solved if it hasn't already. I have
>> > some ideas I could offer, but I'm not an expert on the kernel side
>> > things so I'm not sure they'd be correct.
>>
>> Bring on the ideas.
>
> My thought is that when the kernel opens the binary and sees that it's
> a script that needs an interpreter, the kernel should not pass
> /proc/self/fd/%d to the interpreter, but instead should pass the name
> of a new magic symlink in /proc/self that's connected to the inode for
> the script to be executed but that ceases to exist as soon as it's
> opened. In theory this could also be used for suid scripts to make
> them secure.

This doesn't help if /proc is not mounted, which is an important use case.

>
>> FWIW, I've often thought that interpreter binaries should mark
>> themselves as such to enable better interactions with the kernel.
>
> That's hard since users expect to be able to use arbitrary
> interpreters (and sometimes even pass through multiple ones, e.g.
> #!/usr/bin/env perl).
>

Hmm.  I'd be okay with old interpreters having a somewhat degraded experience.

I guess that #!/some/interpreted/script isn't allowed, but maybe
#!/usr/bin/env some-interpreted-script should work.

It could be that all that's really needed is some convention to tell
an interpreter that it should use fd N as a script *and close it*.
Something like /dev/fd_and_close/N could work, but that has all kinds
of problems.

Alternatively, if we could have a way to mark an fd so that it's
close-on-exec after exec, that would solve the nesting problem, as
long as every interpreter in the chain does it.  And the kernel could
certainly implement execve on a close-on-exec fd by passing /dev/fd/N
where N is a close-on-exec fd, at least in the non-nested case.

--Andy

> Rich



-- 
Andy Lutomirski
AMA Capital Management, LLC

^ permalink raw reply

* Re: [musl] Re: [RFC] Possible new execveat(2) Linux syscall
From: Rich Felker @ 2014-11-16 23:32 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: libc-alpha, musl, Andrew Morton, David Drysdale, Linux API,
	Christoph Hellwig
In-Reply-To: <CALCETrVtN73rTxGXV9Xt+sPOitAWCcyrfUWY_3_tAmd+n6V1gA@mail.gmail.com>

On Sun, Nov 16, 2014 at 02:34:32PM -0800, Andy Lutomirski wrote:
> On Sun, Nov 16, 2014 at 2:08 PM, Rich Felker <dalias@aerifal.cx> wrote:
> > On Sun, Nov 16, 2014 at 01:20:39PM -0800, Andy Lutomirski wrote:
> >> On Nov 16, 2014 11:53 AM, "Rich Felker" <dalias@aerifal.cx> wrote:
> >> >
> >> > On Fri, Nov 14, 2014 at 02:54:19PM +0000, David Drysdale wrote:
> >> > > Hi,
> >> > >
> >> > > Over at the LKML[1] we've been discussing a possible new syscall, execveat(2),
> >> > > and it would be good to hear a glibc perspective about it (and whether there
> >> > > are any interface changes that would make it easier to use from userspace).
> >> > >
> >> > > The syscall prototype is:
> >> > >   int execveat(int fd, const char *pathname,
> >> > >                       char *const argv[],  char *const envp[],
> >> > >                       int flags); /* AT_EMPTY_PATH, AT_SYMLINK_NOFOLLOW */
> >> > > and it works similarly to execve(2) except:
> >> > >  - the executable to run is identified by the combination of fd+pathname, like
> >> > >    other *at(2) syscalls
> >> > >  - there's an extra flags field to control behaviour.
> >> > > (I've attached a text version of the suggested man page below)
> >> > >
> >> > > One particular benefit of this is that it allows an fexecve(3) implementation
> >> > > that doesn't rely on /proc being accessible, which is useful for sandboxed
> >> > > applications.  (However, that does only work for non-interpreted programs:
> >> > > the name passed to a script interpreter is of the form "/dev/fd/<fd>/<path>"
> >> > > or "/dev/fd/<fd>", so the executed interpreter will normally still need /proc
> >> > > access to load the script file).
> >> > >
> >> > > How does this sound from a glibc perspective?
> >> >
> >> > I've been following the discussions so far and everything looks mostly
> >> > okay. There are still issues to be resolved with the different
> >> > semantics between Linux O_PATH and what POSIX requires for O_EXEC (and
> >> > O_SEARCH) but as long as the intent is that, once O_EXEC is defined to
> >> > save the permissions at the time of open and cause them to be used in
> >> > place of the current file permissions at the time of execveat
> >>
> >> Is something missing here?
> >>
> >> FWIW, I don't understand O_PATH or O_EXEC very well, so from my POV,
> >> help would be appreciated.
> >
> > Yes. POSIX requires that permission checks for execution (fexecve with
> > O_EXEC file descriptors) and directory-search (*at functions with
> > O_SEARCH file descriptors) succeed if the open operation succeeded --
> > the permissions check is required to take place at open time rather
> > than at exec/search time. There's a separate discussion about how to
> > make this work on the kernel side.
> 
> It may be worth making this work as part of adding execveat to the
> kernel.  Does the kernel even have O_EXEC right now?

No. The proposal is that O_EXEC and O_SEARCH would both be equal to
O_PATH|3 (3 being the rarely-used O_ACCMODE for "neither read or
write, but some weird ioctls are accepted") which gracefully falls
back for both current kernels with O_PATH (in which case the 3 is
ignored and the discrepency from POSIX is just the time at which
permissions are checked) and for pre-O_PATH kernels (in which case the
access mode used is 3, and read/write ops fail on the fd, but it's
still usable for fexecve and *at functions with /proc-based fallback
implementations).

I would be happy to see this work get done at the same time.

> >> > One major issue however is FD_CLOEXEC with scripts. Last I checked,
> >> > this didn't work because the file is already closed by the time the
> >> > interpreted runs. The intended usage of fexecve is almost certainly to
> >> > call it with the file descriptor set close-on-exec; otherwise, there
> >> > would be no clean way to close it, since the program being executed
> >> > doesn't know that it's being executed via fexecve. So this is a
> >> > serious problem that needs to be solved if it hasn't already. I have
> >> > some ideas I could offer, but I'm not an expert on the kernel side
> >> > things so I'm not sure they'd be correct.
> >>
> >> Bring on the ideas.
> >
> > My thought is that when the kernel opens the binary and sees that it's
> > a script that needs an interpreter, the kernel should not pass
> > /proc/self/fd/%d to the interpreter, but instead should pass the name
> > of a new magic symlink in /proc/self that's connected to the inode for
> > the script to be executed but that ceases to exist as soon as it's
> > opened. In theory this could also be used for suid scripts to make
> > them secure.
> 
> This doesn't help if /proc is not mounted, which is an important use case.

I don't know what can be done in this case short of some really ugly
hacks, like giving open() special behavior when the pathname points to
a magic address in the argv region, or having the kernel create temp
files in some magic path.

> >> FWIW, I've often thought that interpreter binaries should mark
> >> themselves as such to enable better interactions with the kernel.
> >
> > That's hard since users expect to be able to use arbitrary
> > interpreters (and sometimes even pass through multiple ones, e.g.
> > #!/usr/bin/env perl).
> 
> Hmm.  I'd be okay with old interpreters having a somewhat degraded experience.
> 
> I guess that #!/some/interpreted/script isn't allowed, but maybe
> #!/usr/bin/env some-interpreted-script should work.
> 
> It could be that all that's really needed is some convention to tell
> an interpreter that it should use fd N as a script *and close it*.
> Something like /dev/fd_and_close/N could work, but that has all kinds
> of problems.
> 
> Alternatively, if we could have a way to mark an fd so that it's
> close-on-exec after exec, that would solve the nesting problem, as
> long as every interpreter in the chain does it.  And the kernel could
> certainly implement execve on a close-on-exec fd by passing /dev/fd/N
> where N is a close-on-exec fd, at least in the non-nested case.

This doesn't solve the problem of needing /proc though (/dev/fd is
just a link to /proc/self/fd).

Rich

^ permalink raw reply

* Re: [musl] Re: [RFC] Possible new execveat(2) Linux syscall
From: Andy Lutomirski @ 2014-11-17  0:06 UTC (permalink / raw)
  To: Rich Felker
  Cc: libc-alpha, musl-ZwoEplunGu1jrUoiu81ncdBPR1lH4CV8, Andrew Morton,
	David Drysdale, Linux API, Christoph Hellwig
In-Reply-To: <20141116233202.GA22465-C3MtFaGISjmo6RMmaWD+6Sb1p8zYI1N1@public.gmane.org>

On Sun, Nov 16, 2014 at 3:32 PM, Rich Felker <dalias-/miJ2pyFWUyWIDz0JBNUog@public.gmane.org> wrote:
> On Sun, Nov 16, 2014 at 02:34:32PM -0800, Andy Lutomirski wrote:
>> On Sun, Nov 16, 2014 at 2:08 PM, Rich Felker <dalias-/miJ2pyFWUyWIDz0JBNUog@public.gmane.org> wrote:
>> > On Sun, Nov 16, 2014 at 01:20:39PM -0800, Andy Lutomirski wrote:
>> >> On Nov 16, 2014 11:53 AM, "Rich Felker" <dalias-/miJ2pyFWUyWIDz0JBNUog@public.gmane.org> wrote:
>> >> >
>> >> > On Fri, Nov 14, 2014 at 02:54:19PM +0000, David Drysdale wrote:
>> >> > > Hi,
>> >> > >
>> >> > > Over at the LKML[1] we've been discussing a possible new syscall, execveat(2),
>> >> > > and it would be good to hear a glibc perspective about it (and whether there
>> >> > > are any interface changes that would make it easier to use from userspace).
>> >> > >
>> >> > > The syscall prototype is:
>> >> > >   int execveat(int fd, const char *pathname,
>> >> > >                       char *const argv[],  char *const envp[],
>> >> > >                       int flags); /* AT_EMPTY_PATH, AT_SYMLINK_NOFOLLOW */
>> >> > > and it works similarly to execve(2) except:
>> >> > >  - the executable to run is identified by the combination of fd+pathname, like
>> >> > >    other *at(2) syscalls
>> >> > >  - there's an extra flags field to control behaviour.
>> >> > > (I've attached a text version of the suggested man page below)
>> >> > >
>> >> > > One particular benefit of this is that it allows an fexecve(3) implementation
>> >> > > that doesn't rely on /proc being accessible, which is useful for sandboxed
>> >> > > applications.  (However, that does only work for non-interpreted programs:
>> >> > > the name passed to a script interpreter is of the form "/dev/fd/<fd>/<path>"
>> >> > > or "/dev/fd/<fd>", so the executed interpreter will normally still need /proc
>> >> > > access to load the script file).
>> >> > >
>> >> > > How does this sound from a glibc perspective?
>> >> >
>> >> > I've been following the discussions so far and everything looks mostly
>> >> > okay. There are still issues to be resolved with the different
>> >> > semantics between Linux O_PATH and what POSIX requires for O_EXEC (and
>> >> > O_SEARCH) but as long as the intent is that, once O_EXEC is defined to
>> >> > save the permissions at the time of open and cause them to be used in
>> >> > place of the current file permissions at the time of execveat
>> >>
>> >> Is something missing here?
>> >>
>> >> FWIW, I don't understand O_PATH or O_EXEC very well, so from my POV,
>> >> help would be appreciated.
>> >
>> > Yes. POSIX requires that permission checks for execution (fexecve with
>> > O_EXEC file descriptors) and directory-search (*at functions with
>> > O_SEARCH file descriptors) succeed if the open operation succeeded --
>> > the permissions check is required to take place at open time rather
>> > than at exec/search time. There's a separate discussion about how to
>> > make this work on the kernel side.
>>
>> It may be worth making this work as part of adding execveat to the
>> kernel.  Does the kernel even have O_EXEC right now?
>
> No. The proposal is that O_EXEC and O_SEARCH would both be equal to
> O_PATH|3 (3 being the rarely-used O_ACCMODE for "neither read or
> write, but some weird ioctls are accepted") which gracefully falls
> back for both current kernels with O_PATH (in which case the 3 is
> ignored and the discrepency from POSIX is just the time at which
> permissions are checked) and for pre-O_PATH kernels (in which case the
> access mode used is 3, and read/write ops fail on the fd, but it's
> still usable for fexecve and *at functions with /proc-based fallback
> implementations).
>
> I would be happy to see this work get done at the same time.
>
>> >> > One major issue however is FD_CLOEXEC with scripts. Last I checked,
>> >> > this didn't work because the file is already closed by the time the
>> >> > interpreted runs. The intended usage of fexecve is almost certainly to
>> >> > call it with the file descriptor set close-on-exec; otherwise, there
>> >> > would be no clean way to close it, since the program being executed
>> >> > doesn't know that it's being executed via fexecve. So this is a
>> >> > serious problem that needs to be solved if it hasn't already. I have
>> >> > some ideas I could offer, but I'm not an expert on the kernel side
>> >> > things so I'm not sure they'd be correct.
>> >>
>> >> Bring on the ideas.
>> >
>> > My thought is that when the kernel opens the binary and sees that it's
>> > a script that needs an interpreter, the kernel should not pass
>> > /proc/self/fd/%d to the interpreter, but instead should pass the name
>> > of a new magic symlink in /proc/self that's connected to the inode for
>> > the script to be executed but that ceases to exist as soon as it's
>> > opened. In theory this could also be used for suid scripts to make
>> > them secure.
>>
>> This doesn't help if /proc is not mounted, which is an important use case.
>
> I don't know what can be done in this case short of some really ugly
> hacks, like giving open() special behavior when the pathname points to
> a magic address in the argv region, or having the kernel create temp
> files in some magic path.
>
>> >> FWIW, I've often thought that interpreter binaries should mark
>> >> themselves as such to enable better interactions with the kernel.
>> >
>> > That's hard since users expect to be able to use arbitrary
>> > interpreters (and sometimes even pass through multiple ones, e.g.
>> > #!/usr/bin/env perl).
>>
>> Hmm.  I'd be okay with old interpreters having a somewhat degraded experience.
>>
>> I guess that #!/some/interpreted/script isn't allowed, but maybe
>> #!/usr/bin/env some-interpreted-script should work.
>>
>> It could be that all that's really needed is some convention to tell
>> an interpreter that it should use fd N as a script *and close it*.
>> Something like /dev/fd_and_close/N could work, but that has all kinds
>> of problems.
>>
>> Alternatively, if we could have a way to mark an fd so that it's
>> close-on-exec after exec, that would solve the nesting problem, as
>> long as every interpreter in the chain does it.  And the kernel could
>> certainly implement execve on a close-on-exec fd by passing /dev/fd/N
>> where N is a close-on-exec fd, at least in the non-nested case.
>
> This doesn't solve the problem of needing /proc though (/dev/fd is
> just a link to /proc/self/fd).
>

Al Viro was talking about having a special fs just for /dev/fd.  And
interpreters could special-case path names of a certain form.

--Andy

^ permalink raw reply

* Re: [PATCH 1/5] kdbus: extend structures with security pointer for lsm
From: Karol Lewandowski @ 2014-11-17  1:47 UTC (permalink / raw)
  To: gregkh-hQyY1W1yCW8ekmWlsbkhG0B+6BGkLq7r
  Cc: pmoore-H+wXaHxf7aLQT0dZR+AlfA, jkosina-AlSwsSmVLrQ,
	linux-api-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	john.stultz-QSEj5FYQhm4dnm+yROfE0A, arnd-r2nGTMty4D4,
	tj-DgEjT+Ai2ygdnm+yROfE0A, desrt-0xnayjDhYQY,
	simon.mcvittie-ZGY8ohtN/8pPYcu2f3hruQ,
	daniel-cYrQPVfZoowdnm+yROfE0A, dh.herrmann-Re5JQEeQqe8AvxtiuMwx3w,
	casey.schaufler-ral2JQCrhuEAvxtiuMwx3w,
	marcel-kz+m5ild9QBg9hUCZPvPmw, tixxdz-Umm1ozX2/EEdnm+yROfE0A,
	javier.martinez-ZGY8ohtN/8pPYcu2f3hruQ,
	alban.crequy-ZGY8ohtN/8pPYcu2f3hruQ,
	linux-security-module-u79uwXL29TY76Z2rM5mHXA,
	r.krypa-Sze3O3UU22JBDgjK7y7TUQ, Karol Lewandowski
In-Reply-To: <1414773397-26490-2-git-send-email-k.lewandowsk-Sze3O3UU22JBDgjK7y7TUQ@public.gmane.org>

On Fri, Oct 31, 2014 at 05:36:33PM +0100, Karol Lewandowski wrote:
> Signed-off-by: Karol Lewandowski <k.lewandowsk-Sze3O3UU22JBDgjK7y7TUQ@public.gmane.org>
> ---
>  drivers/misc/kdbus/bus.h        | 2 ++
>  drivers/misc/kdbus/connection.h | 2 ++
>  drivers/misc/kdbus/domain.h     | 2 ++
>  3 files changed, 6 insertions(+)
> 
> diff --git a/drivers/misc/kdbus/bus.h b/drivers/misc/kdbus/bus.h
> index fd9d843..5c403ef 100644
> --- a/drivers/misc/kdbus/bus.h
> +++ b/drivers/misc/kdbus/bus.h
> @@ -49,6 +49,7 @@
>   * @conn_hash:		Map of connection IDs
>   * @monitors_list:	Connections that monitor this bus
>   * @meta:		Meta information about the bus creator
> + * @security:		LSM security blob
>   *
>   * A bus provides a "bus" endpoint / device node.
>   *
> @@ -84,6 +85,7 @@ struct kdbus_bus {
>  	struct list_head monitors_list;
>  
>  	struct kdbus_meta *meta;
> +	void *security;

One minor note - with possibility of LSM modules trying to access
these structures it would be worth to have kdbus headers in, say
include/linux/kdbus or similar globally accessible location.

Thanks

^ permalink raw reply

* Re: [PATCH 2/2] groups: Allow unprivileged processes to use setgroups to drop groups
From: One Thousand Gnomes @ 2014-11-17 11:37 UTC (permalink / raw)
  To: Josh Triplett
  Cc: Theodore Ts'o, Andy Lutomirski, Eric W. Biederman,
	Andrew Morton, Kees Cook, Michael Kerrisk-manpages, Linux API,
	linux-man, linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
In-Reply-To: <20141116045232.GB18880@thin>

> optional), I can do that too.  The security model of "having a group
> gives you less privilege than not having it" seems crazy, but
> nonetheless I can see a couple of easy ways that we can avoid breaking

It's an old pattern of use that makes complete sense in a traditional
Unix permission world because it's the only way to do "exclude {list}"
nicely. Our default IMHO shouldn't break this.

> that pattern, no_new_privs being one of them.  I'd like to make sure
> that nobody sees any other real-world corner case that unprivileged
> setgroups would break.

Barring the usual risk of people doing improper error checking I don't
see one immediately.

For containers I think it actually makes sense that the sysctl can be
applied per container anyway.

Alan
--
To unsubscribe from this list: send the line "unsubscribe linux-man" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH v2] [media] Add RGB444_1X12 and RGB565_1X16 media bus formats
From: Sakari Ailus @ 2014-11-17 15:24 UTC (permalink / raw)
  To: Boris Brezillon
  Cc: Mauro Carvalho Chehab, Hans Verkuil, Laurent Pinchart,
	linux-media, linux-api, linux-kernel, linux-doc
In-Reply-To: <1416126278-17708-1-git-send-email-boris.brezillon@free-electrons.com>

Hi Boris,

On Sun, Nov 16, 2014 at 09:24:38AM +0100, Boris Brezillon wrote:
> Add RGB444_1X12 and RGB565_1X16 format definitions and update the
> documentation.
> 
> Signed-off-by: Boris Brezillon <boris.brezillon@free-electrons.com>
> Acked-by: Mauro Carvalho Chehab <mchehab@osg.samsung.com>
> ---
> Changes since v1:
> - keep BPP and bits per sample ordering
> 
>  Documentation/DocBook/media/v4l/subdev-formats.xml | 40 ++++++++++++++++++++++
>  include/uapi/linux/media-bus-format.h              |  4 ++-
>  2 files changed, 43 insertions(+), 1 deletion(-)

Acked-by: Sakari Ailus <sakari.ailus@linux.intel.com>

-- 
Kind regards,

Sakari Ailus
e-mail: sakari.ailus@iki.fi	XMPP: sailus@retiisi.org.uk

^ permalink raw reply

* Re: [musl] Re: [RFC] Possible new execveat(2) Linux syscall
From: David Drysdale @ 2014-11-17 15:42 UTC (permalink / raw)
  To: Rich Felker
  Cc: Andy Lutomirski, libc-alpha,
	musl-ZwoEplunGu1jrUoiu81ncdBPR1lH4CV8, Andrew Morton, Linux API,
	Christoph Hellwig
In-Reply-To: <20141116233202.GA22465-C3MtFaGISjmo6RMmaWD+6Sb1p8zYI1N1@public.gmane.org>

On Sun, Nov 16, 2014 at 11:32 PM, Rich Felker <dalias-/miJ2pyFWUyWIDz0JBNUog@public.gmane.org> wrote:
> On Sun, Nov 16, 2014 at 02:34:32PM -0800, Andy Lutomirski wrote:
>> On Sun, Nov 16, 2014 at 2:08 PM, Rich Felker <dalias-/miJ2pyFWUyWIDz0JBNUog@public.gmane.org> wrote:
>> > On Sun, Nov 16, 2014 at 01:20:39PM -0800, Andy Lutomirski wrote:
>> >> On Nov 16, 2014 11:53 AM, "Rich Felker" <dalias-/miJ2pyFWUyWIDz0JBNUog@public.gmane.org> wrote:
>> >> >
>> >> > On Fri, Nov 14, 2014 at 02:54:19PM +0000, David Drysdale wrote:
>> >> > > Hi,
>> >> > >
>> >> > > Over at the LKML[1] we've been discussing a possible new syscall, execveat(2),
>> >> > > and it would be good to hear a glibc perspective about it (and whether there
>> >> > > are any interface changes that would make it easier to use from userspace).
>> >> > >
>> >> > > The syscall prototype is:
>> >> > >   int execveat(int fd, const char *pathname,
>> >> > >                       char *const argv[],  char *const envp[],
>> >> > >                       int flags); /* AT_EMPTY_PATH, AT_SYMLINK_NOFOLLOW */
>> >> > > and it works similarly to execve(2) except:
>> >> > >  - the executable to run is identified by the combination of fd+pathname, like
>> >> > >    other *at(2) syscalls
>> >> > >  - there's an extra flags field to control behaviour.
>> >> > > (I've attached a text version of the suggested man page below)
>> >> > >
>> >> > > One particular benefit of this is that it allows an fexecve(3) implementation
>> >> > > that doesn't rely on /proc being accessible, which is useful for sandboxed
>> >> > > applications.  (However, that does only work for non-interpreted programs:
>> >> > > the name passed to a script interpreter is of the form "/dev/fd/<fd>/<path>"
>> >> > > or "/dev/fd/<fd>", so the executed interpreter will normally still need /proc
>> >> > > access to load the script file).
>> >> > >
>> >> > > How does this sound from a glibc perspective?
>> >> >
>> >> > I've been following the discussions so far and everything looks mostly
>> >> > okay. There are still issues to be resolved with the different
>> >> > semantics between Linux O_PATH and what POSIX requires for O_EXEC (and
>> >> > O_SEARCH) but as long as the intent is that, once O_EXEC is defined to
>> >> > save the permissions at the time of open and cause them to be used in
>> >> > place of the current file permissions at the time of execveat
>> >>
>> >> Is something missing here?
>> >>
>> >> FWIW, I don't understand O_PATH or O_EXEC very well, so from my POV,
>> >> help would be appreciated.
>> >
>> > Yes. POSIX requires that permission checks for execution (fexecve with
>> > O_EXEC file descriptors) and directory-search (*at functions with
>> > O_SEARCH file descriptors) succeed if the open operation succeeded --
>> > the permissions check is required to take place at open time rather
>> > than at exec/search time. There's a separate discussion about how to
>> > make this work on the kernel side.

I'm not familiar with O_EXEC either, I'm afraid, so to be clear -- does
O_EXEC mean the permission check is explicitly skipped later, at execute
time?  In other words, if you open(O_EXEC) an executable then remove the
execute bit from the file, does a subsequent fexecve() still work?

If it does, then from an implementation perspective that presumably implies
the need for a record of the permission check in the struct file (and that
this property would be inherited by any dup()ed file descriptors).  From a
security perspective, having a gap between time-of-check and time-of-use
always sounds worrying...

>>
>> It may be worth making this work as part of adding execveat to the
>> kernel.  Does the kernel even have O_EXEC right now?
>
> No. The proposal is that O_EXEC and O_SEARCH would both be equal to
> O_PATH|3 (3 being the rarely-used O_ACCMODE for "neither read or
> write, but some weird ioctls are accepted") which gracefully falls
> back for both current kernels with O_PATH (in which case the 3 is
> ignored and the discrepency from POSIX is just the time at which
> permissions are checked) and for pre-O_PATH kernels (in which case the
> access mode used is 3, and read/write ops fail on the fd, but it's
> still usable for fexecve and *at functions with /proc-based fallback
> implementations).
>
> I would be happy to see this work get done at the same time.

^ permalink raw reply

* Re: [PATCH 2/2] groups: Allow unprivileged processes to use setgroups to drop groups
From: Casey Schaufler @ 2014-11-17 18:06 UTC (permalink / raw)
  To: Josh Triplett, Andrew Morton, Eric W. Biederman, Kees Cook,
	mtk.manpages, linux-api, linux-man, linux-kernel, Casey Schaufler
In-Reply-To: <0895c1f268bc0b01cc6c8ed4607d7c3953f49728.1416041823.git.josh@joshtriplett.org>

On 11/15/2014 1:01 AM, Josh Triplett wrote:
> Currently, unprivileged processes (without CAP_SETGID) cannot call
> setgroups at all.  In particular, processes with a set of supplementary
> groups cannot further drop permissions without obtaining elevated
> permissions first.

Has anyone put any thought into how this will interact with
POSIX ACLs? I don't see that anywhere in the discussion.

Tizen takes advantage of the fact you can't drop groups. If
a process can drop itself out of groups without privilege
it can circumvent the system security policy.

Back when the LSM was introduced a choice was made between
authoritative hooks (which would have allowed this sort of thing)
and restrictive hooks (which would not). Authoritative hooks were
rejected because they would have "broken Linux". I hope that the
people who spoke up then will speak up now.

This is going to introduce a whole class of vulnerabilities.
Don't even think of arguing that the reduction in use of privilege
will make up for that. Developers have enough trouble with the
difference between setuid() and seteuid() to expect them to use
group dropping properly.

>
> Allow unprivileged processes to call setgroups with a subset of their
> current groups; only require CAP_SETGID to add a group the process does
> not currently have.
>
> The kernel already maintains the list of supplementary group IDs in
> sorted order, and setgroups already needs to sort the new list, so this
> just requires a linear comparison of the two sorted lists.
>
> This moves the CAP_SETGID test from setgroups into set_current_groups.
>
> Tested via the following test program:
>
> #include <err.h>
> #include <grp.h>
> #include <stdio.h>
> #include <sys/types.h>
> #include <unistd.h>
>
> void run_id(void)
> {
>     pid_t p = fork();
>     switch (p) {
>         case -1:
>             err(1, "fork");
>         case 0:
>             execl("/usr/bin/id", "id", NULL);
>             err(1, "exec");
>         default:
>             if (waitpid(p, NULL, 0) < 0)
>                 err(1, "waitpid");
>     }
> }
>
> int main(void)
> {
>     gid_t list1[] = { 1, 2, 3, 4, 5 };
>     gid_t list2[] = { 2, 3, 4 };
>     run_id();
>     if (setgroups(5, list1) < 0)
>         err(1, "setgroups 1");
>     run_id();
>     if (setresgid(1, 1, 1) < 0)
>         err(1, "setresgid");
>     if (setresuid(1, 1, 1) < 0)
>         err(1, "setresuid");
>     run_id();
>     if (setgroups(3, list2) < 0)
>         err(1, "setgroups 2");
>     run_id();
>     if (setgroups(5, list1) < 0)
>         err(1, "setgroups 3");
>     run_id();
>
>     return 0;
> }
>
> Without this patch, the test program gets EPERM from the second
> setgroups call, after dropping root privileges.  With this patch, the
> test program successfully drops groups 1 and 5, but then gets EPERM from
> the third setgroups call, since that call attempts to add groups the
> process does not currently have.
>
> Signed-off-by: Josh Triplett <josh@joshtriplett.org>
> ---
>  kernel/groups.c | 33 ++++++++++++++++++++++++++++++---
>  kernel/uid16.c  |  2 --
>  2 files changed, 30 insertions(+), 5 deletions(-)
>
> diff --git a/kernel/groups.c b/kernel/groups.c
> index f0667e7..fe7367d 100644
> --- a/kernel/groups.c
> +++ b/kernel/groups.c
> @@ -153,6 +153,29 @@ int groups_search(const struct group_info *group_info, kgid_t grp)
>  	return 0;
>  }
>  
> +/* Compare two sorted group lists; return true if the first is a subset of the
> + * second.
> + */
> +static bool is_subset(const struct group_info *g1, const struct group_info *g2)
> +{
> +	unsigned int i, j;
> +
> +	for (i = 0, j = 0; i < g1->ngroups; i++) {
> +		kgid_t gid1 = GROUP_AT(g1, i);
> +		kgid_t gid2;
> +		for (; j < g2->ngroups; j++) {
> +			gid2 = GROUP_AT(g2, j);
> +			if (gid_lte(gid1, gid2))
> +				break;
> +		}
> +		if (j >= g2->ngroups || !gid_eq(gid1, gid2))
> +			return false;
> +		j++;
> +	}
> +
> +	return true;
> +}
> +
>  /**
>   * set_groups_sorted - Change a group subscription in a set of credentials
>   * @new: The newly prepared set of credentials to alter
> @@ -189,11 +212,17 @@ int set_current_groups(struct group_info *group_info)
>  {
>  	struct cred *new;
>  
> +	groups_sort(group_info);
>  	new = prepare_creds();
>  	if (!new)
>  		return -ENOMEM;
> +	if (!ns_capable(current_user_ns(), CAP_SETGID)
> +	    && !is_subset(group_info, new->group_info)) {
> +		abort_creds(new);
> +		return -EPERM;
> +	}
>  
> -	set_groups(new, group_info);
> +	set_groups_sorted(new, group_info);
>  	return commit_creds(new);
>  }
>  
> @@ -233,8 +262,6 @@ SYSCALL_DEFINE2(setgroups, int, gidsetsize, gid_t __user *, grouplist)
>  	struct group_info *group_info;
>  	int retval;
>  
> -	if (!ns_capable(current_user_ns(), CAP_SETGID))
> -		return -EPERM;
>  	if ((unsigned)gidsetsize > NGROUPS_MAX)
>  		return -EINVAL;
>  
> diff --git a/kernel/uid16.c b/kernel/uid16.c
> index 602e5bb..b27e167 100644
> --- a/kernel/uid16.c
> +++ b/kernel/uid16.c
> @@ -176,8 +176,6 @@ SYSCALL_DEFINE2(setgroups16, int, gidsetsize, old_gid_t __user *, grouplist)
>  	struct group_info *group_info;
>  	int retval;
>  
> -	if (!ns_capable(current_user_ns(), CAP_SETGID))
> -		return -EPERM;
>  	if ((unsigned)gidsetsize > NGROUPS_MAX)
>  		return -EINVAL;
>  

^ permalink raw reply

* Re: [PATCH 2/2] groups: Allow unprivileged processes to use setgroups to drop groups
From: Andy Lutomirski @ 2014-11-17 18:07 UTC (permalink / raw)
  To: One Thousand Gnomes
  Cc: linux-man, Ted Ts'o, Michael Kerrisk-manpages, Josh Triplett,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	Andrew Morton, Eric W. Biederman, Linux API, Kees Cook
In-Reply-To: <20141117113734.396798e6-qBU/x9rampVanCEyBjwyrvXRex20P6io@public.gmane.org>

On Nov 17, 2014 3:37 AM, "One Thousand Gnomes"
<gnomes-qBU/x9rampVanCEyBjwyrvXRex20P6io@public.gmane.org> wrote:
>
> > optional), I can do that too.  The security model of "having a group
> > gives you less privilege than not having it" seems crazy, but
> > nonetheless I can see a couple of easy ways that we can avoid breaking
>
> It's an old pattern of use that makes complete sense in a traditional
> Unix permission world because it's the only way to do "exclude {list}"
> nicely. Our default IMHO shouldn't break this.
>
> > that pattern, no_new_privs being one of them.  I'd like to make sure
> > that nobody sees any other real-world corner case that unprivileged
> > setgroups would break.
>
> Barring the usual risk of people doing improper error checking I don't
> see one immediately.
>
> For containers I think it actually makes sense that the sysctl can be
> applied per container anyway.

We'll probably need per container sysctls some day.

>
> Alan
--
To unsubscribe from this list: send the line "unsubscribe linux-man" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [musl] Re: [RFC] Possible new execveat(2) Linux syscall
From: Rich Felker @ 2014-11-17 18:30 UTC (permalink / raw)
  To: David Drysdale
  Cc: Andy Lutomirski, libc-alpha, musl, Andrew Morton, Linux API,
	Christoph Hellwig
In-Reply-To: <CAHse=S8uceX-buoeFoA_Qthsr0TZ-nX7_x_098qqwr5pa_2r-w@mail.gmail.com>

On Mon, Nov 17, 2014 at 03:42:15PM +0000, David Drysdale wrote:
> I'm not familiar with O_EXEC either, I'm afraid, so to be clear -- does
> O_EXEC mean the permission check is explicitly skipped later, at execute
> time?  In other words, if you open(O_EXEC) an executable then remove the
> execute bit from the file, does a subsequent fexecve() still work?

Yes. It's just like how read and write permissions work. If you open a
file for read then remove read permissions, or open it for write then
remove write permissions, the existing permissions to the open file
are not lost. Of course open with O_EXEC/O_SEARCH needs to fail if the
caller does not have +x access to the file/directory at the time of
open.

> If it does, then from an implementation perspective that presumably implies
> the need for a record of the permission check in the struct file (and that
> this property would be inherited by any dup()ed file descriptors).  From a
> security perspective, having a gap between time-of-check and time-of-use
> always sounds worrying...

This record already exists for read and write. All that's needed is
for an extra bit to be added to record exec/search permission.

Rich

^ permalink raw reply

* Re: [PATCH 2/2] groups: Allow unprivileged processes to use setgroups to drop groups
From: Andy Lutomirski @ 2014-11-17 18:31 UTC (permalink / raw)
  To: Casey Schaufler
  Cc: Josh Triplett, Andrew Morton, Eric W. Biederman, Kees Cook,
	Michael Kerrisk-manpages, Linux API, linux-man,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
In-Reply-To: <546A3942.5040906-iSGtlc1asvQWG2LlvL+J4A@public.gmane.org>

On Mon, Nov 17, 2014 at 10:06 AM, Casey Schaufler
<casey-iSGtlc1asvQWG2LlvL+J4A@public.gmane.org> wrote:
> On 11/15/2014 1:01 AM, Josh Triplett wrote:
>> Currently, unprivileged processes (without CAP_SETGID) cannot call
>> setgroups at all.  In particular, processes with a set of supplementary
>> groups cannot further drop permissions without obtaining elevated
>> permissions first.
>
> Has anyone put any thought into how this will interact with
> POSIX ACLs? I don't see that anywhere in the discussion.

That means that user namespaces are a problem, too, and we need to fix
it.  Or we should add some control to turn unprivileged user namespace
creation on and off and document that turning it on defeats POSIX ACLs
with a group entry that is more restrictive than the other entry.

--Andy

>
> Tizen takes advantage of the fact you can't drop groups. If
> a process can drop itself out of groups without privilege
> it can circumvent the system security policy.
>
> Back when the LSM was introduced a choice was made between
> authoritative hooks (which would have allowed this sort of thing)
> and restrictive hooks (which would not). Authoritative hooks were
> rejected because they would have "broken Linux". I hope that the
> people who spoke up then will speak up now.
>
> This is going to introduce a whole class of vulnerabilities.
> Don't even think of arguing that the reduction in use of privilege
> will make up for that. Developers have enough trouble with the
> difference between setuid() and seteuid() to expect them to use
> group dropping properly.
>
>>
>> Allow unprivileged processes to call setgroups with a subset of their
>> current groups; only require CAP_SETGID to add a group the process does
>> not currently have.
>>
>> The kernel already maintains the list of supplementary group IDs in
>> sorted order, and setgroups already needs to sort the new list, so this
>> just requires a linear comparison of the two sorted lists.
>>
>> This moves the CAP_SETGID test from setgroups into set_current_groups.
>>
>> Tested via the following test program:
>>
>> #include <err.h>
>> #include <grp.h>
>> #include <stdio.h>
>> #include <sys/types.h>
>> #include <unistd.h>
>>
>> void run_id(void)
>> {
>>     pid_t p = fork();
>>     switch (p) {
>>         case -1:
>>             err(1, "fork");
>>         case 0:
>>             execl("/usr/bin/id", "id", NULL);
>>             err(1, "exec");
>>         default:
>>             if (waitpid(p, NULL, 0) < 0)
>>                 err(1, "waitpid");
>>     }
>> }
>>
>> int main(void)
>> {
>>     gid_t list1[] = { 1, 2, 3, 4, 5 };
>>     gid_t list2[] = { 2, 3, 4 };
>>     run_id();
>>     if (setgroups(5, list1) < 0)
>>         err(1, "setgroups 1");
>>     run_id();
>>     if (setresgid(1, 1, 1) < 0)
>>         err(1, "setresgid");
>>     if (setresuid(1, 1, 1) < 0)
>>         err(1, "setresuid");
>>     run_id();
>>     if (setgroups(3, list2) < 0)
>>         err(1, "setgroups 2");
>>     run_id();
>>     if (setgroups(5, list1) < 0)
>>         err(1, "setgroups 3");
>>     run_id();
>>
>>     return 0;
>> }
>>
>> Without this patch, the test program gets EPERM from the second
>> setgroups call, after dropping root privileges.  With this patch, the
>> test program successfully drops groups 1 and 5, but then gets EPERM from
>> the third setgroups call, since that call attempts to add groups the
>> process does not currently have.
>>
>> Signed-off-by: Josh Triplett <josh-iaAMLnmF4UmaiuxdJuQwMA@public.gmane.org>
>> ---
>>  kernel/groups.c | 33 ++++++++++++++++++++++++++++++---
>>  kernel/uid16.c  |  2 --
>>  2 files changed, 30 insertions(+), 5 deletions(-)
>>
>> diff --git a/kernel/groups.c b/kernel/groups.c
>> index f0667e7..fe7367d 100644
>> --- a/kernel/groups.c
>> +++ b/kernel/groups.c
>> @@ -153,6 +153,29 @@ int groups_search(const struct group_info *group_info, kgid_t grp)
>>       return 0;
>>  }
>>
>> +/* Compare two sorted group lists; return true if the first is a subset of the
>> + * second.
>> + */
>> +static bool is_subset(const struct group_info *g1, const struct group_info *g2)
>> +{
>> +     unsigned int i, j;
>> +
>> +     for (i = 0, j = 0; i < g1->ngroups; i++) {
>> +             kgid_t gid1 = GROUP_AT(g1, i);
>> +             kgid_t gid2;
>> +             for (; j < g2->ngroups; j++) {
>> +                     gid2 = GROUP_AT(g2, j);
>> +                     if (gid_lte(gid1, gid2))
>> +                             break;
>> +             }
>> +             if (j >= g2->ngroups || !gid_eq(gid1, gid2))
>> +                     return false;
>> +             j++;
>> +     }
>> +
>> +     return true;
>> +}
>> +
>>  /**
>>   * set_groups_sorted - Change a group subscription in a set of credentials
>>   * @new: The newly prepared set of credentials to alter
>> @@ -189,11 +212,17 @@ int set_current_groups(struct group_info *group_info)
>>  {
>>       struct cred *new;
>>
>> +     groups_sort(group_info);
>>       new = prepare_creds();
>>       if (!new)
>>               return -ENOMEM;
>> +     if (!ns_capable(current_user_ns(), CAP_SETGID)
>> +         && !is_subset(group_info, new->group_info)) {
>> +             abort_creds(new);
>> +             return -EPERM;
>> +     }
>>
>> -     set_groups(new, group_info);
>> +     set_groups_sorted(new, group_info);
>>       return commit_creds(new);
>>  }
>>
>> @@ -233,8 +262,6 @@ SYSCALL_DEFINE2(setgroups, int, gidsetsize, gid_t __user *, grouplist)
>>       struct group_info *group_info;
>>       int retval;
>>
>> -     if (!ns_capable(current_user_ns(), CAP_SETGID))
>> -             return -EPERM;
>>       if ((unsigned)gidsetsize > NGROUPS_MAX)
>>               return -EINVAL;
>>
>> diff --git a/kernel/uid16.c b/kernel/uid16.c
>> index 602e5bb..b27e167 100644
>> --- a/kernel/uid16.c
>> +++ b/kernel/uid16.c
>> @@ -176,8 +176,6 @@ SYSCALL_DEFINE2(setgroups16, int, gidsetsize, old_gid_t __user *, grouplist)
>>       struct group_info *group_info;
>>       int retval;
>>
>> -     if (!ns_capable(current_user_ns(), CAP_SETGID))
>> -             return -EPERM;
>>       if ((unsigned)gidsetsize > NGROUPS_MAX)
>>               return -EINVAL;
>>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-api" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Andy Lutomirski
AMA Capital Management, LLC

^ permalink raw reply

* Re: [PATCH 1/5] kdbus: extend structures with security pointer for lsm
From: Greg KH @ 2014-11-17 18:37 UTC (permalink / raw)
  To: Karol Lewandowski
  Cc: pmoore-H+wXaHxf7aLQT0dZR+AlfA, jkosina-AlSwsSmVLrQ,
	linux-api-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	john.stultz-QSEj5FYQhm4dnm+yROfE0A, arnd-r2nGTMty4D4,
	tj-DgEjT+Ai2ygdnm+yROfE0A, desrt-0xnayjDhYQY,
	simon.mcvittie-ZGY8ohtN/8pPYcu2f3hruQ,
	daniel-cYrQPVfZoowdnm+yROfE0A, dh.herrmann-Re5JQEeQqe8AvxtiuMwx3w,
	casey.schaufler-ral2JQCrhuEAvxtiuMwx3w,
	marcel-kz+m5ild9QBg9hUCZPvPmw, tixxdz-Umm1ozX2/EEdnm+yROfE0A,
	javier.martinez-ZGY8ohtN/8pPYcu2f3hruQ,
	alban.crequy-ZGY8ohtN/8pPYcu2f3hruQ,
	linux-security-module-u79uwXL29TY76Z2rM5mHXA,
	r.krypa-Sze3O3UU22JBDgjK7y7TUQ, Karol Lewandowski
In-Reply-To: <20141117014732.GA21745@pix>

On Mon, Nov 17, 2014 at 02:47:32AM +0100, Karol Lewandowski wrote:
> On Fri, Oct 31, 2014 at 05:36:33PM +0100, Karol Lewandowski wrote:
> > Signed-off-by: Karol Lewandowski <k.lewandowsk-Sze3O3UU22JBDgjK7y7TUQ@public.gmane.org>
> > ---
> >  drivers/misc/kdbus/bus.h        | 2 ++
> >  drivers/misc/kdbus/connection.h | 2 ++
> >  drivers/misc/kdbus/domain.h     | 2 ++
> >  3 files changed, 6 insertions(+)
> > 
> > diff --git a/drivers/misc/kdbus/bus.h b/drivers/misc/kdbus/bus.h
> > index fd9d843..5c403ef 100644
> > --- a/drivers/misc/kdbus/bus.h
> > +++ b/drivers/misc/kdbus/bus.h
> > @@ -49,6 +49,7 @@
> >   * @conn_hash:		Map of connection IDs
> >   * @monitors_list:	Connections that monitor this bus
> >   * @meta:		Meta information about the bus creator
> > + * @security:		LSM security blob
> >   *
> >   * A bus provides a "bus" endpoint / device node.
> >   *
> > @@ -84,6 +85,7 @@ struct kdbus_bus {
> >  	struct list_head monitors_list;
> >  
> >  	struct kdbus_meta *meta;
> > +	void *security;
> 
> One minor note - with possibility of LSM modules trying to access
> these structures it would be worth to have kdbus headers in, say
> include/linux/kdbus or similar globally accessible location.

Yes, that is a good idea.

^ permalink raw reply

* Re: [PATCH 2/2] groups: Allow unprivileged processes to use setgroups to drop groups
From: Andy Lutomirski @ 2014-11-17 18:46 UTC (permalink / raw)
  To: Casey Schaufler
  Cc: Josh Triplett, Andrew Morton, Eric W. Biederman, Kees Cook,
	Michael Kerrisk-manpages, Linux API, linux-man,
	linux-kernel@vger.kernel.org
In-Reply-To: <CALCETrWy-3C1on3+2wE9aqcChBUCUY0i_4AL_Om8CSZ5QB18sQ@mail.gmail.com>

On Mon, Nov 17, 2014 at 10:31 AM, Andy Lutomirski <luto@amacapital.net> wrote:
> On Mon, Nov 17, 2014 at 10:06 AM, Casey Schaufler
> <casey@schaufler-ca.com> wrote:
>> On 11/15/2014 1:01 AM, Josh Triplett wrote:
>>> Currently, unprivileged processes (without CAP_SETGID) cannot call
>>> setgroups at all.  In particular, processes with a set of supplementary
>>> groups cannot further drop permissions without obtaining elevated
>>> permissions first.
>>
>> Has anyone put any thought into how this will interact with
>> POSIX ACLs? I don't see that anywhere in the discussion.
>
> That means that user namespaces are a problem, too, and we need to fix
> it.  Or we should add some control to turn unprivileged user namespace
> creation on and off and document that turning it on defeats POSIX ACLs
> with a group entry that is more restrictive than the other entry.
>

This is a significant enough issue that I posted it to oss-security:

http://www.openwall.com/lists/oss-security/2014/11/17/19

It's not at all obvious to me how to fix it.  We could disallow userns
creation of any supplementary groups don't match fsuid, or we could
keep negative-only groups around in the userns.

It may be worth adding a sysctl to change the behavior, too.  IMO it's
absurd to use groups to deny permissions that are otherwise available.

--Andy

> --Andy
>
>>
>> Tizen takes advantage of the fact you can't drop groups. If
>> a process can drop itself out of groups without privilege
>> it can circumvent the system security policy.
>>
>> Back when the LSM was introduced a choice was made between
>> authoritative hooks (which would have allowed this sort of thing)
>> and restrictive hooks (which would not). Authoritative hooks were
>> rejected because they would have "broken Linux". I hope that the
>> people who spoke up then will speak up now.
>>
>> This is going to introduce a whole class of vulnerabilities.
>> Don't even think of arguing that the reduction in use of privilege
>> will make up for that. Developers have enough trouble with the
>> difference between setuid() and seteuid() to expect them to use
>> group dropping properly.
>>
>>>
>>> Allow unprivileged processes to call setgroups with a subset of their
>>> current groups; only require CAP_SETGID to add a group the process does
>>> not currently have.
>>>
>>> The kernel already maintains the list of supplementary group IDs in
>>> sorted order, and setgroups already needs to sort the new list, so this
>>> just requires a linear comparison of the two sorted lists.
>>>
>>> This moves the CAP_SETGID test from setgroups into set_current_groups.
>>>
>>> Tested via the following test program:
>>>
>>> #include <err.h>
>>> #include <grp.h>
>>> #include <stdio.h>
>>> #include <sys/types.h>
>>> #include <unistd.h>
>>>
>>> void run_id(void)
>>> {
>>>     pid_t p = fork();
>>>     switch (p) {
>>>         case -1:
>>>             err(1, "fork");
>>>         case 0:
>>>             execl("/usr/bin/id", "id", NULL);
>>>             err(1, "exec");
>>>         default:
>>>             if (waitpid(p, NULL, 0) < 0)
>>>                 err(1, "waitpid");
>>>     }
>>> }
>>>
>>> int main(void)
>>> {
>>>     gid_t list1[] = { 1, 2, 3, 4, 5 };
>>>     gid_t list2[] = { 2, 3, 4 };
>>>     run_id();
>>>     if (setgroups(5, list1) < 0)
>>>         err(1, "setgroups 1");
>>>     run_id();
>>>     if (setresgid(1, 1, 1) < 0)
>>>         err(1, "setresgid");
>>>     if (setresuid(1, 1, 1) < 0)
>>>         err(1, "setresuid");
>>>     run_id();
>>>     if (setgroups(3, list2) < 0)
>>>         err(1, "setgroups 2");
>>>     run_id();
>>>     if (setgroups(5, list1) < 0)
>>>         err(1, "setgroups 3");
>>>     run_id();
>>>
>>>     return 0;
>>> }
>>>
>>> Without this patch, the test program gets EPERM from the second
>>> setgroups call, after dropping root privileges.  With this patch, the
>>> test program successfully drops groups 1 and 5, but then gets EPERM from
>>> the third setgroups call, since that call attempts to add groups the
>>> process does not currently have.
>>>
>>> Signed-off-by: Josh Triplett <josh@joshtriplett.org>
>>> ---
>>>  kernel/groups.c | 33 ++++++++++++++++++++++++++++++---
>>>  kernel/uid16.c  |  2 --
>>>  2 files changed, 30 insertions(+), 5 deletions(-)
>>>
>>> diff --git a/kernel/groups.c b/kernel/groups.c
>>> index f0667e7..fe7367d 100644
>>> --- a/kernel/groups.c
>>> +++ b/kernel/groups.c
>>> @@ -153,6 +153,29 @@ int groups_search(const struct group_info *group_info, kgid_t grp)
>>>       return 0;
>>>  }
>>>
>>> +/* Compare two sorted group lists; return true if the first is a subset of the
>>> + * second.
>>> + */
>>> +static bool is_subset(const struct group_info *g1, const struct group_info *g2)
>>> +{
>>> +     unsigned int i, j;
>>> +
>>> +     for (i = 0, j = 0; i < g1->ngroups; i++) {
>>> +             kgid_t gid1 = GROUP_AT(g1, i);
>>> +             kgid_t gid2;
>>> +             for (; j < g2->ngroups; j++) {
>>> +                     gid2 = GROUP_AT(g2, j);
>>> +                     if (gid_lte(gid1, gid2))
>>> +                             break;
>>> +             }
>>> +             if (j >= g2->ngroups || !gid_eq(gid1, gid2))
>>> +                     return false;
>>> +             j++;
>>> +     }
>>> +
>>> +     return true;
>>> +}
>>> +
>>>  /**
>>>   * set_groups_sorted - Change a group subscription in a set of credentials
>>>   * @new: The newly prepared set of credentials to alter
>>> @@ -189,11 +212,17 @@ int set_current_groups(struct group_info *group_info)
>>>  {
>>>       struct cred *new;
>>>
>>> +     groups_sort(group_info);
>>>       new = prepare_creds();
>>>       if (!new)
>>>               return -ENOMEM;
>>> +     if (!ns_capable(current_user_ns(), CAP_SETGID)
>>> +         && !is_subset(group_info, new->group_info)) {
>>> +             abort_creds(new);
>>> +             return -EPERM;
>>> +     }
>>>
>>> -     set_groups(new, group_info);
>>> +     set_groups_sorted(new, group_info);
>>>       return commit_creds(new);
>>>  }
>>>
>>> @@ -233,8 +262,6 @@ SYSCALL_DEFINE2(setgroups, int, gidsetsize, gid_t __user *, grouplist)
>>>       struct group_info *group_info;
>>>       int retval;
>>>
>>> -     if (!ns_capable(current_user_ns(), CAP_SETGID))
>>> -             return -EPERM;
>>>       if ((unsigned)gidsetsize > NGROUPS_MAX)
>>>               return -EINVAL;
>>>
>>> diff --git a/kernel/uid16.c b/kernel/uid16.c
>>> index 602e5bb..b27e167 100644
>>> --- a/kernel/uid16.c
>>> +++ b/kernel/uid16.c
>>> @@ -176,8 +176,6 @@ SYSCALL_DEFINE2(setgroups16, int, gidsetsize, old_gid_t __user *, grouplist)
>>>       struct group_info *group_info;
>>>       int retval;
>>>
>>> -     if (!ns_capable(current_user_ns(), CAP_SETGID))
>>> -             return -EPERM;
>>>       if ((unsigned)gidsetsize > NGROUPS_MAX)
>>>               return -EINVAL;
>>>
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-api" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>
>
> --
> Andy Lutomirski
> AMA Capital Management, LLC



-- 
Andy Lutomirski
AMA Capital Management, LLC

^ permalink raw reply

* Re: [PATCH 2/2] groups: Allow unprivileged processes to use setgroups to drop groups
From: Casey Schaufler @ 2014-11-17 18:51 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Josh Triplett, Andrew Morton, Eric W. Biederman, Kees Cook,
	Michael Kerrisk-manpages, Linux API, linux-man,
	linux-kernel@vger.kernel.org, LSM
In-Reply-To: <CALCETrVn4gVXp7F=5h-bkN5VWuRMG9BoxgeQfKhX4+ZXxGE=wQ@mail.gmail.com>

On 11/17/2014 10:46 AM, Andy Lutomirski wrote:
> On Mon, Nov 17, 2014 at 10:31 AM, Andy Lutomirski <luto@amacapital.net> wrote:
>> On Mon, Nov 17, 2014 at 10:06 AM, Casey Schaufler
>> <casey@schaufler-ca.com> wrote:
>>> On 11/15/2014 1:01 AM, Josh Triplett wrote:
>>>> Currently, unprivileged processes (without CAP_SETGID) cannot call
>>>> setgroups at all.  In particular, processes with a set of supplementary
>>>> groups cannot further drop permissions without obtaining elevated
>>>> permissions first.
>>> Has anyone put any thought into how this will interact with
>>> POSIX ACLs? I don't see that anywhere in the discussion.
>> That means that user namespaces are a problem, too, and we need to fix
>> it.  Or we should add some control to turn unprivileged user namespace
>> creation on and off and document that turning it on defeats POSIX ACLs
>> with a group entry that is more restrictive than the other entry.
>>
> This is a significant enough issue that I posted it to oss-security:
>
> http://www.openwall.com/lists/oss-security/2014/11/17/19
>
> It's not at all obvious to me how to fix it.  We could disallow userns
> creation of any supplementary groups don't match fsuid, or we could
> keep negative-only groups around in the userns.
>
> It may be worth adding a sysctl to change the behavior, too.  IMO it's
> absurd to use groups to deny permissions that are otherwise available.

Absurd or not, it's traditional behavior, and if you don't have ACLs it
is the best way to accomplish the security goal.


>
> --Andy
>
>> --Andy
>>
>>> Tizen takes advantage of the fact you can't drop groups. If
>>> a process can drop itself out of groups without privilege
>>> it can circumvent the system security policy.
>>>
>>> Back when the LSM was introduced a choice was made between
>>> authoritative hooks (which would have allowed this sort of thing)
>>> and restrictive hooks (which would not). Authoritative hooks were
>>> rejected because they would have "broken Linux". I hope that the
>>> people who spoke up then will speak up now.
>>>
>>> This is going to introduce a whole class of vulnerabilities.
>>> Don't even think of arguing that the reduction in use of privilege
>>> will make up for that. Developers have enough trouble with the
>>> difference between setuid() and seteuid() to expect them to use
>>> group dropping properly.
>>>
>>>> Allow unprivileged processes to call setgroups with a subset of their
>>>> current groups; only require CAP_SETGID to add a group the process does
>>>> not currently have.
>>>>
>>>> The kernel already maintains the list of supplementary group IDs in
>>>> sorted order, and setgroups already needs to sort the new list, so this
>>>> just requires a linear comparison of the two sorted lists.
>>>>
>>>> This moves the CAP_SETGID test from setgroups into set_current_groups.
>>>>
>>>> Tested via the following test program:
>>>>
>>>> #include <err.h>
>>>> #include <grp.h>
>>>> #include <stdio.h>
>>>> #include <sys/types.h>
>>>> #include <unistd.h>
>>>>
>>>> void run_id(void)
>>>> {
>>>>     pid_t p = fork();
>>>>     switch (p) {
>>>>         case -1:
>>>>             err(1, "fork");
>>>>         case 0:
>>>>             execl("/usr/bin/id", "id", NULL);
>>>>             err(1, "exec");
>>>>         default:
>>>>             if (waitpid(p, NULL, 0) < 0)
>>>>                 err(1, "waitpid");
>>>>     }
>>>> }
>>>>
>>>> int main(void)
>>>> {
>>>>     gid_t list1[] = { 1, 2, 3, 4, 5 };
>>>>     gid_t list2[] = { 2, 3, 4 };
>>>>     run_id();
>>>>     if (setgroups(5, list1) < 0)
>>>>         err(1, "setgroups 1");
>>>>     run_id();
>>>>     if (setresgid(1, 1, 1) < 0)
>>>>         err(1, "setresgid");
>>>>     if (setresuid(1, 1, 1) < 0)
>>>>         err(1, "setresuid");
>>>>     run_id();
>>>>     if (setgroups(3, list2) < 0)
>>>>         err(1, "setgroups 2");
>>>>     run_id();
>>>>     if (setgroups(5, list1) < 0)
>>>>         err(1, "setgroups 3");
>>>>     run_id();
>>>>
>>>>     return 0;
>>>> }
>>>>
>>>> Without this patch, the test program gets EPERM from the second
>>>> setgroups call, after dropping root privileges.  With this patch, the
>>>> test program successfully drops groups 1 and 5, but then gets EPERM from
>>>> the third setgroups call, since that call attempts to add groups the
>>>> process does not currently have.
>>>>
>>>> Signed-off-by: Josh Triplett <josh@joshtriplett.org>
>>>> ---
>>>>  kernel/groups.c | 33 ++++++++++++++++++++++++++++++---
>>>>  kernel/uid16.c  |  2 --
>>>>  2 files changed, 30 insertions(+), 5 deletions(-)
>>>>
>>>> diff --git a/kernel/groups.c b/kernel/groups.c
>>>> index f0667e7..fe7367d 100644
>>>> --- a/kernel/groups.c
>>>> +++ b/kernel/groups.c
>>>> @@ -153,6 +153,29 @@ int groups_search(const struct group_info *group_info, kgid_t grp)
>>>>       return 0;
>>>>  }
>>>>
>>>> +/* Compare two sorted group lists; return true if the first is a subset of the
>>>> + * second.
>>>> + */
>>>> +static bool is_subset(const struct group_info *g1, const struct group_info *g2)
>>>> +{
>>>> +     unsigned int i, j;
>>>> +
>>>> +     for (i = 0, j = 0; i < g1->ngroups; i++) {
>>>> +             kgid_t gid1 = GROUP_AT(g1, i);
>>>> +             kgid_t gid2;
>>>> +             for (; j < g2->ngroups; j++) {
>>>> +                     gid2 = GROUP_AT(g2, j);
>>>> +                     if (gid_lte(gid1, gid2))
>>>> +                             break;
>>>> +             }
>>>> +             if (j >= g2->ngroups || !gid_eq(gid1, gid2))
>>>> +                     return false;
>>>> +             j++;
>>>> +     }
>>>> +
>>>> +     return true;
>>>> +}
>>>> +
>>>>  /**
>>>>   * set_groups_sorted - Change a group subscription in a set of credentials
>>>>   * @new: The newly prepared set of credentials to alter
>>>> @@ -189,11 +212,17 @@ int set_current_groups(struct group_info *group_info)
>>>>  {
>>>>       struct cred *new;
>>>>
>>>> +     groups_sort(group_info);
>>>>       new = prepare_creds();
>>>>       if (!new)
>>>>               return -ENOMEM;
>>>> +     if (!ns_capable(current_user_ns(), CAP_SETGID)
>>>> +         && !is_subset(group_info, new->group_info)) {
>>>> +             abort_creds(new);
>>>> +             return -EPERM;
>>>> +     }
>>>>
>>>> -     set_groups(new, group_info);
>>>> +     set_groups_sorted(new, group_info);
>>>>       return commit_creds(new);
>>>>  }
>>>>
>>>> @@ -233,8 +262,6 @@ SYSCALL_DEFINE2(setgroups, int, gidsetsize, gid_t __user *, grouplist)
>>>>       struct group_info *group_info;
>>>>       int retval;
>>>>
>>>> -     if (!ns_capable(current_user_ns(), CAP_SETGID))
>>>> -             return -EPERM;
>>>>       if ((unsigned)gidsetsize > NGROUPS_MAX)
>>>>               return -EINVAL;
>>>>
>>>> diff --git a/kernel/uid16.c b/kernel/uid16.c
>>>> index 602e5bb..b27e167 100644
>>>> --- a/kernel/uid16.c
>>>> +++ b/kernel/uid16.c
>>>> @@ -176,8 +176,6 @@ SYSCALL_DEFINE2(setgroups16, int, gidsetsize, old_gid_t __user *, grouplist)
>>>>       struct group_info *group_info;
>>>>       int retval;
>>>>
>>>> -     if (!ns_capable(current_user_ns(), CAP_SETGID))
>>>> -             return -EPERM;
>>>>       if ((unsigned)gidsetsize > NGROUPS_MAX)
>>>>               return -EINVAL;
>>>>
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe linux-api" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>>
>> --
>> Andy Lutomirski
>> AMA Capital Management, LLC
>
>


^ permalink raw reply

* [PATCH] block: create ioctl to discard-or-zeroout a range of blocks
From: Darrick J. Wong @ 2014-11-17 19:28 UTC (permalink / raw)
  To: Martin K. Petersen; +Cc: linux-scsi, linux-ide, linux-fsdevel, neilb, linux-api
In-Reply-To: <20141111000433.GA10047@birch.djwong.org>

Create a new ioctl to expose the block layer's newfound ability to
issue either a zeroing discard, a WRITE SAME with a zero page, or a
regular write with the zero page.  This BLKZEROOUT2 ioctl takes
{start, length, flags} as parameters.  So far, the only flag available
is to enable the zeroing discard part -- without it, the call invokes
the old BLKZEROOUT behavior.  start and length have the same meaning
as in BLKZEROOUT.

Furthermore, because BLKZEROOUT2 issues commands directly to the
storage device, we must invalidate the page cache (as a regular
O_DIRECT write would do) to avoid returning stale cache contents at a
later time.

This patch depends on mkp's earlier patch "block: Introduce
blkdev_issue_zeroout_discard() function".

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 block/ioctl.c           |   45 ++++++++++++++++++++++++++++++++++++++-------
 include/uapi/linux/fs.h |    7 +++++++
 2 files changed, 45 insertions(+), 7 deletions(-)

diff --git a/block/ioctl.c b/block/ioctl.c
index 7d8befd..ff623d5 100644
--- a/block/ioctl.c
+++ b/block/ioctl.c
@@ -186,19 +186,39 @@ static int blk_ioctl_discard(struct block_device *bdev, uint64_t start,
 }
 
 static int blk_ioctl_zeroout(struct block_device *bdev, uint64_t start,
-			     uint64_t len)
+			     uint64_t len, uint32_t flags)
 {
+	int ret;
+	struct address_space *mapping;
+	uint64_t end = start + len - 1;
+
+	if (flags & ~BLKZEROOUT2_DISCARD_OK)
+		return -EINVAL;
 	if (start & 511)
 		return -EINVAL;
 	if (len & 511)
 		return -EINVAL;
-	start >>= 9;
-	len >>= 9;
-
-	if (start + len > (i_size_read(bdev->bd_inode) >> 9))
+	if (end >= i_size_read(bdev->bd_inode))
 		return -EINVAL;
 
-	return blkdev_issue_zeroout(bdev, start, len, GFP_KERNEL, false);
+	/* Invalidate the page cache, including dirty pages */
+	mapping = bdev->bd_inode->i_mapping;
+	truncate_inode_pages_range(mapping, start, end);
+
+	ret = blkdev_issue_zeroout(bdev, start >> 9, len >> 9, GFP_KERNEL,
+				   flags & BLKZEROOUT2_DISCARD_OK);
+	if (ret)
+		goto out;
+
+	/*
+	 * Invalidate again; if someone wandered in and dirtied a page,
+	 * the caller will be given -EBUSY.
+	 */
+	ret = invalidate_inode_pages2_range(mapping,
+					    start >> PAGE_CACHE_SHIFT,
+					    end >> PAGE_CACHE_SHIFT);
+out:
+	return ret;
 }
 
 static int put_ushort(unsigned long arg, unsigned short val)
@@ -326,7 +346,18 @@ int blkdev_ioctl(struct block_device *bdev, fmode_t mode, unsigned cmd,
 		if (copy_from_user(range, (void __user *)arg, sizeof(range)))
 			return -EFAULT;
 
-		return blk_ioctl_zeroout(bdev, range[0], range[1]);
+		return blk_ioctl_zeroout(bdev, range[0], range[1], 0);
+	}
+	case BLKZEROOUT2: {
+		struct blkzeroout2 p;
+
+		if (!(mode & FMODE_WRITE))
+			return -EBADF;
+
+		if (copy_from_user(&p, (void __user *)arg, sizeof(p)))
+			return -EFAULT;
+
+		return blk_ioctl_zeroout(bdev, p.start, p.length, p.flags);
 	}
 
 	case HDIO_GETGEO: {
diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
index 3735fa0..54d24ea 100644
--- a/include/uapi/linux/fs.h
+++ b/include/uapi/linux/fs.h
@@ -150,6 +150,13 @@ struct inodes_stat_t {
 #define BLKSECDISCARD _IO(0x12,125)
 #define BLKROTATIONAL _IO(0x12,126)
 #define BLKZEROOUT _IO(0x12,127)
+struct blkzeroout2 {
+	__u64 start;
+	__u64 length;
+	__u32 flags;
+};
+#define BLKZEROOUT2_DISCARD_OK	1
+#define BLKZEROOUT2 _IOR(0x12, 127, struct blkzeroout2)
 
 #define BMAP_IOCTL 1		/* obsolete - kept for compatibility */
 #define FIBMAP	   _IO(0x00,1)	/* bmap access */

^ permalink raw reply related

* Re: [PATCH 2/2] groups: Allow unprivileged processes to use setgroups to drop groups
From: Eric W.Biederman @ 2014-11-17 22:11 UTC (permalink / raw)
  To: Andy Lutomirski, One Thousand Gnomes
  Cc: linux-man, Ted Ts'o, Michael Kerrisk-manpages, Josh Triplett,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	Andrew Morton, Linux API, Kees Cook
In-Reply-To: <CALCETrXi1qHyu4_U7cbROB74n461nBZ9R7=0kfhR8-VFAwOF1w-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>



On November 17, 2014 1:07:30 PM EST, Andy Lutomirski <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org> wrote:
>On Nov 17, 2014 3:37 AM, "One Thousand Gnomes"
><gnomes-qBU/x9rampVanCEyBjwyrvXRex20P6io@public.gmane.org> wrote:
>>
>> > optional), I can do that too.  The security model of "having a
>group
>> > gives you less privilege than not having it" seems crazy, but
>> > nonetheless I can see a couple of easy ways that we can avoid
>breaking
>>
>> It's an old pattern of use that makes complete sense in a traditional
>> Unix permission world because it's the only way to do "exclude
>{list}"
>> nicely. Our default IMHO shouldn't break this.
>>
>> > that pattern, no_new_privs being one of them.  I'd like to make
>sure
>> > that nobody sees any other real-world corner case that unprivileged
>> > setgroups would break.
>>
>> Barring the usual risk of people doing improper error checking I
>don't
>> see one immediately.
>>
>> For containers I think it actually makes sense that the sysctl can be
>> applied per container anyway.
>
>We'll probably need per container sysctls some day.

We already have a mess of per network namespace sysctls,
as well as few for other namespaces.

We have the infrastructure it is just a matter of using it for whatever purpose we need.

Eric

--
To unsubscribe from this list: send the line "unsubscribe linux-man" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH 2/2] groups: Allow unprivileged processes to use setgroups to drop groups
From: Andy Lutomirski @ 2014-11-17 22:22 UTC (permalink / raw)
  To: Eric W.Biederman
  Cc: One Thousand Gnomes, linux-man, Ted Ts'o,
	Michael Kerrisk-manpages, Josh Triplett,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	Andrew Morton, Linux API, Kees Cook
In-Reply-To: <0b65fd07-48ea-483b-8fd5-fd84d0bff881-2ueSQiBKiTY7tOexoI0I+QC/G2K4zDHf@public.gmane.org>

On Mon, Nov 17, 2014 at 2:11 PM, Eric W.Biederman <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> wrote:
>
>
> On November 17, 2014 1:07:30 PM EST, Andy Lutomirski <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org> wrote:
>>On Nov 17, 2014 3:37 AM, "One Thousand Gnomes"
>><gnomes-qBU/x9rampVanCEyBjwyrvXRex20P6io@public.gmane.org> wrote:
>>>
>>> > optional), I can do that too.  The security model of "having a
>>group
>>> > gives you less privilege than not having it" seems crazy, but
>>> > nonetheless I can see a couple of easy ways that we can avoid
>>breaking
>>>
>>> It's an old pattern of use that makes complete sense in a traditional
>>> Unix permission world because it's the only way to do "exclude
>>{list}"
>>> nicely. Our default IMHO shouldn't break this.
>>>
>>> > that pattern, no_new_privs being one of them.  I'd like to make
>>sure
>>> > that nobody sees any other real-world corner case that unprivileged
>>> > setgroups would break.
>>>
>>> Barring the usual risk of people doing improper error checking I
>>don't
>>> see one immediately.
>>>
>>> For containers I think it actually makes sense that the sysctl can be
>>> applied per container anyway.
>>
>>We'll probably need per container sysctls some day.
>
> We already have a mess of per network namespace sysctls,
> as well as few for other namespaces.
>
> We have the infrastructure it is just a matter of using it for whatever purpose we need.
>

A list of group id ranges that it's permissible to drop would do the
trick, both for setgroups and for unshare.  The downside would be that
users in those groups (i.e. everyone by default) would not be able to
unshare their user ns.

Better ideas welcome.

--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-man" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox