Linux userland API discussions

Linux userland API discussions
 help / color / mirror / Atom feed

* Re: [PATCH ghak90 V6 02/10] audit: add container id
From: Paul Moore @ 2019-05-29 16:03 UTC (permalink / raw)
  To: Tycho Andersen
  Cc: Richard Guy Briggs, containers, linux-api,
	Linux-Audit Mailing List, linux-fsdevel, LKML, netdev,
	netfilter-devel, sgrubb, omosnace, dhowells, simo, Eric Paris,
	Serge Hallyn, ebiederm, nhorman
In-Reply-To: <20190529153427.GB8959@cisco>

On Wed, May 29, 2019 at 11:34 AM Tycho Andersen <tycho@tycho.ws> wrote:
>
> On Wed, May 29, 2019 at 11:29:05AM -0400, Paul Moore wrote:
> > On Wed, May 29, 2019 at 10:57 AM Tycho Andersen <tycho@tycho.ws> wrote:
> > >
> > > On Mon, Apr 08, 2019 at 11:39:09PM -0400, Richard Guy Briggs wrote:
> > > > It is not permitted to unset the audit container identifier.
> > > > A child inherits its parent's audit container identifier.
> > >
> > > ...
> > >
> > > >  /**
> > > > + * audit_set_contid - set current task's audit contid
> > > > + * @contid: contid value
> > > > + *
> > > > + * Returns 0 on success, -EPERM on permission failure.
> > > > + *
> > > > + * Called (set) from fs/proc/base.c::proc_contid_write().
> > > > + */
> > > > +int audit_set_contid(struct task_struct *task, u64 contid)
> > > > +{
> > > > +     u64 oldcontid;
> > > > +     int rc = 0;
> > > > +     struct audit_buffer *ab;
> > > > +     uid_t uid;
> > > > +     struct tty_struct *tty;
> > > > +     char comm[sizeof(current->comm)];
> > > > +
> > > > +     task_lock(task);
> > > > +     /* Can't set if audit disabled */
> > > > +     if (!task->audit) {
> > > > +             task_unlock(task);
> > > > +             return -ENOPROTOOPT;
> > > > +     }
> > > > +     oldcontid = audit_get_contid(task);
> > > > +     read_lock(&tasklist_lock);
> > > > +     /* Don't allow the audit containerid to be unset */
> > > > +     if (!audit_contid_valid(contid))
> > > > +             rc = -EINVAL;
> > > > +     /* if we don't have caps, reject */
> > > > +     else if (!capable(CAP_AUDIT_CONTROL))
> > > > +             rc = -EPERM;
> > > > +     /* if task has children or is not single-threaded, deny */
> > > > +     else if (!list_empty(&task->children))
> > > > +             rc = -EBUSY;
> > > > +     else if (!(thread_group_leader(task) && thread_group_empty(task)))
> > > > +             rc = -EALREADY;
> > > > +     read_unlock(&tasklist_lock);
> > > > +     if (!rc)
> > > > +             task->audit->contid = contid;
> > > > +     task_unlock(task);
> > > > +
> > > > +     if (!audit_enabled)
> > > > +             return rc;
> > >
> > > ...but it is allowed to change it (assuming
> > > capable(CAP_AUDIT_CONTROL), of course)? Seems like this might be more
> > > immediately useful since we still live in the world of majority
> > > privileged containers if we didn't allow changing it, in addition to
> > > un-setting it.
> >
> > The idea is that only container orchestrators should be able to
> > set/modify the audit container ID, and since setting the audit
> > container ID can have a significant effect on the records captured
> > (and their routing to multiple daemons when we get there) modifying
> > the audit container ID is akin to modifying the audit configuration
> > which is why it is gated by CAP_AUDIT_CONTROL.  The current thinking
> > is that you would only change the audit container ID from one
> > set/inherited value to another if you were nesting containers, in
> > which case the nested container orchestrator would need to be granted
> > CAP_AUDIT_CONTROL (which everyone to date seems to agree is a workable
> > compromise).
>
> But then don't you want some kind of ns_capable() instead (probably
> not the obvious one, though...)? With capable(), you can't really nest
> using the audit-id and user namespaces together.

You want capable() and not ns_capable() because you want to ensure
that the orchestrator has the rights in the init_ns as changes to the
audit container ID could have an auditing impact that spans the entire
system.  Setting the audit container ID is equivalent to munging the
kernel's audit configuration, and the audit configuration is not
"namespaced" in any way.  The audit container ID work is about
providing the right "container context" (as defined by userspace) with
the audit records so that admins have a better understanding about
what is going on in the system; it is very explicitly not creating an
audit namespace.

At some point in the future we will want to support running multiple
audit daemons, and have a configurable way of routing audit records
based on the audit container ID, which will blur the line regarding
audit namespaces, but even then I would argue we are not creating an
audit namespace.

-- 
paul moore
www.paul-moore.com

^ permalink raw reply

* Re: [PATCH 3/7] vfs: Add a mount-notification facility
From: Casey Schaufler @ 2019-05-29 15:53 UTC (permalink / raw)
  To: David Howells, Jann Horn
  Cc: Al Viro, raven, linux-fsdevel, Linux API, linux-block, keyrings,
	linux-security-module, kernel list
In-Reply-To: <14347.1559127657@warthog.procyon.org.uk>


[-- Attachment #1.1: Type: text/plain, Size: 1386 bytes --]

On 5/29/2019 4:00 AM, David Howells wrote:
> Jann Horn <jannh@google.com> wrote:
>
>>> +void post_mount_notification(struct mount *changed,
>>> +                            struct mount_notification *notify)
>>> +{
>>> +       const struct cred *cred = current_cred();
>> This current_cred() looks bogus to me. Can't mount topology changes
>> come from all sorts of places? For example, umount_mnt() from
>> umount_tree() from dissolve_on_fput() from __fput(), which could
>> happen pretty much anywhere depending on where the last reference gets
>> dropped?
> IIRC, that's what Casey argued is the right thing to do from a security PoV.
> Casey?

You need to identify the credential of the subject that triggered
the event. If it isn't current_cred(), the cred needs to be passed
in to post_mount_notification(), or derived by some other means.

> Maybe I should pass in NULL creds in the case that an event is being generated
> because an object is being destroyed due to the last usage[*] being removed.

You should pass the cred of the process that removed the
last usage. If the last usage was removed by something like
the power being turned off on a disk drive a system cred
should be used. Someone or something caused the event. It can
be important who it was.

>  [*] Usage, not ref - Superblocks are a bit weird in their accounting.
>
> David


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply

* Re: [RFC][PATCH 0/7] Mount, FS, Block and Keyrings notifications
From: Amir Goldstein @ 2019-05-29 15:53 UTC (permalink / raw)
  To: Jan Kara
  Cc: David Howells, Al Viro, Ian Kent, linux-fsdevel, linux-api,
	linux-block, keyrings, LSM List, linux-kernel
In-Reply-To: <20190529142504.GC32147@quack2.suse.cz>

> > David,
> >
> > I am interested to know how you envision filesystem notifications would
> > look with this interface.
> >
> > fanotify can certainly benefit from providing a ring buffer interface to read
> > events.
> >
> > From what I have seen, a common practice of users is to monitor mounts
> > (somehow) and place FAN_MARK_MOUNT fanotify watches dynamically.
> > It'd be good if those users can use a single watch mechanism/API for
> > watching the mount namespace and filesystem events within mounts.
> >
> > A similar usability concern is with sb_notify and FAN_MARK_FILESYSTEM.
> > It provides users with two complete different mechanisms to watch error
> > and filesystem events. That is generally not a good thing to have.
> >
> > I am not asking that you implement fs_notify() before merging sb_notify()
> > and I understand that you have a use case for sb_notify().
> > I am asking that you show me the path towards a unified API (how a
> > typical program would look like), so that we know before merging your
> > new API that it could be extended to accommodate fsnotify events
> > where the final result will look wholesome to users.
>
> Are you sure we want to combine notification about file changes etc. with
> administrator-type notifications about the filesystem? To me these two
> sound like rather different (although sometimes related) things.
>

Well I am sure that ring buffer for fanotify events would be useful, so
seeing that David is proposing a generic notification mechanism, I wanted
to know how that mechanism could best share infrastructure with fsnotify.

But apart from that I foresee the questions from users about why the
mount notification API and filesystem events API do not have better
integration.

The way I see it, the notification queue can serve several classes
of notifications and fsnotify could be one of those classes
(at least FAN_CLASS_NOTIF fits nicely to the model).

Thanks,
Amir.

^ permalink raw reply

* Re: [PATCH 1/5] glibc: Perform rseq(2) registration at C startup and thread creation (v10)
From: Mathieu Desnoyers @ 2019-05-29 15:45 UTC (permalink / raw)
  To: Florian Weimer
  Cc: carlos, Joseph Myers, Szabolcs Nagy, libc-alpha, Thomas Gleixner,
	Ben Maurer, Peter Zijlstra, Paul E. McKenney, Boqun Feng,
	Will Deacon, Dave Watson, Paul Turner, Rich Felker, linux-kernel,
	linux-api
In-Reply-To: <1239705947.14878.1558985272873.JavaMail.zimbra@efficios.com>

----- On May 27, 2019, at 3:27 PM, Mathieu Desnoyers mathieu.desnoyers@efficios.com wrote:

> ----- On May 27, 2019, at 7:19 AM, Florian Weimer fweimer@redhat.com wrote:
> 

[...]

>> 
>> Furthermore, the reference to ELF constructors is misleading.  I believe
>> the code you added to __libc_start_main to initialize __rseq_handled and
>> register __seq_abi with the kernel runs *after* ELF constructors have
>> executed (and not at all if the main program is written in Go, alas).
>> All initialization activity for the shared case needs to happen in
>> elf/rtld.c or called from there, probably as part of the security
>> initialization code or thereabouts.
> 
> in elf/rtld.c:dl_main() we have the following code:
> 
>  /* We do not initialize any of the TLS functionality unless any of the
>     initial modules uses TLS.  This makes dynamic loading of modules with
>     TLS impossible, but to support it requires either eagerly doing setup
>     now or lazily doing it later.  Doing it now makes us incompatible with
>     an old kernel that can't perform TLS_INIT_TP, even if no TLS is ever
>     used.  Trying to do it lazily is too hairy to try when there could be
>     multiple threads (from a non-TLS-using libpthread).  */
>  bool was_tls_init_tp_called = tls_init_tp_called;
>  if (tcbp == NULL)
>    tcbp = init_tls ();
> 
> If I understand your point correctly, I should move the rseq_init() and
> rseq_register_current_thread() for the SHARED case just after this
> initialization, otherwise calling those from LIBC_START_MAIN() is too
> late and it runs after initial modules constructors (or not at all for
> Go). However, this means glibc will start using TLS internally. I'm
> concerned that this is not quite in line with the above comment which
> states that TLS is not initialized if no initial modules use TLS.
> 
> For the !SHARED use-case, if my understanding is correct, I should keep
> rseq_init() and rseq_register_current_thread() calls within LIBC_START_MAIN().

I've moved the rseq initialization for SHARED case to the very end of
elf/rtld.c:init_tls(), and get the following error on make check:

Generating locale am_ET.UTF-8: this might take a while...
Inconsistency detected by ld.so: get-dynamic-info.h: 143: elf_get_dynamic_info: Assertion `info[DT_FLAGS] == NULL || (info[DT_FLAGS]->d_un.d_val & ~DF_BIND_NOW) == 0' failed!
Charmap: "UTF-8" Inputfile: "am_ET" Outputdir: "am_ET.UTF-8" failed
/bin/sh: 4: cannot create /home/efficios/git/glibc-build/localedata/am_ET.UTF-8/LC_CTYPE.test-result: Directory nonexistent

This error goes away if I comment out the call to rseq_register_current_thread (),
which touches the __rseq_abi __thread variable and issues a system call.

Currently, the __rseq_abi __thread variable is within
sysdeps/unix/sysv/linux/rseq-sym.c, which is added to the
sysdep_routines within sysdeps/unix/sysv/linux/Makefile. I
suspect it may need to be moved elsewhere.

Any thoughts on how to solve this ?

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply

* Re: [PATCH v1 1/2] fork: add clone3
From: Yann Droneaud @ 2019-05-29 15:42 UTC (permalink / raw)
  To: Christian Brauner, viro, linux-kernel, torvalds, jannh
  Cc: fweimer, oleg, arnd, dhowells, Pavel Emelyanov, Andrew Morton,
	Adrian Reber, Andrei Vagin, linux-api
In-Reply-To: <20190529152237.10719-1-christian@brauner.io>

Le mercredi 29 mai 2019 à 17:22 +0200, Christian Brauner a écrit :
> This adds the clone3 system call.
> 
> 
> diff --git a/kernel/fork.c b/kernel/fork.c
> index b4cba953040a..6bc3e3d17150 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -2472,7 +2475,96 @@ SYSCALL_DEFINE5(clone, unsigned long, clone_flags, unsigned long, newsp,
>  		 unsigned long, tls)
>  #endif
>  {
> -	return _do_fork(clone_flags, newsp, 0, parent_tidptr, child_tidptr, tls);
> +	struct kernel_clone_args args = {
> +		.flags = clone_flags,
> +		.stack = newsp,
> +		.pidfd = parent_tidptr,
> +		.parent_tidptr = parent_tidptr,
> +		.tls = tls,
> +		.child_tidptr = child_tidptr,
> +	};
> +
> +	/* clone(CLONE_PIDFD) uses parent_tidptr to return a pidfd */
> +	if ((clone_flags & CLONE_PIDFD) && (clone_flags & CLONE_PARENT_SETTID))
> +		return -EINVAL;
> +
> +	return _do_fork(&args);
> +}
> +
> +static bool clone3_flags_valid(u64 flags)
> +{
> +	if (flags & CLONE_DETACHED)
> +		return false;
> +
> +	if (flags & ~CLONE_MAX)
> +		return false;
> +
> +	return true;
> +}
> +
> +static int copy_clone_args_from_user(struct kernel_clone_args *kargs,
> +				     struct clone_args __user *uargs,
> +				     size_t size)
> +{
> +	struct clone_args args;
> +
> +	if (unlikely(size > PAGE_SIZE))
> +		return -E2BIG;
> +
> +	if (unlikely(size < sizeof(struct clone_args)))
> +		return -EINVAL;
> +
> +	if (unlikely(!access_ok(uargs, size)))
> +		return -EFAULT;
> +
> +	if (size > sizeof(struct clone_args)) {
> +		unsigned char __user *addr;
> +		unsigned char __user *end;
> +		unsigned char val;
> +
> +		addr = (void __user *)uargs + sizeof(struct clone_args);
> +		end = (void __user *)uargs + size;
> +
> +		for (; addr < end; addr++) {
> +			if (get_user(val, addr))
> +				return -EFAULT;
> +			if (val)
> +				return -E2BIG;

Should be -EINVAL: having something after the structure should be
handled just like an invalid flags, while still allowing future
userspace program to probe for support for newer feature.

> +		}
> +
> +		size = sizeof(struct clone_args);
> +	}
> +
> +	if (copy_from_user(&args, uargs, size))
> +		return -EFAULT;
> +
> +	if (!clone3_flags_valid(args.flags))
> +		return -EINVAL;
> +
> +	memset(kargs, 0, sizeof(*kargs));
> +
> +	kargs->flags = args.flags;
> +	kargs->child_tidptr = u64_to_user_ptr(args.child_tidptr);
> +	kargs->parent_tidptr = u64_to_user_ptr(args.parent_tidptr);
> +	kargs->pidfd = u64_to_user_ptr(args.pidfd);
> +	kargs->stack = args.stack;
> +	kargs->stack_size = args.stack_size;
> +	kargs->tls = args.tls;
> +
> +	return 0;
> +}
> +
> +SYSCALL_DEFINE2(clone3, struct clone_args __user *, uargs, size_t, size)
> +{
> +	int err;
> +
> +	struct kernel_clone_args kargs;
> +
> +	err = copy_clone_args_from_user(&kargs, uargs, size);
> +	if (err)
> +		return err;
> +
> +	return _do_fork(&kargs);
>  }
>  #endif
>  

Regards.

-- 
Yann Droneaud
OPTEYA

^ permalink raw reply

* Re: [RFC][PATCH 0/7] Mount, FS, Block and Keyrings notifications
From: Casey Schaufler @ 2019-05-29 15:41 UTC (permalink / raw)
  To: David Howells, Greg KH
  Cc: viro, raven, linux-fsdevel, linux-api, linux-block, keyrings,
	linux-security-module, linux-kernel, casey
In-Reply-To: <31751.1559120984@warthog.procyon.org.uk>

On 5/29/2019 2:09 AM, David Howells wrote:
> Greg KH <gregkh@linuxfoundation.org> wrote:
>
>>>  (3) Letting users see events they shouldn't be able to see.
>> How are you handling namespaces then?  Are they determined by the
>> namespace of the process that opened the original device handle, or the
>> namespace that made the new syscall for the events to "start flowing"?
> So far I haven't had to deal directly with namespaces.
>
> mount_notify() requires you to have access to the mountpoint you want to watch
> - and the entire tree rooted there is in one namespace, so your event sources
> are restricted to that namespace.  Further, mount objects don't themselves
> have any other namespaces, not even a user_ns.
>
> sb_notify() requires you to have access to the superblock you want to watch.
> superblocks aren't directly namespaced as a class, though individual
> superblocks may participate in particular namespaces (ipc, net, etc.).  I'm
> thinking some of these should be marked unwatchable (all pseudo superblocks,
> kernfs-class, proc, for example).
>
> Superblocks, however, do each have a user_ns - but you were allowed to access
> the superblock by pathwalk, so you must have some access to the user_ns - I
> think.
>
> KEYCTL_NOTIFY requires you to have View access on the key you're watching.
> Currently, keys have no real namespace restrictions, though I have patches to
> include a namespace tag in the lookup criteria.
>
> block_notify() doesn't require any direct access since you're watching a
> global queue and there is no blockdev namespacing.  LSMs are given the option
> to filter events, though.  The thought here is that if you can access dmesg,
> you should be able to watch for blockdev events.
>
>
> Actually, thinking further on this, restricting access to events is trickier
> than I thought and than perhaps Casey was suggesting.
>
> Say you're watching a mount object and someone in a different user_ns
> namespace or with a different security label mounts on it.  What governs
> whether you are allowed to see the event?

Conceptually it should be simple, but we have a variety of different
policies in the core OS, never mind what goes on inside the LSMs.
If you want to treat a notification like a signal you would only deliver
it if the process that performed the action that triggered the event
has the same UID as the process receiving the notification. Should you
decide to treat it like an IP packet only the LSMs would filter delivery.
If there are mode bits on the thing being watched shouldn't you respect
them?

> You're watching the object for changes - and it *has* changed.  Further, you
> might be able to see the result of this change by other means (/proc/mounts,
> for instance).
>
> Should you be denied the event based on the security model?

From a subject/object model view there are two objects and one
subject involved. The subject (active entity) is the process that
changes the first object, triggering an event. The watching process
(that will receive the notification) is the second object, because
its state will change (be written to) when the notification is
delivered. For the watching process to receive the notification
the changing process needs write access to the watching process.
The indirection of the notification mechanism isn't relevant.
If the changing process couldn't directly notify the watching process
it shouldn't be able to do it indirectly, either.

> On the other hand, if you're watching a tree of mount objects, it could be
> argued that you should be denied access to events on any mount object you
> can't reach by pathwalk.
>
> On the third hand, if you can see it in /proc/mounts or by fsinfo(), you
> should get an event for it.

Right. We've done a pretty good job of muddling the security
landscape by adding spiffy features to make life easier for
particular use cases. /proc is chuck full of examples. Objects
that can be viewed in many different ways make for confusing
security models. Try explaining /proc/234/fd/2 to a security
theory student.

>> How are you handling namespaces then?
> So to go back to the original question.  At the moment they haven't impinged
> directly and I haven't had to deal with them directly.  There are indirect
> namespace restrictions that I get for free just due to pathwalk, for instance.
>
> David

^ permalink raw reply

* Re: [PATCH ghak90 V6 02/10] audit: add container id
From: Tycho Andersen @ 2019-05-29 15:34 UTC (permalink / raw)
  To: Paul Moore
  Cc: Richard Guy Briggs, containers, linux-api,
	Linux-Audit Mailing List, linux-fsdevel, LKML, netdev,
	netfilter-devel, sgrubb, omosnace, dhowells, simo, Eric Paris,
	Serge Hallyn, ebiederm, nhorman
In-Reply-To: <CAHC9VhR4fudQanvZGYWMvCf7k2CU3q7e7n1Pi7hzC3v_zpVEdw@mail.gmail.com>

On Wed, May 29, 2019 at 11:29:05AM -0400, Paul Moore wrote:
> On Wed, May 29, 2019 at 10:57 AM Tycho Andersen <tycho@tycho.ws> wrote:
> >
> > On Mon, Apr 08, 2019 at 11:39:09PM -0400, Richard Guy Briggs wrote:
> > > It is not permitted to unset the audit container identifier.
> > > A child inherits its parent's audit container identifier.
> >
> > ...
> >
> > >  /**
> > > + * audit_set_contid - set current task's audit contid
> > > + * @contid: contid value
> > > + *
> > > + * Returns 0 on success, -EPERM on permission failure.
> > > + *
> > > + * Called (set) from fs/proc/base.c::proc_contid_write().
> > > + */
> > > +int audit_set_contid(struct task_struct *task, u64 contid)
> > > +{
> > > +     u64 oldcontid;
> > > +     int rc = 0;
> > > +     struct audit_buffer *ab;
> > > +     uid_t uid;
> > > +     struct tty_struct *tty;
> > > +     char comm[sizeof(current->comm)];
> > > +
> > > +     task_lock(task);
> > > +     /* Can't set if audit disabled */
> > > +     if (!task->audit) {
> > > +             task_unlock(task);
> > > +             return -ENOPROTOOPT;
> > > +     }
> > > +     oldcontid = audit_get_contid(task);
> > > +     read_lock(&tasklist_lock);
> > > +     /* Don't allow the audit containerid to be unset */
> > > +     if (!audit_contid_valid(contid))
> > > +             rc = -EINVAL;
> > > +     /* if we don't have caps, reject */
> > > +     else if (!capable(CAP_AUDIT_CONTROL))
> > > +             rc = -EPERM;
> > > +     /* if task has children or is not single-threaded, deny */
> > > +     else if (!list_empty(&task->children))
> > > +             rc = -EBUSY;
> > > +     else if (!(thread_group_leader(task) && thread_group_empty(task)))
> > > +             rc = -EALREADY;
> > > +     read_unlock(&tasklist_lock);
> > > +     if (!rc)
> > > +             task->audit->contid = contid;
> > > +     task_unlock(task);
> > > +
> > > +     if (!audit_enabled)
> > > +             return rc;
> >
> > ...but it is allowed to change it (assuming
> > capable(CAP_AUDIT_CONTROL), of course)? Seems like this might be more
> > immediately useful since we still live in the world of majority
> > privileged containers if we didn't allow changing it, in addition to
> > un-setting it.
> 
> The idea is that only container orchestrators should be able to
> set/modify the audit container ID, and since setting the audit
> container ID can have a significant effect on the records captured
> (and their routing to multiple daemons when we get there) modifying
> the audit container ID is akin to modifying the audit configuration
> which is why it is gated by CAP_AUDIT_CONTROL.  The current thinking
> is that you would only change the audit container ID from one
> set/inherited value to another if you were nesting containers, in
> which case the nested container orchestrator would need to be granted
> CAP_AUDIT_CONTROL (which everyone to date seems to agree is a workable
> compromise).

But then don't you want some kind of ns_capable() instead (probably
not the obvious one, though...)? With capable(), you can't really nest
using the audit-id and user namespaces together.

Tycho

^ permalink raw reply

* Re: [PATCH ghak90 V6 02/10] audit: add container id
From: Paul Moore @ 2019-05-29 15:29 UTC (permalink / raw)
  To: Tycho Andersen
  Cc: Richard Guy Briggs, containers, linux-api,
	Linux-Audit Mailing List, linux-fsdevel, LKML, netdev,
	netfilter-devel, sgrubb, omosnace, dhowells, simo, Eric Paris,
	Serge Hallyn, ebiederm, nhorman
In-Reply-To: <20190529145742.GA8959@cisco>

On Wed, May 29, 2019 at 10:57 AM Tycho Andersen <tycho@tycho.ws> wrote:
>
> On Mon, Apr 08, 2019 at 11:39:09PM -0400, Richard Guy Briggs wrote:
> > It is not permitted to unset the audit container identifier.
> > A child inherits its parent's audit container identifier.
>
> ...
>
> >  /**
> > + * audit_set_contid - set current task's audit contid
> > + * @contid: contid value
> > + *
> > + * Returns 0 on success, -EPERM on permission failure.
> > + *
> > + * Called (set) from fs/proc/base.c::proc_contid_write().
> > + */
> > +int audit_set_contid(struct task_struct *task, u64 contid)
> > +{
> > +     u64 oldcontid;
> > +     int rc = 0;
> > +     struct audit_buffer *ab;
> > +     uid_t uid;
> > +     struct tty_struct *tty;
> > +     char comm[sizeof(current->comm)];
> > +
> > +     task_lock(task);
> > +     /* Can't set if audit disabled */
> > +     if (!task->audit) {
> > +             task_unlock(task);
> > +             return -ENOPROTOOPT;
> > +     }
> > +     oldcontid = audit_get_contid(task);
> > +     read_lock(&tasklist_lock);
> > +     /* Don't allow the audit containerid to be unset */
> > +     if (!audit_contid_valid(contid))
> > +             rc = -EINVAL;
> > +     /* if we don't have caps, reject */
> > +     else if (!capable(CAP_AUDIT_CONTROL))
> > +             rc = -EPERM;
> > +     /* if task has children or is not single-threaded, deny */
> > +     else if (!list_empty(&task->children))
> > +             rc = -EBUSY;
> > +     else if (!(thread_group_leader(task) && thread_group_empty(task)))
> > +             rc = -EALREADY;
> > +     read_unlock(&tasklist_lock);
> > +     if (!rc)
> > +             task->audit->contid = contid;
> > +     task_unlock(task);
> > +
> > +     if (!audit_enabled)
> > +             return rc;
>
> ...but it is allowed to change it (assuming
> capable(CAP_AUDIT_CONTROL), of course)? Seems like this might be more
> immediately useful since we still live in the world of majority
> privileged containers if we didn't allow changing it, in addition to
> un-setting it.

The idea is that only container orchestrators should be able to
set/modify the audit container ID, and since setting the audit
container ID can have a significant effect on the records captured
(and their routing to multiple daemons when we get there) modifying
the audit container ID is akin to modifying the audit configuration
which is why it is gated by CAP_AUDIT_CONTROL.  The current thinking
is that you would only change the audit container ID from one
set/inherited value to another if you were nesting containers, in
which case the nested container orchestrator would need to be granted
CAP_AUDIT_CONTROL (which everyone to date seems to agree is a workable
compromise).  We did consider allowing for a chain of nested audit
container IDs, but the implications of doing so are significant
(implementation mess, runtime cost, etc.) so we are leaving that out
of this effort.

>From a practical perspective, un-setting the audit container ID is
pretty much the same as changing it from one set value to another so
most of the above applies to that case as well.

-- 
paul moore
www.paul-moore.com

^ permalink raw reply

* [PATCH v1 2/2] arch: wire-up clone3() syscall on x86
From: Christian Brauner @ 2019-05-29 15:22 UTC (permalink / raw)
  To: viro, linux-kernel, torvalds, jannh
  Cc: fweimer, oleg, arnd, dhowells, Christian Brauner, Andrew Morton,
	Adrian Reber, linux-api, linux-arch, x86
In-Reply-To: <20190529152237.10719-1-christian@brauner.io>

Wire up the clone3() call on x86.

This patch only wires up clone3() on x86. Some of the arches look like they
need special assembly massaging and it is probably smarter if the
appropriate arch maintainers would do the actual wiring.

Signed-off-by: Christian Brauner <christian@brauner.io>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: David Howells <dhowells@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Adrian Reber <adrian@lisas.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Florian Weimer <fweimer@redhat.com>
Cc: linux-api@vger.kernel.org
Cc: linux-arch@vger.kernel.org
Cc: x86@kernel.org
---
v1: unchanged
---
 arch/x86/entry/syscalls/syscall_32.tbl | 1 +
 arch/x86/entry/syscalls/syscall_64.tbl | 1 +
 include/uapi/asm-generic/unistd.h      | 4 +++-
 3 files changed, 5 insertions(+), 1 deletion(-)

diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
index ad968b7bac72..80e26211feff 100644
--- a/arch/x86/entry/syscalls/syscall_32.tbl
+++ b/arch/x86/entry/syscalls/syscall_32.tbl
@@ -438,3 +438,4 @@
 431	i386	fsconfig		sys_fsconfig			__ia32_sys_fsconfig
 432	i386	fsmount			sys_fsmount			__ia32_sys_fsmount
 433	i386	fspick			sys_fspick			__ia32_sys_fspick
+436	i386	clone3			sys_clone3			__ia32_sys_clone3
diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
index b4e6f9e6204a..7968f0b5b5e8 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -355,6 +355,7 @@
 431	common	fsconfig		__x64_sys_fsconfig
 432	common	fsmount			__x64_sys_fsmount
 433	common	fspick			__x64_sys_fspick
+436	common	clone3			__x64_sys_clone3/ptregs
 
 #
 # x32-specific system call numbers start at 512 to avoid cache impact
diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
index a87904daf103..45bc87687c47 100644
--- a/include/uapi/asm-generic/unistd.h
+++ b/include/uapi/asm-generic/unistd.h
@@ -844,9 +844,11 @@ __SYSCALL(__NR_fsconfig, sys_fsconfig)
 __SYSCALL(__NR_fsmount, sys_fsmount)
 #define __NR_fspick 433
 __SYSCALL(__NR_fspick, sys_fspick)
+#define __NR_clone3 436
+__SYSCALL(__NR_clone3, sys_clone3)
 
 #undef __NR_syscalls
-#define __NR_syscalls 434
+#define __NR_syscalls 437
 
 /*
  * 32 bit systems traditionally used different
-- 
2.21.0

^ permalink raw reply related

* [PATCH v1 1/2] fork: add clone3
From: Christian Brauner @ 2019-05-29 15:22 UTC (permalink / raw)
  To: viro, linux-kernel, torvalds, jannh
  Cc: fweimer, oleg, arnd, dhowells, Christian Brauner, Pavel Emelyanov,
	Andrew Morton, Adrian Reber, Andrei Vagin, linux-api

This adds the clone3 system call.

As mentioned several times already (cf. [7], [8]) here's the promised
patchset for clone3().

We recently merged the CLONE_PIDFD patchset (cf. [1]). It took the last
free flag from clone().

Independent of the CLONE_PIDFD patchset a time namespace has been discussed
at Linux Plumber Conference last year and has been sent out and reviewed
(cf. [5]). It is expected that it will go upstream in the not too distant
future. However, it relies on the addition of the CLONE_NEWTIME flag to
clone(). The only other good candidate - CLONE_DETACHED - is currently not
recyclable as we have identified at least two large or widely used
codebases that currently pass this flag (cf. [2], [3], and [4]). Given that
CLONE_PIDFD grabbed the last clone() flag the time namespace is effectively
blocked. clone3() has the advantage that it will unblock this patchset
again.

The idea is to keep clone3() very simple and close to the original clone(),
specifically, to keep on supporting old clone()-based workloads.
We know there have been various creative proposals how a new process
creation syscall or even api is supposed to look like. Some people even
going so far as to argue that the traditional fork()+exec() split should be
abandoned in favor of an in-kernel version of spawn(). Independent of
whether or not we personally think spawn() is a good idea this patchset has
and does not want to have anything to do with this.
One stance we take is that there's no real good alternative to
clone()+exec() and we need and want to support this model going forward;
independent of spawn().
The following requirements guided clone3():
- bump the number of available flags
- move arguments that are currently passed as separate arguments
  in clone() into a dedicated struct clone_args
  - choose a struct layout that is easy to handle on 32 and on 64 bit
  - choose a struct layout that is extensible
  - give new flags that currently need to abuse another flag's dedicated
    return argument in clone() their own dedicated return argument
    (e.g. CLONE_PIDFD)
  - use a separate kernel internal struct kernel_clone_args that is
    properly typed according to current kernel conventions in fork.c and is
    different from  the uapi struct clone_args
- port _do_fork() to use kernel_clone_args so that all process creation
  syscalls such as fork(), vfork(), clone(), and clone3() behave identical
  (Arnd suggested, that we can probably also port do_fork() itself in a
   separate patchset.)
- ease of transition for userspace from clone() to clone3()
  This very much means that we do *not* remove functionality that userspace
  currently relies on as the latter is a good way of creating a syscall
  that won't be adopted.
- do not try to be clever or complex: keep clone3() as dumb as possible

In accordance with Linus suggestions, clone3() has the following signature:

/* uapi */
struct clone_args {
        __aligned_u64 flags;
        __aligned_u64 pidfd;
        __aligned_u64 parent_tidptr;
        __aligned_u64 child_tidptr;
        __aligned_u64 stack;
        __aligned_u64 stack_size;
        __aligned_u64 tls;
};

/* kernel internal */
struct kernel_clone_args {
        u64 flags;
        int __user *pidfd;
        int __user *parent_tidptr;
        int __user *child_tidptr;
        unsigned long stack;
        unsigned long stack_size;
        unsigned long tls;
};

long sys_clone3(struct clone_args __user *uargs, size_t size)

clone3() cleanly supports all of the supported flags from clone() and thus
all legacy workloads.
The advantage of sticking close to the old clone() is the low cost for
userspace to switch to this new api. Quite a lot of userspace apis (e.g.
pthreads) are based on the clone() syscall. With the new clone3() syscall
supporting all of the old workloads and opening up the ability to add new
features should make switching to it for userspace more appealing. In
essence, glibc can just write a simple wrapper to switch from clone() to
clone3().

There has been some interest in this patchset already. We have received a
patch from the CRIU corner for clone3() that would set the PID/TID of a
restored process without /proc/sys/kernel/ns_last_pid to eliminate a race.

/* References */
[1]: b3e5838252665ee4cfa76b82bdf1198dca81e5be
[2]: https://dxr.mozilla.org/mozilla-central/source/security/sandbox/linux/SandboxFilter.cpp#343
[3]: https://git.musl-libc.org/cgit/musl/tree/src/thread/pthread_create.c#n233
[4]: https://sources.debian.org/src/blcr/0.8.5-2.3/cr_module/cr_dump_self.c/?hl=740#L740
[5]: https://lore.kernel.org/lkml/20190425161416.26600-1-dima@arista.com/
[6]: https://lore.kernel.org/lkml/20190425161416.26600-2-dima@arista.com/
[7]: https://lore.kernel.org/lkml/CAHrFyr5HxpGXA2YrKza-oB-GGwJCqwPfyhD-Y5wbktWZdt0sGQ@mail.gmail.com/
[8]: https://lore.kernel.org/lkml/20190524102756.qjsjxukuq2f4t6bo@brauner.io/

Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Christian Brauner <christian@brauner.io>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Pavel Emelyanov <xemul@virtuozzo.com>
Cc: Jann Horn <jannh@google.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Adrian Reber <adrian@lisas.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrei Vagin <avagin@gmail.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Florian Weimer <fweimer@redhat.com>
Cc: linux-api@vger.kernel.org
--
v1:
- Linus Torvalds <torvalds@linux-foundation.org>:
  - redesign based on Linus proposal
  - switch from arg-based to revision-based naming scheme: s/clone6/clone3/
- Arnd Bergmann <arnd@arndb.de>:
  - use a single copy_from_user() instead of multiple get_user() calls
    since the latter have a constant overhead on some architectures
  - a range of other tweaks and suggestions
---
 arch/x86/ia32/sys_ia32.c   |  11 ++-
 include/linux/sched/task.h |  13 ++-
 include/linux/syscalls.h   |   6 ++
 include/uapi/linux/sched.h |  16 ++++
 kernel/fork.c              | 176 ++++++++++++++++++++++++++++---------
 5 files changed, 177 insertions(+), 45 deletions(-)

diff --git a/arch/x86/ia32/sys_ia32.c b/arch/x86/ia32/sys_ia32.c
index a43212036257..1fd5c4594a8e 100644
--- a/arch/x86/ia32/sys_ia32.c
+++ b/arch/x86/ia32/sys_ia32.c
@@ -237,6 +237,13 @@ COMPAT_SYSCALL_DEFINE5(x86_clone, unsigned long, clone_flags,
 		       unsigned long, newsp, int __user *, parent_tidptr,
 		       unsigned long, tls_val, int __user *, child_tidptr)
 {
-	return _do_fork(clone_flags, newsp, 0, parent_tidptr, child_tidptr,
-			tls_val);
+	struct kernel_clone_args args = {
+		.flags = clone_flags,
+		.stack = newsp,
+		.parent_tidptr = parent_tidptr,
+		.tls = tls_val,
+		.child_tidptr = child_tidptr,
+	};
+
+	return _do_fork(&args);
 }
diff --git a/include/linux/sched/task.h b/include/linux/sched/task.h
index f1227f2c38a4..818696626a0d 100644
--- a/include/linux/sched/task.h
+++ b/include/linux/sched/task.h
@@ -8,11 +8,22 @@
  */
 
 #include <linux/sched.h>
+#include <linux/compiler_types.h>
 
 struct task_struct;
 struct rusage;
 union thread_union;
 
+struct kernel_clone_args {
+	u64 flags;
+	int __user *pidfd;
+	int __user *parent_tidptr;
+	int __user *child_tidptr;
+	unsigned long stack;
+	unsigned long stack_size;
+	unsigned long tls;
+};
+
 /*
  * This serializes "schedule()" and also protects
  * the run-queue from deletions/modifications (but
@@ -73,7 +84,7 @@ extern void do_group_exit(int);
 extern void exit_files(struct task_struct *);
 extern void exit_itimers(struct signal_struct *);
 
-extern long _do_fork(unsigned long, unsigned long, unsigned long, int __user *, int __user *, unsigned long);
+extern long _do_fork(struct kernel_clone_args *kargs);
 extern long do_fork(unsigned long, unsigned long, unsigned long, int __user *, int __user *);
 struct task_struct *fork_idle(int);
 struct mm_struct *copy_init_mm(void);
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index e2870fe1be5b..254db24af0cd 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -70,6 +70,7 @@ struct sigaltstack;
 struct rseq;
 union bpf_attr;
 struct io_uring_params;
+struct clone_args;
 
 #include <linux/types.h>
 #include <linux/aio_abi.h>
@@ -852,6 +853,11 @@ asmlinkage long sys_clone(unsigned long, unsigned long, int __user *,
 	       int __user *, unsigned long);
 #endif
 #endif
+
+#ifdef __ARCH_WANT_SYS_CLONE
+asmlinkage long sys_clone3(struct clone_args __user *uargs, size_t size);
+#endif
+
 asmlinkage long sys_execve(const char __user *filename,
 		const char __user *const __user *argv,
 		const char __user *const __user *envp);
diff --git a/include/uapi/linux/sched.h b/include/uapi/linux/sched.h
index ed4ee170bee2..d426ef382a6e 100644
--- a/include/uapi/linux/sched.h
+++ b/include/uapi/linux/sched.h
@@ -2,6 +2,8 @@
 #ifndef _UAPI_LINUX_SCHED_H
 #define _UAPI_LINUX_SCHED_H
 
+#include <linux/types.h>
+
 /*
  * cloning flags:
  */
@@ -30,6 +32,20 @@
 #define CLONE_NEWPID		0x20000000	/* New pid namespace */
 #define CLONE_NEWNET		0x40000000	/* New network namespace */
 #define CLONE_IO		0x80000000	/* Clone io context */
+#define CLONE_MAX ~0U
+
+/*
+ * Arguments for the clone3 syscall
+ */
+struct clone_args {
+	__aligned_u64 flags;
+	__aligned_u64 pidfd;
+	__aligned_u64 parent_tidptr;
+	__aligned_u64 child_tidptr;
+	__aligned_u64 stack;
+	__aligned_u64 stack_size;
+	__aligned_u64 tls;
+};
 
 /*
  * Scheduling policies
diff --git a/kernel/fork.c b/kernel/fork.c
index b4cba953040a..6bc3e3d17150 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1760,19 +1760,19 @@ static __always_inline void delayed_free_task(struct task_struct *tsk)
  * flags). The actual kick-off is left to the caller.
  */
 static __latent_entropy struct task_struct *copy_process(
-					unsigned long clone_flags,
-					unsigned long stack_start,
-					unsigned long stack_size,
-					int __user *parent_tidptr,
-					int __user *child_tidptr,
 					struct pid *pid,
 					int trace,
-					unsigned long tls,
-					int node)
+					int node,
+					struct kernel_clone_args *args)
 {
 	int pidfd = -1, retval;
 	struct task_struct *p;
 	struct multiprocess_signals delayed;
+	u64 clone_flags = args->flags;
+	int __user *child_tidptr = args->child_tidptr;
+	unsigned long tls = args->tls;
+	unsigned long stack_start = args->stack;
+	unsigned long stack_size = args->stack_size;
 
 	/*
 	 * Don't allow sharing the root directory with processes in a different
@@ -1821,27 +1821,12 @@ static __latent_entropy struct task_struct *copy_process(
 	}
 
 	if (clone_flags & CLONE_PIDFD) {
-		int reserved;
-
 		/*
-		 * - CLONE_PARENT_SETTID is useless for pidfds and also
-		 *   parent_tidptr is used to return pidfds.
 		 * - CLONE_DETACHED is blocked so that we can potentially
 		 *   reuse it later for CLONE_PIDFD.
 		 * - CLONE_THREAD is blocked until someone really needs it.
 		 */
-		if (clone_flags &
-		    (CLONE_DETACHED | CLONE_PARENT_SETTID | CLONE_THREAD))
-			return ERR_PTR(-EINVAL);
-
-		/*
-		 * Verify that parent_tidptr is sane so we can potentially
-		 * reuse it later.
-		 */
-		if (get_user(reserved, parent_tidptr))
-			return ERR_PTR(-EFAULT);
-
-		if (reserved != 0)
+		if (clone_flags & (CLONE_DETACHED | CLONE_THREAD))
 			return ERR_PTR(-EINVAL);
 	}
 
@@ -2062,7 +2047,7 @@ static __latent_entropy struct task_struct *copy_process(
 			goto bad_fork_free_pid;
 
 		pidfd = retval;
-		retval = put_user(pidfd, parent_tidptr);
+		retval = put_user(pidfd, args->pidfd);
 		if (retval)
 			goto bad_fork_put_pidfd;
 	}
@@ -2313,8 +2298,11 @@ static inline void init_idle_pids(struct task_struct *idle)
 struct task_struct *fork_idle(int cpu)
 {
 	struct task_struct *task;
-	task = copy_process(CLONE_VM, 0, 0, NULL, NULL, &init_struct_pid, 0, 0,
-			    cpu_to_node(cpu));
+	struct kernel_clone_args args = {
+		.flags = CLONE_VM,
+	};
+
+	task = copy_process(&init_struct_pid, 0, cpu_to_node(cpu), &args);
 	if (!IS_ERR(task)) {
 		init_idle_pids(task);
 		init_idle(task, cpu);
@@ -2334,18 +2322,15 @@ struct mm_struct *copy_init_mm(void)
  * It copies the process, and if successful kick-starts
  * it and waits for it to finish using the VM if required.
  */
-long _do_fork(unsigned long clone_flags,
-	      unsigned long stack_start,
-	      unsigned long stack_size,
-	      int __user *parent_tidptr,
-	      int __user *child_tidptr,
-	      unsigned long tls)
+long _do_fork(struct kernel_clone_args *args)
 {
+	u64 clone_flags = args->flags;
 	struct completion vfork;
 	struct pid *pid;
 	struct task_struct *p;
 	int trace = 0;
 	long nr;
+	int __user *parent_tidptr = args->parent_tidptr;
 
 	/*
 	 * Determine whether and which event to report to ptracer.  When
@@ -2365,8 +2350,7 @@ long _do_fork(unsigned long clone_flags,
 			trace = 0;
 	}
 
-	p = copy_process(clone_flags, stack_start, stack_size, parent_tidptr,
-			 child_tidptr, NULL, trace, tls, NUMA_NO_NODE);
+	p = copy_process(NULL, trace, NUMA_NO_NODE, args);
 	add_latent_entropy();
 
 	if (IS_ERR(p))
@@ -2414,8 +2398,15 @@ long do_fork(unsigned long clone_flags,
 	      int __user *parent_tidptr,
 	      int __user *child_tidptr)
 {
-	return _do_fork(clone_flags, stack_start, stack_size,
-			parent_tidptr, child_tidptr, 0);
+	struct kernel_clone_args args = {
+		.flags = clone_flags,
+		.stack = stack_start,
+		.stack_size = stack_size,
+		.parent_tidptr = parent_tidptr,
+		.child_tidptr = child_tidptr,
+	};
+
+	return _do_fork(&args);
 }
 #endif
 
@@ -2424,15 +2415,24 @@ long do_fork(unsigned long clone_flags,
  */
 pid_t kernel_thread(int (*fn)(void *), void *arg, unsigned long flags)
 {
-	return _do_fork(flags|CLONE_VM|CLONE_UNTRACED, (unsigned long)fn,
-		(unsigned long)arg, NULL, NULL, 0);
+	struct kernel_clone_args args = {
+		.flags = flags | CLONE_VM | CLONE_UNTRACED,
+		.stack = (unsigned long)fn,
+		.stack_size = (unsigned long)arg,
+	};
+
+	return _do_fork(&args);
 }
 
 #ifdef __ARCH_WANT_SYS_FORK
 SYSCALL_DEFINE0(fork)
 {
 #ifdef CONFIG_MMU
-	return _do_fork(SIGCHLD, 0, 0, NULL, NULL, 0);
+	struct kernel_clone_args args = {
+		.flags = SIGCHLD,
+	};
+
+	return _do_fork(&args);
 #else
 	/* can not support in nommu mode */
 	return -EINVAL;
@@ -2443,8 +2443,11 @@ SYSCALL_DEFINE0(fork)
 #ifdef __ARCH_WANT_SYS_VFORK
 SYSCALL_DEFINE0(vfork)
 {
-	return _do_fork(CLONE_VFORK | CLONE_VM | SIGCHLD, 0,
-			0, NULL, NULL, 0);
+	struct kernel_clone_args args = {
+		.flags = CLONE_VFORK | CLONE_VM | SIGCHLD,
+	};
+
+	return _do_fork(&args);
 }
 #endif
 
@@ -2472,7 +2475,96 @@ SYSCALL_DEFINE5(clone, unsigned long, clone_flags, unsigned long, newsp,
 		 unsigned long, tls)
 #endif
 {
-	return _do_fork(clone_flags, newsp, 0, parent_tidptr, child_tidptr, tls);
+	struct kernel_clone_args args = {
+		.flags = clone_flags,
+		.stack = newsp,
+		.pidfd = parent_tidptr,
+		.parent_tidptr = parent_tidptr,
+		.tls = tls,
+		.child_tidptr = child_tidptr,
+	};
+
+	/* clone(CLONE_PIDFD) uses parent_tidptr to return a pidfd */
+	if ((clone_flags & CLONE_PIDFD) && (clone_flags & CLONE_PARENT_SETTID))
+		return -EINVAL;
+
+	return _do_fork(&args);
+}
+
+static bool clone3_flags_valid(u64 flags)
+{
+	if (flags & CLONE_DETACHED)
+		return false;
+
+	if (flags & ~CLONE_MAX)
+		return false;
+
+	return true;
+}
+
+static int copy_clone_args_from_user(struct kernel_clone_args *kargs,
+				     struct clone_args __user *uargs,
+				     size_t size)
+{
+	struct clone_args args;
+
+	if (unlikely(size > PAGE_SIZE))
+		return -E2BIG;
+
+	if (unlikely(size < sizeof(struct clone_args)))
+		return -EINVAL;
+
+	if (unlikely(!access_ok(uargs, size)))
+		return -EFAULT;
+
+	if (size > sizeof(struct clone_args)) {
+		unsigned char __user *addr;
+		unsigned char __user *end;
+		unsigned char val;
+
+		addr = (void __user *)uargs + sizeof(struct clone_args);
+		end = (void __user *)uargs + size;
+
+		for (; addr < end; addr++) {
+			if (get_user(val, addr))
+				return -EFAULT;
+			if (val)
+				return -E2BIG;
+		}
+
+		size = sizeof(struct clone_args);
+	}
+
+	if (copy_from_user(&args, uargs, size))
+		return -EFAULT;
+
+	if (!clone3_flags_valid(args.flags))
+		return -EINVAL;
+
+	memset(kargs, 0, sizeof(*kargs));
+
+	kargs->flags = args.flags;
+	kargs->child_tidptr = u64_to_user_ptr(args.child_tidptr);
+	kargs->parent_tidptr = u64_to_user_ptr(args.parent_tidptr);
+	kargs->pidfd = u64_to_user_ptr(args.pidfd);
+	kargs->stack = args.stack;
+	kargs->stack_size = args.stack_size;
+	kargs->tls = args.tls;
+
+	return 0;
+}
+
+SYSCALL_DEFINE2(clone3, struct clone_args __user *, uargs, size_t, size)
+{
+	int err;
+
+	struct kernel_clone_args kargs;
+
+	err = copy_clone_args_from_user(&kargs, uargs, size);
+	if (err)
+		return err;
+
+	return _do_fork(&kargs);
 }
 #endif
 
-- 
2.21.0

^ permalink raw reply related

* Re: [RFC][PATCH 0/7] Mount, FS, Block and Keyrings notifications
From: Greg KH @ 2019-05-29 15:10 UTC (permalink / raw)
  To: Jan Kara
  Cc: Amir Goldstein, David Howells, Al Viro, Ian Kent, linux-fsdevel,
	linux-api, linux-block, keyrings, LSM List, linux-kernel
In-Reply-To: <20190529142504.GC32147@quack2.suse.cz>

On Wed, May 29, 2019 at 04:25:04PM +0200, Jan Kara wrote:
> > I am not asking that you implement fs_notify() before merging sb_notify()
> > and I understand that you have a use case for sb_notify().
> > I am asking that you show me the path towards a unified API (how a
> > typical program would look like), so that we know before merging your
> > new API that it could be extended to accommodate fsnotify events
> > where the final result will look wholesome to users.
> 
> Are you sure we want to combine notification about file changes etc. with
> administrator-type notifications about the filesystem? To me these two
> sound like rather different (although sometimes related) things.

This patchset is looking to create a "generic" kernel notification
system, so I think the question is valid.  It's up to the requestor to
ask for the specific type of notification.

thanks,

greg k-h

^ permalink raw reply

* Re: [PATCH RFC v8 01/10] namei: obey trailing magic-link DAC permissions
From: Andy Lutomirski @ 2019-05-29 15:10 UTC (permalink / raw)
  To: Aleksa Sarai
  Cc: Andy Lutomirski, Al Viro, Jeff Layton, J. Bruce Fields,
	Arnd Bergmann, David Howells, Shuah Khan, Shuah Khan,
	Christian Brauner, Eric Biederman, Andrew Morton,
	Alexei Starovoitov, Kees Cook, Jann Horn, Tycho Andersen,
	David Drysdale, Chanho Min, Oleg Nesterov, Aleksa Sarai,
	Linus Torvalds, Linux Containers <contai>
In-Reply-To: <20190524031109.v24r6typyug2rlto@yavin>



> On May 23, 2019, at 8:11 PM, Aleksa Sarai <cyphar@cyphar.com> wrote:
> 
>> On 2019-05-23, Aleksa Sarai <cyphar@cyphar.com> wrote:
>>> On 2019-05-22, Andy Lutomirski <luto@kernel.org> wrote:
>>> What are actual examples of uses for this exception?  Breaking
>>> selftests is not, in and of itself, a huge problem.
>> 
>> Not as far as I know. All of the re-opening users I know of do re-opens
>> of O_PATH or are re-opening with the same (or fewer) privileges. I also
>> ran this for a few days on my laptop without this exception, and didn't
>> have any visible issues.
> 
> I have modified the patch to WARN_ON(may_open_magiclink() == -EACCES).
> 
> So far (in the past day on my openSUSE machines) I have only seen two
> programs which have hit this case: kbd[1]'s "loadkeys" and "kbd_mode"
> binaries. In addition to there not being any user-visible errors -- they
> actually handle permission errors gracefully!
> 
>  static int
>  open_a_console(const char *fnam)
>  {
>      int fd;
> 
>      /*
>       * For ioctl purposes we only need some fd and permissions
>       * do not matter. But setfont:activatemap() does a write.
>       */
>      fd = open(fnam, O_RDWR);
>      if (fd < 0)
>          fd = open(fnam, O_WRONLY);
>      if (fd < 0)
>          fd = open(fnam, O_RDONLY);
>      if (fd < 0)
>          return -1;
>      return fd;
>  }
> 
> The above gets called with "/proc/self/fd/0" as an argument (as well as
> other console candidates like "/dev/console"). And setfont:activatemap()
> actually does handle read-only fds:
> 
>  static void
>  send_escseq(int fd, const char *seq, int n)
>  {
>      if (write(fd, seq, n) != n) /* maybe fd is read-only */
>          printf("%s", seq);
>  }
> 
>  void activatemap(int fd)
>  {
>      send_escseq(fd, "\033(K", 3);
>  }
> 
> So, thus far, not only have I not seen anything go wrong -- the only
> program which actually hits this case handles the error gracefully.
> Obviously we got lucky here, but the lack of any users of this
> mis-feature leads me to have some hope that we can block it without
> anyone noticing.
> 
> But I emphatically do not want to break userspace here (except for
> attackers, obviously).

Hmm. This will break any script that does echo foo >/dev/stdin too.

Just to throw an idea out there, what if the open were allowed if the file mode is sufficient or if the magic link target is openable with the correct mode without magic?  In other words, first check as in your code but without the exception and, if that check fails, then walk the same path that d_path would return and see if it would work as a normal open?  Of course, that second attempt would need to disable magic links to avoid recursing.  I’m not sure I love this idea...

Otherwise, I imagine we can live with the exception, especially if the new open API turns it off by default.

^ permalink raw reply

* Re: [PATCH ghak90 V6 02/10] audit: add container id
From: Tycho Andersen @ 2019-05-29 14:57 UTC (permalink / raw)
  To: Richard Guy Briggs
  Cc: containers, linux-api, Linux-Audit Mailing List, linux-fsdevel,
	LKML, netdev, netfilter-devel, Paul Moore, sgrubb, omosnace,
	dhowells, simo, eparis, serge, ebiederm, nhorman
In-Reply-To: <9edad39c40671fb53f28d76862304cc2647029c6.1554732921.git.rgb@redhat.com>

On Mon, Apr 08, 2019 at 11:39:09PM -0400, Richard Guy Briggs wrote:
> It is not permitted to unset the audit container identifier.
> A child inherits its parent's audit container identifier.

...

>  /**
> + * audit_set_contid - set current task's audit contid
> + * @contid: contid value
> + *
> + * Returns 0 on success, -EPERM on permission failure.
> + *
> + * Called (set) from fs/proc/base.c::proc_contid_write().
> + */
> +int audit_set_contid(struct task_struct *task, u64 contid)
> +{
> +	u64 oldcontid;
> +	int rc = 0;
> +	struct audit_buffer *ab;
> +	uid_t uid;
> +	struct tty_struct *tty;
> +	char comm[sizeof(current->comm)];
> +
> +	task_lock(task);
> +	/* Can't set if audit disabled */
> +	if (!task->audit) {
> +		task_unlock(task);
> +		return -ENOPROTOOPT;
> +	}
> +	oldcontid = audit_get_contid(task);
> +	read_lock(&tasklist_lock);
> +	/* Don't allow the audit containerid to be unset */
> +	if (!audit_contid_valid(contid))
> +		rc = -EINVAL;
> +	/* if we don't have caps, reject */
> +	else if (!capable(CAP_AUDIT_CONTROL))
> +		rc = -EPERM;
> +	/* if task has children or is not single-threaded, deny */
> +	else if (!list_empty(&task->children))
> +		rc = -EBUSY;
> +	else if (!(thread_group_leader(task) && thread_group_empty(task)))
> +		rc = -EALREADY;
> +	read_unlock(&tasklist_lock);
> +	if (!rc)
> +		task->audit->contid = contid;
> +	task_unlock(task);
> +
> +	if (!audit_enabled)
> +		return rc;

...but it is allowed to change it (assuming
capable(CAP_AUDIT_CONTROL), of course)? Seems like this might be more
immediately useful since we still live in the world of majority
privileged containers if we didn't allow changing it, in addition to
un-setting it.

Tycho

^ permalink raw reply

* Re: [PATCH ghak90 V6 00/10] audit: implement container identifier
From: Paul Moore @ 2019-05-29 14:33 UTC (permalink / raw)
  To: Dan Walsh
  Cc: Richard Guy Briggs, Steve Grubb, Neil Horman, containers,
	linux-api, Linux-Audit Mailing List, linux-fsdevel, LKML, netdev,
	netfilter-devel, omosnace, dhowells, simo, Eric Paris,
	Serge Hallyn, ebiederm, Mrunal Patel
In-Reply-To: <9a9ccb28-3cbc-c0b1-71b2-26df08105b4a@redhat.com>

On Wed, May 29, 2019 at 10:07 AM Daniel Walsh <dwalsh@redhat.com> wrote:
> On 5/29/19 9:17 AM, Paul Moore wrote:
> > On Wed, May 29, 2019 at 8:03 AM Daniel Walsh <dwalsh@redhat.com> wrote:
> >> On 5/28/19 8:43 PM, Richard Guy Briggs wrote:
> >>> On 2019-05-28 19:00, Steve Grubb wrote:
> >>>> On Tuesday, May 28, 2019 6:26:47 PM EDT Paul Moore wrote:
> >>>>> On Tue, May 28, 2019 at 5:54 PM Daniel Walsh <dwalsh@redhat.com> wrote:
> >>>>>> On 4/22/19 9:49 AM, Paul Moore wrote:
> >>>>>>> On Mon, Apr 22, 2019 at 7:38 AM Neil Horman <nhorman@tuxdriver.com>
> >>>> wrote:
> >>>>>>>> On Mon, Apr 08, 2019 at 11:39:07PM -0400, Richard Guy Briggs wrote:
> >>>>>>>>> Implement kernel audit container identifier.
> >>>>>>>> I'm sorry, I've lost track of this, where have we landed on it? Are we
> >>>>>>>> good for inclusion?
> >>>>>>> I haven't finished going through this latest revision, but unless
> >>>>>>> Richard made any significant changes outside of the feedback from the
> >>>>>>> v5 patchset I'm guessing we are "close".
> >>>>>>>
> >>>>>>> Based on discussions Richard and I had some time ago, I have always
> >>>>>>> envisioned the plan as being get the kernel patchset, tests, docs
> >>>>>>> ready (which Richard has been doing) and then run the actual
> >>>>>>> implemented API by the userland container folks, e.g. cri-o/lxc/etc.,
> >>>>>>> to make sure the actual implementation is sane from their perspective.
> >>>>>>> They've already seen the design, so I'm not expecting any real
> >>>>>>> surprises here, but sometimes opinions change when they have actual
> >>>>>>> code in front of them to play with and review.
> >>>>>>>
> >>>>>>> Beyond that, while the cri-o/lxc/etc. folks are looking it over,
> >>>>>>> whatever additional testing we can do would be a big win.  I'm
> >>>>>>> thinking I'll pull it into a separate branch in the audit tree
> >>>>>>> (audit/working-container ?) and include that in my secnext kernels
> >>>>>>> that I build/test on a regular basis; this is also a handy way to keep
> >>>>>>> it based against the current audit/next branch.  If any changes are
> >>>>>>> needed Richard can either chose to base those changes on audit/next or
> >>>>>>> the separate audit container ID branch; that's up to him.  I've done
> >>>>>>> this with other big changes in other trees, e.g. SELinux, and it has
> >>>>>>> worked well to get some extra testing in and keep the patchset "merge
> >>>>>>> ready" while others outside the subsystem look things over.
> >>>>>> Mrunal Patel (maintainer of CRI-O) and I have reviewed the API, and
> >>>>>> believe this is something we can work on in the container runtimes team
> >>>>>> to implement the container auditing code in CRI-O and Podman.
> >>>>> Thanks Dan.  If I pulled this into a branch and built you some test
> >>>>> kernels to play with, any idea how long it might take to get a proof
> >>>>> of concept working on the cri-o side?
> >>>> We'd need to merge user space patches and let them use that instead of the
> >>>> raw interface. I'm not going to merge user space until we are pretty sure the
> >>>> patch is going into the kernel.
> >>> I have an f29 test rpm of the userspace bits if that helps for testing:
> >>>       http://people.redhat.com/~rbriggs/ghak90/git-1db7e21/
> >>>
> >>> Here's what it contains (minus the last patch):
> >>>       https://github.com/linux-audit/audit-userspace/compare/master...rgbriggs:ghau40-containerid-filter.v7.0
> >>>
> >>>> -Steve
> >>>>
> >>>>> FWIW, I've also reached out to some of the LXC folks I know to get
> >>>>> their take on the API.  I think if we can get two different container
> >>>>> runtimes to give the API a thumbs-up then I think we are in good shape
> >>>>> with respect to the userspace interface.
> >>>>>
> >>>>> I just finished looking over the last of the pending audit kernel
> >>>>> patches that were queued waiting for the merge window to open so this
> >>>>> is next on my list to look at.  I plan to start doing that
> >>>>> tonight/tomorrow, and as long as the changes between v5/v6 are not
> >>>>> that big, it shouldn't take too long.
> >>> - RGB
> >>>
> >>> --
> >>> Richard Guy Briggs <rgb@redhat.com>
> >>> Sr. S/W Engineer, Kernel Security, Base Operating Systems
> >>> Remote, Ottawa, Red Hat Canada
> >>> IRC: rgb, SunRaycer
> >>> Voice: +1.647.777.2635, Internal: (81) 32635
> >> Our current thoughts are to put the setting of the ID inside of conmon,
> >> and then launching the OCI Runtime.  In a perfect world this would
> >> happen in the OCI Runtime, but we have no controls over different OCI
> >> Runtimes.
> >>
> >> By putting it into conmon, then CRI-O and Podman will automatically get
> >> the container id support.  After we have this we have to plumb it back
> >> up through the contianer engines to be able to easily report the link
> >> between the Container UUID and The Kernel Container Audit ID.
> > I'm glad you guys have a plan, that's encouraging, but sadly I have no
> > idea about the level of complexity/difficulty involved in modifying
> > the various container bits for a proof-of-concept?  Are we talking a
> > week or two?  A month?  More?
> >
> If we had the kernel and the libaudit api, it would involve a small
> effort in conmon,  I would figure a few days for a POC.  Getting the
> hole wiring into CRI-O and Podman, would be a little more effort.

That's great.  Stay tuned ...

-- 
paul moore
www.paul-moore.com

^ permalink raw reply

* Re: [PATCH v2 0/7] mm: process_vm_mmap() -- syscall for duplication a process mapping
From: Kirill Tkhai @ 2019-05-29 14:33 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: akpm, dan.j.williams, mhocko, keith.busch, kirill.shutemov,
	alexander.h.duyck, ira.weiny, andreyknvl, arunks, vbabka, cl,
	riel, keescook, hannes, npiggin, mathieu.desnoyers, shakeelb,
	guro, aarcange, hughd, jglisse, mgorman, daniel.m.jordan, jannh,
	kilobyte, linux-api, linux-kernel, linux-mm
In-Reply-To: <20190528161524.tn5sqzhmhgyuwrmy@box>

On 28.05.2019 19:15, Kirill A. Shutemov wrote:
> On Tue, May 28, 2019 at 12:15:16PM +0300, Kirill Tkhai wrote:
>> On 28.05.2019 02:30, Kirill A. Shutemov wrote:
>>> On Fri, May 24, 2019 at 05:00:32PM +0300, Kirill Tkhai wrote:
>>>> On 24.05.2019 14:52, Kirill A. Shutemov wrote:
>>>>> On Fri, May 24, 2019 at 01:45:50PM +0300, Kirill Tkhai wrote:
>>>>>> On 22.05.2019 18:22, Kirill A. Shutemov wrote:
>>>>>>> On Mon, May 20, 2019 at 05:00:01PM +0300, Kirill Tkhai wrote:
>>>>>>>> This patchset adds a new syscall, which makes possible
>>>>>>>> to clone a VMA from a process to current process.
>>>>>>>> The syscall supplements the functionality provided
>>>>>>>> by process_vm_writev() and process_vm_readv() syscalls,
>>>>>>>> and it may be useful in many situation.
>>>>>>>
>>>>>>> Kirill, could you explain how the change affects rmap and how it is safe.
>>>>>>>
>>>>>>> My concern is that the patchset allows to map the same page multiple times
>>>>>>> within one process or even map page allocated by child to the parrent.
>>>>>>>
>>>>>>> It was not allowed before.
>>>>>>>
>>>>>>> In the best case it makes reasoning about rmap substantially more difficult.
>>>>>>>
>>>>>>> But I'm worry it will introduce hard-to-debug bugs, like described in
>>>>>>> https://lwn.net/Articles/383162/.
>>>>>>
>>>>>> Andy suggested to unmap PTEs from source page table, and this make the single
>>>>>> page never be mapped in the same process twice. This is OK for my use case,
>>>>>> and here we will just do a small step "allow to inherit VMA by a child process",
>>>>>> which we didn't have before this. If someone still needs to continue the work
>>>>>> to allow the same page be mapped twice in a single process in the future, this
>>>>>> person will have a supported basis we do in this small step. I believe, someone
>>>>>> like debugger may want to have this to make a fast snapshot of a process private
>>>>>> memory (when the task is stopped for a small time to get its memory). But for
>>>>>> me remapping is enough at the moment.
>>>>>>
>>>>>> What do you think about this?
>>>>>
>>>>> I don't think that unmapping alone will do. Consider the following
>>>>> scenario:
>>>>>
>>>>> 1. Task A creates and populates the mapping.
>>>>> 2. Task A forks. We have now Task B mapping the same pages, but
>>>>> write-protected.
>>>>> 3. Task B calls process_vm_mmap() and passes the mapping to the parent.
>>>>>
>>>>> After this Task A will have the same anon pages mapped twice.
>>>>
>>>> Ah, sure.
>>>>
>>>>> One possible way out would be to force CoW on all pages in the mapping,
>>>>> before passing the mapping to the new process.
>>>>
>>>> This will pop all swapped pages up, which is the thing the patchset aims
>>>> to prevent.
>>>>
>>>> Hm, what about allow remapping only VMA, which anon_vma::rb_root contain
>>>> only chain and which vma->anon_vma_chain contains single entry? This is
>>>> a vma, which were faulted, but its mm never were duplicated (or which
>>>> forks already died).
>>>
>>> The requirement for the VMA to be faulted (have any pages mapped) looks
>>> excessive to me, but the general idea may work.
>>>
>>> One issue I see is that userspace may not have full control to create such
>>> VMA. vma_merge() can merge the VMA to the next one without any consent
>>> from userspace and you'll get anon_vma inherited from the VMA you've
>>> justed merged with.
>>>
>>> I don't have any valid idea on how to get around this.
>>
>> Technically it is possible by creating boundary 1-page VMAs with another protection:
>> one above and one below the desired region, then map the desired mapping. But this
>> is not comfortable.
>>
>> I don't think it's difficult to find a natural limitation, which prevents mapping
>> a single page twice if we want to avoid this at least on start. Another suggestion:
>>
>> prohibit to map a remote process's VMA only in case of its vm_area_struct::anon_vma::root
>> is the same as root of one of local process's VMA.
>>
>> What about this?
> 
> I don't see anything immediately wrong with this, but it's still going to
> produce puzzling errors for a user. How would you document such limitation
> in the way it makes sense for userspace developer?

It's difficult, since the limitation is artificial.

I just may to suggest more strict limitation.

Something like "VMA may be remapped only as a whole region,
and only in the case of there were not fork() after VMA
appeared in a process (by mmap or remapping from another
remote process). In case of VMA were merged with a neighbouring
VMA, the same rules are applied to the neighbours.

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 0e8834ac32b7..0bcd6f598e73 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -287,13 +287,17 @@ extern unsigned int kobjsize(const void *objp);
 #define VM_HIGH_ARCH_BIT_2	34	/* bit only usable on 64-bit architectures */
 #define VM_HIGH_ARCH_BIT_3	35	/* bit only usable on 64-bit architectures */
 #define VM_HIGH_ARCH_BIT_4	36	/* bit only usable on 64-bit architectures */
+#define VM_HIGH_ARCH_BIT_5	37	/* bit only usable on 64-bit architectures */
 #define VM_HIGH_ARCH_0	BIT(VM_HIGH_ARCH_BIT_0)
 #define VM_HIGH_ARCH_1	BIT(VM_HIGH_ARCH_BIT_1)
 #define VM_HIGH_ARCH_2	BIT(VM_HIGH_ARCH_BIT_2)
 #define VM_HIGH_ARCH_3	BIT(VM_HIGH_ARCH_BIT_3)
 #define VM_HIGH_ARCH_4	BIT(VM_HIGH_ARCH_BIT_4)
+#define VM_HIGH_ARCH_5	BIT(VM_HIGH_ARCH_BIT_5)
 #endif /* CONFIG_ARCH_USES_HIGH_VMA_FLAGS */
 
+#define VM_MAY_REMOTE_REMAP	VM_HIGH_ARCH_5
+
 #ifdef CONFIG_ARCH_HAS_PKEYS
 # define VM_PKEY_SHIFT	VM_HIGH_ARCH_BIT_0
 # define VM_PKEY_BIT0	VM_HIGH_ARCH_0	/* A protection key is a 4-bit value */
diff --git a/kernel/fork.c b/kernel/fork.c
index ff4efd16fd82..a3c758c8cd54 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -584,8 +584,10 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
 		rb_parent = &tmp->vm_rb;
 
 		mm->map_count++;
-		if (!(tmp->vm_flags & VM_WIPEONFORK))
+		if (!(tmp->vm_flags & VM_WIPEONFORK)) {
 			retval = copy_page_range(mm, oldmm, mpnt);
+			mpnt->vm_flags &= ~VM_MAY_REMOTE_REMAP;
+		}
 
 		if (tmp->vm_ops && tmp->vm_ops->open)
 			tmp->vm_ops->open(tmp);

^ permalink raw reply related

* Re: [RFC][PATCH 0/7] Mount, FS, Block and Keyrings notifications
From: Jan Kara @ 2019-05-29 14:25 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: David Howells, Al Viro, Ian Kent, linux-fsdevel, linux-api,
	linux-block, keyrings, LSM List, linux-kernel, Jan Kara
In-Reply-To: <CAOQ4uxjC1M7jwjd9zSaSa6UW2dbEjc+ZbFSo7j9F1YHAQxQ8LQ@mail.gmail.com>

On Wed 29-05-19 09:33:35, Amir Goldstein wrote:
> On Tue, May 28, 2019 at 7:03 PM David Howells <dhowells@redhat.com> wrote:
> >
> >
> > Hi Al,
> >
> > Here's a set of patches to add a general variable-length notification queue
> > concept and to add sources of events for:
> >
> >  (1) Mount topology events, such as mounting, unmounting, mount expiry,
> >      mount reconfiguration.
> >
> >  (2) Superblock events, such as R/W<->R/O changes, quota overrun and I/O
> >      errors (not complete yet).
> >
> >  (3) Block layer events, such as I/O errors.
> >
> >  (4) Key/keyring events, such as creating, linking and removal of keys.
> >
> > One of the reasons for this is so that we can remove the issue of processes
> > having to repeatedly and regularly scan /proc/mounts, which has proven to
> > be a system performance problem.  To further aid this, the fsinfo() syscall
> > on which this patch series depends, provides a way to access superblock and
> > mount information in binary form without the need to parse /proc/mounts.
> >
> >
> > Design decisions:
> >
> >  (1) A misc chardev is used to create and open a ring buffer:
> >
> >         fd = open("/dev/watch_queue", O_RDWR);
> >
> >      which is then configured and mmap'd into userspace:
> >
> >         ioctl(fd, IOC_WATCH_QUEUE_SET_SIZE, BUF_SIZE);
> >         ioctl(fd, IOC_WATCH_QUEUE_SET_FILTER, &filter);
> >         buf = mmap(NULL, BUF_SIZE * page_size, PROT_READ | PROT_WRITE,
> >                    MAP_SHARED, fd, 0);
> >
> >      The fd cannot be read or written (though there is a facility to use
> >      write to inject records for debugging) and userspace just pulls data
> >      directly out of the buffer.
> >
> >  (2) The ring index pointers are stored inside the ring and are thus
> >      accessible to userspace.  Userspace should only update the tail
> >      pointer and never the head pointer or risk breaking the buffer.  The
> >      kernel checks that the pointers appear valid before trying to use
> >      them.  A 'skip' record is maintained around the pointers.
> >
> >  (3) poll() can be used to wait for data to appear in the buffer.
> >
> >  (4) Records in the buffer are binary, typed and have a length so that they
> >      can be of varying size.
> >
> >      This means that multiple heterogeneous sources can share a common
> >      buffer.  Tags may be specified when a watchpoint is created to help
> >      distinguish the sources.
> >
> >  (5) The queue is reusable as there are 16 million types available, of
> >      which I've used 4, so there is scope for others to be used.
> >
> >  (6) Records are filterable as types have up to 256 subtypes that can be
> >      individually filtered.  Other filtration is also available.
> >
> >  (7) Each time the buffer is opened, a new buffer is created - this means
> >      that there's no interference between watchers.
> >
> >  (8) When recording a notification, the kernel will not sleep, but will
> >      rather mark a queue as overrun if there's insufficient space, thereby
> >      avoiding userspace causing the kernel to hang.
> >
> >  (9) The 'watchpoint' should be specific where possible, meaning that you
> >      specify the object that you want to watch.
> >
> > (10) The buffer is created and then watchpoints are attached to it, using
> >      one of:
> >
> >         keyctl_watch_key(KEY_SPEC_SESSION_KEYRING, fd, 0x01);
> >         mount_notify(AT_FDCWD, "/", 0, fd, 0x02);
> >         sb_notify(AT_FDCWD, "/mnt", 0, fd, 0x03);
> >
> >      where in all three cases, fd indicates the queue and the number after
> >      is a tag between 0 and 255.
> >
> > (11) The watch must be removed if either the watch buffer is destroyed or
> >      the watched object is destroyed.
> >
> >
> > Things I want to avoid:
> >
> >  (1) Introducing features that make the core VFS dependent on the network
> >      stack or networking namespaces (ie. usage of netlink).
> >
> >  (2) Dumping all this stuff into dmesg and having a daemon that sits there
> >      parsing the output and distributing it as this then puts the
> >      responsibility for security into userspace and makes handling
> >      namespaces tricky.  Further, dmesg might not exist or might be
> >      inaccessible inside a container.
> >
> >  (3) Letting users see events they shouldn't be able to see.
> >
> >
> > Further things that could be considered:
> >
> >  (1) Adding a keyctl call to allow a watch on a keyring to be extended to
> >      "children" of that keyring, such that the watch is removed from the
> >      child if it is unlinked from the keyring.
> >
> >  (2) Adding global superblock event queue.
> >
> >  (3) Propagating watches to child superblock over automounts.
> >
> 
> David,
> 
> I am interested to know how you envision filesystem notifications would
> look with this interface.
> 
> fanotify can certainly benefit from providing a ring buffer interface to read
> events.
> 
> From what I have seen, a common practice of users is to monitor mounts
> (somehow) and place FAN_MARK_MOUNT fanotify watches dynamically.
> It'd be good if those users can use a single watch mechanism/API for
> watching the mount namespace and filesystem events within mounts.
> 
> A similar usability concern is with sb_notify and FAN_MARK_FILESYSTEM.
> It provides users with two complete different mechanisms to watch error
> and filesystem events. That is generally not a good thing to have.
> 
> I am not asking that you implement fs_notify() before merging sb_notify()
> and I understand that you have a use case for sb_notify().
> I am asking that you show me the path towards a unified API (how a
> typical program would look like), so that we know before merging your
> new API that it could be extended to accommodate fsnotify events
> where the final result will look wholesome to users.

Are you sure we want to combine notification about file changes etc. with
administrator-type notifications about the filesystem? To me these two
sound like rather different (although sometimes related) things.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply

* Re: [PATCH 4/7] vfs: Add superblock notifications
From: Jann Horn @ 2019-05-29 14:16 UTC (permalink / raw)
  To: David Howells
  Cc: Al Viro, raven, linux-fsdevel, Linux API, linux-block, keyrings,
	linux-security-module, kernel list
In-Reply-To: <24577.1559134719@warthog.procyon.org.uk>

On Wed, May 29, 2019 at 2:58 PM David Howells <dhowells@redhat.com> wrote:
> Jann Horn <jannh@google.com> wrote:
> > It might make sense to require that the path points to the root inode
> > of the superblock? That way you wouldn't be able to do this on a bind
> > mount that exposes part of a shared filesystem to a container.
>
> Why prevent that?  It doesn't prevent the container denizen from watching a
> bind mount that exposes the root of a shared filesystem into a container.

Well, yes, but if you expose the root of the shared filesystem to the
container, the container is probably meant to have a higher level of
access than if only a bind mount is exposed? But I don't know.

> It probably makes sense to permit the LSM to rule on whether a watch may be
> emplaced, however.

We should have some sort of reasonable policy outside of LSM code
though - the kernel should still be secure even if no LSMs are built
into it.

> > > +                       }
> > > +               }
> > > +               up_write(&s->s_umount);
> > > +               if (ret < 0)
> > > +                       kfree(watch);
> > > +       } else if (s->s_watchers) {
> >
> > This should probably have something like a READ_ONCE() for clarity?
>
> Note that I think I'll rearrange this to:
>
>         } else {
>                 ret = -EBADSLT;
>                 if (s->s_watchers) {
>                         down_write(&s->s_umount);
>                         ret = remove_watch_from_object(s->s_watchers, wqueue,
>                                                        s->s_unique_id, false);
>                         up_write(&s->s_umount);
>                 }
>         }
>
> I'm not sure READ_ONCE() is necessary, since s_watchers can only be
> instantiated once and the watch list then persists until the superblock is
> deactivated.  Furthermore, by the time deactivate_locked_super() is called, we
> can't be calling sb_notify() on it as it's become inaccessible.
>
> So if we see s->s_watchers as non-NULL, we should not see anything different
> inside the lock.  In fact, I should be able to rewrite the above to:
>
>         } else {
>                 ret = -EBADSLT;
>                 wlist = s->s_watchers;
>                 if (wlist) {
>                         down_write(&s->s_umount);
>                         ret = remove_watch_from_object(wlist, wqueue,
>                                                        s->s_unique_id, false);
>                         up_write(&s->s_umount);
>                 }
>         }

I'm extremely twitchy when it comes to code like this because AFAIK
gcc at least used to sometimes turn code that read a value from memory
and then used it multiple times into something with multiple memory
reads, leading to critical security vulnerabilities; see e.g. slide 36
of <https://www.blackhat.com/docs/us-16/materials/us-16-Wilhelm-Xenpwn-Breaking-Paravirtualized-Devices.pdf>.
I am not aware of any spec that requires the compiler to only perform
one read from the memory location in code like this.

^ permalink raw reply

* Re: [PATCH ghak90 V6 00/10] audit: implement container identifier
From: Daniel Walsh @ 2019-05-29 14:07 UTC (permalink / raw)
  To: Paul Moore
  Cc: Richard Guy Briggs, Steve Grubb, Neil Horman, containers,
	linux-api, Linux-Audit Mailing List, linux-fsdevel, LKML, netdev,
	netfilter-devel, omosnace, dhowells, simo, Eric Paris,
	Serge Hallyn, ebiederm, Mrunal Patel
In-Reply-To: <CAHC9VhT295iYu_uDcQ7eqVq8SSkYgEQAsoNrmpvbMR5ERcBzaA@mail.gmail.com>

On 5/29/19 9:17 AM, Paul Moore wrote:
> On Wed, May 29, 2019 at 8:03 AM Daniel Walsh <dwalsh@redhat.com> wrote:
>> On 5/28/19 8:43 PM, Richard Guy Briggs wrote:
>>> On 2019-05-28 19:00, Steve Grubb wrote:
>>>> On Tuesday, May 28, 2019 6:26:47 PM EDT Paul Moore wrote:
>>>>> On Tue, May 28, 2019 at 5:54 PM Daniel Walsh <dwalsh@redhat.com> wrote:
>>>>>> On 4/22/19 9:49 AM, Paul Moore wrote:
>>>>>>> On Mon, Apr 22, 2019 at 7:38 AM Neil Horman <nhorman@tuxdriver.com>
>>>> wrote:
>>>>>>>> On Mon, Apr 08, 2019 at 11:39:07PM -0400, Richard Guy Briggs wrote:
>>>>>>>>> Implement kernel audit container identifier.
>>>>>>>> I'm sorry, I've lost track of this, where have we landed on it? Are we
>>>>>>>> good for inclusion?
>>>>>>> I haven't finished going through this latest revision, but unless
>>>>>>> Richard made any significant changes outside of the feedback from the
>>>>>>> v5 patchset I'm guessing we are "close".
>>>>>>>
>>>>>>> Based on discussions Richard and I had some time ago, I have always
>>>>>>> envisioned the plan as being get the kernel patchset, tests, docs
>>>>>>> ready (which Richard has been doing) and then run the actual
>>>>>>> implemented API by the userland container folks, e.g. cri-o/lxc/etc.,
>>>>>>> to make sure the actual implementation is sane from their perspective.
>>>>>>> They've already seen the design, so I'm not expecting any real
>>>>>>> surprises here, but sometimes opinions change when they have actual
>>>>>>> code in front of them to play with and review.
>>>>>>>
>>>>>>> Beyond that, while the cri-o/lxc/etc. folks are looking it over,
>>>>>>> whatever additional testing we can do would be a big win.  I'm
>>>>>>> thinking I'll pull it into a separate branch in the audit tree
>>>>>>> (audit/working-container ?) and include that in my secnext kernels
>>>>>>> that I build/test on a regular basis; this is also a handy way to keep
>>>>>>> it based against the current audit/next branch.  If any changes are
>>>>>>> needed Richard can either chose to base those changes on audit/next or
>>>>>>> the separate audit container ID branch; that's up to him.  I've done
>>>>>>> this with other big changes in other trees, e.g. SELinux, and it has
>>>>>>> worked well to get some extra testing in and keep the patchset "merge
>>>>>>> ready" while others outside the subsystem look things over.
>>>>>> Mrunal Patel (maintainer of CRI-O) and I have reviewed the API, and
>>>>>> believe this is something we can work on in the container runtimes team
>>>>>> to implement the container auditing code in CRI-O and Podman.
>>>>> Thanks Dan.  If I pulled this into a branch and built you some test
>>>>> kernels to play with, any idea how long it might take to get a proof
>>>>> of concept working on the cri-o side?
>>>> We'd need to merge user space patches and let them use that instead of the
>>>> raw interface. I'm not going to merge user space until we are pretty sure the
>>>> patch is going into the kernel.
>>> I have an f29 test rpm of the userspace bits if that helps for testing:
>>>       http://people.redhat.com/~rbriggs/ghak90/git-1db7e21/
>>>
>>> Here's what it contains (minus the last patch):
>>>       https://github.com/linux-audit/audit-userspace/compare/master...rgbriggs:ghau40-containerid-filter.v7.0
>>>
>>>> -Steve
>>>>
>>>>> FWIW, I've also reached out to some of the LXC folks I know to get
>>>>> their take on the API.  I think if we can get two different container
>>>>> runtimes to give the API a thumbs-up then I think we are in good shape
>>>>> with respect to the userspace interface.
>>>>>
>>>>> I just finished looking over the last of the pending audit kernel
>>>>> patches that were queued waiting for the merge window to open so this
>>>>> is next on my list to look at.  I plan to start doing that
>>>>> tonight/tomorrow, and as long as the changes between v5/v6 are not
>>>>> that big, it shouldn't take too long.
>>> - RGB
>>>
>>> --
>>> Richard Guy Briggs <rgb@redhat.com>
>>> Sr. S/W Engineer, Kernel Security, Base Operating Systems
>>> Remote, Ottawa, Red Hat Canada
>>> IRC: rgb, SunRaycer
>>> Voice: +1.647.777.2635, Internal: (81) 32635
>> Our current thoughts are to put the setting of the ID inside of conmon,
>> and then launching the OCI Runtime.  In a perfect world this would
>> happen in the OCI Runtime, but we have no controls over different OCI
>> Runtimes.
>>
>> By putting it into conmon, then CRI-O and Podman will automatically get
>> the container id support.  After we have this we have to plumb it back
>> up through the contianer engines to be able to easily report the link
>> between the Container UUID and The Kernel Container Audit ID.
> I'm glad you guys have a plan, that's encouraging, but sadly I have no
> idea about the level of complexity/difficulty involved in modifying
> the various container bits for a proof-of-concept?  Are we talking a
> week or two?  A month?  More?
>
If we had the kernel and the libaudit api, it would involve a small
effort in conmon,  I would figure a few days for a POC.  Getting the
hole wiring into CRI-O and Podman, would be a little more effort.

^ permalink raw reply

* Re: [PATCH ghak90 V6 00/10] audit: implement container identifier
From: Paul Moore @ 2019-05-29 13:17 UTC (permalink / raw)
  To: Dan Walsh
  Cc: Richard Guy Briggs, Steve Grubb, Neil Horman, containers,
	linux-api, Linux-Audit Mailing List, linux-fsdevel, LKML, netdev,
	netfilter-devel, omosnace, dhowells, simo, Eric Paris,
	Serge Hallyn, ebiederm, Mrunal Patel
In-Reply-To: <31804653-7518-1a9c-83af-f6ce6a6ce408@redhat.com>

On Wed, May 29, 2019 at 8:03 AM Daniel Walsh <dwalsh@redhat.com> wrote:
>
> On 5/28/19 8:43 PM, Richard Guy Briggs wrote:
> > On 2019-05-28 19:00, Steve Grubb wrote:
> >> On Tuesday, May 28, 2019 6:26:47 PM EDT Paul Moore wrote:
> >>> On Tue, May 28, 2019 at 5:54 PM Daniel Walsh <dwalsh@redhat.com> wrote:
> >>>> On 4/22/19 9:49 AM, Paul Moore wrote:
> >>>>> On Mon, Apr 22, 2019 at 7:38 AM Neil Horman <nhorman@tuxdriver.com>
> >> wrote:
> >>>>>> On Mon, Apr 08, 2019 at 11:39:07PM -0400, Richard Guy Briggs wrote:
> >>>>>>> Implement kernel audit container identifier.
> >>>>>> I'm sorry, I've lost track of this, where have we landed on it? Are we
> >>>>>> good for inclusion?
> >>>>> I haven't finished going through this latest revision, but unless
> >>>>> Richard made any significant changes outside of the feedback from the
> >>>>> v5 patchset I'm guessing we are "close".
> >>>>>
> >>>>> Based on discussions Richard and I had some time ago, I have always
> >>>>> envisioned the plan as being get the kernel patchset, tests, docs
> >>>>> ready (which Richard has been doing) and then run the actual
> >>>>> implemented API by the userland container folks, e.g. cri-o/lxc/etc.,
> >>>>> to make sure the actual implementation is sane from their perspective.
> >>>>> They've already seen the design, so I'm not expecting any real
> >>>>> surprises here, but sometimes opinions change when they have actual
> >>>>> code in front of them to play with and review.
> >>>>>
> >>>>> Beyond that, while the cri-o/lxc/etc. folks are looking it over,
> >>>>> whatever additional testing we can do would be a big win.  I'm
> >>>>> thinking I'll pull it into a separate branch in the audit tree
> >>>>> (audit/working-container ?) and include that in my secnext kernels
> >>>>> that I build/test on a regular basis; this is also a handy way to keep
> >>>>> it based against the current audit/next branch.  If any changes are
> >>>>> needed Richard can either chose to base those changes on audit/next or
> >>>>> the separate audit container ID branch; that's up to him.  I've done
> >>>>> this with other big changes in other trees, e.g. SELinux, and it has
> >>>>> worked well to get some extra testing in and keep the patchset "merge
> >>>>> ready" while others outside the subsystem look things over.
> >>>> Mrunal Patel (maintainer of CRI-O) and I have reviewed the API, and
> >>>> believe this is something we can work on in the container runtimes team
> >>>> to implement the container auditing code in CRI-O and Podman.
> >>> Thanks Dan.  If I pulled this into a branch and built you some test
> >>> kernels to play with, any idea how long it might take to get a proof
> >>> of concept working on the cri-o side?
> >> We'd need to merge user space patches and let them use that instead of the
> >> raw interface. I'm not going to merge user space until we are pretty sure the
> >> patch is going into the kernel.
> > I have an f29 test rpm of the userspace bits if that helps for testing:
> >       http://people.redhat.com/~rbriggs/ghak90/git-1db7e21/
> >
> > Here's what it contains (minus the last patch):
> >       https://github.com/linux-audit/audit-userspace/compare/master...rgbriggs:ghau40-containerid-filter.v7.0
> >
> >> -Steve
> >>
> >>> FWIW, I've also reached out to some of the LXC folks I know to get
> >>> their take on the API.  I think if we can get two different container
> >>> runtimes to give the API a thumbs-up then I think we are in good shape
> >>> with respect to the userspace interface.
> >>>
> >>> I just finished looking over the last of the pending audit kernel
> >>> patches that were queued waiting for the merge window to open so this
> >>> is next on my list to look at.  I plan to start doing that
> >>> tonight/tomorrow, and as long as the changes between v5/v6 are not
> >>> that big, it shouldn't take too long.
> > - RGB
> >
> > --
> > Richard Guy Briggs <rgb@redhat.com>
> > Sr. S/W Engineer, Kernel Security, Base Operating Systems
> > Remote, Ottawa, Red Hat Canada
> > IRC: rgb, SunRaycer
> > Voice: +1.647.777.2635, Internal: (81) 32635
>
> Our current thoughts are to put the setting of the ID inside of conmon,
> and then launching the OCI Runtime.  In a perfect world this would
> happen in the OCI Runtime, but we have no controls over different OCI
> Runtimes.
>
> By putting it into conmon, then CRI-O and Podman will automatically get
> the container id support.  After we have this we have to plumb it back
> up through the contianer engines to be able to easily report the link
> between the Container UUID and The Kernel Container Audit ID.

I'm glad you guys have a plan, that's encouraging, but sadly I have no
idea about the level of complexity/difficulty involved in modifying
the various container bits for a proof-of-concept?  Are we talking a
week or two?  A month?  More?

-- 
paul moore
www.paul-moore.com

^ permalink raw reply

* Re: [PATCH ghak90 V6 00/10] audit: implement container identifier
From: Paul Moore @ 2019-05-29 13:14 UTC (permalink / raw)
  To: Richard Guy Briggs, Steve Grubb
  Cc: Dan Walsh, Neil Horman, containers, linux-api,
	Linux-Audit Mailing List, linux-fsdevel, LKML, netdev,
	netfilter-devel, omosnace, dhowells, simo, Eric Paris,
	Serge Hallyn, ebiederm, Mrunal Patel
In-Reply-To: <20190529004352.vvicec7nnk6pvkwt@madcap2.tricolour.ca>

On Tue, May 28, 2019 at 8:44 PM Richard Guy Briggs <rgb@redhat.com> wrote:
> On 2019-05-28 19:00, Steve Grubb wrote:
> > On Tuesday, May 28, 2019 6:26:47 PM EDT Paul Moore wrote:
> > > On Tue, May 28, 2019 at 5:54 PM Daniel Walsh <dwalsh@redhat.com> wrote:

...

> > > > Mrunal Patel (maintainer of CRI-O) and I have reviewed the API, and
> > > > believe this is something we can work on in the container runtimes team
> > > > to implement the container auditing code in CRI-O and Podman.
> > >
> > > Thanks Dan.  If I pulled this into a branch and built you some test
> > > kernels to play with, any idea how long it might take to get a proof
> > > of concept working on the cri-o side?
> >
> > We'd need to merge user space patches and let them use that instead of the
> > raw interface. I'm not going to merge user space until we are pretty sure the
> > patch is going into the kernel.
>
> I have an f29 test rpm of the userspace bits if that helps for testing:
>         http://people.redhat.com/~rbriggs/ghak90/git-1db7e21/
>
> Here's what it contains (minus the last patch):
>         https://github.com/linux-audit/audit-userspace/compare/master...rgbriggs:ghau40-containerid-filter.v7.0

Yes, exactly.  Just as I plan to start making some test kernels for
people to play with (assuming v6 looks okay), I think it would be good
if Steve could make a test build of the latest audit userspace with
the audit container ID patches.  It really shouldn't be that hard, and
the benefits should far outweigh any time spent generating the
tree/builds.

-- 
paul moore
www.paul-moore.com

^ permalink raw reply

* Re: [PATCH 1/7] General notification queue with user mmap()'able ring buffer [ver #13]
From: David Howells @ 2019-05-29 13:12 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: dhowells, Al Viro, Ian Kent, Linux API, linux-fsdevel,
	linux-kernel, Miklos Szeredi
In-Reply-To: <CAJfpeguPTQ00zVjpwVQ4R8mEqE3aijCzNMAz6Wvr56xE-jfJag@mail.gmail.com>

Miklos Szeredi <miklos@szeredi.hu> wrote:

> Would it make sense to use relayfs for the implementation of the
> mapped ring buffer?

Note that I reposted the notification patches under the correct cover note
later.  Could you repost your response there.

Subject: [RFC][PATCH 0/7] Mount, FS, Block and Keyrings notifications

David

^ permalink raw reply

* Re: [PATCH 4/7] vfs: Add superblock notifications
From: David Howells @ 2019-05-29 12:58 UTC (permalink / raw)
  To: Jann Horn
  Cc: dhowells, Al Viro, raven, linux-fsdevel, Linux API, linux-block,
	keyrings, linux-security-module, kernel list
In-Reply-To: <CAG48ez2o1egR13FDd3=CgdXP_MbBsZM4SX=+aqvR6eheWddhFg@mail.gmail.com>

Jann Horn <jannh@google.com> wrote:

> It might make sense to require that the path points to the root inode
> of the superblock? That way you wouldn't be able to do this on a bind
> mount that exposes part of a shared filesystem to a container.

Why prevent that?  It doesn't prevent the container denizen from watching a
bind mount that exposes the root of a shared filesystem into a container.

It probably makes sense to permit the LSM to rule on whether a watch may be
emplaced, however.

> > +                       ret = add_watch_to_object(watch, s->s_watchers);
> > +                       if (ret == 0) {
> > +                               spin_lock(&sb_lock);
> > +                               s->s_count++;
> > +                               spin_unlock(&sb_lock);
> 
> Why do watches hold references on the superblock they're watching?

Fair point.  It was necessary at one point, but I don't think it is now.  I'll
see if I can remove it.  Note that it doesn't stop a superblock from being
unmounted and destroyed.

> > +                       }
> > +               }
> > +               up_write(&s->s_umount);
> > +               if (ret < 0)
> > +                       kfree(watch);
> > +       } else if (s->s_watchers) {
> 
> This should probably have something like a READ_ONCE() for clarity?

Note that I think I'll rearrange this to:

	} else {
		ret = -EBADSLT;
		if (s->s_watchers) {
			down_write(&s->s_umount);
			ret = remove_watch_from_object(s->s_watchers, wqueue,
						       s->s_unique_id, false);
			up_write(&s->s_umount);
		}
	}

I'm not sure READ_ONCE() is necessary, since s_watchers can only be
instantiated once and the watch list then persists until the superblock is
deactivated.  Furthermore, by the time deactivate_locked_super() is called, we
can't be calling sb_notify() on it as it's become inaccessible.

So if we see s->s_watchers as non-NULL, we should not see anything different
inside the lock.  In fact, I should be able to rewrite the above to:

	} else {
		ret = -EBADSLT;
		wlist = s->s_watchers;
		if (wlist) {
			down_write(&s->s_umount);
			ret = remove_watch_from_object(wlist, wqueue,
						       s->s_unique_id, false);
			up_write(&s->s_umount);
		}
	}

David

^ permalink raw reply

* Re: [PATCH ghak90 V6 00/10] audit: implement container identifier
From: Daniel Walsh @ 2019-05-29 12:02 UTC (permalink / raw)
  To: Richard Guy Briggs, Steve Grubb
  Cc: Paul Moore, Neil Horman, containers, linux-api,
	Linux-Audit Mailing List, linux-fsdevel, LKML, netdev,
	netfilter-devel, omosnace, dhowells, simo, Eric Paris,
	Serge Hallyn, ebiederm, Mrunal Patel
In-Reply-To: <20190529004352.vvicec7nnk6pvkwt@madcap2.tricolour.ca>

On 5/28/19 8:43 PM, Richard Guy Briggs wrote:
> On 2019-05-28 19:00, Steve Grubb wrote:
>> On Tuesday, May 28, 2019 6:26:47 PM EDT Paul Moore wrote:
>>> On Tue, May 28, 2019 at 5:54 PM Daniel Walsh <dwalsh@redhat.com> wrote:
>>>> On 4/22/19 9:49 AM, Paul Moore wrote:
>>>>> On Mon, Apr 22, 2019 at 7:38 AM Neil Horman <nhorman@tuxdriver.com> 
>> wrote:
>>>>>> On Mon, Apr 08, 2019 at 11:39:07PM -0400, Richard Guy Briggs wrote:
>>>>>>> Implement kernel audit container identifier.
>>>>>> I'm sorry, I've lost track of this, where have we landed on it? Are we
>>>>>> good for inclusion?
>>>>> I haven't finished going through this latest revision, but unless
>>>>> Richard made any significant changes outside of the feedback from the
>>>>> v5 patchset I'm guessing we are "close".
>>>>>
>>>>> Based on discussions Richard and I had some time ago, I have always
>>>>> envisioned the plan as being get the kernel patchset, tests, docs
>>>>> ready (which Richard has been doing) and then run the actual
>>>>> implemented API by the userland container folks, e.g. cri-o/lxc/etc.,
>>>>> to make sure the actual implementation is sane from their perspective.
>>>>> They've already seen the design, so I'm not expecting any real
>>>>> surprises here, but sometimes opinions change when they have actual
>>>>> code in front of them to play with and review.
>>>>>
>>>>> Beyond that, while the cri-o/lxc/etc. folks are looking it over,
>>>>> whatever additional testing we can do would be a big win.  I'm
>>>>> thinking I'll pull it into a separate branch in the audit tree
>>>>> (audit/working-container ?) and include that in my secnext kernels
>>>>> that I build/test on a regular basis; this is also a handy way to keep
>>>>> it based against the current audit/next branch.  If any changes are
>>>>> needed Richard can either chose to base those changes on audit/next or
>>>>> the separate audit container ID branch; that's up to him.  I've done
>>>>> this with other big changes in other trees, e.g. SELinux, and it has
>>>>> worked well to get some extra testing in and keep the patchset "merge
>>>>> ready" while others outside the subsystem look things over.
>>>> Mrunal Patel (maintainer of CRI-O) and I have reviewed the API, and
>>>> believe this is something we can work on in the container runtimes team
>>>> to implement the container auditing code in CRI-O and Podman.
>>> Thanks Dan.  If I pulled this into a branch and built you some test
>>> kernels to play with, any idea how long it might take to get a proof
>>> of concept working on the cri-o side?
>> We'd need to merge user space patches and let them use that instead of the 
>> raw interface. I'm not going to merge user space until we are pretty sure the 
>> patch is going into the kernel.
> I have an f29 test rpm of the userspace bits if that helps for testing:
> 	http://people.redhat.com/~rbriggs/ghak90/git-1db7e21/
>
> Here's what it contains (minus the last patch):
> 	https://github.com/linux-audit/audit-userspace/compare/master...rgbriggs:ghau40-containerid-filter.v7.0
>
>> -Steve
>>
>>> FWIW, I've also reached out to some of the LXC folks I know to get
>>> their take on the API.  I think if we can get two different container
>>> runtimes to give the API a thumbs-up then I think we are in good shape
>>> with respect to the userspace interface.
>>>
>>> I just finished looking over the last of the pending audit kernel
>>> patches that were queued waiting for the merge window to open so this
>>> is next on my list to look at.  I plan to start doing that
>>> tonight/tomorrow, and as long as the changes between v5/v6 are not
>>> that big, it shouldn't take too long.
> - RGB
>
> --
> Richard Guy Briggs <rgb@redhat.com>
> Sr. S/W Engineer, Kernel Security, Base Operating Systems
> Remote, Ottawa, Red Hat Canada
> IRC: rgb, SunRaycer
> Voice: +1.647.777.2635, Internal: (81) 32635

Our current thoughts are to put the setting of the ID inside of conmon,
and then launching the OCI Runtime.  In a perfect world this would
happen in the OCI Runtime, but we have no controls over different OCI
Runtimes. 

By putting it into conmon, then CRI-O and Podman will automatically get
the container id support.  After we have this we have to plumb it back
up through the contianer engines to be able to easily report the link
between the Container UUID and The Kernel Container Audit ID.

^ permalink raw reply

* Re: [PATCH 3/7] vfs: Add a mount-notification facility
From: David Howells @ 2019-05-29 11:16 UTC (permalink / raw)
  To: Jann Horn
  Cc: dhowells, Al Viro, raven, linux-fsdevel, Linux API, linux-block,
	keyrings, linux-security-module, kernel list
In-Reply-To: <CAG48ez2SAKbPeChAf06GMazMPPThFM+OR00abRZafAP7v+ptKw@mail.gmail.com>

Jann Horn <jannh@google.com> wrote:

> I don't really know. I guess it depends on how it's being used? If
> someone decides to e.g. make a file browser that installs watches for
> a bunch of mountpoints for some fancy sidebar showing the device
> mounts on the system, or something like that, that probably shouldn't
> inhibit unmounting... I don't know if that's a realistic use case.

In such a use case, I would envision the browser putting a watch on "/".  A
watch sees all events in the subtree rooted at that point and you must apply a
filter that filters them out if you're not interested (filter on
WATCH_INFO_IN_SUBTREE using info_filter and info_mask).

David

^ permalink raw reply

* Re: [PATCH 3/7] vfs: Add a mount-notification facility
From: David Howells @ 2019-05-29 11:00 UTC (permalink / raw)
  To: Jann Horn, casey
  Cc: dhowells, Al Viro, raven, linux-fsdevel, Linux API, linux-block,
	keyrings, linux-security-module, kernel list
In-Reply-To: <CAG48ez2rRh2_Kq_EGJs5k-ZBNffGs_Q=vkQdinorBgo58tbGpg@mail.gmail.com>

Jann Horn <jannh@google.com> wrote:

> > +void post_mount_notification(struct mount *changed,
> > +                            struct mount_notification *notify)
> > +{
> > +       const struct cred *cred = current_cred();
> 
> This current_cred() looks bogus to me. Can't mount topology changes
> come from all sorts of places? For example, umount_mnt() from
> umount_tree() from dissolve_on_fput() from __fput(), which could
> happen pretty much anywhere depending on where the last reference gets
> dropped?

IIRC, that's what Casey argued is the right thing to do from a security PoV.
Casey?

Maybe I should pass in NULL creds in the case that an event is being generated
because an object is being destroyed due to the last usage[*] being removed.

 [*] Usage, not ref - Superblocks are a bit weird in their accounting.

David

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox