Linux userland API discussions

Linux userland API discussions
 help / color / mirror / Atom feed

* Re: [PATCH ghak90 V6 02/10] audit: add container id
From: Paul Moore @ 2019-05-29 16:03 UTC (permalink / raw)
  To: Tycho Andersen
  Cc: Richard Guy Briggs, containers, linux-api,
	Linux-Audit Mailing List, linux-fsdevel, LKML, netdev,
	netfilter-devel, sgrubb, omosnace, dhowells, simo, Eric Paris,
	Serge Hallyn, ebiederm, nhorman
In-Reply-To: <20190529153427.GB8959@cisco>

On Wed, May 29, 2019 at 11:34 AM Tycho Andersen <tycho@tycho.ws> wrote:
>
> On Wed, May 29, 2019 at 11:29:05AM -0400, Paul Moore wrote:
> > On Wed, May 29, 2019 at 10:57 AM Tycho Andersen <tycho@tycho.ws> wrote:
> > >
> > > On Mon, Apr 08, 2019 at 11:39:09PM -0400, Richard Guy Briggs wrote:
> > > > It is not permitted to unset the audit container identifier.
> > > > A child inherits its parent's audit container identifier.
> > >
> > > ...
> > >
> > > >  /**
> > > > + * audit_set_contid - set current task's audit contid
> > > > + * @contid: contid value
> > > > + *
> > > > + * Returns 0 on success, -EPERM on permission failure.
> > > > + *
> > > > + * Called (set) from fs/proc/base.c::proc_contid_write().
> > > > + */
> > > > +int audit_set_contid(struct task_struct *task, u64 contid)
> > > > +{
> > > > +     u64 oldcontid;
> > > > +     int rc = 0;
> > > > +     struct audit_buffer *ab;
> > > > +     uid_t uid;
> > > > +     struct tty_struct *tty;
> > > > +     char comm[sizeof(current->comm)];
> > > > +
> > > > +     task_lock(task);
> > > > +     /* Can't set if audit disabled */
> > > > +     if (!task->audit) {
> > > > +             task_unlock(task);
> > > > +             return -ENOPROTOOPT;
> > > > +     }
> > > > +     oldcontid = audit_get_contid(task);
> > > > +     read_lock(&tasklist_lock);
> > > > +     /* Don't allow the audit containerid to be unset */
> > > > +     if (!audit_contid_valid(contid))
> > > > +             rc = -EINVAL;
> > > > +     /* if we don't have caps, reject */
> > > > +     else if (!capable(CAP_AUDIT_CONTROL))
> > > > +             rc = -EPERM;
> > > > +     /* if task has children or is not single-threaded, deny */
> > > > +     else if (!list_empty(&task->children))
> > > > +             rc = -EBUSY;
> > > > +     else if (!(thread_group_leader(task) && thread_group_empty(task)))
> > > > +             rc = -EALREADY;
> > > > +     read_unlock(&tasklist_lock);
> > > > +     if (!rc)
> > > > +             task->audit->contid = contid;
> > > > +     task_unlock(task);
> > > > +
> > > > +     if (!audit_enabled)
> > > > +             return rc;
> > >
> > > ...but it is allowed to change it (assuming
> > > capable(CAP_AUDIT_CONTROL), of course)? Seems like this might be more
> > > immediately useful since we still live in the world of majority
> > > privileged containers if we didn't allow changing it, in addition to
> > > un-setting it.
> >
> > The idea is that only container orchestrators should be able to
> > set/modify the audit container ID, and since setting the audit
> > container ID can have a significant effect on the records captured
> > (and their routing to multiple daemons when we get there) modifying
> > the audit container ID is akin to modifying the audit configuration
> > which is why it is gated by CAP_AUDIT_CONTROL.  The current thinking
> > is that you would only change the audit container ID from one
> > set/inherited value to another if you were nesting containers, in
> > which case the nested container orchestrator would need to be granted
> > CAP_AUDIT_CONTROL (which everyone to date seems to agree is a workable
> > compromise).
>
> But then don't you want some kind of ns_capable() instead (probably
> not the obvious one, though...)? With capable(), you can't really nest
> using the audit-id and user namespaces together.

You want capable() and not ns_capable() because you want to ensure
that the orchestrator has the rights in the init_ns as changes to the
audit container ID could have an auditing impact that spans the entire
system.  Setting the audit container ID is equivalent to munging the
kernel's audit configuration, and the audit configuration is not
"namespaced" in any way.  The audit container ID work is about
providing the right "container context" (as defined by userspace) with
the audit records so that admins have a better understanding about
what is going on in the system; it is very explicitly not creating an
audit namespace.

At some point in the future we will want to support running multiple
audit daemons, and have a configurable way of routing audit records
based on the audit container ID, which will blur the line regarding
audit namespaces, but even then I would argue we are not creating an
audit namespace.

-- 
paul moore
www.paul-moore.com

^ permalink raw reply

* Re: [PATCH 1/7] General notification queue with user mmap()'able ring buffer
From: David Howells @ 2019-05-29 16:06 UTC (permalink / raw)
  To: Greg KH
  Cc: dhowells, viro, raven, linux-fsdevel, linux-api, linux-block,
	keyrings, linux-security-module, linux-kernel
In-Reply-To: <20190528231218.GA28384@kroah.com>

Greg KH <gregkh@linuxfoundation.org> wrote:

> > kref_put() could potentially add an unnecessary extra stack frame and would
> > seem to be best avoided, though an optimising compiler ought to be able to
> > inline if it can.
> 
> If kref_put() is on your fast path, you have worse problems (kfree isn't
> fast, right?)
> 
> Anyway, it's an inline function, how can it add an extra stack frame?

The call to the function pointer.  Hopefully the compiler will optimise that
away for an inlineable function.

> > Are you now on the convert all refcounts to krefs path?
> 
> "now"?  Remember, I wrote kref all those years ago,

Yes - and I thought it wasn't a good idea at the time.  But this is the first
time you've mentioned it to me, let alone pushed to change to it, that I
recall.

> everyone should use
> it.  It saves us having to audit the same pattern over and over again.
> And, even nicer, it uses a refcount now, and as you are trying to
> reference count an object, it is exactly what this was written for.
> 
> So yes, I do think it should be used here, unless it is deemed to not
> fit the pattern/usage model.

kref_put() enforces a very specific destructor signature.  I know of places
where that doesn't work because the destructor takes more than one argument
(granted that this is not the case here).  So why does kref_put() exist at
all?  Why not kref_dec_and_test()?

Why doesn't refcount_t get merged into kref, or vice versa?  Having both would
seem redundant.

Mind you, I've been gradually reverting atomic_t-to-refcount_t conversions
because it seems I'm not allowed refcount_inc/dec_return() and I want to get
at the point refcount for tracing purposes.

> > > > +module_exit(watch_queue_exit);
> > > 
> > > module_misc_device()?
> > 
> > 	warthog>git grep module_misc_device -- Documentation/
> > 	warthog1>
> 
> Do I have to document all helper macros?

If you add an API, documenting it is your privilege ;-)  It's an important
test of the API - if you can't describe it, it's probably wrong.

Now I will grant that you didn't add that function...

> Anyway, it saves you boilerplate code, but if built in, it's at the module
> init level, not the fs init level, like you are asking for here.  So that
> might not work, it's your call.

Actually, I probably shouldn't have a module exit function.  It can't be a
module as it's called by core code.  I'll switch to builtin_misc_device().

> And how does the tracing and perf ring buffers do this without needing
> volatile?  Why not use the same type of interface they provide, as it's
> always good to share code that has already had all of the nasty corner
> cases worked out.

I've no idea how trace does it - or even where - or even if.  As far as I can
see, grepping for mmap in kernel/trace/*, there's no mmap support.

Reading Documentation/trace/ring-buffer-design.txt the trace subsystem has
some sort of transient page fifo which is a lot more complicated than what I
want and doesn't look like it'll be mmap'able.

Looking at the perf ring buffer, there appears to be a missing barrier in
perf_aux_output_end():

	rb->user_page->aux_head = rb->aux_head;

should be:

	smp_store_release(&rb->user_page->aux_head, rb->aux_head);

It should also be using smp_load_acquire().  See
Documentation/core-api/circular-buffers.rst

And a (partial) patch has been proposed: https://lkml.org/lkml/2018/5/10/249

David

^ permalink raw reply

* Re: [PATCH 3/7] vfs: Add a mount-notification facility
From: Jann Horn @ 2019-05-29 16:12 UTC (permalink / raw)
  To: Casey Schaufler
  Cc: David Howells, Al Viro, raven, linux-fsdevel, Linux API,
	linux-block, keyrings, linux-security-module, kernel list,
	Andy Lutomirski
In-Reply-To: <312a138c-e5b2-4bfb-b50b-40c82c55773f@schaufler-ca.com>

On Wed, May 29, 2019 at 5:53 PM Casey Schaufler <casey@schaufler-ca.com> wrote:
> On 5/29/2019 4:00 AM, David Howells wrote:
> > Jann Horn <jannh@google.com> wrote:
> >
> >>> +void post_mount_notification(struct mount *changed,
> >>> +                            struct mount_notification *notify)
> >>> +{
> >>> +       const struct cred *cred = current_cred();
> >> This current_cred() looks bogus to me. Can't mount topology changes
> >> come from all sorts of places? For example, umount_mnt() from
> >> umount_tree() from dissolve_on_fput() from __fput(), which could
> >> happen pretty much anywhere depending on where the last reference gets
> >> dropped?
> > IIRC, that's what Casey argued is the right thing to do from a security PoV.
> > Casey?
>
> You need to identify the credential of the subject that triggered
> the event. If it isn't current_cred(), the cred needs to be passed
> in to post_mount_notification(), or derived by some other means.
>
> > Maybe I should pass in NULL creds in the case that an event is being generated
> > because an object is being destroyed due to the last usage[*] being removed.
>
> You should pass the cred of the process that removed the
> last usage. If the last usage was removed by something like
> the power being turned off on a disk drive a system cred
> should be used. Someone or something caused the event. It can
> be important who it was.

The kernel's normal security model means that you should be able to
e.g. accept FDs that random processes send you and perform
read()/write() calls on them without acting as a subject in any
security checks; let alone close(). If you send a file descriptor over
a unix domain socket and the unix domain socket is garbage collected,
for example, I think the close() will just come from some random,
completely unrelated task that happens to trigger the garbage
collector?

Also, I think if someone does I/O via io_uring, I think the caller's
credentials for read/write operations will probably just be normal
kernel creds?

Here the checks probably aren't all that important, but in other
places, when people try to use an LSM as the primary line of defense,
checks that don't align with the kernel's normal security model might
lead to a bunch of problems.

^ permalink raw reply

* Re: [PATCH 3/7] vfs: Add a mount-notification facility
From: Casey Schaufler @ 2019-05-29 17:04 UTC (permalink / raw)
  To: Jann Horn
  Cc: David Howells, Al Viro, raven, linux-fsdevel, Linux API,
	linux-block, keyrings, linux-security-module, kernel list,
	Andy Lutomirski, casey
In-Reply-To: <CAG48ez2KMrTBFzO9p8GvduXruz+FNLPyhc2YivHePsgViEoT1g@mail.gmail.com>

On 5/29/2019 9:12 AM, Jann Horn wrote:
> On Wed, May 29, 2019 at 5:53 PM Casey Schaufler <casey@schaufler-ca.com> wrote:
>> On 5/29/2019 4:00 AM, David Howells wrote:
>>> Jann Horn <jannh@google.com> wrote:
>>>
>>>>> +void post_mount_notification(struct mount *changed,
>>>>> +                            struct mount_notification *notify)
>>>>> +{
>>>>> +       const struct cred *cred = current_cred();
>>>> This current_cred() looks bogus to me. Can't mount topology changes
>>>> come from all sorts of places? For example, umount_mnt() from
>>>> umount_tree() from dissolve_on_fput() from __fput(), which could
>>>> happen pretty much anywhere depending on where the last reference gets
>>>> dropped?
>>> IIRC, that's what Casey argued is the right thing to do from a security PoV.
>>> Casey?
>> You need to identify the credential of the subject that triggered
>> the event. If it isn't current_cred(), the cred needs to be passed
>> in to post_mount_notification(), or derived by some other means.
>>
>>> Maybe I should pass in NULL creds in the case that an event is being generated
>>> because an object is being destroyed due to the last usage[*] being removed.
>> You should pass the cred of the process that removed the
>> last usage. If the last usage was removed by something like
>> the power being turned off on a disk drive a system cred
>> should be used. Someone or something caused the event. It can
>> be important who it was.
> The kernel's normal security model means that you should be able to
> e.g. accept FDs that random processes send you and perform
> read()/write() calls on them without acting as a subject in any
> security checks; let alone close().

Passed file descriptors are an anomaly in the security model
that (in this developer's opinion) should have never been
included. More than one of the "B" level UNIX systems disabled
them outright. 

>  If you send a file descriptor over
> a unix domain socket and the unix domain socket is garbage collected,
> for example, I think the close() will just come from some random,
> completely unrelated task that happens to trigger the garbage
> collector?

I never said this was going to be easy or pleasant.
Who destroyed the UDS? It didn't just spontaneously become
garbage. Well, not on modern Linux filesystems, anyway. 

> Also, I think if someone does I/O via io_uring, I think the caller's
> credentials for read/write operations will probably just be normal
> kernel creds?
>
> Here the checks probably aren't all that important, but in other
> places, when people try to use an LSM as the primary line of defense,
> checks that don't align with the kernel's normal security model might
> lead to a bunch of problems.

The kernel does not have a "normal security model". It has a
collection of disparate and almost but not quite contradictory
models for the various objects and mechanisms it implements.
It already has a bunch of problems, we're just used to them.

I can only send a signal to a process with the same UID. Why doesn't
a process have mode bits so that I could get signals from my group?

Why do IPC object have creator bits, while files don't?

Why can I send a file descriptor over a UDS, but not a message queue?

Why can't I set the mode bits on a symlink?

What can go wrong if I don't map groups into a user namespace?

LSMs (SELinux and Smack, which are classic mandatory access control
systems in particular) are more consistent, but still have to deal with
some of these differences. A symlink gets a Smack label, for example.

The point being that it's very easy to add new mechanisms that do
wonderful things but that introduce unforeseen ways to bypass one
or more of the existing protections.

^ permalink raw reply

* Re: [PATCH 3/7] vfs: Add a mount-notification facility
From: Andy Lutomirski @ 2019-05-29 17:13 UTC (permalink / raw)
  To: Casey Schaufler
  Cc: David Howells, Jann Horn, Al Viro, raven, linux-fsdevel,
	Linux API, linux-block, keyrings, linux-security-module,
	kernel list
In-Reply-To: <312a138c-e5b2-4bfb-b50b-40c82c55773f@schaufler-ca.com>



> On May 29, 2019, at 8:53 AM, Casey Schaufler <casey@schaufler-ca.com> wrote:
> 
>> On 5/29/2019 4:00 AM, David Howells wrote:
>> Jann Horn <jannh@google.com> wrote:
>> 
>>>> +void post_mount_notification(struct mount *changed,
>>>> +                            struct mount_notification *notify)
>>>> +{
>>>> +       const struct cred *cred = current_cred();
>>> This current_cred() looks bogus to me. Can't mount topology changes
>>> come from all sorts of places? For example, umount_mnt() from
>>> umount_tree() from dissolve_on_fput() from __fput(), which could
>>> happen pretty much anywhere depending on where the last reference gets
>>> dropped?
>> IIRC, that's what Casey argued is the right thing to do from a security PoV.
>> Casey?
> 
> You need to identify the credential of the subject that triggered
> the event. If it isn't current_cred(), the cred needs to be passed
> in to post_mount_notification(), or derived by some other means.

Taking a step back, why do we care who triggered the event?  It seems to me that we should care whether the event happened and whether the *receiver* is permitted to know that.

(And receiver means whoever subscribed, presumably, not whoever called read() or mmap().)

^ permalink raw reply

* Re: [PATCH 3/7] vfs: Add a mount-notification facility
From: Casey Schaufler @ 2019-05-29 17:46 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: David Howells, Jann Horn, Al Viro, raven, linux-fsdevel,
	Linux API, linux-block, keyrings, linux-security-module,
	kernel list, casey
In-Reply-To: <4552118F-BE9B-4905-BF0F-A53DC13D5A82@amacapital.net>

On 5/29/2019 10:13 AM, Andy Lutomirski wrote:
>
>> On May 29, 2019, at 8:53 AM, Casey Schaufler <casey@schaufler-ca.com> wrote:
>>
>>> On 5/29/2019 4:00 AM, David Howells wrote:
>>> Jann Horn <jannh@google.com> wrote:
>>>
>>>>> +void post_mount_notification(struct mount *changed,
>>>>> +                            struct mount_notification *notify)
>>>>> +{
>>>>> +       const struct cred *cred = current_cred();
>>>> This current_cred() looks bogus to me. Can't mount topology changes
>>>> come from all sorts of places? For example, umount_mnt() from
>>>> umount_tree() from dissolve_on_fput() from __fput(), which could
>>>> happen pretty much anywhere depending on where the last reference gets
>>>> dropped?
>>> IIRC, that's what Casey argued is the right thing to do from a security PoV.
>>> Casey?
>> You need to identify the credential of the subject that triggered
>> the event. If it isn't current_cred(), the cred needs to be passed
>> in to post_mount_notification(), or derived by some other means.
> Taking a step back, why do we care who triggered the event?  It seems to me that we should care whether the event happened and whether the *receiver* is permitted to know that.

There are two filesystems, "dot" and "dash". I am not allowed
to communicate with Fred on the system, and all precautions have
been taken to ensure I cannot. Fred asks for notifications on
all mount activity. I perform actions that result in notifications
on "dot" and "dash". Fred receives notifications and interprets
them using Morse code. This is not OK. If Wilma, who *is* allowed
to communicate with Fred, does the same actions, he should be
allowed to get the messages via Morse.

The event is information. The information is generated as a
result of my or Wilma's action. Fred is passive in this access.
Fred is not "reading" the event. The event is being written to
Fred. My process is the subject, and Fred's the object.

Other security modelers may disagree. The models they produce
are going to be *very* complicated and will introduce agents and
intermediate objects to justify Fred's reception of an event as
a read operation.

> (And receiver means whoever subscribed, presumably, not whoever called read() or mmap().)

The receiver is the process that gets the event. There may
be more than one receiver, and the receivers may have different
credentials. Each needs to be checked separately.

Isn't this starting to sound like the discussions on kdbus?
I'm not sure if that deserves a :) or a :( but probably one of the two.

^ permalink raw reply

* Re: [PATCH 1/7] General notification queue with user mmap()'able ring buffer
From: Jann Horn @ 2019-05-29 17:46 UTC (permalink / raw)
  To: David Howells
  Cc: Greg KH, Al Viro, raven, linux-fsdevel, Linux API, linux-block,
	keyrings, linux-security-module, kernel list, Kees Cook,
	Kernel Hardening
In-Reply-To: <31936.1559146000@warthog.procyon.org.uk>

On Wed, May 29, 2019 at 6:07 PM David Howells <dhowells@redhat.com> wrote:
> Greg KH <gregkh@linuxfoundation.org> wrote:
> > everyone should use
> > it.  It saves us having to audit the same pattern over and over again.
> > And, even nicer, it uses a refcount now, and as you are trying to
> > reference count an object, it is exactly what this was written for.
> >
> > So yes, I do think it should be used here, unless it is deemed to not
> > fit the pattern/usage model.
>
> kref_put() enforces a very specific destructor signature.  I know of places
> where that doesn't work because the destructor takes more than one argument
> (granted that this is not the case here).  So why does kref_put() exist at
> all?  Why not kref_dec_and_test()?
>
> Why doesn't refcount_t get merged into kref, or vice versa?  Having both would
> seem redundant.
>
> Mind you, I've been gradually reverting atomic_t-to-refcount_t conversions
> because it seems I'm not allowed refcount_inc/dec_return() and I want to get
> at the point refcount for tracing purposes.

Yeeech, that's horrible, please don't do that.

Does this mean that refcount_read() isn't sufficient for what you want
to do with tracing (because for some reason you actually need to know
the values atomically at the time of increment/decrement)?

^ permalink raw reply

* Re: [PATCH 3/7] vfs: Add a mount-notification facility
From: Jann Horn @ 2019-05-29 18:11 UTC (permalink / raw)
  To: Casey Schaufler
  Cc: Andy Lutomirski, David Howells, Al Viro, raven, linux-fsdevel,
	Linux API, linux-block, keyrings, linux-security-module,
	kernel list
In-Reply-To: <058f227c-71ab-a6f4-00bf-b8782b3b2956@schaufler-ca.com>

On Wed, May 29, 2019 at 7:46 PM Casey Schaufler <casey@schaufler-ca.com> wrote:
> On 5/29/2019 10:13 AM, Andy Lutomirski wrote:
> >> On May 29, 2019, at 8:53 AM, Casey Schaufler <casey@schaufler-ca.com> wrote:
> >>> On 5/29/2019 4:00 AM, David Howells wrote:
> >>> Jann Horn <jannh@google.com> wrote:
> >>>
> >>>>> +void post_mount_notification(struct mount *changed,
> >>>>> +                            struct mount_notification *notify)
> >>>>> +{
> >>>>> +       const struct cred *cred = current_cred();
> >>>> This current_cred() looks bogus to me. Can't mount topology changes
> >>>> come from all sorts of places? For example, umount_mnt() from
> >>>> umount_tree() from dissolve_on_fput() from __fput(), which could
> >>>> happen pretty much anywhere depending on where the last reference gets
> >>>> dropped?
> >>> IIRC, that's what Casey argued is the right thing to do from a security PoV.
> >>> Casey?
> >> You need to identify the credential of the subject that triggered
> >> the event. If it isn't current_cred(), the cred needs to be passed
> >> in to post_mount_notification(), or derived by some other means.
> > Taking a step back, why do we care who triggered the event?  It seems to me that we should care whether the event happened and whether the *receiver* is permitted to know that.
>
> There are two filesystems, "dot" and "dash". I am not allowed
> to communicate with Fred on the system, and all precautions have
> been taken to ensure I cannot. Fred asks for notifications on
> all mount activity. I perform actions that result in notifications
> on "dot" and "dash". Fred receives notifications and interprets
> them using Morse code. This is not OK. If Wilma, who *is* allowed
> to communicate with Fred, does the same actions, he should be
> allowed to get the messages via Morse.

In other words, a classic covert channel. You can't really prevent two
cooperating processes from communicating through a covert channel on a
modern computer. You can transmit information through the scheduler,
through hyperthread resource sharing, through CPU data caches, through
disk contention, through page cache state, through RAM contention, and
probably dozens of other ways that I can't think of right now. There
have been plenty of papers that demonstrated things like an SSH
connection between two virtual machines without network access running
on the same physical host (<https://gruss.cc/files/hello.pdf>),
communication between a VM and a browser running on the host system,
and so on.

^ permalink raw reply

* Re: [PATCH 3/7] vfs: Add a mount-notification facility
From: Casey Schaufler @ 2019-05-29 19:28 UTC (permalink / raw)
  To: Jann Horn
  Cc: Andy Lutomirski, David Howells, Al Viro, raven, linux-fsdevel,
	Linux API, linux-block, keyrings, linux-security-module,
	kernel list, casey
In-Reply-To: <CAG48ez2S+i2wxpWXVGpEAprgY9gtjxyejLfbZtrqu5YOkQ81Nw@mail.gmail.com>

On 5/29/2019 11:11 AM, Jann Horn wrote:
> On Wed, May 29, 2019 at 7:46 PM Casey Schaufler <casey@schaufler-ca.com> wrote:
>> On 5/29/2019 10:13 AM, Andy Lutomirski wrote:
>>>> On May 29, 2019, at 8:53 AM, Casey Schaufler <casey@schaufler-ca.com> wrote:
>>>>> On 5/29/2019 4:00 AM, David Howells wrote:
>>>>> Jann Horn <jannh@google.com> wrote:
>>>>>
>>>>>>> +void post_mount_notification(struct mount *changed,
>>>>>>> +                            struct mount_notification *notify)
>>>>>>> +{
>>>>>>> +       const struct cred *cred = current_cred();
>>>>>> This current_cred() looks bogus to me. Can't mount topology changes
>>>>>> come from all sorts of places? For example, umount_mnt() from
>>>>>> umount_tree() from dissolve_on_fput() from __fput(), which could
>>>>>> happen pretty much anywhere depending on where the last reference gets
>>>>>> dropped?
>>>>> IIRC, that's what Casey argued is the right thing to do from a security PoV.
>>>>> Casey?
>>>> You need to identify the credential of the subject that triggered
>>>> the event. If it isn't current_cred(), the cred needs to be passed
>>>> in to post_mount_notification(), or derived by some other means.
>>> Taking a step back, why do we care who triggered the event?  It seems to me that we should care whether the event happened and whether the *receiver* is permitted to know that.
>> There are two filesystems, "dot" and "dash". I am not allowed
>> to communicate with Fred on the system, and all precautions have
>> been taken to ensure I cannot. Fred asks for notifications on
>> all mount activity. I perform actions that result in notifications
>> on "dot" and "dash". Fred receives notifications and interprets
>> them using Morse code. This is not OK. If Wilma, who *is* allowed
>> to communicate with Fred, does the same actions, he should be
>> allowed to get the messages via Morse.
> In other words, a classic covert channel. You can't really prevent two
> cooperating processes from communicating through a covert channel on a
> modern computer.

That doesn't give you permission to design them in.
Plus, the LSMs that implement mandatory access controls
are going to want to intervene. No unclassified user
should see notifications caused by Top Secret users.

>  You can transmit information through the scheduler,
> through hyperthread resource sharing, through CPU data caches, through
> disk contention, through page cache state, through RAM contention, and
> probably dozens of other ways that I can't think of right now.

Yeah, and there's been a lot of activity to reduce those,
which are hard to exploit, as opposed to this, which would
be trivial and obvious.

> There
> have been plenty of papers that demonstrated things like an SSH
> connection between two virtual machines without network access running
> on the same physical host (<https://gruss.cc/files/hello.pdf>),
> communication between a VM and a browser running on the host system,
> and so on.

So you're saying we shouldn't have mode bits on files because
spectre/meltdown makes them pointless?

^ permalink raw reply

* Re: [PATCH 3/7] vfs: Add a mount-notification facility
From: Jann Horn @ 2019-05-29 19:47 UTC (permalink / raw)
  To: Casey Schaufler
  Cc: Andy Lutomirski, David Howells, Al Viro, raven, linux-fsdevel,
	Linux API, linux-block, keyrings, linux-security-module,
	kernel list
In-Reply-To: <0cd823ca-4733-19ef-c13e-ed5ac8c63a0f@schaufler-ca.com>

On Wed, May 29, 2019 at 9:28 PM Casey Schaufler <casey@schaufler-ca.com> wrote:
> On 5/29/2019 11:11 AM, Jann Horn wrote:
> > On Wed, May 29, 2019 at 7:46 PM Casey Schaufler <casey@schaufler-ca.com> wrote:
> >> On 5/29/2019 10:13 AM, Andy Lutomirski wrote:
> >>>> On May 29, 2019, at 8:53 AM, Casey Schaufler <casey@schaufler-ca.com> wrote:
> >>>>> On 5/29/2019 4:00 AM, David Howells wrote:
> >>>>> Jann Horn <jannh@google.com> wrote:
> >>>>>
> >>>>>>> +void post_mount_notification(struct mount *changed,
> >>>>>>> +                            struct mount_notification *notify)
> >>>>>>> +{
> >>>>>>> +       const struct cred *cred = current_cred();
> >>>>>> This current_cred() looks bogus to me. Can't mount topology changes
> >>>>>> come from all sorts of places? For example, umount_mnt() from
> >>>>>> umount_tree() from dissolve_on_fput() from __fput(), which could
> >>>>>> happen pretty much anywhere depending on where the last reference gets
> >>>>>> dropped?
> >>>>> IIRC, that's what Casey argued is the right thing to do from a security PoV.
> >>>>> Casey?
> >>>> You need to identify the credential of the subject that triggered
> >>>> the event. If it isn't current_cred(), the cred needs to be passed
> >>>> in to post_mount_notification(), or derived by some other means.
> >>> Taking a step back, why do we care who triggered the event?  It seems to me that we should care whether the event happened and whether the *receiver* is permitted to know that.
> >> There are two filesystems, "dot" and "dash". I am not allowed
> >> to communicate with Fred on the system, and all precautions have
> >> been taken to ensure I cannot. Fred asks for notifications on
> >> all mount activity. I perform actions that result in notifications
> >> on "dot" and "dash". Fred receives notifications and interprets
> >> them using Morse code. This is not OK. If Wilma, who *is* allowed
> >> to communicate with Fred, does the same actions, he should be
> >> allowed to get the messages via Morse.
> > In other words, a classic covert channel. You can't really prevent two
> > cooperating processes from communicating through a covert channel on a
> > modern computer.
>
> That doesn't give you permission to design them in.
> Plus, the LSMs that implement mandatory access controls
> are going to want to intervene. No unclassified user
> should see notifications caused by Top Secret users.

But that's probably because they're worried about *side* channels, not
covert channels?

Talking about this in the context of (small) side channels: The
notification types introduced in this patch are mostly things that a
user would be able to observe anyway if they polled /proc/self/mounts,
right? It might make sense to align access controls based on that - if
you don't want it to be possible to observe events happening on some
mount points through this API, you should probably lock down
/proc/*/mounts equivalently, by introducing an LSM hook for "is @cred
allowed to see @mnt" or something like that - and if you want to
compare two cred structures, you could record the cred structure that
is responsible for the creation of the mount point, or something like
that.

For some of the other patches, I guess things get more tricky because
the notification exposes new information that wasn't really available
before.

> >  You can transmit information through the scheduler,
> > through hyperthread resource sharing, through CPU data caches, through
> > disk contention, through page cache state, through RAM contention, and
> > probably dozens of other ways that I can't think of right now.
>
> Yeah, and there's been a lot of activity to reduce those,
> which are hard to exploit, as opposed to this, which would
> be trivial and obvious.
>
> > There
> > have been plenty of papers that demonstrated things like an SSH
> > connection between two virtual machines without network access running
> > on the same physical host (<https://gruss.cc/files/hello.pdf>),
> > communication between a VM and a browser running on the host system,
> > and so on.
>
> So you're saying we shouldn't have mode bits on files because
> spectre/meltdown makes them pointless?

spectre/meltdown are vulnerabilities that are being mitigated.
Microarchitectural covert channels are an accepted fact and I haven't
heard of anyone seriously considering trying to get rid of them all.

^ permalink raw reply

* Re: [PATCH 3/7] vfs: Add a mount-notification facility
From: Casey Schaufler @ 2019-05-29 20:50 UTC (permalink / raw)
  To: Jann Horn
  Cc: Andy Lutomirski, David Howells, Al Viro, raven, linux-fsdevel,
	Linux API, linux-block, keyrings, linux-security-module,
	kernel list, casey
In-Reply-To: <CAG48ez0X7rKw-qfZm9i+8OLq7YccBRtV3aF-7hkQsfWaiTbuXg@mail.gmail.com>

On 5/29/2019 12:47 PM, Jann Horn wrote:
> On Wed, May 29, 2019 at 9:28 PM Casey Schaufler <casey@schaufler-ca.com> wrote:
>> On 5/29/2019 11:11 AM, Jann Horn wrote:
>>> On Wed, May 29, 2019 at 7:46 PM Casey Schaufler <casey@schaufler-ca.com> wrote:
>>>> On 5/29/2019 10:13 AM, Andy Lutomirski wrote:
>>>>>> On May 29, 2019, at 8:53 AM, Casey Schaufler <casey@schaufler-ca.com> wrote:
>>>>>>> On 5/29/2019 4:00 AM, David Howells wrote:
>>>>>>> Jann Horn <jannh@google.com> wrote:
>>>>>>>
>>>>>>>>> +void post_mount_notification(struct mount *changed,
>>>>>>>>> +                            struct mount_notification *notify)
>>>>>>>>> +{
>>>>>>>>> +       const struct cred *cred = current_cred();
>>>>>>>> This current_cred() looks bogus to me. Can't mount topology changes
>>>>>>>> come from all sorts of places? For example, umount_mnt() from
>>>>>>>> umount_tree() from dissolve_on_fput() from __fput(), which could
>>>>>>>> happen pretty much anywhere depending on where the last reference gets
>>>>>>>> dropped?
>>>>>>> IIRC, that's what Casey argued is the right thing to do from a security PoV.
>>>>>>> Casey?
>>>>>> You need to identify the credential of the subject that triggered
>>>>>> the event. If it isn't current_cred(), the cred needs to be passed
>>>>>> in to post_mount_notification(), or derived by some other means.
>>>>> Taking a step back, why do we care who triggered the event?  It seems to me that we should care whether the event happened and whether the *receiver* is permitted to know that.
>>>> There are two filesystems, "dot" and "dash". I am not allowed
>>>> to communicate with Fred on the system, and all precautions have
>>>> been taken to ensure I cannot. Fred asks for notifications on
>>>> all mount activity. I perform actions that result in notifications
>>>> on "dot" and "dash". Fred receives notifications and interprets
>>>> them using Morse code. This is not OK. If Wilma, who *is* allowed
>>>> to communicate with Fred, does the same actions, he should be
>>>> allowed to get the messages via Morse.
>>> In other words, a classic covert channel. You can't really prevent two
>>> cooperating processes from communicating through a covert channel on a
>>> modern computer.
>> That doesn't give you permission to design them in.
>> Plus, the LSMs that implement mandatory access controls
>> are going to want to intervene. No unclassified user
>> should see notifications caused by Top Secret users.
> But that's probably because they're worried about *side* channels, not
> covert channels?

The security evaluators from the 1990's considered any channel
with greater than 1 bit/second bandwidth a show-stopper. That was
true for covert and side channels. Further, if you knew that a
mechanism had a channel, as this one does, and you didn't fix it,
you didn't get your certificate. If you know about a problem
during the design/implementation phase it's really inexcusable not
to fix it before "completing" the code.

> Talking about this in the context of (small) side channels: The
> notification types introduced in this patch are mostly things that a
> user would be able to observe anyway if they polled /proc/self/mounts,
> right?

It's supposed to be a general mechanism. Of course it would
be simpler if is was restricted to things you can get at via
/proc/self.

>  It might make sense to align access controls based on that - if
> you don't want it to be possible to observe events happening on some
> mount points through this API, you should probably lock down
> /proc/*/mounts equivalently, by introducing an LSM hook for "is @cred
> allowed to see @mnt" or something like that - and if you want to
> compare two cred structures, you could record the cred structure that
> is responsible for the creation of the mount point, or something like
> that.

I'm not going to argue against that.

> For some of the other patches, I guess things get more tricky because
> the notification exposes new information that wasn't really available
> before.

We have to look not just at the information being available,
but the mechanism used. Being able to look at information about
a process in /proc doesn't mean I should be able to look at it
using ptrace(). Access control isn't done on data, it's done on
objects. That I can get information by looking in one object provides
no assurance that I can get it through a different object containing
the same information. This happens in /dev all over the place. A
file with hard links may be accessible by one path but not another.

>
>>>  You can transmit information through the scheduler,
>>> through hyperthread resource sharing, through CPU data caches, through
>>> disk contention, through page cache state, through RAM contention, and
>>> probably dozens of other ways that I can't think of right now.
>> Yeah, and there's been a lot of activity to reduce those,
>> which are hard to exploit, as opposed to this, which would
>> be trivial and obvious.
>>
>>> There
>>> have been plenty of papers that demonstrated things like an SSH
>>> connection between two virtual machines without network access running
>>> on the same physical host (<https://gruss.cc/files/hello.pdf>),
>>> communication between a VM and a browser running on the host system,
>>> and so on.
>> So you're saying we shouldn't have mode bits on files because
>> spectre/meltdown makes them pointless?
> spectre/meltdown are vulnerabilities that are being mitigated.
> Microarchitectural covert channels are an accepted fact and I haven't
> heard of anyone seriously considering trying to get rid of them all.

^ permalink raw reply

* Re: [PATCH 1/7] General notification queue with user mmap()'able ring buffer
From: David Howells @ 2019-05-29 21:02 UTC (permalink / raw)
  To: Jann Horn
  Cc: dhowells, Greg KH, Al Viro, raven, linux-fsdevel, Linux API,
	linux-block, keyrings, linux-security-module, kernel list,
	Kees Cook, Kernel Hardening
In-Reply-To: <CAG48ez0R-R3Xs+3Xg9T9qcV3Xv6r4pnx1Z2y=Ltx7RGOayte_w@mail.gmail.com>

Jann Horn <jannh@google.com> wrote:

> Does this mean that refcount_read() isn't sufficient for what you want
> to do with tracing (because for some reason you actually need to know
> the values atomically at the time of increment/decrement)?

Correct.  There's a gap and if an interrupt or something occurs, it's
sufficiently big for the refcount trace to go weird.

I've seen it in afs/rxrpc where the incoming network packets that are part of
the rxrpc call flow disrupt the refcounts noted in trace lines.

David

^ permalink raw reply

* Re: [PATCH ghak90 V6 04/10] audit: log container info of syscalls
From: Paul Moore @ 2019-05-29 22:15 UTC (permalink / raw)
  To: Richard Guy Briggs
  Cc: containers, linux-api, Linux-Audit Mailing List, linux-fsdevel,
	LKML, netdev, netfilter-devel, sgrubb, omosnace, dhowells, simo,
	Eric Paris, Serge Hallyn, ebiederm, nhorman
In-Reply-To: <f4a49f7c949e5df80c339a3fe5c4c2303b12bf23.1554732921.git.rgb@redhat.com>

On Mon, Apr 8, 2019 at 11:40 PM Richard Guy Briggs <rgb@redhat.com> wrote:
>
> Create a new audit record AUDIT_CONTAINER_ID to document the audit
> container identifier of a process if it is present.
>
> Called from audit_log_exit(), syscalls are covered.
>
> A sample raw event:
> type=SYSCALL msg=audit(1519924845.499:257): arch=c000003e syscall=257 success=yes exit=3 a0=ffffff9c a1=56374e1cef30 a2=241 a3=1b6 items=2 ppid=606 pid=635 auid=0 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=0 fsgid=0 tty=pts0 ses=3 comm="bash" exe="/usr/bin/bash" subj=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023 key="tmpcontainerid"
> type=CWD msg=audit(1519924845.499:257): cwd="/root"
> type=PATH msg=audit(1519924845.499:257): item=0 name="/tmp/" inode=13863 dev=00:27 mode=041777 ouid=0 ogid=0 rdev=00:00 obj=system_u:object_r:tmp_t:s0 nametype= PARENT cap_fp=0 cap_fi=0 cap_fe=0 cap_fver=0
> type=PATH msg=audit(1519924845.499:257): item=1 name="/tmp/tmpcontainerid" inode=17729 dev=00:27 mode=0100644 ouid=0 ogid=0 rdev=00:00 obj=unconfined_u:object_r:user_tmp_t:s0 nametype=CREATE cap_fp=0 cap_fi=0 cap_fe=0 cap_fver=0
> type=PROCTITLE msg=audit(1519924845.499:257): proctitle=62617368002D6300736C65657020313B206563686F2074657374203E202F746D702F746D70636F6E7461696E65726964
> type=CONTAINER_ID msg=audit(1519924845.499:257): contid=123458
>
> Please see the github audit kernel issue for the main feature:
>   https://github.com/linux-audit/audit-kernel/issues/90
> Please see the github audit userspace issue for supporting additions:
>   https://github.com/linux-audit/audit-userspace/issues/51
> Please see the github audit testsuiite issue for the test case:
>   https://github.com/linux-audit/audit-testsuite/issues/64
> Please see the github audit wiki for the feature overview:
>   https://github.com/linux-audit/audit-kernel/wiki/RFE-Audit-Container-ID
> Signed-off-by: Richard Guy Briggs <rgb@redhat.com>
> Acked-by: Serge Hallyn <serge@hallyn.com>
> Acked-by: Steve Grubb <sgrubb@redhat.com>
> Acked-by: Neil Horman <nhorman@tuxdriver.com>
> Reviewed-by: Ondrej Mosnacek <omosnace@redhat.com>
> ---
>  include/linux/audit.h      |  5 +++++
>  include/uapi/linux/audit.h |  1 +
>  kernel/audit.c             | 20 ++++++++++++++++++++
>  kernel/auditsc.c           | 20 ++++++++++++++------
>  4 files changed, 40 insertions(+), 6 deletions(-)

...

> diff --git a/kernel/audit.c b/kernel/audit.c
> index 182b0f2c183d..3e0af53f3c4d 100644
> --- a/kernel/audit.c
> +++ b/kernel/audit.c
> @@ -2127,6 +2127,26 @@ void audit_log_session_info(struct audit_buffer *ab)
>         audit_log_format(ab, "auid=%u ses=%u", auid, sessionid);
>  }
>
> +/*
> + * audit_log_contid - report container info
> + * @context: task or local context for record
> + * @contid: container ID to report
> + */
> +void audit_log_contid(struct audit_context *context, u64 contid)
> +{
> +       struct audit_buffer *ab;
> +
> +       if (!audit_contid_valid(contid))
> +               return;
> +       /* Generate AUDIT_CONTAINER_ID record with container ID */
> +       ab = audit_log_start(context, GFP_KERNEL, AUDIT_CONTAINER_ID);
> +       if (!ab)
> +               return;
> +       audit_log_format(ab, "contid=%llu", (unsigned long long)contid);

We have a consistency problem regarding how to output the u64 contid
values; this function uses an explicit cast, others do not.  According
to Documentation/core-api/printk-formats.rst the recommendation for
u64 is %llu (or %llx, if you want hex).  Looking quickly through the
printk code this appears to still be correct.  I suggest we get rid of
the cast (like it was in v5).

> +       audit_log_end(ab);
> +}
> +EXPORT_SYMBOL(audit_log_contid);

^ permalink raw reply

* Re: [PATCH ghak90 V6 08/10] audit: add containerid filtering
From: Paul Moore @ 2019-05-29 22:16 UTC (permalink / raw)
  To: Richard Guy Briggs
  Cc: containers, linux-api, Linux-Audit Mailing List, linux-fsdevel,
	LKML, netdev, netfilter-devel, sgrubb, omosnace, dhowells, simo,
	Eric Paris, Serge Hallyn, ebiederm, nhorman
In-Reply-To: <0785ee2644804f3ec6af1243cc0dcf89709c1fd4.1554732921.git.rgb@redhat.com>

On Mon, Apr 8, 2019 at 11:41 PM Richard Guy Briggs <rgb@redhat.com> wrote:
>
> Implement audit container identifier filtering using the AUDIT_CONTID
> field name to send an 8-character string representing a u64 since the
> value field is only u32.
>
> Sending it as two u32 was considered, but gathering and comparing two
> fields was more complex.
>
> The feature indicator is AUDIT_FEATURE_BITMAP_CONTAINERID.
>
> Please see the github audit kernel issue for the contid filter feature:
>   https://github.com/linux-audit/audit-kernel/issues/91
> Please see the github audit userspace issue for filter additions:
>   https://github.com/linux-audit/audit-userspace/issues/40
> Please see the github audit testsuiite issue for the test case:
>   https://github.com/linux-audit/audit-testsuite/issues/64
> Please see the github audit wiki for the feature overview:
>   https://github.com/linux-audit/audit-kernel/wiki/RFE-Audit-Container-ID
> Signed-off-by: Richard Guy Briggs <rgb@redhat.com>
> Acked-by: Serge Hallyn <serge@hallyn.com>
> Acked-by: Neil Horman <nhorman@tuxdriver.com>
> Reviewed-by: Ondrej Mosnacek <omosnace@redhat.com>
> ---
>  include/linux/audit.h      |  1 +
>  include/uapi/linux/audit.h |  5 ++++-
>  kernel/audit.h             |  1 +
>  kernel/auditfilter.c       | 47 ++++++++++++++++++++++++++++++++++++++++++++++
>  kernel/auditsc.c           |  4 ++++
>  5 files changed, 57 insertions(+), 1 deletion(-)

...

> diff --git a/kernel/auditfilter.c b/kernel/auditfilter.c
> index 63f8b3f26fab..407b5bb3b4c6 100644
> --- a/kernel/auditfilter.c
> +++ b/kernel/auditfilter.c
> @@ -1206,6 +1224,31 @@ int audit_comparator(u32 left, u32 op, u32 right)
>         }
>  }
>
> +int audit_comparator64(u64 left, u32 op, u64 right)
> +{
> +       switch (op) {
> +       case Audit_equal:
> +               return (left == right);
> +       case Audit_not_equal:
> +               return (left != right);
> +       case Audit_lt:
> +               return (left < right);
> +       case Audit_le:
> +               return (left <= right);
> +       case Audit_gt:
> +               return (left > right);
> +       case Audit_ge:
> +               return (left >= right);
> +       case Audit_bitmask:
> +               return (left & right);
> +       case Audit_bittest:
> +               return ((left & right) == right);
> +       default:
> +               BUG();

A little birdy mentioned the BUG() here as a potential issue and while
I had ignored it in earlier patches because this is likely a
cut-n-paste from another audit comparator function, I took a closer
look this time.  It appears as though we will never have an invalid op
value as audit_data_to_entry()/audit_to_op() ensure that the op value
is a a known good value.  Removing the BUG() from all the audit
comparators is a separate issue, but I think it would be good to
remove it from this newly added comparator; keeping it so that we
return "0" in the default case seems reasoanble.

> +               return 0;
> +       }
> +}

--
paul moore
www.paul-moore.com

^ permalink raw reply

* Re: [PATCH ghak90 V6 09/10] audit: add support for containerid to network namespaces
From: Paul Moore @ 2019-05-29 22:17 UTC (permalink / raw)
  To: Richard Guy Briggs
  Cc: containers, linux-api, Linux-Audit Mailing List, linux-fsdevel,
	LKML, netdev, netfilter-devel, sgrubb, omosnace, dhowells, simo,
	Eric Paris, Serge Hallyn, ebiederm, nhorman
In-Reply-To: <423ed5e5c5e4ed7c3e26ac7d2bd7c267aaae777c.1554732921.git.rgb@redhat.com>

On Mon, Apr 8, 2019 at 11:41 PM Richard Guy Briggs <rgb@redhat.com> wrote:
>
> Audit events could happen in a network namespace outside of a task
> context due to packets received from the net that trigger an auditing
> rule prior to being associated with a running task.  The network
> namespace could be in use by multiple containers by association to the
> tasks in that network namespace.  We still want a way to attribute
> these events to any potential containers.  Keep a list per network
> namespace to track these audit container identifiiers.
>
> Add/increment the audit container identifier on:
> - initial setting of the audit container identifier via /proc
> - clone/fork call that inherits an audit container identifier
> - unshare call that inherits an audit container identifier
> - setns call that inherits an audit container identifier
> Delete/decrement the audit container identifier on:
> - an inherited audit container identifier dropped when child set
> - process exit
> - unshare call that drops a net namespace
> - setns call that drops a net namespace
>
> Please see the github audit kernel issue for contid net support:
>   https://github.com/linux-audit/audit-kernel/issues/92
> Please see the github audit testsuiite issue for the test case:
>   https://github.com/linux-audit/audit-testsuite/issues/64
> Please see the github audit wiki for the feature overview:
>   https://github.com/linux-audit/audit-kernel/wiki/RFE-Audit-Container-ID
> Signed-off-by: Richard Guy Briggs <rgb@redhat.com>
> Acked-by: Neil Horman <nhorman@tuxdriver.com>
> Reviewed-by: Ondrej Mosnacek <omosnace@redhat.com>
> ---
>  include/linux/audit.h | 19 +++++++++++
>  kernel/audit.c        | 88 +++++++++++++++++++++++++++++++++++++++++++++++++--
>  kernel/nsproxy.c      |  4 +++
>  3 files changed, 108 insertions(+), 3 deletions(-)

...

> diff --git a/kernel/audit.c b/kernel/audit.c
> index 6c742da66b32..996213591617 100644
> --- a/kernel/audit.c
> +++ b/kernel/audit.c
> @@ -376,6 +384,75 @@ static struct sock *audit_get_sk(const struct net *net)
>         return aunet->sk;
>  }
>
> +void audit_netns_contid_add(struct net *net, u64 contid)
> +{
> +       struct audit_net *aunet;
> +       struct list_head *contid_list;
> +       struct audit_contid *cont;
> +
> +       if (!net)
> +               return;
> +       if (!audit_contid_valid(contid))
> +               return;
> +       aunet = net_generic(net, audit_net_id);
> +       if (!aunet)
> +               return;
> +       contid_list = &aunet->contid_list;
> +       spin_lock(&aunet->contid_list_lock);
> +       list_for_each_entry_rcu(cont, contid_list, list)
> +               if (cont->id == contid) {
> +                       refcount_inc(&cont->refcount);
> +                       goto out;
> +               }
> +       cont = kmalloc(sizeof(struct audit_contid), GFP_ATOMIC);
> +       if (cont) {
> +               INIT_LIST_HEAD(&cont->list);

I thought you were going to get rid of this INIT_LIST_HEAD() call?

> +               cont->id = contid;
> +               refcount_set(&cont->refcount, 1);
> +               list_add_rcu(&cont->list, contid_list);
> +       }
> +out:
> +       spin_unlock(&aunet->contid_list_lock);
> +}

^ permalink raw reply

* Re: [PATCH v1 1/2] fork: add clone3
From: Andrei Vagin @ 2019-05-29 22:24 UTC (permalink / raw)
  To: Christian Brauner
  Cc: viro, linux-kernel, torvalds, jannh, fweimer, oleg, arnd,
	dhowells, Pavel Emelyanov, Andrew Morton, Adrian Reber, linux-api
In-Reply-To: <20190529152237.10719-1-christian@brauner.io>

On Wed, May 29, 2019 at 05:22:36PM +0200, Christian Brauner wrote:
> This adds the clone3 system call.
> 
> As mentioned several times already (cf. [7], [8]) here's the promised
> patchset for clone3().
> 
> We recently merged the CLONE_PIDFD patchset (cf. [1]). It took the last
> free flag from clone().
> 
> Independent of the CLONE_PIDFD patchset a time namespace has been discussed
> at Linux Plumber Conference last year and has been sent out and reviewed
> (cf. [5]). It is expected that it will go upstream in the not too distant
> future. However, it relies on the addition of the CLONE_NEWTIME flag to
> clone(). The only other good candidate - CLONE_DETACHED - is currently not
> recyclable as we have identified at least two large or widely used
> codebases that currently pass this flag (cf. [2], [3], and [4]). Given that
> CLONE_PIDFD grabbed the last clone() flag the time namespace is effectively
> blocked. clone3() has the advantage that it will unblock this patchset
> again.

Hi Christian,

Thank you for thinking about time namespaces. I looked at this patch
quickly and I would suggest to move a termination signal out of flags. I
think we can add a separate field (exit_signal) into clone_args. Does it
make sense? For me, exit_signal in flags always looked weird...

I will look at this patch more detailed later this week. Thanks.

> 
> The idea is to keep clone3() very simple and close to the original clone(),
> specifically, to keep on supporting old clone()-based workloads.
> We know there have been various creative proposals how a new process
> creation syscall or even api is supposed to look like. Some people even
> going so far as to argue that the traditional fork()+exec() split should be
> abandoned in favor of an in-kernel version of spawn(). Independent of
> whether or not we personally think spawn() is a good idea this patchset has
> and does not want to have anything to do with this.
> One stance we take is that there's no real good alternative to
> clone()+exec() and we need and want to support this model going forward;
> independent of spawn().
> The following requirements guided clone3():
> - bump the number of available flags
> - move arguments that are currently passed as separate arguments
>   in clone() into a dedicated struct clone_args
>   - choose a struct layout that is easy to handle on 32 and on 64 bit
>   - choose a struct layout that is extensible
>   - give new flags that currently need to abuse another flag's dedicated
>     return argument in clone() their own dedicated return argument
>     (e.g. CLONE_PIDFD)
>   - use a separate kernel internal struct kernel_clone_args that is
>     properly typed according to current kernel conventions in fork.c and is
>     different from  the uapi struct clone_args
> - port _do_fork() to use kernel_clone_args so that all process creation
>   syscalls such as fork(), vfork(), clone(), and clone3() behave identical
>   (Arnd suggested, that we can probably also port do_fork() itself in a
>    separate patchset.)
> - ease of transition for userspace from clone() to clone3()
>   This very much means that we do *not* remove functionality that userspace
>   currently relies on as the latter is a good way of creating a syscall
>   that won't be adopted.
> - do not try to be clever or complex: keep clone3() as dumb as possible
> 
> In accordance with Linus suggestions, clone3() has the following signature:
> 
> /* uapi */
> struct clone_args {
>         __aligned_u64 flags;
>         __aligned_u64 pidfd;
>         __aligned_u64 parent_tidptr;
>         __aligned_u64 child_tidptr;
>         __aligned_u64 stack;
>         __aligned_u64 stack_size;
>         __aligned_u64 tls;
> };
> 
> /* kernel internal */
> struct kernel_clone_args {
>         u64 flags;
>         int __user *pidfd;
>         int __user *parent_tidptr;
>         int __user *child_tidptr;
>         unsigned long stack;
>         unsigned long stack_size;
>         unsigned long tls;
> };
> 
> long sys_clone3(struct clone_args __user *uargs, size_t size)
> 
> clone3() cleanly supports all of the supported flags from clone() and thus
> all legacy workloads.
> The advantage of sticking close to the old clone() is the low cost for
> userspace to switch to this new api. Quite a lot of userspace apis (e.g.
> pthreads) are based on the clone() syscall. With the new clone3() syscall
> supporting all of the old workloads and opening up the ability to add new
> features should make switching to it for userspace more appealing. In
> essence, glibc can just write a simple wrapper to switch from clone() to
> clone3().
> 
> There has been some interest in this patchset already. We have received a
> patch from the CRIU corner for clone3() that would set the PID/TID of a
> restored process without /proc/sys/kernel/ns_last_pid to eliminate a race.
> 
> /* References */
> [1]: b3e5838252665ee4cfa76b82bdf1198dca81e5be
> [2]: https://dxr.mozilla.org/mozilla-central/source/security/sandbox/linux/SandboxFilter.cpp#343
> [3]: https://git.musl-libc.org/cgit/musl/tree/src/thread/pthread_create.c#n233
> [4]: https://sources.debian.org/src/blcr/0.8.5-2.3/cr_module/cr_dump_self.c/?hl=740#L740
> [5]: https://lore.kernel.org/lkml/20190425161416.26600-1-dima@arista.com/
> [6]: https://lore.kernel.org/lkml/20190425161416.26600-2-dima@arista.com/
> [7]: https://lore.kernel.org/lkml/CAHrFyr5HxpGXA2YrKza-oB-GGwJCqwPfyhD-Y5wbktWZdt0sGQ@mail.gmail.com/
> [8]: https://lore.kernel.org/lkml/20190524102756.qjsjxukuq2f4t6bo@brauner.io/
> 
> Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
> Signed-off-by: Christian Brauner <christian@brauner.io>
> Cc: Arnd Bergmann <arnd@arndb.de>
> Cc: Pavel Emelyanov <xemul@virtuozzo.com>
> Cc: Jann Horn <jannh@google.com>
> Cc: David Howells <dhowells@redhat.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Oleg Nesterov <oleg@redhat.com>
> Cc: Adrian Reber <adrian@lisas.de>
> Cc: Linus Torvalds <torvalds@linux-foundation.org>
> Cc: Andrei Vagin <avagin@gmail.com>
> Cc: Al Viro <viro@zeniv.linux.org.uk>
> Cc: Florian Weimer <fweimer@redhat.com>
> Cc: linux-api@vger.kernel.org
> --
> v1:
> - Linus Torvalds <torvalds@linux-foundation.org>:
>   - redesign based on Linus proposal
>   - switch from arg-based to revision-based naming scheme: s/clone6/clone3/
> - Arnd Bergmann <arnd@arndb.de>:
>   - use a single copy_from_user() instead of multiple get_user() calls
>     since the latter have a constant overhead on some architectures
>   - a range of other tweaks and suggestions
> ---
>  arch/x86/ia32/sys_ia32.c   |  11 ++-
>  include/linux/sched/task.h |  13 ++-
>  include/linux/syscalls.h   |   6 ++
>  include/uapi/linux/sched.h |  16 ++++
>  kernel/fork.c              | 176 ++++++++++++++++++++++++++++---------
>  5 files changed, 177 insertions(+), 45 deletions(-)
> 
> diff --git a/arch/x86/ia32/sys_ia32.c b/arch/x86/ia32/sys_ia32.c
> index a43212036257..1fd5c4594a8e 100644
> --- a/arch/x86/ia32/sys_ia32.c
> +++ b/arch/x86/ia32/sys_ia32.c
> @@ -237,6 +237,13 @@ COMPAT_SYSCALL_DEFINE5(x86_clone, unsigned long, clone_flags,
>  		       unsigned long, newsp, int __user *, parent_tidptr,
>  		       unsigned long, tls_val, int __user *, child_tidptr)
>  {
> -	return _do_fork(clone_flags, newsp, 0, parent_tidptr, child_tidptr,
> -			tls_val);
> +	struct kernel_clone_args args = {
> +		.flags = clone_flags,
> +		.stack = newsp,
> +		.parent_tidptr = parent_tidptr,
> +		.tls = tls_val,
> +		.child_tidptr = child_tidptr,
> +	};
> +
> +	return _do_fork(&args);
>  }
> diff --git a/include/linux/sched/task.h b/include/linux/sched/task.h
> index f1227f2c38a4..818696626a0d 100644
> --- a/include/linux/sched/task.h
> +++ b/include/linux/sched/task.h
> @@ -8,11 +8,22 @@
>   */
>  
>  #include <linux/sched.h>
> +#include <linux/compiler_types.h>
>  
>  struct task_struct;
>  struct rusage;
>  union thread_union;
>  
> +struct kernel_clone_args {
> +	u64 flags;
> +	int __user *pidfd;
> +	int __user *parent_tidptr;
> +	int __user *child_tidptr;
> +	unsigned long stack;
> +	unsigned long stack_size;
> +	unsigned long tls;
> +};
> +
>  /*
>   * This serializes "schedule()" and also protects
>   * the run-queue from deletions/modifications (but
> @@ -73,7 +84,7 @@ extern void do_group_exit(int);
>  extern void exit_files(struct task_struct *);
>  extern void exit_itimers(struct signal_struct *);
>  
> -extern long _do_fork(unsigned long, unsigned long, unsigned long, int __user *, int __user *, unsigned long);
> +extern long _do_fork(struct kernel_clone_args *kargs);
>  extern long do_fork(unsigned long, unsigned long, unsigned long, int __user *, int __user *);
>  struct task_struct *fork_idle(int);
>  struct mm_struct *copy_init_mm(void);
> diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
> index e2870fe1be5b..254db24af0cd 100644
> --- a/include/linux/syscalls.h
> +++ b/include/linux/syscalls.h
> @@ -70,6 +70,7 @@ struct sigaltstack;
>  struct rseq;
>  union bpf_attr;
>  struct io_uring_params;
> +struct clone_args;
>  
>  #include <linux/types.h>
>  #include <linux/aio_abi.h>
> @@ -852,6 +853,11 @@ asmlinkage long sys_clone(unsigned long, unsigned long, int __user *,
>  	       int __user *, unsigned long);
>  #endif
>  #endif
> +
> +#ifdef __ARCH_WANT_SYS_CLONE
> +asmlinkage long sys_clone3(struct clone_args __user *uargs, size_t size);
> +#endif
> +
>  asmlinkage long sys_execve(const char __user *filename,
>  		const char __user *const __user *argv,
>  		const char __user *const __user *envp);
> diff --git a/include/uapi/linux/sched.h b/include/uapi/linux/sched.h
> index ed4ee170bee2..d426ef382a6e 100644
> --- a/include/uapi/linux/sched.h
> +++ b/include/uapi/linux/sched.h
> @@ -2,6 +2,8 @@
>  #ifndef _UAPI_LINUX_SCHED_H
>  #define _UAPI_LINUX_SCHED_H
>  
> +#include <linux/types.h>
> +
>  /*
>   * cloning flags:
>   */
> @@ -30,6 +32,20 @@
>  #define CLONE_NEWPID		0x20000000	/* New pid namespace */
>  #define CLONE_NEWNET		0x40000000	/* New network namespace */
>  #define CLONE_IO		0x80000000	/* Clone io context */
> +#define CLONE_MAX ~0U
> +
> +/*
> + * Arguments for the clone3 syscall
> + */
> +struct clone_args {
> +	__aligned_u64 flags;
> +	__aligned_u64 pidfd;
> +	__aligned_u64 parent_tidptr;
> +	__aligned_u64 child_tidptr;
> +	__aligned_u64 stack;
> +	__aligned_u64 stack_size;
> +	__aligned_u64 tls;
> +};
>  
>  /*
>   * Scheduling policies
> diff --git a/kernel/fork.c b/kernel/fork.c
> index b4cba953040a..6bc3e3d17150 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -1760,19 +1760,19 @@ static __always_inline void delayed_free_task(struct task_struct *tsk)
>   * flags). The actual kick-off is left to the caller.
>   */
>  static __latent_entropy struct task_struct *copy_process(
> -					unsigned long clone_flags,
> -					unsigned long stack_start,
> -					unsigned long stack_size,
> -					int __user *parent_tidptr,
> -					int __user *child_tidptr,
>  					struct pid *pid,
>  					int trace,
> -					unsigned long tls,
> -					int node)
> +					int node,
> +					struct kernel_clone_args *args)
>  {
>  	int pidfd = -1, retval;
>  	struct task_struct *p;
>  	struct multiprocess_signals delayed;
> +	u64 clone_flags = args->flags;
> +	int __user *child_tidptr = args->child_tidptr;
> +	unsigned long tls = args->tls;
> +	unsigned long stack_start = args->stack;
> +	unsigned long stack_size = args->stack_size;
>  
>  	/*
>  	 * Don't allow sharing the root directory with processes in a different
> @@ -1821,27 +1821,12 @@ static __latent_entropy struct task_struct *copy_process(
>  	}
>  
>  	if (clone_flags & CLONE_PIDFD) {
> -		int reserved;
> -
>  		/*
> -		 * - CLONE_PARENT_SETTID is useless for pidfds and also
> -		 *   parent_tidptr is used to return pidfds.
>  		 * - CLONE_DETACHED is blocked so that we can potentially
>  		 *   reuse it later for CLONE_PIDFD.
>  		 * - CLONE_THREAD is blocked until someone really needs it.
>  		 */
> -		if (clone_flags &
> -		    (CLONE_DETACHED | CLONE_PARENT_SETTID | CLONE_THREAD))
> -			return ERR_PTR(-EINVAL);
> -
> -		/*
> -		 * Verify that parent_tidptr is sane so we can potentially
> -		 * reuse it later.
> -		 */
> -		if (get_user(reserved, parent_tidptr))
> -			return ERR_PTR(-EFAULT);
> -
> -		if (reserved != 0)
> +		if (clone_flags & (CLONE_DETACHED | CLONE_THREAD))
>  			return ERR_PTR(-EINVAL);
>  	}
>  
> @@ -2062,7 +2047,7 @@ static __latent_entropy struct task_struct *copy_process(
>  			goto bad_fork_free_pid;
>  
>  		pidfd = retval;
> -		retval = put_user(pidfd, parent_tidptr);
> +		retval = put_user(pidfd, args->pidfd);
>  		if (retval)
>  			goto bad_fork_put_pidfd;
>  	}
> @@ -2313,8 +2298,11 @@ static inline void init_idle_pids(struct task_struct *idle)
>  struct task_struct *fork_idle(int cpu)
>  {
>  	struct task_struct *task;
> -	task = copy_process(CLONE_VM, 0, 0, NULL, NULL, &init_struct_pid, 0, 0,
> -			    cpu_to_node(cpu));
> +	struct kernel_clone_args args = {
> +		.flags = CLONE_VM,
> +	};
> +
> +	task = copy_process(&init_struct_pid, 0, cpu_to_node(cpu), &args);
>  	if (!IS_ERR(task)) {
>  		init_idle_pids(task);
>  		init_idle(task, cpu);
> @@ -2334,18 +2322,15 @@ struct mm_struct *copy_init_mm(void)
>   * It copies the process, and if successful kick-starts
>   * it and waits for it to finish using the VM if required.
>   */
> -long _do_fork(unsigned long clone_flags,
> -	      unsigned long stack_start,
> -	      unsigned long stack_size,
> -	      int __user *parent_tidptr,
> -	      int __user *child_tidptr,
> -	      unsigned long tls)
> +long _do_fork(struct kernel_clone_args *args)
>  {
> +	u64 clone_flags = args->flags;
>  	struct completion vfork;
>  	struct pid *pid;
>  	struct task_struct *p;
>  	int trace = 0;
>  	long nr;
> +	int __user *parent_tidptr = args->parent_tidptr;
>  
>  	/*
>  	 * Determine whether and which event to report to ptracer.  When
> @@ -2365,8 +2350,7 @@ long _do_fork(unsigned long clone_flags,
>  			trace = 0;
>  	}
>  
> -	p = copy_process(clone_flags, stack_start, stack_size, parent_tidptr,
> -			 child_tidptr, NULL, trace, tls, NUMA_NO_NODE);
> +	p = copy_process(NULL, trace, NUMA_NO_NODE, args);
>  	add_latent_entropy();
>  
>  	if (IS_ERR(p))
> @@ -2414,8 +2398,15 @@ long do_fork(unsigned long clone_flags,
>  	      int __user *parent_tidptr,
>  	      int __user *child_tidptr)
>  {
> -	return _do_fork(clone_flags, stack_start, stack_size,
> -			parent_tidptr, child_tidptr, 0);
> +	struct kernel_clone_args args = {
> +		.flags = clone_flags,
> +		.stack = stack_start,
> +		.stack_size = stack_size,
> +		.parent_tidptr = parent_tidptr,
> +		.child_tidptr = child_tidptr,
> +	};
> +
> +	return _do_fork(&args);
>  }
>  #endif
>  
> @@ -2424,15 +2415,24 @@ long do_fork(unsigned long clone_flags,
>   */
>  pid_t kernel_thread(int (*fn)(void *), void *arg, unsigned long flags)
>  {
> -	return _do_fork(flags|CLONE_VM|CLONE_UNTRACED, (unsigned long)fn,
> -		(unsigned long)arg, NULL, NULL, 0);
> +	struct kernel_clone_args args = {
> +		.flags = flags | CLONE_VM | CLONE_UNTRACED,
> +		.stack = (unsigned long)fn,
> +		.stack_size = (unsigned long)arg,
> +	};
> +
> +	return _do_fork(&args);
>  }
>  
>  #ifdef __ARCH_WANT_SYS_FORK
>  SYSCALL_DEFINE0(fork)
>  {
>  #ifdef CONFIG_MMU
> -	return _do_fork(SIGCHLD, 0, 0, NULL, NULL, 0);
> +	struct kernel_clone_args args = {
> +		.flags = SIGCHLD,
> +	};
> +
> +	return _do_fork(&args);
>  #else
>  	/* can not support in nommu mode */
>  	return -EINVAL;
> @@ -2443,8 +2443,11 @@ SYSCALL_DEFINE0(fork)
>  #ifdef __ARCH_WANT_SYS_VFORK
>  SYSCALL_DEFINE0(vfork)
>  {
> -	return _do_fork(CLONE_VFORK | CLONE_VM | SIGCHLD, 0,
> -			0, NULL, NULL, 0);
> +	struct kernel_clone_args args = {
> +		.flags = CLONE_VFORK | CLONE_VM | SIGCHLD,
> +	};
> +
> +	return _do_fork(&args);
>  }
>  #endif
>  
> @@ -2472,7 +2475,96 @@ SYSCALL_DEFINE5(clone, unsigned long, clone_flags, unsigned long, newsp,
>  		 unsigned long, tls)
>  #endif
>  {
> -	return _do_fork(clone_flags, newsp, 0, parent_tidptr, child_tidptr, tls);
> +	struct kernel_clone_args args = {
> +		.flags = clone_flags,
> +		.stack = newsp,
> +		.pidfd = parent_tidptr,
> +		.parent_tidptr = parent_tidptr,
> +		.tls = tls,
> +		.child_tidptr = child_tidptr,
> +	};
> +
> +	/* clone(CLONE_PIDFD) uses parent_tidptr to return a pidfd */
> +	if ((clone_flags & CLONE_PIDFD) && (clone_flags & CLONE_PARENT_SETTID))
> +		return -EINVAL;
> +
> +	return _do_fork(&args);
> +}
> +
> +static bool clone3_flags_valid(u64 flags)
> +{
> +	if (flags & CLONE_DETACHED)
> +		return false;
> +
> +	if (flags & ~CLONE_MAX)
> +		return false;
> +
> +	return true;
> +}
> +
> +static int copy_clone_args_from_user(struct kernel_clone_args *kargs,
> +				     struct clone_args __user *uargs,
> +				     size_t size)
> +{
> +	struct clone_args args;
> +
> +	if (unlikely(size > PAGE_SIZE))
> +		return -E2BIG;
> +
> +	if (unlikely(size < sizeof(struct clone_args)))
> +		return -EINVAL;
> +
> +	if (unlikely(!access_ok(uargs, size)))
> +		return -EFAULT;
> +
> +	if (size > sizeof(struct clone_args)) {
> +		unsigned char __user *addr;
> +		unsigned char __user *end;
> +		unsigned char val;
> +
> +		addr = (void __user *)uargs + sizeof(struct clone_args);
> +		end = (void __user *)uargs + size;
> +
> +		for (; addr < end; addr++) {
> +			if (get_user(val, addr))
> +				return -EFAULT;
> +			if (val)
> +				return -E2BIG;
> +		}
> +
> +		size = sizeof(struct clone_args);
> +	}
> +
> +	if (copy_from_user(&args, uargs, size))
> +		return -EFAULT;
> +
> +	if (!clone3_flags_valid(args.flags))
> +		return -EINVAL;
> +
> +	memset(kargs, 0, sizeof(*kargs));
> +
> +	kargs->flags = args.flags;
> +	kargs->child_tidptr = u64_to_user_ptr(args.child_tidptr);
> +	kargs->parent_tidptr = u64_to_user_ptr(args.parent_tidptr);
> +	kargs->pidfd = u64_to_user_ptr(args.pidfd);
> +	kargs->stack = args.stack;
> +	kargs->stack_size = args.stack_size;
> +	kargs->tls = args.tls;
> +
> +	return 0;
> +}
> +
> +SYSCALL_DEFINE2(clone3, struct clone_args __user *, uargs, size_t, size)
> +{
> +	int err;
> +
> +	struct kernel_clone_args kargs;
> +
> +	err = copy_clone_args_from_user(&kargs, uargs, size);
> +	if (err)
> +		return err;
> +
> +	return _do_fork(&kargs);
>  }
>  #endif
>  
> -- 
> 2.21.0
> 

^ permalink raw reply

* Re: [PATCH ghak90 V6 00/10] audit: implement container identifier
From: Paul Moore @ 2019-05-29 22:26 UTC (permalink / raw)
  To: Richard Guy Briggs
  Cc: containers, linux-api, Linux-Audit Mailing List, linux-fsdevel,
	LKML, netdev, netfilter-devel, sgrubb, omosnace, dhowells, simo,
	Eric Paris, Serge Hallyn, ebiederm, Neil Horman
In-Reply-To: <CAHC9VhQYPF2ma_W+hySbQtfTztf=K1LTFnxnyVK0y9VYxj-K=w@mail.gmail.com>

On Mon, Apr 22, 2019 at 9:49 AM Paul Moore <paul@paul-moore.com> wrote:
> On Mon, Apr 22, 2019 at 7:38 AM Neil Horman <nhorman@tuxdriver.com> wrote:
> > On Mon, Apr 08, 2019 at 11:39:07PM -0400, Richard Guy Briggs wrote:
> > > Implement kernel audit container identifier.
> >
> > I'm sorry, I've lost track of this, where have we landed on it? Are we good for
> > inclusion?
>
> I haven't finished going through this latest revision, but unless
> Richard made any significant changes outside of the feedback from the
> v5 patchset I'm guessing we are "close".
>
> Based on discussions Richard and I had some time ago, I have always
> envisioned the plan as being get the kernel patchset, tests, docs
> ready (which Richard has been doing) and then run the actual
> implemented API by the userland container folks, e.g. cri-o/lxc/etc.,
> to make sure the actual implementation is sane from their perspective.
> They've already seen the design, so I'm not expecting any real
> surprises here, but sometimes opinions change when they have actual
> code in front of them to play with and review.
>
> Beyond that, while the cri-o/lxc/etc. folks are looking it over,
> whatever additional testing we can do would be a big win.  I'm
> thinking I'll pull it into a separate branch in the audit tree
> (audit/working-container ?) and include that in my secnext kernels
> that I build/test on a regular basis; this is also a handy way to keep
> it based against the current audit/next branch.  If any changes are
> needed Richard can either chose to base those changes on audit/next or
> the separate audit container ID branch; that's up to him.  I've done
> this with other big changes in other trees, e.g. SELinux, and it has
> worked well to get some extra testing in and keep the patchset "merge
> ready" while others outside the subsystem look things over.

I just sent my feedback on the v6 patchset, and it's small: basically
three patches with "one-liner" changes needed.

Richard, it's your call on how you want to proceed from here.  You can
post a v7 incorporating the feedback, or since the tweaks are so
minor, you can post fixup patches; the former being more
comprehensive, the later being quicker to review and digest.
Regardless of that, while we are waiting on a prototype from the
container folks, I think it would be good to pull this into a working
branch in the audit repo (as mentioned above), unless you would prefer
to keep it as a patchset on the mailing list?  If you want to go with
the working branch approach, I'll keep the branch fresh and (re)based
against audit/next and if we notice any problems you can just submit
fixes against that branch (depending on the issue they can be fixup
patches, or proper patches).  My hope is that this will enable the
process to move quicker as we get near the finish line.

-- 
paul moore
www.paul-moore.com

^ permalink raw reply

* Re: [PATCH ghak90 V6 02/10] audit: add container id
From: Tycho Andersen @ 2019-05-29 22:28 UTC (permalink / raw)
  To: Paul Moore
  Cc: Richard Guy Briggs, containers, linux-api,
	Linux-Audit Mailing List, linux-fsdevel, LKML, netdev,
	netfilter-devel, sgrubb, omosnace, dhowells, simo, Eric Paris,
	Serge Hallyn, ebiederm, nhorman
In-Reply-To: <CAHC9VhSF3AjErX37+eeusJ7+XRw8yuPsmqBTRwc9EVoRBh_3Tw@mail.gmail.com>

On Wed, May 29, 2019 at 12:03:58PM -0400, Paul Moore wrote:
> On Wed, May 29, 2019 at 11:34 AM Tycho Andersen <tycho@tycho.ws> wrote:
> >
> > On Wed, May 29, 2019 at 11:29:05AM -0400, Paul Moore wrote:
> > > On Wed, May 29, 2019 at 10:57 AM Tycho Andersen <tycho@tycho.ws> wrote:
> > > >
> > > > On Mon, Apr 08, 2019 at 11:39:09PM -0400, Richard Guy Briggs wrote:
> > > > > It is not permitted to unset the audit container identifier.
> > > > > A child inherits its parent's audit container identifier.
> > > >
> > > > ...
> > > >
> > > > >  /**
> > > > > + * audit_set_contid - set current task's audit contid
> > > > > + * @contid: contid value
> > > > > + *
> > > > > + * Returns 0 on success, -EPERM on permission failure.
> > > > > + *
> > > > > + * Called (set) from fs/proc/base.c::proc_contid_write().
> > > > > + */
> > > > > +int audit_set_contid(struct task_struct *task, u64 contid)
> > > > > +{
> > > > > +     u64 oldcontid;
> > > > > +     int rc = 0;
> > > > > +     struct audit_buffer *ab;
> > > > > +     uid_t uid;
> > > > > +     struct tty_struct *tty;
> > > > > +     char comm[sizeof(current->comm)];
> > > > > +
> > > > > +     task_lock(task);
> > > > > +     /* Can't set if audit disabled */
> > > > > +     if (!task->audit) {
> > > > > +             task_unlock(task);
> > > > > +             return -ENOPROTOOPT;
> > > > > +     }
> > > > > +     oldcontid = audit_get_contid(task);
> > > > > +     read_lock(&tasklist_lock);
> > > > > +     /* Don't allow the audit containerid to be unset */
> > > > > +     if (!audit_contid_valid(contid))
> > > > > +             rc = -EINVAL;
> > > > > +     /* if we don't have caps, reject */
> > > > > +     else if (!capable(CAP_AUDIT_CONTROL))
> > > > > +             rc = -EPERM;
> > > > > +     /* if task has children or is not single-threaded, deny */
> > > > > +     else if (!list_empty(&task->children))
> > > > > +             rc = -EBUSY;
> > > > > +     else if (!(thread_group_leader(task) && thread_group_empty(task)))
> > > > > +             rc = -EALREADY;
> > > > > +     read_unlock(&tasklist_lock);
> > > > > +     if (!rc)
> > > > > +             task->audit->contid = contid;
> > > > > +     task_unlock(task);
> > > > > +
> > > > > +     if (!audit_enabled)
> > > > > +             return rc;
> > > >
> > > > ...but it is allowed to change it (assuming
> > > > capable(CAP_AUDIT_CONTROL), of course)? Seems like this might be more
> > > > immediately useful since we still live in the world of majority
> > > > privileged containers if we didn't allow changing it, in addition to
> > > > un-setting it.
> > >
> > > The idea is that only container orchestrators should be able to
> > > set/modify the audit container ID, and since setting the audit
> > > container ID can have a significant effect on the records captured
> > > (and their routing to multiple daemons when we get there) modifying
> > > the audit container ID is akin to modifying the audit configuration
> > > which is why it is gated by CAP_AUDIT_CONTROL.  The current thinking
> > > is that you would only change the audit container ID from one
> > > set/inherited value to another if you were nesting containers, in
> > > which case the nested container orchestrator would need to be granted
> > > CAP_AUDIT_CONTROL (which everyone to date seems to agree is a workable
> > > compromise).
> >
> > But then don't you want some kind of ns_capable() instead (probably
> > not the obvious one, though...)? With capable(), you can't really nest
> > using the audit-id and user namespaces together.
> 
> You want capable() and not ns_capable() because you want to ensure
> that the orchestrator has the rights in the init_ns as changes to the
> audit container ID could have an auditing impact that spans the entire
> system.

Ok but,

> > > The current thinking
> > > is that you would only change the audit container ID from one
> > > set/inherited value to another if you were nesting containers, in
> > > which case the nested container orchestrator would need to be granted
> > > CAP_AUDIT_CONTROL (which everyone to date seems to agree is a workable
> > > compromise).

won't work in user namespaced containers, because they will never be
capable(CAP_AUDIT_CONTROL); so I don't think this will work for
nesting as is. But maybe nobody cares :)

Tycho

^ permalink raw reply

* Re: [PATCH ghak90 V6 02/10] audit: add container id
From: Paul Moore @ 2019-05-29 22:39 UTC (permalink / raw)
  To: Tycho Andersen
  Cc: Richard Guy Briggs, containers, linux-api,
	Linux-Audit Mailing List, linux-fsdevel, LKML, netdev,
	netfilter-devel, sgrubb, omosnace, dhowells, simo, Eric Paris,
	Serge Hallyn, ebiederm, nhorman
In-Reply-To: <20190529222835.GD8959@cisco>

On Wed, May 29, 2019 at 6:28 PM Tycho Andersen <tycho@tycho.ws> wrote:
> On Wed, May 29, 2019 at 12:03:58PM -0400, Paul Moore wrote:
> > On Wed, May 29, 2019 at 11:34 AM Tycho Andersen <tycho@tycho.ws> wrote:
> > >
> > > On Wed, May 29, 2019 at 11:29:05AM -0400, Paul Moore wrote:
> > > > On Wed, May 29, 2019 at 10:57 AM Tycho Andersen <tycho@tycho.ws> wrote:
> > > > >
> > > > > On Mon, Apr 08, 2019 at 11:39:09PM -0400, Richard Guy Briggs wrote:
> > > > > > It is not permitted to unset the audit container identifier.
> > > > > > A child inherits its parent's audit container identifier.
> > > > >
> > > > > ...
> > > > >
> > > > > >  /**
> > > > > > + * audit_set_contid - set current task's audit contid
> > > > > > + * @contid: contid value
> > > > > > + *
> > > > > > + * Returns 0 on success, -EPERM on permission failure.
> > > > > > + *
> > > > > > + * Called (set) from fs/proc/base.c::proc_contid_write().
> > > > > > + */
> > > > > > +int audit_set_contid(struct task_struct *task, u64 contid)
> > > > > > +{
> > > > > > +     u64 oldcontid;
> > > > > > +     int rc = 0;
> > > > > > +     struct audit_buffer *ab;
> > > > > > +     uid_t uid;
> > > > > > +     struct tty_struct *tty;
> > > > > > +     char comm[sizeof(current->comm)];
> > > > > > +
> > > > > > +     task_lock(task);
> > > > > > +     /* Can't set if audit disabled */
> > > > > > +     if (!task->audit) {
> > > > > > +             task_unlock(task);
> > > > > > +             return -ENOPROTOOPT;
> > > > > > +     }
> > > > > > +     oldcontid = audit_get_contid(task);
> > > > > > +     read_lock(&tasklist_lock);
> > > > > > +     /* Don't allow the audit containerid to be unset */
> > > > > > +     if (!audit_contid_valid(contid))
> > > > > > +             rc = -EINVAL;
> > > > > > +     /* if we don't have caps, reject */
> > > > > > +     else if (!capable(CAP_AUDIT_CONTROL))
> > > > > > +             rc = -EPERM;
> > > > > > +     /* if task has children or is not single-threaded, deny */
> > > > > > +     else if (!list_empty(&task->children))
> > > > > > +             rc = -EBUSY;
> > > > > > +     else if (!(thread_group_leader(task) && thread_group_empty(task)))
> > > > > > +             rc = -EALREADY;
> > > > > > +     read_unlock(&tasklist_lock);
> > > > > > +     if (!rc)
> > > > > > +             task->audit->contid = contid;
> > > > > > +     task_unlock(task);
> > > > > > +
> > > > > > +     if (!audit_enabled)
> > > > > > +             return rc;
> > > > >
> > > > > ...but it is allowed to change it (assuming
> > > > > capable(CAP_AUDIT_CONTROL), of course)? Seems like this might be more
> > > > > immediately useful since we still live in the world of majority
> > > > > privileged containers if we didn't allow changing it, in addition to
> > > > > un-setting it.
> > > >
> > > > The idea is that only container orchestrators should be able to
> > > > set/modify the audit container ID, and since setting the audit
> > > > container ID can have a significant effect on the records captured
> > > > (and their routing to multiple daemons when we get there) modifying
> > > > the audit container ID is akin to modifying the audit configuration
> > > > which is why it is gated by CAP_AUDIT_CONTROL.  The current thinking
> > > > is that you would only change the audit container ID from one
> > > > set/inherited value to another if you were nesting containers, in
> > > > which case the nested container orchestrator would need to be granted
> > > > CAP_AUDIT_CONTROL (which everyone to date seems to agree is a workable
> > > > compromise).
> > >
> > > But then don't you want some kind of ns_capable() instead (probably
> > > not the obvious one, though...)? With capable(), you can't really nest
> > > using the audit-id and user namespaces together.
> >
> > You want capable() and not ns_capable() because you want to ensure
> > that the orchestrator has the rights in the init_ns as changes to the
> > audit container ID could have an auditing impact that spans the entire
> > system.
>
> Ok but,
>
> > > > The current thinking
> > > > is that you would only change the audit container ID from one
> > > > set/inherited value to another if you were nesting containers, in
> > > > which case the nested container orchestrator would need to be granted
> > > > CAP_AUDIT_CONTROL (which everyone to date seems to agree is a workable
> > > > compromise).
>
> won't work in user namespaced containers, because they will never be
> capable(CAP_AUDIT_CONTROL); so I don't think this will work for
> nesting as is. But maybe nobody cares :)

That's fun :)

To be honest, I've never been a big fan of supporting nested
containers from an audit perspective, so I'm not really too upset
about this.  The k8s/cri-o folks seem okay with this, or at least I
haven't heard any objections; lxc folks, what do you have to say?

-- 
paul moore
www.paul-moore.com

^ permalink raw reply

* Re: [PATCH 1/7] General notification queue with user mmap()'able ring buffer
From: Greg KH @ 2019-05-29 23:09 UTC (permalink / raw)
  To: David Howells
  Cc: viro, raven, linux-fsdevel, linux-api, linux-block, keyrings,
	linux-security-module, linux-kernel
In-Reply-To: <31936.1559146000@warthog.procyon.org.uk>

On Wed, May 29, 2019 at 05:06:40PM +0100, David Howells wrote:
> Greg KH <gregkh@linuxfoundation.org> wrote:
> 
> > > kref_put() could potentially add an unnecessary extra stack frame and would
> > > seem to be best avoided, though an optimising compiler ought to be able to
> > > inline if it can.
> > 
> > If kref_put() is on your fast path, you have worse problems (kfree isn't
> > fast, right?)
> > 
> > Anyway, it's an inline function, how can it add an extra stack frame?
> 
> The call to the function pointer.  Hopefully the compiler will optimise that
> away for an inlineable function.

The function pointer only gets called for the last "put", and then kfree
will be called so you should not have to worry about speed/stack frames
at that point in time.

> > > Are you now on the convert all refcounts to krefs path?
> > 
> > "now"?  Remember, I wrote kref all those years ago,
> 
> Yes - and I thought it wasn't a good idea at the time.  But this is the first
> time you've mentioned it to me, let alone pushed to change to it, that I
> recall.

I bring up using a kref any time I see a usage that could use it as it
makes it easier for people to understand and "know" you are doing your
reference counting for your object "correctly".  It's an abstraction
that is used to make it easier for us developers to understand.
Otherwise you have to hand-roll the same logic here.  Yes, refcounts
have made it easier to do it in your own (which was their goal), but you
still don't have to do it "on your own".

Anyway, I'll not push the issue here, if you want to stick to a
refcount_t, that's enough for now.  We can worry about changing this
later after you have debugged all the corner conditions :)

> > everyone should use
> > it.  It saves us having to audit the same pattern over and over again.
> > And, even nicer, it uses a refcount now, and as you are trying to
> > reference count an object, it is exactly what this was written for.
> > 
> > So yes, I do think it should be used here, unless it is deemed to not
> > fit the pattern/usage model.
> 
> kref_put() enforces a very specific destructor signature.  I know of places
> where that doesn't work because the destructor takes more than one argument
> (granted that this is not the case here).  So why does kref_put() exist at
> all?  Why not kref_dec_and_test()?

The destructor only takes one object pointer as you are finally freeing
that object.  What more do you need/want to "know" at that point in
time?

What would kref_dec_and_test() be needed for?

> Why doesn't refcount_t get merged into kref, or vice versa?  Having both would
> seem redundant.

kref uses refcount_t and provides a different functionality on top of
it.  Not all uses of a refcount in the kernel is for object lifecycle
reference counting, as you know :)

> Mind you, I've been gradually reverting atomic_t-to-refcount_t conversions
> because it seems I'm not allowed refcount_inc/dec_return() and I want to get
> at the point refcount for tracing purposes.

That's not good, we should address that independently as you are loosing
functionality/protection when doing that.

thanks,

greg k-h

^ permalink raw reply

* Re: [PATCH 1/7] General notification queue with user mmap()'able ring buffer
From: Greg KH @ 2019-05-29 23:11 UTC (permalink / raw)
  To: David Howells
  Cc: viro, raven, linux-fsdevel, linux-api, linux-block, keyrings,
	linux-security-module, linux-kernel
In-Reply-To: <31936.1559146000@warthog.procyon.org.uk>

On Wed, May 29, 2019 at 05:06:40PM +0100, David Howells wrote:
> Greg KH <gregkh@linuxfoundation.org> wrote:
> > And how does the tracing and perf ring buffers do this without needing
> > volatile?  Why not use the same type of interface they provide, as it's
> > always good to share code that has already had all of the nasty corner
> > cases worked out.
> 
> I've no idea how trace does it - or even where - or even if.  As far as I can
> see, grepping for mmap in kernel/trace/*, there's no mmap support.
> 
> Reading Documentation/trace/ring-buffer-design.txt the trace subsystem has
> some sort of transient page fifo which is a lot more complicated than what I
> want and doesn't look like it'll be mmap'able.
> 
> Looking at the perf ring buffer, there appears to be a missing barrier in
> perf_aux_output_end():
> 
> 	rb->user_page->aux_head = rb->aux_head;
> 
> should be:
> 
> 	smp_store_release(&rb->user_page->aux_head, rb->aux_head);
> 
> It should also be using smp_load_acquire().  See
> Documentation/core-api/circular-buffers.rst
> 
> And a (partial) patch has been proposed: https://lkml.org/lkml/2018/5/10/249

So, if that's all that needs to be fixed, can you use the same
buffer/code if that patch is merged?

thanks,

greg k-h

^ permalink raw reply

* Re: [PATCH 3/7] vfs: Add a mount-notification facility
From: Andy Lutomirski @ 2019-05-29 23:12 UTC (permalink / raw)
  To: Casey Schaufler
  Cc: David Howells, Jann Horn, Al Viro, raven, linux-fsdevel,
	Linux API, linux-block, keyrings, linux-security-module,
	kernel list
In-Reply-To: <058f227c-71ab-a6f4-00bf-b8782b3b2956@schaufler-ca.com>



> On May 29, 2019, at 10:46 AM, Casey Schaufler <casey@schaufler-ca.com> wrote:
> 
>> On 5/29/2019 10:13 AM, Andy Lutomirski wrote:
>> 
>>>> On May 29, 2019, at 8:53 AM, Casey Schaufler <casey@schaufler-ca.com> wrote:
>>>> 
>>>> On 5/29/2019 4:00 AM, David Howells wrote:
>>>> Jann Horn <jannh@google.com> wrote:
>>>> 
>>>>>> +void post_mount_notification(struct mount *changed,
>>>>>> +                            struct mount_notification *notify)
>>>>>> +{
>>>>>> +       const struct cred *cred = current_cred();
>>>>> This current_cred() looks bogus to me. Can't mount topology changes
>>>>> come from all sorts of places? For example, umount_mnt() from
>>>>> umount_tree() from dissolve_on_fput() from __fput(), which could
>>>>> happen pretty much anywhere depending on where the last reference gets
>>>>> dropped?
>>>> IIRC, that's what Casey argued is the right thing to do from a security PoV.
>>>> Casey?
>>> You need to identify the credential of the subject that triggered
>>> the event. If it isn't current_cred(), the cred needs to be passed
>>> in to post_mount_notification(), or derived by some other means.
>> Taking a step back, why do we care who triggered the event?  It seems to me that we should care whether the event happened and whether the *receiver* is permitted to know that.
> 
> There are two filesystems, "dot" and "dash". I am not allowed
> to communicate with Fred on the system, and all precautions have
> been taken to ensure I cannot. Fred asks for notifications on
> all mount activity. I perform actions that result in notifications
> on "dot" and "dash". Fred receives notifications and interprets
> them using Morse code. This is not OK. If Wilma, who *is* allowed
> to communicate with Fred, does the same actions, he should be
> allowed to get the messages via Morse.

Under this scenario, Fred should not be allowed to enable these watches. If you give yourself and Fred unconstrained access to the same FS, then can communicate.

> 
> Other security modelers may disagree. The models they produce
> are going to be *very* complicated and will introduce agents and
> intermediate objects to justify Fred's reception of an event as
> a read operation.

I disagree. They’ll model the watch as something to prevent if they want to restrict communication.

> 
>> (And receiver means whoever subscribed, presumably, not whoever called read() or mmap().)
> 
> The receiver is the process that gets the event. There may
> be more than one receiver, and the receivers may have different
> credentials. Each needs to be checked separately.

I think it’s a bit crazy to have the same event queue with two readers who read different things.

^ permalink raw reply

* Re: [PATCH 3/7] vfs: Add a mount-notification facility
From: Casey Schaufler @ 2019-05-29 23:56 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: David Howells, Jann Horn, Al Viro, raven, linux-fsdevel,
	Linux API, linux-block, keyrings, linux-security-module,
	kernel list, casey
In-Reply-To: <2FF92095-E5B1-4811-A7F8-B7D4C32F86DD@amacapital.net>

On 5/29/2019 4:12 PM, Andy Lutomirski wrote:
>
>> On May 29, 2019, at 10:46 AM, Casey Schaufler <casey@schaufler-ca.com> wrote:
>>
>>> On 5/29/2019 10:13 AM, Andy Lutomirski wrote:
>>>
>>>>> On May 29, 2019, at 8:53 AM, Casey Schaufler <casey@schaufler-ca.com> wrote:
>>>>>
>>>>> On 5/29/2019 4:00 AM, David Howells wrote:
>>>>> Jann Horn <jannh@google.com> wrote:
>>>>>
>>>>>>> +void post_mount_notification(struct mount *changed,
>>>>>>> +                            struct mount_notification *notify)
>>>>>>> +{
>>>>>>> +       const struct cred *cred = current_cred();
>>>>>> This current_cred() looks bogus to me. Can't mount topology changes
>>>>>> come from all sorts of places? For example, umount_mnt() from
>>>>>> umount_tree() from dissolve_on_fput() from __fput(), which could
>>>>>> happen pretty much anywhere depending on where the last reference gets
>>>>>> dropped?
>>>>> IIRC, that's what Casey argued is the right thing to do from a security PoV.
>>>>> Casey?
>>>> You need to identify the credential of the subject that triggered
>>>> the event. If it isn't current_cred(), the cred needs to be passed
>>>> in to post_mount_notification(), or derived by some other means.
>>> Taking a step back, why do we care who triggered the event?  It seems to me that we should care whether the event happened and whether the *receiver* is permitted to know that.
>> There are two filesystems, "dot" and "dash". I am not allowed
>> to communicate with Fred on the system, and all precautions have
>> been taken to ensure I cannot. Fred asks for notifications on
>> all mount activity. I perform actions that result in notifications
>> on "dot" and "dash". Fred receives notifications and interprets
>> them using Morse code. This is not OK. If Wilma, who *is* allowed
>> to communicate with Fred, does the same actions, he should be
>> allowed to get the messages via Morse.
> Under this scenario, Fred should not be allowed to enable these watches. If you give yourself and Fred unconstrained access to the same FS, then can communicate.

How are you going to determine at the time Fred tries to enable the watches
that I am going to do something that will trigger them? I'm not saying it isn't
possible, I'm curious how you would propose doing it. If you deny Fred the ability
to set watches because it is possible for me to trigger them, he can't use them
to get information from Wilma, either.

>
>> Other security modelers may disagree. The models they produce
>> are going to be *very* complicated and will introduce agents and
>> intermediate objects to justify Fred's reception of an event as
>> a read operation.
> I disagree. They’ll model the watch as something to prevent if they want to restrict communication.

Sorry, but that isn't sufficiently detailed to be meaningful.

>>> (And receiver means whoever subscribed, presumably, not whoever called read() or mmap().)
>> The receiver is the process that gets the event. There may
>> be more than one receiver, and the receivers may have different
>> credentials. Each needs to be checked separately.
> I think it’s a bit crazy to have the same event queue with two readers who read different things.

Look at killpg(3).

The process that creates the event has to be involved in the
access decision. Otherwise you have an uncontrolled data channel.
When the receiver reads the event queue it knows nothing about the
sender, and hence cannot make the decision unless the credential of
the sender is kept with the event message, and used when the
receiver tries to access it. I don't think that wold work well with
the mechanism as designed.
 

^ permalink raw reply

* Re: [RFC 6/7] mm: extend process_madvise syscall to support vector arrary
From: Minchan Kim @ 2019-05-30  2:17 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Daniel Colascione, Andrew Morton, LKML, linux-mm, Johannes Weiner,
	Tim Murray, Joel Fernandes, Suren Baghdasaryan, Shakeel Butt,
	Sonny Rao, Brian Geffon, Linux API
In-Reply-To: <20190529103352.GD18589@dhcp22.suse.cz>

On Wed, May 29, 2019 at 12:33:52PM +0200, Michal Hocko wrote:
> On Wed 29-05-19 03:08:32, Daniel Colascione wrote:
> > On Mon, May 27, 2019 at 12:49 AM Minchan Kim <minchan@kernel.org> wrote:
> > >
> > > On Tue, May 21, 2019 at 12:37:26PM +0200, Michal Hocko wrote:
> > > > On Tue 21-05-19 19:26:13, Minchan Kim wrote:
> > > > > On Tue, May 21, 2019 at 08:24:21AM +0200, Michal Hocko wrote:
> > > > > > On Tue 21-05-19 11:48:20, Minchan Kim wrote:
> > > > > > > On Mon, May 20, 2019 at 11:22:58AM +0200, Michal Hocko wrote:
> > > > > > > > [Cc linux-api]
> > > > > > > >
> > > > > > > > On Mon 20-05-19 12:52:53, Minchan Kim wrote:
> > > > > > > > > Currently, process_madvise syscall works for only one address range
> > > > > > > > > so user should call the syscall several times to give hints to
> > > > > > > > > multiple address range.
> > > > > > > >
> > > > > > > > Is that a problem? How big of a problem? Any numbers?
> > > > > > >
> > > > > > > We easily have 2000+ vma so it's not trivial overhead. I will come up
> > > > > > > with number in the description at respin.
> > > > > >
> > > > > > Does this really have to be a fast operation? I would expect the monitor
> > > > > > is by no means a fast path. The system call overhead is not what it used
> > > > > > to be, sigh, but still for something that is not a hot path it should be
> > > > > > tolerable, especially when the whole operation is quite expensive on its
> > > > > > own (wrt. the syscall entry/exit).
> > > > >
> > > > > What's different with process_vm_[readv|writev] and vmsplice?
> > > > > If the range needed to be covered is a lot, vector operation makes senese
> > > > > to me.
> > > >
> > > > I am not saying that the vector API is wrong. All I am trying to say is
> > > > that the benefit is not really clear so far. If you want to push it
> > > > through then you should better get some supporting data.
> > >
> > > I measured 1000 madvise syscall vs. a vector range syscall with 1000
> > > ranges on ARM64 mordern device. Even though I saw 15% improvement but
> > > absoluate gain is just 1ms so I don't think it's worth to support.
> > > I will drop vector support at next revision.
> > 
> > Please do keep the vector support. Absolute timing is misleading,
> > since in a tight loop, you're not going to contend on mmap_sem. We've
> > seen tons of improvements in things like camera start come from
> > coalescing mprotect calls, with the gains coming from taking and
> > releasing various locks a lot less often and bouncing around less on
> > the contended lock paths. Raw throughput doesn't tell the whole story,
> > especially on mobile.
> 
> This will always be a double edge sword. Taking a lock for longer can
> improve a throughput of a single call but it would make a latency for
> anybody contending on the lock much worse.
> 
> Besides that, please do not overcomplicate the thing from the early
> beginning please. Let's start with a simple and well defined remote
> madvise alternative first and build a vector API on top with some
> numbers based on _real_ workloads.

First time, I didn't think about atomicity about address range race
because MADV_COLD/PAGEOUT is not critical for the race.
However you raised the atomicity issue because people would extend
hints to destructive ones easily. I agree with that and that's why
we discussed how to guarantee the race and Daniel comes up with good idea.

  - vma configuration seq number via process_getinfo(2).

We discussed the race issue without _read_ workloads/requests because
it's common sense that people might extend the syscall later.

Here is same. For current workload, we don't need to support vector
for perfomance point of view based on my experiment. However, it's
rather limited experiment. Some configuration might have 10000+ vmas
or really slow CPU. 

Furthermore, I want to have vector support due to atomicity issue
if it's really the one we should consider.
With vector support of the API and vma configuration sequence number
from Daniel, we could support address ranges operations's atomicity.
However, since we don't introduce vector at this moment, we need to
introduce *another syscall* later to be able to handle multile ranges
all at once atomically if it's okay.

Other thought:
Maybe we could extend address range batch syscall covers other MM
syscall like mmap/munmap/madvise/mprotect and so on because there
are multiple users that would benefit from this general batching
mechanism.

^ permalink raw reply

* Re: [RFC 6/7] mm: extend process_madvise syscall to support vector arrary
From: Michal Hocko @ 2019-05-30  6:57 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Daniel Colascione, Andrew Morton, LKML, linux-mm, Johannes Weiner,
	Tim Murray, Joel Fernandes, Suren Baghdasaryan, Shakeel Butt,
	Sonny Rao, Brian Geffon, Linux API
In-Reply-To: <20190530021748.GE229459@google.com>

On Thu 30-05-19 11:17:48, Minchan Kim wrote:
> On Wed, May 29, 2019 at 12:33:52PM +0200, Michal Hocko wrote:
> > On Wed 29-05-19 03:08:32, Daniel Colascione wrote:
> > > On Mon, May 27, 2019 at 12:49 AM Minchan Kim <minchan@kernel.org> wrote:
> > > >
> > > > On Tue, May 21, 2019 at 12:37:26PM +0200, Michal Hocko wrote:
> > > > > On Tue 21-05-19 19:26:13, Minchan Kim wrote:
> > > > > > On Tue, May 21, 2019 at 08:24:21AM +0200, Michal Hocko wrote:
> > > > > > > On Tue 21-05-19 11:48:20, Minchan Kim wrote:
> > > > > > > > On Mon, May 20, 2019 at 11:22:58AM +0200, Michal Hocko wrote:
> > > > > > > > > [Cc linux-api]
> > > > > > > > >
> > > > > > > > > On Mon 20-05-19 12:52:53, Minchan Kim wrote:
> > > > > > > > > > Currently, process_madvise syscall works for only one address range
> > > > > > > > > > so user should call the syscall several times to give hints to
> > > > > > > > > > multiple address range.
> > > > > > > > >
> > > > > > > > > Is that a problem? How big of a problem? Any numbers?
> > > > > > > >
> > > > > > > > We easily have 2000+ vma so it's not trivial overhead. I will come up
> > > > > > > > with number in the description at respin.
> > > > > > >
> > > > > > > Does this really have to be a fast operation? I would expect the monitor
> > > > > > > is by no means a fast path. The system call overhead is not what it used
> > > > > > > to be, sigh, but still for something that is not a hot path it should be
> > > > > > > tolerable, especially when the whole operation is quite expensive on its
> > > > > > > own (wrt. the syscall entry/exit).
> > > > > >
> > > > > > What's different with process_vm_[readv|writev] and vmsplice?
> > > > > > If the range needed to be covered is a lot, vector operation makes senese
> > > > > > to me.
> > > > >
> > > > > I am not saying that the vector API is wrong. All I am trying to say is
> > > > > that the benefit is not really clear so far. If you want to push it
> > > > > through then you should better get some supporting data.
> > > >
> > > > I measured 1000 madvise syscall vs. a vector range syscall with 1000
> > > > ranges on ARM64 mordern device. Even though I saw 15% improvement but
> > > > absoluate gain is just 1ms so I don't think it's worth to support.
> > > > I will drop vector support at next revision.
> > > 
> > > Please do keep the vector support. Absolute timing is misleading,
> > > since in a tight loop, you're not going to contend on mmap_sem. We've
> > > seen tons of improvements in things like camera start come from
> > > coalescing mprotect calls, with the gains coming from taking and
> > > releasing various locks a lot less often and bouncing around less on
> > > the contended lock paths. Raw throughput doesn't tell the whole story,
> > > especially on mobile.
> > 
> > This will always be a double edge sword. Taking a lock for longer can
> > improve a throughput of a single call but it would make a latency for
> > anybody contending on the lock much worse.
> > 
> > Besides that, please do not overcomplicate the thing from the early
> > beginning please. Let's start with a simple and well defined remote
> > madvise alternative first and build a vector API on top with some
> > numbers based on _real_ workloads.
> 
> First time, I didn't think about atomicity about address range race
> because MADV_COLD/PAGEOUT is not critical for the race.
> However you raised the atomicity issue because people would extend
> hints to destructive ones easily. I agree with that and that's why
> we discussed how to guarantee the race and Daniel comes up with good idea.

Just for the clarification, I didn't really mean atomicity but rather a
_consistency_ (essentially time to check to time to use consistency).
 
>   - vma configuration seq number via process_getinfo(2).
> 
> We discussed the race issue without _read_ workloads/requests because
> it's common sense that people might extend the syscall later.
> 
> Here is same. For current workload, we don't need to support vector
> for perfomance point of view based on my experiment. However, it's
> rather limited experiment. Some configuration might have 10000+ vmas
> or really slow CPU. 
> 
> Furthermore, I want to have vector support due to atomicity issue
> if it's really the one we should consider.
> With vector support of the API and vma configuration sequence number
> from Daniel, we could support address ranges operations's atomicity.

I am not sure what do you mean here. Perform all ranges atomicaly wrt.
other address space modifications? If yes I am not sure we want that
semantic because it can cause really long stalls for other operations
but that is a discussion on its own and I would rather focus on a simple
interface first.

> However, since we don't introduce vector at this moment, we need to
> introduce *another syscall* later to be able to handle multile ranges
> all at once atomically if it's okay.

Agreed.

> Other thought:
> Maybe we could extend address range batch syscall covers other MM
> syscall like mmap/munmap/madvise/mprotect and so on because there
> are multiple users that would benefit from this general batching
> mechanism.

Again a discussion on its own ;)

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox