Linux userland API discussions
 help / color / mirror / Atom feed
* Re: [PATCH 3/7] vfs: Add a mount-notification facility
From: David Howells @ 2019-05-29 11:00 UTC (permalink / raw)
  To: Jann Horn, casey
  Cc: dhowells, Al Viro, raven, linux-fsdevel, Linux API, linux-block,
	keyrings, linux-security-module, kernel list
In-Reply-To: <CAG48ez2rRh2_Kq_EGJs5k-ZBNffGs_Q=vkQdinorBgo58tbGpg@mail.gmail.com>

Jann Horn <jannh@google.com> wrote:

> > +void post_mount_notification(struct mount *changed,
> > +                            struct mount_notification *notify)
> > +{
> > +       const struct cred *cred = current_cred();
> 
> This current_cred() looks bogus to me. Can't mount topology changes
> come from all sorts of places? For example, umount_mnt() from
> umount_tree() from dissolve_on_fput() from __fput(), which could
> happen pretty much anywhere depending on where the last reference gets
> dropped?

IIRC, that's what Casey argued is the right thing to do from a security PoV.
Casey?

Maybe I should pass in NULL creds in the case that an event is being generated
because an object is being destroyed due to the last usage[*] being removed.

 [*] Usage, not ref - Superblocks are a bit weird in their accounting.

David

^ permalink raw reply

* Re: [PATCH 3/7] vfs: Add a mount-notification facility
From: David Howells @ 2019-05-29 10:55 UTC (permalink / raw)
  To: Jann Horn
  Cc: dhowells, Al Viro, raven, linux-fsdevel, Linux API, linux-block,
	keyrings, linux-security-module, kernel list
In-Reply-To: <CAG48ez2rRh2_Kq_EGJs5k-ZBNffGs_Q=vkQdinorBgo58tbGpg@mail.gmail.com>

Jann Horn <jannh@google.com> wrote:

> > +                       /* Global root? */
> > +                       if (mnt != parent) {
> > +                               cursor.dentry = READ_ONCE(mnt->mnt_mountpoint);
> > +                               mnt = parent;
> > +                               cursor.mnt = &mnt->mnt;
> > +                               continue;
> > +                       }
> > +                       break;
> 
> (nit: this would look clearer if you inverted the condition and wrote
> it as "if (mnt == parent) break;", then you also wouldn't need that
> "continue" or the braces)

It does look better with the logic inverted, but you *do* still need the
continue.  After the if-statement, there is:

	cursor.dentry = cursor.dentry->d_parent;

which we need to skip.  It might make sense to move that into an
else-statement from an aesthetic point of view.

David

^ permalink raw reply

* Re: [RFC 6/7] mm: extend process_madvise syscall to support vector arrary
From: Michal Hocko @ 2019-05-29 10:33 UTC (permalink / raw)
  To: Daniel Colascione
  Cc: Minchan Kim, Andrew Morton, LKML, linux-mm, Johannes Weiner,
	Tim Murray, Joel Fernandes, Suren Baghdasaryan, Shakeel Butt,
	Sonny Rao, Brian Geffon, Linux API
In-Reply-To: <CAKOZuesK-8zrm1zua4dzqh4TEMivsZKiccySMvfBjOyDkg-MEw@mail.gmail.com>

On Wed 29-05-19 03:08:32, Daniel Colascione wrote:
> On Mon, May 27, 2019 at 12:49 AM Minchan Kim <minchan@kernel.org> wrote:
> >
> > On Tue, May 21, 2019 at 12:37:26PM +0200, Michal Hocko wrote:
> > > On Tue 21-05-19 19:26:13, Minchan Kim wrote:
> > > > On Tue, May 21, 2019 at 08:24:21AM +0200, Michal Hocko wrote:
> > > > > On Tue 21-05-19 11:48:20, Minchan Kim wrote:
> > > > > > On Mon, May 20, 2019 at 11:22:58AM +0200, Michal Hocko wrote:
> > > > > > > [Cc linux-api]
> > > > > > >
> > > > > > > On Mon 20-05-19 12:52:53, Minchan Kim wrote:
> > > > > > > > Currently, process_madvise syscall works for only one address range
> > > > > > > > so user should call the syscall several times to give hints to
> > > > > > > > multiple address range.
> > > > > > >
> > > > > > > Is that a problem? How big of a problem? Any numbers?
> > > > > >
> > > > > > We easily have 2000+ vma so it's not trivial overhead. I will come up
> > > > > > with number in the description at respin.
> > > > >
> > > > > Does this really have to be a fast operation? I would expect the monitor
> > > > > is by no means a fast path. The system call overhead is not what it used
> > > > > to be, sigh, but still for something that is not a hot path it should be
> > > > > tolerable, especially when the whole operation is quite expensive on its
> > > > > own (wrt. the syscall entry/exit).
> > > >
> > > > What's different with process_vm_[readv|writev] and vmsplice?
> > > > If the range needed to be covered is a lot, vector operation makes senese
> > > > to me.
> > >
> > > I am not saying that the vector API is wrong. All I am trying to say is
> > > that the benefit is not really clear so far. If you want to push it
> > > through then you should better get some supporting data.
> >
> > I measured 1000 madvise syscall vs. a vector range syscall with 1000
> > ranges on ARM64 mordern device. Even though I saw 15% improvement but
> > absoluate gain is just 1ms so I don't think it's worth to support.
> > I will drop vector support at next revision.
> 
> Please do keep the vector support. Absolute timing is misleading,
> since in a tight loop, you're not going to contend on mmap_sem. We've
> seen tons of improvements in things like camera start come from
> coalescing mprotect calls, with the gains coming from taking and
> releasing various locks a lot less often and bouncing around less on
> the contended lock paths. Raw throughput doesn't tell the whole story,
> especially on mobile.

This will always be a double edge sword. Taking a lock for longer can
improve a throughput of a single call but it would make a latency for
anybody contending on the lock much worse.

Besides that, please do not overcomplicate the thing from the early
beginning please. Let's start with a simple and well defined remote
madvise alternative first and build a vector API on top with some
numbers based on _real_ workloads.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply

* Re: [RFC 6/7] mm: extend process_madvise syscall to support vector arrary
From: Daniel Colascione @ 2019-05-29 10:08 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Michal Hocko, Andrew Morton, LKML, linux-mm, Johannes Weiner,
	Tim Murray, Joel Fernandes, Suren Baghdasaryan, Shakeel Butt,
	Sonny Rao, Brian Geffon, Linux API
In-Reply-To: <20190527074940.GB6879@google.com>

On Mon, May 27, 2019 at 12:49 AM Minchan Kim <minchan@kernel.org> wrote:
>
> On Tue, May 21, 2019 at 12:37:26PM +0200, Michal Hocko wrote:
> > On Tue 21-05-19 19:26:13, Minchan Kim wrote:
> > > On Tue, May 21, 2019 at 08:24:21AM +0200, Michal Hocko wrote:
> > > > On Tue 21-05-19 11:48:20, Minchan Kim wrote:
> > > > > On Mon, May 20, 2019 at 11:22:58AM +0200, Michal Hocko wrote:
> > > > > > [Cc linux-api]
> > > > > >
> > > > > > On Mon 20-05-19 12:52:53, Minchan Kim wrote:
> > > > > > > Currently, process_madvise syscall works for only one address range
> > > > > > > so user should call the syscall several times to give hints to
> > > > > > > multiple address range.
> > > > > >
> > > > > > Is that a problem? How big of a problem? Any numbers?
> > > > >
> > > > > We easily have 2000+ vma so it's not trivial overhead. I will come up
> > > > > with number in the description at respin.
> > > >
> > > > Does this really have to be a fast operation? I would expect the monitor
> > > > is by no means a fast path. The system call overhead is not what it used
> > > > to be, sigh, but still for something that is not a hot path it should be
> > > > tolerable, especially when the whole operation is quite expensive on its
> > > > own (wrt. the syscall entry/exit).
> > >
> > > What's different with process_vm_[readv|writev] and vmsplice?
> > > If the range needed to be covered is a lot, vector operation makes senese
> > > to me.
> >
> > I am not saying that the vector API is wrong. All I am trying to say is
> > that the benefit is not really clear so far. If you want to push it
> > through then you should better get some supporting data.
>
> I measured 1000 madvise syscall vs. a vector range syscall with 1000
> ranges on ARM64 mordern device. Even though I saw 15% improvement but
> absoluate gain is just 1ms so I don't think it's worth to support.
> I will drop vector support at next revision.

Please do keep the vector support. Absolute timing is misleading,
since in a tight loop, you're not going to contend on mmap_sem. We've
seen tons of improvements in things like camera start come from
coalescing mprotect calls, with the gains coming from taking and
releasing various locks a lot less often and bouncing around less on
the contended lock paths. Raw throughput doesn't tell the whole story,
especially on mobile.

^ permalink raw reply

* Re: [RFC][PATCH 0/7] Mount, FS, Block and Keyrings notifications
From: David Howells @ 2019-05-29  9:09 UTC (permalink / raw)
  To: Greg KH, casey
  Cc: dhowells, viro, raven, linux-fsdevel, linux-api, linux-block,
	keyrings, linux-security-module, linux-kernel
In-Reply-To: <20190528235810.GA5776@kroah.com>

Greg KH <gregkh@linuxfoundation.org> wrote:

> >  (3) Letting users see events they shouldn't be able to see.
> 
> How are you handling namespaces then?  Are they determined by the
> namespace of the process that opened the original device handle, or the
> namespace that made the new syscall for the events to "start flowing"?

So far I haven't had to deal directly with namespaces.

mount_notify() requires you to have access to the mountpoint you want to watch
- and the entire tree rooted there is in one namespace, so your event sources
are restricted to that namespace.  Further, mount objects don't themselves
have any other namespaces, not even a user_ns.

sb_notify() requires you to have access to the superblock you want to watch.
superblocks aren't directly namespaced as a class, though individual
superblocks may participate in particular namespaces (ipc, net, etc.).  I'm
thinking some of these should be marked unwatchable (all pseudo superblocks,
kernfs-class, proc, for example).

Superblocks, however, do each have a user_ns - but you were allowed to access
the superblock by pathwalk, so you must have some access to the user_ns - I
think.

KEYCTL_NOTIFY requires you to have View access on the key you're watching.
Currently, keys have no real namespace restrictions, though I have patches to
include a namespace tag in the lookup criteria.

block_notify() doesn't require any direct access since you're watching a
global queue and there is no blockdev namespacing.  LSMs are given the option
to filter events, though.  The thought here is that if you can access dmesg,
you should be able to watch for blockdev events.


Actually, thinking further on this, restricting access to events is trickier
than I thought and than perhaps Casey was suggesting.

Say you're watching a mount object and someone in a different user_ns
namespace or with a different security label mounts on it.  What governs
whether you are allowed to see the event?

You're watching the object for changes - and it *has* changed.  Further, you
might be able to see the result of this change by other means (/proc/mounts,
for instance).

Should you be denied the event based on the security model?

On the other hand, if you're watching a tree of mount objects, it could be
argued that you should be denied access to events on any mount object you
can't reach by pathwalk.

On the third hand, if you can see it in /proc/mounts or by fsinfo(), you
should get an event for it.


> How are you handling namespaces then?

So to go back to the original question.  At the moment they haven't impinged
directly and I haven't had to deal with them directly.  There are indirect
namespace restrictions that I get for free just due to pathwalk, for instance.

David

^ permalink raw reply

* Re: [PATCH 04/25] vfs: Implement parameter value retrieval with fsinfo() [ver #13]
From: Miklos Szeredi @ 2019-05-29  8:08 UTC (permalink / raw)
  To: David Howells
  Cc: Al Viro, Ian Kent, Linux API, linux-fsdevel, linux-kernel,
	Miklos Szeredi
In-Reply-To: <155905629702.1662.7233272785972036117.stgit@warthog.procyon.org.uk>

On Tue, May 28, 2019 at 5:11 PM David Howells <dhowells@redhat.com> wrote:
>
> Implement parameter value retrieval with fsinfo() - akin to parsing
> /proc/mounts.
>
> This allows all the parameters to be retrieved in one go with:
>
>         struct fsinfo_params params = {
>                 .request        = FSINFO_ATTR_PARAMETER,
>         };

Ah, here it is.

>
> Each parameter comes as a pair of blobs with a length tacked on the front
> rather than using separators, since any printable character that could be
> used as a separator can be found in some value somewhere (including comma).
> In fact, cifs allows the separator to be set using the "sep=" option in
> parameter parsing.
>
> The length on the front of each blob is 1-3 bytes long.  Each byte has a
> flag in bit 7 that's set if there are more bytes and clear on the last
> byte; bits 0-6 should be shifted and OR'd into the length count.  The bytes
> are most-significant first.
>
> For example, 0x83 0xf5 0x06 is the length (0x03<<14 | 0x75<<7 | 0x06).

Sounds way too complicated.  What about fixed 4byte sizes?  Or using
the nul charater as separator (and binary blobs be damned)?

[...]

> +static void fsinfo_insert_sb_flag_parameters(struct path *path,
> +                                            struct fsinfo_kparams *params)
> +{
> +       int s_flags = READ_ONCE(path->dentry->d_sb->s_flags);
> +
> +       if (s_flags & SB_DIRSYNC)
> +               fsinfo_note_param(params, "dirsync", NULL);
> +       if (s_flags & SB_LAZYTIME)
> +               fsinfo_note_param(params, "lazytime", NULL);
> +       if (s_flags & SB_MANDLOCK)
> +               fsinfo_note_param(params, "mand", NULL);
> +       if (s_flags & SB_POSIXACL)
> +               fsinfo_note_param(params, "posixacl", NULL);
> +       if (s_flags & SB_RDONLY)
> +               fsinfo_note_param(params, "ro", NULL);
> +       if (s_flags & SB_SYNCHRONOUS)
> +               fsinfo_note_param(params, "sync", NULL);

Again, don't blindly transform s_flags into options, because some of
them may have been internally manipulated by the filesystem.

You could do a helper for filesystems that does the the common ones
(ro/sync/dirsync) but all of that *should* go through the filesystem.

Same goes for vfs_parse_sb_flag() btw.   It should be moved into each
filesystem's ->parse_param() and not be a mandatory thing.

Thanks,
Miklos

^ permalink raw reply

* Re: [PATCH 1/7] General notification queue with user mmap()'able ring buffer [ver #13]
From: Miklos Szeredi @ 2019-05-29  7:55 UTC (permalink / raw)
  To: David Howells
  Cc: Al Viro, Ian Kent, Linux API, linux-fsdevel, linux-kernel,
	Miklos Szeredi
In-Reply-To: <155905622921.1304.8775688192987027250.stgit@warthog.procyon.org.uk>

On Tue, May 28, 2019 at 5:10 PM David Howells <dhowells@redhat.com> wrote:
>
> Implement a misc device that implements a general notification queue as a
> ring buffer that can be mmap()'d from userspace.
>
> The way this is done is:
>
>  (1) An application opens the device and indicates the size of the ring
>      buffer that it wants to reserve in pages (this can only be set once):
>
>         fd = open("/dev/watch_queue", O_RDWR);
>         ioctl(fd, IOC_WATCH_QUEUE_NR_PAGES, nr_of_pages);
>
>  (2) The application should then map the pages that the device has
>      reserved.  Each instance of the device created by open() allocates
>      separate pages so that maps of different fds don't interfere with one
>      another.  Multiple mmap() calls on the same fd, however, will all work
>      together.
>
>         page_size = sysconf(_SC_PAGESIZE);
>         mapping_size = nr_of_pages * page_size;
>         char *buf = mmap(NULL, mapping_size, PROT_READ|PROT_WRITE,
>                          MAP_SHARED, fd, 0);

Would it make sense to use relayfs for the implementation of the
mapped ring buffer?

Thanks,
Miklos

^ permalink raw reply

* Re: [PATCH 01/25] vfs: syscall: Add fsinfo() to query filesystem information [ver #13]
From: Miklos Szeredi @ 2019-05-29  7:42 UTC (permalink / raw)
  To: David Howells
  Cc: Al Viro, Ian Kent, Linux API, linux-fsdevel, linux-kernel,
	Miklos Szeredi
In-Reply-To: <155905627049.1662.17033721577309385838.stgit@warthog.procyon.org.uk>

On Tue, May 28, 2019 at 5:11 PM David Howells <dhowells@redhat.com> wrote:
>
> Add a system call to allow filesystem information to be queried.  A request
> value can be given to indicate the desired attribute.  Support is provided
> for enumerating multi-value attributes.
>

[...]

> +static u32 calc_sb_flags(u32 s_flags)
> +{
> +       u32 flags = 0;
> +
> +       if (s_flags & SB_RDONLY)        flags |= MS_RDONLY;
> +       if (s_flags & SB_SYNCHRONOUS)   flags |= MS_SYNCHRONOUS;
> +       if (s_flags & SB_MANDLOCK)      flags |= MS_MANDLOCK;
> +       if (s_flags & SB_DIRSYNC)       flags |= MS_DIRSYNC;
> +       if (s_flags & SB_SILENT)        flags |= MS_SILENT;
> +       if (s_flags & SB_POSIXACL)      flags |= MS_POSIXACL;
> +       if (s_flags & SB_LAZYTIME)      flags |= MS_LAZYTIME;
> +       if (s_flags & SB_I_VERSION)     flags |= MS_I_VERSION;

Please don't resurrect MS_ flags.  They are from the old API and
shouldn't be used in the new one.  Some of them (e.g. MS_POSIXACL,
MS_I_VERSION) are actually internal flags despite being exported on
the old API.  And there's SB_SILENT which is simply not a superblock
flag and we might be better getting rid of it entirely.

The proper way to query mount options should be analogous to the way
they are set on the new API: list of {key, type, value, aux} tuples.

Thanks,
Miklos

^ permalink raw reply

* Re: [RFC][PATCH 0/7] Mount, FS, Block and Keyrings notifications
From: Amir Goldstein @ 2019-05-29  7:40 UTC (permalink / raw)
  To: David Howells
  Cc: Al Viro, Ian Kent, linux-fsdevel, linux-api, linux-block,
	keyrings, LSM List, linux-kernel, Jan Kara
In-Reply-To: <22971.1559112346@warthog.procyon.org.uk>

On Wed, May 29, 2019 at 9:45 AM David Howells <dhowells@redhat.com> wrote:
>
> Amir Goldstein <amir73il@gmail.com> wrote:
>
> > I am interested to know how you envision filesystem notifications would
> > look with this interface.
>
> What sort of events are you thinking of by "filesystem notifications"?  You
> mean things like file changes?

I mean all the events provided by
http://man7.org/linux/man-pages/man7/fanotify.7.html

Which was recently (v4.20) extended to support watching a super block
and more recently (v5.1) extended to support watching directory entry
modifications.

Thanks,
Amir.

^ permalink raw reply

* Re: [RFC][PATCH 0/7] Mount, FS, Block and Keyrings notifications
From: David Howells @ 2019-05-29  6:45 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: dhowells, Al Viro, Ian Kent, linux-fsdevel, linux-api,
	linux-block, keyrings, LSM List, linux-kernel, Jan Kara
In-Reply-To: <CAOQ4uxjC1M7jwjd9zSaSa6UW2dbEjc+ZbFSo7j9F1YHAQxQ8LQ@mail.gmail.com>

Amir Goldstein <amir73il@gmail.com> wrote:

> I am interested to know how you envision filesystem notifications would
> look with this interface.

What sort of events are you thinking of by "filesystem notifications"?  You
mean things like file changes?

David

^ permalink raw reply

* Re: [RFC][PATCH 0/7] Mount, FS, Block and Keyrings notifications
From: Amir Goldstein @ 2019-05-29  6:33 UTC (permalink / raw)
  To: David Howells
  Cc: Al Viro, Ian Kent, linux-fsdevel, linux-api, linux-block,
	keyrings, LSM List, linux-kernel, Jan Kara
In-Reply-To: <155905930702.7587.7100265859075976147.stgit@warthog.procyon.org.uk>

On Tue, May 28, 2019 at 7:03 PM David Howells <dhowells@redhat.com> wrote:
>
>
> Hi Al,
>
> Here's a set of patches to add a general variable-length notification queue
> concept and to add sources of events for:
>
>  (1) Mount topology events, such as mounting, unmounting, mount expiry,
>      mount reconfiguration.
>
>  (2) Superblock events, such as R/W<->R/O changes, quota overrun and I/O
>      errors (not complete yet).
>
>  (3) Block layer events, such as I/O errors.
>
>  (4) Key/keyring events, such as creating, linking and removal of keys.
>
> One of the reasons for this is so that we can remove the issue of processes
> having to repeatedly and regularly scan /proc/mounts, which has proven to
> be a system performance problem.  To further aid this, the fsinfo() syscall
> on which this patch series depends, provides a way to access superblock and
> mount information in binary form without the need to parse /proc/mounts.
>
>
> Design decisions:
>
>  (1) A misc chardev is used to create and open a ring buffer:
>
>         fd = open("/dev/watch_queue", O_RDWR);
>
>      which is then configured and mmap'd into userspace:
>
>         ioctl(fd, IOC_WATCH_QUEUE_SET_SIZE, BUF_SIZE);
>         ioctl(fd, IOC_WATCH_QUEUE_SET_FILTER, &filter);
>         buf = mmap(NULL, BUF_SIZE * page_size, PROT_READ | PROT_WRITE,
>                    MAP_SHARED, fd, 0);
>
>      The fd cannot be read or written (though there is a facility to use
>      write to inject records for debugging) and userspace just pulls data
>      directly out of the buffer.
>
>  (2) The ring index pointers are stored inside the ring and are thus
>      accessible to userspace.  Userspace should only update the tail
>      pointer and never the head pointer or risk breaking the buffer.  The
>      kernel checks that the pointers appear valid before trying to use
>      them.  A 'skip' record is maintained around the pointers.
>
>  (3) poll() can be used to wait for data to appear in the buffer.
>
>  (4) Records in the buffer are binary, typed and have a length so that they
>      can be of varying size.
>
>      This means that multiple heterogeneous sources can share a common
>      buffer.  Tags may be specified when a watchpoint is created to help
>      distinguish the sources.
>
>  (5) The queue is reusable as there are 16 million types available, of
>      which I've used 4, so there is scope for others to be used.
>
>  (6) Records are filterable as types have up to 256 subtypes that can be
>      individually filtered.  Other filtration is also available.
>
>  (7) Each time the buffer is opened, a new buffer is created - this means
>      that there's no interference between watchers.
>
>  (8) When recording a notification, the kernel will not sleep, but will
>      rather mark a queue as overrun if there's insufficient space, thereby
>      avoiding userspace causing the kernel to hang.
>
>  (9) The 'watchpoint' should be specific where possible, meaning that you
>      specify the object that you want to watch.
>
> (10) The buffer is created and then watchpoints are attached to it, using
>      one of:
>
>         keyctl_watch_key(KEY_SPEC_SESSION_KEYRING, fd, 0x01);
>         mount_notify(AT_FDCWD, "/", 0, fd, 0x02);
>         sb_notify(AT_FDCWD, "/mnt", 0, fd, 0x03);
>
>      where in all three cases, fd indicates the queue and the number after
>      is a tag between 0 and 255.
>
> (11) The watch must be removed if either the watch buffer is destroyed or
>      the watched object is destroyed.
>
>
> Things I want to avoid:
>
>  (1) Introducing features that make the core VFS dependent on the network
>      stack or networking namespaces (ie. usage of netlink).
>
>  (2) Dumping all this stuff into dmesg and having a daemon that sits there
>      parsing the output and distributing it as this then puts the
>      responsibility for security into userspace and makes handling
>      namespaces tricky.  Further, dmesg might not exist or might be
>      inaccessible inside a container.
>
>  (3) Letting users see events they shouldn't be able to see.
>
>
> Further things that could be considered:
>
>  (1) Adding a keyctl call to allow a watch on a keyring to be extended to
>      "children" of that keyring, such that the watch is removed from the
>      child if it is unlinked from the keyring.
>
>  (2) Adding global superblock event queue.
>
>  (3) Propagating watches to child superblock over automounts.
>

David,

I am interested to know how you envision filesystem notifications would
look with this interface.

fanotify can certainly benefit from providing a ring buffer interface to read
events.

>From what I have seen, a common practice of users is to monitor mounts
(somehow) and place FAN_MARK_MOUNT fanotify watches dynamically.
It'd be good if those users can use a single watch mechanism/API for
watching the mount namespace and filesystem events within mounts.

A similar usability concern is with sb_notify and FAN_MARK_FILESYSTEM.
It provides users with two complete different mechanisms to watch error
and filesystem events. That is generally not a good thing to have.

I am not asking that you implement fs_notify() before merging sb_notify()
and I understand that you have a use case for sb_notify().
I am asking that you show me the path towards a unified API (how a
typical program would look like), so that we know before merging your
new API that it could be extended to accommodate fsnotify events
where the final result will look wholesome to users.

Thanks,
Amir.

^ permalink raw reply

* Re: [RFC][PATCH] link.2: AT_ATOMIC_DATA and AT_ATOMIC_METADATA
From: Amir Goldstein @ 2019-05-29  5:58 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Theodore Tso, Jan Kara, Dave Chinner, Chris Mason, Al Viro,
	linux-fsdevel, linux-xfs, Ext4, Linux Btrfs, linux-api
In-Reply-To: <20190528200659.GK5221@magnolia>

On Tue, May 28, 2019 at 11:07 PM Darrick J. Wong
<darrick.wong@oracle.com> wrote:
>
> On Mon, May 27, 2019 at 08:26:55PM +0300, Amir Goldstein wrote:
> > New link flags to request "atomic" link.
> >
> > Signed-off-by: Amir Goldstein <amir73il@gmail.com>
> > ---
> >
> > Hi Guys,
> >
> > Following our discussions on LSF/MM and beyond [1][2], here is
> > an RFC documentation patch.
> >
> > Ted, I know we discussed limiting the API for linking an O_TMPFILE
> > to avert the hardlinks issue, but I decided it would be better to
> > document the hardlinks non-guaranty instead. This will allow me to
> > replicate the same semantics and documentation to renameat(2).
> > Let me know how that works out for you.
> >
> > I also decided to try out two separate flags for data and metadata.
> > I do not find any of those flags very useful without the other, but
> > documenting them seprately was easier, because of the fsync/fdatasync
> > reference.  In the end, we are trying to solve a social engineering
> > problem, so this is the least confusing way I could think of to describe
> > the new API.
> >
> > First implementation of AT_ATOMIC_METADATA is expected to be
> > noop for xfs/ext4 and probably fsync for btrfs.
> >
> > First implementation of AT_ATOMIC_DATA is expected to be
> > filemap_write_and_wait() for xfs/ext4 and probably fdatasync for btrfs.
> >
> > Thoughts?
> >
> > Amir.
> >
> > [1] https://lwn.net/Articles/789038/
> > [2] https://lore.kernel.org/linux-fsdevel/CAOQ4uxjZm6E2TmCv8JOyQr7f-2VB0uFRy7XEp8HBHQmMdQg+6w@mail.gmail.com/
> >
> >  man2/link.2 | 51 +++++++++++++++++++++++++++++++++++++++++++++++++++
> >  1 file changed, 51 insertions(+)
> >
> > diff --git a/man2/link.2 b/man2/link.2
> > index 649ba00c7..15c24703e 100644
> > --- a/man2/link.2
> > +++ b/man2/link.2
> > @@ -184,6 +184,57 @@ See
> >  .BR openat (2)
> >  for an explanation of the need for
> >  .BR linkat ().
> > +.TP
> > +.BR AT_ATOMIC_METADATA " (since Linux 5.x)"
> > +By default, a link operation followed by a system crash, may result in the
> > +new file name being linked with old inode metadata, such as out dated time
> > +stamps or missing extended attributes.
> > +.BR fsync (2)
> > +before linking the inode, but that involves flushing of volatile disk caches.
> > +
> > +A filesystem that accepts this flag will guaranty, that old inode metadata
> > +will not be exposed in the new linked name.
> > +Some filesystems may internally perform
> > +.BR fsync (2)
> > +before linking the inode to provide this guaranty,
> > +but often, filesystems will have a more efficient method to provide this
> > +guaranty without flushing volatile disk caches.
> > +
> > +A filesystem that accepts this flag does
> > +.BR NOT
> > +guaranty that the new file name will exist after a system crash, nor that the
> > +current inode metadata is persisted to disk.
>
> Hmmm.  I think it would be much clearer to state the two expectations in
> the same place, e.g.:
>
> "A filesystem that accepts this flag guarantees that after a successful
> call completion, the filesystem will return either (a) the version of
> the metadata that was on disk at the time the call completed; (b) a
> newer version of that metadata; or (c) -ENOENT.  In other words, a
> subsequent access of the file path will never return metadata that was
> obsolete at the time that the call completed, even if the system crashes
> immediately afterwards."

Your feedback is along the same line as Ted's feedback.
I will try out the phrasing that Ted proposed, will see how that goes...

>
> Did I get that right?  I /think/ this means that one could implement Ye
> Olde Write And Rename as:
>
> fd = open(".", O_TMPFILE...);
> write(fd);
> fsync(fd);

Certainly not fsync(), that what my "counter-fsync()" phrasing was trying to
convey. The flags are meant as a "cheaper" replacement for that fsync().

> snprintf(path, PATH_MAX, "/proc/self/fd/%d", fd);
> linkat(AT_FDCWD, path, AT_FDCWD, "file.txt",
>         AT_EMPTY_PATH | AT_ATOMIC_DATA | AT_ATOMIC_METADATA);
>
> (Still struggling to figure out what userspace programs would use this
> for...)
>

There are generally two use cases:

1. Flushing volatile disk caches is very expensive.
In this case sync_file_range(SYNC_FILE_RANGE_WRITE_AND_WAIT)
is cheaper than fdatasync() for a preallocated file and obviously a noop
for AT_ATOMIC_METADATA in "S.O.M.C" filesystem is cheaper than
a journal commit.

2. Highly concurrent metadata workloads
Many users running  AT_ATOMIC_METADATA concurrently are much
less likely to interfere each other than many users running fsync concurrently.

IWO, when I'm using fsync() to provide the AT_ATOMIC_METADATA guarantee
on a single journal fs, other users pay the penalty for a guaranty that I didn't
ask for (i.e. persistency).

Thanks,
Amir.

^ permalink raw reply

* Re: [RFC][PATCH] link.2: AT_ATOMIC_DATA and AT_ATOMIC_METADATA
From: Amir Goldstein @ 2019-05-29  5:38 UTC (permalink / raw)
  To: Theodore Ts'o
  Cc: Jan Kara, Darrick J . Wong, Dave Chinner, Chris Mason, Al Viro,
	linux-fsdevel, linux-xfs, Ext4, Linux Btrfs, linux-api
In-Reply-To: <20190528202659.GA12412@mit.edu>

On Tue, May 28, 2019 at 11:27 PM Theodore Ts'o <tytso@mit.edu> wrote:
>
> On Mon, May 27, 2019 at 08:26:55PM +0300, Amir Goldstein wrote:
> >
> > Following our discussions on LSF/MM and beyond [1][2], here is
> > an RFC documentation patch.
> >
> > Ted, I know we discussed limiting the API for linking an O_TMPFILE
> > to avert the hardlinks issue, but I decided it would be better to
> > document the hardlinks non-guaranty instead. This will allow me to
> > replicate the same semantics and documentation to renameat(2).
> > Let me know how that works out for you.
> >
> > I also decided to try out two separate flags for data and metadata.
> > I do not find any of those flags very useful without the other, but
> > documenting them seprately was easier, because of the fsync/fdatasync
> > reference.  In the end, we are trying to solve a social engineering
> > problem, so this is the least confusing way I could think of to describe
> > the new API.
>
> The way you have stated thigs is very confusing, and prone to be
> misunderstood.  I think it would be helpful to state things in the
> positive, instead of the negative.
>
> Let's review what you had wanted:
>
>         *If* the filename is visible in the directory after the crash,
>         *then* all of the metadata/data that had been written to the file
>         before the linkat(2) would be visible.
>
>         HOWEVER, you did not want to necessarily force an fsync(2) in
>         order to provide that guarantee.  That is, the filename would
>         not necessarily be guaranteed to be visible after a crash when
>         linkat(2) returns, but if the existence of the filename is
>         persisted, then the data would be too.
>
>         Also, at least initially we talked about this only making
>         sense for O_TMPFILE file desacriptors.  I believe you were
>         trying to generalize things so it wouldn't necessarily have to
>         be a file created using O_TMPFILE.  Is that correct?

That is correct. I felt we were limiting ourselves only to avert the
hardlinks issue, so decided its better to explain that "nlink is not
part of the inode metadata" that this guarantee refers to.
I would be happy to get your feedback about the hardlink disclaimer
since you brought up the concern. It the disclaimer enough?
Not needed at all?

>
> So instead of saying "A filesystem that accepts this flag will
> guaranty, that old inode data will not be exposed in the new linked
> name."  It's much clearer to state this in the affirmative:
>
>         A filesystem which accepts this flag will guarantee that if
>         the new pathname exists after a crash, all of the data written
>         to the file at the time of the linkat(2) call will be visible.
>

Sounds good to me. I will take a swing at another patch.

Thanks for the review!
Amir.

^ permalink raw reply

* Re: [PATCH ghak90 V6 00/10] audit: implement container identifier
From: Richard Guy Briggs @ 2019-05-29  0:43 UTC (permalink / raw)
  To: Steve Grubb
  Cc: Paul Moore, Dan Walsh, Neil Horman, containers, linux-api,
	Linux-Audit Mailing List, linux-fsdevel, LKML, netdev,
	netfilter-devel, omosnace, dhowells, simo, Eric Paris,
	Serge Hallyn, ebiederm, Mrunal Patel
In-Reply-To: <3299293.RYyUlNkVNy@x2>

On 2019-05-28 19:00, Steve Grubb wrote:
> On Tuesday, May 28, 2019 6:26:47 PM EDT Paul Moore wrote:
> > On Tue, May 28, 2019 at 5:54 PM Daniel Walsh <dwalsh@redhat.com> wrote:
> > > On 4/22/19 9:49 AM, Paul Moore wrote:
> > > > On Mon, Apr 22, 2019 at 7:38 AM Neil Horman <nhorman@tuxdriver.com> 
> wrote:
> > > >> On Mon, Apr 08, 2019 at 11:39:07PM -0400, Richard Guy Briggs wrote:
> > > >>> Implement kernel audit container identifier.
> > > >> 
> > > >> I'm sorry, I've lost track of this, where have we landed on it? Are we
> > > >> good for inclusion?
> > > > 
> > > > I haven't finished going through this latest revision, but unless
> > > > Richard made any significant changes outside of the feedback from the
> > > > v5 patchset I'm guessing we are "close".
> > > > 
> > > > Based on discussions Richard and I had some time ago, I have always
> > > > envisioned the plan as being get the kernel patchset, tests, docs
> > > > ready (which Richard has been doing) and then run the actual
> > > > implemented API by the userland container folks, e.g. cri-o/lxc/etc.,
> > > > to make sure the actual implementation is sane from their perspective.
> > > > They've already seen the design, so I'm not expecting any real
> > > > surprises here, but sometimes opinions change when they have actual
> > > > code in front of them to play with and review.
> > > > 
> > > > Beyond that, while the cri-o/lxc/etc. folks are looking it over,
> > > > whatever additional testing we can do would be a big win.  I'm
> > > > thinking I'll pull it into a separate branch in the audit tree
> > > > (audit/working-container ?) and include that in my secnext kernels
> > > > that I build/test on a regular basis; this is also a handy way to keep
> > > > it based against the current audit/next branch.  If any changes are
> > > > needed Richard can either chose to base those changes on audit/next or
> > > > the separate audit container ID branch; that's up to him.  I've done
> > > > this with other big changes in other trees, e.g. SELinux, and it has
> > > > worked well to get some extra testing in and keep the patchset "merge
> > > > ready" while others outside the subsystem look things over.
> > > 
> > > Mrunal Patel (maintainer of CRI-O) and I have reviewed the API, and
> > > believe this is something we can work on in the container runtimes team
> > > to implement the container auditing code in CRI-O and Podman.
> > 
> > Thanks Dan.  If I pulled this into a branch and built you some test
> > kernels to play with, any idea how long it might take to get a proof
> > of concept working on the cri-o side?
> 
> We'd need to merge user space patches and let them use that instead of the 
> raw interface. I'm not going to merge user space until we are pretty sure the 
> patch is going into the kernel.

I have an f29 test rpm of the userspace bits if that helps for testing:
	http://people.redhat.com/~rbriggs/ghak90/git-1db7e21/

Here's what it contains (minus the last patch):
	https://github.com/linux-audit/audit-userspace/compare/master...rgbriggs:ghau40-containerid-filter.v7.0

> -Steve
> 
> > FWIW, I've also reached out to some of the LXC folks I know to get
> > their take on the API.  I think if we can get two different container
> > runtimes to give the API a thumbs-up then I think we are in good shape
> > with respect to the userspace interface.
> > 
> > I just finished looking over the last of the pending audit kernel
> > patches that were queued waiting for the merge window to open so this
> > is next on my list to look at.  I plan to start doing that
> > tonight/tomorrow, and as long as the changes between v5/v6 are not
> > that big, it shouldn't take too long.

- RGB

--
Richard Guy Briggs <rgb@redhat.com>
Sr. S/W Engineer, Kernel Security, Base Operating Systems
Remote, Ottawa, Red Hat Canada
IRC: rgb, SunRaycer
Voice: +1.647.777.2635, Internal: (81) 32635

^ permalink raw reply

* Re: [RFC][PATCH 0/7] Mount, FS, Block and Keyrings notifications
From: Greg KH @ 2019-05-28 23:58 UTC (permalink / raw)
  To: David Howells
  Cc: viro, raven, linux-fsdevel, linux-api, linux-block, keyrings,
	linux-security-module, linux-kernel
In-Reply-To: <155905930702.7587.7100265859075976147.stgit@warthog.procyon.org.uk>

On Tue, May 28, 2019 at 05:01:47PM +0100, David Howells wrote:
> Things I want to avoid:
> 
>  (1) Introducing features that make the core VFS dependent on the network
>      stack or networking namespaces (ie. usage of netlink).
> 
>  (2) Dumping all this stuff into dmesg and having a daemon that sits there
>      parsing the output and distributing it as this then puts the
>      responsibility for security into userspace and makes handling
>      namespaces tricky.  Further, dmesg might not exist or might be
>      inaccessible inside a container.
> 
>  (3) Letting users see events they shouldn't be able to see.

How are you handling namespaces then?  Are they determined by the
namespace of the process that opened the original device handle, or the
namespace that made the new syscall for the events to "start flowing"?

Am I missing the logic that determines this in the patches, or is that
not implemented yet?

thanks,

greg k-h

^ permalink raw reply

* Re: [PATCH 3/7] vfs: Add a mount-notification facility
From: Jann Horn @ 2019-05-28 23:23 UTC (permalink / raw)
  To: David Howells
  Cc: Al Viro, raven, linux-fsdevel, Linux API, linux-block, keyrings,
	linux-security-module, kernel list
In-Reply-To: <10418.1559084686@warthog.procyon.org.uk>

On Wed, May 29, 2019 at 1:04 AM David Howells <dhowells@redhat.com> wrote:
> Jann Horn <jannh@google.com> wrote:
> > It might make sense to redesign this stuff so that watches don't hold
> > references on the object being watched.
>
> I explicitly made it hold a reference so that if you place a watch on an
> automounted mount it stops it from expiring.
>
> Further, if I create a watch on something, *should* it be unmountable, just as
> if I had a file open there or had chdir'd into there?

I don't really know. I guess it depends on how it's being used? If
someone decides to e.g. make a file browser that installs watches for
a bunch of mountpoints for some fancy sidebar showing the device
mounts on the system, or something like that, that probably shouldn't
inhibit unmounting... I don't know if that's a realistic use case.

^ permalink raw reply

* Re: [PATCH 1/7] General notification queue with user mmap()'able ring buffer
From: Jann Horn @ 2019-05-28 23:16 UTC (permalink / raw)
  To: David Howells
  Cc: Al Viro, raven, linux-fsdevel, Linux API, linux-block, keyrings,
	linux-security-module, kernel list
In-Reply-To: <11466.1559082515@warthog.procyon.org.uk>

On Wed, May 29, 2019 at 12:28 AM David Howells <dhowells@redhat.com> wrote:
> Jann Horn <jannh@google.com> wrote:
> > I don't see you setting any special properties on the VMA that would
> > prevent userspace from extending its size via mremap() - no
> > VM_DONTEXPAND or VM_PFNMAP. So I think you might get an out-of-bounds
> > access here?
>
> Should I just set VM_DONTEXPAND in watch_queue_mmap()?  Like so:
>
>         vma->vm_flags |= VM_DONTEXPAND;

Yeah, that should work.

> > > +static void watch_queue_map_pages(struct vm_fault *vmf,
> > > +                                 pgoff_t start_pgoff, pgoff_t end_pgoff)
> > ...
> >
> > Same as above.
>
> Same solution as above?  Or do I need ot check start/end_pgoff too?

Same solution.

> > > +static int watch_queue_mmap(struct file *file, struct vm_area_struct *vma)
> > > +{
> > > +       struct watch_queue *wqueue = file->private_data;
> > > +
> > > +       if (vma->vm_pgoff != 0 ||
> > > +           vma->vm_end - vma->vm_start > wqueue->nr_pages * PAGE_SIZE ||
> > > +           !(pgprot_val(vma->vm_page_prot) & pgprot_val(PAGE_SHARED)))
> > > +               return -EINVAL;
> >
> > This thing should probably have locking against concurrent
> > watch_queue_set_size()?
>
> Yeah.
>
>         static int watch_queue_mmap(struct file *file,
>                                     struct vm_area_struct *vma)
>         {
>                 struct watch_queue *wqueue = file->private_data;
>                 struct inode *inode = file_inode(file);
>                 u8 nr_pages;
>
>                 inode_lock(inode);
>                 nr_pages = wqueue->nr_pages;
>                 inode_unlock(inode);
>
>                 if (nr_pages == 0 ||
>                 ...
>                         return -EINVAL;

Looks reasonable.

> > > +       smp_store_release(&buf->meta.head, len);
> >
> > Why is this an smp_store_release()? The entire buffer should not be visible to
> > userspace before this setup is complete, right?
>
> Yes - if I put the above locking in the mmap handler.  OTOH, it's a one-off
> barrier...
>
> > > +               if (wqueue->buffer)
> > > +                       return -EBUSY;
> >
> > The preceding check occurs without any locks held and therefore has no
> > reliable effect. It should probably be moved below the
> > inode_lock(...).
>
> Yeah, it can race.  I'll move it into watch_queue_set_size().

Sounds good.

> > > +static void free_watch(struct rcu_head *rcu)
> > > +{
> > > +       struct watch *watch = container_of(rcu, struct watch, rcu);
> > > +
> > > +       put_watch_queue(rcu_access_pointer(watch->queue));
> >
> > This should be rcu_dereference_protected(..., 1).
>
> That shouldn't be necessary.  rcu_access_pointer()'s description says:
>
>  * It is also permissible to use rcu_access_pointer() when read-side
>  * access to the pointer was removed at least one grace period ago, as
>  * is the case in the context of the RCU callback that is freeing up
>  * the data, ...
>
> It's in an rcu callback function, so accessing the __rcu pointers in the RCU'd
> struct should be fine with rcu_access_pointer().

Aah, whoops, you're right, I missed that paragraph in the
documentation of rcu_access_pointer().

> > > +       /* We don't need the watch list lock for the next bit as RCU is
> > > +        * protecting everything from being deallocated.
> >
> > Does "everything" mean "the wqueue" or more than that?
>
> Actually, just 'wqueue' and its buffer.  'watch' is held by us once we've
> dequeued it as we now own the ref 'wlist' had on it.  'wlist' and 'wq' must be
> pinned by the caller.
>
> > > +                       if (release) {
> > > +                               if (wlist->release_watch) {
> > > +                                       rcu_read_unlock();
> > > +                                       /* This might need to call dput(), so
> > > +                                        * we have to drop all the locks.
> > > +                                        */
> > > +                                       wlist->release_watch(wlist, watch);
> >
> > How are you holding a reference to `wlist` here? You got the reference through
> > rcu_dereference(), you've dropped the RCU read lock, and I don't see anything
> > that stabilizes the reference.
>
> The watch record must hold a ref on the watched object if the watch_list has a
> ->release_watch() method.  In the code snippet above, the watch record now
> belongs to us because we unlinked it under the wlist->lock some lines prior.

Ah, of course.

> However, you raise a good point, and I think the thing to do is to cache
> ->release_watch from it and not pass wlist into (*release_watch)().  We don't
> need to concern ourselves with cleaning up *wlist as it will be cleaned up
> when the target object is removed.
>
> Keyrings don't have a ->release_watch method and neither does the block-layer
> notification stuff.
>
> > > +       if (wqueue->pages && wqueue->pages[0])
> > > +               WARN_ON(page_ref_count(wqueue->pages[0]) != 1);
> >
> > Is there a reason why there couldn't still be references to the pages
> > from get_user_pages()/get_user_pages_fast()?
>
> I'm not sure.  I'm not sure what to do if there are.  What do you suggest?

I would use put_page() instead of manually freeing it; I think that
should be enough? I'm not entirely sure though.

> > > +       n->info &= (WATCH_INFO_LENGTH | WATCH_INFO_TYPE_FLAGS | WATCH_INFO_ID);
> >
> > Should the non-atomic modification of n->info
>
> n's an unpublished copy of some userspace data that's local to the function
> instance.  There shouldn't be any way to race with it at this point.

Ah, derp. Yes.

^ permalink raw reply

* Re: [PATCH 1/7] General notification queue with user mmap()'able ring buffer
From: Greg KH @ 2019-05-28 23:12 UTC (permalink / raw)
  To: David Howells
  Cc: viro, raven, linux-fsdevel, linux-api, linux-block, keyrings,
	linux-security-module, linux-kernel
In-Reply-To: <4031.1559064620@warthog.procyon.org.uk>

On Tue, May 28, 2019 at 06:30:20PM +0100, David Howells wrote:
> Greg KH <gregkh@linuxfoundation.org> wrote:
> 
> > > Implement a misc device that implements a general notification queue as a
> > > ring buffer that can be mmap()'d from userspace.
> > 
> > "general" but just for filesystems, right?  :(
> 
> Whatever gave you that idea?  You can watch keyrings events, for example -
> they're not exactly filesystems.  I've added the ability to watch for mount
> topology changes and superblock events because those are something I've been
> asked to do.  I've added something for block events because I've recently had
> a problem with trying to recover data from a dodgy disk in that every time the
> disk goes offline, the ddrecover goes "wheeeee!" as it just sees a lot of
> EIO/ENODATA at a great rate of knots because it doesn't know the driver is now
> ignoring the disk.
> 
> I don't know what else people might want to watch, but I've tried to make it
> as generic as possible so as not to exclude it if possible.

Ok, let me try to dig up some older proposals to see if this fits into
the same model to work with them as well.

> > > +	refcount_t		usage;
> > 
> > Usage of what, this structure?  Or something else?
> 
> This is the number of usages of this struct (references to if you prefer).  I
> can add a comment to this effect.

I think you answer this later on with the kref comment :)

> > > +		return -EOPNOTSUPP;
> > 
> > -ENOTTY is the correct "not a valid ioctl" error value, right?
> 
> fs/ioctl.c does both, but I can switch it if it makes you happier.

It does.

> > > +void put_watch_queue(struct watch_queue *wqueue)
> > > +{
> > > +	if (refcount_dec_and_test(&wqueue->usage))
> > > +		kfree_rcu(wqueue, rcu);
> > 
> > Why not just use a kref?
> 
> Why use a kref?  It seems like an effort to be a C++ base class, but without
> the C++ inheritance bit.  Using kref doesn't seem to gain anything.  It's just
> a wrapper around refcount_t - so why not just use a refcount_t?
> 
> kref_put() could potentially add an unnecessary extra stack frame and would
> seem to be best avoided, though an optimising compiler ought to be able to
> inline if it can.

If kref_put() is on your fast path, you have worse problems (kfree isn't
fast, right?)

Anyway, it's an inline function, how can it add an extra stack frame?
Don't try to optimize something that isn't needed yet.

> Are you now on the convert all refcounts to krefs path?

"now"?  Remember, I wrote kref all those years ago, everyone should use
it.  It saves us having to audit the same pattern over and over again.
And, even nicer, it uses a refcount now, and as you are trying to
reference count an object, it is exactly what this was written for.

So yes, I do think it should be used here, unless it is deemed to not
fit the pattern/usage model.

> > > +EXPORT_SYMBOL(add_watch_to_object);
> > 
> > Naming nit, shouldn't the "prefix" all be the same for these new
> > functions?
> > 
> > watch_queue_add_object()?  watch_queue_put()?  And so on?
> 
> Naming is fun.  watch_queue_add_object - that suggests something different to
> what the function actually does.  I'll think about adjusting the names.

Ok, just had to say something.  It's your call, and yes, naming is hard.

> > > +module_exit(watch_queue_exit);
> > 
> > module_misc_device()?
> 
> 	warthog>git grep module_misc_device -- Documentation/
> 	warthog1>

Do I have to document all helper macros?  Anyway, it saves you
boilerplate code, but if built in, it's at the module init level, not
the fs init level, like you are asking for here.  So that might not
work, it's your call.

> > > +		struct {
> > > +			struct watch_notification watch; /* WATCH_TYPE_SKIP */
> > > +			volatile __u32	head;		/* Ring head index */
> > > +			volatile __u32	tail;		/* Ring tail index */
> > 
> > A uapi structure that has volatile in it?  Are you _SURE_ this is
> > correct?
> > 
> > That feels wrong to me...  This is not a backing-hardware register, it's
> > "just memory" and slapping volatile on it shouldn't be the correct
> > solution for telling the compiler to not to optimize away reads/flushes,
> > right?  You need a proper memory access type primitive for that to work
> > correctly everywhere I thought.
> > 
> > We only have 2 users of volatile in include/uapi, one for WMI structures
> > that are backed by firmware (seems correct), and one for DRM which I
> > have no idea how it works as it claims to be a lock.  Why is this new
> > addition the correct way to do this that no other ring-buffer that was
> > mmapped has needed to?
> 
> Yeah, I understand your concern with this.
> 
> The reason I put the volatiles in is that the kernel may be modifying the head
> pointer on one CPU simultaneously with userspace modifying the tail pointer on
> another CPU.
> 
> Note that userspace does not need to enter the kernel to find out if there's
> anything in the buffer or to read stuff out of the buffer.  Userspace only
> needs to enter the kernel, using poll() or similar, to wait for something to
> appear in the buffer.

And how does the tracing and perf ring buffers do this without needing
volatile?  Why not use the same type of interface they provide, as it's
always good to share code that has already had all of the nasty corner
cases worked out.

Anyway, I don't want you to think I don't like this code/idea overall, I
do.  I just want to see it be something that everyone can use, and use
easily, as it has been something lots of people have been asking for for
a long time.

thanks,

greg k-h

^ permalink raw reply

* Re: [PATCH 3/7] vfs: Add a mount-notification facility
From: David Howells @ 2019-05-28 23:08 UTC (permalink / raw)
  To: Jann Horn
  Cc: dhowells, Al Viro, raven, linux-fsdevel, Linux API, linux-block,
	keyrings, linux-security-module, kernel list
In-Reply-To: <10418.1559084686@warthog.procyon.org.uk>

David Howells <dhowells@redhat.com> wrote:

> > It might make sense to redesign this stuff so that watches don't hold
> > references on the object being watched.
>
> I explicitly made it hold a reference so that if you place a watch on an
> automounted mount it stops it from expiring.
> 
> Further, if I create a watch on something, *should* it be unmountable, just as
> if I had a file open there or had chdir'd into there?

It gets trickier than that as I need a ref on the dentry on which the watch is
rooted to prevent it from getting culled.

David

^ permalink raw reply

* Re: [PATCH 3/7] vfs: Add a mount-notification facility
From: David Howells @ 2019-05-28 23:04 UTC (permalink / raw)
  To: Jann Horn
  Cc: dhowells, Al Viro, raven, linux-fsdevel, Linux API, linux-block,
	keyrings, linux-security-module, kernel list
In-Reply-To: <CAG48ez2rRh2_Kq_EGJs5k-ZBNffGs_Q=vkQdinorBgo58tbGpg@mail.gmail.com>

Jann Horn <jannh@google.com> wrote:

> It might make sense to redesign this stuff so that watches don't hold
> references on the object being watched.

I explicitly made it hold a reference so that if you place a watch on an
automounted mount it stops it from expiring.

Further, if I create a watch on something, *should* it be unmountable, just as
if I had a file open there or had chdir'd into there?

David

^ permalink raw reply

* Re: [PATCH ghak90 V6 00/10] audit: implement container identifier
From: Steve Grubb @ 2019-05-28 23:00 UTC (permalink / raw)
  To: Paul Moore
  Cc: Dan Walsh, Neil Horman, Richard Guy Briggs, containers, linux-api,
	Linux-Audit Mailing List, linux-fsdevel, LKML, netdev,
	netfilter-devel, omosnace, dhowells, simo, Eric Paris,
	Serge Hallyn, ebiederm, Mrunal Patel
In-Reply-To: <CAHC9VhRW9f6GbhvvfifbOzd9p=PgdB2gq1E7tACcaqvfb85Y8A@mail.gmail.com>

On Tuesday, May 28, 2019 6:26:47 PM EDT Paul Moore wrote:
> On Tue, May 28, 2019 at 5:54 PM Daniel Walsh <dwalsh@redhat.com> wrote:
> > On 4/22/19 9:49 AM, Paul Moore wrote:
> > > On Mon, Apr 22, 2019 at 7:38 AM Neil Horman <nhorman@tuxdriver.com> 
wrote:
> > >> On Mon, Apr 08, 2019 at 11:39:07PM -0400, Richard Guy Briggs wrote:
> > >>> Implement kernel audit container identifier.
> > >> 
> > >> I'm sorry, I've lost track of this, where have we landed on it? Are we
> > >> good for inclusion?
> > > 
> > > I haven't finished going through this latest revision, but unless
> > > Richard made any significant changes outside of the feedback from the
> > > v5 patchset I'm guessing we are "close".
> > > 
> > > Based on discussions Richard and I had some time ago, I have always
> > > envisioned the plan as being get the kernel patchset, tests, docs
> > > ready (which Richard has been doing) and then run the actual
> > > implemented API by the userland container folks, e.g. cri-o/lxc/etc.,
> > > to make sure the actual implementation is sane from their perspective.
> > > They've already seen the design, so I'm not expecting any real
> > > surprises here, but sometimes opinions change when they have actual
> > > code in front of them to play with and review.
> > > 
> > > Beyond that, while the cri-o/lxc/etc. folks are looking it over,
> > > whatever additional testing we can do would be a big win.  I'm
> > > thinking I'll pull it into a separate branch in the audit tree
> > > (audit/working-container ?) and include that in my secnext kernels
> > > that I build/test on a regular basis; this is also a handy way to keep
> > > it based against the current audit/next branch.  If any changes are
> > > needed Richard can either chose to base those changes on audit/next or
> > > the separate audit container ID branch; that's up to him.  I've done
> > > this with other big changes in other trees, e.g. SELinux, and it has
> > > worked well to get some extra testing in and keep the patchset "merge
> > > ready" while others outside the subsystem look things over.
> > 
> > Mrunal Patel (maintainer of CRI-O) and I have reviewed the API, and
> > believe this is something we can work on in the container runtimes team
> > to implement the container auditing code in CRI-O and Podman.
> 
> Thanks Dan.  If I pulled this into a branch and built you some test
> kernels to play with, any idea how long it might take to get a proof
> of concept working on the cri-o side?

We'd need to merge user space patches and let them use that instead of the 
raw interface. I'm not going to merge user space until we are pretty sure the 
patch is going into the kernel.

-Steve

> FWIW, I've also reached out to some of the LXC folks I know to get
> their take on the API.  I think if we can get two different container
> runtimes to give the API a thumbs-up then I think we are in good shape
> with respect to the userspace interface.
> 
> I just finished looking over the last of the pending audit kernel
> patches that were queued waiting for the merge window to open so this
> is next on my list to look at.  I plan to start doing that
> tonight/tomorrow, and as long as the changes between v5/v6 are not
> that big, it shouldn't take too long.

^ permalink raw reply

* Re: [PATCH 1/7] General notification queue with user mmap()'able ring buffer
From: David Howells @ 2019-05-28 22:28 UTC (permalink / raw)
  To: Jann Horn
  Cc: dhowells, Al Viro, raven, linux-fsdevel, Linux API, linux-block,
	keyrings, linux-security-module, kernel list
In-Reply-To: <CAG48ez3L5KzKyKMxUTaaB=r1E1ZXh=m6e9+CwYcXfRnUSjDvWA@mail.gmail.com>

Jann Horn <jannh@google.com> wrote:

> I don't see you setting any special properties on the VMA that would
> prevent userspace from extending its size via mremap() - no
> VM_DONTEXPAND or VM_PFNMAP. So I think you might get an out-of-bounds
> access here?

Should I just set VM_DONTEXPAND in watch_queue_mmap()?  Like so:

	vma->vm_flags |= VM_DONTEXPAND;

> > +static void watch_queue_map_pages(struct vm_fault *vmf,
> > +                                 pgoff_t start_pgoff, pgoff_t end_pgoff)
> ...
> 
> Same as above.

Same solution as above?  Or do I need ot check start/end_pgoff too?

> > +static int watch_queue_mmap(struct file *file, struct vm_area_struct *vma)
> > +{
> > +       struct watch_queue *wqueue = file->private_data;
> > +
> > +       if (vma->vm_pgoff != 0 ||
> > +           vma->vm_end - vma->vm_start > wqueue->nr_pages * PAGE_SIZE ||
> > +           !(pgprot_val(vma->vm_page_prot) & pgprot_val(PAGE_SHARED)))
> > +               return -EINVAL;
> 
> This thing should probably have locking against concurrent
> watch_queue_set_size()?

Yeah.

	static int watch_queue_mmap(struct file *file,
				    struct vm_area_struct *vma)
	{
		struct watch_queue *wqueue = file->private_data;
		struct inode *inode = file_inode(file);
		u8 nr_pages;

		inode_lock(inode);
		nr_pages = wqueue->nr_pages;
		inode_unlock(inode);

		if (nr_pages == 0 ||
		...
			return -EINVAL;

> > +       smp_store_release(&buf->meta.head, len);
> 
> Why is this an smp_store_release()? The entire buffer should not be visible to
> userspace before this setup is complete, right?

Yes - if I put the above locking in the mmap handler.  OTOH, it's a one-off
barrier...

> > +               if (wqueue->buffer)
> > +                       return -EBUSY;
> 
> The preceding check occurs without any locks held and therefore has no
> reliable effect. It should probably be moved below the
> inode_lock(...).

Yeah, it can race.  I'll move it into watch_queue_set_size().

> > +static void free_watch(struct rcu_head *rcu)
> > +{
> > +       struct watch *watch = container_of(rcu, struct watch, rcu);
> > +
> > +       put_watch_queue(rcu_access_pointer(watch->queue));
> 
> This should be rcu_dereference_protected(..., 1).

That shouldn't be necessary.  rcu_access_pointer()'s description says:

 * It is also permissible to use rcu_access_pointer() when read-side
 * access to the pointer was removed at least one grace period ago, as
 * is the case in the context of the RCU callback that is freeing up
 * the data, ...

It's in an rcu callback function, so accessing the __rcu pointers in the RCU'd
struct should be fine with rcu_access_pointer().

> > +       /* We don't need the watch list lock for the next bit as RCU is
> > +        * protecting everything from being deallocated.
> 
> Does "everything" mean "the wqueue" or more than that?

Actually, just 'wqueue' and its buffer.  'watch' is held by us once we've
dequeued it as we now own the ref 'wlist' had on it.  'wlist' and 'wq' must be
pinned by the caller.

> > +                       if (release) {
> > +                               if (wlist->release_watch) {
> > +                                       rcu_read_unlock();
> > +                                       /* This might need to call dput(), so
> > +                                        * we have to drop all the locks.
> > +                                        */
> > +                                       wlist->release_watch(wlist, watch);
> 
> How are you holding a reference to `wlist` here? You got the reference through
> rcu_dereference(), you've dropped the RCU read lock, and I don't see anything
> that stabilizes the reference.

The watch record must hold a ref on the watched object if the watch_list has a
->release_watch() method.  In the code snippet above, the watch record now
belongs to us because we unlinked it under the wlist->lock some lines prior.

However, you raise a good point, and I think the thing to do is to cache
->release_watch from it and not pass wlist into (*release_watch)().  We don't
need to concern ourselves with cleaning up *wlist as it will be cleaned up
when the target object is removed.

Keyrings don't have a ->release_watch method and neither does the block-layer
notification stuff.

> > +       if (wqueue->pages && wqueue->pages[0])
> > +               WARN_ON(page_ref_count(wqueue->pages[0]) != 1);
> 
> Is there a reason why there couldn't still be references to the pages
> from get_user_pages()/get_user_pages_fast()?

I'm not sure.  I'm not sure what to do if there are.  What do you suggest?

> > +       n->info &= (WATCH_INFO_LENGTH | WATCH_INFO_TYPE_FLAGS | WATCH_INFO_ID);
> 
> Should the non-atomic modification of n->info

n's an unpublished copy of some userspace data that's local to the function
instance.  There shouldn't be any way to race with it at this point.

> (and perhaps also the
> following uses of ->debug) be protected by some lock?
> 
> > +       if (post_one_notification(wqueue, n, file->f_cred))
> > +               wqueue->debug = 0;
> > +       else
> > +               wqueue->debug++;
> > +       ret = len;
> > +       if (wqueue->debug > 20)
> > +               ret = -EIO;

It's for debugging purposes, so the non-atomiticity doesn't matter.  I'll
#undef the symbol to disable the code.

> > +#define IOC_WATCH_QUEUE_SET_SIZE       _IO('s', 0x01)  /* Set the size in pages */
> > +#define IOC_WATCH_QUEUE_SET_FILTER     _IO('s', 0x02)  /* Set the filter */
> 
> Should these ioctl numbers be registered in
> Documentation/ioctl/ioctl-number.txt?

Quite possibly.  I'm not sure where's best to actually allocate it.  I picked
's' out of the air.

David

^ permalink raw reply

* Re: [PATCH ghak90 V6 00/10] audit: implement container identifier
From: Paul Moore @ 2019-05-28 22:26 UTC (permalink / raw)
  To: Dan Walsh
  Cc: Neil Horman, Richard Guy Briggs, containers, linux-api,
	Linux-Audit Mailing List, linux-fsdevel, LKML, netdev,
	netfilter-devel, sgrubb, omosnace, dhowells, simo, Eric Paris,
	Serge Hallyn, ebiederm, Mrunal Patel
In-Reply-To: <509ea6b0-1ac8-b809-98c2-37c34dd98ca3@redhat.com>

On Tue, May 28, 2019 at 5:54 PM Daniel Walsh <dwalsh@redhat.com> wrote:
>
> On 4/22/19 9:49 AM, Paul Moore wrote:
> > On Mon, Apr 22, 2019 at 7:38 AM Neil Horman <nhorman@tuxdriver.com> wrote:
> >> On Mon, Apr 08, 2019 at 11:39:07PM -0400, Richard Guy Briggs wrote:
> >>> Implement kernel audit container identifier.
> >> I'm sorry, I've lost track of this, where have we landed on it? Are we good for
> >> inclusion?
> > I haven't finished going through this latest revision, but unless
> > Richard made any significant changes outside of the feedback from the
> > v5 patchset I'm guessing we are "close".
> >
> > Based on discussions Richard and I had some time ago, I have always
> > envisioned the plan as being get the kernel patchset, tests, docs
> > ready (which Richard has been doing) and then run the actual
> > implemented API by the userland container folks, e.g. cri-o/lxc/etc.,
> > to make sure the actual implementation is sane from their perspective.
> > They've already seen the design, so I'm not expecting any real
> > surprises here, but sometimes opinions change when they have actual
> > code in front of them to play with and review.
> >
> > Beyond that, while the cri-o/lxc/etc. folks are looking it over,
> > whatever additional testing we can do would be a big win.  I'm
> > thinking I'll pull it into a separate branch in the audit tree
> > (audit/working-container ?) and include that in my secnext kernels
> > that I build/test on a regular basis; this is also a handy way to keep
> > it based against the current audit/next branch.  If any changes are
> > needed Richard can either chose to base those changes on audit/next or
> > the separate audit container ID branch; that's up to him.  I've done
> > this with other big changes in other trees, e.g. SELinux, and it has
> > worked well to get some extra testing in and keep the patchset "merge
> > ready" while others outside the subsystem look things over.
> >
> Mrunal Patel (maintainer of CRI-O) and I have reviewed the API, and
> believe this is something we can work on in the container runtimes team
> to implement the container auditing code in CRI-O and Podman.

Thanks Dan.  If I pulled this into a branch and built you some test
kernels to play with, any idea how long it might take to get a proof
of concept working on the cri-o side?

FWIW, I've also reached out to some of the LXC folks I know to get
their take on the API.  I think if we can get two different container
runtimes to give the API a thumbs-up then I think we are in good shape
with respect to the userspace interface.

I just finished looking over the last of the pending audit kernel
patches that were queued waiting for the merge window to open so this
is next on my list to look at.  I plan to start doing that
tonight/tomorrow, and as long as the changes between v5/v6 are not
that big, it shouldn't take too long.

-- 
paul moore
www.paul-moore.com

^ permalink raw reply

* Re: [PATCH ghak90 V6 00/10] audit: implement container identifier
From: Richard Guy Briggs @ 2019-05-28 22:25 UTC (permalink / raw)
  To: Daniel Walsh
  Cc: Paul Moore, Neil Horman, containers, linux-api,
	Linux-Audit Mailing List, linux-fsdevel, LKML, netdev,
	netfilter-devel, sgrubb, omosnace, dhowells, simo, Eric Paris,
	Serge Hallyn, ebiederm, Mrunal Patel
In-Reply-To: <509ea6b0-1ac8-b809-98c2-37c34dd98ca3@redhat.com>

On 2019-05-28 17:53, Daniel Walsh wrote:
> On 4/22/19 9:49 AM, Paul Moore wrote:
> > On Mon, Apr 22, 2019 at 7:38 AM Neil Horman <nhorman@tuxdriver.com> wrote:
> >> On Mon, Apr 08, 2019 at 11:39:07PM -0400, Richard Guy Briggs wrote:
> >>> Implement kernel audit container identifier.
> >> I'm sorry, I've lost track of this, where have we landed on it? Are we good for
> >> inclusion?
> > I haven't finished going through this latest revision, but unless
> > Richard made any significant changes outside of the feedback from the
> > v5 patchset I'm guessing we are "close".
> >
> > Based on discussions Richard and I had some time ago, I have always
> > envisioned the plan as being get the kernel patchset, tests, docs
> > ready (which Richard has been doing) and then run the actual
> > implemented API by the userland container folks, e.g. cri-o/lxc/etc.,
> > to make sure the actual implementation is sane from their perspective.
> > They've already seen the design, so I'm not expecting any real
> > surprises here, but sometimes opinions change when they have actual
> > code in front of them to play with and review.
> >
> > Beyond that, while the cri-o/lxc/etc. folks are looking it over,
> > whatever additional testing we can do would be a big win.  I'm
> > thinking I'll pull it into a separate branch in the audit tree
> > (audit/working-container ?) and include that in my secnext kernels
> > that I build/test on a regular basis; this is also a handy way to keep
> > it based against the current audit/next branch.  If any changes are
> > needed Richard can either chose to base those changes on audit/next or
> > the separate audit container ID branch; that's up to him.  I've done
> > this with other big changes in other trees, e.g. SELinux, and it has
> > worked well to get some extra testing in and keep the patchset "merge
> > ready" while others outside the subsystem look things over.
> >
> Mrunal Patel (maintainer of CRI-O) and I have reviewed the API, and
> believe this is something we can work on in the container runtimes team
> to implement the container auditing code in CRI-O and Podman.

Thanks Dan, Mrunal!

- RGB

--
Richard Guy Briggs <rgb@redhat.com>
Sr. S/W Engineer, Kernel Security, Base Operating Systems
Remote, Ottawa, Red Hat Canada
IRC: rgb, SunRaycer
Voice: +1.647.777.2635, Internal: (81) 32635

^ permalink raw reply

* Re: [PATCH ghak90 V6 00/10] audit: implement container identifier
From: Daniel Walsh @ 2019-05-28 21:53 UTC (permalink / raw)
  To: Paul Moore, Neil Horman
  Cc: Richard Guy Briggs, containers, linux-api,
	Linux-Audit Mailing List, linux-fsdevel, LKML, netdev,
	netfilter-devel, sgrubb, omosnace, dhowells, simo, Eric Paris,
	Serge Hallyn, ebiederm, Mrunal Patel
In-Reply-To: <CAHC9VhQYPF2ma_W+hySbQtfTztf=K1LTFnxnyVK0y9VYxj-K=w@mail.gmail.com>

On 4/22/19 9:49 AM, Paul Moore wrote:
> On Mon, Apr 22, 2019 at 7:38 AM Neil Horman <nhorman@tuxdriver.com> wrote:
>> On Mon, Apr 08, 2019 at 11:39:07PM -0400, Richard Guy Briggs wrote:
>>> Implement kernel audit container identifier.
>> I'm sorry, I've lost track of this, where have we landed on it? Are we good for
>> inclusion?
> I haven't finished going through this latest revision, but unless
> Richard made any significant changes outside of the feedback from the
> v5 patchset I'm guessing we are "close".
>
> Based on discussions Richard and I had some time ago, I have always
> envisioned the plan as being get the kernel patchset, tests, docs
> ready (which Richard has been doing) and then run the actual
> implemented API by the userland container folks, e.g. cri-o/lxc/etc.,
> to make sure the actual implementation is sane from their perspective.
> They've already seen the design, so I'm not expecting any real
> surprises here, but sometimes opinions change when they have actual
> code in front of them to play with and review.
>
> Beyond that, while the cri-o/lxc/etc. folks are looking it over,
> whatever additional testing we can do would be a big win.  I'm
> thinking I'll pull it into a separate branch in the audit tree
> (audit/working-container ?) and include that in my secnext kernels
> that I build/test on a regular basis; this is also a handy way to keep
> it based against the current audit/next branch.  If any changes are
> needed Richard can either chose to base those changes on audit/next or
> the separate audit container ID branch; that's up to him.  I've done
> this with other big changes in other trees, e.g. SELinux, and it has
> worked well to get some extra testing in and keep the patchset "merge
> ready" while others outside the subsystem look things over.
>
Mrunal Patel (maintainer of CRI-O) and I have reviewed the API, and
believe this is something we can work on in the container runtimes team
to implement the container auditing code in CRI-O and Podman.

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox