Linux userland API discussions

Linux userland API discussions
 help / color / mirror / Atom feed

* Re: [PATCH man-pages] Document encoded I/O
From: Aleksa Sarai @ 2019-10-23  4:44 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Omar Sandoval, linux-fsdevel, Linux Btrfs, Dave Chinner,
	Jann Horn, Linux API, kernel-team, Theodore Tso
In-Reply-To: <CAOQ4uxgm6MWwCDO5stUwOKKSq7Ot4-Sc96F1Evc6ra5qBE+-wA@mail.gmail.com>

[-- Attachment #1: Type: text/plain, Size: 6433 bytes --]

On 2019-10-22, Amir Goldstein <amir73il@gmail.com> wrote:
> On Mon, Oct 21, 2019 at 9:54 PM Omar Sandoval <osandov@osandov.com> wrote:
> >
> > On Mon, Oct 21, 2019 at 09:18:13AM +0300, Amir Goldstein wrote:
> > > CC: Ted
> > >
> > > What ever happened to read/write ext4 encrypted data API?
> > > https://marc.info/?l=linux-ext4&m=145030599010416&w=2
> > >
> > > Can we learn anything from the ext4 experience to improve
> > > the new proposed API?
> >
> > I wasn't aware of these patches, thanks for pointing them out. Ted, do
> > you have any thoughts about making this API work for fscrypt?
> >
> > > On Wed, Oct 16, 2019 at 12:29 AM Omar Sandoval <osandov@osandov.com> wrote:
> > > >
> > > > From: Omar Sandoval <osandov@fb.com>
> > > >
> > > > This adds a new page, rwf_encoded(7), providing an overview of encoded
> > > > I/O and updates fcntl(2), open(2), and preadv2(2)/pwritev2(2) to
> > > > reference it.
> > > >
> > > > Signed-off-by: Omar Sandoval <osandov@fb.com>
> > > > ---
> > > >  man2/fcntl.2       |  10 +-
> > > >  man2/open.2        |  13 ++
> > > >  man2/readv.2       |  46 +++++++
> > > >  man7/rwf_encoded.7 | 297 +++++++++++++++++++++++++++++++++++++++++++++
> > > >  4 files changed, 365 insertions(+), 1 deletion(-)
> > > >  create mode 100644 man7/rwf_encoded.7
> > > >
> > > > diff --git a/man2/fcntl.2 b/man2/fcntl.2
> > > > index fce4f4c2b..76fe9cc6f 100644
> > > > --- a/man2/fcntl.2
> > > > +++ b/man2/fcntl.2
> > > > @@ -222,8 +222,9 @@ On Linux, this command can change only the
> > > >  .BR O_ASYNC ,
> > > >  .BR O_DIRECT ,
> > > >  .BR O_NOATIME ,
> > > > +.BR O_NONBLOCK ,
> > > >  and
> > > > -.B O_NONBLOCK
> > > > +.B O_ENCODED
> > > >  flags.
> > > >  It is not possible to change the
> > > >  .BR O_DSYNC
> > > > @@ -1803,6 +1804,13 @@ Attempted to clear the
> > > >  flag on a file that has the append-only attribute set.
> > > >  .TP
> > > >  .B EPERM
> > > > +Attempted to set the
> > > > +.B O_ENCODED
> > > > +flag and the calling process did not have the
> > > > +.B CAP_SYS_ADMIN
> > > > +capability.
> > > > +.TP
> > > > +.B EPERM
> > > >  .I cmd
> > > >  was
> > > >  .BR F_ADD_SEALS ,
> > > > diff --git a/man2/open.2 b/man2/open.2
> > > > index b0f485b41..cdd3c549c 100644
> > > > --- a/man2/open.2
> > > > +++ b/man2/open.2
> > > > @@ -421,6 +421,14 @@ was followed by a call to
> > > >  .BR fdatasync (2)).
> > > >  .IR "See NOTES below" .
> > > >  .TP
> > > > +.B O_ENCODED
> > > > +Open the file with encoded I/O permissions;
> > >
> > > 1. I find the name of the flag confusing.
> > > Yes, most people don't read documentation so carefully (or at all)
> > > so they will assume O_ENCODED will affect read/write or that it
> > > relates to RWF_ENCODED in a similar way that O_SYNC relates
> > > to RWF_SYNC (i.e. logical OR and not logical AND).
> > >
> > > I am not good at naming and to prove it I will propose:
> > > O_PROMISCUOUS, O_MAINTENANCE, O_ALLOW_ENCODED
> >
> > Agreed, the name is misleading. I can't think of anything better than
> > O_ALLOW_ENCODED, so I'll go with that unless someone comes up with
> > something better :)
> >
> > > 2. While I see no harm in adding O_ flag to open(2) for this
> > > use case, I also don't see a major benefit in adding it.
> > > What if we only allowed setting the flag via fcntl(2) which returns
> > > an error on old kernels?
> > > Since unlike most O_ flags, O_ENCODED does NOT affect file
> > > i/o without additional opt-in flags, it is not standard anyway and
> > > therefore I find that setting it only via fcntl(2) is less error prone.
> >
> > If I make this fcntl-only, then it probably shouldn't be through
> > F_GETFL/F_SETFL (it'd be pretty awkward for an O_ flag to not be valid
> > for open(), and also awkward to mix some non-O_ flag with O_ flags for
> > F_GETFL/F_SETFL). So that leaves a couple of options:
> >
> > 1. Get/set it with F_GETFD/F_SETFD, which is currently only used for
> >    FD_CLOEXEC. That also silently ignores unknown flags, but as with the
> >    O_ flag option, I don't think that's a big deal for FD_ALLOW_ENCODED.
> > 2. Add a new fcntl command (F_GETFD2/F_SETFD2?). This seems like
> >    overkill to me.
> >
> > However, both of these options are annoying to implement. Ideally, we
> > wouldn't have to add another flags field to struct file. But, to reuse
> > f_flags, we'd need to make sure that FD_ALLOW_ENCODED doesn't collide
> > with other O_ flags, and we'd probably want to hide it from F_GETFL. At
> > that point, it might as well be an O_ flag.
> >
> > It seems to me that it's more trouble than it's worth to make this not
> > an O_ flag, but please let me know if you see a nice way to do so.
> >
> 
> No, I see why you choose to add the flag to open(2).
> I have no objection.
> 
> I once had a crazy thought how to add new open flags
> in a non racy manner without adding a new syscall,
> but as you wrote, this is not relevant for O_ALLOW_ENCODED.
> 
> Something like:
> 
> /*
>  * Old kernels silently ignore unsupported open flags.
>  * New kernels that gets __O_CHECK_NEWFLAGS do
>  * the proper checking for unsupported flags AND set the
>  * flag __O_HAVE_NEWFLAGS.
>  */
> #define O_FLAG1 __O_CHECK_NEWFLAGS|__O_FLAG1
> #define O_HAVE_FLAG1 __O_HAVE_NEWFLAGS|__O_FLAG1
> 
> fd = open(path, O_FLAG1);
> if (fd < 0)
>     return -errno;
> flags = fcntl(fd, F_GETFL, 0);
> if (flags < 0)
>     return flags;
> if ((flags & O_HAVE_FLAG1) != O_HAVE_FLAG1) {
>     close(fd);
>     return -EINVAL;
> }

You don't need to add __O_HAVE_NEWFLAGS to do this -- this already works
today for userspace to check whether a flag works properly
(specifically, __O_FLAG1 will only be set if __O_FLAG1 is supported --
otherwise it gets cleared during build_open_flags).

The problem with adding new flags is that an *old* program running on a
*new* kernel could pass a garbage flag (__O_CHECK_NEWFLAGS for instance)
that causes an error only on the new kernel.

The only real solution to this (and several other problems) is
openat2(). As for O_ALLOW_ENCODED -- the current semantics (-EPERM if it
is set without CAP_SYS_ADMIN) *will* cause backwards compatibility
issues for programs that have garbage flags set...

-- 
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
<https://www.cyphar.com/>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply

* Re: [PATCH 3/7] Add a UFFD_SECURE flag to the userfaultfd API.
From: Andy Lutomirski @ 2019-10-23  4:11 UTC (permalink / raw)
  To: Andy Lutomirski, Pavel Emelyanov, Cyrill Gorcunov
  Cc: Daniel Colascione, Linus Torvalds, Jann Horn, Andrea Arcangeli,
	Linux API, LKML, Lokesh Gidra, Nick Kralevich, Nosh Minwalla,
	Tim Murray
In-Reply-To: <CALCETrUyq=J37gU-MYXqLdoi7uH7iNNVRjvcGUT11JA1QuTFyg@mail.gmail.com>

Trying again.  It looks like I used the wrong address for Pavel.

On Sat, Oct 12, 2019 at 6:14 PM Andy Lutomirski <luto@kernel.org> wrote:
>
> [adding more people because this is going to be an ABI break, sigh]
>
> On Sat, Oct 12, 2019 at 5:52 PM Daniel Colascione <dancol@google.com> wrote:
> >
> > On Sat, Oct 12, 2019 at 4:10 PM Andy Lutomirski <luto@kernel.org> wrote:
> > >
> > > On Sat, Oct 12, 2019 at 12:16 PM Daniel Colascione <dancol@google.com> wrote:
> > > >
> > > > The new secure flag makes userfaultfd use a new "secure" anonymous
> > > > file object instead of the default one, letting security modules
> > > > supervise userfaultfd use.
> > > >
> > > > Requiring that users pass a new flag lets us avoid changing the
> > > > semantics for existing callers.
> > >
> > > Is there any good reason not to make this be the default?
> > >
> > >
> > > The only downside I can see is that it would increase the memory usage
> > > of userfaultfd(), but that doesn't seem like such a big deal.  A
> > > lighter-weight alternative would be to have a single inode shared by
> > > all userfaultfd instances, which would require a somewhat different
> > > internal anon_inode API.
> >
> > I'd also prefer to just make SELinux use mandatory, but there's a
> > nasty interaction with UFFD_EVENT_FORK. Adding a new UFFD_SECURE mode
> > which blocks UFFD_EVENT_FORK sidesteps this problem. Maybe you know a
> > better way to deal with it.
>
> ...
>
> > But maybe we can go further: let's separate authentication and
> > authorization, as we do in other LSM hooks. Let's split my
> > inode_init_security_anon into two hooks, inode_init_security_anon and
> > inode_create_anon. We'd define the former to just initialize the file
> > object's security information --- in the SELinux case, figuring out
> > its class and SID --- and define the latter to answer the yes/no
> > question of whether a particular anonymous inode creation should be
> > allowed. Normally, anon_inode_getfile2() would just call both hooks.
> > We'd add another anon_inode_getfd flag, ANON_INODE_SKIP_AUTHORIZATION
> > or something, that would tell anon_inode_getfile2() to skip calling
> > the authorization hook, effectively making the creation always
> > succeed. We can then make the UFFD code pass
> > ANON_INODE_SKIP_AUTHORIZATION when it's creating a file object in the
> > fork child while creating UFFD_EVENT_FORK messages.
>
> That sounds like an improvement.  Or maybe just teach SELinux that
> this particular fd creation is actually making an anon_inode that is a
> child of an existing anon inode and that the context should be copied
> or whatever SELinux wants to do.  Like this, maybe:
>
> static int resolve_userfault_fork(struct userfaultfd_ctx *ctx,
>                                   struct userfaultfd_ctx *new,
>                                   struct uffd_msg *msg)
> {
>         int fd;
>
> Change this:
>
>         fd = anon_inode_getfd("[userfaultfd]", &userfaultfd_fops, new,
>                               O_RDWR | (new->flags & UFFD_SHARED_FCNTL_FLAGS));
>
> to something like:
>
>       fd = anon_inode_make_child_fd(..., ctx->inode, ...);
>
> where ctx->inode is the one context's inode.
>
> *** HOWEVER *** !!!
>
> Now that you've pointed this mechanism out, it is utterly and
> completely broken and should be removed from the kernel outright or at
> least severely restricted.  A .read implementation MUST NOT ACT ON THE
> CALLING TASK.  Ever.  Just imagine the effect of passing a userfaultfd
> as stdin to a setuid program.
>
> So I think the right solution might be to attempt to *remove*
> UFFD_EVENT_FORK.  Maybe the solution is to say that, unless the
> creator of a userfaultfd() has global CAP_SYS_ADMIN, then it cannot
> use UFFD_FEATURE_EVENT_FORK) and print a warning (once) when
> UFFD_FEATURE_EVENT_FORK is allowed.  And, after some suitable
> deprecation period, just remove it.  If it's genuinely useful, it
> needs an entirely new API based on ioctl() or a syscall.  Or even
> recvmsg() :)
>
> And UFFD_SECURE should just become automatic, since you don't have a
> problem any more. :-p
>
> --Andy

^ permalink raw reply

* Re: [PATCH v2] usercopy: Avoid soft lockups in test_check_nonzero_user()
From: Michael Ellerman @ 2019-10-23  2:23 UTC (permalink / raw)
  To: Christian Brauner
  Cc: cyphar, mingo, peterz, alexander.shishkin, jolsa, namhyung,
	christian, keescook, linux, viro, torvalds, linux-api,
	linux-kernel
In-Reply-To: <20191017060938.p4tmr5ruv6frgse4@wittgenstein>

Christian Brauner <christian.brauner@ubuntu.com> writes:
> On Thu, Oct 17, 2019 at 09:00:48AM +1100, Michael Ellerman wrote:
>> Christian Brauner <christian.brauner@ubuntu.com> writes:
>> > On Wed, Oct 16, 2019 at 11:27:32PM +1100, Michael Ellerman wrote:
>> >> On a machine with a 64K PAGE_SIZE, the nested for loops in
>> >> test_check_nonzero_user() can lead to soft lockups, eg:
>> >> 
>> >>   watchdog: BUG: soft lockup - CPU#4 stuck for 22s! [modprobe:611]
>> >>   Modules linked in: test_user_copy(+) vmx_crypto gf128mul crc32c_vpmsum virtio_balloon ip_tables x_tables autofs4
>> >>   CPU: 4 PID: 611 Comm: modprobe Tainted: G             L    5.4.0-rc1-gcc-8.2.0-00001-gf5a1a536fa14-dirty #1151
>> >>   ...
>> >>   NIP __might_sleep+0x20/0xc0
>> >>   LR  __might_fault+0x40/0x60
>> >>   Call Trace:
>> >>     check_zeroed_user+0x12c/0x200
>> >>     test_user_copy_init+0x67c/0x1210 [test_user_copy]
>> >>     do_one_initcall+0x60/0x340
>> >>     do_init_module+0x7c/0x2f0
>> >>     load_module+0x2d94/0x30e0
>> >>     __do_sys_finit_module+0xc8/0x150
>> >>     system_call+0x5c/0x68
>> >> 
>> >> Even with a 4K PAGE_SIZE the test takes multiple seconds. Instead
>> >> tweak it to only scan a 1024 byte region, but make it cross the
>> >> page boundary.
>> >> 
>> >> Fixes: f5a1a536fa14 ("lib: introduce copy_struct_from_user() helper")
>> >> Suggested-by: Aleksa Sarai <cyphar@cyphar.com>
>> >> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
>> >
>> > With Aleksa's Reviewed-by I've picked this up:
>> > https://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux.git/log/?h=copy_struct_from_user
>> 
>> Thanks. Are you planning to send that to Linus for v5.4 or v5.5 ?
>
> This looks like a pretty straight bugfix to me since it's clearly
> causing an issue for you on power so v5.4-rc4 is what I'd aim for. I
> just want it to be in linux-next until tomorrow.

I see it in mainine now, thanks!

cheers

^ permalink raw reply

* Re: [PATCH 3/7] Add a UFFD_SECURE flag to the userfaultfd API.
From: Daniel Colascione @ 2019-10-22 21:27 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Linus Torvalds, Jann Horn, Andrea Arcangeli, Pavel Emelyanov,
	Linux API, LKML, Lokesh Gidra, Nick Kralevich, Nosh Minwalla,
	Tim Murray
In-Reply-To: <CALCETrUyq=J37gU-MYXqLdoi7uH7iNNVRjvcGUT11JA1QuTFyg@mail.gmail.com>

On Sat, Oct 12, 2019 at 6:14 PM Andy Lutomirski <luto@kernel.org> wrote:
> [adding more people because this is going to be an ABI break, sigh]

Just pinging this thread. I'd like to rev my patch and I'm not sure
what we want to do about problem Andy identified. Are we removing
UFFD_EVENT_FORK?

^ permalink raw reply

* Re: [PATCH ghak90 V7 20/21] audit: add capcontid to set contid outside init_user_ns
From: Richard Guy Briggs @ 2019-10-22 20:06 UTC (permalink / raw)
  To: Neil Horman
  Cc: Paul Moore, containers, linux-api, Linux-Audit Mailing List,
	linux-fsdevel, LKML, netdev, netfilter-devel, sgrubb, omosnace,
	dhowells, simo, Eric Paris, Serge Hallyn, ebiederm, Dan Walsh,
	mpatel
In-Reply-To: <20191022121302.GA9397@hmswarspite.think-freely.org>

On 2019-10-22 08:13, Neil Horman wrote:
> On Mon, Oct 21, 2019 at 08:31:37PM -0400, Paul Moore wrote:
> > On Mon, Oct 21, 2019 at 7:58 PM Richard Guy Briggs <rgb@redhat.com> wrote:
> > > On 2019-10-21 17:43, Paul Moore wrote:
> > > > On Mon, Oct 21, 2019 at 5:38 PM Richard Guy Briggs <rgb@redhat.com> wrote:
> > > > > On 2019-10-21 15:53, Paul Moore wrote:
> > > > > > On Fri, Oct 18, 2019 at 9:39 PM Richard Guy Briggs <rgb@redhat.com> wrote:
> > > > > > > On 2019-09-18 21:22, Richard Guy Briggs wrote:
> > > > > > > > Provide a mechanism similar to CAP_AUDIT_CONTROL to explicitly give a
> > > > > > > > process in a non-init user namespace the capability to set audit
> > > > > > > > container identifiers.
> > > > > > > >
> > > > > > > > Use audit netlink message types AUDIT_GET_CAPCONTID 1027 and
> > > > > > > > AUDIT_SET_CAPCONTID 1028.  The message format includes the data
> > > > > > > > structure:
> > > > > > > > struct audit_capcontid_status {
> > > > > > > >         pid_t   pid;
> > > > > > > >         u32     enable;
> > > > > > > > };
> > > > > > >
> > > > > > > Paul, can I get a review of the general idea here to see if you're ok
> > > > > > > with this way of effectively extending CAP_AUDIT_CONTROL for the sake of
> > > > > > > setting contid from beyond the init user namespace where capable() can't
> > > > > > > reach and ns_capable() is meaningless for these purposes?
> > > > > >
> > > > > > I think my previous comment about having both the procfs and netlink
> > > > > > interfaces apply here.  I don't see why we need two different APIs at
> > > > > > the start; explain to me why procfs isn't sufficient.  If the argument
> > > > > > is simply the desire to avoid mounting procfs in the container, how
> > > > > > many container orchestrators can function today without a valid /proc?
> > > > >
> > > > > Ok, sorry, I meant to address that question from a previous patch
> > > > > comment at the same time.
> > > > >
> > > > > It was raised by Eric Biederman that the proc filesystem interface for
> > > > > audit had its limitations and he had suggested an audit netlink
> > > > > interface made more sense.
> > > >
> > > > I'm sure you've got it handy, so I'm going to be lazy and ask: archive
> > > > pointer to Eric's comments?  Just a heads-up, I'm really *not* a fan
> > > > of using the netlink interface for this, so unless Eric presents a
> > > > super compelling reason for why we shouldn't use procfs I'm inclined
> > > > to stick with /proc.
> > >
> > > It was actually a video call with Eric and Steve where that was
> > > recommended, so I can't provide you with any first-hand communication
> > > about it.  I'll get more details...
> > 
> > Yeah, that sort of information really needs to be on the list.
> > 
> > > So, with that out of the way, could you please comment on the general
> > > idea of what was intended to be the central idea of this mechanism to be
> > > able to nest containers beyond the initial user namespace (knowing that
> > > a /proc interface is available and the audit netlink interface isn't
> > > necessary for it to work and the latter can be easily removed)?
> > 
> > I'm not entirely clear what you are asking about, are you asking why I
> > care about nesting container orchestrators?  Simply put, it is not
> > uncommon for the LXC/LXD folks to see nested container orchestrators,
> > so I felt it was important to support that use case.  When we
> > originally started this effort we probably should have done a better
> > job reaching out to the LXC/LXD folks, we may have caught this
> > earlier.  Regardless, we caught it, and it looks like we are on our
> > way to supporting it (that's good).
> > 
> > Are you asking why I prefer the procfs approach to setting/getting the
> > audit container ID?  For one, it makes it easier for a LSM to enforce
> > the audit container ID operations independent of the other audit
> > control APIs.  It also provides a simpler interface for container
> > orchestrators.  Both seem like desirable traits as far as I'm
> > concerned.
> 
> I agree that one api is probably the best approach here, but I actually
> think that the netlink interface is the more flexible approach.  Its a
> little more work for userspace (you have to marshal your data into a
> netlink message before sending it, and wait for an async response), but
> thats a well known pattern, and it provides significantly more
> flexibility for the kernel.  LSM already has a hook to audit netlink
> messages in sock_sendmsg, so thats not a problem, and if you use
> netlink, you get the advantage of being able to broadcast messages
> within your network namespaces, facilitating any needed orchestrator
> co-ordination.  To do the same thing with a filesystem api, you need to
> use the fanotify api, which IIRC doesn't work on proc.

One api was the intent, deprecating proc for loginuid and sessionid if
netlink was the chosen way to go.

I don't think we had discussed the possibility or need to use netlink
multicast for this purpose and see it as a liability to limiting access
to only those processes that need it.

> Neil
> 
> > > > > The intent was to switch to the audit netlink interface for contid,
> > > > > capcontid and to add the audit netlink interface for loginuid and
> > > > > sessionid while deprecating the proc interface for loginuid and
> > > > > sessionid.  This was alluded to in the cover letter, but not very clear,
> > > > > I'm afraid.  I have patches to remove the contid and loginuid/sessionid
> > > > > interfaces in another tree which is why I had forgotten to outline that
> > > > > plan more explicitly in the cover letter.
> > 
> > paul moore

- RGB

^ permalink raw reply

* Re: [PATCH ghak90 V7 20/21] audit: add capcontid to set contid outside init_user_ns
From: Paul Moore @ 2019-10-22 14:34 UTC (permalink / raw)
  To: Richard Guy Briggs
  Cc: containers, linux-api, Linux-Audit Mailing List, linux-fsdevel,
	LKML, netdev, netfilter-devel, sgrubb, omosnace, dhowells, simo,
	Eric Paris, Serge Hallyn, ebiederm, nhorman, Dan Walsh, mpatel
In-Reply-To: <20191022142716.sgxcmc27w4uaqh3u@madcap2.tricolour.ca>

On Tue, Oct 22, 2019 at 10:27 AM Richard Guy Briggs <rgb@redhat.com> wrote:
> I'd like your perspective on how the capcontid feature was implemented
> (aside from the proc/netlink api issue which was intended to be
> consistent across loginuid/sessionid/contid/capcontid).  Do you see this
> feature as potentially solving the nested container issue in child user
> namespaces?

The patchset is a bit messy at this point in the stack due to the
"fixup!" confusion and a few other things which I already mentioned so
I don't really want to comment too much on that until I can see
everything in a reasonable patch stack.  Let's leave that for the next
draft.

-- 
paul moore
www.paul-moore.com

^ permalink raw reply

* Re: [PATCH ghak90 V7 20/21] audit: add capcontid to set contid outside init_user_ns
From: Richard Guy Briggs @ 2019-10-22 14:27 UTC (permalink / raw)
  To: Paul Moore
  Cc: containers, linux-api, Linux-Audit Mailing List, linux-fsdevel,
	LKML, netdev, netfilter-devel, sgrubb, omosnace, dhowells, simo,
	Eric Paris, Serge Hallyn, ebiederm, nhorman, Dan Walsh, mpatel
In-Reply-To: <CAHC9VhSiwnY-+2awxvGeO4a0NgfVkOPd8fzzBVujp=HtjskTuQ@mail.gmail.com>

On 2019-10-21 20:31, Paul Moore wrote:
> On Mon, Oct 21, 2019 at 7:58 PM Richard Guy Briggs <rgb@redhat.com> wrote:
> > On 2019-10-21 17:43, Paul Moore wrote:
> > > On Mon, Oct 21, 2019 at 5:38 PM Richard Guy Briggs <rgb@redhat.com> wrote:
> > > > On 2019-10-21 15:53, Paul Moore wrote:
> > > > > On Fri, Oct 18, 2019 at 9:39 PM Richard Guy Briggs <rgb@redhat.com> wrote:
> > > > > > On 2019-09-18 21:22, Richard Guy Briggs wrote:
> > > > > > > Provide a mechanism similar to CAP_AUDIT_CONTROL to explicitly give a
> > > > > > > process in a non-init user namespace the capability to set audit
> > > > > > > container identifiers.
> > > > > > >
> > > > > > > Use audit netlink message types AUDIT_GET_CAPCONTID 1027 and
> > > > > > > AUDIT_SET_CAPCONTID 1028.  The message format includes the data
> > > > > > > structure:
> > > > > > > struct audit_capcontid_status {
> > > > > > >         pid_t   pid;
> > > > > > >         u32     enable;
> > > > > > > };
> > > > > >
> > > > > > Paul, can I get a review of the general idea here to see if you're ok
> > > > > > with this way of effectively extending CAP_AUDIT_CONTROL for the sake of
> > > > > > setting contid from beyond the init user namespace where capable() can't
> > > > > > reach and ns_capable() is meaningless for these purposes?
> > > > >
> > > > > I think my previous comment about having both the procfs and netlink
> > > > > interfaces apply here.  I don't see why we need two different APIs at
> > > > > the start; explain to me why procfs isn't sufficient.  If the argument
> > > > > is simply the desire to avoid mounting procfs in the container, how
> > > > > many container orchestrators can function today without a valid /proc?
> > > >
> > > > Ok, sorry, I meant to address that question from a previous patch
> > > > comment at the same time.
> > > >
> > > > It was raised by Eric Biederman that the proc filesystem interface for
> > > > audit had its limitations and he had suggested an audit netlink
> > > > interface made more sense.
> > >
> > > I'm sure you've got it handy, so I'm going to be lazy and ask: archive
> > > pointer to Eric's comments?  Just a heads-up, I'm really *not* a fan
> > > of using the netlink interface for this, so unless Eric presents a
> > > super compelling reason for why we shouldn't use procfs I'm inclined
> > > to stick with /proc.
> >
> > It was actually a video call with Eric and Steve where that was
> > recommended, so I can't provide you with any first-hand communication
> > about it.  I'll get more details...
> 
> Yeah, that sort of information really needs to be on the list.
> 
> > So, with that out of the way, could you please comment on the general
> > idea of what was intended to be the central idea of this mechanism to be
> > able to nest containers beyond the initial user namespace (knowing that
> > a /proc interface is available and the audit netlink interface isn't
> > necessary for it to work and the latter can be easily removed)?
> 
> I'm not entirely clear what you are asking about, are you asking why I
> care about nesting container orchestrators?  Simply put, it is not
> uncommon for the LXC/LXD folks to see nested container orchestrators,
> so I felt it was important to support that use case.  When we
> originally started this effort we probably should have done a better
> job reaching out to the LXC/LXD folks, we may have caught this
> earlier.  Regardless, we caught it, and it looks like we are on our
> way to supporting it (that's good).

I'm not asking why you care about container orchestrators.

> Are you asking why I prefer the procfs approach to setting/getting the
> audit container ID?  For one, it makes it easier for a LSM to enforce
> the audit container ID operations independent of the other audit
> control APIs.  It also provides a simpler interface for container
> orchestrators.  Both seem like desirable traits as far as I'm
> concerned.

I'd like to leave the proc/netlink decision/debate out of this
discussion, though it does need to happen and I was hoping that would
happen on the loginuid/sessionid proc/netlink patch thread.

I'd like your perspective on how the capcontid feature was implemented
(aside from the proc/netlink api issue which was intended to be
consistent across loginuid/sessionid/contid/capcontid).  Do you see this
feature as potentially solving the nested container issue in child user
namespaces?

> > > > The intent was to switch to the audit netlink interface for contid,
> > > > capcontid and to add the audit netlink interface for loginuid and
> > > > sessionid while deprecating the proc interface for loginuid and
> > > > sessionid.  This was alluded to in the cover letter, but not very clear,
> > > > I'm afraid.  I have patches to remove the contid and loginuid/sessionid
> > > > interfaces in another tree which is why I had forgotten to outline that
> > > > plan more explicitly in the cover letter.
> 
> paul moore

- RGB

^ permalink raw reply

* Re: [PATCH ghak90 V7 20/21] audit: add capcontid to set contid outside init_user_ns
From: Paul Moore @ 2019-10-22 14:04 UTC (permalink / raw)
  To: Neil Horman
  Cc: Richard Guy Briggs, containers, linux-api,
	Linux-Audit Mailing List, linux-fsdevel, LKML, netdev,
	netfilter-devel, sgrubb, omosnace, dhowells, simo, Eric Paris,
	Serge Hallyn, ebiederm, Dan Walsh, mpatel
In-Reply-To: <20191022121302.GA9397@hmswarspite.think-freely.org>

On Tue, Oct 22, 2019 at 8:13 AM Neil Horman <nhorman@tuxdriver.com> wrote:
> On Mon, Oct 21, 2019 at 08:31:37PM -0400, Paul Moore wrote:
> > On Mon, Oct 21, 2019 at 7:58 PM Richard Guy Briggs <rgb@redhat.com> wrote:
> > > On 2019-10-21 17:43, Paul Moore wrote:
> > > > On Mon, Oct 21, 2019 at 5:38 PM Richard Guy Briggs <rgb@redhat.com> wrote:
> > > > > On 2019-10-21 15:53, Paul Moore wrote:
> > > > > > On Fri, Oct 18, 2019 at 9:39 PM Richard Guy Briggs <rgb@redhat.com> wrote:
> > > > > > > On 2019-09-18 21:22, Richard Guy Briggs wrote:
> > > > > > > > Provide a mechanism similar to CAP_AUDIT_CONTROL to explicitly give a
> > > > > > > > process in a non-init user namespace the capability to set audit
> > > > > > > > container identifiers.
> > > > > > > >
> > > > > > > > Use audit netlink message types AUDIT_GET_CAPCONTID 1027 and
> > > > > > > > AUDIT_SET_CAPCONTID 1028.  The message format includes the data
> > > > > > > > structure:
> > > > > > > > struct audit_capcontid_status {
> > > > > > > >         pid_t   pid;
> > > > > > > >         u32     enable;
> > > > > > > > };
> > > > > > >
> > > > > > > Paul, can I get a review of the general idea here to see if you're ok
> > > > > > > with this way of effectively extending CAP_AUDIT_CONTROL for the sake of
> > > > > > > setting contid from beyond the init user namespace where capable() can't
> > > > > > > reach and ns_capable() is meaningless for these purposes?
> > > > > >
> > > > > > I think my previous comment about having both the procfs and netlink
> > > > > > interfaces apply here.  I don't see why we need two different APIs at
> > > > > > the start; explain to me why procfs isn't sufficient.  If the argument
> > > > > > is simply the desire to avoid mounting procfs in the container, how
> > > > > > many container orchestrators can function today without a valid /proc?
> > > > >
> > > > > Ok, sorry, I meant to address that question from a previous patch
> > > > > comment at the same time.
> > > > >
> > > > > It was raised by Eric Biederman that the proc filesystem interface for
> > > > > audit had its limitations and he had suggested an audit netlink
> > > > > interface made more sense.
> > > >
> > > > I'm sure you've got it handy, so I'm going to be lazy and ask: archive
> > > > pointer to Eric's comments?  Just a heads-up, I'm really *not* a fan
> > > > of using the netlink interface for this, so unless Eric presents a
> > > > super compelling reason for why we shouldn't use procfs I'm inclined
> > > > to stick with /proc.
> > >
> > > It was actually a video call with Eric and Steve where that was
> > > recommended, so I can't provide you with any first-hand communication
> > > about it.  I'll get more details...
> >
> > Yeah, that sort of information really needs to be on the list.
> >
> > > So, with that out of the way, could you please comment on the general
> > > idea of what was intended to be the central idea of this mechanism to be
> > > able to nest containers beyond the initial user namespace (knowing that
> > > a /proc interface is available and the audit netlink interface isn't
> > > necessary for it to work and the latter can be easily removed)?
> >
> > I'm not entirely clear what you are asking about, are you asking why I
> > care about nesting container orchestrators?  Simply put, it is not
> > uncommon for the LXC/LXD folks to see nested container orchestrators,
> > so I felt it was important to support that use case.  When we
> > originally started this effort we probably should have done a better
> > job reaching out to the LXC/LXD folks, we may have caught this
> > earlier.  Regardless, we caught it, and it looks like we are on our
> > way to supporting it (that's good).
> >
> > Are you asking why I prefer the procfs approach to setting/getting the
> > audit container ID?  For one, it makes it easier for a LSM to enforce
> > the audit container ID operations independent of the other audit
> > control APIs.  It also provides a simpler interface for container
> > orchestrators.  Both seem like desirable traits as far as I'm
> > concerned.
> >
> I agree that one api is probably the best approach here, but I actually
> think that the netlink interface is the more flexible approach.  Its a
> little more work for userspace (you have to marshal your data into a
> netlink message before sending it, and wait for an async response), but
> thats a well known pattern, and it provides significantly more
> flexibility for the kernel.  LSM already has a hook to audit netlink
> messages in sock_sendmsg, so thats not a problem ...

Look closely at how the LSM controls for netlink work and you'll see a
number of problems; basically command level granularity it hard.  On
the other hand, per-file granularity it easy.

> ... and if you use
> netlink, you get the advantage of being able to broadcast messages
> within your network namespaces, facilitating any needed orchestrator
> co-ordination.

Please don't read this comment as support of the netlink approach, but
I don't think we want to use the multicast netlink; we would want it
to be more of client/server model so that we could enforce access
controls a bit easier.  Besides, is this even a use case?

> To do the same thing with a filesystem api, you need to
> use the fanotify api, which IIRC doesn't work on proc.

Once again, I'm not sure this is a problem we are trying to solve
(broadcasting audit container ID across multiple tasks), is it?
Access to the audit container ID in userspace is something I've always
thought needs to be tightly controlled to prevent abuse.

-- 
paul moore
www.paul-moore.com

^ permalink raw reply

* Re: [PATCH ghak90 V7 20/21] audit: add capcontid to set contid outside init_user_ns
From: Neil Horman @ 2019-10-22 12:13 UTC (permalink / raw)
  To: Paul Moore
  Cc: Richard Guy Briggs, containers, linux-api,
	Linux-Audit Mailing List, linux-fsdevel, LKML, netdev,
	netfilter-devel, sgrubb, omosnace, dhowells, simo, Eric Paris,
	Serge Hallyn, ebiederm, Dan Walsh, mpatel
In-Reply-To: <CAHC9VhSiwnY-+2awxvGeO4a0NgfVkOPd8fzzBVujp=HtjskTuQ@mail.gmail.com>

On Mon, Oct 21, 2019 at 08:31:37PM -0400, Paul Moore wrote:
> On Mon, Oct 21, 2019 at 7:58 PM Richard Guy Briggs <rgb@redhat.com> wrote:
> > On 2019-10-21 17:43, Paul Moore wrote:
> > > On Mon, Oct 21, 2019 at 5:38 PM Richard Guy Briggs <rgb@redhat.com> wrote:
> > > > On 2019-10-21 15:53, Paul Moore wrote:
> > > > > On Fri, Oct 18, 2019 at 9:39 PM Richard Guy Briggs <rgb@redhat.com> wrote:
> > > > > > On 2019-09-18 21:22, Richard Guy Briggs wrote:
> > > > > > > Provide a mechanism similar to CAP_AUDIT_CONTROL to explicitly give a
> > > > > > > process in a non-init user namespace the capability to set audit
> > > > > > > container identifiers.
> > > > > > >
> > > > > > > Use audit netlink message types AUDIT_GET_CAPCONTID 1027 and
> > > > > > > AUDIT_SET_CAPCONTID 1028.  The message format includes the data
> > > > > > > structure:
> > > > > > > struct audit_capcontid_status {
> > > > > > >         pid_t   pid;
> > > > > > >         u32     enable;
> > > > > > > };
> > > > > >
> > > > > > Paul, can I get a review of the general idea here to see if you're ok
> > > > > > with this way of effectively extending CAP_AUDIT_CONTROL for the sake of
> > > > > > setting contid from beyond the init user namespace where capable() can't
> > > > > > reach and ns_capable() is meaningless for these purposes?
> > > > >
> > > > > I think my previous comment about having both the procfs and netlink
> > > > > interfaces apply here.  I don't see why we need two different APIs at
> > > > > the start; explain to me why procfs isn't sufficient.  If the argument
> > > > > is simply the desire to avoid mounting procfs in the container, how
> > > > > many container orchestrators can function today without a valid /proc?
> > > >
> > > > Ok, sorry, I meant to address that question from a previous patch
> > > > comment at the same time.
> > > >
> > > > It was raised by Eric Biederman that the proc filesystem interface for
> > > > audit had its limitations and he had suggested an audit netlink
> > > > interface made more sense.
> > >
> > > I'm sure you've got it handy, so I'm going to be lazy and ask: archive
> > > pointer to Eric's comments?  Just a heads-up, I'm really *not* a fan
> > > of using the netlink interface for this, so unless Eric presents a
> > > super compelling reason for why we shouldn't use procfs I'm inclined
> > > to stick with /proc.
> >
> > It was actually a video call with Eric and Steve where that was
> > recommended, so I can't provide you with any first-hand communication
> > about it.  I'll get more details...
> 
> Yeah, that sort of information really needs to be on the list.
> 
> > So, with that out of the way, could you please comment on the general
> > idea of what was intended to be the central idea of this mechanism to be
> > able to nest containers beyond the initial user namespace (knowing that
> > a /proc interface is available and the audit netlink interface isn't
> > necessary for it to work and the latter can be easily removed)?
> 
> I'm not entirely clear what you are asking about, are you asking why I
> care about nesting container orchestrators?  Simply put, it is not
> uncommon for the LXC/LXD folks to see nested container orchestrators,
> so I felt it was important to support that use case.  When we
> originally started this effort we probably should have done a better
> job reaching out to the LXC/LXD folks, we may have caught this
> earlier.  Regardless, we caught it, and it looks like we are on our
> way to supporting it (that's good).
> 
> Are you asking why I prefer the procfs approach to setting/getting the
> audit container ID?  For one, it makes it easier for a LSM to enforce
> the audit container ID operations independent of the other audit
> control APIs.  It also provides a simpler interface for container
> orchestrators.  Both seem like desirable traits as far as I'm
> concerned.
> 
I agree that one api is probably the best approach here, but I actually
think that the netlink interface is the more flexible approach.  Its a
little more work for userspace (you have to marshal your data into a
netlink message before sending it, and wait for an async response), but
thats a well known pattern, and it provides significantly more
flexibility for the kernel.  LSM already has a hook to audit netlink
messages in sock_sendmsg, so thats not a problem, and if you use
netlink, you get the advantage of being able to broadcast messages
within your network namespaces, facilitating any needed orchestrator
co-ordination.  To do the same thing with a filesystem api, you need to
use the fanotify api, which IIRC doesn't work on proc.

Neil

> > > > The intent was to switch to the audit netlink interface for contid,
> > > > capcontid and to add the audit netlink interface for loginuid and
> > > > sessionid while deprecating the proc interface for loginuid and
> > > > sessionid.  This was alluded to in the cover letter, but not very clear,
> > > > I'm afraid.  I have patches to remove the contid and loginuid/sessionid
> > > > interfaces in another tree which is why I had forgotten to outline that
> > > > plan more explicitly in the cover letter.
> 
> -- 
> paul moore
> www.paul-moore.com
> 

^ permalink raw reply

* Re: [PATCHv7 00/33] kernel: Introduce Time Namespace
From: Andrei Vagin @ 2019-10-22  8:45 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Dmitry Safonov, linux-kernel, Dmitry Safonov, Adrian Reber,
	Andrei Vagin, Andy Lutomirski, Arnd Bergmann, Christian Brauner,
	Cyrill Gorcunov, Eric W. Biederman, H. Peter Anvin, Ingo Molnar,
	Jann Horn, Jeff Dike, Oleg Nesterov, Pavel Emelyanov, Shuah Khan,
	Vincenzo Frascino, containers, criu, linux-api, x86
In-Reply-To: <20191017234748.GA26011@gmail.com>

On Thu, Oct 17, 2019 at 04:47:48PM -0700, Andrei Vagin wrote:
> 
> In my table, the "before" column is actually for the upstream kernel
> with the 18-th patch. Here is the table with the real "before" column:
> 
>         | before    | with 18/33 | CONFIG_TIME_NS=n | host      | inside timens
> ------------------------------------------------------------------------------
> avg     | 150331408 | 153345935  | 153588088        | 150816637 | 139192114
> ------------------------------------------------------------------------------
> diff %  |       98  |      100   | 100.1            | 98.3      | 90.7
> ------------------------------------------------------------------------------
> stdev % |       0.3 |     0.09   | 0.15             | 0.25      | 0.13
> 
> If we compare numbers in "before", "host" and "inside timens" columns, we
> see the same results that you had. clock_gettime() works with the
> same performance in the host namespace and 7% slower in a time
> namespace.
> 

I played with this a bit more and I've found that we can speed up
clock_gettime on 5% more if we mark do_hres and do_coarse as
__always_inline.

With the unlikely hint in vdso_read_begin and noinline for
do_hres_timens and do_coarse_timens:
1..8
ok 1 host:      clock:  monotonic       cycles:  155278332
ok 2 host:      clock: monotonic-coarse cycles:  662067077
ok 3 host:      clock: monotonic-raw    cycles:  151218057
ok 4 host:      clock:   boottime       cycles:  154907635
ok 5 ns:        clock:  monotonic       cycles:  133100433
ok 6 host:      clock: monotonic-coarse cycles:  444170219
ok 7 host:      clock: monotonic-raw    cycles:  129550178
ok 8 ns:        clock:   boottime       cycles:  130167136


With __always_inline for do_hres and do_coarse:
1..8
ok 1 host:      clock:  monotonic       cycles:  163691015
ok 2 host:      clock: monotonic-coarse cycles:  641443397
ok 3 host:      clock: monotonic-raw    cycles:  163649270
ok 4 host:      clock:   boottime       cycles:  163682242
ok 5 ns:        clock:  monotonic       cycles:  138759212
ok 6 host:      clock: monotonic-coarse cycles:  486149502
ok 7 host:      clock: monotonic-raw    cycles:  134801053
ok 8 ns:        clock:   boottime       cycles:  138675460
# Pass 8 Fail 0 Xfail 0 Xpass 0 Skip 0 Error 0

With __always_inline for do_hres, do_coarse, do_hres_timens,
do_coarse_timens:
1..8
ok 1 host:	clock:  monotonic	cycles:	 158984538
ok 2 host:	clock: monotonic-coarse	cycles:	 594932695
ok 3 host:	clock: monotonic-raw	cycles:	 157834511
ok 4 host:	clock:   boottime	cycles:	 158297691
ok 5 ns:	clock:  monotonic	cycles:	 148559612
ok 6 host:	clock: monotonic-coarse	cycles:	 468505657
ok 7 host:	clock: monotonic-raw	cycles:	 146366575
ok 8 ns:	clock:   boottime	cycles:	 148573015
# Pass 8 Fail 0 Xfail 0 Xpass 0 Skip 0 Error 0

Thanks,
Andrei

^ permalink raw reply

* Re: [PATCH man-pages] Document encoded I/O
From: Amir Goldstein @ 2019-10-22  6:40 UTC (permalink / raw)
  To: Omar Sandoval
  Cc: linux-fsdevel, Linux Btrfs, Dave Chinner, Jann Horn, Linux API,
	kernel-team, Theodore Tso
In-Reply-To: <20191021185356.GB81648@vader>

On Mon, Oct 21, 2019 at 9:54 PM Omar Sandoval <osandov@osandov.com> wrote:
>
> On Mon, Oct 21, 2019 at 09:18:13AM +0300, Amir Goldstein wrote:
> > CC: Ted
> >
> > What ever happened to read/write ext4 encrypted data API?
> > https://marc.info/?l=linux-ext4&m=145030599010416&w=2
> >
> > Can we learn anything from the ext4 experience to improve
> > the new proposed API?
>
> I wasn't aware of these patches, thanks for pointing them out. Ted, do
> you have any thoughts about making this API work for fscrypt?
>
> > On Wed, Oct 16, 2019 at 12:29 AM Omar Sandoval <osandov@osandov.com> wrote:
> > >
> > > From: Omar Sandoval <osandov@fb.com>
> > >
> > > This adds a new page, rwf_encoded(7), providing an overview of encoded
> > > I/O and updates fcntl(2), open(2), and preadv2(2)/pwritev2(2) to
> > > reference it.
> > >
> > > Signed-off-by: Omar Sandoval <osandov@fb.com>
> > > ---
> > >  man2/fcntl.2       |  10 +-
> > >  man2/open.2        |  13 ++
> > >  man2/readv.2       |  46 +++++++
> > >  man7/rwf_encoded.7 | 297 +++++++++++++++++++++++++++++++++++++++++++++
> > >  4 files changed, 365 insertions(+), 1 deletion(-)
> > >  create mode 100644 man7/rwf_encoded.7
> > >
> > > diff --git a/man2/fcntl.2 b/man2/fcntl.2
> > > index fce4f4c2b..76fe9cc6f 100644
> > > --- a/man2/fcntl.2
> > > +++ b/man2/fcntl.2
> > > @@ -222,8 +222,9 @@ On Linux, this command can change only the
> > >  .BR O_ASYNC ,
> > >  .BR O_DIRECT ,
> > >  .BR O_NOATIME ,
> > > +.BR O_NONBLOCK ,
> > >  and
> > > -.B O_NONBLOCK
> > > +.B O_ENCODED
> > >  flags.
> > >  It is not possible to change the
> > >  .BR O_DSYNC
> > > @@ -1803,6 +1804,13 @@ Attempted to clear the
> > >  flag on a file that has the append-only attribute set.
> > >  .TP
> > >  .B EPERM
> > > +Attempted to set the
> > > +.B O_ENCODED
> > > +flag and the calling process did not have the
> > > +.B CAP_SYS_ADMIN
> > > +capability.
> > > +.TP
> > > +.B EPERM
> > >  .I cmd
> > >  was
> > >  .BR F_ADD_SEALS ,
> > > diff --git a/man2/open.2 b/man2/open.2
> > > index b0f485b41..cdd3c549c 100644
> > > --- a/man2/open.2
> > > +++ b/man2/open.2
> > > @@ -421,6 +421,14 @@ was followed by a call to
> > >  .BR fdatasync (2)).
> > >  .IR "See NOTES below" .
> > >  .TP
> > > +.B O_ENCODED
> > > +Open the file with encoded I/O permissions;
> >
> > 1. I find the name of the flag confusing.
> > Yes, most people don't read documentation so carefully (or at all)
> > so they will assume O_ENCODED will affect read/write or that it
> > relates to RWF_ENCODED in a similar way that O_SYNC relates
> > to RWF_SYNC (i.e. logical OR and not logical AND).
> >
> > I am not good at naming and to prove it I will propose:
> > O_PROMISCUOUS, O_MAINTENANCE, O_ALLOW_ENCODED
>
> Agreed, the name is misleading. I can't think of anything better than
> O_ALLOW_ENCODED, so I'll go with that unless someone comes up with
> something better :)
>
> > 2. While I see no harm in adding O_ flag to open(2) for this
> > use case, I also don't see a major benefit in adding it.
> > What if we only allowed setting the flag via fcntl(2) which returns
> > an error on old kernels?
> > Since unlike most O_ flags, O_ENCODED does NOT affect file
> > i/o without additional opt-in flags, it is not standard anyway and
> > therefore I find that setting it only via fcntl(2) is less error prone.
>
> If I make this fcntl-only, then it probably shouldn't be through
> F_GETFL/F_SETFL (it'd be pretty awkward for an O_ flag to not be valid
> for open(), and also awkward to mix some non-O_ flag with O_ flags for
> F_GETFL/F_SETFL). So that leaves a couple of options:
>
> 1. Get/set it with F_GETFD/F_SETFD, which is currently only used for
>    FD_CLOEXEC. That also silently ignores unknown flags, but as with the
>    O_ flag option, I don't think that's a big deal for FD_ALLOW_ENCODED.
> 2. Add a new fcntl command (F_GETFD2/F_SETFD2?). This seems like
>    overkill to me.
>
> However, both of these options are annoying to implement. Ideally, we
> wouldn't have to add another flags field to struct file. But, to reuse
> f_flags, we'd need to make sure that FD_ALLOW_ENCODED doesn't collide
> with other O_ flags, and we'd probably want to hide it from F_GETFL. At
> that point, it might as well be an O_ flag.
>
> It seems to me that it's more trouble than it's worth to make this not
> an O_ flag, but please let me know if you see a nice way to do so.
>

No, I see why you choose to add the flag to open(2).
I have no objection.

I once had a crazy thought how to add new open flags
in a non racy manner without adding a new syscall,
but as you wrote, this is not relevant for O_ALLOW_ENCODED.

Something like:

/*
 * Old kernels silently ignore unsupported open flags.
 * New kernels that gets __O_CHECK_NEWFLAGS do
 * the proper checking for unsupported flags AND set the
 * flag __O_HAVE_NEWFLAGS.
 */
#define O_FLAG1 __O_CHECK_NEWFLAGS|__O_FLAG1
#define O_HAVE_FLAG1 __O_HAVE_NEWFLAGS|__O_FLAG1

fd = open(path, O_FLAG1);
if (fd < 0)
    return -errno;
flags = fcntl(fd, F_GETFL, 0);
if (flags < 0)
    return flags;
if ((flags & O_HAVE_FLAG1) != O_HAVE_FLAG1) {
    close(fd);
    return -EINVAL;
}

Not pretty, but hidden inside libc, end-users won't need to
be aware of this.

Cheers,
Amir.

^ permalink raw reply

* Re: [RFC PATCH v2 2/5] fs: add RWF_ENCODED for reading/writing compressed data
From: Aleksa Sarai @ 2019-10-22  2:02 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Omar Sandoval, linux-fsdevel, linux-btrfs, Dave Chinner,
	Jann Horn, linux-api, kernel-team
In-Reply-To: <20191021190010.GC6726@magnolia>

[-- Attachment #1: Type: text/plain, Size: 11806 bytes --]

On 2019-10-21, Darrick J. Wong <darrick.wong@oracle.com> wrote:
> On Tue, Oct 22, 2019 at 05:38:31AM +1100, Aleksa Sarai wrote:
> > On 2019-10-21, Darrick J. Wong <darrick.wong@oracle.com> wrote:
> > > On Tue, Oct 15, 2019 at 11:42:40AM -0700, Omar Sandoval wrote:
> > > > From: Omar Sandoval <osandov@fb.com>
> > > > 
> > > > Btrfs supports transparent compression: data written by the user can be
> > > > compressed when written to disk and decompressed when read back.
> > > > However, we'd like to add an interface to write pre-compressed data
> > > > directly to the filesystem, and the matching interface to read
> > > > compressed data without decompressing it. This adds support for
> > > > so-called "encoded I/O" via preadv2() and pwritev2().
> > > > 
> > > > A new RWF_ENCODED flags indicates that a read or write is "encoded". If
> > > > this flag is set, iov[0].iov_base points to a struct encoded_iov which
> > > > is used for metadata: namely, the compression algorithm, unencoded
> > > > (i.e., decompressed) length, and what subrange of the unencoded data
> > > > should be used (needed for truncated or hole-punched extents and when
> > > > reading in the middle of an extent). For reads, the filesystem returns
> > > > this information; for writes, the caller provides it to the filesystem.
> > > > iov[0].iov_len must be set to sizeof(struct encoded_iov), which can be
> > > > used to extend the interface in the future. The remaining iovecs contain
> > > > the encoded extent.
> > > > 
> > > > Filesystems must indicate that they support encoded writes by setting
> > > > FMODE_ENCODED_IO in ->file_open().
> > > > 
> > > > Signed-off-by: Omar Sandoval <osandov@fb.com>
> > > > ---
> > > >  include/linux/fs.h      | 14 +++++++
> > > >  include/uapi/linux/fs.h | 26 ++++++++++++-
> > > >  mm/filemap.c            | 82 ++++++++++++++++++++++++++++++++++-------
> > > >  3 files changed, 108 insertions(+), 14 deletions(-)
> > > > 
> > > > diff --git a/include/linux/fs.h b/include/linux/fs.h
> > > > index e0d909d35763..54681f21e05e 100644
> > > > --- a/include/linux/fs.h
> > > > +++ b/include/linux/fs.h
> > > > @@ -175,6 +175,9 @@ typedef int (dio_iodone_t)(struct kiocb *iocb, loff_t offset,
> > > >  /* File does not contribute to nr_files count */
> > > >  #define FMODE_NOACCOUNT		((__force fmode_t)0x20000000)
> > > >  
> > > > +/* File supports encoded IO */
> > > > +#define FMODE_ENCODED_IO	((__force fmode_t)0x40000000)
> > > > +
> > > >  /*
> > > >   * Flag for rw_copy_check_uvector and compat_rw_copy_check_uvector
> > > >   * that indicates that they should check the contents of the iovec are
> > > > @@ -314,6 +317,7 @@ enum rw_hint {
> > > >  #define IOCB_SYNC		(1 << 5)
> > > >  #define IOCB_WRITE		(1 << 6)
> > > >  #define IOCB_NOWAIT		(1 << 7)
> > > > +#define IOCB_ENCODED		(1 << 8)
> > > >  
> > > >  struct kiocb {
> > > >  	struct file		*ki_filp;
> > > > @@ -3088,6 +3092,11 @@ extern int sb_min_blocksize(struct super_block *, int);
> > > >  extern int generic_file_mmap(struct file *, struct vm_area_struct *);
> > > >  extern int generic_file_readonly_mmap(struct file *, struct vm_area_struct *);
> > > >  extern ssize_t generic_write_checks(struct kiocb *, struct iov_iter *);
> > > > +struct encoded_iov;
> > > > +extern int generic_encoded_write_checks(struct kiocb *, struct encoded_iov *);
> > > > +extern ssize_t check_encoded_read(struct kiocb *, struct iov_iter *);
> > > > +extern int import_encoded_write(struct kiocb *, struct encoded_iov *,
> > > > +				struct iov_iter *);
> > > >  extern int generic_remap_checks(struct file *file_in, loff_t pos_in,
> > > >  				struct file *file_out, loff_t pos_out,
> > > >  				loff_t *count, unsigned int remap_flags);
> > > > @@ -3403,6 +3412,11 @@ static inline int kiocb_set_rw_flags(struct kiocb *ki, rwf_t flags)
> > > >  			return -EOPNOTSUPP;
> > > >  		ki->ki_flags |= IOCB_NOWAIT;
> > > >  	}
> > > > +	if (flags & RWF_ENCODED) {
> > > > +		if (!(ki->ki_filp->f_mode & FMODE_ENCODED_IO))
> > > > +			return -EOPNOTSUPP;
> > > > +		ki->ki_flags |= IOCB_ENCODED;
> > > > +	}
> > > >  	if (flags & RWF_HIPRI)
> > > >  		ki->ki_flags |= IOCB_HIPRI;
> > > >  	if (flags & RWF_DSYNC)
> > > > diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
> > > > index 379a612f8f1d..ed92a8a257cb 100644
> > > > --- a/include/uapi/linux/fs.h
> > > > +++ b/include/uapi/linux/fs.h
> > > > @@ -284,6 +284,27 @@ struct fsxattr {
> > > >  
> > > >  typedef int __bitwise __kernel_rwf_t;
> > > >  
> > > > +enum {
> > > > +	ENCODED_IOV_COMPRESSION_NONE,
> > > > +	ENCODED_IOV_COMPRESSION_ZLIB,
> > > > +	ENCODED_IOV_COMPRESSION_LZO,
> > > > +	ENCODED_IOV_COMPRESSION_ZSTD,
> > > > +	ENCODED_IOV_COMPRESSION_TYPES = ENCODED_IOV_COMPRESSION_ZSTD,
> > > > +};
> > > > +
> > > > +enum {
> > > > +	ENCODED_IOV_ENCRYPTION_NONE,
> > > > +	ENCODED_IOV_ENCRYPTION_TYPES = ENCODED_IOV_ENCRYPTION_NONE,
> > > > +};
> > > > +
> > > > +struct encoded_iov {
> > > > +	__u64 len;
> > > > +	__u64 unencoded_len;
> > > > +	__u64 unencoded_offset;
> > > > +	__u32 compression;
> > > > +	__u32 encryption;
> > > 
> > > Can we add some must-be-zero padding space at the end here for whomever
> > > comes along next wanting to add more encoding info?
> > 
> > I would suggest to copy the extension design of copy_struct_from_user().
> > Adding must-be-zero padding is a less-ideal solution to the extension
> > problem than length-based extension.
> 
> Come to think of it, you /do/ have to specify iov_len so... yeah, do
> that instead; we can always extend the structure later.

Just to clarify -- if we want to make the interface forward-compatible
from the outset (programs built 4 years from now running on 5.5), we
will need to implement this in the original merge. Otherwise userspace
will need to handle backwards-compatibility themselves once new features
are added.

@Omar: If it'd make your life easier, I can send some draft patches
	   which port copy_struct_from_user() to iovec-land.

> > Also (I might be wrong) but shouldn't the __u64s be __aligned_u64 (as
> > with syscall structure arguments)?
> 
> <shrug> No idea, that's the first I've heard of that type and it doesn't
> seem to be used by the fs code.  Why would we care about alignment for
> an incore structure?
> 
> --D
> 
> > 
> > > (And maybe a manpage and some basic testing, to reiterate Dave...)
> > > 
> > > --D
> > > 
> > > > +};
> > > > +
> > > >  /* high priority request, poll if possible */
> > > >  #define RWF_HIPRI	((__force __kernel_rwf_t)0x00000001)
> > > >  
> > > > @@ -299,8 +320,11 @@ typedef int __bitwise __kernel_rwf_t;
> > > >  /* per-IO O_APPEND */
> > > >  #define RWF_APPEND	((__force __kernel_rwf_t)0x00000010)
> > > >  
> > > > +/* encoded (e.g., compressed or encrypted) IO */
> > > > +#define RWF_ENCODED	((__force __kernel_rwf_t)0x00000020)
> > > > +
> > > >  /* mask of flags supported by the kernel */
> > > >  #define RWF_SUPPORTED	(RWF_HIPRI | RWF_DSYNC | RWF_SYNC | RWF_NOWAIT |\
> > > > -			 RWF_APPEND)
> > > > +			 RWF_APPEND | RWF_ENCODED)
> > > >  
> > > >  #endif /* _UAPI_LINUX_FS_H */
> > > > diff --git a/mm/filemap.c b/mm/filemap.c
> > > > index 1146fcfa3215..d2e6d9caf353 100644
> > > > --- a/mm/filemap.c
> > > > +++ b/mm/filemap.c
> > > > @@ -2948,24 +2948,15 @@ static int generic_write_check_limits(struct file *file, loff_t pos,
> > > >  	return 0;
> > > >  }
> > > >  
> > > > -/*
> > > > - * Performs necessary checks before doing a write
> > > > - *
> > > > - * Can adjust writing position or amount of bytes to write.
> > > > - * Returns appropriate error code that caller should return or
> > > > - * zero in case that write should be allowed.
> > > > - */
> > > > -inline ssize_t generic_write_checks(struct kiocb *iocb, struct iov_iter *from)
> > > > +static int generic_write_checks_common(struct kiocb *iocb, loff_t *count)
> > > >  {
> > > >  	struct file *file = iocb->ki_filp;
> > > >  	struct inode *inode = file->f_mapping->host;
> > > > -	loff_t count;
> > > > -	int ret;
> > > >  
> > > >  	if (IS_SWAPFILE(inode))
> > > >  		return -ETXTBSY;
> > > >  
> > > > -	if (!iov_iter_count(from))
> > > > +	if (!*count)
> > > >  		return 0;
> > > >  
> > > >  	/* FIXME: this is for backwards compatibility with 2.4 */
> > > > @@ -2975,8 +2966,21 @@ inline ssize_t generic_write_checks(struct kiocb *iocb, struct iov_iter *from)
> > > >  	if ((iocb->ki_flags & IOCB_NOWAIT) && !(iocb->ki_flags & IOCB_DIRECT))
> > > >  		return -EINVAL;
> > > >  
> > > > -	count = iov_iter_count(from);
> > > > -	ret = generic_write_check_limits(file, iocb->ki_pos, &count);
> > > > +	return generic_write_check_limits(iocb->ki_filp, iocb->ki_pos, count);
> > > > +}
> > > > +
> > > > +/*
> > > > + * Performs necessary checks before doing a write
> > > > + *
> > > > + * Can adjust writing position or amount of bytes to write.
> > > > + * Returns a negative errno or the new number of bytes to write.
> > > > + */
> > > > +inline ssize_t generic_write_checks(struct kiocb *iocb, struct iov_iter *from)
> > > > +{
> > > > +	loff_t count = iov_iter_count(from);
> > > > +	int ret;
> > > > +
> > > > +	ret = generic_write_checks_common(iocb, &count);
> > > >  	if (ret)
> > > >  		return ret;
> > > >  
> > > > @@ -2985,6 +2989,58 @@ inline ssize_t generic_write_checks(struct kiocb *iocb, struct iov_iter *from)
> > > >  }
> > > >  EXPORT_SYMBOL(generic_write_checks);
> > > >  
> > > > +int generic_encoded_write_checks(struct kiocb *iocb,
> > > > +				 struct encoded_iov *encoded)
> > > > +{
> > > > +	loff_t count = encoded->unencoded_len;
> > > > +	int ret;
> > > > +
> > > > +	ret = generic_write_checks_common(iocb, &count);
> > > > +	if (ret)
> > > > +		return ret;
> > > > +
> > > > +	if (count != encoded->unencoded_len) {
> > > > +		/*
> > > > +		 * The write got truncated by generic_write_checks_common(). We
> > > > +		 * can't do a partial encoded write.
> > > > +		 */
> > > > +		return -EFBIG;
> > > > +	}
> > > > +	return 0;
> > > > +}
> > > > +EXPORT_SYMBOL(generic_encoded_write_checks);
> > > > +
> > > > +ssize_t check_encoded_read(struct kiocb *iocb, struct iov_iter *iter)
> > > > +{
> > > > +	if (!(iocb->ki_filp->f_flags & O_ENCODED))
> > > > +		return -EPERM;
> > > > +	if (iov_iter_single_seg_count(iter) != sizeof(struct encoded_iov))
> > > > +		return -EINVAL;
> > > > +	return iov_iter_count(iter) - sizeof(struct encoded_iov);
> > > > +}
> > > > +EXPORT_SYMBOL(check_encoded_read);
> > > > +
> > > > +int import_encoded_write(struct kiocb *iocb, struct encoded_iov *encoded,
> > > > +			 struct iov_iter *from)
> > > > +{
> > > > +	if (!(iocb->ki_filp->f_flags & O_ENCODED))
> > > > +		return -EPERM;
> > > > +	if (iov_iter_single_seg_count(from) != sizeof(*encoded))
> > > > +		return -EINVAL;
> > > > +	if (copy_from_iter(encoded, sizeof(*encoded), from) != sizeof(*encoded))
> > > > +		return -EFAULT;
> > > > +	if (encoded->compression == ENCODED_IOV_COMPRESSION_NONE &&
> > > > +	    encoded->encryption == ENCODED_IOV_ENCRYPTION_NONE)
> > > > +		return -EINVAL;
> > > > +	if (encoded->compression > ENCODED_IOV_COMPRESSION_TYPES ||
> > > > +	    encoded->encryption > ENCODED_IOV_ENCRYPTION_TYPES)
> > > > +		return -EINVAL;
> > > > +	if (encoded->unencoded_offset >= encoded->unencoded_len)
> > > > +		return -EINVAL;
> > > > +	return 0;
> > > > +}
> > > > +EXPORT_SYMBOL(import_encoded_write);
> > > > +
> > > >  /*
> > > >   * Performs necessary checks before doing a clone.
> > > >   *

-- 
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
<https://www.cyphar.com/>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply

* Re: [RFC PATCH v2 2/5] fs: add RWF_ENCODED for reading/writing compressed data
From: Aleksa Sarai @ 2019-10-22  1:37 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Omar Sandoval, linux-fsdevel, linux-btrfs, Dave Chinner,
	Jann Horn, linux-api, kernel-team
In-Reply-To: <20191021190010.GC6726@magnolia>

[-- Attachment #1: Type: text/plain, Size: 11816 bytes --]

On 2019-10-21, Darrick J. Wong <darrick.wong@oracle.com> wrote:
> On Tue, Oct 22, 2019 at 05:38:31AM +1100, Aleksa Sarai wrote:
> > On 2019-10-21, Darrick J. Wong <darrick.wong@oracle.com> wrote:
> > > On Tue, Oct 15, 2019 at 11:42:40AM -0700, Omar Sandoval wrote:
> > > > From: Omar Sandoval <osandov@fb.com>
> > > > 
> > > > Btrfs supports transparent compression: data written by the user can be
> > > > compressed when written to disk and decompressed when read back.
> > > > However, we'd like to add an interface to write pre-compressed data
> > > > directly to the filesystem, and the matching interface to read
> > > > compressed data without decompressing it. This adds support for
> > > > so-called "encoded I/O" via preadv2() and pwritev2().
> > > > 
> > > > A new RWF_ENCODED flags indicates that a read or write is "encoded". If
> > > > this flag is set, iov[0].iov_base points to a struct encoded_iov which
> > > > is used for metadata: namely, the compression algorithm, unencoded
> > > > (i.e., decompressed) length, and what subrange of the unencoded data
> > > > should be used (needed for truncated or hole-punched extents and when
> > > > reading in the middle of an extent). For reads, the filesystem returns
> > > > this information; for writes, the caller provides it to the filesystem.
> > > > iov[0].iov_len must be set to sizeof(struct encoded_iov), which can be
> > > > used to extend the interface in the future. The remaining iovecs contain
> > > > the encoded extent.
> > > > 
> > > > Filesystems must indicate that they support encoded writes by setting
> > > > FMODE_ENCODED_IO in ->file_open().
> > > > 
> > > > Signed-off-by: Omar Sandoval <osandov@fb.com>
> > > > ---
> > > >  include/linux/fs.h      | 14 +++++++
> > > >  include/uapi/linux/fs.h | 26 ++++++++++++-
> > > >  mm/filemap.c            | 82 ++++++++++++++++++++++++++++++++++-------
> > > >  3 files changed, 108 insertions(+), 14 deletions(-)
> > > > 
> > > > diff --git a/include/linux/fs.h b/include/linux/fs.h
> > > > index e0d909d35763..54681f21e05e 100644
> > > > --- a/include/linux/fs.h
> > > > +++ b/include/linux/fs.h
> > > > @@ -175,6 +175,9 @@ typedef int (dio_iodone_t)(struct kiocb *iocb, loff_t offset,
> > > >  /* File does not contribute to nr_files count */
> > > >  #define FMODE_NOACCOUNT		((__force fmode_t)0x20000000)
> > > >  
> > > > +/* File supports encoded IO */
> > > > +#define FMODE_ENCODED_IO	((__force fmode_t)0x40000000)
> > > > +
> > > >  /*
> > > >   * Flag for rw_copy_check_uvector and compat_rw_copy_check_uvector
> > > >   * that indicates that they should check the contents of the iovec are
> > > > @@ -314,6 +317,7 @@ enum rw_hint {
> > > >  #define IOCB_SYNC		(1 << 5)
> > > >  #define IOCB_WRITE		(1 << 6)
> > > >  #define IOCB_NOWAIT		(1 << 7)
> > > > +#define IOCB_ENCODED		(1 << 8)
> > > >  
> > > >  struct kiocb {
> > > >  	struct file		*ki_filp;
> > > > @@ -3088,6 +3092,11 @@ extern int sb_min_blocksize(struct super_block *, int);
> > > >  extern int generic_file_mmap(struct file *, struct vm_area_struct *);
> > > >  extern int generic_file_readonly_mmap(struct file *, struct vm_area_struct *);
> > > >  extern ssize_t generic_write_checks(struct kiocb *, struct iov_iter *);
> > > > +struct encoded_iov;
> > > > +extern int generic_encoded_write_checks(struct kiocb *, struct encoded_iov *);
> > > > +extern ssize_t check_encoded_read(struct kiocb *, struct iov_iter *);
> > > > +extern int import_encoded_write(struct kiocb *, struct encoded_iov *,
> > > > +				struct iov_iter *);
> > > >  extern int generic_remap_checks(struct file *file_in, loff_t pos_in,
> > > >  				struct file *file_out, loff_t pos_out,
> > > >  				loff_t *count, unsigned int remap_flags);
> > > > @@ -3403,6 +3412,11 @@ static inline int kiocb_set_rw_flags(struct kiocb *ki, rwf_t flags)
> > > >  			return -EOPNOTSUPP;
> > > >  		ki->ki_flags |= IOCB_NOWAIT;
> > > >  	}
> > > > +	if (flags & RWF_ENCODED) {
> > > > +		if (!(ki->ki_filp->f_mode & FMODE_ENCODED_IO))
> > > > +			return -EOPNOTSUPP;
> > > > +		ki->ki_flags |= IOCB_ENCODED;
> > > > +	}
> > > >  	if (flags & RWF_HIPRI)
> > > >  		ki->ki_flags |= IOCB_HIPRI;
> > > >  	if (flags & RWF_DSYNC)
> > > > diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
> > > > index 379a612f8f1d..ed92a8a257cb 100644
> > > > --- a/include/uapi/linux/fs.h
> > > > +++ b/include/uapi/linux/fs.h
> > > > @@ -284,6 +284,27 @@ struct fsxattr {
> > > >  
> > > >  typedef int __bitwise __kernel_rwf_t;
> > > >  
> > > > +enum {
> > > > +	ENCODED_IOV_COMPRESSION_NONE,
> > > > +	ENCODED_IOV_COMPRESSION_ZLIB,
> > > > +	ENCODED_IOV_COMPRESSION_LZO,
> > > > +	ENCODED_IOV_COMPRESSION_ZSTD,
> > > > +	ENCODED_IOV_COMPRESSION_TYPES = ENCODED_IOV_COMPRESSION_ZSTD,
> > > > +};
> > > > +
> > > > +enum {
> > > > +	ENCODED_IOV_ENCRYPTION_NONE,
> > > > +	ENCODED_IOV_ENCRYPTION_TYPES = ENCODED_IOV_ENCRYPTION_NONE,
> > > > +};
> > > > +
> > > > +struct encoded_iov {
> > > > +	__u64 len;
> > > > +	__u64 unencoded_len;
> > > > +	__u64 unencoded_offset;
> > > > +	__u32 compression;
> > > > +	__u32 encryption;
> > > 
> > > Can we add some must-be-zero padding space at the end here for whomever
> > > comes along next wanting to add more encoding info?
> > 
> > I would suggest to copy the extension design of copy_struct_from_user().
> > Adding must-be-zero padding is a less-ideal solution to the extension
> > problem than length-based extension.
> 
> Come to think of it, you /do/ have to specify iov_len so... yeah, do
> that instead; we can always extend the structure later.
> 
> > Also (I might be wrong) but shouldn't the __u64s be __aligned_u64 (as
> > with syscall structure arguments)?
> 
> <shrug> No idea, that's the first I've heard of that type and it doesn't
> seem to be used by the fs code.  Why would we care about alignment for
> an incore structure?

When passing u64s from userspace, it's generally considered a good idea
to use __aligned_u64 -- the main reason is that 32-bit userspace on a
64-bit kernel will use different structure alignment for 64-bit fields.

This means you'd need to implement a bunch of COMPAT_SYSCALL-like
handling for that case. It's much simpler to use __aligned_u64 (and on
the plus side I don't think you need to add any fields to ensure the
padding is zero).

> > 
> > > (And maybe a manpage and some basic testing, to reiterate Dave...)
> > > 
> > > --D
> > > 
> > > > +};
> > > > +
> > > >  /* high priority request, poll if possible */
> > > >  #define RWF_HIPRI	((__force __kernel_rwf_t)0x00000001)
> > > >  
> > > > @@ -299,8 +320,11 @@ typedef int __bitwise __kernel_rwf_t;
> > > >  /* per-IO O_APPEND */
> > > >  #define RWF_APPEND	((__force __kernel_rwf_t)0x00000010)
> > > >  
> > > > +/* encoded (e.g., compressed or encrypted) IO */
> > > > +#define RWF_ENCODED	((__force __kernel_rwf_t)0x00000020)
> > > > +
> > > >  /* mask of flags supported by the kernel */
> > > >  #define RWF_SUPPORTED	(RWF_HIPRI | RWF_DSYNC | RWF_SYNC | RWF_NOWAIT |\
> > > > -			 RWF_APPEND)
> > > > +			 RWF_APPEND | RWF_ENCODED)
> > > >  
> > > >  #endif /* _UAPI_LINUX_FS_H */
> > > > diff --git a/mm/filemap.c b/mm/filemap.c
> > > > index 1146fcfa3215..d2e6d9caf353 100644
> > > > --- a/mm/filemap.c
> > > > +++ b/mm/filemap.c
> > > > @@ -2948,24 +2948,15 @@ static int generic_write_check_limits(struct file *file, loff_t pos,
> > > >  	return 0;
> > > >  }
> > > >  
> > > > -/*
> > > > - * Performs necessary checks before doing a write
> > > > - *
> > > > - * Can adjust writing position or amount of bytes to write.
> > > > - * Returns appropriate error code that caller should return or
> > > > - * zero in case that write should be allowed.
> > > > - */
> > > > -inline ssize_t generic_write_checks(struct kiocb *iocb, struct iov_iter *from)
> > > > +static int generic_write_checks_common(struct kiocb *iocb, loff_t *count)
> > > >  {
> > > >  	struct file *file = iocb->ki_filp;
> > > >  	struct inode *inode = file->f_mapping->host;
> > > > -	loff_t count;
> > > > -	int ret;
> > > >  
> > > >  	if (IS_SWAPFILE(inode))
> > > >  		return -ETXTBSY;
> > > >  
> > > > -	if (!iov_iter_count(from))
> > > > +	if (!*count)
> > > >  		return 0;
> > > >  
> > > >  	/* FIXME: this is for backwards compatibility with 2.4 */
> > > > @@ -2975,8 +2966,21 @@ inline ssize_t generic_write_checks(struct kiocb *iocb, struct iov_iter *from)
> > > >  	if ((iocb->ki_flags & IOCB_NOWAIT) && !(iocb->ki_flags & IOCB_DIRECT))
> > > >  		return -EINVAL;
> > > >  
> > > > -	count = iov_iter_count(from);
> > > > -	ret = generic_write_check_limits(file, iocb->ki_pos, &count);
> > > > +	return generic_write_check_limits(iocb->ki_filp, iocb->ki_pos, count);
> > > > +}
> > > > +
> > > > +/*
> > > > + * Performs necessary checks before doing a write
> > > > + *
> > > > + * Can adjust writing position or amount of bytes to write.
> > > > + * Returns a negative errno or the new number of bytes to write.
> > > > + */
> > > > +inline ssize_t generic_write_checks(struct kiocb *iocb, struct iov_iter *from)
> > > > +{
> > > > +	loff_t count = iov_iter_count(from);
> > > > +	int ret;
> > > > +
> > > > +	ret = generic_write_checks_common(iocb, &count);
> > > >  	if (ret)
> > > >  		return ret;
> > > >  
> > > > @@ -2985,6 +2989,58 @@ inline ssize_t generic_write_checks(struct kiocb *iocb, struct iov_iter *from)
> > > >  }
> > > >  EXPORT_SYMBOL(generic_write_checks);
> > > >  
> > > > +int generic_encoded_write_checks(struct kiocb *iocb,
> > > > +				 struct encoded_iov *encoded)
> > > > +{
> > > > +	loff_t count = encoded->unencoded_len;
> > > > +	int ret;
> > > > +
> > > > +	ret = generic_write_checks_common(iocb, &count);
> > > > +	if (ret)
> > > > +		return ret;
> > > > +
> > > > +	if (count != encoded->unencoded_len) {
> > > > +		/*
> > > > +		 * The write got truncated by generic_write_checks_common(). We
> > > > +		 * can't do a partial encoded write.
> > > > +		 */
> > > > +		return -EFBIG;
> > > > +	}
> > > > +	return 0;
> > > > +}
> > > > +EXPORT_SYMBOL(generic_encoded_write_checks);
> > > > +
> > > > +ssize_t check_encoded_read(struct kiocb *iocb, struct iov_iter *iter)
> > > > +{
> > > > +	if (!(iocb->ki_filp->f_flags & O_ENCODED))
> > > > +		return -EPERM;
> > > > +	if (iov_iter_single_seg_count(iter) != sizeof(struct encoded_iov))
> > > > +		return -EINVAL;
> > > > +	return iov_iter_count(iter) - sizeof(struct encoded_iov);
> > > > +}
> > > > +EXPORT_SYMBOL(check_encoded_read);
> > > > +
> > > > +int import_encoded_write(struct kiocb *iocb, struct encoded_iov *encoded,
> > > > +			 struct iov_iter *from)
> > > > +{
> > > > +	if (!(iocb->ki_filp->f_flags & O_ENCODED))
> > > > +		return -EPERM;
> > > > +	if (iov_iter_single_seg_count(from) != sizeof(*encoded))
> > > > +		return -EINVAL;
> > > > +	if (copy_from_iter(encoded, sizeof(*encoded), from) != sizeof(*encoded))
> > > > +		return -EFAULT;
> > > > +	if (encoded->compression == ENCODED_IOV_COMPRESSION_NONE &&
> > > > +	    encoded->encryption == ENCODED_IOV_ENCRYPTION_NONE)
> > > > +		return -EINVAL;
> > > > +	if (encoded->compression > ENCODED_IOV_COMPRESSION_TYPES ||
> > > > +	    encoded->encryption > ENCODED_IOV_ENCRYPTION_TYPES)
> > > > +		return -EINVAL;
> > > > +	if (encoded->unencoded_offset >= encoded->unencoded_len)
> > > > +		return -EINVAL;
> > > > +	return 0;
> > > > +}
> > > > +EXPORT_SYMBOL(import_encoded_write);
> > > > +
> > > >  /*
> > > >   * Performs necessary checks before doing a clone.
> > > >   *

-- 
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
<https://www.cyphar.com/>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply

* Re: [PATCH ghak90 V7 20/21] audit: add capcontid to set contid outside init_user_ns
From: Paul Moore @ 2019-10-22  0:31 UTC (permalink / raw)
  To: Richard Guy Briggs
  Cc: containers, linux-api, Linux-Audit Mailing List, linux-fsdevel,
	LKML, netdev, netfilter-devel, sgrubb, omosnace, dhowells, simo,
	Eric Paris, Serge Hallyn, ebiederm, nhorman, Dan Walsh, mpatel
In-Reply-To: <20191021235734.mgcjotdqoe73e4ha@madcap2.tricolour.ca>

On Mon, Oct 21, 2019 at 7:58 PM Richard Guy Briggs <rgb@redhat.com> wrote:
> On 2019-10-21 17:43, Paul Moore wrote:
> > On Mon, Oct 21, 2019 at 5:38 PM Richard Guy Briggs <rgb@redhat.com> wrote:
> > > On 2019-10-21 15:53, Paul Moore wrote:
> > > > On Fri, Oct 18, 2019 at 9:39 PM Richard Guy Briggs <rgb@redhat.com> wrote:
> > > > > On 2019-09-18 21:22, Richard Guy Briggs wrote:
> > > > > > Provide a mechanism similar to CAP_AUDIT_CONTROL to explicitly give a
> > > > > > process in a non-init user namespace the capability to set audit
> > > > > > container identifiers.
> > > > > >
> > > > > > Use audit netlink message types AUDIT_GET_CAPCONTID 1027 and
> > > > > > AUDIT_SET_CAPCONTID 1028.  The message format includes the data
> > > > > > structure:
> > > > > > struct audit_capcontid_status {
> > > > > >         pid_t   pid;
> > > > > >         u32     enable;
> > > > > > };
> > > > >
> > > > > Paul, can I get a review of the general idea here to see if you're ok
> > > > > with this way of effectively extending CAP_AUDIT_CONTROL for the sake of
> > > > > setting contid from beyond the init user namespace where capable() can't
> > > > > reach and ns_capable() is meaningless for these purposes?
> > > >
> > > > I think my previous comment about having both the procfs and netlink
> > > > interfaces apply here.  I don't see why we need two different APIs at
> > > > the start; explain to me why procfs isn't sufficient.  If the argument
> > > > is simply the desire to avoid mounting procfs in the container, how
> > > > many container orchestrators can function today without a valid /proc?
> > >
> > > Ok, sorry, I meant to address that question from a previous patch
> > > comment at the same time.
> > >
> > > It was raised by Eric Biederman that the proc filesystem interface for
> > > audit had its limitations and he had suggested an audit netlink
> > > interface made more sense.
> >
> > I'm sure you've got it handy, so I'm going to be lazy and ask: archive
> > pointer to Eric's comments?  Just a heads-up, I'm really *not* a fan
> > of using the netlink interface for this, so unless Eric presents a
> > super compelling reason for why we shouldn't use procfs I'm inclined
> > to stick with /proc.
>
> It was actually a video call with Eric and Steve where that was
> recommended, so I can't provide you with any first-hand communication
> about it.  I'll get more details...

Yeah, that sort of information really needs to be on the list.

> So, with that out of the way, could you please comment on the general
> idea of what was intended to be the central idea of this mechanism to be
> able to nest containers beyond the initial user namespace (knowing that
> a /proc interface is available and the audit netlink interface isn't
> necessary for it to work and the latter can be easily removed)?

I'm not entirely clear what you are asking about, are you asking why I
care about nesting container orchestrators?  Simply put, it is not
uncommon for the LXC/LXD folks to see nested container orchestrators,
so I felt it was important to support that use case.  When we
originally started this effort we probably should have done a better
job reaching out to the LXC/LXD folks, we may have caught this
earlier.  Regardless, we caught it, and it looks like we are on our
way to supporting it (that's good).

Are you asking why I prefer the procfs approach to setting/getting the
audit container ID?  For one, it makes it easier for a LSM to enforce
the audit container ID operations independent of the other audit
control APIs.  It also provides a simpler interface for container
orchestrators.  Both seem like desirable traits as far as I'm
concerned.

> > > The intent was to switch to the audit netlink interface for contid,
> > > capcontid and to add the audit netlink interface for loginuid and
> > > sessionid while deprecating the proc interface for loginuid and
> > > sessionid.  This was alluded to in the cover letter, but not very clear,
> > > I'm afraid.  I have patches to remove the contid and loginuid/sessionid
> > > interfaces in another tree which is why I had forgotten to outline that
> > > plan more explicitly in the cover letter.

-- 
paul moore
www.paul-moore.com

^ permalink raw reply

* Re: [PATCH ghak90 V7 20/21] audit: add capcontid to set contid outside init_user_ns
From: Richard Guy Briggs @ 2019-10-21 23:57 UTC (permalink / raw)
  To: Paul Moore
  Cc: containers, linux-api, Linux-Audit Mailing List, linux-fsdevel,
	LKML, netdev, netfilter-devel, sgrubb, omosnace, dhowells, simo,
	Eric Paris, Serge Hallyn, ebiederm, nhorman, Dan Walsh, mpatel
In-Reply-To: <CAHC9VhRdNXsY4neJpSoNyJoAVEoiEc2oW5kSscF99tjmoQAxFA@mail.gmail.com>

On 2019-10-21 17:43, Paul Moore wrote:
> On Mon, Oct 21, 2019 at 5:38 PM Richard Guy Briggs <rgb@redhat.com> wrote:
> > On 2019-10-21 15:53, Paul Moore wrote:
> > > On Fri, Oct 18, 2019 at 9:39 PM Richard Guy Briggs <rgb@redhat.com> wrote:
> > > > On 2019-09-18 21:22, Richard Guy Briggs wrote:
> > > > > Provide a mechanism similar to CAP_AUDIT_CONTROL to explicitly give a
> > > > > process in a non-init user namespace the capability to set audit
> > > > > container identifiers.
> > > > >
> > > > > Use audit netlink message types AUDIT_GET_CAPCONTID 1027 and
> > > > > AUDIT_SET_CAPCONTID 1028.  The message format includes the data
> > > > > structure:
> > > > > struct audit_capcontid_status {
> > > > >         pid_t   pid;
> > > > >         u32     enable;
> > > > > };
> > > >
> > > > Paul, can I get a review of the general idea here to see if you're ok
> > > > with this way of effectively extending CAP_AUDIT_CONTROL for the sake of
> > > > setting contid from beyond the init user namespace where capable() can't
> > > > reach and ns_capable() is meaningless for these purposes?
> > >
> > > I think my previous comment about having both the procfs and netlink
> > > interfaces apply here.  I don't see why we need two different APIs at
> > > the start; explain to me why procfs isn't sufficient.  If the argument
> > > is simply the desire to avoid mounting procfs in the container, how
> > > many container orchestrators can function today without a valid /proc?
> >
> > Ok, sorry, I meant to address that question from a previous patch
> > comment at the same time.
> >
> > It was raised by Eric Biederman that the proc filesystem interface for
> > audit had its limitations and he had suggested an audit netlink
> > interface made more sense.
> 
> I'm sure you've got it handy, so I'm going to be lazy and ask: archive
> pointer to Eric's comments?  Just a heads-up, I'm really *not* a fan
> of using the netlink interface for this, so unless Eric presents a
> super compelling reason for why we shouldn't use procfs I'm inclined
> to stick with /proc.

It was actually a video call with Eric and Steve where that was
recommended, so I can't provide you with any first-hand communication
about it.  I'll get more details...

So, with that out of the way, could you please comment on the general
idea of what was intended to be the central idea of this mechanism to be
able to nest containers beyond the initial user namespace (knowing that
a /proc interface is available and the audit netlink interface isn't
necessary for it to work and the latter can be easily removed)?

> > The intent was to switch to the audit netlink interface for contid,
> > capcontid and to add the audit netlink interface for loginuid and
> > sessionid while deprecating the proc interface for loginuid and
> > sessionid.  This was alluded to in the cover letter, but not very clear,
> > I'm afraid.  I have patches to remove the contid and loginuid/sessionid
> > interfaces in another tree which is why I had forgotten to outline that
> > plan more explicitly in the cover letter.
> 
> paul moore

- RGB

^ permalink raw reply

* Re: [PATCH ghak90 V7 20/21] audit: add capcontid to set contid outside init_user_ns
From: Paul Moore @ 2019-10-21 21:43 UTC (permalink / raw)
  To: Richard Guy Briggs
  Cc: containers, linux-api, Linux-Audit Mailing List, linux-fsdevel,
	LKML, netdev, netfilter-devel, sgrubb, omosnace, dhowells, simo,
	Eric Paris, Serge Hallyn, ebiederm, nhorman, Dan Walsh, mpatel
In-Reply-To: <20191021213824.6zti5ndxu7sqs772@madcap2.tricolour.ca>

On Mon, Oct 21, 2019 at 5:38 PM Richard Guy Briggs <rgb@redhat.com> wrote:
> On 2019-10-21 15:53, Paul Moore wrote:
> > On Fri, Oct 18, 2019 at 9:39 PM Richard Guy Briggs <rgb@redhat.com> wrote:
> > > On 2019-09-18 21:22, Richard Guy Briggs wrote:
> > > > Provide a mechanism similar to CAP_AUDIT_CONTROL to explicitly give a
> > > > process in a non-init user namespace the capability to set audit
> > > > container identifiers.
> > > >
> > > > Use audit netlink message types AUDIT_GET_CAPCONTID 1027 and
> > > > AUDIT_SET_CAPCONTID 1028.  The message format includes the data
> > > > structure:
> > > > struct audit_capcontid_status {
> > > >         pid_t   pid;
> > > >         u32     enable;
> > > > };
> > >
> > > Paul, can I get a review of the general idea here to see if you're ok
> > > with this way of effectively extending CAP_AUDIT_CONTROL for the sake of
> > > setting contid from beyond the init user namespace where capable() can't
> > > reach and ns_capable() is meaningless for these purposes?
> >
> > I think my previous comment about having both the procfs and netlink
> > interfaces apply here.  I don't see why we need two different APIs at
> > the start; explain to me why procfs isn't sufficient.  If the argument
> > is simply the desire to avoid mounting procfs in the container, how
> > many container orchestrators can function today without a valid /proc?
>
> Ok, sorry, I meant to address that question from a previous patch
> comment at the same time.
>
> It was raised by Eric Biederman that the proc filesystem interface for
> audit had its limitations and he had suggested an audit netlink
> interface made more sense.

I'm sure you've got it handy, so I'm going to be lazy and ask: archive
pointer to Eric's comments?  Just a heads-up, I'm really *not* a fan
of using the netlink interface for this, so unless Eric presents a
super compelling reason for why we shouldn't use procfs I'm inclined
to stick with /proc.

> The intent was to switch to the audit netlink interface for contid,
> capcontid and to add the audit netlink interface for loginuid and
> sessionid while deprecating the proc interface for loginuid and
> sessionid.  This was alluded to in the cover letter, but not very clear,
> I'm afraid.  I have patches to remove the contid and loginuid/sessionid
> interfaces in another tree which is why I had forgotten to outline that
> plan more explicitly in the cover letter.

-- 
paul moore
www.paul-moore.com

^ permalink raw reply

* Re: [PATCH ghak90 V7 20/21] audit: add capcontid to set contid outside init_user_ns
From: Richard Guy Briggs @ 2019-10-21 21:38 UTC (permalink / raw)
  To: Paul Moore
  Cc: containers, linux-api, Linux-Audit Mailing List, linux-fsdevel,
	LKML, netdev, netfilter-devel, sgrubb, omosnace, dhowells, simo,
	Eric Paris, Serge Hallyn, ebiederm, nhorman, Dan Walsh, mpatel
In-Reply-To: <CAHC9VhRPygA=LsHLUqv+K=ouAiPFJ6fb2_As=OT-_zB7kGc_aQ@mail.gmail.com>

On 2019-10-21 15:53, Paul Moore wrote:
> On Fri, Oct 18, 2019 at 9:39 PM Richard Guy Briggs <rgb@redhat.com> wrote:
> > On 2019-09-18 21:22, Richard Guy Briggs wrote:
> > > Provide a mechanism similar to CAP_AUDIT_CONTROL to explicitly give a
> > > process in a non-init user namespace the capability to set audit
> > > container identifiers.
> > >
> > > Use audit netlink message types AUDIT_GET_CAPCONTID 1027 and
> > > AUDIT_SET_CAPCONTID 1028.  The message format includes the data
> > > structure:
> > > struct audit_capcontid_status {
> > >         pid_t   pid;
> > >         u32     enable;
> > > };
> >
> > Paul, can I get a review of the general idea here to see if you're ok
> > with this way of effectively extending CAP_AUDIT_CONTROL for the sake of
> > setting contid from beyond the init user namespace where capable() can't
> > reach and ns_capable() is meaningless for these purposes?
> 
> I think my previous comment about having both the procfs and netlink
> interfaces apply here.  I don't see why we need two different APIs at
> the start; explain to me why procfs isn't sufficient.  If the argument
> is simply the desire to avoid mounting procfs in the container, how
> many container orchestrators can function today without a valid /proc?

Ok, sorry, I meant to address that question from a previous patch
comment at the same time.

It was raised by Eric Biederman that the proc filesystem interface for
audit had its limitations and he had suggested an audit netlink
interface made more sense.

The intent was to switch to the audit netlink interface for contid,
capcontid and to add the audit netlink interface for loginuid and
sessionid while deprecating the proc interface for loginuid and
sessionid.  This was alluded to in the cover letter, but not very clear,
I'm afraid.  I have patches to remove the contid and loginuid/sessionid
interfaces in another tree which is why I had forgotten to outline that
plan more explicitly in the cover letter.

> paul moore

- RGB

^ permalink raw reply

* Re: [PATCH ghak90 V7 20/21] audit: add capcontid to set contid outside init_user_ns
From: Paul Moore @ 2019-10-21 19:53 UTC (permalink / raw)
  To: Richard Guy Briggs
  Cc: containers, linux-api, Linux-Audit Mailing List, linux-fsdevel,
	LKML, netdev, netfilter-devel, sgrubb, omosnace, dhowells, simo,
	Eric Paris, Serge Hallyn, ebiederm, nhorman, Dan Walsh, mpatel
In-Reply-To: <20191019013904.uevmrzbmztsbhpnh@madcap2.tricolour.ca>

On Fri, Oct 18, 2019 at 9:39 PM Richard Guy Briggs <rgb@redhat.com> wrote:
> On 2019-09-18 21:22, Richard Guy Briggs wrote:
> > Provide a mechanism similar to CAP_AUDIT_CONTROL to explicitly give a
> > process in a non-init user namespace the capability to set audit
> > container identifiers.
> >
> > Use audit netlink message types AUDIT_GET_CAPCONTID 1027 and
> > AUDIT_SET_CAPCONTID 1028.  The message format includes the data
> > structure:
> > struct audit_capcontid_status {
> >         pid_t   pid;
> >         u32     enable;
> > };
>
> Paul, can I get a review of the general idea here to see if you're ok
> with this way of effectively extending CAP_AUDIT_CONTROL for the sake of
> setting contid from beyond the init user namespace where capable() can't
> reach and ns_capable() is meaningless for these purposes?

I think my previous comment about having both the procfs and netlink
interfaces apply here.  I don't see why we need two different APIs at
the start; explain to me why procfs isn't sufficient.  If the argument
is simply the desire to avoid mounting procfs in the container, how
many container orchestrators can function today without a valid /proc?

-- 
paul moore
www.paul-moore.com

^ permalink raw reply

* Re: [PATCH v3 1/2] clone3: add CLONE_CLEAR_SIGHAND
From: Christian Brauner @ 2019-10-21 19:42 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: linux-kernel, Florian Weimer, Arnd Bergmann, libc-alpha,
	David Howells, Jann Horn, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Shuah Khan, Andrew Morton, Michal Hocko,
	Elena Reshetova, Thomas Gleixner, Roman Gushchin,
	Andrea Arcangeli
In-Reply-To: <20191021144633.GA2720@redhat.com>

On Mon, Oct 21, 2019 at 04:46:33PM +0200, Oleg Nesterov wrote:
> On 10/14, Christian Brauner wrote:
> >
> > The child helper process on Linux posix_spawn must ensure that no signal
> > handlers are enabled, so the signal disposition must be either SIG_DFL
> > or SIG_IGN. However, it requires a sigprocmask to obtain the current
> > signal mask and at least _NSIG sigaction calls to reset the signal
> > handlers for each posix_spawn call
> 
> Plus the caller has to block/unblock all signals around clone(VM|VFORK).
> 
> Can this justify the new CLONE_ flag? Honestly, I have no idea. But the
> patch is simple and looks technically correct to me. FWIW,
> 
> Reviewed-by: Oleg Nesterov <oleg@redhat.com>

The problem is not just the number of syscalls but also that the
sigaction logic is already quite complicated and would need to be even
more complicated without this flag. That's covered mostly in the glibc
thread though. Even just the ability to avoid potentially _NSIG syscalls
is enough justification especially since were not scarce on flags.

Thanks! I'll pick this up for 5.5.
Christian

^ permalink raw reply

* Re: [PATCH v3 1/2] clone3: add CLONE_CLEAR_SIGHAND
From: Christian Brauner @ 2019-10-21 19:40 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: linux-kernel, Florian Weimer, Arnd Bergmann, libc-alpha,
	David Howells, Jann Horn, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Shuah Khan, Andrew Morton, Michal Hocko,
	Elena Reshetova, Thomas Gleixner, Roman Gushchin,
	Andrea Arcangeli
In-Reply-To: <20191021151255.GA3459@redhat.com>

On Mon, Oct 21, 2019 at 05:12:55PM +0200, Oleg Nesterov wrote:
> On 10/21, Oleg Nesterov wrote:
> >
> > On 10/14, Christian Brauner wrote:
> > >
> > > The child helper process on Linux posix_spawn must ensure that no signal
> > > handlers are enabled, so the signal disposition must be either SIG_DFL
> > > or SIG_IGN. However, it requires a sigprocmask to obtain the current
> > > signal mask and at least _NSIG sigaction calls to reset the signal
> > > handlers for each posix_spawn call
> >
> > Plus the caller has to block/unblock all signals around clone(VM|VFORK).
>        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> 
> just in case... I meant that posix_spawn() has to block/unblock, not its
> caller.

Yeah, I think that is what's currently happening.

Christian

^ permalink raw reply

* Re: [RFC PATCH v2 2/5] fs: add RWF_ENCODED for reading/writing compressed data
From: Omar Sandoval @ 2019-10-21 19:07 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: linux-fsdevel, linux-btrfs, Dave Chinner, Jann Horn, linux-api,
	kernel-team
In-Reply-To: <20191021182806.GA6706@magnolia>

On Mon, Oct 21, 2019 at 11:28:06AM -0700, Darrick J. Wong wrote:
> On Tue, Oct 15, 2019 at 11:42:40AM -0700, Omar Sandoval wrote:
> > From: Omar Sandoval <osandov@fb.com>
> > 
> > Btrfs supports transparent compression: data written by the user can be
> > compressed when written to disk and decompressed when read back.
> > However, we'd like to add an interface to write pre-compressed data
> > directly to the filesystem, and the matching interface to read
> > compressed data without decompressing it. This adds support for
> > so-called "encoded I/O" via preadv2() and pwritev2().
> > 
> > A new RWF_ENCODED flags indicates that a read or write is "encoded". If
> > this flag is set, iov[0].iov_base points to a struct encoded_iov which
> > is used for metadata: namely, the compression algorithm, unencoded
> > (i.e., decompressed) length, and what subrange of the unencoded data
> > should be used (needed for truncated or hole-punched extents and when
> > reading in the middle of an extent). For reads, the filesystem returns
> > this information; for writes, the caller provides it to the filesystem.
> > iov[0].iov_len must be set to sizeof(struct encoded_iov), which can be
> > used to extend the interface in the future. The remaining iovecs contain
> > the encoded extent.
> > 
> > Filesystems must indicate that they support encoded writes by setting
> > FMODE_ENCODED_IO in ->file_open().
> > 
> > Signed-off-by: Omar Sandoval <osandov@fb.com>
> > ---
> >  include/linux/fs.h      | 14 +++++++
> >  include/uapi/linux/fs.h | 26 ++++++++++++-
> >  mm/filemap.c            | 82 ++++++++++++++++++++++++++++++++++-------
> >  3 files changed, 108 insertions(+), 14 deletions(-)
> > 
> > diff --git a/include/linux/fs.h b/include/linux/fs.h
> > index e0d909d35763..54681f21e05e 100644
> > --- a/include/linux/fs.h
> > +++ b/include/linux/fs.h
> > @@ -175,6 +175,9 @@ typedef int (dio_iodone_t)(struct kiocb *iocb, loff_t offset,
> >  /* File does not contribute to nr_files count */
> >  #define FMODE_NOACCOUNT		((__force fmode_t)0x20000000)
> >  
> > +/* File supports encoded IO */
> > +#define FMODE_ENCODED_IO	((__force fmode_t)0x40000000)
> > +
> >  /*
> >   * Flag for rw_copy_check_uvector and compat_rw_copy_check_uvector
> >   * that indicates that they should check the contents of the iovec are
> > @@ -314,6 +317,7 @@ enum rw_hint {
> >  #define IOCB_SYNC		(1 << 5)
> >  #define IOCB_WRITE		(1 << 6)
> >  #define IOCB_NOWAIT		(1 << 7)
> > +#define IOCB_ENCODED		(1 << 8)
> >  
> >  struct kiocb {
> >  	struct file		*ki_filp;
> > @@ -3088,6 +3092,11 @@ extern int sb_min_blocksize(struct super_block *, int);
> >  extern int generic_file_mmap(struct file *, struct vm_area_struct *);
> >  extern int generic_file_readonly_mmap(struct file *, struct vm_area_struct *);
> >  extern ssize_t generic_write_checks(struct kiocb *, struct iov_iter *);
> > +struct encoded_iov;
> > +extern int generic_encoded_write_checks(struct kiocb *, struct encoded_iov *);
> > +extern ssize_t check_encoded_read(struct kiocb *, struct iov_iter *);
> > +extern int import_encoded_write(struct kiocb *, struct encoded_iov *,
> > +				struct iov_iter *);
> >  extern int generic_remap_checks(struct file *file_in, loff_t pos_in,
> >  				struct file *file_out, loff_t pos_out,
> >  				loff_t *count, unsigned int remap_flags);
> > @@ -3403,6 +3412,11 @@ static inline int kiocb_set_rw_flags(struct kiocb *ki, rwf_t flags)
> >  			return -EOPNOTSUPP;
> >  		ki->ki_flags |= IOCB_NOWAIT;
> >  	}
> > +	if (flags & RWF_ENCODED) {
> > +		if (!(ki->ki_filp->f_mode & FMODE_ENCODED_IO))
> > +			return -EOPNOTSUPP;
> > +		ki->ki_flags |= IOCB_ENCODED;
> > +	}
> >  	if (flags & RWF_HIPRI)
> >  		ki->ki_flags |= IOCB_HIPRI;
> >  	if (flags & RWF_DSYNC)
> > diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
> > index 379a612f8f1d..ed92a8a257cb 100644
> > --- a/include/uapi/linux/fs.h
> > +++ b/include/uapi/linux/fs.h
> > @@ -284,6 +284,27 @@ struct fsxattr {
> >  
> >  typedef int __bitwise __kernel_rwf_t;
> >  
> > +enum {
> > +	ENCODED_IOV_COMPRESSION_NONE,
> > +	ENCODED_IOV_COMPRESSION_ZLIB,
> > +	ENCODED_IOV_COMPRESSION_LZO,
> > +	ENCODED_IOV_COMPRESSION_ZSTD,
> > +	ENCODED_IOV_COMPRESSION_TYPES = ENCODED_IOV_COMPRESSION_ZSTD,
> > +};
> > +
> > +enum {
> > +	ENCODED_IOV_ENCRYPTION_NONE,
> > +	ENCODED_IOV_ENCRYPTION_TYPES = ENCODED_IOV_ENCRYPTION_NONE,
> > +};
> > +
> > +struct encoded_iov {
> > +	__u64 len;
> > +	__u64 unencoded_len;
> > +	__u64 unencoded_offset;
> > +	__u32 compression;
> > +	__u32 encryption;
> 
> Can we add some must-be-zero padding space at the end here for whomever
> comes along next wanting to add more encoding info?

>From the commit message:

iov_len must be set to sizeof(struct encoded_iov), which can be used to
extend the interface in the future.

> (And maybe a manpage and some basic testing, to reiterate Dave...)

I sent a man page as part of this thread:

https://lore.kernel.org/linux-btrfs/c7e8f93596fee7bb818dc0edf29f484036be1abb.1571164851.git.osandov@fb.com/

See my reply to Dave, I have tests in my xfstests repo that I haven't
sent yet.

> --D

^ permalink raw reply

* Re: [RFC PATCH v2 0/5] fs: interface for directly reading/writing compressed data
From: Omar Sandoval @ 2019-10-21 19:04 UTC (permalink / raw)
  To: Dave Chinner
  Cc: linux-fsdevel, linux-btrfs, Jann Horn, linux-api, kernel-team
In-Reply-To: <20191020230501.GA8080@dread.disaster.area>

On Mon, Oct 21, 2019 at 10:05:01AM +1100, Dave Chinner wrote:
> On Tue, Oct 15, 2019 at 11:42:38AM -0700, Omar Sandoval wrote:
> > From: Omar Sandoval <osandov@fb.com>
> > 
> > Hello,
> > 
> > This series adds an API for reading compressed data on a filesystem
> > without decompressing it as well as support for writing compressed data
> > directly to the filesystem. It is based on my previous series which
> > added a Btrfs-specific ioctl [1], but it is now an extension to
> > preadv2()/pwritev2() as suggested by Dave Chinner [2]. I've included a
> > man page patch describing the API in detail. Test cases and examples
> > programs are available [3].
> > 
> > The use case that I have in mind is Btrfs send/receive: currently, when
> > sending data from one compressed filesystem to another, the sending side
> > decompresses the data and the receiving side recompresses it before
> > writing it out. This is wasteful and can be avoided if we can just send
> > and write compressed extents. The send part will be implemented in a
> > separate series, as this API can stand alone.
> > 
> > Patches 1 and 2 add the VFS support. Patch 3 is a Btrfs prep patch.
> > Patch 4 implements encoded reads for Btrfs, and patch 5 implements
> > encoded writes.
> > 
> > Changes from v1 [4]:
> > 
> > - Encoded reads are now also implemented.
> > - The encoded_iov structure now includes metadata for referring to a
> >   subset of decoded data. This is required to handle certain cases where
> >   a compressed extent is truncated, hole punched, or otherwise sliced up
> >   and Btrfs chooses to reflect this in metadata instead of decompressing
> >   the whole extent and rewriting the pieces. We call these "bookend
> >   extents" in Btrfs, but any filesystem supporting transparent encoding
> >   is likely to have a similar concept.
> 
> Where's the in-kernel documentation for this API? You're encoding a
> specific set of behaviours into the user API, so this needs a whole
> heap of documentation in the generic code to describe how it works
> so that other filesystems implementing have a well defined guideline
> to what they need to support.

The man-page I sent is quite detailed, but sure, I can add the relevant
information to the generic code, as well.

> Also, I don't see any test code for this -

It's in the cover letter: https://github.com/osandov/xfstests/tree/rwf-encoded

I haven't sent those patches up because it's tedious to rework and
resend them for each little tweak we make to the API.

> can you please add
> support for RWF_ENCODED to xfs_io and write a suite of unit tests
> for fstests that exercise the user API fully?

Reading requires filesystem-specific decoding, and I wasn't sure if that
would be a good fit for xfs_io. Alternatively, it could dump the raw
buffer to stdout, but whatever interprets it also needs the metadata, so
there'd need to be some sort of protocol between xfs_io and whatever
interprets it. I added a btrfs_read_encoded program in my xfstests
branch above instead. It should be easy enough to move the encoded_write
test program to xfs_io pwrite, though.

> Given our history of
> screwing up new user APIs, this absolutely should not be merged
> until there is a full set of generic unit tests written and reviewed
> for it and support has been added to fsstress, fsx, and other test
> utilities to fuzz and stress the implementation as part of normal
> day-to-day filesystem development...

Sure thing, I'll add support to those tools once the API isn't in flux
so much.

^ permalink raw reply

* Re: [RFC PATCH v2 2/5] fs: add RWF_ENCODED for reading/writing compressed data
From: Darrick J. Wong @ 2019-10-21 19:00 UTC (permalink / raw)
  To: Aleksa Sarai
  Cc: Omar Sandoval, linux-fsdevel, linux-btrfs, Dave Chinner,
	Jann Horn, linux-api, kernel-team
In-Reply-To: <20191021183831.mbe4q2beqo76fqxm@yavin.dot.cyphar.com>

On Tue, Oct 22, 2019 at 05:38:31AM +1100, Aleksa Sarai wrote:
> On 2019-10-21, Darrick J. Wong <darrick.wong@oracle.com> wrote:
> > On Tue, Oct 15, 2019 at 11:42:40AM -0700, Omar Sandoval wrote:
> > > From: Omar Sandoval <osandov@fb.com>
> > > 
> > > Btrfs supports transparent compression: data written by the user can be
> > > compressed when written to disk and decompressed when read back.
> > > However, we'd like to add an interface to write pre-compressed data
> > > directly to the filesystem, and the matching interface to read
> > > compressed data without decompressing it. This adds support for
> > > so-called "encoded I/O" via preadv2() and pwritev2().
> > > 
> > > A new RWF_ENCODED flags indicates that a read or write is "encoded". If
> > > this flag is set, iov[0].iov_base points to a struct encoded_iov which
> > > is used for metadata: namely, the compression algorithm, unencoded
> > > (i.e., decompressed) length, and what subrange of the unencoded data
> > > should be used (needed for truncated or hole-punched extents and when
> > > reading in the middle of an extent). For reads, the filesystem returns
> > > this information; for writes, the caller provides it to the filesystem.
> > > iov[0].iov_len must be set to sizeof(struct encoded_iov), which can be
> > > used to extend the interface in the future. The remaining iovecs contain
> > > the encoded extent.
> > > 
> > > Filesystems must indicate that they support encoded writes by setting
> > > FMODE_ENCODED_IO in ->file_open().
> > > 
> > > Signed-off-by: Omar Sandoval <osandov@fb.com>
> > > ---
> > >  include/linux/fs.h      | 14 +++++++
> > >  include/uapi/linux/fs.h | 26 ++++++++++++-
> > >  mm/filemap.c            | 82 ++++++++++++++++++++++++++++++++++-------
> > >  3 files changed, 108 insertions(+), 14 deletions(-)
> > > 
> > > diff --git a/include/linux/fs.h b/include/linux/fs.h
> > > index e0d909d35763..54681f21e05e 100644
> > > --- a/include/linux/fs.h
> > > +++ b/include/linux/fs.h
> > > @@ -175,6 +175,9 @@ typedef int (dio_iodone_t)(struct kiocb *iocb, loff_t offset,
> > >  /* File does not contribute to nr_files count */
> > >  #define FMODE_NOACCOUNT		((__force fmode_t)0x20000000)
> > >  
> > > +/* File supports encoded IO */
> > > +#define FMODE_ENCODED_IO	((__force fmode_t)0x40000000)
> > > +
> > >  /*
> > >   * Flag for rw_copy_check_uvector and compat_rw_copy_check_uvector
> > >   * that indicates that they should check the contents of the iovec are
> > > @@ -314,6 +317,7 @@ enum rw_hint {
> > >  #define IOCB_SYNC		(1 << 5)
> > >  #define IOCB_WRITE		(1 << 6)
> > >  #define IOCB_NOWAIT		(1 << 7)
> > > +#define IOCB_ENCODED		(1 << 8)
> > >  
> > >  struct kiocb {
> > >  	struct file		*ki_filp;
> > > @@ -3088,6 +3092,11 @@ extern int sb_min_blocksize(struct super_block *, int);
> > >  extern int generic_file_mmap(struct file *, struct vm_area_struct *);
> > >  extern int generic_file_readonly_mmap(struct file *, struct vm_area_struct *);
> > >  extern ssize_t generic_write_checks(struct kiocb *, struct iov_iter *);
> > > +struct encoded_iov;
> > > +extern int generic_encoded_write_checks(struct kiocb *, struct encoded_iov *);
> > > +extern ssize_t check_encoded_read(struct kiocb *, struct iov_iter *);
> > > +extern int import_encoded_write(struct kiocb *, struct encoded_iov *,
> > > +				struct iov_iter *);
> > >  extern int generic_remap_checks(struct file *file_in, loff_t pos_in,
> > >  				struct file *file_out, loff_t pos_out,
> > >  				loff_t *count, unsigned int remap_flags);
> > > @@ -3403,6 +3412,11 @@ static inline int kiocb_set_rw_flags(struct kiocb *ki, rwf_t flags)
> > >  			return -EOPNOTSUPP;
> > >  		ki->ki_flags |= IOCB_NOWAIT;
> > >  	}
> > > +	if (flags & RWF_ENCODED) {
> > > +		if (!(ki->ki_filp->f_mode & FMODE_ENCODED_IO))
> > > +			return -EOPNOTSUPP;
> > > +		ki->ki_flags |= IOCB_ENCODED;
> > > +	}
> > >  	if (flags & RWF_HIPRI)
> > >  		ki->ki_flags |= IOCB_HIPRI;
> > >  	if (flags & RWF_DSYNC)
> > > diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
> > > index 379a612f8f1d..ed92a8a257cb 100644
> > > --- a/include/uapi/linux/fs.h
> > > +++ b/include/uapi/linux/fs.h
> > > @@ -284,6 +284,27 @@ struct fsxattr {
> > >  
> > >  typedef int __bitwise __kernel_rwf_t;
> > >  
> > > +enum {
> > > +	ENCODED_IOV_COMPRESSION_NONE,
> > > +	ENCODED_IOV_COMPRESSION_ZLIB,
> > > +	ENCODED_IOV_COMPRESSION_LZO,
> > > +	ENCODED_IOV_COMPRESSION_ZSTD,
> > > +	ENCODED_IOV_COMPRESSION_TYPES = ENCODED_IOV_COMPRESSION_ZSTD,
> > > +};
> > > +
> > > +enum {
> > > +	ENCODED_IOV_ENCRYPTION_NONE,
> > > +	ENCODED_IOV_ENCRYPTION_TYPES = ENCODED_IOV_ENCRYPTION_NONE,
> > > +};
> > > +
> > > +struct encoded_iov {
> > > +	__u64 len;
> > > +	__u64 unencoded_len;
> > > +	__u64 unencoded_offset;
> > > +	__u32 compression;
> > > +	__u32 encryption;
> > 
> > Can we add some must-be-zero padding space at the end here for whomever
> > comes along next wanting to add more encoding info?
> 
> I would suggest to copy the extension design of copy_struct_from_user().
> Adding must-be-zero padding is a less-ideal solution to the extension
> problem than length-based extension.

Come to think of it, you /do/ have to specify iov_len so... yeah, do
that instead; we can always extend the structure later.

> Also (I might be wrong) but shouldn't the __u64s be __aligned_u64 (as
> with syscall structure arguments)?

<shrug> No idea, that's the first I've heard of that type and it doesn't
seem to be used by the fs code.  Why would we care about alignment for
an incore structure?

--D

> 
> > (And maybe a manpage and some basic testing, to reiterate Dave...)
> > 
> > --D
> > 
> > > +};
> > > +
> > >  /* high priority request, poll if possible */
> > >  #define RWF_HIPRI	((__force __kernel_rwf_t)0x00000001)
> > >  
> > > @@ -299,8 +320,11 @@ typedef int __bitwise __kernel_rwf_t;
> > >  /* per-IO O_APPEND */
> > >  #define RWF_APPEND	((__force __kernel_rwf_t)0x00000010)
> > >  
> > > +/* encoded (e.g., compressed or encrypted) IO */
> > > +#define RWF_ENCODED	((__force __kernel_rwf_t)0x00000020)
> > > +
> > >  /* mask of flags supported by the kernel */
> > >  #define RWF_SUPPORTED	(RWF_HIPRI | RWF_DSYNC | RWF_SYNC | RWF_NOWAIT |\
> > > -			 RWF_APPEND)
> > > +			 RWF_APPEND | RWF_ENCODED)
> > >  
> > >  #endif /* _UAPI_LINUX_FS_H */
> > > diff --git a/mm/filemap.c b/mm/filemap.c
> > > index 1146fcfa3215..d2e6d9caf353 100644
> > > --- a/mm/filemap.c
> > > +++ b/mm/filemap.c
> > > @@ -2948,24 +2948,15 @@ static int generic_write_check_limits(struct file *file, loff_t pos,
> > >  	return 0;
> > >  }
> > >  
> > > -/*
> > > - * Performs necessary checks before doing a write
> > > - *
> > > - * Can adjust writing position or amount of bytes to write.
> > > - * Returns appropriate error code that caller should return or
> > > - * zero in case that write should be allowed.
> > > - */
> > > -inline ssize_t generic_write_checks(struct kiocb *iocb, struct iov_iter *from)
> > > +static int generic_write_checks_common(struct kiocb *iocb, loff_t *count)
> > >  {
> > >  	struct file *file = iocb->ki_filp;
> > >  	struct inode *inode = file->f_mapping->host;
> > > -	loff_t count;
> > > -	int ret;
> > >  
> > >  	if (IS_SWAPFILE(inode))
> > >  		return -ETXTBSY;
> > >  
> > > -	if (!iov_iter_count(from))
> > > +	if (!*count)
> > >  		return 0;
> > >  
> > >  	/* FIXME: this is for backwards compatibility with 2.4 */
> > > @@ -2975,8 +2966,21 @@ inline ssize_t generic_write_checks(struct kiocb *iocb, struct iov_iter *from)
> > >  	if ((iocb->ki_flags & IOCB_NOWAIT) && !(iocb->ki_flags & IOCB_DIRECT))
> > >  		return -EINVAL;
> > >  
> > > -	count = iov_iter_count(from);
> > > -	ret = generic_write_check_limits(file, iocb->ki_pos, &count);
> > > +	return generic_write_check_limits(iocb->ki_filp, iocb->ki_pos, count);
> > > +}
> > > +
> > > +/*
> > > + * Performs necessary checks before doing a write
> > > + *
> > > + * Can adjust writing position or amount of bytes to write.
> > > + * Returns a negative errno or the new number of bytes to write.
> > > + */
> > > +inline ssize_t generic_write_checks(struct kiocb *iocb, struct iov_iter *from)
> > > +{
> > > +	loff_t count = iov_iter_count(from);
> > > +	int ret;
> > > +
> > > +	ret = generic_write_checks_common(iocb, &count);
> > >  	if (ret)
> > >  		return ret;
> > >  
> > > @@ -2985,6 +2989,58 @@ inline ssize_t generic_write_checks(struct kiocb *iocb, struct iov_iter *from)
> > >  }
> > >  EXPORT_SYMBOL(generic_write_checks);
> > >  
> > > +int generic_encoded_write_checks(struct kiocb *iocb,
> > > +				 struct encoded_iov *encoded)
> > > +{
> > > +	loff_t count = encoded->unencoded_len;
> > > +	int ret;
> > > +
> > > +	ret = generic_write_checks_common(iocb, &count);
> > > +	if (ret)
> > > +		return ret;
> > > +
> > > +	if (count != encoded->unencoded_len) {
> > > +		/*
> > > +		 * The write got truncated by generic_write_checks_common(). We
> > > +		 * can't do a partial encoded write.
> > > +		 */
> > > +		return -EFBIG;
> > > +	}
> > > +	return 0;
> > > +}
> > > +EXPORT_SYMBOL(generic_encoded_write_checks);
> > > +
> > > +ssize_t check_encoded_read(struct kiocb *iocb, struct iov_iter *iter)
> > > +{
> > > +	if (!(iocb->ki_filp->f_flags & O_ENCODED))
> > > +		return -EPERM;
> > > +	if (iov_iter_single_seg_count(iter) != sizeof(struct encoded_iov))
> > > +		return -EINVAL;
> > > +	return iov_iter_count(iter) - sizeof(struct encoded_iov);
> > > +}
> > > +EXPORT_SYMBOL(check_encoded_read);
> > > +
> > > +int import_encoded_write(struct kiocb *iocb, struct encoded_iov *encoded,
> > > +			 struct iov_iter *from)
> > > +{
> > > +	if (!(iocb->ki_filp->f_flags & O_ENCODED))
> > > +		return -EPERM;
> > > +	if (iov_iter_single_seg_count(from) != sizeof(*encoded))
> > > +		return -EINVAL;
> > > +	if (copy_from_iter(encoded, sizeof(*encoded), from) != sizeof(*encoded))
> > > +		return -EFAULT;
> > > +	if (encoded->compression == ENCODED_IOV_COMPRESSION_NONE &&
> > > +	    encoded->encryption == ENCODED_IOV_ENCRYPTION_NONE)
> > > +		return -EINVAL;
> > > +	if (encoded->compression > ENCODED_IOV_COMPRESSION_TYPES ||
> > > +	    encoded->encryption > ENCODED_IOV_ENCRYPTION_TYPES)
> > > +		return -EINVAL;
> > > +	if (encoded->unencoded_offset >= encoded->unencoded_len)
> > > +		return -EINVAL;
> > > +	return 0;
> > > +}
> > > +EXPORT_SYMBOL(import_encoded_write);
> > > +
> > >  /*
> > >   * Performs necessary checks before doing a clone.
> > >   *
> > > -- 
> > > 2.23.0
> > > 
> 
> 
> -- 
> Aleksa Sarai
> Senior Software Engineer (Containers)
> SUSE Linux GmbH
> <https://www.cyphar.com/>

^ permalink raw reply

* Re: [PATCH man-pages] Document encoded I/O
From: Omar Sandoval @ 2019-10-21 18:53 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: linux-fsdevel, Linux Btrfs, Dave Chinner, Jann Horn, Linux API,
	kernel-team, Theodore Tso
In-Reply-To: <CAOQ4uxh_pZSiMmD=46Mc3o0GE+svXuoC155P_9FGJXdsE4cweg@mail.gmail.com>

On Mon, Oct 21, 2019 at 09:18:13AM +0300, Amir Goldstein wrote:
> CC: Ted
> 
> What ever happened to read/write ext4 encrypted data API?
> https://marc.info/?l=linux-ext4&m=145030599010416&w=2
> 
> Can we learn anything from the ext4 experience to improve
> the new proposed API?

I wasn't aware of these patches, thanks for pointing them out. Ted, do
you have any thoughts about making this API work for fscrypt?

> On Wed, Oct 16, 2019 at 12:29 AM Omar Sandoval <osandov@osandov.com> wrote:
> >
> > From: Omar Sandoval <osandov@fb.com>
> >
> > This adds a new page, rwf_encoded(7), providing an overview of encoded
> > I/O and updates fcntl(2), open(2), and preadv2(2)/pwritev2(2) to
> > reference it.
> >
> > Signed-off-by: Omar Sandoval <osandov@fb.com>
> > ---
> >  man2/fcntl.2       |  10 +-
> >  man2/open.2        |  13 ++
> >  man2/readv.2       |  46 +++++++
> >  man7/rwf_encoded.7 | 297 +++++++++++++++++++++++++++++++++++++++++++++
> >  4 files changed, 365 insertions(+), 1 deletion(-)
> >  create mode 100644 man7/rwf_encoded.7
> >
> > diff --git a/man2/fcntl.2 b/man2/fcntl.2
> > index fce4f4c2b..76fe9cc6f 100644
> > --- a/man2/fcntl.2
> > +++ b/man2/fcntl.2
> > @@ -222,8 +222,9 @@ On Linux, this command can change only the
> >  .BR O_ASYNC ,
> >  .BR O_DIRECT ,
> >  .BR O_NOATIME ,
> > +.BR O_NONBLOCK ,
> >  and
> > -.B O_NONBLOCK
> > +.B O_ENCODED
> >  flags.
> >  It is not possible to change the
> >  .BR O_DSYNC
> > @@ -1803,6 +1804,13 @@ Attempted to clear the
> >  flag on a file that has the append-only attribute set.
> >  .TP
> >  .B EPERM
> > +Attempted to set the
> > +.B O_ENCODED
> > +flag and the calling process did not have the
> > +.B CAP_SYS_ADMIN
> > +capability.
> > +.TP
> > +.B EPERM
> >  .I cmd
> >  was
> >  .BR F_ADD_SEALS ,
> > diff --git a/man2/open.2 b/man2/open.2
> > index b0f485b41..cdd3c549c 100644
> > --- a/man2/open.2
> > +++ b/man2/open.2
> > @@ -421,6 +421,14 @@ was followed by a call to
> >  .BR fdatasync (2)).
> >  .IR "See NOTES below" .
> >  .TP
> > +.B O_ENCODED
> > +Open the file with encoded I/O permissions;
> 
> 1. I find the name of the flag confusing.
> Yes, most people don't read documentation so carefully (or at all)
> so they will assume O_ENCODED will affect read/write or that it
> relates to RWF_ENCODED in a similar way that O_SYNC relates
> to RWF_SYNC (i.e. logical OR and not logical AND).
> 
> I am not good at naming and to prove it I will propose:
> O_PROMISCUOUS, O_MAINTENANCE, O_ALLOW_ENCODED

Agreed, the name is misleading. I can't think of anything better than
O_ALLOW_ENCODED, so I'll go with that unless someone comes up with
something better :)

> 2. While I see no harm in adding O_ flag to open(2) for this
> use case, I also don't see a major benefit in adding it.
> What if we only allowed setting the flag via fcntl(2) which returns
> an error on old kernels?
> Since unlike most O_ flags, O_ENCODED does NOT affect file
> i/o without additional opt-in flags, it is not standard anyway and
> therefore I find that setting it only via fcntl(2) is less error prone.

If I make this fcntl-only, then it probably shouldn't be through
F_GETFL/F_SETFL (it'd be pretty awkward for an O_ flag to not be valid
for open(), and also awkward to mix some non-O_ flag with O_ flags for
F_GETFL/F_SETFL). So that leaves a couple of options:

1. Get/set it with F_GETFD/F_SETFD, which is currently only used for
   FD_CLOEXEC. That also silently ignores unknown flags, but as with the
   O_ flag option, I don't think that's a big deal for FD_ALLOW_ENCODED.
2. Add a new fcntl command (F_GETFD2/F_SETFD2?). This seems like
   overkill to me.

However, both of these options are annoying to implement. Ideally, we
wouldn't have to add another flags field to struct file. But, to reuse
f_flags, we'd need to make sure that FD_ALLOW_ENCODED doesn't collide
with other O_ flags, and we'd probably want to hide it from F_GETFL. At
that point, it might as well be an O_ flag.

It seems to me that it's more trouble than it's worth to make this not
an O_ flag, but please let me know if you see a nice way to do so.

> > +see
> > +.BR rwf_encoded (7).
> > +The caller must have the
> > +.B CAP_SYS_ADMIN
> > +capabilty.
> > +.TP
> >  .B O_EXCL
> >  Ensure that this call creates the file:
> >  if this flag is specified in conjunction with
> > @@ -1168,6 +1176,11 @@ did not match the owner of the file and the caller was not privileged.
> >  The operation was prevented by a file seal; see
> >  .BR fcntl (2).
> >  .TP
> > +.B EPERM
> > +The
> > +.B O_ENCODED
> > +flag was specified, but the caller was not privileged.
> > +.TP
> >  .B EROFS
> >  .I pathname
> >  refers to a file on a read-only filesystem and write access was
> > diff --git a/man2/readv.2 b/man2/readv.2
> > index af27aa63e..aa60b980a 100644
> > --- a/man2/readv.2
> > +++ b/man2/readv.2
> > @@ -265,6 +265,11 @@ the data is always appended to the end of the file.
> >  However, if the
> >  .I offset
> >  argument is \-1, the current file offset is updated.
> > +.TP
> > +.BR RWF_ENCODED " (since Linux 5.6)"
> > +Read or write encoded (e.g., compressed) data.
> > +See
> > +.BR rwf_encoded (7).
> >  .SH RETURN VALUE
> >  On success,
> >  .BR readv (),
> > @@ -284,6 +289,13 @@ than requested (see
> >  and
> >  .BR write (2)).
> >  .PP
> > +If
> > +.B
> > +RWF_ENCODED
> > +was specified in
> > +.IR flags ,
> > +then the return value is the number of encoded bytes.
> > +.PP
> >  On error, \-1 is returned, and \fIerrno\fP is set appropriately.
> >  .SH ERRORS
> >  The errors are as given for
> > @@ -314,6 +326,40 @@ is less than zero or greater than the permitted maximum.
> >  .TP
> >  .B EOPNOTSUPP
> >  An unknown flag is specified in \fIflags\fP.
> > +.TP
> > +.B EOPNOTSUPP
> > +.B RWF_ENCODED
> > +is specified in
> > +.I flags
> > +and the filesystem does not implement encoded I/O.
> > +.TP
> > +.B EPERM
> > +.B RWF_ENCODED
> > +is specified in
> > +.I flags
> > +and the file was not opened with the
> > +.B O_ENCODED
> > +flag.
> > +.PP
> > +.BR preadv2 ()
> > +can fail for the following reasons:
> > +.TP
> > +.B EFBIG
> > +.B RWF_ENCODED
> > +is specified in
> > +.I flags
> > +and buffers in
> > +.I iov
> > +were not big enough to return the encoded data.
> 
> I don't like it that EFBIG is returned for read.
> While EXXX values meaning is often very vague, EFBIG meaning
> is still quite consistent - it is always a write related error when trying
> to change i_size above fs/system limits.
> 
> In the case above, I find E2BIG much more appropriate.
> Although its original meaning was too long arg list, it already grew
> several cases where it generally means "buffer cannot hold the result"
> like the case with msgrcv(2) and listxattr(2).

Yes, E2BIG is a better fit, I'll change it.

^ permalink raw reply

* Re: [RFC PATCH v2 2/5] fs: add RWF_ENCODED for reading/writing compressed data
From: Aleksa Sarai @ 2019-10-21 18:38 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Omar Sandoval, linux-fsdevel, linux-btrfs, Dave Chinner,
	Jann Horn, linux-api, kernel-team
In-Reply-To: <20191021182806.GA6706@magnolia>

[-- Attachment #1: Type: text/plain, Size: 9945 bytes --]

On 2019-10-21, Darrick J. Wong <darrick.wong@oracle.com> wrote:
> On Tue, Oct 15, 2019 at 11:42:40AM -0700, Omar Sandoval wrote:
> > From: Omar Sandoval <osandov@fb.com>
> > 
> > Btrfs supports transparent compression: data written by the user can be
> > compressed when written to disk and decompressed when read back.
> > However, we'd like to add an interface to write pre-compressed data
> > directly to the filesystem, and the matching interface to read
> > compressed data without decompressing it. This adds support for
> > so-called "encoded I/O" via preadv2() and pwritev2().
> > 
> > A new RWF_ENCODED flags indicates that a read or write is "encoded". If
> > this flag is set, iov[0].iov_base points to a struct encoded_iov which
> > is used for metadata: namely, the compression algorithm, unencoded
> > (i.e., decompressed) length, and what subrange of the unencoded data
> > should be used (needed for truncated or hole-punched extents and when
> > reading in the middle of an extent). For reads, the filesystem returns
> > this information; for writes, the caller provides it to the filesystem.
> > iov[0].iov_len must be set to sizeof(struct encoded_iov), which can be
> > used to extend the interface in the future. The remaining iovecs contain
> > the encoded extent.
> > 
> > Filesystems must indicate that they support encoded writes by setting
> > FMODE_ENCODED_IO in ->file_open().
> > 
> > Signed-off-by: Omar Sandoval <osandov@fb.com>
> > ---
> >  include/linux/fs.h      | 14 +++++++
> >  include/uapi/linux/fs.h | 26 ++++++++++++-
> >  mm/filemap.c            | 82 ++++++++++++++++++++++++++++++++++-------
> >  3 files changed, 108 insertions(+), 14 deletions(-)
> > 
> > diff --git a/include/linux/fs.h b/include/linux/fs.h
> > index e0d909d35763..54681f21e05e 100644
> > --- a/include/linux/fs.h
> > +++ b/include/linux/fs.h
> > @@ -175,6 +175,9 @@ typedef int (dio_iodone_t)(struct kiocb *iocb, loff_t offset,
> >  /* File does not contribute to nr_files count */
> >  #define FMODE_NOACCOUNT		((__force fmode_t)0x20000000)
> >  
> > +/* File supports encoded IO */
> > +#define FMODE_ENCODED_IO	((__force fmode_t)0x40000000)
> > +
> >  /*
> >   * Flag for rw_copy_check_uvector and compat_rw_copy_check_uvector
> >   * that indicates that they should check the contents of the iovec are
> > @@ -314,6 +317,7 @@ enum rw_hint {
> >  #define IOCB_SYNC		(1 << 5)
> >  #define IOCB_WRITE		(1 << 6)
> >  #define IOCB_NOWAIT		(1 << 7)
> > +#define IOCB_ENCODED		(1 << 8)
> >  
> >  struct kiocb {
> >  	struct file		*ki_filp;
> > @@ -3088,6 +3092,11 @@ extern int sb_min_blocksize(struct super_block *, int);
> >  extern int generic_file_mmap(struct file *, struct vm_area_struct *);
> >  extern int generic_file_readonly_mmap(struct file *, struct vm_area_struct *);
> >  extern ssize_t generic_write_checks(struct kiocb *, struct iov_iter *);
> > +struct encoded_iov;
> > +extern int generic_encoded_write_checks(struct kiocb *, struct encoded_iov *);
> > +extern ssize_t check_encoded_read(struct kiocb *, struct iov_iter *);
> > +extern int import_encoded_write(struct kiocb *, struct encoded_iov *,
> > +				struct iov_iter *);
> >  extern int generic_remap_checks(struct file *file_in, loff_t pos_in,
> >  				struct file *file_out, loff_t pos_out,
> >  				loff_t *count, unsigned int remap_flags);
> > @@ -3403,6 +3412,11 @@ static inline int kiocb_set_rw_flags(struct kiocb *ki, rwf_t flags)
> >  			return -EOPNOTSUPP;
> >  		ki->ki_flags |= IOCB_NOWAIT;
> >  	}
> > +	if (flags & RWF_ENCODED) {
> > +		if (!(ki->ki_filp->f_mode & FMODE_ENCODED_IO))
> > +			return -EOPNOTSUPP;
> > +		ki->ki_flags |= IOCB_ENCODED;
> > +	}
> >  	if (flags & RWF_HIPRI)
> >  		ki->ki_flags |= IOCB_HIPRI;
> >  	if (flags & RWF_DSYNC)
> > diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
> > index 379a612f8f1d..ed92a8a257cb 100644
> > --- a/include/uapi/linux/fs.h
> > +++ b/include/uapi/linux/fs.h
> > @@ -284,6 +284,27 @@ struct fsxattr {
> >  
> >  typedef int __bitwise __kernel_rwf_t;
> >  
> > +enum {
> > +	ENCODED_IOV_COMPRESSION_NONE,
> > +	ENCODED_IOV_COMPRESSION_ZLIB,
> > +	ENCODED_IOV_COMPRESSION_LZO,
> > +	ENCODED_IOV_COMPRESSION_ZSTD,
> > +	ENCODED_IOV_COMPRESSION_TYPES = ENCODED_IOV_COMPRESSION_ZSTD,
> > +};
> > +
> > +enum {
> > +	ENCODED_IOV_ENCRYPTION_NONE,
> > +	ENCODED_IOV_ENCRYPTION_TYPES = ENCODED_IOV_ENCRYPTION_NONE,
> > +};
> > +
> > +struct encoded_iov {
> > +	__u64 len;
> > +	__u64 unencoded_len;
> > +	__u64 unencoded_offset;
> > +	__u32 compression;
> > +	__u32 encryption;
> 
> Can we add some must-be-zero padding space at the end here for whomever
> comes along next wanting to add more encoding info?

I would suggest to copy the extension design of copy_struct_from_user().
Adding must-be-zero padding is a less-ideal solution to the extension
problem than length-based extension.

Also (I might be wrong) but shouldn't the __u64s be __aligned_u64 (as
with syscall structure arguments)?

> (And maybe a manpage and some basic testing, to reiterate Dave...)
> 
> --D
> 
> > +};
> > +
> >  /* high priority request, poll if possible */
> >  #define RWF_HIPRI	((__force __kernel_rwf_t)0x00000001)
> >  
> > @@ -299,8 +320,11 @@ typedef int __bitwise __kernel_rwf_t;
> >  /* per-IO O_APPEND */
> >  #define RWF_APPEND	((__force __kernel_rwf_t)0x00000010)
> >  
> > +/* encoded (e.g., compressed or encrypted) IO */
> > +#define RWF_ENCODED	((__force __kernel_rwf_t)0x00000020)
> > +
> >  /* mask of flags supported by the kernel */
> >  #define RWF_SUPPORTED	(RWF_HIPRI | RWF_DSYNC | RWF_SYNC | RWF_NOWAIT |\
> > -			 RWF_APPEND)
> > +			 RWF_APPEND | RWF_ENCODED)
> >  
> >  #endif /* _UAPI_LINUX_FS_H */
> > diff --git a/mm/filemap.c b/mm/filemap.c
> > index 1146fcfa3215..d2e6d9caf353 100644
> > --- a/mm/filemap.c
> > +++ b/mm/filemap.c
> > @@ -2948,24 +2948,15 @@ static int generic_write_check_limits(struct file *file, loff_t pos,
> >  	return 0;
> >  }
> >  
> > -/*
> > - * Performs necessary checks before doing a write
> > - *
> > - * Can adjust writing position or amount of bytes to write.
> > - * Returns appropriate error code that caller should return or
> > - * zero in case that write should be allowed.
> > - */
> > -inline ssize_t generic_write_checks(struct kiocb *iocb, struct iov_iter *from)
> > +static int generic_write_checks_common(struct kiocb *iocb, loff_t *count)
> >  {
> >  	struct file *file = iocb->ki_filp;
> >  	struct inode *inode = file->f_mapping->host;
> > -	loff_t count;
> > -	int ret;
> >  
> >  	if (IS_SWAPFILE(inode))
> >  		return -ETXTBSY;
> >  
> > -	if (!iov_iter_count(from))
> > +	if (!*count)
> >  		return 0;
> >  
> >  	/* FIXME: this is for backwards compatibility with 2.4 */
> > @@ -2975,8 +2966,21 @@ inline ssize_t generic_write_checks(struct kiocb *iocb, struct iov_iter *from)
> >  	if ((iocb->ki_flags & IOCB_NOWAIT) && !(iocb->ki_flags & IOCB_DIRECT))
> >  		return -EINVAL;
> >  
> > -	count = iov_iter_count(from);
> > -	ret = generic_write_check_limits(file, iocb->ki_pos, &count);
> > +	return generic_write_check_limits(iocb->ki_filp, iocb->ki_pos, count);
> > +}
> > +
> > +/*
> > + * Performs necessary checks before doing a write
> > + *
> > + * Can adjust writing position or amount of bytes to write.
> > + * Returns a negative errno or the new number of bytes to write.
> > + */
> > +inline ssize_t generic_write_checks(struct kiocb *iocb, struct iov_iter *from)
> > +{
> > +	loff_t count = iov_iter_count(from);
> > +	int ret;
> > +
> > +	ret = generic_write_checks_common(iocb, &count);
> >  	if (ret)
> >  		return ret;
> >  
> > @@ -2985,6 +2989,58 @@ inline ssize_t generic_write_checks(struct kiocb *iocb, struct iov_iter *from)
> >  }
> >  EXPORT_SYMBOL(generic_write_checks);
> >  
> > +int generic_encoded_write_checks(struct kiocb *iocb,
> > +				 struct encoded_iov *encoded)
> > +{
> > +	loff_t count = encoded->unencoded_len;
> > +	int ret;
> > +
> > +	ret = generic_write_checks_common(iocb, &count);
> > +	if (ret)
> > +		return ret;
> > +
> > +	if (count != encoded->unencoded_len) {
> > +		/*
> > +		 * The write got truncated by generic_write_checks_common(). We
> > +		 * can't do a partial encoded write.
> > +		 */
> > +		return -EFBIG;
> > +	}
> > +	return 0;
> > +}
> > +EXPORT_SYMBOL(generic_encoded_write_checks);
> > +
> > +ssize_t check_encoded_read(struct kiocb *iocb, struct iov_iter *iter)
> > +{
> > +	if (!(iocb->ki_filp->f_flags & O_ENCODED))
> > +		return -EPERM;
> > +	if (iov_iter_single_seg_count(iter) != sizeof(struct encoded_iov))
> > +		return -EINVAL;
> > +	return iov_iter_count(iter) - sizeof(struct encoded_iov);
> > +}
> > +EXPORT_SYMBOL(check_encoded_read);
> > +
> > +int import_encoded_write(struct kiocb *iocb, struct encoded_iov *encoded,
> > +			 struct iov_iter *from)
> > +{
> > +	if (!(iocb->ki_filp->f_flags & O_ENCODED))
> > +		return -EPERM;
> > +	if (iov_iter_single_seg_count(from) != sizeof(*encoded))
> > +		return -EINVAL;
> > +	if (copy_from_iter(encoded, sizeof(*encoded), from) != sizeof(*encoded))
> > +		return -EFAULT;
> > +	if (encoded->compression == ENCODED_IOV_COMPRESSION_NONE &&
> > +	    encoded->encryption == ENCODED_IOV_ENCRYPTION_NONE)
> > +		return -EINVAL;
> > +	if (encoded->compression > ENCODED_IOV_COMPRESSION_TYPES ||
> > +	    encoded->encryption > ENCODED_IOV_ENCRYPTION_TYPES)
> > +		return -EINVAL;
> > +	if (encoded->unencoded_offset >= encoded->unencoded_len)
> > +		return -EINVAL;
> > +	return 0;
> > +}
> > +EXPORT_SYMBOL(import_encoded_write);
> > +
> >  /*
> >   * Performs necessary checks before doing a clone.
> >   *
> > -- 
> > 2.23.0
> > 


-- 
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
<https://www.cyphar.com/>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox