Linux userland API discussions

Linux userland API discussions
 help / color / mirror / Atom feed

* Re: [RFC PATCH 2/3] fs: add RWF_ENCODED for writing compressed data
From: Theodore Y. Ts'o @ 2019-09-25 15:08 UTC (permalink / raw)
  To: Colin Walters
  Cc: Dave Chinner, Jann Horn, Omar Sandoval, Aleksa Sarai, Jens Axboe,
	linux-fsdevel, linux-btrfs, Linux API, Kernel Team,
	Andy Lutomirski
In-Reply-To: <60c48ac5-b215-44e1-a628-6145d84a4ce3@www.fastmail.com>

On Wed, Sep 25, 2019 at 08:07:12AM -0400, Colin Walters wrote:
> 
> 
> On Wed, Sep 25, 2019, at 3:11 AM, Dave Chinner wrote:
> >
> > We're talking about user data read/write access here, not some
> > special security capability. Access to the data has already been
> > permission checked, so why should the format that the data is
> > supplied to the kernel in suddenly require new privilege checks?
> 
> What happens with BTRFS today if userspace provides invalid
> compressed data via this interface?  Does that show up as filesystem
> corruption later?  If the data is verified at write time, wouldn't
> that be losing most of the speed advantages of providing
> pre-compressed data?

Not necessarily, most compression algorithms are far more expensive to
compress than to decompress.

If there is a buggy decompressor, it's possible that invalid data
could result in a buffer overrun.  So that's an argument for verifying
the compressed code at write time.  OTOH, the verification could be
just as vulnerability to invalid data as the decompressor, so it
doesn't buy you that much.

> Ability for a user to cause fsck errors later would be a new thing
> that would argue for a privilege check I think.

Well, if it's only invalid data in a user file, there's no reason why
it should cause the kernel declare that the file system is corrupt; it
can just return EIO.

What fsck does is a different question, of course; it might be that
the fsck code isn't going to check compressed user data.  After all,
if all of the files on the file system are compressed, requiring fsck
to check all compressed data blocks is tantamount to requiring it to
read all of the blocks in the file system.  Much better would be some
kind of online scrub operation which validates data files while the
file system is mounted and the system can be in a serving state.

						- Ted

^ permalink raw reply

* Re: [RFC PATCH 2/3] add RWF_ENCODED for writing compressed data
From: Chris Mason @ 2019-09-25 14:56 UTC (permalink / raw)
  To: Colin Walters
  Cc: Dave Chinner, Jann Horn, Omar Sandoval, Aleksa Sarai, Jens Axboe,
	linux-fsdevel, linux-btrfs@vger.kernel.org, Linux API,
	Kernel Team, Andy Lutomirski
In-Reply-To: <60c48ac5-b215-44e1-a628-6145d84a4ce3@www.fastmail.com>

On 25 Sep 2019, at 8:07, Colin Walters wrote:

> On Wed, Sep 25, 2019, at 3:11 AM, Dave Chinner wrote:
>>
>> We're talking about user data read/write access here, not some
>> special security capability. Access to the data has already been
>> permission checked, so why should the format that the data is
>> supplied to the kernel in suddenly require new privilege checks?
>
> What happens with BTRFS today if userspace provides invalid compressed 
> data via this interface?  Does that show up as filesystem corruption 
> later?  If the data is verified at write time, wouldn't that be losing 
> most of the speed advantages of providing pre-compressed data?

The data is verified while being decompressed, but that's a fairly large 
fuzzing surface (all of zstd, zlib, and lzo).  A lot of people will 
correctly argue that we already have that fuzzing surface today, but I'd 
rather not make a really easy way to stuff arbitrary bytes through the 
kernel decompression code until all the projects involved sign off.

-chris

^ permalink raw reply

* Re: For review: pidfd_send_signal(2) manual page
From: Michael Kerrisk (man-pages) @ 2019-09-25 14:29 UTC (permalink / raw)
  To: Christian Brauner
  Cc: mtk.manpages, Florian Weimer, Oleg Nesterov, Jann Horn,
	Eric W. Biederman, Daniel Colascione, Joel Fernandes, linux-man,
	Linux API, lkml
In-Reply-To: <20190925135308.rev5tczcigyuchae@wittgenstein>

Hello Christian,

On 9/25/19 3:53 PM, Christian Brauner wrote:
> On Wed, Sep 25, 2019 at 03:46:26PM +0200, Michael Kerrisk (man-pages) wrote:
>> On 9/24/19 11:53 PM, Christian Brauner wrote:
>>> On Tue, Sep 24, 2019 at 11:00:03PM +0200, Michael Kerrisk (man-pages) wrote:
>>>> Hello Christian,
>>>>
>>>>>>> If you're the parent of the process you can do this without CLONE_PIDFD:
>>>>>>> pid = fork();
>>>>>>> pidfd = pidfd_open();
>>>>>>> ret = pidfd_send_signal(pidfd, 0, NULL, 0);
>>>>>>> if (ret < 0 && errno == ESRCH)
>>>>>>> 	/* pidfd refers to another, recycled process */
>>>>>>
>>>>>> Although there is still the race between the fork() and the
>>>>>> pidfd_open(), right?
>>>>>
>>>>> Actually no and my code is even too complex.
>>>>> If you are the parent, and this is really a sequence that obeys the
>>>>> ordering pidfd_open() before waiting:
>>>>>
>>>>> pid = fork();
>>>>> if (pid == 0)
>>>>> 	exit(EXIT_SUCCESS);
>>>>> pidfd = pidfd_open(pid, 0);
>>>>> waitid(pid, ...);
>>>>>
>>>>> Then you are guaranteed that pidfd will refer to pid. No recycling can
>>>>> happen since the process has not been waited upon yet (That is,
>>>>
>>>> D'oh! Yes, of course. 
>>>>
>>>>> excluding special cases such as where you have a mainloop where a
>>>>> callback reacts to a SIGCHLD event and waits on the child behind your
>>>>> back and your next callback in the mainloop calls pidfd_open() while the
>>>>> pid has been recycled etc.).
>>>>> A race could only appear in sequences where waiting happens before
>>>>> pidfd_open():
>>>>>
>>>>> pid = fork();
>>>>> if (pid == 0)
>>>>> 	exit(EXIT_SUCCESS);
>>>>> waitid(pid, ...);
>>>>> pidfd = pidfd_open(pid, 0);
>>>>>
>>>>> which honestly simply doesn't make any sense. So if you're the parent
>>>>> and you combine fork() + pidfd_open() correctly things should be fine
>>>>> without even having to verify via pidfd_send_signal() (I missed that in
>>>>> my first mail.).
>>>>
>>>> Thanks for the additional detail.
>>>
>>> You're very welcome.
>>>
>>>>
>>>> I added the following to the pidfd_open() page, to
>>>> prevent people making the same thinko as me:
>>>>
>>>>        The following code sequence can be used to obtain a file  descrip‐
>>>>        tor for the child of fork(2):
>>>>
>>>>            pid = fork();
>>>>            if (pid > 0) {     /* If parent */
>>>>                pidfd = pidfd_open(pid, 0);
>>>>                ...
>>>>            }
>>>>
>>>>        Even  if  the  child process has already terminated by the time of
>>>>        the pidfd_open() call, the returned file descriptor is  guaranteed
>>>>        to refer to the child because the parent has not yet waited on the
>>>>        child (and therefore, the child's ID has not been recycled).
>>>
>>> Thanks! I'm fine with the example. The code illustrates the basics. If
>>> you want to go overboard, you can mention my callback example and put my
>>> SIG_IGN code snippet from my earlier mails (cf. [1] and [2]) in there.
>>> But imho, that'll complicate the manpage and I'm not sure it's worth it.
>>
>> I agree that we should not complicate this discussion with more code,
>> but how about we refine the text as follows:
>>
>>        The following code sequence can be used to obtain a file  descrip‐
>>        tor for the child of fork(2):
>>
>>            pid = fork();
>>            if (pid > 0) {     /* If parent */
>>                pidfd = pidfd_open(pid, 0);
>>                ...
>>            }
>>
>>        Even  if  the  child  has  already  terminated  by the time of the
>>        pidfd_open() call, its PID will not have  been  recycled  and  the
>>        returned  file  descriptor  will  refer  to  the  resulting zombie
>>        process.  Note, however, that this is guaranteed only if the  fol‐
>>        lowing conditions hold true:
>>
>>        *  the  disposition  of  SIGCHLD  has  not  been explicitly set to
>>           SIG_IGN (see sigaction(2)); and
> 
> Ugh, I forgot a third one. There's also SA_NOCLDWAIT. When set and
> the SIGCHLD handler is set to SIG_DFL then no zombie processes are
> created and no SIGCHLD signal is sent. When an explicit handler for
> SIGCHLD is set then a SIGCHLD signal is generated but the process will
> still not be turned into a zombie...

Oh, yes. I added:

       *  the SA_NOCLDSTOP flag was not specified  while  establishing  a
          handler  for  SIGCHLD  or while setting the disposition of that
          signal to SIG_DFL (see sigaction(2));

>>        *  the zombie process was not  reaped  elsewhere  in  the  program
>>           (e.g.,  either  by an asynchronously executed signal handler or
>>           by wait(2) or similar in another thread).
>>
>>        If these conditions don't hold true, then the child process should
> 
> "If any of these conditions does not hold, the child process..."
> 
> That might be clearer. But I leave the call on that to you. :)

Yep, your wording is better. Fixed.

Thanks,

Michael

-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

^ permalink raw reply

* Re: For review: pidfd_send_signal(2) manual page
From: Michael Kerrisk (man-pages) @ 2019-09-25 14:02 UTC (permalink / raw)
  To: Florian Weimer
  Cc: mtk.manpages, Christian Brauner, Oleg Nesterov, Jann Horn,
	Eric W. Biederman, Daniel Colascione, Joel Fernandes, linux-man,
	Linux API, lkml
In-Reply-To: <87h850scux.fsf@mid.deneb.enyo.de>

On 9/25/19 3:51 PM, Florian Weimer wrote:
> * Michael Kerrisk:
> 
>>        If these conditions don't hold true, then the child process should
>>        instead be created using clone(2) with the CLONE_PID flag.
> 
> I think this should be CLONE_PIDFD.

Thanks Florian. Fixed.

Cheers,

Michael


-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

^ permalink raw reply

* Re: For review: pidfd_send_signal(2) manual page
From: Christian Brauner @ 2019-09-25 13:53 UTC (permalink / raw)
  To: Michael Kerrisk (man-pages)
  Cc: Florian Weimer, Oleg Nesterov, Jann Horn, Eric W. Biederman,
	Daniel Colascione, Joel Fernandes, linux-man, Linux API, lkml
In-Reply-To: <331cc245-3f70-dd43-31f9-8c1680ca6b20@gmail.com>

On Wed, Sep 25, 2019 at 03:46:26PM +0200, Michael Kerrisk (man-pages) wrote:
> On 9/24/19 11:53 PM, Christian Brauner wrote:
> > On Tue, Sep 24, 2019 at 11:00:03PM +0200, Michael Kerrisk (man-pages) wrote:
> >> Hello Christian,
> >>
> >>>>> If you're the parent of the process you can do this without CLONE_PIDFD:
> >>>>> pid = fork();
> >>>>> pidfd = pidfd_open();
> >>>>> ret = pidfd_send_signal(pidfd, 0, NULL, 0);
> >>>>> if (ret < 0 && errno == ESRCH)
> >>>>> 	/* pidfd refers to another, recycled process */
> >>>>
> >>>> Although there is still the race between the fork() and the
> >>>> pidfd_open(), right?
> >>>
> >>> Actually no and my code is even too complex.
> >>> If you are the parent, and this is really a sequence that obeys the
> >>> ordering pidfd_open() before waiting:
> >>>
> >>> pid = fork();
> >>> if (pid == 0)
> >>> 	exit(EXIT_SUCCESS);
> >>> pidfd = pidfd_open(pid, 0);
> >>> waitid(pid, ...);
> >>>
> >>> Then you are guaranteed that pidfd will refer to pid. No recycling can
> >>> happen since the process has not been waited upon yet (That is,
> >>
> >> D'oh! Yes, of course. 
> >>
> >>> excluding special cases such as where you have a mainloop where a
> >>> callback reacts to a SIGCHLD event and waits on the child behind your
> >>> back and your next callback in the mainloop calls pidfd_open() while the
> >>> pid has been recycled etc.).
> >>> A race could only appear in sequences where waiting happens before
> >>> pidfd_open():
> >>>
> >>> pid = fork();
> >>> if (pid == 0)
> >>> 	exit(EXIT_SUCCESS);
> >>> waitid(pid, ...);
> >>> pidfd = pidfd_open(pid, 0);
> >>>
> >>> which honestly simply doesn't make any sense. So if you're the parent
> >>> and you combine fork() + pidfd_open() correctly things should be fine
> >>> without even having to verify via pidfd_send_signal() (I missed that in
> >>> my first mail.).
> >>
> >> Thanks for the additional detail.
> > 
> > You're very welcome.
> > 
> >>
> >> I added the following to the pidfd_open() page, to
> >> prevent people making the same thinko as me:
> >>
> >>        The following code sequence can be used to obtain a file  descrip‐
> >>        tor for the child of fork(2):
> >>
> >>            pid = fork();
> >>            if (pid > 0) {     /* If parent */
> >>                pidfd = pidfd_open(pid, 0);
> >>                ...
> >>            }
> >>
> >>        Even  if  the  child process has already terminated by the time of
> >>        the pidfd_open() call, the returned file descriptor is  guaranteed
> >>        to refer to the child because the parent has not yet waited on the
> >>        child (and therefore, the child's ID has not been recycled).
> > 
> > Thanks! I'm fine with the example. The code illustrates the basics. If
> > you want to go overboard, you can mention my callback example and put my
> > SIG_IGN code snippet from my earlier mails (cf. [1] and [2]) in there.
> > But imho, that'll complicate the manpage and I'm not sure it's worth it.
> 
> I agree that we should not complicate this discussion with more code,
> but how about we refine the text as follows:
> 
>        The following code sequence can be used to obtain a file  descrip‐
>        tor for the child of fork(2):
> 
>            pid = fork();
>            if (pid > 0) {     /* If parent */
>                pidfd = pidfd_open(pid, 0);
>                ...
>            }
> 
>        Even  if  the  child  has  already  terminated  by the time of the
>        pidfd_open() call, its PID will not have  been  recycled  and  the
>        returned  file  descriptor  will  refer  to  the  resulting zombie
>        process.  Note, however, that this is guaranteed only if the  fol‐
>        lowing conditions hold true:
> 
>        *  the  disposition  of  SIGCHLD  has  not  been explicitly set to
>           SIG_IGN (see sigaction(2)); and

Ugh, I forgot a third one. There's also SA_NOCLDWAIT. When set and
the SIGCHLD handler is set to SIG_DFL then no zombie processes are
created and no SIGCHLD signal is sent. When an explicit handler for
SIGCHLD is set then a SIGCHLD signal is generated but the process will
still not be turned into a zombie...

> 
>        *  the zombie process was not  reaped  elsewhere  in  the  program
>           (e.g.,  either  by an asynchronously executed signal handler or
>           by wait(2) or similar in another thread).
> 
>        If these conditions don't hold true, then the child process should

"If any of these conditions does not hold, the child process..."

That might be clearer. But I leave the call on that to you. :)

Christian

^ permalink raw reply

* Re: For review: pidfd_send_signal(2) manual page
From: Florian Weimer @ 2019-09-25 13:51 UTC (permalink / raw)
  To: Michael Kerrisk (man-pages)
  Cc: Christian Brauner, Oleg Nesterov, Jann Horn, Eric W. Biederman,
	Daniel Colascione, Joel Fernandes, linux-man, Linux API, lkml
In-Reply-To: <331cc245-3f70-dd43-31f9-8c1680ca6b20@gmail.com>

* Michael Kerrisk:

>        If these conditions don't hold true, then the child process should
>        instead be created using clone(2) with the CLONE_PID flag.

I think this should be CLONE_PIDFD.

^ permalink raw reply

* Re: For review: pidfd_send_signal(2) manual page
From: Michael Kerrisk (man-pages) @ 2019-09-25 13:46 UTC (permalink / raw)
  To: Christian Brauner
  Cc: mtk.manpages, Florian Weimer, Oleg Nesterov, Jann Horn,
	Eric W. Biederman, Daniel Colascione, Joel Fernandes, linux-man,
	Linux API, lkml
In-Reply-To: <20190924215316.gxev2anuqffegocw@wittgenstein>

On 9/24/19 11:53 PM, Christian Brauner wrote:
> On Tue, Sep 24, 2019 at 11:00:03PM +0200, Michael Kerrisk (man-pages) wrote:
>> Hello Christian,
>>
>>>>> If you're the parent of the process you can do this without CLONE_PIDFD:
>>>>> pid = fork();
>>>>> pidfd = pidfd_open();
>>>>> ret = pidfd_send_signal(pidfd, 0, NULL, 0);
>>>>> if (ret < 0 && errno == ESRCH)
>>>>> 	/* pidfd refers to another, recycled process */
>>>>
>>>> Although there is still the race between the fork() and the
>>>> pidfd_open(), right?
>>>
>>> Actually no and my code is even too complex.
>>> If you are the parent, and this is really a sequence that obeys the
>>> ordering pidfd_open() before waiting:
>>>
>>> pid = fork();
>>> if (pid == 0)
>>> 	exit(EXIT_SUCCESS);
>>> pidfd = pidfd_open(pid, 0);
>>> waitid(pid, ...);
>>>
>>> Then you are guaranteed that pidfd will refer to pid. No recycling can
>>> happen since the process has not been waited upon yet (That is,
>>
>> D'oh! Yes, of course. 
>>
>>> excluding special cases such as where you have a mainloop where a
>>> callback reacts to a SIGCHLD event and waits on the child behind your
>>> back and your next callback in the mainloop calls pidfd_open() while the
>>> pid has been recycled etc.).
>>> A race could only appear in sequences where waiting happens before
>>> pidfd_open():
>>>
>>> pid = fork();
>>> if (pid == 0)
>>> 	exit(EXIT_SUCCESS);
>>> waitid(pid, ...);
>>> pidfd = pidfd_open(pid, 0);
>>>
>>> which honestly simply doesn't make any sense. So if you're the parent
>>> and you combine fork() + pidfd_open() correctly things should be fine
>>> without even having to verify via pidfd_send_signal() (I missed that in
>>> my first mail.).
>>
>> Thanks for the additional detail.
> 
> You're very welcome.
> 
>>
>> I added the following to the pidfd_open() page, to
>> prevent people making the same thinko as me:
>>
>>        The following code sequence can be used to obtain a file  descrip‐
>>        tor for the child of fork(2):
>>
>>            pid = fork();
>>            if (pid > 0) {     /* If parent */
>>                pidfd = pidfd_open(pid, 0);
>>                ...
>>            }
>>
>>        Even  if  the  child process has already terminated by the time of
>>        the pidfd_open() call, the returned file descriptor is  guaranteed
>>        to refer to the child because the parent has not yet waited on the
>>        child (and therefore, the child's ID has not been recycled).
> 
> Thanks! I'm fine with the example. The code illustrates the basics. If
> you want to go overboard, you can mention my callback example and put my
> SIG_IGN code snippet from my earlier mails (cf. [1] and [2]) in there.
> But imho, that'll complicate the manpage and I'm not sure it's worth it.

I agree that we should not complicate this discussion with more code,
but how about we refine the text as follows:

       The following code sequence can be used to obtain a file  descrip‐
       tor for the child of fork(2):

           pid = fork();
           if (pid > 0) {     /* If parent */
               pidfd = pidfd_open(pid, 0);
               ...
           }

       Even  if  the  child  has  already  terminated  by the time of the
       pidfd_open() call, its PID will not have  been  recycled  and  the
       returned  file  descriptor  will  refer  to  the  resulting zombie
       process.  Note, however, that this is guaranteed only if the  fol‐
       lowing conditions hold true:

       *  the  disposition  of  SIGCHLD  has  not  been explicitly set to
          SIG_IGN (see sigaction(2)); and

       *  the zombie process was not  reaped  elsewhere  in  the  program
          (e.g.,  either  by an asynchronously executed signal handler or
          by wait(2) or similar in another thread).

       If these conditions don't hold true, then the child process should
       instead be created using clone(2) with the CLONE_PID flag.

Thanks,

Michael

-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

^ permalink raw reply

* Re: For review: pidfd_send_signal(2) manual page
From: Michael Kerrisk (man-pages) @ 2019-09-25 13:46 UTC (permalink / raw)
  To: Daniel Colascione
  Cc: mtk.manpages, Christian Brauner, Florian Weimer, Oleg Nesterov,
	Jann Horn, Eric W. Biederman, Joel Fernandes, linux-man,
	Linux API, lkml
In-Reply-To: <CAKOZueuJ=En9=mtFvhpmsnjqMAOjVmffaVwwzpe=ieHemxH3mw@mail.gmail.com>

Hello Daniel,

On 9/24/19 11:08 PM, Daniel Colascione wrote:
> On Tue, Sep 24, 2019 at 2:00 PM Michael Kerrisk (man-pages)
> <mtk.manpages@gmail.com> wrote:
>>
>> Hello Christian,
>>
>>>>> If you're the parent of the process you can do this without CLONE_PIDFD:
>>>>> pid = fork();
>>>>> pidfd = pidfd_open();
>>>>> ret = pidfd_send_signal(pidfd, 0, NULL, 0);
>>>>> if (ret < 0 && errno == ESRCH)
>>>>>     /* pidfd refers to another, recycled process */
>>>>
>>>> Although there is still the race between the fork() and the
>>>> pidfd_open(), right?
>>>
>>> Actually no and my code is even too complex.
>>> If you are the parent, and this is really a sequence that obeys the
>>> ordering pidfd_open() before waiting:
>>>
>>> pid = fork();
>>> if (pid == 0)
>>>       exit(EXIT_SUCCESS);
>>> pidfd = pidfd_open(pid, 0);
>>> waitid(pid, ...);
>>>
>>> Then you are guaranteed that pidfd will refer to pid. No recycling can
>>> happen since the process has not been waited upon yet (That is,
>>
>> D'oh! Yes, of course.
> 
> You still have a race if you're the parent and you have SIGCHLD set to
> SIG_IGN though.

Yes, thanks for reminding me of that (as did Christian also).

>>> excluding special cases such as where you have a mainloop where a
>>> callback reacts to a SIGCHLD event and waits on the child behind your
>>> back and your next callback in the mainloop calls pidfd_open() while the
>>> pid has been recycled etc.).
> 
> That's a pretty common case though, especially if you're a library.

I presume that conventionally the only real resolution of this kind of
problem is that the mainloop SIGCHLD call back has a waitpid() loop
that (in a nonblocking fashion) loops checking each of the PIDs created
by the mainloop, right?

>>> A race could only appear in sequences where waiting happens before
>>> pidfd_open():
>>>
>>> pid = fork();
>>> if (pid == 0)
>>>       exit(EXIT_SUCCESS);
>>> waitid(pid, ...);
>>> pidfd = pidfd_open(pid, 0);
>>>
>>> which honestly simply doesn't make any sense. So if you're the parent
>>> and you combine fork() + pidfd_open() correctly things should be fine
>>> without even having to verify via pidfd_send_signal() (I missed that in
>>> my first mail.).
>>
>> Thanks for the additional detail.
>>
>> I added the following to the pidfd_open() page, to
>> prevent people making the same thinko as me:
>>
>>        The following code sequence can be used to obtain a file  descrip‐
>>        tor for the child of fork(2):
>>
>>            pid = fork();
>>            if (pid > 0) {     /* If parent */
>>                pidfd = pidfd_open(pid, 0);
>>                ...
>>            }
>>
>>        Even  if  the  child process has already terminated by the time of
>>        the pidfd_open() call, the returned file descriptor is  guaranteed
>>        to refer to the child because the parent has not yet waited on the
>>        child (and therefore, the child's ID has not been recycled).
> 
> I'd prefer that sample code be robust in all cases.

I'm not clear what you think is missing. Or do you mean that the code
can't be robust in the face of (1) waitpid(-1) in another thread or an
asynchronous SIGCHLD handler and (2) SIGCHLD disposition set to SIG_IGN?

Thanks,

Michael

-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

^ permalink raw reply

* Re: [RFC PATCH 2/3] fs: add RWF_ENCODED for writing compressed data
From: Colin Walters @ 2019-09-25 12:07 UTC (permalink / raw)
  To: Dave Chinner, Jann Horn
  Cc: Omar Sandoval, Aleksa Sarai, Jens Axboe, linux-fsdevel,
	linux-btrfs, Linux API, Kernel Team, Andy Lutomirski
In-Reply-To: <20190925071129.GB804@dread.disaster.area>

On Wed, Sep 25, 2019, at 3:11 AM, Dave Chinner wrote:
>
> We're talking about user data read/write access here, not some
> special security capability. Access to the data has already been
> permission checked, so why should the format that the data is
> supplied to the kernel in suddenly require new privilege checks?

What happens with BTRFS today if userspace provides invalid compressed data via this interface?  Does that show up as filesystem corruption later?  If the data is verified at write time, wouldn't that be losing most of the speed advantages of providing pre-compressed data?

Ability for a user to cause fsck errors later would be a new thing that would argue for a privilege check I think.

^ permalink raw reply

* Re: [RFC PATCH 2/3] fs: add RWF_ENCODED for writing compressed data
From: Dave Chinner @ 2019-09-25  7:11 UTC (permalink / raw)
  To: Jann Horn
  Cc: Omar Sandoval, Aleksa Sarai, Jens Axboe, linux-fsdevel,
	linux-btrfs, Linux API, Kernel Team, Andy Lutomirski
In-Reply-To: <CAG48ez1NQBNR1XeVQYGoopEk=g_KedUr+7jxLQTaO+V8JCeweQ@mail.gmail.com>

On Tue, Sep 24, 2019 at 10:01:41PM +0200, Jann Horn wrote:
> On Tue, Sep 24, 2019 at 9:35 PM Omar Sandoval <osandov@osandov.com> wrote:
> > On Tue, Sep 24, 2019 at 10:15:13AM -0700, Omar Sandoval wrote:
> > > On Thu, Sep 19, 2019 at 05:44:12PM +0200, Jann Horn wrote:
> > > > On Thu, Sep 19, 2019 at 8:54 AM Omar Sandoval <osandov@osandov.com> wrote:
> > > > > Btrfs can transparently compress data written by the user. However, we'd
> > > > > like to add an interface to write pre-compressed data directly to the
> > > > > filesystem. This adds support for so-called "encoded writes" via
> > > > > pwritev2().
> > > > >
> > > > > A new RWF_ENCODED flags indicates that a write is "encoded". If this
> > > > > flag is set, iov[0].iov_base points to a struct encoded_iov which
> > > > > contains metadata about the write: namely, the compression algorithm and
> > > > > the unencoded (i.e., decompressed) length of the extent. iov[0].iov_len
> > > > > must be set to sizeof(struct encoded_iov), which can be used to extend
> > > > > the interface in the future. The remaining iovecs contain the encoded
> > > > > extent.
> > > > >
> > > > > A similar interface for reading encoded data can be added to preadv2()
> > > > > in the future.
> > > > >
> > > > > Filesystems must indicate that they support encoded writes by setting
> > > > > FMODE_ENCODED_IO in ->file_open().
> > > > [...]
> > > > > +int import_encoded_write(struct kiocb *iocb, struct encoded_iov *encoded,
> > > > > +                        struct iov_iter *from)
> > > > > +{
> > > > > +       if (iov_iter_single_seg_count(from) != sizeof(*encoded))
> > > > > +               return -EINVAL;
> > > > > +       if (copy_from_iter(encoded, sizeof(*encoded), from) != sizeof(*encoded))
> > > > > +               return -EFAULT;
> > > > > +       if (encoded->compression == ENCODED_IOV_COMPRESSION_NONE &&
> > > > > +           encoded->encryption == ENCODED_IOV_ENCRYPTION_NONE) {
> > > > > +               iocb->ki_flags &= ~IOCB_ENCODED;
> > > > > +               return 0;
> > > > > +       }
> > > > > +       if (encoded->compression > ENCODED_IOV_COMPRESSION_TYPES ||
> > > > > +           encoded->encryption > ENCODED_IOV_ENCRYPTION_TYPES)
> > > > > +               return -EINVAL;
> > > > > +       if (!capable(CAP_SYS_ADMIN))
> > > > > +               return -EPERM;
> > > >
> > > > How does this capable() check interact with io_uring? Without having
> > > > looked at this in detail, I suspect that when an encoded write is
> > > > requested through io_uring, the capable() check might be executed on
> > > > something like a workqueue worker thread, which is probably running
> > > > with a full capability set.
> > >
> > > I discussed this more with Jens. You're right, per-IO permission checks
> > > aren't going to work. In fully-polled mode, we never get an opportunity
> > > to check capabilities in right context. So, this will probably require a
> > > new open flag.
> >
> > Actually, file_ns_capable() accomplishes the same thing without a new
> > open flag. Changing the capable() check to file_ns_capable() in
> > init_user_ns should be enough.
> 
> +Aleksa for openat2() and open() space
> 
> Mmh... but if the file descriptor has been passed through a privilege
> boundary, it isn't really clear whether the original opener of the
> file intended for this to be possible. For example, if (as a
> hypothetical example) the init process opens a service's logfile with
> root privileges, then passes the file descriptor to that logfile to
> the service on execve(), that doesn't mean that the service should be
> able to perform compressed writes into that file, I think.

Where's the privilege boundary that is being crossed?

We're talking about user data read/write access here, not some
special security capability. Access to the data has already been
permission checked, so why should the format that the data is
supplied to the kernel in suddenly require new privilege checks?

i.e. writing encoded data to a file requires exactly the same
access permissions as writing cleartext data to the file. The only
extra information here is whether the _filesystem_ supports encoded
data, and that doesn't change regardless of what the open file gets
passed to. Hence the capability is either there or it isn't, it
doesn't transform not matter what privilege boundary the file is
passed across. Similarly, we have permission to access the data
or we don't through the struct file, it doesn't transform either.

Hence I don't see why CAP_SYS_ADMIN or any special permissions are
needed for an application with access permissions to file data to
use these RWF_ENCODED IO interfaces. I am inclined to think the
permission check here is wrong and should be dropped, and then all
these issues go away.

Yes, the app that is going to use this needs root perms because it
accesses all data in the fs (it's a backup app!), but that doesn't
mean you can only use RWF_ENCODED if you have root perms.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply

* Re: For review: pidfd_send_signal(2) manual page
From: Jann Horn @ 2019-09-25  1:48 UTC (permalink / raw)
  To: Florian Weimer
  Cc: Michael Kerrisk (man-pages), Oleg Nesterov, Christian Brauner,
	Eric W. Biederman, Daniel Colascione, Joel Fernandes, linux-man,
	Linux API, lkml
In-Reply-To: <87pnjr9rth.fsf@mid.deneb.enyo.de>

On Mon, Sep 23, 2019 at 1:26 PM Florian Weimer <fw@deneb.enyo.de> wrote:
> * Michael Kerrisk:
> >        The  pidfd_send_signal()  system call allows the avoidance of race
> >        conditions that occur when using traditional interfaces  (such  as
> >        kill(2)) to signal a process.  The problem is that the traditional
> >        interfaces specify the target process via a process ID (PID), with
> >        the  result  that the sender may accidentally send a signal to the
> >        wrong process if the originally intended target process has termi‐
> >        nated  and its PID has been recycled for another process.  By con‐
> >        trast, a PID file descriptor is a stable reference to  a  specific
> >        process;  if  that  process  terminates,  then the file descriptor
> >        ceases to be  valid  and  the  caller  of  pidfd_send_signal()  is
> >        informed of this fact via an ESRCH error.
>
> It would be nice to explain somewhere how you can avoid the race using
> a PID descriptor.  Is there anything else besides CLONE_PIDFD?

My favorite example here is that you could implement "killall" without
PID reuse races. With /proc/$pid file descriptors, you could do it
like this (rough pseudocode with missing error handling and resource
leaks and such):

for each pid {
  procfs_pid_fd = open("/proc/"+pid);
  if (procfs_pid_fd == -1) continue;
  comm_fd = openat(procfs_pid_fd, "comm");
  if (comm_fd == -1) continue;
  char buf[1000];
  int n = read(comm_fd, buf, sizeof(buf)-1);
  buf[n] = 0;
  if (strcmp(buf, expected_comm) == 0) {
    pidfd_send_signal(procfs_pid_fd, SIGKILL, NULL, 0);
  }
}

If you want to avoid using a procfs fd for this, I think you can still
do it, the dance just gets more complicated:

for each pid {
  procfs_pid_fd = open("/proc/"+pid);
  if (procfs_pid_fd == -1) continue;
  pid_fd = pidfd_open(pid, 0);
  if (pid_fd == -1) continue;
  /* at this point procfs_pid_fd and pid_fd may refer to different processes */
  comm_fd = openat(procfs_pid_fd, "comm");
  if (comm_fd == -1) continue;
  /* at this point we know that procfs_pid_fd and pid_fd refer to the
same struct pid, because otherwise the procfs_pid_fd must point to a
directory that throws -ESRCH for everything */
  char buf[1000];
  int n = read(comm_fd, buf, sizeof(buf)-1);
  buf[n] = 0;
  if (strcmp(buf, expected_comm) == 0) {
    pidfd_send_signal(pid_fd, SIGKILL, NULL, 0);
  }
}

But I don't think anyone is actually interested in using pidfds for
this kind of usecase right now.

^ permalink raw reply

* Re: For review: pidfd_send_signal(2) manual page
From: Christian Brauner @ 2019-09-24 21:53 UTC (permalink / raw)
  To: Michael Kerrisk (man-pages)
  Cc: Florian Weimer, Oleg Nesterov, Jann Horn, Eric W. Biederman,
	Daniel Colascione, Joel Fernandes, linux-man, Linux API, lkml
In-Reply-To: <b76adb4c-826b-6493-ba75-a9863066d3b1@gmail.com>

On Tue, Sep 24, 2019 at 11:00:03PM +0200, Michael Kerrisk (man-pages) wrote:
> Hello Christian,
> 
> >>> If you're the parent of the process you can do this without CLONE_PIDFD:
> >>> pid = fork();
> >>> pidfd = pidfd_open();
> >>> ret = pidfd_send_signal(pidfd, 0, NULL, 0);
> >>> if (ret < 0 && errno == ESRCH)
> >>> 	/* pidfd refers to another, recycled process */
> >>
> >> Although there is still the race between the fork() and the
> >> pidfd_open(), right?
> > 
> > Actually no and my code is even too complex.
> > If you are the parent, and this is really a sequence that obeys the
> > ordering pidfd_open() before waiting:
> > 
> > pid = fork();
> > if (pid == 0)
> > 	exit(EXIT_SUCCESS);
> > pidfd = pidfd_open(pid, 0);
> > waitid(pid, ...);
> > 
> > Then you are guaranteed that pidfd will refer to pid. No recycling can
> > happen since the process has not been waited upon yet (That is,
> 
> D'oh! Yes, of course. 
> 
> > excluding special cases such as where you have a mainloop where a
> > callback reacts to a SIGCHLD event and waits on the child behind your
> > back and your next callback in the mainloop calls pidfd_open() while the
> > pid has been recycled etc.).
> > A race could only appear in sequences where waiting happens before
> > pidfd_open():
> > 
> > pid = fork();
> > if (pid == 0)
> > 	exit(EXIT_SUCCESS);
> > waitid(pid, ...);
> > pidfd = pidfd_open(pid, 0);
> > 
> > which honestly simply doesn't make any sense. So if you're the parent
> > and you combine fork() + pidfd_open() correctly things should be fine
> > without even having to verify via pidfd_send_signal() (I missed that in
> > my first mail.).
> 
> Thanks for the additional detail.

You're very welcome.

> 
> I added the following to the pidfd_open() page, to
> prevent people making the same thinko as me:
> 
>        The following code sequence can be used to obtain a file  descrip‐
>        tor for the child of fork(2):
> 
>            pid = fork();
>            if (pid > 0) {     /* If parent */
>                pidfd = pidfd_open(pid, 0);
>                ...
>            }
> 
>        Even  if  the  child process has already terminated by the time of
>        the pidfd_open() call, the returned file descriptor is  guaranteed
>        to refer to the child because the parent has not yet waited on the
>        child (and therefore, the child's ID has not been recycled).

Thanks! I'm fine with the example. The code illustrates the basics. If
you want to go overboard, you can mention my callback example and put my
SIG_IGN code snippet from my earlier mails (cf. [1] and [2]) in there.
But imho, that'll complicate the manpage and I'm not sure it's worth it.

Thanks!
Christian

/* References */
[1]: https://lore.kernel.org/r/20190924195701.7pw2olbviieqsg5q@wittgenstein
[2]: https://lore.kernel.org/r/20190924200735.2dvqhan7ynnmfc7s@wittgenstein

^ permalink raw reply

* Re: For review: pidfd_send_signal(2) manual page
From: Daniel Colascione @ 2019-09-24 21:08 UTC (permalink / raw)
  To: Michael Kerrisk (man-pages)
  Cc: Christian Brauner, Florian Weimer, Oleg Nesterov, Jann Horn,
	Eric W. Biederman, Joel Fernandes, linux-man, Linux API, lkml
In-Reply-To: <b76adb4c-826b-6493-ba75-a9863066d3b1@gmail.com>

On Tue, Sep 24, 2019 at 2:00 PM Michael Kerrisk (man-pages)
<mtk.manpages@gmail.com> wrote:
>
> Hello Christian,
>
> >>> If you're the parent of the process you can do this without CLONE_PIDFD:
> >>> pid = fork();
> >>> pidfd = pidfd_open();
> >>> ret = pidfd_send_signal(pidfd, 0, NULL, 0);
> >>> if (ret < 0 && errno == ESRCH)
> >>>     /* pidfd refers to another, recycled process */
> >>
> >> Although there is still the race between the fork() and the
> >> pidfd_open(), right?
> >
> > Actually no and my code is even too complex.
> > If you are the parent, and this is really a sequence that obeys the
> > ordering pidfd_open() before waiting:
> >
> > pid = fork();
> > if (pid == 0)
> >       exit(EXIT_SUCCESS);
> > pidfd = pidfd_open(pid, 0);
> > waitid(pid, ...);
> >
> > Then you are guaranteed that pidfd will refer to pid. No recycling can
> > happen since the process has not been waited upon yet (That is,
>
> D'oh! Yes, of course.

You still have a race if you're the parent and you have SIGCHLD set to
SIG_IGN though.

> > excluding special cases such as where you have a mainloop where a
> > callback reacts to a SIGCHLD event and waits on the child behind your
> > back and your next callback in the mainloop calls pidfd_open() while the
> > pid has been recycled etc.).

That's a pretty common case though, especially if you're a library.

> > A race could only appear in sequences where waiting happens before
> > pidfd_open():
> >
> > pid = fork();
> > if (pid == 0)
> >       exit(EXIT_SUCCESS);
> > waitid(pid, ...);
> > pidfd = pidfd_open(pid, 0);
> >
> > which honestly simply doesn't make any sense. So if you're the parent
> > and you combine fork() + pidfd_open() correctly things should be fine
> > without even having to verify via pidfd_send_signal() (I missed that in
> > my first mail.).
>
> Thanks for the additional detail.
>
> I added the following to the pidfd_open() page, to
> prevent people making the same thinko as me:
>
>        The following code sequence can be used to obtain a file  descrip‐
>        tor for the child of fork(2):
>
>            pid = fork();
>            if (pid > 0) {     /* If parent */
>                pidfd = pidfd_open(pid, 0);
>                ...
>            }
>
>        Even  if  the  child process has already terminated by the time of
>        the pidfd_open() call, the returned file descriptor is  guaranteed
>        to refer to the child because the parent has not yet waited on the
>        child (and therefore, the child's ID has not been recycled).

I'd prefer that sample code be robust in all cases.

^ permalink raw reply

* Re: For review: pidfd_send_signal(2) manual page
From: Michael Kerrisk (man-pages) @ 2019-09-24 21:00 UTC (permalink / raw)
  To: Christian Brauner
  Cc: mtk.manpages, Florian Weimer, Oleg Nesterov, Jann Horn,
	Eric W. Biederman, Daniel Colascione, Joel Fernandes, linux-man,
	Linux API, lkml
In-Reply-To: <20190924195701.7pw2olbviieqsg5q@wittgenstein>

Hello Christian,

>>> If you're the parent of the process you can do this without CLONE_PIDFD:
>>> pid = fork();
>>> pidfd = pidfd_open();
>>> ret = pidfd_send_signal(pidfd, 0, NULL, 0);
>>> if (ret < 0 && errno == ESRCH)
>>> 	/* pidfd refers to another, recycled process */
>>
>> Although there is still the race between the fork() and the
>> pidfd_open(), right?
> 
> Actually no and my code is even too complex.
> If you are the parent, and this is really a sequence that obeys the
> ordering pidfd_open() before waiting:
> 
> pid = fork();
> if (pid == 0)
> 	exit(EXIT_SUCCESS);
> pidfd = pidfd_open(pid, 0);
> waitid(pid, ...);
> 
> Then you are guaranteed that pidfd will refer to pid. No recycling can
> happen since the process has not been waited upon yet (That is,

D'oh! Yes, of course. 

> excluding special cases such as where you have a mainloop where a
> callback reacts to a SIGCHLD event and waits on the child behind your
> back and your next callback in the mainloop calls pidfd_open() while the
> pid has been recycled etc.).
> A race could only appear in sequences where waiting happens before
> pidfd_open():
> 
> pid = fork();
> if (pid == 0)
> 	exit(EXIT_SUCCESS);
> waitid(pid, ...);
> pidfd = pidfd_open(pid, 0);
> 
> which honestly simply doesn't make any sense. So if you're the parent
> and you combine fork() + pidfd_open() correctly things should be fine
> without even having to verify via pidfd_send_signal() (I missed that in
> my first mail.).

Thanks for the additional detail.

I added the following to the pidfd_open() page, to
prevent people making the same thinko as me:

       The following code sequence can be used to obtain a file  descrip‐
       tor for the child of fork(2):

           pid = fork();
           if (pid > 0) {     /* If parent */
               pidfd = pidfd_open(pid, 0);
               ...
           }

       Even  if  the  child process has already terminated by the time of
       the pidfd_open() call, the returned file descriptor is  guaranteed
       to refer to the child because the parent has not yet waited on the
       child (and therefore, the child's ID has not been recycled).

Thanks,

Michael

-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

^ permalink raw reply

* Re: [RFC PATCH 2/3] fs: add RWF_ENCODED for writing compressed data
From: Matthew Wilcox @ 2019-09-24 20:50 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Jann Horn, Omar Sandoval, Aleksa Sarai, Jens Axboe, linux-fsdevel,
	linux-btrfs, Dave Chinner, Linux API, Kernel Team,
	Andy Lutomirski
In-Reply-To: <20190924202229.mjvjigpnrskjtk5n@wittgenstein>

On Tue, Sep 24, 2019 at 10:22:29PM +0200, Christian Brauner wrote:
> On Tue, Sep 24, 2019 at 10:01:41PM +0200, Jann Horn wrote:
> > Mmh... but if the file descriptor has been passed through a privilege
> > boundary, it isn't really clear whether the original opener of the
> > file intended for this to be possible. For example, if (as a
> > hypothetical example) the init process opens a service's logfile with
> > root privileges, then passes the file descriptor to that logfile to
> > the service on execve(), that doesn't mean that the service should be
> > able to perform compressed writes into that file, I think.
> 
> I think we should even generalize this: for most new properties a given
> file descriptor can carry we would want it to be explicitly enabled such
> that passing the fd around amounts to passing that property around. At
> least as soon as we consider it to be associated with some privilege
> boundary. I don't think we have done this generally. But I would very
> much support moving to such a model.

I think you've got this right.  This needs to be an fcntl() flag, which
is only settable by root.  Now, should it be an O_ flag, modifiable by
F_SETFL, or should it be a new F_ flag?

^ permalink raw reply

* Re: [RFC PATCH 2/3] fs: add RWF_ENCODED for writing compressed data
From: Omar Sandoval @ 2019-09-24 20:38 UTC (permalink / raw)
  To: Jann Horn
  Cc: Aleksa Sarai, Jens Axboe, linux-fsdevel, linux-btrfs,
	Dave Chinner, Linux API, Kernel Team, Andy Lutomirski
In-Reply-To: <CAG48ez1NQBNR1XeVQYGoopEk=g_KedUr+7jxLQTaO+V8JCeweQ@mail.gmail.com>

On Tue, Sep 24, 2019 at 10:01:41PM +0200, Jann Horn wrote:
> On Tue, Sep 24, 2019 at 9:35 PM Omar Sandoval <osandov@osandov.com> wrote:
> > On Tue, Sep 24, 2019 at 10:15:13AM -0700, Omar Sandoval wrote:
> > > On Thu, Sep 19, 2019 at 05:44:12PM +0200, Jann Horn wrote:
> > > > On Thu, Sep 19, 2019 at 8:54 AM Omar Sandoval <osandov@osandov.com> wrote:
> > > > > Btrfs can transparently compress data written by the user. However, we'd
> > > > > like to add an interface to write pre-compressed data directly to the
> > > > > filesystem. This adds support for so-called "encoded writes" via
> > > > > pwritev2().
> > > > >
> > > > > A new RWF_ENCODED flags indicates that a write is "encoded". If this
> > > > > flag is set, iov[0].iov_base points to a struct encoded_iov which
> > > > > contains metadata about the write: namely, the compression algorithm and
> > > > > the unencoded (i.e., decompressed) length of the extent. iov[0].iov_len
> > > > > must be set to sizeof(struct encoded_iov), which can be used to extend
> > > > > the interface in the future. The remaining iovecs contain the encoded
> > > > > extent.
> > > > >
> > > > > A similar interface for reading encoded data can be added to preadv2()
> > > > > in the future.
> > > > >
> > > > > Filesystems must indicate that they support encoded writes by setting
> > > > > FMODE_ENCODED_IO in ->file_open().
> > > > [...]
> > > > > +int import_encoded_write(struct kiocb *iocb, struct encoded_iov *encoded,
> > > > > +                        struct iov_iter *from)
> > > > > +{
> > > > > +       if (iov_iter_single_seg_count(from) != sizeof(*encoded))
> > > > > +               return -EINVAL;
> > > > > +       if (copy_from_iter(encoded, sizeof(*encoded), from) != sizeof(*encoded))
> > > > > +               return -EFAULT;
> > > > > +       if (encoded->compression == ENCODED_IOV_COMPRESSION_NONE &&
> > > > > +           encoded->encryption == ENCODED_IOV_ENCRYPTION_NONE) {
> > > > > +               iocb->ki_flags &= ~IOCB_ENCODED;
> > > > > +               return 0;
> > > > > +       }
> > > > > +       if (encoded->compression > ENCODED_IOV_COMPRESSION_TYPES ||
> > > > > +           encoded->encryption > ENCODED_IOV_ENCRYPTION_TYPES)
> > > > > +               return -EINVAL;
> > > > > +       if (!capable(CAP_SYS_ADMIN))
> > > > > +               return -EPERM;
> > > >
> > > > How does this capable() check interact with io_uring? Without having
> > > > looked at this in detail, I suspect that when an encoded write is
> > > > requested through io_uring, the capable() check might be executed on
> > > > something like a workqueue worker thread, which is probably running
> > > > with a full capability set.
> > >
> > > I discussed this more with Jens. You're right, per-IO permission checks
> > > aren't going to work. In fully-polled mode, we never get an opportunity
> > > to check capabilities in right context. So, this will probably require a
> > > new open flag.
> >
> > Actually, file_ns_capable() accomplishes the same thing without a new
> > open flag. Changing the capable() check to file_ns_capable() in
> > init_user_ns should be enough.
> 
> +Aleksa for openat2() and open() space
> 
> Mmh... but if the file descriptor has been passed through a privilege
> boundary, it isn't really clear whether the original opener of the
> file intended for this to be possible. For example, if (as a
> hypothetical example) the init process opens a service's logfile with
> root privileges, then passes the file descriptor to that logfile to
> the service on execve(), that doesn't mean that the service should be
> able to perform compressed writes into that file, I think.

Ahh, you're right.

> I think that an open flag (as you already suggested) or an fcntl()
> operation would do the job; but AFAIK the open() flag space has run
> out, so if you hook it up that way, I think you might have to wait for
> Aleksa Sarai to get something like his sys_openat2() suggestion
> (https://lore.kernel.org/lkml/20190904201933.10736-12-cyphar@cyphar.com/)
> merged?

If I counted correctly, there's still space for a new O_ flag. One of
the problems that Aleksa is solving is that unknown O_ flags are
silently ignored, which isn't an issue for an O_ENCODED flag. If the
kernel doesn't support it, it won't support RWF_ENCODED, either, so
you'll get EOPNOTSUPP from pwritev2(). So, open flag it is...

^ permalink raw reply

* Re: [PATCH v2 7/7] random: Remove kernel.random.read_wakeup_threshold
From: Jann Horn @ 2019-09-24 20:30 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Theodore Tso, LKML, Linux API, Kees Cook, Jason A. Donenfeld,
	Ahmed S. Darwish, Lennart Poettering, Eric W. Biederman,
	Alexander E. Patrakov, Michael Kerrisk, Willy Tarreau,
	Matthew Garrett, Ext4 Developers List, linux-man
In-Reply-To: <66b16acf2953fc033abc9641b9cf43d23e75a8e9.1568990048.git.luto@kernel.org>

On Fri, Sep 20, 2019 at 4:37 PM Andy Lutomirski <luto@kernel.org> wrote:
> It has no effect any more, so remove it.  We can revert this if
> there is some user code that expects to be able to set this sysctl.
>
> Signed-off-by: Andy Lutomirski <luto@kernel.org>
> ---
>  drivers/char/random.c | 18 +-----------------
>  1 file changed, 1 insertion(+), 17 deletions(-)
>
> diff --git a/drivers/char/random.c b/drivers/char/random.c
[...]
> -       {
> -               .procname       = "read_wakeup_threshold",

There's a line in bin_random_table in kernel/sysctl_binary.c that
refers to this sysctl, that should probably also be deleted?

^ permalink raw reply

* Re: [RFC PATCH 2/3] fs: add RWF_ENCODED for writing compressed data
From: Christian Brauner @ 2019-09-24 20:22 UTC (permalink / raw)
  To: Jann Horn
  Cc: Omar Sandoval, Aleksa Sarai, Jens Axboe, linux-fsdevel,
	linux-btrfs, Dave Chinner, Linux API, Kernel Team,
	Andy Lutomirski
In-Reply-To: <CAG48ez1NQBNR1XeVQYGoopEk=g_KedUr+7jxLQTaO+V8JCeweQ@mail.gmail.com>

On Tue, Sep 24, 2019 at 10:01:41PM +0200, Jann Horn wrote:
> On Tue, Sep 24, 2019 at 9:35 PM Omar Sandoval <osandov@osandov.com> wrote:
> > On Tue, Sep 24, 2019 at 10:15:13AM -0700, Omar Sandoval wrote:
> > > On Thu, Sep 19, 2019 at 05:44:12PM +0200, Jann Horn wrote:
> > > > On Thu, Sep 19, 2019 at 8:54 AM Omar Sandoval <osandov@osandov.com> wrote:
> > > > > Btrfs can transparently compress data written by the user. However, we'd
> > > > > like to add an interface to write pre-compressed data directly to the
> > > > > filesystem. This adds support for so-called "encoded writes" via
> > > > > pwritev2().
> > > > >
> > > > > A new RWF_ENCODED flags indicates that a write is "encoded". If this
> > > > > flag is set, iov[0].iov_base points to a struct encoded_iov which
> > > > > contains metadata about the write: namely, the compression algorithm and
> > > > > the unencoded (i.e., decompressed) length of the extent. iov[0].iov_len
> > > > > must be set to sizeof(struct encoded_iov), which can be used to extend
> > > > > the interface in the future. The remaining iovecs contain the encoded
> > > > > extent.
> > > > >
> > > > > A similar interface for reading encoded data can be added to preadv2()
> > > > > in the future.
> > > > >
> > > > > Filesystems must indicate that they support encoded writes by setting
> > > > > FMODE_ENCODED_IO in ->file_open().
> > > > [...]
> > > > > +int import_encoded_write(struct kiocb *iocb, struct encoded_iov *encoded,
> > > > > +                        struct iov_iter *from)
> > > > > +{
> > > > > +       if (iov_iter_single_seg_count(from) != sizeof(*encoded))
> > > > > +               return -EINVAL;
> > > > > +       if (copy_from_iter(encoded, sizeof(*encoded), from) != sizeof(*encoded))
> > > > > +               return -EFAULT;
> > > > > +       if (encoded->compression == ENCODED_IOV_COMPRESSION_NONE &&
> > > > > +           encoded->encryption == ENCODED_IOV_ENCRYPTION_NONE) {
> > > > > +               iocb->ki_flags &= ~IOCB_ENCODED;
> > > > > +               return 0;
> > > > > +       }
> > > > > +       if (encoded->compression > ENCODED_IOV_COMPRESSION_TYPES ||
> > > > > +           encoded->encryption > ENCODED_IOV_ENCRYPTION_TYPES)
> > > > > +               return -EINVAL;
> > > > > +       if (!capable(CAP_SYS_ADMIN))
> > > > > +               return -EPERM;
> > > >
> > > > How does this capable() check interact with io_uring? Without having
> > > > looked at this in detail, I suspect that when an encoded write is
> > > > requested through io_uring, the capable() check might be executed on
> > > > something like a workqueue worker thread, which is probably running
> > > > with a full capability set.
> > >
> > > I discussed this more with Jens. You're right, per-IO permission checks
> > > aren't going to work. In fully-polled mode, we never get an opportunity
> > > to check capabilities in right context. So, this will probably require a
> > > new open flag.
> >
> > Actually, file_ns_capable() accomplishes the same thing without a new
> > open flag. Changing the capable() check to file_ns_capable() in
> > init_user_ns should be enough.
> 
> +Aleksa for openat2() and open() space
> 
> Mmh... but if the file descriptor has been passed through a privilege
> boundary, it isn't really clear whether the original opener of the
> file intended for this to be possible. For example, if (as a
> hypothetical example) the init process opens a service's logfile with
> root privileges, then passes the file descriptor to that logfile to
> the service on execve(), that doesn't mean that the service should be
> able to perform compressed writes into that file, I think.

I think we should even generalize this: for most new properties a given
file descriptor can carry we would want it to be explicitly enabled such
that passing the fd around amounts to passing that property around. At
least as soon as we consider it to be associated with some privilege
boundary. I don't think we have done this generally. But I would very
much support moving to such a model.

Christian

^ permalink raw reply

* Re: For review: pidfd_send_signal(2) manual page
From: Christian Brauner @ 2019-09-24 20:07 UTC (permalink / raw)
  To: Michael Kerrisk (man-pages)
  Cc: Florian Weimer, Oleg Nesterov, Jann Horn, Eric W. Biederman,
	Daniel Colascione, Joel Fernandes, linux-man, Linux API, lkml
In-Reply-To: <20190924195701.7pw2olbviieqsg5q@wittgenstein>

On Tue, Sep 24, 2019 at 09:57:04PM +0200, Christian Brauner wrote:
> On Tue, Sep 24, 2019 at 09:44:49PM +0200, Michael Kerrisk (man-pages) wrote:
> > Hello Christian,
> > 
> > On 9/23/19 4:23 PM, Christian Brauner wrote:
> > > On Mon, Sep 23, 2019 at 01:26:34PM +0200, Florian Weimer wrote:
> > >> * Michael Kerrisk:
> > >>
> > >>> SYNOPSIS
> > >>>        int pidfd_send_signal(int pidfd, int sig, siginfo_t info,
> > >>>                              unsigned int flags);
> > >>
> > >> This probably should reference a header for siginfo_t.
> > > 
> > > Agreed.
> > > 
> > >>
> > >>>        ESRCH  The target process does not exist.
> > >>
> > >> If the descriptor is valid, does this mean the process has been waited
> > >> for?  Maybe this can be made more explicit.
> > > 
> > > If by valid you mean "refers to a process/thread-group leader" aka is a
> > > pidfd then yes: Getting ESRCH means that the process has exited and has
> > > already been waited upon.
> > > If it had only exited but not waited upon aka is a zombie, then sending
> > > a signal will just work because that's currently how sending signals to
> > > zombies works, i.e. if you only send a signal and don't do any
> > > additional checks you won't notice a difference between a process being
> > > alive and a process being a zombie. The userspace visible behavior in
> > > terms of signaling them is identical.
> > 
> > (Thanks for the clarification. I added the text "(i.e., it has 
> > terminated and been waited on)" to the ESRCH error.)
> > 
> > >>>        The  pidfd_send_signal()  system call allows the avoidance of race
> > >>>        conditions that occur when using traditional interfaces  (such  as
> > >>>        kill(2)) to signal a process.  The problem is that the traditional
> > >>>        interfaces specify the target process via a process ID (PID), with
> > >>>        the  result  that the sender may accidentally send a signal to the
> > >>>        wrong process if the originally intended target process has termi‐
> > >>>        nated  and its PID has been recycled for another process.  By con‐
> > >>>        trast, a PID file descriptor is a stable reference to  a  specific
> > >>>        process;  if  that  process  terminates,  then the file descriptor
> > >>>        ceases to be  valid  and  the  caller  of  pidfd_send_signal()  is
> > >>>        informed of this fact via an ESRCH error.
> > >>
> > >> It would be nice to explain somewhere how you can avoid the race using
> > >> a PID descriptor.  Is there anything else besides CLONE_PIDFD?
> > > 
> > > If you're the parent of the process you can do this without CLONE_PIDFD:
> > > pid = fork();
> > > pidfd = pidfd_open();
> > > ret = pidfd_send_signal(pidfd, 0, NULL, 0);
> > > if (ret < 0 && errno == ESRCH)
> > > 	/* pidfd refers to another, recycled process */
> > 
> > Although there is still the race between the fork() and the
> > pidfd_open(), right?
> 
> Actually no and my code is even too complex.
> If you are the parent, and this is really a sequence that obeys the
> ordering pidfd_open() before waiting:
> 
> pid = fork();
> if (pid == 0)
> 	exit(EXIT_SUCCESS);
> pidfd = pidfd_open(pid, 0);
> waitid(pid, ...);
> 
> Then you are guaranteed that pidfd will refer to pid. No recycling can
> happen since the process has not been waited upon yet (That is,
> excluding special cases such as where you have a mainloop where a
> callback reacts to a SIGCHLD event and waits on the child behind your
> back and your next callback in the mainloop calls pidfd_open() while the
> pid has been recycled etc.).

If we wanted to be super nitpicky one could also get in that situation
where you do:

signal(SIGCHLD,SIG_IGN);

// or

struct sigaction sa;
sa.sa_handler = SIG_IGN;
sigemptyset(&sa.sa_mask);
sa.sa_flags = 0;
sigaction(SIGCHLD, &sa, 0)

pid = fork();
if (pid == 0)
	exit(EXIT_SUCCESS);
pidfd = pidfd_open();

because then the process gets autoreaped and can be recycled. But again,
that's just bad form and in that scenario one should again use
clone(CLONE_PIDFD) instead of fork().

Christian

^ permalink raw reply

* Re: [RFC PATCH 2/3] fs: add RWF_ENCODED for writing compressed data
From: Jann Horn @ 2019-09-24 20:01 UTC (permalink / raw)
  To: Omar Sandoval, Aleksa Sarai
  Cc: Jens Axboe, linux-fsdevel, linux-btrfs, Dave Chinner, Linux API,
	Kernel Team, Andy Lutomirski
In-Reply-To: <20190924193513.GA45540@vader>

On Tue, Sep 24, 2019 at 9:35 PM Omar Sandoval <osandov@osandov.com> wrote:
> On Tue, Sep 24, 2019 at 10:15:13AM -0700, Omar Sandoval wrote:
> > On Thu, Sep 19, 2019 at 05:44:12PM +0200, Jann Horn wrote:
> > > On Thu, Sep 19, 2019 at 8:54 AM Omar Sandoval <osandov@osandov.com> wrote:
> > > > Btrfs can transparently compress data written by the user. However, we'd
> > > > like to add an interface to write pre-compressed data directly to the
> > > > filesystem. This adds support for so-called "encoded writes" via
> > > > pwritev2().
> > > >
> > > > A new RWF_ENCODED flags indicates that a write is "encoded". If this
> > > > flag is set, iov[0].iov_base points to a struct encoded_iov which
> > > > contains metadata about the write: namely, the compression algorithm and
> > > > the unencoded (i.e., decompressed) length of the extent. iov[0].iov_len
> > > > must be set to sizeof(struct encoded_iov), which can be used to extend
> > > > the interface in the future. The remaining iovecs contain the encoded
> > > > extent.
> > > >
> > > > A similar interface for reading encoded data can be added to preadv2()
> > > > in the future.
> > > >
> > > > Filesystems must indicate that they support encoded writes by setting
> > > > FMODE_ENCODED_IO in ->file_open().
> > > [...]
> > > > +int import_encoded_write(struct kiocb *iocb, struct encoded_iov *encoded,
> > > > +                        struct iov_iter *from)
> > > > +{
> > > > +       if (iov_iter_single_seg_count(from) != sizeof(*encoded))
> > > > +               return -EINVAL;
> > > > +       if (copy_from_iter(encoded, sizeof(*encoded), from) != sizeof(*encoded))
> > > > +               return -EFAULT;
> > > > +       if (encoded->compression == ENCODED_IOV_COMPRESSION_NONE &&
> > > > +           encoded->encryption == ENCODED_IOV_ENCRYPTION_NONE) {
> > > > +               iocb->ki_flags &= ~IOCB_ENCODED;
> > > > +               return 0;
> > > > +       }
> > > > +       if (encoded->compression > ENCODED_IOV_COMPRESSION_TYPES ||
> > > > +           encoded->encryption > ENCODED_IOV_ENCRYPTION_TYPES)
> > > > +               return -EINVAL;
> > > > +       if (!capable(CAP_SYS_ADMIN))
> > > > +               return -EPERM;
> > >
> > > How does this capable() check interact with io_uring? Without having
> > > looked at this in detail, I suspect that when an encoded write is
> > > requested through io_uring, the capable() check might be executed on
> > > something like a workqueue worker thread, which is probably running
> > > with a full capability set.
> >
> > I discussed this more with Jens. You're right, per-IO permission checks
> > aren't going to work. In fully-polled mode, we never get an opportunity
> > to check capabilities in right context. So, this will probably require a
> > new open flag.
>
> Actually, file_ns_capable() accomplishes the same thing without a new
> open flag. Changing the capable() check to file_ns_capable() in
> init_user_ns should be enough.

+Aleksa for openat2() and open() space

Mmh... but if the file descriptor has been passed through a privilege
boundary, it isn't really clear whether the original opener of the
file intended for this to be possible. For example, if (as a
hypothetical example) the init process opens a service's logfile with
root privileges, then passes the file descriptor to that logfile to
the service on execve(), that doesn't mean that the service should be
able to perform compressed writes into that file, I think.

I think that an open flag (as you already suggested) or an fcntl()
operation would do the job; but AFAIK the open() flag space has run
out, so if you hook it up that way, I think you might have to wait for
Aleksa Sarai to get something like his sys_openat2() suggestion
(https://lore.kernel.org/lkml/20190904201933.10736-12-cyphar@cyphar.com/)
merged?

^ permalink raw reply

* Re: For review: pidfd_send_signal(2) manual page
From: Christian Brauner @ 2019-09-24 19:57 UTC (permalink / raw)
  To: Michael Kerrisk (man-pages)
  Cc: Florian Weimer, Oleg Nesterov, Jann Horn, Eric W. Biederman,
	Daniel Colascione, Joel Fernandes, linux-man, Linux API, lkml
In-Reply-To: <90dd38d5-34b3-b72f-8e5a-b51f944f22fb@gmail.com>

On Tue, Sep 24, 2019 at 09:44:49PM +0200, Michael Kerrisk (man-pages) wrote:
> Hello Christian,
> 
> On 9/23/19 4:23 PM, Christian Brauner wrote:
> > On Mon, Sep 23, 2019 at 01:26:34PM +0200, Florian Weimer wrote:
> >> * Michael Kerrisk:
> >>
> >>> SYNOPSIS
> >>>        int pidfd_send_signal(int pidfd, int sig, siginfo_t info,
> >>>                              unsigned int flags);
> >>
> >> This probably should reference a header for siginfo_t.
> > 
> > Agreed.
> > 
> >>
> >>>        ESRCH  The target process does not exist.
> >>
> >> If the descriptor is valid, does this mean the process has been waited
> >> for?  Maybe this can be made more explicit.
> > 
> > If by valid you mean "refers to a process/thread-group leader" aka is a
> > pidfd then yes: Getting ESRCH means that the process has exited and has
> > already been waited upon.
> > If it had only exited but not waited upon aka is a zombie, then sending
> > a signal will just work because that's currently how sending signals to
> > zombies works, i.e. if you only send a signal and don't do any
> > additional checks you won't notice a difference between a process being
> > alive and a process being a zombie. The userspace visible behavior in
> > terms of signaling them is identical.
> 
> (Thanks for the clarification. I added the text "(i.e., it has 
> terminated and been waited on)" to the ESRCH error.)
> 
> >>>        The  pidfd_send_signal()  system call allows the avoidance of race
> >>>        conditions that occur when using traditional interfaces  (such  as
> >>>        kill(2)) to signal a process.  The problem is that the traditional
> >>>        interfaces specify the target process via a process ID (PID), with
> >>>        the  result  that the sender may accidentally send a signal to the
> >>>        wrong process if the originally intended target process has termi‐
> >>>        nated  and its PID has been recycled for another process.  By con‐
> >>>        trast, a PID file descriptor is a stable reference to  a  specific
> >>>        process;  if  that  process  terminates,  then the file descriptor
> >>>        ceases to be  valid  and  the  caller  of  pidfd_send_signal()  is
> >>>        informed of this fact via an ESRCH error.
> >>
> >> It would be nice to explain somewhere how you can avoid the race using
> >> a PID descriptor.  Is there anything else besides CLONE_PIDFD?
> > 
> > If you're the parent of the process you can do this without CLONE_PIDFD:
> > pid = fork();
> > pidfd = pidfd_open();
> > ret = pidfd_send_signal(pidfd, 0, NULL, 0);
> > if (ret < 0 && errno == ESRCH)
> > 	/* pidfd refers to another, recycled process */
> 
> Although there is still the race between the fork() and the
> pidfd_open(), right?

Actually no and my code is even too complex.
If you are the parent, and this is really a sequence that obeys the
ordering pidfd_open() before waiting:

pid = fork();
if (pid == 0)
	exit(EXIT_SUCCESS);
pidfd = pidfd_open(pid, 0);
waitid(pid, ...);

Then you are guaranteed that pidfd will refer to pid. No recycling can
happen since the process has not been waited upon yet (That is,
excluding special cases such as where you have a mainloop where a
callback reacts to a SIGCHLD event and waits on the child behind your
back and your next callback in the mainloop calls pidfd_open() while the
pid has been recycled etc.).
A race could only appear in sequences where waiting happens before
pidfd_open():

pid = fork();
if (pid == 0)
	exit(EXIT_SUCCESS);
waitid(pid, ...);
pidfd = pidfd_open(pid, 0);

which honestly simply doesn't make any sense. So if you're the parent
and you combine fork() + pidfd_open() correctly things should be fine
without even having to verify via pidfd_send_signal() (I missed that in
my first mail.).
(Now, it gets more hairy when one considers clone(CLONE_PARENT) but that
would be wildly esoteric because at that point you're using clone()
already and then you should simply pass clone(CLONE_PARENT | CLONE_PIDFD).)

If you're _not_ the parent then CLONE_PIDFD and sending around the pidfd
are your only option to avoiding the race imho.

> 
> >>>        static
> >>>        int pidfd_send_signal(int pidfd, int sig, siginfo_t *info,
> >>>                unsigned int flags)
> >>>        {
> >>>            return syscall(__NR_pidfd_send_signal, pidfd, sig, info, flags);
> >>>        }
> >>
> >> Please use a different function name.  Thanks.
> 
> Covered in another thread. I await some further feedback from Florian.

Right, that wasn't my suggestion anyway. :)

Thanks!
Christian

^ permalink raw reply

* Re: For review: pidfd_send_signal(2) manual page
From: Michael Kerrisk (man-pages) @ 2019-09-24 19:44 UTC (permalink / raw)
  To: Christian Brauner, Florian Weimer
  Cc: mtk.manpages, Oleg Nesterov, Christian Brauner, Jann Horn,
	Eric W. Biederman, Daniel Colascione, Joel Fernandes, linux-man,
	Linux API, lkml
In-Reply-To: <20190923142325.jowzbnwjw7g7si7j@wittgenstein>

Hello Christian,

On 9/23/19 4:23 PM, Christian Brauner wrote:
> On Mon, Sep 23, 2019 at 01:26:34PM +0200, Florian Weimer wrote:
>> * Michael Kerrisk:
>>
>>> SYNOPSIS
>>>        int pidfd_send_signal(int pidfd, int sig, siginfo_t info,
>>>                              unsigned int flags);
>>
>> This probably should reference a header for siginfo_t.
> 
> Agreed.
> 
>>
>>>        ESRCH  The target process does not exist.
>>
>> If the descriptor is valid, does this mean the process has been waited
>> for?  Maybe this can be made more explicit.
> 
> If by valid you mean "refers to a process/thread-group leader" aka is a
> pidfd then yes: Getting ESRCH means that the process has exited and has
> already been waited upon.
> If it had only exited but not waited upon aka is a zombie, then sending
> a signal will just work because that's currently how sending signals to
> zombies works, i.e. if you only send a signal and don't do any
> additional checks you won't notice a difference between a process being
> alive and a process being a zombie. The userspace visible behavior in
> terms of signaling them is identical.

(Thanks for the clarification. I added the text "(i.e., it has 
terminated and been waited on)" to the ESRCH error.)

>>>        The  pidfd_send_signal()  system call allows the avoidance of race
>>>        conditions that occur when using traditional interfaces  (such  as
>>>        kill(2)) to signal a process.  The problem is that the traditional
>>>        interfaces specify the target process via a process ID (PID), with
>>>        the  result  that the sender may accidentally send a signal to the
>>>        wrong process if the originally intended target process has termi‐
>>>        nated  and its PID has been recycled for another process.  By con‐
>>>        trast, a PID file descriptor is a stable reference to  a  specific
>>>        process;  if  that  process  terminates,  then the file descriptor
>>>        ceases to be  valid  and  the  caller  of  pidfd_send_signal()  is
>>>        informed of this fact via an ESRCH error.
>>
>> It would be nice to explain somewhere how you can avoid the race using
>> a PID descriptor.  Is there anything else besides CLONE_PIDFD?
> 
> If you're the parent of the process you can do this without CLONE_PIDFD:
> pid = fork();
> pidfd = pidfd_open();
> ret = pidfd_send_signal(pidfd, 0, NULL, 0);
> if (ret < 0 && errno == ESRCH)
> 	/* pidfd refers to another, recycled process */

Although there is still the race between the fork() and the
pidfd_open(), right?

>>>        static
>>>        int pidfd_send_signal(int pidfd, int sig, siginfo_t *info,
>>>                unsigned int flags)
>>>        {
>>>            return syscall(__NR_pidfd_send_signal, pidfd, sig, info, flags);
>>>        }
>>
>> Please use a different function name.  Thanks.

Covered in another thread. I await some further feedback from Florian.

Thanks,

Michael



-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

^ permalink raw reply

* Re: For review: pidfd_send_signal(2) manual page
From: Michael Kerrisk (man-pages) @ 2019-09-24 19:43 UTC (permalink / raw)
  To: Florian Weimer
  Cc: mtk.manpages, Oleg Nesterov, Christian Brauner, Jann Horn,
	Eric W. Biederman, Daniel Colascione, Joel Fernandes, linux-man,
	Linux API, lkml
In-Reply-To: <87pnjr9rth.fsf@mid.deneb.enyo.de>

Hello Florian,

On 9/23/19 1:26 PM, Florian Weimer wrote:
> * Michael Kerrisk:
> 
>> SYNOPSIS
>>        int pidfd_send_signal(int pidfd, int sig, siginfo_t info,
>>                              unsigned int flags);
> 
> This probably should reference a header for siginfo_t.

Thanks. I added: #include <signal.h>

>>        ESRCH  The target process does not exist.
> 
> If the descriptor is valid, does this mean the process has been waited
> for?  Maybe this can be made more explicit.

Yes. I added "(i.e., it has terminated and been waited on)".

>>        The  pidfd_send_signal()  system call allows the avoidance of race
>>        conditions that occur when using traditional interfaces  (such  as
>>        kill(2)) to signal a process.  The problem is that the traditional
>>        interfaces specify the target process via a process ID (PID), with
>>        the  result  that the sender may accidentally send a signal to the
>>        wrong process if the originally intended target process has termi‐
>>        nated  and its PID has been recycled for another process.  By con‐
>>        trast, a PID file descriptor is a stable reference to  a  specific
>>        process;  if  that  process  terminates,  then the file descriptor
>>        ceases to be  valid  and  the  caller  of  pidfd_send_signal()  is
>>        informed of this fact via an ESRCH error.
> 
> It would be nice to explain somewhere how you can avoid the race using
> a PID descriptor.  Is there anything else besides CLONE_PIDFD?

Please see my comment in reply to Christian (which will be sent just
after this).

>>        static
>>        int pidfd_send_signal(int pidfd, int sig, siginfo_t *info,
>>                unsigned int flags)
>>        {
>>            return syscall(__NR_pidfd_send_signal, pidfd, sig, info, flags);
>>        }
> 
> Please use a different function name.  Thanks.

Please see my open question in the thread on pidfd_open().

Thanks for the review, Florian.

Cheers,

Michael

Michael

-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

^ permalink raw reply

* Re: For review: pidfd_send_signal(2) manual page
From: Michael Kerrisk (man-pages) @ 2019-09-24 19:42 UTC (permalink / raw)
  To: Daniel Colascione
  Cc: mtk.manpages, Oleg Nesterov, Christian Brauner, Jann Horn,
	Eric W. Biederman, Joel Fernandes, linux-man, Linux API, lkml
In-Reply-To: <CAKOZuetMK0eRxBrR8wXo_qCaQ7OGKQHqAy15cX437+Q+cvbbvA@mail.gmail.com>

On 9/23/19 1:31 PM, Daniel Colascione wrote:
> On Mon, Sep 23, 2019 at 2:12 AM Michael Kerrisk (man-pages)
> <mtk.manpages@gmail.com> wrote:
>>        The  pidfd_send_signal()  system call allows the avoidance of race
>>        conditions that occur when using traditional interfaces  (such  as
>>        kill(2)) to signal a process.  The problem is that the traditional
>>        interfaces specify the target process via a process ID (PID), with
>>        the  result  that the sender may accidentally send a signal to the
>>        wrong process if the originally intended target process has termi‐
>>        nated  and its PID has been recycled for another process.  By con‐
>>        trast, a PID file descriptor is a stable reference to  a  specific
>>        process;  if  that  process  terminates,  then the file descriptor
>>        ceases to be  valid
> 
> The file *descriptor* remains valid even after the process to which it
> refers exits. You can close(2) the file descriptor without getting
> EBADF. I'd say, instead, that "a PID file descriptor is a stable
> reference to a specific process; process-related operations on a PID
> file descriptor fail after that process exits".

Thanks, Daniel. I like that rephrasing, but, since pidfd_send_signal()
is (so far as I know) currently the only relevant process-related
operation (and because this is the manual page describing that
syscall), I made it:

[[
By contrast, a PID file descriptor is a stable reference to a
specific process; if that process terminates, pidfd_send_signal()
fails with the error ESRCH.
]]

Thanks,

Michael


-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

^ permalink raw reply

* Re: [RFC PATCH 2/3] fs: add RWF_ENCODED for writing compressed data
From: Omar Sandoval @ 2019-09-24 19:35 UTC (permalink / raw)
  To: Jann Horn
  Cc: Jens Axboe, linux-fsdevel, linux-btrfs, Dave Chinner, Linux API,
	Kernel Team, Andy Lutomirski
In-Reply-To: <20190924171513.GA39872@vader>

On Tue, Sep 24, 2019 at 10:15:13AM -0700, Omar Sandoval wrote:
> On Thu, Sep 19, 2019 at 05:44:12PM +0200, Jann Horn wrote:
> > On Thu, Sep 19, 2019 at 8:54 AM Omar Sandoval <osandov@osandov.com> wrote:
> > > Btrfs can transparently compress data written by the user. However, we'd
> > > like to add an interface to write pre-compressed data directly to the
> > > filesystem. This adds support for so-called "encoded writes" via
> > > pwritev2().
> > >
> > > A new RWF_ENCODED flags indicates that a write is "encoded". If this
> > > flag is set, iov[0].iov_base points to a struct encoded_iov which
> > > contains metadata about the write: namely, the compression algorithm and
> > > the unencoded (i.e., decompressed) length of the extent. iov[0].iov_len
> > > must be set to sizeof(struct encoded_iov), which can be used to extend
> > > the interface in the future. The remaining iovecs contain the encoded
> > > extent.
> > >
> > > A similar interface for reading encoded data can be added to preadv2()
> > > in the future.
> > >
> > > Filesystems must indicate that they support encoded writes by setting
> > > FMODE_ENCODED_IO in ->file_open().
> > [...]
> > > +int import_encoded_write(struct kiocb *iocb, struct encoded_iov *encoded,
> > > +                        struct iov_iter *from)
> > > +{
> > > +       if (iov_iter_single_seg_count(from) != sizeof(*encoded))
> > > +               return -EINVAL;
> > > +       if (copy_from_iter(encoded, sizeof(*encoded), from) != sizeof(*encoded))
> > > +               return -EFAULT;
> > > +       if (encoded->compression == ENCODED_IOV_COMPRESSION_NONE &&
> > > +           encoded->encryption == ENCODED_IOV_ENCRYPTION_NONE) {
> > > +               iocb->ki_flags &= ~IOCB_ENCODED;
> > > +               return 0;
> > > +       }
> > > +       if (encoded->compression > ENCODED_IOV_COMPRESSION_TYPES ||
> > > +           encoded->encryption > ENCODED_IOV_ENCRYPTION_TYPES)
> > > +               return -EINVAL;
> > > +       if (!capable(CAP_SYS_ADMIN))
> > > +               return -EPERM;
> > 
> > How does this capable() check interact with io_uring? Without having
> > looked at this in detail, I suspect that when an encoded write is
> > requested through io_uring, the capable() check might be executed on
> > something like a workqueue worker thread, which is probably running
> > with a full capability set.
> 
> I discussed this more with Jens. You're right, per-IO permission checks
> aren't going to work. In fully-polled mode, we never get an opportunity
> to check capabilities in right context. So, this will probably require a
> new open flag.

Actually, file_ns_capable() accomplishes the same thing without a new
open flag. Changing the capable() check to file_ns_capable() in
init_user_ns should be enough.

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox