* Re: [PATCH RFC v4 1/1] random: WARN on large getrandom() waits and introduce getrandom2()
From: Willy Tarreau @ 2019-09-21 3:05 UTC (permalink / raw)
To: Andy Lutomirski
Cc: Linus Torvalds, Ahmed S. Darwish, Lennart Poettering,
Theodore Y. Ts'o, Eric W. Biederman, Alexander E. Patrakov,
Michael Kerrisk, Matthew Garrett, lkml, Ext4 Developers List,
Linux API, linux-man
In-Reply-To: <CALCETrWCjGHKnKikj+YVw22Ufpmnh1TCdGPjG2RL-qzsF=wisA@mail.gmail.com>
On Fri, Sep 20, 2019 at 04:30:20PM -0700, Andy Lutomirski wrote:
> So I think that just improving the
> getrandom()-is-blocking-on-x86-and-arm behavior, adding GRND_INSECURE
> and GRND_SECURE_BLOCKING, and adding the warning if 0 is passed is
> good enough.
I think so as well. Anyway, keep in mind that *with a sane API*,
userland can improve very quickly (faster than kernel deployments in
field). But userland developers need reliable and testable support for
features. If it's enough to do #ifndef GRND_xxx/#define GRND_xxx and
call getrandom() with these flags to detect support, it's basically 5
reliable lines of code to add to userland to make a warning disappear
and/or to allow a system that previously failed to boot to now boot. So
this gives strong incentive to userland to adopt the new API, provided
there's a way for the developer to understand what's happening (which
the warning does).
If we do it right, all we'll hear are userland developers complaining
that those stupid kernel developers have changed their API again and
really don't know what they want. That will be a good sign that the
warning flows back to them and that adoption is taking.
And if the change is small enough, maybe it could make sense to backport
it to stable versions to fix boot issues. With a testable feature it
does make sense.
Willy
^ permalink raw reply
* Re: [PATCH RFC v4 1/1] random: WARN on large getrandom() waits and introduce getrandom2()
From: Florian Weimer @ 2019-09-21 6:07 UTC (permalink / raw)
To: Linus Torvalds
Cc: Andy Lutomirski, Ahmed S. Darwish, Lennart Poettering,
Theodore Y. Ts'o, Eric W. Biederman, Alexander E. Patrakov,
Michael Kerrisk, Willy Tarreau, Matthew Garrett, lkml,
Ext4 Developers List, Linux API, linux-man
In-Reply-To: <CAHk-=wjpTWgpo6d24pTv+ubfea_uEomX-sHjjOkdACfV-8Nmkg@mail.gmail.com>
* Linus Torvalds:
> Violently agreed. And that's kind of what the GRND_EXPLICIT is really
> aiming for.
>
> However, it's worth noting that nobody should ever use GRND_EXPLICIT
> directly. That's just the name for the bit. The actual users would use
> GRND_INSECURE or GRND_SECURE.
Should we switch glibc's getentropy to GRND_EXPLICIT? Or something
else?
I don't think we want to print a kernel warning for this function.
Thanks,
Florian
^ permalink raw reply
* For review: pidfd_open(2) manual page
From: Michael Kerrisk (man-pages) @ 2019-09-23 9:11 UTC (permalink / raw)
To: Christian Brauner, Jann Horn, Daniel Colascione,
Eric W. Biederman, Joel Fernandes
Cc: mtk.manpages, Linux API, lkml, linux-man, Oleg Nesterov
Hello Christian and all,
Below, I have the rendered version of the current draft of
the pidfd_open(2) manual page that I have written.
The page source can be found in a Git branch at:
https://git.kernel.org/pub/scm/docs/man-pages/man-pages.git/log/?h=draft_pidfd
I would be pleased to receive corrections and notes on any
details that should be added. (For example, are there error
cases that I have missed?)
Would you be able to review please?
Thanks,
Michael
NAME
pidfd_open - obtain a file descriptor that refers to a process
SYNOPSIS
int pidfd_open(pid_t pid, unsigned int flags);
DESCRIPTION
The pidfd_open() system creates a file descriptor that refers to
the process whose PID is specified in pid. The file descriptor is
returned as the function result; the close-on-exec flag is set on
the file descriptor.
The flags argument is reserved for future use; currently, this
argument must be specified as 0.
RETURN VALUE
On success, pidfd_open() returns a nonnegative file descriptor.
On success, -1 is returned and errno is set to indicate the cause
of the error.
ERRORS
EINVAL flags is not 0.
EINVAL pid is not valid.
ESRCH The process specified by pid does not exist.
VERSIONS
pidfd_open() first appeared in Linux 5.3.
CONFORMING TO
pidfd_open() is Linux specific.
NOTES
Currently, there is no glibc wrapper for this system call; call it
using syscall(2).
The pidfd_send_signal(2) system call can be used to send a signal
to the process referred to by a PID file descriptor.
A PID file descriptor can be monitored using poll(2), select(2),
and epoll(7). When the process that it refers to terminates, the
file descriptor indicates as readable. Note, however, that in the
current implementation, nothing can be read from the file descrip‐
tor.
The pidfd_open() system call is the preferred way of obtaining a
PID file descriptor. The alternative is to obtain a file descrip‐
tor by opening a /proc/[pid] directory. However, the latter tech‐
nique is possible only if the proc(5) file system is mounted; fur‐
thermore, the file descriptor obtained in this way is not pol‐
lable.
See also the discussion of the CLONE_PIDFD flag in clone(2).
EXAMPLE
The program below opens a PID file descriptor for the process
whose PID is specified as its command-line argument. It then mon‐
itors the file descriptor for readability (POLLIN) using poll(2).
When the process with the specified by PID terminates, poll(2)
returns, and indicates that the file descriptor is readable.
Program source
#define _GNU_SOURCE
#include <sys/syscall.h>
#include <unistd.h>
#include <poll.h>
#include <stdlib.h>
#include <stdio.h>
#ifndef __NR_pidfd_open
#define __NR_pidfd_open 434
#endif
static
int pidfd_open(pid_t pid, unsigned int flags)
{
return syscall(__NR_pidfd_open, pid, flags);
}
int
main(int argc, char *argv[])
{
struct pollfd pollfd;
int pidfd, ready;
if (argc != 2) {
fprintf(stderr, "Usage: %s <pid>\n", argv[0]);
exit(EXIT_SUCCESS);
}
pidfd = pidfd_open(atoi(argv[1]), 0);
if (pidfd == -1) {
perror("pidfd_open");
exit(EXIT_FAILURE);
}
pollfd.fd = pidfd;
pollfd.events = POLLIN;
ready = poll(&pollfd, 1, -1);
if (ready == -1) {
perror("poll");
exit(EXIT_FAILURE);
}
printf("Events (0x%x): POLLIN is %sset\n", pollfd.revents,
(pollfd.revents & POLLIN) ? "" : "not ");
exit(EXIT_SUCCESS);
}
SEE ALSO
clone(2), kill(2), pidfd_send_signal(2), poll(2), select(2),
epoll(7)
--
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/
^ permalink raw reply
* For review: pidfd_send_signal(2) manual page
From: Michael Kerrisk (man-pages) @ 2019-09-23 9:12 UTC (permalink / raw)
To: Oleg Nesterov, Christian Brauner, Jann Horn, Eric W. Biederman,
Daniel Colascione, Joel Fernandes
Cc: mtk.manpages, linux-man, Linux API, lkml
Hello Christian and all,
Below, I have the rendered version of the current draft of
the pidfd_send_signal(2) manual page that I have written.
The page source can be found in a Git branch at:
https://git.kernel.org/pub/scm/docs/man-pages/man-pages.git/log/?h=draft_pidfd
I would be pleased to receive corrections and notes on any
details that should be added. (For example, are there error
cases that I have missed?)
Would you be able to review please?
Thanks,
Michael
NAME
pidfd_send_signal - send a signal to a process specified by a file
descriptor
SYNOPSIS
int pidfd_send_signal(int pidfd, int sig, siginfo_t info,
unsigned int flags);
DESCRIPTION
The pidfd_send_signal() system call sends the signal sig to the
target process referred to by pidfd, a PID file descriptor that
refers to a process.
If the info argument points to a siginfo_t buffer, that buffer
should be populated as described in rt_sigqueueinfo(2).
If the info argument is a NULL pointer, this is equivalent to
specifying a pointer to a siginfo_t buffer whose fields match the
values that are implicitly supplied when a signal is sent using
kill(2):
* si_signo is set to the signal number;
* si_errno is set to 0;
* si_code is set to SI_USER;
* si_pid is set to the caller's PID; and
* si_uid is set to the caller's real user ID.
The calling process must either be in the same PID namespace as
the process referred to by pidfd, or be in an ancestor of that
namespace.
The flags argument is reserved for future use; currently, this
argument must be specified as 0.
RETURN VALUE
On success, pidfd_send_signal() returns 0. On success, -1 is
returned and errno is set to indicate the cause of the error.
ERRORS
EBADF pidfd is not a valid PID file descriptor.
EINVAL sig is not a valid signal.
EINVAL The calling process is not in a PID namespace from which it
can send a signal to the target process.
EINVAL flags is not 0.
EPERM The calling process does not have permission to send the
signal to the target process.
EPERM pidfd doesn't refer to the calling process, and
info.si_code is invalid (see rt_sigqueueinfo(2)).
ESRCH The target process does not exist.
VERSIONS
pidfd_send_signal() first appeared in Linux 5.1.
CONFORMING TO
pidfd_send_signal() is Linux specific.
NOTES
Currently, there is no glibc wrapper for this system call; call it
using syscall(2).
PID file descriptors
The pidfd argument is a PID file descriptor, a file descriptor
that refers to process. Such a file descriptor can be obtained
in any of the following ways:
* by opening a /proc/[pid] directory;
* using pidfd_open(2); or
* via the PID file descriptor that is returned by a call to
clone(2) or clone3(2) that specifies the CLONE_PIDFD flag.
The pidfd_send_signal() system call allows the avoidance of race
conditions that occur when using traditional interfaces (such as
kill(2)) to signal a process. The problem is that the traditional
interfaces specify the target process via a process ID (PID), with
the result that the sender may accidentally send a signal to the
wrong process if the originally intended target process has termi‐
nated and its PID has been recycled for another process. By con‐
trast, a PID file descriptor is a stable reference to a specific
process; if that process terminates, then the file descriptor
ceases to be valid and the caller of pidfd_send_signal() is
informed of this fact via an ESRCH error.
EXAMPLE
#define _GNU_SOURCE
#include <limits.h>
#include <signal.h>
#include <fcntl.h>
#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include <unistd.h>
#include <sys/syscall.h>
#ifndef __NR_pidfd_send_signal
#define __NR_pidfd_send_signal 424
#endif
static
int pidfd_send_signal(int pidfd, int sig, siginfo_t *info,
unsigned int flags)
{
return syscall(__NR_pidfd_send_signal, pidfd, sig, info, flags);
}
int
main(int argc, char *argv[])
{
siginfo_t info;
char path[PATH_MAX];
int pidfd, sig;
if (argc != 3) {
fprintf(stderr, "Usage: %s <pid> <signal>\n", argv[0]);
exit(EXIT_FAILURE);
}
sig = atoi(argv[2]);
/* Obtain a PID file descriptor by opening the /proc/PID directory
of the target process */
snprintf(path, sizeof(path), "/proc/%s", argv[1]);
pidfd = open(path, O_RDONLY);
if (pidfd == -1) {
perror("open");
exit(EXIT_FAILURE);
}
/* Populate a 'siginfo_t' structure for use with
pidfd_send_signal() */
memset(&info, 0, sizeof(info));
info.si_code = SI_QUEUE;
info.si_signo = sig;
info.si_errno = 0;
info.si_uid = getuid();
info.si_pid = getpid();
info.si_value.sival_int = 1234;
/* Send the signal */
if (pidfd_send_signal(pidfd, sig, &info, 0) == -1) {
perror("pidfd_send_signal");
exit(EXIT_FAILURE);
}
exit(EXIT_SUCCESS);
}
SEE ALSO
clone(2), kill(2), pidfd_open(2), rt_sigqueueinfo(2), sigac‐
tion(2), pid_namespaces(7), signal(7)
^ permalink raw reply
* Re: For review: pidfd_open(2) manual page
From: Florian Weimer @ 2019-09-23 10:53 UTC (permalink / raw)
To: Michael Kerrisk (man-pages)
Cc: Christian Brauner, Jann Horn, Daniel Colascione,
Eric W. Biederman, Joel Fernandes, Linux API, lkml, linux-man,
Oleg Nesterov
In-Reply-To: <90399dee-53d8-a82c-3871-9ec8f94601ce@gmail.com>
* Michael Kerrisk:
> SYNOPSIS
> int pidfd_open(pid_t pid, unsigned int flags);
Should this mention <sys/types.h> for pid_t?
> ERRORS
> EINVAL flags is not 0.
>
> EINVAL pid is not valid.
>
> ESRCH The process specified by pid does not exist.
Presumably, EMFILE and ENFILE are also possible errors, and so is
ENOMEM.
> A PID file descriptor can be monitored using poll(2), select(2),
> and epoll(7). When the process that it refers to terminates, the
> file descriptor indicates as readable. Note, however, that in the
> current implementation, nothing can be read from the file descrip‐
> tor.
“is indicated as readable” or “becomes readable”? Will reading block?
> The pidfd_open() system call is the preferred way of obtaining a
> PID file descriptor. The alternative is to obtain a file descrip‐
> tor by opening a /proc/[pid] directory. However, the latter tech‐
> nique is possible only if the proc(5) file system is mounted; fur‐
> thermore, the file descriptor obtained in this way is not pol‐
> lable.
One question is whether the glibc wrapper should fall back back to the
/proc subdirectory if it is not available. Probably not.
> static
> int pidfd_open(pid_t pid, unsigned int flags)
> {
> return syscall(__NR_pidfd_open, pid, flags);
> }
Please call this function something else (not pidfd_open), so that the
example continues to work if glibc provides the system call wrapper.
^ permalink raw reply
* Re: For review: pidfd_open(2) manual page
From: Daniel Colascione @ 2019-09-23 11:26 UTC (permalink / raw)
To: Florian Weimer
Cc: Michael Kerrisk (man-pages), Christian Brauner, Jann Horn,
Eric W. Biederman, Joel Fernandes, Linux API, lkml, linux-man,
Oleg Nesterov
In-Reply-To: <87tv939td6.fsf@mid.deneb.enyo.de>
On Mon, Sep 23, 2019 at 3:53 AM Florian Weimer <fw@deneb.enyo.de> wrote:
>
> * Michael Kerrisk:
>
> > SYNOPSIS
> > int pidfd_open(pid_t pid, unsigned int flags);
>
> Should this mention <sys/types.h> for pid_t?
>
> > ERRORS
> > EINVAL flags is not 0.
> >
> > EINVAL pid is not valid.
> >
> > ESRCH The process specified by pid does not exist.
>
> Presumably, EMFILE and ENFILE are also possible errors, and so is
> ENOMEM.
>
> > A PID file descriptor can be monitored using poll(2), select(2),
> > and epoll(7). When the process that it refers to terminates, the
> > file descriptor indicates as readable.
The phrase "becomes readable" is simpler than "indicates as readable"
and conveys the same meaning. I agree with Florian's comment on this
point below.
> > Note, however, that in the
> > current implementation, nothing can be read from the file descrip‐
> > tor.
>
> “is indicated as readable” or “becomes readable”? Will reading block?
>
> > The pidfd_open() system call is the preferred way of obtaining a
> > PID file descriptor. The alternative is to obtain a file descrip‐
> > tor by opening a /proc/[pid] directory. However, the latter tech‐
> > nique is possible only if the proc(5) file system is mounted; fur‐
> > thermore, the file descriptor obtained in this way is not pol‐
> > lable.
Referring to procfs directory FDs as pidfds will probably confuse
people. I'd just omit this paragraph.
> One question is whether the glibc wrapper should fall back back to the
> /proc subdirectory if it is not available. Probably not.
I'd prefer that glibc not provide this kind of fallback.
posix_fallocate-style emulation is, IMHO, too surprising.
^ permalink raw reply
* Re: For review: pidfd_send_signal(2) manual page
From: Florian Weimer @ 2019-09-23 11:26 UTC (permalink / raw)
To: Michael Kerrisk (man-pages)
Cc: Oleg Nesterov, Christian Brauner, Jann Horn, Eric W. Biederman,
Daniel Colascione, Joel Fernandes, linux-man, Linux API, lkml
In-Reply-To: <f21dbd73-5ef4-fb5b-003f-ff4fec34a1de@gmail.com>
* Michael Kerrisk:
> SYNOPSIS
> int pidfd_send_signal(int pidfd, int sig, siginfo_t info,
> unsigned int flags);
This probably should reference a header for siginfo_t.
> ESRCH The target process does not exist.
If the descriptor is valid, does this mean the process has been waited
for? Maybe this can be made more explicit.
> The pidfd_send_signal() system call allows the avoidance of race
> conditions that occur when using traditional interfaces (such as
> kill(2)) to signal a process. The problem is that the traditional
> interfaces specify the target process via a process ID (PID), with
> the result that the sender may accidentally send a signal to the
> wrong process if the originally intended target process has termi‐
> nated and its PID has been recycled for another process. By con‐
> trast, a PID file descriptor is a stable reference to a specific
> process; if that process terminates, then the file descriptor
> ceases to be valid and the caller of pidfd_send_signal() is
> informed of this fact via an ESRCH error.
It would be nice to explain somewhere how you can avoid the race using
a PID descriptor. Is there anything else besides CLONE_PIDFD?
> static
> int pidfd_send_signal(int pidfd, int sig, siginfo_t *info,
> unsigned int flags)
> {
> return syscall(__NR_pidfd_send_signal, pidfd, sig, info, flags);
> }
Please use a different function name. Thanks.
^ permalink raw reply
* Re: For review: pidfd_send_signal(2) manual page
From: Daniel Colascione @ 2019-09-23 11:31 UTC (permalink / raw)
To: Michael Kerrisk (man-pages)
Cc: Oleg Nesterov, Christian Brauner, Jann Horn, Eric W. Biederman,
Joel Fernandes, linux-man, Linux API, lkml
In-Reply-To: <f21dbd73-5ef4-fb5b-003f-ff4fec34a1de@gmail.com>
On Mon, Sep 23, 2019 at 2:12 AM Michael Kerrisk (man-pages)
<mtk.manpages@gmail.com> wrote:
> The pidfd_send_signal() system call allows the avoidance of race
> conditions that occur when using traditional interfaces (such as
> kill(2)) to signal a process. The problem is that the traditional
> interfaces specify the target process via a process ID (PID), with
> the result that the sender may accidentally send a signal to the
> wrong process if the originally intended target process has termi‐
> nated and its PID has been recycled for another process. By con‐
> trast, a PID file descriptor is a stable reference to a specific
> process; if that process terminates, then the file descriptor
> ceases to be valid
The file *descriptor* remains valid even after the process to which it
refers exits. You can close(2) the file descriptor without getting
EBADF. I'd say, instead, that "a PID file descriptor is a stable
reference to a specific process; process-related operations on a PID
file descriptor fail after that process exits".
^ permalink raw reply
* Re: For review: pidfd_send_signal(2) manual page
From: Christian Brauner @ 2019-09-23 14:23 UTC (permalink / raw)
To: Florian Weimer
Cc: Michael Kerrisk (man-pages), Oleg Nesterov, Christian Brauner,
Jann Horn, Eric W. Biederman, Daniel Colascione, Joel Fernandes,
linux-man, Linux API, lkml
In-Reply-To: <87pnjr9rth.fsf@mid.deneb.enyo.de>
On Mon, Sep 23, 2019 at 01:26:34PM +0200, Florian Weimer wrote:
> * Michael Kerrisk:
>
> > SYNOPSIS
> > int pidfd_send_signal(int pidfd, int sig, siginfo_t info,
> > unsigned int flags);
>
> This probably should reference a header for siginfo_t.
Agreed.
>
> > ESRCH The target process does not exist.
>
> If the descriptor is valid, does this mean the process has been waited
> for? Maybe this can be made more explicit.
If by valid you mean "refers to a process/thread-group leader" aka is a
pidfd then yes: Getting ESRCH means that the process has exited and has
already been waited upon.
If it had only exited but not waited upon aka is a zombie, then sending
a signal will just work because that's currently how sending signals to
zombies works, i.e. if you only send a signal and don't do any
additional checks you won't notice a difference between a process being
alive and a process being a zombie. The userspace visible behavior in
terms of signaling them is identical.
>
> > The pidfd_send_signal() system call allows the avoidance of race
> > conditions that occur when using traditional interfaces (such as
> > kill(2)) to signal a process. The problem is that the traditional
> > interfaces specify the target process via a process ID (PID), with
> > the result that the sender may accidentally send a signal to the
> > wrong process if the originally intended target process has termi‐
> > nated and its PID has been recycled for another process. By con‐
> > trast, a PID file descriptor is a stable reference to a specific
> > process; if that process terminates, then the file descriptor
> > ceases to be valid and the caller of pidfd_send_signal() is
> > informed of this fact via an ESRCH error.
>
> It would be nice to explain somewhere how you can avoid the race using
> a PID descriptor. Is there anything else besides CLONE_PIDFD?
If you're the parent of the process you can do this without CLONE_PIDFD:
pid = fork();
pidfd = pidfd_open();
ret = pidfd_send_signal(pidfd, 0, NULL, 0);
if (ret < 0 && errno == ESRCH)
/* pidfd refers to another, recycled process */
>
> > static
> > int pidfd_send_signal(int pidfd, int sig, siginfo_t *info,
> > unsigned int flags)
> > {
> > return syscall(__NR_pidfd_send_signal, pidfd, sig, info, flags);
> > }
>
> Please use a different function name. Thanks.
^ permalink raw reply
* Re: For review: pidfd_send_signal(2) manual page
From: Christian Brauner @ 2019-09-23 14:29 UTC (permalink / raw)
To: Michael Kerrisk (man-pages)
Cc: Oleg Nesterov, Christian Brauner, Jann Horn, Eric W. Biederman,
Daniel Colascione, Joel Fernandes, linux-man, Linux API, lkml
In-Reply-To: <f21dbd73-5ef4-fb5b-003f-ff4fec34a1de@gmail.com>
On Mon, Sep 23, 2019 at 11:12:00AM +0200, Michael Kerrisk (man-pages) wrote:
> Hello Christian and all,
>
> Below, I have the rendered version of the current draft of
> the pidfd_send_signal(2) manual page that I have written.
> The page source can be found in a Git branch at:
> https://git.kernel.org/pub/scm/docs/man-pages/man-pages.git/log/?h=draft_pidfd
>
> I would be pleased to receive corrections and notes on any
> details that should be added. (For example, are there error
> cases that I have missed?)
>
> Would you be able to review please?
Michael,
A big big thank you for doing this! Really appreciated.
I'm happy to review this!
>
> Thanks,
>
> Michael
>
>
> NAME
> pidfd_send_signal - send a signal to a process specified by a file
> descriptor
>
> SYNOPSIS
> int pidfd_send_signal(int pidfd, int sig, siginfo_t info,
> unsigned int flags);
>
> DESCRIPTION
> The pidfd_send_signal() system call sends the signal sig to the
> target process referred to by pidfd, a PID file descriptor that
> refers to a process.
>
> If the info argument points to a siginfo_t buffer, that buffer
> should be populated as described in rt_sigqueueinfo(2).
>
> If the info argument is a NULL pointer, this is equivalent to
> specifying a pointer to a siginfo_t buffer whose fields match the
> values that are implicitly supplied when a signal is sent using
> kill(2):
>
> * si_signo is set to the signal number;
> * si_errno is set to 0;
> * si_code is set to SI_USER;
> * si_pid is set to the caller's PID; and
> * si_uid is set to the caller's real user ID.
>
> The calling process must either be in the same PID namespace as
> the process referred to by pidfd, or be in an ancestor of that
> namespace.
>
> The flags argument is reserved for future use; currently, this
> argument must be specified as 0.
>
> RETURN VALUE
> On success, pidfd_send_signal() returns 0. On success, -1 is
This should probably be "On error, -1 is [...]".
> returned and errno is set to indicate the cause of the error.
>
> ERRORS
> EBADF pidfd is not a valid PID file descriptor.
>
> EINVAL sig is not a valid signal.
>
> EINVAL The calling process is not in a PID namespace from which it
> can send a signal to the target process.
>
> EINVAL flags is not 0.
>
> EPERM The calling process does not have permission to send the
> signal to the target process.
>
> EPERM pidfd doesn't refer to the calling process, and
> info.si_code is invalid (see rt_sigqueueinfo(2)).
>
> ESRCH The target process does not exist.
>
> VERSIONS
> pidfd_send_signal() first appeared in Linux 5.1.
>
> CONFORMING TO
> pidfd_send_signal() is Linux specific.
>
> NOTES
> Currently, there is no glibc wrapper for this system call; call it
> using syscall(2).
>
> PID file descriptors
> The pidfd argument is a PID file descriptor, a file descriptor
> that refers to process. Such a file descriptor can be obtained
> in any of the following ways:
>
> * by opening a /proc/[pid] directory;
>
> * using pidfd_open(2); or
>
> * via the PID file descriptor that is returned by a call to
> clone(2) or clone3(2) that specifies the CLONE_PIDFD flag.
>
> The pidfd_send_signal() system call allows the avoidance of race
> conditions that occur when using traditional interfaces (such as
> kill(2)) to signal a process. The problem is that the traditional
> interfaces specify the target process via a process ID (PID), with
> the result that the sender may accidentally send a signal to the
> wrong process if the originally intended target process has termi‐
> nated and its PID has been recycled for another process. By con‐
> trast, a PID file descriptor is a stable reference to a specific
> process; if that process terminates, then the file descriptor
> ceases to be valid and the caller of pidfd_send_signal() is
> informed of this fact via an ESRCH error.
>
> EXAMPLE
> #define _GNU_SOURCE
> #include <limits.h>
> #include <signal.h>
> #include <fcntl.h>
> #include <stdio.h>
> #include <string.h>
> #include <stdlib.h>
> #include <unistd.h>
> #include <sys/syscall.h>
>
> #ifndef __NR_pidfd_send_signal
> #define __NR_pidfd_send_signal 424
> #endif
>
> static
> int pidfd_send_signal(int pidfd, int sig, siginfo_t *info,
> unsigned int flags)
> {
> return syscall(__NR_pidfd_send_signal, pidfd, sig, info, flags);
> }
>
> int
> main(int argc, char *argv[])
> {
> siginfo_t info;
> char path[PATH_MAX];
> int pidfd, sig;
>
> if (argc != 3) {
> fprintf(stderr, "Usage: %s <pid> <signal>\n", argv[0]);
> exit(EXIT_FAILURE);
> }
>
> sig = atoi(argv[2]);
>
> /* Obtain a PID file descriptor by opening the /proc/PID directory
> of the target process */
>
> snprintf(path, sizeof(path), "/proc/%s", argv[1]);
>
> pidfd = open(path, O_RDONLY);
> if (pidfd == -1) {
> perror("open");
> exit(EXIT_FAILURE);
> }
>
> /* Populate a 'siginfo_t' structure for use with
> pidfd_send_signal() */
>
> memset(&info, 0, sizeof(info));
> info.si_code = SI_QUEUE;
> info.si_signo = sig;
> info.si_errno = 0;
> info.si_uid = getuid();
> info.si_pid = getpid();
> info.si_value.sival_int = 1234;
>
> /* Send the signal */
>
> if (pidfd_send_signal(pidfd, sig, &info, 0) == -1) {
> perror("pidfd_send_signal");
> exit(EXIT_FAILURE);
> }
>
> exit(EXIT_SUCCESS);
> }
>
> SEE ALSO
> clone(2), kill(2), pidfd_open(2), rt_sigqueueinfo(2), sigac‐
> tion(2), pid_namespaces(7), signal(7)
>
^ permalink raw reply
* Re: For review: pidfd_open(2) manual page
From: Christian Brauner @ 2019-09-23 14:38 UTC (permalink / raw)
To: Michael Kerrisk (man-pages)
Cc: Christian Brauner, Jann Horn, Daniel Colascione,
Eric W. Biederman, Joel Fernandes, Linux API, lkml, linux-man,
Oleg Nesterov
In-Reply-To: <90399dee-53d8-a82c-3871-9ec8f94601ce@gmail.com>
On Mon, Sep 23, 2019 at 11:11:53AM +0200, Michael Kerrisk (man-pages) wrote:
> Hello Christian and all,
>
> Below, I have the rendered version of the current draft of
> the pidfd_open(2) manual page that I have written.
> The page source can be found in a Git branch at:
> https://git.kernel.org/pub/scm/docs/man-pages/man-pages.git/log/?h=draft_pidfd
>
> I would be pleased to receive corrections and notes on any
> details that should be added. (For example, are there error
> cases that I have missed?)
>
> Would you be able to review please?
Again, thank you Michael for doing this!
>
> Thanks,
>
> Michael
>
>
> NAME
> pidfd_open - obtain a file descriptor that refers to a process
>
> SYNOPSIS
> int pidfd_open(pid_t pid, unsigned int flags);
>
> DESCRIPTION
> The pidfd_open() system creates a file descriptor that refers to
s/system/system call/
> the process whose PID is specified in pid. The file descriptor is
> returned as the function result; the close-on-exec flag is set on
> the file descriptor.
>
> The flags argument is reserved for future use; currently, this
> argument must be specified as 0.
>
> RETURN VALUE
> On success, pidfd_open() returns a nonnegative file descriptor.
> On success, -1 is returned and errno is set to indicate the cause
s/On success/On error/g
> of the error.
>
> ERRORS
> EINVAL flags is not 0.
>
> EINVAL pid is not valid.
>
> ESRCH The process specified by pid does not exist.
>
> VERSIONS
> pidfd_open() first appeared in Linux 5.3.
>
> CONFORMING TO
> pidfd_open() is Linux specific.
>
> NOTES
> Currently, there is no glibc wrapper for this system call; call it
> using syscall(2).
>
> The pidfd_send_signal(2) system call can be used to send a signal
> to the process referred to by a PID file descriptor.
>
> A PID file descriptor can be monitored using poll(2), select(2),
> and epoll(7). When the process that it refers to terminates, the
> file descriptor indicates as readable. Note, however, that in the
Not a native English speaker but should this be "indicates it is
readable"?
> current implementation, nothing can be read from the file descrip‐
> tor.
>
> The pidfd_open() system call is the preferred way of obtaining a
> PID file descriptor. The alternative is to obtain a file descrip‐
> tor by opening a /proc/[pid] directory. However, the latter tech‐
> nique is possible only if the proc(5) file system is mounted; fur‐
> thermore, the file descriptor obtained in this way is not pol‐
> lable.
I mentioned this already in the CLONE_PIDFD manpage, we should probably
not make a big deal out of this and not mention /proc/<pid> here at all.
(Crazy idea, but we could also have a config option that allows you to
turn of proc-pid-dirfds as pidfds if we start to feel really strongly
about this or a sysctl whatever...)
>
> See also the discussion of the CLONE_PIDFD flag in clone(2).
>
> EXAMPLE
> The program below opens a PID file descriptor for the process
> whose PID is specified as its command-line argument. It then mon‐
> itors the file descriptor for readability (POLLIN) using poll(2).
Yeah, maybe say "monitors the file descriptor for process exit indicated
by an EPOLLIN event" or something. Readability might be confusing.
> When the process with the specified by PID terminates, poll(2)
> returns, and indicates that the file descriptor is readable.
See comment above "readable". (I'm on my phone and I think someone
pointed this out already.)
>
> Program source
>
> #define _GNU_SOURCE
> #include <sys/syscall.h>
> #include <unistd.h>
> #include <poll.h>
> #include <stdlib.h>
> #include <stdio.h>
>
> #ifndef __NR_pidfd_open
> #define __NR_pidfd_open 434
> #endif
Alpha is special... (and not in a good way).
So you would need to special case Alpha since that's the only arch where
we haven't been able to unify syscall numbering. :D
But it's not super important.
I like the program example.
>
> static
> int pidfd_open(pid_t pid, unsigned int flags)
> {
> return syscall(__NR_pidfd_open, pid, flags);
> }
>
> int
> main(int argc, char *argv[])
> {
> struct pollfd pollfd;
> int pidfd, ready;
>
> if (argc != 2) {
> fprintf(stderr, "Usage: %s <pid>\n", argv[0]);
> exit(EXIT_SUCCESS);
> }
>
> pidfd = pidfd_open(atoi(argv[1]), 0);
> if (pidfd == -1) {
> perror("pidfd_open");
> exit(EXIT_FAILURE);
> }
>
> pollfd.fd = pidfd;
> pollfd.events = POLLIN;
>
> ready = poll(&pollfd, 1, -1);
> if (ready == -1) {
> perror("poll");
> exit(EXIT_FAILURE);
> }
>
> printf("Events (0x%x): POLLIN is %sset\n", pollfd.revents,
> (pollfd.revents & POLLIN) ? "" : "not ");
>
> exit(EXIT_SUCCESS);
> }
>
> SEE ALSO
> clone(2), kill(2), pidfd_send_signal(2), poll(2), select(2),
> epoll(7)
>
>
> --
> Michael Kerrisk
> Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
> Linux/UNIX System Programming Training: http://man7.org/training/
^ permalink raw reply
* Re: For review: pidfd_open(2) manual page
From: Christian Brauner @ 2019-09-23 14:47 UTC (permalink / raw)
To: Florian Weimer
Cc: Michael Kerrisk (man-pages), Christian Brauner, Jann Horn,
Daniel Colascione, Eric W. Biederman, Joel Fernandes, Linux API,
lkml, linux-man, Oleg Nesterov
In-Reply-To: <87tv939td6.fsf@mid.deneb.enyo.de>
On Mon, Sep 23, 2019 at 12:53:09PM +0200, Florian Weimer wrote:
> * Michael Kerrisk:
>
> > SYNOPSIS
> > int pidfd_open(pid_t pid, unsigned int flags);
>
> Should this mention <sys/types.h> for pid_t?
>
> > ERRORS
> > EINVAL flags is not 0.
> >
> > EINVAL pid is not valid.
> >
> > ESRCH The process specified by pid does not exist.
>
> Presumably, EMFILE and ENFILE are also possible errors, and so is
> ENOMEM.
So, error codes that could surface are:
EMFILE: too many open files
ENODEV: the anon inode filesystem is not available in this kernel (unlikely)
ENOMEM: not enough memory (to allocate the backing struct file)
ENFILE: you're over the max_files limit which can be set through proc
I think that should be it.
>
> > A PID file descriptor can be monitored using poll(2), select(2),
> > and epoll(7). When the process that it refers to terminates, the
> > file descriptor indicates as readable. Note, however, that in the
> > current implementation, nothing can be read from the file descrip‐
> > tor.
>
> “is indicated as readable” or “becomes readable”? Will reading block?
>
> > The pidfd_open() system call is the preferred way of obtaining a
> > PID file descriptor. The alternative is to obtain a file descrip‐
> > tor by opening a /proc/[pid] directory. However, the latter tech‐
> > nique is possible only if the proc(5) file system is mounted; fur‐
> > thermore, the file descriptor obtained in this way is not pol‐
> > lable.
>
> One question is whether the glibc wrapper should fall back back to the
> /proc subdirectory if it is not available. Probably not.
No, that would not be transparent to userspace. Especially because both
fds differ in what can be done with them.
>
> > static
> > int pidfd_open(pid_t pid, unsigned int flags)
> > {
> > return syscall(__NR_pidfd_open, pid, flags);
> > }
>
> Please call this function something else (not pidfd_open), so that the
> example continues to work if glibc provides the system call wrapper.
Agreed!
^ permalink raw reply
* Re: [PATCH RFC v4 1/1] random: WARN on large getrandom() waits and introduce getrandom2()
From: Andy Lutomirski @ 2019-09-23 18:33 UTC (permalink / raw)
To: Florian Weimer
Cc: Linus Torvalds, Andy Lutomirski, Ahmed S. Darwish,
Lennart Poettering, Theodore Y. Ts'o, Eric W. Biederman,
Alexander E. Patrakov, Michael Kerrisk, Willy Tarreau,
Matthew Garrett, lkml, Ext4 Developers List, Linux API, linux-man
In-Reply-To: <87blvefai7.fsf@oldenburg2.str.redhat.com>
On Fri, Sep 20, 2019 at 11:07 PM Florian Weimer <fweimer@redhat.com> wrote:
>
> * Linus Torvalds:
>
> > Violently agreed. And that's kind of what the GRND_EXPLICIT is really
> > aiming for.
> >
> > However, it's worth noting that nobody should ever use GRND_EXPLICIT
> > directly. That's just the name for the bit. The actual users would use
> > GRND_INSECURE or GRND_SECURE.
>
> Should we switch glibc's getentropy to GRND_EXPLICIT? Or something
> else?
>
> I don't think we want to print a kernel warning for this function.
>
Contemplating this question, I think the answer is that we should just
not introduce GRND_EXPLICIT or anything like it. glibc is going to
have to do *something*, and getentropy() is unlikely to just go away.
The explicitly documented semantics are that it blocks if the RNG
isn't seeded.
Similarly, FreeBSD has getrandom():
https://www.freebsd.org/cgi/man.cgi?query=getrandom&sektion=2&manpath=freebsd-release-ports
and if we make getrandom(..., 0) warn, then we have a situation where
the *correct* (if regrettable) way to use the function on FreeBSD
causes a warning on Linux.
Let's just add GRND_INSECURE, make the blocking mode work better, and,
if we're feeling a bit more adventurous, add GRND_SECURE_BLOCKING as a
better replacement for 0, convince FreeBSD to add it too, and then
worry about deprecating 0 once we at least get some agreement from the
FreeBSD camp.
^ permalink raw reply
* Re: For review: pidfd_open(2) manual page
From: Michael Kerrisk (man-pages) @ 2019-09-23 20:20 UTC (permalink / raw)
To: Florian Weimer
Cc: mtk.manpages, Christian Brauner, Jann Horn, Daniel Colascione,
Eric W. Biederman, Joel Fernandes, Linux API, lkml, linux-man,
Oleg Nesterov
In-Reply-To: <87tv939td6.fsf@mid.deneb.enyo.de>
Hello Florian,
Thanks for taking a look at this page.
On 9/23/19 12:53 PM, Florian Weimer wrote:
> * Michael Kerrisk:
>
>> SYNOPSIS
>> int pidfd_open(pid_t pid, unsigned int flags);
>
> Should this mention <sys/types.h> for pid_t?
Seems reasonable. I added this.
>> ERRORS
>> EINVAL flags is not 0.
>>
>> EINVAL pid is not valid.
>>
>> ESRCH The process specified by pid does not exist.
>
> Presumably, EMFILE and ENFILE are also possible errors, and so is
> ENOMEM.
Thanks. I've added those.
>> A PID file descriptor can be monitored using poll(2), select(2),
>> and epoll(7). When the process that it refers to terminates, the
>> file descriptor indicates as readable. Note, however, that in the
>> current implementation, nothing can be read from the file descrip‐
>> tor.
>
> “is indicated as readable” or “becomes readable”? Will reading block?
It won't block. Reads from a pidfd always fail with the error EINVAL
(regardless of whether the target process has terminated).
I specifically wanted to avoid "becomes readable" to avoid any
suggestion that read() does something for a pidfd. I thought
"indicates as readable" was fine, but you, Christian and Joel
all called this wording out, so I changed this to:
"When the process that it refers to terminates,
these interfaces indicate the file descriptor as readable."
>> The pidfd_open() system call is the preferred way of obtaining a
>> PID file descriptor. The alternative is to obtain a file descrip‐
>> tor by opening a /proc/[pid] directory. However, the latter tech‐
>> nique is possible only if the proc(5) file system is mounted; fur‐
>> thermore, the file descriptor obtained in this way is not pol‐
>> lable.
>
> One question is whether the glibc wrapper should fall back back to the
> /proc subdirectory if it is not available. Probably not.
No, since the FD returned by opening /proc/PID is less functional
(it is not pollable) than the one returned by pidfd_open().
>> static
>> int pidfd_open(pid_t pid, unsigned int flags)
>> {
>> return syscall(__NR_pidfd_open, pid, flags);
>> }
>
> Please call this function something else (not pidfd_open), so that the
> example continues to work if glibc provides the system call wrapper.
I figured that if the syscall does get added to glibc, then I would
modify the example. In the meantime, this does seem the most natural
way of doing things, since the example then uses the real syscall
name as it would be used if there were a wrapper function.
But, this leads to the question: what do you think the likelihood
is that this system call will land in glibc?
Thanks for your feedback, Florian. I've pushed various changes
to the Git branch at
https://git.kernel.org/pub/scm/docs/man-pages/man-pages.git/log/?h=draft_pidfd
Cheers,
Michael
--
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/
^ permalink raw reply
* Re: For review: pidfd_open(2) manual page
From: Michael Kerrisk (man-pages) @ 2019-09-23 20:21 UTC (permalink / raw)
To: Christian Brauner
Cc: mtk.manpages, Christian Brauner, Jann Horn, Daniel Colascione,
Eric W. Biederman, Joel Fernandes, Linux API, lkml, linux-man,
Oleg Nesterov
In-Reply-To: <20190923143846.u7miwgmszecankof@wittgenstein>
Hello Christian,
On 9/23/19 4:38 PM, Christian Brauner wrote:
> On Mon, Sep 23, 2019 at 11:11:53AM +0200, Michael Kerrisk (man-pages) wrote:
>> Hello Christian and all,
>>
>> Below, I have the rendered version of the current draft of
>> the pidfd_open(2) manual page that I have written.
>> The page source can be found in a Git branch at:
>> https://git.kernel.org/pub/scm/docs/man-pages/man-pages.git/log/?h=draft_pidfd
>>
>> I would be pleased to receive corrections and notes on any
>> details that should be added. (For example, are there error
>> cases that I have missed?)
>>
>> Would you be able to review please?
>
> Again, thank you Michael for doing this!
>
>>
>> Thanks,
>>
>> Michael
>>
>>
>> NAME
>> pidfd_open - obtain a file descriptor that refers to a process
>>
>> SYNOPSIS
>> int pidfd_open(pid_t pid, unsigned int flags);
>>
>> DESCRIPTION
>> The pidfd_open() system creates a file descriptor that refers to
>
> s/system/system call/
Fixed.
>> the process whose PID is specified in pid. The file descriptor is
>> returned as the function result; the close-on-exec flag is set on
>> the file descriptor.
>>
>> The flags argument is reserved for future use; currently, this
>> argument must be specified as 0.
>>
>> RETURN VALUE
>> On success, pidfd_open() returns a nonnegative file descriptor.
>> On success, -1 is returned and errno is set to indicate the cause
>
> s/On success/On error/g
Fixed.
>> of the error.
>>
>> ERRORS
>> EINVAL flags is not 0.
>>
>> EINVAL pid is not valid.
>>
>> ESRCH The process specified by pid does not exist.
>>
>> VERSIONS
>> pidfd_open() first appeared in Linux 5.3.
>>
>> CONFORMING TO
>> pidfd_open() is Linux specific.
>>
>> NOTES
>> Currently, there is no glibc wrapper for this system call; call it
>> using syscall(2).
>>
>> The pidfd_send_signal(2) system call can be used to send a signal
>> to the process referred to by a PID file descriptor.
>>
>> A PID file descriptor can be monitored using poll(2), select(2),
>> and epoll(7). When the process that it refers to terminates, the
>> file descriptor indicates as readable. Note, however, that in the
>
> Not a native English speaker but should this be "indicates it is
> readable"?
See my reply to Florian.
>> current implementation, nothing can be read from the file descrip‐
>> tor.
>>
>> The pidfd_open() system call is the preferred way of obtaining a
>> PID file descriptor. The alternative is to obtain a file descrip‐
>> tor by opening a /proc/[pid] directory. However, the latter tech‐
>> nique is possible only if the proc(5) file system is mounted; fur‐
>> thermore, the file descriptor obtained in this way is not pol‐
>> lable.
>
> I mentioned this already in the CLONE_PIDFD manpage, we should probably
> not make a big deal out of this and not mention /proc/<pid> here at all.
The thing is, people *will* learn about these two different types
of FDs, whether we document them or not. So, I think it's better to
be up front about what's available, and make a suitably strong
recommendation about the preferred technique.
Reading between the lines, it sounds like just a couple of releases
after it was implemented, you're saying that implementing
open(/proc/PID) was a mistake?
> (Crazy idea, but we could also have a config option that allows you to
> turn of proc-pid-dirfds as pidfds if we start to feel really strongly
> about this or a sysctl whatever...)
>
>>
>> See also the discussion of the CLONE_PIDFD flag in clone(2).
>>
>> EXAMPLE
>> The program below opens a PID file descriptor for the process
>> whose PID is specified as its command-line argument. It then mon‐
>> itors the file descriptor for readability (POLLIN) using poll(2).
>
> Yeah, maybe say "monitors the file descriptor for process exit indicated
> by an EPOLLIN event" or something. Readability might be confusing.
I like that suggestion! I reworded to something close to what you suggest.
>> When the process with the specified by PID terminates, poll(2)
>> returns, and indicates that the file descriptor is readable.
>
> See comment above "readable". (I'm on my phone and I think someone
> pointed this out already.)
Actually, I think I can just remove that sentence. It doesn't really
add much.
>> Program source
>>
>> #define _GNU_SOURCE
>> #include <sys/syscall.h>
>> #include <unistd.h>
>> #include <poll.h>
>> #include <stdlib.h>
>> #include <stdio.h>
>>
>> #ifndef __NR_pidfd_open
>> #define __NR_pidfd_open 434
>> #endif
>
> Alpha is special... (and not in a good way).
> So you would need to special case Alpha since that's the only arch where
> we haven't been able to unify syscall numbering. :D
> But it's not super important.
Okay.
> I like the program example.
Good.
Thanks for reviewing! I've pushed various changes
to the Git branch at
https://git.kernel.org/pub/scm/docs/man-pages/man-pages.git/log/?h=draft_pidfd
Cheers,
Michael
>>
>> static
>> int pidfd_open(pid_t pid, unsigned int flags)
>> {
>> return syscall(__NR_pidfd_open, pid, flags);
>> }
>>
>> int
>> main(int argc, char *argv[])
>> {
>> struct pollfd pollfd;
>> int pidfd, ready;
>>
>> if (argc != 2) {
>> fprintf(stderr, "Usage: %s <pid>\n", argv[0]);
>> exit(EXIT_SUCCESS);
>> }
>>
>> pidfd = pidfd_open(atoi(argv[1]), 0);
>> if (pidfd == -1) {
>> perror("pidfd_open");
>> exit(EXIT_FAILURE);
>> }
>>
>> pollfd.fd = pidfd;
>> pollfd.events = POLLIN;
>>
>> ready = poll(&pollfd, 1, -1);
>> if (ready == -1) {
>> perror("poll");
>> exit(EXIT_FAILURE);
>> }
>>
>> printf("Events (0x%x): POLLIN is %sset\n", pollfd.revents,
>> (pollfd.revents & POLLIN) ? "" : "not ");
>>
>> exit(EXIT_SUCCESS);
>> }
>>
>> SEE ALSO
>> clone(2), kill(2), pidfd_send_signal(2), poll(2), select(2),
>> epoll(7)
>>
>>
>> --
>> Michael Kerrisk
>> Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
>> Linux/UNIX System Programming Training: http://man7.org/training/
>
--
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/
^ permalink raw reply
* Re: For review: pidfd_open(2) manual page
From: Michael Kerrisk (man-pages) @ 2019-09-23 20:22 UTC (permalink / raw)
To: Christian Brauner, Florian Weimer
Cc: mtk.manpages, Christian Brauner, Jann Horn, Daniel Colascione,
Eric W. Biederman, Joel Fernandes, Linux API, lkml, linux-man,
Oleg Nesterov
In-Reply-To: <20190923144711.ssbrg6bdquhewo7q@wittgenstein>
Hello Christian,
On 9/23/19 4:47 PM, Christian Brauner wrote:
> On Mon, Sep 23, 2019 at 12:53:09PM +0200, Florian Weimer wrote:
>> * Michael Kerrisk:
>>
>>> SYNOPSIS
>>> int pidfd_open(pid_t pid, unsigned int flags);
>>
>> Should this mention <sys/types.h> for pid_t?
>>
>>> ERRORS
>>> EINVAL flags is not 0.
>>>
>>> EINVAL pid is not valid.
>>>
>>> ESRCH The process specified by pid does not exist.
>>
>> Presumably, EMFILE and ENFILE are also possible errors, and so is
>> ENOMEM.
>
> So, error codes that could surface are:
> EMFILE: too many open files
> ENODEV: the anon inode filesystem is not available in this kernel (unlikely)
> ENOMEM: not enough memory (to allocate the backing struct file)
> ENFILE: you're over the max_files limit which can be set through proc
>
> I think that should be it.
Thanks. I've added those.
>>> A PID file descriptor can be monitored using poll(2), select(2),
>>> and epoll(7). When the process that it refers to terminates, the
>>> file descriptor indicates as readable. Note, however, that in the
>>> current implementation, nothing can be read from the file descrip‐
>>> tor.
>>
>> “is indicated as readable” or “becomes readable”? Will reading block?
>>
>>> The pidfd_open() system call is the preferred way of obtaining a
>>> PID file descriptor. The alternative is to obtain a file descrip‐
>>> tor by opening a /proc/[pid] directory. However, the latter tech‐
>>> nique is possible only if the proc(5) file system is mounted; fur‐
>>> thermore, the file descriptor obtained in this way is not pol‐
>>> lable.
>>
>> One question is whether the glibc wrapper should fall back back to the
>> /proc subdirectory if it is not available. Probably not.
>
> No, that would not be transparent to userspace. Especially because both
> fds differ in what can be done with them.
>
>>
>>> static
>>> int pidfd_open(pid_t pid, unsigned int flags)
>>> {
>>> return syscall(__NR_pidfd_open, pid, flags);
>>> }
>>
>> Please call this function something else (not pidfd_open), so that the
>> example continues to work if glibc provides the system call wrapper.
>
> Agreed!
See my reply to Florian. (So far, I didn't change anything here.)
Thanks,
Michael
--
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/
^ permalink raw reply
* Re: For review: pidfd_open(2) manual page
From: Michael Kerrisk (man-pages) @ 2019-09-23 20:22 UTC (permalink / raw)
To: Daniel Colascione, Florian Weimer
Cc: mtk.manpages, Christian Brauner, Jann Horn, Eric W. Biederman,
Joel Fernandes, Linux API, lkml, linux-man, Oleg Nesterov
In-Reply-To: <CAKOZuetTgKjgWZpCaBz8q662MwVQ-UhrV4oWFqKEWr35mQTFLw@mail.gmail.com>
Hello Daniel,
Than you for reviewing the page!
On 9/23/19 1:26 PM, Daniel Colascione wrote:
> On Mon, Sep 23, 2019 at 3:53 AM Florian Weimer <fw@deneb.enyo.de> wrote:
>>
>> * Michael Kerrisk:
>>
>>> SYNOPSIS
>>> int pidfd_open(pid_t pid, unsigned int flags);
>>
>> Should this mention <sys/types.h> for pid_t?
>>
>>> ERRORS
>>> EINVAL flags is not 0.
>>>
>>> EINVAL pid is not valid.
>>>
>>> ESRCH The process specified by pid does not exist.
>>
>> Presumably, EMFILE and ENFILE are also possible errors, and so is
>> ENOMEM.
>>
>>> A PID file descriptor can be monitored using poll(2), select(2),
>>> and epoll(7). When the process that it refers to terminates, the
>>> file descriptor indicates as readable.
>
> The phrase "becomes readable" is simpler than "indicates as readable"
> and conveys the same meaning. I agree with Florian's comment on this
> point below.
See my reply to Florian. (I did change the text here.)
>>> Note, however, that in the
>>> current implementation, nothing can be read from the file descrip‐
>>> tor.
>>
>> “is indicated as readable” or “becomes readable”? Will reading block?
>>
>>> The pidfd_open() system call is the preferred way of obtaining a
>>> PID file descriptor. The alternative is to obtain a file descrip‐
>>> tor by opening a /proc/[pid] directory. However, the latter tech‐
>>> nique is possible only if the proc(5) file system is mounted; fur‐
>>> thermore, the file descriptor obtained in this way is not pol‐
>>> lable.
>
> Referring to procfs directory FDs as pidfds will probably confuse
> people. I'd just omit this paragraph.
See my reply to Christian (and feel free to argue the point, please).
So far, I have made no change here.
>> One question is whether the glibc wrapper should fall back back to the
>> /proc subdirectory if it is not available. Probably not.
>
> I'd prefer that glibc not provide this kind of fallback.
> posix_fallocate-style emulation is, IMHO, too surprising.
Agreed.
Cheers,
Michael
--
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/
^ permalink raw reply
* Re: For review: pidfd_send_signal(2) manual page
From: Michael Kerrisk (man-pages) @ 2019-09-23 20:27 UTC (permalink / raw)
To: Christian Brauner
Cc: mtk.manpages, Oleg Nesterov, Christian Brauner, Jann Horn,
Eric W. Biederman, Daniel Colascione, Joel Fernandes, linux-man,
Linux API, lkml
In-Reply-To: <20190923142932.2gujbddnzyp4ujeu@wittgenstein>
Hello Christian,
On 9/23/19 4:29 PM, Christian Brauner wrote:
> On Mon, Sep 23, 2019 at 11:12:00AM +0200, Michael Kerrisk (man-pages) wrote:
>> Hello Christian and all,
>>
>> Below, I have the rendered version of the current draft of
>> the pidfd_send_signal(2) manual page that I have written.
>> The page source can be found in a Git branch at:
>> https://git.kernel.org/pub/scm/docs/man-pages/man-pages.git/log/?h=draft_pidfd
>>
>> I would be pleased to receive corrections and notes on any
>> details that should be added. (For example, are there error
>> cases that I have missed?)
>>
>> Would you be able to review please?
>
> Michael,
>
> A big big thank you for doing this! Really appreciated.
> I'm happy to review this!
>
>>
>> Thanks,
>>
>> Michael
>>
>>
>> NAME
>> pidfd_send_signal - send a signal to a process specified by a file
>> descriptor
>>
>> SYNOPSIS
>> int pidfd_send_signal(int pidfd, int sig, siginfo_t info,
>> unsigned int flags);
>>
>> DESCRIPTION
>> The pidfd_send_signal() system call sends the signal sig to the
>> target process referred to by pidfd, a PID file descriptor that
>> refers to a process.
>>
>> If the info argument points to a siginfo_t buffer, that buffer
>> should be populated as described in rt_sigqueueinfo(2).
>>
>> If the info argument is a NULL pointer, this is equivalent to
>> specifying a pointer to a siginfo_t buffer whose fields match the
>> values that are implicitly supplied when a signal is sent using
>> kill(2):
>>
>> * si_signo is set to the signal number;
>> * si_errno is set to 0;
>> * si_code is set to SI_USER;
>> * si_pid is set to the caller's PID; and
>> * si_uid is set to the caller's real user ID.
>>
>> The calling process must either be in the same PID namespace as
>> the process referred to by pidfd, or be in an ancestor of that
>> namespace.
>>
>> The flags argument is reserved for future use; currently, this
>> argument must be specified as 0.
>>
>> RETURN VALUE
>> On success, pidfd_send_signal() returns 0. On success, -1 is
>
> This should probably be "On error, -1 is [...]".
Thanks. Fixed.
Cheers,
Michael
--
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/
^ permalink raw reply
* Re: For review: pidfd_open(2) manual page
From: Florian Weimer @ 2019-09-23 20:41 UTC (permalink / raw)
To: Michael Kerrisk (man-pages)
Cc: Christian Brauner, Jann Horn, Daniel Colascione,
Eric W. Biederman, Joel Fernandes, Linux API, lkml, linux-man,
Oleg Nesterov
In-Reply-To: <63566f1f-667d-50ca-ae85-784924d09af4@gmail.com>
* Michael Kerrisk:
>>> static
>>> int pidfd_open(pid_t pid, unsigned int flags)
>>> {
>>> return syscall(__NR_pidfd_open, pid, flags);
>>> }
>>
>> Please call this function something else (not pidfd_open), so that the
>> example continues to work if glibc provides the system call wrapper.
>
> I figured that if the syscall does get added to glibc, then I would
> modify the example. In the meantime, this does seem the most natural
> way of doing things, since the example then uses the real syscall
> name as it would be used if there were a wrapper function.
The problem is that programs do this as well, so they fail to build
once they are built on a newer glibc version.
> But, this leads to the question: what do you think the likelihood
> is that this system call will land in glibc?
Quite likely. It's easy enough to document, there are no P&C issues,
and it doesn't need any new types.
pidfd_send_signal is slightly more difficult because we probably need
to add rt_sigqueueinfo first, for consistency.
^ permalink raw reply
* Re: For review: pidfd_open(2) manual page
From: Michael Kerrisk (man-pages) @ 2019-09-23 20:57 UTC (permalink / raw)
To: Florian Weimer
Cc: mtk.manpages, Christian Brauner, Jann Horn, Daniel Colascione,
Eric W. Biederman, Joel Fernandes, Linux API, lkml, linux-man,
Oleg Nesterov
In-Reply-To: <874l12924w.fsf@mid.deneb.enyo.de>
Hello Florian,
On 9/23/19 10:41 PM, Florian Weimer wrote:
> * Michael Kerrisk:
>
>>>> static
>>>> int pidfd_open(pid_t pid, unsigned int flags)
>>>> {
>>>> return syscall(__NR_pidfd_open, pid, flags);
>>>> }
>>>
>>> Please call this function something else (not pidfd_open), so that the
>>> example continues to work if glibc provides the system call wrapper.
>>
>> I figured that if the syscall does get added to glibc, then I would
>> modify the example. In the meantime, this does seem the most natural
>> way of doing things, since the example then uses the real syscall
>> name as it would be used if there were a wrapper function.
>
> The problem is that programs do this as well, so they fail to build
> once they are built on a newer glibc version.
But isn't such a failure a good thing? I mean: it encourages
people to rid their programs of uses of syscall(2).
>> But, this leads to the question: what do you think the likelihood
>> is that this system call will land in glibc?
>
> Quite likely. It's easy enough to document, there are no P&C issues,
> and it doesn't need any new types.
Okay.
> pidfd_send_signal is slightly more difficult because we probably need
> to add rt_sigqueueinfo first, for consistency.
Okay. I see that's a little more problematic.
Cheers,
Michael
--
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/
^ permalink raw reply
* Re: For review: pidfd_send_signal(2) manual page
From: Eric W. Biederman @ 2019-09-23 21:27 UTC (permalink / raw)
To: Michael Kerrisk (man-pages)
Cc: Oleg Nesterov, Christian Brauner, Jann Horn, Daniel Colascione,
Joel Fernandes, linux-man, Linux API, lkml
In-Reply-To: <f21dbd73-5ef4-fb5b-003f-ff4fec34a1de@gmail.com>
"Michael Kerrisk (man-pages)" <mtk.manpages@gmail.com> writes:
> Hello Christian and all,
>
> Below, I have the rendered version of the current draft of
> the pidfd_send_signal(2) manual page that I have written.
> The page source can be found in a Git branch at:
> https://git.kernel.org/pub/scm/docs/man-pages/man-pages.git/log/?h=draft_pidfd
>
> I would be pleased to receive corrections and notes on any
> details that should be added. (For example, are there error
> cases that I have missed?)
>
> Would you be able to review please?
>
> Thanks,
>
> Michael
>
>
> NAME
> pidfd_send_signal - send a signal to a process specified by a file
> descriptor
>
> SYNOPSIS
> int pidfd_send_signal(int pidfd, int sig, siginfo_t info,
This needs to be "siginfo_t *info," -----------------------^
> unsigned int flags);
>
Eric
^ permalink raw reply
* Re: For review: pidfd_open(2) manual page
From: Christian Brauner @ 2019-09-24 7:38 UTC (permalink / raw)
To: Florian Weimer
Cc: Michael Kerrisk (man-pages), Jann Horn, Daniel Colascione,
Eric W. Biederman, Joel Fernandes, Linux API, lkml, linux-man,
Oleg Nesterov
In-Reply-To: <874l12924w.fsf@mid.deneb.enyo.de>
On Mon, Sep 23, 2019 at 10:41:19PM +0200, Florian Weimer wrote:
> * Michael Kerrisk:
>
> >>> static
> >>> int pidfd_open(pid_t pid, unsigned int flags)
> >>> {
> >>> return syscall(__NR_pidfd_open, pid, flags);
> >>> }
> >>
> >> Please call this function something else (not pidfd_open), so that the
> >> example continues to work if glibc provides the system call wrapper.
> >
> > I figured that if the syscall does get added to glibc, then I would
> > modify the example. In the meantime, this does seem the most natural
> > way of doing things, since the example then uses the real syscall
> > name as it would be used if there were a wrapper function.
>
> The problem is that programs do this as well, so they fail to build
> once they are built on a newer glibc version.
>
> > But, this leads to the question: what do you think the likelihood
> > is that this system call will land in glibc?
>
> Quite likely. It's easy enough to document, there are no P&C issues,
> and it doesn't need any new types.
My previous mail probably didn't make it so here it is again: I think
especially with the recently established glibc consensus to provide
wrappers for all new system calls (with some sensible exceptions) I'd
expect this to be the case.
>
> pidfd_send_signal is slightly more difficult because we probably need
> to add rt_sigqueueinfo first, for consistency.
Oh, huh. Somehow I thought we already provide that.
Christian
^ permalink raw reply
* Re: [RFC PATCH 2/3] fs: add RWF_ENCODED for writing compressed data
From: Omar Sandoval @ 2019-09-24 17:15 UTC (permalink / raw)
To: Jann Horn
Cc: Jens Axboe, linux-fsdevel, linux-btrfs, Dave Chinner, Linux API,
Kernel Team, Andy Lutomirski
In-Reply-To: <CAG48ez2GKv15Uj6Wzv0sG5v2bXyrSaCtRTw5Ok_ovja_CiO_fQ@mail.gmail.com>
On Thu, Sep 19, 2019 at 05:44:12PM +0200, Jann Horn wrote:
> On Thu, Sep 19, 2019 at 8:54 AM Omar Sandoval <osandov@osandov.com> wrote:
> > Btrfs can transparently compress data written by the user. However, we'd
> > like to add an interface to write pre-compressed data directly to the
> > filesystem. This adds support for so-called "encoded writes" via
> > pwritev2().
> >
> > A new RWF_ENCODED flags indicates that a write is "encoded". If this
> > flag is set, iov[0].iov_base points to a struct encoded_iov which
> > contains metadata about the write: namely, the compression algorithm and
> > the unencoded (i.e., decompressed) length of the extent. iov[0].iov_len
> > must be set to sizeof(struct encoded_iov), which can be used to extend
> > the interface in the future. The remaining iovecs contain the encoded
> > extent.
> >
> > A similar interface for reading encoded data can be added to preadv2()
> > in the future.
> >
> > Filesystems must indicate that they support encoded writes by setting
> > FMODE_ENCODED_IO in ->file_open().
> [...]
> > +int import_encoded_write(struct kiocb *iocb, struct encoded_iov *encoded,
> > + struct iov_iter *from)
> > +{
> > + if (iov_iter_single_seg_count(from) != sizeof(*encoded))
> > + return -EINVAL;
> > + if (copy_from_iter(encoded, sizeof(*encoded), from) != sizeof(*encoded))
> > + return -EFAULT;
> > + if (encoded->compression == ENCODED_IOV_COMPRESSION_NONE &&
> > + encoded->encryption == ENCODED_IOV_ENCRYPTION_NONE) {
> > + iocb->ki_flags &= ~IOCB_ENCODED;
> > + return 0;
> > + }
> > + if (encoded->compression > ENCODED_IOV_COMPRESSION_TYPES ||
> > + encoded->encryption > ENCODED_IOV_ENCRYPTION_TYPES)
> > + return -EINVAL;
> > + if (!capable(CAP_SYS_ADMIN))
> > + return -EPERM;
>
> How does this capable() check interact with io_uring? Without having
> looked at this in detail, I suspect that when an encoded write is
> requested through io_uring, the capable() check might be executed on
> something like a workqueue worker thread, which is probably running
> with a full capability set.
I discussed this more with Jens. You're right, per-IO permission checks
aren't going to work. In fully-polled mode, we never get an opportunity
to check capabilities in right context. So, this will probably require a
new open flag.
^ permalink raw reply
* Re: For review: pidfd_send_signal(2) manual page
From: Michael Kerrisk (man-pages) @ 2019-09-24 19:10 UTC (permalink / raw)
To: Eric W. Biederman
Cc: mtk.manpages, Oleg Nesterov, Christian Brauner, Jann Horn,
Daniel Colascione, Joel Fernandes, linux-man, Linux API, lkml
In-Reply-To: <87ftkmu2i6.fsf@x220.int.ebiederm.org>
On 9/23/19 11:27 PM, Eric W. Biederman wrote:
> "Michael Kerrisk (man-pages)" <mtk.manpages@gmail.com> writes:
>
>> Hello Christian and all,
>>
>> Below, I have the rendered version of the current draft of
>> the pidfd_send_signal(2) manual page that I have written.
>> The page source can be found in a Git branch at:
>> https://git.kernel.org/pub/scm/docs/man-pages/man-pages.git/log/?h=draft_pidfd
>>
>> I would be pleased to receive corrections and notes on any
>> details that should be added. (For example, are there error
>> cases that I have missed?)
>>
>> Would you be able to review please?
>>
>> Thanks,
>>
>> Michael
>>
>>
>> NAME
>> pidfd_send_signal - send a signal to a process specified by a file
>> descriptor
>>
>> SYNOPSIS
>> int pidfd_send_signal(int pidfd, int sig, siginfo_t info,
>
> This needs to be "siginfo_t *info," -----------------------^
Thanks, Eric. Fixed.
Cheers,
Michael
--
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/
^ permalink raw reply
* Re: [RFC PATCH 2/3] fs: add RWF_ENCODED for writing compressed data
From: Omar Sandoval @ 2019-09-24 19:35 UTC (permalink / raw)
To: Jann Horn
Cc: Jens Axboe, linux-fsdevel, linux-btrfs, Dave Chinner, Linux API,
Kernel Team, Andy Lutomirski
In-Reply-To: <20190924171513.GA39872@vader>
On Tue, Sep 24, 2019 at 10:15:13AM -0700, Omar Sandoval wrote:
> On Thu, Sep 19, 2019 at 05:44:12PM +0200, Jann Horn wrote:
> > On Thu, Sep 19, 2019 at 8:54 AM Omar Sandoval <osandov@osandov.com> wrote:
> > > Btrfs can transparently compress data written by the user. However, we'd
> > > like to add an interface to write pre-compressed data directly to the
> > > filesystem. This adds support for so-called "encoded writes" via
> > > pwritev2().
> > >
> > > A new RWF_ENCODED flags indicates that a write is "encoded". If this
> > > flag is set, iov[0].iov_base points to a struct encoded_iov which
> > > contains metadata about the write: namely, the compression algorithm and
> > > the unencoded (i.e., decompressed) length of the extent. iov[0].iov_len
> > > must be set to sizeof(struct encoded_iov), which can be used to extend
> > > the interface in the future. The remaining iovecs contain the encoded
> > > extent.
> > >
> > > A similar interface for reading encoded data can be added to preadv2()
> > > in the future.
> > >
> > > Filesystems must indicate that they support encoded writes by setting
> > > FMODE_ENCODED_IO in ->file_open().
> > [...]
> > > +int import_encoded_write(struct kiocb *iocb, struct encoded_iov *encoded,
> > > + struct iov_iter *from)
> > > +{
> > > + if (iov_iter_single_seg_count(from) != sizeof(*encoded))
> > > + return -EINVAL;
> > > + if (copy_from_iter(encoded, sizeof(*encoded), from) != sizeof(*encoded))
> > > + return -EFAULT;
> > > + if (encoded->compression == ENCODED_IOV_COMPRESSION_NONE &&
> > > + encoded->encryption == ENCODED_IOV_ENCRYPTION_NONE) {
> > > + iocb->ki_flags &= ~IOCB_ENCODED;
> > > + return 0;
> > > + }
> > > + if (encoded->compression > ENCODED_IOV_COMPRESSION_TYPES ||
> > > + encoded->encryption > ENCODED_IOV_ENCRYPTION_TYPES)
> > > + return -EINVAL;
> > > + if (!capable(CAP_SYS_ADMIN))
> > > + return -EPERM;
> >
> > How does this capable() check interact with io_uring? Without having
> > looked at this in detail, I suspect that when an encoded write is
> > requested through io_uring, the capable() check might be executed on
> > something like a workqueue worker thread, which is probably running
> > with a full capability set.
>
> I discussed this more with Jens. You're right, per-IO permission checks
> aren't going to work. In fully-polled mode, we never get an opportunity
> to check capabilities in right context. So, this will probably require a
> new open flag.
Actually, file_ns_capable() accomplishes the same thing without a new
open flag. Changing the capable() check to file_ns_capable() in
init_user_ns should be enough.
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox