From: josh@joshtriplett.org
To: Thiago Macieira <thiago.macieira@intel.com>
Cc: David Drysdale <drysdale@google.com>,
Al Viro <viro@zeniv.linux.org.uk>,
Andrew Morton <akpm@linux-foundation.org>,
Andy Lutomirski <luto@amacapital.net>,
Ingo Molnar <mingo@redhat.com>, Kees Cook <keescook@chromium.org>,
Oleg Nesterov <oleg@redhat.com>,
"Paul E. McKenney" <paulmck@linux.vnet.ibm.com>,
"H. Peter Anvin" <hpa@zytor.com>, Rik van Riel <riel@redhat.com>,
Thomas Gleixner <tglx@linutronix.de>,
Michael Kerrisk <mtk.manpages@gmail.com>,
"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
Linux API <linux-api@vger.kernel.org>,
linux-fsdevel@vger.kernel.org, X86 ML <x86@kernel.org>
Subject: Re: [PATCH 0/6] CLONE_FD: Task exit notification via file descriptor
Date: Fri, 13 Mar 2015 14:44:47 -0700 [thread overview]
Message-ID: <20150313214447.GA10954@cloud> (raw)
In-Reply-To: <8029771.gA9WzoLePv@tjmaciei-mobl4>
On Fri, Mar 13, 2015 at 02:16:07PM -0700, Thiago Macieira wrote:
> On Friday 13 March 2015 12:42:52 Josh Triplett wrote:
> > > Hi Josh,
> > >
> > > From the overall description (i.e. I haven't looked at the code yet)
> > > this looks very interesting. However, it seems to cover a lot of the
> > > same ground as the process descriptor feature that was added to FreeBSD
> > >
> > > in 9.x/10.x:
> > > https://www.freebsd.org/cgi/man.cgi?query=pdfork&sektion=2
> >
> > Interesting.
>
> I wasn't aware of the FreeBSD implementation of pdfork(). It is actually
> exactly what I need in userspace.
Right; libqt should be able to use pdfork on FreeBSD and CLONE_FD on
Linux.
> The only difference between pdfork() and and
> my proposed forkfd() is where the PID and where the file descriptor are
> returned (meaning, which is optional and which isn't).
>
> Josh and I opted to return the file descriptor in the regular return value in
> forkfd and in clone4 because getting the file descriptor the whole objective of
> using the forkfd or clone4-with-CLONE_FD in the first place: the file descriptor
> is not optional, but the PID is.
And as long as you can get the fd, where it's returned really doesn't
matter.
> > Agreed; however, I think it's reasonable to provide appropriate Linux
> > system calls, and then let glibc or libbsd or similar provide the
> > BSD-compatible calls on top of those. I don't think the kernel
> > interface needs to exactly match FreeBSD's, as long as it's a superset
> > of the functionality.
> >
> > For example, pdfork can just call clone4 with CLONE_FD and return the
> > resulting file descriptor.
>
> Agreed, we should recommend libc implement pdfork(), pdkill() and pdwait4().
>
> I'm not too attached to the forkfd() interface, but I find it slightly superior
> for the reasons above.
Agreed.
> If we want the PD_DAEMON flag, it will have to translate to a clone flag, like
> CLONEFD_DAEMON or inverted like CLONEFD_KILL_ON_CLOSE.
I think the inverted version makes more sense, so that the default
behavior just changes exit notification without adding the kill-on-close
behavior. And that kill-on-close behavior can come in a later patch. :)
> > In the future, I plan to add an fd-based equivalent of
> > rt_{,tg}sigqueueinfo (likely a single syscall with a flag to determine
> > whether to kill a process or thread) which is a superset of pdkill.
> > pdkill could then call that and just not pass the extra info.
> >
> > A fair bit of pdwait4 could be implemented on top of read(), other than
> > the full rusage information (see below), and the ability to wait for
> > STOP/CONT (which the CLONE_FD file descriptor could support if desired,
> > but it'd have to be set via a flag at clone time).
> >
> > I think it's a feature to use read() rather than an additional magic
> > system call.
>
> Indeed, even if the libc provides a wrapper for you, like glibc does for
> eventfd (eventfd_read, eventfd_write).
>
> Josh and I didn't want to submit "killfd" (or pdkill in the FreeBSD name) in
> the initial patch set, but it was part of the plans.
>
> > > > clone4() will never return a file descriptor in the range
> > > > 0-2 to
> > > > the caller, to avoid ambiguity with the return of 0 in the
> > > > child
> > > > process. Only the calling process will have the new
> > > > file
> > > > descriptor open; the child process will not.
> > >
> > > FreeBSD's pdfork(2) returns a PID but also takes an int *fdp argument to
> > > return the file descriptor separately, which avoids the need for special
> > > case processing for low FD values (and means that POSIX's "lowest file
> > > descriptor not currently open" behaviour can be preserved if desired).
> >
> > That'd be easy to implement if desired, by adding an outbound pointer to
> > clone4_args.
> >
> > The (very mild) reason I'd dropped the PID: with CLONE_FD and future
> > syscalls that use the fd as an identifier, PIDs can hopefully become
> > mostly unnecessary. However, I'm not that attached to changing the
> > return value; it'd be trivial to switch to an outbound parameter
> > instead, and then drop the "not 0-2".
>
> See above for more motivation on making the PID optional.
>
> As for the file descriptor range, if we need to be able to return 0, we can
> implement a magic constant to mean the child process, like the userspace
> forkfd() does (FFD_CHILD_PROCESS). We'd probably choose the value -4096 on
> Linux, since that is neither a valid file descriptor nor a valid errno value.
I don't think that logic is worth implementing, though, since it would
require changing all the architecture-specific copy_thread
implementations. If we really want to go this path, we should just
return the fd via an out parameter in the clone4_args structure.
> > > [FreeBSD theoretically has pdwait4(2) to do wait4-like operations on a
> > > process descriptor, including rusage retrieval. However, I don't think
> > >
> > > they actually implemented it:
> > > http://fxr.watson.org/fxr/source/kern/syscalls.master#L928]
> >
> > That's a pretty good argument that we don't need to either, at least not
> > yet.
>
> pdwait4() can be implemented on top of read(), with the WNOHANG flag being just
> toggling the O_NONBLOCK bit. The problem is with the rest of the flags. We
> could implement it via more ioctls to be done prior to read() if we don't want
> to add a syscall...
>
> Another alternative is to add a P_PD flag that can be passed as the first
> argument to waitid(), making the second argument a file descriptor instead of a
> PID or pgrp.
Or a flag that can be added to the options argument of wait4 to indicate
that the first argument is really a file descriptor.
> > > FreeBSD also implements fstat(2) for its process descriptors, although
> > > only a few of the fields get filled in.
> >
> > I looked at what they provide, and that seems like more of a novelty
> > than something particularly useful (since most of the stat fields aren't
> > meaningful), but if that's useful for compatibility then adding it seems
> > fine.
>
> I don't think we need to do anything: anon_inode will do it for us.
>
> If I stat an eventfd:
>
> stat("/proc/107751/fd/4", {st_dev=makedev(0, 9), st_ino=3943,
> st_mode=0600, st_nlink=1, st_uid=0, st_gid=0, st_blksize=4096, st_blocks=0,
> st_size=0, st_atime=2015/03/07-16:40:28, st_mtime=2015/03/12-16:12:00,
> st_ctime=2015/03/12-16:12:00}) = 0
>
> And just out of curiosity, in the following order: epoll, signalfd, timerfd
> and inotify:
>
> stat("/proc/1462/fd/4", {st_dev=makedev(0, 9), st_ino=3943, st_mode=0600,
> st_nlink=1, st_uid=0, st_gid=0, st_blksize=4096, st_blocks=0, st_size=0,
> st_atime=2015/03/07-16:40:28, st_mtime=2015/03/12-16:12:00,
> st_ctime=2015/03/12-16:12:00}) = 0
> stat("/proc/1462/fd/5", {st_dev=makedev(0, 9), st_ino=3943, st_mode=0600,
> st_nlink=1, st_uid=0, st_gid=0, st_blksize=4096, st_blocks=0, st_size=0,
> st_atime=2015/03/07-16:40:28, st_mtime=2015/03/12-16:12:00,
> st_ctime=2015/03/12-16:12:00}) = 0
> stat("/proc/1462/fd/7", {st_dev=makedev(0, 9), st_ino=3943, st_mode=0600,
> st_nlink=1, st_uid=0, st_gid=0, st_blksize=4096, st_blocks=0, st_size=0,
> st_atime=2015/03/07-16:40:28, st_mtime=2015/03/12-16:12:00,
> st_ctime=2015/03/12-16:12:00}) = 0
> stat("/proc/1462/fd/8", {st_dev=makedev(0, 9), st_ino=3943, st_mode=0600,
> st_nlink=1, st_uid=0, st_gid=0, st_blksize=4096, st_blocks=0, st_size=0,
> st_atime=2015/03/07-16:40:28, st_mtime=2015/03/12-16:12:00,
> st_ctime=2015/03/12-16:12:00}) = 0
>
> (that process is systemd --user)
Interesting. What does stat on a CLONE_FD file descriptor return?
> > > > poll(2), select(2), epoll(7) (and similar)
> > > >
> > > > The file descriptor is readable (the select(2)
> > > > readfds
> > > > argument; the poll(2) POLLIN flag) if the new
> > > > process has
> > > > exited.
> > >
> > > FreeBSD uses POLLHUP here.
> >
> > That makes sense given that they provide the information via a separate
> > call rather than read. Since the CLONE_FD file descriptor uses read, it
> > needs to provide POLLIN, but I have no objection to using *both* POLLIN
> > and POLLHUP if that'd be at all useful.
>
> I think we should provide both, since we're notifying that there are things to
> be read and that the file descriptor has closed.
"closed" in the "other end of the not-quite-a-pipe" sense, sure. I'll
add that in a v2.
> > > FreeBSD has two different behaviours for close(2), depending on a flag
> > > value (PD_DAEMON). With the flag set it's roughly like this, but
> > > without PD_DAEMON a close(2) operation on the (last open) file
> > > descriptor terminates the child process.
> > >
> > > This can be quite useful, particularly for the use case where some
> > > userspace library has an FD-controlled subprocess -- if the application
> > > using the library terminates, the process descriptor is closed and so
> > > the subprocess is automatically terminated.
> >
> > That's an interesting idea. I don't think it makes sense for that to be
> > the default behavior, but if someone wanted to add an additional flag
> > to implement that behavior, that seems fine. A FreeBSD-compatible
> > pdfork could then use that flag when not passed PD_DAEMON and not use it
> > when passed PD_DAEMON.
> >
> > How does it kill the process when the last open descriptor closes?
> > SIGKILL? SIGTERM? The former seems unfriendly (preventing graceful
> > termination), and the latter blockable. There's a reason init systems
> > send TERM, then wait, then KILL.
>
> I was wondering if it shouldn't be a SIGHUP, since we're talking about a file
> descriptor closing. We could make it configurable too, but I'd rather not use
> the current CSIGNAL field -- better move to the arguments structure, just in
> case someone is passing SIGCHLD there, they should get EINVAL instead of
> silently sending SIGCHLD to the child process to ask it to terminate.
That sounds like several good reasons right there to defer "kill on
close" to a future patch, the author of which should research how
FreeBSD implements this.
- Josh Triplett
next prev parent reply other threads:[~2015-03-13 21:44 UTC|newest]
Thread overview: 54+ messages / expand[flat|nested] mbox.gz Atom feed top
2015-03-13 1:40 [PATCH 0/6] CLONE_FD: Task exit notification via file descriptor Josh Triplett
2015-03-13 1:40 ` [PATCH 1/6] clone: Support passing tls argument via C rather than pt_regs magic Josh Triplett
2015-03-13 1:40 ` [PATCH 3/6] Introduce a new clone4 syscall with more flag bits and extensible arguments Josh Triplett
2015-03-13 1:40 ` [PATCH 4/6] signal: Factor out a helper function to process task_struct exit_code Josh Triplett
2015-03-13 1:41 ` [PATCH 6/6] clone4: Introduce new CLONE_FD flag to get task exit notification via fd Josh Triplett
2015-03-13 16:21 ` Oleg Nesterov
2015-03-13 19:57 ` josh
2015-03-13 21:34 ` Andy Lutomirski
2015-03-13 22:20 ` josh
2015-03-13 22:28 ` Andy Lutomirski
[not found] ` <CALCETrVRqgsbpi9pPRwy42cuXiDzyPgWRJmfSRSQM7eGFfsZYA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2015-03-13 22:34 ` josh-iaAMLnmF4UmaiuxdJuQwMA
2015-03-13 22:38 ` Andy Lutomirski
2015-03-14 14:14 ` Oleg Nesterov
[not found] ` <20150314141414.GA11062-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2015-03-14 14:32 ` Oleg Nesterov
2015-03-14 18:38 ` Thiago Macieira
2015-03-14 18:54 ` Oleg Nesterov
[not found] ` <20150314185424.GA6813-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2015-03-14 22:03 ` Josh Triplett
2015-03-14 22:26 ` Thiago Macieira
2015-03-14 19:01 ` Josh Triplett
2015-03-14 19:18 ` Oleg Nesterov
[not found] ` <20150314191836.GA8416-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2015-03-14 19:47 ` Oleg Nesterov
[not found] ` <20150314194721.GA9654-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2015-03-14 20:14 ` Josh Triplett
2015-03-14 20:30 ` Oleg Nesterov
[not found] ` <20150314203029.GA11656-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2015-03-14 22:14 ` Josh Triplett
2015-03-14 20:03 ` Josh Triplett
2015-03-14 20:20 ` Oleg Nesterov
2015-03-14 22:09 ` Josh Triplett
[not found] ` <9c39c576e1d9a9912b4aec54d833a73a84d2f592.1426180120.git.josh-iaAMLnmF4UmaiuxdJuQwMA@public.gmane.org>
2015-03-14 14:35 ` Oleg Nesterov
[not found] ` <20150314143558.GB12086-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2015-03-14 19:15 ` Josh Triplett
2015-03-14 19:24 ` Oleg Nesterov
[not found] ` <20150314192456.GA8707-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2015-03-14 19:48 ` Josh Triplett
2015-03-13 1:41 ` [PATCH] clone4.2: New manpage documenting clone4(2) Josh Triplett
[not found] ` <cover.1426180120.git.josh-iaAMLnmF4UmaiuxdJuQwMA@public.gmane.org>
2015-03-13 1:40 ` [PATCH 2/6] x86: Opt into HAVE_COPY_THREAD_TLS, for both 32-bit and 64-bit Josh Triplett
[not found] ` <cf79b9f0c40314e6bfda7c634e378015bd7ba037.1426180120.git.josh-iaAMLnmF4UmaiuxdJuQwMA@public.gmane.org>
2015-03-13 22:01 ` Andy Lutomirski
2015-03-13 22:31 ` josh
2015-03-13 22:38 ` Andy Lutomirski
[not found] ` <CALCETrWspvNcEYxwbo1+ifXSj7Qj7YdcRgmNvZ1RaBrUAK12Zw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2015-03-13 22:43 ` josh-iaAMLnmF4UmaiuxdJuQwMA
2015-03-13 22:45 ` Andy Lutomirski
[not found] ` <CALCETrW3kJSVz4ffVC6YdB+ELukhOHNgPFKZSziMq5nn_Nq3Zg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2015-03-13 23:01 ` josh-iaAMLnmF4UmaiuxdJuQwMA
2015-03-13 1:40 ` [PATCH 5/6] fs: Make alloc_fd non-private Josh Triplett
2015-03-13 2:07 ` [PATCH 0/6] CLONE_FD: Task exit notification via file descriptor Thiago Macieira
2015-03-13 16:05 ` David Drysdale
2015-03-13 19:42 ` Josh Triplett
2015-03-13 21:16 ` Thiago Macieira
2015-03-13 21:44 ` josh [this message]
2015-03-13 21:33 ` Andy Lutomirski
[not found] ` <CALCETrXH+Ui1XqVZjFB=8vDpJLCYeVf+XNUPGWNfwDsNKi_nKg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2015-03-13 21:45 ` josh-iaAMLnmF4UmaiuxdJuQwMA
2015-03-13 21:51 ` Andy Lutomirski
[not found] ` <CALCETrWwkdWkNCsrcSAn+7f9SJCuYA-TV9=AygWMhXCC9Njp9Q-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2015-03-14 1:11 ` Thiago Macieira
2015-03-14 19:03 ` Thiago Macieira
2015-03-14 19:29 ` Josh Triplett
2015-03-15 10:18 ` David Drysdale
2015-03-15 10:59 ` Josh Triplett
2015-03-15 8:55 ` David Drysdale
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20150313214447.GA10954@cloud \
--to=josh@joshtriplett.org \
--cc=akpm@linux-foundation.org \
--cc=drysdale@google.com \
--cc=hpa@zytor.com \
--cc=keescook@chromium.org \
--cc=linux-api@vger.kernel.org \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=luto@amacapital.net \
--cc=mingo@redhat.com \
--cc=mtk.manpages@gmail.com \
--cc=oleg@redhat.com \
--cc=paulmck@linux.vnet.ibm.com \
--cc=riel@redhat.com \
--cc=tglx@linutronix.de \
--cc=thiago.macieira@intel.com \
--cc=viro@zeniv.linux.org.uk \
--cc=x86@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).