linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: josh@joshtriplett.org
To: Thiago Macieira <thiago.macieira@intel.com>
Cc: David Drysdale <drysdale@google.com>,
	Al Viro <viro@zeniv.linux.org.uk>,
	Andrew Morton <akpm@linux-foundation.org>,
	Andy Lutomirski <luto@amacapital.net>,
	Ingo Molnar <mingo@redhat.com>, Kees Cook <keescook@chromium.org>,
	Oleg Nesterov <oleg@redhat.com>,
	"Paul E. McKenney" <paulmck@linux.vnet.ibm.com>,
	"H. Peter Anvin" <hpa@zytor.com>, Rik van Riel <riel@redhat.com>,
	Thomas Gleixner <tglx@linutronix.de>,
	Michael Kerrisk <mtk.manpages@gmail.com>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	Linux API <linux-api@vger.kernel.org>,
	linux-fsdevel@vger.kernel.org, X86 ML <x86@kernel.org>
Subject: Re: [PATCH 0/6] CLONE_FD: Task exit notification via file descriptor
Date: Fri, 13 Mar 2015 14:44:47 -0700	[thread overview]
Message-ID: <20150313214447.GA10954@cloud> (raw)
In-Reply-To: <8029771.gA9WzoLePv@tjmaciei-mobl4>

On Fri, Mar 13, 2015 at 02:16:07PM -0700, Thiago Macieira wrote:
> On Friday 13 March 2015 12:42:52 Josh Triplett wrote:
> > > Hi Josh,
> > > 
> > > From the overall description (i.e. I haven't looked at the code yet)
> > > this looks very interesting.  However, it seems to cover a lot of the
> > > same ground as the process descriptor feature that was added to FreeBSD
> > > 
> > > in 9.x/10.x:
> > >   https://www.freebsd.org/cgi/man.cgi?query=pdfork&sektion=2
> > 
> > Interesting.
> 
> I wasn't aware of the FreeBSD implementation of pdfork(). It is actually 
> exactly what I need in userspace.

Right; libqt should be able to use pdfork on FreeBSD and CLONE_FD on
Linux.

> The only difference between pdfork() and and 
> my proposed forkfd() is where the PID and where the file descriptor are 
> returned (meaning, which is optional and which isn't).
> 
> Josh and I opted to return the file descriptor in the regular return value in 
> forkfd and in clone4 because getting the file descriptor the whole objective of 
> using the forkfd or clone4-with-CLONE_FD in the first place: the file descriptor 
> is not optional, but the PID is.

And as long as you can get the fd, where it's returned really doesn't
matter.

> > Agreed; however, I think it's reasonable to provide appropriate Linux
> > system calls, and then let glibc or libbsd or similar provide the
> > BSD-compatible calls on top of those.  I don't think the kernel
> > interface needs to exactly match FreeBSD's, as long as it's a superset
> > of the functionality.
> > 
> > For example, pdfork can just call clone4 with CLONE_FD and return the
> > resulting file descriptor.
> 
> Agreed, we should recommend libc implement pdfork(), pdkill() and pdwait4().
> 
> I'm not too attached to the forkfd() interface, but I find it slightly superior 
> for the reasons above.

Agreed.

> If we want the PD_DAEMON flag, it will have to translate to a clone flag, like 
> CLONEFD_DAEMON or inverted like CLONEFD_KILL_ON_CLOSE.

I think the inverted version makes more sense, so that the default
behavior just changes exit notification without adding the kill-on-close
behavior.  And that kill-on-close behavior can come in a later patch. :)

> > In the future, I plan to add an fd-based equivalent of
> > rt_{,tg}sigqueueinfo (likely a single syscall with a flag to determine
> > whether to kill a process or thread) which is a superset of pdkill.
> > pdkill could then call that and just not pass the extra info.
> > 
> > A fair bit of pdwait4 could be implemented on top of read(), other than
> > the full rusage information (see below), and the ability to wait for
> > STOP/CONT (which the CLONE_FD file descriptor could support if desired,
> > but it'd have to be set via a flag at clone time).
> > 
> > I think it's a feature to use read() rather than an additional magic
> > system call.
> 
> Indeed, even if the libc provides a wrapper for you, like glibc does for 
> eventfd (eventfd_read, eventfd_write).
> 
> Josh and I didn't want to submit "killfd" (or pdkill in the FreeBSD name) in 
> the initial patch set, but it was part of the plans.
> 
> > > >               clone4() will never return a file descriptor in the range
> > > >               0-2 to
> > > >               the caller, to avoid ambiguity with the return of 0 in the
> > > >               child
> > > >               process.  Only the  calling  process  will  have  the  new
> > > >                file
> > > >               descriptor open; the child process will not.
> > > 
> > > FreeBSD's pdfork(2) returns a PID but also takes an int *fdp argument to
> > > return the file descriptor separately, which avoids the need for special
> > > case processing for low FD values (and means that POSIX's "lowest file
> > > descriptor not currently open" behaviour can be preserved if desired).
> > 
> > That'd be easy to implement if desired, by adding an outbound pointer to
> > clone4_args.
> >
> > The (very mild) reason I'd dropped the PID: with CLONE_FD and future
> > syscalls that use the fd as an identifier, PIDs can hopefully become
> > mostly unnecessary.  However, I'm not that attached to changing the
> > return value; it'd be trivial to switch to an outbound parameter
> > instead, and then drop the "not 0-2".
> 
> See above for more motivation on making the PID optional.
> 
> As for the file descriptor range, if we need to be able to return 0, we can 
> implement a magic constant to mean the child process, like the userspace 
> forkfd() does (FFD_CHILD_PROCESS). We'd probably choose the value -4096 on 
> Linux, since that is neither a valid file descriptor nor a valid errno value.

I don't think that logic is worth implementing, though, since it would
require changing all the architecture-specific copy_thread
implementations.  If we really want to go this path, we should just
return the fd via an out parameter in the clone4_args structure.

> > > [FreeBSD theoretically has pdwait4(2) to do wait4-like operations on a
> > > process descriptor, including rusage retrieval.  However, I don't think
> > > 
> > > they actually implemented it:
> > >   http://fxr.watson.org/fxr/source/kern/syscalls.master#L928]
> > 
> > That's a pretty good argument that we don't need to either, at least not
> > yet.
> 
> pdwait4() can be implemented on top of read(), with the WNOHANG flag being just 
> toggling the O_NONBLOCK bit. The problem is with the rest of the flags. We 
> could implement it via more ioctls to be done prior to read() if we don't want 
> to add a syscall...
> 
> Another alternative is to add a P_PD flag that can be passed as the first 
> argument to waitid(), making the second argument a file descriptor instead of a 
> PID or pgrp.

Or a flag that can be added to the options argument of wait4 to indicate
that the first argument is really a file descriptor.

> > > FreeBSD also implements fstat(2) for its process descriptors, although
> > > only a few of the fields get filled in.
> > 
> > I looked at what they provide, and that seems like more of a novelty
> > than something particularly useful (since most of the stat fields aren't
> > meaningful), but if that's useful for compatibility then adding it seems
> > fine.
> 
> I don't think we need to do anything: anon_inode will do it for us.
> 
> If I stat an eventfd:
> 
> 	stat("/proc/107751/fd/4", {st_dev=makedev(0, 9), st_ino=3943, 
> st_mode=0600, st_nlink=1, st_uid=0, st_gid=0, st_blksize=4096, st_blocks=0, 
> st_size=0, st_atime=2015/03/07-16:40:28, st_mtime=2015/03/12-16:12:00, 
> st_ctime=2015/03/12-16:12:00}) = 0
> 
> And just out of curiosity, in the following order: epoll, signalfd, timerfd 
> and inotify:
> 
> 	stat("/proc/1462/fd/4", {st_dev=makedev(0, 9), st_ino=3943, st_mode=0600, 
> st_nlink=1, st_uid=0, st_gid=0, st_blksize=4096, st_blocks=0, st_size=0, 
> st_atime=2015/03/07-16:40:28, st_mtime=2015/03/12-16:12:00, 
> st_ctime=2015/03/12-16:12:00}) = 0
> 	stat("/proc/1462/fd/5", {st_dev=makedev(0, 9), st_ino=3943, st_mode=0600, 
> st_nlink=1, st_uid=0, st_gid=0, st_blksize=4096, st_blocks=0, st_size=0, 
> st_atime=2015/03/07-16:40:28, st_mtime=2015/03/12-16:12:00, 
> st_ctime=2015/03/12-16:12:00}) = 0
> 	stat("/proc/1462/fd/7", {st_dev=makedev(0, 9), st_ino=3943, st_mode=0600, 
> st_nlink=1, st_uid=0, st_gid=0, st_blksize=4096, st_blocks=0, st_size=0, 
> st_atime=2015/03/07-16:40:28, st_mtime=2015/03/12-16:12:00, 
> st_ctime=2015/03/12-16:12:00}) = 0
> 	stat("/proc/1462/fd/8", {st_dev=makedev(0, 9), st_ino=3943, st_mode=0600, 
> st_nlink=1, st_uid=0, st_gid=0, st_blksize=4096, st_blocks=0, st_size=0, 
> st_atime=2015/03/07-16:40:28, st_mtime=2015/03/12-16:12:00, 
> st_ctime=2015/03/12-16:12:00}) = 0
> 
> (that process is systemd --user)

Interesting.  What does stat on a CLONE_FD file descriptor return?

> > > >               poll(2), select(2), epoll(7) (and similar)
> > > >               
> > > >                      The  file  descriptor  is readable (the select(2)
> > > >                      readfds
> > > >                      argument; the poll(2) POLLIN flag) if the new
> > > >                      process has
> > > >                      exited.
> > > 
> > > FreeBSD uses POLLHUP here.
> > 
> > That makes sense given that they provide the information via a separate
> > call rather than read.  Since the CLONE_FD file descriptor uses read, it
> > needs to provide POLLIN, but I have no objection to using *both* POLLIN
> > and POLLHUP if that'd be at all useful.
> 
> I think we should provide both, since we're notifying that there are things to 
> be read and that the file descriptor has closed.

"closed" in the "other end of the not-quite-a-pipe" sense, sure.  I'll
add that in a v2.

> > > FreeBSD has two different behaviours for close(2), depending on a flag
> > > value (PD_DAEMON).  With the flag set it's roughly like this, but
> > > without PD_DAEMON a close(2) operation on the (last open) file
> > > descriptor terminates the child process.
> > > 
> > > This can be quite useful, particularly for the use case where some
> > > userspace library has an FD-controlled subprocess -- if the application
> > > using the library terminates, the process descriptor is closed and so
> > > the subprocess is automatically terminated.
> > 
> > That's an interesting idea.  I don't think it makes sense for that to be
> > the default behavior, but if someone wanted to add an additional flag
> > to implement that behavior, that seems fine.  A FreeBSD-compatible
> > pdfork could then use that flag when not passed PD_DAEMON and not use it
> > when passed PD_DAEMON.
> > 
> > How does it kill the process when the last open descriptor closes?
> > SIGKILL?  SIGTERM?  The former seems unfriendly (preventing graceful
> > termination), and the latter blockable.  There's a reason init systems
> > send TERM, then wait, then KILL.
> 
> I was wondering if it shouldn't be a SIGHUP, since we're talking about a file 
> descriptor closing. We could make it configurable too, but I'd rather not use 
> the current CSIGNAL field -- better move to the arguments structure, just in 
> case someone is passing SIGCHLD there, they should get EINVAL instead of 
> silently sending SIGCHLD to the child process to ask it to terminate.

That sounds like several good reasons right there to defer "kill on
close" to a future patch, the author of which should research how
FreeBSD implements this.

- Josh Triplett

  reply	other threads:[~2015-03-13 21:44 UTC|newest]

Thread overview: 54+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-03-13  1:40 [PATCH 0/6] CLONE_FD: Task exit notification via file descriptor Josh Triplett
2015-03-13  1:40 ` [PATCH 1/6] clone: Support passing tls argument via C rather than pt_regs magic Josh Triplett
2015-03-13  1:40 ` [PATCH 3/6] Introduce a new clone4 syscall with more flag bits and extensible arguments Josh Triplett
2015-03-13  1:40 ` [PATCH 4/6] signal: Factor out a helper function to process task_struct exit_code Josh Triplett
2015-03-13  1:41 ` [PATCH 6/6] clone4: Introduce new CLONE_FD flag to get task exit notification via fd Josh Triplett
2015-03-13 16:21   ` Oleg Nesterov
2015-03-13 19:57     ` josh
2015-03-13 21:34       ` Andy Lutomirski
2015-03-13 22:20         ` josh
2015-03-13 22:28           ` Andy Lutomirski
     [not found]             ` <CALCETrVRqgsbpi9pPRwy42cuXiDzyPgWRJmfSRSQM7eGFfsZYA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2015-03-13 22:34               ` josh-iaAMLnmF4UmaiuxdJuQwMA
2015-03-13 22:38                 ` Andy Lutomirski
2015-03-14 14:14       ` Oleg Nesterov
     [not found]         ` <20150314141414.GA11062-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2015-03-14 14:32           ` Oleg Nesterov
2015-03-14 18:38             ` Thiago Macieira
2015-03-14 18:54               ` Oleg Nesterov
     [not found]                 ` <20150314185424.GA6813-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2015-03-14 22:03                   ` Josh Triplett
2015-03-14 22:26                     ` Thiago Macieira
2015-03-14 19:01               ` Josh Triplett
2015-03-14 19:18                 ` Oleg Nesterov
     [not found]                   ` <20150314191836.GA8416-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2015-03-14 19:47                     ` Oleg Nesterov
     [not found]                       ` <20150314194721.GA9654-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2015-03-14 20:14                         ` Josh Triplett
2015-03-14 20:30                           ` Oleg Nesterov
     [not found]                             ` <20150314203029.GA11656-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2015-03-14 22:14                               ` Josh Triplett
2015-03-14 20:03                     ` Josh Triplett
2015-03-14 20:20                       ` Oleg Nesterov
2015-03-14 22:09         ` Josh Triplett
     [not found]   ` <9c39c576e1d9a9912b4aec54d833a73a84d2f592.1426180120.git.josh-iaAMLnmF4UmaiuxdJuQwMA@public.gmane.org>
2015-03-14 14:35     ` Oleg Nesterov
     [not found]       ` <20150314143558.GB12086-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2015-03-14 19:15         ` Josh Triplett
2015-03-14 19:24           ` Oleg Nesterov
     [not found]             ` <20150314192456.GA8707-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2015-03-14 19:48               ` Josh Triplett
2015-03-13  1:41 ` [PATCH] clone4.2: New manpage documenting clone4(2) Josh Triplett
     [not found] ` <cover.1426180120.git.josh-iaAMLnmF4UmaiuxdJuQwMA@public.gmane.org>
2015-03-13  1:40   ` [PATCH 2/6] x86: Opt into HAVE_COPY_THREAD_TLS, for both 32-bit and 64-bit Josh Triplett
     [not found]     ` <cf79b9f0c40314e6bfda7c634e378015bd7ba037.1426180120.git.josh-iaAMLnmF4UmaiuxdJuQwMA@public.gmane.org>
2015-03-13 22:01       ` Andy Lutomirski
2015-03-13 22:31         ` josh
2015-03-13 22:38           ` Andy Lutomirski
     [not found]             ` <CALCETrWspvNcEYxwbo1+ifXSj7Qj7YdcRgmNvZ1RaBrUAK12Zw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2015-03-13 22:43               ` josh-iaAMLnmF4UmaiuxdJuQwMA
2015-03-13 22:45                 ` Andy Lutomirski
     [not found]                   ` <CALCETrW3kJSVz4ffVC6YdB+ELukhOHNgPFKZSziMq5nn_Nq3Zg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2015-03-13 23:01                     ` josh-iaAMLnmF4UmaiuxdJuQwMA
2015-03-13  1:40   ` [PATCH 5/6] fs: Make alloc_fd non-private Josh Triplett
2015-03-13  2:07   ` [PATCH 0/6] CLONE_FD: Task exit notification via file descriptor Thiago Macieira
2015-03-13 16:05   ` David Drysdale
2015-03-13 19:42     ` Josh Triplett
2015-03-13 21:16       ` Thiago Macieira
2015-03-13 21:44         ` josh [this message]
2015-03-13 21:33       ` Andy Lutomirski
     [not found]         ` <CALCETrXH+Ui1XqVZjFB=8vDpJLCYeVf+XNUPGWNfwDsNKi_nKg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2015-03-13 21:45           ` josh-iaAMLnmF4UmaiuxdJuQwMA
2015-03-13 21:51             ` Andy Lutomirski
     [not found]               ` <CALCETrWwkdWkNCsrcSAn+7f9SJCuYA-TV9=AygWMhXCC9Njp9Q-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2015-03-14  1:11                 ` Thiago Macieira
2015-03-14 19:03                   ` Thiago Macieira
2015-03-14 19:29                     ` Josh Triplett
2015-03-15 10:18                       ` David Drysdale
2015-03-15 10:59                         ` Josh Triplett
2015-03-15  8:55       ` David Drysdale

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20150313214447.GA10954@cloud \
    --to=josh@joshtriplett.org \
    --cc=akpm@linux-foundation.org \
    --cc=drysdale@google.com \
    --cc=hpa@zytor.com \
    --cc=keescook@chromium.org \
    --cc=linux-api@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=luto@amacapital.net \
    --cc=mingo@redhat.com \
    --cc=mtk.manpages@gmail.com \
    --cc=oleg@redhat.com \
    --cc=paulmck@linux.vnet.ibm.com \
    --cc=riel@redhat.com \
    --cc=tglx@linutronix.de \
    --cc=thiago.macieira@intel.com \
    --cc=viro@zeniv.linux.org.uk \
    --cc=x86@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).