From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mailman by lists.gnu.org with tmda-scanned (Exim 4.43) id 1M1ezQ-00013k-75 for qemu-devel@nongnu.org; Wed, 06 May 2009 07:08:40 -0400 Received: from exim by lists.gnu.org with spam-scanned (Exim 4.43) id 1M1ezK-00011O-Q3 for qemu-devel@nongnu.org; Wed, 06 May 2009 07:08:39 -0400 Received: from [199.232.76.173] (port=35446 helo=monty-python.gnu.org) by lists.gnu.org with esmtp (Exim 4.43) id 1M1ezK-000117-BM for qemu-devel@nongnu.org; Wed, 06 May 2009 07:08:34 -0400 Received: from mail2.shareable.org ([80.68.89.115]:44620) by monty-python.gnu.org with esmtps (TLS-1.0:RSA_AES_256_CBC_SHA1:32) (Exim 4.60) (envelope-from ) id 1M1ezJ-0002Lf-It for qemu-devel@nongnu.org; Wed, 06 May 2009 07:08:34 -0400 Date: Wed, 6 May 2009 12:08:32 +0100 From: Jamie Lokier Subject: Re: [Qemu-devel] [PATCH] linux-user: implement pipe2 syscall Message-ID: <20090506110832.GC23364@shareable.org> References: <20090505133048.GA29646@kos.to> <20090505225809.GJ7574@shareable.org> <20090506080023.GA7230@kos.to> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20090506080023.GA7230@kos.to> List-Id: qemu-devel.nongnu.org List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Riku Voipio Cc: qemu-devel@nongnu.org Riku Voipio wrote: > > The point of pipe2() with FD_CLOEXEC is to be atomic: make sure > > another thread can never see the file descriptor with FD_CLOEXEC not set. > > > If you can't guarantee that, it's better to return ENOSYS as every > > application using pipe2() like this has a fallback to use pipe() and > > FD_CLOEXEC itself, and probably has application logic to protect > > against the race condition. > > > If there's only one thread, or if you can arrange to block any > > concurrent clone/fork/execve calls in other threads (in QEMU) during > > the race window, then it's fine to emulate it with fcntl. > > We haven't returned from the pipe2 syscall when setting the flag with fcntl. > Before returning from the syscall, the pipe file descriptors could point > to anything (unitialized memory, zeros, ...) That's not possible with file descriptors. A user program never sees an uninitialized descriptor - because descriptors aren't visible to the user program (in any threads) until they are stored into the file descriptor table for the process. That happens once the descriptor is completely initialised, and for pipe2() that means _after_ FD_CLOEXEC is set. Of course it's usually an application bug to use a specific file descriptor from another thread, when that descriptor is still being created :-) But it's not a bug to call execve(), or fork() then execve(), in another thread at the same time as descriptors are being created. Those calls scan the whole file descriptor table, and look at the FD_CLOEXEC flags. The bug is that execve() in parallel with pipe()+fcntl() can result in the file descriptor getting copied to a child process, because execve() scans it. That's why pipe2() exists, to fix that bug properly by making it impossible. I haven't looked too closely at how guest file descriptors are handled in QEMU these days. In an older version I'm looking at, guest file descriptors are simply host file descriptors so the pipe2 emulation is broken in this way. If QEMU maintained a guest file descriptor table internally, emulating what a kernel does, this would be solved automatically, but it doesn't. You can solve it quite simply for any host kernel with the lock solution I just posted in another mail on this thread. The same method works for all the other syscalls taking *_CLOEXEC flags, so it's probably a good idea :-) -- Jamie