From: Alan Burlison <Alan.Burlison@oracle.com>
To: Al Viro <viro@ZenIV.linux.org.uk>, Casper.Dik@oracle.com
Cc: Eric Dumazet <eric.dumazet@gmail.com>,
Stephen Hemminger <stephen@networkplumber.org>,
netdev@vger.kernel.org, dholland-tech@netbsd.org
Subject: Re: Fw: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)
Date: Thu, 22 Oct 2015 11:55:42 +0100 [thread overview]
Message-ID: <5628C0AE.2020508@oracle.com> (raw)
In-Reply-To: <20151022042100.GO22011@ZenIV.linux.org.uk>
On 22/10/2015 05:21, Al Viro wrote:
>> Most of the work on using a file descriptor is local to the thread.
>
> Using - sure, but what of cacheline dirtied every time you resolve a
> descriptor to file reference?
Don't you have to do that anyway, to do anything useful with the file?
> How much does it cover and to what
> degree is that local to thread? When it's a refcount inside struct file -
> no big deal, we'll be reading the same cacheline anyway and unless several
> threads are pounding on the same file with syscalls at the same time,
> that won't be a big deal. But when refcounts are associated with
> descriptors...
There is a refcount in the struct held in the per-process list of open
files and the 'slow path' processing is only taken if there's more than
one LWP in the process that's accessing the file.
> In case of Linux we have two bitmaps and an array of pointers associated
> with descriptor table. They grow on demand (in parallel)
> * reserving a descriptor is done under ->file_lock (dropped/regained
> around memory allocation if we end up expanding the sucker, actual reassignment
> of pointers to array/bitmaps is under that spinlock)
> * installing a pointer is lockless (we wait for ongoing resize to
> settle, RCU takes care of the rest)
> * grabbing a file by index is lockless as well
> * removing a pointer is under ->file_lock, so's replacing it by dup2().
Is that table per-process or global?
> The point is, dup2() over _unused_ descriptor is inherently racy, but dup2()
> over a descriptor we'd opened and kept open should be safe. As it is,
> their implementation boils down to "userland must do process-wide exclusion
> between *any* dup2() and all syscalls that might create a new descriptor -
> open()/pipe()/socket()/accept()/recvmsg()/dup()/etc". At the very least,
> it's a big QoI problem, especially since such userland exclusion would have
> to be taken around the operations that can block for a long time. Sure,
> POSIX wording regarding dup2 is so weak that this behaviour can be argued
> to be compliant, but... replacement of the opened file associated with
> newfd really ought to be atomic to be of any use for multithreaded processes.
There's existing language in the Issue 7 dup2() description that says
dup2() has to be atomic:
"the dup2( ) function provides unique services, as no other
interface is able to atomically replace an existing file descriptor."
And there is some new language in Issue 7 Technical Corrigenda 2 that
reinforces that, when it's talking about reassignment of
stdin/stdout/stderr:
"Furthermore, a close() followed by a reopen operation (e.g. open(),
dup() etc) is not atomic; dup2() should be used to change standard file
descriptors."
I don't think that it's possible to claim that a non-atomic dup2() is
POSIX-compliant.
> IOW, if newfd is an opened descriptor prior to dup2() and no other thread
> attempts to close it by any means, there should be no window during which
> it would appear to be not opened. Linux and FreeBSD satisfy that; OpenBSD
> seems to do the same, from the quick look. NetBSD doesn't, no idea about
> Solaris. FWIW, older NetBSD implementation (prior to "File descriptor changes,
> discussed on tech-kern:" back in 2008) used to behave like OpenBSD one; it
> had fixed a lot of crap, so it's entirely possible that OpenBSD simply has
> kept the old implementation, with tons of other races in that area, but this
> dup2() race got introduced in that rewrite.
Related to dup2(), there's some rather surprising behaviour on Linux.
Here's the scenario:
----------
ThreadA opens, listens and accepts on socket fd1, waiting for incoming
connections.
ThreadB waits for a while, then opens normal file fd2 for read/write.
ThreadB uses dup2 to make fd1 a clone of fd2.
ThreadB closes fd2.
ThreadA remains sat in accept on fd1 which is now a plain file, not a
socket.
ThreadB writes to fd1, the result of which appears in the file, so fd1
is indeed operating as a plain file.
ThreadB exits. ThreadA is still sat in accept on fd1.
A connection is made to fd1 by another process. The accept call succeeds
and returns the incoming connection. fd1 is still operating as a
socket, even though it's now actually a plain file.
----------
I assume this is another consequence of the fact that threads waiting in
accept don't get a notification if the fd they are using is closed,
either directly by a call to close or by a syscall such as dup2. Not
waking up other threads on a fd when it is closed seems like it's
undesirable behaviour.
I can see the reasoning behind allowing shutdown to be used to do such a
wakeup even if that's not POSIX-compliant - it may make it slightly
easier for applications avoid fd recycling races. However the current
situation is that shutdown is the *only* way to perform such a wakeup -
simply closing the fd has no effect on any other threads. That seems
incorrect.
--
Alan Burlison
--
next prev parent reply other threads:[~2015-10-22 10:55 UTC|newest]
Thread overview: 138+ messages / expand[flat|nested] mbox.gz Atom feed top
2015-10-19 16:59 Fw: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3) Stephen Hemminger
2015-10-19 23:33 ` Eric Dumazet
2015-10-20 1:12 ` Alan Burlison
2015-10-20 1:45 ` Eric Dumazet
2015-10-20 9:59 ` Alan Burlison
2015-10-20 11:24 ` David Miller
2015-10-20 11:39 ` Alan Burlison
2015-10-20 13:19 ` Fw: " Eric Dumazet
2015-10-20 13:45 ` Alan Burlison
2015-10-20 15:30 ` Eric Dumazet
2015-10-20 18:31 ` Alan Burlison
2015-10-20 18:42 ` Eric Dumazet
2015-10-21 10:25 ` David Laight
2015-10-21 10:49 ` Alan Burlison
2015-10-21 11:28 ` Eric Dumazet
2015-10-21 13:03 ` Alan Burlison
2015-10-21 13:29 ` Eric Dumazet
2015-10-21 3:49 ` Al Viro
2015-10-21 14:38 ` Alan Burlison
2015-10-21 15:30 ` David Miller
2015-10-21 16:04 ` Casper.Dik
2015-10-21 21:18 ` Eric Dumazet
2015-10-21 21:28 ` Al Viro
2015-10-21 16:32 ` Fw: " Eric Dumazet
2015-10-21 18:51 ` Al Viro
2015-10-21 20:33 ` Casper.Dik
2015-10-22 4:21 ` Al Viro
2015-10-22 10:55 ` Alan Burlison [this message]
2015-10-22 18:16 ` Al Viro
2015-10-22 20:15 ` Alan Burlison
2015-11-02 10:03 ` David Laight
2015-11-02 10:29 ` Al Viro
2015-10-21 22:28 ` Alan Burlison
2015-10-22 1:29 ` David Miller
2015-10-22 4:17 ` Alan Burlison
2015-10-22 4:44 ` Al Viro
2015-10-22 6:03 ` Al Viro
2015-10-22 6:34 ` Casper.Dik
2015-10-22 17:21 ` Al Viro
2015-10-22 18:24 ` Casper.Dik
2015-10-22 19:07 ` Al Viro
2015-10-22 19:51 ` Casper.Dik
2015-10-22 21:57 ` Al Viro
2015-10-23 9:52 ` Casper.Dik
2015-10-23 13:02 ` Eric Dumazet
2015-10-23 13:20 ` Casper.Dik
2015-10-23 13:48 ` Eric Dumazet
2015-10-23 14:13 ` Eric Dumazet
2015-10-23 13:35 ` Alan Burlison
2015-10-23 14:21 ` Eric Dumazet
2015-10-23 15:46 ` Alan Burlison
2015-10-23 16:00 ` Eric Dumazet
2015-10-23 16:07 ` Alan Burlison
2015-10-23 16:19 ` Eric Dumazet
2015-10-23 16:40 ` Alan Burlison
2015-10-23 17:47 ` Eric Dumazet
2015-10-23 17:59 ` [PATCH net-next] af_unix: do not report POLLOUT on listeners Eric Dumazet
2015-10-25 13:45 ` David Miller
2015-10-24 2:30 ` [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3) Al Viro
2015-10-27 9:08 ` Casper.Dik
2015-10-27 10:52 ` Alan Burlison
2015-10-27 12:01 ` Eric Dumazet
2015-10-27 12:27 ` Alan Burlison
2015-10-27 12:44 ` Eric Dumazet
2015-10-27 13:42 ` David Miller
2015-10-27 13:37 ` Alan Burlison
2015-10-27 13:59 ` David Miller
2015-10-27 14:13 ` Alan Burlison
2015-10-27 14:39 ` David Miller
2015-10-27 14:39 ` Alan Burlison
2015-10-27 15:04 ` David Miller
2015-10-27 15:53 ` Alan Burlison
2015-10-27 23:17 ` Al Viro
2015-10-28 0:13 ` Eric Dumazet
2015-10-28 12:35 ` Al Viro
2015-10-28 13:24 ` Eric Dumazet
2015-10-28 14:47 ` Eric Dumazet
2015-10-28 21:13 ` Al Viro
2015-10-28 21:44 ` Eric Dumazet
2015-10-28 22:33 ` Al Viro
2015-10-28 23:08 ` Eric Dumazet
2015-10-29 0:15 ` Al Viro
2015-10-29 3:29 ` Eric Dumazet
2015-10-29 4:16 ` Al Viro
2015-10-29 12:35 ` Eric Dumazet
2015-10-29 13:48 ` Eric Dumazet
2015-10-30 17:18 ` Linus Torvalds
2015-10-30 21:02 ` Al Viro
2015-10-30 21:23 ` Linus Torvalds
2015-10-30 21:50 ` Linus Torvalds
2015-10-30 22:33 ` Al Viro
2015-10-30 23:52 ` Linus Torvalds
2015-10-31 0:09 ` Al Viro
2015-10-31 15:59 ` Eric Dumazet
2015-10-31 19:34 ` Al Viro
2015-10-31 19:54 ` Linus Torvalds
2015-10-31 20:29 ` Al Viro
2015-11-02 0:24 ` Al Viro
2015-11-02 0:59 ` Linus Torvalds
2015-11-02 2:14 ` Eric Dumazet
2015-11-02 6:22 ` Al Viro
2015-10-31 20:45 ` Eric Dumazet
2015-10-31 21:23 ` Linus Torvalds
2015-10-31 21:51 ` Al Viro
2015-10-31 22:34 ` Eric Dumazet
2015-10-31 1:07 ` Eric Dumazet
2015-10-28 16:04 ` Alan Burlison
2015-10-29 14:58 ` David Holland
2015-10-29 15:18 ` Alan Burlison
2015-10-29 16:01 ` David Holland
2015-10-29 16:15 ` Alan Burlison
2015-10-29 17:07 ` Al Viro
2015-10-29 17:12 ` Alan Burlison
2015-10-30 1:54 ` David Miller
2015-10-30 1:55 ` David Miller
2015-10-30 5:44 ` David Holland
2015-10-30 17:43 ` David Laight
2015-10-30 21:09 ` Al Viro
2015-11-04 15:54 ` David Laight
2015-11-04 16:27 ` Al Viro
2015-11-06 15:07 ` David Laight
2015-11-06 19:31 ` Al Viro
2015-10-22 6:51 ` Casper.Dik
2015-10-22 11:18 ` Alan Burlison
2015-10-22 11:15 ` Alan Burlison
2015-10-22 6:15 ` Casper.Dik
2015-10-22 11:30 ` Eric Dumazet
2015-10-22 11:58 ` Alan Burlison
2015-10-22 12:10 ` Eric Dumazet
2015-10-22 13:12 ` David Miller
2015-10-22 13:14 ` Alan Burlison
2015-10-22 17:05 ` Al Viro
2015-10-22 17:39 ` Alan Burlison
2015-10-22 18:56 ` Al Viro
2015-10-22 19:50 ` Casper.Dik
2015-10-23 17:09 ` Al Viro
2015-10-23 18:30 ` Fw: " David Holland
2015-10-23 19:51 ` Al Viro
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=5628C0AE.2020508@oracle.com \
--to=alan.burlison@oracle.com \
--cc=Casper.Dik@oracle.com \
--cc=dholland-tech@netbsd.org \
--cc=eric.dumazet@gmail.com \
--cc=netdev@vger.kernel.org \
--cc=stephen@networkplumber.org \
--cc=viro@ZenIV.linux.org.uk \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).