* O_NONBLOCK is broken
@ 2007-08-14 11:41 Denys Vlasenko
2007-08-14 12:33 ` Alan Cox
2007-08-14 21:59 ` O_NONBLOCK is broken David Schwartz
0 siblings, 2 replies; 6+ messages in thread
From: Denys Vlasenko @ 2007-08-14 11:41 UTC (permalink / raw)
To: linux-kernel
Hi folks,
I apologize for a provocative subject. O_NONBLOCK in Linux is not broken.
It is working as designed. But the design itself is suffering from a flaw.
Suppose I want to write block of data to stdout, without blocking.
I will do the classic thing:
int fl = fcntl(1, F_GETFL, 0);
fcntl(1, F_SETFL, fl | O_NONBLOCK);
r = write(1, buf, size);
fcntl(1, F_SETFL, fl); /* restore ASAP! */
The problem is, O_NONBLOCK flag is not attached to file *descriptor*,
but to a "file description" mentioned in fcntl manpage:
"Each open file description has certain associated status flags, initialized
by open(2) and possibly modified by fcntl(2). Duplicated file descriptors
(made with dup(), fcntl(F_DUPFD), fork(), etc.) refer to the same open file
description, and thus share the same file status flags."
We don't know whether our stdout descriptor #1 is shared with anyone or not,
and if we were started from shell, it typically is. That's why we try to
restore flags ASAP.
But "ASAP" isn't soon enough. Between setting and clearing O_NONBLOCK,
other process which share fd #1 with us may well be affected
by file suddenly becoming O_NONBLOCK under its feet.
Worse, other process can do the same
fcntl(1, F_SETFL, fl | O_NONBLOCK);
...
fcntl(1, F_SETFL, fl);
sequence, and first fcntl can return flags with O_NONBLOCK set (because of
us), and then second fcntl will set O_NONBLOCK permanently, which is not
what was intended!
Other failure mode is that process can be killed by a signal
between fcntl's, leaving file in O_NONBLOCK mode.
This isn't theoretical problem, it actually happens not-so-rarely, for
example, with pagers.
Possible solutions:
a) Introduce *per-fd* flag, so that one can use
fcntl(1, F_SETFD, fdflag | O_NONBLOCK) instead of
fcntl(1, F_SETFL, flflag | O_NONBLOCK) instead of
Currently there is only one per-fd flag - O_CLOEXEC with value of 1.
O_NONBLOCK is 0x4000.
b) Make recv(fd, buf, size, flags) and send(fd, buf, size, flags);
work with non-socket fds too, for flags==0 or flags==MSG_DONTWAIT.
(it's ok to fail with "socket op on non-socket fd" for other values
of flags)
Both things are non-standard, but portable programs can test for errors
and fall back to "standard" UNIX way of doing it.
P.S. Hmm, it seems fcntl GETFL/SETFL interface seems to be racy:
int fl = fcntl(fd, F_GETFL, 0);
/* other process can muck with file flags here */
fcntl(fd, F_SETFL, fl | SOME_BITS);
How can I *atomically* add or remove bits from file flags?
--
vda
^ permalink raw reply [flat|nested] 6+ messages in thread* Re: O_NONBLOCK is broken 2007-08-14 11:41 O_NONBLOCK is broken Denys Vlasenko @ 2007-08-14 12:33 ` Alan Cox 2007-08-14 17:28 ` Jan Engelhardt 2007-08-19 12:50 ` [PATCH] allow send/recv(MSG_DONTWAIT) on non-sockets (was Re: O_NONBLOCK is broken) Denys Vlasenko 2007-08-14 21:59 ` O_NONBLOCK is broken David Schwartz 1 sibling, 2 replies; 6+ messages in thread From: Alan Cox @ 2007-08-14 12:33 UTC (permalink / raw) To: Denys Vlasenko; +Cc: linux-kernel > b) Make recv(fd, buf, size, flags) and send(fd, buf, size, flags); > work with non-socket fds too, for flags==0 or flags==MSG_DONTWAIT. > (it's ok to fail with "socket op on non-socket fd" for other values > of flags) I think that makes a lot of sense, and to be honest other MSG_ flags make useful sense and have meaningful semantics that might be helpful elsewhere if ever coded that way. If you want to do this the first job is going to be to sort out the way non-block is propogated to device driver read/write handlers. At the moment they all check filp->f_flags Alan ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: O_NONBLOCK is broken 2007-08-14 12:33 ` Alan Cox @ 2007-08-14 17:28 ` Jan Engelhardt 2007-08-19 12:50 ` [PATCH] allow send/recv(MSG_DONTWAIT) on non-sockets (was Re: O_NONBLOCK is broken) Denys Vlasenko 1 sibling, 0 replies; 6+ messages in thread From: Jan Engelhardt @ 2007-08-14 17:28 UTC (permalink / raw) To: Alan Cox; +Cc: Denys Vlasenko, linux-kernel On Aug 14 2007 13:33, Alan Cox wrote: > >> b) Make recv(fd, buf, size, flags) and send(fd, buf, size, flags); >> work with non-socket fds too, for flags==0 or flags==MSG_DONTWAIT. >> (it's ok to fail with "socket op on non-socket fd" for other values >> of flags) > >I think that makes a lot of sense, and to be honest other MSG_ flags make >useful sense and have meaningful semantics that might be helpful >elsewhere if ever coded that way. > >If you want to do this the first job is going to be to sort out the way >non-block is propogated to device driver read/write handlers. At the >moment they all check filp->f_flags And a side effect, kernel code (kthreads) rarely allocate a file descriptor. Jan -- ^ permalink raw reply [flat|nested] 6+ messages in thread
* [PATCH] allow send/recv(MSG_DONTWAIT) on non-sockets (was Re: O_NONBLOCK is broken) 2007-08-14 12:33 ` Alan Cox 2007-08-14 17:28 ` Jan Engelhardt @ 2007-08-19 12:50 ` Denys Vlasenko 1 sibling, 0 replies; 6+ messages in thread From: Denys Vlasenko @ 2007-08-19 12:50 UTC (permalink / raw) To: Alan Cox; +Cc: linux-kernel [-- Attachment #1: Type: text/plain, Size: 1568 bytes --] On Tuesday 14 August 2007 13:33, Alan Cox wrote: > > b) Make recv(fd, buf, size, flags) and send(fd, buf, size, flags); > > work with non-socket fds too, for flags==0 or flags==MSG_DONTWAIT. > > (it's ok to fail with "socket op on non-socket fd" for other values > > of flags) > > I think that makes a lot of sense, and to be honest other MSG_ flags make > useful sense and have meaningful semantics that might be helpful > elsewhere if ever coded that way. Yes, that's my feeling too. > If you want to do this the first job is going to be to sort out the way > non-block is propogated to device driver read/write handlers. At the > moment they all check filp->f_flags ...which happens in ~250 files. I'd rather not touch that much of code, if possible. Attached patch detects send/recv(fd, buf, size, MSG_DONTWAIT) on non-sockets and turns them into non-blocking write/read. Since filp->f_flags appear to be read and modified without any locking, I cannot modify it without potentially affecting other processes accessing the same file through shared struct file. Therefore I simply make a temporary copy of struct file, set O_NONBLOCK in it and pass it to vfs_read/write. Is this heresy? ;) I see only one spinlock in struct file: #ifdef CONFIG_EPOLL spinlock_t f_ep_lock; #endif /* #ifdef CONFIG_EPOLL */ Do I need to take it? Also attached is ndelaytest.c which can be used to test that send(MSG_DONTWAIT) indeed is failing with EAGAIN if write would block and that other processes never see O_NONBLOCK set. Comments? -- vda [-- Attachment #2: nonblock_linux-2.6.22-rc6.patch --] [-- Type: text/x-diff, Size: 2902 bytes --] --- linux-2.6.22-rc6.src/fs/read_write.c Fri Jun 15 19:30:05 2007 +++ linux-2.6.22-rc6_ndelay/fs/read_write.c Sun Aug 19 10:43:24 2007 @@ -15,6 +15,7 @@ #include <linux/module.h> #include <linux/syscalls.h> #include <linux/pagemap.h> +#include <linux/socket.h> #include "read_write.h" #include <asm/uaccess.h> @@ -351,6 +352,36 @@ static inline void file_pos_write(struct file *file, loff_t pos) { file->f_pos = pos; +} + +/* Helper for send/recv on non-sockets */ +ssize_t rw_with_flags(struct file *file, int fput_needed, void __user *buf, size_t count, unsigned flags) +{ + int err; + loff_t pos; + struct file *file_copy; + + file_copy = file; + if (flags & MSG_DONTWAIT) { + /* We make copy even if O_NONBLOCK is already set. */ + /* We don't want it to change under our feet. */ + file_copy = kmalloc(sizeof(*file_copy), GFP_KERNEL); + memcpy(file_copy, file, sizeof(*file_copy)); + file_copy->f_flags |= O_NONBLOCK; + } + + pos = file_pos_read(file); + if (flags & MSG_OOB) /* MSG_OOB is reused to mean 'write' */ + err = vfs_write(file_copy, buf, count, &pos); + else + err = vfs_read(file_copy, buf, count, &pos); + file_pos_write(file, pos); + + if (flags & MSG_DONTWAIT) { + kfree(file_copy); + } + fput_light(file, fput_needed); + return err; } asmlinkage ssize_t sys_read(unsigned int fd, char __user * buf, size_t count) --- linux-2.6.22-rc6.src/include/linux/fs.h Wed Jun 27 21:24:18 2007 +++ linux-2.6.22-rc6_ndelay/include/linux/fs.h Sun Aug 19 10:32:20 2007 @@ -1154,6 +1154,9 @@ extern ssize_t vfs_writev(struct file *, const struct iovec __user *, unsigned long, loff_t *); +extern ssize_t rw_with_flags(struct file *, int, void __user *, size_t, + unsigned); + /* * NOTE: write_inode, delete_inode, clear_inode, put_inode can be called * without the big kernel lock held in all filesystems. --- linux-2.6.22-rc6.src/net/socket.c Fri Jun 15 19:30:08 2007 +++ linux-2.6.22-rc6_ndelay/net/socket.c Sun Aug 19 11:34:07 2007 @@ -1585,8 +1585,17 @@ goto out; sock = sock_from_file(sock_file, &err); - if (!sock) - goto out_put; + if (!sock) { + if (addr) + goto out_put; + if (flags & ~MSG_DONTWAIT) + goto out_put; + /* it's not a socket, but we support a special case: + * send(fd, buf, count, MSG_DONTWAIT) + * (MSG_OOB is reused to mean 'write') */ + return rw_with_flags(sock_file, fput_needed, buff, len, flags | MSG_OOB); + } + iov.iov_base = buff; iov.iov_len = len; msg.msg_name = NULL; @@ -1646,8 +1655,15 @@ goto out; sock = sock_from_file(sock_file, &err); - if (!sock) - goto out_put; + if (!sock) { + if (addr) + goto out_put; + if (flags & ~MSG_DONTWAIT) + goto out_put; + /* it's not a socket, but we support a special case: + * recv(fd, ubuf, size, MSG_DONTWAIT) */ + return rw_with_flags(sock_file, fput_needed, ubuf, size, flags); + } msg.msg_control = NULL; msg.msg_controllen = 0; [-- Attachment #3: ndelaytest.c --] [-- Type: text/x-csrc, Size: 1463 bytes --] #include <sys/types.h> #include <sys/socket.h> #include <errno.h> #include <stdio.h> #include <unistd.h> #include <fcntl.h> #include <time.h> #include <signal.h> #define SECONDS 10 #define STR "." //#define STR "123456789 123456789 123456789 123456789 " /* To see send() resulting in EAGAIN: * strace -ff -o log ndelaytest | while sleep 11; do break; done * log.$PID: * send(1, "123456789 123456789 123456789 12"..., 40, MSG_DONTWAIT) * = -1 EAGAIN (Resource temporarily unavailable) */ int main() { pid_t pid; time_t t; int fl; puts("starting"); t = time(0); pid = fork(); if (pid == 0) { /* child */ while ((time(0) - t) < SECONDS-1) { #if 0 /* Uncomment this part and simply run the executable * to see race detection code in action */ #define OP "write" fcntl(1, F_SETFL, fcntl(1, F_GETFL) | O_NONBLOCK); fl = write(1, STR, sizeof(STR) - 1); fcntl(1, F_SETFL, fcntl(1, F_GETFL) & ~O_NONBLOCK); #else /* This part tests whether send(MSG_DONTWAIT) * is racy or not */ #define OP "send" fl = send(1, STR, sizeof(STR) - 1, MSG_DONTWAIT); #endif if (fl < 0) { perror(OP); kill(getppid(), SIGKILL); return 1; } } return 0; } while ((time(0) - t) < SECONDS) { fl = fcntl(1, F_GETFL); if (fl & O_NONBLOCK) { fprintf(stderr, "NONBLOCK:1\n"); kill(pid, SIGKILL); fcntl(1, F_SETFL, fl & ~O_NONBLOCK); return 1; } } fprintf(stderr, "NONBLOCK:0\n"); return 0; } ^ permalink raw reply [flat|nested] 6+ messages in thread
* RE: O_NONBLOCK is broken 2007-08-14 11:41 O_NONBLOCK is broken Denys Vlasenko 2007-08-14 12:33 ` Alan Cox @ 2007-08-14 21:59 ` David Schwartz 2007-08-19 12:50 ` Denys Vlasenko 1 sibling, 1 reply; 6+ messages in thread From: David Schwartz @ 2007-08-14 21:59 UTC (permalink / raw) To: linux-kernel > The problem is, O_NONBLOCK flag is not attached to file *descriptor*, > but to a "file description" mentioned in fcntl manpage: [snip] > We don't know whether our stdout descriptor #1 is shared with > anyone or not, > and if we were started from shell, it typically is. That's why we try to > restore flags ASAP. > But "ASAP" isn't soon enough. Between setting and clearing O_NONBLOCK, > other process which share fd #1 with us may well be affected > by file suddenly becoming O_NONBLOCK under its feet. > > Worse, other process can do the same > fcntl(1, F_SETFL, fl | O_NONBLOCK); > ... > fcntl(1, F_SETFL, fl); > sequence, and first fcntl can return flags with O_NONBLOCK set > (because of > us), and then second fcntl will set O_NONBLOCK permanently, which is not > what was intended! [snip] > P.S. Hmm, it seems fcntl GETFL/SETFL interface seems to be racy: > > int fl = fcntl(fd, F_GETFL, 0); > /* other process can muck with file flags here */ > fcntl(fd, F_SETFL, fl | SOME_BITS); > > How can I *atomically* add or remove bits from file flags? Simply put, you cannot change file flags on a shared descriptor. It is a bug to do so, a bug that is sadly present in many common programs. I like the idea of being able to specify blocking or non-blocking behavior in the operation. It is not too uncommon to want to perform blocking operations sometimes and non-blocking operations other times for the same object and having to keep changing modes, even if it wasn't racy, is a pain. However, there's a much more fundamental problem here. Processes need a good way to get exclusive use of their stdin, stdout, and stderr streams and there is no good way. Perhaps an "exclusive lock" that blocked all other process' attempts to use the terminal until it was released would be a good thing. DS ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: O_NONBLOCK is broken 2007-08-14 21:59 ` O_NONBLOCK is broken David Schwartz @ 2007-08-19 12:50 ` Denys Vlasenko 0 siblings, 0 replies; 6+ messages in thread From: Denys Vlasenko @ 2007-08-19 12:50 UTC (permalink / raw) To: davids; +Cc: linux-kernel On Tuesday 14 August 2007 22:59, David Schwartz wrote: > > The problem is, O_NONBLOCK flag is not attached to file *descriptor*, > > but to a "file description" mentioned in fcntl manpage: > > [snip] > > > We don't know whether our stdout descriptor #1 is shared with > > anyone or not, > > and if we were started from shell, it typically is. That's why we try to > > restore flags ASAP. > > > > But "ASAP" isn't soon enough. Between setting and clearing O_NONBLOCK, > > other process which share fd #1 with us may well be affected > > by file suddenly becoming O_NONBLOCK under its feet. > > > > Worse, other process can do the same > > fcntl(1, F_SETFL, fl | O_NONBLOCK); > > ... > > fcntl(1, F_SETFL, fl); > > sequence, and first fcntl can return flags with O_NONBLOCK set > > (because of > > us), and then second fcntl will set O_NONBLOCK permanently, which is not > > what was intended! > > [snip] > > > P.S. Hmm, it seems fcntl GETFL/SETFL interface seems to be racy: > > > > int fl = fcntl(fd, F_GETFL, 0); > > /* other process can muck with file flags here */ > > fcntl(fd, F_SETFL, fl | SOME_BITS); > > > > How can I *atomically* add or remove bits from file flags? > > Simply put, you cannot change file flags on a shared descriptor. It is a > bug to do so, a bug that is sadly present in many common programs. It means that the design is flawed and if done right, file flags which are changeable by fcntl (O_NONBLOCK, O_APPEND, O_ASYNC, O_DIRECT, O_NOATIME) shouldn't be shared, they are useless as shared. IOW, they should be file _descriptor_ flags. It's unlikely that kernel tribe leaders will agree to violate POSIX and make fcntl(F_SETFL) be per-fd thing. There can be users of this (mis)feature. Making fcntl(F_SETFD) accept those same flags and making it override F_SETFL flags may fare slightly better, but may require propagation of these flags into *a lot* of kernel codepaths. > I like the idea of being able to specify blocking or non-blocking behavior > in the operation. It is not too uncommon to want to perform blocking > operations sometimes and non-blocking operations other times for the same > object and having to keep changing modes, even if it wasn't racy, is a > pain. I am submitting a patch witch allows this. Let's see what people will say. Yet another way to fix this problem is to add a new fcntl operation "duplicate an open file": fd = fcntl(fd, F_DUPFL, min_fd); which is analogous to F_DUPFD, but produces _unshared_ file descriptor. You can F_SETFL it as you want, no one else will be affected. > However, there's a much more fundamental problem here. Processes need a > good way to get exclusive use of their stdin, stdout, and stderr streams > and there is no good way. Perhaps an "exclusive lock" that blocked all > other process' attempts to use the terminal until it was released would be > a good thing. Yep, maybe. But this is a different problem. IOW: there are cases where one doesn't want this kind of locking, but simply needs to do unblocked read/write. That's what I'm trying to solve. -- vda ^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2007-08-19 12:51 UTC | newest] Thread overview: 6+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2007-08-14 11:41 O_NONBLOCK is broken Denys Vlasenko 2007-08-14 12:33 ` Alan Cox 2007-08-14 17:28 ` Jan Engelhardt 2007-08-19 12:50 ` [PATCH] allow send/recv(MSG_DONTWAIT) on non-sockets (was Re: O_NONBLOCK is broken) Denys Vlasenko 2007-08-14 21:59 ` O_NONBLOCK is broken David Schwartz 2007-08-19 12:50 ` Denys Vlasenko
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox