From: Al Viro <viro@ZenIV.linux.org.uk>
To: linux-fsdevel@vger.kernel.org
Cc: Linus Torvalds <torvalds@linux-foundation.org>,
Trond Myklebust <trond.myklebust@primarydata.com>,
Christoph Hellwig <hch@infradead.org>,
Dave Chinner <david@fromorbit.com>, Theodore Ts'o <tytso@mit.edu>,
Miklos Szeredi <miklos@szeredi.hu>,
Oleg Drokin <oleg.drokin@intel.com>
Subject: [RFC] write(2) semantics wrt return values and current position
Date: Mon, 6 Apr 2015 17:02:31 +0100 [thread overview]
Message-ID: <20150406153641.GL889@ZenIV.linux.org.uk> (raw)
There are several questions regarding the write(2) semantics, and
I'd like to see comments on those. All of that is for regular files.
1) should we ever update the current position when write returns
an error? As it is, write(2) explicitly ignores any changes of position
when ->write() has returned an error, but some other callers of vfs_write()
are not so careful.
2) should we ever update the current position when write() returns 0?
IOW, what effect should zero-length write() on O_APPEND file have upon its
current position? POSIX seems to imply that it should do nothing, and
generally that's what happens, but e.g. ext4 *does* update position to
the EOF, whether we will write anything or not. So does FUSE when server
requests to bypass the page cache. AFAICS, lustre is the same way,
but I might be missing something; everything else definitely does not
update position in that case. IMO the common behaviour is correct and
ext4 one is a bug.
3) pwrite(2): POSIX seems to require ignoring the O_APPEND completely
for that syscall. We definitely do not. It's arguable whether this is
desired or not, but it's an existing behaviour that had been that way since
we'd got pwrite(2) in the kernel (2.1.60). Probably too late to do anything
about that.
4) at lower level, there's a nasty case when short (but non-empty)
O_DIRECT write followed by success of fallback to buffered write and a failure
of filemap_write_and_wait_range() yields a return of the amount written by
->direct_IO() *and* update of current position by that plus the amount
reported by buffered write. IOW, we shift the offset by amount different
from (positive) value we'll be returning from write(2). That's a direct
POSIX violation and I would expect the userland to be very surprised by
running into that. IMO it's a bug and we would be better off by shifting
position by the amount we'll be returning.
5) somewhat related: nfs_direct_IO() ends up calling
nfs_file_direct_write(), which calls generic_write_checks();
it's triggered by swap-over-NFS (normal O_DIRECT writes go directly to
nfs_file_direct_write()), and it ends up being subject to rlimit of
caller. Which might be anyone who calls alloc_pages(), AFAICS. Almost
certainly a bug.
6) XFS seems to have fun bugs in O_DIRECT handling. Consider
the following scenario:
* O_DIRECT write() is called, we hit xfs_file_dio_aio_write().
* we check alignment and make decision whether to do
xfs_rw_ilock exclusive (which will include i_mutex) or shared (which will
not). Suppose it takes that shared.
* we call xfs_file_aio_write_checks(), which, for starters, might
modify position (on O_APPEND) and size (on rlimit). Which renders the
alignment checks useless, of course, but what's worse, it proceeds to
calling xfs_break_layouts(), which might drop and retake XFS part of what's
taken by xfs_rw_iolock(). Retake it exclusive, and update the iolock flag
passed to it by reference accordingly. And when we return to
xfs_file_aio_write_checks(), and do xfs_rw_iunlock(), we'll end up dropping
exclusively taken XFS part of things *and* ->i_mutex we'd never taken.
I might be misreading that code (it sure as hell wouldn't be
the first time when xfs_{rw_,}_ilock() is involved), but it looks dubious
to me...
My preference would be to have new_sync_write() and vfs_iter_write()
to ignore iocb.ki_pos when ->write_iter() returns negative or zero (would
take care of (1) and (2)) and have __generic_file_write_iter() to do
->ki_pos update in sync with what it'll be returning (takes care of (4)).
(3) is probably too old to fix, (5) should have generic_write_checks() done
outside of fs/nfs/direct.c. No idea on (6) and I would really like to hear
from XFS folks before doing anything to that one.
Comments?
next reply other threads:[~2015-04-06 16:02 UTC|newest]
Thread overview: 15+ messages / expand[flat|nested] mbox.gz Atom feed top
2015-04-06 16:02 Al Viro [this message]
2015-04-06 18:13 ` [RFC] write(2) semantics wrt return values and current position Linus Torvalds
2015-04-06 19:29 ` Al Viro
2015-04-06 19:50 ` Al Viro
2015-04-06 20:04 ` Drokin, Oleg
2015-04-06 20:09 ` Al Viro
2015-04-06 20:39 ` Drokin, Oleg
2015-04-07 15:25 ` Christoph Hellwig
2015-04-08 19:24 ` Al Viro
2015-04-08 20:57 ` Al Viro
2015-04-08 21:20 ` Al Viro
2015-04-09 4:48 ` Junxiao Bi
2015-04-09 11:23 ` Al Viro
2015-04-09 11:42 ` Al Viro
2015-04-10 14:31 ` Junxiao Bi
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20150406153641.GL889@ZenIV.linux.org.uk \
--to=viro@zeniv.linux.org.uk \
--cc=david@fromorbit.com \
--cc=hch@infradead.org \
--cc=linux-fsdevel@vger.kernel.org \
--cc=miklos@szeredi.hu \
--cc=oleg.drokin@intel.com \
--cc=torvalds@linux-foundation.org \
--cc=trond.myklebust@primarydata.com \
--cc=tytso@mit.edu \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).