From: Al Viro <viro@ZenIV.linux.org.uk>
To: Linus Torvalds <torvalds@linux-foundation.org>
Cc: linux-fsdevel <linux-fsdevel@vger.kernel.org>,
Trond Myklebust <trond.myklebust@primarydata.com>,
Christoph Hellwig <hch@infradead.org>,
Dave Chinner <david@fromorbit.com>, Theodore Ts'o <tytso@mit.edu>,
Miklos Szeredi <miklos@szeredi.hu>,
Oleg Drokin <oleg.drokin@intel.com>
Subject: Re: [RFC] write(2) semantics wrt return values and current position
Date: Mon, 6 Apr 2015 20:29:03 +0100 [thread overview]
Message-ID: <20150406192903.GN889@ZenIV.linux.org.uk> (raw)
In-Reply-To: <CA+55aFwtTL2WKR_-fabmh3Cw3s+srSDC3T7Fcgz+WxNVb8P+hg@mail.gmail.com>
On Mon, Apr 06, 2015 at 11:13:14AM -0700, Linus Torvalds wrote:
> On Mon, Apr 6, 2015 at 9:02 AM, Al Viro <viro@zeniv.linux.org.uk> wrote:
> >
> > 1) should we ever update the current position when write returns
> > an error? As it is, write(2) explicitly ignores any changes of position
> > when ->write() has returned an error, but some other callers of vfs_write()
> > are not so careful.
>
> I think that question is the wrong way around.
>
> If the write has ever been even partially successful, we should never
> return an error. We should return the partial success.
Check what happens if generic_write_sync() (from generic_file_write_iter())
fails. That's the kind of thing I'm worried about. Sure, an error halfway
through => short write, no problem with that. We do handle that.
> > 2) should we ever update the current position when write() returns 0?
>
> I think the ext4 behavior is fine, although there's some noise in
> POSIX about zero-sized writes being no-ops. I think the POSIX wording
> is simply because some systems used to check for zero length before
> even doing anything. In fact, I think Linux used to do that too, but
> then we had special packetized formats that we wanted to syupport with
> write() too (not just sendmsg), so that got removed.
>
> I really don't think we should worry about it.
No, it's just that it would be a lot more convenient to have it ki_pos
discarded (in new_sync_write() and vfs_iter_write()) when ->write_iter()
returns 0 or negative. As it is, we do rather clumsy dances in
generic_write_checks() and it would be much nicer if we could simply
pass iocb and iter to it and have it update ->ki_pos (in O_APPEND case)
and do iov_iter_turncate(). And yes, it needs massage to get iocb to
all callers; the main obstacle used to be v9fs_file_write(), but that's
got dealt with in my tree.
> > 4) at lower level, there's a nasty case when short (but non-empty)
> > O_DIRECT write followed by success of fallback to buffered write and a failure
> > of filemap_write_and_wait_range() yields a return of the amount written by
> > ->direct_IO() *and* update of current position by that plus the amount
> > reported by buffered write. IOW, we shift the offset by amount different
> > from (positive) value we'll be returning from write(2). That's a direct
> > POSIX violation and I would expect the userland to be very surprised by
> > running into that. IMO it's a bug and we would be better off by shifting
> > position by the amount we'll be returning.
>
> That does sound like a bug. If we return a success value, and it's a
> normal file (ie not the FAT "translate NL into NLCR" magic, or some
> /proc seqfile etc), then I agree that the position should update by
> the value we returned from write.
ext* and friends. It's in __generic_file_write_iter().
> > 6) XFS seems to have fun bugs in O_DIRECT handling. Consider
> > the following scenario:
> > * O_DIRECT write() is called, we hit xfs_file_dio_aio_write().
> > * we check alignment and make decision whether to do
> > xfs_rw_ilock exclusive (which will include i_mutex) or shared (which will
> > not). Suppose it takes that shared.
> > * we call xfs_file_aio_write_checks(), which, for starters, might
> > modify position (on O_APPEND) and size (on rlimit). Which renders the
> > alignment checks useless, of course, but what's worse, it proceeds to
> > calling xfs_break_layouts(), which might drop and retake XFS part of what's
> > taken by xfs_rw_iolock(). Retake it exclusive, and update the iolock flag
> > passed to it by reference accordingly. And when we return to
> > xfs_file_aio_write_checks(), and do xfs_rw_iunlock(), we'll end up dropping
> > exclusively taken XFS part of things *and* ->i_mutex we'd never taken.
> > I might be misreading that code (it sure as hell wouldn't be
> > the first time when xfs_{rw_,}_ilock() is involved), but it looks dubious
> > to me...
>
> I don't think aio_write() makes sense on an O_APPEND file (for the
> same reason pwrite() doesn't), but we might be stuck with it. People
> who do that are insane and probably deserve whatever crazy semantics
> they get (and if they rely on them, we shouldn't change them in the
> name of "make things sane").
>
> If the lack of proper locking causes coherence problems, that's a XFS
> bug, of course.
It doesn't have to be O_DIRECT; setrlimit(2) from another thread is enough
to screw the alignment to hell and back and mutex_unlock() of something
we hadn't done mutex_lock() to is definitely a bug (don't need O_APPEND to
trigger that either; needs pNFS involved, AFAICS). I'd really like comments
from Christoph and Dave on that on...
next prev parent reply other threads:[~2015-04-06 19:29 UTC|newest]
Thread overview: 15+ messages / expand[flat|nested] mbox.gz Atom feed top
2015-04-06 16:02 [RFC] write(2) semantics wrt return values and current position Al Viro
2015-04-06 18:13 ` Linus Torvalds
2015-04-06 19:29 ` Al Viro [this message]
2015-04-06 19:50 ` Al Viro
2015-04-06 20:04 ` Drokin, Oleg
2015-04-06 20:09 ` Al Viro
2015-04-06 20:39 ` Drokin, Oleg
2015-04-07 15:25 ` Christoph Hellwig
2015-04-08 19:24 ` Al Viro
2015-04-08 20:57 ` Al Viro
2015-04-08 21:20 ` Al Viro
2015-04-09 4:48 ` Junxiao Bi
2015-04-09 11:23 ` Al Viro
2015-04-09 11:42 ` Al Viro
2015-04-10 14:31 ` Junxiao Bi
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20150406192903.GN889@ZenIV.linux.org.uk \
--to=viro@zeniv.linux.org.uk \
--cc=david@fromorbit.com \
--cc=hch@infradead.org \
--cc=linux-fsdevel@vger.kernel.org \
--cc=miklos@szeredi.hu \
--cc=oleg.drokin@intel.com \
--cc=torvalds@linux-foundation.org \
--cc=trond.myklebust@primarydata.com \
--cc=tytso@mit.edu \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).