linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Al Viro <viro@ZenIV.linux.org.uk>
To: Linus Torvalds <torvalds@linux-foundation.org>
Cc: linux-fsdevel <linux-fsdevel@vger.kernel.org>,
	Trond Myklebust <trond.myklebust@primarydata.com>,
	Christoph Hellwig <hch@infradead.org>,
	Dave Chinner <david@fromorbit.com>, Theodore Ts'o <tytso@mit.edu>,
	Miklos Szeredi <miklos@szeredi.hu>,
	Oleg Drokin <oleg.drokin@intel.com>
Subject: Re: [RFC] write(2) semantics wrt return values and current position
Date: Mon, 6 Apr 2015 20:29:03 +0100	[thread overview]
Message-ID: <20150406192903.GN889@ZenIV.linux.org.uk> (raw)
In-Reply-To: <CA+55aFwtTL2WKR_-fabmh3Cw3s+srSDC3T7Fcgz+WxNVb8P+hg@mail.gmail.com>

On Mon, Apr 06, 2015 at 11:13:14AM -0700, Linus Torvalds wrote:
> On Mon, Apr 6, 2015 at 9:02 AM, Al Viro <viro@zeniv.linux.org.uk> wrote:
> >
> >         1) should we ever update the current position when write returns
> > an error?  As it is, write(2) explicitly ignores any changes of position
> > when ->write() has returned an error, but some other callers of vfs_write()
> > are not so careful.
> 
> I think that question is the wrong way around.
> 
> If the write has ever been even partially successful, we should never
> return an error. We should return the partial success.

Check what happens if generic_write_sync() (from generic_file_write_iter())
fails.  That's the kind of thing I'm worried about.  Sure, an error halfway
through => short write, no problem with that.  We do handle that.

> >         2) should we ever update the current position when write() returns 0?
> 
> I think the ext4 behavior is fine, although there's some noise in
> POSIX about zero-sized writes being no-ops. I think the POSIX wording
> is simply because some systems used to check for zero length before
> even doing anything. In fact, I think Linux used to do that too, but
> then we had special packetized formats that we wanted to syupport with
> write() too (not just sendmsg), so that got removed.
> 
> I really don't think we should worry about it.

No, it's just that it would be a lot more convenient to have it ki_pos
discarded (in new_sync_write() and vfs_iter_write()) when ->write_iter()
returns 0 or negative.  As it is, we do rather clumsy dances in
generic_write_checks() and it would be much nicer if we could simply
pass iocb and iter to it and have it update ->ki_pos (in O_APPEND case)
and do iov_iter_turncate().  And yes, it needs massage to get iocb to
all callers; the main obstacle used to be v9fs_file_write(), but that's
got dealt with in my tree.

> >         4) at lower level, there's a nasty case when short (but non-empty)
> > O_DIRECT write followed by success of fallback to buffered write and a failure
> > of filemap_write_and_wait_range() yields a return of the amount written by
> > ->direct_IO() *and* update of current position by that plus the amount
> > reported by buffered write.  IOW, we shift the offset by amount different
> > from (positive) value we'll be returning from write(2).  That's a direct
> > POSIX violation and I would expect the userland to be very surprised by
> > running into that.  IMO it's a bug and we would be better off by shifting
> > position by the amount we'll be returning.
> 
> That does sound like a bug. If we return a success value, and it's a
> normal file (ie not the FAT "translate NL into NLCR" magic, or some
> /proc seqfile etc), then I agree that the position should update by
> the value we returned from write.

ext* and friends.  It's in __generic_file_write_iter().

> >         6) XFS seems to have fun bugs in O_DIRECT handling.  Consider
> > the following scenario:
> >         * O_DIRECT write() is called, we hit xfs_file_dio_aio_write().
> >         * we check alignment and make decision whether to do
> > xfs_rw_ilock exclusive (which will include i_mutex) or shared (which will
> > not).  Suppose it takes that shared.
> >         * we call xfs_file_aio_write_checks(), which, for starters, might
> > modify position (on O_APPEND) and size (on rlimit).  Which renders the
> > alignment checks useless, of course, but what's worse, it proceeds to
> > calling xfs_break_layouts(), which might drop and retake XFS part of what's
> > taken by xfs_rw_iolock().  Retake it exclusive, and update the iolock flag
> > passed to it by reference accordingly.  And when we return to
> > xfs_file_aio_write_checks(), and do xfs_rw_iunlock(), we'll end up dropping
> > exclusively taken XFS part of things *and* ->i_mutex we'd never taken.
> >         I might be misreading that code (it sure as hell wouldn't be
> > the first time when xfs_{rw_,}_ilock() is involved), but it looks dubious
> > to me...
> 
> I don't think aio_write() makes sense on an O_APPEND file (for the
> same reason pwrite() doesn't), but we might be stuck with it.  People
> who do that are insane and probably deserve whatever crazy semantics
> they get (and if they rely on them, we shouldn't change them in the
> name of "make things sane").
> 
> If the lack of proper locking causes coherence problems, that's a XFS
> bug, of course.

It doesn't have to be O_DIRECT; setrlimit(2) from another thread is enough
to screw the alignment to hell and back and mutex_unlock() of something
we hadn't done mutex_lock() to is definitely a bug (don't need O_APPEND to
trigger that either; needs pNFS involved, AFAICS).  I'd really like comments
from Christoph and Dave on that on...

  reply	other threads:[~2015-04-06 19:29 UTC|newest]

Thread overview: 15+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-04-06 16:02 [RFC] write(2) semantics wrt return values and current position Al Viro
2015-04-06 18:13 ` Linus Torvalds
2015-04-06 19:29   ` Al Viro [this message]
2015-04-06 19:50     ` Al Viro
2015-04-06 20:04 ` Drokin, Oleg
2015-04-06 20:09   ` Al Viro
2015-04-06 20:39     ` Drokin, Oleg
2015-04-07 15:25 ` Christoph Hellwig
2015-04-08 19:24 ` Al Viro
2015-04-08 20:57   ` Al Viro
2015-04-08 21:20     ` Al Viro
2015-04-09  4:48       ` Junxiao Bi
2015-04-09 11:23         ` Al Viro
2015-04-09 11:42           ` Al Viro
2015-04-10 14:31             ` Junxiao Bi

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20150406192903.GN889@ZenIV.linux.org.uk \
    --to=viro@zeniv.linux.org.uk \
    --cc=david@fromorbit.com \
    --cc=hch@infradead.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=miklos@szeredi.hu \
    --cc=oleg.drokin@intel.com \
    --cc=torvalds@linux-foundation.org \
    --cc=trond.myklebust@primarydata.com \
    --cc=tytso@mit.edu \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).