Re: [RFC] write(2) semantics wrt return values and current position

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Al Viro <viro@ZenIV.linux.org.uk>
To: Linus Torvalds <torvalds@linux-foundation.org>
Cc: linux-fsdevel <linux-fsdevel@vger.kernel.org>,
	Trond Myklebust <trond.myklebust@primarydata.com>,
	Christoph Hellwig <hch@infradead.org>,
	Dave Chinner <david@fromorbit.com>, Theodore Ts'o <tytso@mit.edu>,
	Miklos Szeredi <miklos@szeredi.hu>,
	Oleg Drokin <oleg.drokin@intel.com>
Subject: Re: [RFC] write(2) semantics wrt return values and current position
Date: Mon, 6 Apr 2015 20:29:03 +0100	[thread overview]
Message-ID: <20150406192903.GN889@ZenIV.linux.org.uk> (raw)
In-Reply-To: <CA+55aFwtTL2WKR_-fabmh3Cw3s+srSDC3T7Fcgz+WxNVb8P+hg@mail.gmail.com>

On Mon, Apr 06, 2015 at 11:13:14AM -0700, Linus Torvalds wrote:
> On Mon, Apr 6, 2015 at 9:02 AM, Al Viro <viro@zeniv.linux.org.uk> wrote:
> >
> >         1) should we ever update the current position when write returns
> > an error?  As it is, write(2) explicitly ignores any changes of position
> > when ->write() has returned an error, but some other callers of vfs_write()
> > are not so careful.
> 
> I think that question is the wrong way around.
> 
> If the write has ever been even partially successful, we should never
> return an error. We should return the partial success.

Check what happens if generic_write_sync() (from generic_file_write_iter())
fails.  That's the kind of thing I'm worried about.  Sure, an error halfway
through => short write, no problem with that.  We do handle that.

> >         2) should we ever update the current position when write() returns 0?
> 
> I think the ext4 behavior is fine, although there's some noise in
> POSIX about zero-sized writes being no-ops. I think the POSIX wording
> is simply because some systems used to check for zero length before
> even doing anything. In fact, I think Linux used to do that too, but
> then we had special packetized formats that we wanted to syupport with
> write() too (not just sendmsg), so that got removed.
> 
> I really don't think we should worry about it.

No, it's just that it would be a lot more convenient to have it ki_pos
discarded (in new_sync_write() and vfs_iter_write()) when ->write_iter()
returns 0 or negative.  As it is, we do rather clumsy dances in
generic_write_checks() and it would be much nicer if we could simply
pass iocb and iter to it and have it update ->ki_pos (in O_APPEND case)
and do iov_iter_turncate().  And yes, it needs massage to get iocb to
all callers; the main obstacle used to be v9fs_file_write(), but that's
got dealt with in my tree.

> >         4) at lower level, there's a nasty case when short (but non-empty)
> > O_DIRECT write followed by success of fallback to buffered write and a failure
> > of filemap_write_and_wait_range() yields a return of the amount written by
> > ->direct_IO() *and* update of current position by that plus the amount
> > reported by buffered write.  IOW, we shift the offset by amount different
> > from (positive) value we'll be returning from write(2).  That's a direct
> > POSIX violation and I would expect the userland to be very surprised by
> > running into that.  IMO it's a bug and we would be better off by shifting
> > position by the amount we'll be returning.
> 
> That does sound like a bug. If we return a success value, and it's a
> normal file (ie not the FAT "translate NL into NLCR" magic, or some
> /proc seqfile etc), then I agree that the position should update by
> the value we returned from write.

ext* and friends.  It's in __generic_file_write_iter().

> >         6) XFS seems to have fun bugs in O_DIRECT handling.  Consider
> > the following scenario:
> >         * O_DIRECT write() is called, we hit xfs_file_dio_aio_write().
> >         * we check alignment and make decision whether to do
> > xfs_rw_ilock exclusive (which will include i_mutex) or shared (which will
> > not).  Suppose it takes that shared.
> >         * we call xfs_file_aio_write_checks(), which, for starters, might
> > modify position (on O_APPEND) and size (on rlimit).  Which renders the
> > alignment checks useless, of course, but what's worse, it proceeds to
> > calling xfs_break_layouts(), which might drop and retake XFS part of what's
> > taken by xfs_rw_iolock().  Retake it exclusive, and update the iolock flag
> > passed to it by reference accordingly.  And when we return to
> > xfs_file_aio_write_checks(), and do xfs_rw_iunlock(), we'll end up dropping
> > exclusively taken XFS part of things *and* ->i_mutex we'd never taken.
> >         I might be misreading that code (it sure as hell wouldn't be
> > the first time when xfs_{rw_,}_ilock() is involved), but it looks dubious
> > to me...
> 
> I don't think aio_write() makes sense on an O_APPEND file (for the
> same reason pwrite() doesn't), but we might be stuck with it.  People
> who do that are insane and probably deserve whatever crazy semantics
> they get (and if they rely on them, we shouldn't change them in the
> name of "make things sane").
> 
> If the lack of proper locking causes coherence problems, that's a XFS
> bug, of course.

It doesn't have to be O_DIRECT; setrlimit(2) from another thread is enough
to screw the alignment to hell and back and mutex_unlock() of something
we hadn't done mutex_lock() to is definitely a bug (don't need O_APPEND to
trigger that either; needs pNFS involved, AFAICS).  I'd really like comments
from Christoph and Dave on that on...

next prev parent reply	other threads:[~2015-04-06 19:29 UTC|newest]

Thread overview: 15+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-04-06 16:02 [RFC] write(2) semantics wrt return values and current position Al Viro
2015-04-06 18:13 ` Linus Torvalds
2015-04-06 19:29   ` Al Viro [this message]
2015-04-06 19:50     ` Al Viro
2015-04-06 20:04 ` Drokin, Oleg
2015-04-06 20:09   ` Al Viro
2015-04-06 20:39     ` Drokin, Oleg
2015-04-07 15:25 ` Christoph Hellwig
2015-04-08 19:24 ` Al Viro
2015-04-08 20:57   ` Al Viro
2015-04-08 21:20     ` Al Viro
2015-04-09  4:48       ` Junxiao Bi
2015-04-09 11:23         ` Al Viro
2015-04-09 11:42           ` Al Viro
2015-04-10 14:31             ` Junxiao Bi

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20150406192903.GN889@ZenIV.linux.org.uk \
    --to=viro@zeniv.linux.org.uk \
    --cc=david@fromorbit.com \
    --cc=hch@infradead.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=miklos@szeredi.hu \
    --cc=oleg.drokin@intel.com \
    --cc=torvalds@linux-foundation.org \
    --cc=trond.myklebust@primarydata.com \
    --cc=tytso@mit.edu \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.