linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Jamie Lokier <jamie@shareable.org>
To: Christoph Hellwig <hch@lst.de>
Cc: Al Viro <viro@zeniv.linux.org.uk>, Nick Piggin <npiggin@suse.de>,
	Jan Kara <jack@suse.cz>,
	linux-fsdevel@vger.kernel.org
Subject: Re: [PATCH] direct I/O fallback sync simplification
Date: Tue, 29 Sep 2009 22:30:02 +0100	[thread overview]
Message-ID: <20090929213001.GA5896@shareable.org> (raw)
In-Reply-To: <20090926150842.GB11623@lst.de>

Christoph Hellwig wrote:
> On Wed, Sep 23, 2009 at 03:04:30PM +0100, Jamie Lokier wrote:
> > Christoph Hellwig wrote:
> > > In the case of direct I/O falling back to buffered I/O we sync data
> > > twice currently: once at the end of generic_file_buffered_write using
> > > filemap_write_and_wait_range and once a little later in
> > > __generic_file_aio_write using do_sync_mapping_range with all flags set.
> > > 
> > > The wait before write of the do_sync_mapping_range call does not make
> > > any sense, so just keep the filemap_write_and_wait_range call and move
> > > it to the right spot.
> > 
> > Are you sure this is an expectation of O_DIRECT?
> > 
> > A few notes from the net, including some documentation from IBM,
> > advise using O_DIRECT|O_DSYNC if you need sync when direct I/O falls
> > back to buffered on some other OSes.
> 
> Not sure if we're on the same page here.  Before the patch we do the
> following in case of a fallback buffered write when opened with
> O_DIRECT:
> 
>  - a filemap_write_and_wait_range in generic_file_buffered_write for
>    (with a too long length, btw)
>  - a do_sync_mapping_range(wait,write,wait) from
>    __generic_file_aio_write
> 
> which expands to
> 
>  (1) __filemap_fdatawrite_range(SYNC_ALL)
>  (2) wait_on_page_writeback_range
>  (3) wait_on_page_writeback_range
>  (4) __filemap_fdatawrite_range(SYNC_ALL)
>  (5) wait_on_page_writeback_range
> 
> which clearly doesn't make any sense.  The patch reduces it to
>  
>  (1) __filemap_fdatawrite_range(SYNC_ALL)
>  (2) wait_on_page_writeback_range
> 
> which does exactly the same, that is assuring the _data_ is on disk.
> That's still not O_DSNC or O_SYNC semantics, but that's an entirely
> different discussion.

I agree that the patch is a great improvement, and it's obviously good
regardless of further discussions.

What I'm suggesting is that there is no need to commit the data to the
disk, and sometimes it's an unwanted pessimisation.  So those calls
may be removed entirely.

I'll describe the O_DIRECT, non-O_DSYNC case only.  (O_DIRECT|O_DSYNC
should of course do fdatasync_range properly after every write).

Apps which need the data to reach the disk with any assurance that
they'll be able to read it after a crash _must_ use fdatasync (or
O_DSYNC) to get that assurance.

Your explanation makes that even clearer: Historically what Linux has
done is not O_DSYNC, and this is not intended to change.  Because the
fallback happens when extending a file or filling a hole, it's even
more likely that the metadata needed to retrieve the data won't be
committed when the write returns.

Regarding performance, when the fallback happens, since good apps
should be using fdatasync (or O_DSYNC) already, and bad apps already
don't have any _useful_ guarantee, it's preferable that the buffered
writes have normal buffering behaviour and can be sensibly reordered
and batched like normal writes.  This is particularly relevant to the
common case of extending a file, and when writing to a filesystem
which doesn't support O_DIRECT at all.

In a nutshell, I'm saying the only useful behaviours for direct I/O
fallback are "like normal writes" or "like O_DSYNC" - that "sort of
kind of in between" doesn't help anything.  It's a historical,
unnecessary relic which should go to the same recycling bin as O_SYNC
meaning O_DSYNC on Linux (and probably the change should go alongside
your sensible fixes to O_SYNC/O_DSYNC).

But I'll readily agree your patch is a trivial improvement that
doesn't change anything, if you want to get it in before a bigger
decision is made. :-)

-- Jamie

  reply	other threads:[~2009-09-29 21:30 UTC|newest]

Thread overview: 8+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2009-09-23 13:07 [PATCH] direct I/O fallback sync simplification Christoph Hellwig
2009-09-23 14:04 ` Jamie Lokier
2009-09-26 15:08   ` Christoph Hellwig
2009-09-29 21:30     ` Jamie Lokier [this message]
2009-09-30 12:05       ` Christoph Hellwig
2009-09-30 18:13         ` Jamie Lokier
2009-09-26 19:37 ` Nick Piggin
2009-09-29 13:08 ` Jan Kara

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20090929213001.GA5896@shareable.org \
    --to=jamie@shareable.org \
    --cc=hch@lst.de \
    --cc=jack@suse.cz \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=npiggin@suse.de \
    --cc=viro@zeniv.linux.org.uk \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).