[PATCH] direct I/O fallback sync simplification

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH] direct I/O fallback sync simplification
@ 2009-09-23 13:07 Christoph Hellwig
  2009-09-23 14:04 ` Jamie Lokier
                   ` (2 more replies)
  0 siblings, 3 replies; 8+ messages in thread
From: Christoph Hellwig @ 2009-09-23 13:07 UTC (permalink / raw)
  To: Al Viro, Nick Piggin, Jan Kara; +Cc: linux-fsdevel

In the case of direct I/O falling back to buffered I/O we sync data
twice currently: once at the end of generic_file_buffered_write using
filemap_write_and_wait_range and once a little later in
__generic_file_aio_write using do_sync_mapping_range with all flags set.

The wait before write of the do_sync_mapping_range call does not make
any sense, so just keep the filemap_write_and_wait_range call and move
it to the right spot.


Signed-off-by: Christoph Hellwig <hch@lst.de>

Index: vfs-2.6.git/mm/filemap.c
===================================================================
--- vfs-2.6.git.orig/mm/filemap.c	2009-09-22 14:20:59.917761567 -0300
+++ vfs-2.6.git/mm/filemap.c	2009-09-22 14:28:01.833832530 -0300
@@ -2265,15 +2265,6 @@ generic_file_buffered_write(struct kiocb
 		*ppos = pos + status;
   	}
 	
-	/*
-	 * If we get here for O_DIRECT writes then we must have fallen through
-	 * to buffered writes (block instantiation inside i_size).  So we sync
-	 * the file data here, to try to honour O_DIRECT expectations.
-	 */
-	if (unlikely(file->f_flags & O_DIRECT) && written)
-		status = filemap_write_and_wait_range(mapping,
-					pos, pos + written - 1);
-
 	return written ? written : status;
 }
 EXPORT_SYMBOL(generic_file_buffered_write);
@@ -2372,10 +2363,7 @@ ssize_t __generic_file_aio_write(struct 
 		 * semantics.
 		 */
 		endbyte = pos + written_buffered - written - 1;
-		err = do_sync_mapping_range(file->f_mapping, pos, endbyte,
-					    SYNC_FILE_RANGE_WAIT_BEFORE|
-					    SYNC_FILE_RANGE_WRITE|
-					    SYNC_FILE_RANGE_WAIT_AFTER);
+		err = filemap_write_and_wait_range(file->f_mapping, pos, endbyte);
 		if (err == 0) {
 			written = written_buffered;
 			invalidate_mapping_pages(mapping,

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] direct I/O fallback sync simplification
  2009-09-23 13:07 [PATCH] direct I/O fallback sync simplification Christoph Hellwig
@ 2009-09-23 14:04 ` Jamie Lokier
  2009-09-26 15:08   ` Christoph Hellwig
  2009-09-26 19:37 ` Nick Piggin
  2009-09-29 13:08 ` Jan Kara
  2 siblings, 1 reply; 8+ messages in thread
From: Jamie Lokier @ 2009-09-23 14:04 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Al Viro, Nick Piggin, Jan Kara, linux-fsdevel

Christoph Hellwig wrote:
> In the case of direct I/O falling back to buffered I/O we sync data
> twice currently: once at the end of generic_file_buffered_write using
> filemap_write_and_wait_range and once a little later in
> __generic_file_aio_write using do_sync_mapping_range with all flags set.
> 
> The wait before write of the do_sync_mapping_range call does not make
> any sense, so just keep the filemap_write_and_wait_range call and move
> it to the right spot.

Are you sure this is an expectation of O_DIRECT?

A few notes from the net, including some documentation from IBM,
advise using O_DIRECT|O_DSYNC if you need sync when direct I/O falls
back to buffered on some other OSes.

IBM (about AIX I believe):

    http://publib.boulder.ibm.com/infocenter/systems/index.jsp?topic=/com.ibm.aix.genprogc/doc/genprogc/fileio.htm

    Direct I/O and Data I/O Integrity Completion

    Although direct I/O writes are done synchronously, they do not
    provide synchronized I/O data integrity completion, as defined by
    POSIX. Applications that need this feature should use O_DSYNC in
    addition to O_DIRECT. O_DSYNC guarantees that all of the data and
    enough of the metadata (for example, indirect blocks) have written
    to the stable store to be able to retrieve the data after a system
    crash. O_DIRECT only writes the data; it does not write the
    metadata.

>From an earlier thread, "O_DIRECT and barriers":

Theodore Tso wrote:
> On Fri, Aug 21, 2009 at 10:26:35AM -0400, Christoph Hellwig wrote:
> > > It turns out that applications needing integrity must use fdatasync or
> > > O_DSYNC (or O_SYNC) *already* with O_DIRECT, because the kernel may
> > > choose to use buffered writes at any time, with no signal to the
> > > application.
> >
> > The fallback was a relatively recent addition to the O_DIRECT semantics
> > for broken filesystems that can't handle holes very well.  Fortunately
> > enough we do force O_SYNC (that is Linux O_SYNC aka Posix O_DSYNC)
> > semantics for that already.
> 
> Um, actually, we don't.  If we did that, we would have to wait for a
> journal commit to complete before allowing the write(2) to complete,
> which would be especially painfully slow for ext3.

There's no point in a "half-sync".  Nobody expects or can usefully
depend on it.  So imho we should drop the filemap_write_and_wait_range
entirely when O_DSYNC is not set.

O_DIRECT without syncing in the buffered fallback will be a useful
performance optimisation for applications (including virtual machines)
which do sequences of writes interspersed with fdatasync calls on
sparse files, or when extending files, or to filesystems which don't
implement O_DIRECT.

Since they need fdatasync anyway, even with direct I/O to get
integrity on some hardware, that's a sensible coding pattern.

-- Jamie

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] direct I/O fallback sync simplification
  2009-09-23 14:04 ` Jamie Lokier
@ 2009-09-26 15:08   ` Christoph Hellwig
  2009-09-29 21:30     ` Jamie Lokier
  0 siblings, 1 reply; 8+ messages in thread
From: Christoph Hellwig @ 2009-09-26 15:08 UTC (permalink / raw)
  To: Jamie Lokier
  Cc: Christoph Hellwig, Al Viro, Nick Piggin, Jan Kara, linux-fsdevel

On Wed, Sep 23, 2009 at 03:04:30PM +0100, Jamie Lokier wrote:
> Christoph Hellwig wrote:
> > In the case of direct I/O falling back to buffered I/O we sync data
> > twice currently: once at the end of generic_file_buffered_write using
> > filemap_write_and_wait_range and once a little later in
> > __generic_file_aio_write using do_sync_mapping_range with all flags set.
> > 
> > The wait before write of the do_sync_mapping_range call does not make
> > any sense, so just keep the filemap_write_and_wait_range call and move
> > it to the right spot.
> 
> Are you sure this is an expectation of O_DIRECT?
> 
> A few notes from the net, including some documentation from IBM,
> advise using O_DIRECT|O_DSYNC if you need sync when direct I/O falls
> back to buffered on some other OSes.

Not sure if we're on the same page here.  Before the patch we do the
following in case of a fallback buffered write when opened with
O_DIRECT:

 - a filemap_write_and_wait_range in generic_file_buffered_write for
   (with a too long length, btw)
 - a do_sync_mapping_range(wait,write,wait) from
   __generic_file_aio_write

which expands to

 (1) __filemap_fdatawrite_range(SYNC_ALL)
 (2) wait_on_page_writeback_range
 (3) wait_on_page_writeback_range
 (4) __filemap_fdatawrite_range(SYNC_ALL)
 (5) wait_on_page_writeback_range

which clearly doesn't make any sense.  The patch reduces it to
 
 (1) __filemap_fdatawrite_range(SYNC_ALL)
 (2) wait_on_page_writeback_range

which does exactly the same, that is assuring the _data_ is on disk.
That's still not O_DSNC or O_SYNC semantics, but that's an entirely
different discussion.


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] direct I/O fallback sync simplification
  2009-09-23 13:07 [PATCH] direct I/O fallback sync simplification Christoph Hellwig
  2009-09-23 14:04 ` Jamie Lokier
@ 2009-09-26 19:37 ` Nick Piggin
  2009-09-29 13:08 ` Jan Kara
  2 siblings, 0 replies; 8+ messages in thread
From: Nick Piggin @ 2009-09-26 19:37 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Al Viro, Jan Kara, linux-fsdevel

On Wed, Sep 23, 2009 at 03:07:30PM +0200, Christoph Hellwig wrote:
> In the case of direct I/O falling back to buffered I/O we sync data
> twice currently: once at the end of generic_file_buffered_write using
> filemap_write_and_wait_range and once a little later in
> __generic_file_aio_write using do_sync_mapping_range with all flags set.
> 
> The wait before write of the do_sync_mapping_range call does not make
> any sense, so just keep the filemap_write_and_wait_range call and move
> it to the right spot.

Looks much cleaner to me.

Acked-by: Nick Piggin <npiggin@suse.de>

> 
> 
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> 
> Index: vfs-2.6.git/mm/filemap.c
> ===================================================================
> --- vfs-2.6.git.orig/mm/filemap.c	2009-09-22 14:20:59.917761567 -0300
> +++ vfs-2.6.git/mm/filemap.c	2009-09-22 14:28:01.833832530 -0300
> @@ -2265,15 +2265,6 @@ generic_file_buffered_write(struct kiocb
>  		*ppos = pos + status;
>    	}
>  	
> -	/*
> -	 * If we get here for O_DIRECT writes then we must have fallen through
> -	 * to buffered writes (block instantiation inside i_size).  So we sync
> -	 * the file data here, to try to honour O_DIRECT expectations.
> -	 */
> -	if (unlikely(file->f_flags & O_DIRECT) && written)
> -		status = filemap_write_and_wait_range(mapping,
> -					pos, pos + written - 1);
> -
>  	return written ? written : status;
>  }
>  EXPORT_SYMBOL(generic_file_buffered_write);
> @@ -2372,10 +2363,7 @@ ssize_t __generic_file_aio_write(struct 
>  		 * semantics.
>  		 */
>  		endbyte = pos + written_buffered - written - 1;
> -		err = do_sync_mapping_range(file->f_mapping, pos, endbyte,
> -					    SYNC_FILE_RANGE_WAIT_BEFORE|
> -					    SYNC_FILE_RANGE_WRITE|
> -					    SYNC_FILE_RANGE_WAIT_AFTER);
> +		err = filemap_write_and_wait_range(file->f_mapping, pos, endbyte);
>  		if (err == 0) {
>  			written = written_buffered;
>  			invalidate_mapping_pages(mapping,

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] direct I/O fallback sync simplification
  2009-09-23 13:07 [PATCH] direct I/O fallback sync simplification Christoph Hellwig
  2009-09-23 14:04 ` Jamie Lokier
  2009-09-26 19:37 ` Nick Piggin
@ 2009-09-29 13:08 ` Jan Kara
  2 siblings, 0 replies; 8+ messages in thread
From: Jan Kara @ 2009-09-29 13:08 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Al Viro, Nick Piggin, Jan Kara, linux-fsdevel

On Wed 23-09-09 15:07:30, Christoph Hellwig wrote:
> In the case of direct I/O falling back to buffered I/O we sync data
> twice currently: once at the end of generic_file_buffered_write using
> filemap_write_and_wait_range and once a little later in
> __generic_file_aio_write using do_sync_mapping_range with all flags set.
> 
> The wait before write of the do_sync_mapping_range call does not make
> any sense, so just keep the filemap_write_and_wait_range call and move
> it to the right spot.
> 
> 
> Signed-off-by: Christoph Hellwig <hch@lst.de>
  Looks much better. Acked-by: Jan Kara <jack@suse.cz>

									Honza
> 
> Index: vfs-2.6.git/mm/filemap.c
> ===================================================================
> --- vfs-2.6.git.orig/mm/filemap.c	2009-09-22 14:20:59.917761567 -0300
> +++ vfs-2.6.git/mm/filemap.c	2009-09-22 14:28:01.833832530 -0300
> @@ -2265,15 +2265,6 @@ generic_file_buffered_write(struct kiocb
>  		*ppos = pos + status;
>    	}
>  	
> -	/*
> -	 * If we get here for O_DIRECT writes then we must have fallen through
> -	 * to buffered writes (block instantiation inside i_size).  So we sync
> -	 * the file data here, to try to honour O_DIRECT expectations.
> -	 */
> -	if (unlikely(file->f_flags & O_DIRECT) && written)
> -		status = filemap_write_and_wait_range(mapping,
> -					pos, pos + written - 1);
> -
>  	return written ? written : status;
>  }
>  EXPORT_SYMBOL(generic_file_buffered_write);
> @@ -2372,10 +2363,7 @@ ssize_t __generic_file_aio_write(struct 
>  		 * semantics.
>  		 */
>  		endbyte = pos + written_buffered - written - 1;
> -		err = do_sync_mapping_range(file->f_mapping, pos, endbyte,
> -					    SYNC_FILE_RANGE_WAIT_BEFORE|
> -					    SYNC_FILE_RANGE_WRITE|
> -					    SYNC_FILE_RANGE_WAIT_AFTER);
> +		err = filemap_write_and_wait_range(file->f_mapping, pos, endbyte);
>  		if (err == 0) {
>  			written = written_buffered;
>  			invalidate_mapping_pages(mapping,
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] direct I/O fallback sync simplification
  2009-09-26 15:08   ` Christoph Hellwig
@ 2009-09-29 21:30     ` Jamie Lokier
  2009-09-30 12:05       ` Christoph Hellwig
  0 siblings, 1 reply; 8+ messages in thread
From: Jamie Lokier @ 2009-09-29 21:30 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Al Viro, Nick Piggin, Jan Kara, linux-fsdevel

Christoph Hellwig wrote:
> On Wed, Sep 23, 2009 at 03:04:30PM +0100, Jamie Lokier wrote:
> > Christoph Hellwig wrote:
> > > In the case of direct I/O falling back to buffered I/O we sync data
> > > twice currently: once at the end of generic_file_buffered_write using
> > > filemap_write_and_wait_range and once a little later in
> > > __generic_file_aio_write using do_sync_mapping_range with all flags set.
> > > 
> > > The wait before write of the do_sync_mapping_range call does not make
> > > any sense, so just keep the filemap_write_and_wait_range call and move
> > > it to the right spot.
> > 
> > Are you sure this is an expectation of O_DIRECT?
> > 
> > A few notes from the net, including some documentation from IBM,
> > advise using O_DIRECT|O_DSYNC if you need sync when direct I/O falls
> > back to buffered on some other OSes.
> 
> Not sure if we're on the same page here.  Before the patch we do the
> following in case of a fallback buffered write when opened with
> O_DIRECT:
> 
>  - a filemap_write_and_wait_range in generic_file_buffered_write for
>    (with a too long length, btw)
>  - a do_sync_mapping_range(wait,write,wait) from
>    __generic_file_aio_write
> 
> which expands to
> 
>  (1) __filemap_fdatawrite_range(SYNC_ALL)
>  (2) wait_on_page_writeback_range
>  (3) wait_on_page_writeback_range
>  (4) __filemap_fdatawrite_range(SYNC_ALL)
>  (5) wait_on_page_writeback_range
> 
> which clearly doesn't make any sense.  The patch reduces it to
>  
>  (1) __filemap_fdatawrite_range(SYNC_ALL)
>  (2) wait_on_page_writeback_range
> 
> which does exactly the same, that is assuring the _data_ is on disk.
> That's still not O_DSNC or O_SYNC semantics, but that's an entirely
> different discussion.

I agree that the patch is a great improvement, and it's obviously good
regardless of further discussions.

What I'm suggesting is that there is no need to commit the data to the
disk, and sometimes it's an unwanted pessimisation.  So those calls
may be removed entirely.

I'll describe the O_DIRECT, non-O_DSYNC case only.  (O_DIRECT|O_DSYNC
should of course do fdatasync_range properly after every write).

Apps which need the data to reach the disk with any assurance that
they'll be able to read it after a crash _must_ use fdatasync (or
O_DSYNC) to get that assurance.

Your explanation makes that even clearer: Historically what Linux has
done is not O_DSYNC, and this is not intended to change.  Because the
fallback happens when extending a file or filling a hole, it's even
more likely that the metadata needed to retrieve the data won't be
committed when the write returns.

Regarding performance, when the fallback happens, since good apps
should be using fdatasync (or O_DSYNC) already, and bad apps already
don't have any _useful_ guarantee, it's preferable that the buffered
writes have normal buffering behaviour and can be sensibly reordered
and batched like normal writes.  This is particularly relevant to the
common case of extending a file, and when writing to a filesystem
which doesn't support O_DIRECT at all.

In a nutshell, I'm saying the only useful behaviours for direct I/O
fallback are "like normal writes" or "like O_DSYNC" - that "sort of
kind of in between" doesn't help anything.  It's a historical,
unnecessary relic which should go to the same recycling bin as O_SYNC
meaning O_DSYNC on Linux (and probably the change should go alongside
your sensible fixes to O_SYNC/O_DSYNC).

But I'll readily agree your patch is a trivial improvement that
doesn't change anything, if you want to get it in before a bigger
decision is made. :-)

-- Jamie

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] direct I/O fallback sync simplification
  2009-09-29 21:30     ` Jamie Lokier
@ 2009-09-30 12:05       ` Christoph Hellwig
  2009-09-30 18:13         ` Jamie Lokier
  0 siblings, 1 reply; 8+ messages in thread
From: Christoph Hellwig @ 2009-09-30 12:05 UTC (permalink / raw)
  To: Jamie Lokier
  Cc: Christoph Hellwig, Al Viro, Nick Piggin, Jan Kara, linux-fsdevel

On Tue, Sep 29, 2009 at 10:30:02PM +0100, Jamie Lokier wrote:
> What I'm suggesting is that there is no need to commit the data to the
> disk, and sometimes it's an unwanted pessimisation.  So those calls
> may be removed entirely.

I'm not going to losen up these semantics.  They might be utterly wrong,
but it's what we have given to users for a long time.  If you want to
losen it up send a patch with a good rational for use case that it
really matters, and extended version of the proof below and argue for
it to get included.  And take all the blamer later in case something
breaks anyway.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] direct I/O fallback sync simplification
  2009-09-30 12:05       ` Christoph Hellwig
@ 2009-09-30 18:13         ` Jamie Lokier
  0 siblings, 0 replies; 8+ messages in thread
From: Jamie Lokier @ 2009-09-30 18:13 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Al Viro, Nick Piggin, Jan Kara, linux-fsdevel

Christoph Hellwig wrote:
> On Tue, Sep 29, 2009 at 10:30:02PM +0100, Jamie Lokier wrote:
> > What I'm suggesting is that there is no need to commit the data to the
> > disk, and sometimes it's an unwanted pessimisation.  So those calls
> > may be removed entirely.
> 
> I'm not going to losen up these semantics.  They might be utterly wrong,
> but it's what we have given to users for a long time.  If you want to
> losen it up send a patch with a good rational for use case that it
> really matters, and extended version of the proof below and argue for
> it to get included.  And take all the blamer later in case something
> breaks anyway.

Tbh, I'm not sure what the semantics _are_.  If I understand right,
the data is written (without a barrier), but the metadata needed to
reach the data is not written at all by that point, and (for the
filesystems we care about) this branch is used only for filling holes
and extending files - precisely the cases where lack of metadata
is most likely to occur.

So the written data is simply unreachable if a crash occurs shortly
after a write.  Later it's fine, but normal writeback would take care
of that anyway.

Do I misunderstand that?  If I got that right, I'll see about a test
to find out of the data is really inaccessible when a crash occurs
after writing.

-- Jamie

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2009-09-30 18:13 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-09-23 13:07 [PATCH] direct I/O fallback sync simplification Christoph Hellwig
2009-09-23 14:04 ` Jamie Lokier
2009-09-26 15:08   ` Christoph Hellwig
2009-09-29 21:30     ` Jamie Lokier
2009-09-30 12:05       ` Christoph Hellwig
2009-09-30 18:13         ` Jamie Lokier
2009-09-26 19:37 ` Nick Piggin
2009-09-29 13:08 ` Jan Kara

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).