Small O_SYNC writes are no longer NFS_DATA

linux-nfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Small O_SYNC writes are no longer NFS_DATA_SYNC
@ 2011-02-16  6:15 NeilBrown
  2011-02-16 13:11 ` Jeff Layton
  2011-03-17 23:53 ` Trond Myklebust
  0 siblings, 2 replies; 15+ messages in thread
From: NeilBrown @ 2011-02-16  6:15 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: linux-nfs

Hi Trond,
 I wonder if I might get your help/advice on an issue with NFS.

 It seems that NFS_DATA_SYNC is hardly used at all currently.  It is used for
 O_DIRECT writes and for writes 'for_reclaim', and for handling some error
 conditions, but that is about it.

 This appears to be a regression.

 Back in 2005, commit ab0a3dbedc5 in 2.6.13 says:

    [PATCH] NFS: Write optimization for short files and small O_SYNC writes.

     Use stable writes if we can see that we are only going to put a single
     write on the wire.

 which seems like a sensible optimisation, and we have a customer which
 values it.  Very roughly, they have an NFS server which optimises 'unstable'
 writes for throughput and 'stable' writes for latency - these seems like a
 reasonable approach.
 With a 2.6.16 kernel an application which generates many small sync writes
 gets adequate performance.  In 2.6.32 they see unstable writes followed by
 commits, which cannot be (or at least aren't) optimised as well.

 It seems this was changed by commit c63c7b0513953

    NFS: Fix a race when doing NFS write coalescing

 in 2.6.22.

 Is it possible/easy/desirable to get this behaviour back.  i.e. to use
 NFS_DATA_SYNC at least on sub-page writes triggered by a write to an
 O_SYNC file.

 My (possibly naive) attempt is as follows.  It appears to work as I expect
 (though it still uses SYNC for 1-page writes) but I'm not confident that it
 is "right".

Thanks,

NeilBrown

diff --git a/fs/nfs/write.c b/fs/nfs/write.c
index 10d648e..392bfa8 100644
--- a/fs/nfs/write.c
+++ b/fs/nfs/write.c
@@ -178,6 +178,9 @@ static int wb_priority(struct writeback_control *wbc)
 		return FLUSH_HIGHPRI | FLUSH_STABLE;
 	if (wbc->for_kupdate || wbc->for_background)
 		return FLUSH_LOWPRI;
+	if (wbc->sync_mode == WB_SYNC_ALL &&
+	    (wbc->range_end - wbc->range_start) < PAGE_SIZE)
+		return FLUSH_STABLE;
 	return 0;
 }

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: Small O_SYNC writes are no longer NFS_DATA_SYNC
  2011-02-16  6:15 Small O_SYNC writes are no longer NFS_DATA_SYNC NeilBrown
@ 2011-02-16 13:11 ` Jeff Layton
  2011-02-16 20:26   ` NeilBrown
  2011-03-17 23:53 ` Trond Myklebust
  1 sibling, 1 reply; 15+ messages in thread
From: Jeff Layton @ 2011-02-16 13:11 UTC (permalink / raw)
  To: NeilBrown; +Cc: Trond Myklebust, linux-nfs

On Wed, 16 Feb 2011 17:15:55 +1100
NeilBrown <neilb@suse.de> wrote:

> 
> Hi Trond,
>  I wonder if I might get your help/advice on an issue with NFS.
> 
>  It seems that NFS_DATA_SYNC is hardly used at all currently.  It is used for
>  O_DIRECT writes and for writes 'for_reclaim', and for handling some error
>  conditions, but that is about it.
> 
>  This appears to be a regression.
> 
>  Back in 2005, commit ab0a3dbedc5 in 2.6.13 says:
> 
>     [PATCH] NFS: Write optimization for short files and small O_SYNC writes.
>     
>      Use stable writes if we can see that we are only going to put a single
>      write on the wire.
> 
>  which seems like a sensible optimisation, and we have a customer which
>  values it.  Very roughly, they have an NFS server which optimises 'unstable'
>  writes for throughput and 'stable' writes for latency - these seems like a
>  reasonable approach.
>  With a 2.6.16 kernel an application which generates many small sync writes
>  gets adequate performance.  In 2.6.32 they see unstable writes followed by
>  commits, which cannot be (or at least aren't) optimised as well.
> 
>  It seems this was changed by commit c63c7b0513953
> 
>     NFS: Fix a race when doing NFS write coalescing
>     
>  in 2.6.22.
> 
>  Is it possible/easy/desirable to get this behaviour back.  i.e. to use
>  NFS_DATA_SYNC at least on sub-page writes triggered by a write to an
>  O_SYNC file.
> 
>  My (possibly naive) attempt is as follows.  It appears to work as I expect
>  (though it still uses SYNC for 1-page writes) but I'm not confident that it
>  is "right".
> 
> Thanks,
> 
> NeilBrown
> 
> diff --git a/fs/nfs/write.c b/fs/nfs/write.c
> index 10d648e..392bfa8 100644
> --- a/fs/nfs/write.c
> +++ b/fs/nfs/write.c
> @@ -178,6 +178,9 @@ static int wb_priority(struct writeback_control *wbc)
>  		return FLUSH_HIGHPRI | FLUSH_STABLE;
>  	if (wbc->for_kupdate || wbc->for_background)
>  		return FLUSH_LOWPRI;
> +	if (wbc->sync_mode == WB_SYNC_ALL &&
> +	    (wbc->range_end - wbc->range_start) < PAGE_SIZE)
> +		return FLUSH_STABLE;
>  	return 0;
>  }
>  

I'm not so sure about this change. wb_priority is called from
nfs_wb_page. The comments there say:

/*
 * Write back all requests on one page - we do this before reading it.
 */

...do we really need those writes to be NFS_FILE_SYNC?

I think that the difficulty here is determining when we really are
going to just be doing a single write. In that case, then clearly a
FILE_SYNC write is better than an unstable + COMMIT.

This is very workload dependent though. It's hard to know beforehand
whether a page that we intend to write will be redirtied soon
afterward. If it is, then FILE_SYNC writes may be worse than letting
the server cache the writes until a COMMIT comes in.

-- 
Jeff Layton <jlayton@redhat.com>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Small O_SYNC writes are no longer NFS_DATA_SYNC
  2011-02-16 13:11 ` Jeff Layton
@ 2011-02-16 20:26   ` NeilBrown
  2011-02-16 20:50     ` Jeff Layton
  0 siblings, 1 reply; 15+ messages in thread
From: NeilBrown @ 2011-02-16 20:26 UTC (permalink / raw)
  To: Jeff Layton; +Cc: Trond Myklebust, linux-nfs

On Wed, 16 Feb 2011 08:11:50 -0500 Jeff Layton <jlayton@redhat.com> wrote:

> On Wed, 16 Feb 2011 17:15:55 +1100
> NeilBrown <neilb@suse.de> wrote:
> 
> > 
> > Hi Trond,
> >  I wonder if I might get your help/advice on an issue with NFS.
> > 
> >  It seems that NFS_DATA_SYNC is hardly used at all currently.  It is used for
> >  O_DIRECT writes and for writes 'for_reclaim', and for handling some error
> >  conditions, but that is about it.
> > 
> >  This appears to be a regression.
> > 
> >  Back in 2005, commit ab0a3dbedc5 in 2.6.13 says:
> > 
> >     [PATCH] NFS: Write optimization for short files and small O_SYNC writes.
> >     
> >      Use stable writes if we can see that we are only going to put a single
> >      write on the wire.
> > 
> >  which seems like a sensible optimisation, and we have a customer which
> >  values it.  Very roughly, they have an NFS server which optimises 'unstable'
> >  writes for throughput and 'stable' writes for latency - these seems like a
> >  reasonable approach.
> >  With a 2.6.16 kernel an application which generates many small sync writes
> >  gets adequate performance.  In 2.6.32 they see unstable writes followed by
> >  commits, which cannot be (or at least aren't) optimised as well.
> > 
> >  It seems this was changed by commit c63c7b0513953
> > 
> >     NFS: Fix a race when doing NFS write coalescing
> >     
> >  in 2.6.22.
> > 
> >  Is it possible/easy/desirable to get this behaviour back.  i.e. to use
> >  NFS_DATA_SYNC at least on sub-page writes triggered by a write to an
> >  O_SYNC file.
> > 
> >  My (possibly naive) attempt is as follows.  It appears to work as I expect
> >  (though it still uses SYNC for 1-page writes) but I'm not confident that it
> >  is "right".
> > 
> > Thanks,
> > 
> > NeilBrown
> > 
> > diff --git a/fs/nfs/write.c b/fs/nfs/write.c
> > index 10d648e..392bfa8 100644
> > --- a/fs/nfs/write.c
> > +++ b/fs/nfs/write.c
> > @@ -178,6 +178,9 @@ static int wb_priority(struct writeback_control *wbc)
> >  		return FLUSH_HIGHPRI | FLUSH_STABLE;
> >  	if (wbc->for_kupdate || wbc->for_background)
> >  		return FLUSH_LOWPRI;
> > +	if (wbc->sync_mode == WB_SYNC_ALL &&
> > +	    (wbc->range_end - wbc->range_start) < PAGE_SIZE)
> > +		return FLUSH_STABLE;
> >  	return 0;
> >  }
> >  
> 
> I'm not so sure about this change. wb_priority is called from
> nfs_wb_page. The comments there say:
> 
> /*
>  * Write back all requests on one page - we do this before reading it.
>  */
> 
> ...do we really need those writes to be NFS_FILE_SYNC?

Thanks for taking a look.
wb_priority is called from several places - yes.

In the nfs_wb_page case, I think we *do* want NFS_FILE_SYNC.
nfs_wb_page calls nfs_writepage_locked and then nfs_commit_inode which calls
nfs_scan_commit to send a COMMIT request for the page (if the write wasn't
stable).

By using NFS_FILE_SYNC we can avoid that COMMIT, and lose nothing (that I can
see).

> 
> I think that the difficulty here is determining when we really are
> going to just be doing a single write. In that case, then clearly a
> FILE_SYNC write is better than an unstable + COMMIT.
> 
> This is very workload dependent though. It's hard to know beforehand
> whether a page that we intend to write will be redirtied soon
> afterward. If it is, then FILE_SYNC writes may be worse than letting
> the server cache the writes until a COMMIT comes in.
> 

The hope is that sync_mode == WB_SYNC_ALL combined with a short 'range' are
sufficient.

In particular, WB_SYNC_ALL essentially says that we want this page out to
storage *now* so a 'flush' of some sort is likely to following.


BTW, I'm wondering if the length of 'range' that we test should be related to
'wsize' rather than PAGE_SIZE.  Any thoughts on that?

Thanks,
NeilBrown

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Small O_SYNC writes are no longer NFS_DATA_SYNC
  2011-02-16 20:26   ` NeilBrown
@ 2011-02-16 20:50     ` Jeff Layton
  2011-02-16 21:00       ` NeilBrown
  0 siblings, 1 reply; 15+ messages in thread
From: Jeff Layton @ 2011-02-16 20:50 UTC (permalink / raw)
  To: NeilBrown; +Cc: Trond Myklebust, linux-nfs

On Thu, 17 Feb 2011 07:26:18 +1100
NeilBrown <neilb@suse.de> wrote:

> On Wed, 16 Feb 2011 08:11:50 -0500 Jeff Layton <jlayton@redhat.com> wrote:
> 
> > On Wed, 16 Feb 2011 17:15:55 +1100
> > NeilBrown <neilb@suse.de> wrote:
> > 
> > > 
> > > Hi Trond,
> > >  I wonder if I might get your help/advice on an issue with NFS.
> > > 
> > >  It seems that NFS_DATA_SYNC is hardly used at all currently.  It is used for
> > >  O_DIRECT writes and for writes 'for_reclaim', and for handling some error
> > >  conditions, but that is about it.
> > > 
> > >  This appears to be a regression.
> > > 
> > >  Back in 2005, commit ab0a3dbedc5 in 2.6.13 says:
> > > 
> > >     [PATCH] NFS: Write optimization for short files and small O_SYNC writes.
> > >     
> > >      Use stable writes if we can see that we are only going to put a single
> > >      write on the wire.
> > > 
> > >  which seems like a sensible optimisation, and we have a customer which
> > >  values it.  Very roughly, they have an NFS server which optimises 'unstable'
> > >  writes for throughput and 'stable' writes for latency - these seems like a
> > >  reasonable approach.
> > >  With a 2.6.16 kernel an application which generates many small sync writes
> > >  gets adequate performance.  In 2.6.32 they see unstable writes followed by
> > >  commits, which cannot be (or at least aren't) optimised as well.
> > > 
> > >  It seems this was changed by commit c63c7b0513953
> > > 
> > >     NFS: Fix a race when doing NFS write coalescing
> > >     
> > >  in 2.6.22.
> > > 
> > >  Is it possible/easy/desirable to get this behaviour back.  i.e. to use
> > >  NFS_DATA_SYNC at least on sub-page writes triggered by a write to an
> > >  O_SYNC file.
> > > 
> > >  My (possibly naive) attempt is as follows.  It appears to work as I expect
> > >  (though it still uses SYNC for 1-page writes) but I'm not confident that it
> > >  is "right".
> > > 
> > > Thanks,
> > > 
> > > NeilBrown
> > > 
> > > diff --git a/fs/nfs/write.c b/fs/nfs/write.c
> > > index 10d648e..392bfa8 100644
> > > --- a/fs/nfs/write.c
> > > +++ b/fs/nfs/write.c
> > > @@ -178,6 +178,9 @@ static int wb_priority(struct writeback_control *wbc)
> > >  		return FLUSH_HIGHPRI | FLUSH_STABLE;
> > >  	if (wbc->for_kupdate || wbc->for_background)
> > >  		return FLUSH_LOWPRI;
> > > +	if (wbc->sync_mode == WB_SYNC_ALL &&
> > > +	    (wbc->range_end - wbc->range_start) < PAGE_SIZE)
> > > +		return FLUSH_STABLE;
> > >  	return 0;
> > >  }
> > >  
> > 
> > I'm not so sure about this change. wb_priority is called from
> > nfs_wb_page. The comments there say:
> > 
> > /*
> >  * Write back all requests on one page - we do this before reading it.
> >  */
> > 
> > ...do we really need those writes to be NFS_FILE_SYNC?
> 
> Thanks for taking a look.
> wb_priority is called from several places - yes.
> 
> In the nfs_wb_page case, I think we *do* want NFS_FILE_SYNC.
> nfs_wb_page calls nfs_writepage_locked and then nfs_commit_inode which calls
> nfs_scan_commit to send a COMMIT request for the page (if the write wasn't
> stable).
> 
> By using NFS_FILE_SYNC we can avoid that COMMIT, and lose nothing (that I can
> see).
> 

Good point.

> > 
> > I think that the difficulty here is determining when we really are
> > going to just be doing a single write. In that case, then clearly a
> > FILE_SYNC write is better than an unstable + COMMIT.
> > 
> > This is very workload dependent though. It's hard to know beforehand
> > whether a page that we intend to write will be redirtied soon
> > afterward. If it is, then FILE_SYNC writes may be worse than letting
> > the server cache the writes until a COMMIT comes in.
> > 
> 
> The hope is that sync_mode == WB_SYNC_ALL combined with a short 'range' are
> sufficient.
> 
> In particular, WB_SYNC_ALL essentially says that we want this page out to
> storage *now* so a 'flush' of some sort is likely to following.
> 

Also a good point.

I guess my main worry is that, at least in the Linux NFS server case,
these get translated to O_SYNC writes which can be very slow. If we
know though that we're not likely to have follow-on writes that could
be batched up then this is probably fine.

I guess I just don't have a good feel for how often (if ever) we do
multiple writes to the same file with different wbc ranges without an
intervening commit. If that happens and we end up doing FILE_SYNC
writes rather than unstable ones, then this change is probably not a
good one.

Do you have any idea as to why unstable writes get worse performance
for this customer? I know that base 2.6.32 did a *lot* more COMMITs
than were really necessary. Perhaps they are seeing the effect of that?
Or have they also tested something more recent and seen bad
performance?

> 
> BTW, I'm wondering if the length of 'range' that we test should be related to
> 'wsize' rather than PAGE_SIZE.  Any thoughts on that?
> 

Yeah, that seems like a more meaningful test than page size.

-- 
Jeff Layton <jlayton@redhat.com>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Small O_SYNC writes are no longer NFS_DATA_SYNC
  2011-02-16 20:50     ` Jeff Layton
@ 2011-02-16 21:00       ` NeilBrown
  0 siblings, 0 replies; 15+ messages in thread
From: NeilBrown @ 2011-02-16 21:00 UTC (permalink / raw)
  To: Jeff Layton; +Cc: Trond Myklebust, linux-nfs

On Wed, 16 Feb 2011 15:50:42 -0500 Jeff Layton <jlayton@redhat.com> wrote:

> On Thu, 17 Feb 2011 07:26:18 +1100
> NeilBrown <neilb@suse.de> wrote:
> 
> > On Wed, 16 Feb 2011 08:11:50 -0500 Jeff Layton <jlayton@redhat.com> wrote:
> > 
> > > On Wed, 16 Feb 2011 17:15:55 +1100
> > > NeilBrown <neilb@suse.de> wrote:
> > > 
> > > > 
> > > > Hi Trond,
> > > >  I wonder if I might get your help/advice on an issue with NFS.
> > > > 
> > > >  It seems that NFS_DATA_SYNC is hardly used at all currently.  It is used for
> > > >  O_DIRECT writes and for writes 'for_reclaim', and for handling some error
> > > >  conditions, but that is about it.
> > > > 
> > > >  This appears to be a regression.
> > > > 
> > > >  Back in 2005, commit ab0a3dbedc5 in 2.6.13 says:
> > > > 
> > > >     [PATCH] NFS: Write optimization for short files and small O_SYNC writes.
> > > >     
> > > >      Use stable writes if we can see that we are only going to put a single
> > > >      write on the wire.
> > > > 
> > > >  which seems like a sensible optimisation, and we have a customer which
> > > >  values it.  Very roughly, they have an NFS server which optimises 'unstable'
> > > >  writes for throughput and 'stable' writes for latency - these seems like a
> > > >  reasonable approach.
> > > >  With a 2.6.16 kernel an application which generates many small sync writes
> > > >  gets adequate performance.  In 2.6.32 they see unstable writes followed by
> > > >  commits, which cannot be (or at least aren't) optimised as well.
> > > > 
> > > >  It seems this was changed by commit c63c7b0513953
> > > > 
> > > >     NFS: Fix a race when doing NFS write coalescing
> > > >     
> > > >  in 2.6.22.
> > > > 
> > > >  Is it possible/easy/desirable to get this behaviour back.  i.e. to use
> > > >  NFS_DATA_SYNC at least on sub-page writes triggered by a write to an
> > > >  O_SYNC file.
> > > > 
> > > >  My (possibly naive) attempt is as follows.  It appears to work as I expect
> > > >  (though it still uses SYNC for 1-page writes) but I'm not confident that it
> > > >  is "right".
> > > > 
> > > > Thanks,
> > > > 
> > > > NeilBrown
> > > > 
> > > > diff --git a/fs/nfs/write.c b/fs/nfs/write.c
> > > > index 10d648e..392bfa8 100644
> > > > --- a/fs/nfs/write.c
> > > > +++ b/fs/nfs/write.c
> > > > @@ -178,6 +178,9 @@ static int wb_priority(struct writeback_control *wbc)
> > > >  		return FLUSH_HIGHPRI | FLUSH_STABLE;
> > > >  	if (wbc->for_kupdate || wbc->for_background)
> > > >  		return FLUSH_LOWPRI;
> > > > +	if (wbc->sync_mode == WB_SYNC_ALL &&
> > > > +	    (wbc->range_end - wbc->range_start) < PAGE_SIZE)
> > > > +		return FLUSH_STABLE;
> > > >  	return 0;
> > > >  }
> > > >  
> > > 
> > > I'm not so sure about this change. wb_priority is called from
> > > nfs_wb_page. The comments there say:
> > > 
> > > /*
> > >  * Write back all requests on one page - we do this before reading it.
> > >  */
> > > 
> > > ...do we really need those writes to be NFS_FILE_SYNC?
> > 
> > Thanks for taking a look.
> > wb_priority is called from several places - yes.
> > 
> > In the nfs_wb_page case, I think we *do* want NFS_FILE_SYNC.
> > nfs_wb_page calls nfs_writepage_locked and then nfs_commit_inode which calls
> > nfs_scan_commit to send a COMMIT request for the page (if the write wasn't
> > stable).
> > 
> > By using NFS_FILE_SYNC we can avoid that COMMIT, and lose nothing (that I can
> > see).
> > 
> 
> Good point.
> 
> > > 
> > > I think that the difficulty here is determining when we really are
> > > going to just be doing a single write. In that case, then clearly a
> > > FILE_SYNC write is better than an unstable + COMMIT.
> > > 
> > > This is very workload dependent though. It's hard to know beforehand
> > > whether a page that we intend to write will be redirtied soon
> > > afterward. If it is, then FILE_SYNC writes may be worse than letting
> > > the server cache the writes until a COMMIT comes in.
> > > 
> > 
> > The hope is that sync_mode == WB_SYNC_ALL combined with a short 'range' are
> > sufficient.
> > 
> > In particular, WB_SYNC_ALL essentially says that we want this page out to
> > storage *now* so a 'flush' of some sort is likely to following.
> > 
> 
> Also a good point.
> 
> I guess my main worry is that, at least in the Linux NFS server case,
> these get translated to O_SYNC writes which can be very slow. If we
> know though that we're not likely to have follow-on writes that could
> be batched up then this is probably fine.

My aim is to only trigger NFS_DATA_SYNC when the app performs an O_SYNC write,
so having the server do an O_SYNC write in that cases seems correct.

I suspect the heuristic I chose isn' quite that focussed.  I'm hoping someone
can either agree that it is probably focussed enough, or suggest how to
encode a heuristic that really does just get O_SYNC writes...

> 
> I guess I just don't have a good feel for how often (if ever) we do
> multiple writes to the same file with different wbc ranges without an
> intervening commit. If that happens and we end up doing FILE_SYNC
> writes rather than unstable ones, then this change is probably not a
> good one.
> 
> Do you have any idea as to why unstable writes get worse performance
> for this customer? I know that base 2.6.32 did a *lot* more COMMITs
> than were really necessary. Perhaps they are seeing the effect of that?
> Or have they also tested something more recent and seen bad
> performance?

To quote:

------------------------
The Customers email application is doing primarily small sync writes.
we use a clustered filesystem in the backend to store the data exported via
NFS.
if we see the incoming write request with a stable flag set, we optimize the
buffer handling and update in place very efficient. if there is no stable flag
we have to assume there is more data to come and either delay the operation
(speculate there are more writes to the block before we should ackowledge) or
allocate a bigger buffer that needs to be written multiple times (which each
subsequent 4k write) . both solutions are slowing down the application a lot as
you either introduce even more latency or you generate more i/o activity to
your disks as needed (by multiple flush the same filesystem block) .
so if we know the client sends data with with direct i/o or O_SYNC set on the
client application, the data packet sent to the Server should have the stable
flag turned on.
------------------------

So it seems to be the "unstable" write rather than the excess commit that
trigger the unfortunate behaviour.
They have only tried SLES11-SP1, which is 2.6.32 based.


> 
> > 
> > BTW, I'm wondering if the length of 'range' that we test should be related to
> > 'wsize' rather than PAGE_SIZE.  Any thoughts on that?
> > 
> 
> Yeah, that seems like a more meaningful test than page size.
> 


Thanks,
NeilBrown


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Small O_SYNC writes are no longer NFS_DATA_SYNC
  2011-02-16  6:15 Small O_SYNC writes are no longer NFS_DATA_SYNC NeilBrown
  2011-02-16 13:11 ` Jeff Layton
@ 2011-03-17 23:53 ` Trond Myklebust
  2011-03-18  1:04   ` NeilBrown
  1 sibling, 1 reply; 15+ messages in thread
From: Trond Myklebust @ 2011-03-17 23:53 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-nfs

On Wed, 2011-02-16 at 17:15 +1100, NeilBrown wrote:
> Hi Trond,
>  I wonder if I might get your help/advice on an issue with NFS.
> 
>  It seems that NFS_DATA_SYNC is hardly used at all currently.  It is used for
>  O_DIRECT writes and for writes 'for_reclaim', and for handling some error
>  conditions, but that is about it.
> 
>  This appears to be a regression.
> 
>  Back in 2005, commit ab0a3dbedc5 in 2.6.13 says:
> 
>     [PATCH] NFS: Write optimization for short files and small O_SYNC writes.
>     
>      Use stable writes if we can see that we are only going to put a single
>      write on the wire.
> 
>  which seems like a sensible optimisation, and we have a customer which
>  values it.  Very roughly, they have an NFS server which optimises 'unstable'
>  writes for throughput and 'stable' writes for latency - these seems like a
>  reasonable approach.
>  With a 2.6.16 kernel an application which generates many small sync writes
>  gets adequate performance.  In 2.6.32 they see unstable writes followed by
>  commits, which cannot be (or at least aren't) optimised as well.
> 
>  It seems this was changed by commit c63c7b0513953
> 
>     NFS: Fix a race when doing NFS write coalescing
>     
>  in 2.6.22.
> 
>  Is it possible/easy/desirable to get this behaviour back.  i.e. to use
>  NFS_DATA_SYNC at least on sub-page writes triggered by a write to an
>  O_SYNC file.
> 
>  My (possibly naive) attempt is as follows.  It appears to work as I expect
>  (though it still uses SYNC for 1-page writes) but I'm not confident that it
>  is "right".
> 
> Thanks,
> 
> NeilBrown
> 
> diff --git a/fs/nfs/write.c b/fs/nfs/write.c
> index 10d648e..392bfa8 100644
> --- a/fs/nfs/write.c
> +++ b/fs/nfs/write.c
> @@ -178,6 +178,9 @@ static int wb_priority(struct writeback_control *wbc)
>  		return FLUSH_HIGHPRI | FLUSH_STABLE;
>  	if (wbc->for_kupdate || wbc->for_background)
>  		return FLUSH_LOWPRI;
> +	if (wbc->sync_mode == WB_SYNC_ALL &&
> +	    (wbc->range_end - wbc->range_start) < PAGE_SIZE)
> +		return FLUSH_STABLE;
>  	return 0;
>  }

Would it ever be wrong to set the FILE_SYNC flag for the very last rpc
call in a writeback series? I'm thinking that we might want to set
FLUSH_STABLE before the call to pageio_complete in
nfs_writepage_locked() and nfs_writepages().

The only thing that makes me uncomfortable with that idea is the
possible repercussions for pNFS.

-- 
Trond Myklebust
Linux NFS client maintainer

NetApp
Trond.Myklebust@netapp.com
www.netapp.com


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Small O_SYNC writes are no longer NFS_DATA_SYNC
  2011-03-17 23:53 ` Trond Myklebust
@ 2011-03-18  1:04   ` NeilBrown
  2011-03-18  1:49     ` Trond Myklebust
  0 siblings, 1 reply; 15+ messages in thread
From: NeilBrown @ 2011-03-18  1:04 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: linux-nfs

On Thu, 17 Mar 2011 19:53:07 -0400 Trond Myklebust
<Trond.Myklebust@netapp.com> wrote:

> On Wed, 2011-02-16 at 17:15 +1100, NeilBrown wrote:
> > Hi Trond,
> >  I wonder if I might get your help/advice on an issue with NFS.
> > 
> >  It seems that NFS_DATA_SYNC is hardly used at all currently.  It is used for
> >  O_DIRECT writes and for writes 'for_reclaim', and for handling some error
> >  conditions, but that is about it.
> > 
> >  This appears to be a regression.
> > 
> >  Back in 2005, commit ab0a3dbedc5 in 2.6.13 says:
> > 
> >     [PATCH] NFS: Write optimization for short files and small O_SYNC writes.
> >     
> >      Use stable writes if we can see that we are only going to put a single
> >      write on the wire.
> > 
> >  which seems like a sensible optimisation, and we have a customer which
> >  values it.  Very roughly, they have an NFS server which optimises 'unstable'
> >  writes for throughput and 'stable' writes for latency - these seems like a
> >  reasonable approach.
> >  With a 2.6.16 kernel an application which generates many small sync writes
> >  gets adequate performance.  In 2.6.32 they see unstable writes followed by
> >  commits, which cannot be (or at least aren't) optimised as well.
> > 
> >  It seems this was changed by commit c63c7b0513953
> > 
> >     NFS: Fix a race when doing NFS write coalescing
> >     
> >  in 2.6.22.
> > 
> >  Is it possible/easy/desirable to get this behaviour back.  i.e. to use
> >  NFS_DATA_SYNC at least on sub-page writes triggered by a write to an
> >  O_SYNC file.
> > 
> >  My (possibly naive) attempt is as follows.  It appears to work as I expect
> >  (though it still uses SYNC for 1-page writes) but I'm not confident that it
> >  is "right".
> > 
> > Thanks,
> > 
> > NeilBrown
> > 
> > diff --git a/fs/nfs/write.c b/fs/nfs/write.c
> > index 10d648e..392bfa8 100644
> > --- a/fs/nfs/write.c
> > +++ b/fs/nfs/write.c
> > @@ -178,6 +178,9 @@ static int wb_priority(struct writeback_control *wbc)
> >  		return FLUSH_HIGHPRI | FLUSH_STABLE;
> >  	if (wbc->for_kupdate || wbc->for_background)
> >  		return FLUSH_LOWPRI;
> > +	if (wbc->sync_mode == WB_SYNC_ALL &&
> > +	    (wbc->range_end - wbc->range_start) < PAGE_SIZE)
> > +		return FLUSH_STABLE;
> >  	return 0;
> >  }
> 
> Would it ever be wrong to set the FILE_SYNC flag for the very last rpc
> call in a writeback series? I'm thinking that we might want to set
> FLUSH_STABLE before the call to pageio_complete in
> nfs_writepage_locked() and nfs_writepages().

Interesting idea.

Am I correct in assuming you only mean if wbc->sync_mode == WB_SYNC_ALL.
It wouldn't seem to make any sense for WB_SYNC_NONE.

In that case that last RPC would be immediately followed by a COMMIT. So it
could be reasonable to make it NFS_DATA_SYNC.

However the server would be seeing something a bit odd - a sequence of
unstable writes, then a stable write, then a commit.  This could cause it to
'sync' things in the 'wrong' order which might be less than optimal.  It
would depend a lot on the particular server and filesystem of course, but it
seems to be mis-communicating.  So I think I would avoid this approach
(assuming I understand it correctly).

> 
> The only thing that makes me uncomfortable with that idea is the
> possible repercussions for pNFS.
> 

Cannot comment there - I don't have a deep enough understanding of the issues
with pNFS.

Thanks,
NeilBrown

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Small O_SYNC writes are no longer NFS_DATA_SYNC
  2011-03-18  1:04   ` NeilBrown
@ 2011-03-18  1:49     ` Trond Myklebust
  2011-03-18  2:12       ` NeilBrown
  0 siblings, 1 reply; 15+ messages in thread
From: Trond Myklebust @ 2011-03-18  1:49 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-nfs

On Fri, 2011-03-18 at 12:04 +1100, NeilBrown wrote:
> On Thu, 17 Mar 2011 19:53:07 -0400 Trond Myklebust
> <Trond.Myklebust@netapp.com> wrote:
> > Would it ever be wrong to set the FILE_SYNC flag for the very last rpc
> > call in a writeback series? I'm thinking that we might want to set
> > FLUSH_STABLE before the call to pageio_complete in
> > nfs_writepage_locked() and nfs_writepages().
> 
> Interesting idea.
> 
> Am I correct in assuming you only mean if wbc->sync_mode == WB_SYNC_ALL.
> It wouldn't seem to make any sense for WB_SYNC_NONE.
> 
> In that case that last RPC would be immediately followed by a COMMIT. So it
> could be reasonable to make it NFS_DATA_SYNC.

No. DATA_SYNC _requires_ a COMMIT to ensure that metadata is synced too,
so it makes little sense to use it here.

> However the server would be seeing something a bit odd - a sequence of
> unstable writes, then a stable write, then a commit.  This could cause it to
> 'sync' things in the 'wrong' order which might be less than optimal.  It
> would depend a lot on the particular server and filesystem of course, but it
> seems to be mis-communicating.  So I think I would avoid this approach
> (assuming I understand it correctly).

Yes. Thinking a bit more about it, the latest versions of the Linux
server actually do use vfs_fsync_range(), so it no longer makes sense to
replace COMMIT with a FILE_SYNC write.

However we could adopt the Solaris convention of always starting
writebacks with a FILE_SYNC, and then falling back to UNSTABLE for the
second rpc call and all subsequent calls...

-- 
Trond Myklebust
Linux NFS client maintainer

NetApp
Trond.Myklebust@netapp.com
www.netapp.com


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Small O_SYNC writes are no longer NFS_DATA_SYNC
  2011-03-18  1:49     ` Trond Myklebust
@ 2011-03-18  2:12       ` NeilBrown
  2011-03-18  2:25         ` Trond Myklebust
  0 siblings, 1 reply; 15+ messages in thread
From: NeilBrown @ 2011-03-18  2:12 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: linux-nfs

On Thu, 17 Mar 2011 21:49:26 -0400 Trond Myklebust
<Trond.Myklebust@netapp.com> wrote:

> However we could adopt the Solaris convention of always starting
> writebacks with a FILE_SYNC, and then falling back to UNSTABLE for the
> second rpc call and all subsequent calls...
> 

That approach certainly has merit.

However, as we know from the wbc info whether the write is small and sync -
which is the only case where I think a STABLE write is needed - I cannot see
why you don't want to just use that information to guide the choice of
'stable' or not ???

NeilBrown


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Small O_SYNC writes are no longer NFS_DATA_SYNC
  2011-03-18  2:12       ` NeilBrown
@ 2011-03-18  2:25         ` Trond Myklebust
  2011-03-18  3:52           ` NeilBrown
  0 siblings, 1 reply; 15+ messages in thread
From: Trond Myklebust @ 2011-03-18  2:25 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-nfs

On Fri, 2011-03-18 at 13:12 +1100, NeilBrown wrote:
> On Thu, 17 Mar 2011 21:49:26 -0400 Trond Myklebust
> <Trond.Myklebust@netapp.com> wrote:
> 
> > However we could adopt the Solaris convention of always starting
> > writebacks with a FILE_SYNC, and then falling back to UNSTABLE for the
> > second rpc call and all subsequent calls...
> > 
> 
> That approach certainly has merit.
> 
> However, as we know from the wbc info whether the write is small and sync -
> which is the only case where I think a STABLE write is needed - I cannot see
> why you don't want to just use that information to guide the choice of
> 'stable' or not ???

By far the most common case we would want to optimise for is the sync at
close() or fsync() when you have written a small file (<= wsize). If we
can't optimise for that case, then the optimisation isn't worth doing at
all.

The point is that in that particular case, the wbc doesn't help you at
all since the limits are set at 0 and LLONG_MAX (see nfs_wb_all(),
write_inode_now(),...)

-- 
Trond Myklebust
Linux NFS client maintainer

NetApp
Trond.Myklebust@netapp.com
www.netapp.com


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Small O_SYNC writes are no longer NFS_DATA_SYNC
  2011-03-18  2:25         ` Trond Myklebust
@ 2011-03-18  3:52           ` NeilBrown
  2011-03-21 21:02             ` Trond Myklebust
  0 siblings, 1 reply; 15+ messages in thread
From: NeilBrown @ 2011-03-18  3:52 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: linux-nfs

On Thu, 17 Mar 2011 22:25:08 -0400 Trond Myklebust
<Trond.Myklebust@netapp.com> wrote:

> On Fri, 2011-03-18 at 13:12 +1100, NeilBrown wrote:
> > On Thu, 17 Mar 2011 21:49:26 -0400 Trond Myklebust
> > <Trond.Myklebust@netapp.com> wrote:
> > 
> > > However we could adopt the Solaris convention of always starting
> > > writebacks with a FILE_SYNC, and then falling back to UNSTABLE for the
> > > second rpc call and all subsequent calls...
> > > 
> > 
> > That approach certainly has merit.
> > 
> > However, as we know from the wbc info whether the write is small and sync -
> > which is the only case where I think a STABLE write is needed - I cannot see
> > why you don't want to just use that information to guide the choice of
> > 'stable' or not ???
> 
> By far the most common case we would want to optimise for is the sync at
> close() or fsync() when you have written a small file (<= wsize). If we
> can't optimise for that case, then the optimisation isn't worth doing at
> all.

Fair point.  I hadn't thought of that.

> 
> The point is that in that particular case, the wbc doesn't help you at
> all since the limits are set at 0 and LLONG_MAX (see nfs_wb_all(),
> write_inode_now(),...)
> 

I would be trivial to use min(wbc->range_end, i_size_read(inode)) as the
upper bound when assessing the size of the range to compare with 'wsize'.

However that wouldn't address the case of a small append to a large file
which would also be good to optimise.

If you can detect the 'first' RPC reliably at the same time that you still
have access to the wbc information, then:

 if this is the first request in a writeback, and the difference beween
  the address of this page, and min(wbc->range_end, i_size_read(inode))
  is less than wsize, then make it a STABLE write

might be a heuristic that catches most interesting cases.
It might be a bit complex though.

I think we should in general err on the size of not using a STABLE write
when it might be useful rather than using a STABLE write when it is not
necessary as, while there a costs each way, I think the cost of incorrectly
using STABLE would be higher.

Thanks for the clarification.

NeilBrown

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Small O_SYNC writes are no longer NFS_DATA_SYNC
  2011-03-18  3:52           ` NeilBrown
@ 2011-03-21 21:02             ` Trond Myklebust
  2011-03-21 22:17               ` NeilBrown
  0 siblings, 1 reply; 15+ messages in thread
From: Trond Myklebust @ 2011-03-21 21:02 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-nfs

On Fri, 2011-03-18 at 14:52 +1100, NeilBrown wrote:
> On Thu, 17 Mar 2011 22:25:08 -0400 Trond Myklebust
> <Trond.Myklebust@netapp.com> wrote:
> 
> > On Fri, 2011-03-18 at 13:12 +1100, NeilBrown wrote:
> > > On Thu, 17 Mar 2011 21:49:26 -0400 Trond Myklebust
> > > <Trond.Myklebust@netapp.com> wrote:
> > > 
> > > > However we could adopt the Solaris convention of always starting
> > > > writebacks with a FILE_SYNC, and then falling back to UNSTABLE for the
> > > > second rpc call and all subsequent calls...
> > > > 
> > > 
> > > That approach certainly has merit.
> > > 
> > > However, as we know from the wbc info whether the write is small and sync -
> > > which is the only case where I think a STABLE write is needed - I cannot see
> > > why you don't want to just use that information to guide the choice of
> > > 'stable' or not ???
> > 
> > By far the most common case we would want to optimise for is the sync at
> > close() or fsync() when you have written a small file (<= wsize). If we
> > can't optimise for that case, then the optimisation isn't worth doing at
> > all.
> 
> Fair point.  I hadn't thought of that.
> 
> > 
> > The point is that in that particular case, the wbc doesn't help you at
> > all since the limits are set at 0 and LLONG_MAX (see nfs_wb_all(),
> > write_inode_now(),...)
> > 
> 
> I would be trivial to use min(wbc->range_end, i_size_read(inode)) as the
> upper bound when assessing the size of the range to compare with 'wsize'.
> 
> However that wouldn't address the case of a small append to a large file
> which would also be good to optimise.
> 
> If you can detect the 'first' RPC reliably at the same time that you still
> have access to the wbc information, then:
> 
>  if this is the first request in a writeback, and the difference beween
>   the address of this page, and min(wbc->range_end, i_size_read(inode))
>   is less than wsize, then make it a STABLE write
> 
> might be a heuristic that catches most interesting cases.
> It might be a bit complex though.
> 
> I think we should in general err on the size of not using a STABLE write
> when it might be useful rather than using a STABLE write when it is not
> necessary as, while there a costs each way, I think the cost of incorrectly
> using STABLE would be higher.

How about something like the following (as of yet untested) patch?

Cheers
  Trond

8<-------------------------------------------------------------------------------

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Small O_SYNC writes are no longer NFS_DATA_SYNC
  2011-03-21 21:02             ` Trond Myklebust
@ 2011-03-21 22:17               ` NeilBrown
  2011-03-21 22:54                 ` Trond Myklebust
  0 siblings, 1 reply; 15+ messages in thread
From: NeilBrown @ 2011-03-21 22:17 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: linux-nfs

On Mon, 21 Mar 2011 17:02:00 -0400 Trond Myklebust
<Trond.Myklebust@netapp.com> wrote:

> On Fri, 2011-03-18 at 14:52 +1100, NeilBrown wrote:
> > On Thu, 17 Mar 2011 22:25:08 -0400 Trond Myklebust
> > <Trond.Myklebust@netapp.com> wrote:
> > 
> > > On Fri, 2011-03-18 at 13:12 +1100, NeilBrown wrote:
> > > > On Thu, 17 Mar 2011 21:49:26 -0400 Trond Myklebust
> > > > <Trond.Myklebust@netapp.com> wrote:
> > > > 
> > > > > However we could adopt the Solaris convention of always starting
> > > > > writebacks with a FILE_SYNC, and then falling back to UNSTABLE for the
> > > > > second rpc call and all subsequent calls...
> > > > > 
> > > > 
> > > > That approach certainly has merit.
> > > > 
> > > > However, as we know from the wbc info whether the write is small and sync -
> > > > which is the only case where I think a STABLE write is needed - I cannot see
> > > > why you don't want to just use that information to guide the choice of
> > > > 'stable' or not ???
> > > 
> > > By far the most common case we would want to optimise for is the sync at
> > > close() or fsync() when you have written a small file (<= wsize). If we
> > > can't optimise for that case, then the optimisation isn't worth doing at
> > > all.
> > 
> > Fair point.  I hadn't thought of that.
> > 
> > > 
> > > The point is that in that particular case, the wbc doesn't help you at
> > > all since the limits are set at 0 and LLONG_MAX (see nfs_wb_all(),
> > > write_inode_now(),...)
> > > 
> > 
> > I would be trivial to use min(wbc->range_end, i_size_read(inode)) as the
> > upper bound when assessing the size of the range to compare with 'wsize'.
> > 
> > However that wouldn't address the case of a small append to a large file
> > which would also be good to optimise.
> > 
> > If you can detect the 'first' RPC reliably at the same time that you still
> > have access to the wbc information, then:
> > 
> >  if this is the first request in a writeback, and the difference beween
> >   the address of this page, and min(wbc->range_end, i_size_read(inode))
> >   is less than wsize, then make it a STABLE write
> > 
> > might be a heuristic that catches most interesting cases.
> > It might be a bit complex though.
> > 
> > I think we should in general err on the size of not using a STABLE write
> > when it might be useful rather than using a STABLE write when it is not
> > necessary as, while there a costs each way, I think the cost of incorrectly
> > using STABLE would be higher.
> 
> How about something like the following (as of yet untested) patch?

Looks good.

I tried it out and it works as expected on large writes (uses UNSTABLE),
O_SYNC writes (uses FILE_SYNC) and small writes (uses FILE_SYNC).
All these were simple open/write/close.  I didn't watch what happens if we
wait for writeback, but I don't expect it to be any different.

I must admit that I found the terminology a bit confusing for 
FLUSH_COND_STABLE.
What is means is "if this turns out to be part of a larger request, then clear
FLUSH_STABLE" - which sounds a bit more like "FLUSH_COND_UNSTABLE" - i.e. if
a condition is met, then make it unstable.
But I don't think that change would help at all.

How about changing the test in nfs_write_rpcsetup to
	if (how & (FLUSH_STABLE|FLUSH_CONDSTABLE)) {
		data->args.stable = NFS_DATA_SYNC;
		if (!nfs_need_commit(NFS_I(inode)))
			data->args.stable = NFS_FILE_SYNC;
	}

and then just set either FLUSH_STABLE or FLUSH_COND_STABLE - never both -
and when you test FLUSH_COND_STABLE and then some other condition, just
clear FLUSH_COND_STABLE.

I would find that quite a bit more readable.

i.e. the following - in which I also enhanced a comment slightly.

Thanks,
NeilBrown

commit 8283f834711942d808a5a9056893c46f59ad4770
Author: Trond Myklebust <Trond.Myklebust@netapp.com>
Date:   Mon Mar 21 17:02:00 2011 -0400

    NFS: Use stable writes when not doing a bulk flush
    
    If we're only doing a single write, and there are no other unstable
    writes being queued up, we might want to just flip to using a stable
    write RPC call.
    
    Reviewed-by: NeilBrown <neilb@suse.de>
    Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>

diff --git a/fs/nfs/pagelist.c b/fs/nfs/pagelist.c
index 23e7944..fd85618 100644
--- a/fs/nfs/pagelist.c
+++ b/fs/nfs/pagelist.c
@@ -223,6 +223,7 @@ void nfs_pageio_init(struct nfs_pageio_descriptor *desc,
 	desc->pg_count = 0;
 	desc->pg_bsize = bsize;
 	desc->pg_base = 0;
+	desc->pg_moreio = 0;
 	desc->pg_inode = inode;
 	desc->pg_doio = doio;
 	desc->pg_ioflags = io_flags;
@@ -335,9 +336,11 @@ int nfs_pageio_add_request(struct nfs_pageio_descriptor *desc,
 			   struct nfs_page *req)
 {
 	while (!nfs_pageio_do_add_request(desc, req)) {
+		desc->pg_moreio = 1;
 		nfs_pageio_doio(desc);
 		if (desc->pg_error < 0)
 			return 0;
+		desc->pg_moreio = 0;
 	}
 	return 1;
 }
diff --git a/fs/nfs/write.c b/fs/nfs/write.c
index 47a3ad6..4d686ee 100644
--- a/fs/nfs/write.c
+++ b/fs/nfs/write.c
@@ -179,8 +179,8 @@ static int wb_priority(struct writeback_control *wbc)
 	if (wbc->for_reclaim)
 		return FLUSH_HIGHPRI | FLUSH_STABLE;
 	if (wbc->for_kupdate || wbc->for_background)
-		return FLUSH_LOWPRI;
-	return 0;
+		return FLUSH_LOWPRI | FLUSH_COND_STABLE;
+	return FLUSH_COND_STABLE;
 }
 
 /*
@@ -863,7 +863,7 @@ static int nfs_write_rpcsetup(struct nfs_page *req,
 	data->args.context = get_nfs_open_context(req->wb_context);
 	data->args.lock_context = req->wb_lock_context;
 	data->args.stable  = NFS_UNSTABLE;
-	if (how & FLUSH_STABLE) {
+	if (how & (FLUSH_STABLE | FLUSH_COND_STABLE)) {
 		data->args.stable = NFS_DATA_SYNC;
 		if (!nfs_need_commit(NFS_I(inode)))
 			data->args.stable = NFS_FILE_SYNC;
@@ -912,6 +912,12 @@ static int nfs_flush_multi(struct nfs_pageio_descriptor *desc)
 
 	nfs_list_remove_request(req);
 
+	if ((desc->pg_ioflags & FLUSH_COND_STABLE) &&
+	    (desc->pg_moreio || NFS_I(desc->pg_inode)->ncommit ||
+	     desc->pg_count > wsize))
+		desc->pg_ioflags &= ~FLUSH_COND_STABLE;
+
+
 	nbytes = desc->pg_count;
 	do {
 		size_t len = min(nbytes, wsize);
@@ -1002,6 +1008,10 @@ static int nfs_flush_one(struct nfs_pageio_descriptor *desc)
 	if ((!lseg) && list_is_singular(&data->pages))
 		lseg = pnfs_update_layout(desc->pg_inode, req->wb_context, IOMODE_RW);
 
+	if ((desc->pg_ioflags & FLUSH_COND_STABLE) &&
+	    (desc->pg_moreio || NFS_I(desc->pg_inode)->ncommit))
+		desc->pg_ioflags &= ~FLUSH_COND_STABLE;
+
 	/* Set up the argument struct */
 	ret = nfs_write_rpcsetup(req, data, &nfs_write_full_ops, desc->pg_count, 0, lseg, desc->pg_ioflags);
 out:
diff --git a/include/linux/nfs_fs.h b/include/linux/nfs_fs.h
index f88522b..cb2add4 100644
--- a/include/linux/nfs_fs.h
+++ b/include/linux/nfs_fs.h
@@ -33,6 +33,8 @@
 #define FLUSH_STABLE		4	/* commit to stable storage */
 #define FLUSH_LOWPRI		8	/* low priority background flush */
 #define FLUSH_HIGHPRI		16	/* high priority memory reclaim flush */
+#define FLUSH_COND_STABLE	32	/* conditional stable write - only stable
+					 * if everything fits in one RPC */
 
 #ifdef __KERNEL__
 
diff --git a/include/linux/nfs_page.h b/include/linux/nfs_page.h
index 90907ad..92d54c8 100644
--- a/include/linux/nfs_page.h
+++ b/include/linux/nfs_page.h
@@ -57,6 +57,7 @@ struct nfs_pageio_descriptor {
 	size_t			pg_count;
 	size_t			pg_bsize;
 	unsigned int		pg_base;
+	char			pg_moreio;
 
 	struct inode		*pg_inode;
 	int			(*pg_doio)(struct nfs_pageio_descriptor *);


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: Small O_SYNC writes are no longer NFS_DATA_SYNC
  2011-03-21 22:17               ` NeilBrown
@ 2011-03-21 22:54                 ` Trond Myklebust
  2011-03-21 23:47                   ` NeilBrown
  0 siblings, 1 reply; 15+ messages in thread
From: Trond Myklebust @ 2011-03-21 22:54 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-nfs

On Tue, 2011-03-22 at 09:17 +1100, NeilBrown wrote:
> On Mon, 21 Mar 2011 17:02:00 -0400 Trond Myklebust
> <Trond.Myklebust@netapp.com> wrote:
> I must admit that I found the terminology a bit confusing for 
> FLUSH_COND_STABLE.
> What is means is "if this turns out to be part of a larger request, then clear
> FLUSH_STABLE" - which sounds a bit more like "FLUSH_COND_UNSTABLE" - i.e. if
> a condition is met, then make it unstable.
> But I don't think that change would help at all.
> 
> How about changing the test in nfs_write_rpcsetup to
> 	if (how & (FLUSH_STABLE|FLUSH_CONDSTABLE)) {
> 		data->args.stable = NFS_DATA_SYNC;
> 		if (!nfs_need_commit(NFS_I(inode)))
> 			data->args.stable = NFS_FILE_SYNC;
> 	}
> 
> and then just set either FLUSH_STABLE or FLUSH_COND_STABLE - never both -
> and when you test FLUSH_COND_STABLE and then some other condition, just
> clear FLUSH_COND_STABLE.
> 
> I would find that quite a bit more readable.

The reason why I opted not to go for that approach was because
nfs_write_rpcsetup() doesn't really know about whether or not there are
any more RPCs pending (and so having a 'conditional' flag being
interpreted there didn't appear to make sense), but if you feel that is
easier on the eyes then I'm happy to change my opinion.

-- 
Trond Myklebust
Linux NFS client maintainer

NetApp
Trond.Myklebust@netapp.com
www.netapp.com


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Small O_SYNC writes are no longer NFS_DATA_SYNC
  2011-03-21 22:54                 ` Trond Myklebust
@ 2011-03-21 23:47                   ` NeilBrown
  0 siblings, 0 replies; 15+ messages in thread
From: NeilBrown @ 2011-03-21 23:47 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: linux-nfs

On Mon, 21 Mar 2011 18:54:55 -0400 Trond Myklebust
<Trond.Myklebust@netapp.com> wrote:

> On Tue, 2011-03-22 at 09:17 +1100, NeilBrown wrote:
> > On Mon, 21 Mar 2011 17:02:00 -0400 Trond Myklebust
> > <Trond.Myklebust@netapp.com> wrote:
> > I must admit that I found the terminology a bit confusing for 
> > FLUSH_COND_STABLE.
> > What is means is "if this turns out to be part of a larger request, then clear
> > FLUSH_STABLE" - which sounds a bit more like "FLUSH_COND_UNSTABLE" - i.e. if
> > a condition is met, then make it unstable.
> > But I don't think that change would help at all.
> > 
> > How about changing the test in nfs_write_rpcsetup to
> > 	if (how & (FLUSH_STABLE|FLUSH_CONDSTABLE)) {
> > 		data->args.stable = NFS_DATA_SYNC;
> > 		if (!nfs_need_commit(NFS_I(inode)))
> > 			data->args.stable = NFS_FILE_SYNC;
> > 	}
> > 
> > and then just set either FLUSH_STABLE or FLUSH_COND_STABLE - never both -
> > and when you test FLUSH_COND_STABLE and then some other condition, just
> > clear FLUSH_COND_STABLE.
> > 
> > I would find that quite a bit more readable.
> 
> The reason why I opted not to go for that approach was because
> nfs_write_rpcsetup() doesn't really know about whether or not there are
> any more RPCs pending (and so having a 'conditional' flag being
> interpreted there didn't appear to make sense), but if you feel that is
> easier on the eyes then I'm happy to change my opinion.
> 

I do see your point - by the time we get to nfs_write_rpcsetup it isn't
really conditional any more - it is now mandatory.

Maybe one cannot get the names perfect.  However I find that having
interdependencies between different flag bits can easily become confusing.

So I have a slight preference for my version, but I concede that it isn't a
clear winner.

Thanks,
NeilBrown


^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2011-03-21 23:47 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-02-16  6:15 Small O_SYNC writes are no longer NFS_DATA_SYNC NeilBrown
2011-02-16 13:11 ` Jeff Layton
2011-02-16 20:26   ` NeilBrown
2011-02-16 20:50     ` Jeff Layton
2011-02-16 21:00       ` NeilBrown
2011-03-17 23:53 ` Trond Myklebust
2011-03-18  1:04   ` NeilBrown
2011-03-18  1:49     ` Trond Myklebust
2011-03-18  2:12       ` NeilBrown
2011-03-18  2:25         ` Trond Myklebust
2011-03-18  3:52           ` NeilBrown
2011-03-21 21:02             ` Trond Myklebust
2011-03-21 22:17               ` NeilBrown
2011-03-21 22:54                 ` Trond Myklebust
2011-03-21 23:47                   ` NeilBrown

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).