linux-nfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Jeff Layton <jlayton@redhat.com>
To: NeilBrown <neilb@suse.de>
Cc: Trond Myklebust <Trond.Myklebust@netapp.com>, linux-nfs@vger.kernel.org
Subject: Re: Small O_SYNC writes are no longer NFS_DATA_SYNC
Date: Wed, 16 Feb 2011 15:50:42 -0500	[thread overview]
Message-ID: <20110216155042.383b420f@corrin.poochiereds.net> (raw)
In-Reply-To: <20110217072618.26a6c02a@notabene.brown>

On Thu, 17 Feb 2011 07:26:18 +1100
NeilBrown <neilb@suse.de> wrote:

> On Wed, 16 Feb 2011 08:11:50 -0500 Jeff Layton <jlayton@redhat.com> wrote:
> 
> > On Wed, 16 Feb 2011 17:15:55 +1100
> > NeilBrown <neilb@suse.de> wrote:
> > 
> > > 
> > > Hi Trond,
> > >  I wonder if I might get your help/advice on an issue with NFS.
> > > 
> > >  It seems that NFS_DATA_SYNC is hardly used at all currently.  It is used for
> > >  O_DIRECT writes and for writes 'for_reclaim', and for handling some error
> > >  conditions, but that is about it.
> > > 
> > >  This appears to be a regression.
> > > 
> > >  Back in 2005, commit ab0a3dbedc5 in 2.6.13 says:
> > > 
> > >     [PATCH] NFS: Write optimization for short files and small O_SYNC writes.
> > >     
> > >      Use stable writes if we can see that we are only going to put a single
> > >      write on the wire.
> > > 
> > >  which seems like a sensible optimisation, and we have a customer which
> > >  values it.  Very roughly, they have an NFS server which optimises 'unstable'
> > >  writes for throughput and 'stable' writes for latency - these seems like a
> > >  reasonable approach.
> > >  With a 2.6.16 kernel an application which generates many small sync writes
> > >  gets adequate performance.  In 2.6.32 they see unstable writes followed by
> > >  commits, which cannot be (or at least aren't) optimised as well.
> > > 
> > >  It seems this was changed by commit c63c7b0513953
> > > 
> > >     NFS: Fix a race when doing NFS write coalescing
> > >     
> > >  in 2.6.22.
> > > 
> > >  Is it possible/easy/desirable to get this behaviour back.  i.e. to use
> > >  NFS_DATA_SYNC at least on sub-page writes triggered by a write to an
> > >  O_SYNC file.
> > > 
> > >  My (possibly naive) attempt is as follows.  It appears to work as I expect
> > >  (though it still uses SYNC for 1-page writes) but I'm not confident that it
> > >  is "right".
> > > 
> > > Thanks,
> > > 
> > > NeilBrown
> > > 
> > > diff --git a/fs/nfs/write.c b/fs/nfs/write.c
> > > index 10d648e..392bfa8 100644
> > > --- a/fs/nfs/write.c
> > > +++ b/fs/nfs/write.c
> > > @@ -178,6 +178,9 @@ static int wb_priority(struct writeback_control *wbc)
> > >  		return FLUSH_HIGHPRI | FLUSH_STABLE;
> > >  	if (wbc->for_kupdate || wbc->for_background)
> > >  		return FLUSH_LOWPRI;
> > > +	if (wbc->sync_mode == WB_SYNC_ALL &&
> > > +	    (wbc->range_end - wbc->range_start) < PAGE_SIZE)
> > > +		return FLUSH_STABLE;
> > >  	return 0;
> > >  }
> > >  
> > 
> > I'm not so sure about this change. wb_priority is called from
> > nfs_wb_page. The comments there say:
> > 
> > /*
> >  * Write back all requests on one page - we do this before reading it.
> >  */
> > 
> > ...do we really need those writes to be NFS_FILE_SYNC?
> 
> Thanks for taking a look.
> wb_priority is called from several places - yes.
> 
> In the nfs_wb_page case, I think we *do* want NFS_FILE_SYNC.
> nfs_wb_page calls nfs_writepage_locked and then nfs_commit_inode which calls
> nfs_scan_commit to send a COMMIT request for the page (if the write wasn't
> stable).
> 
> By using NFS_FILE_SYNC we can avoid that COMMIT, and lose nothing (that I can
> see).
> 

Good point.

> > 
> > I think that the difficulty here is determining when we really are
> > going to just be doing a single write. In that case, then clearly a
> > FILE_SYNC write is better than an unstable + COMMIT.
> > 
> > This is very workload dependent though. It's hard to know beforehand
> > whether a page that we intend to write will be redirtied soon
> > afterward. If it is, then FILE_SYNC writes may be worse than letting
> > the server cache the writes until a COMMIT comes in.
> > 
> 
> The hope is that sync_mode == WB_SYNC_ALL combined with a short 'range' are
> sufficient.
> 
> In particular, WB_SYNC_ALL essentially says that we want this page out to
> storage *now* so a 'flush' of some sort is likely to following.
> 

Also a good point.

I guess my main worry is that, at least in the Linux NFS server case,
these get translated to O_SYNC writes which can be very slow. If we
know though that we're not likely to have follow-on writes that could
be batched up then this is probably fine.

I guess I just don't have a good feel for how often (if ever) we do
multiple writes to the same file with different wbc ranges without an
intervening commit. If that happens and we end up doing FILE_SYNC
writes rather than unstable ones, then this change is probably not a
good one.

Do you have any idea as to why unstable writes get worse performance
for this customer? I know that base 2.6.32 did a *lot* more COMMITs
than were really necessary. Perhaps they are seeing the effect of that?
Or have they also tested something more recent and seen bad
performance?

> 
> BTW, I'm wondering if the length of 'range' that we test should be related to
> 'wsize' rather than PAGE_SIZE.  Any thoughts on that?
> 

Yeah, that seems like a more meaningful test than page size.

-- 
Jeff Layton <jlayton@redhat.com>

  reply	other threads:[~2011-02-16 20:46 UTC|newest]

Thread overview: 15+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2011-02-16  6:15 Small O_SYNC writes are no longer NFS_DATA_SYNC NeilBrown
2011-02-16 13:11 ` Jeff Layton
2011-02-16 20:26   ` NeilBrown
2011-02-16 20:50     ` Jeff Layton [this message]
2011-02-16 21:00       ` NeilBrown
2011-03-17 23:53 ` Trond Myklebust
2011-03-18  1:04   ` NeilBrown
2011-03-18  1:49     ` Trond Myklebust
2011-03-18  2:12       ` NeilBrown
2011-03-18  2:25         ` Trond Myklebust
2011-03-18  3:52           ` NeilBrown
2011-03-21 21:02             ` Trond Myklebust
2011-03-21 22:17               ` NeilBrown
2011-03-21 22:54                 ` Trond Myklebust
2011-03-21 23:47                   ` NeilBrown

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20110216155042.383b420f@corrin.poochiereds.net \
    --to=jlayton@redhat.com \
    --cc=Trond.Myklebust@netapp.com \
    --cc=linux-nfs@vger.kernel.org \
    --cc=neilb@suse.de \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).