From: NeilBrown <neilb@suse.de>
To: Jeff Layton <jlayton@redhat.com>
Cc: Trond Myklebust <Trond.Myklebust@netapp.com>, linux-nfs@vger.kernel.org
Subject: Re: Small O_SYNC writes are no longer NFS_DATA_SYNC
Date: Thu, 17 Feb 2011 07:26:18 +1100 [thread overview]
Message-ID: <20110217072618.26a6c02a@notabene.brown> (raw)
In-Reply-To: <20110216081150.49b50ffe@tlielax.poochiereds.net>
On Wed, 16 Feb 2011 08:11:50 -0500 Jeff Layton <jlayton@redhat.com> wrote:
> On Wed, 16 Feb 2011 17:15:55 +1100
> NeilBrown <neilb@suse.de> wrote:
>
> >
> > Hi Trond,
> > I wonder if I might get your help/advice on an issue with NFS.
> >
> > It seems that NFS_DATA_SYNC is hardly used at all currently. It is used for
> > O_DIRECT writes and for writes 'for_reclaim', and for handling some error
> > conditions, but that is about it.
> >
> > This appears to be a regression.
> >
> > Back in 2005, commit ab0a3dbedc5 in 2.6.13 says:
> >
> > [PATCH] NFS: Write optimization for short files and small O_SYNC writes.
> >
> > Use stable writes if we can see that we are only going to put a single
> > write on the wire.
> >
> > which seems like a sensible optimisation, and we have a customer which
> > values it. Very roughly, they have an NFS server which optimises 'unstable'
> > writes for throughput and 'stable' writes for latency - these seems like a
> > reasonable approach.
> > With a 2.6.16 kernel an application which generates many small sync writes
> > gets adequate performance. In 2.6.32 they see unstable writes followed by
> > commits, which cannot be (or at least aren't) optimised as well.
> >
> > It seems this was changed by commit c63c7b0513953
> >
> > NFS: Fix a race when doing NFS write coalescing
> >
> > in 2.6.22.
> >
> > Is it possible/easy/desirable to get this behaviour back. i.e. to use
> > NFS_DATA_SYNC at least on sub-page writes triggered by a write to an
> > O_SYNC file.
> >
> > My (possibly naive) attempt is as follows. It appears to work as I expect
> > (though it still uses SYNC for 1-page writes) but I'm not confident that it
> > is "right".
> >
> > Thanks,
> >
> > NeilBrown
> >
> > diff --git a/fs/nfs/write.c b/fs/nfs/write.c
> > index 10d648e..392bfa8 100644
> > --- a/fs/nfs/write.c
> > +++ b/fs/nfs/write.c
> > @@ -178,6 +178,9 @@ static int wb_priority(struct writeback_control *wbc)
> > return FLUSH_HIGHPRI | FLUSH_STABLE;
> > if (wbc->for_kupdate || wbc->for_background)
> > return FLUSH_LOWPRI;
> > + if (wbc->sync_mode == WB_SYNC_ALL &&
> > + (wbc->range_end - wbc->range_start) < PAGE_SIZE)
> > + return FLUSH_STABLE;
> > return 0;
> > }
> >
>
> I'm not so sure about this change. wb_priority is called from
> nfs_wb_page. The comments there say:
>
> /*
> * Write back all requests on one page - we do this before reading it.
> */
>
> ...do we really need those writes to be NFS_FILE_SYNC?
Thanks for taking a look.
wb_priority is called from several places - yes.
In the nfs_wb_page case, I think we *do* want NFS_FILE_SYNC.
nfs_wb_page calls nfs_writepage_locked and then nfs_commit_inode which calls
nfs_scan_commit to send a COMMIT request for the page (if the write wasn't
stable).
By using NFS_FILE_SYNC we can avoid that COMMIT, and lose nothing (that I can
see).
>
> I think that the difficulty here is determining when we really are
> going to just be doing a single write. In that case, then clearly a
> FILE_SYNC write is better than an unstable + COMMIT.
>
> This is very workload dependent though. It's hard to know beforehand
> whether a page that we intend to write will be redirtied soon
> afterward. If it is, then FILE_SYNC writes may be worse than letting
> the server cache the writes until a COMMIT comes in.
>
The hope is that sync_mode == WB_SYNC_ALL combined with a short 'range' are
sufficient.
In particular, WB_SYNC_ALL essentially says that we want this page out to
storage *now* so a 'flush' of some sort is likely to following.
BTW, I'm wondering if the length of 'range' that we test should be related to
'wsize' rather than PAGE_SIZE. Any thoughts on that?
Thanks,
NeilBrown
next prev parent reply other threads:[~2011-02-16 20:26 UTC|newest]
Thread overview: 15+ messages / expand[flat|nested] mbox.gz Atom feed top
2011-02-16 6:15 Small O_SYNC writes are no longer NFS_DATA_SYNC NeilBrown
2011-02-16 13:11 ` Jeff Layton
2011-02-16 20:26 ` NeilBrown [this message]
2011-02-16 20:50 ` Jeff Layton
2011-02-16 21:00 ` NeilBrown
2011-03-17 23:53 ` Trond Myklebust
2011-03-18 1:04 ` NeilBrown
2011-03-18 1:49 ` Trond Myklebust
2011-03-18 2:12 ` NeilBrown
2011-03-18 2:25 ` Trond Myklebust
2011-03-18 3:52 ` NeilBrown
2011-03-21 21:02 ` Trond Myklebust
2011-03-21 22:17 ` NeilBrown
2011-03-21 22:54 ` Trond Myklebust
2011-03-21 23:47 ` NeilBrown
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20110217072618.26a6c02a@notabene.brown \
--to=neilb@suse.de \
--cc=Trond.Myklebust@netapp.com \
--cc=jlayton@redhat.com \
--cc=linux-nfs@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).