From: NeilBrown <neilb@suse.de>
To: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: linux-nfs@vger.kernel.org
Subject: Re: Small O_SYNC writes are no longer NFS_DATA_SYNC
Date: Tue, 22 Mar 2011 09:17:57 +1100 [thread overview]
Message-ID: <20110322091757.7bf56d80@notabene.brown> (raw)
In-Reply-To: <1300741320.13307.50.camel@lade.trondhjem.org>
On Mon, 21 Mar 2011 17:02:00 -0400 Trond Myklebust
<Trond.Myklebust@netapp.com> wrote:
> On Fri, 2011-03-18 at 14:52 +1100, NeilBrown wrote:
> > On Thu, 17 Mar 2011 22:25:08 -0400 Trond Myklebust
> > <Trond.Myklebust@netapp.com> wrote:
> >
> > > On Fri, 2011-03-18 at 13:12 +1100, NeilBrown wrote:
> > > > On Thu, 17 Mar 2011 21:49:26 -0400 Trond Myklebust
> > > > <Trond.Myklebust@netapp.com> wrote:
> > > >
> > > > > However we could adopt the Solaris convention of always starting
> > > > > writebacks with a FILE_SYNC, and then falling back to UNSTABLE for the
> > > > > second rpc call and all subsequent calls...
> > > > >
> > > >
> > > > That approach certainly has merit.
> > > >
> > > > However, as we know from the wbc info whether the write is small and sync -
> > > > which is the only case where I think a STABLE write is needed - I cannot see
> > > > why you don't want to just use that information to guide the choice of
> > > > 'stable' or not ???
> > >
> > > By far the most common case we would want to optimise for is the sync at
> > > close() or fsync() when you have written a small file (<= wsize). If we
> > > can't optimise for that case, then the optimisation isn't worth doing at
> > > all.
> >
> > Fair point. I hadn't thought of that.
> >
> > >
> > > The point is that in that particular case, the wbc doesn't help you at
> > > all since the limits are set at 0 and LLONG_MAX (see nfs_wb_all(),
> > > write_inode_now(),...)
> > >
> >
> > I would be trivial to use min(wbc->range_end, i_size_read(inode)) as the
> > upper bound when assessing the size of the range to compare with 'wsize'.
> >
> > However that wouldn't address the case of a small append to a large file
> > which would also be good to optimise.
> >
> > If you can detect the 'first' RPC reliably at the same time that you still
> > have access to the wbc information, then:
> >
> > if this is the first request in a writeback, and the difference beween
> > the address of this page, and min(wbc->range_end, i_size_read(inode))
> > is less than wsize, then make it a STABLE write
> >
> > might be a heuristic that catches most interesting cases.
> > It might be a bit complex though.
> >
> > I think we should in general err on the size of not using a STABLE write
> > when it might be useful rather than using a STABLE write when it is not
> > necessary as, while there a costs each way, I think the cost of incorrectly
> > using STABLE would be higher.
>
> How about something like the following (as of yet untested) patch?
Looks good.
I tried it out and it works as expected on large writes (uses UNSTABLE),
O_SYNC writes (uses FILE_SYNC) and small writes (uses FILE_SYNC).
All these were simple open/write/close. I didn't watch what happens if we
wait for writeback, but I don't expect it to be any different.
I must admit that I found the terminology a bit confusing for
FLUSH_COND_STABLE.
What is means is "if this turns out to be part of a larger request, then clear
FLUSH_STABLE" - which sounds a bit more like "FLUSH_COND_UNSTABLE" - i.e. if
a condition is met, then make it unstable.
But I don't think that change would help at all.
How about changing the test in nfs_write_rpcsetup to
if (how & (FLUSH_STABLE|FLUSH_CONDSTABLE)) {
data->args.stable = NFS_DATA_SYNC;
if (!nfs_need_commit(NFS_I(inode)))
data->args.stable = NFS_FILE_SYNC;
}
and then just set either FLUSH_STABLE or FLUSH_COND_STABLE - never both -
and when you test FLUSH_COND_STABLE and then some other condition, just
clear FLUSH_COND_STABLE.
I would find that quite a bit more readable.
i.e. the following - in which I also enhanced a comment slightly.
Thanks,
NeilBrown
commit 8283f834711942d808a5a9056893c46f59ad4770
Author: Trond Myklebust <Trond.Myklebust@netapp.com>
Date: Mon Mar 21 17:02:00 2011 -0400
NFS: Use stable writes when not doing a bulk flush
If we're only doing a single write, and there are no other unstable
writes being queued up, we might want to just flip to using a stable
write RPC call.
Reviewed-by: NeilBrown <neilb@suse.de>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
diff --git a/fs/nfs/pagelist.c b/fs/nfs/pagelist.c
index 23e7944..fd85618 100644
--- a/fs/nfs/pagelist.c
+++ b/fs/nfs/pagelist.c
@@ -223,6 +223,7 @@ void nfs_pageio_init(struct nfs_pageio_descriptor *desc,
desc->pg_count = 0;
desc->pg_bsize = bsize;
desc->pg_base = 0;
+ desc->pg_moreio = 0;
desc->pg_inode = inode;
desc->pg_doio = doio;
desc->pg_ioflags = io_flags;
@@ -335,9 +336,11 @@ int nfs_pageio_add_request(struct nfs_pageio_descriptor *desc,
struct nfs_page *req)
{
while (!nfs_pageio_do_add_request(desc, req)) {
+ desc->pg_moreio = 1;
nfs_pageio_doio(desc);
if (desc->pg_error < 0)
return 0;
+ desc->pg_moreio = 0;
}
return 1;
}
diff --git a/fs/nfs/write.c b/fs/nfs/write.c
index 47a3ad6..4d686ee 100644
--- a/fs/nfs/write.c
+++ b/fs/nfs/write.c
@@ -179,8 +179,8 @@ static int wb_priority(struct writeback_control *wbc)
if (wbc->for_reclaim)
return FLUSH_HIGHPRI | FLUSH_STABLE;
if (wbc->for_kupdate || wbc->for_background)
- return FLUSH_LOWPRI;
- return 0;
+ return FLUSH_LOWPRI | FLUSH_COND_STABLE;
+ return FLUSH_COND_STABLE;
}
/*
@@ -863,7 +863,7 @@ static int nfs_write_rpcsetup(struct nfs_page *req,
data->args.context = get_nfs_open_context(req->wb_context);
data->args.lock_context = req->wb_lock_context;
data->args.stable = NFS_UNSTABLE;
- if (how & FLUSH_STABLE) {
+ if (how & (FLUSH_STABLE | FLUSH_COND_STABLE)) {
data->args.stable = NFS_DATA_SYNC;
if (!nfs_need_commit(NFS_I(inode)))
data->args.stable = NFS_FILE_SYNC;
@@ -912,6 +912,12 @@ static int nfs_flush_multi(struct nfs_pageio_descriptor *desc)
nfs_list_remove_request(req);
+ if ((desc->pg_ioflags & FLUSH_COND_STABLE) &&
+ (desc->pg_moreio || NFS_I(desc->pg_inode)->ncommit ||
+ desc->pg_count > wsize))
+ desc->pg_ioflags &= ~FLUSH_COND_STABLE;
+
+
nbytes = desc->pg_count;
do {
size_t len = min(nbytes, wsize);
@@ -1002,6 +1008,10 @@ static int nfs_flush_one(struct nfs_pageio_descriptor *desc)
if ((!lseg) && list_is_singular(&data->pages))
lseg = pnfs_update_layout(desc->pg_inode, req->wb_context, IOMODE_RW);
+ if ((desc->pg_ioflags & FLUSH_COND_STABLE) &&
+ (desc->pg_moreio || NFS_I(desc->pg_inode)->ncommit))
+ desc->pg_ioflags &= ~FLUSH_COND_STABLE;
+
/* Set up the argument struct */
ret = nfs_write_rpcsetup(req, data, &nfs_write_full_ops, desc->pg_count, 0, lseg, desc->pg_ioflags);
out:
diff --git a/include/linux/nfs_fs.h b/include/linux/nfs_fs.h
index f88522b..cb2add4 100644
--- a/include/linux/nfs_fs.h
+++ b/include/linux/nfs_fs.h
@@ -33,6 +33,8 @@
#define FLUSH_STABLE 4 /* commit to stable storage */
#define FLUSH_LOWPRI 8 /* low priority background flush */
#define FLUSH_HIGHPRI 16 /* high priority memory reclaim flush */
+#define FLUSH_COND_STABLE 32 /* conditional stable write - only stable
+ * if everything fits in one RPC */
#ifdef __KERNEL__
diff --git a/include/linux/nfs_page.h b/include/linux/nfs_page.h
index 90907ad..92d54c8 100644
--- a/include/linux/nfs_page.h
+++ b/include/linux/nfs_page.h
@@ -57,6 +57,7 @@ struct nfs_pageio_descriptor {
size_t pg_count;
size_t pg_bsize;
unsigned int pg_base;
+ char pg_moreio;
struct inode *pg_inode;
int (*pg_doio)(struct nfs_pageio_descriptor *);
next prev parent reply other threads:[~2011-03-21 22:18 UTC|newest]
Thread overview: 15+ messages / expand[flat|nested] mbox.gz Atom feed top
2011-02-16 6:15 Small O_SYNC writes are no longer NFS_DATA_SYNC NeilBrown
2011-02-16 13:11 ` Jeff Layton
2011-02-16 20:26 ` NeilBrown
2011-02-16 20:50 ` Jeff Layton
2011-02-16 21:00 ` NeilBrown
2011-03-17 23:53 ` Trond Myklebust
2011-03-18 1:04 ` NeilBrown
2011-03-18 1:49 ` Trond Myklebust
2011-03-18 2:12 ` NeilBrown
2011-03-18 2:25 ` Trond Myklebust
2011-03-18 3:52 ` NeilBrown
2011-03-21 21:02 ` Trond Myklebust
2011-03-21 22:17 ` NeilBrown [this message]
2011-03-21 22:54 ` Trond Myklebust
2011-03-21 23:47 ` NeilBrown
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20110322091757.7bf56d80@notabene.brown \
--to=neilb@suse.de \
--cc=Trond.Myklebust@netapp.com \
--cc=linux-nfs@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).