From: Wu Fengguang <fengguang.wu-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
To: Jan Kara <jack-AlSwsSmVLrQ@public.gmane.org>
Cc: Christoph Hellwig <hch-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org>,
Steve Rago <sar-a+KepyhlMvJWk0Htik3J/w@public.gmane.org>,
Peter Zijlstra <peterz-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org>,
"linux-nfs-u79uwXL29TY76Z2rM5mHXA@public.gmane.org"
<linux-nfs-u79uwXL29TY76Z2rM5mHXA@public.gmane.org>,
"linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org"
<linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org>,
"Trond.Myklebust-HgOvQuBEEgTQT0dZR+AlfA@public.gmane.org"
<Trond.Myklebust-HgOvQuBEEgTQT0dZR+AlfA@public.gmane.org>,
"jens.axboe" <jens.axboe-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>,
Peter Staubach <staubach-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>,
Arjan van de Ven <arjan-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org>,
Ingo Molnar <mingo-X9Un+BFzKDI@public.gmane.org>,
"linux-fsdevel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org"
<linux-fsdevel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org>
Subject: Re: [PATCH] improve the performance of large sequential write NFS workloads
Date: Thu, 24 Dec 2009 13:25:15 +0800 [thread overview]
Message-ID: <20091224052515.GA9698@localhost> (raw)
In-Reply-To: <20091223133244.GB3159-+0h/O2h83AeN3ZZ/Hiejyg@public.gmane.org>
On Wed, Dec 23, 2009 at 09:32:44PM +0800, Jan Kara wrote:
> On Wed 23-12-09 03:43:02, Christoph Hellwig wrote:
> > On Tue, Dec 22, 2009 at 01:35:39PM +0100, Jan Kara wrote:
> > > > nfsd_sync:
> > > > [take i_mutex]
> > > > filemap_fdatawrite => can also be blocked, but less a problem
> > > > [drop i_mutex]
> > > > filemap_fdatawait
> > > >
> > > > Maybe it's a dumb question, but what's the purpose of i_mutex here?
> > > > For correctness or to prevent livelock? I can imagine some livelock
> > > > problem here (current implementation can easily wait for extra
> > > > pages), however not too hard to fix.
> > > Generally, most filesystems take i_mutex during fsync to
> > > a) avoid all sorts of livelocking problems
> > > b) serialize fsyncs for one inode (mostly for simplicity)
> > > I don't see what advantage would it bring that we get rid of i_mutex
> > > for fdatawait - only that maybe writers could proceed while we are
> > > waiting but is that really the problem?
> >
> > It would match what we do in vfs_fsync for the non-nfsd path, so it's
> > a no-brainer to do it. In fact I did switch it over to vfs_fsync a
> > while ago but that go reverted because it caused deadlocks for
> > nfsd_sync_dir which for some reason can't take the i_mutex (I'd have to
> > check the archives why).
> >
> > Here's a RFC patch to make some more sense of the fsync callers in nfsd,
> > including fixing up the data write/wait calling conventions to match the
> > regular fsync path (which might make this a -stable candidate):
> The patch looks good to me from general soundness point of view :).
> Someone with more NFS knowledge should tell whether dropping i_mutex for
> fdatawrite_and_wait is fine for NFS.
I believe it's safe to drop i_mutex for fdatawrite_and_wait().
Because NFS
1) client: collect all unstable pages (which server ACKed that have
reach its page cache)
2) client: send COMMIT
3) server: fdatawrite_and_wait(), which makes sure pages in 1) get cleaned
4) client: put all pages collected in 1) to clean state
So there's no need to take i_mutex to prevent concurrent
write/commits.
If someone else concurrently truncate and then extend i_size, the NFS
verf will be changed and thus client will resend the pages? (whether
it should overwrite the pages is another problem..)
Thanks,
Fengguang
>
> > Index: linux-2.6/fs/nfsd/vfs.c
> > ===================================================================
> > --- linux-2.6.orig/fs/nfsd/vfs.c 2009-12-23 09:32:45.693170043 +0100
> > +++ linux-2.6/fs/nfsd/vfs.c 2009-12-23 09:39:47.627170082 +0100
> > @@ -769,45 +769,27 @@ nfsd_close(struct file *filp)
> > }
> >
> > /*
> > - * Sync a file
> > - * As this calls fsync (not fdatasync) there is no need for a write_inode
> > - * after it.
> > + * Sync a directory to disk.
> > + *
> > + * This is odd compared to all other fsync callers because we
> > + *
> > + * a) do not have a file struct available
> > + * b) expect to have i_mutex already held by the caller
> > */
> > -static inline int nfsd_dosync(struct file *filp, struct dentry *dp,
> > - const struct file_operations *fop)
> > +int
> > +nfsd_sync_dir(struct dentry *dentry)
> > {
> > - struct inode *inode = dp->d_inode;
> > - int (*fsync) (struct file *, struct dentry *, int);
> > + struct inode *inode = dentry->d_inode;
> > int err;
> >
> > - err = filemap_fdatawrite(inode->i_mapping);
> > - if (err == 0 && fop && (fsync = fop->fsync))
> > - err = fsync(filp, dp, 0);
> > - if (err == 0)
> > - err = filemap_fdatawait(inode->i_mapping);
> > + WARN_ON(!mutex_is_locked(&inode->i_mutex));
> >
> > + err = filemap_write_and_wait(inode->i_mapping);
> > + if (err == 0 && inode->i_fop->fsync)
> > + err = inode->i_fop->fsync(NULL, dentry, 0);
> > return err;
> > }
> >
> > -static int
> > -nfsd_sync(struct file *filp)
> > -{
> > - int err;
> > - struct inode *inode = filp->f_path.dentry->d_inode;
> > - dprintk("nfsd: sync file %s\n", filp->f_path.dentry->d_name.name);
> > - mutex_lock(&inode->i_mutex);
> > - err=nfsd_dosync(filp, filp->f_path.dentry, filp->f_op);
> > - mutex_unlock(&inode->i_mutex);
> > -
> > - return err;
> > -}
> > -
> > -int
> > -nfsd_sync_dir(struct dentry *dp)
> > -{
> > - return nfsd_dosync(NULL, dp, dp->d_inode->i_fop);
> > -}
> > -
> > /*
> > * Obtain the readahead parameters for the file
> > * specified by (dev, ino).
> > @@ -1011,7 +993,7 @@ static int wait_for_concurrent_writes(st
> >
> > if (inode->i_state & I_DIRTY) {
> > dprintk("nfsd: write sync %d\n", task_pid_nr(current));
> > - err = nfsd_sync(file);
> > + err = vfs_fsync(file, file->f_path.dentry, 0);
> > }
> > last_ino = inode->i_ino;
> > last_dev = inode->i_sb->s_dev;
> > @@ -1180,7 +1162,7 @@ nfsd_commit(struct svc_rqst *rqstp, stru
> > return err;
> > if (EX_ISSYNC(fhp->fh_export)) {
> > if (file->f_op && file->f_op->fsync) {
> > - err = nfserrno(nfsd_sync(file));
> > + err = nfserrno(vfs_fsync(file, file->f_path.dentry, 0));
> > } else {
> > err = nfserr_notsupp;
> > }
> --
> Jan Kara <jack-AlSwsSmVLrQ@public.gmane.org>
> SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
next prev parent reply other threads:[~2009-12-24 5:25 UTC|newest]
Thread overview: 66+ messages / expand[flat|nested] mbox.gz Atom feed top
[not found] <1261015420.1947.54.camel@serenity>
[not found] ` <1261037877.27920.36.camel@laptop>
[not found] ` <20091219122033.GA11360@localhost>
[not found] ` <1261232747.1947.194.camel@serenity>
2009-12-22 1:59 ` [PATCH] improve the performance of large sequential write NFS workloads Wu Fengguang
2009-12-22 12:35 ` Jan Kara
[not found] ` <20091222123538.GB604-jyMamyUUXNJG4ohzP4jBZS1Fcj925eT/@public.gmane.org>
2009-12-23 8:43 ` Christoph Hellwig
[not found] ` <20091223084302.GA14912-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org>
2009-12-23 13:32 ` Jan Kara
[not found] ` <20091223133244.GB3159-+0h/O2h83AeN3ZZ/Hiejyg@public.gmane.org>
2009-12-24 5:25 ` Wu Fengguang [this message]
2009-12-24 1:26 ` Wu Fengguang
2009-12-22 16:41 ` Steve Rago
2009-12-24 1:21 ` Wu Fengguang
2009-12-24 14:49 ` Steve Rago
2009-12-25 7:37 ` Wu Fengguang
2009-12-23 14:21 ` Trond Myklebust
2009-12-23 18:05 ` Jan Kara
[not found] ` <20091223180551.GD3159-+0h/O2h83AeN3ZZ/Hiejyg@public.gmane.org>
2009-12-23 19:12 ` Trond Myklebust
2009-12-24 2:52 ` Wu Fengguang
2009-12-24 12:04 ` Trond Myklebust
2009-12-25 5:56 ` Wu Fengguang
2009-12-30 16:22 ` Trond Myklebust
2009-12-31 5:04 ` Wu Fengguang
2009-12-31 19:13 ` Trond Myklebust
2010-01-06 3:03 ` Wu Fengguang
2010-01-06 16:56 ` Trond Myklebust
2010-01-06 18:26 ` Trond Myklebust
2010-01-06 18:37 ` Peter Zijlstra
2010-01-06 18:52 ` Trond Myklebust
2010-01-06 19:07 ` Peter Zijlstra
2010-01-06 19:21 ` Trond Myklebust
2010-01-06 19:53 ` Trond Myklebust
2010-01-06 20:09 ` Jan Kara
[not found] ` <20100106200928.GB22781-+0h/O2h83AeN3ZZ/Hiejyg@public.gmane.org>
2010-01-06 20:51 ` [PATCH 0/6] " Trond Myklebust
[not found] ` <20100106205110.22547.85345.stgit-bi+AKbBUZKY6gyzm1THtWbp2dZbC/Bob@public.gmane.org>
2010-01-06 20:51 ` [PATCH 3/6] VM: Split out the accounting of unstable writes from BDI_RECLAIMABLE Trond Myklebust
[not found] ` <20100106205110.22547.93554.stgit-bi+AKbBUZKY6gyzm1THtWbp2dZbC/Bob@public.gmane.org>
2010-01-07 1:48 ` Wu Fengguang
2010-01-06 20:51 ` [PATCH 2/6] VM/NFS: The VM must tell the filesystem when to free reclaimable pages Trond Myklebust
2010-01-07 2:29 ` Wu Fengguang
2010-01-07 4:49 ` Trond Myklebust
2010-01-07 5:03 ` Wu Fengguang
2010-01-07 5:30 ` Trond Myklebust
2010-01-07 14:37 ` Wu Fengguang
2010-01-06 20:51 ` [PATCH 5/6] VM: Use per-bdi unstable accounting to improve use of wbc->force_commit Trond Myklebust
[not found] ` <20100106205110.22547.32584.stgit-bi+AKbBUZKY6gyzm1THtWbp2dZbC/Bob@public.gmane.org>
2010-01-07 2:34 ` Wu Fengguang
2010-01-06 20:51 ` [PATCH 6/6] NFS: Run COMMIT as an asynchronous RPC call when wbc->for_background is set Trond Myklebust
[not found] ` <20100106205110.22547.31434.stgit-bi+AKbBUZKY6gyzm1THtWbp2dZbC/Bob@public.gmane.org>
2010-01-07 2:32 ` Wu Fengguang
2010-01-06 20:51 ` [PATCH 4/6] VM: Don't call bdi_stat(BDI_UNSTABLE) on non-nfs backing-devices Trond Myklebust
2010-01-07 1:56 ` Wu Fengguang
2010-01-06 20:51 ` [PATCH 1/6] VFS: Ensure that writeback_single_inode() commits unstable writes Trond Myklebust
[not found] ` <20100106205110.22547.17971.stgit-bi+AKbBUZKY6gyzm1THtWbp2dZbC/Bob@public.gmane.org>
2010-01-06 21:38 ` Jan Kara
[not found] ` <20100106213843.GD22781-+0h/O2h83AeN3ZZ/Hiejyg@public.gmane.org>
2010-01-06 21:48 ` Trond Myklebust
2010-01-07 2:18 ` Wu Fengguang
[not found] ` <1262839082.2185.15.camel@localhost>
2010-01-07 4:48 ` Wu Fengguang
2010-01-07 4:53 ` [PATCH 0/5] Re: [PATCH] improve the performance of large sequential write NFS workloads Trond Myklebust
[not found] ` <20100107045330.5986.55090.stgit-bi+AKbBUZKY6gyzm1THtWbp2dZbC/Bob@public.gmane.org>
2010-01-07 4:53 ` [PATCH 1/5] VFS: Ensure that writeback_single_inode() commits unstable writes Trond Myklebust
2010-01-07 4:53 ` [PATCH 5/5] NFS: Run COMMIT as an asynchronous RPC call when wbc->for_background is set Trond Myklebust
2010-01-07 4:53 ` [PATCH 4/5] VM/NFS: The VM must tell the filesystem when to free reclaimable pages Trond Myklebust
2010-01-07 4:53 ` [PATCH 2/5] VM: Split out the accounting of unstable writes from BDI_RECLAIMABLE Trond Myklebust
2010-01-07 4:53 ` [PATCH 3/5] VM: Don't call bdi_stat(BDI_UNSTABLE) on non-nfs backing-devices Trond Myklebust
2010-01-07 14:56 ` [PATCH 1/6] VFS: Ensure that writeback_single_inode() commits unstable writes Wu Fengguang
2010-01-07 15:10 ` Trond Myklebust
2010-01-08 1:17 ` Wu Fengguang
2010-01-08 1:37 ` Trond Myklebust
2010-01-08 1:53 ` Wu Fengguang
2010-01-08 9:25 ` Christoph Hellwig
2010-01-08 13:46 ` Trond Myklebust
2010-01-08 13:54 ` Christoph Hellwig
2010-01-08 14:15 ` Trond Myklebust
2010-01-06 21:44 ` [PATCH 0/6] Re: [PATCH] improve the performance of large sequential write NFS workloads Jan Kara
2010-01-06 22:03 ` Trond Myklebust
2010-01-07 8:16 ` Peter Zijlstra
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20091224052515.GA9698@localhost \
--to=fengguang.wu-ral2jqcrhueavxtiumwx3w@public.gmane.org \
--cc=Trond.Myklebust-HgOvQuBEEgTQT0dZR+AlfA@public.gmane.org \
--cc=arjan-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org \
--cc=hch-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org \
--cc=jack-AlSwsSmVLrQ@public.gmane.org \
--cc=jens.axboe-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org \
--cc=linux-fsdevel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org \
--cc=linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org \
--cc=linux-nfs-u79uwXL29TY76Z2rM5mHXA@public.gmane.org \
--cc=mingo-X9Un+BFzKDI@public.gmane.org \
--cc=peterz-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org \
--cc=sar-a+KepyhlMvJWk0Htik3J/w@public.gmane.org \
--cc=staubach-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).