public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
From: Wu Fengguang <fengguang.wu@intel.com>
To: Jan Kara <jack@suse.cz>
Cc: Christoph Hellwig <hch@infradead.org>,
	Steve Rago <sar@nec-labs.com>,
	Peter Zijlstra <peterz@infradead.org>,
	"linux-nfs@vger.kernel.org" <linux-nfs@vger.kernel.org>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	"Trond.Myklebust@netapp.com" <Trond.Myklebust@netapp.com>,
	"jens.axboe" <jens.axboe@oracle.com>,
	Peter Staubach <staubach@redhat.com>,
	Arjan van de Ven <arjan@infradead.org>,
	Ingo Molnar <mingo@elte.hu>,
	"linux-fsdevel@vger.kernel.org" <linux-fsdevel@vger.kernel.org>
Subject: Re: [PATCH] improve the performance of large sequential write NFS workloads
Date: Thu, 24 Dec 2009 13:25:15 +0800	[thread overview]
Message-ID: <20091224052515.GA9698@localhost> (raw)
In-Reply-To: <20091223133244.GB3159@quack.suse.cz>

On Wed, Dec 23, 2009 at 09:32:44PM +0800, Jan Kara wrote:
> On Wed 23-12-09 03:43:02, Christoph Hellwig wrote:
> > On Tue, Dec 22, 2009 at 01:35:39PM +0100, Jan Kara wrote:
> > > >    nfsd_sync:
> > > >      [take i_mutex]
> > > >        filemap_fdatawrite  => can also be blocked, but less a problem
> > > >      [drop i_mutex]
> > > >        filemap_fdatawait
> > > >  
> > > >    Maybe it's a dumb question, but what's the purpose of i_mutex here?
> > > >    For correctness or to prevent livelock? I can imagine some livelock
> > > >    problem here (current implementation can easily wait for extra
> > > >    pages), however not too hard to fix.
> > >   Generally, most filesystems take i_mutex during fsync to
> > > a) avoid all sorts of livelocking problems
> > > b) serialize fsyncs for one inode (mostly for simplicity)
> > >   I don't see what advantage would it bring that we get rid of i_mutex
> > > for fdatawait - only that maybe writers could proceed while we are
> > > waiting but is that really the problem?
> > 
> > It would match what we do in vfs_fsync for the non-nfsd path, so it's
> > a no-brainer to do it.  In fact I did switch it over to vfs_fsync a
> > while ago but that go reverted because it caused deadlocks for
> > nfsd_sync_dir which for some reason can't take the i_mutex (I'd have to
> > check the archives why).
> > 
> > Here's a RFC patch to make some more sense of the fsync callers in nfsd,
> > including fixing up the data write/wait calling conventions to match the
> > regular fsync path (which might make this a -stable candidate):
>   The patch looks good to me from general soundness point of view :).
> Someone with more NFS knowledge should tell whether dropping i_mutex for
> fdatawrite_and_wait is fine for NFS.

I believe it's safe to drop i_mutex for fdatawrite_and_wait().
Because NFS
1) client: collect all unstable pages (which server ACKed that have
   reach its page cache)
2) client: send COMMIT
3) server: fdatawrite_and_wait(), which makes sure pages in 1) get cleaned
4) client: put all pages collected in 1) to clean state

So there's no need to take i_mutex to prevent concurrent
write/commits.

If someone else concurrently truncate and then extend i_size, the NFS
verf will be changed and thus client will resend the pages? (whether
it should overwrite the pages is another problem..)

Thanks,
Fengguang


>  
> > Index: linux-2.6/fs/nfsd/vfs.c
> > ===================================================================
> > --- linux-2.6.orig/fs/nfsd/vfs.c	2009-12-23 09:32:45.693170043 +0100
> > +++ linux-2.6/fs/nfsd/vfs.c	2009-12-23 09:39:47.627170082 +0100
> > @@ -769,45 +769,27 @@ nfsd_close(struct file *filp)
> >  }
> >  
> >  /*
> > - * Sync a file
> > - * As this calls fsync (not fdatasync) there is no need for a write_inode
> > - * after it.
> > + * Sync a directory to disk.
> > + *
> > + * This is odd compared to all other fsync callers because we
> > + *
> > + *  a) do not have a file struct available
> > + *  b) expect to have i_mutex already held by the caller
> >   */
> > -static inline int nfsd_dosync(struct file *filp, struct dentry *dp,
> > -			      const struct file_operations *fop)
> > +int
> > +nfsd_sync_dir(struct dentry *dentry)
> >  {
> > -	struct inode *inode = dp->d_inode;
> > -	int (*fsync) (struct file *, struct dentry *, int);
> > +	struct inode *inode = dentry->d_inode;
> >  	int err;
> >  
> > -	err = filemap_fdatawrite(inode->i_mapping);
> > -	if (err == 0 && fop && (fsync = fop->fsync))
> > -		err = fsync(filp, dp, 0);
> > -	if (err == 0)
> > -		err = filemap_fdatawait(inode->i_mapping);
> > +	WARN_ON(!mutex_is_locked(&inode->i_mutex));
> >  
> > +	err = filemap_write_and_wait(inode->i_mapping);
> > +	if (err == 0 && inode->i_fop->fsync)
> > +		err = inode->i_fop->fsync(NULL, dentry, 0);
> >  	return err;
> >  }
> >  
> > -static int
> > -nfsd_sync(struct file *filp)
> > -{
> > -        int err;
> > -	struct inode *inode = filp->f_path.dentry->d_inode;
> > -	dprintk("nfsd: sync file %s\n", filp->f_path.dentry->d_name.name);
> > -	mutex_lock(&inode->i_mutex);
> > -	err=nfsd_dosync(filp, filp->f_path.dentry, filp->f_op);
> > -	mutex_unlock(&inode->i_mutex);
> > -
> > -	return err;
> > -}
> > -
> > -int
> > -nfsd_sync_dir(struct dentry *dp)
> > -{
> > -	return nfsd_dosync(NULL, dp, dp->d_inode->i_fop);
> > -}
> > -
> >  /*
> >   * Obtain the readahead parameters for the file
> >   * specified by (dev, ino).
> > @@ -1011,7 +993,7 @@ static int wait_for_concurrent_writes(st
> >  
> >  	if (inode->i_state & I_DIRTY) {
> >  		dprintk("nfsd: write sync %d\n", task_pid_nr(current));
> > -		err = nfsd_sync(file);
> > +		err = vfs_fsync(file, file->f_path.dentry, 0);
> >  	}
> >  	last_ino = inode->i_ino;
> >  	last_dev = inode->i_sb->s_dev;
> > @@ -1180,7 +1162,7 @@ nfsd_commit(struct svc_rqst *rqstp, stru
> >  		return err;
> >  	if (EX_ISSYNC(fhp->fh_export)) {
> >  		if (file->f_op && file->f_op->fsync) {
> > -			err = nfserrno(nfsd_sync(file));
> > +			err = nfserrno(vfs_fsync(file, file->f_path.dentry, 0));
> >  		} else {
> >  			err = nfserr_notsupp;
> >  		}
> -- 
> Jan Kara <jack@suse.cz>
> SUSE Labs, CR

  reply	other threads:[~2009-12-24  5:25 UTC|newest]

Thread overview: 52+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2009-12-17  2:03 [PATCH] improve the performance of large sequential write NFS workloads Steve Rago
2009-12-17  8:17 ` Peter Zijlstra
2009-12-18 19:33   ` Steve Rago
2009-12-18 19:41     ` Ingo Molnar
2009-12-18 21:20       ` Steve Rago
2009-12-18 22:07         ` Ingo Molnar
2009-12-18 22:46           ` Steve Rago
2009-12-19  8:08         ` Arjan van de Ven
2009-12-19 13:37           ` Steve Rago
2009-12-18 19:44     ` Peter Zijlstra
2009-12-19 12:20   ` Wu Fengguang
2009-12-19 14:25     ` Steve Rago
2009-12-22  1:59       ` Wu Fengguang
2009-12-22 12:35         ` Jan Kara
2009-12-23  8:43           ` Christoph Hellwig
2009-12-23 13:32             ` Jan Kara
2009-12-24  5:25               ` Wu Fengguang [this message]
2009-12-24  1:26           ` Wu Fengguang
2009-12-22 13:01         ` Martin Knoblauch
2009-12-24  1:46           ` Wu Fengguang
2009-12-22 16:41         ` Steve Rago
2009-12-24  1:21           ` Wu Fengguang
2009-12-24 14:49             ` Steve Rago
2009-12-25  7:37               ` Wu Fengguang
2009-12-23 14:21         ` Trond Myklebust
2009-12-23 18:05           ` Jan Kara
2009-12-23 19:12             ` Trond Myklebust
2009-12-24  2:52               ` Wu Fengguang
2009-12-24 12:04                 ` Trond Myklebust
2009-12-25  5:56                   ` Wu Fengguang
2009-12-30 16:22                     ` Trond Myklebust
2009-12-31  5:04                       ` Wu Fengguang
2009-12-31 19:13                         ` Trond Myklebust
2010-01-06  3:03                           ` Wu Fengguang
2010-01-06 16:56                             ` Trond Myklebust
2010-01-06 18:26                               ` Trond Myklebust
2010-01-06 18:37                                 ` Peter Zijlstra
2010-01-06 18:52                                   ` Trond Myklebust
2010-01-06 19:07                                     ` Peter Zijlstra
2010-01-06 19:21                                       ` Trond Myklebust
2010-01-06 19:53                                         ` Trond Myklebust
2010-01-06 20:09                                           ` Jan Kara
2009-12-22 12:25       ` Jan Kara
2009-12-22 12:38         ` Peter Zijlstra
2009-12-22 12:55           ` Jan Kara
2009-12-22 16:20         ` Steve Rago
2009-12-23 18:39           ` Jan Kara
2009-12-23 20:16             ` Steve Rago
2009-12-23 21:49               ` Trond Myklebust
2009-12-23 23:13                 ` Steve Rago
2009-12-23 23:44                   ` Trond Myklebust
2009-12-24  4:30                     ` Steve Rago

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20091224052515.GA9698@localhost \
    --to=fengguang.wu@intel.com \
    --cc=Trond.Myklebust@netapp.com \
    --cc=arjan@infradead.org \
    --cc=hch@infradead.org \
    --cc=jack@suse.cz \
    --cc=jens.axboe@oracle.com \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-nfs@vger.kernel.org \
    --cc=mingo@elte.hu \
    --cc=peterz@infradead.org \
    --cc=sar@nec-labs.com \
    --cc=staubach@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox