* Re: [PATCH] improve the performance of large sequential write NFS workloads [not found] ` <1261232747.1947.194.camel@serenity> @ 2009-12-22 1:59 ` Wu Fengguang 2009-12-22 12:35 ` Jan Kara ` (2 more replies) 0 siblings, 3 replies; 66+ messages in thread From: Wu Fengguang @ 2009-12-22 1:59 UTC (permalink / raw) To: Steve Rago Cc: Peter Zijlstra, linux-nfs@vger.kernel.org, linux-kernel@vger.kernel.org, Trond.Myklebust@netapp.com, jens.axboe, Peter Staubach, Jan Kara, Arjan van de Ven, Ingo Molnar, linux-fsdevel Steve, On Sat, Dec 19, 2009 at 10:25:47PM +0800, Steve Rago wrote: > > On Sat, 2009-12-19 at 20:20 +0800, Wu Fengguang wrote: > > > > On Thu, Dec 17, 2009 at 04:17:57PM +0800, Peter Zijlstra wrote: > > > On Wed, 2009-12-16 at 21:03 -0500, Steve Rago wrote: > > > > Eager Writeback for NFS Clients > > > > ------------------------------- > > > > Prevent applications that write large sequential streams of data (like backup, for example) > > > > from entering into a memory pressure state, which degrades performance by falling back to > > > > synchronous operations (both synchronous writes and additional commits). > > > > What exactly is the "memory pressure state" condition? What's the > > code to do the "synchronous writes and additional commits" and maybe > > how they are triggered? > > Memory pressure occurs when most of the client pages have been dirtied > by an application (think backup server writing multi-gigabyte files that > exceed the size of main memory). The system works harder to be able to > free dirty pages so that they can be reused. For a local file system, > this means writing the pages to disk. For NFS, however, the writes > leave the pages in an "unstable" state until the server responds to a > commit request. Generally speaking, commit processing is far more > expensive than write processing on the server; both are done with the > inode locked, but since the commit takes so long, all writes are > blocked, which stalls the pipeline. Let me try reiterate the problem with code, please correct me if I'm wrong. 1) normal fs sets I_DIRTY_DATASYNC when extending i_size, however NFS will set the flag for any pages written -- why this trick? To guarantee the call of nfs_commit_inode()? Which unfortunately turns almost every server side NFS write into sync writes.. writeback_single_inode: do_writepages nfs_writepages nfs_writepage ----[short time later]---> nfs_writeback_release* nfs_mark_request_commit __mark_inode_dirty(I_DIRTY_DATASYNC); if (I_DIRTY_SYNC || I_DIRTY_DATASYNC) <---- so this will be true for most time write_inode nfs_write_inode nfs_commit_inode 2) NFS commit stops pipeline because it sleep&wait inside i_mutex, which blocks all other NFSDs trying to write/writeback the inode. nfsd_sync: take i_mutex filemap_fdatawrite filemap_fdatawait drop i_mutex If filemap_fdatawait() can be moved out of i_mutex (or just remove the lock), we solve the root problem: nfsd_sync: [take i_mutex] filemap_fdatawrite => can also be blocked, but less a problem [drop i_mutex] filemap_fdatawait Maybe it's a dumb question, but what's the purpose of i_mutex here? For correctness or to prevent livelock? I can imagine some livelock problem here (current implementation can easily wait for extra pages), however not too hard to fix. The proposed patch essentially takes two actions in nfs_file_write() - to start writeback when the per-file nr_dirty goes high without committing - to throttle dirtying when the per-file nr_writeback goes high I guess this effectively prevents pdflush from kicking in with its bad committing behavior In general it's reasonable to keep NFS per-file nr_dirty low, however questionable to do per-file nr_writeback throttling. This does not work well with the global limits - eg. when there are many dirty files, the summed-up nr_writeback will still grow out of control. And it's more likely to impact user visible responsiveness than a global limit. But my opinion can be biased -- me have a patch to do global NFS nr_writeback limit ;) Thanks, Fengguang ^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [PATCH] improve the performance of large sequential write NFS workloads 2009-12-22 1:59 ` [PATCH] improve the performance of large sequential write NFS workloads Wu Fengguang @ 2009-12-22 12:35 ` Jan Kara [not found] ` <20091222123538.GB604-jyMamyUUXNJG4ohzP4jBZS1Fcj925eT/@public.gmane.org> 2009-12-24 1:26 ` Wu Fengguang 2009-12-22 16:41 ` Steve Rago 2009-12-23 14:21 ` Trond Myklebust 2 siblings, 2 replies; 66+ messages in thread From: Jan Kara @ 2009-12-22 12:35 UTC (permalink / raw) To: Wu Fengguang Cc: Steve Rago, Peter Zijlstra, linux-nfs-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Trond.Myklebust-HgOvQuBEEgTQT0dZR+AlfA@public.gmane.org, jens.axboe, Peter Staubach, Jan Kara, Arjan van de Ven, Ingo Molnar, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA > 2) NFS commit stops pipeline because it sleep&wait inside i_mutex, > which blocks all other NFSDs trying to write/writeback the inode. > > nfsd_sync: > take i_mutex > filemap_fdatawrite > filemap_fdatawait > drop i_mutex I believe this is unrelated to the problem Steve is trying to solve. When we get to doing sync writes the performance is busted so we better shouldn't get to that (unless user asked for that of course). > If filemap_fdatawait() can be moved out of i_mutex (or just remove > the lock), we solve the root problem: > > nfsd_sync: > [take i_mutex] > filemap_fdatawrite => can also be blocked, but less a problem > [drop i_mutex] > filemap_fdatawait > > Maybe it's a dumb question, but what's the purpose of i_mutex here? > For correctness or to prevent livelock? I can imagine some livelock > problem here (current implementation can easily wait for extra > pages), however not too hard to fix. Generally, most filesystems take i_mutex during fsync to a) avoid all sorts of livelocking problems b) serialize fsyncs for one inode (mostly for simplicity) I don't see what advantage would it bring that we get rid of i_mutex for fdatawait - only that maybe writers could proceed while we are waiting but is that really the problem? Honza -- Jan Kara <jack-AlSwsSmVLrQ@public.gmane.org> SuSE CR Labs -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 66+ messages in thread
[parent not found: <20091222123538.GB604-jyMamyUUXNJG4ohzP4jBZS1Fcj925eT/@public.gmane.org>]
* Re: [PATCH] improve the performance of large sequential write NFS workloads [not found] ` <20091222123538.GB604-jyMamyUUXNJG4ohzP4jBZS1Fcj925eT/@public.gmane.org> @ 2009-12-23 8:43 ` Christoph Hellwig [not found] ` <20091223084302.GA14912-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org> 0 siblings, 1 reply; 66+ messages in thread From: Christoph Hellwig @ 2009-12-23 8:43 UTC (permalink / raw) To: Jan Kara Cc: Wu Fengguang, Steve Rago, Peter Zijlstra, linux-nfs-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Trond.Myklebust-HgOvQuBEEgTQT0dZR+AlfA@public.gmane.org, jens.axboe, Peter Staubach, Arjan van de Ven, Ingo Molnar, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA On Tue, Dec 22, 2009 at 01:35:39PM +0100, Jan Kara wrote: > > nfsd_sync: > > [take i_mutex] > > filemap_fdatawrite => can also be blocked, but less a problem > > [drop i_mutex] > > filemap_fdatawait > > > > Maybe it's a dumb question, but what's the purpose of i_mutex here? > > For correctness or to prevent livelock? I can imagine some livelock > > problem here (current implementation can easily wait for extra > > pages), however not too hard to fix. > Generally, most filesystems take i_mutex during fsync to > a) avoid all sorts of livelocking problems > b) serialize fsyncs for one inode (mostly for simplicity) > I don't see what advantage would it bring that we get rid of i_mutex > for fdatawait - only that maybe writers could proceed while we are > waiting but is that really the problem? It would match what we do in vfs_fsync for the non-nfsd path, so it's a no-brainer to do it. In fact I did switch it over to vfs_fsync a while ago but that go reverted because it caused deadlocks for nfsd_sync_dir which for some reason can't take the i_mutex (I'd have to check the archives why). Here's a RFC patch to make some more sense of the fsync callers in nfsd, including fixing up the data write/wait calling conventions to match the regular fsync path (which might make this a -stable candidate): Index: linux-2.6/fs/nfsd/vfs.c =================================================================== --- linux-2.6.orig/fs/nfsd/vfs.c 2009-12-23 09:32:45.693170043 +0100 +++ linux-2.6/fs/nfsd/vfs.c 2009-12-23 09:39:47.627170082 +0100 @@ -769,45 +769,27 @@ nfsd_close(struct file *filp) } /* - * Sync a file - * As this calls fsync (not fdatasync) there is no need for a write_inode - * after it. + * Sync a directory to disk. + * + * This is odd compared to all other fsync callers because we + * + * a) do not have a file struct available + * b) expect to have i_mutex already held by the caller */ -static inline int nfsd_dosync(struct file *filp, struct dentry *dp, - const struct file_operations *fop) +int +nfsd_sync_dir(struct dentry *dentry) { - struct inode *inode = dp->d_inode; - int (*fsync) (struct file *, struct dentry *, int); + struct inode *inode = dentry->d_inode; int err; - err = filemap_fdatawrite(inode->i_mapping); - if (err == 0 && fop && (fsync = fop->fsync)) - err = fsync(filp, dp, 0); - if (err == 0) - err = filemap_fdatawait(inode->i_mapping); + WARN_ON(!mutex_is_locked(&inode->i_mutex)); + err = filemap_write_and_wait(inode->i_mapping); + if (err == 0 && inode->i_fop->fsync) + err = inode->i_fop->fsync(NULL, dentry, 0); return err; } -static int -nfsd_sync(struct file *filp) -{ - int err; - struct inode *inode = filp->f_path.dentry->d_inode; - dprintk("nfsd: sync file %s\n", filp->f_path.dentry->d_name.name); - mutex_lock(&inode->i_mutex); - err=nfsd_dosync(filp, filp->f_path.dentry, filp->f_op); - mutex_unlock(&inode->i_mutex); - - return err; -} - -int -nfsd_sync_dir(struct dentry *dp) -{ - return nfsd_dosync(NULL, dp, dp->d_inode->i_fop); -} - /* * Obtain the readahead parameters for the file * specified by (dev, ino). @@ -1011,7 +993,7 @@ static int wait_for_concurrent_writes(st if (inode->i_state & I_DIRTY) { dprintk("nfsd: write sync %d\n", task_pid_nr(current)); - err = nfsd_sync(file); + err = vfs_fsync(file, file->f_path.dentry, 0); } last_ino = inode->i_ino; last_dev = inode->i_sb->s_dev; @@ -1180,7 +1162,7 @@ nfsd_commit(struct svc_rqst *rqstp, stru return err; if (EX_ISSYNC(fhp->fh_export)) { if (file->f_op && file->f_op->fsync) { - err = nfserrno(nfsd_sync(file)); + err = nfserrno(vfs_fsync(file, file->f_path.dentry, 0)); } else { err = nfserr_notsupp; } -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 66+ messages in thread
[parent not found: <20091223084302.GA14912-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org>]
* Re: [PATCH] improve the performance of large sequential write NFS workloads [not found] ` <20091223084302.GA14912-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org> @ 2009-12-23 13:32 ` Jan Kara [not found] ` <20091223133244.GB3159-+0h/O2h83AeN3ZZ/Hiejyg@public.gmane.org> 0 siblings, 1 reply; 66+ messages in thread From: Jan Kara @ 2009-12-23 13:32 UTC (permalink / raw) To: Christoph Hellwig Cc: Jan Kara, Wu Fengguang, Steve Rago, Peter Zijlstra, linux-nfs-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Trond.Myklebust-HgOvQuBEEgTQT0dZR+AlfA@public.gmane.org, jens.axboe, Peter Staubach, Arjan van de Ven, Ingo Molnar, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA On Wed 23-12-09 03:43:02, Christoph Hellwig wrote: > On Tue, Dec 22, 2009 at 01:35:39PM +0100, Jan Kara wrote: > > > nfsd_sync: > > > [take i_mutex] > > > filemap_fdatawrite => can also be blocked, but less a problem > > > [drop i_mutex] > > > filemap_fdatawait > > > > > > Maybe it's a dumb question, but what's the purpose of i_mutex here? > > > For correctness or to prevent livelock? I can imagine some livelock > > > problem here (current implementation can easily wait for extra > > > pages), however not too hard to fix. > > Generally, most filesystems take i_mutex during fsync to > > a) avoid all sorts of livelocking problems > > b) serialize fsyncs for one inode (mostly for simplicity) > > I don't see what advantage would it bring that we get rid of i_mutex > > for fdatawait - only that maybe writers could proceed while we are > > waiting but is that really the problem? > > It would match what we do in vfs_fsync for the non-nfsd path, so it's > a no-brainer to do it. In fact I did switch it over to vfs_fsync a > while ago but that go reverted because it caused deadlocks for > nfsd_sync_dir which for some reason can't take the i_mutex (I'd have to > check the archives why). > > Here's a RFC patch to make some more sense of the fsync callers in nfsd, > including fixing up the data write/wait calling conventions to match the > regular fsync path (which might make this a -stable candidate): The patch looks good to me from general soundness point of view :). Someone with more NFS knowledge should tell whether dropping i_mutex for fdatawrite_and_wait is fine for NFS. Honza > Index: linux-2.6/fs/nfsd/vfs.c > =================================================================== > --- linux-2.6.orig/fs/nfsd/vfs.c 2009-12-23 09:32:45.693170043 +0100 > +++ linux-2.6/fs/nfsd/vfs.c 2009-12-23 09:39:47.627170082 +0100 > @@ -769,45 +769,27 @@ nfsd_close(struct file *filp) > } > > /* > - * Sync a file > - * As this calls fsync (not fdatasync) there is no need for a write_inode > - * after it. > + * Sync a directory to disk. > + * > + * This is odd compared to all other fsync callers because we > + * > + * a) do not have a file struct available > + * b) expect to have i_mutex already held by the caller > */ > -static inline int nfsd_dosync(struct file *filp, struct dentry *dp, > - const struct file_operations *fop) > +int > +nfsd_sync_dir(struct dentry *dentry) > { > - struct inode *inode = dp->d_inode; > - int (*fsync) (struct file *, struct dentry *, int); > + struct inode *inode = dentry->d_inode; > int err; > > - err = filemap_fdatawrite(inode->i_mapping); > - if (err == 0 && fop && (fsync = fop->fsync)) > - err = fsync(filp, dp, 0); > - if (err == 0) > - err = filemap_fdatawait(inode->i_mapping); > + WARN_ON(!mutex_is_locked(&inode->i_mutex)); > > + err = filemap_write_and_wait(inode->i_mapping); > + if (err == 0 && inode->i_fop->fsync) > + err = inode->i_fop->fsync(NULL, dentry, 0); > return err; > } > > -static int > -nfsd_sync(struct file *filp) > -{ > - int err; > - struct inode *inode = filp->f_path.dentry->d_inode; > - dprintk("nfsd: sync file %s\n", filp->f_path.dentry->d_name.name); > - mutex_lock(&inode->i_mutex); > - err=nfsd_dosync(filp, filp->f_path.dentry, filp->f_op); > - mutex_unlock(&inode->i_mutex); > - > - return err; > -} > - > -int > -nfsd_sync_dir(struct dentry *dp) > -{ > - return nfsd_dosync(NULL, dp, dp->d_inode->i_fop); > -} > - > /* > * Obtain the readahead parameters for the file > * specified by (dev, ino). > @@ -1011,7 +993,7 @@ static int wait_for_concurrent_writes(st > > if (inode->i_state & I_DIRTY) { > dprintk("nfsd: write sync %d\n", task_pid_nr(current)); > - err = nfsd_sync(file); > + err = vfs_fsync(file, file->f_path.dentry, 0); > } > last_ino = inode->i_ino; > last_dev = inode->i_sb->s_dev; > @@ -1180,7 +1162,7 @@ nfsd_commit(struct svc_rqst *rqstp, stru > return err; > if (EX_ISSYNC(fhp->fh_export)) { > if (file->f_op && file->f_op->fsync) { > - err = nfserrno(nfsd_sync(file)); > + err = nfserrno(vfs_fsync(file, file->f_path.dentry, 0)); > } else { > err = nfserr_notsupp; > } -- Jan Kara <jack-AlSwsSmVLrQ@public.gmane.org> SUSE Labs, CR -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 66+ messages in thread
[parent not found: <20091223133244.GB3159-+0h/O2h83AeN3ZZ/Hiejyg@public.gmane.org>]
* Re: [PATCH] improve the performance of large sequential write NFS workloads [not found] ` <20091223133244.GB3159-+0h/O2h83AeN3ZZ/Hiejyg@public.gmane.org> @ 2009-12-24 5:25 ` Wu Fengguang 0 siblings, 0 replies; 66+ messages in thread From: Wu Fengguang @ 2009-12-24 5:25 UTC (permalink / raw) To: Jan Kara Cc: Christoph Hellwig, Steve Rago, Peter Zijlstra, linux-nfs-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Trond.Myklebust-HgOvQuBEEgTQT0dZR+AlfA@public.gmane.org, jens.axboe, Peter Staubach, Arjan van de Ven, Ingo Molnar, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org On Wed, Dec 23, 2009 at 09:32:44PM +0800, Jan Kara wrote: > On Wed 23-12-09 03:43:02, Christoph Hellwig wrote: > > On Tue, Dec 22, 2009 at 01:35:39PM +0100, Jan Kara wrote: > > > > nfsd_sync: > > > > [take i_mutex] > > > > filemap_fdatawrite => can also be blocked, but less a problem > > > > [drop i_mutex] > > > > filemap_fdatawait > > > > > > > > Maybe it's a dumb question, but what's the purpose of i_mutex here? > > > > For correctness or to prevent livelock? I can imagine some livelock > > > > problem here (current implementation can easily wait for extra > > > > pages), however not too hard to fix. > > > Generally, most filesystems take i_mutex during fsync to > > > a) avoid all sorts of livelocking problems > > > b) serialize fsyncs for one inode (mostly for simplicity) > > > I don't see what advantage would it bring that we get rid of i_mutex > > > for fdatawait - only that maybe writers could proceed while we are > > > waiting but is that really the problem? > > > > It would match what we do in vfs_fsync for the non-nfsd path, so it's > > a no-brainer to do it. In fact I did switch it over to vfs_fsync a > > while ago but that go reverted because it caused deadlocks for > > nfsd_sync_dir which for some reason can't take the i_mutex (I'd have to > > check the archives why). > > > > Here's a RFC patch to make some more sense of the fsync callers in nfsd, > > including fixing up the data write/wait calling conventions to match the > > regular fsync path (which might make this a -stable candidate): > The patch looks good to me from general soundness point of view :). > Someone with more NFS knowledge should tell whether dropping i_mutex for > fdatawrite_and_wait is fine for NFS. I believe it's safe to drop i_mutex for fdatawrite_and_wait(). Because NFS 1) client: collect all unstable pages (which server ACKed that have reach its page cache) 2) client: send COMMIT 3) server: fdatawrite_and_wait(), which makes sure pages in 1) get cleaned 4) client: put all pages collected in 1) to clean state So there's no need to take i_mutex to prevent concurrent write/commits. If someone else concurrently truncate and then extend i_size, the NFS verf will be changed and thus client will resend the pages? (whether it should overwrite the pages is another problem..) Thanks, Fengguang > > > Index: linux-2.6/fs/nfsd/vfs.c > > =================================================================== > > --- linux-2.6.orig/fs/nfsd/vfs.c 2009-12-23 09:32:45.693170043 +0100 > > +++ linux-2.6/fs/nfsd/vfs.c 2009-12-23 09:39:47.627170082 +0100 > > @@ -769,45 +769,27 @@ nfsd_close(struct file *filp) > > } > > > > /* > > - * Sync a file > > - * As this calls fsync (not fdatasync) there is no need for a write_inode > > - * after it. > > + * Sync a directory to disk. > > + * > > + * This is odd compared to all other fsync callers because we > > + * > > + * a) do not have a file struct available > > + * b) expect to have i_mutex already held by the caller > > */ > > -static inline int nfsd_dosync(struct file *filp, struct dentry *dp, > > - const struct file_operations *fop) > > +int > > +nfsd_sync_dir(struct dentry *dentry) > > { > > - struct inode *inode = dp->d_inode; > > - int (*fsync) (struct file *, struct dentry *, int); > > + struct inode *inode = dentry->d_inode; > > int err; > > > > - err = filemap_fdatawrite(inode->i_mapping); > > - if (err == 0 && fop && (fsync = fop->fsync)) > > - err = fsync(filp, dp, 0); > > - if (err == 0) > > - err = filemap_fdatawait(inode->i_mapping); > > + WARN_ON(!mutex_is_locked(&inode->i_mutex)); > > > > + err = filemap_write_and_wait(inode->i_mapping); > > + if (err == 0 && inode->i_fop->fsync) > > + err = inode->i_fop->fsync(NULL, dentry, 0); > > return err; > > } > > > > -static int > > -nfsd_sync(struct file *filp) > > -{ > > - int err; > > - struct inode *inode = filp->f_path.dentry->d_inode; > > - dprintk("nfsd: sync file %s\n", filp->f_path.dentry->d_name.name); > > - mutex_lock(&inode->i_mutex); > > - err=nfsd_dosync(filp, filp->f_path.dentry, filp->f_op); > > - mutex_unlock(&inode->i_mutex); > > - > > - return err; > > -} > > - > > -int > > -nfsd_sync_dir(struct dentry *dp) > > -{ > > - return nfsd_dosync(NULL, dp, dp->d_inode->i_fop); > > -} > > - > > /* > > * Obtain the readahead parameters for the file > > * specified by (dev, ino). > > @@ -1011,7 +993,7 @@ static int wait_for_concurrent_writes(st > > > > if (inode->i_state & I_DIRTY) { > > dprintk("nfsd: write sync %d\n", task_pid_nr(current)); > > - err = nfsd_sync(file); > > + err = vfs_fsync(file, file->f_path.dentry, 0); > > } > > last_ino = inode->i_ino; > > last_dev = inode->i_sb->s_dev; > > @@ -1180,7 +1162,7 @@ nfsd_commit(struct svc_rqst *rqstp, stru > > return err; > > if (EX_ISSYNC(fhp->fh_export)) { > > if (file->f_op && file->f_op->fsync) { > > - err = nfserrno(nfsd_sync(file)); > > + err = nfserrno(vfs_fsync(file, file->f_path.dentry, 0)); > > } else { > > err = nfserr_notsupp; > > } > -- > Jan Kara <jack-AlSwsSmVLrQ@public.gmane.org> > SUSE Labs, CR -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [PATCH] improve the performance of large sequential write NFS workloads 2009-12-22 12:35 ` Jan Kara [not found] ` <20091222123538.GB604-jyMamyUUXNJG4ohzP4jBZS1Fcj925eT/@public.gmane.org> @ 2009-12-24 1:26 ` Wu Fengguang 1 sibling, 0 replies; 66+ messages in thread From: Wu Fengguang @ 2009-12-24 1:26 UTC (permalink / raw) To: Jan Kara Cc: Steve Rago, Peter Zijlstra, linux-nfs@vger.kernel.org, linux-kernel@vger.kernel.org, Trond.Myklebust@netapp.com, jens.axboe, Peter Staubach, Arjan van de Ven, Ingo Molnar, linux-fsdevel@vger.kernel.org On Tue, Dec 22, 2009 at 08:35:39PM +0800, Jan Kara wrote: > > 2) NFS commit stops pipeline because it sleep&wait inside i_mutex, > > which blocks all other NFSDs trying to write/writeback the inode. > > > > nfsd_sync: > > take i_mutex > > filemap_fdatawrite > > filemap_fdatawait > > drop i_mutex > I believe this is unrelated to the problem Steve is trying to solve. > When we get to doing sync writes the performance is busted so we better > shouldn't get to that (unless user asked for that of course). Yes, first priority is always to reduce the COMMITs and the number of writeback pages they submitted under WB_SYNC_ALL. And I guess the "increase write chunk beyond 128MB" patches can serve it well. The i_mutex should impact NFS write performance for single big copy in this way: pdflush submits many (4MB write, 1 commit) pairs, because the write and commit each will take i_mutex, it effectively limits the server side io queue depth to <=4MB: the next 4MB dirty data won't reach page cache until the previous 4MB is completely synced to disk. There are two kinds of inefficiency here: - the small queue depth - the interleaved use of CPU/DISK: loop { write 4MB => normally only CPU writeback 4MB => mostly disk } When writing many small dirty files _plus_ one big file, there will still be interleaved write/writeback: the 4MB write will be broken into 8 NFS writes with the default wsize=524288. So there may be one nfsd doing COMMIT, another 7 nfsd waiting for the big file's i_mutex. All 8 nfsd are "busy" and pipeline is destroyed. Just a possibility. > > If filemap_fdatawait() can be moved out of i_mutex (or just remove > > the lock), we solve the root problem: > > > > nfsd_sync: > > [take i_mutex] > > filemap_fdatawrite => can also be blocked, but less a problem > > [drop i_mutex] > > filemap_fdatawait > > > > Maybe it's a dumb question, but what's the purpose of i_mutex here? > > For correctness or to prevent livelock? I can imagine some livelock > > problem here (current implementation can easily wait for extra > > pages), however not too hard to fix. > Generally, most filesystems take i_mutex during fsync to > a) avoid all sorts of livelocking problems > b) serialize fsyncs for one inode (mostly for simplicity) > I don't see what advantage would it bring that we get rid of i_mutex > for fdatawait - only that maybe writers could proceed while we are > waiting but is that really the problem? The i_mutex at least has some performance impact. Another one would be the WB_SYNC_ALL. All are related to the COMMIT/sync write behavior. Are there some other _direct_ causes? Thanks, Fengguang ^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [PATCH] improve the performance of large sequential write NFS workloads 2009-12-22 1:59 ` [PATCH] improve the performance of large sequential write NFS workloads Wu Fengguang 2009-12-22 12:35 ` Jan Kara @ 2009-12-22 16:41 ` Steve Rago 2009-12-24 1:21 ` Wu Fengguang 2009-12-23 14:21 ` Trond Myklebust 2 siblings, 1 reply; 66+ messages in thread From: Steve Rago @ 2009-12-22 16:41 UTC (permalink / raw) To: Wu Fengguang Cc: Peter Zijlstra, linux-nfs-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Trond.Myklebust-HgOvQuBEEgTQT0dZR+AlfA@public.gmane.org, jens.axboe, Peter Staubach, Jan Kara, Arjan van de Ven, Ingo Molnar, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA On Tue, 2009-12-22 at 09:59 +0800, Wu Fengguang wrote: > Steve, > > On Sat, Dec 19, 2009 at 10:25:47PM +0800, Steve Rago wrote: > > > > On Sat, 2009-12-19 at 20:20 +0800, Wu Fengguang wrote: > > > > > > On Thu, Dec 17, 2009 at 04:17:57PM +0800, Peter Zijlstra wrote: > > > > On Wed, 2009-12-16 at 21:03 -0500, Steve Rago wrote: > > > > > Eager Writeback for NFS Clients > > > > > ------------------------------- > > > > > Prevent applications that write large sequential streams of data (like backup, for example) > > > > > from entering into a memory pressure state, which degrades performance by falling back to > > > > > synchronous operations (both synchronous writes and additional commits). > > > > > > What exactly is the "memory pressure state" condition? What's the > > > code to do the "synchronous writes and additional commits" and maybe > > > how they are triggered? > > > > Memory pressure occurs when most of the client pages have been dirtied > > by an application (think backup server writing multi-gigabyte files that > > exceed the size of main memory). The system works harder to be able to > > free dirty pages so that they can be reused. For a local file system, > > this means writing the pages to disk. For NFS, however, the writes > > leave the pages in an "unstable" state until the server responds to a > > commit request. Generally speaking, commit processing is far more > > expensive than write processing on the server; both are done with the > > inode locked, but since the commit takes so long, all writes are > > blocked, which stalls the pipeline. > > Let me try reiterate the problem with code, please correct me if I'm > wrong. > > 1) normal fs sets I_DIRTY_DATASYNC when extending i_size, however NFS > will set the flag for any pages written -- why this trick? To > guarantee the call of nfs_commit_inode()? Which unfortunately turns > almost every server side NFS write into sync writes.. Not really. The commit needs to be sent, but the writes are still asynchronous. It's just that the pages can't be recycled until they are on stable storage. > > writeback_single_inode: > do_writepages > nfs_writepages > nfs_writepage ----[short time later]---> nfs_writeback_release* > nfs_mark_request_commit > __mark_inode_dirty(I_DIRTY_DATASYNC); > > if (I_DIRTY_SYNC || I_DIRTY_DATASYNC) <---- so this will be true for most time > write_inode > nfs_write_inode > nfs_commit_inode > > > 2) NFS commit stops pipeline because it sleep&wait inside i_mutex, > which blocks all other NFSDs trying to write/writeback the inode. > > nfsd_sync: > take i_mutex > filemap_fdatawrite > filemap_fdatawait > drop i_mutex > > If filemap_fdatawait() can be moved out of i_mutex (or just remove > the lock), we solve the root problem: > > nfsd_sync: > [take i_mutex] > filemap_fdatawrite => can also be blocked, but less a problem > [drop i_mutex] > filemap_fdatawait > > Maybe it's a dumb question, but what's the purpose of i_mutex here? > For correctness or to prevent livelock? I can imagine some livelock > problem here (current implementation can easily wait for extra > pages), however not too hard to fix. Commits and writes on the same inode need to be serialized for consistency (write can change the data and metadata; commit [fsync] needs to provide guarantees that the written data are stable). The performance problem arises because NFS writes are fast (they generally just deposit data into the server's page cache), but commits can take a long time, especially if there is a lot of cached data to flush to stable storage. > > > The proposed patch essentially takes two actions in nfs_file_write() > - to start writeback when the per-file nr_dirty goes high > without committing > - to throttle dirtying when the per-file nr_writeback goes high > I guess this effectively prevents pdflush from kicking in with > its bad committing behavior > > In general it's reasonable to keep NFS per-file nr_dirty low, however > questionable to do per-file nr_writeback throttling. This does not > work well with the global limits - eg. when there are many dirty > files, the summed-up nr_writeback will still grow out of control. Not with the eager writeback patch. The nr_writeback for NFS is limited by the woutstanding tunable parameter multiplied by the number of active NFS files being written. > And it's more likely to impact user visible responsiveness than > a global limit. But my opinion can be biased -- me have a patch to > do global NFS nr_writeback limit ;) What affects user-visible responsiveness is avoiding long delays and avoiding delays that vary widely. Whether the limit is global or per-file is less important (but I'd be happy to be convinced otherwise). Steve > > Thanks, > Fengguang > -- > To unsubscribe from this list: send the line "unsubscribe linux-nfs" in > the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [PATCH] improve the performance of large sequential write NFS workloads 2009-12-22 16:41 ` Steve Rago @ 2009-12-24 1:21 ` Wu Fengguang 2009-12-24 14:49 ` Steve Rago 0 siblings, 1 reply; 66+ messages in thread From: Wu Fengguang @ 2009-12-24 1:21 UTC (permalink / raw) To: Steve Rago Cc: Peter Zijlstra, linux-nfs@vger.kernel.org, linux-kernel@vger.kernel.org, Trond.Myklebust@netapp.com, jens.axboe, Peter Staubach, Jan Kara, Arjan van de Ven, Ingo Molnar, linux-fsdevel@vger.kernel.org On Wed, Dec 23, 2009 at 12:41:53AM +0800, Steve Rago wrote: > > On Tue, 2009-12-22 at 09:59 +0800, Wu Fengguang wrote: > > Steve, > > > > On Sat, Dec 19, 2009 at 10:25:47PM +0800, Steve Rago wrote: > > > > > > On Sat, 2009-12-19 at 20:20 +0800, Wu Fengguang wrote: > > > > > > > > On Thu, Dec 17, 2009 at 04:17:57PM +0800, Peter Zijlstra wrote: > > > > > On Wed, 2009-12-16 at 21:03 -0500, Steve Rago wrote: > > > > > > Eager Writeback for NFS Clients > > > > > > ------------------------------- > > > > > > Prevent applications that write large sequential streams of data (like backup, for example) > > > > > > from entering into a memory pressure state, which degrades performance by falling back to > > > > > > synchronous operations (both synchronous writes and additional commits). > > > > > > > > What exactly is the "memory pressure state" condition? What's the > > > > code to do the "synchronous writes and additional commits" and maybe > > > > how they are triggered? > > > > > > Memory pressure occurs when most of the client pages have been dirtied > > > by an application (think backup server writing multi-gigabyte files that > > > exceed the size of main memory). The system works harder to be able to > > > free dirty pages so that they can be reused. For a local file system, > > > this means writing the pages to disk. For NFS, however, the writes > > > leave the pages in an "unstable" state until the server responds to a > > > commit request. Generally speaking, commit processing is far more > > > expensive than write processing on the server; both are done with the > > > inode locked, but since the commit takes so long, all writes are > > > blocked, which stalls the pipeline. > > > > Let me try reiterate the problem with code, please correct me if I'm > > wrong. > > > > 1) normal fs sets I_DIRTY_DATASYNC when extending i_size, however NFS > > will set the flag for any pages written -- why this trick? To > > guarantee the call of nfs_commit_inode()? Which unfortunately turns > > almost every server side NFS write into sync writes.. Ah sorry for the typo, here I mean: the commits by pdflush turn most server side NFS _writeback_ into sync ones(ie, datawrite+datawait, with WB_SYNC_ALL). Just to clarify it: write = from user buffer to page cache writeback = from page cache to disk > Not really. The commit needs to be sent, but the writes are still > asynchronous. It's just that the pages can't be recycled until they > are on stable storage. Right. > > > > writeback_single_inode: > > do_writepages > > nfs_writepages > > nfs_writepage ----[short time later]---> nfs_writeback_release* > > nfs_mark_request_commit > > __mark_inode_dirty(I_DIRTY_DATASYNC); > > > > if (I_DIRTY_SYNC || I_DIRTY_DATASYNC) <---- so this will be true for most time > > write_inode > > nfs_write_inode > > nfs_commit_inode > > > > > > 2) NFS commit stops pipeline because it sleep&wait inside i_mutex, > > which blocks all other NFSDs trying to write/writeback the inode. > > > > nfsd_sync: > > take i_mutex > > filemap_fdatawrite > > filemap_fdatawait > > drop i_mutex > > > > If filemap_fdatawait() can be moved out of i_mutex (or just remove > > the lock), we solve the root problem: > > > > nfsd_sync: > > [take i_mutex] > > filemap_fdatawrite => can also be blocked, but less a problem > > [drop i_mutex] > > filemap_fdatawait > > > > Maybe it's a dumb question, but what's the purpose of i_mutex here? > > For correctness or to prevent livelock? I can imagine some livelock > > problem here (current implementation can easily wait for extra > > pages), however not too hard to fix. > > Commits and writes on the same inode need to be serialized for > consistency (write can change the data and metadata; commit [fsync] > needs to provide guarantees that the written data are stable). The > performance problem arises because NFS writes are fast (they generally > just deposit data into the server's page cache), but commits can take a Right. > long time, especially if there is a lot of cached data to flush to > stable storage. "a lot of cached data to flush" is not likely with pdflush, since it roughly send one COMMIT per 4MB WRITEs. So in average each COMMIT syncs 4MB at the server side. Your patch adds another pre-pdlush async write logic, which greatly reduced the number of COMMITs by pdflush. Can this be the major factor of the performance gain? Jan has been proposing to change the pdflush logic from loop over dirty files { writeback 4MB write_inode } to loop over dirty files { writeback all its dirty pages write_inode } This should also be able to reduce the COMMIT numbers. I wonder if this (more general) approach can achieve the same performance gain. > > The proposed patch essentially takes two actions in nfs_file_write() > > - to start writeback when the per-file nr_dirty goes high > > without committing > > - to throttle dirtying when the per-file nr_writeback goes high > > I guess this effectively prevents pdflush from kicking in with > > its bad committing behavior > > > > In general it's reasonable to keep NFS per-file nr_dirty low, however > > questionable to do per-file nr_writeback throttling. This does not > > work well with the global limits - eg. when there are many dirty > > files, the summed-up nr_writeback will still grow out of control. > > Not with the eager writeback patch. The nr_writeback for NFS is limited > by the woutstanding tunable parameter multiplied by the number of active > NFS files being written. Ah yes - _active_ files. That makes it less likely, but still possible. Imagine the summed-up nr_dirty exceeds global limit, and pdflush wakes up. It will cycle through all dirty files and make them all in active NFS write.. It's only a possibility though - NFS writes are fast in normal. > > And it's more likely to impact user visible responsiveness than > > a global limit. But my opinion can be biased -- me have a patch to > > do global NFS nr_writeback limit ;) > > What affects user-visible responsiveness is avoiding long delays and > avoiding delays that vary widely. Whether the limit is global or > per-file is less important (but I'd be happy to be convinced otherwise). For example, one solution is to have max_global_writeback and another is to have max_file_writeback. Then their default values may be max_file_writeback = max_global_writeback / 10 Obviously the smaller max_global_writeback is more likely to block users when active write files < 10, which is the common case. Or, in this fake workload (spike writes from time to time), for i in `seq 1 100` do cp 10MB-$i /nfs/ sleep 1s done When you have 5MB max_file_writeback, the copies will be bumpy, while the max_global_writeback will never kick in.. Note that there is another difference: your per-file nr_writeback throttles _dirtying_ process, while my per-NFS-mount nr_writeback throttles pdflush (then indirectly throttles application). Thanks, Fengguang ^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [PATCH] improve the performance of large sequential write NFS workloads 2009-12-24 1:21 ` Wu Fengguang @ 2009-12-24 14:49 ` Steve Rago 2009-12-25 7:37 ` Wu Fengguang 0 siblings, 1 reply; 66+ messages in thread From: Steve Rago @ 2009-12-24 14:49 UTC (permalink / raw) To: Wu Fengguang Cc: Peter Zijlstra, linux-nfs-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Trond.Myklebust-HgOvQuBEEgTQT0dZR+AlfA@public.gmane.org, jens.axboe, Peter Staubach, Jan Kara, Arjan van de Ven, Ingo Molnar, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org On Thu, 2009-12-24 at 09:21 +0800, Wu Fengguang wrote: > > Commits and writes on the same inode need to be serialized for > > consistency (write can change the data and metadata; commit [fsync] > > needs to provide guarantees that the written data are stable). The > > performance problem arises because NFS writes are fast (they generally > > just deposit data into the server's page cache), but commits can take a > > Right. > > > long time, especially if there is a lot of cached data to flush to > > stable storage. > > "a lot of cached data to flush" is not likely with pdflush, since it > roughly send one COMMIT per 4MB WRITEs. So in average each COMMIT > syncs 4MB at the server side. Maybe on paper, but empirically I see anywhere from one commit per 8MB to one commit per 64 MB. > > Your patch adds another pre-pdlush async write logic, which greatly > reduced the number of COMMITs by pdflush. Can this be the major factor > of the performance gain? My patch removes pdflush from the picture almost entirely. See my comments below. > > Jan has been proposing to change the pdflush logic from > > loop over dirty files { > writeback 4MB > write_inode > } > to > loop over dirty files { > writeback all its dirty pages > write_inode > } > > This should also be able to reduce the COMMIT numbers. I wonder if > this (more general) approach can achieve the same performance gain. The pdflush mechanism is fine for random writes and small sequential writes, because it promotes concurrency -- instead of the application blocking while it tries to write and commit its data, the application can go on doing other more useful things, and the data gets flushed in the background. There is also a benefit if the application makes another modification to a page that is already dirty, because then multiple modifications are coalesced into a single write. However, the pdflush mechanism is wrong for large sequential writes (like a backup stream, for example). First, there is no concurrency to exploit -- the application is only going to dirty more pages, so removing the need for it to block writing the pages out only adds to the problem of memory pressure. Second, the application is not going to go back and modify a page it has already written, so leaving it in the cache for someone else to write provides no additional benefit. Note that this assumes the application actually cares about the consistency of its data and will call fsync() when it is done. If the application doesn't call fsync(), then it doesn't matter when the pages are written to backing store, because the interface makes no guarantees in this case. Thanks, Steve -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [PATCH] improve the performance of large sequential write NFS workloads 2009-12-24 14:49 ` Steve Rago @ 2009-12-25 7:37 ` Wu Fengguang 0 siblings, 0 replies; 66+ messages in thread From: Wu Fengguang @ 2009-12-25 7:37 UTC (permalink / raw) To: Steve Rago Cc: Peter Zijlstra, linux-nfs@vger.kernel.org, linux-kernel@vger.kernel.org, Trond.Myklebust@netapp.com, jens.axboe, Peter Staubach, Jan Kara, Arjan van de Ven, Ingo Molnar, linux-fsdevel@vger.kernel.org On Thu, Dec 24, 2009 at 10:49:40PM +0800, Steve Rago wrote: > > On Thu, 2009-12-24 at 09:21 +0800, Wu Fengguang wrote: > > > > Commits and writes on the same inode need to be serialized for > > > consistency (write can change the data and metadata; commit [fsync] > > > needs to provide guarantees that the written data are stable). The > > > performance problem arises because NFS writes are fast (they generally > > > just deposit data into the server's page cache), but commits can take a > > > > Right. > > > > > long time, especially if there is a lot of cached data to flush to > > > stable storage. > > > > "a lot of cached data to flush" is not likely with pdflush, since it > > roughly send one COMMIT per 4MB WRITEs. So in average each COMMIT > > syncs 4MB at the server side. > > Maybe on paper, but empirically I see anywhere from one commit per 8MB > to one commit per 64 MB. Thanks for the data. It seems that your CPU works faster than network, so that non of the NFS writes (submitted by L543) return by the time we try to COMMIT at L547. 543 ret = do_writepages(mapping, wbc); 544 545 /* Don't write the inode if only I_DIRTY_PAGES was set */ 546 if (dirty & (I_DIRTY_SYNC | I_DIRTY_DATASYNC)) { 547 int err = write_inode(inode, wait); Thus pdflush is able to do several rounds of do_writepages() before write_inode() can actually collect some pages to be COMMITed. > > > > Your patch adds another pre-pdlush async write logic, which greatly > > reduced the number of COMMITs by pdflush. Can this be the major factor > > of the performance gain? > > My patch removes pdflush from the picture almost entirely. See my > comments below. Yes for sequential async writes, so I said "pre-pdflush" :) > > > > Jan has been proposing to change the pdflush logic from > > > > loop over dirty files { > > writeback 4MB > > write_inode > > } > > to > > loop over dirty files { > > writeback all its dirty pages > > write_inode > > } > > > > This should also be able to reduce the COMMIT numbers. I wonder if > > this (more general) approach can achieve the same performance gain. > > The pdflush mechanism is fine for random writes and small sequential > writes, because it promotes concurrency -- instead of the application > blocking while it tries to write and commit its data, the application > can go on doing other more useful things, and the data gets flushed in > the background. There is also a benefit if the application makes > another modification to a page that is already dirty, because then > multiple modifications are coalesced into a single write. Right. > However, the pdflush mechanism is wrong for large sequential writes > (like a backup stream, for example). First, there is no concurrency to > exploit -- the application is only going to dirty more pages, so > removing the need for it to block writing the pages out only adds to the > problem of memory pressure. Second, the application is not going to go > back and modify a page it has already written, so leaving it in the > cache for someone else to write provides no additional benefit. Well, in general pdflush does more good than bad, that's why we need it. The above two reasons are about "pdflush is not as helpful", but not that it is wrong. That said, I do agree to limit the per-file dirty pages for NFS -- because it tends to flush before simple stat/read operations, which could be costly. > Note that this assumes the application actually cares about the > consistency of its data and will call fsync() when it is done. If the > application doesn't call fsync(), then it doesn't matter when the pages > are written to backing store, because the interface makes no guarantees > in this case. Thanks, Fengguang ^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [PATCH] improve the performance of large sequential write NFS workloads 2009-12-22 1:59 ` [PATCH] improve the performance of large sequential write NFS workloads Wu Fengguang 2009-12-22 12:35 ` Jan Kara 2009-12-22 16:41 ` Steve Rago @ 2009-12-23 14:21 ` Trond Myklebust 2009-12-23 18:05 ` Jan Kara 2 siblings, 1 reply; 66+ messages in thread From: Trond Myklebust @ 2009-12-23 14:21 UTC (permalink / raw) To: Wu Fengguang Cc: Steve Rago, Peter Zijlstra, linux-nfs-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, jens.axboe, Peter Staubach, Jan Kara, Arjan van de Ven, Ingo Molnar, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA On Tue, 2009-12-22 at 09:59 +0800, Wu Fengguang wrote: > 1) normal fs sets I_DIRTY_DATASYNC when extending i_size, however NFS > will set the flag for any pages written -- why this trick? To > guarantee the call of nfs_commit_inode()? Which unfortunately turns > almost every server side NFS write into sync writes.. > > writeback_single_inode: > do_writepages > nfs_writepages > nfs_writepage ----[short time later]---> nfs_writeback_release* > nfs_mark_request_commit > __mark_inode_dirty(I_DIRTY_DATASYNC); > > if (I_DIRTY_SYNC || I_DIRTY_DATASYNC) <---- so this will be true for most time > write_inode > nfs_write_inode > nfs_commit_inode I have been working on a fix for this. We basically do want to ensure that NFS calls commit (otherwise we're not finished cleaning the dirty pages), but we want to do it _after_ we've waited for all the writes to complete. See below... Trond ------------------------------------------------------------------------------------------------------ VFS: Add a new inode state: I_UNSTABLE_PAGES From: Trond Myklebust <Trond.Myklebust-HgOvQuBEEgTQT0dZR+AlfA@public.gmane.org> Add a new inode state to enable the vfs to commit the nfs unstable pages to stable storage once the write back of dirty pages is done. Signed-off-by: Trond Myklebust <Trond.Myklebust-HgOvQuBEEgTQT0dZR+AlfA@public.gmane.org> --- fs/fs-writeback.c | 24 ++++++++++++++++++++++-- fs/nfs/inode.c | 13 +++++-------- fs/nfs/write.c | 2 +- include/linux/fs.h | 7 +++++++ 4 files changed, 35 insertions(+), 11 deletions(-) diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c index 49bc1b8..c035efe 100644 --- a/fs/fs-writeback.c +++ b/fs/fs-writeback.c @@ -388,6 +388,14 @@ static int write_inode(struct inode *inode, int sync) } /* + * Commit the NFS unstable pages. + */ +static void commit_unstable_pages(struct inode *inode, int wait) +{ + return write_inode(inode, sync); +} + +/* * Wait for writeback on an inode to complete. */ static void inode_wait_for_writeback(struct inode *inode) @@ -474,6 +482,18 @@ writeback_single_inode(struct inode *inode, struct writeback_control *wbc) } spin_lock(&inode_lock); + /* + * Special state for cleaning NFS unstable pages + */ + if (inode->i_state & I_UNSTABLE_PAGES) { + int err; + inode->i_state &= ~I_UNSTABLE_PAGES; + spin_unlock(&inode_lock); + err = commit_unstable_pages(inode, wait); + if (ret == 0) + ret = err; + spin_lock(&inode_lock); + } inode->i_state &= ~I_SYNC; if (!(inode->i_state & (I_FREEING | I_CLEAR))) { if ((inode->i_state & I_DIRTY_PAGES) && wbc->for_kupdate) { @@ -481,7 +501,7 @@ writeback_single_inode(struct inode *inode, struct writeback_control *wbc) * More pages get dirtied by a fast dirtier. */ goto select_queue; - } else if (inode->i_state & I_DIRTY) { + } else if (inode->i_state & (I_DIRTY|I_UNSTABLE_PAGES)) { /* * At least XFS will redirty the inode during the * writeback (delalloc) and on io completion (isize). @@ -1050,7 +1070,7 @@ void __mark_inode_dirty(struct inode *inode, int flags) spin_lock(&inode_lock); if ((inode->i_state & flags) != flags) { - const int was_dirty = inode->i_state & I_DIRTY; + const int was_dirty = inode->i_state & (I_DIRTY|I_UNSTABLE_PAGES); inode->i_state |= flags; diff --git a/fs/nfs/inode.c b/fs/nfs/inode.c index faa0918..4f129b3 100644 --- a/fs/nfs/inode.c +++ b/fs/nfs/inode.c @@ -99,17 +99,14 @@ u64 nfs_compat_user_ino64(u64 fileid) int nfs_write_inode(struct inode *inode, int sync) { + int flags = 0; int ret; - if (sync) { - ret = filemap_fdatawait(inode->i_mapping); - if (ret == 0) - ret = nfs_commit_inode(inode, FLUSH_SYNC); - } else - ret = nfs_commit_inode(inode, 0); - if (ret >= 0) + if (sync) + flags = FLUSH_SYNC; + ret = nfs_commit_inode(inode, flags); + if (ret > 0) return 0; - __mark_inode_dirty(inode, I_DIRTY_DATASYNC); return ret; } diff --git a/fs/nfs/write.c b/fs/nfs/write.c index d171696..2f74e44 100644 --- a/fs/nfs/write.c +++ b/fs/nfs/write.c @@ -441,7 +441,7 @@ nfs_mark_request_commit(struct nfs_page *req) spin_unlock(&inode->i_lock); inc_zone_page_state(req->wb_page, NR_UNSTABLE_NFS); inc_bdi_stat(req->wb_page->mapping->backing_dev_info, BDI_RECLAIMABLE); - __mark_inode_dirty(inode, I_DIRTY_DATASYNC); + mark_inode_unstable_pages(inode); } static int diff --git a/include/linux/fs.h b/include/linux/fs.h index cca1919..ab01af0 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -1637,6 +1637,8 @@ struct super_operations { #define I_CLEAR 64 #define __I_SYNC 7 #define I_SYNC (1 << __I_SYNC) +#define __I_UNSTABLE_PAGES 9 +#define I_UNSTABLE_PAGES (1 << __I_UNSTABLE_PAGES) #define I_DIRTY (I_DIRTY_SYNC | I_DIRTY_DATASYNC | I_DIRTY_PAGES) @@ -1651,6 +1653,11 @@ static inline void mark_inode_dirty_sync(struct inode *inode) __mark_inode_dirty(inode, I_DIRTY_SYNC); } +static inline void mark_inode_unstable_pages(struct inode *inode) +{ + __mark_inode_dirty(inode, I_UNSTABLE_PAGES); +} + /** * inc_nlink - directly increment an inode's link count * @inode: inode -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply related [flat|nested] 66+ messages in thread
* Re: [PATCH] improve the performance of large sequential write NFS workloads 2009-12-23 14:21 ` Trond Myklebust @ 2009-12-23 18:05 ` Jan Kara [not found] ` <20091223180551.GD3159-+0h/O2h83AeN3ZZ/Hiejyg@public.gmane.org> 0 siblings, 1 reply; 66+ messages in thread From: Jan Kara @ 2009-12-23 18:05 UTC (permalink / raw) To: Trond Myklebust Cc: Wu Fengguang, Steve Rago, Peter Zijlstra, linux-nfs@vger.kernel.org, linux-kernel@vger.kernel.org, jens.axboe, Peter Staubach, Jan Kara, Arjan van de Ven, Ingo Molnar, linux-fsdevel On Wed 23-12-09 15:21:47, Trond Myklebust wrote: > On Tue, 2009-12-22 at 09:59 +0800, Wu Fengguang wrote: > > 1) normal fs sets I_DIRTY_DATASYNC when extending i_size, however NFS > > will set the flag for any pages written -- why this trick? To > > guarantee the call of nfs_commit_inode()? Which unfortunately turns > > almost every server side NFS write into sync writes.. > > > > writeback_single_inode: > > do_writepages > > nfs_writepages > > nfs_writepage ----[short time later]---> nfs_writeback_release* > > nfs_mark_request_commit > > __mark_inode_dirty(I_DIRTY_DATASYNC); > > > > if (I_DIRTY_SYNC || I_DIRTY_DATASYNC) <---- so this will be true for most time > > write_inode > > nfs_write_inode > > nfs_commit_inode > > > I have been working on a fix for this. We basically do want to ensure > that NFS calls commit (otherwise we're not finished cleaning the dirty > pages), but we want to do it _after_ we've waited for all the writes to > complete. See below... > > Trond > > ------------------------------------------------------------------------------------------------------ > VFS: Add a new inode state: I_UNSTABLE_PAGES > > From: Trond Myklebust <Trond.Myklebust@netapp.com> > > Add a new inode state to enable the vfs to commit the nfs unstable pages to > stable storage once the write back of dirty pages is done. Hmm, does your patch really help? > @@ -474,6 +482,18 @@ writeback_single_inode(struct inode *inode, struct writeback_control *wbc) > } > > spin_lock(&inode_lock); > + /* > + * Special state for cleaning NFS unstable pages > + */ > + if (inode->i_state & I_UNSTABLE_PAGES) { > + int err; > + inode->i_state &= ~I_UNSTABLE_PAGES; > + spin_unlock(&inode_lock); > + err = commit_unstable_pages(inode, wait); > + if (ret == 0) > + ret = err; > + spin_lock(&inode_lock); > + } I don't quite understand this chunk: We've called writeback_single_inode because it had some dirty pages. Thus it has I_DIRTY_DATASYNC set and a few lines above your chunk, we've called nfs_write_inode which sent commit to the server. Now here you sometimes send the commit again? What's the purpose? > diff --git a/fs/nfs/inode.c b/fs/nfs/inode.c > index faa0918..4f129b3 100644 > --- a/fs/nfs/inode.c > +++ b/fs/nfs/inode.c > @@ -99,17 +99,14 @@ u64 nfs_compat_user_ino64(u64 fileid) > > int nfs_write_inode(struct inode *inode, int sync) > { > + int flags = 0; > int ret; > > - if (sync) { > - ret = filemap_fdatawait(inode->i_mapping); > - if (ret == 0) > - ret = nfs_commit_inode(inode, FLUSH_SYNC); > - } else > - ret = nfs_commit_inode(inode, 0); > - if (ret >= 0) > + if (sync) > + flags = FLUSH_SYNC; > + ret = nfs_commit_inode(inode, flags); > + if (ret > 0) > return 0; > - __mark_inode_dirty(inode, I_DIRTY_DATASYNC); > return ret; > } > Honza -- Jan Kara <jack@suse.cz> SUSE Labs, CR ^ permalink raw reply [flat|nested] 66+ messages in thread
[parent not found: <20091223180551.GD3159-+0h/O2h83AeN3ZZ/Hiejyg@public.gmane.org>]
* Re: [PATCH] improve the performance of large sequential write NFS workloads [not found] ` <20091223180551.GD3159-+0h/O2h83AeN3ZZ/Hiejyg@public.gmane.org> @ 2009-12-23 19:12 ` Trond Myklebust 2009-12-24 2:52 ` Wu Fengguang 0 siblings, 1 reply; 66+ messages in thread From: Trond Myklebust @ 2009-12-23 19:12 UTC (permalink / raw) To: Jan Kara Cc: Wu Fengguang, Steve Rago, Peter Zijlstra, linux-nfs-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, jens.axboe, Peter Staubach, Arjan van de Ven, Ingo Molnar, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA On Wed, 2009-12-23 at 19:05 +0100, Jan Kara wrote: > On Wed 23-12-09 15:21:47, Trond Myklebust wrote: > > @@ -474,6 +482,18 @@ writeback_single_inode(struct inode *inode, struct writeback_control *wbc) > > } > > > > spin_lock(&inode_lock); > > + /* > > + * Special state for cleaning NFS unstable pages > > + */ > > + if (inode->i_state & I_UNSTABLE_PAGES) { > > + int err; > > + inode->i_state &= ~I_UNSTABLE_PAGES; > > + spin_unlock(&inode_lock); > > + err = commit_unstable_pages(inode, wait); > > + if (ret == 0) > > + ret = err; > > + spin_lock(&inode_lock); > > + } > I don't quite understand this chunk: We've called writeback_single_inode > because it had some dirty pages. Thus it has I_DIRTY_DATASYNC set and a few > lines above your chunk, we've called nfs_write_inode which sent commit to > the server. Now here you sometimes send the commit again? What's the > purpose? We no longer set I_DIRTY_DATASYNC. We only set I_DIRTY_PAGES (and later I_UNSTABLE_PAGES). The point is that we now do the commit only _after_ we've sent all the dirty pages, and waited for writeback to complete, whereas previously we did it in the wrong order. Cheers Trond -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [PATCH] improve the performance of large sequential write NFS workloads 2009-12-23 19:12 ` Trond Myklebust @ 2009-12-24 2:52 ` Wu Fengguang 2009-12-24 12:04 ` Trond Myklebust 0 siblings, 1 reply; 66+ messages in thread From: Wu Fengguang @ 2009-12-24 2:52 UTC (permalink / raw) To: Trond Myklebust Cc: Jan Kara, Steve Rago, Peter Zijlstra, linux-nfs-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, jens.axboe, Peter Staubach, Arjan van de Ven, Ingo Molnar, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org Trond, On Thu, Dec 24, 2009 at 03:12:54AM +0800, Trond Myklebust wrote: > On Wed, 2009-12-23 at 19:05 +0100, Jan Kara wrote: > > On Wed 23-12-09 15:21:47, Trond Myklebust wrote: > > > @@ -474,6 +482,18 @@ writeback_single_inode(struct inode *inode, struct writeback_control *wbc) > > > } > > > > > > spin_lock(&inode_lock); > > > + /* > > > + * Special state for cleaning NFS unstable pages > > > + */ > > > + if (inode->i_state & I_UNSTABLE_PAGES) { > > > + int err; > > > + inode->i_state &= ~I_UNSTABLE_PAGES; > > > + spin_unlock(&inode_lock); > > > + err = commit_unstable_pages(inode, wait); > > > + if (ret == 0) > > > + ret = err; > > > + spin_lock(&inode_lock); > > > + } > > I don't quite understand this chunk: We've called writeback_single_inode > > because it had some dirty pages. Thus it has I_DIRTY_DATASYNC set and a few > > lines above your chunk, we've called nfs_write_inode which sent commit to > > the server. Now here you sometimes send the commit again? What's the > > purpose? > > We no longer set I_DIRTY_DATASYNC. We only set I_DIRTY_PAGES (and later > I_UNSTABLE_PAGES). > > The point is that we now do the commit only _after_ we've sent all the > dirty pages, and waited for writeback to complete, whereas previously we > did it in the wrong order. Sorry I still don't get it. The timing used to be: write 4MB ==> WRITE block 0 (ie. first 512KB) WRITE block 1 WRITE block 2 WRITE block 3 ack from server for WRITE block 0 => mark 0 as unstable (inode marked need-commit) WRITE block 4 ack from server for WRITE block 1 => mark 1 as unstable WRITE block 5 ack from server for WRITE block 2 => mark 2 as unstable WRITE block 6 ack from server for WRITE block 3 => mark 3 as unstable WRITE block 7 ack from server for WRITE block 4 => mark 4 as unstable ack from server for WRITE block 5 => mark 5 as unstable write_inode ==> COMMIT blocks 0-5 ack from server for WRITE block 6 => mark 6 as unstable (inode marked need-commit) ack from server for WRITE block 7 => mark 7 as unstable ack from server for COMMIT blocks 0-5 => mark 0-5 as clean write_inode ==> COMMIT blocks 6-7 ack from server for COMMIT blocks 6-7 => mark 6-7 as clean Note that the first COMMIT is submitted before receiving all ACKs for the previous writes, hence the second COMMIT is necessary. It seems that your patch does not improve the timing at all. Thanks, Fengguang -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [PATCH] improve the performance of large sequential write NFS workloads 2009-12-24 2:52 ` Wu Fengguang @ 2009-12-24 12:04 ` Trond Myklebust 2009-12-25 5:56 ` Wu Fengguang 0 siblings, 1 reply; 66+ messages in thread From: Trond Myklebust @ 2009-12-24 12:04 UTC (permalink / raw) To: Wu Fengguang Cc: Jan Kara, Steve Rago, Peter Zijlstra, linux-nfs-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, jens.axboe, Peter Staubach, Arjan van de Ven, Ingo Molnar, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org On Thu, 2009-12-24 at 10:52 +0800, Wu Fengguang wrote: > Trond, > > On Thu, Dec 24, 2009 at 03:12:54AM +0800, Trond Myklebust wrote: > > On Wed, 2009-12-23 at 19:05 +0100, Jan Kara wrote: > > > On Wed 23-12-09 15:21:47, Trond Myklebust wrote: > > > > @@ -474,6 +482,18 @@ writeback_single_inode(struct inode *inode, struct writeback_control *wbc) > > > > } > > > > > > > > spin_lock(&inode_lock); > > > > + /* > > > > + * Special state for cleaning NFS unstable pages > > > > + */ > > > > + if (inode->i_state & I_UNSTABLE_PAGES) { > > > > + int err; > > > > + inode->i_state &= ~I_UNSTABLE_PAGES; > > > > + spin_unlock(&inode_lock); > > > > + err = commit_unstable_pages(inode, wait); > > > > + if (ret == 0) > > > > + ret = err; > > > > + spin_lock(&inode_lock); > > > > + } > > > I don't quite understand this chunk: We've called writeback_single_inode > > > because it had some dirty pages. Thus it has I_DIRTY_DATASYNC set and a few > > > lines above your chunk, we've called nfs_write_inode which sent commit to > > > the server. Now here you sometimes send the commit again? What's the > > > purpose? > > > > We no longer set I_DIRTY_DATASYNC. We only set I_DIRTY_PAGES (and later > > I_UNSTABLE_PAGES). > > > > The point is that we now do the commit only _after_ we've sent all the > > dirty pages, and waited for writeback to complete, whereas previously we > > did it in the wrong order. > > Sorry I still don't get it. The timing used to be: > > write 4MB ==> WRITE block 0 (ie. first 512KB) > WRITE block 1 > WRITE block 2 > WRITE block 3 ack from server for WRITE block 0 => mark 0 as unstable (inode marked need-commit) > WRITE block 4 ack from server for WRITE block 1 => mark 1 as unstable > WRITE block 5 ack from server for WRITE block 2 => mark 2 as unstable > WRITE block 6 ack from server for WRITE block 3 => mark 3 as unstable > WRITE block 7 ack from server for WRITE block 4 => mark 4 as unstable > ack from server for WRITE block 5 => mark 5 as unstable > write_inode ==> COMMIT blocks 0-5 > ack from server for WRITE block 6 => mark 6 as unstable (inode marked need-commit) > ack from server for WRITE block 7 => mark 7 as unstable > > ack from server for COMMIT blocks 0-5 => mark 0-5 as clean > > write_inode ==> COMMIT blocks 6-7 > > ack from server for COMMIT blocks 6-7 => mark 6-7 as clean > > Note that the first COMMIT is submitted before receiving all ACKs for > the previous writes, hence the second COMMIT is necessary. It seems > that your patch does not improve the timing at all. That would indicate that we're cycling through writeback_single_inode() more than once. Why? Trond -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [PATCH] improve the performance of large sequential write NFS workloads 2009-12-24 12:04 ` Trond Myklebust @ 2009-12-25 5:56 ` Wu Fengguang 2009-12-30 16:22 ` Trond Myklebust 0 siblings, 1 reply; 66+ messages in thread From: Wu Fengguang @ 2009-12-25 5:56 UTC (permalink / raw) To: Trond Myklebust Cc: Jan Kara, Steve Rago, Peter Zijlstra, linux-nfs-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, jens.axboe, Peter Staubach, Arjan van de Ven, Ingo Molnar, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org On Thu, Dec 24, 2009 at 08:04:41PM +0800, Trond Myklebust wrote: > On Thu, 2009-12-24 at 10:52 +0800, Wu Fengguang wrote: > > Trond, > > > > On Thu, Dec 24, 2009 at 03:12:54AM +0800, Trond Myklebust wrote: > > > On Wed, 2009-12-23 at 19:05 +0100, Jan Kara wrote: > > > > On Wed 23-12-09 15:21:47, Trond Myklebust wrote: > > > > > @@ -474,6 +482,18 @@ writeback_single_inode(struct inode *inode, struct writeback_control *wbc) > > > > > } > > > > > > > > > > spin_lock(&inode_lock); > > > > > + /* > > > > > + * Special state for cleaning NFS unstable pages > > > > > + */ > > > > > + if (inode->i_state & I_UNSTABLE_PAGES) { > > > > > + int err; > > > > > + inode->i_state &= ~I_UNSTABLE_PAGES; > > > > > + spin_unlock(&inode_lock); > > > > > + err = commit_unstable_pages(inode, wait); > > > > > + if (ret == 0) > > > > > + ret = err; > > > > > + spin_lock(&inode_lock); > > > > > + } > > > > I don't quite understand this chunk: We've called writeback_single_inode > > > > because it had some dirty pages. Thus it has I_DIRTY_DATASYNC set and a few > > > > lines above your chunk, we've called nfs_write_inode which sent commit to > > > > the server. Now here you sometimes send the commit again? What's the > > > > purpose? > > > > > > We no longer set I_DIRTY_DATASYNC. We only set I_DIRTY_PAGES (and later > > > I_UNSTABLE_PAGES). > > > > > > The point is that we now do the commit only _after_ we've sent all the > > > dirty pages, and waited for writeback to complete, whereas previously we > > > did it in the wrong order. > > > > Sorry I still don't get it. The timing used to be: > > > > write 4MB ==> WRITE block 0 (ie. first 512KB) > > WRITE block 1 > > WRITE block 2 > > WRITE block 3 ack from server for WRITE block 0 => mark 0 as unstable (inode marked need-commit) > > WRITE block 4 ack from server for WRITE block 1 => mark 1 as unstable > > WRITE block 5 ack from server for WRITE block 2 => mark 2 as unstable > > WRITE block 6 ack from server for WRITE block 3 => mark 3 as unstable > > WRITE block 7 ack from server for WRITE block 4 => mark 4 as unstable > > ack from server for WRITE block 5 => mark 5 as unstable > > write_inode ==> COMMIT blocks 0-5 > > ack from server for WRITE block 6 => mark 6 as unstable (inode marked need-commit) > > ack from server for WRITE block 7 => mark 7 as unstable > > > > ack from server for COMMIT blocks 0-5 => mark 0-5 as clean > > > > write_inode ==> COMMIT blocks 6-7 > > > > ack from server for COMMIT blocks 6-7 => mark 6-7 as clean > > > > Note that the first COMMIT is submitted before receiving all ACKs for > > the previous writes, hence the second COMMIT is necessary. It seems > > that your patch does not improve the timing at all. > > That would indicate that we're cycling through writeback_single_inode() > more than once. Why? Yes. The above sequence can happen for a 4MB sized dirty file. The first COMMIT is done by L547, while the second COMMIT will be scheduled either by __mark_inode_dirty(), or scheduled by L583 (depending on the time ACKs for L543 but missed L547 arrives: if an ACK missed L578, the inode will be queued into b_dirty list, but if any ACK arrives between L547 and L578, the inode will enter b_more_io_wait, which is a to-be-introduced new dirty list). 537 dirty = inode->i_state & I_DIRTY; 538 inode->i_state |= I_SYNC; 539 inode->i_state &= ~I_DIRTY; 540 541 spin_unlock(&inode_lock); 542 ==> 543 ret = do_writepages(mapping, wbc); 544 545 /* Don't write the inode if only I_DIRTY_PAGES was set */ 546 if (dirty & (I_DIRTY_SYNC | I_DIRTY_DATASYNC)) { ==> 547 int err = write_inode(inode, wait); 548 if (ret == 0) 549 ret = err; 550 } 551 552 if (wait) { 553 int err = filemap_fdatawait(mapping); 554 if (ret == 0) 555 ret = err; 556 } 557 558 spin_lock(&inode_lock); 559 inode->i_state &= ~I_SYNC; 560 if (!(inode->i_state & (I_FREEING | I_CLEAR))) { 561 if (mapping_tagged(mapping, PAGECACHE_TAG_DIRTY)) { 562 /* 563 * We didn't write back all the pages. nfs_writepages() 564 * sometimes bales out without doing anything. 565 */ 566 inode->i_state |= I_DIRTY_PAGES; 567 if (wbc->nr_to_write <= 0) { 568 /* 569 * slice used up: queue for next turn 570 */ 571 requeue_io(inode); 572 } else { 573 /* 574 * somehow blocked: retry later 575 */ 576 requeue_io_wait(inode); 577 } ==> 578 } else if (inode->i_state & I_DIRTY) { 579 /* 580 * At least XFS will redirty the inode during the 581 * writeback (delalloc) and on io completion (isize). 582 */ ==> 583 requeue_io_wait(inode); Thanks, Fengguang -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [PATCH] improve the performance of large sequential write NFS workloads 2009-12-25 5:56 ` Wu Fengguang @ 2009-12-30 16:22 ` Trond Myklebust 2009-12-31 5:04 ` Wu Fengguang 0 siblings, 1 reply; 66+ messages in thread From: Trond Myklebust @ 2009-12-30 16:22 UTC (permalink / raw) To: Wu Fengguang Cc: Jan Kara, Steve Rago, Peter Zijlstra, linux-nfs@vger.kernel.org, linux-kernel@vger.kernel.org, jens.axboe, Peter Staubach, Arjan van de Ven, Ingo Molnar, linux-fsdevel@vger.kernel.org On Fri, 2009-12-25 at 13:56 +0800, Wu Fengguang wrote: > On Thu, Dec 24, 2009 at 08:04:41PM +0800, Trond Myklebust wrote: > > That would indicate that we're cycling through writeback_single_inode() > > more than once. Why? > > Yes. The above sequence can happen for a 4MB sized dirty file. > The first COMMIT is done by L547, while the second COMMIT will be > scheduled either by __mark_inode_dirty(), or scheduled by L583 > (depending on the time ACKs for L543 but missed L547 arrives: > if an ACK missed L578, the inode will be queued into b_dirty list, > but if any ACK arrives between L547 and L578, the inode will enter > b_more_io_wait, which is a to-be-introduced new dirty list). > > 537 dirty = inode->i_state & I_DIRTY; > 538 inode->i_state |= I_SYNC; > 539 inode->i_state &= ~I_DIRTY; > 540 > 541 spin_unlock(&inode_lock); > 542 > ==> 543 ret = do_writepages(mapping, wbc); > 544 > 545 /* Don't write the inode if only I_DIRTY_PAGES was set */ > 546 if (dirty & (I_DIRTY_SYNC | I_DIRTY_DATASYNC)) { > ==> 547 int err = write_inode(inode, wait); > 548 if (ret == 0) > 549 ret = err; > 550 } > 551 > 552 if (wait) { > 553 int err = filemap_fdatawait(mapping); > 554 if (ret == 0) > 555 ret = err; > 556 } > 557 > 558 spin_lock(&inode_lock); > 559 inode->i_state &= ~I_SYNC; > 560 if (!(inode->i_state & (I_FREEING | I_CLEAR))) { > 561 if (mapping_tagged(mapping, PAGECACHE_TAG_DIRTY)) { > 562 /* > 563 * We didn't write back all the pages. nfs_writepages() > 564 * sometimes bales out without doing anything. > 565 */ > 566 inode->i_state |= I_DIRTY_PAGES; > 567 if (wbc->nr_to_write <= 0) { > 568 /* > 569 * slice used up: queue for next turn > 570 */ > 571 requeue_io(inode); > 572 } else { > 573 /* > 574 * somehow blocked: retry later > 575 */ > 576 requeue_io_wait(inode); > 577 } > ==> 578 } else if (inode->i_state & I_DIRTY) { > 579 /* > 580 * At least XFS will redirty the inode during the > 581 * writeback (delalloc) and on io completion (isize). > 582 */ > ==> 583 requeue_io_wait(inode); Hi Fengguang, Apologies for having taken time over this. Do you see any improvement with the appended variant instead? It adds a new address_space_operation in order to do the commit. Furthermore, it ignores the commit request if the caller is just doing a WB_SYNC_NONE background flush, waiting instead for the ensuing WB_SYNC_ALL request... Cheers Trond -------------------------------------------------------------------------------------------------------- VFS: Add a new inode state: I_UNSTABLE_PAGES From: Trond Myklebust <Trond.Myklebust@netapp.com> Add a new inode state to enable the vfs to commit the nfs unstable pages to stable storage once the write back of dirty pages is done. Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com> --- fs/fs-writeback.c | 27 +++++++++++++++++++++++++-- fs/nfs/file.c | 1 + fs/nfs/inode.c | 16 ---------------- fs/nfs/internal.h | 3 ++- fs/nfs/super.c | 2 -- fs/nfs/write.c | 29 ++++++++++++++++++++++++++++- include/linux/fs.h | 9 +++++++++ 7 files changed, 65 insertions(+), 22 deletions(-) diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c index 49bc1b8..24bc817 100644 --- a/fs/fs-writeback.c +++ b/fs/fs-writeback.c @@ -388,6 +388,17 @@ static int write_inode(struct inode *inode, int sync) } /* + * Commit the NFS unstable pages. + */ +static int commit_unstable_pages(struct address_space *mapping, + struct writeback_control *wbc) +{ + if (mapping->a_ops && mapping->a_ops->commit_unstable_pages) + return mapping->a_ops->commit_unstable_pages(mapping, wbc); + return 0; +} + +/* * Wait for writeback on an inode to complete. */ static void inode_wait_for_writeback(struct inode *inode) @@ -474,6 +485,18 @@ writeback_single_inode(struct inode *inode, struct writeback_control *wbc) } spin_lock(&inode_lock); + /* + * Special state for cleaning NFS unstable pages + */ + if (inode->i_state & I_UNSTABLE_PAGES) { + int err; + inode->i_state &= ~I_UNSTABLE_PAGES; + spin_unlock(&inode_lock); + err = commit_unstable_pages(mapping, wbc); + if (ret == 0) + ret = err; + spin_lock(&inode_lock); + } inode->i_state &= ~I_SYNC; if (!(inode->i_state & (I_FREEING | I_CLEAR))) { if ((inode->i_state & I_DIRTY_PAGES) && wbc->for_kupdate) { @@ -481,7 +504,7 @@ writeback_single_inode(struct inode *inode, struct writeback_control *wbc) * More pages get dirtied by a fast dirtier. */ goto select_queue; - } else if (inode->i_state & I_DIRTY) { + } else if (inode->i_state & (I_DIRTY|I_UNSTABLE_PAGES)) { /* * At least XFS will redirty the inode during the * writeback (delalloc) and on io completion (isize). @@ -1050,7 +1073,7 @@ void __mark_inode_dirty(struct inode *inode, int flags) spin_lock(&inode_lock); if ((inode->i_state & flags) != flags) { - const int was_dirty = inode->i_state & I_DIRTY; + const int was_dirty = inode->i_state & (I_DIRTY|I_UNSTABLE_PAGES); inode->i_state |= flags; diff --git a/fs/nfs/file.c b/fs/nfs/file.c index 6b89132..67e50ac 100644 --- a/fs/nfs/file.c +++ b/fs/nfs/file.c @@ -526,6 +526,7 @@ const struct address_space_operations nfs_file_aops = { .migratepage = nfs_migrate_page, .launder_page = nfs_launder_page, .error_remove_page = generic_error_remove_page, + .commit_unstable_pages = nfs_commit_unstable_pages, }; /* diff --git a/fs/nfs/inode.c b/fs/nfs/inode.c index faa0918..8341709 100644 --- a/fs/nfs/inode.c +++ b/fs/nfs/inode.c @@ -97,22 +97,6 @@ u64 nfs_compat_user_ino64(u64 fileid) return ino; } -int nfs_write_inode(struct inode *inode, int sync) -{ - int ret; - - if (sync) { - ret = filemap_fdatawait(inode->i_mapping); - if (ret == 0) - ret = nfs_commit_inode(inode, FLUSH_SYNC); - } else - ret = nfs_commit_inode(inode, 0); - if (ret >= 0) - return 0; - __mark_inode_dirty(inode, I_DIRTY_DATASYNC); - return ret; -} - void nfs_clear_inode(struct inode *inode) { /* diff --git a/fs/nfs/internal.h b/fs/nfs/internal.h index 29e464d..7bb326f 100644 --- a/fs/nfs/internal.h +++ b/fs/nfs/internal.h @@ -211,7 +211,6 @@ extern int nfs_access_cache_shrinker(int nr_to_scan, gfp_t gfp_mask); extern struct workqueue_struct *nfsiod_workqueue; extern struct inode *nfs_alloc_inode(struct super_block *sb); extern void nfs_destroy_inode(struct inode *); -extern int nfs_write_inode(struct inode *,int); extern void nfs_clear_inode(struct inode *); #ifdef CONFIG_NFS_V4 extern void nfs4_clear_inode(struct inode *); @@ -253,6 +252,8 @@ extern int nfs4_path_walk(struct nfs_server *server, extern void nfs_read_prepare(struct rpc_task *task, void *calldata); /* write.c */ +extern int nfs_commit_unstable_pages(struct address_space *mapping, + struct writeback_control *wbc); extern void nfs_write_prepare(struct rpc_task *task, void *calldata); #ifdef CONFIG_MIGRATION extern int nfs_migrate_page(struct address_space *, diff --git a/fs/nfs/super.c b/fs/nfs/super.c index ce907ef..805c1a0 100644 --- a/fs/nfs/super.c +++ b/fs/nfs/super.c @@ -265,7 +265,6 @@ struct file_system_type nfs_xdev_fs_type = { static const struct super_operations nfs_sops = { .alloc_inode = nfs_alloc_inode, .destroy_inode = nfs_destroy_inode, - .write_inode = nfs_write_inode, .statfs = nfs_statfs, .clear_inode = nfs_clear_inode, .umount_begin = nfs_umount_begin, @@ -334,7 +333,6 @@ struct file_system_type nfs4_referral_fs_type = { static const struct super_operations nfs4_sops = { .alloc_inode = nfs_alloc_inode, .destroy_inode = nfs_destroy_inode, - .write_inode = nfs_write_inode, .statfs = nfs_statfs, .clear_inode = nfs4_clear_inode, .umount_begin = nfs_umount_begin, diff --git a/fs/nfs/write.c b/fs/nfs/write.c index d171696..187f3a9 100644 --- a/fs/nfs/write.c +++ b/fs/nfs/write.c @@ -441,7 +441,7 @@ nfs_mark_request_commit(struct nfs_page *req) spin_unlock(&inode->i_lock); inc_zone_page_state(req->wb_page, NR_UNSTABLE_NFS); inc_bdi_stat(req->wb_page->mapping->backing_dev_info, BDI_RECLAIMABLE); - __mark_inode_dirty(inode, I_DIRTY_DATASYNC); + mark_inode_unstable_pages(inode); } static int @@ -1406,11 +1406,38 @@ int nfs_commit_inode(struct inode *inode, int how) } return res; } + +int nfs_commit_unstable_pages(struct address_space *mapping, + struct writeback_control *wbc) +{ + struct inode *inode = mapping->host; + int flags = FLUSH_SYNC; + int ret; + + /* Don't commit if this is just a non-blocking flush */ + if (wbc->sync_mode != WB_SYNC_ALL) { + mark_inode_unstable_pages(inode); + return 0; + } + if (wbc->nonblocking) + flags = 0; + ret = nfs_commit_inode(inode, flags); + if (ret > 0) + return 0; + return ret; +} + #else static inline int nfs_commit_list(struct inode *inode, struct list_head *head, int how) { return 0; } + +int nfs_commit_unstable_pages(struct address_space *mapping, + struct writeback_control *wbc) +{ + return 0; +} #endif long nfs_sync_mapping_wait(struct address_space *mapping, struct writeback_control *wbc, int how) diff --git a/include/linux/fs.h b/include/linux/fs.h index 9147ca8..ea0b7a3 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -602,6 +602,8 @@ struct address_space_operations { int (*is_partially_uptodate) (struct page *, read_descriptor_t *, unsigned long); int (*error_remove_page)(struct address_space *, struct page *); + int (*commit_unstable_pages)(struct address_space *, + struct writeback_control *); }; /* @@ -1635,6 +1637,8 @@ struct super_operations { #define I_CLEAR 64 #define __I_SYNC 7 #define I_SYNC (1 << __I_SYNC) +#define __I_UNSTABLE_PAGES 9 +#define I_UNSTABLE_PAGES (1 << __I_UNSTABLE_PAGES) #define I_DIRTY (I_DIRTY_SYNC | I_DIRTY_DATASYNC | I_DIRTY_PAGES) @@ -1649,6 +1653,11 @@ static inline void mark_inode_dirty_sync(struct inode *inode) __mark_inode_dirty(inode, I_DIRTY_SYNC); } +static inline void mark_inode_unstable_pages(struct inode *inode) +{ + __mark_inode_dirty(inode, I_UNSTABLE_PAGES); +} + /** * inc_nlink - directly increment an inode's link count * @inode: inode ^ permalink raw reply related [flat|nested] 66+ messages in thread
* Re: [PATCH] improve the performance of large sequential write NFS workloads 2009-12-30 16:22 ` Trond Myklebust @ 2009-12-31 5:04 ` Wu Fengguang 2009-12-31 19:13 ` Trond Myklebust 0 siblings, 1 reply; 66+ messages in thread From: Wu Fengguang @ 2009-12-31 5:04 UTC (permalink / raw) To: Trond Myklebust Cc: Jan Kara, Steve Rago, Peter Zijlstra, linux-nfs-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, jens.axboe, Peter Staubach, Arjan van de Ven, Ingo Molnar, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org Trond, On Thu, Dec 31, 2009 at 12:22:48AM +0800, Trond Myklebust wrote: > it ignores the commit request if the caller is just doing a > WB_SYNC_NONE background flush, waiting instead for the ensuing > WB_SYNC_ALL request... I'm afraid this will block balance_dirty_pages() until explicit sync/fsync calls: COMMITs are bad, however if we don't send them regularly, NR_UNSTABLE_NFS will grow large and block balance_dirty_pages() as well as throttle_vm_writeout().. > +int nfs_commit_unstable_pages(struct address_space *mapping, > + struct writeback_control *wbc) > +{ > + struct inode *inode = mapping->host; > + int flags = FLUSH_SYNC; > + int ret; > + ==> > + /* Don't commit if this is just a non-blocking flush */ ==> > + if (wbc->sync_mode != WB_SYNC_ALL) { ==> > + mark_inode_unstable_pages(inode); ==> > + return 0; ==> > + } > + if (wbc->nonblocking) > + flags = 0; > + ret = nfs_commit_inode(inode, flags); > + if (ret > 0) > + return 0; > + return ret; > +} The NFS protocol provides no painless way to reclaim unstable pages other than the COMMIT (or sync write).. This leaves us in a dilemma. We may reasonably reduce the number of COMMITs, and possibly even delay them for a while (and hope the server have writeback the pages before the COMMIT, somehow fragile). What we can obviously do is to avoid sending a COMMIT - if there are already an ongoing COMMIT for the same inode - or when there are ongoing WRITE for the inode (are there easy way to detect this?) What do you think? Thanks, Fengguang --- fs/nfs/inode.c | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-) --- linux.orig/fs/nfs/inode.c 2009-12-25 09:25:38.000000000 +0800 +++ linux/fs/nfs/inode.c 2009-12-25 10:13:06.000000000 +0800 @@ -105,8 +105,11 @@ int nfs_write_inode(struct inode *inode, ret = filemap_fdatawait(inode->i_mapping); if (ret == 0) ret = nfs_commit_inode(inode, FLUSH_SYNC); - } else + } else if (!radix_tree_tagged(&NFS_I(inode)->nfs_page_tree, + NFS_PAGE_TAG_LOCKED)) ret = nfs_commit_inode(inode, 0); + else + ret = -EAGAIN; if (ret >= 0) return 0; __mark_inode_dirty(inode, I_DIRTY_DATASYNC); -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [PATCH] improve the performance of large sequential write NFS workloads 2009-12-31 5:04 ` Wu Fengguang @ 2009-12-31 19:13 ` Trond Myklebust 2010-01-06 3:03 ` Wu Fengguang 0 siblings, 1 reply; 66+ messages in thread From: Trond Myklebust @ 2009-12-31 19:13 UTC (permalink / raw) To: Wu Fengguang Cc: Jan Kara, Steve Rago, Peter Zijlstra, linux-nfs@vger.kernel.org, linux-kernel@vger.kernel.org, jens.axboe, Peter Staubach, Arjan van de Ven, Ingo Molnar, linux-fsdevel@vger.kernel.org On Thu, 2009-12-31 at 13:04 +0800, Wu Fengguang wrote: > --- > fs/nfs/inode.c | 5 ++++- > 1 file changed, 4 insertions(+), 1 deletion(-) > > --- linux.orig/fs/nfs/inode.c 2009-12-25 09:25:38.000000000 +0800 > +++ linux/fs/nfs/inode.c 2009-12-25 10:13:06.000000000 +0800 > @@ -105,8 +105,11 @@ int nfs_write_inode(struct inode *inode, > ret = filemap_fdatawait(inode->i_mapping); > if (ret == 0) > ret = nfs_commit_inode(inode, FLUSH_SYNC); > - } else > + } else if (!radix_tree_tagged(&NFS_I(inode)->nfs_page_tree, > + NFS_PAGE_TAG_LOCKED)) > ret = nfs_commit_inode(inode, 0); > + else > + ret = -EAGAIN; > if (ret >= 0) > return 0; > __mark_inode_dirty(inode, I_DIRTY_DATASYNC); The above change improves on the existing code, but doesn't solve the problem that write_inode() isn't a good match for COMMIT. We need to wait for all the unstable WRITE rpc calls to return before we can know whether or not a COMMIT is needed (some commercial servers never require commit, even if the client requested an unstable write). That was the other reason for the change. I do, however, agree that the above can provide a nice heuristic for the WB_SYNC_NONE case (minus the -EAGAIN error). Mind if I integrate it? Cheers (and Happy New Year!) Trond ------------------------------------------------------------------------------------------------------------ VFS: Ensure that writeback_single_inode() commits unstable writes From: Trond Myklebust <Trond.Myklebust@netapp.com> If the call to do_writepages() succeeded in starting writeback, we do not know whether or not we will need to COMMIT any unstable writes until after the write RPC calls are finished. Currently, we assume that at least one write RPC call will have finished, and set I_DIRTY_DATASYNC by the time do_writepages is done, so that write_inode() is triggered. In order to ensure reliable operation (i.e. ensure that a single call to writeback_single_inode() with WB_SYNC_ALL set suffices to ensure that pages are on disk) we need to first wait for filemap_fdatawait() to complete, then test for unstable pages. Since NFS is currently the only filesystem that has unstable pages, we can add a new inode state I_UNSTABLE_PAGES that NFS alone will set. When set, this will trigger a callback to a new address_space_operation to call the COMMIT. Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com> --- fs/fs-writeback.c | 31 ++++++++++++++++++++++++++++++- fs/nfs/file.c | 1 + fs/nfs/inode.c | 16 ---------------- fs/nfs/internal.h | 3 ++- fs/nfs/super.c | 2 -- fs/nfs/write.c | 33 ++++++++++++++++++++++++++++++++- include/linux/fs.h | 9 +++++++++ 7 files changed, 74 insertions(+), 21 deletions(-) diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c index f6c2155..b25efbb 100644 --- a/fs/fs-writeback.c +++ b/fs/fs-writeback.c @@ -388,6 +388,17 @@ static int write_inode(struct inode *inode, int sync) } /* + * Commit the NFS unstable pages. + */ +static int commit_unstable_pages(struct address_space *mapping, + struct writeback_control *wbc) +{ + if (mapping->a_ops && mapping->a_ops->commit_unstable_pages) + return mapping->a_ops->commit_unstable_pages(mapping, wbc); + return 0; +} + +/* * Wait for writeback on an inode to complete. */ static void inode_wait_for_writeback(struct inode *inode) @@ -474,6 +485,18 @@ writeback_single_inode(struct inode *inode, struct writeback_control *wbc) } spin_lock(&inode_lock); + /* + * Special state for cleaning NFS unstable pages + */ + if (inode->i_state & I_UNSTABLE_PAGES) { + int err; + inode->i_state &= ~I_UNSTABLE_PAGES; + spin_unlock(&inode_lock); + err = commit_unstable_pages(mapping, wbc); + if (ret == 0) + ret = err; + spin_lock(&inode_lock); + } inode->i_state &= ~I_SYNC; if (!(inode->i_state & (I_FREEING | I_CLEAR))) { if ((inode->i_state & I_DIRTY_PAGES) && wbc->for_kupdate) { @@ -532,6 +555,12 @@ select_queue: inode->i_state |= I_DIRTY_PAGES; redirty_tail(inode); } + } else if (inode->i_state & I_UNSTABLE_PAGES) { + /* + * The inode has got yet more unstable pages to + * commit. Requeue on b_more_io + */ + requeue_io(inode); } else if (atomic_read(&inode->i_count)) { /* * The inode is clean, inuse @@ -1050,7 +1079,7 @@ void __mark_inode_dirty(struct inode *inode, int flags) spin_lock(&inode_lock); if ((inode->i_state & flags) != flags) { - const int was_dirty = inode->i_state & I_DIRTY; + const int was_dirty = inode->i_state & (I_DIRTY|I_UNSTABLE_PAGES); inode->i_state |= flags; diff --git a/fs/nfs/file.c b/fs/nfs/file.c index 6b89132..67e50ac 100644 --- a/fs/nfs/file.c +++ b/fs/nfs/file.c @@ -526,6 +526,7 @@ const struct address_space_operations nfs_file_aops = { .migratepage = nfs_migrate_page, .launder_page = nfs_launder_page, .error_remove_page = generic_error_remove_page, + .commit_unstable_pages = nfs_commit_unstable_pages, }; /* diff --git a/fs/nfs/inode.c b/fs/nfs/inode.c index faa0918..8341709 100644 --- a/fs/nfs/inode.c +++ b/fs/nfs/inode.c @@ -97,22 +97,6 @@ u64 nfs_compat_user_ino64(u64 fileid) return ino; } -int nfs_write_inode(struct inode *inode, int sync) -{ - int ret; - - if (sync) { - ret = filemap_fdatawait(inode->i_mapping); - if (ret == 0) - ret = nfs_commit_inode(inode, FLUSH_SYNC); - } else - ret = nfs_commit_inode(inode, 0); - if (ret >= 0) - return 0; - __mark_inode_dirty(inode, I_DIRTY_DATASYNC); - return ret; -} - void nfs_clear_inode(struct inode *inode) { /* diff --git a/fs/nfs/internal.h b/fs/nfs/internal.h index 29e464d..7bb326f 100644 --- a/fs/nfs/internal.h +++ b/fs/nfs/internal.h @@ -211,7 +211,6 @@ extern int nfs_access_cache_shrinker(int nr_to_scan, gfp_t gfp_mask); extern struct workqueue_struct *nfsiod_workqueue; extern struct inode *nfs_alloc_inode(struct super_block *sb); extern void nfs_destroy_inode(struct inode *); -extern int nfs_write_inode(struct inode *,int); extern void nfs_clear_inode(struct inode *); #ifdef CONFIG_NFS_V4 extern void nfs4_clear_inode(struct inode *); @@ -253,6 +252,8 @@ extern int nfs4_path_walk(struct nfs_server *server, extern void nfs_read_prepare(struct rpc_task *task, void *calldata); /* write.c */ +extern int nfs_commit_unstable_pages(struct address_space *mapping, + struct writeback_control *wbc); extern void nfs_write_prepare(struct rpc_task *task, void *calldata); #ifdef CONFIG_MIGRATION extern int nfs_migrate_page(struct address_space *, diff --git a/fs/nfs/super.c b/fs/nfs/super.c index ce907ef..805c1a0 100644 --- a/fs/nfs/super.c +++ b/fs/nfs/super.c @@ -265,7 +265,6 @@ struct file_system_type nfs_xdev_fs_type = { static const struct super_operations nfs_sops = { .alloc_inode = nfs_alloc_inode, .destroy_inode = nfs_destroy_inode, - .write_inode = nfs_write_inode, .statfs = nfs_statfs, .clear_inode = nfs_clear_inode, .umount_begin = nfs_umount_begin, @@ -334,7 +333,6 @@ struct file_system_type nfs4_referral_fs_type = { static const struct super_operations nfs4_sops = { .alloc_inode = nfs_alloc_inode, .destroy_inode = nfs_destroy_inode, - .write_inode = nfs_write_inode, .statfs = nfs_statfs, .clear_inode = nfs4_clear_inode, .umount_begin = nfs_umount_begin, diff --git a/fs/nfs/write.c b/fs/nfs/write.c index d171696..910be28 100644 --- a/fs/nfs/write.c +++ b/fs/nfs/write.c @@ -441,7 +441,7 @@ nfs_mark_request_commit(struct nfs_page *req) spin_unlock(&inode->i_lock); inc_zone_page_state(req->wb_page, NR_UNSTABLE_NFS); inc_bdi_stat(req->wb_page->mapping->backing_dev_info, BDI_RECLAIMABLE); - __mark_inode_dirty(inode, I_DIRTY_DATASYNC); + mark_inode_unstable_pages(inode); } static int @@ -1406,11 +1406,42 @@ int nfs_commit_inode(struct inode *inode, int how) } return res; } + +int nfs_commit_unstable_pages(struct address_space *mapping, + struct writeback_control *wbc) +{ + struct inode *inode = mapping->host; + int flags = FLUSH_SYNC; + int ret; + + /* Don't commit yet if this is a non-blocking flush and there are + * outstanding writes for this mapping. + */ + if (wbc->sync_mode != WB_SYNC_ALL && + radix_tree_tagged(&NFS_I(inode)->nfs_page_tree, + NFS_PAGE_TAG_LOCKED)) { + mark_inode_unstable_pages(inode); + return 0; + } + if (wbc->nonblocking) + flags = 0; + ret = nfs_commit_inode(inode, flags); + if (ret > 0) + ret = 0; + return ret; +} + #else static inline int nfs_commit_list(struct inode *inode, struct list_head *head, int how) { return 0; } + +int nfs_commit_unstable_pages(struct address_space *mapping, + struct writeback_control *wbc) +{ + return 0; +} #endif long nfs_sync_mapping_wait(struct address_space *mapping, struct writeback_control *wbc, int how) diff --git a/include/linux/fs.h b/include/linux/fs.h index 9147ca8..ea0b7a3 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -602,6 +602,8 @@ struct address_space_operations { int (*is_partially_uptodate) (struct page *, read_descriptor_t *, unsigned long); int (*error_remove_page)(struct address_space *, struct page *); + int (*commit_unstable_pages)(struct address_space *, + struct writeback_control *); }; /* @@ -1635,6 +1637,8 @@ struct super_operations { #define I_CLEAR 64 #define __I_SYNC 7 #define I_SYNC (1 << __I_SYNC) +#define __I_UNSTABLE_PAGES 9 +#define I_UNSTABLE_PAGES (1 << __I_UNSTABLE_PAGES) #define I_DIRTY (I_DIRTY_SYNC | I_DIRTY_DATASYNC | I_DIRTY_PAGES) @@ -1649,6 +1653,11 @@ static inline void mark_inode_dirty_sync(struct inode *inode) __mark_inode_dirty(inode, I_DIRTY_SYNC); } +static inline void mark_inode_unstable_pages(struct inode *inode) +{ + __mark_inode_dirty(inode, I_UNSTABLE_PAGES); +} + /** * inc_nlink - directly increment an inode's link count * @inode: inode ^ permalink raw reply related [flat|nested] 66+ messages in thread
* Re: [PATCH] improve the performance of large sequential write NFS workloads 2009-12-31 19:13 ` Trond Myklebust @ 2010-01-06 3:03 ` Wu Fengguang 2010-01-06 16:56 ` Trond Myklebust 0 siblings, 1 reply; 66+ messages in thread From: Wu Fengguang @ 2010-01-06 3:03 UTC (permalink / raw) To: Trond Myklebust Cc: Jan Kara, Steve Rago, Peter Zijlstra, linux-nfs@vger.kernel.org, linux-kernel@vger.kernel.org, jens.axboe, Peter Staubach, Arjan van de Ven, Ingo Molnar, linux-fsdevel@vger.kernel.org Trond, On Fri, Jan 01, 2010 at 03:13:48AM +0800, Trond Myklebust wrote: > On Thu, 2009-12-31 at 13:04 +0800, Wu Fengguang wrote: > > > --- > > fs/nfs/inode.c | 5 ++++- > > 1 file changed, 4 insertions(+), 1 deletion(-) > > > > --- linux.orig/fs/nfs/inode.c 2009-12-25 09:25:38.000000000 +0800 > > +++ linux/fs/nfs/inode.c 2009-12-25 10:13:06.000000000 +0800 > > @@ -105,8 +105,11 @@ int nfs_write_inode(struct inode *inode, > > ret = filemap_fdatawait(inode->i_mapping); > > if (ret == 0) > > ret = nfs_commit_inode(inode, FLUSH_SYNC); > > - } else > > + } else if (!radix_tree_tagged(&NFS_I(inode)->nfs_page_tree, > > + NFS_PAGE_TAG_LOCKED)) > > ret = nfs_commit_inode(inode, 0); > > + else > > + ret = -EAGAIN; > > if (ret >= 0) > > return 0; > > __mark_inode_dirty(inode, I_DIRTY_DATASYNC); > > The above change improves on the existing code, but doesn't solve the > problem that write_inode() isn't a good match for COMMIT. We need to > wait for all the unstable WRITE rpc calls to return before we can know > whether or not a COMMIT is needed (some commercial servers never require > commit, even if the client requested an unstable write). That was the > other reason for the change. Ah good to know that reason. However we cannot wait for ongoing WRITEs for unlimited time or pages, otherwise nr_unstable goes up and squeeze nr_dirty and nr_writeback to zero, and stall the cp process for a long time, as demonstrated by the trace (more reasoning in previous email). > > I do, however, agree that the above can provide a nice heuristic for the > WB_SYNC_NONE case (minus the -EAGAIN error). Mind if I integrate it? Sure, thank you. Here is the trace I collected with this patch. The pipeline is often stalled and throughput is poor.. Thanks, Fengguang % vmmon -d 1 nr_writeback nr_dirty nr_unstable nr_writeback nr_dirty nr_unstable 0 0 0 0 0 0 0 0 0 31609 71540 146 45293 60500 2832 44418 58964 5246 44927 55903 7806 44672 55901 8064 44159 52840 11646 43120 51317 14224 43556 48256 16857 42532 46728 19417 43044 43672 21977 42093 42144 24464 40999 40621 27097 41508 37560 29657 40612 36032 32089 41600 34509 32640 41600 34509 32640 41600 34509 32640 41454 32976 34319 40466 31448 36843 nr_writeback nr_dirty nr_unstable 39699 29920 39146 40210 26864 41707 39168 25336 44285 38126 25341 45330 38144 25341 45312 37779 23808 47210 38254 20752 49807 37358 19224 52239 36334 19229 53266 36352 17696 54781 35438 16168 57231 35496 13621 59736 47463 0 61420 47421 0 61440 44389 0 64472 41829 0 67032 39342 0 69519 39357 0 69504 36656 0 72205 34131 0 74730 31717 0 77144 31165 0 77696 28975 0 79886 26451 0 82410 nr_writeback nr_dirty nr_unstable 23873 0 84988 22992 0 85869 21586 0 87275 19027 0 89834 16467 0 92394 14765 0 94096 14781 0 94080 12080 0 96781 9391 0 99470 6831 0 102030 6589 0 102272 6589 0 102272 3669 0 105192 1089 0 107772 44 0 108817 0 0 108861 0 0 108861 35186 71874 1679 32626 71913 4238 30121 71913 6743 28802 71913 8062 26610 71913 10254 36953 59138 12686 34473 59114 15191 nr_writeback nr_dirty nr_unstable 33446 59114 16218 33408 59114 16256 30707 59114 18957 28183 59114 21481 25988 59114 23676 25253 59114 24411 25216 59114 24448 22953 59114 26711 35351 44274 29161 32645 44274 31867 32384 44274 32128 32384 44274 32128 32384 44274 32128 28928 44274 35584 26350 44274 38162 26112 44274 38400 26112 44274 38400 26112 44274 38400 22565 44274 41947 36989 27364 44434 35440 27379 45968 32805 27379 48603 30245 27379 51163 28672 27379 52736 nr_writeback nr_dirty nr_unstable 56047 4 52736 56051 0 52736 56051 0 52736 56051 0 52736 56051 0 52736 54279 0 54508 51846 0 56941 49158 0 59629 47987 0 60800 47987 0 60800 47987 0 60800 47987 0 60800 47987 0 60800 47987 0 60800 44612 0 62976 42228 0 62976 39650 0 62976 37236 0 62976 34658 0 62976 32226 0 62976 29722 0 62976 27161 0 62976 24674 0 62976 22242 0 62976 nr_writeback nr_dirty nr_unstable 19737 0 62976 17306 0 62976 14745 0 62976 12313 0 62976 9753 0 62976 7321 0 62976 4743 0 62976 2329 0 62976 43 0 14139 0 0 0 0 0 0 0 0 0 wfg ~% dstat ----total-cpu-usage---- -dsk/total- -net/total- ---paging-- ---system-- usr sys idl wai hiq siq| read writ| recv send| in out | int csw 2 9 89 0 0 0| 0 0 | 729B 720B| 0 0 | 875 2136 6 9 76 8 0 1| 0 352k|9532B 4660B| 0 0 |1046 2091 3 8 89 0 0 0| 0 0 |1153B 426B| 0 0 | 870 1870 1 9 89 0 0 0| 0 72k|1218B 246B| 0 0 | 853 1757 3 8 89 0 0 0| 0 0 | 844B 66B| 0 0 | 865 1695 2 7 91 0 0 0| 0 0 | 523B 66B| 0 0 | 818 1576 3 7 90 0 0 0| 0 0 | 901B 66B| 0 0 | 820 1590 6 11 68 11 0 4| 0 456k|2028k 51k| 0 0 |1560 2756 7 21 52 0 0 20| 0 0 | 11M 238k| 0 0 |4627 7423 2 22 51 0 0 24| 0 80k| 10M 230k| 0 0 |4200 6469 4 19 54 0 0 23| 0 0 | 10M 236k| 0 0 |4277 6629 3 15 37 31 0 14| 0 64M|5377k 115k| 0 0 |2229 2972 3 27 45 0 0 26| 0 0 | 10M 237k| 0 0 |4416 6743 3 20 51 0 0 27| 0 1024k| 10M 233k| 0 0 |4284 6694 ^C ----total-cpu-usage---- -dsk/total- -net/total- ---paging-- ---system-- usr sys idl wai hiq siq| read writ| recv send| in out | int csw 5 9 84 2 0 1| 225k 443k| 0 0 | 0 0 | 950 1985 4 28 25 22 0 21| 0 62M| 10M 235k| 0 0 |4529 6686 5 23 30 11 0 31| 0 23M| 10M 239k| 0 0 |4570 6948 2 24 48 0 0 26| 0 0 | 10M 234k| 0 0 |4334 6796 2 25 34 17 0 22| 0 50M| 10M 236k| 0 0 |4546 6944 2 29 46 7 0 18| 0 14M| 10M 236k| 0 0 |4411 6998 2 23 53 0 0 22| 0 0 | 10M 232k| 0 0 |4100 6595 3 19 20 32 0 26| 0 39M|9466k 207k| 0 0 |3455 4617 2 13 40 43 0 1| 0 41M| 930B 264B| 0 0 | 906 1545 3 7 45 43 0 1| 0 57M| 713B 132B| 0 0 | 859 1669 3 9 47 40 0 1| 0 54M| 376B 66B| 0 0 | 944 1741 5 25 47 0 0 21| 0 16k|9951k 222k| 0 0 |4227 6697 5 20 38 14 0 23| 0 36M|9388k 204k| 0 0 |3650 5135 3 28 46 0 0 24| 0 8192B| 11M 241k| 0 0 |4612 7115 2 24 49 0 0 25| 0 0 | 10M 234k| 0 0 |4120 6477 2 25 37 12 0 23| 0 56M| 11M 239k| 0 0 |4406 6237 3 7 38 44 0 7| 0 48M|1529k 32k| 0 0 |1071 1635 3 8 41 45 0 2| 0 58M| 602B 198B| 0 0 | 886 1613 2 25 45 2 0 27| 0 2056k| 10M 228k| 0 0 |4233 6623 2 24 49 0 0 24| 0 0 | 10M 235k| 0 0 |4292 6815 2 27 41 8 0 22| 0 50M| 10M 234k| 0 0 |4381 6394 1 9 41 41 0 7| 0 59M|1790k 38k| 0 0 |1226 1823 2 26 40 10 0 22| 0 17M|8185k 183k| 0 0 |3584 5410 1 23 54 0 0 22| 0 0 | 10M 228k| 0 0 |4153 6672 1 22 49 0 0 28| 0 37M| 11M 239k| 0 0 |4499 6938 2 15 37 32 0 13| 0 57M|5078k 110k| 0 0 |2154 2903 3 20 45 21 0 10| 0 31M|4268k 96k| 0 0 |2338 3712 2 21 55 0 0 21| 0 0 | 10M 231k| 0 0 |4292 6940 2 22 49 0 0 27| 0 25M| 11M 238k| 0 0 |4338 6677 2 17 42 19 0 19| 0 53M|8269k 180k| 0 0 |3341 4501 3 17 45 33 0 2| 0 50M|2083k 49k| 0 0 |1778 2733 2 23 53 0 0 22| 0 0 | 11M 240k| 0 0 |4482 7108 2 23 51 0 0 25| 0 9792k| 10M 230k| 0 0 |4220 6563 3 21 38 15 0 24| 0 53M| 11M 240k| 0 0 |4038 5697 3 10 41 43 0 3| 0 65M| 80k 660B| 0 0 | 984 1725 1 23 51 0 0 25| 0 8192B| 10M 230k| 0 0 |4301 6652 2 21 48 0 0 29| 0 0 | 10M 237k| 0 0 |4267 6956 2 26 43 5 0 23| 0 52M| 10M 236k| 0 0 |4553 6764 7 7 34 41 0 10| 0 57M|2596k 56k| 0 0 |1210 1680 6 21 44 12 0 17| 0 19M|7053k 158k| 0 0 |3194 4902 4 24 51 0 0 21| 0 0 | 10M 237k| 0 0 |4406 6724 4 22 53 0 0 21| 0 31M| 10M 237k| 0 0 |4752 7286 4 15 32 32 0 17| 0 49M|5777k 125k| 0 0 |2379 3015 5 14 43 34 0 3| 0 48M|1781k 42k| 0 0 |1578 2492 4 22 42 0 0 32| 0 0 | 10M 236k| 0 0 |4318 6763 3 22 50 4 0 21| 0 7072k| 10M 236k| 0 0 |4509 6859 6 21 28 16 0 28| 0 41M| 11M 241k| 0 0 |4289 5928 7 8 39 44 0 2| 0 40M| 217k 3762B| 0 0 |1024 1763 4 15 46 28 0 6| 0 39M|2377k 55k| 0 0 |1683 2678 4 24 45 0 0 26| 0 0 | 10M 232k| 0 0 |4207 6596 3 24 50 5 0 19| 0 10M|9472k 210k| 0 0 |3976 6122 5 7 40 46 0 1| 0 32M|1230B 66B| 0 0 | 967 1676 ----total-cpu-usage---- -dsk/total- -net/total- ---paging-- ---system-- usr sys idl wai hiq siq| read writ| recv send| in out | int csw 5 7 47 40 0 1| 0 39M| 651B 66B| 0 0 | 916 1583 4 12 54 22 0 7| 0 35M|1815k 41k| 0 0 |1448 2383 4 22 52 0 0 21| 0 0 | 10M 233k| 0 0 |4258 6705 4 22 52 0 0 22| 0 24M| 10M 236k| 0 0 |4480 7097 3 23 48 0 0 26| 0 28M| 10M 234k| 0 0 |4402 6798 5 12 36 29 0 19| 0 59M|5464k 118k| 0 0 |2358 2963 4 26 47 4 0 19| 0 5184k|8684k 194k| 0 0 |3786 5852 4 22 43 0 0 32| 0 0 | 10M 233k| 0 0 |4350 6779 3 26 44 0 0 27| 0 36M| 10M 233k| 0 0 |4360 6619 4 11 39 33 0 13| 0 46M|4545k 98k| 0 0 |2159 2600 3 14 40 40 0 2| 0 46M| 160k 4198B| 0 0 |1070 1610 4 25 45 0 0 27| 0 0 | 10M 236k| 0 0 |4435 6760 4 25 48 0 0 24| 0 3648k| 10M 235k| 0 0 |4595 6950 3 24 29 22 0 21| 0 37M| 10M 236k| 0 0 |4335 6461 5 11 42 36 0 6| 0 45M|2257k 48k| 0 0 |1440 1755 5 6 41 47 0 1| 0 43M| 768B 198B| 0 0 | 989 1592 5 30 47 3 0 15| 0 24k|8598k 192k| 0 0 |3694 5580 2 23 49 0 0 26| 0 0 | 10M 229k| 0 0 |4319 6805 4 22 32 20 0 22| 0 26M| 10M 234k| 0 0 |4487 6751 4 11 24 53 0 8| 0 32M|2503k 55k| 0 0 |1287 1654 8 10 42 39 0 0| 0 43M|1783B 132B| 0 0 |1054 1900 6 16 43 27 0 8| 0 24M|2790k 64k| 0 0 |2150 3370 4 24 51 0 0 21| 0 0 | 10M 231k| 0 0 |4308 6589 3 24 36 13 0 24| 0 9848k| 10M 231k| 0 0 |4394 6742 6 10 11 62 0 9| 0 27M|2519k 55k| 0 0 |1482 1723 3 12 23 61 0 2| 0 34M| 608B 132B| 0 0 | 927 1623 3 15 38 38 0 6| 0 36M|2077k 48k| 0 0 |1801 2651 7 25 45 6 0 17| 0 3000k| 11M 241k| 0 0 |5071 7687 3 26 45 3 0 23| 0 13M| 11M 238k| 0 0 |4473 6650 4 17 40 21 0 17| 0 37M|6253k 139k| 0 0 |2891 3746 3 24 48 0 0 25| 0 0 | 10M 238k| 0 0 |4736 7189 1 28 38 7 0 25| 0 9160k| 10M 232k| 0 0 |4689 7026 4 17 26 35 0 18| 0 21M|8707k 190k| 0 0 |3346 4488 4 11 12 72 0 1| 0 29M|1459B 264B| 0 0 | 947 1643 4 10 20 64 0 1| 0 28M| 728B 132B| 0 0 |1010 1531 6 8 7 78 0 1| 0 25M| 869B 66B| 0 0 | 945 1620 5 10 15 69 0 1| 0 27M| 647B 132B| 0 0 |1052 1553 5 11 0 82 0 1| 0 16M| 724B 66B| 0 0 |1063 1679 3 22 18 49 0 9| 0 14M|4560k 103k| 0 0 |2931 4039 3 24 44 0 0 29| 0 0 | 10M 236k| 0 0 |4863 7497 3 30 42 0 0 24| 0 4144k| 11M 250k| 0 0 |5505 7945 3 18 13 45 0 20| 0 15M|7234k 157k| 0 0 |3197 4021 7 9 0 82 0 1| 0 23M| 356B 198B| 0 0 | 979 1738 3 11 9 77 0 0| 0 22M| 802B 132B| 0 0 | 994 1635 5 9 1 84 0 2| 0 31M| 834B 66B| 0 0 | 996 1534 4 10 14 71 0 1| 0 20M| 288B 132B| 0 0 | 976 1627 4 14 22 59 0 1| 0 8032k| 865k 20k| 0 0 |1222 1589 4 23 46 0 0 26| 0 0 | 10M 239k| 0 0 |3791 5035 5 17 43 6 0 29| 0 17M| 10M 233k| 0 0 |3198 4372 4 19 50 0 0 27| 0 0 | 10M 231k| 0 0 |2952 4447 5 19 37 14 0 26| 0 8568k| 10M 227k| 0 0 |3562 5251 3 21 23 25 0 28| 0 9560k| 10M 230k| 0 0 |3390 5038 ----total-cpu-usage---- -dsk/total- -net/total- ---paging-- ---system-- usr sys idl wai hiq siq| read writ| recv send| in out | int csw 5 19 24 26 0 26| 0 11M| 10M 229k| 0 0 |3282 4749 4 20 8 39 0 28| 0 7992k| 10M 230k| 0 0 |3302 4488 4 17 3 47 0 30| 0 8616k| 10M 231k| 0 0 |3440 4909 5 16 22 25 0 31| 0 6556k| 10M 227k| 0 0 |3291 4671 3 18 22 24 0 32| 0 5588k| 10M 230k| 0 0 |3345 4822 4 16 26 25 0 29| 0 4744k| 10M 230k| 0 0 |3331 4854 3 18 16 37 0 26| 0 4296k| 10M 228k| 0 0 |3056 4139 3 17 18 25 0 36| 0 3016k| 10M 230k| 0 0 |3239 4623 4 19 23 26 0 27| 0 2216k| 10M 229k| 0 0 |3331 4777 4 20 41 8 0 26| 0 8584k| 10M 228k| 0 0 |3434 5114 4 17 50 0 0 29| 0 1000k| 10M 229k| 0 0 |3151 4878 2 18 50 1 0 29| 0 32k| 10M 232k| 0 0 |3176 4951 3 19 51 0 0 28| 0 0 | 10M 232k| 0 0 |3014 4567 4 17 53 1 0 24| 0 32k|8787k 195k| 0 0 |2768 4382 3 8 89 0 0 0| 0 0 |4013B 2016B| 0 0 | 866 1653 3 8 88 0 0 0| 0 16k|1017B 0 | 0 0 | 828 1660 6 8 86 0 0 0| 0 0 |1320B 66B| 0 0 | 821 1713 4 8 88 0 0 0| 0 0 | 692B 66B| 0 0 | 806 1665 > ------------------------------------------------------------------------------------------------------------ > VFS: Ensure that writeback_single_inode() commits unstable writes > > From: Trond Myklebust <Trond.Myklebust@netapp.com> > > If the call to do_writepages() succeeded in starting writeback, we do not > know whether or not we will need to COMMIT any unstable writes until after > the write RPC calls are finished. Currently, we assume that at least one > write RPC call will have finished, and set I_DIRTY_DATASYNC by the time > do_writepages is done, so that write_inode() is triggered. > > In order to ensure reliable operation (i.e. ensure that a single call to > writeback_single_inode() with WB_SYNC_ALL set suffices to ensure that pages > are on disk) we need to first wait for filemap_fdatawait() to complete, > then test for unstable pages. > > Since NFS is currently the only filesystem that has unstable pages, we can > add a new inode state I_UNSTABLE_PAGES that NFS alone will set. When set, > this will trigger a callback to a new address_space_operation to call the > COMMIT. > > Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com> > --- > > fs/fs-writeback.c | 31 ++++++++++++++++++++++++++++++- > fs/nfs/file.c | 1 + > fs/nfs/inode.c | 16 ---------------- > fs/nfs/internal.h | 3 ++- > fs/nfs/super.c | 2 -- > fs/nfs/write.c | 33 ++++++++++++++++++++++++++++++++- > include/linux/fs.h | 9 +++++++++ > 7 files changed, 74 insertions(+), 21 deletions(-) > > > diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c > index f6c2155..b25efbb 100644 > --- a/fs/fs-writeback.c > +++ b/fs/fs-writeback.c > @@ -388,6 +388,17 @@ static int write_inode(struct inode *inode, int sync) > } > > /* > + * Commit the NFS unstable pages. > + */ > +static int commit_unstable_pages(struct address_space *mapping, > + struct writeback_control *wbc) > +{ > + if (mapping->a_ops && mapping->a_ops->commit_unstable_pages) > + return mapping->a_ops->commit_unstable_pages(mapping, wbc); > + return 0; > +} > + > +/* > * Wait for writeback on an inode to complete. > */ > static void inode_wait_for_writeback(struct inode *inode) > @@ -474,6 +485,18 @@ writeback_single_inode(struct inode *inode, struct writeback_control *wbc) > } > > spin_lock(&inode_lock); > + /* > + * Special state for cleaning NFS unstable pages > + */ > + if (inode->i_state & I_UNSTABLE_PAGES) { > + int err; > + inode->i_state &= ~I_UNSTABLE_PAGES; > + spin_unlock(&inode_lock); > + err = commit_unstable_pages(mapping, wbc); > + if (ret == 0) > + ret = err; > + spin_lock(&inode_lock); > + } > inode->i_state &= ~I_SYNC; > if (!(inode->i_state & (I_FREEING | I_CLEAR))) { > if ((inode->i_state & I_DIRTY_PAGES) && wbc->for_kupdate) { > @@ -532,6 +555,12 @@ select_queue: > inode->i_state |= I_DIRTY_PAGES; > redirty_tail(inode); > } > + } else if (inode->i_state & I_UNSTABLE_PAGES) { > + /* > + * The inode has got yet more unstable pages to > + * commit. Requeue on b_more_io > + */ > + requeue_io(inode); > } else if (atomic_read(&inode->i_count)) { > /* > * The inode is clean, inuse > @@ -1050,7 +1079,7 @@ void __mark_inode_dirty(struct inode *inode, int flags) > > spin_lock(&inode_lock); > if ((inode->i_state & flags) != flags) { > - const int was_dirty = inode->i_state & I_DIRTY; > + const int was_dirty = inode->i_state & (I_DIRTY|I_UNSTABLE_PAGES); > > inode->i_state |= flags; > > diff --git a/fs/nfs/file.c b/fs/nfs/file.c > index 6b89132..67e50ac 100644 > --- a/fs/nfs/file.c > +++ b/fs/nfs/file.c > @@ -526,6 +526,7 @@ const struct address_space_operations nfs_file_aops = { > .migratepage = nfs_migrate_page, > .launder_page = nfs_launder_page, > .error_remove_page = generic_error_remove_page, > + .commit_unstable_pages = nfs_commit_unstable_pages, > }; > > /* > diff --git a/fs/nfs/inode.c b/fs/nfs/inode.c > index faa0918..8341709 100644 > --- a/fs/nfs/inode.c > +++ b/fs/nfs/inode.c > @@ -97,22 +97,6 @@ u64 nfs_compat_user_ino64(u64 fileid) > return ino; > } > > -int nfs_write_inode(struct inode *inode, int sync) > -{ > - int ret; > - > - if (sync) { > - ret = filemap_fdatawait(inode->i_mapping); > - if (ret == 0) > - ret = nfs_commit_inode(inode, FLUSH_SYNC); > - } else > - ret = nfs_commit_inode(inode, 0); > - if (ret >= 0) > - return 0; > - __mark_inode_dirty(inode, I_DIRTY_DATASYNC); > - return ret; > -} > - > void nfs_clear_inode(struct inode *inode) > { > /* > diff --git a/fs/nfs/internal.h b/fs/nfs/internal.h > index 29e464d..7bb326f 100644 > --- a/fs/nfs/internal.h > +++ b/fs/nfs/internal.h > @@ -211,7 +211,6 @@ extern int nfs_access_cache_shrinker(int nr_to_scan, gfp_t gfp_mask); > extern struct workqueue_struct *nfsiod_workqueue; > extern struct inode *nfs_alloc_inode(struct super_block *sb); > extern void nfs_destroy_inode(struct inode *); > -extern int nfs_write_inode(struct inode *,int); > extern void nfs_clear_inode(struct inode *); > #ifdef CONFIG_NFS_V4 > extern void nfs4_clear_inode(struct inode *); > @@ -253,6 +252,8 @@ extern int nfs4_path_walk(struct nfs_server *server, > extern void nfs_read_prepare(struct rpc_task *task, void *calldata); > > /* write.c */ > +extern int nfs_commit_unstable_pages(struct address_space *mapping, > + struct writeback_control *wbc); > extern void nfs_write_prepare(struct rpc_task *task, void *calldata); > #ifdef CONFIG_MIGRATION > extern int nfs_migrate_page(struct address_space *, > diff --git a/fs/nfs/super.c b/fs/nfs/super.c > index ce907ef..805c1a0 100644 > --- a/fs/nfs/super.c > +++ b/fs/nfs/super.c > @@ -265,7 +265,6 @@ struct file_system_type nfs_xdev_fs_type = { > static const struct super_operations nfs_sops = { > .alloc_inode = nfs_alloc_inode, > .destroy_inode = nfs_destroy_inode, > - .write_inode = nfs_write_inode, > .statfs = nfs_statfs, > .clear_inode = nfs_clear_inode, > .umount_begin = nfs_umount_begin, > @@ -334,7 +333,6 @@ struct file_system_type nfs4_referral_fs_type = { > static const struct super_operations nfs4_sops = { > .alloc_inode = nfs_alloc_inode, > .destroy_inode = nfs_destroy_inode, > - .write_inode = nfs_write_inode, > .statfs = nfs_statfs, > .clear_inode = nfs4_clear_inode, > .umount_begin = nfs_umount_begin, > diff --git a/fs/nfs/write.c b/fs/nfs/write.c > index d171696..910be28 100644 > --- a/fs/nfs/write.c > +++ b/fs/nfs/write.c > @@ -441,7 +441,7 @@ nfs_mark_request_commit(struct nfs_page *req) > spin_unlock(&inode->i_lock); > inc_zone_page_state(req->wb_page, NR_UNSTABLE_NFS); > inc_bdi_stat(req->wb_page->mapping->backing_dev_info, BDI_RECLAIMABLE); > - __mark_inode_dirty(inode, I_DIRTY_DATASYNC); > + mark_inode_unstable_pages(inode); > } > > static int > @@ -1406,11 +1406,42 @@ int nfs_commit_inode(struct inode *inode, int how) > } > return res; > } > + > +int nfs_commit_unstable_pages(struct address_space *mapping, > + struct writeback_control *wbc) > +{ > + struct inode *inode = mapping->host; > + int flags = FLUSH_SYNC; > + int ret; > + > + /* Don't commit yet if this is a non-blocking flush and there are > + * outstanding writes for this mapping. > + */ > + if (wbc->sync_mode != WB_SYNC_ALL && > + radix_tree_tagged(&NFS_I(inode)->nfs_page_tree, > + NFS_PAGE_TAG_LOCKED)) { > + mark_inode_unstable_pages(inode); > + return 0; > + } > + if (wbc->nonblocking) > + flags = 0; > + ret = nfs_commit_inode(inode, flags); > + if (ret > 0) > + ret = 0; > + return ret; > +} > + > #else > static inline int nfs_commit_list(struct inode *inode, struct list_head *head, int how) > { > return 0; > } > + > +int nfs_commit_unstable_pages(struct address_space *mapping, > + struct writeback_control *wbc) > +{ > + return 0; > +} > #endif > > long nfs_sync_mapping_wait(struct address_space *mapping, struct writeback_control *wbc, int how) > diff --git a/include/linux/fs.h b/include/linux/fs.h > index 9147ca8..ea0b7a3 100644 > --- a/include/linux/fs.h > +++ b/include/linux/fs.h > @@ -602,6 +602,8 @@ struct address_space_operations { > int (*is_partially_uptodate) (struct page *, read_descriptor_t *, > unsigned long); > int (*error_remove_page)(struct address_space *, struct page *); > + int (*commit_unstable_pages)(struct address_space *, > + struct writeback_control *); > }; > > /* > @@ -1635,6 +1637,8 @@ struct super_operations { > #define I_CLEAR 64 > #define __I_SYNC 7 > #define I_SYNC (1 << __I_SYNC) > +#define __I_UNSTABLE_PAGES 9 > +#define I_UNSTABLE_PAGES (1 << __I_UNSTABLE_PAGES) > > #define I_DIRTY (I_DIRTY_SYNC | I_DIRTY_DATASYNC | I_DIRTY_PAGES) > > @@ -1649,6 +1653,11 @@ static inline void mark_inode_dirty_sync(struct inode *inode) > __mark_inode_dirty(inode, I_DIRTY_SYNC); > } > > +static inline void mark_inode_unstable_pages(struct inode *inode) > +{ > + __mark_inode_dirty(inode, I_UNSTABLE_PAGES); > +} > + > /** > * inc_nlink - directly increment an inode's link count > * @inode: inode > ^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [PATCH] improve the performance of large sequential write NFS workloads 2010-01-06 3:03 ` Wu Fengguang @ 2010-01-06 16:56 ` Trond Myklebust 2010-01-06 18:26 ` Trond Myklebust 0 siblings, 1 reply; 66+ messages in thread From: Trond Myklebust @ 2010-01-06 16:56 UTC (permalink / raw) To: Wu Fengguang Cc: Jan Kara, Steve Rago, Peter Zijlstra, linux-nfs-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, jens.axboe, Peter Staubach, Arjan van de Ven, Ingo Molnar, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org On Wed, 2010-01-06 at 11:03 +0800, Wu Fengguang wrote: > Trond, > > On Fri, Jan 01, 2010 at 03:13:48AM +0800, Trond Myklebust wrote: > > The above change improves on the existing code, but doesn't solve the > > problem that write_inode() isn't a good match for COMMIT. We need to > > wait for all the unstable WRITE rpc calls to return before we can know > > whether or not a COMMIT is needed (some commercial servers never require > > commit, even if the client requested an unstable write). That was the > > other reason for the change. > > Ah good to know that reason. However we cannot wait for ongoing WRITEs > for unlimited time or pages, otherwise nr_unstable goes up and squeeze > nr_dirty and nr_writeback to zero, and stall the cp process for a long > time, as demonstrated by the trace (more reasoning in previous email). OK. I think we need a mechanism to allow balance_dirty_pages() to communicate to the filesystem that it really is holding too many unstable pages. Currently, all we do is say that 'your total is too big', and then let the filesystem figure out what it needs to do. So how about if we modify your heuristic to do something like this? It applies on top of the previous patch. Cheers Trond --------------------------------------------------------------------------------------------------------- VM/NFS: The VM must tell the filesystem when to free reclaimable pages From: Trond Myklebust <Trond.Myklebust-HgOvQuBEEgTQT0dZR+AlfA@public.gmane.org> balance_dirty_pages() should really tell the filesystem whether or not it has an excess of actual dirty pages, or whether it would be more useful to start freeing up the reclaimable pages. Assume that if the number of dirty pages associated with this backing-dev is less than 1/2 the number of reclaimable pages, then we should concentrate on freeing up the latter. Signed-off-by: Trond Myklebust <Trond.Myklebust-HgOvQuBEEgTQT0dZR+AlfA@public.gmane.org> --- fs/nfs/write.c | 9 +++++++-- include/linux/backing-dev.h | 6 ++++++ mm/page-writeback.c | 7 +++++-- 3 files changed, 18 insertions(+), 4 deletions(-) diff --git a/fs/nfs/write.c b/fs/nfs/write.c index 910be28..36113e6 100644 --- a/fs/nfs/write.c +++ b/fs/nfs/write.c @@ -1420,8 +1420,10 @@ int nfs_commit_unstable_pages(struct address_space *mapping, if (wbc->sync_mode != WB_SYNC_ALL && radix_tree_tagged(&NFS_I(inode)->nfs_page_tree, NFS_PAGE_TAG_LOCKED)) { - mark_inode_unstable_pages(inode); - return 0; + if (wbc->bdi == NULL) + goto out_nocommit; + if (wbc->bdi->dirty_exceeded != BDI_RECLAIMABLE_EXCEEDED) + goto out_nocommit; } if (wbc->nonblocking) flags = 0; @@ -1429,6 +1431,9 @@ int nfs_commit_unstable_pages(struct address_space *mapping, if (ret > 0) ret = 0; return ret; +out_nocommit: + mark_inode_unstable_pages(inode); + return 0; } #else diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h index fcbc26a..cd1645e 100644 --- a/include/linux/backing-dev.h +++ b/include/linux/backing-dev.h @@ -94,6 +94,12 @@ struct backing_dev_info { #endif }; +enum bdi_dirty_exceeded_state { + BDI_NO_DIRTY_EXCESS = 0, + BDI_DIRTY_EXCEEDED, + BDI_RECLAIMABLE_EXCEEDED, +}; + int bdi_init(struct backing_dev_info *bdi); void bdi_destroy(struct backing_dev_info *bdi); diff --git a/mm/page-writeback.c b/mm/page-writeback.c index 0b19943..0133c8f 100644 --- a/mm/page-writeback.c +++ b/mm/page-writeback.c @@ -524,8 +524,11 @@ static void balance_dirty_pages(struct address_space *mapping, (background_thresh + dirty_thresh) / 2) break; - if (!bdi->dirty_exceeded) - bdi->dirty_exceeded = 1; + if (bdi_nr_writeback > bdi_nr_reclaimable / 2) { + if (bdi->dirty_exceeded != BDI_DIRTY_EXCEEDED) + bdi->dirty_exceeded = BDI_DIRTY_EXCEEDED; + } else if (bdi->dirty_exceeded != BDI_RECLAIMABLE_EXCEEDED) + bdi->dirty_exceeded = BDI_RECLAIMABLE_EXCEEDED; /* Note: nr_reclaimable denotes nr_dirty + nr_unstable. * Unstable writes are a feature of certain networked -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply related [flat|nested] 66+ messages in thread
* Re: [PATCH] improve the performance of large sequential write NFS workloads 2010-01-06 16:56 ` Trond Myklebust @ 2010-01-06 18:26 ` Trond Myklebust 2010-01-06 18:37 ` Peter Zijlstra 0 siblings, 1 reply; 66+ messages in thread From: Trond Myklebust @ 2010-01-06 18:26 UTC (permalink / raw) To: Wu Fengguang Cc: Jan Kara, Steve Rago, Peter Zijlstra, linux-nfs-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, jens.axboe, Peter Staubach, Arjan van de Ven, Ingo Molnar, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org On Wed, 2010-01-06 at 11:56 -0500, Trond Myklebust wrote: > On Wed, 2010-01-06 at 11:03 +0800, Wu Fengguang wrote: > > Trond, > > > > On Fri, Jan 01, 2010 at 03:13:48AM +0800, Trond Myklebust wrote: > > > The above change improves on the existing code, but doesn't solve the > > > problem that write_inode() isn't a good match for COMMIT. We need to > > > wait for all the unstable WRITE rpc calls to return before we can know > > > whether or not a COMMIT is needed (some commercial servers never require > > > commit, even if the client requested an unstable write). That was the > > > other reason for the change. > > > > Ah good to know that reason. However we cannot wait for ongoing WRITEs > > for unlimited time or pages, otherwise nr_unstable goes up and squeeze > > nr_dirty and nr_writeback to zero, and stall the cp process for a long > > time, as demonstrated by the trace (more reasoning in previous email). > > OK. I think we need a mechanism to allow balance_dirty_pages() to > communicate to the filesystem that it really is holding too many > unstable pages. Currently, all we do is say that 'your total is too > big', and then let the filesystem figure out what it needs to do. > > So how about if we modify your heuristic to do something like this? It > applies on top of the previous patch. Gah! I misread the definitions of bdi_nr_reclaimable and bdi_nr_writeback. Please ignore the previous patch. OK. It looks as if the only key to finding out how many unstable writes we have is to use global_page_state(NR_UNSTABLE_NFS), so we can't specifically target our own backing-dev. Also, on reflection, I think it might be more helpful to use the writeback control to signal when we want to force a commit. That makes it a more general mechanism. There is one thing that we might still want to do here. Currently we do not update wbc->nr_to_write inside nfs_commit_unstable_pages(), which again means that we don't update 'pages_written' if the only effect of the writeback_inodes_wbc() was to commit pages. Perhaps it might not be a bad idea to do this (but that should be in a separate patch)... Cheers Trond ------------------------------------------------------------------------------------- VM/NFS: The VM must tell the filesystem when to free reclaimable pages From: Trond Myklebust <Trond.Myklebust-HgOvQuBEEgTQT0dZR+AlfA@public.gmane.org> balance_dirty_pages() should really tell the filesystem whether or not it has an excess of actual dirty pages, or whether it would be more useful to start freeing up the unstable writes. Assume that if the number of unstable writes is more than 1/2 the number of reclaimable pages, then we should force NFS to free up the former. Signed-off-by: Trond Myklebust <Trond.Myklebust-HgOvQuBEEgTQT0dZR+AlfA@public.gmane.org> --- fs/nfs/write.c | 2 +- include/linux/writeback.h | 5 +++++ mm/page-writeback.c | 9 ++++++++- 3 files changed, 14 insertions(+), 2 deletions(-) diff --git a/fs/nfs/write.c b/fs/nfs/write.c index 910be28..ee3daf4 100644 --- a/fs/nfs/write.c +++ b/fs/nfs/write.c @@ -1417,7 +1417,7 @@ int nfs_commit_unstable_pages(struct address_space *mapping, /* Don't commit yet if this is a non-blocking flush and there are * outstanding writes for this mapping. */ - if (wbc->sync_mode != WB_SYNC_ALL && + if (!wbc->force_commit && wbc->sync_mode != WB_SYNC_ALL && radix_tree_tagged(&NFS_I(inode)->nfs_page_tree, NFS_PAGE_TAG_LOCKED)) { mark_inode_unstable_pages(inode); diff --git a/include/linux/writeback.h b/include/linux/writeback.h index 76e8903..3fd5c3e 100644 --- a/include/linux/writeback.h +++ b/include/linux/writeback.h @@ -62,6 +62,11 @@ struct writeback_control { * so we use a single control to update them */ unsigned no_nrwrite_index_update:1; + /* + * The following is used by balance_dirty_pages() to + * force NFS to commit unstable pages. + */ + unsigned force_commit:1; }; /* diff --git a/mm/page-writeback.c b/mm/page-writeback.c index 0b19943..ede5356 100644 --- a/mm/page-writeback.c +++ b/mm/page-writeback.c @@ -485,6 +485,7 @@ static void balance_dirty_pages(struct address_space *mapping, { long nr_reclaimable, bdi_nr_reclaimable; long nr_writeback, bdi_nr_writeback; + long nr_unstable_nfs; unsigned long background_thresh; unsigned long dirty_thresh; unsigned long bdi_thresh; @@ -505,8 +506,9 @@ static void balance_dirty_pages(struct address_space *mapping, get_dirty_limits(&background_thresh, &dirty_thresh, &bdi_thresh, bdi); + nr_unstable_nfs = global_page_state(NR_UNSTABLE_NFS); nr_reclaimable = global_page_state(NR_FILE_DIRTY) + - global_page_state(NR_UNSTABLE_NFS); + nr_unstable_nfs; nr_writeback = global_page_state(NR_WRITEBACK); bdi_nr_reclaimable = bdi_stat(bdi, BDI_RECLAIMABLE); @@ -537,6 +539,11 @@ static void balance_dirty_pages(struct address_space *mapping, * up. */ if (bdi_nr_reclaimable > bdi_thresh) { + wbc.force_commit = 0; + /* Force NFS to also free up unstable writes. */ + if (nr_unstable_nfs > nr_reclaimable / 2) + wbc.force_commit = 1; + writeback_inodes_wbc(&wbc); pages_written += write_chunk - wbc.nr_to_write; get_dirty_limits(&background_thresh, &dirty_thresh, -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply related [flat|nested] 66+ messages in thread
* Re: [PATCH] improve the performance of large sequential write NFS workloads 2010-01-06 18:26 ` Trond Myklebust @ 2010-01-06 18:37 ` Peter Zijlstra 2010-01-06 18:52 ` Trond Myklebust 0 siblings, 1 reply; 66+ messages in thread From: Peter Zijlstra @ 2010-01-06 18:37 UTC (permalink / raw) To: Trond Myklebust Cc: Wu Fengguang, Jan Kara, Steve Rago, linux-nfs-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, jens.axboe, Peter Staubach, Arjan van de Ven, Ingo Molnar, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org On Wed, 2010-01-06 at 13:26 -0500, Trond Myklebust wrote: > OK. It looks as if the only key to finding out how many unstable writes > we have is to use global_page_state(NR_UNSTABLE_NFS), so we can't > specifically target our own backing-dev. Would be a simple matter of splitting BDI_UNSTABLE out from BDI_RECLAIMABLE, no? Something like --- fs/nfs/write.c | 6 +++--- include/linux/backing-dev.h | 3 ++- mm/backing-dev.c | 6 ++++-- mm/filemap.c | 2 +- mm/page-writeback.c | 16 ++++++++++------ mm/truncate.c | 2 +- 6 files changed, 21 insertions(+), 14 deletions(-) diff --git a/fs/nfs/write.c b/fs/nfs/write.c index d171696..7ba56f8 100644 --- a/fs/nfs/write.c +++ b/fs/nfs/write.c @@ -440,7 +440,7 @@ nfs_mark_request_commit(struct nfs_page *req) NFS_PAGE_TAG_COMMIT); spin_unlock(&inode->i_lock); inc_zone_page_state(req->wb_page, NR_UNSTABLE_NFS); - inc_bdi_stat(req->wb_page->mapping->backing_dev_info, BDI_RECLAIMABLE); + inc_bdi_stat(req->wb_page->mapping->backing_dev_info, BDI_UNSTABLE); __mark_inode_dirty(inode, I_DIRTY_DATASYNC); } @@ -451,7 +451,7 @@ nfs_clear_request_commit(struct nfs_page *req) if (test_and_clear_bit(PG_CLEAN, &(req)->wb_flags)) { dec_zone_page_state(page, NR_UNSTABLE_NFS); - dec_bdi_stat(page->mapping->backing_dev_info, BDI_RECLAIMABLE); + dec_bdi_stat(page->mapping->backing_dev_info, BDI_UNSTABLE); return 1; } return 0; @@ -1322,7 +1322,7 @@ nfs_commit_list(struct inode *inode, struct list_head *head, int how) nfs_mark_request_commit(req); dec_zone_page_state(req->wb_page, NR_UNSTABLE_NFS); dec_bdi_stat(req->wb_page->mapping->backing_dev_info, - BDI_RECLAIMABLE); + BDI_UNSTABLE); nfs_clear_page_tag_locked(req); } return -ENOMEM; diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h index fcbc26a..1ef1e5c 100644 --- a/include/linux/backing-dev.h +++ b/include/linux/backing-dev.h @@ -36,7 +36,8 @@ enum bdi_state { typedef int (congested_fn)(void *, int); enum bdi_stat_item { - BDI_RECLAIMABLE, + BDI_DIRTY, + DBI_UNSTABLE, BDI_WRITEBACK, NR_BDI_STAT_ITEMS }; diff --git a/mm/backing-dev.c b/mm/backing-dev.c index 0e8ca03..88f3655 100644 --- a/mm/backing-dev.c +++ b/mm/backing-dev.c @@ -88,7 +88,8 @@ static int bdi_debug_stats_show(struct seq_file *m, void *v) #define K(x) ((x) << (PAGE_SHIFT - 10)) seq_printf(m, "BdiWriteback: %8lu kB\n" - "BdiReclaimable: %8lu kB\n" + "BdiDirty: %8lu kB\n" + "BdiUnstable: %8lu kB\n" "BdiDirtyThresh: %8lu kB\n" "DirtyThresh: %8lu kB\n" "BackgroundThresh: %8lu kB\n" @@ -102,7 +103,8 @@ static int bdi_debug_stats_show(struct seq_file *m, void *v) "wb_list: %8u\n" "wb_cnt: %8u\n", (unsigned long) K(bdi_stat(bdi, BDI_WRITEBACK)), - (unsigned long) K(bdi_stat(bdi, BDI_RECLAIMABLE)), + (unsigned long) K(bdi_stat(bdi, BDI_DIRTY)), + (unsigned long) K(bdi_stat(bdi, BDI_UNSTABLE)), K(bdi_thresh), K(dirty_thresh), K(background_thresh), nr_wb, nr_dirty, nr_io, nr_more_io, !list_empty(&bdi->bdi_list), bdi->state, bdi->wb_mask, diff --git a/mm/filemap.c b/mm/filemap.c index 96ac6b0..458387d 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -136,7 +136,7 @@ void __remove_from_page_cache(struct page *page) */ if (PageDirty(page) && mapping_cap_account_dirty(mapping)) { dec_zone_page_state(page, NR_FILE_DIRTY); - dec_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE); + dec_bdi_stat(mapping->backing_dev_info, BDI_DIRTY); } } diff --git a/mm/page-writeback.c b/mm/page-writeback.c index 0b19943..b1d31be 100644 --- a/mm/page-writeback.c +++ b/mm/page-writeback.c @@ -272,7 +272,8 @@ static void clip_bdi_dirty_limit(struct backing_dev_info *bdi, else avail_dirty = 0; - avail_dirty += bdi_stat(bdi, BDI_RECLAIMABLE) + + avail_dirty += bdi_stat(bdi, BDI_DIRTY) + + bdi_stat(bdi, BDI_UNSTABLE) + bdi_stat(bdi, BDI_WRITEBACK); *pbdi_dirty = min(*pbdi_dirty, avail_dirty); @@ -509,7 +510,8 @@ static void balance_dirty_pages(struct address_space *mapping, global_page_state(NR_UNSTABLE_NFS); nr_writeback = global_page_state(NR_WRITEBACK); - bdi_nr_reclaimable = bdi_stat(bdi, BDI_RECLAIMABLE); + bdi_nr_reclaimable = bdi_stat(bdi, BDI_DIRTY) + + bdi_stat(bdi, BDI_UNSTABLE); bdi_nr_writeback = bdi_stat(bdi, BDI_WRITEBACK); if (bdi_nr_reclaimable + bdi_nr_writeback <= bdi_thresh) @@ -554,10 +556,12 @@ static void balance_dirty_pages(struct address_space *mapping, * deltas. */ if (bdi_thresh < 2*bdi_stat_error(bdi)) { - bdi_nr_reclaimable = bdi_stat_sum(bdi, BDI_RECLAIMABLE); + bdi_nr_reclaimable = bdi_stat_sum(bdi, BDI_DIRTY) + + bdi_stat_sum(bdi, DBI_UNSTABLE); bdi_nr_writeback = bdi_stat_sum(bdi, BDI_WRITEBACK); } else if (bdi_nr_reclaimable) { - bdi_nr_reclaimable = bdi_stat(bdi, BDI_RECLAIMABLE); + bdi_nr_reclaimable = bdi_stat(bdi, BDI_DIRTY) + + bdi_stat(bdi, BDI_UNSTABLE); bdi_nr_writeback = bdi_stat(bdi, BDI_WRITEBACK); } @@ -1079,7 +1083,7 @@ void account_page_dirtied(struct page *page, struct address_space *mapping) { if (mapping_cap_account_dirty(mapping)) { __inc_zone_page_state(page, NR_FILE_DIRTY); - __inc_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE); + __inc_bdi_stat(mapping->backing_dev_info, BDI_DIRTY); task_dirty_inc(current); task_io_account_write(PAGE_CACHE_SIZE); } @@ -1255,7 +1259,7 @@ int clear_page_dirty_for_io(struct page *page) if (TestClearPageDirty(page)) { dec_zone_page_state(page, NR_FILE_DIRTY); dec_bdi_stat(mapping->backing_dev_info, - BDI_RECLAIMABLE); + BDI_DIRTY); return 1; } return 0; diff --git a/mm/truncate.c b/mm/truncate.c index 342deee..b0ce8fb 100644 --- a/mm/truncate.c +++ b/mm/truncate.c @@ -75,7 +75,7 @@ void cancel_dirty_page(struct page *page, unsigned int account_size) if (mapping && mapping_cap_account_dirty(mapping)) { dec_zone_page_state(page, NR_FILE_DIRTY); dec_bdi_stat(mapping->backing_dev_info, - BDI_RECLAIMABLE); + BDI_DIRTY); if (account_size) task_io_account_cancelled_write(account_size); } -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply related [flat|nested] 66+ messages in thread
* Re: [PATCH] improve the performance of large sequential write NFS workloads 2010-01-06 18:37 ` Peter Zijlstra @ 2010-01-06 18:52 ` Trond Myklebust 2010-01-06 19:07 ` Peter Zijlstra 0 siblings, 1 reply; 66+ messages in thread From: Trond Myklebust @ 2010-01-06 18:52 UTC (permalink / raw) To: Peter Zijlstra Cc: Wu Fengguang, Jan Kara, Steve Rago, linux-nfs-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, jens.axboe, Peter Staubach, Arjan van de Ven, Ingo Molnar, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org On Wed, 2010-01-06 at 19:37 +0100, Peter Zijlstra wrote: > On Wed, 2010-01-06 at 13:26 -0500, Trond Myklebust wrote: > > OK. It looks as if the only key to finding out how many unstable writes > > we have is to use global_page_state(NR_UNSTABLE_NFS), so we can't > > specifically target our own backing-dev. > > Would be a simple matter of splitting BDI_UNSTABLE out from > BDI_RECLAIMABLE, no? > > Something like OK. How about if we also add in a bdi->capabilities flag to tell that we might have BDI_UNSTABLE? That would allow us to avoid the potentially expensive extra calls to bdi_stat() and bdi_stat_sum() for the non-nfs case? Cheers Trond -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [PATCH] improve the performance of large sequential write NFS workloads 2010-01-06 18:52 ` Trond Myklebust @ 2010-01-06 19:07 ` Peter Zijlstra 2010-01-06 19:21 ` Trond Myklebust 0 siblings, 1 reply; 66+ messages in thread From: Peter Zijlstra @ 2010-01-06 19:07 UTC (permalink / raw) To: Trond Myklebust Cc: Wu Fengguang, Jan Kara, Steve Rago, linux-nfs@vger.kernel.org, linux-kernel@vger.kernel.org, jens.axboe, Peter Staubach, Arjan van de Ven, Ingo Molnar, linux-fsdevel@vger.kernel.org On Wed, 2010-01-06 at 13:52 -0500, Trond Myklebust wrote: > On Wed, 2010-01-06 at 19:37 +0100, Peter Zijlstra wrote: > > On Wed, 2010-01-06 at 13:26 -0500, Trond Myklebust wrote: > > > OK. It looks as if the only key to finding out how many unstable writes > > > we have is to use global_page_state(NR_UNSTABLE_NFS), so we can't > > > specifically target our own backing-dev. > > > > Would be a simple matter of splitting BDI_UNSTABLE out from > > BDI_RECLAIMABLE, no? > > > > Something like > > OK. How about if we also add in a bdi->capabilities flag to tell that we > might have BDI_UNSTABLE? That would allow us to avoid the potentially > expensive extra calls to bdi_stat() and bdi_stat_sum() for the non-nfs > case? The bdi_stat_sum() in the error limit is basically the only such expensive op, but I suspect we might hit that more than enough. So sure that sounds like a plan. ^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [PATCH] improve the performance of large sequential write NFS workloads 2010-01-06 19:07 ` Peter Zijlstra @ 2010-01-06 19:21 ` Trond Myklebust 2010-01-06 19:53 ` Trond Myklebust 0 siblings, 1 reply; 66+ messages in thread From: Trond Myklebust @ 2010-01-06 19:21 UTC (permalink / raw) To: Peter Zijlstra Cc: Wu Fengguang, Jan Kara, Steve Rago, linux-nfs-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, jens.axboe, Peter Staubach, Arjan van de Ven, Ingo Molnar, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org On Wed, 2010-01-06 at 20:07 +0100, Peter Zijlstra wrote: > On Wed, 2010-01-06 at 13:52 -0500, Trond Myklebust wrote: > > On Wed, 2010-01-06 at 19:37 +0100, Peter Zijlstra wrote: > > > On Wed, 2010-01-06 at 13:26 -0500, Trond Myklebust wrote: > > > > OK. It looks as if the only key to finding out how many unstable writes > > > > we have is to use global_page_state(NR_UNSTABLE_NFS), so we can't > > > > specifically target our own backing-dev. > > > > > > Would be a simple matter of splitting BDI_UNSTABLE out from > > > BDI_RECLAIMABLE, no? > > > > > > Something like > > > > OK. How about if we also add in a bdi->capabilities flag to tell that we > > might have BDI_UNSTABLE? That would allow us to avoid the potentially > > expensive extra calls to bdi_stat() and bdi_stat_sum() for the non-nfs > > case? > > The bdi_stat_sum() in the error limit is basically the only such > expensive op, but I suspect we might hit that more than enough. So sure > that sounds like a plan. > This should apply on top of your patch.... Cheers Trond ------------------------------------------------------------------------------------------------ VM: Don't call bdi_stat(BDI_UNSTABLE) on non-nfs backing-devices From: Trond Myklebust <Trond.Myklebust-HgOvQuBEEgTQT0dZR+AlfA@public.gmane.org> Speeds up the accounting in balance_dirty_pages() for non-nfs devices. Signed-off-by: Trond Myklebust <Trond.Myklebust-HgOvQuBEEgTQT0dZR+AlfA@public.gmane.org> --- fs/nfs/client.c | 1 + include/linux/backing-dev.h | 6 ++++++ mm/page-writeback.c | 16 +++++++++++----- 3 files changed, 18 insertions(+), 5 deletions(-) diff --git a/fs/nfs/client.c b/fs/nfs/client.c index ee77713..d0b060a 100644 --- a/fs/nfs/client.c +++ b/fs/nfs/client.c @@ -890,6 +890,7 @@ static void nfs_server_set_fsinfo(struct nfs_server *server, struct nfs_fsinfo * server->backing_dev_info.name = "nfs"; server->backing_dev_info.ra_pages = server->rpages * NFS_MAX_READAHEAD; + server->backing_dev_info.capabilities |= BDI_CAP_ACCT_UNSTABLE; if (server->wsize > max_rpc_payload) server->wsize = max_rpc_payload; diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h index 42c3e2a..8b45166 100644 --- a/include/linux/backing-dev.h +++ b/include/linux/backing-dev.h @@ -232,6 +232,7 @@ int bdi_set_max_ratio(struct backing_dev_info *bdi, unsigned int max_ratio); #define BDI_CAP_EXEC_MAP 0x00000040 #define BDI_CAP_NO_ACCT_WB 0x00000080 #define BDI_CAP_SWAP_BACKED 0x00000100 +#define BDI_CAP_ACCT_UNSTABLE 0x00000200 #define BDI_CAP_VMFLAGS \ (BDI_CAP_READ_MAP | BDI_CAP_WRITE_MAP | BDI_CAP_EXEC_MAP) @@ -311,6 +312,11 @@ static inline bool bdi_cap_flush_forker(struct backing_dev_info *bdi) return bdi == &default_backing_dev_info; } +static inline bool bdi_cap_account_unstable(struct backing_dev_info *bdi) +{ + return bdi->capabilities & BDI_CAP_ACCT_UNSTABLE; +} + static inline bool mapping_cap_writeback_dirty(struct address_space *mapping) { return bdi_cap_writeback_dirty(mapping->backing_dev_info); diff --git a/mm/page-writeback.c b/mm/page-writeback.c index aa26b0f..d90a0db 100644 --- a/mm/page-writeback.c +++ b/mm/page-writeback.c @@ -273,8 +273,9 @@ static void clip_bdi_dirty_limit(struct backing_dev_info *bdi, avail_dirty = 0; avail_dirty += bdi_stat(bdi, BDI_DIRTY) + - bdi_stat(bdi, BDI_UNSTABLE) + bdi_stat(bdi, BDI_WRITEBACK); + if (bdi_cap_account_unstable(bdi)) + avail_dirty += bdi_stat(bdi, BDI_UNSTABLE); *pbdi_dirty = min(*pbdi_dirty, avail_dirty); } @@ -512,8 +513,9 @@ static void balance_dirty_pages(struct address_space *mapping, nr_unstable_nfs; nr_writeback = global_page_state(NR_WRITEBACK); - bdi_nr_reclaimable = bdi_stat(bdi, BDI_DIRTY) + - bdi_stat(bdi, BDI_UNSTABLE); + bdi_nr_reclaimable = bdi_stat(bdi, BDI_DIRTY); + if (bdi_cap_account_unstable(bdi)) + bdi_nr_reclaimable += bdi_stat(bdi, BDI_UNSTABLE); bdi_nr_writeback = bdi_stat(bdi, BDI_WRITEBACK); if (bdi_nr_reclaimable + bdi_nr_writeback <= bdi_thresh) @@ -563,11 +565,15 @@ static void balance_dirty_pages(struct address_space *mapping, * deltas. */ if (bdi_thresh < 2*bdi_stat_error(bdi)) { - bdi_nr_reclaimable = bdi_stat_sum(bdi, BDI_DIRTY) + + bdi_nr_reclaimable = bdi_stat_sum(bdi, BDI_DIRTY); + if (bdi_cap_account_unstable(bdi)) + bdi_nr_reclaimable += bdi_stat_sum(bdi, BDI_UNSTABLE); bdi_nr_writeback = bdi_stat_sum(bdi, BDI_WRITEBACK); } else if (bdi_nr_reclaimable) { - bdi_nr_reclaimable = bdi_stat(bdi, BDI_DIRTY) + + bdi_nr_reclaimable = bdi_stat(bdi, BDI_DIRTY); + if (bdi_cap_account_unstable(bdi)) + bdi_nr_reclaimable += bdi_stat(bdi, BDI_UNSTABLE); bdi_nr_writeback = bdi_stat(bdi, BDI_WRITEBACK); } -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply related [flat|nested] 66+ messages in thread
* Re: [PATCH] improve the performance of large sequential write NFS workloads 2010-01-06 19:21 ` Trond Myklebust @ 2010-01-06 19:53 ` Trond Myklebust 2010-01-06 20:09 ` Jan Kara 0 siblings, 1 reply; 66+ messages in thread From: Trond Myklebust @ 2010-01-06 19:53 UTC (permalink / raw) To: Peter Zijlstra Cc: Wu Fengguang, Jan Kara, Steve Rago, linux-nfs-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, jens.axboe, Peter Staubach, Arjan van de Ven, Ingo Molnar, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org On Wed, 2010-01-06 at 14:21 -0500, Trond Myklebust wrote: > On Wed, 2010-01-06 at 20:07 +0100, Peter Zijlstra wrote: > > On Wed, 2010-01-06 at 13:52 -0500, Trond Myklebust wrote: > > > On Wed, 2010-01-06 at 19:37 +0100, Peter Zijlstra wrote: > > > > On Wed, 2010-01-06 at 13:26 -0500, Trond Myklebust wrote: > > > > > OK. It looks as if the only key to finding out how many unstable writes > > > > > we have is to use global_page_state(NR_UNSTABLE_NFS), so we can't > > > > > specifically target our own backing-dev. > > > > > > > > Would be a simple matter of splitting BDI_UNSTABLE out from > > > > BDI_RECLAIMABLE, no? > > > > > > > > Something like > > > > > > OK. How about if we also add in a bdi->capabilities flag to tell that we > > > might have BDI_UNSTABLE? That would allow us to avoid the potentially > > > expensive extra calls to bdi_stat() and bdi_stat_sum() for the non-nfs > > > case? > > > > The bdi_stat_sum() in the error limit is basically the only such > > expensive op, but I suspect we might hit that more than enough. So sure > > that sounds like a plan. > > > > This should apply on top of your patch.... ...and finally, this should convert the previous NFS patch to use the per-bdi accounting. Cheers Trond -------------------------------------------------------------------------------------- VM: Use per-bdi unstable accounting to improve use of wbc->force_commit From: Trond Myklebust <Trond.Myklebust-HgOvQuBEEgTQT0dZR+AlfA@public.gmane.org> Signed-off-by: Trond Myklebust <Trond.Myklebust-HgOvQuBEEgTQT0dZR+AlfA@public.gmane.org> --- mm/page-writeback.c | 13 +++++++------ 1 files changed, 7 insertions(+), 6 deletions(-) diff --git a/mm/page-writeback.c b/mm/page-writeback.c index d90a0db..c537543 100644 --- a/mm/page-writeback.c +++ b/mm/page-writeback.c @@ -487,7 +487,6 @@ static void balance_dirty_pages(struct address_space *mapping, { long nr_reclaimable, bdi_nr_reclaimable; long nr_writeback, bdi_nr_writeback; - long nr_unstable_nfs; unsigned long background_thresh; unsigned long dirty_thresh; unsigned long bdi_thresh; @@ -504,18 +503,20 @@ static void balance_dirty_pages(struct address_space *mapping, .nr_to_write = write_chunk, .range_cyclic = 1, }; + long bdi_nr_unstable = 0; get_dirty_limits(&background_thresh, &dirty_thresh, &bdi_thresh, bdi); - nr_unstable_nfs = global_page_state(NR_UNSTABLE_NFS); nr_reclaimable = global_page_state(NR_FILE_DIRTY) + - nr_unstable_nfs; + global_page_state(NR_UNSTABLE_NFS); nr_writeback = global_page_state(NR_WRITEBACK); bdi_nr_reclaimable = bdi_stat(bdi, BDI_DIRTY); - if (bdi_cap_account_unstable(bdi)) - bdi_nr_reclaimable += bdi_stat(bdi, BDI_UNSTABLE); + if (bdi_cap_account_unstable(bdi)) { + bdi_nr_unstable = bdi_stat(bdi, BDI_UNSTABLE); + bdi_nr_reclaimable += bdi_nr_unstable; + } bdi_nr_writeback = bdi_stat(bdi, BDI_WRITEBACK); if (bdi_nr_reclaimable + bdi_nr_writeback <= bdi_thresh) @@ -545,7 +546,7 @@ static void balance_dirty_pages(struct address_space *mapping, if (bdi_nr_reclaimable > bdi_thresh) { wbc.force_commit = 0; /* Force NFS to also free up unstable writes. */ - if (nr_unstable_nfs > nr_reclaimable / 2) + if (bdi_nr_unstable > bdi_nr_reclaimable / 2) wbc.force_commit = 1; writeback_inodes_wbc(&wbc); -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply related [flat|nested] 66+ messages in thread
* Re: [PATCH] improve the performance of large sequential write NFS workloads 2010-01-06 19:53 ` Trond Myklebust @ 2010-01-06 20:09 ` Jan Kara [not found] ` <20100106200928.GB22781-+0h/O2h83AeN3ZZ/Hiejyg@public.gmane.org> 0 siblings, 1 reply; 66+ messages in thread From: Jan Kara @ 2010-01-06 20:09 UTC (permalink / raw) To: Trond Myklebust Cc: Peter Zijlstra, Wu Fengguang, Jan Kara, Steve Rago, linux-nfs-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, jens.axboe, Peter Staubach, Arjan van de Ven, Ingo Molnar, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org On Wed 06-01-10 14:53:14, Trond Myklebust wrote: > ...and finally, this should convert the previous NFS patch to use the > per-bdi accounting. > > Cheers > Trond > > -------------------------------------------------------------------------------------- > VM: Use per-bdi unstable accounting to improve use of wbc->force_commit > > From: Trond Myklebust <Trond.Myklebust-HgOvQuBEEgTQT0dZR+AlfA@public.gmane.org> > > Signed-off-by: Trond Myklebust <Trond.Myklebust-HgOvQuBEEgTQT0dZR+AlfA@public.gmane.org> I like this. You can add Acked-by: Jan Kara <jack-AlSwsSmVLrQ@public.gmane.org> to this patch previous patches adding unstable pages accounting. Honza > --- > > mm/page-writeback.c | 13 +++++++------ > 1 files changed, 7 insertions(+), 6 deletions(-) > > > diff --git a/mm/page-writeback.c b/mm/page-writeback.c > index d90a0db..c537543 100644 > --- a/mm/page-writeback.c > +++ b/mm/page-writeback.c > @@ -487,7 +487,6 @@ static void balance_dirty_pages(struct address_space *mapping, > { > long nr_reclaimable, bdi_nr_reclaimable; > long nr_writeback, bdi_nr_writeback; > - long nr_unstable_nfs; > unsigned long background_thresh; > unsigned long dirty_thresh; > unsigned long bdi_thresh; > @@ -504,18 +503,20 @@ static void balance_dirty_pages(struct address_space *mapping, > .nr_to_write = write_chunk, > .range_cyclic = 1, > }; > + long bdi_nr_unstable = 0; > > get_dirty_limits(&background_thresh, &dirty_thresh, > &bdi_thresh, bdi); > > - nr_unstable_nfs = global_page_state(NR_UNSTABLE_NFS); > nr_reclaimable = global_page_state(NR_FILE_DIRTY) + > - nr_unstable_nfs; > + global_page_state(NR_UNSTABLE_NFS); > nr_writeback = global_page_state(NR_WRITEBACK); > > bdi_nr_reclaimable = bdi_stat(bdi, BDI_DIRTY); > - if (bdi_cap_account_unstable(bdi)) > - bdi_nr_reclaimable += bdi_stat(bdi, BDI_UNSTABLE); > + if (bdi_cap_account_unstable(bdi)) { > + bdi_nr_unstable = bdi_stat(bdi, BDI_UNSTABLE); > + bdi_nr_reclaimable += bdi_nr_unstable; > + } > bdi_nr_writeback = bdi_stat(bdi, BDI_WRITEBACK); > > if (bdi_nr_reclaimable + bdi_nr_writeback <= bdi_thresh) > @@ -545,7 +546,7 @@ static void balance_dirty_pages(struct address_space *mapping, > if (bdi_nr_reclaimable > bdi_thresh) { > wbc.force_commit = 0; > /* Force NFS to also free up unstable writes. */ > - if (nr_unstable_nfs > nr_reclaimable / 2) > + if (bdi_nr_unstable > bdi_nr_reclaimable / 2) > wbc.force_commit = 1; > > writeback_inodes_wbc(&wbc); > -- Jan Kara <jack-AlSwsSmVLrQ@public.gmane.org> SUSE Labs, CR -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 66+ messages in thread
[parent not found: <20100106200928.GB22781-+0h/O2h83AeN3ZZ/Hiejyg@public.gmane.org>]
* [PATCH 0/6] Re: [PATCH] improve the performance of large sequential write NFS workloads [not found] ` <20100106200928.GB22781-+0h/O2h83AeN3ZZ/Hiejyg@public.gmane.org> @ 2010-01-06 20:51 ` Trond Myklebust [not found] ` <20100106205110.22547.85345.stgit-bi+AKbBUZKY6gyzm1THtWbp2dZbC/Bob@public.gmane.org> 2010-01-07 8:16 ` Peter Zijlstra 0 siblings, 2 replies; 66+ messages in thread From: Trond Myklebust @ 2010-01-06 20:51 UTC (permalink / raw) To: Wu Fengguang Cc: Peter Zijlstra, Jan Kara, Steve Rago, linux-nfs-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, jens.axboe, Peter Staubach, Arjan van de Ven, Ingo Molnar, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org OK, here is the full series so far. I'm resending because I had to fix up a couple of BDI_UNSTABLE typos in Peter's patch... Cheers Trond --- Peter Zijlstra (1): VM: Split out the accounting of unstable writes from BDI_RECLAIMABLE Trond Myklebust (5): NFS: Run COMMIT as an asynchronous RPC call when wbc->for_background is set VM: Use per-bdi unstable accounting to improve use of wbc->force_commit VM: Don't call bdi_stat(BDI_UNSTABLE) on non-nfs backing-devices VM/NFS: The VM must tell the filesystem when to free reclaimable pages VFS: Ensure that writeback_single_inode() commits unstable writes fs/fs-writeback.c | 31 ++++++++++++++++++++++++++++++- fs/nfs/client.c | 1 + fs/nfs/file.c | 1 + fs/nfs/inode.c | 16 ---------------- fs/nfs/internal.h | 3 ++- fs/nfs/super.c | 2 -- fs/nfs/write.c | 39 +++++++++++++++++++++++++++++++++++---- include/linux/backing-dev.h | 9 ++++++++- include/linux/fs.h | 9 +++++++++ include/linux/writeback.h | 5 +++++ mm/backing-dev.c | 6 ++++-- mm/filemap.c | 2 +- mm/page-writeback.c | 30 ++++++++++++++++++++++++------ mm/truncate.c | 2 +- 14 files changed, 121 insertions(+), 35 deletions(-) -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 66+ messages in thread
[parent not found: <20100106205110.22547.85345.stgit-bi+AKbBUZKY6gyzm1THtWbp2dZbC/Bob@public.gmane.org>]
* [PATCH 5/6] VM: Use per-bdi unstable accounting to improve use of wbc->force_commit [not found] ` <20100106205110.22547.85345.stgit-bi+AKbBUZKY6gyzm1THtWbp2dZbC/Bob@public.gmane.org> @ 2010-01-06 20:51 ` Trond Myklebust [not found] ` <20100106205110.22547.32584.stgit-bi+AKbBUZKY6gyzm1THtWbp2dZbC/Bob@public.gmane.org> 2010-01-06 20:51 ` [PATCH 4/6] VM: Don't call bdi_stat(BDI_UNSTABLE) on non-nfs backing-devices Trond Myklebust ` (5 subsequent siblings) 6 siblings, 1 reply; 66+ messages in thread From: Trond Myklebust @ 2010-01-06 20:51 UTC (permalink / raw) To: Wu Fengguang Cc: Peter Zijlstra, Jan Kara, Steve Rago, linux-nfs-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, jens.axboe, Peter Staubach, Arjan van de Ven, Ingo Molnar, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org Signed-off-by: Trond Myklebust <Trond.Myklebust-HgOvQuBEEgTQT0dZR+AlfA@public.gmane.org> Acked-by: Jan Kara <jack-AlSwsSmVLrQ@public.gmane.org> --- mm/page-writeback.c | 13 +++++++------ 1 files changed, 7 insertions(+), 6 deletions(-) diff --git a/mm/page-writeback.c b/mm/page-writeback.c index d90a0db..c537543 100644 --- a/mm/page-writeback.c +++ b/mm/page-writeback.c @@ -487,7 +487,6 @@ static void balance_dirty_pages(struct address_space *mapping, { long nr_reclaimable, bdi_nr_reclaimable; long nr_writeback, bdi_nr_writeback; - long nr_unstable_nfs; unsigned long background_thresh; unsigned long dirty_thresh; unsigned long bdi_thresh; @@ -504,18 +503,20 @@ static void balance_dirty_pages(struct address_space *mapping, .nr_to_write = write_chunk, .range_cyclic = 1, }; + long bdi_nr_unstable = 0; get_dirty_limits(&background_thresh, &dirty_thresh, &bdi_thresh, bdi); - nr_unstable_nfs = global_page_state(NR_UNSTABLE_NFS); nr_reclaimable = global_page_state(NR_FILE_DIRTY) + - nr_unstable_nfs; + global_page_state(NR_UNSTABLE_NFS); nr_writeback = global_page_state(NR_WRITEBACK); bdi_nr_reclaimable = bdi_stat(bdi, BDI_DIRTY); - if (bdi_cap_account_unstable(bdi)) - bdi_nr_reclaimable += bdi_stat(bdi, BDI_UNSTABLE); + if (bdi_cap_account_unstable(bdi)) { + bdi_nr_unstable = bdi_stat(bdi, BDI_UNSTABLE); + bdi_nr_reclaimable += bdi_nr_unstable; + } bdi_nr_writeback = bdi_stat(bdi, BDI_WRITEBACK); if (bdi_nr_reclaimable + bdi_nr_writeback <= bdi_thresh) @@ -545,7 +546,7 @@ static void balance_dirty_pages(struct address_space *mapping, if (bdi_nr_reclaimable > bdi_thresh) { wbc.force_commit = 0; /* Force NFS to also free up unstable writes. */ - if (nr_unstable_nfs > nr_reclaimable / 2) + if (bdi_nr_unstable > bdi_nr_reclaimable / 2) wbc.force_commit = 1; writeback_inodes_wbc(&wbc); -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply related [flat|nested] 66+ messages in thread
[parent not found: <20100106205110.22547.32584.stgit-bi+AKbBUZKY6gyzm1THtWbp2dZbC/Bob@public.gmane.org>]
* Re: [PATCH 5/6] VM: Use per-bdi unstable accounting to improve use of wbc->force_commit [not found] ` <20100106205110.22547.32584.stgit-bi+AKbBUZKY6gyzm1THtWbp2dZbC/Bob@public.gmane.org> @ 2010-01-07 2:34 ` Wu Fengguang 0 siblings, 0 replies; 66+ messages in thread From: Wu Fengguang @ 2010-01-07 2:34 UTC (permalink / raw) To: Trond Myklebust Cc: Peter Zijlstra, Jan Kara, Steve Rago, linux-nfs-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, jens.axboe, Peter Staubach, Arjan van de Ven, Ingo Molnar, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org Trond, I'm with Jan that this patch can folded :) Thanks, Fengguang On Thu, Jan 07, 2010 at 04:51:10AM +0800, Trond Myklebust wrote: > Signed-off-by: Trond Myklebust <Trond.Myklebust-HgOvQuBEEgTQT0dZR+AlfA@public.gmane.org> > Acked-by: Jan Kara <jack-AlSwsSmVLrQ@public.gmane.org> > --- > > mm/page-writeback.c | 13 +++++++------ > 1 files changed, 7 insertions(+), 6 deletions(-) > > diff --git a/mm/page-writeback.c b/mm/page-writeback.c > index d90a0db..c537543 100644 > --- a/mm/page-writeback.c > +++ b/mm/page-writeback.c > @@ -487,7 +487,6 @@ static void balance_dirty_pages(struct address_space *mapping, > { > long nr_reclaimable, bdi_nr_reclaimable; > long nr_writeback, bdi_nr_writeback; > - long nr_unstable_nfs; > unsigned long background_thresh; > unsigned long dirty_thresh; > unsigned long bdi_thresh; > @@ -504,18 +503,20 @@ static void balance_dirty_pages(struct address_space *mapping, > .nr_to_write = write_chunk, > .range_cyclic = 1, > }; > + long bdi_nr_unstable = 0; > > get_dirty_limits(&background_thresh, &dirty_thresh, > &bdi_thresh, bdi); > > - nr_unstable_nfs = global_page_state(NR_UNSTABLE_NFS); > nr_reclaimable = global_page_state(NR_FILE_DIRTY) + > - nr_unstable_nfs; > + global_page_state(NR_UNSTABLE_NFS); > nr_writeback = global_page_state(NR_WRITEBACK); > > bdi_nr_reclaimable = bdi_stat(bdi, BDI_DIRTY); > - if (bdi_cap_account_unstable(bdi)) > - bdi_nr_reclaimable += bdi_stat(bdi, BDI_UNSTABLE); > + if (bdi_cap_account_unstable(bdi)) { > + bdi_nr_unstable = bdi_stat(bdi, BDI_UNSTABLE); > + bdi_nr_reclaimable += bdi_nr_unstable; > + } > bdi_nr_writeback = bdi_stat(bdi, BDI_WRITEBACK); > > if (bdi_nr_reclaimable + bdi_nr_writeback <= bdi_thresh) > @@ -545,7 +546,7 @@ static void balance_dirty_pages(struct address_space *mapping, > if (bdi_nr_reclaimable > bdi_thresh) { > wbc.force_commit = 0; > /* Force NFS to also free up unstable writes. */ > - if (nr_unstable_nfs > nr_reclaimable / 2) > + if (bdi_nr_unstable > bdi_nr_reclaimable / 2) > wbc.force_commit = 1; > > writeback_inodes_wbc(&wbc); > -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 66+ messages in thread
* [PATCH 4/6] VM: Don't call bdi_stat(BDI_UNSTABLE) on non-nfs backing-devices [not found] ` <20100106205110.22547.85345.stgit-bi+AKbBUZKY6gyzm1THtWbp2dZbC/Bob@public.gmane.org> 2010-01-06 20:51 ` [PATCH 5/6] VM: Use per-bdi unstable accounting to improve use of wbc->force_commit Trond Myklebust @ 2010-01-06 20:51 ` Trond Myklebust 2010-01-07 1:56 ` Wu Fengguang 2010-01-06 20:51 ` [PATCH 6/6] NFS: Run COMMIT as an asynchronous RPC call when wbc->for_background is set Trond Myklebust ` (4 subsequent siblings) 6 siblings, 1 reply; 66+ messages in thread From: Trond Myklebust @ 2010-01-06 20:51 UTC (permalink / raw) To: Wu Fengguang Cc: Peter Zijlstra, Jan Kara, Steve Rago, linux-nfs-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, jens.axboe, Peter Staubach, Arjan van de Ven, Ingo Molnar, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org Speeds up the accounting in balance_dirty_pages() for non-nfs devices. Signed-off-by: Trond Myklebust <Trond.Myklebust-HgOvQuBEEgTQT0dZR+AlfA@public.gmane.org> Acked-by: Jan Kara <jack-AlSwsSmVLrQ@public.gmane.org> --- fs/nfs/client.c | 1 + include/linux/backing-dev.h | 6 ++++++ mm/page-writeback.c | 16 +++++++++++----- 3 files changed, 18 insertions(+), 5 deletions(-) diff --git a/fs/nfs/client.c b/fs/nfs/client.c index ee77713..d0b060a 100644 --- a/fs/nfs/client.c +++ b/fs/nfs/client.c @@ -890,6 +890,7 @@ static void nfs_server_set_fsinfo(struct nfs_server *server, struct nfs_fsinfo * server->backing_dev_info.name = "nfs"; server->backing_dev_info.ra_pages = server->rpages * NFS_MAX_READAHEAD; + server->backing_dev_info.capabilities |= BDI_CAP_ACCT_UNSTABLE; if (server->wsize > max_rpc_payload) server->wsize = max_rpc_payload; diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h index 42c3e2a..8b45166 100644 --- a/include/linux/backing-dev.h +++ b/include/linux/backing-dev.h @@ -232,6 +232,7 @@ int bdi_set_max_ratio(struct backing_dev_info *bdi, unsigned int max_ratio); #define BDI_CAP_EXEC_MAP 0x00000040 #define BDI_CAP_NO_ACCT_WB 0x00000080 #define BDI_CAP_SWAP_BACKED 0x00000100 +#define BDI_CAP_ACCT_UNSTABLE 0x00000200 #define BDI_CAP_VMFLAGS \ (BDI_CAP_READ_MAP | BDI_CAP_WRITE_MAP | BDI_CAP_EXEC_MAP) @@ -311,6 +312,11 @@ static inline bool bdi_cap_flush_forker(struct backing_dev_info *bdi) return bdi == &default_backing_dev_info; } +static inline bool bdi_cap_account_unstable(struct backing_dev_info *bdi) +{ + return bdi->capabilities & BDI_CAP_ACCT_UNSTABLE; +} + static inline bool mapping_cap_writeback_dirty(struct address_space *mapping) { return bdi_cap_writeback_dirty(mapping->backing_dev_info); diff --git a/mm/page-writeback.c b/mm/page-writeback.c index aa26b0f..d90a0db 100644 --- a/mm/page-writeback.c +++ b/mm/page-writeback.c @@ -273,8 +273,9 @@ static void clip_bdi_dirty_limit(struct backing_dev_info *bdi, avail_dirty = 0; avail_dirty += bdi_stat(bdi, BDI_DIRTY) + - bdi_stat(bdi, BDI_UNSTABLE) + bdi_stat(bdi, BDI_WRITEBACK); + if (bdi_cap_account_unstable(bdi)) + avail_dirty += bdi_stat(bdi, BDI_UNSTABLE); *pbdi_dirty = min(*pbdi_dirty, avail_dirty); } @@ -512,8 +513,9 @@ static void balance_dirty_pages(struct address_space *mapping, nr_unstable_nfs; nr_writeback = global_page_state(NR_WRITEBACK); - bdi_nr_reclaimable = bdi_stat(bdi, BDI_DIRTY) + - bdi_stat(bdi, BDI_UNSTABLE); + bdi_nr_reclaimable = bdi_stat(bdi, BDI_DIRTY); + if (bdi_cap_account_unstable(bdi)) + bdi_nr_reclaimable += bdi_stat(bdi, BDI_UNSTABLE); bdi_nr_writeback = bdi_stat(bdi, BDI_WRITEBACK); if (bdi_nr_reclaimable + bdi_nr_writeback <= bdi_thresh) @@ -563,11 +565,15 @@ static void balance_dirty_pages(struct address_space *mapping, * deltas. */ if (bdi_thresh < 2*bdi_stat_error(bdi)) { - bdi_nr_reclaimable = bdi_stat_sum(bdi, BDI_DIRTY) + + bdi_nr_reclaimable = bdi_stat_sum(bdi, BDI_DIRTY); + if (bdi_cap_account_unstable(bdi)) + bdi_nr_reclaimable += bdi_stat_sum(bdi, BDI_UNSTABLE); bdi_nr_writeback = bdi_stat_sum(bdi, BDI_WRITEBACK); } else if (bdi_nr_reclaimable) { - bdi_nr_reclaimable = bdi_stat(bdi, BDI_DIRTY) + + bdi_nr_reclaimable = bdi_stat(bdi, BDI_DIRTY); + if (bdi_cap_account_unstable(bdi)) + bdi_nr_reclaimable += bdi_stat(bdi, BDI_UNSTABLE); bdi_nr_writeback = bdi_stat(bdi, BDI_WRITEBACK); } -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply related [flat|nested] 66+ messages in thread
* Re: [PATCH 4/6] VM: Don't call bdi_stat(BDI_UNSTABLE) on non-nfs backing-devices 2010-01-06 20:51 ` [PATCH 4/6] VM: Don't call bdi_stat(BDI_UNSTABLE) on non-nfs backing-devices Trond Myklebust @ 2010-01-07 1:56 ` Wu Fengguang 0 siblings, 0 replies; 66+ messages in thread From: Wu Fengguang @ 2010-01-07 1:56 UTC (permalink / raw) To: Trond Myklebust Cc: Peter Zijlstra, Jan Kara, Steve Rago, linux-nfs@vger.kernel.org, jens.axboe, Peter Staubach, Arjan van de Ven, Ingo Molnar, linux-fsdevel@vger.kernel.org On Thu, Jan 07, 2010 at 04:51:10AM +0800, Trond Myklebust wrote: > avail_dirty += bdi_stat(bdi, BDI_DIRTY) + > - bdi_stat(bdi, BDI_UNSTABLE) + > bdi_stat(bdi, BDI_WRITEBACK); > + if (bdi_cap_account_unstable(bdi)) > + avail_dirty += bdi_stat(bdi, BDI_UNSTABLE); It seems that not changing the bdi_stat()s makes more readable code, otherwise looks OK to me. Reviewed-by: Wu Fengguang <fengguang.wu@intel.com> ^ permalink raw reply [flat|nested] 66+ messages in thread
* [PATCH 6/6] NFS: Run COMMIT as an asynchronous RPC call when wbc->for_background is set [not found] ` <20100106205110.22547.85345.stgit-bi+AKbBUZKY6gyzm1THtWbp2dZbC/Bob@public.gmane.org> 2010-01-06 20:51 ` [PATCH 5/6] VM: Use per-bdi unstable accounting to improve use of wbc->force_commit Trond Myklebust 2010-01-06 20:51 ` [PATCH 4/6] VM: Don't call bdi_stat(BDI_UNSTABLE) on non-nfs backing-devices Trond Myklebust @ 2010-01-06 20:51 ` Trond Myklebust [not found] ` <20100106205110.22547.31434.stgit-bi+AKbBUZKY6gyzm1THtWbp2dZbC/Bob@public.gmane.org> 2010-01-06 20:51 ` [PATCH 2/6] VM/NFS: The VM must tell the filesystem when to free reclaimable pages Trond Myklebust ` (3 subsequent siblings) 6 siblings, 1 reply; 66+ messages in thread From: Trond Myklebust @ 2010-01-06 20:51 UTC (permalink / raw) To: Wu Fengguang Cc: Peter Zijlstra, Jan Kara, Steve Rago, linux-nfs-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, jens.axboe, Peter Staubach, Arjan van de Ven, Ingo Molnar, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org Signed-off-by: Trond Myklebust <Trond.Myklebust-HgOvQuBEEgTQT0dZR+AlfA@public.gmane.org> --- fs/nfs/write.c | 2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/fs/nfs/write.c b/fs/nfs/write.c index 978de7f..d6d8048 100644 --- a/fs/nfs/write.c +++ b/fs/nfs/write.c @@ -1423,7 +1423,7 @@ int nfs_commit_unstable_pages(struct address_space *mapping, mark_inode_unstable_pages(inode); return 0; } - if (wbc->nonblocking) + if (wbc->nonblocking || wbc->for_background) flags = 0; ret = nfs_commit_inode(inode, flags); if (ret > 0) -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply related [flat|nested] 66+ messages in thread
[parent not found: <20100106205110.22547.31434.stgit-bi+AKbBUZKY6gyzm1THtWbp2dZbC/Bob@public.gmane.org>]
* Re: [PATCH 6/6] NFS: Run COMMIT as an asynchronous RPC call when wbc->for_background is set [not found] ` <20100106205110.22547.31434.stgit-bi+AKbBUZKY6gyzm1THtWbp2dZbC/Bob@public.gmane.org> @ 2010-01-07 2:32 ` Wu Fengguang 0 siblings, 0 replies; 66+ messages in thread From: Wu Fengguang @ 2010-01-07 2:32 UTC (permalink / raw) To: Trond Myklebust Cc: Peter Zijlstra, Jan Kara, Steve Rago, linux-nfs-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, jens.axboe, Peter Staubach, Arjan van de Ven, Ingo Molnar, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org On Thu, Jan 07, 2010 at 04:51:10AM +0800, Trond Myklebust wrote: > @@ -1423,7 +1423,7 @@ int nfs_commit_unstable_pages(struct address_space *mapping, > mark_inode_unstable_pages(inode); > return 0; > } > - if (wbc->nonblocking) > + if (wbc->nonblocking || wbc->for_background) > flags = 0; > ret = nfs_commit_inode(inode, flags); > if (ret > 0) Acked-by: Wu Fengguang <fengguang.wu-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org> -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 66+ messages in thread
* [PATCH 2/6] VM/NFS: The VM must tell the filesystem when to free reclaimable pages [not found] ` <20100106205110.22547.85345.stgit-bi+AKbBUZKY6gyzm1THtWbp2dZbC/Bob@public.gmane.org> ` (2 preceding siblings ...) 2010-01-06 20:51 ` [PATCH 6/6] NFS: Run COMMIT as an asynchronous RPC call when wbc->for_background is set Trond Myklebust @ 2010-01-06 20:51 ` Trond Myklebust 2010-01-07 2:29 ` Wu Fengguang 2010-01-06 20:51 ` [PATCH 3/6] VM: Split out the accounting of unstable writes from BDI_RECLAIMABLE Trond Myklebust ` (2 subsequent siblings) 6 siblings, 1 reply; 66+ messages in thread From: Trond Myklebust @ 2010-01-06 20:51 UTC (permalink / raw) To: Wu Fengguang Cc: Peter Zijlstra, Jan Kara, Steve Rago, linux-nfs-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, jens.axboe, Peter Staubach, Arjan van de Ven, Ingo Molnar, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org balance_dirty_pages() should really tell the filesystem whether or not it has an excess of actual dirty pages, or whether it would be more useful to start freeing up the unstable writes. Assume that if the number of unstable writes is more than 1/2 the number of reclaimable pages, then we should force NFS to free up the former. Signed-off-by: Trond Myklebust <Trond.Myklebust-HgOvQuBEEgTQT0dZR+AlfA@public.gmane.org> Acked-by: Jan Kara <jack-AlSwsSmVLrQ@public.gmane.org> --- fs/nfs/write.c | 2 +- include/linux/writeback.h | 5 +++++ mm/page-writeback.c | 9 ++++++++- 3 files changed, 14 insertions(+), 2 deletions(-) diff --git a/fs/nfs/write.c b/fs/nfs/write.c index 910be28..ee3daf4 100644 --- a/fs/nfs/write.c +++ b/fs/nfs/write.c @@ -1417,7 +1417,7 @@ int nfs_commit_unstable_pages(struct address_space *mapping, /* Don't commit yet if this is a non-blocking flush and there are * outstanding writes for this mapping. */ - if (wbc->sync_mode != WB_SYNC_ALL && + if (!wbc->force_commit && wbc->sync_mode != WB_SYNC_ALL && radix_tree_tagged(&NFS_I(inode)->nfs_page_tree, NFS_PAGE_TAG_LOCKED)) { mark_inode_unstable_pages(inode); diff --git a/include/linux/writeback.h b/include/linux/writeback.h index 76e8903..3fd5c3e 100644 --- a/include/linux/writeback.h +++ b/include/linux/writeback.h @@ -62,6 +62,11 @@ struct writeback_control { * so we use a single control to update them */ unsigned no_nrwrite_index_update:1; + /* + * The following is used by balance_dirty_pages() to + * force NFS to commit unstable pages. + */ + unsigned force_commit:1; }; /* diff --git a/mm/page-writeback.c b/mm/page-writeback.c index 0b19943..ede5356 100644 --- a/mm/page-writeback.c +++ b/mm/page-writeback.c @@ -485,6 +485,7 @@ static void balance_dirty_pages(struct address_space *mapping, { long nr_reclaimable, bdi_nr_reclaimable; long nr_writeback, bdi_nr_writeback; + long nr_unstable_nfs; unsigned long background_thresh; unsigned long dirty_thresh; unsigned long bdi_thresh; @@ -505,8 +506,9 @@ static void balance_dirty_pages(struct address_space *mapping, get_dirty_limits(&background_thresh, &dirty_thresh, &bdi_thresh, bdi); + nr_unstable_nfs = global_page_state(NR_UNSTABLE_NFS); nr_reclaimable = global_page_state(NR_FILE_DIRTY) + - global_page_state(NR_UNSTABLE_NFS); + nr_unstable_nfs; nr_writeback = global_page_state(NR_WRITEBACK); bdi_nr_reclaimable = bdi_stat(bdi, BDI_RECLAIMABLE); @@ -537,6 +539,11 @@ static void balance_dirty_pages(struct address_space *mapping, * up. */ if (bdi_nr_reclaimable > bdi_thresh) { + wbc.force_commit = 0; + /* Force NFS to also free up unstable writes. */ + if (nr_unstable_nfs > nr_reclaimable / 2) + wbc.force_commit = 1; + writeback_inodes_wbc(&wbc); pages_written += write_chunk - wbc.nr_to_write; get_dirty_limits(&background_thresh, &dirty_thresh, -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply related [flat|nested] 66+ messages in thread
* Re: [PATCH 2/6] VM/NFS: The VM must tell the filesystem when to free reclaimable pages 2010-01-06 20:51 ` [PATCH 2/6] VM/NFS: The VM must tell the filesystem when to free reclaimable pages Trond Myklebust @ 2010-01-07 2:29 ` Wu Fengguang 2010-01-07 4:49 ` Trond Myklebust 0 siblings, 1 reply; 66+ messages in thread From: Wu Fengguang @ 2010-01-07 2:29 UTC (permalink / raw) To: Trond Myklebust Cc: Peter Zijlstra, Jan Kara, Steve Rago, linux-nfs@vger.kernel.org, jens.axboe, Peter Staubach, Arjan van de Ven, Ingo Molnar, linux-fsdevel@vger.kernel.org On Thu, Jan 07, 2010 at 04:51:10AM +0800, Trond Myklebust wrote: > balance_dirty_pages() should really tell the filesystem whether or not it > has an excess of actual dirty pages, or whether it would be more useful to > start freeing up the unstable writes. > > Assume that if the number of unstable writes is more than 1/2 the number of > reclaimable pages, then we should force NFS to free up the former. > > Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com> > Acked-by: Jan Kara <jack@suse.cz> > --- > > fs/nfs/write.c | 2 +- > include/linux/writeback.h | 5 +++++ > mm/page-writeback.c | 9 ++++++++- > 3 files changed, 14 insertions(+), 2 deletions(-) > > diff --git a/fs/nfs/write.c b/fs/nfs/write.c > index 910be28..ee3daf4 100644 > --- a/fs/nfs/write.c > +++ b/fs/nfs/write.c > @@ -1417,7 +1417,7 @@ int nfs_commit_unstable_pages(struct address_space *mapping, > /* Don't commit yet if this is a non-blocking flush and there are > * outstanding writes for this mapping. > */ > - if (wbc->sync_mode != WB_SYNC_ALL && > + if (!wbc->force_commit && wbc->sync_mode != WB_SYNC_ALL && > radix_tree_tagged(&NFS_I(inode)->nfs_page_tree, > NFS_PAGE_TAG_LOCKED)) { > mark_inode_unstable_pages(inode); > diff --git a/include/linux/writeback.h b/include/linux/writeback.h > index 76e8903..3fd5c3e 100644 > --- a/include/linux/writeback.h > +++ b/include/linux/writeback.h > @@ -62,6 +62,11 @@ struct writeback_control { > * so we use a single control to update them > */ > unsigned no_nrwrite_index_update:1; > + /* > + * The following is used by balance_dirty_pages() to > + * force NFS to commit unstable pages. > + */ In fact it may be too late to force commit at balance_dirty_pages() time: commit takes time and the application has already been blocked. If not convenient for now, I can make the change -- I'll remove the writeback_inodes_wbc() call altogether from balance_dirty_pages(). > + unsigned force_commit:1; > }; nfs_commit may be a more newbie friendly name? Thanks, Fengguang > /* > diff --git a/mm/page-writeback.c b/mm/page-writeback.c > index 0b19943..ede5356 100644 > --- a/mm/page-writeback.c > +++ b/mm/page-writeback.c > @@ -485,6 +485,7 @@ static void balance_dirty_pages(struct address_space *mapping, > { > long nr_reclaimable, bdi_nr_reclaimable; > long nr_writeback, bdi_nr_writeback; > + long nr_unstable_nfs; > unsigned long background_thresh; > unsigned long dirty_thresh; > unsigned long bdi_thresh; > @@ -505,8 +506,9 @@ static void balance_dirty_pages(struct address_space *mapping, > get_dirty_limits(&background_thresh, &dirty_thresh, > &bdi_thresh, bdi); > > + nr_unstable_nfs = global_page_state(NR_UNSTABLE_NFS); > nr_reclaimable = global_page_state(NR_FILE_DIRTY) + > - global_page_state(NR_UNSTABLE_NFS); > + nr_unstable_nfs; > nr_writeback = global_page_state(NR_WRITEBACK); > > bdi_nr_reclaimable = bdi_stat(bdi, BDI_RECLAIMABLE); > @@ -537,6 +539,11 @@ static void balance_dirty_pages(struct address_space *mapping, > * up. > */ > if (bdi_nr_reclaimable > bdi_thresh) { > + wbc.force_commit = 0; > + /* Force NFS to also free up unstable writes. */ > + if (nr_unstable_nfs > nr_reclaimable / 2) > + wbc.force_commit = 1; > + > writeback_inodes_wbc(&wbc); > pages_written += write_chunk - wbc.nr_to_write; > get_dirty_limits(&background_thresh, &dirty_thresh, > ^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [PATCH 2/6] VM/NFS: The VM must tell the filesystem when to free reclaimable pages 2010-01-07 2:29 ` Wu Fengguang @ 2010-01-07 4:49 ` Trond Myklebust 2010-01-07 5:03 ` Wu Fengguang 0 siblings, 1 reply; 66+ messages in thread From: Trond Myklebust @ 2010-01-07 4:49 UTC (permalink / raw) To: Wu Fengguang Cc: Peter Zijlstra, Jan Kara, Steve Rago, linux-nfs-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, jens.axboe, Peter Staubach, Arjan van de Ven, Ingo Molnar, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org On Thu, 2010-01-07 at 10:29 +0800, Wu Fengguang wrote: > On Thu, Jan 07, 2010 at 04:51:10AM +0800, Trond Myklebust wrote: > > balance_dirty_pages() should really tell the filesystem whether or not it > > has an excess of actual dirty pages, or whether it would be more useful to > > start freeing up the unstable writes. > > > > Assume that if the number of unstable writes is more than 1/2 the number of > > reclaimable pages, then we should force NFS to free up the former. > > > > Signed-off-by: Trond Myklebust <Trond.Myklebust-HgOvQuBEEgTQT0dZR+AlfA@public.gmane.org> > > Acked-by: Jan Kara <jack-AlSwsSmVLrQ@public.gmane.org> > > --- > > > > fs/nfs/write.c | 2 +- > > include/linux/writeback.h | 5 +++++ > > mm/page-writeback.c | 9 ++++++++- > > 3 files changed, 14 insertions(+), 2 deletions(-) > > > > diff --git a/fs/nfs/write.c b/fs/nfs/write.c > > index 910be28..ee3daf4 100644 > > --- a/fs/nfs/write.c > > +++ b/fs/nfs/write.c > > @@ -1417,7 +1417,7 @@ int nfs_commit_unstable_pages(struct address_space *mapping, > > /* Don't commit yet if this is a non-blocking flush and there are > > * outstanding writes for this mapping. > > */ > > - if (wbc->sync_mode != WB_SYNC_ALL && > > + if (!wbc->force_commit && wbc->sync_mode != WB_SYNC_ALL && > > radix_tree_tagged(&NFS_I(inode)->nfs_page_tree, > > NFS_PAGE_TAG_LOCKED)) { > > mark_inode_unstable_pages(inode); > > diff --git a/include/linux/writeback.h b/include/linux/writeback.h > > index 76e8903..3fd5c3e 100644 > > --- a/include/linux/writeback.h > > +++ b/include/linux/writeback.h > > @@ -62,6 +62,11 @@ struct writeback_control { > > * so we use a single control to update them > > */ > > unsigned no_nrwrite_index_update:1; > > + /* > > + * The following is used by balance_dirty_pages() to > > + * force NFS to commit unstable pages. > > + */ > > In fact it may be too late to force commit at balance_dirty_pages() > time: commit takes time and the application has already been blocked. > > If not convenient for now, I can make the change -- I'll remove the > writeback_inodes_wbc() call altogether from balance_dirty_pages(). You could always set the 'for_background' flag instead. > > + unsigned force_commit:1; > > }; > > nfs_commit may be a more newbie friendly name? We could possibly rename it to something like 'force_nfs_commit', but the comment above the declaration should really be sufficient. Trond -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [PATCH 2/6] VM/NFS: The VM must tell the filesystem when to free reclaimable pages 2010-01-07 4:49 ` Trond Myklebust @ 2010-01-07 5:03 ` Wu Fengguang 2010-01-07 5:30 ` Trond Myklebust 0 siblings, 1 reply; 66+ messages in thread From: Wu Fengguang @ 2010-01-07 5:03 UTC (permalink / raw) To: Trond Myklebust Cc: Peter Zijlstra, Jan Kara, Steve Rago, linux-nfs-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, jens.axboe, Peter Staubach, Arjan van de Ven, Ingo Molnar, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org On Thu, Jan 07, 2010 at 12:49:23PM +0800, Trond Myklebust wrote: > On Thu, 2010-01-07 at 10:29 +0800, Wu Fengguang wrote: > > On Thu, Jan 07, 2010 at 04:51:10AM +0800, Trond Myklebust wrote: > > > balance_dirty_pages() should really tell the filesystem whether or not it > > > has an excess of actual dirty pages, or whether it would be more useful to > > > start freeing up the unstable writes. > > > > > > Assume that if the number of unstable writes is more than 1/2 the number of > > > reclaimable pages, then we should force NFS to free up the former. > > > > > > Signed-off-by: Trond Myklebust <Trond.Myklebust-HgOvQuBEEgTQT0dZR+AlfA@public.gmane.org> > > > Acked-by: Jan Kara <jack-AlSwsSmVLrQ@public.gmane.org> > > > --- > > > > > > fs/nfs/write.c | 2 +- > > > include/linux/writeback.h | 5 +++++ > > > mm/page-writeback.c | 9 ++++++++- > > > 3 files changed, 14 insertions(+), 2 deletions(-) > > > > > > diff --git a/fs/nfs/write.c b/fs/nfs/write.c > > > index 910be28..ee3daf4 100644 > > > --- a/fs/nfs/write.c > > > +++ b/fs/nfs/write.c > > > @@ -1417,7 +1417,7 @@ int nfs_commit_unstable_pages(struct address_space *mapping, > > > /* Don't commit yet if this is a non-blocking flush and there are > > > * outstanding writes for this mapping. > > > */ > > > - if (wbc->sync_mode != WB_SYNC_ALL && > > > + if (!wbc->force_commit && wbc->sync_mode != WB_SYNC_ALL && > > > radix_tree_tagged(&NFS_I(inode)->nfs_page_tree, > > > NFS_PAGE_TAG_LOCKED)) { > > > mark_inode_unstable_pages(inode); > > > diff --git a/include/linux/writeback.h b/include/linux/writeback.h > > > index 76e8903..3fd5c3e 100644 > > > --- a/include/linux/writeback.h > > > +++ b/include/linux/writeback.h > > > @@ -62,6 +62,11 @@ struct writeback_control { > > > * so we use a single control to update them > > > */ > > > unsigned no_nrwrite_index_update:1; > > > + /* > > > + * The following is used by balance_dirty_pages() to > > > + * force NFS to commit unstable pages. > > > + */ > > > > In fact it may be too late to force commit at balance_dirty_pages() > > time: commit takes time and the application has already been blocked. > > > > If not convenient for now, I can make the change -- I'll remove the > > writeback_inodes_wbc() call altogether from balance_dirty_pages(). > > You could always set the 'for_background' flag instead. Please this is misusing ->for_background.. Anyway it's not a big problem. I'll set the force_nfs_commit flag in background writeback. > > > + unsigned force_commit:1; > > > }; > > > > nfs_commit may be a more newbie friendly name? > > We could possibly rename it to something like 'force_nfs_commit', but > the comment above the declaration should really be sufficient. "commit" could also be misread as "commit a transaction"? Anyway I think adding an "nfs" limits the scope to NFS thus makes code reading somehow easier. Just a personal feeling. Thanks, Fengguang -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [PATCH 2/6] VM/NFS: The VM must tell the filesystem when to free reclaimable pages 2010-01-07 5:03 ` Wu Fengguang @ 2010-01-07 5:30 ` Trond Myklebust 2010-01-07 14:37 ` Wu Fengguang 0 siblings, 1 reply; 66+ messages in thread From: Trond Myklebust @ 2010-01-07 5:30 UTC (permalink / raw) To: Wu Fengguang Cc: Peter Zijlstra, Jan Kara, Steve Rago, linux-nfs-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, jens.axboe, Peter Staubach, Arjan van de Ven, Ingo Molnar, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org On Thu, 2010-01-07 at 13:03 +0800, Wu Fengguang wrote: > "commit" could also be misread as "commit a transaction"? > Anyway I think adding an "nfs" limits the scope to NFS thus makes code > reading somehow easier. Just a personal feeling. How about 'force_commit_unstable' instead? That ties it up to the unstable writes rather than NFS. Cheers Trond -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [PATCH 2/6] VM/NFS: The VM must tell the filesystem when to free reclaimable pages 2010-01-07 5:30 ` Trond Myklebust @ 2010-01-07 14:37 ` Wu Fengguang 0 siblings, 0 replies; 66+ messages in thread From: Wu Fengguang @ 2010-01-07 14:37 UTC (permalink / raw) To: Trond Myklebust Cc: Peter Zijlstra, Jan Kara, Steve Rago, linux-nfs-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, jens.axboe, Peter Staubach, Arjan van de Ven, Ingo Molnar, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org On Thu, Jan 07, 2010 at 01:30:43PM +0800, Trond Myklebust wrote: > On Thu, 2010-01-07 at 13:03 +0800, Wu Fengguang wrote: > > > "commit" could also be misread as "commit a transaction"? > > Anyway I think adding an "nfs" limits the scope to NFS thus makes code > > reading somehow easier. Just a personal feeling. > > How about 'force_commit_unstable' instead? That ties it up to the > unstable writes rather than NFS. That would be good, thanks! Fengguang -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 66+ messages in thread
* [PATCH 3/6] VM: Split out the accounting of unstable writes from BDI_RECLAIMABLE [not found] ` <20100106205110.22547.85345.stgit-bi+AKbBUZKY6gyzm1THtWbp2dZbC/Bob@public.gmane.org> ` (3 preceding siblings ...) 2010-01-06 20:51 ` [PATCH 2/6] VM/NFS: The VM must tell the filesystem when to free reclaimable pages Trond Myklebust @ 2010-01-06 20:51 ` Trond Myklebust [not found] ` <20100106205110.22547.93554.stgit-bi+AKbBUZKY6gyzm1THtWbp2dZbC/Bob@public.gmane.org> 2010-01-06 20:51 ` [PATCH 1/6] VFS: Ensure that writeback_single_inode() commits unstable writes Trond Myklebust 2010-01-06 21:44 ` [PATCH 0/6] Re: [PATCH] improve the performance of large sequential write NFS workloads Jan Kara 6 siblings, 1 reply; 66+ messages in thread From: Trond Myklebust @ 2010-01-06 20:51 UTC (permalink / raw) To: Wu Fengguang Cc: Peter Zijlstra, Jan Kara, Steve Rago, linux-nfs-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, jens.axboe, Peter Staubach, Arjan van de Ven, Ingo Molnar, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org From: Peter Zijlstra <peterz-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org> Signed-off-by: Peter Zijlstra <peterz-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org> Signed-off-by: Trond Myklebust <Trond.Myklebust-HgOvQuBEEgTQT0dZR+AlfA@public.gmane.org> Acked-by: Jan Kara <jack-AlSwsSmVLrQ@public.gmane.org> --- fs/nfs/write.c | 6 +++--- include/linux/backing-dev.h | 3 ++- mm/backing-dev.c | 6 ++++-- mm/filemap.c | 2 +- mm/page-writeback.c | 16 ++++++++++------ mm/truncate.c | 2 +- 6 files changed, 21 insertions(+), 14 deletions(-) diff --git a/fs/nfs/write.c b/fs/nfs/write.c index ee3daf4..978de7f 100644 --- a/fs/nfs/write.c +++ b/fs/nfs/write.c @@ -440,7 +440,7 @@ nfs_mark_request_commit(struct nfs_page *req) NFS_PAGE_TAG_COMMIT); spin_unlock(&inode->i_lock); inc_zone_page_state(req->wb_page, NR_UNSTABLE_NFS); - inc_bdi_stat(req->wb_page->mapping->backing_dev_info, BDI_RECLAIMABLE); + inc_bdi_stat(req->wb_page->mapping->backing_dev_info, BDI_UNSTABLE); mark_inode_unstable_pages(inode); } @@ -451,7 +451,7 @@ nfs_clear_request_commit(struct nfs_page *req) if (test_and_clear_bit(PG_CLEAN, &(req)->wb_flags)) { dec_zone_page_state(page, NR_UNSTABLE_NFS); - dec_bdi_stat(page->mapping->backing_dev_info, BDI_RECLAIMABLE); + dec_bdi_stat(page->mapping->backing_dev_info, BDI_UNSTABLE); return 1; } return 0; @@ -1322,7 +1322,7 @@ nfs_commit_list(struct inode *inode, struct list_head *head, int how) nfs_mark_request_commit(req); dec_zone_page_state(req->wb_page, NR_UNSTABLE_NFS); dec_bdi_stat(req->wb_page->mapping->backing_dev_info, - BDI_RECLAIMABLE); + BDI_UNSTABLE); nfs_clear_page_tag_locked(req); } return -ENOMEM; diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h index fcbc26a..42c3e2a 100644 --- a/include/linux/backing-dev.h +++ b/include/linux/backing-dev.h @@ -36,7 +36,8 @@ enum bdi_state { typedef int (congested_fn)(void *, int); enum bdi_stat_item { - BDI_RECLAIMABLE, + BDI_DIRTY, + BDI_UNSTABLE, BDI_WRITEBACK, NR_BDI_STAT_ITEMS }; diff --git a/mm/backing-dev.c b/mm/backing-dev.c index 0e8ca03..88f3655 100644 --- a/mm/backing-dev.c +++ b/mm/backing-dev.c @@ -88,7 +88,8 @@ static int bdi_debug_stats_show(struct seq_file *m, void *v) #define K(x) ((x) << (PAGE_SHIFT - 10)) seq_printf(m, "BdiWriteback: %8lu kB\n" - "BdiReclaimable: %8lu kB\n" + "BdiDirty: %8lu kB\n" + "BdiUnstable: %8lu kB\n" "BdiDirtyThresh: %8lu kB\n" "DirtyThresh: %8lu kB\n" "BackgroundThresh: %8lu kB\n" @@ -102,7 +103,8 @@ static int bdi_debug_stats_show(struct seq_file *m, void *v) "wb_list: %8u\n" "wb_cnt: %8u\n", (unsigned long) K(bdi_stat(bdi, BDI_WRITEBACK)), - (unsigned long) K(bdi_stat(bdi, BDI_RECLAIMABLE)), + (unsigned long) K(bdi_stat(bdi, BDI_DIRTY)), + (unsigned long) K(bdi_stat(bdi, BDI_UNSTABLE)), K(bdi_thresh), K(dirty_thresh), K(background_thresh), nr_wb, nr_dirty, nr_io, nr_more_io, !list_empty(&bdi->bdi_list), bdi->state, bdi->wb_mask, diff --git a/mm/filemap.c b/mm/filemap.c index 96ac6b0..458387d 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -136,7 +136,7 @@ void __remove_from_page_cache(struct page *page) */ if (PageDirty(page) && mapping_cap_account_dirty(mapping)) { dec_zone_page_state(page, NR_FILE_DIRTY); - dec_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE); + dec_bdi_stat(mapping->backing_dev_info, BDI_DIRTY); } } diff --git a/mm/page-writeback.c b/mm/page-writeback.c index ede5356..aa26b0f 100644 --- a/mm/page-writeback.c +++ b/mm/page-writeback.c @@ -272,7 +272,8 @@ static void clip_bdi_dirty_limit(struct backing_dev_info *bdi, else avail_dirty = 0; - avail_dirty += bdi_stat(bdi, BDI_RECLAIMABLE) + + avail_dirty += bdi_stat(bdi, BDI_DIRTY) + + bdi_stat(bdi, BDI_UNSTABLE) + bdi_stat(bdi, BDI_WRITEBACK); *pbdi_dirty = min(*pbdi_dirty, avail_dirty); @@ -511,7 +512,8 @@ static void balance_dirty_pages(struct address_space *mapping, nr_unstable_nfs; nr_writeback = global_page_state(NR_WRITEBACK); - bdi_nr_reclaimable = bdi_stat(bdi, BDI_RECLAIMABLE); + bdi_nr_reclaimable = bdi_stat(bdi, BDI_DIRTY) + + bdi_stat(bdi, BDI_UNSTABLE); bdi_nr_writeback = bdi_stat(bdi, BDI_WRITEBACK); if (bdi_nr_reclaimable + bdi_nr_writeback <= bdi_thresh) @@ -561,10 +563,12 @@ static void balance_dirty_pages(struct address_space *mapping, * deltas. */ if (bdi_thresh < 2*bdi_stat_error(bdi)) { - bdi_nr_reclaimable = bdi_stat_sum(bdi, BDI_RECLAIMABLE); + bdi_nr_reclaimable = bdi_stat_sum(bdi, BDI_DIRTY) + + bdi_stat_sum(bdi, BDI_UNSTABLE); bdi_nr_writeback = bdi_stat_sum(bdi, BDI_WRITEBACK); } else if (bdi_nr_reclaimable) { - bdi_nr_reclaimable = bdi_stat(bdi, BDI_RECLAIMABLE); + bdi_nr_reclaimable = bdi_stat(bdi, BDI_DIRTY) + + bdi_stat(bdi, BDI_UNSTABLE); bdi_nr_writeback = bdi_stat(bdi, BDI_WRITEBACK); } @@ -1086,7 +1090,7 @@ void account_page_dirtied(struct page *page, struct address_space *mapping) { if (mapping_cap_account_dirty(mapping)) { __inc_zone_page_state(page, NR_FILE_DIRTY); - __inc_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE); + __inc_bdi_stat(mapping->backing_dev_info, BDI_DIRTY); task_dirty_inc(current); task_io_account_write(PAGE_CACHE_SIZE); } @@ -1262,7 +1266,7 @@ int clear_page_dirty_for_io(struct page *page) if (TestClearPageDirty(page)) { dec_zone_page_state(page, NR_FILE_DIRTY); dec_bdi_stat(mapping->backing_dev_info, - BDI_RECLAIMABLE); + BDI_DIRTY); return 1; } return 0; diff --git a/mm/truncate.c b/mm/truncate.c index 342deee..b0ce8fb 100644 --- a/mm/truncate.c +++ b/mm/truncate.c @@ -75,7 +75,7 @@ void cancel_dirty_page(struct page *page, unsigned int account_size) if (mapping && mapping_cap_account_dirty(mapping)) { dec_zone_page_state(page, NR_FILE_DIRTY); dec_bdi_stat(mapping->backing_dev_info, - BDI_RECLAIMABLE); + BDI_DIRTY); if (account_size) task_io_account_cancelled_write(account_size); } -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply related [flat|nested] 66+ messages in thread
[parent not found: <20100106205110.22547.93554.stgit-bi+AKbBUZKY6gyzm1THtWbp2dZbC/Bob@public.gmane.org>]
* Re: [PATCH 3/6] VM: Split out the accounting of unstable writes from BDI_RECLAIMABLE [not found] ` <20100106205110.22547.93554.stgit-bi+AKbBUZKY6gyzm1THtWbp2dZbC/Bob@public.gmane.org> @ 2010-01-07 1:48 ` Wu Fengguang 0 siblings, 0 replies; 66+ messages in thread From: Wu Fengguang @ 2010-01-07 1:48 UTC (permalink / raw) To: Trond Myklebust Cc: Peter Zijlstra, Jan Kara, Steve Rago, linux-nfs-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, jens.axboe, Peter Staubach, Arjan van de Ven, Ingo Molnar, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org > seq_printf(m, > "BdiWriteback: %8lu kB\n" > - "BdiReclaimable: %8lu kB\n" > + "BdiDirty: %8lu kB\n" > + "BdiUnstable: %8lu kB\n" > "BdiDirtyThresh: %8lu kB\n" > "DirtyThresh: %8lu kB\n" > "BackgroundThresh: %8lu kB\n" This also reduces one synthetic concept ;) Thanks! Reviewed-by: Wu Fengguang <fengguang.wu-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org> -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 66+ messages in thread
* [PATCH 1/6] VFS: Ensure that writeback_single_inode() commits unstable writes [not found] ` <20100106205110.22547.85345.stgit-bi+AKbBUZKY6gyzm1THtWbp2dZbC/Bob@public.gmane.org> ` (4 preceding siblings ...) 2010-01-06 20:51 ` [PATCH 3/6] VM: Split out the accounting of unstable writes from BDI_RECLAIMABLE Trond Myklebust @ 2010-01-06 20:51 ` Trond Myklebust [not found] ` <20100106205110.22547.17971.stgit-bi+AKbBUZKY6gyzm1THtWbp2dZbC/Bob@public.gmane.org> 2010-01-06 21:44 ` [PATCH 0/6] Re: [PATCH] improve the performance of large sequential write NFS workloads Jan Kara 6 siblings, 1 reply; 66+ messages in thread From: Trond Myklebust @ 2010-01-06 20:51 UTC (permalink / raw) To: Wu Fengguang Cc: Peter Zijlstra, Jan Kara, Steve Rago, linux-nfs-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, jens.axboe, Peter Staubach, Arjan van de Ven, Ingo Molnar, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org If the call to do_writepages() succeeded in starting writeback, we do not know whether or not we will need to COMMIT any unstable writes until after the write RPC calls are finished. Currently, we assume that at least one write RPC call will have finished, and set I_DIRTY_DATASYNC by the time do_writepages is done, so that write_inode() is triggered. In order to ensure reliable operation (i.e. ensure that a single call to writeback_single_inode() with WB_SYNC_ALL set suffices to ensure that pages are on disk) we need to first wait for filemap_fdatawait() to complete, then test for unstable pages. Since NFS is currently the only filesystem that has unstable pages, we can add a new inode state I_UNSTABLE_PAGES that NFS alone will set. When set, this will trigger a callback to a new address_space_operation to call the COMMIT. Signed-off-by: Trond Myklebust <Trond.Myklebust-HgOvQuBEEgTQT0dZR+AlfA@public.gmane.org> Acked-by: Jan Kara <jack-AlSwsSmVLrQ@public.gmane.org> --- fs/fs-writeback.c | 31 ++++++++++++++++++++++++++++++- fs/nfs/file.c | 1 + fs/nfs/inode.c | 16 ---------------- fs/nfs/internal.h | 3 ++- fs/nfs/super.c | 2 -- fs/nfs/write.c | 33 ++++++++++++++++++++++++++++++++- include/linux/fs.h | 9 +++++++++ 7 files changed, 74 insertions(+), 21 deletions(-) diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c index 1a7c42c..3bc0a96 100644 --- a/fs/fs-writeback.c +++ b/fs/fs-writeback.c @@ -389,6 +389,17 @@ static int write_inode(struct inode *inode, int sync) } /* + * Commit the NFS unstable pages. + */ +static int commit_unstable_pages(struct address_space *mapping, + struct writeback_control *wbc) +{ + if (mapping->a_ops && mapping->a_ops->commit_unstable_pages) + return mapping->a_ops->commit_unstable_pages(mapping, wbc); + return 0; +} + +/* * Wait for writeback on an inode to complete. */ static void inode_wait_for_writeback(struct inode *inode) @@ -475,6 +486,18 @@ writeback_single_inode(struct inode *inode, struct writeback_control *wbc) } spin_lock(&inode_lock); + /* + * Special state for cleaning NFS unstable pages + */ + if (inode->i_state & I_UNSTABLE_PAGES) { + int err; + inode->i_state &= ~I_UNSTABLE_PAGES; + spin_unlock(&inode_lock); + err = commit_unstable_pages(mapping, wbc); + if (ret == 0) + ret = err; + spin_lock(&inode_lock); + } inode->i_state &= ~I_SYNC; if (!(inode->i_state & (I_FREEING | I_CLEAR))) { if ((inode->i_state & I_DIRTY_PAGES) && wbc->for_kupdate) { @@ -533,6 +556,12 @@ select_queue: inode->i_state |= I_DIRTY_PAGES; redirty_tail(inode); } + } else if (inode->i_state & I_UNSTABLE_PAGES) { + /* + * The inode has got yet more unstable pages to + * commit. Requeue on b_more_io + */ + requeue_io(inode); } else if (atomic_read(&inode->i_count)) { /* * The inode is clean, inuse @@ -1051,7 +1080,7 @@ void __mark_inode_dirty(struct inode *inode, int flags) spin_lock(&inode_lock); if ((inode->i_state & flags) != flags) { - const int was_dirty = inode->i_state & I_DIRTY; + const int was_dirty = inode->i_state & (I_DIRTY|I_UNSTABLE_PAGES); inode->i_state |= flags; diff --git a/fs/nfs/file.c b/fs/nfs/file.c index 6b89132..67e50ac 100644 --- a/fs/nfs/file.c +++ b/fs/nfs/file.c @@ -526,6 +526,7 @@ const struct address_space_operations nfs_file_aops = { .migratepage = nfs_migrate_page, .launder_page = nfs_launder_page, .error_remove_page = generic_error_remove_page, + .commit_unstable_pages = nfs_commit_unstable_pages, }; /* diff --git a/fs/nfs/inode.c b/fs/nfs/inode.c index faa0918..8341709 100644 --- a/fs/nfs/inode.c +++ b/fs/nfs/inode.c @@ -97,22 +97,6 @@ u64 nfs_compat_user_ino64(u64 fileid) return ino; } -int nfs_write_inode(struct inode *inode, int sync) -{ - int ret; - - if (sync) { - ret = filemap_fdatawait(inode->i_mapping); - if (ret == 0) - ret = nfs_commit_inode(inode, FLUSH_SYNC); - } else - ret = nfs_commit_inode(inode, 0); - if (ret >= 0) - return 0; - __mark_inode_dirty(inode, I_DIRTY_DATASYNC); - return ret; -} - void nfs_clear_inode(struct inode *inode) { /* diff --git a/fs/nfs/internal.h b/fs/nfs/internal.h index 29e464d..7bb326f 100644 --- a/fs/nfs/internal.h +++ b/fs/nfs/internal.h @@ -211,7 +211,6 @@ extern int nfs_access_cache_shrinker(int nr_to_scan, gfp_t gfp_mask); extern struct workqueue_struct *nfsiod_workqueue; extern struct inode *nfs_alloc_inode(struct super_block *sb); extern void nfs_destroy_inode(struct inode *); -extern int nfs_write_inode(struct inode *,int); extern void nfs_clear_inode(struct inode *); #ifdef CONFIG_NFS_V4 extern void nfs4_clear_inode(struct inode *); @@ -253,6 +252,8 @@ extern int nfs4_path_walk(struct nfs_server *server, extern void nfs_read_prepare(struct rpc_task *task, void *calldata); /* write.c */ +extern int nfs_commit_unstable_pages(struct address_space *mapping, + struct writeback_control *wbc); extern void nfs_write_prepare(struct rpc_task *task, void *calldata); #ifdef CONFIG_MIGRATION extern int nfs_migrate_page(struct address_space *, diff --git a/fs/nfs/super.c b/fs/nfs/super.c index ce907ef..805c1a0 100644 --- a/fs/nfs/super.c +++ b/fs/nfs/super.c @@ -265,7 +265,6 @@ struct file_system_type nfs_xdev_fs_type = { static const struct super_operations nfs_sops = { .alloc_inode = nfs_alloc_inode, .destroy_inode = nfs_destroy_inode, - .write_inode = nfs_write_inode, .statfs = nfs_statfs, .clear_inode = nfs_clear_inode, .umount_begin = nfs_umount_begin, @@ -334,7 +333,6 @@ struct file_system_type nfs4_referral_fs_type = { static const struct super_operations nfs4_sops = { .alloc_inode = nfs_alloc_inode, .destroy_inode = nfs_destroy_inode, - .write_inode = nfs_write_inode, .statfs = nfs_statfs, .clear_inode = nfs4_clear_inode, .umount_begin = nfs_umount_begin, diff --git a/fs/nfs/write.c b/fs/nfs/write.c index d171696..910be28 100644 --- a/fs/nfs/write.c +++ b/fs/nfs/write.c @@ -441,7 +441,7 @@ nfs_mark_request_commit(struct nfs_page *req) spin_unlock(&inode->i_lock); inc_zone_page_state(req->wb_page, NR_UNSTABLE_NFS); inc_bdi_stat(req->wb_page->mapping->backing_dev_info, BDI_RECLAIMABLE); - __mark_inode_dirty(inode, I_DIRTY_DATASYNC); + mark_inode_unstable_pages(inode); } static int @@ -1406,11 +1406,42 @@ int nfs_commit_inode(struct inode *inode, int how) } return res; } + +int nfs_commit_unstable_pages(struct address_space *mapping, + struct writeback_control *wbc) +{ + struct inode *inode = mapping->host; + int flags = FLUSH_SYNC; + int ret; + + /* Don't commit yet if this is a non-blocking flush and there are + * outstanding writes for this mapping. + */ + if (wbc->sync_mode != WB_SYNC_ALL && + radix_tree_tagged(&NFS_I(inode)->nfs_page_tree, + NFS_PAGE_TAG_LOCKED)) { + mark_inode_unstable_pages(inode); + return 0; + } + if (wbc->nonblocking) + flags = 0; + ret = nfs_commit_inode(inode, flags); + if (ret > 0) + ret = 0; + return ret; +} + #else static inline int nfs_commit_list(struct inode *inode, struct list_head *head, int how) { return 0; } + +int nfs_commit_unstable_pages(struct address_space *mapping, + struct writeback_control *wbc) +{ + return 0; +} #endif long nfs_sync_mapping_wait(struct address_space *mapping, struct writeback_control *wbc, int how) diff --git a/include/linux/fs.h b/include/linux/fs.h index 9147ca8..ea0b7a3 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -602,6 +602,8 @@ struct address_space_operations { int (*is_partially_uptodate) (struct page *, read_descriptor_t *, unsigned long); int (*error_remove_page)(struct address_space *, struct page *); + int (*commit_unstable_pages)(struct address_space *, + struct writeback_control *); }; /* @@ -1635,6 +1637,8 @@ struct super_operations { #define I_CLEAR 64 #define __I_SYNC 7 #define I_SYNC (1 << __I_SYNC) +#define __I_UNSTABLE_PAGES 9 +#define I_UNSTABLE_PAGES (1 << __I_UNSTABLE_PAGES) #define I_DIRTY (I_DIRTY_SYNC | I_DIRTY_DATASYNC | I_DIRTY_PAGES) @@ -1649,6 +1653,11 @@ static inline void mark_inode_dirty_sync(struct inode *inode) __mark_inode_dirty(inode, I_DIRTY_SYNC); } +static inline void mark_inode_unstable_pages(struct inode *inode) +{ + __mark_inode_dirty(inode, I_UNSTABLE_PAGES); +} + /** * inc_nlink - directly increment an inode's link count * @inode: inode -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply related [flat|nested] 66+ messages in thread
[parent not found: <20100106205110.22547.17971.stgit-bi+AKbBUZKY6gyzm1THtWbp2dZbC/Bob@public.gmane.org>]
* Re: [PATCH 1/6] VFS: Ensure that writeback_single_inode() commits unstable writes [not found] ` <20100106205110.22547.17971.stgit-bi+AKbBUZKY6gyzm1THtWbp2dZbC/Bob@public.gmane.org> @ 2010-01-06 21:38 ` Jan Kara [not found] ` <20100106213843.GD22781-+0h/O2h83AeN3ZZ/Hiejyg@public.gmane.org> 2010-01-07 2:18 ` Wu Fengguang 1 sibling, 1 reply; 66+ messages in thread From: Jan Kara @ 2010-01-06 21:38 UTC (permalink / raw) To: Trond Myklebust Cc: Wu Fengguang, Peter Zijlstra, Jan Kara, Steve Rago, linux-nfs-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, jens.axboe, Peter Staubach, Arjan van de Ven, Ingo Molnar, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org On Wed 06-01-10 15:51:10, Trond Myklebust wrote: ... > diff --git a/include/linux/fs.h b/include/linux/fs.h > index 9147ca8..ea0b7a3 100644 > --- a/include/linux/fs.h > +++ b/include/linux/fs.h > @@ -1635,6 +1637,8 @@ struct super_operations { > #define I_CLEAR 64 > #define __I_SYNC 7 > #define I_SYNC (1 << __I_SYNC) > +#define __I_UNSTABLE_PAGES 9 Hum, why isn't this 8? Honza -- Jan Kara <jack-AlSwsSmVLrQ@public.gmane.org> SUSE Labs, CR -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 66+ messages in thread
[parent not found: <20100106213843.GD22781-+0h/O2h83AeN3ZZ/Hiejyg@public.gmane.org>]
* Re: [PATCH 1/6] VFS: Ensure that writeback_single_inode() commits unstable writes [not found] ` <20100106213843.GD22781-+0h/O2h83AeN3ZZ/Hiejyg@public.gmane.org> @ 2010-01-06 21:48 ` Trond Myklebust 0 siblings, 0 replies; 66+ messages in thread From: Trond Myklebust @ 2010-01-06 21:48 UTC (permalink / raw) To: Jan Kara Cc: Wu Fengguang, Peter Zijlstra, Steve Rago, linux-nfs-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, jens.axboe, Peter Staubach, Arjan van de Ven, Ingo Molnar, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org On Wed, 2010-01-06 at 22:38 +0100, Jan Kara wrote: > On Wed 06-01-10 15:51:10, Trond Myklebust wrote: > ... > > diff --git a/include/linux/fs.h b/include/linux/fs.h > > index 9147ca8..ea0b7a3 100644 > > --- a/include/linux/fs.h > > +++ b/include/linux/fs.h > > @@ -1635,6 +1637,8 @@ struct super_operations { > > #define I_CLEAR 64 > > #define __I_SYNC 7 > > #define I_SYNC (1 << __I_SYNC) > > +#define __I_UNSTABLE_PAGES 9 > Hum, why isn't this 8? > > Honza I missed Christoph's patch that got rid of I_LOCK. I think that was merged after I started work on these patches. I'd be quite OK with changing the above value to 8 if that is preferable. Cheers Trond -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [PATCH 1/6] VFS: Ensure that writeback_single_inode() commits unstable writes [not found] ` <20100106205110.22547.17971.stgit-bi+AKbBUZKY6gyzm1THtWbp2dZbC/Bob@public.gmane.org> 2010-01-06 21:38 ` Jan Kara @ 2010-01-07 2:18 ` Wu Fengguang [not found] ` <1262839082.2185.15.camel@localhost> 1 sibling, 1 reply; 66+ messages in thread From: Wu Fengguang @ 2010-01-07 2:18 UTC (permalink / raw) To: Trond Myklebust Cc: Peter Zijlstra, Jan Kara, Steve Rago, linux-nfs-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, jens.axboe, Peter Staubach, Arjan van de Ven, Ingo Molnar, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org On Thu, Jan 07, 2010 at 04:51:10AM +0800, Trond Myklebust wrote: > If the call to do_writepages() succeeded in starting writeback, we do not > know whether or not we will need to COMMIT any unstable writes until after > the write RPC calls are finished. Currently, we assume that at least one > write RPC call will have finished, and set I_DIRTY_DATASYNC by the time > do_writepages is done, so that write_inode() is triggered. > > In order to ensure reliable operation (i.e. ensure that a single call to > writeback_single_inode() with WB_SYNC_ALL set suffices to ensure that pages > are on disk) we need to first wait for filemap_fdatawait() to complete, > then test for unstable pages. > > Since NFS is currently the only filesystem that has unstable pages, we can > add a new inode state I_UNSTABLE_PAGES that NFS alone will set. When set, > this will trigger a callback to a new address_space_operation to call the > COMMIT. > > Signed-off-by: Trond Myklebust <Trond.Myklebust-HgOvQuBEEgTQT0dZR+AlfA@public.gmane.org> > Acked-by: Jan Kara <jack-AlSwsSmVLrQ@public.gmane.org> > --- > > fs/fs-writeback.c | 31 ++++++++++++++++++++++++++++++- > fs/nfs/file.c | 1 + > fs/nfs/inode.c | 16 ---------------- > fs/nfs/internal.h | 3 ++- > fs/nfs/super.c | 2 -- > fs/nfs/write.c | 33 ++++++++++++++++++++++++++++++++- > include/linux/fs.h | 9 +++++++++ > 7 files changed, 74 insertions(+), 21 deletions(-) > > diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c > index 1a7c42c..3bc0a96 100644 > --- a/fs/fs-writeback.c > +++ b/fs/fs-writeback.c > @@ -389,6 +389,17 @@ static int write_inode(struct inode *inode, int sync) > } > > /* > + * Commit the NFS unstable pages. > + */ > +static int commit_unstable_pages(struct address_space *mapping, > + struct writeback_control *wbc) > +{ > + if (mapping->a_ops && mapping->a_ops->commit_unstable_pages) > + return mapping->a_ops->commit_unstable_pages(mapping, wbc); > + return 0; > +} > + > +/* > * Wait for writeback on an inode to complete. > */ > static void inode_wait_for_writeback(struct inode *inode) > @@ -475,6 +486,18 @@ writeback_single_inode(struct inode *inode, struct writeback_control *wbc) > } > > spin_lock(&inode_lock); > + /* > + * Special state for cleaning NFS unstable pages > + */ > + if (inode->i_state & I_UNSTABLE_PAGES) { > + int err; > + inode->i_state &= ~I_UNSTABLE_PAGES; > + spin_unlock(&inode_lock); > + err = commit_unstable_pages(mapping, wbc); > + if (ret == 0) > + ret = err; > + spin_lock(&inode_lock); > + } > inode->i_state &= ~I_SYNC; > if (!(inode->i_state & (I_FREEING | I_CLEAR))) { > if ((inode->i_state & I_DIRTY_PAGES) && wbc->for_kupdate) { > @@ -533,6 +556,12 @@ select_queue: > inode->i_state |= I_DIRTY_PAGES; > redirty_tail(inode); > } > + } else if (inode->i_state & I_UNSTABLE_PAGES) { > + /* > + * The inode has got yet more unstable pages to > + * commit. Requeue on b_more_io > + */ > + requeue_io(inode); This risks "busy retrying" inodes with unstable pages, when - nfs_commit_unstable_pages() don't think it's time to commit - NFS server somehow response slowly The workaround is to use redirty_tail() for now. But that risks delay the COMMIT for up to 30s, which obviously might stuck applications in balance_dirty_pages() for too long. I have a patch to shorten the retry time to 1s (or other constant) by introducing b_more_io_wait. It currently sits in my writeback queue series whose main blocking issue is the constantly broken NFS pipeline.. > } else if (atomic_read(&inode->i_count)) { > /* > * The inode is clean, inuse > @@ -1051,7 +1080,7 @@ void __mark_inode_dirty(struct inode *inode, int flags) > > spin_lock(&inode_lock); > if ((inode->i_state & flags) != flags) { > - const int was_dirty = inode->i_state & I_DIRTY; > + const int was_dirty = inode->i_state & (I_DIRTY|I_UNSTABLE_PAGES); > > inode->i_state |= flags; > > diff --git a/fs/nfs/file.c b/fs/nfs/file.c > index 6b89132..67e50ac 100644 > --- a/fs/nfs/file.c > +++ b/fs/nfs/file.c > @@ -526,6 +526,7 @@ const struct address_space_operations nfs_file_aops = { > .migratepage = nfs_migrate_page, > .launder_page = nfs_launder_page, > .error_remove_page = generic_error_remove_page, > + .commit_unstable_pages = nfs_commit_unstable_pages, > }; > > /* > diff --git a/fs/nfs/inode.c b/fs/nfs/inode.c > index faa0918..8341709 100644 > --- a/fs/nfs/inode.c > +++ b/fs/nfs/inode.c > @@ -97,22 +97,6 @@ u64 nfs_compat_user_ino64(u64 fileid) > return ino; > } > > -int nfs_write_inode(struct inode *inode, int sync) > -{ > - int ret; > - > - if (sync) { > - ret = filemap_fdatawait(inode->i_mapping); > - if (ret == 0) > - ret = nfs_commit_inode(inode, FLUSH_SYNC); > - } else > - ret = nfs_commit_inode(inode, 0); > - if (ret >= 0) > - return 0; > - __mark_inode_dirty(inode, I_DIRTY_DATASYNC); > - return ret; > -} > - > void nfs_clear_inode(struct inode *inode) > { > /* > diff --git a/fs/nfs/internal.h b/fs/nfs/internal.h > index 29e464d..7bb326f 100644 > --- a/fs/nfs/internal.h > +++ b/fs/nfs/internal.h > @@ -211,7 +211,6 @@ extern int nfs_access_cache_shrinker(int nr_to_scan, gfp_t gfp_mask); > extern struct workqueue_struct *nfsiod_workqueue; > extern struct inode *nfs_alloc_inode(struct super_block *sb); > extern void nfs_destroy_inode(struct inode *); > -extern int nfs_write_inode(struct inode *,int); > extern void nfs_clear_inode(struct inode *); > #ifdef CONFIG_NFS_V4 > extern void nfs4_clear_inode(struct inode *); > @@ -253,6 +252,8 @@ extern int nfs4_path_walk(struct nfs_server *server, > extern void nfs_read_prepare(struct rpc_task *task, void *calldata); > > /* write.c */ > +extern int nfs_commit_unstable_pages(struct address_space *mapping, > + struct writeback_control *wbc); > extern void nfs_write_prepare(struct rpc_task *task, void *calldata); > #ifdef CONFIG_MIGRATION > extern int nfs_migrate_page(struct address_space *, > diff --git a/fs/nfs/super.c b/fs/nfs/super.c > index ce907ef..805c1a0 100644 > --- a/fs/nfs/super.c > +++ b/fs/nfs/super.c > @@ -265,7 +265,6 @@ struct file_system_type nfs_xdev_fs_type = { > static const struct super_operations nfs_sops = { > .alloc_inode = nfs_alloc_inode, > .destroy_inode = nfs_destroy_inode, > - .write_inode = nfs_write_inode, > .statfs = nfs_statfs, > .clear_inode = nfs_clear_inode, > .umount_begin = nfs_umount_begin, > @@ -334,7 +333,6 @@ struct file_system_type nfs4_referral_fs_type = { > static const struct super_operations nfs4_sops = { > .alloc_inode = nfs_alloc_inode, > .destroy_inode = nfs_destroy_inode, > - .write_inode = nfs_write_inode, > .statfs = nfs_statfs, > .clear_inode = nfs4_clear_inode, > .umount_begin = nfs_umount_begin, > diff --git a/fs/nfs/write.c b/fs/nfs/write.c > index d171696..910be28 100644 > --- a/fs/nfs/write.c > +++ b/fs/nfs/write.c > @@ -441,7 +441,7 @@ nfs_mark_request_commit(struct nfs_page *req) > spin_unlock(&inode->i_lock); > inc_zone_page_state(req->wb_page, NR_UNSTABLE_NFS); > inc_bdi_stat(req->wb_page->mapping->backing_dev_info, BDI_RECLAIMABLE); > - __mark_inode_dirty(inode, I_DIRTY_DATASYNC); > + mark_inode_unstable_pages(inode); Then we shall mark I_DIRTY_DATASYNC on other places that extend i_size. > } > > static int > @@ -1406,11 +1406,42 @@ int nfs_commit_inode(struct inode *inode, int how) > } > return res; > } > + > +int nfs_commit_unstable_pages(struct address_space *mapping, > + struct writeback_control *wbc) > +{ > + struct inode *inode = mapping->host; > + int flags = FLUSH_SYNC; > + int ret; > + > + /* Don't commit yet if this is a non-blocking flush and there are > + * outstanding writes for this mapping. > + */ > + if (wbc->sync_mode != WB_SYNC_ALL && > + radix_tree_tagged(&NFS_I(inode)->nfs_page_tree, > + NFS_PAGE_TAG_LOCKED)) { > + mark_inode_unstable_pages(inode); > + return 0; > + } A dumb question: does NFS_PAGE_TAG_LOCKED means either flying COMMITs or WRITEs? As an NFS newbie, I'm only confident on the COMMIT part :) > + if (wbc->nonblocking) > + flags = 0; > + ret = nfs_commit_inode(inode, flags); > + if (ret > 0) > + ret = 0; > + return ret; > +} > + > #else > static inline int nfs_commit_list(struct inode *inode, struct list_head *head, int how) > { > return 0; > } > + > +int nfs_commit_unstable_pages(struct address_space *mapping, > + struct writeback_control *wbc) > +{ > + return 0; > +} > #endif > > long nfs_sync_mapping_wait(struct address_space *mapping, struct writeback_control *wbc, int how) > diff --git a/include/linux/fs.h b/include/linux/fs.h > index 9147ca8..ea0b7a3 100644 > --- a/include/linux/fs.h > +++ b/include/linux/fs.h > @@ -602,6 +602,8 @@ struct address_space_operations { > int (*is_partially_uptodate) (struct page *, read_descriptor_t *, > unsigned long); > int (*error_remove_page)(struct address_space *, struct page *); > + int (*commit_unstable_pages)(struct address_space *, > + struct writeback_control *); > }; > > /* > @@ -1635,6 +1637,8 @@ struct super_operations { > #define I_CLEAR 64 > #define __I_SYNC 7 > #define I_SYNC (1 << __I_SYNC) > +#define __I_UNSTABLE_PAGES 9 > +#define I_UNSTABLE_PAGES (1 << __I_UNSTABLE_PAGES) > > #define I_DIRTY (I_DIRTY_SYNC | I_DIRTY_DATASYNC | I_DIRTY_PAGES) > > @@ -1649,6 +1653,11 @@ static inline void mark_inode_dirty_sync(struct inode *inode) > __mark_inode_dirty(inode, I_DIRTY_SYNC); > } > > +static inline void mark_inode_unstable_pages(struct inode *inode) > +{ > + __mark_inode_dirty(inode, I_UNSTABLE_PAGES); > +} > + > /** > * inc_nlink - directly increment an inode's link count > * @inode: inode > -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 66+ messages in thread
[parent not found: <1262839082.2185.15.camel@localhost>]
* Re: [PATCH 1/6] VFS: Ensure that writeback_single_inode() commits unstable writes [not found] ` <1262839082.2185.15.camel@localhost> @ 2010-01-07 4:48 ` Wu Fengguang 2010-01-07 4:53 ` [PATCH 0/5] Re: [PATCH] improve the performance of large sequential write NFS workloads Trond Myklebust 2010-01-07 14:56 ` [PATCH 1/6] " Wu Fengguang 1 sibling, 1 reply; 66+ messages in thread From: Wu Fengguang @ 2010-01-07 4:48 UTC (permalink / raw) To: Myklebust, Trond Cc: Peter Zijlstra, Jan Kara, Steve Rago, linux-nfs-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, jens.axboe, Peter Staubach, Arjan van de Ven, Ingo Molnar, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org On Thu, Jan 07, 2010 at 12:38:02PM +0800, Myklebust, Trond wrote: > On Thu, 2010-01-07 at 10:18 +0800, Wu Fengguang wrote: > > On Thu, Jan 07, 2010 at 04:51:10AM +0800, Trond Myklebust wrote: > > > @@ -533,6 +556,12 @@ select_queue: > > > inode->i_state |= I_DIRTY_PAGES; > > > redirty_tail(inode); > > > } > > > + } else if (inode->i_state & I_UNSTABLE_PAGES) { > > > + /* > > > + * The inode has got yet more unstable pages to > > > + * commit. Requeue on b_more_io > > > + */ > > > + requeue_io(inode); > > > > This risks "busy retrying" inodes with unstable pages, when > > > > - nfs_commit_unstable_pages() don't think it's time to commit > > - NFS server somehow response slowly > > > > The workaround is to use redirty_tail() for now. But that risks delay > > the COMMIT for up to 30s, which obviously might stuck applications in > > balance_dirty_pages() for too long. > > > > I have a patch to shorten the retry time to 1s (or other constant) > > by introducing b_more_io_wait. It currently sits in my writeback queue > > series whose main blocking issue is the constantly broken NFS pipeline.. > > > OK. Should I use redirty_tail() for the moment then, and assume you will > fix when you introduce the new state? OK, I'll change your redirty_tail() then :) > > > diff --git a/fs/nfs/write.c b/fs/nfs/write.c > > > index d171696..910be28 100644 > > > --- a/fs/nfs/write.c > > > +++ b/fs/nfs/write.c > > > @@ -441,7 +441,7 @@ nfs_mark_request_commit(struct nfs_page *req) > > > spin_unlock(&inode->i_lock); > > > inc_zone_page_state(req->wb_page, NR_UNSTABLE_NFS); > > > inc_bdi_stat(req->wb_page->mapping->backing_dev_info, BDI_RECLAIMABLE); > > > - __mark_inode_dirty(inode, I_DIRTY_DATASYNC); > > > + mark_inode_unstable_pages(inode); > > > > Then we shall mark I_DIRTY_DATASYNC on other places that extend i_size. > > Why? The NFS client itself shouldn't ever set I_DIRTY_DATASYNC after > this patch is applied. We won't ever need it. > > If the VM or VFS is doing it, then they ought to be fixed: there is no > reason to assume that all filesystems need to sync their inodes on > i_size changes. Ah OK, I took it for certain.. > > > + /* Don't commit yet if this is a non-blocking flush and there are > > > + * outstanding writes for this mapping. > > > + */ > > > + if (wbc->sync_mode != WB_SYNC_ALL && > > > + radix_tree_tagged(&NFS_I(inode)->nfs_page_tree, > > > + NFS_PAGE_TAG_LOCKED)) { > > > + mark_inode_unstable_pages(inode); > > > + return 0; > > > + } > > > > A dumb question: does NFS_PAGE_TAG_LOCKED means either flying COMMITs > > or WRITEs? As an NFS newbie, I'm only confident on the COMMIT part :) > > Both writebacks and commits will cause NFS_PAGE_TAG_LOCKED to be set, as > will attempts to change the page contents. See the calls to > nfs_set_page_tag_locked()... Thanks for the tip! > IOW: the above code will fail to trigger if there are outstanding WRITE > RPC calls, or if there is a second process that happens to be writing to > this inode's page cache... IOW: for a busy "cp", the commit of inode may be delayed until "nr_unstable > nr_dirty / 2"? OK that's what we want. Thanks, Fengguang -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 66+ messages in thread
* [PATCH 0/5] Re: [PATCH] improve the performance of large sequential write NFS workloads 2010-01-07 4:48 ` Wu Fengguang @ 2010-01-07 4:53 ` Trond Myklebust [not found] ` <20100107045330.5986.55090.stgit-bi+AKbBUZKY6gyzm1THtWbp2dZbC/Bob@public.gmane.org> 0 siblings, 1 reply; 66+ messages in thread From: Trond Myklebust @ 2010-01-07 4:53 UTC (permalink / raw) To: Wu Fengguang Cc: Peter Zijlstra, Jan Kara, Steve Rago, linux-nfs-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, jens.axboe, Peter Staubach, Arjan van de Ven, Ingo Molnar, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org Take 2 of this series, incorporating the suggested changes from Jan and Fengguang... Cheers Trond --- Peter Zijlstra (1): VM: Split out the accounting of unstable writes from BDI_RECLAIMABLE Trond Myklebust (4): NFS: Run COMMIT as an asynchronous RPC call when wbc->for_background is set VM/NFS: The VM must tell the filesystem when to free reclaimable pages VM: Don't call bdi_stat(BDI_UNSTABLE) on non-nfs backing-devices VFS: Ensure that writeback_single_inode() commits unstable writes fs/fs-writeback.c | 31 ++++++++++++++++++++++++++++++- fs/nfs/client.c | 1 + fs/nfs/file.c | 1 + fs/nfs/inode.c | 16 ---------------- fs/nfs/internal.h | 3 ++- fs/nfs/super.c | 2 -- fs/nfs/write.c | 39 +++++++++++++++++++++++++++++++++++---- include/linux/backing-dev.h | 9 ++++++++- include/linux/fs.h | 9 +++++++++ include/linux/writeback.h | 5 +++++ mm/backing-dev.c | 6 ++++-- mm/filemap.c | 2 +- mm/page-writeback.c | 30 ++++++++++++++++++++++++------ mm/truncate.c | 2 +- 14 files changed, 121 insertions(+), 35 deletions(-) -- Signature -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 66+ messages in thread
[parent not found: <20100107045330.5986.55090.stgit-bi+AKbBUZKY6gyzm1THtWbp2dZbC/Bob@public.gmane.org>]
* [PATCH 2/5] VM: Split out the accounting of unstable writes from BDI_RECLAIMABLE [not found] ` <20100107045330.5986.55090.stgit-bi+AKbBUZKY6gyzm1THtWbp2dZbC/Bob@public.gmane.org> @ 2010-01-07 4:53 ` Trond Myklebust 2010-01-07 4:53 ` [PATCH 4/5] VM/NFS: The VM must tell the filesystem when to free reclaimable pages Trond Myklebust ` (3 subsequent siblings) 4 siblings, 0 replies; 66+ messages in thread From: Trond Myklebust @ 2010-01-07 4:53 UTC (permalink / raw) To: Wu Fengguang Cc: Peter Zijlstra, Jan Kara, Steve Rago, linux-nfs-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, jens.axboe, Peter Staubach, Arjan van de Ven, Ingo Molnar, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org From: Peter Zijlstra <peterz-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org> Signed-off-by: Peter Zijlstra <peterz-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org> Signed-off-by: Trond Myklebust <Trond.Myklebust-HgOvQuBEEgTQT0dZR+AlfA@public.gmane.org> Acked-by: Jan Kara <jack-AlSwsSmVLrQ@public.gmane.org> --- fs/nfs/write.c | 6 +++--- include/linux/backing-dev.h | 3 ++- mm/backing-dev.c | 6 ++++-- mm/filemap.c | 2 +- mm/page-writeback.c | 16 ++++++++++------ mm/truncate.c | 2 +- 6 files changed, 21 insertions(+), 14 deletions(-) diff --git a/fs/nfs/write.c b/fs/nfs/write.c index 910be28..36549b1 100644 --- a/fs/nfs/write.c +++ b/fs/nfs/write.c @@ -440,7 +440,7 @@ nfs_mark_request_commit(struct nfs_page *req) NFS_PAGE_TAG_COMMIT); spin_unlock(&inode->i_lock); inc_zone_page_state(req->wb_page, NR_UNSTABLE_NFS); - inc_bdi_stat(req->wb_page->mapping->backing_dev_info, BDI_RECLAIMABLE); + inc_bdi_stat(req->wb_page->mapping->backing_dev_info, BDI_UNSTABLE); mark_inode_unstable_pages(inode); } @@ -451,7 +451,7 @@ nfs_clear_request_commit(struct nfs_page *req) if (test_and_clear_bit(PG_CLEAN, &(req)->wb_flags)) { dec_zone_page_state(page, NR_UNSTABLE_NFS); - dec_bdi_stat(page->mapping->backing_dev_info, BDI_RECLAIMABLE); + dec_bdi_stat(page->mapping->backing_dev_info, BDI_UNSTABLE); return 1; } return 0; @@ -1322,7 +1322,7 @@ nfs_commit_list(struct inode *inode, struct list_head *head, int how) nfs_mark_request_commit(req); dec_zone_page_state(req->wb_page, NR_UNSTABLE_NFS); dec_bdi_stat(req->wb_page->mapping->backing_dev_info, - BDI_RECLAIMABLE); + BDI_UNSTABLE); nfs_clear_page_tag_locked(req); } return -ENOMEM; diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h index fcbc26a..42c3e2a 100644 --- a/include/linux/backing-dev.h +++ b/include/linux/backing-dev.h @@ -36,7 +36,8 @@ enum bdi_state { typedef int (congested_fn)(void *, int); enum bdi_stat_item { - BDI_RECLAIMABLE, + BDI_DIRTY, + BDI_UNSTABLE, BDI_WRITEBACK, NR_BDI_STAT_ITEMS }; diff --git a/mm/backing-dev.c b/mm/backing-dev.c index 0e8ca03..88f3655 100644 --- a/mm/backing-dev.c +++ b/mm/backing-dev.c @@ -88,7 +88,8 @@ static int bdi_debug_stats_show(struct seq_file *m, void *v) #define K(x) ((x) << (PAGE_SHIFT - 10)) seq_printf(m, "BdiWriteback: %8lu kB\n" - "BdiReclaimable: %8lu kB\n" + "BdiDirty: %8lu kB\n" + "BdiUnstable: %8lu kB\n" "BdiDirtyThresh: %8lu kB\n" "DirtyThresh: %8lu kB\n" "BackgroundThresh: %8lu kB\n" @@ -102,7 +103,8 @@ static int bdi_debug_stats_show(struct seq_file *m, void *v) "wb_list: %8u\n" "wb_cnt: %8u\n", (unsigned long) K(bdi_stat(bdi, BDI_WRITEBACK)), - (unsigned long) K(bdi_stat(bdi, BDI_RECLAIMABLE)), + (unsigned long) K(bdi_stat(bdi, BDI_DIRTY)), + (unsigned long) K(bdi_stat(bdi, BDI_UNSTABLE)), K(bdi_thresh), K(dirty_thresh), K(background_thresh), nr_wb, nr_dirty, nr_io, nr_more_io, !list_empty(&bdi->bdi_list), bdi->state, bdi->wb_mask, diff --git a/mm/filemap.c b/mm/filemap.c index 96ac6b0..458387d 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -136,7 +136,7 @@ void __remove_from_page_cache(struct page *page) */ if (PageDirty(page) && mapping_cap_account_dirty(mapping)) { dec_zone_page_state(page, NR_FILE_DIRTY); - dec_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE); + dec_bdi_stat(mapping->backing_dev_info, BDI_DIRTY); } } diff --git a/mm/page-writeback.c b/mm/page-writeback.c index 0b19943..23d3fc6 100644 --- a/mm/page-writeback.c +++ b/mm/page-writeback.c @@ -272,7 +272,8 @@ static void clip_bdi_dirty_limit(struct backing_dev_info *bdi, else avail_dirty = 0; - avail_dirty += bdi_stat(bdi, BDI_RECLAIMABLE) + + avail_dirty += bdi_stat(bdi, BDI_DIRTY) + + bdi_stat(bdi, BDI_UNSTABLE) + bdi_stat(bdi, BDI_WRITEBACK); *pbdi_dirty = min(*pbdi_dirty, avail_dirty); @@ -509,7 +510,8 @@ static void balance_dirty_pages(struct address_space *mapping, global_page_state(NR_UNSTABLE_NFS); nr_writeback = global_page_state(NR_WRITEBACK); - bdi_nr_reclaimable = bdi_stat(bdi, BDI_RECLAIMABLE); + bdi_nr_reclaimable = bdi_stat(bdi, BDI_DIRTY) + + bdi_stat(bdi, BDI_UNSTABLE); bdi_nr_writeback = bdi_stat(bdi, BDI_WRITEBACK); if (bdi_nr_reclaimable + bdi_nr_writeback <= bdi_thresh) @@ -554,10 +556,12 @@ static void balance_dirty_pages(struct address_space *mapping, * deltas. */ if (bdi_thresh < 2*bdi_stat_error(bdi)) { - bdi_nr_reclaimable = bdi_stat_sum(bdi, BDI_RECLAIMABLE); + bdi_nr_reclaimable = bdi_stat_sum(bdi, BDI_DIRTY) + + bdi_stat_sum(bdi, BDI_UNSTABLE); bdi_nr_writeback = bdi_stat_sum(bdi, BDI_WRITEBACK); } else if (bdi_nr_reclaimable) { - bdi_nr_reclaimable = bdi_stat(bdi, BDI_RECLAIMABLE); + bdi_nr_reclaimable = bdi_stat(bdi, BDI_DIRTY) + + bdi_stat(bdi, BDI_UNSTABLE); bdi_nr_writeback = bdi_stat(bdi, BDI_WRITEBACK); } @@ -1079,7 +1083,7 @@ void account_page_dirtied(struct page *page, struct address_space *mapping) { if (mapping_cap_account_dirty(mapping)) { __inc_zone_page_state(page, NR_FILE_DIRTY); - __inc_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE); + __inc_bdi_stat(mapping->backing_dev_info, BDI_DIRTY); task_dirty_inc(current); task_io_account_write(PAGE_CACHE_SIZE); } @@ -1255,7 +1259,7 @@ int clear_page_dirty_for_io(struct page *page) if (TestClearPageDirty(page)) { dec_zone_page_state(page, NR_FILE_DIRTY); dec_bdi_stat(mapping->backing_dev_info, - BDI_RECLAIMABLE); + BDI_DIRTY); return 1; } return 0; diff --git a/mm/truncate.c b/mm/truncate.c index 342deee..b0ce8fb 100644 --- a/mm/truncate.c +++ b/mm/truncate.c @@ -75,7 +75,7 @@ void cancel_dirty_page(struct page *page, unsigned int account_size) if (mapping && mapping_cap_account_dirty(mapping)) { dec_zone_page_state(page, NR_FILE_DIRTY); dec_bdi_stat(mapping->backing_dev_info, - BDI_RECLAIMABLE); + BDI_DIRTY); if (account_size) task_io_account_cancelled_write(account_size); } -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply related [flat|nested] 66+ messages in thread
* [PATCH 4/5] VM/NFS: The VM must tell the filesystem when to free reclaimable pages [not found] ` <20100107045330.5986.55090.stgit-bi+AKbBUZKY6gyzm1THtWbp2dZbC/Bob@public.gmane.org> 2010-01-07 4:53 ` [PATCH 2/5] VM: Split out the accounting of unstable writes from BDI_RECLAIMABLE Trond Myklebust @ 2010-01-07 4:53 ` Trond Myklebust 2010-01-07 4:53 ` [PATCH 3/5] VM: Don't call bdi_stat(BDI_UNSTABLE) on non-nfs backing-devices Trond Myklebust ` (2 subsequent siblings) 4 siblings, 0 replies; 66+ messages in thread From: Trond Myklebust @ 2010-01-07 4:53 UTC (permalink / raw) To: Wu Fengguang Cc: Peter Zijlstra, Jan Kara, Steve Rago, linux-nfs-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, jens.axboe, Peter Staubach, Arjan van de Ven, Ingo Molnar, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org balance_dirty_pages() should really tell the filesystem whether or not it has an excess of actual dirty pages, or whether it would be more useful to start freeing up the unstable writes. Assume that if the number of unstable writes is more than 1/2 the number of reclaimable pages, then we should force NFS to free up the former. Signed-off-by: Trond Myklebust <Trond.Myklebust-HgOvQuBEEgTQT0dZR+AlfA@public.gmane.org> Acked-by: Jan Kara <jack-AlSwsSmVLrQ@public.gmane.org> --- fs/nfs/write.c | 2 +- include/linux/writeback.h | 5 +++++ mm/page-writeback.c | 12 ++++++++++-- 3 files changed, 16 insertions(+), 3 deletions(-) diff --git a/fs/nfs/write.c b/fs/nfs/write.c index 36549b1..978de7f 100644 --- a/fs/nfs/write.c +++ b/fs/nfs/write.c @@ -1417,7 +1417,7 @@ int nfs_commit_unstable_pages(struct address_space *mapping, /* Don't commit yet if this is a non-blocking flush and there are * outstanding writes for this mapping. */ - if (wbc->sync_mode != WB_SYNC_ALL && + if (!wbc->force_commit && wbc->sync_mode != WB_SYNC_ALL && radix_tree_tagged(&NFS_I(inode)->nfs_page_tree, NFS_PAGE_TAG_LOCKED)) { mark_inode_unstable_pages(inode); diff --git a/include/linux/writeback.h b/include/linux/writeback.h index 76e8903..3fd5c3e 100644 --- a/include/linux/writeback.h +++ b/include/linux/writeback.h @@ -62,6 +62,11 @@ struct writeback_control { * so we use a single control to update them */ unsigned no_nrwrite_index_update:1; + /* + * The following is used by balance_dirty_pages() to + * force NFS to commit unstable pages. + */ + unsigned force_commit:1; }; /* diff --git a/mm/page-writeback.c b/mm/page-writeback.c index c06739b..c537543 100644 --- a/mm/page-writeback.c +++ b/mm/page-writeback.c @@ -503,6 +503,7 @@ static void balance_dirty_pages(struct address_space *mapping, .nr_to_write = write_chunk, .range_cyclic = 1, }; + long bdi_nr_unstable = 0; get_dirty_limits(&background_thresh, &dirty_thresh, &bdi_thresh, bdi); @@ -512,8 +513,10 @@ static void balance_dirty_pages(struct address_space *mapping, nr_writeback = global_page_state(NR_WRITEBACK); bdi_nr_reclaimable = bdi_stat(bdi, BDI_DIRTY); - if (bdi_cap_account_unstable(bdi)) - bdi_nr_reclaimable += bdi_stat(bdi, BDI_UNSTABLE); + if (bdi_cap_account_unstable(bdi)) { + bdi_nr_unstable = bdi_stat(bdi, BDI_UNSTABLE); + bdi_nr_reclaimable += bdi_nr_unstable; + } bdi_nr_writeback = bdi_stat(bdi, BDI_WRITEBACK); if (bdi_nr_reclaimable + bdi_nr_writeback <= bdi_thresh) @@ -541,6 +544,11 @@ static void balance_dirty_pages(struct address_space *mapping, * up. */ if (bdi_nr_reclaimable > bdi_thresh) { + wbc.force_commit = 0; + /* Force NFS to also free up unstable writes. */ + if (bdi_nr_unstable > bdi_nr_reclaimable / 2) + wbc.force_commit = 1; + writeback_inodes_wbc(&wbc); pages_written += write_chunk - wbc.nr_to_write; get_dirty_limits(&background_thresh, &dirty_thresh, -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply related [flat|nested] 66+ messages in thread
* [PATCH 3/5] VM: Don't call bdi_stat(BDI_UNSTABLE) on non-nfs backing-devices [not found] ` <20100107045330.5986.55090.stgit-bi+AKbBUZKY6gyzm1THtWbp2dZbC/Bob@public.gmane.org> 2010-01-07 4:53 ` [PATCH 2/5] VM: Split out the accounting of unstable writes from BDI_RECLAIMABLE Trond Myklebust 2010-01-07 4:53 ` [PATCH 4/5] VM/NFS: The VM must tell the filesystem when to free reclaimable pages Trond Myklebust @ 2010-01-07 4:53 ` Trond Myklebust 2010-01-07 4:53 ` [PATCH 5/5] NFS: Run COMMIT as an asynchronous RPC call when wbc->for_background is set Trond Myklebust 2010-01-07 4:53 ` [PATCH 1/5] VFS: Ensure that writeback_single_inode() commits unstable writes Trond Myklebust 4 siblings, 0 replies; 66+ messages in thread From: Trond Myklebust @ 2010-01-07 4:53 UTC (permalink / raw) To: Wu Fengguang Cc: Peter Zijlstra, Jan Kara, Steve Rago, linux-nfs-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, jens.axboe, Peter Staubach, Arjan van de Ven, Ingo Molnar, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org Speeds up the accounting in balance_dirty_pages() for non-nfs devices. Signed-off-by: Trond Myklebust <Trond.Myklebust-HgOvQuBEEgTQT0dZR+AlfA@public.gmane.org> Acked-by: Jan Kara <jack-AlSwsSmVLrQ@public.gmane.org> --- fs/nfs/client.c | 1 + include/linux/backing-dev.h | 6 ++++++ mm/page-writeback.c | 16 +++++++++++----- 3 files changed, 18 insertions(+), 5 deletions(-) diff --git a/fs/nfs/client.c b/fs/nfs/client.c index ee77713..d0b060a 100644 --- a/fs/nfs/client.c +++ b/fs/nfs/client.c @@ -890,6 +890,7 @@ static void nfs_server_set_fsinfo(struct nfs_server *server, struct nfs_fsinfo * server->backing_dev_info.name = "nfs"; server->backing_dev_info.ra_pages = server->rpages * NFS_MAX_READAHEAD; + server->backing_dev_info.capabilities |= BDI_CAP_ACCT_UNSTABLE; if (server->wsize > max_rpc_payload) server->wsize = max_rpc_payload; diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h index 42c3e2a..8b45166 100644 --- a/include/linux/backing-dev.h +++ b/include/linux/backing-dev.h @@ -232,6 +232,7 @@ int bdi_set_max_ratio(struct backing_dev_info *bdi, unsigned int max_ratio); #define BDI_CAP_EXEC_MAP 0x00000040 #define BDI_CAP_NO_ACCT_WB 0x00000080 #define BDI_CAP_SWAP_BACKED 0x00000100 +#define BDI_CAP_ACCT_UNSTABLE 0x00000200 #define BDI_CAP_VMFLAGS \ (BDI_CAP_READ_MAP | BDI_CAP_WRITE_MAP | BDI_CAP_EXEC_MAP) @@ -311,6 +312,11 @@ static inline bool bdi_cap_flush_forker(struct backing_dev_info *bdi) return bdi == &default_backing_dev_info; } +static inline bool bdi_cap_account_unstable(struct backing_dev_info *bdi) +{ + return bdi->capabilities & BDI_CAP_ACCT_UNSTABLE; +} + static inline bool mapping_cap_writeback_dirty(struct address_space *mapping) { return bdi_cap_writeback_dirty(mapping->backing_dev_info); diff --git a/mm/page-writeback.c b/mm/page-writeback.c index 23d3fc6..c06739b 100644 --- a/mm/page-writeback.c +++ b/mm/page-writeback.c @@ -273,8 +273,9 @@ static void clip_bdi_dirty_limit(struct backing_dev_info *bdi, avail_dirty = 0; avail_dirty += bdi_stat(bdi, BDI_DIRTY) + - bdi_stat(bdi, BDI_UNSTABLE) + bdi_stat(bdi, BDI_WRITEBACK); + if (bdi_cap_account_unstable(bdi)) + avail_dirty += bdi_stat(bdi, BDI_UNSTABLE); *pbdi_dirty = min(*pbdi_dirty, avail_dirty); } @@ -510,8 +511,9 @@ static void balance_dirty_pages(struct address_space *mapping, global_page_state(NR_UNSTABLE_NFS); nr_writeback = global_page_state(NR_WRITEBACK); - bdi_nr_reclaimable = bdi_stat(bdi, BDI_DIRTY) + - bdi_stat(bdi, BDI_UNSTABLE); + bdi_nr_reclaimable = bdi_stat(bdi, BDI_DIRTY); + if (bdi_cap_account_unstable(bdi)) + bdi_nr_reclaimable += bdi_stat(bdi, BDI_UNSTABLE); bdi_nr_writeback = bdi_stat(bdi, BDI_WRITEBACK); if (bdi_nr_reclaimable + bdi_nr_writeback <= bdi_thresh) @@ -556,11 +558,15 @@ static void balance_dirty_pages(struct address_space *mapping, * deltas. */ if (bdi_thresh < 2*bdi_stat_error(bdi)) { - bdi_nr_reclaimable = bdi_stat_sum(bdi, BDI_DIRTY) + + bdi_nr_reclaimable = bdi_stat_sum(bdi, BDI_DIRTY); + if (bdi_cap_account_unstable(bdi)) + bdi_nr_reclaimable += bdi_stat_sum(bdi, BDI_UNSTABLE); bdi_nr_writeback = bdi_stat_sum(bdi, BDI_WRITEBACK); } else if (bdi_nr_reclaimable) { - bdi_nr_reclaimable = bdi_stat(bdi, BDI_DIRTY) + + bdi_nr_reclaimable = bdi_stat(bdi, BDI_DIRTY); + if (bdi_cap_account_unstable(bdi)) + bdi_nr_reclaimable += bdi_stat(bdi, BDI_UNSTABLE); bdi_nr_writeback = bdi_stat(bdi, BDI_WRITEBACK); } -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply related [flat|nested] 66+ messages in thread
* [PATCH 5/5] NFS: Run COMMIT as an asynchronous RPC call when wbc->for_background is set [not found] ` <20100107045330.5986.55090.stgit-bi+AKbBUZKY6gyzm1THtWbp2dZbC/Bob@public.gmane.org> ` (2 preceding siblings ...) 2010-01-07 4:53 ` [PATCH 3/5] VM: Don't call bdi_stat(BDI_UNSTABLE) on non-nfs backing-devices Trond Myklebust @ 2010-01-07 4:53 ` Trond Myklebust 2010-01-07 4:53 ` [PATCH 1/5] VFS: Ensure that writeback_single_inode() commits unstable writes Trond Myklebust 4 siblings, 0 replies; 66+ messages in thread From: Trond Myklebust @ 2010-01-07 4:53 UTC (permalink / raw) To: Wu Fengguang Cc: Peter Zijlstra, Jan Kara, Steve Rago, linux-nfs-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, jens.axboe, Peter Staubach, Arjan van de Ven, Ingo Molnar, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org Signed-off-by: Trond Myklebust <Trond.Myklebust-HgOvQuBEEgTQT0dZR+AlfA@public.gmane.org> --- fs/nfs/write.c | 2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/fs/nfs/write.c b/fs/nfs/write.c index 978de7f..d6d8048 100644 --- a/fs/nfs/write.c +++ b/fs/nfs/write.c @@ -1423,7 +1423,7 @@ int nfs_commit_unstable_pages(struct address_space *mapping, mark_inode_unstable_pages(inode); return 0; } - if (wbc->nonblocking) + if (wbc->nonblocking || wbc->for_background) flags = 0; ret = nfs_commit_inode(inode, flags); if (ret > 0) -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply related [flat|nested] 66+ messages in thread
* [PATCH 1/5] VFS: Ensure that writeback_single_inode() commits unstable writes [not found] ` <20100107045330.5986.55090.stgit-bi+AKbBUZKY6gyzm1THtWbp2dZbC/Bob@public.gmane.org> ` (3 preceding siblings ...) 2010-01-07 4:53 ` [PATCH 5/5] NFS: Run COMMIT as an asynchronous RPC call when wbc->for_background is set Trond Myklebust @ 2010-01-07 4:53 ` Trond Myklebust 4 siblings, 0 replies; 66+ messages in thread From: Trond Myklebust @ 2010-01-07 4:53 UTC (permalink / raw) To: Wu Fengguang Cc: Peter Zijlstra, Jan Kara, Steve Rago, linux-nfs-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, jens.axboe, Peter Staubach, Arjan van de Ven, Ingo Molnar, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org If the call to do_writepages() succeeded in starting writeback, we do not know whether or not we will need to COMMIT any unstable writes until after the write RPC calls are finished. Currently, we assume that at least one write RPC call will have finished, and set I_DIRTY_DATASYNC by the time do_writepages is done, so that write_inode() is triggered. In order to ensure reliable operation (i.e. ensure that a single call to writeback_single_inode() with WB_SYNC_ALL set suffices to ensure that pages are on disk) we need to first wait for filemap_fdatawait() to complete, then test for unstable pages. Since NFS is currently the only filesystem that has unstable pages, we can add a new inode state I_UNSTABLE_PAGES that NFS alone will set. When set, this will trigger a callback to a new address_space_operation to call the COMMIT. Signed-off-by: Trond Myklebust <Trond.Myklebust-HgOvQuBEEgTQT0dZR+AlfA@public.gmane.org> Acked-by: Jan Kara <jack-AlSwsSmVLrQ@public.gmane.org> --- fs/fs-writeback.c | 31 ++++++++++++++++++++++++++++++- fs/nfs/file.c | 1 + fs/nfs/inode.c | 16 ---------------- fs/nfs/internal.h | 3 ++- fs/nfs/super.c | 2 -- fs/nfs/write.c | 33 ++++++++++++++++++++++++++++++++- include/linux/fs.h | 9 +++++++++ 7 files changed, 74 insertions(+), 21 deletions(-) diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c index 1a7c42c..3640769 100644 --- a/fs/fs-writeback.c +++ b/fs/fs-writeback.c @@ -389,6 +389,17 @@ static int write_inode(struct inode *inode, int sync) } /* + * Commit the NFS unstable pages. + */ +static int commit_unstable_pages(struct address_space *mapping, + struct writeback_control *wbc) +{ + if (mapping->a_ops && mapping->a_ops->commit_unstable_pages) + return mapping->a_ops->commit_unstable_pages(mapping, wbc); + return 0; +} + +/* * Wait for writeback on an inode to complete. */ static void inode_wait_for_writeback(struct inode *inode) @@ -475,6 +486,18 @@ writeback_single_inode(struct inode *inode, struct writeback_control *wbc) } spin_lock(&inode_lock); + /* + * Special state for cleaning NFS unstable pages + */ + if (inode->i_state & I_UNSTABLE_PAGES) { + int err; + inode->i_state &= ~I_UNSTABLE_PAGES; + spin_unlock(&inode_lock); + err = commit_unstable_pages(mapping, wbc); + if (ret == 0) + ret = err; + spin_lock(&inode_lock); + } inode->i_state &= ~I_SYNC; if (!(inode->i_state & (I_FREEING | I_CLEAR))) { if ((inode->i_state & I_DIRTY_PAGES) && wbc->for_kupdate) { @@ -533,6 +556,12 @@ select_queue: inode->i_state |= I_DIRTY_PAGES; redirty_tail(inode); } + } else if (inode->i_state & I_UNSTABLE_PAGES) { + /* + * The inode has got yet more unstable pages to + * commit. Requeue... + */ + redirty_tail(inode); } else if (atomic_read(&inode->i_count)) { /* * The inode is clean, inuse @@ -1051,7 +1080,7 @@ void __mark_inode_dirty(struct inode *inode, int flags) spin_lock(&inode_lock); if ((inode->i_state & flags) != flags) { - const int was_dirty = inode->i_state & I_DIRTY; + const int was_dirty = inode->i_state & (I_DIRTY|I_UNSTABLE_PAGES); inode->i_state |= flags; diff --git a/fs/nfs/file.c b/fs/nfs/file.c index 6b89132..67e50ac 100644 --- a/fs/nfs/file.c +++ b/fs/nfs/file.c @@ -526,6 +526,7 @@ const struct address_space_operations nfs_file_aops = { .migratepage = nfs_migrate_page, .launder_page = nfs_launder_page, .error_remove_page = generic_error_remove_page, + .commit_unstable_pages = nfs_commit_unstable_pages, }; /* diff --git a/fs/nfs/inode.c b/fs/nfs/inode.c index faa0918..8341709 100644 --- a/fs/nfs/inode.c +++ b/fs/nfs/inode.c @@ -97,22 +97,6 @@ u64 nfs_compat_user_ino64(u64 fileid) return ino; } -int nfs_write_inode(struct inode *inode, int sync) -{ - int ret; - - if (sync) { - ret = filemap_fdatawait(inode->i_mapping); - if (ret == 0) - ret = nfs_commit_inode(inode, FLUSH_SYNC); - } else - ret = nfs_commit_inode(inode, 0); - if (ret >= 0) - return 0; - __mark_inode_dirty(inode, I_DIRTY_DATASYNC); - return ret; -} - void nfs_clear_inode(struct inode *inode) { /* diff --git a/fs/nfs/internal.h b/fs/nfs/internal.h index 29e464d..7bb326f 100644 --- a/fs/nfs/internal.h +++ b/fs/nfs/internal.h @@ -211,7 +211,6 @@ extern int nfs_access_cache_shrinker(int nr_to_scan, gfp_t gfp_mask); extern struct workqueue_struct *nfsiod_workqueue; extern struct inode *nfs_alloc_inode(struct super_block *sb); extern void nfs_destroy_inode(struct inode *); -extern int nfs_write_inode(struct inode *,int); extern void nfs_clear_inode(struct inode *); #ifdef CONFIG_NFS_V4 extern void nfs4_clear_inode(struct inode *); @@ -253,6 +252,8 @@ extern int nfs4_path_walk(struct nfs_server *server, extern void nfs_read_prepare(struct rpc_task *task, void *calldata); /* write.c */ +extern int nfs_commit_unstable_pages(struct address_space *mapping, + struct writeback_control *wbc); extern void nfs_write_prepare(struct rpc_task *task, void *calldata); #ifdef CONFIG_MIGRATION extern int nfs_migrate_page(struct address_space *, diff --git a/fs/nfs/super.c b/fs/nfs/super.c index ce907ef..805c1a0 100644 --- a/fs/nfs/super.c +++ b/fs/nfs/super.c @@ -265,7 +265,6 @@ struct file_system_type nfs_xdev_fs_type = { static const struct super_operations nfs_sops = { .alloc_inode = nfs_alloc_inode, .destroy_inode = nfs_destroy_inode, - .write_inode = nfs_write_inode, .statfs = nfs_statfs, .clear_inode = nfs_clear_inode, .umount_begin = nfs_umount_begin, @@ -334,7 +333,6 @@ struct file_system_type nfs4_referral_fs_type = { static const struct super_operations nfs4_sops = { .alloc_inode = nfs_alloc_inode, .destroy_inode = nfs_destroy_inode, - .write_inode = nfs_write_inode, .statfs = nfs_statfs, .clear_inode = nfs4_clear_inode, .umount_begin = nfs_umount_begin, diff --git a/fs/nfs/write.c b/fs/nfs/write.c index d171696..910be28 100644 --- a/fs/nfs/write.c +++ b/fs/nfs/write.c @@ -441,7 +441,7 @@ nfs_mark_request_commit(struct nfs_page *req) spin_unlock(&inode->i_lock); inc_zone_page_state(req->wb_page, NR_UNSTABLE_NFS); inc_bdi_stat(req->wb_page->mapping->backing_dev_info, BDI_RECLAIMABLE); - __mark_inode_dirty(inode, I_DIRTY_DATASYNC); + mark_inode_unstable_pages(inode); } static int @@ -1406,11 +1406,42 @@ int nfs_commit_inode(struct inode *inode, int how) } return res; } + +int nfs_commit_unstable_pages(struct address_space *mapping, + struct writeback_control *wbc) +{ + struct inode *inode = mapping->host; + int flags = FLUSH_SYNC; + int ret; + + /* Don't commit yet if this is a non-blocking flush and there are + * outstanding writes for this mapping. + */ + if (wbc->sync_mode != WB_SYNC_ALL && + radix_tree_tagged(&NFS_I(inode)->nfs_page_tree, + NFS_PAGE_TAG_LOCKED)) { + mark_inode_unstable_pages(inode); + return 0; + } + if (wbc->nonblocking) + flags = 0; + ret = nfs_commit_inode(inode, flags); + if (ret > 0) + ret = 0; + return ret; +} + #else static inline int nfs_commit_list(struct inode *inode, struct list_head *head, int how) { return 0; } + +int nfs_commit_unstable_pages(struct address_space *mapping, + struct writeback_control *wbc) +{ + return 0; +} #endif long nfs_sync_mapping_wait(struct address_space *mapping, struct writeback_control *wbc, int how) diff --git a/include/linux/fs.h b/include/linux/fs.h index 9147ca8..de594b3 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -602,6 +602,8 @@ struct address_space_operations { int (*is_partially_uptodate) (struct page *, read_descriptor_t *, unsigned long); int (*error_remove_page)(struct address_space *, struct page *); + int (*commit_unstable_pages)(struct address_space *, + struct writeback_control *); }; /* @@ -1635,6 +1637,8 @@ struct super_operations { #define I_CLEAR 64 #define __I_SYNC 7 #define I_SYNC (1 << __I_SYNC) +#define __I_UNSTABLE_PAGES 8 +#define I_UNSTABLE_PAGES (1 << __I_UNSTABLE_PAGES) #define I_DIRTY (I_DIRTY_SYNC | I_DIRTY_DATASYNC | I_DIRTY_PAGES) @@ -1649,6 +1653,11 @@ static inline void mark_inode_dirty_sync(struct inode *inode) __mark_inode_dirty(inode, I_DIRTY_SYNC); } +static inline void mark_inode_unstable_pages(struct inode *inode) +{ + __mark_inode_dirty(inode, I_UNSTABLE_PAGES); +} + /** * inc_nlink - directly increment an inode's link count * @inode: inode -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply related [flat|nested] 66+ messages in thread
* Re: [PATCH 1/6] VFS: Ensure that writeback_single_inode() commits unstable writes [not found] ` <1262839082.2185.15.camel@localhost> 2010-01-07 4:48 ` Wu Fengguang @ 2010-01-07 14:56 ` Wu Fengguang 2010-01-07 15:10 ` Trond Myklebust 1 sibling, 1 reply; 66+ messages in thread From: Wu Fengguang @ 2010-01-07 14:56 UTC (permalink / raw) To: Myklebust, Trond Cc: Peter Zijlstra, Jan Kara, Steve Rago, linux-nfs-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, jens.axboe, Peter Staubach, Arjan van de Ven, Ingo Molnar, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org On Thu, Jan 07, 2010 at 12:38:02PM +0800, Myklebust, Trond wrote: > > > diff --git a/fs/nfs/write.c b/fs/nfs/write.c > > > index d171696..910be28 100644 > > > --- a/fs/nfs/write.c > > > +++ b/fs/nfs/write.c > > > @@ -441,7 +441,7 @@ nfs_mark_request_commit(struct nfs_page *req) > > > spin_unlock(&inode->i_lock); > > > inc_zone_page_state(req->wb_page, NR_UNSTABLE_NFS); > > > inc_bdi_stat(req->wb_page->mapping->backing_dev_info, BDI_RECLAIMABLE); > > > - __mark_inode_dirty(inode, I_DIRTY_DATASYNC); > > > + mark_inode_unstable_pages(inode); > > > > Then we shall mark I_DIRTY_DATASYNC on other places that extend i_size. > > Why? The NFS client itself shouldn't ever set I_DIRTY_DATASYNC after > this patch is applied. We won't ever need it. > > If the VM or VFS is doing it, then they ought to be fixed: there is no > reason to assume that all filesystems need to sync their inodes on > i_size changes. Sorry, one more question. It seems to me that you are replacing I_DIRTY_DATASYNC => write_inode() with I_UNSTABLE_PAGES => commit_unstable_pages() Is that change for the sake of clarity? Or to fix some problem? (This patch does fix some problems, but do they inherently require the above change?) Thanks, Fengguang -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [PATCH 1/6] VFS: Ensure that writeback_single_inode() commits unstable writes 2010-01-07 14:56 ` [PATCH 1/6] " Wu Fengguang @ 2010-01-07 15:10 ` Trond Myklebust 2010-01-08 1:17 ` Wu Fengguang 2010-01-08 9:25 ` Christoph Hellwig 0 siblings, 2 replies; 66+ messages in thread From: Trond Myklebust @ 2010-01-07 15:10 UTC (permalink / raw) To: Wu Fengguang Cc: Peter Zijlstra, Jan Kara, Steve Rago, linux-nfs-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, jens.axboe, Peter Staubach, Arjan van de Ven, Ingo Molnar, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org On Thu, 2010-01-07 at 22:56 +0800, Wu Fengguang wrote: > On Thu, Jan 07, 2010 at 12:38:02PM +0800, Myklebust, Trond wrote: > > > > > diff --git a/fs/nfs/write.c b/fs/nfs/write.c > > > > index d171696..910be28 100644 > > > > --- a/fs/nfs/write.c > > > > +++ b/fs/nfs/write.c > > > > @@ -441,7 +441,7 @@ nfs_mark_request_commit(struct nfs_page *req) > > > > spin_unlock(&inode->i_lock); > > > > inc_zone_page_state(req->wb_page, NR_UNSTABLE_NFS); > > > > inc_bdi_stat(req->wb_page->mapping->backing_dev_info, BDI_RECLAIMABLE); > > > > - __mark_inode_dirty(inode, I_DIRTY_DATASYNC); > > > > + mark_inode_unstable_pages(inode); > > > > > > Then we shall mark I_DIRTY_DATASYNC on other places that extend i_size. > > > > Why? The NFS client itself shouldn't ever set I_DIRTY_DATASYNC after > > this patch is applied. We won't ever need it. > > > > If the VM or VFS is doing it, then they ought to be fixed: there is no > > reason to assume that all filesystems need to sync their inodes on > > i_size changes. > > Sorry, one more question. > > It seems to me that you are replacing > > I_DIRTY_DATASYNC => write_inode() > with > I_UNSTABLE_PAGES => commit_unstable_pages() > > Is that change for the sake of clarity? Or to fix some problem? > (This patch does fix some problems, but do they inherently require > the above change?) As I said previously, the write_inode() call is done _before_ you sync the dirty pages to the server, whereas commit_unstable_pages() wants to be done _after_ syncing. So the two are not the same, and we cannot replace commit_unstable_pages() with write_inode(). Replacing I_DIRTY_DATASYNC with I_UNSTABLE_PAGES is more for the sake of clarity. The difference between the two is that in the I_UNSTABLE_PAGES case, the inode itself isn't actually dirty; it just contains pages that are not guaranteed to be on permanent storage until we commit. Cheers Trond -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [PATCH 1/6] VFS: Ensure that writeback_single_inode() commits unstable writes 2010-01-07 15:10 ` Trond Myklebust @ 2010-01-08 1:17 ` Wu Fengguang 2010-01-08 1:37 ` Trond Myklebust 2010-01-08 9:25 ` Christoph Hellwig 1 sibling, 1 reply; 66+ messages in thread From: Wu Fengguang @ 2010-01-08 1:17 UTC (permalink / raw) To: Trond Myklebust Cc: Peter Zijlstra, Jan Kara, Steve Rago, linux-nfs-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, jens.axboe, Peter Staubach, Arjan van de Ven, Ingo Molnar, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org On Thu, Jan 07, 2010 at 11:10:22PM +0800, Trond Myklebust wrote: > On Thu, 2010-01-07 at 22:56 +0800, Wu Fengguang wrote: > > On Thu, Jan 07, 2010 at 12:38:02PM +0800, Myklebust, Trond wrote: > > > > > > > diff --git a/fs/nfs/write.c b/fs/nfs/write.c > > > > > index d171696..910be28 100644 > > > > > --- a/fs/nfs/write.c > > > > > +++ b/fs/nfs/write.c > > > > > @@ -441,7 +441,7 @@ nfs_mark_request_commit(struct nfs_page *req) > > > > > spin_unlock(&inode->i_lock); > > > > > inc_zone_page_state(req->wb_page, NR_UNSTABLE_NFS); > > > > > inc_bdi_stat(req->wb_page->mapping->backing_dev_info, BDI_RECLAIMABLE); > > > > > - __mark_inode_dirty(inode, I_DIRTY_DATASYNC); > > > > > + mark_inode_unstable_pages(inode); > > > > > > > > Then we shall mark I_DIRTY_DATASYNC on other places that extend i_size. > > > > > > Why? The NFS client itself shouldn't ever set I_DIRTY_DATASYNC after > > > this patch is applied. We won't ever need it. > > > > > > If the VM or VFS is doing it, then they ought to be fixed: there is no > > > reason to assume that all filesystems need to sync their inodes on > > > i_size changes. > > > > Sorry, one more question. > > > > It seems to me that you are replacing > > > > I_DIRTY_DATASYNC => write_inode() > > with > > I_UNSTABLE_PAGES => commit_unstable_pages() > > > > Is that change for the sake of clarity? Or to fix some problem? > > (This patch does fix some problems, but do they inherently require > > the above change?) > > As I said previously, the write_inode() call is done _before_ you sync > the dirty pages to the server, whereas commit_unstable_pages() wants to > be done _after_ syncing. So the two are not the same, and we cannot > replace commit_unstable_pages() with write_inode(). This is the ordering: 0 do_writepages() 1 if (I_DIRTY_SYNC | I_DIRTY_DATASYNC) 2 write_inode() 3 if (wait) 4 filemap_fdatawait() 5 if (I_UNSTABLE_PAGES) 6 commit_unstable_pages() The page is synced to NFS server in line 0. The only difference is write_inode() is called before filemap_fdatawait(), while commit_unstable_pages() is called after it. Note that filemap_fdatawait() will only be called on WB_SYNC_ALL, so I still cannot understand the difference.. > Replacing I_DIRTY_DATASYNC with I_UNSTABLE_PAGES is more for the sake of > clarity. The difference between the two is that in the I_UNSTABLE_PAGES > case, the inode itself isn't actually dirty; it just contains pages that > are not guaranteed to be on permanent storage until we commit. And I_UNSTABLE_PAGES is necessary for calling commit_unstable_pages() :) Thanks, Fengguang -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [PATCH 1/6] VFS: Ensure that writeback_single_inode() commits unstable writes 2010-01-08 1:17 ` Wu Fengguang @ 2010-01-08 1:37 ` Trond Myklebust 2010-01-08 1:53 ` Wu Fengguang 0 siblings, 1 reply; 66+ messages in thread From: Trond Myklebust @ 2010-01-08 1:37 UTC (permalink / raw) To: Wu Fengguang Cc: Peter Zijlstra, Jan Kara, Steve Rago, linux-nfs-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, jens.axboe, Peter Staubach, Arjan van de Ven, Ingo Molnar, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org On Fri, 2010-01-08 at 09:17 +0800, Wu Fengguang wrote: > On Thu, Jan 07, 2010 at 11:10:22PM +0800, Trond Myklebust wrote: > > On Thu, 2010-01-07 at 22:56 +0800, Wu Fengguang wrote: > > > On Thu, Jan 07, 2010 at 12:38:02PM +0800, Myklebust, Trond wrote: > > > > > > > > > diff --git a/fs/nfs/write.c b/fs/nfs/write.c > > > > > > index d171696..910be28 100644 > > > > > > --- a/fs/nfs/write.c > > > > > > +++ b/fs/nfs/write.c > > > > > > @@ -441,7 +441,7 @@ nfs_mark_request_commit(struct nfs_page *req) > > > > > > spin_unlock(&inode->i_lock); > > > > > > inc_zone_page_state(req->wb_page, NR_UNSTABLE_NFS); > > > > > > inc_bdi_stat(req->wb_page->mapping->backing_dev_info, BDI_RECLAIMABLE); > > > > > > - __mark_inode_dirty(inode, I_DIRTY_DATASYNC); > > > > > > + mark_inode_unstable_pages(inode); > > > > > > > > > > Then we shall mark I_DIRTY_DATASYNC on other places that extend i_size. > > > > > > > > Why? The NFS client itself shouldn't ever set I_DIRTY_DATASYNC after > > > > this patch is applied. We won't ever need it. > > > > > > > > If the VM or VFS is doing it, then they ought to be fixed: there is no > > > > reason to assume that all filesystems need to sync their inodes on > > > > i_size changes. > > > > > > Sorry, one more question. > > > > > > It seems to me that you are replacing > > > > > > I_DIRTY_DATASYNC => write_inode() > > > with > > > I_UNSTABLE_PAGES => commit_unstable_pages() > > > > > > Is that change for the sake of clarity? Or to fix some problem? > > > (This patch does fix some problems, but do they inherently require > > > the above change?) > > > > As I said previously, the write_inode() call is done _before_ you sync > > the dirty pages to the server, whereas commit_unstable_pages() wants to > > be done _after_ syncing. So the two are not the same, and we cannot > > replace commit_unstable_pages() with write_inode(). > > This is the ordering: > > 0 do_writepages() > 1 if (I_DIRTY_SYNC | I_DIRTY_DATASYNC) > 2 write_inode() > 3 if (wait) > 4 filemap_fdatawait() > 5 if (I_UNSTABLE_PAGES) > 6 commit_unstable_pages() > > The page is synced to NFS server in line 0. > > The only difference is write_inode() is called before filemap_fdatawait(), > while commit_unstable_pages() is called after it. > > Note that filemap_fdatawait() will only be called on WB_SYNC_ALL, so I > still cannot understand the difference.. The difference is precisely that... In the case of WB_SYNC_ALL we want the call to filemap_fdatawait() to occur before we call commit_unstable_pages(), so that we know that all the in-flight write rpc calls are done before we ask that they be committed to stable storage. In the case of WB_SYNC_NONE, there is no wait, and so we are forced to play games with heuristics and/or add the force_commit_unstable flag because we don't wait for the dirty pages to be cleaned. I don't like this, but those are the semantics that we've defined for WB_SYNC_NONE. Cheers Trond -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [PATCH 1/6] VFS: Ensure that writeback_single_inode() commits unstable writes 2010-01-08 1:37 ` Trond Myklebust @ 2010-01-08 1:53 ` Wu Fengguang 0 siblings, 0 replies; 66+ messages in thread From: Wu Fengguang @ 2010-01-08 1:53 UTC (permalink / raw) To: Trond Myklebust Cc: Peter Zijlstra, Jan Kara, Steve Rago, linux-nfs-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, jens.axboe, Peter Staubach, Arjan van de Ven, Ingo Molnar, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org On Fri, Jan 08, 2010 at 09:37:31AM +0800, Trond Myklebust wrote: > On Fri, 2010-01-08 at 09:17 +0800, Wu Fengguang wrote: > > On Thu, Jan 07, 2010 at 11:10:22PM +0800, Trond Myklebust wrote: > > > On Thu, 2010-01-07 at 22:56 +0800, Wu Fengguang wrote: > > > > On Thu, Jan 07, 2010 at 12:38:02PM +0800, Myklebust, Trond wrote: > > > > > > > > > > > diff --git a/fs/nfs/write.c b/fs/nfs/write.c > > > > > > > index d171696..910be28 100644 > > > > > > > --- a/fs/nfs/write.c > > > > > > > +++ b/fs/nfs/write.c > > > > > > > @@ -441,7 +441,7 @@ nfs_mark_request_commit(struct nfs_page *req) > > > > > > > spin_unlock(&inode->i_lock); > > > > > > > inc_zone_page_state(req->wb_page, NR_UNSTABLE_NFS); > > > > > > > inc_bdi_stat(req->wb_page->mapping->backing_dev_info, BDI_RECLAIMABLE); > > > > > > > - __mark_inode_dirty(inode, I_DIRTY_DATASYNC); > > > > > > > + mark_inode_unstable_pages(inode); > > > > > > > > > > > > Then we shall mark I_DIRTY_DATASYNC on other places that extend i_size. > > > > > > > > > > Why? The NFS client itself shouldn't ever set I_DIRTY_DATASYNC after > > > > > this patch is applied. We won't ever need it. > > > > > > > > > > If the VM or VFS is doing it, then they ought to be fixed: there is no > > > > > reason to assume that all filesystems need to sync their inodes on > > > > > i_size changes. > > > > > > > > Sorry, one more question. > > > > > > > > It seems to me that you are replacing > > > > > > > > I_DIRTY_DATASYNC => write_inode() > > > > with > > > > I_UNSTABLE_PAGES => commit_unstable_pages() > > > > > > > > Is that change for the sake of clarity? Or to fix some problem? > > > > (This patch does fix some problems, but do they inherently require > > > > the above change?) > > > > > > As I said previously, the write_inode() call is done _before_ you sync > > > the dirty pages to the server, whereas commit_unstable_pages() wants to > > > be done _after_ syncing. So the two are not the same, and we cannot > > > replace commit_unstable_pages() with write_inode(). > > > > This is the ordering: > > > > 0 do_writepages() > > 1 if (I_DIRTY_SYNC | I_DIRTY_DATASYNC) > > 2 write_inode() > > 3 if (wait) > > 4 filemap_fdatawait() > > 5 if (I_UNSTABLE_PAGES) > > 6 commit_unstable_pages() > > > > The page is synced to NFS server in line 0. > > > > The only difference is write_inode() is called before filemap_fdatawait(), > > while commit_unstable_pages() is called after it. > > > > Note that filemap_fdatawait() will only be called on WB_SYNC_ALL, so I > > still cannot understand the difference.. > > The difference is precisely that... Thanks, I got it. > In the case of WB_SYNC_ALL we want the call to filemap_fdatawait() to > occur before we call commit_unstable_pages(), so that we know that all > the in-flight write rpc calls are done before we ask that they be > committed to stable storage. That's good order for WB_SYNC_ALL. However this is optimizing a minor case, and what I cared is WB_SYNC_NONE :) > In the case of WB_SYNC_NONE, there is no wait, and so we are forced to > play games with heuristics and/or add the force_commit_unstable flag > because we don't wait for the dirty pages to be cleaned. I don't like > this, but those are the semantics that we've defined for WB_SYNC_NONE. For WB_SYNC_NONE we will now also wait for WRITE completion with the combination of NFS_PAGE_TAG_LOCKED-based-bail-out and redirty_tail(). This is retry based, so less elegant. But that's not the whole story. The I_UNSTABLE_PAGES+commit_unstable_pages() scheme seems elegant for WB_SYNC_ALL, however it may break the pipeline for big files in a perfect loop: loop { WRITE 4MB COMMIT 4MB } While the retry based WB_SYNC_ALL will keep back-off COMMITs because do_writepages() keep submit new WRITEs. So its loop would be loop { WRITE 4MB <skip COMMIT> WRITE 4MB <skip COMMIT> WRITE 4MB <skip COMMIT> WRITE 4MB <skip COMMIT> ... <redirty_tail timeout> COMMIT 400MB } That can be improved by lifting the writeback chunk size from 4MB to >=128MB. Thanks, Fengguang -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [PATCH 1/6] VFS: Ensure that writeback_single_inode() commits unstable writes 2010-01-07 15:10 ` Trond Myklebust 2010-01-08 1:17 ` Wu Fengguang @ 2010-01-08 9:25 ` Christoph Hellwig 2010-01-08 13:46 ` Trond Myklebust 1 sibling, 1 reply; 66+ messages in thread From: Christoph Hellwig @ 2010-01-08 9:25 UTC (permalink / raw) To: Trond Myklebust Cc: Wu Fengguang, Peter Zijlstra, Jan Kara, Steve Rago, linux-nfs-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, jens.axboe, Peter Staubach, Arjan van de Ven, Ingo Molnar, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org On Thu, Jan 07, 2010 at 10:10:22AM -0500, Trond Myklebust wrote: > As I said previously, the write_inode() call is done _before_ you sync > the dirty pages to the server, whereas commit_unstable_pages() wants to > be done _after_ syncing. So the two are not the same, and we cannot > replace commit_unstable_pages() with write_inode(). But that's more an accident of how this code was written. The right order nees to be to write the pages first, then call write_inode. Most modern filesystems have to work around this in their write_inode method by waiting for the pages themselves. I already fixed the same ordering issue in fsync, and the writeback code is next on the agenda. -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [PATCH 1/6] VFS: Ensure that writeback_single_inode() commits unstable writes 2010-01-08 9:25 ` Christoph Hellwig @ 2010-01-08 13:46 ` Trond Myklebust 2010-01-08 13:54 ` Christoph Hellwig 0 siblings, 1 reply; 66+ messages in thread From: Trond Myklebust @ 2010-01-08 13:46 UTC (permalink / raw) To: Christoph Hellwig Cc: Wu Fengguang, Peter Zijlstra, Jan Kara, Steve Rago, linux-nfs@vger.kernel.org, jens.axboe, Peter Staubach, Arjan van de Ven, Ingo Molnar, linux-fsdevel@vger.kernel.org On Fri, 2010-01-08 at 04:25 -0500, Christoph Hellwig wrote: > On Thu, Jan 07, 2010 at 10:10:22AM -0500, Trond Myklebust wrote: > > As I said previously, the write_inode() call is done _before_ you sync > > the dirty pages to the server, whereas commit_unstable_pages() wants to > > be done _after_ syncing. So the two are not the same, and we cannot > > replace commit_unstable_pages() with write_inode(). > > But that's more an accident of how this code was written. The right > order nees to be to write the pages first, then call write_inode. Most > modern filesystems have to work around this in their write_inode method > by waiting for the pages themselves. I already fixed the same ordering > issue in fsync, and the writeback code is next on the agenda. > Could we in that case replace write_inode() with something that takes a struct writeback_control? It is very useful to have full information about the write range and flags as it allows us to tweak the COMMIT RPC call. Trond ^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [PATCH 1/6] VFS: Ensure that writeback_single_inode() commits unstable writes 2010-01-08 13:46 ` Trond Myklebust @ 2010-01-08 13:54 ` Christoph Hellwig 2010-01-08 14:15 ` Trond Myklebust 0 siblings, 1 reply; 66+ messages in thread From: Christoph Hellwig @ 2010-01-08 13:54 UTC (permalink / raw) To: Trond Myklebust Cc: Christoph Hellwig, Wu Fengguang, Peter Zijlstra, Jan Kara, Steve Rago, linux-nfs@vger.kernel.org, jens.axboe, Peter Staubach, Arjan van de Ven, Ingo Molnar, linux-fsdevel@vger.kernel.org On Fri, Jan 08, 2010 at 08:46:46AM -0500, Trond Myklebust wrote: > Could we in that case replace write_inode() with something that takes a > struct writeback_control? It is very useful to have full information > about the write range and flags as it allows us to tweak the COMMIT RPC > call. At this point I do not plan to change the write_inode interface. But changing the ->write_inode operation to take a writeback control instead of the sync flag should be a prety easy change if you want to do it. ^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [PATCH 1/6] VFS: Ensure that writeback_single_inode() commits unstable writes 2010-01-08 13:54 ` Christoph Hellwig @ 2010-01-08 14:15 ` Trond Myklebust 0 siblings, 0 replies; 66+ messages in thread From: Trond Myklebust @ 2010-01-08 14:15 UTC (permalink / raw) To: Christoph Hellwig Cc: Wu Fengguang, Peter Zijlstra, Jan Kara, Steve Rago, linux-nfs@vger.kernel.org, jens.axboe, Peter Staubach, Arjan van de Ven, Ingo Molnar, linux-fsdevel@vger.kernel.org On Fri, 2010-01-08 at 08:54 -0500, Christoph Hellwig wrote: > On Fri, Jan 08, 2010 at 08:46:46AM -0500, Trond Myklebust wrote: > > Could we in that case replace write_inode() with something that takes a > > struct writeback_control? It is very useful to have full information > > about the write range and flags as it allows us to tweak the COMMIT RPC > > call. > > At this point I do not plan to change the write_inode interface. But > changing the ->write_inode operation to take a writeback control instead > of the sync flag should be a prety easy change if you want to do it. > OK. Please could you let me know when you're done with your changes so that I can adapt this patch? Cheers Trond ^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [PATCH 0/6] Re: [PATCH] improve the performance of large sequential write NFS workloads [not found] ` <20100106205110.22547.85345.stgit-bi+AKbBUZKY6gyzm1THtWbp2dZbC/Bob@public.gmane.org> ` (5 preceding siblings ...) 2010-01-06 20:51 ` [PATCH 1/6] VFS: Ensure that writeback_single_inode() commits unstable writes Trond Myklebust @ 2010-01-06 21:44 ` Jan Kara 2010-01-06 22:03 ` Trond Myklebust 6 siblings, 1 reply; 66+ messages in thread From: Jan Kara @ 2010-01-06 21:44 UTC (permalink / raw) To: Trond Myklebust Cc: Wu Fengguang, Peter Zijlstra, Jan Kara, Steve Rago, linux-nfs-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, jens.axboe, Peter Staubach, Arjan van de Ven, Ingo Molnar, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org On Wed 06-01-10 15:51:10, Trond Myklebust wrote: > Peter Zijlstra (1): > VM: Split out the accounting of unstable writes from BDI_RECLAIMABLE > > Trond Myklebust (5): > NFS: Run COMMIT as an asynchronous RPC call when wbc->for_background is set > VM: Use per-bdi unstable accounting to improve use of wbc->force_commit > VM: Don't call bdi_stat(BDI_UNSTABLE) on non-nfs backing-devices > VM/NFS: The VM must tell the filesystem when to free reclaimable pages > VFS: Ensure that writeback_single_inode() commits unstable writes I think the series would be nicer if you made Peter's patch #2 and join your patches "VM/NFS: The VM must tell the filesystem when to free reclaimable pages" and "VM: Use per-bdi unstable accounting to improve use of wbc->force_commit" Honza -- Jan Kara <jack-AlSwsSmVLrQ@public.gmane.org> SUSE Labs, CR -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [PATCH 0/6] Re: [PATCH] improve the performance of large sequential write NFS workloads 2010-01-06 21:44 ` [PATCH 0/6] Re: [PATCH] improve the performance of large sequential write NFS workloads Jan Kara @ 2010-01-06 22:03 ` Trond Myklebust 0 siblings, 0 replies; 66+ messages in thread From: Trond Myklebust @ 2010-01-06 22:03 UTC (permalink / raw) To: Jan Kara Cc: Wu Fengguang, Peter Zijlstra, Steve Rago, linux-nfs@vger.kernel.org, jens.axboe, Peter Staubach, Arjan van de Ven, Ingo Molnar, linux-fsdevel@vger.kernel.org On Wed, 2010-01-06 at 22:44 +0100, Jan Kara wrote: > On Wed 06-01-10 15:51:10, Trond Myklebust wrote: > > Peter Zijlstra (1): > > VM: Split out the accounting of unstable writes from BDI_RECLAIMABLE > > > > Trond Myklebust (5): > > NFS: Run COMMIT as an asynchronous RPC call when wbc->for_background is set > > VM: Use per-bdi unstable accounting to improve use of wbc->force_commit > > VM: Don't call bdi_stat(BDI_UNSTABLE) on non-nfs backing-devices > > VM/NFS: The VM must tell the filesystem when to free reclaimable pages > > VFS: Ensure that writeback_single_inode() commits unstable writes > I think the series would be nicer if you made Peter's patch #2 and join > your patches > "VM/NFS: The VM must tell the filesystem when to free reclaimable pages" > and > "VM: Use per-bdi unstable accounting to improve use of wbc->force_commit" > > Honza Indeed, and if everyone is OK with Peter's patch, I'll do that. Cheers Trond ^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [PATCH 0/6] Re: [PATCH] improve the performance of large sequential write NFS workloads 2010-01-06 20:51 ` [PATCH 0/6] " Trond Myklebust [not found] ` <20100106205110.22547.85345.stgit-bi+AKbBUZKY6gyzm1THtWbp2dZbC/Bob@public.gmane.org> @ 2010-01-07 8:16 ` Peter Zijlstra 1 sibling, 0 replies; 66+ messages in thread From: Peter Zijlstra @ 2010-01-07 8:16 UTC (permalink / raw) To: Trond Myklebust Cc: Wu Fengguang, Jan Kara, Steve Rago, linux-nfs@vger.kernel.org, jens.axboe, Peter Staubach, Arjan van de Ven, Ingo Molnar, linux-fsdevel@vger.kernel.org On Wed, 2010-01-06 at 15:51 -0500, Trond Myklebust wrote: > OK, here is the full series so far. I'm resending because I had to fix > up a couple of BDI_UNSTABLE typos in Peter's patch... Looks good and thanks for fixing things up! Acked-by: Peter Zijlstra <peterz@infradead.org> > > Peter Zijlstra (1): > VM: Split out the accounting of unstable writes from BDI_RECLAIMABLE > > Trond Myklebust (5): > NFS: Run COMMIT as an asynchronous RPC call when wbc->for_background is set > VM: Use per-bdi unstable accounting to improve use of wbc->force_commit > VM: Don't call bdi_stat(BDI_UNSTABLE) on non-nfs backing-devices > VM/NFS: The VM must tell the filesystem when to free reclaimable pages > VFS: Ensure that writeback_single_inode() commits unstable writes > > > fs/fs-writeback.c | 31 ++++++++++++++++++++++++++++++- > fs/nfs/client.c | 1 + > fs/nfs/file.c | 1 + > fs/nfs/inode.c | 16 ---------------- > fs/nfs/internal.h | 3 ++- > fs/nfs/super.c | 2 -- > fs/nfs/write.c | 39 +++++++++++++++++++++++++++++++++++---- > include/linux/backing-dev.h | 9 ++++++++- > include/linux/fs.h | 9 +++++++++ > include/linux/writeback.h | 5 +++++ > mm/backing-dev.c | 6 ++++-- > mm/filemap.c | 2 +- > mm/page-writeback.c | 30 ++++++++++++++++++++++++------ > mm/truncate.c | 2 +- > 14 files changed, 121 insertions(+), 35 deletions(-) > ^ permalink raw reply [flat|nested] 66+ messages in thread
end of thread, other threads:[~2010-01-08 14:15 UTC | newest] Thread overview: 66+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- [not found] <1261015420.1947.54.camel@serenity> [not found] ` <1261037877.27920.36.camel@laptop> [not found] ` <20091219122033.GA11360@localhost> [not found] ` <1261232747.1947.194.camel@serenity> 2009-12-22 1:59 ` [PATCH] improve the performance of large sequential write NFS workloads Wu Fengguang 2009-12-22 12:35 ` Jan Kara [not found] ` <20091222123538.GB604-jyMamyUUXNJG4ohzP4jBZS1Fcj925eT/@public.gmane.org> 2009-12-23 8:43 ` Christoph Hellwig [not found] ` <20091223084302.GA14912-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org> 2009-12-23 13:32 ` Jan Kara [not found] ` <20091223133244.GB3159-+0h/O2h83AeN3ZZ/Hiejyg@public.gmane.org> 2009-12-24 5:25 ` Wu Fengguang 2009-12-24 1:26 ` Wu Fengguang 2009-12-22 16:41 ` Steve Rago 2009-12-24 1:21 ` Wu Fengguang 2009-12-24 14:49 ` Steve Rago 2009-12-25 7:37 ` Wu Fengguang 2009-12-23 14:21 ` Trond Myklebust 2009-12-23 18:05 ` Jan Kara [not found] ` <20091223180551.GD3159-+0h/O2h83AeN3ZZ/Hiejyg@public.gmane.org> 2009-12-23 19:12 ` Trond Myklebust 2009-12-24 2:52 ` Wu Fengguang 2009-12-24 12:04 ` Trond Myklebust 2009-12-25 5:56 ` Wu Fengguang 2009-12-30 16:22 ` Trond Myklebust 2009-12-31 5:04 ` Wu Fengguang 2009-12-31 19:13 ` Trond Myklebust 2010-01-06 3:03 ` Wu Fengguang 2010-01-06 16:56 ` Trond Myklebust 2010-01-06 18:26 ` Trond Myklebust 2010-01-06 18:37 ` Peter Zijlstra 2010-01-06 18:52 ` Trond Myklebust 2010-01-06 19:07 ` Peter Zijlstra 2010-01-06 19:21 ` Trond Myklebust 2010-01-06 19:53 ` Trond Myklebust 2010-01-06 20:09 ` Jan Kara [not found] ` <20100106200928.GB22781-+0h/O2h83AeN3ZZ/Hiejyg@public.gmane.org> 2010-01-06 20:51 ` [PATCH 0/6] " Trond Myklebust [not found] ` <20100106205110.22547.85345.stgit-bi+AKbBUZKY6gyzm1THtWbp2dZbC/Bob@public.gmane.org> 2010-01-06 20:51 ` [PATCH 5/6] VM: Use per-bdi unstable accounting to improve use of wbc->force_commit Trond Myklebust [not found] ` <20100106205110.22547.32584.stgit-bi+AKbBUZKY6gyzm1THtWbp2dZbC/Bob@public.gmane.org> 2010-01-07 2:34 ` Wu Fengguang 2010-01-06 20:51 ` [PATCH 4/6] VM: Don't call bdi_stat(BDI_UNSTABLE) on non-nfs backing-devices Trond Myklebust 2010-01-07 1:56 ` Wu Fengguang 2010-01-06 20:51 ` [PATCH 6/6] NFS: Run COMMIT as an asynchronous RPC call when wbc->for_background is set Trond Myklebust [not found] ` <20100106205110.22547.31434.stgit-bi+AKbBUZKY6gyzm1THtWbp2dZbC/Bob@public.gmane.org> 2010-01-07 2:32 ` Wu Fengguang 2010-01-06 20:51 ` [PATCH 2/6] VM/NFS: The VM must tell the filesystem when to free reclaimable pages Trond Myklebust 2010-01-07 2:29 ` Wu Fengguang 2010-01-07 4:49 ` Trond Myklebust 2010-01-07 5:03 ` Wu Fengguang 2010-01-07 5:30 ` Trond Myklebust 2010-01-07 14:37 ` Wu Fengguang 2010-01-06 20:51 ` [PATCH 3/6] VM: Split out the accounting of unstable writes from BDI_RECLAIMABLE Trond Myklebust [not found] ` <20100106205110.22547.93554.stgit-bi+AKbBUZKY6gyzm1THtWbp2dZbC/Bob@public.gmane.org> 2010-01-07 1:48 ` Wu Fengguang 2010-01-06 20:51 ` [PATCH 1/6] VFS: Ensure that writeback_single_inode() commits unstable writes Trond Myklebust [not found] ` <20100106205110.22547.17971.stgit-bi+AKbBUZKY6gyzm1THtWbp2dZbC/Bob@public.gmane.org> 2010-01-06 21:38 ` Jan Kara [not found] ` <20100106213843.GD22781-+0h/O2h83AeN3ZZ/Hiejyg@public.gmane.org> 2010-01-06 21:48 ` Trond Myklebust 2010-01-07 2:18 ` Wu Fengguang [not found] ` <1262839082.2185.15.camel@localhost> 2010-01-07 4:48 ` Wu Fengguang 2010-01-07 4:53 ` [PATCH 0/5] Re: [PATCH] improve the performance of large sequential write NFS workloads Trond Myklebust [not found] ` <20100107045330.5986.55090.stgit-bi+AKbBUZKY6gyzm1THtWbp2dZbC/Bob@public.gmane.org> 2010-01-07 4:53 ` [PATCH 2/5] VM: Split out the accounting of unstable writes from BDI_RECLAIMABLE Trond Myklebust 2010-01-07 4:53 ` [PATCH 4/5] VM/NFS: The VM must tell the filesystem when to free reclaimable pages Trond Myklebust 2010-01-07 4:53 ` [PATCH 3/5] VM: Don't call bdi_stat(BDI_UNSTABLE) on non-nfs backing-devices Trond Myklebust 2010-01-07 4:53 ` [PATCH 5/5] NFS: Run COMMIT as an asynchronous RPC call when wbc->for_background is set Trond Myklebust 2010-01-07 4:53 ` [PATCH 1/5] VFS: Ensure that writeback_single_inode() commits unstable writes Trond Myklebust 2010-01-07 14:56 ` [PATCH 1/6] " Wu Fengguang 2010-01-07 15:10 ` Trond Myklebust 2010-01-08 1:17 ` Wu Fengguang 2010-01-08 1:37 ` Trond Myklebust 2010-01-08 1:53 ` Wu Fengguang 2010-01-08 9:25 ` Christoph Hellwig 2010-01-08 13:46 ` Trond Myklebust 2010-01-08 13:54 ` Christoph Hellwig 2010-01-08 14:15 ` Trond Myklebust 2010-01-06 21:44 ` [PATCH 0/6] Re: [PATCH] improve the performance of large sequential write NFS workloads Jan Kara 2010-01-06 22:03 ` Trond Myklebust 2010-01-07 8:16 ` Peter Zijlstra
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).