[PATCH 1/2] writeback: Improve busyloop prevention

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 1/2] writeback: Improve busyloop prevention
@ 2011-09-08  0:44 Jan Kara
  2011-09-08  0:44 ` [PATCH 2/2] writeback: Replace some redirty_tail() calls with requeue_io() Jan Kara
  2011-09-08  0:57 ` [PATCH 1/2] writeback: Improve busyloop prevention Wu Fengguang
  0 siblings, 2 replies; 60+ messages in thread
From: Jan Kara @ 2011-09-08  0:44 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: Wu Fengguang, Dave Chinner, Jan Kara, Christoph Hellwig

Writeback of an inode can be stalled by things like internal fs locks being
held. So in case we didn't write anything during a pass through b_io list,
just wait for a moment and try again.

CC: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/fs-writeback.c |   26 ++++++++++++++------------
 1 files changed, 14 insertions(+), 12 deletions(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 04cf3b9..f506542 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -699,8 +699,8 @@ static long wb_writeback(struct bdi_writeback *wb,
 	unsigned long wb_start = jiffies;
 	long nr_pages = work->nr_pages;
 	unsigned long oldest_jif;
-	struct inode *inode;
 	long progress;
+	long pause = 1;
 
 	oldest_jif = jiffies;
 	work->older_than_this = &oldest_jif;
@@ -755,25 +755,27 @@ static long wb_writeback(struct bdi_writeback *wb,
 		 * mean the overall work is done. So we keep looping as long
 		 * as made some progress on cleaning pages or inodes.
 		 */
-		if (progress)
+		if (progress) {
+			pause = 1;
 			continue;
+		}
 		/*
 		 * No more inodes for IO, bail
 		 */
 		if (list_empty(&wb->b_more_io))
 			break;
 		/*
-		 * Nothing written. Wait for some inode to
-		 * become available for writeback. Otherwise
-		 * we'll just busyloop.
+		 * Nothing written (some internal fs locks were unavailable or
+		 * inode was under writeback from balance_dirty_pages() or
+		 * similar conditions).  Wait for a while to avoid busylooping.
 		 */
-		if (!list_empty(&wb->b_more_io))  {
-			trace_writeback_wait(wb->bdi, work);
-			inode = wb_inode(wb->b_more_io.prev);
-			spin_lock(&inode->i_lock);
-			inode_wait_for_writeback(inode, wb);
-			spin_unlock(&inode->i_lock);
-		}
+		trace_writeback_wait(wb->bdi, work);
+		spin_unlock(&wb->list_lock);
+		schedule_timeout(pause);
+		pause <<= 1;
+		if (pause > HZ / 10)
+			pause = HZ / 10;
+		spin_lock(&wb->list_lock);
 	}
 	spin_unlock(&wb->list_lock);
 
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH 2/2] writeback: Replace some redirty_tail() calls with requeue_io()
  2011-09-08  0:44 [PATCH 1/2] writeback: Improve busyloop prevention Jan Kara
@ 2011-09-08  0:44 ` Jan Kara
  2011-09-08  1:22   ` Wu Fengguang
  2011-09-08  0:57 ` [PATCH 1/2] writeback: Improve busyloop prevention Wu Fengguang
  1 sibling, 1 reply; 60+ messages in thread
From: Jan Kara @ 2011-09-08  0:44 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: Wu Fengguang, Dave Chinner, Jan Kara, Christoph Hellwig

Calling redirty_tail() can put off inode writeback for upto 30 seconds (or
whatever dirty_expire_centisecs is). This is unnecessarily big delay in some
cases and in some cases it is really bad thing. In particular XFS tries to be
nice to writeback and when ->write_inode is called for inode with lock ilock,
it just redirties the inode and returns EAGAIN. That currently causes
writeback_single_inode() to redirty_tail() the inode. As contended ilock is
common thing with XFS while extending files the result can be that inode
writeout is put off for a really long time.

Now that we have more robust busyloop prevention in wb_writeback() we can
call requeue_io() in cases where quick retry is required without fear of
raising CPU consumption too much.

CC: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/fs-writeback.c |   61 ++++++++++++++++++++++++----------------------------
 1 files changed, 28 insertions(+), 33 deletions(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index f506542..9bb4e96 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -356,6 +356,7 @@ writeback_single_inode(struct inode *inode, struct bdi_writeback *wb,
 	long nr_to_write = wbc->nr_to_write;
 	unsigned dirty;
 	int ret;
+	bool inode_written = false;
 
 	assert_spin_locked(&wb->list_lock);
 	assert_spin_locked(&inode->i_lock);
@@ -420,6 +421,8 @@ writeback_single_inode(struct inode *inode, struct bdi_writeback *wb,
 	/* Don't write the inode if only I_DIRTY_PAGES was set */
 	if (dirty & (I_DIRTY_SYNC | I_DIRTY_DATASYNC)) {
 		int err = write_inode(inode, wbc);
+		if (!err)
+			inode_written = true;
 		if (ret == 0)
 			ret = err;
 	}
@@ -430,42 +433,39 @@ writeback_single_inode(struct inode *inode, struct bdi_writeback *wb,
 	if (!(inode->i_state & I_FREEING)) {
 		/*
 		 * Sync livelock prevention. Each inode is tagged and synced in
-		 * one shot. If still dirty, it will be redirty_tail()'ed below.
-		 * Update the dirty time to prevent enqueue and sync it again.
+		 * one shot. If still dirty, update dirty time and put it back
+		 * to dirty list to prevent enqueue and syncing it again.
 		 */
 		if ((inode->i_state & I_DIRTY) &&
-		    (wbc->sync_mode == WB_SYNC_ALL || wbc->tagged_writepages))
+		    (wbc->sync_mode == WB_SYNC_ALL || wbc->tagged_writepages)) {
 			inode->dirtied_when = jiffies;
-
-		if (mapping_tagged(mapping, PAGECACHE_TAG_DIRTY)) {
+			redirty_tail(inode, wb);
+		} else if (mapping_tagged(mapping, PAGECACHE_TAG_DIRTY)) {
 			/*
-			 * We didn't write back all the pages.  nfs_writepages()
-			 * sometimes bales out without doing anything.
+			 * We didn't write back all the pages. nfs_writepages()
+			 * sometimes bales out without doing anything or we
+			 * just run our of our writeback slice.
 			 */
 			inode->i_state |= I_DIRTY_PAGES;
-			if (wbc->nr_to_write <= 0) {
-				/*
-				 * slice used up: queue for next turn
-				 */
-				requeue_io(inode, wb);
-			} else {
-				/*
-				 * Writeback blocked by something other than
-				 * congestion. Delay the inode for some time to
-				 * avoid spinning on the CPU (100% iowait)
-				 * retrying writeback of the dirty page/inode
-				 * that cannot be performed immediately.
-				 */
-				redirty_tail(inode, wb);
-			}
+			requeue_io(inode, wb);
 		} else if (inode->i_state & I_DIRTY) {
 			/*
 			 * Filesystems can dirty the inode during writeback
 			 * operations, such as delayed allocation during
 			 * submission or metadata updates after data IO
-			 * completion.
+			 * completion. Also inode could have been dirtied by
+			 * some process aggressively touching metadata.
+			 * Finally, filesystem could just fail to write the
+			 * inode for some reason. We have to distinguish the
+			 * last case from the previous ones - in the last case
+			 * we want to give the inode quick retry, in the
+			 * other cases we want to put it back to the dirty list
+			 * to avoid livelocking of writeback.
 			 */
-			redirty_tail(inode, wb);
+			if (inode_written)
+				redirty_tail(inode, wb);
+			else
+				requeue_io(inode, wb);
 		} else {
 			/*
 			 * The inode is clean.  At this point we either have
@@ -583,10 +583,10 @@ static long writeback_sb_inodes(struct super_block *sb,
 			wrote++;
 		if (wbc.pages_skipped) {
 			/*
-			 * writeback is not making progress due to locked
-			 * buffers.  Skip this inode for now.
+			 * Writeback is not making progress due to unavailable
+			 * fs locks or similar condition. Retry in next round.
 			 */
-			redirty_tail(inode, wb);
+			requeue_io(inode, wb);
 		}
 		spin_unlock(&inode->i_lock);
 		spin_unlock(&wb->list_lock);
@@ -618,12 +618,7 @@ static long __writeback_inodes_wb(struct bdi_writeback *wb,
 		struct super_block *sb = inode->i_sb;
 
 		if (!grab_super_passive(sb)) {
-			/*
-			 * grab_super_passive() may fail consistently due to
-			 * s_umount being grabbed by someone else. Don't use
-			 * requeue_io() to avoid busy retrying the inode/sb.
-			 */
-			redirty_tail(inode, wb);
+			requeue_io(inode, wb);
 			continue;
 		}
 		wrote += writeback_sb_inodes(sb, wb, work);
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 60+ messages in thread

* Re: [PATCH 1/2] writeback: Improve busyloop prevention
  2011-09-08  0:44 [PATCH 1/2] writeback: Improve busyloop prevention Jan Kara
  2011-09-08  0:44 ` [PATCH 2/2] writeback: Replace some redirty_tail() calls with requeue_io() Jan Kara
@ 2011-09-08  0:57 ` Wu Fengguang
  2011-09-08 13:49   ` Jan Kara
  1 sibling, 1 reply; 60+ messages in thread
From: Wu Fengguang @ 2011-09-08  0:57 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-fsdevel@vger.kernel.org, Dave Chinner, Christoph Hellwig

On Thu, Sep 08, 2011 at 08:44:43AM +0800, Jan Kara wrote:
> Writeback of an inode can be stalled by things like internal fs locks being
> held. So in case we didn't write anything during a pass through b_io list,
> just wait for a moment and try again.
 
Reviewed-by: Wu Fengguang <fengguang.wu@intel.com>

with comments below.

> +		trace_writeback_wait(wb->bdi, work);
> +		spin_unlock(&wb->list_lock);

__set_current_state(TASK_INTERRUPTIBLE);

> +		schedule_timeout(pause);

> +		pause <<= 1;
> +		if (pause > HZ / 10)
> +			pause = HZ / 10;

It's a bit more safer to do

                if (pause < HZ / 10)
                        pause <<= 1;

in case someone hacked HZ=1.

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 2/2] writeback: Replace some redirty_tail() calls with requeue_io()
  2011-09-08  0:44 ` [PATCH 2/2] writeback: Replace some redirty_tail() calls with requeue_io() Jan Kara
@ 2011-09-08  1:22   ` Wu Fengguang
  2011-09-08 15:03     ` Jan Kara
  0 siblings, 1 reply; 60+ messages in thread
From: Wu Fengguang @ 2011-09-08  1:22 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-fsdevel@vger.kernel.org, Dave Chinner, Christoph Hellwig

Jan,

> @@ -420,6 +421,8 @@ writeback_single_inode(struct inode *inode, struct bdi_writeback *wb,
>  	/* Don't write the inode if only I_DIRTY_PAGES was set */
>  	if (dirty & (I_DIRTY_SYNC | I_DIRTY_DATASYNC)) {
>  		int err = write_inode(inode, wbc);
> +		if (!err)
> +			inode_written = true;
>  		if (ret == 0)
>  			ret = err;
>  	}

write_inode() typically return error after redirtying the inode.
So the conditions inode_written=false and (inode->i_state & I_DIRTY)
are mostly on/off together. For the cases they disagree, it's probably
a filesystem bug -- at least I don't think some FS will deliberately 
return success while redirtying the inode, or the reverse.

>  		} else if (inode->i_state & I_DIRTY) {
>  			/*
>  			 * Filesystems can dirty the inode during writeback
>  			 * operations, such as delayed allocation during
>  			 * submission or metadata updates after data IO
> -			 * completion.
> +			 * completion. Also inode could have been dirtied by
> +			 * some process aggressively touching metadata.
> +			 * Finally, filesystem could just fail to write the
> +			 * inode for some reason. We have to distinguish the
> +			 * last case from the previous ones - in the last case
> +			 * we want to give the inode quick retry, in the
> +			 * other cases we want to put it back to the dirty list
> +			 * to avoid livelocking of writeback.
>  			 */
> -			redirty_tail(inode, wb);
> +			if (inode_written)
> +				redirty_tail(inode, wb);

Can you elaborate the livelock in the below inode_written=true case?
Why the sleep in the wb_writeback() loop is not enough?

> +			else
> +				requeue_io(inode, wb);
>  		} else {
>  			/*
>  			 * The inode is clean.  At this point we either have

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 1/2] writeback: Improve busyloop prevention
  2011-09-08  0:57 ` [PATCH 1/2] writeback: Improve busyloop prevention Wu Fengguang
@ 2011-09-08 13:49   ` Jan Kara
  0 siblings, 0 replies; 60+ messages in thread
From: Jan Kara @ 2011-09-08 13:49 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Jan Kara, linux-fsdevel@vger.kernel.org, Dave Chinner,
	Christoph Hellwig

On Thu 08-09-11 08:57:50, Wu Fengguang wrote:
> On Thu, Sep 08, 2011 at 08:44:43AM +0800, Jan Kara wrote:
> > Writeback of an inode can be stalled by things like internal fs locks being
> > held. So in case we didn't write anything during a pass through b_io list,
> > just wait for a moment and try again.
>  
> Reviewed-by: Wu Fengguang <fengguang.wu@intel.com>
> 
> with comments below.
> 
> > +		trace_writeback_wait(wb->bdi, work);
> > +		spin_unlock(&wb->list_lock);
> 
> __set_current_state(TASK_INTERRUPTIBLE);
  Ah, right. Thanks for catching this.
> 
> > +		schedule_timeout(pause);
> 
> > +		pause <<= 1;
> > +		if (pause > HZ / 10)
> > +			pause = HZ / 10;
> 
> It's a bit more safer to do
> 
>                 if (pause < HZ / 10)
>                         pause <<= 1;
> 
> in case someone hacked HZ=1.
  Good idea. Done. Thanks for review.

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 2/2] writeback: Replace some redirty_tail() calls with requeue_io()
  2011-09-08  1:22   ` Wu Fengguang
@ 2011-09-08 15:03     ` Jan Kara
  2011-09-18 14:07       ` Wu Fengguang
  0 siblings, 1 reply; 60+ messages in thread
From: Jan Kara @ 2011-09-08 15:03 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Jan Kara, linux-fsdevel@vger.kernel.org, Dave Chinner,
	Christoph Hellwig

On Thu 08-09-11 09:22:36, Wu Fengguang wrote:
> Jan,
> 
> > @@ -420,6 +421,8 @@ writeback_single_inode(struct inode *inode, struct bdi_writeback *wb,
> >  	/* Don't write the inode if only I_DIRTY_PAGES was set */
> >  	if (dirty & (I_DIRTY_SYNC | I_DIRTY_DATASYNC)) {
> >  		int err = write_inode(inode, wbc);
> > +		if (!err)
> > +			inode_written = true;
> >  		if (ret == 0)
> >  			ret = err;
> >  	}
> 
> write_inode() typically return error after redirtying the inode.
> So the conditions inode_written=false and (inode->i_state & I_DIRTY)
> are mostly on/off together. For the cases they disagree, it's probably
> a filesystem bug -- at least I don't think some FS will deliberately 
> return success while redirtying the inode, or the reverse.
  There is a possibility someone else redirties the inode between the moment
I_DIRTY bits are cleared in writeback_single_inode() and the check for
I_DIRTY is done after ->write_inode() is called. Especially when
write_inode() blocks waiting for some IO this isn't that hard to happen. So
there are valid (although relatively rare) cases when inode_written is
different from the result of I_DIRTY check.

> >  		} else if (inode->i_state & I_DIRTY) {
> >  			/*
> >  			 * Filesystems can dirty the inode during writeback
> >  			 * operations, such as delayed allocation during
> >  			 * submission or metadata updates after data IO
> > -			 * completion.
> > +			 * completion. Also inode could have been dirtied by
> > +			 * some process aggressively touching metadata.
> > +			 * Finally, filesystem could just fail to write the
> > +			 * inode for some reason. We have to distinguish the
> > +			 * last case from the previous ones - in the last case
> > +			 * we want to give the inode quick retry, in the
> > +			 * other cases we want to put it back to the dirty list
> > +			 * to avoid livelocking of writeback.
> >  			 */
> > -			redirty_tail(inode, wb);
> > +			if (inode_written)
> > +				redirty_tail(inode, wb);
> 
> Can you elaborate the livelock in the below inode_written=true case?
> Why the sleep in the wb_writeback() loop is not enough?
  In case someone would be able to consistently trigger the race window and
redirty the inode before we check here, we would loop for a long time
always writing just this inode and thus effectivelly stalling other
writeback. That's why I push redirtied inode behind other inodes in the
dirty list.

								Honza
> 
> > +			else
> > +				requeue_io(inode, wb);
> >  		} else {
> >  			/*
> >  			 * The inode is clean.  At this point we either have
> 
> Thanks,
> Fengguang
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 2/2] writeback: Replace some redirty_tail() calls with requeue_io()
  2011-09-08 15:03     ` Jan Kara
@ 2011-09-18 14:07       ` Wu Fengguang
  2011-10-05 17:39         ` Jan Kara
  0 siblings, 1 reply; 60+ messages in thread
From: Wu Fengguang @ 2011-09-18 14:07 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-fsdevel@vger.kernel.org, Dave Chinner, Christoph Hellwig

On Thu, Sep 08, 2011 at 11:03:40PM +0800, Jan Kara wrote:
> On Thu 08-09-11 09:22:36, Wu Fengguang wrote:
> > Jan,
> > 
> > > @@ -420,6 +421,8 @@ writeback_single_inode(struct inode *inode, struct bdi_writeback *wb,
> > >  	/* Don't write the inode if only I_DIRTY_PAGES was set */
> > >  	if (dirty & (I_DIRTY_SYNC | I_DIRTY_DATASYNC)) {
> > >  		int err = write_inode(inode, wbc);
> > > +		if (!err)
> > > +			inode_written = true;
> > >  		if (ret == 0)
> > >  			ret = err;
> > >  	}
> > 
> > write_inode() typically return error after redirtying the inode.
> > So the conditions inode_written=false and (inode->i_state & I_DIRTY)
> > are mostly on/off together. For the cases they disagree, it's probably
> > a filesystem bug -- at least I don't think some FS will deliberately 
> > return success while redirtying the inode, or the reverse.
>   There is a possibility someone else redirties the inode between the moment
> I_DIRTY bits are cleared in writeback_single_inode() and the check for
> I_DIRTY is done after ->write_inode() is called. Especially when
> write_inode() blocks waiting for some IO this isn't that hard to happen. So
> there are valid (although relatively rare) cases when inode_written is
> different from the result of I_DIRTY check.

Ah yes, that's good point.

> > >  		} else if (inode->i_state & I_DIRTY) {
> > >  			/*
> > >  			 * Filesystems can dirty the inode during writeback
> > >  			 * operations, such as delayed allocation during
> > >  			 * submission or metadata updates after data IO
> > > -			 * completion.
> > > +			 * completion. Also inode could have been dirtied by
> > > +			 * some process aggressively touching metadata.
> > > +			 * Finally, filesystem could just fail to write the
> > > +			 * inode for some reason. We have to distinguish the
> > > +			 * last case from the previous ones - in the last case
> > > +			 * we want to give the inode quick retry, in the
> > > +			 * other cases we want to put it back to the dirty list
> > > +			 * to avoid livelocking of writeback.
> > >  			 */
> > > -			redirty_tail(inode, wb);
> > > +			if (inode_written)
> > > +				redirty_tail(inode, wb);
> > 
> > Can you elaborate the livelock in the below inode_written=true case?
> > Why the sleep in the wb_writeback() loop is not enough?
>   In case someone would be able to consistently trigger the race window and
> redirty the inode before we check here, we would loop for a long time
> always writing just this inode and thus effectivelly stalling other
> writeback. That's why I push redirtied inode behind other inodes in the
> dirty list.

Agreed. All the left to do is to confirm whether this addresses
Christoph's original problem.

Acked-by: Wu Fengguang <fengguang.wu@intel.com>

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 2/2] writeback: Replace some redirty_tail() calls with requeue_io()
  2011-09-18 14:07       ` Wu Fengguang
@ 2011-10-05 17:39         ` Jan Kara
  2011-10-07 13:43           ` Wu Fengguang
  0 siblings, 1 reply; 60+ messages in thread
From: Jan Kara @ 2011-10-05 17:39 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Jan Kara, linux-fsdevel@vger.kernel.org, Dave Chinner,
	Christoph Hellwig

On Sun 18-09-11 22:07:37, Wu Fengguang wrote:
> On Thu, Sep 08, 2011 at 11:03:40PM +0800, Jan Kara wrote:
> > On Thu 08-09-11 09:22:36, Wu Fengguang wrote:
> > > Jan,
> > > 
> > > > @@ -420,6 +421,8 @@ writeback_single_inode(struct inode *inode, struct bdi_writeback *wb,
> > > >  	/* Don't write the inode if only I_DIRTY_PAGES was set */
> > > >  	if (dirty & (I_DIRTY_SYNC | I_DIRTY_DATASYNC)) {
> > > >  		int err = write_inode(inode, wbc);
> > > > +		if (!err)
> > > > +			inode_written = true;
> > > >  		if (ret == 0)
> > > >  			ret = err;
> > > >  	}
> > > 
> > > write_inode() typically return error after redirtying the inode.
> > > So the conditions inode_written=false and (inode->i_state & I_DIRTY)
> > > are mostly on/off together. For the cases they disagree, it's probably
> > > a filesystem bug -- at least I don't think some FS will deliberately 
> > > return success while redirtying the inode, or the reverse.
> >   There is a possibility someone else redirties the inode between the moment
> > I_DIRTY bits are cleared in writeback_single_inode() and the check for
> > I_DIRTY is done after ->write_inode() is called. Especially when
> > write_inode() blocks waiting for some IO this isn't that hard to happen. So
> > there are valid (although relatively rare) cases when inode_written is
> > different from the result of I_DIRTY check.
> 
> Ah yes, that's good point.
> 
> > > >  		} else if (inode->i_state & I_DIRTY) {
> > > >  			/*
> > > >  			 * Filesystems can dirty the inode during writeback
> > > >  			 * operations, such as delayed allocation during
> > > >  			 * submission or metadata updates after data IO
> > > > -			 * completion.
> > > > +			 * completion. Also inode could have been dirtied by
> > > > +			 * some process aggressively touching metadata.
> > > > +			 * Finally, filesystem could just fail to write the
> > > > +			 * inode for some reason. We have to distinguish the
> > > > +			 * last case from the previous ones - in the last case
> > > > +			 * we want to give the inode quick retry, in the
> > > > +			 * other cases we want to put it back to the dirty list
> > > > +			 * to avoid livelocking of writeback.
> > > >  			 */
> > > > -			redirty_tail(inode, wb);
> > > > +			if (inode_written)
> > > > +				redirty_tail(inode, wb);
> > > 
> > > Can you elaborate the livelock in the below inode_written=true case?
> > > Why the sleep in the wb_writeback() loop is not enough?
> >   In case someone would be able to consistently trigger the race window and
> > redirty the inode before we check here, we would loop for a long time
> > always writing just this inode and thus effectivelly stalling other
> > writeback. That's why I push redirtied inode behind other inodes in the
> > dirty list.
> 
> Agreed. All the left to do is to confirm whether this addresses
> Christoph's original problem.
> 
> Acked-by: Wu Fengguang <fengguang.wu@intel.com>
  Great, thanks for review! I'll resend the two patches to Christoph so
that he can try them.

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 2/2] writeback: Replace some redirty_tail() calls with requeue_io()
  2011-10-05 17:39         ` Jan Kara
@ 2011-10-07 13:43           ` Wu Fengguang
  2011-10-07 14:22             ` Jan Kara
  0 siblings, 1 reply; 60+ messages in thread
From: Wu Fengguang @ 2011-10-07 13:43 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-fsdevel@vger.kernel.org, Dave Chinner, Christoph Hellwig

>   Great, thanks for review! I'll resend the two patches to Christoph so
> that he can try them.

Jan, I'd like to test out your updated patches with my stupid dd
workloads. Would you (re)send them publicly?

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 2/2] writeback: Replace some redirty_tail() calls with requeue_io()
  2011-10-07 13:43           ` Wu Fengguang
@ 2011-10-07 14:22             ` Jan Kara
  2011-10-07 14:29               ` Wu Fengguang
  0 siblings, 1 reply; 60+ messages in thread
From: Jan Kara @ 2011-10-07 14:22 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Jan Kara, linux-fsdevel@vger.kernel.org, Dave Chinner,
	Christoph Hellwig

[-- Attachment #1: Type: text/plain, Size: 599 bytes --]

On Fri 07-10-11 21:43:47, Wu Fengguang wrote:
> >   Great, thanks for review! I'll resend the two patches to Christoph so
> > that he can try them.
> 
> Jan, I'd like to test out your updated patches with my stupid dd
> workloads. Would you (re)send them publicly?
  Ah, I resent them publicly on Wednesday
(http://comments.gmane.org/gmane.linux.kernel/1199713) but git send-email
apparently does not include emails from Acked-by into list of recipients so
you didn't get them. Sorry for that. The patches are attached for your
convenience.

									Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

[-- Attachment #2: 0001-writeback-Improve-busyloop-prevention.patch --]
[-- Type: text/x-patch, Size: 2241 bytes --]

>From a042c2a839ad3cf89d8ee158b2bb4b94b573f578 Mon Sep 17 00:00:00 2001
From: Jan Kara <jack@suse.cz>
Date: Thu, 8 Sep 2011 01:05:25 +0200
Subject: [PATCH 1/2] writeback: Improve busyloop prevention

Writeback of an inode can be stalled by things like internal fs locks being
held. So in case we didn't write anything during a pass through b_io list,
just wait for a moment and try again.

CC: Christoph Hellwig <hch@infradead.org>
Reviewed-by: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/fs-writeback.c |   26 ++++++++++++++------------
 1 files changed, 14 insertions(+), 12 deletions(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 04cf3b9..bdeb26a 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -699,8 +699,8 @@ static long wb_writeback(struct bdi_writeback *wb,
 	unsigned long wb_start = jiffies;
 	long nr_pages = work->nr_pages;
 	unsigned long oldest_jif;
-	struct inode *inode;
 	long progress;
+	long pause = 1;
 
 	oldest_jif = jiffies;
 	work->older_than_this = &oldest_jif;
@@ -755,25 +755,27 @@ static long wb_writeback(struct bdi_writeback *wb,
 		 * mean the overall work is done. So we keep looping as long
 		 * as made some progress on cleaning pages or inodes.
 		 */
-		if (progress)
+		if (progress) {
+			pause = 1;
 			continue;
+		}
 		/*
 		 * No more inodes for IO, bail
 		 */
 		if (list_empty(&wb->b_more_io))
 			break;
 		/*
-		 * Nothing written. Wait for some inode to
-		 * become available for writeback. Otherwise
-		 * we'll just busyloop.
+		 * Nothing written (some internal fs locks were unavailable or
+		 * inode was under writeback from balance_dirty_pages() or
+		 * similar conditions).  Wait for a while to avoid busylooping.
 		 */
-		if (!list_empty(&wb->b_more_io))  {
-			trace_writeback_wait(wb->bdi, work);
-			inode = wb_inode(wb->b_more_io.prev);
-			spin_lock(&inode->i_lock);
-			inode_wait_for_writeback(inode, wb);
-			spin_unlock(&inode->i_lock);
-		}
+		trace_writeback_wait(wb->bdi, work);
+		spin_unlock(&wb->list_lock);
+		__set_current_state(TASK_INTERRUPTIBLE);
+		schedule_timeout(pause);
+		if (pause < HZ / 10)
+			pause <<= 1;
+		spin_lock(&wb->list_lock);
 	}
 	spin_unlock(&wb->list_lock);
 
-- 
1.7.1


[-- Attachment #3: 0002-writeback-Replace-some-redirty_tail-calls-with-reque.patch --]
[-- Type: text/x-patch, Size: 5390 bytes --]

>From 0a4a2cb4d5432f5446215b1e6e44f7d83032dba3 Mon Sep 17 00:00:00 2001
From: Jan Kara <jack@suse.cz>
Date: Thu, 8 Sep 2011 01:46:42 +0200
Subject: [PATCH 2/2] writeback: Replace some redirty_tail() calls with requeue_io()

Calling redirty_tail() can put off inode writeback for upto 30 seconds (or
whatever dirty_expire_centisecs is). This is unnecessarily big delay in some
cases and in other cases it is a really bad thing. In particular XFS tries to
be nice to writeback and when ->write_inode is called for an inode with locked
ilock, it just redirties the inode and returns EAGAIN. That currently causes
writeback_single_inode() to redirty_tail() the inode. As contended ilock is
common thing with XFS while extending files the result can be that inode
writeout is put off for a really long time.

Now that we have more robust busyloop prevention in wb_writeback() we can
call requeue_io() in cases where quick retry is required without fear of
raising CPU consumption too much.

CC: Christoph Hellwig <hch@infradead.org>
Acked-by: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/fs-writeback.c |   61 ++++++++++++++++++++++++----------------------------
 1 files changed, 28 insertions(+), 33 deletions(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index bdeb26a..c786023 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -356,6 +356,7 @@ writeback_single_inode(struct inode *inode, struct bdi_writeback *wb,
 	long nr_to_write = wbc->nr_to_write;
 	unsigned dirty;
 	int ret;
+	bool inode_written = false;
 
 	assert_spin_locked(&wb->list_lock);
 	assert_spin_locked(&inode->i_lock);
@@ -420,6 +421,8 @@ writeback_single_inode(struct inode *inode, struct bdi_writeback *wb,
 	/* Don't write the inode if only I_DIRTY_PAGES was set */
 	if (dirty & (I_DIRTY_SYNC | I_DIRTY_DATASYNC)) {
 		int err = write_inode(inode, wbc);
+		if (!err)
+			inode_written = true;
 		if (ret == 0)
 			ret = err;
 	}
@@ -430,42 +433,39 @@ writeback_single_inode(struct inode *inode, struct bdi_writeback *wb,
 	if (!(inode->i_state & I_FREEING)) {
 		/*
 		 * Sync livelock prevention. Each inode is tagged and synced in
-		 * one shot. If still dirty, it will be redirty_tail()'ed below.
-		 * Update the dirty time to prevent enqueue and sync it again.
+		 * one shot. If still dirty, update dirty time and put it back
+		 * to dirty list to prevent enqueue and syncing it again.
 		 */
 		if ((inode->i_state & I_DIRTY) &&
-		    (wbc->sync_mode == WB_SYNC_ALL || wbc->tagged_writepages))
+		    (wbc->sync_mode == WB_SYNC_ALL || wbc->tagged_writepages)) {
 			inode->dirtied_when = jiffies;
-
-		if (mapping_tagged(mapping, PAGECACHE_TAG_DIRTY)) {
+			redirty_tail(inode, wb);
+		} else if (mapping_tagged(mapping, PAGECACHE_TAG_DIRTY)) {
 			/*
-			 * We didn't write back all the pages.  nfs_writepages()
-			 * sometimes bales out without doing anything.
+			 * We didn't write back all the pages. nfs_writepages()
+			 * sometimes bales out without doing anything or we
+			 * just run our of our writeback slice.
 			 */
 			inode->i_state |= I_DIRTY_PAGES;
-			if (wbc->nr_to_write <= 0) {
-				/*
-				 * slice used up: queue for next turn
-				 */
-				requeue_io(inode, wb);
-			} else {
-				/*
-				 * Writeback blocked by something other than
-				 * congestion. Delay the inode for some time to
-				 * avoid spinning on the CPU (100% iowait)
-				 * retrying writeback of the dirty page/inode
-				 * that cannot be performed immediately.
-				 */
-				redirty_tail(inode, wb);
-			}
+			requeue_io(inode, wb);
 		} else if (inode->i_state & I_DIRTY) {
 			/*
 			 * Filesystems can dirty the inode during writeback
 			 * operations, such as delayed allocation during
 			 * submission or metadata updates after data IO
-			 * completion.
+			 * completion. Also inode could have been dirtied by
+			 * some process aggressively touching metadata.
+			 * Finally, filesystem could just fail to write the
+			 * inode for some reason. We have to distinguish the
+			 * last case from the previous ones - in the last case
+			 * we want to give the inode quick retry, in the
+			 * other cases we want to put it back to the dirty list
+			 * to avoid livelocking of writeback.
 			 */
-			redirty_tail(inode, wb);
+			if (inode_written)
+				redirty_tail(inode, wb);
+			else
+				requeue_io(inode, wb);
 		} else {
 			/*
 			 * The inode is clean.  At this point we either have
@@ -583,10 +583,10 @@ static long writeback_sb_inodes(struct super_block *sb,
 			wrote++;
 		if (wbc.pages_skipped) {
 			/*
-			 * writeback is not making progress due to locked
-			 * buffers.  Skip this inode for now.
+			 * Writeback is not making progress due to unavailable
+			 * fs locks or similar condition. Retry in next round.
 			 */
-			redirty_tail(inode, wb);
+			requeue_io(inode, wb);
 		}
 		spin_unlock(&inode->i_lock);
 		spin_unlock(&wb->list_lock);
@@ -618,12 +618,7 @@ static long __writeback_inodes_wb(struct bdi_writeback *wb,
 		struct super_block *sb = inode->i_sb;
 
 		if (!grab_super_passive(sb)) {
-			/*
-			 * grab_super_passive() may fail consistently due to
-			 * s_umount being grabbed by someone else. Don't use
-			 * requeue_io() to avoid busy retrying the inode/sb.
-			 */
-			redirty_tail(inode, wb);
+			requeue_io(inode, wb);
 			continue;
 		}
 		wrote += writeback_sb_inodes(sb, wb, work);
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 60+ messages in thread

* Re: [PATCH 2/2] writeback: Replace some redirty_tail() calls with requeue_io()
  2011-10-07 14:22             ` Jan Kara
@ 2011-10-07 14:29               ` Wu Fengguang
  2011-10-07 14:45                 ` Jan Kara
  0 siblings, 1 reply; 60+ messages in thread
From: Wu Fengguang @ 2011-10-07 14:29 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-fsdevel@vger.kernel.org, Dave Chinner, Christoph Hellwig

On Fri, Oct 07, 2011 at 10:22:01PM +0800, Jan Kara wrote:
> On Fri 07-10-11 21:43:47, Wu Fengguang wrote:
> > >   Great, thanks for review! I'll resend the two patches to Christoph so
> > > that he can try them.
> > 
> > Jan, I'd like to test out your updated patches with my stupid dd
> > workloads. Would you (re)send them publicly?
>   Ah, I resent them publicly on Wednesday
> (http://comments.gmane.org/gmane.linux.kernel/1199713) but git send-email
> apparently does not include emails from Acked-by into list of recipients so
> you didn't get them. Sorry for that. The patches are attached for your
> convenience.

OK thanks. I only checked the linux-fsdevel list before asking..
The results should be ready tomorrow.

Thanks,
Fengguang

> From a042c2a839ad3cf89d8ee158b2bb4b94b573f578 Mon Sep 17 00:00:00 2001
> From: Jan Kara <jack@suse.cz>
> Date: Thu, 8 Sep 2011 01:05:25 +0200
> Subject: [PATCH 1/2] writeback: Improve busyloop prevention
> 
> Writeback of an inode can be stalled by things like internal fs locks being
> held. So in case we didn't write anything during a pass through b_io list,
> just wait for a moment and try again.
> 
> CC: Christoph Hellwig <hch@infradead.org>
> Reviewed-by: Wu Fengguang <fengguang.wu@intel.com>
> Signed-off-by: Jan Kara <jack@suse.cz>
> ---
>  fs/fs-writeback.c |   26 ++++++++++++++------------
>  1 files changed, 14 insertions(+), 12 deletions(-)
> 
> diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> index 04cf3b9..bdeb26a 100644
> --- a/fs/fs-writeback.c
> +++ b/fs/fs-writeback.c
> @@ -699,8 +699,8 @@ static long wb_writeback(struct bdi_writeback *wb,
>  	unsigned long wb_start = jiffies;
>  	long nr_pages = work->nr_pages;
>  	unsigned long oldest_jif;
> -	struct inode *inode;
>  	long progress;
> +	long pause = 1;
>  
>  	oldest_jif = jiffies;
>  	work->older_than_this = &oldest_jif;
> @@ -755,25 +755,27 @@ static long wb_writeback(struct bdi_writeback *wb,
>  		 * mean the overall work is done. So we keep looping as long
>  		 * as made some progress on cleaning pages or inodes.
>  		 */
> -		if (progress)
> +		if (progress) {
> +			pause = 1;
>  			continue;
> +		}
>  		/*
>  		 * No more inodes for IO, bail
>  		 */
>  		if (list_empty(&wb->b_more_io))
>  			break;
>  		/*
> -		 * Nothing written. Wait for some inode to
> -		 * become available for writeback. Otherwise
> -		 * we'll just busyloop.
> +		 * Nothing written (some internal fs locks were unavailable or
> +		 * inode was under writeback from balance_dirty_pages() or
> +		 * similar conditions).  Wait for a while to avoid busylooping.
>  		 */
> -		if (!list_empty(&wb->b_more_io))  {
> -			trace_writeback_wait(wb->bdi, work);
> -			inode = wb_inode(wb->b_more_io.prev);
> -			spin_lock(&inode->i_lock);
> -			inode_wait_for_writeback(inode, wb);
> -			spin_unlock(&inode->i_lock);
> -		}
> +		trace_writeback_wait(wb->bdi, work);
> +		spin_unlock(&wb->list_lock);
> +		__set_current_state(TASK_INTERRUPTIBLE);
> +		schedule_timeout(pause);
> +		if (pause < HZ / 10)
> +			pause <<= 1;
> +		spin_lock(&wb->list_lock);
>  	}
>  	spin_unlock(&wb->list_lock);
>  
> -- 
> 1.7.1
> 

> From 0a4a2cb4d5432f5446215b1e6e44f7d83032dba3 Mon Sep 17 00:00:00 2001
> From: Jan Kara <jack@suse.cz>
> Date: Thu, 8 Sep 2011 01:46:42 +0200
> Subject: [PATCH 2/2] writeback: Replace some redirty_tail() calls with requeue_io()
> 
> Calling redirty_tail() can put off inode writeback for upto 30 seconds (or
> whatever dirty_expire_centisecs is). This is unnecessarily big delay in some
> cases and in other cases it is a really bad thing. In particular XFS tries to
> be nice to writeback and when ->write_inode is called for an inode with locked
> ilock, it just redirties the inode and returns EAGAIN. That currently causes
> writeback_single_inode() to redirty_tail() the inode. As contended ilock is
> common thing with XFS while extending files the result can be that inode
> writeout is put off for a really long time.
> 
> Now that we have more robust busyloop prevention in wb_writeback() we can
> call requeue_io() in cases where quick retry is required without fear of
> raising CPU consumption too much.
> 
> CC: Christoph Hellwig <hch@infradead.org>
> Acked-by: Wu Fengguang <fengguang.wu@intel.com>
> Signed-off-by: Jan Kara <jack@suse.cz>
> ---
>  fs/fs-writeback.c |   61 ++++++++++++++++++++++++----------------------------
>  1 files changed, 28 insertions(+), 33 deletions(-)
> 
> diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> index bdeb26a..c786023 100644
> --- a/fs/fs-writeback.c
> +++ b/fs/fs-writeback.c
> @@ -356,6 +356,7 @@ writeback_single_inode(struct inode *inode, struct bdi_writeback *wb,
>  	long nr_to_write = wbc->nr_to_write;
>  	unsigned dirty;
>  	int ret;
> +	bool inode_written = false;
>  
>  	assert_spin_locked(&wb->list_lock);
>  	assert_spin_locked(&inode->i_lock);
> @@ -420,6 +421,8 @@ writeback_single_inode(struct inode *inode, struct bdi_writeback *wb,
>  	/* Don't write the inode if only I_DIRTY_PAGES was set */
>  	if (dirty & (I_DIRTY_SYNC | I_DIRTY_DATASYNC)) {
>  		int err = write_inode(inode, wbc);
> +		if (!err)
> +			inode_written = true;
>  		if (ret == 0)
>  			ret = err;
>  	}
> @@ -430,42 +433,39 @@ writeback_single_inode(struct inode *inode, struct bdi_writeback *wb,
>  	if (!(inode->i_state & I_FREEING)) {
>  		/*
>  		 * Sync livelock prevention. Each inode is tagged and synced in
> -		 * one shot. If still dirty, it will be redirty_tail()'ed below.
> -		 * Update the dirty time to prevent enqueue and sync it again.
> +		 * one shot. If still dirty, update dirty time and put it back
> +		 * to dirty list to prevent enqueue and syncing it again.
>  		 */
>  		if ((inode->i_state & I_DIRTY) &&
> -		    (wbc->sync_mode == WB_SYNC_ALL || wbc->tagged_writepages))
> +		    (wbc->sync_mode == WB_SYNC_ALL || wbc->tagged_writepages)) {
>  			inode->dirtied_when = jiffies;
> -
> -		if (mapping_tagged(mapping, PAGECACHE_TAG_DIRTY)) {
> +			redirty_tail(inode, wb);
> +		} else if (mapping_tagged(mapping, PAGECACHE_TAG_DIRTY)) {
>  			/*
> -			 * We didn't write back all the pages.  nfs_writepages()
> -			 * sometimes bales out without doing anything.
> +			 * We didn't write back all the pages. nfs_writepages()
> +			 * sometimes bales out without doing anything or we
> +			 * just run our of our writeback slice.
>  			 */
>  			inode->i_state |= I_DIRTY_PAGES;
> -			if (wbc->nr_to_write <= 0) {
> -				/*
> -				 * slice used up: queue for next turn
> -				 */
> -				requeue_io(inode, wb);
> -			} else {
> -				/*
> -				 * Writeback blocked by something other than
> -				 * congestion. Delay the inode for some time to
> -				 * avoid spinning on the CPU (100% iowait)
> -				 * retrying writeback of the dirty page/inode
> -				 * that cannot be performed immediately.
> -				 */
> -				redirty_tail(inode, wb);
> -			}
> +			requeue_io(inode, wb);
>  		} else if (inode->i_state & I_DIRTY) {
>  			/*
>  			 * Filesystems can dirty the inode during writeback
>  			 * operations, such as delayed allocation during
>  			 * submission or metadata updates after data IO
> -			 * completion.
> +			 * completion. Also inode could have been dirtied by
> +			 * some process aggressively touching metadata.
> +			 * Finally, filesystem could just fail to write the
> +			 * inode for some reason. We have to distinguish the
> +			 * last case from the previous ones - in the last case
> +			 * we want to give the inode quick retry, in the
> +			 * other cases we want to put it back to the dirty list
> +			 * to avoid livelocking of writeback.
>  			 */
> -			redirty_tail(inode, wb);
> +			if (inode_written)
> +				redirty_tail(inode, wb);
> +			else
> +				requeue_io(inode, wb);
>  		} else {
>  			/*
>  			 * The inode is clean.  At this point we either have
> @@ -583,10 +583,10 @@ static long writeback_sb_inodes(struct super_block *sb,
>  			wrote++;
>  		if (wbc.pages_skipped) {
>  			/*
> -			 * writeback is not making progress due to locked
> -			 * buffers.  Skip this inode for now.
> +			 * Writeback is not making progress due to unavailable
> +			 * fs locks or similar condition. Retry in next round.
>  			 */
> -			redirty_tail(inode, wb);
> +			requeue_io(inode, wb);
>  		}
>  		spin_unlock(&inode->i_lock);
>  		spin_unlock(&wb->list_lock);
> @@ -618,12 +618,7 @@ static long __writeback_inodes_wb(struct bdi_writeback *wb,
>  		struct super_block *sb = inode->i_sb;
>  
>  		if (!grab_super_passive(sb)) {
> -			/*
> -			 * grab_super_passive() may fail consistently due to
> -			 * s_umount being grabbed by someone else. Don't use
> -			 * requeue_io() to avoid busy retrying the inode/sb.
> -			 */
> -			redirty_tail(inode, wb);
> +			requeue_io(inode, wb);
>  			continue;
>  		}
>  		wrote += writeback_sb_inodes(sb, wb, work);
> -- 
> 1.7.1
> 


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 2/2] writeback: Replace some redirty_tail() calls with requeue_io()
  2011-10-07 14:29               ` Wu Fengguang
@ 2011-10-07 14:45                 ` Jan Kara
  2011-10-07 15:29                   ` Wu Fengguang
  2011-10-08  4:00                   ` Wu Fengguang
  0 siblings, 2 replies; 60+ messages in thread
From: Jan Kara @ 2011-10-07 14:45 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Jan Kara, linux-fsdevel@vger.kernel.org, Dave Chinner,
	Christoph Hellwig

On Fri 07-10-11 22:29:28, Wu Fengguang wrote:
> On Fri, Oct 07, 2011 at 10:22:01PM +0800, Jan Kara wrote:
> > On Fri 07-10-11 21:43:47, Wu Fengguang wrote:
> > > >   Great, thanks for review! I'll resend the two patches to Christoph so
> > > > that he can try them.
> > > 
> > > Jan, I'd like to test out your updated patches with my stupid dd
> > > workloads. Would you (re)send them publicly?
> >   Ah, I resent them publicly on Wednesday
> > (http://comments.gmane.org/gmane.linux.kernel/1199713) but git send-email
> > apparently does not include emails from Acked-by into list of recipients so
> > you didn't get them. Sorry for that. The patches are attached for your
> > convenience.
> 
> OK thanks. I only checked the linux-fsdevel list before asking..
> The results should be ready tomorrow.
  Thanks for testing!

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 2/2] writeback: Replace some redirty_tail() calls with requeue_io()
  2011-10-07 14:45                 ` Jan Kara
@ 2011-10-07 15:29                   ` Wu Fengguang
  2011-10-08  4:00                   ` Wu Fengguang
  1 sibling, 0 replies; 60+ messages in thread
From: Wu Fengguang @ 2011-10-07 15:29 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-fsdevel@vger.kernel.org, Dave Chinner, Christoph Hellwig

On Fri, Oct 07, 2011 at 10:45:04PM +0800, Jan Kara wrote:
> On Fri 07-10-11 22:29:28, Wu Fengguang wrote:
> > On Fri, Oct 07, 2011 at 10:22:01PM +0800, Jan Kara wrote:
> > > On Fri 07-10-11 21:43:47, Wu Fengguang wrote:
> > > > >   Great, thanks for review! I'll resend the two patches to Christoph so
> > > > > that he can try them.
> > > > 
> > > > Jan, I'd like to test out your updated patches with my stupid dd
> > > > workloads. Would you (re)send them publicly?
> > >   Ah, I resent them publicly on Wednesday
> > > (http://comments.gmane.org/gmane.linux.kernel/1199713) but git send-email
> > > apparently does not include emails from Acked-by into list of recipients so
> > > you didn't get them. Sorry for that. The patches are attached for your
> > > convenience.
> > 
> > OK thanks. I only checked the linux-fsdevel list before asking..
> > The results should be ready tomorrow.
>   Thanks for testing!

You are welcome! Not to mention it's hands down work :)

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 2/2] writeback: Replace some redirty_tail() calls with requeue_io()
  2011-10-07 14:45                 ` Jan Kara
  2011-10-07 15:29                   ` Wu Fengguang
@ 2011-10-08  4:00                   ` Wu Fengguang
  2011-10-08 11:52                     ` Wu Fengguang
  2011-10-10 11:21                     ` Jan Kara
  1 sibling, 2 replies; 60+ messages in thread
From: Wu Fengguang @ 2011-10-08  4:00 UTC (permalink / raw)
  To: Jan Kara
  Cc: linux-fsdevel@vger.kernel.org, Dave Chinner, Christoph Hellwig,
	Chris Mason

Hi Jan,

The test results look not good: btrfs is heavily impacted and the
other filesystems are slightly impacted.

I'll send you the detailed logs in private emails (too large for the
mailing list). Basically I noticed many writeback_wait traces that
never appear w/o this patch. In the btrfs cases that see larger
regressions, I see large fluctuations in the writeout bandwidth and
long disk idle periods. It's still a bit puzzling how all these
happen..

      3.1.0-rc8-ioless6+  3.1.0-rc8-ioless6-requeue+
------------------------  ------------------------
                   59.39       -82.9%        10.13  thresh=100M/btrfs-10dd-4k-8p-4096M-100M:10-X
                   58.68       -80.3%        11.54  thresh=100M/btrfs-1dd-4k-8p-4096M-100M:10-X
                   58.92       -80.0%        11.76  thresh=100M/btrfs-2dd-4k-8p-4096M-100M:10-X
                   38.02        -1.0%        37.65  thresh=100M/ext3-10dd-4k-8p-4096M-100M:10-X
                   45.20        +1.7%        45.96  thresh=100M/ext3-1dd-4k-8p-4096M-100M:10-X
                   42.50        -0.8%        42.14  thresh=100M/ext3-2dd-4k-8p-4096M-100M:10-X
                   47.50        -2.5%        46.32  thresh=100M/ext4-10dd-4k-8p-4096M-100M:10-X
                   58.18        -3.0%        56.41  thresh=100M/ext4-1dd-4k-8p-4096M-100M:10-X
                   55.79        -2.1%        54.63  thresh=100M/ext4-2dd-4k-8p-4096M-100M:10-X
                   44.89       -19.3%        36.23  thresh=100M/xfs-10dd-4k-8p-4096M-100M:10-X
                   58.06        -4.2%        55.64  thresh=100M/xfs-1dd-4k-8p-4096M-100M:10-X
                   51.94        -1.1%        51.35  thresh=100M/xfs-2dd-4k-8p-4096M-100M:10-X
                   60.29       -35.9%        38.63  thresh=1G/btrfs-100dd-4k-8p-4096M-1024M:10-X
                   58.80       -33.2%        39.25  thresh=1G/btrfs-10dd-4k-8p-4096M-1024M:10-X
                   58.53       -21.5%        45.93  thresh=1G/btrfs-1dd-4k-8p-4096M-1024M:10-X
                   31.96        -4.7%        30.44  thresh=1G/ext3-100dd-4k-8p-4096M-1024M:10-X
                   36.19        -1.0%        35.82  thresh=1G/ext3-10dd-4k-8p-4096M-1024M:10-X
                   45.03        -2.7%        43.80  thresh=1G/ext3-1dd-4k-8p-4096M-1024M:10-X
                   51.47        -2.6%        50.14  thresh=1G/ext4-100dd-4k-8p-4096M-1024M:10-X
                   56.19        -1.0%        55.64  thresh=1G/ext4-10dd-4k-8p-4096M-1024M:10-X
                   58.41        -1.0%        57.84  thresh=1G/ext4-1dd-4k-8p-4096M-1024M:10-X
                   43.44        -8.4%        39.77  thresh=1G/xfs-100dd-4k-8p-4096M-1024M:10-X
                   49.83        -3.3%        48.18  thresh=1G/xfs-10dd-4k-8p-4096M-1024M:10-X
                   52.70        -0.8%        52.26  thresh=1G/xfs-1dd-4k-8p-4096M-1024M:10-X
                   57.12       -85.5%         8.27  thresh=8M/btrfs-10dd-4k-8p-4096M-8M:10-X
                   59.29       -84.7%         9.05  thresh=8M/btrfs-1dd-4k-8p-4096M-8M:10-X
                   59.23       -84.9%         8.97  thresh=8M/btrfs-2dd-4k-8p-4096M-8M:10-X
                   33.63        -3.3%        32.51  thresh=8M/ext3-10dd-4k-8p-4096M-8M:10-X
                   48.30        -4.7%        46.03  thresh=8M/ext3-1dd-4k-8p-4096M-8M:10-X
                   46.77        -4.5%        44.69  thresh=8M/ext3-2dd-4k-8p-4096M-8M:10-X
                   36.58        -2.2%        35.77  thresh=8M/ext4-10dd-4k-8p-4096M-8M:10-X
                   57.35        -0.3%        57.16  thresh=8M/ext4-1dd-4k-8p-4096M-8M:10-X
                   52.82        -1.5%        52.04  thresh=8M/ext4-2dd-4k-8p-4096M-8M:10-X
                   32.19        -4.5%        30.74  thresh=8M/xfs-10dd-4k-8p-4096M-8M:10-X
                   55.86        -1.4%        55.09  thresh=8M/xfs-1dd-4k-8p-4096M-8M:10-X
                   48.96       -33.1%        32.74  thresh=8M/xfs-2dd-4k-8p-4096M-8M:10-X
                 1810.02       -22.1%      1410.49  TOTAL

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 2/2] writeback: Replace some redirty_tail() calls with requeue_io()
  2011-10-08  4:00                   ` Wu Fengguang
@ 2011-10-08 11:52                     ` Wu Fengguang
  2011-10-08 13:49                       ` Wu Fengguang
  2011-10-10 11:21                     ` Jan Kara
  1 sibling, 1 reply; 60+ messages in thread
From: Wu Fengguang @ 2011-10-08 11:52 UTC (permalink / raw)
  To: Jan Kara
  Cc: linux-fsdevel@vger.kernel.org, Dave Chinner, Christoph Hellwig,
	Chris Mason

On Sat, Oct 08, 2011 at 12:00:36PM +0800, Wu Fengguang wrote:
> Hi Jan,
> 
> The test results look not good: btrfs is heavily impacted and the
> other filesystems are slightly impacted.
> 
> I'll send you the detailed logs in private emails (too large for the
> mailing list). Basically I noticed many writeback_wait traces that
> never appear w/o this patch. In the btrfs cases that see larger
> regressions, I see large fluctuations in the writeout bandwidth and
> long disk idle periods. It's still a bit puzzling how all these
> happen..

Sorry I find that part of the regressions (about 2-3%) are caused by
change of my test scripts recently. Here are the more fair compares
and they show only regressions in btrfs and xfs:

     3.1.0-rc8-ioless6a+  3.1.0-rc8-ioless6-requeue+
------------------------  ------------------------
                   37.34        +0.8%        37.65  thresh=100M/ext3-10dd-4k-8p-4096M-100M:10-X
                   44.44        +3.4%        45.96  thresh=100M/ext3-1dd-4k-8p-4096M-100M:10-X
                   41.70        +1.0%        42.14  thresh=100M/ext3-2dd-4k-8p-4096M-100M:10-X
                   46.45        -0.3%        46.32  thresh=100M/ext4-10dd-4k-8p-4096M-100M:10-X
                   56.60        -0.3%        56.41  thresh=100M/ext4-1dd-4k-8p-4096M-100M:10-X
                   54.14        +0.9%        54.63  thresh=100M/ext4-2dd-4k-8p-4096M-100M:10-X
                   30.66        -0.7%        30.44  thresh=1G/ext3-100dd-4k-8p-4096M-1024M:10-X
                   35.24        +1.6%        35.82  thresh=1G/ext3-10dd-4k-8p-4096M-1024M:10-X
                   43.58        +0.5%        43.80  thresh=1G/ext3-1dd-4k-8p-4096M-1024M:10-X
                   50.42        -0.6%        50.14  thresh=1G/ext4-100dd-4k-8p-4096M-1024M:10-X
                   56.23        -1.0%        55.64  thresh=1G/ext4-10dd-4k-8p-4096M-1024M:10-X
                   58.12        -0.5%        57.84  thresh=1G/ext4-1dd-4k-8p-4096M-1024M:10-X
                   45.37        +1.4%        46.03  thresh=8M/ext3-1dd-4k-8p-4096M-8M:10-X
                   43.71        +2.2%        44.69  thresh=8M/ext3-2dd-4k-8p-4096M-8M:10-X
                   35.58        +0.5%        35.77  thresh=8M/ext4-10dd-4k-8p-4096M-8M:10-X
                   56.39        +1.4%        57.16  thresh=8M/ext4-1dd-4k-8p-4096M-8M:10-X
                   51.26        +1.5%        52.04  thresh=8M/ext4-2dd-4k-8p-4096M-8M:10-X
                  787.25        +0.7%       792.47  TOTAL

     3.1.0-rc8-ioless6a+  3.1.0-rc8-ioless6-requeue+
------------------------  ------------------------
                   44.53       -18.6%        36.23  thresh=100M/xfs-10dd-4k-8p-4096M-100M:10-X
                   55.89        -0.4%        55.64  thresh=100M/xfs-1dd-4k-8p-4096M-100M:10-X
                   51.11        +0.5%        51.35  thresh=100M/xfs-2dd-4k-8p-4096M-100M:10-X
                   41.76        -4.8%        39.77  thresh=1G/xfs-100dd-4k-8p-4096M-1024M:10-X
                   48.34        -0.3%        48.18  thresh=1G/xfs-10dd-4k-8p-4096M-1024M:10-X
                   52.36        -0.2%        52.26  thresh=1G/xfs-1dd-4k-8p-4096M-1024M:10-X
                   31.07        -1.1%        30.74  thresh=8M/xfs-10dd-4k-8p-4096M-8M:10-X
                   55.44        -0.6%        55.09  thresh=8M/xfs-1dd-4k-8p-4096M-8M:10-X
                   47.59       -31.2%        32.74  thresh=8M/xfs-2dd-4k-8p-4096M-8M:10-X
                  428.07        -6.1%       401.99  TOTAL

     3.1.0-rc8-ioless6a+  3.1.0-rc8-ioless6-requeue+
------------------------  ------------------------
                   58.23       -82.6%        10.13  thresh=100M/btrfs-10dd-4k-8p-4096M-100M:10-X
                   58.43       -80.3%        11.54  thresh=100M/btrfs-1dd-4k-8p-4096M-100M:10-X
                   58.53       -79.9%        11.76  thresh=100M/btrfs-2dd-4k-8p-4096M-100M:10-X
                   56.55       -31.7%        38.63  thresh=1G/btrfs-100dd-4k-8p-4096M-1024M:10-X
                   56.11       -30.1%        39.25  thresh=1G/btrfs-10dd-4k-8p-4096M-1024M:10-X
                   56.21       -18.3%        45.93  thresh=1G/btrfs-1dd-4k-8p-4096M-1024M:10-X
                  344.06       -54.3%       157.24  TOTAL

I'm now bisecting the patches to find out the root cause.

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 2/2] writeback: Replace some redirty_tail() calls with requeue_io()
  2011-10-08 11:52                     ` Wu Fengguang
@ 2011-10-08 13:49                       ` Wu Fengguang
  2011-10-09  0:27                         ` Wu Fengguang
  0 siblings, 1 reply; 60+ messages in thread
From: Wu Fengguang @ 2011-10-08 13:49 UTC (permalink / raw)
  To: Jan Kara
  Cc: linux-fsdevel@vger.kernel.org, Dave Chinner, Christoph Hellwig,
	Chris Mason

On Sat, Oct 08, 2011 at 07:52:27PM +0800, Wu Fengguang wrote:
> On Sat, Oct 08, 2011 at 12:00:36PM +0800, Wu Fengguang wrote:
> > Hi Jan,
> > 
> > The test results look not good: btrfs is heavily impacted and the
> > other filesystems are slightly impacted.
> > 
> > I'll send you the detailed logs in private emails (too large for the
> > mailing list). Basically I noticed many writeback_wait traces that
> > never appear w/o this patch. In the btrfs cases that see larger
> > regressions, I see large fluctuations in the writeout bandwidth and
> > long disk idle periods. It's still a bit puzzling how all these
> > happen..
> 
> Sorry I find that part of the regressions (about 2-3%) are caused by
> change of my test scripts recently. Here are the more fair compares
> and they show only regressions in btrfs and xfs:
> 
>      3.1.0-rc8-ioless6a+  3.1.0-rc8-ioless6-requeue+
> ------------------------  ------------------------
>                    37.34        +0.8%        37.65  thresh=100M/ext3-10dd-4k-8p-4096M-100M:10-X
>                    44.44        +3.4%        45.96  thresh=100M/ext3-1dd-4k-8p-4096M-100M:10-X
>                    41.70        +1.0%        42.14  thresh=100M/ext3-2dd-4k-8p-4096M-100M:10-X
>                    46.45        -0.3%        46.32  thresh=100M/ext4-10dd-4k-8p-4096M-100M:10-X
>                    56.60        -0.3%        56.41  thresh=100M/ext4-1dd-4k-8p-4096M-100M:10-X
>                    54.14        +0.9%        54.63  thresh=100M/ext4-2dd-4k-8p-4096M-100M:10-X
>                    30.66        -0.7%        30.44  thresh=1G/ext3-100dd-4k-8p-4096M-1024M:10-X
>                    35.24        +1.6%        35.82  thresh=1G/ext3-10dd-4k-8p-4096M-1024M:10-X
>                    43.58        +0.5%        43.80  thresh=1G/ext3-1dd-4k-8p-4096M-1024M:10-X
>                    50.42        -0.6%        50.14  thresh=1G/ext4-100dd-4k-8p-4096M-1024M:10-X
>                    56.23        -1.0%        55.64  thresh=1G/ext4-10dd-4k-8p-4096M-1024M:10-X
>                    58.12        -0.5%        57.84  thresh=1G/ext4-1dd-4k-8p-4096M-1024M:10-X
>                    45.37        +1.4%        46.03  thresh=8M/ext3-1dd-4k-8p-4096M-8M:10-X
>                    43.71        +2.2%        44.69  thresh=8M/ext3-2dd-4k-8p-4096M-8M:10-X
>                    35.58        +0.5%        35.77  thresh=8M/ext4-10dd-4k-8p-4096M-8M:10-X
>                    56.39        +1.4%        57.16  thresh=8M/ext4-1dd-4k-8p-4096M-8M:10-X
>                    51.26        +1.5%        52.04  thresh=8M/ext4-2dd-4k-8p-4096M-8M:10-X
>                   787.25        +0.7%       792.47  TOTAL
> 
>      3.1.0-rc8-ioless6a+  3.1.0-rc8-ioless6-requeue+
> ------------------------  ------------------------
>                    44.53       -18.6%        36.23  thresh=100M/xfs-10dd-4k-8p-4096M-100M:10-X
>                    55.89        -0.4%        55.64  thresh=100M/xfs-1dd-4k-8p-4096M-100M:10-X
>                    51.11        +0.5%        51.35  thresh=100M/xfs-2dd-4k-8p-4096M-100M:10-X
>                    41.76        -4.8%        39.77  thresh=1G/xfs-100dd-4k-8p-4096M-1024M:10-X
>                    48.34        -0.3%        48.18  thresh=1G/xfs-10dd-4k-8p-4096M-1024M:10-X
>                    52.36        -0.2%        52.26  thresh=1G/xfs-1dd-4k-8p-4096M-1024M:10-X
>                    31.07        -1.1%        30.74  thresh=8M/xfs-10dd-4k-8p-4096M-8M:10-X
>                    55.44        -0.6%        55.09  thresh=8M/xfs-1dd-4k-8p-4096M-8M:10-X
>                    47.59       -31.2%        32.74  thresh=8M/xfs-2dd-4k-8p-4096M-8M:10-X
>                   428.07        -6.1%       401.99  TOTAL
> 
>      3.1.0-rc8-ioless6a+  3.1.0-rc8-ioless6-requeue+
> ------------------------  ------------------------
>                    58.23       -82.6%        10.13  thresh=100M/btrfs-10dd-4k-8p-4096M-100M:10-X
>                    58.43       -80.3%        11.54  thresh=100M/btrfs-1dd-4k-8p-4096M-100M:10-X
>                    58.53       -79.9%        11.76  thresh=100M/btrfs-2dd-4k-8p-4096M-100M:10-X
>                    56.55       -31.7%        38.63  thresh=1G/btrfs-100dd-4k-8p-4096M-1024M:10-X
>                    56.11       -30.1%        39.25  thresh=1G/btrfs-10dd-4k-8p-4096M-1024M:10-X
>                    56.21       -18.3%        45.93  thresh=1G/btrfs-1dd-4k-8p-4096M-1024M:10-X
>                   344.06       -54.3%       157.24  TOTAL
> 
> I'm now bisecting the patches to find out the root cause.

Current findings are, when only applying the first patch, or reduce the second
patch to the below one, the btrfs regressions are restored:

     3.1.0-rc8-ioless6a+  3.1.0-rc8-ioless6-requeue2+  
------------------------  ------------------------  
                   58.23        -0.3%        58.06  thresh=100M/btrfs-10dd-4k-8p-4096M-100M:10-X
                   58.43        -0.4%        58.19  thresh=100M/btrfs-1dd-4k-8p-4096M-100M:10-X
                   58.53        -0.5%        58.25  thresh=100M/btrfs-2dd-4k-8p-4096M-100M:10-X
                   56.55        -0.4%        56.30  thresh=1G/btrfs-100dd-4k-8p-4096M-1024M:10-X
                   56.11        +0.1%        56.19  thresh=1G/btrfs-10dd-4k-8p-4096M-1024M:10-X
                   56.21        -0.2%        56.12  thresh=1G/btrfs-1dd-4k-8p-4096M-1024M:10-X
                   50.42        -2.1%        49.36  thresh=1G/ext4-100dd-4k-8p-4096M-1024M:10-X
                   56.23        -2.2%        55.00  thresh=1G/ext4-10dd-4k-8p-4096M-1024M:10-X
                   58.12        -2.2%        56.82  thresh=1G/ext4-1dd-4k-8p-4096M-1024M:10-X
                   41.76        +1.6%        42.42  thresh=1G/xfs-100dd-4k-8p-4096M-1024M:10-X
                   48.34        -1.0%        47.85  thresh=1G/xfs-10dd-4k-8p-4096M-1024M:10-X
                   52.36        -1.5%        51.57  thresh=1G/xfs-1dd-4k-8p-4096M-1024M:10-X
                  651.29        -0.8%       646.12  TOTAL

     3.1.0-rc8-ioless6a+  3.1.0-rc8-ioless6-requeue3+
------------------------  ------------------------
                   56.55        -3.3%        54.70  thresh=1G/btrfs-100dd-4k-8p-4096M-1024M:10-X
                   56.11        -0.4%        55.91  thresh=1G/btrfs-10dd-4k-8p-4096M-1024M:10-X
                   56.21        +0.7%        56.58  thresh=1G/btrfs-1dd-4k-8p-4096M-1024M:10-X
                  168.87        -1.0%       167.20  TOTAL

--- linux-next.orig/fs/fs-writeback.c	2011-10-08 20:49:31.000000000 +0800
+++ linux-next/fs/fs-writeback.c	2011-10-08 20:51:22.000000000 +0800
@@ -370,6 +370,7 @@ writeback_single_inode(struct inode *ino
 	long nr_to_write = wbc->nr_to_write;
 	unsigned dirty;
 	int ret;
+	bool inode_written = false;
 
 	assert_spin_locked(&wb->list_lock);
 	assert_spin_locked(&inode->i_lock);
@@ -434,6 +435,8 @@ writeback_single_inode(struct inode *ino
 	/* Don't write the inode if only I_DIRTY_PAGES was set */
 	if (dirty & (I_DIRTY_SYNC | I_DIRTY_DATASYNC)) {
 		int err = write_inode(inode, wbc);
+		if (!err)
+			inode_written = true;
 		if (ret == 0)
 			ret = err;
 	}
@@ -477,9 +480,19 @@ writeback_single_inode(struct inode *ino
 			 * Filesystems can dirty the inode during writeback
 			 * operations, such as delayed allocation during
 			 * submission or metadata updates after data IO
-			 * completion.
+			 * completion. Also inode could have been dirtied by
+			 * some process aggressively touching metadata.
+			 * Finally, filesystem could just fail to write the
+			 * inode for some reason. We have to distinguish the
+			 * last case from the previous ones - in the last case
+			 * we want to give the inode quick retry, in the
+			 * other cases we want to put it back to the dirty list
+			 * to avoid livelocking of writeback.
 			 */
-			redirty_tail(inode, wb);
+			if (inode_written)
+				redirty_tail(inode, wb);
+			else
+				requeue_io(inode, wb);
 		} else {
 			/*
 			 * The inode is clean.  At this point we either have

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 2/2] writeback: Replace some redirty_tail() calls with requeue_io()
  2011-10-08 13:49                       ` Wu Fengguang
@ 2011-10-09  0:27                         ` Wu Fengguang
  2011-10-09  8:44                           ` Wu Fengguang
  0 siblings, 1 reply; 60+ messages in thread
From: Wu Fengguang @ 2011-10-09  0:27 UTC (permalink / raw)
  To: Jan Kara
  Cc: linux-fsdevel@vger.kernel.org, Dave Chinner, Christoph Hellwig,
	Chris Mason

On Sat, Oct 08, 2011 at 09:49:27PM +0800, Wu Fengguang wrote:
> On Sat, Oct 08, 2011 at 07:52:27PM +0800, Wu Fengguang wrote:
> > On Sat, Oct 08, 2011 at 12:00:36PM +0800, Wu Fengguang wrote:
> > > Hi Jan,
> > > 
> > > The test results look not good: btrfs is heavily impacted and the
> > > other filesystems are slightly impacted.
> > > 
> > > I'll send you the detailed logs in private emails (too large for the
> > > mailing list). Basically I noticed many writeback_wait traces that
> > > never appear w/o this patch. In the btrfs cases that see larger
> > > regressions, I see large fluctuations in the writeout bandwidth and
> > > long disk idle periods. It's still a bit puzzling how all these
> > > happen..
> > 
> > Sorry I find that part of the regressions (about 2-3%) are caused by
> > change of my test scripts recently. Here are the more fair compares
> > and they show only regressions in btrfs and xfs:
> > 
> >      3.1.0-rc8-ioless6a+  3.1.0-rc8-ioless6-requeue+
> > ------------------------  ------------------------
> >                    37.34        +0.8%        37.65  thresh=100M/ext3-10dd-4k-8p-4096M-100M:10-X
> >                    44.44        +3.4%        45.96  thresh=100M/ext3-1dd-4k-8p-4096M-100M:10-X
> >                    41.70        +1.0%        42.14  thresh=100M/ext3-2dd-4k-8p-4096M-100M:10-X
> >                    46.45        -0.3%        46.32  thresh=100M/ext4-10dd-4k-8p-4096M-100M:10-X
> >                    56.60        -0.3%        56.41  thresh=100M/ext4-1dd-4k-8p-4096M-100M:10-X
> >                    54.14        +0.9%        54.63  thresh=100M/ext4-2dd-4k-8p-4096M-100M:10-X
> >                    30.66        -0.7%        30.44  thresh=1G/ext3-100dd-4k-8p-4096M-1024M:10-X
> >                    35.24        +1.6%        35.82  thresh=1G/ext3-10dd-4k-8p-4096M-1024M:10-X
> >                    43.58        +0.5%        43.80  thresh=1G/ext3-1dd-4k-8p-4096M-1024M:10-X
> >                    50.42        -0.6%        50.14  thresh=1G/ext4-100dd-4k-8p-4096M-1024M:10-X
> >                    56.23        -1.0%        55.64  thresh=1G/ext4-10dd-4k-8p-4096M-1024M:10-X
> >                    58.12        -0.5%        57.84  thresh=1G/ext4-1dd-4k-8p-4096M-1024M:10-X
> >                    45.37        +1.4%        46.03  thresh=8M/ext3-1dd-4k-8p-4096M-8M:10-X
> >                    43.71        +2.2%        44.69  thresh=8M/ext3-2dd-4k-8p-4096M-8M:10-X
> >                    35.58        +0.5%        35.77  thresh=8M/ext4-10dd-4k-8p-4096M-8M:10-X
> >                    56.39        +1.4%        57.16  thresh=8M/ext4-1dd-4k-8p-4096M-8M:10-X
> >                    51.26        +1.5%        52.04  thresh=8M/ext4-2dd-4k-8p-4096M-8M:10-X
> >                   787.25        +0.7%       792.47  TOTAL
> > 
> >      3.1.0-rc8-ioless6a+  3.1.0-rc8-ioless6-requeue+
> > ------------------------  ------------------------
> >                    44.53       -18.6%        36.23  thresh=100M/xfs-10dd-4k-8p-4096M-100M:10-X
> >                    55.89        -0.4%        55.64  thresh=100M/xfs-1dd-4k-8p-4096M-100M:10-X
> >                    51.11        +0.5%        51.35  thresh=100M/xfs-2dd-4k-8p-4096M-100M:10-X
> >                    41.76        -4.8%        39.77  thresh=1G/xfs-100dd-4k-8p-4096M-1024M:10-X
> >                    48.34        -0.3%        48.18  thresh=1G/xfs-10dd-4k-8p-4096M-1024M:10-X
> >                    52.36        -0.2%        52.26  thresh=1G/xfs-1dd-4k-8p-4096M-1024M:10-X
> >                    31.07        -1.1%        30.74  thresh=8M/xfs-10dd-4k-8p-4096M-8M:10-X
> >                    55.44        -0.6%        55.09  thresh=8M/xfs-1dd-4k-8p-4096M-8M:10-X
> >                    47.59       -31.2%        32.74  thresh=8M/xfs-2dd-4k-8p-4096M-8M:10-X
> >                   428.07        -6.1%       401.99  TOTAL
> > 
> >      3.1.0-rc8-ioless6a+  3.1.0-rc8-ioless6-requeue+
> > ------------------------  ------------------------
> >                    58.23       -82.6%        10.13  thresh=100M/btrfs-10dd-4k-8p-4096M-100M:10-X
> >                    58.43       -80.3%        11.54  thresh=100M/btrfs-1dd-4k-8p-4096M-100M:10-X
> >                    58.53       -79.9%        11.76  thresh=100M/btrfs-2dd-4k-8p-4096M-100M:10-X
> >                    56.55       -31.7%        38.63  thresh=1G/btrfs-100dd-4k-8p-4096M-1024M:10-X
> >                    56.11       -30.1%        39.25  thresh=1G/btrfs-10dd-4k-8p-4096M-1024M:10-X
> >                    56.21       -18.3%        45.93  thresh=1G/btrfs-1dd-4k-8p-4096M-1024M:10-X
> >                   344.06       -54.3%       157.24  TOTAL
> > 
> > I'm now bisecting the patches to find out the root cause.
> 
> Current findings are, when only applying the first patch, or reduce the second
> patch to the below one, the btrfs regressions are restored:

And the below reduced patch is also OK:

     3.1.0-rc8-ioless6a+  3.1.0-rc8-ioless6-requeue4+
------------------------  ------------------------
                   58.23        -0.4%        57.98  thresh=100M/btrfs-10dd-4k-8p-4096M-100M:10-X
                   58.43        -2.2%        57.13  thresh=100M/btrfs-1dd-4k-8p-4096M-100M:10-X
                   58.53        -1.2%        57.83  thresh=100M/btrfs-2dd-4k-8p-4096M-100M:10-X
                   37.34        -0.7%        37.07  thresh=100M/ext3-10dd-4k-8p-4096M-100M:10-X
                   44.44        +0.2%        44.52  thresh=100M/ext3-1dd-4k-8p-4096M-100M:10-X
                   41.70        +0.0%        41.72  thresh=100M/ext3-2dd-4k-8p-4096M-100M:10-X
                   46.45        -0.7%        46.10  thresh=100M/ext4-10dd-4k-8p-4096M-100M:10-X
                   56.60        -0.8%        56.15  thresh=100M/ext4-1dd-4k-8p-4096M-100M:10-X
                   54.14        +0.3%        54.33  thresh=100M/ext4-2dd-4k-8p-4096M-100M:10-X
                   44.53        -7.3%        41.29  thresh=100M/xfs-10dd-4k-8p-4096M-100M:10-X
                   55.89        +0.9%        56.39  thresh=100M/xfs-1dd-4k-8p-4096M-100M:10-X
                   51.11        +1.0%        51.60  thresh=100M/xfs-2dd-4k-8p-4096M-100M:10-X
                   56.55        -1.0%        55.97  thresh=1G/btrfs-100dd-4k-8p-4096M-1024M:10-X
                   56.11        -1.5%        55.28  thresh=1G/btrfs-10dd-4k-8p-4096M-1024M:10-X
                   56.21        -1.9%        55.16  thresh=1G/btrfs-1dd-4k-8p-4096M-1024M:10-X
                   30.66        -2.7%        29.82  thresh=1G/ext3-100dd-4k-8p-4096M-1024M:10-X
                   35.24        -0.7%        35.00  thresh=1G/ext3-10dd-4k-8p-4096M-1024M:10-X
                   43.58        -2.1%        42.65  thresh=1G/ext3-1dd-4k-8p-4096M-1024M:10-X
                   50.42        -2.4%        49.21  thresh=1G/ext4-100dd-4k-8p-4096M-1024M:10-X
                   56.23        -2.2%        55.00  thresh=1G/ext4-10dd-4k-8p-4096M-1024M:10-X
                   58.12        -1.8%        57.08  thresh=1G/ext4-1dd-4k-8p-4096M-1024M:10-X
                   41.76        -5.1%        39.61  thresh=1G/xfs-100dd-4k-8p-4096M-1024M:10-X
                   48.34        -2.6%        47.06  thresh=1G/xfs-10dd-4k-8p-4096M-1024M:10-X
                   52.36        -3.3%        50.64  thresh=1G/xfs-1dd-4k-8p-4096M-1024M:10-X
                   45.37        +0.7%        45.70  thresh=8M/ext3-1dd-4k-8p-4096M-8M:10-X
                   43.71        +0.7%        44.00  thresh=8M/ext3-2dd-4k-8p-4096M-8M:10-X
                   35.58        +0.7%        35.82  thresh=8M/ext4-10dd-4k-8p-4096M-8M:10-X
                   56.39        -1.1%        55.77  thresh=8M/ext4-1dd-4k-8p-4096M-8M:10-X
                   51.26        -0.6%        50.94  thresh=8M/ext4-2dd-4k-8p-4096M-8M:10-X
                   31.07       -13.3%        26.94  thresh=8M/xfs-10dd-4k-8p-4096M-8M:10-X
                   55.44        +0.5%        55.72  thresh=8M/xfs-1dd-4k-8p-4096M-8M:10-X
                   47.59        +1.6%        48.33  thresh=8M/xfs-2dd-4k-8p-4096M-8M:10-X
                 1559.39        -1.4%      1537.83  TOTAL

Subject: writeback: Replace some redirty_tail() calls with requeue_io()
Date: Thu, 8 Sep 2011 01:46:42 +0200

From: Jan Kara <jack@suse.cz>

Calling redirty_tail() can put off inode writeback for upto 30 seconds (or
whatever dirty_expire_centisecs is). This is unnecessarily big delay in some
cases and in other cases it is a really bad thing. In particular XFS tries to
be nice to writeback and when ->write_inode is called for an inode with locked
ilock, it just redirties the inode and returns EAGAIN. That currently causes
writeback_single_inode() to redirty_tail() the inode. As contended ilock is
common thing with XFS while extending files the result can be that inode
writeout is put off for a really long time.

Now that we have more robust busyloop prevention in wb_writeback() we can
call requeue_io() in cases where quick retry is required without fear of
raising CPU consumption too much.

CC: Christoph Hellwig <hch@infradead.org>
Acked-by: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/fs-writeback.c |   30 +++++++++++++++++++-----------
 1 file changed, 19 insertions(+), 11 deletions(-)

--- linux-next.orig/fs/fs-writeback.c	2011-10-08 20:49:31.000000000 +0800
+++ linux-next/fs/fs-writeback.c	2011-10-08 21:51:00.000000000 +0800
@@ -370,6 +370,7 @@ writeback_single_inode(struct inode *ino
 	long nr_to_write = wbc->nr_to_write;
 	unsigned dirty;
 	int ret;
+	bool inode_written = false;
 
 	assert_spin_locked(&wb->list_lock);
 	assert_spin_locked(&inode->i_lock);
@@ -434,6 +435,8 @@ writeback_single_inode(struct inode *ino
 	/* Don't write the inode if only I_DIRTY_PAGES was set */
 	if (dirty & (I_DIRTY_SYNC | I_DIRTY_DATASYNC)) {
 		int err = write_inode(inode, wbc);
+		if (!err)
+			inode_written = true;
 		if (ret == 0)
 			ret = err;
 	}
@@ -477,9 +480,19 @@ writeback_single_inode(struct inode *ino
 			 * Filesystems can dirty the inode during writeback
 			 * operations, such as delayed allocation during
 			 * submission or metadata updates after data IO
-			 * completion.
+			 * completion. Also inode could have been dirtied by
+			 * some process aggressively touching metadata.
+			 * Finally, filesystem could just fail to write the
+			 * inode for some reason. We have to distinguish the
+			 * last case from the previous ones - in the last case
+			 * we want to give the inode quick retry, in the
+			 * other cases we want to put it back to the dirty list
+			 * to avoid livelocking of writeback.
 			 */
-			redirty_tail(inode, wb);
+			if (inode_written)
+				redirty_tail(inode, wb);
+			else
+				requeue_io(inode, wb);
 		} else {
 			/*
 			 * The inode is clean.  At this point we either have
@@ -597,10 +610,10 @@ static long writeback_sb_inodes(struct s
 			wrote++;
 		if (wbc.pages_skipped) {
 			/*
-			 * writeback is not making progress due to locked
-			 * buffers.  Skip this inode for now.
+			 * Writeback is not making progress due to unavailable
+			 * fs locks or similar condition. Retry in next round.
 			 */
-			redirty_tail(inode, wb);
+			requeue_io(inode, wb);
 		}
 		spin_unlock(&inode->i_lock);
 		spin_unlock(&wb->list_lock);
@@ -632,12 +645,7 @@ static long __writeback_inodes_wb(struct
 		struct super_block *sb = inode->i_sb;
 
 		if (!grab_super_passive(sb)) {
-			/*
-			 * grab_super_passive() may fail consistently due to
-			 * s_umount being grabbed by someone else. Don't use
-			 * requeue_io() to avoid busy retrying the inode/sb.
-			 */
-			redirty_tail(inode, wb);
+			requeue_io(inode, wb);
 			continue;
 		}
 		wrote += writeback_sb_inodes(sb, wb, work);

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 2/2] writeback: Replace some redirty_tail() calls with requeue_io()
  2011-10-09  0:27                         ` Wu Fengguang
@ 2011-10-09  8:44                           ` Wu Fengguang
  0 siblings, 0 replies; 60+ messages in thread
From: Wu Fengguang @ 2011-10-09  8:44 UTC (permalink / raw)
  To: Jan Kara
  Cc: linux-fsdevel@vger.kernel.org, Dave Chinner, Christoph Hellwig,
	Chris Mason

On Sun, Oct 09, 2011 at 08:27:36AM +0800, Wu Fengguang wrote:
> On Sat, Oct 08, 2011 at 09:49:27PM +0800, Wu Fengguang wrote:
> > On Sat, Oct 08, 2011 at 07:52:27PM +0800, Wu Fengguang wrote:
> > > On Sat, Oct 08, 2011 at 12:00:36PM +0800, Wu Fengguang wrote:
> > > > Hi Jan,
> > > > 
> > > > The test results look not good: btrfs is heavily impacted and the
> > > > other filesystems are slightly impacted.
> > > > 
> > > > I'll send you the detailed logs in private emails (too large for the
> > > > mailing list). Basically I noticed many writeback_wait traces that
> > > > never appear w/o this patch. In the btrfs cases that see larger
> > > > regressions, I see large fluctuations in the writeout bandwidth and
> > > > long disk idle periods. It's still a bit puzzling how all these
> > > > happen..
> > > 
> > > Sorry I find that part of the regressions (about 2-3%) are caused by
> > > change of my test scripts recently. Here are the more fair compares
> > > and they show only regressions in btrfs and xfs:
> > > 
> > >      3.1.0-rc8-ioless6a+  3.1.0-rc8-ioless6-requeue+
> > > ------------------------  ------------------------
> > >                    37.34        +0.8%        37.65  thresh=100M/ext3-10dd-4k-8p-4096M-100M:10-X
> > >                    44.44        +3.4%        45.96  thresh=100M/ext3-1dd-4k-8p-4096M-100M:10-X
> > >                    41.70        +1.0%        42.14  thresh=100M/ext3-2dd-4k-8p-4096M-100M:10-X
> > >                    46.45        -0.3%        46.32  thresh=100M/ext4-10dd-4k-8p-4096M-100M:10-X
> > >                    56.60        -0.3%        56.41  thresh=100M/ext4-1dd-4k-8p-4096M-100M:10-X
> > >                    54.14        +0.9%        54.63  thresh=100M/ext4-2dd-4k-8p-4096M-100M:10-X
> > >                    30.66        -0.7%        30.44  thresh=1G/ext3-100dd-4k-8p-4096M-1024M:10-X
> > >                    35.24        +1.6%        35.82  thresh=1G/ext3-10dd-4k-8p-4096M-1024M:10-X
> > >                    43.58        +0.5%        43.80  thresh=1G/ext3-1dd-4k-8p-4096M-1024M:10-X
> > >                    50.42        -0.6%        50.14  thresh=1G/ext4-100dd-4k-8p-4096M-1024M:10-X
> > >                    56.23        -1.0%        55.64  thresh=1G/ext4-10dd-4k-8p-4096M-1024M:10-X
> > >                    58.12        -0.5%        57.84  thresh=1G/ext4-1dd-4k-8p-4096M-1024M:10-X
> > >                    45.37        +1.4%        46.03  thresh=8M/ext3-1dd-4k-8p-4096M-8M:10-X
> > >                    43.71        +2.2%        44.69  thresh=8M/ext3-2dd-4k-8p-4096M-8M:10-X
> > >                    35.58        +0.5%        35.77  thresh=8M/ext4-10dd-4k-8p-4096M-8M:10-X
> > >                    56.39        +1.4%        57.16  thresh=8M/ext4-1dd-4k-8p-4096M-8M:10-X
> > >                    51.26        +1.5%        52.04  thresh=8M/ext4-2dd-4k-8p-4096M-8M:10-X
> > >                   787.25        +0.7%       792.47  TOTAL
> > > 
> > >      3.1.0-rc8-ioless6a+  3.1.0-rc8-ioless6-requeue+
> > > ------------------------  ------------------------
> > >                    44.53       -18.6%        36.23  thresh=100M/xfs-10dd-4k-8p-4096M-100M:10-X
> > >                    55.89        -0.4%        55.64  thresh=100M/xfs-1dd-4k-8p-4096M-100M:10-X
> > >                    51.11        +0.5%        51.35  thresh=100M/xfs-2dd-4k-8p-4096M-100M:10-X
> > >                    41.76        -4.8%        39.77  thresh=1G/xfs-100dd-4k-8p-4096M-1024M:10-X
> > >                    48.34        -0.3%        48.18  thresh=1G/xfs-10dd-4k-8p-4096M-1024M:10-X
> > >                    52.36        -0.2%        52.26  thresh=1G/xfs-1dd-4k-8p-4096M-1024M:10-X
> > >                    31.07        -1.1%        30.74  thresh=8M/xfs-10dd-4k-8p-4096M-8M:10-X
> > >                    55.44        -0.6%        55.09  thresh=8M/xfs-1dd-4k-8p-4096M-8M:10-X
> > >                    47.59       -31.2%        32.74  thresh=8M/xfs-2dd-4k-8p-4096M-8M:10-X
> > >                   428.07        -6.1%       401.99  TOTAL
> > > 
> > >      3.1.0-rc8-ioless6a+  3.1.0-rc8-ioless6-requeue+
> > > ------------------------  ------------------------
> > >                    58.23       -82.6%        10.13  thresh=100M/btrfs-10dd-4k-8p-4096M-100M:10-X
> > >                    58.43       -80.3%        11.54  thresh=100M/btrfs-1dd-4k-8p-4096M-100M:10-X
> > >                    58.53       -79.9%        11.76  thresh=100M/btrfs-2dd-4k-8p-4096M-100M:10-X
> > >                    56.55       -31.7%        38.63  thresh=1G/btrfs-100dd-4k-8p-4096M-1024M:10-X
> > >                    56.11       -30.1%        39.25  thresh=1G/btrfs-10dd-4k-8p-4096M-1024M:10-X
> > >                    56.21       -18.3%        45.93  thresh=1G/btrfs-1dd-4k-8p-4096M-1024M:10-X
> > >                   344.06       -54.3%       157.24  TOTAL
> > > 
> > > I'm now bisecting the patches to find out the root cause.
> > 
> > Current findings are, when only applying the first patch, or reduce the second
> > patch to the below one, the btrfs regressions are restored:
> 
> And the below reduced patch is also OK:
> 
>      3.1.0-rc8-ioless6a+  3.1.0-rc8-ioless6-requeue4+
> ------------------------  ------------------------
>                    58.23        -0.4%        57.98  thresh=100M/btrfs-10dd-4k-8p-4096M-100M:10-X
>                    58.43        -2.2%        57.13  thresh=100M/btrfs-1dd-4k-8p-4096M-100M:10-X
>                    58.53        -1.2%        57.83  thresh=100M/btrfs-2dd-4k-8p-4096M-100M:10-X
>                    37.34        -0.7%        37.07  thresh=100M/ext3-10dd-4k-8p-4096M-100M:10-X
>                    44.44        +0.2%        44.52  thresh=100M/ext3-1dd-4k-8p-4096M-100M:10-X
>                    41.70        +0.0%        41.72  thresh=100M/ext3-2dd-4k-8p-4096M-100M:10-X
>                    46.45        -0.7%        46.10  thresh=100M/ext4-10dd-4k-8p-4096M-100M:10-X
>                    56.60        -0.8%        56.15  thresh=100M/ext4-1dd-4k-8p-4096M-100M:10-X
>                    54.14        +0.3%        54.33  thresh=100M/ext4-2dd-4k-8p-4096M-100M:10-X
>                    44.53        -7.3%        41.29  thresh=100M/xfs-10dd-4k-8p-4096M-100M:10-X
>                    55.89        +0.9%        56.39  thresh=100M/xfs-1dd-4k-8p-4096M-100M:10-X
>                    51.11        +1.0%        51.60  thresh=100M/xfs-2dd-4k-8p-4096M-100M:10-X
>                    56.55        -1.0%        55.97  thresh=1G/btrfs-100dd-4k-8p-4096M-1024M:10-X
>                    56.11        -1.5%        55.28  thresh=1G/btrfs-10dd-4k-8p-4096M-1024M:10-X
>                    56.21        -1.9%        55.16  thresh=1G/btrfs-1dd-4k-8p-4096M-1024M:10-X
>                    30.66        -2.7%        29.82  thresh=1G/ext3-100dd-4k-8p-4096M-1024M:10-X
>                    35.24        -0.7%        35.00  thresh=1G/ext3-10dd-4k-8p-4096M-1024M:10-X
>                    43.58        -2.1%        42.65  thresh=1G/ext3-1dd-4k-8p-4096M-1024M:10-X
>                    50.42        -2.4%        49.21  thresh=1G/ext4-100dd-4k-8p-4096M-1024M:10-X
>                    56.23        -2.2%        55.00  thresh=1G/ext4-10dd-4k-8p-4096M-1024M:10-X
>                    58.12        -1.8%        57.08  thresh=1G/ext4-1dd-4k-8p-4096M-1024M:10-X
>                    41.76        -5.1%        39.61  thresh=1G/xfs-100dd-4k-8p-4096M-1024M:10-X
>                    48.34        -2.6%        47.06  thresh=1G/xfs-10dd-4k-8p-4096M-1024M:10-X
>                    52.36        -3.3%        50.64  thresh=1G/xfs-1dd-4k-8p-4096M-1024M:10-X
>                    45.37        +0.7%        45.70  thresh=8M/ext3-1dd-4k-8p-4096M-8M:10-X
>                    43.71        +0.7%        44.00  thresh=8M/ext3-2dd-4k-8p-4096M-8M:10-X
>                    35.58        +0.7%        35.82  thresh=8M/ext4-10dd-4k-8p-4096M-8M:10-X
>                    56.39        -1.1%        55.77  thresh=8M/ext4-1dd-4k-8p-4096M-8M:10-X
>                    51.26        -0.6%        50.94  thresh=8M/ext4-2dd-4k-8p-4096M-8M:10-X
>                    31.07       -13.3%        26.94  thresh=8M/xfs-10dd-4k-8p-4096M-8M:10-X
>                    55.44        +0.5%        55.72  thresh=8M/xfs-1dd-4k-8p-4096M-8M:10-X
>                    47.59        +1.6%        48.33  thresh=8M/xfs-2dd-4k-8p-4096M-8M:10-X
>                  1559.39        -1.4%      1537.83  TOTAL

Now I got slightly better results with the below incremental patch.

     3.1.0-rc8-ioless6a+  3.1.0-rc8-ioless6-requeue5+
------------------------  ------------------------
                   58.23        +0.9%        58.76  thresh=100M/btrfs-10dd-4k-8p-4096M-100M:10-X
                   58.43        -0.4%        58.19  thresh=100M/btrfs-1dd-4k-8p-4096M-100M:10-X
                   58.53        -0.1%        58.48  thresh=100M/btrfs-2dd-4k-8p-4096M-100M:10-X
                   37.34        +1.4%        37.88  thresh=100M/ext3-10dd-4k-8p-4096M-100M:10-X
                   44.44        -0.3%        44.30  thresh=100M/ext3-1dd-4k-8p-4096M-100M:10-X
                   41.70        +1.3%        42.24  thresh=100M/ext3-2dd-4k-8p-4096M-100M:10-X
                   46.45        +0.3%        46.59  thresh=100M/ext4-10dd-4k-8p-4096M-100M:10-X
                   56.60        -0.7%        56.22  thresh=100M/ext4-1dd-4k-8p-4096M-100M:10-X
                   54.14        +0.7%        54.50  thresh=100M/ext4-2dd-4k-8p-4096M-100M:10-X
                   44.53        -2.4%        43.45  thresh=100M/xfs-10dd-4k-8p-4096M-100M:10-X
                   55.89        +0.2%        56.00  thresh=100M/xfs-1dd-4k-8p-4096M-100M:10-X
                   51.11        -0.0%        51.10  thresh=100M/xfs-2dd-4k-8p-4096M-100M:10-X
                   56.55        +0.0%        56.56  thresh=1G/btrfs-100dd-4k-8p-4096M-1024M:10-X
                   56.11        +0.5%        56.38  thresh=1G/btrfs-10dd-4k-8p-4096M-1024M:10-X
                   56.21        +0.1%        56.29  thresh=1G/btrfs-1dd-4k-8p-4096M-1024M:10-X
                   30.66        +0.2%        30.72  thresh=1G/ext3-100dd-4k-8p-4096M-1024M:10-X
                   35.24        +0.8%        35.54  thresh=1G/ext3-10dd-4k-8p-4096M-1024M:10-X
                   43.58        +0.1%        43.64  thresh=1G/ext3-1dd-4k-8p-4096M-1024M:10-X
                   50.42        -0.9%        49.99  thresh=1G/ext4-100dd-4k-8p-4096M-1024M:10-X
                   56.23        -0.0%        56.21  thresh=1G/ext4-10dd-4k-8p-4096M-1024M:10-X
                   58.12        -1.9%        57.02  thresh=1G/ext4-1dd-4k-8p-4096M-1024M:10-X
                   41.76        -5.9%        39.30  thresh=1G/xfs-100dd-4k-8p-4096M-1024M:10-X
                   48.34        -1.4%        47.67  thresh=1G/xfs-10dd-4k-8p-4096M-1024M:10-X
                   52.36        -1.8%        51.41  thresh=1G/xfs-1dd-4k-8p-4096M-1024M:10-X
                   54.37        +0.6%        54.71  thresh=8M/btrfs-10dd-4k-8p-4096M-8M:10-X
                   56.11        +1.7%        57.08  thresh=8M/btrfs-1dd-4k-8p-4096M-8M:10-X
                   56.22        +0.8%        56.67  thresh=8M/btrfs-2dd-4k-8p-4096M-8M:10-X
                   32.21        -0.2%        32.15  thresh=8M/ext3-10dd-4k-8p-4096M-8M:10-X
                   45.37        +0.2%        45.45  thresh=8M/ext3-1dd-4k-8p-4096M-8M:10-X
                   43.71        +0.0%        43.72  thresh=8M/ext3-2dd-4k-8p-4096M-8M:10-X
                   35.58        +0.1%        35.61  thresh=8M/ext4-10dd-4k-8p-4096M-8M:10-X
                   56.39        +1.0%        56.98  thresh=8M/ext4-1dd-4k-8p-4096M-8M:10-X
                   51.26        +0.4%        51.44  thresh=8M/ext4-2dd-4k-8p-4096M-8M:10-X
                   31.07       -11.5%        27.49  thresh=8M/xfs-10dd-4k-8p-4096M-8M:10-X
                   55.44        +0.7%        55.81  thresh=8M/xfs-1dd-4k-8p-4096M-8M:10-X
                   47.59        +1.0%        48.05  thresh=8M/xfs-2dd-4k-8p-4096M-8M:10-X
                 1758.29        -0.3%      1753.58  TOTAL

     3.1.0-rc8-ioless6a+  3.1.0-rc8-ioless6-requeue5+
------------------------  ------------------------
                   62.64        -0.8%        62.13  3G-UKEY-HDD/ext4-1dd-4k-8p-4096M-20:10-X
                   59.80        -0.1%        59.76  3G-UKEY-HDD/ext4-2dd-4k-8p-4096M-20:10-X
                   53.42        -1.1%        52.84  3G-UKEY-HDD/xfs-10dd-4k-8p-4096M-20:10-X
                   59.00        -0.1%        58.92  3G-UKEY-HDD/xfs-1dd-4k-8p-4096M-20:10-X
                   56.14        -0.3%        55.95  3G-UKEY-HDD/xfs-2dd-4k-8p-4096M-20:10-X
                  291.00        -0.5%       289.60  TOTAL

> @@ -597,10 +610,10 @@ static long writeback_sb_inodes(struct s
>  			wrote++;
>  		if (wbc.pages_skipped) {
>  			/*
> -			 * writeback is not making progress due to locked
> -			 * buffers.  Skip this inode for now.
> +			 * Writeback is not making progress due to unavailable
> +			 * fs locks or similar condition. Retry in next round.
>  			 */
> -			redirty_tail(inode, wb);
> +			requeue_io(inode, wb);
>  		}
>  		spin_unlock(&inode->i_lock);
>  		spin_unlock(&wb->list_lock);

The above change to requeue_io() is not desirable in the case the
inode was redirty_tail()ed in writeback_singel_inode(). So I'd propose
to just leave the original code as it is, or to remove the code as the
below patch. In either way, we get slightly better performance
demonstrated by the larger bandwidth in 3.1.0-rc8-ioless6-requeue3+
and 3.1.0-rc8-ioless6-requeue5+.

Thanks,
Fengguang
---

--- linux-next.orig/fs/fs-writeback.c	2011-10-09 08:34:38.000000000 +0800
+++ linux-next/fs/fs-writeback.c	2011-10-09 08:35:06.000000000 +0800
@@ -608,13 +608,6 @@ static long writeback_sb_inodes(struct s
 		wrote += write_chunk - wbc.nr_to_write;
 		if (!(inode->i_state & I_DIRTY))
 			wrote++;
-		if (wbc.pages_skipped) {
-			/*
-			 * Writeback is not making progress due to unavailable
-			 * fs locks or similar condition. Retry in next round.
-			 */
-			requeue_io(inode, wb);
-		}
 		spin_unlock(&inode->i_lock);
 		spin_unlock(&wb->list_lock);
 		iput(inode);

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 2/2] writeback: Replace some redirty_tail() calls with requeue_io()
  2011-10-08  4:00                   ` Wu Fengguang
  2011-10-08 11:52                     ` Wu Fengguang
@ 2011-10-10 11:21                     ` Jan Kara
  2011-10-10 11:31                       ` Wu Fengguang
  1 sibling, 1 reply; 60+ messages in thread
From: Jan Kara @ 2011-10-10 11:21 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Jan Kara, linux-fsdevel@vger.kernel.org, Dave Chinner,
	Christoph Hellwig, Chris Mason

  Hi Fengguang,

On Sat 08-10-11 12:00:36, Wu Fengguang wrote:
> The test results look not good: btrfs is heavily impacted and the
> other filesystems are slightly impacted.
>
> I'll send you the detailed logs in private emails (too large for the
> mailing list).  Basically I noticed many writeback_wait traces that never
> appear w/o this patch.
  OK, thanks for running these tests. I'll have a look at detailed logs.
I guess the difference can be caused by changes in redirty/requeue logic in
the second patch (the changes in the first patch could possibly make
several writeback_wait events from one event but never could introduce new
events).

I guess I'll also try to reproduce the problem since it should be pretty
easy when you see such a huge regression even with 1 dd process on btrfs
filesystem.

> In the btrfs cases that see larger regressions, I see large fluctuations
> in the writeout bandwidth and long disk idle periods. It's still a bit
> puzzling how all these happen..
  Yes, I don't understand it yet either...

									Honza

> 
>       3.1.0-rc8-ioless6+  3.1.0-rc8-ioless6-requeue+
> ------------------------  ------------------------
>                    59.39       -82.9%        10.13  thresh=100M/btrfs-10dd-4k-8p-4096M-100M:10-X
>                    58.68       -80.3%        11.54  thresh=100M/btrfs-1dd-4k-8p-4096M-100M:10-X
>                    58.92       -80.0%        11.76  thresh=100M/btrfs-2dd-4k-8p-4096M-100M:10-X
>                    38.02        -1.0%        37.65  thresh=100M/ext3-10dd-4k-8p-4096M-100M:10-X
>                    45.20        +1.7%        45.96  thresh=100M/ext3-1dd-4k-8p-4096M-100M:10-X
>                    42.50        -0.8%        42.14  thresh=100M/ext3-2dd-4k-8p-4096M-100M:10-X
>                    47.50        -2.5%        46.32  thresh=100M/ext4-10dd-4k-8p-4096M-100M:10-X
>                    58.18        -3.0%        56.41  thresh=100M/ext4-1dd-4k-8p-4096M-100M:10-X
>                    55.79        -2.1%        54.63  thresh=100M/ext4-2dd-4k-8p-4096M-100M:10-X
>                    44.89       -19.3%        36.23  thresh=100M/xfs-10dd-4k-8p-4096M-100M:10-X
>                    58.06        -4.2%        55.64  thresh=100M/xfs-1dd-4k-8p-4096M-100M:10-X
>                    51.94        -1.1%        51.35  thresh=100M/xfs-2dd-4k-8p-4096M-100M:10-X
>                    60.29       -35.9%        38.63  thresh=1G/btrfs-100dd-4k-8p-4096M-1024M:10-X
>                    58.80       -33.2%        39.25  thresh=1G/btrfs-10dd-4k-8p-4096M-1024M:10-X
>                    58.53       -21.5%        45.93  thresh=1G/btrfs-1dd-4k-8p-4096M-1024M:10-X
>                    31.96        -4.7%        30.44  thresh=1G/ext3-100dd-4k-8p-4096M-1024M:10-X
>                    36.19        -1.0%        35.82  thresh=1G/ext3-10dd-4k-8p-4096M-1024M:10-X
>                    45.03        -2.7%        43.80  thresh=1G/ext3-1dd-4k-8p-4096M-1024M:10-X
>                    51.47        -2.6%        50.14  thresh=1G/ext4-100dd-4k-8p-4096M-1024M:10-X
>                    56.19        -1.0%        55.64  thresh=1G/ext4-10dd-4k-8p-4096M-1024M:10-X
>                    58.41        -1.0%        57.84  thresh=1G/ext4-1dd-4k-8p-4096M-1024M:10-X
>                    43.44        -8.4%        39.77  thresh=1G/xfs-100dd-4k-8p-4096M-1024M:10-X
>                    49.83        -3.3%        48.18  thresh=1G/xfs-10dd-4k-8p-4096M-1024M:10-X
>                    52.70        -0.8%        52.26  thresh=1G/xfs-1dd-4k-8p-4096M-1024M:10-X
>                    57.12       -85.5%         8.27  thresh=8M/btrfs-10dd-4k-8p-4096M-8M:10-X
>                    59.29       -84.7%         9.05  thresh=8M/btrfs-1dd-4k-8p-4096M-8M:10-X
>                    59.23       -84.9%         8.97  thresh=8M/btrfs-2dd-4k-8p-4096M-8M:10-X
>                    33.63        -3.3%        32.51  thresh=8M/ext3-10dd-4k-8p-4096M-8M:10-X
>                    48.30        -4.7%        46.03  thresh=8M/ext3-1dd-4k-8p-4096M-8M:10-X
>                    46.77        -4.5%        44.69  thresh=8M/ext3-2dd-4k-8p-4096M-8M:10-X
>                    36.58        -2.2%        35.77  thresh=8M/ext4-10dd-4k-8p-4096M-8M:10-X
>                    57.35        -0.3%        57.16  thresh=8M/ext4-1dd-4k-8p-4096M-8M:10-X
>                    52.82        -1.5%        52.04  thresh=8M/ext4-2dd-4k-8p-4096M-8M:10-X
>                    32.19        -4.5%        30.74  thresh=8M/xfs-10dd-4k-8p-4096M-8M:10-X
>                    55.86        -1.4%        55.09  thresh=8M/xfs-1dd-4k-8p-4096M-8M:10-X
>                    48.96       -33.1%        32.74  thresh=8M/xfs-2dd-4k-8p-4096M-8M:10-X
>                  1810.02       -22.1%      1410.49  TOTAL
> 
> Thanks,
> Fengguang
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 2/2] writeback: Replace some redirty_tail() calls with requeue_io()
  2011-10-10 11:21                     ` Jan Kara
@ 2011-10-10 11:31                       ` Wu Fengguang
  2011-10-10 23:30                         ` Jan Kara
  0 siblings, 1 reply; 60+ messages in thread
From: Wu Fengguang @ 2011-10-10 11:31 UTC (permalink / raw)
  To: Jan Kara
  Cc: linux-fsdevel@vger.kernel.org, Dave Chinner, Christoph Hellwig,
	Chris Mason

On Mon, Oct 10, 2011 at 07:21:33PM +0800, Jan Kara wrote:
>   Hi Fengguang,
> 
> On Sat 08-10-11 12:00:36, Wu Fengguang wrote:
> > The test results look not good: btrfs is heavily impacted and the
> > other filesystems are slightly impacted.
> >
> > I'll send you the detailed logs in private emails (too large for the
> > mailing list).  Basically I noticed many writeback_wait traces that never
> > appear w/o this patch.
>   OK, thanks for running these tests. I'll have a look at detailed logs.
> I guess the difference can be caused by changes in redirty/requeue logic in
> the second patch (the changes in the first patch could possibly make
> several writeback_wait events from one event but never could introduce new
> events).
> 
> I guess I'll also try to reproduce the problem since it should be pretty
> easy when you see such a huge regression even with 1 dd process on btrfs
> filesystem.
> 
> > In the btrfs cases that see larger regressions, I see large fluctuations
> > in the writeout bandwidth and long disk idle periods. It's still a bit
> > puzzling how all these happen..
>   Yes, I don't understand it yet either...

Jan, it's obviously caused by this chunk, which is not really
necessary for fixing Christoph's problem. So the easy way is to go
ahead without this chunk.

The remaining problems is, the simple dd tests may not be the suitable
workloads to demonstrate the patches' usefulness to XFS.

Thanks,
Fengguang
---

                if ((inode->i_state & I_DIRTY) &&
-                   (wbc->sync_mode == WB_SYNC_ALL || wbc->tagged_writepages))
+                   (wbc->sync_mode == WB_SYNC_ALL || wbc->tagged_writepages)) {
                        inode->dirtied_when = jiffies;
-
-               if (mapping_tagged(mapping, PAGECACHE_TAG_DIRTY)) {
+                       redirty_tail(inode, wb);
+               } else if (mapping_tagged(mapping, PAGECACHE_TAG_DIRTY)) {
                        /*
-                        * We didn't write back all the pages.  nfs_writepages()
-                        * sometimes bales out without doing anything.
+                        * We didn't write back all the pages. nfs_writepages()
+                        * sometimes bales out without doing anything or we
+                        * just run our of our writeback slice.
                         */
                        inode->i_state |= I_DIRTY_PAGES;
-                       if (wbc->nr_to_write <= 0) {
-                               /*
-                                * slice used up: queue for next turn
-                                */
-                               requeue_io(inode, wb);
-                       } else {
-                               /*
-                                * Writeback blocked by something other than
-                                * congestion. Delay the inode for some time to
-                                * avoid spinning on the CPU (100% iowait)
-                                * retrying writeback of the dirty page/inode
-                                * that cannot be performed immediately.
-                                */
-                               redirty_tail(inode, wb);
-                       }
+                       requeue_io(inode, wb);
                } else if (inode->i_state & I_DIRTY) {

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 2/2] writeback: Replace some redirty_tail() calls with requeue_io()
  2011-10-10 11:31                       ` Wu Fengguang
@ 2011-10-10 23:30                         ` Jan Kara
  2011-10-11  2:36                           ` Wu Fengguang
  0 siblings, 1 reply; 60+ messages in thread
From: Jan Kara @ 2011-10-10 23:30 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Jan Kara, linux-fsdevel@vger.kernel.org, Dave Chinner,
	Christoph Hellwig, Chris Mason

On Mon 10-10-11 19:31:30, Wu Fengguang wrote:
> On Mon, Oct 10, 2011 at 07:21:33PM +0800, Jan Kara wrote:
> >   Hi Fengguang,
> > 
> > On Sat 08-10-11 12:00:36, Wu Fengguang wrote:
> > > The test results look not good: btrfs is heavily impacted and the
> > > other filesystems are slightly impacted.
> > >
> > > I'll send you the detailed logs in private emails (too large for the
> > > mailing list).  Basically I noticed many writeback_wait traces that never
> > > appear w/o this patch.
> >   OK, thanks for running these tests. I'll have a look at detailed logs.
> > I guess the difference can be caused by changes in redirty/requeue logic in
> > the second patch (the changes in the first patch could possibly make
> > several writeback_wait events from one event but never could introduce new
> > events).
> > 
> > I guess I'll also try to reproduce the problem since it should be pretty
> > easy when you see such a huge regression even with 1 dd process on btrfs
> > filesystem.
> > 
> > > In the btrfs cases that see larger regressions, I see large fluctuations
> > > in the writeout bandwidth and long disk idle periods. It's still a bit
> > > puzzling how all these happen..
> >   Yes, I don't understand it yet either...
> 
> Jan, it's obviously caused by this chunk, which is not really
> necessary for fixing Christoph's problem. So the easy way is to go
> ahead without this chunk.
  Yes, thanks a lot for debugging this! I'd still like to understand why
the hunk below is causing such a big problem to btrfs. I was looking into
the traces and all I could find so far was that for some reason relevant
inode (ino 257) was not getting queued for writeback for a long time (e.g.
20 seconds) which introduced disk idle times and thus bad throughput. But I
don't understand why the inode was not queue for such a long time yet...
Today it's too late but I'll continue with my investigation tomorrow.

> The remaining problems is, the simple dd tests may not be the suitable
> workloads to demonstrate the patches' usefulness to XFS.
  Maybe, hopefully Christoph will tell use whether patches work for him or
not.

								Honza
> 
> Thanks,
> Fengguang
> ---
> 
>                 if ((inode->i_state & I_DIRTY) &&
> -                   (wbc->sync_mode == WB_SYNC_ALL || wbc->tagged_writepages))
> +                   (wbc->sync_mode == WB_SYNC_ALL || wbc->tagged_writepages)) {
>                         inode->dirtied_when = jiffies;
> -
> -               if (mapping_tagged(mapping, PAGECACHE_TAG_DIRTY)) {
> +                       redirty_tail(inode, wb);
> +               } else if (mapping_tagged(mapping, PAGECACHE_TAG_DIRTY)) {
>                         /*
> -                        * We didn't write back all the pages.  nfs_writepages()
> -                        * sometimes bales out without doing anything.
> +                        * We didn't write back all the pages. nfs_writepages()
> +                        * sometimes bales out without doing anything or we
> +                        * just run our of our writeback slice.
>                          */
>                         inode->i_state |= I_DIRTY_PAGES;
> -                       if (wbc->nr_to_write <= 0) {
> -                               /*
> -                                * slice used up: queue for next turn
> -                                */
> -                               requeue_io(inode, wb);
> -                       } else {
> -                               /*
> -                                * Writeback blocked by something other than
> -                                * congestion. Delay the inode for some time to
> -                                * avoid spinning on the CPU (100% iowait)
> -                                * retrying writeback of the dirty page/inode
> -                                * that cannot be performed immediately.
> -                                */
> -                               redirty_tail(inode, wb);
> -                       }
> +                       requeue_io(inode, wb);
>                 } else if (inode->i_state & I_DIRTY) {
> 
> Thanks,
> Fengguang
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 2/2] writeback: Replace some redirty_tail() calls with requeue_io()
  2011-10-10 23:30                         ` Jan Kara
@ 2011-10-11  2:36                           ` Wu Fengguang
  2011-10-11 21:53                             ` Jan Kara
  0 siblings, 1 reply; 60+ messages in thread
From: Wu Fengguang @ 2011-10-11  2:36 UTC (permalink / raw)
  To: Jan Kara
  Cc: linux-fsdevel@vger.kernel.org, Dave Chinner, Christoph Hellwig,
	Chris Mason

On Tue, Oct 11, 2011 at 07:30:07AM +0800, Jan Kara wrote:
> On Mon 10-10-11 19:31:30, Wu Fengguang wrote:
> > On Mon, Oct 10, 2011 at 07:21:33PM +0800, Jan Kara wrote:
> > >   Hi Fengguang,
> > > 
> > > On Sat 08-10-11 12:00:36, Wu Fengguang wrote:
> > > > The test results look not good: btrfs is heavily impacted and the
> > > > other filesystems are slightly impacted.
> > > >
> > > > I'll send you the detailed logs in private emails (too large for the
> > > > mailing list).  Basically I noticed many writeback_wait traces that never
> > > > appear w/o this patch.
> > >   OK, thanks for running these tests. I'll have a look at detailed logs.
> > > I guess the difference can be caused by changes in redirty/requeue logic in
> > > the second patch (the changes in the first patch could possibly make
> > > several writeback_wait events from one event but never could introduce new
> > > events).
> > > 
> > > I guess I'll also try to reproduce the problem since it should be pretty
> > > easy when you see such a huge regression even with 1 dd process on btrfs
> > > filesystem.
> > > 
> > > > In the btrfs cases that see larger regressions, I see large fluctuations
> > > > in the writeout bandwidth and long disk idle periods. It's still a bit
> > > > puzzling how all these happen..
> > >   Yes, I don't understand it yet either...
> > 
> > Jan, it's obviously caused by this chunk, which is not really
> > necessary for fixing Christoph's problem. So the easy way is to go
> > ahead without this chunk.
>   Yes, thanks a lot for debugging this! I'd still like to understand why
> the hunk below is causing such a big problem to btrfs. I was looking into
> the traces and all I could find so far was that for some reason relevant
> inode (ino 257) was not getting queued for writeback for a long time (e.g.
> 20 seconds) which introduced disk idle times and thus bad throughput. But I
> don't understand why the inode was not queue for such a long time yet...
> Today it's too late but I'll continue with my investigation tomorrow.

Yeah, I have exactly the same observation and puzzle..

> > The remaining problems is, the simple dd tests may not be the suitable
> > workloads to demonstrate the patches' usefulness to XFS.
>   Maybe, hopefully Christoph will tell use whether patches work for him or
> not.

The explanation could be, there are ignorable differences between
redirty_tail() and requeue_io() for XFS background writeback, because
the background writeback simply ignores inode->dirtied_when.

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 2/2] writeback: Replace some redirty_tail() calls with requeue_io()
  2011-10-11  2:36                           ` Wu Fengguang
@ 2011-10-11 21:53                             ` Jan Kara
  2011-10-12  2:44                               ` Wu Fengguang
  0 siblings, 1 reply; 60+ messages in thread
From: Jan Kara @ 2011-10-11 21:53 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel@vger.kernel.org, Dave Chinner, Christoph Hellwig,
	Chris Mason

[-- Attachment #1: Type: text/plain, Size: 4339 bytes --]

On Tue 11-10-11 10:36:38, Wu Fengguang wrote:
> On Tue, Oct 11, 2011 at 07:30:07AM +0800, Jan Kara wrote:
> > On Mon 10-10-11 19:31:30, Wu Fengguang wrote:
> > > On Mon, Oct 10, 2011 at 07:21:33PM +0800, Jan Kara wrote:
> > > >   Hi Fengguang,
> > > > 
> > > > On Sat 08-10-11 12:00:36, Wu Fengguang wrote:
> > > > > The test results look not good: btrfs is heavily impacted and the
> > > > > other filesystems are slightly impacted.
> > > > >
> > > > > I'll send you the detailed logs in private emails (too large for the
> > > > > mailing list).  Basically I noticed many writeback_wait traces that never
> > > > > appear w/o this patch.
> > > >   OK, thanks for running these tests. I'll have a look at detailed logs.
> > > > I guess the difference can be caused by changes in redirty/requeue logic in
> > > > the second patch (the changes in the first patch could possibly make
> > > > several writeback_wait events from one event but never could introduce new
> > > > events).
> > > > 
> > > > I guess I'll also try to reproduce the problem since it should be pretty
> > > > easy when you see such a huge regression even with 1 dd process on btrfs
> > > > filesystem.
> > > > 
> > > > > In the btrfs cases that see larger regressions, I see large fluctuations
> > > > > in the writeout bandwidth and long disk idle periods. It's still a bit
> > > > > puzzling how all these happen..
> > > >   Yes, I don't understand it yet either...
> > > 
> > > Jan, it's obviously caused by this chunk, which is not really
> > > necessary for fixing Christoph's problem. So the easy way is to go
> > > ahead without this chunk.
> >   Yes, thanks a lot for debugging this! I'd still like to understand why
> > the hunk below is causing such a big problem to btrfs. I was looking into
> > the traces and all I could find so far was that for some reason relevant
> > inode (ino 257) was not getting queued for writeback for a long time (e.g.
> > 20 seconds) which introduced disk idle times and thus bad throughput. But I
> > don't understand why the inode was not queue for such a long time yet...
> > Today it's too late but I'll continue with my investigation tomorrow.
> 
> Yeah, I have exactly the same observation and puzzle..
  OK, I dug more into this and I think I found an explanation. The problem
starts at
   flush-btrfs-1-1336  [005]    20.688011: writeback_start: bdi btrfs-1:
sb_dev 0:0 nr_pages=23685 sync_mode=0 kupdate=1 range_cyclic=1 background=0
reason=periodic
  in the btrfs trace you sent me when we start "kupdate" style writeback
for bdi "btrfs-1". This work then blocks flusher thread upto moment:
   flush-btrfs-1-1336  [007]    45.707479: writeback_start: bdi btrfs-1:
sb_dev 0:0 nr_pages=18173 sync_mode=0 kupdate=1 range_cyclic=1 background=0
reason=periodic
   flush-btrfs-1-1336  [007]    45.707479: writeback_written: bdi btrfs-1:
sb_dev 0:0 nr_pages=18173 sync_mode=0 kupdate=1 range_cyclic=1 background=0
reason=periodic

  (i.e. for 25 seconds). The reason why this work blocks flusher thread for
so long is that btrfs has "btree inode" - essentially an inode holding
filesystem metadata and btrfs ignores any ->writepages() request for this
inode coming from kupdate style writeback. So we always try to write this
inode, make no progress, requeue inode (as it has still mapping tagged as
dirty), see that b_more_io is nonempty so we sleep for a while and then
retry. We do not include inode 257 with real dirty data into writeback
because this is kupdate style writeback and inode 257 does not have dirty
timestamp old enough. This loop would break either after 30s when inode
with data becomes old enough or - as we see above - at the moment when
btrfs decided to do transaction commit and cleaned metadata inode by it's
own methods. In either case this is far too late...
 
  So for now I don't see a better alternative than to revert to old
behavior in writeback_single_inode() as you suggested earlier. That way we
would redirty_tail() inodes which we cannot clean and thus they won't cause
livelocking of kupdate work. Longer term we might want to be more clever in
switching away from kupdate style writeback to pure background writeback
but it's not yet clear to me what the logic should be so that we give
enough preference to old inodes...

  New version of the second patch is attached.

								Honza

[-- Attachment #2: 0002-writeback-Replace-some-redirty_tail-calls-with-reque.patch --]
[-- Type: text/x-patch, Size: 5656 bytes --]

>From 97fb1a2f4d334787e9dd23ef58ead1cf4e439cc2 Mon Sep 17 00:00:00 2001
From: Jan Kara <jack@suse.cz>
Date: Thu, 8 Sep 2011 01:46:42 +0200
Subject: [PATCH 2/2] writeback: Replace some redirty_tail() calls with requeue_io()

Calling redirty_tail() can put off inode writeback for upto 30 seconds (or
whatever dirty_expire_centisecs is). This is unnecessarily big delay in some
cases and in other cases it is a really bad thing. In particular XFS tries to
be nice to writeback and when ->write_inode is called for an inode with locked
ilock, it just redirties the inode and returns EAGAIN. That currently causes
writeback_single_inode() to redirty_tail() the inode. As contended ilock is
common thing with XFS while extending files the result can be that inode
writeout is put off for a really long time.

Now that we have more robust busyloop prevention in wb_writeback() we can
call requeue_io() in cases where quick retry is required without fear of
raising CPU consumption too much.

CC: Christoph Hellwig <hch@infradead.org>
Acked-by: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/fs-writeback.c |   55 ++++++++++++++++++++++++++++++----------------------
 1 files changed, 32 insertions(+), 23 deletions(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index bdeb26a..b03878a 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -356,6 +356,7 @@ writeback_single_inode(struct inode *inode, struct bdi_writeback *wb,
 	long nr_to_write = wbc->nr_to_write;
 	unsigned dirty;
 	int ret;
+	bool inode_written = false;
 
 	assert_spin_locked(&wb->list_lock);
 	assert_spin_locked(&inode->i_lock);
@@ -420,6 +421,8 @@ writeback_single_inode(struct inode *inode, struct bdi_writeback *wb,
 	/* Don't write the inode if only I_DIRTY_PAGES was set */
 	if (dirty & (I_DIRTY_SYNC | I_DIRTY_DATASYNC)) {
 		int err = write_inode(inode, wbc);
+		if (!err)
+			inode_written = true;
 		if (ret == 0)
 			ret = err;
 	}
@@ -430,17 +433,20 @@ writeback_single_inode(struct inode *inode, struct bdi_writeback *wb,
 	if (!(inode->i_state & I_FREEING)) {
 		/*
 		 * Sync livelock prevention. Each inode is tagged and synced in
-		 * one shot. If still dirty, it will be redirty_tail()'ed below.
-		 * Update the dirty time to prevent enqueue and sync it again.
+		 * one shot. If still dirty, update dirty time and put it back
+		 * to dirty list to prevent enqueue and syncing it again.
 		 */
 		if ((inode->i_state & I_DIRTY) &&
-		    (wbc->sync_mode == WB_SYNC_ALL || wbc->tagged_writepages))
+		    (wbc->sync_mode == WB_SYNC_ALL || wbc->tagged_writepages)) {
 			inode->dirtied_when = jiffies;
-
-		if (mapping_tagged(mapping, PAGECACHE_TAG_DIRTY)) {
+			redirty_tail(inode, wb);
+		} else if (mapping_tagged(mapping, PAGECACHE_TAG_DIRTY)) {
 			/*
-			 * We didn't write back all the pages.  nfs_writepages()
-			 * sometimes bales out without doing anything.
+			 * We didn't write back all the pages. We may have just
+			 * run out of our writeback slice, or nfs_writepages()
+			 * sometimes bales out without doing anything, or e.g.
+			 * btrfs ignores for_kupdate writeback requests for
+			 * metadata inodes.
 			 */
 			inode->i_state |= I_DIRTY_PAGES;
 			if (wbc->nr_to_write <= 0) {
@@ -450,11 +456,9 @@ writeback_single_inode(struct inode *inode, struct bdi_writeback *wb,
 				requeue_io(inode, wb);
 			} else {
 				/*
-				 * Writeback blocked by something other than
-				 * congestion. Delay the inode for some time to
-				 * avoid spinning on the CPU (100% iowait)
-				 * retrying writeback of the dirty page/inode
-				 * that cannot be performed immediately.
+				 * Writeback blocked by something. Put inode
+				 * back to dirty list to prevent livelocking of
+				 * writeback.
 				 */
 				redirty_tail(inode, wb);
 			}
@@ -463,9 +467,19 @@ writeback_single_inode(struct inode *inode, struct bdi_writeback *wb,
 			 * Filesystems can dirty the inode during writeback
 			 * operations, such as delayed allocation during
 			 * submission or metadata updates after data IO
-			 * completion.
+			 * completion. Also inode could have been dirtied by
+			 * some process aggressively touching metadata.
+			 * Finally, filesystem could just fail to write the
+			 * inode for some reason. We have to distinguish the
+			 * last case from the previous ones - in the last case
+			 * we want to give the inode quick retry, in the
+			 * other cases we want to put it back to the dirty list
+			 * to avoid livelocking of writeback.
 			 */
-			redirty_tail(inode, wb);
+			if (inode_written)
+				redirty_tail(inode, wb);
+			else
+				requeue_io(inode, wb);
 		} else {
 			/*
 			 * The inode is clean.  At this point we either have
@@ -583,10 +597,10 @@ static long writeback_sb_inodes(struct super_block *sb,
 			wrote++;
 		if (wbc.pages_skipped) {
 			/*
-			 * writeback is not making progress due to locked
-			 * buffers.  Skip this inode for now.
+			 * Writeback is not making progress due to unavailable
+			 * fs locks or similar condition. Retry in next round.
 			 */
-			redirty_tail(inode, wb);
+			requeue_io(inode, wb);
 		}
 		spin_unlock(&inode->i_lock);
 		spin_unlock(&wb->list_lock);
@@ -618,12 +632,7 @@ static long __writeback_inodes_wb(struct bdi_writeback *wb,
 		struct super_block *sb = inode->i_sb;
 
 		if (!grab_super_passive(sb)) {
-			/*
-			 * grab_super_passive() may fail consistently due to
-			 * s_umount being grabbed by someone else. Don't use
-			 * requeue_io() to avoid busy retrying the inode/sb.
-			 */
-			redirty_tail(inode, wb);
+			requeue_io(inode, wb);
 			continue;
 		}
 		wrote += writeback_sb_inodes(sb, wb, work);
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 60+ messages in thread

* Re: [PATCH 2/2] writeback: Replace some redirty_tail() calls with requeue_io()
  2011-10-11 21:53                             ` Jan Kara
@ 2011-10-12  2:44                               ` Wu Fengguang
  2011-10-12 19:34                                 ` Jan Kara
  0 siblings, 1 reply; 60+ messages in thread
From: Wu Fengguang @ 2011-10-12  2:44 UTC (permalink / raw)
  To: Jan Kara
  Cc: linux-fsdevel@vger.kernel.org, Dave Chinner, Christoph Hellwig,
	Chris Mason

On Wed, Oct 12, 2011 at 05:53:59AM +0800, Jan Kara wrote:
> On Tue 11-10-11 10:36:38, Wu Fengguang wrote:
> > On Tue, Oct 11, 2011 at 07:30:07AM +0800, Jan Kara wrote:
> > > On Mon 10-10-11 19:31:30, Wu Fengguang wrote:
> > > > On Mon, Oct 10, 2011 at 07:21:33PM +0800, Jan Kara wrote:
> > > > >   Hi Fengguang,
> > > > > 
> > > > > On Sat 08-10-11 12:00:36, Wu Fengguang wrote:
> > > > > > The test results look not good: btrfs is heavily impacted and the
> > > > > > other filesystems are slightly impacted.
> > > > > >
> > > > > > I'll send you the detailed logs in private emails (too large for the
> > > > > > mailing list).  Basically I noticed many writeback_wait traces that never
> > > > > > appear w/o this patch.
> > > > >   OK, thanks for running these tests. I'll have a look at detailed logs.
> > > > > I guess the difference can be caused by changes in redirty/requeue logic in
> > > > > the second patch (the changes in the first patch could possibly make
> > > > > several writeback_wait events from one event but never could introduce new
> > > > > events).
> > > > > 
> > > > > I guess I'll also try to reproduce the problem since it should be pretty
> > > > > easy when you see such a huge regression even with 1 dd process on btrfs
> > > > > filesystem.
> > > > > 
> > > > > > In the btrfs cases that see larger regressions, I see large fluctuations
> > > > > > in the writeout bandwidth and long disk idle periods. It's still a bit
> > > > > > puzzling how all these happen..
> > > > >   Yes, I don't understand it yet either...
> > > > 
> > > > Jan, it's obviously caused by this chunk, which is not really
> > > > necessary for fixing Christoph's problem. So the easy way is to go
> > > > ahead without this chunk.
> > >   Yes, thanks a lot for debugging this! I'd still like to understand why
> > > the hunk below is causing such a big problem to btrfs. I was looking into
> > > the traces and all I could find so far was that for some reason relevant
> > > inode (ino 257) was not getting queued for writeback for a long time (e.g.
> > > 20 seconds) which introduced disk idle times and thus bad throughput. But I
> > > don't understand why the inode was not queue for such a long time yet...
> > > Today it's too late but I'll continue with my investigation tomorrow.
> > 
> > Yeah, I have exactly the same observation and puzzle..
>   OK, I dug more into this and I think I found an explanation. The problem
> starts at
>    flush-btrfs-1-1336  [005]    20.688011: writeback_start: bdi btrfs-1:
> sb_dev 0:0 nr_pages=23685 sync_mode=0 kupdate=1 range_cyclic=1 background=0
> reason=periodic
>   in the btrfs trace you sent me when we start "kupdate" style writeback
> for bdi "btrfs-1". This work then blocks flusher thread upto moment:
>    flush-btrfs-1-1336  [007]    45.707479: writeback_start: bdi btrfs-1:
> sb_dev 0:0 nr_pages=18173 sync_mode=0 kupdate=1 range_cyclic=1 background=0
> reason=periodic
>    flush-btrfs-1-1336  [007]    45.707479: writeback_written: bdi btrfs-1:
> sb_dev 0:0 nr_pages=18173 sync_mode=0 kupdate=1 range_cyclic=1 background=0
> reason=periodic
> 
>   (i.e. for 25 seconds). The reason why this work blocks flusher thread for
> so long is that btrfs has "btree inode" - essentially an inode holding
> filesystem metadata and btrfs ignores any ->writepages() request for this
> inode coming from kupdate style writeback. So we always try to write this
> inode, make no progress, requeue inode (as it has still mapping tagged as
> dirty), see that b_more_io is nonempty so we sleep for a while and then
> retry. We do not include inode 257 with real dirty data into writeback
> because this is kupdate style writeback and inode 257 does not have dirty
> timestamp old enough. This loop would break either after 30s when inode
> with data becomes old enough or - as we see above - at the moment when
> btrfs decided to do transaction commit and cleaned metadata inode by it's
> own methods. In either case this is far too late...

Yes indeed. Good catch! 

The implication of this case is, never put an inode to b_more_io
unless made some progress on cleaning some pages or the metadata.

Failing to do so will lead to

- busy looping (which can be fixed by patch 1/2 "writeback: Improve busyloop prevention")

- block the current work (and in turn the other queued works) for long time,
  where the other pending works may tend to work on a different set of
  inodes or have different criteria for the FS to make progress. The
  existing examples are the for_kupdate test in btrfs and the SYNC vs
  ASYNC tests in general. And I'm planning to send writeback works
  from the vmscan code to write a specific inode..

In this sense, it looks not the right direction to convert the
redirty_tail() calls to requeue_io().

If we change redirty_tail() to the earlier proposed requeue_io_wait(),
all the known problems can be solved nicely.

>   So for now I don't see a better alternative than to revert to old
> behavior in writeback_single_inode() as you suggested earlier. That way we
> would redirty_tail() inodes which we cannot clean and thus they won't cause
> livelocking of kupdate work.

requeue_io_wait() can equally avoid touching inode->dirtied_when :)

> Longer term we might want to be more clever in
> switching away from kupdate style writeback to pure background writeback
> but it's not yet clear to me what the logic should be so that we give
> enough preference to old inodes...

We'll need to adequately update older_than_this in the wb_writeback()
loop for background work. Then we can make the switch.

>   New version of the second patch is attached.
> 
> 								Honza

> @@ -583,10 +597,10 @@ static long writeback_sb_inodes(struct super_block *sb,
>  			wrote++;
>  		if (wbc.pages_skipped) {
>  			/*
> -			 * writeback is not making progress due to locked
> -			 * buffers.  Skip this inode for now.
> +			 * Writeback is not making progress due to unavailable
> +			 * fs locks or similar condition. Retry in next round.
>  			 */
> -			redirty_tail(inode, wb);
> +			requeue_io(inode, wb);
>  		}
>  		spin_unlock(&inode->i_lock);
>  		spin_unlock(&wb->list_lock);

In the case writeback_single_inode() just redirty_tail()ed the inode,
it's not good to requeue_io() it here. So I'd suggest to keep the
original code, or remove the if(pages_skipped){} block totally.

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 2/2] writeback: Replace some redirty_tail() calls with requeue_io()
  2011-10-12  2:44                               ` Wu Fengguang
@ 2011-10-12 19:34                                 ` Jan Kara
  0 siblings, 0 replies; 60+ messages in thread
From: Jan Kara @ 2011-10-12 19:34 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Jan Kara, linux-fsdevel@vger.kernel.org, Dave Chinner,
	Christoph Hellwig, Chris Mason

On Wed 12-10-11 10:44:36, Wu Fengguang wrote:
> On Wed, Oct 12, 2011 at 05:53:59AM +0800, Jan Kara wrote:
> > On Tue 11-10-11 10:36:38, Wu Fengguang wrote:
> > > On Tue, Oct 11, 2011 at 07:30:07AM +0800, Jan Kara wrote:
> > > > On Mon 10-10-11 19:31:30, Wu Fengguang wrote:
> > > > > On Mon, Oct 10, 2011 at 07:21:33PM +0800, Jan Kara wrote:
> > > > > >   Hi Fengguang,
> > > > > > 
> > > > > > On Sat 08-10-11 12:00:36, Wu Fengguang wrote:
> > > > > > > The test results look not good: btrfs is heavily impacted and the
> > > > > > > other filesystems are slightly impacted.
> > > > > > >
> > > > > > > I'll send you the detailed logs in private emails (too large for the
> > > > > > > mailing list).  Basically I noticed many writeback_wait traces that never
> > > > > > > appear w/o this patch.
> > > > > >   OK, thanks for running these tests. I'll have a look at detailed logs.
> > > > > > I guess the difference can be caused by changes in redirty/requeue logic in
> > > > > > the second patch (the changes in the first patch could possibly make
> > > > > > several writeback_wait events from one event but never could introduce new
> > > > > > events).
> > > > > > 
> > > > > > I guess I'll also try to reproduce the problem since it should be pretty
> > > > > > easy when you see such a huge regression even with 1 dd process on btrfs
> > > > > > filesystem.
> > > > > > 
> > > > > > > In the btrfs cases that see larger regressions, I see large fluctuations
> > > > > > > in the writeout bandwidth and long disk idle periods. It's still a bit
> > > > > > > puzzling how all these happen..
> > > > > >   Yes, I don't understand it yet either...
> > > > > 
> > > > > Jan, it's obviously caused by this chunk, which is not really
> > > > > necessary for fixing Christoph's problem. So the easy way is to go
> > > > > ahead without this chunk.
> > > >   Yes, thanks a lot for debugging this! I'd still like to understand why
> > > > the hunk below is causing such a big problem to btrfs. I was looking into
> > > > the traces and all I could find so far was that for some reason relevant
> > > > inode (ino 257) was not getting queued for writeback for a long time (e.g.
> > > > 20 seconds) which introduced disk idle times and thus bad throughput. But I
> > > > don't understand why the inode was not queue for such a long time yet...
> > > > Today it's too late but I'll continue with my investigation tomorrow.
> > > 
> > > Yeah, I have exactly the same observation and puzzle..
> >   OK, I dug more into this and I think I found an explanation. The problem
> > starts at
> >    flush-btrfs-1-1336  [005]    20.688011: writeback_start: bdi btrfs-1:
> > sb_dev 0:0 nr_pages=23685 sync_mode=0 kupdate=1 range_cyclic=1 background=0
> > reason=periodic
> >   in the btrfs trace you sent me when we start "kupdate" style writeback
> > for bdi "btrfs-1". This work then blocks flusher thread upto moment:
> >    flush-btrfs-1-1336  [007]    45.707479: writeback_start: bdi btrfs-1:
> > sb_dev 0:0 nr_pages=18173 sync_mode=0 kupdate=1 range_cyclic=1 background=0
> > reason=periodic
> >    flush-btrfs-1-1336  [007]    45.707479: writeback_written: bdi btrfs-1:
> > sb_dev 0:0 nr_pages=18173 sync_mode=0 kupdate=1 range_cyclic=1 background=0
> > reason=periodic
> > 
> >   (i.e. for 25 seconds). The reason why this work blocks flusher thread for
> > so long is that btrfs has "btree inode" - essentially an inode holding
> > filesystem metadata and btrfs ignores any ->writepages() request for this
> > inode coming from kupdate style writeback. So we always try to write this
> > inode, make no progress, requeue inode (as it has still mapping tagged as
> > dirty), see that b_more_io is nonempty so we sleep for a while and then
> > retry. We do not include inode 257 with real dirty data into writeback
> > because this is kupdate style writeback and inode 257 does not have dirty
> > timestamp old enough. This loop would break either after 30s when inode
> > with data becomes old enough or - as we see above - at the moment when
> > btrfs decided to do transaction commit and cleaned metadata inode by it's
> > own methods. In either case this is far too late...
> 
> Yes indeed. Good catch! 
> 
> The implication of this case is, never put an inode to b_more_io
> unless made some progress on cleaning some pages or the metadata.
  Well, it is not so simple. The problem really is a lack of information
propagation from filesystem to writeback code. It may well happen (similar
to XFS case Christoph described), that we failed to write anything because
of lock contention or similar rather temporary reason. In that case,
retrying via b_more_io would be fine. On the other hand it may be case that
we never succeed to write anything from the inode like btrfs shows, and
finally, it may even be the case that even though we succeed to write
something the inode remains dirty e.g. because it is metadata inode which
gets dirty on each filesystem change. In this light, even current code
which does requeue_io() when nr_to_write <= 0 is in theory livelockable by
a constantly dirty inode which has always enough pages to write. But so far
noone observed this condition in practice so I wouldn't be too worried...

> Failing to do so will lead to
> 
> - busy looping (which can be fixed by patch 1/2 "writeback: Improve busyloop prevention")
> 
> - block the current work (and in turn the other queued works) for long time,
>   where the other pending works may tend to work on a different set of
>   inodes or have different criteria for the FS to make progress. The
>   existing examples are the for_kupdate test in btrfs and the SYNC vs
>   ASYNC tests in general. And I'm planning to send writeback works
>   from the vmscan code to write a specific inode..
> 
> In this sense, it looks not the right direction to convert the
> redirty_tail() calls to requeue_io().
  Well, there has to be some balance. Surely giving up on inodes where we
have problems with doing writeback easily prevents livelocks but on the
other hand it introduces the type of problems Christoph saw with XFS - some
inodes can get starved from writeback.

> If we change redirty_tail() to the earlier proposed requeue_io_wait(),
> all the known problems can be solved nicely.
  Right, so I see that using requeue_io_wait() will avoid livelocks and the
possibility of starvation will be lower than with redirty_tail(). But what
I don't like about your implementation is the uncertainty when writeback
will be retried (using requeue_io_wait() may effectively mean putting off
writeback of the inode for dirty_writeback_interval = 5 seconds) and also
more subtle logic of list handling. But you inspired me how we could maybe
implement requeue_io_wait() logic without these problems: When we have
writeback work which cannot make progress, we could check whether there's
other work to do (plus if we are doing for_kupdate work, we'll check
whether for_background work would be needed) and if yes, we just finish
current work and continue with the new work. Then requeue_io() for inode
which cannot make progress will effectively work like requeue_io_wait() but
we'll get well defined retry times and no additional list logic. What do
you think? I'll send updated patches in a moment...

> >   So for now I don't see a better alternative than to revert to old
> > behavior in writeback_single_inode() as you suggested earlier. That way we
> > would redirty_tail() inodes which we cannot clean and thus they won't cause
> > livelocking of kupdate work.
> 
> requeue_io_wait() can equally avoid touching inode->dirtied_when :)
> 
> > Longer term we might want to be more clever in
> > switching away from kupdate style writeback to pure background writeback
> > but it's not yet clear to me what the logic should be so that we give
> > enough preference to old inodes...
> 
> We'll need to adequately update older_than_this in the wb_writeback()
> loop for background work. Then we can make the switch.
> 
> > @@ -583,10 +597,10 @@ static long writeback_sb_inodes(struct super_block *sb,
> >  			wrote++;
> >  		if (wbc.pages_skipped) {
> >  			/*
> > -			 * writeback is not making progress due to locked
> > -			 * buffers.  Skip this inode for now.
> > +			 * Writeback is not making progress due to unavailable
> > +			 * fs locks or similar condition. Retry in next round.
> >  			 */
> > -			redirty_tail(inode, wb);
> > +			requeue_io(inode, wb);
> >  		}
> >  		spin_unlock(&inode->i_lock);
> >  		spin_unlock(&wb->list_lock);
> 
> In the case writeback_single_inode() just redirty_tail()ed the inode,
> it's not good to requeue_io() it here. So I'd suggest to keep the
> original code, or remove the if(pages_skipped){} block totally.
  Good point. I'll just remove that if. Hmm, which will make pages_skipped
unused so we can remove them altogether. I can post that as a separate
patch.

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 60+ messages in thread

* [PATCH 1/2] writeback: Improve busyloop prevention
  2011-10-12 20:57 [PATCH 0/2 v4] writeback: Improve busyloop prevention and inode requeueing Jan Kara
@ 2011-10-12 20:57 ` Jan Kara
  2011-10-13 14:26   ` Wu Fengguang
  0 siblings, 1 reply; 60+ messages in thread
From: Jan Kara @ 2011-10-12 20:57 UTC (permalink / raw)
  To: Wu Fengguang; +Cc: linux-fsdevel, Christoph Hellwig, Dave Chinner, Jan Kara

Writeback of an inode can be stalled by things like internal fs locks being
held. So in case we didn't write anything during a pass through b_io list,
just wait for a moment and try again. When retrying is fruitless for a long
time, or we have some other work to do, we just stop current work to avoid
blocking flusher thread.

CC: Christoph Hellwig <hch@infradead.org>
Reviewed-by: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/fs-writeback.c |   39 +++++++++++++++++++++++++++------------
 1 files changed, 27 insertions(+), 12 deletions(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 04cf3b9..b619f3a 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -699,8 +699,11 @@ static long wb_writeback(struct bdi_writeback *wb,
 	unsigned long wb_start = jiffies;
 	long nr_pages = work->nr_pages;
 	unsigned long oldest_jif;
-	struct inode *inode;
 	long progress;
+	long pause = 1;
+	long max_pause = dirty_writeback_interval ?
+			   msecs_to_jiffies(dirty_writeback_interval * 10) :
+			   HZ;
 
 	oldest_jif = jiffies;
 	work->older_than_this = &oldest_jif;
@@ -755,25 +758,37 @@ static long wb_writeback(struct bdi_writeback *wb,
 		 * mean the overall work is done. So we keep looping as long
 		 * as made some progress on cleaning pages or inodes.
 		 */
-		if (progress)
+		if (progress) {
+			pause = 1;
 			continue;
+		}
 		/*
 		 * No more inodes for IO, bail
 		 */
 		if (list_empty(&wb->b_more_io))
 			break;
 		/*
-		 * Nothing written. Wait for some inode to
-		 * become available for writeback. Otherwise
-		 * we'll just busyloop.
+		 * Nothing written (some internal fs locks were unavailable or
+		 * inode was under writeback from balance_dirty_pages() or
+		 * similar conditions).
 		 */
-		if (!list_empty(&wb->b_more_io))  {
-			trace_writeback_wait(wb->bdi, work);
-			inode = wb_inode(wb->b_more_io.prev);
-			spin_lock(&inode->i_lock);
-			inode_wait_for_writeback(inode, wb);
-			spin_unlock(&inode->i_lock);
-		}
+		/* If there's some other work to do, proceed with it... */
+		if (!list_empty(&wb->bdi->work_list) ||
+		    (!work->for_background && over_bground_thresh()))
+			break;
+		/*
+		 * Wait for a while to avoid busylooping unless we waited for
+		 * so long it does not make sense to retry anymore.
+		 */
+		if (pause > max_pause)
+			break;
+		trace_writeback_wait(wb->bdi, work);
+		spin_unlock(&wb->list_lock);
+		__set_current_state(TASK_INTERRUPTIBLE);
+		schedule_timeout(pause);
+		if (pause < max_pause)
+			pause <<= 1;
+		spin_lock(&wb->list_lock);
 	}
 	spin_unlock(&wb->list_lock);
 
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 60+ messages in thread

* Re: [PATCH 1/2] writeback: Improve busyloop prevention
  2011-10-12 20:57 ` [PATCH 1/2] writeback: Improve busyloop prevention Jan Kara
@ 2011-10-13 14:26   ` Wu Fengguang
  2011-10-13 20:13     ` Jan Kara
       [not found]     ` <20111013143939.GA9691@localhost>
  0 siblings, 2 replies; 60+ messages in thread
From: Wu Fengguang @ 2011-10-13 14:26 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-fsdevel@vger.kernel.org, Christoph Hellwig, Dave Chinner

On Thu, Oct 13, 2011 at 04:57:22AM +0800, Jan Kara wrote:
> Writeback of an inode can be stalled by things like internal fs locks being
> held. So in case we didn't write anything during a pass through b_io list,
> just wait for a moment and try again. When retrying is fruitless for a long
> time, or we have some other work to do, we just stop current work to avoid
> blocking flusher thread.
> 
> CC: Christoph Hellwig <hch@infradead.org>
> Reviewed-by: Wu Fengguang <fengguang.wu@intel.com>
> Signed-off-by: Jan Kara <jack@suse.cz>
> ---
>  fs/fs-writeback.c |   39 +++++++++++++++++++++++++++------------
>  1 files changed, 27 insertions(+), 12 deletions(-)
> 
> diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> index 04cf3b9..b619f3a 100644
> --- a/fs/fs-writeback.c
> +++ b/fs/fs-writeback.c
> @@ -699,8 +699,11 @@ static long wb_writeback(struct bdi_writeback *wb,
>  	unsigned long wb_start = jiffies;
>  	long nr_pages = work->nr_pages;
>  	unsigned long oldest_jif;
> -	struct inode *inode;
>  	long progress;
> +	long pause = 1;
> +	long max_pause = dirty_writeback_interval ?
> +			   msecs_to_jiffies(dirty_writeback_interval * 10) :
> +			   HZ;

It's better not to put the flusher to sleeps more than 10ms, so that
when the condition changes, we don't risk making the storage idle for
too long time.

So let's distinguish between accumulated and one-shot max pause time
in the below code?

The other changes look fine to me.

Thanks,
Fengguang

>  	oldest_jif = jiffies;
>  	work->older_than_this = &oldest_jif;
> @@ -755,25 +758,37 @@ static long wb_writeback(struct bdi_writeback *wb,
>  		 * mean the overall work is done. So we keep looping as long
>  		 * as made some progress on cleaning pages or inodes.
>  		 */
> -		if (progress)
> +		if (progress) {
> +			pause = 1;
>  			continue;
> +		}
>  		/*
>  		 * No more inodes for IO, bail
>  		 */
>  		if (list_empty(&wb->b_more_io))
>  			break;
>  		/*
> -		 * Nothing written. Wait for some inode to
> -		 * become available for writeback. Otherwise
> -		 * we'll just busyloop.
> +		 * Nothing written (some internal fs locks were unavailable or
> +		 * inode was under writeback from balance_dirty_pages() or
> +		 * similar conditions).
>  		 */
> -		if (!list_empty(&wb->b_more_io))  {
> -			trace_writeback_wait(wb->bdi, work);
> -			inode = wb_inode(wb->b_more_io.prev);
> -			spin_lock(&inode->i_lock);
> -			inode_wait_for_writeback(inode, wb);
> -			spin_unlock(&inode->i_lock);
> -		}
> +		/* If there's some other work to do, proceed with it... */
> +		if (!list_empty(&wb->bdi->work_list) ||
> +		    (!work->for_background && over_bground_thresh()))
> +			break;
> +		/*
> +		 * Wait for a while to avoid busylooping unless we waited for
> +		 * so long it does not make sense to retry anymore.
> +		 */
> +		if (pause > max_pause)
> +			break;
> +		trace_writeback_wait(wb->bdi, work);
> +		spin_unlock(&wb->list_lock);
> +		__set_current_state(TASK_INTERRUPTIBLE);
> +		schedule_timeout(pause);
> +		if (pause < max_pause)
> +			pause <<= 1;
> +		spin_lock(&wb->list_lock);
>  	}
>  	spin_unlock(&wb->list_lock);
>  
> -- 
> 1.7.1

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 1/2] writeback: Improve busyloop prevention
  2011-10-13 14:26   ` Wu Fengguang
@ 2011-10-13 20:13     ` Jan Kara
  2011-10-14  7:18       ` Christoph Hellwig
       [not found]     ` <20111013143939.GA9691@localhost>
  1 sibling, 1 reply; 60+ messages in thread
From: Jan Kara @ 2011-10-13 20:13 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Jan Kara, linux-fsdevel@vger.kernel.org, Christoph Hellwig,
	Dave Chinner

[-- Attachment #1: Type: text/plain, Size: 4174 bytes --]

On Thu 13-10-11 22:26:38, Wu Fengguang wrote:
> On Thu, Oct 13, 2011 at 04:57:22AM +0800, Jan Kara wrote:
> > Writeback of an inode can be stalled by things like internal fs locks being
> > held. So in case we didn't write anything during a pass through b_io list,
> > just wait for a moment and try again. When retrying is fruitless for a long
> > time, or we have some other work to do, we just stop current work to avoid
> > blocking flusher thread.
> > 
> > CC: Christoph Hellwig <hch@infradead.org>
> > Reviewed-by: Wu Fengguang <fengguang.wu@intel.com>
> > Signed-off-by: Jan Kara <jack@suse.cz>
> > ---
> >  fs/fs-writeback.c |   39 +++++++++++++++++++++++++++------------
> >  1 files changed, 27 insertions(+), 12 deletions(-)
> > 
> > diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> > index 04cf3b9..b619f3a 100644
> > --- a/fs/fs-writeback.c
> > +++ b/fs/fs-writeback.c
> > @@ -699,8 +699,11 @@ static long wb_writeback(struct bdi_writeback *wb,
> >  	unsigned long wb_start = jiffies;
> >  	long nr_pages = work->nr_pages;
> >  	unsigned long oldest_jif;
> > -	struct inode *inode;
> >  	long progress;
> > +	long pause = 1;
> > +	long max_pause = dirty_writeback_interval ?
> > +			   msecs_to_jiffies(dirty_writeback_interval * 10) :
> > +			   HZ;
> 
> It's better not to put the flusher to sleeps more than 10ms, so that
> when the condition changes, we don't risk making the storage idle for
> too long time.
> 
> So let's distinguish between accumulated and one-shot max pause time
> in the below code?
  I was thinking about this as well but then I realized that when some work
is queued or when background writeback is necessary, we always wakeup
flusher thread so these conditions will be noticed promptly. And regarding
locks we potentially blocked on, we always wait only as long as we already
waited in previous waits together so that doesn't look like too defensive
to me. Or are you concerned about something else? I've just noticed the was
was unnecessarily racy wrt wakeups so attached is a new version which
should be safe in this regard.

  Specifically I didn't want to wake up every 10 ms because if there are
inodes which are unwriteable for a long time like in case of btrfs, we
would just wakeup flusher thread (and thus CPU) every 10 ms on otherwise
idle system and that does draw considerable amount of power on a laptop.

> >  	oldest_jif = jiffies;
> >  	work->older_than_this = &oldest_jif;
> > @@ -755,25 +758,37 @@ static long wb_writeback(struct bdi_writeback *wb,
> >  		 * mean the overall work is done. So we keep looping as long
> >  		 * as made some progress on cleaning pages or inodes.
> >  		 */
> > -		if (progress)
> > +		if (progress) {
> > +			pause = 1;
> >  			continue;
> > +		}
> >  		/*
> >  		 * No more inodes for IO, bail
> >  		 */
> >  		if (list_empty(&wb->b_more_io))
> >  			break;
> >  		/*
> > -		 * Nothing written. Wait for some inode to
> > -		 * become available for writeback. Otherwise
> > -		 * we'll just busyloop.
> > +		 * Nothing written (some internal fs locks were unavailable or
> > +		 * inode was under writeback from balance_dirty_pages() or
> > +		 * similar conditions).
> >  		 */
> > -		if (!list_empty(&wb->b_more_io))  {
> > -			trace_writeback_wait(wb->bdi, work);
> > -			inode = wb_inode(wb->b_more_io.prev);
> > -			spin_lock(&inode->i_lock);
> > -			inode_wait_for_writeback(inode, wb);
> > -			spin_unlock(&inode->i_lock);
> > -		}
> > +		/* If there's some other work to do, proceed with it... */
> > +		if (!list_empty(&wb->bdi->work_list) ||
> > +		    (!work->for_background && over_bground_thresh()))
> > +			break;
> > +		/*
> > +		 * Wait for a while to avoid busylooping unless we waited for
> > +		 * so long it does not make sense to retry anymore.
> > +		 */
> > +		if (pause > max_pause)
> > +			break;
> > +		trace_writeback_wait(wb->bdi, work);
> > +		spin_unlock(&wb->list_lock);
> > +		__set_current_state(TASK_INTERRUPTIBLE);
> > +		schedule_timeout(pause);
> > +		if (pause < max_pause)
> > +			pause <<= 1;
> > +		spin_lock(&wb->list_lock);
> >  	}
> >  	spin_unlock(&wb->list_lock);
> >  
> > -- 
> > 1.7.1
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

[-- Attachment #2: 0001-writeback-Improve-busyloop-prevention.patch --]
[-- Type: text/x-patch, Size: 3007 bytes --]

>From 7d59989e38af2e10101f7d9f0c98343fe551c536 Mon Sep 17 00:00:00 2001
From: Jan Kara <jack@suse.cz>
Date: Thu, 8 Sep 2011 01:05:25 +0200
Subject: [PATCH 1/2] writeback: Improve busyloop prevention

Writeback of an inode can be stalled by things like internal fs locks being
held. So in case we didn't write anything during a pass through b_io list,
just wait for a moment and try again. When retrying is fruitless for a long
time, or we have some other work to do, we just stop current work to avoid
blocking flusher thread.

CC: Christoph Hellwig <hch@infradead.org>
Reviewed-by: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/fs-writeback.c |   43 ++++++++++++++++++++++++++++++++-----------
 1 files changed, 32 insertions(+), 11 deletions(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 04cf3b9..4ffc07f 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -699,8 +699,11 @@ static long wb_writeback(struct bdi_writeback *wb,
 	unsigned long wb_start = jiffies;
 	long nr_pages = work->nr_pages;
 	unsigned long oldest_jif;
-	struct inode *inode;
 	long progress;
+	long pause = 1;
+	long max_pause = dirty_writeback_interval ?
+			   msecs_to_jiffies(dirty_writeback_interval * 10) :
+			   HZ;
 
 	oldest_jif = jiffies;
 	work->older_than_this = &oldest_jif;
@@ -755,25 +758,43 @@ static long wb_writeback(struct bdi_writeback *wb,
 		 * mean the overall work is done. So we keep looping as long
 		 * as made some progress on cleaning pages or inodes.
 		 */
-		if (progress)
+		if (progress) {
+			pause = 1;
 			continue;
+		}
 		/*
 		 * No more inodes for IO, bail
 		 */
 		if (list_empty(&wb->b_more_io))
 			break;
 		/*
-		 * Nothing written. Wait for some inode to
-		 * become available for writeback. Otherwise
-		 * we'll just busyloop.
+		 * Nothing written (some internal fs locks were unavailable or
+		 * inode was under writeback from balance_dirty_pages() or
+		 * similar conditions).
+		 *
+		 * Wait for a while to avoid busylooping unless we waited for
+		 * so long it does not make sense to retry anymore.
 		 */
-		if (!list_empty(&wb->b_more_io))  {
-			trace_writeback_wait(wb->bdi, work);
-			inode = wb_inode(wb->b_more_io.prev);
-			spin_lock(&inode->i_lock);
-			inode_wait_for_writeback(inode, wb);
-			spin_unlock(&inode->i_lock);
+		if (pause > max_pause)
+			break;
+		/*
+		 * Set state here to prevent races with someone waking us up
+		 * (because new work is queued or because background limit is
+		 * exceeded).
+		 */
+		set_current_state(TASK_INTERRUPTIBLE);
+		/* If there's some other work to do, proceed with it... */
+		if (!list_empty(&wb->bdi->work_list) ||
+		    (!work->for_background && over_bground_thresh())) {
+			__set_current_state(TASK_RUNNING);
+			break;
 		}
+		trace_writeback_wait(wb->bdi, work);
+		spin_unlock(&wb->list_lock);
+		schedule_timeout(pause);
+		if (pause < max_pause)
+			pause <<= 1;
+		spin_lock(&wb->list_lock);
 	}
 	spin_unlock(&wb->list_lock);
 
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 60+ messages in thread

* Re: [PATCH 1/2] writeback: Improve busyloop prevention
       [not found]     ` <20111013143939.GA9691@localhost>
@ 2011-10-13 20:18       ` Jan Kara
  2011-10-14 16:00         ` Wu Fengguang
  0 siblings, 1 reply; 60+ messages in thread
From: Jan Kara @ 2011-10-13 20:18 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Jan Kara, linux-fsdevel@vger.kernel.org, Christoph Hellwig,
	Dave Chinner

On Thu 13-10-11 22:39:39, Wu Fengguang wrote:
> > > +	long pause = 1;
> > > +	long max_pause = dirty_writeback_interval ?
> > > +			   msecs_to_jiffies(dirty_writeback_interval * 10) :
> > > +			   HZ;
> > 
> > It's better not to put the flusher to sleeps more than 10ms, so that
> > when the condition changes, we don't risk making the storage idle for
> > too long time.
> 
> Yeah, the one big regression case
> 
>      3.1.0-rc8-ioless6a+  3.1.0-rc8-ioless6-requeue6+  
> ------------------------  ------------------------  
>                    47.07       -15.5%        39.78  thresh=1M/xfs-2dd-4k-8p-4096M-1M:10-X
> 
> is exactly caused by the large sleep: the attached graphs are showing
> one period of no-progress on the number of written pages.
  Thanks for the tests! Interesting. Do you have trace file from that run?
I see the writeback stalled for 20s or so which is more than
dirty_writeback_centisecs so I think something more complicated must have
happened.

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 1/2] writeback: Improve busyloop prevention
  2011-10-13 20:13     ` Jan Kara
@ 2011-10-14  7:18       ` Christoph Hellwig
  2011-10-14 19:31         ` Chris Mason
  0 siblings, 1 reply; 60+ messages in thread
From: Christoph Hellwig @ 2011-10-14  7:18 UTC (permalink / raw)
  To: Jan Kara
  Cc: Wu Fengguang, linux-fsdevel@vger.kernel.org, Christoph Hellwig,
	Dave Chinner, linux-btrfs

What btrfs does for the btree inode is insane, and I'm pretty sure I
already complained about it.  It really needs to stop registering that
inode with the writeback code and just driver it manually.  Same as
other filesystems do for their "micro-managed" metadata.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 1/2] writeback: Improve busyloop prevention
  2011-10-13 20:18       ` Jan Kara
@ 2011-10-14 16:00         ` Wu Fengguang
  2011-10-14 16:28           ` Wu Fengguang
  2011-10-15 12:41           ` Wu Fengguang
  0 siblings, 2 replies; 60+ messages in thread
From: Wu Fengguang @ 2011-10-14 16:00 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-fsdevel@vger.kernel.org, Christoph Hellwig, Dave Chinner

On Fri, Oct 14, 2011 at 04:18:35AM +0800, Jan Kara wrote:
> On Thu 13-10-11 22:39:39, Wu Fengguang wrote:
> > > > +	long pause = 1;
> > > > +	long max_pause = dirty_writeback_interval ?
> > > > +			   msecs_to_jiffies(dirty_writeback_interval * 10) :
> > > > +			   HZ;
> > > 
> > > It's better not to put the flusher to sleeps more than 10ms, so that
> > > when the condition changes, we don't risk making the storage idle for
> > > too long time.
> > 
> > Yeah, the one big regression case
> > 
> >      3.1.0-rc8-ioless6a+  3.1.0-rc8-ioless6-requeue6+  
> > ------------------------  ------------------------  
> >                    47.07       -15.5%        39.78  thresh=1M/xfs-2dd-4k-8p-4096M-1M:10-X
> > 
> > is exactly caused by the large sleep: the attached graphs are showing
> > one period of no-progress on the number of written pages.
>   Thanks for the tests! Interesting. Do you have trace file from that run?
> I see the writeback stalled for 20s or so which is more than
> dirty_writeback_centisecs so I think something more complicated must have
> happened.

I noticed that

1) the global dirty limit is exceeded (dirty=286, limit=256), hence
   the dd tasks are hard blocked in balance_dirty_pages().

       flush-8:0-1170  [004]   211.068427: global_dirty_state: dirty=286 writeback=0 unstable=0 bg_thresh=128 thresh=256 limit=256 dirtied=2084879 written=2081447

2) the flusher thread is not woken up because we test writeback_in_progress()
   in balance_dirty_pages().

                if (unlikely(!writeback_in_progress(bdi)))
                        bdi_start_background_writeback(bdi);

Thus the flusher thread wait and wait as in below trace.

       flush-8:0-1170  [004]   211.068427: global_dirty_state: dirty=286 writeback=0 unstable=0 bg_thresh=128 thresh=256 limit=256 dirtied=2084879 written=2081447
       flush-8:0-1170  [004]   211.068428: task_io: read=9216 write=12873728 cancelled_write=0 nr_dirtied=0 nr_dirtied_pause=32
       flush-8:0-1170  [004]   211.068428: writeback_start: bdi 8:0: sb_dev 0:0 nr_pages=9223372036854774848 sync_mode=0 kupdate=0 range_cyclic=1 background=1 reason=background
       flush-8:0-1170  [004]   211.068440: writeback_single_inode: bdi 8:0: ino=131 state=I_DIRTY_SYNC dirtied_when=4294869658 age=9 index=0 to_write=1024 wrote=0
       flush-8:0-1170  [004]   211.068442: writeback_written: bdi 8:0: sb_dev 0:0 nr_pages=9223372036854774848 sync_mode=0 kupdate=0 range_cyclic=1 background=1 reason=background
       flush-8:0-1170  [004]   211.068443: writeback_wait: bdi 8:0: sb_dev 0:0 nr_pages=9223372036854774848 sync_mode=0 kupdate=0 range_cyclic=1 background=1 reason=background

       flush-8:0-1170  [004]   213.110122: global_dirty_state: dirty=286 writeback=0 unstable=0 bg_thresh=128 thresh=256 limit=256 dirtied=2084879 written=2081447
       flush-8:0-1170  [004]   213.110126: task_io: read=9216 write=12873728 cancelled_write=0 nr_dirtied=0 nr_dirtied_pause=32
       flush-8:0-1170  [004]   213.110126: writeback_start: bdi 8:0: sb_dev 0:0 nr_pages=9223372036854774848 sync_mode=0 kupdate=0 range_cyclic=1 background=1 reason=background
       flush-8:0-1170  [004]   213.110134: writeback_single_inode: bdi 8:0: ino=131 state=I_DIRTY_SYNC dirtied_when=4294869658 age=11 index=0 to_write=1024 wrote=0
       flush-8:0-1170  [004]   213.110135: writeback_written: bdi 8:0: sb_dev 0:0 nr_pages=9223372036854774848 sync_mode=0 kupdate=0 range_cyclic=1 background=1 reason=background
       flush-8:0-1170  [004]   213.110135: writeback_wait: bdi 8:0: sb_dev 0:0 nr_pages=9223372036854774848 sync_mode=0 kupdate=0 range_cyclic=1 background=1 reason=background

       flush-8:0-1170  [004]   217.193470: global_dirty_state: dirty=286 writeback=0 unstable=0 bg_thresh=128 thresh=256 limit=256 dirtied=2084879 written=2081447
       flush-8:0-1170  [004]   217.193471: task_io: read=9216 write=12873728 cancelled_write=0 nr_dirtied=0 nr_dirtied_pause=32
       flush-8:0-1170  [004]   217.193471: writeback_start: bdi 8:0: sb_dev 0:0 nr_pages=9223372036854774848 sync_mode=0 kupdate=0 range_cyclic=1 background=1 reason=background
       flush-8:0-1170  [004]   217.193483: writeback_single_inode: bdi 8:0: ino=131 state=I_DIRTY_SYNC dirtied_when=4294869658 age=15 index=0 to_write=1024 wrote=0
       flush-8:0-1170  [004]   217.193485: writeback_written: bdi 8:0: sb_dev 0:0 nr_pages=9223372036854774848 sync_mode=0 kupdate=0 range_cyclic=1 background=1 reason=background

This should be fixable by removing the BDI_writeback_running flag
before doing the wait sleep.

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 1/2] writeback: Improve busyloop prevention
  2011-10-14 16:00         ` Wu Fengguang
@ 2011-10-14 16:28           ` Wu Fengguang
  2011-10-18  0:51             ` Jan Kara
  2011-10-15 12:41           ` Wu Fengguang
  1 sibling, 1 reply; 60+ messages in thread
From: Wu Fengguang @ 2011-10-14 16:28 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-fsdevel@vger.kernel.org, Christoph Hellwig, Dave Chinner

On Sat, Oct 15, 2011 at 12:00:47AM +0800, Wu Fengguang wrote:
> On Fri, Oct 14, 2011 at 04:18:35AM +0800, Jan Kara wrote:
> > On Thu 13-10-11 22:39:39, Wu Fengguang wrote:
> > > > > +	long pause = 1;
> > > > > +	long max_pause = dirty_writeback_interval ?
> > > > > +			   msecs_to_jiffies(dirty_writeback_interval * 10) :
> > > > > +			   HZ;
> > > > 
> > > > It's better not to put the flusher to sleeps more than 10ms, so that
> > > > when the condition changes, we don't risk making the storage idle for
> > > > too long time.
> > > 
> > > Yeah, the one big regression case
> > > 
> > >      3.1.0-rc8-ioless6a+  3.1.0-rc8-ioless6-requeue6+  
> > > ------------------------  ------------------------  
> > >                    47.07       -15.5%        39.78  thresh=1M/xfs-2dd-4k-8p-4096M-1M:10-X
> > > 
> > > is exactly caused by the large sleep: the attached graphs are showing
> > > one period of no-progress on the number of written pages.
> >   Thanks for the tests! Interesting. Do you have trace file from that run?
> > I see the writeback stalled for 20s or so which is more than
> > dirty_writeback_centisecs so I think something more complicated must have
> > happened.
> 
> I noticed that
> 
> 1) the global dirty limit is exceeded (dirty=286, limit=256), hence
>    the dd tasks are hard blocked in balance_dirty_pages().
> 
>        flush-8:0-1170  [004]   211.068427: global_dirty_state: dirty=286 writeback=0 unstable=0 bg_thresh=128 thresh=256 limit=256 dirtied=2084879 written=2081447
> 
> 2) the flusher thread is not woken up because we test writeback_in_progress()
>    in balance_dirty_pages().
> 
>                 if (unlikely(!writeback_in_progress(bdi)))
>                         bdi_start_background_writeback(bdi);
> 
> Thus the flusher thread wait and wait as in below trace.
> 
>        flush-8:0-1170  [004]   211.068427: global_dirty_state: dirty=286 writeback=0 unstable=0 bg_thresh=128 thresh=256 limit=256 dirtied=2084879 written=2081447
>        flush-8:0-1170  [004]   211.068428: task_io: read=9216 write=12873728 cancelled_write=0 nr_dirtied=0 nr_dirtied_pause=32
>        flush-8:0-1170  [004]   211.068428: writeback_start: bdi 8:0: sb_dev 0:0 nr_pages=9223372036854774848 sync_mode=0 kupdate=0 range_cyclic=1 background=1 reason=background
>        flush-8:0-1170  [004]   211.068440: writeback_single_inode: bdi 8:0: ino=131 state=I_DIRTY_SYNC dirtied_when=4294869658 age=9 index=0 to_write=1024 wrote=0
>        flush-8:0-1170  [004]   211.068442: writeback_written: bdi 8:0: sb_dev 0:0 nr_pages=9223372036854774848 sync_mode=0 kupdate=0 range_cyclic=1 background=1 reason=background
>        flush-8:0-1170  [004]   211.068443: writeback_wait: bdi 8:0: sb_dev 0:0 nr_pages=9223372036854774848 sync_mode=0 kupdate=0 range_cyclic=1 background=1 reason=background
> 
>        flush-8:0-1170  [004]   213.110122: global_dirty_state: dirty=286 writeback=0 unstable=0 bg_thresh=128 thresh=256 limit=256 dirtied=2084879 written=2081447
>        flush-8:0-1170  [004]   213.110126: task_io: read=9216 write=12873728 cancelled_write=0 nr_dirtied=0 nr_dirtied_pause=32
>        flush-8:0-1170  [004]   213.110126: writeback_start: bdi 8:0: sb_dev 0:0 nr_pages=9223372036854774848 sync_mode=0 kupdate=0 range_cyclic=1 background=1 reason=background
>        flush-8:0-1170  [004]   213.110134: writeback_single_inode: bdi 8:0: ino=131 state=I_DIRTY_SYNC dirtied_when=4294869658 age=11 index=0 to_write=1024 wrote=0
>        flush-8:0-1170  [004]   213.110135: writeback_written: bdi 8:0: sb_dev 0:0 nr_pages=9223372036854774848 sync_mode=0 kupdate=0 range_cyclic=1 background=1 reason=background
>        flush-8:0-1170  [004]   213.110135: writeback_wait: bdi 8:0: sb_dev 0:0 nr_pages=9223372036854774848 sync_mode=0 kupdate=0 range_cyclic=1 background=1 reason=background
> 
>        flush-8:0-1170  [004]   217.193470: global_dirty_state: dirty=286 writeback=0 unstable=0 bg_thresh=128 thresh=256 limit=256 dirtied=2084879 written=2081447
>        flush-8:0-1170  [004]   217.193471: task_io: read=9216 write=12873728 cancelled_write=0 nr_dirtied=0 nr_dirtied_pause=32
>        flush-8:0-1170  [004]   217.193471: writeback_start: bdi 8:0: sb_dev 0:0 nr_pages=9223372036854774848 sync_mode=0 kupdate=0 range_cyclic=1 background=1 reason=background
>        flush-8:0-1170  [004]   217.193483: writeback_single_inode: bdi 8:0: ino=131 state=I_DIRTY_SYNC dirtied_when=4294869658 age=15 index=0 to_write=1024 wrote=0
>        flush-8:0-1170  [004]   217.193485: writeback_written: bdi 8:0: sb_dev 0:0 nr_pages=9223372036854774848 sync_mode=0 kupdate=0 range_cyclic=1 background=1 reason=background

It's still puzzling why dirty pages remain at 286 and does not get
cleaned by either flusher threads for local XFS and NFSROOT for so
long time..

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 1/2] writeback: Improve busyloop prevention
  2011-10-14  7:18       ` Christoph Hellwig
@ 2011-10-14 19:31         ` Chris Mason
  0 siblings, 0 replies; 60+ messages in thread
From: Chris Mason @ 2011-10-14 19:31 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jan Kara, Wu Fengguang, linux-fsdevel@vger.kernel.org,
	Dave Chinner, linux-btrfs

Excerpts from Christoph Hellwig's message of 2011-10-14 03:18:02 -0400:
> What btrfs does for the btree inode is insane, and I'm pretty sure I
> already complained about it.  It really needs to stop registering that
> inode with the writeback code and just driver it manually.  Same as
> other filesystems do for their "micro-managed" metadata.
> 

So I think you probably don't like the inode and the part where we
actively decide not to writeback when there isn't much dirty.

Yes, it would be different if btrfs had its own LRU for the btrees, and
if it maintained them such that the LRU understood it was better to kick
out leaves than roots.

I've really wanted to play with this for a while.

-chris

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 1/2] writeback: Improve busyloop prevention
  2011-10-14 16:00         ` Wu Fengguang
  2011-10-14 16:28           ` Wu Fengguang
@ 2011-10-15 12:41           ` Wu Fengguang
  1 sibling, 0 replies; 60+ messages in thread
From: Wu Fengguang @ 2011-10-15 12:41 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-fsdevel@vger.kernel.org, Christoph Hellwig, Dave Chinner

> 2) the flusher thread is not woken up because we test writeback_in_progress()
>    in balance_dirty_pages().
> 
>                 if (unlikely(!writeback_in_progress(bdi)))
>                         bdi_start_background_writeback(bdi);
> 
[...]
> This should be fixable by removing the BDI_writeback_running flag
> before doing the wait sleep.

I tried adding the below patch, and this time it shows 1.8%
performance improvement.  Looks not bad :-)

     3.1.0-rc8-ioless6a+  3.1.0-rc8-ioless6-requeue7+
------------------------  ------------------------
                   32.09        +1.1%        32.45  thresh=1M/ext4-10dd-4k-8p-4096M-1M:10-X
                   51.36        +3.2%        53.00  thresh=1M/ext4-1dd-4k-8p-4096M-1M:10-X
                   46.93        +1.9%        47.80  thresh=1M/ext4-2dd-4k-8p-4096M-1M:10-X
                   28.68        +1.6%        29.15  thresh=1M/xfs-10dd-4k-8p-4096M-1M:10-X
                   51.95        +4.2%        54.13  thresh=1M/xfs-1dd-4k-8p-4096M-1M:10-X
                   47.07        +1.5%        47.80  thresh=1M/xfs-2dd-4k-8p-4096M-1M:10-X
                   54.37        +2.2%        55.58  thresh=8M/btrfs-10dd-4k-8p-4096M-8M:10-X
                   56.12        +4.2%        58.49  thresh=8M/btrfs-1dd-4k-8p-4096M-8M:10-X
                   56.22        +2.4%        57.55  thresh=8M/btrfs-2dd-4k-8p-4096M-8M:10-X
                   32.21        +1.7%        32.77  thresh=8M/ext3-10dd-4k-8p-4096M-8M:10-X
                   45.37        +2.7%        46.58  thresh=8M/ext3-1dd-4k-8p-4096M-8M:10-X
                   43.71        +2.6%        44.83  thresh=8M/ext3-2dd-4k-8p-4096M-8M:10-X
                   35.58        +1.3%        36.06  thresh=8M/ext4-10dd-4k-8p-4096M-8M:10-X
                   56.39        +0.4%        56.61  thresh=8M/ext4-1dd-4k-8p-4096M-8M:10-X
                   51.26        +1.4%        51.98  thresh=8M/ext4-2dd-4k-8p-4096M-8M:10-X
                   31.07        +0.3%        31.16  thresh=8M/xfs-10dd-4k-8p-4096M-8M:10-X
                   55.44        -2.0%        54.33  thresh=8M/xfs-1dd-4k-8p-4096M-8M:10-X
                   47.59        +0.6%        47.87  thresh=8M/xfs-2dd-4k-8p-4096M-8M:10-X
                  823.40        +1.8%       838.14  TOTAL

--- linux-next.orig/fs/fs-writeback.c	2011-10-15 08:30:48.000000000 +0800
+++ linux-next/fs/fs-writeback.c	2011-10-15 08:31:36.000000000 +0800
@@ -809,8 +809,10 @@ static long wb_writeback(struct bdi_writ
 			break;
 		trace_writeback_wait(wb->bdi, work);
 		spin_unlock(&wb->list_lock);
+		clear_bit(BDI_writeback_running, &wb->bdi->state);
 		__set_current_state(TASK_INTERRUPTIBLE);
 		schedule_timeout(pause);
+		set_bit(BDI_writeback_running, &wb->bdi->state);
 		if (pause < max_pause)
 			pause <<= 1;
 		spin_lock(&wb->list_lock);

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 1/2] writeback: Improve busyloop prevention
  2011-10-14 16:28           ` Wu Fengguang
@ 2011-10-18  0:51             ` Jan Kara
  2011-10-18 14:35               ` Wu Fengguang
  2011-10-20  9:46               ` Christoph Hellwig
  0 siblings, 2 replies; 60+ messages in thread
From: Jan Kara @ 2011-10-18  0:51 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Jan Kara, linux-fsdevel@vger.kernel.org, Christoph Hellwig,
	Dave Chinner

[-- Attachment #1: Type: text/plain, Size: 6118 bytes --]

On Sat 15-10-11 00:28:07, Wu Fengguang wrote:
> On Sat, Oct 15, 2011 at 12:00:47AM +0800, Wu Fengguang wrote:
> > On Fri, Oct 14, 2011 at 04:18:35AM +0800, Jan Kara wrote:
> > > On Thu 13-10-11 22:39:39, Wu Fengguang wrote:
> > > > > > +	long pause = 1;
> > > > > > +	long max_pause = dirty_writeback_interval ?
> > > > > > +			   msecs_to_jiffies(dirty_writeback_interval * 10) :
> > > > > > +			   HZ;
> > > > > 
> > > > > It's better not to put the flusher to sleeps more than 10ms, so that
> > > > > when the condition changes, we don't risk making the storage idle for
> > > > > too long time.
> > > > 
> > > > Yeah, the one big regression case
> > > > 
> > > >      3.1.0-rc8-ioless6a+  3.1.0-rc8-ioless6-requeue6+  
> > > > ------------------------  ------------------------  
> > > >                    47.07       -15.5%        39.78  thresh=1M/xfs-2dd-4k-8p-4096M-1M:10-X
> > > > 
> > > > is exactly caused by the large sleep: the attached graphs are showing
> > > > one period of no-progress on the number of written pages.
> > >   Thanks for the tests! Interesting. Do you have trace file from that run?
> > > I see the writeback stalled for 20s or so which is more than
> > > dirty_writeback_centisecs so I think something more complicated must have
> > > happened.
> > 
> > I noticed that
> > 
> > 1) the global dirty limit is exceeded (dirty=286, limit=256), hence
> >    the dd tasks are hard blocked in balance_dirty_pages().
> > 
> >        flush-8:0-1170  [004]   211.068427: global_dirty_state: dirty=286 writeback=0 unstable=0 bg_thresh=128 thresh=256 limit=256 dirtied=2084879 written=2081447
> > 
> > 2) the flusher thread is not woken up because we test writeback_in_progress()
> >    in balance_dirty_pages().
> > 
> >                 if (unlikely(!writeback_in_progress(bdi)))
> >                         bdi_start_background_writeback(bdi);
> > 
> > Thus the flusher thread wait and wait as in below trace.
> > 
> >        flush-8:0-1170  [004]   211.068427: global_dirty_state: dirty=286 writeback=0 unstable=0 bg_thresh=128 thresh=256 limit=256 dirtied=2084879 written=2081447
> >        flush-8:0-1170  [004]   211.068428: task_io: read=9216 write=12873728 cancelled_write=0 nr_dirtied=0 nr_dirtied_pause=32
> >        flush-8:0-1170  [004]   211.068428: writeback_start: bdi 8:0: sb_dev 0:0 nr_pages=9223372036854774848 sync_mode=0 kupdate=0 range_cyclic=1 background=1 reason=background
> >        flush-8:0-1170  [004]   211.068440: writeback_single_inode: bdi 8:0: ino=131 state=I_DIRTY_SYNC dirtied_when=4294869658 age=9 index=0 to_write=1024 wrote=0
> >        flush-8:0-1170  [004]   211.068442: writeback_written: bdi 8:0: sb_dev 0:0 nr_pages=9223372036854774848 sync_mode=0 kupdate=0 range_cyclic=1 background=1 reason=background
> >        flush-8:0-1170  [004]   211.068443: writeback_wait: bdi 8:0: sb_dev 0:0 nr_pages=9223372036854774848 sync_mode=0 kupdate=0 range_cyclic=1 background=1 reason=background
> > 
> >        flush-8:0-1170  [004]   213.110122: global_dirty_state: dirty=286 writeback=0 unstable=0 bg_thresh=128 thresh=256 limit=256 dirtied=2084879 written=2081447
> >        flush-8:0-1170  [004]   213.110126: task_io: read=9216 write=12873728 cancelled_write=0 nr_dirtied=0 nr_dirtied_pause=32
> >        flush-8:0-1170  [004]   213.110126: writeback_start: bdi 8:0: sb_dev 0:0 nr_pages=9223372036854774848 sync_mode=0 kupdate=0 range_cyclic=1 background=1 reason=background
> >        flush-8:0-1170  [004]   213.110134: writeback_single_inode: bdi 8:0: ino=131 state=I_DIRTY_SYNC dirtied_when=4294869658 age=11 index=0 to_write=1024 wrote=0
> >        flush-8:0-1170  [004]   213.110135: writeback_written: bdi 8:0: sb_dev 0:0 nr_pages=9223372036854774848 sync_mode=0 kupdate=0 range_cyclic=1 background=1 reason=background
> >        flush-8:0-1170  [004]   213.110135: writeback_wait: bdi 8:0: sb_dev 0:0 nr_pages=9223372036854774848 sync_mode=0 kupdate=0 range_cyclic=1 background=1 reason=background
> > 
> >        flush-8:0-1170  [004]   217.193470: global_dirty_state: dirty=286 writeback=0 unstable=0 bg_thresh=128 thresh=256 limit=256 dirtied=2084879 written=2081447
> >        flush-8:0-1170  [004]   217.193471: task_io: read=9216 write=12873728 cancelled_write=0 nr_dirtied=0 nr_dirtied_pause=32
> >        flush-8:0-1170  [004]   217.193471: writeback_start: bdi 8:0: sb_dev 0:0 nr_pages=9223372036854774848 sync_mode=0 kupdate=0 range_cyclic=1 background=1 reason=background
> >        flush-8:0-1170  [004]   217.193483: writeback_single_inode: bdi 8:0: ino=131 state=I_DIRTY_SYNC dirtied_when=4294869658 age=15 index=0 to_write=1024 wrote=0
> >        flush-8:0-1170  [004]   217.193485: writeback_written: bdi 8:0: sb_dev 0:0 nr_pages=9223372036854774848 sync_mode=0 kupdate=0 range_cyclic=1 background=1 reason=background
> 
> It's still puzzling why dirty pages remain at 286 and does not get
> cleaned by either flusher threads for local XFS and NFSROOT for so
> long time..
  I was looking at this as well. So the reason why pages were not cleaned
by the flusher thread is that there were 2 dirty inodes and the inode with
dirty pages had i_dirtied_whan newer than the time when we started this
background writeback. Thus the running background writeback work always
included only the other inode which has no dirty pages but I_DIRTY_SYNC set.
Apparently XFS is stubborn and refuses to write the inode although we try
rather hard. That is probably because dd writing to this inode is stuck in
balance_dirty_pages() and holds ilock - which is a bit unfortunate behavior
but what can we do...

  I think the patch you suggest in the other email does not fix the above
scenario (although it is useful for reducing latency so I'll include it -
thanks for it!).  Probably you were just lucky enough not to hit it in your
next run. What I'd suggest is to refresh oldest_jif in wb_writeback() when
we do not make any progress with writeback. Thus we allow freshly dirtied
inodes to be queued when we cannot make progress with the current set of
inodes. The resulting patch is attached.

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

[-- Attachment #2: 0001-writeback-Improve-busyloop-prevention.patch --]
[-- Type: text/x-patch, Size: 4098 bytes --]

>From b85b7cdaf5fafec2d850304c1d0b1813c1c122a3 Mon Sep 17 00:00:00 2001
From: Jan Kara <jack@suse.cz>
Date: Thu, 8 Sep 2011 01:05:25 +0200
Subject: [PATCH 1/2] writeback: Improve busyloop prevention

Writeback of an inode can be stalled by things like internal fs locks being
held. So in case we didn't write anything during a pass through b_io list, just
wait for a moment and try again. Also allow newly dirtied inodes to be queued
during the retry. When retrying is fruitless for a long time, or we have some
other work to do, we just stop current work to avoid blocking flusher thread.

CC: Christoph Hellwig <hch@infradead.org>
Reviewed-by: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/fs-writeback.c |   61 +++++++++++++++++++++++++++++++++++++++++-----------
 1 files changed, 48 insertions(+), 13 deletions(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 04cf3b9..6e909a9 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -699,8 +699,11 @@ static long wb_writeback(struct bdi_writeback *wb,
 	unsigned long wb_start = jiffies;
 	long nr_pages = work->nr_pages;
 	unsigned long oldest_jif;
-	struct inode *inode;
 	long progress;
+	long pause = 1;
+	long max_pause = dirty_writeback_interval ?
+			   msecs_to_jiffies(dirty_writeback_interval * 10) :
+			   HZ;
 
 	oldest_jif = jiffies;
 	work->older_than_this = &oldest_jif;
@@ -730,11 +733,19 @@ static long wb_writeback(struct bdi_writeback *wb,
 		if (work->for_background && !over_bground_thresh())
 			break;
 
+		/*
+		 * Refresh oldest timestamp. In kupdate style we include
+		 * freshly expired inodes in each round. For other types of
+		 * writeback we don't want to include newly dirtied inodes
+		 * to avoid livelocking. But in case we made no progress in
+		 * the last writeback round, allow new inodes to be queued
+		 * so that we don't block flusher thread unnecessarily.
+		 */
 		if (work->for_kupdate) {
 			oldest_jif = jiffies -
 				msecs_to_jiffies(dirty_expire_interval * 10);
-			work->older_than_this = &oldest_jif;
-		}
+		} else if (pause > 1)
+			oldest_jif = jiffies;
 
 		trace_writeback_start(wb->bdi, work);
 		if (list_empty(&wb->b_io))
@@ -755,25 +766,49 @@ static long wb_writeback(struct bdi_writeback *wb,
 		 * mean the overall work is done. So we keep looping as long
 		 * as made some progress on cleaning pages or inodes.
 		 */
-		if (progress)
+		if (progress) {
+			pause = 1;
 			continue;
+		}
 		/*
 		 * No more inodes for IO, bail
 		 */
 		if (list_empty(&wb->b_more_io))
 			break;
 		/*
-		 * Nothing written. Wait for some inode to
-		 * become available for writeback. Otherwise
-		 * we'll just busyloop.
+		 * Nothing written (some internal fs locks were unavailable or
+		 * inode was under writeback from balance_dirty_pages() or
+		 * similar conditions).
+		 *
+		 * Wait for a while to avoid busylooping unless we waited for
+		 * so long it does not make sense to retry anymore.
 		 */
-		if (!list_empty(&wb->b_more_io))  {
-			trace_writeback_wait(wb->bdi, work);
-			inode = wb_inode(wb->b_more_io.prev);
-			spin_lock(&inode->i_lock);
-			inode_wait_for_writeback(inode, wb);
-			spin_unlock(&inode->i_lock);
+		if (pause > max_pause)
+			break;
+		/*
+		 * Set state here to prevent races with someone waking us up
+		 * (because new work is queued or because background limit is
+		 * exceeded).
+		 */
+		set_current_state(TASK_INTERRUPTIBLE);
+		/* If there's some other work to do, proceed with it... */
+		if (!list_empty(&wb->bdi->work_list) ||
+		    (!work->for_background && over_bground_thresh())) {
+			__set_current_state(TASK_RUNNING);
+			break;
 		}
+		trace_writeback_wait(wb->bdi, work);
+		spin_unlock(&wb->list_lock);
+		/*
+		 * Clear writeback_running so that we properly indicate that
+		 * writeback is currently stalled
+		 */
+		clear_bit(BDI_writeback_running, &wb->bdi->state);
+		schedule_timeout(pause);
+		set_bit(BDI_writeback_running, &wb->bdi->state);
+		if (pause < max_pause)
+			pause <<= 1;
+		spin_lock(&wb->list_lock);
 	}
 	spin_unlock(&wb->list_lock);
 
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 60+ messages in thread

* Re: [PATCH 1/2] writeback: Improve busyloop prevention
  2011-10-18  0:51             ` Jan Kara
@ 2011-10-18 14:35               ` Wu Fengguang
  2011-10-19 11:56                 ` Jan Kara
  2011-10-20  9:46               ` Christoph Hellwig
  1 sibling, 1 reply; 60+ messages in thread
From: Wu Fengguang @ 2011-10-18 14:35 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-fsdevel@vger.kernel.org, Christoph Hellwig, Dave Chinner

On Tue, Oct 18, 2011 at 08:51:28AM +0800, Jan Kara wrote:
> On Sat 15-10-11 00:28:07, Wu Fengguang wrote:
> > On Sat, Oct 15, 2011 at 12:00:47AM +0800, Wu Fengguang wrote:
> > > On Fri, Oct 14, 2011 at 04:18:35AM +0800, Jan Kara wrote:
> > > > On Thu 13-10-11 22:39:39, Wu Fengguang wrote:
> > > > > > > +	long pause = 1;
> > > > > > > +	long max_pause = dirty_writeback_interval ?
> > > > > > > +			   msecs_to_jiffies(dirty_writeback_interval * 10) :
> > > > > > > +			   HZ;
> > > > > > 
> > > > > > It's better not to put the flusher to sleeps more than 10ms, so that
> > > > > > when the condition changes, we don't risk making the storage idle for
> > > > > > too long time.
> > > > > 
> > > > > Yeah, the one big regression case
> > > > > 
> > > > >      3.1.0-rc8-ioless6a+  3.1.0-rc8-ioless6-requeue6+  
> > > > > ------------------------  ------------------------  
> > > > >                    47.07       -15.5%        39.78  thresh=1M/xfs-2dd-4k-8p-4096M-1M:10-X
> > > > > 
> > > > > is exactly caused by the large sleep: the attached graphs are showing
> > > > > one period of no-progress on the number of written pages.
> > > >   Thanks for the tests! Interesting. Do you have trace file from that run?
> > > > I see the writeback stalled for 20s or so which is more than
> > > > dirty_writeback_centisecs so I think something more complicated must have
> > > > happened.
> > > 
> > > I noticed that
> > > 
> > > 1) the global dirty limit is exceeded (dirty=286, limit=256), hence
> > >    the dd tasks are hard blocked in balance_dirty_pages().
> > > 
> > >        flush-8:0-1170  [004]   211.068427: global_dirty_state: dirty=286 writeback=0 unstable=0 bg_thresh=128 thresh=256 limit=256 dirtied=2084879 written=2081447
> > > 
> > > 2) the flusher thread is not woken up because we test writeback_in_progress()
> > >    in balance_dirty_pages().
> > > 
> > >                 if (unlikely(!writeback_in_progress(bdi)))
> > >                         bdi_start_background_writeback(bdi);
> > > 
> > > Thus the flusher thread wait and wait as in below trace.
> > > 
> > >        flush-8:0-1170  [004]   211.068427: global_dirty_state: dirty=286 writeback=0 unstable=0 bg_thresh=128 thresh=256 limit=256 dirtied=2084879 written=2081447
> > >        flush-8:0-1170  [004]   211.068428: task_io: read=9216 write=12873728 cancelled_write=0 nr_dirtied=0 nr_dirtied_pause=32
> > >        flush-8:0-1170  [004]   211.068428: writeback_start: bdi 8:0: sb_dev 0:0 nr_pages=9223372036854774848 sync_mode=0 kupdate=0 range_cyclic=1 background=1 reason=background
> > >        flush-8:0-1170  [004]   211.068440: writeback_single_inode: bdi 8:0: ino=131 state=I_DIRTY_SYNC dirtied_when=4294869658 age=9 index=0 to_write=1024 wrote=0
> > >        flush-8:0-1170  [004]   211.068442: writeback_written: bdi 8:0: sb_dev 0:0 nr_pages=9223372036854774848 sync_mode=0 kupdate=0 range_cyclic=1 background=1 reason=background
> > >        flush-8:0-1170  [004]   211.068443: writeback_wait: bdi 8:0: sb_dev 0:0 nr_pages=9223372036854774848 sync_mode=0 kupdate=0 range_cyclic=1 background=1 reason=background
> > > 
> > >        flush-8:0-1170  [004]   213.110122: global_dirty_state: dirty=286 writeback=0 unstable=0 bg_thresh=128 thresh=256 limit=256 dirtied=2084879 written=2081447
> > >        flush-8:0-1170  [004]   213.110126: task_io: read=9216 write=12873728 cancelled_write=0 nr_dirtied=0 nr_dirtied_pause=32
> > >        flush-8:0-1170  [004]   213.110126: writeback_start: bdi 8:0: sb_dev 0:0 nr_pages=9223372036854774848 sync_mode=0 kupdate=0 range_cyclic=1 background=1 reason=background
> > >        flush-8:0-1170  [004]   213.110134: writeback_single_inode: bdi 8:0: ino=131 state=I_DIRTY_SYNC dirtied_when=4294869658 age=11 index=0 to_write=1024 wrote=0
> > >        flush-8:0-1170  [004]   213.110135: writeback_written: bdi 8:0: sb_dev 0:0 nr_pages=9223372036854774848 sync_mode=0 kupdate=0 range_cyclic=1 background=1 reason=background
> > >        flush-8:0-1170  [004]   213.110135: writeback_wait: bdi 8:0: sb_dev 0:0 nr_pages=9223372036854774848 sync_mode=0 kupdate=0 range_cyclic=1 background=1 reason=background
> > > 
> > >        flush-8:0-1170  [004]   217.193470: global_dirty_state: dirty=286 writeback=0 unstable=0 bg_thresh=128 thresh=256 limit=256 dirtied=2084879 written=2081447
> > >        flush-8:0-1170  [004]   217.193471: task_io: read=9216 write=12873728 cancelled_write=0 nr_dirtied=0 nr_dirtied_pause=32
> > >        flush-8:0-1170  [004]   217.193471: writeback_start: bdi 8:0: sb_dev 0:0 nr_pages=9223372036854774848 sync_mode=0 kupdate=0 range_cyclic=1 background=1 reason=background
> > >        flush-8:0-1170  [004]   217.193483: writeback_single_inode: bdi 8:0: ino=131 state=I_DIRTY_SYNC dirtied_when=4294869658 age=15 index=0 to_write=1024 wrote=0
> > >        flush-8:0-1170  [004]   217.193485: writeback_written: bdi 8:0: sb_dev 0:0 nr_pages=9223372036854774848 sync_mode=0 kupdate=0 range_cyclic=1 background=1 reason=background
> > 
> > It's still puzzling why dirty pages remain at 286 and does not get
> > cleaned by either flusher threads for local XFS and NFSROOT for so
> > long time..
>   I was looking at this as well. So the reason why pages were not cleaned
> by the flusher thread is that there were 2 dirty inodes and the inode with
> dirty pages had i_dirtied_whan newer than the time when we started this
> background writeback. Thus the running background writeback work always
> included only the other inode which has no dirty pages but I_DIRTY_SYNC set.

Yes that's very likely, given that the background work can run for very
long time and there are still code paths to redirty_tail() the inode.

This sounds horrible -- as time goes by, more and more inodes could be
excluded from the background writeback due to inode->dirtied_when
being touched undesirably.

> Apparently XFS is stubborn and refuses to write the inode although we try
> rather hard. That is probably because dd writing to this inode is stuck in
> balance_dirty_pages() and holds ilock - which is a bit unfortunate behavior
> but what can we do...

Yeah, and there may well be other known or unknown long lived blocking
cases. What we can do in VFS is to be not so picky on these conditions...

To be frank I still like the requeue_io_wait() approach. It can
trivially replace _all_ redity_tail() calls and hence avoid touching
inode->dirtied_when -- keeping that time stamp sane is the one thing I
would really like to do.

It also narrows down the perhaps blocked inodes and hence the need to
do heuristic wait-and-retry to the minimal.

It does introduce one more list, which IMHO is more tolerable than the
problems it fixed. And its max delay time can be reduced explicitly
if necessary: when there are heavy dirtiers, it will quickly be woken
up by balance_dirty_pages(); in other cases we can let the kupdate
work check s_more_io_wait before aborting to ensure 5s max delay:

@@ -863,7 +863,7 @@ static long wb_check_old_data_flush(stru
 
        expired = wb->last_old_flush +
                        msecs_to_jiffies(dirty_writeback_interval * 10);
-       if (time_before(jiffies, expired))
+       if (time_before(jiffies, expired) && list_empty(&wb->b_more_io_wait))
                return 0;
 
        wb->last_old_flush = jiffies;

>   I think the patch you suggest in the other email does not fix the above
> scenario (although it is useful for reducing latency so I'll include it -
> thanks for it!).  Probably you were just lucky enough not to hit it in your
> next run. What I'd suggest is to refresh oldest_jif in wb_writeback() when
> we do not make any progress with writeback. Thus we allow freshly dirtied
> inodes to be queued when we cannot make progress with the current set of
> inodes. The resulting patch is attached.

Refreshing oldest_jif should be safe for the background work. However
there is risk of livelock for other works, as the oldest_jif test is
there right for preventing livelocks..

For one thing, it will break this code:

                /*
                 * Sync livelock prevention. Each inode is tagged and synced in
                 * one shot. If still dirty, it will be redirty_tail()'ed below.
                 * Update the dirty time to prevent enqueue and sync it again.
                 */
                if ((inode->i_state & I_DIRTY) &&
                    (wbc->sync_mode == WB_SYNC_ALL || wbc->tagged_writepages))
                        inode->dirtied_when = jiffies;


> From b85b7cdaf5fafec2d850304c1d0b1813c1c122a3 Mon Sep 17 00:00:00 2001
> From: Jan Kara <jack@suse.cz>
> Date: Thu, 8 Sep 2011 01:05:25 +0200
> Subject: [PATCH 1/2] writeback: Improve busyloop prevention
> 
> Writeback of an inode can be stalled by things like internal fs locks being
> held. So in case we didn't write anything during a pass through b_io list, just
> wait for a moment and try again. Also allow newly dirtied inodes to be queued
> during the retry. When retrying is fruitless for a long time, or we have some
> other work to do, we just stop current work to avoid blocking flusher thread.
> 
> CC: Christoph Hellwig <hch@infradead.org>
> Reviewed-by: Wu Fengguang <fengguang.wu@intel.com>
> Signed-off-by: Jan Kara <jack@suse.cz>
> ---
>  fs/fs-writeback.c |   61 +++++++++++++++++++++++++++++++++++++++++-----------
>  1 files changed, 48 insertions(+), 13 deletions(-)
> 
> diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> index 04cf3b9..6e909a9 100644
> --- a/fs/fs-writeback.c
> +++ b/fs/fs-writeback.c
> @@ -699,8 +699,11 @@ static long wb_writeback(struct bdi_writeback *wb,
>  	unsigned long wb_start = jiffies;
>  	long nr_pages = work->nr_pages;
>  	unsigned long oldest_jif;
> -	struct inode *inode;
>  	long progress;
> +	long pause = 1;
> +	long max_pause = dirty_writeback_interval ?
> +			   msecs_to_jiffies(dirty_writeback_interval * 10) :
> +			   HZ;

There seems no strong reasons to prefer dirty_writeback_interval over
HZ. So how about use the simpler form "max_pause = HZ"?

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 1/2] writeback: Improve busyloop prevention
  2011-10-18 14:35               ` Wu Fengguang
@ 2011-10-19 11:56                 ` Jan Kara
  2011-10-19 13:25                   ` Wu Fengguang
                                     ` (3 more replies)
  0 siblings, 4 replies; 60+ messages in thread
From: Jan Kara @ 2011-10-19 11:56 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Jan Kara, linux-fsdevel@vger.kernel.org, Christoph Hellwig,
	Dave Chinner

[-- Attachment #1: Type: text/plain, Size: 12579 bytes --]

On Tue 18-10-11 22:35:04, Wu Fengguang wrote:
> On Tue, Oct 18, 2011 at 08:51:28AM +0800, Jan Kara wrote:
> > On Sat 15-10-11 00:28:07, Wu Fengguang wrote:
> > > On Sat, Oct 15, 2011 at 12:00:47AM +0800, Wu Fengguang wrote:
> > > > On Fri, Oct 14, 2011 at 04:18:35AM +0800, Jan Kara wrote:
> > > > > On Thu 13-10-11 22:39:39, Wu Fengguang wrote:
> > > > > > > > +	long pause = 1;
> > > > > > > > +	long max_pause = dirty_writeback_interval ?
> > > > > > > > +			   msecs_to_jiffies(dirty_writeback_interval * 10) :
> > > > > > > > +			   HZ;
> > > > > > > 
> > > > > > > It's better not to put the flusher to sleeps more than 10ms, so that
> > > > > > > when the condition changes, we don't risk making the storage idle for
> > > > > > > too long time.
> > > > > > 
> > > > > > Yeah, the one big regression case
> > > > > > 
> > > > > >      3.1.0-rc8-ioless6a+  3.1.0-rc8-ioless6-requeue6+  
> > > > > > ------------------------  ------------------------  
> > > > > >                    47.07       -15.5%        39.78  thresh=1M/xfs-2dd-4k-8p-4096M-1M:10-X
> > > > > > 
> > > > > > is exactly caused by the large sleep: the attached graphs are showing
> > > > > > one period of no-progress on the number of written pages.
> > > > >   Thanks for the tests! Interesting. Do you have trace file from that run?
> > > > > I see the writeback stalled for 20s or so which is more than
> > > > > dirty_writeback_centisecs so I think something more complicated must have
> > > > > happened.
> > > > 
> > > > I noticed that
> > > > 
> > > > 1) the global dirty limit is exceeded (dirty=286, limit=256), hence
> > > >    the dd tasks are hard blocked in balance_dirty_pages().
> > > > 
> > > >        flush-8:0-1170  [004]   211.068427: global_dirty_state: dirty=286 writeback=0 unstable=0 bg_thresh=128 thresh=256 limit=256 dirtied=2084879 written=2081447
> > > > 
> > > > 2) the flusher thread is not woken up because we test writeback_in_progress()
> > > >    in balance_dirty_pages().
> > > > 
> > > >                 if (unlikely(!writeback_in_progress(bdi)))
> > > >                         bdi_start_background_writeback(bdi);
> > > > 
> > > > Thus the flusher thread wait and wait as in below trace.
> > > > 
> > > >        flush-8:0-1170  [004]   211.068427: global_dirty_state: dirty=286 writeback=0 unstable=0 bg_thresh=128 thresh=256 limit=256 dirtied=2084879 written=2081447
> > > >        flush-8:0-1170  [004]   211.068428: task_io: read=9216 write=12873728 cancelled_write=0 nr_dirtied=0 nr_dirtied_pause=32
> > > >        flush-8:0-1170  [004]   211.068428: writeback_start: bdi 8:0: sb_dev 0:0 nr_pages=9223372036854774848 sync_mode=0 kupdate=0 range_cyclic=1 background=1 reason=background
> > > >        flush-8:0-1170  [004]   211.068440: writeback_single_inode: bdi 8:0: ino=131 state=I_DIRTY_SYNC dirtied_when=4294869658 age=9 index=0 to_write=1024 wrote=0
> > > >        flush-8:0-1170  [004]   211.068442: writeback_written: bdi 8:0: sb_dev 0:0 nr_pages=9223372036854774848 sync_mode=0 kupdate=0 range_cyclic=1 background=1 reason=background
> > > >        flush-8:0-1170  [004]   211.068443: writeback_wait: bdi 8:0: sb_dev 0:0 nr_pages=9223372036854774848 sync_mode=0 kupdate=0 range_cyclic=1 background=1 reason=background
> > > > 
> > > >        flush-8:0-1170  [004]   213.110122: global_dirty_state: dirty=286 writeback=0 unstable=0 bg_thresh=128 thresh=256 limit=256 dirtied=2084879 written=2081447
> > > >        flush-8:0-1170  [004]   213.110126: task_io: read=9216 write=12873728 cancelled_write=0 nr_dirtied=0 nr_dirtied_pause=32
> > > >        flush-8:0-1170  [004]   213.110126: writeback_start: bdi 8:0: sb_dev 0:0 nr_pages=9223372036854774848 sync_mode=0 kupdate=0 range_cyclic=1 background=1 reason=background
> > > >        flush-8:0-1170  [004]   213.110134: writeback_single_inode: bdi 8:0: ino=131 state=I_DIRTY_SYNC dirtied_when=4294869658 age=11 index=0 to_write=1024 wrote=0
> > > >        flush-8:0-1170  [004]   213.110135: writeback_written: bdi 8:0: sb_dev 0:0 nr_pages=9223372036854774848 sync_mode=0 kupdate=0 range_cyclic=1 background=1 reason=background
> > > >        flush-8:0-1170  [004]   213.110135: writeback_wait: bdi 8:0: sb_dev 0:0 nr_pages=9223372036854774848 sync_mode=0 kupdate=0 range_cyclic=1 background=1 reason=background
> > > > 
> > > >        flush-8:0-1170  [004]   217.193470: global_dirty_state: dirty=286 writeback=0 unstable=0 bg_thresh=128 thresh=256 limit=256 dirtied=2084879 written=2081447
> > > >        flush-8:0-1170  [004]   217.193471: task_io: read=9216 write=12873728 cancelled_write=0 nr_dirtied=0 nr_dirtied_pause=32
> > > >        flush-8:0-1170  [004]   217.193471: writeback_start: bdi 8:0: sb_dev 0:0 nr_pages=9223372036854774848 sync_mode=0 kupdate=0 range_cyclic=1 background=1 reason=background
> > > >        flush-8:0-1170  [004]   217.193483: writeback_single_inode: bdi 8:0: ino=131 state=I_DIRTY_SYNC dirtied_when=4294869658 age=15 index=0 to_write=1024 wrote=0
> > > >        flush-8:0-1170  [004]   217.193485: writeback_written: bdi 8:0: sb_dev 0:0 nr_pages=9223372036854774848 sync_mode=0 kupdate=0 range_cyclic=1 background=1 reason=background
> > > 
> > > It's still puzzling why dirty pages remain at 286 and does not get
> > > cleaned by either flusher threads for local XFS and NFSROOT for so
> > > long time..
> >   I was looking at this as well. So the reason why pages were not cleaned
> > by the flusher thread is that there were 2 dirty inodes and the inode with
> > dirty pages had i_dirtied_whan newer than the time when we started this
> > background writeback. Thus the running background writeback work always
> > included only the other inode which has no dirty pages but I_DIRTY_SYNC set.
> 
> Yes that's very likely, given that the background work can run for very
> long time and there are still code paths to redirty_tail() the inode.
> 
> This sounds horrible -- as time goes by, more and more inodes could be
> excluded from the background writeback due to inode->dirtied_when
> being touched undesirably.
  Yes, but it's not only about touching inode->i_dirtied_when. It can also
be the case that the inode started to be dirty only after we started the
writeback. For example in your run, I'm pretty confident from the traces
that that's what happened. This behavior can happen for a long time (I
think I introduced it by 7624ee72aa09334af072853457a5d46d9901c3f8) and is a
result of how we do livelock avoidance using older_than_this (or wb_start
previously). Just when you don't have long-running work that cannot
progress, it will be mostly hidden (only might be observable as a small
inefficiency of background writeback).

> > Apparently XFS is stubborn and refuses to write the inode although we try
> > rather hard. That is probably because dd writing to this inode is stuck in
> > balance_dirty_pages() and holds ilock - which is a bit unfortunate behavior
> > but what can we do...
> 
> Yeah, and there may well be other known or unknown long lived blocking
> cases. What we can do in VFS is to be not so picky on these conditions...
  True.

> To be frank I still like the requeue_io_wait() approach. It can
> trivially replace _all_ redity_tail() calls and hence avoid touching
> inode->dirtied_when -- keeping that time stamp sane is the one thing I
> would really like to do.
  As I pointed out above, it's not only about redirty_tail() so I belive we
should solve this problem regardless whether we use requeue_io_wait() or
other busyloop prevention - do you agree with the attached patch? If yes,
add it to your patch queue for the next merge window please.

  BTW: I think you cannot avoid _all_ redirty_tail() calls - in situation
where inode was really redirtied you need to really redirty_tail() it. But
I agree you can remove all redirty_tail() calls which are there because we
failed to make any progress with the inode.

> It also narrows down the perhaps blocked inodes and hence the need to
> do heuristic wait-and-retry to the minimal.
> 
> It does introduce one more list, which IMHO is more tolerable than the
> problems it fixed. And its max delay time can be reduced explicitly
> if necessary: when there are heavy dirtiers, it will quickly be woken
> up by balance_dirty_pages(); in other cases we can let the kupdate
> work check s_more_io_wait before aborting to ensure 5s max delay:
> 
> @@ -863,7 +863,7 @@ static long wb_check_old_data_flush(stru
>  
>         expired = wb->last_old_flush +
>                         msecs_to_jiffies(dirty_writeback_interval * 10);
> -       if (time_before(jiffies, expired))
> +       if (time_before(jiffies, expired) && list_empty(&wb->b_more_io_wait))
>                 return 0;
>  
>         wb->last_old_flush = jiffies;
  So after seeing complications with my code I was also reconsidering your
approach. Having additional list should be OK if we document the logic of
all the lists in one place in detail. So the remaining problem I have is
the uncertainty when writeback will be retried. Your change above
guarantees one immediate retry in case we were not doing for_kupdate
writeback when filesystem refused to write the inode and the next retry in
dirty_writeback_interval which seems too late. But if we trigger kupdate
work earlier to retry blocked inodes, your method would be a viable
alternative - something like (completely untested) second attached patch?
With that I find both approaches mostly equivalent so if your passes
testing and you like it more, then I'm fine with that.
 
> >   I think the patch you suggest in the other email does not fix the above
> > scenario (although it is useful for reducing latency so I'll include it -
> > thanks for it!).  Probably you were just lucky enough not to hit it in your
> > next run. What I'd suggest is to refresh oldest_jif in wb_writeback() when
> > we do not make any progress with writeback. Thus we allow freshly dirtied
> > inodes to be queued when we cannot make progress with the current set of
> > inodes. The resulting patch is attached.
> 
> Refreshing oldest_jif should be safe for the background work. However
> there is risk of livelock for other works, as the oldest_jif test is
> there right for preventing livelocks..
  Yes, I know, I just figured that refreshing oldest_jif only when we could
not make progress (and thus we know there is no other work to do because we
would break out of the loop in that case) is safe. But it is subtle and
actually I've realized that refreshing oldest_jif for background writeback
is enough so let's just do that.

> For one thing, it will break this code:
> 
>                 /*
>                  * Sync livelock prevention. Each inode is tagged and synced in
>                  * one shot. If still dirty, it will be redirty_tail()'ed below.
>                  * Update the dirty time to prevent enqueue and sync it again.
>                  */
>                 if ((inode->i_state & I_DIRTY) &&
>                     (wbc->sync_mode == WB_SYNC_ALL || wbc->tagged_writepages))
>                         inode->dirtied_when = jiffies;
  Hum, right, but for WB_SYNC_ALL writeback filesystem should better not
refuse to writeback any inode. Other things would break horribly. And it
certainly is not an issue when we refresh the timestamp only for background
writeback.

> > diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> > index 04cf3b9..6e909a9 100644
> > --- a/fs/fs-writeback.c
> > +++ b/fs/fs-writeback.c
> > @@ -699,8 +699,11 @@ static long wb_writeback(struct bdi_writeback *wb,
> >  	unsigned long wb_start = jiffies;
> >  	long nr_pages = work->nr_pages;
> >  	unsigned long oldest_jif;
> > -	struct inode *inode;
> >  	long progress;
> > +	long pause = 1;
> > +	long max_pause = dirty_writeback_interval ?
> > +			   msecs_to_jiffies(dirty_writeback_interval * 10) :
> > +			   HZ;
> 
> There seems no strong reasons to prefer dirty_writeback_interval over
> HZ. So how about use the simpler form "max_pause = HZ"?
  Well, dirty_writeback_interval actually makes some sense here. It is the
interval in which user wants flusher thread to recheck dirtiness situation.
So it defines latency user wants from flusher thread. Thus after this time
we definitely want to break out of any work which cannot progress. If
dirty_writeback_interval is shorter than 1s, we would break this by using
HZ in the above code. Also I wanted to avoid introducing another magic
value to writeback code when there is tunable which makes sense and can
be used... I can add a comment explaining this if you want.

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

[-- Attachment #2: 0001-writeback-Include-all-dirty-inodes-in-background-wri.patch --]
[-- Type: text/x-patch, Size: 1688 bytes --]

>From 7b559f1cea41cdba7b39138ad1637f8000e218b9 Mon Sep 17 00:00:00 2001
From: Jan Kara <jack@suse.cz>
Date: Wed, 19 Oct 2011 11:44:41 +0200
Subject: [PATCH] writeback: Include all dirty inodes in background writeback

Current livelock avoidance code makes background work to include only inodes
that were dirtied before background writeback has started. However background
writeback can be running for a long time and thus excluding newly dirtied
inodes can eventually exclude significant portion of dirty inodes making
background writeback inefficient. Since background writeback avoids livelocking
the flusher thread by yielding to any other work, there is no real reason why
background work should not include all dirty inodes so change the logic in
wb_writeback().

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/fs-writeback.c |   10 ++++++++--
 1 files changed, 8 insertions(+), 2 deletions(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 04cf3b9..8314241 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -730,11 +730,17 @@ static long wb_writeback(struct bdi_writeback *wb,
 		if (work->for_background && !over_bground_thresh())
 			break;
 
+		/*
+		 * Kupdate and background works are special and we want to
+		 * include all inodes that need writing. Livelock avoidance is
+		 * handled by these works yielding to any other work so we are
+		 * safe.
+		 */
 		if (work->for_kupdate) {
 			oldest_jif = jiffies -
 				msecs_to_jiffies(dirty_expire_interval * 10);
-			work->older_than_this = &oldest_jif;
-		}
+		} else if (work->for_background)
+			oldest_jif = jiffies;
 
 		trace_writeback_start(wb->bdi, work);
 		if (list_empty(&wb->b_io))
-- 
1.7.1


[-- Attachment #3: 0001-writeback-Retry-kupdate-work-early-if-we-need-to-ret.patch --]
[-- Type: text/x-patch, Size: 2973 bytes --]

>From 595677f8efcaa0d9f675bf74a7048739323afd06 Mon Sep 17 00:00:00 2001
From: Jan Kara <jack@suse.cz>
Date: Wed, 19 Oct 2011 13:44:46 +0200
Subject: [PATCH] writeback: Retry kupdate work early if we need to retry some inode writeback

In case we could not do any writeback for some inodes, trigger next kupdate
work early so that writeback on these inodes is not delayed for the whole
dirty_writeback_interval.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/fs-writeback.c |   25 ++++++++++++++++++-------
 1 files changed, 18 insertions(+), 7 deletions(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 8314241..e48da04 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -701,6 +701,7 @@ static long wb_writeback(struct bdi_writeback *wb,
 	unsigned long oldest_jif;
 	struct inode *inode;
 	long progress;
+	long total_progress = 0;
 
 	oldest_jif = jiffies;
 	work->older_than_this = &oldest_jif;
@@ -750,6 +751,7 @@ static long wb_writeback(struct bdi_writeback *wb,
 		else
 			progress = __writeback_inodes_wb(wb, work);
 		trace_writeback_written(wb->bdi, work);
+		total_progress += progress;
 
 		wb_update_bandwidth(wb, wb_start);
 
@@ -783,7 +785,7 @@ static long wb_writeback(struct bdi_writeback *wb,
 	}
 	spin_unlock(&wb->list_lock);
 
-	return nr_pages - work->nr_pages;
+	return total_progress;
 }
 
 /*
@@ -845,7 +847,7 @@ static long wb_check_old_data_flush(struct bdi_writeback *wb)
 
 	expired = wb->last_old_flush +
 			msecs_to_jiffies(dirty_writeback_interval * 10);
-	if (time_before(jiffies, expired))
+	if (time_before(jiffies, expired) && list_empty(&wb->b_more_io_wait))
 		return 0;
 
 	wb->last_old_flush = jiffies;
@@ -915,7 +917,11 @@ int bdi_writeback_thread(void *data)
 {
 	struct bdi_writeback *wb = data;
 	struct backing_dev_info *bdi = wb->bdi;
-	long pages_written;
+	long progress;
+	unsigned int pause = 1;
+	unsigned int max_pause = dirty_writeback_interval ?
+			msecs_to_jiffies(dirty_writeback_interval * 10) :
+			HZ;
 
 	current->flags |= PF_SWAPWRITE;
 	set_freezable();
@@ -935,12 +941,14 @@ int bdi_writeback_thread(void *data)
 		 */
 		del_timer(&wb->wakeup_timer);
 
-		pages_written = wb_do_writeback(wb, 0);
+		progress = wb_do_writeback(wb, 0);
 
 		trace_writeback_pages_written(pages_written);
 
-		if (pages_written)
+		if (progress) {
 			wb->last_active = jiffies;
+			pause = 1;
+		}
 
 		set_current_state(TASK_INTERRUPTIBLE);
 		if (!list_empty(&bdi->work_list) || kthread_should_stop()) {
@@ -948,8 +956,11 @@ int bdi_writeback_thread(void *data)
 			continue;
 		}
 
-		if (wb_has_dirty_io(wb) && dirty_writeback_interval)
-			schedule_timeout(msecs_to_jiffies(dirty_writeback_interval * 10));
+		if (!list_empty(&wb->b_more_io_wait) && pause < max_pause) {
+			schedule_timeout(pause);
+			pause <<= 1;
+		} else if (wb_has_dirty_io(wb) && dirty_writeback_interval)
+			schedule_timeout(max_pause);
 		else {
 			/*
 			 * We have nothing to do, so can go sleep without any
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 60+ messages in thread

* Re: [PATCH 1/2] writeback: Improve busyloop prevention
  2011-10-19 11:56                 ` Jan Kara
@ 2011-10-19 13:25                   ` Wu Fengguang
  2011-10-19 13:30                   ` Wu Fengguang
                                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 60+ messages in thread
From: Wu Fengguang @ 2011-10-19 13:25 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-fsdevel@vger.kernel.org, Christoph Hellwig, Dave Chinner

On Wed, Oct 19, 2011 at 07:56:30PM +0800, Jan Kara wrote:
> On Tue 18-10-11 22:35:04, Wu Fengguang wrote:
> > On Tue, Oct 18, 2011 at 08:51:28AM +0800, Jan Kara wrote:
> > > On Sat 15-10-11 00:28:07, Wu Fengguang wrote:
> > > > On Sat, Oct 15, 2011 at 12:00:47AM +0800, Wu Fengguang wrote:
> > > > > On Fri, Oct 14, 2011 at 04:18:35AM +0800, Jan Kara wrote:
> > > > > > On Thu 13-10-11 22:39:39, Wu Fengguang wrote:
> > > > > > > > > +       long pause = 1;
> > > > > > > > > +       long max_pause = dirty_writeback_interval ?
> > > > > > > > > +                          msecs_to_jiffies(dirty_writeback_interval * 10) :
> > > > > > > > > +                          HZ;
> > > > > > > >
> > > > > > > > It's better not to put the flusher to sleeps more than 10ms, so that
> > > > > > > > when the condition changes, we don't risk making the storage idle for
> > > > > > > > too long time.
> > > > > > >
> > > > > > > Yeah, the one big regression case
> > > > > > >
> > > > > > >      3.1.0-rc8-ioless6a+  3.1.0-rc8-ioless6-requeue6+
> > > > > > > ------------------------  ------------------------
> > > > > > >                    47.07       -15.5%        39.78  thresh=1M/xfs-2dd-4k-8p-4096M-1M:10-X
> > > > > > >
> > > > > > > is exactly caused by the large sleep: the attached graphs are showing
> > > > > > > one period of no-progress on the number of written pages.
> > > > > >   Thanks for the tests! Interesting. Do you have trace file from that run?
> > > > > > I see the writeback stalled for 20s or so which is more than
> > > > > > dirty_writeback_centisecs so I think something more complicated must have
> > > > > > happened.
> > > > >
> > > > > I noticed that
> > > > >
> > > > > 1) the global dirty limit is exceeded (dirty=286, limit=256), hence
> > > > >    the dd tasks are hard blocked in balance_dirty_pages().
> > > > >
> > > > >        flush-8:0-1170  [004]   211.068427: global_dirty_state: dirty=286 writeback=0 unstable=0 bg_thresh=128 thresh=256 limit=256 dirtied=2084879 written=2081447
> > > > >
> > > > > 2) the flusher thread is not woken up because we test writeback_in_progress()
> > > > >    in balance_dirty_pages().
> > > > >
> > > > >                 if (unlikely(!writeback_in_progress(bdi)))
> > > > >                         bdi_start_background_writeback(bdi);
> > > > >
> > > > > Thus the flusher thread wait and wait as in below trace.
> > > > >
> > > > >        flush-8:0-1170  [004]   211.068427: global_dirty_state: dirty=286 writeback=0 unstable=0 bg_thresh=128 thresh=256 limit=256 dirtied=2084879 written=2081447
> > > > >        flush-8:0-1170  [004]   211.068428: task_io: read=9216 write=12873728 cancelled_write=0 nr_dirtied=0 nr_dirtied_pause=32
> > > > >        flush-8:0-1170  [004]   211.068428: writeback_start: bdi 8:0: sb_dev 0:0 nr_pages=9223372036854774848 sync_mode=0 kupdate=0 range_cyclic=1 background=1 reason=background
> > > > >        flush-8:0-1170  [004]   211.068440: writeback_single_inode: bdi 8:0: ino=131 state=I_DIRTY_SYNC dirtied_when=4294869658 age=9 index=0 to_write=1024 wrote=0
> > > > >        flush-8:0-1170  [004]   211.068442: writeback_written: bdi 8:0: sb_dev 0:0 nr_pages=9223372036854774848 sync_mode=0 kupdate=0 range_cyclic=1 background=1 reason=background
> > > > >        flush-8:0-1170  [004]   211.068443: writeback_wait: bdi 8:0: sb_dev 0:0 nr_pages=9223372036854774848 sync_mode=0 kupdate=0 range_cyclic=1 background=1 reason=background
> > > > >
> > > > >        flush-8:0-1170  [004]   213.110122: global_dirty_state: dirty=286 writeback=0 unstable=0 bg_thresh=128 thresh=256 limit=256 dirtied=2084879 written=2081447
> > > > >        flush-8:0-1170  [004]   213.110126: task_io: read=9216 write=12873728 cancelled_write=0 nr_dirtied=0 nr_dirtied_pause=32
> > > > >        flush-8:0-1170  [004]   213.110126: writeback_start: bdi 8:0: sb_dev 0:0 nr_pages=9223372036854774848 sync_mode=0 kupdate=0 range_cyclic=1 background=1 reason=background
> > > > >        flush-8:0-1170  [004]   213.110134: writeback_single_inode: bdi 8:0: ino=131 state=I_DIRTY_SYNC dirtied_when=4294869658 age=11 index=0 to_write=1024 wrote=0
> > > > >        flush-8:0-1170  [004]   213.110135: writeback_written: bdi 8:0: sb_dev 0:0 nr_pages=9223372036854774848 sync_mode=0 kupdate=0 range_cyclic=1 background=1 reason=background
> > > > >        flush-8:0-1170  [004]   213.110135: writeback_wait: bdi 8:0: sb_dev 0:0 nr_pages=9223372036854774848 sync_mode=0 kupdate=0 range_cyclic=1 background=1 reason=background
> > > > >
> > > > >        flush-8:0-1170  [004]   217.193470: global_dirty_state: dirty=286 writeback=0 unstable=0 bg_thresh=128 thresh=256 limit=256 dirtied=2084879 written=2081447
> > > > >        flush-8:0-1170  [004]   217.193471: task_io: read=9216 write=12873728 cancelled_write=0 nr_dirtied=0 nr_dirtied_pause=32
> > > > >        flush-8:0-1170  [004]   217.193471: writeback_start: bdi 8:0: sb_dev 0:0 nr_pages=9223372036854774848 sync_mode=0 kupdate=0 range_cyclic=1 background=1 reason=background
> > > > >        flush-8:0-1170  [004]   217.193483: writeback_single_inode: bdi 8:0: ino=131 state=I_DIRTY_SYNC dirtied_when=4294869658 age=15 index=0 to_write=1024 wrote=0
> > > > >        flush-8:0-1170  [004]   217.193485: writeback_written: bdi 8:0: sb_dev 0:0 nr_pages=9223372036854774848 sync_mode=0 kupdate=0 range_cyclic=1 background=1 reason=background
> > > >
> > > > It's still puzzling why dirty pages remain at 286 and does not get
> > > > cleaned by either flusher threads for local XFS and NFSROOT for so
> > > > long time..
> > >   I was looking at this as well. So the reason why pages were not cleaned
> > > by the flusher thread is that there were 2 dirty inodes and the inode with
> > > dirty pages had i_dirtied_whan newer than the time when we started this
> > > background writeback. Thus the running background writeback work always
> > > included only the other inode which has no dirty pages but I_DIRTY_SYNC set.
> >
> > Yes that's very likely, given that the background work can run for very
> > long time and there are still code paths to redirty_tail() the inode.
> >
> > This sounds horrible -- as time goes by, more and more inodes could be
> > excluded from the background writeback due to inode->dirtied_when
> > being touched undesirably.
>   Yes, but it's not only about touching inode->i_dirtied_when. It can also
> be the case that the inode started to be dirty only after we started the
> writeback. For example in your run, I'm pretty confident from the traces
> that that's what happened.

Yeah, that's possible because the dirty threshold is merely 1MB, and
one inode could be cleaned totally (in the meanwhile the dd is throttled)
and then made dirty again.

> This behavior can happen for a long time (I
> think I introduced it by 7624ee72aa09334af072853457a5d46d9901c3f8) and is a
> result of how we do livelock avoidance using older_than_this (or wb_start
> previously).

I'd say that commit itself is towards the right direction.
Unfortunately the writeback logic is a bit more twisted than our vision..

> Just when you don't have long-running work that cannot
> progress, it will be mostly hidden (only might be observable as a small
> inefficiency of background writeback).

Yes.

> > > Apparently XFS is stubborn and refuses to write the inode although we try
> > > rather hard. That is probably because dd writing to this inode is stuck in
> > > balance_dirty_pages() and holds ilock - which is a bit unfortunate behavior
> > > but what can we do...
> >
> > Yeah, and there may well be other known or unknown long lived blocking
> > cases. What we can do in VFS is to be not so picky on these conditions...
>   True.
> 
> > To be frank I still like the requeue_io_wait() approach. It can
> > trivially replace _all_ redity_tail() calls and hence avoid touching
> > inode->dirtied_when -- keeping that time stamp sane is the one thing I
> > would really like to do.
>   As I pointed out above, it's not only about redirty_tail() so I belive we
> should solve this problem regardless whether we use requeue_io_wait() or

Good point. Yeah we need to make the background work accept newly
dirtied indoes. This could make the flusher less concentrated on the
elder pages, but there seems no obvious right solution...

> other busyloop prevention - do you agree with the attached patch? If yes,
> add it to your patch queue for the next merge window please.

OK, I'll think about it.

>   BTW: I think you cannot avoid _all_ redirty_tail() calls - in situation
> where inode was really redirtied you need to really redirty_tail() it. But
> I agree you can remove all redirty_tail() calls which are there because we
> failed to make any progress with the inode.

That's right. We only convert redirty_tail(possibly blocked inode) to
requeue_io_wait(). Well, in general. In the case of failed
grab_super_passive(), it's not even the current inode being blocked,
but _something_ get blocked.  But we still have to requeue_io_wait()
it to avoid busy retrying.

In the below case,

               if (inode->i_sb != sb) {
                        if (work->sb) {
                                /*
                                 * We only want to write back data for this
                                 * superblock, move all inodes not belonging
                                 * to it back onto the dirty list.
                                 */
                                redirty_tail(inode, wb);
                                continue;
                        }

It's better off to _not_ enqueue that inode in the first place inside
move_expired_inodes(). Then we can safely convert that redirty_tail()
to requeue_io() to skip the irrelevant inodes already in s_io/s_more_io.

> > It also narrows down the perhaps blocked inodes and hence the need to
> > do heuristic wait-and-retry to the minimal.
> >
> > It does introduce one more list, which IMHO is more tolerable than the
> > problems it fixed. And its max delay time can be reduced explicitly
> > if necessary: when there are heavy dirtiers, it will quickly be woken
> > up by balance_dirty_pages(); in other cases we can let the kupdate
> > work check s_more_io_wait before aborting to ensure 5s max delay:
> >
> > @@ -863,7 +863,7 @@ static long wb_check_old_data_flush(stru
> >
> >         expired = wb->last_old_flush +
> >                         msecs_to_jiffies(dirty_writeback_interval * 10);
> > -       if (time_before(jiffies, expired))
> > +       if (time_before(jiffies, expired) && list_empty(&wb->b_more_io_wait))
> >                 return 0;
> >
> >         wb->last_old_flush = jiffies;
>   So after seeing complications with my code I was also reconsidering your
> approach. Having additional list should be OK if we document the logic of
> all the lists in one place in detail.

Actually, if without the "requeue on I_SYNC" case, we can do without
s_more_io_wait and still be able to prevent the unnecessary changes
to dirtied_when. Because if we do requeue_io() for all inodes and
still find !progress on a full iteration of b_io, we are mostly sure
that the inodes remaining inside b_more_io is related to some block
condition and can modify the loop break condition to consider that.

> So the remaining problem I have is
> the uncertainty when writeback will be retried. Your change above
> guarantees one immediate retry in case we were not doing for_kupdate
> writeback when filesystem refused to write the inode and the next retry in
> dirty_writeback_interval which seems too late. But if we trigger kupdate
> work earlier to retry blocked inodes, your method would be a viable
> alternative - something like (completely untested) second attached patch?

Yeah! I have a similar patch for shortening the retry interval from
dirty_writeback_interval=5s to dirty_writeback_interval/10=500ms:

@@ -967,9 +967,14 @@ int bdi_writeback_thread(void *data)
                        continue;
                }

-               if (wb_has_dirty_io(wb) && dirty_writeback_interval)
-                       schedule_timeout(msecs_to_jiffies(dirty_writeback_interval * 10));
-               else {
+               if (wb_has_dirty_io(wb) && dirty_writeback_interval) {
+                       unsigned long t;
+                       if (!list_empty(&wb->b_more_io_wait))
+                               t = msecs_to_jiffies(dirty_writeback_interval);
+                       else
+                               t = msecs_to_jiffies(dirty_writeback_interval * 10);
+                       schedule_timeout(t);
+               } else {
                        /*
                         * We have nothing to do, so can go sleep without any
                         * timeout and save power. When a work is queued or

However your adaptive sleep seems better, and it also covers the
dirty_writeback_interval=0 case.

> With that I find both approaches mostly equivalent so if your passes
> testing and you like it more, then I'm fine with that.

OK, thanks!

> > >   I think the patch you suggest in the other email does not fix the above
> > > scenario (although it is useful for reducing latency so I'll include it -
> > > thanks for it!).  Probably you were just lucky enough not to hit it in your
> > > next run. What I'd suggest is to refresh oldest_jif in wb_writeback() when
> > > we do not make any progress with writeback. Thus we allow freshly dirtied
> > > inodes to be queued when we cannot make progress with the current set of
> > > inodes. The resulting patch is attached.
> >
> > Refreshing oldest_jif should be safe for the background work. However
> > there is risk of livelock for other works, as the oldest_jif test is
> > there right for preventing livelocks..
>   Yes, I know, I just figured that refreshing oldest_jif only when we could
> not make progress (and thus we know there is no other work to do because we
> would break out of the loop in that case) is safe. But it is subtle and
> actually I've realized that refreshing oldest_jif for background writeback
> is enough so let's just do that.

OK.

> > For one thing, it will break this code:
> >
> >                 /*
> >                  * Sync livelock prevention. Each inode is tagged and synced in
> >                  * one shot. If still dirty, it will be redirty_tail()'ed below.
> >                  * Update the dirty time to prevent enqueue and sync it again.
> >                  */
> >                 if ((inode->i_state & I_DIRTY) &&
> >                     (wbc->sync_mode == WB_SYNC_ALL || wbc->tagged_writepages))
> >                         inode->dirtied_when = jiffies;
>   Hum, right, but for WB_SYNC_ALL writeback filesystem should better not
> refuse to writeback any inode. Other things would break horribly. And it
> certainly is not an issue when we refresh the timestamp only for background
> writeback.

Ah yes!

> > > diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> > > index 04cf3b9..6e909a9 100644
> > > --- a/fs/fs-writeback.c
> > > +++ b/fs/fs-writeback.c
> > > @@ -699,8 +699,11 @@ static long wb_writeback(struct bdi_writeback *wb,
> > >     unsigned long wb_start = jiffies;
> > >     long nr_pages = work->nr_pages;
> > >     unsigned long oldest_jif;
> > > -   struct inode *inode;
> > >     long progress;
> > > +   long pause = 1;
> > > +   long max_pause = dirty_writeback_interval ?
> > > +                      msecs_to_jiffies(dirty_writeback_interval * 10) :
> > > +                      HZ;
> >
> > There seems no strong reasons to prefer dirty_writeback_interval over
> > HZ. So how about use the simpler form "max_pause = HZ"?
>   Well, dirty_writeback_interval actually makes some sense here. It is the
> interval in which user wants flusher thread to recheck dirtiness situation.
> So it defines latency user wants from flusher thread. Thus after this time
> we definitely want to break out of any work which cannot progress. If
> dirty_writeback_interval is shorter than 1s, we would break this by using
> HZ in the above code. Also I wanted to avoid introducing another magic
> value to writeback code when there is tunable which makes sense and can
> be used... I can add a comment explaining this if you want.

Good point.

Thanks,
Fengguang


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 1/2] writeback: Improve busyloop prevention
  2011-10-19 11:56                 ` Jan Kara
  2011-10-19 13:25                   ` Wu Fengguang
@ 2011-10-19 13:30                   ` Wu Fengguang
  2011-10-19 13:35                   ` Wu Fengguang
  2011-10-20 12:09                   ` Wu Fengguang
  3 siblings, 0 replies; 60+ messages in thread
From: Wu Fengguang @ 2011-10-19 13:30 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-fsdevel@vger.kernel.org, Christoph Hellwig, Dave Chinner

> From 7b559f1cea41cdba7b39138ad1637f8000e218b9 Mon Sep 17 00:00:00 2001
> From: Jan Kara <jack@suse.cz>
> Date: Wed, 19 Oct 2011 11:44:41 +0200
> Subject: [PATCH] writeback: Include all dirty inodes in background writeback
> 
> Current livelock avoidance code makes background work to include only inodes
> that were dirtied before background writeback has started. However background
> writeback can be running for a long time and thus excluding newly dirtied
> inodes can eventually exclude significant portion of dirty inodes making
> background writeback inefficient. Since background writeback avoids livelocking
> the flusher thread by yielding to any other work, there is no real reason why
> background work should not include all dirty inodes so change the logic in
> wb_writeback().

Looks good to me. Thanks!

> Signed-off-by: Jan Kara <jack@suse.cz>
> ---
>  fs/fs-writeback.c |   10 ++++++++--
>  1 files changed, 8 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> index 04cf3b9..8314241 100644
> --- a/fs/fs-writeback.c
> +++ b/fs/fs-writeback.c
> @@ -730,11 +730,17 @@ static long wb_writeback(struct bdi_writeback *wb,
>  		if (work->for_background && !over_bground_thresh())
>  			break;
>  
> +		/*
> +		 * Kupdate and background works are special and we want to
> +		 * include all inodes that need writing. Livelock avoidance is
> +		 * handled by these works yielding to any other work so we are
> +		 * safe.
> +		 */
>  		if (work->for_kupdate) {
>  			oldest_jif = jiffies -
>  				msecs_to_jiffies(dirty_expire_interval * 10);
> -			work->older_than_this = &oldest_jif;
> -		}
> +		} else if (work->for_background)
> +			oldest_jif = jiffies;
>  
>  		trace_writeback_start(wb->bdi, work);
>  		if (list_empty(&wb->b_io))
> -- 
> 1.7.1
> 


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 1/2] writeback: Improve busyloop prevention
  2011-10-19 11:56                 ` Jan Kara
  2011-10-19 13:25                   ` Wu Fengguang
  2011-10-19 13:30                   ` Wu Fengguang
@ 2011-10-19 13:35                   ` Wu Fengguang
  2011-10-20 12:09                   ` Wu Fengguang
  3 siblings, 0 replies; 60+ messages in thread
From: Wu Fengguang @ 2011-10-19 13:35 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-fsdevel@vger.kernel.org, Christoph Hellwig, Dave Chinner

> From 595677f8efcaa0d9f675bf74a7048739323afd06 Mon Sep 17 00:00:00 2001
> From: Jan Kara <jack@suse.cz>
> Date: Wed, 19 Oct 2011 13:44:46 +0200
> Subject: [PATCH] writeback: Retry kupdate work early if we need to retry some inode writeback
> 
> In case we could not do any writeback for some inodes, trigger next kupdate
> work early so that writeback on these inodes is not delayed for the whole
> dirty_writeback_interval.

Looks good, too.

> Signed-off-by: Jan Kara <jack@suse.cz>
> ---
>  fs/fs-writeback.c |   25 ++++++++++++++++++-------
>  1 files changed, 18 insertions(+), 7 deletions(-)
> 
> diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> index 8314241..e48da04 100644
> --- a/fs/fs-writeback.c
> +++ b/fs/fs-writeback.c
> @@ -701,6 +701,7 @@ static long wb_writeback(struct bdi_writeback *wb,
>  	unsigned long oldest_jif;
>  	struct inode *inode;
>  	long progress;
> +	long total_progress = 0;
>  
>  	oldest_jif = jiffies;
>  	work->older_than_this = &oldest_jif;
> @@ -750,6 +751,7 @@ static long wb_writeback(struct bdi_writeback *wb,
>  		else
>  			progress = __writeback_inodes_wb(wb, work);
>  		trace_writeback_written(wb->bdi, work);
> +		total_progress += progress;
>  
>  		wb_update_bandwidth(wb, wb_start);
>  
> @@ -783,7 +785,7 @@ static long wb_writeback(struct bdi_writeback *wb,
>  	}
>  	spin_unlock(&wb->list_lock);
>  
> -	return nr_pages - work->nr_pages;
> +	return total_progress;
>  }
>  
>  /*
> @@ -845,7 +847,7 @@ static long wb_check_old_data_flush(struct bdi_writeback *wb)
>  
>  	expired = wb->last_old_flush +
>  			msecs_to_jiffies(dirty_writeback_interval * 10);
> -	if (time_before(jiffies, expired))
> +	if (time_before(jiffies, expired) && list_empty(&wb->b_more_io_wait))
>  		return 0;
>  
>  	wb->last_old_flush = jiffies;
> @@ -915,7 +917,11 @@ int bdi_writeback_thread(void *data)
>  {
>  	struct bdi_writeback *wb = data;
>  	struct backing_dev_info *bdi = wb->bdi;
> -	long pages_written;
> +	long progress;

I'd like to separate out the pages_written=>progress changes,
which will make two more clear patches.

Thanks,
Fengguang

> +	unsigned int pause = 1;
> +	unsigned int max_pause = dirty_writeback_interval ?
> +			msecs_to_jiffies(dirty_writeback_interval * 10) :
> +			HZ;
>  
>  	current->flags |= PF_SWAPWRITE;
>  	set_freezable();
> @@ -935,12 +941,14 @@ int bdi_writeback_thread(void *data)
>  		 */
>  		del_timer(&wb->wakeup_timer);
>  
> -		pages_written = wb_do_writeback(wb, 0);
> +		progress = wb_do_writeback(wb, 0);
>  
>  		trace_writeback_pages_written(pages_written);
>  
> -		if (pages_written)
> +		if (progress) {
>  			wb->last_active = jiffies;
> +			pause = 1;
> +		}
>  
>  		set_current_state(TASK_INTERRUPTIBLE);
>  		if (!list_empty(&bdi->work_list) || kthread_should_stop()) {
> @@ -948,8 +956,11 @@ int bdi_writeback_thread(void *data)
>  			continue;
>  		}
>  
> -		if (wb_has_dirty_io(wb) && dirty_writeback_interval)
> -			schedule_timeout(msecs_to_jiffies(dirty_writeback_interval * 10));
> +		if (!list_empty(&wb->b_more_io_wait) && pause < max_pause) {
> +			schedule_timeout(pause);
> +			pause <<= 1;
> +		} else if (wb_has_dirty_io(wb) && dirty_writeback_interval)
> +			schedule_timeout(max_pause);
>  		else {
>  			/*
>  			 * We have nothing to do, so can go sleep without any
> -- 
> 1.7.1
> 


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 1/2] writeback: Improve busyloop prevention
  2011-10-18  0:51             ` Jan Kara
  2011-10-18 14:35               ` Wu Fengguang
@ 2011-10-20  9:46               ` Christoph Hellwig
  2011-10-20 15:32                 ` Jan Kara
  1 sibling, 1 reply; 60+ messages in thread
From: Christoph Hellwig @ 2011-10-20  9:46 UTC (permalink / raw)
  To: Jan Kara
  Cc: Wu Fengguang, linux-fsdevel@vger.kernel.org, Christoph Hellwig,
	Dave Chinner

On Tue, Oct 18, 2011 at 02:51:28AM +0200, Jan Kara wrote:
> > It's still puzzling why dirty pages remain at 286 and does not get
> > cleaned by either flusher threads for local XFS and NFSROOT for so
> > long time..
>   I was looking at this as well. So the reason why pages were not cleaned
> by the flusher thread is that there were 2 dirty inodes and the inode with
> dirty pages had i_dirtied_whan newer than the time when we started this
> background writeback. Thus the running background writeback work always
> included only the other inode which has no dirty pages but I_DIRTY_SYNC set.
> Apparently XFS is stubborn and refuses to write the inode although we try
> rather hard. That is probably because dd writing to this inode is stuck in
> balance_dirty_pages() and holds ilock - which is a bit unfortunate behavior
> but what can we do...

Stop writing data from balance_dirty_pages()?

Anyway, XFS tries very hard to not block in a non-block ->write_inode
or ->writepages, which generally is a good thing to avoid getting stuck
in the flusher thread.  For cases like this where an inode is long
beyond it's due time it might make sense to simply do a synchronous
write_inode from the flusher thread to force the inode out.


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 1/2] writeback: Improve busyloop prevention
  2011-10-19 11:56                 ` Jan Kara
                                     ` (2 preceding siblings ...)
  2011-10-19 13:35                   ` Wu Fengguang
@ 2011-10-20 12:09                   ` Wu Fengguang
  2011-10-20 12:33                     ` Wu Fengguang
  3 siblings, 1 reply; 60+ messages in thread
From: Wu Fengguang @ 2011-10-20 12:09 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-fsdevel@vger.kernel.org, Christoph Hellwig, Dave Chinner

Jan,

I tried the below combined patch over the ioless one, and find some
minor regressions. I studied the thresh=1G/ext3-1dd case in particular
and find that nr_writeback and the iostat avgrq-sz drops from time to time.

I'll try to bisect the changeset.

3.1.0-rc9-ioless-full-next-20111014+  3.1.0-rc9-ioless-full-more_io_wait-next-20111014+
------------------------  ------------------------
                   56.47        -0.6%        56.13  thresh=100M/btrfs-10dd-4k-8p-4096M-100M:10-X
                   56.28        -0.4%        56.07  thresh=100M/btrfs-1dd-4k-8p-4096M-100M:10-X
                   56.11        -0.1%        56.05  thresh=100M/btrfs-2dd-4k-8p-4096M-100M:10-X
                   37.86        +1.8%        38.54  thresh=100M/ext3-10dd-4k-8p-4096M-100M:10-X
                   45.91        +0.7%        46.22  thresh=100M/ext3-1dd-4k-8p-4096M-100M:10-X
                   41.87        +0.8%        42.19  thresh=100M/ext3-2dd-4k-8p-4096M-100M:10-X
                   45.68        -0.4%        45.50  thresh=100M/ext4-10dd-4k-8p-4096M-100M:10-X
                   55.74        -2.2%        54.51  thresh=100M/ext4-1dd-4k-8p-4096M-100M:10-X
                   46.20        -4.8%        43.98  thresh=100M/xfs-10dd-4k-8p-4096M-100M:10-X
                   55.72        +0.1%        55.76  thresh=100M/xfs-1dd-4k-8p-4096M-100M:10-X
                   54.01        -2.0%        52.94  thresh=100M/xfs-2dd-4k-8p-4096M-100M:10-X
                   55.08        -1.0%        54.52  thresh=1G/btrfs-100dd-4k-8p-4096M-1024M:10-X
                   55.49        -1.0%        54.94  thresh=1G/btrfs-10dd-4k-8p-4096M-1024M:10-X
                   55.38        -2.7%        53.91  thresh=1G/btrfs-1dd-4k-8p-4096M-1024M:10-X
                   36.70        -1.5%        36.15  thresh=1G/ext3-100dd-4k-8p-4096M-1024M:10-X
                   40.64        -5.9%        38.25  thresh=1G/ext3-10dd-4k-8p-4096M-1024M:10-X
                   48.65        -6.9%        45.30  thresh=1G/ext3-1dd-4k-8p-4096M-1024M:10-X
                   49.84        -3.2%        48.23  thresh=1G/ext4-100dd-4k-8p-4096M-1024M:10-X
                   56.03        -3.3%        54.21  thresh=1G/ext4-10dd-4k-8p-4096M-1024M:10-X
                   57.42        -2.3%        56.07  thresh=1G/ext4-1dd-4k-8p-4096M-1024M:10-X
                   45.74        -1.4%        45.12  thresh=1G/xfs-100dd-4k-8p-4096M-1024M:10-X
                   54.19        -0.5%        53.94  thresh=1G/xfs-10dd-4k-8p-4096M-1024M:10-X
                   55.93        -0.5%        55.66  thresh=1G/xfs-1dd-4k-8p-4096M-1024M:10-X
                    2.77       +27.8%         3.54  thresh=1M/btrfs-10dd-4k-8p-4096M-1M:10-X
                    2.20       -15.5%         1.86  thresh=1M/btrfs-1dd-4k-8p-4096M-1M:10-X
                    2.42        -1.3%         2.39  thresh=1M/btrfs-2dd-4k-8p-4096M-1M:10-X
                   28.91        +1.9%        29.47  thresh=1M/ext3-10dd-4k-8p-4096M-1M:10-X
                   45.02        +1.1%        45.50  thresh=1M/ext3-1dd-4k-8p-4096M-1M:10-X
                   40.91        +0.4%        41.09  thresh=1M/ext3-2dd-4k-8p-4096M-1M:10-X
                   31.82        +2.3%        32.56  thresh=1M/ext4-10dd-4k-8p-4096M-1M:10-X
                   52.33        -0.9%        51.85  thresh=1M/ext4-1dd-4k-8p-4096M-1M:10-X
                   28.43        +1.2%        28.77  thresh=1M/xfs-10dd-4k-8p-4096M-1M:10-X
                   52.93        -3.8%        50.90  thresh=1M/xfs-1dd-4k-8p-4096M-1M:10-X
                   46.87        -0.0%        46.85  thresh=1M/xfs-2dd-4k-8p-4096M-1M:10-X
                   54.54        -1.3%        53.82  thresh=8M/btrfs-10dd-4k-8p-4096M-8M:10-X
                   56.60        -1.4%        55.80  thresh=8M/btrfs-1dd-4k-8p-4096M-8M:10-X
                   56.21        -0.4%        55.96  thresh=8M/btrfs-2dd-4k-8p-4096M-8M:10-X
                   32.54        +0.2%        32.62  thresh=8M/ext3-10dd-4k-8p-4096M-8M:10-X
                   46.01        -1.0%        45.55  thresh=8M/ext3-1dd-4k-8p-4096M-8M:10-X
                   44.13        -0.6%        43.87  thresh=8M/ext3-2dd-4k-8p-4096M-8M:10-X
                   35.78        -0.4%        35.63  thresh=8M/ext4-10dd-4k-8p-4096M-8M:10-X
                   55.29        +0.2%        55.38  thresh=8M/ext4-1dd-4k-8p-4096M-8M:10-X
                   31.21        -0.8%        30.96  thresh=8M/xfs-10dd-4k-8p-4096M-8M:10-X
                   54.10        -0.3%        53.95  thresh=8M/xfs-1dd-4k-8p-4096M-8M:10-X
                   46.97        +0.5%        47.20  thresh=8M/xfs-2dd-4k-8p-4096M-8M:10-X
                 2010.92        -1.1%      1989.70  TOTAL write_bw

--- linux-next.orig/fs/fs-writeback.c	2011-10-20 19:26:37.000000000 +0800
+++ linux-next/fs/fs-writeback.c	2011-10-20 20:00:22.000000000 +0800
@@ -234,6 +234,15 @@ static void requeue_io(struct inode *ino
 	list_move(&inode->i_wb_list, &wb->b_more_io);
 }
 
+/*
+ * The inode should be retried in an opportunistic way.
+ */
+static void requeue_io_wait(struct inode *inode, struct bdi_writeback *wb)
+{
+	assert_spin_locked(&wb->list_lock);
+	list_move(&inode->i_wb_list, &wb->b_more_io_wait);
+}
+
 static void inode_sync_complete(struct inode *inode)
 {
 	/*
@@ -321,6 +330,7 @@ static void queue_io(struct bdi_writebac
 	int moved;
 	assert_spin_locked(&wb->list_lock);
 	list_splice_init(&wb->b_more_io, &wb->b_io);
+	list_splice_init(&wb->b_more_io_wait, &wb->b_io);
 	moved = move_expired_inodes(&wb->b_dirty, &wb->b_io, work);
 	trace_writeback_queue_io(wb, work, moved);
 }
@@ -470,7 +480,7 @@ writeback_single_inode(struct inode *ino
 				 * retrying writeback of the dirty page/inode
 				 * that cannot be performed immediately.
 				 */
-				redirty_tail(inode, wb);
+				requeue_io_wait(inode, wb);
 			}
 		} else if (inode->i_state & I_DIRTY) {
 			/*
@@ -478,8 +488,18 @@ writeback_single_inode(struct inode *ino
 			 * operations, such as delayed allocation during
 			 * submission or metadata updates after data IO
 			 * completion.
+			 *
+			 * For the latter case it is very important to give
+			 * the inode another turn on b_more_io instead of
+			 * redirtying it.  Constantly moving dirtied_when
+			 * forward will prevent us from ever writing out
+			 * the metadata dirtied in the I/O completion handler.
+			 *
+			 * For files on XFS that constantly get appended to
+			 * calling redirty_tail means they will never get
+			 * their updated i_size written out.
 			 */
-			redirty_tail(inode, wb);
+			requeue_io_wait(inode, wb);
 		} else {
 			/*
 			 * The inode is clean.  At this point we either have
@@ -600,7 +620,7 @@ static long writeback_sb_inodes(struct s
 			 * writeback is not making progress due to locked
 			 * buffers.  Skip this inode for now.
 			 */
-			redirty_tail(inode, wb);
+			requeue_io_wait(inode, wb);
 		}
 		spin_unlock(&inode->i_lock);
 		spin_unlock(&wb->list_lock);
@@ -637,7 +657,7 @@ static long __writeback_inodes_wb(struct
 			 * s_umount being grabbed by someone else. Don't use
 			 * requeue_io() to avoid busy retrying the inode/sb.
 			 */
-			redirty_tail(inode, wb);
+			requeue_io_wait(inode, wb);
 			continue;
 		}
 		wrote += writeback_sb_inodes(sb, wb, work);
@@ -720,10 +740,10 @@ static long wb_writeback(struct bdi_writ
 			 struct wb_writeback_work *work)
 {
 	unsigned long wb_start = jiffies;
-	long nr_pages = work->nr_pages;
 	unsigned long oldest_jif;
 	struct inode *inode;
 	long progress;
+	long total_progress = 0;
 
 	oldest_jif = jiffies;
 	work->older_than_this = &oldest_jif;
@@ -753,11 +773,17 @@ static long wb_writeback(struct bdi_writ
 		if (work->for_background && !over_bground_thresh(wb->bdi))
 			break;
 
+		/*
+		 * Kupdate and background works are special and we want to
+		 * include all inodes that need writing. Livelock avoidance is
+		 * handled by these works yielding to any other work so we are
+		 * safe.
+		 */
 		if (work->for_kupdate) {
 			oldest_jif = jiffies -
 				msecs_to_jiffies(dirty_expire_interval * 10);
-			work->older_than_this = &oldest_jif;
-		}
+		} else if (work->for_background)
+			oldest_jif = jiffies;
 
 		trace_writeback_start(wb->bdi, work);
 		if (list_empty(&wb->b_io))
@@ -767,6 +793,7 @@ static long wb_writeback(struct bdi_writ
 		else
 			progress = __writeback_inodes_wb(wb, work);
 		trace_writeback_written(wb->bdi, work);
+		total_progress += progress;
 
 		wb_update_bandwidth(wb, wb_start);
 
@@ -800,7 +827,7 @@ static long wb_writeback(struct bdi_writ
 	}
 	spin_unlock(&wb->list_lock);
 
-	return nr_pages - work->nr_pages;
+	return total_progress;
 }
 
 /*
@@ -863,7 +890,7 @@ static long wb_check_old_data_flush(stru
 
 	expired = wb->last_old_flush +
 			msecs_to_jiffies(dirty_writeback_interval * 10);
-	if (time_before(jiffies, expired))
+	if (time_before(jiffies, expired) && list_empty(&wb->b_more_io_wait))
 		return 0;
 
 	wb->last_old_flush = jiffies;
@@ -934,7 +961,11 @@ int bdi_writeback_thread(void *data)
 {
 	struct bdi_writeback *wb = data;
 	struct backing_dev_info *bdi = wb->bdi;
-	long pages_written;
+	long progress;
+	unsigned int pause = 1;
+	unsigned int max_pause = dirty_writeback_interval ?
+			msecs_to_jiffies(dirty_writeback_interval * 10) :
+			HZ;
 
 	current->flags |= PF_SWAPWRITE;
 	set_freezable();
@@ -954,12 +985,12 @@ int bdi_writeback_thread(void *data)
 		 */
 		del_timer(&wb->wakeup_timer);
 
-		pages_written = wb_do_writeback(wb, 0);
+		progress = wb_do_writeback(wb, 0);
 
-		trace_writeback_pages_written(pages_written);
-
-		if (pages_written)
+		if (progress) {
 			wb->last_active = jiffies;
+			pause = 1;
+		}
 
 		set_current_state(TASK_INTERRUPTIBLE);
 		if (!list_empty(&bdi->work_list) || kthread_should_stop()) {
@@ -967,8 +998,11 @@ int bdi_writeback_thread(void *data)
 			continue;
 		}
 
-		if (wb_has_dirty_io(wb) && dirty_writeback_interval)
-			schedule_timeout(msecs_to_jiffies(dirty_writeback_interval * 10));
+		if (!list_empty(&wb->b_more_io_wait) && pause < max_pause) {
+			schedule_timeout(pause);
+			pause <<= 1;
+		} else if (wb_has_dirty_io(wb) && dirty_writeback_interval)
+			schedule_timeout(max_pause);
 		else {
 			/*
 			 * We have nothing to do, so can go sleep without any
--- linux-next.orig/include/linux/backing-dev.h	2011-10-20 19:26:37.000000000 +0800
+++ linux-next/include/linux/backing-dev.h	2011-10-20 19:29:39.000000000 +0800
@@ -59,6 +59,7 @@ struct bdi_writeback {
 	struct list_head b_dirty;	/* dirty inodes */
 	struct list_head b_io;		/* parked for writeback */
 	struct list_head b_more_io;	/* parked for more writeback */
+	struct list_head b_more_io_wait;/* opportunistic retry io */
 	spinlock_t list_lock;		/* protects the b_* lists */
 };
 
@@ -133,9 +134,10 @@ extern struct list_head bdi_pending_list
 
 static inline int wb_has_dirty_io(struct bdi_writeback *wb)
 {
-	return !list_empty(&wb->b_dirty) ||
-	       !list_empty(&wb->b_io) ||
-	       !list_empty(&wb->b_more_io);
+	return !list_empty(&wb->b_dirty)	||
+	       !list_empty(&wb->b_io)		||
+	       !list_empty(&wb->b_more_io)	||
+	       !list_empty(&wb->b_more_io_wait);
 }
 
 static inline void __add_bdi_stat(struct backing_dev_info *bdi,
--- linux-next.orig/mm/backing-dev.c	2011-10-20 19:26:37.000000000 +0800
+++ linux-next/mm/backing-dev.c	2011-10-20 19:29:39.000000000 +0800
@@ -74,10 +74,10 @@ static int bdi_debug_stats_show(struct s
 	unsigned long background_thresh;
 	unsigned long dirty_thresh;
 	unsigned long bdi_thresh;
-	unsigned long nr_dirty, nr_io, nr_more_io;
+	unsigned long nr_dirty, nr_io, nr_more_io, nr_more_io_wait;
 	struct inode *inode;
 
-	nr_dirty = nr_io = nr_more_io = 0;
+	nr_dirty = nr_io = nr_more_io = nr_more_io_wait = 0;
 	spin_lock(&wb->list_lock);
 	list_for_each_entry(inode, &wb->b_dirty, i_wb_list)
 		nr_dirty++;
@@ -85,6 +85,8 @@ static int bdi_debug_stats_show(struct s
 		nr_io++;
 	list_for_each_entry(inode, &wb->b_more_io, i_wb_list)
 		nr_more_io++;
+	list_for_each_entry(inode, &wb->b_more_io_wait, i_wb_list)
+		nr_more_io_wait++;
 	spin_unlock(&wb->list_lock);
 
 	global_dirty_limits(&background_thresh, &dirty_thresh);
@@ -103,6 +105,7 @@ static int bdi_debug_stats_show(struct s
 		   "b_dirty:            %10lu\n"
 		   "b_io:               %10lu\n"
 		   "b_more_io:          %10lu\n"
+		   "b_more_io_wait:     %10lu\n"
 		   "bdi_list:           %10u\n"
 		   "state:              %10lx\n",
 		   (unsigned long) K(bdi_stat(bdi, BDI_WRITEBACK)),
@@ -116,6 +119,7 @@ static int bdi_debug_stats_show(struct s
 		   nr_dirty,
 		   nr_io,
 		   nr_more_io,
+		   nr_more_io_wait,
 		   !list_empty(&bdi->bdi_list), bdi->state);
 #undef K
 
@@ -651,6 +655,7 @@ static void bdi_wb_init(struct bdi_write
 	INIT_LIST_HEAD(&wb->b_dirty);
 	INIT_LIST_HEAD(&wb->b_io);
 	INIT_LIST_HEAD(&wb->b_more_io);
+	INIT_LIST_HEAD(&wb->b_more_io_wait);
 	spin_lock_init(&wb->list_lock);
 	setup_timer(&wb->wakeup_timer, wakeup_timer_fn, (unsigned long)bdi);
 }
@@ -718,6 +723,7 @@ void bdi_destroy(struct backing_dev_info
 		list_splice(&bdi->wb.b_dirty, &dst->b_dirty);
 		list_splice(&bdi->wb.b_io, &dst->b_io);
 		list_splice(&bdi->wb.b_more_io, &dst->b_more_io);
+		list_splice(&bdi->wb.b_more_io_wait, &dst->b_more_io_wait);
 		spin_unlock(&bdi->wb.list_lock);
 		spin_unlock(&dst->list_lock);
 	}

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 1/2] writeback: Improve busyloop prevention
  2011-10-20 12:09                   ` Wu Fengguang
@ 2011-10-20 12:33                     ` Wu Fengguang
  2011-10-20 13:39                       ` Wu Fengguang
  0 siblings, 1 reply; 60+ messages in thread
From: Wu Fengguang @ 2011-10-20 12:33 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-fsdevel@vger.kernel.org, Christoph Hellwig, Dave Chinner

On Thu, Oct 20, 2011 at 08:09:09PM +0800, Wu Fengguang wrote:
> Jan,
> 
> I tried the below combined patch over the ioless one, and find some
> minor regressions. I studied the thresh=1G/ext3-1dd case in particular
> and find that nr_writeback and the iostat avgrq-sz drops from time to time.
> 
> I'll try to bisect the changeset.

Current finding is, performance is restored if only applying this: 

--- linux-next.orig/fs/fs-writeback.c	2011-10-20 19:26:37.000000000 +0800
+++ linux-next/fs/fs-writeback.c	2011-10-20 20:31:18.000000000 +0800
@@ -234,6 +234,15 @@ static void requeue_io(struct inode *ino
 	list_move(&inode->i_wb_list, &wb->b_more_io);
 }
 
+/*
+ * The inode should be retried in an opportunistic way.
+ */
+static void requeue_io_wait(struct inode *inode, struct bdi_writeback *wb)
+{
+	assert_spin_locked(&wb->list_lock);
+	list_move(&inode->i_wb_list, &wb->b_more_io_wait);
+}
+
 static void inode_sync_complete(struct inode *inode)
 {
 	/*
@@ -321,6 +330,7 @@ static void queue_io(struct bdi_writebac
 	int moved;
 	assert_spin_locked(&wb->list_lock);
 	list_splice_init(&wb->b_more_io, &wb->b_io);
+	list_splice_init(&wb->b_more_io_wait, &wb->b_io);
 	moved = move_expired_inodes(&wb->b_dirty, &wb->b_io, work);
 	trace_writeback_queue_io(wb, work, moved);
 }
@@ -478,8 +488,18 @@ writeback_single_inode(struct inode *ino
 			 * operations, such as delayed allocation during
 			 * submission or metadata updates after data IO
 			 * completion.
+			 *
+			 * For the latter case it is very important to give
+			 * the inode another turn on b_more_io instead of
+			 * redirtying it.  Constantly moving dirtied_when
+			 * forward will prevent us from ever writing out
+			 * the metadata dirtied in the I/O completion handler.
+			 *
+			 * For files on XFS that constantly get appended to
+			 * calling redirty_tail means they will never get
+			 * their updated i_size written out.
 			 */
-			redirty_tail(inode, wb);
+			requeue_io_wait(inode, wb);
 		} else {
 			/*
 			 * The inode is clean.  At this point we either have
--- linux-next.orig/include/linux/backing-dev.h	2011-10-20 19:26:37.000000000 +0800
+++ linux-next/include/linux/backing-dev.h	2011-10-20 19:29:39.000000000 +0800
@@ -59,6 +59,7 @@ struct bdi_writeback {
 	struct list_head b_dirty;	/* dirty inodes */
 	struct list_head b_io;		/* parked for writeback */
 	struct list_head b_more_io;	/* parked for more writeback */
+	struct list_head b_more_io_wait;/* opportunistic retry io */
 	spinlock_t list_lock;		/* protects the b_* lists */
 };
 
@@ -133,9 +134,10 @@ extern struct list_head bdi_pending_list
 
 static inline int wb_has_dirty_io(struct bdi_writeback *wb)
 {
-	return !list_empty(&wb->b_dirty) ||
-	       !list_empty(&wb->b_io) ||
-	       !list_empty(&wb->b_more_io);
+	return !list_empty(&wb->b_dirty)	||
+	       !list_empty(&wb->b_io)		||
+	       !list_empty(&wb->b_more_io)	||
+	       !list_empty(&wb->b_more_io_wait);
 }
 
 static inline void __add_bdi_stat(struct backing_dev_info *bdi,
--- linux-next.orig/mm/backing-dev.c	2011-10-20 19:26:37.000000000 +0800
+++ linux-next/mm/backing-dev.c	2011-10-20 19:29:39.000000000 +0800
@@ -74,10 +74,10 @@ static int bdi_debug_stats_show(struct s
 	unsigned long background_thresh;
 	unsigned long dirty_thresh;
 	unsigned long bdi_thresh;
-	unsigned long nr_dirty, nr_io, nr_more_io;
+	unsigned long nr_dirty, nr_io, nr_more_io, nr_more_io_wait;
 	struct inode *inode;
 
-	nr_dirty = nr_io = nr_more_io = 0;
+	nr_dirty = nr_io = nr_more_io = nr_more_io_wait = 0;
 	spin_lock(&wb->list_lock);
 	list_for_each_entry(inode, &wb->b_dirty, i_wb_list)
 		nr_dirty++;
@@ -85,6 +85,8 @@ static int bdi_debug_stats_show(struct s
 		nr_io++;
 	list_for_each_entry(inode, &wb->b_more_io, i_wb_list)
 		nr_more_io++;
+	list_for_each_entry(inode, &wb->b_more_io_wait, i_wb_list)
+		nr_more_io_wait++;
 	spin_unlock(&wb->list_lock);
 
 	global_dirty_limits(&background_thresh, &dirty_thresh);
@@ -103,6 +105,7 @@ static int bdi_debug_stats_show(struct s
 		   "b_dirty:            %10lu\n"
 		   "b_io:               %10lu\n"
 		   "b_more_io:          %10lu\n"
+		   "b_more_io_wait:     %10lu\n"
 		   "bdi_list:           %10u\n"
 		   "state:              %10lx\n",
 		   (unsigned long) K(bdi_stat(bdi, BDI_WRITEBACK)),
@@ -116,6 +119,7 @@ static int bdi_debug_stats_show(struct s
 		   nr_dirty,
 		   nr_io,
 		   nr_more_io,
+		   nr_more_io_wait,
 		   !list_empty(&bdi->bdi_list), bdi->state);
 #undef K
 
@@ -651,6 +655,7 @@ static void bdi_wb_init(struct bdi_write
 	INIT_LIST_HEAD(&wb->b_dirty);
 	INIT_LIST_HEAD(&wb->b_io);
 	INIT_LIST_HEAD(&wb->b_more_io);
+	INIT_LIST_HEAD(&wb->b_more_io_wait);
 	spin_lock_init(&wb->list_lock);
 	setup_timer(&wb->wakeup_timer, wakeup_timer_fn, (unsigned long)bdi);
 }
@@ -718,6 +723,7 @@ void bdi_destroy(struct backing_dev_info
 		list_splice(&bdi->wb.b_dirty, &dst->b_dirty);
 		list_splice(&bdi->wb.b_io, &dst->b_io);
 		list_splice(&bdi->wb.b_more_io, &dst->b_more_io);
+		list_splice(&bdi->wb.b_more_io_wait, &dst->b_more_io_wait);
 		spin_unlock(&bdi->wb.list_lock);
 		spin_unlock(&dst->list_lock);
 	}

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 1/2] writeback: Improve busyloop prevention
  2011-10-20 12:33                     ` Wu Fengguang
@ 2011-10-20 13:39                       ` Wu Fengguang
  2011-10-20 22:26                         ` Jan Kara
  0 siblings, 1 reply; 60+ messages in thread
From: Wu Fengguang @ 2011-10-20 13:39 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-fsdevel@vger.kernel.org, Christoph Hellwig, Dave Chinner

On Thu, Oct 20, 2011 at 08:33:00PM +0800, Wu Fengguang wrote:
> On Thu, Oct 20, 2011 at 08:09:09PM +0800, Wu Fengguang wrote:
> > Jan,
> > 
> > I tried the below combined patch over the ioless one, and find some
> > minor regressions. I studied the thresh=1G/ext3-1dd case in particular
> > and find that nr_writeback and the iostat avgrq-sz drops from time to time.
> > 
> > I'll try to bisect the changeset.

This is interesting, the culprit is found to be patch 1, which is
simply
                if (work->for_kupdate) {
                        oldest_jif = jiffies -
                                msecs_to_jiffies(dirty_expire_interval * 10);
-                       work->older_than_this = &oldest_jif;
-               }
+               } else if (work->for_background)
+                       oldest_jif = jiffies;

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 1/2] writeback: Improve busyloop prevention
  2011-10-20  9:46               ` Christoph Hellwig
@ 2011-10-20 15:32                 ` Jan Kara
  0 siblings, 0 replies; 60+ messages in thread
From: Jan Kara @ 2011-10-20 15:32 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jan Kara, Wu Fengguang, linux-fsdevel@vger.kernel.org,
	Dave Chinner

On Thu 20-10-11 05:46:49, Christoph Hellwig wrote:
> On Tue, Oct 18, 2011 at 02:51:28AM +0200, Jan Kara wrote:
> > > It's still puzzling why dirty pages remain at 286 and does not get
> > > cleaned by either flusher threads for local XFS and NFSROOT for so
> > > long time..
> >   I was looking at this as well. So the reason why pages were not cleaned
> > by the flusher thread is that there were 2 dirty inodes and the inode with
> > dirty pages had i_dirtied_whan newer than the time when we started this
> > background writeback. Thus the running background writeback work always
> > included only the other inode which has no dirty pages but I_DIRTY_SYNC set.
> > Apparently XFS is stubborn and refuses to write the inode although we try
> > rather hard. That is probably because dd writing to this inode is stuck in
> > balance_dirty_pages() and holds ilock - which is a bit unfortunate behavior
> > but what can we do...
> 
> Stop writing data from balance_dirty_pages()?
  Well, this was also with Fengguang's IO-less patches so writing from
balance_dirty_pages() was not an issue.

> Anyway, XFS tries very hard to not block in a non-block ->write_inode
> or ->writepages, which generally is a good thing to avoid getting stuck
> in the flusher thread.  For cases like this where an inode is long
> beyond it's due time it might make sense to simply do a synchronous
> write_inode from the flusher thread to force the inode out.
  Yeah, that might be a good option.

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 1/2] writeback: Improve busyloop prevention
  2011-10-20 13:39                       ` Wu Fengguang
@ 2011-10-20 22:26                         ` Jan Kara
  2011-10-22  4:20                           ` Wu Fengguang
                                             ` (2 more replies)
  0 siblings, 3 replies; 60+ messages in thread
From: Jan Kara @ 2011-10-20 22:26 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Jan Kara, linux-fsdevel@vger.kernel.org, Christoph Hellwig,
	Dave Chinner

On Thu 20-10-11 21:39:38, Wu Fengguang wrote:
> On Thu, Oct 20, 2011 at 08:33:00PM +0800, Wu Fengguang wrote:
> > On Thu, Oct 20, 2011 at 08:09:09PM +0800, Wu Fengguang wrote:
> > > Jan,
> > > 
> > > I tried the below combined patch over the ioless one, and find some
> > > minor regressions. I studied the thresh=1G/ext3-1dd case in particular
> > > and find that nr_writeback and the iostat avgrq-sz drops from time to time.
> > > 
> > > I'll try to bisect the changeset.
> 
> This is interesting, the culprit is found to be patch 1, which is
> simply
>                 if (work->for_kupdate) {
>                         oldest_jif = jiffies -
>                                 msecs_to_jiffies(dirty_expire_interval * 10);
> -                       work->older_than_this = &oldest_jif;
> -               }
> +               } else if (work->for_background)
> +                       oldest_jif = jiffies;
  Yeah. I had a look into the trace and you can notice that during the
whole dd run, we were running a single background writeback work (you can
verify that by work->nr_pages decreasing steadily). Without refreshing
oldest_jif, we'd write block device inode for /dev/sda (you can identify
that by bdi=8:0, ino=0) only once. When refreshing oldest_jif, we write it
every 5 seconds (kjournald dirties the device inode after committing a
transaction by dirtying metadata buffers which were just committed and can
now be checkpointed either by kjournald or flusher thread). So although the
performance is slightly reduced, I'd say that the behavior is a desired
one.

Also if you observed the performance on a really long run, the difference
should get smaller because eventually, kjournald has to flush the metadata
blocks when the journal fills up and we need to free some journal space and
at that point flushing is even more expensive because we have to do a
blocking write during which all transaction operations, thus effectively
the whole filesystem, are blocked.

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 1/2] writeback: Improve busyloop prevention
  2011-10-20 22:26                         ` Jan Kara
@ 2011-10-22  4:20                           ` Wu Fengguang
  2011-10-24 15:45                             ` Jan Kara
       [not found]                           ` <20111027063133.GA10146@localhost>
       [not found]                           ` <20111027064745.GA14017@localhost>
  2 siblings, 1 reply; 60+ messages in thread
From: Wu Fengguang @ 2011-10-22  4:20 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-fsdevel@vger.kernel.org, Christoph Hellwig, Dave Chinner

On Fri, Oct 21, 2011 at 06:26:16AM +0800, Jan Kara wrote:
> On Thu 20-10-11 21:39:38, Wu Fengguang wrote:
> > On Thu, Oct 20, 2011 at 08:33:00PM +0800, Wu Fengguang wrote:
> > > On Thu, Oct 20, 2011 at 08:09:09PM +0800, Wu Fengguang wrote:
> > > > Jan,
> > > > 
> > > > I tried the below combined patch over the ioless one, and find some
> > > > minor regressions. I studied the thresh=1G/ext3-1dd case in particular
> > > > and find that nr_writeback and the iostat avgrq-sz drops from time to time.
> > > > 
> > > > I'll try to bisect the changeset.
> > 
> > This is interesting, the culprit is found to be patch 1, which is
> > simply
> >                 if (work->for_kupdate) {
> >                         oldest_jif = jiffies -
> >                                 msecs_to_jiffies(dirty_expire_interval * 10);
> > -                       work->older_than_this = &oldest_jif;
> > -               }
> > +               } else if (work->for_background)
> > +                       oldest_jif = jiffies;
>   Yeah. I had a look into the trace and you can notice that during the
> whole dd run, we were running a single background writeback work (you can
> verify that by work->nr_pages decreasing steadily).

Yes, it is.

> Without refreshing
> oldest_jif, we'd write block device inode for /dev/sda (you can identify
> that by bdi=8:0, ino=0) only once. When refreshing oldest_jif, we write it
> every 5 seconds (kjournald dirties the device inode after committing a
> transaction by dirtying metadata buffers which were just committed and can
> now be checkpointed either by kjournald or flusher thread).

OK, now I understand the regular drops of nr_writeback and avgrq-sz:
on every 5s, it takes _some time_ to write inode 0, during which the
flusher is blocked and the IO queue runs low.

> So although the performance is slightly reduced, I'd say that the
> behavior is a desired one.

OK. However it's sad to see the flusher get blocked from time to time...

> Also if you observed the performance on a really long run, the difference
> should get smaller because eventually, kjournald has to flush the metadata
> blocks when the journal fills up and we need to free some journal space and
> at that point flushing is even more expensive because we have to do a
> blocking write during which all transaction operations, thus effectively
> the whole filesystem, are blocked.

OK. The dd test time was 300s, I'll increase it to 900s (cannot do
more because it's a 90GB disk partition).

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 1/2] writeback: Improve busyloop prevention
  2011-10-22  4:20                           ` Wu Fengguang
@ 2011-10-24 15:45                             ` Jan Kara
  0 siblings, 0 replies; 60+ messages in thread
From: Jan Kara @ 2011-10-24 15:45 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Jan Kara, linux-fsdevel@vger.kernel.org, Christoph Hellwig,
	Dave Chinner

On Sat 22-10-11 12:20:19, Wu Fengguang wrote:
> On Fri, Oct 21, 2011 at 06:26:16AM +0800, Jan Kara wrote:
> > On Thu 20-10-11 21:39:38, Wu Fengguang wrote:
> > > On Thu, Oct 20, 2011 at 08:33:00PM +0800, Wu Fengguang wrote:
> > > > On Thu, Oct 20, 2011 at 08:09:09PM +0800, Wu Fengguang wrote:
> > > > > Jan,
> > > > > 
> > > > > I tried the below combined patch over the ioless one, and find some
> > > > > minor regressions. I studied the thresh=1G/ext3-1dd case in particular
> > > > > and find that nr_writeback and the iostat avgrq-sz drops from time to time.
> > > > > 
> > > > > I'll try to bisect the changeset.
> > > 
> > > This is interesting, the culprit is found to be patch 1, which is
> > > simply
> > >                 if (work->for_kupdate) {
> > >                         oldest_jif = jiffies -
> > >                                 msecs_to_jiffies(dirty_expire_interval * 10);
> > > -                       work->older_than_this = &oldest_jif;
> > > -               }
> > > +               } else if (work->for_background)
> > > +                       oldest_jif = jiffies;
> >   Yeah. I had a look into the trace and you can notice that during the
> > whole dd run, we were running a single background writeback work (you can
> > verify that by work->nr_pages decreasing steadily).
> 
> Yes, it is.
> 
> > Without refreshing
> > oldest_jif, we'd write block device inode for /dev/sda (you can identify
> > that by bdi=8:0, ino=0) only once. When refreshing oldest_jif, we write it
> > every 5 seconds (kjournald dirties the device inode after committing a
> > transaction by dirtying metadata buffers which were just committed and can
> > now be checkpointed either by kjournald or flusher thread).
> 
> OK, now I understand the regular drops of nr_writeback and avgrq-sz:
> on every 5s, it takes _some time_ to write inode 0, during which the
> flusher is blocked and the IO queue runs low.
> 
> > So although the performance is slightly reduced, I'd say that the
> > behavior is a desired one.
> 
> OK. However it's sad to see the flusher get blocked from time to time...
  Well, it doesn't get blocked. It just has to write out inode which cannot
be written out so efficiently. But that's nothing we can really solve...

> > Also if you observed the performance on a really long run, the difference
> > should get smaller because eventually, kjournald has to flush the metadata
> > blocks when the journal fills up and we need to free some journal space and
> > at that point flushing is even more expensive because we have to do a
> > blocking write during which all transaction operations, thus effectively
> > the whole filesystem, are blocked.
> 
> OK. The dd test time was 300s, I'll increase it to 900s (cannot do
> more because it's a 90GB disk partition).
  Yes, that might be an interesting try...

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 1/2] writeback: Improve busyloop prevention
       [not found]                           ` <20111027063133.GA10146@localhost>
@ 2011-10-27 20:31                             ` Jan Kara
       [not found]                               ` <20111101134231.GA31718@localhost>
       [not found]                               ` <20111102185603.GA4034@localhost>
  0 siblings, 2 replies; 60+ messages in thread
From: Jan Kara @ 2011-10-27 20:31 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Jan Kara, linux-fsdevel@vger.kernel.org, Christoph Hellwig,
	Dave Chinner

On Thu 27-10-11 14:31:33, Wu Fengguang wrote:
> On Fri, Oct 21, 2011 at 06:26:16AM +0800, Jan Kara wrote:
> > On Thu 20-10-11 21:39:38, Wu Fengguang wrote:
> > > On Thu, Oct 20, 2011 at 08:33:00PM +0800, Wu Fengguang wrote:
> > > > On Thu, Oct 20, 2011 at 08:09:09PM +0800, Wu Fengguang wrote:
> > > > > Jan,
> > > > > 
> > > > > I tried the below combined patch over the ioless one, and find some
> > > > > minor regressions. I studied the thresh=1G/ext3-1dd case in particular
> > > > > and find that nr_writeback and the iostat avgrq-sz drops from time to time.
> > > > > 
> > > > > I'll try to bisect the changeset.
> > > 
> > > This is interesting, the culprit is found to be patch 1, which is
> > > simply
> > >                 if (work->for_kupdate) {
> > >                         oldest_jif = jiffies -
> > >                                 msecs_to_jiffies(dirty_expire_interval * 10);
> > > -                       work->older_than_this = &oldest_jif;
> > > -               }
> > > +               } else if (work->for_background)
> > > +                       oldest_jif = jiffies;
> >   Yeah. I had a look into the trace and you can notice that during the
> > whole dd run, we were running a single background writeback work (you can
> > verify that by work->nr_pages decreasing steadily). Without refreshing
> > oldest_jif, we'd write block device inode for /dev/sda (you can identify
> > that by bdi=8:0, ino=0) only once. When refreshing oldest_jif, we write it
> > every 5 seconds (kjournald dirties the device inode after committing a
> > transaction by dirtying metadata buffers which were just committed and can
> > now be checkpointed either by kjournald or flusher thread). So although the
> > performance is slightly reduced, I'd say that the behavior is a desired
> > one.
> > 
> > Also if you observed the performance on a really long run, the difference
> > should get smaller because eventually, kjournald has to flush the metadata
> > blocks when the journal fills up and we need to free some journal space and
> > at that point flushing is even more expensive because we have to do a
> > blocking write during which all transaction operations, thus effectively
> > the whole filesystem, are blocked.
> 
> Jan, I got figures for test case
> 
> ext3-1dd-4k-8p-2941M-1000M:10-3.1.0-rc9-ioless-full-nfs-wq5-next-20111014+
> 
> There is no single drop of nr_writeback in the longer 1200s run, which
> wrote ~60GB data.
  I did some calculations. Default journal size for a filesystem of your
size is 128 MB which allows recording of around 128 GB of data. So your
test probably didn't hit the point where the journal is recycled yet. An
easy way to make sure journal gets recycled is to set its size to a lower
value when creating the filesystem by
  mke2fs -J size=8

  Then at latest after writing 8 GB the effect of journal recycling should
be visible (I suggest writing at least 16 or so so that we can see some
pattern). Also note that without the patch altering background writeback,
kjournald will do all the writeback of the metadata and kjournal works with
buffer heads. Thus IO it does is *not* accounted in mm statistics. You will
observe its effects only by a sudden increase in await or svctm because the
disk got busy by IO you don't see. Also secondarily you could probably
observe that as a hiccup in the number of dirtied/written pages.

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 1/2] writeback: Improve busyloop prevention
       [not found]                           ` <20111027064745.GA14017@localhost>
@ 2011-10-27 20:50                             ` Jan Kara
  0 siblings, 0 replies; 60+ messages in thread
From: Jan Kara @ 2011-10-27 20:50 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Jan Kara, linux-fsdevel@vger.kernel.org, Christoph Hellwig,
	Dave Chinner

On Thu 27-10-11 14:47:45, Wu Fengguang wrote:
> On Fri, Oct 21, 2011 at 06:26:16AM +0800, Jan Kara wrote:
> > On Thu 20-10-11 21:39:38, Wu Fengguang wrote:
> > > On Thu, Oct 20, 2011 at 08:33:00PM +0800, Wu Fengguang wrote:
> > > > On Thu, Oct 20, 2011 at 08:09:09PM +0800, Wu Fengguang wrote:
> > > > > Jan,
> > > > > 
> > > > > I tried the below combined patch over the ioless one, and find some
> > > > > minor regressions. I studied the thresh=1G/ext3-1dd case in particular
> > > > > and find that nr_writeback and the iostat avgrq-sz drops from time to time.
> > > > > 
> > > > > I'll try to bisect the changeset.
> > > 
> > > This is interesting, the culprit is found to be patch 1, which is
> > > simply
> > >                 if (work->for_kupdate) {
> > >                         oldest_jif = jiffies -
> > >                                 msecs_to_jiffies(dirty_expire_interval * 10);
> > > -                       work->older_than_this = &oldest_jif;
> > > -               }
> > > +               } else if (work->for_background)
> > > +                       oldest_jif = jiffies;
> >   Yeah. I had a look into the trace and you can notice that during the
> > whole dd run, we were running a single background writeback work (you can
> > verify that by work->nr_pages decreasing steadily). Without refreshing
> > oldest_jif, we'd write block device inode for /dev/sda (you can identify
> > that by bdi=8:0, ino=0) only once. When refreshing oldest_jif, we write it
> > every 5 seconds (kjournald dirties the device inode after committing a
> > transaction by dirtying metadata buffers which were just committed and can
> > now be checkpointed either by kjournald or flusher thread). So although the
> > performance is slightly reduced, I'd say that the behavior is a desired
> > one.
> 
> 
> This is part of
> ext3-1dd-4k-8p-2941M-1024M:10-3.1.0-rc9-ioless-full-more_io_wait-next-20111014+/trace.bz2
> 
>        flush-8:0-3133  [000]   357.418668: writeback_single_inode: bdi 8:0: ino=12 state=I_DIRTY_PAGES dirtied_when=4294728197 age=297 index=3443711 to_write=6144 wrote=6144
>        flush-8:0-3133  [000]   357.954817: writeback_single_inode: bdi 8:0: ino=0 state= dirtied_when=4295024858 age=1 index=3708770 to_write=6144 wrote=59
>        flush-8:0-3133  [000]   358.509053: writeback_single_inode: bdi 8:0: ino=12 state=I_DIRTY_PAGES dirtied_when=4294728197 age=298 index=3449855 to_write=6144 wrote=6144
>        flush-8:0-3133  [002]   358.774390: writeback_single_inode: bdi 8:0: ino=12 state=I_DIRTY_PAGES dirtied_when=4294728197 age=298 index=3455999 to_write=6144 wrote=6144
>        flush-8:0-3133  [002]   358.783747: writeback_single_inode: bdi 8:0: ino=12 state=I_DIRTY_PAGES dirtied_when=4294728197 age=298 index=3462143 to_write=6144 wrote=6144
> 
> I noticed that there is nothing else blocking the flusher because the
> write_begin trace events always start immediately after each
> writeback_single_inode event.
> 
> And the writeback for ino 0 took 357.954817-357.418668 = ~500ms but
> only writes 59 pages, which is the major reason nr_writeback drops
> from ~80MB to ~50MB.
  Yes, this is not a big surprise. Metadata (indirect blocks) are
interleaved with data so all the metadata IO are just single block (4 KB)
writes unlike data IO which comes in 4 MB chunks. This is also why you see
request size drop in the graphs you attached to the next email.  Also since
you have to seek to fulfil each 4 KB IO request, and seek times are on the
order of 5-15 ms, you easily get to metadata writeback of 59 pages taking
around 500 ms. The filesystem could do better and avoid these seeks but it
does not and it is beyond the task of writeback to fix it. Writeback just
has to count with the fact that some pages are fast to write (e.g.
sequential writes as you generate them) and some pages are hard to write
(in this case metadata pages).

> What's more, nr_writeback is only able to restore
> after this event (perhaps ext3 is cleaning up something during the
> time, which blocks dd in write_begin() or whatever):
> 
>               dd-3556  [000]   358.768670: balance_dirty_pages: bdi 8:0: limit=262144 setpoint=213489 dirty=212003 bdi_setpoint=213389 bdi_dirty=211640 dirty_ratelimit=46580 task_ratelimit=47444 dirtied=149 dirtied_pause=149 period=12 think=196 pause=-184 paused=0
  Well, my bet would be that dd needed to access some metadata block (such
as free block bitmap) but it was under IO so dd was blocked until the IO
finished.

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 1/2] writeback: Improve busyloop prevention
       [not found]                               ` <20111101134231.GA31718@localhost>
@ 2011-11-01 21:53                                 ` Jan Kara
  2011-11-02 17:25                                   ` Wu Fengguang
  0 siblings, 1 reply; 60+ messages in thread
From: Jan Kara @ 2011-11-01 21:53 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Jan Kara, linux-fsdevel@vger.kernel.org, Christoph Hellwig,
	Dave Chinner

On Tue 01-11-11 21:42:31, Wu Fengguang wrote:
> On Fri, Oct 28, 2011 at 04:31:04AM +0800, Jan Kara wrote:
> > On Thu 27-10-11 14:31:33, Wu Fengguang wrote:
> > > On Fri, Oct 21, 2011 at 06:26:16AM +0800, Jan Kara wrote:
> > > > On Thu 20-10-11 21:39:38, Wu Fengguang wrote:
> > > > > On Thu, Oct 20, 2011 at 08:33:00PM +0800, Wu Fengguang wrote:
> > > > > > On Thu, Oct 20, 2011 at 08:09:09PM +0800, Wu Fengguang wrote:
> > > > > > > Jan,
> > > > > > > 
> > > > > > > I tried the below combined patch over the ioless one, and find some
> > > > > > > minor regressions. I studied the thresh=1G/ext3-1dd case in particular
> > > > > > > and find that nr_writeback and the iostat avgrq-sz drops from time to time.
> > > > > > > 
> > > > > > > I'll try to bisect the changeset.
> > > > > 
> > > > > This is interesting, the culprit is found to be patch 1, which is
> > > > > simply
> > > > >                 if (work->for_kupdate) {
> > > > >                         oldest_jif = jiffies -
> > > > >                                 msecs_to_jiffies(dirty_expire_interval * 10);
> > > > > -                       work->older_than_this = &oldest_jif;
> > > > > -               }
> > > > > +               } else if (work->for_background)
> > > > > +                       oldest_jif = jiffies;
> > > >   Yeah. I had a look into the trace and you can notice that during the
> > > > whole dd run, we were running a single background writeback work (you can
> > > > verify that by work->nr_pages decreasing steadily). Without refreshing
> > > > oldest_jif, we'd write block device inode for /dev/sda (you can identify
> > > > that by bdi=8:0, ino=0) only once. When refreshing oldest_jif, we write it
> > > > every 5 seconds (kjournald dirties the device inode after committing a
> > > > transaction by dirtying metadata buffers which were just committed and can
> > > > now be checkpointed either by kjournald or flusher thread). So although the
> > > > performance is slightly reduced, I'd say that the behavior is a desired
> > > > one.
> > > > 
> > > > Also if you observed the performance on a really long run, the difference
> > > > should get smaller because eventually, kjournald has to flush the metadata
> > > > blocks when the journal fills up and we need to free some journal space and
> > > > at that point flushing is even more expensive because we have to do a
> > > > blocking write during which all transaction operations, thus effectively
> > > > the whole filesystem, are blocked.
> > > 
> > > Jan, I got figures for test case
> > > 
> > > ext3-1dd-4k-8p-2941M-1000M:10-3.1.0-rc9-ioless-full-nfs-wq5-next-20111014+
> > > 
> > > There is no single drop of nr_writeback in the longer 1200s run, which
> > > wrote ~60GB data.
> >   I did some calculations. Default journal size for a filesystem of your
> > size is 128 MB which allows recording of around 128 GB of data. So your
> > test probably didn't hit the point where the journal is recycled yet. An
> > easy way to make sure journal gets recycled is to set its size to a lower
> > value when creating the filesystem by
> >   mke2fs -J size=8
> 
> I tried the "-J size=8" and get similar interesting results for
> ext3/4, before/after this change:
> 
>                 if (work->for_kupdate) {
>                         oldest_jif = jiffies -
>                                 msecs_to_jiffies(dirty_expire_interval * 10);
> -                       work->older_than_this = &oldest_jif;
> -               }
> +               } else if (work->for_background)
> +                       oldest_jif = jiffies;
> 
> So I only attach the graphs for one case:
> 
> ext4-1dd-4k-8p-2941M-1000M:10-3.1.0-ioless-full-next-20111025+
> 
> Two of the graphs are very interesting. balance_dirty_pages-pause.png
> shows increasingly large negative pause times, which indicates large
> delays inside some ext4's routines.
  Likely we are hanging waiting for transaction start. 8 MB journal puts
rather big pressure on journal space so we end up waiting on kjournald a
lot. But I'm not sure why wait times would increase on large scale - with
ext4 it's harder to estimate used journal space because it uses extents so
the amount of metadata written depends on fragmentation. If you could post
ext3 graphs, maybe I could make some sense from it... 

> And iostat-util.png shows very large CPU utilization...Oh well the
> lock_stat has the rcu_torture_timer on the top.  I'd better retest
> without the rcu torture test option...
  Yes, I guess this might be just debugging artefact.

> >   Then at latest after writing 8 GB the effect of journal recycling should
> > be visible (I suggest writing at least 16 or so so that we can see some
> > pattern). Also note that without the patch altering background writeback,
> > kjournald will do all the writeback of the metadata and kjournal works with
> > buffer heads. Thus IO it does is *not* accounted in mm statistics. You will
> > observe its effects only by a sudden increase in await or svctm because the
> > disk got busy by IO you don't see. Also secondarily you could probably
> > observe that as a hiccup in the number of dirtied/written pages.
> 
> Ah good to know that. It could explain the drops of IO size.
> 
> iostat should still be reporting the journal IO, is it?
  Yes.

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 1/2] writeback: Improve busyloop prevention
  2011-11-01 21:53                                 ` Jan Kara
@ 2011-11-02 17:25                                   ` Wu Fengguang
  0 siblings, 0 replies; 60+ messages in thread
From: Wu Fengguang @ 2011-11-02 17:25 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-fsdevel@vger.kernel.org, Christoph Hellwig, Dave Chinner

> > >   I did some calculations. Default journal size for a filesystem of your
> > > size is 128 MB which allows recording of around 128 GB of data. So your
> > > test probably didn't hit the point where the journal is recycled yet. An
> > > easy way to make sure journal gets recycled is to set its size to a lower
> > > value when creating the filesystem by
> > >   mke2fs -J size=8
> > 
> > I tried the "-J size=8" and get similar interesting results for
> > ext3/4, before/after this change:
> > 
> >                 if (work->for_kupdate) {
> >                         oldest_jif = jiffies -
> >                                 msecs_to_jiffies(dirty_expire_interval * 10);
> > -                       work->older_than_this = &oldest_jif;
> > -               }
> > +               } else if (work->for_background)
> > +                       oldest_jif = jiffies;
> > 
> > So I only attach the graphs for one case:
> > 
> > ext4-1dd-4k-8p-2941M-1000M:10-3.1.0-ioless-full-next-20111025+
> > 
> > Two of the graphs are very interesting. balance_dirty_pages-pause.png
> > shows increasingly large negative pause times, which indicates large
> > delays inside some ext4's routines.
>   Likely we are hanging waiting for transaction start. 8 MB journal puts
> rather big pressure on journal space so we end up waiting on kjournald a
> lot. But I'm not sure why wait times would increase on large scale - with
> ext4 it's harder to estimate used journal space because it uses extents so
> the amount of metadata written depends on fragmentation. If you could post
> ext3 graphs, maybe I could make some sense from it... 

Oops it's me that messed it up again -- two chunks were lost during
when rebasing the patchset which leads to the misbehavior -- big sorry!

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 1/2] writeback: Improve busyloop prevention
       [not found]                               ` <20111102185603.GA4034@localhost>
@ 2011-11-03  1:51                                 ` Jan Kara
  2011-11-03 14:52                                   ` Wu Fengguang
       [not found]                                   ` <20111104152054.GA11577@localhost>
  0 siblings, 2 replies; 60+ messages in thread
From: Jan Kara @ 2011-11-03  1:51 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Jan Kara, linux-fsdevel@vger.kernel.org, Christoph Hellwig,
	Dave Chinner

On Thu 03-11-11 02:56:03, Wu Fengguang wrote:
> On Fri, Oct 28, 2011 at 04:31:04AM +0800, Jan Kara wrote:
> > On Thu 27-10-11 14:31:33, Wu Fengguang wrote:
> > > On Fri, Oct 21, 2011 at 06:26:16AM +0800, Jan Kara wrote:
> > > > On Thu 20-10-11 21:39:38, Wu Fengguang wrote:
> > > > > On Thu, Oct 20, 2011 at 08:33:00PM +0800, Wu Fengguang wrote:
> > > > > > On Thu, Oct 20, 2011 at 08:09:09PM +0800, Wu Fengguang wrote:
> > > > > > > Jan,
> > > > > > > 
> > > > > > > I tried the below combined patch over the ioless one, and find some
> > > > > > > minor regressions. I studied the thresh=1G/ext3-1dd case in particular
> > > > > > > and find that nr_writeback and the iostat avgrq-sz drops from time to time.
> > > > > > > 
> > > > > > > I'll try to bisect the changeset.
> > > > > 
> > > > > This is interesting, the culprit is found to be patch 1, which is
> > > > > simply
> > > > >                 if (work->for_kupdate) {
> > > > >                         oldest_jif = jiffies -
> > > > >                                 msecs_to_jiffies(dirty_expire_interval * 10);
> > > > > -                       work->older_than_this = &oldest_jif;
> > > > > -               }
> > > > > +               } else if (work->for_background)
> > > > > +                       oldest_jif = jiffies;
> > > >   Yeah. I had a look into the trace and you can notice that during the
> > > > whole dd run, we were running a single background writeback work (you can
> > > > verify that by work->nr_pages decreasing steadily). Without refreshing
> > > > oldest_jif, we'd write block device inode for /dev/sda (you can identify
> > > > that by bdi=8:0, ino=0) only once. When refreshing oldest_jif, we write it
> > > > every 5 seconds (kjournald dirties the device inode after committing a
> > > > transaction by dirtying metadata buffers which were just committed and can
> > > > now be checkpointed either by kjournald or flusher thread). So although the
> > > > performance is slightly reduced, I'd say that the behavior is a desired
> > > > one.
> > > > 
> > > > Also if you observed the performance on a really long run, the difference
> > > > should get smaller because eventually, kjournald has to flush the metadata
> > > > blocks when the journal fills up and we need to free some journal space and
> > > > at that point flushing is even more expensive because we have to do a
> > > > blocking write during which all transaction operations, thus effectively
> > > > the whole filesystem, are blocked.
> > > 
> > > Jan, I got figures for test case
> > > 
> > > ext3-1dd-4k-8p-2941M-1000M:10-3.1.0-rc9-ioless-full-nfs-wq5-next-20111014+
> > > 
> > > There is no single drop of nr_writeback in the longer 1200s run, which
> > > wrote ~60GB data.
> >   I did some calculations. Default journal size for a filesystem of your
> > size is 128 MB which allows recording of around 128 GB of data. So your
> > test probably didn't hit the point where the journal is recycled yet. An
> > easy way to make sure journal gets recycled is to set its size to a lower
> > value when creating the filesystem by
> >   mke2fs -J size=8
> > 
> >   Then at latest after writing 8 GB the effect of journal recycling should
> > be visible (I suggest writing at least 16 or so so that we can see some
> > pattern). Also note that without the patch altering background writeback,
> > kjournald will do all the writeback of the metadata and kjournal works with
> > buffer heads. Thus IO it does is *not* accounted in mm statistics. You will
> > observe its effects only by a sudden increase in await or svctm because the
> > disk got busy by IO you don't see. Also secondarily you could probably
> > observe that as a hiccup in the number of dirtied/written pages.
> 
> Jan, finally the `correct' results for "-J size=8" w/o the patch
> altering background writeback.
> 
> I noticed the periodic small drops of nr_writeback in
> global_dirty_state.png, other than that it looks pretty good.
  If you look at iostat graphs, you'll notice periodic increases in await
time in roughly 100 s intervals. I belive this could be checkpointing
that's going on in the background. Also there are (negative) peaks in the
"paused" graph. Anyway, the main question is - do you see any throughput
difference with/without the background writeback patch with the small
journal?

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 1/2] writeback: Improve busyloop prevention
  2011-11-03  1:51                                 ` Jan Kara
@ 2011-11-03 14:52                                   ` Wu Fengguang
       [not found]                                   ` <20111104152054.GA11577@localhost>
  1 sibling, 0 replies; 60+ messages in thread
From: Wu Fengguang @ 2011-11-03 14:52 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-fsdevel@vger.kernel.org, Christoph Hellwig, Dave Chinner

On Thu, Nov 03, 2011 at 09:51:36AM +0800, Jan Kara wrote:
> On Thu 03-11-11 02:56:03, Wu Fengguang wrote:
> > On Fri, Oct 28, 2011 at 04:31:04AM +0800, Jan Kara wrote:
> > > On Thu 27-10-11 14:31:33, Wu Fengguang wrote:
> > > > On Fri, Oct 21, 2011 at 06:26:16AM +0800, Jan Kara wrote:
> > > > > On Thu 20-10-11 21:39:38, Wu Fengguang wrote:
> > > > > > On Thu, Oct 20, 2011 at 08:33:00PM +0800, Wu Fengguang wrote:
> > > > > > > On Thu, Oct 20, 2011 at 08:09:09PM +0800, Wu Fengguang wrote:
> > > > > > > > Jan,
> > > > > > > > 
> > > > > > > > I tried the below combined patch over the ioless one, and find some
> > > > > > > > minor regressions. I studied the thresh=1G/ext3-1dd case in particular
> > > > > > > > and find that nr_writeback and the iostat avgrq-sz drops from time to time.
> > > > > > > > 
> > > > > > > > I'll try to bisect the changeset.
> > > > > > 
> > > > > > This is interesting, the culprit is found to be patch 1, which is
> > > > > > simply
> > > > > >                 if (work->for_kupdate) {
> > > > > >                         oldest_jif = jiffies -
> > > > > >                                 msecs_to_jiffies(dirty_expire_interval * 10);
> > > > > > -                       work->older_than_this = &oldest_jif;
> > > > > > -               }
> > > > > > +               } else if (work->for_background)
> > > > > > +                       oldest_jif = jiffies;
> > > > >   Yeah. I had a look into the trace and you can notice that during the
> > > > > whole dd run, we were running a single background writeback work (you can
> > > > > verify that by work->nr_pages decreasing steadily). Without refreshing
> > > > > oldest_jif, we'd write block device inode for /dev/sda (you can identify
> > > > > that by bdi=8:0, ino=0) only once. When refreshing oldest_jif, we write it
> > > > > every 5 seconds (kjournald dirties the device inode after committing a
> > > > > transaction by dirtying metadata buffers which were just committed and can
> > > > > now be checkpointed either by kjournald or flusher thread). So although the
> > > > > performance is slightly reduced, I'd say that the behavior is a desired
> > > > > one.
> > > > > 
> > > > > Also if you observed the performance on a really long run, the difference
> > > > > should get smaller because eventually, kjournald has to flush the metadata
> > > > > blocks when the journal fills up and we need to free some journal space and
> > > > > at that point flushing is even more expensive because we have to do a
> > > > > blocking write during which all transaction operations, thus effectively
> > > > > the whole filesystem, are blocked.
> > > > 
> > > > Jan, I got figures for test case
> > > > 
> > > > ext3-1dd-4k-8p-2941M-1000M:10-3.1.0-rc9-ioless-full-nfs-wq5-next-20111014+
> > > > 
> > > > There is no single drop of nr_writeback in the longer 1200s run, which
> > > > wrote ~60GB data.
> > >   I did some calculations. Default journal size for a filesystem of your
> > > size is 128 MB which allows recording of around 128 GB of data. So your
> > > test probably didn't hit the point where the journal is recycled yet. An
> > > easy way to make sure journal gets recycled is to set its size to a lower
> > > value when creating the filesystem by
> > >   mke2fs -J size=8
> > > 
> > >   Then at latest after writing 8 GB the effect of journal recycling should
> > > be visible (I suggest writing at least 16 or so so that we can see some
> > > pattern). Also note that without the patch altering background writeback,
> > > kjournald will do all the writeback of the metadata and kjournal works with
> > > buffer heads. Thus IO it does is *not* accounted in mm statistics. You will
> > > observe its effects only by a sudden increase in await or svctm because the
> > > disk got busy by IO you don't see. Also secondarily you could probably
> > > observe that as a hiccup in the number of dirtied/written pages.
> > 
> > Jan, finally the `correct' results for "-J size=8" w/o the patch
> > altering background writeback.
> > 
> > I noticed the periodic small drops of nr_writeback in
> > global_dirty_state.png, other than that it looks pretty good.
>   If you look at iostat graphs, you'll notice periodic increases in await
> time in roughly 100 s intervals. I belive this could be checkpointing

Yes it is. And there is frequent drop of IO queue size.

> that's going on in the background. Also there are (negative) peaks in the
> "paused" graph.

Yeah, this happens on all ext3/4 workloads and I'm kind of used to it ;)

> Anyway, the main question is - do you see any throughput
> difference with/without the background writeback patch with the small
> journal?

Here are the comparisons w/o the patch. The results w/ the patch
should be available tomorrow.

wfg@bee /export/writeback% ./compare.rb -g ext4 -c fs -e io_wkB_s thresh*/*-20111102+
                    ext4              ext4:jsize=8
------------------------  ------------------------
                47684.35        -0.3%     47546.62  thresh=1000M/X-100dd-4k-8p-4096M-1000M:10-3.1.0-ioless-full-next-20111102+
                54015.86        -1.6%     53166.76  thresh=1000M/X-10dd-4k-8p-4096M-1000M:10-3.1.0-ioless-full-next-20111102+
                55320.03        +0.6%     55657.48  thresh=1000M/X-1dd-4k-8p-4096M-1000M:10-3.1.0-ioless-full-next-20111102+
                44271.29        +2.6%     45443.23  thresh=100M/X-10dd-4k-8p-4096M-100M:10-3.1.0-ioless-full-next-20111102+
                54334.22        -1.0%     53801.15  thresh=100M/X-1dd-4k-8p-4096M-100M:10-3.1.0-ioless-full-next-20111102+
                52563.67        -0.7%     52207.05  thresh=100M/X-2dd-4k-8p-4096M-100M:10-3.1.0-ioless-full-next-20111102+
               308189.41        -0.1%    307822.30  TOTAL io_wkB_s

wfg@bee /export/writeback% ./compare.rb -g ext3 -c fs -e io_wkB_s thresh*/*-20111102+
                    ext3              ext3:jsize=8
------------------------  ------------------------
                36231.89        -1.6%     35659.34  thresh=1000M/X-100dd-4k-8p-4096M-1000M:10-3.1.0-ioless-full-next-20111102+
                41115.07        -6.2%     38564.52  thresh=1000M/X-10dd-4k-8p-4096M-1000M:10-3.1.0-ioless-full-next-20111102+
                48025.75        -3.8%     46213.55  thresh=1000M/X-1dd-4k-8p-4096M-1000M:10-3.1.0-ioless-full-next-20111102+
                45317.31        +1.6%     46023.21  thresh=100M/X-1dd-4k-8p-4096M-100M:10-3.1.0-ioless-full-next-20111102+
                40552.64        +4.0%     42182.84  thresh=100M/X-2dd-4k-8p-4096M-100M:10-3.1.0-ioless-full-next-20111102+
               211242.67        -1.2%    208643.45  TOTAL io_wkB_s

I'm currently rewriting the test scripts to make it easier for others
to understand and make use of it. It will also gain the feature to run
each test 2+ times to get a better idea of the fluctuations :)

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 1/2] writeback: Improve busyloop prevention
       [not found]                                   ` <20111104152054.GA11577@localhost>
@ 2011-11-08 23:52                                     ` Jan Kara
  2011-11-09 13:51                                       ` Wu Fengguang
  2011-11-10 14:50                                       ` Jan Kara
  0 siblings, 2 replies; 60+ messages in thread
From: Jan Kara @ 2011-11-08 23:52 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Jan Kara, linux-fsdevel@vger.kernel.org, Christoph Hellwig,
	Dave Chinner

On Fri 04-11-11 23:20:55, Wu Fengguang wrote:
> On Thu, Nov 03, 2011 at 09:51:36AM +0800, Jan Kara wrote:
> > On Thu 03-11-11 02:56:03, Wu Fengguang wrote:
> > > On Fri, Oct 28, 2011 at 04:31:04AM +0800, Jan Kara wrote:
> > > > On Thu 27-10-11 14:31:33, Wu Fengguang wrote:
> > > > > On Fri, Oct 21, 2011 at 06:26:16AM +0800, Jan Kara wrote:
> > > > > > On Thu 20-10-11 21:39:38, Wu Fengguang wrote:
> > > > > > > On Thu, Oct 20, 2011 at 08:33:00PM +0800, Wu Fengguang wrote:
> > > > > > > > On Thu, Oct 20, 2011 at 08:09:09PM +0800, Wu Fengguang wrote:
> > > > > > > > > Jan,
> > > > > > > > > 
> > > > > > > > > I tried the below combined patch over the ioless one, and find some
> > > > > > > > > minor regressions. I studied the thresh=1G/ext3-1dd case in particular
> > > > > > > > > and find that nr_writeback and the iostat avgrq-sz drops from time to time.
> > > > > > > > > 
> > > > > > > > > I'll try to bisect the changeset.
> > > > > > > 
> > > > > > > This is interesting, the culprit is found to be patch 1, which is
> > > > > > > simply
> > > > > > >                 if (work->for_kupdate) {
> > > > > > >                         oldest_jif = jiffies -
> > > > > > >                                 msecs_to_jiffies(dirty_expire_interval * 10);
> > > > > > > -                       work->older_than_this = &oldest_jif;
> > > > > > > -               }
> > > > > > > +               } else if (work->for_background)
> > > > > > > +                       oldest_jif = jiffies;
> > > > > >   Yeah. I had a look into the trace and you can notice that during the
> > > > > > whole dd run, we were running a single background writeback work (you can
> > > > > > verify that by work->nr_pages decreasing steadily). Without refreshing
> > > > > > oldest_jif, we'd write block device inode for /dev/sda (you can identify
> > > > > > that by bdi=8:0, ino=0) only once. When refreshing oldest_jif, we write it
> > > > > > every 5 seconds (kjournald dirties the device inode after committing a
> > > > > > transaction by dirtying metadata buffers which were just committed and can
> > > > > > now be checkpointed either by kjournald or flusher thread). So although the
> > > > > > performance is slightly reduced, I'd say that the behavior is a desired
> > > > > > one.
> > > > > > 
> > > > > > Also if you observed the performance on a really long run, the difference
> > > > > > should get smaller because eventually, kjournald has to flush the metadata
> > > > > > blocks when the journal fills up and we need to free some journal space and
> > > > > > at that point flushing is even more expensive because we have to do a
> > > > > > blocking write during which all transaction operations, thus effectively
> > > > > > the whole filesystem, are blocked.
> > > > > 
> > > > > Jan, I got figures for test case
> > > > > 
> > > > > ext3-1dd-4k-8p-2941M-1000M:10-3.1.0-rc9-ioless-full-nfs-wq5-next-20111014+
> > > > > 
> > > > > There is no single drop of nr_writeback in the longer 1200s run, which
> > > > > wrote ~60GB data.
> > > >   I did some calculations. Default journal size for a filesystem of your
> > > > size is 128 MB which allows recording of around 128 GB of data. So your
> > > > test probably didn't hit the point where the journal is recycled yet. An
> > > > easy way to make sure journal gets recycled is to set its size to a lower
> > > > value when creating the filesystem by
> > > >   mke2fs -J size=8
> > > > 
> > > >   Then at latest after writing 8 GB the effect of journal recycling should
> > > > be visible (I suggest writing at least 16 or so so that we can see some
> > > > pattern). Also note that without the patch altering background writeback,
> > > > kjournald will do all the writeback of the metadata and kjournal works with
> > > > buffer heads. Thus IO it does is *not* accounted in mm statistics. You will
> > > > observe its effects only by a sudden increase in await or svctm because the
> > > > disk got busy by IO you don't see. Also secondarily you could probably
> > > > observe that as a hiccup in the number of dirtied/written pages.
> > > 
> > > Jan, finally the `correct' results for "-J size=8" w/o the patch
> > > altering background writeback.
> > > 
> > > I noticed the periodic small drops of nr_writeback in
> > > global_dirty_state.png, other than that it looks pretty good.
> >   If you look at iostat graphs, you'll notice periodic increases in await
> > time in roughly 100 s intervals. I belive this could be checkpointing
> > that's going on in the background. Also there are (negative) peaks in the
> > "paused" graph. Anyway, the main question is - do you see any throughput
> > difference with/without the background writeback patch with the small
> > journal?
> 
> Jan, I got the results before/after patch -- there is small
> performance drops either with plain mkfs or mkfs "-J size=8",
> while the latter does see smaller drops.
> 
> To make it more accurate, I use the average wkB/s value reported by
> iostat for the comparison.
> 
> wfg@bee /export/writeback% ./compare.rb -g jsize -e io_wkB_s thresh*/*-ioless-full-next-20111102+ thresh*/*-20111102+
> 3.1.0-ioless-full-next-20111102+  3.1.0-ioless-full-bg-all-next-20111102+
> ------------------------  ------------------------
>                 35659.34        -0.8%     35377.54  thresh=1000M/ext3:jsize=8-100dd-4k-8p-4096M-1000M:10-X
>                 38564.52        -1.9%     37839.55  thresh=1000M/ext3:jsize=8-10dd-4k-8p-4096M-1000M:10-X
>                 46213.55        -3.1%     44784.05  thresh=1000M/ext3:jsize=8-1dd-4k-8p-4096M-1000M:10-X
>                 47546.62        +0.5%     47790.81  thresh=1000M/ext4:jsize=8-100dd-4k-8p-4096M-1000M:10-X
>                 53166.76        +0.6%     53512.28  thresh=1000M/ext4:jsize=8-10dd-4k-8p-4096M-1000M:10-X
>                 55657.48        -0.2%     55530.27  thresh=1000M/ext4:jsize=8-1dd-4k-8p-4096M-1000M:10-X
>                 38868.18        -1.9%     38146.89  thresh=100M/ext3:jsize=8-10dd-4k-8p-4096M-100M:10-X
>                 46023.21        -0.2%     45908.73  thresh=100M/ext3:jsize=8-1dd-4k-8p-4096M-100M:10-X
>                 42182.84        -1.5%     41556.99  thresh=100M/ext3:jsize=8-2dd-4k-8p-4096M-100M:10-X
>                 45443.23        -0.9%     45038.84  thresh=100M/ext4:jsize=8-10dd-4k-8p-4096M-100M:10-X
>                 53801.15        -0.9%     53315.74  thresh=100M/ext4:jsize=8-1dd-4k-8p-4096M-100M:10-X
>                 52207.05        -0.6%     51913.22  thresh=100M/ext4:jsize=8-2dd-4k-8p-4096M-100M:10-X
>                 33389.88        -3.5%     32226.18  thresh=10M/ext3:jsize=8-10dd-4k-8p-4096M-10M:10-X
>                 45430.23        -3.5%     43846.57  thresh=10M/ext3:jsize=8-1dd-4k-8p-4096M-10M:10-X
>                 44186.72        -4.5%     42185.16  thresh=10M/ext3:jsize=8-2dd-4k-8p-4096M-10M:10-X
>                 36237.34        -3.1%     35128.90  thresh=10M/ext4:jsize=8-10dd-4k-8p-4096M-10M:10-X
>                 54633.30        -2.7%     53135.13  thresh=10M/ext4:jsize=8-1dd-4k-8p-4096M-10M:10-X
>                 50767.63        -1.9%     49800.59  thresh=10M/ext4:jsize=8-2dd-4k-8p-4096M-10M:10-X
>                 49654.38        -4.8%     47274.27  thresh=1M/ext4:jsize=8-1dd-4k-8p-4096M-1M:10-X
>                 45142.01        -5.3%     42745.49  thresh=1M/ext4:jsize=8-2dd-4k-8p-4096M-1M:10-X
>                914775.42        -1.9%    897057.21  TOTAL io_wkB_s
  These differences look negligible unless thresh <= 10M when flushing
becomes rather aggressive I'd say and thus the fact that background
writeback can switch inodes is more noticeable. OTOH thresh <= 10M doesn't
look like a case which needs optimizing for.

> wfg@bee /export/writeback% ./compare.rb -v jsize -e io_wkB_s thresh*/*-ioless-full-next-20111102+ thresh*/*-20111102+
> 3.1.0-ioless-full-next-20111102+  3.1.0-ioless-full-bg-all-next-20111102+
> ------------------------  ------------------------
>                 36231.89        -3.8%     34855.10  thresh=1000M/ext3-100dd-4k-8p-4096M-1000M:10-X
>                 41115.07       -12.7%     35886.36  thresh=1000M/ext3-10dd-4k-8p-4096M-1000M:10-X
>                 48025.75       -14.3%     41146.57  thresh=1000M/ext3-1dd-4k-8p-4096M-1000M:10-X
>                 47684.35        -6.4%     44644.30  thresh=1000M/ext4-100dd-4k-8p-4096M-1000M:10-X
>                 54015.86        -4.0%     51851.01  thresh=1000M/ext4-10dd-4k-8p-4096M-1000M:10-X
>                 55320.03        -2.6%     53867.63  thresh=1000M/ext4-1dd-4k-8p-4096M-1000M:10-X
>                 37400.51        +1.6%     38012.57  thresh=100M/ext3-10dd-4k-8p-4096M-100M:10-X
>                 45317.31        -4.5%     43272.16  thresh=100M/ext3-1dd-4k-8p-4096M-100M:10-X
>                 40552.64        +0.8%     40884.60  thresh=100M/ext3-2dd-4k-8p-4096M-100M:10-X
>                 44271.29        -5.6%     41789.76  thresh=100M/ext4-10dd-4k-8p-4096M-100M:10-X
>                 54334.22        -3.5%     52435.69  thresh=100M/ext4-1dd-4k-8p-4096M-100M:10-X
>                 52563.67        -6.1%     49341.84  thresh=100M/ext4-2dd-4k-8p-4096M-100M:10-X
>                 45027.95        -1.0%     44599.37  thresh=10M/ext3-1dd-4k-8p-4096M-10M:10-X
>                 42478.40        +0.3%     42608.48  thresh=10M/ext3-2dd-4k-8p-4096M-10M:10-X
>                 35178.47        -0.2%     35103.56  thresh=10M/ext4-10dd-4k-8p-4096M-10M:10-X
>                 54079.64        -0.5%     53834.85  thresh=10M/ext4-1dd-4k-8p-4096M-10M:10-X
>                 49982.11        -0.4%     49803.44  thresh=10M/ext4-2dd-4k-8p-4096M-10M:10-X
>                783579.17        -3.8%    753937.28  TOTAL io_wkB_s
  Here I can see some noticeable drops in the realistic thresh=100M case
(case thresh=1000M is unrealistic but it still surprise me that there are
drops as well). I'll try to reproduce your results so that I can look into
this more effectively.

								Honza

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 1/2] writeback: Improve busyloop prevention
  2011-11-08 23:52                                     ` Jan Kara
@ 2011-11-09 13:51                                       ` Wu Fengguang
  2011-11-10 14:50                                       ` Jan Kara
  1 sibling, 0 replies; 60+ messages in thread
From: Wu Fengguang @ 2011-11-09 13:51 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-fsdevel@vger.kernel.org, Christoph Hellwig, Dave Chinner

On Wed, Nov 09, 2011 at 07:52:07AM +0800, Jan Kara wrote:
> On Fri 04-11-11 23:20:55, Wu Fengguang wrote:
> > On Thu, Nov 03, 2011 at 09:51:36AM +0800, Jan Kara wrote:
> > > On Thu 03-11-11 02:56:03, Wu Fengguang wrote:
> > > > On Fri, Oct 28, 2011 at 04:31:04AM +0800, Jan Kara wrote:
> > > > > On Thu 27-10-11 14:31:33, Wu Fengguang wrote:
> > > > > > On Fri, Oct 21, 2011 at 06:26:16AM +0800, Jan Kara wrote:
> > > > > > > On Thu 20-10-11 21:39:38, Wu Fengguang wrote:
> > > > > > > > On Thu, Oct 20, 2011 at 08:33:00PM +0800, Wu Fengguang wrote:
> > > > > > > > > On Thu, Oct 20, 2011 at 08:09:09PM +0800, Wu Fengguang wrote:
> > > > > > > > > > Jan,
> > > > > > > > > > 
> > > > > > > > > > I tried the below combined patch over the ioless one, and find some
> > > > > > > > > > minor regressions. I studied the thresh=1G/ext3-1dd case in particular
> > > > > > > > > > and find that nr_writeback and the iostat avgrq-sz drops from time to time.
> > > > > > > > > > 
> > > > > > > > > > I'll try to bisect the changeset.
> > > > > > > > 
> > > > > > > > This is interesting, the culprit is found to be patch 1, which is
> > > > > > > > simply
> > > > > > > >                 if (work->for_kupdate) {
> > > > > > > >                         oldest_jif = jiffies -
> > > > > > > >                                 msecs_to_jiffies(dirty_expire_interval * 10);
> > > > > > > > -                       work->older_than_this = &oldest_jif;
> > > > > > > > -               }
> > > > > > > > +               } else if (work->for_background)
> > > > > > > > +                       oldest_jif = jiffies;
> > > > > > >   Yeah. I had a look into the trace and you can notice that during the
> > > > > > > whole dd run, we were running a single background writeback work (you can
> > > > > > > verify that by work->nr_pages decreasing steadily). Without refreshing
> > > > > > > oldest_jif, we'd write block device inode for /dev/sda (you can identify
> > > > > > > that by bdi=8:0, ino=0) only once. When refreshing oldest_jif, we write it
> > > > > > > every 5 seconds (kjournald dirties the device inode after committing a
> > > > > > > transaction by dirtying metadata buffers which were just committed and can
> > > > > > > now be checkpointed either by kjournald or flusher thread). So although the
> > > > > > > performance is slightly reduced, I'd say that the behavior is a desired
> > > > > > > one.
> > > > > > > 
> > > > > > > Also if you observed the performance on a really long run, the difference
> > > > > > > should get smaller because eventually, kjournald has to flush the metadata
> > > > > > > blocks when the journal fills up and we need to free some journal space and
> > > > > > > at that point flushing is even more expensive because we have to do a
> > > > > > > blocking write during which all transaction operations, thus effectively
> > > > > > > the whole filesystem, are blocked.
> > > > > > 
> > > > > > Jan, I got figures for test case
> > > > > > 
> > > > > > ext3-1dd-4k-8p-2941M-1000M:10-3.1.0-rc9-ioless-full-nfs-wq5-next-20111014+
> > > > > > 
> > > > > > There is no single drop of nr_writeback in the longer 1200s run, which
> > > > > > wrote ~60GB data.
> > > > >   I did some calculations. Default journal size for a filesystem of your
> > > > > size is 128 MB which allows recording of around 128 GB of data. So your
> > > > > test probably didn't hit the point where the journal is recycled yet. An
> > > > > easy way to make sure journal gets recycled is to set its size to a lower
> > > > > value when creating the filesystem by
> > > > >   mke2fs -J size=8
> > > > > 
> > > > >   Then at latest after writing 8 GB the effect of journal recycling should
> > > > > be visible (I suggest writing at least 16 or so so that we can see some
> > > > > pattern). Also note that without the patch altering background writeback,
> > > > > kjournald will do all the writeback of the metadata and kjournal works with
> > > > > buffer heads. Thus IO it does is *not* accounted in mm statistics. You will
> > > > > observe its effects only by a sudden increase in await or svctm because the
> > > > > disk got busy by IO you don't see. Also secondarily you could probably
> > > > > observe that as a hiccup in the number of dirtied/written pages.
> > > > 
> > > > Jan, finally the `correct' results for "-J size=8" w/o the patch
> > > > altering background writeback.
> > > > 
> > > > I noticed the periodic small drops of nr_writeback in
> > > > global_dirty_state.png, other than that it looks pretty good.
> > >   If you look at iostat graphs, you'll notice periodic increases in await
> > > time in roughly 100 s intervals. I belive this could be checkpointing
> > > that's going on in the background. Also there are (negative) peaks in the
> > > "paused" graph. Anyway, the main question is - do you see any throughput
> > > difference with/without the background writeback patch with the small
> > > journal?
> > 
> > Jan, I got the results before/after patch -- there is small
> > performance drops either with plain mkfs or mkfs "-J size=8",
> > while the latter does see smaller drops.
> > 
> > To make it more accurate, I use the average wkB/s value reported by
> > iostat for the comparison.
> > 
> > wfg@bee /export/writeback% ./compare.rb -g jsize -e io_wkB_s thresh*/*-ioless-full-next-20111102+ thresh*/*-20111102+
> > 3.1.0-ioless-full-next-20111102+  3.1.0-ioless-full-bg-all-next-20111102+
> > ------------------------  ------------------------
> >                 35659.34        -0.8%     35377.54  thresh=1000M/ext3:jsize=8-100dd-4k-8p-4096M-1000M:10-X
> >                 38564.52        -1.9%     37839.55  thresh=1000M/ext3:jsize=8-10dd-4k-8p-4096M-1000M:10-X
> >                 46213.55        -3.1%     44784.05  thresh=1000M/ext3:jsize=8-1dd-4k-8p-4096M-1000M:10-X
> >                 47546.62        +0.5%     47790.81  thresh=1000M/ext4:jsize=8-100dd-4k-8p-4096M-1000M:10-X
> >                 53166.76        +0.6%     53512.28  thresh=1000M/ext4:jsize=8-10dd-4k-8p-4096M-1000M:10-X
> >                 55657.48        -0.2%     55530.27  thresh=1000M/ext4:jsize=8-1dd-4k-8p-4096M-1000M:10-X
> >                 38868.18        -1.9%     38146.89  thresh=100M/ext3:jsize=8-10dd-4k-8p-4096M-100M:10-X
> >                 46023.21        -0.2%     45908.73  thresh=100M/ext3:jsize=8-1dd-4k-8p-4096M-100M:10-X
> >                 42182.84        -1.5%     41556.99  thresh=100M/ext3:jsize=8-2dd-4k-8p-4096M-100M:10-X
> >                 45443.23        -0.9%     45038.84  thresh=100M/ext4:jsize=8-10dd-4k-8p-4096M-100M:10-X
> >                 53801.15        -0.9%     53315.74  thresh=100M/ext4:jsize=8-1dd-4k-8p-4096M-100M:10-X
> >                 52207.05        -0.6%     51913.22  thresh=100M/ext4:jsize=8-2dd-4k-8p-4096M-100M:10-X
> >                 33389.88        -3.5%     32226.18  thresh=10M/ext3:jsize=8-10dd-4k-8p-4096M-10M:10-X
> >                 45430.23        -3.5%     43846.57  thresh=10M/ext3:jsize=8-1dd-4k-8p-4096M-10M:10-X
> >                 44186.72        -4.5%     42185.16  thresh=10M/ext3:jsize=8-2dd-4k-8p-4096M-10M:10-X
> >                 36237.34        -3.1%     35128.90  thresh=10M/ext4:jsize=8-10dd-4k-8p-4096M-10M:10-X
> >                 54633.30        -2.7%     53135.13  thresh=10M/ext4:jsize=8-1dd-4k-8p-4096M-10M:10-X
> >                 50767.63        -1.9%     49800.59  thresh=10M/ext4:jsize=8-2dd-4k-8p-4096M-10M:10-X
> >                 49654.38        -4.8%     47274.27  thresh=1M/ext4:jsize=8-1dd-4k-8p-4096M-1M:10-X
> >                 45142.01        -5.3%     42745.49  thresh=1M/ext4:jsize=8-2dd-4k-8p-4096M-1M:10-X
> >                914775.42        -1.9%    897057.21  TOTAL io_wkB_s
>   These differences look negligible unless thresh <= 10M when flushing
> becomes rather aggressive I'd say and thus the fact that background
> writeback can switch inodes is more noticeable. OTOH thresh <= 10M doesn't
> look like a case which needs optimizing for.

Agreed in principle.

> > wfg@bee /export/writeback% ./compare.rb -v jsize -e io_wkB_s thresh*/*-ioless-full-next-20111102+ thresh*/*-20111102+
> > 3.1.0-ioless-full-next-20111102+  3.1.0-ioless-full-bg-all-next-20111102+
> > ------------------------  ------------------------
> >                 36231.89        -3.8%     34855.10  thresh=1000M/ext3-100dd-4k-8p-4096M-1000M:10-X
> >                 41115.07       -12.7%     35886.36  thresh=1000M/ext3-10dd-4k-8p-4096M-1000M:10-X
> >                 48025.75       -14.3%     41146.57  thresh=1000M/ext3-1dd-4k-8p-4096M-1000M:10-X
> >                 47684.35        -6.4%     44644.30  thresh=1000M/ext4-100dd-4k-8p-4096M-1000M:10-X
> >                 54015.86        -4.0%     51851.01  thresh=1000M/ext4-10dd-4k-8p-4096M-1000M:10-X
> >                 55320.03        -2.6%     53867.63  thresh=1000M/ext4-1dd-4k-8p-4096M-1000M:10-X
> >                 37400.51        +1.6%     38012.57  thresh=100M/ext3-10dd-4k-8p-4096M-100M:10-X
> >                 45317.31        -4.5%     43272.16  thresh=100M/ext3-1dd-4k-8p-4096M-100M:10-X
> >                 40552.64        +0.8%     40884.60  thresh=100M/ext3-2dd-4k-8p-4096M-100M:10-X
> >                 44271.29        -5.6%     41789.76  thresh=100M/ext4-10dd-4k-8p-4096M-100M:10-X
> >                 54334.22        -3.5%     52435.69  thresh=100M/ext4-1dd-4k-8p-4096M-100M:10-X
> >                 52563.67        -6.1%     49341.84  thresh=100M/ext4-2dd-4k-8p-4096M-100M:10-X
> >                 45027.95        -1.0%     44599.37  thresh=10M/ext3-1dd-4k-8p-4096M-10M:10-X
> >                 42478.40        +0.3%     42608.48  thresh=10M/ext3-2dd-4k-8p-4096M-10M:10-X
> >                 35178.47        -0.2%     35103.56  thresh=10M/ext4-10dd-4k-8p-4096M-10M:10-X
> >                 54079.64        -0.5%     53834.85  thresh=10M/ext4-1dd-4k-8p-4096M-10M:10-X
> >                 49982.11        -0.4%     49803.44  thresh=10M/ext4-2dd-4k-8p-4096M-10M:10-X
> >                783579.17        -3.8%    753937.28  TOTAL io_wkB_s
>   Here I can see some noticeable drops in the realistic thresh=100M case
> (case thresh=1000M is unrealistic but it still surprise me that there are
> drops as well). I'll try to reproduce your results so that I can look into
> this more effectively.

OK. I'm trying to bring out the test scripts in a useful way, so as to
make it easier for you to do comparison/analyzes more freely :)

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 1/2] writeback: Improve busyloop prevention
  2011-11-08 23:52                                     ` Jan Kara
  2011-11-09 13:51                                       ` Wu Fengguang
@ 2011-11-10 14:50                                       ` Jan Kara
  2011-12-05  8:02                                         ` Wu Fengguang
  1 sibling, 1 reply; 60+ messages in thread
From: Jan Kara @ 2011-11-10 14:50 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Jan Kara, linux-fsdevel@vger.kernel.org, Christoph Hellwig,
	Dave Chinner

On Wed 09-11-11 00:52:07, Jan Kara wrote:
> > wfg@bee /export/writeback% ./compare.rb -v jsize -e io_wkB_s thresh*/*-ioless-full-next-20111102+ thresh*/*-20111102+
> > 3.1.0-ioless-full-next-20111102+  3.1.0-ioless-full-bg-all-next-20111102+
> > ------------------------  ------------------------
> >                 36231.89        -3.8%     34855.10  thresh=1000M/ext3-100dd-4k-8p-4096M-1000M:10-X
> >                 41115.07       -12.7%     35886.36  thresh=1000M/ext3-10dd-4k-8p-4096M-1000M:10-X
> >                 48025.75       -14.3%     41146.57  thresh=1000M/ext3-1dd-4k-8p-4096M-1000M:10-X
> >                 47684.35        -6.4%     44644.30  thresh=1000M/ext4-100dd-4k-8p-4096M-1000M:10-X
> >                 54015.86        -4.0%     51851.01  thresh=1000M/ext4-10dd-4k-8p-4096M-1000M:10-X
> >                 55320.03        -2.6%     53867.63  thresh=1000M/ext4-1dd-4k-8p-4096M-1000M:10-X
> >                 37400.51        +1.6%     38012.57  thresh=100M/ext3-10dd-4k-8p-4096M-100M:10-X
> >                 45317.31        -4.5%     43272.16  thresh=100M/ext3-1dd-4k-8p-4096M-100M:10-X
> >                 40552.64        +0.8%     40884.60  thresh=100M/ext3-2dd-4k-8p-4096M-100M:10-X
> >                 44271.29        -5.6%     41789.76  thresh=100M/ext4-10dd-4k-8p-4096M-100M:10-X
> >                 54334.22        -3.5%     52435.69  thresh=100M/ext4-1dd-4k-8p-4096M-100M:10-X
> >                 52563.67        -6.1%     49341.84  thresh=100M/ext4-2dd-4k-8p-4096M-100M:10-X
> >                 45027.95        -1.0%     44599.37  thresh=10M/ext3-1dd-4k-8p-4096M-10M:10-X
> >                 42478.40        +0.3%     42608.48  thresh=10M/ext3-2dd-4k-8p-4096M-10M:10-X
> >                 35178.47        -0.2%     35103.56  thresh=10M/ext4-10dd-4k-8p-4096M-10M:10-X
> >                 54079.64        -0.5%     53834.85  thresh=10M/ext4-1dd-4k-8p-4096M-10M:10-X
> >                 49982.11        -0.4%     49803.44  thresh=10M/ext4-2dd-4k-8p-4096M-10M:10-X
> >                783579.17        -3.8%    753937.28  TOTAL io_wkB_s
>   Here I can see some noticeable drops in the realistic thresh=100M case
> (case thresh=1000M is unrealistic but it still surprise me that there are
> drops as well). I'll try to reproduce your results so that I can look into
> this more effectively.
  So I've run a test on a machine with 1G of memory, thresh=184M (so
something similar to your 4G-1G test). I've used tiobench using 10 threads,
each thread writing 1.6G file. I have run the test 10 times to get an idea
of fluctuations. The result is:
  without patch			with patch
   AVG         STDDEV      AVG         STDDEV
199.884820 +- 1.32268	200.466003 +- 0.377405

The numbers are time-to-completion so lower is better. Summary is: No
statistically meaningful difference. I'll run more tests with different
dirty thresholds to see whether I won't be able to observe some
difference...

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 1/2] writeback: Improve busyloop prevention
  2011-11-10 14:50                                       ` Jan Kara
@ 2011-12-05  8:02                                         ` Wu Fengguang
  2011-12-07 10:13                                           ` Jan Kara
  0 siblings, 1 reply; 60+ messages in thread
From: Wu Fengguang @ 2011-12-05  8:02 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-fsdevel@vger.kernel.org, Christoph Hellwig, Dave Chinner

Jan,

Sorry for the long delay!

On Thu, Nov 10, 2011 at 10:50:44PM +0800, Jan Kara wrote:
> On Wed 09-11-11 00:52:07, Jan Kara wrote:
> > > wfg@bee /export/writeback% ./compare.rb -v jsize -e io_wkB_s thresh*/*-ioless-full-next-20111102+ thresh*/*-20111102+
> > > 3.1.0-ioless-full-next-20111102+  3.1.0-ioless-full-bg-all-next-20111102+
> > > ------------------------  ------------------------
> > >                 36231.89        -3.8%     34855.10  thresh=1000M/ext3-100dd-4k-8p-4096M-1000M:10-X
> > >                 41115.07       -12.7%     35886.36  thresh=1000M/ext3-10dd-4k-8p-4096M-1000M:10-X
> > >                 48025.75       -14.3%     41146.57  thresh=1000M/ext3-1dd-4k-8p-4096M-1000M:10-X
> > >                 47684.35        -6.4%     44644.30  thresh=1000M/ext4-100dd-4k-8p-4096M-1000M:10-X
> > >                 54015.86        -4.0%     51851.01  thresh=1000M/ext4-10dd-4k-8p-4096M-1000M:10-X
> > >                 55320.03        -2.6%     53867.63  thresh=1000M/ext4-1dd-4k-8p-4096M-1000M:10-X
> > >                 37400.51        +1.6%     38012.57  thresh=100M/ext3-10dd-4k-8p-4096M-100M:10-X
> > >                 45317.31        -4.5%     43272.16  thresh=100M/ext3-1dd-4k-8p-4096M-100M:10-X
> > >                 40552.64        +0.8%     40884.60  thresh=100M/ext3-2dd-4k-8p-4096M-100M:10-X
> > >                 44271.29        -5.6%     41789.76  thresh=100M/ext4-10dd-4k-8p-4096M-100M:10-X
> > >                 54334.22        -3.5%     52435.69  thresh=100M/ext4-1dd-4k-8p-4096M-100M:10-X
> > >                 52563.67        -6.1%     49341.84  thresh=100M/ext4-2dd-4k-8p-4096M-100M:10-X
> > >                 45027.95        -1.0%     44599.37  thresh=10M/ext3-1dd-4k-8p-4096M-10M:10-X
> > >                 42478.40        +0.3%     42608.48  thresh=10M/ext3-2dd-4k-8p-4096M-10M:10-X
> > >                 35178.47        -0.2%     35103.56  thresh=10M/ext4-10dd-4k-8p-4096M-10M:10-X
> > >                 54079.64        -0.5%     53834.85  thresh=10M/ext4-1dd-4k-8p-4096M-10M:10-X
> > >                 49982.11        -0.4%     49803.44  thresh=10M/ext4-2dd-4k-8p-4096M-10M:10-X
> > >                783579.17        -3.8%    753937.28  TOTAL io_wkB_s
> >   Here I can see some noticeable drops in the realistic thresh=100M case
> > (case thresh=1000M is unrealistic but it still surprise me that there are
> > drops as well). I'll try to reproduce your results so that I can look into
> > this more effectively.
>   So I've run a test on a machine with 1G of memory, thresh=184M (so
> something similar to your 4G-1G test). I've used tiobench using 10 threads,
> each thread writing 1.6G file. I have run the test 10 times to get an idea
> of fluctuations. The result is:
>   without patch			with patch
>    AVG         STDDEV      AVG         STDDEV
> 199.884820 +- 1.32268	200.466003 +- 0.377405
> 
> The numbers are time-to-completion so lower is better. Summary is: No
> statistically meaningful difference. I'll run more tests with different
> dirty thresholds to see whether I won't be able to observe some
> difference...

I carried out some tests on ext3/ext4 before/after patch. Most tests
are repeated for 3 times so as to get an idea about the variations.

The ":jsize=8" notion means "-J size=8" in action.

The overall deltas are -0.3% for ext4 and -3.3% for ext3.  I noticed
that the regressions mostly happen for the "-J size=8" cases.  For
normal mkfs, ext4 actually sees +2.5% increase and ext3 sees only
-0.8% drop.

I don't find any misbehaves in the graphs.
So in general I think the test results are acceptable.

Thanks,
Fengguang
---

compare-io.sh can be found here
https://github.com/fengguang/writeback-tests/

all cases:

wfg@bee /export/writeback% ./compare-io.sh  snb/*/{*-3.2.0-rc1,*-3.2.0-rc1-bg-all+} | grep TOTAL
                10288.05        -1.7%     10112.32  TOTAL write_bw
             10515330.05        -1.7%  10340244.65  TOTAL io_wkB_s
                22691.02        -0.1%     22675.29  TOTAL io_w_s
              1177578.21        -3.3%   1138850.24  TOTAL io_wrqm_s
               115888.13        -0.6%    115170.24  TOTAL io_avgrq_sz
                17112.93        -0.5%     17032.84  TOTAL io_avgqu_sz
                96998.62        +0.7%     97649.86  TOTAL io_await
                  838.95        -1.8%       823.97  TOTAL io_svctm
                12493.72        +0.0%     12493.90  TOTAL io_util
                   32.06        -0.1%        32.03  TOTAL cpu_user
                    0.00        +nan%         0.00  TOTAL cpu_nice
                  514.20        -0.7%       510.85  TOTAL cpu_system
                 4168.71     +5402.9%    229401.29  TOTAL cpu_iowait
                    0.00        +nan%         0.00  TOTAL cpu_steal
                 7784.93        -2.5%      7587.00  TOTAL cpu_idle

normal mkfs:

wfg@bee /export/writeback% ./compare-io.sh -v jsize snb/*/{*-3.2.0-rc1,*-3.2.0-rc1-bg-all+} | grep TOTAL
                 5309.47        +0.9%      5357.55  TOTAL write_bw
              5426694.86        +1.0%   5480854.55  TOTAL io_wkB_s
                11532.74        +4.1%     12010.72  TOTAL io_w_s
               599951.24        -0.9%    594789.05  TOTAL io_wrqm_s
                60481.46        -2.0%     59242.21  TOTAL io_avgrq_sz
                 8727.64        -0.5%      8687.29  TOTAL io_avgqu_sz
                49759.66        -3.6%     47945.50  TOTAL io_await
                  419.03        -6.6%       391.51  TOTAL io_svctm
                 6396.66        +0.0%      6397.08  TOTAL io_util
                   16.86        +2.1%        17.22  TOTAL cpu_user
                    0.00        +nan%         0.00  TOTAL cpu_nice
                  263.49        +0.0%       263.59  TOTAL cpu_system
                 2188.46        +4.3%      2283.60  TOTAL cpu_iowait
                    0.00        +nan%         0.00  TOTAL cpu_steal
                 3931.13        -2.4%      3835.53  TOTAL cpu_idle

ext3 normal mkfs:

wfg@bee /export/writeback% ./compare-io.sh -v jsize -g ext3 snb/*/{*-3.2.0-rc1,*-3.2.0-rc1-bg-all+} | grep TOTAL
                 2365.00        -0.8%      2345.04  TOTAL write_bw
              2420077.26        -0.8%   2400572.22  TOTAL io_wkB_s
                 5494.17        +5.9%      5820.50  TOTAL io_w_s
               599733.51        -0.9%    594590.82  TOTAL io_wrqm_s
                28648.41        -4.3%     27409.62  TOTAL io_avgrq_sz
                 4393.66        -0.3%      4380.82  TOTAL io_avgqu_sz
                26496.62        -4.1%     25409.98  TOTAL io_await
                  241.60        -8.8%       220.24  TOTAL io_svctm
                 3197.57        -0.0%      3197.26  TOTAL io_util
                    8.40        +1.7%         8.54  TOTAL cpu_user
                    0.00        +nan%         0.00  TOTAL cpu_nice
                  132.00        -0.5%       131.32  TOTAL cpu_system
                 1050.19        +5.0%      1102.49  TOTAL cpu_iowait
                    0.00        +nan%         0.00  TOTAL cpu_steal
                 2009.39        -2.6%      1957.63  TOTAL cpu_idle

ext4 normal mkfs:

wfg@bee /export/writeback% ./compare-io.sh -v jsize -g ext4 snb/*/{*-3.2.0-rc1,*-3.2.0-rc1-bg-all+} | grep TOTAL
                 2944.47        +2.3%      3012.52  TOTAL write_bw
              3006617.60        +2.5%   3080282.33  TOTAL io_wkB_s
                 6038.57        +2.5%      6190.22  TOTAL io_w_s
                  217.73        -9.0%       198.23  TOTAL io_wrqm_s
                31833.06        -0.0%     31832.59  TOTAL io_avgrq_sz
                 4333.98        -0.6%      4306.47  TOTAL io_avgqu_sz
                23263.05        -3.1%     22535.51  TOTAL io_await
                  177.43        -3.5%       171.28  TOTAL io_svctm
                 3199.09        +0.0%      3199.83  TOTAL io_util
                    8.46        +2.6%         8.68  TOTAL cpu_user
                    0.00        +nan%         0.00  TOTAL cpu_nice
                  131.49        +0.6%       132.28  TOTAL cpu_system
                 1138.27        +3.8%      1181.12  TOTAL cpu_iowait
                    0.00        +nan%         0.00  TOTAL cpu_steal
                 1921.74        -2.3%      1877.90  TOTAL cpu_idle

Some details for ext4/ext3:

wfg@bee /export/writeback% ./compare.rb -g ext4 -e io_wkB_s snb/*/{*-3.2.0-rc1,*-3.2.0-rc1-bg-all+}
               3.2.0-rc1         3.2.0-rc1-bg-all+
------------------------  ------------------------
                84473.27       +14.6%     96810.30  snb/thresh=1000M/ext4-100dd-1-3.2.0-rc1
                96971.63        -0.2%     96817.43  snb/thresh=1000M/ext4-100dd-2-3.2.0-rc1
                93518.20       +13.0%    105681.79  snb/thresh=1000M/ext4-10dd-1-3.2.0-rc1
                91381.67       +15.4%    105428.50  snb/thresh=1000M/ext4-10dd-2-3.2.0-rc1
               105734.03        -7.6%     97746.25  snb/thresh=1000M/ext4-10dd-3-3.2.0-rc1
               106967.36        -9.7%     96610.37  snb/thresh=1000M/ext4-1dd-1-3.2.0-rc1
               105721.10        +0.1%    105821.17  snb/thresh=1000M/ext4-1dd-2-3.2.0-rc1
                93144.96       +15.1%    107163.72  snb/thresh=1000M/ext4-1dd-3-3.2.0-rc1
                91008.48        -5.9%     85667.88  snb/thresh=1000M/ext4:jsize=8-100dd-1-3.2.0-rc1
                83903.83        +1.7%     85332.89  snb/thresh=1000M/ext4:jsize=8-100dd-2-3.2.0-rc1
                91294.62        +2.2%     93291.84  snb/thresh=1000M/ext4:jsize=8-10dd-1-3.2.0-rc1
                99669.41        -6.4%     93324.27  snb/thresh=1000M/ext4:jsize=8-10dd-2-3.2.0-rc1
                97860.10        -2.1%     95845.66  snb/thresh=1000M/ext4:jsize=8-1dd-1-3.2.0-rc1
                94028.20        +1.2%     95164.13  snb/thresh=1000M/ext4:jsize=8-1dd-2-3.2.0-rc1
               106718.79       -11.1%     94867.78  snb/thresh=1000M/ext4:jsize=8-1dd-3-3.2.0-rc1
                67377.61       +13.9%     76717.96  snb/thresh=100M/ext4-100dd-1-3.2.0-rc1
                76879.86        -2.9%     74629.73  snb/thresh=100M/ext4-100dd-2-3.2.0-rc1
                81428.25       +13.0%     92024.19  snb/thresh=100M/ext4-10dd-1-3.2.0-rc1
                81262.62        +6.9%     86877.99  snb/thresh=100M/ext4-10dd-2-3.2.0-rc1
                92194.34        -9.5%     83411.11  snb/thresh=100M/ext4-10dd-3-3.2.0-rc1
                94225.13       +12.7%    106187.25  snb/thresh=100M/ext4-1dd-1-3.2.0-rc1
                97244.07        -0.5%     96772.82  snb/thresh=100M/ext4-1dd-2-3.2.0-rc1
                91910.86        +7.9%     99214.46  snb/thresh=100M/ext4-1dd-3-3.2.0-rc1
                69473.36        -2.2%     67959.60  snb/thresh=100M/ext4:jsize=8-100dd-1-3.2.0-rc1
                69488.99        -2.4%     67832.74  snb/thresh=100M/ext4:jsize=8-100dd-2-3.2.0-rc1
                92434.46       -10.9%     82384.58  snb/thresh=100M/ext4:jsize=8-10dd-1-3.2.0-rc1
                85331.01        -3.4%     82406.72  snb/thresh=100M/ext4:jsize=8-10dd-2-3.2.0-rc1
               106049.14       -11.7%     93615.96  snb/thresh=100M/ext4:jsize=8-1dd-1-3.2.0-rc1
               106210.30       -11.9%     93573.97  snb/thresh=100M/ext4:jsize=8-1dd-2-3.2.0-rc1
                95406.63        -2.4%     93129.65  snb/thresh=100M/ext4:jsize=8-1dd-3-3.2.0-rc1
                89037.60        +0.5%     89483.46  snb/thresh=2G/ext4-100dd-1-3.2.0-rc1
                88633.10        +8.8%     96438.86  snb/thresh=2G/ext4-100dd-2-3.2.0-rc1
               105652.96       -11.6%     93410.17  snb/thresh=2G/ext4-10dd-1-3.2.0-rc1
                91111.88       +16.7%    106299.99  snb/thresh=2G/ext4-10dd-2-3.2.0-rc1
                90829.10       +16.4%    105759.23  snb/thresh=2G/ext4-10dd-3-3.2.0-rc1
               108116.09       -11.4%     95782.88  snb/thresh=2G/ext4-1dd-1-3.2.0-rc1
               108440.24        +0.1%    108508.84  snb/thresh=2G/ext4-1dd-2-3.2.0-rc1
                95097.15        +9.4%    104081.54  snb/thresh=2G/ext4-1dd-3-3.2.0-rc1
               100816.94        -9.9%     90836.14  snb/thresh=2G/ext4:jsize=8-100dd-1-3.2.0-rc1
                92018.44        -3.7%     88608.54  snb/thresh=2G/ext4:jsize=8-100dd-2-3.2.0-rc1
                91911.35        +4.0%     95545.77  snb/thresh=2G/ext4:jsize=8-10dd-1-3.2.0-rc1
               105958.52       -11.2%     94135.56  snb/thresh=2G/ext4:jsize=8-10dd-2-3.2.0-rc1
                94454.87        +1.7%     96052.07  snb/thresh=2G/ext4:jsize=8-1dd-1-3.2.0-rc1
                94129.58        +2.1%     96079.46  snb/thresh=2G/ext4:jsize=8-1dd-2-3.2.0-rc1
               107918.43       -11.6%     95372.06  snb/thresh=2G/ext4:jsize=8-1dd-3-3.2.0-rc1
                93440.85        -7.4%     86511.28  snb/thresh=8G/ext4-100dd-1-3.2.0-rc1
                90258.33        -2.9%     87628.67  snb/thresh=8G/ext4-100dd-2-3.2.0-rc1
               106366.23       -13.6%     91884.53  snb/thresh=8G/ext4-10dd-1-3.2.0-rc1
                92778.89        -0.6%     92213.84  snb/thresh=8G/ext4-10dd-2-3.2.0-rc1
                91049.68       +15.8%    105478.65  snb/thresh=8G/ext4-10dd-3-3.2.0-rc1
                94623.79        +1.1%     95704.38  snb/thresh=8G/ext4-1dd-1-3.2.0-rc1
                99052.74        -1.0%     98097.38  snb/thresh=8G/ext4-1dd-2-3.2.0-rc1
               101694.01        -6.5%     95083.58  snb/thresh=8G/ext4-1dd-3-3.2.0-rc1
               104697.47        -9.7%     94582.62  snb/thresh=8G/ext4:jsize=8-100dd-1-3.2.0-rc1
                99339.28        -3.1%     96299.59  snb/thresh=8G/ext4:jsize=8-100dd-2-3.2.0-rc1
                92131.87        +6.6%     98239.08  snb/thresh=8G/ext4:jsize=8-10dd-1-3.2.0-rc1
               106453.21       -10.0%     95858.09  snb/thresh=8G/ext4:jsize=8-10dd-2-3.2.0-rc1
                95807.93        -3.1%     92818.24  snb/thresh=8G/ext4:jsize=8-10dd-3-3.2.0-rc1
                95341.79       +13.5%    108206.19  snb/thresh=8G/ext4:jsize=8-1dd-1-3.2.0-rc1
                94420.66        +2.6%     96839.02  snb/thresh=8G/ext4:jsize=8-1dd-2-3.2.0-rc1
                93483.45        +7.3%    100283.10  snb/thresh=8G/ext4:jsize=8-1dd-3-3.2.0-rc1
              5764378.71        -0.3%   5744435.53  TOTAL io_wkB_s

wfg@bee /export/writeback% ./compare.rb -g ext3 -e io_wkB_s snb/*/{*-3.2.0-rc1,*-3.2.0-rc1-bg-all+}
               3.2.0-rc1         3.2.0-rc1-bg-all+
------------------------  ------------------------
                67942.37       +10.7%     75241.90  snb/thresh=1000M/ext3-100dd-1-3.2.0-rc1
                70086.88        +9.1%     76474.81  snb/thresh=1000M/ext3-100dd-2-3.2.0-rc1
                70151.68       +14.8%     80558.96  snb/thresh=1000M/ext3-10dd-1-3.2.0-rc1
                73726.08        +7.7%     79387.64  snb/thresh=1000M/ext3-10dd-2-3.2.0-rc1
                74274.58        +8.2%     80379.63  snb/thresh=1000M/ext3-10dd-3-3.2.0-rc1
                77488.67        -3.9%     74429.64  snb/thresh=1000M/ext3-1dd-1-3.2.0-rc1
                87472.78        -5.3%     82842.31  snb/thresh=1000M/ext3-1dd-2-3.2.0-rc1
                80476.31        +2.6%     82599.50  snb/thresh=1000M/ext3-1dd-3-3.2.0-rc1
                74607.08       -21.1%     58885.80  snb/thresh=1000M/ext3:jsize=8-100dd-1-3.2.0-rc1
                66799.19       -14.1%     57388.42  snb/thresh=1000M/ext3:jsize=8-100dd-2-3.2.0-rc1
                70984.70        +1.4%     72010.71  snb/thresh=1000M/ext3:jsize=8-10dd-1-3.2.0-rc1
                79739.20       -10.3%     71495.41  snb/thresh=1000M/ext3:jsize=8-10dd-2-3.2.0-rc1
                73494.35        -2.1%     71919.37  snb/thresh=1000M/ext3:jsize=8-10dd-3-3.2.0-rc1
                82797.80        -9.2%     75185.91  snb/thresh=1000M/ext3:jsize=8-1dd-1-3.2.0-rc1
                73225.25        -2.3%     71531.99  snb/thresh=1000M/ext3:jsize=8-1dd-2-3.2.0-rc1
                80920.52       -11.5%     71591.89  snb/thresh=1000M/ext3:jsize=8-1dd-3-3.2.0-rc1
                55866.74       +14.5%     63983.45  snb/thresh=100M/ext3-100dd-1-3.2.0-rc1
                63516.92       -10.8%     56657.25  snb/thresh=100M/ext3-100dd-2-3.2.0-rc1
                79772.17        +0.2%     79962.32  snb/thresh=100M/ext3-10dd-1-3.2.0-rc1
                71211.04        +3.2%     73469.61  snb/thresh=100M/ext3-10dd-2-3.2.0-rc1
                75960.20        +2.3%     77684.90  snb/thresh=100M/ext3-10dd-3-3.2.0-rc1
                74090.84       +12.2%     83154.21  snb/thresh=100M/ext3-1dd-1-3.2.0-rc1
                82997.81        -0.3%     82755.64  snb/thresh=100M/ext3-1dd-2-3.2.0-rc1
                73987.01       +10.0%     81398.04  snb/thresh=100M/ext3-1dd-3-3.2.0-rc1
                60477.45       -22.0%     47175.66  snb/thresh=100M/ext3:jsize=8-100dd-1-3.2.0-rc1
                51895.73        -9.1%     47196.06  snb/thresh=100M/ext3:jsize=8-100dd-2-3.2.0-rc1
                70934.34       -11.1%     63057.75  snb/thresh=100M/ext3:jsize=8-10dd-1-3.2.0-rc1
                73631.00       -13.8%     63444.69  snb/thresh=100M/ext3:jsize=8-10dd-2-3.2.0-rc1
                70808.72        -9.7%     63915.61  snb/thresh=100M/ext3:jsize=8-10dd-3-3.2.0-rc1
                81741.64        -8.0%     75201.74  snb/thresh=100M/ext3:jsize=8-1dd-1-3.2.0-rc1
                73874.82        -6.8%     68886.61  snb/thresh=100M/ext3:jsize=8-1dd-2-3.2.0-rc1
                82993.12       -16.5%     69303.45  snb/thresh=100M/ext3:jsize=8-1dd-3-3.2.0-rc1
                68866.74        -0.8%     68311.58  snb/thresh=2G/ext3-100dd-1-3.2.0-rc1
                67564.11       +13.6%     76719.46  snb/thresh=2G/ext3-100dd-2-3.2.0-rc1
                75070.89        -5.4%     71030.23  snb/thresh=2G/ext3-10dd-1-3.2.0-rc1
                81434.15        -2.7%     79194.71  snb/thresh=2G/ext3-10dd-2-3.2.0-rc1
                73935.70        +7.7%     79592.27  snb/thresh=2G/ext3-10dd-3-3.2.0-rc1
                79820.91        -7.2%     74061.53  snb/thresh=2G/ext3-1dd-1-3.2.0-rc1
                77685.01        -2.1%     76026.55  snb/thresh=2G/ext3-1dd-2-3.2.0-rc1
                85991.41        -9.9%     77493.53  snb/thresh=2G/ext3-1dd-3-3.2.0-rc1
                67874.30        -0.5%     67532.29  snb/thresh=2G/ext3:jsize=8-100dd-1-3.2.0-rc1
                68509.10        -4.4%     65490.60  snb/thresh=2G/ext3:jsize=8-100dd-2-3.2.0-rc1
                70306.67        +2.4%     72015.48  snb/thresh=2G/ext3:jsize=8-10dd-1-3.2.0-rc1
                79750.15       -11.7%     70393.85  snb/thresh=2G/ext3:jsize=8-10dd-2-3.2.0-rc1
                77675.53        -8.9%     70737.37  snb/thresh=2G/ext3:jsize=8-10dd-3-3.2.0-rc1
                79488.16        -8.0%     73142.17  snb/thresh=2G/ext3:jsize=8-1dd-1-3.2.0-rc1
                74246.27        +0.8%     74833.84  snb/thresh=2G/ext3:jsize=8-1dd-2-3.2.0-rc1
                72685.30        +1.7%     73896.29  snb/thresh=2G/ext3:jsize=8-1dd-3-3.2.0-rc1
                77685.67       -13.3%     67340.38  snb/thresh=8G/ext3-100dd-1-3.2.0-rc1
                68913.88        -2.0%     67541.52  snb/thresh=8G/ext3-100dd-2-3.2.0-rc1
                77486.78        -9.4%     70175.15  snb/thresh=8G/ext3-10dd-1-3.2.0-rc1
                83617.81       -13.4%     72390.80  snb/thresh=8G/ext3-10dd-2-3.2.0-rc1
                73818.10        -5.3%     69890.46  snb/thresh=8G/ext3-10dd-3-3.2.0-rc1
                85714.30       -14.3%     73452.90  snb/thresh=8G/ext3-1dd-1-3.2.0-rc1
                76820.46        -4.7%     73246.99  snb/thresh=8G/ext3-1dd-2-3.2.0-rc1
                86629.27       -15.6%     73124.34  snb/thresh=8G/ext3-1dd-3-3.2.0-rc1
                62348.10       +12.7%     70243.05  snb/thresh=8G/ext3:jsize=8-100dd-1-3.2.0-rc1
                68753.14        -6.8%     64044.01  snb/thresh=8G/ext3:jsize=8-100dd-2-3.2.0-rc1
                70475.06        +1.7%     71693.23  snb/thresh=8G/ext3:jsize=8-10dd-1-3.2.0-rc1
                69322.48        +1.2%     70162.89  snb/thresh=8G/ext3:jsize=8-10dd-2-3.2.0-rc1
                78733.88        -8.0%     72443.10  snb/thresh=8G/ext3:jsize=8-10dd-3-3.2.0-rc1
                72872.62       +10.8%     80718.93  snb/thresh=8G/ext3:jsize=8-1dd-1-3.2.0-rc1
                76446.68        -2.5%     74534.80  snb/thresh=8G/ext3:jsize=8-1dd-2-3.2.0-rc1
                72461.74        +3.7%     75163.91  snb/thresh=8G/ext3:jsize=8-1dd-3-3.2.0-rc1
              4750951.34        -3.3%   4595809.11  TOTAL io_wkB_s

wfg@bee /export/writeback% ./compare.rb -g ext3- -e io_wkB_s snb/*/{*-3.2.0-rc1,*-3.2.0-rc1-bg-all+}
               3.2.0-rc1         3.2.0-rc1-bg-all+  
------------------------  ------------------------  
                67942.37       +10.7%     75241.90  snb/thresh=1000M/ext3-100dd-1-3.2.0-rc1
                70086.88        +9.1%     76474.81  snb/thresh=1000M/ext3-100dd-2-3.2.0-rc1
                70151.68       +14.8%     80558.96  snb/thresh=1000M/ext3-10dd-1-3.2.0-rc1
                73726.08        +7.7%     79387.64  snb/thresh=1000M/ext3-10dd-2-3.2.0-rc1
                74274.58        +8.2%     80379.63  snb/thresh=1000M/ext3-10dd-3-3.2.0-rc1
                77488.67        -3.9%     74429.64  snb/thresh=1000M/ext3-1dd-1-3.2.0-rc1
                87472.78        -5.3%     82842.31  snb/thresh=1000M/ext3-1dd-2-3.2.0-rc1
                80476.31        +2.6%     82599.50  snb/thresh=1000M/ext3-1dd-3-3.2.0-rc1
                55866.74       +14.5%     63983.45  snb/thresh=100M/ext3-100dd-1-3.2.0-rc1
                63516.92       -10.8%     56657.25  snb/thresh=100M/ext3-100dd-2-3.2.0-rc1
                79772.17        +0.2%     79962.32  snb/thresh=100M/ext3-10dd-1-3.2.0-rc1
                71211.04        +3.2%     73469.61  snb/thresh=100M/ext3-10dd-2-3.2.0-rc1
                75960.20        +2.3%     77684.90  snb/thresh=100M/ext3-10dd-3-3.2.0-rc1
                74090.84       +12.2%     83154.21  snb/thresh=100M/ext3-1dd-1-3.2.0-rc1
                82997.81        -0.3%     82755.64  snb/thresh=100M/ext3-1dd-2-3.2.0-rc1
                73987.01       +10.0%     81398.04  snb/thresh=100M/ext3-1dd-3-3.2.0-rc1
                68866.74        -0.8%     68311.58  snb/thresh=2G/ext3-100dd-1-3.2.0-rc1
                67564.11       +13.6%     76719.46  snb/thresh=2G/ext3-100dd-2-3.2.0-rc1
                75070.89        -5.4%     71030.23  snb/thresh=2G/ext3-10dd-1-3.2.0-rc1
                81434.15        -2.7%     79194.71  snb/thresh=2G/ext3-10dd-2-3.2.0-rc1
                73935.70        +7.7%     79592.27  snb/thresh=2G/ext3-10dd-3-3.2.0-rc1
                79820.91        -7.2%     74061.53  snb/thresh=2G/ext3-1dd-1-3.2.0-rc1
                77685.01        -2.1%     76026.55  snb/thresh=2G/ext3-1dd-2-3.2.0-rc1
                85991.41        -9.9%     77493.53  snb/thresh=2G/ext3-1dd-3-3.2.0-rc1
                77685.67       -13.3%     67340.38  snb/thresh=8G/ext3-100dd-1-3.2.0-rc1
                68913.88        -2.0%     67541.52  snb/thresh=8G/ext3-100dd-2-3.2.0-rc1
                77486.78        -9.4%     70175.15  snb/thresh=8G/ext3-10dd-1-3.2.0-rc1
                83617.81       -13.4%     72390.80  snb/thresh=8G/ext3-10dd-2-3.2.0-rc1
                73818.10        -5.3%     69890.46  snb/thresh=8G/ext3-10dd-3-3.2.0-rc1
                85714.30       -14.3%     73452.90  snb/thresh=8G/ext3-1dd-1-3.2.0-rc1
                76820.46        -4.7%     73246.99  snb/thresh=8G/ext3-1dd-2-3.2.0-rc1
                86629.27       -15.6%     73124.34  snb/thresh=8G/ext3-1dd-3-3.2.0-rc1
              2420077.26        -0.8%   2400572.22  TOTAL io_wkB_s

wfg@bee /export/writeback% ./compare.rb -g ext4- -e io_wkB_s snb/*/{*-3.2.0-rc1,*-3.2.0-rc1-bg-all+}
               3.2.0-rc1         3.2.0-rc1-bg-all+  
------------------------  ------------------------  
                84473.27       +14.6%     96810.30  snb/thresh=1000M/ext4-100dd-1-3.2.0-rc1
                96971.63        -0.2%     96817.43  snb/thresh=1000M/ext4-100dd-2-3.2.0-rc1
                93518.20       +13.0%    105681.79  snb/thresh=1000M/ext4-10dd-1-3.2.0-rc1
                91381.67       +15.4%    105428.50  snb/thresh=1000M/ext4-10dd-2-3.2.0-rc1
               105734.03        -7.6%     97746.25  snb/thresh=1000M/ext4-10dd-3-3.2.0-rc1
               106967.36        -9.7%     96610.37  snb/thresh=1000M/ext4-1dd-1-3.2.0-rc1
               105721.10        +0.1%    105821.17  snb/thresh=1000M/ext4-1dd-2-3.2.0-rc1
                93144.96       +15.1%    107163.72  snb/thresh=1000M/ext4-1dd-3-3.2.0-rc1
                67377.61       +13.9%     76717.96  snb/thresh=100M/ext4-100dd-1-3.2.0-rc1
                76879.86        -2.9%     74629.73  snb/thresh=100M/ext4-100dd-2-3.2.0-rc1
                81428.25       +13.0%     92024.19  snb/thresh=100M/ext4-10dd-1-3.2.0-rc1
                81262.62        +6.9%     86877.99  snb/thresh=100M/ext4-10dd-2-3.2.0-rc1
                92194.34        -9.5%     83411.11  snb/thresh=100M/ext4-10dd-3-3.2.0-rc1
                94225.13       +12.7%    106187.25  snb/thresh=100M/ext4-1dd-1-3.2.0-rc1
                97244.07        -0.5%     96772.82  snb/thresh=100M/ext4-1dd-2-3.2.0-rc1
                91910.86        +7.9%     99214.46  snb/thresh=100M/ext4-1dd-3-3.2.0-rc1
                89037.60        +0.5%     89483.46  snb/thresh=2G/ext4-100dd-1-3.2.0-rc1
                88633.10        +8.8%     96438.86  snb/thresh=2G/ext4-100dd-2-3.2.0-rc1
               105652.96       -11.6%     93410.17  snb/thresh=2G/ext4-10dd-1-3.2.0-rc1
                91111.88       +16.7%    106299.99  snb/thresh=2G/ext4-10dd-2-3.2.0-rc1
                90829.10       +16.4%    105759.23  snb/thresh=2G/ext4-10dd-3-3.2.0-rc1
               108116.09       -11.4%     95782.88  snb/thresh=2G/ext4-1dd-1-3.2.0-rc1
               108440.24        +0.1%    108508.84  snb/thresh=2G/ext4-1dd-2-3.2.0-rc1
                95097.15        +9.4%    104081.54  snb/thresh=2G/ext4-1dd-3-3.2.0-rc1
                93440.85        -7.4%     86511.28  snb/thresh=8G/ext4-100dd-1-3.2.0-rc1
                90258.33        -2.9%     87628.67  snb/thresh=8G/ext4-100dd-2-3.2.0-rc1
               106366.23       -13.6%     91884.53  snb/thresh=8G/ext4-10dd-1-3.2.0-rc1
                92778.89        -0.6%     92213.84  snb/thresh=8G/ext4-10dd-2-3.2.0-rc1
                91049.68       +15.8%    105478.65  snb/thresh=8G/ext4-10dd-3-3.2.0-rc1
                94623.79        +1.1%     95704.38  snb/thresh=8G/ext4-1dd-1-3.2.0-rc1
                99052.74        -1.0%     98097.38  snb/thresh=8G/ext4-1dd-2-3.2.0-rc1
               101694.01        -6.5%     95083.58  snb/thresh=8G/ext4-1dd-3-3.2.0-rc1
              3006617.60        +2.5%   3080282.33  TOTAL io_wkB_s

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 1/2] writeback: Improve busyloop prevention
  2011-12-05  8:02                                         ` Wu Fengguang
@ 2011-12-07 10:13                                           ` Jan Kara
  2011-12-07 11:45                                             ` Wu Fengguang
  0 siblings, 1 reply; 60+ messages in thread
From: Jan Kara @ 2011-12-07 10:13 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Jan Kara, linux-fsdevel@vger.kernel.org, Christoph Hellwig,
	Dave Chinner

  Hello Fengguang,

On Mon 05-12-11 16:02:43, Wu Fengguang wrote:
> On Thu, Nov 10, 2011 at 10:50:44PM +0800, Jan Kara wrote:
> > On Wed 09-11-11 00:52:07, Jan Kara wrote:
> > > > wfg@bee /export/writeback% ./compare.rb -v jsize -e io_wkB_s thresh*/*-ioless-full-next-20111102+ thresh*/*-20111102+
> > > > 3.1.0-ioless-full-next-20111102+  3.1.0-ioless-full-bg-all-next-20111102+
> > > > ------------------------  ------------------------
> > > >                 36231.89        -3.8%     34855.10  thresh=1000M/ext3-100dd-4k-8p-4096M-1000M:10-X
> > > >                 41115.07       -12.7%     35886.36  thresh=1000M/ext3-10dd-4k-8p-4096M-1000M:10-X
> > > >                 48025.75       -14.3%     41146.57  thresh=1000M/ext3-1dd-4k-8p-4096M-1000M:10-X
> > > >                 47684.35        -6.4%     44644.30  thresh=1000M/ext4-100dd-4k-8p-4096M-1000M:10-X
> > > >                 54015.86        -4.0%     51851.01  thresh=1000M/ext4-10dd-4k-8p-4096M-1000M:10-X
> > > >                 55320.03        -2.6%     53867.63  thresh=1000M/ext4-1dd-4k-8p-4096M-1000M:10-X
> > > >                 37400.51        +1.6%     38012.57  thresh=100M/ext3-10dd-4k-8p-4096M-100M:10-X
> > > >                 45317.31        -4.5%     43272.16  thresh=100M/ext3-1dd-4k-8p-4096M-100M:10-X
> > > >                 40552.64        +0.8%     40884.60  thresh=100M/ext3-2dd-4k-8p-4096M-100M:10-X
> > > >                 44271.29        -5.6%     41789.76  thresh=100M/ext4-10dd-4k-8p-4096M-100M:10-X
> > > >                 54334.22        -3.5%     52435.69  thresh=100M/ext4-1dd-4k-8p-4096M-100M:10-X
> > > >                 52563.67        -6.1%     49341.84  thresh=100M/ext4-2dd-4k-8p-4096M-100M:10-X
> > > >                 45027.95        -1.0%     44599.37  thresh=10M/ext3-1dd-4k-8p-4096M-10M:10-X
> > > >                 42478.40        +0.3%     42608.48  thresh=10M/ext3-2dd-4k-8p-4096M-10M:10-X
> > > >                 35178.47        -0.2%     35103.56  thresh=10M/ext4-10dd-4k-8p-4096M-10M:10-X
> > > >                 54079.64        -0.5%     53834.85  thresh=10M/ext4-1dd-4k-8p-4096M-10M:10-X
> > > >                 49982.11        -0.4%     49803.44  thresh=10M/ext4-2dd-4k-8p-4096M-10M:10-X
> > > >                783579.17        -3.8%    753937.28  TOTAL io_wkB_s
> > >   Here I can see some noticeable drops in the realistic thresh=100M case
> > > (case thresh=1000M is unrealistic but it still surprise me that there are
> > > drops as well). I'll try to reproduce your results so that I can look into
> > > this more effectively.
> >   So I've run a test on a machine with 1G of memory, thresh=184M (so
> > something similar to your 4G-1G test). I've used tiobench using 10 threads,
> > each thread writing 1.6G file. I have run the test 10 times to get an idea
> > of fluctuations. The result is:
> >   without patch			with patch
> >    AVG         STDDEV      AVG         STDDEV
> > 199.884820 +- 1.32268	200.466003 +- 0.377405
> > 
> > The numbers are time-to-completion so lower is better. Summary is: No
> > statistically meaningful difference. I'll run more tests with different
> > dirty thresholds to see whether I won't be able to observe some
> > difference...
> 
> I carried out some tests on ext3/ext4 before/after patch. Most tests
> are repeated for 3 times so as to get an idea about the variations.
> 
> The ":jsize=8" notion means "-J size=8" in action.
> 
> The overall deltas are -0.3% for ext4 and -3.3% for ext3.  I noticed
> that the regressions mostly happen for the "-J size=8" cases.  For
> normal mkfs, ext4 actually sees +2.5% increase and ext3 sees only
> -0.8% drop.
> 
> I don't find any misbehaves in the graphs.
> So in general I think the test results are acceptable.
  Thanks for running the tests. I looked through the results and given the
variation I would be happy with them. Will you merge the patch or should I
resend it?

								Honza

-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 1/2] writeback: Improve busyloop prevention
  2011-12-07 10:13                                           ` Jan Kara
@ 2011-12-07 11:45                                             ` Wu Fengguang
  0 siblings, 0 replies; 60+ messages in thread
From: Wu Fengguang @ 2011-12-07 11:45 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-fsdevel@vger.kernel.org, Christoph Hellwig, Dave Chinner

Jan,

> > I don't find any misbehaves in the graphs.
> > So in general I think the test results are acceptable.
>   Thanks for running the tests. I looked through the results and given the
> variation I would be happy with them. Will you merge the patch or should I
> resend it?

The patch I tested is this one. It's obviously better than w/ patch.

Looking backing a bit more, the change is related to the old topic
"how can the flush order align better with the LRU order". The general
idea was to make background work behave more like kupdate work which I
still think is a good idea. However the patch I proposed at the time
have some unpleasant/convoluted "goto" and didn't make it eventually.
The code is then leaved in this bad state...

I'll merge this patch.

Thanks,
Fengguang
---
Subject: writeback: Include all dirty inodes in background writeback
Date: Wed, 19 Oct 2011 11:44:41 +0200

From: Jan Kara <jack@suse.cz>

Current livelock avoidance code makes background work to include only inodes
that were dirtied before background writeback has started. However background
writeback can be running for a long time and thus excluding newly dirtied
inodes can eventually exclude significant portion of dirty inodes making
background writeback inefficient. Since background writeback avoids livelocking
the flusher thread by yielding to any other work, there is no real reason why
background work should not include all dirty inodes so change the logic in
wb_writeback().

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/fs-writeback.c |   10 ++++++++--
 1 file changed, 8 insertions(+), 2 deletions(-)

--- linux-next.orig/fs/fs-writeback.c	2011-10-31 00:14:14.000000000 +0800
+++ linux-next/fs/fs-writeback.c	2011-10-31 21:59:37.000000000 +0800
@@ -780,11 +780,17 @@ static long wb_writeback(struct bdi_writ
 		if (work->for_background && !over_bground_thresh(wb->bdi))
 			break;

+		/*
+		 * Kupdate and background works are special and we want to
+		 * include all inodes that need writing. Livelock avoidance is
+		 * handled by these works yielding to any other work so we are
+		 * safe.
+		 */
 		if (work->for_kupdate) {
 			oldest_jif = jiffies -
 				msecs_to_jiffies(dirty_expire_interval * 10);
-			work->older_than_this = &oldest_jif;
-		}
+		} else if (work->for_background)
+			oldest_jif = jiffies;

 		trace_writeback_start(wb->bdi, work);
 		if (list_empty(&wb->b_io))

^ permalink raw reply	[flat|nested] 60+ messages in thread

end of thread, other threads:[~2011-12-07 11:55 UTC | newest]

Thread overview: 60+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-09-08  0:44 [PATCH 1/2] writeback: Improve busyloop prevention Jan Kara
2011-09-08  0:44 ` [PATCH 2/2] writeback: Replace some redirty_tail() calls with requeue_io() Jan Kara
2011-09-08  1:22   ` Wu Fengguang
2011-09-08 15:03     ` Jan Kara
2011-09-18 14:07       ` Wu Fengguang
2011-10-05 17:39         ` Jan Kara
2011-10-07 13:43           ` Wu Fengguang
2011-10-07 14:22             ` Jan Kara
2011-10-07 14:29               ` Wu Fengguang
2011-10-07 14:45                 ` Jan Kara
2011-10-07 15:29                   ` Wu Fengguang
2011-10-08  4:00                   ` Wu Fengguang
2011-10-08 11:52                     ` Wu Fengguang
2011-10-08 13:49                       ` Wu Fengguang
2011-10-09  0:27                         ` Wu Fengguang
2011-10-09  8:44                           ` Wu Fengguang
2011-10-10 11:21                     ` Jan Kara
2011-10-10 11:31                       ` Wu Fengguang
2011-10-10 23:30                         ` Jan Kara
2011-10-11  2:36                           ` Wu Fengguang
2011-10-11 21:53                             ` Jan Kara
2011-10-12  2:44                               ` Wu Fengguang
2011-10-12 19:34                                 ` Jan Kara
2011-09-08  0:57 ` [PATCH 1/2] writeback: Improve busyloop prevention Wu Fengguang
2011-09-08 13:49   ` Jan Kara
  -- strict thread matches above, loose matches on Subject: below --
2011-10-12 20:57 [PATCH 0/2 v4] writeback: Improve busyloop prevention and inode requeueing Jan Kara
2011-10-12 20:57 ` [PATCH 1/2] writeback: Improve busyloop prevention Jan Kara
2011-10-13 14:26   ` Wu Fengguang
2011-10-13 20:13     ` Jan Kara
2011-10-14  7:18       ` Christoph Hellwig
2011-10-14 19:31         ` Chris Mason
     [not found]     ` <20111013143939.GA9691@localhost>
2011-10-13 20:18       ` Jan Kara
2011-10-14 16:00         ` Wu Fengguang
2011-10-14 16:28           ` Wu Fengguang
2011-10-18  0:51             ` Jan Kara
2011-10-18 14:35               ` Wu Fengguang
2011-10-19 11:56                 ` Jan Kara
2011-10-19 13:25                   ` Wu Fengguang
2011-10-19 13:30                   ` Wu Fengguang
2011-10-19 13:35                   ` Wu Fengguang
2011-10-20 12:09                   ` Wu Fengguang
2011-10-20 12:33                     ` Wu Fengguang
2011-10-20 13:39                       ` Wu Fengguang
2011-10-20 22:26                         ` Jan Kara
2011-10-22  4:20                           ` Wu Fengguang
2011-10-24 15:45                             ` Jan Kara
     [not found]                           ` <20111027063133.GA10146@localhost>
2011-10-27 20:31                             ` Jan Kara
     [not found]                               ` <20111101134231.GA31718@localhost>
2011-11-01 21:53                                 ` Jan Kara
2011-11-02 17:25                                   ` Wu Fengguang
     [not found]                               ` <20111102185603.GA4034@localhost>
2011-11-03  1:51                                 ` Jan Kara
2011-11-03 14:52                                   ` Wu Fengguang
     [not found]                                   ` <20111104152054.GA11577@localhost>
2011-11-08 23:52                                     ` Jan Kara
2011-11-09 13:51                                       ` Wu Fengguang
2011-11-10 14:50                                       ` Jan Kara
2011-12-05  8:02                                         ` Wu Fengguang
2011-12-07 10:13                                           ` Jan Kara
2011-12-07 11:45                                             ` Wu Fengguang
     [not found]                           ` <20111027064745.GA14017@localhost>
2011-10-27 20:50                             ` Jan Kara
2011-10-20  9:46               ` Christoph Hellwig
2011-10-20 15:32                 ` Jan Kara
2011-10-15 12:41           ` Wu Fengguang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).