linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Jan Kara <jack@suse.cz>
To: Wu Fengguang <fengguang.wu@intel.com>
Cc: Jan Kara <jack@suse.cz>,
	Andrew Morton <akpm@linux-foundation.org>,
	Theodore Tso <tytso@mit.edu>,
	Christoph Hellwig <hch@infradead.org>,
	Dave Chinner <david@fromorbit.com>,
	Chris Mason <chris.mason@oracle.com>,
	Peter Zijlstra <a.p.zijlstra@chello.nl>,
	"Li, Shaohua" <shaohua.li@intel.com>,
	Myklebust Trond <Trond.Myklebust@netapp.com>,
	"jens.axboe@oracle.com" <jens.axboe@oracle.com>,
	Nick Piggin <npiggin@suse.de>,
	"linux-fsdevel@vger.kernel.org" <linux-fsdevel@vger.kernel.org>,
	Richard Kennedy <richard@rsk.demon.co.uk>,
	LKML <linux-kernel@vger.kernel.org>
Subject: Re: [PATCH 01/45] writeback: reduce calls to global_page_state in balance_dirty_pages()
Date: Tue, 13 Oct 2009 20:12:14 +0200	[thread overview]
Message-ID: <20091013181214.GA31440@duck.suse.cz> (raw)
In-Reply-To: <20091013032405.GA20405@localhost>

On Tue 13-10-09 11:24:05, Wu Fengguang wrote:
> On Tue, Oct 13, 2009 at 05:18:38AM +0800, Jan Kara wrote:
> > On Sun 11-10-09 05:33:39, Wu Fengguang wrote:
> > > On Fri, Oct 09, 2009 at 11:12:31PM +0800, Jan Kara wrote:
> > > > > +			/* don't wait if we've done enough */
> > > > > +			if (pages_written >= write_chunk)
> > > > > +				break;
> > > > >  		}
> > > > > -
> > > > > -		if (bdi_nr_reclaimable + bdi_nr_writeback <= bdi_thresh)
> > > > > -			break;
> > > > > -		if (pages_written >= write_chunk)
> > > > > -			break;		/* We've done our duty */
> > > > > -
> > > >   Here, we had an opportunity to break from the loop even if we didn't
> > > > manage to write everything (for example because per-bdi thread managed to
> > > > write enough or because enough IO has completed while we were trying to
> > > > write). After the patch, we will sleep. IMHO that's not good...
> > > 
> > > Note that the pages_written check is moved several lines up in the patch :)
> > > 
> > > >   I'd think that if we did all that work in writeback_inodes_wbc we could
> > > > spend the effort on regetting and rechecking the stats...
> > > 
> > > Yes maybe. I didn't care it because the later throttle queue patch totally
> > > removed the loop and hence to need to reget the stats :)
> >   Yes, since the loop gets removed in the end, this does not matter. You
> > are right.
> 
> You are right too :) I followed you and Peter's advice to do the loop
> and the recheck of stats as follows:
> 
> static void balance_dirty_pages(struct address_space *mapping,
> 				unsigned long write_chunk)
> {
> 	long nr_reclaimable, bdi_nr_reclaimable;
> 	long nr_writeback, bdi_nr_writeback;
> 	unsigned long background_thresh;
> 	unsigned long dirty_thresh;
> 	unsigned long bdi_thresh;
> 	int dirty_exceeded;
> 	struct backing_dev_info *bdi = mapping->backing_dev_info;
> 
> 	/*
> 	 * If sync() is in progress, curb the to-be-synced inodes regardless
> 	 * of dirty limits, so that a fast dirtier won't livelock the sync.
> 	 */
> 	if (unlikely(bdi->sync_time &&
> 		     S_ISREG(mapping->host->i_mode) &&
> 		     time_after_eq(bdi->sync_time,
> 				   mapping->host->dirtied_when))) {
> 		write_chunk *= 2;
> 		bdi_writeback_wait(bdi, write_chunk);
> 	}
> 
> 	for (;;) {
> 		nr_reclaimable = global_page_state(NR_FILE_DIRTY) +
> 				 global_page_state(NR_UNSTABLE_NFS);
> 		nr_writeback = global_page_state(NR_WRITEBACK) +
> 			       global_page_state(NR_WRITEBACK_TEMP);
> 
> 		global_dirty_thresh(&background_thresh, &dirty_thresh);
> 
> 		/*
> 		 * Throttle it only when the background writeback cannot
> 		 * catch-up. This avoids (excessively) small writeouts
> 		 * when the bdi limits are ramping up.
> 		 */
> 		if (nr_reclaimable + nr_writeback <
> 		    (background_thresh + dirty_thresh) / 2)
> 			break;
> 
> 		bdi_thresh = bdi_dirty_thresh(bdi, dirty_thresh);
> 
> 		/*
> 		 * In order to avoid the stacked BDI deadlock we need
> 		 * to ensure we accurately count the 'dirty' pages when
> 		 * the threshold is low.
> 		 *
> 		 * Otherwise it would be possible to get thresh+n pages
> 		 * reported dirty, even though there are thresh-m pages
> 		 * actually dirty; with m+n sitting in the percpu
> 		 * deltas.
> 		 */
> 		if (bdi_thresh < 2*bdi_stat_error(bdi)) {
> 			bdi_nr_reclaimable = bdi_stat_sum(bdi, BDI_RECLAIMABLE);
> 			bdi_nr_writeback = bdi_stat_sum(bdi, BDI_WRITEBACK);
> 		} else {
> 			bdi_nr_reclaimable = bdi_stat(bdi, BDI_RECLAIMABLE);
> 			bdi_nr_writeback = bdi_stat(bdi, BDI_WRITEBACK);
> 		}
> 
> 		/*
> 		 * The bdi thresh is somehow "soft" limit derived from the
> 		 * global "hard" limit. The former helps to prevent heavy IO
> 		 * bdi or process from holding back light ones; The latter is
> 		 * the last resort safeguard.
> 		 */
> 		dirty_exceeded =
> 			(bdi_nr_reclaimable + bdi_nr_writeback >= bdi_thresh)
> 			|| (nr_reclaimable + nr_writeback >= dirty_thresh);
> 
> 		if (!dirty_exceeded)
> 			break;
> 
> 		bdi->dirty_exceed_time = jiffies;
> 
> 		bdi_writeback_wait(bdi, write_chunk);
  Hmm, probably you've discussed this in some other email but why do we
cycle in this loop until we get below dirty limit? We used to leave the
loop after writing write_chunk... So the time we spend in
balance_dirty_pages() is no longer limited, right?

> 	}
> 
> 	/*
> 	 * In laptop mode, we wait until hitting the higher threshold before
> 	 * starting background writeout, and then write out all the way down
> 	 * to the lower threshold.  So slow writers cause minimal disk activity.
> 	 *
> 	 * In normal mode, we start background writeout at the lower
> 	 * background_thresh, to keep the amount of dirty memory low.
> 	 */
> 	if (!laptop_mode && (nr_reclaimable > background_thresh) &&
> 	    can_submit_background_writeback(bdi))
> 		bdi_start_writeback(bdi, NULL, WB_FOR_BACKGROUND);
> }
> 
> > > > >  		schedule_timeout_interruptible(pause);
> > > > >
> > > > >  		/*
> > > > > @@ -577,8 +547,7 @@ static void balance_dirty_pages(struct a
> > > > >  			pause = HZ / 10;
> > > > >  	}
> > > > >
> > > > > -	if (bdi_nr_reclaimable + bdi_nr_writeback < bdi_thresh &&
> > > > > -			bdi->dirty_exceeded)
> > > > > +	if (!dirty_exceeded && bdi->dirty_exceeded)
> > > > >  		bdi->dirty_exceeded = 0;
> > > >   Here we fail to clear dirty_exceeded if we are over global dirty limit
> > > > but not over per-bdi dirty limit...
> > > 
> > > You must be mistaken: dirty_exceeded = (over bdi limit || over global limit),
> > > so !dirty_exceeded = (!over bdi limit && !over global limit).
> >   Exactly. Previously, the check was:
> > if (!over bdi limit)
> >   bdi->dirty_exceeded = 0;
> > 
> >   Now it is
> > if (!over bdi limit && !over global limit)
> >   bdi->dirty_exceeded = 0;
> > 
> >   That's clearly not equivalent which is what I was trying to point out.
> > But looking at where dirty_exceeded is used, your new way is probably more
> > useful. It's just a bit counterintuitive that bdi->dirty_exceeded is set
> > even if the per-bdi limit is not exceeded...
> 
> Yeah good point. Since the per-bdi limits are more about "soft" limits
> which are derived from the global "hard" limit, the code makes sense
> with some comments and updated changelog :)
> 
>         This patch slightly changes behavior by replacing clip_bdi_dirty_limit()
>         with the explicit check (nr_reclaimable + nr_writeback >= dirty_thresh)
>         to avoid exceeding the dirty limit. Since the bdi dirty limit is mostly
>         accurate we don't need to do routinely clip. A simple dirty limit check
>         would be enough.
> 
>         The check is necessary because, in principle we should throttle
>         everything calling balance_dirty_pages() when we're over the total
>         limit, as said by Peter.
> 
>         We now set and clear dirty_exceeded not only based on bdi dirty limits,
>         but also on the global dirty limits. This is a bit counterintuitive, but
>         the global limits are the ultimate goal and shall be always imposed.
> 
>         We may now start background writeback work based on outdated conditions.
>         That's safe because the bdi flush thread will (and have to) double check
>         the states. It reduces overall overheads because the test based on old
>         states still have good chance to be right.
  The new description is good, thanks!

-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

  parent reply	other threads:[~2009-10-13 18:12 UTC|newest]

Thread overview: 116+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2009-10-07  7:38 [PATCH 00/45] some writeback experiments Wu Fengguang
2009-10-07  7:38 ` [PATCH 01/45] writeback: reduce calls to global_page_state in balance_dirty_pages() Wu Fengguang
2009-10-09 15:12   ` Jan Kara
2009-10-09 15:18     ` Peter Zijlstra
2009-10-09 15:47       ` Jan Kara
2009-10-11  2:28         ` Wu Fengguang
2009-10-11  7:44           ` Peter Zijlstra
2009-10-11 10:50             ` Wu Fengguang
2009-10-11 10:58               ` Peter Zijlstra
2009-10-11 11:25               ` Peter Zijlstra
2009-10-12  1:26                 ` Wu Fengguang
2009-10-12  9:07                   ` Peter Zijlstra
2009-10-12  9:24                     ` Wu Fengguang
2009-10-10 21:33     ` Wu Fengguang
2009-10-12 21:18       ` Jan Kara
2009-10-13  3:24         ` Wu Fengguang
2009-10-13  8:41           ` Peter Zijlstra
2009-10-13 18:12           ` Jan Kara [this message]
2009-10-13 18:28             ` Peter Zijlstra
2009-10-14  1:38               ` Wu Fengguang
2009-10-14 11:22                 ` Peter Zijlstra
2009-10-17  5:30                   ` Wu Fengguang
2009-10-07  7:38 ` [PATCH 02/45] writeback: reduce calculation of bdi dirty thresholds Wu Fengguang
2009-10-07  7:38 ` [PATCH 03/45] ext4: remove unused parameter wbc from __ext4_journalled_writepage() Wu Fengguang
2009-10-07  7:38 ` [PATCH 04/45] writeback: remove unused nonblocking and congestion checks Wu Fengguang
2009-10-09 15:26   ` Jan Kara
2009-10-10 13:47     ` Wu Fengguang
2009-10-07  7:38 ` [PATCH 05/45] writeback: remove the always false bdi_cap_writeback_dirty() test Wu Fengguang
2009-10-07  7:38 ` [PATCH 06/45] writeback: use larger ratelimit when dirty_exceeded Wu Fengguang
2009-10-07  8:53   ` Peter Zijlstra
2009-10-07  9:17     ` Wu Fengguang
2009-10-07  7:38 ` [PATCH 07/45] writeback: dont redirty tail an inode with dirty pages Wu Fengguang
2009-10-09 15:45   ` Jan Kara
2009-10-07  7:38 ` [PATCH 08/45] writeback: quit on wrap for .range_cyclic (write_cache_pages) Wu Fengguang
2009-10-07  7:38 ` [PATCH 09/45] writeback: quit on wrap for .range_cyclic (pohmelfs) Wu Fengguang
2009-10-07 12:32   ` Evgeniy Polyakov
2009-10-07 14:23     ` Wu Fengguang
2009-10-07  7:38 ` [PATCH 10/45] writeback: quit on wrap for .range_cyclic (btrfs) Wu Fengguang
2009-10-07  7:38 ` [PATCH 11/45] writeback: quit on wrap for .range_cyclic (cifs) Wu Fengguang
2009-10-07  7:38 ` [PATCH 12/45] writeback: quit on wrap for .range_cyclic (ext4) Wu Fengguang
2009-10-07  7:38 ` [PATCH 13/45] writeback: quit on wrap for .range_cyclic (gfs2) Wu Fengguang
2009-10-07  7:38 ` [PATCH 14/45] writeback: quit on wrap for .range_cyclic (afs) Wu Fengguang
2009-10-07  7:38 ` [PATCH 15/45] writeback: fix queue_io() ordering Wu Fengguang
2009-10-07  7:38 ` [PATCH 16/45] writeback: merge for_kupdate and !for_kupdate cases Wu Fengguang
2009-10-07  7:38 ` [PATCH 17/45] writeback: only allow two background writeback works Wu Fengguang
2009-10-07  7:38 ` [PATCH 18/45] writeback: introduce wait queue for balance_dirty_pages() Wu Fengguang
2009-10-08  1:01   ` KAMEZAWA Hiroyuki
2009-10-08  1:58     ` Wu Fengguang
2009-10-08  2:40       ` KAMEZAWA Hiroyuki
2009-10-08  4:01         ` Wu Fengguang
2009-10-08  5:59           ` KAMEZAWA Hiroyuki
2009-10-08  6:07             ` Wu Fengguang
2009-10-08  6:28             ` Wu Fengguang
2009-10-08  6:39               ` KAMEZAWA Hiroyuki
2009-10-08  8:08       ` Peter Zijlstra
2009-10-08  8:11         ` KAMEZAWA Hiroyuki
2009-10-08  8:36         ` Jens Axboe
2009-10-09  2:52           ` [PATCH] writeback: account IO throttling wait as iowait Wu Fengguang
2009-10-09 10:41             ` Jens Axboe
2009-10-09 10:58               ` Wu Fengguang
2009-10-09 11:01                 ` Jens Axboe
2009-10-08  8:05     ` [PATCH 18/45] writeback: introduce wait queue for balance_dirty_pages() Peter Zijlstra
2009-10-07  7:38 ` [PATCH 19/45] writeback: remove the loop in balance_dirty_pages() Wu Fengguang
2009-10-07  7:38 ` [PATCH 20/45] NFS: introduce writeback wait queue Wu Fengguang
2009-10-07  8:53   ` Peter Zijlstra
2009-10-07  9:07     ` Wu Fengguang
2009-10-07  9:15       ` Peter Zijlstra
2009-10-07  9:19         ` Wu Fengguang
2009-10-07  9:17       ` Nick Piggin
2009-10-07  9:52         ` Wu Fengguang
2009-10-07  7:38 ` [PATCH 21/45] writeback: estimate bdi write bandwidth Wu Fengguang
2009-10-07  8:53   ` Peter Zijlstra
2009-10-07  9:39     ` Wu Fengguang
2009-10-07  7:38 ` [PATCH 22/45] writeback: show bdi write bandwidth in debugfs Wu Fengguang
2009-10-07  7:38 ` [PATCH 23/45] writeback: kill space in debugfs item name Wu Fengguang
2009-10-07  7:38 ` [PATCH 24/45] writeback: remove global nr_to_write and use timeout instead Wu Fengguang
2009-10-07  7:38 ` [PATCH 25/45] writeback: convert wbc.nr_to_write to per-file parameter Wu Fengguang
2009-10-07  7:38 ` [PATCH 26/45] block: pass the non-rotational queue flag to backing_dev_info Wu Fengguang
2009-10-07  7:38 ` [PATCH 27/45] writeback: introduce wbc.for_background Wu Fengguang
2009-10-07  7:38 ` [PATCH 28/45] writeback: introduce wbc.nr_segments Wu Fengguang
2009-10-07  7:38 ` [PATCH 29/45] writeback: fix the shmem AOP_WRITEPAGE_ACTIVATE case Wu Fengguang
2009-10-07 11:57   ` Hugh Dickins
2009-10-07 14:00     ` Wu Fengguang
2009-10-07  7:38 ` [PATCH 30/45] vmscan: lumpy pageout Wu Fengguang
2009-10-07  7:38 ` [PATCH 31/45] writeback: sync old inodes first in background writeback Wu Fengguang
2010-07-12  3:01   ` Christoph Hellwig
2010-07-12 15:24     ` Wu Fengguang
2009-10-07  7:38 ` [PATCH 32/45] writeback: update kupdate expire timestamp on each scan of b_io Wu Fengguang
2009-10-07  7:38 ` [PATCH 34/45] writeback: sync livelock - kick background writeback Wu Fengguang
2009-10-07  7:38 ` [PATCH 35/45] writeback: sync livelock - use single timestamp for whole sync work Wu Fengguang
2009-10-07  7:38 ` [PATCH 36/45] writeback: sync livelock - curb dirty speed for inodes to be synced Wu Fengguang
2009-10-07  7:38 ` [PATCH 37/45] writeback: use timestamp to indicate dirty exceeded Wu Fengguang
2009-10-07  7:38 ` [PATCH 38/45] writeback: introduce queue b_more_io_wait Wu Fengguang
2009-10-07  7:38 ` [PATCH 39/45] writeback: remove wbc.more_io Wu Fengguang
2009-10-07  7:38 ` [PATCH 40/45] writeback: requeue_io_wait() on I_SYNC locked inode Wu Fengguang
2009-10-07  7:38 ` [PATCH 41/45] writeback: requeue_io_wait() on pages_skipped inode Wu Fengguang
2009-10-07  7:39 ` [PATCH 42/45] writeback: requeue_io_wait() on blocked inode Wu Fengguang
2009-10-07  7:39 ` [PATCH 43/45] writeback: requeue_io_wait() on fs redirtied inode Wu Fengguang
2009-10-07  7:39 ` [PATCH 44/45] NFS: remove NFS_INO_FLUSHING lock Wu Fengguang
2009-10-07 13:11   ` Peter Staubach
2009-10-07 13:32     ` Wu Fengguang
2009-10-07 13:59       ` Peter Staubach
2009-10-08  1:44         ` Wu Fengguang
2009-10-07  7:39 ` [PATCH 45/45] btrfs: fix race on syncing the btree inode Wu Fengguang
2009-10-07  8:53 ` [PATCH 00/45] some writeback experiments Peter Zijlstra
2009-10-07 10:17 ` [PATCH 14/45] writeback: quit on wrap for .range_cyclic (afs) David Howells
2009-10-07 10:21   ` Nick Piggin
2009-10-07 10:47     ` Wu Fengguang
2009-10-07 11:23       ` Nick Piggin
2009-10-07 12:21         ` Wu Fengguang
2009-10-07 13:47 ` [PATCH 00/45] some writeback experiments Peter Staubach
2009-10-07 15:18   ` Wu Fengguang
2009-10-08  5:33     ` Wu Fengguang
2009-10-08  5:44       ` Wu Fengguang
2009-10-07 14:26 ` Theodore Tso
2009-10-07 14:45   ` Wu Fengguang

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20091013181214.GA31440@duck.suse.cz \
    --to=jack@suse.cz \
    --cc=Trond.Myklebust@netapp.com \
    --cc=a.p.zijlstra@chello.nl \
    --cc=akpm@linux-foundation.org \
    --cc=chris.mason@oracle.com \
    --cc=david@fromorbit.com \
    --cc=fengguang.wu@intel.com \
    --cc=hch@infradead.org \
    --cc=jens.axboe@oracle.com \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=npiggin@suse.de \
    --cc=richard@rsk.demon.co.uk \
    --cc=shaohua.li@intel.com \
    --cc=tytso@mit.edu \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).