All of lore.kernel.org
 help / color / mirror / Atom feed
From: Wu Fengguang <fengguang.wu@intel.com>
To: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Peter Staubach <staubach@redhat.com>,
	Myklebust Trond <Trond.Myklebust@netapp.com>,
	Jan Kara <jack@suse.cz>,
	Andrew Morton <akpm@linux-foundation.org>,
	Theodore Tso <tytso@mit.edu>,
	Christoph Hellwig <hch@infradead.org>,
	Dave Chinner <david@fromorbit.com>,
	Chris Mason <chris.mason@oracle.com>,
	"Li, Shaohua" <shaohua.li@intel.com>,
	"jens.axboe@oracle.com" <jens.axboe@oracle.com>,
	Nick Piggin <npiggin@suse.de>,
	"linux-fsdevel@vger.kernel.org" <linux-fsdevel@vger.kernel.org>,
	Richard Kennedy <richard@rsk.demon.co.uk>,
	LKML <linux-kernel@vger.kernel.org>
Subject: Re: [PATCH 01/45] writeback: reduce calls to global_page_state in balance_dirty_pages()
Date: Sat, 17 Oct 2009 13:30:16 +0800	[thread overview]
Message-ID: <20091017053016.GA20086@localhost> (raw)
In-Reply-To: <1255519348.8392.412.camel@twins>

On Wed, Oct 14, 2009 at 07:22:28PM +0800, Peter Zijlstra wrote:
> On Wed, 2009-10-14 at 09:38 +0800, Wu Fengguang wrote:
> > > >   Hmm, probably you've discussed this in some other email but why do we
> > > > cycle in this loop until we get below dirty limit? We used to leave the
> > > > loop after writing write_chunk... So the time we spend in
> > > > balance_dirty_pages() is no longer limited, right?
> > 
> > Right, this is a legitimate concern.
> 
> Quite.
> 
> > > Wu was saying that without the loop nr_writeback wasn't limited, but
> > > since bdi_writeback_wakeup() is driven from writeout completion, I'm not
> > > sure how again that was so.
> > 
> > Let me summarize the ideas :)
> > 
> > There are two cases:
> > 
> > - there are no bdi or block io queue to limit nr_writeback
> >   This must be fixed. It either let nr_writeback grow to dirty_thresh
> >   (with loop) and thus squeeze nr_dirty, or grow out of control
> >   totally (without loop). Current state is, the nr_writeback wait
> >   queue for NFS is there; the one for btrfs is still missing.
> > 
> > - there is a nr_writeback limit, but is larger than dirty_thresh
> >   In this case nr_dirty will be close to 0 regardless of the loop.
> >   The loop will help to keep
> >           nr_dirty + nr_writeback + nr_unstable < dirty_thresh
> >   Without the loop, the "real" dirty threshold would be larger
> >   (determined by the nr_writeback limit).
> > 
> > > We can move all of bdi_dirty to bdi_writeout, if the bdi writeout queue
> > > permits, but it cannot grow beyond the total limit, since we're actually
> > > waiting for writeout completion.
> > 
> > Yes, this explains the second case. It's some trade-off like: the
> > nr_writeback limit can not be trusted in small memory systems, so do
> > the loop to impose the dirty_thresh, which unfortunately can hurt
> > responsiveness on all systems with prolonged wait time..
> 
> Ok, so I'm still puzzled.

Big sorry - it's me that was confused (by some buggy tests).

>   set_page_dirty()
>   balance_dirty_pages_ratelimited()
>     balance_dirty_pages_ratelimited_nr(1)
>       balance_dirty_pages(nr);
> 
> So we call balance_dirty_pages() with an appropriate count for each
> set_page_dirty() successful invocation, right?

Right.

> balance_dirty_pages() guarantees that:
> 
>   nr_dirty + nr_writeback + nr_unstable < dirty_thresh &&
>   (nr_dirty + nr_writeback + nr_unstable < 
> 	(dirty_thresh + background_thresh)/2 ||
>    bdi_dirty + bdi_writeback + bdi_unstable < bdi_thresh)
> 
> Now without loop, without writeback limit, I still see no way to
> actually generate more 'dirty' pages than dirty_thresh.
>
> As soon as we hit dirty_thresh a process will wait for exactly the same
> amount of pages to get cleaned (writeback completed) as were dirtied
> (+/- the ratelimit fuzz which should even out over processes).

Ah yes, we now wait for writeback _completed_ in bdi_writeback_wait(),
instead of _start_ writeback in the old fashioned writeback_inodes().

> That should bound things to dirty_thresh -- the wait is on writeback
> complete, so nr_writeback is bounded too.

Right. It was not bounded in the tests because bdi_writeback_wait()
quits _prematurely_, because the background writeback finds it was
already under background threshold, and so wakeup the throttled tasks
and then quit. Fixed by simply removing the wakeup-all in background
writeback and this change:

                if (args->for_background && !over_bground_thresh() &&
+                   list_empty(&wb->bdi->throttle_list))
                        break;

So now
- the throttled tasks are guaranteed to be wakeup
- it will only be wakeup in __bdi_writeout_inc()
- once wakeup, at least write_chunk pages have been written on behalf of it

> [ I forgot the exact semantics of unstable, if we clear writeback before
> unstable, we need to fix something ]

New tests show that NFS is working fine without loop and without NFS
nr_writeback limit:

      $ dd if=/dev/zero of=/mnt/test/zero3 bs=1M count=200 &
      $ vmmon -d 1 nr_writeback nr_dirty nr_unstable

     nr_writeback         nr_dirty      nr_unstable
                0                2                0
                0                2                0
                0            22477               65
                2            20849             1697
                2            19153             3393
                2            17420             5126
            27825                7             5979
            27816                0               41
            26925                0              907
            31064              286              159
            32531                0              213
            32548                0               89
            32405                0              155
            32464                0               98
            32517                0               45
            32560                0              194
            32534                0              220
            32601                0              222
            32490                0               72
            32447                0              115
            32511                0               48
            32535                0              216
            32535                0              216
            32535                0              216
            32535                0              216
            31555                0             1180
            29732                0             3003
            29277                0             3458
            27721                0             5014
            25955                0             6780
            24356                0             8379
            22763                0             9972
            21083                0            11652
            19371                0            13364
            17564                0            15171
            15781                0            16954
            14005                0            18730
            12230                0            20505
            12177                0            20558
            11383                0            21352
             9489                0            23246
             7621                0            25115
             5866                0            26870
             4790                0            27947
             2962                0            29773
             1089                0            31646
                0                0            32735
                0                0            32735
                0                0                0
                0                0                0
 
> Now, a nr_writeback queue that limits writeback will still be useful,
> esp for high speed devices. Once they ramp up and bdi_thresh exceeds the
> queue size, it'll take effect. So you reap the benefits when needed.

Right, the nr_writeback limit avoids

        nr_writeback => dirty_thresh
+
        nr_dirty + nr_writeback < dirty_thresh
=>
        nr_dirty => 0

Thanks for the clarification, it looks less obscure now :)

Thanks,
Fengguang

  reply	other threads:[~2009-10-17  5:30 UTC|newest]

Thread overview: 118+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2009-10-07  7:38 [PATCH 00/45] some writeback experiments Wu Fengguang
2009-10-07  7:38 ` [PATCH 01/45] writeback: reduce calls to global_page_state in balance_dirty_pages() Wu Fengguang
2009-10-09 15:12   ` Jan Kara
2009-10-09 15:18     ` Peter Zijlstra
2009-10-09 15:47       ` Jan Kara
2009-10-11  2:28         ` Wu Fengguang
2009-10-11  7:44           ` Peter Zijlstra
2009-10-11 10:50             ` Wu Fengguang
2009-10-11 10:58               ` Peter Zijlstra
2009-10-11 11:25               ` Peter Zijlstra
2009-10-12  1:26                 ` Wu Fengguang
2009-10-12  9:07                   ` Peter Zijlstra
2009-10-12  9:24                     ` Wu Fengguang
2009-10-10 21:33     ` Wu Fengguang
2009-10-12 21:18       ` Jan Kara
2009-10-13  3:24         ` Wu Fengguang
2009-10-13  8:41           ` Peter Zijlstra
2009-10-13 18:12           ` Jan Kara
2009-10-13 18:28             ` Peter Zijlstra
2009-10-14  1:38               ` Wu Fengguang
2009-10-14 11:22                 ` Peter Zijlstra
2009-10-17  5:30                   ` Wu Fengguang [this message]
2009-10-07  7:38 ` [PATCH 02/45] writeback: reduce calculation of bdi dirty thresholds Wu Fengguang
2009-10-07  7:38 ` [PATCH 03/45] ext4: remove unused parameter wbc from __ext4_journalled_writepage() Wu Fengguang
2009-10-07  7:38 ` [PATCH 04/45] writeback: remove unused nonblocking and congestion checks Wu Fengguang
2009-10-09 15:26   ` Jan Kara
2009-10-10 13:47     ` Wu Fengguang
2009-10-07  7:38 ` [PATCH 05/45] writeback: remove the always false bdi_cap_writeback_dirty() test Wu Fengguang
2009-10-07  7:38 ` [PATCH 06/45] writeback: use larger ratelimit when dirty_exceeded Wu Fengguang
2009-10-07  8:53   ` Peter Zijlstra
2009-10-07  9:17     ` Wu Fengguang
2009-10-07  7:38 ` [PATCH 07/45] writeback: dont redirty tail an inode with dirty pages Wu Fengguang
2009-10-09 15:45   ` Jan Kara
2009-10-07  7:38 ` [PATCH 08/45] writeback: quit on wrap for .range_cyclic (write_cache_pages) Wu Fengguang
2009-10-07  7:38 ` [PATCH 09/45] writeback: quit on wrap for .range_cyclic (pohmelfs) Wu Fengguang
2009-10-07 12:32   ` Evgeniy Polyakov
2009-10-07 14:23     ` Wu Fengguang
2009-10-07  7:38 ` [PATCH 10/45] writeback: quit on wrap for .range_cyclic (btrfs) Wu Fengguang
2009-10-07  7:38 ` [PATCH 11/45] writeback: quit on wrap for .range_cyclic (cifs) Wu Fengguang
2009-10-07  7:38 ` [PATCH 12/45] writeback: quit on wrap for .range_cyclic (ext4) Wu Fengguang
2009-10-07  7:38 ` [PATCH 13/45] writeback: quit on wrap for .range_cyclic (gfs2) Wu Fengguang
2009-10-07  7:38 ` [PATCH 14/45] writeback: quit on wrap for .range_cyclic (afs) Wu Fengguang
2009-10-07 10:17   ` David Howells
2009-10-07 10:21     ` Nick Piggin
2009-10-07 10:47       ` Wu Fengguang
2009-10-07 11:23         ` Nick Piggin
2009-10-07 12:21           ` Wu Fengguang
2009-10-07  7:38 ` [PATCH 15/45] writeback: fix queue_io() ordering Wu Fengguang
2009-10-07  7:38 ` [PATCH 16/45] writeback: merge for_kupdate and !for_kupdate cases Wu Fengguang
2009-10-07  7:38 ` [PATCH 17/45] writeback: only allow two background writeback works Wu Fengguang
2009-10-07  7:38 ` [PATCH 18/45] writeback: introduce wait queue for balance_dirty_pages() Wu Fengguang
2009-10-08  1:01   ` KAMEZAWA Hiroyuki
2009-10-08  1:58     ` Wu Fengguang
2009-10-08  2:40       ` KAMEZAWA Hiroyuki
2009-10-08  4:01         ` Wu Fengguang
2009-10-08  5:59           ` KAMEZAWA Hiroyuki
2009-10-08  6:07             ` Wu Fengguang
2009-10-08  6:28             ` Wu Fengguang
2009-10-08  6:39               ` KAMEZAWA Hiroyuki
2009-10-08  8:08       ` Peter Zijlstra
2009-10-08  8:11         ` KAMEZAWA Hiroyuki
2009-10-08  8:36         ` Jens Axboe
2009-10-09  2:52           ` [PATCH] writeback: account IO throttling wait as iowait Wu Fengguang
2009-10-09 10:41             ` Jens Axboe
2009-10-09 10:58               ` Wu Fengguang
2009-10-09 11:01                 ` Jens Axboe
2009-10-08  8:05     ` [PATCH 18/45] writeback: introduce wait queue for balance_dirty_pages() Peter Zijlstra
2009-10-07  7:38 ` [PATCH 19/45] writeback: remove the loop in balance_dirty_pages() Wu Fengguang
2009-10-07  7:38 ` [PATCH 20/45] NFS: introduce writeback wait queue Wu Fengguang
2009-10-07  8:53   ` Peter Zijlstra
2009-10-07  9:07     ` Wu Fengguang
2009-10-07  9:15       ` Peter Zijlstra
2009-10-07  9:19         ` Wu Fengguang
2009-10-07  9:17       ` Nick Piggin
2009-10-07  9:52         ` Wu Fengguang
2009-10-07  7:38 ` [PATCH 21/45] writeback: estimate bdi write bandwidth Wu Fengguang
2009-10-07  8:53   ` Peter Zijlstra
2009-10-07  9:39     ` Wu Fengguang
2009-10-07  7:38 ` [PATCH 22/45] writeback: show bdi write bandwidth in debugfs Wu Fengguang
2009-10-07  7:38 ` [PATCH 23/45] writeback: kill space in debugfs item name Wu Fengguang
2009-10-07  7:38 ` [PATCH 24/45] writeback: remove global nr_to_write and use timeout instead Wu Fengguang
2009-10-07  7:38 ` [PATCH 25/45] writeback: convert wbc.nr_to_write to per-file parameter Wu Fengguang
2009-10-07  7:38 ` [PATCH 26/45] block: pass the non-rotational queue flag to backing_dev_info Wu Fengguang
2009-10-07  7:38 ` [PATCH 27/45] writeback: introduce wbc.for_background Wu Fengguang
2009-10-07  7:38 ` [PATCH 28/45] writeback: introduce wbc.nr_segments Wu Fengguang
2009-10-07  7:38 ` [PATCH 29/45] writeback: fix the shmem AOP_WRITEPAGE_ACTIVATE case Wu Fengguang
2009-10-07 11:57   ` Hugh Dickins
2009-10-07 14:00     ` Wu Fengguang
2009-10-07  7:38 ` [PATCH 30/45] vmscan: lumpy pageout Wu Fengguang
2009-10-07  7:38 ` [PATCH 31/45] writeback: sync old inodes first in background writeback Wu Fengguang
2010-07-12  3:01   ` Christoph Hellwig
2010-07-12 15:24     ` Wu Fengguang
2009-10-07  7:38 ` [PATCH 32/45] writeback: update kupdate expire timestamp on each scan of b_io Wu Fengguang
2009-10-07  7:38 ` [PATCH 33/45] writeback: sync livelock - introduce wbc.for_sync Wu Fengguang
2009-10-07  7:38 ` [PATCH 34/45] writeback: sync livelock - kick background writeback Wu Fengguang
2009-10-07  7:38 ` [PATCH 35/45] writeback: sync livelock - use single timestamp for whole sync work Wu Fengguang
2009-10-07  7:38 ` [PATCH 36/45] writeback: sync livelock - curb dirty speed for inodes to be synced Wu Fengguang
2009-10-07  7:38 ` [PATCH 37/45] writeback: use timestamp to indicate dirty exceeded Wu Fengguang
2009-10-07  7:38 ` [PATCH 38/45] writeback: introduce queue b_more_io_wait Wu Fengguang
2009-10-07  7:38 ` [PATCH 39/45] writeback: remove wbc.more_io Wu Fengguang
2009-10-07  7:38 ` [PATCH 40/45] writeback: requeue_io_wait() on I_SYNC locked inode Wu Fengguang
2009-10-07  7:38 ` [PATCH 41/45] writeback: requeue_io_wait() on pages_skipped inode Wu Fengguang
2009-10-07  7:39 ` [PATCH 42/45] writeback: requeue_io_wait() on blocked inode Wu Fengguang
2009-10-07  7:39 ` [PATCH 43/45] writeback: requeue_io_wait() on fs redirtied inode Wu Fengguang
2009-10-07  7:39 ` [PATCH 44/45] NFS: remove NFS_INO_FLUSHING lock Wu Fengguang
2009-10-07 13:11   ` Peter Staubach
2009-10-07 13:32     ` Wu Fengguang
2009-10-07 13:59       ` Peter Staubach
2009-10-08  1:44         ` Wu Fengguang
2009-10-07  7:39 ` [PATCH 45/45] btrfs: fix race on syncing the btree inode Wu Fengguang
2009-10-07  8:53 ` [PATCH 00/45] some writeback experiments Peter Zijlstra
2009-10-07 13:47 ` Peter Staubach
2009-10-07 15:18   ` Wu Fengguang
2009-10-08  5:33     ` Wu Fengguang
2009-10-08  5:44       ` Wu Fengguang
2009-10-07 14:26 ` Theodore Tso
2009-10-07 14:45   ` Wu Fengguang
2009-10-07 14:45     ` Wu Fengguang

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20091017053016.GA20086@localhost \
    --to=fengguang.wu@intel.com \
    --cc=Trond.Myklebust@netapp.com \
    --cc=a.p.zijlstra@chello.nl \
    --cc=akpm@linux-foundation.org \
    --cc=chris.mason@oracle.com \
    --cc=david@fromorbit.com \
    --cc=hch@infradead.org \
    --cc=jack@suse.cz \
    --cc=jens.axboe@oracle.com \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=npiggin@suse.de \
    --cc=richard@rsk.demon.co.uk \
    --cc=shaohua.li@intel.com \
    --cc=staubach@redhat.com \
    --cc=tytso@mit.edu \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.