From: Fengguang Wu <wfg@mail.ustc.edu.cn>
To: Andrew Morton <akpm@linux-foundation.org>
Cc: Chuck Ebbert <cebbert@redhat.com>, Greg KH <gregkh@suse.de>,
Chakri n <chakriin5@gmail.com>,
Peter Zijlstra <a.p.zijlstra@chello.nl>,
Krzysztof Oledzki <olel@ans.pl>,
linux-pm <linux-pm@lists.linux-foundation.org>,
lkml <linux-kernel@vger.kernel.org>,
richard kennedy <richard@rsk.demon.co.uk>,
Ingo Molnar <mingo@elte.hu>
Subject: Re: [PATCH] writeback: avoid possible balance_dirty_pages() lockup on a light-load bdi
Date: Tue, 2 Oct 2007 20:13:27 +0800 [thread overview]
Message-ID: <391327210.01048@ustc.edu.cn> (raw)
Message-ID: <20071002121327.GA5718@mail.ustc.edu.cn> (raw)
In-Reply-To: <20071001191457.2f7c7538.akpm@linux-foundation.org>
On Mon, Oct 01, 2007 at 07:14:57PM -0700, Andrew Morton wrote:
> On Tue, 2 Oct 2007 10:00:40 +0800 Fengguang Wu <wfg@mail.ustc.edu.cn> wrote:
>
> > writeback: avoid possible balance_dirty_pages() lockup on a light-load bdi
> >
> > On a busy-writing system, a writer could be hold up infinitely on a
> > light-load device. It will be trying to sync more than available dirty data.
> >
> > The problem case:
> >
> > 0. sda/nr_dirty >= dirty_limit;
> > sdb/nr_dirty == 0
> > 1. dd writes 32 pages on sdb
> > 2. balance_dirty_pages() blocks dd, and tries to write 6MB.
> > 3. it never gets there: there's only 128KB dirty data.
> > 4. dd may be blocked for a loooong time
>
> Please quantify loooong.
There're only two 'break' conditions in the loop:
1. nr_dirty + nr_unstable + nr_writeback < dirty_limit
=> *mostly* FALSE for a busy system
=> *always* FALSE in Chakri's stucked NFS case
2. nr_written >= 6MB
for a light-load bdi:
=> *never* TRUE until there comes many new writers, contributing
more dirty pages to sync
=> more worse, those new writers will also stuck here...
the obvious unbalance here is:
each writer contributes only 32KB new dirty pages, but
want to consume (not necessarily available) 6MB
So loooong = min(global-less-busy-time, bdi-many-new-writers-arrival-time).
> > Fix it by returning on 'zero dirty inodes' in the current bdi.
> > (In fact there are slight differences between 'dirty inodes' and 'dirty pages'.
> > But there is no available counters for 'dirty pages'.)
> >
> > But the newly introduced 'break' could make the nr_writeback drift away
> > above the dirty limit. The workaround is to limit the error under 1MB.
>
> I'm still not sure that we fully understand this yet.
>
> If the sdb writer is stuck in balance_dirty_pages() then all sda writers
> will be in balance_dirty_pages() too, madly writing stuff out to sda. And
> pdflush will be writing out sda as well. All this writeout to sda should
> release the sdb writer.
>
> Why isn't this happening?
You are right in the reasoning. The exact consequence is:
the light-load sdb is made as _unresponsive_ as the busy sda
Hence Chakri's case: whenever NFS is stuck, every device get stuck.
>
> > Cc: Chuck Ebbert <cebbert@redhat.com>
> > Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
> > Signed-off-by: Fengguang Wu <wfg@mail.ustc.edu.cn>
> > ---
> > mm/page-writeback.c | 5 +++++
> > 1 file changed, 5 insertions(+)
> >
> > --- linux-2.6.22.orig/mm/page-writeback.c
> > +++ linux-2.6.22/mm/page-writeback.c
> > @@ -250,6 +250,11 @@ static void balance_dirty_pages(struct a
> > pages_written += write_chunk - wbc.nr_to_write;
> > if (pages_written >= write_chunk)
> > break; /* We've done our duty */
> > + if (list_empty(&mapping->host->i_sb->s_dirty) &&
> > + list_empty(&mapping->host->i_sb->s_io) &&
> > + nr_reclaimable + global_page_state(NR_WRITEBACK) <=
> > + dirty_thresh + (1 << (20-PAGE_CACHE_SHIFT)))
> > + break;
> > }
> > congestion_wait(WRITE, HZ/10);
> > }
>
> Well that has a nice safetly net. Perhaps it could fail a bit later on,
> but that depends on why it's failing.
In theory, every CPU/paralle writer could contribute 8 pages of error.
Hence we get 1MB/32KB = 32 (CPUs/writers).
One more serious problem is, a busy writer could also drain all the
dirty pages and make (nr_writeback == dirty_limit+1MB). In that case,
I suspect the light-load sdb writer still have good chance to
make progress(need confirmation).
> How well tested was this?
Not well tested till now. My system becomes unusable soon after
starting the NFS write(even before plugging the network). I'm seeing
large latencies in try_to_wake_up(). Hope that Ingo could help it out.
> If we merge this for 2.6.23 then I expect that we'll immediately unmerge it
> for 2.6.24 because Peter's stuff fixes this problem by other means.
>
> Do we all agree with the above sentence?
Yeah, Peter and me were both aware of the timing.
This patch is only meant for 2.6.23 and 2.6.22.10.
Fengguang
next prev parent reply other threads:[~2007-10-02 12:13 UTC|newest]
Thread overview: 46+ messages / expand[flat|nested] mbox.gz Atom feed top
2007-09-28 6:32 A unresponsive file system can hang all I/O in the system on linux-2.6.23-rc6 (dirty_thresh problem?) Chakri n
2007-09-28 6:50 ` Andrew Morton
2007-09-28 6:59 ` Peter Zijlstra
2007-09-28 8:27 ` Chakri n
2007-09-28 8:40 ` Peter Zijlstra
2007-09-28 9:01 ` Chakri n
2007-09-28 9:12 ` Peter Zijlstra
2007-09-28 9:20 ` Chakri n
2007-09-28 9:23 ` Peter Zijlstra
2007-09-28 10:36 ` Chakri n
2007-09-28 13:28 ` Jonathan Corbet
2007-09-28 13:35 ` Peter Zijlstra
2007-09-28 16:45 ` [linux-pm] " Alan Stern
2007-09-29 1:27 ` Daniel Phillips
2007-09-28 18:04 ` Andrew Morton
2007-09-28 17:00 ` Trond Myklebust
2007-09-28 18:49 ` Andrew Morton
2007-09-28 18:48 ` Peter Zijlstra
2007-09-28 19:16 ` Andrew Morton
2007-10-02 13:36 ` Peter Zijlstra
2007-10-02 15:42 ` Randy Dunlap
2007-10-03 9:28 ` [PATCH] lockstat: documentation Peter Zijlstra
2007-10-03 9:35 ` Ingo Molnar
2007-09-28 19:16 ` A unresponsive file system can hang all I/O in the system on linux-2.6.23-rc6 (dirty_thresh problem?) Trond Myklebust
2007-09-28 19:26 ` Andrew Morton
2007-09-28 19:52 ` Trond Myklebust
2007-09-28 20:10 ` Andrew Morton
2007-09-28 20:32 ` Trond Myklebust
2007-09-28 20:43 ` Andrew Morton
2007-09-28 21:36 ` Chakri n
2007-09-28 23:33 ` Chakri n
2007-09-28 20:24 ` Daniel Phillips
2007-09-29 1:51 ` KDB? Daniel Phillips
2007-09-29 0:46 ` A unresponsive file system can hang all I/O in the system on linux-2.6.23-rc6 (dirty_thresh problem?) Daniel Phillips
[not found] ` <20070929110454.GA29861@mail.ustc.edu.cn>
2007-09-29 11:04 ` Fengguang Wu
2007-09-29 11:48 ` Peter Zijlstra
[not found] ` <20070929122842.GA5454@mail.ustc.edu.cn>
2007-09-29 12:28 ` Fengguang Wu
2007-09-29 14:43 ` Peter Zijlstra
2007-10-01 15:57 ` Chuck Ebbert
[not found] ` <20071002020040.GA5275@mail.ustc.edu.cn>
2007-10-02 2:00 ` [PATCH] writeback: avoid possible balance_dirty_pages() lockup on a light-load bdi Fengguang Wu
2007-10-02 2:14 ` Andrew Morton
[not found] ` <20071002121327.GA5718@mail.ustc.edu.cn>
2007-10-02 12:13 ` Fengguang Wu [this message]
[not found] ` <20071002132702.GA10967@mail.ustc.edu.cn>
2007-10-02 13:27 ` Fengguang Wu
2007-10-02 18:35 ` Chuck Ebbert
2007-10-03 12:46 ` richard kennedy
[not found] ` <20071004015053.GA5789@mail.ustc.edu.cn>
2007-10-04 1:50 ` Fengguang Wu
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=391327210.01048@ustc.edu.cn \
--to=wfg@mail.ustc.edu.cn \
--cc=a.p.zijlstra@chello.nl \
--cc=akpm@linux-foundation.org \
--cc=cebbert@redhat.com \
--cc=chakriin5@gmail.com \
--cc=gregkh@suse.de \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-pm@lists.linux-foundation.org \
--cc=mingo@elte.hu \
--cc=olel@ans.pl \
--cc=richard@rsk.demon.co.uk \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox