From: Fengguang Wu <wfg@mail.ustc.edu.cn>
To: Andrew Morton <akpm@linux-foundation.org>
Cc: a.p.zijlstra@chello.nl, miklos@szeredi.hu, linux-mm@kvack.org,
linux-kernel@vger.kernel.org
Subject: Re: [PATCH] remove throttle_vm_writeout()
Date: Sat, 6 Oct 2007 10:32:24 +0800 [thread overview]
Message-ID: <391637948.09008@ustc.edu.cn> (raw)
Message-ID: <20071006023224.GA7526@mail.ustc.edu.cn> (raw)
In-Reply-To: <20071005102005.134f07fa.akpm@linux-foundation.org>
On Fri, Oct 05, 2007 at 10:20:05AM -0700, Andrew Morton wrote:
> On Fri, 5 Oct 2007 20:30:28 +0800
> Fengguang Wu <wfg@mail.ustc.edu.cn> wrote:
>
> > > commit c4e2d7ddde9693a4c05da7afd485db02c27a7a09
> > > Author: akpm <akpm>
> > > Date: Sun Dec 22 01:07:33 2002 +0000
> > >
> > > [PATCH] Give kswapd writeback higher priority than pdflush
> > >
> > > The `low latency page reclaim' design works by preventing page
> > > allocators from blocking on request queues (and by preventing them from
> > > blocking against writeback of individual pages, but that is immaterial
> > > here).
> > >
> > > This has a problem under some situations. pdflush (or a write(2)
> > > caller) could be saturating the queue with highmem pages. This
> > > prevents anyone from writing back ZONE_NORMAL pages. We end up doing
> > > enormous amounts of scenning.
> >
> > Sorry, I cannot not understand it. We now have balanced aging between
> > zones. So the page allocations are expected to distribute proportionally
> > between ZONE_HIGHMEM and ZONE_NORMAL?
>
> Sure, but we don't have one disk queue per disk per zone! The queue is
> shared by all the zones. So if writeback from one zone has filled the
> queue up, the kernel can't write back data from another zone.
Hmm, that's a problem. But I guess when one zone is full, other zones
will not be far away... It's a "sooner or later" problem.
> (Well, it can, by blocking in get_request_wait(), but that causes long and
> uncontrollable latencies).
I guess PF_SWAPWRITE processes still have good probability to stuck in
get_request_wait(). Because balance_dirty_pages() are allowed to
disregard the congestion. It will be exhausting the available request
slots all the time.
Signed-off-by: Fengguang Wu <wfg@mail.ustc.edu.cn>
---
mm/page-writeback.c | 1 +
1 file changed, 1 insertion(+)
--- linux-2.6.23-rc8-mm2.orig/mm/page-writeback.c
+++ linux-2.6.23-rc8-mm2/mm/page-writeback.c
@@ -400,6 +400,7 @@ static void balance_dirty_pages(struct a
.sync_mode = WB_SYNC_NONE,
.older_than_this = NULL,
.nr_to_write = write_chunk,
+ .nonblocking = 1,
.range_cyclic = 1,
};
> > > A test case is to mmap(MAP_SHARED) almost all of a 4G machine's memory,
> > > then kill the mmapping applications. The machine instantly goes from
> > > 0% of memory dirty to 95% or more. pdflush kicks in and starts writing
> > > the least-recently-dirtied pages, which are all highmem. The queue is
> > > congested so nobody will write back ZONE_NORMAL pages. kswapd chews
> > > 50% of the CPU scanning past dirty ZONE_NORMAL pages and page reclaim
> > > efficiency (pages_reclaimed/pages_scanned) falls to 2%.
> > >
> > > So this patch changes the policy for kswapd. kswapd may use all of a
> > > request queue, and is prepared to block on request queues.
> > >
> > > What will now happen in the above scenario is:
> > >
> > > 1: The page alloctor scans some pages, fails to reclaim enough
> > > memory and takes a nap in blk_congetion_wait().
> > >
> > > 2: kswapd() will scan the ZONE_NORMAL LRU and will start writing
> > > back pages. (These pages will be rotated to the tail of the
> > > inactive list at IO-completion interrupt time).
> > >
> > > This writeback will saturate the queue with ZONE_NORMAL pages.
> > > Conveniently, pdflush will avoid the congested queues. So we end up
> > > writing the correct pages.
> > >
> > > In this test, kswapd CPU utilisation falls from 50% to 2%, page reclaim
> > > efficiency rises from 2% to 40% and things are generally a lot happier.
> >
> > We may see the same problem and improvement in the absent of 'all
> > writeback goes to one zone' assumption.
> >
> > The problem could be:
> > - dirty_thresh is exceeded, so balance_dirty_pages() starts syncing
> > data and quickly _congests_ the queue;
>
> Or someone ran fsync(), or pdflush is writing back data because it exceeded
> dirty_writeback_centisecs, etc.
Ah, yes.
> > - dirty pages are slowly but continuously turned into clean pages by
> > balance_dirty_pages(), but they still stay in the same place in LRU;
> > - the zones are mostly dirty/writeback pages, kswapd has a hard time
> > finding the randomly distributed clean pages;
> > - kswapd cannot do the writeout because the queue is congested!
> >
> > The improvement could be:
> > - kswapd is now explicitly preferred to do the writeout;
> > - the pages written by kswapd will be rotated and easy for kswapd to reclaim;
> > - it becomes possible for kswapd to wait for the congested queue,
> > instead of doing the vmscan like mad.
>
> Yeah. In 2.4 and early 2.5, page-reclaim (both direct reclaim and kswapd,
> iirc) would throttle by waiting on writeout of a particular page. This was
> a poor design, because writeback against a *particular* page can take
> anywhere from one millisecond to thirty seconds to complete, depending upon
> where the disk head is and all that stuff.
>
> The critical change I made was to switch the throttling algorithm from
> "wait for one page to get written" to "wait for _any_ page to get written".
> Becaue reclaim really doesn't care _which_ page got written: we want to
> wake up and start scanning again when _any_ page got written.
That must be a big improvement!
> That's what congestion_wait() does.
>
> It is pretty crude. It could be that writeback completed against pages which
> aren't in the correct zone, or it could be that some other task went and
> allocated the just-cleaned pages before this task can get running and
> reclaim them, or it could be that the just-written-back pages weren't
> reclaimable after all, etc.
>
> It would take a mind-boggling amount of logic and locking to make all this
> 100% accurate and the need has never been demonstrated. So page reclaim
> presently should be viewed as a polling algorithm, where the rate of
> polling is paced by the rate at which the IO system can retire writes.
Yeah. So the polling overheads are limited.
> > The congestion wait looks like a pretty natural way to throttle the kswapd.
> > Instead of doing the vmscan at 1000MB/s and actually freeing pages at
> > 60MB/s(about the write throughput), kswapd will be relaxed to do vmscan at
> > maybe 150MB/s.
>
> Something like that.
>
> The critical numbers to watch are /proc/vmstat's *scan* and *steal*. Look:
>
> akpm:/usr/src/25> uptime
> 10:08:14 up 10 days, 16:46, 15 users, load average: 0.02, 0.05, 0.04
> akpm:/usr/src/25> grep steal /proc/vmstat
> pgsteal_dma 0
> pgsteal_dma32 0
> pgsteal_normal 0
> pgsteal_high 0
> pginodesteal 0
> kswapd_steal 1218698
> kswapd_inodesteal 266847
> akpm:/usr/src/25> grep scan /proc/vmstat
> pgscan_kswapd_dma 0
> pgscan_kswapd_dma32 1246816
> pgscan_kswapd_normal 0
> pgscan_kswapd_high 0
> pgscan_direct_dma 0
> pgscan_direct_dma32 448
> pgscan_direct_normal 0
> pgscan_direct_high 0
> slabs_scanned 2881664
>
> Ignore kswapd_inodesteal and slabs_scanned. We see that this machine has
> scanned 1246816+448 pages and has reclaimed (stolen) 1218698 pages. That's
> a reclaim success rate of 97.7%, which is pretty damn good - this machine
> is just a lightly-loaded 3GB desktop.
>
> When testing reclaim, it is critical that this ratio be monitored (vmmon.c
> from ext3-tools is a vmstat-like interface to /proc/vmstat). If the
> reclaim efficiency falls below, umm, 25% then things are getting into some
> trouble.
Nice tool!
I've been watching the raw numbers by writing scripts. This one can be
written as:
#!/bin/bash
while true; do
while read a b
do
eval $a=$b
done < /proc/vmstat
uptime=$(</proc/uptime)
scan=$((
pgscan_kswapd_dma +
pgscan_kswapd_dma32 +
pgscan_kswapd_normal +
pgscan_kswapd_high +
pgscan_direct_dma +
pgscan_direct_dma32 +
pgscan_direct_normal +
pgscan_direct_high
))
steal=$((
pgsteal_dma +
pgsteal_dma32 +
pgsteal_normal +
pgsteal_high
))
ratio=$((100*steal/(scan+1)))
echo -e "$uptime\t$ratio%\t$steal\t$scan"
sleep 1;
done
Not surprisingly, I see nice numbers on my desktop:
9517.99 9368.60 96% 1898452 1961536
> Actually, 25% is still pretty good. We scan 4 pages for each reclaimed
> page, but the amount of wall time which that takes is vastly less than the
> time to write one page, bearing in mind that these things tend to be seeky
> as hell. But still, keeping an eye on the reclaim efficiency is just your
> basic starting point for working on page reclaim.
Thank you for the nice tip :-)
Fengguang
WARNING: multiple messages have this Message-ID (diff)
From: Fengguang Wu <wfg@mail.ustc.edu.cn>
To: Andrew Morton <akpm@linux-foundation.org>
Cc: a.p.zijlstra@chello.nl, miklos@szeredi.hu, linux-mm@kvack.org,
linux-kernel@vger.kernel.org
Subject: Re: [PATCH] remove throttle_vm_writeout()
Date: Sat, 6 Oct 2007 10:32:24 +0800 [thread overview]
Message-ID: <391637948.09008@ustc.edu.cn> (raw)
Message-ID: <20071006023224.GA7526@mail.ustc.edu.cn> (raw)
In-Reply-To: <20071005102005.134f07fa.akpm@linux-foundation.org>
On Fri, Oct 05, 2007 at 10:20:05AM -0700, Andrew Morton wrote:
> On Fri, 5 Oct 2007 20:30:28 +0800
> Fengguang Wu <wfg@mail.ustc.edu.cn> wrote:
>
> > > commit c4e2d7ddde9693a4c05da7afd485db02c27a7a09
> > > Author: akpm <akpm>
> > > Date: Sun Dec 22 01:07:33 2002 +0000
> > >
> > > [PATCH] Give kswapd writeback higher priority than pdflush
> > >
> > > The `low latency page reclaim' design works by preventing page
> > > allocators from blocking on request queues (and by preventing them from
> > > blocking against writeback of individual pages, but that is immaterial
> > > here).
> > >
> > > This has a problem under some situations. pdflush (or a write(2)
> > > caller) could be saturating the queue with highmem pages. This
> > > prevents anyone from writing back ZONE_NORMAL pages. We end up doing
> > > enormous amounts of scenning.
> >
> > Sorry, I cannot not understand it. We now have balanced aging between
> > zones. So the page allocations are expected to distribute proportionally
> > between ZONE_HIGHMEM and ZONE_NORMAL?
>
> Sure, but we don't have one disk queue per disk per zone! The queue is
> shared by all the zones. So if writeback from one zone has filled the
> queue up, the kernel can't write back data from another zone.
Hmm, that's a problem. But I guess when one zone is full, other zones
will not be far away... It's a "sooner or later" problem.
> (Well, it can, by blocking in get_request_wait(), but that causes long and
> uncontrollable latencies).
I guess PF_SWAPWRITE processes still have good probability to stuck in
get_request_wait(). Because balance_dirty_pages() are allowed to
disregard the congestion. It will be exhausting the available request
slots all the time.
Signed-off-by: Fengguang Wu <wfg@mail.ustc.edu.cn>
---
mm/page-writeback.c | 1 +
1 file changed, 1 insertion(+)
--- linux-2.6.23-rc8-mm2.orig/mm/page-writeback.c
+++ linux-2.6.23-rc8-mm2/mm/page-writeback.c
@@ -400,6 +400,7 @@ static void balance_dirty_pages(struct a
.sync_mode = WB_SYNC_NONE,
.older_than_this = NULL,
.nr_to_write = write_chunk,
+ .nonblocking = 1,
.range_cyclic = 1,
};
> > > A test case is to mmap(MAP_SHARED) almost all of a 4G machine's memory,
> > > then kill the mmapping applications. The machine instantly goes from
> > > 0% of memory dirty to 95% or more. pdflush kicks in and starts writing
> > > the least-recently-dirtied pages, which are all highmem. The queue is
> > > congested so nobody will write back ZONE_NORMAL pages. kswapd chews
> > > 50% of the CPU scanning past dirty ZONE_NORMAL pages and page reclaim
> > > efficiency (pages_reclaimed/pages_scanned) falls to 2%.
> > >
> > > So this patch changes the policy for kswapd. kswapd may use all of a
> > > request queue, and is prepared to block on request queues.
> > >
> > > What will now happen in the above scenario is:
> > >
> > > 1: The page alloctor scans some pages, fails to reclaim enough
> > > memory and takes a nap in blk_congetion_wait().
> > >
> > > 2: kswapd() will scan the ZONE_NORMAL LRU and will start writing
> > > back pages. (These pages will be rotated to the tail of the
> > > inactive list at IO-completion interrupt time).
> > >
> > > This writeback will saturate the queue with ZONE_NORMAL pages.
> > > Conveniently, pdflush will avoid the congested queues. So we end up
> > > writing the correct pages.
> > >
> > > In this test, kswapd CPU utilisation falls from 50% to 2%, page reclaim
> > > efficiency rises from 2% to 40% and things are generally a lot happier.
> >
> > We may see the same problem and improvement in the absent of 'all
> > writeback goes to one zone' assumption.
> >
> > The problem could be:
> > - dirty_thresh is exceeded, so balance_dirty_pages() starts syncing
> > data and quickly _congests_ the queue;
>
> Or someone ran fsync(), or pdflush is writing back data because it exceeded
> dirty_writeback_centisecs, etc.
Ah, yes.
> > - dirty pages are slowly but continuously turned into clean pages by
> > balance_dirty_pages(), but they still stay in the same place in LRU;
> > - the zones are mostly dirty/writeback pages, kswapd has a hard time
> > finding the randomly distributed clean pages;
> > - kswapd cannot do the writeout because the queue is congested!
> >
> > The improvement could be:
> > - kswapd is now explicitly preferred to do the writeout;
> > - the pages written by kswapd will be rotated and easy for kswapd to reclaim;
> > - it becomes possible for kswapd to wait for the congested queue,
> > instead of doing the vmscan like mad.
>
> Yeah. In 2.4 and early 2.5, page-reclaim (both direct reclaim and kswapd,
> iirc) would throttle by waiting on writeout of a particular page. This was
> a poor design, because writeback against a *particular* page can take
> anywhere from one millisecond to thirty seconds to complete, depending upon
> where the disk head is and all that stuff.
>
> The critical change I made was to switch the throttling algorithm from
> "wait for one page to get written" to "wait for _any_ page to get written".
> Becaue reclaim really doesn't care _which_ page got written: we want to
> wake up and start scanning again when _any_ page got written.
That must be a big improvement!
> That's what congestion_wait() does.
>
> It is pretty crude. It could be that writeback completed against pages which
> aren't in the correct zone, or it could be that some other task went and
> allocated the just-cleaned pages before this task can get running and
> reclaim them, or it could be that the just-written-back pages weren't
> reclaimable after all, etc.
>
> It would take a mind-boggling amount of logic and locking to make all this
> 100% accurate and the need has never been demonstrated. So page reclaim
> presently should be viewed as a polling algorithm, where the rate of
> polling is paced by the rate at which the IO system can retire writes.
Yeah. So the polling overheads are limited.
> > The congestion wait looks like a pretty natural way to throttle the kswapd.
> > Instead of doing the vmscan at 1000MB/s and actually freeing pages at
> > 60MB/s(about the write throughput), kswapd will be relaxed to do vmscan at
> > maybe 150MB/s.
>
> Something like that.
>
> The critical numbers to watch are /proc/vmstat's *scan* and *steal*. Look:
>
> akpm:/usr/src/25> uptime
> 10:08:14 up 10 days, 16:46, 15 users, load average: 0.02, 0.05, 0.04
> akpm:/usr/src/25> grep steal /proc/vmstat
> pgsteal_dma 0
> pgsteal_dma32 0
> pgsteal_normal 0
> pgsteal_high 0
> pginodesteal 0
> kswapd_steal 1218698
> kswapd_inodesteal 266847
> akpm:/usr/src/25> grep scan /proc/vmstat
> pgscan_kswapd_dma 0
> pgscan_kswapd_dma32 1246816
> pgscan_kswapd_normal 0
> pgscan_kswapd_high 0
> pgscan_direct_dma 0
> pgscan_direct_dma32 448
> pgscan_direct_normal 0
> pgscan_direct_high 0
> slabs_scanned 2881664
>
> Ignore kswapd_inodesteal and slabs_scanned. We see that this machine has
> scanned 1246816+448 pages and has reclaimed (stolen) 1218698 pages. That's
> a reclaim success rate of 97.7%, which is pretty damn good - this machine
> is just a lightly-loaded 3GB desktop.
>
> When testing reclaim, it is critical that this ratio be monitored (vmmon.c
> from ext3-tools is a vmstat-like interface to /proc/vmstat). If the
> reclaim efficiency falls below, umm, 25% then things are getting into some
> trouble.
Nice tool!
I've been watching the raw numbers by writing scripts. This one can be
written as:
#!/bin/bash
while true; do
while read a b
do
eval $a=$b
done < /proc/vmstat
uptime=$(</proc/uptime)
scan=$((
pgscan_kswapd_dma +
pgscan_kswapd_dma32 +
pgscan_kswapd_normal +
pgscan_kswapd_high +
pgscan_direct_dma +
pgscan_direct_dma32 +
pgscan_direct_normal +
pgscan_direct_high
))
steal=$((
pgsteal_dma +
pgsteal_dma32 +
pgsteal_normal +
pgsteal_high
))
ratio=$((100*steal/(scan+1)))
echo -e "$uptime\t$ratio%\t$steal\t$scan"
sleep 1;
done
Not surprisingly, I see nice numbers on my desktop:
9517.99 9368.60 96% 1898452 1961536
> Actually, 25% is still pretty good. We scan 4 pages for each reclaimed
> page, but the amount of wall time which that takes is vastly less than the
> time to write one page, bearing in mind that these things tend to be seeky
> as hell. But still, keeping an eye on the reclaim efficiency is just your
> basic starting point for working on page reclaim.
Thank you for the nice tip :-)
Fengguang
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
next prev parent reply other threads:[~2007-10-06 2:32 UTC|newest]
Thread overview: 70+ messages / expand[flat|nested] mbox.gz Atom feed top
2007-10-04 12:25 [PATCH] remove throttle_vm_writeout() Miklos Szeredi
2007-10-04 12:25 ` Miklos Szeredi
2007-10-04 12:40 ` Peter Zijlstra
2007-10-04 13:00 ` Miklos Szeredi
2007-10-04 13:00 ` Miklos Szeredi
2007-10-04 13:23 ` Peter Zijlstra
2007-10-04 13:49 ` Miklos Szeredi
2007-10-04 13:49 ` Miklos Szeredi
2007-10-04 16:47 ` Peter Zijlstra
2007-10-04 16:47 ` Peter Zijlstra
2007-10-04 17:46 ` Andrew Morton
2007-10-04 17:46 ` Andrew Morton
2007-10-04 18:10 ` Peter Zijlstra
2007-10-04 18:10 ` Peter Zijlstra
2007-10-04 18:54 ` Andrew Morton
2007-10-04 18:54 ` Andrew Morton
2007-10-05 12:30 ` Fengguang Wu
2007-10-05 12:30 ` Fengguang Wu
2007-10-05 12:30 ` Fengguang Wu
2007-10-05 17:20 ` Andrew Morton
2007-10-05 17:20 ` Andrew Morton
2007-10-06 2:32 ` Fengguang Wu [this message]
2007-10-06 2:32 ` Fengguang Wu
2007-10-06 2:32 ` Fengguang Wu
2007-10-07 23:54 ` David Chinner
2007-10-07 23:54 ` David Chinner
2007-10-08 0:33 ` Fengguang Wu
2007-10-08 0:33 ` Fengguang Wu
2007-10-08 0:33 ` Fengguang Wu
2007-10-04 21:07 ` Miklos Szeredi
2007-10-04 21:07 ` Miklos Szeredi
2007-10-04 21:56 ` Andrew Morton
2007-10-04 21:56 ` Andrew Morton
2007-10-04 22:39 ` Miklos Szeredi
2007-10-04 22:39 ` Miklos Szeredi
2007-10-04 23:09 ` Andrew Morton
2007-10-04 23:09 ` Andrew Morton
2007-10-04 23:26 ` Miklos Szeredi
2007-10-04 23:26 ` Miklos Szeredi
2007-10-04 23:48 ` Andrew Morton
2007-10-04 23:48 ` Andrew Morton
2007-10-05 0:12 ` Miklos Szeredi
2007-10-05 0:12 ` Miklos Szeredi
2007-10-05 0:48 ` Andrew Morton
2007-10-05 0:48 ` Andrew Morton
2007-10-05 8:22 ` Peter Zijlstra
2007-10-05 9:22 ` Miklos Szeredi
2007-10-05 9:22 ` Miklos Szeredi
2007-10-05 9:47 ` Peter Zijlstra
2007-10-05 10:27 ` Miklos Szeredi
2007-10-05 10:27 ` Miklos Szeredi
2007-10-05 10:32 ` Miklos Szeredi
2007-10-05 10:32 ` Miklos Szeredi
2007-10-05 15:43 ` John Stoffel
2007-10-05 15:43 ` John Stoffel
2007-10-05 10:57 ` Peter Zijlstra
2007-10-05 11:27 ` Miklos Szeredi
2007-10-05 11:27 ` Miklos Szeredi
2007-10-05 17:50 ` Trond Myklebust
2007-10-05 17:50 ` Trond Myklebust
2007-10-05 18:32 ` Peter Zijlstra
2007-10-05 18:32 ` Peter Zijlstra
2007-10-05 19:20 ` Trond Myklebust
2007-10-05 19:20 ` Trond Myklebust
2007-10-05 19:23 ` Trond Myklebust
2007-10-05 19:23 ` Trond Myklebust
2007-10-05 21:07 ` Peter Zijlstra
2007-10-05 21:07 ` Peter Zijlstra
2007-10-06 0:40 ` Fengguang Wu
2007-10-06 0:40 ` Fengguang Wu
2007-10-06 0:40 ` Fengguang Wu
2007-10-05 7:32 ` Peter Zijlstra
2007-10-05 19:54 ` Rik van Riel
2007-10-05 19:54 ` Rik van Riel
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=391637948.09008@ustc.edu.cn \
--to=wfg@mail.ustc.edu.cn \
--cc=a.p.zijlstra@chello.nl \
--cc=akpm@linux-foundation.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=miklos@szeredi.hu \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.