From: Fengguang Wu <wfg@mail.ustc.edu.cn>
To: David Chinner <dgc@sgi.com>
Cc: Andrew Morton <akpm@linux-foundation.org>,
Peter Zijlstra <a.p.zijlstra@chello.nl>,
Miklos Szeredi <miklos@szeredi.hu>,
linux-mm@kvack.org, linux-kernel@vger.kernel.org
Subject: Re: [PATCH] remove throttle_vm_writeout()
Date: Mon, 8 Oct 2007 08:33:49 +0800 [thread overview]
Message-ID: <391803633.20978@ustc.edu.cn> (raw)
Message-ID: <20071008003349.GA5455@mail.ustc.edu.cn> (raw)
In-Reply-To: <20071007235433.GW995458@sgi.com>
On Mon, Oct 08, 2007 at 09:54:33AM +1000, David Chinner wrote:
> On Fri, Oct 05, 2007 at 08:30:28PM +0800, Fengguang Wu wrote:
> > The improvement could be:
> > - kswapd is now explicitly preferred to do the writeout;
>
> Careful. kswapd is much less efficient at writeout than pdflush
> because it does not do low->high offset writeback per address space.
> It just flushes the pages in LRU order and that turns writeback into
> a non-sequential mess. I/O sizes decrease substantially and
> throughput falls through the floor.
>
> So if you want kswapd to take over all the writeback, it needs to do
> writeback in the same manner as the background flushes. i.e. by
> grabbing page->mapping and flushing that in sequential order rather
> than just the page on the end of the LRU....
>
> I documented the effect of kswapd taking over writeback in this
> paper (section 5.3):
>
> http://oss.sgi.com/projects/xfs/papers/ols2006/ols-2006-paper.pdf
Ah, indeed. That means introducing a new "really congested" threshold
for kswapd is *dangerous*. I realized this later on, and am now
heading for another direction.
The basic idea is to
- rotate pdflush issued writeback pages for kswapd;
- use the more precise zone_rotate_wait() to throttle kswapd.
The code is a quick hack and not tested yet.
Early comments are more than welcome.
Fengguang
---
include/linux/mmzone.h | 1 +
mm/filemap.c | 5 ++++-
mm/page_alloc.c | 1 +
mm/swap.c | 13 +++++++++++++
mm/vmscan.c | 12 ++++++++++--
5 files changed, 29 insertions(+), 3 deletions(-)
--- linux-2.6.23-rc8-mm2.orig/include/linux/mmzone.h
+++ linux-2.6.23-rc8-mm2/include/linux/mmzone.h
@@ -316,6 +316,7 @@ struct zone {
wait_queue_head_t * wait_table;
unsigned long wait_table_hash_nr_entries;
unsigned long wait_table_bits;
+ wait_queue_head_t wait_rotate;
/*
* Discontig memory support fields.
--- linux-2.6.23-rc8-mm2.orig/mm/filemap.c
+++ linux-2.6.23-rc8-mm2/mm/filemap.c
@@ -558,12 +558,15 @@ EXPORT_SYMBOL(unlock_page);
*/
void end_page_writeback(struct page *page)
{
- if (!TestClearPageReclaim(page) || rotate_reclaimable_page(page)) {
+ int r = 1;
+ if (!TestClearPageReclaim(page) || (r = rotate_reclaimable_page(page))) {
if (!test_clear_page_writeback(page))
BUG();
}
smp_mb__after_clear_bit();
wake_up_page(page, PG_writeback);
+ if (!r)
+ wake_up(&page_zone(page)->wait_rotate);
}
EXPORT_SYMBOL(end_page_writeback);
--- linux-2.6.23-rc8-mm2.orig/mm/page_alloc.c
+++ linux-2.6.23-rc8-mm2/mm/page_alloc.c
@@ -3482,6 +3482,7 @@ static void __meminit free_area_init_cor
zone->prev_priority = DEF_PRIORITY;
zone_pcp_init(zone);
+ init_waitqueue_head(&zone->wait_rotate);
INIT_LIST_HEAD(&zone->active_list);
INIT_LIST_HEAD(&zone->inactive_list);
zone->nr_scan_active = 0;
--- linux-2.6.23-rc8-mm2.orig/mm/vmscan.c
+++ linux-2.6.23-rc8-mm2/mm/vmscan.c
@@ -50,6 +50,7 @@
struct scan_control {
/* Incremented by the number of inactive pages that were scanned */
unsigned long nr_scanned;
+ unsigned long nr_dirty_writeback;
/* This context's GFP mask */
gfp_t gfp_mask;
@@ -558,8 +559,10 @@ static unsigned long shrink_page_list(st
case PAGE_ACTIVATE:
goto activate_locked;
case PAGE_SUCCESS:
- if (PageWriteback(page) || PageDirty(page))
+ if (PageWriteback(page) || PageDirty(page)) {
+ sc->nr_dirty_writeback++;
goto keep;
+ }
/*
* A synchronous write - probably a ramdisk. Go
* ahead and try to reclaim the page.
@@ -620,6 +623,10 @@ keep_locked:
keep:
list_add(&page->lru, &ret_pages);
VM_BUG_ON(PageLRU(page));
+ if (PageLocked(page) && PageWriteback(page)) {
+ SetPageReclaim(page);
+ sc->nr_dirty_writeback++;
+ }
}
list_splice(&ret_pages, page_list);
if (pagevec_count(&freed_pvec))
@@ -1184,7 +1191,8 @@ static unsigned long shrink_zone(int pri
}
}
- throttle_vm_writeout(sc->gfp_mask);
+ if (!nr_reclaimed && sc->nr_dirty_writeback)
+ zone_rotate_wait(zone, HZ/100);
return nr_reclaimed;
}
--- linux-2.6.23-rc8-mm2.orig/mm/swap.c
+++ linux-2.6.23-rc8-mm2/mm/swap.c
@@ -174,6 +174,19 @@ int rotate_reclaimable_page(struct page
return 0;
}
+long zone_rotate_wait(struct zone* z, long timeout)
+{
+ long ret;
+ DEFINE_WAIT(wait);
+ wait_queue_head_t *wqh = &z->wait_rotate;
+
+ prepare_to_wait(wqh, &wait, TASK_UNINTERRUPTIBLE);
+ ret = io_schedule_timeout(timeout);
+ finish_wait(wqh, &wait);
+ return ret;
+}
+EXPORT_SYMBOL(zone_rotate_wait);
+
/*
* FIXME: speed this up?
*/
WARNING: multiple messages have this Message-ID (diff)
From: Fengguang Wu <wfg@mail.ustc.edu.cn>
To: David Chinner <dgc@sgi.com>
Cc: Andrew Morton <akpm@linux-foundation.org>,
Peter Zijlstra <a.p.zijlstra@chello.nl>,
Miklos Szeredi <miklos@szeredi.hu>,
linux-mm@kvack.org, linux-kernel@vger.kernel.org
Subject: Re: [PATCH] remove throttle_vm_writeout()
Date: Mon, 8 Oct 2007 08:33:49 +0800 [thread overview]
Message-ID: <391803633.20978@ustc.edu.cn> (raw)
Message-ID: <20071008003349.GA5455@mail.ustc.edu.cn> (raw)
In-Reply-To: <20071007235433.GW995458@sgi.com>
On Mon, Oct 08, 2007 at 09:54:33AM +1000, David Chinner wrote:
> On Fri, Oct 05, 2007 at 08:30:28PM +0800, Fengguang Wu wrote:
> > The improvement could be:
> > - kswapd is now explicitly preferred to do the writeout;
>
> Careful. kswapd is much less efficient at writeout than pdflush
> because it does not do low->high offset writeback per address space.
> It just flushes the pages in LRU order and that turns writeback into
> a non-sequential mess. I/O sizes decrease substantially and
> throughput falls through the floor.
>
> So if you want kswapd to take over all the writeback, it needs to do
> writeback in the same manner as the background flushes. i.e. by
> grabbing page->mapping and flushing that in sequential order rather
> than just the page on the end of the LRU....
>
> I documented the effect of kswapd taking over writeback in this
> paper (section 5.3):
>
> http://oss.sgi.com/projects/xfs/papers/ols2006/ols-2006-paper.pdf
Ah, indeed. That means introducing a new "really congested" threshold
for kswapd is *dangerous*. I realized this later on, and am now
heading for another direction.
The basic idea is to
- rotate pdflush issued writeback pages for kswapd;
- use the more precise zone_rotate_wait() to throttle kswapd.
The code is a quick hack and not tested yet.
Early comments are more than welcome.
Fengguang
---
include/linux/mmzone.h | 1 +
mm/filemap.c | 5 ++++-
mm/page_alloc.c | 1 +
mm/swap.c | 13 +++++++++++++
mm/vmscan.c | 12 ++++++++++--
5 files changed, 29 insertions(+), 3 deletions(-)
--- linux-2.6.23-rc8-mm2.orig/include/linux/mmzone.h
+++ linux-2.6.23-rc8-mm2/include/linux/mmzone.h
@@ -316,6 +316,7 @@ struct zone {
wait_queue_head_t * wait_table;
unsigned long wait_table_hash_nr_entries;
unsigned long wait_table_bits;
+ wait_queue_head_t wait_rotate;
/*
* Discontig memory support fields.
--- linux-2.6.23-rc8-mm2.orig/mm/filemap.c
+++ linux-2.6.23-rc8-mm2/mm/filemap.c
@@ -558,12 +558,15 @@ EXPORT_SYMBOL(unlock_page);
*/
void end_page_writeback(struct page *page)
{
- if (!TestClearPageReclaim(page) || rotate_reclaimable_page(page)) {
+ int r = 1;
+ if (!TestClearPageReclaim(page) || (r = rotate_reclaimable_page(page))) {
if (!test_clear_page_writeback(page))
BUG();
}
smp_mb__after_clear_bit();
wake_up_page(page, PG_writeback);
+ if (!r)
+ wake_up(&page_zone(page)->wait_rotate);
}
EXPORT_SYMBOL(end_page_writeback);
--- linux-2.6.23-rc8-mm2.orig/mm/page_alloc.c
+++ linux-2.6.23-rc8-mm2/mm/page_alloc.c
@@ -3482,6 +3482,7 @@ static void __meminit free_area_init_cor
zone->prev_priority = DEF_PRIORITY;
zone_pcp_init(zone);
+ init_waitqueue_head(&zone->wait_rotate);
INIT_LIST_HEAD(&zone->active_list);
INIT_LIST_HEAD(&zone->inactive_list);
zone->nr_scan_active = 0;
--- linux-2.6.23-rc8-mm2.orig/mm/vmscan.c
+++ linux-2.6.23-rc8-mm2/mm/vmscan.c
@@ -50,6 +50,7 @@
struct scan_control {
/* Incremented by the number of inactive pages that were scanned */
unsigned long nr_scanned;
+ unsigned long nr_dirty_writeback;
/* This context's GFP mask */
gfp_t gfp_mask;
@@ -558,8 +559,10 @@ static unsigned long shrink_page_list(st
case PAGE_ACTIVATE:
goto activate_locked;
case PAGE_SUCCESS:
- if (PageWriteback(page) || PageDirty(page))
+ if (PageWriteback(page) || PageDirty(page)) {
+ sc->nr_dirty_writeback++;
goto keep;
+ }
/*
* A synchronous write - probably a ramdisk. Go
* ahead and try to reclaim the page.
@@ -620,6 +623,10 @@ keep_locked:
keep:
list_add(&page->lru, &ret_pages);
VM_BUG_ON(PageLRU(page));
+ if (PageLocked(page) && PageWriteback(page)) {
+ SetPageReclaim(page);
+ sc->nr_dirty_writeback++;
+ }
}
list_splice(&ret_pages, page_list);
if (pagevec_count(&freed_pvec))
@@ -1184,7 +1191,8 @@ static unsigned long shrink_zone(int pri
}
}
- throttle_vm_writeout(sc->gfp_mask);
+ if (!nr_reclaimed && sc->nr_dirty_writeback)
+ zone_rotate_wait(zone, HZ/100);
return nr_reclaimed;
}
--- linux-2.6.23-rc8-mm2.orig/mm/swap.c
+++ linux-2.6.23-rc8-mm2/mm/swap.c
@@ -174,6 +174,19 @@ int rotate_reclaimable_page(struct page
return 0;
}
+long zone_rotate_wait(struct zone* z, long timeout)
+{
+ long ret;
+ DEFINE_WAIT(wait);
+ wait_queue_head_t *wqh = &z->wait_rotate;
+
+ prepare_to_wait(wqh, &wait, TASK_UNINTERRUPTIBLE);
+ ret = io_schedule_timeout(timeout);
+ finish_wait(wqh, &wait);
+ return ret;
+}
+EXPORT_SYMBOL(zone_rotate_wait);
+
/*
* FIXME: speed this up?
*/
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
next prev parent reply other threads:[~2007-10-08 0:34 UTC|newest]
Thread overview: 70+ messages / expand[flat|nested] mbox.gz Atom feed top
2007-10-04 12:25 [PATCH] remove throttle_vm_writeout() Miklos Szeredi
2007-10-04 12:25 ` Miklos Szeredi
2007-10-04 12:40 ` Peter Zijlstra
2007-10-04 13:00 ` Miklos Szeredi
2007-10-04 13:00 ` Miklos Szeredi
2007-10-04 13:23 ` Peter Zijlstra
2007-10-04 13:49 ` Miklos Szeredi
2007-10-04 13:49 ` Miklos Szeredi
2007-10-04 16:47 ` Peter Zijlstra
2007-10-04 16:47 ` Peter Zijlstra
2007-10-04 17:46 ` Andrew Morton
2007-10-04 17:46 ` Andrew Morton
2007-10-04 18:10 ` Peter Zijlstra
2007-10-04 18:10 ` Peter Zijlstra
2007-10-04 18:54 ` Andrew Morton
2007-10-04 18:54 ` Andrew Morton
2007-10-05 12:30 ` Fengguang Wu
2007-10-05 12:30 ` Fengguang Wu
2007-10-05 12:30 ` Fengguang Wu
2007-10-05 17:20 ` Andrew Morton
2007-10-05 17:20 ` Andrew Morton
2007-10-06 2:32 ` Fengguang Wu
2007-10-06 2:32 ` Fengguang Wu
2007-10-06 2:32 ` Fengguang Wu
2007-10-07 23:54 ` David Chinner
2007-10-07 23:54 ` David Chinner
2007-10-08 0:33 ` Fengguang Wu [this message]
2007-10-08 0:33 ` Fengguang Wu
2007-10-08 0:33 ` Fengguang Wu
2007-10-04 21:07 ` Miklos Szeredi
2007-10-04 21:07 ` Miklos Szeredi
2007-10-04 21:56 ` Andrew Morton
2007-10-04 21:56 ` Andrew Morton
2007-10-04 22:39 ` Miklos Szeredi
2007-10-04 22:39 ` Miklos Szeredi
2007-10-04 23:09 ` Andrew Morton
2007-10-04 23:09 ` Andrew Morton
2007-10-04 23:26 ` Miklos Szeredi
2007-10-04 23:26 ` Miklos Szeredi
2007-10-04 23:48 ` Andrew Morton
2007-10-04 23:48 ` Andrew Morton
2007-10-05 0:12 ` Miklos Szeredi
2007-10-05 0:12 ` Miklos Szeredi
2007-10-05 0:48 ` Andrew Morton
2007-10-05 0:48 ` Andrew Morton
2007-10-05 8:22 ` Peter Zijlstra
2007-10-05 9:22 ` Miklos Szeredi
2007-10-05 9:22 ` Miklos Szeredi
2007-10-05 9:47 ` Peter Zijlstra
2007-10-05 10:27 ` Miklos Szeredi
2007-10-05 10:27 ` Miklos Szeredi
2007-10-05 10:32 ` Miklos Szeredi
2007-10-05 10:32 ` Miklos Szeredi
2007-10-05 15:43 ` John Stoffel
2007-10-05 15:43 ` John Stoffel
2007-10-05 10:57 ` Peter Zijlstra
2007-10-05 11:27 ` Miklos Szeredi
2007-10-05 11:27 ` Miklos Szeredi
2007-10-05 17:50 ` Trond Myklebust
2007-10-05 17:50 ` Trond Myklebust
2007-10-05 18:32 ` Peter Zijlstra
2007-10-05 18:32 ` Peter Zijlstra
2007-10-05 19:20 ` Trond Myklebust
2007-10-05 19:20 ` Trond Myklebust
2007-10-05 19:23 ` Trond Myklebust
2007-10-05 19:23 ` Trond Myklebust
2007-10-05 21:07 ` Peter Zijlstra
2007-10-05 21:07 ` Peter Zijlstra
2007-10-06 0:40 ` Fengguang Wu
2007-10-06 0:40 ` Fengguang Wu
2007-10-06 0:40 ` Fengguang Wu
2007-10-05 7:32 ` Peter Zijlstra
2007-10-05 19:54 ` Rik van Riel
2007-10-05 19:54 ` Rik van Riel
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=391803633.20978@ustc.edu.cn \
--to=wfg@mail.ustc.edu.cn \
--cc=a.p.zijlstra@chello.nl \
--cc=akpm@linux-foundation.org \
--cc=dgc@sgi.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=miklos@szeredi.hu \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.