* Re: [patch 3/3][rfc] vmscan: batched swap slot allocation
2009-04-20 20:24 ` [patch 3/3][rfc] vmscan: batched swap slot allocation Johannes Weiner
@ 2009-04-20 20:31 ` Johannes Weiner
2009-04-20 20:53 ` Andrew Morton
2009-04-21 0:58 ` KAMEZAWA Hiroyuki
2009-04-22 20:37 ` Hugh Dickins
2 siblings, 1 reply; 22+ messages in thread
From: Johannes Weiner @ 2009-04-20 20:31 UTC (permalink / raw)
To: Andrew Morton; +Cc: linux-mm, linux-kernel, Rik van Riel, Hugh Dickins
[-- Attachment #1: Type: text/plain, Size: 929 bytes --]
A test program creates an anonymous memory mapping the size of the
system's RAM (2G). It faults all pages of it linearly, then kicks off
128 reclaimers (on 4 cores) that map, fault and unmap 2G in sum and
parallel, thereby evicting the first mapping onto swap.
The time is then taken for the initial mapping to get faulted in from
swap linearly again, thus measuring how bad the 128 reclaimers
distributed the pages on the swap space.
Average over 5 runs, standard deviation in parens:
swap-in user system total
old: 74.97s (0.38s) 0.52s (0.02s) 291.07s (3.28s) 2m52.66s (0m1.32s)
new: 45.26s (0.68s) 0.53s (0.01s) 250.47s (5.17s) 2m45.93s (0m2.63s)
where old is current mmotm snapshot 2009-04-17-15-19 and new is these
three patches applied to it.
Test program attached. Kernbench didn't show any differences on my
single core x86 laptop with 256mb ram (poor thing).
[-- Attachment #2: contswap2.c --]
[-- Type: text/plain, Size: 1515 bytes --]
/*
* contswap benchmark
*/
#include <sys/mman.h>
#include <sys/time.h>
#include <sys/wait.h>
#include <assert.h>
#include <stdlib.h>
#include <unistd.h>
#include <stdio.h>
#define MEMORY (1650 << 20)
#define RECLAIMERS 128
#define PAGE_SIZE 4096
#define PART (MEMORY / RECLAIMERS)
static void *anonmap(unsigned long size)
{
void *map = mmap(NULL, size, PROT_READ, MAP_PRIVATE | MAP_ANON, -1, 0);
assert(map != MAP_FAILED);
return map;
}
static void touch_linear(char *map, unsigned long size)
{
unsigned long off;
for (off = 0; off < size; off += PAGE_SIZE)
if (map[off])
puts("huh?");
}
static void __claim(unsigned long size)
{
char *map = anonmap(size);
touch_linear(map, size);
sleep(5);
munmap(map, size);
}
static pid_t claim(unsigned long size)
{
pid_t pid;
switch (pid = fork()) {
case -1:
puts("fork failed");
exit(1);
case 0:
kill(getpid(), SIGSTOP);
__claim(size);
exit(0);
default:
return pid;
}
}
int main(void)
{
struct timeval start, stop, diff;
pid_t pids[RECLAIMERS];
int nr, crap;
char *one;
one = anonmap(MEMORY);
touch_linear(one, MEMORY);
for (nr = 0; nr < RECLAIMERS; nr++)
pids[nr] = claim(PART);
for (nr = 0; nr < RECLAIMERS; nr++)
kill(pids[nr], SIGCONT);
for (nr = 0; nr < RECLAIMERS; nr++)
waitpid(pids[nr], &crap, 0);
gettimeofday(&start, NULL);
touch_linear(one, MEMORY);
gettimeofday(&stop, NULL);
munmap(one, MEMORY);
timersub(&stop, &start, &diff);
printf("%lu.%lu\n", diff.tv_sec, diff.tv_usec);
return 0;
}
^ permalink raw reply [flat|nested] 22+ messages in thread* Re: [patch 3/3][rfc] vmscan: batched swap slot allocation
2009-04-20 20:31 ` Johannes Weiner
@ 2009-04-20 20:53 ` Andrew Morton
2009-04-20 21:38 ` Johannes Weiner
0 siblings, 1 reply; 22+ messages in thread
From: Andrew Morton @ 2009-04-20 20:53 UTC (permalink / raw)
To: Johannes Weiner; +Cc: linux-mm, linux-kernel, riel, hugh
On Mon, 20 Apr 2009 22:31:19 +0200
Johannes Weiner <hannes@cmpxchg.org> wrote:
> A test program creates an anonymous memory mapping the size of the
> system's RAM (2G). It faults all pages of it linearly, then kicks off
> 128 reclaimers (on 4 cores) that map, fault and unmap 2G in sum and
> parallel, thereby evicting the first mapping onto swap.
>
> The time is then taken for the initial mapping to get faulted in from
> swap linearly again, thus measuring how bad the 128 reclaimers
> distributed the pages on the swap space.
>
> Average over 5 runs, standard deviation in parens:
>
> swap-in user system total
>
> old: 74.97s (0.38s) 0.52s (0.02s) 291.07s (3.28s) 2m52.66s (0m1.32s)
> new: 45.26s (0.68s) 0.53s (0.01s) 250.47s (5.17s) 2m45.93s (0m2.63s)
>
> where old is current mmotm snapshot 2009-04-17-15-19 and new is these
> three patches applied to it.
>
> Test program attached. Kernbench didn't show any differences on my
> single core x86 laptop with 256mb ram (poor thing).
qsbench is pretty good at fragmenting swapspace. It would be vaguely
interesting to see what effect you've had on its runtime.
I've found that qsbench's runtimes are fairly chaotic when it's
operating at the transition point between all-in-core and
madly-swapping, so a bit of thought and caution is needed.
I used to run it with
./qsbench -p 4 -m 96
on a 256MB machine and it had sufficiently repeatable runtimes to be
useful.
There's a copy of qsbench in
http://userweb.kernel.org/~akpm/stuff/ext3-tools.tar.gz
I wonder what effect this patch has upon hibernate/resume performance.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [patch 3/3][rfc] vmscan: batched swap slot allocation
2009-04-20 20:53 ` Andrew Morton
@ 2009-04-20 21:38 ` Johannes Weiner
0 siblings, 0 replies; 22+ messages in thread
From: Johannes Weiner @ 2009-04-20 21:38 UTC (permalink / raw)
To: Andrew Morton; +Cc: linux-mm, linux-kernel, riel, hugh
On Mon, Apr 20, 2009 at 01:53:03PM -0700, Andrew Morton wrote:
> On Mon, 20 Apr 2009 22:31:19 +0200
> Johannes Weiner <hannes@cmpxchg.org> wrote:
>
> > A test program creates an anonymous memory mapping the size of the
> > system's RAM (2G). It faults all pages of it linearly, then kicks off
> > 128 reclaimers (on 4 cores) that map, fault and unmap 2G in sum and
> > parallel, thereby evicting the first mapping onto swap.
> >
> > The time is then taken for the initial mapping to get faulted in from
> > swap linearly again, thus measuring how bad the 128 reclaimers
> > distributed the pages on the swap space.
> >
> > Average over 5 runs, standard deviation in parens:
> >
> > swap-in user system total
> >
> > old: 74.97s (0.38s) 0.52s (0.02s) 291.07s (3.28s) 2m52.66s (0m1.32s)
> > new: 45.26s (0.68s) 0.53s (0.01s) 250.47s (5.17s) 2m45.93s (0m2.63s)
> >
> > where old is current mmotm snapshot 2009-04-17-15-19 and new is these
> > three patches applied to it.
> >
> > Test program attached. Kernbench didn't show any differences on my
> > single core x86 laptop with 256mb ram (poor thing).
>
> qsbench is pretty good at fragmenting swapspace. It would be vaguely
> interesting to see what effect you've had on its runtime.
>
> I've found that qsbench's runtimes are fairly chaotic when it's
> operating at the transition point between all-in-core and
> madly-swapping, so a bit of thought and caution is needed.
>
> I used to run it with
>
> ./qsbench -p 4 -m 96
>
> on a 256MB machine and it had sufficiently repeatable runtimes to be
> useful.
>
> There's a copy of qsbench in
> http://userweb.kernel.org/~akpm/stuff/ext3-tools.tar.gz
Thanks a lot.
> I wonder what effect this patch has upon hibernate/resume performance.
Good point, I will test this.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [patch 3/3][rfc] vmscan: batched swap slot allocation
2009-04-20 20:24 ` [patch 3/3][rfc] vmscan: batched swap slot allocation Johannes Weiner
2009-04-20 20:31 ` Johannes Weiner
@ 2009-04-21 0:58 ` KAMEZAWA Hiroyuki
2009-04-21 8:52 ` Johannes Weiner
2009-04-22 20:37 ` Hugh Dickins
2 siblings, 1 reply; 22+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-04-21 0:58 UTC (permalink / raw)
To: Johannes Weiner
Cc: Andrew Morton, linux-mm, linux-kernel, Rik van Riel, Hugh Dickins
On Mon, 20 Apr 2009 22:24:45 +0200
Johannes Weiner <hannes@cmpxchg.org> wrote:
> Every swap slot allocation tries to be subsequent to the previous one
> to help keeping the LRU order of anon pages intact when they are
> swapped out.
>
> With an increasing number of concurrent reclaimers, the average
> distance between two subsequent slot allocations of one reclaimer
> increases as well. The contiguous LRU list chunks each reclaimer
> swaps out get 'multiplexed' on the swap space as they allocate the
> slots concurrently.
>
> 2 processes isolating 15 pages each and allocating swap slots
> concurrently:
>
> #0 #1
>
> page 0 slot 0 page 15 slot 1
> page 1 slot 2 page 16 slot 3
> page 2 slot 4 page 17 slot 5
> ...
>
> -> average slot distance of 2
>
> All reclaimers being equally fast, this becomes a problem when the
> total number of concurrent reclaimers gets so high that even equal
> distribution makes the average distance between the slots of one
> reclaimer too wide for optimistic swap-in to compensate.
>
> But right now, one reclaimer can take much longer than another one
> because its pages are mapped into more page tables and it has thus
> more work to do and the faster reclaimer will allocate multiple swap
> slots between two slot allocations of the slower one.
>
> This patch makes shrink_page_list() allocate swap slots in batches,
> collecting all the anonymous memory pages in a list without
> rescheduling and actual reclaim in between. And only after all anon
> pages are swap cached, unmap and write-out starts for them.
>
> While this does not fix the fundamental issue of slot distribution
> increasing with reclaimers, it mitigates the problem by balancing the
> resulting fragmentation equally between the allocators.
>
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Rik van Riel <riel@redhat.com>
> Cc: Hugh Dickins <hugh@veritas.com>
> ---
> mm/vmscan.c | 49 +++++++++++++++++++++++++++++++++++++++++--------
> 1 files changed, 41 insertions(+), 8 deletions(-)
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 70092fa..b3823fe 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -592,24 +592,42 @@ static unsigned long shrink_page_list(struct list_head *page_list,
> enum pageout_io sync_writeback)
> {
> LIST_HEAD(ret_pages);
> + LIST_HEAD(swap_pages);
> struct pagevec freed_pvec;
> - int pgactivate = 0;
> + int pgactivate = 0, restart = 0;
> unsigned long nr_reclaimed = 0;
>
> cond_resched();
>
> pagevec_init(&freed_pvec, 1);
> +restart:
> while (!list_empty(page_list)) {
> struct address_space *mapping;
> struct page *page;
> int may_enter_fs;
> int referenced;
>
> - cond_resched();
> + if (list_empty(&swap_pages))
> + cond_resched();
>
Why this ?
> page = lru_to_page(page_list);
> list_del(&page->lru);
>
> + if (restart) {
> + /*
> + * We are allowed to do IO when we restart for
> + * swap pages.
> + */
> + may_enter_fs = 1;
> + /*
> + * Referenced pages will be sorted out by
> + * try_to_unmap() and unmapped (anon!) pages
> + * are not to be referenced anymore.
> + */
> + referenced = 0;
> + goto reclaim;
> + }
> +
> if (!trylock_page(page))
> goto keep;
>
Keeping multiple pages locked while they stay on private list ?
BTW, isn't it better to add "allocate multiple swap space at once" function
like
- void get_swap_pages(nr, swp_entry_array[])
? "nr" will not be bigger than SWAP_CLUSTER_MAX.
Regards,
-Kame
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 22+ messages in thread* Re: [patch 3/3][rfc] vmscan: batched swap slot allocation
2009-04-21 0:58 ` KAMEZAWA Hiroyuki
@ 2009-04-21 8:52 ` Johannes Weiner
2009-04-21 9:23 ` KAMEZAWA Hiroyuki
2009-04-21 9:27 ` KOSAKI Motohiro
0 siblings, 2 replies; 22+ messages in thread
From: Johannes Weiner @ 2009-04-21 8:52 UTC (permalink / raw)
To: KAMEZAWA Hiroyuki
Cc: Andrew Morton, linux-mm, linux-kernel, Rik van Riel, Hugh Dickins
On Tue, Apr 21, 2009 at 09:58:57AM +0900, KAMEZAWA Hiroyuki wrote:
> On Mon, 20 Apr 2009 22:24:45 +0200
> Johannes Weiner <hannes@cmpxchg.org> wrote:
>
> > Every swap slot allocation tries to be subsequent to the previous one
> > to help keeping the LRU order of anon pages intact when they are
> > swapped out.
> >
> > With an increasing number of concurrent reclaimers, the average
> > distance between two subsequent slot allocations of one reclaimer
> > increases as well. The contiguous LRU list chunks each reclaimer
> > swaps out get 'multiplexed' on the swap space as they allocate the
> > slots concurrently.
> >
> > 2 processes isolating 15 pages each and allocating swap slots
> > concurrently:
> >
> > #0 #1
> >
> > page 0 slot 0 page 15 slot 1
> > page 1 slot 2 page 16 slot 3
> > page 2 slot 4 page 17 slot 5
> > ...
> >
> > -> average slot distance of 2
> >
> > All reclaimers being equally fast, this becomes a problem when the
> > total number of concurrent reclaimers gets so high that even equal
> > distribution makes the average distance between the slots of one
> > reclaimer too wide for optimistic swap-in to compensate.
> >
> > But right now, one reclaimer can take much longer than another one
> > because its pages are mapped into more page tables and it has thus
> > more work to do and the faster reclaimer will allocate multiple swap
> > slots between two slot allocations of the slower one.
> >
> > This patch makes shrink_page_list() allocate swap slots in batches,
> > collecting all the anonymous memory pages in a list without
> > rescheduling and actual reclaim in between. And only after all anon
> > pages are swap cached, unmap and write-out starts for them.
> >
> > While this does not fix the fundamental issue of slot distribution
> > increasing with reclaimers, it mitigates the problem by balancing the
> > resulting fragmentation equally between the allocators.
> >
> > Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> > Cc: Rik van Riel <riel@redhat.com>
> > Cc: Hugh Dickins <hugh@veritas.com>
> > ---
> > mm/vmscan.c | 49 +++++++++++++++++++++++++++++++++++++++++--------
> > 1 files changed, 41 insertions(+), 8 deletions(-)
> >
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index 70092fa..b3823fe 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -592,24 +592,42 @@ static unsigned long shrink_page_list(struct list_head *page_list,
> > enum pageout_io sync_writeback)
> > {
> > LIST_HEAD(ret_pages);
> > + LIST_HEAD(swap_pages);
> > struct pagevec freed_pvec;
> > - int pgactivate = 0;
> > + int pgactivate = 0, restart = 0;
> > unsigned long nr_reclaimed = 0;
> >
> > cond_resched();
> >
> > pagevec_init(&freed_pvec, 1);
> > +restart:
> > while (!list_empty(page_list)) {
> > struct address_space *mapping;
> > struct page *page;
> > int may_enter_fs;
> > int referenced;
> >
> > - cond_resched();
> > + if (list_empty(&swap_pages))
> > + cond_resched();
> >
> Why this ?
It shouldn't schedule anymore when it's allocated the first swap slot.
Another reclaimer could e.g. sleep on the cond_resched() before the
loop and when we schedule while having swap slots allocated, we might
continue further allocations multiple slots ahead.
> > page = lru_to_page(page_list);
> > list_del(&page->lru);
> >
> > + if (restart) {
> > + /*
> > + * We are allowed to do IO when we restart for
> > + * swap pages.
> > + */
> > + may_enter_fs = 1;
> > + /*
> > + * Referenced pages will be sorted out by
> > + * try_to_unmap() and unmapped (anon!) pages
> > + * are not to be referenced anymore.
> > + */
> > + referenced = 0;
> > + goto reclaim;
> > + }
> > +
> > if (!trylock_page(page))
> > goto keep;
> >
> Keeping multiple pages locked while they stay on private list ?
Yeah, it's a bit suboptimal but I don't see a way around it.
> BTW, isn't it better to add "allocate multiple swap space at once" function
> like
> - void get_swap_pages(nr, swp_entry_array[])
> ? "nr" will not be bigger than SWAP_CLUSTER_MAX.
It will sometimes be, see __zone_reclaim().
I had such a function once. The interesting part is: how and when do
you call it? If you drop the page lock in between, you need to redo
the checks for unevictability and whether the page has become mapped
etc.
You also need to have the pages in swap cache as soon as possible or
optimistic swap-in will 'steal' your swap slots. See add_to_swap()
when the cache radix tree says -EEXIST.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 22+ messages in thread* Re: [patch 3/3][rfc] vmscan: batched swap slot allocation
2009-04-21 8:52 ` Johannes Weiner
@ 2009-04-21 9:23 ` KAMEZAWA Hiroyuki
2009-04-21 9:54 ` Johannes Weiner
2009-04-21 9:27 ` KOSAKI Motohiro
1 sibling, 1 reply; 22+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-04-21 9:23 UTC (permalink / raw)
To: Johannes Weiner
Cc: Andrew Morton, linux-mm, linux-kernel, Rik van Riel, Hugh Dickins
On Tue, 21 Apr 2009 10:52:31 +0200
Johannes Weiner <hannes@cmpxchg.org> wrote:
> > Keeping multiple pages locked while they stay on private list ?
>
> Yeah, it's a bit suboptimal but I don't see a way around it.
>
Hmm, seems to increase stale swap cache dramatically under memcg ;)
> > BTW, isn't it better to add "allocate multiple swap space at once" function
> > like
> > - void get_swap_pages(nr, swp_entry_array[])
> > ? "nr" will not be bigger than SWAP_CLUSTER_MAX.
>
> It will sometimes be, see __zone_reclaim().
>
Hm ? If I read the code correctly, __zone_reclaim() just call shrink_zone() and
"nr" to shrink_page_list() is SWAP_CLUSTER_MAX, at most.
> I had such a function once. The interesting part is: how and when do
> you call it? If you drop the page lock in between, you need to redo
> the checks for unevictability and whether the page has become mapped
> etc.
>
> You also need to have the pages in swap cache as soon as possible or
> optimistic swap-in will 'steal' your swap slots. See add_to_swap()
> when the cache radix tree says -EEXIST.
>
If I was you, modify "offset" calculation of
get_swap_pages()
-> scan_swap_map()
to allow that a cpu tends to find countinous swap page cluster.
Too difficult ?
Regards,
-Kame
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 22+ messages in thread* Re: [patch 3/3][rfc] vmscan: batched swap slot allocation
2009-04-21 9:23 ` KAMEZAWA Hiroyuki
@ 2009-04-21 9:54 ` Johannes Weiner
0 siblings, 0 replies; 22+ messages in thread
From: Johannes Weiner @ 2009-04-21 9:54 UTC (permalink / raw)
To: KAMEZAWA Hiroyuki
Cc: Andrew Morton, linux-mm, linux-kernel, Rik van Riel, Hugh Dickins
On Tue, Apr 21, 2009 at 06:23:31PM +0900, KAMEZAWA Hiroyuki wrote:
> On Tue, 21 Apr 2009 10:52:31 +0200
> Johannes Weiner <hannes@cmpxchg.org> wrote:
>
> > > Keeping multiple pages locked while they stay on private list ?
> >
> > Yeah, it's a bit suboptimal but I don't see a way around it.
> >
> Hmm, seems to increase stale swap cache dramatically under memcg ;)
Hmpf, not good.
> > > BTW, isn't it better to add "allocate multiple swap space at once" function
> > > like
> > > - void get_swap_pages(nr, swp_entry_array[])
> > > ? "nr" will not be bigger than SWAP_CLUSTER_MAX.
> >
> > It will sometimes be, see __zone_reclaim().
> >
> Hm ? If I read the code correctly, __zone_reclaim() just call shrink_zone() and
> "nr" to shrink_page_list() is SWAP_CLUSTER_MAX, at most.
shrink_zone() and shrink_inactive_list() use whatever is set in
sc->swap_cluster_max and for __zone_reclaim() this is:
.swap_cluster_max = max_t(unsigned long, nr_pages, SWAP_CLUSTER_MAX)
SWAP_CLUSTER_MAX is 32 (2^5), so if you have an order 6 allocation
doing reclaim, you end up with sc->swap_cluster_max == 64 already.
Not common, but it happens.
> > I had such a function once. The interesting part is: how and when do
> > you call it? If you drop the page lock in between, you need to redo
> > the checks for unevictability and whether the page has become mapped
> > etc.
> >
> > You also need to have the pages in swap cache as soon as possible or
> > optimistic swap-in will 'steal' your swap slots. See add_to_swap()
> > when the cache radix tree says -EEXIST.
> >
>
> If I was you, modify "offset" calculation of
> get_swap_pages()
> -> scan_swap_map()
> to allow that a cpu tends to find countinous swap page cluster.
> Too difficult ?
This goes in the direction of extent-based allocations. I tried that
once by providing every reclaimer with a cookie that is passed in for
swap allocations and used to find per-reclaimer offsets.
Something went wrong, I can not quite remember. Will have another
look at this.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [patch 3/3][rfc] vmscan: batched swap slot allocation
2009-04-21 8:52 ` Johannes Weiner
2009-04-21 9:23 ` KAMEZAWA Hiroyuki
@ 2009-04-21 9:27 ` KOSAKI Motohiro
2009-04-21 9:38 ` Johannes Weiner
1 sibling, 1 reply; 22+ messages in thread
From: KOSAKI Motohiro @ 2009-04-21 9:27 UTC (permalink / raw)
To: Johannes Weiner
Cc: kosaki.motohiro, KAMEZAWA Hiroyuki, Andrew Morton, linux-mm,
linux-kernel, Rik van Riel, Hugh Dickins
> > > - cond_resched();
> > > + if (list_empty(&swap_pages))
> > > + cond_resched();
> > >
> > Why this ?
>
> It shouldn't schedule anymore when it's allocated the first swap slot.
> Another reclaimer could e.g. sleep on the cond_resched() before the
> loop and when we schedule while having swap slots allocated, we might
> continue further allocations multiple slots ahead.
Oops, It seems regression. this cond_resched() intent to
cond_resched();
pageout();
cond_resched();
pageout();
cond_resched();
pageout();
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [patch 3/3][rfc] vmscan: batched swap slot allocation
2009-04-21 9:27 ` KOSAKI Motohiro
@ 2009-04-21 9:38 ` Johannes Weiner
2009-04-21 9:41 ` KOSAKI Motohiro
0 siblings, 1 reply; 22+ messages in thread
From: Johannes Weiner @ 2009-04-21 9:38 UTC (permalink / raw)
To: KOSAKI Motohiro
Cc: KAMEZAWA Hiroyuki, Andrew Morton, linux-mm, linux-kernel,
Rik van Riel, Hugh Dickins
On Tue, Apr 21, 2009 at 06:27:08PM +0900, KOSAKI Motohiro wrote:
> > > > - cond_resched();
> > > > + if (list_empty(&swap_pages))
> > > > + cond_resched();
> > > >
> > > Why this ?
> >
> > It shouldn't schedule anymore when it's allocated the first swap slot.
> > Another reclaimer could e.g. sleep on the cond_resched() before the
> > loop and when we schedule while having swap slots allocated, we might
> > continue further allocations multiple slots ahead.
>
> Oops, It seems regression. this cond_resched() intent to
>
> cond_resched();
> pageout();
> cond_resched();
> pageout();
> cond_resched();
> pageout();
It still does that. While it collects swap pages (swap_pages list is
non-empty), it doesn't page out. And if it restarts for unmap and
page-out, the swap_pages list is empty and cond_resched() is called.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [patch 3/3][rfc] vmscan: batched swap slot allocation
2009-04-21 9:38 ` Johannes Weiner
@ 2009-04-21 9:41 ` KOSAKI Motohiro
0 siblings, 0 replies; 22+ messages in thread
From: KOSAKI Motohiro @ 2009-04-21 9:41 UTC (permalink / raw)
To: Johannes Weiner
Cc: kosaki.motohiro, KAMEZAWA Hiroyuki, Andrew Morton, linux-mm,
linux-kernel, Rik van Riel, Hugh Dickins
> On Tue, Apr 21, 2009 at 06:27:08PM +0900, KOSAKI Motohiro wrote:
> > > > > - cond_resched();
> > > > > + if (list_empty(&swap_pages))
> > > > > + cond_resched();
> > > > >
> > > > Why this ?
> > >
> > > It shouldn't schedule anymore when it's allocated the first swap slot.
> > > Another reclaimer could e.g. sleep on the cond_resched() before the
> > > loop and when we schedule while having swap slots allocated, we might
> > > continue further allocations multiple slots ahead.
> >
> > Oops, It seems regression. this cond_resched() intent to
> >
> > cond_resched();
> > pageout();
> > cond_resched();
> > pageout();
> > cond_resched();
> > pageout();
>
> It still does that. While it collects swap pages (swap_pages list is
> non-empty), it doesn't page out. And if it restarts for unmap and
> page-out, the swap_pages list is empty and cond_resched() is called.
Ah, ok.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [patch 3/3][rfc] vmscan: batched swap slot allocation
2009-04-20 20:24 ` [patch 3/3][rfc] vmscan: batched swap slot allocation Johannes Weiner
2009-04-20 20:31 ` Johannes Weiner
2009-04-21 0:58 ` KAMEZAWA Hiroyuki
@ 2009-04-22 20:37 ` Hugh Dickins
2009-04-27 7:46 ` Johannes Weiner
2 siblings, 1 reply; 22+ messages in thread
From: Hugh Dickins @ 2009-04-22 20:37 UTC (permalink / raw)
To: Johannes Weiner; +Cc: Andrew Morton, linux-mm, linux-kernel, Rik van Riel
On Mon, 20 Apr 2009, Johannes Weiner wrote:
> Every swap slot allocation tries to be subsequent to the previous one
> to help keeping the LRU order of anon pages intact when they are
> swapped out.
>
> With an increasing number of concurrent reclaimers, the average
> distance between two subsequent slot allocations of one reclaimer
> increases as well. The contiguous LRU list chunks each reclaimer
> swaps out get 'multiplexed' on the swap space as they allocate the
> slots concurrently.
>
> 2 processes isolating 15 pages each and allocating swap slots
> concurrently:
>
> #0 #1
>
> page 0 slot 0 page 15 slot 1
> page 1 slot 2 page 16 slot 3
> page 2 slot 4 page 17 slot 5
> ...
>
> -> average slot distance of 2
>
> All reclaimers being equally fast, this becomes a problem when the
> total number of concurrent reclaimers gets so high that even equal
> distribution makes the average distance between the slots of one
> reclaimer too wide for optimistic swap-in to compensate.
>
> But right now, one reclaimer can take much longer than another one
> because its pages are mapped into more page tables and it has thus
> more work to do and the faster reclaimer will allocate multiple swap
> slots between two slot allocations of the slower one.
>
> This patch makes shrink_page_list() allocate swap slots in batches,
> collecting all the anonymous memory pages in a list without
> rescheduling and actual reclaim in between. And only after all anon
> pages are swap cached, unmap and write-out starts for them.
>
> While this does not fix the fundamental issue of slot distribution
> increasing with reclaimers, it mitigates the problem by balancing the
> resulting fragmentation equally between the allocators.
>
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Rik van Riel <riel@redhat.com>
> Cc: Hugh Dickins <hugh@veritas.com>
You're right to be thinking along these lines, and probing for
improvements to be made here, but I don't think this patch is
what we want.
Its spaghetti just about defeated me. If it were what we wanted,
I think it ought to be restructured. Thanks to KAMEZAWA-san for
pointing out the issue of multiple locked pages, I'm not keen on
that either. And I don't like the
> + if (list_empty(&swap_pages))
> + cond_resched();
because that kind of thing only makes a difference on !CONFIG_PREEMPT
(which may cover most distros, but still seems regrettable).
Your testing looked good, but wasn't it precisely the test that
would be improved by these changes? Linear touching, some memory
pressure chaos, then repeated linear touching.
I think you're placing too much emphasis on the expectation that
the pages which come off the bottom of the LRU are linear and
belonging to a single object. Isn't it more realistic that
they'll come from scattered locations within independent objects
of different lifetimes? Or, the single linear without the chaos.
There may well be changes you can make here to reflect that better,
yet still keep your advantage in the exceptional case that there's
just the one linear.
An experiment I've never made, maybe you'd like to try, is to have
a level of indirection between the swap entries inserted into ptes
and the actual offsets on swap: assigning the actual offset on swap
at the last moment in swap_writepage, so the writes are in sequence
and merged at the block layer (whichever CPU they come from). Whether
swapins will be bunched together we cannot know, but we do know that
bunching the writes together should pay off (both on HDD and SSD).
Hugh
> ---
> mm/vmscan.c | 49 +++++++++++++++++++++++++++++++++++++++++--------
> 1 files changed, 41 insertions(+), 8 deletions(-)
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 70092fa..b3823fe 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -592,24 +592,42 @@ static unsigned long shrink_page_list(struct list_head *page_list,
> enum pageout_io sync_writeback)
> {
> LIST_HEAD(ret_pages);
> + LIST_HEAD(swap_pages);
> struct pagevec freed_pvec;
> - int pgactivate = 0;
> + int pgactivate = 0, restart = 0;
> unsigned long nr_reclaimed = 0;
>
> cond_resched();
>
> pagevec_init(&freed_pvec, 1);
> +restart:
> while (!list_empty(page_list)) {
> struct address_space *mapping;
> struct page *page;
> int may_enter_fs;
> int referenced;
>
> - cond_resched();
> + if (list_empty(&swap_pages))
> + cond_resched();
>
> page = lru_to_page(page_list);
> list_del(&page->lru);
>
> + if (restart) {
> + /*
> + * We are allowed to do IO when we restart for
> + * swap pages.
> + */
> + may_enter_fs = 1;
> + /*
> + * Referenced pages will be sorted out by
> + * try_to_unmap() and unmapped (anon!) pages
> + * are not to be referenced anymore.
> + */
> + referenced = 0;
> + goto reclaim;
> + }
> +
> if (!trylock_page(page))
> goto keep;
>
> @@ -655,14 +673,24 @@ static unsigned long shrink_page_list(struct list_head *page_list,
> * Anonymous process memory has backing store?
> * Try to allocate it some swap space here.
> */
> - if (PageAnon(page) && !PageSwapCache(page)) {
> - if (!(sc->gfp_mask & __GFP_IO))
> - goto keep_locked;
> - if (!add_to_swap(page))
> - goto activate_locked;
> - may_enter_fs = 1;
> + if (PageAnon(page)) {
> + if (!PageSwapCache(page)) {
> + if (!(sc->gfp_mask & __GFP_IO))
> + goto keep_locked;
> + if (!add_to_swap(page))
> + goto activate_locked;
> + } else if (!may_enter_fs)
> + /*
> + * It's no use to batch when we are
> + * not allocating swap for this GFP
> + * mask.
> + */
> + goto reclaim;
> + list_add(&page->lru, &swap_pages);
> + continue;
> }
>
> + reclaim:
> mapping = page_mapping(page);
>
> /*
> @@ -794,6 +822,11 @@ keep:
> list_add(&page->lru, &ret_pages);
> VM_BUG_ON(PageLRU(page) || PageUnevictable(page));
> }
> + if (!list_empty(&swap_pages)) {
> + list_splice_init(&swap_pages, page_list);
> + restart = 1;
> + goto restart;
> + }
> list_splice(&ret_pages, page_list);
> if (pagevec_count(&freed_pvec))
> __pagevec_free(&freed_pvec);
> --
> 1.6.2.1.135.gde769
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 22+ messages in thread* Re: [patch 3/3][rfc] vmscan: batched swap slot allocation
2009-04-22 20:37 ` Hugh Dickins
@ 2009-04-27 7:46 ` Johannes Weiner
0 siblings, 0 replies; 22+ messages in thread
From: Johannes Weiner @ 2009-04-27 7:46 UTC (permalink / raw)
To: Hugh Dickins; +Cc: Andrew Morton, linux-mm, linux-kernel, Rik van Riel
On Wed, Apr 22, 2009 at 09:37:09PM +0100, Hugh Dickins wrote:
> On Mon, 20 Apr 2009, Johannes Weiner wrote:
>
> > Every swap slot allocation tries to be subsequent to the previous one
> > to help keeping the LRU order of anon pages intact when they are
> > swapped out.
> >
> > With an increasing number of concurrent reclaimers, the average
> > distance between two subsequent slot allocations of one reclaimer
> > increases as well. The contiguous LRU list chunks each reclaimer
> > swaps out get 'multiplexed' on the swap space as they allocate the
> > slots concurrently.
> >
> > 2 processes isolating 15 pages each and allocating swap slots
> > concurrently:
> >
> > #0 #1
> >
> > page 0 slot 0 page 15 slot 1
> > page 1 slot 2 page 16 slot 3
> > page 2 slot 4 page 17 slot 5
> > ...
> >
> > -> average slot distance of 2
> >
> > All reclaimers being equally fast, this becomes a problem when the
> > total number of concurrent reclaimers gets so high that even equal
> > distribution makes the average distance between the slots of one
> > reclaimer too wide for optimistic swap-in to compensate.
> >
> > But right now, one reclaimer can take much longer than another one
> > because its pages are mapped into more page tables and it has thus
> > more work to do and the faster reclaimer will allocate multiple swap
> > slots between two slot allocations of the slower one.
> >
> > This patch makes shrink_page_list() allocate swap slots in batches,
> > collecting all the anonymous memory pages in a list without
> > rescheduling and actual reclaim in between. And only after all anon
> > pages are swap cached, unmap and write-out starts for them.
> >
> > While this does not fix the fundamental issue of slot distribution
> > increasing with reclaimers, it mitigates the problem by balancing the
> > resulting fragmentation equally between the allocators.
> >
> > Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> > Cc: Rik van Riel <riel@redhat.com>
> > Cc: Hugh Dickins <hugh@veritas.com>
>
> You're right to be thinking along these lines, and probing for
> improvements to be made here, but I don't think this patch is
> what we want.
>
> Its spaghetti just about defeated me. If it were what we wanted,
> I think it ought to be restructured. Thanks to KAMEZAWA-san for
> pointing out the issue of multiple locked pages, I'm not keen on
> that either. And I don't like the
> > + if (list_empty(&swap_pages))
> > + cond_resched();
> because that kind of thing only makes a difference on !CONFIG_PREEMPT
> (which may cover most distros, but still seems regrettable).
>
> Your testing looked good, but wasn't it precisely the test that
> would be improved by these changes? Linear touching, some memory
> pressure chaos, then repeated linear touching.
Agreed, it was. I started to play around with qsbench per Andrew's
suggestion but stopped now and went back to the scratch pad. I agree
that these patches are not the solution.
> I think you're placing too much emphasis on the expectation that
> the pages which come off the bottom of the LRU are linear and
> belonging to a single object. Isn't it more realistic that
> they'll come from scattered locations within independent objects
> of different lifetimes? Or, the single linear without the chaos.
Oh that is certainly great input, thank you!
> There may well be changes you can make here to reflect that better,
> yet still keep your advantage in the exceptional case that there's
> just the one linear.
>
> An experiment I've never made, maybe you'd like to try, is to have
> a level of indirection between the swap entries inserted into ptes
> and the actual offsets on swap: assigning the actual offset on swap
> at the last moment in swap_writepage, so the writes are in sequence
> and merged at the block layer (whichever CPU they come from). Whether
> swapins will be bunched together we cannot know, but we do know that
> bunching the writes together should pay off (both on HDD and SSD).
I thought about indirect ptes as well but wasn't sure about it and
hoped we could get away with less invasive changes. It might not be
the case. Thanks for poking in that direction, I will see what I can
come up with.
Hannes
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 22+ messages in thread