Re: Userland swsusp failure (mm-related)

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* Re: Userland swsusp failure (mm-related)
       [not found] <b637ec0b0604080537s55e63544r8bb63c887e81ecaf@mail.gmail.com>
@ 2006-04-08 15:16 ` Rafael J. Wysocki
  2006-04-08 16:15   ` Pavel Machek
  0 siblings, 1 reply; 15+ messages in thread
From: Rafael J. Wysocki @ 2006-04-08 15:16 UTC (permalink / raw)
  To: Fabio Comolli; +Cc: linux-kernel, Pavel Machek, Nick Piggin

Hi,

On Saturday 08 April 2006 14:37, Fabio Comolli wrote:
> This is my first (and unique) failure since I began testing uswsusp
> (2.6.17-rc1 version). It happened (I think) because more than 50% of
> physical memory was occupied at suspend time (about 550 megs out og
> 1G) and that was what I was trying to test. After freeing some memory
> suspend worked (there was no need to reboot).

Well, it looks like we didn't free enough RAM for suspend in this case.
Unfortunately we were below the min watermark for ZONE_NORMAL and
we tried to allocate with GFP_ATOMIC (Nick, shouldn't we fall back to
ZONE_DMA in this case?).

I think we can safely ignore the watermarks in swsusp, so probably
we can set PF_MEMALLOC for the current task temporarily and reset
it when we have allocated memory.  Pavel, what do you think?

> I did not set the image size limit. Of course if I set the image size
> to 500M everything works.
> 
> I use fglrx; however this has never proved to be a problem:
> suspend-resume always worked perfectly, DRI was functioning after
> resume without any noticeable difference.
> 
> In my normal activity uswsusp work fine with both compression and
> encryption. Good work guys.

Thanks. :-)

Greetings,
Rafael


> -----------------------
> Apr  8 14:04:09 tycho kernel: Stopping tasks:
> ====================================================================================================|
> Apr  8 14:04:09 tycho kernel: Shrinking memory...  ^H-^H\^Hdone (19786
> pages freed)
> Apr  8 14:04:09 tycho kernel: eth1: Going into suspend...
> Apr  8 14:04:09 tycho kernel: swsusp: Need to copy 113906 pages
> Apr  8 14:04:09 tycho kernel: suspend: page allocation failure.
> order:0, mode:0x8120
> Apr  8 14:04:09 tycho kernel:  <c0131ccf> __alloc_pages+0x249/0x25d  
> <c0131d52> get_zeroed_page+0x31/0x4c
> Apr  8 14:04:09 tycho kernel:  <c0129e9d> alloc_data_pages+0x5b/0xb8  
> <c012a730> swsusp_save+0xea/0x244
> Apr  8 14:04:09 tycho kernel:  <c02265ea>
> swsusp_arch_suspend+0x2a/0x2c   <c0129852> swsusp_suspend+0x31/0x6b
> Apr  8 14:04:09 tycho kernel:  <c012b5bf> snapshot_ioctl+0x1ab/0x3f9  
> <c012b414> snapshot_ioctl+0x0/0x3f9
> Apr  8 14:04:09 tycho kernel:  <c0151da5> do_ioctl+0x39/0x48  
> <c0151fb3> vfs_ioctl+0x1ff/0x216
> Apr  8 14:04:09 tycho kernel:  <c0151ff6> sys_ioctl+0x2c/0x42  
> <c01028db> sysenter_past_esp+0x54/0x75
> Apr  8 14:04:09 tycho kernel: Mem-info:
> Apr  8 14:04:09 tycho kernel: DMA per-cpu:
> Apr  8 14:04:09 tycho kernel: cpu 0 hot: high 0, batch 1 used:0
> Apr  8 14:04:09 tycho kernel: cpu 0 cold: high 0, batch 1 used:0
> Apr  8 14:04:09 tycho kernel: DMA32 per-cpu: empty
> Apr  8 14:04:09 tycho kernel: Normal per-cpu:
> Apr  8 14:04:09 tycho kernel: cpu 0 hot: high 186, batch 31 used:0
> Apr  8 14:04:09 tycho kernel: cpu 0 cold: high 62, batch 15 used:14
> Apr  8 14:04:09 tycho kernel: HighMem per-cpu: empty
> Apr  8 14:04:09 tycho kernel: Free pages:        4952kB (0kB HighMem)
> Apr  8 14:04:09 tycho kernel: Active:45901 inactive:54562 dirty:0
> writeback:0 unstable:0 free:1238 slab:4155 mapped:43513 pagetables:472
> Apr  8 14:04:09 tycho kernel: DMA free:3548kB min:68kB low:84kB
> high:100kB active:0kB inactive:0kB present:16384kB pages_scanned:0
> all_unreclaimable? no
> Apr  8 14:04:09 tycho kernel: lowmem_reserve[]: 0 0 880 880
> Apr  8 14:04:09 tycho kernel: DMA32 free:0kB min:0kB low:0kB high:0kB
> active:0kB inactive:0kB present:0kB pages_scanned:0 all_unreclaimable?
> no
> Apr  8 14:04:09 tycho kernel: lowmem_reserve[]: 0 0 880 880
> Apr  8 14:04:09 tycho kernel: Normal free:1404kB min:3756kB low:4692kB
> high:5632kB active:183604kB inactive:218248kB present:901120kB
> pages_scanned:0 all_unreclaimable? no
> Apr  8 14:04:09 tycho kernel: lowmem_reserve[]: 0 0 0 0
> Apr  8 14:04:09 tycho kernel: HighMem free:0kB min:128kB low:128kB
> high:128kB active:0kB inactive:0kB present:0kB pages_scanned:0
> all_unreclaimable? no
> Apr  8 14:04:09 tycho kernel: lowmem_reserve[]: 0 0 0 0
> Apr  8 14:04:09 tycho kernel: DMA: 1*4kB 1*8kB 1*16kB 0*32kB 1*64kB
> 1*128kB 1*256kB 0*512kB 1*1024kB 1*2048kB 0*4096kB = 3548kB
> Apr  8 14:04:09 tycho kernel: DMA32: empty
> Apr  8 14:04:09 tycho kernel: Normal: 1*4kB 1*8kB 1*16kB 1*32kB 1*64kB
> 0*128kB 1*256kB 0*512kB 1*1024kB 0*2048kB 0*4096kB = 1404kB
> Apr  8 14:04:09 tycho kernel: HighMem: empty
> Apr  8 14:04:09 tycho kernel: Swap cache: add 0, delete 0, find 0/0, race 0+0
> Apr  8 14:04:09 tycho kernel: Free swap  = 1052248kB
> Apr  8 14:04:09 tycho kernel: Total swap = 1052248kB
> Apr  8 14:04:09 tycho kernel: Free swap:       1052248kB
> Apr  8 14:04:09 tycho kernel: 229376 pages of RAM
> Apr  8 14:04:09 tycho kernel: 0 pages of HIGHMEM
> Apr  8 14:04:09 tycho kernel: 2732 reserved pages
> Apr  8 14:04:09 tycho kernel: 93657 pages shared
> Apr  8 14:04:09 tycho kernel: 0 pages swap cached
> Apr  8 14:04:09 tycho kernel: 0 pages dirty
> Apr  8 14:04:09 tycho kernel: 0 pages writeback
> Apr  8 14:04:09 tycho kernel: 43513 pages mapped
> Apr  8 14:04:09 tycho kernel: 4155 pages slab
> Apr  8 14:04:09 tycho kernel: 472 pages pagetables
> Apr  8 14:04:09 tycho kernel: suspend: Allocating image pages failed.
> Apr  8 14:04:09 tycho kernel: Error -12 suspending
> 

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Userland swsusp failure (mm-related)
  2006-04-08 15:16 ` Userland swsusp failure (mm-related) Rafael J. Wysocki
@ 2006-04-08 16:15   ` Pavel Machek
  2006-04-08 22:47     ` Rafael J. Wysocki
  0 siblings, 1 reply; 15+ messages in thread
From: Pavel Machek @ 2006-04-08 16:15 UTC (permalink / raw)
  To: Rafael J. Wysocki; +Cc: Fabio Comolli, linux-kernel, Nick Piggin

Hi!
 
> > This is my first (and unique) failure since I began testing uswsusp
> > (2.6.17-rc1 version). It happened (I think) because more than 50% of
> > physical memory was occupied at suspend time (about 550 megs out og
> > 1G) and that was what I was trying to test. After freeing some memory
> > suspend worked (there was no need to reboot).
> 
> Well, it looks like we didn't free enough RAM for suspend in this case.
> Unfortunately we were below the min watermark for ZONE_NORMAL and
> we tried to allocate with GFP_ATOMIC (Nick, shouldn't we fall back to
> ZONE_DMA in this case?).
> 
> I think we can safely ignore the watermarks in swsusp, so probably
> we can set PF_MEMALLOC for the current task temporarily and reset
> it when we have allocated memory.  Pavel, what do you think?

Seems little hacky but okay to me.

Should not fixing "how much to free" computation to free a bit more be
enough to handle this?
								Pavel
-- 
Thanks for all the (sleeping) penguins.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Userland swsusp failure (mm-related)
  2006-04-08 16:15   ` Pavel Machek
@ 2006-04-08 22:47     ` Rafael J. Wysocki
  2006-04-08 23:24       ` Con Kolivas
  2006-04-09  1:51       ` Userland swsusp failure (mm-related) Nick Piggin
  0 siblings, 2 replies; 15+ messages in thread
From: Rafael J. Wysocki @ 2006-04-08 22:47 UTC (permalink / raw)
  To: Pavel Machek; +Cc: Fabio Comolli, linux-kernel, Nick Piggin

Hi,

On Saturday 08 April 2006 18:15, Pavel Machek wrote:
> > > This is my first (and unique) failure since I began testing uswsusp
> > > (2.6.17-rc1 version). It happened (I think) because more than 50% of
> > > physical memory was occupied at suspend time (about 550 megs out og
> > > 1G) and that was what I was trying to test. After freeing some memory
> > > suspend worked (there was no need to reboot).
> > 
> > Well, it looks like we didn't free enough RAM for suspend in this case.
> > Unfortunately we were below the min watermark for ZONE_NORMAL and
> > we tried to allocate with GFP_ATOMIC (Nick, shouldn't we fall back to
> > ZONE_DMA in this case?).
> > 
> > I think we can safely ignore the watermarks in swsusp, so probably
> > we can set PF_MEMALLOC for the current task temporarily and reset
> > it when we have allocated memory.  Pavel, what do you think?
> 
> Seems little hacky but okay to me.
> 
> Should not fixing "how much to free" computation to free a bit more be
> enough to handle this?

Yes, but in that case we'll leave some memory unused. ;-)

Rafael

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Userland swsusp failure (mm-related)
  2006-04-08 22:47     ` Rafael J. Wysocki
@ 2006-04-08 23:24       ` Con Kolivas
  2006-04-09 20:36         ` shrink_all_memory tweaks (was: Re: Userland swsusp failure (mm-related)) Rafael J. Wysocki
  2006-04-09  1:51       ` Userland swsusp failure (mm-related) Nick Piggin
  1 sibling, 1 reply; 15+ messages in thread
From: Con Kolivas @ 2006-04-08 23:24 UTC (permalink / raw)
  To: linux-kernel; +Cc: Rafael J. Wysocki, Pavel Machek, Fabio Comolli, Nick Piggin

On Sunday 09 April 2006 08:47, Rafael J. Wysocki wrote:
> Hi,
>
> On Saturday 08 April 2006 18:15, Pavel Machek wrote:
> > > > This is my first (and unique) failure since I began testing uswsusp
> > > > (2.6.17-rc1 version). It happened (I think) because more than 50% of
> > > > physical memory was occupied at suspend time (about 550 megs out og
> > > > 1G) and that was what I was trying to test. After freeing some memory
> > > > suspend worked (there was no need to reboot).
> > >
> > > Well, it looks like we didn't free enough RAM for suspend in this case.
> > > Unfortunately we were below the min watermark for ZONE_NORMAL and
> > > we tried to allocate with GFP_ATOMIC (Nick, shouldn't we fall back to
> > > ZONE_DMA in this case?).
> > >
> > > I think we can safely ignore the watermarks in swsusp, so probably
> > > we can set PF_MEMALLOC for the current task temporarily and reset
> > > it when we have allocated memory.  Pavel, what do you think?
> >
> > Seems little hacky but okay to me.
> >
> > Should not fixing "how much to free" computation to free a bit more be
> > enough to handle this?
>
> Yes, but in that case we'll leave some memory unused. ;-)

How's the shrink_all_memory tweaks I sent performing for you Rafael? It may 
theoretically be prone to the same issue but I tried to make it less likely.

-- 
-ck

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Userland swsusp failure (mm-related)
  2006-04-08 22:47     ` Rafael J. Wysocki
  2006-04-08 23:24       ` Con Kolivas
@ 2006-04-09  1:51       ` Nick Piggin
  2006-04-11 21:33         ` Rafael J. Wysocki
  1 sibling, 1 reply; 15+ messages in thread
From: Nick Piggin @ 2006-04-09  1:51 UTC (permalink / raw)
  To: Rafael J. Wysocki; +Cc: Pavel Machek, Fabio Comolli, linux-kernel

Rafael J. Wysocki wrote:
> Hi,
> 

>>>Well, it looks like we didn't free enough RAM for suspend in this case.
>>>Unfortunately we were below the min watermark for ZONE_NORMAL and
>>>we tried to allocate with GFP_ATOMIC (Nick, shouldn't we fall back to
>>>ZONE_DMA in this case?).
>>>
>>>I think we can safely ignore the watermarks in swsusp, so probably
>>>we can set PF_MEMALLOC for the current task temporarily and reset
>>>it when we have allocated memory.  Pavel, what do you think?
>>
>>Seems little hacky but okay to me.
>>
>>Should not fixing "how much to free" computation to free a bit more be
>>enough to handle this?
> 
> 
> Yes, but in that case we'll leave some memory unused. ;-)
> 

Probably doesn't fall back to ZONE_DMA because of lowmem reserve.
Yes, PF_MEMALLOC sounds like it might do what you want. A little
hackish perhaps, but better than putting swsusp special cases
into page_alloc.c.

-- 
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 

^ permalink raw reply	[flat|nested] 15+ messages in thread

* shrink_all_memory tweaks (was: Re: Userland swsusp failure (mm-related))
  2006-04-08 23:24       ` Con Kolivas
@ 2006-04-09 20:36         ` Rafael J. Wysocki
  2006-04-09 23:23           ` Con Kolivas
  0 siblings, 1 reply; 15+ messages in thread
From: Rafael J. Wysocki @ 2006-04-09 20:36 UTC (permalink / raw)
  To: Con Kolivas; +Cc: linux-kernel, Pavel Machek, Fabio Comolli, Nick Piggin

Hi Con,

On Sunday 09 April 2006 01:24, Con Kolivas wrote:
> On Sunday 09 April 2006 08:47, Rafael J. Wysocki wrote:
> > On Saturday 08 April 2006 18:15, Pavel Machek wrote:
> > > > > This is my first (and unique) failure since I began testing uswsusp
> > > > > (2.6.17-rc1 version). It happened (I think) because more than 50% of
> > > > > physical memory was occupied at suspend time (about 550 megs out og
> > > > > 1G) and that was what I was trying to test. After freeing some memory
> > > > > suspend worked (there was no need to reboot).
> > > >
> > > > Well, it looks like we didn't free enough RAM for suspend in this case.
> > > > Unfortunately we were below the min watermark for ZONE_NORMAL and
> > > > we tried to allocate with GFP_ATOMIC (Nick, shouldn't we fall back to
> > > > ZONE_DMA in this case?).
> > > >
> > > > I think we can safely ignore the watermarks in swsusp, so probably
> > > > we can set PF_MEMALLOC for the current task temporarily and reset
> > > > it when we have allocated memory.  Pavel, what do you think?
> > >
> > > Seems little hacky but okay to me.
> > >
> > > Should not fixing "how much to free" computation to free a bit more be
> > > enough to handle this?
> >
> > Yes, but in that case we'll leave some memory unused. ;-)
> 
> How's the shrink_all_memory tweaks I sent performing for you Rafael? It may 
> theoretically be prone to the same issue but I tried to make it less likely.

Well, I don't think it would help in this particular case.  The memory got divided
almost ideally in swsusp_shrink_memory() and we were hit by the lowmem
reserve in ZONE_DMA, apparently.

Still I've been doing a crash course in mm internals recently and I can say a
bit more about your patch now. ;-)

First, I agree that using balance_pgdat() for freeing memory by swsusp is
overkill, so the removal of its second argument seems to be a good idea to
me.  However, I'd rather avoid modifying struct scan_control and shrink_zone()
and reimplement the shrink_zone()'s logic directly in shrink_all_memory(),
with some modifications (eg. we can explicitly avoid shrinking of the active
list until we decide it's worth it) -- or we can define a separate function for
this purpose.

Second, there are a couple of details I'd do in a different way.  For example
I think we should call shrink_slab() with the non-zero first argument
(otherwise it'll use SWAP_CLUSTER_MAX) and instead of setting
zone->prev_priority to 0 I'd set vm_swappiness to 100 temporarily
(or maybe l'd left it to the user to set swappiness before suspend?).

Also I think we can try to avoid slab shrinking until we start to shrink the
active zone or IOW until we can't get any more pages from the inactive
list alone.

If you don't mind, I'll try to rework your patch a bit in accordance with
the above remarks in the next couple of days.

Greetings,
Rafael

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: shrink_all_memory tweaks (was: Re: Userland swsusp failure (mm-related))
  2006-04-09 20:36         ` shrink_all_memory tweaks (was: Re: Userland swsusp failure (mm-related)) Rafael J. Wysocki
@ 2006-04-09 23:23           ` Con Kolivas
  2006-04-11 17:06             ` Rafael J. Wysocki
  0 siblings, 1 reply; 15+ messages in thread
From: Con Kolivas @ 2006-04-09 23:23 UTC (permalink / raw)
  To: Rafael J. Wysocki; +Cc: linux-kernel, Pavel Machek, Fabio Comolli, Nick Piggin

On Monday 10 April 2006 06:36, Rafael J. Wysocki wrote:
> Still I've been doing a crash course in mm internals recently and I can say
> a bit more about your patch now. ;-)

Great.
>
> First, I agree that using balance_pgdat() for freeing memory by swsusp is
> overkill, so the removal of its second argument seems to be a good idea to
> me.  However, I'd rather avoid modifying struct scan_control and
> shrink_zone() and reimplement the shrink_zone()'s logic directly in
> shrink_all_memory(), with some modifications (eg. we can explicitly avoid
> shrinking of the active list until we decide it's worth it) -- or we can
> define a separate function for this purpose.

I was trying to reuse as much code as possible.

> Second, there are a couple of details I'd do in a different way.  For
> example I think we should call shrink_slab() with the non-zero first
> argument (otherwise it'll use SWAP_CLUSTER_MAX)

Sounds good.

> and instead of setting 
> zone->prev_priority to 0 I'd set vm_swappiness to 100 temporarily
> (or maybe l'd left it to the user to set swappiness before suspend?).

Probably can't rely on just the user setting. Setting priority to 0 is 
explicit and overrides any swappiness setting which is a tunable. Priority 
will recover by itself unlike swappiness which needs to be set and reset.

> Also I think we can try to avoid slab shrinking until we start to shrink
> the active zone or IOW until we can't get any more pages from the inactive
> list alone.

I tried that and it didn't shrink enough, but then that's because of the 
SWAP_CLUSTER_MAX limit you mentioned above. But slab can be massive if you do 
for example a lot of 'find's and shrinking slab doesn't affect the user 
experence as much as shrinking the active/inactive lists.

> If you don't mind, I'll try to rework your patch a bit in accordance with
> the above remarks in the next couple of days.

By all means :)

-- 
-ck

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: shrink_all_memory tweaks (was: Re: Userland swsusp failure (mm-related))
  2006-04-09 23:23           ` Con Kolivas
@ 2006-04-11 17:06             ` Rafael J. Wysocki
  2006-04-13 12:42               ` Con Kolivas
  0 siblings, 1 reply; 15+ messages in thread
From: Rafael J. Wysocki @ 2006-04-11 17:06 UTC (permalink / raw)
  To: Con Kolivas; +Cc: linux-kernel, Pavel Machek, Fabio Comolli, Nick Piggin

Hi,

On Monday 10 April 2006 01:23, Con Kolivas wrote:
> On Monday 10 April 2006 06:36, Rafael J. Wysocki wrote:
> > Still I've been doing a crash course in mm internals recently and I can say
> > a bit more about your patch now. ;-)
> 
> Great.
> >
> > First, I agree that using balance_pgdat() for freeing memory by swsusp is
> > overkill, so the removal of its second argument seems to be a good idea to
> > me.  However, I'd rather avoid modifying struct scan_control and
> > shrink_zone() and reimplement the shrink_zone()'s logic directly in
> > shrink_all_memory(), with some modifications (eg. we can explicitly avoid
> > shrinking of the active list until we decide it's worth it) -- or we can
> > define a separate function for this purpose.
> 
> I was trying to reuse as much code as possible.
> 
> > Second, there are a couple of details I'd do in a different way.  For
> > example I think we should call shrink_slab() with the non-zero first
> > argument (otherwise it'll use SWAP_CLUSTER_MAX)
> 
> Sounds good.
> 
> > and instead of setting 
> > zone->prev_priority to 0 I'd set vm_swappiness to 100 temporarily
> > (or maybe l'd left it to the user to set swappiness before suspend?).
> 
> Probably can't rely on just the user setting. Setting priority to 0 is 
> explicit and overrides any swappiness setting which is a tunable. Priority 
> will recover by itself unlike swappiness which needs to be set and reset.
> 
> > Also I think we can try to avoid slab shrinking until we start to shrink
> > the active zone or IOW until we can't get any more pages from the inactive
> > list alone.
> 
> I tried that and it didn't shrink enough, but then that's because of the 
> SWAP_CLUSTER_MAX limit you mentioned above. But slab can be massive if you do 
> for example a lot of 'find's and shrinking slab doesn't affect the user 
> experence as much as shrinking the active/inactive lists.
> 
> > If you don't mind, I'll try to rework your patch a bit in accordance with
> > the above remarks in the next couple of days.
> 
> By all means :)

The patch is appended.

In shrink_all_memory() I try to free exactly as many pages as the caller asks
for, preferably in one shot, starting from easier targets.  If slabs are huge,
they are most likely to have enough pages to reclaim.  The inactive lists
are next (the zones with more inactive pages go first) etc.  However, since
each pass potentially requires more work, the number of pages to scan is
decreased as the pages are reclaimed which seems to make the shrinking
of memory go more smoothly.

I've been testing it on an x86_64 box for some time and it seems to behave
quite reasonably, eg. it usually makes the actual image size very close to
the value of image_size and if you set image_size to 0, it shrinks everything
almost totally.

Greetings,
Rafael

---
 kernel/power/swsusp.c |   10 +-
 mm/vmscan.c           |  211 ++++++++++++++++++++++++++++++++++++--------------
 2 files changed, 164 insertions(+), 57 deletions(-)

Index: linux-2.6.17-rc1-mm2/mm/vmscan.c
===================================================================
--- linux-2.6.17-rc1-mm2.orig/mm/vmscan.c
+++ linux-2.6.17-rc1-mm2/mm/vmscan.c
@@ -1031,10 +1031,6 @@ out:
  * For kswapd, balance_pgdat() will work across all this node's zones until
  * they are all at pages_high.
  *
- * If `nr_pages' is non-zero then it is the number of pages which are to be
- * reclaimed, regardless of the zone occupancies.  This is a software suspend
- * special.
- *
  * Returns the number of pages which were actually freed.
  *
  * There is special handling here for zones which are full of pinned pages.
@@ -1052,10 +1048,8 @@ out:
  * the page allocator fallback scheme to ensure that aging of pages is balanced
  * across the zones.
  */
-static unsigned long balance_pgdat(pg_data_t *pgdat, unsigned long nr_pages,
-				int order)
+static unsigned long balance_pgdat(pg_data_t *pgdat, int order)
 {
-	unsigned long to_free = nr_pages;
 	int all_zones_ok;
 	int priority;
 	int i;
@@ -1065,7 +1059,7 @@ static unsigned long balance_pgdat(pg_da
 	struct scan_control sc = {
 		.gfp_mask = GFP_KERNEL,
 		.may_swap = 1,
-		.swap_cluster_max = nr_pages ? nr_pages : SWAP_CLUSTER_MAX,
+		.swap_cluster_max = SWAP_CLUSTER_MAX,
 	};
 
 loop_again:
@@ -1092,31 +1086,27 @@ loop_again:
 
 		all_zones_ok = 1;
 
-		if (nr_pages == 0) {
-			/*
-			 * Scan in the highmem->dma direction for the highest
-			 * zone which needs scanning
-			 */
-			for (i = pgdat->nr_zones - 1; i >= 0; i--) {
-				struct zone *zone = pgdat->node_zones + i;
+		/*
+		 * Scan in the highmem->dma direction for the highest
+		 * zone which needs scanning
+		 */
+		for (i = pgdat->nr_zones - 1; i >= 0; i--) {
+			struct zone *zone = pgdat->node_zones + i;
 
-				if (!populated_zone(zone))
-					continue;
+			if (!populated_zone(zone))
+				continue;
 
-				if (zone->all_unreclaimable &&
-						priority != DEF_PRIORITY)
-					continue;
-
-				if (!zone_watermark_ok(zone, order,
-						zone->pages_high, 0, 0)) {
-					end_zone = i;
-					goto scan;
-				}
+			if (zone->all_unreclaimable &&
+					priority != DEF_PRIORITY)
+				continue;
+
+			if (!zone_watermark_ok(zone, order, zone->pages_high,
+					       0, 0)) {
+				end_zone = i;
+				goto scan;
 			}
-			goto out;
-		} else {
-			end_zone = pgdat->nr_zones - 1;
 		}
+		goto out;
 scan:
 		for (i = 0; i <= end_zone; i++) {
 			struct zone *zone = pgdat->node_zones + i;
@@ -1143,11 +1133,9 @@ scan:
 			if (zone->all_unreclaimable && priority != DEF_PRIORITY)
 				continue;
 
-			if (nr_pages == 0) {	/* Not software suspend */
-				if (!zone_watermark_ok(zone, order,
-						zone->pages_high, end_zone, 0))
-					all_zones_ok = 0;
-			}
+			if (!zone_watermark_ok(zone, order, zone->pages_high,
+					       end_zone, 0))
+				all_zones_ok = 0;
 			zone->temp_priority = priority;
 			if (zone->prev_priority > priority)
 				zone->prev_priority = priority;
@@ -1172,8 +1160,6 @@ scan:
 			    total_scanned > nr_reclaimed + nr_reclaimed / 2)
 				sc.may_writepage = 1;
 		}
-		if (nr_pages && to_free > nr_reclaimed)
-			continue;	/* swsusp: need to do more work */
 		if (all_zones_ok)
 			break;		/* kswapd: all done */
 		/*
@@ -1189,7 +1175,7 @@ scan:
 		 * matches the direct reclaim path behaviour in terms of impact
 		 * on zone->*_priority.
 		 */
-		if ((nr_reclaimed >= SWAP_CLUSTER_MAX) && !nr_pages)
+		if (nr_reclaimed >= SWAP_CLUSTER_MAX)
 			break;
 	}
 out:
@@ -1271,7 +1257,7 @@ static int kswapd(void *p)
 		}
 		finish_wait(&pgdat->kswapd_wait, &wait);
 
-		balance_pgdat(pgdat, 0, order);
+		balance_pgdat(pgdat, order);
 	}
 	return 0;
 }
@@ -1300,37 +1286,152 @@ void wakeup_kswapd(struct zone *zone, in
 
 #ifdef CONFIG_PM
 /*
- * Try to free `nr_pages' of memory, system-wide.  Returns the number of freed
- * pages.
+ * Helper function for shrink_all_memory().  Tries to reclaim 'nr_pages' pages
+ * from LRU lists system-wide, for given pass and priority, and returns the
+ * number of reclaimed pages
+ *
+ * For pass > 3 we also try to shrink the LRU lists that contain a few pages
+ */
+unsigned long shrink_all_zones(unsigned long nr_pages, int pass, int prio,
+				struct scan_control *sc)
+{
+	struct zone *zone;
+	unsigned long nr_to_scan, ret = 0;
+
+	for_each_zone(zone) {
+
+		if (!populated_zone(zone))
+			continue;
+
+		if (zone->all_unreclaimable && prio != DEF_PRIORITY)
+			continue;
+
+		/* For pass = 0 we don't shrink the active list */
+		if (pass > 0) {
+			zone->nr_scan_active += (zone->nr_active >> prio) + 1;
+			if (zone->nr_scan_active >= nr_pages || pass > 3) {
+				zone->nr_scan_active = 0;
+				nr_to_scan = min(nr_pages, zone->nr_active);
+				shrink_active_list(nr_to_scan, zone, sc);
+			}
+		}
+
+		zone->nr_scan_inactive += (zone->nr_inactive >> prio) + 1;
+		if (zone->nr_scan_inactive >= nr_pages || pass > 3) {
+			zone->nr_scan_inactive = 0;
+			nr_to_scan = min(nr_pages, zone->nr_inactive);
+			ret += shrink_inactive_list(nr_to_scan, zone, sc);
+			if (ret >= nr_pages)
+				return ret;
+		}
+	}
+
+	return ret;
+}
+
+/*
+ * Try to free `nr_pages' of memory, system-wide, and return the number of
+ * freed pages.
+ *
+ * Rather than trying to age LRUs the aim is to preserve the overall
+ * LRU order by reclaiming preferentially
+ * inactive > active > active referenced > active mapped
  */
 unsigned long shrink_all_memory(unsigned long nr_pages)
 {
-	pg_data_t *pgdat;
-	unsigned long nr_to_free = nr_pages;
+	unsigned long lru_pages, nr_slab;
 	unsigned long ret = 0;
-	unsigned retry = 2;
-	struct reclaim_state reclaim_state = {
-		.reclaimed_slab = 0,
+	int swappiness = vm_swappiness, pass;
+	struct reclaim_state reclaim_state;
+	struct zone *zone;
+	struct scan_control sc = {
+		.gfp_mask = GFP_KERNEL,
+		.may_swap = 1,
+		.swap_cluster_max = nr_pages,
+		.may_writepage = 1,
 	};
 
 	delay_swap_prefetch();
 
 	current->reclaim_state = &reclaim_state;
-repeat:
-	for_each_online_pgdat(pgdat) {
-		unsigned long freed;
 
-		freed = balance_pgdat(pgdat, nr_to_free, 0);
-		ret += freed;
-		nr_to_free -= freed;
-		if ((long)nr_to_free <= 0)
+	lru_pages = 0;
+	for_each_zone(zone)
+		lru_pages += zone->nr_active + zone->nr_inactive;
+	nr_slab = read_page_state(nr_slab);
+	/* If slab caches are huge, it's better to hit them first */
+	while (nr_slab >= lru_pages) {
+		reclaim_state.reclaimed_slab = 0;
+		shrink_slab(nr_pages, sc.gfp_mask, lru_pages);
+		if (!reclaim_state.reclaimed_slab)
 			break;
+
+		ret += reclaim_state.reclaimed_slab;
+		if (ret >= nr_pages)
+			goto out;
+
+		nr_slab -= reclaim_state.reclaimed_slab;
 	}
-	if (retry-- && ret < nr_pages) {
-		blk_congestion_wait(WRITE, HZ/5);
-		goto repeat;
+
+	/*
+	 * We try to shrink LRUs in 5 passes:
+	 * 0 = Reclaim from inactive_list only
+	 * 1 = Reclaim from active list but don't reclaim mapped
+	 * 2 = 2nd pass of type 1
+	 * 3 = Reclaim mapped (normal reclaim)
+	 * 4 = 2nd pass of type 3
+	 */
+	for (pass = 0; pass < 5; pass++) {
+		int prio;
+
+		/* Needed for shrinking slab caches later on */
+		if (!lru_pages)
+			for_each_zone(zone) {
+				lru_pages += zone->nr_active;
+				lru_pages += zone->nr_inactive;
+			}
+
+		/* Force reclaiming mapped pages in the passes #3 and #4 */
+		if (pass > 2)
+			vm_swappiness = 100;
+
+		for (prio = DEF_PRIORITY; prio >= 0; prio--) {
+			unsigned long nr_to_scan = nr_pages - ret;
+
+			sc.nr_mapped = read_page_state(nr_mapped);
+			sc.nr_scanned = 0;
+
+			ret += shrink_all_zones(nr_to_scan, prio, pass, &sc);
+			if (ret >= nr_pages)
+				goto out;
+
+			reclaim_state.reclaimed_slab = 0;
+			shrink_slab(sc.nr_scanned, sc.gfp_mask, lru_pages);
+			ret += reclaim_state.reclaimed_slab;
+			if (ret >= nr_pages)
+				goto out;
+
+			if (sc.nr_scanned && prio < DEF_PRIORITY - 2)
+				blk_congestion_wait(WRITE, HZ / 10);
+		}
+
+		lru_pages = 0;
 	}
+
+	/*
+	 * If ret = 0, we could not shrink LRUs, but there may be something
+	 * in slab caches
+	 */
+	if (!ret)
+		do {
+			reclaim_state.reclaimed_slab = 0;
+			shrink_slab(nr_pages, sc.gfp_mask, lru_pages);
+			ret += reclaim_state.reclaimed_slab;
+		} while (ret < nr_pages && reclaim_state.reclaimed_slab > 0);
+
+out:
 	current->reclaim_state = NULL;
+	vm_swappiness = swappiness;
 	return ret;
 }
 #endif
Index: linux-2.6.17-rc1-mm2/kernel/power/swsusp.c
===================================================================
--- linux-2.6.17-rc1-mm2.orig/kernel/power/swsusp.c
+++ linux-2.6.17-rc1-mm2/kernel/power/swsusp.c
@@ -175,6 +175,12 @@ void free_all_swap_pages(int swap, struc
  */
 
 #define SHRINK_BITE	10000
+static inline unsigned long __shrink_memory(long tmp)
+{
+	if (tmp > SHRINK_BITE)
+		tmp = SHRINK_BITE;
+	return shrink_all_memory(tmp);
+}
 
 int swsusp_shrink_memory(void)
 {
@@ -195,12 +201,12 @@ int swsusp_shrink_memory(void)
 			if (!is_highmem(zone))
 				tmp -= zone->free_pages;
 		if (tmp > 0) {
-			tmp = shrink_all_memory(SHRINK_BITE);
+			tmp = __shrink_memory(tmp);
 			if (!tmp)
 				return -ENOMEM;
 			pages += tmp;
 		} else if (size > image_size / PAGE_SIZE) {
-			tmp = shrink_all_memory(SHRINK_BITE);
+			tmp = __shrink_memory(size - (image_size / PAGE_SIZE));
 			pages += tmp;
 		}
 		printk("\b%c", p[i++%4]);

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Userland swsusp failure (mm-related)
  2006-04-09  1:51       ` Userland swsusp failure (mm-related) Nick Piggin
@ 2006-04-11 21:33         ` Rafael J. Wysocki
  2006-04-11 21:36           ` Pavel Machek
  0 siblings, 1 reply; 15+ messages in thread
From: Rafael J. Wysocki @ 2006-04-11 21:33 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Pavel Machek, Fabio Comolli, linux-kernel

Hi,

On Sunday 09 April 2006 03:51, Nick Piggin wrote:
> Rafael J. Wysocki wrote:
> >>>Well, it looks like we didn't free enough RAM for suspend in this case.
> >>>Unfortunately we were below the min watermark for ZONE_NORMAL and
> >>>we tried to allocate with GFP_ATOMIC (Nick, shouldn't we fall back to
> >>>ZONE_DMA in this case?).
> >>>
> >>>I think we can safely ignore the watermarks in swsusp, so probably
> >>>we can set PF_MEMALLOC for the current task temporarily and reset
> >>>it when we have allocated memory.  Pavel, what do you think?
> >>
> >>Seems little hacky but okay to me.
> >>
> >>Should not fixing "how much to free" computation to free a bit more be
> >>enough to handle this?
> > 
> > 
> > Yes, but in that case we'll leave some memory unused. ;-)
> > 
> 
> Probably doesn't fall back to ZONE_DMA because of lowmem reserve.
> Yes, PF_MEMALLOC sounds like it might do what you want. A little
> hackish perhaps, but better than putting swsusp special cases
> into page_alloc.c.

The appended patch contains the changes I'd like to make.  Pavel, is that
acceptable?

Rafael

---
 kernel/power/snapshot.c |   10 ++++++++--
 1 files changed, 8 insertions(+), 2 deletions(-)

Index: linux-2.6.17-rc1-mm2/kernel/power/snapshot.c
===================================================================
--- linux-2.6.17-rc1-mm2.orig/kernel/power/snapshot.c	2006-04-08 21:29:55.000000000 +0200
+++ linux-2.6.17-rc1-mm2/kernel/power/snapshot.c	2006-04-11 22:09:28.000000000 +0200
@@ -461,17 +461,23 @@ static struct pbe *swsusp_alloc(unsigned
 {
 	struct pbe *pblist;
 
+	/* We don't want to be affected by zone watermarks etc. */
+	current->flags |= PF_MEMALLOC;
+
 	if (!(pblist = alloc_pagedir(nr_pages, GFP_ATOMIC | __GFP_COLD, 0))) {
 		printk(KERN_ERR "suspend: Allocating pagedir failed.\n");
-		return NULL;
+		goto out;
 	}
 
 	if (alloc_data_pages(pblist, GFP_ATOMIC | __GFP_COLD, 0)) {
 		printk(KERN_ERR "suspend: Allocating image pages failed.\n");
 		swsusp_free();
-		return NULL;
+		pblist = NULL;
 	}
 
+out:
+	current->flags &= ~PF_MEMALLOC;
+
 	return pblist;
 }
 

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Userland swsusp failure (mm-related)
  2006-04-11 21:33         ` Rafael J. Wysocki
@ 2006-04-11 21:36           ` Pavel Machek
  2006-04-11 22:10             ` Rafael J. Wysocki
  0 siblings, 1 reply; 15+ messages in thread
From: Pavel Machek @ 2006-04-11 21:36 UTC (permalink / raw)
  To: Rafael J. Wysocki; +Cc: Nick Piggin, Fabio Comolli, linux-kernel

Hi!

> > Rafael J. Wysocki wrote:
> > >>>Well, it looks like we didn't free enough RAM for suspend in this case.
> > >>>Unfortunately we were below the min watermark for ZONE_NORMAL and
> > >>>we tried to allocate with GFP_ATOMIC (Nick, shouldn't we fall back to
> > >>>ZONE_DMA in this case?).
> > >>>
> > >>>I think we can safely ignore the watermarks in swsusp, so probably
> > >>>we can set PF_MEMALLOC for the current task temporarily and reset
> > >>>it when we have allocated memory.  Pavel, what do you think?
> > >>
> > >>Seems little hacky but okay to me.
> > >>
> > >>Should not fixing "how much to free" computation to free a bit more be
> > >>enough to handle this?
> > > 
> > > 
> > > Yes, but in that case we'll leave some memory unused. ;-)
> > > 
> > 
> > Probably doesn't fall back to ZONE_DMA because of lowmem reserve.
> > Yes, PF_MEMALLOC sounds like it might do what you want. A little
> > hackish perhaps, but better than putting swsusp special cases
> > into page_alloc.c.
> 
> The appended patch contains the changes I'd like to make.  Pavel, is that
> acceptable?

Why is PF_MEMALLOC only neccessary for pagedir allocations, and not
for normal page allocations, too?


> Rafael
> 
> ---
>  kernel/power/snapshot.c |   10 ++++++++--
>  1 files changed, 8 insertions(+), 2 deletions(-)
> 
> Index: linux-2.6.17-rc1-mm2/kernel/power/snapshot.c
> ===================================================================
> --- linux-2.6.17-rc1-mm2.orig/kernel/power/snapshot.c	2006-04-08 21:29:55.000000000 +0200
> +++ linux-2.6.17-rc1-mm2/kernel/power/snapshot.c	2006-04-11 22:09:28.000000000 +0200
> @@ -461,17 +461,23 @@ static struct pbe *swsusp_alloc(unsigned
>  {
>  	struct pbe *pblist;
>  
> +	/* We don't want to be affected by zone watermarks etc. */
> +	current->flags |= PF_MEMALLOC;
> +
>  	if (!(pblist = alloc_pagedir(nr_pages, GFP_ATOMIC | __GFP_COLD, 0))) {
>  		printk(KERN_ERR "suspend: Allocating pagedir failed.\n");
> -		return NULL;
> +		goto out;
>  	}
>  
>  	if (alloc_data_pages(pblist, GFP_ATOMIC | __GFP_COLD, 0)) {
>  		printk(KERN_ERR "suspend: Allocating image pages failed.\n");
>  		swsusp_free();
> -		return NULL;
> +		pblist = NULL;
>  	}
>  
> +out:
> +	current->flags &= ~PF_MEMALLOC;
> +
>  	return pblist;
>  }
>  

-- 
Thanks, Sharp!

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Userland swsusp failure (mm-related)
  2006-04-11 21:36           ` Pavel Machek
@ 2006-04-11 22:10             ` Rafael J. Wysocki
  2006-04-12  5:29               ` Rafael J. Wysocki
  0 siblings, 1 reply; 15+ messages in thread
From: Rafael J. Wysocki @ 2006-04-11 22:10 UTC (permalink / raw)
  To: Pavel Machek; +Cc: Nick Piggin, Fabio Comolli, linux-kernel

Hi,

On Tuesday 11 April 2006 23:36, Pavel Machek wrote:
> > > Rafael J. Wysocki wrote:
> > > >>>Well, it looks like we didn't free enough RAM for suspend in this case.
> > > >>>Unfortunately we were below the min watermark for ZONE_NORMAL and
> > > >>>we tried to allocate with GFP_ATOMIC (Nick, shouldn't we fall back to
> > > >>>ZONE_DMA in this case?).
> > > >>>
> > > >>>I think we can safely ignore the watermarks in swsusp, so probably
> > > >>>we can set PF_MEMALLOC for the current task temporarily and reset
> > > >>>it when we have allocated memory.  Pavel, what do you think?
> > > >>
> > > >>Seems little hacky but okay to me.
> > > >>
> > > >>Should not fixing "how much to free" computation to free a bit more be
> > > >>enough to handle this?
> > > > 
> > > > 
> > > > Yes, but in that case we'll leave some memory unused. ;-)
> > > > 
> > > 
> > > Probably doesn't fall back to ZONE_DMA because of lowmem reserve.
> > > Yes, PF_MEMALLOC sounds like it might do what you want. A little
> > > hackish perhaps, but better than putting swsusp special cases
> > > into page_alloc.c.
> > 
> > The appended patch contains the changes I'd like to make.  Pavel, is that
> > acceptable?
> 
> Why is PF_MEMALLOC only neccessary for pagedir allocations, and not
> for normal page allocations, too?

Right, we'll need it untli we finally free the image, so I think it should be
set/reset in disk.c:pm_suspend_disk().

However, there's a problem with this approach wrt the userland suspend, because
we'd have to keep PF_MEMALLOC set accross ioctls and I wouldn't like to do
this.

Well, the alternative solution would be to take the ZONE_DMA's lowmem reserve
into account in our free memory computations.

Greetings,
Rafael

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Userland swsusp failure (mm-related)
  2006-04-11 22:10             ` Rafael J. Wysocki
@ 2006-04-12  5:29               ` Rafael J. Wysocki
  0 siblings, 0 replies; 15+ messages in thread
From: Rafael J. Wysocki @ 2006-04-12  5:29 UTC (permalink / raw)
  To: Pavel Machek; +Cc: Nick Piggin, Fabio Comolli, linux-kernel

Hi,

On Wednesday 12 April 2006 00:10, Rafael J. Wysocki wrote:
> On Tuesday 11 April 2006 23:36, Pavel Machek wrote:
> > > > Rafael J. Wysocki wrote:
> > > > >>>Well, it looks like we didn't free enough RAM for suspend in this case.
> > > > >>>Unfortunately we were below the min watermark for ZONE_NORMAL and
> > > > >>>we tried to allocate with GFP_ATOMIC (Nick, shouldn't we fall back to
> > > > >>>ZONE_DMA in this case?).
> > > > >>>
> > > > >>>I think we can safely ignore the watermarks in swsusp, so probably
> > > > >>>we can set PF_MEMALLOC for the current task temporarily and reset
> > > > >>>it when we have allocated memory.  Pavel, what do you think?
> > > > >>
> > > > >>Seems little hacky but okay to me.
> > > > >>
> > > > >>Should not fixing "how much to free" computation to free a bit more be
> > > > >>enough to handle this?
> > > > > 
> > > > > 
> > > > > Yes, but in that case we'll leave some memory unused. ;-)
> > > > > 
> > > > 
> > > > Probably doesn't fall back to ZONE_DMA because of lowmem reserve.
> > > > Yes, PF_MEMALLOC sounds like it might do what you want. A little
> > > > hackish perhaps, but better than putting swsusp special cases
> > > > into page_alloc.c.
> > > 
> > > The appended patch contains the changes I'd like to make.  Pavel, is that
> > > acceptable?
> > 
> > Why is PF_MEMALLOC only neccessary for pagedir allocations, and not
> > for normal page allocations, too?
> 
> Right, we'll need it untli we finally free the image, so I think it should be
> set/reset in disk.c:pm_suspend_disk().
> 
> However, there's a problem with this approach wrt the userland suspend, because
> we'd have to keep PF_MEMALLOC set accross ioctls and I wouldn't like to do
> this.
> 
> Well, the alternative solution would be to take the ZONE_DMA's lowmem reserve
> into account in our free memory computations.

OK, the appended patch subtracts each zone's lowmem reserve for ZONE_NORMAL
from the number of free pages because we are going to allocate from the normal
zone and won't be able to use the lowmem reserves.

Please have a look.

Greetings,
Rafael

---
 kernel/power/swsusp.c |    4 +++-
 1 files changed, 3 insertions(+), 1 deletion(-)

Index: linux-2.6.17-rc1-mm2/kernel/power/swsusp.c
===================================================================
--- linux-2.6.17-rc1-mm2.orig/kernel/power/swsusp.c	2006-04-12 07:09:20.000000000 +0200
+++ linux-2.6.17-rc1-mm2/kernel/power/swsusp.c	2006-04-12 07:11:09.000000000 +0200
@@ -192,8 +192,10 @@ int swsusp_shrink_memory(void)
 			PAGES_FOR_IO;
 		tmp = size;
 		for_each_zone (zone)
-			if (!is_highmem(zone))
+			if (!is_highmem(zone) && zone->present_pages > 0) {
 				tmp -= zone->free_pages;
+				tmp += zone->lowmem_reserve[ZONE_NORMAL];
+			}
 		if (tmp > 0) {
 			tmp = shrink_all_memory(SHRINK_BITE);
 			if (!tmp)

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: shrink_all_memory tweaks (was: Re: Userland swsusp failure (mm-related))
  2006-04-11 17:06             ` Rafael J. Wysocki
@ 2006-04-13 12:42               ` Con Kolivas
  2006-04-13 13:54                 ` Rafael J. Wysocki
  0 siblings, 1 reply; 15+ messages in thread
From: Con Kolivas @ 2006-04-13 12:42 UTC (permalink / raw)
  To: Rafael J. Wysocki; +Cc: linux-kernel, Pavel Machek, Fabio Comolli, Nick Piggin

On Wednesday 12 April 2006 03:06, Rafael J. Wysocki wrote:
> The patch is appended.
>
> In shrink_all_memory() I try to free exactly as many pages as the caller
> asks for, preferably in one shot, starting from easier targets.  If slabs
> are huge, they are most likely to have enough pages to reclaim.  The
> inactive lists are next (the zones with more inactive pages go first) etc. 
> However, since each pass potentially requires more work, the number of
> pages to scan is decreased as the pages are reclaimed which seems to make
> the shrinking of memory go more smoothly.
>
> I've been testing it on an x86_64 box for some time and it seems to behave
> quite reasonably, eg. it usually makes the actual image size very close to
> the value of image_size and if you set image_size to 0, it shrinks
> everything almost totally.

Great. Looks pretty good. See comments.

> ---

>  #ifdef CONFIG_PM
>  /*
> - * Try to free `nr_pages' of memory, system-wide.  Returns the number of
> freed - * pages.
> + * Helper function for shrink_all_memory().  Tries to reclaim 'nr_pages'
> pages + * from LRU lists system-wide, for given pass and priority, and
> returns the + * number of reclaimed pages
> + *
> + * For pass > 3 we also try to shrink the LRU lists that contain a few
> pages + */
> +unsigned long shrink_all_zones(unsigned long nr_pages, int pass, int prio,
> +				struct scan_control *sc)

I like how this moves all suspend vm functions out of the generic functions 
even more than I managed to.

> +	int swappiness = vm_swappiness, pass;
> +	struct reclaim_state reclaim_state;
> +	struct zone *zone;
> +	struct scan_control sc = {
> +		.gfp_mask = GFP_KERNEL,
> +		.may_swap = 1,
> +		.swap_cluster_max = nr_pages,
> +		.may_writepage = 1,
>  	};

This is not quite right at maintaining the original semantics I was proposing. 
Since you are iterating over all priorities, setting may_swap means you will 
reclaim mapped ram on the earlier passes once priority gets low enough. 
Setting vm_swappiness temporarily to 100 is unncecessary. You should set 
may_swap to 0 and set it to 1 on passes 3+.

Otherwise, looks good, thanks!

-- 
-ck

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: shrink_all_memory tweaks (was: Re: Userland swsusp failure (mm-related))
  2006-04-13 12:42               ` Con Kolivas
@ 2006-04-13 13:54                 ` Rafael J. Wysocki
  2006-04-13 14:01                   ` Con Kolivas
  0 siblings, 1 reply; 15+ messages in thread
From: Rafael J. Wysocki @ 2006-04-13 13:54 UTC (permalink / raw)
  To: Con Kolivas; +Cc: linux-kernel, Pavel Machek, Fabio Comolli, Nick Piggin

On Thursday 13 April 2006 14:42, Con Kolivas wrote:
> On Wednesday 12 April 2006 03:06, Rafael J. Wysocki wrote:
> > The patch is appended.
> >
> > In shrink_all_memory() I try to free exactly as many pages as the caller
> > asks for, preferably in one shot, starting from easier targets.  If slabs
> > are huge, they are most likely to have enough pages to reclaim.  The
> > inactive lists are next (the zones with more inactive pages go first) etc. 
> > However, since each pass potentially requires more work, the number of
> > pages to scan is decreased as the pages are reclaimed which seems to make
> > the shrinking of memory go more smoothly.
> >
> > I've been testing it on an x86_64 box for some time and it seems to behave
> > quite reasonably, eg. it usually makes the actual image size very close to
> > the value of image_size and if you set image_size to 0, it shrinks
> > everything almost totally.
> 
> Great. Looks pretty good. See comments.
> 
> > ---
> 
> >  #ifdef CONFIG_PM
> >  /*
> > - * Try to free `nr_pages' of memory, system-wide.  Returns the number of
> > freed - * pages.
> > + * Helper function for shrink_all_memory().  Tries to reclaim 'nr_pages'
> > pages + * from LRU lists system-wide, for given pass and priority, and
> > returns the + * number of reclaimed pages
> > + *
> > + * For pass > 3 we also try to shrink the LRU lists that contain a few
> > pages + */
> > +unsigned long shrink_all_zones(unsigned long nr_pages, int pass, int prio,
> > +				struct scan_control *sc)
> 
> I like how this moves all suspend vm functions out of the generic functions 
> even more than I managed to.
> 
> > +	int swappiness = vm_swappiness, pass;
> > +	struct reclaim_state reclaim_state;
> > +	struct zone *zone;
> > +	struct scan_control sc = {
> > +		.gfp_mask = GFP_KERNEL,
> > +		.may_swap = 1,
> > +		.swap_cluster_max = nr_pages,
> > +		.may_writepage = 1,
> >  	};
> 
> This is not quite right at maintaining the original semantics I was proposing. 
> Since you are iterating over all priorities, setting may_swap means you will 
> reclaim mapped ram on the earlier passes once priority gets low enough.

No, I won't, because I don't update zone->prev_priority which is necessary
to trigger this.  Unless of course zone->prev_priority is already low enough ...

> Setting vm_swappiness temporarily to 100 is unncecessary. You should set 
> may_swap to 0 and set it to 1 on passes 3+.

... which can be dealt with by setting may_swap like you're saying.

I'll make this change and repost as an RFC in a separate thread.
 
> Otherwise, looks good, thanks!

Thanks a lot for the comments.

Greetings,
Rafael

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: shrink_all_memory tweaks (was: Re: Userland swsusp failure (mm-related))
  2006-04-13 13:54                 ` Rafael J. Wysocki
@ 2006-04-13 14:01                   ` Con Kolivas
  0 siblings, 0 replies; 15+ messages in thread
From: Con Kolivas @ 2006-04-13 14:01 UTC (permalink / raw)
  To: Rafael J. Wysocki; +Cc: linux-kernel, Pavel Machek, Fabio Comolli, Nick Piggin

On Thursday 13 April 2006 23:54, Rafael J. Wysocki wrote:
> On Thursday 13 April 2006 14:42, Con Kolivas wrote:
> > This is not quite right at maintaining the original semantics I was
> > proposing. Since you are iterating over all priorities, setting may_swap
> > means you will reclaim mapped ram on the earlier passes once priority
> > gets low enough.
>
> No, I won't, because I don't update zone->prev_priority which is necessary
> to trigger this.  Unless of course zone->prev_priority is already low
> enough ...

Ah yes of course, that explains why I didn't need to either :P

> > Setting vm_swappiness temporarily to 100 is unncecessary. You should set
> > may_swap to 0 and set it to 1 on passes 3+.
>
> ... which can be dealt with by setting may_swap like you're saying.
>
> I'll make this change and repost as an RFC in a separate thread.

Great

-- 
-ck

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2006-04-13 14:02 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <b637ec0b0604080537s55e63544r8bb63c887e81ecaf@mail.gmail.com>
2006-04-08 15:16 ` Userland swsusp failure (mm-related) Rafael J. Wysocki
2006-04-08 16:15   ` Pavel Machek
2006-04-08 22:47     ` Rafael J. Wysocki
2006-04-08 23:24       ` Con Kolivas
2006-04-09 20:36         ` shrink_all_memory tweaks (was: Re: Userland swsusp failure (mm-related)) Rafael J. Wysocki
2006-04-09 23:23           ` Con Kolivas
2006-04-11 17:06             ` Rafael J. Wysocki
2006-04-13 12:42               ` Con Kolivas
2006-04-13 13:54                 ` Rafael J. Wysocki
2006-04-13 14:01                   ` Con Kolivas
2006-04-09  1:51       ` Userland swsusp failure (mm-related) Nick Piggin
2006-04-11 21:33         ` Rafael J. Wysocki
2006-04-11 21:36           ` Pavel Machek
2006-04-11 22:10             ` Rafael J. Wysocki
2006-04-12  5:29               ` Rafael J. Wysocki

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox