public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
From: Andrew Morton <akpm@linux-foundation.org>
To: Rik van Riel <riel@redhat.com>
Cc: linux-kernel@vger.kernel.org, lee.schermerhorn@hp.com,
	kosaki.motohiro@jp.fujitsu.com
Subject: Re: [PATCH -mm 07/25] second chance replacement for anonymous pages
Date: Fri, 6 Jun 2008 18:04:43 -0700	[thread overview]
Message-ID: <20080606180443.43f782e2.akpm@linux-foundation.org> (raw)
In-Reply-To: <20080606202858.744030156@redhat.com>

On Fri, 06 Jun 2008 16:28:45 -0400
Rik van Riel <riel@redhat.com> wrote:

> From: Rik van Riel <riel@redhat.com>
> 
> We avoid evicting and scanning anonymous pages for the most part, but
> under some workloads we can end up with most of memory filled with
> anonymous pages.  At that point, we suddenly need to clear the referenced
> bits on all of memory, which can take ages on very large memory systems.
> 
> We can reduce the maximum number of pages that need to be scanned by
> not taking the referenced state into account when deactivating an
> anonymous page.  After all, every anonymous page starts out referenced,
> so why check?
> 
> If an anonymous page gets referenced again before it reaches the end
> of the inactive list, we move it back to the active list.
> 
> To keep the maximum amount of necessary work reasonable, we scale the
> active to inactive ratio with the size of memory, using the formula
> active:inactive ratio = sqrt(memory in GB * 10).

Should be scaled by PAGE_SIZE?

> Kswapd CPU use now seems to scale by the amount of pageout bandwidth,
> instead of by the amount of memory present in the system.
> 
> Signed-off-by: Rik van Riel <riel@redhat.com>
> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> 
> ---
>  include/linux/mm_inline.h |   12 ++++++++++++
>  include/linux/mmzone.h    |    5 +++++
>  mm/page_alloc.c           |   40 ++++++++++++++++++++++++++++++++++++++++
>  mm/vmscan.c               |   38 +++++++++++++++++++++++++++++++-------
>  mm/vmstat.c               |    6 ++++--
>  5 files changed, 92 insertions(+), 9 deletions(-)
> 
> Index: linux-2.6.26-rc2-mm1/include/linux/mm_inline.h
> ===================================================================
> --- linux-2.6.26-rc2-mm1.orig/include/linux/mm_inline.h	2008-05-23 14:21:34.000000000 -0400
> +++ linux-2.6.26-rc2-mm1/include/linux/mm_inline.h	2008-05-28 12:09:06.000000000 -0400
> @@ -97,4 +97,16 @@ del_page_from_lru(struct zone *zone, str
>  	__dec_zone_state(zone, NR_INACTIVE_ANON + l);
>  }
>  
> +static inline int inactive_anon_low(struct zone *zone)
> +{
> +	unsigned long active, inactive;
> +
> +	active = zone_page_state(zone, NR_ACTIVE_ANON);
> +	inactive = zone_page_state(zone, NR_INACTIVE_ANON);
> +
> +	if (inactive * zone->inactive_ratio < active)
> +		return 1;
> +
> +	return 0;
> +}

inactive_anon_low: "number of inactive anonymous pages which are in lowmem"?

Nope.

Needs a comment.  And maybe a better name, like inactive_anon_is_low. 
Although making the return type a bool kind-of does that.

>  #endif
> Index: linux-2.6.26-rc2-mm1/include/linux/mmzone.h
> ===================================================================
> --- linux-2.6.26-rc2-mm1.orig/include/linux/mmzone.h	2008-05-23 14:21:34.000000000 -0400
> +++ linux-2.6.26-rc2-mm1/include/linux/mmzone.h	2008-05-28 12:09:06.000000000 -0400
> @@ -311,6 +311,11 @@ struct zone {
>  	 */
>  	int prev_priority;
>  
> +	/*
> +	 * The ratio of active to inactive pages.
> +	 */
> +	unsigned int inactive_ratio;

That comment needs a lot of help please.  For a start, it's plain wrong
- inactive_ratio would need to be a float to be able to record that ratio.

The comment should describe the units too.

Now poor-old-reviewer has to go off and work out what this thing is.

>  
>  	ZONE_PADDING(_pad2_)
>  	/* Rarely used or read-mostly fields */
> Index: linux-2.6.26-rc2-mm1/mm/page_alloc.c
> ===================================================================
> --- linux-2.6.26-rc2-mm1.orig/mm/page_alloc.c	2008-05-23 14:21:34.000000000 -0400
> +++ linux-2.6.26-rc2-mm1/mm/page_alloc.c	2008-05-28 12:09:06.000000000 -0400
> @@ -4269,6 +4269,45 @@ void setup_per_zone_pages_min(void)
>  	calculate_totalreserve_pages();
>  }
>  
> +/**
> + * setup_per_zone_inactive_ratio - called when min_free_kbytes changes.
> + *
> + * The inactive anon list should be small enough that the VM never has to
> + * do too much work, but large enough that each inactive page has a chance
> + * to be referenced again before it is swapped out.
> + *
> + * The inactive_anon ratio is the ratio of active to inactive anonymous

target ratio?  Desired ratio?

> + * pages.  Ie. a ratio of 3 means 3:1 or 25% of the anonymous pages are
> + * on the inactive list.
> + *
> + * total     return    max
> + * memory    value     inactive anon

This function doesn't "return" a "value".

> + * -------------------------------------
> + *   10MB       1         5MB
> + *  100MB       1        50MB
> + *    1GB       3       250MB
> + *   10GB      10       0.9GB
> + *  100GB      31         3GB
> + *    1TB     101        10GB
> + *   10TB     320        32GB
> + */
> +void setup_per_zone_inactive_ratio(void)
> +{
> +	struct zone *zone;
> +
> +	for_each_zone(zone) {
> +		unsigned int gb, ratio;
> +
> +		/* Zone size in gigabytes */
> +		gb = zone->present_pages >> (30 - PAGE_SHIFT);
> +		ratio = int_sqrt(10 * gb);
> +		if (!ratio)
> +			ratio = 1;
> +
> +		zone->inactive_ratio = ratio;
> +	}
> +}

OK, so inactive_ratio is an integer 1 ..  N which determines our target
number of inactive pages according to the formula

	nr_inactive = nr_active / inactive_ratio

yes?

Can nr_inactive get larger than this?  I assume so.  I guess that
doesn't matter much.  Except the problems which you're trying to sovle
here can reoccur.   What would I need to do to trigger that?

>  /*
>   * Initialise min_free_kbytes.
>   *
> @@ -4306,6 +4345,7 @@ static int __init init_per_zone_pages_mi
>  		min_free_kbytes = 65536;
>  	setup_per_zone_pages_min();
>  	setup_per_zone_lowmem_reserve();
> +	setup_per_zone_inactive_ratio();
>  	return 0;
>  }
>  module_init(init_per_zone_pages_min)
> Index: linux-2.6.26-rc2-mm1/mm/vmscan.c
> ===================================================================
> --- linux-2.6.26-rc2-mm1.orig/mm/vmscan.c	2008-05-23 14:21:34.000000000 -0400
> +++ linux-2.6.26-rc2-mm1/mm/vmscan.c	2008-05-28 12:11:38.000000000 -0400
> @@ -114,7 +114,7 @@ struct scan_control {
>  /*
>   * From 0 .. 100.  Higher means more swappy.
>   */
> -int vm_swappiness = 60;
> +int vm_swappiness = 20;

<goes back to check the changelog>

Whoa.  Where'd this come from?

>  long vm_total_pages;	/* The total number of pages which the VM controls */
>  
>  static LIST_HEAD(shrinker_list);
> @@ -1008,7 +1008,7 @@ static inline int zone_is_near_oom(struc
>  static void shrink_active_list(unsigned long nr_pages, struct zone *zone,
>  			struct scan_control *sc, int priority, int file)
>  {
> -	unsigned long pgmoved;
> +	unsigned long pgmoved = 0;
>  	int pgdeactivate = 0;
>  	unsigned long pgscanned;
>  	LIST_HEAD(l_hold);	/* The pages which were snipped off */
> @@ -1036,17 +1036,32 @@ static void shrink_active_list(unsigned 
>  		__mod_zone_page_state(zone, NR_ACTIVE_ANON, -pgmoved);
>  	spin_unlock_irq(&zone->lru_lock);
>  
> +	pgmoved = 0;

didn't we just do that?

>  	while (!list_empty(&l_hold)) {
>  		cond_resched();
>  		page = lru_to_page(&l_hold);
>  		list_del(&page->lru);
> -		if (page_referenced(page, 0, sc->mem_cgroup))
> -			list_add(&page->lru, &l_active);
> -		else
> +		if (page_referenced(page, 0, sc->mem_cgroup)) {
> +			if (file) {
> +				/* Referenced file pages stay active. */
> +				list_add(&page->lru, &l_active);
> +			} else {
> +				/* Anonymous pages always get deactivated. */

hm.  That's going to make the machine swap like hell.  I guess I don't
understand all this yet.

> +				list_add(&page->lru, &l_inactive);
> +				pgmoved++;
> +			}
> +		} else
>  			list_add(&page->lru, &l_inactive);
>  	}
>  
>  	/*
> +	 * Count the referenced anon pages as rotated, to balance pageout
> +	 * scan pressure between file and anonymous pages in get_sacn_ratio.

tpyo



  reply	other threads:[~2008-06-07  1:06 UTC|newest]

Thread overview: 102+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2008-06-06 20:28 [PATCH -mm 00/25] VM pageout scalability improvements (V10) Rik van Riel, Rik van Riel
2008-06-06 20:28 ` [PATCH -mm 01/25] move isolate_lru_page() to vmscan.c Rik van Riel, Rik van Riel
2008-06-06 20:28 ` [PATCH -mm 02/25] Use an indexed array for LRU variables Rik van Riel, Rik van Riel
2008-06-07  1:04   ` Andrew Morton
2008-06-07  5:43     ` KOSAKI Motohiro
2008-06-07 14:47       ` Rik van Riel
2008-06-08 11:22         ` KOSAKI Motohiro
2008-06-07 18:42     ` Rik van Riel
2008-06-06 20:28 ` [PATCH -mm 03/25] use an array for the LRU pagevecs Rik van Riel, Rik van Riel
2008-06-06 20:28 ` [PATCH -mm 04/25] free swap space on swap-in/activation Rik van Riel, Rik van Riel
2008-06-07  1:04   ` Andrew Morton
2008-06-07 19:56     ` Rik van Riel
2008-06-09  2:14     ` MinChan Kim
2008-06-09  2:42       ` Rik van Riel
2008-06-09 13:38       ` KOSAKI Motohiro
2008-06-10  2:30         ` MinChan Kim
2008-06-06 20:28 ` [PATCH -mm 05/25] define page_file_cache() function Rik van Riel, Rik van Riel
2008-06-07  1:04   ` Andrew Morton
2008-06-07 23:38     ` Rik van Riel
2008-06-06 20:28 ` [PATCH -mm 06/25] split LRU lists into anon & file sets Rik van Riel, Rik van Riel
2008-06-07  1:04   ` Andrew Morton
2008-06-07  1:22     ` Rik van Riel
2008-06-07  1:52       ` Andrew Morton
2008-06-06 20:28 ` [PATCH -mm 07/25] second chance replacement for anonymous pages Rik van Riel, Rik van Riel
2008-06-07  1:04   ` Andrew Morton [this message]
2008-06-07  6:03     ` KOSAKI Motohiro
2008-06-07  6:43       ` Andrew Morton
2008-06-08 15:04     ` Rik van Riel
2008-06-06 20:28 ` [PATCH -mm 08/25] add some sanity checks to get_scan_ratio Rik van Riel, Rik van Riel
2008-06-07  1:04   ` Andrew Morton
2008-06-08 15:11     ` Rik van Riel
2008-06-06 20:28 ` [PATCH -mm 09/25] fix pagecache reclaim referenced bit check Rik van Riel, Rik van Riel
2008-06-07  1:04   ` Andrew Morton
2008-06-07  1:08     ` Rik van Riel
2008-06-08 10:02       ` Peter Zijlstra
2008-06-06 20:28 ` [PATCH -mm 10/25] add newly swapped in pages to the inactive list Rik van Riel, Rik van Riel
2008-06-07  1:04   ` Andrew Morton
2008-06-06 20:28 ` [PATCH -mm 11/25] more aggressively use lumpy reclaim Rik van Riel, Rik van Riel
2008-06-07  1:05   ` Andrew Morton
2008-06-06 20:28 ` [PATCH -mm 12/25] pageflag helpers for configed-out flags Rik van Riel, Rik van Riel
2008-06-06 20:28 ` [PATCH -mm 13/25] Noreclaim LRU Infrastructure Rik van Riel, Rik van Riel
2008-06-07  1:05   ` Andrew Morton
2008-06-08 20:34     ` Rik van Riel
2008-06-08 20:57       ` Andrew Morton
2008-06-08 21:32         ` Rik van Riel
2008-06-08 21:43           ` Ray Lee
2008-06-08 23:22           ` Andrew Morton
2008-06-08 23:34             ` Rik van Riel
2008-06-08 23:54               ` Andrew Morton
2008-06-09  0:56                 ` Rik van Riel
2008-06-09  6:10                   ` Andrew Morton
2008-06-09 13:44                     ` Rik van Riel
2008-06-09  2:58                 ` Rik van Riel
2008-06-09  5:44                   ` Andrew Morton
2008-06-10 19:17                 ` Christoph Lameter
2008-06-10 19:37                   ` Rik van Riel
2008-06-10 21:33                     ` Andrew Morton
2008-06-10 21:48                       ` Andi Kleen
2008-06-10 22:05                       ` Dave Hansen
2008-06-11  5:09                       ` Paul Mundt
2008-06-11  6:16                         ` Andrew Morton
2008-06-11  6:29                           ` Paul Mundt
2008-06-11 12:06                           ` Andi Kleen
2008-06-11 14:09                           ` Removing node flags from page->flags was Re: [PATCH -mm 13/25] Noreclaim LRU Infrastructure II Andi Kleen
2008-06-11 19:03                       ` [PATCH -mm 13/25] Noreclaim LRU Infrastructure Andy Whitcroft
2008-06-11 20:52                         ` Andi Kleen
2008-06-11 23:25                         ` Christoph Lameter
2008-06-08 22:03         ` Rik van Riel
2008-06-08 21:07       ` KOSAKI Motohiro
2008-06-10 20:09     ` Rik van Riel
2008-06-06 20:28 ` [PATCH -mm 14/25] Noreclaim LRU Page Statistics Rik van Riel, Rik van Riel
2008-06-06 20:28 ` [PATCH -mm 15/25] Ramfs and Ram Disk pages are non-reclaimable Rik van Riel, Rik van Riel
2008-06-07  1:05   ` Andrew Morton
2008-06-08  4:32     ` Greg KH
2008-06-06 20:28 ` [PATCH -mm 16/25] SHM_LOCKED " Rik van Riel, Rik van Riel
2008-06-07  1:05   ` Andrew Morton
2008-06-07  5:21     ` KOSAKI Motohiro
2008-06-10 21:03     ` Rik van Riel
2008-06-10 21:22       ` Lee Schermerhorn
2008-06-10 21:49         ` Andrew Morton
2008-06-06 20:28 ` [PATCH -mm 17/25] Mlocked Pages " Rik van Riel, Rik van Riel
2008-06-07  1:07   ` Andrew Morton
2008-06-07  5:38     ` KOSAKI Motohiro
2008-06-10  3:31     ` Nick Piggin
2008-06-10 12:50       ` Rik van Riel
2008-06-10 21:14       ` Rik van Riel
2008-06-10 21:43         ` Lee Schermerhorn
2008-06-10 21:57           ` Andrew Morton
2008-06-11 16:01             ` Lee Schermerhorn
2008-06-10 23:48           ` Rik van Riel
2008-06-11 15:29             ` Lee Schermerhorn
2008-06-11  1:00     ` Rik van Riel
2008-06-06 20:28 ` [PATCH -mm 18/25] Downgrade mmap sem while populating mlocked regions Rik van Riel, Rik van Riel
2008-06-06 20:28 ` [PATCH -mm 19/25] Handle mlocked pages during map, remap, unmap Rik van Riel, Rik van Riel
2008-06-06 20:28 ` [PATCH -mm 20/25] Mlocked Pages statistics Rik van Riel, Rik van Riel
2008-06-06 20:28 ` [PATCH -mm 21/25] Cull non-reclaimable pages in fault path Rik van Riel, Rik van Riel
2008-06-06 20:29 ` [PATCH -mm 22/25] Noreclaim and Mlocked pages vm events Rik van Riel, Rik van Riel
2008-06-06 20:29 ` [PATCH -mm 23/25] Noreclaim LRU scan sysctl Rik van Riel, Rik van Riel
2008-06-06 20:29 ` [PATCH -mm 24/25] Mlocked Pages: count attempts to free mlocked page Rik van Riel, Rik van Riel
2008-06-06 20:29 ` [PATCH -mm 25/25] Noreclaim LRU and Mlocked Pages Documentation Rik van Riel, Rik van Riel
2008-06-06 21:02 ` [PATCH -mm 00/25] VM pageout scalability improvements (V10) Andrew Morton
2008-06-06 21:08   ` Rik van Riel

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20080606180443.43f782e2.akpm@linux-foundation.org \
    --to=akpm@linux-foundation.org \
    --cc=kosaki.motohiro@jp.fujitsu.com \
    --cc=lee.schermerhorn@hp.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=riel@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox