From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail138.messagelabs.com (mail138.messagelabs.com [216.82.249.35]) by kanga.kvack.org (Postfix) with ESMTP id B878C900086 for ; Fri, 15 Apr 2011 02:08:51 -0400 (EDT) Received: from wpaz17.hot.corp.google.com (wpaz17.hot.corp.google.com [172.24.198.81]) by smtp-out.google.com with ESMTP id p3F68g4n008324 for ; Thu, 14 Apr 2011 23:08:42 -0700 Received: from qwa26 (qwa26.prod.google.com [10.241.193.26]) by wpaz17.hot.corp.google.com with ESMTP id p3F67cRi022498 (version=TLSv1/SSLv3 cipher=RC4-SHA bits=128 verify=NOT) for ; Thu, 14 Apr 2011 23:08:41 -0700 Received: by qwa26 with SMTP id 26so1794085qwa.0 for ; Thu, 14 Apr 2011 23:08:40 -0700 (PDT) MIME-Version: 1.0 In-Reply-To: <20110415101148.80cb6721.kamezawa.hiroyu@jp.fujitsu.com> References: <1302821669-29862-1-git-send-email-yinghan@google.com> <1302821669-29862-7-git-send-email-yinghan@google.com> <20110415101148.80cb6721.kamezawa.hiroyu@jp.fujitsu.com> Date: Thu, 14 Apr 2011 23:08:40 -0700 Message-ID: Subject: Re: [PATCH V4 06/10] Per-memcg background reclaim. From: Ying Han Content-Type: multipart/alternative; boundary=002354470674947b7c04a0eedfa3 Sender: owner-linux-mm@kvack.org List-ID: To: KAMEZAWA Hiroyuki Cc: KOSAKI Motohiro , Minchan Kim , Daisuke Nishimura , Balbir Singh , Tejun Heo , Pavel Emelyanov , Andrew Morton , Li Zefan , Mel Gorman , Christoph Lameter , Johannes Weiner , Rik van Riel , Hugh Dickins , Michal Hocko , Dave Hansen , Zhu Yanhai , linux-mm@kvack.org --002354470674947b7c04a0eedfa3 Content-Type: text/plain; charset=ISO-8859-1 On Thu, Apr 14, 2011 at 6:11 PM, KAMEZAWA Hiroyuki < kamezawa.hiroyu@jp.fujitsu.com> wrote: > On Thu, 14 Apr 2011 15:54:25 -0700 > Ying Han wrote: > > > This is the main loop of per-memcg background reclaim which is > implemented in > > function balance_mem_cgroup_pgdat(). > > > > The function performs a priority loop similar to global reclaim. During > each > > iteration it invokes balance_pgdat_node() for all nodes on the system, > which > > is another new function performs background reclaim per node. After > reclaiming > > each node, it checks mem_cgroup_watermark_ok() and breaks the priority > loop if > > it returns true. > > > > changelog v4..v3: > > 1. split the select_victim_node and zone_unreclaimable to a seperate > patches > > 2. remove the logic tries to do zone balancing. > > > > changelog v3..v2: > > 1. change mz->all_unreclaimable to be boolean. > > 2. define ZONE_RECLAIMABLE_RATE macro shared by zone and per-memcg > reclaim. > > 3. some more clean-up. > > > > changelog v2..v1: > > 1. move the per-memcg per-zone clear_unreclaimable into uncharge stage. > > 2. shared the kswapd_run/kswapd_stop for per-memcg and global background > > reclaim. > > 3. name the per-memcg memcg as "memcg-id" (css->id). And the global > kswapd > > keeps the same name. > > 4. fix a race on kswapd_stop while the per-memcg-per-zone info could be > accessed > > after freeing. > > 5. add the fairness in zonelist where memcg remember the last zone > reclaimed > > from. > > > > Signed-off-by: Ying Han > > --- > > mm/vmscan.c | 161 > +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > > 1 files changed, 161 insertions(+), 0 deletions(-) > > > > diff --git a/mm/vmscan.c b/mm/vmscan.c > > index 4deb9c8..b8345d2 100644 > > --- a/mm/vmscan.c > > +++ b/mm/vmscan.c > > @@ -47,6 +47,8 @@ > > > > #include > > > > +#include > > + > > #include "internal.h" > > > > #define CREATE_TRACE_POINTS > > @@ -111,6 +113,8 @@ struct scan_control { > > * are scanned. > > */ > > nodemask_t *nodemask; > > + > > + int priority; > > }; > > > > #define lru_to_page(_head) (list_entry((_head)->prev, struct page, lru)) > > @@ -2632,11 +2636,168 @@ static void kswapd_try_to_sleep(struct kswapd > *kswapd_p, int order, > > finish_wait(wait_h, &wait); > > } > > > > +#ifdef CONFIG_CGROUP_MEM_RES_CTLR > > +/* > > + * The function is used for per-memcg LRU. It scanns all the zones of > the > > + * node and returns the nr_scanned and nr_reclaimed. > > + */ > > +static void balance_pgdat_node(pg_data_t *pgdat, int order, > > + struct scan_control *sc) > > +{ > > + int i; > > + unsigned long total_scanned = 0; > > + struct mem_cgroup *mem_cont = sc->mem_cgroup; > > + int priority = sc->priority; > > + > > + /* > > + * Now scan the zone in the dma->highmem direction, and we scan > > + * every zones for each node. > > + * > > + * We do this because the page allocator works in the opposite > > + * direction. This prevents the page allocator from allocating > > + * pages behind kswapd's direction of progress, which would > > + * cause too much scanning of the lower zones. > > + */ > > I guess this comment is a cut-n-paste from global kswapd. It works when > alloc_page() stalls....hmm, I'd like to think whether dma->highmem > direction > is good in this case. > This is a legacy comment and the actual logic of zone balancing has been removed from this patch. > > As you know, memcg works against user's memory, memory should be in highmem > zone. > Memcg-kswapd is not for memory-shortage, but for voluntary page dropping by > _user_. > in some sense, yes. but it would also related to memory-shortage on fully packed machines. > > If this memcg-kswapd drops pages from lower zones first, ah, ok, it's good > for > the system because memcg's pages should be on higher zone if we have free > memory. > > So, I think the reason for dma->highmem is different from global kswapd. > yes. I agree that the logic of dma->highmem ordering is not exactly the same from per-memcg kswapd and per-node kswapd. But still the page allocation happens on the other side, and this is still good for the system as you pointed out. > > > > > > + for (i = 0; i < pgdat->nr_zones; i++) { > > + struct zone *zone = pgdat->node_zones + i; > > + > > + if (!populated_zone(zone)) > > + continue; > > + > > + sc->nr_scanned = 0; > > + shrink_zone(priority, zone, sc); > > + total_scanned += sc->nr_scanned; > > + > > + /* > > + * If we've done a decent amount of scanning and > > + * the reclaim ratio is low, start doing writepage > > + * even in laptop mode > > + */ > > + if (total_scanned > SWAP_CLUSTER_MAX * 2 && > > + total_scanned > sc->nr_reclaimed + sc->nr_reclaimed / > 2) { > > + sc->may_writepage = 1; > > + } > > + } > > + > > + sc->nr_scanned = total_scanned; > > + return; > > +} > > + > > +/* > > + * Per cgroup background reclaim. > > + * TODO: Take off the order since memcg always do order 0 > > + */ > > +static unsigned long balance_mem_cgroup_pgdat(struct mem_cgroup > *mem_cont, > > + int order) > > +{ > > + int i, nid; > > + int start_node; > > + int priority; > > + bool wmark_ok; > > + int loop; > > + pg_data_t *pgdat; > > + nodemask_t do_nodes; > > + unsigned long total_scanned; > > + struct scan_control sc = { > > + .gfp_mask = GFP_KERNEL, > > + .may_unmap = 1, > > + .may_swap = 1, > > + .nr_to_reclaim = ULONG_MAX, > > + .swappiness = vm_swappiness, > > + .order = order, > > + .mem_cgroup = mem_cont, > > + }; > > + > > +loop_again: > > + do_nodes = NODE_MASK_NONE; > > + sc.may_writepage = !laptop_mode; > > I think may_writepage should start from '0' always. We're not sure > the system is in memory shortage...we just want to release memory > volunatary. write_page will add huge costs, I guess. > > For exmaple, > sc.may_writepage = !!loop > may be better for memcg. > > BTW, you set nr_to_reclaim as ULONG_MAX here and doesn't modify it later. > > I think you should add some logic to fix it to right value. > > For example, before calling shrink_zone(), > > sc->nr_to_reclaim = min(SWAP_CLUSETR_MAX, memcg_usage_in_this_zone() / > 100); # 1% in this zone. > > if we love 'fair pressure for each zone'. > Hmm. I don't get it. Leaving the nr_to_reclaim to be ULONG_MAX in kswapd case which is intended to add equal memory pressure for each zone. So in the shrink_zone, we won't bail out in the following condition: >-------while (nr[LRU_INACTIVE_ANON] || nr[LRU_ACTIVE_FILE] || > >------->------->------->------->-------nr[LRU_INACTIVE_FILE]) { > >------->-------if (nr_reclaimed >= nr_to_reclaim && priority < DEF_PRIORITY) >------->------->-------break; } > > --Ying > > > > > > + sc.nr_reclaimed = 0; > > + total_scanned = 0; > > + > > + for (priority = DEF_PRIORITY; priority >= 0; priority--) { > > + sc.priority = priority; > > + wmark_ok = false; > > + loop = 0; > > + > > + /* The swap token gets in the way of swapout... */ > > + if (!priority) > > + disable_swap_token(); > > + > > + if (priority == DEF_PRIORITY) > > + do_nodes = node_states[N_ONLINE]; > > + > > + while (1) { > > + nid = mem_cgroup_select_victim_node(mem_cont, > > + &do_nodes); > > + > > + /* Indicate we have cycled the nodelist once > > + * TODO: we might add MAX_RECLAIM_LOOP for > preventing > > + * kswapd burning cpu cycles. > > + */ > > + if (loop == 0) { > > + start_node = nid; > > + loop++; > > + } else if (nid == start_node) > > + break; > > + > > + pgdat = NODE_DATA(nid); > > + balance_pgdat_node(pgdat, order, &sc); > > + total_scanned += sc.nr_scanned; > > + > > + /* Set the node which has at least > > + * one reclaimable zone > > + */ > > + for (i = pgdat->nr_zones - 1; i >= 0; i--) { > > + struct zone *zone = pgdat->node_zones + i; > > + > > + if (!populated_zone(zone)) > > + continue; > > How about checking whether memcg has pages on this node ? > > > + } > > + if (i < 0) > > + node_clear(nid, do_nodes); > > + > > + if (mem_cgroup_watermark_ok(mem_cont, > > + CHARGE_WMARK_HIGH)) > { > > + wmark_ok = true; > > + goto out; > > + } > > + > > + if (nodes_empty(do_nodes)) { > > + wmark_ok = true; > > + goto out; > > + } > > + } > > + > > + /* All the nodes are unreclaimable, kswapd is done */ > > + if (nodes_empty(do_nodes)) { > > + wmark_ok = true; > > + goto out; > > + } > > Can this happen ? > > > > + > > + if (total_scanned && priority < DEF_PRIORITY - 2) > > + congestion_wait(WRITE, HZ/10); > > + > > + if (sc.nr_reclaimed >= SWAP_CLUSTER_MAX) > > + break; > > + } > > +out: > > + if (!wmark_ok) { > > + cond_resched(); > > + > > + try_to_freeze(); > > + > > + goto loop_again; > > + } > > + > > + return sc.nr_reclaimed; > > +} > > +#else > > static unsigned long balance_mem_cgroup_pgdat(struct mem_cgroup > *mem_cont, > > int order) > > { > > return 0; > > } > > +#endif > > > > > Thanks, > -Kame > > --002354470674947b7c04a0eedfa3 Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable

On Thu, Apr 14, 2011 at 6:11 PM, KAMEZAW= A Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
On Thu, 14 Apr 2011 15:54:25 -0700
Ying Han <yinghan@google.com&g= t; wrote:

> This is the main loop of per-memcg background reclaim which is impleme= nted in
> function balance_mem_cgroup_pgdat().
>
> The function performs a priority loop similar to global reclaim. Durin= g each
> iteration it invokes balance_pgdat_node() for all nodes on the system,= which
> is another new function performs background reclaim per node. After re= claiming
> each node, it checks mem_cgroup_watermark_ok() and breaks the priority= loop if
> it returns true.
>
> changelog v4..v3:
> 1. split the select_victim_node and zone_unreclaimable to a seperate p= atches
> 2. remove the logic tries to do zone balancing.
>
> changelog v3..v2:
> 1. change mz->all_unreclaimable to be boolean.
> 2. define ZONE_RECLAIMABLE_RATE macro shared by zone and per-memcg rec= laim.
> 3. some more clean-up.
>
> changelog v2..v1:
> 1. move the per-memcg per-zone clear_unreclaimable into uncharge stage= .
> 2. shared the kswapd_run/kswapd_stop for per-memcg and global backgrou= nd
> reclaim.
> 3. name the per-memcg memcg as "memcg-id" (css->id). And = the global kswapd
> keeps the same name.
> 4. fix a race on kswapd_stop while the per-memcg-per-zone info could b= e accessed
> after freeing.
> 5. add the fairness in zonelist where memcg remember the last zone rec= laimed
> from.
>
> Signed-off-by: Ying Han <ying= han@google.com>
> ---
> =A0mm/vmscan.c | =A0161 ++++++++++++++++++++++++++++++++++++++++++++++= +++++++++++++
> =A01 files changed, 161 insertions(+), 0 deletions(-)
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 4deb9c8..b8345d2 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -47,6 +47,8 @@
>
> =A0#include <linux/swapops.h>
>
> +#include <linux/res_counter.h>
> +
> =A0#include "internal.h"
>
> =A0#define CREATE_TRACE_POINTS
> @@ -111,6 +113,8 @@ struct scan_control {
> =A0 =A0 =A0 =A0* are scanned.
> =A0 =A0 =A0 =A0*/
> =A0 =A0 =A0 nodemask_t =A0 =A0 =A0*nodemask;
> +
> + =A0 =A0 int priority;
> =A0};
>
> =A0#define lru_to_page(_head) (list_entry((_head)->prev, struct pag= e, lru))
> @@ -2632,11 +2636,168 @@ static void kswapd_try_to_sleep(struct kswapd= *kswapd_p, int order,
> =A0 =A0 =A0 finish_wait(wait_h, &wait);
> =A0}
>
> +#ifdef CONFIG_CGROUP_MEM_RES_CTLR
> +/*
> + * The function is used for per-memcg LRU. It scanns all the zones of= the
> + * node and returns the nr_scanned and nr_reclaimed.
> + */
> +static void balance_pgdat_node(pg_data_t *pgdat, int order,
> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 = =A0 struct scan_control *sc)
> +{
> + =A0 =A0 int i;
> + =A0 =A0 unsigned long total_scanned =3D 0;
> + =A0 =A0 struct mem_cgroup *mem_cont =3D sc->mem_cgroup;
> + =A0 =A0 int priority =3D sc->priority;
> +
> + =A0 =A0 /*
> + =A0 =A0 =A0* Now scan the zone in the dma->highmem direction, and= we scan
> + =A0 =A0 =A0* every zones for each node.
> + =A0 =A0 =A0*
> + =A0 =A0 =A0* We do this because the page allocator works in the oppo= site
> + =A0 =A0 =A0* direction. =A0This prevents the page allocator from all= ocating
> + =A0 =A0 =A0* pages behind kswapd's direction of progress, which = would
> + =A0 =A0 =A0* cause too much scanning of the lower zones.
> + =A0 =A0 =A0*/

I guess this comment is a cut-n-paste from global kswapd. It wo= rks when
alloc_page() stalls....hmm, I'd like to think whether dma->highmem d= irection
is good in this case.

This is a legacy co= mment and the actual logic of zone balancing has been removed from this pat= ch.=A0=A0

As you know, memcg works against user's memory, memory should be in hig= hmem zone.
Memcg-kswapd is not for memory-shortage, but for voluntary page dropping by=
_user_.

in some sense, yes. but it woul= d also related to memory-shortage on fully packed machines.=A0

If this memcg-kswapd drops pages from lower zones first, ah, ok, it's g= ood for
the system because memcg's pages should be on higher zone if we have fr= ee memory.

So, I think the reason for dma->highmem is different from global kswapd.=

yes. I agree that the logic of dma->= ;highmem ordering is not exactly the same from per-memcg kswapd and per-nod= e kswapd. But still the page allocation happens on the other side, and this= is still good for the system as you pointed out.




> + =A0 =A0 for (i =3D 0; i < pgdat->nr_zones; i++) {
> + =A0 =A0 =A0 =A0 =A0 =A0 struct zone *zone =3D pgdat->node_zones += i;
> +
> + =A0 =A0 =A0 =A0 =A0 =A0 if (!populated_zone(zone))
> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 continue;
> +
> + =A0 =A0 =A0 =A0 =A0 =A0 sc->nr_scanned =3D 0;
> + =A0 =A0 =A0 =A0 =A0 =A0 shrink_zone(priority, zone, sc);
> + =A0 =A0 =A0 =A0 =A0 =A0 total_scanned +=3D sc->nr_scanned;
> +
> + =A0 =A0 =A0 =A0 =A0 =A0 /*
> + =A0 =A0 =A0 =A0 =A0 =A0 =A0* If we've done a decent amount of sc= anning and
> + =A0 =A0 =A0 =A0 =A0 =A0 =A0* the reclaim ratio is low, start doing w= ritepage
> + =A0 =A0 =A0 =A0 =A0 =A0 =A0* even in laptop mode
> + =A0 =A0 =A0 =A0 =A0 =A0 =A0*/
> + =A0 =A0 =A0 =A0 =A0 =A0 if (total_scanned > SWAP_CLUSTER_MAX * 2 = &&
> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 total_scanned > sc->nr_reclaim= ed + sc->nr_reclaimed / 2) {
> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 sc->may_writepage =3D 1;<= br> > + =A0 =A0 =A0 =A0 =A0 =A0 }
> + =A0 =A0 }
> +
> + =A0 =A0 sc->nr_scanned =3D total_scanned;
> + =A0 =A0 return;
> +}
> +
> +/*
> + * Per cgroup background reclaim.
> + * TODO: Take off the order since memcg always do order 0
> + */
> +static unsigned long balance_mem_cgroup_pgdat(struct mem_cgroup *mem_= cont,
> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 = =A0 =A0 =A0 =A0 int order)
> +{
> + =A0 =A0 int i, nid;
> + =A0 =A0 int start_node;
> + =A0 =A0 int priority;
> + =A0 =A0 bool wmark_ok;
> + =A0 =A0 int loop;
> + =A0 =A0 pg_data_t *pgdat;
> + =A0 =A0 nodemask_t do_nodes;
> + =A0 =A0 unsigned long total_scanned;
> + =A0 =A0 struct scan_control sc =3D {
> + =A0 =A0 =A0 =A0 =A0 =A0 .gfp_mask =3D GFP_KERNEL,
> + =A0 =A0 =A0 =A0 =A0 =A0 .may_unmap =3D 1,
> + =A0 =A0 =A0 =A0 =A0 =A0 .may_swap =3D 1,
> + =A0 =A0 =A0 =A0 =A0 =A0 .nr_to_reclaim =3D ULONG_MAX,
> + =A0 =A0 =A0 =A0 =A0 =A0 .swappiness =3D vm_swappiness,
> + =A0 =A0 =A0 =A0 =A0 =A0 .order =3D order,
> + =A0 =A0 =A0 =A0 =A0 =A0 .mem_cgroup =3D mem_cont,
> + =A0 =A0 };
> +
> +loop_again:
> + =A0 =A0 do_nodes =3D NODE_MASK_NONE;
> + =A0 =A0 sc.may_writepage =3D !laptop_mode;

I think may_writepage should start from '0' always. We&= #39;re not sure
the system is in memory shortage...we just want to release memory
volunatary. write_page will add huge costs, I guess.

For exmaple,
=A0 =A0 =A0 =A0sc.may_writepage =3D !!loop
may be better for memcg.

BTW, you set nr_to_reclaim as ULONG_MAX here and doesn't modify it late= r.

I think you should add some logic to fix it to right value.

For example, before calling shrink_zone(),

sc->nr_to_reclaim =3D min(SWAP_CLUSETR_MAX, memcg_usage_in_this_zone() /= 100); =A0# 1% in this zone.

if we love 'fair pressure for each zone'.

=
Hmm. I don't get it. Leaving the=A0nr_to_reclaim to be=A0ULO= NG_MAX in kswapd case which is intended to add equal memory pressure for ea= ch zone. So in the shrink_zone, we won't bail out in the following cond= ition:


>-------while (nr[LR= U_INACTIVE_ANON] || nr[LRU_ACTIVE_FILE] ||
>-----= -->------->------->------->-------nr[LRU_INACTIVE_FILE]) {

=A0>------->-= ------if (nr_reclaimed >=3D nr_to_reclaim && priority < DEF_P= RIORITY)
>------->------->-------break;

}

--Ying=A0




> + =A0 =A0 sc.nr_reclaimed =3D 0;
> + =A0 =A0 total_scanned =3D 0;
> +
> + =A0 =A0 for (priority =3D DEF_PRIORITY; priority >=3D 0; priority= --) {
> + =A0 =A0 =A0 =A0 =A0 =A0 sc.priority =3D priority;
> + =A0 =A0 =A0 =A0 =A0 =A0 wmark_ok =3D false;
> + =A0 =A0 =A0 =A0 =A0 =A0 loop =3D 0;
> +
> + =A0 =A0 =A0 =A0 =A0 =A0 /* The swap token gets in the way of swapout= ... */
> + =A0 =A0 =A0 =A0 =A0 =A0 if (!priority)
> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 disable_swap_token();
> +
> + =A0 =A0 =A0 =A0 =A0 =A0 if (priority =3D=3D DEF_PRIORITY)
> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 do_nodes =3D node_states[N_O= NLINE];
> +
> + =A0 =A0 =A0 =A0 =A0 =A0 while (1) {
> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 nid =3D mem_cgroup_select_vi= ctim_node(mem_cont,
> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 = =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 &do_nodes);
> +
> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 /* Indicate we have cycled t= he nodelist once
> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0* TODO: we might add MAX_= RECLAIM_LOOP for preventing
> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0* kswapd burning cpu cycl= es.
> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0*/
> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 if (loop =3D=3D 0) {
> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 start_node = =3D nid;
> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 loop++;
> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 } else if (nid =3D=3D start_= node)
> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 break;
> +
> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 pgdat =3D NODE_DATA(nid); > + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 balance_pgdat_node(pgdat, or= der, &sc);
> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 total_scanned +=3D sc.nr_sca= nned;
> +
> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 /* Set the node which has at= least
> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0* one reclaimable zone > + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0*/
> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 for (i =3D pgdat->nr_zone= s - 1; i >=3D 0; i--) {
> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 struct zone = *zone =3D pgdat->node_zones + i;
> +
> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 if (!populat= ed_zone(zone))
> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 = =A0 continue;

How about checking whether memcg has pages on this node ?



> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 }
> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 if (i < 0)
> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 node_clear(n= id, do_nodes);
> +
> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 if (mem_cgroup_watermark_ok(= mem_cont,
> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 = =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 CHARGE_WMARK_HIGH)) {
> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 wmark_ok =3D= true;
> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 goto out; > + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 }
> +
> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 if (nodes_empty(do_nodes)) {=
> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 wmark_ok =3D= true;
> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 goto out; > + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 }
> + =A0 =A0 =A0 =A0 =A0 =A0 }
> +
> + =A0 =A0 =A0 =A0 =A0 =A0 /* All the nodes are unreclaimable, kswapd i= s done */
> + =A0 =A0 =A0 =A0 =A0 =A0 if (nodes_empty(do_nodes)) {
> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 wmark_ok =3D true;
> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 goto out;
> + =A0 =A0 =A0 =A0 =A0 =A0 }

Can this happen ?


> +
> + =A0 =A0 =A0 =A0 =A0 =A0 if (total_scanned && priority < D= EF_PRIORITY - 2)
> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 congestion_wait(WRITE, HZ/10= );
> +
> + =A0 =A0 =A0 =A0 =A0 =A0 if (sc.nr_reclaimed >=3D SWAP_CLUSTER_MAX= )
> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 break;
> + =A0 =A0 }
> +out:
> + =A0 =A0 if (!wmark_ok) {
> + =A0 =A0 =A0 =A0 =A0 =A0 cond_resched();
> +
> + =A0 =A0 =A0 =A0 =A0 =A0 try_to_freeze();
> +
> + =A0 =A0 =A0 =A0 =A0 =A0 goto loop_again;
> + =A0 =A0 }
> +
> + =A0 =A0 return sc.nr_reclaimed;
> +}
> +#else
> =A0static unsigned long balance_mem_cgroup_pgdat(struct mem_cgroup *me= m_cont,
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 = =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 int order)
> =A0{
> =A0 =A0 =A0 return 0;
> =A0}
> +#endif
>


Thanks,
-Kame


--002354470674947b7c04a0eedfa3-- -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: email@kvack.org