From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail190.messagelabs.com (mail190.messagelabs.com [216.82.249.51]) by kanga.kvack.org (Postfix) with ESMTP id BF42A900086 for ; Thu, 14 Apr 2011 23:35:09 -0400 (EDT) Received: from hpaq7.eem.corp.google.com (hpaq7.eem.corp.google.com [172.25.149.7]) by smtp-out.google.com with ESMTP id p3F3Z3rx024416 for ; Thu, 14 Apr 2011 20:35:03 -0700 Received: from qwb8 (qwb8.prod.google.com [10.241.193.72]) by hpaq7.eem.corp.google.com with ESMTP id p3F3Z1Q8028371 (version=TLSv1/SSLv3 cipher=RC4-SHA bits=128 verify=NOT) for ; Thu, 14 Apr 2011 20:35:02 -0700 Received: by qwb8 with SMTP id 8so1774725qwb.25 for ; Thu, 14 Apr 2011 20:35:01 -0700 (PDT) MIME-Version: 1.0 In-Reply-To: <20110415090445.4578f987.kamezawa.hiroyu@jp.fujitsu.com> References: <1302821669-29862-1-git-send-email-yinghan@google.com> <1302821669-29862-2-git-send-email-yinghan@google.com> <20110415090445.4578f987.kamezawa.hiroyu@jp.fujitsu.com> Date: Thu, 14 Apr 2011 20:35:00 -0700 Message-ID: Subject: Re: [PATCH V4 01/10] Add kswapd descriptor From: Ying Han Content-Type: multipart/alternative; boundary=00248c6a84ca09d7c104a0ecba6d Sender: owner-linux-mm@kvack.org List-ID: To: KAMEZAWA Hiroyuki Cc: KOSAKI Motohiro , Minchan Kim , Daisuke Nishimura , Balbir Singh , Tejun Heo , Pavel Emelyanov , Andrew Morton , Li Zefan , Mel Gorman , Christoph Lameter , Johannes Weiner , Rik van Riel , Hugh Dickins , Michal Hocko , Dave Hansen , Zhu Yanhai , linux-mm@kvack.org --00248c6a84ca09d7c104a0ecba6d Content-Type: text/plain; charset=ISO-8859-1 On Thu, Apr 14, 2011 at 5:04 PM, KAMEZAWA Hiroyuki < kamezawa.hiroyu@jp.fujitsu.com> wrote: > On Thu, 14 Apr 2011 15:54:20 -0700 > Ying Han wrote: > > > There is a kswapd kernel thread for each numa node. We will add a > different > > kswapd for each memcg. The kswapd is sleeping in the wait queue headed at > > kswapd_wait field of a kswapd descriptor. The kswapd descriptor stores > > information of node or memcg and it allows the global and per-memcg > background > > reclaim to share common reclaim algorithms. > > > > This patch adds the kswapd descriptor and moves the per-node kswapd to > use the > > new structure. > > > > No objections to your direction but some comments. > > > changelog v2..v1: > > 1. dynamic allocate kswapd descriptor and initialize the wait_queue_head > of pgdat > > at kswapd_run. > > 2. add helper macro is_node_kswapd to distinguish per-node/per-cgroup > kswapd > > descriptor. > > > > changelog v3..v2: > > 1. move the struct mem_cgroup *kswapd_mem in kswapd sruct to later patch. > > 2. rename thr in kswapd_run to something else. > > > > Signed-off-by: Ying Han > > --- > > include/linux/mmzone.h | 3 +- > > include/linux/swap.h | 7 ++++ > > mm/page_alloc.c | 1 - > > mm/vmscan.c | 95 > ++++++++++++++++++++++++++++++++++++------------ > > 4 files changed, 80 insertions(+), 26 deletions(-) > > > > diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h > > index 628f07b..6cba7d2 100644 > > --- a/include/linux/mmzone.h > > +++ b/include/linux/mmzone.h > > @@ -640,8 +640,7 @@ typedef struct pglist_data { > > unsigned long node_spanned_pages; /* total size of physical page > > range, including holes */ > > int node_id; > > - wait_queue_head_t kswapd_wait; > > - struct task_struct *kswapd; > > + wait_queue_head_t *kswapd_wait; > > int kswapd_max_order; > > enum zone_type classzone_idx; > > I think pg_data_t should include struct kswapd in it, as > > struct pglist_data { > ..... > struct kswapd kswapd; > }; > and you can add a macro as > > #define kswapd_waitqueue(kswapd) (&(kswapd)->kswapd_wait) > if it looks better. > > Why I recommend this is I think it's better to have 'struct kswapd' > on the same page of pg_data_t or struct memcg. > Do you have benefits to kmalloc() struct kswapd on damand ? > So we don't end of have kswapd struct on memcgs' which doesn't have per-memcg kswapd enabled. I don't see one is strongly better than the other for the two approaches. If ok, I would like to keep as it is for this verion. Hope this is ok for now. > > > > > } pg_data_t; > > diff --git a/include/linux/swap.h b/include/linux/swap.h > > index ed6ebe6..f43d406 100644 > > --- a/include/linux/swap.h > > +++ b/include/linux/swap.h > > @@ -26,6 +26,13 @@ static inline int current_is_kswapd(void) > > return current->flags & PF_KSWAPD; > > } > > > > +struct kswapd { > > + struct task_struct *kswapd_task; > > + wait_queue_head_t kswapd_wait; > > + pg_data_t *kswapd_pgdat; > > +}; > > + > > +int kswapd(void *p); > > /* > > * MAX_SWAPFILES defines the maximum number of swaptypes: things which > can > > * be swapped to. The swap type and the offset into that swap type are > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > > index 6e1b52a..6340865 100644 > > --- a/mm/page_alloc.c > > +++ b/mm/page_alloc.c > > @@ -4205,7 +4205,6 @@ static void __paginginit free_area_init_core(struct > pglist_data *pgdat, > > > > pgdat_resize_init(pgdat); > > pgdat->nr_zones = 0; > > - init_waitqueue_head(&pgdat->kswapd_wait); > > pgdat->kswapd_max_order = 0; > > pgdat_page_cgroup_init(pgdat); > > > > diff --git a/mm/vmscan.c b/mm/vmscan.c > > index 060e4c1..77ac74f 100644 > > --- a/mm/vmscan.c > > +++ b/mm/vmscan.c > > @@ -2241,13 +2241,16 @@ static bool pgdat_balanced(pg_data_t *pgdat, > unsigned long balanced_pages, > > return balanced_pages > (present_pages >> 2); > > } > > > > +static DEFINE_SPINLOCK(kswapds_spinlock); > > + > Maybe better to explain this lock is for what. > > It seems we need this because we allocate kswapd descriptor after NODE is > online.. > right ? > > true. I will put comment there. --Ying Thanks, > -Kame > > > /* is kswapd sleeping prematurely? */ > > -static bool sleeping_prematurely(pg_data_t *pgdat, int order, long > remaining, > > - int classzone_idx) > > +static int sleeping_prematurely(struct kswapd *kswapd, int order, > > + long remaining, int classzone_idx) > > { > > int i; > > unsigned long balanced = 0; > > bool all_zones_ok = true; > > + pg_data_t *pgdat = kswapd->kswapd_pgdat; > > > > /* If a direct reclaimer woke kswapd within HZ/10, it's premature > */ > > if (remaining) > > @@ -2570,28 +2573,31 @@ out: > > return order; > > } > > > > -static void kswapd_try_to_sleep(pg_data_t *pgdat, int order, int > classzone_idx) > > +static void kswapd_try_to_sleep(struct kswapd *kswapd_p, int order, > > + int classzone_idx) > > { > > long remaining = 0; > > DEFINE_WAIT(wait); > > + pg_data_t *pgdat = kswapd_p->kswapd_pgdat; > > + wait_queue_head_t *wait_h = &kswapd_p->kswapd_wait; > > > > if (freezing(current) || kthread_should_stop()) > > return; > > > > - prepare_to_wait(&pgdat->kswapd_wait, &wait, TASK_INTERRUPTIBLE); > > + prepare_to_wait(wait_h, &wait, TASK_INTERRUPTIBLE); > > > > /* Try to sleep for a short interval */ > > - if (!sleeping_prematurely(pgdat, order, remaining, classzone_idx)) > { > > + if (!sleeping_prematurely(kswapd_p, order, remaining, > classzone_idx)) { > > remaining = schedule_timeout(HZ/10); > > - finish_wait(&pgdat->kswapd_wait, &wait); > > - prepare_to_wait(&pgdat->kswapd_wait, &wait, > TASK_INTERRUPTIBLE); > > + finish_wait(wait_h, &wait); > > + prepare_to_wait(wait_h, &wait, TASK_INTERRUPTIBLE); > > } > > > > /* > > * After a short sleep, check if it was a premature sleep. If not, > then > > * go fully to sleep until explicitly woken up. > > */ > > - if (!sleeping_prematurely(pgdat, order, remaining, classzone_idx)) > { > > + if (!sleeping_prematurely(kswapd_p, order, remaining, > classzone_idx)) { > > trace_mm_vmscan_kswapd_sleep(pgdat->node_id); > > > > /* > > @@ -2611,7 +2617,7 @@ static void kswapd_try_to_sleep(pg_data_t *pgdat, > int order, int classzone_idx) > > else > > count_vm_event(KSWAPD_HIGH_WMARK_HIT_QUICKLY); > > } > > - finish_wait(&pgdat->kswapd_wait, &wait); > > + finish_wait(wait_h, &wait); > > } > > > > /* > > @@ -2627,20 +2633,24 @@ static void kswapd_try_to_sleep(pg_data_t *pgdat, > int order, int classzone_idx) > > * If there are applications that are active memory-allocators > > * (most normal use), this basically shouldn't matter. > > */ > > -static int kswapd(void *p) > > +int kswapd(void *p) > > { > > unsigned long order; > > int classzone_idx; > > - pg_data_t *pgdat = (pg_data_t*)p; > > + struct kswapd *kswapd_p = (struct kswapd *)p; > > + pg_data_t *pgdat = kswapd_p->kswapd_pgdat; > > + wait_queue_head_t *wait_h = &kswapd_p->kswapd_wait; > > struct task_struct *tsk = current; > > > > struct reclaim_state reclaim_state = { > > .reclaimed_slab = 0, > > }; > > - const struct cpumask *cpumask = cpumask_of_node(pgdat->node_id); > > + const struct cpumask *cpumask; > > > > lockdep_set_current_reclaim_state(GFP_KERNEL); > > > > + BUG_ON(pgdat->kswapd_wait != wait_h); > > + cpumask = cpumask_of_node(pgdat->node_id); > > if (!cpumask_empty(cpumask)) > > set_cpus_allowed_ptr(tsk, cpumask); > > current->reclaim_state = &reclaim_state; > > @@ -2679,7 +2689,7 @@ static int kswapd(void *p) > > order = new_order; > > classzone_idx = new_classzone_idx; > > } else { > > - kswapd_try_to_sleep(pgdat, order, classzone_idx); > > + kswapd_try_to_sleep(kswapd_p, order, > classzone_idx); > > order = pgdat->kswapd_max_order; > > classzone_idx = pgdat->classzone_idx; > > pgdat->kswapd_max_order = 0; > > @@ -2719,13 +2729,13 @@ void wakeup_kswapd(struct zone *zone, int order, > enum zone_type classzone_idx) > > pgdat->kswapd_max_order = order; > > pgdat->classzone_idx = min(pgdat->classzone_idx, > classzone_idx); > > } > > - if (!waitqueue_active(&pgdat->kswapd_wait)) > > + if (!waitqueue_active(pgdat->kswapd_wait)) > > return; > > if (zone_watermark_ok_safe(zone, order, low_wmark_pages(zone), 0, > 0)) > > return; > > > > trace_mm_vmscan_wakeup_kswapd(pgdat->node_id, zone_idx(zone), > order); > > - wake_up_interruptible(&pgdat->kswapd_wait); > > + wake_up_interruptible(pgdat->kswapd_wait); > > } > > > > /* > > @@ -2817,12 +2827,23 @@ static int __devinit cpu_callback(struct > notifier_block *nfb, > > for_each_node_state(nid, N_HIGH_MEMORY) { > > pg_data_t *pgdat = NODE_DATA(nid); > > const struct cpumask *mask; > > + struct kswapd *kswapd_p; > > + struct task_struct *kswapd_thr; > > + wait_queue_head_t *wait; > > > > mask = cpumask_of_node(pgdat->node_id); > > > > + spin_lock(&kswapds_spinlock); > > + wait = pgdat->kswapd_wait; > > + kswapd_p = container_of(wait, struct kswapd, > > + kswapd_wait); > > + kswapd_thr = kswapd_p->kswapd_task; > > + spin_unlock(&kswapds_spinlock); > > + > > if (cpumask_any_and(cpu_online_mask, mask) < > nr_cpu_ids) > > /* One of our CPUs online: restore mask */ > > - set_cpus_allowed_ptr(pgdat->kswapd, mask); > > + if (kswapd_thr) > > + set_cpus_allowed_ptr(kswapd_thr, > mask); > > } > > } > > return NOTIFY_OK; > > @@ -2835,18 +2856,31 @@ static int __devinit cpu_callback(struct > notifier_block *nfb, > > int kswapd_run(int nid) > > { > > pg_data_t *pgdat = NODE_DATA(nid); > > + struct task_struct *kswapd_thr; > > + struct kswapd *kswapd_p; > > int ret = 0; > > > > - if (pgdat->kswapd) > > + if (pgdat->kswapd_wait) > > return 0; > > > > - pgdat->kswapd = kthread_run(kswapd, pgdat, "kswapd%d", nid); > > - if (IS_ERR(pgdat->kswapd)) { > > + kswapd_p = kzalloc(sizeof(struct kswapd), GFP_KERNEL); > > + if (!kswapd_p) > > + return -ENOMEM; > > + > > + init_waitqueue_head(&kswapd_p->kswapd_wait); > > + pgdat->kswapd_wait = &kswapd_p->kswapd_wait; > > + kswapd_p->kswapd_pgdat = pgdat; > > + > > + kswapd_thr = kthread_run(kswapd, kswapd_p, "kswapd%d", nid); > > + if (IS_ERR(kswapd_thr)) { > > /* failure at boot is fatal */ > > BUG_ON(system_state == SYSTEM_BOOTING); > > printk("Failed to start kswapd on node %d\n",nid); > > + pgdat->kswapd_wait = NULL; > > + kfree(kswapd_p); > > ret = -1; > > - } > > + } else > > + kswapd_p->kswapd_task = kswapd_thr; > > return ret; > > } > > > > @@ -2855,10 +2889,25 @@ int kswapd_run(int nid) > > */ > > void kswapd_stop(int nid) > > { > > - struct task_struct *kswapd = NODE_DATA(nid)->kswapd; > > + struct task_struct *kswapd_thr = NULL; > > + struct kswapd *kswapd_p = NULL; > > + wait_queue_head_t *wait; > > + > > + pg_data_t *pgdat = NODE_DATA(nid); > > + > > + spin_lock(&kswapds_spinlock); > > + wait = pgdat->kswapd_wait; > > + if (wait) { > > + kswapd_p = container_of(wait, struct kswapd, kswapd_wait); > > + kswapd_thr = kswapd_p->kswapd_task; > > + kswapd_p->kswapd_task = NULL; > > + } > > + spin_unlock(&kswapds_spinlock); > > + > > + if (kswapd_thr) > > + kthread_stop(kswapd_thr); > > > > - if (kswapd) > > - kthread_stop(kswapd); > > + kfree(kswapd_p); > > } > > > > static int __init kswapd_init(void) > > -- > > 1.7.3.1 > > > > -- > > To unsubscribe, send a message with 'unsubscribe linux-mm' in > > the body to majordomo@kvack.org. For more info on Linux MM, > > see: http://www.linux-mm.org/ . > > Fight unfair telecom internet charges in Canada: sign > http://stopthemeter.ca/ > > Don't email: email@kvack.org > > > > --00248c6a84ca09d7c104a0ecba6d Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable

On Thu, Apr 14, 2011 at 5:04 PM, KAMEZAW= A Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
On Thu, 14 Apr 2011 15:54:20 -0700
Ying Han <yinghan@google.com&g= t; wrote:

> There is a kswapd kernel thread for each numa node. We will add a diff= erent
> kswapd for each memcg. The kswapd is sleeping in the wait queue headed= at
> kswapd_wait field of a kswapd descriptor. The kswapd descriptor stores=
> information of node or memcg and it allows the global and per-memcg ba= ckground
> reclaim to share common reclaim algorithms.
>
> This patch adds the kswapd descriptor and moves the per-node kswapd to= use the
> new structure.
>

No objections to your direction but some comments.

> changelog v2..v1:
> 1. dynamic allocate kswapd descriptor and initialize the wait_queue_he= ad of pgdat
> at kswapd_run.
> 2. add helper macro is_node_kswapd to distinguish per-node/per-cgroup = kswapd
> descriptor.
>
> changelog v3..v2:
> 1. move the struct mem_cgroup *kswapd_mem in kswapd sruct to later pat= ch.
> 2. rename thr in kswapd_run to something else.
>
> Signed-off-by: Ying Han <ying= han@google.com>
> ---
> =A0include/linux/mmzone.h | =A0 =A03 +-
> =A0include/linux/swap.h =A0 | =A0 =A07 ++++
> =A0mm/page_alloc.c =A0 =A0 =A0 =A0| =A0 =A01 -
> =A0mm/vmscan.c =A0 =A0 =A0 =A0 =A0 =A0| =A0 95 +++++++++++++++++++++++= +++++++++++++------------
> =A04 files changed, 80 insertions(+), 26 deletions(-)
>
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index 628f07b..6cba7d2 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -640,8 +640,7 @@ typedef struct pglist_data {
> =A0 =A0 =A0 unsigned long node_spanned_pages; /* total size of physica= l page
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 = =A0 =A0 =A0 =A0 =A0range, including holes */
> =A0 =A0 =A0 int node_id;
> - =A0 =A0 wait_queue_head_t kswapd_wait;
> - =A0 =A0 struct task_struct *kswapd;
> + =A0 =A0 wait_queue_head_t *kswapd_wait;
> =A0 =A0 =A0 int kswapd_max_order;
> =A0 =A0 =A0 enum zone_type classzone_idx;

I think pg_data_t should include struct kswapd in it, as

=A0 =A0 =A0 =A0struct pglist_data {
=A0 =A0 =A0 =A0.....
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0struct kswapd =A0 kswapd;
=A0 =A0 =A0 =A0};
and you can add a macro as

#define kswapd_waitqueue(kswapd) =A0 =A0 =A0 =A0(&(kswapd)->kswapd_w= ait)
if it looks better.

Why I recommend this is I think it's better to have 'struct kswapd&= #39;
on the same page of pg_data_t or struct memcg.
Do you have benefits to kmalloc() struct kswapd on damand ?

So we don't end of have kswapd struct on memcgs= 9; which doesn't have per-memcg kswapd enabled. I don't see one is = strongly better than the other for the two approaches. If ok, I would like = to keep as it is for this verion. Hope this is ok for now.
=A0



> =A0} pg_data_t;
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index ed6ebe6..f43d406 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -26,6 +26,13 @@ static inline int current_is_kswapd(void)
> =A0 =A0 =A0 return current->flags & PF_KSWAPD;
> =A0}
>
> +struct kswapd {
> + =A0 =A0 struct task_struct *kswapd_task;
> + =A0 =A0 wait_queue_head_t kswapd_wait;
> + =A0 =A0 pg_data_t *kswapd_pgdat;
> +};
> +
> +int kswapd(void *p);
> =A0/*
> =A0 * MAX_SWAPFILES defines the maximum number of swaptypes: things wh= ich can
> =A0 * be swapped to. =A0The swap type and the offset into that swap ty= pe are
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 6e1b52a..6340865 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -4205,7 +4205,6 @@ static void __paginginit free_area_init_core(str= uct pglist_data *pgdat,
>
> =A0 =A0 =A0 pgdat_resize_init(pgdat);
> =A0 =A0 =A0 pgdat->nr_zones =3D 0;
> - =A0 =A0 init_waitqueue_head(&pgdat->kswapd_wait);
> =A0 =A0 =A0 pgdat->kswapd_max_order =3D 0;
> =A0 =A0 =A0 pgdat_page_cgroup_init(pgdat);
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 060e4c1..77ac74f 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2241,13 +2241,16 @@ static bool pgdat_balanced(pg_data_t *pgdat, u= nsigned long balanced_pages,
> =A0 =A0 =A0 return balanced_pages > (present_pages >> 2);
> =A0}
>
> +static DEFINE_SPINLOCK(kswapds_spinlock);
> +
Maybe better to explain this lock is for what.

It seems we need this because we allocate kswapd descriptor after NODE is o= nline..
right ?

=A0true. I will put comment there.

--Ying

Thanks,
-Kame

> =A0/* is kswapd sleeping prematurely? */
> -static bool sleeping_prematurely(pg_data_t *pgdat, int order, long re= maining,
> - =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 = =A0 int classzone_idx)
> +static int sleeping_prematurely(struct kswapd *kswapd, int order,
> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 long remaini= ng, int classzone_idx)
> =A0{
> =A0 =A0 =A0 int i;
> =A0 =A0 =A0 unsigned long balanced =3D 0;
> =A0 =A0 =A0 bool all_zones_ok =3D true;
> + =A0 =A0 pg_data_t *pgdat =3D kswapd->kswapd_pgdat;
>
> =A0 =A0 =A0 /* If a direct reclaimer woke kswapd within HZ/10, it'= s premature */
> =A0 =A0 =A0 if (remaining)
> @@ -2570,28 +2573,31 @@ out:
> =A0 =A0 =A0 return order;
> =A0}
>
> -static void kswapd_try_to_sleep(pg_data_t *pgdat, int order, int clas= szone_idx)
> +static void kswapd_try_to_sleep(struct kswapd *kswapd_p, int order, > + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 int classzon= e_idx)
> =A0{
> =A0 =A0 =A0 long remaining =3D 0;
> =A0 =A0 =A0 DEFINE_WAIT(wait);
> + =A0 =A0 pg_data_t *pgdat =3D kswapd_p->kswapd_pgdat;
> + =A0 =A0 wait_queue_head_t *wait_h =3D &kswapd_p->kswapd_wait;=
>
> =A0 =A0 =A0 if (freezing(current) || kthread_should_stop())
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 return;
>
> - =A0 =A0 prepare_to_wait(&pgdat->kswapd_wait, &wait, TASK_= INTERRUPTIBLE);
> + =A0 =A0 prepare_to_wait(wait_h, &wait, TASK_INTERRUPTIBLE);
>
> =A0 =A0 =A0 /* Try to sleep for a short interval */
> - =A0 =A0 if (!sleeping_prematurely(pgdat, order, remaining, classzone= _idx)) {
> + =A0 =A0 if (!sleeping_prematurely(kswapd_p, order, remaining, classz= one_idx)) {
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 remaining =3D schedule_timeout(HZ/10);
> - =A0 =A0 =A0 =A0 =A0 =A0 finish_wait(&pgdat->kswapd_wait, &= ;wait);
> - =A0 =A0 =A0 =A0 =A0 =A0 prepare_to_wait(&pgdat->kswapd_wait, = &wait, TASK_INTERRUPTIBLE);
> + =A0 =A0 =A0 =A0 =A0 =A0 finish_wait(wait_h, &wait);
> + =A0 =A0 =A0 =A0 =A0 =A0 prepare_to_wait(wait_h, &wait, TASK_INTE= RRUPTIBLE);
> =A0 =A0 =A0 }
>
> =A0 =A0 =A0 /*
> =A0 =A0 =A0 =A0* After a short sleep, check if it was a premature slee= p. If not, then
> =A0 =A0 =A0 =A0* go fully to sleep until explicitly woken up.
> =A0 =A0 =A0 =A0*/
> - =A0 =A0 if (!sleeping_prematurely(pgdat, order, remaining, classzone= _idx)) {
> + =A0 =A0 if (!sleeping_prematurely(kswapd_p, order, remaining, classz= one_idx)) {
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 trace_mm_vmscan_kswapd_sleep(pgdat->nod= e_id);
>
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 /*
> @@ -2611,7 +2617,7 @@ static void kswapd_try_to_sleep(pg_data_t *pgdat= , int order, int classzone_idx)
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 else
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 count_vm_event(KSWAPD_HIGH= _WMARK_HIT_QUICKLY);
> =A0 =A0 =A0 }
> - =A0 =A0 finish_wait(&pgdat->kswapd_wait, &wait);
> + =A0 =A0 finish_wait(wait_h, &wait);
> =A0}
>
> =A0/*
> @@ -2627,20 +2633,24 @@ static void kswapd_try_to_sleep(pg_data_t *pgd= at, int order, int classzone_idx)
> =A0 * If there are applications that are active memory-allocators
> =A0 * (most normal use), this basically shouldn't matter.
> =A0 */
> -static int kswapd(void *p)
> +int kswapd(void *p)
> =A0{
> =A0 =A0 =A0 unsigned long order;
> =A0 =A0 =A0 int classzone_idx;
> - =A0 =A0 pg_data_t *pgdat =3D (pg_data_t*)p;
> + =A0 =A0 struct kswapd *kswapd_p =3D (struct kswapd *)p;
> + =A0 =A0 pg_data_t *pgdat =3D kswapd_p->kswapd_pgdat;
> + =A0 =A0 wait_queue_head_t *wait_h =3D &kswapd_p->kswapd_wait;=
> =A0 =A0 =A0 struct task_struct *tsk =3D current;
>
> =A0 =A0 =A0 struct reclaim_state reclaim_state =3D {
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 .reclaimed_slab =3D 0,
> =A0 =A0 =A0 };
> - =A0 =A0 const struct cpumask *cpumask =3D cpumask_of_node(pgdat->= node_id);
> + =A0 =A0 const struct cpumask *cpumask;
>
> =A0 =A0 =A0 lockdep_set_current_reclaim_state(GFP_KERNEL);
>
> + =A0 =A0 BUG_ON(pgdat->kswapd_wait !=3D wait_h);
> + =A0 =A0 cpumask =3D cpumask_of_node(pgdat->node_id);
> =A0 =A0 =A0 if (!cpumask_empty(cpumask))
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 set_cpus_allowed_ptr(tsk, cpumask);
> =A0 =A0 =A0 current->reclaim_state =3D &reclaim_state;
> @@ -2679,7 +2689,7 @@ static int kswapd(void *p)
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 order =3D new_order;
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 classzone_idx =3D new_clas= szone_idx;
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 } else {
> - =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 kswapd_try_to_sleep(pgdat, o= rder, classzone_idx);
> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 kswapd_try_to_sleep(kswapd_p= , order, classzone_idx);
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 order =3D pgdat->kswapd= _max_order;
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 classzone_idx =3D pgdat-&g= t;classzone_idx;
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 pgdat->kswapd_max_order= =3D 0;
> @@ -2719,13 +2729,13 @@ void wakeup_kswapd(struct zone *zone, int orde= r, enum zone_type classzone_idx)
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 pgdat->kswapd_max_order =3D order;
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 pgdat->classzone_idx =3D min(pgdat->= classzone_idx, classzone_idx);
> =A0 =A0 =A0 }
> - =A0 =A0 if (!waitqueue_active(&pgdat->kswapd_wait))
> + =A0 =A0 if (!waitqueue_active(pgdat->kswapd_wait))
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 return;
> =A0 =A0 =A0 if (zone_watermark_ok_safe(zone, order, low_wmark_pages(zo= ne), 0, 0))
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 return;
>
> =A0 =A0 =A0 trace_mm_vmscan_wakeup_kswapd(pgdat->node_id, zone_idx(= zone), order);
> - =A0 =A0 wake_up_interruptible(&pgdat->kswapd_wait);
> + =A0 =A0 wake_up_interruptible(pgdat->kswapd_wait);
> =A0}
>
> =A0/*
> @@ -2817,12 +2827,23 @@ static int __devinit cpu_callback(struct notif= ier_block *nfb,
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 for_each_node_state(nid, N_HIGH_MEMORY) {<= br> > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 pg_data_t *pgdat =3D NODE_= DATA(nid);
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 const struct cpumask *mask= ;
> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 struct kswapd *kswapd_p;
> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 struct task_struct *kswapd_t= hr;
> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 wait_queue_head_t *wait;
>
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 mask =3D cpumask_of_node(p= gdat->node_id);
>
> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 spin_lock(&kswapds_spinl= ock);
> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 wait =3D pgdat->kswapd_wa= it;
> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 kswapd_p =3D container_of(wa= it, struct kswapd,
> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 = =A0 =A0 =A0 =A0 =A0 kswapd_wait);
> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 kswapd_thr =3D kswapd_p->= kswapd_task;
> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 spin_unlock(&kswapds_spi= nlock);
> +
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 if (cpumask_any_and(cpu_on= line_mask, mask) < nr_cpu_ids)
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 /* One of = our CPUs online: restore mask */
> - =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 set_cpus_all= owed_ptr(pgdat->kswapd, mask);
> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 if (kswapd_t= hr)
> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 = =A0 set_cpus_allowed_ptr(kswapd_thr, mask);
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 }
> =A0 =A0 =A0 }
> =A0 =A0 =A0 return NOTIFY_OK;
> @@ -2835,18 +2856,31 @@ static int __devinit cpu_callback(struct notif= ier_block *nfb,
> =A0int kswapd_run(int nid)
> =A0{
> =A0 =A0 =A0 pg_data_t *pgdat =3D NODE_DATA(nid);
> + =A0 =A0 struct task_struct *kswapd_thr;
> + =A0 =A0 struct kswapd *kswapd_p;
> =A0 =A0 =A0 int ret =3D 0;
>
> - =A0 =A0 if (pgdat->kswapd)
> + =A0 =A0 if (pgdat->kswapd_wait)
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 return 0;
>
> - =A0 =A0 pgdat->kswapd =3D kthread_run(kswapd, pgdat, "kswapd= %d", nid);
> - =A0 =A0 if (IS_ERR(pgdat->kswapd)) {
> + =A0 =A0 kswapd_p =3D kzalloc(sizeof(struct kswapd), GFP_KERNEL);
> + =A0 =A0 if (!kswapd_p)
> + =A0 =A0 =A0 =A0 =A0 =A0 return -ENOMEM;
> +
> + =A0 =A0 init_waitqueue_head(&kswapd_p->kswapd_wait);
> + =A0 =A0 pgdat->kswapd_wait =3D &kswapd_p->kswapd_wait;
> + =A0 =A0 kswapd_p->kswapd_pgdat =3D pgdat;
> +
> + =A0 =A0 kswapd_thr =3D kthread_run(kswapd, kswapd_p, "kswapd%d&= quot;, nid);
> + =A0 =A0 if (IS_ERR(kswapd_thr)) {
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 /* failure at boot is fatal */
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 BUG_ON(system_state =3D=3D SYSTEM_BOOTING)= ;
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 printk("Failed to start kswapd on nod= e %d\n",nid);
> + =A0 =A0 =A0 =A0 =A0 =A0 pgdat->kswapd_wait =3D NULL;
> + =A0 =A0 =A0 =A0 =A0 =A0 kfree(kswapd_p);
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 ret =3D -1;
> - =A0 =A0 }
> + =A0 =A0 } else
> + =A0 =A0 =A0 =A0 =A0 =A0 kswapd_p->kswapd_task =3D kswapd_thr;
> =A0 =A0 =A0 return ret;
> =A0}
>
> @@ -2855,10 +2889,25 @@ int kswapd_run(int nid)
> =A0 */
> =A0void kswapd_stop(int nid)
> =A0{
> - =A0 =A0 struct task_struct *kswapd =3D NODE_DATA(nid)->kswapd; > + =A0 =A0 struct task_struct *kswapd_thr =3D NULL;
> + =A0 =A0 struct kswapd *kswapd_p =3D NULL;
> + =A0 =A0 wait_queue_head_t *wait;
> +
> + =A0 =A0 pg_data_t *pgdat =3D NODE_DATA(nid);
> +
> + =A0 =A0 spin_lock(&kswapds_spinlock);
> + =A0 =A0 wait =3D pgdat->kswapd_wait;
> + =A0 =A0 if (wait) {
> + =A0 =A0 =A0 =A0 =A0 =A0 kswapd_p =3D container_of(wait, struct kswap= d, kswapd_wait);
> + =A0 =A0 =A0 =A0 =A0 =A0 kswapd_thr =3D kswapd_p->kswapd_task;
> + =A0 =A0 =A0 =A0 =A0 =A0 kswapd_p->kswapd_task =3D NULL;
> + =A0 =A0 }
> + =A0 =A0 spin_unlock(&kswapds_spinlock);
> +
> + =A0 =A0 if (kswapd_thr)
> + =A0 =A0 =A0 =A0 =A0 =A0 kthread_stop(kswapd_thr);
>
> - =A0 =A0 if (kswapd)
> - =A0 =A0 =A0 =A0 =A0 =A0 kthread_stop(kswapd);
> + =A0 =A0 kfree(kswapd_p);
> =A0}
>
> =A0static int __init kswapd_init(void)
> --
> 1.7.3.1
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in<= br> > the body to majordomo@kvack.org= . =A0For more info on Linux MM,
> see: http://www= .linux-mm.org/ .
> Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
> Don't email: <a href=3Dmailto:"dont@kvack.org"> emai= l@kvack.org </a>
>


--00248c6a84ca09d7c104a0ecba6d-- -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: email@kvack.org