[RFC][PATCH] the proposal of improve page reclaim by throttle

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [RFC][PATCH] the proposal of improve page reclaim by throttle
@ 2008-02-19  5:44 KOSAKI Motohiro
  2008-02-19  6:34 ` Nick Piggin
                   ` (2 more replies)
  0 siblings, 3 replies; 15+ messages in thread
From: KOSAKI Motohiro @ 2008-02-19  5:44 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki, Balbir Singh, Rik van Riel, Lee Schermerhorn,
	linux-mm, linux-kernel
  Cc: kosaki.motohiro

background
========================================
current VM implementation doesn't has limit of # of parallel reclaim.
when heavy workload, it bring to 2 bad things
  - heavy lock contention
  - unnecessary swap out

abount 2 month ago, KAMEZA Hiroyuki proposed the patch of page 
reclaim throttle and explain it improve reclaim time.
	http://marc.info/?l=linux-mm&m=119667465917215&w=2

but unfortunately it works only memcgroup reclaim.
Today, I implement it again for support global reclaim and mesure it.


test machine, method and result
==================================================
<test machine>
	CPU:  IA64 x8
	MEM:  8GB
	SWAP: 2GB

<test method>
	got hackbench from
		http://people.redhat.com/mingo/cfs-scheduler/tools/hackbench.c

	$ /usr/bin/time hackbench 120 process 1000

	this parameter mean consume all physical memory and 
	1GB swap space on my test environment.

<test result (average of 3 times measurement)>

before:
	hackbench result:		282.30
	/usr/bin/time result
		user:			14.16
		sys:			1248.47
		elapse:			432.93
		major fault:		29026
	max parallel reclaim tasks:	1298
	max consumption time of
	 try_to_free_pages():		70394 

after:
	hackbench result:		30.36
	/usr/bin/time result
		user:			14.26
		sys:			294.44
		elapse:			118.01
		major fault:		3064
	max parallel reclaim tasks:	4
	max consumption time of
	 try_to_free_pages():		12234 


conclusion
=========================================
this patch improve 3 things.
1. reduce unnecessary swap
   (see above major fault. about 90% reduced)
2. improve throughput performance
   (see above hackbench result. about 90% reduced)
3. improve interactive performance.
   (see above max consumption of try_to_free_pages.
    about 80% reduced)
4. reduce lock contention.
   (see above sys time. about 80% reduced)


Now, we got about 1000% performance improvement of hackbench :)



foture works
==========================================================
 - more discussion with memory controller guys.



Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
CC: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
CC: Balbir Singh <balbir@linux.vnet.ibm.com>
CC: Rik van Riel <riel@redhat.com>
CC: Lee Schermerhorn <Lee.Schermerhorn@hp.com>

---
 include/linux/nodemask.h |    1 
 mm/vmscan.c              |   49 +++++++++++++++++++++++++++++++++++++++++++++--
 2 files changed, 48 insertions(+), 2 deletions(-)

Index: b/include/linux/nodemask.h
===================================================================
--- a/include/linux/nodemask.h	2008-02-19 13:58:05.000000000 +0900
+++ b/include/linux/nodemask.h	2008-02-19 13:58:23.000000000 +0900
@@ -431,6 +431,7 @@ static inline int num_node_state(enum no
 
 #define num_online_nodes()	num_node_state(N_ONLINE)
 #define num_possible_nodes()	num_node_state(N_POSSIBLE)
+#define num_highmem_nodes()	num_node_state(N_HIGH_MEMORY)
 #define node_online(node)	node_state((node), N_ONLINE)
 #define node_possible(node)	node_state((node), N_POSSIBLE)
 
Index: b/mm/vmscan.c
===================================================================
--- a/mm/vmscan.c	2008-02-19 13:58:05.000000000 +0900
+++ b/mm/vmscan.c	2008-02-19 14:04:06.000000000 +0900
@@ -127,6 +127,11 @@ long vm_total_pages;	/* The total number
 static LIST_HEAD(shrinker_list);
 static DECLARE_RWSEM(shrinker_rwsem);
 
+static atomic_t nr_reclaimers = ATOMIC_INIT(0);
+static DECLARE_WAIT_QUEUE_HEAD(reclaim_throttle_waitq);
+#define RECLAIM_LIMIT (2 * num_highmem_nodes())
+
+
 #ifdef CONFIG_CGROUP_MEM_CONT
 #define scan_global_lru(sc)	(!(sc)->mem_cgroup)
 #else
@@ -1421,6 +1426,46 @@ out:
 	return ret;
 }
 
+static unsigned long try_to_free_pages_throttled(struct zone **zones,
+						 int order,
+						 gfp_t gfp_mask,
+						 struct scan_control *sc)
+{
+	unsigned long nr_reclaimed = 0;
+	unsigned long start_time;
+	int i;
+
+	start_time = jiffies;
+
+	wait_event(reclaim_throttle_waitq,
+		   atomic_add_unless(&nr_reclaimers, 1, RECLAIM_LIMIT));
+
+	/* more reclaim until needed? */
+	if (unlikely(time_after(jiffies, start_time + HZ))) {
+		for (i = 0; zones[i] != NULL; i++) {
+			struct zone *zone = zones[i];
+			int classzone_idx = zone_idx(zones[0]);
+
+			if (!populated_zone(zone))
+				continue;
+
+			if (zone_watermark_ok(zone, order, 4*zone->pages_high,
+					      classzone_idx, 0)) {
+				nr_reclaimed = 1;
+				goto out;
+			}
+		}
+	}
+
+	nr_reclaimed = do_try_to_free_pages(zones, gfp_mask, sc);
+
+out:
+	atomic_dec(&nr_reclaimers);
+	wake_up_all(&reclaim_throttle_waitq);
+
+	return nr_reclaimed;
+}
+
 unsigned long try_to_free_pages(struct zone **zones, int order, gfp_t gfp_mask)
 {
 	struct scan_control sc = {
@@ -1434,7 +1479,7 @@ unsigned long try_to_free_pages(struct z
 		.isolate_pages = isolate_pages_global,
 	};
 
-	return do_try_to_free_pages(zones, gfp_mask, &sc);
+	return try_to_free_pages_throttled(zones, order, gfp_mask, &sc);
 }
 
 #ifdef CONFIG_CGROUP_MEM_CONT
@@ -1456,7 +1501,7 @@ unsigned long try_to_free_mem_cgroup_pag
 	int target_zone = gfp_zone(GFP_HIGHUSER_MOVABLE);
 
 	zones = NODE_DATA(numa_node_id())->node_zonelists[target_zone].zones;
-	if (do_try_to_free_pages(zones, sc.gfp_mask, &sc))
+	if (try_to_free_pages_throttled(zones, 0, sc.gfp_mask, &sc))
 		return 1;
 	return 0;
 }


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC][PATCH] the proposal of improve page reclaim by throttle
  2008-02-19  5:44 [RFC][PATCH] the proposal of improve page reclaim by throttle KOSAKI Motohiro
@ 2008-02-19  6:34 ` Nick Piggin
  2008-02-19  7:09   ` KOSAKI Motohiro
  2008-02-19 13:31   ` Rik van Riel
  2008-02-20  8:56 ` minchan Kim
  2008-02-21  9:48 ` Balbir Singh
  2 siblings, 2 replies; 15+ messages in thread
From: Nick Piggin @ 2008-02-19  6:34 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: KAMEZAWA Hiroyuki, Balbir Singh, Rik van Riel, Lee Schermerhorn,
	linux-mm, linux-kernel

On Tuesday 19 February 2008 16:44, KOSAKI Motohiro wrote:
> background
> ========================================
> current VM implementation doesn't has limit of # of parallel reclaim.
> when heavy workload, it bring to 2 bad things
>   - heavy lock contention
>   - unnecessary swap out
>
> abount 2 month ago, KAMEZA Hiroyuki proposed the patch of page
> reclaim throttle and explain it improve reclaim time.
> 	http://marc.info/?l=linux-mm&m=119667465917215&w=2
>
> but unfortunately it works only memcgroup reclaim.
> Today, I implement it again for support global reclaim and mesure it.
>
>
> test machine, method and result
> ==================================================
> <test machine>
> 	CPU:  IA64 x8
> 	MEM:  8GB
> 	SWAP: 2GB
>
> <test method>
> 	got hackbench from
> 		http://people.redhat.com/mingo/cfs-scheduler/tools/hackbench.c
>
> 	$ /usr/bin/time hackbench 120 process 1000
>
> 	this parameter mean consume all physical memory and
> 	1GB swap space on my test environment.
>
> <test result (average of 3 times measurement)>
>
> before:
> 	hackbench result:		282.30
> 	/usr/bin/time result
> 		user:			14.16
> 		sys:			1248.47
> 		elapse:			432.93
> 		major fault:		29026
> 	max parallel reclaim tasks:	1298
> 	max consumption time of
> 	 try_to_free_pages():		70394
>
> after:
> 	hackbench result:		30.36
> 	/usr/bin/time result
> 		user:			14.26
> 		sys:			294.44
> 		elapse:			118.01
> 		major fault:		3064
> 	max parallel reclaim tasks:	4
> 	max consumption time of
> 	 try_to_free_pages():		12234
>
>
> conclusion
> =========================================
> this patch improve 3 things.
> 1. reduce unnecessary swap
>    (see above major fault. about 90% reduced)
> 2. improve throughput performance
>    (see above hackbench result. about 90% reduced)
> 3. improve interactive performance.
>    (see above max consumption of try_to_free_pages.
>     about 80% reduced)
> 4. reduce lock contention.
>    (see above sys time. about 80% reduced)
>
>
> Now, we got about 1000% performance improvement of hackbench :)
>
>
>
> foture works
> ==========================================================
>  - more discussion with memory controller guys.

Hi,

Yeah this is definitely needed and a nice result.

I'm worried about a) placing a global limit on parallelism, and b)
placing a limit on parallelism at all.

I think it should maybe be a per-zone thing...

What happens if you make it a per-zone mutex, and allow just a single
process to reclaim pages from a given zone at a time? I guess that is
going to slow down throughput a little bit in some cases though...

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC][PATCH] the proposal of improve page reclaim by throttle
  2008-02-19  6:34 ` Nick Piggin
@ 2008-02-19  7:09   ` KOSAKI Motohiro
  2008-02-19 13:31   ` Rik van Riel
  1 sibling, 0 replies; 15+ messages in thread
From: KOSAKI Motohiro @ 2008-02-19  7:09 UTC (permalink / raw)
  To: Nick Piggin
  Cc: kosaki.motohiro, KAMEZAWA Hiroyuki, Balbir Singh, Rik van Riel,
	Lee Schermerhorn, linux-mm, linux-kernel

Hi Nick,

> Yeah this is definitely needed and a nice result.
> 
> I'm worried about a) placing a global limit on parallelism, and b)
> placing a limit on parallelism at all.

sorry, i don't understand yet.
a) and b) have any relation?

> 
> I think it should maybe be a per-zone thing...
> 
> What happens if you make it a per-zone mutex, and allow just a single
> process to reclaim pages from a given zone at a time? I guess that is
> going to slow down throughput a little bit in some cases though...

That makes sense.

OK.
I'll repost after 2-3 days.

Thanks.

- kosaki


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC][PATCH] the proposal of improve page reclaim by throttle
  2008-02-19  6:34 ` Nick Piggin
  2008-02-19  7:09   ` KOSAKI Motohiro
@ 2008-02-19 13:31   ` Rik van Riel
  1 sibling, 0 replies; 15+ messages in thread
From: Rik van Riel @ 2008-02-19 13:31 UTC (permalink / raw)
  To: Nick Piggin
  Cc: KOSAKI Motohiro, KAMEZAWA Hiroyuki, Balbir Singh,
	Lee Schermerhorn, linux-mm, linux-kernel

On Tue, 19 Feb 2008 17:34:59 +1100
Nick Piggin <nickpiggin@yahoo.com.au> wrote:

> On Tuesday 19 February 2008 16:44, KOSAKI Motohiro wrote:
> > background
> > ========================================
> > current VM implementation doesn't has limit of # of parallel reclaim.
> > when heavy workload, it bring to 2 bad things
> >   - heavy lock contention
> >   - unnecessary swap out

> I think it should maybe be a per-zone thing...
> 
> What happens if you make it a per-zone mutex, and allow just a single
> process to reclaim pages from a given zone at a time? I guess that is
> going to slow down throughput a little bit in some cases though...

I agree, doing things per zone will probably work better, because
that way one process can do page reclaim on every NUMA node at
the same time.

-- 
All rights reversed.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC][PATCH] the proposal of improve page reclaim by throttle
  2008-02-19  5:44 [RFC][PATCH] the proposal of improve page reclaim by throttle KOSAKI Motohiro
  2008-02-19  6:34 ` Nick Piggin
@ 2008-02-20  8:56 ` minchan Kim
  2008-02-20  9:24   ` KOSAKI Motohiro
  2008-02-21  9:48 ` Balbir Singh
  2 siblings, 1 reply; 15+ messages in thread
From: minchan Kim @ 2008-02-20  8:56 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: KAMEZAWA Hiroyuki, Balbir Singh, Rik van Riel, Lee Schermerhorn,
	linux-mm, linux-kernel

Hi, KOSAKI.

I am a many interested in your patch. so I want to test it with exact
same method as you did.
I will test it in embedded environment(ARM 920T, 32M ram) and my
desktop machine.(Core2Duo 2.2G, 2G ram)

I guess this patch won't be efficient in embedded environment.
Since many embedded board just have one processor and don't have any
swap device.

What I want to know is that this patch have a regression in UP and NO
swap device like embedded.
I think I can't show some field only top or freemem.
Becuase top or freemem won't be able to work well if system have a
great overhead with page reclaiming and swapping.
So, How do I evaluate following field as you did ?

 * elapse (what do you mean it ??)
 * major fault
 * max parallel reclaim tasks:
 *  max consumption time of
        try_to_free_pages():

If you have a patch for testing, Let me receive it.

On Feb 19, 2008 2:44 PM, KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> wrote:
> background
> ========================================
> current VM implementation doesn't has limit of # of parallel reclaim.
> when heavy workload, it bring to 2 bad things
>  - heavy lock contention
>  - unnecessary swap out
>
> abount 2 month ago, KAMEZA Hiroyuki proposed the patch of page
> reclaim throttle and explain it improve reclaim time.
>        http://marc.info/?l=linux-mm&m=119667465917215&w=2
>
> but unfortunately it works only memcgroup reclaim.
> Today, I implement it again for support global reclaim and mesure it.
>
>
> test machine, method and result
> ==================================================
> <test machine>
>        CPU:  IA64 x8
>        MEM:  8GB
>        SWAP: 2GB
>
> <test method>
>        got hackbench from
>                http://people.redhat.com/mingo/cfs-scheduler/tools/hackbench.c
>
>        $ /usr/bin/time hackbench 120 process 1000
>
>        this parameter mean consume all physical memory and
>        1GB swap space on my test environment.
>
> <test result (average of 3 times measurement)>
>
> before:
>        hackbench result:               282.30
>        /usr/bin/time result
>                user:                   14.16
>                sys:                    1248.47
>                elapse:                 432.93
>                major fault:            29026
>        max parallel reclaim tasks:     1298
>        max consumption time of
>         try_to_free_pages():           70394
>
> after:
>        hackbench result:               30.36
>        /usr/bin/time result
>                user:                   14.26
>                sys:                    294.44
>                elapse:                 118.01
>                major fault:            3064
>        max parallel reclaim tasks:     4
>        max consumption time of
>         try_to_free_pages():           12234
>
>
> conclusion
> =========================================
> this patch improve 3 things.
> 1. reduce unnecessary swap
>   (see above major fault. about 90% reduced)
> 2. improve throughput performance
>   (see above hackbench result. about 90% reduced)
> 3. improve interactive performance.
>   (see above max consumption of try_to_free_pages.
>    about 80% reduced)
> 4. reduce lock contention.
>   (see above sys time. about 80% reduced)
>
>
> Now, we got about 1000% performance improvement of hackbench :)
>
>
>
> foture works
> ==========================================================
>  - more discussion with memory controller guys.
>
>
>
> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> CC: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> CC: Balbir Singh <balbir@linux.vnet.ibm.com>
> CC: Rik van Riel <riel@redhat.com>
> CC: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
>
> ---
>  include/linux/nodemask.h |    1
>  mm/vmscan.c              |   49 +++++++++++++++++++++++++++++++++++++++++++++--
>  2 files changed, 48 insertions(+), 2 deletions(-)
>
> Index: b/include/linux/nodemask.h
> ===================================================================
> --- a/include/linux/nodemask.h  2008-02-19 13:58:05.000000000 +0900
> +++ b/include/linux/nodemask.h  2008-02-19 13:58:23.000000000 +0900
> @@ -431,6 +431,7 @@ static inline int num_node_state(enum no
>
>  #define num_online_nodes()     num_node_state(N_ONLINE)
>  #define num_possible_nodes()   num_node_state(N_POSSIBLE)
> +#define num_highmem_nodes()    num_node_state(N_HIGH_MEMORY)
>  #define node_online(node)      node_state((node), N_ONLINE)
>  #define node_possible(node)    node_state((node), N_POSSIBLE)
>
> Index: b/mm/vmscan.c
> ===================================================================
> --- a/mm/vmscan.c       2008-02-19 13:58:05.000000000 +0900
> +++ b/mm/vmscan.c       2008-02-19 14:04:06.000000000 +0900
> @@ -127,6 +127,11 @@ long vm_total_pages;       /* The total number
>  static LIST_HEAD(shrinker_list);
>  static DECLARE_RWSEM(shrinker_rwsem);
>
> +static atomic_t nr_reclaimers = ATOMIC_INIT(0);
> +static DECLARE_WAIT_QUEUE_HEAD(reclaim_throttle_waitq);
> +#define RECLAIM_LIMIT (2 * num_highmem_nodes())
> +
> +
>  #ifdef CONFIG_CGROUP_MEM_CONT
>  #define scan_global_lru(sc)    (!(sc)->mem_cgroup)
>  #else
> @@ -1421,6 +1426,46 @@ out:
>        return ret;
>  }
>
> +static unsigned long try_to_free_pages_throttled(struct zone **zones,
> +                                                int order,
> +                                                gfp_t gfp_mask,
> +                                                struct scan_control *sc)
> +{
> +       unsigned long nr_reclaimed = 0;
> +       unsigned long start_time;
> +       int i;
> +
> +       start_time = jiffies;
> +
> +       wait_event(reclaim_throttle_waitq,
> +                  atomic_add_unless(&nr_reclaimers, 1, RECLAIM_LIMIT));
> +
> +       /* more reclaim until needed? */
> +       if (unlikely(time_after(jiffies, start_time + HZ))) {
> +               for (i = 0; zones[i] != NULL; i++) {
> +                       struct zone *zone = zones[i];
> +                       int classzone_idx = zone_idx(zones[0]);
> +
> +                       if (!populated_zone(zone))
> +                               continue;
> +
> +                       if (zone_watermark_ok(zone, order, 4*zone->pages_high,
> +                                             classzone_idx, 0)) {
> +                               nr_reclaimed = 1;
> +                               goto out;
> +                       }
> +               }
> +       }
> +
> +       nr_reclaimed = do_try_to_free_pages(zones, gfp_mask, sc);
> +
> +out:
> +       atomic_dec(&nr_reclaimers);
> +       wake_up_all(&reclaim_throttle_waitq);
> +
> +       return nr_reclaimed;
> +}
> +
>  unsigned long try_to_free_pages(struct zone **zones, int order, gfp_t gfp_mask)
>  {
>        struct scan_control sc = {
> @@ -1434,7 +1479,7 @@ unsigned long try_to_free_pages(struct z
>                .isolate_pages = isolate_pages_global,
>        };
>
> -       return do_try_to_free_pages(zones, gfp_mask, &sc);
> +       return try_to_free_pages_throttled(zones, order, gfp_mask, &sc);
>  }
>
>  #ifdef CONFIG_CGROUP_MEM_CONT
> @@ -1456,7 +1501,7 @@ unsigned long try_to_free_mem_cgroup_pag
>        int target_zone = gfp_zone(GFP_HIGHUSER_MOVABLE);
>
>        zones = NODE_DATA(numa_node_id())->node_zonelists[target_zone].zones;
> -       if (do_try_to_free_pages(zones, sc.gfp_mask, &sc))
> +       if (try_to_free_pages_throttled(zones, 0, sc.gfp_mask, &sc))
>                return 1;
>        return 0;
>  }
>
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
>



-- 
Thanks,
barrios

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC][PATCH] the proposal of improve page reclaim by throttle
  2008-02-20  8:56 ` minchan Kim
@ 2008-02-20  9:24   ` KOSAKI Motohiro
  2008-02-20  9:49     ` minchan Kim
  0 siblings, 1 reply; 15+ messages in thread
From: KOSAKI Motohiro @ 2008-02-20  9:24 UTC (permalink / raw)
  To: minchan Kim
  Cc: kosaki.motohiro, KAMEZAWA Hiroyuki, Balbir Singh, Rik van Riel,
	Lee Schermerhorn, linux-mm, linux-kernel

Hi Kim-san

Do you adjust hackbench parameter?
my parameter adjust my test machine(8GB mem),
if unchanged, maybe doesn't works it because lack memory.

> I am a many interested in your patch. so I want to test it with exact
> same method as you did.
> I will test it in embedded environment(ARM 920T, 32M ram) and my
> desktop machine.(Core2Duo 2.2G, 2G ram)

Hm
I don't have embedded test machine.
but I can desktop.
I will test it about weekend.
if you don't mind, could you please send me .config file
and tell me your test kernel version?

Thanks, interesting report.


> I guess this patch won't be efficient in embedded environment.
> Since many embedded board just have one processor and don't have any
> swap device.

reclaim conflict rarely happened on UP.
thus, my patch expect no improvement.

but (of course) I will fix regression.

> So, How do I evaluate following field as you did ?
> 
>  * elapse (what do you mean it ??)
>  * major fault

/usr/bin/time command output that.


>  * max parallel reclaim tasks:
>  *  max consumption time of
>         try_to_free_pages():

sorry, I inserted debug code to my patch at that time.



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC][PATCH] the proposal of improve page reclaim by throttle
  2008-02-20  9:24   ` KOSAKI Motohiro
@ 2008-02-20  9:49     ` minchan Kim
  2008-02-20 10:09       ` KOSAKI Motohiro
  0 siblings, 1 reply; 15+ messages in thread
From: minchan Kim @ 2008-02-20  9:49 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: KAMEZAWA Hiroyuki, Balbir Singh, Rik van Riel, Lee Schermerhorn,
	linux-mm, linux-kernel

On Feb 20, 2008 6:24 PM, KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> wrote:
> Hi Kim-san
>
> Do you adjust hackbench parameter?
> my parameter adjust my test machine(8GB mem),
> if unchanged, maybe doesn't works it because lack memory.

I already adjusted it. :-)
But, In my desktop, I couldn't make to consune my swap device above
half. (My swap device is 512M size)
Because my kernel almost was hang before happening many swapping.
Perhaps, it might be a not hang.  However, Although I wait a very long
time, My box don't have a any response.
I will try do it more.

> > I am a many interested in your patch. so I want to test it with exact
> > same method as you did.
> > I will test it in embedded environment(ARM 920T, 32M ram) and my
> > desktop machine.(Core2Duo 2.2G, 2G ram)
>
> Hm
> I don't have embedded test machine.
> but I can desktop.
> I will test it about weekend.
> if you don't mind, could you please send me .config file
> and tell me your test kernel version?

I mean I will test your patch by myself.
Because I already have a embedded board and Desktop.

> Thanks, interesting report.
>
>
> > I guess this patch won't be efficient in embedded environment.
> > Since many embedded board just have one processor and don't have any
> > swap device.
>
> reclaim conflict rarely happened on UP.
> thus, my patch expect no improvement.

I agree with you.

> but (of course) I will fix regression.

I didn't say your patch had a regression.
What I mean is just that I am concern about it.
Actually, Many VM guys is working on server environment.
They didn't try to do performance test in embedde system.
and that patch was submitted in mainline.

Actually, I am concern about it.

> > So, How do I evaluate following field as you did ?
> >
> >  * elapse (what do you mean it ??)
> >  * major fault
>
> /usr/bin/time command output that.
>
>
> >  * max parallel reclaim tasks:
> >  *  max consumption time of
> >         try_to_free_pages():
>
> sorry, I inserted debug code to my patch at that time.
>

Could you send me that debug code ?
If you will send it to me, I will test it my environment (ARM-920T, Core2Duo).
And I will report test result.

-- 
Thanks,
barrios

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC][PATCH] the proposal of improve page reclaim by throttle
  2008-02-20  9:49     ` minchan Kim
@ 2008-02-20 10:09       ` KOSAKI Motohiro
  2008-02-21  9:38         ` minchan Kim
  0 siblings, 1 reply; 15+ messages in thread
From: KOSAKI Motohiro @ 2008-02-20 10:09 UTC (permalink / raw)
  To: minchan Kim
  Cc: kosaki.motohiro, KAMEZAWA Hiroyuki, Balbir Singh, Rik van Riel,
	Lee Schermerhorn, linux-mm, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 959 bytes --]

Hi

> > >  * max parallel reclaim tasks:
> > >  *  max consumption time of
> > >         try_to_free_pages():
> >
> > sorry, I inserted debug code to my patch at that time.
> 
> Could you send me that debug code ?
> If you will send it to me, I will test it my environment (ARM-920T, Core2Duo).
> And I will report test result.

attached it.
but it is very messy ;-)

usage:
./benchloop.sh

sample output
=========================================================
max reclaim 2
Running with 120*40 (== 4800) tasks.
Time: 34.177
14.17user 284.38system 1:43.85elapsed 287%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (3813major+148922minor)pagefaults 0swaps
max prepare time: 4599 0
max reclaim time: 2350 5781
total
8271
max reclaimer
4
max overkill
62131
max saved overkill
9740


max reclaimer represent to max parallel reclaim tasks.
total represetnto max consumption time of try_to_free_pages().

Thanks


[-- Attachment #2: reclaim-throttle-3.patch --]
[-- Type: application/octet-stream, Size: 7106 bytes --]

---
 include/linux/mmzone.h   |    1 
 include/linux/nodemask.h |    2 
 kernel/sysctl.c          |   78 ++++++++++++++++++++++++++++++++++++++
 mm/vmscan.c              |   95 ++++++++++++++++++++++++++++++++++++++++++++++-
 4 files changed, 173 insertions(+), 3 deletions(-)

Index: b/kernel/sysctl.c
===================================================================
--- a/kernel/sysctl.c	2008-02-15 20:14:40.000000000 +0900
+++ b/kernel/sysctl.c	2008-02-16 16:45:58.000000000 +0900
@@ -187,6 +187,18 @@ int sysctl_legacy_va_layout;
 extern int prove_locking;
 extern int lock_stat;
 
+extern int max_reclaimer;
+extern unsigned long max_reclaim_time;
+extern unsigned long max_reclaim_prepare_time;
+extern int reclaim_limit;
+extern unsigned long max_overkill_reclaim;
+
+extern unsigned long max_reclaim_time_aux;
+extern unsigned long max_reclaim_prepare_time_aux;
+extern unsigned long max_total_time;
+
+
+
 /* The default sysctl tables: */
 
 static struct ctl_table root_table[] = {
@@ -1155,6 +1167,72 @@ static struct ctl_table vm_table[] = {
 		.extra2		= &one,
 	},
 #endif
+	{
+		.ctl_name	= CTL_UNNUMBERED,
+		.procname	= "max_reclaimer",
+		.data		= &max_reclaimer,
+		.maxlen		= sizeof(max_reclaimer),
+		.mode		= 0644,
+		.proc_handler	= &proc_dointvec,
+		.strategy	= &sysctl_intvec,
+	},
+	{
+		.ctl_name	= CTL_UNNUMBERED,
+		.procname	= "max_reclaim_time",
+		.data		= &max_reclaim_time,
+		.maxlen		= sizeof(max_reclaim_time),
+		.mode		= 0644,
+		.proc_handler	= &proc_doulongvec_minmax,
+	},
+	{
+		.ctl_name	= CTL_UNNUMBERED,
+		.procname	= "max_reclaim_prepare_time",
+		.data		= &max_reclaim_prepare_time,
+		.maxlen		= sizeof(max_reclaim_prepare_time),
+		.mode		= 0644,
+		.proc_handler	= &proc_doulongvec_minmax,
+	},
+	{
+		.ctl_name	= CTL_UNNUMBERED,
+		.procname	= "reclaim_limit",
+		.data		= &reclaim_limit,
+		.maxlen		= sizeof(reclaim_limit),
+		.mode		= 0644,
+		.proc_handler	= &proc_dointvec,
+	},
+	{
+		.ctl_name	= CTL_UNNUMBERED,
+		.procname	= "max_overkill_reclaim",
+		.data		= &max_overkill_reclaim,
+		.maxlen		= sizeof(max_overkill_reclaim),
+		.mode		= 0644,
+		.proc_handler	= &proc_doulongvec_minmax,
+	},
+	{
+		.ctl_name	= CTL_UNNUMBERED,
+		.procname	= "max_reclaim_time_aux",
+		.data		= &max_reclaim_time_aux,
+		.maxlen		= sizeof(max_reclaim_time_aux),
+		.mode		= 0644,
+		.proc_handler	= &proc_doulongvec_minmax,
+	},
+	{
+		.ctl_name	= CTL_UNNUMBERED,
+		.procname	= "max_reclaim_prepare_time_aux",
+		.data		= &max_reclaim_prepare_time_aux,
+		.maxlen		= sizeof(max_reclaim_prepare_time_aux),
+		.mode		= 0644,
+		.proc_handler	= &proc_doulongvec_minmax,
+	},
+	{
+		.ctl_name	= CTL_UNNUMBERED,
+		.procname	= "max_total_time",
+		.data		= &max_total_time,
+		.maxlen		= sizeof(max_total_time),
+		.mode		= 0644,
+		.proc_handler	= &proc_doulongvec_minmax,
+	},
+
 /*
  * NOTE: do not add new entries to this table unless you have read
  * Documentation/sysctl/ctl_unnumbered.txt
Index: b/mm/vmscan.c
===================================================================
--- a/mm/vmscan.c	2008-02-15 20:14:40.000000000 +0900
+++ b/mm/vmscan.c	2008-02-17 11:50:50.000000000 +0900
@@ -1421,6 +1421,23 @@ out:
 	return ret;
 }
 
+static DEFINE_SPINLOCK(research_reclaim_max_lock);
+static atomic_t nr_reclaimers = ATOMIC_INIT(0);
+static DECLARE_WAIT_QUEUE_HEAD(reclaim_throttle_waitq);
+
+// limit 
+int reclaim_limit = 2;
+#define RECLAIM_LIMIT (reclaim_limit * num_highmem_nodes())
+
+// record
+int max_reclaimer = 0;
+unsigned long max_reclaim_time = 0;
+unsigned long max_reclaim_time_aux = 0;
+unsigned long max_reclaim_prepare_time = 0;
+unsigned long max_reclaim_prepare_time_aux = 0;
+unsigned long max_overkill_reclaim = 0;
+unsigned long max_total_time = 0;
+
 unsigned long try_to_free_pages(struct zone **zones, int order, gfp_t gfp_mask)
 {
 	struct scan_control sc = {
@@ -1433,8 +1450,82 @@ unsigned long try_to_free_pages(struct z
 		.mem_cgroup = NULL,
 		.isolate_pages = isolate_pages_global,
 	};
-
-	return do_try_to_free_pages(zones, gfp_mask, &sc);
+	unsigned long nr_reclaimed;
+	u64 start_time;
+	u64 prepared_time;
+	u64 end_time;
+	u64 preparing_time;
+	u64 reclaiming_time;
+	unsigned long free_mem;
+	int record_max_prepare_time = 0;
+	unsigned long total_time;
+
+	start_time = jiffies_64;
+	
+	if (unlikely(!atomic_add_unless(&nr_reclaimers, 1, RECLAIM_LIMIT)))
+		wait_event(reclaim_throttle_waitq,
+			   atomic_add_unless(&nr_reclaimers, 1, RECLAIM_LIMIT));
+	
+	spin_lock(&research_reclaim_max_lock);
+	if (atomic_read(&nr_reclaimers) > max_reclaimer)
+		max_reclaimer = atomic_read(&nr_reclaimers);
+	
+	prepared_time = jiffies_64;
+	preparing_time = prepared_time - start_time;
+	if (preparing_time > max_reclaim_time) {
+		record_max_prepare_time = 1;
+		max_reclaim_prepare_time = preparing_time;
+	}
+	spin_unlock(&research_reclaim_max_lock);
+	
+	/* more reclaim until needed? */
+	if (preparing_time > HZ) {
+		int i;
+		
+		for (i = 0; zones[i] != NULL; i++) {
+			struct zone *zone = zones[i];
+			int classzone_idx = zone_idx(zones[0]);
+			
+			if (!populated_zone(zone))
+				continue;
+			
+			if (zone_watermark_ok(zone, order, 4*zone->pages_high,
+					      classzone_idx, 0)) {
+				nr_reclaimed = 1;
+				goto out;
+			}
+		}
+	}
+	
+	nr_reclaimed = do_try_to_free_pages(zones, gfp_mask, &sc);
+	
+	spin_lock(&research_reclaim_max_lock);
+	end_time = jiffies_64;
+	reclaiming_time = end_time - prepared_time;
+
+	if (record_max_prepare_time)
+		max_reclaim_prepare_time_aux = reclaiming_time;
+
+	if (reclaiming_time > max_reclaim_time) {
+		max_reclaim_time_aux = preparing_time;
+		max_reclaim_time = reclaiming_time;
+	}
+
+	total_time = preparing_time + reclaiming_time;
+	if( total_time > max_total_time ){
+		max_total_time = total_time;
+	}
+
+	free_mem = global_page_state(NR_FREE_PAGES);
+	if (free_mem > max_overkill_reclaim)
+		max_overkill_reclaim = free_mem;
+	spin_unlock(&research_reclaim_max_lock);
+	
+out:
+	atomic_dec(&nr_reclaimers);
+	wake_up_all(&reclaim_throttle_waitq);
+	
+	return nr_reclaimed;
 }
 
 #ifdef CONFIG_CGROUP_MEM_CONT
Index: b/include/linux/mmzone.h
===================================================================
--- a/include/linux/mmzone.h	2008-02-15 20:14:40.000000000 +0900
+++ b/include/linux/mmzone.h	2008-02-15 20:14:49.000000000 +0900
@@ -334,7 +334,6 @@ struct zone {
 	 */
 	unsigned long		spanned_pages;	/* total size, including holes */
 	unsigned long		present_pages;	/* amount of memory (excluding holes) */
-
 	/*
 	 * rarely used fields:
 	 */
Index: b/include/linux/nodemask.h
===================================================================
--- a/include/linux/nodemask.h	2008-02-15 20:14:40.000000000 +0900
+++ b/include/linux/nodemask.h	2008-02-15 20:14:49.000000000 +0900
@@ -431,6 +431,8 @@ static inline int num_node_state(enum no
 
 #define num_online_nodes()	num_node_state(N_ONLINE)
 #define num_possible_nodes()	num_node_state(N_POSSIBLE)
+#define num_highmem_nodes()	num_node_state(N_HIGH_MEMORY)
+
 #define node_online(node)	node_state((node), N_ONLINE)
 #define node_possible(node)	node_state((node), N_POSSIBLE)
 

[-- Attachment #3: benchloop.sh --]
[-- Type: application/octet-stream, Size: 1515 bytes --]

#!/bin/sh

for i in 2 10000; do
	for j in 1 2 3; do
		sudo sh -c "echo $i > /proc/sys/vm/reclaim_limit"
	        sudo sh -c "echo 0 > /proc/sys/vm/max_reclaim_prepare_time"
	        sudo sh -c "echo 0 > /proc/sys/vm/max_reclaim_time"
                sudo sh -c "echo 0 > /proc/sys/vm/max_reclaim_prepare_time_aux"
                sudo sh -c "echo 0 > /proc/sys/vm/max_reclaim_time_aux"
                sudo sh -c "echo 0 > /proc/sys/vm/max_total_time"
		sudo sh -c "echo 0 > /proc/sys/vm/max_reclaimer"
		sudo sh -c "echo 0 > /proc/sys/vm/max_overkill_reclaim"
                sudo sh -c "echo 0 > /proc/sys/vm/max_saved_overkill_reclaim"
		sudo sh -c "echo 3 > /proc/sys/vm/drop_caches"

		echo "max reclaim $i"
		/usr/bin/time ./hackbench 120 process 1000 2>&1 | uniq

		prepare_time=`cat /proc/sys/vm/max_reclaim_prepare_time`
		prepare_time_aux=`cat /proc/sys/vm/max_reclaim_prepare_time_aux`
		echo "max prepare time: $prepare_time $prepare_time_aux"

		reclaim_time_aux=`cat /proc/sys/vm/max_reclaim_time_aux`
		reclaim_time=`cat /proc/sys/vm/max_reclaim_time`
		echo "max reclaim time: $reclaim_time_aux $reclaim_time"

		echo "total"
		cat /proc/sys/vm/max_total_time

		echo "max reclaimer"
	        cat /proc/sys/vm/max_reclaimer
		
		echo "max overkill"
		cat /proc/sys/vm/max_overkill_reclaim

		echo "max saved overkill"
                cat /proc/sys/vm/max_saved_overkill_reclaim

                sudo sh -c "echo 1000 > /proc/sys/vm/reclaim_limit"
	done
done

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC][PATCH] the proposal of improve page reclaim by throttle
  2008-02-20 10:09       ` KOSAKI Motohiro
@ 2008-02-21  9:38         ` minchan Kim
  2008-02-21 10:55           ` KOSAKI Motohiro
  0 siblings, 1 reply; 15+ messages in thread
From: minchan Kim @ 2008-02-21  9:38 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: KAMEZAWA Hiroyuki, Balbir Singh, Rik van Riel, Lee Schermerhorn,
	linux-mm, linux-kernel

I miss CC's. so I resend.


First of all, I tried test it in embedded board.

---
<test machine>
      CPU:  200MHz(ARM926EJ-S)
      MEM:  32M
      SWAP: none
      KERNEL : 2.6.25-rc1

<test 1> - NO SWAP

before :

Running with 5*40 (== 200) tasks.

Time: 12.591
       Command being timed: "./hackbench.arm 5 process 100"
       User time (seconds): 0.78
       System time(seconds): 13.39
       Percent of CPU this job got: 99%
       Elapsed (wall clock) time (h:mm:ss or m:ss): 0m 14.22s
       Major (requiring I/O) page faults: 20
       max parallel reclaim tasks:     30
       max consumption time of
        try_to_free_pages():           789

after:

Running with 5*40 (== 200) tasks.
Time: 11.535
       Command being timed: "./hackbench.arm 5 process 100"
       User time (seconds): 0.69
       System time (seconds): 12.42
       Percent of CPU this job got: 99%
       Elapsed (wall clock) time (h:mm:ss or m:ss): 0m 13.16s
       Major (requiring I/O) page faults: 18
       max parallel reclaim tasks:     4
       max consumption time of
       try_to_free_pages():           740

<test 2> - SWAP
before:
Running with 6*40 (== 240) tasks.
Time: 121.686
       Command being timed: "./hackbench.arm 6 process 100"
       User time (seconds): 1.89
       System time (seconds): 44.95
       Percent of CPU this job got: 37%
       Elapsed (wall clock) time (h:mm:ss or m:ss): 2m 3.79s
       Major (requiring I/O) page faults: 230
       max parallel reclaim tasks:     56
       max consumption time of
       try_to_free_pages():           10811


after :
Running with 6*40 (== 240) tasks.
Time: 67.757
       Command being timed: "./hackbench.arm 6 process 100"
       User time (seconds): 1.56
       System time (seconds): 35.41
       Percent of CPU this job got: 52%
       Elapsed (wall clock) time (h:mm:ss or m:ss): 1m 9.87s
       Major (requiring I/O) page faults: 16
       max parallel reclaim tasks:     4
       max consumption time of
         try_to_free_pages():           6419

<test 3> NO_SWAP

before:

' OOM killer kill hackbench!!!'

after :
Time: 16.578
       Command being timed: "./hackbench.arm 6 process 100"
       User time (seconds): 0.71
       System time (seconds): 17.92
       Percent of CPU this job got: 99%
       Elapsed (wall clock) time (h:mm:ss or m:ss): 0m 18.69s
       Major (requiring I/O) page faults: 22
       max parallel reclaim tasks:     4
       max consumption time of
        try_to_free_pages():           1785

===============================

It was a very interesting result.
In embedded system, your patch improve performance a little in case
without noswap(normal case in embedded system).
But, more important thing is OOM occured when I made 240 process
without swap device and vanilla kernel.
Then, I applied your patch, it worked very well without OOM.

I think that's why zone's page_scanned was six times greater than
number of lru pages.
At result, OOM happened.

So, I think your patch also improves performance in embedded system.

In case OOM didn't occur, reclaiming performance without swap device
was better than one with swap device.
Now, I think we need to improve reclaiming procedure in embedded
system(UP and NO swap).

On Wed, Feb 20, 2008 at 7:09 PM, KOSAKI Motohiro
<kosaki.motohiro@jp.fujitsu.com> wrote:
> Hi
>
>
>  > > >  * max parallel reclaim tasks:
>  > > >  *  max consumption time of
>  > > >         try_to_free_pages():
>  > >
>  > > sorry, I inserted debug code to my patch at that time.
>  >
>  > Could you send me that debug code ?
>  > If you will send it to me, I will test it my environment (ARM-920T, Core2Duo).
>  > And I will report test result.
>
>  attached it.
>  but it is very messy ;-)
>
>  usage:
>  ./benchloop.sh
>
>  sample output
>  =========================================================
>  max reclaim 2
>  Running with 120*40 (== 4800) tasks.
>  Time: 34.177
>  14.17user 284.38system 1:43.85elapsed 287%CPU (0avgtext+0avgdata 0maxresident)k
>  0inputs+0outputs (3813major+148922minor)pagefaults 0swaps
>  max prepare time: 4599 0
>  max reclaim time: 2350 5781
>  total
>  8271
>  max reclaimer
>  4
>  max overkill
>  62131
>  max saved overkill
>  9740
>
>
>  max reclaimer represent to max parallel reclaim tasks.
>  total represetnto max consumption time of try_to_free_pages().
>
>  Thanks
>
>



-- 
Thanks,
barrios

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC][PATCH] the proposal of improve page reclaim by throttle
  2008-02-21  9:38         ` minchan Kim
@ 2008-02-21 10:55           ` KOSAKI Motohiro
  2008-02-21 12:29             ` minchan Kim
  0 siblings, 1 reply; 15+ messages in thread
From: KOSAKI Motohiro @ 2008-02-21 10:55 UTC (permalink / raw)
  To: minchan Kim
  Cc: KAMEZAWA Hiroyuki, Balbir Singh, Rik van Riel, Lee Schermerhorn,
	linux-mm, linux-kernel

Hi Kim-san,

Thank you very much.
btw, what different between <test 1> and <test 2>?

>  It was a very interesting result.
>  In embedded system, your patch improve performance a little in case
>  without noswap(normal case in embedded system).
>  But, more important thing is OOM occured when I made 240 process
>  without swap device and vanilla kernel.
>  Then, I applied your patch, it worked very well without OOM.

Wow, it is very interesting result!
I am very happy.

>  I think that's why zone's page_scanned was six times greater than
>  number of lru pages.
>  At result, OOM happened.

please repost question with change subject.
i don't know reason of vanilla kernel behavior, sorry.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC][PATCH] the proposal of improve page reclaim by throttle
  2008-02-21 10:55           ` KOSAKI Motohiro
@ 2008-02-21 12:29             ` minchan Kim
  2008-02-21 12:41               ` KOSAKI Motohiro
  0 siblings, 1 reply; 15+ messages in thread
From: minchan Kim @ 2008-02-21 12:29 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: KAMEZAWA Hiroyuki, Balbir Singh, Rik van Riel, Lee Schermerhorn,
	linux-mm, linux-kernel

On Thu, Feb 21, 2008 at 7:55 PM, KOSAKI Motohiro
<kosaki.motohiro@jp.fujitsu.com> wrote:
> Hi Kim-san,
>
>  Thank you very much.
>  btw, what different between <test 1> and <test 2>?

<test 1> have no swap device with 200 tasks by hackbench.
But <test 2> have swap device(32M) with 240 tasks by hackbench.
If <test2> have no swap device without your patch, <test2> is killed by OOM.

<test 1> - NO SWAP
Running with 5*40 (== 200) tasks.
...
<test 2> - SWAP
Running with 6*40 (== 240) tasks.
...

>
>  >  It was a very interesting result.
>  >  In embedded system, your patch improve performance a little in case
>  >  without noswap(normal case in embedded system).
>  >  But, more important thing is OOM occured when I made 240 process
>  >  without swap device and vanilla kernel.
>  >  Then, I applied your patch, it worked very well without OOM.
>
>  Wow, it is very interesting result!
>  I am very happy.
>
>
>  >  I think that's why zone's page_scanned was six times greater than
>  >  number of lru pages.
>  >  At result, OOM happened.
>
>  please repost question with change subject.
>  i don't know reason of vanilla kernel behavior, sorry.

Normally, embedded linux have only one zone(DMA).

If your patch isn't applied, several processes can reclaim memory in parallel.
then, DMA zone's pages_scanned is suddenly increased largely. Because
embedded linux have no swap device,  kernel can't stop to scan lru
list until meeting page cache page. so if zone->pages_scanned is
greater six time than lru list pages, kernel make the zone with
unreclaimable state, As a result, OOM will kill it, too.

-- 
Thanks,
barrios

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC][PATCH] the proposal of improve page reclaim by throttle
  2008-02-21 12:29             ` minchan Kim
@ 2008-02-21 12:41               ` KOSAKI Motohiro
  0 siblings, 0 replies; 15+ messages in thread
From: KOSAKI Motohiro @ 2008-02-21 12:41 UTC (permalink / raw)
  To: minchan Kim
  Cc: KAMEZAWA Hiroyuki, Balbir Singh, Rik van Riel, Lee Schermerhorn,
	linux-mm, linux-kernel

>  >  please repost question with change subject.
>  >  i don't know reason of vanilla kernel behavior, sorry.
>
>  Normally, embedded linux have only one zone(DMA).
>
>  If your patch isn't applied, several processes can reclaim memory in parallel.
>  then, DMA zone's pages_scanned is suddenly increased largely. Because
>  embedded linux have no swap device,  kernel can't stop to scan lru
>  list until meeting page cache page. so if zone->pages_scanned is
>  greater six time than lru list pages, kernel make the zone with
>  unreclaimable state, As a result, OOM will kill it, too.

sorry, my last mail is easy confusious.
if you want discuss vanilla kernel bug, you shold post mail by another thread.
if not, your mail is only readed by few people.

Thanks.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC][PATCH] the proposal of improve page reclaim by throttle
  2008-02-19  5:44 [RFC][PATCH] the proposal of improve page reclaim by throttle KOSAKI Motohiro
  2008-02-19  6:34 ` Nick Piggin
  2008-02-20  8:56 ` minchan Kim
@ 2008-02-21  9:48 ` Balbir Singh
  2008-02-21 11:01   ` KOSAKI Motohiro
  2 siblings, 1 reply; 15+ messages in thread
From: Balbir Singh @ 2008-02-21  9:48 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: KAMEZAWA Hiroyuki, Rik van Riel, Lee Schermerhorn, linux-mm,
	linux-kernel

KOSAKI Motohiro wrote:
> background
> ========================================
> current VM implementation doesn't has limit of # of parallel reclaim.
> when heavy workload, it bring to 2 bad things
>   - heavy lock contention
>   - unnecessary swap out
> 
> abount 2 month ago, KAMEZA Hiroyuki proposed the patch of page 
> reclaim throttle and explain it improve reclaim time.
> 	http://marc.info/?l=linux-mm&m=119667465917215&w=2
> 
> but unfortunately it works only memcgroup reclaim.
> Today, I implement it again for support global reclaim and mesure it.
> 

Hi, Kosaki,

It's good to keep the main reclaim code and the memory controller reclaim in
sync, so this is a nice effort.

> @@ -1456,7 +1501,7 @@ unsigned long try_to_free_mem_cgroup_pag
>  	int target_zone = gfp_zone(GFP_HIGHUSER_MOVABLE);
> 
>  	zones = NODE_DATA(numa_node_id())->node_zonelists[target_zone].zones;
> -	if (do_try_to_free_pages(zones, sc.gfp_mask, &sc))
> +	if (try_to_free_pages_throttled(zones, 0, sc.gfp_mask, &sc))
>  		return 1;
>  	return 0;
>  }
> 

try_to_free_pages_throttled checks for zone_watermark_ok(), that will not work
in the case that we are reclaiming from a cgroup which over it's limit. We need
a different check, to see if the mem_cgroup is still over it's limit or not.

-- 
	Warm Regards,
	Balbir Singh
	Linux Technology Center
	IBM, ISTL

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC][PATCH] the proposal of improve page reclaim by throttle
  2008-02-21  9:48 ` Balbir Singh
@ 2008-02-21 11:01   ` KOSAKI Motohiro
  2008-02-21 11:02     ` Balbir Singh
  0 siblings, 1 reply; 15+ messages in thread
From: KOSAKI Motohiro @ 2008-02-21 11:01 UTC (permalink / raw)
  To: balbir
  Cc: KAMEZAWA Hiroyuki, Rik van Riel, Lee Schermerhorn, linux-mm,
	linux-kernel

Hi balbir-san

>  It's good to keep the main reclaim code and the memory controller reclaim in
>  sync, so this is a nice effort.

thank you.
I will repost next version (fixed nick's opinion) while a few days.

>  > @@ -1456,7 +1501,7 @@ unsigned long try_to_free_mem_cgroup_pag
>  >       int target_zone = gfp_zone(GFP_HIGHUSER_MOVABLE);
>  >
>  >       zones = NODE_DATA(numa_node_id())->node_zonelists[target_zone].zones;
>  > -     if (do_try_to_free_pages(zones, sc.gfp_mask, &sc))
>  > +     if (try_to_free_pages_throttled(zones, 0, sc.gfp_mask, &sc))
>  >               return 1;
>  >       return 0;
>  >  }
>
>  try_to_free_pages_throttled checks for zone_watermark_ok(), that will not work
>  in the case that we are reclaiming from a cgroup which over it's limit. We need
>  a different check, to see if the mem_cgroup is still over it's limit or not.

That makes sense.

unfortunately, I don't know mem-cgroup so much.
What do i use function, instead?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC][PATCH] the proposal of improve page reclaim by throttle
  2008-02-21 11:01   ` KOSAKI Motohiro
@ 2008-02-21 11:02     ` Balbir Singh
  0 siblings, 0 replies; 15+ messages in thread
From: Balbir Singh @ 2008-02-21 11:02 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: KAMEZAWA Hiroyuki, Rik van Riel, Lee Schermerhorn, linux-mm,
	linux-kernel

KOSAKI Motohiro wrote:
> Hi balbir-san
> 
>>  It's good to keep the main reclaim code and the memory controller reclaim in
>>  sync, so this is a nice effort.
> 
> thank you.
> I will repost next version (fixed nick's opinion) while a few days.
> 
>>  > @@ -1456,7 +1501,7 @@ unsigned long try_to_free_mem_cgroup_pag
>>  >       int target_zone = gfp_zone(GFP_HIGHUSER_MOVABLE);
>>  >
>>  >       zones = NODE_DATA(numa_node_id())->node_zonelists[target_zone].zones;
>>  > -     if (do_try_to_free_pages(zones, sc.gfp_mask, &sc))
>>  > +     if (try_to_free_pages_throttled(zones, 0, sc.gfp_mask, &sc))
>>  >               return 1;
>>  >       return 0;
>>  >  }
>>
>>  try_to_free_pages_throttled checks for zone_watermark_ok(), that will not work
>>  in the case that we are reclaiming from a cgroup which over it's limit. We need
>>  a different check, to see if the mem_cgroup is still over it's limit or not.
> 
> That makes sense.
> 
> unfortunately, I don't know mem-cgroup so much.
> What do i use function, instead?

One option could be that once the memory controller has this feature, we'll need
no changes in try_to_free_mem_cgroup_pages.

-- 
	Warm Regards,
	Balbir Singh
	Linux Technology Center
	IBM, ISTL

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2008-02-21 12:41 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-02-19  5:44 [RFC][PATCH] the proposal of improve page reclaim by throttle KOSAKI Motohiro
2008-02-19  6:34 ` Nick Piggin
2008-02-19  7:09   ` KOSAKI Motohiro
2008-02-19 13:31   ` Rik van Riel
2008-02-20  8:56 ` minchan Kim
2008-02-20  9:24   ` KOSAKI Motohiro
2008-02-20  9:49     ` minchan Kim
2008-02-20 10:09       ` KOSAKI Motohiro
2008-02-21  9:38         ` minchan Kim
2008-02-21 10:55           ` KOSAKI Motohiro
2008-02-21 12:29             ` minchan Kim
2008-02-21 12:41               ` KOSAKI Motohiro
2008-02-21  9:48 ` Balbir Singh
2008-02-21 11:01   ` KOSAKI Motohiro
2008-02-21 11:02     ` Balbir Singh

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).