[PATCH 0/5] Prevent kswapd dumping excessive amounts of memory in response to high-order allocations V3

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/5] Prevent kswapd dumping excessive amounts of memory in response to high-order allocations V3
@ 2010-12-09 11:18 Mel Gorman
  2010-12-09 11:18 ` [PATCH 1/6] mm: kswapd: Stop high-order balancing when any suitable zone is balanced Mel Gorman
                   ` (6 more replies)
  0 siblings, 7 replies; 29+ messages in thread
From: Mel Gorman @ 2010-12-09 11:18 UTC (permalink / raw)
  To: Simon Kirby
  Cc: KOSAKI Motohiro, Shaohua Li, Dave Hansen, Johannes Weiner,
	Andrew Morton, linux-mm, linux-kernel, Mel Gorman

There was a minor bug in V2 that led to this release.  I'm hopeful it'll
stop kswapd going mad on Simon's machine and might also alleviate some of
the "too much free memory" problem.

Changelog since V2
  o Add clarifying comments
  o Properly check that the zone is balanced for order-0
  o Treat zone->all_unreclaimable properly

Changelog since V1
  o Take classzone into account
  o Ensure that kswapd always balances at order-09
  o Reset classzone and order after reading
  o Require a percentage of a node be balanced for high-order allocations,
    not just any zone as ZONE_DMA could be balanced when the node in general
    is a mess

Simon Kirby reported the following problem

   We're seeing cases on a number of servers where cache never fully
   grows to use all available memory.  Sometimes we see servers with 4
   GB of memory that never seem to have less than 1.5 GB free, even with
   a constantly-active VM.  In some cases, these servers also swap out
   while this happens, even though they are constantly reading the working
   set into memory.  We have been seeing this happening for a long time;
   I don't think it's anything recent, and it still happens on 2.6.36.

After some debugging work by Simon, Dave Hansen and others, the prevaling
theory became that kswapd is reclaiming order-3 pages requested by SLUB
too aggressive about it.

There are two apparent problems here. On the target machine, there is a small
Normal zone in comparison to DMA32. As kswapd tries to balance all zones, it
would continually try reclaiming for Normal even though DMA32 was balanced
enough for callers. The second problem is that sleeping_prematurely() does
not use the same logic as balance_pgdat() when deciding whether to sleep
or not.  This keeps kswapd artifically awake.

A number of tests were run and the figures from previous postings will look
very different for a few reasons. One, the old figures were forcing my network
card to use GFP_ATOMIC in attempt to replicate Simon's problem. Second, I
previous specified slub_min_order=3 again in an attempt to reproduce Simon's
problem. In this posting, I'm depending on Simon to say whether his problem is
fixed or not and these figures are to show the impact to the ordinary cases.
Finally, the "vmscan" figures are taken from /proc/vmstat instead of the
tracepoints.  There is less information but recording is less disruptive.

The first test of relevance was postmark with a process running in the
background reading a large amount of anonymous memory in blocks. The
objective was to vaguely simulate what was happening on Simon's machine
and it's memory intensive enough to have kswapd awake.

POSTMARK
                                            traceonly          kanyzone
Transactions per second:              156.00 ( 0.00%)   153.00 (-1.96%)
Data megabytes read per second:        21.51 ( 0.00%)    21.52 ( 0.05%)
Data megabytes written per second:     29.28 ( 0.00%)    29.11 (-0.58%)
Files created alone per second:       250.00 ( 0.00%)   416.00 (39.90%)
Files create/transact per second:      79.00 ( 0.00%)    76.00 (-3.95%)
Files deleted alone per second:       520.00 ( 0.00%)   420.00 (-23.81%)
Files delete/transact per second:      79.00 ( 0.00%)    76.00 (-3.95%)

MMTests Statistics: duration
User/Sys Time Running Test (seconds)         16.58      17.4
Total Elapsed Time (seconds)                218.48    222.47

VMstat Reclaim Statistics: vmscan
Direct reclaims                                  0          4
Direct reclaim pages scanned                     0        203
Direct reclaim pages reclaimed                   0        184
Kswapd pages scanned                        326631     322018
Kswapd pages reclaimed                      312632     309784
Kswapd low wmark quickly                         1          4
Kswapd high wmark quickly                      122        475
Kswapd skip congestion_wait                      1          0
Pages activated                             700040     705317
Pages deactivated                           212113     203922
Pages written                                 9875       6363

Total pages scanned                         326631    322221
Total pages reclaimed                       312632    309968
%age total pages scanned/reclaimed          95.71%    96.20%
%age total pages scanned/written             3.02%     1.97%

proc vmstat: Faults
Major Faults                                   300       254
Minor Faults                                645183    660284
Page ins                                    493588    486704
Page outs                                  4960088   4986704
Swap ins                                      1230       661
Swap outs                                     9869      6355

Performance is mildly affected because kswapd is no longer doing as much
work and the background memory consumer process is getting in the way. Note
that kswapd scanned and reclaimed fewer pages as it's less aggressive and
overall fewer pages were scanned and reclaimed. Swap in/out is particularly
reduced again reflecting kswapd throwing out fewer pages.

The slight performance impact is unfortunate here but it looks like a direct
result of kswapd being less aggressive. As the bug report is about too many
pages being freed by kswapd, it may have to be accepted for now.

The second test is a streaming IO benchmark that was previously used by
Johannes to show regressions in page reclaim.

MICRO
					 traceonly  kanyzone
User/Sys Time Running Test (seconds)         29.29     28.87
Total Elapsed Time (seconds)                492.18    488.79

VMstat Reclaim Statistics: vmscan
Direct reclaims                               2128       1460
Direct reclaim pages scanned               2284822    1496067
Direct reclaim pages reclaimed              148919     110937
Kswapd pages scanned                      15450014   16202876
Kswapd pages reclaimed                     8503697    8537897
Kswapd low wmark quickly                      3100       3397
Kswapd high wmark quickly                     1860       7243
Kswapd skip congestion_wait                    708        801
Pages activated                               9635       9573
Pages deactivated                             1432       1271
Pages written                                  223       1130

Total pages scanned                       17734836  17698943
Total pages reclaimed                      8652616   8648834
%age total pages scanned/reclaimed          48.79%    48.87%
%age total pages scanned/written             0.00%     0.01%

proc vmstat: Faults
Major Faults                                   165       221
Minor Faults                               9655785   9656506
Page ins                                      3880      7228
Page outs                                 37692940  37480076
Swap ins                                         0        69
Swap outs                                       19        15

Again fewer pages are scanned and reclaimed as expected and this time the test
completed faster. Note that kswapd is hitting its watermarks faster (low and
high wmark quickly) which I expect is due to kswapd reclaiming fewer pages.

I also ran fs-mark, iozone and sysbench but there is nothing interesting
to report in the figures. Performance is not significantly changed and the
reclaim statistics look reasonable.

 include/linux/mmzone.h |    3 +-
 mm/page_alloc.c        |    8 ++-
 mm/vmscan.c            |  132 +++++++++++++++++++++++++++++++++++++++++-------
 3 files changed, 120 insertions(+), 23 deletions(-)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH 1/6] mm: kswapd: Stop high-order balancing when any suitable zone is balanced
  2010-12-09 11:18 [PATCH 0/5] Prevent kswapd dumping excessive amounts of memory in response to high-order allocations V3 Mel Gorman
@ 2010-12-09 11:18 ` Mel Gorman
  2010-12-09 15:21   ` Minchan Kim
                     ` (2 more replies)
  2010-12-09 11:18 ` [PATCH 2/6] mm: kswapd: Keep kswapd awake for high-order allocations until a percentage of the node " Mel Gorman
                   ` (5 subsequent siblings)
  6 siblings, 3 replies; 29+ messages in thread
From: Mel Gorman @ 2010-12-09 11:18 UTC (permalink / raw)
  To: Simon Kirby
  Cc: KOSAKI Motohiro, Shaohua Li, Dave Hansen, Johannes Weiner,
	Andrew Morton, linux-mm, linux-kernel, Mel Gorman

When the allocator enters its slow path, kswapd is woken up to balance the
node. It continues working until all zones within the node are balanced. For
order-0 allocations, this makes perfect sense but for higher orders it can
have unintended side-effects. If the zone sizes are imbalanced, kswapd may
reclaim heavily within a smaller zone discarding an excessive number of
pages. The user-visible behaviour is that kswapd is awake and reclaiming
even though plenty of pages are free from a suitable zone.

This patch alters the "balance" logic for high-order reclaim allowing kswapd
to stop if any suitable zone becomes balanced to reduce the number of pages
it reclaims from other zones. kswapd still tries to ensure that order-0
watermarks for all zones are met before sleeping.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 include/linux/mmzone.h |    3 +-
 mm/page_alloc.c        |    8 +++--
 mm/vmscan.c            |   68 +++++++++++++++++++++++++++++++++++++++++------
 3 files changed, 66 insertions(+), 13 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 39c24eb..7177f51 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -645,6 +645,7 @@ typedef struct pglist_data {
 	wait_queue_head_t kswapd_wait;
 	struct task_struct *kswapd;
 	int kswapd_max_order;
+	enum zone_type classzone_idx;
 } pg_data_t;
 
 #define node_present_pages(nid)	(NODE_DATA(nid)->node_present_pages)
@@ -660,7 +661,7 @@ typedef struct pglist_data {
 
 extern struct mutex zonelists_mutex;
 void build_all_zonelists(void *data);
-void wakeup_kswapd(struct zone *zone, int order);
+void wakeup_kswapd(struct zone *zone, int order, enum zone_type classzone_idx);
 int zone_watermark_ok(struct zone *z, int order, unsigned long mark,
 		int classzone_idx, int alloc_flags);
 enum memmap_context {
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index e409270..82e3499 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1915,13 +1915,14 @@ __alloc_pages_high_priority(gfp_t gfp_mask, unsigned int order,
 
 static inline
 void wake_all_kswapd(unsigned int order, struct zonelist *zonelist,
-						enum zone_type high_zoneidx)
+						enum zone_type high_zoneidx,
+						enum zone_type classzone_idx)
 {
 	struct zoneref *z;
 	struct zone *zone;
 
 	for_each_zone_zonelist(zone, z, zonelist, high_zoneidx)
-		wakeup_kswapd(zone, order);
+		wakeup_kswapd(zone, order, classzone_idx);
 }
 
 static inline int
@@ -1998,7 +1999,8 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 		goto nopage;
 
 restart:
-	wake_all_kswapd(order, zonelist, high_zoneidx);
+	wake_all_kswapd(order, zonelist, high_zoneidx,
+						zone_idx(preferred_zone));
 
 	/*
 	 * OK, we're below the kswapd watermark and have kicked background
diff --git a/mm/vmscan.c b/mm/vmscan.c
index d31d7ce..25cb373 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2165,11 +2165,14 @@ static int sleeping_prematurely(pg_data_t *pgdat, int order, long remaining)
  * interoperates with the page allocator fallback scheme to ensure that aging
  * of pages is balanced across the zones.
  */
-static unsigned long balance_pgdat(pg_data_t *pgdat, int order)
+static unsigned long balance_pgdat(pg_data_t *pgdat, int order,
+							int classzone_idx)
 {
 	int all_zones_ok;
+	int any_zone_ok;
 	int priority;
 	int i;
+	int end_zone = 0;	/* Inclusive.  0 = ZONE_DMA */
 	unsigned long total_scanned;
 	struct reclaim_state *reclaim_state = current->reclaim_state;
 	struct scan_control sc = {
@@ -2192,7 +2195,6 @@ loop_again:
 	count_vm_event(PAGEOUTRUN);
 
 	for (priority = DEF_PRIORITY; priority >= 0; priority--) {
-		int end_zone = 0;	/* Inclusive.  0 = ZONE_DMA */
 		unsigned long lru_pages = 0;
 		int has_under_min_watermark_zone = 0;
 
@@ -2201,6 +2203,7 @@ loop_again:
 			disable_swap_token();
 
 		all_zones_ok = 1;
+		any_zone_ok = 0;
 
 		/*
 		 * Scan in the highmem->dma direction for the highest
@@ -2310,10 +2313,12 @@ loop_again:
 				 * spectulatively avoid congestion waits
 				 */
 				zone_clear_flag(zone, ZONE_CONGESTED);
+				if (i <= classzone_idx)
+					any_zone_ok = 1;
 			}
 
 		}
-		if (all_zones_ok)
+		if (all_zones_ok || (order && any_zone_ok))
 			break;		/* kswapd: all done */
 		/*
 		 * OK, kswapd is getting into trouble.  Take a nap, then take
@@ -2336,7 +2341,13 @@ loop_again:
 			break;
 	}
 out:
-	if (!all_zones_ok) {
+
+	/*
+	 * order-0: All zones must meet high watermark for a balanced node
+	 * high-order: Any zone below pgdats classzone_idx must meet the high
+	 *             watermark for a balanced node
+	 */
+	if (!(all_zones_ok || (order && any_zone_ok))) {
 		cond_resched();
 
 		try_to_freeze();
@@ -2361,6 +2372,36 @@ out:
 		goto loop_again;
 	}
 
+	/*
+	 * If kswapd was reclaiming at a higher order, it has the option of
+	 * sleeping without all zones being balanced. Before it does, it must
+	 * ensure that the watermarks for order-0 on *all* zones are met and
+	 * that the congestion flags are cleared. The congestion flag must
+	 * be cleared as kswapd is the only mechanism that clears the flag
+	 * and it is potentially going to sleep here.
+	 */
+	if (order) {
+		for (i = 0; i <= end_zone; i++) {
+			struct zone *zone = pgdat->node_zones + i;
+
+			if (!populated_zone(zone))
+				continue;
+
+			if (zone->all_unreclaimable && priority != DEF_PRIORITY)
+				continue;
+
+			/* Confirm the zone is balanced for order-0 */
+			if (!zone_watermark_ok(zone, 0,
+					high_wmark_pages(zone), 0, 0)) {
+				order = sc.order = 0;
+				goto loop_again;
+			}
+
+			/* If balanced, clear the congested flag */
+			zone_clear_flag(zone, ZONE_CONGESTED);
+		}
+	}
+
 	return sc.nr_reclaimed;
 }
 
@@ -2380,6 +2421,7 @@ out:
 static int kswapd(void *p)
 {
 	unsigned long order;
+	int classzone_idx;
 	pg_data_t *pgdat = (pg_data_t*)p;
 	struct task_struct *tsk = current;
 	DEFINE_WAIT(wait);
@@ -2410,19 +2452,24 @@ static int kswapd(void *p)
 	set_freezable();
 
 	order = 0;
+	classzone_idx = MAX_NR_ZONES - 1;
 	for ( ; ; ) {
 		unsigned long new_order;
+		int new_classzone_idx;
 		int ret;
 
 		prepare_to_wait(&pgdat->kswapd_wait, &wait, TASK_INTERRUPTIBLE);
 		new_order = pgdat->kswapd_max_order;
+		new_classzone_idx = pgdat->classzone_idx;
 		pgdat->kswapd_max_order = 0;
-		if (order < new_order) {
+		pgdat->classzone_idx = MAX_NR_ZONES - 1;
+		if (order < new_order || classzone_idx > new_classzone_idx) {
 			/*
 			 * Don't sleep if someone wants a larger 'order'
-			 * allocation
+			 * allocation or has tigher zone constraints
 			 */
 			order = new_order;
+			classzone_idx = new_classzone_idx;
 		} else {
 			if (!freezing(current) && !kthread_should_stop()) {
 				long remaining = 0;
@@ -2451,6 +2498,7 @@ static int kswapd(void *p)
 			}
 
 			order = pgdat->kswapd_max_order;
+			classzone_idx = pgdat->classzone_idx;
 		}
 		finish_wait(&pgdat->kswapd_wait, &wait);
 
@@ -2464,7 +2512,7 @@ static int kswapd(void *p)
 		 */
 		if (!ret) {
 			trace_mm_vmscan_kswapd_wake(pgdat->node_id, order);
-			balance_pgdat(pgdat, order);
+			balance_pgdat(pgdat, order, classzone_idx);
 		}
 	}
 	return 0;
@@ -2473,7 +2521,7 @@ static int kswapd(void *p)
 /*
  * A zone is low on free memory, so wake its kswapd task to service it.
  */
-void wakeup_kswapd(struct zone *zone, int order)
+void wakeup_kswapd(struct zone *zone, int order, enum zone_type classzone_idx)
 {
 	pg_data_t *pgdat;
 
@@ -2483,8 +2531,10 @@ void wakeup_kswapd(struct zone *zone, int order)
 	pgdat = zone->zone_pgdat;
 	if (zone_watermark_ok(zone, order, low_wmark_pages(zone), 0, 0))
 		return;
-	if (pgdat->kswapd_max_order < order)
+	if (pgdat->kswapd_max_order < order) {
 		pgdat->kswapd_max_order = order;
+		pgdat->classzone_idx = min(pgdat->classzone_idx, classzone_idx);
+	}
 	trace_mm_vmscan_wakeup_kswapd(pgdat->node_id, zone_idx(zone), order);
 	if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL))
 		return;
-- 
1.7.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH 2/6] mm: kswapd: Keep kswapd awake for high-order allocations until a percentage of the node is balanced
  2010-12-09 11:18 [PATCH 0/5] Prevent kswapd dumping excessive amounts of memory in response to high-order allocations V3 Mel Gorman
  2010-12-09 11:18 ` [PATCH 1/6] mm: kswapd: Stop high-order balancing when any suitable zone is balanced Mel Gorman
@ 2010-12-09 11:18 ` Mel Gorman
  2010-12-09 15:42   ` Minchan Kim
                     ` (2 more replies)
  2010-12-09 11:18 ` [PATCH 3/6] mm: kswapd: Use the order that kswapd was reclaiming at for sleeping_prematurely() Mel Gorman
                   ` (4 subsequent siblings)
  6 siblings, 3 replies; 29+ messages in thread
From: Mel Gorman @ 2010-12-09 11:18 UTC (permalink / raw)
  To: Simon Kirby
  Cc: KOSAKI Motohiro, Shaohua Li, Dave Hansen, Johannes Weiner,
	Andrew Morton, linux-mm, linux-kernel, Mel Gorman

When reclaiming for high-orders, kswapd is responsible for balancing a
node but it should not reclaim excessively. It avoids excessive reclaim by
considering if any zone in a node is balanced then the node is balanced. In
the cases where there are imbalanced zone sizes (e.g. ZONE_DMA with both
ZONE_DMA32 and ZONE_NORMAL), kswapd can go to sleep prematurely as just
one small zone was balanced.

This alters the sleep logic of kswapd slightly. It counts the number of pages
that make up the balanced zones. If the total number of balanced pages is
more than a quarter of the zone, kswapd will go back to sleep. This should
keep a node balanced without reclaiming an excessive number of pages.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 mm/vmscan.c |   43 ++++++++++++++++++++++++++++++++++---------
 1 files changed, 34 insertions(+), 9 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 25cb373..b4472a1 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2117,10 +2117,26 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem_cont,
 }
 #endif
 
+/*
+ * pgdat_balanced is used when checking if a node is balanced for high-order
+ * allocations. Only zones that meet watermarks make up "balanced".
+ * The total of balanced pages must be at least 25% of the node for the
+ * node to be considered balanced. Forcing all zones to be balanced for high
+ * orders can cause excessive reclaim when there are imbalanced zones.
+ * Similarly, we do not want kswapd to go to sleep because ZONE_DMA happens
+ * to be balanced when ZONE_DMA32 is huge in comparison and unbalanced
+ */
+static bool pgdat_balanced(pg_data_t *pgdat, unsigned long balanced)
+{
+	return balanced > pgdat->node_present_pages / 4;
+}
+
 /* is kswapd sleeping prematurely? */
 static int sleeping_prematurely(pg_data_t *pgdat, int order, long remaining)
 {
 	int i;
+	unsigned long balanced = 0;
+	bool all_zones_ok = true;
 
 	/* If a direct reclaimer woke kswapd within HZ/10, it's premature */
 	if (remaining)
@@ -2138,10 +2154,19 @@ static int sleeping_prematurely(pg_data_t *pgdat, int order, long remaining)
 
 		if (!zone_watermark_ok(zone, order, high_wmark_pages(zone),
 								0, 0))
-			return 1;
+			all_zones_ok = false;
+		else
+			balanced += zone->present_pages;
 	}
 
-	return 0;
+	/*
+	 * For high-order requests, any zone meeting the watermark allows
+	 * kswapd to sleep. For order-0, all zones must be balanced
+	 */
+	if (order)
+		return pgdat_balanced(pgdat, balanced);
+	else
+		return !all_zones_ok;
 }
 
 /*
@@ -2169,7 +2194,7 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order,
 							int classzone_idx)
 {
 	int all_zones_ok;
-	int any_zone_ok;
+	unsigned long balanced;
 	int priority;
 	int i;
 	int end_zone = 0;	/* Inclusive.  0 = ZONE_DMA */
@@ -2203,7 +2228,7 @@ loop_again:
 			disable_swap_token();
 
 		all_zones_ok = 1;
-		any_zone_ok = 0;
+		balanced = 0;
 
 		/*
 		 * Scan in the highmem->dma direction for the highest
@@ -2314,11 +2339,11 @@ loop_again:
 				 */
 				zone_clear_flag(zone, ZONE_CONGESTED);
 				if (i <= classzone_idx)
-					any_zone_ok = 1;
+					balanced += zone->present_pages;
 			}
 
 		}
-		if (all_zones_ok || (order && any_zone_ok))
+		if (all_zones_ok || (order && pgdat_balanced(pgdat, balanced)))
 			break;		/* kswapd: all done */
 		/*
 		 * OK, kswapd is getting into trouble.  Take a nap, then take
@@ -2344,10 +2369,10 @@ out:
 
 	/*
 	 * order-0: All zones must meet high watermark for a balanced node
-	 * high-order: Any zone below pgdats classzone_idx must meet the high
-	 *             watermark for a balanced node
+	 * high-order: Balanced zones must make up at least 25% of the node
+	 *             for the node to be balanced
 	 */
-	if (!(all_zones_ok || (order && any_zone_ok))) {
+	if (!(all_zones_ok || (order && pgdat_balanced(pgdat, balanced)))) {
 		cond_resched();
 
 		try_to_freeze();
-- 
1.7.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH 3/6] mm: kswapd: Use the order that kswapd was reclaiming at for sleeping_prematurely()
  2010-12-09 11:18 [PATCH 0/5] Prevent kswapd dumping excessive amounts of memory in response to high-order allocations V3 Mel Gorman
  2010-12-09 11:18 ` [PATCH 1/6] mm: kswapd: Stop high-order balancing when any suitable zone is balanced Mel Gorman
  2010-12-09 11:18 ` [PATCH 2/6] mm: kswapd: Keep kswapd awake for high-order allocations until a percentage of the node " Mel Gorman
@ 2010-12-09 11:18 ` Mel Gorman
  2010-12-10  1:16   ` Minchan Kim
  2010-12-10  1:18   ` KAMEZAWA Hiroyuki
  2010-12-09 11:18 ` [PATCH 4/6] mm: kswapd: Reset kswapd_max_order and classzone_idx after reading Mel Gorman
                   ` (3 subsequent siblings)
  6 siblings, 2 replies; 29+ messages in thread
From: Mel Gorman @ 2010-12-09 11:18 UTC (permalink / raw)
  To: Simon Kirby
  Cc: KOSAKI Motohiro, Shaohua Li, Dave Hansen, Johannes Weiner,
	Andrew Morton, linux-mm, linux-kernel, Mel Gorman

Before kswapd goes to sleep, it uses sleeping_prematurely() to check if
there was a race pushing a zone below its watermark. If the race
happened, it stays awake. However, balance_pgdat() can decide to reclaim
at a lower order if it decides that high-order reclaim is not working as
expected. This information is not passed back to sleeping_prematurely().
The impact is that kswapd remains awake reclaiming pages long after it
should have gone to sleep. This patch passes the adjusted order to
sleeping_prematurely and uses the same logic as balance_pgdat to decide
if it's ok to go to sleep.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 mm/vmscan.c |   14 ++++++++++----
 1 files changed, 10 insertions(+), 4 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index b4472a1..52e229e 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2132,7 +2132,7 @@ static bool pgdat_balanced(pg_data_t *pgdat, unsigned long balanced)
 }
 
 /* is kswapd sleeping prematurely? */
-static int sleeping_prematurely(pg_data_t *pgdat, int order, long remaining)
+static bool sleeping_prematurely(pg_data_t *pgdat, int order, long remaining)
 {
 	int i;
 	unsigned long balanced = 0;
@@ -2142,7 +2142,7 @@ static int sleeping_prematurely(pg_data_t *pgdat, int order, long remaining)
 	if (remaining)
 		return 1;
 
-	/* If after HZ/10, a zone is below the high mark, it's premature */
+	/* Check the watermark levels */
 	for (i = 0; i < pgdat->nr_zones; i++) {
 		struct zone *zone = pgdat->node_zones + i;
 
@@ -2427,7 +2427,13 @@ out:
 		}
 	}
 
-	return sc.nr_reclaimed;
+	/*
+	 * Return the order we were reclaiming at so sleeping_prematurely()
+	 * makes a decision on the order we were last reclaiming at. However,
+	 * if another caller entered the allocator slow path while kswapd
+	 * was awake, order will remain at the higher level
+	 */
+	return order;
 }
 
 /*
@@ -2537,7 +2543,7 @@ static int kswapd(void *p)
 		 */
 		if (!ret) {
 			trace_mm_vmscan_kswapd_wake(pgdat->node_id, order);
-			balance_pgdat(pgdat, order, classzone_idx);
+			order = balance_pgdat(pgdat, order, classzone_idx);
 		}
 	}
 	return 0;
-- 
1.7.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH 4/6] mm: kswapd: Reset kswapd_max_order and classzone_idx after reading
  2010-12-09 11:18 [PATCH 0/5] Prevent kswapd dumping excessive amounts of memory in response to high-order allocations V3 Mel Gorman
                   ` (2 preceding siblings ...)
  2010-12-09 11:18 ` [PATCH 3/6] mm: kswapd: Use the order that kswapd was reclaiming at for sleeping_prematurely() Mel Gorman
@ 2010-12-09 11:18 ` Mel Gorman
  2010-12-09 15:59   ` Minchan Kim
  2010-12-10  1:19   ` KAMEZAWA Hiroyuki
  2010-12-09 11:18 ` [PATCH 5/6] mm: kswapd: Treat zone->all_unreclaimable in sleeping_prematurely similar to balance_pgdat() Mel Gorman
                   ` (2 subsequent siblings)
  6 siblings, 2 replies; 29+ messages in thread
From: Mel Gorman @ 2010-12-09 11:18 UTC (permalink / raw)
  To: Simon Kirby
  Cc: KOSAKI Motohiro, Shaohua Li, Dave Hansen, Johannes Weiner,
	Andrew Morton, linux-mm, linux-kernel, Mel Gorman

When kswapd wakes up, it reads its order and classzone from pgdat and
calls balance_pgdat. While its awake, it potentially reclaimes at a high
order and a low classzone index. This might have been a once-off that
was not required by subsequent callers. However, because the pgdat
values were not reset, they remain artifically high while
balance_pgdat() is running and potentially kswapd enters a second
unnecessary reclaim cycle. Reset the pgdat order and classzone index
after reading.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 mm/vmscan.c |    2 ++
 1 files changed, 2 insertions(+), 0 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 52e229e..bc233d8 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2530,6 +2530,8 @@ static int kswapd(void *p)
 
 			order = pgdat->kswapd_max_order;
 			classzone_idx = pgdat->classzone_idx;
+			pgdat->kswapd_max_order = 0;
+			pgdat->classzone_idx = MAX_NR_ZONES - 1;
 		}
 		finish_wait(&pgdat->kswapd_wait, &wait);
 
-- 
1.7.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH 5/6] mm: kswapd: Treat zone->all_unreclaimable in sleeping_prematurely similar to balance_pgdat()
  2010-12-09 11:18 [PATCH 0/5] Prevent kswapd dumping excessive amounts of memory in response to high-order allocations V3 Mel Gorman
                   ` (3 preceding siblings ...)
  2010-12-09 11:18 ` [PATCH 4/6] mm: kswapd: Reset kswapd_max_order and classzone_idx after reading Mel Gorman
@ 2010-12-09 11:18 ` Mel Gorman
  2010-12-10  1:23   ` KAMEZAWA Hiroyuki
  2010-12-09 11:18 ` [PATCH 6/6] mm: kswapd: Use the classzone idx that kswapd was using for sleeping_prematurely() Mel Gorman
  2010-12-09 11:19 ` [PATCH 0/5] Prevent kswapd dumping excessive amounts of memory in response to high-order allocations V3 Mel Gorman
  6 siblings, 1 reply; 29+ messages in thread
From: Mel Gorman @ 2010-12-09 11:18 UTC (permalink / raw)
  To: Simon Kirby
  Cc: KOSAKI Motohiro, Shaohua Li, Dave Hansen, Johannes Weiner,
	Andrew Morton, linux-mm, linux-kernel, Mel Gorman

After DEF_PRIORITY, balance_pgdat() considers all_unreclaimable zones to
be balanced but sleeping_prematurely does not. This can force kswapd to
stay awake longer than it should. This patch fixes it.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 mm/vmscan.c |   10 +++++++++-
 1 files changed, 9 insertions(+), 1 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index bc233d8..d7b0a3c 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2149,8 +2149,16 @@ static bool sleeping_prematurely(pg_data_t *pgdat, int order, long remaining)
 		if (!populated_zone(zone))
 			continue;
 
-		if (zone->all_unreclaimable)
+		/*
+		 * balance_pgdat() skips over all_unreclaimable after
+		 * DEF_PRIORITY. Effectively, it considers them balanced so
+		 * they must be considered balanced here as well if kswapd
+		 * is to sleep
+		 */
+		if (zone->all_unreclaimable) {
+			balanced += zone->present_pages;
 			continue;
+		}
 
 		if (!zone_watermark_ok(zone, order, high_wmark_pages(zone),
 								0, 0))
-- 
1.7.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH 6/6] mm: kswapd: Use the classzone idx that kswapd was using for sleeping_prematurely()
  2010-12-09 11:18 [PATCH 0/5] Prevent kswapd dumping excessive amounts of memory in response to high-order allocations V3 Mel Gorman
                   ` (4 preceding siblings ...)
  2010-12-09 11:18 ` [PATCH 5/6] mm: kswapd: Treat zone->all_unreclaimable in sleeping_prematurely similar to balance_pgdat() Mel Gorman
@ 2010-12-09 11:18 ` Mel Gorman
  2010-12-10  2:38   ` Minchan Kim
  2010-12-09 11:19 ` [PATCH 0/5] Prevent kswapd dumping excessive amounts of memory in response to high-order allocations V3 Mel Gorman
  6 siblings, 1 reply; 29+ messages in thread
From: Mel Gorman @ 2010-12-09 11:18 UTC (permalink / raw)
  To: Simon Kirby
  Cc: KOSAKI Motohiro, Shaohua Li, Dave Hansen, Johannes Weiner,
	Andrew Morton, linux-mm, linux-kernel, Mel Gorman

When kswapd is woken up for a high-order allocation, it takes account of
the highest usable zone by the caller (the classzone idx). During
allocation, this index is used to select the lowmem_reserve[] that
should be applied to the watermark calculation in zone_watermark_ok().

When balancing a node, kswapd considers the highest unbalanced zone to be the
classzone index. This will always be at least be the callers classzone_idx
and can be higher. However, sleeping_prematurely() always considers the
lowest zone (e.g. ZONE_DMA) to be the classzone index. This means that
sleeping_prematurely() can consider a zone to be balanced that is unusable
by the allocation request that originally woke kswapd. This patch changes
sleeping_prematurely() to use a classzone_idx matching the value it used
in balance_pgdat().

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 mm/vmscan.c |   19 +++++++++++--------
 1 files changed, 11 insertions(+), 8 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index d7b0a3c..e25b3ac 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2132,7 +2132,8 @@ static bool pgdat_balanced(pg_data_t *pgdat, unsigned long balanced)
 }
 
 /* is kswapd sleeping prematurely? */
-static bool sleeping_prematurely(pg_data_t *pgdat, int order, long remaining)
+static bool sleeping_prematurely(pg_data_t *pgdat, int order, long remaining,
+					int classzone_idx)
 {
 	int i;
 	unsigned long balanced = 0;
@@ -2140,7 +2141,7 @@ static bool sleeping_prematurely(pg_data_t *pgdat, int order, long remaining)
 
 	/* If a direct reclaimer woke kswapd within HZ/10, it's premature */
 	if (remaining)
-		return 1;
+		return true;
 
 	/* Check the watermark levels */
 	for (i = 0; i < pgdat->nr_zones; i++) {
@@ -2161,7 +2162,7 @@ static bool sleeping_prematurely(pg_data_t *pgdat, int order, long remaining)
 		}
 
 		if (!zone_watermark_ok(zone, order, high_wmark_pages(zone),
-								0, 0))
+							classzone_idx, 0))
 			all_zones_ok = false;
 		else
 			balanced += zone->present_pages;
@@ -2199,7 +2200,7 @@ static bool sleeping_prematurely(pg_data_t *pgdat, int order, long remaining)
  * of pages is balanced across the zones.
  */
 static unsigned long balance_pgdat(pg_data_t *pgdat, int order,
-							int classzone_idx)
+							int *classzone_idx)
 {
 	int all_zones_ok;
 	unsigned long balanced;
@@ -2262,6 +2263,7 @@ loop_again:
 			if (!zone_watermark_ok(zone, order,
 					high_wmark_pages(zone), 0, 0)) {
 				end_zone = i;
+				*classzone_idx = i;
 				break;
 			}
 		}
@@ -2346,7 +2348,7 @@ loop_again:
 				 * spectulatively avoid congestion waits
 				 */
 				zone_clear_flag(zone, ZONE_CONGESTED);
-				if (i <= classzone_idx)
+				if (i <= *classzone_idx)
 					balanced += zone->present_pages;
 			}
 
@@ -2441,6 +2443,7 @@ out:
 	 * if another caller entered the allocator slow path while kswapd
 	 * was awake, order will remain at the higher level
 	 */
+	*classzone_idx = end_zone;
 	return order;
 }
 
@@ -2514,7 +2517,7 @@ static int kswapd(void *p)
 				long remaining = 0;
 
 				/* Try to sleep for a short interval */
-				if (!sleeping_prematurely(pgdat, order, remaining)) {
+				if (!sleeping_prematurely(pgdat, order, remaining, classzone_idx)) {
 					remaining = schedule_timeout(HZ/10);
 					finish_wait(&pgdat->kswapd_wait, &wait);
 					prepare_to_wait(&pgdat->kswapd_wait, &wait, TASK_INTERRUPTIBLE);
@@ -2525,7 +2528,7 @@ static int kswapd(void *p)
 				 * premature sleep. If not, then go fully
 				 * to sleep until explicitly woken up
 				 */
-				if (!sleeping_prematurely(pgdat, order, remaining)) {
+				if (!sleeping_prematurely(pgdat, order, remaining, classzone_idx)) {
 					trace_mm_vmscan_kswapd_sleep(pgdat->node_id);
 					schedule();
 				} else {
@@ -2553,7 +2556,7 @@ static int kswapd(void *p)
 		 */
 		if (!ret) {
 			trace_mm_vmscan_kswapd_wake(pgdat->node_id, order);
-			order = balance_pgdat(pgdat, order, classzone_idx);
+			order = balance_pgdat(pgdat, order, &classzone_idx);
 		}
 	}
 	return 0;
-- 
1.7.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: [PATCH 0/5] Prevent kswapd dumping excessive amounts of memory in response to high-order allocations V3
  2010-12-09 11:18 [PATCH 0/5] Prevent kswapd dumping excessive amounts of memory in response to high-order allocations V3 Mel Gorman
                   ` (5 preceding siblings ...)
  2010-12-09 11:18 ` [PATCH 6/6] mm: kswapd: Use the classzone idx that kswapd was using for sleeping_prematurely() Mel Gorman
@ 2010-12-09 11:19 ` Mel Gorman
  6 siblings, 0 replies; 29+ messages in thread
From: Mel Gorman @ 2010-12-09 11:19 UTC (permalink / raw)
  To: Simon Kirby
  Cc: KOSAKI Motohiro, Shaohua Li, Dave Hansen, Johannes Weiner,
	Andrew Morton, linux-mm, linux-kernel

Bah, this should have been PATCH 0/6 of course :(

On Thu, Dec 09, 2010 at 11:18:14AM +0000, Mel Gorman wrote:
> There was a minor bug in V2 that led to this release.  I'm hopeful it'll
> stop kswapd going mad on Simon's machine and might also alleviate some of
> the "too much free memory" problem.
> 
> Changelog since V2
>   o Add clarifying comments
>   o Properly check that the zone is balanced for order-0
>   o Treat zone->all_unreclaimable properly
> 
> Changelog since V1
>   o Take classzone into account
>   o Ensure that kswapd always balances at order-09
>   o Reset classzone and order after reading
>   o Require a percentage of a node be balanced for high-order allocations,
>     not just any zone as ZONE_DMA could be balanced when the node in general
>     is a mess
> 
> <SNIP>

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 1/6] mm: kswapd: Stop high-order balancing when any suitable zone is balanced
  2010-12-09 11:18 ` [PATCH 1/6] mm: kswapd: Stop high-order balancing when any suitable zone is balanced Mel Gorman
@ 2010-12-09 15:21   ` Minchan Kim
  2010-12-10  1:06   ` KAMEZAWA Hiroyuki
  2010-12-13 16:54   ` Eric B Munson
  2 siblings, 0 replies; 29+ messages in thread
From: Minchan Kim @ 2010-12-09 15:21 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Simon Kirby, KOSAKI Motohiro, Shaohua Li, Dave Hansen,
	Johannes Weiner, Andrew Morton, linux-mm, linux-kernel

On Thu, Dec 09, 2010 at 11:18:15AM +0000, Mel Gorman wrote:
> When the allocator enters its slow path, kswapd is woken up to balance the
> node. It continues working until all zones within the node are balanced. For
> order-0 allocations, this makes perfect sense but for higher orders it can
> have unintended side-effects. If the zone sizes are imbalanced, kswapd may
> reclaim heavily within a smaller zone discarding an excessive number of
> pages. The user-visible behaviour is that kswapd is awake and reclaiming
> even though plenty of pages are free from a suitable zone.
> 
> This patch alters the "balance" logic for high-order reclaim allowing kswapd
> to stop if any suitable zone becomes balanced to reduce the number of pages
> it reclaims from other zones. kswapd still tries to ensure that order-0
> watermarks for all zones are met before sleeping.
> 
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>

Looks good to me.

-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 2/6] mm: kswapd: Keep kswapd awake for high-order allocations until a percentage of the node is balanced
  2010-12-09 11:18 ` [PATCH 2/6] mm: kswapd: Keep kswapd awake for high-order allocations until a percentage of the node " Mel Gorman
@ 2010-12-09 15:42   ` Minchan Kim
  2010-12-10 10:19     ` Mel Gorman
  2010-12-10  1:16   ` KAMEZAWA Hiroyuki
  2010-12-13 17:00   ` Eric B Munson
  2 siblings, 1 reply; 29+ messages in thread
From: Minchan Kim @ 2010-12-09 15:42 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Simon Kirby, KOSAKI Motohiro, Shaohua Li, Dave Hansen,
	Johannes Weiner, Andrew Morton, linux-mm, linux-kernel

On Thu, Dec 09, 2010 at 11:18:16AM +0000, Mel Gorman wrote:
> When reclaiming for high-orders, kswapd is responsible for balancing a
> node but it should not reclaim excessively. It avoids excessive reclaim by
> considering if any zone in a node is balanced then the node is balanced. In
> the cases where there are imbalanced zone sizes (e.g. ZONE_DMA with both
> ZONE_DMA32 and ZONE_NORMAL), kswapd can go to sleep prematurely as just
> one small zone was balanced.
> 
> This alters the sleep logic of kswapd slightly. It counts the number of pages
> that make up the balanced zones. If the total number of balanced pages is
> more than a quarter of the zone, kswapd will go back to sleep. This should
> keep a node balanced without reclaiming an excessive number of pages.
> 
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>

Just below trivials.

> ---
>  mm/vmscan.c |   43 ++++++++++++++++++++++++++++++++++---------
>  1 files changed, 34 insertions(+), 9 deletions(-)
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 25cb373..b4472a1 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2117,10 +2117,26 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem_cont,
>  }
>  #endif
>  
> +/*
> + * pgdat_balanced is used when checking if a node is balanced for high-order
> + * allocations. Only zones that meet watermarks make up "balanced".
> + * The total of balanced pages must be at least 25% of the node for the

Could you write down why you select 25% in description?
It could help someone who try to change that watermark in future.

> + * node to be considered balanced. Forcing all zones to be balanced for high
> + * orders can cause excessive reclaim when there are imbalanced zones.
> + * Similarly, we do not want kswapd to go to sleep because ZONE_DMA happens
> + * to be balanced when ZONE_DMA32 is huge in comparison and unbalanced
> + */
> +static bool pgdat_balanced(pg_data_t *pgdat, unsigned long balanced)

I think balanced_pages is more clear.
But it's a preference.

> +{
> +	return balanced > pgdat->node_present_pages / 4;
> +}
> +
>  /* is kswapd sleeping prematurely? */
>  static int sleeping_prematurely(pg_data_t *pgdat, int order, long remaining)
>  {
>  	int i;
> +	unsigned long balanced = 0;
> +	bool all_zones_ok = true;
>  
>  	/* If a direct reclaimer woke kswapd within HZ/10, it's premature */
>  	if (remaining)
> @@ -2138,10 +2154,19 @@ static int sleeping_prematurely(pg_data_t *pgdat, int order, long remaining)
>  
>  		if (!zone_watermark_ok(zone, order, high_wmark_pages(zone),
>  								0, 0))
> -			return 1;
> +			all_zones_ok = false;
> +		else
> +			balanced += zone->present_pages;
>  	}
>  
> -	return 0;
> +	/*
> +	 * For high-order requests, any zone meeting the watermark allows
> +	 * kswapd to sleep. For order-0, all zones must be balanced

The comment isn't exact.
could allow?

> +	 */
> +	if (order)
> +		return pgdat_balanced(pgdat, balanced);
> +	else
> +		return !all_zones_ok;
>  }
>  
>  /*
> @@ -2169,7 +2194,7 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order,
>  							int classzone_idx)
>  {
>  	int all_zones_ok;
> -	int any_zone_ok;
> +	unsigned long balanced;
>  	int priority;
>  	int i;
>  	int end_zone = 0;	/* Inclusive.  0 = ZONE_DMA */
> @@ -2203,7 +2228,7 @@ loop_again:
>  			disable_swap_token();
>  
>  		all_zones_ok = 1;
> -		any_zone_ok = 0;
> +		balanced = 0;
>  
>  		/*
>  		 * Scan in the highmem->dma direction for the highest
> @@ -2314,11 +2339,11 @@ loop_again:
>  				 */
>  				zone_clear_flag(zone, ZONE_CONGESTED);
>  				if (i <= classzone_idx)
> -					any_zone_ok = 1;
> +					balanced += zone->present_pages;
>  			}
>  
>  		}
> -		if (all_zones_ok || (order && any_zone_ok))
> +		if (all_zones_ok || (order && pgdat_balanced(pgdat, balanced)))
>  			break;		/* kswapd: all done */
>  		/*
>  		 * OK, kswapd is getting into trouble.  Take a nap, then take
> @@ -2344,10 +2369,10 @@ out:
>  
>  	/*
>  	 * order-0: All zones must meet high watermark for a balanced node
> -	 * high-order: Any zone below pgdats classzone_idx must meet the high
> -	 *             watermark for a balanced node
> +	 * high-order: Balanced zones must make up at least 25% of the node
> +	 *             for the node to be balanced
>  	 */
> -	if (!(all_zones_ok || (order && any_zone_ok))) {
> +	if (!(all_zones_ok || (order && pgdat_balanced(pgdat, balanced)))) {
>  		cond_resched();
>  
>  		try_to_freeze();
> -- 
> 1.7.1
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 4/6] mm: kswapd: Reset kswapd_max_order and classzone_idx after reading
  2010-12-09 11:18 ` [PATCH 4/6] mm: kswapd: Reset kswapd_max_order and classzone_idx after reading Mel Gorman
@ 2010-12-09 15:59   ` Minchan Kim
  2010-12-09 16:03     ` Minchan Kim
  2010-12-10  1:19   ` KAMEZAWA Hiroyuki
  1 sibling, 1 reply; 29+ messages in thread
From: Minchan Kim @ 2010-12-09 15:59 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Simon Kirby, KOSAKI Motohiro, Shaohua Li, Dave Hansen,
	Johannes Weiner, Andrew Morton, linux-mm, linux-kernel

On Thu, Dec 09, 2010 at 11:18:18AM +0000, Mel Gorman wrote:
> When kswapd wakes up, it reads its order and classzone from pgdat and
> calls balance_pgdat. While its awake, it potentially reclaimes at a high
> order and a low classzone index. This might have been a once-off that
> was not required by subsequent callers. However, because the pgdat
> values were not reset, they remain artifically high while
> balance_pgdat() is running and potentially kswapd enters a second
> unnecessary reclaim cycle. Reset the pgdat order and classzone index
> after reading.
> 
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>

-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 4/6] mm: kswapd: Reset kswapd_max_order and classzone_idx after reading
  2010-12-09 15:59   ` Minchan Kim
@ 2010-12-09 16:03     ` Minchan Kim
  2010-12-10 10:42       ` Mel Gorman
  0 siblings, 1 reply; 29+ messages in thread
From: Minchan Kim @ 2010-12-09 16:03 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Simon Kirby, KOSAKI Motohiro, Shaohua Li, Dave Hansen,
	Johannes Weiner, Andrew Morton, linux-mm, linux-kernel

On Fri, Dec 10, 2010 at 12:59 AM, Minchan Kim <minchan.kim@gmail.com> wrote:
> On Thu, Dec 09, 2010 at 11:18:18AM +0000, Mel Gorman wrote:
>> When kswapd wakes up, it reads its order and classzone from pgdat and
>> calls balance_pgdat. While its awake, it potentially reclaimes at a high
>> order and a low classzone index. This might have been a once-off that
>> was not required by subsequent callers. However, because the pgdat
>> values were not reset, they remain artifically high while
>> balance_pgdat() is running and potentially kswapd enters a second
>> unnecessary reclaim cycle. Reset the pgdat order and classzone index
>> after reading.
>>
>> Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> Reviewed-by: Minchan Kim <minchan.kim@gmail.com>

Nitpick.
Shouldn't this patch be merged with 1/6?

>
> --
> Kind regards,
> Minchan Kim
>



-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 1/6] mm: kswapd: Stop high-order balancing when any suitable zone is balanced
  2010-12-09 11:18 ` [PATCH 1/6] mm: kswapd: Stop high-order balancing when any suitable zone is balanced Mel Gorman
  2010-12-09 15:21   ` Minchan Kim
@ 2010-12-10  1:06   ` KAMEZAWA Hiroyuki
  2010-12-13 16:54   ` Eric B Munson
  2 siblings, 0 replies; 29+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-12-10  1:06 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Simon Kirby, KOSAKI Motohiro, Shaohua Li, Dave Hansen,
	Johannes Weiner, Andrew Morton, linux-mm, linux-kernel

On Thu,  9 Dec 2010 11:18:15 +0000
Mel Gorman <mel@csn.ul.ie> wrote:

> When the allocator enters its slow path, kswapd is woken up to balance the
> node. It continues working until all zones within the node are balanced. For
> order-0 allocations, this makes perfect sense but for higher orders it can
> have unintended side-effects. If the zone sizes are imbalanced, kswapd may
> reclaim heavily within a smaller zone discarding an excessive number of
> pages. The user-visible behaviour is that kswapd is awake and reclaiming
> even though plenty of pages are free from a suitable zone.
> 
> This patch alters the "balance" logic for high-order reclaim allowing kswapd
> to stop if any suitable zone becomes balanced to reduce the number of pages
> it reclaims from other zones. kswapd still tries to ensure that order-0
> watermarks for all zones are met before sleeping.
> 
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 3/6] mm: kswapd: Use the order that kswapd was reclaiming at for sleeping_prematurely()
  2010-12-09 11:18 ` [PATCH 3/6] mm: kswapd: Use the order that kswapd was reclaiming at for sleeping_prematurely() Mel Gorman
@ 2010-12-10  1:16   ` Minchan Kim
  2010-12-10 10:36     ` Mel Gorman
  2010-12-10  1:18   ` KAMEZAWA Hiroyuki
  1 sibling, 1 reply; 29+ messages in thread
From: Minchan Kim @ 2010-12-10  1:16 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Simon Kirby, KOSAKI Motohiro, Shaohua Li, Dave Hansen,
	Johannes Weiner, Andrew Morton, linux-mm, linux-kernel

On Thu, Dec 9, 2010 at 8:18 PM, Mel Gorman <mel@csn.ul.ie> wrote:
> Before kswapd goes to sleep, it uses sleeping_prematurely() to check if
> there was a race pushing a zone below its watermark. If the race
> happened, it stays awake. However, balance_pgdat() can decide to reclaim
> at a lower order if it decides that high-order reclaim is not working as

Could you specify "order-0" explicitly instead of "a lower order"?
It makes more clear to me.

> expected. This information is not passed back to sleeping_prematurely().
> The impact is that kswapd remains awake reclaiming pages long after it
> should have gone to sleep. This patch passes the adjusted order to
> sleeping_prematurely and uses the same logic as balance_pgdat to decide
> if it's ok to go to sleep.
>
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>

A comment below.

> ---
>  mm/vmscan.c |   14 ++++++++++----
>  1 files changed, 10 insertions(+), 4 deletions(-)
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index b4472a1..52e229e 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2132,7 +2132,7 @@ static bool pgdat_balanced(pg_data_t *pgdat, unsigned long balanced)
>  }
>
>  /* is kswapd sleeping prematurely? */
> -static int sleeping_prematurely(pg_data_t *pgdat, int order, long remaining)
> +static bool sleeping_prematurely(pg_data_t *pgdat, int order, long remaining)
>  {
>        int i;
>        unsigned long balanced = 0;
> @@ -2142,7 +2142,7 @@ static int sleeping_prematurely(pg_data_t *pgdat, int order, long remaining)
>        if (remaining)
>                return 1;
>
> -       /* If after HZ/10, a zone is below the high mark, it's premature */
> +       /* Check the watermark levels */
>        for (i = 0; i < pgdat->nr_zones; i++) {
>                struct zone *zone = pgdat->node_zones + i;
>
> @@ -2427,7 +2427,13 @@ out:
>                }
>        }
>
> -       return sc.nr_reclaimed;
> +       /*
> +        * Return the order we were reclaiming at so sleeping_prematurely()
> +        * makes a decision on the order we were last reclaiming at. However,
> +        * if another caller entered the allocator slow path while kswapd
> +        * was awake, order will remain at the higher level
> +        */
> +       return order;
>  }

Please change return value description of balance_pgdat.
"Returns the number of pages which were actually freed"


-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 2/6] mm: kswapd: Keep kswapd awake for high-order allocations until a percentage of the node is balanced
  2010-12-09 11:18 ` [PATCH 2/6] mm: kswapd: Keep kswapd awake for high-order allocations until a percentage of the node " Mel Gorman
  2010-12-09 15:42   ` Minchan Kim
@ 2010-12-10  1:16   ` KAMEZAWA Hiroyuki
  2010-12-10 10:25     ` Mel Gorman
  2010-12-13 17:00   ` Eric B Munson
  2 siblings, 1 reply; 29+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-12-10  1:16 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Simon Kirby, KOSAKI Motohiro, Shaohua Li, Dave Hansen,
	Johannes Weiner, Andrew Morton, linux-mm, linux-kernel

On Thu,  9 Dec 2010 11:18:16 +0000
Mel Gorman <mel@csn.ul.ie> wrote:

> When reclaiming for high-orders, kswapd is responsible for balancing a
> node but it should not reclaim excessively. It avoids excessive reclaim by
> considering if any zone in a node is balanced then the node is balanced. In
> the cases where there are imbalanced zone sizes (e.g. ZONE_DMA with both
> ZONE_DMA32 and ZONE_NORMAL), kswapd can go to sleep prematurely as just
> one small zone was balanced.
> 
> This alters the sleep logic of kswapd slightly. It counts the number of pages
> that make up the balanced zones. If the total number of balanced pages is
> more than a quarter of the zone, kswapd will go back to sleep. This should
> keep a node balanced without reclaiming an excessive number of pages.
> 
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>

Hmm, does this work well in

for example, x86-32,
	DMA: 16MB
	NORMAL: 700MB
	HIGHMEM: 11G
?

At 1st look, it's balanced when HIGHMEM has enough free pages...
This is not good for NICs which requests high-order allocations.

Can't we take claszone_idx into account at checking rather than
node->present_pages ?

as
	balanced > present_pages_below_classzone_idx(node, classzone_idx)/4

?
Thanks,
-Kame

> ---
>  mm/vmscan.c |   43 ++++++++++++++++++++++++++++++++++---------
>  1 files changed, 34 insertions(+), 9 deletions(-)
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 25cb373..b4472a1 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2117,10 +2117,26 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem_cont,
>  }
>  #endif
>  
> +/*
> + * pgdat_balanced is used when checking if a node is balanced for high-order
> + * allocations. Only zones that meet watermarks make up "balanced".
> + * The total of balanced pages must be at least 25% of the node for the
> + * node to be considered balanced. Forcing all zones to be balanced for high
> + * orders can cause excessive reclaim when there are imbalanced zones.
> + * Similarly, we do not want kswapd to go to sleep because ZONE_DMA happens
> + * to be balanced when ZONE_DMA32 is huge in comparison and unbalanced
> + */
> +static bool pgdat_balanced(pg_data_t *pgdat, unsigned long balanced)
> +{
> +	return balanced > pgdat->node_present_pages / 4;
> +}
> +
>  /* is kswapd sleeping prematurely? */
>  static int sleeping_prematurely(pg_data_t *pgdat, int order, long remaining)
>  {
>  	int i;
> +	unsigned long balanced = 0;
> +	bool all_zones_ok = true;
>  
>  	/* If a direct reclaimer woke kswapd within HZ/10, it's premature */
>  	if (remaining)
> @@ -2138,10 +2154,19 @@ static int sleeping_prematurely(pg_data_t *pgdat, int order, long remaining)
>  
>  		if (!zone_watermark_ok(zone, order, high_wmark_pages(zone),
>  								0, 0))
> -			return 1;
> +			all_zones_ok = false;
> +		else
> +			balanced += zone->present_pages;
>  	}
>  
> -	return 0;
> +	/*
> +	 * For high-order requests, any zone meeting the watermark allows
> +	 * kswapd to sleep. For order-0, all zones must be balanced
> +	 */
> +	if (order)
> +		return pgdat_balanced(pgdat, balanced);
> +	else
> +		return !all_zones_ok;
>  }
>  
>  /*
> @@ -2169,7 +2194,7 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order,
>  							int classzone_idx)
>  {
>  	int all_zones_ok;
> -	int any_zone_ok;
> +	unsigned long balanced;
>  	int priority;
>  	int i;
>  	int end_zone = 0;	/* Inclusive.  0 = ZONE_DMA */
> @@ -2203,7 +2228,7 @@ loop_again:
>  			disable_swap_token();
>  
>  		all_zones_ok = 1;
> -		any_zone_ok = 0;
> +		balanced = 0;
>  
>  		/*
>  		 * Scan in the highmem->dma direction for the highest
> @@ -2314,11 +2339,11 @@ loop_again:
>  				 */
>  				zone_clear_flag(zone, ZONE_CONGESTED);
>  				if (i <= classzone_idx)
> -					any_zone_ok = 1;
> +					balanced += zone->present_pages;
>  			}
>  
>  		}
> -		if (all_zones_ok || (order && any_zone_ok))
> +		if (all_zones_ok || (order && pgdat_balanced(pgdat, balanced)))
>  			break;		/* kswapd: all done */
>  		/*
>  		 * OK, kswapd is getting into trouble.  Take a nap, then take
> @@ -2344,10 +2369,10 @@ out:
>  
>  	/*
>  	 * order-0: All zones must meet high watermark for a balanced node
> -	 * high-order: Any zone below pgdats classzone_idx must meet the high
> -	 *             watermark for a balanced node
> +	 * high-order: Balanced zones must make up at least 25% of the node
> +	 *             for the node to be balanced
>  	 */
> -	if (!(all_zones_ok || (order && any_zone_ok))) {
> +	if (!(all_zones_ok || (order && pgdat_balanced(pgdat, balanced)))) {
>  		cond_resched();
>  
>  		try_to_freeze();
> -- 
> 1.7.1
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 3/6] mm: kswapd: Use the order that kswapd was reclaiming at for sleeping_prematurely()
  2010-12-09 11:18 ` [PATCH 3/6] mm: kswapd: Use the order that kswapd was reclaiming at for sleeping_prematurely() Mel Gorman
  2010-12-10  1:16   ` Minchan Kim
@ 2010-12-10  1:18   ` KAMEZAWA Hiroyuki
  1 sibling, 0 replies; 29+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-12-10  1:18 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Simon Kirby, KOSAKI Motohiro, Shaohua Li, Dave Hansen,
	Johannes Weiner, Andrew Morton, linux-mm, linux-kernel

On Thu,  9 Dec 2010 11:18:17 +0000
Mel Gorman <mel@csn.ul.ie> wrote:

> Before kswapd goes to sleep, it uses sleeping_prematurely() to check if
> there was a race pushing a zone below its watermark. If the race
> happened, it stays awake. However, balance_pgdat() can decide to reclaim
> at a lower order if it decides that high-order reclaim is not working as
> expected. This information is not passed back to sleeping_prematurely().
> The impact is that kswapd remains awake reclaiming pages long after it
> should have gone to sleep. This patch passes the adjusted order to
> sleeping_prematurely and uses the same logic as balance_pgdat to decide
> if it's ok to go to sleep.
> 
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 4/6] mm: kswapd: Reset kswapd_max_order and classzone_idx after reading
  2010-12-09 11:18 ` [PATCH 4/6] mm: kswapd: Reset kswapd_max_order and classzone_idx after reading Mel Gorman
  2010-12-09 15:59   ` Minchan Kim
@ 2010-12-10  1:19   ` KAMEZAWA Hiroyuki
  1 sibling, 0 replies; 29+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-12-10  1:19 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Simon Kirby, KOSAKI Motohiro, Shaohua Li, Dave Hansen,
	Johannes Weiner, Andrew Morton, linux-mm, linux-kernel

On Thu,  9 Dec 2010 11:18:18 +0000
Mel Gorman <mel@csn.ul.ie> wrote:

> When kswapd wakes up, it reads its order and classzone from pgdat and
> calls balance_pgdat. While its awake, it potentially reclaimes at a high
> order and a low classzone index. This might have been a once-off that
> was not required by subsequent callers. However, because the pgdat
> values were not reset, they remain artifically high while
> balance_pgdat() is running and potentially kswapd enters a second
> unnecessary reclaim cycle. Reset the pgdat order and classzone index
> after reading.
> 
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 5/6] mm: kswapd: Treat zone->all_unreclaimable in sleeping_prematurely similar to balance_pgdat()
  2010-12-09 11:18 ` [PATCH 5/6] mm: kswapd: Treat zone->all_unreclaimable in sleeping_prematurely similar to balance_pgdat() Mel Gorman
@ 2010-12-10  1:23   ` KAMEZAWA Hiroyuki
  2010-12-10 10:55     ` Mel Gorman
  0 siblings, 1 reply; 29+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-12-10  1:23 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Simon Kirby, KOSAKI Motohiro, Shaohua Li, Dave Hansen,
	Johannes Weiner, Andrew Morton, linux-mm, linux-kernel

On Thu,  9 Dec 2010 11:18:19 +0000
Mel Gorman <mel@csn.ul.ie> wrote:

> After DEF_PRIORITY, balance_pgdat() considers all_unreclaimable zones to
> be balanced but sleeping_prematurely does not. This can force kswapd to
> stay awake longer than it should. This patch fixes it.
> 
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>

Hmm, maybe the logic works well but I don't like very much.

How about adding below instead of pgdat->node_present_pages ?

static unsigned long required_balanced_pages(pgdat, classzone_idx)
{
	unsigned long present = 0;

	for_each_zone_in_node(zone, pgdat) {
		if (zone->all_unreclaimable) /* Ignore unreclaimable zone at checking balance */
			continue;
		if (zone_idx(zone) > classzone_idx)
			continue;
		present = zone->present_pages;
	}
	return present;
}

> ---
>  mm/vmscan.c |   10 +++++++++-
>  1 files changed, 9 insertions(+), 1 deletions(-)
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index bc233d8..d7b0a3c 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2149,8 +2149,16 @@ static bool sleeping_prematurely(pg_data_t *pgdat, int order, long remaining)
>  		if (!populated_zone(zone))
>  			continue;
>  
> -		if (zone->all_unreclaimable)
> +		/*
> +		 * balance_pgdat() skips over all_unreclaimable after
> +		 * DEF_PRIORITY. Effectively, it considers them balanced so
> +		 * they must be considered balanced here as well if kswapd
> +		 * is to sleep
> +		 */
> +		if (zone->all_unreclaimable) {
> +			balanced += zone->present_pages;
>  			continue;
> +		}
>  
>  		if (!zone_watermark_ok(zone, order, high_wmark_pages(zone),
>  								0, 0))
> -- 
> 1.7.1
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 6/6] mm: kswapd: Use the classzone idx that kswapd was using for sleeping_prematurely()
  2010-12-09 11:18 ` [PATCH 6/6] mm: kswapd: Use the classzone idx that kswapd was using for sleeping_prematurely() Mel Gorman
@ 2010-12-10  2:38   ` Minchan Kim
  0 siblings, 0 replies; 29+ messages in thread
From: Minchan Kim @ 2010-12-10  2:38 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Simon Kirby, KOSAKI Motohiro, Shaohua Li, Dave Hansen,
	Johannes Weiner, Andrew Morton, linux-mm, linux-kernel

On Thu, Dec 9, 2010 at 8:18 PM, Mel Gorman <mel@csn.ul.ie> wrote:
> When kswapd is woken up for a high-order allocation, it takes account of
> the highest usable zone by the caller (the classzone idx). During
> allocation, this index is used to select the lowmem_reserve[] that
> should be applied to the watermark calculation in zone_watermark_ok().
>
> When balancing a node, kswapd considers the highest unbalanced zone to be the
> classzone index. This will always be at least be the callers classzone_idx
> and can be higher. However, sleeping_prematurely() always considers the
> lowest zone (e.g. ZONE_DMA) to be the classzone index. This means that
> sleeping_prematurely() can consider a zone to be balanced that is unusable
> by the allocation request that originally woke kswapd. This patch changes
> sleeping_prematurely() to use a classzone_idx matching the value it used
> in balance_pgdat().
>
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>

Nice catch! and it does make sense to me.

-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 2/6] mm: kswapd: Keep kswapd awake for high-order allocations until a percentage of the node is balanced
  2010-12-09 15:42   ` Minchan Kim
@ 2010-12-10 10:19     ` Mel Gorman
  0 siblings, 0 replies; 29+ messages in thread
From: Mel Gorman @ 2010-12-10 10:19 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Simon Kirby, KOSAKI Motohiro, Shaohua Li, Dave Hansen,
	Johannes Weiner, Andrew Morton, linux-mm, linux-kernel

On Fri, Dec 10, 2010 at 12:42:58AM +0900, Minchan Kim wrote:
> On Thu, Dec 09, 2010 at 11:18:16AM +0000, Mel Gorman wrote:
> > When reclaiming for high-orders, kswapd is responsible for balancing a
> > node but it should not reclaim excessively. It avoids excessive reclaim by
> > considering if any zone in a node is balanced then the node is balanced. In
> > the cases where there are imbalanced zone sizes (e.g. ZONE_DMA with both
> > ZONE_DMA32 and ZONE_NORMAL), kswapd can go to sleep prematurely as just
> > one small zone was balanced.
> > 
> > This alters the sleep logic of kswapd slightly. It counts the number of pages
> > that make up the balanced zones. If the total number of balanced pages is
> > more than a quarter of the zone, kswapd will go back to sleep. This should
> > keep a node balanced without reclaiming an excessive number of pages.
> > 
> > Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
> 
> Just below trivials.
> 
> > ---
> >  mm/vmscan.c |   43 ++++++++++++++++++++++++++++++++++---------
> >  1 files changed, 34 insertions(+), 9 deletions(-)
> > 
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index 25cb373..b4472a1 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -2117,10 +2117,26 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem_cont,
> >  }
> >  #endif
> >  
> > +/*
> > + * pgdat_balanced is used when checking if a node is balanced for high-order
> > + * allocations. Only zones that meet watermarks make up "balanced".
> > + * The total of balanced pages must be at least 25% of the node for the
> 
> Could you write down why you select 25% in description?
> It could help someone who try to change that watermark in future.
> 

Sure, how about;

+/*
+ * pgdat_balanced is used when checking if a node is balanced for high-order
+ * allocations. Only zones that meet watermarks make up "balanced".
+ * The total of balanced pages must be at least 25% of the node for the
+ * node to be considered balanced. Forcing all zones to be balanced for high
+ * orders can cause excessive reclaim when there are imbalanced zones.
+ * The choice of 25% is due to
+ *   o a 16M DMA zone that is balanced will not balance a zone on any
+ *     reasonable sized machine
+ *   o On all other machines, the top zone must be at least a reasonable
+ *     precentage of the middle zones. For example, on 32-bit x86, highmem
+ *     would need to be at least 256M for it to be balance a whole node.
+ *     Similarly, on x86-64 the Normal zone would need to be at least 1G
+ *     to balance a node on its own. These seemed like reasonable ratios.
+ */
+static bool pgdat_balanced(pg_data_t *pgdat, unsigned long balanced)
+{
+       return balanced > pgdat->node_present_pages / 4;
+}

?

> > + * node to be considered balanced. Forcing all zones to be balanced for high
> > + * orders can cause excessive reclaim when there are imbalanced zones.
> > + * Similarly, we do not want kswapd to go to sleep because ZONE_DMA happens
> > + * to be balanced when ZONE_DMA32 is huge in comparison and unbalanced
> > + */
> > +static bool pgdat_balanced(pg_data_t *pgdat, unsigned long balanced)
> 
> I think balanced_pages is more clear.
> But it's a preference.
> 

Updated.

> > +{
> > +	return balanced > pgdat->node_present_pages / 4;
> > +}
> > +
> >  /* is kswapd sleeping prematurely? */
> >  static int sleeping_prematurely(pg_data_t *pgdat, int order, long remaining)
> >  {
> >  	int i;
> > +	unsigned long balanced = 0;
> > +	bool all_zones_ok = true;
> >  
> >  	/* If a direct reclaimer woke kswapd within HZ/10, it's premature */
> >  	if (remaining)
> > @@ -2138,10 +2154,19 @@ static int sleeping_prematurely(pg_data_t *pgdat, int order, long remaining)
> >  
> >  		if (!zone_watermark_ok(zone, order, high_wmark_pages(zone),
> >  								0, 0))
> > -			return 1;
> > +			all_zones_ok = false;
> > +		else
> > +			balanced += zone->present_pages;
> >  	}
> >  
> > -	return 0;
> > +	/*
> > +	 * For high-order requests, any zone meeting the watermark allows
> > +	 * kswapd to sleep. For order-0, all zones must be balanced
> 
> The comment isn't exact.
> could allow?
> 

Correct;

+       /*
+        * For high-order requests, the balanced zones must contain at least
+        * 25% of the nodes pages for kswapd to sleep. For order-0, all zones
+        * must be balanced
+        */
+       if (order)
+               return pgdat_balanced(pgdat, balanced);
+       else
+               return !all_zones_ok;

Thanks

> > +	 */
> > +	if (order)
> > +		return pgdat_balanced(pgdat, balanced);
> > +	else
> > +		return !all_zones_ok;
> >  }
> >  
> >  /*
> > @@ -2169,7 +2194,7 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order,
> >  							int classzone_idx)
> >  {
> >  	int all_zones_ok;
> > -	int any_zone_ok;
> > +	unsigned long balanced;
> >  	int priority;
> >  	int i;
> >  	int end_zone = 0;	/* Inclusive.  0 = ZONE_DMA */
> > @@ -2203,7 +2228,7 @@ loop_again:
> >  			disable_swap_token();
> >  
> >  		all_zones_ok = 1;
> > -		any_zone_ok = 0;
> > +		balanced = 0;
> >  
> >  		/*
> >  		 * Scan in the highmem->dma direction for the highest
> > @@ -2314,11 +2339,11 @@ loop_again:
> >  				 */
> >  				zone_clear_flag(zone, ZONE_CONGESTED);
> >  				if (i <= classzone_idx)
> > -					any_zone_ok = 1;
> > +					balanced += zone->present_pages;
> >  			}
> >  
> >  		}
> > -		if (all_zones_ok || (order && any_zone_ok))
> > +		if (all_zones_ok || (order && pgdat_balanced(pgdat, balanced)))
> >  			break;		/* kswapd: all done */
> >  		/*
> >  		 * OK, kswapd is getting into trouble.  Take a nap, then take
> > @@ -2344,10 +2369,10 @@ out:
> >  
> >  	/*
> >  	 * order-0: All zones must meet high watermark for a balanced node
> > -	 * high-order: Any zone below pgdats classzone_idx must meet the high
> > -	 *             watermark for a balanced node
> > +	 * high-order: Balanced zones must make up at least 25% of the node
> > +	 *             for the node to be balanced
> >  	 */
> > -	if (!(all_zones_ok || (order && any_zone_ok))) {
> > +	if (!(all_zones_ok || (order && pgdat_balanced(pgdat, balanced)))) {
> >  		cond_resched();
> >  
> >  		try_to_freeze();

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 2/6] mm: kswapd: Keep kswapd awake for high-order allocations until a percentage of the node is balanced
  2010-12-10  1:16   ` KAMEZAWA Hiroyuki
@ 2010-12-10 10:25     ` Mel Gorman
  0 siblings, 0 replies; 29+ messages in thread
From: Mel Gorman @ 2010-12-10 10:25 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Simon Kirby, KOSAKI Motohiro, Shaohua Li, Dave Hansen,
	Johannes Weiner, Andrew Morton, linux-mm, linux-kernel

On Fri, Dec 10, 2010 at 10:16:49AM +0900, KAMEZAWA Hiroyuki wrote:
> On Thu,  9 Dec 2010 11:18:16 +0000
> Mel Gorman <mel@csn.ul.ie> wrote:
> 
> > When reclaiming for high-orders, kswapd is responsible for balancing a
> > node but it should not reclaim excessively. It avoids excessive reclaim by
> > considering if any zone in a node is balanced then the node is balanced. In
> > the cases where there are imbalanced zone sizes (e.g. ZONE_DMA with both
> > ZONE_DMA32 and ZONE_NORMAL), kswapd can go to sleep prematurely as just
> > one small zone was balanced.
> > 
> > This alters the sleep logic of kswapd slightly. It counts the number of pages
> > that make up the balanced zones. If the total number of balanced pages is
> > more than a quarter of the zone, kswapd will go back to sleep. This should
> > keep a node balanced without reclaiming an excessive number of pages.
> > 
> > Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> 
> Hmm, does this work well in
> 
> for example, x86-32,
> 	DMA: 16MB
> 	NORMAL: 700MB
> 	HIGHMEM: 11G
> ?
> 
> At 1st look, it's balanced when HIGHMEM has enough free pages...
> This is not good for NICs which requests high-order allocations.
> 

Good question.

In this case, the classzone_idx for the NICs high-order allocation will
be the Normal zone. In balance_pgdat(), this check is made

                                if (i <= classzone_idx)
                                        balanced += zone->present_pages;

Highmem will be too high and so the pages will not be counted and the node
will not be balanced.

> Can't we take claszone_idx into account at checking rather than
> node->present_pages ?
> 
> as
> 	balanced > present_pages_below_classzone_idx(node, classzone_idx)/4
> 
> ?

We can, but not for the reasons you list above. When a heavily
imbalanced highmem zone like this, the node might never be considered
balanced as the sum of DMA and Normal is less than 25% of the node.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 3/6] mm: kswapd: Use the order that kswapd was reclaiming at for sleeping_prematurely()
  2010-12-10  1:16   ` Minchan Kim
@ 2010-12-10 10:36     ` Mel Gorman
  0 siblings, 0 replies; 29+ messages in thread
From: Mel Gorman @ 2010-12-10 10:36 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Simon Kirby, KOSAKI Motohiro, Shaohua Li, Dave Hansen,
	Johannes Weiner, Andrew Morton, linux-mm, linux-kernel

On Fri, Dec 10, 2010 at 10:16:19AM +0900, Minchan Kim wrote:
> On Thu, Dec 9, 2010 at 8:18 PM, Mel Gorman <mel@csn.ul.ie> wrote:
> > Before kswapd goes to sleep, it uses sleeping_prematurely() to check if
> > there was a race pushing a zone below its watermark. If the race
> > happened, it stays awake. However, balance_pgdat() can decide to reclaim
> > at a lower order if it decides that high-order reclaim is not working as
> 
> Could you specify "order-0" explicitly instead of "a lower order"?
> It makes more clear to me.
> 

Done.

> > expected. This information is not passed back to sleeping_prematurely().
> > The impact is that kswapd remains awake reclaiming pages long after it
> > should have gone to sleep. This patch passes the adjusted order to
> > sleeping_prematurely and uses the same logic as balance_pgdat to decide
> > if it's ok to go to sleep.
> >
> > Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
> 
> A comment below.
> 
> > ---
> >  mm/vmscan.c |   14 ++++++++++----
> >  1 files changed, 10 insertions(+), 4 deletions(-)
> >
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index b4472a1..52e229e 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -2132,7 +2132,7 @@ static bool pgdat_balanced(pg_data_t *pgdat, unsigned long balanced)
> >  }
> >
> >  /* is kswapd sleeping prematurely? */
> > -static int sleeping_prematurely(pg_data_t *pgdat, int order, long remaining)
> > +static bool sleeping_prematurely(pg_data_t *pgdat, int order, long remaining)
> >  {
> >        int i;
> >        unsigned long balanced = 0;
> > @@ -2142,7 +2142,7 @@ static int sleeping_prematurely(pg_data_t *pgdat, int order, long remaining)
> >        if (remaining)
> >                return 1;
> >
> > -       /* If after HZ/10, a zone is below the high mark, it's premature */
> > +       /* Check the watermark levels */
> >        for (i = 0; i < pgdat->nr_zones; i++) {
> >                struct zone *zone = pgdat->node_zones + i;
> >
> > @@ -2427,7 +2427,13 @@ out:
> >                }
> >        }
> >
> > -       return sc.nr_reclaimed;
> > +       /*
> > +        * Return the order we were reclaiming at so sleeping_prematurely()
> > +        * makes a decision on the order we were last reclaiming at. However,
> > +        * if another caller entered the allocator slow path while kswapd
> > +        * was awake, order will remain at the higher level
> > +        */
> > +       return order;
> >  }
> 
> Please change return value description of balance_pgdat.
> "Returns the number of pages which were actually freed"
> 

Oops, done. Thanks

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 4/6] mm: kswapd: Reset kswapd_max_order and classzone_idx after reading
  2010-12-09 16:03     ` Minchan Kim
@ 2010-12-10 10:42       ` Mel Gorman
  0 siblings, 0 replies; 29+ messages in thread
From: Mel Gorman @ 2010-12-10 10:42 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Simon Kirby, KOSAKI Motohiro, Shaohua Li, Dave Hansen,
	Johannes Weiner, Andrew Morton, linux-mm, linux-kernel

On Fri, Dec 10, 2010 at 01:03:01AM +0900, Minchan Kim wrote:
> On Fri, Dec 10, 2010 at 12:59 AM, Minchan Kim <minchan.kim@gmail.com> wrote:
> > On Thu, Dec 09, 2010 at 11:18:18AM +0000, Mel Gorman wrote:
> >> When kswapd wakes up, it reads its order and classzone from pgdat and
> >> calls balance_pgdat. While its awake, it potentially reclaimes at a high
> >> order and a low classzone index. This might have been a once-off that
> >> was not required by subsequent callers. However, because the pgdat
> >> values were not reset, they remain artifically high while
> >> balance_pgdat() is running and potentially kswapd enters a second
> >> unnecessary reclaim cycle. Reset the pgdat order and classzone index
> >> after reading.
> >>
> >> Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> > Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
> 
> Nitpick.
> Shouldn't this patch be merged with 1/6?
> 

I don't think so as it's a standalone fix. For example, if this was
merged on its own, the "order" should still be reset after reading.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 5/6] mm: kswapd: Treat zone->all_unreclaimable in sleeping_prematurely similar to balance_pgdat()
  2010-12-10  1:23   ` KAMEZAWA Hiroyuki
@ 2010-12-10 10:55     ` Mel Gorman
  2010-12-12 23:58       ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 29+ messages in thread
From: Mel Gorman @ 2010-12-10 10:55 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Simon Kirby, KOSAKI Motohiro, Shaohua Li, Dave Hansen,
	Johannes Weiner, Andrew Morton, linux-mm, linux-kernel

On Fri, Dec 10, 2010 at 10:23:37AM +0900, KAMEZAWA Hiroyuki wrote:
> On Thu,  9 Dec 2010 11:18:19 +0000
> Mel Gorman <mel@csn.ul.ie> wrote:
> 
> > After DEF_PRIORITY, balance_pgdat() considers all_unreclaimable zones to
> > be balanced but sleeping_prematurely does not. This can force kswapd to
> > stay awake longer than it should. This patch fixes it.
> > 
> > Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> 
> Hmm, maybe the logic works well but I don't like very much.
> 
> How about adding below instead of pgdat->node_present_pages ?
> 
> static unsigned long required_balanced_pages(pgdat, classzone_idx)
> {
> 	unsigned long present = 0;
> 
> 	for_each_zone_in_node(zone, pgdat) {
> 		if (zone->all_unreclaimable) /* Ignore unreclaimable zone at checking balance */
> 			continue;
> 		if (zone_idx(zone) > classzone_idx)
> 			continue;
> 		present = zone->present_pages;
> 	}
> 	return present;
> }
> 

I'm afraid I do not really understand. After your earlier comments,
pgdat_balanced() now looks like

static bool pgdat_balanced(pg_data_t *pgdat, unsigned long balanced_pages,
                                                int classzone_idx)
{
        unsigned long present_pages = 0;
        int i;

        for (i = 0; i <= classzone_idx; i++)
                present_pages += pgdat->node_zones[i].present_pages;

        return balanced_pages > (present_pages >> 2);
}

so the classzone is being taken into account. I'm not sure what you're
asking for it to be changed to. Maybe it'll be clearer after V4 comes
out rebased on top of mmotm.


-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH 6/6] mm: kswapd: Use the classzone idx that kswapd was using for sleeping_prematurely()
  2010-12-10 15:46 [PATCH 0/6] Prevent kswapd dumping excessive amounts of memory in response to high-order allocations V4 Mel Gorman
@ 2010-12-10 15:46 ` Mel Gorman
  2010-12-13 19:43   ` Eric B Munson
  0 siblings, 1 reply; 29+ messages in thread
From: Mel Gorman @ 2010-12-10 15:46 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Simon Kirby, KOSAKI Motohiro, Shaohua Li, Dave Hansen,
	Johannes Weiner, linux-mm, linux-kernel, Mel Gorman

When kswapd is woken up for a high-order allocation, it takes account of
the highest usable zone by the caller (the classzone idx). During
allocation, this index is used to select the lowmem_reserve[] that
should be applied to the watermark calculation in zone_watermark_ok().

When balancing a node, kswapd considers the highest unbalanced zone to be the
classzone index. This will always be at least be the callers classzone_idx
and can be higher. However, sleeping_prematurely() always considers the
lowest zone (e.g. ZONE_DMA) to be the classzone index. This means that
sleeping_prematurely() can consider a zone to be balanced that is unusable
by the allocation request that originally woke kswapd. This patch changes
sleeping_prematurely() to use a classzone_idx matching the value it used
in balance_pgdat().

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
---
 mm/vmscan.c |   29 ++++++++++++++++-------------
 1 files changed, 16 insertions(+), 13 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 5995121..cf03a11 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2220,7 +2220,8 @@ static bool pgdat_balanced(pg_data_t *pgdat, unsigned long balanced_pages,
 }
 
 /* is kswapd sleeping prematurely? */
-static bool sleeping_prematurely(pg_data_t *pgdat, int order, long remaining)
+static bool sleeping_prematurely(pg_data_t *pgdat, int order, long remaining,
+					int classzone_idx)
 {
 	int i;
 	unsigned long balanced = 0;
@@ -2228,7 +2229,7 @@ static bool sleeping_prematurely(pg_data_t *pgdat, int order, long remaining)
 
 	/* If a direct reclaimer woke kswapd within HZ/10, it's premature */
 	if (remaining)
-		return 1;
+		return true;
 
 	/* Check the watermark levels */
 	for (i = 0; i < pgdat->nr_zones; i++) {
@@ -2249,7 +2250,7 @@ static bool sleeping_prematurely(pg_data_t *pgdat, int order, long remaining)
 		}
 
 		if (!zone_watermark_ok_safe(zone, order, high_wmark_pages(zone),
-								0, 0))
+							classzone_idx, 0))
 			all_zones_ok = false;
 		else
 			balanced += zone->present_pages;
@@ -2261,7 +2262,7 @@ static bool sleeping_prematurely(pg_data_t *pgdat, int order, long remaining)
 	 * must be balanced
 	 */
 	if (order)
-		return pgdat_balanced(pgdat, balanced, 0);
+		return pgdat_balanced(pgdat, balanced, classzone_idx);
 	else
 		return !all_zones_ok;
 }
@@ -2288,7 +2289,7 @@ static bool sleeping_prematurely(pg_data_t *pgdat, int order, long remaining)
  * of pages is balanced across the zones.
  */
 static unsigned long balance_pgdat(pg_data_t *pgdat, int order,
-							int classzone_idx)
+							int *classzone_idx)
 {
 	int all_zones_ok;
 	unsigned long balanced;
@@ -2351,6 +2352,7 @@ loop_again:
 			if (!zone_watermark_ok_safe(zone, order,
 					high_wmark_pages(zone), 0, 0)) {
 				end_zone = i;
+				*classzone_idx = i;
 				break;
 			}
 		}
@@ -2444,12 +2446,12 @@ loop_again:
 				 * spectulatively avoid congestion waits
 				 */
 				zone_clear_flag(zone, ZONE_CONGESTED);
-				if (i <= classzone_idx)
+				if (i <= *classzone_idx)
 					balanced += zone->present_pages;
 			}
 
 		}
-		if (all_zones_ok || (order && pgdat_balanced(pgdat, balanced, classzone_idx)))
+		if (all_zones_ok || (order && pgdat_balanced(pgdat, balanced, *classzone_idx)))
 			break;		/* kswapd: all done */
 		/*
 		 * OK, kswapd is getting into trouble.  Take a nap, then take
@@ -2478,7 +2480,7 @@ out:
 	 * high-order: Balanced zones must make up at least 25% of the node
 	 *             for the node to be balanced
 	 */
-	if (!(all_zones_ok || (order && pgdat_balanced(pgdat, balanced, classzone_idx)))) {
+	if (!(all_zones_ok || (order && pgdat_balanced(pgdat, balanced, *classzone_idx)))) {
 		cond_resched();
 
 		try_to_freeze();
@@ -2539,10 +2541,11 @@ out:
 	 * if another caller entered the allocator slow path while kswapd
 	 * was awake, order will remain at the higher level
 	 */
+	*classzone_idx = end_zone;
 	return order;
 }
 
-static void kswapd_try_to_sleep(pg_data_t *pgdat, int order)
+static void kswapd_try_to_sleep(pg_data_t *pgdat, int order, int classzone_idx)
 {
 	long remaining = 0;
 	DEFINE_WAIT(wait);
@@ -2553,7 +2556,7 @@ static void kswapd_try_to_sleep(pg_data_t *pgdat, int order)
 	prepare_to_wait(&pgdat->kswapd_wait, &wait, TASK_INTERRUPTIBLE);
 
 	/* Try to sleep for a short interval */
-	if (!sleeping_prematurely(pgdat, order, remaining)) {
+	if (!sleeping_prematurely(pgdat, order, remaining, classzone_idx)) {
 		remaining = schedule_timeout(HZ/10);
 		finish_wait(&pgdat->kswapd_wait, &wait);
 		prepare_to_wait(&pgdat->kswapd_wait, &wait, TASK_INTERRUPTIBLE);
@@ -2563,7 +2566,7 @@ static void kswapd_try_to_sleep(pg_data_t *pgdat, int order)
 	 * After a short sleep, check if it was a premature sleep. If not, then
 	 * go fully to sleep until explicitly woken up.
 	 */
-	if (!sleeping_prematurely(pgdat, order, remaining)) {
+	if (!sleeping_prematurely(pgdat, order, remaining, classzone_idx)) {
 		trace_mm_vmscan_kswapd_sleep(pgdat->node_id);
 
 		/*
@@ -2651,7 +2654,7 @@ static int kswapd(void *p)
 			order = new_order;
 			classzone_idx = new_classzone_idx;
 		} else {
-			kswapd_try_to_sleep(pgdat, order);
+			kswapd_try_to_sleep(pgdat, order, classzone_idx);
 			order = pgdat->kswapd_max_order;
 			classzone_idx = pgdat->classzone_idx;
 			pgdat->kswapd_max_order = 0;
@@ -2668,7 +2671,7 @@ static int kswapd(void *p)
 		 */
 		if (!ret) {
 			trace_mm_vmscan_kswapd_wake(pgdat->node_id, order);
-			order = balance_pgdat(pgdat, order, classzone_idx);
+			order = balance_pgdat(pgdat, order, &classzone_idx);
 		}
 	}
 	return 0;
-- 
1.7.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: [PATCH 5/6] mm: kswapd: Treat zone->all_unreclaimable in sleeping_prematurely similar to balance_pgdat()
  2010-12-10 10:55     ` Mel Gorman
@ 2010-12-12 23:58       ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 29+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-12-12 23:58 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Simon Kirby, KOSAKI Motohiro, Shaohua Li, Dave Hansen,
	Johannes Weiner, Andrew Morton, linux-mm, linux-kernel

On Fri, 10 Dec 2010 10:55:32 +0000
Mel Gorman <mel@csn.ul.ie> wrote:

> On Fri, Dec 10, 2010 at 10:23:37AM +0900, KAMEZAWA Hiroyuki wrote:
> > On Thu,  9 Dec 2010 11:18:19 +0000
> > Mel Gorman <mel@csn.ul.ie> wrote:
> > 
> > > After DEF_PRIORITY, balance_pgdat() considers all_unreclaimable zones to
> > > be balanced but sleeping_prematurely does not. This can force kswapd to
> > > stay awake longer than it should. This patch fixes it.
> > > 
> > > Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> > 
> > Hmm, maybe the logic works well but I don't like very much.
> > 
> > How about adding below instead of pgdat->node_present_pages ?
> > 
> > static unsigned long required_balanced_pages(pgdat, classzone_idx)
> > {
> > 	unsigned long present = 0;
> > 
> > 	for_each_zone_in_node(zone, pgdat) {
> > 		if (zone->all_unreclaimable) /* Ignore unreclaimable zone at checking balance */
> > 			continue;
> > 		if (zone_idx(zone) > classzone_idx)
> > 			continue;
> > 		present = zone->present_pages;
> > 	}
> > 	return present;
> > }
> > 
> 
> I'm afraid I do not really understand. After your earlier comments,
> pgdat_balanced() now looks like
> 
> static bool pgdat_balanced(pg_data_t *pgdat, unsigned long balanced_pages,
>                                                 int classzone_idx)
> {
>         unsigned long present_pages = 0;
>         int i;
> 
>         for (i = 0; i <= classzone_idx; i++)
>                 present_pages += pgdat->node_zones[i].present_pages;
> 
>         return balanced_pages > (present_pages >> 2);
> }
> 
> so the classzone is being taken into account. I'm not sure what you're
> asking for it to be changed to. Maybe it'll be clearer after V4 comes
> out rebased on top of mmotm.
> 

Ah, this seems okay to me.

Thank you.
-Kame



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 1/6] mm: kswapd: Stop high-order balancing when any suitable zone is balanced
  2010-12-09 11:18 ` [PATCH 1/6] mm: kswapd: Stop high-order balancing when any suitable zone is balanced Mel Gorman
  2010-12-09 15:21   ` Minchan Kim
  2010-12-10  1:06   ` KAMEZAWA Hiroyuki
@ 2010-12-13 16:54   ` Eric B Munson
  2 siblings, 0 replies; 29+ messages in thread
From: Eric B Munson @ 2010-12-13 16:54 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Simon Kirby, KOSAKI Motohiro, Shaohua Li, Dave Hansen,
	Johannes Weiner, Andrew Morton, linux-mm, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 980 bytes --]

On Thu, 09 Dec 2010, Mel Gorman wrote:

> When the allocator enters its slow path, kswapd is woken up to balance the
> node. It continues working until all zones within the node are balanced. For
> order-0 allocations, this makes perfect sense but for higher orders it can
> have unintended side-effects. If the zone sizes are imbalanced, kswapd may
> reclaim heavily within a smaller zone discarding an excessive number of
> pages. The user-visible behaviour is that kswapd is awake and reclaiming
> even though plenty of pages are free from a suitable zone.
> 
> This patch alters the "balance" logic for high-order reclaim allowing kswapd
> to stop if any suitable zone becomes balanced to reduce the number of pages
> it reclaims from other zones. kswapd still tries to ensure that order-0
> watermarks for all zones are met before sleeping.
> 
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>

Looks good to me.

Reviewed-by: Eric B Munson <emunson@mgebm.net>

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 490 bytes --]

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 2/6] mm: kswapd: Keep kswapd awake for high-order allocations until a percentage of the node is balanced
  2010-12-09 11:18 ` [PATCH 2/6] mm: kswapd: Keep kswapd awake for high-order allocations until a percentage of the node " Mel Gorman
  2010-12-09 15:42   ` Minchan Kim
  2010-12-10  1:16   ` KAMEZAWA Hiroyuki
@ 2010-12-13 17:00   ` Eric B Munson
  2 siblings, 0 replies; 29+ messages in thread
From: Eric B Munson @ 2010-12-13 17:00 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Simon Kirby, KOSAKI Motohiro, Shaohua Li, Dave Hansen,
	Johannes Weiner, Andrew Morton, linux-mm, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 918 bytes --]

On Thu, 09 Dec 2010, Mel Gorman wrote:

> When reclaiming for high-orders, kswapd is responsible for balancing a
> node but it should not reclaim excessively. It avoids excessive reclaim by
> considering if any zone in a node is balanced then the node is balanced. In
> the cases where there are imbalanced zone sizes (e.g. ZONE_DMA with both
> ZONE_DMA32 and ZONE_NORMAL), kswapd can go to sleep prematurely as just
> one small zone was balanced.
> 
> This alters the sleep logic of kswapd slightly. It counts the number of pages
> that make up the balanced zones. If the total number of balanced pages is
> more than a quarter of the zone, kswapd will go back to sleep. This should
> keep a node balanced without reclaiming an excessive number of pages.
> 
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>

With Minchan's requests this looks good to me.

Reviewed-by: Eric B Munson <emunson@mgebm.net>

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 490 bytes --]

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 6/6] mm: kswapd: Use the classzone idx that kswapd was using for sleeping_prematurely()
  2010-12-10 15:46 ` [PATCH 6/6] mm: kswapd: Use the classzone idx that kswapd was using for sleeping_prematurely() Mel Gorman
@ 2010-12-13 19:43   ` Eric B Munson
  0 siblings, 0 replies; 29+ messages in thread
From: Eric B Munson @ 2010-12-13 19:43 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Simon Kirby, KOSAKI Motohiro, Shaohua Li,
	Dave Hansen, Johannes Weiner, linux-mm, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 1048 bytes --]

On Fri, 10 Dec 2010, Mel Gorman wrote:

> When kswapd is woken up for a high-order allocation, it takes account of
> the highest usable zone by the caller (the classzone idx). During
> allocation, this index is used to select the lowmem_reserve[] that
> should be applied to the watermark calculation in zone_watermark_ok().
> 
> When balancing a node, kswapd considers the highest unbalanced zone to be the
> classzone index. This will always be at least be the callers classzone_idx
> and can be higher. However, sleeping_prematurely() always considers the
> lowest zone (e.g. ZONE_DMA) to be the classzone index. This means that
> sleeping_prematurely() can consider a zone to be balanced that is unusable
> by the allocation request that originally woke kswapd. This patch changes
> sleeping_prematurely() to use a classzone_idx matching the value it used
> in balance_pgdat().
> 
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> Reviewed-by: Minchan Kim <minchan.kim@gmail.com>

Reviewed-by: Eric B Munson <emunson@mgebm.net>

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 490 bytes --]

^ permalink raw reply	[flat|nested] 29+ messages in thread

end of thread, other threads:[~2010-12-13 19:43 UTC | newest]

Thread overview: 29+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-12-09 11:18 [PATCH 0/5] Prevent kswapd dumping excessive amounts of memory in response to high-order allocations V3 Mel Gorman
2010-12-09 11:18 ` [PATCH 1/6] mm: kswapd: Stop high-order balancing when any suitable zone is balanced Mel Gorman
2010-12-09 15:21   ` Minchan Kim
2010-12-10  1:06   ` KAMEZAWA Hiroyuki
2010-12-13 16:54   ` Eric B Munson
2010-12-09 11:18 ` [PATCH 2/6] mm: kswapd: Keep kswapd awake for high-order allocations until a percentage of the node " Mel Gorman
2010-12-09 15:42   ` Minchan Kim
2010-12-10 10:19     ` Mel Gorman
2010-12-10  1:16   ` KAMEZAWA Hiroyuki
2010-12-10 10:25     ` Mel Gorman
2010-12-13 17:00   ` Eric B Munson
2010-12-09 11:18 ` [PATCH 3/6] mm: kswapd: Use the order that kswapd was reclaiming at for sleeping_prematurely() Mel Gorman
2010-12-10  1:16   ` Minchan Kim
2010-12-10 10:36     ` Mel Gorman
2010-12-10  1:18   ` KAMEZAWA Hiroyuki
2010-12-09 11:18 ` [PATCH 4/6] mm: kswapd: Reset kswapd_max_order and classzone_idx after reading Mel Gorman
2010-12-09 15:59   ` Minchan Kim
2010-12-09 16:03     ` Minchan Kim
2010-12-10 10:42       ` Mel Gorman
2010-12-10  1:19   ` KAMEZAWA Hiroyuki
2010-12-09 11:18 ` [PATCH 5/6] mm: kswapd: Treat zone->all_unreclaimable in sleeping_prematurely similar to balance_pgdat() Mel Gorman
2010-12-10  1:23   ` KAMEZAWA Hiroyuki
2010-12-10 10:55     ` Mel Gorman
2010-12-12 23:58       ` KAMEZAWA Hiroyuki
2010-12-09 11:18 ` [PATCH 6/6] mm: kswapd: Use the classzone idx that kswapd was using for sleeping_prematurely() Mel Gorman
2010-12-10  2:38   ` Minchan Kim
2010-12-09 11:19 ` [PATCH 0/5] Prevent kswapd dumping excessive amounts of memory in response to high-order allocations V3 Mel Gorman
  -- strict thread matches above, loose matches on Subject: below --
2010-12-10 15:46 [PATCH 0/6] Prevent kswapd dumping excessive amounts of memory in response to high-order allocations V4 Mel Gorman
2010-12-10 15:46 ` [PATCH 6/6] mm: kswapd: Use the classzone idx that kswapd was using for sleeping_prematurely() Mel Gorman
2010-12-13 19:43   ` Eric B Munson

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).