[PATCH 0/2] memcg: add the soft_limit reclaim in global direct reclaim

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/2] memcg: add the soft_limit reclaim in global direct reclaim
@ 2011-04-28 22:37 Ying Han
  2011-04-28 22:37 ` [PATCH 1/2] Add " Ying Han
                   ` (4 more replies)
  0 siblings, 5 replies; 24+ messages in thread
From: Ying Han @ 2011-04-28 22:37 UTC (permalink / raw)
  To: KOSAKI Motohiro, Minchan Kim, Daisuke Nishimura, Balbir Singh,
	Tejun Heo, Pavel Emelyanov, KAMEZAWA Hiroyuki, Andrew Morton,
	Li Zefan, Mel Gorman, Christoph Lameter, Johannes Weiner,
	Rik van Riel, Hugh Dickins, Michal Hocko, Dave Hansen, Zhu Yanhai
  Cc: linux-mm

We recently added the change in global background reclaim which counts the
return value of soft_limit reclaim. Now this patch adds the similar logic
on global direct reclaim.

We should skip scanning global LRU on shrink_zone if soft_limit reclaim does
enough work. This is the first step where we start with counting the nr_scanned
and nr_reclaimed from soft_limit reclaim into global scan_control.

The patch is based on mmotm-04-14 and i triggered kernel BUG at mm/vmscan.c:1058!

[  938.242033] kernel BUG at mm/vmscan.c:1058!
[  938.242033] invalid opcode: 0000 [#1] SMPA.
[  938.242033] last sysfs file: /sys/devices/pci0000:00/0000:00:1f.2/device
[  938.242033] Pid: 546, comm: kswapd0 Tainted: G        W   2.6.39-smp-direct_reclaim
[  938.242033] RIP: 0010:[<ffffffff810ed174>]  [<ffffffff810ed174>] isolate_pages_global+0x18c/0x34f
[  938.242033] RSP: 0018:ffff88082f83bb50  EFLAGS: 00010082
[  938.242033] RAX: 00000000ffffffea RBX: ffff88082f83bc90 RCX: 0000000000000401
[  938.242033] RDX: 0000000000000001 RSI: 0000000000000000 RDI: ffffea001ca653e8
[  938.242033] RBP: ffff88082f83bc20 R08: 0000000000000000 R09: ffff88085ffb6e00
[  938.242033] R10: ffff88085ffb73d0 R11: ffff88085ffb6e00 R12: ffff88085ffb6e00
[  938.242033] R13: ffffea001ca65410 R14: 0000000000000001 R15: ffffea001ca653e8
[  938.242033] FS:  0000000000000000(0000) GS:ffff88085fd00000(0000) knlGS:0000000000000000
[  938.242033] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[  938.242033] CR2: 00007f5c3405c320 CR3: 0000000001803000 CR4: 00000000000006e0
[  938.242033] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  938.242033] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[  938.242033] Process kswapd0 (pid: 546, threadinfo ffff88082f83a000, task ffff88082fe52080)
[  938.242033] Stack:
[  938.242033]  ffff88085ffb6e00 ffffea0000000002 0000000000000021 0000000000000000
[  938.242033]  0000000000000000 ffff88082f83bcb8 ffffea00108eec80 ffffea00108eecb8
[  938.242033]  ffffea00108eecf0 0000000000000004 fffffffffffffffc 0000000000000020
[  938.242033] Call Trace:
[  938.242033]  [<ffffffff810ee8a5>] shrink_inactive_list+0x185/0x418
[  938.242033]  [<ffffffff810366cc>] ? __switch_to+0xea/0x212
[  938.242033]  [<ffffffff810e8b35>] ? determine_dirtyable_memory+0x1a/0x2c
[  938.242033]  [<ffffffff810ef19b>] shrink_zone+0x380/0x44d
[  938.242033]  [<ffffffff810e5188>] ? zone_watermark_ok_safe+0xa1/0xae
[  938.242033]  [<ffffffff810efbd8>] kswapd+0x41b/0x76b
[  938.242033]  [<ffffffff810ef7bd>] ? zone_reclaim+0x2fb/0x2fb
[  938.242033]  [<ffffffff81088569>] kthread+0x82/0x8a
[  938.242033]  [<ffffffff8141b0d4>] kernel_thread_helper+0x4/0x10
[  938.242033]  [<ffffffff810884e7>] ? kthread_worker_fn+0x112/0x112
[  938.242033]  [<ffffffff8141b0d0>] ? gs_change+0xb/0xb

Thank you Minchan for the pointer. I reverted the following commit and I
haven't seen the problem with the same operation. I haven't looked deeply
on the patch yet, but figured it would be a good idea to post the dump.
The dump looks not directly related to this patchset, but ppl can use it to
reproduce the problem.

commit 278df9f451dc71dcd002246be48358a473504ad0
Author: Minchan Kim <minchan.kim@gmail.com>
Date:   Tue Mar 22 16:32:54 2011 -0700

   mm: reclaim invalidated page ASAP

How to reproduce it, On my 32G of machine
1. I create two memcgs and set their hard_limit and soft_limit:
$echo 20g >A/memory.limit_in_bytes
$echo 20g >B/memory.limit_in_bytes
$echo 3g >A/memory.soft_limit_in_bytes
$echo 3g >B/memory.soft_limit_in_bytes

2. Reading a 20g file on each container
$echo $$ >A/tasks
$cat /export/hdc3/dd_A/tf0 > /dev/zero

$echo $$ >B/tasks
$cat /export/hdc3/dd_B/tf0 > /dev/zero

3. Add memory pressure by allocating anon + mlock. And trigger global
reclaim.

Ying Han (2):
  Add the soft_limit reclaim in global direct reclaim.
  Add stats to monitor soft_limit reclaim

 Documentation/cgroups/memory.txt |   10 ++++-
 mm/memcontrol.c                  |   68 ++++++++++++++++++++++++++++----------
 mm/vmscan.c                      |   16 ++++++++-
 3 files changed, 72 insertions(+), 22 deletions(-)

-- 
1.7.3.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH 1/2] Add the soft_limit reclaim in global direct reclaim.
  2011-04-28 22:37 [PATCH 0/2] memcg: add the soft_limit reclaim in global direct reclaim Ying Han
@ 2011-04-28 22:37 ` Ying Han
  2011-04-28 23:25   ` Ying Han
                     ` (2 more replies)
  2011-04-28 22:37 ` [PATCH 2/2] Add stats to monitor soft_limit reclaim Ying Han
                   ` (3 subsequent siblings)
  4 siblings, 3 replies; 24+ messages in thread
From: Ying Han @ 2011-04-28 22:37 UTC (permalink / raw)
  To: KOSAKI Motohiro, Minchan Kim, Daisuke Nishimura, Balbir Singh,
	Tejun Heo, Pavel Emelyanov, KAMEZAWA Hiroyuki, Andrew Morton,
	Li Zefan, Mel Gorman, Christoph Lameter, Johannes Weiner,
	Rik van Riel, Hugh Dickins, Michal Hocko, Dave Hansen, Zhu Yanhai
  Cc: linux-mm

We recently added the change in global background reclaim which
counts the return value of soft_limit reclaim. Now this patch adds
the similar logic on global direct reclaim.

We should skip scanning global LRU on shrink_zone if soft_limit reclaim
does enough work. This is the first step where we start with counting
the nr_scanned and nr_reclaimed from soft_limit reclaim into global
scan_control.

Signed-off-by: Ying Han <yinghan@google.com>
---
 mm/vmscan.c |   16 ++++++++++++++--
 1 files changed, 14 insertions(+), 2 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index b3a569f..84003cc 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1959,11 +1959,14 @@ restart:
  * If a zone is deemed to be full of pinned pages then just give it a light
  * scan then give up on it.
  */
-static void shrink_zones(int priority, struct zonelist *zonelist,
+static unsigned long shrink_zones(int priority, struct zonelist *zonelist,
 					struct scan_control *sc)
 {
 	struct zoneref *z;
 	struct zone *zone;
+	unsigned long nr_soft_reclaimed;
+	unsigned long nr_soft_scanned;
+	unsigned long total_scanned = 0;
 
 	for_each_zone_zonelist_nodemask(zone, z, zonelist,
 					gfp_zone(sc->gfp_mask), sc->nodemask) {
@@ -1980,8 +1983,17 @@ static void shrink_zones(int priority, struct zonelist *zonelist,
 				continue;	/* Let kswapd poll it */
 		}
 
+		nr_soft_scanned = 0;
+		nr_soft_reclaimed = mem_cgroup_soft_limit_reclaim(zone,
+							sc->order, sc->gfp_mask,
+							&nr_soft_scanned);
+		sc->nr_reclaimed += nr_soft_reclaimed;
+		total_scanned += nr_soft_scanned;
+
 		shrink_zone(priority, zone, sc);
 	}
+
+	return total_scanned;
 }
 
 static bool zone_reclaimable(struct zone *zone)
@@ -2045,7 +2057,7 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
 		sc->nr_scanned = 0;
 		if (!priority)
 			disable_swap_token();
-		shrink_zones(priority, zonelist, sc);
+		total_scanned += shrink_zones(priority, zonelist, sc);
 		/*
 		 * Don't shrink slabs when reclaiming memory from
 		 * over limit cgroups
-- 
1.7.3.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 2/2] Add stats to monitor soft_limit reclaim
  2011-04-28 22:37 [PATCH 0/2] memcg: add the soft_limit reclaim in global direct reclaim Ying Han
  2011-04-28 22:37 ` [PATCH 1/2] Add " Ying Han
@ 2011-04-28 22:37 ` Ying Han
  2011-04-28 23:26   ` Ying Han
  2011-04-28 23:24 ` [PATCH 0/2] memcg: add the soft_limit reclaim in global direct reclaim Ying Han
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 24+ messages in thread
From: Ying Han @ 2011-04-28 22:37 UTC (permalink / raw)
  To: KOSAKI Motohiro, Minchan Kim, Daisuke Nishimura, Balbir Singh,
	Tejun Heo, Pavel Emelyanov, KAMEZAWA Hiroyuki, Andrew Morton,
	Li Zefan, Mel Gorman, Christoph Lameter, Johannes Weiner,
	Rik van Riel, Hugh Dickins, Michal Hocko, Dave Hansen, Zhu Yanhai
  Cc: linux-mm

This patch extend the soft_limit reclaim stats to both global background
reclaim and global direct reclaim.

We have a thread discussing the naming of some of the stats. Both
KAMEZAWA and Johannes posted the proposals. The following stats are based
on what i had before that thread. I will make the corresponding change on
the next post when we make decision.

$cat /dev/cgroup/memory/A/memory.stat
kswapd_soft_steal 1053626
kswapd_soft_scan 1053693
direct_soft_steal 1481810
direct_soft_scan 1481996

Signed-off-by: Ying Han <yinghan@google.com>
---
 Documentation/cgroups/memory.txt |   10 ++++-
 mm/memcontrol.c                  |   68 ++++++++++++++++++++++++++++----------
 2 files changed, 58 insertions(+), 20 deletions(-)

diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt
index 0c40dab..fedc107 100644
--- a/Documentation/cgroups/memory.txt
+++ b/Documentation/cgroups/memory.txt
@@ -387,8 +387,14 @@ pgpgout		- # of pages paged out (equivalent to # of uncharging events).
 swap		- # of bytes of swap usage
 pgfault		- # of page faults.
 pgmajfault	- # of major page faults.
-soft_steal	- # of pages reclaimed from global hierarchical reclaim
-soft_scan	- # of pages scanned from global hierarchical reclaim
+soft_kswapd_steal- # of pages reclaimed in global hierarchical reclaim from
+		background reclaim
+soft_kswapd_scan - # of pages scanned in global hierarchical reclaim from
+		background reclaim
+soft_direct_steal- # of pages reclaimed in global hierarchical reclaim from
+		direct reclaim
+soft_direct_scan- # of pages scanned in global hierarchical reclaim from
+		direct reclaim
 inactive_anon	- # of bytes of anonymous memory and swap cache memory on
 		LRU list.
 active_anon	- # of bytes of anonymous and swap cache memory on active
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index c2776f1..392ed9c 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -96,10 +96,14 @@ enum mem_cgroup_events_index {
 	MEM_CGROUP_EVENTS_COUNT,	/* # of pages paged in/out */
 	MEM_CGROUP_EVENTS_PGFAULT,	/* # of page-faults */
 	MEM_CGROUP_EVENTS_PGMAJFAULT,	/* # of major page-faults */
-	MEM_CGROUP_EVENTS_SOFT_STEAL,	/* # of pages reclaimed from */
-					/* soft reclaim               */
-	MEM_CGROUP_EVENTS_SOFT_SCAN,	/* # of pages scanned from */
-					/* soft reclaim               */
+	MEM_CGROUP_EVENTS_SOFT_KSWAPD_STEAL, /* # of pages reclaimed from */
+					/* soft reclaim in background reclaim */
+	MEM_CGROUP_EVENTS_SOFT_KSWAPD_SCAN, /* # of pages scanned from */
+					/* soft reclaim in background reclaim */
+	MEM_CGROUP_EVENTS_SOFT_DIRECT_STEAL, /* # of pages reclaimed from */
+					/* soft reclaim in direct reclaim */
+	MEM_CGROUP_EVENTS_SOFT_DIRECT_SCAN, /* # of pages scanned from */
+					/* soft reclaim in direct reclaim */
 	MEM_CGROUP_EVENTS_NSTATS,
 };
 /*
@@ -640,14 +644,30 @@ static void mem_cgroup_charge_statistics(struct mem_cgroup *mem,
 	preempt_enable();
 }
 
-static void mem_cgroup_soft_steal(struct mem_cgroup *mem, int val)
+static void mem_cgroup_soft_steal(struct mem_cgroup *mem, bool is_kswapd,
+				  int val)
 {
-	this_cpu_add(mem->stat->events[MEM_CGROUP_EVENTS_SOFT_STEAL], val);
+	if (is_kswapd)
+		this_cpu_add(
+			mem->stat->events[MEM_CGROUP_EVENTS_SOFT_KSWAPD_STEAL],
+									val);
+	else
+		this_cpu_add(
+			mem->stat->events[MEM_CGROUP_EVENTS_SOFT_DIRECT_STEAL],
+									val);
 }
 
-static void mem_cgroup_soft_scan(struct mem_cgroup *mem, int val)
+static void mem_cgroup_soft_scan(struct mem_cgroup *mem, bool is_kswapd,
+				 int val)
 {
-	this_cpu_add(mem->stat->events[MEM_CGROUP_EVENTS_SOFT_SCAN], val);
+	if (is_kswapd)
+		this_cpu_add(
+			mem->stat->events[MEM_CGROUP_EVENTS_SOFT_KSWAPD_SCAN],
+									val);
+	else
+		this_cpu_add(
+			mem->stat->events[MEM_CGROUP_EVENTS_SOFT_DIRECT_SCAN],
+									val);
 }
 
 static unsigned long mem_cgroup_get_local_zonestat(struct mem_cgroup *mem,
@@ -1495,6 +1515,7 @@ static int mem_cgroup_hierarchical_reclaim(struct mem_cgroup *root_mem,
 	bool noswap = reclaim_options & MEM_CGROUP_RECLAIM_NOSWAP;
 	bool shrink = reclaim_options & MEM_CGROUP_RECLAIM_SHRINK;
 	bool check_soft = reclaim_options & MEM_CGROUP_RECLAIM_SOFT;
+	bool is_kswapd = false;
 	unsigned long excess;
 	unsigned long nr_scanned;
 
@@ -1504,6 +1525,9 @@ static int mem_cgroup_hierarchical_reclaim(struct mem_cgroup *root_mem,
 	if (root_mem->memsw_is_minimum)
 		noswap = true;
 
+	if (current_is_kswapd())
+		is_kswapd = true;
+
 	while (1) {
 		victim = mem_cgroup_select_victim(root_mem);
 		if (victim == root_mem) {
@@ -1544,8 +1568,8 @@ static int mem_cgroup_hierarchical_reclaim(struct mem_cgroup *root_mem,
 				noswap, get_swappiness(victim), zone,
 				&nr_scanned);
 			*total_scanned += nr_scanned;
-			mem_cgroup_soft_steal(victim, ret);
-			mem_cgroup_soft_scan(victim, nr_scanned);
+			mem_cgroup_soft_steal(victim, is_kswapd, ret);
+			mem_cgroup_soft_scan(victim, is_kswapd, nr_scanned);
 		} else
 			ret = try_to_free_mem_cgroup_pages(victim, gfp_mask,
 						noswap, get_swappiness(victim));
@@ -3840,8 +3864,10 @@ enum {
 	MCS_SWAP,
 	MCS_PGFAULT,
 	MCS_PGMAJFAULT,
-	MCS_SOFT_STEAL,
-	MCS_SOFT_SCAN,
+	MCS_SOFT_KSWAPD_STEAL,
+	MCS_SOFT_KSWAPD_SCAN,
+	MCS_SOFT_DIRECT_STEAL,
+	MCS_SOFT_DIRECT_SCAN,
 	MCS_INACTIVE_ANON,
 	MCS_ACTIVE_ANON,
 	MCS_INACTIVE_FILE,
@@ -3866,8 +3892,10 @@ struct {
 	{"swap", "total_swap"},
 	{"pgfault", "total_pgfault"},
 	{"pgmajfault", "total_pgmajfault"},
-	{"soft_steal", "total_soft_steal"},
-	{"soft_scan", "total_soft_scan"},
+	{"kswapd_soft_steal", "total_kswapd_soft_steal"},
+	{"kswapd_soft_scan", "total_kswapd_soft_scan"},
+	{"direct_soft_steal", "total_direct_soft_steal"},
+	{"direct_soft_scan", "total_direct_soft_scan"},
 	{"inactive_anon", "total_inactive_anon"},
 	{"active_anon", "total_active_anon"},
 	{"inactive_file", "total_inactive_file"},
@@ -3896,10 +3924,14 @@ mem_cgroup_get_local_stat(struct mem_cgroup *mem, struct mcs_total_stat *s)
 		val = mem_cgroup_read_stat(mem, MEM_CGROUP_STAT_SWAPOUT);
 		s->stat[MCS_SWAP] += val * PAGE_SIZE;
 	}
-	val = mem_cgroup_read_events(mem, MEM_CGROUP_EVENTS_SOFT_STEAL);
-	s->stat[MCS_SOFT_STEAL] += val;
-	val = mem_cgroup_read_events(mem, MEM_CGROUP_EVENTS_SOFT_SCAN);
-	s->stat[MCS_SOFT_SCAN] += val;
+	val = mem_cgroup_read_events(mem, MEM_CGROUP_EVENTS_SOFT_KSWAPD_STEAL);
+	s->stat[MCS_SOFT_KSWAPD_STEAL] += val;
+	val = mem_cgroup_read_events(mem, MEM_CGROUP_EVENTS_SOFT_KSWAPD_SCAN);
+	s->stat[MCS_SOFT_KSWAPD_SCAN] += val;
+	val = mem_cgroup_read_events(mem, MEM_CGROUP_EVENTS_SOFT_DIRECT_STEAL);
+	s->stat[MCS_SOFT_DIRECT_STEAL] += val;
+	val = mem_cgroup_read_events(mem, MEM_CGROUP_EVENTS_SOFT_DIRECT_SCAN);
+	s->stat[MCS_SOFT_DIRECT_SCAN] += val;
 	val = mem_cgroup_read_events(mem, MEM_CGROUP_EVENTS_PGFAULT);
 	s->stat[MCS_PGFAULT] += val;
 	val = mem_cgroup_read_events(mem, MEM_CGROUP_EVENTS_PGMAJFAULT);
-- 
1.7.3.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [PATCH 0/2] memcg: add the soft_limit reclaim in global direct reclaim
  2011-04-28 22:37 [PATCH 0/2] memcg: add the soft_limit reclaim in global direct reclaim Ying Han
  2011-04-28 22:37 ` [PATCH 1/2] Add " Ying Han
  2011-04-28 22:37 ` [PATCH 2/2] Add stats to monitor soft_limit reclaim Ying Han
@ 2011-04-28 23:24 ` Ying Han
  2011-04-29 10:23 ` Balbir Singh
  2011-04-29 16:44 ` Minchan Kim
  4 siblings, 0 replies; 24+ messages in thread
From: Ying Han @ 2011-04-28 23:24 UTC (permalink / raw)
  To: KOSAKI Motohiro, Minchan Kim, Daisuke Nishimura, Balbir Singh,
	Tejun Heo, Pavel Emelyanov, KAMEZAWA Hiroyuki, Andrew Morton,
	Li Zefan, Mel Gorman, Christoph Lameter, Johannes Weiner,
	Rik van Riel, Hugh Dickins, Michal Hocko, Dave Hansen, Zhu Yanhai,
	kamezawa.hiroyuki
  Cc: linux-mm

On Thu, Apr 28, 2011 at 3:37 PM, Ying Han <yinghan@google.com> wrote:
> We recently added the change in global background reclaim which counts the
> return value of soft_limit reclaim. Now this patch adds the similar logic
> on global direct reclaim.
>
> We should skip scanning global LRU on shrink_zone if soft_limit reclaim does
> enough work. This is the first step where we start with counting the nr_scanned
> and nr_reclaimed from soft_limit reclaim into global scan_control.
>
> The patch is based on mmotm-04-14 and i triggered kernel BUG at mm/vmscan.c:1058!
>
> [  938.242033] kernel BUG at mm/vmscan.c:1058!
> [  938.242033] invalid opcode: 0000 [#1] SMP·
> [  938.242033] last sysfs file: /sys/devices/pci0000:00/0000:00:1f.2/device
> [  938.242033] Pid: 546, comm: kswapd0 Tainted: G        W   2.6.39-smp-direct_reclaim
> [  938.242033] RIP: 0010:[<ffffffff810ed174>]  [<ffffffff810ed174>] isolate_pages_global+0x18c/0x34f
> [  938.242033] RSP: 0018:ffff88082f83bb50  EFLAGS: 00010082
> [  938.242033] RAX: 00000000ffffffea RBX: ffff88082f83bc90 RCX: 0000000000000401
> [  938.242033] RDX: 0000000000000001 RSI: 0000000000000000 RDI: ffffea001ca653e8
> [  938.242033] RBP: ffff88082f83bc20 R08: 0000000000000000 R09: ffff88085ffb6e00
> [  938.242033] R10: ffff88085ffb73d0 R11: ffff88085ffb6e00 R12: ffff88085ffb6e00
> [  938.242033] R13: ffffea001ca65410 R14: 0000000000000001 R15: ffffea001ca653e8
> [  938.242033] FS:  0000000000000000(0000) GS:ffff88085fd00000(0000) knlGS:0000000000000000
> [  938.242033] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> [  938.242033] CR2: 00007f5c3405c320 CR3: 0000000001803000 CR4: 00000000000006e0
> [  938.242033] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [  938.242033] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> [  938.242033] Process kswapd0 (pid: 546, threadinfo ffff88082f83a000, task ffff88082fe52080)
> [  938.242033] Stack:
> [  938.242033]  ffff88085ffb6e00 ffffea0000000002 0000000000000021 0000000000000000
> [  938.242033]  0000000000000000 ffff88082f83bcb8 ffffea00108eec80 ffffea00108eecb8
> [  938.242033]  ffffea00108eecf0 0000000000000004 fffffffffffffffc 0000000000000020
> [  938.242033] Call Trace:
> [  938.242033]  [<ffffffff810ee8a5>] shrink_inactive_list+0x185/0x418
> [  938.242033]  [<ffffffff810366cc>] ? __switch_to+0xea/0x212
> [  938.242033]  [<ffffffff810e8b35>] ? determine_dirtyable_memory+0x1a/0x2c
> [  938.242033]  [<ffffffff810ef19b>] shrink_zone+0x380/0x44d
> [  938.242033]  [<ffffffff810e5188>] ? zone_watermark_ok_safe+0xa1/0xae
> [  938.242033]  [<ffffffff810efbd8>] kswapd+0x41b/0x76b
> [  938.242033]  [<ffffffff810ef7bd>] ? zone_reclaim+0x2fb/0x2fb
> [  938.242033]  [<ffffffff81088569>] kthread+0x82/0x8a
> [  938.242033]  [<ffffffff8141b0d4>] kernel_thread_helper+0x4/0x10
> [  938.242033]  [<ffffffff810884e7>] ? kthread_worker_fn+0x112/0x112
> [  938.242033]  [<ffffffff8141b0d0>] ? gs_change+0xb/0xb
>
> Thank you Minchan for the pointer. I reverted the following commit and I
> haven't seen the problem with the same operation. I haven't looked deeply
> on the patch yet, but figured it would be a good idea to post the dump.
> The dump looks not directly related to this patchset, but ppl can use it to
> reproduce the problem.
>
> commit 278df9f451dc71dcd002246be48358a473504ad0
> Author: Minchan Kim <minchan.kim@gmail.com>
> Date:   Tue Mar 22 16:32:54 2011 -0700
>
>   mm: reclaim invalidated page ASAP
>
> How to reproduce it, On my 32G of machine
> 1. I create two memcgs and set their hard_limit and soft_limit:
> $echo 20g >A/memory.limit_in_bytes
> $echo 20g >B/memory.limit_in_bytes
> $echo 3g >A/memory.soft_limit_in_bytes
> $echo 3g >B/memory.soft_limit_in_bytes
>
> 2. Reading a 20g file on each container
> $echo $$ >A/tasks
> $cat /export/hdc3/dd_A/tf0 > /dev/zero
>
> $echo $$ >B/tasks
> $cat /export/hdc3/dd_B/tf0 > /dev/zero
>
> 3. Add memory pressure by allocating anon + mlock. And trigger global
> reclaim.
>
> Ying Han (2):
>  Add the soft_limit reclaim in global direct reclaim.
>  Add stats to monitor soft_limit reclaim
>
>  Documentation/cgroups/memory.txt |   10 ++++-
>  mm/memcontrol.c                  |   68 ++++++++++++++++++++++++++++----------
>  mm/vmscan.c                      |   16 ++++++++-
>  3 files changed, 72 insertions(+), 22 deletions(-)
>
> --
> 1.7.3.1
>
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/2] Add the soft_limit reclaim in global direct reclaim.
  2011-04-28 22:37 ` [PATCH 1/2] Add " Ying Han
@ 2011-04-28 23:25   ` Ying Han
  2011-04-29 10:26   ` Balbir Singh
  2011-04-29 13:05   ` Michal Hocko
  2 siblings, 0 replies; 24+ messages in thread
From: Ying Han @ 2011-04-28 23:25 UTC (permalink / raw)
  To: KOSAKI Motohiro, Minchan Kim, Daisuke Nishimura, Balbir Singh,
	Tejun Heo, Pavel Emelyanov, KAMEZAWA Hiroyuki, Andrew Morton,
	Li Zefan, Mel Gorman, Christoph Lameter, Johannes Weiner,
	Rik van Riel, Hugh Dickins, Michal Hocko, Dave Hansen, Zhu Yanhai,
	kamezawa.hiroyuki
  Cc: linux-mm

On Thu, Apr 28, 2011 at 3:37 PM, Ying Han <yinghan@google.com> wrote:
> We recently added the change in global background reclaim which
> counts the return value of soft_limit reclaim. Now this patch adds
> the similar logic on global direct reclaim.
>
> We should skip scanning global LRU on shrink_zone if soft_limit reclaim
> does enough work. This is the first step where we start with counting
> the nr_scanned and nr_reclaimed from soft_limit reclaim into global
> scan_control.
>
> Signed-off-by: Ying Han <yinghan@google.com>
> ---
>  mm/vmscan.c |   16 ++++++++++++++--
>  1 files changed, 14 insertions(+), 2 deletions(-)
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index b3a569f..84003cc 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1959,11 +1959,14 @@ restart:
>  * If a zone is deemed to be full of pinned pages then just give it a light
>  * scan then give up on it.
>  */
> -static void shrink_zones(int priority, struct zonelist *zonelist,
> +static unsigned long shrink_zones(int priority, struct zonelist *zonelist,
>                                        struct scan_control *sc)
>  {
>        struct zoneref *z;
>        struct zone *zone;
> +       unsigned long nr_soft_reclaimed;
> +       unsigned long nr_soft_scanned;
> +       unsigned long total_scanned = 0;
>
>        for_each_zone_zonelist_nodemask(zone, z, zonelist,
>                                        gfp_zone(sc->gfp_mask), sc->nodemask) {
> @@ -1980,8 +1983,17 @@ static void shrink_zones(int priority, struct zonelist *zonelist,
>                                continue;       /* Let kswapd poll it */
>                }
>
> +               nr_soft_scanned = 0;
> +               nr_soft_reclaimed = mem_cgroup_soft_limit_reclaim(zone,
> +                                                       sc->order, sc->gfp_mask,
> +                                                       &nr_soft_scanned);
> +               sc->nr_reclaimed += nr_soft_reclaimed;
> +               total_scanned += nr_soft_scanned;
> +
>                shrink_zone(priority, zone, sc);
>        }
> +
> +       return total_scanned;
>  }
>
>  static bool zone_reclaimable(struct zone *zone)
> @@ -2045,7 +2057,7 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
>                sc->nr_scanned = 0;
>                if (!priority)
>                        disable_swap_token();
> -               shrink_zones(priority, zonelist, sc);
> +               total_scanned += shrink_zones(priority, zonelist, sc);
>                /*
>                 * Don't shrink slabs when reclaiming memory from
>                 * over limit cgroups
> --
> 1.7.3.1
>
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 2/2] Add stats to monitor soft_limit reclaim
  2011-04-28 22:37 ` [PATCH 2/2] Add stats to monitor soft_limit reclaim Ying Han
@ 2011-04-28 23:26   ` Ying Han
  2011-04-28 23:51     ` Hiroyuki Kamezawa
  0 siblings, 1 reply; 24+ messages in thread
From: Ying Han @ 2011-04-28 23:26 UTC (permalink / raw)
  To: KOSAKI Motohiro, Minchan Kim, Daisuke Nishimura, Balbir Singh,
	Tejun Heo, Pavel Emelyanov, KAMEZAWA Hiroyuki, Andrew Morton,
	Li Zefan, Mel Gorman, Christoph Lameter, Johannes Weiner,
	Rik van Riel, Hugh Dickins, Michal Hocko, Dave Hansen, Zhu Yanhai,
	kamezawa.hiroyuki
  Cc: linux-mm

On Thu, Apr 28, 2011 at 3:37 PM, Ying Han <yinghan@google.com> wrote:
> This patch extend the soft_limit reclaim stats to both global background
> reclaim and global direct reclaim.
>
> We have a thread discussing the naming of some of the stats. Both
> KAMEZAWA and Johannes posted the proposals. The following stats are based
> on what i had before that thread. I will make the corresponding change on
> the next post when we make decision.
>
> $cat /dev/cgroup/memory/A/memory.stat
> kswapd_soft_steal 1053626
> kswapd_soft_scan 1053693
> direct_soft_steal 1481810
> direct_soft_scan 1481996
>
> Signed-off-by: Ying Han <yinghan@google.com>
> ---
>  Documentation/cgroups/memory.txt |   10 ++++-
>  mm/memcontrol.c                  |   68 ++++++++++++++++++++++++++++----------
>  2 files changed, 58 insertions(+), 20 deletions(-)
>
> diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt
> index 0c40dab..fedc107 100644
> --- a/Documentation/cgroups/memory.txt
> +++ b/Documentation/cgroups/memory.txt
> @@ -387,8 +387,14 @@ pgpgout            - # of pages paged out (equivalent to # of uncharging events).
>  swap           - # of bytes of swap usage
>  pgfault                - # of page faults.
>  pgmajfault     - # of major page faults.
> -soft_steal     - # of pages reclaimed from global hierarchical reclaim
> -soft_scan      - # of pages scanned from global hierarchical reclaim
> +soft_kswapd_steal- # of pages reclaimed in global hierarchical reclaim from
> +               background reclaim
> +soft_kswapd_scan - # of pages scanned in global hierarchical reclaim from
> +               background reclaim
> +soft_direct_steal- # of pages reclaimed in global hierarchical reclaim from
> +               direct reclaim
> +soft_direct_scan- # of pages scanned in global hierarchical reclaim from
> +               direct reclaim
>  inactive_anon  - # of bytes of anonymous memory and swap cache memory on
>                LRU list.
>  active_anon    - # of bytes of anonymous and swap cache memory on active
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index c2776f1..392ed9c 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -96,10 +96,14 @@ enum mem_cgroup_events_index {
>        MEM_CGROUP_EVENTS_COUNT,        /* # of pages paged in/out */
>        MEM_CGROUP_EVENTS_PGFAULT,      /* # of page-faults */
>        MEM_CGROUP_EVENTS_PGMAJFAULT,   /* # of major page-faults */
> -       MEM_CGROUP_EVENTS_SOFT_STEAL,   /* # of pages reclaimed from */
> -                                       /* soft reclaim               */
> -       MEM_CGROUP_EVENTS_SOFT_SCAN,    /* # of pages scanned from */
> -                                       /* soft reclaim               */
> +       MEM_CGROUP_EVENTS_SOFT_KSWAPD_STEAL, /* # of pages reclaimed from */
> +                                       /* soft reclaim in background reclaim */
> +       MEM_CGROUP_EVENTS_SOFT_KSWAPD_SCAN, /* # of pages scanned from */
> +                                       /* soft reclaim in background reclaim */
> +       MEM_CGROUP_EVENTS_SOFT_DIRECT_STEAL, /* # of pages reclaimed from */
> +                                       /* soft reclaim in direct reclaim */
> +       MEM_CGROUP_EVENTS_SOFT_DIRECT_SCAN, /* # of pages scanned from */
> +                                       /* soft reclaim in direct reclaim */
>        MEM_CGROUP_EVENTS_NSTATS,
>  };
>  /*
> @@ -640,14 +644,30 @@ static void mem_cgroup_charge_statistics(struct mem_cgroup *mem,
>        preempt_enable();
>  }
>
> -static void mem_cgroup_soft_steal(struct mem_cgroup *mem, int val)
> +static void mem_cgroup_soft_steal(struct mem_cgroup *mem, bool is_kswapd,
> +                                 int val)
>  {
> -       this_cpu_add(mem->stat->events[MEM_CGROUP_EVENTS_SOFT_STEAL], val);
> +       if (is_kswapd)
> +               this_cpu_add(
> +                       mem->stat->events[MEM_CGROUP_EVENTS_SOFT_KSWAPD_STEAL],
> +                                                                       val);
> +       else
> +               this_cpu_add(
> +                       mem->stat->events[MEM_CGROUP_EVENTS_SOFT_DIRECT_STEAL],
> +                                                                       val);
>  }
>
> -static void mem_cgroup_soft_scan(struct mem_cgroup *mem, int val)
> +static void mem_cgroup_soft_scan(struct mem_cgroup *mem, bool is_kswapd,
> +                                int val)
>  {
> -       this_cpu_add(mem->stat->events[MEM_CGROUP_EVENTS_SOFT_SCAN], val);
> +       if (is_kswapd)
> +               this_cpu_add(
> +                       mem->stat->events[MEM_CGROUP_EVENTS_SOFT_KSWAPD_SCAN],
> +                                                                       val);
> +       else
> +               this_cpu_add(
> +                       mem->stat->events[MEM_CGROUP_EVENTS_SOFT_DIRECT_SCAN],
> +                                                                       val);
>  }
>
>  static unsigned long mem_cgroup_get_local_zonestat(struct mem_cgroup *mem,
> @@ -1495,6 +1515,7 @@ static int mem_cgroup_hierarchical_reclaim(struct mem_cgroup *root_mem,
>        bool noswap = reclaim_options & MEM_CGROUP_RECLAIM_NOSWAP;
>        bool shrink = reclaim_options & MEM_CGROUP_RECLAIM_SHRINK;
>        bool check_soft = reclaim_options & MEM_CGROUP_RECLAIM_SOFT;
> +       bool is_kswapd = false;
>        unsigned long excess;
>        unsigned long nr_scanned;
>
> @@ -1504,6 +1525,9 @@ static int mem_cgroup_hierarchical_reclaim(struct mem_cgroup *root_mem,
>        if (root_mem->memsw_is_minimum)
>                noswap = true;
>
> +       if (current_is_kswapd())
> +               is_kswapd = true;
> +
>        while (1) {
>                victim = mem_cgroup_select_victim(root_mem);
>                if (victim == root_mem) {
> @@ -1544,8 +1568,8 @@ static int mem_cgroup_hierarchical_reclaim(struct mem_cgroup *root_mem,
>                                noswap, get_swappiness(victim), zone,
>                                &nr_scanned);
>                        *total_scanned += nr_scanned;
> -                       mem_cgroup_soft_steal(victim, ret);
> -                       mem_cgroup_soft_scan(victim, nr_scanned);
> +                       mem_cgroup_soft_steal(victim, is_kswapd, ret);
> +                       mem_cgroup_soft_scan(victim, is_kswapd, nr_scanned);
>                } else
>                        ret = try_to_free_mem_cgroup_pages(victim, gfp_mask,
>                                                noswap, get_swappiness(victim));
> @@ -3840,8 +3864,10 @@ enum {
>        MCS_SWAP,
>        MCS_PGFAULT,
>        MCS_PGMAJFAULT,
> -       MCS_SOFT_STEAL,
> -       MCS_SOFT_SCAN,
> +       MCS_SOFT_KSWAPD_STEAL,
> +       MCS_SOFT_KSWAPD_SCAN,
> +       MCS_SOFT_DIRECT_STEAL,
> +       MCS_SOFT_DIRECT_SCAN,
>        MCS_INACTIVE_ANON,
>        MCS_ACTIVE_ANON,
>        MCS_INACTIVE_FILE,
> @@ -3866,8 +3892,10 @@ struct {
>        {"swap", "total_swap"},
>        {"pgfault", "total_pgfault"},
>        {"pgmajfault", "total_pgmajfault"},
> -       {"soft_steal", "total_soft_steal"},
> -       {"soft_scan", "total_soft_scan"},
> +       {"kswapd_soft_steal", "total_kswapd_soft_steal"},
> +       {"kswapd_soft_scan", "total_kswapd_soft_scan"},
> +       {"direct_soft_steal", "total_direct_soft_steal"},
> +       {"direct_soft_scan", "total_direct_soft_scan"},
>        {"inactive_anon", "total_inactive_anon"},
>        {"active_anon", "total_active_anon"},
>        {"inactive_file", "total_inactive_file"},
> @@ -3896,10 +3924,14 @@ mem_cgroup_get_local_stat(struct mem_cgroup *mem, struct mcs_total_stat *s)
>                val = mem_cgroup_read_stat(mem, MEM_CGROUP_STAT_SWAPOUT);
>                s->stat[MCS_SWAP] += val * PAGE_SIZE;
>        }
> -       val = mem_cgroup_read_events(mem, MEM_CGROUP_EVENTS_SOFT_STEAL);
> -       s->stat[MCS_SOFT_STEAL] += val;
> -       val = mem_cgroup_read_events(mem, MEM_CGROUP_EVENTS_SOFT_SCAN);
> -       s->stat[MCS_SOFT_SCAN] += val;
> +       val = mem_cgroup_read_events(mem, MEM_CGROUP_EVENTS_SOFT_KSWAPD_STEAL);
> +       s->stat[MCS_SOFT_KSWAPD_STEAL] += val;
> +       val = mem_cgroup_read_events(mem, MEM_CGROUP_EVENTS_SOFT_KSWAPD_SCAN);
> +       s->stat[MCS_SOFT_KSWAPD_SCAN] += val;
> +       val = mem_cgroup_read_events(mem, MEM_CGROUP_EVENTS_SOFT_DIRECT_STEAL);
> +       s->stat[MCS_SOFT_DIRECT_STEAL] += val;
> +       val = mem_cgroup_read_events(mem, MEM_CGROUP_EVENTS_SOFT_DIRECT_SCAN);
> +       s->stat[MCS_SOFT_DIRECT_SCAN] += val;
>        val = mem_cgroup_read_events(mem, MEM_CGROUP_EVENTS_PGFAULT);
>        s->stat[MCS_PGFAULT] += val;
>        val = mem_cgroup_read_events(mem, MEM_CGROUP_EVENTS_PGMAJFAULT);
> --
> 1.7.3.1
>
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 2/2] Add stats to monitor soft_limit reclaim
  2011-04-28 23:26   ` Ying Han
@ 2011-04-28 23:51     ` Hiroyuki Kamezawa
  2011-04-29  3:28       ` Ying Han
  0 siblings, 1 reply; 24+ messages in thread
From: Hiroyuki Kamezawa @ 2011-04-28 23:51 UTC (permalink / raw)
  To: Ying Han
  Cc: KOSAKI Motohiro, Minchan Kim, Daisuke Nishimura, Balbir Singh,
	Tejun Heo, Pavel Emelyanov, KAMEZAWA Hiroyuki, Andrew Morton,
	Li Zefan, Mel Gorman, Christoph Lameter, Johannes Weiner,
	Rik van Riel, Hugh Dickins, Michal Hocko, Dave Hansen, Zhu Yanhai,
	linux-mm

2011/4/29 Ying Han <yinghan@google.com>:
> On Thu, Apr 28, 2011 at 3:37 PM, Ying Han <yinghan@google.com> wrote:
>> This patch extend the soft_limit reclaim stats to both global background
>> reclaim and global direct reclaim.
>>
>> We have a thread discussing the naming of some of the stats. Both
>> KAMEZAWA and Johannes posted the proposals. The following stats are based
>> on what i had before that thread. I will make the corresponding change on
>> the next post when we make decision.
>>
>> $cat /dev/cgroup/memory/A/memory.stat
>> kswapd_soft_steal 1053626
>> kswapd_soft_scan 1053693
>> direct_soft_steal 1481810
>> direct_soft_scan 1481996
>>
>> Signed-off-by: Ying Han <yinghan@google.com>
>> ---
>>  Documentation/cgroups/memory.txt |   10 ++++-
>>  mm/memcontrol.c                  |   68 ++++++++++++++++++++++++++++----------
>>  2 files changed, 58 insertions(+), 20 deletions(-)
>>
>> diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt
>> index 0c40dab..fedc107 100644
>> --- a/Documentation/cgroups/memory.txt
>> +++ b/Documentation/cgroups/memory.txt
>> @@ -387,8 +387,14 @@ pgpgout            - # of pages paged out (equivalent to # of uncharging events).
>>  swap           - # of bytes of swap usage
>>  pgfault                - # of page faults.
>>  pgmajfault     - # of major page faults.
>> -soft_steal     - # of pages reclaimed from global hierarchical reclaim
>> -soft_scan      - # of pages scanned from global hierarchical reclaim
>> +soft_kswapd_steal- # of pages reclaimed in global hierarchical reclaim from
>> +               background reclaim
>> +soft_kswapd_scan - # of pages scanned in global hierarchical reclaim from
>> +               background reclaim
>> +soft_direct_steal- # of pages reclaimed in global hierarchical reclaim from
>> +               direct reclaim
>> +soft_direct_scan- # of pages scanned in global hierarchical reclaim from
>> +               direct reclaim

Thank you for CC.

I don't have strong opinion but once we add interfaces to mainline,
it's hard to rename them. So, it's better to make a list of what name
we'll need in future.

Now, your naming has a format as [Reason]-[Who reclaim]-[What count?]
soft_kswapd_steal
soft_kswapd_scan
soft_direct_steal
soft_direct_scan

Ok, we can make a name for wmark and limit reclaim as

limit_direct_steal/scan
wmark_bg_steal/scan

Then, assume we finally do round-robin scan of memcg regardless of softlimit by
removing global LRU, what name do we have ? Hmm,

kernel_kswapd_scan/steal
kernel_direct_scan/steal

?

BTW, your changelog has different name of counters. please fix.

And I'm sorry I'll not be very active for a while.

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 2/2] Add stats to monitor soft_limit reclaim
  2011-04-28 23:51     ` Hiroyuki Kamezawa
@ 2011-04-29  3:28       ` Ying Han
  2011-04-29 10:30         ` Balbir Singh
  0 siblings, 1 reply; 24+ messages in thread
From: Ying Han @ 2011-04-29  3:28 UTC (permalink / raw)
  To: Hiroyuki Kamezawa
  Cc: KOSAKI Motohiro, Minchan Kim, Daisuke Nishimura, Balbir Singh,
	Tejun Heo, Pavel Emelyanov, KAMEZAWA Hiroyuki, Andrew Morton,
	Li Zefan, Mel Gorman, Christoph Lameter, Johannes Weiner,
	Rik van Riel, Hugh Dickins, Michal Hocko, Dave Hansen, Zhu Yanhai,
	linux-mm

On Thu, Apr 28, 2011 at 4:51 PM, Hiroyuki Kamezawa
<kamezawa.hiroyuki@gmail.com> wrote:
> 2011/4/29 Ying Han <yinghan@google.com>:
>> On Thu, Apr 28, 2011 at 3:37 PM, Ying Han <yinghan@google.com> wrote:
>>> This patch extend the soft_limit reclaim stats to both global background
>>> reclaim and global direct reclaim.
>>>
>>> We have a thread discussing the naming of some of the stats. Both
>>> KAMEZAWA and Johannes posted the proposals. The following stats are based
>>> on what i had before that thread. I will make the corresponding change on
>>> the next post when we make decision.
>>>
>>> $cat /dev/cgroup/memory/A/memory.stat
>>> kswapd_soft_steal 1053626
>>> kswapd_soft_scan 1053693
>>> direct_soft_steal 1481810
>>> direct_soft_scan 1481996
>>>
>>> Signed-off-by: Ying Han <yinghan@google.com>
>>> ---
>>>  Documentation/cgroups/memory.txt |   10 ++++-
>>>  mm/memcontrol.c                  |   68 ++++++++++++++++++++++++++++----------
>>>  2 files changed, 58 insertions(+), 20 deletions(-)
>>>
>>> diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt
>>> index 0c40dab..fedc107 100644
>>> --- a/Documentation/cgroups/memory.txt
>>> +++ b/Documentation/cgroups/memory.txt
>>> @@ -387,8 +387,14 @@ pgpgout            - # of pages paged out (equivalent to # of uncharging events).
>>>  swap           - # of bytes of swap usage
>>>  pgfault                - # of page faults.
>>>  pgmajfault     - # of major page faults.
>>> -soft_steal     - # of pages reclaimed from global hierarchical reclaim
>>> -soft_scan      - # of pages scanned from global hierarchical reclaim
>>> +soft_kswapd_steal- # of pages reclaimed in global hierarchical reclaim from
>>> +               background reclaim
>>> +soft_kswapd_scan - # of pages scanned in global hierarchical reclaim from
>>> +               background reclaim
>>> +soft_direct_steal- # of pages reclaimed in global hierarchical reclaim from
>>> +               direct reclaim
>>> +soft_direct_scan- # of pages scanned in global hierarchical reclaim from
>>> +               direct reclaim
>
> Thank you for CC.
>
> I don't have strong opinion but once we add interfaces to mainline,
> it's hard to rename them. So, it's better to make a list of what name
> we'll need in future.
>
> Now, your naming has a format as [Reason]-[Who reclaim]-[What count?]
> soft_kswapd_steal
> soft_kswapd_scan
> soft_direct_steal
> soft_direct_scan
>
> Ok, we can make a name for wmark and limit reclaim as
>
> limit_direct_steal/scan
> wmark_bg_steal/scan
>
> Then, assume we finally do round-robin scan of memcg regardless of softlimit by
> removing global LRU, what name do we have ? Hmm,
>
> kernel_kswapd_scan/steal
> kernel_direct_scan/steal
>
> ?

Johannes has the proposal to separate out reclaims on the memcg
internally and externally. And then apply the format
[Reason]-[Who reclaim]-[What count?], also i added the 4th item .

1. when the memcg hits its hard_limit
> limit_direct_steal
> limit_direct_scan

2. when the memcg hits its wmark
> wmark_kswapd_steal
> wmark_kswapd_scan

3. the global direct reclaim triggers soft_limit pushback
> soft_direct_steal
> soft_direct_scan

4. hierarchy-triggered direct reclaim
> limit_hierarchy_steal
> limit_hierarchy_scan

5. the global bg reclaim triggers soft_limit pushback
> soft_kswapd_steal
> soft_kswapd_scan

For both soft_limit reclaim and per-memcg bg reclaim, we don't have
hierarchical reclaim. Only hitting the hard_limit of parent will
trigger direct reclaim hierarchically on the children.


> BTW, your changelog has different name of counters. please fix.

sure, will fix that.

>
> And I'm sorry I'll not be very active for a while.
np.

--Ying

>
> Thanks,
> -Kame
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 0/2] memcg: add the soft_limit reclaim in global direct reclaim
  2011-04-28 22:37 [PATCH 0/2] memcg: add the soft_limit reclaim in global direct reclaim Ying Han
                   ` (2 preceding siblings ...)
  2011-04-28 23:24 ` [PATCH 0/2] memcg: add the soft_limit reclaim in global direct reclaim Ying Han
@ 2011-04-29 10:23 ` Balbir Singh
  2011-04-29 17:17   ` Ying Han
  2011-04-29 16:44 ` Minchan Kim
  4 siblings, 1 reply; 24+ messages in thread
From: Balbir Singh @ 2011-04-29 10:23 UTC (permalink / raw)
  To: Ying Han
  Cc: KOSAKI Motohiro, Minchan Kim, Daisuke Nishimura, Tejun Heo,
	Pavel Emelyanov, KAMEZAWA Hiroyuki, Andrew Morton, Li Zefan,
	Mel Gorman, Christoph Lameter, Johannes Weiner, Rik van Riel,
	Hugh Dickins, Michal Hocko, Dave Hansen, Zhu Yanhai, linux-mm

* Ying Han <yinghan@google.com> [2011-04-28 15:37:04]:

> We recently added the change in global background reclaim which counts the
> return value of soft_limit reclaim. Now this patch adds the similar logic
> on global direct reclaim.
>

Sorry, I missed much of that discussion, I was away. I'll try and
catch up with them soon.
 
> We should skip scanning global LRU on shrink_zone if soft_limit reclaim does
> enough work. This is the first step where we start with counting the nr_scanned
> and nr_reclaimed from soft_limit reclaim into global scan_control.
> 
> The patch is based on mmotm-04-14 and i triggered kernel BUG at mm/vmscan.c:1058!
> 
> [  938.242033] kernel BUG at mm/vmscan.c:1058!
> [  938.242033] invalid opcode: 0000 [#1] SMP.
> [  938.242033] last sysfs file: /sys/devices/pci0000:00/0000:00:1f.2/device
> [  938.242033] Pid: 546, comm: kswapd0 Tainted: G        W   2.6.39-smp-direct_reclaim
> [  938.242033] RIP: 0010:[<ffffffff810ed174>]  [<ffffffff810ed174>] isolate_pages_global+0x18c/0x34f
> [  938.242033] RSP: 0018:ffff88082f83bb50  EFLAGS: 00010082
> [  938.242033] RAX: 00000000ffffffea RBX: ffff88082f83bc90 RCX: 0000000000000401
> [  938.242033] RDX: 0000000000000001 RSI: 0000000000000000 RDI: ffffea001ca653e8
> [  938.242033] RBP: ffff88082f83bc20 R08: 0000000000000000 R09: ffff88085ffb6e00
> [  938.242033] R10: ffff88085ffb73d0 R11: ffff88085ffb6e00 R12: ffff88085ffb6e00
> [  938.242033] R13: ffffea001ca65410 R14: 0000000000000001 R15: ffffea001ca653e8
> [  938.242033] FS:  0000000000000000(0000) GS:ffff88085fd00000(0000) knlGS:0000000000000000
> [  938.242033] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> [  938.242033] CR2: 00007f5c3405c320 CR3: 0000000001803000 CR4: 00000000000006e0
> [  938.242033] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [  938.242033] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> [  938.242033] Process kswapd0 (pid: 546, threadinfo ffff88082f83a000, task ffff88082fe52080)
> [  938.242033] Stack:
> [  938.242033]  ffff88085ffb6e00 ffffea0000000002 0000000000000021 0000000000000000
> [  938.242033]  0000000000000000 ffff88082f83bcb8 ffffea00108eec80 ffffea00108eecb8
> [  938.242033]  ffffea00108eecf0 0000000000000004 fffffffffffffffc 0000000000000020
> [  938.242033] Call Trace:
> [  938.242033]  [<ffffffff810ee8a5>] shrink_inactive_list+0x185/0x418
> [  938.242033]  [<ffffffff810366cc>] ? __switch_to+0xea/0x212
> [  938.242033]  [<ffffffff810e8b35>] ? determine_dirtyable_memory+0x1a/0x2c
> [  938.242033]  [<ffffffff810ef19b>] shrink_zone+0x380/0x44d
> [  938.242033]  [<ffffffff810e5188>] ? zone_watermark_ok_safe+0xa1/0xae
> [  938.242033]  [<ffffffff810efbd8>] kswapd+0x41b/0x76b
> [  938.242033]  [<ffffffff810ef7bd>] ? zone_reclaim+0x2fb/0x2fb
> [  938.242033]  [<ffffffff81088569>] kthread+0x82/0x8a
> [  938.242033]  [<ffffffff8141b0d4>] kernel_thread_helper+0x4/0x10
> [  938.242033]  [<ffffffff810884e7>] ? kthread_worker_fn+0x112/0x112
> [  938.242033]  [<ffffffff8141b0d0>] ? gs_change+0xb/0xb

What is gs_change()?

> 
> Thank you Minchan for the pointer. I reverted the following commit and I
> haven't seen the problem with the same operation. I haven't looked deeply
> on the patch yet, but figured it would be a good idea to post the dump.
> The dump looks not directly related to this patchset, but ppl can use it to
> reproduce the problem.
> 
> commit 278df9f451dc71dcd002246be48358a473504ad0
> Author: Minchan Kim <minchan.kim@gmail.com>
> Date:   Tue Mar 22 16:32:54 2011 -0700
> 
>    mm: reclaim invalidated page ASAP
> 
> How to reproduce it, On my 32G of machine
> 1. I create two memcgs and set their hard_limit and soft_limit:
> $echo 20g >A/memory.limit_in_bytes
> $echo 20g >B/memory.limit_in_bytes
> $echo 3g >A/memory.soft_limit_in_bytes
> $echo 3g >B/memory.soft_limit_in_bytes
> 
> 2. Reading a 20g file on each container
> $echo $$ >A/tasks
> $cat /export/hdc3/dd_A/tf0 > /dev/zero
> 
> $echo $$ >B/tasks
> $cat /export/hdc3/dd_B/tf0 > /dev/zero
> 
> 3. Add memory pressure by allocating anon + mlock. And trigger global
> reclaim.
>

I am sorry, but the summary leaves me confused about the patchset. You
mentioned adding memcg scan and reclaim, but then quickly shift focus
to the stacktrace.

-- 
	Three Cheers,
	Balbir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/2] Add the soft_limit reclaim in global direct reclaim.
  2011-04-28 22:37 ` [PATCH 1/2] Add " Ying Han
  2011-04-28 23:25   ` Ying Han
@ 2011-04-29 10:26   ` Balbir Singh
  2011-04-29 17:42     ` Ying Han
  2011-04-29 13:05   ` Michal Hocko
  2 siblings, 1 reply; 24+ messages in thread
From: Balbir Singh @ 2011-04-29 10:26 UTC (permalink / raw)
  To: Ying Han
  Cc: KOSAKI Motohiro, Minchan Kim, Daisuke Nishimura, Tejun Heo,
	Pavel Emelyanov, KAMEZAWA Hiroyuki, Andrew Morton, Li Zefan,
	Mel Gorman, Christoph Lameter, Johannes Weiner, Rik van Riel,
	Hugh Dickins, Michal Hocko, Dave Hansen, Zhu Yanhai, linux-mm

* Ying Han <yinghan@google.com> [2011-04-28 15:37:05]:

> We recently added the change in global background reclaim which
> counts the return value of soft_limit reclaim. Now this patch adds
> the similar logic on global direct reclaim.
> 
> We should skip scanning global LRU on shrink_zone if soft_limit reclaim
> does enough work. This is the first step where we start with counting
> the nr_scanned and nr_reclaimed from soft_limit reclaim into global
> scan_control.
> 
> Signed-off-by: Ying Han <yinghan@google.com>
> ---
>  mm/vmscan.c |   16 ++++++++++++++--
>  1 files changed, 14 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index b3a569f..84003cc 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1959,11 +1959,14 @@ restart:
>   * If a zone is deemed to be full of pinned pages then just give it a light
>   * scan then give up on it.
>   */
> -static void shrink_zones(int priority, struct zonelist *zonelist,
> +static unsigned long shrink_zones(int priority, struct zonelist *zonelist,
>  					struct scan_control *sc)
>  {
>  	struct zoneref *z;
>  	struct zone *zone;
> +	unsigned long nr_soft_reclaimed;
> +	unsigned long nr_soft_scanned;
> +	unsigned long total_scanned = 0;
> 
>  	for_each_zone_zonelist_nodemask(zone, z, zonelist,
>  					gfp_zone(sc->gfp_mask), sc->nodemask) {
> @@ -1980,8 +1983,17 @@ static void shrink_zones(int priority, struct zonelist *zonelist,
>  				continue;	/* Let kswapd poll it */
>  		}
> 
> +		nr_soft_scanned = 0;
> +		nr_soft_reclaimed = mem_cgroup_soft_limit_reclaim(zone,
> +							sc->order, sc->gfp_mask,
> +							&nr_soft_scanned);
> +		sc->nr_reclaimed += nr_soft_reclaimed;
> +		total_scanned += nr_soft_scanned;
> +
>  		shrink_zone(priority, zone, sc);
>  	}
> +
> +	return total_scanned;
>  }
> 
>  static bool zone_reclaimable(struct zone *zone)
> @@ -2045,7 +2057,7 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
>  		sc->nr_scanned = 0;
>  		if (!priority)
>  			disable_swap_token();
> -		shrink_zones(priority, zonelist, sc);
> +		total_scanned += shrink_zones(priority, zonelist, sc);
>  		/*
>  		 * Don't shrink slabs when reclaiming memory from
>  		 * over limit cgroups

Seems reasonable to me, are you able to see the benefits of setting
soft limits and then adding back the stats on global LRU scan if
soft limits did a good job?

-- 
	Three Cheers,
	Balbir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 2/2] Add stats to monitor soft_limit reclaim
  2011-04-29  3:28       ` Ying Han
@ 2011-04-29 10:30         ` Balbir Singh
  2011-04-29 19:12           ` Ying Han
  0 siblings, 1 reply; 24+ messages in thread
From: Balbir Singh @ 2011-04-29 10:30 UTC (permalink / raw)
  To: Ying Han
  Cc: Hiroyuki Kamezawa, KOSAKI Motohiro, Minchan Kim,
	Daisuke Nishimura, Tejun Heo, Pavel Emelyanov, KAMEZAWA Hiroyuki,
	Andrew Morton, Li Zefan, Mel Gorman, Christoph Lameter,
	Johannes Weiner, Rik van Riel, Hugh Dickins, Michal Hocko,
	Dave Hansen, Zhu Yanhai, linux-mm

* Ying Han <yinghan@google.com> [2011-04-28 20:28:54]:

> On Thu, Apr 28, 2011 at 4:51 PM, Hiroyuki Kamezawa
> <kamezawa.hiroyuki@gmail.com> wrote:
> > 2011/4/29 Ying Han <yinghan@google.com>:
> >> On Thu, Apr 28, 2011 at 3:37 PM, Ying Han <yinghan@google.com> wrote:
> >>> This patch extend the soft_limit reclaim stats to both global background
> >>> reclaim and global direct reclaim.
> >>>
> >>> We have a thread discussing the naming of some of the stats. Both
> >>> KAMEZAWA and Johannes posted the proposals. The following stats are based
> >>> on what i had before that thread. I will make the corresponding change on
> >>> the next post when we make decision.
> >>>
> >>> $cat /dev/cgroup/memory/A/memory.stat
> >>> kswapd_soft_steal 1053626
> >>> kswapd_soft_scan 1053693
> >>> direct_soft_steal 1481810
> >>> direct_soft_scan 1481996
> >>>
> >>> Signed-off-by: Ying Han <yinghan@google.com>
> >>> ---
> >>>  Documentation/cgroups/memory.txt |   10 ++++-
> >>>  mm/memcontrol.c                  |   68 ++++++++++++++++++++++++++++----------
> >>>  2 files changed, 58 insertions(+), 20 deletions(-)
> >>>
> >>> diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt
> >>> index 0c40dab..fedc107 100644
> >>> --- a/Documentation/cgroups/memory.txt
> >>> +++ b/Documentation/cgroups/memory.txt
> >>> @@ -387,8 +387,14 @@ pgpgout            - # of pages paged out (equivalent to # of uncharging events).
> >>>  swap           - # of bytes of swap usage
> >>>  pgfault                - # of page faults.
> >>>  pgmajfault     - # of major page faults.
> >>> -soft_steal     - # of pages reclaimed from global hierarchical reclaim
> >>> -soft_scan      - # of pages scanned from global hierarchical reclaim
> >>> +soft_kswapd_steal- # of pages reclaimed in global hierarchical reclaim from
> >>> +               background reclaim
> >>> +soft_kswapd_scan - # of pages scanned in global hierarchical reclaim from
> >>> +               background reclaim
> >>> +soft_direct_steal- # of pages reclaimed in global hierarchical reclaim from
> >>> +               direct reclaim
> >>> +soft_direct_scan- # of pages scanned in global hierarchical reclaim from
> >>> +               direct reclaim
> >
> > Thank you for CC.
> >
> > I don't have strong opinion but once we add interfaces to mainline,
> > it's hard to rename them. So, it's better to make a list of what name
> > we'll need in future.
> >
> > Now, your naming has a format as [Reason]-[Who reclaim]-[What count?]
> > soft_kswapd_steal
> > soft_kswapd_scan
> > soft_direct_steal
> > soft_direct_scan
> >
> > Ok, we can make a name for wmark and limit reclaim as
> >
> > limit_direct_steal/scan
> > wmark_bg_steal/scan
> >
> > Then, assume we finally do round-robin scan of memcg regardless of softlimit by
> > removing global LRU, what name do we have ? Hmm,
> >
> > kernel_kswapd_scan/steal
> > kernel_direct_scan/steal
> >
> > ?
> 
> Johannes has the proposal to separate out reclaims on the memcg
> internally and externally. And then apply the format
> [Reason]-[Who reclaim]-[What count?], also i added the 4th item .
> 
> 1. when the memcg hits its hard_limit
> > limit_direct_steal
> > limit_direct_scan
> 
> 2. when the memcg hits its wmark
> > wmark_kswapd_steal
> > wmark_kswapd_scan
> 
> 3. the global direct reclaim triggers soft_limit pushback
> > soft_direct_steal
> > soft_direct_scan
> 
> 4. hierarchy-triggered direct reclaim
> > limit_hierarchy_steal
> > limit_hierarchy_scan
> 
> 5. the global bg reclaim triggers soft_limit pushback
> > soft_kswapd_steal
> > soft_kswapd_scan
>

I like these names, but these are more developer friendly than end
user friendly.

-- 
	Three Cheers,
	Balbir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/2] Add the soft_limit reclaim in global direct reclaim.
  2011-04-28 22:37 ` [PATCH 1/2] Add " Ying Han
  2011-04-28 23:25   ` Ying Han
  2011-04-29 10:26   ` Balbir Singh
@ 2011-04-29 13:05   ` Michal Hocko
  2011-04-29 17:44     ` Ying Han
  2 siblings, 1 reply; 24+ messages in thread
From: Michal Hocko @ 2011-04-29 13:05 UTC (permalink / raw)
  To: Ying Han
  Cc: KOSAKI Motohiro, Minchan Kim, Daisuke Nishimura, Balbir Singh,
	Tejun Heo, Pavel Emelyanov, KAMEZAWA Hiroyuki, Andrew Morton,
	Li Zefan, Mel Gorman, Christoph Lameter, Johannes Weiner,
	Rik van Riel, Hugh Dickins, Dave Hansen, Zhu Yanhai, linux-mm

On Thu 28-04-11 15:37:05, Ying Han wrote:
> We recently added the change in global background reclaim which
> counts the return value of soft_limit reclaim. Now this patch adds
> the similar logic on global direct reclaim.
> 
> We should skip scanning global LRU on shrink_zone if soft_limit reclaim
> does enough work. This is the first step where we start with counting
> the nr_scanned and nr_reclaimed from soft_limit reclaim into global
> scan_control.

Makes sense.

> 
> Signed-off-by: Ying Han <yinghan@google.com>
> ---
>  mm/vmscan.c |   16 ++++++++++++++--
>  1 files changed, 14 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index b3a569f..84003cc 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1959,11 +1959,14 @@ restart:
>   * If a zone is deemed to be full of pinned pages then just give it a light
>   * scan then give up on it.
>   */
> -static void shrink_zones(int priority, struct zonelist *zonelist,
> +static unsigned long shrink_zones(int priority, struct zonelist *zonelist,
>  					struct scan_control *sc)
>  {
>  	struct zoneref *z;
>  	struct zone *zone;
> +	unsigned long nr_soft_reclaimed;
> +	unsigned long nr_soft_scanned;
> +	unsigned long total_scanned = 0;
>  
>  	for_each_zone_zonelist_nodemask(zone, z, zonelist,
>  					gfp_zone(sc->gfp_mask), sc->nodemask) {
> @@ -1980,8 +1983,17 @@ static void shrink_zones(int priority, struct zonelist *zonelist,
>  				continue;	/* Let kswapd poll it */
>  		}
>  
> +		nr_soft_scanned = 0;
> +		nr_soft_reclaimed = mem_cgroup_soft_limit_reclaim(zone,
> +							sc->order, sc->gfp_mask,
> +							&nr_soft_scanned);
> +		sc->nr_reclaimed += nr_soft_reclaimed;
> +		total_scanned += nr_soft_scanned;
> +
>  		shrink_zone(priority, zone, sc);

This can cause more aggressive reclaiming, right? Shouldn't we check
whether shrink_zone is still needed?

-- 
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9    
Czech Republic

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 0/2] memcg: add the soft_limit reclaim in global direct reclaim
  2011-04-28 22:37 [PATCH 0/2] memcg: add the soft_limit reclaim in global direct reclaim Ying Han
                   ` (3 preceding siblings ...)
  2011-04-29 10:23 ` Balbir Singh
@ 2011-04-29 16:44 ` Minchan Kim
  2011-04-29 17:19   ` Ying Han
  4 siblings, 1 reply; 24+ messages in thread
From: Minchan Kim @ 2011-04-29 16:44 UTC (permalink / raw)
  To: Ying Han
  Cc: KOSAKI Motohiro, Daisuke Nishimura, Balbir Singh, Tejun Heo,
	Pavel Emelyanov, KAMEZAWA Hiroyuki, Andrew Morton, Li Zefan,
	Mel Gorman, Christoph Lameter, Johannes Weiner, Rik van Riel,
	Hugh Dickins, Michal Hocko, Dave Hansen, Zhu Yanhai, linux-mm

Hi Ying,

On Thu, Apr 28, 2011 at 03:37:04PM -0700, Ying Han wrote:
> We recently added the change in global background reclaim which counts the
> return value of soft_limit reclaim. Now this patch adds the similar logic
> on global direct reclaim.
> 
> We should skip scanning global LRU on shrink_zone if soft_limit reclaim does
> enough work. This is the first step where we start with counting the nr_scanned
> and nr_reclaimed from soft_limit reclaim into global scan_control.
> 
> The patch is based on mmotm-04-14 and i triggered kernel BUG at mm/vmscan.c:1058!

Could you tell me exact patches?
mmtom-04-14 + just 2 patch of this? or + something?

These day, You and Kame produces many patches.
Do I have to apply something of them?

> 
> [  938.242033] kernel BUG at mm/vmscan.c:1058!
> [  938.242033] invalid opcode: 0000 [#1] SMP.
> [  938.242033] last sysfs file: /sys/devices/pci0000:00/0000:00:1f.2/device
> [  938.242033] Pid: 546, comm: kswapd0 Tainted: G        W   2.6.39-smp-direct_reclaim
> [  938.242033] RIP: 0010:[<ffffffff810ed174>]  [<ffffffff810ed174>] isolate_pages_global+0x18c/0x34f
> [  938.242033] RSP: 0018:ffff88082f83bb50  EFLAGS: 00010082
> [  938.242033] RAX: 00000000ffffffea RBX: ffff88082f83bc90 RCX: 0000000000000401
> [  938.242033] RDX: 0000000000000001 RSI: 0000000000000000 RDI: ffffea001ca653e8
> [  938.242033] RBP: ffff88082f83bc20 R08: 0000000000000000 R09: ffff88085ffb6e00
> [  938.242033] R10: ffff88085ffb73d0 R11: ffff88085ffb6e00 R12: ffff88085ffb6e00
> [  938.242033] R13: ffffea001ca65410 R14: 0000000000000001 R15: ffffea001ca653e8
> [  938.242033] FS:  0000000000000000(0000) GS:ffff88085fd00000(0000) knlGS:0000000000000000
> [  938.242033] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> [  938.242033] CR2: 00007f5c3405c320 CR3: 0000000001803000 CR4: 00000000000006e0
> [  938.242033] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [  938.242033] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> [  938.242033] Process kswapd0 (pid: 546, threadinfo ffff88082f83a000, task ffff88082fe52080)
> [  938.242033] Stack:
> [  938.242033]  ffff88085ffb6e00 ffffea0000000002 0000000000000021 0000000000000000
> [  938.242033]  0000000000000000 ffff88082f83bcb8 ffffea00108eec80 ffffea00108eecb8
> [  938.242033]  ffffea00108eecf0 0000000000000004 fffffffffffffffc 0000000000000020
> [  938.242033] Call Trace:
> [  938.242033]  [<ffffffff810ee8a5>] shrink_inactive_list+0x185/0x418
> [  938.242033]  [<ffffffff810366cc>] ? __switch_to+0xea/0x212
> [  938.242033]  [<ffffffff810e8b35>] ? determine_dirtyable_memory+0x1a/0x2c
> [  938.242033]  [<ffffffff810ef19b>] shrink_zone+0x380/0x44d
> [  938.242033]  [<ffffffff810e5188>] ? zone_watermark_ok_safe+0xa1/0xae
> [  938.242033]  [<ffffffff810efbd8>] kswapd+0x41b/0x76b
> [  938.242033]  [<ffffffff810ef7bd>] ? zone_reclaim+0x2fb/0x2fb
> [  938.242033]  [<ffffffff81088569>] kthread+0x82/0x8a
> [  938.242033]  [<ffffffff8141b0d4>] kernel_thread_helper+0x4/0x10
> [  938.242033]  [<ffffffff810884e7>] ? kthread_worker_fn+0x112/0x112
> [  938.242033]  [<ffffffff8141b0d0>] ? gs_change+0xb/0xb
> 

It seems there is active page in inactive list.
As I look deactivate_page, lru_deactivate_fn clears PageActive before
add_page_to_lru_list and it should be protected by zone->lru_lock.
In addiion, PageLRU would protect with race with isolation functions.

Hmm, I don't have any clue now.
Is it reproducible easily?

Could you apply below debugging patch and report the result?

diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h
index 8f7d247..f39b53a 100644
--- a/include/linux/mm_inline.h
+++ b/include/linux/mm_inline.h
@@ -25,6 +25,8 @@ static inline void
 __add_page_to_lru_list(struct zone *zone, struct page *page, enum lru_list l,
 		       struct list_head *head)
 {
+	VM_BUG_ON(PageActive(page) && (
+			l == LRU_INACTIVE_ANON || l == LRU_INACTIVE_FILE));
 	list_add(&page->lru, head);
 	__mod_zone_page_state(zone, NR_LRU_BASE + l, hpage_nr_pages(page));
 	mem_cgroup_add_lru_list(page, l);
diff --git a/mm/swap.c b/mm/swap.c
index a83ec5a..5f7c3c8 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -454,6 +454,8 @@ static void lru_deactivate_fn(struct page *page, void *arg)
 		 * The page's writeback ends up during pagevec
 		 * We moves tha page into tail of inactive.
 		 */
+		VM_BUG_ON(PageActive(page) && (
+			lru == LRU_INACTIVE_ANON || lru == LRU_INACTIVE_FILE));
 		list_move_tail(&page->lru, &zone->lru[lru].list);
 		mem_cgroup_rotate_reclaimable_page(page);
 		__count_vm_event(PGROTATED);
diff --git a/mm/vmscan.c b/mm/vmscan.c
index b3a569f..3415896 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -963,7 +963,7 @@ int __isolate_lru_page(struct page *page, int mode, int file)
 
 	/* Only take pages on the LRU. */
 	if (!PageLRU(page))
-		return ret;
+		return 1;
 
 	/*
 	 * When checking the active state, we need to be sure we are
@@ -971,10 +971,10 @@ int __isolate_lru_page(struct page *page, int mode, int file)
 	 * of each.
 	 */
 	if (mode != ISOLATE_BOTH && (!PageActive(page) != !mode))
-		return ret;
+		return 2;
 
 	if (mode != ISOLATE_BOTH && page_is_file_cache(page) != file)
-		return ret;
+		return 3;
 
 	/*
 	 * When this function is being called for lumpy reclaim, we
@@ -982,7 +982,7 @@ int __isolate_lru_page(struct page *page, int mode, int file)
 	 * unevictable; only give shrink_page_list evictable pages.
 	 */
 	if (PageUnevictable(page))
-		return ret;
+		return 4;
 
 	ret = -EBUSY;
 
@@ -1035,13 +1035,14 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
 		unsigned long end_pfn;
 		unsigned long page_pfn;
 		int zone_id;
+		int ret;
 
 		page = lru_to_page(src);
 		prefetchw_prev_lru_page(page, src, flags);
 
 		VM_BUG_ON(!PageLRU(page));
 
-		switch (__isolate_lru_page(page, mode, file)) {
+		switch (ret = __isolate_lru_page(page, mode, file)) {
 		case 0:
 			list_move(&page->lru, dst);
 			mem_cgroup_del_lru(page);
@@ -1055,6 +1056,7 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
 			continue;
 
 		default:
+			printk(KERN_ERR "ret %d\n", ret);
 			BUG();
 		}
 
> Thank you Minchan for the pointer. I reverted the following commit and I
> haven't seen the problem with the same operation. I haven't looked deeply
> on the patch yet, but figured it would be a good idea to post the dump.
> The dump looks not directly related to this patchset, but ppl can use it to
> reproduce the problem.

I tested the patch with rsync + fadvise several times
in my machine(2P, 2G DRAM) but I didn't have ever seen the BUG.
But I didn't test it in memcg. As I look dump, it seems not related to memcg.
Anyway, I tried it to reproduce it in my machine.
Maybe I will start testing after next week. Sorry.

I hope my debugging patch givse some clues.
Thanks for the reporting, Ying. 

-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [PATCH 0/2] memcg: add the soft_limit reclaim in global direct reclaim
  2011-04-29 10:23 ` Balbir Singh
@ 2011-04-29 17:17   ` Ying Han
  0 siblings, 0 replies; 24+ messages in thread
From: Ying Han @ 2011-04-29 17:17 UTC (permalink / raw)
  To: balbir
  Cc: KOSAKI Motohiro, Minchan Kim, Daisuke Nishimura, Tejun Heo,
	Pavel Emelyanov, KAMEZAWA Hiroyuki, Andrew Morton, Li Zefan,
	Mel Gorman, Christoph Lameter, Johannes Weiner, Rik van Riel,
	Hugh Dickins, Michal Hocko, Dave Hansen, Zhu Yanhai, linux-mm

On Fri, Apr 29, 2011 at 3:23 AM, Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
> * Ying Han <yinghan@google.com> [2011-04-28 15:37:04]:
>
>> We recently added the change in global background reclaim which counts the
>> return value of soft_limit reclaim. Now this patch adds the similar logic
>> on global direct reclaim.
>>
>
> Sorry, I missed much of that discussion, I was away. I'll try and
> catch up with them soon.
>
>> We should skip scanning global LRU on shrink_zone if soft_limit reclaim does
>> enough work. This is the first step where we start with counting the nr_scanned
>> and nr_reclaimed from soft_limit reclaim into global scan_control.
>>
>> The patch is based on mmotm-04-14 and i triggered kernel BUG at mm/vmscan.c:1058!
>>
>> [  938.242033] kernel BUG at mm/vmscan.c:1058!
>> [  938.242033] invalid opcode: 0000 [#1] SMP·
>> [  938.242033] last sysfs file: /sys/devices/pci0000:00/0000:00:1f.2/device
>> [  938.242033] Pid: 546, comm: kswapd0 Tainted: G        W   2.6.39-smp-direct_reclaim
>> [  938.242033] RIP: 0010:[<ffffffff810ed174>]  [<ffffffff810ed174>] isolate_pages_global+0x18c/0x34f
>> [  938.242033] RSP: 0018:ffff88082f83bb50  EFLAGS: 00010082
>> [  938.242033] RAX: 00000000ffffffea RBX: ffff88082f83bc90 RCX: 0000000000000401
>> [  938.242033] RDX: 0000000000000001 RSI: 0000000000000000 RDI: ffffea001ca653e8
>> [  938.242033] RBP: ffff88082f83bc20 R08: 0000000000000000 R09: ffff88085ffb6e00
>> [  938.242033] R10: ffff88085ffb73d0 R11: ffff88085ffb6e00 R12: ffff88085ffb6e00
>> [  938.242033] R13: ffffea001ca65410 R14: 0000000000000001 R15: ffffea001ca653e8
>> [  938.242033] FS:  0000000000000000(0000) GS:ffff88085fd00000(0000) knlGS:0000000000000000
>> [  938.242033] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
>> [  938.242033] CR2: 00007f5c3405c320 CR3: 0000000001803000 CR4: 00000000000006e0
>> [  938.242033] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>> [  938.242033] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
>> [  938.242033] Process kswapd0 (pid: 546, threadinfo ffff88082f83a000, task ffff88082fe52080)
>> [  938.242033] Stack:
>> [  938.242033]  ffff88085ffb6e00 ffffea0000000002 0000000000000021 0000000000000000
>> [  938.242033]  0000000000000000 ffff88082f83bcb8 ffffea00108eec80 ffffea00108eecb8
>> [  938.242033]  ffffea00108eecf0 0000000000000004 fffffffffffffffc 0000000000000020
>> [  938.242033] Call Trace:
>> [  938.242033]  [<ffffffff810ee8a5>] shrink_inactive_list+0x185/0x418
>> [  938.242033]  [<ffffffff810366cc>] ? __switch_to+0xea/0x212
>> [  938.242033]  [<ffffffff810e8b35>] ? determine_dirtyable_memory+0x1a/0x2c
>> [  938.242033]  [<ffffffff810ef19b>] shrink_zone+0x380/0x44d
>> [  938.242033]  [<ffffffff810e5188>] ? zone_watermark_ok_safe+0xa1/0xae
>> [  938.242033]  [<ffffffff810efbd8>] kswapd+0x41b/0x76b
>> [  938.242033]  [<ffffffff810ef7bd>] ? zone_reclaim+0x2fb/0x2fb
>> [  938.242033]  [<ffffffff81088569>] kthread+0x82/0x8a
>> [  938.242033]  [<ffffffff8141b0d4>] kernel_thread_helper+0x4/0x10
>> [  938.242033]  [<ffffffff810884e7>] ? kthread_worker_fn+0x112/0x112
>> [  938.242033]  [<ffffffff8141b0d0>] ? gs_change+0xb/0xb
>
> What is gs_change()?

[  938.242033] Pid: 546, comm: kswapd0 Tainted: G        W
2.6.39-smp-direct_reclaim #25 Intel Greencreek,ESB2/Ilium_IN_03
[  938.242033] RIP: 0010:[<ffffffff810ed174>]  [<ffffffff810ed174>]
isolate_pages_global+0x18c/0x34f
[  938.242033] RSP: 0018:ffff88082f83bb50  EFLAGS: 00010082
[  938.242033] RAX: 00000000ffffffea RBX: ffff88082f83bc90 RCX: 0000000000000401
[  938.242033] RDX: 0000000000000001 RSI: 0000000000000000 RDI: ffffea001ca653e8
[  938.242033] RBP: ffff88082f83bc20 R08: 0000000000000000 R09: ffff88085ffb6e00
[  938.242033] R10: ffff88085ffb73d0 R11: ffff88085ffb6e00 R12: ffff88085ffb6e00
[  938.242033] R13: ffffea001ca65410 R14: 0000000000000001 R15: ffffea001ca653e8
[  938.242033] FS:  0000000000000000(0000) GS:ffff88085fd00000(0000)
knlGS:0000000000000000
[  938.242033] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[  938.242033] CR2: 00007f5c3405c320 CR3: 0000000001803000 CR4: 00000000000006e0
[  938.242033] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  938.242033] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[  938.242033] Process kswapd0 (pid: 546, threadinfo ffff88082f83a000,
task ffff88082fe52080)
[  938.242033] Stack:
[  938.242033]  ffff88085ffb6e00 ffffea0000000002 0000000000000021
0000000000000000
[  938.242033]  0000000000000000 ffff88082f83bcb8 ffffea00108eec80
ffffea00108eecb8
[  938.242033]  ffffea00108eecf0 0000000000000004 fffffffffffffffc
0000000000000020
[  938.242033] Call Trace:
[  938.242033]  [<ffffffff810ee8a5>] shrink_inactive_list+0x185/0x418
[  938.242033]  [<ffffffff810366cc>] ? __switch_to+0xea/0x212
[  938.242033]  [<ffffffff810e8b35>] ? determine_dirtyable_memory+0x1a/0x2c
[  938.242033]  [<ffffffff810ef19b>] shrink_zone+0x380/0x44d
[  938.242033]  [<ffffffff810e5188>] ? zone_watermark_ok_safe+0xa1/0xae
[  938.242033]  [<ffffffff810efbd8>] kswapd+0x41b/0x76b
[  938.242033]  [<ffffffff810ef7bd>] ? zone_reclaim+0x2fb/0x2fb
[  938.242033]  [<ffffffff81088569>] kthread+0x82/0x8a
[  938.242033]  [<ffffffff8141b0d4>] kernel_thread_helper+0x4/0x10
[  938.242033]  [<ffffffff810884e7>] ? kthread_worker_fn+0x112/0x112
[  938.242033]  [<ffffffff8141b0d0>] ? gs_change+0xb/0xb
[  938.242033] Code: 45 d8 25 00 00 08 00 48 83 f8 01 49 8b 45 d8 19
f6 83 e6 02 83 e0 40 48 83 f8 01 83 de ff 4c 89 ff e8 4c 15 03 00 e9
2e 01 00 00 <0f> 0b eb fe 49 8b 45 d8 48 b9 00 00 00 00 00 16 00 00 4c
8b 75·
[  938.242033] RIP  [<ffffffff810ed174>] isolate_pages_global+0x18c/0x34f
[  938.242033]  RSP <ffff88082f83bb50>
[  938.242033] ---[ end trace 8af2d95d2c95696c ]---
[  938.242033] Kernel panic - not syncing: Fatal exception
[  938.242033] Pid: 546, comm: kswapd0 Tainted: G      D W
2.6.39-smp-direct_reclaim #25
[  938.242033] Call Trace:
[  938.242033]  [<ffffffff814118d8>] panic+0x91/0x194
[  938.242033]  [<ffffffff81414888>] oops_end+0xae/0xbe
[  938.242033]  [<ffffffff81039906>] die+0x5a/0x63
[  938.242033]  [<ffffffff81414321>] do_trap+0x121/0x130
[  938.242033]  [<ffffffff81037e85>] do_invalid_op+0x96/0x9f
[  938.242033]  [<ffffffff810ed174>] ? isolate_pages_global+0x18c/0x34f
[  938.242033]  [<ffffffff810ed677>] ? free_page_list+0xcc/0xda
[  938.242033]  [<ffffffff8141af55>] invalid_op+0x15/0x20
[  938.242033]  [<ffffffff810ed174>] ? isolate_pages_global+0x18c/0x34f
[  938.242033]  [<ffffffff810ee8a5>] shrink_inactive_list+0x185/0x418
[  938.242033]  [<ffffffff810366cc>] ? __switch_to+0xea/0x212
[  938.242033]  [<ffffffff810e8b35>] ? determine_dirtyable_memory+0x1a/0x2c
[  938.242033]  [<ffffffff810ef19b>] shrink_zone+0x380/0x44d
[  938.242033]  [<ffffffff810e5188>] ? zone_watermark_ok_safe+0xa1/0xae
[  938.242033]  [<ffffffff810efbd8>] kswapd+0x41b/0x76b
[  938.242033]  [<ffffffff810ef7bd>] ? zone_reclaim+0x2fb/0x2fb
[  938.242033]  [<ffffffff81088569>] kthread+0x82/0x8a
[  938.242033]  [<ffffffff8141b0d4>] kernel_thread_helper+0x4/0x10
[  938.242033]  [<ffffffff810884e7>] ? kthread_worker_fn+0x112/0x112
[  938.242033]  [<ffffffff8141b0d0>] ? gs_change+0xb/0xb
[  938.242033] Rebooting in 10 seconds..


>
>>
>> Thank you Minchan for the pointer. I reverted the following commit and I
>> haven't seen the problem with the same operation. I haven't looked deeply
>> on the patch yet, but figured it would be a good idea to post the dump.
>> The dump looks not directly related to this patchset, but ppl can use it to
>> reproduce the problem.
>>
>> commit 278df9f451dc71dcd002246be48358a473504ad0
>> Author: Minchan Kim <minchan.kim@gmail.com>
>> Date:   Tue Mar 22 16:32:54 2011 -0700
>>
>>    mm: reclaim invalidated page ASAP
>>
>> How to reproduce it, On my 32G of machine
>> 1. I create two memcgs and set their hard_limit and soft_limit:
>> $echo 20g >A/memory.limit_in_bytes
>> $echo 20g >B/memory.limit_in_bytes
>> $echo 3g >A/memory.soft_limit_in_bytes
>> $echo 3g >B/memory.soft_limit_in_bytes
>>
>> 2. Reading a 20g file on each container
>> $echo $$ >A/tasks
>> $cat /export/hdc3/dd_A/tf0 > /dev/zero
>>
>> $echo $$ >B/tasks
>> $cat /export/hdc3/dd_B/tf0 > /dev/zero
>>
>> 3. Add memory pressure by allocating anon + mlock. And trigger global
>> reclaim.
>>
>
> I am sorry, but the summary leaves me confused about the patchset. You
> mentioned adding memcg scan and reclaim, but then quickly shift focus
> to the stacktrace.

Sorry about the confusion. I wasn't quite sure what exactly the best
way to report the problem. I
saw the BUG while testing my patch, but didn't get time to reproduce
it w/o it. Meantime, I feel
the BUG has nothing to do with the patch itself. So I end up posting
BUG and the patch which
can help to reproduce the BUG.

--Ying
>
> --
>        Three Cheers,
>        Balbir
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 0/2] memcg: add the soft_limit reclaim in global direct reclaim
  2011-04-29 16:44 ` Minchan Kim
@ 2011-04-29 17:19   ` Ying Han
  2011-04-29 17:48     ` Ying Han
  2011-04-29 18:58     ` Ying Han
  0 siblings, 2 replies; 24+ messages in thread
From: Ying Han @ 2011-04-29 17:19 UTC (permalink / raw)
  To: Minchan Kim
  Cc: KOSAKI Motohiro, Daisuke Nishimura, Balbir Singh, Tejun Heo,
	Pavel Emelyanov, KAMEZAWA Hiroyuki, Andrew Morton, Li Zefan,
	Mel Gorman, Christoph Lameter, Johannes Weiner, Rik van Riel,
	Hugh Dickins, Michal Hocko, Dave Hansen, Zhu Yanhai, linux-mm

On Fri, Apr 29, 2011 at 9:44 AM, Minchan Kim <minchan.kim@gmail.com> wrote:
> Hi Ying,
>
> On Thu, Apr 28, 2011 at 03:37:04PM -0700, Ying Han wrote:
>> We recently added the change in global background reclaim which counts the
>> return value of soft_limit reclaim. Now this patch adds the similar logic
>> on global direct reclaim.
>>
>> We should skip scanning global LRU on shrink_zone if soft_limit reclaim does
>> enough work. This is the first step where we start with counting the nr_scanned
>> and nr_reclaimed from soft_limit reclaim into global scan_control.
>>
>> The patch is based on mmotm-04-14 and i triggered kernel BUG at mm/vmscan.c:1058!
>
> Could you tell me exact patches?
> mmtom-04-14 + just 2 patch of this? or + something?
>
> These day, You and Kame produces many patches.
> Do I have to apply something of them?
No, I applied my patch on top of mmotm and here is the last commit
before my patch.

commit 66a3827927351e0f88dc391919cf0cda10d42dd7
Author: Andrew Morton <akpm@linux-foundation.org>
Date:   Thu Apr 14 15:51:34 2011 -0700

>
>>
>> [  938.242033] kernel BUG at mm/vmscan.c:1058!
>> [  938.242033] invalid opcode: 0000 [#1] SMP·
>> [  938.242033] last sysfs file: /sys/devices/pci0000:00/0000:00:1f.2/device
>> [  938.242033] Pid: 546, comm: kswapd0 Tainted: G        W   2.6.39-smp-direct_reclaim
>> [  938.242033] RIP: 0010:[<ffffffff810ed174>]  [<ffffffff810ed174>] isolate_pages_global+0x18c/0x34f
>> [  938.242033] RSP: 0018:ffff88082f83bb50  EFLAGS: 00010082
>> [  938.242033] RAX: 00000000ffffffea RBX: ffff88082f83bc90 RCX: 0000000000000401
>> [  938.242033] RDX: 0000000000000001 RSI: 0000000000000000 RDI: ffffea001ca653e8
>> [  938.242033] RBP: ffff88082f83bc20 R08: 0000000000000000 R09: ffff88085ffb6e00
>> [  938.242033] R10: ffff88085ffb73d0 R11: ffff88085ffb6e00 R12: ffff88085ffb6e00
>> [  938.242033] R13: ffffea001ca65410 R14: 0000000000000001 R15: ffffea001ca653e8
>> [  938.242033] FS:  0000000000000000(0000) GS:ffff88085fd00000(0000) knlGS:0000000000000000
>> [  938.242033] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
>> [  938.242033] CR2: 00007f5c3405c320 CR3: 0000000001803000 CR4: 00000000000006e0
>> [  938.242033] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>> [  938.242033] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
>> [  938.242033] Process kswapd0 (pid: 546, threadinfo ffff88082f83a000, task ffff88082fe52080)
>> [  938.242033] Stack:
>> [  938.242033]  ffff88085ffb6e00 ffffea0000000002 0000000000000021 0000000000000000
>> [  938.242033]  0000000000000000 ffff88082f83bcb8 ffffea00108eec80 ffffea00108eecb8
>> [  938.242033]  ffffea00108eecf0 0000000000000004 fffffffffffffffc 0000000000000020
>> [  938.242033] Call Trace:
>> [  938.242033]  [<ffffffff810ee8a5>] shrink_inactive_list+0x185/0x418
>> [  938.242033]  [<ffffffff810366cc>] ? __switch_to+0xea/0x212
>> [  938.242033]  [<ffffffff810e8b35>] ? determine_dirtyable_memory+0x1a/0x2c
>> [  938.242033]  [<ffffffff810ef19b>] shrink_zone+0x380/0x44d
>> [  938.242033]  [<ffffffff810e5188>] ? zone_watermark_ok_safe+0xa1/0xae
>> [  938.242033]  [<ffffffff810efbd8>] kswapd+0x41b/0x76b
>> [  938.242033]  [<ffffffff810ef7bd>] ? zone_reclaim+0x2fb/0x2fb
>> [  938.242033]  [<ffffffff81088569>] kthread+0x82/0x8a
>> [  938.242033]  [<ffffffff8141b0d4>] kernel_thread_helper+0x4/0x10
>> [  938.242033]  [<ffffffff810884e7>] ? kthread_worker_fn+0x112/0x112
>> [  938.242033]  [<ffffffff8141b0d0>] ? gs_change+0xb/0xb
>>
>
> It seems there is active page in inactive list.
> As I look deactivate_page, lru_deactivate_fn clears PageActive before
> add_page_to_lru_list and it should be protected by zone->lru_lock.
> In addiion, PageLRU would protect with race with isolation functions.
>
> Hmm, I don't have any clue now.
> Is it reproducible easily?
I can manage to reproduce it on my host by adding lots of memory
pressure and then trigger the global
reclaim.

>
> Could you apply below debugging patch and report the result?
>
> diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h
> index 8f7d247..f39b53a 100644
> --- a/include/linux/mm_inline.h
> +++ b/include/linux/mm_inline.h
> @@ -25,6 +25,8 @@ static inline void
>  __add_page_to_lru_list(struct zone *zone, struct page *page, enum lru_list l,
>                       struct list_head *head)
>  {
> +       VM_BUG_ON(PageActive(page) && (
> +                       l == LRU_INACTIVE_ANON || l == LRU_INACTIVE_FILE));
>        list_add(&page->lru, head);
>        __mod_zone_page_state(zone, NR_LRU_BASE + l, hpage_nr_pages(page));
>        mem_cgroup_add_lru_list(page, l);
> diff --git a/mm/swap.c b/mm/swap.c
> index a83ec5a..5f7c3c8 100644
> --- a/mm/swap.c
> +++ b/mm/swap.c
> @@ -454,6 +454,8 @@ static void lru_deactivate_fn(struct page *page, void *arg)
>                 * The page's writeback ends up during pagevec
>                 * We moves tha page into tail of inactive.
>                 */
> +               VM_BUG_ON(PageActive(page) && (
> +                       lru == LRU_INACTIVE_ANON || lru == LRU_INACTIVE_FILE));
>                list_move_tail(&page->lru, &zone->lru[lru].list);
>                mem_cgroup_rotate_reclaimable_page(page);
>                __count_vm_event(PGROTATED);
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index b3a569f..3415896 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -963,7 +963,7 @@ int __isolate_lru_page(struct page *page, int mode, int file)
>
>        /* Only take pages on the LRU. */
>        if (!PageLRU(page))
> -               return ret;
> +               return 1;
>
>        /*
>         * When checking the active state, we need to be sure we are
> @@ -971,10 +971,10 @@ int __isolate_lru_page(struct page *page, int mode, int file)
>         * of each.
>         */
>        if (mode != ISOLATE_BOTH && (!PageActive(page) != !mode))
> -               return ret;
> +               return 2;
>
>        if (mode != ISOLATE_BOTH && page_is_file_cache(page) != file)
> -               return ret;
> +               return 3;
>
>        /*
>         * When this function is being called for lumpy reclaim, we
> @@ -982,7 +982,7 @@ int __isolate_lru_page(struct page *page, int mode, int file)
>         * unevictable; only give shrink_page_list evictable pages.
>         */
>        if (PageUnevictable(page))
> -               return ret;
> +               return 4;
>
>        ret = -EBUSY;
>
> @@ -1035,13 +1035,14 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
>                unsigned long end_pfn;
>                unsigned long page_pfn;
>                int zone_id;
> +               int ret;
>
>                page = lru_to_page(src);
>                prefetchw_prev_lru_page(page, src, flags);
>
>                VM_BUG_ON(!PageLRU(page));
>
> -               switch (__isolate_lru_page(page, mode, file)) {
> +               switch (ret = __isolate_lru_page(page, mode, file)) {
>                case 0:
>                        list_move(&page->lru, dst);
>                        mem_cgroup_del_lru(page);
> @@ -1055,6 +1056,7 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
>                        continue;
>
>                default:
> +                       printk(KERN_ERR "ret %d\n", ret);
>                        BUG();
>                }
>
>> Thank you Minchan for the pointer. I reverted the following commit and I
>> haven't seen the problem with the same operation. I haven't looked deeply
>> on the patch yet, but figured it would be a good idea to post the dump.
>> The dump looks not directly related to this patchset, but ppl can use it to
>> reproduce the problem.
>
> I tested the patch with rsync + fadvise several times
> in my machine(2P, 2G DRAM) but I didn't have ever seen the BUG.
> But I didn't test it in memcg. As I look dump, it seems not related to memcg.
> Anyway, I tried it to reproduce it in my machine.
> Maybe I will start testing after next week. Sorry.
>
> I hope my debugging patch givse some clues.
> Thanks for the reporting, Ying.

Sure, i will try the patch and post the result.

--Ying

> --
> Kind regards,
> Minchan Kim
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/2] Add the soft_limit reclaim in global direct reclaim.
  2011-04-29 10:26   ` Balbir Singh
@ 2011-04-29 17:42     ` Ying Han
  0 siblings, 0 replies; 24+ messages in thread
From: Ying Han @ 2011-04-29 17:42 UTC (permalink / raw)
  To: balbir
  Cc: KOSAKI Motohiro, Minchan Kim, Daisuke Nishimura, Tejun Heo,
	Pavel Emelyanov, KAMEZAWA Hiroyuki, Andrew Morton, Li Zefan,
	Mel Gorman, Christoph Lameter, Johannes Weiner, Rik van Riel,
	Hugh Dickins, Michal Hocko, Dave Hansen, Zhu Yanhai, linux-mm

On Fri, Apr 29, 2011 at 3:26 AM, Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
> * Ying Han <yinghan@google.com> [2011-04-28 15:37:05]:
>
>> We recently added the change in global background reclaim which
>> counts the return value of soft_limit reclaim. Now this patch adds
>> the similar logic on global direct reclaim.
>>
>> We should skip scanning global LRU on shrink_zone if soft_limit reclaim
>> does enough work. This is the first step where we start with counting
>> the nr_scanned and nr_reclaimed from soft_limit reclaim into global
>> scan_control.
>>
>> Signed-off-by: Ying Han <yinghan@google.com>
>> ---
>>  mm/vmscan.c |   16 ++++++++++++++--
>>  1 files changed, 14 insertions(+), 2 deletions(-)
>>
>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>> index b3a569f..84003cc 100644
>> --- a/mm/vmscan.c
>> +++ b/mm/vmscan.c
>> @@ -1959,11 +1959,14 @@ restart:
>>   * If a zone is deemed to be full of pinned pages then just give it a light
>>   * scan then give up on it.
>>   */
>> -static void shrink_zones(int priority, struct zonelist *zonelist,
>> +static unsigned long shrink_zones(int priority, struct zonelist *zonelist,
>>                                       struct scan_control *sc)
>>  {
>>       struct zoneref *z;
>>       struct zone *zone;
>> +     unsigned long nr_soft_reclaimed;
>> +     unsigned long nr_soft_scanned;
>> +     unsigned long total_scanned = 0;
>>
>>       for_each_zone_zonelist_nodemask(zone, z, zonelist,
>>                                       gfp_zone(sc->gfp_mask), sc->nodemask) {
>> @@ -1980,8 +1983,17 @@ static void shrink_zones(int priority, struct zonelist *zonelist,
>>                               continue;       /* Let kswapd poll it */
>>               }
>>
>> +             nr_soft_scanned = 0;
>> +             nr_soft_reclaimed = mem_cgroup_soft_limit_reclaim(zone,
>> +                                                     sc->order, sc->gfp_mask,
>> +                                                     &nr_soft_scanned);
>> +             sc->nr_reclaimed += nr_soft_reclaimed;
>> +             total_scanned += nr_soft_scanned;
>> +
>>               shrink_zone(priority, zone, sc);
>>       }
>> +
>> +     return total_scanned;
>>  }
>>
>>  static bool zone_reclaimable(struct zone *zone)
>> @@ -2045,7 +2057,7 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
>>               sc->nr_scanned = 0;
>>               if (!priority)
>>                       disable_swap_token();
>> -             shrink_zones(priority, zonelist, sc);
>> +             total_scanned += shrink_zones(priority, zonelist, sc);
>>               /*
>>                * Don't shrink slabs when reclaiming memory from
>>                * over limit cgroups
>
> Seems reasonable to me, are you able to see the benefits of setting
> soft limits and then adding back the stats on global LRU scan if
> soft limits did a good job?

I can list the stats on my next post which shows the how many reclaimed from
soft_limit reclaim vs per-zone recalim before and after the patch.

Thanks

--Ying
>
> --
>        Three Cheers,
>        Balbir
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/2] Add the soft_limit reclaim in global direct reclaim.
  2011-04-29 13:05   ` Michal Hocko
@ 2011-04-29 17:44     ` Ying Han
  2011-05-02  7:22       ` Johannes Weiner
  0 siblings, 1 reply; 24+ messages in thread
From: Ying Han @ 2011-04-29 17:44 UTC (permalink / raw)
  To: Michal Hocko
  Cc: KOSAKI Motohiro, Minchan Kim, Daisuke Nishimura, Balbir Singh,
	Tejun Heo, Pavel Emelyanov, KAMEZAWA Hiroyuki, Andrew Morton,
	Li Zefan, Mel Gorman, Christoph Lameter, Johannes Weiner,
	Rik van Riel, Hugh Dickins, Dave Hansen, Zhu Yanhai, linux-mm

On Fri, Apr 29, 2011 at 6:05 AM, Michal Hocko <mhocko@suse.cz> wrote:
> On Thu 28-04-11 15:37:05, Ying Han wrote:
>> We recently added the change in global background reclaim which
>> counts the return value of soft_limit reclaim. Now this patch adds
>> the similar logic on global direct reclaim.
>>
>> We should skip scanning global LRU on shrink_zone if soft_limit reclaim
>> does enough work. This is the first step where we start with counting
>> the nr_scanned and nr_reclaimed from soft_limit reclaim into global
>> scan_control.
>
> Makes sense.
>
>>
>> Signed-off-by: Ying Han <yinghan@google.com>
>> ---
>>  mm/vmscan.c |   16 ++++++++++++++--
>>  1 files changed, 14 insertions(+), 2 deletions(-)
>>
>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>> index b3a569f..84003cc 100644
>> --- a/mm/vmscan.c
>> +++ b/mm/vmscan.c
>> @@ -1959,11 +1959,14 @@ restart:
>>   * If a zone is deemed to be full of pinned pages then just give it a light
>>   * scan then give up on it.
>>   */
>> -static void shrink_zones(int priority, struct zonelist *zonelist,
>> +static unsigned long shrink_zones(int priority, struct zonelist *zonelist,
>>                                       struct scan_control *sc)
>>  {
>>       struct zoneref *z;
>>       struct zone *zone;
>> +     unsigned long nr_soft_reclaimed;
>> +     unsigned long nr_soft_scanned;
>> +     unsigned long total_scanned = 0;
>>
>>       for_each_zone_zonelist_nodemask(zone, z, zonelist,
>>                                       gfp_zone(sc->gfp_mask), sc->nodemask) {
>> @@ -1980,8 +1983,17 @@ static void shrink_zones(int priority, struct zonelist *zonelist,
>>                               continue;       /* Let kswapd poll it */
>>               }
>>
>> +             nr_soft_scanned = 0;
>> +             nr_soft_reclaimed = mem_cgroup_soft_limit_reclaim(zone,
>> +                                                     sc->order, sc->gfp_mask,
>> +                                                     &nr_soft_scanned);
>> +             sc->nr_reclaimed += nr_soft_reclaimed;
>> +             total_scanned += nr_soft_scanned;
>> +
>>               shrink_zone(priority, zone, sc);
>
> This can cause more aggressive reclaiming, right? Shouldn't we check
> whether shrink_zone is still needed?

We decided to leave the shrink_zone for now before making further
changes for soft_limit reclaim. The same
patch I did last time for global background reclaim. It is safer to do
this step-by-step :)

--Ying
>
> --
> Michal Hocko
> SUSE Labs
> SUSE LINUX s.r.o.
> Lihovarska 1060/12
> 190 00 Praha 9
> Czech Republic
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 0/2] memcg: add the soft_limit reclaim in global direct reclaim
  2011-04-29 17:19   ` Ying Han
@ 2011-04-29 17:48     ` Ying Han
  2011-04-29 18:58     ` Ying Han
  1 sibling, 0 replies; 24+ messages in thread
From: Ying Han @ 2011-04-29 17:48 UTC (permalink / raw)
  To: Minchan Kim
  Cc: KOSAKI Motohiro, Daisuke Nishimura, Balbir Singh, Tejun Heo,
	Pavel Emelyanov, KAMEZAWA Hiroyuki, Andrew Morton, Li Zefan,
	Mel Gorman, Christoph Lameter, Johannes Weiner, Rik van Riel,
	Hugh Dickins, Michal Hocko, Dave Hansen, Zhu Yanhai, linux-mm,
	Greg Thelen

On Fri, Apr 29, 2011 at 10:19 AM, Ying Han <yinghan@google.com> wrote:
> On Fri, Apr 29, 2011 at 9:44 AM, Minchan Kim <minchan.kim@gmail.com> wrote:
>> Hi Ying,
>>
>> On Thu, Apr 28, 2011 at 03:37:04PM -0700, Ying Han wrote:
>>> We recently added the change in global background reclaim which counts the
>>> return value of soft_limit reclaim. Now this patch adds the similar logic
>>> on global direct reclaim.
>>>
>>> We should skip scanning global LRU on shrink_zone if soft_limit reclaim does
>>> enough work. This is the first step where we start with counting the nr_scanned
>>> and nr_reclaimed from soft_limit reclaim into global scan_control.
>>>
>>> The patch is based on mmotm-04-14 and i triggered kernel BUG at mm/vmscan.c:1058!
>>
>> Could you tell me exact patches?
>> mmtom-04-14 + just 2 patch of this? or + something?
>>
>> These day, You and Kame produces many patches.
>> Do I have to apply something of them?
> No, I applied my patch on top of mmotm and here is the last commit
> before my patch.
>
> commit 66a3827927351e0f88dc391919cf0cda10d42dd7
> Author: Andrew Morton <akpm@linux-foundation.org>
> Date:   Thu Apr 14 15:51:34 2011 -0700

sorry, please ignore the last post. I learned that the mmotm i posted
is based on
tag: mmotm-2011-04-14-15-08.

--Ying


>
>>
>>>
>>> [  938.242033] kernel BUG at mm/vmscan.c:1058!
>>> [  938.242033] invalid opcode: 0000 [#1] SMP·
>>> [  938.242033] last sysfs file: /sys/devices/pci0000:00/0000:00:1f.2/device
>>> [  938.242033] Pid: 546, comm: kswapd0 Tainted: G        W   2.6.39-smp-direct_reclaim
>>> [  938.242033] RIP: 0010:[<ffffffff810ed174>]  [<ffffffff810ed174>] isolate_pages_global+0x18c/0x34f
>>> [  938.242033] RSP: 0018:ffff88082f83bb50  EFLAGS: 00010082
>>> [  938.242033] RAX: 00000000ffffffea RBX: ffff88082f83bc90 RCX: 0000000000000401
>>> [  938.242033] RDX: 0000000000000001 RSI: 0000000000000000 RDI: ffffea001ca653e8
>>> [  938.242033] RBP: ffff88082f83bc20 R08: 0000000000000000 R09: ffff88085ffb6e00
>>> [  938.242033] R10: ffff88085ffb73d0 R11: ffff88085ffb6e00 R12: ffff88085ffb6e00
>>> [  938.242033] R13: ffffea001ca65410 R14: 0000000000000001 R15: ffffea001ca653e8
>>> [  938.242033] FS:  0000000000000000(0000) GS:ffff88085fd00000(0000) knlGS:0000000000000000
>>> [  938.242033] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
>>> [  938.242033] CR2: 00007f5c3405c320 CR3: 0000000001803000 CR4: 00000000000006e0
>>> [  938.242033] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>>> [  938.242033] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
>>> [  938.242033] Process kswapd0 (pid: 546, threadinfo ffff88082f83a000, task ffff88082fe52080)
>>> [  938.242033] Stack:
>>> [  938.242033]  ffff88085ffb6e00 ffffea0000000002 0000000000000021 0000000000000000
>>> [  938.242033]  0000000000000000 ffff88082f83bcb8 ffffea00108eec80 ffffea00108eecb8
>>> [  938.242033]  ffffea00108eecf0 0000000000000004 fffffffffffffffc 0000000000000020
>>> [  938.242033] Call Trace:
>>> [  938.242033]  [<ffffffff810ee8a5>] shrink_inactive_list+0x185/0x418
>>> [  938.242033]  [<ffffffff810366cc>] ? __switch_to+0xea/0x212
>>> [  938.242033]  [<ffffffff810e8b35>] ? determine_dirtyable_memory+0x1a/0x2c
>>> [  938.242033]  [<ffffffff810ef19b>] shrink_zone+0x380/0x44d
>>> [  938.242033]  [<ffffffff810e5188>] ? zone_watermark_ok_safe+0xa1/0xae
>>> [  938.242033]  [<ffffffff810efbd8>] kswapd+0x41b/0x76b
>>> [  938.242033]  [<ffffffff810ef7bd>] ? zone_reclaim+0x2fb/0x2fb
>>> [  938.242033]  [<ffffffff81088569>] kthread+0x82/0x8a
>>> [  938.242033]  [<ffffffff8141b0d4>] kernel_thread_helper+0x4/0x10
>>> [  938.242033]  [<ffffffff810884e7>] ? kthread_worker_fn+0x112/0x112
>>> [  938.242033]  [<ffffffff8141b0d0>] ? gs_change+0xb/0xb
>>>
>>
>> It seems there is active page in inactive list.
>> As I look deactivate_page, lru_deactivate_fn clears PageActive before
>> add_page_to_lru_list and it should be protected by zone->lru_lock.
>> In addiion, PageLRU would protect with race with isolation functions.
>>
>> Hmm, I don't have any clue now.
>> Is it reproducible easily?
> I can manage to reproduce it on my host by adding lots of memory
> pressure and then trigger the global
> reclaim.
>
>>
>> Could you apply below debugging patch and report the result?
>>
>> diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h
>> index 8f7d247..f39b53a 100644
>> --- a/include/linux/mm_inline.h
>> +++ b/include/linux/mm_inline.h
>> @@ -25,6 +25,8 @@ static inline void
>>  __add_page_to_lru_list(struct zone *zone, struct page *page, enum lru_list l,
>>                       struct list_head *head)
>>  {
>> +       VM_BUG_ON(PageActive(page) && (
>> +                       l == LRU_INACTIVE_ANON || l == LRU_INACTIVE_FILE));
>>        list_add(&page->lru, head);
>>        __mod_zone_page_state(zone, NR_LRU_BASE + l, hpage_nr_pages(page));
>>        mem_cgroup_add_lru_list(page, l);
>> diff --git a/mm/swap.c b/mm/swap.c
>> index a83ec5a..5f7c3c8 100644
>> --- a/mm/swap.c
>> +++ b/mm/swap.c
>> @@ -454,6 +454,8 @@ static void lru_deactivate_fn(struct page *page, void *arg)
>>                 * The page's writeback ends up during pagevec
>>                 * We moves tha page into tail of inactive.
>>                 */
>> +               VM_BUG_ON(PageActive(page) && (
>> +                       lru == LRU_INACTIVE_ANON || lru == LRU_INACTIVE_FILE));
>>                list_move_tail(&page->lru, &zone->lru[lru].list);
>>                mem_cgroup_rotate_reclaimable_page(page);
>>                __count_vm_event(PGROTATED);
>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>> index b3a569f..3415896 100644
>> --- a/mm/vmscan.c
>> +++ b/mm/vmscan.c
>> @@ -963,7 +963,7 @@ int __isolate_lru_page(struct page *page, int mode, int file)
>>
>>        /* Only take pages on the LRU. */
>>        if (!PageLRU(page))
>> -               return ret;
>> +               return 1;
>>
>>        /*
>>         * When checking the active state, we need to be sure we are
>> @@ -971,10 +971,10 @@ int __isolate_lru_page(struct page *page, int mode, int file)
>>         * of each.
>>         */
>>        if (mode != ISOLATE_BOTH && (!PageActive(page) != !mode))
>> -               return ret;
>> +               return 2;
>>
>>        if (mode != ISOLATE_BOTH && page_is_file_cache(page) != file)
>> -               return ret;
>> +               return 3;
>>
>>        /*
>>         * When this function is being called for lumpy reclaim, we
>> @@ -982,7 +982,7 @@ int __isolate_lru_page(struct page *page, int mode, int file)
>>         * unevictable; only give shrink_page_list evictable pages.
>>         */
>>        if (PageUnevictable(page))
>> -               return ret;
>> +               return 4;
>>
>>        ret = -EBUSY;
>>
>> @@ -1035,13 +1035,14 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
>>                unsigned long end_pfn;
>>                unsigned long page_pfn;
>>                int zone_id;
>> +               int ret;
>>
>>                page = lru_to_page(src);
>>                prefetchw_prev_lru_page(page, src, flags);
>>
>>                VM_BUG_ON(!PageLRU(page));
>>
>> -               switch (__isolate_lru_page(page, mode, file)) {
>> +               switch (ret = __isolate_lru_page(page, mode, file)) {
>>                case 0:
>>                        list_move(&page->lru, dst);
>>                        mem_cgroup_del_lru(page);
>> @@ -1055,6 +1056,7 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
>>                        continue;
>>
>>                default:
>> +                       printk(KERN_ERR "ret %d\n", ret);
>>                        BUG();
>>                }
>>
>>> Thank you Minchan for the pointer. I reverted the following commit and I
>>> haven't seen the problem with the same operation. I haven't looked deeply
>>> on the patch yet, but figured it would be a good idea to post the dump.
>>> The dump looks not directly related to this patchset, but ppl can use it to
>>> reproduce the problem.
>>
>> I tested the patch with rsync + fadvise several times
>> in my machine(2P, 2G DRAM) but I didn't have ever seen the BUG.
>> But I didn't test it in memcg. As I look dump, it seems not related to memcg.
>> Anyway, I tried it to reproduce it in my machine.
>> Maybe I will start testing after next week. Sorry.
>>
>> I hope my debugging patch givse some clues.
>> Thanks for the reporting, Ying.
>
> Sure, i will try the patch and post the result.
>
> --Ying
>
>> --
>> Kind regards,
>> Minchan Kim
>>
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 0/2] memcg: add the soft_limit reclaim in global direct reclaim
  2011-04-29 17:19   ` Ying Han
  2011-04-29 17:48     ` Ying Han
@ 2011-04-29 18:58     ` Ying Han
  2011-04-29 23:20       ` Minchan Kim
  1 sibling, 1 reply; 24+ messages in thread
From: Ying Han @ 2011-04-29 18:58 UTC (permalink / raw)
  To: Minchan Kim
  Cc: KOSAKI Motohiro, Daisuke Nishimura, Balbir Singh, Tejun Heo,
	Pavel Emelyanov, KAMEZAWA Hiroyuki, Andrew Morton, Li Zefan,
	Mel Gorman, Christoph Lameter, Johannes Weiner, Rik van Riel,
	Hugh Dickins, Michal Hocko, Dave Hansen, Zhu Yanhai, linux-mm

On Fri, Apr 29, 2011 at 10:19 AM, Ying Han <yinghan@google.com> wrote:
> On Fri, Apr 29, 2011 at 9:44 AM, Minchan Kim <minchan.kim@gmail.com> wrote:
>> Hi Ying,
>>
>> On Thu, Apr 28, 2011 at 03:37:04PM -0700, Ying Han wrote:
>>> We recently added the change in global background reclaim which counts the
>>> return value of soft_limit reclaim. Now this patch adds the similar logic
>>> on global direct reclaim.
>>>
>>> We should skip scanning global LRU on shrink_zone if soft_limit reclaim does
>>> enough work. This is the first step where we start with counting the nr_scanned
>>> and nr_reclaimed from soft_limit reclaim into global scan_control.
>>>
>>> The patch is based on mmotm-04-14 and i triggered kernel BUG at mm/vmscan.c:1058!
>>
>> Could you tell me exact patches?
>> mmtom-04-14 + just 2 patch of this? or + something?
>>
>> These day, You and Kame produces many patches.
>> Do I have to apply something of them?
> No, I applied my patch on top of mmotm and here is the last commit
> before my patch.
>
> commit 66a3827927351e0f88dc391919cf0cda10d42dd7
> Author: Andrew Morton <akpm@linux-foundation.org>
> Date:   Thu Apr 14 15:51:34 2011 -0700
>
>>
>>>
>>> [  938.242033] kernel BUG at mm/vmscan.c:1058!
>>> [  938.242033] invalid opcode: 0000 [#1] SMP·
>>> [  938.242033] last sysfs file: /sys/devices/pci0000:00/0000:00:1f.2/device
>>> [  938.242033] Pid: 546, comm: kswapd0 Tainted: G        W   2.6.39-smp-direct_reclaim
>>> [  938.242033] RIP: 0010:[<ffffffff810ed174>]  [<ffffffff810ed174>] isolate_pages_global+0x18c/0x34f
>>> [  938.242033] RSP: 0018:ffff88082f83bb50  EFLAGS: 00010082
>>> [  938.242033] RAX: 00000000ffffffea RBX: ffff88082f83bc90 RCX: 0000000000000401
>>> [  938.242033] RDX: 0000000000000001 RSI: 0000000000000000 RDI: ffffea001ca653e8
>>> [  938.242033] RBP: ffff88082f83bc20 R08: 0000000000000000 R09: ffff88085ffb6e00
>>> [  938.242033] R10: ffff88085ffb73d0 R11: ffff88085ffb6e00 R12: ffff88085ffb6e00
>>> [  938.242033] R13: ffffea001ca65410 R14: 0000000000000001 R15: ffffea001ca653e8
>>> [  938.242033] FS:  0000000000000000(0000) GS:ffff88085fd00000(0000) knlGS:0000000000000000
>>> [  938.242033] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
>>> [  938.242033] CR2: 00007f5c3405c320 CR3: 0000000001803000 CR4: 00000000000006e0
>>> [  938.242033] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>>> [  938.242033] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
>>> [  938.242033] Process kswapd0 (pid: 546, threadinfo ffff88082f83a000, task ffff88082fe52080)
>>> [  938.242033] Stack:
>>> [  938.242033]  ffff88085ffb6e00 ffffea0000000002 0000000000000021 0000000000000000
>>> [  938.242033]  0000000000000000 ffff88082f83bcb8 ffffea00108eec80 ffffea00108eecb8
>>> [  938.242033]  ffffea00108eecf0 0000000000000004 fffffffffffffffc 0000000000000020
>>> [  938.242033] Call Trace:
>>> [  938.242033]  [<ffffffff810ee8a5>] shrink_inactive_list+0x185/0x418
>>> [  938.242033]  [<ffffffff810366cc>] ? __switch_to+0xea/0x212
>>> [  938.242033]  [<ffffffff810e8b35>] ? determine_dirtyable_memory+0x1a/0x2c
>>> [  938.242033]  [<ffffffff810ef19b>] shrink_zone+0x380/0x44d
>>> [  938.242033]  [<ffffffff810e5188>] ? zone_watermark_ok_safe+0xa1/0xae
>>> [  938.242033]  [<ffffffff810efbd8>] kswapd+0x41b/0x76b
>>> [  938.242033]  [<ffffffff810ef7bd>] ? zone_reclaim+0x2fb/0x2fb
>>> [  938.242033]  [<ffffffff81088569>] kthread+0x82/0x8a
>>> [  938.242033]  [<ffffffff8141b0d4>] kernel_thread_helper+0x4/0x10
>>> [  938.242033]  [<ffffffff810884e7>] ? kthread_worker_fn+0x112/0x112
>>> [  938.242033]  [<ffffffff8141b0d0>] ? gs_change+0xb/0xb
>>>
>>
>> It seems there is active page in inactive list.
>> As I look deactivate_page, lru_deactivate_fn clears PageActive before
>> add_page_to_lru_list and it should be protected by zone->lru_lock.
>> In addiion, PageLRU would protect with race with isolation functions.
>>
>> Hmm, I don't have any clue now.
>> Is it reproducible easily?
> I can manage to reproduce it on my host by adding lots of memory
> pressure and then trigger the global
> reclaim.
>
>>
>> Could you apply below debugging patch and report the result?
>>
>> diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h
>> index 8f7d247..f39b53a 100644
>> --- a/include/linux/mm_inline.h
>> +++ b/include/linux/mm_inline.h
>> @@ -25,6 +25,8 @@ static inline void
>>  __add_page_to_lru_list(struct zone *zone, struct page *page, enum lru_list l,
>>                       struct list_head *head)
>>  {
>> +       VM_BUG_ON(PageActive(page) && (
>> +                       l == LRU_INACTIVE_ANON || l == LRU_INACTIVE_FILE));
>>        list_add(&page->lru, head);
>>        __mod_zone_page_state(zone, NR_LRU_BASE + l, hpage_nr_pages(page));
>>        mem_cgroup_add_lru_list(page, l);
>> diff --git a/mm/swap.c b/mm/swap.c
>> index a83ec5a..5f7c3c8 100644
>> --- a/mm/swap.c
>> +++ b/mm/swap.c
>> @@ -454,6 +454,8 @@ static void lru_deactivate_fn(struct page *page, void *arg)
>>                 * The page's writeback ends up during pagevec
>>                 * We moves tha page into tail of inactive.
>>                 */
>> +               VM_BUG_ON(PageActive(page) && (
>> +                       lru == LRU_INACTIVE_ANON || lru == LRU_INACTIVE_FILE));
>>                list_move_tail(&page->lru, &zone->lru[lru].list);
>>                mem_cgroup_rotate_reclaimable_page(page);
>>                __count_vm_event(PGROTATED);
>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>> index b3a569f..3415896 100644
>> --- a/mm/vmscan.c
>> +++ b/mm/vmscan.c
>> @@ -963,7 +963,7 @@ int __isolate_lru_page(struct page *page, int mode, int file)
>>
>>        /* Only take pages on the LRU. */
>>        if (!PageLRU(page))
>> -               return ret;
>> +               return 1;
>>
>>        /*
>>         * When checking the active state, we need to be sure we are
>> @@ -971,10 +971,10 @@ int __isolate_lru_page(struct page *page, int mode, int file)
>>         * of each.
>>         */
>>        if (mode != ISOLATE_BOTH && (!PageActive(page) != !mode))
>> -               return ret;
>> +               return 2;
>>
>>        if (mode != ISOLATE_BOTH && page_is_file_cache(page) != file)
>> -               return ret;
>> +               return 3;
>>
>>        /*
>>         * When this function is being called for lumpy reclaim, we
>> @@ -982,7 +982,7 @@ int __isolate_lru_page(struct page *page, int mode, int file)
>>         * unevictable; only give shrink_page_list evictable pages.
>>         */
>>        if (PageUnevictable(page))
>> -               return ret;
>> +               return 4;
>>
>>        ret = -EBUSY;
>>
>> @@ -1035,13 +1035,14 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
>>                unsigned long end_pfn;
>>                unsigned long page_pfn;
>>                int zone_id;
>> +               int ret;
>>
>>                page = lru_to_page(src);
>>                prefetchw_prev_lru_page(page, src, flags);
>>
>>                VM_BUG_ON(!PageLRU(page));
>>
>> -               switch (__isolate_lru_page(page, mode, file)) {
>> +               switch (ret = __isolate_lru_page(page, mode, file)) {
>>                case 0:
>>                        list_move(&page->lru, dst);
>>                        mem_cgroup_del_lru(page);
>> @@ -1055,6 +1056,7 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
>>                        continue;
>>
>>                default:
>> +                       printk(KERN_ERR "ret %d\n", ret);
>>                        BUG();
>>                }
>>
>>> Thank you Minchan for the pointer. I reverted the following commit and I
>>> haven't seen the problem with the same operation. I haven't looked deeply
>>> on the patch yet, but figured it would be a good idea to post the dump.
>>> The dump looks not directly related to this patchset, but ppl can use it to
>>> reproduce the problem.
>>
>> I tested the patch with rsync + fadvise several times
>> in my machine(2P, 2G DRAM) but I didn't have ever seen the BUG.
>> But I didn't test it in memcg. As I look dump, it seems not related to memcg.
>> Anyway, I tried it to reproduce it in my machine.
>> Maybe I will start testing after next week. Sorry.
>>
>> I hope my debugging patch givse some clues.
>> Thanks for the reporting, Ying.
>
> Sure, i will try the patch and post the result.

Minchan:

Here is the stack trace after applying your patch. We used
trace_printk instead since the printk doesn't give me the message. The
ret == 4 , so looks like we are failing at the check  if
(PageUnevictable(page))

kernel is based on tag: mmotm-2011-04-14-15-08 plus my the two memcg
patches in the thread, and also the debugging patch.

[  426.696004] kernel BUG at mm/vmscan.c:1061!
[  426.696004] invalid opcode: 0000 [#1] SMP·
[  426.696004] Dumping ftrace buffer:
[  426.696004] ---------------------------------
[  426.696004]    <...>-546     4d... 426442418us : isolate_pages_global: ret 4
[  426.696004] ---------------------------------
[  426.696004] RIP: 0010:[<ffffffff810ed1b2>]  [<ffffffff810ed1b2>]
isolate_pages_global+0x1ba/0x37d
[  426.696004] RSP: 0000:ffff88082f8dfb50  EFLAGS: 00010086
[  426.696004] RAX: 0000000000000001 RBX: ffff88082f8dfc90 RCX: 0000000000000000
[  426.696004] RDX: 0000000000000006 RSI: 0000000000000046 RDI: ffff88085f805f80
[  426.696004] RBP: ffff88082f8dfc20 R08: 0000000000000000 R09: 0000000000000007
[  426.696004] R10: 0000000000000005 R11: 0000000000000000 R12: ffff88085ffb6e00
[  426.696004] R13: ffffea001ca66c58 R14: 0000000000000004 R15: ffffea001ca66c30
[  426.696004] FS:  0000000000000000(0000) GS:ffff88085fd00000(0000)
knlGS:0000000000000000
[  426.696004] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[  426.696004] CR2: 00007f0c65c6f320 CR3: 000000082b66f000 CR4: 00000000000006e0
[  426.696004] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  426.696004] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[  426.696004] Process kswapd0 (pid: 546, threadinfo ffff88082f8de000,
task ffff88082f83b8e0)
[  426.696004] Stack:
[  426.696004]  ffff88085ffb6e00 ffffea0000000002 0000000000000020
0000000000000000
[  426.696004]  0000000000000000 ffff88082f8dfcb8 ffffea00158f58d8
ffffea00158f5868
[  426.696004]  ffffea00158f5de0 0000000000000001 ffffffffffffffff
0000000000000020
[  426.696004] Call Trace:
[  426.696004]  [<ffffffff810ee8e7>] shrink_inactive_list+0x185/0x3c9
[  426.696004]  [<ffffffff8107a3fc>] ? lock_timer_base+0x2c/0x52
[  426.696004]  [<ffffffff810e8b2d>] ? determine_dirtyable_memory+0x1a/0x2c
[  426.696004]  [<ffffffff810ef17c>] shrink_zone+0x380/0x44d
[  426.696004]  [<ffffffff810e5180>] ? zone_watermark_ok_safe+0xa1/0xae
[  426.696004]  [<ffffffff810efbb9>] kswapd+0x41b/0x76b
[  426.696004]  [<ffffffff810ef79e>] ? zone_reclaim+0x2fb/0x2fb
[  426.696004]  [<ffffffff81088561>] kthread+0x82/0x8a
[  426.696004]  [<ffffffff8141af54>] kernel_thread_helper+0x4/0x10
[  426.696004]  [<ffffffff810884df>] ? kthread_worker_fn+0x112/0x112
[  426.696004]  [<ffffffff8141af50>] ? gs_change+0xb/0xb
[  426.696004] Code: 01 00 00 89 c6 48 c7 c7 69 52 70 81 31 c0 e8 c1
46 32 00 48 8b 35 37 2b 79 00 44 89 f2 48 c7 c7 8a d1 0e 81 31 c0 e8
09 e2 fd ff <0f> 0b eb fe 49 8b 45 d8 48 b9 00 00 00 00 00 16 00 00 4c
8b 75·
[  426.696004] RIP  [<ffffffff810ed1b2>] isolate_pages_global+0x1ba/0x37d
[  426.696004]  RSP <ffff88082f8dfb50>
[  426.696004] ---[ end trace fbb25b41a0373361 ]---
[  426.696004] Kernel panic - not syncing: Fatal exception
[  426.696004] Pid: 546, comm: kswapd0 Tainted: G      D W
2.6.39-smp-Minchan #28
[  426.696004] Call Trace:
[  426.696004]  [<ffffffff81411758>] panic+0x91/0x194
[  426.696004]  [<ffffffff81414708>] oops_end+0xae/0xbe
[  426.696004]  [<ffffffff81039906>] die+0x5a/0x63
[  426.696004]  [<ffffffff814141a1>] do_trap+0x121/0x130
[  426.696004]  [<ffffffff81037e85>] do_invalid_op+0x96/0x9f
[  426.696004]  [<ffffffff810ed1b2>] ? isolate_pages_global+0x1ba/0x37d
[  426.696004]  [<ffffffff810c414a>] ? ring_buffer_lock_reserve+0x6a/0x78
[  426.696004]  [<ffffffff810c2e3e>] ? rb_commit+0x76/0x78
[  426.696004]  [<ffffffff810c2eab>] ? ring_buffer_unlock_commit+0x21/0x25
[  426.696004]  [<ffffffff8141add5>] invalid_op+0x15/0x20
[  426.696004]  [<ffffffff810ed1b2>] ? isolate_pages_global+0x1ba/0x37d
[  426.696004]  [<ffffffff810ee8e7>] shrink_inactive_list+0x185/0x3c9
[  426.696004]  [<ffffffff8107a3fc>] ? lock_timer_base+0x2c/0x52
[  426.696004]  [<ffffffff810e8b2d>] ? determine_dirtyable_memory+0x1a/0x2c
[  426.696004]  [<ffffffff810ef17c>] shrink_zone+0x380/0x44d
[  426.696004]  [<ffffffff810e5180>] ? zone_watermark_ok_safe+0xa1/0xae
[  426.696004]  [<ffffffff810efbb9>] kswapd+0x41b/0x76b
[  426.696004]  [<ffffffff810ef79e>] ? zone_reclaim+0x2fb/0x2fb
[  426.696004]  [<ffffffff81088561>] kthread+0x82/0x8a
[  426.696004]  [<ffffffff8141af54>] kernel_thread_helper+0x4/0x10
[  426.696004]  [<ffffffff810884df>] ? kthread_worker_fn+0x112/0x112
[  426.696004]  [<ffffffff8141af50>] ? gs_change+0xb/0xb

--Ying



>
> --Ying
>
>> --
>> Kind regards,
>> Minchan Kim
>>
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 2/2] Add stats to monitor soft_limit reclaim
  2011-04-29 10:30         ` Balbir Singh
@ 2011-04-29 19:12           ` Ying Han
  0 siblings, 0 replies; 24+ messages in thread
From: Ying Han @ 2011-04-29 19:12 UTC (permalink / raw)
  To: balbir
  Cc: Hiroyuki Kamezawa, KOSAKI Motohiro, Minchan Kim,
	Daisuke Nishimura, Tejun Heo, Pavel Emelyanov, KAMEZAWA Hiroyuki,
	Andrew Morton, Li Zefan, Mel Gorman, Christoph Lameter,
	Johannes Weiner, Rik van Riel, Hugh Dickins, Michal Hocko,
	Dave Hansen, Zhu Yanhai, linux-mm

On Fri, Apr 29, 2011 at 3:30 AM, Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
> * Ying Han <yinghan@google.com> [2011-04-28 20:28:54]:
>
>> On Thu, Apr 28, 2011 at 4:51 PM, Hiroyuki Kamezawa
>> <kamezawa.hiroyuki@gmail.com> wrote:
>> > 2011/4/29 Ying Han <yinghan@google.com>:
>> >> On Thu, Apr 28, 2011 at 3:37 PM, Ying Han <yinghan@google.com> wrote:
>> >>> This patch extend the soft_limit reclaim stats to both global background
>> >>> reclaim and global direct reclaim.
>> >>>
>> >>> We have a thread discussing the naming of some of the stats. Both
>> >>> KAMEZAWA and Johannes posted the proposals. The following stats are based
>> >>> on what i had before that thread. I will make the corresponding change on
>> >>> the next post when we make decision.
>> >>>
>> >>> $cat /dev/cgroup/memory/A/memory.stat
>> >>> kswapd_soft_steal 1053626
>> >>> kswapd_soft_scan 1053693
>> >>> direct_soft_steal 1481810
>> >>> direct_soft_scan 1481996
>> >>>
>> >>> Signed-off-by: Ying Han <yinghan@google.com>
>> >>> ---
>> >>>  Documentation/cgroups/memory.txt |   10 ++++-
>> >>>  mm/memcontrol.c                  |   68 ++++++++++++++++++++++++++++----------
>> >>>  2 files changed, 58 insertions(+), 20 deletions(-)
>> >>>
>> >>> diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt
>> >>> index 0c40dab..fedc107 100644
>> >>> --- a/Documentation/cgroups/memory.txt
>> >>> +++ b/Documentation/cgroups/memory.txt
>> >>> @@ -387,8 +387,14 @@ pgpgout            - # of pages paged out (equivalent to # of uncharging events).
>> >>>  swap           - # of bytes of swap usage
>> >>>  pgfault                - # of page faults.
>> >>>  pgmajfault     - # of major page faults.
>> >>> -soft_steal     - # of pages reclaimed from global hierarchical reclaim
>> >>> -soft_scan      - # of pages scanned from global hierarchical reclaim
>> >>> +soft_kswapd_steal- # of pages reclaimed in global hierarchical reclaim from
>> >>> +               background reclaim
>> >>> +soft_kswapd_scan - # of pages scanned in global hierarchical reclaim from
>> >>> +               background reclaim
>> >>> +soft_direct_steal- # of pages reclaimed in global hierarchical reclaim from
>> >>> +               direct reclaim
>> >>> +soft_direct_scan- # of pages scanned in global hierarchical reclaim from
>> >>> +               direct reclaim
>> >
>> > Thank you for CC.
>> >
>> > I don't have strong opinion but once we add interfaces to mainline,
>> > it's hard to rename them. So, it's better to make a list of what name
>> > we'll need in future.
>> >
>> > Now, your naming has a format as [Reason]-[Who reclaim]-[What count?]
>> > soft_kswapd_steal
>> > soft_kswapd_scan
>> > soft_direct_steal
>> > soft_direct_scan
>> >
>> > Ok, we can make a name for wmark and limit reclaim as
>> >
>> > limit_direct_steal/scan
>> > wmark_bg_steal/scan
>> >
>> > Then, assume we finally do round-robin scan of memcg regardless of softlimit by
>> > removing global LRU, what name do we have ? Hmm,
>> >
>> > kernel_kswapd_scan/steal
>> > kernel_direct_scan/steal
>> >
>> > ?
>>
>> Johannes has the proposal to separate out reclaims on the memcg
>> internally and externally. And then apply the format
>> [Reason]-[Who reclaim]-[What count?], also i added the 4th item .
>>
>> 1. when the memcg hits its hard_limit
>> > limit_direct_steal
>> > limit_direct_scan
>>
>> 2. when the memcg hits its wmark
>> > wmark_kswapd_steal
>> > wmark_kswapd_scan
>>
>> 3. the global direct reclaim triggers soft_limit pushback
>> > soft_direct_steal
>> > soft_direct_scan
>>
>> 4. hierarchy-triggered direct reclaim
>> > limit_hierarchy_steal
>> > limit_hierarchy_scan
>>
>> 5. the global bg reclaim triggers soft_limit pushback
>> > soft_kswapd_steal
>> > soft_kswapd_scan
>>
>
> I like these names, but these are more developer friendly than end
> user friendly.

Thank you for reviewing. One thing that I can think of to help is
better documentation. :)

--Ying

Thank you
>
> --
>        Three Cheers,
>        Balbir
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 0/2] memcg: add the soft_limit reclaim in global direct reclaim
  2011-04-29 18:58     ` Ying Han
@ 2011-04-29 23:20       ` Minchan Kim
  2011-04-29 23:41         ` Ying Han
  2011-04-30  1:33         ` Ying Han
  0 siblings, 2 replies; 24+ messages in thread
From: Minchan Kim @ 2011-04-29 23:20 UTC (permalink / raw)
  To: Ying Han
  Cc: KOSAKI Motohiro, Daisuke Nishimura, Balbir Singh, Tejun Heo,
	Pavel Emelyanov, KAMEZAWA Hiroyuki, Andrew Morton, Li Zefan,
	Mel Gorman, Christoph Lameter, Johannes Weiner, Rik van Riel,
	Hugh Dickins, Michal Hocko, Dave Hansen, Zhu Yanhai, linux-mm

On Fri, Apr 29, 2011 at 11:58:34AM -0700, Ying Han wrote:
> On Fri, Apr 29, 2011 at 10:19 AM, Ying Han <yinghan@google.com> wrote:
> > On Fri, Apr 29, 2011 at 9:44 AM, Minchan Kim <minchan.kim@gmail.com> wrote:
> >> Hi Ying,
> >>
> >> On Thu, Apr 28, 2011 at 03:37:04PM -0700, Ying Han wrote:
> >>> We recently added the change in global background reclaim which counts the
> >>> return value of soft_limit reclaim. Now this patch adds the similar logic
> >>> on global direct reclaim.
> >>>
> >>> We should skip scanning global LRU on shrink_zone if soft_limit reclaim does
> >>> enough work. This is the first step where we start with counting the nr_scanned
> >>> and nr_reclaimed from soft_limit reclaim into global scan_control.
> >>>
> >>> The patch is based on mmotm-04-14 and i triggered kernel BUG at mm/vmscan.c:1058!
> >>
> >> Could you tell me exact patches?
> >> mmtom-04-14 + just 2 patch of this? or + something?
> >>
> >> These day, You and Kame produces many patches.
> >> Do I have to apply something of them?
> > No, I applied my patch on top of mmotm and here is the last commit
> > before my patch.
> >
> > commit 66a3827927351e0f88dc391919cf0cda10d42dd7
> > Author: Andrew Morton <akpm@linux-foundation.org>
> > Date:   Thu Apr 14 15:51:34 2011 -0700
> >
> >>
> >>>
> >>> [  938.242033] kernel BUG at mm/vmscan.c:1058!
> >>> [  938.242033] invalid opcode: 0000 [#1] SMP.
> >>> [  938.242033] last sysfs file: /sys/devices/pci0000:00/0000:00:1f.2/device
> >>> [  938.242033] Pid: 546, comm: kswapd0 Tainted: G        W   2.6.39-smp-direct_reclaim
> >>> [  938.242033] RIP: 0010:[<ffffffff810ed174>]  [<ffffffff810ed174>] isolate_pages_global+0x18c/0x34f
> >>> [  938.242033] RSP: 0018:ffff88082f83bb50  EFLAGS: 00010082
> >>> [  938.242033] RAX: 00000000ffffffea RBX: ffff88082f83bc90 RCX: 0000000000000401
> >>> [  938.242033] RDX: 0000000000000001 RSI: 0000000000000000 RDI: ffffea001ca653e8
> >>> [  938.242033] RBP: ffff88082f83bc20 R08: 0000000000000000 R09: ffff88085ffb6e00
> >>> [  938.242033] R10: ffff88085ffb73d0 R11: ffff88085ffb6e00 R12: ffff88085ffb6e00
> >>> [  938.242033] R13: ffffea001ca65410 R14: 0000000000000001 R15: ffffea001ca653e8
> >>> [  938.242033] FS:  0000000000000000(0000) GS:ffff88085fd00000(0000) knlGS:0000000000000000
> >>> [  938.242033] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> >>> [  938.242033] CR2: 00007f5c3405c320 CR3: 0000000001803000 CR4: 00000000000006e0
> >>> [  938.242033] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> >>> [  938.242033] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> >>> [  938.242033] Process kswapd0 (pid: 546, threadinfo ffff88082f83a000, task ffff88082fe52080)
> >>> [  938.242033] Stack:
> >>> [  938.242033]  ffff88085ffb6e00 ffffea0000000002 0000000000000021 0000000000000000
> >>> [  938.242033]  0000000000000000 ffff88082f83bcb8 ffffea00108eec80 ffffea00108eecb8
> >>> [  938.242033]  ffffea00108eecf0 0000000000000004 fffffffffffffffc 0000000000000020
> >>> [  938.242033] Call Trace:
> >>> [  938.242033]  [<ffffffff810ee8a5>] shrink_inactive_list+0x185/0x418
> >>> [  938.242033]  [<ffffffff810366cc>] ? __switch_to+0xea/0x212
> >>> [  938.242033]  [<ffffffff810e8b35>] ? determine_dirtyable_memory+0x1a/0x2c
> >>> [  938.242033]  [<ffffffff810ef19b>] shrink_zone+0x380/0x44d
> >>> [  938.242033]  [<ffffffff810e5188>] ? zone_watermark_ok_safe+0xa1/0xae
> >>> [  938.242033]  [<ffffffff810efbd8>] kswapd+0x41b/0x76b
> >>> [  938.242033]  [<ffffffff810ef7bd>] ? zone_reclaim+0x2fb/0x2fb
> >>> [  938.242033]  [<ffffffff81088569>] kthread+0x82/0x8a
> >>> [  938.242033]  [<ffffffff8141b0d4>] kernel_thread_helper+0x4/0x10
> >>> [  938.242033]  [<ffffffff810884e7>] ? kthread_worker_fn+0x112/0x112
> >>> [  938.242033]  [<ffffffff8141b0d0>] ? gs_change+0xb/0xb
> >>>
> >>
> >> It seems there is active page in inactive list.
> >> As I look deactivate_page, lru_deactivate_fn clears PageActive before
> >> add_page_to_lru_list and it should be protected by zone->lru_lock.
> >> In addiion, PageLRU would protect with race with isolation functions.
> >>
> >> Hmm, I don't have any clue now.
> >> Is it reproducible easily?
> > I can manage to reproduce it on my host by adding lots of memory
> > pressure and then trigger the global
> > reclaim.
> >
> >>
> >> Could you apply below debugging patch and report the result?
> >>
> >> diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h
> >> index 8f7d247..f39b53a 100644
> >> --- a/include/linux/mm_inline.h
> >> +++ b/include/linux/mm_inline.h
> >> @@ -25,6 +25,8 @@ static inline void
> >>  __add_page_to_lru_list(struct zone *zone, struct page *page, enum lru_list l,
> >>                       struct list_head *head)
> >>  {
> >> +       VM_BUG_ON(PageActive(page) && (
> >> +                       l == LRU_INACTIVE_ANON || l == LRU_INACTIVE_FILE));
> >>        list_add(&page->lru, head);
> >>        __mod_zone_page_state(zone, NR_LRU_BASE + l, hpage_nr_pages(page));
> >>        mem_cgroup_add_lru_list(page, l);
> >> diff --git a/mm/swap.c b/mm/swap.c
> >> index a83ec5a..5f7c3c8 100644
> >> --- a/mm/swap.c
> >> +++ b/mm/swap.c
> >> @@ -454,6 +454,8 @@ static void lru_deactivate_fn(struct page *page, void *arg)
> >>                 * The page's writeback ends up during pagevec
> >>                 * We moves tha page into tail of inactive.
> >>                 */
> >> +               VM_BUG_ON(PageActive(page) && (
> >> +                       lru == LRU_INACTIVE_ANON || lru == LRU_INACTIVE_FILE));
> >>                list_move_tail(&page->lru, &zone->lru[lru].list);
> >>                mem_cgroup_rotate_reclaimable_page(page);
> >>                __count_vm_event(PGROTATED);
> >> diff --git a/mm/vmscan.c b/mm/vmscan.c
> >> index b3a569f..3415896 100644
> >> --- a/mm/vmscan.c
> >> +++ b/mm/vmscan.c
> >> @@ -963,7 +963,7 @@ int __isolate_lru_page(struct page *page, int mode, int file)
> >>
> >>        /* Only take pages on the LRU. */
> >>        if (!PageLRU(page))
> >> -               return ret;
> >> +               return 1;
> >>
> >>        /*
> >>         * When checking the active state, we need to be sure we are
> >> @@ -971,10 +971,10 @@ int __isolate_lru_page(struct page *page, int mode, int file)
> >>         * of each.
> >>         */
> >>        if (mode != ISOLATE_BOTH && (!PageActive(page) != !mode))
> >> -               return ret;
> >> +               return 2;
> >>
> >>        if (mode != ISOLATE_BOTH && page_is_file_cache(page) != file)
> >> -               return ret;
> >> +               return 3;
> >>
> >>        /*
> >>         * When this function is being called for lumpy reclaim, we
> >> @@ -982,7 +982,7 @@ int __isolate_lru_page(struct page *page, int mode, int file)
> >>         * unevictable; only give shrink_page_list evictable pages.
> >>         */
> >>        if (PageUnevictable(page))
> >> -               return ret;
> >> +               return 4;
> >>
> >>        ret = -EBUSY;
> >>
> >> @@ -1035,13 +1035,14 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
> >>                unsigned long end_pfn;
> >>                unsigned long page_pfn;
> >>                int zone_id;
> >> +               int ret;
> >>
> >>                page = lru_to_page(src);
> >>                prefetchw_prev_lru_page(page, src, flags);
> >>
> >>                VM_BUG_ON(!PageLRU(page));
> >>
> >> -               switch (__isolate_lru_page(page, mode, file)) {
> >> +               switch (ret = __isolate_lru_page(page, mode, file)) {
> >>                case 0:
> >>                        list_move(&page->lru, dst);
> >>                        mem_cgroup_del_lru(page);
> >> @@ -1055,6 +1056,7 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
> >>                        continue;
> >>
> >>                default:
> >> +                       printk(KERN_ERR "ret %d\n", ret);
> >>                        BUG();
> >>                }
> >>
> >>> Thank you Minchan for the pointer. I reverted the following commit and I
> >>> haven't seen the problem with the same operation. I haven't looked deeply
> >>> on the patch yet, but figured it would be a good idea to post the dump.
> >>> The dump looks not directly related to this patchset, but ppl can use it to
> >>> reproduce the problem.
> >>
> >> I tested the patch with rsync + fadvise several times
> >> in my machine(2P, 2G DRAM) but I didn't have ever seen the BUG.
> >> But I didn't test it in memcg. As I look dump, it seems not related to memcg.
> >> Anyway, I tried it to reproduce it in my machine.
> >> Maybe I will start testing after next week. Sorry.
> >>
> >> I hope my debugging patch givse some clues.
> >> Thanks for the reporting, Ying.
> >
> > Sure, i will try the patch and post the result.
> 
> Minchan:
> 
> Here is the stack trace after applying your patch. We used
> trace_printk instead since the printk doesn't give me the message. The
> ret == 4 , so looks like we are failing at the check  if
> (PageUnevictable(page))
> 
> kernel is based on tag: mmotm-2011-04-14-15-08 plus my the two memcg
> patches in the thread, and also the debugging patch.
> 
> [  426.696004] kernel BUG at mm/vmscan.c:1061!
> [  426.696004] invalid opcode: 0000 [#1] SMP.
> [  426.696004] Dumping ftrace buffer:
> [  426.696004] ---------------------------------
> [  426.696004]    <...>-546     4d... 426442418us : isolate_pages_global: ret 4
> [  426.696004] ---------------------------------
> [  426.696004] RIP: 0010:[<ffffffff810ed1b2>]  [<ffffffff810ed1b2>]
> isolate_pages_global+0x1ba/0x37d
> [  426.696004] RSP: 0000:ffff88082f8dfb50  EFLAGS: 00010086
> [  426.696004] RAX: 0000000000000001 RBX: ffff88082f8dfc90 RCX: 0000000000000000
> [  426.696004] RDX: 0000000000000006 RSI: 0000000000000046 RDI: ffff88085f805f80
> [  426.696004] RBP: ffff88082f8dfc20 R08: 0000000000000000 R09: 0000000000000007
> [  426.696004] R10: 0000000000000005 R11: 0000000000000000 R12: ffff88085ffb6e00
> [  426.696004] R13: ffffea001ca66c58 R14: 0000000000000004 R15: ffffea001ca66c30
> [  426.696004] FS:  0000000000000000(0000) GS:ffff88085fd00000(0000)
> knlGS:0000000000000000
> [  426.696004] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> [  426.696004] CR2: 00007f0c65c6f320 CR3: 000000082b66f000 CR4: 00000000000006e0
> [  426.696004] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [  426.696004] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> [  426.696004] Process kswapd0 (pid: 546, threadinfo ffff88082f8de000,
> task ffff88082f83b8e0)
> [  426.696004] Stack:
> [  426.696004]  ffff88085ffb6e00 ffffea0000000002 0000000000000020
> 0000000000000000
> [  426.696004]  0000000000000000 ffff88082f8dfcb8 ffffea00158f58d8
> ffffea00158f5868
> [  426.696004]  ffffea00158f5de0 0000000000000001 ffffffffffffffff
> 0000000000000020
> [  426.696004] Call Trace:
> [  426.696004]  [<ffffffff810ee8e7>] shrink_inactive_list+0x185/0x3c9
> [  426.696004]  [<ffffffff8107a3fc>] ? lock_timer_base+0x2c/0x52
> [  426.696004]  [<ffffffff810e8b2d>] ? determine_dirtyable_memory+0x1a/0x2c
> [  426.696004]  [<ffffffff810ef17c>] shrink_zone+0x380/0x44d
> [  426.696004]  [<ffffffff810e5180>] ? zone_watermark_ok_safe+0xa1/0xae
> [  426.696004]  [<ffffffff810efbb9>] kswapd+0x41b/0x76b
> [  426.696004]  [<ffffffff810ef79e>] ? zone_reclaim+0x2fb/0x2fb
> [  426.696004]  [<ffffffff81088561>] kthread+0x82/0x8a
> [  426.696004]  [<ffffffff8141af54>] kernel_thread_helper+0x4/0x10
> [  426.696004]  [<ffffffff810884df>] ? kthread_worker_fn+0x112/0x112
> [  426.696004]  [<ffffffff8141af50>] ? gs_change+0xb/0xb
> [  426.696004] Code: 01 00 00 89 c6 48 c7 c7 69 52 70 81 31 c0 e8 c1
> 46 32 00 48 8b 35 37 2b 79 00 44 89 f2 48 c7 c7 8a d1 0e 81 31 c0 e8
> 09 e2 fd ff <0f> 0b eb fe 49 8b 45 d8 48 b9 00 00 00 00 00 16 00 00 4c
> 8b 75.
> [  426.696004] RIP  [<ffffffff810ed1b2>] isolate_pages_global+0x1ba/0x37d
> [  426.696004]  RSP <ffff88082f8dfb50>
> [  426.696004] ---[ end trace fbb25b41a0373361 ]---
> [  426.696004] Kernel panic - not syncing: Fatal exception
> [  426.696004] Pid: 546, comm: kswapd0 Tainted: G      D W
> 2.6.39-smp-Minchan #28
> [  426.696004] Call Trace:
> [  426.696004]  [<ffffffff81411758>] panic+0x91/0x194
> [  426.696004]  [<ffffffff81414708>] oops_end+0xae/0xbe
> [  426.696004]  [<ffffffff81039906>] die+0x5a/0x63
> [  426.696004]  [<ffffffff814141a1>] do_trap+0x121/0x130
> [  426.696004]  [<ffffffff81037e85>] do_invalid_op+0x96/0x9f
> [  426.696004]  [<ffffffff810ed1b2>] ? isolate_pages_global+0x1ba/0x37d
> [  426.696004]  [<ffffffff810c414a>] ? ring_buffer_lock_reserve+0x6a/0x78
> [  426.696004]  [<ffffffff810c2e3e>] ? rb_commit+0x76/0x78
> [  426.696004]  [<ffffffff810c2eab>] ? ring_buffer_unlock_commit+0x21/0x25
> [  426.696004]  [<ffffffff8141add5>] invalid_op+0x15/0x20
> [  426.696004]  [<ffffffff810ed1b2>] ? isolate_pages_global+0x1ba/0x37d
> [  426.696004]  [<ffffffff810ee8e7>] shrink_inactive_list+0x185/0x3c9
> [  426.696004]  [<ffffffff8107a3fc>] ? lock_timer_base+0x2c/0x52
> [  426.696004]  [<ffffffff810e8b2d>] ? determine_dirtyable_memory+0x1a/0x2c
> [  426.696004]  [<ffffffff810ef17c>] shrink_zone+0x380/0x44d
> [  426.696004]  [<ffffffff810e5180>] ? zone_watermark_ok_safe+0xa1/0xae
> [  426.696004]  [<ffffffff810efbb9>] kswapd+0x41b/0x76b
> [  426.696004]  [<ffffffff810ef79e>] ? zone_reclaim+0x2fb/0x2fb
> [  426.696004]  [<ffffffff81088561>] kthread+0x82/0x8a
> [  426.696004]  [<ffffffff8141af54>] kernel_thread_helper+0x4/0x10
> [  426.696004]  [<ffffffff810884df>] ? kthread_worker_fn+0x112/0x112
> [  426.696004]  [<ffffffff8141af50>] ? gs_change+0xb/0xb
> 
> --Ying

Thanks for the testing.
I missed mprotect case in your scenario.
Yes. I didn't test it at that time. :(
So, it wasn't related to your patch and memcg.
The mprotect makes many unevictable page and it seems my deactive_page could move
it into inactive list. Totally, it's my fault. 
Could you test below patch?

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 0/2] memcg: add the soft_limit reclaim in global direct reclaim
  2011-04-29 23:20       ` Minchan Kim
@ 2011-04-29 23:41         ` Ying Han
  2011-04-30  1:33         ` Ying Han
  1 sibling, 0 replies; 24+ messages in thread
From: Ying Han @ 2011-04-29 23:41 UTC (permalink / raw)
  To: Minchan Kim
  Cc: KOSAKI Motohiro, Daisuke Nishimura, Balbir Singh, Tejun Heo,
	Pavel Emelyanov, KAMEZAWA Hiroyuki, Andrew Morton, Li Zefan,
	Mel Gorman, Christoph Lameter, Johannes Weiner, Rik van Riel,
	Hugh Dickins, Michal Hocko, Dave Hansen, Zhu Yanhai, linux-mm

On Fri, Apr 29, 2011 at 4:20 PM, Minchan Kim <minchan.kim@gmail.com> wrote:
> On Fri, Apr 29, 2011 at 11:58:34AM -0700, Ying Han wrote:
>> On Fri, Apr 29, 2011 at 10:19 AM, Ying Han <yinghan@google.com> wrote:
>> > On Fri, Apr 29, 2011 at 9:44 AM, Minchan Kim <minchan.kim@gmail.com> wrote:
>> >> Hi Ying,
>> >>
>> >> On Thu, Apr 28, 2011 at 03:37:04PM -0700, Ying Han wrote:
>> >>> We recently added the change in global background reclaim which counts the
>> >>> return value of soft_limit reclaim. Now this patch adds the similar logic
>> >>> on global direct reclaim.
>> >>>
>> >>> We should skip scanning global LRU on shrink_zone if soft_limit reclaim does
>> >>> enough work. This is the first step where we start with counting the nr_scanned
>> >>> and nr_reclaimed from soft_limit reclaim into global scan_control.
>> >>>
>> >>> The patch is based on mmotm-04-14 and i triggered kernel BUG at mm/vmscan.c:1058!
>> >>
>> >> Could you tell me exact patches?
>> >> mmtom-04-14 + just 2 patch of this? or + something?
>> >>
>> >> These day, You and Kame produces many patches.
>> >> Do I have to apply something of them?
>> > No, I applied my patch on top of mmotm and here is the last commit
>> > before my patch.
>> >
>> > commit 66a3827927351e0f88dc391919cf0cda10d42dd7
>> > Author: Andrew Morton <akpm@linux-foundation.org>
>> > Date:   Thu Apr 14 15:51:34 2011 -0700
>> >
>> >>
>> >>>
>> >>> [  938.242033] kernel BUG at mm/vmscan.c:1058!
>> >>> [  938.242033] invalid opcode: 0000 [#1] SMP·
>> >>> [  938.242033] last sysfs file: /sys/devices/pci0000:00/0000:00:1f.2/device
>> >>> [  938.242033] Pid: 546, comm: kswapd0 Tainted: G        W   2.6.39-smp-direct_reclaim
>> >>> [  938.242033] RIP: 0010:[<ffffffff810ed174>]  [<ffffffff810ed174>] isolate_pages_global+0x18c/0x34f
>> >>> [  938.242033] RSP: 0018:ffff88082f83bb50  EFLAGS: 00010082
>> >>> [  938.242033] RAX: 00000000ffffffea RBX: ffff88082f83bc90 RCX: 0000000000000401
>> >>> [  938.242033] RDX: 0000000000000001 RSI: 0000000000000000 RDI: ffffea001ca653e8
>> >>> [  938.242033] RBP: ffff88082f83bc20 R08: 0000000000000000 R09: ffff88085ffb6e00
>> >>> [  938.242033] R10: ffff88085ffb73d0 R11: ffff88085ffb6e00 R12: ffff88085ffb6e00
>> >>> [  938.242033] R13: ffffea001ca65410 R14: 0000000000000001 R15: ffffea001ca653e8
>> >>> [  938.242033] FS:  0000000000000000(0000) GS:ffff88085fd00000(0000) knlGS:0000000000000000
>> >>> [  938.242033] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
>> >>> [  938.242033] CR2: 00007f5c3405c320 CR3: 0000000001803000 CR4: 00000000000006e0
>> >>> [  938.242033] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>> >>> [  938.242033] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
>> >>> [  938.242033] Process kswapd0 (pid: 546, threadinfo ffff88082f83a000, task ffff88082fe52080)
>> >>> [  938.242033] Stack:
>> >>> [  938.242033]  ffff88085ffb6e00 ffffea0000000002 0000000000000021 0000000000000000
>> >>> [  938.242033]  0000000000000000 ffff88082f83bcb8 ffffea00108eec80 ffffea00108eecb8
>> >>> [  938.242033]  ffffea00108eecf0 0000000000000004 fffffffffffffffc 0000000000000020
>> >>> [  938.242033] Call Trace:
>> >>> [  938.242033]  [<ffffffff810ee8a5>] shrink_inactive_list+0x185/0x418
>> >>> [  938.242033]  [<ffffffff810366cc>] ? __switch_to+0xea/0x212
>> >>> [  938.242033]  [<ffffffff810e8b35>] ? determine_dirtyable_memory+0x1a/0x2c
>> >>> [  938.242033]  [<ffffffff810ef19b>] shrink_zone+0x380/0x44d
>> >>> [  938.242033]  [<ffffffff810e5188>] ? zone_watermark_ok_safe+0xa1/0xae
>> >>> [  938.242033]  [<ffffffff810efbd8>] kswapd+0x41b/0x76b
>> >>> [  938.242033]  [<ffffffff810ef7bd>] ? zone_reclaim+0x2fb/0x2fb
>> >>> [  938.242033]  [<ffffffff81088569>] kthread+0x82/0x8a
>> >>> [  938.242033]  [<ffffffff8141b0d4>] kernel_thread_helper+0x4/0x10
>> >>> [  938.242033]  [<ffffffff810884e7>] ? kthread_worker_fn+0x112/0x112
>> >>> [  938.242033]  [<ffffffff8141b0d0>] ? gs_change+0xb/0xb
>> >>>
>> >>
>> >> It seems there is active page in inactive list.
>> >> As I look deactivate_page, lru_deactivate_fn clears PageActive before
>> >> add_page_to_lru_list and it should be protected by zone->lru_lock.
>> >> In addiion, PageLRU would protect with race with isolation functions.
>> >>
>> >> Hmm, I don't have any clue now.
>> >> Is it reproducible easily?
>> > I can manage to reproduce it on my host by adding lots of memory
>> > pressure and then trigger the global
>> > reclaim.
>> >
>> >>
>> >> Could you apply below debugging patch and report the result?
>> >>
>> >> diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h
>> >> index 8f7d247..f39b53a 100644
>> >> --- a/include/linux/mm_inline.h
>> >> +++ b/include/linux/mm_inline.h
>> >> @@ -25,6 +25,8 @@ static inline void
>> >>  __add_page_to_lru_list(struct zone *zone, struct page *page, enum lru_list l,
>> >>                       struct list_head *head)
>> >>  {
>> >> +       VM_BUG_ON(PageActive(page) && (
>> >> +                       l == LRU_INACTIVE_ANON || l == LRU_INACTIVE_FILE));
>> >>        list_add(&page->lru, head);
>> >>        __mod_zone_page_state(zone, NR_LRU_BASE + l, hpage_nr_pages(page));
>> >>        mem_cgroup_add_lru_list(page, l);
>> >> diff --git a/mm/swap.c b/mm/swap.c
>> >> index a83ec5a..5f7c3c8 100644
>> >> --- a/mm/swap.c
>> >> +++ b/mm/swap.c
>> >> @@ -454,6 +454,8 @@ static void lru_deactivate_fn(struct page *page, void *arg)
>> >>                 * The page's writeback ends up during pagevec
>> >>                 * We moves tha page into tail of inactive.
>> >>                 */
>> >> +               VM_BUG_ON(PageActive(page) && (
>> >> +                       lru == LRU_INACTIVE_ANON || lru == LRU_INACTIVE_FILE));
>> >>                list_move_tail(&page->lru, &zone->lru[lru].list);
>> >>                mem_cgroup_rotate_reclaimable_page(page);
>> >>                __count_vm_event(PGROTATED);
>> >> diff --git a/mm/vmscan.c b/mm/vmscan.c
>> >> index b3a569f..3415896 100644
>> >> --- a/mm/vmscan.c
>> >> +++ b/mm/vmscan.c
>> >> @@ -963,7 +963,7 @@ int __isolate_lru_page(struct page *page, int mode, int file)
>> >>
>> >>        /* Only take pages on the LRU. */
>> >>        if (!PageLRU(page))
>> >> -               return ret;
>> >> +               return 1;
>> >>
>> >>        /*
>> >>         * When checking the active state, we need to be sure we are
>> >> @@ -971,10 +971,10 @@ int __isolate_lru_page(struct page *page, int mode, int file)
>> >>         * of each.
>> >>         */
>> >>        if (mode != ISOLATE_BOTH && (!PageActive(page) != !mode))
>> >> -               return ret;
>> >> +               return 2;
>> >>
>> >>        if (mode != ISOLATE_BOTH && page_is_file_cache(page) != file)
>> >> -               return ret;
>> >> +               return 3;
>> >>
>> >>        /*
>> >>         * When this function is being called for lumpy reclaim, we
>> >> @@ -982,7 +982,7 @@ int __isolate_lru_page(struct page *page, int mode, int file)
>> >>         * unevictable; only give shrink_page_list evictable pages.
>> >>         */
>> >>        if (PageUnevictable(page))
>> >> -               return ret;
>> >> +               return 4;
>> >>
>> >>        ret = -EBUSY;
>> >>
>> >> @@ -1035,13 +1035,14 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
>> >>                unsigned long end_pfn;
>> >>                unsigned long page_pfn;
>> >>                int zone_id;
>> >> +               int ret;
>> >>
>> >>                page = lru_to_page(src);
>> >>                prefetchw_prev_lru_page(page, src, flags);
>> >>
>> >>                VM_BUG_ON(!PageLRU(page));
>> >>
>> >> -               switch (__isolate_lru_page(page, mode, file)) {
>> >> +               switch (ret = __isolate_lru_page(page, mode, file)) {
>> >>                case 0:
>> >>                        list_move(&page->lru, dst);
>> >>                        mem_cgroup_del_lru(page);
>> >> @@ -1055,6 +1056,7 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
>> >>                        continue;
>> >>
>> >>                default:
>> >> +                       printk(KERN_ERR "ret %d\n", ret);
>> >>                        BUG();
>> >>                }
>> >>
>> >>> Thank you Minchan for the pointer. I reverted the following commit and I
>> >>> haven't seen the problem with the same operation. I haven't looked deeply
>> >>> on the patch yet, but figured it would be a good idea to post the dump.
>> >>> The dump looks not directly related to this patchset, but ppl can use it to
>> >>> reproduce the problem.
>> >>
>> >> I tested the patch with rsync + fadvise several times
>> >> in my machine(2P, 2G DRAM) but I didn't have ever seen the BUG.
>> >> But I didn't test it in memcg. As I look dump, it seems not related to memcg.
>> >> Anyway, I tried it to reproduce it in my machine.
>> >> Maybe I will start testing after next week. Sorry.
>> >>
>> >> I hope my debugging patch givse some clues.
>> >> Thanks for the reporting, Ying.
>> >
>> > Sure, i will try the patch and post the result.
>>
>> Minchan:
>>
>> Here is the stack trace after applying your patch. We used
>> trace_printk instead since the printk doesn't give me the message. The
>> ret == 4 , so looks like we are failing at the check  if
>> (PageUnevictable(page))
>>
>> kernel is based on tag: mmotm-2011-04-14-15-08 plus my the two memcg
>> patches in the thread, and also the debugging patch.
>>
>> [  426.696004] kernel BUG at mm/vmscan.c:1061!
>> [  426.696004] invalid opcode: 0000 [#1] SMP·
>> [  426.696004] Dumping ftrace buffer:
>> [  426.696004] ---------------------------------
>> [  426.696004]    <...>-546     4d... 426442418us : isolate_pages_global: ret 4
>> [  426.696004] ---------------------------------
>> [  426.696004] RIP: 0010:[<ffffffff810ed1b2>]  [<ffffffff810ed1b2>]
>> isolate_pages_global+0x1ba/0x37d
>> [  426.696004] RSP: 0000:ffff88082f8dfb50  EFLAGS: 00010086
>> [  426.696004] RAX: 0000000000000001 RBX: ffff88082f8dfc90 RCX: 0000000000000000
>> [  426.696004] RDX: 0000000000000006 RSI: 0000000000000046 RDI: ffff88085f805f80
>> [  426.696004] RBP: ffff88082f8dfc20 R08: 0000000000000000 R09: 0000000000000007
>> [  426.696004] R10: 0000000000000005 R11: 0000000000000000 R12: ffff88085ffb6e00
>> [  426.696004] R13: ffffea001ca66c58 R14: 0000000000000004 R15: ffffea001ca66c30
>> [  426.696004] FS:  0000000000000000(0000) GS:ffff88085fd00000(0000)
>> knlGS:0000000000000000
>> [  426.696004] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
>> [  426.696004] CR2: 00007f0c65c6f320 CR3: 000000082b66f000 CR4: 00000000000006e0
>> [  426.696004] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>> [  426.696004] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
>> [  426.696004] Process kswapd0 (pid: 546, threadinfo ffff88082f8de000,
>> task ffff88082f83b8e0)
>> [  426.696004] Stack:
>> [  426.696004]  ffff88085ffb6e00 ffffea0000000002 0000000000000020
>> 0000000000000000
>> [  426.696004]  0000000000000000 ffff88082f8dfcb8 ffffea00158f58d8
>> ffffea00158f5868
>> [  426.696004]  ffffea00158f5de0 0000000000000001 ffffffffffffffff
>> 0000000000000020
>> [  426.696004] Call Trace:
>> [  426.696004]  [<ffffffff810ee8e7>] shrink_inactive_list+0x185/0x3c9
>> [  426.696004]  [<ffffffff8107a3fc>] ? lock_timer_base+0x2c/0x52
>> [  426.696004]  [<ffffffff810e8b2d>] ? determine_dirtyable_memory+0x1a/0x2c
>> [  426.696004]  [<ffffffff810ef17c>] shrink_zone+0x380/0x44d
>> [  426.696004]  [<ffffffff810e5180>] ? zone_watermark_ok_safe+0xa1/0xae
>> [  426.696004]  [<ffffffff810efbb9>] kswapd+0x41b/0x76b
>> [  426.696004]  [<ffffffff810ef79e>] ? zone_reclaim+0x2fb/0x2fb
>> [  426.696004]  [<ffffffff81088561>] kthread+0x82/0x8a
>> [  426.696004]  [<ffffffff8141af54>] kernel_thread_helper+0x4/0x10
>> [  426.696004]  [<ffffffff810884df>] ? kthread_worker_fn+0x112/0x112
>> [  426.696004]  [<ffffffff8141af50>] ? gs_change+0xb/0xb
>> [  426.696004] Code: 01 00 00 89 c6 48 c7 c7 69 52 70 81 31 c0 e8 c1
>> 46 32 00 48 8b 35 37 2b 79 00 44 89 f2 48 c7 c7 8a d1 0e 81 31 c0 e8
>> 09 e2 fd ff <0f> 0b eb fe 49 8b 45 d8 48 b9 00 00 00 00 00 16 00 00 4c
>> 8b 75·
>> [  426.696004] RIP  [<ffffffff810ed1b2>] isolate_pages_global+0x1ba/0x37d
>> [  426.696004]  RSP <ffff88082f8dfb50>
>> [  426.696004] ---[ end trace fbb25b41a0373361 ]---
>> [  426.696004] Kernel panic - not syncing: Fatal exception
>> [  426.696004] Pid: 546, comm: kswapd0 Tainted: G      D W
>> 2.6.39-smp-Minchan #28
>> [  426.696004] Call Trace:
>> [  426.696004]  [<ffffffff81411758>] panic+0x91/0x194
>> [  426.696004]  [<ffffffff81414708>] oops_end+0xae/0xbe
>> [  426.696004]  [<ffffffff81039906>] die+0x5a/0x63
>> [  426.696004]  [<ffffffff814141a1>] do_trap+0x121/0x130
>> [  426.696004]  [<ffffffff81037e85>] do_invalid_op+0x96/0x9f
>> [  426.696004]  [<ffffffff810ed1b2>] ? isolate_pages_global+0x1ba/0x37d
>> [  426.696004]  [<ffffffff810c414a>] ? ring_buffer_lock_reserve+0x6a/0x78
>> [  426.696004]  [<ffffffff810c2e3e>] ? rb_commit+0x76/0x78
>> [  426.696004]  [<ffffffff810c2eab>] ? ring_buffer_unlock_commit+0x21/0x25
>> [  426.696004]  [<ffffffff8141add5>] invalid_op+0x15/0x20
>> [  426.696004]  [<ffffffff810ed1b2>] ? isolate_pages_global+0x1ba/0x37d
>> [  426.696004]  [<ffffffff810ee8e7>] shrink_inactive_list+0x185/0x3c9
>> [  426.696004]  [<ffffffff8107a3fc>] ? lock_timer_base+0x2c/0x52
>> [  426.696004]  [<ffffffff810e8b2d>] ? determine_dirtyable_memory+0x1a/0x2c
>> [  426.696004]  [<ffffffff810ef17c>] shrink_zone+0x380/0x44d
>> [  426.696004]  [<ffffffff810e5180>] ? zone_watermark_ok_safe+0xa1/0xae
>> [  426.696004]  [<ffffffff810efbb9>] kswapd+0x41b/0x76b
>> [  426.696004]  [<ffffffff810ef79e>] ? zone_reclaim+0x2fb/0x2fb
>> [  426.696004]  [<ffffffff81088561>] kthread+0x82/0x8a
>> [  426.696004]  [<ffffffff8141af54>] kernel_thread_helper+0x4/0x10
>> [  426.696004]  [<ffffffff810884df>] ? kthread_worker_fn+0x112/0x112
>> [  426.696004]  [<ffffffff8141af50>] ? gs_change+0xb/0xb
>>
>> --Ying
>
> Thanks for the testing.
> I missed mprotect case in your scenario.
> Yes. I didn't test it at that time. :(
> So, it wasn't related to your patch and memcg.
> The mprotect makes many unevictable page and it seems my deactive_page could move
> it into inactive list. Totally, it's my fault.
> Could you test below patch?
>
> From b852da870d3b8bcfed567a8dd224a60b7552abc4 Mon Sep 17 00:00:00 2001
> From: Minchan Kim <minchan.kim@gmail.com>
> Date: Sat, 30 Apr 2011 08:04:18 +0900
> Subject: [PATCH] Check PageUnevictable in lru_deactivate_fn
>
> The lru_deactivate_fn should not move page which in on unevictable lru
> into inactive list. Otherwise, we can meet BUG when we use isolate_lru_pages
> as __isolate_lru_page could return -EINVAL.
> It's really BUG.
>
> Reported-by: Ying Han <yinghan@google.com>
> Signed-off-by: Minchan Kim <minchan.kim@gmail.com>
> ---
>  mm/swap.c |    3 +++
>  1 files changed, 3 insertions(+), 0 deletions(-)
>
> diff --git a/mm/swap.c b/mm/swap.c
> index a83ec5a..298f372 100644
> --- a/mm/swap.c
> +++ b/mm/swap.c
> @@ -426,6 +426,9 @@ static void lru_deactivate_fn(struct page *page, void *arg)
>        bool active;
>        struct zone *zone = page_zone(page);
>
> +       if (PageUnevictable(page))
> +               return;
> +
>        if (!PageLRU(page))
>                return;

Sure, let me test it and I will report the result short after. Thank
you for looking into it

--Ying


> --
> 1.7.1
>
> --
> Kind regards,
> Minchan Kim
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 0/2] memcg: add the soft_limit reclaim in global direct reclaim
  2011-04-29 23:20       ` Minchan Kim
  2011-04-29 23:41         ` Ying Han
@ 2011-04-30  1:33         ` Ying Han
  1 sibling, 0 replies; 24+ messages in thread
From: Ying Han @ 2011-04-30  1:33 UTC (permalink / raw)
  To: Minchan Kim
  Cc: KOSAKI Motohiro, Daisuke Nishimura, Balbir Singh, Tejun Heo,
	Pavel Emelyanov, KAMEZAWA Hiroyuki, Andrew Morton, Li Zefan,
	Mel Gorman, Christoph Lameter, Johannes Weiner, Rik van Riel,
	Hugh Dickins, Michal Hocko, Dave Hansen, Zhu Yanhai, linux-mm

On Fri, Apr 29, 2011 at 4:20 PM, Minchan Kim <minchan.kim@gmail.com> wrote:
> On Fri, Apr 29, 2011 at 11:58:34AM -0700, Ying Han wrote:
>> On Fri, Apr 29, 2011 at 10:19 AM, Ying Han <yinghan@google.com> wrote:
>> > On Fri, Apr 29, 2011 at 9:44 AM, Minchan Kim <minchan.kim@gmail.com> wrote:
>> >> Hi Ying,
>> >>
>> >> On Thu, Apr 28, 2011 at 03:37:04PM -0700, Ying Han wrote:
>> >>> We recently added the change in global background reclaim which counts the
>> >>> return value of soft_limit reclaim. Now this patch adds the similar logic
>> >>> on global direct reclaim.
>> >>>
>> >>> We should skip scanning global LRU on shrink_zone if soft_limit reclaim does
>> >>> enough work. This is the first step where we start with counting the nr_scanned
>> >>> and nr_reclaimed from soft_limit reclaim into global scan_control.
>> >>>
>> >>> The patch is based on mmotm-04-14 and i triggered kernel BUG at mm/vmscan.c:1058!
>> >>
>> >> Could you tell me exact patches?
>> >> mmtom-04-14 + just 2 patch of this? or + something?
>> >>
>> >> These day, You and Kame produces many patches.
>> >> Do I have to apply something of them?
>> > No, I applied my patch on top of mmotm and here is the last commit
>> > before my patch.
>> >
>> > commit 66a3827927351e0f88dc391919cf0cda10d42dd7
>> > Author: Andrew Morton <akpm@linux-foundation.org>
>> > Date:   Thu Apr 14 15:51:34 2011 -0700
>> >
>> >>
>> >>>
>> >>> [  938.242033] kernel BUG at mm/vmscan.c:1058!
>> >>> [  938.242033] invalid opcode: 0000 [#1] SMP·
>> >>> [  938.242033] last sysfs file: /sys/devices/pci0000:00/0000:00:1f.2/device
>> >>> [  938.242033] Pid: 546, comm: kswapd0 Tainted: G        W   2.6.39-smp-direct_reclaim
>> >>> [  938.242033] RIP: 0010:[<ffffffff810ed174>]  [<ffffffff810ed174>] isolate_pages_global+0x18c/0x34f
>> >>> [  938.242033] RSP: 0018:ffff88082f83bb50  EFLAGS: 00010082
>> >>> [  938.242033] RAX: 00000000ffffffea RBX: ffff88082f83bc90 RCX: 0000000000000401
>> >>> [  938.242033] RDX: 0000000000000001 RSI: 0000000000000000 RDI: ffffea001ca653e8
>> >>> [  938.242033] RBP: ffff88082f83bc20 R08: 0000000000000000 R09: ffff88085ffb6e00
>> >>> [  938.242033] R10: ffff88085ffb73d0 R11: ffff88085ffb6e00 R12: ffff88085ffb6e00
>> >>> [  938.242033] R13: ffffea001ca65410 R14: 0000000000000001 R15: ffffea001ca653e8
>> >>> [  938.242033] FS:  0000000000000000(0000) GS:ffff88085fd00000(0000) knlGS:0000000000000000
>> >>> [  938.242033] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
>> >>> [  938.242033] CR2: 00007f5c3405c320 CR3: 0000000001803000 CR4: 00000000000006e0
>> >>> [  938.242033] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>> >>> [  938.242033] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
>> >>> [  938.242033] Process kswapd0 (pid: 546, threadinfo ffff88082f83a000, task ffff88082fe52080)
>> >>> [  938.242033] Stack:
>> >>> [  938.242033]  ffff88085ffb6e00 ffffea0000000002 0000000000000021 0000000000000000
>> >>> [  938.242033]  0000000000000000 ffff88082f83bcb8 ffffea00108eec80 ffffea00108eecb8
>> >>> [  938.242033]  ffffea00108eecf0 0000000000000004 fffffffffffffffc 0000000000000020
>> >>> [  938.242033] Call Trace:
>> >>> [  938.242033]  [<ffffffff810ee8a5>] shrink_inactive_list+0x185/0x418
>> >>> [  938.242033]  [<ffffffff810366cc>] ? __switch_to+0xea/0x212
>> >>> [  938.242033]  [<ffffffff810e8b35>] ? determine_dirtyable_memory+0x1a/0x2c
>> >>> [  938.242033]  [<ffffffff810ef19b>] shrink_zone+0x380/0x44d
>> >>> [  938.242033]  [<ffffffff810e5188>] ? zone_watermark_ok_safe+0xa1/0xae
>> >>> [  938.242033]  [<ffffffff810efbd8>] kswapd+0x41b/0x76b
>> >>> [  938.242033]  [<ffffffff810ef7bd>] ? zone_reclaim+0x2fb/0x2fb
>> >>> [  938.242033]  [<ffffffff81088569>] kthread+0x82/0x8a
>> >>> [  938.242033]  [<ffffffff8141b0d4>] kernel_thread_helper+0x4/0x10
>> >>> [  938.242033]  [<ffffffff810884e7>] ? kthread_worker_fn+0x112/0x112
>> >>> [  938.242033]  [<ffffffff8141b0d0>] ? gs_change+0xb/0xb
>> >>>
>> >>
>> >> It seems there is active page in inactive list.
>> >> As I look deactivate_page, lru_deactivate_fn clears PageActive before
>> >> add_page_to_lru_list and it should be protected by zone->lru_lock.
>> >> In addiion, PageLRU would protect with race with isolation functions.
>> >>
>> >> Hmm, I don't have any clue now.
>> >> Is it reproducible easily?
>> > I can manage to reproduce it on my host by adding lots of memory
>> > pressure and then trigger the global
>> > reclaim.
>> >
>> >>
>> >> Could you apply below debugging patch and report the result?
>> >>
>> >> diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h
>> >> index 8f7d247..f39b53a 100644
>> >> --- a/include/linux/mm_inline.h
>> >> +++ b/include/linux/mm_inline.h
>> >> @@ -25,6 +25,8 @@ static inline void
>> >>  __add_page_to_lru_list(struct zone *zone, struct page *page, enum lru_list l,
>> >>                       struct list_head *head)
>> >>  {
>> >> +       VM_BUG_ON(PageActive(page) && (
>> >> +                       l == LRU_INACTIVE_ANON || l == LRU_INACTIVE_FILE));
>> >>        list_add(&page->lru, head);
>> >>        __mod_zone_page_state(zone, NR_LRU_BASE + l, hpage_nr_pages(page));
>> >>        mem_cgroup_add_lru_list(page, l);
>> >> diff --git a/mm/swap.c b/mm/swap.c
>> >> index a83ec5a..5f7c3c8 100644
>> >> --- a/mm/swap.c
>> >> +++ b/mm/swap.c
>> >> @@ -454,6 +454,8 @@ static void lru_deactivate_fn(struct page *page, void *arg)
>> >>                 * The page's writeback ends up during pagevec
>> >>                 * We moves tha page into tail of inactive.
>> >>                 */
>> >> +               VM_BUG_ON(PageActive(page) && (
>> >> +                       lru == LRU_INACTIVE_ANON || lru == LRU_INACTIVE_FILE));
>> >>                list_move_tail(&page->lru, &zone->lru[lru].list);
>> >>                mem_cgroup_rotate_reclaimable_page(page);
>> >>                __count_vm_event(PGROTATED);
>> >> diff --git a/mm/vmscan.c b/mm/vmscan.c
>> >> index b3a569f..3415896 100644
>> >> --- a/mm/vmscan.c
>> >> +++ b/mm/vmscan.c
>> >> @@ -963,7 +963,7 @@ int __isolate_lru_page(struct page *page, int mode, int file)
>> >>
>> >>        /* Only take pages on the LRU. */
>> >>        if (!PageLRU(page))
>> >> -               return ret;
>> >> +               return 1;
>> >>
>> >>        /*
>> >>         * When checking the active state, we need to be sure we are
>> >> @@ -971,10 +971,10 @@ int __isolate_lru_page(struct page *page, int mode, int file)
>> >>         * of each.
>> >>         */
>> >>        if (mode != ISOLATE_BOTH && (!PageActive(page) != !mode))
>> >> -               return ret;
>> >> +               return 2;
>> >>
>> >>        if (mode != ISOLATE_BOTH && page_is_file_cache(page) != file)
>> >> -               return ret;
>> >> +               return 3;
>> >>
>> >>        /*
>> >>         * When this function is being called for lumpy reclaim, we
>> >> @@ -982,7 +982,7 @@ int __isolate_lru_page(struct page *page, int mode, int file)
>> >>         * unevictable; only give shrink_page_list evictable pages.
>> >>         */
>> >>        if (PageUnevictable(page))
>> >> -               return ret;
>> >> +               return 4;
>> >>
>> >>        ret = -EBUSY;
>> >>
>> >> @@ -1035,13 +1035,14 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
>> >>                unsigned long end_pfn;
>> >>                unsigned long page_pfn;
>> >>                int zone_id;
>> >> +               int ret;
>> >>
>> >>                page = lru_to_page(src);
>> >>                prefetchw_prev_lru_page(page, src, flags);
>> >>
>> >>                VM_BUG_ON(!PageLRU(page));
>> >>
>> >> -               switch (__isolate_lru_page(page, mode, file)) {
>> >> +               switch (ret = __isolate_lru_page(page, mode, file)) {
>> >>                case 0:
>> >>                        list_move(&page->lru, dst);
>> >>                        mem_cgroup_del_lru(page);
>> >> @@ -1055,6 +1056,7 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
>> >>                        continue;
>> >>
>> >>                default:
>> >> +                       printk(KERN_ERR "ret %d\n", ret);
>> >>                        BUG();
>> >>                }
>> >>
>> >>> Thank you Minchan for the pointer. I reverted the following commit and I
>> >>> haven't seen the problem with the same operation. I haven't looked deeply
>> >>> on the patch yet, but figured it would be a good idea to post the dump.
>> >>> The dump looks not directly related to this patchset, but ppl can use it to
>> >>> reproduce the problem.
>> >>
>> >> I tested the patch with rsync + fadvise several times
>> >> in my machine(2P, 2G DRAM) but I didn't have ever seen the BUG.
>> >> But I didn't test it in memcg. As I look dump, it seems not related to memcg.
>> >> Anyway, I tried it to reproduce it in my machine.
>> >> Maybe I will start testing after next week. Sorry.
>> >>
>> >> I hope my debugging patch givse some clues.
>> >> Thanks for the reporting, Ying.
>> >
>> > Sure, i will try the patch and post the result.
>>
>> Minchan:
>>
>> Here is the stack trace after applying your patch. We used
>> trace_printk instead since the printk doesn't give me the message. The
>> ret == 4 , so looks like we are failing at the check  if
>> (PageUnevictable(page))
>>
>> kernel is based on tag: mmotm-2011-04-14-15-08 plus my the two memcg
>> patches in the thread, and also the debugging patch.
>>
>> [  426.696004] kernel BUG at mm/vmscan.c:1061!
>> [  426.696004] invalid opcode: 0000 [#1] SMP·
>> [  426.696004] Dumping ftrace buffer:
>> [  426.696004] ---------------------------------
>> [  426.696004]    <...>-546     4d... 426442418us : isolate_pages_global: ret 4
>> [  426.696004] ---------------------------------
>> [  426.696004] RIP: 0010:[<ffffffff810ed1b2>]  [<ffffffff810ed1b2>]
>> isolate_pages_global+0x1ba/0x37d
>> [  426.696004] RSP: 0000:ffff88082f8dfb50  EFLAGS: 00010086
>> [  426.696004] RAX: 0000000000000001 RBX: ffff88082f8dfc90 RCX: 0000000000000000
>> [  426.696004] RDX: 0000000000000006 RSI: 0000000000000046 RDI: ffff88085f805f80
>> [  426.696004] RBP: ffff88082f8dfc20 R08: 0000000000000000 R09: 0000000000000007
>> [  426.696004] R10: 0000000000000005 R11: 0000000000000000 R12: ffff88085ffb6e00
>> [  426.696004] R13: ffffea001ca66c58 R14: 0000000000000004 R15: ffffea001ca66c30
>> [  426.696004] FS:  0000000000000000(0000) GS:ffff88085fd00000(0000)
>> knlGS:0000000000000000
>> [  426.696004] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
>> [  426.696004] CR2: 00007f0c65c6f320 CR3: 000000082b66f000 CR4: 00000000000006e0
>> [  426.696004] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>> [  426.696004] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
>> [  426.696004] Process kswapd0 (pid: 546, threadinfo ffff88082f8de000,
>> task ffff88082f83b8e0)
>> [  426.696004] Stack:
>> [  426.696004]  ffff88085ffb6e00 ffffea0000000002 0000000000000020
>> 0000000000000000
>> [  426.696004]  0000000000000000 ffff88082f8dfcb8 ffffea00158f58d8
>> ffffea00158f5868
>> [  426.696004]  ffffea00158f5de0 0000000000000001 ffffffffffffffff
>> 0000000000000020
>> [  426.696004] Call Trace:
>> [  426.696004]  [<ffffffff810ee8e7>] shrink_inactive_list+0x185/0x3c9
>> [  426.696004]  [<ffffffff8107a3fc>] ? lock_timer_base+0x2c/0x52
>> [  426.696004]  [<ffffffff810e8b2d>] ? determine_dirtyable_memory+0x1a/0x2c
>> [  426.696004]  [<ffffffff810ef17c>] shrink_zone+0x380/0x44d
>> [  426.696004]  [<ffffffff810e5180>] ? zone_watermark_ok_safe+0xa1/0xae
>> [  426.696004]  [<ffffffff810efbb9>] kswapd+0x41b/0x76b
>> [  426.696004]  [<ffffffff810ef79e>] ? zone_reclaim+0x2fb/0x2fb
>> [  426.696004]  [<ffffffff81088561>] kthread+0x82/0x8a
>> [  426.696004]  [<ffffffff8141af54>] kernel_thread_helper+0x4/0x10
>> [  426.696004]  [<ffffffff810884df>] ? kthread_worker_fn+0x112/0x112
>> [  426.696004]  [<ffffffff8141af50>] ? gs_change+0xb/0xb
>> [  426.696004] Code: 01 00 00 89 c6 48 c7 c7 69 52 70 81 31 c0 e8 c1
>> 46 32 00 48 8b 35 37 2b 79 00 44 89 f2 48 c7 c7 8a d1 0e 81 31 c0 e8
>> 09 e2 fd ff <0f> 0b eb fe 49 8b 45 d8 48 b9 00 00 00 00 00 16 00 00 4c
>> 8b 75·
>> [  426.696004] RIP  [<ffffffff810ed1b2>] isolate_pages_global+0x1ba/0x37d
>> [  426.696004]  RSP <ffff88082f8dfb50>
>> [  426.696004] ---[ end trace fbb25b41a0373361 ]---
>> [  426.696004] Kernel panic - not syncing: Fatal exception
>> [  426.696004] Pid: 546, comm: kswapd0 Tainted: G      D W
>> 2.6.39-smp-Minchan #28
>> [  426.696004] Call Trace:
>> [  426.696004]  [<ffffffff81411758>] panic+0x91/0x194
>> [  426.696004]  [<ffffffff81414708>] oops_end+0xae/0xbe
>> [  426.696004]  [<ffffffff81039906>] die+0x5a/0x63
>> [  426.696004]  [<ffffffff814141a1>] do_trap+0x121/0x130
>> [  426.696004]  [<ffffffff81037e85>] do_invalid_op+0x96/0x9f
>> [  426.696004]  [<ffffffff810ed1b2>] ? isolate_pages_global+0x1ba/0x37d
>> [  426.696004]  [<ffffffff810c414a>] ? ring_buffer_lock_reserve+0x6a/0x78
>> [  426.696004]  [<ffffffff810c2e3e>] ? rb_commit+0x76/0x78
>> [  426.696004]  [<ffffffff810c2eab>] ? ring_buffer_unlock_commit+0x21/0x25
>> [  426.696004]  [<ffffffff8141add5>] invalid_op+0x15/0x20
>> [  426.696004]  [<ffffffff810ed1b2>] ? isolate_pages_global+0x1ba/0x37d
>> [  426.696004]  [<ffffffff810ee8e7>] shrink_inactive_list+0x185/0x3c9
>> [  426.696004]  [<ffffffff8107a3fc>] ? lock_timer_base+0x2c/0x52
>> [  426.696004]  [<ffffffff810e8b2d>] ? determine_dirtyable_memory+0x1a/0x2c
>> [  426.696004]  [<ffffffff810ef17c>] shrink_zone+0x380/0x44d
>> [  426.696004]  [<ffffffff810e5180>] ? zone_watermark_ok_safe+0xa1/0xae
>> [  426.696004]  [<ffffffff810efbb9>] kswapd+0x41b/0x76b
>> [  426.696004]  [<ffffffff810ef79e>] ? zone_reclaim+0x2fb/0x2fb
>> [  426.696004]  [<ffffffff81088561>] kthread+0x82/0x8a
>> [  426.696004]  [<ffffffff8141af54>] kernel_thread_helper+0x4/0x10
>> [  426.696004]  [<ffffffff810884df>] ? kthread_worker_fn+0x112/0x112
>> [  426.696004]  [<ffffffff8141af50>] ? gs_change+0xb/0xb
>>
>> --Ying
>
> Thanks for the testing.
> I missed mprotect case in your scenario.
> Yes. I didn't test it at that time. :(
> So, it wasn't related to your patch and memcg.
> The mprotect makes many unevictable page and it seems my deactive_page could move
> it into inactive list. Totally, it's my fault.
> Could you test below patch?
>
> From b852da870d3b8bcfed567a8dd224a60b7552abc4 Mon Sep 17 00:00:00 2001
> From: Minchan Kim <minchan.kim@gmail.com>
> Date: Sat, 30 Apr 2011 08:04:18 +0900
> Subject: [PATCH] Check PageUnevictable in lru_deactivate_fn
>
> The lru_deactivate_fn should not move page which in on unevictable lru
> into inactive list. Otherwise, we can meet BUG when we use isolate_lru_pages
> as __isolate_lru_page could return -EINVAL.
> It's really BUG.
>
> Reported-by: Ying Han <yinghan@google.com>
> Signed-off-by: Minchan Kim <minchan.kim@gmail.com>
> ---
>  mm/swap.c |    3 +++
>  1 files changed, 3 insertions(+), 0 deletions(-)
>
> diff --git a/mm/swap.c b/mm/swap.c
> index a83ec5a..298f372 100644
> --- a/mm/swap.c
> +++ b/mm/swap.c
> @@ -426,6 +426,9 @@ static void lru_deactivate_fn(struct page *page, void *arg)
>        bool active;
>        struct zone *zone = page_zone(page);
>
> +       if (PageUnevictable(page))
> +               return;
> +
>        if (!PageLRU(page))
>                return;

I tested the patch with the same workload, and the BUG doesn't happen
when it normally triggers quickly. So I believe the patch fixes the
problem. Please go ahead post the patch and I will give tested by.

Thanks

--Ying

> --
> 1.7.1
>
> --
> Kind regards,
> Minchan Kim
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/2] Add the soft_limit reclaim in global direct reclaim.
  2011-04-29 17:44     ` Ying Han
@ 2011-05-02  7:22       ` Johannes Weiner
  0 siblings, 0 replies; 24+ messages in thread
From: Johannes Weiner @ 2011-05-02  7:22 UTC (permalink / raw)
  To: Ying Han
  Cc: Michal Hocko, KOSAKI Motohiro, Minchan Kim, Daisuke Nishimura,
	Balbir Singh, Tejun Heo, Pavel Emelyanov, KAMEZAWA Hiroyuki,
	Andrew Morton, Li Zefan, Mel Gorman, Christoph Lameter,
	Rik van Riel, Hugh Dickins, Dave Hansen, Zhu Yanhai, linux-mm

On Fri, Apr 29, 2011 at 10:44:16AM -0700, Ying Han wrote:
> On Fri, Apr 29, 2011 at 6:05 AM, Michal Hocko <mhocko@suse.cz> wrote:
> > On Thu 28-04-11 15:37:05, Ying Han wrote:
> >> We recently added the change in global background reclaim which
> >> counts the return value of soft_limit reclaim. Now this patch adds
> >> the similar logic on global direct reclaim.

The changelog is a bit misleading: you don't just add something that
counts something.  You add code that can result in actual page
reclamation.

> >> @@ -1980,8 +1983,17 @@ static void shrink_zones(int priority, struct zonelist *zonelist,
> >>                               continue;       /* Let kswapd poll it */
> >>               }
> >>
> >> +             nr_soft_scanned = 0;
> >> +             nr_soft_reclaimed = mem_cgroup_soft_limit_reclaim(zone,
> >> +                                                     sc->order, sc->gfp_mask,
> >> +                                                     &nr_soft_scanned);
> >> +             sc->nr_reclaimed += nr_soft_reclaimed;
> >> +             total_scanned += nr_soft_scanned;
> >> +
> >>               shrink_zone(priority, zone, sc);
> >
> > This can cause more aggressive reclaiming, right? Shouldn't we check
> > whether shrink_zone is still needed?
> 
> We decided to leave the shrink_zone for now before making further
> changes for soft_limit reclaim. The same
> patch I did last time for global background reclaim. It is safer to do
> this step-by-step :)

I am sorry, but I kinda lost track of what's going on because there
are so many patches and concurrent discussions...  who is we and do
you have a pointer to the email where this conclusion was reached?

And safe how?  Do you want to trade a potential regression against a
certain one (overreclaim)?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2011-05-02  7:22 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-04-28 22:37 [PATCH 0/2] memcg: add the soft_limit reclaim in global direct reclaim Ying Han
2011-04-28 22:37 ` [PATCH 1/2] Add " Ying Han
2011-04-28 23:25   ` Ying Han
2011-04-29 10:26   ` Balbir Singh
2011-04-29 17:42     ` Ying Han
2011-04-29 13:05   ` Michal Hocko
2011-04-29 17:44     ` Ying Han
2011-05-02  7:22       ` Johannes Weiner
2011-04-28 22:37 ` [PATCH 2/2] Add stats to monitor soft_limit reclaim Ying Han
2011-04-28 23:26   ` Ying Han
2011-04-28 23:51     ` Hiroyuki Kamezawa
2011-04-29  3:28       ` Ying Han
2011-04-29 10:30         ` Balbir Singh
2011-04-29 19:12           ` Ying Han
2011-04-28 23:24 ` [PATCH 0/2] memcg: add the soft_limit reclaim in global direct reclaim Ying Han
2011-04-29 10:23 ` Balbir Singh
2011-04-29 17:17   ` Ying Han
2011-04-29 16:44 ` Minchan Kim
2011-04-29 17:19   ` Ying Han
2011-04-29 17:48     ` Ying Han
2011-04-29 18:58     ` Ying Han
2011-04-29 23:20       ` Minchan Kim
2011-04-29 23:41         ` Ying Han
2011-04-30  1:33         ` Ying Han

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).