* [RFC][PATCH 1/7] memcg: check margin to limit for async reclaim
2011-05-10 10:02 [RFC][PATCH 0/7] memcg async reclaim KAMEZAWA Hiroyuki
@ 2011-05-10 10:04 ` KAMEZAWA Hiroyuki
2011-05-10 10:05 ` [RFC][PATCH 2/7] memcg: count reclaimable pages per zone KAMEZAWA Hiroyuki
` (6 subsequent siblings)
7 siblings, 0 replies; 19+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-05-10 10:04 UTC (permalink / raw)
To: KAMEZAWA Hiroyuki
Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, Ying Han,
Johannes Weiner, Michal Hocko, balbir@linux.vnet.ibm.com,
nishimura@mxp.nes.nec.co.jp
Now, the kernel supports transparent hugepage and it's used at each page fault
if configured. Then, if the THP allocation hits limit of memcg, it needs to
reclaim memory of HPAGE_SIZE. This tends to require much larger scan than
SWAP_CLUSTER_MAX and increases latency. In other allocations, page scanning
at hitting limit causes latency to some extent.
This patch adds a logic to keep usage margin to the limit in asynchronous way.
When the usage over some threshould (determined automatically), asynchronous
memory reclaim runs and shrink memory to limit - MEMCG_ASYNC_STOP_MARGIN.
By this, there will be no difference in total amount of usage of cpu to
scan the LRU but we'll have a chance to make use of wait time of applications
for freeing memory. For example, when an application read a file or socket,
to fill the newly alloated memory, it needs wait. Async reclaim can make use
of that time and give a chance to reduce latency by background works.
This patch only includes required hooks to trigger async reclaim. Core logics
will be in the following patches.
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
mm/memcontrol.c | 50 ++++++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 50 insertions(+)
Index: mmotm-May6/mm/memcontrol.c
===================================================================
--- mmotm-May6.orig/mm/memcontrol.c
+++ mmotm-May6/mm/memcontrol.c
@@ -115,10 +115,12 @@ enum mem_cgroup_events_index {
enum mem_cgroup_events_target {
MEM_CGROUP_TARGET_THRESH,
MEM_CGROUP_TARGET_SOFTLIMIT,
+ MEM_CGROUP_TARGET_ASYNC,
MEM_CGROUP_NTARGETS,
};
#define THRESHOLDS_EVENTS_TARGET (128)
#define SOFTLIMIT_EVENTS_TARGET (1024)
+#define ASYNC_EVENTS_TARGET (512) /* assume x86-64's hpagesize */
struct mem_cgroup_stat_cpu {
long count[MEM_CGROUP_STAT_NSTATS];
@@ -211,6 +213,31 @@ static void mem_cgroup_threshold(struct
static void mem_cgroup_oom_notify(struct mem_cgroup *mem);
/*
+ * For example, with transparent hugepages, memory reclaim scan at hitting
+ * limit can very long as to reclaim HPAGE_SIZE of memory. This increases
+ * latency of page fault and may cause fallback. At usual page allocation,
+ * we'll see some (shorter) latency, too. To reduce latency, it's appreciated
+ * to free memory in background to make margin to the limit. This consumes
+ * cpu but we'll have a chance to make use of wait time of applications
+ * (read disk etc..) by asynchronous reclaim.
+ *
+ * This async reclaim tries to reclaim HPAGE_SIZE * 2 of pages when margin
+ * to the limit is smaller than HPAGE_SIZE * 2. This will be enabled
+ * automatically when the limit is set and it's greater than the threshold.
+ */
+#if HPAGE_SIZE != PAGE_SIZE
+#define MEMCG_ASYNC_LIMIT_THRESH (HPAGE_SIZE * 64)
+#define MEMCG_ASYNC_START_MARGIN (HPAGE_SIZE * 2)
+#define MEMCG_ASYNC_STOP_MARGIN (HPAGE_SIZE * 4)
+#else /* make the margin as 4M bytes */
+#define MEMCG_ASYNC_LIMIT_THRESH (128 * 1024 * 1024)
+#define MEMCG_ASYNC_START_MARGIN (4 * 1024 * 1024)
+#define MEMCG_ASYNC_STOP_MARGIN (8 * 1024 * 1024)
+#endif
+
+static void mem_cgroup_may_async_reclaim(struct mem_cgroup *mem);
+
+/*
* The memory controller data structure. The memory controller controls both
* page cache and RSS per cgroup. We would eventually like to provide
* statistics based on the statistics developed by Rik Van Riel for clock-pro,
@@ -259,6 +286,7 @@ struct mem_cgroup {
/* set when res.limit == memsw.limit */
bool memsw_is_minimum;
+ bool need_async_reclaim;
/* protect arrays of thresholds */
struct mutex thresholds_lock;
@@ -722,6 +750,9 @@ static void __mem_cgroup_target_update(s
case MEM_CGROUP_TARGET_SOFTLIMIT:
next = val + SOFTLIMIT_EVENTS_TARGET;
break;
+ case MEM_CGROUP_TARGET_ASYNC:
+ next = val + ASYNC_EVENTS_TARGET;
+ break;
default:
return;
}
@@ -745,6 +776,11 @@ static void memcg_check_events(struct me
__mem_cgroup_target_update(mem,
MEM_CGROUP_TARGET_SOFTLIMIT);
}
+ if (__memcg_event_check(mem, MEM_CGROUP_TARGET_ASYNC)) {
+ mem_cgroup_may_async_reclaim(mem);
+ __mem_cgroup_target_update(mem,
+ MEM_CGROUP_TARGET_ASYNC);
+ }
}
}
@@ -3376,6 +3412,11 @@ static int mem_cgroup_resize_limit(struc
memcg->memsw_is_minimum = true;
else
memcg->memsw_is_minimum = false;
+
+ if (val >= MEMCG_ASYNC_LIMIT_THRESH)
+ memcg->need_async_reclaim = true;
+ else
+ memcg->need_async_reclaim = false;
}
mutex_unlock(&set_limit_mutex);
@@ -3553,6 +3594,15 @@ unsigned long mem_cgroup_soft_limit_recl
return nr_reclaimed;
}
+static void mem_cgroup_may_async_reclaim(struct mem_cgroup *mem)
+{
+ if (!mem->need_async_reclaim)
+ return;
+ if (res_counter_margin(&mem->res) <= MEMCG_ASYNC_START_MARGIN) {
+ /* Fill here */
+ }
+}
+
/*
* This routine traverse page_cgroup in given list and drop them all.
* *And* this routine doesn't reclaim page itself, just removes page_cgroup.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 19+ messages in thread* [RFC][PATCH 2/7] memcg: count reclaimable pages per zone
2011-05-10 10:02 [RFC][PATCH 0/7] memcg async reclaim KAMEZAWA Hiroyuki
2011-05-10 10:04 ` [RFC][PATCH 1/7] memcg: check margin to limit for " KAMEZAWA Hiroyuki
@ 2011-05-10 10:05 ` KAMEZAWA Hiroyuki
2011-05-10 10:07 ` [RFC][PATCH 3/7] memcg: export memcg swappiness KAMEZAWA Hiroyuki
` (5 subsequent siblings)
7 siblings, 0 replies; 19+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-05-10 10:05 UTC (permalink / raw)
To: KAMEZAWA Hiroyuki
Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, Ying Han,
Johannes Weiner, Michal Hocko, balbir@linux.vnet.ibm.com,
nishimura@mxp.nes.nec.co.jp
From: Ying Han <yinghan@google.com>
The number of reclaimable pages per zone is an useful information for
controling memory reclaim schedule. This patch exports it.
Signed-off-by: Ying Han <yinghan@google.com>
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
include/linux/memcontrol.h | 2 ++
mm/memcontrol.c | 14 ++++++++++++++
2 files changed, 16 insertions(+)
Index: mmotm-May6/mm/memcontrol.c
===================================================================
--- mmotm-May6.orig/mm/memcontrol.c
+++ mmotm-May6/mm/memcontrol.c
@@ -1198,6 +1198,20 @@ unsigned long mem_cgroup_zone_nr_pages(s
return MEM_CGROUP_ZSTAT(mz, lru);
}
+unsigned long mem_cgroup_zone_reclaimable_pages(struct mem_cgroup *memcg,
+ int nid, int zid)
+{
+ unsigned long nr;
+ struct mem_cgroup_per_zone *mz = mem_cgroup_zoneinfo(memcg, nid, zid);
+
+ nr = MEM_CGROUP_ZSTAT(mz, NR_ACTIVE_FILE) +
+ MEM_CGROUP_ZSTAT(mz, NR_ACTIVE_FILE);
+ if (nr_swap_pages > 0)
+ nr += MEM_CGROUP_ZSTAT(mz, NR_ACTIVE_ANON) +
+ MEM_CGROUP_ZSTAT(mz, NR_INACTIVE_ANON);
+ return nr;
+}
+
struct zone_reclaim_stat *mem_cgroup_get_reclaim_stat(struct mem_cgroup *memcg,
struct zone *zone)
{
Index: mmotm-May6/include/linux/memcontrol.h
===================================================================
--- mmotm-May6.orig/include/linux/memcontrol.h
+++ mmotm-May6/include/linux/memcontrol.h
@@ -108,6 +108,8 @@ extern void mem_cgroup_end_migration(str
*/
int mem_cgroup_inactive_anon_is_low(struct mem_cgroup *memcg);
int mem_cgroup_inactive_file_is_low(struct mem_cgroup *memcg);
+unsigned long
+mem_cgroup_zone_reclaimable_pages(struct mem_cgroup *memcg, int nid, int zid);
int mem_cgroup_select_victim_node(struct mem_cgroup *memcg);
unsigned long mem_cgroup_zone_nr_pages(struct mem_cgroup *memcg,
struct zone *zone,
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 19+ messages in thread* [RFC][PATCH 3/7] memcg: export memcg swappiness
2011-05-10 10:02 [RFC][PATCH 0/7] memcg async reclaim KAMEZAWA Hiroyuki
2011-05-10 10:04 ` [RFC][PATCH 1/7] memcg: check margin to limit for " KAMEZAWA Hiroyuki
2011-05-10 10:05 ` [RFC][PATCH 2/7] memcg: count reclaimable pages per zone KAMEZAWA Hiroyuki
@ 2011-05-10 10:07 ` KAMEZAWA Hiroyuki
2011-05-10 10:08 ` [RFC][PATCH 4/7] memcg : test a memcg is reclaimable KAMEZAWA Hiroyuki
` (4 subsequent siblings)
7 siblings, 0 replies; 19+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-05-10 10:07 UTC (permalink / raw)
To: KAMEZAWA Hiroyuki
Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, Ying Han,
Johannes Weiner, Michal Hocko, balbir@linux.vnet.ibm.com,
nishimura@mxp.nes.nec.co.jp
From: Ying Han <yinghan@google.com>
change mem_cgroup's swappiness interface.
Now, memcg's swappiness interface is defined as 'static' and
the value is passed as an argument to try_to_free_xxxx...
This patch adds an function mem_cgroup_swappiness() and export it,
reduce arguments. This interface will be used in async reclaim, later.
I think an function is better than passing arguments because it's
clearer where the swappiness comes from to scan_control.
Signed-off-by: Ying Han <yinghan@google.com>
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
include/linux/memcontrol.h | 1 +
include/linux/swap.h | 4 +---
mm/memcontrol.c | 14 ++++++--------
mm/vmscan.c | 9 ++++-----
4 files changed, 12 insertions(+), 16 deletions(-)
Index: mmotm-May6/include/linux/memcontrol.h
===================================================================
--- mmotm-May6.orig/include/linux/memcontrol.h
+++ mmotm-May6/include/linux/memcontrol.h
@@ -111,6 +111,7 @@ int mem_cgroup_inactive_file_is_low(stru
unsigned long
mem_cgroup_zone_reclaimable_pages(struct mem_cgroup *memcg, int nid, int zid);
int mem_cgroup_select_victim_node(struct mem_cgroup *memcg);
+unsigned int mem_cgroup_swappiness(struct mem_cgroup *memcg);
unsigned long mem_cgroup_zone_nr_pages(struct mem_cgroup *memcg,
struct zone *zone,
enum lru_list lru);
Index: mmotm-May6/mm/memcontrol.c
===================================================================
--- mmotm-May6.orig/mm/memcontrol.c
+++ mmotm-May6/mm/memcontrol.c
@@ -1321,7 +1321,7 @@ static unsigned long mem_cgroup_margin(s
return margin >> PAGE_SHIFT;
}
-static unsigned int get_swappiness(struct mem_cgroup *memcg)
+unsigned int mem_cgroup_swappiness(struct mem_cgroup *memcg)
{
struct cgroup *cgrp = memcg->css.cgroup;
@@ -1704,14 +1704,13 @@ static int mem_cgroup_hierarchical_recla
/* we use swappiness of local cgroup */
if (check_soft) {
ret = mem_cgroup_shrink_node_zone(victim, gfp_mask,
- noswap, get_swappiness(victim), zone,
- &nr_scanned);
+ noswap, zone, &nr_scanned);
*total_scanned += nr_scanned;
mem_cgroup_soft_steal(victim, is_kswapd, ret);
mem_cgroup_soft_scan(victim, is_kswapd, nr_scanned);
} else
ret = try_to_free_mem_cgroup_pages(victim, gfp_mask,
- noswap, get_swappiness(victim));
+ noswap);
css_put(&victim->css);
/*
* At shrinking usage, we can't check we should stop here or
@@ -3748,8 +3747,7 @@ try_to_free:
ret = -EINTR;
goto out;
}
- progress = try_to_free_mem_cgroup_pages(mem, GFP_KERNEL,
- false, get_swappiness(mem));
+ progress = try_to_free_mem_cgroup_pages(mem, GFP_KERNEL, false);
if (!progress) {
nr_retries--;
/* maybe some writeback is necessary */
@@ -4181,7 +4179,7 @@ static u64 mem_cgroup_swappiness_read(st
{
struct mem_cgroup *memcg = mem_cgroup_from_cont(cgrp);
- return get_swappiness(memcg);
+ return mem_cgroup_swappiness(memcg);
}
static int mem_cgroup_swappiness_write(struct cgroup *cgrp, struct cftype *cft,
@@ -4867,7 +4865,7 @@ mem_cgroup_create(struct cgroup_subsys *
INIT_LIST_HEAD(&mem->oom_notify);
if (parent)
- mem->swappiness = get_swappiness(parent);
+ mem->swappiness = mem_cgroup_swappiness(parent);
atomic_set(&mem->refcnt, 1);
mem->move_charge_at_immigrate = 0;
mutex_init(&mem->thresholds_lock);
Index: mmotm-May6/include/linux/swap.h
===================================================================
--- mmotm-May6.orig/include/linux/swap.h
+++ mmotm-May6/include/linux/swap.h
@@ -252,11 +252,9 @@ static inline void lru_cache_add_file(st
extern unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
gfp_t gfp_mask, nodemask_t *mask);
extern unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem,
- gfp_t gfp_mask, bool noswap,
- unsigned int swappiness);
+ gfp_t gfp_mask, bool noswap);
extern unsigned long mem_cgroup_shrink_node_zone(struct mem_cgroup *mem,
gfp_t gfp_mask, bool noswap,
- unsigned int swappiness,
struct zone *zone,
unsigned long *nr_scanned);
extern int __isolate_lru_page(struct page *page, int mode, int file);
Index: mmotm-May6/mm/vmscan.c
===================================================================
--- mmotm-May6.orig/mm/vmscan.c
+++ mmotm-May6/mm/vmscan.c
@@ -2178,7 +2178,6 @@ unsigned long try_to_free_pages(struct z
unsigned long mem_cgroup_shrink_node_zone(struct mem_cgroup *mem,
gfp_t gfp_mask, bool noswap,
- unsigned int swappiness,
struct zone *zone,
unsigned long *nr_scanned)
{
@@ -2188,7 +2187,6 @@ unsigned long mem_cgroup_shrink_node_zon
.may_writepage = !laptop_mode,
.may_unmap = 1,
.may_swap = !noswap,
- .swappiness = swappiness,
.order = 0,
.mem_cgroup = mem,
};
@@ -2196,6 +2194,8 @@ unsigned long mem_cgroup_shrink_node_zon
sc.gfp_mask = (gfp_mask & GFP_RECLAIM_MASK) |
(GFP_HIGHUSER_MOVABLE & ~GFP_RECLAIM_MASK);
+ sc.swappiness = mem_cgroup_swappiness(mem);
+
trace_mm_vmscan_memcg_softlimit_reclaim_begin(0,
sc.may_writepage,
sc.gfp_mask);
@@ -2217,8 +2217,7 @@ unsigned long mem_cgroup_shrink_node_zon
unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem_cont,
gfp_t gfp_mask,
- bool noswap,
- unsigned int swappiness)
+ bool noswap)
{
struct zonelist *zonelist;
unsigned long nr_reclaimed;
@@ -2228,7 +2227,6 @@ unsigned long try_to_free_mem_cgroup_pag
.may_unmap = 1,
.may_swap = !noswap,
.nr_to_reclaim = SWAP_CLUSTER_MAX,
- .swappiness = swappiness,
.order = 0,
.mem_cgroup = mem_cont,
.nodemask = NULL, /* we don't care the placement */
@@ -2245,6 +2243,7 @@ unsigned long try_to_free_mem_cgroup_pag
* scan does not need to be the current node.
*/
nid = mem_cgroup_select_victim_node(mem_cont);
+ sc.swappiness = mem_cgroup_swappiness(mem_cont);
zonelist = NODE_DATA(nid)->node_zonelists;
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 19+ messages in thread* [RFC][PATCH 4/7] memcg : test a memcg is reclaimable
2011-05-10 10:02 [RFC][PATCH 0/7] memcg async reclaim KAMEZAWA Hiroyuki
` (2 preceding siblings ...)
2011-05-10 10:07 ` [RFC][PATCH 3/7] memcg: export memcg swappiness KAMEZAWA Hiroyuki
@ 2011-05-10 10:08 ` KAMEZAWA Hiroyuki
2011-05-10 10:09 ` [RFC][PATCH 5/7] memcg : export select victim memcg KAMEZAWA Hiroyuki
` (3 subsequent siblings)
7 siblings, 0 replies; 19+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-05-10 10:08 UTC (permalink / raw)
To: KAMEZAWA Hiroyuki
Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, Ying Han,
Johannes Weiner, Michal Hocko, balbir@linux.vnet.ibm.com,
nishimura@mxp.nes.nec.co.jp
A function for checking that a memcg has reclaimable pages. This makes
use of mem->scan_nodes when CONFIG_NUMA=y.
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
include/linux/memcontrol.h | 1 +
mm/memcontrol.c | 19 +++++++++++++++++++
2 files changed, 20 insertions(+)
Index: mmotm-May6/mm/memcontrol.c
===================================================================
--- mmotm-May6.orig/mm/memcontrol.c
+++ mmotm-May6/mm/memcontrol.c
@@ -1623,11 +1623,30 @@ int mem_cgroup_select_victim_node(struct
return node;
}
+bool mem_cgroup_test_reclaimable(struct mem_cgroup *memcg)
+{
+ mem_cgroup_may_update_nodemask(memcg);
+ return !nodes_empty(memcg->scan_nodes);
+}
+
#else
int mem_cgroup_select_victim_node(struct mem_cgroup *mem)
{
return 0;
}
+
+bool mem_cgroup_test_reclaimable(struct mem_cgroup *memcg)
+{
+ unsigned long nr;
+ int zid;
+
+ for (zid = NODE_DATA(0)->nr_zones - 1; zid >= 0; zid--)
+ if (mem_cgroup_zone_reclaimable_pages(memcg, 0, zid))
+ break;
+ if (zid < 0)
+ return false;
+ return true;
+}
#endif
/*
Index: mmotm-May6/include/linux/memcontrol.h
===================================================================
--- mmotm-May6.orig/include/linux/memcontrol.h
+++ mmotm-May6/include/linux/memcontrol.h
@@ -110,6 +110,7 @@ int mem_cgroup_inactive_anon_is_low(stru
int mem_cgroup_inactive_file_is_low(struct mem_cgroup *memcg);
unsigned long
mem_cgroup_zone_reclaimable_pages(struct mem_cgroup *memcg, int nid, int zid);
+bool mem_cgroup_test_reclaimable(struct mem_cgroup *memcg);
int mem_cgroup_select_victim_node(struct mem_cgroup *memcg);
unsigned int mem_cgroup_swappiness(struct mem_cgroup *memcg);
unsigned long mem_cgroup_zone_nr_pages(struct mem_cgroup *memcg,
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 19+ messages in thread* [RFC][PATCH 5/7] memcg : export select victim memcg
2011-05-10 10:02 [RFC][PATCH 0/7] memcg async reclaim KAMEZAWA Hiroyuki
` (3 preceding siblings ...)
2011-05-10 10:08 ` [RFC][PATCH 4/7] memcg : test a memcg is reclaimable KAMEZAWA Hiroyuki
@ 2011-05-10 10:09 ` KAMEZAWA Hiroyuki
2011-05-10 10:13 ` [RFC][PATCH 6/7] memcg : static scan for async reclaim KAMEZAWA Hiroyuki
` (2 subsequent siblings)
7 siblings, 0 replies; 19+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-05-10 10:09 UTC (permalink / raw)
To: KAMEZAWA Hiroyuki
Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, Ying Han,
Johannes Weiner, Michal Hocko, balbir@linux.vnet.ibm.com,
nishimura@mxp.nes.nec.co.jp
Later change will call mem_cgroup_select_victim() from vmscan.c
to do hierarchical reclaim. Need to export an interface and add
release_victim().
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
include/linux/memcontrol.h | 2 ++
mm/memcontrol.c | 13 +++++++++----
2 files changed, 11 insertions(+), 4 deletions(-)
Index: mmotm-May6/mm/memcontrol.c
===================================================================
--- mmotm-May6.orig/mm/memcontrol.c
+++ mmotm-May6/mm/memcontrol.c
@@ -1555,6 +1555,11 @@ mem_cgroup_select_victim(struct mem_cgro
return ret;
}
+void mem_cgroup_release_victim(struct mem_cgroup *mem)
+{
+ css_put(&mem->css);
+}
+
#if MAX_NUMNODES > 1
/*
@@ -1699,7 +1704,7 @@ static int mem_cgroup_hierarchical_recla
* no reclaimable pages under this hierarchy
*/
if (!check_soft || !total) {
- css_put(&victim->css);
+ mem_cgroup_release_victim(victim);
break;
}
/*
@@ -1710,14 +1715,14 @@ static int mem_cgroup_hierarchical_recla
*/
if (total >= (excess >> 2) ||
(loop > MEM_CGROUP_MAX_RECLAIM_LOOPS)) {
- css_put(&victim->css);
+ mem_cgroup_release_victim(victim);
break;
}
}
}
if (!mem_cgroup_local_usage(victim)) {
/* this cgroup's local usage == 0 */
- css_put(&victim->css);
+ mem_cgroup_release_victim(victim);
continue;
}
/* we use swappiness of local cgroup */
@@ -1730,7 +1735,7 @@ static int mem_cgroup_hierarchical_recla
} else
ret = try_to_free_mem_cgroup_pages(victim, gfp_mask,
noswap);
- css_put(&victim->css);
+ mem_cgroup_release_victim(victim);
/*
* At shrinking usage, we can't check we should stop here or
* reclaim more. It's depends on callers. last_scanned_child
Index: mmotm-May6/include/linux/memcontrol.h
===================================================================
--- mmotm-May6.orig/include/linux/memcontrol.h
+++ mmotm-May6/include/linux/memcontrol.h
@@ -122,6 +122,8 @@ struct zone_reclaim_stat*
mem_cgroup_get_reclaim_stat_from_page(struct page *page);
extern void mem_cgroup_print_oom_info(struct mem_cgroup *memcg,
struct task_struct *p);
+struct mem_cgroup *mem_cgroup_select_victim(struct mem_cgroup *mem);
+void mem_cgroup_release_victim(struct mem_cgroup *mem);
#ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
extern int do_swap_account;
#endif
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 19+ messages in thread* [RFC][PATCH 6/7] memcg : static scan for async reclaim
2011-05-10 10:02 [RFC][PATCH 0/7] memcg async reclaim KAMEZAWA Hiroyuki
` (4 preceding siblings ...)
2011-05-10 10:09 ` [RFC][PATCH 5/7] memcg : export select victim memcg KAMEZAWA Hiroyuki
@ 2011-05-10 10:13 ` KAMEZAWA Hiroyuki
2011-05-10 10:13 ` [RFC][PATCH 7/7] memcg: workqueue " KAMEZAWA Hiroyuki
2011-05-12 1:28 ` [RFC][PATCH 0/7] memcg " Andrew Morton
7 siblings, 0 replies; 19+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-05-10 10:13 UTC (permalink / raw)
To: KAMEZAWA Hiroyuki
Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, Ying Han,
Johannes Weiner, Michal Hocko, balbir@linux.vnet.ibm.com,
nishimura@mxp.nes.nec.co.jp
Major change is adding scan_limit to scan_control. I'll need to add more
comments in codes...
==
static scan rate async memory reclaim for memcg.
This patch implements a routine for asynchronous memory reclaim for memory
cgroup, which will be triggered when the usage is near to the limit.
This patch includes only code codes for memory freeing.
Asynchronous memory reclaim can be a help for reduce latency because
memory reclaim goes while an application need to wait or compute something.
To do memory reclaim in async, we need some thread or worker.
Unlike node or zones, memcg can be created on demand and there may be
a system with thousands of memcgs. So, the number of jobs for memcg
asynchronous memory reclaim can be big number in theory. So, node kswapd
codes doesn't fit well. And some scheduling on memcg layer will be appreciated.
This patch implements a static scan rate memory reclaim.
When shrink_mem_cgroup_static_scan() is called, it scans pages at most
MEMCG_STATIC_SCAN_LIMIT(2048) pages and returnes how memory shrinking
was hard. When the function returns false, the caller can assume memory
reclaim on the memcg seemed difficult and can add some scheduling delay
for the job.
Note:
- I think this concept can be used for enhancing softlimit, too.
But need more study.
Changes (since pilot version)
- add scan_limit to scan_control
- support memcg's hierarchy
- removed nodemask on stack
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
include/linux/memcontrol.h | 2
include/linux/swap.h | 2
mm/memcontrol.c | 7 +
mm/vmscan.c | 164 ++++++++++++++++++++++++++++++++++++++++++++-
4 files changed, 173 insertions(+), 2 deletions(-)
Index: mmotm-May6/mm/vmscan.c
===================================================================
--- mmotm-May6.orig/mm/vmscan.c
+++ mmotm-May6/mm/vmscan.c
@@ -106,6 +106,7 @@ struct scan_control {
/* Which cgroup do we reclaim from */
struct mem_cgroup *mem_cgroup;
+ unsigned long scan_limit; /* async reclaim uses static scan rate */
/*
* Nodemask of nodes allowed by the caller. If NULL, all nodes
@@ -1717,7 +1718,7 @@ static unsigned long shrink_list(enum lr
static void get_scan_count(struct zone *zone, struct scan_control *sc,
unsigned long *nr, int priority)
{
- unsigned long anon, file, free;
+ unsigned long anon, file, free, total_scan;
unsigned long anon_prio, file_prio;
unsigned long ap, fp;
struct zone_reclaim_stat *reclaim_stat = get_reclaim_stat(zone, sc);
@@ -1807,6 +1808,8 @@ static void get_scan_count(struct zone *
fraction[1] = fp;
denominator = ap + fp + 1;
out:
+ total_scan = 0;
+
for_each_evictable_lru(l) {
int file = is_file_lru(l);
unsigned long scan;
@@ -1833,6 +1836,20 @@ out:
scan = SWAP_CLUSTER_MAX;
}
nr[l] = scan;
+ total_scan += nr[l];
+ }
+ /*
+ * Asynchronous reclaim for memcg uses static scan rate for avoiding
+ * too much cpu consumption. Adjust the scan number to fit scan count
+ * into scan_limit.
+ */
+ if (total_scan > sc->scan_limit) {
+ for_each_evictable_lru(l) {
+ if (!nr[l] < SWAP_CLUSTER_MAX)
+ continue;
+ nr[l] = div64_u64(nr[l] * sc->scan_limit, total_scan);
+ nr[l] = max((unsigned long)SWAP_CLUSTER_MAX, nr[l]);
+ }
}
}
@@ -1938,6 +1955,11 @@ restart:
*/
if (nr_reclaimed >= nr_to_reclaim && priority < DEF_PRIORITY)
break;
+ /*
+ * static scan rate memory reclaim ?
+ */
+ if (sc->nr_scanned > sc->scan_limit)
+ break;
}
sc->nr_reclaimed += nr_reclaimed;
@@ -2158,6 +2180,7 @@ unsigned long try_to_free_pages(struct z
.order = order,
.mem_cgroup = NULL,
.nodemask = nodemask,
+ .scan_limit = ULONG_MAX,
};
struct shrink_control shrink = {
.gfp_mask = sc.gfp_mask,
@@ -2189,6 +2212,7 @@ unsigned long mem_cgroup_shrink_node_zon
.may_swap = !noswap,
.order = 0,
.mem_cgroup = mem,
+ .scan_limit = ULONG_MAX,
};
sc.gfp_mask = (gfp_mask & GFP_RECLAIM_MASK) |
@@ -2232,6 +2256,7 @@ unsigned long try_to_free_mem_cgroup_pag
.nodemask = NULL, /* we don't care the placement */
.gfp_mask = (gfp_mask & GFP_RECLAIM_MASK) |
(GFP_HIGHUSER_MOVABLE & ~GFP_RECLAIM_MASK),
+ .scan_limit = ULONG_MAX,
};
struct shrink_control shrink = {
.gfp_mask = sc.gfp_mask,
@@ -2257,6 +2282,140 @@ unsigned long try_to_free_mem_cgroup_pag
return nr_reclaimed;
}
+
+/*
+ * Routines for static scan rate memory reclaim for memory cgroup.
+ *
+ * Because asyncronous memory reclaim is served by the kernel as background
+ * service for reduce latency, we don't want to scan too much as priority=0
+ * scan of kswapd. We just scan MEMCG_ASYNCSCAN_LIMIT per iteration at most
+ * and frees MEMCG_ASYNCSCAN_LIMIT/2 of pages. Then, check our success rate
+ * and returns the information to the caller.
+ */
+
+static void shrink_mem_cgroup_node(int nid, int priority, struct scan_control *sc)
+{
+ unsigned long total_scanned = 0;
+ int i;
+
+ for (i = 0; i < NODE_DATA(nid)->nr_zones; i++) {
+ struct zone *zone = NODE_DATA(nid)->node_zones + i;
+ struct zone_reclaim_stat *zrs;
+ unsigned long scan;
+ unsigned long rotate;
+
+ if (!populated_zone(zone))
+ continue;
+ if (!mem_cgroup_zone_reclaimable_pages(sc->mem_cgroup, nid, i))
+ continue;
+ /* If recent scan didn't go good, do writepate */
+ zrs = get_reclaim_stat(zone, sc);
+ scan = zrs->recent_scanned[0] + zrs->recent_scanned[1];
+ rotate = zrs->recent_rotated[0] + zrs->recent_rotated[1];
+ if (rotate > scan/2)
+ sc->may_writepage = 1;
+
+ sc->nr_scanned = 0;
+ shrink_zone(priority, zone, sc);
+ total_scanned += sc->nr_scanned;
+ sc->may_writepage = 0;
+ if (sc->nr_reclaimed >= sc->nr_to_reclaim)
+ break;
+ }
+ sc->nr_scanned = total_scanned;
+}
+
+#define MEMCG_ASYNCSCAN_LIMIT (2048)
+
+bool mem_cgroup_shrink_static_scan(struct mem_cgroup *mem, long required)
+{
+ int nid, priority, next_prio, noscan, loop_check;
+ unsigned long total_scanned, progress;
+ struct scan_control sc = {
+ .gfp_mask = GFP_HIGHUSER_MOVABLE,
+ .may_unmap = 1,
+ .may_swap = 1,
+ .order = 0,
+ };
+ struct mem_cgroup *victim;
+ bool congested = true;
+
+ /* this param will be set per zone */
+ sc.may_writepage = 0;
+ sc.nr_reclaimed = 0;
+ total_scanned = 0;
+ progress = 0;
+ loop_check = 0;
+ sc.nr_to_reclaim = min(required, MEMCG_ASYNCSCAN_LIMIT/2L);
+ sc.swappiness = mem_cgroup_swappiness(mem);
+
+ current->flags |= PF_SWAPWRITE;
+ /*
+ * We always scan static number of pages (unlike kswapd) with visiting
+ * victim node/zones. This next_prio is used for emulate priority.
+ */
+ next_prio = MEMCG_ASYNCSCAN_LIMIT/8;
+ priority = DEF_PRIORITY;
+ noscan = 0;
+ while ((total_scanned < MEMCG_ASYNCSCAN_LIMIT) &&
+ (sc.nr_to_reclaim > sc.nr_reclaimed)) {
+ /* select a victim from hierarchy */
+ victim = mem_cgroup_select_victim(mem);
+ /*
+ * If a memcg was selected twice while we don't make any
+ * progress, break and avoid loop.
+ */
+ if (victim == mem){
+ if (loop_check && total_scanned == progress) {
+ mem_cgroup_release_victim(victim);
+ break;
+ }
+ progress = total_scanned;
+ loop_check = 1;
+ }
+
+ if (!mem_cgroup_test_reclaimable(victim)) {
+ mem_cgroup_release_victim(victim);
+ continue;
+ }
+ /* select a node to scan */
+ nid = mem_cgroup_select_victim_node(victim);
+
+ sc.mem_cgroup = victim;
+ sc.scan_limit = MEMCG_ASYNCSCAN_LIMIT - total_scanned;
+ shrink_mem_cgroup_node(nid, priority, &sc);
+ if (sc.nr_scanned) {
+ total_scanned += sc.nr_scanned;
+ noscan = 0;
+ } else
+ noscan++;
+ mem_cgroup_release_victim(victim);
+ if (mem_cgroup_async_should_stop(mem))
+ break;
+ if (total_scanned > next_prio) {
+ priority--;
+ next_prio <<= 1;
+ }
+ /* If memory reclaim seems heavy, return that we're congested */
+ if (total_scanned > MEMCG_ASYNCSCAN_LIMIT/4 &&
+ total_scanned > sc.nr_reclaimed*8)
+ break;
+ /*
+ * The whole system is busy or some status update
+ * is not synched. It's better to wait for a while.
+ */
+ if (noscan > 1)
+ break;
+ }
+ current->flags &= ~PF_SWAPWRITE;
+ /*
+ * If we successfully freed the half of target, report that
+ * memory reclaim went smoothly.
+ */
+ if (sc.nr_reclaimed > sc.nr_to_reclaim/2)
+ congested = false;
+ return congested;
+}
#endif
/*
@@ -2380,6 +2539,7 @@ static unsigned long balance_pgdat(pg_da
.swappiness = vm_swappiness,
.order = order,
.mem_cgroup = NULL,
+ .scan_limit = ULONG_MAX,
};
struct shrink_control shrink = {
.gfp_mask = sc.gfp_mask,
@@ -2839,6 +2999,7 @@ unsigned long shrink_all_memory(unsigned
.hibernation_mode = 1,
.swappiness = vm_swappiness,
.order = 0,
+ .scan_limit = ULONG_MAX,
};
struct shrink_control shrink = {
.gfp_mask = sc.gfp_mask,
@@ -3026,6 +3187,7 @@ static int __zone_reclaim(struct zone *z
.gfp_mask = gfp_mask,
.swappiness = vm_swappiness,
.order = order,
+ .scan_limit = ULONG_MAX,
};
struct shrink_control shrink = {
.gfp_mask = sc.gfp_mask,
Index: mmotm-May6/mm/memcontrol.c
===================================================================
--- mmotm-May6.orig/mm/memcontrol.c
+++ mmotm-May6/mm/memcontrol.c
@@ -1523,7 +1523,7 @@ u64 mem_cgroup_get_limit(struct mem_cgro
* of the cgroup list, since we track last_scanned_child) of @mem and use
* that to reclaim free pages from.
*/
-static struct mem_cgroup *
+struct mem_cgroup *
mem_cgroup_select_victim(struct mem_cgroup *root_mem)
{
struct mem_cgroup *ret = NULL;
@@ -3631,6 +3631,11 @@ unsigned long mem_cgroup_soft_limit_recl
return nr_reclaimed;
}
+bool mem_cgroup_async_should_stop(struct mem_cgroup *mem)
+{
+ return res_counter_margin(&mem->res) >= MEMCG_ASYNC_STOP_MARGIN;
+}
+
static void mem_cgroup_may_async_reclaim(struct mem_cgroup *mem)
{
if (!mem->need_async_reclaim)
Index: mmotm-May6/include/linux/memcontrol.h
===================================================================
--- mmotm-May6.orig/include/linux/memcontrol.h
+++ mmotm-May6/include/linux/memcontrol.h
@@ -124,6 +124,8 @@ extern void mem_cgroup_print_oom_info(st
struct task_struct *p);
struct mem_cgroup *mem_cgroup_select_victim(struct mem_cgroup *mem);
void mem_cgroup_release_victim(struct mem_cgroup *mem);
+bool mem_cgroup_async_should_stop(struct mem_cgroup *mem);
+
#ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
extern int do_swap_account;
#endif
Index: mmotm-May6/include/linux/swap.h
===================================================================
--- mmotm-May6.orig/include/linux/swap.h
+++ mmotm-May6/include/linux/swap.h
@@ -257,6 +257,8 @@ extern unsigned long mem_cgroup_shrink_n
gfp_t gfp_mask, bool noswap,
struct zone *zone,
unsigned long *nr_scanned);
+extern bool
+mem_cgroup_shrink_static_scan(struct mem_cgroup *mem, long required);
extern int __isolate_lru_page(struct page *page, int mode, int file);
extern unsigned long shrink_all_memory(unsigned long nr_pages);
extern int vm_swappiness;
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 19+ messages in thread* [RFC][PATCH 7/7] memcg: workqueue for async reclaim
2011-05-10 10:02 [RFC][PATCH 0/7] memcg async reclaim KAMEZAWA Hiroyuki
` (5 preceding siblings ...)
2011-05-10 10:13 ` [RFC][PATCH 6/7] memcg : static scan for async reclaim KAMEZAWA Hiroyuki
@ 2011-05-10 10:13 ` KAMEZAWA Hiroyuki
2011-05-12 1:28 ` [RFC][PATCH 0/7] memcg " Andrew Morton
7 siblings, 0 replies; 19+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-05-10 10:13 UTC (permalink / raw)
To: KAMEZAWA Hiroyuki
Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, Ying Han,
Johannes Weiner, Michal Hocko, balbir@linux.vnet.ibm.com,
nishimura@mxp.nes.nec.co.jp
workqueue for memory cgroup asynchronous memory shrinker.
This patch implements the workqueue of async shrinker routine. each
memcg has a work and only one work can be scheduled at the same time.
If shrinking memory doesn't goes well, delay will be added to the work.
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
mm/memcontrol.c | 82 +++++++++++++++++++++++++++++++++++++++++++++++++++++---
1 file changed, 79 insertions(+), 3 deletions(-)
Index: mmotm-May6/mm/memcontrol.c
===================================================================
--- mmotm-May6.orig/mm/memcontrol.c
+++ mmotm-May6/mm/memcontrol.c
@@ -305,6 +305,12 @@ struct mem_cgroup {
* mem_cgroup ? And what type of charges should we move ?
*/
unsigned long move_charge_at_immigrate;
+
+ /* For asynchronous memory reclaim */
+ struct delayed_work async_work;
+ unsigned long async_work_flags;
+#define ASYNC_NORESCHED (0) /* need to stop scanning */
+#define ASYNC_RUNNING (1) /* a work is in schedule or running. */
/*
* percpu counter.
*/
@@ -3631,6 +3637,74 @@ unsigned long mem_cgroup_soft_limit_recl
return nr_reclaimed;
}
+struct workqueue_struct *memcg_async_shrinker;
+
+static int memcg_async_shrinker_init(void)
+{
+ memcg_async_shrinker = alloc_workqueue("memcg_async",
+ WQ_MEM_RECLAIM | WQ_UNBOUND | WQ_FREEZABLE, 0);
+ return 0;
+}
+module_init(memcg_async_shrinker_init);
+
+static void mem_cgroup_async_shrink(struct work_struct *work)
+{
+ struct delayed_work *dw = to_delayed_work(work);
+ struct mem_cgroup *mem = container_of(dw,
+ struct mem_cgroup, async_work);
+ bool congested = false;
+ int delay = 0;
+ unsigned long long required, usage, limit, shrink_to;
+
+ limit = res_counter_read_u64(&mem->res, RES_LIMIT);
+ shrink_to = limit - MEMCG_ASYNC_STOP_MARGIN - PAGE_SIZE;
+ usage = res_counter_read_u64(&mem->res, RES_USAGE);
+ if (shrink_to <= usage) {
+ required = usage - shrink_to;
+ required = (required >> PAGE_SHIFT) + 1;
+ /*
+ * This scans some number of pages and returns that memory
+ * reclaim was slow or now. If slow, we add a delay as
+ * congestion_wait() in vmscan.c
+ */
+ congested = mem_cgroup_shrink_static_scan(mem, (long)required);
+ }
+ if (test_bit(ASYNC_NORESCHED, &mem->async_work_flags)
+ || mem_cgroup_async_should_stop(mem))
+ goto finish_scan;
+ /* If memory reclaim couldn't go well, add delay */
+ if (congested)
+ delay = HZ/10;
+
+ if (queue_delayed_work(memcg_async_shrinker, &mem->async_work, delay))
+ return;
+finish_scan:
+ cgroup_release_and_wakeup_rmdir(&mem->css);
+ clear_bit(ASYNC_RUNNING, &mem->async_work_flags);
+ return;
+}
+
+static void run_mem_cgroup_async_shrinker(struct mem_cgroup *mem)
+{
+ if (test_bit(ASYNC_NORESCHED, &mem->async_work_flags))
+ return;
+ if (test_and_set_bit(ASYNC_RUNNING, &mem->async_work_flags))
+ return;
+ cgroup_exclude_rmdir(&mem->css);
+ if (!queue_delayed_work(memcg_async_shrinker, &mem->async_work, 0)) {
+ cgroup_release_and_wakeup_rmdir(&mem->css);
+ clear_bit(ASYNC_RUNNING, &mem->async_work_flags);
+ }
+ return;
+}
+
+static void stop_mem_cgroup_async_shrinker(struct mem_cgroup *mem)
+{
+ set_bit(ASYNC_NORESCHED, &mem->async_work_flags);
+ flush_delayed_work(&mem->async_work);
+ clear_bit(ASYNC_NORESCHED, &mem->async_work_flags);
+}
+
bool mem_cgroup_async_should_stop(struct mem_cgroup *mem)
{
return res_counter_margin(&mem->res) >= MEMCG_ASYNC_STOP_MARGIN;
@@ -3640,9 +3714,8 @@ static void mem_cgroup_may_async_reclaim
{
if (!mem->need_async_reclaim)
return;
- if (res_counter_margin(&mem->res) <= MEMCG_ASYNC_START_MARGIN) {
- /* Fill here */
- }
+ if (res_counter_margin(&mem->res) <= MEMCG_ASYNC_START_MARGIN)
+ run_mem_cgroup_async_shrinker(mem);
}
/*
@@ -3727,6 +3800,7 @@ move_account:
if (cgroup_task_count(cgrp) || !list_empty(&cgrp->children))
goto out;
ret = -EINTR;
+ stop_mem_cgroup_async_shrinker(mem);
if (signal_pending(current))
goto out;
/* This is for making all *used* pages to be on LRU. */
@@ -4897,6 +4971,7 @@ mem_cgroup_create(struct cgroup_subsys *
mem->swappiness = mem_cgroup_swappiness(parent);
atomic_set(&mem->refcnt, 1);
mem->move_charge_at_immigrate = 0;
+ INIT_DELAYED_WORK(&mem->async_work, mem_cgroup_async_shrink);
mutex_init(&mem->thresholds_lock);
return &mem->css;
free_out:
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 19+ messages in thread* Re: [RFC][PATCH 0/7] memcg async reclaim
2011-05-10 10:02 [RFC][PATCH 0/7] memcg async reclaim KAMEZAWA Hiroyuki
` (6 preceding siblings ...)
2011-05-10 10:13 ` [RFC][PATCH 7/7] memcg: workqueue " KAMEZAWA Hiroyuki
@ 2011-05-12 1:28 ` Andrew Morton
2011-05-12 1:35 ` KAMEZAWA Hiroyuki
7 siblings, 1 reply; 19+ messages in thread
From: Andrew Morton @ 2011-05-12 1:28 UTC (permalink / raw)
To: KAMEZAWA Hiroyuki
Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, Ying Han,
Johannes Weiner, Michal Hocko, balbir@linux.vnet.ibm.com,
nishimura@mxp.nes.nec.co.jp
On Tue, 10 May 2011 19:02:16 +0900 KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> Hi, thank you for all comments on previous patches for watermarks for memcg.
>
> This is a new series as 'async reclaim', no watermark.
> This version is a RFC again and I don't ask anyone to test this...but
> comments/review are appreciated.
>
> Major changes are
> - no configurable watermark
> - hierarchy support
> - more fix for static scan rate round robin scanning of memcg.
>
> (assume x86-64 in following.)
>
> 'async reclaim' works when
> - usage > limit - 4MB.
> until
> - usage < limit - 8MB.
>
> when the limit is larger than 128MB. This value of margin to limit
> has some purpose for helping to reduce page fault latency at using
> Transparent hugepage.
>
> Considering THP, we need to reclaim HPAGE_SIZE(2MB) of pages when we hit
> limit and consume HPAGE_SIZE(2MB) immediately. Then, the application need to
> scan 2MB per each page fault and get big latency. So, some margin > HPAGE_SIZE
> is required. I set it as 2*HPAGE_SIZE/4*HPAGE_SIZE, here. The kernel
> will do async reclaim and reduce usage to limit - 8MB in background.
>
> BTW, when an application gets a page, it tend to do some action to fill the
> gotton page. For example, reading data from file/network and fill buffer.
> This implies the application will have a wait or consumes cpu other than
> reclaiming memory. So, if the kernel can help memory freeing in background
> while application does another jobs, application latency can be reduced.
> Then, this kind of asyncronous reclaim of memory will be a help for reduce
> memory reclaim latency by memcg. But the total amount of cpu time consumed
> will not have any difference.
>
> This patch series implements
> - a logic for trigger async reclaim
> - help functions for async reclaim
> - core logic for async reclaim, considering memcg's hierarchy.
> - static scan rate memcg reclaim.
> - workqueue for async reclaim.
>
> Some concern is that I didn't implement a code for handle the case
> most of pages are mlocked or anon memory in swapless system. I need some
> detection logic to avoid hopless async reclaim.
>
What (user-visible) problem is this patchset solving?
IOW, what is the current behaviour, what is wrong with that behaviour
and what effects does the patchset have upon that behaviour?
The sole answer from the above is "latency spikes". Anything else?
Have these spikes been observed and measured? We should have a
testcase/worload with quantitative results to demonstrate and measure
the problem(s), so the effectiveness of the proposed solution can be
understood.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 19+ messages in thread* Re: [RFC][PATCH 0/7] memcg async reclaim
2011-05-12 1:28 ` [RFC][PATCH 0/7] memcg " Andrew Morton
@ 2011-05-12 1:35 ` KAMEZAWA Hiroyuki
2011-05-12 2:11 ` Ying Han
2011-05-12 3:51 ` Andrew Morton
0 siblings, 2 replies; 19+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-05-12 1:35 UTC (permalink / raw)
To: Andrew Morton
Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, Ying Han,
Johannes Weiner, Michal Hocko, balbir@linux.vnet.ibm.com,
nishimura@mxp.nes.nec.co.jp
On Wed, 11 May 2011 18:28:44 -0700
Andrew Morton <akpm@linux-foundation.org> wrote:
> On Tue, 10 May 2011 19:02:16 +0900 KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
>
> > Hi, thank you for all comments on previous patches for watermarks for memcg.
> >
> > This is a new series as 'async reclaim', no watermark.
> > This version is a RFC again and I don't ask anyone to test this...but
> > comments/review are appreciated.
> >
> > Major changes are
> > - no configurable watermark
> > - hierarchy support
> > - more fix for static scan rate round robin scanning of memcg.
> >
> > (assume x86-64 in following.)
> >
> > 'async reclaim' works when
> > - usage > limit - 4MB.
> > until
> > - usage < limit - 8MB.
> >
> > when the limit is larger than 128MB. This value of margin to limit
> > has some purpose for helping to reduce page fault latency at using
> > Transparent hugepage.
> >
> > Considering THP, we need to reclaim HPAGE_SIZE(2MB) of pages when we hit
> > limit and consume HPAGE_SIZE(2MB) immediately. Then, the application need to
> > scan 2MB per each page fault and get big latency. So, some margin > HPAGE_SIZE
> > is required. I set it as 2*HPAGE_SIZE/4*HPAGE_SIZE, here. The kernel
> > will do async reclaim and reduce usage to limit - 8MB in background.
> >
> > BTW, when an application gets a page, it tend to do some action to fill the
> > gotton page. For example, reading data from file/network and fill buffer.
> > This implies the application will have a wait or consumes cpu other than
> > reclaiming memory. So, if the kernel can help memory freeing in background
> > while application does another jobs, application latency can be reduced.
> > Then, this kind of asyncronous reclaim of memory will be a help for reduce
> > memory reclaim latency by memcg. But the total amount of cpu time consumed
> > will not have any difference.
> >
> > This patch series implements
> > - a logic for trigger async reclaim
> > - help functions for async reclaim
> > - core logic for async reclaim, considering memcg's hierarchy.
> > - static scan rate memcg reclaim.
> > - workqueue for async reclaim.
> >
> > Some concern is that I didn't implement a code for handle the case
> > most of pages are mlocked or anon memory in swapless system. I need some
> > detection logic to avoid hopless async reclaim.
> >
>
> What (user-visible) problem is this patchset solving?
>
> IOW, what is the current behaviour, what is wrong with that behaviour
> and what effects does the patchset have upon that behaviour?
>
> The sole answer from the above is "latency spikes". Anything else?
>
I think this set has possibility to fix latency spike.
For example, in previous set, (which has tuning knobs), do a file copy
of 400M file under 400M limit.
==
1) == hard limit = 400M ==
[root@rhel6-test hilow]# time cp ./tmpfile xxx
real 0m7.353s
user 0m0.009s
sys 0m3.280s
2) == hard limit 500M/ hi_watermark = 400M ==
[root@rhel6-test hilow]# time cp ./tmpfile xxx
real 0m6.421s
user 0m0.059s
sys 0m2.707s
==
and in both case, memory usage after test was 400M.
IIUC, this speed up is because memory reclaim runs in background file 'cp'
read/write files. But above test uses 100MB of margin. I gues we don't need
100MB of margin as above but will not get full speed with 8MB margin. There
will be trade-off because users may want to use memory up to the limit.
So, this set tries to set some 'default' margin, which is not too big and has
idea that implements async reclaim without tuning knobs. I'll measure
some more and report it in the next post.
> Have these spikes been observed and measured? We should have a
> testcase/worload with quantitative results to demonstrate and measure
> the problem(s), so the effectiveness of the proposed solution can be
> understood.
>
>
Yes, you're right, of course.
This set just shows the design changes caused by removing tuning knobs as
a result of long discussion.
As an output of it, we do
1. impleimenting async reclaim without tuning knobs.
2. add some on-demand background reclaim as 'active softlimit', which means
a mode of softlimit, shrinking memory always even if the system has plenty of
free memory. And current softlimit, which works only when memory are in short,
will be called as 'passive softlimit'.
Thanks,
-Kame
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 19+ messages in thread* Re: [RFC][PATCH 0/7] memcg async reclaim
2011-05-12 1:35 ` KAMEZAWA Hiroyuki
@ 2011-05-12 2:11 ` Ying Han
2011-05-12 3:51 ` Andrew Morton
1 sibling, 0 replies; 19+ messages in thread
From: Ying Han @ 2011-05-12 2:11 UTC (permalink / raw)
To: KAMEZAWA Hiroyuki
Cc: Andrew Morton, linux-mm@kvack.org, linux-kernel@vger.kernel.org,
Johannes Weiner, Michal Hocko, balbir@linux.vnet.ibm.com,
nishimura@mxp.nes.nec.co.jp
[-- Attachment #1: Type: text/plain, Size: 5150 bytes --]
On Wed, May 11, 2011 at 6:35 PM, KAMEZAWA Hiroyuki <
kamezawa.hiroyu@jp.fujitsu.com> wrote:
> On Wed, 11 May 2011 18:28:44 -0700
> Andrew Morton <akpm@linux-foundation.org> wrote:
>
> > On Tue, 10 May 2011 19:02:16 +0900 KAMEZAWA Hiroyuki <
> kamezawa.hiroyu@jp.fujitsu.com> wrote:
> >
> > > Hi, thank you for all comments on previous patches for watermarks for
> memcg.
> > >
> > > This is a new series as 'async reclaim', no watermark.
> > > This version is a RFC again and I don't ask anyone to test this...but
> > > comments/review are appreciated.
> > >
> > > Major changes are
> > > - no configurable watermark
> > > - hierarchy support
> > > - more fix for static scan rate round robin scanning of memcg.
> > >
> > > (assume x86-64 in following.)
> > >
> > > 'async reclaim' works when
> > > - usage > limit - 4MB.
> > > until
> > > - usage < limit - 8MB.
> > >
> > > when the limit is larger than 128MB. This value of margin to limit
> > > has some purpose for helping to reduce page fault latency at using
> > > Transparent hugepage.
> > >
> > > Considering THP, we need to reclaim HPAGE_SIZE(2MB) of pages when we
> hit
> > > limit and consume HPAGE_SIZE(2MB) immediately. Then, the application
> need to
> > > scan 2MB per each page fault and get big latency. So, some margin >
> HPAGE_SIZE
> > > is required. I set it as 2*HPAGE_SIZE/4*HPAGE_SIZE, here. The kernel
> > > will do async reclaim and reduce usage to limit - 8MB in background.
> > >
> > > BTW, when an application gets a page, it tend to do some action to fill
> the
> > > gotton page. For example, reading data from file/network and fill
> buffer.
> > > This implies the application will have a wait or consumes cpu other
> than
> > > reclaiming memory. So, if the kernel can help memory freeing in
> background
> > > while application does another jobs, application latency can be
> reduced.
> > > Then, this kind of asyncronous reclaim of memory will be a help for
> reduce
> > > memory reclaim latency by memcg. But the total amount of cpu time
> consumed
> > > will not have any difference.
> > >
> > > This patch series implements
> > > - a logic for trigger async reclaim
> > > - help functions for async reclaim
> > > - core logic for async reclaim, considering memcg's hierarchy.
> > > - static scan rate memcg reclaim.
> > > - workqueue for async reclaim.
> > >
> > > Some concern is that I didn't implement a code for handle the case
> > > most of pages are mlocked or anon memory in swapless system. I need
> some
> > > detection logic to avoid hopless async reclaim.
> > >
> >
> > What (user-visible) problem is this patchset solving?
> >
> > IOW, what is the current behaviour, what is wrong with that behaviour
> > and what effects does the patchset have upon that behaviour?
> >
> > The sole answer from the above is "latency spikes". Anything else?
> >
>
> I think this set has possibility to fix latency spike.
>
> For example, in previous set, (which has tuning knobs), do a file copy
> of 400M file under 400M limit.
> ==
> 1) == hard limit = 400M ==
> [root@rhel6-test hilow]# time cp ./tmpfile xxx
> real 0m7.353s
> user 0m0.009s
> sys 0m3.280s
>
> 2) == hard limit 500M/ hi_watermark = 400M ==
> [root@rhel6-test hilow]# time cp ./tmpfile xxx
>
> real 0m6.421s
> user 0m0.059s
> sys 0m2.707s
> ==
> and in both case, memory usage after test was 400M.
>
> IIUC, this speed up is because memory reclaim runs in background file 'cp'
> read/write files. But above test uses 100MB of margin. I gues we don't need
> 100MB of margin as above but will not get full speed with 8MB margin. There
> will be trade-off because users may want to use memory up to the limit.
>
> So, this set tries to set some 'default' margin, which is not too big and
> has
> idea that implements async reclaim without tuning knobs. I'll measure
> some more and report it in the next post.
>
> I can also try to run some workload to measure the performance impact.
> Kame, just let me know
>
when you have patch ready for testing.
> > Have these spikes been observed and measured? We should have a
> > testcase/worload with quantitative results to demonstrate and measure
> > the problem(s), so the effectiveness of the proposed solution can be
> > understood.
> >
> >
>
> Yes, you're right, of course.
> This set just shows the design changes caused by removing tuning knobs as
> a result of long discussion.
>
> As an output of it, we do
> 1. impleimenting async reclaim without tuning knobs.
> 2. add some on-demand background reclaim as 'active softlimit', which
> means
> a mode of softlimit, shrinking memory always even if the system has
> plenty of
> free memory. And current softlimit, which works only when memory are in
> short,
> will be called as 'passive softlimit'.
>
The second one is a useful feature not only doing watermark based reclaim,
but proactively reclaiming
pages down to the soft_limit. We at google are talking about adopting it
to guarantee more predictability
for applications where the soft_limit could be configured to the actual
working_set_size.
--Ying
>
> Thanks,
> -Kame
>
>
>
>
>
[-- Attachment #2: Type: text/html, Size: 6704 bytes --]
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [RFC][PATCH 0/7] memcg async reclaim
2011-05-12 1:35 ` KAMEZAWA Hiroyuki
2011-05-12 2:11 ` Ying Han
@ 2011-05-12 3:51 ` Andrew Morton
2011-05-12 4:22 ` KAMEZAWA Hiroyuki
1 sibling, 1 reply; 19+ messages in thread
From: Andrew Morton @ 2011-05-12 3:51 UTC (permalink / raw)
To: KAMEZAWA Hiroyuki
Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, Ying Han,
Johannes Weiner, Michal Hocko, balbir@linux.vnet.ibm.com,
nishimura@mxp.nes.nec.co.jp
On Thu, 12 May 2011 10:35:03 +0900 KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> > What (user-visible) problem is this patchset solving?
> >
> > IOW, what is the current behaviour, what is wrong with that behaviour
> > and what effects does the patchset have upon that behaviour?
> >
> > The sole answer from the above is "latency spikes". Anything else?
> >
>
> I think this set has possibility to fix latency spike.
>
> For example, in previous set, (which has tuning knobs), do a file copy
> of 400M file under 400M limit.
> ==
> 1) == hard limit = 400M ==
> [root@rhel6-test hilow]# time cp ./tmpfile xxx
> real 0m7.353s
> user 0m0.009s
> sys 0m3.280s
>
> 2) == hard limit 500M/ hi_watermark = 400M ==
> [root@rhel6-test hilow]# time cp ./tmpfile xxx
>
> real 0m6.421s
> user 0m0.059s
> sys 0m2.707s
> ==
> and in both case, memory usage after test was 400M.
I'm surprised that reclaim consumed so much CPU. But I guess that's a
200,000 page/sec reclaim rate which sounds high(?) but it's - what -
15,000 CPU clocks per page? I don't recall anyone spending much effort
on instrumenting and reducing CPU consumption in reclaim.
Presumably there will be no improvement in CPU consumption on
uniprocessor kernels or in single-CPU containers. More likely a
deterioration.
ahem.
Copying a 400MB file in a non-containered kernel on this 8GB machine
with old, slow CPUs takes 0.64 seconds systime, 0.66 elapsed. Five
times less than your machine. Where the heck did all that CPU time go?
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [RFC][PATCH 0/7] memcg async reclaim
2011-05-12 3:51 ` Andrew Morton
@ 2011-05-12 4:22 ` KAMEZAWA Hiroyuki
2011-05-12 8:17 ` KAMEZAWA Hiroyuki
0 siblings, 1 reply; 19+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-05-12 4:22 UTC (permalink / raw)
To: Andrew Morton
Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, Ying Han,
Johannes Weiner, Michal Hocko, balbir@linux.vnet.ibm.com,
nishimura@mxp.nes.nec.co.jp
On Wed, 11 May 2011 20:51:10 -0700
Andrew Morton <akpm@linux-foundation.org> wrote:
> On Thu, 12 May 2011 10:35:03 +0900 KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
>
> > > What (user-visible) problem is this patchset solving?
> > >
> > > IOW, what is the current behaviour, what is wrong with that behaviour
> > > and what effects does the patchset have upon that behaviour?
> > >
> > > The sole answer from the above is "latency spikes". Anything else?
> > >
> >
> > I think this set has possibility to fix latency spike.
> >
> > For example, in previous set, (which has tuning knobs), do a file copy
> > of 400M file under 400M limit.
> > ==
> > 1) == hard limit = 400M ==
> > [root@rhel6-test hilow]# time cp ./tmpfile xxx
> > real 0m7.353s
> > user 0m0.009s
> > sys 0m3.280s
> >
> > 2) == hard limit 500M/ hi_watermark = 400M ==
> > [root@rhel6-test hilow]# time cp ./tmpfile xxx
> >
> > real 0m6.421s
> > user 0m0.059s
> > sys 0m2.707s
> > ==
> > and in both case, memory usage after test was 400M.
>
> I'm surprised that reclaim consumed so much CPU. But I guess that's a
> 200,000 page/sec reclaim rate which sounds high(?) but it's - what -
> 15,000 CPU clocks per page? I don't recall anyone spending much effort
> on instrumenting and reducing CPU consumption in reclaim.
>
Maybe I need to count the number of congestion_wait() in direct reclaim path.
"prioriry" may goes very high too early.....
(I don't like 'priority' in vmscan.c very much ;)
> Presumably there will be no improvement in CPU consumption on
> uniprocessor kernels or in single-CPU containers. More likely a
> deterioration.
>
Yes, no improvements on CPU cunsumption. (As I've repeatedly written.)
Just moving when the cpu is consumed.
I wanted a switch to control that for scheduling freeing pages when the admin
knows the system is free. But this version drops the knob for simplification
and check the 'default' & 'automatic' way. I'll add a knob again and then,
add a knob of turn-off this feature in natural way.
This is a result in previous set, which had elapsed_time statistics.
==
# cat /cgroup/memory/A/memory.stat
....
direct_elapsed_ns 0
soft_elapsed_ns 0
wmark_elapsed_ns 103566424
direct_scanned 0
soft_scanned 0
wmark_scanned 29303
direct_freed 0
soft_freed 0
wmark_freed 29290
==
In this run (maybe not copy, just 'cat'), async reclaim scan 29000 pages and consumes 0.1ms
>
> ahem.
>
> Copying a 400MB file in a non-containered kernel on this 8GB machine
> with old, slow CPUs takes 0.64 seconds systime, 0.66 elapsed. Five
> times less than your machine. Where the heck did all that CPU time go?
>
Ah, sorry. above was on KVM. without container.
==
[root@rhel6-test hilow]# time cp ./tmpfile xxx
real 0m5.197s
user 0m0.006s
sys 0m2.599s
==
Hmm, still slow. I'll use real hardware in the next post.
Maybe it's good to do a test with complex workload which use file cache.
Thanks,
-Kame
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [RFC][PATCH 0/7] memcg async reclaim
2011-05-12 4:22 ` KAMEZAWA Hiroyuki
@ 2011-05-12 8:17 ` KAMEZAWA Hiroyuki
2011-05-13 3:03 ` KAMEZAWA Hiroyuki
0 siblings, 1 reply; 19+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-05-12 8:17 UTC (permalink / raw)
To: KAMEZAWA Hiroyuki
Cc: Andrew Morton, linux-mm@kvack.org, linux-kernel@vger.kernel.org,
Ying Han, Johannes Weiner, Michal Hocko,
balbir@linux.vnet.ibm.com, nishimura@mxp.nes.nec.co.jp
On Thu, 12 May 2011 13:22:37 +0900
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> On Wed, 11 May 2011 20:51:10 -0700
> Andrew Morton <akpm@linux-foundation.org> wrote:
> Ah, sorry. above was on KVM. without container.
> ==
> [root@rhel6-test hilow]# time cp ./tmpfile xxx
>
> real 0m5.197s
> user 0m0.006s
> sys 0m2.599s
> ==
> Hmm, still slow. I'll use real hardware in the next post.
>
I'm now testing on a real machine with some fixes and see
== without async reclaim ==
real 0m6.569s
user 0m0.006s
sys 0m0.976s
== with async reclaim ==
real 0m6.305s
user 0m0.007s
sys 0m0.907s
...in gneral sys time reduced always but 'real' is in error range ;)
yes, no gain.
I'll check what codes in vmscan.c or /mm affects memcg and post a
required fix in step by step. I think I found some..
Thanks,
-Kame
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [RFC][PATCH 0/7] memcg async reclaim
2011-05-12 8:17 ` KAMEZAWA Hiroyuki
@ 2011-05-13 3:03 ` KAMEZAWA Hiroyuki
2011-05-13 5:10 ` Ying Han
0 siblings, 1 reply; 19+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-05-13 3:03 UTC (permalink / raw)
To: KAMEZAWA Hiroyuki
Cc: Andrew Morton, linux-mm@kvack.org, linux-kernel@vger.kernel.org,
Ying Han, Johannes Weiner, Michal Hocko,
balbir@linux.vnet.ibm.com, nishimura@mxp.nes.nec.co.jp
On Thu, 12 May 2011 17:17:25 +0900
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> On Thu, 12 May 2011 13:22:37 +0900
> KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> I'll check what codes in vmscan.c or /mm affects memcg and post a
> required fix in step by step. I think I found some..
>
After some tests, I doubt that 'automatic' one is unnecessary until
memcg's dirty_ratio is supported. And as Andrew pointed out,
total cpu consumption is unchanged and I don't have workloads which
shows me meaningful speed up.
But I guess...with dirty_ratio, amount of dirty pages in memcg is
limited and background reclaim can work enough without noise of
write_page() while applications are throttled by dirty_ratio.
Hmm, I'll study for a while but it seems better to start active soft limit,
(or some threshold users can set) first.
Anyway, this work makes me to see vmscan.c carefully and I think I can
post some patches for fix, tunes.
Thanks,
-Kame
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [RFC][PATCH 0/7] memcg async reclaim
2011-05-13 3:03 ` KAMEZAWA Hiroyuki
@ 2011-05-13 5:10 ` Ying Han
2011-05-13 9:04 ` KAMEZAWA Hiroyuki
2011-05-14 0:25 ` Ying Han
0 siblings, 2 replies; 19+ messages in thread
From: Ying Han @ 2011-05-13 5:10 UTC (permalink / raw)
To: KAMEZAWA Hiroyuki
Cc: Andrew Morton, linux-mm@kvack.org, linux-kernel@vger.kernel.org,
Johannes Weiner, Michal Hocko, balbir@linux.vnet.ibm.com,
nishimura@mxp.nes.nec.co.jp, Greg Thelen
[-- Attachment #1: Type: text/plain, Size: 1903 bytes --]
On Thu, May 12, 2011 at 8:03 PM, KAMEZAWA Hiroyuki <
kamezawa.hiroyu@jp.fujitsu.com> wrote:
> On Thu, 12 May 2011 17:17:25 +0900
> KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
>
> > On Thu, 12 May 2011 13:22:37 +0900
> > KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> > I'll check what codes in vmscan.c or /mm affects memcg and post a
> > required fix in step by step. I think I found some..
> >
>
> After some tests, I doubt that 'automatic' one is unnecessary until
> memcg's dirty_ratio is supported. And as Andrew pointed out,
> total cpu consumption is unchanged and I don't have workloads which
> shows me meaningful speed up.
>
The total cpu consumption is one way to measure the background reclaim,
another thing I would like to measure is a histogram of page fault latency
for a heavy page allocation application. I would expect with background
reclaim, we will get less variation on the page fault latency than w/o it.
Sorry i haven't got chance to run some tests to back it up. I will try to
get some data.
> But I guess...with dirty_ratio, amount of dirty pages in memcg is
> limited and background reclaim can work enough without noise of
> write_page() while applications are throttled by dirty_ratio.
>
Definitely. I have run into the issue while debugging the soft_limit
reclaim. The background reclaim became very inefficient if we have dirty
pages greater than the soft_limit. Talking w/ Greg about it regarding his
per-memcg dirty page limit effort, we should consider setting the dirty
ratio which not allowing the dirty pages greater the reclaim watermarks
(here is the soft_limit).
--Ying
> Hmm, I'll study for a while but it seems better to start active soft limit,
> (or some threshold users can set) first.
>
> Anyway, this work makes me to see vmscan.c carefully and I think I can
> post some patches for fix, tunes.
>
> Thanks,
> -Kame
>
>
[-- Attachment #2: Type: text/html, Size: 2790 bytes --]
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [RFC][PATCH 0/7] memcg async reclaim
2011-05-13 5:10 ` Ying Han
@ 2011-05-13 9:04 ` KAMEZAWA Hiroyuki
2011-05-14 0:25 ` Ying Han
1 sibling, 0 replies; 19+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-05-13 9:04 UTC (permalink / raw)
To: Ying Han
Cc: Andrew Morton, linux-mm@kvack.org, linux-kernel@vger.kernel.org,
Johannes Weiner, Michal Hocko, balbir@linux.vnet.ibm.com,
nishimura@mxp.nes.nec.co.jp, Greg Thelen
On Thu, 12 May 2011 22:10:30 -0700
Ying Han <yinghan@google.com> wrote:
> On Thu, May 12, 2011 at 8:03 PM, KAMEZAWA Hiroyuki <
> kamezawa.hiroyu@jp.fujitsu.com> wrote:
>
> > On Thu, 12 May 2011 17:17:25 +0900
> > KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> >
> > > On Thu, 12 May 2011 13:22:37 +0900
> > > KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> > > I'll check what codes in vmscan.c or /mm affects memcg and post a
> > > required fix in step by step. I think I found some..
> > >
> >
> > After some tests, I doubt that 'automatic' one is unnecessary until
> > memcg's dirty_ratio is supported. And as Andrew pointed out,
> > total cpu consumption is unchanged and I don't have workloads which
> > shows me meaningful speed up.
> >
>
> The total cpu consumption is one way to measure the background reclaim,
> another thing I would like to measure is a histogram of page fault latency
> for a heavy page allocation application. I would expect with background
> reclaim, we will get less variation on the page fault latency than w/o it.
>
> Sorry i haven't got chance to run some tests to back it up. I will try to
> get some data.
>
My posted set needs some tweaks and fixes. I'll post re-tuned one in the
next week. (But I'll be busy until Wednesday.)
>
> > But I guess...with dirty_ratio, amount of dirty pages in memcg is
> > limited and background reclaim can work enough without noise of
> > write_page() while applications are throttled by dirty_ratio.
> >
>
> Definitely. I have run into the issue while debugging the soft_limit
> reclaim. The background reclaim became very inefficient if we have dirty
> pages greater than the soft_limit. Talking w/ Greg about it regarding his
> per-memcg dirty page limit effort, we should consider setting the dirty
> ratio which not allowing the dirty pages greater the reclaim watermarks
> (here is the soft_limit).
>
I think I got some positive result...in some situation.
On 8cpu, 24GB RAM system, under 300MB memcg, run 2 programs
Program 1) while true; do cat ./test/1G > /dev/null;done
This fills memcg with clean file cache.
Program 2) malloc(200MB) and page-fault, free it in 200 times.
And measure Program2's time.
Case 1) running only Program2
real 0m17.086s
user 0m0.057s
sys 0m17.257s
Case 2) running Program 1 and 2 without async reclaim.
[kamezawa@bluextal test]$ time ./catch_and_release > /dev/null
real 0m26.182s
user 0m0.115s
sys 0m19.075s
[kamezawa@bluextal test]$ time ./catch_and_release > /dev/null
real 0m23.155s
user 0m0.096s
sys 0m18.175s
[kamezawa@bluextal test]$ time ./catch_and_release > /dev/null
real 0m24.667s
user 0m0.108s
sys 0m18.804s
Case 3) running Program 1 and 2 with async reclaim of 8MB to limit.
[kamezawa@bluextal test]$ time ./catch_and_release > /dev/null
real 0m21.438s
user 0m0.083s
sys 0m17.864s
[kamezawa@bluextal test]$ time ./catch_and_release > /dev/null
real 0m23.010s
user 0m0.079s
sys 0m17.819s
[kamezawa@bluextal test]$ time ./catch_and_release > /dev/null
real 0m19.596s
user 0m0.108s
sys 0m18.053s
If my test is correct, there are some meaningful positive effect.
But I doubt there may be case with negative result case.
I wonder to see posivie value, application shouldn't do 'write' ;)
Anyway, I'll make a try in the next week, again.
Thanks,
-Kame
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 19+ messages in thread* Re: [RFC][PATCH 0/7] memcg async reclaim
2011-05-13 5:10 ` Ying Han
2011-05-13 9:04 ` KAMEZAWA Hiroyuki
@ 2011-05-14 0:25 ` Ying Han
2011-05-14 0:29 ` Ying Han
1 sibling, 1 reply; 19+ messages in thread
From: Ying Han @ 2011-05-14 0:25 UTC (permalink / raw)
To: KAMEZAWA Hiroyuki, Andrew Morton
Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, Johannes Weiner,
Michal Hocko, balbir@linux.vnet.ibm.com,
nishimura@mxp.nes.nec.co.jp, Greg Thelen
[-- Attachment #1: Type: text/plain, Size: 5116 bytes --]
Here I ran some tests and the result.
On a 32G machine, I created a memcg with 4G hard_limit (limit_in_bytes)
and and ran cat on a 20g file. Then I use getdelays to measure the
ttfp "delay average" under RECLAIM. When the workload is reaching its
hard_limit and
without background reclaim, each ttfp is triggered by a pagefault. I would
like to demostrate the average delay average for ttfp (thus page fault
latency) on the streaming read/write workload and compare it w/ per-memcg bg
reclaim enabled.
Note:
1. I applied a patch on getdelays.c from fengguang which shows
average CPU/IO/SWAP/RECLAIM delays in ns.
2. I used my latest version of per-memcg-per-kswapd patch for the
following test. The patch could have been improved since then and I can run
the same test when Kame has his patch ready.
Configuration:
$ cat /proc/meminfo
MemTotal: 33045832 kB
$ cat /dev/cgroup/memory/A/memory.limit_in_bytes
4294967296
$ cat /dev/cgroup/memory/A/memory.reclaim_wmarks
low_wmark 4137680896
high_wmark 4085252096
Test:
$ echo $$ >/dev/cgroup/memory/A/tasks
$ cat /export/hdc3/dd_A/tf0 > /dev/zero
Without per-memcg background reclaim:
CPU count real total virtual total delay total delay
average
176589 17248377848 27344548685 1093693318
6193.440ns
IO count delay total delay average
160704 242072632962 1506326ns
SWAP count delay total delay average
0 0 0ns
RECLAIM count delay total delay average
15944 3512140153 220279ns
cat: read=20947877888, write=0, cancelled_write=0
real>---4m26.912s
user>---0m0.227s
sys>----0m27.823s
With per-memcg background reclaim:
$ ps -ef | grep memcg
root 5803 2 2 13:56 ? 00:04:20 [memcg_4]
CPU count real total virtual total delay total delay
average
161085 13185995424 23863858944 72902585
452.572ns
IO count delay total delay average
160915 246145533109 1529661ns
SWAP count delay total delay average
0 0 0ns
RECLAIM count delay total delay average
0 0 0ns
cat: read=20974891008, write=0, cancelled_write=0
real>---4m26.572s
user>---0m0.246s
sys>----0m24.192s
memcg_4 cputime: 2.86sec
Observation:
1. Without the background reclaim, the cat hit ttfp heavely and the "delay
average" goes above 220 microsec.
2. With background reclaim, the ttfp delay average is always 0. Since the
ttfp happens synchronously and that implies the latency of the application
overtime.
3. The real time goes slighly better w/ bg reclaim and the sys time is
about the same ( adding the memcg_4 time on top of sys time of cat). But i
don't expect big cpu benefit. The async reclaim uses spare cputime to
proactivly reclaim pages on the side which gurantees less latency variation
of application over time.
--Ying
On Thu, May 12, 2011 at 10:10 PM, Ying Han <yinghan@google.com> wrote:
>
>
> On Thu, May 12, 2011 at 8:03 PM, KAMEZAWA Hiroyuki <
> kamezawa.hiroyu@jp.fujitsu.com> wrote:
>
>> On Thu, 12 May 2011 17:17:25 +0900
>> KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
>>
>> > On Thu, 12 May 2011 13:22:37 +0900
>> > KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
>> > I'll check what codes in vmscan.c or /mm affects memcg and post a
>> > required fix in step by step. I think I found some..
>> >
>>
>> After some tests, I doubt that 'automatic' one is unnecessary until
>> memcg's dirty_ratio is supported. And as Andrew pointed out,
>> total cpu consumption is unchanged and I don't have workloads which
>> shows me meaningful speed up.
>>
>
> The total cpu consumption is one way to measure the background reclaim,
> another thing I would like to measure is a histogram of page fault latency
> for a heavy page allocation application. I would expect with background
> reclaim, we will get less variation on the page fault latency than w/o it.
>
> Sorry i haven't got chance to run some tests to back it up. I will try to
> get some data.
>
>
>> But I guess...with dirty_ratio, amount of dirty pages in memcg is
>> limited and background reclaim can work enough without noise of
>> write_page() while applications are throttled by dirty_ratio.
>>
>
> Definitely. I have run into the issue while debugging the soft_limit
> reclaim. The background reclaim became very inefficient if we have dirty
> pages greater than the soft_limit. Talking w/ Greg about it regarding his
> per-memcg dirty page limit effort, we should consider setting the dirty
> ratio which not allowing the dirty pages greater the reclaim watermarks
> (here is the soft_limit).
>
> --Ying
>
>
>> Hmm, I'll study for a while but it seems better to start active soft
>> limit,
>> (or some threshold users can set) first.
>>
>> Anyway, this work makes me to see vmscan.c carefully and I think I can
>> post some patches for fix, tunes.
>>
>> Thanks,
>> -Kame
>>
>>
>
[-- Attachment #2: Type: text/html, Size: 7236 bytes --]
^ permalink raw reply [flat|nested] 19+ messages in thread* Re: [RFC][PATCH 0/7] memcg async reclaim
2011-05-14 0:25 ` Ying Han
@ 2011-05-14 0:29 ` Ying Han
0 siblings, 0 replies; 19+ messages in thread
From: Ying Han @ 2011-05-14 0:29 UTC (permalink / raw)
To: KAMEZAWA Hiroyuki, Andrew Morton
Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, Johannes Weiner,
Michal Hocko, balbir@linux.vnet.ibm.com,
nishimura@mxp.nes.nec.co.jp, Greg Thelen
[-- Attachment #1: Type: text/plain, Size: 5670 bytes --]
Sorry forgot to post the script i capture the result:
echo $$ >/dev/cgroup/memory/A/tasks
time cat /export/hdc3/dd_A/tf0 > /dev/zero &
sleep 10
echo $$ >/dev/cgroup/memory/tasks
(
while /root/getdelays -dip `pidof cat`;
do
sleep 10;
done
)
--Ying
On Fri, May 13, 2011 at 5:25 PM, Ying Han <yinghan@google.com> wrote:
> Here I ran some tests and the result.
>
> On a 32G machine, I created a memcg with 4G hard_limit (limit_in_bytes)
> and and ran cat on a 20g file. Then I use getdelays to measure the
> ttfp "delay average" under RECLAIM. When the workload is reaching its
> hard_limit and
> without background reclaim, each ttfp is triggered by a pagefault. I would
> like to demostrate the average delay average for ttfp (thus page fault
> latency) on the streaming read/write workload and compare it w/ per-memcg bg
> reclaim enabled.
>
> Note:
> 1. I applied a patch on getdelays.c from fengguang which shows
> average CPU/IO/SWAP/RECLAIM delays in ns.
>
> 2. I used my latest version of per-memcg-per-kswapd patch for the
> following test. The patch could have been improved since then and I can run
> the same test when Kame has his patch ready.
>
> Configuration:
> $ cat /proc/meminfo
> MemTotal: 33045832 kB
>
> $ cat /dev/cgroup/memory/A/memory.limit_in_bytes
> 4294967296
>
> $ cat /dev/cgroup/memory/A/memory.reclaim_wmarks
> low_wmark 4137680896
> high_wmark 4085252096
>
> Test:
> $ echo $$ >/dev/cgroup/memory/A/tasks
> $ cat /export/hdc3/dd_A/tf0 > /dev/zero
>
> Without per-memcg background reclaim:
>
> CPU count real total virtual total delay total delay
> average
> 176589 17248377848 27344548685 1093693318
> 6193.440ns
> IO count delay total delay average
> 160704 242072632962 1506326ns
> SWAP count delay total delay average
> 0 0 0ns
> RECLAIM count delay total delay average
> 15944 3512140153 220279ns
> cat: read=20947877888, write=0, cancelled_write=0
>
> real>---4m26.912s
> user>---0m0.227s
> sys>----0m27.823s
>
> With per-memcg background reclaim:
>
> $ ps -ef | grep memcg
> root 5803 2 2 13:56 ? 00:04:20 [memcg_4]
>
> CPU count real total virtual total delay total delay
> average
> 161085 13185995424 23863858944 72902585
> 452.572ns
> IO count delay total delay average
> 160915 246145533109 1529661ns
> SWAP count delay total delay average
> 0 0 0ns
> RECLAIM count delay total delay average
> 0 0 0ns
> cat: read=20974891008, write=0, cancelled_write=0
>
> real>---4m26.572s
> user>---0m0.246s
> sys>----0m24.192s
>
> memcg_4 cputime: 2.86sec
>
> Observation:
> 1. Without the background reclaim, the cat hit ttfp heavely and the "delay
> average" goes above 220 microsec.
>
> 2. With background reclaim, the ttfp delay average is always 0. Since the
> ttfp happens synchronously and that implies the latency of the application
> overtime.
>
> 3. The real time goes slighly better w/ bg reclaim and the sys time is
> about the same ( adding the memcg_4 time on top of sys time of cat). But i
> don't expect big cpu benefit. The async reclaim uses spare cputime to
> proactivly reclaim pages on the side which gurantees less latency variation
> of application over time.
>
> --Ying
>
> On Thu, May 12, 2011 at 10:10 PM, Ying Han <yinghan@google.com> wrote:
>
>>
>>
>> On Thu, May 12, 2011 at 8:03 PM, KAMEZAWA Hiroyuki <
>> kamezawa.hiroyu@jp.fujitsu.com> wrote:
>>
>>> On Thu, 12 May 2011 17:17:25 +0900
>>> KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
>>>
>>> > On Thu, 12 May 2011 13:22:37 +0900
>>> > KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
>>> > I'll check what codes in vmscan.c or /mm affects memcg and post a
>>> > required fix in step by step. I think I found some..
>>> >
>>>
>>> After some tests, I doubt that 'automatic' one is unnecessary until
>>> memcg's dirty_ratio is supported. And as Andrew pointed out,
>>> total cpu consumption is unchanged and I don't have workloads which
>>> shows me meaningful speed up.
>>>
>>
>> The total cpu consumption is one way to measure the background reclaim,
>> another thing I would like to measure is a histogram of page fault latency
>> for a heavy page allocation application. I would expect with background
>> reclaim, we will get less variation on the page fault latency than w/o it.
>>
>> Sorry i haven't got chance to run some tests to back it up. I will try to
>> get some data.
>>
>>
>>> But I guess...with dirty_ratio, amount of dirty pages in memcg is
>>> limited and background reclaim can work enough without noise of
>>> write_page() while applications are throttled by dirty_ratio.
>>>
>>
>> Definitely. I have run into the issue while debugging the soft_limit
>> reclaim. The background reclaim became very inefficient if we have dirty
>> pages greater than the soft_limit. Talking w/ Greg about it regarding his
>> per-memcg dirty page limit effort, we should consider setting the dirty
>> ratio which not allowing the dirty pages greater the reclaim watermarks
>> (here is the soft_limit).
>>
>> --Ying
>>
>>
>>> Hmm, I'll study for a while but it seems better to start active soft
>>> limit,
>>> (or some threshold users can set) first.
>>>
>>> Anyway, this work makes me to see vmscan.c carefully and I think I can
>>> post some patches for fix, tunes.
>>>
>>> Thanks,
>>> -Kame
>>>
>>>
>>
>
[-- Attachment #2: Type: text/html, Size: 8212 bytes --]
^ permalink raw reply [flat|nested] 19+ messages in thread