* [PATCH 0/3] memcg softlimit reclaim rework
@ 2011-12-06 23:59 Ying Han
2011-12-06 23:59 ` [PATCH 1/3] memcg: rework softlimit reclaim Ying Han
` (2 more replies)
0 siblings, 3 replies; 7+ messages in thread
From: Ying Han @ 2011-12-06 23:59 UTC (permalink / raw)
To: Michal Hocko, Balbir Singh, Rik van Riel, Hugh Dickins,
Johannes Weiner, Mel Gorman, KAMEZAWA Hiroyuki, Pavel Emelyanov
Cc: linux-mm
The "soft_limit" was introduced in memcg to support over-committing the
memory resource on the host. Each cgroup configures its "hard_limit" where
it will be throttled or OOM killed by going over the limit. However, the
cgroup can go above the "soft_limit" as long as there is no system-wide
memory contention. So, the "soft_limit" is the kernel mechanism for
re-distributng system spare memory among cgroups.
This patch reworks the softlimit reclaim by hooking it into the new global
reclaim scheme. So the global reclaim path including direct reclaim and
background reclaim will respect the memcg softlimit. At the same time,
per-memcg reclaim will by default scanning all the memcgs under the
hierarchy.
On a 64G host, creates 12 * 256M (limit_in_bytes) memcgs where each reads from
a 512M ramdisk. At the same time, sets the softlimit to last 6 memcgs. Under
global memory pressure, only the ones (first 6 memcgs) above softlimit got
scanned and reclaimed.
$ for ((i=0; i<12; i++)); do cat /path/$i/memory.limit_in_bytes; done
536870912
536870912
536870912
536870912
536870912
536870912
536870912
536870912
536870912
536870912
536870912
536870912
$ for ((i=0; i<12; i++)); do cat /path/$i/memory.soft_limit_in_bytes; done
0
0
0
0
0
0
536870912
536870912
536870912
536870912
536870912
536870912
$ for ((i=0; i<12; i++)); do cat /path/$i/memory.vmscan_stat; done
total_scanned_file_pages_by_system_under_hierarchy 1992169
total_scanned_file_pages_by_system_under_hierarchy 2065410
total_scanned_file_pages_by_system_under_hierarchy 2056609
total_scanned_file_pages_by_system_under_hierarchy 1974422
total_scanned_file_pages_by_system_under_hierarchy 1835338
total_scanned_file_pages_by_system_under_hierarchy 1729919
total_scanned_file_pages_by_system_under_hierarchy 0
total_scanned_file_pages_by_system_under_hierarchy 0
total_scanned_file_pages_by_system_under_hierarchy 0
total_scanned_file_pages_by_system_under_hierarchy 0
total_scanned_file_pages_by_system_under_hierarchy 0
total_scanned_file_pages_by_system_under_hierarchy 0
Note:
1.The vmscan_stat API was reverted upstream, and I am not asking for inclusion
here. The only reason to have it here is to demonstrate the result of the
softlimit reclaim patchset.
2.The patch is based on next-20111201
Ying Han (3):
memcg: rework softlimit reclaim
memcg: revert current soft limit reclaim implementation
memcg: track reclaim stats in memory.vmscan_stat
include/linux/memcontrol.h | 36 ++-
include/linux/swap.h | 4 -
kernel/res_counter.c | 1 -
mm/memcontrol.c | 541 +++++++++++++-------------------------------
mm/vmscan.c | 116 ++++------
5 files changed, 233 insertions(+), 465 deletions(-)
--
1.7.3.1
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 7+ messages in thread
* [PATCH 1/3] memcg: rework softlimit reclaim
2011-12-06 23:59 [PATCH 0/3] memcg softlimit reclaim rework Ying Han
@ 2011-12-06 23:59 ` Ying Han
2011-12-07 2:13 ` KAMEZAWA Hiroyuki
2011-12-06 23:59 ` [PATCH 2/3] memcg: revert current soft limit reclaim implementation Ying Han
2011-12-06 23:59 ` [PATCH 3/3] memcg: track reclaim stats in memory.vmscan_stat Ying Han
2 siblings, 1 reply; 7+ messages in thread
From: Ying Han @ 2011-12-06 23:59 UTC (permalink / raw)
To: Michal Hocko, Balbir Singh, Rik van Riel, Hugh Dickins,
Johannes Weiner, Mel Gorman, KAMEZAWA Hiroyuki, Pavel Emelyanov
Cc: linux-mm
Under the shrink_zone, we examine whether or not to reclaim from a memcg
based on its softlimit. We skip scanning the memcg for the first 3 priority.
This is to balance between isolation and efficiency. we don't want to halt
the system by skipping memcgs with low-hanging fruits forever.
Another change is to set soft_limit_in_bytes to 0 by default. This is needed
for both functional and performance:
1. If soft_limit are all set to MAX, it wastes first three periority iterations
without scanning anything.
2. By default every memcg is eligibal for softlimit reclaim, and we can also
set the value to MAX for special memcg which is immune to soft limit reclaim.
Signed-off-by: Ying Han <yinghan@google.com>
---
include/linux/memcontrol.h | 7 ++++
kernel/res_counter.c | 1 -
mm/memcontrol.c | 8 +++++
mm/vmscan.c | 67 ++++++++++++++++++++++++++-----------------
4 files changed, 55 insertions(+), 28 deletions(-)
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 81aabfb..53d483b 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -107,6 +107,8 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *,
struct mem_cgroup_reclaim_cookie *);
void mem_cgroup_iter_break(struct mem_cgroup *, struct mem_cgroup *);
+bool mem_cgroup_soft_limit_exceeded(struct mem_cgroup *);
+
/*
* For memory reclaim.
*/
@@ -293,6 +295,11 @@ static inline void mem_cgroup_iter_break(struct mem_cgroup *root,
{
}
+static inline bool mem_cgroup_soft_limit_exceeded(struct mem_cgroup *mem)
+{
+ return true;
+}
+
static inline int mem_cgroup_get_reclaim_priority(struct mem_cgroup *memcg)
{
return 0;
diff --git a/kernel/res_counter.c b/kernel/res_counter.c
index b814d6c..92afdc1 100644
--- a/kernel/res_counter.c
+++ b/kernel/res_counter.c
@@ -18,7 +18,6 @@ void res_counter_init(struct res_counter *counter, struct res_counter *parent)
{
spin_lock_init(&counter->lock);
counter->limit = RESOURCE_MAX;
- counter->soft_limit = RESOURCE_MAX;
counter->parent = parent;
}
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 4425f62..7c6cade 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -926,6 +926,14 @@ out:
}
EXPORT_SYMBOL(mem_cgroup_count_vm_event);
+bool mem_cgroup_soft_limit_exceeded(struct mem_cgroup *mem)
+{
+ if (mem_cgroup_disabled() || mem_cgroup_is_root(mem))
+ return true;
+
+ return res_counter_soft_limit_excess(&mem->res) > 0;
+}
+
/**
* mem_cgroup_zone_lruvec - get the lru list vector for a zone and memcg
* @zone: zone of the wanted lruvec
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 0ba7d35..b36d91b 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2091,6 +2091,17 @@ restart:
throttle_vm_writeout(sc->gfp_mask);
}
+static bool should_reclaim_mem_cgroup(struct scan_control *sc,
+ struct mem_cgroup *mem,
+ int priority)
+{
+ if (!global_reclaim(sc) || priority <= DEF_PRIORITY - 3 ||
+ mem_cgroup_soft_limit_exceeded(mem))
+ return true;
+
+ return false;
+}
+
static void shrink_zone(int priority, struct zone *zone,
struct scan_control *sc)
{
@@ -2108,7 +2119,9 @@ static void shrink_zone(int priority, struct zone *zone,
.zone = zone,
};
- shrink_mem_cgroup_zone(priority, &mz, sc);
+ if (should_reclaim_mem_cgroup(sc, memcg, priority))
+ shrink_mem_cgroup_zone(priority, &mz, sc);
+
/*
* Limit reclaim has historically picked one memcg and
* scanned it with decreasing priority levels until
@@ -2152,8 +2165,8 @@ static bool shrink_zones(int priority, struct zonelist *zonelist,
{
struct zoneref *z;
struct zone *zone;
- unsigned long nr_soft_reclaimed;
- unsigned long nr_soft_scanned;
+// unsigned long nr_soft_reclaimed;
+// unsigned long nr_soft_scanned;
bool should_abort_reclaim = false;
for_each_zone_zonelist_nodemask(zone, z, zonelist,
@@ -2186,19 +2199,19 @@ static bool shrink_zones(int priority, struct zonelist *zonelist,
continue;
}
}
- /*
- * This steals pages from memory cgroups over softlimit
- * and returns the number of reclaimed pages and
- * scanned pages. This works for global memory pressure
- * and balancing, not for a memcg's limit.
- */
- nr_soft_scanned = 0;
- nr_soft_reclaimed = mem_cgroup_soft_limit_reclaim(zone,
- sc->order, sc->gfp_mask,
- &nr_soft_scanned);
- sc->nr_reclaimed += nr_soft_reclaimed;
- sc->nr_scanned += nr_soft_scanned;
- /* need some check for avoid more shrink_zone() */
+// /*
+// * This steals pages from memory cgroups over softlimit
+// * and returns the number of reclaimed pages and
+// * scanned pages. This works for global memory pressure
+// * and balancing, not for a memcg's limit.
+// */
+// nr_soft_scanned = 0;
+// nr_soft_reclaimed = mem_cgroup_soft_limit_reclaim(zone,
+// sc->order, sc->gfp_mask,
+// &nr_soft_scanned);
+// sc->nr_reclaimed += nr_soft_reclaimed;
+// sc->nr_scanned += nr_soft_scanned;
+// /* need some check for avoid more shrink_zone() */
}
shrink_zone(priority, zone, sc);
@@ -2590,8 +2603,8 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order,
int end_zone = 0; /* Inclusive. 0 = ZONE_DMA */
unsigned long total_scanned;
struct reclaim_state *reclaim_state = current->reclaim_state;
- unsigned long nr_soft_reclaimed;
- unsigned long nr_soft_scanned;
+// unsigned long nr_soft_reclaimed;
+// unsigned long nr_soft_scanned;
struct scan_control sc = {
.gfp_mask = GFP_KERNEL,
.may_unmap = 1,
@@ -2683,15 +2696,15 @@ loop_again:
sc.nr_scanned = 0;
- nr_soft_scanned = 0;
- /*
- * Call soft limit reclaim before calling shrink_zone.
- */
- nr_soft_reclaimed = mem_cgroup_soft_limit_reclaim(zone,
- order, sc.gfp_mask,
- &nr_soft_scanned);
- sc.nr_reclaimed += nr_soft_reclaimed;
- total_scanned += nr_soft_scanned;
+// nr_soft_scanned = 0;
+// /*
+// * Call soft limit reclaim before calling shrink_zone.
+// */
+// nr_soft_reclaimed = mem_cgroup_soft_limit_reclaim(zone,
+// order, sc.gfp_mask,
+// &nr_soft_scanned);
+// sc.nr_reclaimed += nr_soft_reclaimed;
+// total_scanned += nr_soft_scanned;
/*
* We put equal pressure on every zone, unless
--
1.7.3.1
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related [flat|nested] 7+ messages in thread
* [PATCH 2/3] memcg: revert current soft limit reclaim implementation
2011-12-06 23:59 [PATCH 0/3] memcg softlimit reclaim rework Ying Han
2011-12-06 23:59 ` [PATCH 1/3] memcg: rework softlimit reclaim Ying Han
@ 2011-12-06 23:59 ` Ying Han
2011-12-07 2:15 ` KAMEZAWA Hiroyuki
2011-12-06 23:59 ` [PATCH 3/3] memcg: track reclaim stats in memory.vmscan_stat Ying Han
2 siblings, 1 reply; 7+ messages in thread
From: Ying Han @ 2011-12-06 23:59 UTC (permalink / raw)
To: Michal Hocko, Balbir Singh, Rik van Riel, Hugh Dickins,
Johannes Weiner, Mel Gorman, KAMEZAWA Hiroyuki, Pavel Emelyanov
Cc: linux-mm
This patch reverts all the existing softlimit reclaim implementations, and
should be merged together with previous patch.
Signed-off-by: Ying Han <yinghan@google.com>
---
include/linux/memcontrol.h | 11 --
include/linux/swap.h | 4 -
mm/memcontrol.c | 380 +-------------------------------------------
mm/vmscan.c | 68 --------
4 files changed, 2 insertions(+), 461 deletions(-)
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 53d483b..25c4170 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -153,9 +153,6 @@ static inline void mem_cgroup_dec_page_stat(struct page *page,
mem_cgroup_update_page_stat(page, idx, -1);
}
-unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
- gfp_t gfp_mask,
- unsigned long *total_scanned);
u64 mem_cgroup_get_limit(struct mem_cgroup *memcg);
void mem_cgroup_count_vm_event(struct mm_struct *mm, enum vm_event_item idx);
@@ -368,14 +365,6 @@ static inline void mem_cgroup_dec_page_stat(struct page *page,
}
static inline
-unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
- gfp_t gfp_mask,
- unsigned long *total_scanned)
-{
- return 0;
-}
-
-static inline
u64 mem_cgroup_get_limit(struct mem_cgroup *memcg)
{
return 0;
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 1aded491..64cfbf8 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -256,10 +256,6 @@ extern unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
extern int __isolate_lru_page(struct page *page, isolate_mode_t mode, int file);
extern unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem,
gfp_t gfp_mask, bool noswap);
-extern unsigned long mem_cgroup_shrink_node_zone(struct mem_cgroup *mem,
- gfp_t gfp_mask, bool noswap,
- struct zone *zone,
- unsigned long *nr_scanned);
extern unsigned long shrink_all_memory(unsigned long nr_pages);
extern int vm_swappiness;
extern int remove_mapping(struct address_space *mapping, struct page *page);
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 7c6cade..35bf664 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -35,7 +35,6 @@
#include <linux/limits.h>
#include <linux/export.h>
#include <linux/mutex.h>
-#include <linux/rbtree.h>
#include <linux/slab.h>
#include <linux/swap.h>
#include <linux/swapops.h>
@@ -107,12 +106,10 @@ enum mem_cgroup_events_index {
*/
enum mem_cgroup_events_target {
MEM_CGROUP_TARGET_THRESH,
- MEM_CGROUP_TARGET_SOFTLIMIT,
MEM_CGROUP_TARGET_NUMAINFO,
MEM_CGROUP_NTARGETS,
};
#define THRESHOLDS_EVENTS_TARGET (128)
-#define SOFTLIMIT_EVENTS_TARGET (1024)
#define NUMAINFO_EVENTS_TARGET (1024)
struct mem_cgroup_stat_cpu {
@@ -138,12 +135,6 @@ struct mem_cgroup_per_zone {
struct mem_cgroup_reclaim_iter reclaim_iter[DEF_PRIORITY + 1];
struct zone_reclaim_stat reclaim_stat;
- struct rb_node tree_node; /* RB tree node */
- unsigned long long usage_in_excess;/* Set to the value by which */
- /* the soft limit is exceeded*/
- bool on_tree;
- struct mem_cgroup *mem; /* Back pointer, we cannot */
- /* use container_of */
};
/* Macro for accessing counter */
#define MEM_CGROUP_ZSTAT(mz, idx) ((mz)->count[(idx)])
@@ -156,26 +147,6 @@ struct mem_cgroup_lru_info {
struct mem_cgroup_per_node *nodeinfo[MAX_NUMNODES];
};
-/*
- * Cgroups above their limits are maintained in a RB-Tree, independent of
- * their hierarchy representation
- */
-
-struct mem_cgroup_tree_per_zone {
- struct rb_root rb_root;
- spinlock_t lock;
-};
-
-struct mem_cgroup_tree_per_node {
- struct mem_cgroup_tree_per_zone rb_tree_per_zone[MAX_NR_ZONES];
-};
-
-struct mem_cgroup_tree {
- struct mem_cgroup_tree_per_node *rb_tree_per_node[MAX_NUMNODES];
-};
-
-static struct mem_cgroup_tree soft_limit_tree __read_mostly;
-
struct mem_cgroup_threshold {
struct eventfd_ctx *eventfd;
u64 threshold;
@@ -327,12 +298,7 @@ static bool move_file(void)
&mc.to->move_charge_at_immigrate);
}
-/*
- * Maximum loops in mem_cgroup_hierarchical_reclaim(), used for soft
- * limit reclaim to prevent infinite loops, if they ever occur.
- */
#define MEM_CGROUP_MAX_RECLAIM_LOOPS (100)
-#define MEM_CGROUP_MAX_SOFT_LIMIT_RECLAIM_LOOPS (2)
enum charge_type {
MEM_CGROUP_CHARGE_TYPE_CACHE = 0,
@@ -387,164 +353,6 @@ page_cgroup_zoneinfo(struct mem_cgroup *memcg, struct page *page)
return mem_cgroup_zoneinfo(memcg, nid, zid);
}
-static struct mem_cgroup_tree_per_zone *
-soft_limit_tree_node_zone(int nid, int zid)
-{
- return &soft_limit_tree.rb_tree_per_node[nid]->rb_tree_per_zone[zid];
-}
-
-static struct mem_cgroup_tree_per_zone *
-soft_limit_tree_from_page(struct page *page)
-{
- int nid = page_to_nid(page);
- int zid = page_zonenum(page);
-
- return &soft_limit_tree.rb_tree_per_node[nid]->rb_tree_per_zone[zid];
-}
-
-static void
-__mem_cgroup_insert_exceeded(struct mem_cgroup *memcg,
- struct mem_cgroup_per_zone *mz,
- struct mem_cgroup_tree_per_zone *mctz,
- unsigned long long new_usage_in_excess)
-{
- struct rb_node **p = &mctz->rb_root.rb_node;
- struct rb_node *parent = NULL;
- struct mem_cgroup_per_zone *mz_node;
-
- if (mz->on_tree)
- return;
-
- mz->usage_in_excess = new_usage_in_excess;
- if (!mz->usage_in_excess)
- return;
- while (*p) {
- parent = *p;
- mz_node = rb_entry(parent, struct mem_cgroup_per_zone,
- tree_node);
- if (mz->usage_in_excess < mz_node->usage_in_excess)
- p = &(*p)->rb_left;
- /*
- * We can't avoid mem cgroups that are over their soft
- * limit by the same amount
- */
- else if (mz->usage_in_excess >= mz_node->usage_in_excess)
- p = &(*p)->rb_right;
- }
- rb_link_node(&mz->tree_node, parent, p);
- rb_insert_color(&mz->tree_node, &mctz->rb_root);
- mz->on_tree = true;
-}
-
-static void
-__mem_cgroup_remove_exceeded(struct mem_cgroup *memcg,
- struct mem_cgroup_per_zone *mz,
- struct mem_cgroup_tree_per_zone *mctz)
-{
- if (!mz->on_tree)
- return;
- rb_erase(&mz->tree_node, &mctz->rb_root);
- mz->on_tree = false;
-}
-
-static void
-mem_cgroup_remove_exceeded(struct mem_cgroup *memcg,
- struct mem_cgroup_per_zone *mz,
- struct mem_cgroup_tree_per_zone *mctz)
-{
- spin_lock(&mctz->lock);
- __mem_cgroup_remove_exceeded(memcg, mz, mctz);
- spin_unlock(&mctz->lock);
-}
-
-
-static void mem_cgroup_update_tree(struct mem_cgroup *memcg, struct page *page)
-{
- unsigned long long excess;
- struct mem_cgroup_per_zone *mz;
- struct mem_cgroup_tree_per_zone *mctz;
- int nid = page_to_nid(page);
- int zid = page_zonenum(page);
- mctz = soft_limit_tree_from_page(page);
-
- /*
- * Necessary to update all ancestors when hierarchy is used.
- * because their event counter is not touched.
- */
- for (; memcg; memcg = parent_mem_cgroup(memcg)) {
- mz = mem_cgroup_zoneinfo(memcg, nid, zid);
- excess = res_counter_soft_limit_excess(&memcg->res);
- /*
- * We have to update the tree if mz is on RB-tree or
- * mem is over its softlimit.
- */
- if (excess || mz->on_tree) {
- spin_lock(&mctz->lock);
- /* if on-tree, remove it */
- if (mz->on_tree)
- __mem_cgroup_remove_exceeded(memcg, mz, mctz);
- /*
- * Insert again. mz->usage_in_excess will be updated.
- * If excess is 0, no tree ops.
- */
- __mem_cgroup_insert_exceeded(memcg, mz, mctz, excess);
- spin_unlock(&mctz->lock);
- }
- }
-}
-
-static void mem_cgroup_remove_from_trees(struct mem_cgroup *memcg)
-{
- int node, zone;
- struct mem_cgroup_per_zone *mz;
- struct mem_cgroup_tree_per_zone *mctz;
-
- for_each_node_state(node, N_POSSIBLE) {
- for (zone = 0; zone < MAX_NR_ZONES; zone++) {
- mz = mem_cgroup_zoneinfo(memcg, node, zone);
- mctz = soft_limit_tree_node_zone(node, zone);
- mem_cgroup_remove_exceeded(memcg, mz, mctz);
- }
- }
-}
-
-static struct mem_cgroup_per_zone *
-__mem_cgroup_largest_soft_limit_node(struct mem_cgroup_tree_per_zone *mctz)
-{
- struct rb_node *rightmost = NULL;
- struct mem_cgroup_per_zone *mz;
-
-retry:
- mz = NULL;
- rightmost = rb_last(&mctz->rb_root);
- if (!rightmost)
- goto done; /* Nothing to reclaim from */
-
- mz = rb_entry(rightmost, struct mem_cgroup_per_zone, tree_node);
- /*
- * Remove the node now but someone else can add it back,
- * we will to add it back at the end of reclaim to its correct
- * position in the tree.
- */
- __mem_cgroup_remove_exceeded(mz->mem, mz, mctz);
- if (!res_counter_soft_limit_excess(&mz->mem->res) ||
- !css_tryget(&mz->mem->css))
- goto retry;
-done:
- return mz;
-}
-
-static struct mem_cgroup_per_zone *
-mem_cgroup_largest_soft_limit_node(struct mem_cgroup_tree_per_zone *mctz)
-{
- struct mem_cgroup_per_zone *mz;
-
- spin_lock(&mctz->lock);
- mz = __mem_cgroup_largest_soft_limit_node(mctz);
- spin_unlock(&mctz->lock);
- return mz;
-}
-
/*
* Implementation Note: reading percpu statistics for memcg.
*
@@ -695,9 +503,6 @@ static bool mem_cgroup_event_ratelimit(struct mem_cgroup *memcg,
case MEM_CGROUP_TARGET_THRESH:
next = val + THRESHOLDS_EVENTS_TARGET;
break;
- case MEM_CGROUP_TARGET_SOFTLIMIT:
- next = val + SOFTLIMIT_EVENTS_TARGET;
- break;
case MEM_CGROUP_TARGET_NUMAINFO:
next = val + NUMAINFO_EVENTS_TARGET;
break;
@@ -720,10 +525,8 @@ static void memcg_check_events(struct mem_cgroup *memcg, struct page *page)
/* threshold event is triggered in finer grain than soft limit */
if (unlikely(mem_cgroup_event_ratelimit(memcg,
MEM_CGROUP_TARGET_THRESH))) {
- bool do_softlimit, do_numainfo;
+ bool do_numainfo;
- do_softlimit = mem_cgroup_event_ratelimit(memcg,
- MEM_CGROUP_TARGET_SOFTLIMIT);
#if MAX_NUMNODES > 1
do_numainfo = mem_cgroup_event_ratelimit(memcg,
MEM_CGROUP_TARGET_NUMAINFO);
@@ -731,8 +534,6 @@ static void memcg_check_events(struct mem_cgroup *memcg, struct page *page)
preempt_enable();
mem_cgroup_threshold(memcg);
- if (unlikely(do_softlimit))
- mem_cgroup_update_tree(memcg, page);
#if MAX_NUMNODES > 1
if (unlikely(do_numainfo))
atomic_inc(&memcg->numainfo_events);
@@ -1515,6 +1316,7 @@ static unsigned long mem_cgroup_reclaim(struct mem_cgroup *memcg,
break;
if (mem_cgroup_margin(memcg))
break;
+
/*
* If nothing was reclaimed after two attempts, there
* may be no reclaimable pages in this hierarchy.
@@ -1662,59 +1464,6 @@ bool mem_cgroup_reclaimable(struct mem_cgroup *memcg, bool noswap)
}
#endif
-static int mem_cgroup_soft_reclaim(struct mem_cgroup *root_memcg,
- struct zone *zone,
- gfp_t gfp_mask,
- unsigned long *total_scanned)
-{
- struct mem_cgroup *victim = NULL;
- int total = 0;
- int loop = 0;
- unsigned long excess;
- unsigned long nr_scanned;
- struct mem_cgroup_reclaim_cookie reclaim = {
- .zone = zone,
- .priority = 0,
- };
-
- excess = res_counter_soft_limit_excess(&root_memcg->res) >> PAGE_SHIFT;
-
- while (1) {
- victim = mem_cgroup_iter(root_memcg, victim, &reclaim);
- if (!victim) {
- loop++;
- if (loop >= 2) {
- /*
- * If we have not been able to reclaim
- * anything, it might because there are
- * no reclaimable pages under this hierarchy
- */
- if (!total)
- break;
- /*
- * We want to do more targeted reclaim.
- * excess >> 2 is not to excessive so as to
- * reclaim too much, nor too less that we keep
- * coming back to reclaim from this cgroup
- */
- if (total >= (excess >> 2) ||
- (loop > MEM_CGROUP_MAX_RECLAIM_LOOPS))
- break;
- }
- continue;
- }
- if (!mem_cgroup_reclaimable(victim, false))
- continue;
- total += mem_cgroup_shrink_node_zone(victim, gfp_mask, false,
- zone, &nr_scanned);
- *total_scanned += nr_scanned;
- if (!res_counter_soft_limit_excess(&root_memcg->res))
- break;
- }
- mem_cgroup_iter_break(root_memcg, victim);
- return total;
-}
-
/*
* Check OOM-Killer is already running under our hierarchy.
* If someone is running, return false.
@@ -2480,8 +2229,6 @@ static void __mem_cgroup_commit_charge(struct mem_cgroup *memcg,
unlock_page_cgroup(pc);
/*
* "charge_statistics" updated event counter. Then, check it.
- * Insert ancestor (and ancestor's ancestors), to softlimit RB-tree.
- * if they exceeds softlimit.
*/
memcg_check_events(memcg, page);
}
@@ -3512,98 +3259,6 @@ static int mem_cgroup_resize_memsw_limit(struct mem_cgroup *memcg,
return ret;
}
-unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
- gfp_t gfp_mask,
- unsigned long *total_scanned)
-{
- unsigned long nr_reclaimed = 0;
- struct mem_cgroup_per_zone *mz, *next_mz = NULL;
- unsigned long reclaimed;
- int loop = 0;
- struct mem_cgroup_tree_per_zone *mctz;
- unsigned long long excess;
- unsigned long nr_scanned;
-
- if (order > 0)
- return 0;
-
- mctz = soft_limit_tree_node_zone(zone_to_nid(zone), zone_idx(zone));
- /*
- * This loop can run a while, specially if mem_cgroup's continuously
- * keep exceeding their soft limit and putting the system under
- * pressure
- */
- do {
- if (next_mz)
- mz = next_mz;
- else
- mz = mem_cgroup_largest_soft_limit_node(mctz);
- if (!mz)
- break;
-
- nr_scanned = 0;
- reclaimed = mem_cgroup_soft_reclaim(mz->mem, zone,
- gfp_mask, &nr_scanned);
- nr_reclaimed += reclaimed;
- *total_scanned += nr_scanned;
- spin_lock(&mctz->lock);
-
- /*
- * If we failed to reclaim anything from this memory cgroup
- * it is time to move on to the next cgroup
- */
- next_mz = NULL;
- if (!reclaimed) {
- do {
- /*
- * Loop until we find yet another one.
- *
- * By the time we get the soft_limit lock
- * again, someone might have aded the
- * group back on the RB tree. Iterate to
- * make sure we get a different mem.
- * mem_cgroup_largest_soft_limit_node returns
- * NULL if no other cgroup is present on
- * the tree
- */
- next_mz =
- __mem_cgroup_largest_soft_limit_node(mctz);
- if (next_mz == mz)
- css_put(&next_mz->mem->css);
- else /* next_mz == NULL or other memcg */
- break;
- } while (1);
- }
- __mem_cgroup_remove_exceeded(mz->mem, mz, mctz);
- excess = res_counter_soft_limit_excess(&mz->mem->res);
- /*
- * One school of thought says that we should not add
- * back the node to the tree if reclaim returns 0.
- * But our reclaim could return 0, simply because due
- * to priority we are exposing a smaller subset of
- * memory to reclaim from. Consider this as a longer
- * term TODO.
- */
- /* If excess == 0, no tree ops */
- __mem_cgroup_insert_exceeded(mz->mem, mz, mctz, excess);
- spin_unlock(&mctz->lock);
- css_put(&mz->mem->css);
- loop++;
- /*
- * Could not reclaim anything and there are no more
- * mem cgroups to try or we seem to be looping without
- * reclaiming anything.
- */
- if (!nr_reclaimed &&
- (next_mz == NULL ||
- loop > MEM_CGROUP_MAX_SOFT_LIMIT_RECLAIM_LOOPS))
- break;
- } while (!nr_reclaimed);
- if (next_mz)
- css_put(&next_mz->mem->css);
- return nr_reclaimed;
-}
-
/*
* This routine traverse page_cgroup in given list and drop them all.
* *And* this routine doesn't reclaim page itself, just removes page_cgroup.
@@ -4718,9 +4373,6 @@ static int alloc_mem_cgroup_per_zone_info(struct mem_cgroup *memcg, int node)
mz = &pn->zoneinfo[zone];
for_each_lru(l)
INIT_LIST_HEAD(&mz->lruvec.lists[l]);
- mz->usage_in_excess = 0;
- mz->on_tree = false;
- mz->mem = memcg;
}
memcg->info.nodeinfo[node] = pn;
return 0;
@@ -4774,7 +4426,6 @@ static void __mem_cgroup_free(struct mem_cgroup *memcg)
{
int node;
- mem_cgroup_remove_from_trees(memcg);
free_css_id(&mem_cgroup_subsys, &memcg->css);
for_each_node_state(node, N_POSSIBLE)
@@ -4829,31 +4480,6 @@ static void __init enable_swap_cgroup(void)
}
#endif
-static int mem_cgroup_soft_limit_tree_init(void)
-{
- struct mem_cgroup_tree_per_node *rtpn;
- struct mem_cgroup_tree_per_zone *rtpz;
- int tmp, node, zone;
-
- for_each_node_state(node, N_POSSIBLE) {
- tmp = node;
- if (!node_state(node, N_NORMAL_MEMORY))
- tmp = -1;
- rtpn = kzalloc_node(sizeof(*rtpn), GFP_KERNEL, tmp);
- if (!rtpn)
- return 1;
-
- soft_limit_tree.rb_tree_per_node[node] = rtpn;
-
- for (zone = 0; zone < MAX_NR_ZONES; zone++) {
- rtpz = &rtpn->rb_tree_per_zone[zone];
- rtpz->rb_root = RB_ROOT;
- spin_lock_init(&rtpz->lock);
- }
- }
- return 0;
-}
-
static struct cgroup_subsys_state * __ref
mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont)
{
@@ -4875,8 +4501,6 @@ mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont)
enable_swap_cgroup();
parent = NULL;
root_mem_cgroup = memcg;
- if (mem_cgroup_soft_limit_tree_init())
- goto free_out;
for_each_possible_cpu(cpu) {
struct memcg_stock_pcp *stock =
&per_cpu(memcg_stock, cpu);
diff --git a/mm/vmscan.c b/mm/vmscan.c
index b36d91b..b5e81b7 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2165,8 +2165,6 @@ static bool shrink_zones(int priority, struct zonelist *zonelist,
{
struct zoneref *z;
struct zone *zone;
-// unsigned long nr_soft_reclaimed;
-// unsigned long nr_soft_scanned;
bool should_abort_reclaim = false;
for_each_zone_zonelist_nodemask(zone, z, zonelist,
@@ -2199,19 +2197,6 @@ static bool shrink_zones(int priority, struct zonelist *zonelist,
continue;
}
}
-// /*
-// * This steals pages from memory cgroups over softlimit
-// * and returns the number of reclaimed pages and
-// * scanned pages. This works for global memory pressure
-// * and balancing, not for a memcg's limit.
-// */
-// nr_soft_scanned = 0;
-// nr_soft_reclaimed = mem_cgroup_soft_limit_reclaim(zone,
-// sc->order, sc->gfp_mask,
-// &nr_soft_scanned);
-// sc->nr_reclaimed += nr_soft_reclaimed;
-// sc->nr_scanned += nr_soft_scanned;
-// /* need some check for avoid more shrink_zone() */
}
shrink_zone(priority, zone, sc);
@@ -2388,47 +2373,6 @@ unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
#ifdef CONFIG_CGROUP_MEM_RES_CTLR
-unsigned long mem_cgroup_shrink_node_zone(struct mem_cgroup *mem,
- gfp_t gfp_mask, bool noswap,
- struct zone *zone,
- unsigned long *nr_scanned)
-{
- struct scan_control sc = {
- .nr_scanned = 0,
- .nr_to_reclaim = SWAP_CLUSTER_MAX,
- .may_writepage = !laptop_mode,
- .may_unmap = 1,
- .may_swap = !noswap,
- .order = 0,
- .target_mem_cgroup = mem,
- };
- struct mem_cgroup_zone mz = {
- .mem_cgroup = mem,
- .zone = zone,
- };
-
- sc.gfp_mask = (gfp_mask & GFP_RECLAIM_MASK) |
- (GFP_HIGHUSER_MOVABLE & ~GFP_RECLAIM_MASK);
-
- trace_mm_vmscan_memcg_softlimit_reclaim_begin(0,
- sc.may_writepage,
- sc.gfp_mask);
-
- /*
- * NOTE: Although we can get the priority field, using it
- * here is not a good idea, since it limits the pages we can scan.
- * if we don't reclaim here, the shrink_zone from balance_pgdat
- * will pick up pages from other mem cgroup's as well. We hack
- * the priority and make it zero.
- */
- shrink_mem_cgroup_zone(0, &mz, &sc);
-
- trace_mm_vmscan_memcg_softlimit_reclaim_end(sc.nr_reclaimed);
-
- *nr_scanned = sc.nr_scanned;
- return sc.nr_reclaimed;
-}
-
unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem_cont,
gfp_t gfp_mask,
bool noswap)
@@ -2603,8 +2547,6 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order,
int end_zone = 0; /* Inclusive. 0 = ZONE_DMA */
unsigned long total_scanned;
struct reclaim_state *reclaim_state = current->reclaim_state;
-// unsigned long nr_soft_reclaimed;
-// unsigned long nr_soft_scanned;
struct scan_control sc = {
.gfp_mask = GFP_KERNEL,
.may_unmap = 1,
@@ -2696,16 +2638,6 @@ loop_again:
sc.nr_scanned = 0;
-// nr_soft_scanned = 0;
-// /*
-// * Call soft limit reclaim before calling shrink_zone.
-// */
-// nr_soft_reclaimed = mem_cgroup_soft_limit_reclaim(zone,
-// order, sc.gfp_mask,
-// &nr_soft_scanned);
-// sc.nr_reclaimed += nr_soft_reclaimed;
-// total_scanned += nr_soft_scanned;
-
/*
* We put equal pressure on every zone, unless
* one zone has way too many pages free
--
1.7.3.1
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related [flat|nested] 7+ messages in thread
* [PATCH 3/3] memcg: track reclaim stats in memory.vmscan_stat
2011-12-06 23:59 [PATCH 0/3] memcg softlimit reclaim rework Ying Han
2011-12-06 23:59 ` [PATCH 1/3] memcg: rework softlimit reclaim Ying Han
2011-12-06 23:59 ` [PATCH 2/3] memcg: revert current soft limit reclaim implementation Ying Han
@ 2011-12-06 23:59 ` Ying Han
2 siblings, 0 replies; 7+ messages in thread
From: Ying Han @ 2011-12-06 23:59 UTC (permalink / raw)
To: Michal Hocko, Balbir Singh, Rik van Riel, Hugh Dickins,
Johannes Weiner, Mel Gorman, KAMEZAWA Hiroyuki, Pavel Emelyanov
Cc: linux-mm
Not asking for inclusion, only for testing purpose.
The API tracks the number of scanned and freed pages during page reclaim
as well as the total time taken to shrink_zone(). Counts are broken
down by context (system vs. limit, under hierarchy) and by type.
"_by_limit": per-memcg reclaim and memcg is the target
"_by_system": global reclaim and memcg is the target
"_by_limit_under_hierarchy": per-memcg reclaim and memcg is under the hierarchy
"_by_system_under_hierarchy": global reclaim and memcg is under the hierarchy
Sample output:
$ cat /.../memory.vmscan_stat
...
scanned_pages_by_limit 3954818
scanned_anon_pages_by_limit 0
scanned_file_pages_by_limit 3954818
freed_pages_by_limit 3929770
freed_anon_pages_by_limit 0
freed_file_pages_by_limit 3929770
elapsed_ns_by_limit 3386358102
...
Signed-off-by: Ying Han <yinghan@google.com>
---
include/linux/memcontrol.h | 18 +++++
mm/memcontrol.c | 153 +++++++++++++++++++++++++++++++++++++++++++-
mm/vmscan.c | 35 ++++++++++-
3 files changed, 203 insertions(+), 3 deletions(-)
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 25c4170..4afc144 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -38,6 +38,12 @@ struct mem_cgroup_reclaim_cookie {
unsigned int generation;
};
+struct memcg_scan_record {
+ unsigned long nr_scanned[2]; /* the number of scanned pages */
+ unsigned long nr_freed[2]; /* the number of freed pages */
+ unsigned long elapsed; /* nsec of time elapsed while scanning */
+};
+
#ifdef CONFIG_CGROUP_MEM_RES_CTLR
/*
* All "charge" functions with gfp_mask should use GFP_KERNEL or
@@ -126,6 +132,10 @@ mem_cgroup_get_reclaim_stat_from_page(struct page *page);
extern void mem_cgroup_print_oom_info(struct mem_cgroup *memcg,
struct task_struct *p);
+void mem_cgroup_record_scanstat(struct mem_cgroup *mem,
+ struct memcg_scan_record *rec,
+ bool global, bool hierarchy);
+
#ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
extern int do_swap_account;
#endif
@@ -378,6 +388,14 @@ static inline
void mem_cgroup_count_vm_event(struct mm_struct *mm, enum vm_event_item idx)
{
}
+
+static inline void
+mem_cgroup_record_scanstat(struct mem_cgroup *mem,
+ struct memcg_scan_record *rec,
+ bool global, bool hierarchy)
+{
+}
+
#endif /* CONFIG_CGROUP_MEM_CONT */
#if !defined(CONFIG_CGROUP_MEM_RES_CTLR) || !defined(CONFIG_DEBUG_VM)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 35bf664..894e0d2 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -112,10 +112,30 @@ enum mem_cgroup_events_target {
#define THRESHOLDS_EVENTS_TARGET (128)
#define NUMAINFO_EVENTS_TARGET (1024)
+enum mem_cgroup_scan_context {
+ SCAN_BY_SYSTEM,
+ SCAN_BY_SYSTEM_UNDER_HIERARCHY,
+ SCAN_BY_LIMIT,
+ SCAN_BY_LIMIT_UNDER_HIERARCHY,
+ NR_SCAN_CONTEXT,
+};
+
+enum mem_cgroup_scan_stat {
+ SCANNED,
+ SCANNED_ANON,
+ SCANNED_FILE,
+ FREED,
+ FREED_ANON,
+ FREED_FILE,
+ ELAPSED,
+ NR_SCAN_STAT,
+};
+
struct mem_cgroup_stat_cpu {
long count[MEM_CGROUP_STAT_NSTATS];
unsigned long events[MEM_CGROUP_EVENTS_NSTATS];
unsigned long targets[MEM_CGROUP_NTARGETS];
+ unsigned long scanstats[NR_SCAN_CONTEXT][NR_SCAN_STAT];
};
struct mem_cgroup_reclaim_iter {
@@ -542,6 +562,58 @@ static void memcg_check_events(struct mem_cgroup *memcg, struct page *page)
preempt_enable();
}
+void mem_cgroup_record_scanstat(struct mem_cgroup *mem,
+ struct memcg_scan_record *rec,
+ bool global, bool hierarchy)
+{
+ int context;
+
+ if (mem_cgroup_disabled())
+ return;
+
+ if (global)
+ context = SCAN_BY_SYSTEM;
+ else
+ context = SCAN_BY_LIMIT;
+ if (hierarchy)
+ context++;
+
+ this_cpu_add(mem->stat->scanstats[context][SCANNED],
+ rec->nr_scanned[0] + rec->nr_scanned[1]);
+ this_cpu_add(mem->stat->scanstats[context][SCANNED_ANON],
+ rec->nr_scanned[0]);
+ this_cpu_add(mem->stat->scanstats[context][SCANNED_FILE],
+ rec->nr_scanned[1]);
+
+ this_cpu_add(mem->stat->scanstats[context][FREED],
+ rec->nr_freed[0] + rec->nr_freed[1]);
+ this_cpu_add(mem->stat->scanstats[context][FREED_ANON],
+ rec->nr_freed[0]);
+ this_cpu_add(mem->stat->scanstats[context][FREED_FILE],
+ rec->nr_freed[1]);
+
+ this_cpu_add(mem->stat->scanstats[context][ELAPSED],
+ rec->elapsed);
+}
+
+static long mem_cgroup_read_scan_stat(struct mem_cgroup *mem,
+ int context, int stat)
+{
+ long val = 0;
+ int cpu;
+
+ get_online_cpus();
+ for_each_online_cpu(cpu)
+ val += per_cpu(mem->stat->scanstats[context][stat], cpu);
+#ifdef CONFIG_HOTPLUG_CPU
+ spin_lock(&mem->pcp_counter_lock);
+ val += mem->nocpu_base.scanstats[context][stat];
+ spin_unlock(&mem->pcp_counter_lock);
+#endif
+ put_online_cpus();
+ return val;
+}
+
static struct mem_cgroup *mem_cgroup_from_cont(struct cgroup *cont)
{
return container_of(cgroup_subsys_state(cont,
@@ -3672,10 +3744,12 @@ struct mcs_total_stat {
s64 stat[NR_MCS_STAT];
};
-struct {
+struct mem_cgroup_stat_name {
char *local_name;
char *total_name;
-} memcg_stat_strings[NR_MCS_STAT] = {
+};
+
+struct mem_cgroup_stat_name memcg_stat_strings[NR_MCS_STAT] = {
{"cache", "total_cache"},
{"rss", "total_rss"},
{"mapped_file", "total_mapped_file"},
@@ -4234,6 +4308,77 @@ static int mem_control_numa_stat_open(struct inode *unused, struct file *file)
}
#endif /* CONFIG_NUMA */
+struct scan_stat {
+ unsigned long stats[NR_SCAN_CONTEXT][NR_SCAN_STAT];
+};
+
+struct mem_cgroup_stat_name scan_stat_strings[NR_SCAN_STAT] = {
+ {"scanned_pages", "total_scanned_pages"},
+ {"scanned_anon_pages", "total_scanned_anon_pages"},
+ {"scanned_file_pages", "total_scanned_file_pages"},
+ {"freed_pages", "total_freed_pages"},
+ {"freed_anon_pages", "total_freed_anon_pages"},
+ {"freed_file_pages", "total_freed_file_pages"},
+ {"elapsed_ns", "total_elapsed_ns"},
+};
+
+static const char *scan_context_strings[NR_SCAN_CONTEXT] = {
+ "_by_system",
+ "_by_system_under_hierarchy",
+ "_by_limit",
+ "_by_limit_under_hierarchy",
+};
+
+static void mem_cgroup_get_scan_stat(struct mem_cgroup *mem,
+ struct scan_stat *s)
+{
+ int i, j;
+
+ for (i = 0; i < NR_SCAN_CONTEXT; i++)
+ for (j = 0; j < NR_SCAN_STAT; j++)
+ s->stats[i][j] += mem_cgroup_read_scan_stat(mem, i, j);
+}
+
+static void mem_cgroup_get_total_scan_stat(struct mem_cgroup *mem,
+ struct scan_stat *s)
+{
+ struct mem_cgroup *iter;
+
+ for_each_mem_cgroup_tree(iter, mem)
+ mem_cgroup_get_scan_stat(iter, s);
+}
+
+static int mem_cgroup_scan_stat_show(struct cgroup *cont, struct cftype *cft,
+ struct cgroup_map_cb *cb)
+{
+ struct mem_cgroup *mem = mem_cgroup_from_cont(cont);
+ struct scan_stat s;
+ char string[64];
+ int i, j;
+
+ memset(&s, 0, sizeof(s));
+ mem_cgroup_get_scan_stat(mem, &s);
+ for (i = 0; i < NR_SCAN_CONTEXT; i++) {
+ for (j = 0; j < NR_SCAN_STAT; j++) {
+ strcpy(string, scan_stat_strings[j].local_name);
+ strcat(string, scan_context_strings[i]);
+ cb->fill(cb, string, s.stats[i][j]);
+ }
+ }
+
+ memset(&s, 0, sizeof(s));
+ mem_cgroup_get_total_scan_stat(mem, &s);
+ for (i = 0; i < NR_SCAN_CONTEXT; i++) {
+ for (j = 0; j < NR_SCAN_STAT; j++) {
+ strcpy(string, scan_stat_strings[j].total_name);
+ strcat(string, scan_context_strings[i]);
+ cb->fill(cb, string, s.stats[i][j]);
+ }
+ }
+
+ return 0;
+}
+
static struct cftype mem_cgroup_files[] = {
{
.name = "usage_in_bytes",
@@ -4304,6 +4449,10 @@ static struct cftype mem_cgroup_files[] = {
.mode = S_IRUGO,
},
#endif
+ {
+ .name = "vmscan_stat",
+ .read_map = mem_cgroup_scan_stat_show,
+ },
};
#ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
diff --git a/mm/vmscan.c b/mm/vmscan.c
index b5e81b7..669d8c4 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -110,6 +110,11 @@ struct scan_control {
struct mem_cgroup *target_mem_cgroup;
/*
+ * Stats tracked during page reclaim.
+ */
+ struct memcg_scan_record *memcg_record;
+
+ /*
* Nodemask of nodes allowed by the caller. If NULL, all nodes
* are scanned.
*/
@@ -1522,6 +1527,9 @@ shrink_inactive_list(unsigned long nr_to_scan, struct mem_cgroup_zone *mz,
nr_taken = isolate_pages(nr_to_scan, mz, &page_list,
&nr_scanned, sc->order,
reclaim_mode, 0, file);
+
+ sc->memcg_record->nr_scanned[file] += nr_scanned;
+
if (global_reclaim(sc)) {
zone->pages_scanned += nr_scanned;
if (current_is_kswapd())
@@ -1551,6 +1559,8 @@ shrink_inactive_list(unsigned long nr_to_scan, struct mem_cgroup_zone *mz,
priority, &nr_dirty, &nr_writeback);
}
+ sc->memcg_record->nr_freed[file] += nr_reclaimed;
+
local_irq_disable();
if (current_is_kswapd())
__count_vm_events(KSWAPD_STEAL, nr_reclaimed);
@@ -1675,6 +1685,9 @@ static void shrink_active_list(unsigned long nr_pages,
&pgscanned, sc->order,
reclaim_mode, 1, file);
+ if (sc->memcg_record)
+ sc->memcg_record->nr_scanned[file] += pgscanned;
+
if (global_reclaim(sc))
zone->pages_scanned += pgscanned;
@@ -2111,6 +2124,9 @@ static void shrink_zone(int priority, struct zone *zone,
.priority = priority,
};
struct mem_cgroup *memcg;
+ struct memcg_scan_record rec;
+
+ sc->memcg_record = &rec;
memcg = mem_cgroup_iter(root, NULL, &reclaim);
do {
@@ -2119,9 +2135,21 @@ static void shrink_zone(int priority, struct zone *zone,
.zone = zone,
};
- if (should_reclaim_mem_cgroup(sc, memcg, priority))
+ if (should_reclaim_mem_cgroup(sc, memcg, priority)) {
+ unsigned long start, end;
+
+ memset(&rec, 0, sizeof(rec));
+ start = sched_clock();
+
shrink_mem_cgroup_zone(priority, &mz, sc);
+ end = sched_clock();
+ rec.elapsed = end - start;
+ mem_cgroup_record_scanstat(memcg, &rec,
+ global_reclaim(sc),
+ root != memcg);
+ }
+
/*
* Limit reclaim has historically picked one memcg and
* scanned it with decreasing priority levels until
@@ -2355,6 +2383,7 @@ unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
.order = order,
.target_mem_cgroup = NULL,
.nodemask = nodemask,
+ .memcg_record = NULL,
};
struct shrink_control shrink = {
.gfp_mask = sc.gfp_mask,
@@ -2390,6 +2419,7 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem_cont,
.nodemask = NULL, /* we don't care the placement */
.gfp_mask = (gfp_mask & GFP_RECLAIM_MASK) |
(GFP_HIGHUSER_MOVABLE & ~GFP_RECLAIM_MASK),
+ .memcg_record = NULL,
};
struct shrink_control shrink = {
.gfp_mask = sc.gfp_mask,
@@ -2558,6 +2588,7 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order,
.nr_to_reclaim = ULONG_MAX,
.order = order,
.target_mem_cgroup = NULL,
+ .memcg_record = NULL,
};
struct shrink_control shrink = {
.gfp_mask = sc.gfp_mask,
@@ -3029,6 +3060,7 @@ unsigned long shrink_all_memory(unsigned long nr_to_reclaim)
.nr_to_reclaim = nr_to_reclaim,
.hibernation_mode = 1,
.order = 0,
+ .memcg_record = NULL,
};
struct shrink_control shrink = {
.gfp_mask = sc.gfp_mask,
@@ -3215,6 +3247,7 @@ static int __zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
SWAP_CLUSTER_MAX),
.gfp_mask = gfp_mask,
.order = order,
+ .memcg_record = NULL,
};
struct shrink_control shrink = {
.gfp_mask = sc.gfp_mask,
--
1.7.3.1
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related [flat|nested] 7+ messages in thread
* Re: [PATCH 1/3] memcg: rework softlimit reclaim
2011-12-06 23:59 ` [PATCH 1/3] memcg: rework softlimit reclaim Ying Han
@ 2011-12-07 2:13 ` KAMEZAWA Hiroyuki
2011-12-07 17:39 ` Ying Han
0 siblings, 1 reply; 7+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-12-07 2:13 UTC (permalink / raw)
To: Ying Han
Cc: Michal Hocko, Balbir Singh, Rik van Riel, Hugh Dickins,
Johannes Weiner, Mel Gorman, Pavel Emelyanov, linux-mm
On Tue, 6 Dec 2011 15:59:57 -0800
Ying Han <yinghan@google.com> wrote:
> Under the shrink_zone, we examine whether or not to reclaim from a memcg
> based on its softlimit. We skip scanning the memcg for the first 3 priority.
> This is to balance between isolation and efficiency. we don't want to halt
> the system by skipping memcgs with low-hanging fruits forever.
>
> Another change is to set soft_limit_in_bytes to 0 by default. This is needed
> for both functional and performance:
>
> 1. If soft_limit are all set to MAX, it wastes first three periority iterations
> without scanning anything.
>
> 2. By default every memcg is eligibal for softlimit reclaim, and we can also
> set the value to MAX for special memcg which is immune to soft limit reclaim.
>
Could you update softlimit doc ?
> Signed-off-by: Ying Han <yinghan@google.com>
> ---
> include/linux/memcontrol.h | 7 ++++
> kernel/res_counter.c | 1 -
> mm/memcontrol.c | 8 +++++
> mm/vmscan.c | 67 ++++++++++++++++++++++++++-----------------
> 4 files changed, 55 insertions(+), 28 deletions(-)
>
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 81aabfb..53d483b 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -107,6 +107,8 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *,
> struct mem_cgroup_reclaim_cookie *);
> void mem_cgroup_iter_break(struct mem_cgroup *, struct mem_cgroup *);
>
> +bool mem_cgroup_soft_limit_exceeded(struct mem_cgroup *);
> +
> /*
> * For memory reclaim.
> */
> @@ -293,6 +295,11 @@ static inline void mem_cgroup_iter_break(struct mem_cgroup *root,
> {
> }
>
> +static inline bool mem_cgroup_soft_limit_exceeded(struct mem_cgroup *mem)
> +{
> + return true;
> +}
> +
> static inline int mem_cgroup_get_reclaim_priority(struct mem_cgroup *memcg)
> {
> return 0;
> diff --git a/kernel/res_counter.c b/kernel/res_counter.c
> index b814d6c..92afdc1 100644
> --- a/kernel/res_counter.c
> +++ b/kernel/res_counter.c
> @@ -18,7 +18,6 @@ void res_counter_init(struct res_counter *counter, struct res_counter *parent)
> {
> spin_lock_init(&counter->lock);
> counter->limit = RESOURCE_MAX;
> - counter->soft_limit = RESOURCE_MAX;
> counter->parent = parent;
> }
>
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 4425f62..7c6cade 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -926,6 +926,14 @@ out:
> }
> EXPORT_SYMBOL(mem_cgroup_count_vm_event);
>
> +bool mem_cgroup_soft_limit_exceeded(struct mem_cgroup *mem)
> +{
> + if (mem_cgroup_disabled() || mem_cgroup_is_root(mem))
> + return true;
> +
> + return res_counter_soft_limit_excess(&mem->res) > 0;
> +}
> +
> /**
> * mem_cgroup_zone_lruvec - get the lru list vector for a zone and memcg
> * @zone: zone of the wanted lruvec
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 0ba7d35..b36d91b 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2091,6 +2091,17 @@ restart:
> throttle_vm_writeout(sc->gfp_mask);
> }
>
> +static bool should_reclaim_mem_cgroup(struct scan_control *sc,
> + struct mem_cgroup *mem,
> + int priority)
> +{
> + if (!global_reclaim(sc) || priority <= DEF_PRIORITY - 3 ||
> + mem_cgroup_soft_limit_exceeded(mem))
> + return true;
> +
> + return false;
> +}
> +
Why "priority <= DEF_PRIORTY - 3" is selected ?
It seems there is no reason. Could you justify this check ?
Thinking briefly, can't we caluculate the ratio as
number of pages in reclaimable memcg / number of reclaimable pages
And use 'priorty' ? If
total_reclaimable_pages >> priority > number of pages in reclaimabe memcg
memcg under softlimit should be scanned..then, we can avoid scanning pages
twice.
Hmm, please give reason of the magic value here, anyway.
> static void shrink_zone(int priority, struct zone *zone,
> struct scan_control *sc)
> {
> @@ -2108,7 +2119,9 @@ static void shrink_zone(int priority, struct zone *zone,
> .zone = zone,
> };
>
> - shrink_mem_cgroup_zone(priority, &mz, sc);
> + if (should_reclaim_mem_cgroup(sc, memcg, priority))
> + shrink_mem_cgroup_zone(priority, &mz, sc);
> +
> /*
> * Limit reclaim has historically picked one memcg and
> * scanned it with decreasing priority levels until
> @@ -2152,8 +2165,8 @@ static bool shrink_zones(int priority, struct zonelist *zonelist,
> {
> struct zoneref *z;
> struct zone *zone;
> - unsigned long nr_soft_reclaimed;
> - unsigned long nr_soft_scanned;
> +// unsigned long nr_soft_reclaimed;
> +// unsigned long nr_soft_scanned;
Why do you leave these things ?
Hmm, but the whole logic seems clean to me except for magic number.
Thanks,
-Kame
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH 2/3] memcg: revert current soft limit reclaim implementation
2011-12-06 23:59 ` [PATCH 2/3] memcg: revert current soft limit reclaim implementation Ying Han
@ 2011-12-07 2:15 ` KAMEZAWA Hiroyuki
0 siblings, 0 replies; 7+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-12-07 2:15 UTC (permalink / raw)
To: Ying Han
Cc: Michal Hocko, Balbir Singh, Rik van Riel, Hugh Dickins,
Johannes Weiner, Mel Gorman, Pavel Emelyanov, linux-mm
On Tue, 6 Dec 2011 15:59:58 -0800
Ying Han <yinghan@google.com> wrote:
> This patch reverts all the existing softlimit reclaim implementations, and
> should be merged together with previous patch.
>
> Signed-off-by: Ying Han <yinghan@google.com>
I'm ok with this. Because of changes in vmscan.c, it's not valuable to keep per-zone
softlimit statistics. All memcg under tree will be visited anyway.
Thanks,
-Kame
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH 1/3] memcg: rework softlimit reclaim
2011-12-07 2:13 ` KAMEZAWA Hiroyuki
@ 2011-12-07 17:39 ` Ying Han
0 siblings, 0 replies; 7+ messages in thread
From: Ying Han @ 2011-12-07 17:39 UTC (permalink / raw)
To: KAMEZAWA Hiroyuki
Cc: Michal Hocko, Balbir Singh, Rik van Riel, Hugh Dickins,
Johannes Weiner, Mel Gorman, Pavel Emelyanov, linux-mm
On Tue, Dec 6, 2011 at 6:13 PM, KAMEZAWA Hiroyuki
<kamezawa.hiroyu@jp.fujitsu.com> wrote:
> On Tue, 6 Dec 2011 15:59:57 -0800
> Ying Han <yinghan@google.com> wrote:
>
>> Under the shrink_zone, we examine whether or not to reclaim from a memcg
>> based on its softlimit. We skip scanning the memcg for the first 3 priority.
>> This is to balance between isolation and efficiency. we don't want to halt
>> the system by skipping memcgs with low-hanging fruits forever.
>>
>> Another change is to set soft_limit_in_bytes to 0 by default. This is needed
>> for both functional and performance:
>>
>> 1. If soft_limit are all set to MAX, it wastes first three periority iterations
>> without scanning anything.
>>
>> 2. By default every memcg is eligibal for softlimit reclaim, and we can also
>> set the value to MAX for special memcg which is immune to soft limit reclaim.
>>
>
> Could you update softlimit doc ?
Will do .
>
>
>
>> Signed-off-by: Ying Han <yinghan@google.com>
>> ---
>> include/linux/memcontrol.h | 7 ++++
>> kernel/res_counter.c | 1 -
>> mm/memcontrol.c | 8 +++++
>> mm/vmscan.c | 67 ++++++++++++++++++++++++++-----------------
>> 4 files changed, 55 insertions(+), 28 deletions(-)
>>
>> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
>> index 81aabfb..53d483b 100644
>> --- a/include/linux/memcontrol.h
>> +++ b/include/linux/memcontrol.h
>> @@ -107,6 +107,8 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *,
>> struct mem_cgroup_reclaim_cookie *);
>> void mem_cgroup_iter_break(struct mem_cgroup *, struct mem_cgroup *);
>>
>> +bool mem_cgroup_soft_limit_exceeded(struct mem_cgroup *);
>> +
>> /*
>> * For memory reclaim.
>> */
>> @@ -293,6 +295,11 @@ static inline void mem_cgroup_iter_break(struct mem_cgroup *root,
>> {
>> }
>>
>> +static inline bool mem_cgroup_soft_limit_exceeded(struct mem_cgroup *mem)
>> +{
>> + return true;
>> +}
>> +
>> static inline int mem_cgroup_get_reclaim_priority(struct mem_cgroup *memcg)
>> {
>> return 0;
>> diff --git a/kernel/res_counter.c b/kernel/res_counter.c
>> index b814d6c..92afdc1 100644
>> --- a/kernel/res_counter.c
>> +++ b/kernel/res_counter.c
>> @@ -18,7 +18,6 @@ void res_counter_init(struct res_counter *counter, struct res_counter *parent)
>> {
>> spin_lock_init(&counter->lock);
>> counter->limit = RESOURCE_MAX;
>> - counter->soft_limit = RESOURCE_MAX;
>> counter->parent = parent;
>> }
>>
>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>> index 4425f62..7c6cade 100644
>> --- a/mm/memcontrol.c
>> +++ b/mm/memcontrol.c
>> @@ -926,6 +926,14 @@ out:
>> }
>> EXPORT_SYMBOL(mem_cgroup_count_vm_event);
>>
>> +bool mem_cgroup_soft_limit_exceeded(struct mem_cgroup *mem)
>> +{
>> + if (mem_cgroup_disabled() || mem_cgroup_is_root(mem))
>> + return true;
>> +
>> + return res_counter_soft_limit_excess(&mem->res) > 0;
>> +}
>> +
>> /**
>> * mem_cgroup_zone_lruvec - get the lru list vector for a zone and memcg
>> * @zone: zone of the wanted lruvec
>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>> index 0ba7d35..b36d91b 100644
>> --- a/mm/vmscan.c
>> +++ b/mm/vmscan.c
>> @@ -2091,6 +2091,17 @@ restart:
>> throttle_vm_writeout(sc->gfp_mask);
>> }
>>
>> +static bool should_reclaim_mem_cgroup(struct scan_control *sc,
>> + struct mem_cgroup *mem,
>> + int priority)
>> +{
>> + if (!global_reclaim(sc) || priority <= DEF_PRIORITY - 3 ||
>> + mem_cgroup_soft_limit_exceeded(mem))
>> + return true;
>> +
>> + return false;
>> +}
>> +
>
> Why "priority <= DEF_PRIORTY - 3" is selected ?
> It seems there is no reason. Could you justify this check ?
There is no particular reason for this magic "3". And the plan is to
open for further tuning later after seeing real problems.
The idea here is to balance out the performance vs isolation. We don't
want to keep trying on "over softlimit memcgs" with hard to reclaim
memory while leaving the "under softlimit memcg" with low-hanging
fruit behind. This hurts the system performance as a whole.
>
> Thinking briefly, can't we caluculate the ratio as
>
> number of pages in reclaimable memcg / number of reclaimable pages
>
> And use 'priorty' ? If
>
> total_reclaimable_pages >> priority > number of pages in reclaimabe memcg
>
> memcg under softlimit should be scanned..then, we can avoid scanning pages
> twice.
Another thing we were talking about during summit is to reclaim the
pages proportionally based on how much each memcg exceeds its
softlimit, and the calculation above seems to be related to that.
I am pretty sure that we will tune the way to select memcg to reclaim
and how much to relciam while start running into problems, and there
are different ways to tune it. This patch is the very first step to
get started and the main purpose is to get rid of the big giant old
softlimit reclaim implementation.
> Hmm, please give reason of the magic value here, anyway.
>
>> static void shrink_zone(int priority, struct zone *zone,
>> struct scan_control *sc)
>> {
>> @@ -2108,7 +2119,9 @@ static void shrink_zone(int priority, struct zone *zone,
>> .zone = zone,
>> };
>>
>> - shrink_mem_cgroup_zone(priority, &mz, sc);
>> + if (should_reclaim_mem_cgroup(sc, memcg, priority))
>> + shrink_mem_cgroup_zone(priority, &mz, sc);
>> +
>> /*
>> * Limit reclaim has historically picked one memcg and
>> * scanned it with decreasing priority levels until
>> @@ -2152,8 +2165,8 @@ static bool shrink_zones(int priority, struct zonelist *zonelist,
>> {
>> struct zoneref *z;
>> struct zone *zone;
>> - unsigned long nr_soft_reclaimed;
>> - unsigned long nr_soft_scanned;
>> +// unsigned long nr_soft_reclaimed;
>> +// unsigned long nr_soft_scanned;
>
> Why do you leave these things ?
I steal this idea from Johannes's last posted softlimit rework patch.
My understanding is to make
the bisect easier later, maybe I am wrong.
> Hmm, but the whole logic seems clean to me except for magic number.
Thanks.
--Ying
>
> Thanks,
> -Kame
>
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2011-12-07 17:39 UTC | newest]
Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-12-06 23:59 [PATCH 0/3] memcg softlimit reclaim rework Ying Han
2011-12-06 23:59 ` [PATCH 1/3] memcg: rework softlimit reclaim Ying Han
2011-12-07 2:13 ` KAMEZAWA Hiroyuki
2011-12-07 17:39 ` Ying Han
2011-12-06 23:59 ` [PATCH 2/3] memcg: revert current soft limit reclaim implementation Ying Han
2011-12-07 2:15 ` KAMEZAWA Hiroyuki
2011-12-06 23:59 ` [PATCH 3/3] memcg: track reclaim stats in memory.vmscan_stat Ying Han
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).