* [RFC][PATCH 0/5] memcg: reduce lock conetion
@ 2009-08-28 4:20 KAMEZAWA Hiroyuki
2009-08-28 4:23 ` [RFC][PATCH 1/5] memcg: change for softlimit KAMEZAWA Hiroyuki
` (5 more replies)
0 siblings, 6 replies; 37+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-08-28 4:20 UTC (permalink / raw)
To: linux-mm@kvack.org
Cc: linux-kernel@vger.kernel.org, balbir@linux.vnet.ibm.com,
nishimura@mxp.nes.nec.co.jp
Hi,
Recently, memcg's res_counter->lock contention on big server is reported and
Balbir wrote a workaround for root memcg.
It's good but we need some fix for children, too.
This set is for reducing lock conetion of memcg's children cgroup based on mmotm-Aug27.
I'm sorry I have only 8cpu machine and can't reproduce very troublesome lock conention.
Here is lock_stat of make -j 12 on my 8cpu box, befre-after this patch series.
[Before] time make -j 12 (Best time in 3 test)
real 2m55.170s
user 4m38.351s
sys 6m40.694s
lock_stat version 0.3
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
class name con-bounces contentions waittime-min waittime-max waittime-total acq-bounces acquisitions holdtime-min holdtime-max holdtime-total
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
&counter->lock: 1793728 1824383 0.90 16599.78 1255869.40 24879507 44909568 0.45 31183.88 19505982.15
--------------
&counter->lock 999561 [<ffffffff81099224>] res_counter_charge+0x94/0x140
&counter->lock 824822 [<ffffffff8109911c>] res_counter_uncharge+0x3c/0xb0
--------------
&counter->lock 835597 [<ffffffff8109911c>] res_counter_uncharge+0x3c/0xb0
&counter->lock 988786 [<ffffffff81099224>] res_counter_charge+0x94/0x140
you can see this by "head" ;)
[After] time make -j 12 (Best time in 3 test..but score was very stable.)
real 2m52.612s
user 4m45.450s
sys 6m4.422s
&counter->lock: 11159 11406 1.02 30.35 6707.74 1097940 3957860 0.47 17652.17 1534430.74
--------------
&counter->lock 2016 [<ffffffff810991bd>] res_counter_charge+0x4d/0x110
&counter->lock 9390 [<ffffffff81099115>] res_counter_uncharge+0x35/0x90
--------------
&counter->lock 8962 [<ffffffff81099115>] res_counter_uncharge+0x35/0x90
&counter->lock 2444 [<ffffffff810991bd>] res_counter_charge+0x4d/0x110
dcache-lock, zone->lru_lock etc is much heavier than this.
I expects good result on big servers.
But this patch sereis is a "big change". I (and memcg folks) have to be careful...
Thanks,
-Kame
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 37+ messages in thread
* [RFC][PATCH 1/5] memcg: change for softlimit.
2009-08-28 4:20 [RFC][PATCH 0/5] memcg: reduce lock conetion KAMEZAWA Hiroyuki
@ 2009-08-28 4:23 ` KAMEZAWA Hiroyuki
2009-08-28 7:20 ` Balbir Singh
2009-08-28 4:24 ` [RFC][PATCH 2/5] memcg: uncharge in batched manner KAMEZAWA Hiroyuki
` (4 subsequent siblings)
5 siblings, 1 reply; 37+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-08-28 4:23 UTC (permalink / raw)
To: KAMEZAWA Hiroyuki
Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org,
balbir@linux.vnet.ibm.com, nishimura@mxp.nes.nec.co.jp
This patch tries to modify softlimit handling in memcg/res_counter.
There are 2 reasons in general.
1. soft_limit can use only against sub-hierarchy root.
Because softlimit tree is sorted by usage, putting prural groups
under hierarchy (which shares usage) will just adds noise and unnecessary
mess. This patch limits softlimit feature only to hierarchy root.
This will make softlimit-tree maintainance better.
2. In these days, it's reported that res_counter can be bottleneck in
massively parallel enviroment. We need to reduce jobs under spinlock.
The reason we check softlimit at res_counter_charge() is that any member
in hierarchy can have softlimit.
But by chages in "1", only hierarchy root has soft_limit. We can omit
hierarchical check in res_counter.
After this patch, soft limit is avaliable only for root of sub-hierarchy.
(Anyway, softlimit for hierarchy children just makes users confused, hard-to-use)
This modifes
- drop unneccesary checks from res_coutner_charge().uncharge()
- mem->sub_hierarchy_root is added.
- only hierarchy root memcg can be on softlimit tree.
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
Documentation/cgroups/memory.txt | 2
include/linux/res_counter.h | 6 --
kernel/res_counter.c | 14 ----
mm/memcontrol.c | 113 +++++++++++++++++++++++----------------
4 files changed, 74 insertions(+), 61 deletions(-)
Index: mmotm-2.6.31-Aug27/include/linux/res_counter.h
===================================================================
--- mmotm-2.6.31-Aug27.orig/include/linux/res_counter.h
+++ mmotm-2.6.31-Aug27/include/linux/res_counter.h
@@ -114,8 +114,7 @@ void res_counter_init(struct res_counter
int __must_check res_counter_charge_locked(struct res_counter *counter,
unsigned long val);
int __must_check res_counter_charge(struct res_counter *counter,
- unsigned long val, struct res_counter **limit_fail_at,
- struct res_counter **soft_limit_at);
+ unsigned long val, struct res_counter **limit_fail_at);
/*
* uncharge - tell that some portion of the resource is released
@@ -128,8 +127,7 @@ int __must_check res_counter_charge(stru
*/
void res_counter_uncharge_locked(struct res_counter *counter, unsigned long val);
-void res_counter_uncharge(struct res_counter *counter, unsigned long val,
- bool *was_soft_limit_excess);
+void res_counter_uncharge(struct res_counter *counter, unsigned long val);
static inline bool res_counter_limit_check_locked(struct res_counter *cnt)
{
Index: mmotm-2.6.31-Aug27/kernel/res_counter.c
===================================================================
--- mmotm-2.6.31-Aug27.orig/kernel/res_counter.c
+++ mmotm-2.6.31-Aug27/kernel/res_counter.c
@@ -37,16 +37,13 @@ int res_counter_charge_locked(struct res
}
int res_counter_charge(struct res_counter *counter, unsigned long val,
- struct res_counter **limit_fail_at,
- struct res_counter **soft_limit_fail_at)
+ struct res_counter **limit_fail_at)
{
int ret;
unsigned long flags;
struct res_counter *c, *u;
*limit_fail_at = NULL;
- if (soft_limit_fail_at)
- *soft_limit_fail_at = NULL;
local_irq_save(flags);
for (c = counter; c != NULL; c = c->parent) {
spin_lock(&c->lock);
@@ -55,9 +52,6 @@ int res_counter_charge(struct res_counte
* With soft limits, we return the highest ancestor
* that exceeds its soft limit
*/
- if (soft_limit_fail_at &&
- !res_counter_soft_limit_check_locked(c))
- *soft_limit_fail_at = c;
spin_unlock(&c->lock);
if (ret < 0) {
*limit_fail_at = c;
@@ -85,8 +79,7 @@ void res_counter_uncharge_locked(struct
counter->usage -= val;
}
-void res_counter_uncharge(struct res_counter *counter, unsigned long val,
- bool *was_soft_limit_excess)
+void res_counter_uncharge(struct res_counter *counter, unsigned long val)
{
unsigned long flags;
struct res_counter *c;
@@ -94,9 +87,6 @@ void res_counter_uncharge(struct res_cou
local_irq_save(flags);
for (c = counter; c != NULL; c = c->parent) {
spin_lock(&c->lock);
- if (was_soft_limit_excess)
- *was_soft_limit_excess =
- !res_counter_soft_limit_check_locked(c);
res_counter_uncharge_locked(c, val);
spin_unlock(&c->lock);
}
Index: mmotm-2.6.31-Aug27/mm/memcontrol.c
===================================================================
--- mmotm-2.6.31-Aug27.orig/mm/memcontrol.c
+++ mmotm-2.6.31-Aug27/mm/memcontrol.c
@@ -221,6 +221,8 @@ struct mem_cgroup {
atomic_t refcnt;
unsigned int swappiness;
+ /* sub hierarchy root cgroup */
+ struct mem_cgroup *sub_hierarchy_root;
/* set when res.limit == memsw.limit */
bool memsw_is_minimum;
@@ -372,22 +374,28 @@ mem_cgroup_remove_exceeded(struct mem_cg
spin_unlock(&mctz->lock);
}
-static bool mem_cgroup_soft_limit_check(struct mem_cgroup *mem)
+/*
+ * Check subhierarchy root's event counter. If event counter is over threshold,
+ * retrun root. (and the caller will trigger status-check event)
+ */
+static struct mem_cgroup * mem_cgroup_soft_limit_check(struct mem_cgroup *mem)
{
- bool ret = false;
int cpu;
s64 val;
+ struct mem_cgroup *softlimit_root = mem->sub_hierarchy_root;
struct mem_cgroup_stat_cpu *cpustat;
+ if (!softlimit_root)
+ return NULL;
cpu = get_cpu();
- cpustat = &mem->stat.cpustat[cpu];
+ cpustat = &softlimit_root->stat.cpustat[cpu];
val = __mem_cgroup_stat_read_local(cpustat, MEM_CGROUP_STAT_EVENTS);
- if (unlikely(val > SOFTLIMIT_EVENTS_THRESH)) {
+ if (unlikely(val > SOFTLIMIT_EVENTS_THRESH))
__mem_cgroup_stat_reset_safe(cpustat, MEM_CGROUP_STAT_EVENTS);
- ret = true;
- }
+ else
+ softlimit_root = NULL;
put_cpu();
- return ret;
+ return softlimit_root;
}
static void mem_cgroup_update_tree(struct mem_cgroup *mem, struct page *page)
@@ -1268,7 +1276,7 @@ static int __mem_cgroup_try_charge(struc
{
struct mem_cgroup *mem, *mem_over_limit, *mem_over_soft_limit;
int nr_retries = MEM_CGROUP_RECLAIM_RETRIES;
- struct res_counter *fail_res, *soft_fail_res = NULL;
+ struct res_counter *fail_res;
if (unlikely(test_thread_flag(TIF_MEMDIE))) {
/* Don't account this! */
@@ -1300,17 +1308,17 @@ static int __mem_cgroup_try_charge(struc
if (mem_cgroup_is_root(mem))
goto done;
- ret = res_counter_charge(&mem->res, PAGE_SIZE, &fail_res,
- &soft_fail_res);
+ ret = res_counter_charge(&mem->res, PAGE_SIZE, &fail_res);
+
if (likely(!ret)) {
if (!do_swap_account)
break;
ret = res_counter_charge(&mem->memsw, PAGE_SIZE,
- &fail_res, NULL);
+ &fail_res);
if (likely(!ret))
break;
/* mem+swap counter fails */
- res_counter_uncharge(&mem->res, PAGE_SIZE, NULL);
+ res_counter_uncharge(&mem->res, PAGE_SIZE);
flags |= MEM_CGROUP_RECLAIM_NOSWAP;
mem_over_limit = mem_cgroup_from_res_counter(fail_res,
memsw);
@@ -1348,17 +1356,14 @@ static int __mem_cgroup_try_charge(struc
goto nomem;
}
}
+
/*
- * Insert just the ancestor, we should trickle down to the correct
- * cgroup for reclaim, since the other nodes will be below their
- * soft limit
- */
- if (soft_fail_res) {
- mem_over_soft_limit =
- mem_cgroup_from_res_counter(soft_fail_res, res);
- if (mem_cgroup_soft_limit_check(mem_over_soft_limit))
- mem_cgroup_update_tree(mem_over_soft_limit, page);
- }
+ * check hierarchy root's event counter and modify softlimit-tree
+ * if necessary.
+ */
+ mem_over_soft_limit = mem_cgroup_soft_limit_check(mem);
+ if (mem_over_soft_limit)
+ mem_cgroup_update_tree(mem_over_soft_limit, page);
done:
return 0;
nomem:
@@ -1433,10 +1438,9 @@ static void __mem_cgroup_commit_charge(s
if (unlikely(PageCgroupUsed(pc))) {
unlock_page_cgroup(pc);
if (!mem_cgroup_is_root(mem)) {
- res_counter_uncharge(&mem->res, PAGE_SIZE, NULL);
+ res_counter_uncharge(&mem->res, PAGE_SIZE);
if (do_swap_account)
- res_counter_uncharge(&mem->memsw, PAGE_SIZE,
- NULL);
+ res_counter_uncharge(&mem->memsw, PAGE_SIZE);
}
css_put(&mem->css);
return;
@@ -1515,7 +1519,7 @@ static int mem_cgroup_move_account(struc
goto out;
if (!mem_cgroup_is_root(from))
- res_counter_uncharge(&from->res, PAGE_SIZE, NULL);
+ res_counter_uncharge(&from->res, PAGE_SIZE);
mem_cgroup_charge_statistics(from, pc, false);
page = pc->page;
@@ -1535,7 +1539,7 @@ static int mem_cgroup_move_account(struc
}
if (do_swap_account && !mem_cgroup_is_root(from))
- res_counter_uncharge(&from->memsw, PAGE_SIZE, NULL);
+ res_counter_uncharge(&from->memsw, PAGE_SIZE);
css_put(&from->css);
css_get(&to->css);
@@ -1606,9 +1610,9 @@ uncharge:
css_put(&parent->css);
/* uncharge if move fails */
if (!mem_cgroup_is_root(parent)) {
- res_counter_uncharge(&parent->res, PAGE_SIZE, NULL);
+ res_counter_uncharge(&parent->res, PAGE_SIZE);
if (do_swap_account)
- res_counter_uncharge(&parent->memsw, PAGE_SIZE, NULL);
+ res_counter_uncharge(&parent->memsw, PAGE_SIZE);
}
return ret;
}
@@ -1799,8 +1803,7 @@ __mem_cgroup_commit_charge_swapin(struct
* calling css_tryget
*/
if (!mem_cgroup_is_root(memcg))
- res_counter_uncharge(&memcg->memsw, PAGE_SIZE,
- NULL);
+ res_counter_uncharge(&memcg->memsw, PAGE_SIZE);
mem_cgroup_swap_statistics(memcg, false);
mem_cgroup_put(memcg);
}
@@ -1827,9 +1830,9 @@ void mem_cgroup_cancel_charge_swapin(str
if (!mem)
return;
if (!mem_cgroup_is_root(mem)) {
- res_counter_uncharge(&mem->res, PAGE_SIZE, NULL);
+ res_counter_uncharge(&mem->res, PAGE_SIZE);
if (do_swap_account)
- res_counter_uncharge(&mem->memsw, PAGE_SIZE, NULL);
+ res_counter_uncharge(&mem->memsw, PAGE_SIZE);
}
css_put(&mem->css);
}
@@ -1844,7 +1847,7 @@ __mem_cgroup_uncharge_common(struct page
struct page_cgroup *pc;
struct mem_cgroup *mem = NULL;
struct mem_cgroup_per_zone *mz;
- bool soft_limit_excess = false;
+ struct mem_cgroup *soft_limit_excess;
if (mem_cgroup_disabled())
return NULL;
@@ -1884,10 +1887,10 @@ __mem_cgroup_uncharge_common(struct page
}
if (!mem_cgroup_is_root(mem)) {
- res_counter_uncharge(&mem->res, PAGE_SIZE, &soft_limit_excess);
+ res_counter_uncharge(&mem->res, PAGE_SIZE);
if (do_swap_account &&
(ctype != MEM_CGROUP_CHARGE_TYPE_SWAPOUT))
- res_counter_uncharge(&mem->memsw, PAGE_SIZE, NULL);
+ res_counter_uncharge(&mem->memsw, PAGE_SIZE);
}
if (ctype == MEM_CGROUP_CHARGE_TYPE_SWAPOUT)
mem_cgroup_swap_statistics(mem, true);
@@ -1904,8 +1907,9 @@ __mem_cgroup_uncharge_common(struct page
mz = page_cgroup_zoneinfo(pc);
unlock_page_cgroup(pc);
- if (soft_limit_excess && mem_cgroup_soft_limit_check(mem))
- mem_cgroup_update_tree(mem, page);
+ soft_limit_excess = mem_cgroup_soft_limit_check(mem);
+ if (soft_limit_excess)
+ mem_cgroup_update_tree(soft_limit_excess, page);
/* at swapout, this memcg will be accessed to record to swap */
if (ctype != MEM_CGROUP_CHARGE_TYPE_SWAPOUT)
css_put(&mem->css);
@@ -1982,7 +1986,7 @@ void mem_cgroup_uncharge_swap(swp_entry_
* This memcg can be obsolete one. We avoid calling css_tryget
*/
if (!mem_cgroup_is_root(memcg))
- res_counter_uncharge(&memcg->memsw, PAGE_SIZE, NULL);
+ res_counter_uncharge(&memcg->memsw, PAGE_SIZE);
mem_cgroup_swap_statistics(memcg, false);
mem_cgroup_put(memcg);
}
@@ -2475,9 +2479,13 @@ static int mem_cgroup_hierarchy_write(st
*/
if ((!parent_mem || !parent_mem->use_hierarchy) &&
(val == 1 || val == 0)) {
- if (list_empty(&cont->children))
+ if (list_empty(&cont->children)) {
mem->use_hierarchy = val;
- else
+ if (val)
+ mem->sub_hierarchy_root = mem;
+ else
+ mem->sub_hierarchy_root = NULL;
+ } else
retval = -EBUSY;
} else
retval = -EINVAL;
@@ -2587,12 +2595,21 @@ static int mem_cgroup_write(struct cgrou
/*
* For memsw, soft limits are hard to implement in terms
* of semantics, for now, we support soft limits for
- * control without swap
+ * control without swap. And, softlimit is hard to handle
+ * under hierarchy. (softliimit-excess tree handling will
+ * be corrupted.) We limit soflimit feature only for
+ * hierarchy root.
*/
- if (type == _MEM)
- ret = res_counter_set_soft_limit(&memcg->res, val);
- else
+ if (!memcg->sub_hierarchy_root ||
+ memcg->sub_hierarchy_root != memcg)
ret = -EINVAL;
+ else {
+ if (type == _MEM)
+ ret = res_counter_set_soft_limit(&memcg->res,
+ val);
+ else
+ ret = -EINVAL;
+ }
break;
default:
ret = -EINVAL; /* should be BUG() ? */
@@ -3118,9 +3135,15 @@ mem_cgroup_create(struct cgroup_subsys *
* mem_cgroup(see mem_cgroup_put).
*/
mem_cgroup_get(parent);
+ /*
+ * we don't necessary to grab refcnt of hierarchy root.
+ * because it's my ancestor and parent is alive.
+ */
+ mem->sub_hierarchy_root = parent->sub_hierarchy_root;
} else {
res_counter_init(&mem->res, NULL);
res_counter_init(&mem->memsw, NULL);
+ mem->sub_hierarchy_root = NULL;
}
mem->last_scanned_child = 0;
spin_lock_init(&mem->reclaim_param_lock);
Index: mmotm-2.6.31-Aug27/Documentation/cgroups/memory.txt
===================================================================
--- mmotm-2.6.31-Aug27.orig/Documentation/cgroups/memory.txt
+++ mmotm-2.6.31-Aug27/Documentation/cgroups/memory.txt
@@ -398,6 +398,8 @@ heavily contended for, memory is allocat
hints/setup. Currently soft limit based reclaim is setup such that
it gets invoked from balance_pgdat (kswapd).
+Soft limit can be set against root of subtree.
+
7.1 Interface
Soft limits can be setup by using the following commands (in this example we
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 37+ messages in thread
* [RFC][PATCH 2/5] memcg: uncharge in batched manner
2009-08-28 4:20 [RFC][PATCH 0/5] memcg: reduce lock conetion KAMEZAWA Hiroyuki
2009-08-28 4:23 ` [RFC][PATCH 1/5] memcg: change for softlimit KAMEZAWA Hiroyuki
@ 2009-08-28 4:24 ` KAMEZAWA Hiroyuki
2009-08-28 4:53 ` KAMEZAWA Hiroyuki
` (2 more replies)
2009-08-28 4:25 ` [RFC][PATCH 3/5] memcg: unmap, truncate, invalidate uncharege in batch KAMEZAWA Hiroyuki
` (3 subsequent siblings)
5 siblings, 3 replies; 37+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-08-28 4:24 UTC (permalink / raw)
To: KAMEZAWA Hiroyuki
Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org,
balbir@linux.vnet.ibm.com, nishimura@mxp.nes.nec.co.jp
In massive parallel enviroment, res_counter can be a performance bottleneck.
This patch is a trial for reducing lock contention.
One strong techinque to reduce lock contention is reducing calls by
batching some amount of calls int one.
Considering charge/uncharge chatacteristic,
- charge is done one by one via demand-paging.
- uncharge is done by
- in chunk at munmap, truncate, exit, execve...
- one by one via vmscan/paging.
It seems we hace a chance to batched-uncharge.
This patch is a base patch for batched uncharge. For avoiding
scattering memcg's structure, this patch adds memcg batch uncharge
information to the task. please see start/end usage in next patch.
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
include/linux/memcontrol.h | 12 +++++++
include/linux/sched.h | 7 ++++
mm/memcontrol.c | 70 +++++++++++++++++++++++++++++++++++++++++----
3 files changed, 83 insertions(+), 6 deletions(-)
Index: mmotm-2.6.31-Aug27/include/linux/memcontrol.h
===================================================================
--- mmotm-2.6.31-Aug27.orig/include/linux/memcontrol.h
+++ mmotm-2.6.31-Aug27/include/linux/memcontrol.h
@@ -54,6 +54,10 @@ extern void mem_cgroup_rotate_lru_list(s
extern void mem_cgroup_del_lru(struct page *page);
extern void mem_cgroup_move_lists(struct page *page,
enum lru_list from, enum lru_list to);
+
+extern void mem_cgroup_uncharge_batch_start(void);
+extern void mem_cgroup_uncharge_batch_end(void);
+
extern void mem_cgroup_uncharge_page(struct page *page);
extern void mem_cgroup_uncharge_cache_page(struct page *page);
extern int mem_cgroup_shmem_charge_fallback(struct page *page,
@@ -151,6 +155,14 @@ static inline void mem_cgroup_cancel_cha
{
}
+static inline void mem_cgroup_uncharge_batch_start(void)
+{
+}
+
+static inline void mem_cgroup_uncharge_batch_start(void)
+{
+}
+
static inline void mem_cgroup_uncharge_page(struct page *page)
{
}
Index: mmotm-2.6.31-Aug27/mm/memcontrol.c
===================================================================
--- mmotm-2.6.31-Aug27.orig/mm/memcontrol.c
+++ mmotm-2.6.31-Aug27/mm/memcontrol.c
@@ -1837,7 +1837,35 @@ void mem_cgroup_cancel_charge_swapin(str
css_put(&mem->css);
}
+static bool
+__do_batch_uncharge(struct mem_cgroup *mem, const enum charge_type ctype)
+{
+ struct memcg_batch_info *batch = NULL;
+ bool uncharge_memsw;
+ /* If swapout, usage of swap doesn't decrease */
+ if (do_swap_account && (ctype == MEM_CGROUP_CHARGE_TYPE_SWAPOUT))
+ uncharge_memsw = false;
+ else
+ uncharge_memsw = true;
+ if (current->memcg_batch.do_batch) {
+ batch = ¤t->memcg_batch;
+ if (batch->memcg == NULL) {
+ batch->memcg = mem;
+ css_get(&mem->css);
+ }
+ }
+ if (!batch || batch->memcg != mem) {
+ res_counter_uncharge(&mem->res, PAGE_SIZE);
+ if (uncharge_memsw)
+ res_counter_uncharge(&mem->memsw, PAGE_SIZE);
+ } else {
+ batch->pages += PAGE_SIZE;
+ if (uncharge_memsw)
+ batch->memsw += PAGE_SIZE;
+ }
+ return soft_limit_excess;
+}
/*
* uncharge if !page_mapped(page)
*/
@@ -1886,12 +1914,8 @@ __mem_cgroup_uncharge_common(struct page
break;
}
- if (!mem_cgroup_is_root(mem)) {
- res_counter_uncharge(&mem->res, PAGE_SIZE);
- if (do_swap_account &&
- (ctype != MEM_CGROUP_CHARGE_TYPE_SWAPOUT))
- res_counter_uncharge(&mem->memsw, PAGE_SIZE);
- }
+ if (!mem_cgroup_is_root(mem))
+ __do_batch_uncharge(mem, ctype);
if (ctype == MEM_CGROUP_CHARGE_TYPE_SWAPOUT)
mem_cgroup_swap_statistics(mem, true);
mem_cgroup_charge_statistics(mem, pc, false);
@@ -1938,6 +1962,40 @@ void mem_cgroup_uncharge_cache_page(stru
__mem_cgroup_uncharge_common(page, MEM_CGROUP_CHARGE_TYPE_CACHE);
}
+void mem_cgroup_uncharge_batch_start(void)
+{
+ VM_BUG_ON(current->memcg_batch.do_batch);
+ /* avoid batch if killed by OOM */
+ if (test_thread_flag(TIF_MEMDIE))
+ return;
+ current->memcg_batch.do_batch = 1;
+ current->memcg_batch.memcg = NULL;
+ current->memcg_batch.pages = 0;
+ current->memcg_batch.memsw = 0;
+}
+
+void mem_cgroup_uncharge_batch_end(void)
+{
+ struct mem_cgroup *mem;
+
+ if (!current->memcg_batch.do_batch)
+ return;
+
+ current->memcg_batch.do_batch = 0;
+
+ mem = current->memcg_batch.memcg;
+ if (!mem)
+ return;
+ if (current->memcg_batch.pages)
+ res_counter_uncharge(&mem->res,
+ current->memcg_batch.pages, NULL);
+ if (current->memcg_batch.memsw)
+ res_counter_uncharge(&mem->memsw,
+ current->memcg_batch.memsw, NULL);
+ /* we got css's refcnt */
+ cgroup_release_and_wakeup_rmdir(&mem->css);
+}
+
#ifdef CONFIG_SWAP
/*
* called after __delete_from_swap_cache() and drop "page" account.
Index: mmotm-2.6.31-Aug27/include/linux/sched.h
===================================================================
--- mmotm-2.6.31-Aug27.orig/include/linux/sched.h
+++ mmotm-2.6.31-Aug27/include/linux/sched.h
@@ -1540,6 +1540,13 @@ struct task_struct {
unsigned long trace_recursion;
#endif /* CONFIG_TRACING */
unsigned long stack_start;
+#ifdef CONFIG_CGROUP_MEM_RES_CTLR /* memcg uses this to do batch job */
+ struct memcg_batch_info {
+ bool do_batch;
+ struct mem_cgroup *memcg;
+ long pages, memsw;
+ } memcg_batch;
+#endif
};
/* Future-safe accessor for struct task_struct's cpus_allowed. */
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 37+ messages in thread
* [RFC][PATCH 3/5] memcg: unmap, truncate, invalidate uncharege in batch
2009-08-28 4:20 [RFC][PATCH 0/5] memcg: reduce lock conetion KAMEZAWA Hiroyuki
2009-08-28 4:23 ` [RFC][PATCH 1/5] memcg: change for softlimit KAMEZAWA Hiroyuki
2009-08-28 4:24 ` [RFC][PATCH 2/5] memcg: uncharge in batched manner KAMEZAWA Hiroyuki
@ 2009-08-28 4:25 ` KAMEZAWA Hiroyuki
2009-08-31 11:02 ` Balbir Singh
2009-08-28 4:27 ` [RFC][PATCH 4/5] memcg: per-cpu charge stock KAMEZAWA Hiroyuki
` (2 subsequent siblings)
5 siblings, 1 reply; 37+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-08-28 4:25 UTC (permalink / raw)
To: KAMEZAWA Hiroyuki
Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org,
balbir@linux.vnet.ibm.com, nishimura@mxp.nes.nec.co.jp
We can do batched uncharge when
- invalidate/truncte file
- unmap range of pages.
This means we don't do "batched" uncharge in memory reclaim path.
I think it's reasonable.
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
mm/memory.c | 2 ++
mm/truncate.c | 6 ++++++
2 files changed, 8 insertions(+)
Index: mmotm-2.6.31-Aug27/mm/memory.c
===================================================================
--- mmotm-2.6.31-Aug27.orig/mm/memory.c
+++ mmotm-2.6.31-Aug27/mm/memory.c
@@ -909,6 +909,7 @@ static unsigned long unmap_page_range(st
details = NULL;
BUG_ON(addr >= end);
+ mem_cgroup_uncharge_batch_start();
tlb_start_vma(tlb, vma);
pgd = pgd_offset(vma->vm_mm, addr);
do {
@@ -921,6 +922,7 @@ static unsigned long unmap_page_range(st
zap_work, details);
} while (pgd++, addr = next, (addr != end && *zap_work > 0));
tlb_end_vma(tlb, vma);
+ mem_cgroup_uncharge_batch_end();
return addr;
}
Index: mmotm-2.6.31-Aug27/mm/truncate.c
===================================================================
--- mmotm-2.6.31-Aug27.orig/mm/truncate.c
+++ mmotm-2.6.31-Aug27/mm/truncate.c
@@ -272,6 +272,7 @@ void truncate_inode_pages_range(struct a
pagevec_release(&pvec);
break;
}
+ mem_cgroup_uncharge_batch_start();
for (i = 0; i < pagevec_count(&pvec); i++) {
struct page *page = pvec.pages[i];
@@ -286,6 +287,7 @@ void truncate_inode_pages_range(struct a
unlock_page(page);
}
pagevec_release(&pvec);
+ mem_cgroup_uncharge_batch_end();
}
}
EXPORT_SYMBOL(truncate_inode_pages_range);
@@ -327,6 +329,7 @@ unsigned long invalidate_mapping_pages(s
pagevec_init(&pvec, 0);
while (next <= end &&
pagevec_lookup(&pvec, mapping, next, PAGEVEC_SIZE)) {
+ mem_cgroup_uncharge_batch_start();
for (i = 0; i < pagevec_count(&pvec); i++) {
struct page *page = pvec.pages[i];
pgoff_t index;
@@ -354,6 +357,7 @@ unsigned long invalidate_mapping_pages(s
break;
}
pagevec_release(&pvec);
+ mem_cgroup_uncharge_batch_end();
cond_resched();
}
return ret;
@@ -428,6 +432,7 @@ int invalidate_inode_pages2_range(struct
while (next <= end && !wrapped &&
pagevec_lookup(&pvec, mapping, next,
min(end - next, (pgoff_t)PAGEVEC_SIZE - 1) + 1)) {
+ mem_cgroup_uncharge_batch_start();
for (i = 0; i < pagevec_count(&pvec); i++) {
struct page *page = pvec.pages[i];
pgoff_t page_index;
@@ -477,6 +482,7 @@ int invalidate_inode_pages2_range(struct
unlock_page(page);
}
pagevec_release(&pvec);
+ mem_cgroup_uncharge_batch_end();
cond_resched();
}
return ret;
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 37+ messages in thread
* [RFC][PATCH 4/5] memcg: per-cpu charge stock
2009-08-28 4:20 [RFC][PATCH 0/5] memcg: reduce lock conetion KAMEZAWA Hiroyuki
` (2 preceding siblings ...)
2009-08-28 4:25 ` [RFC][PATCH 3/5] memcg: unmap, truncate, invalidate uncharege in batch KAMEZAWA Hiroyuki
@ 2009-08-28 4:27 ` KAMEZAWA Hiroyuki
2009-08-31 11:10 ` Balbir Singh
2009-08-28 4:28 ` [RFC][PATCH 5/5] memcg: drain per cpu stock KAMEZAWA Hiroyuki
2009-08-28 4:28 ` [RFC][PATCH 0/5] memcg: reduce lock conetion Balbir Singh
5 siblings, 1 reply; 37+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-08-28 4:27 UTC (permalink / raw)
To: KAMEZAWA Hiroyuki
Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org,
balbir@linux.vnet.ibm.com, nishimura@mxp.nes.nec.co.jp
For avoiding frequent access to res_counter at charge, add per-cpu
local charge. Comparing with modifing res_coutner (with percpu_counter),
this approach
Pros.
- we don't have to touch res_counter's cache line
- we don't have to chase res_counter's hierarchy
- we don't have to call res_counter function.
Cons.
- we need our own code.
Considering trade-off, I think this is worth to do.
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
mm/memcontrol.c | 46 +++++++++++++++++++++++++++++++++++++---------
1 file changed, 37 insertions(+), 9 deletions(-)
Index: mmotm-2.6.31-Aug27/mm/memcontrol.c
===================================================================
--- mmotm-2.6.31-Aug27.orig/mm/memcontrol.c
+++ mmotm-2.6.31-Aug27/mm/memcontrol.c
@@ -71,7 +71,7 @@ enum mem_cgroup_stat_index {
MEM_CGROUP_STAT_PGPGOUT_COUNT, /* # of pages paged out */
MEM_CGROUP_STAT_EVENTS, /* sum of pagein + pageout for internal use */
MEM_CGROUP_STAT_SWAPOUT, /* # of pages, swapped out */
-
+ MEM_CGROUP_STAT_STOCK, /* # of private charges pre-allocated */
MEM_CGROUP_STAT_NSTATS,
};
@@ -1266,6 +1266,32 @@ done:
unlock_page_cgroup(pc);
}
+#define CHARGE_SIZE (4 * ((NR_CPUS >> 5) + 1) * PAGE_SIZE)
+
+bool consume_local_stock(struct mem_cgroup *mem)
+{
+ struct mem_cgroup_stat_cpu *cstat;
+ int cpu = get_cpu();
+ bool ret = true;
+
+ cstat = &mem->stat.cpustat[cpu];
+ if (cstat->count[MEM_CGROUP_STAT_STOCK])
+ cstat->count[MEM_CGROUP_STAT_STOCK] -= PAGE_SIZE;
+ else
+ ret = false;
+ put_cpu();
+ return ret;
+}
+
+void do_local_stock(struct mem_cgroup *mem, int val)
+{
+ struct mem_cgroup_stat_cpu *cstat;
+ int cpu = get_cpu();
+ cstat = &mem->stat.cpustat[cpu];
+ __mem_cgroup_stat_add_safe(cstat, MEM_CGROUP_STAT_STOCK, val);
+ put_cpu();
+}
+
/*
* Unlike exported interface, "oom" parameter is added. if oom==true,
* oom-killer can be invoked.
@@ -1297,28 +1323,30 @@ static int __mem_cgroup_try_charge(struc
} else {
css_get(&mem->css);
}
- if (unlikely(!mem))
+ /* css_get() against root cgroup is NOOP. we can ignore it */
+ if (!mem || mem_cgroup_is_root(mem))
return 0;
VM_BUG_ON(css_is_removed(&mem->css));
+ if (consume_local_stock(mem))
+ goto got;
+
while (1) {
int ret = 0;
unsigned long flags = 0;
- if (mem_cgroup_is_root(mem))
- goto done;
- ret = res_counter_charge(&mem->res, PAGE_SIZE, &fail_res);
+ ret = res_counter_charge(&mem->res, CHARGE_SIZE, &fail_res);
if (likely(!ret)) {
if (!do_swap_account)
break;
- ret = res_counter_charge(&mem->memsw, PAGE_SIZE,
+ ret = res_counter_charge(&mem->memsw, CHARGE_SIZE,
&fail_res);
if (likely(!ret))
break;
/* mem+swap counter fails */
- res_counter_uncharge(&mem->res, PAGE_SIZE);
+ res_counter_uncharge(&mem->res, CHARGE_SIZE);
flags |= MEM_CGROUP_RECLAIM_NOSWAP;
mem_over_limit = mem_cgroup_from_res_counter(fail_res,
memsw);
@@ -1356,7 +1384,8 @@ static int __mem_cgroup_try_charge(struc
goto nomem;
}
}
-
+ do_local_stock(mem, CHARGE_SIZE - PAGE_SIZE);
+got:
/*
* check hierarchy root's event counter and modify softlimit-tree
* if necessary.
@@ -1364,7 +1393,6 @@ static int __mem_cgroup_try_charge(struc
mem_over_soft_limit = mem_cgroup_soft_limit_check(mem);
if (mem_over_soft_limit)
mem_cgroup_update_tree(mem_over_soft_limit, page);
-done:
return 0;
nomem:
css_put(&mem->css);
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 37+ messages in thread
* [RFC][PATCH 5/5] memcg: drain per cpu stock
2009-08-28 4:20 [RFC][PATCH 0/5] memcg: reduce lock conetion KAMEZAWA Hiroyuki
` (3 preceding siblings ...)
2009-08-28 4:27 ` [RFC][PATCH 4/5] memcg: per-cpu charge stock KAMEZAWA Hiroyuki
@ 2009-08-28 4:28 ` KAMEZAWA Hiroyuki
2009-08-31 11:11 ` Balbir Singh
2009-08-28 4:28 ` [RFC][PATCH 0/5] memcg: reduce lock conetion Balbir Singh
5 siblings, 1 reply; 37+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-08-28 4:28 UTC (permalink / raw)
To: KAMEZAWA Hiroyuki
Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org,
balbir@linux.vnet.ibm.com, nishimura@mxp.nes.nec.co.jp
Add function for dropping per-cpu stock of charges.
This is called when
- cpu is unplugged.
- force_empty
- recalim seems to be not easy.
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
mm/memcontrol.c | 71 +++++++++++++++++++++++++++++++++++++++++++++++++++++++-
1 file changed, 70 insertions(+), 1 deletion(-)
Index: mmotm-2.6.31-Aug27/mm/memcontrol.c
===================================================================
--- mmotm-2.6.31-Aug27.orig/mm/memcontrol.c
+++ mmotm-2.6.31-Aug27/mm/memcontrol.c
@@ -38,6 +38,8 @@
#include <linux/vmalloc.h>
#include <linux/mm_inline.h>
#include <linux/page_cgroup.h>
+#include <linux/notifier.h>
+#include <linux/cpu.h>
#include "internal.h"
#include <asm/uaccess.h>
@@ -77,6 +79,8 @@ enum mem_cgroup_stat_index {
struct mem_cgroup_stat_cpu {
s64 count[MEM_CGROUP_STAT_NSTATS];
+ struct work_struct work;
+ struct mem_cgroup *mem;
} ____cacheline_aligned_in_smp;
struct mem_cgroup_stat {
@@ -277,6 +281,7 @@ enum charge_type {
static void mem_cgroup_get(struct mem_cgroup *mem);
static void mem_cgroup_put(struct mem_cgroup *mem);
static struct mem_cgroup *parent_mem_cgroup(struct mem_cgroup *mem);
+static void schedule_drain_stock_all(struct mem_cgroup *mem, bool sync);
static struct mem_cgroup_per_zone *
mem_cgroup_zoneinfo(struct mem_cgroup *mem, int nid, int zid)
@@ -1195,6 +1200,9 @@ static int mem_cgroup_hierarchical_recla
return total;
} else if (mem_cgroup_check_under_limit(root_mem))
return 1 + total;
+
+ if (loop > 0)
+ schedule_drain_stock_all(victim, false);
}
return total;
}
@@ -1292,6 +1300,48 @@ void do_local_stock(struct mem_cgroup *m
put_cpu();
}
+/* called by cpu hotplug and workqueue */
+int force_drain_local_stock(struct mem_cgroup *mem, void *data)
+{
+ struct mem_cgroup_stat_cpu *cstat;
+ int cpu = *(unsigned long *)data;
+ unsigned long stock;
+
+ cstat = &mem->stat.cpustat[cpu];
+ stock = cstat->count[MEM_CGROUP_STAT_STOCK];
+ cstat->count[MEM_CGROUP_STAT_STOCK] = 0;
+ res_counter_uncharge(&mem->res, stock);
+ return 0;
+}
+
+
+void drain_local_stock(struct work_struct *work)
+{
+ struct mem_cgroup_stat_cpu *cstat;
+ struct mem_cgroup *mem;
+ unsigned long cpu;
+
+ cpu = get_cpu();
+ cstat = container_of(work, struct mem_cgroup_stat_cpu, work);
+ mem = cstat->mem;
+ force_drain_local_stock(mem, &cpu);
+ put_cpu();
+}
+
+
+void schedule_drain_stock_all(struct mem_cgroup *mem, bool sync)
+{
+ struct mem_cgroup_stat_cpu *cstat;
+ int cpu;
+
+ for_each_online_cpu(cpu) {
+ cstat = &mem->stat.cpustat[cpu];
+ schedule_work_on(cpu, &cstat->work);
+ if (sync)
+ flush_work(&cstat->work);
+ }
+}
+
/*
* Unlike exported interface, "oom" parameter is added. if oom==true,
* oom-killer can be invoked.
@@ -2471,6 +2521,7 @@ move_account:
if (signal_pending(current))
goto out;
/* This is for making all *used* pages to be on LRU. */
+ schedule_drain_stock_all(mem, true);
lru_add_drain_all();
ret = 0;
for_each_node_state(node, N_HIGH_MEMORY) {
@@ -3081,6 +3132,7 @@ static struct mem_cgroup *mem_cgroup_all
{
struct mem_cgroup *mem;
int size = mem_cgroup_size();
+ int i;
if (size < PAGE_SIZE)
mem = kmalloc(size, GFP_KERNEL);
@@ -3089,9 +3141,26 @@ static struct mem_cgroup *mem_cgroup_all
if (mem)
memset(mem, 0, size);
+ for (i = 0; i < nr_cpu_ids; i++)
+ INIT_WORK(&mem->stat.cpustat[i].work, drain_local_stock);
+
return mem;
}
+static int __cpuinit percpu_memcg_hotcpu_callback(struct notifier_block *nb,
+ unsigned long action, void *hcpu)
+{
+#ifdef CONFIG_HOTPLUG_CPU
+ if (action != CPU_DEAD)
+ return NOTIFY_OK;
+ if (!root_mem_cgroup)
+ return NOTIFY_OK;
+ mem_cgroup_walk_tree(root_mem_cgroup, hcpu, force_drain_local_stock);
+#endif
+ return NOTIFY_OK;
+}
+
+
/*
* At destroying mem_cgroup, references from swap_cgroup can remain.
* (scanning all at force_empty is too costly...)
@@ -3203,7 +3272,7 @@ mem_cgroup_create(struct cgroup_subsys *
root_mem_cgroup = mem;
if (mem_cgroup_soft_limit_tree_init())
goto free_out;
-
+ hotcpu_notifier(percpu_memcg_hotcpu_callback, 0);
} else {
parent = mem_cgroup_from_cont(cont->parent);
mem->use_hierarchy = parent->use_hierarchy;
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [RFC][PATCH 0/5] memcg: reduce lock conetion
2009-08-28 4:20 [RFC][PATCH 0/5] memcg: reduce lock conetion KAMEZAWA Hiroyuki
` (4 preceding siblings ...)
2009-08-28 4:28 ` [RFC][PATCH 5/5] memcg: drain per cpu stock KAMEZAWA Hiroyuki
@ 2009-08-28 4:28 ` Balbir Singh
2009-08-28 4:33 ` KAMEZAWA Hiroyuki
5 siblings, 1 reply; 37+ messages in thread
From: Balbir Singh @ 2009-08-28 4:28 UTC (permalink / raw)
To: KAMEZAWA Hiroyuki
Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org,
nishimura@mxp.nes.nec.co.jp
* KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2009-08-28 13:20:15]:
> Hi,
>
> Recently, memcg's res_counter->lock contention on big server is reported and
> Balbir wrote a workaround for root memcg.
> It's good but we need some fix for children, too.
>
> This set is for reducing lock conetion of memcg's children cgroup based on mmotm-Aug27.
>
> I'm sorry I have only 8cpu machine and can't reproduce very troublesome lock conention.
> Here is lock_stat of make -j 12 on my 8cpu box, befre-after this patch series.
>
Kamezawa-San,
I've been unable to get mmotm to boot (24th August, I'll try the 27th
Aug and debug). Once that is done, I'll test on a large machine.
--
Balbir
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [RFC][PATCH 0/5] memcg: reduce lock conetion
2009-08-28 4:28 ` [RFC][PATCH 0/5] memcg: reduce lock conetion Balbir Singh
@ 2009-08-28 4:33 ` KAMEZAWA Hiroyuki
0 siblings, 0 replies; 37+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-08-28 4:33 UTC (permalink / raw)
To: balbir
Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org,
nishimura@mxp.nes.nec.co.jp
On Fri, 28 Aug 2009 09:58:36 +0530
Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
> * KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2009-08-28 13:20:15]:
>
> > Hi,
> >
> > Recently, memcg's res_counter->lock contention on big server is reported and
> > Balbir wrote a workaround for root memcg.
> > It's good but we need some fix for children, too.
> >
> > This set is for reducing lock conetion of memcg's children cgroup based on mmotm-Aug27.
> >
> > I'm sorry I have only 8cpu machine and can't reproduce very troublesome lock conention.
> > Here is lock_stat of make -j 12 on my 8cpu box, befre-after this patch series.
> >
>
> Kamezawa-San,
>
> I've been unable to get mmotm to boot (24th August, I'll try the 27th
> Aug and debug). Once that is done, I'll test on a large machine.
>
yep, take it easy. I'm now very active in this weekend, anyway.
BTW, have you tried this ?
http://marc.info/?l=linux-kernel&m=125136796932491&w=2
Thanks,
-Kame
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [RFC][PATCH 2/5] memcg: uncharge in batched manner
2009-08-28 4:24 ` [RFC][PATCH 2/5] memcg: uncharge in batched manner KAMEZAWA Hiroyuki
@ 2009-08-28 4:53 ` KAMEZAWA Hiroyuki
2009-08-28 4:55 ` KAMEZAWA Hiroyuki
2009-08-28 15:10 ` Balbir Singh
2009-08-31 11:02 ` Balbir Singh
2 siblings, 1 reply; 37+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-08-28 4:53 UTC (permalink / raw)
To: KAMEZAWA Hiroyuki
Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org,
balbir@linux.vnet.ibm.com, nishimura@mxp.nes.nec.co.jp
On Fri, 28 Aug 2009 13:24:38 +0900
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
>
> In massive parallel enviroment, res_counter can be a performance bottleneck.
> This patch is a trial for reducing lock contention.
> One strong techinque to reduce lock contention is reducing calls by
> batching some amount of calls int one.
>
> Considering charge/uncharge chatacteristic,
> - charge is done one by one via demand-paging.
> - uncharge is done by
> - in chunk at munmap, truncate, exit, execve...
> - one by one via vmscan/paging.
>
> It seems we hace a chance to batched-uncharge.
> This patch is a base patch for batched uncharge. For avoiding
> scattering memcg's structure, this patch adds memcg batch uncharge
> information to the task. please see start/end usage in next patch.
>
> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> ---
> include/linux/memcontrol.h | 12 +++++++
> include/linux/sched.h | 7 ++++
> mm/memcontrol.c | 70 +++++++++++++++++++++++++++++++++++++++++----
> 3 files changed, 83 insertions(+), 6 deletions(-)
>
> Index: mmotm-2.6.31-Aug27/include/linux/memcontrol.h
> ===================================================================
> --- mmotm-2.6.31-Aug27.orig/include/linux/memcontrol.h
> +++ mmotm-2.6.31-Aug27/include/linux/memcontrol.h
> @@ -54,6 +54,10 @@ extern void mem_cgroup_rotate_lru_list(s
> extern void mem_cgroup_del_lru(struct page *page);
> extern void mem_cgroup_move_lists(struct page *page,
> enum lru_list from, enum lru_list to);
> +
> +extern void mem_cgroup_uncharge_batch_start(void);
> +extern void mem_cgroup_uncharge_batch_end(void);
> +
> extern void mem_cgroup_uncharge_page(struct page *page);
> extern void mem_cgroup_uncharge_cache_page(struct page *page);
> extern int mem_cgroup_shmem_charge_fallback(struct page *page,
> @@ -151,6 +155,14 @@ static inline void mem_cgroup_cancel_cha
> {
> }
>
> +static inline void mem_cgroup_uncharge_batch_start(void)
> +{
> +}
> +
> +static inline void mem_cgroup_uncharge_batch_start(void)
> +{
> +}
> +
> static inline void mem_cgroup_uncharge_page(struct page *page)
> {
> }
> Index: mmotm-2.6.31-Aug27/mm/memcontrol.c
> ===================================================================
> --- mmotm-2.6.31-Aug27.orig/mm/memcontrol.c
> +++ mmotm-2.6.31-Aug27/mm/memcontrol.c
> @@ -1837,7 +1837,35 @@ void mem_cgroup_cancel_charge_swapin(str
> css_put(&mem->css);
> }
>
> +static bool
> +__do_batch_uncharge(struct mem_cgroup *mem, const enum charge_type ctype)
> +{
> + struct memcg_batch_info *batch = NULL;
> + bool uncharge_memsw;
> + /* If swapout, usage of swap doesn't decrease */
> + if (do_swap_account && (ctype == MEM_CGROUP_CHARGE_TYPE_SWAPOUT))
> + uncharge_memsw = false;
> + else
> + uncharge_memsw = true;
>
> + if (current->memcg_batch.do_batch) {
> + batch = ¤t->memcg_batch;
> + if (batch->memcg == NULL) {
> + batch->memcg = mem;
> + css_get(&mem->css);
> + }
> + }
> + if (!batch || batch->memcg != mem) {
> + res_counter_uncharge(&mem->res, PAGE_SIZE);
> + if (uncharge_memsw)
> + res_counter_uncharge(&mem->memsw, PAGE_SIZE);
> + } else {
> + batch->pages += PAGE_SIZE;
> + if (uncharge_memsw)
> + batch->memsw += PAGE_SIZE;
> + }
> + return soft_limit_excess;
> +}
> /*
> * uncharge if !page_mapped(page)
> */
> @@ -1886,12 +1914,8 @@ __mem_cgroup_uncharge_common(struct page
> break;
> }
>
> - if (!mem_cgroup_is_root(mem)) {
> - res_counter_uncharge(&mem->res, PAGE_SIZE);
> - if (do_swap_account &&
> - (ctype != MEM_CGROUP_CHARGE_TYPE_SWAPOUT))
> - res_counter_uncharge(&mem->memsw, PAGE_SIZE);
> - }
> + if (!mem_cgroup_is_root(mem))
> + __do_batch_uncharge(mem, ctype);
> if (ctype == MEM_CGROUP_CHARGE_TYPE_SWAPOUT)
> mem_cgroup_swap_statistics(mem, true);
> mem_cgroup_charge_statistics(mem, pc, false);
> @@ -1938,6 +1962,40 @@ void mem_cgroup_uncharge_cache_page(stru
> __mem_cgroup_uncharge_common(page, MEM_CGROUP_CHARGE_TYPE_CACHE);
> }
>
> +void mem_cgroup_uncharge_batch_start(void)
> +{
> + VM_BUG_ON(current->memcg_batch.do_batch);
> + /* avoid batch if killed by OOM */
> + if (test_thread_flag(TIF_MEMDIE))
> + return;
> + current->memcg_batch.do_batch = 1;
> + current->memcg_batch.memcg = NULL;
> + current->memcg_batch.pages = 0;
> + current->memcg_batch.memsw = 0;
> +}
> +
> +void mem_cgroup_uncharge_batch_end(void)
> +{
> + struct mem_cgroup *mem;
> +
> + if (!current->memcg_batch.do_batch)
> + return;
> +
> + current->memcg_batch.do_batch = 0;
> +
> + mem = current->memcg_batch.memcg;
> + if (!mem)
> + return;
> + if (current->memcg_batch.pages)
> + res_counter_uncharge(&mem->res,
> + current->memcg_batch.pages, NULL);
> + if (current->memcg_batch.memsw)
> + res_counter_uncharge(&mem->memsw,
> + current->memcg_batch.memsw, NULL);
quilt refresh miss....above 2 NULL are unnecesary.
sorry,
-Kame
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [RFC][PATCH 2/5] memcg: uncharge in batched manner
2009-08-28 4:53 ` KAMEZAWA Hiroyuki
@ 2009-08-28 4:55 ` KAMEZAWA Hiroyuki
0 siblings, 0 replies; 37+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-08-28 4:55 UTC (permalink / raw)
To: KAMEZAWA Hiroyuki
Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org,
balbir@linux.vnet.ibm.com, nishimura@mxp.nes.nec.co.jp
reflreshed..
==
In massive parallel enviroment, res_counter can be a performance bottleneck.
This patch is a trial for reducing lock contention.
One strong techinque to reduce lock contention is reducing calls by
batching some amount of calls int one.
Considering charge/uncharge chatacteristic,
- charge is done one by one via demand-paging.
- uncharge is done by
- in chunk at munmap, truncate, exit, execve...
- one by one via vmscan/paging.
It seems we hace a chance to batched-uncharge.
This patch is a base patch for batched uncharge. For avoiding
scattering memcg's structure, this patch adds memcg batch uncharge
information to the task. please see start/end usage in next patch.
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
include/linux/memcontrol.h | 12 +++++++
include/linux/sched.h | 7 ++++
mm/memcontrol.c | 70 +++++++++++++++++++++++++++++++++++++++++----
3 files changed, 83 insertions(+), 6 deletions(-)
Index: mmotm-2.6.31-Aug27/include/linux/memcontrol.h
===================================================================
--- mmotm-2.6.31-Aug27.orig/include/linux/memcontrol.h
+++ mmotm-2.6.31-Aug27/include/linux/memcontrol.h
@@ -54,6 +54,10 @@ extern void mem_cgroup_rotate_lru_list(s
extern void mem_cgroup_del_lru(struct page *page);
extern void mem_cgroup_move_lists(struct page *page,
enum lru_list from, enum lru_list to);
+
+extern void mem_cgroup_uncharge_batch_start(void);
+extern void mem_cgroup_uncharge_batch_end(void);
+
extern void mem_cgroup_uncharge_page(struct page *page);
extern void mem_cgroup_uncharge_cache_page(struct page *page);
extern int mem_cgroup_shmem_charge_fallback(struct page *page,
@@ -151,6 +155,14 @@ static inline void mem_cgroup_cancel_cha
{
}
+static inline void mem_cgroup_uncharge_batch_start(void)
+{
+}
+
+static inline void mem_cgroup_uncharge_batch_start(void)
+{
+}
+
static inline void mem_cgroup_uncharge_page(struct page *page)
{
}
Index: mmotm-2.6.31-Aug27/mm/memcontrol.c
===================================================================
--- mmotm-2.6.31-Aug27.orig/mm/memcontrol.c
+++ mmotm-2.6.31-Aug27/mm/memcontrol.c
@@ -1837,7 +1837,35 @@ void mem_cgroup_cancel_charge_swapin(str
css_put(&mem->css);
}
+static void
+__do_batch_uncharge(struct mem_cgroup *mem, const enum charge_type ctype)
+{
+ struct memcg_batch_info *batch = NULL;
+ bool uncharge_memsw;
+ /* If swapout, usage of swap doesn't decrease */
+ if (do_swap_account && (ctype == MEM_CGROUP_CHARGE_TYPE_SWAPOUT))
+ uncharge_memsw = false;
+ else
+ uncharge_memsw = true;
+ if (current->memcg_batch.do_batch) {
+ batch = ¤t->memcg_batch;
+ if (batch->memcg == NULL) {
+ batch->memcg = mem;
+ css_get(&mem->css);
+ }
+ }
+ if (!batch || batch->memcg != mem) {
+ res_counter_uncharge(&mem->res, PAGE_SIZE);
+ if (uncharge_memsw)
+ res_counter_uncharge(&mem->memsw, PAGE_SIZE);
+ } else {
+ batch->pages += PAGE_SIZE;
+ if (uncharge_memsw)
+ batch->memsw += PAGE_SIZE;
+ }
+ return;
+}
/*
* uncharge if !page_mapped(page)
*/
@@ -1886,12 +1914,8 @@ __mem_cgroup_uncharge_common(struct page
break;
}
- if (!mem_cgroup_is_root(mem)) {
- res_counter_uncharge(&mem->res, PAGE_SIZE);
- if (do_swap_account &&
- (ctype != MEM_CGROUP_CHARGE_TYPE_SWAPOUT))
- res_counter_uncharge(&mem->memsw, PAGE_SIZE);
- }
+ if (!mem_cgroup_is_root(mem))
+ __do_batch_uncharge(mem, ctype);
if (ctype == MEM_CGROUP_CHARGE_TYPE_SWAPOUT)
mem_cgroup_swap_statistics(mem, true);
mem_cgroup_charge_statistics(mem, pc, false);
@@ -1938,6 +1962,38 @@ void mem_cgroup_uncharge_cache_page(stru
__mem_cgroup_uncharge_common(page, MEM_CGROUP_CHARGE_TYPE_CACHE);
}
+void mem_cgroup_uncharge_batch_start(void)
+{
+ VM_BUG_ON(current->memcg_batch.do_batch);
+ /* avoid batch if killed by OOM */
+ if (test_thread_flag(TIF_MEMDIE))
+ return;
+ current->memcg_batch.do_batch = 1;
+ current->memcg_batch.memcg = NULL;
+ current->memcg_batch.pages = 0;
+ current->memcg_batch.memsw = 0;
+}
+
+void mem_cgroup_uncharge_batch_end(void)
+{
+ struct mem_cgroup *mem;
+
+ if (!current->memcg_batch.do_batch)
+ return;
+
+ current->memcg_batch.do_batch = 0;
+
+ mem = current->memcg_batch.memcg;
+ if (!mem)
+ return;
+ if (current->memcg_batch.pages)
+ res_counter_uncharge(&mem->res, current->memcg_batch.pages);
+ if (current->memcg_batch.memsw)
+ res_counter_uncharge(&mem->memsw, current->memcg_batch.memsw);
+ /* we got css's refcnt */
+ cgroup_release_and_wakeup_rmdir(&mem->css);
+}
+
#ifdef CONFIG_SWAP
/*
* called after __delete_from_swap_cache() and drop "page" account.
Index: mmotm-2.6.31-Aug27/include/linux/sched.h
===================================================================
--- mmotm-2.6.31-Aug27.orig/include/linux/sched.h
+++ mmotm-2.6.31-Aug27/include/linux/sched.h
@@ -1540,6 +1540,13 @@ struct task_struct {
unsigned long trace_recursion;
#endif /* CONFIG_TRACING */
unsigned long stack_start;
+#ifdef CONFIG_CGROUP_MEM_RES_CTLR /* memcg uses this to do batch job */
+ struct memcg_batch_info {
+ bool do_batch;
+ struct mem_cgroup *memcg;
+ long pages, memsw;
+ } memcg_batch;
+#endif
};
/* Future-safe accessor for struct task_struct's cpus_allowed. */
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [RFC][PATCH 1/5] memcg: change for softlimit.
2009-08-28 4:23 ` [RFC][PATCH 1/5] memcg: change for softlimit KAMEZAWA Hiroyuki
@ 2009-08-28 7:20 ` Balbir Singh
2009-08-28 7:35 ` KAMEZAWA Hiroyuki
0 siblings, 1 reply; 37+ messages in thread
From: Balbir Singh @ 2009-08-28 7:20 UTC (permalink / raw)
To: KAMEZAWA Hiroyuki
Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org,
nishimura@mxp.nes.nec.co.jp
* KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2009-08-28 13:23:21]:
> This patch tries to modify softlimit handling in memcg/res_counter.
> There are 2 reasons in general.
>
> 1. soft_limit can use only against sub-hierarchy root.
> Because softlimit tree is sorted by usage, putting prural groups
> under hierarchy (which shares usage) will just adds noise and unnecessary
> mess. This patch limits softlimit feature only to hierarchy root.
> This will make softlimit-tree maintainance better.
>
> 2. In these days, it's reported that res_counter can be bottleneck in
> massively parallel enviroment. We need to reduce jobs under spinlock.
> The reason we check softlimit at res_counter_charge() is that any member
> in hierarchy can have softlimit.
> But by chages in "1", only hierarchy root has soft_limit. We can omit
> hierarchical check in res_counter.
>
> After this patch, soft limit is avaliable only for root of sub-hierarchy.
> (Anyway, softlimit for hierarchy children just makes users confused, hard-to-use)
>
I need some time to digest this change, if the root is a hiearchy root
then only root can support soft limits? I think the change makes it
harder to use soft limits. Please help me understand better.
--
Balbir
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [RFC][PATCH 1/5] memcg: change for softlimit.
2009-08-28 7:20 ` Balbir Singh
@ 2009-08-28 7:35 ` KAMEZAWA Hiroyuki
2009-08-28 13:26 ` Balbir Singh
0 siblings, 1 reply; 37+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-08-28 7:35 UTC (permalink / raw)
To: balbir
Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org,
nishimura@mxp.nes.nec.co.jp
On Fri, 28 Aug 2009 12:50:08 +0530
Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
> * KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2009-08-28 13:23:21]:
>
> > This patch tries to modify softlimit handling in memcg/res_counter.
> > There are 2 reasons in general.
> >
> > 1. soft_limit can use only against sub-hierarchy root.
> > Because softlimit tree is sorted by usage, putting prural groups
> > under hierarchy (which shares usage) will just adds noise and unnecessary
> > mess. This patch limits softlimit feature only to hierarchy root.
> > This will make softlimit-tree maintainance better.
> >
> > 2. In these days, it's reported that res_counter can be bottleneck in
> > massively parallel enviroment. We need to reduce jobs under spinlock.
> > The reason we check softlimit at res_counter_charge() is that any member
> > in hierarchy can have softlimit.
> > But by chages in "1", only hierarchy root has soft_limit. We can omit
> > hierarchical check in res_counter.
> >
> > After this patch, soft limit is avaliable only for root of sub-hierarchy.
> > (Anyway, softlimit for hierarchy children just makes users confused, hard-to-use)
> >
>
>
> I need some time to digest this change, if the root is a hiearchy root
> then only root can support soft limits? I think the change makes it
> harder to use soft limits. Please help me understand better.
>
I poitned out this issue many many times while you wrote patch.
memcg has "sub tree". hierarchy here means "sub tree" with use_hierarchy =1.
Assume
/cgroup/Users/use_hierarchy=0
Gold/ use_hierarchy=1
Bob
Mike
Silver/use_hierarchy=1
/System/use_hierarchy=1
In flat, there are 3 sub trees.
/cgroup/Users/Gold (Gold has /cgroup/Users/Gold/Bog, /cgroup/Users/Gold/Mike)
/cgroup/Users/Silver .....
/cgroup/System .....
Then, subtrees means a group which inherits charges by use_hierarchy=1
In current implementation, softlimit can be set to arbitrary cgroup.
Then, following ops are allowed.
==
/cgroup/Users/Gold softlimit= 1G
/cgroup/Users/Gold/Bob softlimit=800M
/cgroup/Users/Gold/Mike softlimit=800M
==
Then, how your RB-tree for softlimit management works ?
When softlimit finds /cgroup/Users/Gold/, it will reclaim memory from
all 3 groups by hierarchical_reclaim. If softlimit finds
/cgroup/Users/Gold/Bob, reclaim from Bob means recalaim from Gold.
Then, to keep the RB-tree neat, you have to extract all related cgroups and
re-insert them all, every time.
(But current code doesn't do that. It's broken.)
Current soft-limit RB-tree will be easily broken i.e. not-sorted correctly
if used under use_hierarchy=1.
My patch disallows set softlimit to Bob and Mike, just allows against Gold
because there can be considered as the same class, hierarchy.
Thanks,
-Kame
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [RFC][PATCH 1/5] memcg: change for softlimit.
2009-08-28 7:35 ` KAMEZAWA Hiroyuki
@ 2009-08-28 13:26 ` Balbir Singh
2009-08-28 14:29 ` KAMEZAWA Hiroyuki
0 siblings, 1 reply; 37+ messages in thread
From: Balbir Singh @ 2009-08-28 13:26 UTC (permalink / raw)
To: KAMEZAWA Hiroyuki
Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org,
nishimura@mxp.nes.nec.co.jp
* KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2009-08-28 16:35:23]:
> On Fri, 28 Aug 2009 12:50:08 +0530
> Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
>
> > * KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2009-08-28 13:23:21]:
> >
> > > This patch tries to modify softlimit handling in memcg/res_counter.
> > > There are 2 reasons in general.
> > >
> > > 1. soft_limit can use only against sub-hierarchy root.
> > > Because softlimit tree is sorted by usage, putting prural groups
> > > under hierarchy (which shares usage) will just adds noise and unnecessary
> > > mess. This patch limits softlimit feature only to hierarchy root.
> > > This will make softlimit-tree maintainance better.
> > >
> > > 2. In these days, it's reported that res_counter can be bottleneck in
> > > massively parallel enviroment. We need to reduce jobs under spinlock.
> > > The reason we check softlimit at res_counter_charge() is that any member
> > > in hierarchy can have softlimit.
> > > But by chages in "1", only hierarchy root has soft_limit. We can omit
> > > hierarchical check in res_counter.
> > >
> > > After this patch, soft limit is avaliable only for root of sub-hierarchy.
> > > (Anyway, softlimit for hierarchy children just makes users confused, hard-to-use)
> > >
> >
> >
> > I need some time to digest this change, if the root is a hiearchy root
> > then only root can support soft limits? I think the change makes it
> > harder to use soft limits. Please help me understand better.
> >
> I poitned out this issue many many times while you wrote patch.
>
> memcg has "sub tree". hierarchy here means "sub tree" with use_hierarchy =1.
>
> Assume
>
>
> /cgroup/Users/use_hierarchy=0
> Gold/ use_hierarchy=1
> Bob
> Mike
> Silver/use_hierarchy=1
>
> /System/use_hierarchy=1
>
> In flat, there are 3 sub trees.
> /cgroup/Users/Gold (Gold has /cgroup/Users/Gold/Bog, /cgroup/Users/Gold/Mike)
> /cgroup/Users/Silver .....
> /cgroup/System .....
>
> Then, subtrees means a group which inherits charges by use_hierarchy=1
>
> In current implementation, softlimit can be set to arbitrary cgroup.
> Then, following ops are allowed.
> ==
> /cgroup/Users/Gold softlimit= 1G
> /cgroup/Users/Gold/Bob softlimit=800M
> /cgroup/Users/Gold/Mike softlimit=800M
> ==
>
> Then, how your RB-tree for softlimit management works ?
>
> When softlimit finds /cgroup/Users/Gold/, it will reclaim memory from
> all 3 groups by hierarchical_reclaim. If softlimit finds
> /cgroup/Users/Gold/Bob, reclaim from Bob means recalaim from Gold.
By reclaim from Bob means reclaim from Gold, are you referring to the
uncharging part, if so yes. But if you look at the tasks part, we
don't reclaim anything from the tasks in Gold.
>
> Then, to keep the RB-tree neat, you have to extract all related cgroups and
> re-insert them all, every time.
> (But current code doesn't do that. It's broken.)
The earlier time dependent code used to catch that, since it was time
based. Now that it is based on activity, it will take a while before
the group is updated. I don't think it is broken, but updates can take
a lag before showing up.
>
> Current soft-limit RB-tree will be easily broken i.e. not-sorted correctly
> if used under use_hierarchy=1.
>
Not true, I think the sorted-ness is delayed and is seen when we pick
a tree for reclaim. Think of it as being lazy :)
> My patch disallows set softlimit to Bob and Mike, just allows against Gold
> because there can be considered as the same class, hierarchy.
>
But Bob and Mike might need to set soft limits between themselves. if
soft limit of gold is 1G and bob needs to be close to 750M and mike
250M, how do we do it without supporting what we have today?
--
Balbir
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [RFC][PATCH 1/5] memcg: change for softlimit.
2009-08-28 13:26 ` Balbir Singh
@ 2009-08-28 14:29 ` KAMEZAWA Hiroyuki
2009-08-28 14:40 ` KAMEZAWA Hiroyuki
2009-08-28 14:45 ` Balbir Singh
0 siblings, 2 replies; 37+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-08-28 14:29 UTC (permalink / raw)
To: balbir
Cc: KAMEZAWA Hiroyuki, linux-mm@kvack.org,
linux-kernel@vger.kernel.org, nishimura@mxp.nes.nec.co.jp
Balbir Singh wrote:
> * KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2009-08-28
> 16:35:23]:
>
>>
>> Current soft-limit RB-tree will be easily broken i.e. not-sorted
>> correctly
>> if used under use_hierarchy=1.
>>
>
> Not true, I think the sorted-ness is delayed and is seen when we pick
> a tree for reclaim. Think of it as being lazy :)
>
plz explain how enexpectedly unsorted RB-tree can work sanely.
>> My patch disallows set softlimit to Bob and Mike, just allows against
>> Gold
>> because there can be considered as the same class, hierarchy.
>>
>
> But Bob and Mike might need to set soft limits between themselves. if
> soft limit of gold is 1G and bob needs to be close to 750M and mike
> 250M, how do we do it without supporting what we have today?
>
Don't use hierarchy or don't use softlimit.
(I never think fine-grain soft limit can be useful.)
Anyway, I have to modify unnecessary hacks for res_counter of softlimit.
plz allow modification. that's bad.
I postpone RB-tree breakage problem, plz explain it or fix it by yourself.
Thanks,
-Kame
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [RFC][PATCH 1/5] memcg: change for softlimit.
2009-08-28 14:29 ` KAMEZAWA Hiroyuki
@ 2009-08-28 14:40 ` KAMEZAWA Hiroyuki
2009-08-28 14:46 ` Balbir Singh
2009-08-28 14:45 ` Balbir Singh
1 sibling, 1 reply; 37+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-08-28 14:40 UTC (permalink / raw)
To: KAMEZAWA Hiroyuki
Cc: balbir, linux-mm@kvack.org, linux-kernel@vger.kernel.org,
nishimura@mxp.nes.nec.co.jp
KAMEZAWA Hiroyuki wrote:
> Balbir Singh wrote:
>> But Bob and Mike might need to set soft limits between themselves. if
>> soft limit of gold is 1G and bob needs to be close to 750M and mike
>> 250M, how do we do it without supporting what we have today?
>>
> Don't use hierarchy or don't use softlimit.
> (I never think fine-grain soft limit can be useful.)
>
> Anyway, I have to modify unnecessary hacks for res_counter of softlimit.
> plz allow modification. that's bad.
> I postpone RB-tree breakage problem, plz explain it or fix it by yourself.
>
I changed my mind....per-zone RB-tree is also broken ;)
Why I don't like broken system is a function which a user can't
know/calculate how-it-works is of no use in mission critical systems.
I'd like to think how-to-fix it with better algorithm. Maybe RB-tree
is not a choice.
Thanks,
-Kame
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [RFC][PATCH 1/5] memcg: change for softlimit.
2009-08-28 14:29 ` KAMEZAWA Hiroyuki
2009-08-28 14:40 ` KAMEZAWA Hiroyuki
@ 2009-08-28 14:45 ` Balbir Singh
2009-08-28 14:58 ` KAMEZAWA Hiroyuki
1 sibling, 1 reply; 37+ messages in thread
From: Balbir Singh @ 2009-08-28 14:45 UTC (permalink / raw)
To: KAMEZAWA Hiroyuki
Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org,
nishimura@mxp.nes.nec.co.jp
* KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2009-08-28 23:29:09]:
> Balbir Singh wrote:
> > * KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2009-08-28
> > 16:35:23]:
> >
>
> >>
> >> Current soft-limit RB-tree will be easily broken i.e. not-sorted
> >> correctly
> >> if used under use_hierarchy=1.
> >>
> >
> > Not true, I think the sorted-ness is delayed and is seen when we pick
> > a tree for reclaim. Think of it as being lazy :)
> >
> plz explain how enexpectedly unsorted RB-tree can work sanely.
>
>
There are two checks built-in
1. In the reclaim path (we see how much to reclaim, compared to the
soft limit)
2. In the dequeue path where we check if we really are over soft limit
I did lot of testing with the time based approach and found no broken
cases, I;ve been testing it with the mmotm (event based approach and I
am yet to see a broken case so far).
--
Balbir
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [RFC][PATCH 1/5] memcg: change for softlimit.
2009-08-28 14:40 ` KAMEZAWA Hiroyuki
@ 2009-08-28 14:46 ` Balbir Singh
2009-08-28 15:06 ` KAMEZAWA Hiroyuki
0 siblings, 1 reply; 37+ messages in thread
From: Balbir Singh @ 2009-08-28 14:46 UTC (permalink / raw)
To: KAMEZAWA Hiroyuki
Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org,
nishimura@mxp.nes.nec.co.jp
* KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2009-08-28 23:40:56]:
> KAMEZAWA Hiroyuki wrote:
> > Balbir Singh wrote:
> >> But Bob and Mike might need to set soft limits between themselves. if
> >> soft limit of gold is 1G and bob needs to be close to 750M and mike
> >> 250M, how do we do it without supporting what we have today?
> >>
> > Don't use hierarchy or don't use softlimit.
> > (I never think fine-grain soft limit can be useful.)
> >
> > Anyway, I have to modify unnecessary hacks for res_counter of softlimit.
> > plz allow modification. that's bad.
> > I postpone RB-tree breakage problem, plz explain it or fix it by yourself.
> >
> I changed my mind....per-zone RB-tree is also broken ;)
>
> Why I don't like broken system is a function which a user can't
> know/calculate how-it-works is of no use in mission critical systems.
>
> I'd like to think how-to-fix it with better algorithm. Maybe RB-tree
> is not a choice.
>
Soft limits are not meant for mission critical work :-) Soft limits is
best effort and not a guaranteed resource allocation mechanism. I've
mentioned in previous emails how we recover if we find the data is
stale
--
Balbir
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [RFC][PATCH 1/5] memcg: change for softlimit.
2009-08-28 14:45 ` Balbir Singh
@ 2009-08-28 14:58 ` KAMEZAWA Hiroyuki
2009-08-28 15:07 ` Balbir Singh
0 siblings, 1 reply; 37+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-08-28 14:58 UTC (permalink / raw)
To: balbir
Cc: KAMEZAWA Hiroyuki, linux-mm@kvack.org,
linux-kernel@vger.kernel.org, nishimura@mxp.nes.nec.co.jp
Balbir Singh wrote:
> * KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2009-08-28
> 23:29:09]:
>
>> Balbir Singh wrote:
>> > * KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2009-08-28
>> > 16:35:23]:
>> >
>>
>> >>
>> >> Current soft-limit RB-tree will be easily broken i.e. not-sorted
>> >> correctly
>> >> if used under use_hierarchy=1.
>> >>
>> >
>> > Not true, I think the sorted-ness is delayed and is seen when we pick
>> > a tree for reclaim. Think of it as being lazy :)
>> >
>> plz explain how enexpectedly unsorted RB-tree can work sanely.
>>
>>
>
> There are two checks built-in
>
> 1. In the reclaim path (we see how much to reclaim, compared to the
> soft limit)
> 2. In the dequeue path where we check if we really are over soft limit
>
that's not a point.
> I did lot of testing with the time based approach and found no broken
> cases, I;ve been testing it with the mmotm (event based approach and I
> am yet to see a broken case so far).
>
I'm sorry if I don't understand RB-tree.
I think RB-tree is a system which can sort inputs passed by caller
one by one and will be in broken state if value of nodes changed
while it's in tree. Wrong ?
While a subtree is
7
/ \
3 9
And, by some magic, the value can be changed without extract
7
/ \
13 9
The biggest is 13. But the biggest number which will be selecte will be "9".
Thanks,
-Kame
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [RFC][PATCH 1/5] memcg: change for softlimit.
2009-08-28 14:46 ` Balbir Singh
@ 2009-08-28 15:06 ` KAMEZAWA Hiroyuki
2009-08-28 15:08 ` Balbir Singh
0 siblings, 1 reply; 37+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-08-28 15:06 UTC (permalink / raw)
To: balbir
Cc: KAMEZAWA Hiroyuki, linux-mm@kvack.org,
linux-kernel@vger.kernel.org, nishimura@mxp.nes.nec.co.jp
Balbir Singh wrote:
> * KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2009-08-28
> 23:40:56]:
>
>> KAMEZAWA Hiroyuki wrote:
>> > Balbir Singh wrote:
>> >> But Bob and Mike might need to set soft limits between themselves. if
>> >> soft limit of gold is 1G and bob needs to be close to 750M and mike
>> >> 250M, how do we do it without supporting what we have today?
>> >>
>> > Don't use hierarchy or don't use softlimit.
>> > (I never think fine-grain soft limit can be useful.)
>> >
>> > Anyway, I have to modify unnecessary hacks for res_counter of
>> softlimit.
>> > plz allow modification. that's bad.
>> > I postpone RB-tree breakage problem, plz explain it or fix it by
>> yourself.
>> >
>> I changed my mind....per-zone RB-tree is also broken ;)
>>
>> Why I don't like broken system is a function which a user can't
>> know/calculate how-it-works is of no use in mission critical systems.
>>
>> I'd like to think how-to-fix it with better algorithm. Maybe RB-tree
>> is not a choice.
>>
>
> Soft limits are not meant for mission critical work :-) Soft limits is
> best effort and not a guaranteed resource allocation mechanism. I've
> mentioned in previous emails how we recover if we find the data is
> stale
>
yes. but can you explain how selection will be done to users ?
I can't.
Thanks,
-Kame
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [RFC][PATCH 1/5] memcg: change for softlimit.
2009-08-28 14:58 ` KAMEZAWA Hiroyuki
@ 2009-08-28 15:07 ` Balbir Singh
0 siblings, 0 replies; 37+ messages in thread
From: Balbir Singh @ 2009-08-28 15:07 UTC (permalink / raw)
To: KAMEZAWA Hiroyuki
Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org,
nishimura@mxp.nes.nec.co.jp
* KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2009-08-28 23:58:39]:
> Balbir Singh wrote:
> > * KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2009-08-28
> > 23:29:09]:
> >
> >> Balbir Singh wrote:
> >> > * KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2009-08-28
> >> > 16:35:23]:
> >> >
> >>
> >> >>
> >> >> Current soft-limit RB-tree will be easily broken i.e. not-sorted
> >> >> correctly
> >> >> if used under use_hierarchy=1.
> >> >>
> >> >
> >> > Not true, I think the sorted-ness is delayed and is seen when we pick
> >> > a tree for reclaim. Think of it as being lazy :)
> >> >
> >> plz explain how enexpectedly unsorted RB-tree can work sanely.
> >>
> >>
> >
> > There are two checks built-in
> >
> > 1. In the reclaim path (we see how much to reclaim, compared to the
> > soft limit)
> > 2. In the dequeue path where we check if we really are over soft limit
> >
> that's not a point.
>
> > I did lot of testing with the time based approach and found no broken
> > cases, I;ve been testing it with the mmotm (event based approach and I
> > am yet to see a broken case so far).
> >
> I'm sorry if I don't understand RB-tree.
> I think RB-tree is a system which can sort inputs passed by caller
> one by one and will be in broken state if value of nodes changed
> while it's in tree. Wrong ?
> While a subtree is
> 7
> / \
> 3 9
> And, by some magic, the value can be changed without extract
> 7
> / \
> 13 9
> The biggest is 13. But the biggest number which will be selecte will be "9".
>
This cannot happen today, we keep the values the same till we update
the tree. I hope that clarifies.
--
Balbir
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [RFC][PATCH 1/5] memcg: change for softlimit.
2009-08-28 15:06 ` KAMEZAWA Hiroyuki
@ 2009-08-28 15:08 ` Balbir Singh
2009-08-28 15:12 ` KAMEZAWA Hiroyuki
0 siblings, 1 reply; 37+ messages in thread
From: Balbir Singh @ 2009-08-28 15:08 UTC (permalink / raw)
To: KAMEZAWA Hiroyuki
Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org,
nishimura@mxp.nes.nec.co.jp
* KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2009-08-29 00:06:23]:
> Balbir Singh wrote:
> > * KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2009-08-28
> > 23:40:56]:
> >
> >> KAMEZAWA Hiroyuki wrote:
> >> > Balbir Singh wrote:
> >> >> But Bob and Mike might need to set soft limits between themselves. if
> >> >> soft limit of gold is 1G and bob needs to be close to 750M and mike
> >> >> 250M, how do we do it without supporting what we have today?
> >> >>
> >> > Don't use hierarchy or don't use softlimit.
> >> > (I never think fine-grain soft limit can be useful.)
> >> >
> >> > Anyway, I have to modify unnecessary hacks for res_counter of
> >> softlimit.
> >> > plz allow modification. that's bad.
> >> > I postpone RB-tree breakage problem, plz explain it or fix it by
> >> yourself.
> >> >
> >> I changed my mind....per-zone RB-tree is also broken ;)
> >>
> >> Why I don't like broken system is a function which a user can't
> >> know/calculate how-it-works is of no use in mission critical systems.
> >>
> >> I'd like to think how-to-fix it with better algorithm. Maybe RB-tree
> >> is not a choice.
> >>
> >
> > Soft limits are not meant for mission critical work :-) Soft limits is
> > best effort and not a guaranteed resource allocation mechanism. I've
> > mentioned in previous emails how we recover if we find the data is
> > stale
> >
> yes. but can you explain how selection will be done to users ?
> I can't.
>
>From a user point, we get what we set, but the timelines can be a
little longer.
--
Balbir
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [RFC][PATCH 2/5] memcg: uncharge in batched manner
2009-08-28 4:24 ` [RFC][PATCH 2/5] memcg: uncharge in batched manner KAMEZAWA Hiroyuki
2009-08-28 4:53 ` KAMEZAWA Hiroyuki
@ 2009-08-28 15:10 ` Balbir Singh
2009-08-28 15:21 ` KAMEZAWA Hiroyuki
2009-08-31 11:02 ` Balbir Singh
2 siblings, 1 reply; 37+ messages in thread
From: Balbir Singh @ 2009-08-28 15:10 UTC (permalink / raw)
To: KAMEZAWA Hiroyuki
Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org,
nishimura@mxp.nes.nec.co.jp
* KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2009-08-28 13:24:38]:
>
> In massive parallel enviroment, res_counter can be a performance bottleneck.
> This patch is a trial for reducing lock contention.
> One strong techinque to reduce lock contention is reducing calls by
> batching some amount of calls int one.
>
> Considering charge/uncharge chatacteristic,
> - charge is done one by one via demand-paging.
> - uncharge is done by
> - in chunk at munmap, truncate, exit, execve...
> - one by one via vmscan/paging.
>
> It seems we hace a chance to batched-uncharge.
> This patch is a base patch for batched uncharge. For avoiding
> scattering memcg's structure, this patch adds memcg batch uncharge
> information to the task. please see start/end usage in next patch.
>
Overall it is a very good idea, can't we do the uncharge at the poin
tof unmap_vmas, exit_mmap, etc so that we don't have to keep
additional data structures around.
--
Balbir
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [RFC][PATCH 1/5] memcg: change for softlimit.
2009-08-28 15:08 ` Balbir Singh
@ 2009-08-28 15:12 ` KAMEZAWA Hiroyuki
2009-08-28 15:15 ` Balbir Singh
0 siblings, 1 reply; 37+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-08-28 15:12 UTC (permalink / raw)
To: balbir
Cc: KAMEZAWA Hiroyuki, linux-mm@kvack.org,
linux-kernel@vger.kernel.org, nishimura@mxp.nes.nec.co.jp
Balbir Singh wrote:
> * KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2009-08-29
> 00:06:23]:
>
>> Balbir Singh wrote:
>> > * KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2009-08-28
>> > 23:40:56]:
>> >
>> >> KAMEZAWA Hiroyuki wrote:
>> >> > Balbir Singh wrote:
>> >> >> But Bob and Mike might need to set soft limits between themselves.
>> if
>> >> >> soft limit of gold is 1G and bob needs to be close to 750M and
>> mike
>> >> >> 250M, how do we do it without supporting what we have today?
>> >> >>
>> >> > Don't use hierarchy or don't use softlimit.
>> >> > (I never think fine-grain soft limit can be useful.)
>> >> >
>> >> > Anyway, I have to modify unnecessary hacks for res_counter of
>> >> softlimit.
>> >> > plz allow modification. that's bad.
>> >> > I postpone RB-tree breakage problem, plz explain it or fix it by
>> >> yourself.
>> >> >
>> >> I changed my mind....per-zone RB-tree is also broken ;)
>> >>
>> >> Why I don't like broken system is a function which a user can't
>> >> know/calculate how-it-works is of no use in mission critical systems.
>> >>
>> >> I'd like to think how-to-fix it with better algorithm. Maybe RB-tree
>> >> is not a choice.
>> >>
>> >
>> > Soft limits are not meant for mission critical work :-) Soft limits is
>> > best effort and not a guaranteed resource allocation mechanism. I've
>> > mentioned in previous emails how we recover if we find the data is
>> > stale
>> >
>> yes. but can you explain how selection will be done to users ?
>> I can't.
>>
>
> From a user point, we get what we set, but the timelines can be a
> little longer.
>
I'll drop this patch, anyway. But will modify res_counter.
We have to reduce ops under lock after we see spinlock can explode
system time.
Thanks,
-Kame
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [RFC][PATCH 1/5] memcg: change for softlimit.
2009-08-28 15:12 ` KAMEZAWA Hiroyuki
@ 2009-08-28 15:15 ` Balbir Singh
0 siblings, 0 replies; 37+ messages in thread
From: Balbir Singh @ 2009-08-28 15:15 UTC (permalink / raw)
To: KAMEZAWA Hiroyuki
Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org,
nishimura@mxp.nes.nec.co.jp
* KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2009-08-29 00:12:26]:
> Balbir Singh wrote:
> > * KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2009-08-29
> > 00:06:23]:
> >
> >> Balbir Singh wrote:
> >> > * KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2009-08-28
> >> > 23:40:56]:
> >> >
> >> >> KAMEZAWA Hiroyuki wrote:
> >> >> > Balbir Singh wrote:
> >> >> >> But Bob and Mike might need to set soft limits between themselves.
> >> if
> >> >> >> soft limit of gold is 1G and bob needs to be close to 750M and
> >> mike
> >> >> >> 250M, how do we do it without supporting what we have today?
> >> >> >>
> >> >> > Don't use hierarchy or don't use softlimit.
> >> >> > (I never think fine-grain soft limit can be useful.)
> >> >> >
> >> >> > Anyway, I have to modify unnecessary hacks for res_counter of
> >> >> softlimit.
> >> >> > plz allow modification. that's bad.
> >> >> > I postpone RB-tree breakage problem, plz explain it or fix it by
> >> >> yourself.
> >> >> >
> >> >> I changed my mind....per-zone RB-tree is also broken ;)
> >> >>
> >> >> Why I don't like broken system is a function which a user can't
> >> >> know/calculate how-it-works is of no use in mission critical systems.
> >> >>
> >> >> I'd like to think how-to-fix it with better algorithm. Maybe RB-tree
> >> >> is not a choice.
> >> >>
> >> >
> >> > Soft limits are not meant for mission critical work :-) Soft limits is
> >> > best effort and not a guaranteed resource allocation mechanism. I've
> >> > mentioned in previous emails how we recover if we find the data is
> >> > stale
> >> >
> >> yes. but can you explain how selection will be done to users ?
> >> I can't.
> >>
> >
> > From a user point, we get what we set, but the timelines can be a
> > little longer.
> >
> I'll drop this patch, anyway. But will modify res_counter.
> We have to reduce ops under lock after we see spinlock can explode
> system time.
>
Thanks! I'll review the other patches as well and test them sometime
over the weekend.
--
Balbir
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [RFC][PATCH 2/5] memcg: uncharge in batched manner
2009-08-28 15:10 ` Balbir Singh
@ 2009-08-28 15:21 ` KAMEZAWA Hiroyuki
2009-08-28 16:03 ` Balbir Singh
0 siblings, 1 reply; 37+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-08-28 15:21 UTC (permalink / raw)
To: balbir
Cc: KAMEZAWA Hiroyuki, linux-mm@kvack.org,
linux-kernel@vger.kernel.org, nishimura@mxp.nes.nec.co.jp
Balbir Singh さんは書きました:
> * KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2009-08-28
> 13:24:38]:
>
>>
>> In massive parallel enviroment, res_counter can be a performance
>> bottleneck.
>> This patch is a trial for reducing lock contention.
>> One strong techinque to reduce lock contention is reducing calls by
>> batching some amount of calls int one.
>>
>> Considering charge/uncharge chatacteristic,
>> - charge is done one by one via demand-paging.
>> - uncharge is done by
>> - in chunk at munmap, truncate, exit, execve...
>> - one by one via vmscan/paging.
>>
>> It seems we hace a chance to batched-uncharge.
>> This patch is a base patch for batched uncharge. For avoiding
>> scattering memcg's structure, this patch adds memcg batch uncharge
>> information to the task. please see start/end usage in next patch.
>>
>
> Overall it is a very good idea, can't we do the uncharge at the poin
> tof unmap_vmas, exit_mmap, etc so that we don't have to keep
> additional data structures around.
>
We can't. We uncharge when page->mapcount goes down to 0.
This is unknown until page_remove_rmap() decrement page->mapcount
by "atomic" ops.
My first version allocated memcg_batch_info on stack ...and..
I had to pass an extra argument to page_remove_rmap() etc....
That was very ugly ;(
Now, I adds per-task memcg_batch_info to task struct.
Because it will be always used at exit() and make exit() path
much faster, it's not very costly.
Thanks,
-Kame
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [RFC][PATCH 2/5] memcg: uncharge in batched manner
2009-08-28 15:21 ` KAMEZAWA Hiroyuki
@ 2009-08-28 16:03 ` Balbir Singh
0 siblings, 0 replies; 37+ messages in thread
From: Balbir Singh @ 2009-08-28 16:03 UTC (permalink / raw)
To: KAMEZAWA Hiroyuki
Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org,
nishimura@mxp.nes.nec.co.jp
* KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2009-08-29 00:21:50]:
> > tof unmap_vmas, exit_mmap, etc so that we don't have to keep
> > additional data structures around.
> >
> We can't. We uncharge when page->mapcount goes down to 0.
> This is unknown until page_remove_rmap() decrement page->mapcount
> by "atomic" ops.
>
> My first version allocated memcg_batch_info on stack ...and..
> I had to pass an extra argument to page_remove_rmap() etc....
> That was very ugly ;(
> Now, I adds per-task memcg_batch_info to task struct.
> Because it will be always used at exit() and make exit() path
> much faster, it's not very costly.
>
Aaah.. I see that makes a lot of sense. Thanks for the clarification.
--
Balbir
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [RFC][PATCH 2/5] memcg: uncharge in batched manner
2009-08-28 4:24 ` [RFC][PATCH 2/5] memcg: uncharge in batched manner KAMEZAWA Hiroyuki
2009-08-28 4:53 ` KAMEZAWA Hiroyuki
2009-08-28 15:10 ` Balbir Singh
@ 2009-08-31 11:02 ` Balbir Singh
2009-08-31 11:59 ` KAMEZAWA Hiroyuki
2 siblings, 1 reply; 37+ messages in thread
From: Balbir Singh @ 2009-08-31 11:02 UTC (permalink / raw)
To: KAMEZAWA Hiroyuki
Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org,
nishimura@mxp.nes.nec.co.jp
* KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2009-08-28 13:24:38]:
>
> In massive parallel enviroment, res_counter can be a performance bottleneck.
> This patch is a trial for reducing lock contention.
> One strong techinque to reduce lock contention is reducing calls by
> batching some amount of calls int one.
>
> Considering charge/uncharge chatacteristic,
> - charge is done one by one via demand-paging.
> - uncharge is done by
> - in chunk at munmap, truncate, exit, execve...
> - one by one via vmscan/paging.
>
> It seems we hace a chance to batched-uncharge.
> This patch is a base patch for batched uncharge. For avoiding
> scattering memcg's structure, this patch adds memcg batch uncharge
> information to the task. please see start/end usage in next patch.
>
> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> ---
> include/linux/memcontrol.h | 12 +++++++
> include/linux/sched.h | 7 ++++
> mm/memcontrol.c | 70 +++++++++++++++++++++++++++++++++++++++++----
> 3 files changed, 83 insertions(+), 6 deletions(-)
>
> Index: mmotm-2.6.31-Aug27/include/linux/memcontrol.h
> ===================================================================
> --- mmotm-2.6.31-Aug27.orig/include/linux/memcontrol.h
> +++ mmotm-2.6.31-Aug27/include/linux/memcontrol.h
> @@ -54,6 +54,10 @@ extern void mem_cgroup_rotate_lru_list(s
> extern void mem_cgroup_del_lru(struct page *page);
> extern void mem_cgroup_move_lists(struct page *page,
> enum lru_list from, enum lru_list to);
> +
> +extern void mem_cgroup_uncharge_batch_start(void);
> +extern void mem_cgroup_uncharge_batch_end(void);
> +
> extern void mem_cgroup_uncharge_page(struct page *page);
> extern void mem_cgroup_uncharge_cache_page(struct page *page);
> extern int mem_cgroup_shmem_charge_fallback(struct page *page,
> @@ -151,6 +155,14 @@ static inline void mem_cgroup_cancel_cha
> {
> }
>
> +static inline void mem_cgroup_uncharge_batch_start(void)
> +{
> +}
> +
> +static inline void mem_cgroup_uncharge_batch_start(void)
> +{
> +}
> +
> static inline void mem_cgroup_uncharge_page(struct page *page)
> {
> }
> Index: mmotm-2.6.31-Aug27/mm/memcontrol.c
> ===================================================================
> --- mmotm-2.6.31-Aug27.orig/mm/memcontrol.c
> +++ mmotm-2.6.31-Aug27/mm/memcontrol.c
> @@ -1837,7 +1837,35 @@ void mem_cgroup_cancel_charge_swapin(str
> css_put(&mem->css);
> }
>
> +static bool
> +__do_batch_uncharge(struct mem_cgroup *mem, const enum charge_type ctype)
> +{
> + struct memcg_batch_info *batch = NULL;
> + bool uncharge_memsw;
> + /* If swapout, usage of swap doesn't decrease */
> + if (do_swap_account && (ctype == MEM_CGROUP_CHARGE_TYPE_SWAPOUT))
> + uncharge_memsw = false;
> + else
> + uncharge_memsw = true;
>
> + if (current->memcg_batch.do_batch) {
> + batch = ¤t->memcg_batch;
> + if (batch->memcg == NULL) {
> + batch->memcg = mem;
> + css_get(&mem->css);
> + }
> + }
> + if (!batch || batch->memcg != mem) {
> + res_counter_uncharge(&mem->res, PAGE_SIZE);
> + if (uncharge_memsw)
> + res_counter_uncharge(&mem->memsw, PAGE_SIZE);
Could you please add a comment stating that if memcg is different that
we do a direct uncharge else we batch.
> + } else {
> + batch->pages += PAGE_SIZE;
> + if (uncharge_memsw)
> + batch->memsw += PAGE_SIZE;
> + }
> + return soft_limit_excess;
> +}
> /*
> * uncharge if !page_mapped(page)
> */
> @@ -1886,12 +1914,8 @@ __mem_cgroup_uncharge_common(struct page
> break;
> }
>
> - if (!mem_cgroup_is_root(mem)) {
> - res_counter_uncharge(&mem->res, PAGE_SIZE);
> - if (do_swap_account &&
> - (ctype != MEM_CGROUP_CHARGE_TYPE_SWAPOUT))
> - res_counter_uncharge(&mem->memsw, PAGE_SIZE);
> - }
> + if (!mem_cgroup_is_root(mem))
> + __do_batch_uncharge(mem, ctype);
Now I am beginning to think we need a cond_mem_cgroup_is_not_root()
function.
> if (ctype == MEM_CGROUP_CHARGE_TYPE_SWAPOUT)
> mem_cgroup_swap_statistics(mem, true);
> mem_cgroup_charge_statistics(mem, pc, false);
> @@ -1938,6 +1962,40 @@ void mem_cgroup_uncharge_cache_page(stru
> __mem_cgroup_uncharge_common(page, MEM_CGROUP_CHARGE_TYPE_CACHE);
> }
>
> +void mem_cgroup_uncharge_batch_start(void)
> +{
> + VM_BUG_ON(current->memcg_batch.do_batch);
> + /* avoid batch if killed by OOM */
> + if (test_thread_flag(TIF_MEMDIE))
> + return;
> + current->memcg_batch.do_batch = 1;
> + current->memcg_batch.memcg = NULL;
> + current->memcg_batch.pages = 0;
> + current->memcg_batch.memsw = 0;
> +}
> +
> +void mem_cgroup_uncharge_batch_end(void)
> +{
> + struct mem_cgroup *mem;
> +
> + if (!current->memcg_batch.do_batch)
> + return;
> +
> + current->memcg_batch.do_batch = 0;
> +
> + mem = current->memcg_batch.memcg;
> + if (!mem)
> + return;
> + if (current->memcg_batch.pages)
> + res_counter_uncharge(&mem->res,
> + current->memcg_batch.pages, NULL);
> + if (current->memcg_batch.memsw)
> + res_counter_uncharge(&mem->memsw,
> + current->memcg_batch.memsw, NULL);
> + /* we got css's refcnt */
> + cgroup_release_and_wakeup_rmdir(&mem->css);
Does this effect deleting of a group and delay it by a large amount?
> +}
> +
> #ifdef CONFIG_SWAP
> /*
> * called after __delete_from_swap_cache() and drop "page" account.
> Index: mmotm-2.6.31-Aug27/include/linux/sched.h
> ===================================================================
> --- mmotm-2.6.31-Aug27.orig/include/linux/sched.h
> +++ mmotm-2.6.31-Aug27/include/linux/sched.h
> @@ -1540,6 +1540,13 @@ struct task_struct {
> unsigned long trace_recursion;
> #endif /* CONFIG_TRACING */
> unsigned long stack_start;
> +#ifdef CONFIG_CGROUP_MEM_RES_CTLR /* memcg uses this to do batch job */
> + struct memcg_batch_info {
> + bool do_batch;
> + struct mem_cgroup *memcg;
> + long pages, memsw;
> + } memcg_batch;
> +#endif
> };
>
> /* Future-safe accessor for struct task_struct's cpus_allowed. */
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
--
Balbir
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [RFC][PATCH 3/5] memcg: unmap, truncate, invalidate uncharege in batch
2009-08-28 4:25 ` [RFC][PATCH 3/5] memcg: unmap, truncate, invalidate uncharege in batch KAMEZAWA Hiroyuki
@ 2009-08-31 11:02 ` Balbir Singh
0 siblings, 0 replies; 37+ messages in thread
From: Balbir Singh @ 2009-08-31 11:02 UTC (permalink / raw)
To: KAMEZAWA Hiroyuki
Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org,
nishimura@mxp.nes.nec.co.jp
* KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2009-08-28 13:25:42]:
>
> We can do batched uncharge when
> - invalidate/truncte file
> - unmap range of pages.
>
> This means we don't do "batched" uncharge in memory reclaim path.
> I think it's reasonable.
>
> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> ---
> mm/memory.c | 2 ++
> mm/truncate.c | 6 ++++++
> 2 files changed, 8 insertions(+)
>
> Index: mmotm-2.6.31-Aug27/mm/memory.c
> ===================================================================
> --- mmotm-2.6.31-Aug27.orig/mm/memory.c
> +++ mmotm-2.6.31-Aug27/mm/memory.c
> @@ -909,6 +909,7 @@ static unsigned long unmap_page_range(st
> details = NULL;
>
> BUG_ON(addr >= end);
> + mem_cgroup_uncharge_batch_start();
> tlb_start_vma(tlb, vma);
> pgd = pgd_offset(vma->vm_mm, addr);
> do {
> @@ -921,6 +922,7 @@ static unsigned long unmap_page_range(st
> zap_work, details);
> } while (pgd++, addr = next, (addr != end && *zap_work > 0));
> tlb_end_vma(tlb, vma);
> + mem_cgroup_uncharge_batch_end();
>
> return addr;
> }
> Index: mmotm-2.6.31-Aug27/mm/truncate.c
> ===================================================================
> --- mmotm-2.6.31-Aug27.orig/mm/truncate.c
> +++ mmotm-2.6.31-Aug27/mm/truncate.c
> @@ -272,6 +272,7 @@ void truncate_inode_pages_range(struct a
> pagevec_release(&pvec);
> break;
> }
> + mem_cgroup_uncharge_batch_start();
> for (i = 0; i < pagevec_count(&pvec); i++) {
> struct page *page = pvec.pages[i];
>
> @@ -286,6 +287,7 @@ void truncate_inode_pages_range(struct a
> unlock_page(page);
> }
> pagevec_release(&pvec);
> + mem_cgroup_uncharge_batch_end();
> }
> }
> EXPORT_SYMBOL(truncate_inode_pages_range);
> @@ -327,6 +329,7 @@ unsigned long invalidate_mapping_pages(s
> pagevec_init(&pvec, 0);
> while (next <= end &&
> pagevec_lookup(&pvec, mapping, next, PAGEVEC_SIZE)) {
> + mem_cgroup_uncharge_batch_start();
> for (i = 0; i < pagevec_count(&pvec); i++) {
> struct page *page = pvec.pages[i];
> pgoff_t index;
> @@ -354,6 +357,7 @@ unsigned long invalidate_mapping_pages(s
> break;
> }
> pagevec_release(&pvec);
> + mem_cgroup_uncharge_batch_end();
> cond_resched();
> }
> return ret;
> @@ -428,6 +432,7 @@ int invalidate_inode_pages2_range(struct
> while (next <= end && !wrapped &&
> pagevec_lookup(&pvec, mapping, next,
> min(end - next, (pgoff_t)PAGEVEC_SIZE - 1) + 1)) {
> + mem_cgroup_uncharge_batch_start();
> for (i = 0; i < pagevec_count(&pvec); i++) {
> struct page *page = pvec.pages[i];
> pgoff_t page_index;
> @@ -477,6 +482,7 @@ int invalidate_inode_pages2_range(struct
> unlock_page(page);
> }
> pagevec_release(&pvec);
> + mem_cgroup_uncharge_batch_end();
> cond_resched();
> }
> return ret;
>
This looks good to me
--
Balbir
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [RFC][PATCH 4/5] memcg: per-cpu charge stock
2009-08-28 4:27 ` [RFC][PATCH 4/5] memcg: per-cpu charge stock KAMEZAWA Hiroyuki
@ 2009-08-31 11:10 ` Balbir Singh
2009-08-31 12:07 ` KAMEZAWA Hiroyuki
0 siblings, 1 reply; 37+ messages in thread
From: Balbir Singh @ 2009-08-31 11:10 UTC (permalink / raw)
To: KAMEZAWA Hiroyuki
Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org,
nishimura@mxp.nes.nec.co.jp
* KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2009-08-28 13:27:06]:
>
> For avoiding frequent access to res_counter at charge, add per-cpu
> local charge. Comparing with modifing res_coutner (with percpu_counter),
> this approach
> Pros.
> - we don't have to touch res_counter's cache line
> - we don't have to chase res_counter's hierarchy
> - we don't have to call res_counter function.
> Cons.
> - we need our own code.
>
> Considering trade-off, I think this is worth to do.
I prefer the other part due to
1. Code reuse (any enhancements made will benefit us)
2. Custom batching that can be done easily
3. Remember hierarchy is explicitly enabled and we've documented that
it is expensive
>
> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> ---
> mm/memcontrol.c | 46 +++++++++++++++++++++++++++++++++++++---------
> 1 file changed, 37 insertions(+), 9 deletions(-)
>
> Index: mmotm-2.6.31-Aug27/mm/memcontrol.c
> ===================================================================
> --- mmotm-2.6.31-Aug27.orig/mm/memcontrol.c
> +++ mmotm-2.6.31-Aug27/mm/memcontrol.c
> @@ -71,7 +71,7 @@ enum mem_cgroup_stat_index {
> MEM_CGROUP_STAT_PGPGOUT_COUNT, /* # of pages paged out */
> MEM_CGROUP_STAT_EVENTS, /* sum of pagein + pageout for internal use */
> MEM_CGROUP_STAT_SWAPOUT, /* # of pages, swapped out */
> -
> + MEM_CGROUP_STAT_STOCK, /* # of private charges pre-allocated */
> MEM_CGROUP_STAT_NSTATS,
> };
>
> @@ -1266,6 +1266,32 @@ done:
> unlock_page_cgroup(pc);
> }
>
> +#define CHARGE_SIZE (4 * ((NR_CPUS >> 5) + 1) * PAGE_SIZE)
> +
> +bool consume_local_stock(struct mem_cgroup *mem)
> +{
> + struct mem_cgroup_stat_cpu *cstat;
> + int cpu = get_cpu();
> + bool ret = true;
> +
> + cstat = &mem->stat.cpustat[cpu];
> + if (cstat->count[MEM_CGROUP_STAT_STOCK])
> + cstat->count[MEM_CGROUP_STAT_STOCK] -= PAGE_SIZE;
> + else
> + ret = false;
> + put_cpu();
> + return ret;
> +}
> +
> +void do_local_stock(struct mem_cgroup *mem, int val)
> +{
> + struct mem_cgroup_stat_cpu *cstat;
> + int cpu = get_cpu();
> + cstat = &mem->stat.cpustat[cpu];
> + __mem_cgroup_stat_add_safe(cstat, MEM_CGROUP_STAT_STOCK, val);
> + put_cpu();
> +}
> +
> /*
> * Unlike exported interface, "oom" parameter is added. if oom==true,
> * oom-killer can be invoked.
> @@ -1297,28 +1323,30 @@ static int __mem_cgroup_try_charge(struc
> } else {
> css_get(&mem->css);
> }
> - if (unlikely(!mem))
> + /* css_get() against root cgroup is NOOP. we can ignore it */
> + if (!mem || mem_cgroup_is_root(mem))
> return 0;
>
> VM_BUG_ON(css_is_removed(&mem->css));
>
> + if (consume_local_stock(mem))
> + goto got;
> +
> while (1) {
> int ret = 0;
> unsigned long flags = 0;
>
> - if (mem_cgroup_is_root(mem))
> - goto done;
> - ret = res_counter_charge(&mem->res, PAGE_SIZE, &fail_res);
> + ret = res_counter_charge(&mem->res, CHARGE_SIZE, &fail_res);
>
> if (likely(!ret)) {
> if (!do_swap_account)
> break;
> - ret = res_counter_charge(&mem->memsw, PAGE_SIZE,
> + ret = res_counter_charge(&mem->memsw, CHARGE_SIZE,
> &fail_res);
> if (likely(!ret))
> break;
> /* mem+swap counter fails */
> - res_counter_uncharge(&mem->res, PAGE_SIZE);
> + res_counter_uncharge(&mem->res, CHARGE_SIZE);
> flags |= MEM_CGROUP_RECLAIM_NOSWAP;
> mem_over_limit = mem_cgroup_from_res_counter(fail_res,
> memsw);
> @@ -1356,7 +1384,8 @@ static int __mem_cgroup_try_charge(struc
> goto nomem;
> }
> }
> -
> + do_local_stock(mem, CHARGE_SIZE - PAGE_SIZE);
> +got:
> /*
> * check hierarchy root's event counter and modify softlimit-tree
> * if necessary.
> @@ -1364,7 +1393,6 @@ static int __mem_cgroup_try_charge(struc
> mem_over_soft_limit = mem_cgroup_soft_limit_check(mem);
> if (mem_over_soft_limit)
> mem_cgroup_update_tree(mem_over_soft_limit, page);
> -done:
> return 0;
> nomem:
> css_put(&mem->css);
>
--
Balbir
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [RFC][PATCH 5/5] memcg: drain per cpu stock
2009-08-28 4:28 ` [RFC][PATCH 5/5] memcg: drain per cpu stock KAMEZAWA Hiroyuki
@ 2009-08-31 11:11 ` Balbir Singh
2009-08-31 12:09 ` KAMEZAWA Hiroyuki
0 siblings, 1 reply; 37+ messages in thread
From: Balbir Singh @ 2009-08-31 11:11 UTC (permalink / raw)
To: KAMEZAWA Hiroyuki
Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org,
nishimura@mxp.nes.nec.co.jp
* KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2009-08-28 13:28:09]:
>
> Add function for dropping per-cpu stock of charges.
> This is called when
> - cpu is unplugged.
> - force_empty
> - recalim seems to be not easy.
>
> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
The complexity of this patch and additional code make percpu_counter
more attractive. Why not work on percpu_counter if that is not as good
as we expect it to be and in turn help other exploiters as well.
--
Balbir
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [RFC][PATCH 2/5] memcg: uncharge in batched manner
2009-08-31 11:02 ` Balbir Singh
@ 2009-08-31 11:59 ` KAMEZAWA Hiroyuki
2009-08-31 12:10 ` Balbir Singh
0 siblings, 1 reply; 37+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-08-31 11:59 UTC (permalink / raw)
To: balbir
Cc: KAMEZAWA Hiroyuki, linux-mm@kvack.org,
linux-kernel@vger.kernel.org, nishimura@mxp.nes.nec.co.jp
Balbir Singh wrote:
> * KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2009-08-28
> 13:24:38]:
>> + }
>> + if (!batch || batch->memcg != mem) {
>> + res_counter_uncharge(&mem->res, PAGE_SIZE);
>> + if (uncharge_memsw)
>> + res_counter_uncharge(&mem->memsw, PAGE_SIZE);
>
> Could you please add a comment stating that if memcg is different that
> we do a direct uncharge else we batch.
>
really necessary ?. ok. I'll do.
>> + } else {
>> + batch->pages += PAGE_SIZE;
>> + if (uncharge_memsw)
>> + batch->memsw += PAGE_SIZE;
>> + }
>> + return soft_limit_excess;
>> +}
>> /*
>> * uncharge if !page_mapped(page)
>> */
>> @@ -1886,12 +1914,8 @@ __mem_cgroup_uncharge_common(struct page
>> break;
>> }
>>
>> - if (!mem_cgroup_is_root(mem)) {
>> - res_counter_uncharge(&mem->res, PAGE_SIZE);
>> - if (do_swap_account &&
>> - (ctype != MEM_CGROUP_CHARGE_TYPE_SWAPOUT))
>> - res_counter_uncharge(&mem->memsw, PAGE_SIZE);
>> - }
>> + if (!mem_cgroup_is_root(mem))
>> + __do_batch_uncharge(mem, ctype);
>
> Now I am beginning to think we need a cond_mem_cgroup_is_not_root()
> function.
>
I can't catch waht cond_mem_cgroup_is_not_root() means.
>> if (ctype == MEM_CGROUP_CHARGE_TYPE_SWAPOUT)
>> mem_cgroup_swap_statistics(mem, true);
>> mem_cgroup_charge_statistics(mem, pc, false);
>> @@ -1938,6 +1962,40 @@ void mem_cgroup_uncharge_cache_page(stru
>> __mem_cgroup_uncharge_common(page, MEM_CGROUP_CHARGE_TYPE_CACHE);
>> }
>>
>> +void mem_cgroup_uncharge_batch_start(void)
>> +{
>> + VM_BUG_ON(current->memcg_batch.do_batch);
>> + /* avoid batch if killed by OOM */
>> + if (test_thread_flag(TIF_MEMDIE))
>> + return;
>> + current->memcg_batch.do_batch = 1;
>> + current->memcg_batch.memcg = NULL;
>> + current->memcg_batch.pages = 0;
>> + current->memcg_batch.memsw = 0;
>> +}
>> +
>> +void mem_cgroup_uncharge_batch_end(void)
>> +{
>> + struct mem_cgroup *mem;
>> +
>> + if (!current->memcg_batch.do_batch)
>> + return;
>> +
>> + current->memcg_batch.do_batch = 0;
>> +
>> + mem = current->memcg_batch.memcg;
>> + if (!mem)
>> + return;
>> + if (current->memcg_batch.pages)
>> + res_counter_uncharge(&mem->res,
>> + current->memcg_batch.pages, NULL);
>> + if (current->memcg_batch.memsw)
>> + res_counter_uncharge(&mem->memsw,
>> + current->memcg_batch.memsw, NULL);
>> + /* we got css's refcnt */
>> + cgroup_release_and_wakeup_rmdir(&mem->css);
>
>
> Does this effect deleting of a group and delay it by a large amount?
>
plz see what cgroup_release_and_xxxx fixed. This is not for delay
but for race-condition, which makes rmdir sleep permanently.
Thanks,
-Kame
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [RFC][PATCH 4/5] memcg: per-cpu charge stock
2009-08-31 11:10 ` Balbir Singh
@ 2009-08-31 12:07 ` KAMEZAWA Hiroyuki
0 siblings, 0 replies; 37+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-08-31 12:07 UTC (permalink / raw)
To: balbir
Cc: KAMEZAWA Hiroyuki, linux-mm@kvack.org,
linux-kernel@vger.kernel.org, nishimura@mxp.nes.nec.co.jp
Balbir Singh wrote:
> * KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2009-08-28
> 13:27:06]:
>
>>
>> For avoiding frequent access to res_counter at charge, add per-cpu
>> local charge. Comparing with modifing res_coutner (with percpu_counter),
>> this approach
>> Pros.
>> - we don't have to touch res_counter's cache line
>> - we don't have to chase res_counter's hierarchy
>> - we don't have to call res_counter function.
>> Cons.
>> - we need our own code.
>>
>> Considering trade-off, I think this is worth to do.
>
> I prefer the other part due to
>
> 1. Code reuse (any enhancements made will benefit us)
> 2. Custom batching that can be done easily
> 3. Remember hierarchy is explicitly enabled and we've documented that
> it is expensive
Hmm. the important point is we don't touch res_counter's cacheline in
fast path. And if we don't use memcg's percpu counter, more cacheline/TLB
will be necesary. (I think percpu counter is slow.)
plz rewrite memcg's percpu counter by youself if you want something generic.
I can't understand what you mention by (3).
Thanks,
-Kame
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [RFC][PATCH 5/5] memcg: drain per cpu stock
2009-08-31 11:11 ` Balbir Singh
@ 2009-08-31 12:09 ` KAMEZAWA Hiroyuki
0 siblings, 0 replies; 37+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-08-31 12:09 UTC (permalink / raw)
To: balbir
Cc: KAMEZAWA Hiroyuki, linux-mm@kvack.org,
linux-kernel@vger.kernel.org, nishimura@mxp.nes.nec.co.jp
Balbir Singh さんは書きました:
> * KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2009-08-28
> 13:28:09]:
>
>>
>> Add function for dropping per-cpu stock of charges.
>> This is called when
>> - cpu is unplugged.
>> - force_empty
>> - recalim seems to be not easy.
>>
>> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
>
> The complexity of this patch and additional code make percpu_counter
> more attractive. Why not work on percpu_counter if that is not as good
> as we expect it to be and in turn help other exploiters as well.
- percpu counter is slow.
- percpu counter is "counter". we use res_counter not as counter but as
accounting for "limit". This "borrow" charges is core of this patch.
- Adding "flush" ops for percpu counter will be much more mess.
- This implementation handles mem->res and mem->memsw at the same time.
This reduces much overhead.
Thanks,
-Kame
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [RFC][PATCH 2/5] memcg: uncharge in batched manner
2009-08-31 11:59 ` KAMEZAWA Hiroyuki
@ 2009-08-31 12:10 ` Balbir Singh
2009-08-31 12:14 ` KAMEZAWA Hiroyuki
0 siblings, 1 reply; 37+ messages in thread
From: Balbir Singh @ 2009-08-31 12:10 UTC (permalink / raw)
To: KAMEZAWA Hiroyuki
Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org,
nishimura@mxp.nes.nec.co.jp
* KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2009-08-31 20:59:18]:
> Balbir Singh wrote:
> > * KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2009-08-28
> > 13:24:38]:
>
> >> + }
> >> + if (!batch || batch->memcg != mem) {
> >> + res_counter_uncharge(&mem->res, PAGE_SIZE);
> >> + if (uncharge_memsw)
> >> + res_counter_uncharge(&mem->memsw, PAGE_SIZE);
> >
> > Could you please add a comment stating that if memcg is different that
> > we do a direct uncharge else we batch.
> >
> really necessary ?. ok. I'll do.
>
I think it will help new readers of the code.
> >> + } else {
> >> + batch->pages += PAGE_SIZE;
> >> + if (uncharge_memsw)
> >> + batch->memsw += PAGE_SIZE;
> >> + }
> >> + return soft_limit_excess;
> >> +}
> >> /*
> >> * uncharge if !page_mapped(page)
> >> */
> >> @@ -1886,12 +1914,8 @@ __mem_cgroup_uncharge_common(struct page
> >> break;
> >> }
> >>
> >> - if (!mem_cgroup_is_root(mem)) {
> >> - res_counter_uncharge(&mem->res, PAGE_SIZE);
> >> - if (do_swap_account &&
> >> - (ctype != MEM_CGROUP_CHARGE_TYPE_SWAPOUT))
> >> - res_counter_uncharge(&mem->memsw, PAGE_SIZE);
> >> - }
> >> + if (!mem_cgroup_is_root(mem))
> >> + __do_batch_uncharge(mem, ctype);
> >
> > Now I am beginning to think we need a cond_mem_cgroup_is_not_root()
> > function.
> >
> I can't catch waht cond_mem_cgroup_is_not_root() means.
>
It is something like cond_resched(), checks if mem_cgroup is not root,
if so executes. Just a nit-pick
>
> >> if (ctype == MEM_CGROUP_CHARGE_TYPE_SWAPOUT)
> >> mem_cgroup_swap_statistics(mem, true);
> >> mem_cgroup_charge_statistics(mem, pc, false);
> >> @@ -1938,6 +1962,40 @@ void mem_cgroup_uncharge_cache_page(stru
> >> __mem_cgroup_uncharge_common(page, MEM_CGROUP_CHARGE_TYPE_CACHE);
> >> }
> >>
> >> +void mem_cgroup_uncharge_batch_start(void)
> >> +{
> >> + VM_BUG_ON(current->memcg_batch.do_batch);
> >> + /* avoid batch if killed by OOM */
> >> + if (test_thread_flag(TIF_MEMDIE))
> >> + return;
> >> + current->memcg_batch.do_batch = 1;
> >> + current->memcg_batch.memcg = NULL;
> >> + current->memcg_batch.pages = 0;
> >> + current->memcg_batch.memsw = 0;
> >> +}
> >> +
> >> +void mem_cgroup_uncharge_batch_end(void)
> >> +{
> >> + struct mem_cgroup *mem;
> >> +
> >> + if (!current->memcg_batch.do_batch)
> >> + return;
> >> +
> >> + current->memcg_batch.do_batch = 0;
> >> +
> >> + mem = current->memcg_batch.memcg;
> >> + if (!mem)
> >> + return;
> >> + if (current->memcg_batch.pages)
> >> + res_counter_uncharge(&mem->res,
> >> + current->memcg_batch.pages, NULL);
> >> + if (current->memcg_batch.memsw)
> >> + res_counter_uncharge(&mem->memsw,
> >> + current->memcg_batch.memsw, NULL);
> >> + /* we got css's refcnt */
> >> + cgroup_release_and_wakeup_rmdir(&mem->css);
> >
> >
> > Does this effect deleting of a group and delay it by a large amount?
> >
> plz see what cgroup_release_and_xxxx fixed. This is not for delay
> but for race-condition, which makes rmdir sleep permanently.
>
I've seen those patches, where rmdir() can hang. My conern was time
elapsed since we do css_get() and do a cgroup_release_and_wake_rmdir()
--
Balbir
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [RFC][PATCH 2/5] memcg: uncharge in batched manner
2009-08-31 12:10 ` Balbir Singh
@ 2009-08-31 12:14 ` KAMEZAWA Hiroyuki
2009-08-31 12:23 ` Balbir Singh
0 siblings, 1 reply; 37+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-08-31 12:14 UTC (permalink / raw)
To: balbir
Cc: KAMEZAWA Hiroyuki, linux-mm@kvack.org,
linux-kernel@vger.kernel.org, nishimura@mxp.nes.nec.co.jp
Balbir Singh wrote:
>> > Does this effect deleting of a group and delay it by a large amount?
>> >
>> plz see what cgroup_release_and_xxxx fixed. This is not for delay
>> but for race-condition, which makes rmdir sleep permanently.
>>
>
> I've seen those patches, where rmdir() can hang. My conern was time
> elapsed since we do css_get() and do a cgroup_release_and_wake_rmdir()
>
plz read unmap() and truncate() code.
The number of pages handled without cond_resched() is limited.
Thanks,
-Kame
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [RFC][PATCH 2/5] memcg: uncharge in batched manner
2009-08-31 12:14 ` KAMEZAWA Hiroyuki
@ 2009-08-31 12:23 ` Balbir Singh
2009-08-31 14:36 ` KAMEZAWA Hiroyuki
0 siblings, 1 reply; 37+ messages in thread
From: Balbir Singh @ 2009-08-31 12:23 UTC (permalink / raw)
To: KAMEZAWA Hiroyuki
Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org,
nishimura@mxp.nes.nec.co.jp
* KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2009-08-31 21:14:10]:
> Balbir Singh wrote:
> >> > Does this effect deleting of a group and delay it by a large amount?
> >> >
> >> plz see what cgroup_release_and_xxxx fixed. This is not for delay
> >> but for race-condition, which makes rmdir sleep permanently.
> >>
> >
> > I've seen those patches, where rmdir() can hang. My conern was time
> > elapsed since we do css_get() and do a cgroup_release_and_wake_rmdir()
> >
> plz read unmap() and truncate() code.
> The number of pages handled without cond_resched() is limited.
>
>
I understand that part, I was referring to tasks stuck doing rmdir()
while we do batched uncharge, will it be very visible to the end user?
cond_resched() is bad in this case.. since it means we'll stay longer
before we release the cgroup.
--
Balbir
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [RFC][PATCH 2/5] memcg: uncharge in batched manner
2009-08-31 12:23 ` Balbir Singh
@ 2009-08-31 14:36 ` KAMEZAWA Hiroyuki
0 siblings, 0 replies; 37+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-08-31 14:36 UTC (permalink / raw)
To: balbir
Cc: KAMEZAWA Hiroyuki, linux-mm@kvack.org,
linux-kernel@vger.kernel.org, nishimura@mxp.nes.nec.co.jp
Balbir Singh wrote:
> * KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2009-08-31
> 21:14:10]:
>
>> Balbir Singh wrote:
>> >> > Does this effect deleting of a group and delay it by a large
>> amount?
>> >> >
>> >> plz see what cgroup_release_and_xxxx fixed. This is not for delay
>> >> but for race-condition, which makes rmdir sleep permanently.
>> >>
>> >
>> > I've seen those patches, where rmdir() can hang. My conern was time
>> > elapsed since we do css_get() and do a cgroup_release_and_wake_rmdir()
>> >
>> plz read unmap() and truncate() code.
>> The number of pages handled without cond_resched() is limited.
>>
>>
>
> I understand that part, I was referring to tasks stuck doing rmdir()
> while we do batched uncharge, will it be very visible to the end user?
truncate/invalidate etc...is done in chunk of pagevec size.
Now, it's 14. then, batched uncharge is done per 14 pages, IIUC.
> cond_resched() is bad in this case.. since it means we'll stay longer
> before we release the cgroup.
cond_resched() is caller's matter. Not related memcg because we dont't
call it.
Thanks,
-Kame
>
>
> --
> Balbir
>
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 37+ messages in thread
end of thread, other threads:[~2009-08-31 14:36 UTC | newest]
Thread overview: 37+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-08-28 4:20 [RFC][PATCH 0/5] memcg: reduce lock conetion KAMEZAWA Hiroyuki
2009-08-28 4:23 ` [RFC][PATCH 1/5] memcg: change for softlimit KAMEZAWA Hiroyuki
2009-08-28 7:20 ` Balbir Singh
2009-08-28 7:35 ` KAMEZAWA Hiroyuki
2009-08-28 13:26 ` Balbir Singh
2009-08-28 14:29 ` KAMEZAWA Hiroyuki
2009-08-28 14:40 ` KAMEZAWA Hiroyuki
2009-08-28 14:46 ` Balbir Singh
2009-08-28 15:06 ` KAMEZAWA Hiroyuki
2009-08-28 15:08 ` Balbir Singh
2009-08-28 15:12 ` KAMEZAWA Hiroyuki
2009-08-28 15:15 ` Balbir Singh
2009-08-28 14:45 ` Balbir Singh
2009-08-28 14:58 ` KAMEZAWA Hiroyuki
2009-08-28 15:07 ` Balbir Singh
2009-08-28 4:24 ` [RFC][PATCH 2/5] memcg: uncharge in batched manner KAMEZAWA Hiroyuki
2009-08-28 4:53 ` KAMEZAWA Hiroyuki
2009-08-28 4:55 ` KAMEZAWA Hiroyuki
2009-08-28 15:10 ` Balbir Singh
2009-08-28 15:21 ` KAMEZAWA Hiroyuki
2009-08-28 16:03 ` Balbir Singh
2009-08-31 11:02 ` Balbir Singh
2009-08-31 11:59 ` KAMEZAWA Hiroyuki
2009-08-31 12:10 ` Balbir Singh
2009-08-31 12:14 ` KAMEZAWA Hiroyuki
2009-08-31 12:23 ` Balbir Singh
2009-08-31 14:36 ` KAMEZAWA Hiroyuki
2009-08-28 4:25 ` [RFC][PATCH 3/5] memcg: unmap, truncate, invalidate uncharege in batch KAMEZAWA Hiroyuki
2009-08-31 11:02 ` Balbir Singh
2009-08-28 4:27 ` [RFC][PATCH 4/5] memcg: per-cpu charge stock KAMEZAWA Hiroyuki
2009-08-31 11:10 ` Balbir Singh
2009-08-31 12:07 ` KAMEZAWA Hiroyuki
2009-08-28 4:28 ` [RFC][PATCH 5/5] memcg: drain per cpu stock KAMEZAWA Hiroyuki
2009-08-31 11:11 ` Balbir Singh
2009-08-31 12:09 ` KAMEZAWA Hiroyuki
2009-08-28 4:28 ` [RFC][PATCH 0/5] memcg: reduce lock conetion Balbir Singh
2009-08-28 4:33 ` KAMEZAWA Hiroyuki
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).