* [RFC PATCH 0/3] Disable swap cgroup allocation at system boot stage
@ 2012-12-04 8:35 Jeff Liu
2012-12-04 8:35 ` [RFC PATCH 1/3] memcg: refactor pages allocation/free for swap_cgroup Jeff Liu
` (3 more replies)
0 siblings, 4 replies; 14+ messages in thread
From: Jeff Liu @ 2012-12-04 8:35 UTC (permalink / raw)
To: linux-mm; +Cc: cgroups, Glauber Costa
Hello,
Currently, we allocate pages for swap cgroup when the system is booting up.
Which means that a particular size of pre-allocated memory(depending on the total
size of the enabled swap files/partitions) would be wasted if there is no child memcg
being alive.
This patch set is intended to defer the memory allocation for swap cgroup until the first
children of memcg was created. Actually, it was totally inspired by Glabuer's previous
proposed patch set, which can be found at:
"memcg: do not call page_cgroup_init at system_boot".
http://lwn.net/Articles/517562/
These patches works to me with some sanity check up. There must have some issues I am not
aware of for now, at least, performing swapon/swapoff when there have child memcg alives
can run into some potential race conditions that would end up go into bad_page() path...
but I'd like to post it early to seek any directions if possible, so that I can continue to
improve it.
Any comments are appreciated, Thanks in advance!
-Jeff
[PATCH 1/3]memcg: refactor pages allocation/free for swap_cgroup
[PATCH 2/3]memcg: disable pages allocation for swap cgroup on system booting up
[PATCH 3/3]memcg: allocate pages for swap cgroup until the first child memcg is alive
include/linux/page_cgroup.h | 12 ++++
mm/memcontrol.c | 3 +
mm/page_cgroup.c | 160 +++++++++++++++++++++++++++++++++++++------
3 files changed, 153 insertions(+), 22 deletions(-)
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 14+ messages in thread
* [RFC PATCH 1/3] memcg: refactor pages allocation/free for swap_cgroup
@ 2012-12-04 8:35 ` Jeff Liu
2012-12-04 10:11 ` Michal Hocko
0 siblings, 1 reply; 14+ messages in thread
From: Jeff Liu @ 2012-12-04 8:35 UTC (permalink / raw)
To: linux-mm; +Cc: cgroups, Glauber Costa
- Rename swap_cgroup_prepare to swap_cgroup_alloc_pages()
- Introduce a new helper swap_cgroup_free_pages() to free
pages from swap cgroup upon a given type.
Signed-off-by: Jie Liu <jeff.liu@oracle.com>
CC: Glauber Costa <glommer@parallels.com>
CC: Michal Hocko <mhocko@suse.cz>
CC: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
CC: Johannes Weiner <hannes@cmpxchg.org>
CC: Mel Gorman <mgorman@suse.de>
CC: Andrew Morton <akpm@linux-foundation.org>
---
mm/page_cgroup.c | 44 ++++++++++++++++++++++++++++++++------------
1 file changed, 32 insertions(+), 12 deletions(-)
diff --git a/mm/page_cgroup.c b/mm/page_cgroup.c
index 5ddad0c..76b1344 100644
--- a/mm/page_cgroup.c
+++ b/mm/page_cgroup.c
@@ -348,31 +348,51 @@ struct swap_cgroup {
*/
/*
- * allocate buffer for swap_cgroup.
+ * Allocate pages for swap_cgroup upon a given type.
*/
-static int swap_cgroup_prepare(int type)
+static int swap_cgroup_alloc_pages(int type)
{
- struct page *page;
struct swap_cgroup_ctrl *ctrl;
- unsigned long idx, max;
+ unsigned long i, length, max;
ctrl = &swap_cgroup_ctrl[type];
-
- for (idx = 0; idx < ctrl->length; idx++) {
- page = alloc_page(GFP_KERNEL | __GFP_ZERO);
+ length = ctrl->length;
+ for (i = 0; i < length; i++) {
+ struct page *page = alloc_page(GFP_KERNEL | __GFP_ZERO);
if (!page)
goto not_enough_page;
- ctrl->map[idx] = page;
+ ctrl->map[i] = page;
}
+
return 0;
+
not_enough_page:
- max = idx;
- for (idx = 0; idx < max; idx++)
- __free_page(ctrl->map[idx]);
+ max = i;
+ for (i = 0; i < max; i++)
+ __free_page(ctrl->map[i]);
return -ENOMEM;
}
+static void swap_cgroup_free_pages(int type)
+{
+ struct swap_cgroup_ctrl *ctrl;
+ struct page **map;
+
+ ctrl = &swap_cgroup_ctrl[type];
+ map = ctrl->map;
+ if (map) {
+ unsigned long length = ctrl->length;
+ unsigned long i;
+
+ for (i = 0; i < length; i++) {
+ struct page *page = map[i];
+ if (page)
+ __free_page(page);
+ }
+ }
+}
+
static struct swap_cgroup *lookup_swap_cgroup(swp_entry_t ent,
struct swap_cgroup_ctrl **ctrlp)
{
@@ -477,7 +497,7 @@ int swap_cgroup_swapon(int type, unsigned long max_pages)
ctrl->length = length;
ctrl->map = array;
spin_lock_init(&ctrl->lock);
- if (swap_cgroup_prepare(type)) {
+ if (swap_cgroup_alloc_pages(type)) {
/* memory shortage */
ctrl->map = NULL;
ctrl->length = 0;
--
1.7.9.5
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related [flat|nested] 14+ messages in thread
* [RFC PATCH 2/3] memcg: disable pages allocation for swap cgroup on system booting up
@ 2012-12-04 8:36 ` Jeff Liu
2012-12-04 11:17 ` Michal Hocko
0 siblings, 1 reply; 14+ messages in thread
From: Jeff Liu @ 2012-12-04 8:36 UTC (permalink / raw)
To: linux-mm; +Cc: cgroups, Glauber Costa
- Disable pages allocation for swap cgroup at system boot up stage.
- Perform page allocation if there have child memcg alive, because the user
might disabled one/more swap files/partitions for some reason.
- Introduce a couple of helpers to deal with page allocation/free for swap cgroup.
- Introduce a new static variable to indicate the status of child memcg create/remove.
Signed-off-by: Jie Liu <jeff.liu@oracle.com>
CC: Glauber Costa <glommer@parallels.com>
CC: Michal Hocko <mhocko@suse.cz>
CC: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
CC: Johannes Weiner <hannes@cmpxchg.org>
CC: Mel Gorman <mgorman@suse.de>
CC: Andrew Morton <akpm@linux-foundation.org>
---
include/linux/page_cgroup.h | 12 +++++
mm/page_cgroup.c | 109 ++++++++++++++++++++++++++++++++++++++-----
2 files changed, 110 insertions(+), 11 deletions(-)
diff --git a/include/linux/page_cgroup.h b/include/linux/page_cgroup.h
index 777a524..2b94fc0 100644
--- a/include/linux/page_cgroup.h
+++ b/include/linux/page_cgroup.h
@@ -113,6 +113,8 @@ extern unsigned short swap_cgroup_record(swp_entry_t ent, unsigned short id);
extern unsigned short lookup_swap_cgroup_id(swp_entry_t ent);
extern int swap_cgroup_swapon(int type, unsigned long max_pages);
extern void swap_cgroup_swapoff(int type);
+extern int swap_cgroup_init(void);
+extern void swap_cgroup_destroy(void);
#else
static inline
@@ -138,6 +140,16 @@ static inline void swap_cgroup_swapoff(int type)
return;
}
+static inline int swap_cgroup_init(void)
+{
+ return 0;
+}
+
+static inline void swap_cgroup_destroy(void)
+{
+ return;
+}
+
#endif /* CONFIG_MEMCG_SWAP */
#endif /* !__GENERATING_BOUNDS_H */
diff --git a/mm/page_cgroup.c b/mm/page_cgroup.c
index 76b1344..f1b257b 100644
--- a/mm/page_cgroup.c
+++ b/mm/page_cgroup.c
@@ -321,8 +321,8 @@ void __meminit pgdat_page_cgroup_init(struct pglist_data *pgdat)
static DEFINE_MUTEX(swap_cgroup_mutex);
struct swap_cgroup_ctrl {
- struct page **map;
- unsigned long length;
+ struct page **map;
+ unsigned long length;
spinlock_t lock;
};
@@ -410,6 +410,8 @@ static struct swap_cgroup *lookup_swap_cgroup(swp_entry_t ent,
return sc + offset % SC_PER_PAGE;
}
+static atomic_t swap_cgroup_initialized = ATOMIC_INIT(0);
+
/**
* swap_cgroup_cmpxchg - cmpxchg mem_cgroup's id for this swp_entry.
* @ent: swap entry to be cmpxchged
@@ -497,17 +499,36 @@ int swap_cgroup_swapon(int type, unsigned long max_pages)
ctrl->length = length;
ctrl->map = array;
spin_lock_init(&ctrl->lock);
- if (swap_cgroup_alloc_pages(type)) {
- /* memory shortage */
- ctrl->map = NULL;
- ctrl->length = 0;
- mutex_unlock(&swap_cgroup_mutex);
- vfree(array);
- goto nomem;
+
+ /*
+ * We would delay page allocation for swap cgroup if swapon(2)
+ * is occurred at system boot phase until the first none-parent
+ * memcg was created.
+ *
+ * However, we might run into the following scenarios:
+ * 1) one/more new swap partitions/files are being enabled
+ * with non-parent memcg+swap_cgroup is/are active.
+ * 2) keep memcg+swap_cgroup being active, but the user has
+ * performed swapoff(2) against the given type of swap
+ * partition or file for some reason, and then the user
+ * turn it on again.
+ * In those cases, we have to allocate the pages in swapon(2)
+ * stage since we have no chance to make it in swap_cgroup_init()
+ * until a new child memcg was created.
+ */
+ if (atomic_read(&swap_cgroup_initialized)) {
+ if (swap_cgroup_alloc_pages(type)) {
+ /* memory shortage */
+ ctrl->map = NULL;
+ ctrl->length = 0;
+ mutex_unlock(&swap_cgroup_mutex);
+ vfree(array);
+ goto nomem;
+ }
}
mutex_unlock(&swap_cgroup_mutex);
-
return 0;
+
nomem:
printk(KERN_INFO "couldn't allocate enough memory for swap_cgroup.\n");
printk(KERN_INFO
@@ -515,10 +536,74 @@ nomem:
return -ENOMEM;
}
+/*
+ * This function is called per child memcg created so that we might
+ * arrive here multiple times. But we only allocate pages for swap
+ * cgroup when the first child memcg was created.
+ */
+int swap_cgroup_init(void)
+{
+ int type;
+
+ if (!do_swap_account)
+ return 0;
+
+ if (atomic_add_return(1, &swap_cgroup_initialized) != 1)
+ return 0;
+
+ mutex_lock(&swap_cgroup_mutex);
+ for (type = 0; type < MAX_SWAPFILES; type++) {
+ if (swap_cgroup_alloc_pages(type) < 0) {
+ struct swap_cgroup_ctrl *ctrl;
+
+ ctrl = &swap_cgroup_ctrl[type];
+ mutex_unlock(&swap_cgroup_mutex);
+ ctrl->length = 0;
+ if (ctrl->map) {
+ vfree(ctrl->map);
+ ctrl->map = NULL;
+ }
+ goto nomem;
+ }
+ }
+ mutex_unlock(&swap_cgroup_mutex);
+
+ return 0;
+
+nomem:
+ pr_info("couldn't initialize swap_cgroup, no enough memory.\n");
+ pr_info("swap_cgroup can be disabled by swapaccount=0 boot option\n");
+ return -ENOMEM;
+}
+
+/*
+ * This function is called per memcg removed so that we might arrive
+ * here multiple times, but we only free pages when the last memcg
+ * was removed. Note that:
+ * We won't clean the map pointer and the length which were calculated
+ * at swapon(2) stage because of that we need those info to re-allocate
+ * pages if a child memcg was created again.
+ */
+void swap_cgroup_destroy(void)
+{
+ int type;
+
+ if (!do_swap_account)
+ return;
+
+ if (atomic_sub_return(1, &swap_cgroup_initialized))
+ return;
+
+ mutex_lock(&swap_cgroup_mutex);
+ for (type = 0; type < MAX_SWAPFILES; type++)
+ swap_cgroup_free_pages(type);
+ mutex_unlock(&swap_cgroup_mutex);
+}
+
void swap_cgroup_swapoff(int type)
{
struct page **map;
- unsigned long i, length;
+ unsigned long length;
struct swap_cgroup_ctrl *ctrl;
if (!do_swap_account)
@@ -533,6 +618,8 @@ void swap_cgroup_swapoff(int type)
mutex_unlock(&swap_cgroup_mutex);
if (map) {
+ unsigned long i;
+
for (i = 0; i < length; i++) {
struct page *page = map[i];
if (page)
--
1.7.9.5
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related [flat|nested] 14+ messages in thread
* [RFC PATCH 3/3] memcg: allocate pages for swap cgroup until the first child memcg is alive
@ 2012-12-04 8:36 ` Jeff Liu
2012-12-04 12:54 ` Michal Hocko
0 siblings, 1 reply; 14+ messages in thread
From: Jeff Liu @ 2012-12-04 8:36 UTC (permalink / raw)
To: linux-mm; +Cc: cgroups, Glauber Costa
- Call swap_cgroup_init() when the first child memcg was created.
- Free pages from swap cgroup once the latest child memcg was removed.
- Teach swap_cgroup_record()/swap_cgroup_cmpxchg()/swap_cgroup_lookup_id()
to aware of the new static variable. i.e, swap_cgroup_initialized, so that
they would not perform further jobs if swap cgroup is not active.
Signed-off-by: Jie Liu <jeff.liu@oracle.com>
CC: Glauber Costa <glommer@parallels.com>
CC: Michal Hocko <mhocko@suse.cz>
CC: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
CC: Johannes Weiner <hannes@cmpxchg.org>
CC: Mel Gorman <mgorman@suse.de>
CC: Andrew Morton <akpm@linux-foundation.org>
---
mm/memcontrol.c | 3 +++
mm/page_cgroup.c | 9 +++++++++
2 files changed, 12 insertions(+)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index dd39ba0..1f14375 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -4982,6 +4982,8 @@ mem_cgroup_create(struct cgroup *cont)
}
hotcpu_notifier(memcg_cpu_hotplug_callback, 0);
} else {
+ if (swap_cgroup_init())
+ goto free_out;
parent = mem_cgroup_from_cont(cont->parent);
memcg->use_hierarchy = parent->use_hierarchy;
memcg->oom_kill_disable = parent->oom_kill_disable;
@@ -5046,6 +5048,7 @@ static void mem_cgroup_destroy(struct cgroup *cont)
struct mem_cgroup *memcg = mem_cgroup_from_cont(cont);
kmem_cgroup_destroy(memcg);
+ swap_cgroup_destroy();
mem_cgroup_put(memcg);
}
diff --git a/mm/page_cgroup.c b/mm/page_cgroup.c
index f1b257b..63c6789 100644
--- a/mm/page_cgroup.c
+++ b/mm/page_cgroup.c
@@ -429,6 +429,9 @@ unsigned short swap_cgroup_cmpxchg(swp_entry_t ent,
unsigned long flags;
unsigned short retval;
+ if (!atomic_read(&swap_cgroup_initialized))
+ return 0;
+
sc = lookup_swap_cgroup(ent, &ctrl);
spin_lock_irqsave(&ctrl->lock, flags);
@@ -456,6 +459,9 @@ unsigned short swap_cgroup_record(swp_entry_t ent, unsigned short id)
unsigned short old;
unsigned long flags;
+ if (!atomic_read(&swap_cgroup_initialized))
+ return 0;
+
sc = lookup_swap_cgroup(ent, &ctrl);
spin_lock_irqsave(&ctrl->lock, flags);
@@ -474,6 +480,9 @@ unsigned short swap_cgroup_record(swp_entry_t ent, unsigned short id)
*/
unsigned short lookup_swap_cgroup_id(swp_entry_t ent)
{
+ if (!atomic_read(&swap_cgroup_initialized))
+ return 0;
+
return lookup_swap_cgroup(ent, NULL)->id;
}
--
1.7.9.5
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related [flat|nested] 14+ messages in thread
* Re: [RFC PATCH 1/3] memcg: refactor pages allocation/free for swap_cgroup
2012-12-04 8:35 ` [RFC PATCH 1/3] memcg: refactor pages allocation/free for swap_cgroup Jeff Liu
@ 2012-12-04 10:11 ` Michal Hocko
2012-12-04 10:46 ` Jeff Liu
0 siblings, 1 reply; 14+ messages in thread
From: Michal Hocko @ 2012-12-04 10:11 UTC (permalink / raw)
To: Jeff Liu; +Cc: linux-mm, cgroups, Glauber Costa
On Tue 04-12-12 16:35:55, Jeff Liu wrote:
[...]
> /*
> - * allocate buffer for swap_cgroup.
> + * Allocate pages for swap_cgroup upon a given type.
> */
> -static int swap_cgroup_prepare(int type)
> +static int swap_cgroup_alloc_pages(int type)
I am not sure this name is better. The whole point of the function is
the prepare swap accounting internals. Yeah we are allocating here as
well but this is not that important.
It also feels strange that the function name suggests we allocate pages
but none of them are returned.
> {
> - struct page *page;
> struct swap_cgroup_ctrl *ctrl;
> - unsigned long idx, max;
> + unsigned long i, length, max;
>
> ctrl = &swap_cgroup_ctrl[type];
> -
> - for (idx = 0; idx < ctrl->length; idx++) {
> - page = alloc_page(GFP_KERNEL | __GFP_ZERO);
> + length = ctrl->length;
> + for (i = 0; i < length; i++) {
> + struct page *page = alloc_page(GFP_KERNEL | __GFP_ZERO);
> if (!page)
> goto not_enough_page;
> - ctrl->map[idx] = page;
> + ctrl->map[i] = page;
> }
> +
> return 0;
> +
> not_enough_page:
> - max = idx;
> - for (idx = 0; idx < max; idx++)
> - __free_page(ctrl->map[idx]);
> + max = i;
> + for (i = 0; i < max; i++)
> + __free_page(ctrl->map[i]);
>
> return -ENOMEM;
> }
Is there any reason for the local variables rename exercise?
I really do not like it.
>
> +static void swap_cgroup_free_pages(int type)
> +{
> + struct swap_cgroup_ctrl *ctrl;
> + struct page **map;
> +
> + ctrl = &swap_cgroup_ctrl[type];
> + map = ctrl->map;
> + if (map) {
> + unsigned long length = ctrl->length;
> + unsigned long i;
> +
> + for (i = 0; i < length; i++) {
> + struct page *page = map[i];
> + if (page)
> + __free_page(page);
> + }
> + }
> +}
> +
This function is not used in this patch so I would suggest moving it
into the #2.
> static struct swap_cgroup *lookup_swap_cgroup(swp_entry_t ent,
> struct swap_cgroup_ctrl **ctrlp)
> {
> @@ -477,7 +497,7 @@ int swap_cgroup_swapon(int type, unsigned long max_pages)
> ctrl->length = length;
> ctrl->map = array;
> spin_lock_init(&ctrl->lock);
> - if (swap_cgroup_prepare(type)) {
> + if (swap_cgroup_alloc_pages(type)) {
> /* memory shortage */
> ctrl->map = NULL;
> ctrl->length = 0;
--
Michal Hocko
SUSE Labs
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [RFC PATCH 1/3] memcg: refactor pages allocation/free for swap_cgroup
2012-12-04 10:11 ` Michal Hocko
@ 2012-12-04 10:46 ` Jeff Liu
0 siblings, 0 replies; 14+ messages in thread
From: Jeff Liu @ 2012-12-04 10:46 UTC (permalink / raw)
To: Michal Hocko; +Cc: linux-mm, cgroups, Glauber Costa
Hi Michal,
Thanks for your prompt response.
On 12/04/2012 06:11 PM, Michal Hocko wrote:
> On Tue 04-12-12 16:35:55, Jeff Liu wrote:
> [...]
>> /*
>> - * allocate buffer for swap_cgroup.
>> + * Allocate pages for swap_cgroup upon a given type.
>> */
>> -static int swap_cgroup_prepare(int type)
>> +static int swap_cgroup_alloc_pages(int type)
>
> I am not sure this name is better. The whole point of the function is
> the prepare swap accounting internals. Yeah we are allocating here as
> well but this is not that important.
I spent nearly half time of writing these patches to figure out a couple
of meaningful names for pages allocation/free, but...
Maybe I should keeping the old name(swap_cgroup_prepare()) unchanged and
don't introduce the corresponding swap_cgroup_free_pages(), since this
helper only called by swap_cgroup_destroy() in my current implementation.
> It also feels strange that the function name suggests we allocate pages
> but none of them are returned.
Yep, generally, only those functions with pages allocated and returned
should be named as xxxx_alloc_pages_xxx(), it really looks strange so.
>
>> {
>> - struct page *page;
>> struct swap_cgroup_ctrl *ctrl;
>> - unsigned long idx, max;
>> + unsigned long i, length, max;
>>
>> ctrl = &swap_cgroup_ctrl[type];
>> -
>> - for (idx = 0; idx < ctrl->length; idx++) {
>> - page = alloc_page(GFP_KERNEL | __GFP_ZERO);
>> + length = ctrl->length;
>> + for (i = 0; i < length; i++) {
>> + struct page *page = alloc_page(GFP_KERNEL | __GFP_ZERO);
>> if (!page)
>> goto not_enough_page;
>> - ctrl->map[idx] = page;
>> + ctrl->map[i] = page;
>> }
>> +
>> return 0;
>> +
>> not_enough_page:
>> - max = idx;
>> - for (idx = 0; idx < max; idx++)
>> - __free_page(ctrl->map[idx]);
>> + max = i;
>> + for (i = 0; i < max; i++)
>> + __free_page(ctrl->map[i]);
>>
>> return -ENOMEM;
>> }
>
> Is there any reason for the local variables rename exercise?
> I really do not like it.
Ah, I can not recall why I renamed it at that time, will change it
back in the next round of post.
>
>>
>> +static void swap_cgroup_free_pages(int type)
>> +{
>> + struct swap_cgroup_ctrl *ctrl;
>> + struct page **map;
>> +
>> + ctrl = &swap_cgroup_ctrl[type];
>> + map = ctrl->map;
>> + if (map) {
>> + unsigned long length = ctrl->length;
>> + unsigned long i;
>> +
>> + for (i = 0; i < length; i++) {
>> + struct page *page = map[i];
>> + if (page)
>> + __free_page(page);
>> + }
>> + }
>> +}
>> +
>
> This function is not used in this patch so I would suggest moving it
> into the #2.
Ok. Maybe it will be merged into swap_cgroup_destroy() as is mentioned
above.
>
>> static struct swap_cgroup *lookup_swap_cgroup(swp_entry_t ent,
>> struct swap_cgroup_ctrl **ctrlp)
>> {
>> @@ -477,7 +497,7 @@ int swap_cgroup_swapon(int type, unsigned long max_pages)
>> ctrl->length = length;
>> ctrl->map = array;
>> spin_lock_init(&ctrl->lock);
>> - if (swap_cgroup_prepare(type)) {
>> + if (swap_cgroup_alloc_pages(type)) {
>> /* memory shortage */
>> ctrl->map = NULL;
>> ctrl->length = 0;
>
Thanks Again!
-Jeff
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [RFC PATCH 2/3] memcg: disable pages allocation for swap cgroup on system booting up
2012-12-04 8:36 ` [RFC PATCH 2/3] memcg: disable pages allocation for swap cgroup on system booting up Jeff Liu
@ 2012-12-04 11:17 ` Michal Hocko
2012-12-04 11:22 ` Michal Hocko
2012-12-04 12:34 ` Michal Hocko
0 siblings, 2 replies; 14+ messages in thread
From: Michal Hocko @ 2012-12-04 11:17 UTC (permalink / raw)
To: Jeff Liu; +Cc: linux-mm, cgroups, Glauber Costa
On Tue 04-12-12 16:36:11, Jeff Liu wrote:
> - Disable pages allocation for swap cgroup at system boot up stage.
> - Perform page allocation if there have child memcg alive, because the user
> might disabled one/more swap files/partitions for some reason.
> - Introduce a couple of helpers to deal with page allocation/free for swap cgroup.
> - Introduce a new static variable to indicate the status of child memcg create/remove.
This approach doesn't work (unless I missed something). See bellow.
> Signed-off-by: Jie Liu <jeff.liu@oracle.com>
> CC: Glauber Costa <glommer@parallels.com>
> CC: Michal Hocko <mhocko@suse.cz>
> CC: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> CC: Johannes Weiner <hannes@cmpxchg.org>
> CC: Mel Gorman <mgorman@suse.de>
> CC: Andrew Morton <akpm@linux-foundation.org>
> ---
> include/linux/page_cgroup.h | 12 +++++
> mm/page_cgroup.c | 109 ++++++++++++++++++++++++++++++++++++++-----
> 2 files changed, 110 insertions(+), 11 deletions(-)
>
> diff --git a/include/linux/page_cgroup.h b/include/linux/page_cgroup.h
> index 777a524..2b94fc0 100644
> --- a/include/linux/page_cgroup.h
> +++ b/include/linux/page_cgroup.h
> @@ -113,6 +113,8 @@ extern unsigned short swap_cgroup_record(swp_entry_t ent, unsigned short id);
> extern unsigned short lookup_swap_cgroup_id(swp_entry_t ent);
> extern int swap_cgroup_swapon(int type, unsigned long max_pages);
> extern void swap_cgroup_swapoff(int type);
> +extern int swap_cgroup_init(void);
> +extern void swap_cgroup_destroy(void);
> #else
>
> static inline
> @@ -138,6 +140,16 @@ static inline void swap_cgroup_swapoff(int type)
> return;
> }
>
> +static inline int swap_cgroup_init(void)
> +{
> + return 0;
> +}
> +
> +static inline void swap_cgroup_destroy(void)
> +{
> + return;
> +}
> +
> #endif /* CONFIG_MEMCG_SWAP */
>
> #endif /* !__GENERATING_BOUNDS_H */
> diff --git a/mm/page_cgroup.c b/mm/page_cgroup.c
> index 76b1344..f1b257b 100644
> --- a/mm/page_cgroup.c
> +++ b/mm/page_cgroup.c
> @@ -321,8 +321,8 @@ void __meminit pgdat_page_cgroup_init(struct pglist_data *pgdat)
>
> static DEFINE_MUTEX(swap_cgroup_mutex);
> struct swap_cgroup_ctrl {
> - struct page **map;
> - unsigned long length;
> + struct page **map;
> + unsigned long length;
> spinlock_t lock;
No need to play with whitespaces here. It is only distracting.
> };
>
> @@ -410,6 +410,8 @@ static struct swap_cgroup *lookup_swap_cgroup(swp_entry_t ent,
> return sc + offset % SC_PER_PAGE;
> }
>
> +static atomic_t swap_cgroup_initialized = ATOMIC_INIT(0);
The name is a bit confusing. Maybe something like memsw_accounting_users?
> +
> /**
> * swap_cgroup_cmpxchg - cmpxchg mem_cgroup's id for this swp_entry.
> * @ent: swap entry to be cmpxchged
> @@ -497,17 +499,36 @@ int swap_cgroup_swapon(int type, unsigned long max_pages)
> ctrl->length = length;
> ctrl->map = array;
> spin_lock_init(&ctrl->lock);
> - if (swap_cgroup_alloc_pages(type)) {
> - /* memory shortage */
> - ctrl->map = NULL;
> - ctrl->length = 0;
> - mutex_unlock(&swap_cgroup_mutex);
> - vfree(array);
> - goto nomem;
> +
> + /*
> + * We would delay page allocation for swap cgroup if swapon(2)
> + * is occurred at system boot phase until the first none-parent
> + * memcg was created.
> + *
> + * However, we might run into the following scenarios:
> + * 1) one/more new swap partitions/files are being enabled
> + * with non-parent memcg+swap_cgroup is/are active.
> + * 2) keep memcg+swap_cgroup being active, but the user has
> + * performed swapoff(2) against the given type of swap
> + * partition or file for some reason, and then the user
> + * turn it on again.
> + * In those cases, we have to allocate the pages in swapon(2)
> + * stage since we have no chance to make it in swap_cgroup_init()
> + * until a new child memcg was created.
> + */
The comment gives more confusion than useful information in my opinion
(especially swapoff&swapon part).
> + if (atomic_read(&swap_cgroup_initialized)) {
Hmm, OK we cannot race with cgroup creation here because you are already
holding the swap_cgroup_mutex while other path takes it after
atomic_add_return. This is rather fragile and it would deserve a
comment.
> + if (swap_cgroup_alloc_pages(type)) {
> + /* memory shortage */
> + ctrl->map = NULL;
> + ctrl->length = 0;
> + mutex_unlock(&swap_cgroup_mutex);
> + vfree(array);
> + goto nomem;
> + }
> }
> mutex_unlock(&swap_cgroup_mutex);
> -
> return 0;
> +
Pointless new lines juggling
> nomem:
> printk(KERN_INFO "couldn't allocate enough memory for swap_cgroup.\n");
> printk(KERN_INFO
> @@ -515,10 +536,74 @@ nomem:
> return -ENOMEM;
> }
>
> +/*
> + * This function is called per child memcg created so that we might
s/per child/per non-root/
> + * arrive here multiple times. But we only allocate pages for swap
> + * cgroup when the first child memcg was created.
> + */
> +int swap_cgroup_init(void)
> +{
> + int type;
> +
> + if (!do_swap_account)
> + return 0;
> +
> + if (atomic_add_return(1, &swap_cgroup_initialized) != 1)
> + return 0;
> +
> + mutex_lock(&swap_cgroup_mutex);
> + for (type = 0; type < MAX_SWAPFILES; type++) {
> + if (swap_cgroup_alloc_pages(type) < 0) {
Why do you initialize MAX_SWAPFILES rather than nr_swapfiles?
Besides that swap_cgroup_alloc_pages is not sufficient because it
doesn't allocate ctrl->map but it tries to put pages in it. So
swap_cgroup_swapon sounds like a better fit (modulo locking). But that
one needs to know maxpages for the partition/file. And that is a real
issue here because you do not have that information during the cgroup
creation time. Or am I missing something?
> + struct swap_cgroup_ctrl *ctrl;
> +
> + ctrl = &swap_cgroup_ctrl[type];
> + mutex_unlock(&swap_cgroup_mutex);
> + ctrl->length = 0;
> + if (ctrl->map) {
> + vfree(ctrl->map);
> + ctrl->map = NULL;
> + }
> + goto nomem;
> + }
> + }
> + mutex_unlock(&swap_cgroup_mutex);
> +
> + return 0;
> +
> +nomem:
> + pr_info("couldn't initialize swap_cgroup, no enough memory.\n");
> + pr_info("swap_cgroup can be disabled by swapaccount=0 boot option\n");
> + return -ENOMEM;
This is duplicating a message from swapon. Can we do it at a single place?
> +}
> +
> +/*
> + * This function is called per memcg removed so that we might arrive
> + * here multiple times, but we only free pages when the last memcg
> + * was removed. Note that:
> + * We won't clean the map pointer and the length which were calculated
> + * at swapon(2) stage because of that we need those info to re-allocate
> + * pages if a child memcg was created again.
> + */
> +void swap_cgroup_destroy(void)
> +{
> + int type;
> +
> + if (!do_swap_account)
> + return;
> +
> + if (atomic_sub_return(1, &swap_cgroup_initialized))
> + return;
> +
> + mutex_lock(&swap_cgroup_mutex);
> + for (type = 0; type < MAX_SWAPFILES; type++)
> + swap_cgroup_free_pages(type);
> + mutex_unlock(&swap_cgroup_mutex);
> +}
> +
What if there are still some pages left on the swap. Please note that
memcg can outlive its cgroup (we just take a reference for it until the
page is freed).
> void swap_cgroup_swapoff(int type)
> {
> struct page **map;
> - unsigned long i, length;
> + unsigned long length;
> struct swap_cgroup_ctrl *ctrl;
>
> if (!do_swap_account)
> @@ -533,6 +618,8 @@ void swap_cgroup_swapoff(int type)
> mutex_unlock(&swap_cgroup_mutex);
>
> if (map) {
> + unsigned long i;
> +
> for (i = 0; i < length; i++) {
> struct page *page = map[i];
> if (page)
Pointless rename.
--
Michal Hocko
SUSE Labs
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [RFC PATCH 2/3] memcg: disable pages allocation for swap cgroup on system booting up
2012-12-04 11:17 ` Michal Hocko
@ 2012-12-04 11:22 ` Michal Hocko
2012-12-04 12:34 ` Michal Hocko
1 sibling, 0 replies; 14+ messages in thread
From: Michal Hocko @ 2012-12-04 11:22 UTC (permalink / raw)
To: Jeff Liu; +Cc: linux-mm, cgroups, Glauber Costa
On Tue 04-12-12 12:17:21, Michal Hocko wrote:
> On Tue 04-12-12 16:36:11, Jeff Liu wrote:
> > - Disable pages allocation for swap cgroup at system boot up stage.
> > - Perform page allocation if there have child memcg alive, because the user
> > might disabled one/more swap files/partitions for some reason.
> > - Introduce a couple of helpers to deal with page allocation/free for swap cgroup.
> > - Introduce a new static variable to indicate the status of child memcg create/remove.
>
> This approach doesn't work (unless I missed something). See bellow.
Wait a second I have missed that ctrl->map is already initialized.
Will get back to you...
--
Michal Hocko
SUSE Labs
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [RFC PATCH 2/3] memcg: disable pages allocation for swap cgroup on system booting up
2012-12-04 11:17 ` Michal Hocko
2012-12-04 11:22 ` Michal Hocko
@ 2012-12-04 12:34 ` Michal Hocko
2012-12-04 12:51 ` Jeff Liu
1 sibling, 1 reply; 14+ messages in thread
From: Michal Hocko @ 2012-12-04 12:34 UTC (permalink / raw)
To: Jeff Liu; +Cc: linux-mm, cgroups, Glauber Costa
On Tue 04-12-12 12:17:21, Michal Hocko wrote:
> On Tue 04-12-12 16:36:11, Jeff Liu wrote:
[...]
> > + * arrive here multiple times. But we only allocate pages for swap
> > + * cgroup when the first child memcg was created.
> > + */
> > +int swap_cgroup_init(void)
> > +{
> > + int type;
> > +
> > + if (!do_swap_account)
> > + return 0;
> > +
> > + if (atomic_add_return(1, &swap_cgroup_initialized) != 1)
> > + return 0;
> > +
> > + mutex_lock(&swap_cgroup_mutex);
> > + for (type = 0; type < MAX_SWAPFILES; type++) {
> > + if (swap_cgroup_alloc_pages(type) < 0) {
>
> Why do you initialize MAX_SWAPFILES rather than nr_swapfiles?
>
> Besides that swap_cgroup_alloc_pages is not sufficient because it
> doesn't allocate ctrl->map but it tries to put pages in it.
Sorry, I have missed that you have kept ctrl->map initialization in
swap_cgroup_swapon so this is not an issue.
I think you can do better if swap_cgroup_swapon only initialized
ctrl->length and deferred all the rest to swap_cgroup_alloc_pages (or
its original name as it suits better) including the map allocation which
is currently done in swap_cgroup_swapon.
--
Michal Hocko
SUSE Labs
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [RFC PATCH 2/3] memcg: disable pages allocation for swap cgroup on system booting up
2012-12-04 12:34 ` Michal Hocko
@ 2012-12-04 12:51 ` Jeff Liu
0 siblings, 0 replies; 14+ messages in thread
From: Jeff Liu @ 2012-12-04 12:51 UTC (permalink / raw)
To: Michal Hocko; +Cc: linux-mm, cgroups, Glauber Costa
On 12/04/2012 08:34 PM, Michal Hocko wrote:
> On Tue 04-12-12 12:17:21, Michal Hocko wrote:
>> On Tue 04-12-12 16:36:11, Jeff Liu wrote:
> [...]
>>> + * arrive here multiple times. But we only allocate pages for swap
>>> + * cgroup when the first child memcg was created.
>>> + */
>>> +int swap_cgroup_init(void)
>>> +{
>>> + int type;
>>> +
>>> + if (!do_swap_account)
>>> + return 0;
>>> +
>>> + if (atomic_add_return(1, &swap_cgroup_initialized) != 1)
>>> + return 0;
>>> +
>>> + mutex_lock(&swap_cgroup_mutex);
>>> + for (type = 0; type < MAX_SWAPFILES; type++) {
>>> + if (swap_cgroup_alloc_pages(type) < 0) {
>>
>> Why do you initialize MAX_SWAPFILES rather than nr_swapfiles?
>>
>> Besides that swap_cgroup_alloc_pages is not sufficient because it
>> doesn't allocate ctrl->map but it tries to put pages in it.
>
> Sorry, I have missed that you have kept ctrl->map initialization in
> swap_cgroup_swapon so this is not an issue.
> I think you can do better if swap_cgroup_swapon only initialized
> ctrl->length and deferred all the rest to swap_cgroup_alloc_pages (or
> its original name as it suits better) including the map allocation which
> is currently done in swap_cgroup_swapon.
Definitely, It's better to defer ctrl->map array allocation up to that
phase, thank you for pointing it out. :)
I'll fix those issues according to your comments in another email.
-Jeff
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [RFC PATCH 3/3] memcg: allocate pages for swap cgroup until the first child memcg is alive
2012-12-04 8:36 ` [RFC PATCH 3/3] memcg: allocate pages for swap cgroup until the first child memcg is alive Jeff Liu
@ 2012-12-04 12:54 ` Michal Hocko
2012-12-04 13:14 ` Jeff Liu
0 siblings, 1 reply; 14+ messages in thread
From: Michal Hocko @ 2012-12-04 12:54 UTC (permalink / raw)
To: Jeff Liu; +Cc: linux-mm, cgroups, Glauber Costa
On Tue 04-12-12 16:36:17, Jeff Liu wrote:
> - Call swap_cgroup_init() when the first child memcg was created.
> - Free pages from swap cgroup once the latest child memcg was removed.
This should be a separate patch
> - Teach swap_cgroup_record()/swap_cgroup_cmpxchg()/swap_cgroup_lookup_id()
> to aware of the new static variable. i.e, swap_cgroup_initialized, so that
> they would not perform further jobs if swap cgroup is not active.
This should be a preparatory patch.
>
> Signed-off-by: Jie Liu <jeff.liu@oracle.com>
> CC: Glauber Costa <glommer@parallels.com>
> CC: Michal Hocko <mhocko@suse.cz>
> CC: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> CC: Johannes Weiner <hannes@cmpxchg.org>
> CC: Mel Gorman <mgorman@suse.de>
> CC: Andrew Morton <akpm@linux-foundation.org>
> ---
> mm/memcontrol.c | 3 +++
> mm/page_cgroup.c | 9 +++++++++
> 2 files changed, 12 insertions(+)
>
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index dd39ba0..1f14375 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -4982,6 +4982,8 @@ mem_cgroup_create(struct cgroup *cont)
> }
> hotcpu_notifier(memcg_cpu_hotplug_callback, 0);
> } else {
> + if (swap_cgroup_init())
> + goto free_out;
> parent = mem_cgroup_from_cont(cont->parent);
> memcg->use_hierarchy = parent->use_hierarchy;
> memcg->oom_kill_disable = parent->oom_kill_disable;
> @@ -5046,6 +5048,7 @@ static void mem_cgroup_destroy(struct cgroup *cont)
> struct mem_cgroup *memcg = mem_cgroup_from_cont(cont);
>
> kmem_cgroup_destroy(memcg);
> + swap_cgroup_destroy();
>
> mem_cgroup_put(memcg);
> }
> diff --git a/mm/page_cgroup.c b/mm/page_cgroup.c
> index f1b257b..63c6789 100644
> --- a/mm/page_cgroup.c
> +++ b/mm/page_cgroup.c
> @@ -429,6 +429,9 @@ unsigned short swap_cgroup_cmpxchg(swp_entry_t ent,
> unsigned long flags;
> unsigned short retval;
>
> + if (!atomic_read(&swap_cgroup_initialized))
> + return 0;
> +
> sc = lookup_swap_cgroup(ent, &ctrl);
>
> spin_lock_irqsave(&ctrl->lock, flags);
> @@ -456,6 +459,9 @@ unsigned short swap_cgroup_record(swp_entry_t ent, unsigned short id)
> unsigned short old;
> unsigned long flags;
>
> + if (!atomic_read(&swap_cgroup_initialized))
> + return 0;
> +
> sc = lookup_swap_cgroup(ent, &ctrl);
>
> spin_lock_irqsave(&ctrl->lock, flags);
> @@ -474,6 +480,9 @@ unsigned short swap_cgroup_record(swp_entry_t ent, unsigned short id)
> */
> unsigned short lookup_swap_cgroup_id(swp_entry_t ent)
> {
> + if (!atomic_read(&swap_cgroup_initialized))
> + return 0;
> +
> return lookup_swap_cgroup(ent, NULL)->id;
I think that the swap_cgroup_initialized test should be stuffed into
lookup_swap_cgroup because it would be less fragile.
--
Michal Hocko
SUSE Labs
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [RFC PATCH 3/3] memcg: allocate pages for swap cgroup until the first child memcg is alive
2012-12-04 12:54 ` Michal Hocko
@ 2012-12-04 13:14 ` Jeff Liu
0 siblings, 0 replies; 14+ messages in thread
From: Jeff Liu @ 2012-12-04 13:14 UTC (permalink / raw)
To: Michal Hocko; +Cc: linux-mm, cgroups, Glauber Costa
On 12/04/2012 08:54 PM, Michal Hocko wrote:
> On Tue 04-12-12 16:36:17, Jeff Liu wrote:
>> - Call swap_cgroup_init() when the first child memcg was created.
>> - Free pages from swap cgroup once the latest child memcg was removed.
>
> This should be a separate patch
>
>> - Teach swap_cgroup_record()/swap_cgroup_cmpxchg()/swap_cgroup_lookup_id()
>> to aware of the new static variable. i.e, swap_cgroup_initialized, so that
>> they would not perform further jobs if swap cgroup is not active.
>
> This should be a preparatory patch.
Ok, these changes will be reflect in next round of post.
>
>>
>> Signed-off-by: Jie Liu <jeff.liu@oracle.com>
>> CC: Glauber Costa <glommer@parallels.com>
>> CC: Michal Hocko <mhocko@suse.cz>
>> CC: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
>> CC: Johannes Weiner <hannes@cmpxchg.org>
>> CC: Mel Gorman <mgorman@suse.de>
>> CC: Andrew Morton <akpm@linux-foundation.org>
>> ---
>> mm/memcontrol.c | 3 +++
>> mm/page_cgroup.c | 9 +++++++++
>> 2 files changed, 12 insertions(+)
>>
>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>> index dd39ba0..1f14375 100644
>> --- a/mm/memcontrol.c
>> +++ b/mm/memcontrol.c
>> @@ -4982,6 +4982,8 @@ mem_cgroup_create(struct cgroup *cont)
>> }
>> hotcpu_notifier(memcg_cpu_hotplug_callback, 0);
>> } else {
>> + if (swap_cgroup_init())
>> + goto free_out;
>> parent = mem_cgroup_from_cont(cont->parent);
>> memcg->use_hierarchy = parent->use_hierarchy;
>> memcg->oom_kill_disable = parent->oom_kill_disable;
>> @@ -5046,6 +5048,7 @@ static void mem_cgroup_destroy(struct cgroup *cont)
>> struct mem_cgroup *memcg = mem_cgroup_from_cont(cont);
>>
>> kmem_cgroup_destroy(memcg);
>> + swap_cgroup_destroy();
>>
>> mem_cgroup_put(memcg);
>> }
>> diff --git a/mm/page_cgroup.c b/mm/page_cgroup.c
>> index f1b257b..63c6789 100644
>> --- a/mm/page_cgroup.c
>> +++ b/mm/page_cgroup.c
>> @@ -429,6 +429,9 @@ unsigned short swap_cgroup_cmpxchg(swp_entry_t ent,
>> unsigned long flags;
>> unsigned short retval;
>>
>> + if (!atomic_read(&swap_cgroup_initialized))
>> + return 0;
>> +
>> sc = lookup_swap_cgroup(ent, &ctrl);
>>
>> spin_lock_irqsave(&ctrl->lock, flags);
>> @@ -456,6 +459,9 @@ unsigned short swap_cgroup_record(swp_entry_t ent, unsigned short id)
>> unsigned short old;
>> unsigned long flags;
>>
>> + if (!atomic_read(&swap_cgroup_initialized))
>> + return 0;
>> +
>> sc = lookup_swap_cgroup(ent, &ctrl);
>>
>> spin_lock_irqsave(&ctrl->lock, flags);
>> @@ -474,6 +480,9 @@ unsigned short swap_cgroup_record(swp_entry_t ent, unsigned short id)
>> */
>> unsigned short lookup_swap_cgroup_id(swp_entry_t ent)
>> {
>> + if (!atomic_read(&swap_cgroup_initialized))
>> + return 0;
>> +
>> return lookup_swap_cgroup(ent, NULL)->id;
>
> I think that the swap_cgroup_initialized test should be stuffed into
> lookup_swap_cgroup because it would be less fragile.
Originally, I spreaded this test over those functions in order to avoid adding
additional check against the return value of lookup_swap_cgroup(), but it's
really stupid, will follow up with your suggestions.
Thanks,
-Jeff
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [RFC PATCH 0/3] Disable swap cgroup allocation at system boot stage
2012-12-04 8:35 [RFC PATCH 0/3] Disable swap cgroup allocation at system boot stage Jeff Liu
` (2 preceding siblings ...)
2012-12-04 8:36 ` [RFC PATCH 3/3] memcg: allocate pages for swap cgroup until the first child memcg is alive Jeff Liu
@ 2012-12-04 13:18 ` Michal Hocko
2012-12-04 14:00 ` Jeff Liu
3 siblings, 1 reply; 14+ messages in thread
From: Michal Hocko @ 2012-12-04 13:18 UTC (permalink / raw)
To: Jeff Liu; +Cc: linux-mm, cgroups, Glauber Costa
On Tue 04-12-12 16:35:44, Jeff Liu wrote:
> Hello,
Hi,
>
> Currently, we allocate pages for swap cgroup when the system is booting up.
That is not precise. We do that during _swapon_ time which usually
happen during boot but still...
> Which means that a particular size of pre-allocated memory(depending
> on the total size of the enabled swap files/partitions) would be
> wasted if there is no child memcg being alive.
>
> This patch set is intended to defer the memory allocation for swap
> cgroup until the first children of memcg was created. Actually, it was
> totally inspired by Glabuer's previous proposed patch set, which can
> be found at: "memcg: do not call page_cgroup_init at system_boot".
> http://lwn.net/Articles/517562/
>
> These patches works to me with some sanity check up. There must
> have some issues I am not aware of for now, at least, performing
> swapon/swapoff when there have child memcg alives can run into some
> potential race conditions that would end up go into bad_page() path...
The locking is kind of awkward because there are indirect locking
dependencies which are not described anywhere.
> but I'd like to post it early to seek any directions if possible, so
> that I can continue to improve it.
Besides small things I have commented on already there is one bigger
issue. You are assuming that the swap tracking is applicable only
if there is at least one cgroup but the root. This is usually true
because we cannot set up any limit for the root cgroup. But this
doesn't consider that we do _tracking_ even though there is no limit
for the root cgroup. This is important as soon as a task is moved
to other group (with immigrate_on_move enabled). Your series would
cause that the swapped out pages wouldn't be migrated (have a look at
swap_cgroup_cmpxchg). Maybe this is can be tricked out somehow (e.g.
return a root cgroup id if the swap_cgroup is 0).
The patches should be also cleaned up a bit and changelogs enhanced to
be more descriptive. I would suggest the following split up
- swap_cgroup_swapon should only setup ctrl->length and move the
rest to swap_cgroup_prepare
- introduce memsw_accounting_users (have it 1 by default now)
and call swap_cgroup_prepare only if it is > 1. Same applies
to lookup_swap_cgroup
- hackaround charge moving from the root cgroup
- make memsw_accounting_users 0 by default and increment it on
a first non-root cgroup creation + call swap_cgroup_swapon.
- (optionally) deallocate swap accounting structures on the last
non-root cgroup removal - I am not sure this is really
necessary but if it is then it should be done only after all
references to the memcg are gone rather than from
mem_cgroup_destroy
Thanks!
--
Michal Hocko
SUSE Labs
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [RFC PATCH 0/3] Disable swap cgroup allocation at system boot stage
2012-12-04 13:18 ` [RFC PATCH 0/3] Disable swap cgroup allocation at system boot stage Michal Hocko
@ 2012-12-04 14:00 ` Jeff Liu
0 siblings, 0 replies; 14+ messages in thread
From: Jeff Liu @ 2012-12-04 14:00 UTC (permalink / raw)
To: Michal Hocko; +Cc: linux-mm, cgroups, Glauber Costa
On 12/04/2012 09:18 PM, Michal Hocko wrote:
> On Tue 04-12-12 16:35:44, Jeff Liu wrote:
>> Hello,
>
> Hi,
>
>>
>> Currently, we allocate pages for swap cgroup when the system is booting up.
>
> That is not precise. We do that during _swapon_ time which usually
> happen during boot but still...
Ah! I should indicate that it happens on any _swapon_ time.:-P
>
>> Which means that a particular size of pre-allocated memory(depending
>> on the total size of the enabled swap files/partitions) would be
>> wasted if there is no child memcg being alive.
>>
>> This patch set is intended to defer the memory allocation for swap
>> cgroup until the first children of memcg was created. Actually, it was
>> totally inspired by Glabuer's previous proposed patch set, which can
>> be found at: "memcg: do not call page_cgroup_init at system_boot".
>> http://lwn.net/Articles/517562/
>>
>> These patches works to me with some sanity check up. There must
>> have some issues I am not aware of for now, at least, performing
>> swapon/swapoff when there have child memcg alives can run into some
>> potential race conditions that would end up go into bad_page() path...
>
> The locking is kind of awkward because there are indirect locking
> dependencies which are not described anywhere.
Yes, Anyway I need to do some further investigations and testing to
avoid such situations.
>
>> but I'd like to post it early to seek any directions if possible, so
>> that I can continue to improve it.
>
> Besides small things I have commented on already there is one bigger
> issue. You are assuming that the swap tracking is applicable only
> if there is at least one cgroup but the root. This is usually true
> because we cannot set up any limit for the root cgroup. But this
> doesn't consider that we do _tracking_ even though there is no limit
> for the root cgroup. This is important as soon as a task is moved
> to other group (with immigrate_on_move enabled). Your series would
> cause that the swapped out pages wouldn't be migrated (have a look at
> swap_cgroup_cmpxchg).
> Maybe this is can be tricked out somehow (e.g. return a root cgroup
> id if the swap_cgroup is 0).
>
> The patches should be also cleaned up a bit and changelogs enhanced to
> be more descriptive. I would suggest the following split up
> - swap_cgroup_swapon should only setup ctrl->length and move the
> rest to swap_cgroup_prepare
> - introduce memsw_accounting_users (have it 1 by default now)
> and call swap_cgroup_prepare only if it is > 1. Same applies
> to lookup_swap_cgroup
> - hackaround charge moving from the root cgroup
> - make memsw_accounting_users 0 by default and increment it on
> a first non-root cgroup creation + call swap_cgroup_swapon.
> - (optionally) deallocate swap accounting structures on the last
> non-root cgroup removal - I am not sure this is really
> necessary but if it is then it should be done only after all
> references to the memcg are gone rather than from
> mem_cgroup_destroy
This is a fuzzy point to me when I writing these patches, I'll try to
make it better.
Thank you for pointing out my mistakes with so much detailed directions!
-Jeff
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 14+ messages in thread
end of thread, other threads:[~2012-12-04 14:00 UTC | newest]
Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-12-04 8:35 [RFC PATCH 0/3] Disable swap cgroup allocation at system boot stage Jeff Liu
2012-12-04 8:35 ` [RFC PATCH 1/3] memcg: refactor pages allocation/free for swap_cgroup Jeff Liu
2012-12-04 10:11 ` Michal Hocko
2012-12-04 10:46 ` Jeff Liu
2012-12-04 8:36 ` [RFC PATCH 2/3] memcg: disable pages allocation for swap cgroup on system booting up Jeff Liu
2012-12-04 11:17 ` Michal Hocko
2012-12-04 11:22 ` Michal Hocko
2012-12-04 12:34 ` Michal Hocko
2012-12-04 12:51 ` Jeff Liu
2012-12-04 8:36 ` [RFC PATCH 3/3] memcg: allocate pages for swap cgroup until the first child memcg is alive Jeff Liu
2012-12-04 12:54 ` Michal Hocko
2012-12-04 13:14 ` Jeff Liu
2012-12-04 13:18 ` [RFC PATCH 0/3] Disable swap cgroup allocation at system boot stage Michal Hocko
2012-12-04 14:00 ` Jeff Liu
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).