[RFC 0/3] Implementation of cgroup isolation

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [RFC 0/3] Implementation of cgroup isolation
@ 2011-03-28  9:39 Michal Hocko
  2011-03-28  9:39 ` [RFC 1/3] Add mem_cgroup->isolated and configuration knob Michal Hocko
                   ` (4 more replies)
  0 siblings, 5 replies; 35+ messages in thread
From: Michal Hocko @ 2011-03-28  9:39 UTC (permalink / raw)
  To: linux-mm; +Cc: linux-kernel

Hi all,

Memory cgroups can be currently used to throttle memory usage of a group of
processes. It, however, cannot be used for an isolation of processes from
the rest of the system because all the pages that belong to the group are
also placed on the global LRU lists and so they are eligible for the global
memory reclaim.

This patchset aims at providing an opt-in memory cgroup isolation. This
means that a cgroup can be configured to be isolated from the rest of the
system by means of cgroup virtual filesystem (/dev/memctl/group/memory.isolated).

Isolated mem cgroup can be particularly helpful in deployments where we have
a primary service which needs to have a certain guarantees for memory
resources (e.g. a database server) and we want to shield it off the
rest of the system (e.g. a burst memory activity in another group). This is
currently possible only with mlocking memory that is essential for the
application(s) or a rather hacky configuration where the primary app is in
the root mem cgroup while all the other system activity happens in other
groups.

mlocking is not an ideal solution all the time because sometimes the working
set is very large and it depends on the workload (e.g. number of incoming
requests) so it can end up not fitting in into memory (leading to a OOM
killer). If we use mem. cgroup isolation instead we are keeping memory resident
and if the working set goes wild we can still do per-cgroup reclaim so the
service is less prone to be OOM killed.

The patch series is split into 3 patches. First one adds a new flag into
mem_cgroup structure which controls whether the group is isolated (false by
default) and a cgroup fs interface to set it.
The second patch implements interaction with the global LRU. The current
semantic is that we are putting a page into a global LRU only if mem cgroup
LRU functions say they do not want the page for themselves.
The last patch prevents from soft reclaim if the group is isolated.

I have tested the patches with the simple memory consumer (allocating
private and shared anon memory and SYSV SHM). 

One instance (call it big consumer) running in the group and paging in the
memory (>90% of cgroup limit) and sleeping for the rest of its life. Then I
had a pool of consumers running in the same cgroup which page in smaller
amount of memory and paging them in the loop to simulate in group memory
pressure (call them sharks).
The sum of consumed memory is more than memory.limit_in_bytes so some
portion of the memory is swapped out.
There is one consumer running in the root cgroup running in parallel which
makes a pressure on the memory (to trigger background reclaim).

Rss+cache of the group drops down significantly (~66% of the limit) if the
group is not isolated. On the other hand if we isolate the group we are
still saturating the group (~97% of the limit). I can show more
comprehensive results if somebody is interested.

Thanks for comments.

---
 include/linux/memcontrol.h |   24 ++++++++------
 include/linux/mm_inline.h  |   10 ++++-
 mm/memcontrol.c            |   76 ++++++++++++++++++++++++++++++++++++---------
 mm/swap.c                  |   12 ++++---
 mm/vmscan.c                |   43 +++++++++++++++----------
 5 files changed, 118 insertions(+), 47 deletions(-)

-- 
Michal Hocko

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [RFC 1/3] Add mem_cgroup->isolated and configuration knob
  2011-03-28  9:39 [RFC 0/3] Implementation of cgroup isolation Michal Hocko
@ 2011-03-28  9:39 ` Michal Hocko
  2011-03-28  9:39 ` [RFC 2/3] Implement isolated LRU cgroups Michal Hocko
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 35+ messages in thread
From: Michal Hocko @ 2011-03-28  9:39 UTC (permalink / raw)
  To: linux-mm; +Cc: linux-kernel

[-- Attachment #1: memcg_add_isolated_lru_knob.patch --]
[-- Type: text/plain, Size: 3428 bytes --]

This is a first patch in the row and it just adds isolated boolean to the
mem_cgroup structure. The value says whether pages charged for this group
should be isolated from the rest of the system when they are charged (they are
not by default).

The patch adds a cgroup fs interface to modify the current isolation status
of a group. The value can be modified by /dev/memctl/memory.isolated knob.

Signed-off-by: Michal Hocko <mhocko@suse.cz>

--- 
 include/linux/memcontrol.h |    2 ++
 mm/memcontrol.c            |   40 ++++++++++++++++++++++++++++++++++++++++
 2 files changed, 42 insertions(+)

Index: linux-2.6.38-rc8/mm/memcontrol.c
===================================================================
--- linux-2.6.38-rc8.orig/mm/memcontrol.c	2011-03-28 11:13:27.000000000 +0200
+++ linux-2.6.38-rc8/mm/memcontrol.c	2011-03-28 11:25:00.000000000 +0200
@@ -245,6 +245,10 @@ struct mem_cgroup {
 	/* set when res.limit == memsw.limit */
 	bool		memsw_is_minimum;
 
+	/* is the group isolated from the global LRU? */
+	/* TODO can we place it into a hole */
+	bool		isolated;
+
 	/* protect arrays of thresholds */
 	struct mutex thresholds_lock;
 
@@ -4295,6 +4299,32 @@ static int mem_cgroup_oom_control_write(
 	return 0;
 }
 
+static int mem_cgroup_isolated_write(struct cgroup *cgrp, struct cftype *cft,
+				       u64 val)
+{
+	int ret = -EINVAL;
+	struct mem_cgroup *mem = mem_cgroup_from_cont(cgrp);
+
+	/* We are not allowing isolation of the root memory cgroup as it has
+	 * a special purpose to collect all pages that do not belong to any
+	 * group.
+	 */
+	if (mem_cgroup_is_root(mem))
+		goto out;
+
+	mem->isolated = !!val;
+	ret = 0;
+out:
+	return ret;
+}
+
+static u64 mem_cgroup_isolated_read(struct cgroup *cgrp, struct cftype *cft)
+{
+	struct mem_cgroup *mem = mem_cgroup_from_cont(cgrp);
+
+	return is_mem_cgroup_isolated(mem);
+}
+
 static struct cftype mem_cgroup_files[] = {
 	{
 		.name = "usage_in_bytes",
@@ -4358,6 +4388,11 @@ static struct cftype mem_cgroup_files[]
 		.unregister_event = mem_cgroup_oom_unregister_event,
 		.private = MEMFILE_PRIVATE(_OOM_TYPE, OOM_CONTROL),
 	},
+	{
+		.name = "isolated",
+		.write_u64 = mem_cgroup_isolated_write,
+		.read_u64 = mem_cgroup_isolated_read,
+	},
 };
 
 #ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
@@ -5168,6 +5203,11 @@ static void mem_cgroup_move_task(struct
 }
 #endif
 
+bool is_mem_cgroup_isolated(struct mem_cgroup *mem)
+{
+	return mem->isolated;
+}
+
 struct cgroup_subsys mem_cgroup_subsys = {
 	.name = "memory",
 	.subsys_id = mem_cgroup_subsys_id,
Index: linux-2.6.38-rc8/include/linux/memcontrol.h
===================================================================
--- linux-2.6.38-rc8.orig/include/linux/memcontrol.h	2011-03-28 11:13:27.000000000 +0200
+++ linux-2.6.38-rc8/include/linux/memcontrol.h	2011-03-28 11:25:00.000000000 +0200
@@ -155,6 +155,8 @@ void mem_cgroup_split_huge_fixup(struct
 bool mem_cgroup_bad_page_check(struct page *page);
 void mem_cgroup_print_bad_page(struct page *page);
 #endif
+
+bool is_mem_cgroup_isolated(struct mem_cgroup *mem);
 #else /* CONFIG_CGROUP_MEM_RES_CTLR */
 struct mem_cgroup;
 


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [RFC 2/3] Implement isolated LRU cgroups
  2011-03-28  9:39 [RFC 0/3] Implementation of cgroup isolation Michal Hocko
  2011-03-28  9:39 ` [RFC 1/3] Add mem_cgroup->isolated and configuration knob Michal Hocko
@ 2011-03-28  9:39 ` Michal Hocko
  2011-03-28  9:40 ` [RFC 3/3] Do not shrink isolated groups from the global reclaim Michal Hocko
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 35+ messages in thread
From: Michal Hocko @ 2011-03-28  9:39 UTC (permalink / raw)
  To: linux-mm; +Cc: linux-kernel

[-- Attachment #1: memcg_handle_global_lru_isolated_pages.patch --]
[-- Type: text/plain, Size: 12138 bytes --]

The primary idea behind isolated pages is in a better isolation of a group from
the global system and other groups activity. At the moment, memory cgroups are
mainly used to throttle processes in a group by placing a cap on their memory
usage. However, mem. cgroups don't protect their (charged) memory from being
evicted by the global reclaim as all its pages are on the global LRU.

This feature will provide an easy way to setup an application in
the memory isolated environment without necessity of mlock to keep its pages
in the memory. Due to per-cgroup reclaim, we can eliminate interference between
unrelated cgroups that exhibit a spike in memory usage.

A similar setup could be achieved with the current implementation as well by
placing the critical application into the root group while all other
processes would be placed in another group (or groups). This is, however,
much harder to configure and also we have only one such an "exclusive" group
on the system which is quite limiting.

This goal is achieved by isolating those pages from the global LRU and
keeping them on a per-cgroup LRU only so the memory cgroup is not affected
by the global reclaim at all.

If we isolate mem-cgroup pages from the global LRU we can still do the
per-cgroup reclaim so the isolation is not the same thing as mlocking that
memory.

is_mem_cgroup_isolated is not called directly by the code that adds
(__add_page_to_lru_list) or moves (isolate_lru_pages,
move_active_pages_to_lru, check_move_unevictable_page, pagevec_move_tail,
lru_deactivate) pages into an LRU because we would need to find a
page_cgroup for the page and this would add an overhead. We changed the
semantic for memcg LRU functions (which add or move pages to mem cgroup LRU)
instead to return a flag whether the page is global (return true) or mem
cgroup isolated.

page->lru is initialized to an empty list whenever the page is not on the
global LRU to make the LRU removal path without modifications. The page is
still mark PageLRU so nobody else will misuse page->lru for other purposes.

Signed-off-by: Michal Hocko <mhocko@suse.cz>

---
 include/linux/memcontrol.h |   22 ++++++++++++----------
 include/linux/mm_inline.h  |   10 ++++++++--
 mm/memcontrol.c            |   36 +++++++++++++++++++++---------------
 mm/swap.c                  |   12 ++++++++----
 mm/vmscan.c                |   25 +++++++++++++++++--------
 5 files changed, 66 insertions(+), 39 deletions(-)

Index: linux-2.6.38-rc8/include/linux/memcontrol.h
===================================================================
--- linux-2.6.38-rc8.orig/include/linux/memcontrol.h	2011-03-28 11:23:58.000000000 +0200
+++ linux-2.6.38-rc8/include/linux/memcontrol.h	2011-03-28 11:24:20.000000000 +0200
@@ -60,12 +60,12 @@ extern void mem_cgroup_cancel_charge_swa
 
 extern int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm,
 					gfp_t gfp_mask);
-extern void mem_cgroup_add_lru_list(struct page *page, enum lru_list lru);
+extern bool mem_cgroup_add_lru_list(struct page *page, enum lru_list lru);
 extern void mem_cgroup_del_lru_list(struct page *page, enum lru_list lru);
-extern void mem_cgroup_rotate_reclaimable_page(struct page *page);
-extern void mem_cgroup_rotate_lru_list(struct page *page, enum lru_list lru);
+extern bool mem_cgroup_rotate_reclaimable_page(struct page *page);
+extern bool mem_cgroup_rotate_lru_list(struct page *page, enum lru_list lru);
 extern void mem_cgroup_del_lru(struct page *page);
-extern void mem_cgroup_move_lists(struct page *page,
+extern bool mem_cgroup_move_lists(struct page *page,
 				  enum lru_list from, enum lru_list to);
 
 /* For coalescing uncharge for reducing memcg' overhead*/
@@ -209,13 +209,14 @@ static inline int mem_cgroup_shmem_charg
 	return 0;
 }
 
-static inline void mem_cgroup_add_lru_list(struct page *page, int lru)
+static inline bool mem_cgroup_add_lru_list(struct page *page, int lru)
 {
+	return true;
 }
 
-static inline void mem_cgroup_del_lru_list(struct page *page, int lru)
+static inline bool mem_cgroup_del_lru_list(struct page *page, int lru)
 {
-	return ;
+	return true;
 }
 
 static inline inline void mem_cgroup_rotate_reclaimable_page(struct page *page)
@@ -223,9 +224,9 @@ static inline inline void mem_cgroup_rot
 	return ;
 }
 
-static inline void mem_cgroup_rotate_lru_list(struct page *page, int lru)
+static inline bool mem_cgroup_rotate_lru_list(struct page *page, int lru)
 {
-	return ;
+	return true;
 }
 
 static inline void mem_cgroup_del_lru(struct page *page)
@@ -233,9 +234,10 @@ static inline void mem_cgroup_del_lru(st
 	return ;
 }
 
-static inline void
+static inline bool
 mem_cgroup_move_lists(struct page *page, enum lru_list from, enum lru_list to)
 {
+	return true;
 }
 
 static inline struct mem_cgroup *try_get_mem_cgroup_from_page(struct page *page)
Index: linux-2.6.38-rc8/include/linux/mm_inline.h
===================================================================
--- linux-2.6.38-rc8.orig/include/linux/mm_inline.h	2011-03-28 11:23:58.000000000 +0200
+++ linux-2.6.38-rc8/include/linux/mm_inline.h	2011-03-28 11:24:20.000000000 +0200
@@ -25,9 +25,15 @@ static inline void
 __add_page_to_lru_list(struct zone *zone, struct page *page, enum lru_list l,
 		       struct list_head *head)
 {
-	list_add(&page->lru, head);
 	__mod_zone_page_state(zone, NR_LRU_BASE + l, hpage_nr_pages(page));
-	mem_cgroup_add_lru_list(page, l);
+
+	/* Add to the global LRU only if cgroup doesn't want the page 
+	 * exclusively 
+	 */
+	if (mem_cgroup_add_lru_list(page, l))
+		list_add(&page->lru, head);
+	else
+		INIT_LIST_HEAD(&page->lru);
 }
 
 static inline void
Index: linux-2.6.38-rc8/mm/memcontrol.c
===================================================================
--- linux-2.6.38-rc8.orig/mm/memcontrol.c	2011-03-28 11:23:58.000000000 +0200
+++ linux-2.6.38-rc8/mm/memcontrol.c	2011-03-28 11:24:20.000000000 +0200
@@ -866,58 +866,62 @@ void mem_cgroup_del_lru(struct page *pag
  * reclaim.  If it still appears to be reclaimable, move it to the tail of the
  * inactive list.
  */
-void mem_cgroup_rotate_reclaimable_page(struct page *page)
+bool mem_cgroup_rotate_reclaimable_page(struct page *page)
 {
 	struct mem_cgroup_per_zone *mz;
 	struct page_cgroup *pc;
 	enum lru_list lru = page_lru(page);
 
 	if (mem_cgroup_disabled())
-		return;
+		return true;
 
 	pc = lookup_page_cgroup(page);
 	/* unused or root page is not rotated. */
 	if (!PageCgroupUsed(pc))
-		return;
+		return true;
 	/* Ensure pc->mem_cgroup is visible after reading PCG_USED. */
 	smp_rmb();
 	if (mem_cgroup_is_root(pc->mem_cgroup))
-		return;
+		return true;
 	mz = page_cgroup_zoneinfo(pc->mem_cgroup, page);
 	list_move_tail(&pc->lru, &mz->lists[lru]);
+
+	return !is_mem_cgroup_isolated(pc->mem_cgroup);
 }
 
-void mem_cgroup_rotate_lru_list(struct page *page, enum lru_list lru)
+bool mem_cgroup_rotate_lru_list(struct page *page, enum lru_list lru)
 {
 	struct mem_cgroup_per_zone *mz;
 	struct page_cgroup *pc;
 
 	if (mem_cgroup_disabled())
-		return;
+		return true;
 
 	pc = lookup_page_cgroup(page);
 	/* unused or root page is not rotated. */
 	if (!PageCgroupUsed(pc))
-		return;
+		return true;
 	/* Ensure pc->mem_cgroup is visible after reading PCG_USED. */
 	smp_rmb();
 	if (mem_cgroup_is_root(pc->mem_cgroup))
-		return;
+		return true;
 	mz = page_cgroup_zoneinfo(pc->mem_cgroup, page);
 	list_move(&pc->lru, &mz->lists[lru]);
+
+	return !is_mem_cgroup_isolated(pc->mem_cgroup);
 }
 
-void mem_cgroup_add_lru_list(struct page *page, enum lru_list lru)
+bool mem_cgroup_add_lru_list(struct page *page, enum lru_list lru)
 {
 	struct page_cgroup *pc;
 	struct mem_cgroup_per_zone *mz;
 
 	if (mem_cgroup_disabled())
-		return;
+		return true;
 	pc = lookup_page_cgroup(page);
 	VM_BUG_ON(PageCgroupAcctLRU(pc));
 	if (!PageCgroupUsed(pc))
-		return;
+		return true;
 	/* Ensure pc->mem_cgroup is visible after reading PCG_USED. */
 	smp_rmb();
 	mz = page_cgroup_zoneinfo(pc->mem_cgroup, page);
@@ -925,8 +929,10 @@ void mem_cgroup_add_lru_list(struct page
 	MEM_CGROUP_ZSTAT(mz, lru) += 1 << compound_order(page);
 	SetPageCgroupAcctLRU(pc);
 	if (mem_cgroup_is_root(pc->mem_cgroup))
-		return;
+		return true;
 	list_add(&pc->lru, &mz->lists[lru]);
+
+	return !is_mem_cgroup_isolated(pc->mem_cgroup);
 }
 
 /*
@@ -979,13 +985,13 @@ static void mem_cgroup_lru_add_after_com
 }
 
 
-void mem_cgroup_move_lists(struct page *page,
+bool mem_cgroup_move_lists(struct page *page,
 			   enum lru_list from, enum lru_list to)
 {
 	if (mem_cgroup_disabled())
-		return;
+		return true;
 	mem_cgroup_del_lru_list(page, from);
-	mem_cgroup_add_lru_list(page, to);
+	return mem_cgroup_add_lru_list(page, to);
 }
 
 int task_in_mem_cgroup(struct task_struct *task, const struct mem_cgroup *mem)
Index: linux-2.6.38-rc8/mm/vmscan.c
===================================================================
--- linux-2.6.38-rc8.orig/mm/vmscan.c	2011-03-28 11:23:58.000000000 +0200
+++ linux-2.6.38-rc8/mm/vmscan.c	2011-03-28 11:24:57.000000000 +0200
@@ -1049,8 +1049,10 @@ static unsigned long isolate_lru_pages(u
 
 		case -EBUSY:
 			/* else it is being freed elsewhere */
-			list_move(&page->lru, src);
-			mem_cgroup_rotate_lru_list(page, page_lru(page));
+			if (mem_cgroup_rotate_lru_list(page, page_lru(page)))
+				list_move(&page->lru, src);
+			else
+				list_del_init(&page->lru);
 			continue;
 
 		default:
@@ -1482,8 +1484,11 @@ static void move_active_pages_to_lru(str
 		VM_BUG_ON(PageLRU(page));
 		SetPageLRU(page);
 
-		list_move(&page->lru, &zone->lru[lru].list);
-		mem_cgroup_add_lru_list(page, lru);
+		if (mem_cgroup_add_lru_list(page, lru))
+			list_move(&page->lru, &zone->lru[lru].list);
+		else
+			list_del_init(&page->lru);
+
 		pgmoved += hpage_nr_pages(page);
 
 		if (!pagevec_add(&pvec, page) || list_empty(list)) {
@@ -3133,8 +3138,10 @@ retry:
 		enum lru_list l = page_lru_base_type(page);
 
 		__dec_zone_state(zone, NR_UNEVICTABLE);
-		list_move(&page->lru, &zone->lru[l].list);
-		mem_cgroup_move_lists(page, LRU_UNEVICTABLE, l);
+		if (mem_cgroup_move_lists(page, LRU_UNEVICTABLE, l))
+			list_move(&page->lru, &zone->lru[l].list);
+		else
+			list_del_init(&page->lru);
 		__inc_zone_state(zone, NR_INACTIVE_ANON + l);
 		__count_vm_event(UNEVICTABLE_PGRESCUED);
 	} else {
@@ -3142,8 +3149,10 @@ retry:
 		 * rotate unevictable list
 		 */
 		SetPageUnevictable(page);
-		list_move(&page->lru, &zone->lru[LRU_UNEVICTABLE].list);
-		mem_cgroup_rotate_lru_list(page, LRU_UNEVICTABLE);
+		if (mem_cgroup_rotate_lru_list(page, LRU_UNEVICTABLE))
+			list_move(&page->lru, &zone->lru[LRU_UNEVICTABLE].list);
+		else
+			list_del_init(&page->lru);
 		if (page_evictable(page, NULL))
 			goto retry;
 	}
Index: linux-2.6.38-rc8/mm/swap.c
===================================================================
--- linux-2.6.38-rc8.orig/mm/swap.c	2011-03-28 11:23:58.000000000 +0200
+++ linux-2.6.38-rc8/mm/swap.c	2011-03-28 11:24:20.000000000 +0200
@@ -201,8 +201,10 @@ static void pagevec_move_tail(struct pag
 		}
 		if (PageLRU(page) && !PageActive(page) && !PageUnevictable(page)) {
 			enum lru_list lru = page_lru_base_type(page);
-			list_move_tail(&page->lru, &zone->lru[lru].list);
-			mem_cgroup_rotate_reclaimable_page(page);
+			if (mem_cgroup_rotate_reclaimable_page(page))
+				list_move_tail(&page->lru, &zone->lru[lru].list);
+			else
+				list_del_init(&page->lru);
 			pgmoved++;
 		}
 	}
@@ -402,8 +404,10 @@ static void lru_deactivate(struct page *
 		 * The page's writeback ends up during pagevec
 		 * We moves tha page into tail of inactive.
 		 */
-		list_move_tail(&page->lru, &zone->lru[lru].list);
-		mem_cgroup_rotate_reclaimable_page(page);
+		if (mem_cgroup_rotate_reclaimable_page(page))
+			list_move_tail(&page->lru, &zone->lru[lru].list);
+		else
+			list_del_init(&page->lru);
 		__count_vm_event(PGROTATED);
 	}
 


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [RFC 3/3] Do not shrink isolated groups from the global reclaim
  2011-03-28  9:39 [RFC 0/3] Implementation of cgroup isolation Michal Hocko
  2011-03-28  9:39 ` [RFC 1/3] Add mem_cgroup->isolated and configuration knob Michal Hocko
  2011-03-28  9:39 ` [RFC 2/3] Implement isolated LRU cgroups Michal Hocko
@ 2011-03-28  9:40 ` Michal Hocko
  2011-03-28 11:03 ` [RFC 0/3] Implementation of cgroup isolation KAMEZAWA Hiroyuki
  2011-03-28 18:01 ` Ying Han
  4 siblings, 0 replies; 35+ messages in thread
From: Michal Hocko @ 2011-03-28  9:40 UTC (permalink / raw)
  To: linux-mm; +Cc: linux-kernel

[-- Attachment #1: memcg-do_not_reclaim_isolated_groups.patch --]
[-- Type: text/plain, Size: 2289 bytes --]

Pages charged for isolated mem cgroups are not placed on the global LRU
lists so they are protected from the reclaim in general. This is still not
enough as they still can get reclaimed during the soft hierarchical reclaim

balance_pgdat
	mem_cgroup_soft_limit_reclaim
		mem_cgroup_hierarchical_reclaim
			mem_cgroup_shrink_node_zone

Let's prevent from soft reclaim if the group isolated and let's defer its
balancing to try_to_free_mem_cgroup_pages called from charging paths. This
will make allocations for the group more oom-prone probably but the group
wanted to be isolated so we should give it as much of isolation as it gets
and let the proper memory usage to the group user.

Signed-off-by: Michal Hocko <mhocko@suse.cz>

---
 vmscan.c |   18 ++++++++++--------
 1 file changed, 10 insertions(+), 8 deletions(-)

Index: linux-2.6.38-rc8/mm/vmscan.c
===================================================================
--- linux-2.6.38-rc8.orig/mm/vmscan.c	2011-03-28 11:24:20.000000000 +0200
+++ linux-2.6.38-rc8/mm/vmscan.c	2011-03-28 11:24:38.000000000 +0200
@@ -2170,14 +2170,16 @@ unsigned long mem_cgroup_shrink_node_zon
 						      sc.may_writepage,
 						      sc.gfp_mask);
 
-	/*
-	 * NOTE: Although we can get the priority field, using it
-	 * here is not a good idea, since it limits the pages we can scan.
-	 * if we don't reclaim here, the shrink_zone from balance_pgdat
-	 * will pick up pages from other mem cgroup's as well. We hack
-	 * the priority and make it zero.
-	 */
-	shrink_zone(0, zone, &sc);
+	if (!is_mem_cgroup_isolated(mem)) {
+		/*
+		 * NOTE: Although we can get the priority field, using it
+		 * here is not a good idea, since it limits the pages we can scan.
+		 * if we don't reclaim here, the shrink_zone from balance_pgdat
+		 * will pick up pages from other mem cgroup's as well. We hack
+		 * the priority and make it zero.
+		 */
+		shrink_zone(0, zone, &sc);
+	}
 
 	trace_mm_vmscan_memcg_softlimit_reclaim_end(sc.nr_reclaimed);
 


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC 0/3] Implementation of cgroup isolation
  2011-03-28  9:39 [RFC 0/3] Implementation of cgroup isolation Michal Hocko
                   ` (2 preceding siblings ...)
  2011-03-28  9:40 ` [RFC 3/3] Do not shrink isolated groups from the global reclaim Michal Hocko
@ 2011-03-28 11:03 ` KAMEZAWA Hiroyuki
  2011-03-28 11:44   ` Michal Hocko
  2011-03-29 15:53   ` Balbir Singh
  2011-03-28 18:01 ` Ying Han
  4 siblings, 2 replies; 35+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-03-28 11:03 UTC (permalink / raw)
  To: Michal Hocko; +Cc: linux-mm, linux-kernel

On Mon, 28 Mar 2011 11:39:57 +0200
Michal Hocko <mhocko@suse.cz> wrote:

> Hi all,
> 
> Memory cgroups can be currently used to throttle memory usage of a group of
> processes. It, however, cannot be used for an isolation of processes from
> the rest of the system because all the pages that belong to the group are
> also placed on the global LRU lists and so they are eligible for the global
> memory reclaim.
> 
> This patchset aims at providing an opt-in memory cgroup isolation. This
> means that a cgroup can be configured to be isolated from the rest of the
> system by means of cgroup virtual filesystem (/dev/memctl/group/memory.isolated).
> 
> Isolated mem cgroup can be particularly helpful in deployments where we have
> a primary service which needs to have a certain guarantees for memory
> resources (e.g. a database server) and we want to shield it off the
> rest of the system (e.g. a burst memory activity in another group). This is
> currently possible only with mlocking memory that is essential for the
> application(s) or a rather hacky configuration where the primary app is in
> the root mem cgroup while all the other system activity happens in other
> groups.
> 
> mlocking is not an ideal solution all the time because sometimes the working
> set is very large and it depends on the workload (e.g. number of incoming
> requests) so it can end up not fitting in into memory (leading to a OOM
> killer). If we use mem. cgroup isolation instead we are keeping memory resident
> and if the working set goes wild we can still do per-cgroup reclaim so the
> service is less prone to be OOM killed.
> 
> The patch series is split into 3 patches. First one adds a new flag into
> mem_cgroup structure which controls whether the group is isolated (false by
> default) and a cgroup fs interface to set it.
> The second patch implements interaction with the global LRU. The current
> semantic is that we are putting a page into a global LRU only if mem cgroup
> LRU functions say they do not want the page for themselves.
> The last patch prevents from soft reclaim if the group is isolated.
> 
> I have tested the patches with the simple memory consumer (allocating
> private and shared anon memory and SYSV SHM). 
> 
> One instance (call it big consumer) running in the group and paging in the
> memory (>90% of cgroup limit) and sleeping for the rest of its life. Then I
> had a pool of consumers running in the same cgroup which page in smaller
> amount of memory and paging them in the loop to simulate in group memory
> pressure (call them sharks).
> The sum of consumed memory is more than memory.limit_in_bytes so some
> portion of the memory is swapped out.
> There is one consumer running in the root cgroup running in parallel which
> makes a pressure on the memory (to trigger background reclaim).
> 
> Rss+cache of the group drops down significantly (~66% of the limit) if the
> group is not isolated. On the other hand if we isolate the group we are
> still saturating the group (~97% of the limit). I can show more
> comprehensive results if somebody is interested.
> 

Isn't it the same result with the case where no cgroup is used ?
What is the problem ?
Why it's not a problem of configuration ?
IIUC, you can put all logins to some cgroup by using cgroupd/libgcgroup.

> Thanks for comments.
> 


Maybe you just want "guarantee".
At 1st thought, this approarch has 3 problems. And memcg is desgined
never to prevent global vm scans,

1. This cannot be used as "guarantee". Just a way for "don't steal from me!!!"
   This just implements a "first come, first served" system.
   I guess this can be used for server desgines.....only with very very careful play.
   If an application exits and lose its memory, there is no guarantee anymore.

2. Even with isolation, a task in memcg can be killed by OOM-killer at
   global memory shortage.

3. it seems this will add more page fragmentation if implemented poorly, IOW,
   can this be work with compaction ?



I think of other approaches.

1. cpuset+nodehotplug enhances.
   At boot, hide most of memory from the system by boot option.
   You can rename node-id of "all unused memory" and create arbitrary nodes
   if the kernel has an interface. You can add a virtual nodes and move
   pages between nodes by renaming it.

   This will allow you to create a safe box dynamically. If you move pages in
   the order of MAX_ORDER, you don't add any fragmentation.
   (But with this way, you need to avoid tasks in root cgrou, too.)


2. allow a mount option to link ROOT cgroup's LRU and add limit for
   root cgroup. Then, softlimit will work well.
   (If softlimit doesn't work, it's bug. That will be an enhancement point.)


Thanks,
-Kame



















--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC 0/3] Implementation of cgroup isolation
  2011-03-28 11:03 ` [RFC 0/3] Implementation of cgroup isolation KAMEZAWA Hiroyuki
@ 2011-03-28 11:44   ` Michal Hocko
  2011-03-29  0:09     ` KAMEZAWA Hiroyuki
  2011-03-29 15:53   ` Balbir Singh
  1 sibling, 1 reply; 35+ messages in thread
From: Michal Hocko @ 2011-03-28 11:44 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-mm, linux-kernel

On Mon 28-03-11 20:03:32, KAMEZAWA Hiroyuki wrote:
> On Mon, 28 Mar 2011 11:39:57 +0200
> Michal Hocko <mhocko@suse.cz> wrote:
[...]
> 
> Isn't it the same result with the case where no cgroup is used ?

Yes and that is the point of the patchset. Memory cgroups will not give
you anything else but the top limit wrt. to the global memory activity.

> What is the problem ?

That we cannot prevent from paging out memory of process(es), even though
we have intentionaly isolated them in a group (read as we do not have
any other possibility for the isolation), because of unrelated memory
activity.

> Why it's not a problem of configuration ?
> IIUC, you can put all logins to some cgroup by using cgroupd/libgcgroup.

Yes, but this still doesn't bring the isolation.

> Maybe you just want "guarantee".
> At 1st thought, this approarch has 3 problems. And memcg is desgined
> never to prevent global vm scans,
> 
> 1. This cannot be used as "guarantee". Just a way for "don't steal from me!!!"
>    This just implements a "first come, first served" system.
>    I guess this can be used for server desgines.....only with very very careful play.
>    If an application exits and lose its memory, there is no guarantee anymore.

Yes, but once it got the memory and it needs to have it or benefits from
having it resindent what-ever happens around then there is no other
solution than mlocking the memory which is not ideal solution all the
time as I have described already.

> 
> 2. Even with isolation, a task in memcg can be killed by OOM-killer at
>    global memory shortage.

Yes it can but I think this is a different problem. Once you are that
short of memory you can hardly ask from any guarantees.
There is no 100% guarantee about anything in the system.

> 
> 3. it seems this will add more page fragmentation if implemented poorly, IOW,
>    can this be work with compaction ?

Why would it add any fragmentation. We are compacting memory based on
the pfn range scanning rather than walking global LRU list, aren't we?

> I think of other approaches.
> 
> 1. cpuset+nodehotplug enhances.
>    At boot, hide most of memory from the system by boot option.
>    You can rename node-id of "all unused memory" and create arbitrary nodes
>    if the kernel has an interface. You can add a virtual nodes and move
>    pages between nodes by renaming it.
> 
>    This will allow you to create a safe box dynamically. 

This sounds as it requires a completely new infrastructure for many
parts of VM code. 

>    If you move pages in
>    the order of MAX_ORDER, you don't add any fragmentation.
>    (But with this way, you need to avoid tasks in root cgrou, too.)
> 
> 
> 2. allow a mount option to link ROOT cgroup's LRU and add limit for
>    root cgroup. Then, softlimit will work well.
>    (If softlimit doesn't work, it's bug. That will be an enhancement point.)

So you mean that the root cgroup would be a normal group like any other?

Thanks
-- 
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9    
Czech Republic

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC 0/3] Implementation of cgroup isolation
  2011-03-28 11:44   ` Michal Hocko
@ 2011-03-29  0:09     ` KAMEZAWA Hiroyuki
  2011-03-29  7:32       ` Michal Hocko
  0 siblings, 1 reply; 35+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-03-29  0:09 UTC (permalink / raw)
  To: Michal Hocko; +Cc: linux-mm, linux-kernel

On Mon, 28 Mar 2011 13:44:30 +0200
Michal Hocko <mhocko@suse.cz> wrote:

> On Mon 28-03-11 20:03:32, KAMEZAWA Hiroyuki wrote:
> > On Mon, 28 Mar 2011 11:39:57 +0200
> > Michal Hocko <mhocko@suse.cz> wrote:
> [...]
> > 
> > Isn't it the same result with the case where no cgroup is used ?
> 
> Yes and that is the point of the patchset. Memory cgroups will not give
> you anything else but the top limit wrt. to the global memory activity.
> 
> > What is the problem ?
> 
> That we cannot prevent from paging out memory of process(es), even though
> we have intentionaly isolated them in a group (read as we do not have
> any other possibility for the isolation), because of unrelated memory
> activity.
> 
Because the design of memory cgroup is not for "defending" but for 
"never attack some other guys".


> > Why it's not a problem of configuration ?
> > IIUC, you can put all logins to some cgroup by using cgroupd/libgcgroup.
> 
> Yes, but this still doesn't bring the isolation.
> 

Please explain this more.
Why don't you move all tasks under /root/default <- this has some limit ?


> > Maybe you just want "guarantee".
> > At 1st thought, this approarch has 3 problems. And memcg is desgined
> > never to prevent global vm scans,
> > 
> > 1. This cannot be used as "guarantee". Just a way for "don't steal from me!!!"
> >    This just implements a "first come, first served" system.
> >    I guess this can be used for server desgines.....only with very very careful play.
> >    If an application exits and lose its memory, there is no guarantee anymore.
> 
> Yes, but once it got the memory and it needs to have it or benefits from
> having it resindent what-ever happens around then there is no other
> solution than mlocking the memory which is not ideal solution all the
> time as I have described already.
> 

Yes, then, almost all mm guys answer has been "please use mlock".



> > 
> > 2. Even with isolation, a task in memcg can be killed by OOM-killer at
> >    global memory shortage.
> 
> Yes it can but I think this is a different problem. Once you are that
> short of memory you can hardly ask from any guarantees.
> There is no 100% guarantee about anything in the system.
> 

I think you should put tasks in root cgroup to somewhere. It works perfect
against OOM. And if memory are hidden by isolation, OOM will happen easier.


> > 
> > 3. it seems this will add more page fragmentation if implemented poorly, IOW,
> >    can this be work with compaction ?
> 
> Why would it add any fragmentation. We are compacting memory based on
> the pfn range scanning rather than walking global LRU list, aren't we?
> 

Please forget, I misunderstood.




> > I think of other approaches.
> > 
> > 1. cpuset+nodehotplug enhances.
> >    At boot, hide most of memory from the system by boot option.
> >    You can rename node-id of "all unused memory" and create arbitrary nodes
> >    if the kernel has an interface. You can add a virtual nodes and move
> >    pages between nodes by renaming it.
> > 
> >    This will allow you to create a safe box dynamically. 
> 
> This sounds as it requires a completely new infrastructure for many
> parts of VM code. 
> 

Not so many parts, I guess. I think I can write a prototype in a week,
if I have time.


> >    If you move pages in
> >    the order of MAX_ORDER, you don't add any fragmentation.
> >    (But with this way, you need to avoid tasks in root cgrou, too.)
> > 
> > 
> > 2. allow a mount option to link ROOT cgroup's LRU and add limit for
> >    root cgroup. Then, softlimit will work well.
> >    (If softlimit doesn't work, it's bug. That will be an enhancement point.)
> 
> So you mean that the root cgroup would be a normal group like any other?
> 

If necessary. Root cgroup has no limit/LRU/etc...just for gaining performance.
If admin can adimit the cost (2-5% now?), I think we can add knobs as boot
option or some.

Anyway, to work softlimit etc..in ideal way, admin should put all tasks into
some memcg which has limits.

Thanks,
-Kame






--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC 0/3] Implementation of cgroup isolation
  2011-03-29  0:09     ` KAMEZAWA Hiroyuki
@ 2011-03-29  7:32       ` Michal Hocko
  2011-03-29  7:51         ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 35+ messages in thread
From: Michal Hocko @ 2011-03-29  7:32 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-mm, linux-kernel

On Tue 29-03-11 09:09:24, KAMEZAWA Hiroyuki wrote:
> On Mon, 28 Mar 2011 13:44:30 +0200
> Michal Hocko <mhocko@suse.cz> wrote:
> 
> > On Mon 28-03-11 20:03:32, KAMEZAWA Hiroyuki wrote:
> > > On Mon, 28 Mar 2011 11:39:57 +0200
> > > Michal Hocko <mhocko@suse.cz> wrote:
> > [...]
> > > 
> > > Isn't it the same result with the case where no cgroup is used ?
> > 
> > Yes and that is the point of the patchset. Memory cgroups will not give
> > you anything else but the top limit wrt. to the global memory activity.
> > 
> > > What is the problem ?
> > 
> > That we cannot prevent from paging out memory of process(es), even though
> > we have intentionaly isolated them in a group (read as we do not have
> > any other possibility for the isolation), because of unrelated memory
> > activity.
> > 
> Because the design of memory cgroup is not for "defending" but for 
> "never attack some other guys".

Yes, I am aware of the current state of implementation. But as the
patchset show there is not quite trivial to implement also the other
(defending) part.

> 
> 
> > > Why it's not a problem of configuration ?
> > > IIUC, you can put all logins to some cgroup by using cgroupd/libgcgroup.
> > 
> > Yes, but this still doesn't bring the isolation.
> > 
> 
> Please explain this more.
> Why don't you move all tasks under /root/default <- this has some limit ?

OK, I have tried to explain that in one of the (2nd) patch description.
If I move all task from the root group to other group(s) and keep the
primary application in the root group I would achieve some isolation as
well. That is very much true. But then there is only one such a group.
What if we need more such groups? I see this solution more as a misuse
of the current implementation of the (special) root cgroup.

> > > Maybe you just want "guarantee".
> > > At 1st thought, this approarch has 3 problems. And memcg is desgined
> > > never to prevent global vm scans,
> > > 
> > > 1. This cannot be used as "guarantee". Just a way for "don't steal from me!!!"
> > >    This just implements a "first come, first served" system.
> > >    I guess this can be used for server desgines.....only with very very careful play.
> > >    If an application exits and lose its memory, there is no guarantee anymore.
> > 
> > Yes, but once it got the memory and it needs to have it or benefits from
> > having it resindent what-ever happens around then there is no other
> > solution than mlocking the memory which is not ideal solution all the
> > time as I have described already.
> > 
> 
> Yes, then, almost all mm guys answer has been "please use mlock".

Yes. As I already tried to explain, mlock is not the remedy all the
time. It gets very tricky when you balance on the edge of the limit of
the available memory resp. cgroup limit. Sometimes you rather want to
have something swapped out than being killed (or fail due to ENOMEM).
The important thing about swapped out above is that with the isolation
it is only per-cgroup.

> > > 2. Even with isolation, a task in memcg can be killed by OOM-killer at
> > >    global memory shortage.
> > 
> > Yes it can but I think this is a different problem. Once you are that
> > short of memory you can hardly ask from any guarantees.
> > There is no 100% guarantee about anything in the system.
> > 
> 
> I think you should put tasks in root cgroup to somewhere. It works perfect
> against OOM. And if memory are hidden by isolation, OOM will happen easier.

Why do you think that it would happen easier? Isn't it similar (from OOM
POV) as if somebody mlocked that memory?

Thanks for comments
-- 
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9    
Czech Republic

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC 0/3] Implementation of cgroup isolation
  2011-03-29  7:32       ` Michal Hocko
@ 2011-03-29  7:51         ` KAMEZAWA Hiroyuki
  2011-03-29  8:59           ` Michal Hocko
  0 siblings, 1 reply; 35+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-03-29  7:51 UTC (permalink / raw)
  To: Michal Hocko; +Cc: linux-mm, linux-kernel

On Tue, 29 Mar 2011 09:32:32 +0200
Michal Hocko <mhocko@suse.cz> wrote:

> On Tue 29-03-11 09:09:24, KAMEZAWA Hiroyuki wrote:
> > On Mon, 28 Mar 2011 13:44:30 +0200
> > Michal Hocko <mhocko@suse.cz> wrote:
> > 
> > > On Mon 28-03-11 20:03:32, KAMEZAWA Hiroyuki wrote:
> > > > On Mon, 28 Mar 2011 11:39:57 +0200
> > > > Michal Hocko <mhocko@suse.cz> wrote:
> > > [...]
> > > > 
> > > > Isn't it the same result with the case where no cgroup is used ?
> > > 
> > > Yes and that is the point of the patchset. Memory cgroups will not give
> > > you anything else but the top limit wrt. to the global memory activity.
> > > 
> > > > What is the problem ?
> > > 
> > > That we cannot prevent from paging out memory of process(es), even though
> > > we have intentionaly isolated them in a group (read as we do not have
> > > any other possibility for the isolation), because of unrelated memory
> > > activity.
> > > 
> > Because the design of memory cgroup is not for "defending" but for 
> > "never attack some other guys".
> 
> Yes, I am aware of the current state of implementation. But as the
> patchset show there is not quite trivial to implement also the other
> (defending) part.
> 

My opinions is to enhance softlimit is better.


> > 
> > 
> > > > Why it's not a problem of configuration ?
> > > > IIUC, you can put all logins to some cgroup by using cgroupd/libgcgroup.
> > > 
> > > Yes, but this still doesn't bring the isolation.
> > > 
> > 
> > Please explain this more.
> > Why don't you move all tasks under /root/default <- this has some limit ?
> 
> OK, I have tried to explain that in one of the (2nd) patch description.
> If I move all task from the root group to other group(s) and keep the
> primary application in the root group I would achieve some isolation as
> well. That is very much true. 

Okay, then, current works well.

> But then there is only one such a group.

I can't catch what you mean. you can create limitless cgroup, anywhere.
Can't you ?

> What if we need more such groups? I see this solution more as a misuse
> of the current implementation of the (special) root cgroup.
> 

make a limitless cgroup and set softlimit properly, if necessary.
But as said in other e-mail, softlimit should be improved.


> > > > Maybe you just want "guarantee".
> > > > At 1st thought, this approarch has 3 problems. And memcg is desgined
> > > > never to prevent global vm scans,
> > > > 
> > > > 1. This cannot be used as "guarantee". Just a way for "don't steal from me!!!"
> > > >    This just implements a "first come, first served" system.
> > > >    I guess this can be used for server desgines.....only with very very careful play.
> > > >    If an application exits and lose its memory, there is no guarantee anymore.
> > > 
> > > Yes, but once it got the memory and it needs to have it or benefits from
> > > having it resindent what-ever happens around then there is no other
> > > solution than mlocking the memory which is not ideal solution all the
> > > time as I have described already.
> > > 
> > 
> > Yes, then, almost all mm guys answer has been "please use mlock".
> 
> Yes. As I already tried to explain, mlock is not the remedy all the
> time. It gets very tricky when you balance on the edge of the limit of
> the available memory resp. cgroup limit. Sometimes you rather want to
> have something swapped out than being killed (or fail due to ENOMEM).
> The important thing about swapped out above is that with the isolation
> it is only per-cgroup.
> 

IMHO, doing isolation by hiding is not good idea. Because we're kernel
engineer, we should do isolation by scheduling. The kernel is art of
shceduling, not separation. I think we should start from some scheduling 
as softlimit. Then, as an extreme case of scheduling, 'complete isolation' 
should be archived. If it seems impossible after trial of making softlimit
better, okay, we should consider some.

BTW, if you want, please post a patch to enable limit/softlimit on ROOT
cgroup with performance measurements.
I myself has no requirements...


> > > > 2. Even with isolation, a task in memcg can be killed by OOM-killer at
> > > >    global memory shortage.
> > > 
> > > Yes it can but I think this is a different problem. Once you are that
> > > short of memory you can hardly ask from any guarantees.
> > > There is no 100% guarantee about anything in the system.
> > > 
> > 
> > I think you should put tasks in root cgroup to somewhere. It works perfect
> > against OOM. And if memory are hidden by isolation, OOM will happen easier.
> 
> Why do you think that it would happen easier? Isn't it similar (from OOM
> POV) as if somebody mlocked that memory?
> 

if global lru scan cannot find victim memory, oom happens.

Thanks,
-Kame




--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC 0/3] Implementation of cgroup isolation
  2011-03-29  7:51         ` KAMEZAWA Hiroyuki
@ 2011-03-29  8:59           ` Michal Hocko
  2011-03-29  9:41             ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 35+ messages in thread
From: Michal Hocko @ 2011-03-29  8:59 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-mm, linux-kernel

On Tue 29-03-11 16:51:17, KAMEZAWA Hiroyuki wrote:
> On Tue, 29 Mar 2011 09:32:32 +0200
> Michal Hocko <mhocko@suse.cz> wrote:
> 
> > On Tue 29-03-11 09:09:24, KAMEZAWA Hiroyuki wrote:
> > > On Mon, 28 Mar 2011 13:44:30 +0200
> > > Michal Hocko <mhocko@suse.cz> wrote:
> > > 
> > > > On Mon 28-03-11 20:03:32, KAMEZAWA Hiroyuki wrote:
> > > > > On Mon, 28 Mar 2011 11:39:57 +0200
> > > > > Michal Hocko <mhocko@suse.cz> wrote:
> > > > [...]
> > > > > 
> > > > > Isn't it the same result with the case where no cgroup is used ?
> > > > 
> > > > Yes and that is the point of the patchset. Memory cgroups will not give
> > > > you anything else but the top limit wrt. to the global memory activity.
> > > > 
> > > > > What is the problem ?
> > > > 
> > > > That we cannot prevent from paging out memory of process(es), even though
> > > > we have intentionaly isolated them in a group (read as we do not have
> > > > any other possibility for the isolation), because of unrelated memory
> > > > activity.
> > > > 
> > > Because the design of memory cgroup is not for "defending" but for 
> > > "never attack some other guys".
> > 
> > Yes, I am aware of the current state of implementation. But as the
> > patchset show there is not quite trivial to implement also the other
> > (defending) part.
> > 
> 
> My opinions is to enhance softlimit is better.

I will look how softlimit can be enhanced to match the expectations but
I'm kind of suspicious it can handle workloads where heuristics simply
cannot guess that the resident memory is important even though it wasn't
touched for a long time.

> > > > > Why it's not a problem of configuration ?
> > > > > IIUC, you can put all logins to some cgroup by using cgroupd/libgcgroup.
> > > > 
> > > > Yes, but this still doesn't bring the isolation.
> > > > 
> > > 
> > > Please explain this more.
> > > Why don't you move all tasks under /root/default <- this has some limit ?
> > 
> > OK, I have tried to explain that in one of the (2nd) patch description.
> > If I move all task from the root group to other group(s) and keep the
> > primary application in the root group I would achieve some isolation as
> > well. That is very much true. 
> 
> Okay, then, current works well.
> 
> > But then there is only one such a group.
> 
> I can't catch what you mean. you can create limitless cgroup, anywhere.
> Can't you ?

This is not about limits. This is about global vs. per-cgroup reclaim
and how much they interact together. 

The everything-in-groups approach with the "primary" service in the root
group (or call it unlimited) works just because all the memory activity
(but the primary service) is caped with the limits so the rest of the
memory can be used by the service. Moreover, in order this to work the
limit for other groups would be smaller then the working set of the
primary service.

Even if you created a limitless group for other important service they
would still interact together and if one goes wild the other would
suffer from that.

But, well, I might be wrong at this, I will play with it so see how it
works.

[...]
> > > Yes, then, almost all mm guys answer has been "please use mlock".
> > 
> > Yes. As I already tried to explain, mlock is not the remedy all the
> > time. It gets very tricky when you balance on the edge of the limit of
> > the available memory resp. cgroup limit. Sometimes you rather want to
> > have something swapped out than being killed (or fail due to ENOMEM).
> > The important thing about swapped out above is that with the isolation
> > it is only per-cgroup.
> > 
> 
> IMHO, doing isolation by hiding is not good idea. 

It depends on what you want to guarantee.

> Because we're kernel engineer, we should do isolation by
> scheduling. The kernel is art of shceduling, not separation.

Well, I would disagree with this statement (to some extend of course).
Cgroups are quite often used for separation (e.g. cpusets basically
hide tasks from CPUs that are not configured for them).

You are certainly right that the memory management is about proper
scheduling and balancing needs vs. demands. And it turned out to be
working fine in many (maybe even most of) workloads (modulo bugs
which are fixed over time). But if an application has more specific
requirements for its memory usage then it is quite limited in ways how
it can achieve them (mlock is one way how to pin the memory but there
are cases where it is not appropriate).
Kernel will simply never know the complete picture and have to rely on
heuristics which will never fit in with everybody.

> I think we should start from some scheduling as softlimit. Then,
> as an extreme case of scheduling, 'complete isolation' should be
> archived. If it seems impossible after trial of making softlimit
> better, okay, we should consider some.

As I already tried to point out what-ever will scheduling do it has no
way to guess that somebody needs to be isolated unless he says that to
kernel.
Anyway, I will have a look whether softlimit can be used and how helpful
it would be.

[...]
> > > I think you should put tasks in root cgroup to somewhere. It works perfect
> > > against OOM. And if memory are hidden by isolation, OOM will happen easier.
> > 
> > Why do you think that it would happen easier? Isn't it similar (from OOM
> > POV) as if somebody mlocked that memory?
> > 
> 
> if global lru scan cannot find victim memory, oom happens.

Yes, but this will happen with mlocked memory as well, right?

-- 
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9    
Czech Republic

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC 0/3] Implementation of cgroup isolation
  2011-03-29  8:59           ` Michal Hocko
@ 2011-03-29  9:41             ` KAMEZAWA Hiroyuki
  2011-03-29 11:18               ` Michal Hocko
  2011-03-30  5:32               ` Ying Han
  0 siblings, 2 replies; 35+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-03-29  9:41 UTC (permalink / raw)
  To: Michal Hocko; +Cc: linux-mm, linux-kernel

On Tue, 29 Mar 2011 10:59:43 +0200
Michal Hocko <mhocko@suse.cz> wrote:

> On Tue 29-03-11 16:51:17, KAMEZAWA Hiroyuki wrote:
> > On Tue, 29 Mar 2011 09:32:32 +0200
> > Michal Hocko <mhocko@suse.cz> wrote:
> > 
> > > On Tue 29-03-11 09:09:24, KAMEZAWA Hiroyuki wrote:
> > > > On Mon, 28 Mar 2011 13:44:30 +0200
> > > > Michal Hocko <mhocko@suse.cz> wrote:
> > > > 
> > > > > On Mon 28-03-11 20:03:32, KAMEZAWA Hiroyuki wrote:
> > > > > > On Mon, 28 Mar 2011 11:39:57 +0200
> > > > > > Michal Hocko <mhocko@suse.cz> wrote:
> > > > > [...]
> > > > > > 
> > > > > > Isn't it the same result with the case where no cgroup is used ?
> > > > > 
> > > > > Yes and that is the point of the patchset. Memory cgroups will not give
> > > > > you anything else but the top limit wrt. to the global memory activity.
> > > > > 
> > > > > > What is the problem ?
> > > > > 
> > > > > That we cannot prevent from paging out memory of process(es), even though
> > > > > we have intentionaly isolated them in a group (read as we do not have
> > > > > any other possibility for the isolation), because of unrelated memory
> > > > > activity.
> > > > > 
> > > > Because the design of memory cgroup is not for "defending" but for 
> > > > "never attack some other guys".
> > > 
> > > Yes, I am aware of the current state of implementation. But as the
> > > patchset show there is not quite trivial to implement also the other
> > > (defending) part.
> > > 
> > 
> > My opinions is to enhance softlimit is better.
> 
> I will look how softlimit can be enhanced to match the expectations but
> I'm kind of suspicious it can handle workloads where heuristics simply
> cannot guess that the resident memory is important even though it wasn't
> touched for a long time.
> 

I think we recommend mlock() or hugepagefs to pin application's work area
in usual. And mm guyes have did hardwork to work mm better even without
memory cgroup under realisitic workloads.

If your worload is realistic but _important_ anonymous memory is swapped out,
it's problem of global VM rather than memcg.

If you add 'isolate' per process, okay, I'll agree to add isolate per memcg.



> > > > > > Why it's not a problem of configuration ?
> > > > > > IIUC, you can put all logins to some cgroup by using cgroupd/libgcgroup.
> > > > > 
> > > > > Yes, but this still doesn't bring the isolation.
> > > > > 
> > > > 
> > > > Please explain this more.
> > > > Why don't you move all tasks under /root/default <- this has some limit ?
> > > 
> > > OK, I have tried to explain that in one of the (2nd) patch description.
> > > If I move all task from the root group to other group(s) and keep the
> > > primary application in the root group I would achieve some isolation as
> > > well. That is very much true. 
> > 
> > Okay, then, current works well.
> > 
> > > But then there is only one such a group.
> > 
> > I can't catch what you mean. you can create limitless cgroup, anywhere.
> > Can't you ?
> 
> This is not about limits. This is about global vs. per-cgroup reclaim
> and how much they interact together. 
> 
> The everything-in-groups approach with the "primary" service in the root
> group (or call it unlimited) works just because all the memory activity
> (but the primary service) is caped with the limits so the rest of the
> memory can be used by the service. Moreover, in order this to work the
> limit for other groups would be smaller then the working set of the
> primary service.
> 
> Even if you created a limitless group for other important service they
> would still interact together and if one goes wild the other would
> suffer from that.
> 

.........I can't understad what is the problem when global reclaim
runs just because an application wasn't limited ...or memory are
overcomitted.




> [...]
> > > > Yes, then, almost all mm guys answer has been "please use mlock".
> > > 
> > > Yes. As I already tried to explain, mlock is not the remedy all the
> > > time. It gets very tricky when you balance on the edge of the limit of
> > > the available memory resp. cgroup limit. Sometimes you rather want to
> > > have something swapped out than being killed (or fail due to ENOMEM).
> > > The important thing about swapped out above is that with the isolation
> > > it is only per-cgroup.
> > > 
> > 
> > IMHO, doing isolation by hiding is not good idea. 
> 
> It depends on what you want to guarantee.
> 
> > Because we're kernel engineer, we should do isolation by
> > scheduling. The kernel is art of shceduling, not separation.
> 
> Well, I would disagree with this statement (to some extend of course).
> Cgroups are quite often used for separation (e.g. cpusets basically
> hide tasks from CPUs that are not configured for them).
> 
> You are certainly right that the memory management is about proper
> scheduling and balancing needs vs. demands. And it turned out to be
> working fine in many (maybe even most of) workloads (modulo bugs
> which are fixed over time). But if an application has more specific
> requirements for its memory usage then it is quite limited in ways how
> it can achieve them (mlock is one way how to pin the memory but there
> are cases where it is not appropriate).
> Kernel will simply never know the complete picture and have to rely on
> heuristics which will never fit in with everybody.
> 

That's what MM guys are tring.

IIUC, there has been many papers on 'hinting LRU' in OS study,
but none has been added to Linux successfully. I'm not sure there has
been no trial or they were rejected. 



> 
> > I think we should start from some scheduling as softlimit. Then,
> > as an extreme case of scheduling, 'complete isolation' should be
> > archived. If it seems impossible after trial of making softlimit
> > better, okay, we should consider some.
> 
> As I already tried to point out what-ever will scheduling do it has no
> way to guess that somebody needs to be isolated unless he says that to
> kernel.
> Anyway, I will have a look whether softlimit can be used and how helpful
> it would be.
> 

If softlimit (after some improvement) isn't enough, please add some other.

What I think of is

1. need to "guarantee" memory usages in future.
   "first come, first served" is not good for admins.

2. need to handle zone memory shortage. Using memory migration
   between zones will be necessary to avoid pageout.

3. need a knob to say "please reclaim from my own cgroup rather than
   affecting others (if usage > some(soft)limit)." 


> [...]
> > > > I think you should put tasks in root cgroup to somewhere. It works perfect
> > > > against OOM. And if memory are hidden by isolation, OOM will happen easier.
> > > 
> > > Why do you think that it would happen easier? Isn't it similar (from OOM
> > > POV) as if somebody mlocked that memory?
> > > 
> > 
> > if global lru scan cannot find victim memory, oom happens.
> 
> Yes, but this will happen with mlocked memory as well, right?
> 
Yes, of course.

Anyway, I'll Nack to simple "first come, first served" isolation.
Please implement garantee, which is reliable and admin can use safely.

mlock() has similar problem, So, I recommend hugetlbfs to customers,
admin can schedule it at boot time.
(the number of users of hugetlbfs is tend to be one app. (oracle))

I'll be absent, tomorrow.

I think you'll come LSF/MM summit and from the schedule, you'll have
a joint session with Ying as "Memcg LRU management and isolation".

IIUC, "LRU management" is a google's performance improvement topic.

It's ok for me to talk only about 'isolation'  1st in earlier session. 
If you want, please ask James to move session and overlay 1st memory
cgroup session. (I think you saw e-mail from James.)

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC 0/3] Implementation of cgroup isolation
  2011-03-29  9:41             ` KAMEZAWA Hiroyuki
@ 2011-03-29 11:18               ` Michal Hocko
  2011-03-29 13:15                 ` Zhu Yanhai
  2011-03-30  5:32               ` Ying Han
  1 sibling, 1 reply; 35+ messages in thread
From: Michal Hocko @ 2011-03-29 11:18 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-mm, linux-kernel

On Tue 29-03-11 18:41:19, KAMEZAWA Hiroyuki wrote:
> On Tue, 29 Mar 2011 10:59:43 +0200
> Michal Hocko <mhocko@suse.cz> wrote:
> 
> > On Tue 29-03-11 16:51:17, KAMEZAWA Hiroyuki wrote:
[...]
> > > My opinions is to enhance softlimit is better.
> > 
> > I will look how softlimit can be enhanced to match the expectations but
> > I'm kind of suspicious it can handle workloads where heuristics simply
> > cannot guess that the resident memory is important even though it wasn't
> > touched for a long time.
> > 
> 
> I think we recommend mlock() or hugepagefs to pin application's work area
> in usual. And mm guyes have did hardwork to work mm better even without
> memory cgroup under realisitic workloads.

Agreed. Whenever this approach is possible we recomend the same thing.

> If your worload is realistic but _important_ anonymous memory is swapped out,
> it's problem of global VM rather than memcg.

I would disagree with you on that. The important thing is that it can be
defined from many perspectives. One is the kernel which considers long
unused memory as not _that_ important. And it makes a perfect sense for
most workloads.
An important memory for an application can be something that would
considerably increase the latency just because the memory got paged out
(be it swap or the storage) because it contains pre-computed
data that have a big initial costs. 
As you can see there is no mention about the time from the application
POV because it can depend on the incoming requests which you cannot
control.

> If you add 'isolate' per process, okay, I'll agree to add isolate per memcg.

What do you mean by isolate per process?

[...]
> > > > OK, I have tried to explain that in one of the (2nd) patch description.
> > > > If I move all task from the root group to other group(s) and keep the
> > > > primary application in the root group I would achieve some isolation as
> > > > well. That is very much true. 
> > > 
> > > Okay, then, current works well.
> > > 
> > > > But then there is only one such a group.
> > > 
> > > I can't catch what you mean. you can create limitless cgroup, anywhere.
> > > Can't you ?
> > 
> > This is not about limits. This is about global vs. per-cgroup reclaim
> > and how much they interact together. 
> > 
> > The everything-in-groups approach with the "primary" service in the root
> > group (or call it unlimited) works just because all the memory activity
> > (but the primary service) is caped with the limits so the rest of the
> > memory can be used by the service. Moreover, in order this to work the
> > limit for other groups would be smaller then the working set of the
> > primary service.
> > 
> > Even if you created a limitless group for other important service they
> > would still interact together and if one goes wild the other would
> > suffer from that.
> > 
> 
> .........I can't understad what is the problem when global reclaim
> runs just because an application wasn't limited ...or memory are
> overcomitted.

I am not sure I understand but what I see as a problem is when unrelated
memory activity triggers reclaim and it pushes out the memory of a
process group just because the heuristics done by the reclaim algorithm
do not pick up the right memory - and honestly, no heuristic will fit
all requirements. Isolation can protect from an unrelated activity
without new heuristics.

[...]
> If softlimit (after some improvement) isn't enough, please add some other.
> 
> What I think of is
> 
> 1. need to "guarantee" memory usages in future.
>    "first come, first served" is not good for admins.

this is not in scope of these patchsets but I agree that it would be
nice to have this guarantee

> 2. need to handle zone memory shortage. Using memory migration
>    between zones will be necessary to avoid pageout.

I am not sure I understand.

> 
> 3. need a knob to say "please reclaim from my own cgroup rather than
>    affecting others (if usage > some(soft)limit)." 

Isn't this handled already and enhanced by the per-cgroup background
reclaim patches?

> 
> > [...]
> > > > > I think you should put tasks in root cgroup to somewhere. It works perfect
> > > > > against OOM. And if memory are hidden by isolation, OOM will happen easier.
> > > > 
> > > > Why do you think that it would happen easier? Isn't it similar (from OOM
> > > > POV) as if somebody mlocked that memory?
> > > > 
> > > 
> > > if global lru scan cannot find victim memory, oom happens.
> > 
> > Yes, but this will happen with mlocked memory as well, right?
> > 
> Yes, of course.
> 
> Anyway, I'll Nack to simple "first come, first served" isolation.
> Please implement garantee, which is reliable and admin can use safely.

Isolation is not about future guarantee. It is rather after you have it
you can rely it will stay in unless in-group activity pushes it out.

> mlock() has similar problem, So, I recommend hugetlbfs to customers,
> admin can schedule it at boot time.
> (the number of users of hugetlbfs is tend to be one app. (oracle))

What if we decide that hugetlbfs won't be pinned into memory in future?

> 
> I'll be absent, tomorrow.
> 
> I think you'll come LSF/MM summit and from the schedule, you'll have
> a joint session with Ying as "Memcg LRU management and isolation".

I didn't have plans to do a session actively, but I can certainly join
to talk and will be happy to discuss this topic.

> 
> IIUC, "LRU management" is a google's performance improvement topic.
> 
> It's ok for me to talk only about 'isolation'  1st in earlier session. 
> If you want, please ask James to move session and overlay 1st memory
> cgroup session. (I think you saw e-mail from James.)

Yeah, I can do that.

Thanks
-- 
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9    
Czech Republic

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC 0/3] Implementation of cgroup isolation
  2011-03-29 11:18               ` Michal Hocko
@ 2011-03-29 13:15                 ` Zhu Yanhai
  2011-03-29 13:42                   ` Michal Hocko
  0 siblings, 1 reply; 35+ messages in thread
From: Zhu Yanhai @ 2011-03-29 13:15 UTC (permalink / raw)
  To: Michal Hocko; +Cc: KAMEZAWA Hiroyuki, linux-mm, linux-kernel

Michal,
Maybe what we need here is some kind of trade-off?
Let's say a new configuable parameter reserve_limit, for the cgroups
which want to
have some guarantee in the memory resource, we have:

limit_in_bytes > soft_limit > reserve_limit

MEM[limit_in_bytes..soft_limit] are the bytes that I'm willing to contribute
to the others if they are short of memory.

MEM[soft_limit..reserve_limit] are the bytes that I can afford if the others
are still eager for memory after I gave them MEM[limit_in_bytes..soft_limit].

MEM[reserve_limit..0] are the bytes which is a must for me to guarantee QoS.
Nobody is allowed to steal them.

And reserve_limit is 0 by default for the cgroups who don't care about Qos.

Then the reclaim path also needs some changes, i.e, balance_pgdat():
1) call mem_cgroup_soft_limit_reclaim(), if nr_reclaimed is meet, goto finish.
2) shrink the global LRU list, and skip the pages which belong to the cgroup
who have set a reserve_limit. if nr_reclaimed is meet, goto finish.
3) shrink the cgroups who have set a reserve_limit, and leave them with only
the reserve_limit bytes they need. if nr_reclaimed is meet, goto finish.
4) OOM

Does it make sense?

Thanks,
Zhu Yanhai


2011/3/29 Michal Hocko <mhocko@suse.cz>:
> On Tue 29-03-11 18:41:19, KAMEZAWA Hiroyuki wrote:
>> On Tue, 29 Mar 2011 10:59:43 +0200
>> Michal Hocko <mhocko@suse.cz> wrote:
>>
>> > On Tue 29-03-11 16:51:17, KAMEZAWA Hiroyuki wrote:
> [...]
>> > > My opinions is to enhance softlimit is better.
>> >
>> > I will look how softlimit can be enhanced to match the expectations but
>> > I'm kind of suspicious it can handle workloads where heuristics simply
>> > cannot guess that the resident memory is important even though it wasn't
>> > touched for a long time.
>> >
>>
>> I think we recommend mlock() or hugepagefs to pin application's work area
>> in usual. And mm guyes have did hardwork to work mm better even without
>> memory cgroup under realisitic workloads.
>
> Agreed. Whenever this approach is possible we recomend the same thing.
>
>> If your worload is realistic but _important_ anonymous memory is swapped out,
>> it's problem of global VM rather than memcg.
>
> I would disagree with you on that. The important thing is that it can be
> defined from many perspectives. One is the kernel which considers long
> unused memory as not _that_ important. And it makes a perfect sense for
> most workloads.
> An important memory for an application can be something that would
> considerably increase the latency just because the memory got paged out
> (be it swap or the storage) because it contains pre-computed
> data that have a big initial costs.
> As you can see there is no mention about the time from the application
> POV because it can depend on the incoming requests which you cannot
> control.
>
>> If you add 'isolate' per process, okay, I'll agree to add isolate per memcg.
>
> What do you mean by isolate per process?
>
> [...]
>> > > > OK, I have tried to explain that in one of the (2nd) patch description.
>> > > > If I move all task from the root group to other group(s) and keep the
>> > > > primary application in the root group I would achieve some isolation as
>> > > > well. That is very much true.
>> > >
>> > > Okay, then, current works well.
>> > >
>> > > > But then there is only one such a group.
>> > >
>> > > I can't catch what you mean. you can create limitless cgroup, anywhere.
>> > > Can't you ?
>> >
>> > This is not about limits. This is about global vs. per-cgroup reclaim
>> > and how much they interact together.
>> >
>> > The everything-in-groups approach with the "primary" service in the root
>> > group (or call it unlimited) works just because all the memory activity
>> > (but the primary service) is caped with the limits so the rest of the
>> > memory can be used by the service. Moreover, in order this to work the
>> > limit for other groups would be smaller then the working set of the
>> > primary service.
>> >
>> > Even if you created a limitless group for other important service they
>> > would still interact together and if one goes wild the other would
>> > suffer from that.
>> >
>>
>> .........I can't understad what is the problem when global reclaim
>> runs just because an application wasn't limited ...or memory are
>> overcomitted.
>
> I am not sure I understand but what I see as a problem is when unrelated
> memory activity triggers reclaim and it pushes out the memory of a
> process group just because the heuristics done by the reclaim algorithm
> do not pick up the right memory - and honestly, no heuristic will fit
> all requirements. Isolation can protect from an unrelated activity
> without new heuristics.
>
> [...]
>> If softlimit (after some improvement) isn't enough, please add some other.
>>
>> What I think of is
>>
>> 1. need to "guarantee" memory usages in future.
>>    "first come, first served" is not good for admins.
>
> this is not in scope of these patchsets but I agree that it would be
> nice to have this guarantee
>
>> 2. need to handle zone memory shortage. Using memory migration
>>    between zones will be necessary to avoid pageout.
>
> I am not sure I understand.
>
>>
>> 3. need a knob to say "please reclaim from my own cgroup rather than
>>    affecting others (if usage > some(soft)limit)."
>
> Isn't this handled already and enhanced by the per-cgroup background
> reclaim patches?
>
>>
>> > [...]
>> > > > > I think you should put tasks in root cgroup to somewhere. It works perfect
>> > > > > against OOM. And if memory are hidden by isolation, OOM will happen easier.
>> > > >
>> > > > Why do you think that it would happen easier? Isn't it similar (from OOM
>> > > > POV) as if somebody mlocked that memory?
>> > > >
>> > >
>> > > if global lru scan cannot find victim memory, oom happens.
>> >
>> > Yes, but this will happen with mlocked memory as well, right?
>> >
>> Yes, of course.
>>
>> Anyway, I'll Nack to simple "first come, first served" isolation.
>> Please implement garantee, which is reliable and admin can use safely.
>
> Isolation is not about future guarantee. It is rather after you have it
> you can rely it will stay in unless in-group activity pushes it out.
>
>> mlock() has similar problem, So, I recommend hugetlbfs to customers,
>> admin can schedule it at boot time.
>> (the number of users of hugetlbfs is tend to be one app. (oracle))
>
> What if we decide that hugetlbfs won't be pinned into memory in future?
>
>>
>> I'll be absent, tomorrow.
>>
>> I think you'll come LSF/MM summit and from the schedule, you'll have
>> a joint session with Ying as "Memcg LRU management and isolation".
>
> I didn't have plans to do a session actively, but I can certainly join
> to talk and will be happy to discuss this topic.
>
>>
>> IIUC, "LRU management" is a google's performance improvement topic.
>>
>> It's ok for me to talk only about 'isolation'  1st in earlier session.
>> If you want, please ask James to move session and overlay 1st memory
>> cgroup session. (I think you saw e-mail from James.)
>
> Yeah, I can do that.
>
> Thanks
> --
> Michal Hocko
> SUSE Labs
> SUSE LINUX s.r.o.
> Lihovarska 1060/12
> 190 00 Praha 9
> Czech Republic
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC 0/3] Implementation of cgroup isolation
  2011-03-29 13:15                 ` Zhu Yanhai
@ 2011-03-29 13:42                   ` Michal Hocko
  2011-03-29 14:02                     ` Zhu Yanhai
  0 siblings, 1 reply; 35+ messages in thread
From: Michal Hocko @ 2011-03-29 13:42 UTC (permalink / raw)
  To: Zhu Yanhai; +Cc: KAMEZAWA Hiroyuki, linux-mm, linux-kernel

On Tue 29-03-11 21:15:59, Zhu Yanhai wrote:
> Michal,

Hi,

> Maybe what we need here is some kind of trade-off?
> Let's say a new configuable parameter reserve_limit, for the cgroups
> which want to
> have some guarantee in the memory resource, we have:
> 
> limit_in_bytes > soft_limit > reserve_limit
> 
> MEM[limit_in_bytes..soft_limit] are the bytes that I'm willing to contribute
> to the others if they are short of memory.
> 
> MEM[soft_limit..reserve_limit] are the bytes that I can afford if the others
> are still eager for memory after I gave them MEM[limit_in_bytes..soft_limit].
> 
> MEM[reserve_limit..0] are the bytes which is a must for me to guarantee QoS.
> Nobody is allowed to steal them.
> 
> And reserve_limit is 0 by default for the cgroups who don't care about Qos.
> 
> Then the reclaim path also needs some changes, i.e, balance_pgdat():
> 1) call mem_cgroup_soft_limit_reclaim(), if nr_reclaimed is meet, goto finish.
> 2) shrink the global LRU list, and skip the pages which belong to the cgroup
> who have set a reserve_limit. if nr_reclaimed is meet, goto finish.

Isn't this an overhead that would slow the whole thing down. Consider
that you would need to lookup page_cgroup for every page and touch
mem_cgroup to get the limit.
The point of the isolation is to not touch the global reclaim path at
all.

> 3) shrink the cgroups who have set a reserve_limit, and leave them with only
> the reserve_limit bytes they need. if nr_reclaimed is meet, goto finish.
> 4) OOM
> 
> Does it make sense?

It sounds like a good thing - in that regard it is more generic than
a simple flag - but I am afraid that the implementation wouldn't be
that easy to preserve the performance and keep the balance between
groups. But maybe it can be done without too much cost.

Thanks
-- 
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9    
Czech Republic

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC 0/3] Implementation of cgroup isolation
  2011-03-29 13:42                   ` Michal Hocko
@ 2011-03-29 14:02                     ` Zhu Yanhai
  2011-03-29 14:08                       ` Zhu Yanhai
  2011-03-30  7:42                       ` Michal Hocko
  0 siblings, 2 replies; 35+ messages in thread
From: Zhu Yanhai @ 2011-03-29 14:02 UTC (permalink / raw)
  To: Michal Hocko; +Cc: KAMEZAWA Hiroyuki, linux-mm, linux-kernel

Hi,

2011/3/29 Michal Hocko <mhocko@suse.cz>:
> Isn't this an overhead that would slow the whole thing down. Consider
> that you would need to lookup page_cgroup for every page and touch
> mem_cgroup to get the limit.

Current almost has did such things, say the direct reclaim path:
shrink_inactive_list()
   ->isolate_pages_global()
      ->isolate_lru_pages()
         ->mem_cgroup_del_lru(for each page it wants to isolate)
            and in mem_cgroup_del_lru() we have:
[code]
	pc = lookup_page_cgroup(page);
	/*
	 * Used bit is set without atomic ops but after smp_wmb().
	 * For making pc->mem_cgroup visible, insert smp_rmb() here.
	 */
	smp_rmb();
	/* unused or root page is not rotated. */
	if (!PageCgroupUsed(pc) || mem_cgroup_is_root(pc->mem_cgroup))
		return;
[/code]
By calling mem_cgroup_is_root(pc->mem_cgroup) we already brought the
struct mem_cgroup into cache.
So probably things won't get worse at least.

Thanks,
Zhu Yanhai

> The point of the isolation is to not touch the global reclaim path at
> all.
>
>> 3) shrink the cgroups who have set a reserve_limit, and leave them with only
>> the reserve_limit bytes they need. if nr_reclaimed is meet, goto finish.
>> 4) OOM
>>
>> Does it make sense?
>
> It sounds like a good thing - in that regard it is more generic than
> a simple flag - but I am afraid that the implementation wouldn't be
> that easy to preserve the performance and keep the balance between
> groups. But maybe it can be done without too much cost.
>
> Thanks
> --
> Michal Hocko
> SUSE Labs
> SUSE LINUX s.r.o.
> Lihovarska 1060/12
> 190 00 Praha 9
> Czech Republic
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC 0/3] Implementation of cgroup isolation
  2011-03-29 14:02                     ` Zhu Yanhai
@ 2011-03-29 14:08                       ` Zhu Yanhai
  2011-03-30  7:42                       ` Michal Hocko
  1 sibling, 0 replies; 35+ messages in thread
From: Zhu Yanhai @ 2011-03-29 14:08 UTC (permalink / raw)
  To: Michal Hocko; +Cc: KAMEZAWA Hiroyuki, linux-mm, linux-kernel

2011/3/29 Zhu Yanhai <zhu.yanhai@gmail.com>:
> Hi,
>
> 2011/3/29 Michal Hocko <mhocko@suse.cz>:
>> Isn't this an overhead that would slow the whole thing down. Consider
>> that you would need to lookup page_cgroup for every page and touch
>> mem_cgroup to get the limit.
>
> Current almost has did such things, say the direct reclaim path:
> shrink_inactive_list()
>   ->isolate_pages_global()
>      ->isolate_lru_pages()
>         ->mem_cgroup_del_lru(for each page it wants to isolate)
>            and in mem_cgroup_del_lru() we have:
oops, the below code is from mem_cgroup_rotate_lru_list not
mem_cgroup_del_lru, the correct one should be:
[code]
pc = lookup_page_cgroup(page);
	/* can happen while we handle swapcache. */
	if (!TestClearPageCgroupAcctLRU(pc))
		return;
	VM_BUG_ON(!pc->mem_cgroup);
	/*
	 * We don't check PCG_USED bit. It's cleared when the "page" is finally
	 * removed from global LRU.
	 */
	mz = page_cgroup_zoneinfo(pc);
	MEM_CGROUP_ZSTAT(mz, lru) -= 1;
	if (mem_cgroup_is_root(pc->mem_cgroup))
		return;
[/code]
Anyway, the point still stands.

-zyh
> [code]
>        pc = lookup_page_cgroup(page);
>        /*
>         * Used bit is set without atomic ops but after smp_wmb().
>         * For making pc->mem_cgroup visible, insert smp_rmb() here.
>         */
>        smp_rmb();
>        /* unused or root page is not rotated. */
>        if (!PageCgroupUsed(pc) || mem_cgroup_is_root(pc->mem_cgroup))
>                return;
> [/code]
> By calling mem_cgroup_is_root(pc->mem_cgroup) we already brought the
> struct mem_cgroup into cache.
> So probably things won't get worse at least.
>
> Thanks,
> Zhu Yanhai
>
>> The point of the isolation is to not touch the global reclaim path at
>> all.
>>
>>> 3) shrink the cgroups who have set a reserve_limit, and leave them with only
>>> the reserve_limit bytes they need. if nr_reclaimed is meet, goto finish.
>>> 4) OOM
>>>
>>> Does it make sense?
>>
>> It sounds like a good thing - in that regard it is more generic than
>> a simple flag - but I am afraid that the implementation wouldn't be
>> that easy to preserve the performance and keep the balance between
>> groups. But maybe it can be done without too much cost.
>>
>> Thanks
>> --
>> Michal Hocko
>> SUSE Labs
>> SUSE LINUX s.r.o.
>> Lihovarska 1060/12
>> 190 00 Praha 9
>> Czech Republic
>>
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC 0/3] Implementation of cgroup isolation
  2011-03-29 14:02                     ` Zhu Yanhai
  2011-03-29 14:08                       ` Zhu Yanhai
@ 2011-03-30  7:42                       ` Michal Hocko
  1 sibling, 0 replies; 35+ messages in thread
From: Michal Hocko @ 2011-03-30  7:42 UTC (permalink / raw)
  To: Zhu Yanhai; +Cc: KAMEZAWA Hiroyuki, linux-mm, linux-kernel

On Tue 29-03-11 22:02:23, Zhu Yanhai wrote:
> Hi,
> 
> 2011/3/29 Michal Hocko <mhocko@suse.cz>:
> > Isn't this an overhead that would slow the whole thing down. Consider
> > that you would need to lookup page_cgroup for every page and touch
> > mem_cgroup to get the limit.
> 
> Current almost has did such things, say the direct reclaim path:
> shrink_inactive_list()
>    ->isolate_pages_global()
>       ->isolate_lru_pages()
>          ->mem_cgroup_del_lru(for each page it wants to isolate)
>             and in mem_cgroup_del_lru() we have:
> [code]
> 	pc = lookup_page_cgroup(page);
> 	/*
> 	 * Used bit is set without atomic ops but after smp_wmb().
> 	 * For making pc->mem_cgroup visible, insert smp_rmb() here.
> 	 */
> 	smp_rmb();
> 	/* unused or root page is not rotated. */
> 	if (!PageCgroupUsed(pc) || mem_cgroup_is_root(pc->mem_cgroup))
> 		return;
> [/code]
> By calling mem_cgroup_is_root(pc->mem_cgroup) we already brought the
> struct mem_cgroup into cache.
> So probably things won't get worse at least.

But we would still have to isolate and put back a lot of pages
potentially. If we do not have those pages on the list we will skip them
automatically.

-- 
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9    
Czech Republic

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC 0/3] Implementation of cgroup isolation
  2011-03-29  9:41             ` KAMEZAWA Hiroyuki
  2011-03-29 11:18               ` Michal Hocko
@ 2011-03-30  5:32               ` Ying Han
  1 sibling, 0 replies; 35+ messages in thread
From: Ying Han @ 2011-03-30  5:32 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: Michal Hocko, linux-mm, linux-kernel

On Tue, Mar 29, 2011 at 2:41 AM, KAMEZAWA Hiroyuki
<kamezawa.hiroyu@jp.fujitsu.com> wrote:
> On Tue, 29 Mar 2011 10:59:43 +0200
> Michal Hocko <mhocko@suse.cz> wrote:
>
>> On Tue 29-03-11 16:51:17, KAMEZAWA Hiroyuki wrote:
>> > On Tue, 29 Mar 2011 09:32:32 +0200
>> > Michal Hocko <mhocko@suse.cz> wrote:
>> >
>> > > On Tue 29-03-11 09:09:24, KAMEZAWA Hiroyuki wrote:
>> > > > On Mon, 28 Mar 2011 13:44:30 +0200
>> > > > Michal Hocko <mhocko@suse.cz> wrote:
>> > > >
>> > > > > On Mon 28-03-11 20:03:32, KAMEZAWA Hiroyuki wrote:
>> > > > > > On Mon, 28 Mar 2011 11:39:57 +0200
>> > > > > > Michal Hocko <mhocko@suse.cz> wrote:
>> > > > > [...]
>> > > > > >
>> > > > > > Isn't it the same result with the case where no cgroup is used ?
>> > > > >
>> > > > > Yes and that is the point of the patchset. Memory cgroups will not give
>> > > > > you anything else but the top limit wrt. to the global memory activity.
>> > > > >
>> > > > > > What is the problem ?
>> > > > >
>> > > > > That we cannot prevent from paging out memory of process(es), even though
>> > > > > we have intentionaly isolated them in a group (read as we do not have
>> > > > > any other possibility for the isolation), because of unrelated memory
>> > > > > activity.
>> > > > >
>> > > > Because the design of memory cgroup is not for "defending" but for
>> > > > "never attack some other guys".
>> > >
>> > > Yes, I am aware of the current state of implementation. But as the
>> > > patchset show there is not quite trivial to implement also the other
>> > > (defending) part.
>> > >
>> >
>> > My opinions is to enhance softlimit is better.
>>
>> I will look how softlimit can be enhanced to match the expectations but
>> I'm kind of suspicious it can handle workloads where heuristics simply
>> cannot guess that the resident memory is important even though it wasn't
>> touched for a long time.
>>
>
> I think we recommend mlock() or hugepagefs to pin application's work area
> in usual. And mm guyes have did hardwork to work mm better even without
> memory cgroup under realisitic workloads.
>
> If your worload is realistic but _important_ anonymous memory is swapped out,
> it's problem of global VM rather than memcg.
>
> If you add 'isolate' per process, okay, I'll agree to add isolate per memcg.
>
>
>
>> > > > > > Why it's not a problem of configuration ?
>> > > > > > IIUC, you can put all logins to some cgroup by using cgroupd/libgcgroup.
>> > > > >
>> > > > > Yes, but this still doesn't bring the isolation.
>> > > > >
>> > > >
>> > > > Please explain this more.
>> > > > Why don't you move all tasks under /root/default <- this has some limit ?
>> > >
>> > > OK, I have tried to explain that in one of the (2nd) patch description.
>> > > If I move all task from the root group to other group(s) and keep the
>> > > primary application in the root group I would achieve some isolation as
>> > > well. That is very much true.
>> >
>> > Okay, then, current works well.
>> >
>> > > But then there is only one such a group.
>> >
>> > I can't catch what you mean. you can create limitless cgroup, anywhere.
>> > Can't you ?
>>
>> This is not about limits. This is about global vs. per-cgroup reclaim
>> and how much they interact together.
>>
>> The everything-in-groups approach with the "primary" service in the root
>> group (or call it unlimited) works just because all the memory activity
>> (but the primary service) is caped with the limits so the rest of the
>> memory can be used by the service. Moreover, in order this to work the
>> limit for other groups would be smaller then the working set of the
>> primary service.
>>
>> Even if you created a limitless group for other important service they
>> would still interact together and if one goes wild the other would
>> suffer from that.
>>
>
> .........I can't understad what is the problem when global reclaim
> runs just because an application wasn't limited ...or memory are
> overcomitted.

I guess the problem here is not triggering global reclaim, but more of
what is the expected output of it. We can not prevent global memory
pressure from happening in over-commit environment, however we should
do targeting reclaim only when that happens.

Hopefully an example helps explaining the problem we are trying to solve here.

Here is the current supported mechanism on memcg limits:
1. limit_in_bytes:
If the usage_in_bytes goes over the limit, the memcg get throttled or
OOM killed.

2. soft_limit_in_bytes:
If the usage_in_bytes goes over the limit, the memory are
best_efforts. Otherwise, no memory pressure is expected in the memcg.
This serves as "guarantee" in some sense.

Here is the configuration memcg users might consider:
On a host with 32G ram, we would like to over-committing the machine
but also provide guarantees to individual memcg.

memcg-A/ -- limit_in_bytes = 20G, soft_limit_in_bytes =  15G
memcg-B/ -- limit_in_bytes = 20G, soft_limit_in_bytes = 15G

The expectation of this configuration is:
a) Either memcg-A or memcg-B can grow usage_in_bytes up to 20G as long
as there is no system memory contention.
b) Both memcg-A and memcg-B have memory guarantee of 15G, and there
shouldn't be memory pressure applied if usage_in_bytes below the
value.
c) If there is a global memory pressure, whoever allocate memory above
the guarantee (soft_limit) need to push pages out.
d) Either memcg-A or memcg-B will be throttled or OOM killed if the
usage_in_bytes goes above the limit_in_bytes.

In order to achieve that, we need the following:
a) Improve the current soft_limit reclaim mechanism. Right now it is
designed to be best-effort working with global background reclaim. I
can easily generate scenario where it is not picking the "right"
cgroup to reclaim from each time. ("right" here stands for the
efficiency of the reclaim)

b) When the global reclaim happens (both background and ttfp), we need
to rely on soft_limit targeting reclaim instead of picking page on
global lru. The later one just blindly throw pages away regardless of
the configuration of cgroup. In this case, the configuration means
"guarantee".

c) Of course, we will have per-memcg background reclaim patch. It will
do more targeting reclaim proactively before the global memory
contention.

Overall, I don't see why we should scan the global LRU, especially
after the things above being improved and supported.

--Ying

>
>
>
>
>> [...]
>> > > > Yes, then, almost all mm guys answer has been "please use mlock".
>> > >
>> > > Yes. As I already tried to explain, mlock is not the remedy all the
>> > > time. It gets very tricky when you balance on the edge of the limit of
>> > > the available memory resp. cgroup limit. Sometimes you rather want to
>> > > have something swapped out than being killed (or fail due to ENOMEM).
>> > > The important thing about swapped out above is that with the isolation
>> > > it is only per-cgroup.
>> > >
>> >
>> > IMHO, doing isolation by hiding is not good idea.
>>
>> It depends on what you want to guarantee.
>>
>> > Because we're kernel engineer, we should do isolation by
>> > scheduling. The kernel is art of shceduling, not separation.
>>
>> Well, I would disagree with this statement (to some extend of course).
>> Cgroups are quite often used for separation (e.g. cpusets basically
>> hide tasks from CPUs that are not configured for them).
>>
>> You are certainly right that the memory management is about proper
>> scheduling and balancing needs vs. demands. And it turned out to be
>> working fine in many (maybe even most of) workloads (modulo bugs
>> which are fixed over time). But if an application has more specific
>> requirements for its memory usage then it is quite limited in ways how
>> it can achieve them (mlock is one way how to pin the memory but there
>> are cases where it is not appropriate).
>> Kernel will simply never know the complete picture and have to rely on
>> heuristics which will never fit in with everybody.
>>
>
> That's what MM guys are tring.
>
> IIUC, there has been many papers on 'hinting LRU' in OS study,
> but none has been added to Linux successfully. I'm not sure there has
> been no trial or they were rejected.
>
>
>
>>
>> > I think we should start from some scheduling as softlimit. Then,
>> > as an extreme case of scheduling, 'complete isolation' should be
>> > archived. If it seems impossible after trial of making softlimit
>> > better, okay, we should consider some.
>>
>> As I already tried to point out what-ever will scheduling do it has no
>> way to guess that somebody needs to be isolated unless he says that to
>> kernel.
>> Anyway, I will have a look whether softlimit can be used and how helpful
>> it would be.
>>
>
> If softlimit (after some improvement) isn't enough, please add some other.
>
> What I think of is
>
> 1. need to "guarantee" memory usages in future.
>   "first come, first served" is not good for admins.
>
> 2. need to handle zone memory shortage. Using memory migration
>   between zones will be necessary to avoid pageout.
>
> 3. need a knob to say "please reclaim from my own cgroup rather than
>   affecting others (if usage > some(soft)limit)."
>
>
>> [...]
>> > > > I think you should put tasks in root cgroup to somewhere. It works perfect
>> > > > against OOM. And if memory are hidden by isolation, OOM will happen easier.
>> > >
>> > > Why do you think that it would happen easier? Isn't it similar (from OOM
>> > > POV) as if somebody mlocked that memory?
>> > >
>> >
>> > if global lru scan cannot find victim memory, oom happens.
>>
>> Yes, but this will happen with mlocked memory as well, right?
>>
> Yes, of course.
>
> Anyway, I'll Nack to simple "first come, first served" isolation.
> Please implement garantee, which is reliable and admin can use safely.
>
> mlock() has similar problem, So, I recommend hugetlbfs to customers,
> admin can schedule it at boot time.
> (the number of users of hugetlbfs is tend to be one app. (oracle))
>
> I'll be absent, tomorrow.
>
> I think you'll come LSF/MM summit and from the schedule, you'll have
> a joint session with Ying as "Memcg LRU management and isolation".
>
> IIUC, "LRU management" is a google's performance improvement topic.
>
> It's ok for me to talk only about 'isolation'  1st in earlier session.
> If you want, please ask James to move session and overlay 1st memory
> cgroup session. (I think you saw e-mail from James.)
>
> Thanks,
> -Kame
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC 0/3] Implementation of cgroup isolation
  2011-03-28 11:03 ` [RFC 0/3] Implementation of cgroup isolation KAMEZAWA Hiroyuki
  2011-03-28 11:44   ` Michal Hocko
@ 2011-03-29 15:53   ` Balbir Singh
  2011-03-30  8:18     ` Michal Hocko
  1 sibling, 1 reply; 35+ messages in thread
From: Balbir Singh @ 2011-03-29 15:53 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: Michal Hocko, linux-mm, linux-kernel

On 03/28/11 16:33, KAMEZAWA Hiroyuki wrote:
> On Mon, 28 Mar 2011 11:39:57 +0200
> Michal Hocko <mhocko@suse.cz> wrote:
> 
>> Hi all,
>>
>> Memory cgroups can be currently used to throttle memory usage of a group of
>> processes. It, however, cannot be used for an isolation of processes from
>> the rest of the system because all the pages that belong to the group are
>> also placed on the global LRU lists and so they are eligible for the global
>> memory reclaim.
>>
>> This patchset aims at providing an opt-in memory cgroup isolation. This
>> means that a cgroup can be configured to be isolated from the rest of the
>> system by means of cgroup virtual filesystem (/dev/memctl/group/memory.isolated).
>>
>> Isolated mem cgroup can be particularly helpful in deployments where we have
>> a primary service which needs to have a certain guarantees for memory
>> resources (e.g. a database server) and we want to shield it off the
>> rest of the system (e.g. a burst memory activity in another group). This is
>> currently possible only with mlocking memory that is essential for the
>> application(s) or a rather hacky configuration where the primary app is in
>> the root mem cgroup while all the other system activity happens in other
>> groups.
>>
>> mlocking is not an ideal solution all the time because sometimes the working
>> set is very large and it depends on the workload (e.g. number of incoming
>> requests) so it can end up not fitting in into memory (leading to a OOM
>> killer). If we use mem. cgroup isolation instead we are keeping memory resident
>> and if the working set goes wild we can still do per-cgroup reclaim so the
>> service is less prone to be OOM killed.
>>
>> The patch series is split into 3 patches. First one adds a new flag into
>> mem_cgroup structure which controls whether the group is isolated (false by
>> default) and a cgroup fs interface to set it.
>> The second patch implements interaction with the global LRU. The current
>> semantic is that we are putting a page into a global LRU only if mem cgroup
>> LRU functions say they do not want the page for themselves.
>> The last patch prevents from soft reclaim if the group is isolated.
>>
>> I have tested the patches with the simple memory consumer (allocating
>> private and shared anon memory and SYSV SHM). 
>>
>> One instance (call it big consumer) running in the group and paging in the
>> memory (>90% of cgroup limit) and sleeping for the rest of its life. Then I
>> had a pool of consumers running in the same cgroup which page in smaller
>> amount of memory and paging them in the loop to simulate in group memory
>> pressure (call them sharks).
>> The sum of consumed memory is more than memory.limit_in_bytes so some
>> portion of the memory is swapped out.
>> There is one consumer running in the root cgroup running in parallel which
>> makes a pressure on the memory (to trigger background reclaim).
>>
>> Rss+cache of the group drops down significantly (~66% of the limit) if the
>> group is not isolated. On the other hand if we isolate the group we are
>> still saturating the group (~97% of the limit). I can show more
>> comprehensive results if somebody is interested.
>>
> 
> Isn't it the same result with the case where no cgroup is used ?
> What is the problem ?
> Why it's not a problem of configuration ?
> IIUC, you can put all logins to some cgroup by using cgroupd/libgcgroup.
> 

I agree with Kame, I am still at loss in terms of understand the use
case, I should probably see the rest of the patches

>> Thanks for comments.
>>
> 
> 
> Maybe you just want "guarantee".
> At 1st thought, this approarch has 3 problems. And memcg is desgined
> never to prevent global vm scans,
> 
> 1. This cannot be used as "guarantee". Just a way for "don't steal from me!!!"
>    This just implements a "first come, first served" system.
>    I guess this can be used for server desgines.....only with very very careful play.
>    If an application exits and lose its memory, there is no guarantee anymore.
> 
> 2. Even with isolation, a task in memcg can be killed by OOM-killer at
>    global memory shortage.
> 
> 3. it seems this will add more page fragmentation if implemented poorly, IOW,
>    can this be work with compaction ?
> 

Good points

Balbir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC 0/3] Implementation of cgroup isolation
  2011-03-29 15:53   ` Balbir Singh
@ 2011-03-30  8:18     ` Michal Hocko
  2011-03-30 17:59       ` Ying Han
  2011-03-31 10:01       ` Balbir Singh
  0 siblings, 2 replies; 35+ messages in thread
From: Michal Hocko @ 2011-03-30  8:18 UTC (permalink / raw)
  To: Balbir Singh; +Cc: KAMEZAWA Hiroyuki, linux-mm, linux-kernel

On Tue 29-03-11 21:23:10, Balbir Singh wrote:
> On 03/28/11 16:33, KAMEZAWA Hiroyuki wrote:
> > On Mon, 28 Mar 2011 11:39:57 +0200
> > Michal Hocko <mhocko@suse.cz> wrote:
[...]
> > Isn't it the same result with the case where no cgroup is used ?
> > What is the problem ?
> > Why it's not a problem of configuration ?
> > IIUC, you can put all logins to some cgroup by using cgroupd/libgcgroup.
> > 
> 
> I agree with Kame, I am still at loss in terms of understand the use
> case, I should probably see the rest of the patches

OK, it looks that I am really bad at explaining the usecase. Let's try
it again then (hopefully in a better way).

Consider a service which serves requests based on the in-memory
precomputed or preprocessed data. 
Let's assume that getting data into memory is rather costly operation
which considerably increases latency of the request processing. Memory
access can be considered random from the system POV because we never
know which requests will come from outside.
This workflow will benefit from having the memory resident as long as
and as much as possible because we have higher chances to be used more
often and so the initial costs would pay off.
Why is mlock not the right thing to do here? Well, if the memory would
be locked and the working set would grow (again this depends on the
incoming requests) then the application would have to unlock some
portions of the memory or to risk OOM because it basically cannot
overcommit.
On the other hand, if the memory is not mlocked and there is a global
memory pressure we can have some part of the costly memory swapped or
paged out which will increase requests latencies. If the application is
placed into an isolated cgroup, though, the global (or other cgroups)
activity doesn't influence its cgroup thus the working set of the
application.
If we compare that to mlock we will benefit from per-group reclaim when
we get over the limit (or soft limit). So we do not start evicting the
memory unless somebody makes really pressure on the _application_.
Cgroup limits would, of course, need to be selected carefully.

There might be other examples when simply kernel cannot know which
memory is important for the process and the long unused memory is not
the ideal choice.

Makes sense?
-- 
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9    
Czech Republic

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC 0/3] Implementation of cgroup isolation
  2011-03-30  8:18     ` Michal Hocko
@ 2011-03-30 17:59       ` Ying Han
  2011-03-31  9:53         ` Michal Hocko
  2011-03-31 10:01       ` Balbir Singh
  1 sibling, 1 reply; 35+ messages in thread
From: Ying Han @ 2011-03-30 17:59 UTC (permalink / raw)
  To: Michal Hocko; +Cc: Balbir Singh, KAMEZAWA Hiroyuki, linux-mm, linux-kernel

On Wed, Mar 30, 2011 at 1:18 AM, Michal Hocko <mhocko@suse.cz> wrote:
> On Tue 29-03-11 21:23:10, Balbir Singh wrote:
>> On 03/28/11 16:33, KAMEZAWA Hiroyuki wrote:
>> > On Mon, 28 Mar 2011 11:39:57 +0200
>> > Michal Hocko <mhocko@suse.cz> wrote:
> [...]
>> > Isn't it the same result with the case where no cgroup is used ?
>> > What is the problem ?
>> > Why it's not a problem of configuration ?
>> > IIUC, you can put all logins to some cgroup by using cgroupd/libgcgroup.
>> >
>>
>> I agree with Kame, I am still at loss in terms of understand the use
>> case, I should probably see the rest of the patches
>
> OK, it looks that I am really bad at explaining the usecase. Let's try
> it again then (hopefully in a better way).
>
> Consider a service which serves requests based on the in-memory
> precomputed or preprocessed data.
> Let's assume that getting data into memory is rather costly operation
> which considerably increases latency of the request processing. Memory
> access can be considered random from the system POV because we never
> know which requests will come from outside.
> This workflow will benefit from having the memory resident as long as
> and as much as possible because we have higher chances to be used more
> often and so the initial costs would pay off.
> Why is mlock not the right thing to do here? Well, if the memory would
> be locked and the working set would grow (again this depends on the
> incoming requests) then the application would have to unlock some
> portions of the memory or to risk OOM because it basically cannot
> overcommit.
> On the other hand, if the memory is not mlocked and there is a global
> memory pressure we can have some part of the costly memory swapped or
> paged out which will increase requests latencies. If the application is
> placed into an isolated cgroup, though, the global (or other cgroups)
> activity doesn't influence its cgroup thus the working set of the
> application.

> If we compare that to mlock we will benefit from per-group reclaim when
> we get over the limit (or soft limit). So we do not start evicting the
> memory unless somebody makes really pressure on the _application_.
> Cgroup limits would, of course, need to be selected carefully.
>
> There might be other examples when simply kernel cannot know which
> memory is important for the process and the long unused memory is not
> the ideal choice.

Michal,

Reading through your example, sounds to me you can accomplish the
"guarantee" of the high priority service using existing
memcg mechanisms.

Assume you have the service named cgroup-A which needs memory
"guarantee". Meantime we want to launch cgroup-B with no memory
"guarantee". What you want is to have cgroup-B uses the slack memory
(not being allocated by cgroup-A), but also volunteer to give up under
system memory pressure.

So continue w/ my previous post, you can consider the following
configuration in 32G machine. We can only have resident size of
cgroup-A as much as the machine capacity.

cgroup-A :  limit_in_bytes =32G soft_limit_in_bytes = 32G
cgroup-B : limit_in_bytes =20G  soft_limit_in_bytes = 0G

To be a little bit extreme, there shouldn't be memory pressure on
cgroup-A unless it grows above the machine capacity. If the global
memory contention is triggered by cgroup-B, we should steal pages from
it always.

However, the current implementation of soft_limit needs to be improved
for the example above. Especially when we start having lots of cgroups
running w/ different limit setting, we need to have soft_limit being
efficient and we can eliminate the global lru scanning. The later one
breaks the isolation.

--Ying

> Makes sense?
> --
> Michal Hocko
> SUSE Labs
> SUSE LINUX s.r.o.
> Lihovarska 1060/12
> 190 00 Praha 9
> Czech Republic
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC 0/3] Implementation of cgroup isolation
  2011-03-30 17:59       ` Ying Han
@ 2011-03-31  9:53         ` Michal Hocko
  2011-03-31 18:10           ` Ying Han
  0 siblings, 1 reply; 35+ messages in thread
From: Michal Hocko @ 2011-03-31  9:53 UTC (permalink / raw)
  To: Ying Han; +Cc: Balbir Singh, KAMEZAWA Hiroyuki, linux-mm, linux-kernel

On Wed 30-03-11 10:59:21, Ying Han wrote:
> On Wed, Mar 30, 2011 at 1:18 AM, Michal Hocko <mhocko@suse.cz> wrote:
> > On Tue 29-03-11 21:23:10, Balbir Singh wrote:
> >> On 03/28/11 16:33, KAMEZAWA Hiroyuki wrote:
> >> > On Mon, 28 Mar 2011 11:39:57 +0200
> >> > Michal Hocko <mhocko@suse.cz> wrote:
> > [...]
> >> > Isn't it the same result with the case where no cgroup is used ?
> >> > What is the problem ?
> >> > Why it's not a problem of configuration ?
> >> > IIUC, you can put all logins to some cgroup by using cgroupd/libgcgroup.
> >> >
> >>
> >> I agree with Kame, I am still at loss in terms of understand the use
> >> case, I should probably see the rest of the patches
> >
> > OK, it looks that I am really bad at explaining the usecase. Let's try
> > it again then (hopefully in a better way).
> >
> > Consider a service which serves requests based on the in-memory
> > precomputed or preprocessed data.
> > Let's assume that getting data into memory is rather costly operation
> > which considerably increases latency of the request processing. Memory
> > access can be considered random from the system POV because we never
> > know which requests will come from outside.
> > This workflow will benefit from having the memory resident as long as
> > and as much as possible because we have higher chances to be used more
> > often and so the initial costs would pay off.
> > Why is mlock not the right thing to do here? Well, if the memory would
> > be locked and the working set would grow (again this depends on the
> > incoming requests) then the application would have to unlock some
> > portions of the memory or to risk OOM because it basically cannot
> > overcommit.
> > On the other hand, if the memory is not mlocked and there is a global
> > memory pressure we can have some part of the costly memory swapped or
> > paged out which will increase requests latencies. If the application is
> > placed into an isolated cgroup, though, the global (or other cgroups)
> > activity doesn't influence its cgroup thus the working set of the
> > application.
> 
> > If we compare that to mlock we will benefit from per-group reclaim when
> > we get over the limit (or soft limit). So we do not start evicting the
> > memory unless somebody makes really pressure on the _application_.
> > Cgroup limits would, of course, need to be selected carefully.
> >
> > There might be other examples when simply kernel cannot know which
> > memory is important for the process and the long unused memory is not
> > the ideal choice.
> 
> Michal,
> 
> Reading through your example, sounds to me you can accomplish the
> "guarantee" of the high priority service using existing
> memcg mechanisms.
> 
> Assume you have the service named cgroup-A which needs memory
> "guarantee". Meantime we want to launch cgroup-B with no memory
> "guarantee". What you want is to have cgroup-B uses the slack memory
> (not being allocated by cgroup-A), but also volunteer to give up under
> system memory pressure.

This would require a "guarantee" that no pages are reclaimed from a
group if that group is under its soft limit, right? I am thinking if we
can achieve that without too many corner cases when cgroups (process's
accounted memory) don't leave out much for other memory used by the
kernel.
That was my concern so I made that isolation rather opt-in without
modifying the current reclaim logic too much (there are, of course,
parts that can be improved).

> So continue w/ my previous post, you can consider the following
> configuration in 32G machine. We can only have resident size of
> cgroup-A as much as the machine capacity.
> 
> cgroup-A :  limit_in_bytes =32G soft_limit_in_bytes = 32G
> cgroup-B : limit_in_bytes =20G  soft_limit_in_bytes = 0G
> 
> To be a little bit extreme, there shouldn't be memory pressure on
> cgroup-A unless it grows above the machine capacity. If the global
> memory contention is triggered by cgroup-B, we should steal pages from
> it always.
> 
> However, the current implementation of soft_limit needs to be improved
> for the example above. Especially when we start having lots of cgroups
> running w/ different limit setting, we need to have soft_limit being
> efficient and we can eliminate the global lru scanning. 

Lots of groups is really an issue because we can end up in a situation
when everybody is under the limit while there is not much memory left
for the kernel. Maybe sum(soft_limit) < kernel_treshold condition would
solve this.

> The later one breaks the isolation.

Sorry, I don't understand. Why would elimination of the global lru
scanning break isolation? Or am I misreading you?

Thanks
-- 
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9    
Czech Republic

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC 0/3] Implementation of cgroup isolation
  2011-03-31  9:53         ` Michal Hocko
@ 2011-03-31 18:10           ` Ying Han
  2011-04-01 14:04             ` Michal Hocko
  0 siblings, 1 reply; 35+ messages in thread
From: Ying Han @ 2011-03-31 18:10 UTC (permalink / raw)
  To: Michal Hocko; +Cc: Balbir Singh, KAMEZAWA Hiroyuki, linux-mm, linux-kernel

On Thu, Mar 31, 2011 at 2:53 AM, Michal Hocko <mhocko@suse.cz> wrote:
> On Wed 30-03-11 10:59:21, Ying Han wrote:
>> On Wed, Mar 30, 2011 at 1:18 AM, Michal Hocko <mhocko@suse.cz> wrote:
>> > On Tue 29-03-11 21:23:10, Balbir Singh wrote:
>> >> On 03/28/11 16:33, KAMEZAWA Hiroyuki wrote:
>> >> > On Mon, 28 Mar 2011 11:39:57 +0200
>> >> > Michal Hocko <mhocko@suse.cz> wrote:
>> > [...]
>> >> > Isn't it the same result with the case where no cgroup is used ?
>> >> > What is the problem ?
>> >> > Why it's not a problem of configuration ?
>> >> > IIUC, you can put all logins to some cgroup by using cgroupd/libgcgroup.
>> >> >
>> >>
>> >> I agree with Kame, I am still at loss in terms of understand the use
>> >> case, I should probably see the rest of the patches
>> >
>> > OK, it looks that I am really bad at explaining the usecase. Let's try
>> > it again then (hopefully in a better way).
>> >
>> > Consider a service which serves requests based on the in-memory
>> > precomputed or preprocessed data.
>> > Let's assume that getting data into memory is rather costly operation
>> > which considerably increases latency of the request processing. Memory
>> > access can be considered random from the system POV because we never
>> > know which requests will come from outside.
>> > This workflow will benefit from having the memory resident as long as
>> > and as much as possible because we have higher chances to be used more
>> > often and so the initial costs would pay off.
>> > Why is mlock not the right thing to do here? Well, if the memory would
>> > be locked and the working set would grow (again this depends on the
>> > incoming requests) then the application would have to unlock some
>> > portions of the memory or to risk OOM because it basically cannot
>> > overcommit.
>> > On the other hand, if the memory is not mlocked and there is a global
>> > memory pressure we can have some part of the costly memory swapped or
>> > paged out which will increase requests latencies. If the application is
>> > placed into an isolated cgroup, though, the global (or other cgroups)
>> > activity doesn't influence its cgroup thus the working set of the
>> > application.
>>
>> > If we compare that to mlock we will benefit from per-group reclaim when
>> > we get over the limit (or soft limit). So we do not start evicting the
>> > memory unless somebody makes really pressure on the _application_.
>> > Cgroup limits would, of course, need to be selected carefully.
>> >
>> > There might be other examples when simply kernel cannot know which
>> > memory is important for the process and the long unused memory is not
>> > the ideal choice.
>>
>> Michal,
>>
>> Reading through your example, sounds to me you can accomplish the
>> "guarantee" of the high priority service using existing
>> memcg mechanisms.
>>
>> Assume you have the service named cgroup-A which needs memory
>> "guarantee". Meantime we want to launch cgroup-B with no memory
>> "guarantee". What you want is to have cgroup-B uses the slack memory
>> (not being allocated by cgroup-A), but also volunteer to give up under
>> system memory pressure.
>
> This would require a "guarantee" that no pages are reclaimed from a
> group if that group is under its soft limit, right?

yes.

I am thinking if we
> can achieve that without too many corner cases when cgroups (process's
> accounted memory) don't leave out much for other memory used by the
> kernel.

> That was my concern so I made that isolation rather opt-in without
> modifying the current reclaim logic too much (there are, of course,
> parts that can be improved).

So far we are discussing the memory limit only for user pages. Later
we definitely need a kernel memory slab accounting and also for
reclaim. If we put them together, do you still have the concern? Sorry
guess I am just trying to understand the concern w/ example.

>
>> So continue w/ my previous post, you can consider the following
>> configuration in 32G machine. We can only have resident size of
>> cgroup-A as much as the machine capacity.
>>
>> cgroup-A :  limit_in_bytes =32G soft_limit_in_bytes = 32G
>> cgroup-B : limit_in_bytes =20G  soft_limit_in_bytes = 0G
>>
>> To be a little bit extreme, there shouldn't be memory pressure on
>> cgroup-A unless it grows above the machine capacity. If the global
>> memory contention is triggered by cgroup-B, we should steal pages from
>> it always.
>>
>> However, the current implementation of soft_limit needs to be improved
>> for the example above. Especially when we start having lots of cgroups
>> running w/ different limit setting, we need to have soft_limit being
>> efficient and we can eliminate the global lru scanning.
>
> Lots of groups is really an issue because we can end up in a situation
> when everybody is under the limit while there is not much memory left
> for the kernel. Maybe sum(soft_limit) < kernel_treshold condition would
> solve this.
most of the kernel memory are allocated on behalf of processes in
cgroup. One way of doing that (after having kernel memory accounting)
is to count in kernel memory into usage_in_bytes. So we have the
following:

1) limit_in_bytes: cap of memory allocation (user + kernel) for cgroup-A
2) soft_limit_in_bytes: guarantee of memory allocation  (user +
kernel) for cgroup-A
3) usage_in_bytes: user pages + kernel pages (allocated on behalf of the memcg)

The above need kernel memory accounting and targeting reclaim. Then we
have sum(soft_limit) < machine capacity. Hope we can talk a bit in the
LSF on this too.





>> The later one breaks the isolation.
>
> Sorry, I don't understand. Why would elimination of the global lru
> scanning break isolation? Or am I misreading you?

Sorry, i meant the other way around. So we agree on this .

--Ying
>
> Thanks
> --
> Michal Hocko
> SUSE Labs
> SUSE LINUX s.r.o.
> Lihovarska 1060/12
> 190 00 Praha 9
> Czech Republic
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC 0/3] Implementation of cgroup isolation
  2011-03-31 18:10           ` Ying Han
@ 2011-04-01 14:04             ` Michal Hocko
  0 siblings, 0 replies; 35+ messages in thread
From: Michal Hocko @ 2011-04-01 14:04 UTC (permalink / raw)
  To: Ying Han; +Cc: Balbir Singh, KAMEZAWA Hiroyuki, linux-mm, linux-kernel

On Thu 31-03-11 11:10:00, Ying Han wrote:
> On Thu, Mar 31, 2011 at 2:53 AM, Michal Hocko <mhocko@suse.cz> wrote:
> > On Wed 30-03-11 10:59:21, Ying Han wrote:
[...]
> > That was my concern so I made that isolation rather opt-in without
> > modifying the current reclaim logic too much (there are, of course,
> > parts that can be improved).
> 
> So far we are discussing the memory limit only for user pages. Later
> we definitely need a kernel memory slab accounting and also for
> reclaim. If we put them together, do you still have the concern? Sorry
> guess I am just trying to understand the concern w/ example.

If we account the kernel memory then it should be less problematic, I
guess.

[...]
> > Lots of groups is really an issue because we can end up in a situation
> > when everybody is under the limit while there is not much memory left
> > for the kernel. Maybe sum(soft_limit) < kernel_treshold condition would
> > solve this.
> most of the kernel memory are allocated on behalf of processes in
> cgroup. One way of doing that (after having kernel memory accounting)
> is to count in kernel memory into usage_in_bytes. So we have the
> following:
> 
> 1) limit_in_bytes: cap of memory allocation (user + kernel) for cgroup-A
> 2) soft_limit_in_bytes: guarantee of memory allocation  (user +
> kernel) for cgroup-A
> 3) usage_in_bytes: user pages + kernel pages (allocated on behalf of the memcg)
> 
> The above need kernel memory accounting and targeting reclaim. Then we
> have sum(soft_limit) < machine capacity. Hope we can talk a bit in the
> LSF on this too.

Sure. I am looking forward.

> >> The later one breaks the isolation.
> >
> > Sorry, I don't understand. Why would elimination of the global lru
> > scanning break isolation? Or am I misreading you?
> 
> Sorry, i meant the other way around. So we agree on this .

Makes more sense now ;)

Thanks
-- 
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9    
Czech Republic

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC 0/3] Implementation of cgroup isolation
  2011-03-30  8:18     ` Michal Hocko
  2011-03-30 17:59       ` Ying Han
@ 2011-03-31 10:01       ` Balbir Singh
  1 sibling, 0 replies; 35+ messages in thread
From: Balbir Singh @ 2011-03-31 10:01 UTC (permalink / raw)
  To: Michal Hocko; +Cc: Balbir Singh, KAMEZAWA Hiroyuki, linux-mm, linux-kernel

* Michal Hocko <mhocko@suse.cz> [2011-03-30 10:18:53]:

> On Tue 29-03-11 21:23:10, Balbir Singh wrote:
> > On 03/28/11 16:33, KAMEZAWA Hiroyuki wrote:
> > > On Mon, 28 Mar 2011 11:39:57 +0200
> > > Michal Hocko <mhocko@suse.cz> wrote:
> [...]
> > > Isn't it the same result with the case where no cgroup is used ?
> > > What is the problem ?
> > > Why it's not a problem of configuration ?
> > > IIUC, you can put all logins to some cgroup by using cgroupd/libgcgroup.
> > > 
> > 
> > I agree with Kame, I am still at loss in terms of understand the use
> > case, I should probably see the rest of the patches
> 
> OK, it looks that I am really bad at explaining the usecase. Let's try
> it again then (hopefully in a better way).
> 
> Consider a service which serves requests based on the in-memory
> precomputed or preprocessed data. 
> Let's assume that getting data into memory is rather costly operation
> which considerably increases latency of the request processing. Memory
> access can be considered random from the system POV because we never
> know which requests will come from outside.
> This workflow will benefit from having the memory resident as long as
> and as much as possible because we have higher chances to be used more
> often and so the initial costs would pay off.
> Why is mlock not the right thing to do here? Well, if the memory would
> be locked and the working set would grow (again this depends on the
> incoming requests) then the application would have to unlock some
> portions of the memory or to risk OOM because it basically cannot
> overcommit.
> On the other hand, if the memory is not mlocked and there is a global
> memory pressure we can have some part of the costly memory swapped or
> paged out which will increase requests latencies. If the application is
> placed into an isolated cgroup, though, the global (or other cgroups)
> activity doesn't influence its cgroup thus the working set of the
> application.

I think one important aspect is what percentage of the memory needs to
be isolated/locked? If you expect really large parts, then we are in
trouble, unless we are aware of the exact requirements for memory and
know what else will run on the system.

> If we compare that to mlock we will benefit from per-group reclaim when
> we get over the limit (or soft limit). So we do not start evicting the
> memory unless somebody makes really pressure on the _application_.
> Cgroup limits would, of course, need to be selected carefully.
> 
> There might be other examples when simply kernel cannot know which
> memory is important for the process and the long unused memory is not
> the ideal choice.
>

There are other watermark based approaches that would work better,
given that memory management is already complicated by topology, zones
and we have non-reclaimable memory being used in the kernel on behalf
of applications. I am not ruling out a solution, just sharing ideas.
NOTE: In the longer run, we want to account for kernel usage and look
at potential reclaim of slab pages. 

-- 
	Three Cheers,
	Balbir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC 0/3] Implementation of cgroup isolation
  2011-03-28  9:39 [RFC 0/3] Implementation of cgroup isolation Michal Hocko
                   ` (3 preceding siblings ...)
  2011-03-28 11:03 ` [RFC 0/3] Implementation of cgroup isolation KAMEZAWA Hiroyuki
@ 2011-03-28 18:01 ` Ying Han
  2011-03-29  0:12   ` KAMEZAWA Hiroyuki
  2011-03-29  7:53   ` Michal Hocko
  4 siblings, 2 replies; 35+ messages in thread
From: Ying Han @ 2011-03-28 18:01 UTC (permalink / raw)
  To: Michal Hocko; +Cc: linux-mm, linux-kernel, Hugh Dickins, Suleiman Souhlal

On Mon, Mar 28, 2011 at 2:39 AM, Michal Hocko <mhocko@suse.cz> wrote:
> Hi all,
>
> Memory cgroups can be currently used to throttle memory usage of a group of
> processes. It, however, cannot be used for an isolation of processes from
> the rest of the system because all the pages that belong to the group are
> also placed on the global LRU lists and so they are eligible for the global
> memory reclaim.
>
> This patchset aims at providing an opt-in memory cgroup isolation. This
> means that a cgroup can be configured to be isolated from the rest of the
> system by means of cgroup virtual filesystem (/dev/memctl/group/memory.isolated).

Thank you Hugh pointing me to the thread. We are working on similar
problem in memcg currently

Here is the problem we see:
1. In memcg, a page is both on per-memcg-per-zone lru and global-lru.
2. Global memory reclaim will throw page away regardless of cgroup.
3. The zone->lru_lock is shared between per-memcg-per-zone lru and global-lru.

And we know:
1. We shouldn't do global reclaim since it breaks memory isolation.
2. There is no need for a page to be on both LRU list, especially
after having per-memcg background reclaim.

So our approach is to take off page from global lru after it is
charged to a memcg. Only pages allocated at root cgroup remains in
global LRU, and each memcg reclaims pages on its isolated LRU.

By doing this, we can further solve the lock contention mentioned in
3) to have per-memcg-per-zone lock. I can post the patch later if that
helps better understanding.

Thanks

--Ying

>
> Isolated mem cgroup can be particularly helpful in deployments where we have
> a primary service which needs to have a certain guarantees for memory
> resources (e.g. a database server) and we want to shield it off the
> rest of the system (e.g. a burst memory activity in another group). This is
> currently possible only with mlocking memory that is essential for the
> application(s) or a rather hacky configuration where the primary app is in
> the root mem cgroup while all the other system activity happens in other
> groups.
>
> mlocking is not an ideal solution all the time because sometimes the working
> set is very large and it depends on the workload (e.g. number of incoming
> requests) so it can end up not fitting in into memory (leading to a OOM
> killer). If we use mem. cgroup isolation instead we are keeping memory resident
> and if the working set goes wild we can still do per-cgroup reclaim so the
> service is less prone to be OOM killed.
>
> The patch series is split into 3 patches. First one adds a new flag into
> mem_cgroup structure which controls whether the group is isolated (false by
> default) and a cgroup fs interface to set it.
> The second patch implements interaction with the global LRU. The current
> semantic is that we are putting a page into a global LRU only if mem cgroup
> LRU functions say they do not want the page for themselves.
> The last patch prevents from soft reclaim if the group is isolated.
>
> I have tested the patches with the simple memory consumer (allocating
> private and shared anon memory and SYSV SHM).
>
> One instance (call it big consumer) running in the group and paging in the
> memory (>90% of cgroup limit) and sleeping for the rest of its life. Then I
> had a pool of consumers running in the same cgroup which page in smaller
> amount of memory and paging them in the loop to simulate in group memory
> pressure (call them sharks).
> The sum of consumed memory is more than memory.limit_in_bytes so some
> portion of the memory is swapped out.
> There is one consumer running in the root cgroup running in parallel which
> makes a pressure on the memory (to trigger background reclaim).
>
> Rss+cache of the group drops down significantly (~66% of the limit) if the
> group is not isolated. On the other hand if we isolate the group we are
> still saturating the group (~97% of the limit). I can show more
> comprehensive results if somebody is interested.
>
> Thanks for comments.
>
> ---
>  include/linux/memcontrol.h |   24 ++++++++------
>  include/linux/mm_inline.h  |   10 ++++-
>  mm/memcontrol.c            |   76 ++++++++++++++++++++++++++++++++++++---------
>  mm/swap.c                  |   12 ++++---
>  mm/vmscan.c                |   43 +++++++++++++++----------
>  5 files changed, 118 insertions(+), 47 deletions(-)
>
> --
> Michal Hocko
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC 0/3] Implementation of cgroup isolation
  2011-03-28 18:01 ` Ying Han
@ 2011-03-29  0:12   ` KAMEZAWA Hiroyuki
  2011-03-29  0:37     ` Ying Han
  2011-03-29  7:53   ` Michal Hocko
  1 sibling, 1 reply; 35+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-03-29  0:12 UTC (permalink / raw)
  To: Ying Han
  Cc: Michal Hocko, linux-mm, linux-kernel, Hugh Dickins,
	Suleiman Souhlal

On Mon, 28 Mar 2011 11:01:18 -0700
Ying Han <yinghan@google.com> wrote:

> On Mon, Mar 28, 2011 at 2:39 AM, Michal Hocko <mhocko@suse.cz> wrote:
> > Hi all,
> >
> > Memory cgroups can be currently used to throttle memory usage of a group of
> > processes. It, however, cannot be used for an isolation of processes from
> > the rest of the system because all the pages that belong to the group are
> > also placed on the global LRU lists and so they are eligible for the global
> > memory reclaim.
> >
> > This patchset aims at providing an opt-in memory cgroup isolation. This
> > means that a cgroup can be configured to be isolated from the rest of the
> > system by means of cgroup virtual filesystem (/dev/memctl/group/memory.isolated).
> 
> Thank you Hugh pointing me to the thread. We are working on similar
> problem in memcg currently
> 
> Here is the problem we see:
> 1. In memcg, a page is both on per-memcg-per-zone lru and global-lru.
> 2. Global memory reclaim will throw page away regardless of cgroup.
> 3. The zone->lru_lock is shared between per-memcg-per-zone lru and global-lru.
> 
> And we know:
> 1. We shouldn't do global reclaim since it breaks memory isolation.
> 2. There is no need for a page to be on both LRU list, especially
> after having per-memcg background reclaim.
> 
> So our approach is to take off page from global lru after it is
> charged to a memcg. Only pages allocated at root cgroup remains in
> global LRU, and each memcg reclaims pages on its isolated LRU.
> 

Why you don't use cpuset and virtual nodes ? It's what you want.

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC 0/3] Implementation of cgroup isolation
  2011-03-29  0:12   ` KAMEZAWA Hiroyuki
@ 2011-03-29  0:37     ` Ying Han
  2011-03-29  0:47       ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 35+ messages in thread
From: Ying Han @ 2011-03-29  0:37 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Michal Hocko, linux-mm, linux-kernel, Hugh Dickins,
	Suleiman Souhlal

On Mon, Mar 28, 2011 at 5:12 PM, KAMEZAWA Hiroyuki
<kamezawa.hiroyu@jp.fujitsu.com> wrote:
> On Mon, 28 Mar 2011 11:01:18 -0700
> Ying Han <yinghan@google.com> wrote:
>
>> On Mon, Mar 28, 2011 at 2:39 AM, Michal Hocko <mhocko@suse.cz> wrote:
>> > Hi all,
>> >
>> > Memory cgroups can be currently used to throttle memory usage of a group of
>> > processes. It, however, cannot be used for an isolation of processes from
>> > the rest of the system because all the pages that belong to the group are
>> > also placed on the global LRU lists and so they are eligible for the global
>> > memory reclaim.
>> >
>> > This patchset aims at providing an opt-in memory cgroup isolation. This
>> > means that a cgroup can be configured to be isolated from the rest of the
>> > system by means of cgroup virtual filesystem (/dev/memctl/group/memory.isolated).
>>
>> Thank you Hugh pointing me to the thread. We are working on similar
>> problem in memcg currently
>>
>> Here is the problem we see:
>> 1. In memcg, a page is both on per-memcg-per-zone lru and global-lru.
>> 2. Global memory reclaim will throw page away regardless of cgroup.
>> 3. The zone->lru_lock is shared between per-memcg-per-zone lru and global-lru.
>>
>> And we know:
>> 1. We shouldn't do global reclaim since it breaks memory isolation.
>> 2. There is no need for a page to be on both LRU list, especially
>> after having per-memcg background reclaim.
>>
>> So our approach is to take off page from global lru after it is
>> charged to a memcg. Only pages allocated at root cgroup remains in
>> global LRU, and each memcg reclaims pages on its isolated LRU.
>>
>
> Why you don't use cpuset and virtual nodes ? It's what you want.

We've been running cpuset + fakenuma nodes configuration in google to
provide memory isolation. The configuration of having the virtual box
is complex which user needs to know great details of the which node to
assign to which cgroup. That is one of the motivations for us moving
towards to memory controller which simply do memory accounting no
matter where pages are allocated.

By saying that, memcg simplified the memory accounting per-cgroup but
the memory isolation is broken. This is one of examples where pages
are shared between global LRU and per-memcg LRU. It is easy to get
cgroup-A's page evicted by adding memory pressure to cgroup-B.

The approach we are thinking to make the page->lru exclusive solve the
problem. and also we should be able to break the zone->lru_lock
sharing.

--Ying




>
> Thanks,
> -Kame
>
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC 0/3] Implementation of cgroup isolation
  2011-03-29  0:37     ` Ying Han
@ 2011-03-29  0:47       ` KAMEZAWA Hiroyuki
  2011-03-29  2:29         ` KAMEZAWA Hiroyuki
  2011-03-29  2:46         ` Ying Han
  0 siblings, 2 replies; 35+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-03-29  0:47 UTC (permalink / raw)
  To: Ying Han
  Cc: Michal Hocko, linux-mm, linux-kernel, Hugh Dickins,
	Suleiman Souhlal

On Mon, 28 Mar 2011 17:37:02 -0700
Ying Han <yinghan@google.com> wrote:

> On Mon, Mar 28, 2011 at 5:12 PM, KAMEZAWA Hiroyuki
> <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> > On Mon, 28 Mar 2011 11:01:18 -0700
> > Ying Han <yinghan@google.com> wrote:
> >
> >> On Mon, Mar 28, 2011 at 2:39 AM, Michal Hocko <mhocko@suse.cz> wrote:
> >> > Hi all,
> >> >
> >> > Memory cgroups can be currently used to throttle memory usage of a group of
> >> > processes. It, however, cannot be used for an isolation of processes from
> >> > the rest of the system because all the pages that belong to the group are
> >> > also placed on the global LRU lists and so they are eligible for the global
> >> > memory reclaim.
> >> >
> >> > This patchset aims at providing an opt-in memory cgroup isolation. This
> >> > means that a cgroup can be configured to be isolated from the rest of the
> >> > system by means of cgroup virtual filesystem (/dev/memctl/group/memory.isolated).
> >>
> >> Thank you Hugh pointing me to the thread. We are working on similar
> >> problem in memcg currently
> >>
> >> Here is the problem we see:
> >> 1. In memcg, a page is both on per-memcg-per-zone lru and global-lru.
> >> 2. Global memory reclaim will throw page away regardless of cgroup.
> >> 3. The zone->lru_lock is shared between per-memcg-per-zone lru and global-lru.
> >>
> >> And we know:
> >> 1. We shouldn't do global reclaim since it breaks memory isolation.
> >> 2. There is no need for a page to be on both LRU list, especially
> >> after having per-memcg background reclaim.
> >>
> >> So our approach is to take off page from global lru after it is
> >> charged to a memcg. Only pages allocated at root cgroup remains in
> >> global LRU, and each memcg reclaims pages on its isolated LRU.
> >>
> >
> > Why you don't use cpuset and virtual nodes ? It's what you want.
> 
> We've been running cpuset + fakenuma nodes configuration in google to
> provide memory isolation. The configuration of having the virtual box
> is complex which user needs to know great details of the which node to
> assign to which cgroup. That is one of the motivations for us moving
> towards to memory controller which simply do memory accounting no
> matter where pages are allocated.
> 

I think current fake-numa is not useful because it works only at boot time.

> By saying that, memcg simplified the memory accounting per-cgroup but
> the memory isolation is broken. This is one of examples where pages
> are shared between global LRU and per-memcg LRU. It is easy to get
> cgroup-A's page evicted by adding memory pressure to cgroup-B.
> 
If you overcommit....Right ?


> The approach we are thinking to make the page->lru exclusive solve the
> problem. and also we should be able to break the zone->lru_lock
> sharing.
> 
Is zone->lru_lock is a problem even with the help of pagevecs ?

If LRU management guys acks you to isolate LRUs and to make kswapd etc..
more complex, okay, we'll go that way. This will _change_ the whole
memcg design and concepts Maybe memcg should have some kind of balloon driver to
work happy with isolated lru.

But my current standing position is "never bad effects global reclaim".
So, I'm not very happy with the solution.

If we go that way, I guess we'll think we should have pseudo nodes/zones, which
was proposed in early days of resource controls.(not cgroup).

Thanks,
-Kame








--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC 0/3] Implementation of cgroup isolation
  2011-03-29  0:47       ` KAMEZAWA Hiroyuki
@ 2011-03-29  2:29         ` KAMEZAWA Hiroyuki
  2011-03-29  3:02           ` Ying Han
  2011-03-29  2:46         ` Ying Han
  1 sibling, 1 reply; 35+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-03-29  2:29 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Ying Han, Michal Hocko, linux-mm, linux-kernel, Hugh Dickins,
	Suleiman Souhlal

On Tue, 29 Mar 2011 09:47:56 +0900
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:

> On Mon, 28 Mar 2011 17:37:02 -0700
> Ying Han <yinghan@google.com> wrote:

> > The approach we are thinking to make the page->lru exclusive solve the
> > problem. and also we should be able to break the zone->lru_lock
> > sharing.
> > 
> Is zone->lru_lock is a problem even with the help of pagevecs ?
> 
> If LRU management guys acks you to isolate LRUs and to make kswapd etc..
> more complex, okay, we'll go that way. This will _change_ the whole
> memcg design and concepts Maybe memcg should have some kind of balloon driver to
> work happy with isolated lru.
> 
> But my current standing position is "never bad effects global reclaim".
> So, I'm not very happy with the solution.
> 
> If we go that way, I guess we'll think we should have pseudo nodes/zones, which
> was proposed in early days of resource controls.(not cgroup).
> 

BTW, against isolation, I have one thought.

Now, soft_limit_reclaim is not called in direct-reclaim path just because we thought
kswapd works enough well. If necessary, I think we can put soft-reclaim call in
generic do_try_to_free_pages(order=0). 

So, isolation problem can be reduced to some extent, isn't it ?
Algorithm of softlimit _should_ be updated. I guess it's not heavily tested feature.

About ROOT cgroup, I think some daemon application should put _all_ process to
some controled cgroup. So, I don't want to think about limiting on ROOT cgroup
without any justification.

I'd like you to devide 'the talk on performance' and 'the talk on feature'.

"This makes makes performance better! ...and add an feature" sounds bad to me.

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC 0/3] Implementation of cgroup isolation
  2011-03-29  2:29         ` KAMEZAWA Hiroyuki
@ 2011-03-29  3:02           ` Ying Han
  0 siblings, 0 replies; 35+ messages in thread
From: Ying Han @ 2011-03-29  3:02 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Michal Hocko, linux-mm, linux-kernel, Hugh Dickins,
	Suleiman Souhlal, Greg Thelen

On Mon, Mar 28, 2011 at 7:29 PM, KAMEZAWA Hiroyuki
<kamezawa.hiroyu@jp.fujitsu.com> wrote:
> On Tue, 29 Mar 2011 09:47:56 +0900
> KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
>
>> On Mon, 28 Mar 2011 17:37:02 -0700
>> Ying Han <yinghan@google.com> wrote:
>
>> > The approach we are thinking to make the page->lru exclusive solve the
>> > problem. and also we should be able to break the zone->lru_lock
>> > sharing.
>> >
>> Is zone->lru_lock is a problem even with the help of pagevecs ?
>>
>> If LRU management guys acks you to isolate LRUs and to make kswapd etc..
>> more complex, okay, we'll go that way. This will _change_ the whole
>> memcg design and concepts Maybe memcg should have some kind of balloon driver to
>> work happy with isolated lru.
>>
>> But my current standing position is "never bad effects global reclaim".
>> So, I'm not very happy with the solution.
>>
>> If we go that way, I guess we'll think we should have pseudo nodes/zones, which
>> was proposed in early days of resource controls.(not cgroup).
>>
>
> BTW, against isolation, I have one thought.
>
> Now, soft_limit_reclaim is not called in direct-reclaim path just because we thought
> kswapd works enough well. If necessary, I think we can put soft-reclaim call in
> generic do_try_to_free_pages(order=0).

We were talking about that internally and that definitely make sense to add.

>
> So, isolation problem can be reduced to some extent, isn't it ?
> Algorithm of softlimit _should_ be updated. I guess it's not heavily tested feature.

Agree and that is something we might want to go and fix. soft_limit in
general provides a nice way to
over_committing the machine, and still have control of doing target
reclaim under system memory pressure.

>
> About ROOT cgroup, I think some daemon application should put _all_ process to
> some controled cgroup. So, I don't want to think about limiting on ROOT cgroup
> without any justification.
>
> I'd like you to devide 'the talk on performance' and 'the talk on feature'.
>
> "This makes makes performance better! ...and add an feature" sounds bad to me.

Ok, then let's stick on the memory isolation feature now :)

--Ying
>
> Thanks,
> -Kame
>
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC 0/3] Implementation of cgroup isolation
  2011-03-29  0:47       ` KAMEZAWA Hiroyuki
  2011-03-29  2:29         ` KAMEZAWA Hiroyuki
@ 2011-03-29  2:46         ` Ying Han
  2011-03-29  2:45           ` KAMEZAWA Hiroyuki
  1 sibling, 1 reply; 35+ messages in thread
From: Ying Han @ 2011-03-29  2:46 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Michal Hocko, linux-mm, linux-kernel, Hugh Dickins,
	Suleiman Souhlal

On Mon, Mar 28, 2011 at 5:47 PM, KAMEZAWA Hiroyuki
<kamezawa.hiroyu@jp.fujitsu.com> wrote:
> On Mon, 28 Mar 2011 17:37:02 -0700
> Ying Han <yinghan@google.com> wrote:
>
>> On Mon, Mar 28, 2011 at 5:12 PM, KAMEZAWA Hiroyuki
>> <kamezawa.hiroyu@jp.fujitsu.com> wrote:
>> > On Mon, 28 Mar 2011 11:01:18 -0700
>> > Ying Han <yinghan@google.com> wrote:
>> >
>> >> On Mon, Mar 28, 2011 at 2:39 AM, Michal Hocko <mhocko@suse.cz> wrote:
>> >> > Hi all,
>> >> >
>> >> > Memory cgroups can be currently used to throttle memory usage of a group of
>> >> > processes. It, however, cannot be used for an isolation of processes from
>> >> > the rest of the system because all the pages that belong to the group are
>> >> > also placed on the global LRU lists and so they are eligible for the global
>> >> > memory reclaim.
>> >> >
>> >> > This patchset aims at providing an opt-in memory cgroup isolation. This
>> >> > means that a cgroup can be configured to be isolated from the rest of the
>> >> > system by means of cgroup virtual filesystem (/dev/memctl/group/memory.isolated).
>> >>
>> >> Thank you Hugh pointing me to the thread. We are working on similar
>> >> problem in memcg currently
>> >>
>> >> Here is the problem we see:
>> >> 1. In memcg, a page is both on per-memcg-per-zone lru and global-lru.
>> >> 2. Global memory reclaim will throw page away regardless of cgroup.
>> >> 3. The zone->lru_lock is shared between per-memcg-per-zone lru and global-lru.
>> >>
>> >> And we know:
>> >> 1. We shouldn't do global reclaim since it breaks memory isolation.
>> >> 2. There is no need for a page to be on both LRU list, especially
>> >> after having per-memcg background reclaim.
>> >>
>> >> So our approach is to take off page from global lru after it is
>> >> charged to a memcg. Only pages allocated at root cgroup remains in
>> >> global LRU, and each memcg reclaims pages on its isolated LRU.
>> >>
>> >
>> > Why you don't use cpuset and virtual nodes ? It's what you want.
>>
>> We've been running cpuset + fakenuma nodes configuration in google to
>> provide memory isolation. The configuration of having the virtual box
>> is complex which user needs to know great details of the which node to
>> assign to which cgroup. That is one of the motivations for us moving
>> towards to memory controller which simply do memory accounting no
>> matter where pages are allocated.
>>
>
> I think current fake-numa is not useful because it works only at boot time.

yes and the big hassle is to manage the nodes after the boot-up.

>
>> By saying that, memcg simplified the memory accounting per-cgroup but
>> the memory isolation is broken. This is one of examples where pages
>> are shared between global LRU and per-memcg LRU. It is easy to get
>> cgroup-A's page evicted by adding memory pressure to cgroup-B.
>>
> If you overcommit....Right ?

yes, we want to support the configuration of over-committing the
machine w/ limit_in_bytes.

>
>
>> The approach we are thinking to make the page->lru exclusive solve the
>> problem. and also we should be able to break the zone->lru_lock
>> sharing.
>>
> Is zone->lru_lock is a problem even with the help of pagevecs ?

> If LRU management guys acks you to isolate LRUs and to make kswapd etc..
> more complex, okay, we'll go that way.

I would assume the change only apply to memcg users , otherwise
everything is leaving in the global LRU list.

This will _change_ the whole memcg design and concepts Maybe memcg
should have some kind of balloon driver to
> work happy with isolated lru.

We have soft_limit hierarchical reclaim for system memory pressure,
and also we will add per-memcg background reclaim. Both of them do
targeting reclaim on per-memcg LRUs, and where is the balloon driver
needed?

Thanks

--Ying

> But my current standing position is "never bad effects global reclaim".
> So, I'm not very happy with the solution.
>
> If we go that way, I guess we'll think we should have pseudo nodes/zones, which
> was proposed in early days of resource controls.(not cgroup).
>
> Thanks,
> -Kame
>
>
>
>
>
>
>
>
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC 0/3] Implementation of cgroup isolation
  2011-03-29  2:46         ` Ying Han
@ 2011-03-29  2:45           ` KAMEZAWA Hiroyuki
  2011-03-29  4:03             ` Ying Han
  0 siblings, 1 reply; 35+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-03-29  2:45 UTC (permalink / raw)
  To: Ying Han
  Cc: Michal Hocko, linux-mm, linux-kernel, Hugh Dickins,
	Suleiman Souhlal

On Mon, 28 Mar 2011 19:46:41 -0700
Ying Han <yinghan@google.com> wrote:

> On Mon, Mar 28, 2011 at 5:47 PM, KAMEZAWA Hiroyuki
> <kamezawa.hiroyu@jp.fujitsu.com> wrote:

> >
> >> By saying that, memcg simplified the memory accounting per-cgroup but
> >> the memory isolation is broken. This is one of examples where pages
> >> are shared between global LRU and per-memcg LRU. It is easy to get
> >> cgroup-A's page evicted by adding memory pressure to cgroup-B.
> >>
> > If you overcommit....Right ?
> 
> yes, we want to support the configuration of over-committing the
> machine w/ limit_in_bytes.
> 

Then, soft_limit is a feature for fixing the problem. If you have problem
with soft_limit, let's fix it.


> >
> >
> >> The approach we are thinking to make the page->lru exclusive solve the
> >> problem. and also we should be able to break the zone->lru_lock
> >> sharing.
> >>
> > Is zone->lru_lock is a problem even with the help of pagevecs ?
> 
> > If LRU management guys acks you to isolate LRUs and to make kswapd etc..
> > more complex, okay, we'll go that way.
> 
> I would assume the change only apply to memcg users , otherwise
> everything is leaving in the global LRU list.
> 
> This will _change_ the whole memcg design and concepts Maybe memcg
> should have some kind of balloon driver to
> > work happy with isolated lru.
> 
> We have soft_limit hierarchical reclaim for system memory pressure,
> and also we will add per-memcg background reclaim. Both of them do
> targeting reclaim on per-memcg LRUs, and where is the balloon driver
> needed?
> 

If soft_limit is _not_ enough. And I think you background reclaim should
be work with soft_limit and be triggered by global memory pressure. 

As wrote in other mail, it's not called via direct reclaim.
Maybe its the 1st point to be shooted rather than trying big change.




Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC 0/3] Implementation of cgroup isolation
  2011-03-29  2:45           ` KAMEZAWA Hiroyuki
@ 2011-03-29  4:03             ` Ying Han
  0 siblings, 0 replies; 35+ messages in thread
From: Ying Han @ 2011-03-29  4:03 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Michal Hocko, linux-mm, linux-kernel, Hugh Dickins,
	Suleiman Souhlal

On Mon, Mar 28, 2011 at 7:45 PM, KAMEZAWA Hiroyuki
<kamezawa.hiroyu@jp.fujitsu.com> wrote:
> On Mon, 28 Mar 2011 19:46:41 -0700
> Ying Han <yinghan@google.com> wrote:
>
>> On Mon, Mar 28, 2011 at 5:47 PM, KAMEZAWA Hiroyuki
>> <kamezawa.hiroyu@jp.fujitsu.com> wrote:
>
>> >
>> >> By saying that, memcg simplified the memory accounting per-cgroup but
>> >> the memory isolation is broken. This is one of examples where pages
>> >> are shared between global LRU and per-memcg LRU. It is easy to get
>> >> cgroup-A's page evicted by adding memory pressure to cgroup-B.
>> >>
>> > If you overcommit....Right ?
>>
>> yes, we want to support the configuration of over-committing the
>> machine w/ limit_in_bytes.
>>
>
> Then, soft_limit is a feature for fixing the problem. If you have problem
> with soft_limit, let's fix it.

The current implementation of soft_limit works as best-effort and some
improvement are needed. Without distracting much from this thread,
simply saying it is not optimized on which cgroup to pick from the
per-zone RB-tree.

>
>
>> >
>> >
>> >> The approach we are thinking to make the page->lru exclusive solve the
>> >> problem. and also we should be able to break the zone->lru_lock
>> >> sharing.
>> >>
>> > Is zone->lru_lock is a problem even with the help of pagevecs ?
>>
>> > If LRU management guys acks you to isolate LRUs and to make kswapd etc..
>> > more complex, okay, we'll go that way.
>>
>> I would assume the change only apply to memcg users , otherwise
>> everything is leaving in the global LRU list.
>>
>> This will _change_ the whole memcg design and concepts Maybe memcg
>> should have some kind of balloon driver to
>> > work happy with isolated lru.
>>
>> We have soft_limit hierarchical reclaim for system memory pressure,
>> and also we will add per-memcg background reclaim. Both of them do
>> targeting reclaim on per-memcg LRUs, and where is the balloon driver
>> needed?
>>
>
> If soft_limit is _not_ enough. And I think you background reclaim should
> be work with soft_limit and be triggered by global memory pressure.

This is something i can think about. Also i think we agree that we
should have efficient target reclaim
so the global LRU scanning should be eliminated.

>
> As wrote in other mail, it's not called via direct reclaim.
> Maybe its the 1st point to be shooted rather than trying big change.

Agree on this.

--Ying

>
>
>
>
> Thanks,
> -Kame
>
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC 0/3] Implementation of cgroup isolation
  2011-03-28 18:01 ` Ying Han
  2011-03-29  0:12   ` KAMEZAWA Hiroyuki
@ 2011-03-29  7:53   ` Michal Hocko
  1 sibling, 0 replies; 35+ messages in thread
From: Michal Hocko @ 2011-03-29  7:53 UTC (permalink / raw)
  To: Ying Han; +Cc: linux-mm, linux-kernel, Hugh Dickins, Suleiman Souhlal

Hi,

On Mon 28-03-11 11:01:18, Ying Han wrote:
> On Mon, Mar 28, 2011 at 2:39 AM, Michal Hocko <mhocko@suse.cz> wrote:
> > Hi all,
> >
> > Memory cgroups can be currently used to throttle memory usage of a group of
> > processes. It, however, cannot be used for an isolation of processes from
> > the rest of the system because all the pages that belong to the group are
> > also placed on the global LRU lists and so they are eligible for the global
> > memory reclaim.
> >
> > This patchset aims at providing an opt-in memory cgroup isolation. This
> > means that a cgroup can be configured to be isolated from the rest of the
> > system by means of cgroup virtual filesystem (/dev/memctl/group/memory.isolated).
> 
> Thank you Hugh pointing me to the thread. We are working on similar
> problem in memcg currently
> 
> Here is the problem we see:
> 1. In memcg, a page is both on per-memcg-per-zone lru and global-lru.
> 2. Global memory reclaim will throw page away regardless of cgroup.
> 3. The zone->lru_lock is shared between per-memcg-per-zone lru and global-lru.

This is the primary motivation for the patchset. Except that I do not
insist on the strict isolation because I found opt-in approach less
invasive because you have to know what you are doing while you are
setting up a group. If the thing is enabled by default we can see many
side-effects during the reclaim, I am afraid.

> And we know:
> 1. We shouldn't do global reclaim since it breaks memory isolation.
> 2. There is no need for a page to be on both LRU list, especially
> after having per-memcg background reclaim.
> 
> So our approach is to take off page from global lru after it is
> charged to a memcg. Only pages allocated at root cgroup remains in
> global LRU, and each memcg reclaims pages on its isolated LRU.

This sounds like an instance where all cgroups are isolated by default
(this can be set by mem_cgroup->isolated = 1).

Thanks
-- 
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9    
Czech Republic

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

end of thread, other threads:[~2011-04-01 14:04 UTC | newest]

Thread overview: 35+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-03-28  9:39 [RFC 0/3] Implementation of cgroup isolation Michal Hocko
2011-03-28  9:39 ` [RFC 1/3] Add mem_cgroup->isolated and configuration knob Michal Hocko
2011-03-28  9:39 ` [RFC 2/3] Implement isolated LRU cgroups Michal Hocko
2011-03-28  9:40 ` [RFC 3/3] Do not shrink isolated groups from the global reclaim Michal Hocko
2011-03-28 11:03 ` [RFC 0/3] Implementation of cgroup isolation KAMEZAWA Hiroyuki
2011-03-28 11:44   ` Michal Hocko
2011-03-29  0:09     ` KAMEZAWA Hiroyuki
2011-03-29  7:32       ` Michal Hocko
2011-03-29  7:51         ` KAMEZAWA Hiroyuki
2011-03-29  8:59           ` Michal Hocko
2011-03-29  9:41             ` KAMEZAWA Hiroyuki
2011-03-29 11:18               ` Michal Hocko
2011-03-29 13:15                 ` Zhu Yanhai
2011-03-29 13:42                   ` Michal Hocko
2011-03-29 14:02                     ` Zhu Yanhai
2011-03-29 14:08                       ` Zhu Yanhai
2011-03-30  7:42                       ` Michal Hocko
2011-03-30  5:32               ` Ying Han
2011-03-29 15:53   ` Balbir Singh
2011-03-30  8:18     ` Michal Hocko
2011-03-30 17:59       ` Ying Han
2011-03-31  9:53         ` Michal Hocko
2011-03-31 18:10           ` Ying Han
2011-04-01 14:04             ` Michal Hocko
2011-03-31 10:01       ` Balbir Singh
2011-03-28 18:01 ` Ying Han
2011-03-29  0:12   ` KAMEZAWA Hiroyuki
2011-03-29  0:37     ` Ying Han
2011-03-29  0:47       ` KAMEZAWA Hiroyuki
2011-03-29  2:29         ` KAMEZAWA Hiroyuki
2011-03-29  3:02           ` Ying Han
2011-03-29  2:46         ` Ying Han
2011-03-29  2:45           ` KAMEZAWA Hiroyuki
2011-03-29  4:03             ` Ying Han
2011-03-29  7:53   ` Michal Hocko

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).