[PATCH 0/2 v2 ] memcg: oom locking updates

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/2 v2 ] memcg: oom locking updates
@ 2011-07-15 12:26 Michal Hocko
  2011-07-13 11:05 ` [PATCH 1/2 v2] memcg: make oom_lock 0 and 1 based rather than coutner Michal Hocko
  2011-07-14 15:29 ` [PATCH 2/2] memcg: change memcg_oom_mutex to spinlock Michal Hocko
  0 siblings, 2 replies; 14+ messages in thread
From: Michal Hocko @ 2011-07-15 12:26 UTC (permalink / raw)
  To: linux-mm; +Cc: Balbir Singh, KAMEZAWA Hiroyuki, Andrew Morton, linux-kernel

Hi,
this a second version of a small patch series has two patches. While the
first one is a bug fix the other one is a cleanup which might be a bit
controversial and I have no problems to drop it if there is a resistance
against it.

I have experienced a serious starvation due the way how we handle
oom_lock counter currently and the first patch aims at fixing it.  The
issue can be reproduced quite easily on a machine with many CPUs and
many tasks fighting for a memory (e.g. 16CPU machine and 100 tasks each
allocating and touching 10MB anonymous memory in a tight loop within a
200MB group with swapoff and mem_control=0)

The other patch changes memcg_oom_mutext to spinlock. I have no hard
numbers to support why spinlock is better than mutex but it feels like
it is more suitable for the code paths we are using it at the moment. It
should also reduce context switches count for many contenders.

Changes since v1:
- reimplemented the lock in cooperation with Kamezawa to have 2 two
  entities for checking of the current oom state. oom_lock guarantees
  a single OOM in the current subtree and under_oom marks all groups
  that are under oom for oom notification
- udpated changelogs with test cases

Michal Hocko (2):
  memcg: make oom_lock 0 and 1 based rather than coutner
  memcg: change memcg_oom_mutex to spinlock

 mm/memcontrol.c |  104 +++++++++++++++++++++++++++++++++++++++++-------------
 1 files changed, 79 insertions(+), 25 deletions(-)

-- 
1.7.5.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH 1/2 v2] memcg: make oom_lock 0 and 1 based rather than coutner
  2011-07-15 12:26 [PATCH 0/2 v2 ] memcg: oom locking updates Michal Hocko
@ 2011-07-13 11:05 ` Michal Hocko
  2011-07-21 20:58   ` Andrew Morton
  2011-08-09 14:03   ` Johannes Weiner
  2011-07-14 15:29 ` [PATCH 2/2] memcg: change memcg_oom_mutex to spinlock Michal Hocko
  1 sibling, 2 replies; 14+ messages in thread
From: Michal Hocko @ 2011-07-13 11:05 UTC (permalink / raw)
  To: linux-mm; +Cc: Balbir Singh, KAMEZAWA Hiroyuki, Andrew Morton, linux-kernel

867578cb "memcg: fix oom kill behavior" introduced oom_lock counter
which is incremented by mem_cgroup_oom_lock when we are about to handle
memcg OOM situation. mem_cgroup_handle_oom falls back to a sleep if
oom_lock > 1 to prevent from multiple oom kills at the same time.  The
counter is then decremented by mem_cgroup_oom_unlock called from the
same function.

This works correctly but it can lead to serious starvations when we
have many processes triggering OOM and many CPUs available for them
(I have tested with 16 CPUs).

Consider a process (call it A) which gets the oom_lock (the first one
that got to mem_cgroup_handle_oom and grabbed memcg_oom_mutex) and
other processes that are blocked on the mutex.
While A releases the mutex and calls mem_cgroup_out_of_memory others
will wake up (one after another) and increase the counter and fall into
sleep (memcg_oom_waitq).
Once A finishes mem_cgroup_out_of_memory it takes the mutex again
and decreases oom_lock and wakes other tasks (if releasing memory by
somebody else - e.g. killed process - hasn't done it yet).

Testcase would look like:
 Assume malloc XXX is a program allocating XXX Megabytes of memory
 which touches all allocated pages in a tight loop
 # swapoff SWAP_DEVICE
 # cgcreate -g memory:A
 # cgset -r memory.oom_control=0   A
 # cgset -r memory.limit_in_bytes= 200M
 # for i in `seq 100`
 # do
 #     cgexec -g memory:A   malloc 10 &
 # done

The main problem here is that all processes still race for the mutex
and there is no guarantee that we will get counter back to 0 for those
that got back to mem_cgroup_handle_oom. In the end the whole convoy
in/decreases the counter but we do not get to 1 that would enable
killing so nothing useful can be done.
The time is basically unbounded because it highly depends on scheduling
and ordering on mutex (I have seen this taking hours...).

This patch replaces the counter by a simple {un}lock semantic.
As mem_cgroup_oom_{un}lock works on the a subtree of a hierarchy we have
to make sure that nobody else races with us which is guaranteed by the
memcg_oom_mutex.
We have to be careful while locking subtrees because we can encounter a
subtree which is already locked:
hierarchy:
          A
        /   \
       B     \
      /\      \
     C  D     E

B - C - D tree might be already locked. While we want to enable locking
E subtree because OOM situations cannot influence each other we
definitely do not want to allow locking A.
Therefore we have to refuse lock if any subtree is already locked and
clear up the lock for all nodes that have been set up to the failure
point.

On the other hand we have to make sure that the rest of the world
will recognize that a group is under OOM even though it doesn't have
a lock. Therefore we have to introduce under_oom variable which is
incremented and decremented for the whole subtree when we enter resp.
leave mem_cgroup_handle_oom.
under_oom, unlike oom_lock, doesn't need be updated under
memcg_oom_mutex because its users only check a single group and they use
atomic operations for that.

This can be checked easily by the following test case:

 # cgcreate -g memory:A
 # cgset -r memory.use_hierarchy=1 A
 # cgset -r memory.oom_control=1   A
 # cgset -r memory.limit_in_bytes= 100M
 # cgset -r memory.memsw.limit_in_bytes= 100M
 # cgcreate -g memory:A/B
 # cgset -r memory.oom_control=1 A/B
 # cgset -r memory.limit_in_bytes=20M
 # cgset -r memory.memsw.limit_in_bytes=20M
 # cgexec -g memory:A/B malloc 30  &    #->this will be blocked by OOM of group B
 # cgexec -g memory:A   malloc 80  &    #->this will be blocked by OOM of group A

While B gets oom_lock A will not get it. Both of them go into sleep and
wait for an external action. We can make the limit higher for A to
enforce waking it up

 # cgset -r memory.memsw.limit_in_bytes=300M A
 # cgset -r memory.limit_in_bytes=300M A

malloc in A has to wake up even though it doesn't have oom_lock.

Finally, the unlock path is very easy because we always unlock only the
subtree we have locked previously while we always decrement under_oom.

Signed-off-by: Michal Hocko <mhocko@suse.cz>
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

Changes since v1:
- under_oom added to mark that a subtree is under OOM condition even if
  it doesn't keep the lock. memcg_oom_recover,
  mem_cgroup_oom_register_event and mem_cgroup_oom_control_read are
  using this flag to find out whether a group is under oom.
- under_oom is handles separately from oom_lock because it doesn't need
  memcg_oom_mutex
- updated changelog with test cases

---
 mm/memcontrol.c |   86 ++++++++++++++++++++++++++++++++++++++++++++----------
 1 files changed, 70 insertions(+), 16 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index e013b8e..de1702c 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -246,7 +246,10 @@ struct mem_cgroup {
 	 * Should the accounting and control be hierarchical, per subtree?
 	 */
 	bool use_hierarchy;
-	atomic_t	oom_lock;
+
+	bool		oom_lock;
+	atomic_t	under_oom;
+
 	atomic_t	refcnt;
 
 	unsigned int	swappiness;
@@ -1803,37 +1806,83 @@ static int mem_cgroup_hierarchical_reclaim(struct mem_cgroup *root_mem,
 /*
  * Check OOM-Killer is already running under our hierarchy.
  * If someone is running, return false.
+ * Has to be called with memcg_oom_mutex
  */
 static bool mem_cgroup_oom_lock(struct mem_cgroup *mem)
 {
-	int x, lock_count = 0;
-	struct mem_cgroup *iter;
+	int lock_count = -1;
+	struct mem_cgroup *iter, *failed = NULL;
+	bool cond = true;
 
-	for_each_mem_cgroup_tree(iter, mem) {
-		x = atomic_inc_return(&iter->oom_lock);
-		lock_count = max(x, lock_count);
+	for_each_mem_cgroup_tree_cond(iter, mem, cond) {
+		bool locked = iter->oom_lock;
+
+		iter->oom_lock = true;
+		if (lock_count == -1)
+			lock_count = iter->oom_lock;
+		else if (lock_count != locked) {
+			/*
+			 * this subtree of our hierarchy is already locked
+			 * so we cannot give a lock.
+			 */
+			lock_count = 0;
+			failed = iter;
+			cond = false;
+		}
 	}
 
-	if (lock_count == 1)
-		return true;
-	return false;
+	if (!failed)
+		goto done;
+
+	/*
+	 * OK, we failed to lock the whole subtree so we have to clean up
+	 * what we set up to the failing subtree
+	 */
+	cond = true;
+	for_each_mem_cgroup_tree_cond(iter, mem, cond) {
+		if (iter == failed) {
+			cond = false;
+			continue;
+		}
+		iter->oom_lock = false;
+	}
+done:
+	return lock_count;
 }
 
+/*
+ * Has to be called with memcg_oom_mutex
+ */
 static int mem_cgroup_oom_unlock(struct mem_cgroup *mem)
 {
 	struct mem_cgroup *iter;
 
+	for_each_mem_cgroup_tree(iter, mem)
+		iter->oom_lock = false;
+	return 0;
+}
+
+static void mem_cgroup_mark_under_oom(struct mem_cgroup *mem)
+{
+	struct mem_cgroup *iter;
+
+	for_each_mem_cgroup_tree(iter, mem)
+		atomic_inc(&iter->under_oom);
+}
+
+static void mem_cgroup_unmark_under_oom(struct mem_cgroup *mem)
+{
+	struct mem_cgroup *iter;
+
 	/*
 	 * When a new child is created while the hierarchy is under oom,
 	 * mem_cgroup_oom_lock() may not be called. We have to use
 	 * atomic_add_unless() here.
 	 */
 	for_each_mem_cgroup_tree(iter, mem)
-		atomic_add_unless(&iter->oom_lock, -1, 0);
-	return 0;
+		atomic_add_unless(&iter->under_oom, -1, 0);
 }
 
-
 static DEFINE_MUTEX(memcg_oom_mutex);
 static DECLARE_WAIT_QUEUE_HEAD(memcg_oom_waitq);
 
@@ -1875,7 +1924,7 @@ static void memcg_wakeup_oom(struct mem_cgroup *mem)
 
 static void memcg_oom_recover(struct mem_cgroup *mem)
 {
-	if (mem && atomic_read(&mem->oom_lock))
+	if (mem && atomic_read(&mem->under_oom))
 		memcg_wakeup_oom(mem);
 }
 
@@ -1893,6 +1942,8 @@ bool mem_cgroup_handle_oom(struct mem_cgroup *mem, gfp_t mask)
 	owait.wait.private = current;
 	INIT_LIST_HEAD(&owait.wait.task_list);
 	need_to_kill = true;
+	mem_cgroup_mark_under_oom(mem);
+
 	/* At first, try to OOM lock hierarchy under mem.*/
 	mutex_lock(&memcg_oom_mutex);
 	locked = mem_cgroup_oom_lock(mem);
@@ -1916,10 +1967,13 @@ bool mem_cgroup_handle_oom(struct mem_cgroup *mem, gfp_t mask)
 		finish_wait(&memcg_oom_waitq, &owait.wait);
 	}
 	mutex_lock(&memcg_oom_mutex);
-	mem_cgroup_oom_unlock(mem);
+	if (locked)
+		mem_cgroup_oom_unlock(mem);
 	memcg_wakeup_oom(mem);
 	mutex_unlock(&memcg_oom_mutex);
 
+	mem_cgroup_unmark_under_oom(mem);
+
 	if (test_thread_flag(TIF_MEMDIE) || fatal_signal_pending(current))
 		return false;
 	/* Give chance to dying process */
@@ -4584,7 +4638,7 @@ static int mem_cgroup_oom_register_event(struct cgroup *cgrp,
 	list_add(&event->list, &memcg->oom_notify);
 
 	/* already in OOM ? */
-	if (atomic_read(&memcg->oom_lock))
+	if (atomic_read(&memcg->under_oom))
 		eventfd_signal(eventfd, 1);
 	mutex_unlock(&memcg_oom_mutex);
 
@@ -4619,7 +4673,7 @@ static int mem_cgroup_oom_control_read(struct cgroup *cgrp,
 
 	cb->fill(cb, "oom_kill_disable", mem->oom_kill_disable);
 
-	if (atomic_read(&mem->oom_lock))
+	if (atomic_read(&mem->under_oom))
 		cb->fill(cb, "under_oom", 1);
 	else
 		cb->fill(cb, "under_oom", 0);
-- 
1.7.5.4


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/2 v2] memcg: make oom_lock 0 and 1 based rather than coutner
  2011-07-13 11:05 ` [PATCH 1/2 v2] memcg: make oom_lock 0 and 1 based rather than coutner Michal Hocko
@ 2011-07-21 20:58   ` Andrew Morton
  2011-07-22  0:15     ` KAMEZAWA Hiroyuki
  2011-08-09 14:03   ` Johannes Weiner
  1 sibling, 1 reply; 14+ messages in thread
From: Andrew Morton @ 2011-07-21 20:58 UTC (permalink / raw)
  To: Michal Hocko; +Cc: linux-mm, Balbir Singh, KAMEZAWA Hiroyuki, linux-kernel

On Wed, 13 Jul 2011 13:05:49 +0200
Michal Hocko <mhocko@suse.cz> wrote:

> @@ -1893,6 +1942,8 @@ bool mem_cgroup_handle_oom(struct mem_cgroup *mem, gfp_t mask)

does:

: 	memcg_wakeup_oom(mem);
: 	mutex_unlock(&memcg_oom_mutex);
: 
: 	mem_cgroup_unmark_under_oom(mem);
: 
: 	if (test_thread_flag(TIF_MEMDIE) || fatal_signal_pending(current))
: 		return false;
: 	/* Give chance to dying process */
: 	schedule_timeout(1);
: 	return true;
: }

Calling schedule_timeout() in state TASK_RUNNING is equivalent to
calling schedule() and then pointlessly wasting some CPU cycles.

Someone might want to take a look at that, and wonder why this bug
wasn't detected in testing ;)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/2 v2] memcg: make oom_lock 0 and 1 based rather than coutner
  2011-07-21 20:58   ` Andrew Morton
@ 2011-07-22  0:15     ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 14+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-07-22  0:15 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Michal Hocko, linux-mm, Balbir Singh, linux-kernel

On Thu, 21 Jul 2011 13:58:17 -0700
Andrew Morton <akpm@linux-foundation.org> wrote:

> On Wed, 13 Jul 2011 13:05:49 +0200
> Michal Hocko <mhocko@suse.cz> wrote:
> 
> > @@ -1893,6 +1942,8 @@ bool mem_cgroup_handle_oom(struct mem_cgroup *mem, gfp_t mask)
> 
> does:
> 
> : 	memcg_wakeup_oom(mem);
> : 	mutex_unlock(&memcg_oom_mutex);
> : 
> : 	mem_cgroup_unmark_under_oom(mem);
> : 
> : 	if (test_thread_flag(TIF_MEMDIE) || fatal_signal_pending(current))
> : 		return false;
> : 	/* Give chance to dying process */
> : 	schedule_timeout(1);
> : 	return true;
> : }
> 
> Calling schedule_timeout() in state TASK_RUNNING is equivalent to
> calling schedule() and then pointlessly wasting some CPU cycles.
> 
Ouch (--;

> Someone might want to take a look at that, and wonder why this bug
> wasn't detected in testing ;)
> 
I wonder just removing this is okay....because we didn't noticed this
in our recent oom tests. 

I'll do some.

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/2 v2] memcg: make oom_lock 0 and 1 based rather than coutner
  2011-07-13 11:05 ` [PATCH 1/2 v2] memcg: make oom_lock 0 and 1 based rather than coutner Michal Hocko
  2011-07-21 20:58   ` Andrew Morton
@ 2011-08-09 14:03   ` Johannes Weiner
  2011-08-09 15:22     ` Michal Hocko
  1 sibling, 1 reply; 14+ messages in thread
From: Johannes Weiner @ 2011-08-09 14:03 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, Balbir Singh, KAMEZAWA Hiroyuki, Andrew Morton,
	linux-kernel

On Wed, Jul 13, 2011 at 01:05:49PM +0200, Michal Hocko wrote:
> @@ -1803,37 +1806,83 @@ static int mem_cgroup_hierarchical_reclaim(struct mem_cgroup *root_mem,
>  /*
>   * Check OOM-Killer is already running under our hierarchy.
>   * If someone is running, return false.
> + * Has to be called with memcg_oom_mutex
>   */
>  static bool mem_cgroup_oom_lock(struct mem_cgroup *mem)
>  {
> -	int x, lock_count = 0;
> -	struct mem_cgroup *iter;
> +	int lock_count = -1;
> +	struct mem_cgroup *iter, *failed = NULL;
> +	bool cond = true;
>  
> -	for_each_mem_cgroup_tree(iter, mem) {
> -		x = atomic_inc_return(&iter->oom_lock);
> -		lock_count = max(x, lock_count);
> +	for_each_mem_cgroup_tree_cond(iter, mem, cond) {
> +		bool locked = iter->oom_lock;
> +
> +		iter->oom_lock = true;
> +		if (lock_count == -1)
> +			lock_count = iter->oom_lock;
> +		else if (lock_count != locked) {
> +			/*
> +			 * this subtree of our hierarchy is already locked
> +			 * so we cannot give a lock.
> +			 */
> +			lock_count = 0;
> +			failed = iter;
> +			cond = false;
> +		}

I noticed system-wide hangs during a parallel/hierarchical memcg test
and found that a single task with a central i_mutex held was sleeping
on the memcg oom waitqueue, stalling everyone else contending for that
same inode.

The problem is the above code, which never succeeds in hierarchies
with more than one member.  The first task going OOM tries to oom lock
the hierarchy, fails, goes to sleep on the OOM waitqueue with the
mutex held, without anybody actually OOM killing anything to make
progress.

Here is a patch that rectified things for me.

---

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/2 v2] memcg: make oom_lock 0 and 1 based rather than coutner
  2011-08-09 14:03   ` Johannes Weiner
@ 2011-08-09 15:22     ` Michal Hocko
  2011-08-09 15:37       ` Johannes Weiner
  0 siblings, 1 reply; 14+ messages in thread
From: Michal Hocko @ 2011-08-09 15:22 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: linux-mm, Balbir Singh, KAMEZAWA Hiroyuki, Andrew Morton,
	linux-kernel

On Tue 09-08-11 16:03:12, Johannes Weiner wrote:
> On Wed, Jul 13, 2011 at 01:05:49PM +0200, Michal Hocko wrote:
> > @@ -1803,37 +1806,83 @@ static int mem_cgroup_hierarchical_reclaim(struct mem_cgroup *root_mem,
> >  /*
> >   * Check OOM-Killer is already running under our hierarchy.
> >   * If someone is running, return false.
> > + * Has to be called with memcg_oom_mutex
> >   */
> >  static bool mem_cgroup_oom_lock(struct mem_cgroup *mem)
> >  {
> > -	int x, lock_count = 0;
> > -	struct mem_cgroup *iter;
> > +	int lock_count = -1;
> > +	struct mem_cgroup *iter, *failed = NULL;
> > +	bool cond = true;
> >  
> > -	for_each_mem_cgroup_tree(iter, mem) {
> > -		x = atomic_inc_return(&iter->oom_lock);
> > -		lock_count = max(x, lock_count);
> > +	for_each_mem_cgroup_tree_cond(iter, mem, cond) {
> > +		bool locked = iter->oom_lock;
> > +
> > +		iter->oom_lock = true;
> > +		if (lock_count == -1)
> > +			lock_count = iter->oom_lock;
> > +		else if (lock_count != locked) {
> > +			/*
> > +			 * this subtree of our hierarchy is already locked
> > +			 * so we cannot give a lock.
> > +			 */
> > +			lock_count = 0;
> > +			failed = iter;
> > +			cond = false;
> > +		}
> 
> I noticed system-wide hangs during a parallel/hierarchical memcg test
> and found that a single task with a central i_mutex held was sleeping
> on the memcg oom waitqueue, stalling everyone else contending for that
> same inode.

Nasty. Thanks for reporting and fixing this. The condition is screwed
totally :/

> 
> The problem is the above code, which never succeeds in hierarchies
> with more than one member.  The first task going OOM tries to oom lock
> the hierarchy, fails, goes to sleep on the OOM waitqueue with the
> mutex held, without anybody actually OOM killing anything to make
> progress.
> 
> Here is a patch that rectified things for me.
> 
> ---
> From c4b52cbe01ed67d6487a96850400cdf5a9de91aa Mon Sep 17 00:00:00 2001
> From: Johannes Weiner <jweiner@redhat.com>
> Date: Tue, 9 Aug 2011 15:31:30 +0200
> Subject: [patch] memcg: fix hierarchical oom locking
> 
> Commit "79dfdac memcg: make oom_lock 0 and 1 based rather than
> counter" tried to oom lock the hierarchy and roll back upon
> encountering an already locked memcg.
> 
> The code is pretty confused when it comes to detecting a locked memcg,
> though, so it would fail and rollback after locking one memcg and
> encountering an unlocked second one.
> 
> The result is that oom-locking hierarchies fails unconditionally and
> that every oom killer invocation simply goes to sleep on the oom
> waitqueue forever.  The tasks practically hang forever without anyone
> intervening, possibly holding locks that trip up unrelated tasks, too.
> 
> Signed-off-by: Johannes Weiner <jweiner@redhat.com>

Looks good. Thanks!
Just a minor nit about done label bellow.

Acked-by: Michal Hocko <mhocko@suse.cz>
> ---
>  mm/memcontrol.c |   14 ++++----------
>  1 files changed, 4 insertions(+), 10 deletions(-)
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 930de94..649c568 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -1841,25 +1841,19 @@ static int mem_cgroup_hierarchical_reclaim(struct mem_cgroup *root_mem,
>   */
>  static bool mem_cgroup_oom_lock(struct mem_cgroup *mem)
>  {
> -	int lock_count = -1;

Yes, the whole lock_count thingy is just stupid. We care just about all
or nothing and state of the first is really not important.

>  	struct mem_cgroup *iter, *failed = NULL;
>  	bool cond = true;
>  
>  	for_each_mem_cgroup_tree_cond(iter, mem, cond) {
> -		bool locked = iter->oom_lock;
> -
> -		iter->oom_lock = true;
> -		if (lock_count == -1)
> -			lock_count = iter->oom_lock;
> -		else if (lock_count != locked) {
> +		if (iter->oom_lock) {
>  			/*
>  			 * this subtree of our hierarchy is already locked
>  			 * so we cannot give a lock.
>  			 */
> -			lock_count = 0;
>  			failed = iter;
>  			cond = false;
> -		}
> +		} else
> +			iter->oom_lock = true;
>  	}
>  
>  	if (!failed)

We can return here and get rid of done label.

> @@ -1878,7 +1872,7 @@ static bool mem_cgroup_oom_lock(struct mem_cgroup *mem)
>  		iter->oom_lock = false;
>  	}
>  done:
> -	return lock_count;
> +	return failed == NULL;
>  }
>  
>  /*
> -- 
> 1.7.6

-- 
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9    
Czech Republic

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/2 v2] memcg: make oom_lock 0 and 1 based rather than coutner
  2011-08-09 15:22     ` Michal Hocko
@ 2011-08-09 15:37       ` Johannes Weiner
  2011-08-09 15:43         ` Michal Hocko
  2011-08-10  0:22         ` KAMEZAWA Hiroyuki
  0 siblings, 2 replies; 14+ messages in thread
From: Johannes Weiner @ 2011-08-09 15:37 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, Balbir Singh, KAMEZAWA Hiroyuki, Andrew Morton,
	linux-kernel

On Tue, Aug 09, 2011 at 05:22:18PM +0200, Michal Hocko wrote:
> On Tue 09-08-11 16:03:12, Johannes Weiner wrote:
> >  	struct mem_cgroup *iter, *failed = NULL;
> >  	bool cond = true;
> >  
> >  	for_each_mem_cgroup_tree_cond(iter, mem, cond) {
> > -		bool locked = iter->oom_lock;
> > -
> > -		iter->oom_lock = true;
> > -		if (lock_count == -1)
> > -			lock_count = iter->oom_lock;
> > -		else if (lock_count != locked) {
> > +		if (iter->oom_lock) {
> >  			/*
> >  			 * this subtree of our hierarchy is already locked
> >  			 * so we cannot give a lock.
> >  			 */
> > -			lock_count = 0;
> >  			failed = iter;
> >  			cond = false;
> > -		}
> > +		} else
> > +			iter->oom_lock = true;
> >  	}
> >  
> >  	if (!failed)
> 
> We can return here and get rid of done label.

Ah, right you are.  Here is an update.

---

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/2 v2] memcg: make oom_lock 0 and 1 based rather than coutner
  2011-08-09 15:37       ` Johannes Weiner
@ 2011-08-09 15:43         ` Michal Hocko
  2011-08-10  0:22         ` KAMEZAWA Hiroyuki
  1 sibling, 0 replies; 14+ messages in thread
From: Michal Hocko @ 2011-08-09 15:43 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: linux-mm, Balbir Singh, KAMEZAWA Hiroyuki, Andrew Morton,
	linux-kernel

On Tue 09-08-11 17:37:32, Johannes Weiner wrote:
> On Tue, Aug 09, 2011 at 05:22:18PM +0200, Michal Hocko wrote:
> > On Tue 09-08-11 16:03:12, Johannes Weiner wrote:
> > >  	struct mem_cgroup *iter, *failed = NULL;
> > >  	bool cond = true;
> > >  
> > >  	for_each_mem_cgroup_tree_cond(iter, mem, cond) {
> > > -		bool locked = iter->oom_lock;
> > > -
> > > -		iter->oom_lock = true;
> > > -		if (lock_count == -1)
> > > -			lock_count = iter->oom_lock;
> > > -		else if (lock_count != locked) {
> > > +		if (iter->oom_lock) {
> > >  			/*
> > >  			 * this subtree of our hierarchy is already locked
> > >  			 * so we cannot give a lock.
> > >  			 */
> > > -			lock_count = 0;
> > >  			failed = iter;
> > >  			cond = false;
> > > -		}
> > > +		} else
> > > +			iter->oom_lock = true;
> > >  	}
> > >  
> > >  	if (!failed)
> > 
> > We can return here and get rid of done label.
> 
> Ah, right you are.  Here is an update.

Thanks!

> 
> ---
> From 86b36904033e6c6a1af4716e9deef13ebd31e64c Mon Sep 17 00:00:00 2001
> From: Johannes Weiner <jweiner@redhat.com>
> Date: Tue, 9 Aug 2011 15:31:30 +0200
> Subject: [patch] memcg: fix hierarchical oom locking
> 
> Commit "79dfdac memcg: make oom_lock 0 and 1 based rather than
> counter" tried to oom lock the hierarchy and roll back upon
> encountering an already locked memcg.
> 
> The code is confused when it comes to detecting a locked memcg,
> though, so it would fail and rollback after locking one memcg and
> encountering an unlocked second one.

It is actually worse than that. The way how it is broken also allows to
lock a hierarchy which already contains locked subtree...

> 
> The result is that oom-locking hierarchies fails unconditionally and
> that every oom killer invocation simply goes to sleep on the oom
> waitqueue forever.  The tasks practically hang forever without anyone
> intervening, possibly holding locks that trip up unrelated tasks, too.
> 
> Signed-off-by: Johannes Weiner <jweiner@redhat.com>
> Acked-by: Michal Hocko <mhocko@suse.cz>
> ---
>  mm/memcontrol.c |   17 +++++------------
>  1 files changed, 5 insertions(+), 12 deletions(-)
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index c6faa32..f39c8fb 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -1843,29 +1843,23 @@ static int mem_cgroup_hierarchical_reclaim(struct mem_cgroup *root_mem,
>   */
>  static bool mem_cgroup_oom_lock(struct mem_cgroup *mem)
>  {
> -	int lock_count = -1;
>  	struct mem_cgroup *iter, *failed = NULL;
>  	bool cond = true;
>  
>  	for_each_mem_cgroup_tree_cond(iter, mem, cond) {
> -		bool locked = iter->oom_lock;
> -
> -		iter->oom_lock = true;
> -		if (lock_count == -1)
> -			lock_count = iter->oom_lock;
> -		else if (lock_count != locked) {
> +		if (iter->oom_lock) {
>  			/*
>  			 * this subtree of our hierarchy is already locked
>  			 * so we cannot give a lock.
>  			 */
> -			lock_count = 0;
>  			failed = iter;
>  			cond = false;
> -		}
> +		} else
> +			iter->oom_lock = true;
>  	}
>  
>  	if (!failed)
> -		goto done;
> +		return true;
>  
>  	/*
>  	 * OK, we failed to lock the whole subtree so we have to clean up
> @@ -1879,8 +1873,7 @@ static bool mem_cgroup_oom_lock(struct mem_cgroup *mem)
>  		}
>  		iter->oom_lock = false;
>  	}
> -done:
> -	return lock_count;
> +	return false;
>  }
>  
>  /*
> -- 
> 1.7.6
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9    
Czech Republic

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/2 v2] memcg: make oom_lock 0 and 1 based rather than coutner
  2011-08-09 15:37       ` Johannes Weiner
  2011-08-09 15:43         ` Michal Hocko
@ 2011-08-10  0:22         ` KAMEZAWA Hiroyuki
  1 sibling, 0 replies; 14+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-08-10  0:22 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Michal Hocko, linux-mm, Balbir Singh, Andrew Morton, linux-kernel

On Tue, 9 Aug 2011 17:37:32 +0200
Johannes Weiner <jweiner@redhat.com> wrote:

> On Tue, Aug 09, 2011 at 05:22:18PM +0200, Michal Hocko wrote:
> > On Tue 09-08-11 16:03:12, Johannes Weiner wrote:
> > >  	struct mem_cgroup *iter, *failed = NULL;
> > >  	bool cond = true;
> > >  
> > >  	for_each_mem_cgroup_tree_cond(iter, mem, cond) {
> > > -		bool locked = iter->oom_lock;
> > > -
> > > -		iter->oom_lock = true;
> > > -		if (lock_count == -1)
> > > -			lock_count = iter->oom_lock;
> > > -		else if (lock_count != locked) {
> > > +		if (iter->oom_lock) {
> > >  			/*
> > >  			 * this subtree of our hierarchy is already locked
> > >  			 * so we cannot give a lock.
> > >  			 */
> > > -			lock_count = 0;
> > >  			failed = iter;
> > >  			cond = false;
> > > -		}
> > > +		} else
> > > +			iter->oom_lock = true;
> > >  	}
> > >  
> > >  	if (!failed)
> > 
> > We can return here and get rid of done label.
> 
> Ah, right you are.  Here is an update.
> 
> ---
> From 86b36904033e6c6a1af4716e9deef13ebd31e64c Mon Sep 17 00:00:00 2001
> From: Johannes Weiner <jweiner@redhat.com>
> Date: Tue, 9 Aug 2011 15:31:30 +0200
> Subject: [patch] memcg: fix hierarchical oom locking
> 
> Commit "79dfdac memcg: make oom_lock 0 and 1 based rather than
> counter" tried to oom lock the hierarchy and roll back upon
> encountering an already locked memcg.
> 
> The code is confused when it comes to detecting a locked memcg,
> though, so it would fail and rollback after locking one memcg and
> encountering an unlocked second one.
> 
> The result is that oom-locking hierarchies fails unconditionally and
> that every oom killer invocation simply goes to sleep on the oom
> waitqueue forever.  The tasks practically hang forever without anyone
> intervening, possibly holding locks that trip up unrelated tasks, too.
> 
> Signed-off-by: Johannes Weiner <jweiner@redhat.com>
> Acked-by: Michal Hocko <mhocko@suse.cz>

Thanks,
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH 2/2] memcg: change memcg_oom_mutex to spinlock
  2011-07-15 12:26 [PATCH 0/2 v2 ] memcg: oom locking updates Michal Hocko
  2011-07-13 11:05 ` [PATCH 1/2 v2] memcg: make oom_lock 0 and 1 based rather than coutner Michal Hocko
@ 2011-07-14 15:29 ` Michal Hocko
  2011-07-20  5:55   ` KAMEZAWA Hiroyuki
  2011-07-20  6:34   ` Balbir Singh
  1 sibling, 2 replies; 14+ messages in thread
From: Michal Hocko @ 2011-07-14 15:29 UTC (permalink / raw)
  To: linux-mm; +Cc: Balbir Singh, KAMEZAWA Hiroyuki, Andrew Morton, linux-kernel

memcg_oom_mutex is used to protect memcg OOM path and eventfd interface
for oom_control. None of the critical sections which it protects sleep
(eventfd_signal works from atomic context and the rest are simple linked
list resp. oom_lock atomic operations).
Mutex is also too heavy weight for those code paths because it triggers
a lot of scheduling. It also makes makes convoying effects more visible
when we have a big number of oom killing because we take the lock
mutliple times during mem_cgroup_handle_oom so we have multiple places
where many processes can sleep.

Signed-off-by: Michal Hocko <mhocko@suse.cz>
---
 mm/memcontrol.c |   22 +++++++++++-----------
 1 files changed, 11 insertions(+), 11 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index de1702c..f11f198 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1806,7 +1806,7 @@ static int mem_cgroup_hierarchical_reclaim(struct mem_cgroup *root_mem,
 /*
  * Check OOM-Killer is already running under our hierarchy.
  * If someone is running, return false.
- * Has to be called with memcg_oom_mutex
+ * Has to be called with memcg_oom_lock
  */
 static bool mem_cgroup_oom_lock(struct mem_cgroup *mem)
 {
@@ -1851,7 +1851,7 @@ done:
 }
 
 /*
- * Has to be called with memcg_oom_mutex
+ * Has to be called with memcg_oom_lock
  */
 static int mem_cgroup_oom_unlock(struct mem_cgroup *mem)
 {
@@ -1883,7 +1883,7 @@ static void mem_cgroup_unmark_under_oom(struct mem_cgroup *mem)
 		atomic_add_unless(&iter->under_oom, -1, 0);
 }
 
-static DEFINE_MUTEX(memcg_oom_mutex);
+static DEFINE_SPINLOCK(memcg_oom_lock);
 static DECLARE_WAIT_QUEUE_HEAD(memcg_oom_waitq);
 
 struct oom_wait_info {
@@ -1945,7 +1945,7 @@ bool mem_cgroup_handle_oom(struct mem_cgroup *mem, gfp_t mask)
 	mem_cgroup_mark_under_oom(mem);
 
 	/* At first, try to OOM lock hierarchy under mem.*/
-	mutex_lock(&memcg_oom_mutex);
+	spin_lock(&memcg_oom_lock);
 	locked = mem_cgroup_oom_lock(mem);
 	/*
 	 * Even if signal_pending(), we can't quit charge() loop without
@@ -1957,7 +1957,7 @@ bool mem_cgroup_handle_oom(struct mem_cgroup *mem, gfp_t mask)
 		need_to_kill = false;
 	if (locked)
 		mem_cgroup_oom_notify(mem);
-	mutex_unlock(&memcg_oom_mutex);
+	spin_unlock(&memcg_oom_lock);
 
 	if (need_to_kill) {
 		finish_wait(&memcg_oom_waitq, &owait.wait);
@@ -1966,11 +1966,11 @@ bool mem_cgroup_handle_oom(struct mem_cgroup *mem, gfp_t mask)
 		schedule();
 		finish_wait(&memcg_oom_waitq, &owait.wait);
 	}
-	mutex_lock(&memcg_oom_mutex);
+	spin_lock(&memcg_oom_lock);
 	if (locked)
 		mem_cgroup_oom_unlock(mem);
 	memcg_wakeup_oom(mem);
-	mutex_unlock(&memcg_oom_mutex);
+	spin_unlock(&memcg_oom_lock);
 
 	mem_cgroup_unmark_under_oom(mem);
 
@@ -4632,7 +4632,7 @@ static int mem_cgroup_oom_register_event(struct cgroup *cgrp,
 	if (!event)
 		return -ENOMEM;
 
-	mutex_lock(&memcg_oom_mutex);
+	spin_lock(&memcg_oom_lock);
 
 	event->eventfd = eventfd;
 	list_add(&event->list, &memcg->oom_notify);
@@ -4640,7 +4640,7 @@ static int mem_cgroup_oom_register_event(struct cgroup *cgrp,
 	/* already in OOM ? */
 	if (atomic_read(&memcg->under_oom))
 		eventfd_signal(eventfd, 1);
-	mutex_unlock(&memcg_oom_mutex);
+	spin_unlock(&memcg_oom_lock);
 
 	return 0;
 }
@@ -4654,7 +4654,7 @@ static void mem_cgroup_oom_unregister_event(struct cgroup *cgrp,
 
 	BUG_ON(type != _OOM_TYPE);
 
-	mutex_lock(&memcg_oom_mutex);
+	spin_lock(&memcg_oom_lock);
 
 	list_for_each_entry_safe(ev, tmp, &mem->oom_notify, list) {
 		if (ev->eventfd == eventfd) {
@@ -4663,7 +4663,7 @@ static void mem_cgroup_oom_unregister_event(struct cgroup *cgrp,
 		}
 	}
 
-	mutex_unlock(&memcg_oom_mutex);
+	spin_unlock(&memcg_oom_lock);
 }
 
 static int mem_cgroup_oom_control_read(struct cgroup *cgrp,
-- 
1.7.5.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH 2/2] memcg: change memcg_oom_mutex to spinlock
  2011-07-14 15:29 ` [PATCH 2/2] memcg: change memcg_oom_mutex to spinlock Michal Hocko
@ 2011-07-20  5:55   ` KAMEZAWA Hiroyuki
  2011-07-20  7:01     ` Michal Hocko
  2011-07-20  6:34   ` Balbir Singh
  1 sibling, 1 reply; 14+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-07-20  5:55 UTC (permalink / raw)
  To: Michal Hocko; +Cc: linux-mm, Balbir Singh, Andrew Morton, linux-kernel

On Thu, 14 Jul 2011 17:29:51 +0200
Michal Hocko <mhocko@suse.cz> wrote:

> memcg_oom_mutex is used to protect memcg OOM path and eventfd interface
> for oom_control. None of the critical sections which it protects sleep
> (eventfd_signal works from atomic context and the rest are simple linked
> list resp. oom_lock atomic operations).
> Mutex is also too heavy weight for those code paths because it triggers
> a lot of scheduling. It also makes makes convoying effects more visible
> when we have a big number of oom killing because we take the lock
> mutliple times during mem_cgroup_handle_oom so we have multiple places
> where many processes can sleep.
> 
> Signed-off-by: Michal Hocko <mhocko@suse.cz>

Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 2/2] memcg: change memcg_oom_mutex to spinlock
  2011-07-20  5:55   ` KAMEZAWA Hiroyuki
@ 2011-07-20  7:01     ` Michal Hocko
  0 siblings, 0 replies; 14+ messages in thread
From: Michal Hocko @ 2011-07-20  7:01 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-mm, Balbir Singh, Andrew Morton, linux-kernel

On Wed 20-07-11 14:55:53, KAMEZAWA Hiroyuki wrote:
> On Thu, 14 Jul 2011 17:29:51 +0200
> Michal Hocko <mhocko@suse.cz> wrote:
> 
> > memcg_oom_mutex is used to protect memcg OOM path and eventfd interface
> > for oom_control. None of the critical sections which it protects sleep
> > (eventfd_signal works from atomic context and the rest are simple linked
> > list resp. oom_lock atomic operations).
> > Mutex is also too heavy weight for those code paths because it triggers
> > a lot of scheduling. It also makes makes convoying effects more visible
> > when we have a big number of oom killing because we take the lock
> > mutliple times during mem_cgroup_handle_oom so we have multiple places
> > where many processes can sleep.
> > 
> > Signed-off-by: Michal Hocko <mhocko@suse.cz>
> 
> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

Thanks!

-- 
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9    
Czech Republic

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 2/2] memcg: change memcg_oom_mutex to spinlock
  2011-07-14 15:29 ` [PATCH 2/2] memcg: change memcg_oom_mutex to spinlock Michal Hocko
  2011-07-20  5:55   ` KAMEZAWA Hiroyuki
@ 2011-07-20  6:34   ` Balbir Singh
  2011-07-20  7:00     ` Michal Hocko
  1 sibling, 1 reply; 14+ messages in thread
From: Balbir Singh @ 2011-07-20  6:34 UTC (permalink / raw)
  To: Michal Hocko; +Cc: linux-mm, KAMEZAWA Hiroyuki, Andrew Morton, linux-kernel

On Thu, Jul 14, 2011 at 8:59 PM, Michal Hocko <mhocko@suse.cz> wrote:
> memcg_oom_mutex is used to protect memcg OOM path and eventfd interface
> for oom_control. None of the critical sections which it protects sleep
> (eventfd_signal works from atomic context and the rest are simple linked
> list resp. oom_lock atomic operations).
> Mutex is also too heavy weight for those code paths because it triggers
> a lot of scheduling. It also makes makes convoying effects more visible
> when we have a big number of oom killing because we take the lock
> mutliple times during mem_cgroup_handle_oom so we have multiple places
> where many processes can sleep.
>
> Signed-off-by: Michal Hocko <mhocko@suse.cz>

Quick question: How long do we expect this lock to be taken? What
happens under oom? Any tests? Numbers?

Balbir Singh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 2/2] memcg: change memcg_oom_mutex to spinlock
  2011-07-20  6:34   ` Balbir Singh
@ 2011-07-20  7:00     ` Michal Hocko
  0 siblings, 0 replies; 14+ messages in thread
From: Michal Hocko @ 2011-07-20  7:00 UTC (permalink / raw)
  To: Balbir Singh; +Cc: linux-mm, KAMEZAWA Hiroyuki, Andrew Morton, linux-kernel

On Wed 20-07-11 12:04:17, Balbir Singh wrote:
> On Thu, Jul 14, 2011 at 8:59 PM, Michal Hocko <mhocko@suse.cz> wrote:
> > memcg_oom_mutex is used to protect memcg OOM path and eventfd interface
> > for oom_control. None of the critical sections which it protects sleep
> > (eventfd_signal works from atomic context and the rest are simple linked
> > list resp. oom_lock atomic operations).
> > Mutex is also too heavy weight for those code paths because it triggers
> > a lot of scheduling. It also makes makes convoying effects more visible
> > when we have a big number of oom killing because we take the lock
> > mutliple times during mem_cgroup_handle_oom so we have multiple places
> > where many processes can sleep.
> >
> > Signed-off-by: Michal Hocko <mhocko@suse.cz>
> 
> Quick question: How long do we expect this lock to be taken? 

The lock is taken in 
* mem_cgroup_handle_oom at 2 places
	- to protect mem_cgroup_oom_lock and mem_cgroup_oom_notify
	- to protect mem_cgroup_oom_unlock and memcg_wakeup_oom

mem_cgroup_oom_{un}lock as well as mem_cgroup_oom_notify scale with the
number of groups in the hierarchy.
mem_cgroup_oom_notify scales with the number of all blocked tasks on the
memcg_oom_waitq (which is not mem_cgroup specific) and
memcg_oom_wake_function can go up the hierarchy for all of them in the
worst case.

* mem_cgroup_oom_register_event uses it to protect notifier registration
  (one list_add operation) + notification in case the group is already
  under oom - we can consider both operations to be constant time
* mem_cgroup_oom_unregister_event protects unregistration so it scales
  with the number of notifiers. I guess this is potentially unlimitted
  but I wouldn't be afraid of that as we call just list_del to every
  one.

> What happens under oom?

Could you be more specific? Does the above exaplains?

> Any tests? Numbers?

I was testing with the test mentioned in the other patch and I couldn't
measure any significant difference. That is why I noted that I do not
have any hard numbers to base my argumentation on. It is just that the
mutex doesn't _feel_ right in the code paths we are using it now.

> Balbir Singh

Thanks
-- 
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9    
Czech Republic

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2011-08-10  0:29 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-07-15 12:26 [PATCH 0/2 v2 ] memcg: oom locking updates Michal Hocko
2011-07-13 11:05 ` [PATCH 1/2 v2] memcg: make oom_lock 0 and 1 based rather than coutner Michal Hocko
2011-07-21 20:58   ` Andrew Morton
2011-07-22  0:15     ` KAMEZAWA Hiroyuki
2011-08-09 14:03   ` Johannes Weiner
2011-08-09 15:22     ` Michal Hocko
2011-08-09 15:37       ` Johannes Weiner
2011-08-09 15:43         ` Michal Hocko
2011-08-10  0:22         ` KAMEZAWA Hiroyuki
2011-07-14 15:29 ` [PATCH 2/2] memcg: change memcg_oom_mutex to spinlock Michal Hocko
2011-07-20  5:55   ` KAMEZAWA Hiroyuki
2011-07-20  7:01     ` Michal Hocko
2011-07-20  6:34   ` Balbir Singh
2011-07-20  7:00     ` Michal Hocko

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).