From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail137.messagelabs.com (mail137.messagelabs.com [216.82.249.19]) by kanga.kvack.org (Postfix) with SMTP id CEE016B0078 for ; Mon, 8 Feb 2010 20:24:47 -0500 (EST) Received: by pzk8 with SMTP id 8so5171981pzk.22 for ; Mon, 08 Feb 2010 17:24:46 -0800 (PST) MIME-Version: 1.0 In-Reply-To: <20100209093246.36c50bae.kamezawa.hiroyu@jp.fujitsu.com> References: <20100205093932.1dcdeb5f.kamezawa.hiroyu@jp.fujitsu.com> <28c262361002050830m7519f1c3y8860540708527fc0@mail.gmail.com> <20100209093246.36c50bae.kamezawa.hiroyu@jp.fujitsu.com> Date: Tue, 9 Feb 2010 10:24:45 +0900 Message-ID: <28c262361002081724l1b64e316v3141fb4567dbf905@mail.gmail.com> Subject: Re: [BUGFIX][PATCH] memcg: fix oom killer kills a task in other cgroup From: Minchan Kim Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Sender: owner-linux-mm@kvack.org To: KAMEZAWA Hiroyuki Cc: "linux-mm@kvack.org" , "balbir@linux.vnet.ibm.com" , "nishimura@mxp.nes.nec.co.jp" , "akpm@linux-foundation.org" , rientjes@google.com List-ID: On Tue, Feb 9, 2010 at 9:32 AM, KAMEZAWA Hiroyuki wrote: > On Sat, 6 Feb 2010 01:30:49 +0900 > Minchan Kim wrote: > >> Hi, Kame. >> >> On Fri, Feb 5, 2010 at 9:39 AM, KAMEZAWA Hiroyuki >> wrote: >> > Please take this patch in different context with recent discussion. >> > This is a quick-fix for a terrible bug. >> > >> > This patch itself is against mmotm but can be easily applied to mainli= ne or >> > stable tree, I think. (But I don't CC stable tree until I get ack.) >> > >> > =3D=3D >> > Now, oom-killer kills process's chidlren at first. But this means >> > a child in other cgroup can be killed. But it's not checked now. >> > >> > This patch fixes that. >> > >> > CC: Balbir Singh >> > CC: Daisuke Nishimura >> > Signed-off-by: KAMEZAWA Hiroyuki >> > --- >> > =C2=A0mm/oom_kill.c | =C2=A0 =C2=A03 +++ >> > =C2=A01 file changed, 3 insertions(+) >> > >> > Index: mmotm-2.6.33-Feb03/mm/oom_kill.c >> > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D >> > --- mmotm-2.6.33-Feb03.orig/mm/oom_kill.c >> > +++ mmotm-2.6.33-Feb03/mm/oom_kill.c >> > @@ -459,6 +459,9 @@ static int oom_kill_process(struct task_ >> > =C2=A0 =C2=A0 =C2=A0 =C2=A0list_for_each_entry(c, &p->children, siblin= g) { >> > =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0if (c->mm =3D= =3D p->mm) >> > =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A0continue; >> > + =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 /* Children may be = in other cgroup */ >> > + =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 if (mem && !task_in= _mem_cgroup(c, mem)) >> > + =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 continue; >> > =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0if (!oom_kill_t= ask(c)) >> > =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A0return 0; >> > =C2=A0 =C2=A0 =C2=A0 =C2=A0} >> > >> > -- >> >> I am worried about latency of OOM at worst case. >> I mean that task_in_mem_cgroup calls task_lock of child. >> We have used task_lock in many place. >> Some place task_lock hold and then other locks. >> For example, exit_fs held task_lock and try to hold write_lock of fs->lo= ck. >> If child already hold task_lock and wait to write_lock of fs->lock, OOM = latency >> is dependent of fs->lock. >> >> I am not sure how many usecase is also dependent of other locks. >> If it is not as is, we can't make sure in future. >> >> So How about try_task_in_mem_cgroup? >> If we can't hold task_lock, let's continue next child. >> > It's recommended not to use trylock in unclear case. > > Then, I think possible replacement will be not-to-use any lock in > task_in_mem_cgroup. In my short consideration, I don't think task_lock > is necessary if we can add some tricks and memory barrier. > > Please let this patch to go as it is because this is an obvious bug fix > and give me time. I think it's not only a latency problem of OOM but it is also a problem of deadlock. We can't expect child's lock state in oom_kill_process. So if you can remove lock like below your suggestion, I am OKAY. > > Now, I think of following. > This makes use of the fact mm->owner is changed only at _exit() of the ow= ner. > If there is a race with _exit() and mm->owner is racy, the oom selection > itself was racy and bad. It seems to make sense to me. > =3D=3D > int task_in_mem_cgroup_oom(struct task_struct *tsk, struct mem_cgroup *me= m) > { > =C2=A0 =C2=A0 =C2=A0 =C2=A0struct mm_struct *mm; > =C2=A0 =C2=A0 =C2=A0 =C2=A0struct task_struct *tsk; > =C2=A0 =C2=A0 =C2=A0 =C2=A0int ret =3D 0; > > =C2=A0 =C2=A0 =C2=A0 =C2=A0mm =3D tsk->mm; > =C2=A0 =C2=A0 =C2=A0 =C2=A0if (!mm) > =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0return ret; > =C2=A0 =C2=A0 =C2=A0 =C2=A0/* > =C2=A0 =C2=A0 =C2=A0 =C2=A0 * we are not interested in tasks other than o= wner. mm->owner is > =C2=A0 =C2=A0 =C2=A0 =C2=A0 * updated when the owner task exits. If the o= wner is exiting now > =C2=A0 =C2=A0 =C2=A0 =C2=A0 * (and race with us), we may miss. > =C2=A0 =C2=A0 =C2=A0 =C2=A0 */ > =C2=A0 =C2=A0 =C2=A0 =C2=A0if (rcu_dereference(mm->owner) !=3D tsk) > =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0return ret; Yes. In this case, OOM killer can wait a few seconds until this task is exi= ted. If we don't do that, we could kill other innocent task. > =C2=A0 =C2=A0 =C2=A0 =C2=A0rcu_read_lock(); > =C2=A0 =C2=A0 =C2=A0 =C2=A0/* while this task is alive, this task is the = owner */ > =C2=A0 =C2=A0 =C2=A0 =C2=A0if (mem =3D=3D mem_cgroup_from_task(tsk)) > =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0ret =3D 1; > =C2=A0 =C2=A0 =C2=A0 =C2=A0rcu_read_unlock(); > =C2=A0 =C2=A0 =C2=A0 =C2=A0return ret; > } > =3D=3D > Hmm, it seems no memory barrier is necessary. > > Does anyone has another idea ? > > Thanks, > -Kame > > > > > > > > --=20 Kind regards, Minchan Kim -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org