From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx198.postini.com [74.125.245.198]) by kanga.kvack.org (Postfix) with SMTP id 1F9EF6B002B for ; Mon, 8 Oct 2012 11:09:54 -0400 (EDT) Date: Mon, 8 Oct 2012 11:09:49 -0400 From: Dave Jones Subject: mpol_to_str revisited. Message-ID: <20121008150949.GA15130@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Sender: owner-linux-mm@kvack.org List-ID: To: Linux Kernel Cc: bhutchings@solarflare.com, linux-mm@kvack.org, Linus Torvalds , Andrew Morton Last month I sent in 80de7c3138ee9fd86a98696fd2cf7ad89b995d0a to remove a user triggerable BUG in mempolicy. Ben Hutchings pointed out to me that my change introduced a potential leak of stack contents to userspace, because none of the callers check the return value. This patch adds the missing return checking, and also clears the buffer beforehand. Reported-by: Ben Hutchings Cc: stable@kernel.org Signed-off-by: Dave Jones --- unanswered question: why are the buffer sizes here different ? which is correct? diff -durpN '--exclude-from=/home/davej/.exclude' src/git-trees/kernel/linux/fs/proc/task_mmu.c linux-dj/fs/proc/task_mmu.c --- src/git-trees/kernel/linux/fs/proc/task_mmu.c 2012-05-31 22:32:46.778150675 -0400 +++ linux-dj/fs/proc/task_mmu.c 2012-10-04 19:31:41.269988984 -0400 @@ -1162,6 +1162,7 @@ static int show_numa_map(struct seq_file struct mm_walk walk = {}; struct mempolicy *pol; int n; + int ret; char buffer[50]; if (!mm) @@ -1178,7 +1179,11 @@ static int show_numa_map(struct seq_file walk.mm = mm; pol = get_vma_policy(proc_priv->task, vma, vma->vm_start); - mpol_to_str(buffer, sizeof(buffer), pol, 0); + memset(buffer, 0, sizeof(buffer)); + ret = mpol_to_str(buffer, sizeof(buffer), pol, 0); + if (ret < 0) + return 0; + mpol_cond_put(pol); seq_printf(m, "%08lx %s", vma->vm_start, buffer); diff -durpN '--exclude-from=/home/davej/.exclude' src/git-trees/kernel/linux/mm/shmem.c linux-dj/mm/shmem.c --- src/git-trees/kernel/linux/mm/shmem.c 2012-10-02 15:49:51.977277944 -0400 +++ linux-dj/mm/shmem.c 2012-10-04 19:32:28.862949907 -0400 @@ -885,13 +885,15 @@ redirty: static void shmem_show_mpol(struct seq_file *seq, struct mempolicy *mpol) { char buffer[64]; + int ret; if (!mpol || mpol->mode == MPOL_DEFAULT) return; /* show nothing */ - mpol_to_str(buffer, sizeof(buffer), mpol, 1); - - seq_printf(seq, ",mpol=%s", buffer); + memset(buffer, 0, sizeof(buffer)); + ret = mpol_to_str(buffer, sizeof(buffer), mpol, 1); + if (ret > 0) + seq_printf(seq, ",mpol=%s", buffer); } static struct mempolicy *shmem_get_sbmpol(struct shmem_sb_info *sbinfo) -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx121.postini.com [74.125.245.121]) by kanga.kvack.org (Postfix) with SMTP id 29E9C6B0044 for ; Mon, 8 Oct 2012 11:15:56 -0400 (EDT) Date: Mon, 8 Oct 2012 11:15:52 -0400 From: Dave Jones Subject: Re: mpol_to_str revisited. Message-ID: <20121008151552.GA10881@redhat.com> References: <20121008150949.GA15130@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20121008150949.GA15130@redhat.com> Sender: owner-linux-mm@kvack.org List-ID: To: Linux Kernel , bhutchings@solarflare.com, linux-mm@kvack.org, Linus Torvalds , Andrew Morton On Mon, Oct 08, 2012 at 11:09:49AM -0400, Dave Jones wrote: > Last month I sent in 80de7c3138ee9fd86a98696fd2cf7ad89b995d0a to remove > a user triggerable BUG in mempolicy. > > Ben Hutchings pointed out to me that my change introduced a potential leak > of stack contents to userspace, because none of the callers check the return value. > > This patch adds the missing return checking, and also clears the buffer beforehand. > > Reported-by: Ben Hutchings > Cc: stable@kernel.org > Signed-off-by: Dave Jones > > --- > unanswered question: why are the buffer sizes here different ? which is correct? A further unanswered question is how the state got so screwed up that we hit that default case at all. Looking at the original report: https://lkml.org/lkml/2012/9/6/356 What's in RAX looks suspiciously like left-over slab poison. If pol->mode was poisoned, that smells like we have a race where policy is getting freed while another process is reading it. Am I missing something, or is there no locking around that at all ? Dave -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx130.postini.com [74.125.245.130]) by kanga.kvack.org (Postfix) with SMTP id 3C08B6B0044 for ; Mon, 8 Oct 2012 16:35:45 -0400 (EDT) Received: by mail-pa0-f41.google.com with SMTP id fa10so4860033pad.14 for ; Mon, 08 Oct 2012 13:35:44 -0700 (PDT) Date: Mon, 8 Oct 2012 13:35:42 -0700 (PDT) From: David Rientjes Subject: Re: mpol_to_str revisited. In-Reply-To: <20121008150949.GA15130@redhat.com> Message-ID: References: <20121008150949.GA15130@redhat.com> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-linux-mm@kvack.org List-ID: To: Dave Jones , Linux Kernel , bhutchings@solarflare.com, linux-mm@kvack.org, Linus Torvalds , Andrew Morton On Mon, 8 Oct 2012, Dave Jones wrote: > unanswered question: why are the buffer sizes here different ? which is correct? > Given the current set of mempolicy modes and flags, it's 34, but this can change if new modes or flags are added with longer names. I see no reason why shmem shouldn't round up to the nearest power-of-2 of 64 like it already does, but 50 is certainly safe as well in task_mmu.c. > diff -durpN '--exclude-from=/home/davej/.exclude' src/git-trees/kernel/linux/fs/proc/task_mmu.c linux-dj/fs/proc/task_mmu.c > --- src/git-trees/kernel/linux/fs/proc/task_mmu.c 2012-05-31 22:32:46.778150675 -0400 > +++ linux-dj/fs/proc/task_mmu.c 2012-10-04 19:31:41.269988984 -0400 > @@ -1162,6 +1162,7 @@ static int show_numa_map(struct seq_file > struct mm_walk walk = {}; > struct mempolicy *pol; > int n; > + int ret; > char buffer[50]; > > if (!mm) > @@ -1178,7 +1179,11 @@ static int show_numa_map(struct seq_file > walk.mm = mm; > > pol = get_vma_policy(proc_priv->task, vma, vma->vm_start); > - mpol_to_str(buffer, sizeof(buffer), pol, 0); > + memset(buffer, 0, sizeof(buffer)); > + ret = mpol_to_str(buffer, sizeof(buffer), pol, 0); > + if (ret < 0) > + return 0; We should need the mpol_cond_put(pol) here before returning. > + > mpol_cond_put(pol); > > seq_printf(m, "%08lx %s", vma->vm_start, buffer); > diff -durpN '--exclude-from=/home/davej/.exclude' src/git-trees/kernel/linux/mm/shmem.c linux-dj/mm/shmem.c > --- src/git-trees/kernel/linux/mm/shmem.c 2012-10-02 15:49:51.977277944 -0400 > +++ linux-dj/mm/shmem.c 2012-10-04 19:32:28.862949907 -0400 > @@ -885,13 +885,15 @@ redirty: > static void shmem_show_mpol(struct seq_file *seq, struct mempolicy *mpol) > { > char buffer[64]; > + int ret; > > if (!mpol || mpol->mode == MPOL_DEFAULT) > return; /* show nothing */ > > - mpol_to_str(buffer, sizeof(buffer), mpol, 1); > - > - seq_printf(seq, ",mpol=%s", buffer); > + memset(buffer, 0, sizeof(buffer)); > + ret = mpol_to_str(buffer, sizeof(buffer), mpol, 1); > + if (ret > 0) > + seq_printf(seq, ",mpol=%s", buffer); > } > > static struct mempolicy *shmem_get_sbmpol(struct shmem_sb_info *sbinfo) -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx137.postini.com [74.125.245.137]) by kanga.kvack.org (Postfix) with SMTP id 5432A6B002B for ; Mon, 8 Oct 2012 16:46:41 -0400 (EDT) Received: by mail-da0-f41.google.com with SMTP id i14so2093461dad.14 for ; Mon, 08 Oct 2012 13:46:40 -0700 (PDT) Date: Mon, 8 Oct 2012 13:46:38 -0700 (PDT) From: David Rientjes Subject: Re: mpol_to_str revisited. In-Reply-To: <20121008151552.GA10881@redhat.com> Message-ID: References: <20121008150949.GA15130@redhat.com> <20121008151552.GA10881@redhat.com> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-linux-mm@kvack.org List-ID: To: Dave Jones , Linux Kernel , bhutchings@solarflare.com, linux-mm@kvack.org, Linus Torvalds , Andrew Morton On Mon, 8 Oct 2012, Dave Jones wrote: > If pol->mode was poisoned, that smells like we have a race where policy is getting freed > while another process is reading it. > > Am I missing something, or is there no locking around that at all ? > The only thing that is held during the read() is a reference to the task_struct so it doesn't disappear from under us. The protection needed for a task's mempolicy, however, is task_lock() and that is not held. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx202.postini.com [74.125.245.202]) by kanga.kvack.org (Postfix) with SMTP id E90E66B002B for ; Mon, 8 Oct 2012 16:52:17 -0400 (EDT) Date: Mon, 8 Oct 2012 16:52:13 -0400 From: Dave Jones Subject: Re: mpol_to_str revisited. Message-ID: <20121008205213.GA23211@redhat.com> References: <20121008150949.GA15130@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: Sender: owner-linux-mm@kvack.org List-ID: To: David Rientjes Cc: Linux Kernel , bhutchings@solarflare.com, linux-mm@kvack.org, Linus Torvalds , Andrew Morton On Mon, Oct 08, 2012 at 01:35:42PM -0700, David Rientjes wrote: > > unanswered question: why are the buffer sizes here different ? which is correct? > > > Given the current set of mempolicy modes and flags, it's 34, but this can > change if new modes or flags are added with longer names. I see no reason > why shmem shouldn't round up to the nearest power-of-2 of 64 like it > already does, but 50 is certainly safe as well in task_mmu.c. Ok. I'll leave that for now. > > diff -durpN '--exclude-from=/home/davej/.exclude' src/git-trees/kernel/linux/fs/proc/task_mmu.c linux-dj/fs/proc/task_mmu.c > > --- src/git-trees/kernel/linux/fs/proc/task_mmu.c 2012-05-31 22:32:46.778150675 -0400 > > +++ linux-dj/fs/proc/task_mmu.c 2012-10-04 19:31:41.269988984 -0400 > > @@ -1162,6 +1162,7 @@ static int show_numa_map(struct seq_file > > struct mm_walk walk = {}; > > struct mempolicy *pol; > > int n; > > + int ret; > > char buffer[50]; > > > > if (!mm) > > @@ -1178,7 +1179,11 @@ static int show_numa_map(struct seq_file > > walk.mm = mm; > > > > pol = get_vma_policy(proc_priv->task, vma, vma->vm_start); > > - mpol_to_str(buffer, sizeof(buffer), pol, 0); > > + memset(buffer, 0, sizeof(buffer)); > > + ret = mpol_to_str(buffer, sizeof(buffer), pol, 0); > > + if (ret < 0) > > + return 0; > > We should need the mpol_cond_put(pol) here before returning. good catch. I'll respin the patch later with this changed. thanks, Dave -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx161.postini.com [74.125.245.161]) by kanga.kvack.org (Postfix) with SMTP id 96EE26B002B for ; Mon, 8 Oct 2012 20:33:22 -0400 (EDT) Message-ID: <1349742791.6336.11.camel@deadeye.wl.decadent.org.uk> Subject: Re: mpol_to_str revisited. From: Ben Hutchings Date: Tue, 09 Oct 2012 01:33:11 +0100 In-Reply-To: <20121008150949.GA15130@redhat.com> References: <20121008150949.GA15130@redhat.com> Content-Type: multipart/signed; micalg="pgp-sha512"; protocol="application/pgp-signature"; boundary="=-Y9MwhCVWrtXjuOnyH+rX" Mime-Version: 1.0 Sender: owner-linux-mm@kvack.org List-ID: To: Dave Jones Cc: Linux Kernel , linux-mm@kvack.org, Linus Torvalds , Andrew Morton --=-Y9MwhCVWrtXjuOnyH+rX Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable On Mon, 2012-10-08 at 11:09 -0400, Dave Jones wrote: > Last month I sent in 80de7c3138ee9fd86a98696fd2cf7ad89b995d0a to remove > a user triggerable BUG in mempolicy. >=20 > Ben Hutchings pointed out to me that my change introduced a potential lea= k > of stack contents to userspace, because none of the callers check the ret= urn value. >=20 > This patch adds the missing return checking, and also clears the buffer b= eforehand. > > Reported-by: Ben Hutchings I was wearing my other hat at the time (ben@decadent.org.uk). > Cc: stable@kernel.org > Signed-off-by: Dave Jones >=20 > ---=20 > unanswered question: why are the buffer sizes here different ? which is c= orrect? [...] Further question: why even use an intermediate buffer on the stack? Both callers want to write the result to a seq_file. Should mpol_str() then be replaced with a seq_mpol()? Ben. --=20 Ben Hutchings Who are all these weirdos? - David Bowie, about L-Space IRC channel #afp --=-Y9MwhCVWrtXjuOnyH+rX Content-Type: application/pgp-signature; name="signature.asc" Content-Description: This is a digitally signed message part -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.12 (GNU/Linux) iQIVAwUAUHNwx+e/yOyVhhEJAQrlihAAjp47UOZEb3F0sVUQOByVmd1WUR3+fTLB BOMP/WZnKdRACzz3GCNisWsCUyUnBXGTdxsjNjkcFOYxBMe/d2CyZIlUMxJiLVv0 E0OtpNqvXcOiPNiF60rN7+9/eQc21ME2G6EBVjgQi0/9tcLsVFhucY8Iar6Go/9x GY8I+6yeBQ32/afOTrnEma5BQbf5M5kPCQHGKlo0PX237Eu1WpYafDahQl4RYVSd utaZC7xLiBtqQFLV56QKWQlU4T5CSoGVVcX6F/ZMpSrwJ4d6SZhB0d+5vO6APnfh 6rPWuQNxICjdiXjCTew2i6nNRYKf5l8t+aYw0+c62Wf2GUKRLd2ZaGDIsAOeHpT9 s+W4BGa2CGJx5VcCT81zFMi2dWyaRPQ7zg2DMTC+J+CS+Vk+dSnS/nWm2iU7XtTm hoPuVx1W4weVN8txhtVqeh2QR0eglwTrWWQSKr4CMl1u3h+2uXFewZA2Ke8JQk6T 9rinjAgsLxj5nSeUSYLJLy4drMdR2C+Q0q+pgPdHCraF+uxd9OnFF9DuSs+X7i8s norB41sqHbQfD35nK+J5F68nqYSKi7I4E5ORHWnfvKpDSmT9g4zASBpXJ6iLvzcA NwhidylqfVjUDcqho8WpMhn3KxQok+k9JPQQoDFX8CgLUtQD8J5wE/TLzZwsxXt2 xc5CfDLmjSs= =+VgZ -----END PGP SIGNATURE----- --=-Y9MwhCVWrtXjuOnyH+rX-- -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx166.postini.com [74.125.245.166]) by kanga.kvack.org (Postfix) with SMTP id C9ADB6B005A for ; Mon, 15 Oct 2012 20:48:35 -0400 (EDT) Received: by mail-pa0-f41.google.com with SMTP id fa10so6078821pad.14 for ; Mon, 15 Oct 2012 17:48:35 -0700 (PDT) Date: Mon, 15 Oct 2012 17:48:33 -0700 (PDT) From: David Rientjes Subject: Re: mpol_to_str revisited. In-Reply-To: <20121008205213.GA23211@redhat.com> Message-ID: References: <20121008150949.GA15130@redhat.com> <20121008205213.GA23211@redhat.com> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-linux-mm@kvack.org List-ID: To: Dave Jones , Linux Kernel , bhutchings@solarflare.com, linux-mm@kvack.org, Linus Torvalds , Andrew Morton On Mon, 8 Oct 2012, Dave Jones wrote: > > > diff -durpN '--exclude-from=/home/davej/.exclude' src/git-trees/kernel/linux/fs/proc/task_mmu.c linux-dj/fs/proc/task_mmu.c > > > --- src/git-trees/kernel/linux/fs/proc/task_mmu.c 2012-05-31 22:32:46.778150675 -0400 > > > +++ linux-dj/fs/proc/task_mmu.c 2012-10-04 19:31:41.269988984 -0400 > > > @@ -1162,6 +1162,7 @@ static int show_numa_map(struct seq_file > > > struct mm_walk walk = {}; > > > struct mempolicy *pol; > > > int n; > > > + int ret; > > > char buffer[50]; > > > > > > if (!mm) > > > @@ -1178,7 +1179,11 @@ static int show_numa_map(struct seq_file > > > walk.mm = mm; > > > > > > pol = get_vma_policy(proc_priv->task, vma, vma->vm_start); > > > - mpol_to_str(buffer, sizeof(buffer), pol, 0); > > > + memset(buffer, 0, sizeof(buffer)); > > > + ret = mpol_to_str(buffer, sizeof(buffer), pol, 0); > > > + if (ret < 0) > > > + return 0; > > > > We should need the mpol_cond_put(pol) here before returning. > > good catch. I'll respin the patch later with this changed. > Did you get a chance to fix this issue? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx126.postini.com [74.125.245.126]) by kanga.kvack.org (Postfix) with SMTP id 194CB6B002B for ; Mon, 15 Oct 2012 22:35:15 -0400 (EDT) Received: by mail-oa0-f41.google.com with SMTP id k14so7018225oag.14 for ; Mon, 15 Oct 2012 19:35:14 -0700 (PDT) MIME-Version: 1.0 In-Reply-To: <20121008150949.GA15130@redhat.com> References: <20121008150949.GA15130@redhat.com> From: KOSAKI Motohiro Date: Mon, 15 Oct 2012 22:34:53 -0400 Message-ID: Subject: Re: mpol_to_str revisited. Content-Type: text/plain; charset=ISO-8859-1 Sender: owner-linux-mm@kvack.org List-ID: To: Dave Jones , Linux Kernel , bhutchings@solarflare.com, linux-mm@kvack.org, Linus Torvalds , Andrew Morton On Mon, Oct 8, 2012 at 11:09 AM, Dave Jones wrote: > Last month I sent in 80de7c3138ee9fd86a98696fd2cf7ad89b995d0a to remove > a user triggerable BUG in mempolicy. > > Ben Hutchings pointed out to me that my change introduced a potential leak > of stack contents to userspace, because none of the callers check the return value. > > This patch adds the missing return checking, and also clears the buffer beforehand. I don't think 80de7c3138ee9fd86a98696fd2cf7ad89b995d0a is right fix. we should close a race (or kill remain ref count leak) if we still have. Because of, this patch makes unstable /proc output and might lead to userland confusing. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx203.postini.com [74.125.245.203]) by kanga.kvack.org (Postfix) with SMTP id 800536B002B for ; Mon, 15 Oct 2012 23:58:36 -0400 (EDT) Received: by mail-pa0-f41.google.com with SMTP id fa10so6215958pad.14 for ; Mon, 15 Oct 2012 20:58:35 -0700 (PDT) Date: Mon, 15 Oct 2012 20:58:33 -0700 (PDT) From: David Rientjes Subject: Re: mpol_to_str revisited. In-Reply-To: Message-ID: References: <20121008150949.GA15130@redhat.com> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-linux-mm@kvack.org List-ID: To: KOSAKI Motohiro Cc: Dave Jones , Linux Kernel , bhutchings@solarflare.com, linux-mm@kvack.org, Linus Torvalds , Andrew Morton On Mon, 15 Oct 2012, KOSAKI Motohiro wrote: > I don't think 80de7c3138ee9fd86a98696fd2cf7ad89b995d0a is right fix. It's certainly not a complete fix, but I think it's a much better result of the race, i.e. we don't panic anymore, we simply fail the read() instead. > we should > close a race (or kill remain ref count leak) if we still have. As I mentioned earlier in the thread, the read() is done here on a task while only a reference to the task_struct is taken and we do not hold task_lock() which is required for task->mempolicy. Once that is fixed, mpol_to_str() should never be called for !task->mempolicy so it will never need to return -EINVAL in such a condition. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx171.postini.com [74.125.245.171]) by kanga.kvack.org (Postfix) with SMTP id 6624C6B0062 for ; Tue, 16 Oct 2012 01:10:55 -0400 (EDT) Received: by mail-oa0-f41.google.com with SMTP id k14so7126402oag.14 for ; Mon, 15 Oct 2012 22:10:54 -0700 (PDT) MIME-Version: 1.0 In-Reply-To: References: <20121008150949.GA15130@redhat.com> From: KOSAKI Motohiro Date: Tue, 16 Oct 2012 01:10:34 -0400 Message-ID: Subject: Re: mpol_to_str revisited. Content-Type: text/plain; charset=ISO-8859-1 Sender: owner-linux-mm@kvack.org List-ID: To: David Rientjes Cc: Dave Jones , Linux Kernel , bhutchings@solarflare.com, linux-mm@kvack.org, Linus Torvalds , Andrew Morton On Mon, Oct 15, 2012 at 11:58 PM, David Rientjes wrote: > On Mon, 15 Oct 2012, KOSAKI Motohiro wrote: > >> I don't think 80de7c3138ee9fd86a98696fd2cf7ad89b995d0a is right fix. > > It's certainly not a complete fix, but I think it's a much better result > of the race, i.e. we don't panic anymore, we simply fail the read() > instead. Even though 80de7c3138ee9fd86a98696fd2cf7ad89b995d0a itself is simple. It bring to caller complex. That's not good and have no worth. >> we should >> close a race (or kill remain ref count leak) if we still have. > > As I mentioned earlier in the thread, the read() is done here on a task > while only a reference to the task_struct is taken and we do not hold > task_lock() which is required for task->mempolicy. Once that is fixed, > mpol_to_str() should never be called for !task->mempolicy so it will never > need to return -EINVAL in such a condition. I agree that's obviously a bug and we should fix it. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx190.postini.com [74.125.245.190]) by kanga.kvack.org (Postfix) with SMTP id CE4846B002B for ; Tue, 16 Oct 2012 02:10:12 -0400 (EDT) Received: by mail-pb0-f41.google.com with SMTP id rq2so6371790pbb.14 for ; Mon, 15 Oct 2012 23:10:12 -0700 (PDT) Date: Mon, 15 Oct 2012 23:10:09 -0700 (PDT) From: David Rientjes Subject: Re: mpol_to_str revisited. In-Reply-To: Message-ID: References: <20121008150949.GA15130@redhat.com> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-linux-mm@kvack.org List-ID: To: KOSAKI Motohiro Cc: Dave Jones , Linux Kernel , bhutchings@solarflare.com, linux-mm@kvack.org, Linus Torvalds , Andrew Morton On Tue, 16 Oct 2012, KOSAKI Motohiro wrote: > >> I don't think 80de7c3138ee9fd86a98696fd2cf7ad89b995d0a is right fix. > > > > It's certainly not a complete fix, but I think it's a much better result > > of the race, i.e. we don't panic anymore, we simply fail the read() > > instead. > > Even though 80de7c3138ee9fd86a98696fd2cf7ad89b995d0a itself is simple. It bring > to caller complex. That's not good and have no worth. > Before: the kernel panics, all workloads cease. After: the file shows garbage, all workloads continue. This is better, in my opinion, but at best it's only a judgment call and has no effect on anything. I agree it would be better to respect the return value of mpol_to_str() since there are other possible error conditions other than a freed mempolicy, but let's not consider reverting 80de7c3138. It is obviously not a full solution to the problem, though, and we need to serialize with task_lock(). Dave, are you interested in coming up with a patch? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx128.postini.com [74.125.245.128]) by kanga.kvack.org (Postfix) with SMTP id 61A956B002B for ; Tue, 16 Oct 2012 19:39:50 -0400 (EDT) Received: by mail-oa0-f41.google.com with SMTP id k14so8418997oag.14 for ; Tue, 16 Oct 2012 16:39:49 -0700 (PDT) MIME-Version: 1.0 In-Reply-To: References: <20121008150949.GA15130@redhat.com> From: KOSAKI Motohiro Date: Tue, 16 Oct 2012 19:39:29 -0400 Message-ID: Subject: Re: mpol_to_str revisited. Content-Type: text/plain; charset=ISO-8859-1 Sender: owner-linux-mm@kvack.org List-ID: To: David Rientjes Cc: Dave Jones , Linux Kernel , bhutchings@solarflare.com, linux-mm@kvack.org, Linus Torvalds , Andrew Morton On Tue, Oct 16, 2012 at 2:10 AM, David Rientjes wrote: > On Tue, 16 Oct 2012, KOSAKI Motohiro wrote: > >> >> I don't think 80de7c3138ee9fd86a98696fd2cf7ad89b995d0a is right fix. >> > >> > It's certainly not a complete fix, but I think it's a much better result >> > of the race, i.e. we don't panic anymore, we simply fail the read() >> > instead. >> >> Even though 80de7c3138ee9fd86a98696fd2cf7ad89b995d0a itself is simple. It bring >> to caller complex. That's not good and have no worth. >> > > Before: the kernel panics, all workloads cease. > After: the file shows garbage, all workloads continue. > > This is better, in my opinion, but at best it's only a judgment call and > has no effect on anything. Kernel panics help to find our serious mistake. > I agree it would be better to respect the return value of mpol_to_str() > since there are other possible error conditions other than a freed > mempolicy, but let's not consider reverting 80de7c3138. It is obviously > not a full solution to the problem, though, and we need to serialize with > task_lock(). Sorry no. I will have to revert it. mempolicy have already a lot of meaningless complex and bring us a lot of problems. I haven't seen any reason adding more. > Dave, are you interested in coming up with a patch? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx188.postini.com [74.125.245.188]) by kanga.kvack.org (Postfix) with SMTP id 3F77F6B002B for ; Tue, 16 Oct 2012 20:12:53 -0400 (EDT) Received: by mail-pb0-f41.google.com with SMTP id rq2so7346699pbb.14 for ; Tue, 16 Oct 2012 17:12:52 -0700 (PDT) Date: Tue, 16 Oct 2012 17:12:50 -0700 (PDT) From: David Rientjes Subject: Re: mpol_to_str revisited. In-Reply-To: Message-ID: References: <20121008150949.GA15130@redhat.com> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-linux-mm@kvack.org List-ID: To: KOSAKI Motohiro Cc: Dave Jones , Linux Kernel , bhutchings@solarflare.com, linux-mm@kvack.org, Linus Torvalds , Andrew Morton On Tue, 16 Oct 2012, KOSAKI Motohiro wrote: > >> Even though 80de7c3138ee9fd86a98696fd2cf7ad89b995d0a itself is simple. It bring > >> to caller complex. That's not good and have no worth. > >> > > > > Before: the kernel panics, all workloads cease. > > After: the file shows garbage, all workloads continue. > > > > This is better, in my opinion, but at best it's only a judgment call and > > has no effect on anything. > > Kernel panics help to find our serious mistake. > Kernel panics are not your little debugging tool to let users suffer through for non-fatal issues. > > I agree it would be better to respect the return value of mpol_to_str() > > since there are other possible error conditions other than a freed > > mempolicy, but let's not consider reverting 80de7c3138. It is obviously > > not a full solution to the problem, though, and we need to serialize with > > task_lock(). > > Sorry no. I will have to revert it. Feel free to revert anything you wish in your own tree, I couldn't care less. If you try to propose it upstream, Andrew will surely ask you to justify the BUG(), good luck on that. I'll reply to this message with the fix that I think is best. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx164.postini.com [74.125.245.164]) by kanga.kvack.org (Postfix) with SMTP id 0DA2C6B002B for ; Tue, 16 Oct 2012 20:31:26 -0400 (EDT) Received: by mail-pa0-f41.google.com with SMTP id fa10so7319653pad.14 for ; Tue, 16 Oct 2012 17:31:26 -0700 (PDT) Date: Tue, 16 Oct 2012 17:31:23 -0700 (PDT) From: David Rientjes Subject: [patch for-3.7] mm, mempolicy: fix printing stack contents in numa_maps In-Reply-To: Message-ID: References: <20121008150949.GA15130@redhat.com> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-linux-mm@kvack.org List-ID: To: Andrew Morton , Linus Torvalds Cc: Dave Jones , KOSAKI Motohiro , bhutchings@solarflare.com, Konstantin Khlebnikov , Naoya Horiguchi , Hugh Dickins , KAMEZAWA Hiroyuki , linux-kernel@vger.kernel.org, linux-mm@kvack.org When reading /proc/pid/numa_maps, it's possible to return the contents of the stack where the mempolicy string should be printed if the policy gets freed from beneath us. This happens because mpol_to_str() may return an error the stack-allocated buffer is then printed without ever being stored. There are two possible error conditions in mpol_to_str(): - if the buffer allocated is insufficient for the string to be stored, and - if the mempolicy has an invalid mode. The first error condition is not triggered in any of the callers to mpol_to_str(): at least 50 bytes is always allocated on the stack and this is sufficient for the string to be written. A future patch should convert this into BUILD_BUG_ON() since we know the maximum strlen possible, but that's not -rc material. The second error condition is possible if a race occurs in dropping a reference to a task's mempolicy causing it to be freed during the read(). The slab poison value is then used for the mode and mpol_to_str() returns -EINVAL. This race is only possible because get_vma_policy() believes that mm->mmap_sem protects task->mempolicy, which isn't true. The exit path does not hold mm->mmap_sem when dropping the reference or setting task->mempolicy to NULL: it uses task_lock(task) instead. Thus, it's required for the caller of a task mempolicy to hold task_lock(task) while grabbing the mempolicy and reading it. Callers with a vma policy store their mempolicy earlier and can simply increment the reference count so it's guaranteed not to be freed. Reported-by: Dave Jones Signed-off-by: David Rientjes --- fs/proc/task_mmu.c | 7 +++++-- mm/mempolicy.c | 5 ++--- 2 files changed, 7 insertions(+), 5 deletions(-) diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c --- a/fs/proc/task_mmu.c +++ b/fs/proc/task_mmu.c @@ -1158,6 +1158,7 @@ static int show_numa_map(struct seq_file *m, void *v, int is_pid) struct vm_area_struct *vma = v; struct numa_maps *md = &numa_priv->md; struct file *file = vma->vm_file; + struct task_struct *task = proc_priv->task; struct mm_struct *mm = vma->vm_mm; struct mm_walk walk = {}; struct mempolicy *pol; @@ -1177,9 +1178,11 @@ static int show_numa_map(struct seq_file *m, void *v, int is_pid) walk.private = md; walk.mm = mm; - pol = get_vma_policy(proc_priv->task, vma, vma->vm_start); + task_lock(task); + pol = get_vma_policy(task, vma, vma->vm_start); mpol_to_str(buffer, sizeof(buffer), pol, 0); mpol_cond_put(pol); + task_unlock(task); seq_printf(m, "%08lx %s", vma->vm_start, buffer); @@ -1189,7 +1192,7 @@ static int show_numa_map(struct seq_file *m, void *v, int is_pid) } else if (vma->vm_start <= mm->brk && vma->vm_end >= mm->start_brk) { seq_printf(m, " heap"); } else { - pid_t tid = vm_is_stack(proc_priv->task, vma, is_pid); + pid_t tid = vm_is_stack(task, vma, is_pid); if (tid != 0) { /* * Thread stack in /proc/PID/task/TID/maps or diff --git a/mm/mempolicy.c b/mm/mempolicy.c index 0b78fb9..d04a8a5 100644 --- a/mm/mempolicy.c +++ b/mm/mempolicy.c @@ -1536,9 +1536,8 @@ asmlinkage long compat_sys_mbind(compat_ulong_t start, compat_ulong_t len, * * Returns effective policy for a VMA at specified address. * Falls back to @task or system default policy, as necessary. - * Current or other task's task mempolicy and non-shared vma policies - * are protected by the task's mmap_sem, which must be held for read by - * the caller. + * Current or other task's task mempolicy and non-shared vma policies must be + * protected by task_lock(task) by the caller. * Shared policies [those marked as MPOL_F_SHARED] require an extra reference * count--added by the get_policy() vm_op, as appropriate--to protect against * freeing by another task. It is the caller's responsibility to free the -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx186.postini.com [74.125.245.186]) by kanga.kvack.org (Postfix) with SMTP id 1F9F26B002B for ; Tue, 16 Oct 2012 21:34:16 -0400 (EDT) Received: by mail-ob0-f169.google.com with SMTP id va7so8350620obc.14 for ; Tue, 16 Oct 2012 18:34:15 -0700 (PDT) MIME-Version: 1.0 In-Reply-To: References: <20121008150949.GA15130@redhat.com> From: KOSAKI Motohiro Date: Tue, 16 Oct 2012 21:33:55 -0400 Message-ID: Subject: Re: mpol_to_str revisited. Content-Type: text/plain; charset=ISO-8859-1 Sender: owner-linux-mm@kvack.org List-ID: To: David Rientjes Cc: Dave Jones , Linux Kernel , bhutchings@solarflare.com, linux-mm@kvack.org, Linus Torvalds , Andrew Morton On Tue, Oct 16, 2012 at 8:12 PM, David Rientjes wrote: > On Tue, 16 Oct 2012, KOSAKI Motohiro wrote: > >> >> Even though 80de7c3138ee9fd86a98696fd2cf7ad89b995d0a itself is simple. It bring >> >> to caller complex. That's not good and have no worth. >> >> >> > >> > Before: the kernel panics, all workloads cease. >> > After: the file shows garbage, all workloads continue. >> > >> > This is better, in my opinion, but at best it's only a judgment call and >> > has no effect on anything. >> >> Kernel panics help to find our serious mistake. > > Kernel panics are not your little debugging tool to let users suffer > through for non-fatal issues. use after free is fatal, no doubt. > >> > I agree it would be better to respect the return value of mpol_to_str() >> > since there are other possible error conditions other than a freed >> > mempolicy, but let's not consider reverting 80de7c3138. It is obviously >> > not a full solution to the problem, though, and we need to serialize with >> > task_lock(). >> >> Sorry no. I will have to revert it. > > Feel free to revert anything you wish in your own tree, I couldn't care > less. If you try to propose it upstream, Andrew will surely ask you to > justify the BUG(), good luck on that. Yeah. I'm ok just remove both BUG() and EINVAL, but current situation (i.e. ignoring EINVAL by caller) is surely bad. So, just revert is best IMHO. > > I'll reply to this message with the fix that I think is best. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx197.postini.com [74.125.245.197]) by kanga.kvack.org (Postfix) with SMTP id 5F3EE6B005D for ; Tue, 16 Oct 2012 21:38:48 -0400 (EDT) Received: by mail-oa0-f41.google.com with SMTP id k14so8510708oag.14 for ; Tue, 16 Oct 2012 18:38:47 -0700 (PDT) MIME-Version: 1.0 In-Reply-To: References: <20121008150949.GA15130@redhat.com> From: KOSAKI Motohiro Date: Tue, 16 Oct 2012 21:38:26 -0400 Message-ID: Subject: Re: [patch for-3.7] mm, mempolicy: fix printing stack contents in numa_maps Content-Type: text/plain; charset=ISO-8859-1 Sender: owner-linux-mm@kvack.org List-ID: To: David Rientjes Cc: Andrew Morton , Linus Torvalds , Dave Jones , bhutchings@solarflare.com, Konstantin Khlebnikov , Naoya Horiguchi , Hugh Dickins , KAMEZAWA Hiroyuki , linux-kernel@vger.kernel.org, linux-mm@kvack.org On Tue, Oct 16, 2012 at 8:31 PM, David Rientjes wrote: > When reading /proc/pid/numa_maps, it's possible to return the contents of > the stack where the mempolicy string should be printed if the policy gets > freed from beneath us. > > This happens because mpol_to_str() may return an error the > stack-allocated buffer is then printed without ever being stored. > > There are two possible error conditions in mpol_to_str(): > > - if the buffer allocated is insufficient for the string to be stored, > and > > - if the mempolicy has an invalid mode. > > The first error condition is not triggered in any of the callers to > mpol_to_str(): at least 50 bytes is always allocated on the stack and this > is sufficient for the string to be written. A future patch should convert > this into BUILD_BUG_ON() since we know the maximum strlen possible, but > that's not -rc material. > > The second error condition is possible if a race occurs in dropping a > reference to a task's mempolicy causing it to be freed during the read(). > The slab poison value is then used for the mode and mpol_to_str() returns > -EINVAL. > > This race is only possible because get_vma_policy() believes that > mm->mmap_sem protects task->mempolicy, which isn't true. The exit path > does not hold mm->mmap_sem when dropping the reference or setting > task->mempolicy to NULL: it uses task_lock(task) instead. > > Thus, it's required for the caller of a task mempolicy to hold > task_lock(task) while grabbing the mempolicy and reading it. Callers with > a vma policy store their mempolicy earlier and can simply increment the > reference count so it's guaranteed not to be freed. > > Reported-by: Dave Jones > Signed-off-by: David Rientjes > --- > fs/proc/task_mmu.c | 7 +++++-- > mm/mempolicy.c | 5 ++--- > 2 files changed, 7 insertions(+), 5 deletions(-) > > diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c > --- a/fs/proc/task_mmu.c > +++ b/fs/proc/task_mmu.c > @@ -1158,6 +1158,7 @@ static int show_numa_map(struct seq_file *m, void *v, int is_pid) > struct vm_area_struct *vma = v; > struct numa_maps *md = &numa_priv->md; > struct file *file = vma->vm_file; > + struct task_struct *task = proc_priv->task; > struct mm_struct *mm = vma->vm_mm; > struct mm_walk walk = {}; > struct mempolicy *pol; > @@ -1177,9 +1178,11 @@ static int show_numa_map(struct seq_file *m, void *v, int is_pid) > walk.private = md; > walk.mm = mm; > > - pol = get_vma_policy(proc_priv->task, vma, vma->vm_start); > + task_lock(task); > + pol = get_vma_policy(task, vma, vma->vm_start); > mpol_to_str(buffer, sizeof(buffer), pol, 0); > mpol_cond_put(pol); > + task_unlock(task); > > seq_printf(m, "%08lx %s", vma->vm_start, buffer); > > @@ -1189,7 +1192,7 @@ static int show_numa_map(struct seq_file *m, void *v, int is_pid) > } else if (vma->vm_start <= mm->brk && vma->vm_end >= mm->start_brk) { > seq_printf(m, " heap"); > } else { > - pid_t tid = vm_is_stack(proc_priv->task, vma, is_pid); > + pid_t tid = vm_is_stack(task, vma, is_pid); > if (tid != 0) { > /* > * Thread stack in /proc/PID/task/TID/maps or > diff --git a/mm/mempolicy.c b/mm/mempolicy.c > index 0b78fb9..d04a8a5 100644 > --- a/mm/mempolicy.c > +++ b/mm/mempolicy.c > @@ -1536,9 +1536,8 @@ asmlinkage long compat_sys_mbind(compat_ulong_t start, compat_ulong_t len, > * > * Returns effective policy for a VMA at specified address. > * Falls back to @task or system default policy, as necessary. > - * Current or other task's task mempolicy and non-shared vma policies > - * are protected by the task's mmap_sem, which must be held for read by > - * the caller. > + * Current or other task's task mempolicy and non-shared vma policies must be > + * protected by task_lock(task) by the caller. This is not correct. mmap_sem is needed for protecting vma. task_lock() is needed to close vs exit race only when task != current. In other word, caller must held both mmap_sem and task_lock if task != current. > * Shared policies [those marked as MPOL_F_SHARED] require an extra reference > * count--added by the get_policy() vm_op, as appropriate--to protect against > * freeing by another task. It is the caller's responsibility to free the -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx194.postini.com [74.125.245.194]) by kanga.kvack.org (Postfix) with SMTP id 621126B002B for ; Tue, 16 Oct 2012 21:49:04 -0400 (EDT) Received: by mail-pb0-f41.google.com with SMTP id rq2so7430802pbb.14 for ; Tue, 16 Oct 2012 18:49:03 -0700 (PDT) Date: Tue, 16 Oct 2012 18:49:00 -0700 (PDT) From: David Rientjes Subject: Re: [patch for-3.7] mm, mempolicy: fix printing stack contents in numa_maps In-Reply-To: Message-ID: References: <20121008150949.GA15130@redhat.com> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-linux-mm@kvack.org List-ID: To: KOSAKI Motohiro Cc: Andrew Morton , Linus Torvalds , Dave Jones , bhutchings@solarflare.com, Konstantin Khlebnikov , Naoya Horiguchi , Hugh Dickins , KAMEZAWA Hiroyuki , linux-kernel@vger.kernel.org, linux-mm@kvack.org On Tue, 16 Oct 2012, KOSAKI Motohiro wrote: > > diff --git a/mm/mempolicy.c b/mm/mempolicy.c > > index 0b78fb9..d04a8a5 100644 > > --- a/mm/mempolicy.c > > +++ b/mm/mempolicy.c > > @@ -1536,9 +1536,8 @@ asmlinkage long compat_sys_mbind(compat_ulong_t start, compat_ulong_t len, > > * > > * Returns effective policy for a VMA at specified address. > > * Falls back to @task or system default policy, as necessary. > > - * Current or other task's task mempolicy and non-shared vma policies > > - * are protected by the task's mmap_sem, which must be held for read by > > - * the caller. > > + * Current or other task's task mempolicy and non-shared vma policies must be > > + * protected by task_lock(task) by the caller. > > This is not correct. mmap_sem is needed for protecting vma. task_lock() > is needed to close vs exit race only when task != current. In other word, > caller must held both mmap_sem and task_lock if task != current. > The comment is specifically addressing non-shared vma policies, you do not need to hold mmap_sem to access another thread's mempolicy. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx179.postini.com [74.125.245.179]) by kanga.kvack.org (Postfix) with SMTP id EAAEA6B002B for ; Tue, 16 Oct 2012 21:53:22 -0400 (EDT) Received: by mail-ob0-f169.google.com with SMTP id va7so8363891obc.14 for ; Tue, 16 Oct 2012 18:53:22 -0700 (PDT) MIME-Version: 1.0 In-Reply-To: References: <20121008150949.GA15130@redhat.com> From: KOSAKI Motohiro Date: Tue, 16 Oct 2012 21:53:02 -0400 Message-ID: Subject: Re: [patch for-3.7] mm, mempolicy: fix printing stack contents in numa_maps Content-Type: text/plain; charset=ISO-8859-1 Sender: owner-linux-mm@kvack.org List-ID: To: David Rientjes Cc: Andrew Morton , Linus Torvalds , Dave Jones , bhutchings@solarflare.com, Konstantin Khlebnikov , Naoya Horiguchi , Hugh Dickins , KAMEZAWA Hiroyuki , linux-kernel@vger.kernel.org, linux-mm@kvack.org On Tue, Oct 16, 2012 at 9:49 PM, David Rientjes wrote: > On Tue, 16 Oct 2012, KOSAKI Motohiro wrote: > >> > diff --git a/mm/mempolicy.c b/mm/mempolicy.c >> > index 0b78fb9..d04a8a5 100644 >> > --- a/mm/mempolicy.c >> > +++ b/mm/mempolicy.c >> > @@ -1536,9 +1536,8 @@ asmlinkage long compat_sys_mbind(compat_ulong_t start, compat_ulong_t len, >> > * >> > * Returns effective policy for a VMA at specified address. >> > * Falls back to @task or system default policy, as necessary. >> > - * Current or other task's task mempolicy and non-shared vma policies >> > - * are protected by the task's mmap_sem, which must be held for read by >> > - * the caller. >> > + * Current or other task's task mempolicy and non-shared vma policies must be >> > + * protected by task_lock(task) by the caller. >> >> This is not correct. mmap_sem is needed for protecting vma. task_lock() >> is needed to close vs exit race only when task != current. In other word, >> caller must held both mmap_sem and task_lock if task != current. > > The comment is specifically addressing non-shared vma policies, you do not > need to hold mmap_sem to access another thread's mempolicy. I didn't say old comment is true. I just only your new comment also false. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx166.postini.com [74.125.245.166]) by kanga.kvack.org (Postfix) with SMTP id 33B7E6B002B for ; Wed, 17 Oct 2012 00:05:27 -0400 (EDT) Date: Wed, 17 Oct 2012 00:05:15 -0400 From: Dave Jones Subject: Re: [patch for-3.7] mm, mempolicy: fix printing stack contents in numa_maps Message-ID: <20121017040515.GA13505@redhat.com> References: <20121008150949.GA15130@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: Sender: owner-linux-mm@kvack.org List-ID: To: David Rientjes Cc: Andrew Morton , Linus Torvalds , KOSAKI Motohiro , bhutchings@solarflare.com, Konstantin Khlebnikov , Naoya Horiguchi , Hugh Dickins , KAMEZAWA Hiroyuki , linux-kernel@vger.kernel.org, linux-mm@kvack.org On Tue, Oct 16, 2012 at 05:31:23PM -0700, David Rientjes wrote: > - pol = get_vma_policy(proc_priv->task, vma, vma->vm_start); > + task_lock(task); > + pol = get_vma_policy(task, vma, vma->vm_start); > mpol_to_str(buffer, sizeof(buffer), pol, 0); > mpol_cond_put(pol); > + task_unlock(task); This seems to cause some fallout for me.. BUG: sleeping function called from invalid context at kernel/mutex.c:269 in_atomic(): 1, irqs_disabled(): 0, pid: 8558, name: trinity-child2 3 locks on stack by trinity-child2/8558: #0: held: (&p->lock){+.+.+.}, instance: ffff88010c9a00b0, at: [] seq_lseek+0x3f/0x120 #1: held: (&mm->mmap_sem){++++++}, instance: ffff88013956f7c8, at: [] m_start+0xa7/0x190 #2: held: (&(&p->alloc_lock)->rlock){+.+...}, instance: ffff88011fc64f30, at: [] show_numa_map+0x14f/0x610 Pid: 8558, comm: trinity-child2 Not tainted 3.7.0-rc1+ #32 Call Trace: [] __might_sleep+0x14c/0x200 [] mutex_lock_nested+0x2e/0x50 [] mpol_shared_policy_lookup+0x33/0x90 [] shmem_get_policy+0x33/0x40 [] get_vma_policy+0x3a/0x90 [] show_numa_map+0x163/0x610 [] ? pid_maps_open+0x20/0x20 [] ? pagemap_hugetlb_range+0xf0/0xf0 [] show_pid_numa_map+0x13/0x20 [] traverse+0xf2/0x230 [] seq_lseek+0xab/0x120 [] sys_lseek+0x7b/0xb0 [] tracesys+0xe1/0xe6 same problem, different syscall.. BUG: sleeping function called from invalid context at kernel/mutex.c:269 in_atomic(): 1, irqs_disabled(): 0, pid: 21996, name: trinity-child3 3 locks on stack by trinity-child3/21996: #0: held: (&p->lock){+.+.+.}, instance: ffff88008d712c08, at: [] seq_read+0x3d/0x3e0 #1: held: (&mm->mmap_sem){++++++}, instance: ffff88013956f7c8, at: [] m_start+0xa7/0x190 #2: held: (&(&p->alloc_lock)->rlock){+.+...}, instance: ffff88011fc64f30, at: [] show_numa_map+0x14f/0x610 Pid: 21996, comm: trinity-child3 Not tainted 3.7.0-rc1+ #32 Call Trace: [] __might_sleep+0x14c/0x200 [] mutex_lock_nested+0x2e/0x50 [] mpol_shared_policy_lookup+0x33/0x90 [] shmem_get_policy+0x33/0x40 [] get_vma_policy+0x3a/0x90 [] show_numa_map+0x163/0x610 [] ? pid_maps_open+0x20/0x20 [] ? pagemap_hugetlb_range+0xf0/0xf0 [] show_pid_numa_map+0x13/0x20 [] traverse+0xf2/0x230 [] seq_read+0x34b/0x3e0 [] ? seq_lseek+0x120/0x120 [] do_loop_readv_writev+0x5a/0x90 [] do_readv_writev+0x1c1/0x1e0 [] ? get_parent_ip+0x11/0x50 [] vfs_readv+0x35/0x60 [] sys_preadv+0xc2/0xe0 [] tracesys+0xe1/0xe6 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx148.postini.com [74.125.245.148]) by kanga.kvack.org (Postfix) with SMTP id 65A316B005D for ; Wed, 17 Oct 2012 01:24:36 -0400 (EDT) Received: by mail-pb0-f41.google.com with SMTP id rq2so7626235pbb.14 for ; Tue, 16 Oct 2012 22:24:35 -0700 (PDT) Date: Tue, 16 Oct 2012 22:24:32 -0700 (PDT) From: David Rientjes Subject: Re: [patch for-3.7] mm, mempolicy: fix printing stack contents in numa_maps In-Reply-To: <20121017040515.GA13505@redhat.com> Message-ID: References: <20121008150949.GA15130@redhat.com> <20121017040515.GA13505@redhat.com> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-linux-mm@kvack.org List-ID: To: Dave Jones , Andrew Morton , Linus Torvalds , KOSAKI Motohiro , bhutchings@solarflare.com, Konstantin Khlebnikov , Naoya Horiguchi , Hugh Dickins , KAMEZAWA Hiroyuki , linux-kernel@vger.kernel.org, linux-mm@kvack.org On Wed, 17 Oct 2012, Dave Jones wrote: > BUG: sleeping function called from invalid context at kernel/mutex.c:269 > in_atomic(): 1, irqs_disabled(): 0, pid: 8558, name: trinity-child2 > 3 locks on stack by trinity-child2/8558: > #0: held: (&p->lock){+.+.+.}, instance: ffff88010c9a00b0, at: [] seq_lseek+0x3f/0x120 > #1: held: (&mm->mmap_sem){++++++}, instance: ffff88013956f7c8, at: [] m_start+0xa7/0x190 > #2: held: (&(&p->alloc_lock)->rlock){+.+...}, instance: ffff88011fc64f30, at: [] show_numa_map+0x14f/0x610 > Pid: 8558, comm: trinity-child2 Not tainted 3.7.0-rc1+ #32 > Call Trace: > [] __might_sleep+0x14c/0x200 > [] mutex_lock_nested+0x2e/0x50 > [] mpol_shared_policy_lookup+0x33/0x90 > [] shmem_get_policy+0x33/0x40 > [] get_vma_policy+0x3a/0x90 > [] show_numa_map+0x163/0x610 > [] ? pid_maps_open+0x20/0x20 > [] ? pagemap_hugetlb_range+0xf0/0xf0 > [] show_pid_numa_map+0x13/0x20 > [] traverse+0xf2/0x230 > [] seq_lseek+0xab/0x120 > [] sys_lseek+0x7b/0xb0 > [] tracesys+0xe1/0xe6 > Hmm, looks like we need to change the refcount semantics entirely. We'll need to make get_vma_policy() always take a reference and then drop it accordingly. This work sif get_vma_policy() can grab a reference while holding task_lock() for the task policy fallback case. Comments on this approach? --- fs/proc/task_mmu.c | 4 +--- include/linux/mm.h | 3 +-- mm/hugetlb.c | 4 ++-- mm/mempolicy.c | 41 ++++++++++++++++++++++------------------- 4 files changed, 26 insertions(+), 26 deletions(-) diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c --- a/fs/proc/task_mmu.c +++ b/fs/proc/task_mmu.c @@ -1178,11 +1178,9 @@ static int show_numa_map(struct seq_file *m, void *v, int is_pid) walk.private = md; walk.mm = mm; - task_lock(task); pol = get_vma_policy(task, vma, vma->vm_start); mpol_to_str(buffer, sizeof(buffer), pol, 0); - mpol_cond_put(pol); - task_unlock(task); + __mpol_put(pol); seq_printf(m, "%08lx %s", vma->vm_start, buffer); diff --git a/include/linux/mm.h b/include/linux/mm.h --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -216,8 +216,7 @@ struct vm_operations_struct { * get_policy() op must add reference [mpol_get()] to any policy at * (vma,addr) marked as MPOL_SHARED. The shared policy infrastructure * in mm/mempolicy.c will do this automatically. - * get_policy() must NOT add a ref if the policy at (vma,addr) is not - * marked as MPOL_SHARED. vma policies are protected by the mmap_sem. + * vma policies are protected by the mmap_sem. * If no [shared/vma] mempolicy exists at the addr, get_policy() op * must return NULL--i.e., do not "fallback" to task or system default * policy. diff --git a/mm/hugetlb.c b/mm/hugetlb.c --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -568,13 +568,13 @@ retry_cpuset: } } - mpol_cond_put(mpol); + __mpol_put(mpol); if (unlikely(!put_mems_allowed(cpuset_mems_cookie) && !page)) goto retry_cpuset; return page; err: - mpol_cond_put(mpol); + __mpol_put(mpol); return NULL; } diff --git a/mm/mempolicy.c b/mm/mempolicy.c --- a/mm/mempolicy.c +++ b/mm/mempolicy.c @@ -1536,39 +1536,41 @@ asmlinkage long compat_sys_mbind(compat_ulong_t start, compat_ulong_t len, * * Returns effective policy for a VMA at specified address. * Falls back to @task or system default policy, as necessary. - * Current or other task's task mempolicy and non-shared vma policies must be - * protected by task_lock(task) by the caller. - * Shared policies [those marked as MPOL_F_SHARED] require an extra reference - * count--added by the get_policy() vm_op, as appropriate--to protect against - * freeing by another task. It is the caller's responsibility to free the - * extra reference for shared policies. + * Increments the reference count of the returned mempolicy, it is the caller's + * responsibility to decrement with __mpol_put(). + * Requires vma->vm_mm->mmap_sem to be held for vma policies and takes + * task_lock(task) for task policy fallback. */ struct mempolicy *get_vma_policy(struct task_struct *task, struct vm_area_struct *vma, unsigned long addr) { - struct mempolicy *pol = task->mempolicy; + struct mempolicy *pol; + + task_lock(task); + pol = task->mempolicy; + mpol_get(pol); + task_unlock(task); if (vma) { if (vma->vm_ops && vma->vm_ops->get_policy) { struct mempolicy *vpol = vma->vm_ops->get_policy(vma, addr); - if (vpol) + if (vpol) { + mpol_put(pol); pol = vpol; + if (!mpol_needs_cond_ref(pol)) + mpol_get(pol); + } } else if (vma->vm_policy) { + mpol_put(pol); pol = vma->vm_policy; - - /* - * shmem_alloc_page() passes MPOL_F_SHARED policy with - * a pseudo vma whose vma->vm_ops=NULL. Take a reference - * count on these policies which will be dropped by - * mpol_cond_put() later - */ - if (mpol_needs_cond_ref(pol)) - mpol_get(pol); + mpol_get(pol); } } - if (!pol) + if (!pol) { pol = &default_policy; + mpol_get(pol); + } return pol; } @@ -1919,7 +1921,7 @@ retry_cpuset: unsigned nid; nid = interleave_nid(pol, vma, addr, PAGE_SHIFT + order); - mpol_cond_put(pol); + __mpol_put(pol); page = alloc_page_interleave(gfp, order, nid); if (unlikely(!put_mems_allowed(cpuset_mems_cookie) && !page)) goto retry_cpuset; @@ -1943,6 +1945,7 @@ retry_cpuset: */ page = __alloc_pages_nodemask(gfp, order, zl, policy_nodemask(gfp, pol)); + __mpol_put(pol); if (unlikely(!put_mems_allowed(cpuset_mems_cookie) && !page)) goto retry_cpuset; return page; -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx135.postini.com [74.125.245.135]) by kanga.kvack.org (Postfix) with SMTP id E45676B005A for ; Wed, 17 Oct 2012 01:43:30 -0400 (EDT) Received: from m4.gw.fujitsu.co.jp (unknown [10.0.50.74]) by fgwmail5.fujitsu.co.jp (Postfix) with ESMTP id 747F03EE0C8 for ; Wed, 17 Oct 2012 14:43:28 +0900 (JST) Received: from smail (m4 [127.0.0.1]) by outgoing.m4.gw.fujitsu.co.jp (Postfix) with ESMTP id EBFBA45DE50 for ; Wed, 17 Oct 2012 14:43:27 +0900 (JST) Received: from s4.gw.fujitsu.co.jp (s4.gw.fujitsu.co.jp [10.0.50.94]) by m4.gw.fujitsu.co.jp (Postfix) with ESMTP id EBC3645DE5F for ; Wed, 17 Oct 2012 14:43:25 +0900 (JST) Received: from s4.gw.fujitsu.co.jp (localhost.localdomain [127.0.0.1]) by s4.gw.fujitsu.co.jp (Postfix) with ESMTP id 70BE41DB8040 for ; Wed, 17 Oct 2012 14:43:25 +0900 (JST) Received: from m1001.s.css.fujitsu.com (m1001.s.css.fujitsu.com [10.240.81.139]) by s4.gw.fujitsu.co.jp (Postfix) with ESMTP id E0C4D1DB803E for ; Wed, 17 Oct 2012 14:43:24 +0900 (JST) Message-ID: <507E4531.1070700@jp.fujitsu.com> Date: Wed, 17 Oct 2012 14:42:09 +0900 From: Kamezawa Hiroyuki MIME-Version: 1.0 Subject: Re: [patch for-3.7] mm, mempolicy: fix printing stack contents in numa_maps References: <20121008150949.GA15130@redhat.com> <20121017040515.GA13505@redhat.com> In-Reply-To: Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: David Rientjes Cc: Dave Jones , Andrew Morton , Linus Torvalds , KOSAKI Motohiro , bhutchings@solarflare.com, Konstantin Khlebnikov , Naoya Horiguchi , Hugh Dickins , linux-kernel@vger.kernel.org, linux-mm@kvack.org (2012/10/17 14:24), David Rientjes wrote: > On Wed, 17 Oct 2012, Dave Jones wrote: > >> BUG: sleeping function called from invalid context at kernel/mutex.c:269 >> in_atomic(): 1, irqs_disabled(): 0, pid: 8558, name: trinity-child2 >> 3 locks on stack by trinity-child2/8558: >> #0: held: (&p->lock){+.+.+.}, instance: ffff88010c9a00b0, at: [] seq_lseek+0x3f/0x120 >> #1: held: (&mm->mmap_sem){++++++}, instance: ffff88013956f7c8, at: [] m_start+0xa7/0x190 >> #2: held: (&(&p->alloc_lock)->rlock){+.+...}, instance: ffff88011fc64f30, at: [] show_numa_map+0x14f/0x610 >> Pid: 8558, comm: trinity-child2 Not tainted 3.7.0-rc1+ #32 >> Call Trace: >> [] __might_sleep+0x14c/0x200 >> [] mutex_lock_nested+0x2e/0x50 >> [] mpol_shared_policy_lookup+0x33/0x90 >> [] shmem_get_policy+0x33/0x40 >> [] get_vma_policy+0x3a/0x90 >> [] show_numa_map+0x163/0x610 >> [] ? pid_maps_open+0x20/0x20 >> [] ? pagemap_hugetlb_range+0xf0/0xf0 >> [] show_pid_numa_map+0x13/0x20 >> [] traverse+0xf2/0x230 >> [] seq_lseek+0xab/0x120 >> [] sys_lseek+0x7b/0xb0 >> [] tracesys+0xe1/0xe6 >> > > Hmm, looks like we need to change the refcount semantics entirely. We'll > need to make get_vma_policy() always take a reference and then drop it > accordingly. This work sif get_vma_policy() can grab a reference while > holding task_lock() for the task policy fallback case. > > Comments on this approach? I think this refcounting is better than using task_lock(). Thanks, -Kame -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx119.postini.com [74.125.245.119]) by kanga.kvack.org (Postfix) with SMTP id CBCF16B002B for ; Wed, 17 Oct 2012 04:49:23 -0400 (EDT) Received: by mail-oa0-f41.google.com with SMTP id k14so8825447oag.14 for ; Wed, 17 Oct 2012 01:49:23 -0700 (PDT) MIME-Version: 1.0 In-Reply-To: <507E4531.1070700@jp.fujitsu.com> References: <20121008150949.GA15130@redhat.com> <20121017040515.GA13505@redhat.com> <507E4531.1070700@jp.fujitsu.com> From: KOSAKI Motohiro Date: Wed, 17 Oct 2012 04:49:02 -0400 Message-ID: Subject: Re: [patch for-3.7] mm, mempolicy: fix printing stack contents in numa_maps Content-Type: text/plain; charset=ISO-8859-1 Sender: owner-linux-mm@kvack.org List-ID: To: Kamezawa Hiroyuki Cc: David Rientjes , Dave Jones , Andrew Morton , Linus Torvalds , bhutchings@solarflare.com, Konstantin Khlebnikov , Naoya Horiguchi , Hugh Dickins , linux-kernel@vger.kernel.org, linux-mm@kvack.org On Wed, Oct 17, 2012 at 1:42 AM, Kamezawa Hiroyuki wrote: > (2012/10/17 14:24), David Rientjes wrote: >> >> On Wed, 17 Oct 2012, Dave Jones wrote: >> >>> BUG: sleeping function called from invalid context at kernel/mutex.c:269 >>> in_atomic(): 1, irqs_disabled(): 0, pid: 8558, name: trinity-child2 >>> 3 locks on stack by trinity-child2/8558: >>> #0: held: (&p->lock){+.+.+.}, instance: ffff88010c9a00b0, at: >>> [] seq_lseek+0x3f/0x120 >>> #1: held: (&mm->mmap_sem){++++++}, instance: ffff88013956f7c8, at: >>> [] m_start+0xa7/0x190 >>> #2: held: (&(&p->alloc_lock)->rlock){+.+...}, instance: >>> ffff88011fc64f30, at: [] show_numa_map+0x14f/0x610 >>> Pid: 8558, comm: trinity-child2 Not tainted 3.7.0-rc1+ #32 >>> Call Trace: >>> [] __might_sleep+0x14c/0x200 >>> [] mutex_lock_nested+0x2e/0x50 >>> [] mpol_shared_policy_lookup+0x33/0x90 >>> [] shmem_get_policy+0x33/0x40 >>> [] get_vma_policy+0x3a/0x90 >>> [] show_numa_map+0x163/0x610 >>> [] ? pid_maps_open+0x20/0x20 >>> [] ? pagemap_hugetlb_range+0xf0/0xf0 >>> [] show_pid_numa_map+0x13/0x20 >>> [] traverse+0xf2/0x230 >>> [] seq_lseek+0xab/0x120 >>> [] sys_lseek+0x7b/0xb0 >>> [] tracesys+0xe1/0xe6 >>> >> >> Hmm, looks like we need to change the refcount semantics entirely. We'll >> need to make get_vma_policy() always take a reference and then drop it >> accordingly. This work sif get_vma_policy() can grab a reference while >> holding task_lock() for the task policy fallback case. >> >> Comments on this approach? > > > I think this refcounting is better than using task_lock(). I don't think so. get_vma_policy() is used from fast path. In other words, number of atomic ops is sensible for allocation performance. Instead, I'd like to use spinlock for shared mempolicy instead of mutex. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx188.postini.com [74.125.245.188]) by kanga.kvack.org (Postfix) with SMTP id 00A7F6B002B for ; Wed, 17 Oct 2012 14:14:25 -0400 (EDT) Date: Wed, 17 Oct 2012 14:14:13 -0400 From: Dave Jones Subject: Re: [patch for-3.7] mm, mempolicy: fix printing stack contents in numa_maps Message-ID: <20121017181413.GA16805@redhat.com> References: <20121008150949.GA15130@redhat.com> <20121017040515.GA13505@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: Sender: owner-linux-mm@kvack.org List-ID: To: David Rientjes Cc: Andrew Morton , Linus Torvalds , KOSAKI Motohiro , bhutchings@solarflare.com, Konstantin Khlebnikov , Naoya Horiguchi , Hugh Dickins , KAMEZAWA Hiroyuki , linux-kernel@vger.kernel.org, linux-mm@kvack.org On Tue, Oct 16, 2012 at 10:24:32PM -0700, David Rientjes wrote: > On Wed, 17 Oct 2012, Dave Jones wrote: > > > BUG: sleeping function called from invalid context at kernel/mutex.c:269 > > Hmm, looks like we need to change the refcount semantics entirely. We'll > need to make get_vma_policy() always take a reference and then drop it > accordingly. This work sif get_vma_policy() can grab a reference while > holding task_lock() for the task policy fallback case. > > Comments on this approach? Seems to be surviving my testing at least.. Dave -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx120.postini.com [74.125.245.120]) by kanga.kvack.org (Postfix) with SMTP id 615FF6B005A for ; Wed, 17 Oct 2012 15:21:23 -0400 (EDT) Received: by mail-pa0-f41.google.com with SMTP id fa10so8485578pad.14 for ; Wed, 17 Oct 2012 12:21:22 -0700 (PDT) Date: Wed, 17 Oct 2012 12:21:10 -0700 (PDT) From: David Rientjes Subject: Re: [patch for-3.7] mm, mempolicy: fix printing stack contents in numa_maps In-Reply-To: <20121017181413.GA16805@redhat.com> Message-ID: References: <20121008150949.GA15130@redhat.com> <20121017040515.GA13505@redhat.com> <20121017181413.GA16805@redhat.com> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-linux-mm@kvack.org List-ID: To: Dave Jones , Andrew Morton , Linus Torvalds , KOSAKI Motohiro , bhutchings@solarflare.com, Konstantin Khlebnikov , Naoya Horiguchi , Hugh Dickins , KAMEZAWA Hiroyuki , linux-kernel@vger.kernel.org, linux-mm@kvack.org On Wed, 17 Oct 2012, Dave Jones wrote: > On Tue, Oct 16, 2012 at 10:24:32PM -0700, David Rientjes wrote: > > On Wed, 17 Oct 2012, Dave Jones wrote: > > > > > BUG: sleeping function called from invalid context at kernel/mutex.c:269 > > > > Hmm, looks like we need to change the refcount semantics entirely. We'll > > need to make get_vma_policy() always take a reference and then drop it > > accordingly. This work sif get_vma_policy() can grab a reference while > > holding task_lock() for the task policy fallback case. > > > > Comments on this approach? > > Seems to be surviving my testing at least.. > Sounds good. Is it possible to verify that policy_cache isn't getting larger than normal in /proc/slabinfo, i.e. when all processes with a task mempolicy or shared vma policy have exited, are there still a significant number of active objects? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx112.postini.com [74.125.245.112]) by kanga.kvack.org (Postfix) with SMTP id 8AC426B005A for ; Wed, 17 Oct 2012 15:32:38 -0400 (EDT) Date: Wed, 17 Oct 2012 15:32:29 -0400 From: Dave Jones Subject: Re: [patch for-3.7] mm, mempolicy: fix printing stack contents in numa_maps Message-ID: <20121017193229.GC16805@redhat.com> References: <20121017040515.GA13505@redhat.com> <20121017181413.GA16805@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: Sender: owner-linux-mm@kvack.org List-ID: To: David Rientjes Cc: Andrew Morton , Linus Torvalds , KOSAKI Motohiro , bhutchings@solarflare.com, Konstantin Khlebnikov , Naoya Horiguchi , Hugh Dickins , KAMEZAWA Hiroyuki , linux-kernel@vger.kernel.org, linux-mm@kvack.org On Wed, Oct 17, 2012 at 12:21:10PM -0700, David Rientjes wrote: > On Wed, 17 Oct 2012, Dave Jones wrote: > > > On Tue, Oct 16, 2012 at 10:24:32PM -0700, David Rientjes wrote: > > > On Wed, 17 Oct 2012, Dave Jones wrote: > > > > > > > BUG: sleeping function called from invalid context at kernel/mutex.c:269 > > > > > > Hmm, looks like we need to change the refcount semantics entirely. We'll > > > need to make get_vma_policy() always take a reference and then drop it > > > accordingly. This work sif get_vma_policy() can grab a reference while > > > holding task_lock() for the task policy fallback case. > > > > > > Comments on this approach? > > > > Seems to be surviving my testing at least.. > > > > Sounds good. Is it possible to verify that policy_cache isn't getting > larger than normal in /proc/slabinfo, i.e. when all processes with a > task mempolicy or shared vma policy have exited, are there still a > significant number of active objects? Killing the fuzzer caused it to drop dramatically. Before: (15:29:59:davej@bitcrush:trinity[master])$ sudo cat /proc/slabinfo | grep policy shared_policy_node 2931 2967 376 43 4 : tunables 0 0 0 : slabdata 69 69 0 numa_policy 2971 6545 464 35 4 : tunables 0 0 0 : slabdata 187 187 0 After: (15:30:16:davej@bitcrush:trinity[master])$ sudo cat /proc/slabinfo | grep policy shared_policy_node 0 215 376 43 4 : tunables 0 0 0 : slabdata 5 5 0 numa_policy 15 175 464 35 4 : tunables 0 0 0 : slabdata 5 5 0 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx137.postini.com [74.125.245.137]) by kanga.kvack.org (Postfix) with SMTP id 381166B002B for ; Wed, 17 Oct 2012 15:38:58 -0400 (EDT) Received: by mail-pa0-f41.google.com with SMTP id fa10so8503097pad.14 for ; Wed, 17 Oct 2012 12:38:57 -0700 (PDT) Date: Wed, 17 Oct 2012 12:38:55 -0700 (PDT) From: David Rientjes Subject: Re: [patch for-3.7] mm, mempolicy: fix printing stack contents in numa_maps In-Reply-To: <20121017193229.GC16805@redhat.com> Message-ID: References: <20121017040515.GA13505@redhat.com> <20121017181413.GA16805@redhat.com> <20121017193229.GC16805@redhat.com> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-linux-mm@kvack.org List-ID: To: Dave Jones , Andrew Morton , Linus Torvalds , KOSAKI Motohiro , bhutchings@solarflare.com, Konstantin Khlebnikov , Naoya Horiguchi , Hugh Dickins , KAMEZAWA Hiroyuki , linux-kernel@vger.kernel.org, linux-mm@kvack.org On Wed, 17 Oct 2012, Dave Jones wrote: > > Sounds good. Is it possible to verify that policy_cache isn't getting > > larger than normal in /proc/slabinfo, i.e. when all processes with a > > task mempolicy or shared vma policy have exited, are there still a > > significant number of active objects? > > Killing the fuzzer caused it to drop dramatically. > > Before: > (15:29:59:davej@bitcrush:trinity[master])$ sudo cat /proc/slabinfo | grep policy > shared_policy_node 2931 2967 376 43 4 : tunables 0 0 0 : slabdata 69 69 0 > numa_policy 2971 6545 464 35 4 : tunables 0 0 0 : slabdata 187 187 0 > > After: > (15:30:16:davej@bitcrush:trinity[master])$ sudo cat /proc/slabinfo | grep policy > shared_policy_node 0 215 376 43 4 : tunables 0 0 0 : slabdata 5 5 0 > numa_policy 15 175 464 35 4 : tunables 0 0 0 : slabdata 5 5 0 > Excellent, thanks. This shows that the refcounting is working properly and we're not leaking any references as a result of this change causing the mempolicies to never be freed. ("numa_policy" turns out to be policy_cache in the code, so thanks for checking both of them.) Could I add your tested-by? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx164.postini.com [74.125.245.164]) by kanga.kvack.org (Postfix) with SMTP id 1800B6B002B for ; Wed, 17 Oct 2012 15:45:12 -0400 (EDT) Date: Wed, 17 Oct 2012 15:45:01 -0400 From: Dave Jones Subject: Re: [patch for-3.7] mm, mempolicy: fix printing stack contents in numa_maps Message-ID: <20121017194501.GA24400@redhat.com> References: <20121017040515.GA13505@redhat.com> <20121017181413.GA16805@redhat.com> <20121017193229.GC16805@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: Sender: owner-linux-mm@kvack.org List-ID: To: David Rientjes Cc: Andrew Morton , Linus Torvalds , KOSAKI Motohiro , bhutchings@solarflare.com, Konstantin Khlebnikov , Naoya Horiguchi , Hugh Dickins , KAMEZAWA Hiroyuki , linux-kernel@vger.kernel.org, linux-mm@kvack.org On Wed, Oct 17, 2012 at 12:38:55PM -0700, David Rientjes wrote: > > > Sounds good. Is it possible to verify that policy_cache isn't getting > > > larger than normal in /proc/slabinfo, i.e. when all processes with a > > > task mempolicy or shared vma policy have exited, are there still a > > > significant number of active objects? > > > > Killing the fuzzer caused it to drop dramatically. > > > Excellent, thanks. This shows that the refcounting is working properly > and we're not leaking any references as a result of this change causing > the mempolicies to never be freed. ("numa_policy" turns out to be > policy_cache in the code, so thanks for checking both of them.) > > Could I add your tested-by? Sure. Here's a fresh one I just baked. Tested-by: Dave Jones Dave -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx152.postini.com [74.125.245.152]) by kanga.kvack.org (Postfix) with SMTP id D1FE56B002B for ; Wed, 17 Oct 2012 15:50:23 -0400 (EDT) Received: by mail-pb0-f41.google.com with SMTP id rq2so8557105pbb.14 for ; Wed, 17 Oct 2012 12:50:23 -0700 (PDT) Date: Wed, 17 Oct 2012 12:50:21 -0700 (PDT) From: David Rientjes Subject: Re: [patch for-3.7] mm, mempolicy: fix printing stack contents in numa_maps In-Reply-To: Message-ID: References: <20121008150949.GA15130@redhat.com> <20121017040515.GA13505@redhat.com> <507E4531.1070700@jp.fujitsu.com> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-linux-mm@kvack.org List-ID: To: KOSAKI Motohiro Cc: Kamezawa Hiroyuki , Dave Jones , Andrew Morton , Linus Torvalds , bhutchings@solarflare.com, Konstantin Khlebnikov , Naoya Horiguchi , Hugh Dickins , linux-kernel@vger.kernel.org, linux-mm@kvack.org On Wed, 17 Oct 2012, KOSAKI Motohiro wrote: > > I think this refcounting is better than using task_lock(). > > I don't think so. get_vma_policy() is used from fast path. In other > words, number of > atomic ops is sensible for allocation performance. There are enhancements that we can make with refcounting: for instance, we may want to avoid doing it in the super-fast path when the policy is default_policy and then just do if (mpol != &default_policy) mpol_put(mpol); > Instead, I'd like > to use spinlock > for shared mempolicy instead of mutex. > Um, this was just changed to a mutex last week in commit b22d127a39dd ("mempolicy: fix a race in shared_policy_replace()") so that sp_alloc() can be done with GFP_KERNEL, so I didn't consider reverting that behavior. Are you nacking that patch, which you acked, now? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx176.postini.com [74.125.245.176]) by kanga.kvack.org (Postfix) with SMTP id 3F60E6B002B for ; Wed, 17 Oct 2012 16:28:51 -0400 (EDT) Received: by mail-pb0-f41.google.com with SMTP id rq2so8594179pbb.14 for ; Wed, 17 Oct 2012 13:28:50 -0700 (PDT) Date: Wed, 17 Oct 2012 13:28:47 -0700 (PDT) From: David Rientjes Subject: [patch for-3.7] mm, mempolicy: avoid taking mutex inside spinlock when reading numa_maps In-Reply-To: <20121017194501.GA24400@redhat.com> Message-ID: References: <20121017040515.GA13505@redhat.com> <20121017181413.GA16805@redhat.com> <20121017193229.GC16805@redhat.com> <20121017194501.GA24400@redhat.com> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-linux-mm@kvack.org List-ID: To: Linus Torvalds , Andrew Morton Cc: Dave Jones , KOSAKI Motohiro , bhutchings@solarflare.com, Konstantin Khlebnikov , Naoya Horiguchi , Hugh Dickins , KAMEZAWA Hiroyuki , linux-kernel@vger.kernel.org, linux-mm@kvack.org As a result of commit 32f8516a8c73 ("mm, mempolicy: fix printing stack contents in numa_maps"), the mutex protecting a shared policy can be inadvertently taken while holding task_lock(task). Recently, commit b22d127a39dd ("mempolicy: fix a race in shared_policy_replace()") switched the spinlock within a shared policy to a mutex so sp_alloc() could block. Thus, a refcount must be grabbed on all mempolicies returned by get_vma_policy() so it isn't freed while being passed to mpol_to_str() when reading /proc/pid/numa_maps. This patch only takes task_lock() while dereferencing task->mempolicy in get_vma_policy() to increment its refcount. This ensures it will remain in memory until dropped by __mpol_put() after mpol_to_str() is called. Refcounts of shared policies are grabbed by the ->get_policy() function of the vma, all others will be grabbed directly in get_vma_policy(). Now that this is done, all callers now unconditionally drop the refcount. Tested-by: Dave Jones Signed-off-by: David Rientjes --- fs/proc/task_mmu.c | 4 +-- include/linux/mempolicy.h | 12 +------ mm/hugetlb.c | 4 +-- mm/mempolicy.c | 79 +++++++++++++++++++-------------------------- 4 files changed, 38 insertions(+), 61 deletions(-) diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c --- a/fs/proc/task_mmu.c +++ b/fs/proc/task_mmu.c @@ -1178,11 +1178,9 @@ static int show_numa_map(struct seq_file *m, void *v, int is_pid) walk.private = md; walk.mm = mm; - task_lock(task); pol = get_vma_policy(task, vma, vma->vm_start); mpol_to_str(buffer, sizeof(buffer), pol, 0); - mpol_cond_put(pol); - task_unlock(task); + __mpol_put(pol); seq_printf(m, "%08lx %s", vma->vm_start, buffer); diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h --- a/include/linux/mempolicy.h +++ b/include/linux/mempolicy.h @@ -73,13 +73,7 @@ static inline void mpol_put(struct mempolicy *pol) */ static inline int mpol_needs_cond_ref(struct mempolicy *pol) { - return (pol && (pol->flags & MPOL_F_SHARED)); -} - -static inline void mpol_cond_put(struct mempolicy *pol) -{ - if (mpol_needs_cond_ref(pol)) - __mpol_put(pol); + return pol->flags & MPOL_F_SHARED; } extern struct mempolicy *__mpol_cond_copy(struct mempolicy *tompol, @@ -211,10 +205,6 @@ static inline void mpol_put(struct mempolicy *p) { } -static inline void mpol_cond_put(struct mempolicy *pol) -{ -} - static inline struct mempolicy *mpol_cond_copy(struct mempolicy *to, struct mempolicy *from) { diff --git a/mm/hugetlb.c b/mm/hugetlb.c --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -568,13 +568,13 @@ retry_cpuset: } } - mpol_cond_put(mpol); + __mpol_put(mpol); if (unlikely(!put_mems_allowed(cpuset_mems_cookie) && !page)) goto retry_cpuset; return page; err: - mpol_cond_put(mpol); + __mpol_put(mpol); return NULL; } diff --git a/mm/mempolicy.c b/mm/mempolicy.c --- a/mm/mempolicy.c +++ b/mm/mempolicy.c @@ -906,7 +906,8 @@ static long do_get_mempolicy(int *policy, nodemask_t *nmask, } out: - mpol_cond_put(pol); + if (mpol_needs_cond_ref(pol)) + __mpol_put(pol); if (vma) up_read(¤t->mm->mmap_sem); return err; @@ -1527,48 +1528,52 @@ asmlinkage long compat_sys_mbind(compat_ulong_t start, compat_ulong_t len, } #endif - -/* - * get_vma_policy(@task, @vma, @addr) - * @task - task for fallback if vma policy == default - * @vma - virtual memory area whose policy is sought - * @addr - address in @vma for shared policy lookup +/** + * get_vma_policy() - return effective policy for a vma at specified address + * @task: task for fallback if vma policy == default_policy + * @vma: virtual memory area whose policy is sought + * @addr: address in @vma for shared policy lookup * - * Returns effective policy for a VMA at specified address. * Falls back to @task or system default policy, as necessary. - * Current or other task's task mempolicy and non-shared vma policies must be - * protected by task_lock(task) by the caller. - * Shared policies [those marked as MPOL_F_SHARED] require an extra reference - * count--added by the get_policy() vm_op, as appropriate--to protect against - * freeing by another task. It is the caller's responsibility to free the - * extra reference for shared policies. + * Increments the reference count of the returned mempolicy, it is the caller's + * responsibility to decrement with __mpol_put(). + * Requires vma->vm_mm->mmap_sem to be held for vma policies and takes + * task_lock(task) for task policy fallback. */ struct mempolicy *get_vma_policy(struct task_struct *task, struct vm_area_struct *vma, unsigned long addr) { - struct mempolicy *pol = task->mempolicy; + struct mempolicy *pol; + + /* + * Grab a reference before task has the potential to exit and free its + * mempolicy. + */ + task_lock(task); + pol = task->mempolicy; + mpol_get(pol); + task_unlock(task); if (vma) { if (vma->vm_ops && vma->vm_ops->get_policy) { struct mempolicy *vpol = vma->vm_ops->get_policy(vma, addr); - if (vpol) + if (vpol) { + mpol_put(pol); pol = vpol; + if (!mpol_needs_cond_ref(pol)) + mpol_get(pol); + } } else if (vma->vm_policy) { + mpol_put(pol); pol = vma->vm_policy; - - /* - * shmem_alloc_page() passes MPOL_F_SHARED policy with - * a pseudo vma whose vma->vm_ops=NULL. Take a reference - * count on these policies which will be dropped by - * mpol_cond_put() later - */ - if (mpol_needs_cond_ref(pol)) - mpol_get(pol); + mpol_get(pol); } } - if (!pol) + if (!pol) { pol = &default_policy; + mpol_get(pol); + } return pol; } @@ -1919,30 +1924,14 @@ retry_cpuset: unsigned nid; nid = interleave_nid(pol, vma, addr, PAGE_SHIFT + order); - mpol_cond_put(pol); page = alloc_page_interleave(gfp, order, nid); - if (unlikely(!put_mems_allowed(cpuset_mems_cookie) && !page)) - goto retry_cpuset; - - return page; + goto out; } zl = policy_zonelist(gfp, pol, node); - if (unlikely(mpol_needs_cond_ref(pol))) { - /* - * slow path: ref counted shared policy - */ - struct page *page = __alloc_pages_nodemask(gfp, order, - zl, policy_nodemask(gfp, pol)); - __mpol_put(pol); - if (unlikely(!put_mems_allowed(cpuset_mems_cookie) && !page)) - goto retry_cpuset; - return page; - } - /* - * fast path: default or task policy - */ page = __alloc_pages_nodemask(gfp, order, zl, policy_nodemask(gfp, pol)); +out: + __mpol_put(pol); if (unlikely(!put_mems_allowed(cpuset_mems_cookie) && !page)) goto retry_cpuset; return page; -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx197.postini.com [74.125.245.197]) by kanga.kvack.org (Postfix) with SMTP id 1F5356B002B for ; Wed, 17 Oct 2012 17:05:58 -0400 (EDT) Received: by mail-ob0-f169.google.com with SMTP id va7so9563669obc.14 for ; Wed, 17 Oct 2012 14:05:57 -0700 (PDT) MIME-Version: 1.0 In-Reply-To: References: <20121008150949.GA15130@redhat.com> <20121017040515.GA13505@redhat.com> <507E4531.1070700@jp.fujitsu.com> From: KOSAKI Motohiro Date: Wed, 17 Oct 2012 17:05:37 -0400 Message-ID: Subject: Re: [patch for-3.7] mm, mempolicy: fix printing stack contents in numa_maps Content-Type: text/plain; charset=ISO-8859-1 Sender: owner-linux-mm@kvack.org List-ID: To: David Rientjes Cc: Kamezawa Hiroyuki , Dave Jones , Andrew Morton , Linus Torvalds , bhutchings@solarflare.com, Konstantin Khlebnikov , Naoya Horiguchi , Hugh Dickins , linux-kernel@vger.kernel.org, linux-mm@kvack.org On Wed, Oct 17, 2012 at 3:50 PM, David Rientjes wrote: > On Wed, 17 Oct 2012, KOSAKI Motohiro wrote: > >> > I think this refcounting is better than using task_lock(). >> >> I don't think so. get_vma_policy() is used from fast path. In other >> words, number of >> atomic ops is sensible for allocation performance. > > There are enhancements that we can make with refcounting: for instance, we > may want to avoid doing it in the super-fast path when the policy is > default_policy and then just do > > if (mpol != &default_policy) > mpol_put(mpol); > >> Instead, I'd like >> to use spinlock >> for shared mempolicy instead of mutex. >> > > Um, this was just changed to a mutex last week in commit b22d127a39dd > ("mempolicy: fix a race in shared_policy_replace()") so that sp_alloc() > can be done with GFP_KERNEL, so I didn't consider reverting that behavior. > Are you nacking that patch, which you acked, now? Yes, sadly. /proc usage is a corner case issue. It's not worth to strike main path. see commit 52cd3b0740 and around patches. That explain why we avoided your approach. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx130.postini.com [74.125.245.130]) by kanga.kvack.org (Postfix) with SMTP id 826066B0068 for ; Wed, 17 Oct 2012 17:27:38 -0400 (EDT) Received: by mail-pb0-f41.google.com with SMTP id rq2so8649273pbb.14 for ; Wed, 17 Oct 2012 14:27:37 -0700 (PDT) Date: Wed, 17 Oct 2012 14:27:35 -0700 (PDT) From: David Rientjes Subject: Re: [patch for-3.7] mm, mempolicy: fix printing stack contents in numa_maps In-Reply-To: Message-ID: References: <20121008150949.GA15130@redhat.com> <20121017040515.GA13505@redhat.com> <507E4531.1070700@jp.fujitsu.com> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-linux-mm@kvack.org List-ID: To: KOSAKI Motohiro Cc: Kamezawa Hiroyuki , Dave Jones , Andrew Morton , Linus Torvalds , bhutchings@solarflare.com, Konstantin Khlebnikov , Naoya Horiguchi , Hugh Dickins , linux-kernel@vger.kernel.org, linux-mm@kvack.org On Wed, 17 Oct 2012, KOSAKI Motohiro wrote: > > Um, this was just changed to a mutex last week in commit b22d127a39dd > > ("mempolicy: fix a race in shared_policy_replace()") so that sp_alloc() > > can be done with GFP_KERNEL, so I didn't consider reverting that behavior. > > Are you nacking that patch, which you acked, now? > > Yes, sadly. /proc usage is a corner case issue. It's not worth to > strike main path. It also simplifies the fastpath since we can now unconditionally drop the reference. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx179.postini.com [74.125.245.179]) by kanga.kvack.org (Postfix) with SMTP id EDE576B0068 for ; Wed, 17 Oct 2012 17:31:11 -0400 (EDT) Received: by mail-pb0-f41.google.com with SMTP id rq2so8652332pbb.14 for ; Wed, 17 Oct 2012 14:31:11 -0700 (PDT) Date: Wed, 17 Oct 2012 14:31:09 -0700 (PDT) From: David Rientjes Subject: [patch for-3.7 v2] mm, mempolicy: avoid taking mutex inside spinlock when reading numa_maps In-Reply-To: Message-ID: References: <20121017040515.GA13505@redhat.com> <20121017181413.GA16805@redhat.com> <20121017193229.GC16805@redhat.com> <20121017194501.GA24400@redhat.com> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-linux-mm@kvack.org List-ID: To: Linus Torvalds , Andrew Morton Cc: Dave Jones , KOSAKI Motohiro , bhutchings@solarflare.com, Konstantin Khlebnikov , Naoya Horiguchi , Hugh Dickins , KAMEZAWA Hiroyuki , linux-kernel@vger.kernel.org, linux-mm@kvack.org As a result of commit 32f8516a8c73 ("mm, mempolicy: fix printing stack contents in numa_maps"), the mutex protecting a shared policy can be inadvertently taken while holding task_lock(task). Recently, commit b22d127a39dd ("mempolicy: fix a race in shared_policy_replace()") switched the spinlock within a shared policy to a mutex so sp_alloc() could block. Thus, a refcount must be grabbed on all mempolicies returned by get_vma_policy() so it isn't freed while being passed to mpol_to_str() when reading /proc/pid/numa_maps. This patch only takes task_lock() while dereferencing task->mempolicy in get_vma_policy() if it's non-NULL in the lockess check to increment its refcount. This ensures it will remain in memory until dropped by __mpol_put() after mpol_to_str() is called. Refcounts of shared policies are grabbed by the ->get_policy() function of the vma, all others will be grabbed directly in get_vma_policy(). Now that this is done, all callers now unconditionally drop the refcount. Tested-by: Dave Jones Signed-off-by: David Rientjes --- v2: optimized task_lock() in get_vma_policy(): test for a non-NULL task->mempolicy before taking task_lock() and grabbing the reference so we don't take the lock unnecessarily. fs/proc/task_mmu.c | 4 +-- include/linux/mempolicy.h | 12 +------ mm/hugetlb.c | 4 +-- mm/mempolicy.c | 79 ++++++++++++++++++++------------------------- 4 files changed, 39 insertions(+), 60 deletions(-) diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c index 14df880..5709e70 100644 --- a/fs/proc/task_mmu.c +++ b/fs/proc/task_mmu.c @@ -1178,11 +1178,9 @@ static int show_numa_map(struct seq_file *m, void *v, int is_pid) walk.private = md; walk.mm = mm; - task_lock(task); pol = get_vma_policy(task, vma, vma->vm_start); mpol_to_str(buffer, sizeof(buffer), pol, 0); - mpol_cond_put(pol); - task_unlock(task); + __mpol_put(pol); seq_printf(m, "%08lx %s", vma->vm_start, buffer); diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h index e5ccb9d..f76f7e0 100644 --- a/include/linux/mempolicy.h +++ b/include/linux/mempolicy.h @@ -73,13 +73,7 @@ static inline void mpol_put(struct mempolicy *pol) */ static inline int mpol_needs_cond_ref(struct mempolicy *pol) { - return (pol && (pol->flags & MPOL_F_SHARED)); -} - -static inline void mpol_cond_put(struct mempolicy *pol) -{ - if (mpol_needs_cond_ref(pol)) - __mpol_put(pol); + return pol->flags & MPOL_F_SHARED; } extern struct mempolicy *__mpol_cond_copy(struct mempolicy *tompol, @@ -211,10 +205,6 @@ static inline void mpol_put(struct mempolicy *p) { } -static inline void mpol_cond_put(struct mempolicy *pol) -{ -} - static inline struct mempolicy *mpol_cond_copy(struct mempolicy *to, struct mempolicy *from) { diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 59a0059..5080808 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -568,13 +568,13 @@ retry_cpuset: } } - mpol_cond_put(mpol); + __mpol_put(mpol); if (unlikely(!put_mems_allowed(cpuset_mems_cookie) && !page)) goto retry_cpuset; return page; err: - mpol_cond_put(mpol); + __mpol_put(mpol); return NULL; } diff --git a/mm/mempolicy.c b/mm/mempolicy.c index d04a8a5..a0bb463 100644 --- a/mm/mempolicy.c +++ b/mm/mempolicy.c @@ -906,7 +906,8 @@ static long do_get_mempolicy(int *policy, nodemask_t *nmask, } out: - mpol_cond_put(pol); + if (mpol_needs_cond_ref(pol)) + __mpol_put(pol); if (vma) up_read(¤t->mm->mmap_sem); return err; @@ -1527,48 +1528,54 @@ asmlinkage long compat_sys_mbind(compat_ulong_t start, compat_ulong_t len, } #endif - -/* - * get_vma_policy(@task, @vma, @addr) - * @task - task for fallback if vma policy == default - * @vma - virtual memory area whose policy is sought - * @addr - address in @vma for shared policy lookup +/** + * get_vma_policy() - return effective policy for a vma at specified address + * @task: task for fallback if vma policy == default_policy + * @vma: virtual memory area whose policy is sought + * @addr: address in @vma for shared policy lookup * - * Returns effective policy for a VMA at specified address. * Falls back to @task or system default policy, as necessary. - * Current or other task's task mempolicy and non-shared vma policies must be - * protected by task_lock(task) by the caller. - * Shared policies [those marked as MPOL_F_SHARED] require an extra reference - * count--added by the get_policy() vm_op, as appropriate--to protect against - * freeing by another task. It is the caller's responsibility to free the - * extra reference for shared policies. + * Increments the reference count of the returned mempolicy, it is the caller's + * responsibility to decrement with __mpol_put(). + * Requires vma->vm_mm->mmap_sem to be held for vma policies and takes + * task_lock(task) for task policy fallback. */ struct mempolicy *get_vma_policy(struct task_struct *task, struct vm_area_struct *vma, unsigned long addr) { struct mempolicy *pol = task->mempolicy; + /* + * Grab a reference before task has the potential to exit and free its + * mempolicy. + */ + if (pol) { + task_lock(task); + pol = task->mempolicy; + mpol_get(pol); + task_unlock(task); + } + if (vma) { if (vma->vm_ops && vma->vm_ops->get_policy) { struct mempolicy *vpol = vma->vm_ops->get_policy(vma, addr); - if (vpol) + if (vpol) { + mpol_put(pol); pol = vpol; + if (!mpol_needs_cond_ref(pol)) + mpol_get(pol); + } } else if (vma->vm_policy) { + mpol_put(pol); pol = vma->vm_policy; - - /* - * shmem_alloc_page() passes MPOL_F_SHARED policy with - * a pseudo vma whose vma->vm_ops=NULL. Take a reference - * count on these policies which will be dropped by - * mpol_cond_put() later - */ - if (mpol_needs_cond_ref(pol)) - mpol_get(pol); + mpol_get(pol); } } - if (!pol) + if (!pol) { pol = &default_policy; + mpol_get(pol); + } return pol; } @@ -1919,30 +1926,14 @@ retry_cpuset: unsigned nid; nid = interleave_nid(pol, vma, addr, PAGE_SHIFT + order); - mpol_cond_put(pol); page = alloc_page_interleave(gfp, order, nid); - if (unlikely(!put_mems_allowed(cpuset_mems_cookie) && !page)) - goto retry_cpuset; - - return page; + goto out; } zl = policy_zonelist(gfp, pol, node); - if (unlikely(mpol_needs_cond_ref(pol))) { - /* - * slow path: ref counted shared policy - */ - struct page *page = __alloc_pages_nodemask(gfp, order, - zl, policy_nodemask(gfp, pol)); - __mpol_put(pol); - if (unlikely(!put_mems_allowed(cpuset_mems_cookie) && !page)) - goto retry_cpuset; - return page; - } - /* - * fast path: default or task policy - */ page = __alloc_pages_nodemask(gfp, order, zl, policy_nodemask(gfp, pol)); +out: + __mpol_put(pol); if (unlikely(!put_mems_allowed(cpuset_mems_cookie) && !page)) goto retry_cpuset; return page; -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx161.postini.com [74.125.245.161]) by kanga.kvack.org (Postfix) with SMTP id 713EF6B002B for ; Thu, 18 Oct 2012 00:07:13 -0400 (EDT) Received: from m1.gw.fujitsu.co.jp (unknown [10.0.50.71]) by fgwmail6.fujitsu.co.jp (Postfix) with ESMTP id 2121F3EE0C5 for ; Thu, 18 Oct 2012 13:07:11 +0900 (JST) Received: from smail (m1 [127.0.0.1]) by outgoing.m1.gw.fujitsu.co.jp (Postfix) with ESMTP id 077C845DE5E for ; Thu, 18 Oct 2012 13:07:11 +0900 (JST) Received: from s1.gw.fujitsu.co.jp (s1.gw.fujitsu.co.jp [10.0.50.91]) by m1.gw.fujitsu.co.jp (Postfix) with ESMTP id E324845DE5A for ; Thu, 18 Oct 2012 13:07:10 +0900 (JST) Received: from s1.gw.fujitsu.co.jp (localhost.localdomain [127.0.0.1]) by s1.gw.fujitsu.co.jp (Postfix) with ESMTP id CF2071DB8058 for ; Thu, 18 Oct 2012 13:07:10 +0900 (JST) Received: from m1001.s.css.fujitsu.com (m1001.s.css.fujitsu.com [10.240.81.139]) by s1.gw.fujitsu.co.jp (Postfix) with ESMTP id 62BB81DB8054 for ; Thu, 18 Oct 2012 13:07:10 +0900 (JST) Message-ID: <507F803A.8000900@jp.fujitsu.com> Date: Thu, 18 Oct 2012 13:06:18 +0900 From: Kamezawa Hiroyuki MIME-Version: 1.0 Subject: Re: [patch for-3.7 v2] mm, mempolicy: avoid taking mutex inside spinlock when reading numa_maps References: <20121017040515.GA13505@redhat.com> <20121017181413.GA16805@redhat.com> <20121017193229.GC16805@redhat.com> <20121017194501.GA24400@redhat.com> In-Reply-To: Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: David Rientjes Cc: Linus Torvalds , Andrew Morton , Dave Jones , KOSAKI Motohiro , bhutchings@solarflare.com, Konstantin Khlebnikov , Naoya Horiguchi , Hugh Dickins , linux-kernel@vger.kernel.org, linux-mm@kvack.org (2012/10/18 6:31), David Rientjes wrote: > As a result of commit 32f8516a8c73 ("mm, mempolicy: fix printing stack > contents in numa_maps"), the mutex protecting a shared policy can be > inadvertently taken while holding task_lock(task). > > Recently, commit b22d127a39dd ("mempolicy: fix a race in > shared_policy_replace()") switched the spinlock within a shared policy to > a mutex so sp_alloc() could block. Thus, a refcount must be grabbed on > all mempolicies returned by get_vma_policy() so it isn't freed while being > passed to mpol_to_str() when reading /proc/pid/numa_maps. > > This patch only takes task_lock() while dereferencing task->mempolicy in > get_vma_policy() if it's non-NULL in the lockess check to increment its > refcount. This ensures it will remain in memory until dropped by > __mpol_put() after mpol_to_str() is called. > > Refcounts of shared policies are grabbed by the ->get_policy() function of > the vma, all others will be grabbed directly in get_vma_policy(). Now > that this is done, all callers now unconditionally drop the refcount. > please add original problem description.... from your 1st patch. > When reading /proc/pid/numa_maps, it's possible to return the contents of > the stack where the mempolicy string should be printed if the policy gets > freed from beneath us. > > This happens because mpol_to_str() may return an error the > stack-allocated buffer is then printed without ever being stored. ..... Hmm, I've read the whole thread again...and, I'm sorry if I misunderstand something. I think Kosaki mentioned the commit 52cd3b0740. It avoids refcounting in get_vma_policy() because it's called every time alloc_pages_vma() is called, at every page fault. So, it seems he doesn't agree this fix because of performance concern on big NUMA, Can't we have another way to fix ? like this ? too ugly ? Again, I'm sorry if I misunderstand the points. == From bfe7e2ab1c1375b134ec12efce6517149318f75d Mon Sep 17 00:00:00 2001 From: KAMEZAWA Hiroyuki Date: Thu, 18 Oct 2012 13:17:25 +0900 Subject: [PATCH] hold task->mempolicy while numa_maps scans. /proc//numa_maps scans vma and show mempolicy under mmap_sem. It sometimes accesses task->mempolicy which can be freed without mmap_sem and numa_maps can show some garbage while scanning. This patch tries to take reference count of task->mempolicy at reading numa_maps before calling get_vma_policy(). By this, task->mempolicy will not be freed until numa_maps reaches its end. Signed-off-by: KAMEZAWA Hiroyuki --- fs/proc/task_mmu.c | 20 ++++++++++++++++++++ 1 file changed, 20 insertions(+) diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c index 14df880..d92e868 100644 --- a/fs/proc/task_mmu.c +++ b/fs/proc/task_mmu.c @@ -94,6 +94,11 @@ static void vma_stop(struct proc_maps_private *priv, struct vm_area_struct *vma) { if (vma && vma != priv->tail_vma) { struct mm_struct *mm = vma->vm_mm; +#ifdef CONFIG_NUMA + task_lock(priv->task); + __mpol_put(priv->task->mempolicy); + task_unlock(priv->task); +#endif up_read(&mm->mmap_sem); mmput(mm); } @@ -130,6 +135,16 @@ static void *m_start(struct seq_file *m, loff_t *pos) return mm; down_read(&mm->mmap_sem); + /* + * task->mempolicy can be freed even if mmap_sem is down (see kernel/exit.c) + * We grab refcount for stable access. + * repleacement of task->mmpolicy is guarded by mmap_sem. + */ +#ifdef CONFIG_NUMA + task_lock(priv->task); + mpol_get(priv->task->mempolicy); + task_unlock(priv->task); +#endif tail_vma = get_gate_vma(priv->task->mm); priv->tail_vma = tail_vma; @@ -161,6 +176,11 @@ out: /* End of vmas has been reached */ m->version = (tail_vma != NULL)? 0: -1UL; +#ifdef CONFIG_NUMA + task_lock(priv->task); + __mpol_put(priv->task->mempolicy); + task_unlock(priv->task); +#endif up_read(&mm->mmap_sem); mmput(mm); return tail_vma; -- 1.7.10.2 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx165.postini.com [74.125.245.165]) by kanga.kvack.org (Postfix) with SMTP id 6E6FD6B002B for ; Thu, 18 Oct 2012 00:15:00 -0400 (EDT) Received: by mail-wi0-f173.google.com with SMTP id hm4so1281223wib.8 for ; Wed, 17 Oct 2012 21:14:58 -0700 (PDT) MIME-Version: 1.0 In-Reply-To: <507F803A.8000900@jp.fujitsu.com> References: <20121017040515.GA13505@redhat.com> <20121017181413.GA16805@redhat.com> <20121017193229.GC16805@redhat.com> <20121017194501.GA24400@redhat.com> <507F803A.8000900@jp.fujitsu.com> From: Linus Torvalds Date: Wed, 17 Oct 2012 21:14:38 -0700 Message-ID: Subject: Re: [patch for-3.7 v2] mm, mempolicy: avoid taking mutex inside spinlock when reading numa_maps Content-Type: text/plain; charset=ISO-8859-1 Sender: owner-linux-mm@kvack.org List-ID: To: Kamezawa Hiroyuki Cc: David Rientjes , Andrew Morton , Dave Jones , KOSAKI Motohiro , bhutchings@solarflare.com, Konstantin Khlebnikov , Naoya Horiguchi , Hugh Dickins , linux-kernel@vger.kernel.org, linux-mm@kvack.org On Wed, Oct 17, 2012 at 9:06 PM, Kamezawa Hiroyuki wrote: > if (vma && vma != priv->tail_vma) { > struct mm_struct *mm = vma->vm_mm; > +#ifdef CONFIG_NUMA > + task_lock(priv->task); > + __mpol_put(priv->task->mempolicy); > + task_unlock(priv->task); > +#endif > up_read(&mm->mmap_sem); > mmput(mm); Please don't put #ifdef's inside code. It makes things really ugly and hard to read. And that is *especially* true in this case, since there's a pattern to all these things: > +#ifdef CONFIG_NUMA > + task_lock(priv->task); > + mpol_get(priv->task->mempolicy); > + task_unlock(priv->task); > +#endif > +#ifdef CONFIG_NUMA > + task_lock(priv->task); > + __mpol_put(priv->task->mempolicy); > + task_unlock(priv->task); > +#endif it really sounds like what you want to do is to just abstract a "numa_policy_get/put(priv)" operation. So you could make it be something like #ifdef CONFIG_NUMA static inline numa_policy_get(struct proc_maps_private *priv) { task_lock(priv->task); mpol_get(priv->task->mempolicy); task_unlock(priv->task); } .. same for the "put" function .. #else #define numa_policy_get(priv) do { } while (0) #define numa_policy_put(priv) do { } while (0) #endif and then you wouldn't have to have the #ifdef's in the middle of code, and I think it will be more readable in general. Sure, it is going to be a few more actual lines of patch, but there's no duplicated code sequence, and the added lines are just the syntax that makes it look better. Linus -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx199.postini.com [74.125.245.199]) by kanga.kvack.org (Postfix) with SMTP id D3B776B005D for ; Thu, 18 Oct 2012 00:34:39 -0400 (EDT) Received: from m3.gw.fujitsu.co.jp (unknown [10.0.50.73]) by fgwmail6.fujitsu.co.jp (Postfix) with ESMTP id 6812C3EE0C1 for ; Thu, 18 Oct 2012 13:34:38 +0900 (JST) Received: from smail (m3 [127.0.0.1]) by outgoing.m3.gw.fujitsu.co.jp (Postfix) with ESMTP id 4E52645DEBA for ; Thu, 18 Oct 2012 13:34:38 +0900 (JST) Received: from s3.gw.fujitsu.co.jp (s3.gw.fujitsu.co.jp [10.0.50.93]) by m3.gw.fujitsu.co.jp (Postfix) with ESMTP id 332FF45DEB7 for ; Thu, 18 Oct 2012 13:34:38 +0900 (JST) Received: from s3.gw.fujitsu.co.jp (localhost.localdomain [127.0.0.1]) by s3.gw.fujitsu.co.jp (Postfix) with ESMTP id 1FFDD1DB803E for ; Thu, 18 Oct 2012 13:34:38 +0900 (JST) Received: from ml14.s.css.fujitsu.com (ml14.s.css.fujitsu.com [10.240.81.134]) by s3.gw.fujitsu.co.jp (Postfix) with ESMTP id BFC741DB803B for ; Thu, 18 Oct 2012 13:34:37 +0900 (JST) Message-ID: <507F86BD.7070201@jp.fujitsu.com> Date: Thu, 18 Oct 2012 13:34:05 +0900 From: Kamezawa Hiroyuki MIME-Version: 1.0 Subject: Re: [patch for-3.7 v2] mm, mempolicy: avoid taking mutex inside spinlock when reading numa_maps References: <20121017040515.GA13505@redhat.com> <20121017181413.GA16805@redhat.com> <20121017193229.GC16805@redhat.com> <20121017194501.GA24400@redhat.com> <507F803A.8000900@jp.fujitsu.com> In-Reply-To: <507F803A.8000900@jp.fujitsu.com> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: David Rientjes Cc: Linus Torvalds , Andrew Morton , Dave Jones , KOSAKI Motohiro , bhutchings@solarflare.com, Konstantin Khlebnikov , Naoya Horiguchi , Hugh Dickins , linux-kernel@vger.kernel.org, linux-mm@kvack.org (2012/10/18 13:06), Kamezawa Hiroyuki wrote: > (2012/10/18 6:31), David Rientjes wrote: >> As a result of commit 32f8516a8c73 ("mm, mempolicy: fix printing stack >> contents in numa_maps"), the mutex protecting a shared policy can be >> inadvertently taken while holding task_lock(task). >> >> Recently, commit b22d127a39dd ("mempolicy: fix a race in >> shared_policy_replace()") switched the spinlock within a shared policy to >> a mutex so sp_alloc() could block. Thus, a refcount must be grabbed on >> all mempolicies returned by get_vma_policy() so it isn't freed while being >> passed to mpol_to_str() when reading /proc/pid/numa_maps. >> >> This patch only takes task_lock() while dereferencing task->mempolicy in >> get_vma_policy() if it's non-NULL in the lockess check to increment its >> refcount. This ensures it will remain in memory until dropped by >> __mpol_put() after mpol_to_str() is called. >> >> Refcounts of shared policies are grabbed by the ->get_policy() function of >> the vma, all others will be grabbed directly in get_vma_policy(). Now >> that this is done, all callers now unconditionally drop the refcount. >> > > please add original problem description.... > > from your 1st patch. >> When reading /proc/pid/numa_maps, it's possible to return the contents of >> the stack where the mempolicy string should be printed if the policy gets >> freed from beneath us. >> >> This happens because mpol_to_str() may return an error the >> stack-allocated buffer is then printed without ever being stored. > ..... > > Hmm, I've read the whole thread again...and, I'm sorry if I misunderstand something. > > I think Kosaki mentioned the commit 52cd3b0740. It avoids refcounting in get_vma_policy() > because it's called every time alloc_pages_vma() is called, at every page fault. > So, it seems he doesn't agree this fix because of performance concern on big NUMA, > > > Can't we have another way to fix ? like this ? too ugly ? > Again, I'm sorry if I misunderstand the points. > Sorry this patch itself may be buggy. please don't test.. I missed that kernel/exit.c sets task->mempolicy to be NULL. fixed one here. -- From 5581c71e68a7f50e52fd67cca00148911023f9f5 Mon Sep 17 00:00:00 2001 From: KAMEZAWA Hiroyuki Date: Thu, 18 Oct 2012 13:50:29 +0900 Subject: [PATCH] hold task->mempolicy while numa_maps scans. /proc//numa_maps scans vma and show mempolicy under mmap_sem. It sometimes accesses task->mempolicy which can be freed without mmap_sem and numa_maps can show some garbage while scanning. This patch tries to take reference count of task->mempolicy at reading numa_maps before calling get_vma_policy(). By this, task->mempolicy will not be freed until numa_maps reaches its end. Signed-off-by: KAMEZAWA Hiroyuki V1->V2 - access task->mempolicy only once and remember it. Becase kernel/exit.c can overwrite it. Signed-off-by: KAMEZAWA Hiroyuki --- fs/proc/internal.h | 4 ++++ fs/proc/task_mmu.c | 33 ++++++++++++++++++++++++++++++++- 2 files changed, 36 insertions(+), 1 deletion(-) diff --git a/fs/proc/internal.h b/fs/proc/internal.h index cceaab0..43973b0 100644 --- a/fs/proc/internal.h +++ b/fs/proc/internal.h @@ -12,6 +12,7 @@ #include #include struct ctl_table_header; +struct mempolicy; extern struct proc_dir_entry proc_root; #ifdef CONFIG_PROC_SYSCTL @@ -74,6 +75,9 @@ struct proc_maps_private { #ifdef CONFIG_MMU struct vm_area_struct *tail_vma; #endif +#ifdef CONFIG_NUMA + struct mempolicy *task_mempolicy; +#endif }; void proc_init_inodecache(void); diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c index 14df880..624927d 100644 --- a/fs/proc/task_mmu.c +++ b/fs/proc/task_mmu.c @@ -89,11 +89,41 @@ static void pad_len_spaces(struct seq_file *m, int len) len = 1; seq_printf(m, "%*c", len, ' '); } +#ifdef CONFIG_NUMA +/* + * numa_maps scans all vmas under mmap_sem and checks their mempolicy. + * But task->mempolicy is not guarded by mmap_sem, it can be cleared/freed + * under task_lock() (see kernel/exit.c) replacement of it is guarded by + * mmap_sem. So, take referenceount under task_lock() before we start + * scanning and drop it when numa_maps reaches the end. + */ +static void hold_task_mempolicy(struct proc_maps_private *priv) +{ + struct task_struct *task = priv->task; + + task_lock(task); + priv->task_mempolicy = task->mempolicy; + mpol_get(priv->task_mempolicy); + task_unlock(task); +} +static void release_task_mempolicy(struct proc_maps_private *priv) +{ + mpol_put(priv->task_mempolicy); +} +#else +static void hold_task_mempolicy(struct proc_maps_private *priv) +{ +} +static void release_task_mempolicy(struct proc_maps_private *priv) +{ +} +#endif static void vma_stop(struct proc_maps_private *priv, struct vm_area_struct *vma) { if (vma && vma != priv->tail_vma) { struct mm_struct *mm = vma->vm_mm; + release_task_mempolicy(priv); up_read(&mm->mmap_sem); mmput(mm); } @@ -132,7 +162,7 @@ static void *m_start(struct seq_file *m, loff_t *pos) tail_vma = get_gate_vma(priv->task->mm); priv->tail_vma = tail_vma; - + hold_task_mempolicy(priv); /* Start with last addr hint */ vma = find_vma(mm, last_addr); if (last_addr && vma) { @@ -159,6 +189,7 @@ out: if (vma) return vma; + release_task_mempolicy(priv); /* End of vmas has been reached */ m->version = (tail_vma != NULL)? 0: -1UL; up_read(&mm->mmap_sem); -- 1.7.10.2 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx113.postini.com [74.125.245.113]) by kanga.kvack.org (Postfix) with SMTP id 009B36B0062 for ; Thu, 18 Oct 2012 00:35:55 -0400 (EDT) Received: by mail-pa0-f41.google.com with SMTP id fa10so8954702pad.14 for ; Wed, 17 Oct 2012 21:35:55 -0700 (PDT) Date: Wed, 17 Oct 2012 21:35:53 -0700 (PDT) From: David Rientjes Subject: Re: [patch for-3.7 v2] mm, mempolicy: avoid taking mutex inside spinlock when reading numa_maps In-Reply-To: <507F803A.8000900@jp.fujitsu.com> Message-ID: References: <20121017040515.GA13505@redhat.com> <20121017181413.GA16805@redhat.com> <20121017193229.GC16805@redhat.com> <20121017194501.GA24400@redhat.com> <507F803A.8000900@jp.fujitsu.com> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-linux-mm@kvack.org List-ID: To: Kamezawa Hiroyuki Cc: Linus Torvalds , Andrew Morton , Dave Jones , KOSAKI Motohiro , bhutchings@solarflare.com, Konstantin Khlebnikov , Naoya Horiguchi , Hugh Dickins , linux-kernel@vger.kernel.org, linux-mm@kvack.org On Thu, 18 Oct 2012, Kamezawa Hiroyuki wrote: > diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c > index 14df880..d92e868 100644 > --- a/fs/proc/task_mmu.c > +++ b/fs/proc/task_mmu.c > @@ -94,6 +94,11 @@ static void vma_stop(struct proc_maps_private *priv, struct > vm_area_struct *vma) > { > if (vma && vma != priv->tail_vma) { > struct mm_struct *mm = vma->vm_mm; > +#ifdef CONFIG_NUMA > + task_lock(priv->task); > + __mpol_put(priv->task->mempolicy); > + task_unlock(priv->task); > +#endif > up_read(&mm->mmap_sem); > mmput(mm); > } > @@ -130,6 +135,16 @@ static void *m_start(struct seq_file *m, loff_t *pos) > return mm; > down_read(&mm->mmap_sem); > + /* > + * task->mempolicy can be freed even if mmap_sem is down (see > kernel/exit.c) > + * We grab refcount for stable access. > + * repleacement of task->mmpolicy is guarded by mmap_sem. > + */ > +#ifdef CONFIG_NUMA > + task_lock(priv->task); > + mpol_get(priv->task->mempolicy); > + task_unlock(priv->task); > +#endif > tail_vma = get_gate_vma(priv->task->mm); > priv->tail_vma = tail_vma; > @@ -161,6 +176,11 @@ out: > /* End of vmas has been reached */ > m->version = (tail_vma != NULL)? 0: -1UL; > +#ifdef CONFIG_NUMA > + task_lock(priv->task); > + __mpol_put(priv->task->mempolicy); > + task_unlock(priv->task); > +#endif > up_read(&mm->mmap_sem); > mmput(mm); > return tail_vma; Yes, I must admit that this is better than my version and it looks like all the ->show() functions that use these start, next, stop functions don't take task_lock() and this would generally be useful: we already hold current->mm->mmap_sem so there is little harm in holding task_lock(current) when reading these files as long as we're not touching the fastpath. These routines seem like it would nicely be added to mempolicy.h since we depend on CONFIG_NUMA there already. Please fix up the mess I made in show_numa_map() in 32f8516a8c73 ("mm, mempolicy: fix printing stack contents in numa_maps") by simply removing the task_lock() and task_unlock() as part of your patch. Thanks Kame! -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx131.postini.com [74.125.245.131]) by kanga.kvack.org (Postfix) with SMTP id D98EA6B005D for ; Thu, 18 Oct 2012 00:41:32 -0400 (EDT) Received: from m1.gw.fujitsu.co.jp (unknown [10.0.50.71]) by fgwmail6.fujitsu.co.jp (Postfix) with ESMTP id 5EF993EE0AE for ; Thu, 18 Oct 2012 13:41:31 +0900 (JST) Received: from smail (m1 [127.0.0.1]) by outgoing.m1.gw.fujitsu.co.jp (Postfix) with ESMTP id 3DBD845DE61 for ; Thu, 18 Oct 2012 13:41:31 +0900 (JST) Received: from s1.gw.fujitsu.co.jp (s1.gw.fujitsu.co.jp [10.0.50.91]) by m1.gw.fujitsu.co.jp (Postfix) with ESMTP id 143B945DE53 for ; Thu, 18 Oct 2012 13:41:31 +0900 (JST) Received: from s1.gw.fujitsu.co.jp (localhost.localdomain [127.0.0.1]) by s1.gw.fujitsu.co.jp (Postfix) with ESMTP id 00AE01DB804B for ; Thu, 18 Oct 2012 13:41:31 +0900 (JST) Received: from m1001.s.css.fujitsu.com (m1001.s.css.fujitsu.com [10.240.81.139]) by s1.gw.fujitsu.co.jp (Postfix) with ESMTP id 910D91DB8043 for ; Thu, 18 Oct 2012 13:41:30 +0900 (JST) Message-ID: <507F8864.1070203@jp.fujitsu.com> Date: Thu, 18 Oct 2012 13:41:08 +0900 From: Kamezawa Hiroyuki MIME-Version: 1.0 Subject: Re: [patch for-3.7 v2] mm, mempolicy: avoid taking mutex inside spinlock when reading numa_maps References: <20121017040515.GA13505@redhat.com> <20121017181413.GA16805@redhat.com> <20121017193229.GC16805@redhat.com> <20121017194501.GA24400@redhat.com> <507F803A.8000900@jp.fujitsu.com> In-Reply-To: Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Linus Torvalds Cc: David Rientjes , Andrew Morton , Dave Jones , KOSAKI Motohiro , bhutchings@solarflare.com, Konstantin Khlebnikov , Naoya Horiguchi , Hugh Dickins , linux-kernel@vger.kernel.org, linux-mm@kvack.org (2012/10/18 13:14), Linus Torvalds wrote: > On Wed, Oct 17, 2012 at 9:06 PM, Kamezawa Hiroyuki > wrote: >> if (vma && vma != priv->tail_vma) { >> struct mm_struct *mm = vma->vm_mm; >> +#ifdef CONFIG_NUMA >> + task_lock(priv->task); >> + __mpol_put(priv->task->mempolicy); >> + task_unlock(priv->task); >> +#endif >> up_read(&mm->mmap_sem); >> mmput(mm); > > Please don't put #ifdef's inside code. It makes things really ugly and > hard to read. > > And that is *especially* true in this case, since there's a pattern to > all these things: > >> +#ifdef CONFIG_NUMA >> + task_lock(priv->task); >> + mpol_get(priv->task->mempolicy); >> + task_unlock(priv->task); >> +#endif > >> +#ifdef CONFIG_NUMA >> + task_lock(priv->task); >> + __mpol_put(priv->task->mempolicy); >> + task_unlock(priv->task); >> +#endif > > it really sounds like what you want to do is to just abstract a > "numa_policy_get/put(priv)" operation. > > So you could make it be something like > > #ifdef CONFIG_NUMA > static inline numa_policy_get(struct proc_maps_private *priv) > { > task_lock(priv->task); > mpol_get(priv->task->mempolicy); > task_unlock(priv->task); > } > .. same for the "put" function .. > #else > #define numa_policy_get(priv) do { } while (0) > #define numa_policy_put(priv) do { } while (0) > #endif > > and then you wouldn't have to have the #ifdef's in the middle of code, > and I think it will be more readable in general. > > Sure, it is going to be a few more actual lines of patch, but there's > no duplicated code sequence, and the added lines are just the syntax > that makes it look better. > you're right, I shouldn't send an ugly patch. I'm sorry. V2 uses suggested style, I think. Regards, -Kame -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx153.postini.com [74.125.245.153]) by kanga.kvack.org (Postfix) with SMTP id A65B66B0044 for ; Thu, 18 Oct 2012 16:03:43 -0400 (EDT) Received: by mail-pb0-f41.google.com with SMTP id rq2so9813842pbb.14 for ; Thu, 18 Oct 2012 13:03:43 -0700 (PDT) Date: Thu, 18 Oct 2012 13:03:38 -0700 (PDT) From: David Rientjes Subject: Re: [patch for-3.7 v2] mm, mempolicy: avoid taking mutex inside spinlock when reading numa_maps In-Reply-To: <507F86BD.7070201@jp.fujitsu.com> Message-ID: References: <20121017040515.GA13505@redhat.com> <20121017181413.GA16805@redhat.com> <20121017193229.GC16805@redhat.com> <20121017194501.GA24400@redhat.com> <507F803A.8000900@jp.fujitsu.com> <507F86BD.7070201@jp.fujitsu.com> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-linux-mm@kvack.org List-ID: To: Kamezawa Hiroyuki Cc: Linus Torvalds , Andrew Morton , Dave Jones , KOSAKI Motohiro , bhutchings@solarflare.com, Konstantin Khlebnikov , Naoya Horiguchi , Hugh Dickins , linux-kernel@vger.kernel.org, linux-mm@kvack.org On Thu, 18 Oct 2012, Kamezawa Hiroyuki wrote: > diff --git a/fs/proc/internal.h b/fs/proc/internal.h > index cceaab0..43973b0 100644 > --- a/fs/proc/internal.h > +++ b/fs/proc/internal.h > @@ -12,6 +12,7 @@ > #include > #include > struct ctl_table_header; > +struct mempolicy; > extern struct proc_dir_entry proc_root; > #ifdef CONFIG_PROC_SYSCTL > @@ -74,6 +75,9 @@ struct proc_maps_private { > #ifdef CONFIG_MMU > struct vm_area_struct *tail_vma; > #endif > +#ifdef CONFIG_NUMA > + struct mempolicy *task_mempolicy; > +#endif > }; > void proc_init_inodecache(void); > diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c > index 14df880..624927d 100644 > --- a/fs/proc/task_mmu.c > +++ b/fs/proc/task_mmu.c > @@ -89,11 +89,41 @@ static void pad_len_spaces(struct seq_file *m, int len) > len = 1; > seq_printf(m, "%*c", len, ' '); > } > +#ifdef CONFIG_NUMA > +/* > + * numa_maps scans all vmas under mmap_sem and checks their mempolicy. Doesn't only affect numa_maps, it also affects maps and smaps although they don't need the refcounts. > + * But task->mempolicy is not guarded by mmap_sem, it can be cleared/freed > + * under task_lock() (see kernel/exit.c) replacement of it is guarded by > + * mmap_sem. I think this should be a little more verbose making it clear that task->mempolicy can be cleared and freed if its refcount drops to 0 and is only protected by task_lock() and that we're safe from task->mempolicy changing between ->start(), ->next(), and ->stop() because task->mm->mmap_sem is held for the duration. > So, take referenceount under task_lock() before we start > + * scanning and drop it when numa_maps reaches the end. > + */ > +static void hold_task_mempolicy(struct proc_maps_private *priv) > +{ > + struct task_struct *task = priv->task; > + > + task_lock(task); > + priv->task_mempolicy = task->mempolicy; > + mpol_get(priv->task_mempolicy); > + task_unlock(task); > +} > +static void release_task_mempolicy(struct proc_maps_private *priv) > +{ > + mpol_put(priv->task_mempolicy); > +} > +#else > +static void hold_task_mempolicy(struct proc_maps_private *priv) > +{ > +} > +static void release_task_mempolicy(struct proc_maps_private *priv) > +{ > +} > +#endif > static void vma_stop(struct proc_maps_private *priv, struct vm_area_struct > *vma) > { > if (vma && vma != priv->tail_vma) { > struct mm_struct *mm = vma->vm_mm; > + release_task_mempolicy(priv); > up_read(&mm->mmap_sem); > mmput(mm); > } > @@ -132,7 +162,7 @@ static void *m_start(struct seq_file *m, loff_t *pos) > tail_vma = get_gate_vma(priv->task->mm); > priv->tail_vma = tail_vma; > - > + hold_task_mempolicy(priv); > /* Start with last addr hint */ > vma = find_vma(mm, last_addr); > if (last_addr && vma) { > @@ -159,6 +189,7 @@ out: > if (vma) > return vma; > + release_task_mempolicy(priv); > /* End of vmas has been reached */ > m->version = (tail_vma != NULL)? 0: -1UL; > up_read(&mm->mmap_sem); Otherwise looks good, but please remove the two task_lock()'s in show_numa_map() that I added as part of this since you're replacing the need for locking. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx105.postini.com [74.125.245.105]) by kanga.kvack.org (Postfix) with SMTP id 8CFDF6B0044 for ; Fri, 19 Oct 2012 02:51:50 -0400 (EDT) Received: by mail-ob0-f169.google.com with SMTP id va7so165614obc.14 for ; Thu, 18 Oct 2012 23:51:49 -0700 (PDT) MIME-Version: 1.0 In-Reply-To: <507F86BD.7070201@jp.fujitsu.com> References: <20121017040515.GA13505@redhat.com> <20121017181413.GA16805@redhat.com> <20121017193229.GC16805@redhat.com> <20121017194501.GA24400@redhat.com> <507F803A.8000900@jp.fujitsu.com> <507F86BD.7070201@jp.fujitsu.com> From: KOSAKI Motohiro Date: Fri, 19 Oct 2012 02:51:29 -0400 Message-ID: Subject: Re: [patch for-3.7 v2] mm, mempolicy: avoid taking mutex inside spinlock when reading numa_maps Content-Type: text/plain; charset=ISO-8859-1 Sender: owner-linux-mm@kvack.org List-ID: To: Kamezawa Hiroyuki Cc: David Rientjes , Linus Torvalds , Andrew Morton , Dave Jones , bhutchings@solarflare.com, Konstantin Khlebnikov , Naoya Horiguchi , Hugh Dickins , linux-kernel@vger.kernel.org, linux-mm@kvack.org >> Can't we have another way to fix ? like this ? too ugly ? >> Again, I'm sorry if I misunderstand the points. >> > Sorry this patch itself may be buggy. please don't test.. > I missed that kernel/exit.c sets task->mempolicy to be NULL. > fixed one here. > > -- > From 5581c71e68a7f50e52fd67cca00148911023f9f5 Mon Sep 17 00:00:00 2001 > From: KAMEZAWA Hiroyuki > Date: Thu, 18 Oct 2012 13:50:29 +0900 > > Subject: [PATCH] hold task->mempolicy while numa_maps scans. > > /proc//numa_maps scans vma and show mempolicy under > mmap_sem. It sometimes accesses task->mempolicy which can > be freed without mmap_sem and numa_maps can show some > garbage while scanning. > > This patch tries to take reference count of task->mempolicy at reading > numa_maps before calling get_vma_policy(). By this, task->mempolicy > will not be freed until numa_maps reaches its end. > > Signed-off-by: KAMEZAWA Hiroyuki > > V1->V2 > - access task->mempolicy only once and remember it. Becase kernel/exit.c > can overwrite it. > > Signed-off-by: KAMEZAWA Hiroyuki Ok, this is acceptable to me. go ahead. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx198.postini.com [74.125.245.198]) by kanga.kvack.org (Postfix) with SMTP id 9B35B6B0062 for ; Fri, 19 Oct 2012 04:36:01 -0400 (EDT) Received: from m4.gw.fujitsu.co.jp (unknown [10.0.50.74]) by fgwmail5.fujitsu.co.jp (Postfix) with ESMTP id CB46F3EE0C1 for ; Fri, 19 Oct 2012 17:35:59 +0900 (JST) Received: from smail (m4 [127.0.0.1]) by outgoing.m4.gw.fujitsu.co.jp (Postfix) with ESMTP id AD27045DE51 for ; Fri, 19 Oct 2012 17:35:59 +0900 (JST) Received: from s4.gw.fujitsu.co.jp (s4.gw.fujitsu.co.jp [10.0.50.94]) by m4.gw.fujitsu.co.jp (Postfix) with ESMTP id 93B8945DE4E for ; Fri, 19 Oct 2012 17:35:59 +0900 (JST) Received: from s4.gw.fujitsu.co.jp (localhost.localdomain [127.0.0.1]) by s4.gw.fujitsu.co.jp (Postfix) with ESMTP id 8204A1DB803E for ; Fri, 19 Oct 2012 17:35:59 +0900 (JST) Received: from m1000.s.css.fujitsu.com (m1000.s.css.fujitsu.com [10.240.81.136]) by s4.gw.fujitsu.co.jp (Postfix) with ESMTP id 1DFC9E38003 for ; Fri, 19 Oct 2012 17:35:59 +0900 (JST) Message-ID: <508110C4.6030805@jp.fujitsu.com> Date: Fri, 19 Oct 2012 17:35:16 +0900 From: Kamezawa Hiroyuki MIME-Version: 1.0 Subject: [patch for-3.7 v3] mm, mempolicy: hold task->mempolicy refcount while reading numa_maps. References: <20121017040515.GA13505@redhat.com> <20121017181413.GA16805@redhat.com> <20121017193229.GC16805@redhat.com> <20121017194501.GA24400@redhat.com> <507F803A.8000900@jp.fujitsu.com> <507F86BD.7070201@jp.fujitsu.com> In-Reply-To: Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: David Rientjes Cc: Linus Torvalds , Andrew Morton , Dave Jones , KOSAKI Motohiro , bhutchings@solarflare.com, Konstantin Khlebnikov , Naoya Horiguchi , Hugh Dickins , linux-kernel@vger.kernel.org, linux-mm@kvack.org (2012/10/19 5:03), David Rientjes wrote: > On Thu, 18 Oct 2012, Kamezawa Hiroyuki wrote: >> @@ -132,7 +162,7 @@ static void *m_start(struct seq_file *m, loff_t *pos) >> tail_vma = get_gate_vma(priv->task->mm); >> priv->tail_vma = tail_vma; >> - >> + hold_task_mempolicy(priv); >> /* Start with last addr hint */ >> vma = find_vma(mm, last_addr); >> if (last_addr && vma) { >> @@ -159,6 +189,7 @@ out: >> if (vma) >> return vma; >> + release_task_mempolicy(priv); >> /* End of vmas has been reached */ >> m->version = (tail_vma != NULL)? 0: -1UL; >> up_read(&mm->mmap_sem); > > Otherwise looks good, but please remove the two task_lock()'s in > show_numa_map() that I added as part of this since you're replacing the > need for locking. > Thank you for your review. How about this ? == From c5849c9034abeec3f26bf30dadccd393b0c5c25e Mon Sep 17 00:00:00 2001 From: KAMEZAWA Hiroyuki Date: Fri, 19 Oct 2012 17:00:55 +0900 Subject: [PATCH] hold task->mempolicy while numa_maps scans. /proc//numa_maps scans vma and show mempolicy under mmap_sem. It sometimes accesses task->mempolicy which can be freed without mmap_sem and numa_maps can show some garbage while scanning. This patch tries to take reference count of task->mempolicy at reading numa_maps before calling get_vma_policy(). By this, task->mempolicy will not be freed until numa_maps reaches its end. Signed-off-by: KAMEZAWA Hiroyuki V2->v3 - updated comments to be more verbose. - removed task_lock() in numa_maps code. V1->V2 - access task->mempolicy only once and remember it. Becase kernel/exit.c can overwrite it. Signed-off-by: KAMEZAWA Hiroyuki --- fs/proc/internal.h | 4 ++++ fs/proc/task_mmu.c | 49 ++++++++++++++++++++++++++++++++++++++++++++++--- 2 files changed, 50 insertions(+), 3 deletions(-) diff --git a/fs/proc/internal.h b/fs/proc/internal.h index cceaab0..43973b0 100644 --- a/fs/proc/internal.h +++ b/fs/proc/internal.h @@ -12,6 +12,7 @@ #include #include struct ctl_table_header; +struct mempolicy; extern struct proc_dir_entry proc_root; #ifdef CONFIG_PROC_SYSCTL @@ -74,6 +75,9 @@ struct proc_maps_private { #ifdef CONFIG_MMU struct vm_area_struct *tail_vma; #endif +#ifdef CONFIG_NUMA + struct mempolicy *task_mempolicy; +#endif }; void proc_init_inodecache(void); diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c index 14df880..2371fea 100644 --- a/fs/proc/task_mmu.c +++ b/fs/proc/task_mmu.c @@ -89,11 +89,55 @@ static void pad_len_spaces(struct seq_file *m, int len) len = 1; seq_printf(m, "%*c", len, ' '); } +#ifdef CONFIG_NUMA +/* + * These functions are for numa_maps but called in generic **maps seq_file + * ->start(), ->stop() ops. + * + * numa_maps scans all vmas under mmap_sem and checks their mempolicy. + * Each mempolicy object is controlled by reference counting. The problem here + * is how to avoid accessing dead mempolicy object. + * + * Because we're holding mmap_sem while reading seq_file, it's safe to access + * each vma's mempolicy, no vma objects will never drop refs to mempolicy. + * + * A task's mempolicy (task->mempolicy) has different behavior. task->mempolicy + * is set and replaced under mmap_sem but unrefed and cleared under task_lock(). + * So, without task_lock(), we cannot trust get_vma_policy() because we cannot + * gurantee the task never exits under us. But taking task_lock() around + * get_vma_plicy() causes lock order problem. + * + * To access task->mempolicy without lock, we hold a reference count of an + * object pointed by task->mempolicy and remember it. This will guarantee + * that task->mempolicy points to an alive object or NULL in numa_maps accesses. + */ +static void hold_task_mempolicy(struct proc_maps_private *priv) +{ + struct task_struct *task = priv->task; + + task_lock(task); + priv->task_mempolicy = task->mempolicy; + mpol_get(priv->task_mempolicy); + task_unlock(task); +} +static void release_task_mempolicy(struct proc_maps_private *priv) +{ + mpol_put(priv->task_mempolicy); +} +#else +static void hold_task_mempolicy(struct proc_maps_private *priv) +{ +} +static void release_task_mempolicy(struct proc_maps_private *priv) +{ +} +#endif static void vma_stop(struct proc_maps_private *priv, struct vm_area_struct *vma) { if (vma && vma != priv->tail_vma) { struct mm_struct *mm = vma->vm_mm; + release_task_mempolicy(priv); up_read(&mm->mmap_sem); mmput(mm); } @@ -132,7 +176,7 @@ static void *m_start(struct seq_file *m, loff_t *pos) tail_vma = get_gate_vma(priv->task->mm); priv->tail_vma = tail_vma; - + hold_task_mempolicy(priv); /* Start with last addr hint */ vma = find_vma(mm, last_addr); if (last_addr && vma) { @@ -159,6 +203,7 @@ out: if (vma) return vma; + release_task_mempolicy(priv); /* End of vmas has been reached */ m->version = (tail_vma != NULL)? 0: -1UL; up_read(&mm->mmap_sem); @@ -1178,11 +1223,9 @@ static int show_numa_map(struct seq_file *m, void *v, int is_pid) walk.private = md; walk.mm = mm; - task_lock(task); pol = get_vma_policy(task, vma, vma->vm_start); mpol_to_str(buffer, sizeof(buffer), pol, 0); mpol_cond_put(pol); - task_unlock(task); seq_printf(m, "%08lx %s", vma->vm_start, buffer); -- 1.7.10.2 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx198.postini.com [74.125.245.198]) by kanga.kvack.org (Postfix) with SMTP id DA7056B0070 for ; Fri, 19 Oct 2012 05:28:45 -0400 (EDT) Received: by mail-pb0-f41.google.com with SMTP id rq2so334128pbb.14 for ; Fri, 19 Oct 2012 02:28:45 -0700 (PDT) Date: Fri, 19 Oct 2012 02:28:42 -0700 (PDT) From: David Rientjes Subject: Re: [patch for-3.7 v3] mm, mempolicy: hold task->mempolicy refcount while reading numa_maps. In-Reply-To: <508110C4.6030805@jp.fujitsu.com> Message-ID: References: <20121017040515.GA13505@redhat.com> <20121017181413.GA16805@redhat.com> <20121017193229.GC16805@redhat.com> <20121017194501.GA24400@redhat.com> <507F803A.8000900@jp.fujitsu.com> <507F86BD.7070201@jp.fujitsu.com> <508110C4.6030805@jp.fujitsu.com> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-linux-mm@kvack.org List-ID: To: Kamezawa Hiroyuki Cc: Linus Torvalds , Andrew Morton , Dave Jones , KOSAKI Motohiro , bhutchings@solarflare.com, Konstantin Khlebnikov , Naoya Horiguchi , Hugh Dickins , linux-kernel@vger.kernel.org, linux-mm@kvack.org On Fri, 19 Oct 2012, Kamezawa Hiroyuki wrote: > From c5849c9034abeec3f26bf30dadccd393b0c5c25e Mon Sep 17 00:00:00 2001 > From: KAMEZAWA Hiroyuki > Date: Fri, 19 Oct 2012 17:00:55 +0900 > Subject: [PATCH] hold task->mempolicy while numa_maps scans. > > /proc//numa_maps scans vma and show mempolicy under > mmap_sem. It sometimes accesses task->mempolicy which can > be freed without mmap_sem and numa_maps can show some > garbage while scanning. > > This patch tries to take reference count of task->mempolicy at reading > numa_maps before calling get_vma_policy(). By this, task->mempolicy > will not be freed until numa_maps reaches its end. > > Signed-off-by: KAMEZAWA Hiroyuki Looks good, but the patch is whitespace damaged so it doesn't apply. When that's fixed: Acked-by: David Rientjes Thanks for following through on this! -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx110.postini.com [74.125.245.110]) by kanga.kvack.org (Postfix) with SMTP id 29E256B005A for ; Fri, 19 Oct 2012 15:15:39 -0400 (EDT) Received: by mail-ob0-f169.google.com with SMTP id va7so928684obc.14 for ; Fri, 19 Oct 2012 12:15:38 -0700 (PDT) MIME-Version: 1.0 In-Reply-To: <508110C4.6030805@jp.fujitsu.com> References: <20121017040515.GA13505@redhat.com> <20121017181413.GA16805@redhat.com> <20121017193229.GC16805@redhat.com> <20121017194501.GA24400@redhat.com> <507F803A.8000900@jp.fujitsu.com> <507F86BD.7070201@jp.fujitsu.com> <508110C4.6030805@jp.fujitsu.com> From: KOSAKI Motohiro Date: Fri, 19 Oct 2012 15:15:18 -0400 Message-ID: Subject: Re: [patch for-3.7 v3] mm, mempolicy: hold task->mempolicy refcount while reading numa_maps. Content-Type: text/plain; charset=ISO-8859-1 Sender: owner-linux-mm@kvack.org List-ID: To: Kamezawa Hiroyuki Cc: David Rientjes , Linus Torvalds , Andrew Morton , Dave Jones , bhutchings@solarflare.com, Konstantin Khlebnikov , Naoya Horiguchi , Hugh Dickins , linux-kernel@vger.kernel.org, linux-mm@kvack.org On Fri, Oct 19, 2012 at 4:35 AM, Kamezawa Hiroyuki wrote: > (2012/10/19 5:03), David Rientjes wrote: >> >> On Thu, 18 Oct 2012, Kamezawa Hiroyuki wrote: >>> >>> @@ -132,7 +162,7 @@ static void *m_start(struct seq_file *m, loff_t *pos) >>> tail_vma = get_gate_vma(priv->task->mm); >>> priv->tail_vma = tail_vma; >>> - >>> + hold_task_mempolicy(priv); >>> /* Start with last addr hint */ >>> vma = find_vma(mm, last_addr); >>> if (last_addr && vma) { >>> @@ -159,6 +189,7 @@ out: >>> if (vma) >>> return vma; >>> + release_task_mempolicy(priv); >>> /* End of vmas has been reached */ >>> m->version = (tail_vma != NULL)? 0: -1UL; >>> up_read(&mm->mmap_sem); >> >> >> Otherwise looks good, but please remove the two task_lock()'s in >> show_numa_map() that I added as part of this since you're replacing the >> need for locking. >> > Thank you for your review. > How about this ? > > == > From c5849c9034abeec3f26bf30dadccd393b0c5c25e Mon Sep 17 00:00:00 2001 > From: KAMEZAWA Hiroyuki > Date: Fri, 19 Oct 2012 17:00:55 +0900 > Subject: [PATCH] hold task->mempolicy while numa_maps scans. > > /proc//numa_maps scans vma and show mempolicy under > mmap_sem. It sometimes accesses task->mempolicy which can > be freed without mmap_sem and numa_maps can show some > garbage while scanning. > > This patch tries to take reference count of task->mempolicy at reading > numa_maps before calling get_vma_policy(). By this, task->mempolicy > will not be freed until numa_maps reaches its end. > > Signed-off-by: KAMEZAWA Hiroyuki > > V2->v3 > - updated comments to be more verbose. > - removed task_lock() in numa_maps code. > V1->V2 > - access task->mempolicy only once and remember it. Becase kernel/exit.c > can overwrite it. > > Signed-off-by: KAMEZAWA Hiroyuki Acked-by: KOSAKI Motohiro -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx113.postini.com [74.125.245.113]) by kanga.kvack.org (Postfix) with SMTP id 27ABC6B0062 for ; Sun, 21 Oct 2012 22:47:57 -0400 (EDT) Received: from m3.gw.fujitsu.co.jp (unknown [10.0.50.73]) by fgwmail5.fujitsu.co.jp (Postfix) with ESMTP id E0E6F3EE0BD for ; Mon, 22 Oct 2012 11:47:54 +0900 (JST) Received: from smail (m3 [127.0.0.1]) by outgoing.m3.gw.fujitsu.co.jp (Postfix) with ESMTP id C663045DEBC for ; Mon, 22 Oct 2012 11:47:54 +0900 (JST) Received: from s3.gw.fujitsu.co.jp (s3.gw.fujitsu.co.jp [10.0.50.93]) by m3.gw.fujitsu.co.jp (Postfix) with ESMTP id AB92445DEB6 for ; Mon, 22 Oct 2012 11:47:54 +0900 (JST) Received: from s3.gw.fujitsu.co.jp (localhost.localdomain [127.0.0.1]) by s3.gw.fujitsu.co.jp (Postfix) with ESMTP id 990241DB8041 for ; Mon, 22 Oct 2012 11:47:54 +0900 (JST) Received: from ml14.s.css.fujitsu.com (ml14.s.css.fujitsu.com [10.240.81.134]) by s3.gw.fujitsu.co.jp (Postfix) with ESMTP id 487571DB803F for ; Mon, 22 Oct 2012 11:47:54 +0900 (JST) Message-ID: <5084B3C3.3070906@jp.fujitsu.com> Date: Mon, 22 Oct 2012 11:47:31 +0900 From: Kamezawa Hiroyuki MIME-Version: 1.0 Subject: Re: [patch for-3.7 v3] mm, mempolicy: hold task->mempolicy refcount while reading numa_maps. References: <20121017040515.GA13505@redhat.com> <20121017181413.GA16805@redhat.com> <20121017193229.GC16805@redhat.com> <20121017194501.GA24400@redhat.com> <507F803A.8000900@jp.fujitsu.com> <507F86BD.7070201@jp.fujitsu.com> <508110C4.6030805@jp.fujitsu.com> In-Reply-To: Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: David Rientjes Cc: Linus Torvalds , Andrew Morton , Dave Jones , KOSAKI Motohiro , bhutchings@solarflare.com, Konstantin Khlebnikov , Naoya Horiguchi , Hugh Dickins , linux-kernel@vger.kernel.org, linux-mm@kvack.org (2012/10/19 18:28), David Rientjes wrote: > Looks good, but the patch is whitespace damaged so it doesn't apply. When > that's fixed: > > Acked-by: David Rientjes Sorry, I hope this one is not broken... == From c5849c9034abeec3f26bf30dadccd393b0c5c25e Mon Sep 17 00:00:00 2001 From: KAMEZAWA Hiroyuki Date: Fri, 19 Oct 2012 17:00:55 +0900 Subject: [PATCH] hold task->mempolicy while numa_maps scans. /proc//numa_maps scans vma and show mempolicy under mmap_sem. It sometimes accesses task->mempolicy which can be freed without mmap_sem and numa_maps can show some garbage while scanning. This patch tries to take reference count of task->mempolicy at reading numa_maps before calling get_vma_policy(). By this, task->mempolicy will not be freed until numa_maps reaches its end. Acked-by: David Rientjes Acked-by: KOSAKI Motohiro Signed-off-by: KAMEZAWA Hiroyuki V2->v3 - updated comments to be more verbose. - removed task_lock() in numa_maps code. V1->V2 - access task->mempolicy only once and remember it. Becase kernel/exit.c can overwrite it. --- fs/proc/internal.h | 4 ++++ fs/proc/task_mmu.c | 49 ++++++++++++++++++++++++++++++++++++++++++++++--- 2 files changed, 50 insertions(+), 3 deletions(-) diff --git a/fs/proc/internal.h b/fs/proc/internal.h index cceaab0..43973b0 100644 --- a/fs/proc/internal.h +++ b/fs/proc/internal.h @@ -12,6 +12,7 @@ #include #include struct ctl_table_header; +struct mempolicy; extern struct proc_dir_entry proc_root; #ifdef CONFIG_PROC_SYSCTL @@ -74,6 +75,9 @@ struct proc_maps_private { #ifdef CONFIG_MMU struct vm_area_struct *tail_vma; #endif +#ifdef CONFIG_NUMA + struct mempolicy *task_mempolicy; +#endif }; void proc_init_inodecache(void); diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c index 14df880..2371fea 100644 --- a/fs/proc/task_mmu.c +++ b/fs/proc/task_mmu.c @@ -89,11 +89,55 @@ static void pad_len_spaces(struct seq_file *m, int len) len = 1; seq_printf(m, "%*c", len, ' '); } +#ifdef CONFIG_NUMA +/* + * These functions are for numa_maps but called in generic **maps seq_file + * ->start(), ->stop() ops. + * + * numa_maps scans all vmas under mmap_sem and checks their mempolicy. + * Each mempolicy object is controlled by reference counting. The problem here + * is how to avoid accessing dead mempolicy object. + * + * Because we're holding mmap_sem while reading seq_file, it's safe to access + * each vma's mempolicy, no vma objects will never drop refs to mempolicy. + * + * A task's mempolicy (task->mempolicy) has different behavior. task->mempolicy + * is set and replaced under mmap_sem but unrefed and cleared under task_lock(). + * So, without task_lock(), we cannot trust get_vma_policy() because we cannot + * gurantee the task never exits under us. But taking task_lock() around + * get_vma_plicy() causes lock order problem. + * + * To access task->mempolicy without lock, we hold a reference count of an + * object pointed by task->mempolicy and remember it. This will guarantee + * that task->mempolicy points to an alive object or NULL in numa_maps accesses. + */ +static void hold_task_mempolicy(struct proc_maps_private *priv) +{ + struct task_struct *task = priv->task; + + task_lock(task); + priv->task_mempolicy = task->mempolicy; + mpol_get(priv->task_mempolicy); + task_unlock(task); +} +static void release_task_mempolicy(struct proc_maps_private *priv) +{ + mpol_put(priv->task_mempolicy); +} +#else +static void hold_task_mempolicy(struct proc_maps_private *priv) +{ +} +static void release_task_mempolicy(struct proc_maps_private *priv) +{ +} +#endif static void vma_stop(struct proc_maps_private *priv, struct vm_area_struct *vma) { if (vma && vma != priv->tail_vma) { struct mm_struct *mm = vma->vm_mm; + release_task_mempolicy(priv); up_read(&mm->mmap_sem); mmput(mm); } @@ -132,7 +176,7 @@ static void *m_start(struct seq_file *m, loff_t *pos) tail_vma = get_gate_vma(priv->task->mm); priv->tail_vma = tail_vma; - + hold_task_mempolicy(priv); /* Start with last addr hint */ vma = find_vma(mm, last_addr); if (last_addr && vma) { @@ -159,6 +203,7 @@ out: if (vma) return vma; + release_task_mempolicy(priv); /* End of vmas has been reached */ m->version = (tail_vma != NULL)? 0: -1UL; up_read(&mm->mmap_sem); @@ -1178,11 +1223,9 @@ static int show_numa_map(struct seq_file *m, void *v, int is_pid) walk.private = md; walk.mm = mm; - task_lock(task); pol = get_vma_policy(task, vma, vma->vm_start); mpol_to_str(buffer, sizeof(buffer), pol, 0); mpol_cond_put(pol); - task_unlock(task); seq_printf(m, "%08lx %s", vma->vm_start, buffer); -- 1.7.10.2 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx164.postini.com [74.125.245.164]) by kanga.kvack.org (Postfix) with SMTP id 843B56B0073 for ; Mon, 22 Oct 2012 16:56:01 -0400 (EDT) Date: Mon, 22 Oct 2012 13:55:59 -0700 From: Andrew Morton Subject: Re: [patch for-3.7 v3] mm, mempolicy: hold task->mempolicy refcount while reading numa_maps. Message-Id: <20121022135559.1ccb14bc.akpm@linux-foundation.org> In-Reply-To: <5084B3C3.3070906@jp.fujitsu.com> References: <20121017040515.GA13505@redhat.com> <20121017181413.GA16805@redhat.com> <20121017193229.GC16805@redhat.com> <20121017194501.GA24400@redhat.com> <507F803A.8000900@jp.fujitsu.com> <507F86BD.7070201@jp.fujitsu.com> <508110C4.6030805@jp.fujitsu.com> <5084B3C3.3070906@jp.fujitsu.com> Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Kamezawa Hiroyuki Cc: David Rientjes , Linus Torvalds , Dave Jones , KOSAKI Motohiro , bhutchings@solarflare.com, Konstantin Khlebnikov , Naoya Horiguchi , Hugh Dickins , linux-kernel@vger.kernel.org, linux-mm@kvack.org On Mon, 22 Oct 2012 11:47:31 +0900 Kamezawa Hiroyuki wrote: > (2012/10/19 18:28), David Rientjes wrote: > > > Looks good, but the patch is whitespace damaged so it doesn't apply. When > > that's fixed: > > > > Acked-by: David Rientjes > > Sorry, I hope this one is not broken... > > ... > > --- a/fs/proc/internal.h > +++ b/fs/proc/internal.h > @@ -12,6 +12,7 @@ > #include > #include > struct ctl_table_header; > +struct mempolicy; > > extern struct proc_dir_entry proc_root; > #ifdef CONFIG_PROC_SYSCTL > @@ -74,6 +75,9 @@ struct proc_maps_private { > #ifdef CONFIG_MMU > struct vm_area_struct *tail_vma; > #endif > +#ifdef CONFIG_NUMA > + struct mempolicy *task_mempolicy; > +#endif > }; The mail client space-stuffed it. We merged this three days ago, in 9e7814404b77c3e8920b. Please check that it landed OK - there's a newline fixup in there but it looks good to me. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx187.postini.com [74.125.245.187]) by kanga.kvack.org (Postfix) with SMTP id D0BC56B0078 for ; Mon, 22 Oct 2012 16:57:00 -0400 (EDT) Received: by mail-da0-f41.google.com with SMTP id i14so1670611dad.14 for ; Mon, 22 Oct 2012 13:57:00 -0700 (PDT) Date: Mon, 22 Oct 2012 13:56:56 -0700 (PDT) From: David Rientjes Subject: Re: [patch for-3.7 v3] mm, mempolicy: hold task->mempolicy refcount while reading numa_maps. In-Reply-To: <5084B3C3.3070906@jp.fujitsu.com> Message-ID: References: <20121017040515.GA13505@redhat.com> <20121017181413.GA16805@redhat.com> <20121017193229.GC16805@redhat.com> <20121017194501.GA24400@redhat.com> <507F803A.8000900@jp.fujitsu.com> <507F86BD.7070201@jp.fujitsu.com> <508110C4.6030805@jp.fujitsu.com> <5084B3C3.3070906@jp.fujitsu.com> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-linux-mm@kvack.org List-ID: To: Kamezawa Hiroyuki Cc: Linus Torvalds , Andrew Morton , Dave Jones , KOSAKI Motohiro , bhutchings@solarflare.com, Konstantin Khlebnikov , Naoya Horiguchi , Hugh Dickins , linux-kernel@vger.kernel.org, linux-mm@kvack.org On Mon, 22 Oct 2012, Kamezawa Hiroyuki wrote: > > Looks good, but the patch is whitespace damaged so it doesn't apply. When > > that's fixed: > > > > Acked-by: David Rientjes > > Sorry, I hope this one is not broken... Looks like Linus picked this up directly, thanks Kame! -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx153.postini.com [74.125.245.153]) by kanga.kvack.org (Postfix) with SMTP id 141EC6B0068 for ; Wed, 24 Oct 2012 19:30:50 -0400 (EDT) Received: by mail-ia0-f169.google.com with SMTP id h37so1068672iak.14 for ; Wed, 24 Oct 2012 16:30:49 -0700 (PDT) MIME-Version: 1.0 In-Reply-To: References: <20121008150949.GA15130@redhat.com> <20121017040515.GA13505@redhat.com> From: Sasha Levin Date: Wed, 24 Oct 2012 19:30:29 -0400 Message-ID: Subject: Re: [patch for-3.7] mm, mempolicy: fix printing stack contents in numa_maps Content-Type: text/plain; charset=ISO-8859-1 Sender: owner-linux-mm@kvack.org List-ID: To: David Rientjes Cc: Dave Jones , Andrew Morton , Linus Torvalds , KOSAKI Motohiro , bhutchings@solarflare.com, Konstantin Khlebnikov , Naoya Horiguchi , Hugh Dickins , KAMEZAWA Hiroyuki , linux-kernel@vger.kernel.org, linux-mm@kvack.org On Wed, Oct 17, 2012 at 1:24 AM, David Rientjes wrote: > On Wed, 17 Oct 2012, Dave Jones wrote: > >> BUG: sleeping function called from invalid context at kernel/mutex.c:269 >> in_atomic(): 1, irqs_disabled(): 0, pid: 8558, name: trinity-child2 >> 3 locks on stack by trinity-child2/8558: >> #0: held: (&p->lock){+.+.+.}, instance: ffff88010c9a00b0, at: [] seq_lseek+0x3f/0x120 >> #1: held: (&mm->mmap_sem){++++++}, instance: ffff88013956f7c8, at: [] m_start+0xa7/0x190 >> #2: held: (&(&p->alloc_lock)->rlock){+.+...}, instance: ffff88011fc64f30, at: [] show_numa_map+0x14f/0x610 >> Pid: 8558, comm: trinity-child2 Not tainted 3.7.0-rc1+ #32 >> Call Trace: >> [] __might_sleep+0x14c/0x200 >> [] mutex_lock_nested+0x2e/0x50 >> [] mpol_shared_policy_lookup+0x33/0x90 >> [] shmem_get_policy+0x33/0x40 >> [] get_vma_policy+0x3a/0x90 >> [] show_numa_map+0x163/0x610 >> [] ? pid_maps_open+0x20/0x20 >> [] ? pagemap_hugetlb_range+0xf0/0xf0 >> [] show_pid_numa_map+0x13/0x20 >> [] traverse+0xf2/0x230 >> [] seq_lseek+0xab/0x120 >> [] sys_lseek+0x7b/0xb0 >> [] tracesys+0xe1/0xe6 >> > > Hmm, looks like we need to change the refcount semantics entirely. We'll > need to make get_vma_policy() always take a reference and then drop it > accordingly. This work sif get_vma_policy() can grab a reference while > holding task_lock() for the task policy fallback case. > > Comments on this approach? > --- [snip] I'm not sure about the status of the patch, but it doesn't apply on top of -next, and I still see the warnings when fuzzing on -next. Thanks, Sasha -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx184.postini.com [74.125.245.184]) by kanga.kvack.org (Postfix) with SMTP id EB4E36B0068 for ; Wed, 24 Oct 2012 19:34:53 -0400 (EDT) Received: by mail-pb0-f41.google.com with SMTP id rq2so1717752pbb.14 for ; Wed, 24 Oct 2012 16:34:53 -0700 (PDT) Date: Wed, 24 Oct 2012 16:34:50 -0700 (PDT) From: David Rientjes Subject: Re: [patch for-3.7] mm, mempolicy: fix printing stack contents in numa_maps In-Reply-To: Message-ID: References: <20121008150949.GA15130@redhat.com> <20121017040515.GA13505@redhat.com> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-linux-mm@kvack.org List-ID: To: Sasha Levin Cc: Dave Jones , Andrew Morton , Linus Torvalds , KOSAKI Motohiro , bhutchings@solarflare.com, Konstantin Khlebnikov , Naoya Horiguchi , Hugh Dickins , KAMEZAWA Hiroyuki , linux-kernel@vger.kernel.org, linux-mm@kvack.org On Wed, 24 Oct 2012, Sasha Levin wrote: > I'm not sure about the status of the patch, but it doesn't apply on > top of -next, and I still > see the warnings when fuzzing on -next. > This should be fixed by 9e7814404b77 ("hold task->mempolicy while numa_maps scans.") in 3.7-rc2, can you reproduce any issues reading /proc/pid/numa_maps on that kernel? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx128.postini.com [74.125.245.128]) by kanga.kvack.org (Postfix) with SMTP id D65516B0068 for ; Wed, 24 Oct 2012 19:45:28 -0400 (EDT) Received: by mail-ie0-f169.google.com with SMTP id 10so1898027ied.14 for ; Wed, 24 Oct 2012 16:45:28 -0700 (PDT) MIME-Version: 1.0 In-Reply-To: References: <20121008150949.GA15130@redhat.com> <20121017040515.GA13505@redhat.com> From: Sasha Levin Date: Wed, 24 Oct 2012 19:37:08 -0400 Message-ID: Subject: Re: [patch for-3.7] mm, mempolicy: fix printing stack contents in numa_maps Content-Type: text/plain; charset=ISO-8859-1 Sender: owner-linux-mm@kvack.org List-ID: To: David Rientjes Cc: Dave Jones , Andrew Morton , Linus Torvalds , KOSAKI Motohiro , bhutchings@solarflare.com, Konstantin Khlebnikov , Naoya Horiguchi , Hugh Dickins , KAMEZAWA Hiroyuki , linux-kernel@vger.kernel.org, linux-mm@kvack.org On Wed, Oct 24, 2012 at 7:34 PM, David Rientjes wrote: > On Wed, 24 Oct 2012, Sasha Levin wrote: > >> I'm not sure about the status of the patch, but it doesn't apply on >> top of -next, and I still >> see the warnings when fuzzing on -next. >> > > This should be fixed by 9e7814404b77 ("hold task->mempolicy while > numa_maps scans.") in 3.7-rc2, can you reproduce any issues reading > /proc/pid/numa_maps on that kernel? I was actually referring to the warnings Dave Jones saw when fuzzing with trinity after the original patch was applied. I still see the following when fuzzing: [ 338.467156] BUG: sleeping function called from invalid context at kernel/mutex.c:269 [ 338.473719] in_atomic(): 1, irqs_disabled(): 0, pid: 6361, name: trinity-main [ 338.481199] 2 locks held by trinity-main/6361: [ 338.486629] #0: (&mm->mmap_sem){++++++}, at: [] __do_page_fault+0x1e4/0x4f0 [ 338.498783] #1: (&(&mm->page_table_lock)->rlock){+.+...}, at: [] handle_pte_fault+0x3f7/0x6a0 [ 338.511409] Pid: 6361, comm: trinity-main Tainted: G W 3.7.0-rc2-next-20121024-sasha-00001-gd95ef01-dirty #74 [ 338.530318] Call Trace: [ 338.534088] [] __might_sleep+0x1c3/0x1e0 [ 338.539358] [] mutex_lock_nested+0x29/0x50 [ 338.545253] [] mpol_shared_policy_lookup+0x2e/0x90 [ 338.545258] [] shmem_get_policy+0x2e/0x30 [ 338.545264] [] get_vma_policy+0x5a/0xa0 [ 338.545267] [] mpol_misplaced+0x41/0x1d0 [ 338.545272] [] handle_pte_fault+0x465/0x6a0 [ 338.545278] [] ? __rcu_read_unlock+0x44/0xb0 [ 338.545282] [] handle_mm_fault+0x32a/0x360 [ 338.545286] [] __do_page_fault+0x480/0x4f0 [ 338.545293] [] ? del_timer+0x26/0x80 [ 338.545298] [] ? rcu_cleanup_after_idle+0x23/0x170 [ 338.545302] [] ? rcu_eqs_exit_common+0x64/0x3a0 [ 338.545305] [] ? rcu_eqs_enter_common+0x7c6/0x970 [ 338.545309] [] ? rcu_eqs_exit+0x9c/0xb0 [ 338.545312] [] do_page_fault+0x26/0x40 [ 338.545317] [] do_async_page_fault+0x30/0xa0 [ 338.545321] [] async_page_fault+0x28/0x30 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx150.postini.com [74.125.245.150]) by kanga.kvack.org (Postfix) with SMTP id 0FEE36B0071 for ; Wed, 24 Oct 2012 20:08:14 -0400 (EDT) Received: by mail-pa0-f41.google.com with SMTP id fa10so805589pad.14 for ; Wed, 24 Oct 2012 17:08:13 -0700 (PDT) Date: Wed, 24 Oct 2012 17:08:11 -0700 (PDT) From: David Rientjes Subject: Re: [patch for-3.7] mm, mempolicy: fix printing stack contents in numa_maps In-Reply-To: Message-ID: References: <20121008150949.GA15130@redhat.com> <20121017040515.GA13505@redhat.com> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-linux-mm@kvack.org List-ID: To: Sasha Levin , Mel Gorman , Peter Zijlstra , Rik van Riel Cc: Dave Jones , Andrew Morton , Linus Torvalds , KOSAKI Motohiro , bhutchings@solarflare.com, Konstantin Khlebnikov , Naoya Horiguchi , Hugh Dickins , KAMEZAWA Hiroyuki , linux-kernel@vger.kernel.org, linux-mm@kvack.org On Wed, 24 Oct 2012, Sasha Levin wrote: > > This should be fixed by 9e7814404b77 ("hold task->mempolicy while > > numa_maps scans.") in 3.7-rc2, can you reproduce any issues reading > > /proc/pid/numa_maps on that kernel? > > I was actually referring to the warnings Dave Jones saw when fuzzing > with trinity after the > original patch was applied. > > I still see the following when fuzzing: > > [ 338.467156] BUG: sleeping function called from invalid context at > kernel/mutex.c:269 > [ 338.473719] in_atomic(): 1, irqs_disabled(): 0, pid: 6361, name: trinity-main > [ 338.481199] 2 locks held by trinity-main/6361: > [ 338.486629] #0: (&mm->mmap_sem){++++++}, at: [] > __do_page_fault+0x1e4/0x4f0 > [ 338.498783] #1: (&(&mm->page_table_lock)->rlock){+.+...}, at: > [] handle_pte_fault+0x3f7/0x6a0 > [ 338.511409] Pid: 6361, comm: trinity-main Tainted: G W > 3.7.0-rc2-next-20121024-sasha-00001-gd95ef01-dirty #74 > [ 338.530318] Call Trace: > [ 338.534088] [] __might_sleep+0x1c3/0x1e0 > [ 338.539358] [] mutex_lock_nested+0x29/0x50 > [ 338.545253] [] mpol_shared_policy_lookup+0x2e/0x90 > [ 338.545258] [] shmem_get_policy+0x2e/0x30 > [ 338.545264] [] get_vma_policy+0x5a/0xa0 > [ 338.545267] [] mpol_misplaced+0x41/0x1d0 > [ 338.545272] [] handle_pte_fault+0x465/0x6a0 > [ 338.545278] [] ? __rcu_read_unlock+0x44/0xb0 > [ 338.545282] [] handle_mm_fault+0x32a/0x360 > [ 338.545286] [] __do_page_fault+0x480/0x4f0 > [ 338.545293] [] ? del_timer+0x26/0x80 > [ 338.545298] [] ? rcu_cleanup_after_idle+0x23/0x170 > [ 338.545302] [] ? rcu_eqs_exit_common+0x64/0x3a0 > [ 338.545305] [] ? rcu_eqs_enter_common+0x7c6/0x970 > [ 338.545309] [] ? rcu_eqs_exit+0x9c/0xb0 > [ 338.545312] [] do_page_fault+0x26/0x40 > [ 338.545317] [] do_async_page_fault+0x30/0xa0 > [ 338.545321] [] async_page_fault+0x28/0x30 > Ok, this looks the same but it's actually a different issue: mpol_misplaced(), which now only exists in linux-next and not in 3.7-rc2, calls get_vma_policy() which may take the shared policy mutex. This happens while holding page_table_lock from do_huge_pmd_numa_page() but also from do_numa_page() while holding a spinlock on the ptl, which is coming from the sched/numa branch. Is there anyway that we can avoid changing the shared policy mutex back into a spinlock (it was converted in b22d127a39dd ["mempolicy: fix a race in shared_policy_replace()"])? Adding Peter, Rik, and Mel to the cc. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx178.postini.com [74.125.245.178]) by kanga.kvack.org (Postfix) with SMTP id B56E16B0068 for ; Wed, 24 Oct 2012 20:54:54 -0400 (EDT) Received: by mail-oa0-f41.google.com with SMTP id k14so1349772oag.14 for ; Wed, 24 Oct 2012 17:54:54 -0700 (PDT) MIME-Version: 1.0 In-Reply-To: References: <20121008150949.GA15130@redhat.com> <20121017040515.GA13505@redhat.com> From: KOSAKI Motohiro Date: Wed, 24 Oct 2012 20:54:33 -0400 Message-ID: Subject: Re: [patch for-3.7] mm, mempolicy: fix printing stack contents in numa_maps Content-Type: text/plain; charset=ISO-8859-1 Sender: owner-linux-mm@kvack.org List-ID: To: David Rientjes Cc: Sasha Levin , Mel Gorman , Peter Zijlstra , Rik van Riel , Dave Jones , Andrew Morton , Linus Torvalds , bhutchings@solarflare.com, Konstantin Khlebnikov , Naoya Horiguchi , Hugh Dickins , KAMEZAWA Hiroyuki , linux-kernel@vger.kernel.org, linux-mm@kvack.org On Wed, Oct 24, 2012 at 8:08 PM, David Rientjes wrote: > On Wed, 24 Oct 2012, Sasha Levin wrote: > >> > This should be fixed by 9e7814404b77 ("hold task->mempolicy while >> > numa_maps scans.") in 3.7-rc2, can you reproduce any issues reading >> > /proc/pid/numa_maps on that kernel? >> >> I was actually referring to the warnings Dave Jones saw when fuzzing >> with trinity after the >> original patch was applied. >> >> I still see the following when fuzzing: >> >> [ 338.467156] BUG: sleeping function called from invalid context at >> kernel/mutex.c:269 >> [ 338.473719] in_atomic(): 1, irqs_disabled(): 0, pid: 6361, name: trinity-main >> [ 338.481199] 2 locks held by trinity-main/6361: >> [ 338.486629] #0: (&mm->mmap_sem){++++++}, at: [] >> __do_page_fault+0x1e4/0x4f0 >> [ 338.498783] #1: (&(&mm->page_table_lock)->rlock){+.+...}, at: >> [] handle_pte_fault+0x3f7/0x6a0 >> [ 338.511409] Pid: 6361, comm: trinity-main Tainted: G W >> 3.7.0-rc2-next-20121024-sasha-00001-gd95ef01-dirty #74 >> [ 338.530318] Call Trace: >> [ 338.534088] [] __might_sleep+0x1c3/0x1e0 >> [ 338.539358] [] mutex_lock_nested+0x29/0x50 >> [ 338.545253] [] mpol_shared_policy_lookup+0x2e/0x90 >> [ 338.545258] [] shmem_get_policy+0x2e/0x30 >> [ 338.545264] [] get_vma_policy+0x5a/0xa0 >> [ 338.545267] [] mpol_misplaced+0x41/0x1d0 >> [ 338.545272] [] handle_pte_fault+0x465/0x6a0 >> [ 338.545278] [] ? __rcu_read_unlock+0x44/0xb0 >> [ 338.545282] [] handle_mm_fault+0x32a/0x360 >> [ 338.545286] [] __do_page_fault+0x480/0x4f0 >> [ 338.545293] [] ? del_timer+0x26/0x80 >> [ 338.545298] [] ? rcu_cleanup_after_idle+0x23/0x170 >> [ 338.545302] [] ? rcu_eqs_exit_common+0x64/0x3a0 >> [ 338.545305] [] ? rcu_eqs_enter_common+0x7c6/0x970 >> [ 338.545309] [] ? rcu_eqs_exit+0x9c/0xb0 >> [ 338.545312] [] do_page_fault+0x26/0x40 >> [ 338.545317] [] do_async_page_fault+0x30/0xa0 >> [ 338.545321] [] async_page_fault+0x28/0x30 >> > > Ok, this looks the same but it's actually a different issue: > mpol_misplaced(), which now only exists in linux-next and not in 3.7-rc2, > calls get_vma_policy() which may take the shared policy mutex. This > happens while holding page_table_lock from do_huge_pmd_numa_page() but > also from do_numa_page() while holding a spinlock on the ptl, which is > coming from the sched/numa branch. > > Is there anyway that we can avoid changing the shared policy mutex back > into a spinlock (it was converted in b22d127a39dd ["mempolicy: fix a race > in shared_policy_replace()"])? > > Adding Peter, Rik, and Mel to the cc. Hrm. I haven't noticed there is mpol_misplaced() in linux-next. Peter, I guess you commited it, right? If so, may I review your mempolicy changes? Now mempolicy has a lot of horrible buggy code and I hope to maintain carefully. Which tree should i see? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx180.postini.com [74.125.245.180]) by kanga.kvack.org (Postfix) with SMTP id 84DC46B0068 for ; Wed, 24 Oct 2012 21:15:14 -0400 (EDT) Received: by mail-pb0-f41.google.com with SMTP id rq2so1767795pbb.14 for ; Wed, 24 Oct 2012 18:15:13 -0700 (PDT) Date: Wed, 24 Oct 2012 18:15:11 -0700 (PDT) From: David Rientjes Subject: Re: [patch for-3.7] mm, mempolicy: fix printing stack contents in numa_maps In-Reply-To: Message-ID: References: <20121008150949.GA15130@redhat.com> <20121017040515.GA13505@redhat.com> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-linux-mm@kvack.org List-ID: To: KOSAKI Motohiro Cc: Sasha Levin , Mel Gorman , Peter Zijlstra , Rik van Riel , Dave Jones , Andrew Morton , Linus Torvalds , bhutchings@solarflare.com, Konstantin Khlebnikov , Naoya Horiguchi , Hugh Dickins , KAMEZAWA Hiroyuki , linux-kernel@vger.kernel.org, linux-mm@kvack.org On Wed, 24 Oct 2012, KOSAKI Motohiro wrote: > Hrm. I haven't noticed there is mpol_misplaced() in linux-next. Peter, > I guess you commited it, right? If so, may I review your mempolicy > changes? Now mempolicy has a lot of horrible buggy code and I hope to > maintain carefully. Which tree should i see? > Check out sched/numa from git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git $ git diff v3.7-rc2.. mm/mempolicy.c | diffstat mempolicy.c | 444 +++++++++++++++++++++++++++++++++++++----------------------- 1 file changed, 277 insertions(+), 167 deletions(-) -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx199.postini.com [74.125.245.199]) by kanga.kvack.org (Postfix) with SMTP id E78C66B0070 for ; Thu, 25 Oct 2012 08:20:09 -0400 (EDT) Message-ID: <1351167554.23337.14.camel@twins> Subject: Re: [patch for-3.7] mm, mempolicy: fix printing stack contents in numa_maps From: Peter Zijlstra Date: Thu, 25 Oct 2012 14:19:14 +0200 In-Reply-To: References: <20121008150949.GA15130@redhat.com> <20121017040515.GA13505@redhat.com> Content-Type: text/plain; charset="ISO-8859-1" Content-Transfer-Encoding: quoted-printable Mime-Version: 1.0 Sender: owner-linux-mm@kvack.org List-ID: To: David Rientjes Cc: Sasha Levin , Mel Gorman , Rik van Riel , Dave Jones , Andrew Morton , Linus Torvalds , KOSAKI Motohiro , bhutchings@solarflare.com, Konstantin Khlebnikov , Naoya Horiguchi , Hugh Dickins , KAMEZAWA Hiroyuki , linux-kernel@vger.kernel.org, linux-mm@kvack.org On Wed, 2012-10-24 at 17:08 -0700, David Rientjes wrote: > Ok, this looks the same but it's actually a different issue:=20 > mpol_misplaced(), which now only exists in linux-next and not in 3.7-rc2,= =20 > calls get_vma_policy() which may take the shared policy mutex. This=20 > happens while holding page_table_lock from do_huge_pmd_numa_page() but= =20 > also from do_numa_page() while holding a spinlock on the ptl, which is= =20 > coming from the sched/numa branch. >=20 > Is there anyway that we can avoid changing the shared policy mutex back= =20 > into a spinlock (it was converted in b22d127a39dd ["mempolicy: fix a race= =20 > in shared_policy_replace()"])? >=20 > Adding Peter, Rik, and Mel to the cc.=20 Urgh, crud I totally missed that. So the problem is that we need to compute if the current page is placed 'right' while holding pte_lock in order to avoid multiple pte_lock acquisitions on the 'fast' path. I'll look into this in a bit, but one thing that comes to mind is having both a spnilock and a mutex and require holding both for modification while either one is sufficient for read. That would allow sp_lookup() to use the spinlock, while insert and replace can hold both. Not sure it will work for this, need to stare at this code a little more. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx201.postini.com [74.125.245.201]) by kanga.kvack.org (Postfix) with SMTP id 5E8AD6B0071 for ; Thu, 25 Oct 2012 10:39:53 -0400 (EDT) Message-ID: <1351175972.12171.14.camel@twins> Subject: Re: [patch for-3.7] mm, mempolicy: fix printing stack contents in numa_maps From: Peter Zijlstra Date: Thu, 25 Oct 2012 16:39:32 +0200 In-Reply-To: <1351167554.23337.14.camel@twins> References: <20121008150949.GA15130@redhat.com> <20121017040515.GA13505@redhat.com> <1351167554.23337.14.camel@twins> Content-Type: text/plain; charset="ISO-8859-1" Content-Transfer-Encoding: quoted-printable Mime-Version: 1.0 Sender: owner-linux-mm@kvack.org List-ID: To: David Rientjes Cc: Sasha Levin , Mel Gorman , Rik van Riel , Dave Jones , Andrew Morton , Linus Torvalds , KOSAKI Motohiro , bhutchings@solarflare.com, Konstantin Khlebnikov , Naoya Horiguchi , Hugh Dickins , KAMEZAWA Hiroyuki , linux-kernel@vger.kernel.org, linux-mm@kvack.org On Thu, 2012-10-25 at 14:19 +0200, Peter Zijlstra wrote: > On Wed, 2012-10-24 at 17:08 -0700, David Rientjes wrote: > > Ok, this looks the same but it's actually a different issue:=20 > > mpol_misplaced(), which now only exists in linux-next and not in 3.7-rc= 2,=20 > > calls get_vma_policy() which may take the shared policy mutex. This= =20 > > happens while holding page_table_lock from do_huge_pmd_numa_page() but= =20 > > also from do_numa_page() while holding a spinlock on the ptl, which is= =20 > > coming from the sched/numa branch. > >=20 > > Is there anyway that we can avoid changing the shared policy mutex back= =20 > > into a spinlock (it was converted in b22d127a39dd ["mempolicy: fix a ra= ce=20 > > in shared_policy_replace()"])? > >=20 > > Adding Peter, Rik, and Mel to the cc.=20 >=20 > Urgh, crud I totally missed that. >=20 > So the problem is that we need to compute if the current page is placed > 'right' while holding pte_lock in order to avoid multiple pte_lock > acquisitions on the 'fast' path. >=20 > I'll look into this in a bit, but one thing that comes to mind is having > both a spnilock and a mutex and require holding both for modification > while either one is sufficient for read. >=20 > That would allow sp_lookup() to use the spinlock, while insert and > replace can hold both. >=20 > Not sure it will work for this, need to stare at this code a little > more. So I think the below should work, we hold the spinlock over both rb-tree modification as sp free, this makes mpol_shared_policy_lookup() which returns the policy with an incremented refcount work with just the spinlock. Comments? --- include/linux/mempolicy.h | 1 + mm/mempolicy.c | 23 ++++++++++++++++++----- 2 files changed, 19 insertions(+), 5 deletions(-) --- a/include/linux/mempolicy.h +++ b/include/linux/mempolicy.h @@ -133,6 +133,7 @@ struct sp_node { =20 struct shared_policy { struct rb_root root; + spinlock_t lock; struct mutex mutex; }; =20 --- a/mm/mempolicy.c +++ b/mm/mempolicy.c @@ -2099,12 +2099,20 @@ bool __mpol_equal(struct mempolicy *a, s * * Remember policies even when nobody has shared memory mapped. * The policies are kept in Red-Black tree linked from the inode. - * They are protected by the sp->lock spinlock, which should be held - * for any accesses to the tree. + * + * The rb-tree is locked using both a mutex and a spinlock. Every modifica= tion + * to the tree must hold both the mutex and the spinlock, lookups can hold + * either to observe a stable tree. + * + * In particular, sp_insert() and sp_delete() take the spinlock, whereas + * sp_lookup() doesn't, this so users have choice. + * + * shared_policy_replace() and mpol_free_shared_policy() take the mutex + * and call sp_insert(), sp_delete(). */ =20 /* lookup first element intersecting start-end */ -/* Caller holds sp->mutex */ +/* Caller holds either sp->lock and/or sp->mutex */ static struct sp_node * sp_lookup(struct shared_policy *sp, unsigned long start, unsigned long end= ) { @@ -2143,6 +2151,7 @@ static void sp_insert(struct shared_poli struct rb_node *parent =3D NULL; struct sp_node *nd; =20 + spin_lock(&sp->lock); while (*p) { parent =3D *p; nd =3D rb_entry(parent, struct sp_node, nd); @@ -2155,6 +2164,7 @@ static void sp_insert(struct shared_poli } rb_link_node(&new->nd, parent, p); rb_insert_color(&new->nd, &sp->root); + spin_unlock(&sp->lock); pr_debug("inserting %lx-%lx: %d\n", new->start, new->end, new->policy ? new->policy->mode : 0); } @@ -2168,13 +2178,13 @@ mpol_shared_policy_lookup(struct shared_ =20 if (!sp->root.rb_node) return NULL; - mutex_lock(&sp->mutex); + spin_lock(&sp->lock); sn =3D sp_lookup(sp, idx, idx+1); if (sn) { mpol_get(sn->policy); pol =3D sn->policy; } - mutex_unlock(&sp->mutex); + spin_unlock(&sp->lock); return pol; } =20 @@ -2295,8 +2305,10 @@ int mpol_misplaced(struct page *page, st static void sp_delete(struct shared_policy *sp, struct sp_node *n) { pr_debug("deleting %lx-l%lx\n", n->start, n->end); + spin_lock(&sp->lock); rb_erase(&n->nd, &sp->root); sp_free(n); + spin_unlock(&sp->lock); } =20 static struct sp_node *sp_alloc(unsigned long start, unsigned long end, @@ -2381,6 +2393,7 @@ void mpol_shared_policy_init(struct shar int ret; =20 sp->root =3D RB_ROOT; /* empty tree =3D=3D default mempolicy */ + spin_lock_init(&sp->lock); mutex_init(&sp->mutex); =20 if (mpol) { -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx146.postini.com [74.125.245.146]) by kanga.kvack.org (Postfix) with SMTP id F04826B0072 for ; Thu, 25 Oct 2012 13:24:03 -0400 (EDT) Received: by mail-vc0-f169.google.com with SMTP id fl17so2561418vcb.14 for ; Thu, 25 Oct 2012 10:24:02 -0700 (PDT) Message-ID: <508975A4.50203@gmail.com> Date: Thu, 25 Oct 2012 13:23:48 -0400 From: Sasha Levin MIME-Version: 1.0 Subject: Re: [patch for-3.7] mm, mempolicy: fix printing stack contents in numa_maps References: <20121008150949.GA15130@redhat.com> <20121017040515.GA13505@redhat.com> <1351167554.23337.14.camel@twins> <1351175972.12171.14.camel@twins> In-Reply-To: <1351175972.12171.14.camel@twins> Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Peter Zijlstra Cc: David Rientjes , Mel Gorman , Rik van Riel , Dave Jones , Andrew Morton , Linus Torvalds , KOSAKI Motohiro , bhutchings@solarflare.com, Konstantin Khlebnikov , Naoya Horiguchi , Hugh Dickins , KAMEZAWA Hiroyuki , linux-kernel@vger.kernel.org, linux-mm@kvack.org On 10/25/2012 10:39 AM, Peter Zijlstra wrote: > On Thu, 2012-10-25 at 14:19 +0200, Peter Zijlstra wrote: >> On Wed, 2012-10-24 at 17:08 -0700, David Rientjes wrote: >>> Ok, this looks the same but it's actually a different issue: >>> mpol_misplaced(), which now only exists in linux-next and not in 3.7-rc2, >>> calls get_vma_policy() which may take the shared policy mutex. This >>> happens while holding page_table_lock from do_huge_pmd_numa_page() but >>> also from do_numa_page() while holding a spinlock on the ptl, which is >>> coming from the sched/numa branch. >>> >>> Is there anyway that we can avoid changing the shared policy mutex back >>> into a spinlock (it was converted in b22d127a39dd ["mempolicy: fix a race >>> in shared_policy_replace()"])? >>> >>> Adding Peter, Rik, and Mel to the cc. >> >> Urgh, crud I totally missed that. >> >> So the problem is that we need to compute if the current page is placed >> 'right' while holding pte_lock in order to avoid multiple pte_lock >> acquisitions on the 'fast' path. >> >> I'll look into this in a bit, but one thing that comes to mind is having >> both a spnilock and a mutex and require holding both for modification >> while either one is sufficient for read. >> >> That would allow sp_lookup() to use the spinlock, while insert and >> replace can hold both. >> >> Not sure it will work for this, need to stare at this code a little >> more. > > So I think the below should work, we hold the spinlock over both rb-tree > modification as sp free, this makes mpol_shared_policy_lookup() which > returns the policy with an incremented refcount work with just the > spinlock. > > Comments? > > --- It made the warnings I've reported go away. Thanks, Sasha -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx170.postini.com [74.125.245.170]) by kanga.kvack.org (Postfix) with SMTP id 043CB6B0071 for ; Thu, 25 Oct 2012 16:22:02 -0400 (EDT) Received: by mail-pa0-f41.google.com with SMTP id fa10so1576790pad.14 for ; Thu, 25 Oct 2012 13:22:02 -0700 (PDT) Date: Thu, 25 Oct 2012 13:22:00 -0700 (PDT) From: David Rientjes Subject: Re: [patch for-3.7] mm, mempolicy: fix printing stack contents in numa_maps In-Reply-To: <1351175972.12171.14.camel@twins> Message-ID: References: <20121008150949.GA15130@redhat.com> <20121017040515.GA13505@redhat.com> <1351167554.23337.14.camel@twins> <1351175972.12171.14.camel@twins> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-linux-mm@kvack.org List-ID: To: Peter Zijlstra Cc: Sasha Levin , Mel Gorman , Rik van Riel , Dave Jones , Andrew Morton , Linus Torvalds , KOSAKI Motohiro , bhutchings@solarflare.com, Konstantin Khlebnikov , Naoya Horiguchi , Hugh Dickins , KAMEZAWA Hiroyuki , linux-kernel@vger.kernel.org, linux-mm@kvack.org On Thu, 25 Oct 2012, Peter Zijlstra wrote: > So I think the below should work, we hold the spinlock over both rb-tree > modification as sp free, this makes mpol_shared_policy_lookup() which > returns the policy with an incremented refcount work with just the > spinlock. > > Comments? > It's rather unfortunate that we need to protect modification with a spinlock and a mutex but since sharing was removed in commit 869833f2c5c6 ("mempolicy: remove mempolicy sharing") it requires that sp_alloc() is blockable to do the whole mpol_new() and rebind if necessary, which could require mm->mmap_sem; it's not as simple as just converting all the allocations to GFP_ATOMIC. It looks as though there is no other alternative other than protecting modification with both the spinlock and mutex, which is a clever solution, so it looks good to me, thanks! -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx150.postini.com [74.125.245.150]) by kanga.kvack.org (Postfix) with SMTP id 0A2086B0072 for ; Thu, 25 Oct 2012 19:10:09 -0400 (EDT) Received: by mail-wg0-f45.google.com with SMTP id dq12so1459170wgb.26 for ; Thu, 25 Oct 2012 16:10:08 -0700 (PDT) MIME-Version: 1.0 In-Reply-To: <1351175972.12171.14.camel@twins> References: <20121008150949.GA15130@redhat.com> <20121017040515.GA13505@redhat.com> <1351167554.23337.14.camel@twins> <1351175972.12171.14.camel@twins> From: Linus Torvalds Date: Thu, 25 Oct 2012 16:09:48 -0700 Message-ID: Subject: Re: [patch for-3.7] mm, mempolicy: fix printing stack contents in numa_maps Content-Type: text/plain; charset=ISO-8859-1 Sender: owner-linux-mm@kvack.org List-ID: To: Peter Zijlstra Cc: David Rientjes , Sasha Levin , Mel Gorman , Rik van Riel , Dave Jones , Andrew Morton , KOSAKI Motohiro , bhutchings@solarflare.com, Konstantin Khlebnikov , Naoya Horiguchi , Hugh Dickins , KAMEZAWA Hiroyuki , linux-kernel@vger.kernel.org, linux-mm@kvack.org On Thu, Oct 25, 2012 at 7:39 AM, Peter Zijlstra wrote: > > So I think the below should work, we hold the spinlock over both rb-tree > modification as sp free, this makes mpol_shared_policy_lookup() which > returns the policy with an incremented refcount work with just the > spinlock. > > Comments? Looks reasonable, if annoyingly complex for something that shouldn't be important enough for this. Oh well. However, please check me on this: the need for this is only for linux-next right now, correct? All the current users in my tree are ok with just the mutex, no? Linus -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx168.postini.com [74.125.245.168]) by kanga.kvack.org (Postfix) with SMTP id 15A166B0071 for ; Fri, 26 Oct 2012 04:49:05 -0400 (EDT) Message-ID: <1351241323.12171.43.camel@twins> Subject: Re: [patch for-3.7] mm, mempolicy: fix printing stack contents in numa_maps From: Peter Zijlstra Date: Fri, 26 Oct 2012 10:48:43 +0200 In-Reply-To: References: <20121008150949.GA15130@redhat.com> <20121017040515.GA13505@redhat.com> <1351167554.23337.14.camel@twins> <1351175972.12171.14.camel@twins> Content-Type: text/plain; charset="ISO-8859-1" Content-Transfer-Encoding: quoted-printable Mime-Version: 1.0 Sender: owner-linux-mm@kvack.org List-ID: To: Linus Torvalds Cc: David Rientjes , Sasha Levin , Mel Gorman , Rik van Riel , Dave Jones , Andrew Morton , KOSAKI Motohiro , bhutchings@solarflare.com, Konstantin Khlebnikov , Naoya Horiguchi , Hugh Dickins , KAMEZAWA Hiroyuki , linux-kernel@vger.kernel.org, linux-mm@kvack.org On Thu, 2012-10-25 at 16:09 -0700, Linus Torvalds wrote: > On Thu, Oct 25, 2012 at 7:39 AM, Peter Zijlstra wr= ote: > > > > So I think the below should work, we hold the spinlock over both rb-tre= e > > modification as sp free, this makes mpol_shared_policy_lookup() which > > returns the policy with an incremented refcount work with just the > > spinlock. > > > > Comments? >=20 > Looks reasonable, if annoyingly complex for something that shouldn't > be important enough for this. Oh well. I agree with that.. Its just that when doing numa placement one needs to respect the pre-existing placement constraints. I've not seen a way around this. > However, please check me on this: the need for this is only for > linux-next right now, correct? All the current users in my tree are ok > with just the mutex, no? Yes, the need comes from the numa stuff and I'll stick this patch in there. I completely missed Mel's patch turning it into a mutex, but I guess that's what -next is for :-). -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx195.postini.com [74.125.245.195]) by kanga.kvack.org (Postfix) with SMTP id 5A4E66B0062 for ; Wed, 31 Oct 2012 14:30:20 -0400 (EDT) Received: by mail-ie0-f169.google.com with SMTP id 10so3152099ied.14 for ; Wed, 31 Oct 2012 11:30:19 -0700 (PDT) MIME-Version: 1.0 In-Reply-To: <1351241323.12171.43.camel@twins> References: <20121008150949.GA15130@redhat.com> <20121017040515.GA13505@redhat.com> <1351167554.23337.14.camel@twins> <1351175972.12171.14.camel@twins> <1351241323.12171.43.camel@twins> From: Sasha Levin Date: Wed, 31 Oct 2012 14:29:59 -0400 Message-ID: Subject: Re: [patch for-3.7] mm, mempolicy: fix printing stack contents in numa_maps Content-Type: text/plain; charset=ISO-8859-1 Sender: owner-linux-mm@kvack.org List-ID: To: Peter Zijlstra Cc: Linus Torvalds , David Rientjes , Mel Gorman , Rik van Riel , Dave Jones , Andrew Morton , KOSAKI Motohiro , bhutchings@solarflare.com, Konstantin Khlebnikov , Naoya Horiguchi , Hugh Dickins , KAMEZAWA Hiroyuki , linux-kernel@vger.kernel.org, linux-mm@kvack.org On Fri, Oct 26, 2012 at 4:48 AM, Peter Zijlstra wrote: > On Thu, 2012-10-25 at 16:09 -0700, Linus Torvalds wrote: >> On Thu, Oct 25, 2012 at 7:39 AM, Peter Zijlstra wrote: >> > >> > So I think the below should work, we hold the spinlock over both rb-tree >> > modification as sp free, this makes mpol_shared_policy_lookup() which >> > returns the policy with an incremented refcount work with just the >> > spinlock. >> > >> > Comments? >> >> Looks reasonable, if annoyingly complex for something that shouldn't >> be important enough for this. Oh well. > > I agree with that.. Its just that when doing numa placement one needs to > respect the pre-existing placement constraints. I've not seen a way > around this. > >> However, please check me on this: the need for this is only for >> linux-next right now, correct? All the current users in my tree are ok >> with just the mutex, no? > > Yes, the need comes from the numa stuff and I'll stick this patch in > there. > > I completely missed Mel's patch turning it into a mutex, but I guess > that's what -next is for :-). So I've been fuzzing with it for the past couple of days and it's been looking fine with it. Can someone grab it into his tree please? Thanks, Sasha -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx189.postini.com [74.125.245.189]) by kanga.kvack.org (Postfix) with SMTP id EE3B86B007D for ; Tue, 20 Nov 2012 20:00:22 -0500 (EST) Received: by mail-ia0-f169.google.com with SMTP id r4so5804952iaj.14 for ; Tue, 20 Nov 2012 17:00:22 -0800 (PST) MIME-Version: 1.0 In-Reply-To: References: <20121008150949.GA15130@redhat.com> <20121017040515.GA13505@redhat.com> <1351167554.23337.14.camel@twins> <1351175972.12171.14.camel@twins> <1351241323.12171.43.camel@twins> From: Sasha Levin Date: Tue, 20 Nov 2012 19:59:57 -0500 Message-ID: Subject: Re: [patch for-3.7] mm, mempolicy: fix printing stack contents in numa_maps Content-Type: text/plain; charset=ISO-8859-1 Sender: owner-linux-mm@kvack.org List-ID: To: Peter Zijlstra Cc: Linus Torvalds , David Rientjes , Mel Gorman , Rik van Riel , Dave Jones , Andrew Morton , KOSAKI Motohiro , bhutchings@solarflare.com, Konstantin Khlebnikov , Naoya Horiguchi , Hugh Dickins , KAMEZAWA Hiroyuki , linux-kernel@vger.kernel.org, linux-mm@kvack.org Ping? Can someone take it before it's lost? On Wed, Oct 31, 2012 at 2:29 PM, Sasha Levin wrote: > On Fri, Oct 26, 2012 at 4:48 AM, Peter Zijlstra wrote: >> On Thu, 2012-10-25 at 16:09 -0700, Linus Torvalds wrote: >>> On Thu, Oct 25, 2012 at 7:39 AM, Peter Zijlstra wrote: >>> > >>> > So I think the below should work, we hold the spinlock over both rb-tree >>> > modification as sp free, this makes mpol_shared_policy_lookup() which >>> > returns the policy with an incremented refcount work with just the >>> > spinlock. >>> > >>> > Comments? >>> >>> Looks reasonable, if annoyingly complex for something that shouldn't >>> be important enough for this. Oh well. >> >> I agree with that.. Its just that when doing numa placement one needs to >> respect the pre-existing placement constraints. I've not seen a way >> around this. >> >>> However, please check me on this: the need for this is only for >>> linux-next right now, correct? All the current users in my tree are ok >>> with just the mutex, no? >> >> Yes, the need comes from the numa stuff and I'll stick this patch in >> there. >> >> I completely missed Mel's patch turning it into a mutex, but I guess >> that's what -next is for :-). > > So I've been fuzzing with it for the past couple of days and it's been > looking fine with it. Can someone grab it into his tree please? > > > Thanks, > Sasha -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753826Ab2JHPKA (ORCPT ); Mon, 8 Oct 2012 11:10:00 -0400 Received: from mx1.redhat.com ([209.132.183.28]:39407 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752065Ab2JHPJ5 (ORCPT ); Mon, 8 Oct 2012 11:09:57 -0400 Date: Mon, 8 Oct 2012 11:09:49 -0400 From: Dave Jones To: Linux Kernel Cc: bhutchings@solarflare.com, linux-mm@kvack.org, Linus Torvalds , Andrew Morton Subject: mpol_to_str revisited. Message-ID: <20121008150949.GA15130@redhat.com> Mail-Followup-To: Dave Jones , Linux Kernel , bhutchings@solarflare.com, linux-mm@kvack.org, Linus Torvalds , Andrew Morton MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Last month I sent in 80de7c3138ee9fd86a98696fd2cf7ad89b995d0a to remove a user triggerable BUG in mempolicy. Ben Hutchings pointed out to me that my change introduced a potential leak of stack contents to userspace, because none of the callers check the return value. This patch adds the missing return checking, and also clears the buffer beforehand. Reported-by: Ben Hutchings Cc: stable@kernel.org Signed-off-by: Dave Jones --- unanswered question: why are the buffer sizes here different ? which is correct? diff -durpN '--exclude-from=/home/davej/.exclude' src/git-trees/kernel/linux/fs/proc/task_mmu.c linux-dj/fs/proc/task_mmu.c --- src/git-trees/kernel/linux/fs/proc/task_mmu.c 2012-05-31 22:32:46.778150675 -0400 +++ linux-dj/fs/proc/task_mmu.c 2012-10-04 19:31:41.269988984 -0400 @@ -1162,6 +1162,7 @@ static int show_numa_map(struct seq_file struct mm_walk walk = {}; struct mempolicy *pol; int n; + int ret; char buffer[50]; if (!mm) @@ -1178,7 +1179,11 @@ static int show_numa_map(struct seq_file walk.mm = mm; pol = get_vma_policy(proc_priv->task, vma, vma->vm_start); - mpol_to_str(buffer, sizeof(buffer), pol, 0); + memset(buffer, 0, sizeof(buffer)); + ret = mpol_to_str(buffer, sizeof(buffer), pol, 0); + if (ret < 0) + return 0; + mpol_cond_put(pol); seq_printf(m, "%08lx %s", vma->vm_start, buffer); diff -durpN '--exclude-from=/home/davej/.exclude' src/git-trees/kernel/linux/mm/shmem.c linux-dj/mm/shmem.c --- src/git-trees/kernel/linux/mm/shmem.c 2012-10-02 15:49:51.977277944 -0400 +++ linux-dj/mm/shmem.c 2012-10-04 19:32:28.862949907 -0400 @@ -885,13 +885,15 @@ redirty: static void shmem_show_mpol(struct seq_file *seq, struct mempolicy *mpol) { char buffer[64]; + int ret; if (!mpol || mpol->mode == MPOL_DEFAULT) return; /* show nothing */ - mpol_to_str(buffer, sizeof(buffer), mpol, 1); - - seq_printf(seq, ",mpol=%s", buffer); + memset(buffer, 0, sizeof(buffer)); + ret = mpol_to_str(buffer, sizeof(buffer), mpol, 1); + if (ret > 0) + seq_printf(seq, ",mpol=%s", buffer); } static struct mempolicy *shmem_get_sbmpol(struct shmem_sb_info *sbinfo) From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753922Ab2JHPQC (ORCPT ); Mon, 8 Oct 2012 11:16:02 -0400 Received: from mx1.redhat.com ([209.132.183.28]:37372 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753857Ab2JHPQA (ORCPT ); Mon, 8 Oct 2012 11:16:00 -0400 Date: Mon, 8 Oct 2012 11:15:52 -0400 From: Dave Jones To: Linux Kernel , bhutchings@solarflare.com, linux-mm@kvack.org, Linus Torvalds , Andrew Morton Subject: Re: mpol_to_str revisited. Message-ID: <20121008151552.GA10881@redhat.com> Mail-Followup-To: Dave Jones , Linux Kernel , bhutchings@solarflare.com, linux-mm@kvack.org, Linus Torvalds , Andrew Morton References: <20121008150949.GA15130@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20121008150949.GA15130@redhat.com> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon, Oct 08, 2012 at 11:09:49AM -0400, Dave Jones wrote: > Last month I sent in 80de7c3138ee9fd86a98696fd2cf7ad89b995d0a to remove > a user triggerable BUG in mempolicy. > > Ben Hutchings pointed out to me that my change introduced a potential leak > of stack contents to userspace, because none of the callers check the return value. > > This patch adds the missing return checking, and also clears the buffer beforehand. > > Reported-by: Ben Hutchings > Cc: stable@kernel.org > Signed-off-by: Dave Jones > > --- > unanswered question: why are the buffer sizes here different ? which is correct? A further unanswered question is how the state got so screwed up that we hit that default case at all. Looking at the original report: https://lkml.org/lkml/2012/9/6/356 What's in RAX looks suspiciously like left-over slab poison. If pol->mode was poisoned, that smells like we have a race where policy is getting freed while another process is reading it. Am I missing something, or is there no locking around that at all ? Dave From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754833Ab2JHUf5 (ORCPT ); Mon, 8 Oct 2012 16:35:57 -0400 Received: from mail-da0-f46.google.com ([209.85.210.46]:40996 "EHLO mail-da0-f46.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754790Ab2JHUfo (ORCPT ); Mon, 8 Oct 2012 16:35:44 -0400 Date: Mon, 8 Oct 2012 13:35:42 -0700 (PDT) From: David Rientjes X-X-Sender: rientjes@chino.kir.corp.google.com To: Dave Jones , Linux Kernel , bhutchings@solarflare.com, linux-mm@kvack.org, Linus Torvalds , Andrew Morton Subject: Re: mpol_to_str revisited. In-Reply-To: <20121008150949.GA15130@redhat.com> Message-ID: References: <20121008150949.GA15130@redhat.com> User-Agent: Alpine 2.00 (DEB 1167 2008-08-23) MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon, 8 Oct 2012, Dave Jones wrote: > unanswered question: why are the buffer sizes here different ? which is correct? > Given the current set of mempolicy modes and flags, it's 34, but this can change if new modes or flags are added with longer names. I see no reason why shmem shouldn't round up to the nearest power-of-2 of 64 like it already does, but 50 is certainly safe as well in task_mmu.c. > diff -durpN '--exclude-from=/home/davej/.exclude' src/git-trees/kernel/linux/fs/proc/task_mmu.c linux-dj/fs/proc/task_mmu.c > --- src/git-trees/kernel/linux/fs/proc/task_mmu.c 2012-05-31 22:32:46.778150675 -0400 > +++ linux-dj/fs/proc/task_mmu.c 2012-10-04 19:31:41.269988984 -0400 > @@ -1162,6 +1162,7 @@ static int show_numa_map(struct seq_file > struct mm_walk walk = {}; > struct mempolicy *pol; > int n; > + int ret; > char buffer[50]; > > if (!mm) > @@ -1178,7 +1179,11 @@ static int show_numa_map(struct seq_file > walk.mm = mm; > > pol = get_vma_policy(proc_priv->task, vma, vma->vm_start); > - mpol_to_str(buffer, sizeof(buffer), pol, 0); > + memset(buffer, 0, sizeof(buffer)); > + ret = mpol_to_str(buffer, sizeof(buffer), pol, 0); > + if (ret < 0) > + return 0; We should need the mpol_cond_put(pol) here before returning. > + > mpol_cond_put(pol); > > seq_printf(m, "%08lx %s", vma->vm_start, buffer); > diff -durpN '--exclude-from=/home/davej/.exclude' src/git-trees/kernel/linux/mm/shmem.c linux-dj/mm/shmem.c > --- src/git-trees/kernel/linux/mm/shmem.c 2012-10-02 15:49:51.977277944 -0400 > +++ linux-dj/mm/shmem.c 2012-10-04 19:32:28.862949907 -0400 > @@ -885,13 +885,15 @@ redirty: > static void shmem_show_mpol(struct seq_file *seq, struct mempolicy *mpol) > { > char buffer[64]; > + int ret; > > if (!mpol || mpol->mode == MPOL_DEFAULT) > return; /* show nothing */ > > - mpol_to_str(buffer, sizeof(buffer), mpol, 1); > - > - seq_printf(seq, ",mpol=%s", buffer); > + memset(buffer, 0, sizeof(buffer)); > + ret = mpol_to_str(buffer, sizeof(buffer), mpol, 1); > + if (ret > 0) > + seq_printf(seq, ",mpol=%s", buffer); > } > > static struct mempolicy *shmem_get_sbmpol(struct shmem_sb_info *sbinfo) From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754796Ab2JHUqo (ORCPT ); Mon, 8 Oct 2012 16:46:44 -0400 Received: from mail-pa0-f46.google.com ([209.85.220.46]:43750 "EHLO mail-pa0-f46.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753163Ab2JHUqk (ORCPT ); Mon, 8 Oct 2012 16:46:40 -0400 Date: Mon, 8 Oct 2012 13:46:38 -0700 (PDT) From: David Rientjes X-X-Sender: rientjes@chino.kir.corp.google.com To: Dave Jones , Linux Kernel , bhutchings@solarflare.com, linux-mm@kvack.org, Linus Torvalds , Andrew Morton Subject: Re: mpol_to_str revisited. In-Reply-To: <20121008151552.GA10881@redhat.com> Message-ID: References: <20121008150949.GA15130@redhat.com> <20121008151552.GA10881@redhat.com> User-Agent: Alpine 2.00 (DEB 1167 2008-08-23) MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon, 8 Oct 2012, Dave Jones wrote: > If pol->mode was poisoned, that smells like we have a race where policy is getting freed > while another process is reading it. > > Am I missing something, or is there no locking around that at all ? > The only thing that is held during the read() is a reference to the task_struct so it doesn't disappear from under us. The protection needed for a task's mempolicy, however, is task_lock() and that is not held. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754864Ab2JHUwY (ORCPT ); Mon, 8 Oct 2012 16:52:24 -0400 Received: from mx1.redhat.com ([209.132.183.28]:34758 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754753Ab2JHUwV (ORCPT ); Mon, 8 Oct 2012 16:52:21 -0400 Date: Mon, 8 Oct 2012 16:52:13 -0400 From: Dave Jones To: David Rientjes Cc: Linux Kernel , bhutchings@solarflare.com, linux-mm@kvack.org, Linus Torvalds , Andrew Morton Subject: Re: mpol_to_str revisited. Message-ID: <20121008205213.GA23211@redhat.com> Mail-Followup-To: Dave Jones , David Rientjes , Linux Kernel , bhutchings@solarflare.com, linux-mm@kvack.org, Linus Torvalds , Andrew Morton References: <20121008150949.GA15130@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon, Oct 08, 2012 at 01:35:42PM -0700, David Rientjes wrote: > > unanswered question: why are the buffer sizes here different ? which is correct? > > > Given the current set of mempolicy modes and flags, it's 34, but this can > change if new modes or flags are added with longer names. I see no reason > why shmem shouldn't round up to the nearest power-of-2 of 64 like it > already does, but 50 is certainly safe as well in task_mmu.c. Ok. I'll leave that for now. > > diff -durpN '--exclude-from=/home/davej/.exclude' src/git-trees/kernel/linux/fs/proc/task_mmu.c linux-dj/fs/proc/task_mmu.c > > --- src/git-trees/kernel/linux/fs/proc/task_mmu.c 2012-05-31 22:32:46.778150675 -0400 > > +++ linux-dj/fs/proc/task_mmu.c 2012-10-04 19:31:41.269988984 -0400 > > @@ -1162,6 +1162,7 @@ static int show_numa_map(struct seq_file > > struct mm_walk walk = {}; > > struct mempolicy *pol; > > int n; > > + int ret; > > char buffer[50]; > > > > if (!mm) > > @@ -1178,7 +1179,11 @@ static int show_numa_map(struct seq_file > > walk.mm = mm; > > > > pol = get_vma_policy(proc_priv->task, vma, vma->vm_start); > > - mpol_to_str(buffer, sizeof(buffer), pol, 0); > > + memset(buffer, 0, sizeof(buffer)); > > + ret = mpol_to_str(buffer, sizeof(buffer), pol, 0); > > + if (ret < 0) > > + return 0; > > We should need the mpol_cond_put(pol) here before returning. good catch. I'll respin the patch later with this changed. thanks, Dave From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755373Ab2JPAsg (ORCPT ); Mon, 15 Oct 2012 20:48:36 -0400 Received: from mail-pb0-f46.google.com ([209.85.160.46]:36676 "EHLO mail-pb0-f46.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751248Ab2JPAsf (ORCPT ); Mon, 15 Oct 2012 20:48:35 -0400 Date: Mon, 15 Oct 2012 17:48:33 -0700 (PDT) From: David Rientjes X-X-Sender: rientjes@chino.kir.corp.google.com To: Dave Jones , Linux Kernel , bhutchings@solarflare.com, linux-mm@kvack.org, Linus Torvalds , Andrew Morton Subject: Re: mpol_to_str revisited. In-Reply-To: <20121008205213.GA23211@redhat.com> Message-ID: References: <20121008150949.GA15130@redhat.com> <20121008205213.GA23211@redhat.com> User-Agent: Alpine 2.00 (DEB 1167 2008-08-23) MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon, 8 Oct 2012, Dave Jones wrote: > > > diff -durpN '--exclude-from=/home/davej/.exclude' src/git-trees/kernel/linux/fs/proc/task_mmu.c linux-dj/fs/proc/task_mmu.c > > > --- src/git-trees/kernel/linux/fs/proc/task_mmu.c 2012-05-31 22:32:46.778150675 -0400 > > > +++ linux-dj/fs/proc/task_mmu.c 2012-10-04 19:31:41.269988984 -0400 > > > @@ -1162,6 +1162,7 @@ static int show_numa_map(struct seq_file > > > struct mm_walk walk = {}; > > > struct mempolicy *pol; > > > int n; > > > + int ret; > > > char buffer[50]; > > > > > > if (!mm) > > > @@ -1178,7 +1179,11 @@ static int show_numa_map(struct seq_file > > > walk.mm = mm; > > > > > > pol = get_vma_policy(proc_priv->task, vma, vma->vm_start); > > > - mpol_to_str(buffer, sizeof(buffer), pol, 0); > > > + memset(buffer, 0, sizeof(buffer)); > > > + ret = mpol_to_str(buffer, sizeof(buffer), pol, 0); > > > + if (ret < 0) > > > + return 0; > > > > We should need the mpol_cond_put(pol) here before returning. > > good catch. I'll respin the patch later with this changed. > Did you get a chance to fix this issue? From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753415Ab2JPCfQ (ORCPT ); Mon, 15 Oct 2012 22:35:16 -0400 Received: from mail-oa0-f46.google.com ([209.85.219.46]:40729 "EHLO mail-oa0-f46.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751094Ab2JPCfP (ORCPT ); Mon, 15 Oct 2012 22:35:15 -0400 MIME-Version: 1.0 In-Reply-To: <20121008150949.GA15130@redhat.com> References: <20121008150949.GA15130@redhat.com> From: KOSAKI Motohiro Date: Mon, 15 Oct 2012 22:34:53 -0400 Message-ID: Subject: Re: mpol_to_str revisited. To: Dave Jones , Linux Kernel , bhutchings@solarflare.com, linux-mm@kvack.org, Linus Torvalds , Andrew Morton Content-Type: text/plain; charset=ISO-8859-1 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon, Oct 8, 2012 at 11:09 AM, Dave Jones wrote: > Last month I sent in 80de7c3138ee9fd86a98696fd2cf7ad89b995d0a to remove > a user triggerable BUG in mempolicy. > > Ben Hutchings pointed out to me that my change introduced a potential leak > of stack contents to userspace, because none of the callers check the return value. > > This patch adds the missing return checking, and also clears the buffer beforehand. I don't think 80de7c3138ee9fd86a98696fd2cf7ad89b995d0a is right fix. we should close a race (or kill remain ref count leak) if we still have. Because of, this patch makes unstable /proc output and might lead to userland confusing. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755689Ab2JPD6h (ORCPT ); Mon, 15 Oct 2012 23:58:37 -0400 Received: from mail-pb0-f46.google.com ([209.85.160.46]:61651 "EHLO mail-pb0-f46.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1755655Ab2JPD6g (ORCPT ); Mon, 15 Oct 2012 23:58:36 -0400 Date: Mon, 15 Oct 2012 20:58:33 -0700 (PDT) From: David Rientjes X-X-Sender: rientjes@chino.kir.corp.google.com To: KOSAKI Motohiro cc: Dave Jones , Linux Kernel , bhutchings@solarflare.com, linux-mm@kvack.org, Linus Torvalds , Andrew Morton Subject: Re: mpol_to_str revisited. In-Reply-To: Message-ID: References: <20121008150949.GA15130@redhat.com> User-Agent: Alpine 2.00 (DEB 1167 2008-08-23) MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon, 15 Oct 2012, KOSAKI Motohiro wrote: > I don't think 80de7c3138ee9fd86a98696fd2cf7ad89b995d0a is right fix. It's certainly not a complete fix, but I think it's a much better result of the race, i.e. we don't panic anymore, we simply fail the read() instead. > we should > close a race (or kill remain ref count leak) if we still have. As I mentioned earlier in the thread, the read() is done here on a task while only a reference to the task_struct is taken and we do not hold task_lock() which is required for task->mempolicy. Once that is fixed, mpol_to_str() should never be called for !task->mempolicy so it will never need to return -EINVAL in such a condition. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755840Ab2JPFK4 (ORCPT ); Tue, 16 Oct 2012 01:10:56 -0400 Received: from mail-oa0-f46.google.com ([209.85.219.46]:61342 "EHLO mail-oa0-f46.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1755652Ab2JPFKz (ORCPT ); Tue, 16 Oct 2012 01:10:55 -0400 MIME-Version: 1.0 In-Reply-To: References: <20121008150949.GA15130@redhat.com> From: KOSAKI Motohiro Date: Tue, 16 Oct 2012 01:10:34 -0400 Message-ID: Subject: Re: mpol_to_str revisited. To: David Rientjes Cc: Dave Jones , Linux Kernel , bhutchings@solarflare.com, linux-mm@kvack.org, Linus Torvalds , Andrew Morton Content-Type: text/plain; charset=ISO-8859-1 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon, Oct 15, 2012 at 11:58 PM, David Rientjes wrote: > On Mon, 15 Oct 2012, KOSAKI Motohiro wrote: > >> I don't think 80de7c3138ee9fd86a98696fd2cf7ad89b995d0a is right fix. > > It's certainly not a complete fix, but I think it's a much better result > of the race, i.e. we don't panic anymore, we simply fail the read() > instead. Even though 80de7c3138ee9fd86a98696fd2cf7ad89b995d0a itself is simple. It bring to caller complex. That's not good and have no worth. >> we should >> close a race (or kill remain ref count leak) if we still have. > > As I mentioned earlier in the thread, the read() is done here on a task > while only a reference to the task_struct is taken and we do not hold > task_lock() which is required for task->mempolicy. Once that is fixed, > mpol_to_str() should never be called for !task->mempolicy so it will never > need to return -EINVAL in such a condition. I agree that's obviously a bug and we should fix it. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753548Ab2JPGKO (ORCPT ); Tue, 16 Oct 2012 02:10:14 -0400 Received: from mail-pb0-f46.google.com ([209.85.160.46]:37705 "EHLO mail-pb0-f46.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751317Ab2JPGKM (ORCPT ); Tue, 16 Oct 2012 02:10:12 -0400 Date: Mon, 15 Oct 2012 23:10:09 -0700 (PDT) From: David Rientjes X-X-Sender: rientjes@chino.kir.corp.google.com To: KOSAKI Motohiro cc: Dave Jones , Linux Kernel , bhutchings@solarflare.com, linux-mm@kvack.org, Linus Torvalds , Andrew Morton Subject: Re: mpol_to_str revisited. In-Reply-To: Message-ID: References: <20121008150949.GA15130@redhat.com> User-Agent: Alpine 2.00 (DEB 1167 2008-08-23) MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, 16 Oct 2012, KOSAKI Motohiro wrote: > >> I don't think 80de7c3138ee9fd86a98696fd2cf7ad89b995d0a is right fix. > > > > It's certainly not a complete fix, but I think it's a much better result > > of the race, i.e. we don't panic anymore, we simply fail the read() > > instead. > > Even though 80de7c3138ee9fd86a98696fd2cf7ad89b995d0a itself is simple. It bring > to caller complex. That's not good and have no worth. > Before: the kernel panics, all workloads cease. After: the file shows garbage, all workloads continue. This is better, in my opinion, but at best it's only a judgment call and has no effect on anything. I agree it would be better to respect the return value of mpol_to_str() since there are other possible error conditions other than a freed mempolicy, but let's not consider reverting 80de7c3138. It is obviously not a full solution to the problem, though, and we need to serialize with task_lock(). Dave, are you interested in coming up with a patch? From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S932091Ab2JPXjv (ORCPT ); Tue, 16 Oct 2012 19:39:51 -0400 Received: from mail-ob0-f174.google.com ([209.85.214.174]:42845 "EHLO mail-ob0-f174.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1755824Ab2JPXju (ORCPT ); Tue, 16 Oct 2012 19:39:50 -0400 MIME-Version: 1.0 In-Reply-To: References: <20121008150949.GA15130@redhat.com> From: KOSAKI Motohiro Date: Tue, 16 Oct 2012 19:39:29 -0400 Message-ID: Subject: Re: mpol_to_str revisited. To: David Rientjes Cc: Dave Jones , Linux Kernel , bhutchings@solarflare.com, linux-mm@kvack.org, Linus Torvalds , Andrew Morton Content-Type: text/plain; charset=ISO-8859-1 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, Oct 16, 2012 at 2:10 AM, David Rientjes wrote: > On Tue, 16 Oct 2012, KOSAKI Motohiro wrote: > >> >> I don't think 80de7c3138ee9fd86a98696fd2cf7ad89b995d0a is right fix. >> > >> > It's certainly not a complete fix, but I think it's a much better result >> > of the race, i.e. we don't panic anymore, we simply fail the read() >> > instead. >> >> Even though 80de7c3138ee9fd86a98696fd2cf7ad89b995d0a itself is simple. It bring >> to caller complex. That's not good and have no worth. >> > > Before: the kernel panics, all workloads cease. > After: the file shows garbage, all workloads continue. > > This is better, in my opinion, but at best it's only a judgment call and > has no effect on anything. Kernel panics help to find our serious mistake. > I agree it would be better to respect the return value of mpol_to_str() > since there are other possible error conditions other than a freed > mempolicy, but let's not consider reverting 80de7c3138. It is obviously > not a full solution to the problem, though, and we need to serialize with > task_lock(). Sorry no. I will have to revert it. mempolicy have already a lot of meaningless complex and bring us a lot of problems. I haven't seen any reason adding more. > Dave, are you interested in coming up with a patch? From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755887Ab2JQAMx (ORCPT ); Tue, 16 Oct 2012 20:12:53 -0400 Received: from mail-pb0-f46.google.com ([209.85.160.46]:59752 "EHLO mail-pb0-f46.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1755476Ab2JQAMw (ORCPT ); Tue, 16 Oct 2012 20:12:52 -0400 Date: Tue, 16 Oct 2012 17:12:50 -0700 (PDT) From: David Rientjes X-X-Sender: rientjes@chino.kir.corp.google.com To: KOSAKI Motohiro cc: Dave Jones , Linux Kernel , bhutchings@solarflare.com, linux-mm@kvack.org, Linus Torvalds , Andrew Morton Subject: Re: mpol_to_str revisited. In-Reply-To: Message-ID: References: <20121008150949.GA15130@redhat.com> User-Agent: Alpine 2.00 (DEB 1167 2008-08-23) MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, 16 Oct 2012, KOSAKI Motohiro wrote: > >> Even though 80de7c3138ee9fd86a98696fd2cf7ad89b995d0a itself is simple. It bring > >> to caller complex. That's not good and have no worth. > >> > > > > Before: the kernel panics, all workloads cease. > > After: the file shows garbage, all workloads continue. > > > > This is better, in my opinion, but at best it's only a judgment call and > > has no effect on anything. > > Kernel panics help to find our serious mistake. > Kernel panics are not your little debugging tool to let users suffer through for non-fatal issues. > > I agree it would be better to respect the return value of mpol_to_str() > > since there are other possible error conditions other than a freed > > mempolicy, but let's not consider reverting 80de7c3138. It is obviously > > not a full solution to the problem, though, and we need to serialize with > > task_lock(). > > Sorry no. I will have to revert it. Feel free to revert anything you wish in your own tree, I couldn't care less. If you try to propose it upstream, Andrew will surely ask you to justify the BUG(), good luck on that. I'll reply to this message with the fix that I think is best. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755912Ab2JQAb2 (ORCPT ); Tue, 16 Oct 2012 20:31:28 -0400 Received: from mail-pb0-f46.google.com ([209.85.160.46]:32998 "EHLO mail-pb0-f46.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1755858Ab2JQAb0 (ORCPT ); Tue, 16 Oct 2012 20:31:26 -0400 Date: Tue, 16 Oct 2012 17:31:23 -0700 (PDT) From: David Rientjes X-X-Sender: rientjes@chino.kir.corp.google.com To: Andrew Morton , Linus Torvalds cc: Dave Jones , KOSAKI Motohiro , bhutchings@solarflare.com, Konstantin Khlebnikov , Naoya Horiguchi , Hugh Dickins , KAMEZAWA Hiroyuki , linux-kernel@vger.kernel.org, linux-mm@kvack.org Subject: [patch for-3.7] mm, mempolicy: fix printing stack contents in numa_maps In-Reply-To: Message-ID: References: <20121008150949.GA15130@redhat.com> User-Agent: Alpine 2.00 (DEB 1167 2008-08-23) MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org When reading /proc/pid/numa_maps, it's possible to return the contents of the stack where the mempolicy string should be printed if the policy gets freed from beneath us. This happens because mpol_to_str() may return an error the stack-allocated buffer is then printed without ever being stored. There are two possible error conditions in mpol_to_str(): - if the buffer allocated is insufficient for the string to be stored, and - if the mempolicy has an invalid mode. The first error condition is not triggered in any of the callers to mpol_to_str(): at least 50 bytes is always allocated on the stack and this is sufficient for the string to be written. A future patch should convert this into BUILD_BUG_ON() since we know the maximum strlen possible, but that's not -rc material. The second error condition is possible if a race occurs in dropping a reference to a task's mempolicy causing it to be freed during the read(). The slab poison value is then used for the mode and mpol_to_str() returns -EINVAL. This race is only possible because get_vma_policy() believes that mm->mmap_sem protects task->mempolicy, which isn't true. The exit path does not hold mm->mmap_sem when dropping the reference or setting task->mempolicy to NULL: it uses task_lock(task) instead. Thus, it's required for the caller of a task mempolicy to hold task_lock(task) while grabbing the mempolicy and reading it. Callers with a vma policy store their mempolicy earlier and can simply increment the reference count so it's guaranteed not to be freed. Reported-by: Dave Jones Signed-off-by: David Rientjes --- fs/proc/task_mmu.c | 7 +++++-- mm/mempolicy.c | 5 ++--- 2 files changed, 7 insertions(+), 5 deletions(-) diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c --- a/fs/proc/task_mmu.c +++ b/fs/proc/task_mmu.c @@ -1158,6 +1158,7 @@ static int show_numa_map(struct seq_file *m, void *v, int is_pid) struct vm_area_struct *vma = v; struct numa_maps *md = &numa_priv->md; struct file *file = vma->vm_file; + struct task_struct *task = proc_priv->task; struct mm_struct *mm = vma->vm_mm; struct mm_walk walk = {}; struct mempolicy *pol; @@ -1177,9 +1178,11 @@ static int show_numa_map(struct seq_file *m, void *v, int is_pid) walk.private = md; walk.mm = mm; - pol = get_vma_policy(proc_priv->task, vma, vma->vm_start); + task_lock(task); + pol = get_vma_policy(task, vma, vma->vm_start); mpol_to_str(buffer, sizeof(buffer), pol, 0); mpol_cond_put(pol); + task_unlock(task); seq_printf(m, "%08lx %s", vma->vm_start, buffer); @@ -1189,7 +1192,7 @@ static int show_numa_map(struct seq_file *m, void *v, int is_pid) } else if (vma->vm_start <= mm->brk && vma->vm_end >= mm->start_brk) { seq_printf(m, " heap"); } else { - pid_t tid = vm_is_stack(proc_priv->task, vma, is_pid); + pid_t tid = vm_is_stack(task, vma, is_pid); if (tid != 0) { /* * Thread stack in /proc/PID/task/TID/maps or diff --git a/mm/mempolicy.c b/mm/mempolicy.c index 0b78fb9..d04a8a5 100644 --- a/mm/mempolicy.c +++ b/mm/mempolicy.c @@ -1536,9 +1536,8 @@ asmlinkage long compat_sys_mbind(compat_ulong_t start, compat_ulong_t len, * * Returns effective policy for a VMA at specified address. * Falls back to @task or system default policy, as necessary. - * Current or other task's task mempolicy and non-shared vma policies - * are protected by the task's mmap_sem, which must be held for read by - * the caller. + * Current or other task's task mempolicy and non-shared vma policies must be + * protected by task_lock(task) by the caller. * Shared policies [those marked as MPOL_F_SHARED] require an extra reference * count--added by the get_policy() vm_op, as appropriate--to protect against * freeing by another task. It is the caller's responsibility to free the From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755938Ab2JQBeQ (ORCPT ); Tue, 16 Oct 2012 21:34:16 -0400 Received: from mail-ob0-f174.google.com ([209.85.214.174]:47718 "EHLO mail-ob0-f174.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1755895Ab2JQBeP (ORCPT ); Tue, 16 Oct 2012 21:34:15 -0400 MIME-Version: 1.0 In-Reply-To: References: <20121008150949.GA15130@redhat.com> From: KOSAKI Motohiro Date: Tue, 16 Oct 2012 21:33:55 -0400 Message-ID: Subject: Re: mpol_to_str revisited. To: David Rientjes Cc: Dave Jones , Linux Kernel , bhutchings@solarflare.com, linux-mm@kvack.org, Linus Torvalds , Andrew Morton Content-Type: text/plain; charset=ISO-8859-1 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, Oct 16, 2012 at 8:12 PM, David Rientjes wrote: > On Tue, 16 Oct 2012, KOSAKI Motohiro wrote: > >> >> Even though 80de7c3138ee9fd86a98696fd2cf7ad89b995d0a itself is simple. It bring >> >> to caller complex. That's not good and have no worth. >> >> >> > >> > Before: the kernel panics, all workloads cease. >> > After: the file shows garbage, all workloads continue. >> > >> > This is better, in my opinion, but at best it's only a judgment call and >> > has no effect on anything. >> >> Kernel panics help to find our serious mistake. > > Kernel panics are not your little debugging tool to let users suffer > through for non-fatal issues. use after free is fatal, no doubt. > >> > I agree it would be better to respect the return value of mpol_to_str() >> > since there are other possible error conditions other than a freed >> > mempolicy, but let's not consider reverting 80de7c3138. It is obviously >> > not a full solution to the problem, though, and we need to serialize with >> > task_lock(). >> >> Sorry no. I will have to revert it. > > Feel free to revert anything you wish in your own tree, I couldn't care > less. If you try to propose it upstream, Andrew will surely ask you to > justify the BUG(), good luck on that. Yeah. I'm ok just remove both BUG() and EINVAL, but current situation (i.e. ignoring EINVAL by caller) is surely bad. So, just revert is best IMHO. > > I'll reply to this message with the fix that I think is best. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1756058Ab2JQBiv (ORCPT ); Tue, 16 Oct 2012 21:38:51 -0400 Received: from mail-ob0-f174.google.com ([209.85.214.174]:61537 "EHLO mail-ob0-f174.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1755928Ab2JQBis (ORCPT ); Tue, 16 Oct 2012 21:38:48 -0400 MIME-Version: 1.0 In-Reply-To: References: <20121008150949.GA15130@redhat.com> From: KOSAKI Motohiro Date: Tue, 16 Oct 2012 21:38:26 -0400 Message-ID: Subject: Re: [patch for-3.7] mm, mempolicy: fix printing stack contents in numa_maps To: David Rientjes Cc: Andrew Morton , Linus Torvalds , Dave Jones , bhutchings@solarflare.com, Konstantin Khlebnikov , Naoya Horiguchi , Hugh Dickins , KAMEZAWA Hiroyuki , linux-kernel@vger.kernel.org, linux-mm@kvack.org Content-Type: text/plain; charset=ISO-8859-1 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, Oct 16, 2012 at 8:31 PM, David Rientjes wrote: > When reading /proc/pid/numa_maps, it's possible to return the contents of > the stack where the mempolicy string should be printed if the policy gets > freed from beneath us. > > This happens because mpol_to_str() may return an error the > stack-allocated buffer is then printed without ever being stored. > > There are two possible error conditions in mpol_to_str(): > > - if the buffer allocated is insufficient for the string to be stored, > and > > - if the mempolicy has an invalid mode. > > The first error condition is not triggered in any of the callers to > mpol_to_str(): at least 50 bytes is always allocated on the stack and this > is sufficient for the string to be written. A future patch should convert > this into BUILD_BUG_ON() since we know the maximum strlen possible, but > that's not -rc material. > > The second error condition is possible if a race occurs in dropping a > reference to a task's mempolicy causing it to be freed during the read(). > The slab poison value is then used for the mode and mpol_to_str() returns > -EINVAL. > > This race is only possible because get_vma_policy() believes that > mm->mmap_sem protects task->mempolicy, which isn't true. The exit path > does not hold mm->mmap_sem when dropping the reference or setting > task->mempolicy to NULL: it uses task_lock(task) instead. > > Thus, it's required for the caller of a task mempolicy to hold > task_lock(task) while grabbing the mempolicy and reading it. Callers with > a vma policy store their mempolicy earlier and can simply increment the > reference count so it's guaranteed not to be freed. > > Reported-by: Dave Jones > Signed-off-by: David Rientjes > --- > fs/proc/task_mmu.c | 7 +++++-- > mm/mempolicy.c | 5 ++--- > 2 files changed, 7 insertions(+), 5 deletions(-) > > diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c > --- a/fs/proc/task_mmu.c > +++ b/fs/proc/task_mmu.c > @@ -1158,6 +1158,7 @@ static int show_numa_map(struct seq_file *m, void *v, int is_pid) > struct vm_area_struct *vma = v; > struct numa_maps *md = &numa_priv->md; > struct file *file = vma->vm_file; > + struct task_struct *task = proc_priv->task; > struct mm_struct *mm = vma->vm_mm; > struct mm_walk walk = {}; > struct mempolicy *pol; > @@ -1177,9 +1178,11 @@ static int show_numa_map(struct seq_file *m, void *v, int is_pid) > walk.private = md; > walk.mm = mm; > > - pol = get_vma_policy(proc_priv->task, vma, vma->vm_start); > + task_lock(task); > + pol = get_vma_policy(task, vma, vma->vm_start); > mpol_to_str(buffer, sizeof(buffer), pol, 0); > mpol_cond_put(pol); > + task_unlock(task); > > seq_printf(m, "%08lx %s", vma->vm_start, buffer); > > @@ -1189,7 +1192,7 @@ static int show_numa_map(struct seq_file *m, void *v, int is_pid) > } else if (vma->vm_start <= mm->brk && vma->vm_end >= mm->start_brk) { > seq_printf(m, " heap"); > } else { > - pid_t tid = vm_is_stack(proc_priv->task, vma, is_pid); > + pid_t tid = vm_is_stack(task, vma, is_pid); > if (tid != 0) { > /* > * Thread stack in /proc/PID/task/TID/maps or > diff --git a/mm/mempolicy.c b/mm/mempolicy.c > index 0b78fb9..d04a8a5 100644 > --- a/mm/mempolicy.c > +++ b/mm/mempolicy.c > @@ -1536,9 +1536,8 @@ asmlinkage long compat_sys_mbind(compat_ulong_t start, compat_ulong_t len, > * > * Returns effective policy for a VMA at specified address. > * Falls back to @task or system default policy, as necessary. > - * Current or other task's task mempolicy and non-shared vma policies > - * are protected by the task's mmap_sem, which must be held for read by > - * the caller. > + * Current or other task's task mempolicy and non-shared vma policies must be > + * protected by task_lock(task) by the caller. This is not correct. mmap_sem is needed for protecting vma. task_lock() is needed to close vs exit race only when task != current. In other word, caller must held both mmap_sem and task_lock if task != current. > * Shared policies [those marked as MPOL_F_SHARED] require an extra reference > * count--added by the get_policy() vm_op, as appropriate--to protect against > * freeing by another task. It is the caller's responsibility to free the From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1756043Ab2JQBtG (ORCPT ); Tue, 16 Oct 2012 21:49:06 -0400 Received: from mail-pa0-f46.google.com ([209.85.220.46]:56180 "EHLO mail-pa0-f46.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1755923Ab2JQBtE (ORCPT ); Tue, 16 Oct 2012 21:49:04 -0400 Date: Tue, 16 Oct 2012 18:49:00 -0700 (PDT) From: David Rientjes X-X-Sender: rientjes@chino.kir.corp.google.com To: KOSAKI Motohiro cc: Andrew Morton , Linus Torvalds , Dave Jones , bhutchings@solarflare.com, Konstantin Khlebnikov , Naoya Horiguchi , Hugh Dickins , KAMEZAWA Hiroyuki , linux-kernel@vger.kernel.org, linux-mm@kvack.org Subject: Re: [patch for-3.7] mm, mempolicy: fix printing stack contents in numa_maps In-Reply-To: Message-ID: References: <20121008150949.GA15130@redhat.com> User-Agent: Alpine 2.00 (DEB 1167 2008-08-23) MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, 16 Oct 2012, KOSAKI Motohiro wrote: > > diff --git a/mm/mempolicy.c b/mm/mempolicy.c > > index 0b78fb9..d04a8a5 100644 > > --- a/mm/mempolicy.c > > +++ b/mm/mempolicy.c > > @@ -1536,9 +1536,8 @@ asmlinkage long compat_sys_mbind(compat_ulong_t start, compat_ulong_t len, > > * > > * Returns effective policy for a VMA at specified address. > > * Falls back to @task or system default policy, as necessary. > > - * Current or other task's task mempolicy and non-shared vma policies > > - * are protected by the task's mmap_sem, which must be held for read by > > - * the caller. > > + * Current or other task's task mempolicy and non-shared vma policies must be > > + * protected by task_lock(task) by the caller. > > This is not correct. mmap_sem is needed for protecting vma. task_lock() > is needed to close vs exit race only when task != current. In other word, > caller must held both mmap_sem and task_lock if task != current. > The comment is specifically addressing non-shared vma policies, you do not need to hold mmap_sem to access another thread's mempolicy. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1756074Ab2JQBxX (ORCPT ); Tue, 16 Oct 2012 21:53:23 -0400 Received: from mail-oa0-f46.google.com ([209.85.219.46]:32966 "EHLO mail-oa0-f46.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1755962Ab2JQBxW (ORCPT ); Tue, 16 Oct 2012 21:53:22 -0400 MIME-Version: 1.0 In-Reply-To: References: <20121008150949.GA15130@redhat.com> From: KOSAKI Motohiro Date: Tue, 16 Oct 2012 21:53:02 -0400 Message-ID: Subject: Re: [patch for-3.7] mm, mempolicy: fix printing stack contents in numa_maps To: David Rientjes Cc: Andrew Morton , Linus Torvalds , Dave Jones , bhutchings@solarflare.com, Konstantin Khlebnikov , Naoya Horiguchi , Hugh Dickins , KAMEZAWA Hiroyuki , linux-kernel@vger.kernel.org, linux-mm@kvack.org Content-Type: text/plain; charset=ISO-8859-1 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, Oct 16, 2012 at 9:49 PM, David Rientjes wrote: > On Tue, 16 Oct 2012, KOSAKI Motohiro wrote: > >> > diff --git a/mm/mempolicy.c b/mm/mempolicy.c >> > index 0b78fb9..d04a8a5 100644 >> > --- a/mm/mempolicy.c >> > +++ b/mm/mempolicy.c >> > @@ -1536,9 +1536,8 @@ asmlinkage long compat_sys_mbind(compat_ulong_t start, compat_ulong_t len, >> > * >> > * Returns effective policy for a VMA at specified address. >> > * Falls back to @task or system default policy, as necessary. >> > - * Current or other task's task mempolicy and non-shared vma policies >> > - * are protected by the task's mmap_sem, which must be held for read by >> > - * the caller. >> > + * Current or other task's task mempolicy and non-shared vma policies must be >> > + * protected by task_lock(task) by the caller. >> >> This is not correct. mmap_sem is needed for protecting vma. task_lock() >> is needed to close vs exit race only when task != current. In other word, >> caller must held both mmap_sem and task_lock if task != current. > > The comment is specifically addressing non-shared vma policies, you do not > need to hold mmap_sem to access another thread's mempolicy. I didn't say old comment is true. I just only your new comment also false. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751501Ab2JQEFu (ORCPT ); Wed, 17 Oct 2012 00:05:50 -0400 Received: from mx1.redhat.com ([209.132.183.28]:33420 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750798Ab2JQEFs (ORCPT ); Wed, 17 Oct 2012 00:05:48 -0400 Date: Wed, 17 Oct 2012 00:05:15 -0400 From: Dave Jones To: David Rientjes Cc: Andrew Morton , Linus Torvalds , KOSAKI Motohiro , bhutchings@solarflare.com, Konstantin Khlebnikov , Naoya Horiguchi , Hugh Dickins , KAMEZAWA Hiroyuki , linux-kernel@vger.kernel.org, linux-mm@kvack.org Subject: Re: [patch for-3.7] mm, mempolicy: fix printing stack contents in numa_maps Message-ID: <20121017040515.GA13505@redhat.com> Mail-Followup-To: Dave Jones , David Rientjes , Andrew Morton , Linus Torvalds , KOSAKI Motohiro , bhutchings@solarflare.com, Konstantin Khlebnikov , Naoya Horiguchi , Hugh Dickins , KAMEZAWA Hiroyuki , linux-kernel@vger.kernel.org, linux-mm@kvack.org References: <20121008150949.GA15130@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, Oct 16, 2012 at 05:31:23PM -0700, David Rientjes wrote: > - pol = get_vma_policy(proc_priv->task, vma, vma->vm_start); > + task_lock(task); > + pol = get_vma_policy(task, vma, vma->vm_start); > mpol_to_str(buffer, sizeof(buffer), pol, 0); > mpol_cond_put(pol); > + task_unlock(task); This seems to cause some fallout for me.. BUG: sleeping function called from invalid context at kernel/mutex.c:269 in_atomic(): 1, irqs_disabled(): 0, pid: 8558, name: trinity-child2 3 locks on stack by trinity-child2/8558: #0: held: (&p->lock){+.+.+.}, instance: ffff88010c9a00b0, at: [] seq_lseek+0x3f/0x120 #1: held: (&mm->mmap_sem){++++++}, instance: ffff88013956f7c8, at: [] m_start+0xa7/0x190 #2: held: (&(&p->alloc_lock)->rlock){+.+...}, instance: ffff88011fc64f30, at: [] show_numa_map+0x14f/0x610 Pid: 8558, comm: trinity-child2 Not tainted 3.7.0-rc1+ #32 Call Trace: [] __might_sleep+0x14c/0x200 [] mutex_lock_nested+0x2e/0x50 [] mpol_shared_policy_lookup+0x33/0x90 [] shmem_get_policy+0x33/0x40 [] get_vma_policy+0x3a/0x90 [] show_numa_map+0x163/0x610 [] ? pid_maps_open+0x20/0x20 [] ? pagemap_hugetlb_range+0xf0/0xf0 [] show_pid_numa_map+0x13/0x20 [] traverse+0xf2/0x230 [] seq_lseek+0xab/0x120 [] sys_lseek+0x7b/0xb0 [] tracesys+0xe1/0xe6 same problem, different syscall.. BUG: sleeping function called from invalid context at kernel/mutex.c:269 in_atomic(): 1, irqs_disabled(): 0, pid: 21996, name: trinity-child3 3 locks on stack by trinity-child3/21996: #0: held: (&p->lock){+.+.+.}, instance: ffff88008d712c08, at: [] seq_read+0x3d/0x3e0 #1: held: (&mm->mmap_sem){++++++}, instance: ffff88013956f7c8, at: [] m_start+0xa7/0x190 #2: held: (&(&p->alloc_lock)->rlock){+.+...}, instance: ffff88011fc64f30, at: [] show_numa_map+0x14f/0x610 Pid: 21996, comm: trinity-child3 Not tainted 3.7.0-rc1+ #32 Call Trace: [] __might_sleep+0x14c/0x200 [] mutex_lock_nested+0x2e/0x50 [] mpol_shared_policy_lookup+0x33/0x90 [] shmem_get_policy+0x33/0x40 [] get_vma_policy+0x3a/0x90 [] show_numa_map+0x163/0x610 [] ? pid_maps_open+0x20/0x20 [] ? pagemap_hugetlb_range+0xf0/0xf0 [] show_pid_numa_map+0x13/0x20 [] traverse+0xf2/0x230 [] seq_read+0x34b/0x3e0 [] ? seq_lseek+0x120/0x120 [] do_loop_readv_writev+0x5a/0x90 [] do_readv_writev+0x1c1/0x1e0 [] ? get_parent_ip+0x11/0x50 [] vfs_readv+0x35/0x60 [] sys_preadv+0xc2/0xe0 [] tracesys+0xe1/0xe6 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752090Ab2JQFYh (ORCPT ); Wed, 17 Oct 2012 01:24:37 -0400 Received: from mail-pb0-f46.google.com ([209.85.160.46]:44341 "EHLO mail-pb0-f46.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751368Ab2JQFYg (ORCPT ); Wed, 17 Oct 2012 01:24:36 -0400 Date: Tue, 16 Oct 2012 22:24:32 -0700 (PDT) From: David Rientjes X-X-Sender: rientjes@chino.kir.corp.google.com To: Dave Jones , Andrew Morton , Linus Torvalds , KOSAKI Motohiro , bhutchings@solarflare.com, Konstantin Khlebnikov , Naoya Horiguchi , Hugh Dickins , KAMEZAWA Hiroyuki , linux-kernel@vger.kernel.org, linux-mm@kvack.org Subject: Re: [patch for-3.7] mm, mempolicy: fix printing stack contents in numa_maps In-Reply-To: <20121017040515.GA13505@redhat.com> Message-ID: References: <20121008150949.GA15130@redhat.com> <20121017040515.GA13505@redhat.com> User-Agent: Alpine 2.00 (DEB 1167 2008-08-23) MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, 17 Oct 2012, Dave Jones wrote: > BUG: sleeping function called from invalid context at kernel/mutex.c:269 > in_atomic(): 1, irqs_disabled(): 0, pid: 8558, name: trinity-child2 > 3 locks on stack by trinity-child2/8558: > #0: held: (&p->lock){+.+.+.}, instance: ffff88010c9a00b0, at: [] seq_lseek+0x3f/0x120 > #1: held: (&mm->mmap_sem){++++++}, instance: ffff88013956f7c8, at: [] m_start+0xa7/0x190 > #2: held: (&(&p->alloc_lock)->rlock){+.+...}, instance: ffff88011fc64f30, at: [] show_numa_map+0x14f/0x610 > Pid: 8558, comm: trinity-child2 Not tainted 3.7.0-rc1+ #32 > Call Trace: > [] __might_sleep+0x14c/0x200 > [] mutex_lock_nested+0x2e/0x50 > [] mpol_shared_policy_lookup+0x33/0x90 > [] shmem_get_policy+0x33/0x40 > [] get_vma_policy+0x3a/0x90 > [] show_numa_map+0x163/0x610 > [] ? pid_maps_open+0x20/0x20 > [] ? pagemap_hugetlb_range+0xf0/0xf0 > [] show_pid_numa_map+0x13/0x20 > [] traverse+0xf2/0x230 > [] seq_lseek+0xab/0x120 > [] sys_lseek+0x7b/0xb0 > [] tracesys+0xe1/0xe6 > Hmm, looks like we need to change the refcount semantics entirely. We'll need to make get_vma_policy() always take a reference and then drop it accordingly. This work sif get_vma_policy() can grab a reference while holding task_lock() for the task policy fallback case. Comments on this approach? --- fs/proc/task_mmu.c | 4 +--- include/linux/mm.h | 3 +-- mm/hugetlb.c | 4 ++-- mm/mempolicy.c | 41 ++++++++++++++++++++++------------------- 4 files changed, 26 insertions(+), 26 deletions(-) diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c --- a/fs/proc/task_mmu.c +++ b/fs/proc/task_mmu.c @@ -1178,11 +1178,9 @@ static int show_numa_map(struct seq_file *m, void *v, int is_pid) walk.private = md; walk.mm = mm; - task_lock(task); pol = get_vma_policy(task, vma, vma->vm_start); mpol_to_str(buffer, sizeof(buffer), pol, 0); - mpol_cond_put(pol); - task_unlock(task); + __mpol_put(pol); seq_printf(m, "%08lx %s", vma->vm_start, buffer); diff --git a/include/linux/mm.h b/include/linux/mm.h --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -216,8 +216,7 @@ struct vm_operations_struct { * get_policy() op must add reference [mpol_get()] to any policy at * (vma,addr) marked as MPOL_SHARED. The shared policy infrastructure * in mm/mempolicy.c will do this automatically. - * get_policy() must NOT add a ref if the policy at (vma,addr) is not - * marked as MPOL_SHARED. vma policies are protected by the mmap_sem. + * vma policies are protected by the mmap_sem. * If no [shared/vma] mempolicy exists at the addr, get_policy() op * must return NULL--i.e., do not "fallback" to task or system default * policy. diff --git a/mm/hugetlb.c b/mm/hugetlb.c --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -568,13 +568,13 @@ retry_cpuset: } } - mpol_cond_put(mpol); + __mpol_put(mpol); if (unlikely(!put_mems_allowed(cpuset_mems_cookie) && !page)) goto retry_cpuset; return page; err: - mpol_cond_put(mpol); + __mpol_put(mpol); return NULL; } diff --git a/mm/mempolicy.c b/mm/mempolicy.c --- a/mm/mempolicy.c +++ b/mm/mempolicy.c @@ -1536,39 +1536,41 @@ asmlinkage long compat_sys_mbind(compat_ulong_t start, compat_ulong_t len, * * Returns effective policy for a VMA at specified address. * Falls back to @task or system default policy, as necessary. - * Current or other task's task mempolicy and non-shared vma policies must be - * protected by task_lock(task) by the caller. - * Shared policies [those marked as MPOL_F_SHARED] require an extra reference - * count--added by the get_policy() vm_op, as appropriate--to protect against - * freeing by another task. It is the caller's responsibility to free the - * extra reference for shared policies. + * Increments the reference count of the returned mempolicy, it is the caller's + * responsibility to decrement with __mpol_put(). + * Requires vma->vm_mm->mmap_sem to be held for vma policies and takes + * task_lock(task) for task policy fallback. */ struct mempolicy *get_vma_policy(struct task_struct *task, struct vm_area_struct *vma, unsigned long addr) { - struct mempolicy *pol = task->mempolicy; + struct mempolicy *pol; + + task_lock(task); + pol = task->mempolicy; + mpol_get(pol); + task_unlock(task); if (vma) { if (vma->vm_ops && vma->vm_ops->get_policy) { struct mempolicy *vpol = vma->vm_ops->get_policy(vma, addr); - if (vpol) + if (vpol) { + mpol_put(pol); pol = vpol; + if (!mpol_needs_cond_ref(pol)) + mpol_get(pol); + } } else if (vma->vm_policy) { + mpol_put(pol); pol = vma->vm_policy; - - /* - * shmem_alloc_page() passes MPOL_F_SHARED policy with - * a pseudo vma whose vma->vm_ops=NULL. Take a reference - * count on these policies which will be dropped by - * mpol_cond_put() later - */ - if (mpol_needs_cond_ref(pol)) - mpol_get(pol); + mpol_get(pol); } } - if (!pol) + if (!pol) { pol = &default_policy; + mpol_get(pol); + } return pol; } @@ -1919,7 +1921,7 @@ retry_cpuset: unsigned nid; nid = interleave_nid(pol, vma, addr, PAGE_SHIFT + order); - mpol_cond_put(pol); + __mpol_put(pol); page = alloc_page_interleave(gfp, order, nid); if (unlikely(!put_mems_allowed(cpuset_mems_cookie) && !page)) goto retry_cpuset; @@ -1943,6 +1945,7 @@ retry_cpuset: */ page = __alloc_pages_nodemask(gfp, order, zl, policy_nodemask(gfp, pol)); + __mpol_put(pol); if (unlikely(!put_mems_allowed(cpuset_mems_cookie) && !page)) goto retry_cpuset; return page; From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752743Ab2JQFnb (ORCPT ); Wed, 17 Oct 2012 01:43:31 -0400 Received: from fgwmail6.fujitsu.co.jp ([192.51.44.36]:46974 "EHLO fgwmail6.fujitsu.co.jp" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751974Ab2JQFna (ORCPT ); Wed, 17 Oct 2012 01:43:30 -0400 X-SecurityPolicyCheck: OK by SHieldMailChecker v1.8.4 Message-ID: <507E4531.1070700@jp.fujitsu.com> Date: Wed, 17 Oct 2012 14:42:09 +0900 From: Kamezawa Hiroyuki User-Agent: Mozilla/5.0 (Windows NT 6.1; rv:16.0) Gecko/20121010 Thunderbird/16.0.1 MIME-Version: 1.0 To: David Rientjes CC: Dave Jones , Andrew Morton , Linus Torvalds , KOSAKI Motohiro , bhutchings@solarflare.com, Konstantin Khlebnikov , Naoya Horiguchi , Hugh Dickins , linux-kernel@vger.kernel.org, linux-mm@kvack.org Subject: Re: [patch for-3.7] mm, mempolicy: fix printing stack contents in numa_maps References: <20121008150949.GA15130@redhat.com> <20121017040515.GA13505@redhat.com> In-Reply-To: Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org (2012/10/17 14:24), David Rientjes wrote: > On Wed, 17 Oct 2012, Dave Jones wrote: > >> BUG: sleeping function called from invalid context at kernel/mutex.c:269 >> in_atomic(): 1, irqs_disabled(): 0, pid: 8558, name: trinity-child2 >> 3 locks on stack by trinity-child2/8558: >> #0: held: (&p->lock){+.+.+.}, instance: ffff88010c9a00b0, at: [] seq_lseek+0x3f/0x120 >> #1: held: (&mm->mmap_sem){++++++}, instance: ffff88013956f7c8, at: [] m_start+0xa7/0x190 >> #2: held: (&(&p->alloc_lock)->rlock){+.+...}, instance: ffff88011fc64f30, at: [] show_numa_map+0x14f/0x610 >> Pid: 8558, comm: trinity-child2 Not tainted 3.7.0-rc1+ #32 >> Call Trace: >> [] __might_sleep+0x14c/0x200 >> [] mutex_lock_nested+0x2e/0x50 >> [] mpol_shared_policy_lookup+0x33/0x90 >> [] shmem_get_policy+0x33/0x40 >> [] get_vma_policy+0x3a/0x90 >> [] show_numa_map+0x163/0x610 >> [] ? pid_maps_open+0x20/0x20 >> [] ? pagemap_hugetlb_range+0xf0/0xf0 >> [] show_pid_numa_map+0x13/0x20 >> [] traverse+0xf2/0x230 >> [] seq_lseek+0xab/0x120 >> [] sys_lseek+0x7b/0xb0 >> [] tracesys+0xe1/0xe6 >> > > Hmm, looks like we need to change the refcount semantics entirely. We'll > need to make get_vma_policy() always take a reference and then drop it > accordingly. This work sif get_vma_policy() can grab a reference while > holding task_lock() for the task policy fallback case. > > Comments on this approach? I think this refcounting is better than using task_lock(). Thanks, -Kame From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1756311Ab2JQItZ (ORCPT ); Wed, 17 Oct 2012 04:49:25 -0400 Received: from mail-ob0-f174.google.com ([209.85.214.174]:60022 "EHLO mail-ob0-f174.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751784Ab2JQItX (ORCPT ); Wed, 17 Oct 2012 04:49:23 -0400 MIME-Version: 1.0 In-Reply-To: <507E4531.1070700@jp.fujitsu.com> References: <20121008150949.GA15130@redhat.com> <20121017040515.GA13505@redhat.com> <507E4531.1070700@jp.fujitsu.com> From: KOSAKI Motohiro Date: Wed, 17 Oct 2012 04:49:02 -0400 Message-ID: Subject: Re: [patch for-3.7] mm, mempolicy: fix printing stack contents in numa_maps To: Kamezawa Hiroyuki Cc: David Rientjes , Dave Jones , Andrew Morton , Linus Torvalds , bhutchings@solarflare.com, Konstantin Khlebnikov , Naoya Horiguchi , Hugh Dickins , linux-kernel@vger.kernel.org, linux-mm@kvack.org Content-Type: text/plain; charset=ISO-8859-1 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, Oct 17, 2012 at 1:42 AM, Kamezawa Hiroyuki wrote: > (2012/10/17 14:24), David Rientjes wrote: >> >> On Wed, 17 Oct 2012, Dave Jones wrote: >> >>> BUG: sleeping function called from invalid context at kernel/mutex.c:269 >>> in_atomic(): 1, irqs_disabled(): 0, pid: 8558, name: trinity-child2 >>> 3 locks on stack by trinity-child2/8558: >>> #0: held: (&p->lock){+.+.+.}, instance: ffff88010c9a00b0, at: >>> [] seq_lseek+0x3f/0x120 >>> #1: held: (&mm->mmap_sem){++++++}, instance: ffff88013956f7c8, at: >>> [] m_start+0xa7/0x190 >>> #2: held: (&(&p->alloc_lock)->rlock){+.+...}, instance: >>> ffff88011fc64f30, at: [] show_numa_map+0x14f/0x610 >>> Pid: 8558, comm: trinity-child2 Not tainted 3.7.0-rc1+ #32 >>> Call Trace: >>> [] __might_sleep+0x14c/0x200 >>> [] mutex_lock_nested+0x2e/0x50 >>> [] mpol_shared_policy_lookup+0x33/0x90 >>> [] shmem_get_policy+0x33/0x40 >>> [] get_vma_policy+0x3a/0x90 >>> [] show_numa_map+0x163/0x610 >>> [] ? pid_maps_open+0x20/0x20 >>> [] ? pagemap_hugetlb_range+0xf0/0xf0 >>> [] show_pid_numa_map+0x13/0x20 >>> [] traverse+0xf2/0x230 >>> [] seq_lseek+0xab/0x120 >>> [] sys_lseek+0x7b/0xb0 >>> [] tracesys+0xe1/0xe6 >>> >> >> Hmm, looks like we need to change the refcount semantics entirely. We'll >> need to make get_vma_policy() always take a reference and then drop it >> accordingly. This work sif get_vma_policy() can grab a reference while >> holding task_lock() for the task policy fallback case. >> >> Comments on this approach? > > > I think this refcounting is better than using task_lock(). I don't think so. get_vma_policy() is used from fast path. In other words, number of atomic ops is sensible for allocation performance. Instead, I'd like to use spinlock for shared mempolicy instead of mutex. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1757598Ab2JQSOn (ORCPT ); Wed, 17 Oct 2012 14:14:43 -0400 Received: from mx1.redhat.com ([209.132.183.28]:2443 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754716Ab2JQSOm (ORCPT ); Wed, 17 Oct 2012 14:14:42 -0400 Date: Wed, 17 Oct 2012 14:14:13 -0400 From: Dave Jones To: David Rientjes Cc: Andrew Morton , Linus Torvalds , KOSAKI Motohiro , bhutchings@solarflare.com, Konstantin Khlebnikov , Naoya Horiguchi , Hugh Dickins , KAMEZAWA Hiroyuki , linux-kernel@vger.kernel.org, linux-mm@kvack.org Subject: Re: [patch for-3.7] mm, mempolicy: fix printing stack contents in numa_maps Message-ID: <20121017181413.GA16805@redhat.com> Mail-Followup-To: Dave Jones , David Rientjes , Andrew Morton , Linus Torvalds , KOSAKI Motohiro , bhutchings@solarflare.com, Konstantin Khlebnikov , Naoya Horiguchi , Hugh Dickins , KAMEZAWA Hiroyuki , linux-kernel@vger.kernel.org, linux-mm@kvack.org References: <20121008150949.GA15130@redhat.com> <20121017040515.GA13505@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, Oct 16, 2012 at 10:24:32PM -0700, David Rientjes wrote: > On Wed, 17 Oct 2012, Dave Jones wrote: > > > BUG: sleeping function called from invalid context at kernel/mutex.c:269 > > Hmm, looks like we need to change the refcount semantics entirely. We'll > need to make get_vma_policy() always take a reference and then drop it > accordingly. This work sif get_vma_policy() can grab a reference while > holding task_lock() for the task policy fallback case. > > Comments on this approach? Seems to be surviving my testing at least.. Dave From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S932537Ab2JQTVY (ORCPT ); Wed, 17 Oct 2012 15:21:24 -0400 Received: from mail-pb0-f46.google.com ([209.85.160.46]:48240 "EHLO mail-pb0-f46.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S932098Ab2JQTVX (ORCPT ); Wed, 17 Oct 2012 15:21:23 -0400 Date: Wed, 17 Oct 2012 12:21:10 -0700 (PDT) From: David Rientjes X-X-Sender: rientjes@chino.kir.corp.google.com To: Dave Jones , Andrew Morton , Linus Torvalds , KOSAKI Motohiro , bhutchings@solarflare.com, Konstantin Khlebnikov , Naoya Horiguchi , Hugh Dickins , KAMEZAWA Hiroyuki , linux-kernel@vger.kernel.org, linux-mm@kvack.org Subject: Re: [patch for-3.7] mm, mempolicy: fix printing stack contents in numa_maps In-Reply-To: <20121017181413.GA16805@redhat.com> Message-ID: References: <20121008150949.GA15130@redhat.com> <20121017040515.GA13505@redhat.com> <20121017181413.GA16805@redhat.com> User-Agent: Alpine 2.00 (DEB 1167 2008-08-23) MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, 17 Oct 2012, Dave Jones wrote: > On Tue, Oct 16, 2012 at 10:24:32PM -0700, David Rientjes wrote: > > On Wed, 17 Oct 2012, Dave Jones wrote: > > > > > BUG: sleeping function called from invalid context at kernel/mutex.c:269 > > > > Hmm, looks like we need to change the refcount semantics entirely. We'll > > need to make get_vma_policy() always take a reference and then drop it > > accordingly. This work sif get_vma_policy() can grab a reference while > > holding task_lock() for the task policy fallback case. > > > > Comments on this approach? > > Seems to be surviving my testing at least.. > Sounds good. Is it possible to verify that policy_cache isn't getting larger than normal in /proc/slabinfo, i.e. when all processes with a task mempolicy or shared vma policy have exited, are there still a significant number of active objects? From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1757698Ab2JQTcp (ORCPT ); Wed, 17 Oct 2012 15:32:45 -0400 Received: from mx1.redhat.com ([209.132.183.28]:53065 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1756564Ab2JQTcn (ORCPT ); Wed, 17 Oct 2012 15:32:43 -0400 Date: Wed, 17 Oct 2012 15:32:29 -0400 From: Dave Jones To: David Rientjes Cc: Andrew Morton , Linus Torvalds , KOSAKI Motohiro , bhutchings@solarflare.com, Konstantin Khlebnikov , Naoya Horiguchi , Hugh Dickins , KAMEZAWA Hiroyuki , linux-kernel@vger.kernel.org, linux-mm@kvack.org Subject: Re: [patch for-3.7] mm, mempolicy: fix printing stack contents in numa_maps Message-ID: <20121017193229.GC16805@redhat.com> Mail-Followup-To: Dave Jones , David Rientjes , Andrew Morton , Linus Torvalds , KOSAKI Motohiro , bhutchings@solarflare.com, Konstantin Khlebnikov , Naoya Horiguchi , Hugh Dickins , KAMEZAWA Hiroyuki , linux-kernel@vger.kernel.org, linux-mm@kvack.org References: <20121017040515.GA13505@redhat.com> <20121017181413.GA16805@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, Oct 17, 2012 at 12:21:10PM -0700, David Rientjes wrote: > On Wed, 17 Oct 2012, Dave Jones wrote: > > > On Tue, Oct 16, 2012 at 10:24:32PM -0700, David Rientjes wrote: > > > On Wed, 17 Oct 2012, Dave Jones wrote: > > > > > > > BUG: sleeping function called from invalid context at kernel/mutex.c:269 > > > > > > Hmm, looks like we need to change the refcount semantics entirely. We'll > > > need to make get_vma_policy() always take a reference and then drop it > > > accordingly. This work sif get_vma_policy() can grab a reference while > > > holding task_lock() for the task policy fallback case. > > > > > > Comments on this approach? > > > > Seems to be surviving my testing at least.. > > > > Sounds good. Is it possible to verify that policy_cache isn't getting > larger than normal in /proc/slabinfo, i.e. when all processes with a > task mempolicy or shared vma policy have exited, are there still a > significant number of active objects? Killing the fuzzer caused it to drop dramatically. Before: (15:29:59:davej@bitcrush:trinity[master])$ sudo cat /proc/slabinfo | grep policy shared_policy_node 2931 2967 376 43 4 : tunables 0 0 0 : slabdata 69 69 0 numa_policy 2971 6545 464 35 4 : tunables 0 0 0 : slabdata 187 187 0 After: (15:30:16:davej@bitcrush:trinity[master])$ sudo cat /proc/slabinfo | grep policy shared_policy_node 0 215 376 43 4 : tunables 0 0 0 : slabdata 5 5 0 numa_policy 15 175 464 35 4 : tunables 0 0 0 : slabdata 5 5 0 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1757781Ab2JQTi7 (ORCPT ); Wed, 17 Oct 2012 15:38:59 -0400 Received: from mail-pa0-f46.google.com ([209.85.220.46]:62782 "EHLO mail-pa0-f46.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1757568Ab2JQTi5 (ORCPT ); Wed, 17 Oct 2012 15:38:57 -0400 Date: Wed, 17 Oct 2012 12:38:55 -0700 (PDT) From: David Rientjes X-X-Sender: rientjes@chino.kir.corp.google.com To: Dave Jones , Andrew Morton , Linus Torvalds , KOSAKI Motohiro , bhutchings@solarflare.com, Konstantin Khlebnikov , Naoya Horiguchi , Hugh Dickins , KAMEZAWA Hiroyuki , linux-kernel@vger.kernel.org, linux-mm@kvack.org Subject: Re: [patch for-3.7] mm, mempolicy: fix printing stack contents in numa_maps In-Reply-To: <20121017193229.GC16805@redhat.com> Message-ID: References: <20121017040515.GA13505@redhat.com> <20121017181413.GA16805@redhat.com> <20121017193229.GC16805@redhat.com> User-Agent: Alpine 2.00 (DEB 1167 2008-08-23) MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, 17 Oct 2012, Dave Jones wrote: > > Sounds good. Is it possible to verify that policy_cache isn't getting > > larger than normal in /proc/slabinfo, i.e. when all processes with a > > task mempolicy or shared vma policy have exited, are there still a > > significant number of active objects? > > Killing the fuzzer caused it to drop dramatically. > > Before: > (15:29:59:davej@bitcrush:trinity[master])$ sudo cat /proc/slabinfo | grep policy > shared_policy_node 2931 2967 376 43 4 : tunables 0 0 0 : slabdata 69 69 0 > numa_policy 2971 6545 464 35 4 : tunables 0 0 0 : slabdata 187 187 0 > > After: > (15:30:16:davej@bitcrush:trinity[master])$ sudo cat /proc/slabinfo | grep policy > shared_policy_node 0 215 376 43 4 : tunables 0 0 0 : slabdata 5 5 0 > numa_policy 15 175 464 35 4 : tunables 0 0 0 : slabdata 5 5 0 > Excellent, thanks. This shows that the refcounting is working properly and we're not leaking any references as a result of this change causing the mempolicies to never be freed. ("numa_policy" turns out to be policy_cache in the code, so thanks for checking both of them.) Could I add your tested-by? From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S932357Ab2JQTpU (ORCPT ); Wed, 17 Oct 2012 15:45:20 -0400 Received: from mx1.redhat.com ([209.132.183.28]:4505 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752794Ab2JQTpT (ORCPT ); Wed, 17 Oct 2012 15:45:19 -0400 Date: Wed, 17 Oct 2012 15:45:01 -0400 From: Dave Jones To: David Rientjes Cc: Andrew Morton , Linus Torvalds , KOSAKI Motohiro , bhutchings@solarflare.com, Konstantin Khlebnikov , Naoya Horiguchi , Hugh Dickins , KAMEZAWA Hiroyuki , linux-kernel@vger.kernel.org, linux-mm@kvack.org Subject: Re: [patch for-3.7] mm, mempolicy: fix printing stack contents in numa_maps Message-ID: <20121017194501.GA24400@redhat.com> Mail-Followup-To: Dave Jones , David Rientjes , Andrew Morton , Linus Torvalds , KOSAKI Motohiro , bhutchings@solarflare.com, Konstantin Khlebnikov , Naoya Horiguchi , Hugh Dickins , KAMEZAWA Hiroyuki , linux-kernel@vger.kernel.org, linux-mm@kvack.org References: <20121017040515.GA13505@redhat.com> <20121017181413.GA16805@redhat.com> <20121017193229.GC16805@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, Oct 17, 2012 at 12:38:55PM -0700, David Rientjes wrote: > > > Sounds good. Is it possible to verify that policy_cache isn't getting > > > larger than normal in /proc/slabinfo, i.e. when all processes with a > > > task mempolicy or shared vma policy have exited, are there still a > > > significant number of active objects? > > > > Killing the fuzzer caused it to drop dramatically. > > > Excellent, thanks. This shows that the refcounting is working properly > and we're not leaking any references as a result of this change causing > the mempolicies to never be freed. ("numa_policy" turns out to be > policy_cache in the code, so thanks for checking both of them.) > > Could I add your tested-by? Sure. Here's a fresh one I just baked. Tested-by: Dave Jones Dave From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1757251Ab2JQTuY (ORCPT ); Wed, 17 Oct 2012 15:50:24 -0400 Received: from mail-pb0-f46.google.com ([209.85.160.46]:57418 "EHLO mail-pb0-f46.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1756140Ab2JQTuX (ORCPT ); Wed, 17 Oct 2012 15:50:23 -0400 Date: Wed, 17 Oct 2012 12:50:21 -0700 (PDT) From: David Rientjes X-X-Sender: rientjes@chino.kir.corp.google.com To: KOSAKI Motohiro cc: Kamezawa Hiroyuki , Dave Jones , Andrew Morton , Linus Torvalds , bhutchings@solarflare.com, Konstantin Khlebnikov , Naoya Horiguchi , Hugh Dickins , linux-kernel@vger.kernel.org, linux-mm@kvack.org Subject: Re: [patch for-3.7] mm, mempolicy: fix printing stack contents in numa_maps In-Reply-To: Message-ID: References: <20121008150949.GA15130@redhat.com> <20121017040515.GA13505@redhat.com> <507E4531.1070700@jp.fujitsu.com> User-Agent: Alpine 2.00 (DEB 1167 2008-08-23) MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, 17 Oct 2012, KOSAKI Motohiro wrote: > > I think this refcounting is better than using task_lock(). > > I don't think so. get_vma_policy() is used from fast path. In other > words, number of > atomic ops is sensible for allocation performance. There are enhancements that we can make with refcounting: for instance, we may want to avoid doing it in the super-fast path when the policy is default_policy and then just do if (mpol != &default_policy) mpol_put(mpol); > Instead, I'd like > to use spinlock > for shared mempolicy instead of mutex. > Um, this was just changed to a mutex last week in commit b22d127a39dd ("mempolicy: fix a race in shared_policy_replace()") so that sp_alloc() can be done with GFP_KERNEL, so I didn't consider reverting that behavior. Are you nacking that patch, which you acked, now? From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1757110Ab2JQU2x (ORCPT ); Wed, 17 Oct 2012 16:28:53 -0400 Received: from mail-pb0-f46.google.com ([209.85.160.46]:55656 "EHLO mail-pb0-f46.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753275Ab2JQU2v (ORCPT ); Wed, 17 Oct 2012 16:28:51 -0400 Date: Wed, 17 Oct 2012 13:28:47 -0700 (PDT) From: David Rientjes X-X-Sender: rientjes@chino.kir.corp.google.com To: Linus Torvalds , Andrew Morton cc: Dave Jones , KOSAKI Motohiro , bhutchings@solarflare.com, Konstantin Khlebnikov , Naoya Horiguchi , Hugh Dickins , KAMEZAWA Hiroyuki , linux-kernel@vger.kernel.org, linux-mm@kvack.org Subject: [patch for-3.7] mm, mempolicy: avoid taking mutex inside spinlock when reading numa_maps In-Reply-To: <20121017194501.GA24400@redhat.com> Message-ID: References: <20121017040515.GA13505@redhat.com> <20121017181413.GA16805@redhat.com> <20121017193229.GC16805@redhat.com> <20121017194501.GA24400@redhat.com> User-Agent: Alpine 2.00 (DEB 1167 2008-08-23) MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org As a result of commit 32f8516a8c73 ("mm, mempolicy: fix printing stack contents in numa_maps"), the mutex protecting a shared policy can be inadvertently taken while holding task_lock(task). Recently, commit b22d127a39dd ("mempolicy: fix a race in shared_policy_replace()") switched the spinlock within a shared policy to a mutex so sp_alloc() could block. Thus, a refcount must be grabbed on all mempolicies returned by get_vma_policy() so it isn't freed while being passed to mpol_to_str() when reading /proc/pid/numa_maps. This patch only takes task_lock() while dereferencing task->mempolicy in get_vma_policy() to increment its refcount. This ensures it will remain in memory until dropped by __mpol_put() after mpol_to_str() is called. Refcounts of shared policies are grabbed by the ->get_policy() function of the vma, all others will be grabbed directly in get_vma_policy(). Now that this is done, all callers now unconditionally drop the refcount. Tested-by: Dave Jones Signed-off-by: David Rientjes --- fs/proc/task_mmu.c | 4 +-- include/linux/mempolicy.h | 12 +------ mm/hugetlb.c | 4 +-- mm/mempolicy.c | 79 +++++++++++++++++++-------------------------- 4 files changed, 38 insertions(+), 61 deletions(-) diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c --- a/fs/proc/task_mmu.c +++ b/fs/proc/task_mmu.c @@ -1178,11 +1178,9 @@ static int show_numa_map(struct seq_file *m, void *v, int is_pid) walk.private = md; walk.mm = mm; - task_lock(task); pol = get_vma_policy(task, vma, vma->vm_start); mpol_to_str(buffer, sizeof(buffer), pol, 0); - mpol_cond_put(pol); - task_unlock(task); + __mpol_put(pol); seq_printf(m, "%08lx %s", vma->vm_start, buffer); diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h --- a/include/linux/mempolicy.h +++ b/include/linux/mempolicy.h @@ -73,13 +73,7 @@ static inline void mpol_put(struct mempolicy *pol) */ static inline int mpol_needs_cond_ref(struct mempolicy *pol) { - return (pol && (pol->flags & MPOL_F_SHARED)); -} - -static inline void mpol_cond_put(struct mempolicy *pol) -{ - if (mpol_needs_cond_ref(pol)) - __mpol_put(pol); + return pol->flags & MPOL_F_SHARED; } extern struct mempolicy *__mpol_cond_copy(struct mempolicy *tompol, @@ -211,10 +205,6 @@ static inline void mpol_put(struct mempolicy *p) { } -static inline void mpol_cond_put(struct mempolicy *pol) -{ -} - static inline struct mempolicy *mpol_cond_copy(struct mempolicy *to, struct mempolicy *from) { diff --git a/mm/hugetlb.c b/mm/hugetlb.c --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -568,13 +568,13 @@ retry_cpuset: } } - mpol_cond_put(mpol); + __mpol_put(mpol); if (unlikely(!put_mems_allowed(cpuset_mems_cookie) && !page)) goto retry_cpuset; return page; err: - mpol_cond_put(mpol); + __mpol_put(mpol); return NULL; } diff --git a/mm/mempolicy.c b/mm/mempolicy.c --- a/mm/mempolicy.c +++ b/mm/mempolicy.c @@ -906,7 +906,8 @@ static long do_get_mempolicy(int *policy, nodemask_t *nmask, } out: - mpol_cond_put(pol); + if (mpol_needs_cond_ref(pol)) + __mpol_put(pol); if (vma) up_read(¤t->mm->mmap_sem); return err; @@ -1527,48 +1528,52 @@ asmlinkage long compat_sys_mbind(compat_ulong_t start, compat_ulong_t len, } #endif - -/* - * get_vma_policy(@task, @vma, @addr) - * @task - task for fallback if vma policy == default - * @vma - virtual memory area whose policy is sought - * @addr - address in @vma for shared policy lookup +/** + * get_vma_policy() - return effective policy for a vma at specified address + * @task: task for fallback if vma policy == default_policy + * @vma: virtual memory area whose policy is sought + * @addr: address in @vma for shared policy lookup * - * Returns effective policy for a VMA at specified address. * Falls back to @task or system default policy, as necessary. - * Current or other task's task mempolicy and non-shared vma policies must be - * protected by task_lock(task) by the caller. - * Shared policies [those marked as MPOL_F_SHARED] require an extra reference - * count--added by the get_policy() vm_op, as appropriate--to protect against - * freeing by another task. It is the caller's responsibility to free the - * extra reference for shared policies. + * Increments the reference count of the returned mempolicy, it is the caller's + * responsibility to decrement with __mpol_put(). + * Requires vma->vm_mm->mmap_sem to be held for vma policies and takes + * task_lock(task) for task policy fallback. */ struct mempolicy *get_vma_policy(struct task_struct *task, struct vm_area_struct *vma, unsigned long addr) { - struct mempolicy *pol = task->mempolicy; + struct mempolicy *pol; + + /* + * Grab a reference before task has the potential to exit and free its + * mempolicy. + */ + task_lock(task); + pol = task->mempolicy; + mpol_get(pol); + task_unlock(task); if (vma) { if (vma->vm_ops && vma->vm_ops->get_policy) { struct mempolicy *vpol = vma->vm_ops->get_policy(vma, addr); - if (vpol) + if (vpol) { + mpol_put(pol); pol = vpol; + if (!mpol_needs_cond_ref(pol)) + mpol_get(pol); + } } else if (vma->vm_policy) { + mpol_put(pol); pol = vma->vm_policy; - - /* - * shmem_alloc_page() passes MPOL_F_SHARED policy with - * a pseudo vma whose vma->vm_ops=NULL. Take a reference - * count on these policies which will be dropped by - * mpol_cond_put() later - */ - if (mpol_needs_cond_ref(pol)) - mpol_get(pol); + mpol_get(pol); } } - if (!pol) + if (!pol) { pol = &default_policy; + mpol_get(pol); + } return pol; } @@ -1919,30 +1924,14 @@ retry_cpuset: unsigned nid; nid = interleave_nid(pol, vma, addr, PAGE_SHIFT + order); - mpol_cond_put(pol); page = alloc_page_interleave(gfp, order, nid); - if (unlikely(!put_mems_allowed(cpuset_mems_cookie) && !page)) - goto retry_cpuset; - - return page; + goto out; } zl = policy_zonelist(gfp, pol, node); - if (unlikely(mpol_needs_cond_ref(pol))) { - /* - * slow path: ref counted shared policy - */ - struct page *page = __alloc_pages_nodemask(gfp, order, - zl, policy_nodemask(gfp, pol)); - __mpol_put(pol); - if (unlikely(!put_mems_allowed(cpuset_mems_cookie) && !page)) - goto retry_cpuset; - return page; - } - /* - * fast path: default or task policy - */ page = __alloc_pages_nodemask(gfp, order, zl, policy_nodemask(gfp, pol)); +out: + __mpol_put(pol); if (unlikely(!put_mems_allowed(cpuset_mems_cookie) && !page)) goto retry_cpuset; return page; From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S932539Ab2JQVF7 (ORCPT ); Wed, 17 Oct 2012 17:05:59 -0400 Received: from mail-ob0-f174.google.com ([209.85.214.174]:42401 "EHLO mail-ob0-f174.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S932117Ab2JQVF5 (ORCPT ); Wed, 17 Oct 2012 17:05:57 -0400 MIME-Version: 1.0 In-Reply-To: References: <20121008150949.GA15130@redhat.com> <20121017040515.GA13505@redhat.com> <507E4531.1070700@jp.fujitsu.com> From: KOSAKI Motohiro Date: Wed, 17 Oct 2012 17:05:37 -0400 Message-ID: Subject: Re: [patch for-3.7] mm, mempolicy: fix printing stack contents in numa_maps To: David Rientjes Cc: Kamezawa Hiroyuki , Dave Jones , Andrew Morton , Linus Torvalds , bhutchings@solarflare.com, Konstantin Khlebnikov , Naoya Horiguchi , Hugh Dickins , linux-kernel@vger.kernel.org, linux-mm@kvack.org Content-Type: text/plain; charset=ISO-8859-1 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, Oct 17, 2012 at 3:50 PM, David Rientjes wrote: > On Wed, 17 Oct 2012, KOSAKI Motohiro wrote: > >> > I think this refcounting is better than using task_lock(). >> >> I don't think so. get_vma_policy() is used from fast path. In other >> words, number of >> atomic ops is sensible for allocation performance. > > There are enhancements that we can make with refcounting: for instance, we > may want to avoid doing it in the super-fast path when the policy is > default_policy and then just do > > if (mpol != &default_policy) > mpol_put(mpol); > >> Instead, I'd like >> to use spinlock >> for shared mempolicy instead of mutex. >> > > Um, this was just changed to a mutex last week in commit b22d127a39dd > ("mempolicy: fix a race in shared_policy_replace()") so that sp_alloc() > can be done with GFP_KERNEL, so I didn't consider reverting that behavior. > Are you nacking that patch, which you acked, now? Yes, sadly. /proc usage is a corner case issue. It's not worth to strike main path. see commit 52cd3b0740 and around patches. That explain why we avoided your approach. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751260Ab2JQV1j (ORCPT ); Wed, 17 Oct 2012 17:27:39 -0400 Received: from mail-pb0-f46.google.com ([209.85.160.46]:60545 "EHLO mail-pb0-f46.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750744Ab2JQV1i (ORCPT ); Wed, 17 Oct 2012 17:27:38 -0400 Date: Wed, 17 Oct 2012 14:27:35 -0700 (PDT) From: David Rientjes X-X-Sender: rientjes@chino.kir.corp.google.com To: KOSAKI Motohiro cc: Kamezawa Hiroyuki , Dave Jones , Andrew Morton , Linus Torvalds , bhutchings@solarflare.com, Konstantin Khlebnikov , Naoya Horiguchi , Hugh Dickins , linux-kernel@vger.kernel.org, linux-mm@kvack.org Subject: Re: [patch for-3.7] mm, mempolicy: fix printing stack contents in numa_maps In-Reply-To: Message-ID: References: <20121008150949.GA15130@redhat.com> <20121017040515.GA13505@redhat.com> <507E4531.1070700@jp.fujitsu.com> User-Agent: Alpine 2.00 (DEB 1167 2008-08-23) MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, 17 Oct 2012, KOSAKI Motohiro wrote: > > Um, this was just changed to a mutex last week in commit b22d127a39dd > > ("mempolicy: fix a race in shared_policy_replace()") so that sp_alloc() > > can be done with GFP_KERNEL, so I didn't consider reverting that behavior. > > Are you nacking that patch, which you acked, now? > > Yes, sadly. /proc usage is a corner case issue. It's not worth to > strike main path. It also simplifies the fastpath since we can now unconditionally drop the reference. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751447Ab2JQVbN (ORCPT ); Wed, 17 Oct 2012 17:31:13 -0400 Received: from mail-pb0-f46.google.com ([209.85.160.46]:54320 "EHLO mail-pb0-f46.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751045Ab2JQVbL (ORCPT ); Wed, 17 Oct 2012 17:31:11 -0400 Date: Wed, 17 Oct 2012 14:31:09 -0700 (PDT) From: David Rientjes X-X-Sender: rientjes@chino.kir.corp.google.com To: Linus Torvalds , Andrew Morton cc: Dave Jones , KOSAKI Motohiro , bhutchings@solarflare.com, Konstantin Khlebnikov , Naoya Horiguchi , Hugh Dickins , KAMEZAWA Hiroyuki , linux-kernel@vger.kernel.org, linux-mm@kvack.org Subject: [patch for-3.7 v2] mm, mempolicy: avoid taking mutex inside spinlock when reading numa_maps In-Reply-To: Message-ID: References: <20121017040515.GA13505@redhat.com> <20121017181413.GA16805@redhat.com> <20121017193229.GC16805@redhat.com> <20121017194501.GA24400@redhat.com> User-Agent: Alpine 2.00 (DEB 1167 2008-08-23) MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org As a result of commit 32f8516a8c73 ("mm, mempolicy: fix printing stack contents in numa_maps"), the mutex protecting a shared policy can be inadvertently taken while holding task_lock(task). Recently, commit b22d127a39dd ("mempolicy: fix a race in shared_policy_replace()") switched the spinlock within a shared policy to a mutex so sp_alloc() could block. Thus, a refcount must be grabbed on all mempolicies returned by get_vma_policy() so it isn't freed while being passed to mpol_to_str() when reading /proc/pid/numa_maps. This patch only takes task_lock() while dereferencing task->mempolicy in get_vma_policy() if it's non-NULL in the lockess check to increment its refcount. This ensures it will remain in memory until dropped by __mpol_put() after mpol_to_str() is called. Refcounts of shared policies are grabbed by the ->get_policy() function of the vma, all others will be grabbed directly in get_vma_policy(). Now that this is done, all callers now unconditionally drop the refcount. Tested-by: Dave Jones Signed-off-by: David Rientjes --- v2: optimized task_lock() in get_vma_policy(): test for a non-NULL task->mempolicy before taking task_lock() and grabbing the reference so we don't take the lock unnecessarily. fs/proc/task_mmu.c | 4 +-- include/linux/mempolicy.h | 12 +------ mm/hugetlb.c | 4 +-- mm/mempolicy.c | 79 ++++++++++++++++++++------------------------- 4 files changed, 39 insertions(+), 60 deletions(-) diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c index 14df880..5709e70 100644 --- a/fs/proc/task_mmu.c +++ b/fs/proc/task_mmu.c @@ -1178,11 +1178,9 @@ static int show_numa_map(struct seq_file *m, void *v, int is_pid) walk.private = md; walk.mm = mm; - task_lock(task); pol = get_vma_policy(task, vma, vma->vm_start); mpol_to_str(buffer, sizeof(buffer), pol, 0); - mpol_cond_put(pol); - task_unlock(task); + __mpol_put(pol); seq_printf(m, "%08lx %s", vma->vm_start, buffer); diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h index e5ccb9d..f76f7e0 100644 --- a/include/linux/mempolicy.h +++ b/include/linux/mempolicy.h @@ -73,13 +73,7 @@ static inline void mpol_put(struct mempolicy *pol) */ static inline int mpol_needs_cond_ref(struct mempolicy *pol) { - return (pol && (pol->flags & MPOL_F_SHARED)); -} - -static inline void mpol_cond_put(struct mempolicy *pol) -{ - if (mpol_needs_cond_ref(pol)) - __mpol_put(pol); + return pol->flags & MPOL_F_SHARED; } extern struct mempolicy *__mpol_cond_copy(struct mempolicy *tompol, @@ -211,10 +205,6 @@ static inline void mpol_put(struct mempolicy *p) { } -static inline void mpol_cond_put(struct mempolicy *pol) -{ -} - static inline struct mempolicy *mpol_cond_copy(struct mempolicy *to, struct mempolicy *from) { diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 59a0059..5080808 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -568,13 +568,13 @@ retry_cpuset: } } - mpol_cond_put(mpol); + __mpol_put(mpol); if (unlikely(!put_mems_allowed(cpuset_mems_cookie) && !page)) goto retry_cpuset; return page; err: - mpol_cond_put(mpol); + __mpol_put(mpol); return NULL; } diff --git a/mm/mempolicy.c b/mm/mempolicy.c index d04a8a5..a0bb463 100644 --- a/mm/mempolicy.c +++ b/mm/mempolicy.c @@ -906,7 +906,8 @@ static long do_get_mempolicy(int *policy, nodemask_t *nmask, } out: - mpol_cond_put(pol); + if (mpol_needs_cond_ref(pol)) + __mpol_put(pol); if (vma) up_read(¤t->mm->mmap_sem); return err; @@ -1527,48 +1528,54 @@ asmlinkage long compat_sys_mbind(compat_ulong_t start, compat_ulong_t len, } #endif - -/* - * get_vma_policy(@task, @vma, @addr) - * @task - task for fallback if vma policy == default - * @vma - virtual memory area whose policy is sought - * @addr - address in @vma for shared policy lookup +/** + * get_vma_policy() - return effective policy for a vma at specified address + * @task: task for fallback if vma policy == default_policy + * @vma: virtual memory area whose policy is sought + * @addr: address in @vma for shared policy lookup * - * Returns effective policy for a VMA at specified address. * Falls back to @task or system default policy, as necessary. - * Current or other task's task mempolicy and non-shared vma policies must be - * protected by task_lock(task) by the caller. - * Shared policies [those marked as MPOL_F_SHARED] require an extra reference - * count--added by the get_policy() vm_op, as appropriate--to protect against - * freeing by another task. It is the caller's responsibility to free the - * extra reference for shared policies. + * Increments the reference count of the returned mempolicy, it is the caller's + * responsibility to decrement with __mpol_put(). + * Requires vma->vm_mm->mmap_sem to be held for vma policies and takes + * task_lock(task) for task policy fallback. */ struct mempolicy *get_vma_policy(struct task_struct *task, struct vm_area_struct *vma, unsigned long addr) { struct mempolicy *pol = task->mempolicy; + /* + * Grab a reference before task has the potential to exit and free its + * mempolicy. + */ + if (pol) { + task_lock(task); + pol = task->mempolicy; + mpol_get(pol); + task_unlock(task); + } + if (vma) { if (vma->vm_ops && vma->vm_ops->get_policy) { struct mempolicy *vpol = vma->vm_ops->get_policy(vma, addr); - if (vpol) + if (vpol) { + mpol_put(pol); pol = vpol; + if (!mpol_needs_cond_ref(pol)) + mpol_get(pol); + } } else if (vma->vm_policy) { + mpol_put(pol); pol = vma->vm_policy; - - /* - * shmem_alloc_page() passes MPOL_F_SHARED policy with - * a pseudo vma whose vma->vm_ops=NULL. Take a reference - * count on these policies which will be dropped by - * mpol_cond_put() later - */ - if (mpol_needs_cond_ref(pol)) - mpol_get(pol); + mpol_get(pol); } } - if (!pol) + if (!pol) { pol = &default_policy; + mpol_get(pol); + } return pol; } @@ -1919,30 +1926,14 @@ retry_cpuset: unsigned nid; nid = interleave_nid(pol, vma, addr, PAGE_SHIFT + order); - mpol_cond_put(pol); page = alloc_page_interleave(gfp, order, nid); - if (unlikely(!put_mems_allowed(cpuset_mems_cookie) && !page)) - goto retry_cpuset; - - return page; + goto out; } zl = policy_zonelist(gfp, pol, node); - if (unlikely(mpol_needs_cond_ref(pol))) { - /* - * slow path: ref counted shared policy - */ - struct page *page = __alloc_pages_nodemask(gfp, order, - zl, policy_nodemask(gfp, pol)); - __mpol_put(pol); - if (unlikely(!put_mems_allowed(cpuset_mems_cookie) && !page)) - goto retry_cpuset; - return page; - } - /* - * fast path: default or task policy - */ page = __alloc_pages_nodemask(gfp, order, zl, policy_nodemask(gfp, pol)); +out: + __mpol_put(pol); if (unlikely(!put_mems_allowed(cpuset_mems_cookie) && !page)) goto retry_cpuset; return page; From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1750916Ab2JREHO (ORCPT ); Thu, 18 Oct 2012 00:07:14 -0400 Received: from fgwmail5.fujitsu.co.jp ([192.51.44.35]:41061 "EHLO fgwmail5.fujitsu.co.jp" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750703Ab2JREHM (ORCPT ); Thu, 18 Oct 2012 00:07:12 -0400 X-SecurityPolicyCheck: OK by SHieldMailChecker v1.8.4 Message-ID: <507F803A.8000900@jp.fujitsu.com> Date: Thu, 18 Oct 2012 13:06:18 +0900 From: Kamezawa Hiroyuki User-Agent: Mozilla/5.0 (Windows NT 6.1; rv:16.0) Gecko/20121010 Thunderbird/16.0.1 MIME-Version: 1.0 To: David Rientjes CC: Linus Torvalds , Andrew Morton , Dave Jones , KOSAKI Motohiro , bhutchings@solarflare.com, Konstantin Khlebnikov , Naoya Horiguchi , Hugh Dickins , linux-kernel@vger.kernel.org, linux-mm@kvack.org Subject: Re: [patch for-3.7 v2] mm, mempolicy: avoid taking mutex inside spinlock when reading numa_maps References: <20121017040515.GA13505@redhat.com> <20121017181413.GA16805@redhat.com> <20121017193229.GC16805@redhat.com> <20121017194501.GA24400@redhat.com> In-Reply-To: Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org (2012/10/18 6:31), David Rientjes wrote: > As a result of commit 32f8516a8c73 ("mm, mempolicy: fix printing stack > contents in numa_maps"), the mutex protecting a shared policy can be > inadvertently taken while holding task_lock(task). > > Recently, commit b22d127a39dd ("mempolicy: fix a race in > shared_policy_replace()") switched the spinlock within a shared policy to > a mutex so sp_alloc() could block. Thus, a refcount must be grabbed on > all mempolicies returned by get_vma_policy() so it isn't freed while being > passed to mpol_to_str() when reading /proc/pid/numa_maps. > > This patch only takes task_lock() while dereferencing task->mempolicy in > get_vma_policy() if it's non-NULL in the lockess check to increment its > refcount. This ensures it will remain in memory until dropped by > __mpol_put() after mpol_to_str() is called. > > Refcounts of shared policies are grabbed by the ->get_policy() function of > the vma, all others will be grabbed directly in get_vma_policy(). Now > that this is done, all callers now unconditionally drop the refcount. > please add original problem description.... from your 1st patch. > When reading /proc/pid/numa_maps, it's possible to return the contents of > the stack where the mempolicy string should be printed if the policy gets > freed from beneath us. > > This happens because mpol_to_str() may return an error the > stack-allocated buffer is then printed without ever being stored. ..... Hmm, I've read the whole thread again...and, I'm sorry if I misunderstand something. I think Kosaki mentioned the commit 52cd3b0740. It avoids refcounting in get_vma_policy() because it's called every time alloc_pages_vma() is called, at every page fault. So, it seems he doesn't agree this fix because of performance concern on big NUMA, Can't we have another way to fix ? like this ? too ugly ? Again, I'm sorry if I misunderstand the points. == From bfe7e2ab1c1375b134ec12efce6517149318f75d Mon Sep 17 00:00:00 2001 From: KAMEZAWA Hiroyuki Date: Thu, 18 Oct 2012 13:17:25 +0900 Subject: [PATCH] hold task->mempolicy while numa_maps scans. /proc//numa_maps scans vma and show mempolicy under mmap_sem. It sometimes accesses task->mempolicy which can be freed without mmap_sem and numa_maps can show some garbage while scanning. This patch tries to take reference count of task->mempolicy at reading numa_maps before calling get_vma_policy(). By this, task->mempolicy will not be freed until numa_maps reaches its end. Signed-off-by: KAMEZAWA Hiroyuki --- fs/proc/task_mmu.c | 20 ++++++++++++++++++++ 1 file changed, 20 insertions(+) diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c index 14df880..d92e868 100644 --- a/fs/proc/task_mmu.c +++ b/fs/proc/task_mmu.c @@ -94,6 +94,11 @@ static void vma_stop(struct proc_maps_private *priv, struct vm_area_struct *vma) { if (vma && vma != priv->tail_vma) { struct mm_struct *mm = vma->vm_mm; +#ifdef CONFIG_NUMA + task_lock(priv->task); + __mpol_put(priv->task->mempolicy); + task_unlock(priv->task); +#endif up_read(&mm->mmap_sem); mmput(mm); } @@ -130,6 +135,16 @@ static void *m_start(struct seq_file *m, loff_t *pos) return mm; down_read(&mm->mmap_sem); + /* + * task->mempolicy can be freed even if mmap_sem is down (see kernel/exit.c) + * We grab refcount for stable access. + * repleacement of task->mmpolicy is guarded by mmap_sem. + */ +#ifdef CONFIG_NUMA + task_lock(priv->task); + mpol_get(priv->task->mempolicy); + task_unlock(priv->task); +#endif tail_vma = get_gate_vma(priv->task->mm); priv->tail_vma = tail_vma; @@ -161,6 +176,11 @@ out: /* End of vmas has been reached */ m->version = (tail_vma != NULL)? 0: -1UL; +#ifdef CONFIG_NUMA + task_lock(priv->task); + __mpol_put(priv->task->mempolicy); + task_unlock(priv->task); +#endif up_read(&mm->mmap_sem); mmput(mm); return tail_vma; -- 1.7.10.2 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751222Ab2JREPB (ORCPT ); Thu, 18 Oct 2012 00:15:01 -0400 Received: from mail-we0-f174.google.com ([74.125.82.174]:45574 "EHLO mail-we0-f174.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750790Ab2JREPA (ORCPT ); Thu, 18 Oct 2012 00:15:00 -0400 MIME-Version: 1.0 In-Reply-To: <507F803A.8000900@jp.fujitsu.com> References: <20121017040515.GA13505@redhat.com> <20121017181413.GA16805@redhat.com> <20121017193229.GC16805@redhat.com> <20121017194501.GA24400@redhat.com> <507F803A.8000900@jp.fujitsu.com> From: Linus Torvalds Date: Wed, 17 Oct 2012 21:14:38 -0700 X-Google-Sender-Auth: XRoJry7vVSV5A3QVZbUwy5BerkM Message-ID: Subject: Re: [patch for-3.7 v2] mm, mempolicy: avoid taking mutex inside spinlock when reading numa_maps To: Kamezawa Hiroyuki Cc: David Rientjes , Andrew Morton , Dave Jones , KOSAKI Motohiro , bhutchings@solarflare.com, Konstantin Khlebnikov , Naoya Horiguchi , Hugh Dickins , linux-kernel@vger.kernel.org, linux-mm@kvack.org Content-Type: text/plain; charset=ISO-8859-1 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, Oct 17, 2012 at 9:06 PM, Kamezawa Hiroyuki wrote: > if (vma && vma != priv->tail_vma) { > struct mm_struct *mm = vma->vm_mm; > +#ifdef CONFIG_NUMA > + task_lock(priv->task); > + __mpol_put(priv->task->mempolicy); > + task_unlock(priv->task); > +#endif > up_read(&mm->mmap_sem); > mmput(mm); Please don't put #ifdef's inside code. It makes things really ugly and hard to read. And that is *especially* true in this case, since there's a pattern to all these things: > +#ifdef CONFIG_NUMA > + task_lock(priv->task); > + mpol_get(priv->task->mempolicy); > + task_unlock(priv->task); > +#endif > +#ifdef CONFIG_NUMA > + task_lock(priv->task); > + __mpol_put(priv->task->mempolicy); > + task_unlock(priv->task); > +#endif it really sounds like what you want to do is to just abstract a "numa_policy_get/put(priv)" operation. So you could make it be something like #ifdef CONFIG_NUMA static inline numa_policy_get(struct proc_maps_private *priv) { task_lock(priv->task); mpol_get(priv->task->mempolicy); task_unlock(priv->task); } .. same for the "put" function .. #else #define numa_policy_get(priv) do { } while (0) #define numa_policy_put(priv) do { } while (0) #endif and then you wouldn't have to have the #ifdef's in the middle of code, and I think it will be more readable in general. Sure, it is going to be a few more actual lines of patch, but there's no duplicated code sequence, and the added lines are just the syntax that makes it look better. Linus From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751811Ab2JREel (ORCPT ); Thu, 18 Oct 2012 00:34:41 -0400 Received: from fgwmail5.fujitsu.co.jp ([192.51.44.35]:52669 "EHLO fgwmail5.fujitsu.co.jp" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751298Ab2JREek (ORCPT ); Thu, 18 Oct 2012 00:34:40 -0400 X-SecurityPolicyCheck: OK by SHieldMailChecker v1.8.4 Message-ID: <507F86BD.7070201@jp.fujitsu.com> Date: Thu, 18 Oct 2012 13:34:05 +0900 From: Kamezawa Hiroyuki User-Agent: Mozilla/5.0 (Windows NT 6.1; rv:16.0) Gecko/20121010 Thunderbird/16.0.1 MIME-Version: 1.0 To: David Rientjes CC: Linus Torvalds , Andrew Morton , Dave Jones , KOSAKI Motohiro , bhutchings@solarflare.com, Konstantin Khlebnikov , Naoya Horiguchi , Hugh Dickins , linux-kernel@vger.kernel.org, linux-mm@kvack.org Subject: Re: [patch for-3.7 v2] mm, mempolicy: avoid taking mutex inside spinlock when reading numa_maps References: <20121017040515.GA13505@redhat.com> <20121017181413.GA16805@redhat.com> <20121017193229.GC16805@redhat.com> <20121017194501.GA24400@redhat.com> <507F803A.8000900@jp.fujitsu.com> In-Reply-To: <507F803A.8000900@jp.fujitsu.com> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org (2012/10/18 13:06), Kamezawa Hiroyuki wrote: > (2012/10/18 6:31), David Rientjes wrote: >> As a result of commit 32f8516a8c73 ("mm, mempolicy: fix printing stack >> contents in numa_maps"), the mutex protecting a shared policy can be >> inadvertently taken while holding task_lock(task). >> >> Recently, commit b22d127a39dd ("mempolicy: fix a race in >> shared_policy_replace()") switched the spinlock within a shared policy to >> a mutex so sp_alloc() could block. Thus, a refcount must be grabbed on >> all mempolicies returned by get_vma_policy() so it isn't freed while being >> passed to mpol_to_str() when reading /proc/pid/numa_maps. >> >> This patch only takes task_lock() while dereferencing task->mempolicy in >> get_vma_policy() if it's non-NULL in the lockess check to increment its >> refcount. This ensures it will remain in memory until dropped by >> __mpol_put() after mpol_to_str() is called. >> >> Refcounts of shared policies are grabbed by the ->get_policy() function of >> the vma, all others will be grabbed directly in get_vma_policy(). Now >> that this is done, all callers now unconditionally drop the refcount. >> > > please add original problem description.... > > from your 1st patch. >> When reading /proc/pid/numa_maps, it's possible to return the contents of >> the stack where the mempolicy string should be printed if the policy gets >> freed from beneath us. >> >> This happens because mpol_to_str() may return an error the >> stack-allocated buffer is then printed without ever being stored. > ..... > > Hmm, I've read the whole thread again...and, I'm sorry if I misunderstand something. > > I think Kosaki mentioned the commit 52cd3b0740. It avoids refcounting in get_vma_policy() > because it's called every time alloc_pages_vma() is called, at every page fault. > So, it seems he doesn't agree this fix because of performance concern on big NUMA, > > > Can't we have another way to fix ? like this ? too ugly ? > Again, I'm sorry if I misunderstand the points. > Sorry this patch itself may be buggy. please don't test.. I missed that kernel/exit.c sets task->mempolicy to be NULL. fixed one here. -- From 5581c71e68a7f50e52fd67cca00148911023f9f5 Mon Sep 17 00:00:00 2001 From: KAMEZAWA Hiroyuki Date: Thu, 18 Oct 2012 13:50:29 +0900 Subject: [PATCH] hold task->mempolicy while numa_maps scans. /proc//numa_maps scans vma and show mempolicy under mmap_sem. It sometimes accesses task->mempolicy which can be freed without mmap_sem and numa_maps can show some garbage while scanning. This patch tries to take reference count of task->mempolicy at reading numa_maps before calling get_vma_policy(). By this, task->mempolicy will not be freed until numa_maps reaches its end. Signed-off-by: KAMEZAWA Hiroyuki V1->V2 - access task->mempolicy only once and remember it. Becase kernel/exit.c can overwrite it. Signed-off-by: KAMEZAWA Hiroyuki --- fs/proc/internal.h | 4 ++++ fs/proc/task_mmu.c | 33 ++++++++++++++++++++++++++++++++- 2 files changed, 36 insertions(+), 1 deletion(-) diff --git a/fs/proc/internal.h b/fs/proc/internal.h index cceaab0..43973b0 100644 --- a/fs/proc/internal.h +++ b/fs/proc/internal.h @@ -12,6 +12,7 @@ #include #include struct ctl_table_header; +struct mempolicy; extern struct proc_dir_entry proc_root; #ifdef CONFIG_PROC_SYSCTL @@ -74,6 +75,9 @@ struct proc_maps_private { #ifdef CONFIG_MMU struct vm_area_struct *tail_vma; #endif +#ifdef CONFIG_NUMA + struct mempolicy *task_mempolicy; +#endif }; void proc_init_inodecache(void); diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c index 14df880..624927d 100644 --- a/fs/proc/task_mmu.c +++ b/fs/proc/task_mmu.c @@ -89,11 +89,41 @@ static void pad_len_spaces(struct seq_file *m, int len) len = 1; seq_printf(m, "%*c", len, ' '); } +#ifdef CONFIG_NUMA +/* + * numa_maps scans all vmas under mmap_sem and checks their mempolicy. + * But task->mempolicy is not guarded by mmap_sem, it can be cleared/freed + * under task_lock() (see kernel/exit.c) replacement of it is guarded by + * mmap_sem. So, take referenceount under task_lock() before we start + * scanning and drop it when numa_maps reaches the end. + */ +static void hold_task_mempolicy(struct proc_maps_private *priv) +{ + struct task_struct *task = priv->task; + + task_lock(task); + priv->task_mempolicy = task->mempolicy; + mpol_get(priv->task_mempolicy); + task_unlock(task); +} +static void release_task_mempolicy(struct proc_maps_private *priv) +{ + mpol_put(priv->task_mempolicy); +} +#else +static void hold_task_mempolicy(struct proc_maps_private *priv) +{ +} +static void release_task_mempolicy(struct proc_maps_private *priv) +{ +} +#endif static void vma_stop(struct proc_maps_private *priv, struct vm_area_struct *vma) { if (vma && vma != priv->tail_vma) { struct mm_struct *mm = vma->vm_mm; + release_task_mempolicy(priv); up_read(&mm->mmap_sem); mmput(mm); } @@ -132,7 +162,7 @@ static void *m_start(struct seq_file *m, loff_t *pos) tail_vma = get_gate_vma(priv->task->mm); priv->tail_vma = tail_vma; - + hold_task_mempolicy(priv); /* Start with last addr hint */ vma = find_vma(mm, last_addr); if (last_addr && vma) { @@ -159,6 +189,7 @@ out: if (vma) return vma; + release_task_mempolicy(priv); /* End of vmas has been reached */ m->version = (tail_vma != NULL)? 0: -1UL; up_read(&mm->mmap_sem); -- 1.7.10.2 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751836Ab2JREf5 (ORCPT ); Thu, 18 Oct 2012 00:35:57 -0400 Received: from mail-pa0-f46.google.com ([209.85.220.46]:42817 "EHLO mail-pa0-f46.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751137Ab2JREfz (ORCPT ); Thu, 18 Oct 2012 00:35:55 -0400 Date: Wed, 17 Oct 2012 21:35:53 -0700 (PDT) From: David Rientjes X-X-Sender: rientjes@chino.kir.corp.google.com To: Kamezawa Hiroyuki cc: Linus Torvalds , Andrew Morton , Dave Jones , KOSAKI Motohiro , bhutchings@solarflare.com, Konstantin Khlebnikov , Naoya Horiguchi , Hugh Dickins , linux-kernel@vger.kernel.org, linux-mm@kvack.org Subject: Re: [patch for-3.7 v2] mm, mempolicy: avoid taking mutex inside spinlock when reading numa_maps In-Reply-To: <507F803A.8000900@jp.fujitsu.com> Message-ID: References: <20121017040515.GA13505@redhat.com> <20121017181413.GA16805@redhat.com> <20121017193229.GC16805@redhat.com> <20121017194501.GA24400@redhat.com> <507F803A.8000900@jp.fujitsu.com> User-Agent: Alpine 2.00 (DEB 1167 2008-08-23) MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, 18 Oct 2012, Kamezawa Hiroyuki wrote: > diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c > index 14df880..d92e868 100644 > --- a/fs/proc/task_mmu.c > +++ b/fs/proc/task_mmu.c > @@ -94,6 +94,11 @@ static void vma_stop(struct proc_maps_private *priv, struct > vm_area_struct *vma) > { > if (vma && vma != priv->tail_vma) { > struct mm_struct *mm = vma->vm_mm; > +#ifdef CONFIG_NUMA > + task_lock(priv->task); > + __mpol_put(priv->task->mempolicy); > + task_unlock(priv->task); > +#endif > up_read(&mm->mmap_sem); > mmput(mm); > } > @@ -130,6 +135,16 @@ static void *m_start(struct seq_file *m, loff_t *pos) > return mm; > down_read(&mm->mmap_sem); > + /* > + * task->mempolicy can be freed even if mmap_sem is down (see > kernel/exit.c) > + * We grab refcount for stable access. > + * repleacement of task->mmpolicy is guarded by mmap_sem. > + */ > +#ifdef CONFIG_NUMA > + task_lock(priv->task); > + mpol_get(priv->task->mempolicy); > + task_unlock(priv->task); > +#endif > tail_vma = get_gate_vma(priv->task->mm); > priv->tail_vma = tail_vma; > @@ -161,6 +176,11 @@ out: > /* End of vmas has been reached */ > m->version = (tail_vma != NULL)? 0: -1UL; > +#ifdef CONFIG_NUMA > + task_lock(priv->task); > + __mpol_put(priv->task->mempolicy); > + task_unlock(priv->task); > +#endif > up_read(&mm->mmap_sem); > mmput(mm); > return tail_vma; Yes, I must admit that this is better than my version and it looks like all the ->show() functions that use these start, next, stop functions don't take task_lock() and this would generally be useful: we already hold current->mm->mmap_sem so there is little harm in holding task_lock(current) when reading these files as long as we're not touching the fastpath. These routines seem like it would nicely be added to mempolicy.h since we depend on CONFIG_NUMA there already. Please fix up the mess I made in show_numa_map() in 32f8516a8c73 ("mm, mempolicy: fix printing stack contents in numa_maps") by simply removing the task_lock() and task_unlock() as part of your patch. Thanks Kame! From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751936Ab2JREle (ORCPT ); Thu, 18 Oct 2012 00:41:34 -0400 Received: from fgwmail6.fujitsu.co.jp ([192.51.44.36]:57555 "EHLO fgwmail6.fujitsu.co.jp" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751298Ab2JREld (ORCPT ); Thu, 18 Oct 2012 00:41:33 -0400 X-SecurityPolicyCheck: OK by SHieldMailChecker v1.8.4 Message-ID: <507F8864.1070203@jp.fujitsu.com> Date: Thu, 18 Oct 2012 13:41:08 +0900 From: Kamezawa Hiroyuki User-Agent: Mozilla/5.0 (Windows NT 6.1; rv:16.0) Gecko/20121010 Thunderbird/16.0.1 MIME-Version: 1.0 To: Linus Torvalds CC: David Rientjes , Andrew Morton , Dave Jones , KOSAKI Motohiro , bhutchings@solarflare.com, Konstantin Khlebnikov , Naoya Horiguchi , Hugh Dickins , linux-kernel@vger.kernel.org, linux-mm@kvack.org Subject: Re: [patch for-3.7 v2] mm, mempolicy: avoid taking mutex inside spinlock when reading numa_maps References: <20121017040515.GA13505@redhat.com> <20121017181413.GA16805@redhat.com> <20121017193229.GC16805@redhat.com> <20121017194501.GA24400@redhat.com> <507F803A.8000900@jp.fujitsu.com> In-Reply-To: Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org (2012/10/18 13:14), Linus Torvalds wrote: > On Wed, Oct 17, 2012 at 9:06 PM, Kamezawa Hiroyuki > wrote: >> if (vma && vma != priv->tail_vma) { >> struct mm_struct *mm = vma->vm_mm; >> +#ifdef CONFIG_NUMA >> + task_lock(priv->task); >> + __mpol_put(priv->task->mempolicy); >> + task_unlock(priv->task); >> +#endif >> up_read(&mm->mmap_sem); >> mmput(mm); > > Please don't put #ifdef's inside code. It makes things really ugly and > hard to read. > > And that is *especially* true in this case, since there's a pattern to > all these things: > >> +#ifdef CONFIG_NUMA >> + task_lock(priv->task); >> + mpol_get(priv->task->mempolicy); >> + task_unlock(priv->task); >> +#endif > >> +#ifdef CONFIG_NUMA >> + task_lock(priv->task); >> + __mpol_put(priv->task->mempolicy); >> + task_unlock(priv->task); >> +#endif > > it really sounds like what you want to do is to just abstract a > "numa_policy_get/put(priv)" operation. > > So you could make it be something like > > #ifdef CONFIG_NUMA > static inline numa_policy_get(struct proc_maps_private *priv) > { > task_lock(priv->task); > mpol_get(priv->task->mempolicy); > task_unlock(priv->task); > } > .. same for the "put" function .. > #else > #define numa_policy_get(priv) do { } while (0) > #define numa_policy_put(priv) do { } while (0) > #endif > > and then you wouldn't have to have the #ifdef's in the middle of code, > and I think it will be more readable in general. > > Sure, it is going to be a few more actual lines of patch, but there's > no duplicated code sequence, and the added lines are just the syntax > that makes it look better. > you're right, I shouldn't send an ugly patch. I'm sorry. V2 uses suggested style, I think. Regards, -Kame From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754542Ab2JRUDo (ORCPT ); Thu, 18 Oct 2012 16:03:44 -0400 Received: from mail-pa0-f46.google.com ([209.85.220.46]:35364 "EHLO mail-pa0-f46.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753226Ab2JRUDn (ORCPT ); Thu, 18 Oct 2012 16:03:43 -0400 Date: Thu, 18 Oct 2012 13:03:38 -0700 (PDT) From: David Rientjes X-X-Sender: rientjes@chino.kir.corp.google.com To: Kamezawa Hiroyuki cc: Linus Torvalds , Andrew Morton , Dave Jones , KOSAKI Motohiro , bhutchings@solarflare.com, Konstantin Khlebnikov , Naoya Horiguchi , Hugh Dickins , linux-kernel@vger.kernel.org, linux-mm@kvack.org Subject: Re: [patch for-3.7 v2] mm, mempolicy: avoid taking mutex inside spinlock when reading numa_maps In-Reply-To: <507F86BD.7070201@jp.fujitsu.com> Message-ID: References: <20121017040515.GA13505@redhat.com> <20121017181413.GA16805@redhat.com> <20121017193229.GC16805@redhat.com> <20121017194501.GA24400@redhat.com> <507F803A.8000900@jp.fujitsu.com> <507F86BD.7070201@jp.fujitsu.com> User-Agent: Alpine 2.00 (DEB 1167 2008-08-23) MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, 18 Oct 2012, Kamezawa Hiroyuki wrote: > diff --git a/fs/proc/internal.h b/fs/proc/internal.h > index cceaab0..43973b0 100644 > --- a/fs/proc/internal.h > +++ b/fs/proc/internal.h > @@ -12,6 +12,7 @@ > #include > #include > struct ctl_table_header; > +struct mempolicy; > extern struct proc_dir_entry proc_root; > #ifdef CONFIG_PROC_SYSCTL > @@ -74,6 +75,9 @@ struct proc_maps_private { > #ifdef CONFIG_MMU > struct vm_area_struct *tail_vma; > #endif > +#ifdef CONFIG_NUMA > + struct mempolicy *task_mempolicy; > +#endif > }; > void proc_init_inodecache(void); > diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c > index 14df880..624927d 100644 > --- a/fs/proc/task_mmu.c > +++ b/fs/proc/task_mmu.c > @@ -89,11 +89,41 @@ static void pad_len_spaces(struct seq_file *m, int len) > len = 1; > seq_printf(m, "%*c", len, ' '); > } > +#ifdef CONFIG_NUMA > +/* > + * numa_maps scans all vmas under mmap_sem and checks their mempolicy. Doesn't only affect numa_maps, it also affects maps and smaps although they don't need the refcounts. > + * But task->mempolicy is not guarded by mmap_sem, it can be cleared/freed > + * under task_lock() (see kernel/exit.c) replacement of it is guarded by > + * mmap_sem. I think this should be a little more verbose making it clear that task->mempolicy can be cleared and freed if its refcount drops to 0 and is only protected by task_lock() and that we're safe from task->mempolicy changing between ->start(), ->next(), and ->stop() because task->mm->mmap_sem is held for the duration. > So, take referenceount under task_lock() before we start > + * scanning and drop it when numa_maps reaches the end. > + */ > +static void hold_task_mempolicy(struct proc_maps_private *priv) > +{ > + struct task_struct *task = priv->task; > + > + task_lock(task); > + priv->task_mempolicy = task->mempolicy; > + mpol_get(priv->task_mempolicy); > + task_unlock(task); > +} > +static void release_task_mempolicy(struct proc_maps_private *priv) > +{ > + mpol_put(priv->task_mempolicy); > +} > +#else > +static void hold_task_mempolicy(struct proc_maps_private *priv) > +{ > +} > +static void release_task_mempolicy(struct proc_maps_private *priv) > +{ > +} > +#endif > static void vma_stop(struct proc_maps_private *priv, struct vm_area_struct > *vma) > { > if (vma && vma != priv->tail_vma) { > struct mm_struct *mm = vma->vm_mm; > + release_task_mempolicy(priv); > up_read(&mm->mmap_sem); > mmput(mm); > } > @@ -132,7 +162,7 @@ static void *m_start(struct seq_file *m, loff_t *pos) > tail_vma = get_gate_vma(priv->task->mm); > priv->tail_vma = tail_vma; > - > + hold_task_mempolicy(priv); > /* Start with last addr hint */ > vma = find_vma(mm, last_addr); > if (last_addr && vma) { > @@ -159,6 +189,7 @@ out: > if (vma) > return vma; > + release_task_mempolicy(priv); > /* End of vmas has been reached */ > m->version = (tail_vma != NULL)? 0: -1UL; > up_read(&mm->mmap_sem); Otherwise looks good, but please remove the two task_lock()'s in show_numa_map() that I added as part of this since you're replacing the need for locking. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753004Ab2JSGvv (ORCPT ); Fri, 19 Oct 2012 02:51:51 -0400 Received: from mail-oa0-f46.google.com ([209.85.219.46]:33489 "EHLO mail-oa0-f46.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752713Ab2JSGvu (ORCPT ); Fri, 19 Oct 2012 02:51:50 -0400 MIME-Version: 1.0 In-Reply-To: <507F86BD.7070201@jp.fujitsu.com> References: <20121017040515.GA13505@redhat.com> <20121017181413.GA16805@redhat.com> <20121017193229.GC16805@redhat.com> <20121017194501.GA24400@redhat.com> <507F803A.8000900@jp.fujitsu.com> <507F86BD.7070201@jp.fujitsu.com> From: KOSAKI Motohiro Date: Fri, 19 Oct 2012 02:51:29 -0400 Message-ID: Subject: Re: [patch for-3.7 v2] mm, mempolicy: avoid taking mutex inside spinlock when reading numa_maps To: Kamezawa Hiroyuki Cc: David Rientjes , Linus Torvalds , Andrew Morton , Dave Jones , bhutchings@solarflare.com, Konstantin Khlebnikov , Naoya Horiguchi , Hugh Dickins , linux-kernel@vger.kernel.org, linux-mm@kvack.org Content-Type: text/plain; charset=ISO-8859-1 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org >> Can't we have another way to fix ? like this ? too ugly ? >> Again, I'm sorry if I misunderstand the points. >> > Sorry this patch itself may be buggy. please don't test.. > I missed that kernel/exit.c sets task->mempolicy to be NULL. > fixed one here. > > -- > From 5581c71e68a7f50e52fd67cca00148911023f9f5 Mon Sep 17 00:00:00 2001 > From: KAMEZAWA Hiroyuki > Date: Thu, 18 Oct 2012 13:50:29 +0900 > > Subject: [PATCH] hold task->mempolicy while numa_maps scans. > > /proc//numa_maps scans vma and show mempolicy under > mmap_sem. It sometimes accesses task->mempolicy which can > be freed without mmap_sem and numa_maps can show some > garbage while scanning. > > This patch tries to take reference count of task->mempolicy at reading > numa_maps before calling get_vma_policy(). By this, task->mempolicy > will not be freed until numa_maps reaches its end. > > Signed-off-by: KAMEZAWA Hiroyuki > > V1->V2 > - access task->mempolicy only once and remember it. Becase kernel/exit.c > can overwrite it. > > Signed-off-by: KAMEZAWA Hiroyuki Ok, this is acceptable to me. go ahead. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S932115Ab2JSIgF (ORCPT ); Fri, 19 Oct 2012 04:36:05 -0400 Received: from fgwmail5.fujitsu.co.jp ([192.51.44.35]:46452 "EHLO fgwmail5.fujitsu.co.jp" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1757736Ab2JSIgB (ORCPT ); Fri, 19 Oct 2012 04:36:01 -0400 X-SecurityPolicyCheck: OK by SHieldMailChecker v1.8.4 Message-ID: <508110C4.6030805@jp.fujitsu.com> Date: Fri, 19 Oct 2012 17:35:16 +0900 From: Kamezawa Hiroyuki User-Agent: Mozilla/5.0 (Windows NT 6.1; rv:16.0) Gecko/20121010 Thunderbird/16.0.1 MIME-Version: 1.0 To: David Rientjes CC: Linus Torvalds , Andrew Morton , Dave Jones , KOSAKI Motohiro , bhutchings@solarflare.com, Konstantin Khlebnikov , Naoya Horiguchi , Hugh Dickins , linux-kernel@vger.kernel.org, linux-mm@kvack.org Subject: [patch for-3.7 v3] mm, mempolicy: hold task->mempolicy refcount while reading numa_maps. References: <20121017040515.GA13505@redhat.com> <20121017181413.GA16805@redhat.com> <20121017193229.GC16805@redhat.com> <20121017194501.GA24400@redhat.com> <507F803A.8000900@jp.fujitsu.com> <507F86BD.7070201@jp.fujitsu.com> In-Reply-To: Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org (2012/10/19 5:03), David Rientjes wrote: > On Thu, 18 Oct 2012, Kamezawa Hiroyuki wrote: >> @@ -132,7 +162,7 @@ static void *m_start(struct seq_file *m, loff_t *pos) >> tail_vma = get_gate_vma(priv->task->mm); >> priv->tail_vma = tail_vma; >> - >> + hold_task_mempolicy(priv); >> /* Start with last addr hint */ >> vma = find_vma(mm, last_addr); >> if (last_addr && vma) { >> @@ -159,6 +189,7 @@ out: >> if (vma) >> return vma; >> + release_task_mempolicy(priv); >> /* End of vmas has been reached */ >> m->version = (tail_vma != NULL)? 0: -1UL; >> up_read(&mm->mmap_sem); > > Otherwise looks good, but please remove the two task_lock()'s in > show_numa_map() that I added as part of this since you're replacing the > need for locking. > Thank you for your review. How about this ? == From c5849c9034abeec3f26bf30dadccd393b0c5c25e Mon Sep 17 00:00:00 2001 From: KAMEZAWA Hiroyuki Date: Fri, 19 Oct 2012 17:00:55 +0900 Subject: [PATCH] hold task->mempolicy while numa_maps scans. /proc//numa_maps scans vma and show mempolicy under mmap_sem. It sometimes accesses task->mempolicy which can be freed without mmap_sem and numa_maps can show some garbage while scanning. This patch tries to take reference count of task->mempolicy at reading numa_maps before calling get_vma_policy(). By this, task->mempolicy will not be freed until numa_maps reaches its end. Signed-off-by: KAMEZAWA Hiroyuki V2->v3 - updated comments to be more verbose. - removed task_lock() in numa_maps code. V1->V2 - access task->mempolicy only once and remember it. Becase kernel/exit.c can overwrite it. Signed-off-by: KAMEZAWA Hiroyuki --- fs/proc/internal.h | 4 ++++ fs/proc/task_mmu.c | 49 ++++++++++++++++++++++++++++++++++++++++++++++--- 2 files changed, 50 insertions(+), 3 deletions(-) diff --git a/fs/proc/internal.h b/fs/proc/internal.h index cceaab0..43973b0 100644 --- a/fs/proc/internal.h +++ b/fs/proc/internal.h @@ -12,6 +12,7 @@ #include #include struct ctl_table_header; +struct mempolicy; extern struct proc_dir_entry proc_root; #ifdef CONFIG_PROC_SYSCTL @@ -74,6 +75,9 @@ struct proc_maps_private { #ifdef CONFIG_MMU struct vm_area_struct *tail_vma; #endif +#ifdef CONFIG_NUMA + struct mempolicy *task_mempolicy; +#endif }; void proc_init_inodecache(void); diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c index 14df880..2371fea 100644 --- a/fs/proc/task_mmu.c +++ b/fs/proc/task_mmu.c @@ -89,11 +89,55 @@ static void pad_len_spaces(struct seq_file *m, int len) len = 1; seq_printf(m, "%*c", len, ' '); } +#ifdef CONFIG_NUMA +/* + * These functions are for numa_maps but called in generic **maps seq_file + * ->start(), ->stop() ops. + * + * numa_maps scans all vmas under mmap_sem and checks their mempolicy. + * Each mempolicy object is controlled by reference counting. The problem here + * is how to avoid accessing dead mempolicy object. + * + * Because we're holding mmap_sem while reading seq_file, it's safe to access + * each vma's mempolicy, no vma objects will never drop refs to mempolicy. + * + * A task's mempolicy (task->mempolicy) has different behavior. task->mempolicy + * is set and replaced under mmap_sem but unrefed and cleared under task_lock(). + * So, without task_lock(), we cannot trust get_vma_policy() because we cannot + * gurantee the task never exits under us. But taking task_lock() around + * get_vma_plicy() causes lock order problem. + * + * To access task->mempolicy without lock, we hold a reference count of an + * object pointed by task->mempolicy and remember it. This will guarantee + * that task->mempolicy points to an alive object or NULL in numa_maps accesses. + */ +static void hold_task_mempolicy(struct proc_maps_private *priv) +{ + struct task_struct *task = priv->task; + + task_lock(task); + priv->task_mempolicy = task->mempolicy; + mpol_get(priv->task_mempolicy); + task_unlock(task); +} +static void release_task_mempolicy(struct proc_maps_private *priv) +{ + mpol_put(priv->task_mempolicy); +} +#else +static void hold_task_mempolicy(struct proc_maps_private *priv) +{ +} +static void release_task_mempolicy(struct proc_maps_private *priv) +{ +} +#endif static void vma_stop(struct proc_maps_private *priv, struct vm_area_struct *vma) { if (vma && vma != priv->tail_vma) { struct mm_struct *mm = vma->vm_mm; + release_task_mempolicy(priv); up_read(&mm->mmap_sem); mmput(mm); } @@ -132,7 +176,7 @@ static void *m_start(struct seq_file *m, loff_t *pos) tail_vma = get_gate_vma(priv->task->mm); priv->tail_vma = tail_vma; - + hold_task_mempolicy(priv); /* Start with last addr hint */ vma = find_vma(mm, last_addr); if (last_addr && vma) { @@ -159,6 +203,7 @@ out: if (vma) return vma; + release_task_mempolicy(priv); /* End of vmas has been reached */ m->version = (tail_vma != NULL)? 0: -1UL; up_read(&mm->mmap_sem); @@ -1178,11 +1223,9 @@ static int show_numa_map(struct seq_file *m, void *v, int is_pid) walk.private = md; walk.mm = mm; - task_lock(task); pol = get_vma_policy(task, vma, vma->vm_start); mpol_to_str(buffer, sizeof(buffer), pol, 0); mpol_cond_put(pol); - task_unlock(task); seq_printf(m, "%08lx %s", vma->vm_start, buffer); -- 1.7.10.2 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754177Ab2JSJ2s (ORCPT ); Fri, 19 Oct 2012 05:28:48 -0400 Received: from mail-da0-f46.google.com ([209.85.210.46]:53787 "EHLO mail-da0-f46.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753231Ab2JSJ2p (ORCPT ); Fri, 19 Oct 2012 05:28:45 -0400 Date: Fri, 19 Oct 2012 02:28:42 -0700 (PDT) From: David Rientjes X-X-Sender: rientjes@chino.kir.corp.google.com To: Kamezawa Hiroyuki cc: Linus Torvalds , Andrew Morton , Dave Jones , KOSAKI Motohiro , bhutchings@solarflare.com, Konstantin Khlebnikov , Naoya Horiguchi , Hugh Dickins , linux-kernel@vger.kernel.org, linux-mm@kvack.org Subject: Re: [patch for-3.7 v3] mm, mempolicy: hold task->mempolicy refcount while reading numa_maps. In-Reply-To: <508110C4.6030805@jp.fujitsu.com> Message-ID: References: <20121017040515.GA13505@redhat.com> <20121017181413.GA16805@redhat.com> <20121017193229.GC16805@redhat.com> <20121017194501.GA24400@redhat.com> <507F803A.8000900@jp.fujitsu.com> <507F86BD.7070201@jp.fujitsu.com> <508110C4.6030805@jp.fujitsu.com> User-Agent: Alpine 2.00 (DEB 1167 2008-08-23) MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri, 19 Oct 2012, Kamezawa Hiroyuki wrote: > From c5849c9034abeec3f26bf30dadccd393b0c5c25e Mon Sep 17 00:00:00 2001 > From: KAMEZAWA Hiroyuki > Date: Fri, 19 Oct 2012 17:00:55 +0900 > Subject: [PATCH] hold task->mempolicy while numa_maps scans. > > /proc//numa_maps scans vma and show mempolicy under > mmap_sem. It sometimes accesses task->mempolicy which can > be freed without mmap_sem and numa_maps can show some > garbage while scanning. > > This patch tries to take reference count of task->mempolicy at reading > numa_maps before calling get_vma_policy(). By this, task->mempolicy > will not be freed until numa_maps reaches its end. > > Signed-off-by: KAMEZAWA Hiroyuki Looks good, but the patch is whitespace damaged so it doesn't apply. When that's fixed: Acked-by: David Rientjes Thanks for following through on this! From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1757677Ab2JSTPk (ORCPT ); Fri, 19 Oct 2012 15:15:40 -0400 Received: from mail-ob0-f174.google.com ([209.85.214.174]:33144 "EHLO mail-ob0-f174.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1756835Ab2JSTPj (ORCPT ); Fri, 19 Oct 2012 15:15:39 -0400 MIME-Version: 1.0 In-Reply-To: <508110C4.6030805@jp.fujitsu.com> References: <20121017040515.GA13505@redhat.com> <20121017181413.GA16805@redhat.com> <20121017193229.GC16805@redhat.com> <20121017194501.GA24400@redhat.com> <507F803A.8000900@jp.fujitsu.com> <507F86BD.7070201@jp.fujitsu.com> <508110C4.6030805@jp.fujitsu.com> From: KOSAKI Motohiro Date: Fri, 19 Oct 2012 15:15:18 -0400 Message-ID: Subject: Re: [patch for-3.7 v3] mm, mempolicy: hold task->mempolicy refcount while reading numa_maps. To: Kamezawa Hiroyuki Cc: David Rientjes , Linus Torvalds , Andrew Morton , Dave Jones , bhutchings@solarflare.com, Konstantin Khlebnikov , Naoya Horiguchi , Hugh Dickins , linux-kernel@vger.kernel.org, linux-mm@kvack.org Content-Type: text/plain; charset=ISO-8859-1 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri, Oct 19, 2012 at 4:35 AM, Kamezawa Hiroyuki wrote: > (2012/10/19 5:03), David Rientjes wrote: >> >> On Thu, 18 Oct 2012, Kamezawa Hiroyuki wrote: >>> >>> @@ -132,7 +162,7 @@ static void *m_start(struct seq_file *m, loff_t *pos) >>> tail_vma = get_gate_vma(priv->task->mm); >>> priv->tail_vma = tail_vma; >>> - >>> + hold_task_mempolicy(priv); >>> /* Start with last addr hint */ >>> vma = find_vma(mm, last_addr); >>> if (last_addr && vma) { >>> @@ -159,6 +189,7 @@ out: >>> if (vma) >>> return vma; >>> + release_task_mempolicy(priv); >>> /* End of vmas has been reached */ >>> m->version = (tail_vma != NULL)? 0: -1UL; >>> up_read(&mm->mmap_sem); >> >> >> Otherwise looks good, but please remove the two task_lock()'s in >> show_numa_map() that I added as part of this since you're replacing the >> need for locking. >> > Thank you for your review. > How about this ? > > == > From c5849c9034abeec3f26bf30dadccd393b0c5c25e Mon Sep 17 00:00:00 2001 > From: KAMEZAWA Hiroyuki > Date: Fri, 19 Oct 2012 17:00:55 +0900 > Subject: [PATCH] hold task->mempolicy while numa_maps scans. > > /proc//numa_maps scans vma and show mempolicy under > mmap_sem. It sometimes accesses task->mempolicy which can > be freed without mmap_sem and numa_maps can show some > garbage while scanning. > > This patch tries to take reference count of task->mempolicy at reading > numa_maps before calling get_vma_policy(). By this, task->mempolicy > will not be freed until numa_maps reaches its end. > > Signed-off-by: KAMEZAWA Hiroyuki > > V2->v3 > - updated comments to be more verbose. > - removed task_lock() in numa_maps code. > V1->V2 > - access task->mempolicy only once and remember it. Becase kernel/exit.c > can overwrite it. > > Signed-off-by: KAMEZAWA Hiroyuki Acked-by: KOSAKI Motohiro From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S932839Ab2JVCr6 (ORCPT ); Sun, 21 Oct 2012 22:47:58 -0400 Received: from fgwmail5.fujitsu.co.jp ([192.51.44.35]:48381 "EHLO fgwmail5.fujitsu.co.jp" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S932795Ab2JVCr4 (ORCPT ); Sun, 21 Oct 2012 22:47:56 -0400 X-SecurityPolicyCheck: OK by SHieldMailChecker v1.8.4 Message-ID: <5084B3C3.3070906@jp.fujitsu.com> Date: Mon, 22 Oct 2012 11:47:31 +0900 From: Kamezawa Hiroyuki User-Agent: Mozilla/5.0 (Windows NT 6.1; rv:16.0) Gecko/20121010 Thunderbird/16.0.1 MIME-Version: 1.0 To: David Rientjes CC: Linus Torvalds , Andrew Morton , Dave Jones , KOSAKI Motohiro , bhutchings@solarflare.com, Konstantin Khlebnikov , Naoya Horiguchi , Hugh Dickins , linux-kernel@vger.kernel.org, linux-mm@kvack.org Subject: Re: [patch for-3.7 v3] mm, mempolicy: hold task->mempolicy refcount while reading numa_maps. References: <20121017040515.GA13505@redhat.com> <20121017181413.GA16805@redhat.com> <20121017193229.GC16805@redhat.com> <20121017194501.GA24400@redhat.com> <507F803A.8000900@jp.fujitsu.com> <507F86BD.7070201@jp.fujitsu.com> <508110C4.6030805@jp.fujitsu.com> In-Reply-To: Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org (2012/10/19 18:28), David Rientjes wrote: > Looks good, but the patch is whitespace damaged so it doesn't apply. When > that's fixed: > > Acked-by: David Rientjes Sorry, I hope this one is not broken... == From c5849c9034abeec3f26bf30dadccd393b0c5c25e Mon Sep 17 00:00:00 2001 From: KAMEZAWA Hiroyuki Date: Fri, 19 Oct 2012 17:00:55 +0900 Subject: [PATCH] hold task->mempolicy while numa_maps scans. /proc//numa_maps scans vma and show mempolicy under mmap_sem. It sometimes accesses task->mempolicy which can be freed without mmap_sem and numa_maps can show some garbage while scanning. This patch tries to take reference count of task->mempolicy at reading numa_maps before calling get_vma_policy(). By this, task->mempolicy will not be freed until numa_maps reaches its end. Acked-by: David Rientjes Acked-by: KOSAKI Motohiro Signed-off-by: KAMEZAWA Hiroyuki V2->v3 - updated comments to be more verbose. - removed task_lock() in numa_maps code. V1->V2 - access task->mempolicy only once and remember it. Becase kernel/exit.c can overwrite it. --- fs/proc/internal.h | 4 ++++ fs/proc/task_mmu.c | 49 ++++++++++++++++++++++++++++++++++++++++++++++--- 2 files changed, 50 insertions(+), 3 deletions(-) diff --git a/fs/proc/internal.h b/fs/proc/internal.h index cceaab0..43973b0 100644 --- a/fs/proc/internal.h +++ b/fs/proc/internal.h @@ -12,6 +12,7 @@ #include #include struct ctl_table_header; +struct mempolicy; extern struct proc_dir_entry proc_root; #ifdef CONFIG_PROC_SYSCTL @@ -74,6 +75,9 @@ struct proc_maps_private { #ifdef CONFIG_MMU struct vm_area_struct *tail_vma; #endif +#ifdef CONFIG_NUMA + struct mempolicy *task_mempolicy; +#endif }; void proc_init_inodecache(void); diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c index 14df880..2371fea 100644 --- a/fs/proc/task_mmu.c +++ b/fs/proc/task_mmu.c @@ -89,11 +89,55 @@ static void pad_len_spaces(struct seq_file *m, int len) len = 1; seq_printf(m, "%*c", len, ' '); } +#ifdef CONFIG_NUMA +/* + * These functions are for numa_maps but called in generic **maps seq_file + * ->start(), ->stop() ops. + * + * numa_maps scans all vmas under mmap_sem and checks their mempolicy. + * Each mempolicy object is controlled by reference counting. The problem here + * is how to avoid accessing dead mempolicy object. + * + * Because we're holding mmap_sem while reading seq_file, it's safe to access + * each vma's mempolicy, no vma objects will never drop refs to mempolicy. + * + * A task's mempolicy (task->mempolicy) has different behavior. task->mempolicy + * is set and replaced under mmap_sem but unrefed and cleared under task_lock(). + * So, without task_lock(), we cannot trust get_vma_policy() because we cannot + * gurantee the task never exits under us. But taking task_lock() around + * get_vma_plicy() causes lock order problem. + * + * To access task->mempolicy without lock, we hold a reference count of an + * object pointed by task->mempolicy and remember it. This will guarantee + * that task->mempolicy points to an alive object or NULL in numa_maps accesses. + */ +static void hold_task_mempolicy(struct proc_maps_private *priv) +{ + struct task_struct *task = priv->task; + + task_lock(task); + priv->task_mempolicy = task->mempolicy; + mpol_get(priv->task_mempolicy); + task_unlock(task); +} +static void release_task_mempolicy(struct proc_maps_private *priv) +{ + mpol_put(priv->task_mempolicy); +} +#else +static void hold_task_mempolicy(struct proc_maps_private *priv) +{ +} +static void release_task_mempolicy(struct proc_maps_private *priv) +{ +} +#endif static void vma_stop(struct proc_maps_private *priv, struct vm_area_struct *vma) { if (vma && vma != priv->tail_vma) { struct mm_struct *mm = vma->vm_mm; + release_task_mempolicy(priv); up_read(&mm->mmap_sem); mmput(mm); } @@ -132,7 +176,7 @@ static void *m_start(struct seq_file *m, loff_t *pos) tail_vma = get_gate_vma(priv->task->mm); priv->tail_vma = tail_vma; - + hold_task_mempolicy(priv); /* Start with last addr hint */ vma = find_vma(mm, last_addr); if (last_addr && vma) { @@ -159,6 +203,7 @@ out: if (vma) return vma; + release_task_mempolicy(priv); /* End of vmas has been reached */ m->version = (tail_vma != NULL)? 0: -1UL; up_read(&mm->mmap_sem); @@ -1178,11 +1223,9 @@ static int show_numa_map(struct seq_file *m, void *v, int is_pid) walk.private = md; walk.mm = mm; - task_lock(task); pol = get_vma_policy(task, vma, vma->vm_start); mpol_to_str(buffer, sizeof(buffer), pol, 0); mpol_cond_put(pol); - task_unlock(task); seq_printf(m, "%08lx %s", vma->vm_start, buffer); -- 1.7.10.2 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1756282Ab2JVU4V (ORCPT ); Mon, 22 Oct 2012 16:56:21 -0400 Received: from mail.linuxfoundation.org ([140.211.169.12]:41431 "EHLO mail.linuxfoundation.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1755085Ab2JVU4B (ORCPT ); Mon, 22 Oct 2012 16:56:01 -0400 Date: Mon, 22 Oct 2012 13:55:59 -0700 From: Andrew Morton To: Kamezawa Hiroyuki Cc: David Rientjes , Linus Torvalds , Dave Jones , KOSAKI Motohiro , bhutchings@solarflare.com, Konstantin Khlebnikov , Naoya Horiguchi , Hugh Dickins , linux-kernel@vger.kernel.org, linux-mm@kvack.org Subject: Re: [patch for-3.7 v3] mm, mempolicy: hold task->mempolicy refcount while reading numa_maps. Message-Id: <20121022135559.1ccb14bc.akpm@linux-foundation.org> In-Reply-To: <5084B3C3.3070906@jp.fujitsu.com> References: <20121017040515.GA13505@redhat.com> <20121017181413.GA16805@redhat.com> <20121017193229.GC16805@redhat.com> <20121017194501.GA24400@redhat.com> <507F803A.8000900@jp.fujitsu.com> <507F86BD.7070201@jp.fujitsu.com> <508110C4.6030805@jp.fujitsu.com> <5084B3C3.3070906@jp.fujitsu.com> X-Mailer: Sylpheed 3.0.2 (GTK+ 2.20.1; x86_64-pc-linux-gnu) Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon, 22 Oct 2012 11:47:31 +0900 Kamezawa Hiroyuki wrote: > (2012/10/19 18:28), David Rientjes wrote: > > > Looks good, but the patch is whitespace damaged so it doesn't apply. When > > that's fixed: > > > > Acked-by: David Rientjes > > Sorry, I hope this one is not broken... > > ... > > --- a/fs/proc/internal.h > +++ b/fs/proc/internal.h > @@ -12,6 +12,7 @@ > #include > #include > struct ctl_table_header; > +struct mempolicy; > > extern struct proc_dir_entry proc_root; > #ifdef CONFIG_PROC_SYSCTL > @@ -74,6 +75,9 @@ struct proc_maps_private { > #ifdef CONFIG_MMU > struct vm_area_struct *tail_vma; > #endif > +#ifdef CONFIG_NUMA > + struct mempolicy *task_mempolicy; > +#endif > }; The mail client space-stuffed it. We merged this three days ago, in 9e7814404b77c3e8920b. Please check that it landed OK - there's a newline fixup in there but it looks good to me. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1756355Ab2JVU5F (ORCPT ); Mon, 22 Oct 2012 16:57:05 -0400 Received: from mail-pb0-f46.google.com ([209.85.160.46]:34184 "EHLO mail-pb0-f46.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1756306Ab2JVU5A (ORCPT ); Mon, 22 Oct 2012 16:57:00 -0400 Date: Mon, 22 Oct 2012 13:56:56 -0700 (PDT) From: David Rientjes X-X-Sender: rientjes@chino.kir.corp.google.com To: Kamezawa Hiroyuki cc: Linus Torvalds , Andrew Morton , Dave Jones , KOSAKI Motohiro , bhutchings@solarflare.com, Konstantin Khlebnikov , Naoya Horiguchi , Hugh Dickins , linux-kernel@vger.kernel.org, linux-mm@kvack.org Subject: Re: [patch for-3.7 v3] mm, mempolicy: hold task->mempolicy refcount while reading numa_maps. In-Reply-To: <5084B3C3.3070906@jp.fujitsu.com> Message-ID: References: <20121017040515.GA13505@redhat.com> <20121017181413.GA16805@redhat.com> <20121017193229.GC16805@redhat.com> <20121017194501.GA24400@redhat.com> <507F803A.8000900@jp.fujitsu.com> <507F86BD.7070201@jp.fujitsu.com> <508110C4.6030805@jp.fujitsu.com> <5084B3C3.3070906@jp.fujitsu.com> User-Agent: Alpine 2.00 (DEB 1167 2008-08-23) MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon, 22 Oct 2012, Kamezawa Hiroyuki wrote: > > Looks good, but the patch is whitespace damaged so it doesn't apply. When > > that's fixed: > > > > Acked-by: David Rientjes > > Sorry, I hope this one is not broken... Looks like Linus picked this up directly, thanks Kame! From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1756087Ab2JXXav (ORCPT ); Wed, 24 Oct 2012 19:30:51 -0400 Received: from mail-ie0-f174.google.com ([209.85.223.174]:55871 "EHLO mail-ie0-f174.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750934Ab2JXXat (ORCPT ); Wed, 24 Oct 2012 19:30:49 -0400 MIME-Version: 1.0 In-Reply-To: References: <20121008150949.GA15130@redhat.com> <20121017040515.GA13505@redhat.com> From: Sasha Levin Date: Wed, 24 Oct 2012 19:30:29 -0400 Message-ID: Subject: Re: [patch for-3.7] mm, mempolicy: fix printing stack contents in numa_maps To: David Rientjes Cc: Dave Jones , Andrew Morton , Linus Torvalds , KOSAKI Motohiro , bhutchings@solarflare.com, Konstantin Khlebnikov , Naoya Horiguchi , Hugh Dickins , KAMEZAWA Hiroyuki , linux-kernel@vger.kernel.org, linux-mm@kvack.org Content-Type: text/plain; charset=ISO-8859-1 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, Oct 17, 2012 at 1:24 AM, David Rientjes wrote: > On Wed, 17 Oct 2012, Dave Jones wrote: > >> BUG: sleeping function called from invalid context at kernel/mutex.c:269 >> in_atomic(): 1, irqs_disabled(): 0, pid: 8558, name: trinity-child2 >> 3 locks on stack by trinity-child2/8558: >> #0: held: (&p->lock){+.+.+.}, instance: ffff88010c9a00b0, at: [] seq_lseek+0x3f/0x120 >> #1: held: (&mm->mmap_sem){++++++}, instance: ffff88013956f7c8, at: [] m_start+0xa7/0x190 >> #2: held: (&(&p->alloc_lock)->rlock){+.+...}, instance: ffff88011fc64f30, at: [] show_numa_map+0x14f/0x610 >> Pid: 8558, comm: trinity-child2 Not tainted 3.7.0-rc1+ #32 >> Call Trace: >> [] __might_sleep+0x14c/0x200 >> [] mutex_lock_nested+0x2e/0x50 >> [] mpol_shared_policy_lookup+0x33/0x90 >> [] shmem_get_policy+0x33/0x40 >> [] get_vma_policy+0x3a/0x90 >> [] show_numa_map+0x163/0x610 >> [] ? pid_maps_open+0x20/0x20 >> [] ? pagemap_hugetlb_range+0xf0/0xf0 >> [] show_pid_numa_map+0x13/0x20 >> [] traverse+0xf2/0x230 >> [] seq_lseek+0xab/0x120 >> [] sys_lseek+0x7b/0xb0 >> [] tracesys+0xe1/0xe6 >> > > Hmm, looks like we need to change the refcount semantics entirely. We'll > need to make get_vma_policy() always take a reference and then drop it > accordingly. This work sif get_vma_policy() can grab a reference while > holding task_lock() for the task policy fallback case. > > Comments on this approach? > --- [snip] I'm not sure about the status of the patch, but it doesn't apply on top of -next, and I still see the warnings when fuzzing on -next. Thanks, Sasha From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1757670Ab2JXXez (ORCPT ); Wed, 24 Oct 2012 19:34:55 -0400 Received: from mail-da0-f46.google.com ([209.85.210.46]:56973 "EHLO mail-da0-f46.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750802Ab2JXXex (ORCPT ); Wed, 24 Oct 2012 19:34:53 -0400 Date: Wed, 24 Oct 2012 16:34:50 -0700 (PDT) From: David Rientjes X-X-Sender: rientjes@chino.kir.corp.google.com To: Sasha Levin cc: Dave Jones , Andrew Morton , Linus Torvalds , KOSAKI Motohiro , bhutchings@solarflare.com, Konstantin Khlebnikov , Naoya Horiguchi , Hugh Dickins , KAMEZAWA Hiroyuki , linux-kernel@vger.kernel.org, linux-mm@kvack.org Subject: Re: [patch for-3.7] mm, mempolicy: fix printing stack contents in numa_maps In-Reply-To: Message-ID: References: <20121008150949.GA15130@redhat.com> <20121017040515.GA13505@redhat.com> User-Agent: Alpine 2.00 (DEB 1167 2008-08-23) MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, 24 Oct 2012, Sasha Levin wrote: > I'm not sure about the status of the patch, but it doesn't apply on > top of -next, and I still > see the warnings when fuzzing on -next. > This should be fixed by 9e7814404b77 ("hold task->mempolicy while numa_maps scans.") in 3.7-rc2, can you reproduce any issues reading /proc/pid/numa_maps on that kernel? From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1758922Ab2JXXpa (ORCPT ); Wed, 24 Oct 2012 19:45:30 -0400 Received: from mail-ia0-f174.google.com ([209.85.210.174]:54188 "EHLO mail-ia0-f174.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1757777Ab2JXXp2 (ORCPT ); Wed, 24 Oct 2012 19:45:28 -0400 MIME-Version: 1.0 In-Reply-To: References: <20121008150949.GA15130@redhat.com> <20121017040515.GA13505@redhat.com> From: Sasha Levin Date: Wed, 24 Oct 2012 19:37:08 -0400 Message-ID: Subject: Re: [patch for-3.7] mm, mempolicy: fix printing stack contents in numa_maps To: David Rientjes Cc: Dave Jones , Andrew Morton , Linus Torvalds , KOSAKI Motohiro , bhutchings@solarflare.com, Konstantin Khlebnikov , Naoya Horiguchi , Hugh Dickins , KAMEZAWA Hiroyuki , linux-kernel@vger.kernel.org, linux-mm@kvack.org Content-Type: text/plain; charset=ISO-8859-1 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, Oct 24, 2012 at 7:34 PM, David Rientjes wrote: > On Wed, 24 Oct 2012, Sasha Levin wrote: > >> I'm not sure about the status of the patch, but it doesn't apply on >> top of -next, and I still >> see the warnings when fuzzing on -next. >> > > This should be fixed by 9e7814404b77 ("hold task->mempolicy while > numa_maps scans.") in 3.7-rc2, can you reproduce any issues reading > /proc/pid/numa_maps on that kernel? I was actually referring to the warnings Dave Jones saw when fuzzing with trinity after the original patch was applied. I still see the following when fuzzing: [ 338.467156] BUG: sleeping function called from invalid context at kernel/mutex.c:269 [ 338.473719] in_atomic(): 1, irqs_disabled(): 0, pid: 6361, name: trinity-main [ 338.481199] 2 locks held by trinity-main/6361: [ 338.486629] #0: (&mm->mmap_sem){++++++}, at: [] __do_page_fault+0x1e4/0x4f0 [ 338.498783] #1: (&(&mm->page_table_lock)->rlock){+.+...}, at: [] handle_pte_fault+0x3f7/0x6a0 [ 338.511409] Pid: 6361, comm: trinity-main Tainted: G W 3.7.0-rc2-next-20121024-sasha-00001-gd95ef01-dirty #74 [ 338.530318] Call Trace: [ 338.534088] [] __might_sleep+0x1c3/0x1e0 [ 338.539358] [] mutex_lock_nested+0x29/0x50 [ 338.545253] [] mpol_shared_policy_lookup+0x2e/0x90 [ 338.545258] [] shmem_get_policy+0x2e/0x30 [ 338.545264] [] get_vma_policy+0x5a/0xa0 [ 338.545267] [] mpol_misplaced+0x41/0x1d0 [ 338.545272] [] handle_pte_fault+0x465/0x6a0 [ 338.545278] [] ? __rcu_read_unlock+0x44/0xb0 [ 338.545282] [] handle_mm_fault+0x32a/0x360 [ 338.545286] [] __do_page_fault+0x480/0x4f0 [ 338.545293] [] ? del_timer+0x26/0x80 [ 338.545298] [] ? rcu_cleanup_after_idle+0x23/0x170 [ 338.545302] [] ? rcu_eqs_exit_common+0x64/0x3a0 [ 338.545305] [] ? rcu_eqs_enter_common+0x7c6/0x970 [ 338.545309] [] ? rcu_eqs_exit+0x9c/0xb0 [ 338.545312] [] do_page_fault+0x26/0x40 [ 338.545317] [] do_async_page_fault+0x30/0xa0 [ 338.545321] [] async_page_fault+0x28/0x30 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S934246Ab2JYAIQ (ORCPT ); Wed, 24 Oct 2012 20:08:16 -0400 Received: from mail-da0-f46.google.com ([209.85.210.46]:53698 "EHLO mail-da0-f46.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S932178Ab2JYAIO (ORCPT ); Wed, 24 Oct 2012 20:08:14 -0400 Date: Wed, 24 Oct 2012 17:08:11 -0700 (PDT) From: David Rientjes X-X-Sender: rientjes@chino.kir.corp.google.com To: Sasha Levin , Mel Gorman , Peter Zijlstra , Rik van Riel cc: Dave Jones , Andrew Morton , Linus Torvalds , KOSAKI Motohiro , bhutchings@solarflare.com, Konstantin Khlebnikov , Naoya Horiguchi , Hugh Dickins , KAMEZAWA Hiroyuki , linux-kernel@vger.kernel.org, linux-mm@kvack.org Subject: Re: [patch for-3.7] mm, mempolicy: fix printing stack contents in numa_maps In-Reply-To: Message-ID: References: <20121008150949.GA15130@redhat.com> <20121017040515.GA13505@redhat.com> User-Agent: Alpine 2.00 (DEB 1167 2008-08-23) MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, 24 Oct 2012, Sasha Levin wrote: > > This should be fixed by 9e7814404b77 ("hold task->mempolicy while > > numa_maps scans.") in 3.7-rc2, can you reproduce any issues reading > > /proc/pid/numa_maps on that kernel? > > I was actually referring to the warnings Dave Jones saw when fuzzing > with trinity after the > original patch was applied. > > I still see the following when fuzzing: > > [ 338.467156] BUG: sleeping function called from invalid context at > kernel/mutex.c:269 > [ 338.473719] in_atomic(): 1, irqs_disabled(): 0, pid: 6361, name: trinity-main > [ 338.481199] 2 locks held by trinity-main/6361: > [ 338.486629] #0: (&mm->mmap_sem){++++++}, at: [] > __do_page_fault+0x1e4/0x4f0 > [ 338.498783] #1: (&(&mm->page_table_lock)->rlock){+.+...}, at: > [] handle_pte_fault+0x3f7/0x6a0 > [ 338.511409] Pid: 6361, comm: trinity-main Tainted: G W > 3.7.0-rc2-next-20121024-sasha-00001-gd95ef01-dirty #74 > [ 338.530318] Call Trace: > [ 338.534088] [] __might_sleep+0x1c3/0x1e0 > [ 338.539358] [] mutex_lock_nested+0x29/0x50 > [ 338.545253] [] mpol_shared_policy_lookup+0x2e/0x90 > [ 338.545258] [] shmem_get_policy+0x2e/0x30 > [ 338.545264] [] get_vma_policy+0x5a/0xa0 > [ 338.545267] [] mpol_misplaced+0x41/0x1d0 > [ 338.545272] [] handle_pte_fault+0x465/0x6a0 > [ 338.545278] [] ? __rcu_read_unlock+0x44/0xb0 > [ 338.545282] [] handle_mm_fault+0x32a/0x360 > [ 338.545286] [] __do_page_fault+0x480/0x4f0 > [ 338.545293] [] ? del_timer+0x26/0x80 > [ 338.545298] [] ? rcu_cleanup_after_idle+0x23/0x170 > [ 338.545302] [] ? rcu_eqs_exit_common+0x64/0x3a0 > [ 338.545305] [] ? rcu_eqs_enter_common+0x7c6/0x970 > [ 338.545309] [] ? rcu_eqs_exit+0x9c/0xb0 > [ 338.545312] [] do_page_fault+0x26/0x40 > [ 338.545317] [] do_async_page_fault+0x30/0xa0 > [ 338.545321] [] async_page_fault+0x28/0x30 > Ok, this looks the same but it's actually a different issue: mpol_misplaced(), which now only exists in linux-next and not in 3.7-rc2, calls get_vma_policy() which may take the shared policy mutex. This happens while holding page_table_lock from do_huge_pmd_numa_page() but also from do_numa_page() while holding a spinlock on the ptl, which is coming from the sched/numa branch. Is there anyway that we can avoid changing the shared policy mutex back into a spinlock (it was converted in b22d127a39dd ["mempolicy: fix a race in shared_policy_replace()"])? Adding Peter, Rik, and Mel to the cc. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1759027Ab2JYAy4 (ORCPT ); Wed, 24 Oct 2012 20:54:56 -0400 Received: from mail-oa0-f46.google.com ([209.85.219.46]:36705 "EHLO mail-oa0-f46.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1758959Ab2JYAyy (ORCPT ); Wed, 24 Oct 2012 20:54:54 -0400 MIME-Version: 1.0 In-Reply-To: References: <20121008150949.GA15130@redhat.com> <20121017040515.GA13505@redhat.com> From: KOSAKI Motohiro Date: Wed, 24 Oct 2012 20:54:33 -0400 Message-ID: Subject: Re: [patch for-3.7] mm, mempolicy: fix printing stack contents in numa_maps To: David Rientjes Cc: Sasha Levin , Mel Gorman , Peter Zijlstra , Rik van Riel , Dave Jones , Andrew Morton , Linus Torvalds , bhutchings@solarflare.com, Konstantin Khlebnikov , Naoya Horiguchi , Hugh Dickins , KAMEZAWA Hiroyuki , linux-kernel@vger.kernel.org, linux-mm@kvack.org Content-Type: text/plain; charset=ISO-8859-1 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, Oct 24, 2012 at 8:08 PM, David Rientjes wrote: > On Wed, 24 Oct 2012, Sasha Levin wrote: > >> > This should be fixed by 9e7814404b77 ("hold task->mempolicy while >> > numa_maps scans.") in 3.7-rc2, can you reproduce any issues reading >> > /proc/pid/numa_maps on that kernel? >> >> I was actually referring to the warnings Dave Jones saw when fuzzing >> with trinity after the >> original patch was applied. >> >> I still see the following when fuzzing: >> >> [ 338.467156] BUG: sleeping function called from invalid context at >> kernel/mutex.c:269 >> [ 338.473719] in_atomic(): 1, irqs_disabled(): 0, pid: 6361, name: trinity-main >> [ 338.481199] 2 locks held by trinity-main/6361: >> [ 338.486629] #0: (&mm->mmap_sem){++++++}, at: [] >> __do_page_fault+0x1e4/0x4f0 >> [ 338.498783] #1: (&(&mm->page_table_lock)->rlock){+.+...}, at: >> [] handle_pte_fault+0x3f7/0x6a0 >> [ 338.511409] Pid: 6361, comm: trinity-main Tainted: G W >> 3.7.0-rc2-next-20121024-sasha-00001-gd95ef01-dirty #74 >> [ 338.530318] Call Trace: >> [ 338.534088] [] __might_sleep+0x1c3/0x1e0 >> [ 338.539358] [] mutex_lock_nested+0x29/0x50 >> [ 338.545253] [] mpol_shared_policy_lookup+0x2e/0x90 >> [ 338.545258] [] shmem_get_policy+0x2e/0x30 >> [ 338.545264] [] get_vma_policy+0x5a/0xa0 >> [ 338.545267] [] mpol_misplaced+0x41/0x1d0 >> [ 338.545272] [] handle_pte_fault+0x465/0x6a0 >> [ 338.545278] [] ? __rcu_read_unlock+0x44/0xb0 >> [ 338.545282] [] handle_mm_fault+0x32a/0x360 >> [ 338.545286] [] __do_page_fault+0x480/0x4f0 >> [ 338.545293] [] ? del_timer+0x26/0x80 >> [ 338.545298] [] ? rcu_cleanup_after_idle+0x23/0x170 >> [ 338.545302] [] ? rcu_eqs_exit_common+0x64/0x3a0 >> [ 338.545305] [] ? rcu_eqs_enter_common+0x7c6/0x970 >> [ 338.545309] [] ? rcu_eqs_exit+0x9c/0xb0 >> [ 338.545312] [] do_page_fault+0x26/0x40 >> [ 338.545317] [] do_async_page_fault+0x30/0xa0 >> [ 338.545321] [] async_page_fault+0x28/0x30 >> > > Ok, this looks the same but it's actually a different issue: > mpol_misplaced(), which now only exists in linux-next and not in 3.7-rc2, > calls get_vma_policy() which may take the shared policy mutex. This > happens while holding page_table_lock from do_huge_pmd_numa_page() but > also from do_numa_page() while holding a spinlock on the ptl, which is > coming from the sched/numa branch. > > Is there anyway that we can avoid changing the shared policy mutex back > into a spinlock (it was converted in b22d127a39dd ["mempolicy: fix a race > in shared_policy_replace()"])? > > Adding Peter, Rik, and Mel to the cc. Hrm. I haven't noticed there is mpol_misplaced() in linux-next. Peter, I guess you commited it, right? If so, may I review your mempolicy changes? Now mempolicy has a lot of horrible buggy code and I hope to maintain carefully. Which tree should i see? From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1759029Ab2JYBPR (ORCPT ); Wed, 24 Oct 2012 21:15:17 -0400 Received: from mail-da0-f46.google.com ([209.85.210.46]:65411 "EHLO mail-da0-f46.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1756807Ab2JYBPO (ORCPT ); Wed, 24 Oct 2012 21:15:14 -0400 Date: Wed, 24 Oct 2012 18:15:11 -0700 (PDT) From: David Rientjes X-X-Sender: rientjes@chino.kir.corp.google.com To: KOSAKI Motohiro cc: Sasha Levin , Mel Gorman , Peter Zijlstra , Rik van Riel , Dave Jones , Andrew Morton , Linus Torvalds , bhutchings@solarflare.com, Konstantin Khlebnikov , Naoya Horiguchi , Hugh Dickins , KAMEZAWA Hiroyuki , linux-kernel@vger.kernel.org, linux-mm@kvack.org Subject: Re: [patch for-3.7] mm, mempolicy: fix printing stack contents in numa_maps In-Reply-To: Message-ID: References: <20121008150949.GA15130@redhat.com> <20121017040515.GA13505@redhat.com> User-Agent: Alpine 2.00 (DEB 1167 2008-08-23) MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, 24 Oct 2012, KOSAKI Motohiro wrote: > Hrm. I haven't noticed there is mpol_misplaced() in linux-next. Peter, > I guess you commited it, right? If so, may I review your mempolicy > changes? Now mempolicy has a lot of horrible buggy code and I hope to > maintain carefully. Which tree should i see? > Check out sched/numa from git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git $ git diff v3.7-rc2.. mm/mempolicy.c | diffstat mempolicy.c | 444 +++++++++++++++++++++++++++++++++++++----------------------- 1 file changed, 277 insertions(+), 167 deletions(-) From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1161055Ab2JYMUa (ORCPT ); Thu, 25 Oct 2012 08:20:30 -0400 Received: from casper.infradead.org ([85.118.1.10]:50729 "EHLO casper.infradead.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1757887Ab2JYMU0 convert rfc822-to-8bit (ORCPT ); Thu, 25 Oct 2012 08:20:26 -0400 Message-ID: <1351167554.23337.14.camel@twins> Subject: Re: [patch for-3.7] mm, mempolicy: fix printing stack contents in numa_maps From: Peter Zijlstra To: David Rientjes Cc: Sasha Levin , Mel Gorman , Rik van Riel , Dave Jones , Andrew Morton , Linus Torvalds , KOSAKI Motohiro , bhutchings@solarflare.com, Konstantin Khlebnikov , Naoya Horiguchi , Hugh Dickins , KAMEZAWA Hiroyuki , linux-kernel@vger.kernel.org, linux-mm@kvack.org Date: Thu, 25 Oct 2012 14:19:14 +0200 In-Reply-To: References: <20121008150949.GA15130@redhat.com> <20121017040515.GA13505@redhat.com> Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7BIT X-Mailer: Evolution 3.2.2- Mime-Version: 1.0 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, 2012-10-24 at 17:08 -0700, David Rientjes wrote: > Ok, this looks the same but it's actually a different issue: > mpol_misplaced(), which now only exists in linux-next and not in 3.7-rc2, > calls get_vma_policy() which may take the shared policy mutex. This > happens while holding page_table_lock from do_huge_pmd_numa_page() but > also from do_numa_page() while holding a spinlock on the ptl, which is > coming from the sched/numa branch. > > Is there anyway that we can avoid changing the shared policy mutex back > into a spinlock (it was converted in b22d127a39dd ["mempolicy: fix a race > in shared_policy_replace()"])? > > Adding Peter, Rik, and Mel to the cc. Urgh, crud I totally missed that. So the problem is that we need to compute if the current page is placed 'right' while holding pte_lock in order to avoid multiple pte_lock acquisitions on the 'fast' path. I'll look into this in a bit, but one thing that comes to mind is having both a spnilock and a mutex and require holding both for modification while either one is sufficient for read. That would allow sp_lookup() to use the spinlock, while insert and replace can hold both. Not sure it will work for this, need to stare at this code a little more. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S933921Ab2JYOkA (ORCPT ); Thu, 25 Oct 2012 10:40:00 -0400 Received: from casper.infradead.org ([85.118.1.10]:53806 "EHLO casper.infradead.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S932930Ab2JYOj6 convert rfc822-to-8bit (ORCPT ); Thu, 25 Oct 2012 10:39:58 -0400 Message-ID: <1351175972.12171.14.camel@twins> Subject: Re: [patch for-3.7] mm, mempolicy: fix printing stack contents in numa_maps From: Peter Zijlstra To: David Rientjes Cc: Sasha Levin , Mel Gorman , Rik van Riel , Dave Jones , Andrew Morton , Linus Torvalds , KOSAKI Motohiro , bhutchings@solarflare.com, Konstantin Khlebnikov , Naoya Horiguchi , Hugh Dickins , KAMEZAWA Hiroyuki , linux-kernel@vger.kernel.org, linux-mm@kvack.org Date: Thu, 25 Oct 2012 16:39:32 +0200 In-Reply-To: <1351167554.23337.14.camel@twins> References: <20121008150949.GA15130@redhat.com> <20121017040515.GA13505@redhat.com> <1351167554.23337.14.camel@twins> Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7BIT X-Mailer: Evolution 3.2.2- Mime-Version: 1.0 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, 2012-10-25 at 14:19 +0200, Peter Zijlstra wrote: > On Wed, 2012-10-24 at 17:08 -0700, David Rientjes wrote: > > Ok, this looks the same but it's actually a different issue: > > mpol_misplaced(), which now only exists in linux-next and not in 3.7-rc2, > > calls get_vma_policy() which may take the shared policy mutex. This > > happens while holding page_table_lock from do_huge_pmd_numa_page() but > > also from do_numa_page() while holding a spinlock on the ptl, which is > > coming from the sched/numa branch. > > > > Is there anyway that we can avoid changing the shared policy mutex back > > into a spinlock (it was converted in b22d127a39dd ["mempolicy: fix a race > > in shared_policy_replace()"])? > > > > Adding Peter, Rik, and Mel to the cc. > > Urgh, crud I totally missed that. > > So the problem is that we need to compute if the current page is placed > 'right' while holding pte_lock in order to avoid multiple pte_lock > acquisitions on the 'fast' path. > > I'll look into this in a bit, but one thing that comes to mind is having > both a spnilock and a mutex and require holding both for modification > while either one is sufficient for read. > > That would allow sp_lookup() to use the spinlock, while insert and > replace can hold both. > > Not sure it will work for this, need to stare at this code a little > more. So I think the below should work, we hold the spinlock over both rb-tree modification as sp free, this makes mpol_shared_policy_lookup() which returns the policy with an incremented refcount work with just the spinlock. Comments? --- include/linux/mempolicy.h | 1 + mm/mempolicy.c | 23 ++++++++++++++++++----- 2 files changed, 19 insertions(+), 5 deletions(-) --- a/include/linux/mempolicy.h +++ b/include/linux/mempolicy.h @@ -133,6 +133,7 @@ struct sp_node { struct shared_policy { struct rb_root root; + spinlock_t lock; struct mutex mutex; }; --- a/mm/mempolicy.c +++ b/mm/mempolicy.c @@ -2099,12 +2099,20 @@ bool __mpol_equal(struct mempolicy *a, s * * Remember policies even when nobody has shared memory mapped. * The policies are kept in Red-Black tree linked from the inode. - * They are protected by the sp->lock spinlock, which should be held - * for any accesses to the tree. + * + * The rb-tree is locked using both a mutex and a spinlock. Every modification + * to the tree must hold both the mutex and the spinlock, lookups can hold + * either to observe a stable tree. + * + * In particular, sp_insert() and sp_delete() take the spinlock, whereas + * sp_lookup() doesn't, this so users have choice. + * + * shared_policy_replace() and mpol_free_shared_policy() take the mutex + * and call sp_insert(), sp_delete(). */ /* lookup first element intersecting start-end */ -/* Caller holds sp->mutex */ +/* Caller holds either sp->lock and/or sp->mutex */ static struct sp_node * sp_lookup(struct shared_policy *sp, unsigned long start, unsigned long end) { @@ -2143,6 +2151,7 @@ static void sp_insert(struct shared_poli struct rb_node *parent = NULL; struct sp_node *nd; + spin_lock(&sp->lock); while (*p) { parent = *p; nd = rb_entry(parent, struct sp_node, nd); @@ -2155,6 +2164,7 @@ static void sp_insert(struct shared_poli } rb_link_node(&new->nd, parent, p); rb_insert_color(&new->nd, &sp->root); + spin_unlock(&sp->lock); pr_debug("inserting %lx-%lx: %d\n", new->start, new->end, new->policy ? new->policy->mode : 0); } @@ -2168,13 +2178,13 @@ mpol_shared_policy_lookup(struct shared_ if (!sp->root.rb_node) return NULL; - mutex_lock(&sp->mutex); + spin_lock(&sp->lock); sn = sp_lookup(sp, idx, idx+1); if (sn) { mpol_get(sn->policy); pol = sn->policy; } - mutex_unlock(&sp->mutex); + spin_unlock(&sp->lock); return pol; } @@ -2295,8 +2305,10 @@ int mpol_misplaced(struct page *page, st static void sp_delete(struct shared_policy *sp, struct sp_node *n) { pr_debug("deleting %lx-l%lx\n", n->start, n->end); + spin_lock(&sp->lock); rb_erase(&n->nd, &sp->root); sp_free(n); + spin_unlock(&sp->lock); } static struct sp_node *sp_alloc(unsigned long start, unsigned long end, @@ -2381,6 +2393,7 @@ void mpol_shared_policy_init(struct shar int ret; sp->root = RB_ROOT; /* empty tree == default mempolicy */ + spin_lock_init(&sp->lock); mutex_init(&sp->mutex); if (mpol) { From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S2992489Ab2JYRYG (ORCPT ); Thu, 25 Oct 2012 13:24:06 -0400 Received: from mail-vb0-f46.google.com ([209.85.212.46]:46078 "EHLO mail-vb0-f46.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S935912Ab2JYRYD (ORCPT ); Thu, 25 Oct 2012 13:24:03 -0400 Message-ID: <508975A4.50203@gmail.com> Date: Thu, 25 Oct 2012 13:23:48 -0400 From: Sasha Levin User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:16.0) Gecko/20121024 Thunderbird/16.0.1 MIME-Version: 1.0 To: Peter Zijlstra CC: David Rientjes , Mel Gorman , Rik van Riel , Dave Jones , Andrew Morton , Linus Torvalds , KOSAKI Motohiro , bhutchings@solarflare.com, Konstantin Khlebnikov , Naoya Horiguchi , Hugh Dickins , KAMEZAWA Hiroyuki , linux-kernel@vger.kernel.org, linux-mm@kvack.org Subject: Re: [patch for-3.7] mm, mempolicy: fix printing stack contents in numa_maps References: <20121008150949.GA15130@redhat.com> <20121017040515.GA13505@redhat.com> <1351167554.23337.14.camel@twins> <1351175972.12171.14.camel@twins> In-Reply-To: <1351175972.12171.14.camel@twins> Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 10/25/2012 10:39 AM, Peter Zijlstra wrote: > On Thu, 2012-10-25 at 14:19 +0200, Peter Zijlstra wrote: >> On Wed, 2012-10-24 at 17:08 -0700, David Rientjes wrote: >>> Ok, this looks the same but it's actually a different issue: >>> mpol_misplaced(), which now only exists in linux-next and not in 3.7-rc2, >>> calls get_vma_policy() which may take the shared policy mutex. This >>> happens while holding page_table_lock from do_huge_pmd_numa_page() but >>> also from do_numa_page() while holding a spinlock on the ptl, which is >>> coming from the sched/numa branch. >>> >>> Is there anyway that we can avoid changing the shared policy mutex back >>> into a spinlock (it was converted in b22d127a39dd ["mempolicy: fix a race >>> in shared_policy_replace()"])? >>> >>> Adding Peter, Rik, and Mel to the cc. >> >> Urgh, crud I totally missed that. >> >> So the problem is that we need to compute if the current page is placed >> 'right' while holding pte_lock in order to avoid multiple pte_lock >> acquisitions on the 'fast' path. >> >> I'll look into this in a bit, but one thing that comes to mind is having >> both a spnilock and a mutex and require holding both for modification >> while either one is sufficient for read. >> >> That would allow sp_lookup() to use the spinlock, while insert and >> replace can hold both. >> >> Not sure it will work for this, need to stare at this code a little >> more. > > So I think the below should work, we hold the spinlock over both rb-tree > modification as sp free, this makes mpol_shared_policy_lookup() which > returns the policy with an incremented refcount work with just the > spinlock. > > Comments? > > --- It made the warnings I've reported go away. Thanks, Sasha From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S2992673Ab2JYUWG (ORCPT ); Thu, 25 Oct 2012 16:22:06 -0400 Received: from mail-da0-f46.google.com ([209.85.210.46]:44994 "EHLO mail-da0-f46.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S2992660Ab2JYUWC (ORCPT ); Thu, 25 Oct 2012 16:22:02 -0400 Date: Thu, 25 Oct 2012 13:22:00 -0700 (PDT) From: David Rientjes X-X-Sender: rientjes@chino.kir.corp.google.com To: Peter Zijlstra cc: Sasha Levin , Mel Gorman , Rik van Riel , Dave Jones , Andrew Morton , Linus Torvalds , KOSAKI Motohiro , bhutchings@solarflare.com, Konstantin Khlebnikov , Naoya Horiguchi , Hugh Dickins , KAMEZAWA Hiroyuki , linux-kernel@vger.kernel.org, linux-mm@kvack.org Subject: Re: [patch for-3.7] mm, mempolicy: fix printing stack contents in numa_maps In-Reply-To: <1351175972.12171.14.camel@twins> Message-ID: References: <20121008150949.GA15130@redhat.com> <20121017040515.GA13505@redhat.com> <1351167554.23337.14.camel@twins> <1351175972.12171.14.camel@twins> User-Agent: Alpine 2.00 (DEB 1167 2008-08-23) MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, 25 Oct 2012, Peter Zijlstra wrote: > So I think the below should work, we hold the spinlock over both rb-tree > modification as sp free, this makes mpol_shared_policy_lookup() which > returns the policy with an incremented refcount work with just the > spinlock. > > Comments? > It's rather unfortunate that we need to protect modification with a spinlock and a mutex but since sharing was removed in commit 869833f2c5c6 ("mempolicy: remove mempolicy sharing") it requires that sp_alloc() is blockable to do the whole mpol_new() and rebind if necessary, which could require mm->mmap_sem; it's not as simple as just converting all the allocations to GFP_ATOMIC. It looks as though there is no other alternative other than protecting modification with both the spinlock and mutex, which is a clever solution, so it looks good to me, thanks! From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752029Ab2JYXKL (ORCPT ); Thu, 25 Oct 2012 19:10:11 -0400 Received: from mail-we0-f174.google.com ([74.125.82.174]:34749 "EHLO mail-we0-f174.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751568Ab2JYXKJ (ORCPT ); Thu, 25 Oct 2012 19:10:09 -0400 MIME-Version: 1.0 In-Reply-To: <1351175972.12171.14.camel@twins> References: <20121008150949.GA15130@redhat.com> <20121017040515.GA13505@redhat.com> <1351167554.23337.14.camel@twins> <1351175972.12171.14.camel@twins> From: Linus Torvalds Date: Thu, 25 Oct 2012 16:09:48 -0700 X-Google-Sender-Auth: Fnc2V7B0ASym2GUwQWE1z-grVRQ Message-ID: Subject: Re: [patch for-3.7] mm, mempolicy: fix printing stack contents in numa_maps To: Peter Zijlstra Cc: David Rientjes , Sasha Levin , Mel Gorman , Rik van Riel , Dave Jones , Andrew Morton , KOSAKI Motohiro , bhutchings@solarflare.com, Konstantin Khlebnikov , Naoya Horiguchi , Hugh Dickins , KAMEZAWA Hiroyuki , linux-kernel@vger.kernel.org, linux-mm@kvack.org Content-Type: text/plain; charset=ISO-8859-1 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, Oct 25, 2012 at 7:39 AM, Peter Zijlstra wrote: > > So I think the below should work, we hold the spinlock over both rb-tree > modification as sp free, this makes mpol_shared_policy_lookup() which > returns the policy with an incremented refcount work with just the > spinlock. > > Comments? Looks reasonable, if annoyingly complex for something that shouldn't be important enough for this. Oh well. However, please check me on this: the need for this is only for linux-next right now, correct? All the current users in my tree are ok with just the mutex, no? Linus From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1757508Ab2JZItU (ORCPT ); Fri, 26 Oct 2012 04:49:20 -0400 Received: from casper.infradead.org ([85.118.1.10]:39208 "EHLO casper.infradead.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1755496Ab2JZItQ convert rfc822-to-8bit (ORCPT ); Fri, 26 Oct 2012 04:49:16 -0400 Message-ID: <1351241323.12171.43.camel@twins> Subject: Re: [patch for-3.7] mm, mempolicy: fix printing stack contents in numa_maps From: Peter Zijlstra To: Linus Torvalds Cc: David Rientjes , Sasha Levin , Mel Gorman , Rik van Riel , Dave Jones , Andrew Morton , KOSAKI Motohiro , bhutchings@solarflare.com, Konstantin Khlebnikov , Naoya Horiguchi , Hugh Dickins , KAMEZAWA Hiroyuki , linux-kernel@vger.kernel.org, linux-mm@kvack.org Date: Fri, 26 Oct 2012 10:48:43 +0200 In-Reply-To: References: <20121008150949.GA15130@redhat.com> <20121017040515.GA13505@redhat.com> <1351167554.23337.14.camel@twins> <1351175972.12171.14.camel@twins> Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7BIT X-Mailer: Evolution 3.2.2- Mime-Version: 1.0 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, 2012-10-25 at 16:09 -0700, Linus Torvalds wrote: > On Thu, Oct 25, 2012 at 7:39 AM, Peter Zijlstra wrote: > > > > So I think the below should work, we hold the spinlock over both rb-tree > > modification as sp free, this makes mpol_shared_policy_lookup() which > > returns the policy with an incremented refcount work with just the > > spinlock. > > > > Comments? > > Looks reasonable, if annoyingly complex for something that shouldn't > be important enough for this. Oh well. I agree with that.. Its just that when doing numa placement one needs to respect the pre-existing placement constraints. I've not seen a way around this. > However, please check me on this: the need for this is only for > linux-next right now, correct? All the current users in my tree are ok > with just the mutex, no? Yes, the need comes from the numa stuff and I'll stick this patch in there. I completely missed Mel's patch turning it into a mutex, but I guess that's what -next is for :-). From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1759543Ab2JaSaW (ORCPT ); Wed, 31 Oct 2012 14:30:22 -0400 Received: from mail-ia0-f174.google.com ([209.85.210.174]:55142 "EHLO mail-ia0-f174.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751269Ab2JaSaU (ORCPT ); Wed, 31 Oct 2012 14:30:20 -0400 MIME-Version: 1.0 In-Reply-To: <1351241323.12171.43.camel@twins> References: <20121008150949.GA15130@redhat.com> <20121017040515.GA13505@redhat.com> <1351167554.23337.14.camel@twins> <1351175972.12171.14.camel@twins> <1351241323.12171.43.camel@twins> From: Sasha Levin Date: Wed, 31 Oct 2012 14:29:59 -0400 Message-ID: Subject: Re: [patch for-3.7] mm, mempolicy: fix printing stack contents in numa_maps To: Peter Zijlstra Cc: Linus Torvalds , David Rientjes , Mel Gorman , Rik van Riel , Dave Jones , Andrew Morton , KOSAKI Motohiro , bhutchings@solarflare.com, Konstantin Khlebnikov , Naoya Horiguchi , Hugh Dickins , KAMEZAWA Hiroyuki , linux-kernel@vger.kernel.org, linux-mm@kvack.org Content-Type: text/plain; charset=ISO-8859-1 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri, Oct 26, 2012 at 4:48 AM, Peter Zijlstra wrote: > On Thu, 2012-10-25 at 16:09 -0700, Linus Torvalds wrote: >> On Thu, Oct 25, 2012 at 7:39 AM, Peter Zijlstra wrote: >> > >> > So I think the below should work, we hold the spinlock over both rb-tree >> > modification as sp free, this makes mpol_shared_policy_lookup() which >> > returns the policy with an incremented refcount work with just the >> > spinlock. >> > >> > Comments? >> >> Looks reasonable, if annoyingly complex for something that shouldn't >> be important enough for this. Oh well. > > I agree with that.. Its just that when doing numa placement one needs to > respect the pre-existing placement constraints. I've not seen a way > around this. > >> However, please check me on this: the need for this is only for >> linux-next right now, correct? All the current users in my tree are ok >> with just the mutex, no? > > Yes, the need comes from the numa stuff and I'll stick this patch in > there. > > I completely missed Mel's patch turning it into a mutex, but I guess > that's what -next is for :-). So I've been fuzzing with it for the past couple of days and it's been looking fine with it. Can someone grab it into his tree please? Thanks, Sasha From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753343Ab2KUBAY (ORCPT ); Tue, 20 Nov 2012 20:00:24 -0500 Received: from mail-ie0-f174.google.com ([209.85.223.174]:35088 "EHLO mail-ie0-f174.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752414Ab2KUBAW (ORCPT ); Tue, 20 Nov 2012 20:00:22 -0500 MIME-Version: 1.0 In-Reply-To: References: <20121008150949.GA15130@redhat.com> <20121017040515.GA13505@redhat.com> <1351167554.23337.14.camel@twins> <1351175972.12171.14.camel@twins> <1351241323.12171.43.camel@twins> From: Sasha Levin Date: Tue, 20 Nov 2012 19:59:57 -0500 Message-ID: Subject: Re: [patch for-3.7] mm, mempolicy: fix printing stack contents in numa_maps To: Peter Zijlstra Cc: Linus Torvalds , David Rientjes , Mel Gorman , Rik van Riel , Dave Jones , Andrew Morton , KOSAKI Motohiro , bhutchings@solarflare.com, Konstantin Khlebnikov , Naoya Horiguchi , Hugh Dickins , KAMEZAWA Hiroyuki , linux-kernel@vger.kernel.org, linux-mm@kvack.org Content-Type: text/plain; charset=ISO-8859-1 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Ping? Can someone take it before it's lost? On Wed, Oct 31, 2012 at 2:29 PM, Sasha Levin wrote: > On Fri, Oct 26, 2012 at 4:48 AM, Peter Zijlstra wrote: >> On Thu, 2012-10-25 at 16:09 -0700, Linus Torvalds wrote: >>> On Thu, Oct 25, 2012 at 7:39 AM, Peter Zijlstra wrote: >>> > >>> > So I think the below should work, we hold the spinlock over both rb-tree >>> > modification as sp free, this makes mpol_shared_policy_lookup() which >>> > returns the policy with an incremented refcount work with just the >>> > spinlock. >>> > >>> > Comments? >>> >>> Looks reasonable, if annoyingly complex for something that shouldn't >>> be important enough for this. Oh well. >> >> I agree with that.. Its just that when doing numa placement one needs to >> respect the pre-existing placement constraints. I've not seen a way >> around this. >> >>> However, please check me on this: the need for this is only for >>> linux-next right now, correct? All the current users in my tree are ok >>> with just the mutex, no? >> >> Yes, the need comes from the numa stuff and I'll stick this patch in >> there. >> >> I completely missed Mel's patch turning it into a mutex, but I guess >> that's what -next is for :-). > > So I've been fuzzing with it for the past couple of days and it's been > looking fine with it. Can someone grab it into his tree please? > > > Thanks, > Sasha