All of lore.kernel.org
 help / color / mirror / Atom feed
From: Ingo Molnar <mingo@kernel.org>
To: Davidlohr Bueso <davidlohr@hp.com>, Al Viro <viro@zeniv.linux.org.uk>
Cc: Andrew Morton <akpm@linux-foundation.org>,
	Linus Torvalds <torvalds@linux-foundation.org>,
	Michel Lespinasse <walken@google.com>,
	Peter Zijlstra <a.p.zijlstra@chello.nl>,
	Rik van Riel <riel@redhat.com>,
	Tim Chen <tim.c.chen@linux.intel.com>,
	aswin@hp.com, linux-mm <linux-mm@kvack.org>,
	linux-kernel@vger.kernel.org,
	Russell King <linux@arm.linux.org.uk>,
	Catalin Marinas <catalin.marinas@arm.com>,
	Will Deacon <will.deacon@arm.com>,
	Richard Kuo <rkuo@codeaurora.org>,
	Ralf Baechle <ralf@linux-mips.org>,
	Benjamin Herrenschmidt <benh@kernel.crashing.org>,
	Paul Mackerras <paulus@samba.org>,
	Martin Schwidefsky <schwidefsky@de.ibm.com>,
	Heiko Carstens <heiko.carstens@de.ibm.com>,
	Paul Mundt <lethal@linux-sh.org>,
	Chris Metcalf <cmetcalf@tilera.com>,
	Jeff Dike <jdike@addtoit.com>,
	Richard Weinberger <richard@nod.at>,
	Thomas Gleixner <tglx@linutronix.de>,
	Ingo Molnar <mingo@redhat.com>, "H. Peter Anvin" <hpa@zytor.com>
Subject: [PATCH 4/3] x86/vdso: Optimize setup_additional_pages()
Date: Fri, 18 Oct 2013 08:05:01 +0200	[thread overview]
Message-ID: <20131018060501.GA3411@gmail.com> (raw)
In-Reply-To: <1382057438-3306-4-git-send-email-davidlohr@hp.com>


* Davidlohr Bueso <davidlohr@hp.com> wrote:

> --- a/arch/x86/vdso/vma.c
> +++ b/arch/x86/vdso/vma.c
> @@ -154,12 +154,17 @@ static int setup_additional_pages(struct linux_binprm *bprm,
>  				  unsigned size)
>  {
>  	struct mm_struct *mm = current->mm;
> +	struct vm_area_struct *vma;
>  	unsigned long addr;
>  	int ret;
>  
>  	if (!vdso_enabled)
>  		return 0;
>  
> +	vma = kmem_cache_zalloc(vm_area_cachep, GFP_KERNEL);
> +	if (unlikely(!vma))
> +		return -ENOMEM;
> +
>  	down_write(&mm->mmap_sem);
>  	addr = vdso_addr(mm->start_stack, size);
>  	addr = get_unmapped_area(NULL, addr, size, 0, 0);
> @@ -173,14 +178,17 @@ static int setup_additional_pages(struct linux_binprm *bprm,
>  	ret = install_special_mapping(mm, addr, size,
>  				      VM_READ|VM_EXEC|
>  				      VM_MAYREAD|VM_MAYWRITE|VM_MAYEXEC,
> -				      pages);
> +				      pages, &vma);
>  	if (ret) {
>  		current->mm->context.vdso = NULL;
>  		goto up_fail;
>  	}
>  
> +	up_write(&mm->mmap_sem);
> +	return ret;
>  up_fail:
>  	up_write(&mm->mmap_sem);
> +	kmem_cache_free(vm_area_cachep, vma);
>  	return ret;
>  }
>  

1)

Beyond the simplification that Linus suggested, why not introduce a new 
function, named 'install_special_mapping_vma()' or so, and convert 
architectures one by one, without pressure to get it all done (and all 
correct) in a single patch?

2)

I don't see the justification: this code gets executed in exec() where a 
new mm has just been allocated. There's only a single user of the mm and 
thus the critical section width of mmap_sem is more or less irrelevant.

mmap_sem critical section size only matters for codepaths that threaded 
programs can hit.

3)

But, if we do all that, a couple of other (micro-)optimizations are 
possible in setup_additional_pages() as well:

 - vdso_addr(), which is actually much _more_ expensive than kmalloc() 
   because on most distros it will call into the RNG, can also be done 
   outside the mmap_sem.

 - the error paths can all be merged and the common case can be made 
   fall-through.

 - use 'mm' consistently instead of repeating 'current->mm'

 - set 'mm->context.vdso' only once we know it's all a success, and do it 
   outside the lock

 - add a few comments about which operations are locked, which unlocked, 
   and why. Please double check the assumptions I documented there.

See the diff attached below. (Totally untested and all that.)

Also note that I think, in theory, if exec() guaranteed the privacy and 
single threadedness of the new mm, we could probably do _all_ of this 
unlocked. Right now I don't think this is guaranteed: ptrace() users might 
look up the new PID and might interfere on the MM via 
PTRACE_PEEK*/PTRACE_POKE*.

( Furthermore, /proc/ might also give early access to aspects of the mm - 
  although no manipulation of the mm is possible there. )

If such privacy of the new mm was guaranteed then that would also remove 
the need to move the allocation out of install_special_mapping().

But, I don't think it all matters, due to #2 - and your changes actively 
complicate setup_pages(), which makes this security sensitive code a bit 
more fragile. We can still do it out of sheer principle, I just don't see 
where it's supposed to help scale better.

Thanks,

	Ingo

 arch/x86/vdso/vma.c | 40 +++++++++++++++++++++++++++-------------
 1 file changed, 27 insertions(+), 13 deletions(-)

diff --git a/arch/x86/vdso/vma.c b/arch/x86/vdso/vma.c
index 431e875..c590387 100644
--- a/arch/x86/vdso/vma.c
+++ b/arch/x86/vdso/vma.c
@@ -157,30 +157,44 @@ static int setup_additional_pages(struct linux_binprm *bprm,
 	unsigned long addr;
 	int ret;
 
-	if (!vdso_enabled)
+	if (unlikely(!vdso_enabled))
 		return 0;
 
-	down_write(&mm->mmap_sem);
+	/*
+	 * Do this outside the MM lock - we are in exec() with a new MM,
+	 * nobody else can use these fields of the mm:
+	 */
 	addr = vdso_addr(mm->start_stack, size);
-	addr = get_unmapped_area(NULL, addr, size, 0, 0);
-	if (IS_ERR_VALUE(addr)) {
-		ret = addr;
-		goto up_fail;
-	}
 
-	current->mm->context.vdso = (void *)addr;
+	/*
+	 * This must be done under the MM lock - there might be parallel
+	 * accesses to this mm, such as ptrace().
+	 *
+	 * [ This could be further optimized if exec() reliably inhibited
+	 *   all early access to the mm. ]
+	 */
+	down_write(&mm->mmap_sem);
+	addr = get_unmapped_area(NULL, addr, size, 0, 0);
+	if (IS_ERR_VALUE(addr))
+		goto up_fail_addr;
 
 	ret = install_special_mapping(mm, addr, size,
 				      VM_READ|VM_EXEC|
 				      VM_MAYREAD|VM_MAYWRITE|VM_MAYEXEC,
 				      pages);
-	if (ret) {
-		current->mm->context.vdso = NULL;
-		goto up_fail;
-	}
+	up_write(&mm->mmap_sem);
+	if (ret)
+		goto fail;
 
-up_fail:
+	mm->context.vdso = (void *)addr;
+	return ret;
+
+up_fail_addr:
+	ret = addr;
 	up_write(&mm->mmap_sem);
+fail:
+	mm->context.vdso = NULL;
+
 	return ret;
 }
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

WARNING: multiple messages have this Message-ID (diff)
From: Ingo Molnar <mingo@kernel.org>
To: Davidlohr Bueso <davidlohr@hp.com>, Al Viro <viro@zeniv.linux.org.uk>
Cc: Andrew Morton <akpm@linux-foundation.org>,
	Linus Torvalds <torvalds@linux-foundation.org>,
	Michel Lespinasse <walken@google.com>,
	Peter Zijlstra <a.p.zijlstra@chello.nl>,
	Rik van Riel <riel@redhat.com>,
	Tim Chen <tim.c.chen@linux.intel.com>,
	aswin@hp.com, linux-mm <linux-mm@kvack.org>,
	linux-kernel@vger.kernel.org,
	Russell King <linux@arm.linux.org.uk>,
	Catalin Marinas <catalin.marinas@arm.com>,
	Will Deacon <will.deacon@arm.com>,
	Richard Kuo <rkuo@codeaurora.org>,
	Ralf Baechle <ralf@linux-mips.org>,
	Benjamin Herrenschmidt <benh@kernel.crashing.org>,
	Paul Mackerras <paulus@samba.org>,
	Martin Schwidefsky <schwidefsky@de.ibm.com>,
	Heiko Carstens <heiko.carstens@de.ibm.com>,
	Paul Mundt <lethal@linux-sh.org>,
	Chris Metcalf <cmetcalf@tilera.com>,
	Jeff Dike <jdike@addtoit.com>,
	Richard Weinberger <richard@nod.at>,
	Thomas Gleixner <tglx@linutronix.de>,
	Ingo Molnar <mingo@redhat.com>, "H. Peter Anvin" <hpa@zytor.com>
Subject: [PATCH 4/3] x86/vdso: Optimize setup_additional_pages()
Date: Fri, 18 Oct 2013 08:05:01 +0200	[thread overview]
Message-ID: <20131018060501.GA3411@gmail.com> (raw)
In-Reply-To: <1382057438-3306-4-git-send-email-davidlohr@hp.com>


* Davidlohr Bueso <davidlohr@hp.com> wrote:

> --- a/arch/x86/vdso/vma.c
> +++ b/arch/x86/vdso/vma.c
> @@ -154,12 +154,17 @@ static int setup_additional_pages(struct linux_binprm *bprm,
>  				  unsigned size)
>  {
>  	struct mm_struct *mm = current->mm;
> +	struct vm_area_struct *vma;
>  	unsigned long addr;
>  	int ret;
>  
>  	if (!vdso_enabled)
>  		return 0;
>  
> +	vma = kmem_cache_zalloc(vm_area_cachep, GFP_KERNEL);
> +	if (unlikely(!vma))
> +		return -ENOMEM;
> +
>  	down_write(&mm->mmap_sem);
>  	addr = vdso_addr(mm->start_stack, size);
>  	addr = get_unmapped_area(NULL, addr, size, 0, 0);
> @@ -173,14 +178,17 @@ static int setup_additional_pages(struct linux_binprm *bprm,
>  	ret = install_special_mapping(mm, addr, size,
>  				      VM_READ|VM_EXEC|
>  				      VM_MAYREAD|VM_MAYWRITE|VM_MAYEXEC,
> -				      pages);
> +				      pages, &vma);
>  	if (ret) {
>  		current->mm->context.vdso = NULL;
>  		goto up_fail;
>  	}
>  
> +	up_write(&mm->mmap_sem);
> +	return ret;
>  up_fail:
>  	up_write(&mm->mmap_sem);
> +	kmem_cache_free(vm_area_cachep, vma);
>  	return ret;
>  }
>  

1)

Beyond the simplification that Linus suggested, why not introduce a new 
function, named 'install_special_mapping_vma()' or so, and convert 
architectures one by one, without pressure to get it all done (and all 
correct) in a single patch?

2)

I don't see the justification: this code gets executed in exec() where a 
new mm has just been allocated. There's only a single user of the mm and 
thus the critical section width of mmap_sem is more or less irrelevant.

mmap_sem critical section size only matters for codepaths that threaded 
programs can hit.

3)

But, if we do all that, a couple of other (micro-)optimizations are 
possible in setup_additional_pages() as well:

 - vdso_addr(), which is actually much _more_ expensive than kmalloc() 
   because on most distros it will call into the RNG, can also be done 
   outside the mmap_sem.

 - the error paths can all be merged and the common case can be made 
   fall-through.

 - use 'mm' consistently instead of repeating 'current->mm'

 - set 'mm->context.vdso' only once we know it's all a success, and do it 
   outside the lock

 - add a few comments about which operations are locked, which unlocked, 
   and why. Please double check the assumptions I documented there.

See the diff attached below. (Totally untested and all that.)

Also note that I think, in theory, if exec() guaranteed the privacy and 
single threadedness of the new mm, we could probably do _all_ of this 
unlocked. Right now I don't think this is guaranteed: ptrace() users might 
look up the new PID and might interfere on the MM via 
PTRACE_PEEK*/PTRACE_POKE*.

( Furthermore, /proc/ might also give early access to aspects of the mm - 
  although no manipulation of the mm is possible there. )

If such privacy of the new mm was guaranteed then that would also remove 
the need to move the allocation out of install_special_mapping().

But, I don't think it all matters, due to #2 - and your changes actively 
complicate setup_pages(), which makes this security sensitive code a bit 
more fragile. We can still do it out of sheer principle, I just don't see 
where it's supposed to help scale better.

Thanks,

	Ingo

 arch/x86/vdso/vma.c | 40 +++++++++++++++++++++++++++-------------
 1 file changed, 27 insertions(+), 13 deletions(-)

diff --git a/arch/x86/vdso/vma.c b/arch/x86/vdso/vma.c
index 431e875..c590387 100644
--- a/arch/x86/vdso/vma.c
+++ b/arch/x86/vdso/vma.c
@@ -157,30 +157,44 @@ static int setup_additional_pages(struct linux_binprm *bprm,
 	unsigned long addr;
 	int ret;
 
-	if (!vdso_enabled)
+	if (unlikely(!vdso_enabled))
 		return 0;
 
-	down_write(&mm->mmap_sem);
+	/*
+	 * Do this outside the MM lock - we are in exec() with a new MM,
+	 * nobody else can use these fields of the mm:
+	 */
 	addr = vdso_addr(mm->start_stack, size);
-	addr = get_unmapped_area(NULL, addr, size, 0, 0);
-	if (IS_ERR_VALUE(addr)) {
-		ret = addr;
-		goto up_fail;
-	}
 
-	current->mm->context.vdso = (void *)addr;
+	/*
+	 * This must be done under the MM lock - there might be parallel
+	 * accesses to this mm, such as ptrace().
+	 *
+	 * [ This could be further optimized if exec() reliably inhibited
+	 *   all early access to the mm. ]
+	 */
+	down_write(&mm->mmap_sem);
+	addr = get_unmapped_area(NULL, addr, size, 0, 0);
+	if (IS_ERR_VALUE(addr))
+		goto up_fail_addr;
 
 	ret = install_special_mapping(mm, addr, size,
 				      VM_READ|VM_EXEC|
 				      VM_MAYREAD|VM_MAYWRITE|VM_MAYEXEC,
 				      pages);
-	if (ret) {
-		current->mm->context.vdso = NULL;
-		goto up_fail;
-	}
+	up_write(&mm->mmap_sem);
+	if (ret)
+		goto fail;
 
-up_fail:
+	mm->context.vdso = (void *)addr;
+	return ret;
+
+up_fail_addr:
+	ret = addr;
 	up_write(&mm->mmap_sem);
+fail:
+	mm->context.vdso = NULL;
+
 	return ret;
 }
 

  parent reply	other threads:[~2013-10-18  6:05 UTC|newest]

Thread overview: 46+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2013-10-18  0:50 [PATCH 0/3] mm,vdso: preallocate new vmas Davidlohr Bueso
2013-10-18  0:50 ` Davidlohr Bueso
2013-10-18  0:50 ` [PATCH 1/3] mm: add mlock_future_check helper Davidlohr Bueso
2013-10-18  0:50   ` Davidlohr Bueso
2013-10-23  9:30   ` walken
2013-10-23  9:30     ` walken
2013-10-18  0:50 ` [PATCH 2/3] mm/mlock: prepare params outside critical region Davidlohr Bueso
2013-10-18  0:50   ` Davidlohr Bueso
2013-10-23  9:33   ` walken
2013-10-23  9:33     ` walken
2013-10-23  9:46   ` Vlastimil Babka
2013-10-23  9:46     ` Vlastimil Babka
2013-10-18  0:50 ` [PATCH 3/3] vdso: preallocate new vmas Davidlohr Bueso
2013-10-18  0:50   ` Davidlohr Bueso
2013-10-18  1:17   ` Linus Torvalds
2013-10-18  1:17     ` Linus Torvalds
2013-10-18  5:59   ` Richard Weinberger
2013-10-18  5:59     ` Richard Weinberger
2013-10-18  6:05   ` Ingo Molnar [this message]
2013-10-18  6:05     ` [PATCH 4/3] x86/vdso: Optimize setup_additional_pages() Ingo Molnar
2013-10-21  3:52     ` Davidlohr Bueso
2013-10-21  3:52       ` Davidlohr Bueso
2013-10-21  5:27       ` Ingo Molnar
2013-10-21  5:27         ` Ingo Molnar
2013-10-21  3:26   ` [PATCH 3/3] vdso: preallocate new vmas Davidlohr Bueso
2013-10-21  3:26     ` Davidlohr Bueso
2013-10-23  9:53     ` walken
2013-10-23  9:53       ` walken
2013-10-25  0:55       ` Davidlohr Bueso
2013-10-25  0:55         ` Davidlohr Bueso
2013-10-22 15:48 ` [PATCH 0/3] mm,vdso: " walken
2013-10-22 15:48   ` walken
2013-10-22 16:20   ` Linus Torvalds
2013-10-22 16:20     ` Linus Torvalds
2013-10-22 17:04     ` Michel Lespinasse
2013-10-22 17:04       ` Michel Lespinasse
2013-10-22 17:54   ` Andy Lutomirski
2013-10-22 17:54     ` Andy Lutomirski
2013-10-23 10:13     ` Michel Lespinasse
2013-10-23 10:13       ` Michel Lespinasse
2013-10-23 21:42       ` Andy Lutomirski
2013-10-23 21:42         ` Andy Lutomirski
2013-10-23  2:46   ` Davidlohr Bueso
2013-10-23  2:46     ` Davidlohr Bueso
2013-11-05  0:39 ` Davidlohr Bueso
2013-11-05  0:39   ` Davidlohr Bueso

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20131018060501.GA3411@gmail.com \
    --to=mingo@kernel.org \
    --cc=a.p.zijlstra@chello.nl \
    --cc=akpm@linux-foundation.org \
    --cc=aswin@hp.com \
    --cc=benh@kernel.crashing.org \
    --cc=catalin.marinas@arm.com \
    --cc=cmetcalf@tilera.com \
    --cc=davidlohr@hp.com \
    --cc=heiko.carstens@de.ibm.com \
    --cc=hpa@zytor.com \
    --cc=jdike@addtoit.com \
    --cc=lethal@linux-sh.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=linux@arm.linux.org.uk \
    --cc=mingo@redhat.com \
    --cc=paulus@samba.org \
    --cc=ralf@linux-mips.org \
    --cc=richard@nod.at \
    --cc=riel@redhat.com \
    --cc=rkuo@codeaurora.org \
    --cc=schwidefsky@de.ibm.com \
    --cc=tglx@linutronix.de \
    --cc=tim.c.chen@linux.intel.com \
    --cc=torvalds@linux-foundation.org \
    --cc=viro@zeniv.linux.org.uk \
    --cc=walken@google.com \
    --cc=will.deacon@arm.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.