[patch 1/2] x86: implement pte

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [patch 1/2] x86: implement pte_special
@ 2008-05-21 11:59 Nick Piggin
  2008-05-21 12:11 ` [patch 2/2] mm: lockless get_user_pages Nick Piggin
  0 siblings, 1 reply; 18+ messages in thread
From: Nick Piggin @ 2008-05-21 11:59 UTC (permalink / raw)
  To: Andrew Morton; +Cc: shaggy, axboe, torvalds, linux-mm, linux-arch

Implement the pte_special bit for x86. This is required to support lockless
get_user_pages, because we need to know whether or not we can refcount a
particular page given only its pte (and no vma).

Signed-off-by: Nick Piggin <npiggin@suse.de>
Cc: shaggy@austin.ibm.com
Cc: axboe@oracle.com
Cc: torvalds@linux-foundation.org
Cc: linux-mm@kvack.org
Cc: linux-arch@vger.kernel.org
---
Index: linux-2.6/include/asm-x86/pgtable.h
===================================================================
--- linux-2.6.orig/include/asm-x86/pgtable.h
+++ linux-2.6/include/asm-x86/pgtable.h
@@ -17,6 +17,7 @@
 #define _PAGE_BIT_UNUSED1	9	/* available for programmer */
 #define _PAGE_BIT_UNUSED2	10
 #define _PAGE_BIT_UNUSED3	11
+#define _PAGE_BIT_SPECIAL	_PAGE_BIT_UNUSED1
 #define _PAGE_BIT_PAT_LARGE	12	/* On 2MB or 1GB pages */
 #define _PAGE_BIT_NX           63       /* No execute: only valid after cpuid check */
 
@@ -39,6 +40,8 @@
 #define _PAGE_UNUSED3	(_AC(1, L)<<_PAGE_BIT_UNUSED3)
 #define _PAGE_PAT	(_AC(1, L)<<_PAGE_BIT_PAT)
 #define _PAGE_PAT_LARGE (_AC(1, L)<<_PAGE_BIT_PAT_LARGE)
+#define _PAGE_SPECIAL	(_AC(1, L)<<_PAGE_BIT_SPECIAL)
+#define __HAVE_ARCH_PTE_SPECIAL
 
 #if defined(CONFIG_X86_64) || defined(CONFIG_X86_PAE)
 #define _PAGE_NX	(_AC(1, ULL) << _PAGE_BIT_NX)
@@ -198,7 +201,7 @@ static inline int pte_exec(pte_t pte)
 
 static inline int pte_special(pte_t pte)
 {
-	return 0;
+	return pte_val(pte) & _PAGE_SPECIAL;
 }
 
 static inline int pmd_large(pmd_t pte)
@@ -264,7 +267,7 @@ static inline pte_t pte_clrglobal(pte_t 
 
 static inline pte_t pte_mkspecial(pte_t pte)
 {
-	return pte;
+	return __pte(pte_val(pte) | _PAGE_SPECIAL);
 }
 
 extern pteval_t __supported_pte_mask;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [patch 2/2] mm: lockless get_user_pages
  2008-05-21 11:59 [patch 1/2] x86: implement pte_special Nick Piggin
@ 2008-05-21 12:11 ` Nick Piggin
  2008-05-22 10:28   ` Andy Whitcroft
  0 siblings, 1 reply; 18+ messages in thread
From: Nick Piggin @ 2008-05-21 12:11 UTC (permalink / raw)
  To: Andrew Morton; +Cc: shaggy, axboe, torvalds, linux-mm, linux-arch

I'd like to propose lockless get_user_pages for merge in -mm. Since last
posted, this includes Linus's scheme for loading the pte on 32-bit PAE
(with comment), and an access_ok() check. I'd like if somebody can verify
both these additions (and the patch in general). It wouldn't be hard to
have a big security hole here if any of the checks are wrong...

I do have a powerpc patch which demonstrates that fast_gup isn't just a
crazy x86 only hack... but it is a little more complex (requires speculative
page references in the VM), and not as well tested as the x86 version. 

Anyway, I think the x86 code is now pretty well complete and has no
more known issues.
--

Introduce a new "fast_gup" (for want of a better name right now) which
is basically a get_user_pages with a less general API (but still tends to
be suited to the common case):

- task and mm are always current and current->mm
- force is always 0
- pages is always non-NULL
- don't pass back vmas

This restricted API can be implemented in a much more scalable way when
the ptes are present, by walking the page tables locklessly.

Before anybody asks; it is impossible to just "detect" such a situation in the
existing get_user_pages call, and branch to a fastpath in that case, because
get_user_pages requres the mmap_sem is held over the call, wheras fast_gup does
not.

This patch implements fast_gup on x86, and converts a number of key callsites
to use it.

On x86, we do an optimistic lockless pagetable walk, without taking any page
table locks or even mmap_sem. Page table existence is guaranteed by turning
interrupts off (combined with the fact that we're always looking up the current
mm, means we can do the lockless page table walk within the constraints of the
TLB shootdown design). Basically we can do this lockless pagetable walk in a
similar manner to the way the CPU's pagetable walker does not have to take any
locks to find present ptes.

Many other architectures could do the same thing. Those that don't IPI
could potentially RCU free the page tables and do speculative references
on the pages (a la lockless pagecache) to achieve a lockless fast_gup. I
have actually got an implementation of this for powerpc.

This patch was found to give about 10% performance improvement on a 2 socket
8 core Intel Xeon system running an OLTP workload on DB2 v9.5

 "To test the effects of the patch, an OLTP workload was run on an IBM
 x3850 M2 server with 2 processors (quad-core Intel Xeon processors at
 2.93 GHz) using IBM DB2 v9.5 running Linux 2.6.24rc7 kernel. Comparing
 runs with and without the patch resulted in an overall performance
 benefit of ~9.8%. Correspondingly, oprofiles showed that samples from
 __up_read and __down_read routines that is seen during thread contention
 for system resources was reduced from 2.8% down to .05%. Monitoring
 the /proc/vmstat output from the patched run showed that the counter for
 fast_gup contained a very high number while the fast_gup_slow value was
 zero."

(fast_gup_slow is a counter we had for the number of times the slowpath
was invoked).

The main reason for the improvement is that DB2 has multiple threads each
issuing direct-IO. Direct-IO uses get_user_pages, and thus the threads
contend the mmap_sem cacheline, and can also contend on page table locks.

I would anticipate larger performance gains on larger systems, however I
think DB2 uses an adaptive mix of threads and processes, so it could be
that thread contention remains pretty constant as machine size increases.
In which case, we stuck with "only" a 10% gain.

Lots of other code could use this too (eg. grep drivers/).

The downside of using fast_gup is that if there is not a pte with the
correct permissions for the access, we end up falling back to get_user_pages
and so the fast_gup is just extra work. This should not be the common case
in performance critical code, I'd hope.

Signed-off-by: Nick Piggin <npiggin@suse.de>
Cc: shaggy@austin.ibm.com
Cc: axboe@oracle.com
Cc: torvalds@linux-foundation.org
Cc: linux-mm@kvack.org
Cc: linux-arch@vger.kernel.org

---
 arch/x86/mm/Makefile      |    2 
 arch/x86/mm/gup.c         |  193 ++++++++++++++++++++++++++++++++++++++++++++++
 fs/bio.c                  |    8 -
 fs/direct-io.c            |   10 --
 fs/splice.c               |   41 ---------
 include/asm-x86/uaccess.h |    5 +
 include/linux/mm.h        |   19 ++++
 7 files changed, 225 insertions(+), 53 deletions(-)

Index: linux-2.6/fs/bio.c
===================================================================
--- linux-2.6.orig/fs/bio.c
+++ linux-2.6/fs/bio.c
@@ -713,12 +713,8 @@ static struct bio *__bio_map_user_iov(st
 		const int local_nr_pages = end - start;
 		const int page_limit = cur_page + local_nr_pages;
 		
-		down_read(&current->mm->mmap_sem);
-		ret = get_user_pages(current, current->mm, uaddr,
-				     local_nr_pages,
-				     write_to_vm, 0, &pages[cur_page], NULL);
-		up_read(&current->mm->mmap_sem);
-
+		ret = fast_gup(uaddr, local_nr_pages,
+				write_to_vm, &pages[cur_page]);
 		if (ret < local_nr_pages) {
 			ret = -EFAULT;
 			goto out_unmap;
Index: linux-2.6/fs/direct-io.c
===================================================================
--- linux-2.6.orig/fs/direct-io.c
+++ linux-2.6/fs/direct-io.c
@@ -150,17 +150,11 @@ static int dio_refill_pages(struct dio *
 	int nr_pages;
 
 	nr_pages = min(dio->total_pages - dio->curr_page, DIO_PAGES);
-	down_read(&current->mm->mmap_sem);
-	ret = get_user_pages(
-		current,			/* Task for fault acounting */
-		current->mm,			/* whose pages? */
+	ret = fast_gup(
 		dio->curr_user_address,		/* Where from? */
 		nr_pages,			/* How many pages? */
 		dio->rw == READ,		/* Write to memory? */
-		0,				/* force (?) */
-		&dio->pages[0],
-		NULL);				/* vmas */
-	up_read(&current->mm->mmap_sem);
+		&dio->pages[0]);		/* Put results here */
 
 	if (ret < 0 && dio->blocks_available && (dio->rw & WRITE)) {
 		struct page *page = ZERO_PAGE(0);
Index: linux-2.6/fs/splice.c
===================================================================
--- linux-2.6.orig/fs/splice.c
+++ linux-2.6/fs/splice.c
@@ -1147,36 +1147,6 @@ static long do_splice(struct file *in, l
 }
 
 /*
- * Do a copy-from-user while holding the mmap_semaphore for reading, in a
- * manner safe from deadlocking with simultaneous mmap() (grabbing mmap_sem
- * for writing) and page faulting on the user memory pointed to by src.
- * This assumes that we will very rarely hit the partial != 0 path, or this
- * will not be a win.
- */
-static int copy_from_user_mmap_sem(void *dst, const void __user *src, size_t n)
-{
-	int partial;
-
-	if (!access_ok(VERIFY_READ, src, n))
-		return -EFAULT;
-
-	pagefault_disable();
-	partial = __copy_from_user_inatomic(dst, src, n);
-	pagefault_enable();
-
-	/*
-	 * Didn't copy everything, drop the mmap_sem and do a faulting copy
-	 */
-	if (unlikely(partial)) {
-		up_read(&current->mm->mmap_sem);
-		partial = copy_from_user(dst, src, n);
-		down_read(&current->mm->mmap_sem);
-	}
-
-	return partial;
-}
-
-/*
  * Map an iov into an array of pages and offset/length tupples. With the
  * partial_page structure, we can map several non-contiguous ranges into
  * our ones pages[] map instead of splitting that operation into pieces.
@@ -1189,8 +1159,6 @@ static int get_iovec_page_array(const st
 {
 	int buffers = 0, error = 0;
 
-	down_read(&current->mm->mmap_sem);
-
 	while (nr_vecs) {
 		unsigned long off, npages;
 		struct iovec entry;
@@ -1199,7 +1167,7 @@ static int get_iovec_page_array(const st
 		int i;
 
 		error = -EFAULT;
-		if (copy_from_user_mmap_sem(&entry, iov, sizeof(entry)))
+		if (copy_from_user(&entry, iov, sizeof(entry)))
 			break;
 
 		base = entry.iov_base;
@@ -1233,9 +1201,8 @@ static int get_iovec_page_array(const st
 		if (npages > PIPE_BUFFERS - buffers)
 			npages = PIPE_BUFFERS - buffers;
 
-		error = get_user_pages(current, current->mm,
-				       (unsigned long) base, npages, 0, 0,
-				       &pages[buffers], NULL);
+		error = fast_gup((unsigned long)base, npages,
+					0, &pages[buffers]);
 
 		if (unlikely(error <= 0))
 			break;
@@ -1274,8 +1241,6 @@ static int get_iovec_page_array(const st
 		iov++;
 	}
 
-	up_read(&current->mm->mmap_sem);
-
 	if (buffers)
 		return buffers;
 
Index: linux-2.6/include/linux/mm.h
===================================================================
--- linux-2.6.orig/include/linux/mm.h
+++ linux-2.6/include/linux/mm.h
@@ -12,6 +12,7 @@
 #include <linux/prio_tree.h>
 #include <linux/debug_locks.h>
 #include <linux/mm_types.h>
+#include <linux/uaccess.h> /* for __HAVE_ARCH_FAST_GUP */
 
 struct mempolicy;
 struct anon_vma;
@@ -830,6 +831,24 @@ extern int mprotect_fixup(struct vm_area
 			  struct vm_area_struct **pprev, unsigned long start,
 			  unsigned long end, unsigned long newflags);
 
+#ifndef __HAVE_ARCH_FAST_GUP
+/* Should be moved to asm-generic, and architectures can include it if they
+ * don't implement their own fast_gup.
+ */
+#define fast_gup(start, nr_pages, write, pages)			\
+({								\
+	struct mm_struct *mm = current->mm;			\
+	int ret;						\
+								\
+	down_read(&mm->mmap_sem);				\
+	ret = get_user_pages(current, mm, start, nr_pages,	\
+					write, 0, pages, NULL);	\
+	up_read(&mm->mmap_sem);					\
+								\
+	ret;							\
+})
+#endif
+
 /*
  * A callback you can register to apply pressure to ageable caches.
  *
Index: linux-2.6/arch/x86/mm/Makefile
===================================================================
--- linux-2.6.orig/arch/x86/mm/Makefile
+++ linux-2.6/arch/x86/mm/Makefile
@@ -1,5 +1,5 @@
 obj-y	:=  init_$(BITS).o fault.o ioremap.o extable.o pageattr.o mmap.o \
-	    pat.o pgtable.o
+	    pat.o pgtable.o gup.o
 
 obj-$(CONFIG_X86_32)		+= pgtable_32.o
 
Index: linux-2.6/arch/x86/mm/gup.c
===================================================================
--- /dev/null
+++ linux-2.6/arch/x86/mm/gup.c
@@ -0,0 +1,243 @@
+/*
+ * Lockless fast_gup for x86
+ *
+ * Copyright (C) 2008 Nick Piggin
+ * Copyright (C) 2008 Novell Inc.
+ */
+#include <linux/sched.h>
+#include <linux/mm.h>
+#include <linux/vmstat.h>
+#include <asm/pgtable.h>
+
+static inline pte_t gup_get_pte(pte_t *ptep)
+{
+#ifndef CONFIG_X86_PAE
+	return *ptep;
+#else
+	/*
+	 * With fast_gup, we walk down the pagetables without taking any locks.
+	 * For this we would like to load the pointers atoimcally, but that is
+	 * not possible (without expensive cmpxchg8b) on PAE.  What we do have
+	 * is the guarantee that a pte will only either go from not present to
+	 * present, or present to not present or both -- it will not switch to
+	 * a completely different present page without a TLB flush in between;
+	 * something that we are blocking by holding interrupts off.
+	 *
+	 * Setting ptes from not present to present goes:
+	 * ptep->pte_high = h;
+	 * smp_wmb();
+	 * ptep->pte_low = l;
+	 *
+	 * And present to not present goes:
+	 * ptep->pte_low = 0;
+	 * smp_wmb();
+	 * ptep->pte_high = 0;
+	 *
+	 * We must ensure here that the load of pte_low sees l iff
+	 * pte_high sees h. We load pte_high *after* loading pte_low,
+	 * which ensures we don't see an older value of pte_high.
+	 * *Then* we recheck pte_low, which ensures that we haven't
+	 * picked up a changed pte high. We might have got rubbish values
+	 * from pte_low and pte_high, but we are guaranteed that pte_low
+	 * will not have the present bit set *unless* it is 'l'. And
+	 * fast_gup only operates on present ptes, so we're safe.
+	 *
+	 * gup_get_pte should not be used or copied outside gup.c without
+	 * being very careful -- it does not atomically load the pte or
+	 * anything that is likely to be useful for you.
+	 */
+	pte_t pte;
+
+retry:
+	pte.pte_low = ptep->pte_low;
+	smp_rmb();
+	pte.pte_high = ptep->pte_high;
+	smp_rmb();
+	if (unlikely(pte.pte_low != ptep->pte_low))
+		goto retry;
+
+#endif
+}
+
+/*
+ * The performance critical leaf functions are made noinline otherwise gcc
+ * inlines everything into a single function which results in too much
+ * register pressure.
+ */
+static noinline int gup_pte_range(pmd_t pmd, unsigned long addr,
+		unsigned long end, int write, struct page **pages, int *nr)
+{
+	unsigned long mask, result;
+	pte_t *ptep;
+
+	result = _PAGE_PRESENT|_PAGE_USER;
+	if (write)
+		result |= _PAGE_RW;
+	mask = result | _PAGE_SPECIAL;
+
+	ptep = pte_offset_map(&pmd, addr);
+	do {
+		pte_t pte = gup_get_pte(ptep);
+		struct page *page;
+
+		if ((pte_val(pte) & mask) != result)
+			return 0;
+		VM_BUG_ON(!pfn_valid(pte_pfn(pte)));
+		page = pte_page(pte);
+		get_page(page);
+		pages[*nr] = page;
+		(*nr)++;
+
+	} while (ptep++, addr += PAGE_SIZE, addr != end);
+	pte_unmap(ptep - 1);
+
+	return 1;
+}
+
+static inline void get_head_page_multiple(struct page *page, int nr)
+{
+	VM_BUG_ON(page != compound_head(page));
+	VM_BUG_ON(page_count(page) == 0);
+	atomic_add(nr, &page->_count);
+}
+
+static noinline int gup_huge_pmd(pmd_t pmd, unsigned long addr,
+		unsigned long end, int write, struct page **pages, int *nr)
+{
+	unsigned long mask;
+	pte_t pte = *(pte_t *)&pmd;
+	struct page *head, *page;
+	int refs;
+
+	mask = _PAGE_PRESENT|_PAGE_USER;
+	if (write)
+		mask |= _PAGE_RW;
+	if ((pte_val(pte) & mask) != mask)
+		return 0;
+	/* hugepages are never "special" */
+	VM_BUG_ON(!pfn_valid(pte_pfn(pte)));
+
+	refs = 0;
+	head = pte_page(pte);
+	page = head + ((addr & ~HPAGE_MASK) >> PAGE_SHIFT);
+	do {
+		VM_BUG_ON(compound_head(page) != head);
+		pages[*nr] = page;
+		(*nr)++;
+		page++;
+		refs++;
+	} while (addr += PAGE_SIZE, addr != end);
+	get_head_page_multiple(head, refs);
+
+	return 1;
+}
+
+static int gup_pmd_range(pud_t pud, unsigned long addr, unsigned long end,
+		int write, struct page **pages, int *nr)
+{
+	unsigned long next;
+	pmd_t *pmdp;
+
+	pmdp = pmd_offset(&pud, addr);
+	do {
+		pmd_t pmd = *pmdp;
+
+		next = pmd_addr_end(addr, end);
+		if (pmd_none(pmd))
+			return 0;
+		if (unlikely(pmd_large(pmd))) {
+			if (!gup_huge_pmd(pmd, addr, next, write, pages, nr))
+				return 0;
+		} else {
+			if (!gup_pte_range(pmd, addr, next, write, pages, nr))
+				return 0;
+		}
+	} while (pmdp++, addr = next, addr != end);
+
+	return 1;
+}
+
+static int gup_pud_range(pgd_t pgd, unsigned long addr, unsigned long end, int write, struct page **pages, int *nr)
+{
+	unsigned long next;
+	pud_t *pudp;
+
+	pudp = pud_offset(&pgd, addr);
+	do {
+		pud_t pud = *pudp;
+
+		next = pud_addr_end(addr, end);
+		if (pud_none(pud))
+			return 0;
+		if (!gup_pmd_range(pud, addr, next, write, pages, nr))
+			return 0;
+	} while (pudp++, addr = next, addr != end);
+
+	return 1;
+}
+
+int fast_gup(unsigned long start, int nr_pages, int write, struct page **pages)
+{
+	struct mm_struct *mm = current->mm;
+	unsigned long end = start + (nr_pages << PAGE_SHIFT);
+	unsigned long addr = start;
+	unsigned long next;
+	pgd_t *pgdp;
+	int nr = 0;
+
+	if (unlikely(!access_ok(write ? VERIFY_WRITE : VERIFY_READ,
+					start, nr_pages*PAGE_SIZE)))
+		goto slow_irqon;
+
+	/*
+	 * XXX: batch / limit 'nr', to avoid large irq off latency
+	 * needs some instrumenting to determine the common sizes used by
+	 * important workloads (eg. DB2), and whether limiting the batch size
+	 * will decrease performance.
+	 *
+	 * It seems like we're in the clear for the moment. Direct-IO is
+	 * the main guy that batches up lots of get_user_pages, and even
+	 * they are limited to 64-at-a-time which is not so many.
+	 */
+	/*
+	 * This doesn't prevent pagetable teardown, but does prevent
+	 * the pagetables and pages from being freed on x86.
+	 *
+	 * So long as we atomically load page table pointers versus teardown
+	 * (which we do on x86, with the above PAE exception), we can follow the
+	 * address down to the the page and take a ref on it.
+	 */
+	local_irq_disable();
+	pgdp = pgd_offset(mm, addr);
+	do {
+		pgd_t pgd = *pgdp;
+
+		next = pgd_addr_end(addr, end);
+		if (pgd_none(pgd))
+			goto slow;
+		if (!gup_pud_range(pgd, addr, next, write, pages, &nr))
+			goto slow;
+	} while (pgdp++, addr = next, addr != end);
+	local_irq_enable();
+
+	VM_BUG_ON(nr != (end - start) >> PAGE_SHIFT);
+	return nr;
+
+	{
+		int i, ret;
+
+slow:
+		local_irq_enable();
+slow_irqon:
+		/* Could optimise this more by keeping what we've already got */
+		for (i = 0; i < nr; i++)
+			put_page(pages[i]);
+
+		down_read(&mm->mmap_sem);
+		ret = get_user_pages(current, mm, start,
+			(end - start) >> PAGE_SHIFT, write, 0, pages, NULL);
+		up_read(&mm->mmap_sem);
+
+		return ret;
+	}
+}
Index: linux-2.6/include/asm-x86/uaccess.h
===================================================================
--- linux-2.6.orig/include/asm-x86/uaccess.h
+++ linux-2.6/include/asm-x86/uaccess.h
@@ -3,3 +3,8 @@
 #else
 # include "uaccess_64.h"
 #endif
+
+#define __HAVE_ARCH_FAST_GUP
+struct page;
+int fast_gup(unsigned long start, int nr_pages, int write, struct page **pages);
+

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [patch 2/2] mm: lockless get_user_pages
  2008-05-21 12:11 ` [patch 2/2] mm: lockless get_user_pages Nick Piggin
@ 2008-05-22 10:28   ` Andy Whitcroft
  2008-05-23  2:27     ` Nick Piggin
  0 siblings, 1 reply; 18+ messages in thread
From: Andy Whitcroft @ 2008-05-22 10:28 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Andrew Morton, shaggy, axboe, torvalds, linux-mm, linux-arch

On Wed, May 21, 2008 at 02:11:14PM +0200, Nick Piggin wrote:
> I'd like to propose lockless get_user_pages for merge in -mm. Since last
> posted, this includes Linus's scheme for loading the pte on 32-bit PAE
> (with comment), and an access_ok() check. I'd like if somebody can verify
> both these additions (and the patch in general). It wouldn't be hard to
> have a big security hole here if any of the checks are wrong...
> 
> I do have a powerpc patch which demonstrates that fast_gup isn't just a
> crazy x86 only hack... but it is a little more complex (requires speculative
> page references in the VM), and not as well tested as the x86 version. 
> 
> Anyway, I think the x86 code is now pretty well complete and has no
> more known issues.
> --
> 
> Introduce a new "fast_gup" (for want of a better name right now) which
> is basically a get_user_pages with a less general API (but still tends to
> be suited to the common case):
> 
> - task and mm are always current and current->mm
> - force is always 0
> - pages is always non-NULL
> - don't pass back vmas
> 
> This restricted API can be implemented in a much more scalable way when
> the ptes are present, by walking the page tables locklessly.
> 
> Before anybody asks; it is impossible to just "detect" such a situation in the
> existing get_user_pages call, and branch to a fastpath in that case, because
> get_user_pages requres the mmap_sem is held over the call, wheras fast_gup does
> not.
> 
> This patch implements fast_gup on x86, and converts a number of key callsites
> to use it.
> 
> On x86, we do an optimistic lockless pagetable walk, without taking any page
> table locks or even mmap_sem. Page table existence is guaranteed by turning
> interrupts off (combined with the fact that we're always looking up the current
> mm, means we can do the lockless page table walk within the constraints of the
> TLB shootdown design). Basically we can do this lockless pagetable walk in a
> similar manner to the way the CPU's pagetable walker does not have to take any
> locks to find present ptes.
> 
> Many other architectures could do the same thing. Those that don't IPI
> could potentially RCU free the page tables and do speculative references
> on the pages (a la lockless pagecache) to achieve a lockless fast_gup. I
> have actually got an implementation of this for powerpc.
> 
> This patch was found to give about 10% performance improvement on a 2 socket
> 8 core Intel Xeon system running an OLTP workload on DB2 v9.5
> 
>  "To test the effects of the patch, an OLTP workload was run on an IBM
>  x3850 M2 server with 2 processors (quad-core Intel Xeon processors at
>  2.93 GHz) using IBM DB2 v9.5 running Linux 2.6.24rc7 kernel. Comparing
>  runs with and without the patch resulted in an overall performance
>  benefit of ~9.8%. Correspondingly, oprofiles showed that samples from
>  __up_read and __down_read routines that is seen during thread contention
>  for system resources was reduced from 2.8% down to .05%. Monitoring
>  the /proc/vmstat output from the patched run showed that the counter for
>  fast_gup contained a very high number while the fast_gup_slow value was
>  zero."
> 
> (fast_gup_slow is a counter we had for the number of times the slowpath
> was invoked).
> 
> The main reason for the improvement is that DB2 has multiple threads each
> issuing direct-IO. Direct-IO uses get_user_pages, and thus the threads
> contend the mmap_sem cacheline, and can also contend on page table locks.
> 
> I would anticipate larger performance gains on larger systems, however I
> think DB2 uses an adaptive mix of threads and processes, so it could be
> that thread contention remains pretty constant as machine size increases.
> In which case, we stuck with "only" a 10% gain.
> 
> Lots of other code could use this too (eg. grep drivers/).
> 
> The downside of using fast_gup is that if there is not a pte with the
> correct permissions for the access, we end up falling back to get_user_pages
> and so the fast_gup is just extra work. This should not be the common case
> in performance critical code, I'd hope.
> 
> Signed-off-by: Nick Piggin <npiggin@suse.de>
> Cc: shaggy@austin.ibm.com
> Cc: axboe@oracle.com
> Cc: torvalds@linux-foundation.org
> Cc: linux-mm@kvack.org
> Cc: linux-arch@vger.kernel.org
> 
> ---
>  arch/x86/mm/Makefile      |    2 
>  arch/x86/mm/gup.c         |  193 ++++++++++++++++++++++++++++++++++++++++++++++
>  fs/bio.c                  |    8 -
>  fs/direct-io.c            |   10 --
>  fs/splice.c               |   41 ---------
>  include/asm-x86/uaccess.h |    5 +
>  include/linux/mm.h        |   19 ++++
>  7 files changed, 225 insertions(+), 53 deletions(-)
> 
> Index: linux-2.6/fs/bio.c
> ===================================================================
> --- linux-2.6.orig/fs/bio.c
> +++ linux-2.6/fs/bio.c
> @@ -713,12 +713,8 @@ static struct bio *__bio_map_user_iov(st
>  		const int local_nr_pages = end - start;
>  		const int page_limit = cur_page + local_nr_pages;
>  		
> -		down_read(&current->mm->mmap_sem);
> -		ret = get_user_pages(current, current->mm, uaddr,
> -				     local_nr_pages,
> -				     write_to_vm, 0, &pages[cur_page], NULL);
> -		up_read(&current->mm->mmap_sem);
> -
> +		ret = fast_gup(uaddr, local_nr_pages,
> +				write_to_vm, &pages[cur_page]);
>  		if (ret < local_nr_pages) {
>  			ret = -EFAULT;
>  			goto out_unmap;
> Index: linux-2.6/fs/direct-io.c
> ===================================================================
> --- linux-2.6.orig/fs/direct-io.c
> +++ linux-2.6/fs/direct-io.c
> @@ -150,17 +150,11 @@ static int dio_refill_pages(struct dio *
>  	int nr_pages;
>  
>  	nr_pages = min(dio->total_pages - dio->curr_page, DIO_PAGES);
> -	down_read(&current->mm->mmap_sem);
> -	ret = get_user_pages(
> -		current,			/* Task for fault acounting */
> -		current->mm,			/* whose pages? */
> +	ret = fast_gup(
>  		dio->curr_user_address,		/* Where from? */
>  		nr_pages,			/* How many pages? */
>  		dio->rw == READ,		/* Write to memory? */
> -		0,				/* force (?) */

I worry a little about this, not for the conversion but for the original
writers lack of confidence that they should be using force if the
comments are to be believed.

> -		&dio->pages[0],
> -		NULL);				/* vmas */
> -	up_read(&current->mm->mmap_sem);
> +		&dio->pages[0]);		/* Put results here */
>  
>  	if (ret < 0 && dio->blocks_available && (dio->rw & WRITE)) {
>  		struct page *page = ZERO_PAGE(0);
> Index: linux-2.6/fs/splice.c
> ===================================================================
> --- linux-2.6.orig/fs/splice.c
> +++ linux-2.6/fs/splice.c
> @@ -1147,36 +1147,6 @@ static long do_splice(struct file *in, l
>  }
>  
>  /*
> - * Do a copy-from-user while holding the mmap_semaphore for reading, in a
> - * manner safe from deadlocking with simultaneous mmap() (grabbing mmap_sem
> - * for writing) and page faulting on the user memory pointed to by src.
> - * This assumes that we will very rarely hit the partial != 0 path, or this
> - * will not be a win.
> - */
> -static int copy_from_user_mmap_sem(void *dst, const void __user *src, size_t n)
> -{
> -	int partial;
> -
> -	if (!access_ok(VERIFY_READ, src, n))
> -		return -EFAULT;
> -
> -	pagefault_disable();
> -	partial = __copy_from_user_inatomic(dst, src, n);
> -	pagefault_enable();
> -
> -	/*
> -	 * Didn't copy everything, drop the mmap_sem and do a faulting copy
> -	 */
> -	if (unlikely(partial)) {
> -		up_read(&current->mm->mmap_sem);
> -		partial = copy_from_user(dst, src, n);
> -		down_read(&current->mm->mmap_sem);
> -	}
> -
> -	return partial;
> -}

Why is this optimisation taken out as part of this patch.  From what I
an see the caller below does not move to fast_gup so it appears
(naively) that this would still be applicable, though the locking may be
hard.

> -
> -/*
>   * Map an iov into an array of pages and offset/length tupples. With the
>   * partial_page structure, we can map several non-contiguous ranges into
>   * our ones pages[] map instead of splitting that operation into pieces.
> @@ -1189,8 +1159,6 @@ static int get_iovec_page_array(const st
>  {
>  	int buffers = 0, error = 0;
>  
> -	down_read(&current->mm->mmap_sem);
> -
>  	while (nr_vecs) {
>  		unsigned long off, npages;
>  		struct iovec entry;
> @@ -1199,7 +1167,7 @@ static int get_iovec_page_array(const st
>  		int i;
>  
>  		error = -EFAULT;
> -		if (copy_from_user_mmap_sem(&entry, iov, sizeof(entry)))
> +		if (copy_from_user(&entry, iov, sizeof(entry)))
>  			break;
>  
>  		base = entry.iov_base;
> @@ -1233,9 +1201,8 @@ static int get_iovec_page_array(const st
>  		if (npages > PIPE_BUFFERS - buffers)
>  			npages = PIPE_BUFFERS - buffers;
>  
> -		error = get_user_pages(current, current->mm,
> -				       (unsigned long) base, npages, 0, 0,
> -				       &pages[buffers], NULL);
> +		error = fast_gup((unsigned long)base, npages,
> +					0, &pages[buffers]);
>  
>  		if (unlikely(error <= 0))
>  			break;
> @@ -1274,8 +1241,6 @@ static int get_iovec_page_array(const st
>  		iov++;
>  	}
>  
> -	up_read(&current->mm->mmap_sem);
> -
>  	if (buffers)
>  		return buffers;
>  
> Index: linux-2.6/include/linux/mm.h
> ===================================================================
> --- linux-2.6.orig/include/linux/mm.h
> +++ linux-2.6/include/linux/mm.h
> @@ -12,6 +12,7 @@
>  #include <linux/prio_tree.h>
>  #include <linux/debug_locks.h>
>  #include <linux/mm_types.h>
> +#include <linux/uaccess.h> /* for __HAVE_ARCH_FAST_GUP */
>  
>  struct mempolicy;
>  struct anon_vma;
> @@ -830,6 +831,24 @@ extern int mprotect_fixup(struct vm_area
>  			  struct vm_area_struct **pprev, unsigned long start,
>  			  unsigned long end, unsigned long newflags);
>  
> +#ifndef __HAVE_ARCH_FAST_GUP
> +/* Should be moved to asm-generic, and architectures can include it if they
> + * don't implement their own fast_gup.
> + */
> +#define fast_gup(start, nr_pages, write, pages)			\
> +({								\
> +	struct mm_struct *mm = current->mm;			\
> +	int ret;						\
> +								\
> +	down_read(&mm->mmap_sem);				\
> +	ret = get_user_pages(current, mm, start, nr_pages,	\
> +					write, 0, pages, NULL);	\
> +	up_read(&mm->mmap_sem);					\
> +								\
> +	ret;							\
> +})
> +#endif
> +
>  /*
>   * A callback you can register to apply pressure to ageable caches.
>   *
> Index: linux-2.6/arch/x86/mm/Makefile
> ===================================================================
> --- linux-2.6.orig/arch/x86/mm/Makefile
> +++ linux-2.6/arch/x86/mm/Makefile
> @@ -1,5 +1,5 @@
>  obj-y	:=  init_$(BITS).o fault.o ioremap.o extable.o pageattr.o mmap.o \
> -	    pat.o pgtable.o
> +	    pat.o pgtable.o gup.o
>  
>  obj-$(CONFIG_X86_32)		+= pgtable_32.o
>  
> Index: linux-2.6/arch/x86/mm/gup.c
> ===================================================================
> --- /dev/null
> +++ linux-2.6/arch/x86/mm/gup.c
> @@ -0,0 +1,243 @@
> +/*
> + * Lockless fast_gup for x86
> + *
> + * Copyright (C) 2008 Nick Piggin
> + * Copyright (C) 2008 Novell Inc.
> + */
> +#include <linux/sched.h>
> +#include <linux/mm.h>
> +#include <linux/vmstat.h>
> +#include <asm/pgtable.h>
> +
> +static inline pte_t gup_get_pte(pte_t *ptep)
> +{

This is the gup form of pte_val, I wonder if it would be better named
gup_pte_val.

> +#ifndef CONFIG_X86_PAE
> +	return *ptep;
> +#else
> +	/*
> +	 * With fast_gup, we walk down the pagetables without taking any locks.
> +	 * For this we would like to load the pointers atoimcally, but that is
> +	 * not possible (without expensive cmpxchg8b) on PAE.  What we do have
> +	 * is the guarantee that a pte will only either go from not present to
> +	 * present, or present to not present or both -- it will not switch to
> +	 * a completely different present page without a TLB flush in between;
> +	 * something that we are blocking by holding interrupts off.
> +	 *
> +	 * Setting ptes from not present to present goes:
> +	 * ptep->pte_high = h;
> +	 * smp_wmb();
> +	 * ptep->pte_low = l;
> +	 *
> +	 * And present to not present goes:
> +	 * ptep->pte_low = 0;
> +	 * smp_wmb();
> +	 * ptep->pte_high = 0;

native_set_pte_present uses the below form, which does also seem to follow
both of these patterns sequentially, so thats ok:

        ptep->pte_low = 0;
	smp_wmb();
	ptep->pte_high = pte.pte_high;
	smp_wmb();
	ptep->pte_low = pte.pte_low;

> +	 *
> +	 * We must ensure here that the load of pte_low sees l iff
> +	 * pte_high sees h. We load pte_high *after* loading pte_low,
> +	 * which ensures we don't see an older value of pte_high.
> +	 * *Then* we recheck pte_low, which ensures that we haven't
> +	 * picked up a changed pte high. We might have got rubbish values
> +	 * from pte_low and pte_high, but we are guaranteed that pte_low
> +	 * will not have the present bit set *unless* it is 'l'. And
> +	 * fast_gup only operates on present ptes, so we're safe.

To my mind this is only valid because of the requirement stated on
native_set_pte that we may only set a pte if it is not-present or
invalid, ie that we would never see the following:

	set_pte(N, valid);
	set_pte(N, valid2);

If you rely on that here you should mention it directly, though I guess
your interrupt based locking out half mentions it.

> +	 *
> +	 * gup_get_pte should not be used or copied outside gup.c without
> +	 * being very careful -- it does not atomically load the pte or
> +	 * anything that is likely to be useful for you.
> +	 */
> +	pte_t pte;
> +
> +retry:
> +	pte.pte_low = ptep->pte_low;
> +	smp_rmb();
> +	pte.pte_high = ptep->pte_high;
> +	smp_rmb();
> +	if (unlikely(pte.pte_low != ptep->pte_low))
> +		goto retry;
> +

Seems to be missing a return pte; here.

> +#endif
> +}
> +
> +/*
> + * The performance critical leaf functions are made noinline otherwise gcc
> + * inlines everything into a single function which results in too much
> + * register pressure.
> + */
> +static noinline int gup_pte_range(pmd_t pmd, unsigned long addr,
> +		unsigned long end, int write, struct page **pages, int *nr)
> +{
> +	unsigned long mask, result;
> +	pte_t *ptep;
> +
> +	result = _PAGE_PRESENT|_PAGE_USER;
> +	if (write)
> +		result |= _PAGE_RW;
> +	mask = result | _PAGE_SPECIAL;
> +
> +	ptep = pte_offset_map(&pmd, addr);
> +	do {
> +		pte_t pte = gup_get_pte(ptep);
> +		struct page *page;
> +
> +		if ((pte_val(pte) & mask) != result)
> +			return 0;
> +		VM_BUG_ON(!pfn_valid(pte_pfn(pte)));
> +		page = pte_page(pte);
> +		get_page(page);
> +		pages[*nr] = page;
> +		(*nr)++;
> +
> +	} while (ptep++, addr += PAGE_SIZE, addr != end);
> +	pte_unmap(ptep - 1);

Is this pte_unmap right.  I thought you had to unmap the same address
that was returned from pte_offset_map, not an incremented version?

> +
> +	return 1;
> +}
> +
> +static inline void get_head_page_multiple(struct page *page, int nr)
> +{
> +	VM_BUG_ON(page != compound_head(page));
> +	VM_BUG_ON(page_count(page) == 0);
> +	atomic_add(nr, &page->_count);

I see that follow_hugetlb_page does this as multiple get_page's.  It
seems that it could benefit from this optimisation.

> +}
> +
> +static noinline int gup_huge_pmd(pmd_t pmd, unsigned long addr,
> +		unsigned long end, int write, struct page **pages, int *nr)
> +{
> +	unsigned long mask;
> +	pte_t pte = *(pte_t *)&pmd;
> +	struct page *head, *page;
> +	int refs;
> +
> +	mask = _PAGE_PRESENT|_PAGE_USER;
> +	if (write)
> +		mask |= _PAGE_RW;
> +	if ((pte_val(pte) & mask) != mask)
> +		return 0;
> +	/* hugepages are never "special" */
> +	VM_BUG_ON(!pfn_valid(pte_pfn(pte)));

As we am not expecting them to be special could we not check for that anyhow as you do in the normal case.  Safe against anyone changing hugepages.

> +
> +	refs = 0;
> +	head = pte_page(pte);
> +	page = head + ((addr & ~HPAGE_MASK) >> PAGE_SHIFT);
> +	do {
> +		VM_BUG_ON(compound_head(page) != head);
> +		pages[*nr] = page;
> +		(*nr)++;
> +		page++;
> +		refs++;
> +	} while (addr += PAGE_SIZE, addr != end);
> +	get_head_page_multiple(head, refs);
> +
> +	return 1;
> +}
> +
> +static int gup_pmd_range(pud_t pud, unsigned long addr, unsigned long end,
> +		int write, struct page **pages, int *nr)
> +{
> +	unsigned long next;
> +	pmd_t *pmdp;
> +
> +	pmdp = pmd_offset(&pud, addr);
> +	do {
> +		pmd_t pmd = *pmdp;
> +
> +		next = pmd_addr_end(addr, end);
> +		if (pmd_none(pmd))
> +			return 0;
> +		if (unlikely(pmd_large(pmd))) {
> +			if (!gup_huge_pmd(pmd, addr, next, write, pages, nr))
> +				return 0;
> +		} else {
> +			if (!gup_pte_range(pmd, addr, next, write, pages, nr))
> +				return 0;
> +		}
> +	} while (pmdp++, addr = next, addr != end);
> +
> +	return 1;
> +}
> +
> +static int gup_pud_range(pgd_t pgd, unsigned long addr, unsigned long end, int write, struct page **pages, int *nr)
> +{
> +	unsigned long next;
> +	pud_t *pudp;
> +
> +	pudp = pud_offset(&pgd, addr);
> +	do {
> +		pud_t pud = *pudp;
> +
> +		next = pud_addr_end(addr, end);
> +		if (pud_none(pud))
> +			return 0;
> +		if (!gup_pmd_range(pud, addr, next, write, pages, nr))
> +			return 0;
> +	} while (pudp++, addr = next, addr != end);
> +
> +	return 1;
> +}
> +
> +int fast_gup(unsigned long start, int nr_pages, int write, struct page **pages)
> +{
> +	struct mm_struct *mm = current->mm;
> +	unsigned long end = start + (nr_pages << PAGE_SHIFT);
> +	unsigned long addr = start;
> +	unsigned long next;
> +	pgd_t *pgdp;
> +	int nr = 0;
> +
> +	if (unlikely(!access_ok(write ? VERIFY_WRITE : VERIFY_READ,
> +					start, nr_pages*PAGE_SIZE)))
> +		goto slow_irqon;
> +
> +	/*
> +	 * XXX: batch / limit 'nr', to avoid large irq off latency
> +	 * needs some instrumenting to determine the common sizes used by
> +	 * important workloads (eg. DB2), and whether limiting the batch size
> +	 * will decrease performance.
> +	 *
> +	 * It seems like we're in the clear for the moment. Direct-IO is
> +	 * the main guy that batches up lots of get_user_pages, and even
> +	 * they are limited to 64-at-a-time which is not so many.
> +	 */
> +	/*
> +	 * This doesn't prevent pagetable teardown, but does prevent
> +	 * the pagetables and pages from being freed on x86.
> +	 *
> +	 * So long as we atomically load page table pointers versus teardown
> +	 * (which we do on x86, with the above PAE exception), we can follow the
> +	 * address down to the the page and take a ref on it.
> +	 */
> +	local_irq_disable();
> +	pgdp = pgd_offset(mm, addr);
> +	do {
> +		pgd_t pgd = *pgdp;
> +
> +		next = pgd_addr_end(addr, end);
> +		if (pgd_none(pgd))
> +			goto slow;
> +		if (!gup_pud_range(pgd, addr, next, write, pages, &nr))
> +			goto slow;
> +	} while (pgdp++, addr = next, addr != end);
> +	local_irq_enable();
> +
> +	VM_BUG_ON(nr != (end - start) >> PAGE_SHIFT);
> +	return nr;
> +
> +	{
> +		int i, ret;
> +
> +slow:
> +		local_irq_enable();
> +slow_irqon:
> +		/* Could optimise this more by keeping what we've already got */
> +		for (i = 0; i < nr; i++)
> +			put_page(pages[i]);

Perhaps we could have some vm_stats's for these cases so we can see just
how often we do fail, and how many we wasted.

> +
> +		down_read(&mm->mmap_sem);
> +		ret = get_user_pages(current, mm, start,
> +			(end - start) >> PAGE_SHIFT, write, 0, pages, NULL);
> +		up_read(&mm->mmap_sem);
> +
> +		return ret;
> +	}
> +}
> Index: linux-2.6/include/asm-x86/uaccess.h
> ===================================================================
> --- linux-2.6.orig/include/asm-x86/uaccess.h
> +++ linux-2.6/include/asm-x86/uaccess.h
> @@ -3,3 +3,8 @@
>  #else
>  # include "uaccess_64.h"
>  #endif
> +
> +#define __HAVE_ARCH_FAST_GUP
> +struct page;
> +int fast_gup(unsigned long start, int nr_pages, int write, struct page **pages);
> +

Looking at this from a protection point of view it seems to me to make
sane checks, checking that the address is valid first, and then that
each PTE has the minimum permissions requested.  Where there is any
ambiguity falling back.

I did wonder if we could also check _PAGE_BIT_USER bit as by my reading
that would only ever be set on user pages, and by rejecting pages without
that set we could prevent any kernel pages being returned basically
for free.

Reviewed-by: Andy Whitcroft <apw@shadowen.org>

-apw

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [patch 2/2] mm: lockless get_user_pages
  2008-05-22 10:28   ` Andy Whitcroft
@ 2008-05-23  2:27     ` Nick Piggin
  2008-05-23 12:31       ` apw
  0 siblings, 1 reply; 18+ messages in thread
From: Nick Piggin @ 2008-05-23  2:27 UTC (permalink / raw)
  To: Andy Whitcroft
  Cc: Andrew Morton, shaggy, axboe, torvalds, linux-mm, linux-arch

On Thu, May 22, 2008 at 11:28:03AM +0100, Andy Whitcroft wrote:
> On Wed, May 21, 2008 at 02:11:14PM +0200, Nick Piggin wrote:
> > I'd like to propose lockless get_user_pages for merge in -mm. Since last
> > posted, this includes Linus's scheme for loading the pte on 32-bit PAE
> > (with comment), and an access_ok() check. I'd like if somebody can verify
> > both these additions (and the patch in general). It wouldn't be hard to
> > have a big security hole here if any of the checks are wrong...
> > 
> > I do have a powerpc patch which demonstrates that fast_gup isn't just a
> > crazy x86 only hack... but it is a little more complex (requires speculative
> > page references in the VM), and not as well tested as the x86 version. 
> > 
> > Anyway, I think the x86 code is now pretty well complete and has no
> > more known issues.
> > --
> > 
> > Introduce a new "fast_gup" (for want of a better name right now) which
> > is basically a get_user_pages with a less general API (but still tends to
> > be suited to the common case):
> > 
> > - task and mm are always current and current->mm
> > - force is always 0
> > - pages is always non-NULL
> > - don't pass back vmas
> > 
> > This restricted API can be implemented in a much more scalable way when
> > the ptes are present, by walking the page tables locklessly.
> > 
> > Before anybody asks; it is impossible to just "detect" such a situation in the
> > existing get_user_pages call, and branch to a fastpath in that case, because
> > get_user_pages requres the mmap_sem is held over the call, wheras fast_gup does
> > not.
> > 
> > This patch implements fast_gup on x86, and converts a number of key callsites
> > to use it.
> > 
> > On x86, we do an optimistic lockless pagetable walk, without taking any page
> > table locks or even mmap_sem. Page table existence is guaranteed by turning
> > interrupts off (combined with the fact that we're always looking up the current
> > mm, means we can do the lockless page table walk within the constraints of the
> > TLB shootdown design). Basically we can do this lockless pagetable walk in a
> > similar manner to the way the CPU's pagetable walker does not have to take any
> > locks to find present ptes.
> > 
> > Many other architectures could do the same thing. Those that don't IPI
> > could potentially RCU free the page tables and do speculative references
> > on the pages (a la lockless pagecache) to achieve a lockless fast_gup. I
> > have actually got an implementation of this for powerpc.
> > 
> > This patch was found to give about 10% performance improvement on a 2 socket
> > 8 core Intel Xeon system running an OLTP workload on DB2 v9.5
> > 
> >  "To test the effects of the patch, an OLTP workload was run on an IBM
> >  x3850 M2 server with 2 processors (quad-core Intel Xeon processors at
> >  2.93 GHz) using IBM DB2 v9.5 running Linux 2.6.24rc7 kernel. Comparing
> >  runs with and without the patch resulted in an overall performance
> >  benefit of ~9.8%. Correspondingly, oprofiles showed that samples from
> >  __up_read and __down_read routines that is seen during thread contention
> >  for system resources was reduced from 2.8% down to .05%. Monitoring
> >  the /proc/vmstat output from the patched run showed that the counter for
> >  fast_gup contained a very high number while the fast_gup_slow value was
> >  zero."
> > 
> > (fast_gup_slow is a counter we had for the number of times the slowpath
> > was invoked).
> > 
> > The main reason for the improvement is that DB2 has multiple threads each
> > issuing direct-IO. Direct-IO uses get_user_pages, and thus the threads
> > contend the mmap_sem cacheline, and can also contend on page table locks.
> > 
> > I would anticipate larger performance gains on larger systems, however I
> > think DB2 uses an adaptive mix of threads and processes, so it could be
> > that thread contention remains pretty constant as machine size increases.
> > In which case, we stuck with "only" a 10% gain.
> > 
> > Lots of other code could use this too (eg. grep drivers/).
> > 
> > The downside of using fast_gup is that if there is not a pte with the
> > correct permissions for the access, we end up falling back to get_user_pages
> > and so the fast_gup is just extra work. This should not be the common case
> > in performance critical code, I'd hope.
> > 
> > Signed-off-by: Nick Piggin <npiggin@suse.de>
> > Cc: shaggy@austin.ibm.com
> > Cc: axboe@oracle.com
> > Cc: torvalds@linux-foundation.org
> > Cc: linux-mm@kvack.org
> > Cc: linux-arch@vger.kernel.org
> > 
> > ---
> >  arch/x86/mm/Makefile      |    2 
> >  arch/x86/mm/gup.c         |  193 ++++++++++++++++++++++++++++++++++++++++++++++
> >  fs/bio.c                  |    8 -
> >  fs/direct-io.c            |   10 --
> >  fs/splice.c               |   41 ---------
> >  include/asm-x86/uaccess.h |    5 +
> >  include/linux/mm.h        |   19 ++++
> >  7 files changed, 225 insertions(+), 53 deletions(-)
> > 
> > Index: linux-2.6/fs/bio.c
> > ===================================================================
> > --- linux-2.6.orig/fs/bio.c
> > +++ linux-2.6/fs/bio.c
> > @@ -713,12 +713,8 @@ static struct bio *__bio_map_user_iov(st
> >  		const int local_nr_pages = end - start;
> >  		const int page_limit = cur_page + local_nr_pages;
> >  		
> > -		down_read(&current->mm->mmap_sem);
> > -		ret = get_user_pages(current, current->mm, uaddr,
> > -				     local_nr_pages,
> > -				     write_to_vm, 0, &pages[cur_page], NULL);
> > -		up_read(&current->mm->mmap_sem);
> > -
> > +		ret = fast_gup(uaddr, local_nr_pages,
> > +				write_to_vm, &pages[cur_page]);
> >  		if (ret < local_nr_pages) {
> >  			ret = -EFAULT;
> >  			goto out_unmap;
> > Index: linux-2.6/fs/direct-io.c
> > ===================================================================
> > --- linux-2.6.orig/fs/direct-io.c
> > +++ linux-2.6/fs/direct-io.c
> > @@ -150,17 +150,11 @@ static int dio_refill_pages(struct dio *
> >  	int nr_pages;
> >  
> >  	nr_pages = min(dio->total_pages - dio->curr_page, DIO_PAGES);
> > -	down_read(&current->mm->mmap_sem);
> > -	ret = get_user_pages(
> > -		current,			/* Task for fault acounting */
> > -		current->mm,			/* whose pages? */
> > +	ret = fast_gup(
> >  		dio->curr_user_address,		/* Where from? */
> >  		nr_pages,			/* How many pages? */
> >  		dio->rw == READ,		/* Write to memory? */
> > -		0,				/* force (?) */
> 
> I worry a little about this, not for the conversion but for the original
> writers lack of confidence that they should be using force if the
> comments are to be believed.

It should not use force. Force will bypass the mmap permissions and
do funny things. ptrace is really one of the few valid users I think.

 
> > -		&dio->pages[0],
> > -		NULL);				/* vmas */
> > -	up_read(&current->mm->mmap_sem);
> > +		&dio->pages[0]);		/* Put results here */
> >  
> >  	if (ret < 0 && dio->blocks_available && (dio->rw & WRITE)) {
> >  		struct page *page = ZERO_PAGE(0);
> > Index: linux-2.6/fs/splice.c
> > ===================================================================
> > --- linux-2.6.orig/fs/splice.c
> > +++ linux-2.6/fs/splice.c
> > @@ -1147,36 +1147,6 @@ static long do_splice(struct file *in, l
> >  }
> >  
> >  /*
> > - * Do a copy-from-user while holding the mmap_semaphore for reading, in a
> > - * manner safe from deadlocking with simultaneous mmap() (grabbing mmap_sem
> > - * for writing) and page faulting on the user memory pointed to by src.
> > - * This assumes that we will very rarely hit the partial != 0 path, or this
> > - * will not be a win.
> > - */
> > -static int copy_from_user_mmap_sem(void *dst, const void __user *src, size_t n)
> > -{
> > -	int partial;
> > -
> > -	if (!access_ok(VERIFY_READ, src, n))
> > -		return -EFAULT;
> > -
> > -	pagefault_disable();
> > -	partial = __copy_from_user_inatomic(dst, src, n);
> > -	pagefault_enable();
> > -
> > -	/*
> > -	 * Didn't copy everything, drop the mmap_sem and do a faulting copy
> > -	 */
> > -	if (unlikely(partial)) {
> > -		up_read(&current->mm->mmap_sem);
> > -		partial = copy_from_user(dst, src, n);
> > -		down_read(&current->mm->mmap_sem);
> > -	}
> > -
> > -	return partial;
> > -}
> 
> Why is this optimisation taken out as part of this patch.  From what I
> an see the caller below does not move to fast_gup so it appears
> (naively) that this would still be applicable, though the locking may be
> hard.

The get_iovec_page_array caller AFAIKS does move to fast_gup.

That funny optimisation is an attempt to avoid taking and dropping the
mmap_sem for each page if we can perform the copy without taking a
fault. If the same conditions exist for fast_gup, then it can perform
the get_user_pages operation without taking mmap_sem at all.

 
> > -/*
> >   * Map an iov into an array of pages and offset/length tupples. With the
> >   * partial_page structure, we can map several non-contiguous ranges into
> >   * our ones pages[] map instead of splitting that operation into pieces.
> > @@ -1189,8 +1159,6 @@ static int get_iovec_page_array(const st
> >  {
> >  	int buffers = 0, error = 0;
> >  
> > -	down_read(&current->mm->mmap_sem);
> > -
> >  	while (nr_vecs) {
> >  		unsigned long off, npages;
> >  		struct iovec entry;
> > @@ -1199,7 +1167,7 @@ static int get_iovec_page_array(const st
> >  		int i;
> >  
> >  		error = -EFAULT;
> > -		if (copy_from_user_mmap_sem(&entry, iov, sizeof(entry)))
> > +		if (copy_from_user(&entry, iov, sizeof(entry)))
> >  			break;
> >  
> >  		base = entry.iov_base;
> > @@ -1233,9 +1201,8 @@ static int get_iovec_page_array(const st
> >  		if (npages > PIPE_BUFFERS - buffers)
> >  			npages = PIPE_BUFFERS - buffers;
> >  
> > -		error = get_user_pages(current, current->mm,
> > -				       (unsigned long) base, npages, 0, 0,
> > -				       &pages[buffers], NULL);
> > +		error = fast_gup((unsigned long)base, npages,
> > +					0, &pages[buffers]);
> >  
> >  		if (unlikely(error <= 0))
> >  			break;
> > @@ -1274,8 +1241,6 @@ static int get_iovec_page_array(const st
> >  		iov++;
> >  	}
> >  
> > -	up_read(&current->mm->mmap_sem);
> > -
> >  	if (buffers)
> >  		return buffers;
> >  
> > Index: linux-2.6/arch/x86/mm/gup.c
> > ===================================================================
> > --- /dev/null
> > +++ linux-2.6/arch/x86/mm/gup.c
> > @@ -0,0 +1,243 @@
> > +/*
> > + * Lockless fast_gup for x86
> > + *
> > + * Copyright (C) 2008 Nick Piggin
> > + * Copyright (C) 2008 Novell Inc.
> > + */
> > +#include <linux/sched.h>
> > +#include <linux/mm.h>
> > +#include <linux/vmstat.h>
> > +#include <asm/pgtable.h>
> > +
> > +static inline pte_t gup_get_pte(pte_t *ptep)
> > +{
> 
> This is the gup form of pte_val, I wonder if it would be better named
> gup_pte_val.

Sure, will do.

 
> > +#ifndef CONFIG_X86_PAE
> > +	return *ptep;
> > +#else
> > +	/*
> > +	 * With fast_gup, we walk down the pagetables without taking any locks.
> > +	 * For this we would like to load the pointers atoimcally, but that is
> > +	 * not possible (without expensive cmpxchg8b) on PAE.  What we do have
> > +	 * is the guarantee that a pte will only either go from not present to
> > +	 * present, or present to not present or both -- it will not switch to
> > +	 * a completely different present page without a TLB flush in between;
> > +	 * something that we are blocking by holding interrupts off.
> > +	 *
> > +	 * Setting ptes from not present to present goes:
> > +	 * ptep->pte_high = h;
> > +	 * smp_wmb();
> > +	 * ptep->pte_low = l;
> > +	 *
> > +	 * And present to not present goes:
> > +	 * ptep->pte_low = 0;
> > +	 * smp_wmb();
> > +	 * ptep->pte_high = 0;
> 
> native_set_pte_present uses the below form, which does also seem to follow
> both of these patterns sequentially, so thats ok:
> 
>         ptep->pte_low = 0;
> 	smp_wmb();
> 	ptep->pte_high = pte.pte_high;
> 	smp_wmb();
> 	ptep->pte_low = pte.pte_low;

That's true... I think this is just for kernel ptes though? So the
access_ok should keep us away from any of those issues.


> > +	 *
> > +	 * We must ensure here that the load of pte_low sees l iff
> > +	 * pte_high sees h. We load pte_high *after* loading pte_low,
> > +	 * which ensures we don't see an older value of pte_high.
> > +	 * *Then* we recheck pte_low, which ensures that we haven't
> > +	 * picked up a changed pte high. We might have got rubbish values
> > +	 * from pte_low and pte_high, but we are guaranteed that pte_low
> > +	 * will not have the present bit set *unless* it is 'l'. And
> > +	 * fast_gup only operates on present ptes, so we're safe.
> 
> To my mind this is only valid because of the requirement stated on
> native_set_pte that we may only set a pte if it is not-present or
> invalid, ie that we would never see the following:
> 
> 	set_pte(N, valid);
> 	set_pte(N, valid2);
> 
> If you rely on that here you should mention it directly, though I guess
> your interrupt based locking out half mentions it.

Well... I did say that it can't go from present to present (implicitly
I guess, because I actually listed what are the valid transitions rather
than what aren't. I could emphasize it).

 
> > +	 *
> > +	 * gup_get_pte should not be used or copied outside gup.c without
> > +	 * being very careful -- it does not atomically load the pte or
> > +	 * anything that is likely to be useful for you.
> > +	 */
> > +	pte_t pte;
> > +
> > +retry:
> > +	pte.pte_low = ptep->pte_low;
> > +	smp_rmb();
> > +	pte.pte_high = ptep->pte_high;
> > +	smp_rmb();
> > +	if (unlikely(pte.pte_low != ptep->pte_low))
> > +		goto retry;
> > +
> 
> Seems to be missing a return pte; here.

Oh, I actually have to *compile* with PAE too? Sorry, forgot to do that,
thanks.

 
> > +#endif
> > +}
> > +
> > +/*
> > + * The performance critical leaf functions are made noinline otherwise gcc
> > + * inlines everything into a single function which results in too much
> > + * register pressure.
> > + */
> > +static noinline int gup_pte_range(pmd_t pmd, unsigned long addr,
> > +		unsigned long end, int write, struct page **pages, int *nr)
> > +{
> > +	unsigned long mask, result;
> > +	pte_t *ptep;
> > +
> > +	result = _PAGE_PRESENT|_PAGE_USER;
> > +	if (write)
> > +		result |= _PAGE_RW;
> > +	mask = result | _PAGE_SPECIAL;
> > +
> > +	ptep = pte_offset_map(&pmd, addr);
> > +	do {
> > +		pte_t pte = gup_get_pte(ptep);
> > +		struct page *page;
> > +
> > +		if ((pte_val(pte) & mask) != result)
> > +			return 0;
> > +		VM_BUG_ON(!pfn_valid(pte_pfn(pte)));
> > +		page = pte_page(pte);
> > +		get_page(page);
> > +		pages[*nr] = page;
> > +		(*nr)++;
> > +
> > +	} while (ptep++, addr += PAGE_SIZE, addr != end);
> > +	pte_unmap(ptep - 1);
> 
> Is this pte_unmap right.  I thought you had to unmap the same address
> that was returned from pte_offset_map, not an incremented version?

So long as it is within the same pte page, it's OK. Check some of
the similar loops in mm/memory.c

 
> > +
> > +	return 1;
> > +}
> > +
> > +static inline void get_head_page_multiple(struct page *page, int nr)
> > +{
> > +	VM_BUG_ON(page != compound_head(page));
> > +	VM_BUG_ON(page_count(page) == 0);
> > +	atomic_add(nr, &page->_count);
> 
> I see that follow_hugetlb_page does this as multiple get_page's.  It
> seems that it could benefit from this optimisation.

Sure, this could go into mm/internal.h or include/linux/mm.h.

 
> > +}
> > +
> > +static noinline int gup_huge_pmd(pmd_t pmd, unsigned long addr,
> > +		unsigned long end, int write, struct page **pages, int *nr)
> > +{
> > +	unsigned long mask;
> > +	pte_t pte = *(pte_t *)&pmd;
> > +	struct page *head, *page;
> > +	int refs;
> > +
> > +	mask = _PAGE_PRESENT|_PAGE_USER;
> > +	if (write)
> > +		mask |= _PAGE_RW;
> > +	if ((pte_val(pte) & mask) != mask)
> > +		return 0;
> > +	/* hugepages are never "special" */
> > +	VM_BUG_ON(!pfn_valid(pte_pfn(pte)));
> 
> As we am not expecting them to be special could we not check for that anyhow as you do in the normal case.  Safe against anyone changing hugepages.

Couldn't quite parse this ;) Do you mean to put the same VM_BUG_ON in the
follow_hugetlb_page path? Sure we could do that, although in that case
we have everything locked down and have looked up the vma etc. so I was
just being paranoid here really.


> > +int fast_gup(unsigned long start, int nr_pages, int write, struct page **pages)
> > +{
> > +	struct mm_struct *mm = current->mm;
> > +	unsigned long end = start + (nr_pages << PAGE_SHIFT);
> > +	unsigned long addr = start;
> > +	unsigned long next;
> > +	pgd_t *pgdp;
> > +	int nr = 0;
> > +
> > +	if (unlikely(!access_ok(write ? VERIFY_WRITE : VERIFY_READ,
> > +					start, nr_pages*PAGE_SIZE)))
> > +		goto slow_irqon;
> > +
> > +	/*
> > +	 * XXX: batch / limit 'nr', to avoid large irq off latency
> > +	 * needs some instrumenting to determine the common sizes used by
> > +	 * important workloads (eg. DB2), and whether limiting the batch size
> > +	 * will decrease performance.
> > +	 *
> > +	 * It seems like we're in the clear for the moment. Direct-IO is
> > +	 * the main guy that batches up lots of get_user_pages, and even
> > +	 * they are limited to 64-at-a-time which is not so many.
> > +	 */
> > +	/*
> > +	 * This doesn't prevent pagetable teardown, but does prevent
> > +	 * the pagetables and pages from being freed on x86.
> > +	 *
> > +	 * So long as we atomically load page table pointers versus teardown
> > +	 * (which we do on x86, with the above PAE exception), we can follow the
> > +	 * address down to the the page and take a ref on it.
> > +	 */
> > +	local_irq_disable();
> > +	pgdp = pgd_offset(mm, addr);
> > +	do {
> > +		pgd_t pgd = *pgdp;
> > +
> > +		next = pgd_addr_end(addr, end);
> > +		if (pgd_none(pgd))
> > +			goto slow;
> > +		if (!gup_pud_range(pgd, addr, next, write, pages, &nr))
> > +			goto slow;
> > +	} while (pgdp++, addr = next, addr != end);
> > +	local_irq_enable();
> > +
> > +	VM_BUG_ON(nr != (end - start) >> PAGE_SHIFT);
> > +	return nr;
> > +
> > +	{
> > +		int i, ret;
> > +
> > +slow:
> > +		local_irq_enable();
> > +slow_irqon:
> > +		/* Could optimise this more by keeping what we've already got */
> > +		for (i = 0; i < nr; i++)
> > +			put_page(pages[i]);
> 
> Perhaps we could have some vm_stats's for these cases so we can see just
> how often we do fail, and how many we wasted.

I did have some stats in place when it was run on db2... zero slowpaths
hit ;) although I guess that's a slightly special case. Definitely there
would be some slowpaths hit, but I don't expect it to be a huge number
in any performance critical code: such code should not be changing the
virtual address space too often.

Not sure whether it should go in... I doubt anyone will check it or really
know what is a bad ratio anyway will they? The best we can do IMO is
look out for regressions reported.

> 
> > +
> > +		down_read(&mm->mmap_sem);
> > +		ret = get_user_pages(current, mm, start,
> > +			(end - start) >> PAGE_SHIFT, write, 0, pages, NULL);
> > +		up_read(&mm->mmap_sem);
> > +
> > +		return ret;
> > +	}
> > +}
> > Index: linux-2.6/include/asm-x86/uaccess.h
> > ===================================================================
> > --- linux-2.6.orig/include/asm-x86/uaccess.h
> > +++ linux-2.6/include/asm-x86/uaccess.h
> > @@ -3,3 +3,8 @@
> >  #else
> >  # include "uaccess_64.h"
> >  #endif
> > +
> > +#define __HAVE_ARCH_FAST_GUP
> > +struct page;
> > +int fast_gup(unsigned long start, int nr_pages, int write, struct page **pages);
> > +
> 
> Looking at this from a protection point of view it seems to me to make
> sane checks, checking that the address is valid first, and then that
> each PTE has the minimum permissions requested.  Where there is any
> ambiguity falling back.

Thanks for reviewing. Some good suggestions I will add.

 
> I did wonder if we could also check _PAGE_BIT_USER bit as by my reading
> that would only ever be set on user pages, and by rejecting pages without
> that set we could prevent any kernel pages being returned basically
> for free.

I still do want the access_ok check to avoid any possible issues with
kernel page table modifications... but checking for the user bit would
be another good sanity check, good idea. 


> Reviewed-by: Andy Whitcroft <apw@shadowen.org>

Thanks!

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [patch 2/2] mm: lockless get_user_pages
  2008-05-23  2:27     ` Nick Piggin
@ 2008-05-23 12:31       ` apw
  2008-05-23 23:44         ` Nick Piggin
  0 siblings, 1 reply; 18+ messages in thread
From: apw @ 2008-05-23 12:31 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Andrew Morton, shaggy, axboe, torvalds, linux-mm, linux-arch

On Fri, May 23, 2008 at 04:27:33AM +0200, Nick Piggin wrote:
[...]
> > >  /*
> > > - * Do a copy-from-user while holding the mmap_semaphore for reading, in a
> > > - * manner safe from deadlocking with simultaneous mmap() (grabbing mmap_sem
> > > - * for writing) and page faulting on the user memory pointed to by src.
> > > - * This assumes that we will very rarely hit the partial != 0 path, or this
> > > - * will not be a win.
> > > - */
> > > -static int copy_from_user_mmap_sem(void *dst, const void __user *src, size_t n)
> > > -{
> > > -	int partial;
> > > -
> > > -	if (!access_ok(VERIFY_READ, src, n))
> > > -		return -EFAULT;
> > > -
> > > -	pagefault_disable();
> > > -	partial = __copy_from_user_inatomic(dst, src, n);
> > > -	pagefault_enable();
> > > -
> > > -	/*
> > > -	 * Didn't copy everything, drop the mmap_sem and do a faulting copy
> > > -	 */
> > > -	if (unlikely(partial)) {
> > > -		up_read(&current->mm->mmap_sem);
> > > -		partial = copy_from_user(dst, src, n);
> > > -		down_read(&current->mm->mmap_sem);
> > > -	}
> > > -
> > > -	return partial;
> > > -}
> > 
> > Why is this optimisation taken out as part of this patch.  From what I
> > an see the caller below does not move to fast_gup so it appears
> > (naively) that this would still be applicable, though the locking may be
> > hard.
> 
> The get_iovec_page_array caller AFAIKS does move to fast_gup.
> 
> That funny optimisation is an attempt to avoid taking and dropping the
> mmap_sem for each page if we can perform the copy without taking a
> fault. If the same conditions exist for fast_gup, then it can perform
> the get_user_pages operation without taking mmap_sem at all.

Yep see below though, the call site for this optimisation seems to go
to copy_from_user ...

> > > -/*
> > >   * Map an iov into an array of pages and offset/length tupples. With the
> > >   * partial_page structure, we can map several non-contiguous ranges into
> > >   * our ones pages[] map instead of splitting that operation into pieces.
> > > @@ -1189,8 +1159,6 @@ static int get_iovec_page_array(const st
> > >  {
> > >  	int buffers = 0, error = 0;
> > >  
> > > -	down_read(&current->mm->mmap_sem);
> > > -
> > >  	while (nr_vecs) {
> > >  		unsigned long off, npages;
> > >  		struct iovec entry;
> > > @@ -1199,7 +1167,7 @@ static int get_iovec_page_array(const st
> > >  		int i;
> > >  
> > >  		error = -EFAULT;
> > > -		if (copy_from_user_mmap_sem(&entry, iov, sizeof(entry)))
> > > +		if (copy_from_user(&entry, iov, sizeof(entry)))
> > >  			break;
> > >  
> > >  		base = entry.iov_base;
> > > @@ -1233,9 +1201,8 @@ static int get_iovec_page_array(const st
> > >  		if (npages > PIPE_BUFFERS - buffers)
> > >  			npages = PIPE_BUFFERS - buffers;
> > >  
> > > -		error = get_user_pages(current, current->mm,
> > > -				       (unsigned long) base, npages, 0, 0,
> > > -				       &pages[buffers], NULL);
> > > +		error = fast_gup((unsigned long)base, npages,
> > > +					0, &pages[buffers]);
> > >  
> > >  		if (unlikely(error <= 0))
> > >  			break;

Its not that clear from these two deltas that it does move to fast_gup,
yes this second one does but the first which used the old optimisation
goes straight to copy_from_user.

[...]
> > native_set_pte_present uses the below form, which does also seem to follow
> > both of these patterns sequentially, so thats ok:
> > 
> >         ptep->pte_low = 0;
> > 	smp_wmb();
> > 	ptep->pte_high = pte.pte_high;
> > 	smp_wmb();
> > 	ptep->pte_low = pte.pte_low;
> 
> That's true... I think this is just for kernel ptes though? So the
> access_ok should keep us away from any of those issues.

Yes I was more saying there is this third case, but it also seems
covered so we don't need to worry on it.

[...]
> > > +static noinline int gup_pte_range(pmd_t pmd, unsigned long addr,
> > > +		unsigned long end, int write, struct page **pages, int *nr)
> > > +{
> > > +	unsigned long mask, result;
> > > +	pte_t *ptep;
> > > +
> > > +	result = _PAGE_PRESENT|_PAGE_USER;
> > > +	if (write)
> > > +		result |= _PAGE_RW;
> > > +	mask = result | _PAGE_SPECIAL;
> > > +
> > > +	ptep = pte_offset_map(&pmd, addr);
> > > +	do {
> > > +		pte_t pte = gup_get_pte(ptep);
> > > +		struct page *page;
> > > +
> > > +		if ((pte_val(pte) & mask) != result)
> > > +			return 0;
> > > +		VM_BUG_ON(!pfn_valid(pte_pfn(pte)));
> > > +		page = pte_page(pte);
> > > +		get_page(page);
> > > +		pages[*nr] = page;
> > > +		(*nr)++;
> > > +
> > > +	} while (ptep++, addr += PAGE_SIZE, addr != end);
> > > +	pte_unmap(ptep - 1);
> > 
> > Is this pte_unmap right.  I thought you had to unmap the same address
> > that was returned from pte_offset_map, not an incremented version?
> 
> So long as it is within the same pte page, it's OK. Check some of
> the similar loops in mm/memory.c

Bah, yes I missed this in kunmap_atomic()... I hate code mixed into the
declarations:

        unsigned long vaddr = (unsigned long) kvaddr & PAGE_MASK;

so yes its safe.

[...]
> > > +static noinline int gup_huge_pmd(pmd_t pmd, unsigned long addr,
> > > +		unsigned long end, int write, struct page **pages, int *nr)
> > > +{
> > > +	unsigned long mask;
> > > +	pte_t pte = *(pte_t *)&pmd;
> > > +	struct page *head, *page;
> > > +	int refs;
> > > +
> > > +	mask = _PAGE_PRESENT|_PAGE_USER;
> > > +	if (write)
> > > +		mask |= _PAGE_RW;
> > > +	if ((pte_val(pte) & mask) != mask)
> > > +		return 0;
> > > +	/* hugepages are never "special" */
> > > +	VM_BUG_ON(!pfn_valid(pte_pfn(pte)));
> > 
> > As we am not expecting them to be special could we not check for that anyhow as you do in the normal case.  Safe against anyone changing hugepages.
> 
> Couldn't quite parse this ;) Do you mean to put the same VM_BUG_ON in the
> follow_hugetlb_page path? Sure we could do that, although in that case
> we have everything locked down and have looked up the vma etc. so I was
> just being paranoid here really.

I was more saying, as we know that _PAGE_SPECIAL isn't currently used and
we can only handle the !_PAGE_SPECIAL pages in these paths it might be
prudent to check for the absence of _PAGE_SPECIAL here _anyhow_ exactly
as you did for small pages.  That prevents this code false triggering
should someone later add _PAGE_SPECIAL for hugepages, as they would not
be naturally drawn here to fix it.

[...]
> > I did wonder if we could also check _PAGE_BIT_USER bit as by my reading
> > that would only ever be set on user pages, and by rejecting pages without
> > that set we could prevent any kernel pages being returned basically
> > for free.
> 
> I still do want the access_ok check to avoid any possible issues with
> kernel page table modifications... but checking for the user bit would
> be another good sanity check, good idea. 

Definatly not advocating removing any checks at all.  Just thinking the
addition is one more safety net should any one of the checks be flawed.
Security being a pig to prove at the best of times.

-apw

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [patch 2/2] mm: lockless get_user_pages
  2008-05-23 12:31       ` apw
@ 2008-05-23 23:44         ` Nick Piggin
  2008-05-24  0:04           ` Nick Piggin
  0 siblings, 1 reply; 18+ messages in thread
From: Nick Piggin @ 2008-05-23 23:44 UTC (permalink / raw)
  To: apw; +Cc: Andrew Morton, shaggy, axboe, torvalds, linux-mm, linux-arch

On Fri, May 23, 2008 at 01:31:12PM +0100, apw@shadowen.org wrote:
> On Fri, May 23, 2008 at 04:27:33AM +0200, Nick Piggin wrote:
> [...]
> > > Why is this optimisation taken out as part of this patch.  From what I
> > > an see the caller below does not move to fast_gup so it appears
> > > (naively) that this would still be applicable, though the locking may be
> > > hard.
> > 
> > The get_iovec_page_array caller AFAIKS does move to fast_gup.
> > 
> > That funny optimisation is an attempt to avoid taking and dropping the
> > mmap_sem for each page if we can perform the copy without taking a
> > fault. If the same conditions exist for fast_gup, then it can perform
> > the get_user_pages operation without taking mmap_sem at all.
> 
> Yep see below though, the call site for this optimisation seems to go
> to copy_from_user ...
> 
> > > > -/*
> > > >   * Map an iov into an array of pages and offset/length tupples. With the
> > > >   * partial_page structure, we can map several non-contiguous ranges into
> > > >   * our ones pages[] map instead of splitting that operation into pieces.
> > > > @@ -1189,8 +1159,6 @@ static int get_iovec_page_array(const st
> > > >  {
> > > >  	int buffers = 0, error = 0;
> > > >  
> > > > -	down_read(&current->mm->mmap_sem);
> > > > -
> > > >  	while (nr_vecs) {
> > > >  		unsigned long off, npages;
> > > >  		struct iovec entry;
> > > > @@ -1199,7 +1167,7 @@ static int get_iovec_page_array(const st
> > > >  		int i;
> > > >  
> > > >  		error = -EFAULT;
> > > > -		if (copy_from_user_mmap_sem(&entry, iov, sizeof(entry)))
> > > > +		if (copy_from_user(&entry, iov, sizeof(entry)))
> > > >  			break;
> > > >  
> > > >  		base = entry.iov_base;
> > > > @@ -1233,9 +1201,8 @@ static int get_iovec_page_array(const st
> > > >  		if (npages > PIPE_BUFFERS - buffers)
> > > >  			npages = PIPE_BUFFERS - buffers;
> > > >  
> > > > -		error = get_user_pages(current, current->mm,
> > > > -				       (unsigned long) base, npages, 0, 0,
> > > > -				       &pages[buffers], NULL);
> > > > +		error = fast_gup((unsigned long)base, npages,
> > > > +					0, &pages[buffers]);
> > > >  
> > > >  		if (unlikely(error <= 0))
> > > >  			break;
> 
> Its not that clear from these two deltas that it does move to fast_gup,
> yes this second one does but the first which used the old optimisation
> goes straight to copy_from_user.

Oh, the copy_from_user is OK, it is just copying the iovec out from
userspace.

 
> > > > +	pte_unmap(ptep - 1);
> > > 
> > > Is this pte_unmap right.  I thought you had to unmap the same address
> > > that was returned from pte_offset_map, not an incremented version?
> > 
> > So long as it is within the same pte page, it's OK. Check some of
> > the similar loops in mm/memory.c
> 
> Bah, yes I missed this in kunmap_atomic()... I hate code mixed into the
> declarations:
> 
>         unsigned long vaddr = (unsigned long) kvaddr & PAGE_MASK;
> 
> so yes its safe.

Yeah, the convention is a bit ugly, but that's what we've got :P

 
> [...]
> > > > +static noinline int gup_huge_pmd(pmd_t pmd, unsigned long addr,
> > > > +		unsigned long end, int write, struct page **pages, int *nr)
> > > > +{
> > > > +	unsigned long mask;
> > > > +	pte_t pte = *(pte_t *)&pmd;
> > > > +	struct page *head, *page;
> > > > +	int refs;
> > > > +
> > > > +	mask = _PAGE_PRESENT|_PAGE_USER;
> > > > +	if (write)
> > > > +		mask |= _PAGE_RW;
> > > > +	if ((pte_val(pte) & mask) != mask)
> > > > +		return 0;
> > > > +	/* hugepages are never "special" */
> > > > +	VM_BUG_ON(!pfn_valid(pte_pfn(pte)));
> > > 
> > > As we am not expecting them to be special could we not check for that anyhow as you do in the normal case.  Safe against anyone changing hugepages.
> > 
> > Couldn't quite parse this ;) Do you mean to put the same VM_BUG_ON in the
> > follow_hugetlb_page path? Sure we could do that, although in that case
> > we have everything locked down and have looked up the vma etc. so I was
> > just being paranoid here really.
> 
> I was more saying, as we know that _PAGE_SPECIAL isn't currently used and
> we can only handle the !_PAGE_SPECIAL pages in these paths it might be
> prudent to check for the absence of _PAGE_SPECIAL here _anyhow_ exactly
> as you did for small pages.  That prevents this code false triggering
> should someone later add _PAGE_SPECIAL for hugepages, as they would not
> be naturally drawn here to fix it.

I see. Yes that might be a good idea.

 
> [...]
> > > I did wonder if we could also check _PAGE_BIT_USER bit as by my reading
> > > that would only ever be set on user pages, and by rejecting pages without
> > > that set we could prevent any kernel pages being returned basically
> > > for free.
> > 
> > I still do want the access_ok check to avoid any possible issues with
> > kernel page table modifications... but checking for the user bit would
> > be another good sanity check, good idea. 
> 
> Definatly not advocating removing any checks at all.  Just thinking the
> addition is one more safety net should any one of the checks be flawed.
> Security being a pig to prove at the best of times.

It isn't a bad idea at all. I'll see what I can do.

Thanks
Nick

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [patch 2/2] mm: lockless get_user_pages
  2008-05-23 23:44         ` Nick Piggin
@ 2008-05-24  0:04           ` Nick Piggin
  0 siblings, 0 replies; 18+ messages in thread
From: Nick Piggin @ 2008-05-24  0:04 UTC (permalink / raw)
  To: apw; +Cc: Andrew Morton, shaggy, axboe, torvalds, linux-mm, linux-arch

On Sat, May 24, 2008 at 01:44:32AM +0200, Nick Piggin wrote:
> On Fri, May 23, 2008 at 01:31:12PM +0100, apw@shadowen.org wrote:
> > On Fri, May 23, 2008 at 04:27:33AM +0200, Nick Piggin wrote:
> 
>  
> > [...]
> > > > I did wonder if we could also check _PAGE_BIT_USER bit as by my reading
> > > > that would only ever be set on user pages, and by rejecting pages without
> > > > that set we could prevent any kernel pages being returned basically
> > > > for free.
> > > 
> > > I still do want the access_ok check to avoid any possible issues with
> > > kernel page table modifications... but checking for the user bit would
> > > be another good sanity check, good idea. 
> > 
> > Definatly not advocating removing any checks at all.  Just thinking the
> > addition is one more safety net should any one of the checks be flawed.
> > Security being a pig to prove at the best of times.
> 
> It isn't a bad idea at all. I'll see what I can do.

Oh, hmm, I was already checking the _PAGE_USER bit anyway ;)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [patch 1/2] x86: implement pte_special
@ 2008-05-25 14:48 Nick Piggin
  0 siblings, 0 replies; 18+ messages in thread
From: Nick Piggin @ 2008-05-25 14:48 UTC (permalink / raw)
  To: Andrew Morton; +Cc: shaggy, jens.axboe, torvalds, linux-mm, linux-arch, apw

Implement the pte_special bit for x86. This is required to support lockless
get_user_pages, because we need to know whether or not we can refcount a
particular page given only its pte (and no vma).

Signed-off-by: Nick Piggin <npiggin@suse.de>
---
Index: linux-2.6/include/asm-x86/pgtable.h
===================================================================
--- linux-2.6.orig/include/asm-x86/pgtable.h
+++ linux-2.6/include/asm-x86/pgtable.h
@@ -17,6 +17,7 @@
 #define _PAGE_BIT_UNUSED1	9	/* available for programmer */
 #define _PAGE_BIT_UNUSED2	10
 #define _PAGE_BIT_UNUSED3	11
+#define _PAGE_BIT_SPECIAL	_PAGE_BIT_UNUSED1
 #define _PAGE_BIT_PAT_LARGE	12	/* On 2MB or 1GB pages */
 #define _PAGE_BIT_NX           63       /* No execute: only valid after cpuid check */
 
@@ -39,6 +40,8 @@
 #define _PAGE_UNUSED3	(_AC(1, L)<<_PAGE_BIT_UNUSED3)
 #define _PAGE_PAT	(_AC(1, L)<<_PAGE_BIT_PAT)
 #define _PAGE_PAT_LARGE (_AC(1, L)<<_PAGE_BIT_PAT_LARGE)
+#define _PAGE_SPECIAL	(_AC(1, L)<<_PAGE_BIT_SPECIAL)
+#define __HAVE_ARCH_PTE_SPECIAL
 
 #if defined(CONFIG_X86_64) || defined(CONFIG_X86_PAE)
 #define _PAGE_NX	(_AC(1, ULL) << _PAGE_BIT_NX)
@@ -199,7 +202,7 @@ static inline int pte_exec(pte_t pte)
 
 static inline int pte_special(pte_t pte)
 {
-	return 0;
+	return pte_val(pte) & _PAGE_SPECIAL;
 }
 
 static inline int pmd_large(pmd_t pte)
@@ -265,7 +268,7 @@ static inline pte_t pte_clrglobal(pte_t 
 
 static inline pte_t pte_mkspecial(pte_t pte)
 {
-	return pte;
+	return __pte(pte_val(pte) | _PAGE_SPECIAL);
 }
 
 extern pteval_t __supported_pte_mask;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [patch 0/2]: lockless get_user_pages patchset
@ 2008-03-28  2:54 Nick Piggin
  2008-03-28  2:55 ` [patch 1/2]: x86: implement pte_special Nick Piggin
  0 siblings, 1 reply; 18+ messages in thread
From: Nick Piggin @ 2008-03-28  2:54 UTC (permalink / raw)
  To: Andrew Morton; +Cc: shaggy, axboe, linux-mm, linux-arch, torvalds

Hi,

This patchset It impoves our well known oltp performance thingy by 10% on DB2
on a modest 2 socket x86 system. For a sense of scale, remember numbers being
tossed around like direct-IO giving a 12% improvement; or hugepages giving a 9%
improvement... heh.

I have a powerpc patch as well, but it needs benh to find me a bit in their pte
to use.

These patches are on top of the previous 7 patches just sent.

Nick

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [patch 1/2]: x86: implement pte_special
  2008-03-28  2:54 [patch 0/2]: lockless get_user_pages patchset Nick Piggin
@ 2008-03-28  2:55 ` Nick Piggin
  2008-03-28  3:23   ` David Miller, Nick Piggin
  0 siblings, 1 reply; 18+ messages in thread
From: Nick Piggin @ 2008-03-28  2:55 UTC (permalink / raw)
  To: Andrew Morton; +Cc: shaggy, axboe, linux-mm, linux-arch, torvalds

Implement the pte_special bit for x86. This is required to support lockless
get_user_pages, because we need to know whether or not we can refcount a
particular page given only its pte (and no vma).

Signed-off-by: Nick Piggin <npiggin@suse.de>
---
Index: linux-2.6/include/asm-x86/pgtable.h
===================================================================
--- linux-2.6.orig/include/asm-x86/pgtable.h
+++ linux-2.6/include/asm-x86/pgtable.h
@@ -18,6 +18,7 @@
 #define _PAGE_BIT_UNUSED1	9	/* available for programmer */
 #define _PAGE_BIT_UNUSED2	10
 #define _PAGE_BIT_UNUSED3	11
+#define _PAGE_BIT_SPECIAL	_PAGE_BIT_UNUSED1
 #define _PAGE_BIT_PAT_LARGE	12	/* On 2MB or 1GB pages */
 #define _PAGE_BIT_NX           63       /* No execute: only valid after cpuid check */
 
@@ -40,6 +41,8 @@
 #define _PAGE_UNUSED3	(_AC(1, L)<<_PAGE_BIT_UNUSED3)
 #define _PAGE_PAT	(_AC(1, L)<<_PAGE_BIT_PAT)
 #define _PAGE_PAT_LARGE (_AC(1, L)<<_PAGE_BIT_PAT_LARGE)
+#define _PAGE_SPECIAL	(_AC(1, L)<<_PAGE_BIT_SPECIAL)
+#define __HAVE_ARCH_PTE_SPECIAL
 
 #if defined(CONFIG_X86_64) || defined(CONFIG_X86_PAE)
 #define _PAGE_NX	(_AC(1, ULL) << _PAGE_BIT_NX)
@@ -149,7 +152,7 @@ static inline int pte_file(pte_t pte)		{
 static inline int pte_huge(pte_t pte)		{ return pte_val(pte) & _PAGE_PSE; }
 static inline int pte_global(pte_t pte) 	{ return pte_val(pte) & _PAGE_GLOBAL; }
 static inline int pte_exec(pte_t pte)		{ return !(pte_val(pte) & _PAGE_NX); }
-static inline int pte_special(pte_t pte)	{ return 0; }
+static inline int pte_special(pte_t pte)	{ return pte_val(pte) & _PAGE_SPECIAL; }
 
 static inline int pmd_large(pmd_t pte) {
 	return (pmd_val(pte) & (_PAGE_PSE|_PAGE_PRESENT)) ==
@@ -167,7 +170,7 @@ static inline pte_t pte_mkhuge(pte_t pte
 static inline pte_t pte_clrhuge(pte_t pte)	{ return __pte(pte_val(pte) & ~(pteval_t)_PAGE_PSE); }
 static inline pte_t pte_mkglobal(pte_t pte)	{ return __pte(pte_val(pte) | _PAGE_GLOBAL); }
 static inline pte_t pte_clrglobal(pte_t pte)	{ return __pte(pte_val(pte) & ~(pteval_t)_PAGE_GLOBAL); }
-static inline pte_t pte_mkspecial(pte_t pte)	{ return pte; }
+static inline pte_t pte_mkspecial(pte_t pte)	{ return __pte(pte_val(pte) | _PAGE_SPECIAL); }
 
 extern pteval_t __supported_pte_mask;
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [patch 1/2]: x86: implement pte_special
  2008-03-28  2:55 ` [patch 1/2]: x86: implement pte_special Nick Piggin
@ 2008-03-28  3:23   ` David Miller, Nick Piggin
  2008-03-28  3:31     ` Nick Piggin
  0 siblings, 1 reply; 18+ messages in thread
From: David Miller, Nick Piggin @ 2008-03-28  3:23 UTC (permalink / raw)
  To: npiggin; +Cc: akpm, shaggy, axboe, linux-mm, linux-arch, torvalds

> @@ -40,6 +41,8 @@
>  #define _PAGE_UNUSED3	(_AC(1, L)<<_PAGE_BIT_UNUSED3)
>  #define _PAGE_PAT	(_AC(1, L)<<_PAGE_BIT_PAT)
>  #define _PAGE_PAT_LARGE (_AC(1, L)<<_PAGE_BIT_PAT_LARGE)
> +#define _PAGE_SPECIAL	(_AC(1, L)<<_PAGE_BIT_SPECIAL)
> +#define __HAVE_ARCH_PTE_SPECIAL

What tests __HAVE_ARCH_PTE_SPECIAL?

> @@ -167,7 +170,7 @@ static inline pte_t pte_mkhuge(pte_t pte
>  static inline pte_t pte_clrhuge(pte_t pte)	{ return __pte(pte_val(pte) & ~(pteval_t)_PAGE_PSE); }
>  static inline pte_t pte_mkglobal(pte_t pte)	{ return __pte(pte_val(pte) | _PAGE_GLOBAL); }
>  static inline pte_t pte_clrglobal(pte_t pte)	{ return __pte(pte_val(pte) & ~(pteval_t)_PAGE_GLOBAL); }
> -static inline pte_t pte_mkspecial(pte_t pte)	{ return pte; }
> +static inline pte_t pte_mkspecial(pte_t pte)	{ return __pte(pte_val(pte) | _PAGE_SPECIAL); }

And what calls pte_mkspecial?

I don't see any code that sets the special bit anywhere
in these two patches.

What am I missing?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [patch 1/2]: x86: implement pte_special
  2008-03-28  3:23   ` David Miller, Nick Piggin
@ 2008-03-28  3:31     ` Nick Piggin
  2008-03-28  3:44       ` David Miller, Nick Piggin
  0 siblings, 1 reply; 18+ messages in thread
From: Nick Piggin @ 2008-03-28  3:31 UTC (permalink / raw)
  To: David Miller; +Cc: akpm, shaggy, axboe, linux-mm, linux-arch, torvalds

On Thu, Mar 27, 2008 at 08:23:34PM -0700, David Miller wrote:
> From: Nick Piggin <npiggin@suse.de>
> Date: Fri, 28 Mar 2008 03:55:41 +0100
> 
> > @@ -40,6 +41,8 @@
> >  #define _PAGE_UNUSED3	(_AC(1, L)<<_PAGE_BIT_UNUSED3)
> >  #define _PAGE_PAT	(_AC(1, L)<<_PAGE_BIT_PAT)
> >  #define _PAGE_PAT_LARGE (_AC(1, L)<<_PAGE_BIT_PAT_LARGE)
> > +#define _PAGE_SPECIAL	(_AC(1, L)<<_PAGE_BIT_SPECIAL)
> > +#define __HAVE_ARCH_PTE_SPECIAL
> 
> What tests __HAVE_ARCH_PTE_SPECIAL?
> 
> > @@ -167,7 +170,7 @@ static inline pte_t pte_mkhuge(pte_t pte
> >  static inline pte_t pte_clrhuge(pte_t pte)	{ return __pte(pte_val(pte) & ~(pteval_t)_PAGE_PSE); }
> >  static inline pte_t pte_mkglobal(pte_t pte)	{ return __pte(pte_val(pte) | _PAGE_GLOBAL); }
> >  static inline pte_t pte_clrglobal(pte_t pte)	{ return __pte(pte_val(pte) & ~(pteval_t)_PAGE_GLOBAL); }
> > -static inline pte_t pte_mkspecial(pte_t pte)	{ return pte; }
> > +static inline pte_t pte_mkspecial(pte_t pte)	{ return __pte(pte_val(pte) | _PAGE_SPECIAL); }
> 
> And what calls pte_mkspecial?
> 
> I don't see any code that sets the special bit anywhere
> in these two patches.
> 
> What am I missing?

Oh, sorry these 2 patches are on top of the previous 7 that I sent out.
Hmm, only patch 2/7 went to linux-arch, which is the main consumer of
this pte_special stuff, however if you want a more coherent view of those
7 patches, they are on linux-mm.

Basically, the pfn-based mapping insertion (vm_insert_pfn, remap_pfn_range)
calls pte_mkspecial. And that tells fast_gup "hands off".

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [patch 1/2]: x86: implement pte_special
  2008-03-28  3:31     ` Nick Piggin
@ 2008-03-28  3:44       ` David Miller, Nick Piggin
  2008-03-28  4:04         ` Nick Piggin
  0 siblings, 1 reply; 18+ messages in thread
From: David Miller, Nick Piggin @ 2008-03-28  3:44 UTC (permalink / raw)
  To: npiggin; +Cc: akpm, shaggy, axboe, linux-mm, linux-arch, torvalds

> Basically, the pfn-based mapping insertion (vm_insert_pfn, remap_pfn_range)
> calls pte_mkspecial. And that tells fast_gup "hands off".

I don't think it's wise to allocate a "soft PTE bit" for this on every
platform, especially for such a limited use case.

Is it feasible to test the page instead?  Or are we talking about
cases where there may not be a backing page?

If the issue is to discern things like I/O mappings and such vs. real
pages, there are ways a platform can handle that without a special
bit.

That would leave us with real memory that does not have backing
page structs, and we have a way to test that too.

The special PTE bit seems superfluous to me.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [patch 1/2]: x86: implement pte_special
  2008-03-28  3:44       ` David Miller, Nick Piggin
@ 2008-03-28  4:04         ` Nick Piggin
  2008-03-28  4:09           ` David Miller, Nick Piggin
  0 siblings, 1 reply; 18+ messages in thread
From: Nick Piggin @ 2008-03-28  4:04 UTC (permalink / raw)
  To: David Miller; +Cc: akpm, shaggy, axboe, linux-mm, linux-arch, torvalds

On Thu, Mar 27, 2008 at 08:44:31PM -0700, David Miller wrote:
> From: Nick Piggin <npiggin@suse.de>
> Date: Fri, 28 Mar 2008 04:31:50 +0100
> 
> > Basically, the pfn-based mapping insertion (vm_insert_pfn, remap_pfn_range)
> > calls pte_mkspecial. And that tells fast_gup "hands off".
> 
> I don't think it's wise to allocate a "soft PTE bit" for this on every
> platform, especially for such a limited use case.

Sure, it will be up to the arch to decide whether or not to use it. If
you can't spare a bit, then nothing changes, you just don't get a speedup ;)

But it is not limited to direct IO. If you have a look through
drivers, there are quite a few things using it that might see improvement.

And even if it were limited to direct IO... there will be some platforms
that want it, I suspect.

> Is it feasible to test the page instead?  Or are we talking about
> cases where there may not be a backing page?

Yes. Or there may be one but you are not allowed to touch it.

> If the issue is to discern things like I/O mappings and such vs. real
> pages, there are ways a platform can handle that without a special
> bit.
> 
> That would leave us with real memory that does not have backing
> page structs, and we have a way to test that too.
> 
> The special PTE bit seems superfluous to me.

It has to be quite fast.

There are some platforms that can't really do that easily, so the pte
bit isn't useless on all architectures. s390 for example, which is
why it was introduced in the other patchset.

x86 has 3 bits which are usable by system software, so are not likely
to be ever used by hardware. And in the entire life of Linux, nobody
has ever wanted to use one :) So I think we're safe to use one on x86.

If bits ever run out, and there are as you say other ways to implement
this, then it could be switched over then. Because nothing is going to
be as fast as testing the pte (which we already have loaded in a
register).

For sparc, if you can propose another method, we could look at that.
But if you think it is such a limited use case, then cleanest and eaisest
will just be to not implement it. I doubt many people to serious oltp
on sparc (on Linux), so for that platform you are probably right. If
another compelling use case comes up, then you can always reevaluate.

BTW. if you are still interested, then the powerpc64 patch might be a
better starting point for you. I don't know how the sparc tlb flush
design looks like, but if it doesn't do a synchronous IPI to invalidate
other threads, then you can't use the x86 approach.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [patch 1/2]: x86: implement pte_special
  2008-03-28  4:04         ` Nick Piggin
@ 2008-03-28  4:09           ` David Miller, Nick Piggin
  2008-03-28  4:15             ` Nick Piggin
  0 siblings, 1 reply; 18+ messages in thread
From: David Miller, Nick Piggin @ 2008-03-28  4:09 UTC (permalink / raw)
  To: npiggin; +Cc: akpm, shaggy, axboe, linux-mm, linux-arch, torvalds

> BTW. if you are still interested, then the powerpc64 patch might be a
> better starting point for you. I don't know how the sparc tlb flush
> design looks like, but if it doesn't do a synchronous IPI to invalidate
> other threads, then you can't use the x86 approach.

I have soft bits available on sparc64, that's not my issue.

My issue is that if you implemented this differently, every platform
would get the optimization, without having to do anything special
at all, and I think that's such a much nicer way.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [patch 1/2]: x86: implement pte_special
  2008-03-28  4:09           ` David Miller, Nick Piggin
@ 2008-03-28  4:15             ` Nick Piggin
  2008-03-28  4:16               ` David Miller, Nick Piggin
  2008-03-28  4:17               ` Nick Piggin
  0 siblings, 2 replies; 18+ messages in thread
From: Nick Piggin @ 2008-03-28  4:15 UTC (permalink / raw)
  To: David Miller; +Cc: akpm, shaggy, axboe, linux-mm, linux-arch, torvalds

On Thu, Mar 27, 2008 at 09:09:10PM -0700, David Miller wrote:
> From: Nick Piggin <npiggin@suse.de>
> Date: Fri, 28 Mar 2008 05:04:42 +0100
> 
> > BTW. if you are still interested, then the powerpc64 patch might be a
> > better starting point for you. I don't know how the sparc tlb flush
> > design looks like, but if it doesn't do a synchronous IPI to invalidate
> > other threads, then you can't use the x86 approach.
> 
> I have soft bits available on sparc64, that's not my issue.
> 
> My issue is that if you implemented this differently, every platform
> would get the optimization, without having to do anything special
> at all, and I think that's such a much nicer way.

Oh, they wouldn't. It is completely tied to the low level details of
their TLB and pagetable teardown design. That's the unfortunate part
about it.

The other thing is that the "how do I know if I can refcount the page
behind this (mm,vaddr,pte) tuple" can be quite arch specific as well.
And it is also non-trivial to do because that information can be dynamic
depending on what driver mapped in that given tuple.

It is *possible*, but not trivial.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [patch 1/2]: x86: implement pte_special
  2008-03-28  4:15             ` Nick Piggin
@ 2008-03-28  4:16               ` David Miller, Nick Piggin
  2008-03-28  4:19                 ` Nick Piggin
  2008-03-28  4:17               ` Nick Piggin
  1 sibling, 1 reply; 18+ messages in thread
From: David Miller, Nick Piggin @ 2008-03-28  4:16 UTC (permalink / raw)
  To: npiggin; +Cc: akpm, shaggy, axboe, linux-mm, linux-arch, torvalds

> The other thing is that the "how do I know if I can refcount the page
> behind this (mm,vaddr,pte) tuple" can be quite arch specific as well.
> And it is also non-trivial to do because that information can be dynamic
> depending on what driver mapped in that given tuple.

Those are good points.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [patch 1/2]: x86: implement pte_special
  2008-03-28  4:16               ` David Miller, Nick Piggin
@ 2008-03-28  4:19                 ` Nick Piggin
  0 siblings, 0 replies; 18+ messages in thread
From: Nick Piggin @ 2008-03-28  4:19 UTC (permalink / raw)
  To: David Miller; +Cc: akpm, shaggy, axboe, linux-mm, linux-arch, torvalds

On Thu, Mar 27, 2008 at 09:16:32PM -0700, David Miller wrote:
> From: Nick Piggin <npiggin@suse.de>
> Date: Fri, 28 Mar 2008 05:15:20 +0100
> 
> > The other thing is that the "how do I know if I can refcount the page
> > behind this (mm,vaddr,pte) tuple" can be quite arch specific as well.
> > And it is also non-trivial to do because that information can be dynamic
> > depending on what driver mapped in that given tuple.
> 
> Those are good points.

I know what you mean though... it doesn't feel like it is perfect code
just yet. However, given that it works on x86 and gives such good
results today, I feel that I'd rather get this merged first, and then
maybe when we get some more platforms on board and the code is a bit
more mature then we can try to make it "nicer"...

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [patch 1/2]: x86: implement pte_special
  2008-03-28  4:15             ` Nick Piggin
  2008-03-28  4:16               ` David Miller, Nick Piggin
@ 2008-03-28  4:17               ` Nick Piggin
  1 sibling, 0 replies; 18+ messages in thread
From: Nick Piggin @ 2008-03-28  4:17 UTC (permalink / raw)
  To: David Miller; +Cc: akpm, shaggy, axboe, linux-mm, linux-arch, torvalds

On Fri, Mar 28, 2008 at 05:15:20AM +0100, Nick Piggin wrote:
> On Thu, Mar 27, 2008 at 09:09:10PM -0700, David Miller wrote:
> > From: Nick Piggin <npiggin@suse.de>
> > Date: Fri, 28 Mar 2008 05:04:42 +0100
> > 
> > > BTW. if you are still interested, then the powerpc64 patch might be a
> > > better starting point for you. I don't know how the sparc tlb flush
> > > design looks like, but if it doesn't do a synchronous IPI to invalidate
> > > other threads, then you can't use the x86 approach.
> > 
> > I have soft bits available on sparc64, that's not my issue.
> > 
> > My issue is that if you implemented this differently, every platform
> > would get the optimization, without having to do anything special
> > at all, and I think that's such a much nicer way.
> 
> Oh, they wouldn't. It is completely tied to the low level details of
> their TLB and pagetable teardown design. That's the unfortunate part
> about it.
> 
> The other thing is that the "how do I know if I can refcount the page
> behind this (mm,vaddr,pte) tuple" can be quite arch specific as well.
> And it is also non-trivial to do because that information can be dynamic
> depending on what driver mapped in that given tuple.
> 
> It is *possible*, but not trivial.

And, btw, you'd still have to implement the actual fast_gup completely
in arch code. So once you do that, you are free not to use pte_special
for it anyway.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2008-05-25 14:48 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-05-21 11:59 [patch 1/2] x86: implement pte_special Nick Piggin
2008-05-21 12:11 ` [patch 2/2] mm: lockless get_user_pages Nick Piggin
2008-05-22 10:28   ` Andy Whitcroft
2008-05-23  2:27     ` Nick Piggin
2008-05-23 12:31       ` apw
2008-05-23 23:44         ` Nick Piggin
2008-05-24  0:04           ` Nick Piggin
  -- strict thread matches above, loose matches on Subject: below --
2008-05-25 14:48 [patch 1/2] x86: implement pte_special Nick Piggin
2008-03-28  2:54 [patch 0/2]: lockless get_user_pages patchset Nick Piggin
2008-03-28  2:55 ` [patch 1/2]: x86: implement pte_special Nick Piggin
2008-03-28  3:23   ` David Miller, Nick Piggin
2008-03-28  3:31     ` Nick Piggin
2008-03-28  3:44       ` David Miller, Nick Piggin
2008-03-28  4:04         ` Nick Piggin
2008-03-28  4:09           ` David Miller, Nick Piggin
2008-03-28  4:15             ` Nick Piggin
2008-03-28  4:16               ` David Miller, Nick Piggin
2008-03-28  4:19                 ` Nick Piggin
2008-03-28  4:17               ` Nick Piggin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).