[patch 0/2]: lockless get_user

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [patch 0/2]: lockless get_user_pages patchset
@ 2008-03-28  2:54 Nick Piggin
  2008-03-28  2:55 ` [patch 1/2]: x86: implement pte_special Nick Piggin
  2008-03-28  3:00 ` [patch 2/2]: introduce fast_gup Nick Piggin
  0 siblings, 2 replies; 38+ messages in thread
From: Nick Piggin @ 2008-03-28  2:54 UTC (permalink / raw)
  To: Andrew Morton; +Cc: shaggy, axboe, linux-mm, linux-arch, torvalds

Hi,

This patchset It impoves our well known oltp performance thingy by 10% on DB2
on a modest 2 socket x86 system. For a sense of scale, remember numbers being
tossed around like direct-IO giving a 12% improvement; or hugepages giving a 9%
improvement... heh.

I have a powerpc patch as well, but it needs benh to find me a bit in their pte
to use.

These patches are on top of the previous 7 patches just sent.

Nick

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* [patch 1/2]: x86: implement pte_special
  2008-03-28  2:54 [patch 0/2]: lockless get_user_pages patchset Nick Piggin
@ 2008-03-28  2:55 ` Nick Piggin
  2008-03-28  3:23   ` David Miller, Nick Piggin
  2008-03-28  3:00 ` [patch 2/2]: introduce fast_gup Nick Piggin
  1 sibling, 1 reply; 38+ messages in thread
From: Nick Piggin @ 2008-03-28  2:55 UTC (permalink / raw)
  To: Andrew Morton; +Cc: shaggy, axboe, linux-mm, linux-arch, torvalds

Implement the pte_special bit for x86. This is required to support lockless
get_user_pages, because we need to know whether or not we can refcount a
particular page given only its pte (and no vma).

Signed-off-by: Nick Piggin <npiggin@suse.de>
---
Index: linux-2.6/include/asm-x86/pgtable.h
===================================================================
--- linux-2.6.orig/include/asm-x86/pgtable.h
+++ linux-2.6/include/asm-x86/pgtable.h
@@ -18,6 +18,7 @@
 #define _PAGE_BIT_UNUSED1	9	/* available for programmer */
 #define _PAGE_BIT_UNUSED2	10
 #define _PAGE_BIT_UNUSED3	11
+#define _PAGE_BIT_SPECIAL	_PAGE_BIT_UNUSED1
 #define _PAGE_BIT_PAT_LARGE	12	/* On 2MB or 1GB pages */
 #define _PAGE_BIT_NX           63       /* No execute: only valid after cpuid check */
 
@@ -40,6 +41,8 @@
 #define _PAGE_UNUSED3	(_AC(1, L)<<_PAGE_BIT_UNUSED3)
 #define _PAGE_PAT	(_AC(1, L)<<_PAGE_BIT_PAT)
 #define _PAGE_PAT_LARGE (_AC(1, L)<<_PAGE_BIT_PAT_LARGE)
+#define _PAGE_SPECIAL	(_AC(1, L)<<_PAGE_BIT_SPECIAL)
+#define __HAVE_ARCH_PTE_SPECIAL
 
 #if defined(CONFIG_X86_64) || defined(CONFIG_X86_PAE)
 #define _PAGE_NX	(_AC(1, ULL) << _PAGE_BIT_NX)
@@ -149,7 +152,7 @@ static inline int pte_file(pte_t pte)		{
 static inline int pte_huge(pte_t pte)		{ return pte_val(pte) & _PAGE_PSE; }
 static inline int pte_global(pte_t pte) 	{ return pte_val(pte) & _PAGE_GLOBAL; }
 static inline int pte_exec(pte_t pte)		{ return !(pte_val(pte) & _PAGE_NX); }
-static inline int pte_special(pte_t pte)	{ return 0; }
+static inline int pte_special(pte_t pte)	{ return pte_val(pte) & _PAGE_SPECIAL; }
 
 static inline int pmd_large(pmd_t pte) {
 	return (pmd_val(pte) & (_PAGE_PSE|_PAGE_PRESENT)) ==
@@ -167,7 +170,7 @@ static inline pte_t pte_mkhuge(pte_t pte
 static inline pte_t pte_clrhuge(pte_t pte)	{ return __pte(pte_val(pte) & ~(pteval_t)_PAGE_PSE); }
 static inline pte_t pte_mkglobal(pte_t pte)	{ return __pte(pte_val(pte) | _PAGE_GLOBAL); }
 static inline pte_t pte_clrglobal(pte_t pte)	{ return __pte(pte_val(pte) & ~(pteval_t)_PAGE_GLOBAL); }
-static inline pte_t pte_mkspecial(pte_t pte)	{ return pte; }
+static inline pte_t pte_mkspecial(pte_t pte)	{ return __pte(pte_val(pte) | _PAGE_SPECIAL); }
 
 extern pteval_t __supported_pte_mask;
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* [patch 2/2]: introduce fast_gup
  2008-03-28  2:54 [patch 0/2]: lockless get_user_pages patchset Nick Piggin
  2008-03-28  2:55 ` [patch 1/2]: x86: implement pte_special Nick Piggin
@ 2008-03-28  3:00 ` Nick Piggin
  2008-03-28 10:01   ` Jens Axboe
                     ` (2 more replies)
  1 sibling, 3 replies; 38+ messages in thread
From: Nick Piggin @ 2008-03-28  3:00 UTC (permalink / raw)
  To: Andrew Morton; +Cc: shaggy, axboe, linux-mm, linux-arch, torvalds

Introduce a new "fast_gup" (for want of a better name right now) which
is basically a get_user_pages with a less general API (but still tends to
be suited to the common case):

- task and mm are always current and current->mm
- force is always 0
- pages is always non-NULL
- don't pass back vmas

This restricted API can be implemented in a much more scalable way when
the ptes are present, by walking the page tables locklessly.

Before anybody asks; it is impossible to just "detect" such a situation in the
existing get_user_pages call, and branch to a fastpath in that case, because
get_user_pages requres the mmap_sem is held over the call, wheras fast_gup does
not.

This patch implements fast_gup as a generic function that falls back to
get_user_pages. It converts key direct IO (and splice, for fun) callsites.
It also implements an optimised version on x86, which does not take any
locks in the fastpath.

On x86, we do an optimistic lockless pagetable walk, without taking any page
table locks or even mmap_sem. Page table existence is guaranteed by turning
interrupts off (combined with the fact that we're always looking up the current
mm, means we can do the lockless page table walk within the constraints of the
TLB shootdown design). Basically we can do this lockless pagetable walk in a
similar manner to the way the CPU's pagetable walker does not have to take any
locks to find present ptes.

Many other architectures could do the same thing. Those that don't IPI
could potentially RCU free the page tables and do speculative references
on the pages (a la lockless pagecache) to achieve a lockless fast_gup. I
have actually got an implementation of this for powerpc.

This patch was found to give about 10% performance improvement on a 2 socket
8 core Intel Xeon system running an OLTP workload on DB2 v9.5

 "To test the effects of the patch, an OLTP workload was run on an IBM
 x3850 M2 server with 2 processors (quad-core Intel Xeon processors at
 2.93 GHz) using IBM DB2 v9.5 running Linux 2.6.24rc7 kernel. Comparing
 runs with and without the patch resulted in an overall performance
 benefit of ~9.8%. Correspondingly, oprofiles showed that samples from
 __up_read and __down_read routines that is seen during thread contention
 for system resources was reduced from 2.8% down to .05%. Monitoring
 the /proc/vmstat output from the patched run showed that the counter for
 fast_gup contained a very high number while the fast_gup_slow value was
 zero."

(fast_gup_slow is a counter we had for the number of times the slowpath
was invoked).

The main reason for the improvement is that DB2 has multiple threads each
issuing direct-IO. Direct-IO uses get_user_pages, and thus the threads
contend the mmap_sem cacheline, and can also contend on page table locks.

I would anticipate larger performance gains on larger systems, however I
think DB2 uses an adaptive mix of threads and processes, so it could be
that thread contention remains pretty constant as machine size increases.
In which case, we stuck with "only" a 10% gain.

Lots of other code could use this too (eg. grep drivers/)

Signed-off-by: Nick Piggin <npiggin@suse.de>

---
 arch/x86/mm/Makefile      |    2 
 arch/x86/mm/gup.c         |  193 ++++++++++++++++++++++++++++++++++++++++++++++
 fs/bio.c                  |    8 -
 fs/direct-io.c            |   10 --
 fs/splice.c               |   41 ---------
 include/asm-x86/uaccess.h |    5 +
 include/linux/mm.h        |   19 ++++
 7 files changed, 225 insertions(+), 53 deletions(-)

Index: linux-2.6/fs/bio.c
===================================================================
--- linux-2.6.orig/fs/bio.c
+++ linux-2.6/fs/bio.c
@@ -637,12 +637,8 @@ static struct bio *__bio_map_user_iov(st
 		const int local_nr_pages = end - start;
 		const int page_limit = cur_page + local_nr_pages;
 		
-		down_read(&current->mm->mmap_sem);
-		ret = get_user_pages(current, current->mm, uaddr,
-				     local_nr_pages,
-				     write_to_vm, 0, &pages[cur_page], NULL);
-		up_read(&current->mm->mmap_sem);
-
+		ret = fast_gup(uaddr, local_nr_pages,
+				write_to_vm, &pages[cur_page]);
 		if (ret < local_nr_pages) {
 			ret = -EFAULT;
 			goto out_unmap;
Index: linux-2.6/fs/direct-io.c
===================================================================
--- linux-2.6.orig/fs/direct-io.c
+++ linux-2.6/fs/direct-io.c
@@ -150,17 +150,11 @@ static int dio_refill_pages(struct dio *
 	int nr_pages;
 
 	nr_pages = min(dio->total_pages - dio->curr_page, DIO_PAGES);
-	down_read(&current->mm->mmap_sem);
-	ret = get_user_pages(
-		current,			/* Task for fault acounting */
-		current->mm,			/* whose pages? */
+	ret = fast_gup(
 		dio->curr_user_address,		/* Where from? */
 		nr_pages,			/* How many pages? */
 		dio->rw == READ,		/* Write to memory? */
-		0,				/* force (?) */
-		&dio->pages[0],
-		NULL);				/* vmas */
-	up_read(&current->mm->mmap_sem);
+		&dio->pages[0]);		/* Put results here */
 
 	if (ret < 0 && dio->blocks_available && (dio->rw & WRITE)) {
 		struct page *page = ZERO_PAGE(0);
Index: linux-2.6/fs/splice.c
===================================================================
--- linux-2.6.orig/fs/splice.c
+++ linux-2.6/fs/splice.c
@@ -1170,36 +1170,6 @@ static long do_splice(struct file *in, l
 }
 
 /*
- * Do a copy-from-user while holding the mmap_semaphore for reading, in a
- * manner safe from deadlocking with simultaneous mmap() (grabbing mmap_sem
- * for writing) and page faulting on the user memory pointed to by src.
- * This assumes that we will very rarely hit the partial != 0 path, or this
- * will not be a win.
- */
-static int copy_from_user_mmap_sem(void *dst, const void __user *src, size_t n)
-{
-	int partial;
-
-	if (!access_ok(VERIFY_READ, src, n))
-		return -EFAULT;
-
-	pagefault_disable();
-	partial = __copy_from_user_inatomic(dst, src, n);
-	pagefault_enable();
-
-	/*
-	 * Didn't copy everything, drop the mmap_sem and do a faulting copy
-	 */
-	if (unlikely(partial)) {
-		up_read(&current->mm->mmap_sem);
-		partial = copy_from_user(dst, src, n);
-		down_read(&current->mm->mmap_sem);
-	}
-
-	return partial;
-}
-
-/*
  * Just copy the data to user space
  */
 static int pipe_to_user_copy(struct pipe_inode_info *pipe,
@@ -1424,8 +1394,6 @@ static int get_iovec_page_array(const st
 {
 	int buffers = 0, error = 0;
 
-	down_read(&current->mm->mmap_sem);
-
 	while (nr_vecs) {
 		unsigned long off, npages;
 		struct iovec entry;
@@ -1434,7 +1402,7 @@ static int get_iovec_page_array(const st
 		int i;
 
 		error = -EFAULT;
-		if (copy_from_user_mmap_sem(&entry, iov, sizeof(entry)))
+		if (copy_from_user(&entry, iov, sizeof(entry)))
 			break;
 
 		base = entry.iov_base;
@@ -1468,9 +1436,8 @@ static int get_iovec_page_array(const st
 		if (npages > PIPE_BUFFERS - buffers)
 			npages = PIPE_BUFFERS - buffers;
 
-		error = get_user_pages(current, current->mm,
-				       (unsigned long) base, npages, 0, 0,
-				       &pages[buffers], NULL);
+		error = fast_gup((unsigned long)base, npages,
+					0, &pages[buffers]);
 
 		if (unlikely(error <= 0))
 			break;
@@ -1509,8 +1476,6 @@ static int get_iovec_page_array(const st
 		iov++;
 	}
 
-	up_read(&current->mm->mmap_sem);
-
 	if (buffers)
 		return buffers;
 
Index: linux-2.6/include/linux/mm.h
===================================================================
--- linux-2.6.orig/include/linux/mm.h
+++ linux-2.6/include/linux/mm.h
@@ -12,6 +12,7 @@
 #include <linux/prio_tree.h>
 #include <linux/debug_locks.h>
 #include <linux/mm_types.h>
+#include <linux/uaccess.h> /* for __HAVE_ARCH_FAST_GUP */
 
 struct mempolicy;
 struct anon_vma;
@@ -795,6 +796,24 @@ extern int mprotect_fixup(struct vm_area
 			  struct vm_area_struct **pprev, unsigned long start,
 			  unsigned long end, unsigned long newflags);
 
+#ifndef __HAVE_ARCH_FAST_GUP
+/* Should be moved to asm-generic, and architectures can include it if they
+ * don't implement their own fast_gup.
+ */
+#define fast_gup(start, nr_pages, write, pages)			\
+({								\
+	struct mm_struct *mm = current->mm;			\
+	int ret;						\
+								\
+	down_read(&mm->mmap_sem);				\
+	ret = get_user_pages(current, mm, start, nr_pages,	\
+					write, 0, pages, NULL);	\
+	up_read(&mm->mmap_sem);					\
+								\
+	ret;							\
+})
+#endif
+
 /*
  * A callback you can register to apply pressure to ageable caches.
  *
Index: linux-2.6/arch/x86/mm/Makefile
===================================================================
--- linux-2.6.orig/arch/x86/mm/Makefile
+++ linux-2.6/arch/x86/mm/Makefile
@@ -1,4 +1,4 @@
-obj-y	:=  init_$(BITS).o fault.o ioremap.o extable.o pageattr.o mmap.o
+obj-y	:=  init_$(BITS).o fault.o ioremap.o extable.o pageattr.o mmap.o gup.o
 
 obj-$(CONFIG_X86_32)		+= pgtable_32.o
 
Index: linux-2.6/arch/x86/mm/gup.c
===================================================================
--- /dev/null
+++ linux-2.6/arch/x86/mm/gup.c
@@ -0,0 +1,198 @@
+/*
+ * Lockless fast_gup for x86
+ *
+ * Copyright (C) 2007 Nick Piggin
+ * Copyright (C) 2007 Novell Inc.
+ */
+#include <linux/sched.h>
+#include <linux/mm.h>
+#include <linux/vmstat.h>
+#include <asm/pgtable.h>
+
+/*
+ * The performance critical leaf functions are made noinline otherwise gcc
+ * inlines everything into a single function which results in too much
+ * register pressure.
+ */
+static noinline int gup_pte_range(pmd_t pmd, unsigned long addr,
+		unsigned long end, int write, struct page **pages, int *nr)
+{
+	unsigned long mask, result;
+	pte_t *ptep;
+
+	result = _PAGE_PRESENT|_PAGE_USER;
+	if (write)
+		result |= _PAGE_RW;
+	mask = result | _PAGE_SPECIAL;
+
+	ptep = pte_offset_map(&pmd, addr);
+	do {
+		/*
+		 * XXX: careful. On 3-level 32-bit, the pte is 64 bits, and
+		 * we need to make sure we load the low word first, then the
+		 * high. This means _PAGE_PRESENT should be clear if the high
+		 * word was not valid. Currently, the C compiler can issue
+		 * the loads in any order, and I don't know of a wrapper
+		 * function that will do this properly, so it is broken on
+		 * 32-bit 3-level for the moment.
+		 */
+		pte_t pte = *ptep;
+		struct page *page;
+
+		if ((pte_val(pte) & mask) != result)
+			return 0;
+		VM_BUG_ON(!pfn_valid(pte_pfn(pte)));
+		page = pte_page(pte);
+		get_page(page);
+		pages[*nr] = page;
+		(*nr)++;
+
+	} while (ptep++, addr += PAGE_SIZE, addr != end);
+	pte_unmap(ptep - 1);
+
+	return 1;
+}
+
+static inline void get_head_page_multiple(struct page *page, int nr)
+{
+	VM_BUG_ON(page != compound_head(page));
+	VM_BUG_ON(page_count(page) == 0);
+	atomic_add(nr, &page->_count);
+}
+
+static noinline int gup_huge_pmd(pmd_t pmd, unsigned long addr,
+		unsigned long end, int write, struct page **pages, int *nr)
+{
+	unsigned long mask;
+	pte_t pte = *(pte_t *)&pmd;
+	struct page *head, *page;
+	int refs;
+
+	mask = _PAGE_PRESENT|_PAGE_USER;
+	if (write)
+		mask |= _PAGE_RW;
+	if ((pte_val(pte) & mask) != mask)
+		return 0;
+	/* hugepages are never "special" */
+	VM_BUG_ON(!pfn_valid(pte_pfn(pte)));
+
+	refs = 0;
+	head = pte_page(pte);
+	page = head + ((addr & ~HPAGE_MASK) >> PAGE_SHIFT);
+	do {
+		VM_BUG_ON(compound_head(page) != head);
+		pages[*nr] = page;
+		(*nr)++;
+		page++;
+		refs++;
+	} while (addr += PAGE_SIZE, addr != end);
+	get_head_page_multiple(head, refs);
+
+	return 1;
+}
+
+static int gup_pmd_range(pud_t pud, unsigned long addr, unsigned long end,
+		int write, struct page **pages, int *nr)
+{
+	unsigned long next;
+	pmd_t *pmdp;
+
+	pmdp = pmd_offset(&pud, addr);
+	do {
+		pmd_t pmd = *pmdp;
+
+		next = pmd_addr_end(addr, end);
+		if (pmd_none(pmd))
+			return 0;
+		if (unlikely(pmd_large(pmd))) {
+			if (!gup_huge_pmd(pmd, addr, next, write, pages, nr))
+				return 0;
+		} else {
+			if (!gup_pte_range(pmd, addr, next, write, pages, nr))
+				return 0;
+		}
+	} while (pmdp++, addr = next, addr != end);
+
+	return 1;
+}
+
+static int gup_pud_range(pgd_t pgd, unsigned long addr, unsigned long end, int write, struct page **pages, int *nr)
+{
+	unsigned long next;
+	pud_t *pudp;
+
+	pudp = pud_offset(&pgd, addr);
+	do {
+		pud_t pud = *pudp;
+
+		next = pud_addr_end(addr, end);
+		if (pud_none(pud))
+			return 0;
+		if (!gup_pmd_range(pud, addr, next, write, pages, nr))
+			return 0;
+	} while (pudp++, addr = next, addr != end);
+
+	return 1;
+}
+
+int fast_gup(unsigned long start, int nr_pages, int write, struct page **pages)
+{
+	struct mm_struct *mm = current->mm;
+	unsigned long end = start + (nr_pages << PAGE_SHIFT);
+	unsigned long addr = start;
+	unsigned long next;
+	pgd_t *pgdp;
+	int nr = 0;
+
+	/*
+	 * XXX: batch / limit 'nr', to avoid huge latency
+	 * needs some instrumenting to determine the common sizes used by
+	 * important workloads (eg. DB2), and whether limiting the batch size
+	 * will decrease performance.
+	 *
+	 * At the moment we're pretty well in the clear: direct-IO only does
+	 * 64 pages at a time here, so this is only a theoretical concern. It
+	 * is possible that for simplicity here, we should just ask that
+	 * callers limit their nr_pages?
+	 */
+	/*
+	 * This doesn't prevent pagetable teardown, but does prevent
+	 * the pagetables and pages from being freed on x86.
+	 *
+	 * So long as we atomically load page table pointers versus teardown
+	 * (which we do on x86), we can follow the address down to the the
+	 * page and take a ref on it.
+	 */
+	local_irq_disable();
+	pgdp = pgd_offset(mm, addr);
+	do {
+		pgd_t pgd = *pgdp;
+
+		next = pgd_addr_end(addr, end);
+		if (pgd_none(pgd))
+			goto slow;
+		if (!gup_pud_range(pgd, addr, next, write, pages, &nr))
+			goto slow;
+	} while (pgdp++, addr = next, addr != end);
+	local_irq_enable();
+
+	VM_BUG_ON(nr != (end - start) >> PAGE_SHIFT);
+	return nr;
+
+slow:
+	{
+		int i, ret;
+
+		local_irq_enable();
+		/* Could optimise this more by keeping what we've already got */
+		for (i = 0; i < nr; i++)
+			put_page(pages[i]);
+
+		down_read(&mm->mmap_sem);
+		ret = get_user_pages(current, mm, start,
+			(end - start) >> PAGE_SHIFT, write, 0, pages, NULL);
+		up_read(&mm->mmap_sem);
+
+		return ret;
+	}
+}
Index: linux-2.6/include/asm-x86/uaccess.h
===================================================================
--- linux-2.6.orig/include/asm-x86/uaccess.h
+++ linux-2.6/include/asm-x86/uaccess.h
@@ -3,3 +3,8 @@
 #else
 # include "uaccess_64.h"
 #endif
+
+#define __HAVE_ARCH_FAST_GUP
+struct page;
+int fast_gup(unsigned long start, int nr_pages, int write, struct page **pages);
+

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [patch 1/2]: x86: implement pte_special
  2008-03-28  2:55 ` [patch 1/2]: x86: implement pte_special Nick Piggin
@ 2008-03-28  3:23   ` David Miller, Nick Piggin
  2008-03-28  3:31     ` Nick Piggin
  0 siblings, 1 reply; 38+ messages in thread
From: David Miller, Nick Piggin @ 2008-03-28  3:23 UTC (permalink / raw)
  To: npiggin; +Cc: akpm, shaggy, axboe, linux-mm, linux-arch, torvalds

> @@ -40,6 +41,8 @@
>  #define _PAGE_UNUSED3	(_AC(1, L)<<_PAGE_BIT_UNUSED3)
>  #define _PAGE_PAT	(_AC(1, L)<<_PAGE_BIT_PAT)
>  #define _PAGE_PAT_LARGE (_AC(1, L)<<_PAGE_BIT_PAT_LARGE)
> +#define _PAGE_SPECIAL	(_AC(1, L)<<_PAGE_BIT_SPECIAL)
> +#define __HAVE_ARCH_PTE_SPECIAL

What tests __HAVE_ARCH_PTE_SPECIAL?

> @@ -167,7 +170,7 @@ static inline pte_t pte_mkhuge(pte_t pte
>  static inline pte_t pte_clrhuge(pte_t pte)	{ return __pte(pte_val(pte) & ~(pteval_t)_PAGE_PSE); }
>  static inline pte_t pte_mkglobal(pte_t pte)	{ return __pte(pte_val(pte) | _PAGE_GLOBAL); }
>  static inline pte_t pte_clrglobal(pte_t pte)	{ return __pte(pte_val(pte) & ~(pteval_t)_PAGE_GLOBAL); }
> -static inline pte_t pte_mkspecial(pte_t pte)	{ return pte; }
> +static inline pte_t pte_mkspecial(pte_t pte)	{ return __pte(pte_val(pte) | _PAGE_SPECIAL); }

And what calls pte_mkspecial?

I don't see any code that sets the special bit anywhere
in these two patches.

What am I missing?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [patch 1/2]: x86: implement pte_special
  2008-03-28  3:23   ` David Miller, Nick Piggin
@ 2008-03-28  3:31     ` Nick Piggin
  2008-03-28  3:44       ` David Miller, Nick Piggin
  0 siblings, 1 reply; 38+ messages in thread
From: Nick Piggin @ 2008-03-28  3:31 UTC (permalink / raw)
  To: David Miller; +Cc: akpm, shaggy, axboe, linux-mm, linux-arch, torvalds

On Thu, Mar 27, 2008 at 08:23:34PM -0700, David Miller wrote:
> From: Nick Piggin <npiggin@suse.de>
> Date: Fri, 28 Mar 2008 03:55:41 +0100
> 
> > @@ -40,6 +41,8 @@
> >  #define _PAGE_UNUSED3	(_AC(1, L)<<_PAGE_BIT_UNUSED3)
> >  #define _PAGE_PAT	(_AC(1, L)<<_PAGE_BIT_PAT)
> >  #define _PAGE_PAT_LARGE (_AC(1, L)<<_PAGE_BIT_PAT_LARGE)
> > +#define _PAGE_SPECIAL	(_AC(1, L)<<_PAGE_BIT_SPECIAL)
> > +#define __HAVE_ARCH_PTE_SPECIAL
> 
> What tests __HAVE_ARCH_PTE_SPECIAL?
> 
> > @@ -167,7 +170,7 @@ static inline pte_t pte_mkhuge(pte_t pte
> >  static inline pte_t pte_clrhuge(pte_t pte)	{ return __pte(pte_val(pte) & ~(pteval_t)_PAGE_PSE); }
> >  static inline pte_t pte_mkglobal(pte_t pte)	{ return __pte(pte_val(pte) | _PAGE_GLOBAL); }
> >  static inline pte_t pte_clrglobal(pte_t pte)	{ return __pte(pte_val(pte) & ~(pteval_t)_PAGE_GLOBAL); }
> > -static inline pte_t pte_mkspecial(pte_t pte)	{ return pte; }
> > +static inline pte_t pte_mkspecial(pte_t pte)	{ return __pte(pte_val(pte) | _PAGE_SPECIAL); }
> 
> And what calls pte_mkspecial?
> 
> I don't see any code that sets the special bit anywhere
> in these two patches.
> 
> What am I missing?

Oh, sorry these 2 patches are on top of the previous 7 that I sent out.
Hmm, only patch 2/7 went to linux-arch, which is the main consumer of
this pte_special stuff, however if you want a more coherent view of those
7 patches, they are on linux-mm.

Basically, the pfn-based mapping insertion (vm_insert_pfn, remap_pfn_range)
calls pte_mkspecial. And that tells fast_gup "hands off".

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [patch 1/2]: x86: implement pte_special
  2008-03-28  3:31     ` Nick Piggin
@ 2008-03-28  3:44       ` David Miller, Nick Piggin
  2008-03-28  4:04         ` Nick Piggin
  0 siblings, 1 reply; 38+ messages in thread
From: David Miller, Nick Piggin @ 2008-03-28  3:44 UTC (permalink / raw)
  To: npiggin; +Cc: akpm, shaggy, axboe, linux-mm, linux-arch, torvalds

> Basically, the pfn-based mapping insertion (vm_insert_pfn, remap_pfn_range)
> calls pte_mkspecial. And that tells fast_gup "hands off".

I don't think it's wise to allocate a "soft PTE bit" for this on every
platform, especially for such a limited use case.

Is it feasible to test the page instead?  Or are we talking about
cases where there may not be a backing page?

If the issue is to discern things like I/O mappings and such vs. real
pages, there are ways a platform can handle that without a special
bit.

That would leave us with real memory that does not have backing
page structs, and we have a way to test that too.

The special PTE bit seems superfluous to me.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [patch 1/2]: x86: implement pte_special
  2008-03-28  3:44       ` David Miller, Nick Piggin
@ 2008-03-28  4:04         ` Nick Piggin
  2008-03-28  4:09           ` David Miller, Nick Piggin
  0 siblings, 1 reply; 38+ messages in thread
From: Nick Piggin @ 2008-03-28  4:04 UTC (permalink / raw)
  To: David Miller; +Cc: akpm, shaggy, axboe, linux-mm, linux-arch, torvalds

On Thu, Mar 27, 2008 at 08:44:31PM -0700, David Miller wrote:
> From: Nick Piggin <npiggin@suse.de>
> Date: Fri, 28 Mar 2008 04:31:50 +0100
> 
> > Basically, the pfn-based mapping insertion (vm_insert_pfn, remap_pfn_range)
> > calls pte_mkspecial. And that tells fast_gup "hands off".
> 
> I don't think it's wise to allocate a "soft PTE bit" for this on every
> platform, especially for such a limited use case.

Sure, it will be up to the arch to decide whether or not to use it. If
you can't spare a bit, then nothing changes, you just don't get a speedup ;)

But it is not limited to direct IO. If you have a look through
drivers, there are quite a few things using it that might see improvement.

And even if it were limited to direct IO... there will be some platforms
that want it, I suspect.

> Is it feasible to test the page instead?  Or are we talking about
> cases where there may not be a backing page?

Yes. Or there may be one but you are not allowed to touch it.

> If the issue is to discern things like I/O mappings and such vs. real
> pages, there are ways a platform can handle that without a special
> bit.
> 
> That would leave us with real memory that does not have backing
> page structs, and we have a way to test that too.
> 
> The special PTE bit seems superfluous to me.

It has to be quite fast.

There are some platforms that can't really do that easily, so the pte
bit isn't useless on all architectures. s390 for example, which is
why it was introduced in the other patchset.

x86 has 3 bits which are usable by system software, so are not likely
to be ever used by hardware. And in the entire life of Linux, nobody
has ever wanted to use one :) So I think we're safe to use one on x86.

If bits ever run out, and there are as you say other ways to implement
this, then it could be switched over then. Because nothing is going to
be as fast as testing the pte (which we already have loaded in a
register).

For sparc, if you can propose another method, we could look at that.
But if you think it is such a limited use case, then cleanest and eaisest
will just be to not implement it. I doubt many people to serious oltp
on sparc (on Linux), so for that platform you are probably right. If
another compelling use case comes up, then you can always reevaluate.

BTW. if you are still interested, then the powerpc64 patch might be a
better starting point for you. I don't know how the sparc tlb flush
design looks like, but if it doesn't do a synchronous IPI to invalidate
other threads, then you can't use the x86 approach.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [patch 1/2]: x86: implement pte_special
  2008-03-28  4:04         ` Nick Piggin
@ 2008-03-28  4:09           ` David Miller, Nick Piggin
  2008-03-28  4:15             ` Nick Piggin
  0 siblings, 1 reply; 38+ messages in thread
From: David Miller, Nick Piggin @ 2008-03-28  4:09 UTC (permalink / raw)
  To: npiggin; +Cc: akpm, shaggy, axboe, linux-mm, linux-arch, torvalds

> BTW. if you are still interested, then the powerpc64 patch might be a
> better starting point for you. I don't know how the sparc tlb flush
> design looks like, but if it doesn't do a synchronous IPI to invalidate
> other threads, then you can't use the x86 approach.

I have soft bits available on sparc64, that's not my issue.

My issue is that if you implemented this differently, every platform
would get the optimization, without having to do anything special
at all, and I think that's such a much nicer way.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [patch 1/2]: x86: implement pte_special
  2008-03-28  4:09           ` David Miller, Nick Piggin
@ 2008-03-28  4:15             ` Nick Piggin
  2008-03-28  4:16               ` David Miller, Nick Piggin
  2008-03-28  4:17               ` Nick Piggin
  0 siblings, 2 replies; 38+ messages in thread
From: Nick Piggin @ 2008-03-28  4:15 UTC (permalink / raw)
  To: David Miller; +Cc: akpm, shaggy, axboe, linux-mm, linux-arch, torvalds

On Thu, Mar 27, 2008 at 09:09:10PM -0700, David Miller wrote:
> From: Nick Piggin <npiggin@suse.de>
> Date: Fri, 28 Mar 2008 05:04:42 +0100
> 
> > BTW. if you are still interested, then the powerpc64 patch might be a
> > better starting point for you. I don't know how the sparc tlb flush
> > design looks like, but if it doesn't do a synchronous IPI to invalidate
> > other threads, then you can't use the x86 approach.
> 
> I have soft bits available on sparc64, that's not my issue.
> 
> My issue is that if you implemented this differently, every platform
> would get the optimization, without having to do anything special
> at all, and I think that's such a much nicer way.

Oh, they wouldn't. It is completely tied to the low level details of
their TLB and pagetable teardown design. That's the unfortunate part
about it.

The other thing is that the "how do I know if I can refcount the page
behind this (mm,vaddr,pte) tuple" can be quite arch specific as well.
And it is also non-trivial to do because that information can be dynamic
depending on what driver mapped in that given tuple.

It is *possible*, but not trivial.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [patch 1/2]: x86: implement pte_special
  2008-03-28  4:15             ` Nick Piggin
@ 2008-03-28  4:16               ` David Miller, Nick Piggin
  2008-03-28  4:19                 ` Nick Piggin
  2008-03-28  4:17               ` Nick Piggin
  1 sibling, 1 reply; 38+ messages in thread
From: David Miller, Nick Piggin @ 2008-03-28  4:16 UTC (permalink / raw)
  To: npiggin; +Cc: akpm, shaggy, axboe, linux-mm, linux-arch, torvalds

> The other thing is that the "how do I know if I can refcount the page
> behind this (mm,vaddr,pte) tuple" can be quite arch specific as well.
> And it is also non-trivial to do because that information can be dynamic
> depending on what driver mapped in that given tuple.

Those are good points.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [patch 1/2]: x86: implement pte_special
  2008-03-28  4:15             ` Nick Piggin
  2008-03-28  4:16               ` David Miller, Nick Piggin
@ 2008-03-28  4:17               ` Nick Piggin
  1 sibling, 0 replies; 38+ messages in thread
From: Nick Piggin @ 2008-03-28  4:17 UTC (permalink / raw)
  To: David Miller; +Cc: akpm, shaggy, axboe, linux-mm, linux-arch, torvalds

On Fri, Mar 28, 2008 at 05:15:20AM +0100, Nick Piggin wrote:
> On Thu, Mar 27, 2008 at 09:09:10PM -0700, David Miller wrote:
> > From: Nick Piggin <npiggin@suse.de>
> > Date: Fri, 28 Mar 2008 05:04:42 +0100
> > 
> > > BTW. if you are still interested, then the powerpc64 patch might be a
> > > better starting point for you. I don't know how the sparc tlb flush
> > > design looks like, but if it doesn't do a synchronous IPI to invalidate
> > > other threads, then you can't use the x86 approach.
> > 
> > I have soft bits available on sparc64, that's not my issue.
> > 
> > My issue is that if you implemented this differently, every platform
> > would get the optimization, without having to do anything special
> > at all, and I think that's such a much nicer way.
> 
> Oh, they wouldn't. It is completely tied to the low level details of
> their TLB and pagetable teardown design. That's the unfortunate part
> about it.
> 
> The other thing is that the "how do I know if I can refcount the page
> behind this (mm,vaddr,pte) tuple" can be quite arch specific as well.
> And it is also non-trivial to do because that information can be dynamic
> depending on what driver mapped in that given tuple.
> 
> It is *possible*, but not trivial.

And, btw, you'd still have to implement the actual fast_gup completely
in arch code. So once you do that, you are free not to use pte_special
for it anyway.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [patch 1/2]: x86: implement pte_special
  2008-03-28  4:16               ` David Miller, Nick Piggin
@ 2008-03-28  4:19                 ` Nick Piggin
  0 siblings, 0 replies; 38+ messages in thread
From: Nick Piggin @ 2008-03-28  4:19 UTC (permalink / raw)
  To: David Miller; +Cc: akpm, shaggy, axboe, linux-mm, linux-arch, torvalds

On Thu, Mar 27, 2008 at 09:16:32PM -0700, David Miller wrote:
> From: Nick Piggin <npiggin@suse.de>
> Date: Fri, 28 Mar 2008 05:15:20 +0100
> 
> > The other thing is that the "how do I know if I can refcount the page
> > behind this (mm,vaddr,pte) tuple" can be quite arch specific as well.
> > And it is also non-trivial to do because that information can be dynamic
> > depending on what driver mapped in that given tuple.
> 
> Those are good points.

I know what you mean though... it doesn't feel like it is perfect code
just yet. However, given that it works on x86 and gives such good
results today, I feel that I'd rather get this merged first, and then
maybe when we get some more platforms on board and the code is a bit
more mature then we can try to make it "nicer"...

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [patch 2/2]: introduce fast_gup
  2008-03-28  3:00 ` [patch 2/2]: introduce fast_gup Nick Piggin
@ 2008-03-28 10:01   ` Jens Axboe
  2008-04-17 15:03   ` Peter Zijlstra
  2008-04-22  9:42   ` Peter Zijlstra
  2 siblings, 0 replies; 38+ messages in thread
From: Jens Axboe @ 2008-03-28 10:01 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Andrew Morton, shaggy, linux-mm, linux-arch, torvalds

On Fri, Mar 28 2008, Nick Piggin wrote:
> 
> Introduce a new "fast_gup" (for want of a better name right now) which
> is basically a get_user_pages with a less general API (but still tends to
> be suited to the common case):

I did some quick tests here with two kernels - baseline was current -git
with the io cpu affinity stuff, nick is that same base but with your
fast_gup() applied. The test run was meant to show the best possible
scenario, 1 thread per disk (11 in total) with 4kb block size. Each
thread used async O_DIRECT reads, queue depth was 64.

For each kernel, a=0 means that completions were done normally and
a=1 means that completions were moved to the submitter. Total
runtime for each iteration is ~20 seconds, each test was run 3 times and
the scores averaged (very little deviation was seen between runs).

Kernel             bw         usr(sec)       sys(sec)           bw/sys
----------------------------------------------------------------------
baseline,a=0    306MiB/s     3.490          14.308              21.39
baseline,a=1    309MiB/s     3.717          13.718              22.53
nick,a=0        310MiB/s     3.669          13.804              22.46
nick,a=1        311MiB/s     3.686          13.279              23.42

That last number is just bandwidth/systime. So baseline vs your patch
gets about 5% better bw/sys utilization. fast_gup() + io affinity is
about 9.5% better bw/sys.

The system is just a puny 2-way x86-64, two sockets with HT enabled. So
I'd say the results look quite good!

-- 
Jens Axboe

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [patch 2/2]: introduce fast_gup
  2008-03-28  3:00 ` [patch 2/2]: introduce fast_gup Nick Piggin
  2008-03-28 10:01   ` Jens Axboe
@ 2008-04-17 15:03   ` Peter Zijlstra
  2008-04-17 15:25     ` Linus Torvalds
  2008-04-22  9:42   ` Peter Zijlstra
  2 siblings, 1 reply; 38+ messages in thread
From: Peter Zijlstra @ 2008-04-17 15:03 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Andrew Morton, shaggy, axboe, linux-mm, linux-arch, torvalds,
	Clark Williams, Ingo Molnar

On Fri, 2008-03-28 at 04:00 +0100, Nick Piggin wrote:

> +++ linux-2.6/arch/x86/mm/gup.c
> @@ -0,0 +1,198 @@
> +/*
> + * Lockless fast_gup for x86
> + *
> + * Copyright (C) 2007 Nick Piggin
> + * Copyright (C) 2007 Novell Inc.
> + */
> +#include <linux/sched.h>
> +#include <linux/mm.h>
> +#include <linux/vmstat.h>
> +#include <asm/pgtable.h>
> +
> +/*
> + * The performance critical leaf functions are made noinline otherwise gcc
> + * inlines everything into a single function which results in too much
> + * register pressure.
> + */
> +static noinline int gup_pte_range(pmd_t pmd, unsigned long addr,
> +		unsigned long end, int write, struct page **pages, int *nr)
> +{
> +	unsigned long mask, result;
> +	pte_t *ptep;
> +
> +	result = _PAGE_PRESENT|_PAGE_USER;
> +	if (write)
> +		result |= _PAGE_RW;
> +	mask = result | _PAGE_SPECIAL;
> +
> +	ptep = pte_offset_map(&pmd, addr);
> +	do {
> +		/*
> +		 * XXX: careful. On 3-level 32-bit, the pte is 64 bits, and
> +		 * we need to make sure we load the low word first, then the
> +		 * high. This means _PAGE_PRESENT should be clear if the high
> +		 * word was not valid. Currently, the C compiler can issue
> +		 * the loads in any order, and I don't know of a wrapper
> +		 * function that will do this properly, so it is broken on
> +		 * 32-bit 3-level for the moment.
> +		 */
> +		pte_t pte = *ptep;
> +		struct page *page;
> +
> +		if ((pte_val(pte) & mask) != result)
> +			return 0;
> +		VM_BUG_ON(!pfn_valid(pte_pfn(pte)));
> +		page = pte_page(pte);
> +		get_page(page);
> +		pages[*nr] = page;
> +		(*nr)++;
> +
> +	} while (ptep++, addr += PAGE_SIZE, addr != end);
> +	pte_unmap(ptep - 1);
> +
> +	return 1;
> +}

Would this be sufficient to address that comment's conern?

Index: linux-2.6/arch/x86/mm/gup.c
===================================================================
--- linux-2.6.orig/arch/x86/mm/gup.c
+++ linux-2.6/arch/x86/mm/gup.c
@@ -36,8 +36,16 @@ static noinline int gup_pte_range(pmd_t 
 		 * function that will do this properly, so it is broken on
 		 * 32-bit 3-level for the moment.
 		 */
-		pte_t pte = *ptep;
 		struct page *page;
+		pte_t pte;
+
+#ifdef CONFIG_X86_PAE
+		pte.pte_low = ptep->pte_low;
+		barrier();
+		pte.pte_high = ptep->pte_high;
+#else
+		pte = *ptep;
+#endif
 
 		if ((pte_val(pte) & mask) != result)
 			return 0;


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [patch 2/2]: introduce fast_gup
  2008-04-17 15:03   ` Peter Zijlstra
@ 2008-04-17 15:25     ` Linus Torvalds
  2008-04-17 16:12       ` Peter Zijlstra
  2008-04-21 12:00       ` Avi Kivity
  0 siblings, 2 replies; 38+ messages in thread
From: Linus Torvalds @ 2008-04-17 15:25 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Nick Piggin, Andrew Morton, shaggy, axboe, linux-mm, linux-arch,
	Clark Williams, Ingo Molnar


On Thu, 17 Apr 2008, Peter Zijlstra wrote:
> 
> Would this be sufficient to address that comment's conern?

It would be nicer to just add a "native_get_pte()" to x86, to match the 
already-existing "native_set_pte()".

And that "barrier()" should b "smp_rmb()". They may be the same code 
sequence, but from a conceptual angle, "smp_rmb()" makes a whole lot more 
sense.

Finally, I don't think that comment is correct in the first place. It's 
not that simple. The thing is, even *with* the memory barrier in place, we 
may have:

	CPU#1			CPU#2
	=====			=====

	fast_gup:
	 - read low word

				native_set_pte_present:
				 - set low word to 0
				 - set high word to new value

	 - read high word

				- set low word to new value

and so you read a low word that is associated with a *different* high 
word! Notice?

So trivial memory ordering is _not_ enough.

So I think the code literally needs to be something like this

	#ifdef CONFIG_X86_PAE

	static inline pte_t native_get_pte(pte_t *ptep)
	{
		pte_t pte;

	retry:
		pte.pte_low = ptep->pte_low;
		smp_rmb();
		pte.pte_high = ptep->pte_high;
		smp_rmb();
		if (unlikely(pte.pte_low != ptep->pte_low)
			goto retry;
		return pte;
	}

	#else

	#define native_get_pte(ptep) (*(ptep))

	#endif

but I have admittedly not really thought it fully through.

		Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [patch 2/2]: introduce fast_gup
  2008-04-17 15:25     ` Linus Torvalds
@ 2008-04-17 16:12       ` Peter Zijlstra
  2008-04-17 16:18         ` Linus Torvalds
  2008-04-18  9:58         ` Jeremy Fitzhardinge
  2008-04-21 12:00       ` Avi Kivity
  1 sibling, 2 replies; 38+ messages in thread
From: Peter Zijlstra @ 2008-04-17 16:12 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Nick Piggin, Andrew Morton, shaggy, axboe, linux-mm, linux-arch,
	Clark Williams, Ingo Molnar, Jeremy Fitzhardinge

On Thu, 2008-04-17 at 08:25 -0700, Linus Torvalds wrote:
> 
> On Thu, 17 Apr 2008, Peter Zijlstra wrote:
> > 
> > Would this be sufficient to address that comment's conern?
> 
> It would be nicer to just add a "native_get_pte()" to x86, to match the 
> already-existing "native_set_pte()".

See, I _knew_ I was missing something obvious :-/

> And that "barrier()" should b "smp_rmb()". They may be the same code 
> sequence, but from a conceptual angle, "smp_rmb()" makes a whole lot more 
> sense.
> 
> Finally, I don't think that comment is correct in the first place. It's 
> not that simple. The thing is, even *with* the memory barrier in place, we 
> may have:
> 
> 	CPU#1			CPU#2
> 	=====			=====
> 
> 	fast_gup:
> 	 - read low word
> 
> 				native_set_pte_present:
> 				 - set low word to 0
> 				 - set high word to new value
> 
> 	 - read high word
> 
> 				- set low word to new value
> 
> and so you read a low word that is associated with a *different* high 
> word! Notice?
> 
> So trivial memory ordering is _not_ enough.
> 
> So I think the code literally needs to be something like this
> 
> 	#ifdef CONFIG_X86_PAE
> 
> 	static inline pte_t native_get_pte(pte_t *ptep)
> 	{
> 		pte_t pte;
> 
> 	retry:
> 		pte.pte_low = ptep->pte_low;
> 		smp_rmb();
> 		pte.pte_high = ptep->pte_high;
> 		smp_rmb();
> 		if (unlikely(pte.pte_low != ptep->pte_low)
> 			goto retry;
> 		return pte;
> 	}
> 
> 	#else
> 
> 	#define native_get_pte(ptep) (*(ptep))
> 
> 	#endif
> 
> but I have admittedly not really thought it fully through.

Looks sane here; Clark can you give this a spin?

Jeremy, did I get the paravirt stuff right?

---
Index: linux-2.6/arch/x86/mm/gup.c
===================================================================
--- linux-2.6.orig/arch/x86/mm/gup.c
+++ linux-2.6/arch/x86/mm/gup.c
@@ -36,7 +36,7 @@ static noinline int gup_pte_range(pmd_t 
 		 * function that will do this properly, so it is broken on
 		 * 32-bit 3-level for the moment.
 		 */
-		pte_t pte = *ptep;
+		pte_t pte = get_pte(ptep);
 		struct page *page;
 
 		if ((pte_val(pte) & mask) != result)
Index: linux-2.6/arch/x86/kernel/paravirt_32.c
===================================================================
--- linux-2.6.orig/arch/x86/kernel/paravirt_32.c
+++ linux-2.6/arch/x86/kernel/paravirt_32.c
@@ -450,6 +450,7 @@ struct pv_mmu_ops pv_mmu_ops = {
 #ifdef CONFIG_X86_PAE
 	.set_pte_atomic = native_set_pte_atomic,
 	.set_pte_present = native_set_pte_present,
+	.get_pte = native_get_pte,
 	.set_pud = native_set_pud,
 	.pte_clear = native_pte_clear,
 	.pmd_clear = native_pmd_clear,
Index: linux-2.6/include/asm-x86/paravirt.h
===================================================================
--- linux-2.6.orig/include/asm-x86/paravirt.h
+++ linux-2.6/include/asm-x86/paravirt.h
@@ -216,6 +216,7 @@ struct pv_mmu_ops {
 	void (*set_pte_atomic)(pte_t *ptep, pte_t pteval);
 	void (*set_pte_present)(struct mm_struct *mm, unsigned long addr,
 				pte_t *ptep, pte_t pte);
+	void (*get_pte)(struct pte_t *ptep);
 	void (*set_pud)(pud_t *pudp, pud_t pudval);
 	void (*pte_clear)(struct mm_struct *mm, unsigned long addr, pte_t *ptep);
 	void (*pmd_clear)(pmd_t *pmdp);
@@ -886,6 +887,13 @@ static inline void set_pte_present(struc
 	pv_mmu_ops.set_pte_present(mm, addr, ptep, pte);
 }
 
+static inline pte_t get_pte(struct pte_t *ptep)
+{
+	unsigned long long ret = PVOP_CALL1(unsigned long long, pv_mmu_ops.get_pte, ptep);
+
+	return (pte_t) { ret, ret >> 32 };
+}
+
 static inline void set_pmd(pmd_t *pmdp, pmd_t pmdval)
 {
 	PVOP_VCALL3(pv_mmu_ops.set_pmd, pmdp,
@@ -941,6 +949,11 @@ static inline void set_pte_at(struct mm_
 	PVOP_VCALL4(pv_mmu_ops.set_pte_at, mm, addr, ptep, pteval.pte_low);
 }
 
+static inline pte_t get_pte(struct pte_t *ptep)
+{
+	return PVOP_CALL1(unsigned long, pv_mmu_ops.get_pte, ptep);
+}
+
 static inline void set_pmd(pmd_t *pmdp, pmd_t pmdval)
 {
 	PVOP_VCALL2(pv_mmu_ops.set_pmd, pmdp, pmdval.pud.pgd.pgd);
Index: linux-2.6/include/asm-x86/pgtable-2level.h
===================================================================
--- linux-2.6.orig/include/asm-x86/pgtable-2level.h
+++ linux-2.6/include/asm-x86/pgtable-2level.h
@@ -20,6 +20,10 @@ static inline void native_set_pte_at(str
 {
 	native_set_pte(ptep, pte);
 }
+static inline pte_t native_get_pte(pte_t *ptep)
+{
+	return *ptep;
+}
 static inline void native_set_pmd(pmd_t *pmdp, pmd_t pmd)
 {
 	*pmdp = pmd;
@@ -33,6 +37,8 @@ static inline void native_set_pmd(pmd_t 
 #define set_pte_atomic(pteptr, pteval) set_pte(pteptr,pteval)
 #define set_pte_present(mm,addr,ptep,pteval) set_pte_at(mm,addr,ptep,pteval)
 
+#define get_pte(ptep)			native_get_pte(ptep)
+
 #define pte_clear(mm,addr,xp)	do { set_pte_at(mm, addr, xp, __pte(0)); } while (0)
 #define pmd_clear(xp)	do { set_pmd(xp, __pmd(0)); } while (0)
 
Index: linux-2.6/include/asm-x86/pgtable-3level.h
===================================================================
--- linux-2.6.orig/include/asm-x86/pgtable-3level.h
+++ linux-2.6/include/asm-x86/pgtable-3level.h
@@ -61,6 +61,20 @@ static inline void native_set_pte_presen
 	ptep->pte_low = pte.pte_low;
 }
 
+static inline pte_t native_get_pte(pte_t *ptep)
+{
+	pte_t pte;
+
+retry:
+	pte.pte_low = ptep->pte_low;
+	smp_rmb();
+	pte.pte_high = ptep->pte_high;
+	smp_rmb();
+	if (unlikely(pte.pte_low != ptep->pte_low))
+		goto retry;
+	return pte;
+}
+
 static inline void native_set_pte_atomic(pte_t *ptep, pte_t pte)
 {
 	set_64bit((unsigned long long *)(ptep),native_pte_val(pte));
@@ -99,6 +113,7 @@ static inline void native_pmd_clear(pmd_
 #define set_pte_at(mm, addr, ptep, pte)		native_set_pte_at(mm, addr, ptep, pte)
 #define set_pte_present(mm, addr, ptep, pte)	native_set_pte_present(mm, addr, ptep, pte)
 #define set_pte_atomic(ptep, pte)		native_set_pte_atomic(ptep, pte)
+#define get_pte(ptep)				native_get_pte(ptep)
 #define set_pmd(pmdp, pmd)			native_set_pmd(pmdp, pmd)
 #define set_pud(pudp, pud)			native_set_pud(pudp, pud)
 #define pte_clear(mm, addr, ptep)		native_pte_clear(mm, addr, ptep)


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [patch 2/2]: introduce fast_gup
  2008-04-17 16:12       ` Peter Zijlstra
@ 2008-04-17 16:18         ` Linus Torvalds
  2008-04-17 16:35           ` Peter Zijlstra
  2008-04-18  9:58         ` Jeremy Fitzhardinge
  1 sibling, 1 reply; 38+ messages in thread
From: Linus Torvalds @ 2008-04-17 16:18 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Nick Piggin, Andrew Morton, shaggy, axboe, linux-mm, linux-arch,
	Clark Williams, Ingo Molnar, Jeremy Fitzhardinge

On Thu, 17 Apr 2008, Peter Zijlstra wrote:
> 
> Jeremy, did I get the paravirt stuff right?

I don't think this is worth it to virtualize.

We access the page tables directly in any number of places, having a 
"get_pte()" indirection here is not going to help anything.

Just make it an x86-only inline function. In fact, you can keep it inside 
arch/x86/mm/gup.c, because nobody else is likely to ever even need it, 
since normal accesses are all supposed to be done under the page table 
spinlock, so they do not have this issue at all.

The indirection and virtualization thing is just going to complicate 
matters for no good reason.

		Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [patch 2/2]: introduce fast_gup
  2008-04-17 16:18         ` Linus Torvalds
@ 2008-04-17 16:35           ` Peter Zijlstra
  2008-04-17 16:40             ` Linus Torvalds
  0 siblings, 1 reply; 38+ messages in thread
From: Peter Zijlstra @ 2008-04-17 16:35 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Nick Piggin, Andrew Morton, shaggy, axboe, linux-mm, linux-arch,
	Clark Williams, Ingo Molnar, Jeremy Fitzhardinge

On Thu, 2008-04-17 at 09:18 -0700, Linus Torvalds wrote:
> 
> On Thu, 17 Apr 2008, Peter Zijlstra wrote:
> > 
> > Jeremy, did I get the paravirt stuff right?

Still wanting to know if I got it right.

> I don't think this is worth it to virtualize.
> 
> We access the page tables directly in any number of places, having a 
> "get_pte()" indirection here is not going to help anything.
> 
> Just make it an x86-only inline function. In fact, you can keep it inside 
> arch/x86/mm/gup.c, because nobody else is likely to ever even need it, 
> since normal accesses are all supposed to be done under the page table 
> spinlock, so they do not have this issue at all.
> 
> The indirection and virtualization thing is just going to complicate 
> matters for no good reason.

Here you go ;-)

Index: linux-2.6/arch/x86/mm/gup.c
===================================================================
--- linux-2.6.orig/arch/x86/mm/gup.c
+++ linux-2.6/arch/x86/mm/gup.c
@@ -9,6 +9,49 @@
 #include <linux/vmstat.h>
 #include <asm/pgtable.h>
 
+#ifdef CONFIG_X86_PAE
+
+/*
+ * Companion to native_set_pte_present(); normal access takes the pte_lock
+ * and thus doesn't need it.
+ *
+ * This closes the race:
+ *
+ *  CPU#1                   CPU#2
+ *  =====                   =====
+ *
+ *  fast_gup:
+ *   - read low word
+ *
+ *                          native_set_pte_present:
+ *                           - set low word to 0
+ *                           - set high word to new value
+ *
+ *   - read high word
+ *
+ *                          - set low word to new value
+ *
+ */
+static inline pte_t native_get_pte(pte_t *ptep)
+{
+	pte_t pte;
+
+retry:
+	pte.pte_low = ptep->pte_low;
+	smp_rmb();
+	pte.pte_high = ptep->pte_high;
+	smp_rmb();
+	if (unlikely(pte.pte_low != ptep->pte_low))
+		goto retry;
+	return pte;
+}
+
+#else
+
+#define native_get_pte(ptep) (*(ptep))
+
+#endif
+
 /*
  * The performance critical leaf functions are made noinline otherwise gcc
  * inlines everything into a single function which results in too much



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [patch 2/2]: introduce fast_gup
  2008-04-17 16:35           ` Peter Zijlstra
@ 2008-04-17 16:40             ` Linus Torvalds
  2008-04-17 17:23               ` Peter Zijlstra
  0 siblings, 1 reply; 38+ messages in thread
From: Linus Torvalds @ 2008-04-17 16:40 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Nick Piggin, Andrew Morton, shaggy, axboe, linux-mm, linux-arch,
	Clark Williams, Ingo Molnar, Jeremy Fitzhardinge


On Thu, 17 Apr 2008, Peter Zijlstra wrote:
> 
> Here you go ;-)

I think you should _use_ the new functions too ;)

		Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [patch 2/2]: introduce fast_gup
  2008-04-17 16:40             ` Linus Torvalds
@ 2008-04-17 17:23               ` Peter Zijlstra
  2008-04-17 18:28                 ` Linus Torvalds
  2008-04-18  6:31                 ` Geert Uytterhoeven
  0 siblings, 2 replies; 38+ messages in thread
From: Peter Zijlstra @ 2008-04-17 17:23 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Nick Piggin, Andrew Morton, shaggy, axboe, linux-mm, linux-arch,
	Clark Williams, Ingo Molnar, Jeremy Fitzhardinge

On Thu, 2008-04-17 at 09:40 -0700, Linus Torvalds wrote:
> 
> On Thu, 17 Apr 2008, Peter Zijlstra wrote:
> > 
> > Here you go ;-)
> 
> I think you should _use_ the new functions too ;)

D'0h - clearly not my day today...

Index: linux-2.6/arch/x86/mm/gup.c
===================================================================
--- linux-2.6.orig/arch/x86/mm/gup.c
+++ linux-2.6/arch/x86/mm/gup.c
@@ -9,6 +9,49 @@
 #include <linux/vmstat.h>
 #include <asm/pgtable.h>
 
+#ifdef CONFIG_X86_PAE
+
+/*
+ * Companion to native_set_pte_present(); normal access takes the pte_lock
+ * and thus doesn't need it.
+ *
+ * This closes the race:
+ *
+ *  CPU#1                   CPU#2
+ *  =====                   =====
+ *
+ *  fast_gup:
+ *   - read low word
+ *
+ *                          native_set_pte_present:
+ *                           - set low word to 0
+ *                           - set high word to new value
+ *
+ *   - read high word
+ *
+ *                          - set low word to new value
+ *
+ */
+static inline pte_t native_get_pte(pte_t *ptep)
+{
+	pte_t pte;
+
+retry:
+	pte.pte_low = ptep->pte_low;
+	smp_rmb();
+	pte.pte_high = ptep->pte_high;
+	smp_rmb();
+	if (unlikely(pte.pte_low != ptep->pte_low))
+		goto retry;
+	return pte;
+}
+
+#else
+
+#define native_get_pte(ptep) (*(ptep))
+
+#endif
+
 /*
  * The performance critical leaf functions are made noinline otherwise gcc
  * inlines everything into a single function which results in too much
@@ -36,7 +79,7 @@ static noinline int gup_pte_range(pmd_t 
 		 * function that will do this properly, so it is broken on
 		 * 32-bit 3-level for the moment.
 		 */
-		pte_t pte = *ptep;
+		pte_t pte = native_get_pte(ptep);
 		struct page *page;
 
 		if ((pte_val(pte) & mask) != result)


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [patch 2/2]: introduce fast_gup
  2008-04-17 17:23               ` Peter Zijlstra
@ 2008-04-17 18:28                 ` Linus Torvalds
  2008-04-22  3:14                   ` Nick Piggin
  2008-04-18  6:31                 ` Geert Uytterhoeven
  1 sibling, 1 reply; 38+ messages in thread
From: Linus Torvalds @ 2008-04-17 18:28 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Nick Piggin, Andrew Morton, shaggy, axboe, linux-mm, linux-arch,
	Clark Williams, Ingo Molnar, Jeremy Fitzhardinge


On Thu, 17 Apr 2008, Peter Zijlstra wrote:
> 
> D'0h - clearly not my day today...

Ok, I'm acking this one ;)

And yes, it would be nice if the gup patches would go in early, since I 
wouldn't be entirely surprised if other architectures didn't have some 
other subtle issues here. We've never accessed the page tables without 
locking before, so we've only had races with hardware, never software.

			Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [patch 2/2]: introduce fast_gup
  2008-04-17 17:23               ` Peter Zijlstra
  2008-04-17 18:28                 ` Linus Torvalds
@ 2008-04-18  6:31                 ` Geert Uytterhoeven
  2008-04-18 14:40                   ` Linus Torvalds
  1 sibling, 1 reply; 38+ messages in thread
From: Geert Uytterhoeven @ 2008-04-18  6:31 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linus Torvalds, Nick Piggin, Andrew Morton, shaggy, axboe,
	linux-mm, linux-arch, Clark Williams, Ingo Molnar,
	Jeremy Fitzhardinge

On Thu, 17 Apr 2008, Peter Zijlstra wrote:
> +retry:
> +	pte.pte_low = ptep->pte_low;
> +	smp_rmb();
> +	pte.pte_high = ptep->pte_high;
> +	smp_rmb();
> +	if (unlikely(pte.pte_low != ptep->pte_low))
> +		goto retry;

What about using `do { ... } while (...)' instead?

Gr{oetje,eeting}s,

						Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@linux-m68k.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
							    -- Linus Torvalds

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [patch 2/2]: introduce fast_gup
  2008-04-17 16:12       ` Peter Zijlstra
  2008-04-17 16:18         ` Linus Torvalds
@ 2008-04-18  9:58         ` Jeremy Fitzhardinge
  1 sibling, 0 replies; 38+ messages in thread
From: Jeremy Fitzhardinge @ 2008-04-18  9:58 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linus Torvalds, Nick Piggin, Andrew Morton, shaggy, axboe,
	linux-mm, linux-arch, Clark Williams, Ingo Molnar

Peter Zijlstra wrote:
> On Thu, 2008-04-17 at 08:25 -0700, Linus Torvalds wrote:
>   
>> On Thu, 17 Apr 2008, Peter Zijlstra wrote:
>>     
>>> Would this be sufficient to address that comment's conern?
>>>       
>> It would be nicer to just add a "native_get_pte()" to x86, to match the 
>> already-existing "native_set_pte()".
>>     
>
> See, I _knew_ I was missing something obvious :-/
>
>   
>> And that "barrier()" should b "smp_rmb()". They may be the same code 
>> sequence, but from a conceptual angle, "smp_rmb()" makes a whole lot more 
>> sense.
>>
>> Finally, I don't think that comment is correct in the first place. It's 
>> not that simple. The thing is, even *with* the memory barrier in place, we 
>> may have:
>>
>> 	CPU#1			CPU#2
>> 	=====			=====
>>
>> 	fast_gup:
>> 	 - read low word
>>
>> 				native_set_pte_present:
>> 				 - set low word to 0
>> 				 - set high word to new value
>>
>> 	 - read high word
>>
>> 				- set low word to new value
>>
>> and so you read a low word that is associated with a *different* high 
>> word! Notice?
>>
>> So trivial memory ordering is _not_ enough.
>>
>> So I think the code literally needs to be something like this
>>
>> 	#ifdef CONFIG_X86_PAE
>>
>> 	static inline pte_t native_get_pte(pte_t *ptep)
>> 	{
>> 		pte_t pte;
>>
>> 	retry:
>> 		pte.pte_low = ptep->pte_low;
>> 		smp_rmb();
>> 		pte.pte_high = ptep->pte_high;
>> 		smp_rmb();
>> 		if (unlikely(pte.pte_low != ptep->pte_low)
>> 			goto retry;
>> 		return pte;
>> 	}
>>
>> 	#else
>>
>> 	#define native_get_pte(ptep) (*(ptep))
>>
>> 	#endif
>>
>> but I have admittedly not really thought it fully through.
>>     
>
> Looks sane here; Clark can you give this a spin?
>
> Jeremy, did I get the paravirt stuff right?
>   

You shouldn't need to do anything special for paravirt.  set_pte is 
necessary because it may have side-effects (like a hypervisor call), but 
get_pte should be side-effect free.  There's no other need for it; any 
special processing on the pte value itself is done in pte_val().

    J

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [patch 2/2]: introduce fast_gup
  2008-04-18  6:31                 ` Geert Uytterhoeven
@ 2008-04-18 14:40                   ` Linus Torvalds
  0 siblings, 0 replies; 38+ messages in thread
From: Linus Torvalds @ 2008-04-18 14:40 UTC (permalink / raw)
  To: Geert Uytterhoeven
  Cc: Peter Zijlstra, Nick Piggin, Andrew Morton, shaggy, axboe,
	linux-mm, linux-arch, Clark Williams, Ingo Molnar,
	Jeremy Fitzhardinge

On Fri, 18 Apr 2008, Geert Uytterhoeven wrote:
> On Thu, 17 Apr 2008, Peter Zijlstra wrote:
> > +retry:
> > +	pte.pte_low = ptep->pte_low;
> > +	smp_rmb();
> > +	pte.pte_high = ptep->pte_high;
> > +	smp_rmb();
> > +	if (unlikely(pte.pte_low != ptep->pte_low))
> > +		goto retry;
> 
> What about using `do { ... } while (...)' instead?

Partly because it's not a loop. It's an error and retry event, and it 
generally happens zero times (and sometimes once). I don't think it should 
even _look_ like a loop.

So personally, I just tend to think that it's just more readable when it's 
written as being obviously not a regular loop. I'm not in the "gotos are 
harmful" camp.

And partly because I tend to distrust loops for these thigns is that 
historically gcc sometimes did stupid things for loops (it assumed that 
loops were hot and tried to align them etc, and the same didn't happen for 
branch targets). Of course, these days I suspect gcc can't even tell the 
difference any more (it probably turns it into the same internal 
representation), but old habits die hard.

			Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [patch 2/2]: introduce fast_gup
  2008-04-17 15:25     ` Linus Torvalds
  2008-04-17 16:12       ` Peter Zijlstra
@ 2008-04-21 12:00       ` Avi Kivity
  2008-04-21 12:30         ` Peter Zijlstra
  1 sibling, 1 reply; 38+ messages in thread
From: Avi Kivity @ 2008-04-21 12:00 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Peter Zijlstra, Nick Piggin, Andrew Morton, shaggy, axboe,
	linux-mm, linux-arch, Clark Williams, Ingo Molnar

Linus Torvalds wrote:
> Finally, I don't think that comment is correct in the first place. It's 
> not that simple. The thing is, even *with* the memory barrier in place, we 
> may have:
>
> 	CPU#1			CPU#2
> 	=====			=====
>
> 	fast_gup:
> 	 - read low word
>
> 				native_set_pte_present:
> 				 - set low word to 0
> 				 - set high word to new value
>
> 	 - read high word
>
> 				- set low word to new value
>
> and so you read a low word that is associated with a *different* high 
> word! Notice?
>
> So trivial memory ordering is _not_ enough.
>
> So I think the code literally needs to be something like this
>
> 	#ifdef CONFIG_X86_PAE
>
> 	static inline pte_t native_get_pte(pte_t *ptep)
> 	{
> 		pte_t pte;
>
> 	retry:
> 		pte.pte_low = ptep->pte_low;
> 		smp_rmb();
> 		pte.pte_high = ptep->pte_high;
> 		smp_rmb();
> 		if (unlikely(pte.pte_low != ptep->pte_low)
> 			goto retry;
> 		return pte;
> 	}
>
>   

I think this is still broken.  Suppose that after reading pte_high 
native_set_pte() is called again on another cpu, changing pte_low back 
to the original value (but with a different pte_high).  You now have 
pte_low from second native_set_pte() but pte_high from the first 
native_set_pte().

You could use cmpxchg8b to atomically load the pte, but at the expense 
of taking the cacheline for exclusive access.

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [patch 2/2]: introduce fast_gup
  2008-04-21 12:00       ` Avi Kivity
@ 2008-04-21 12:30         ` Peter Zijlstra
  2008-04-21 13:26           ` Avi Kivity
  0 siblings, 1 reply; 38+ messages in thread
From: Peter Zijlstra @ 2008-04-21 12:30 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Linus Torvalds, Nick Piggin, Andrew Morton, shaggy, axboe,
	linux-mm, linux-arch, Clark Williams, Ingo Molnar

On Mon, 2008-04-21 at 15:00 +0300, Avi Kivity wrote:
> Linus Torvalds wrote:
> > Finally, I don't think that comment is correct in the first place. It's 
> > not that simple. The thing is, even *with* the memory barrier in place, we 
> > may have:
> >
> > 	CPU#1			CPU#2
> > 	=====			=====
> >
> > 	fast_gup:
> > 	 - read low word
> >
> > 				native_set_pte_present:
> > 				 - set low word to 0
> > 				 - set high word to new value
> >
> > 	 - read high word
> >
> > 				- set low word to new value
> >
> > and so you read a low word that is associated with a *different* high 
> > word! Notice?
> >
> > So trivial memory ordering is _not_ enough.
> >
> > So I think the code literally needs to be something like this
> >
> > 	#ifdef CONFIG_X86_PAE
> >
> > 	static inline pte_t native_get_pte(pte_t *ptep)
> > 	{
> > 		pte_t pte;
> >
> > 	retry:
> > 		pte.pte_low = ptep->pte_low;
> > 		smp_rmb();
> > 		pte.pte_high = ptep->pte_high;
> > 		smp_rmb();
> > 		if (unlikely(pte.pte_low != ptep->pte_low)
> > 			goto retry;
> > 		return pte;
> > 	}
> >
> >   
> 
> I think this is still broken.  Suppose that after reading pte_high 
> native_set_pte() is called again on another cpu, changing pte_low back 
> to the original value (but with a different pte_high).  You now have 
> pte_low from second native_set_pte() but pte_high from the first 
> native_set_pte().

I think the idea was that for user pages we only use set_pte_present()
which does the low=0 thing first.



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [patch 2/2]: introduce fast_gup
  2008-04-21 12:30         ` Peter Zijlstra
@ 2008-04-21 13:26           ` Avi Kivity
  2008-04-21 14:35             ` Peter Zijlstra
  0 siblings, 1 reply; 38+ messages in thread
From: Avi Kivity @ 2008-04-21 13:26 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linus Torvalds, Nick Piggin, Andrew Morton, shaggy, axboe,
	linux-mm, linux-arch, Clark Williams, Ingo Molnar

Peter Zijlstra wrote:
> On Mon, 2008-04-21 at 15:00 +0300, Avi Kivity wrote:
>   
>> Linus Torvalds wrote:
>>     
>>> Finally, I don't think that comment is correct in the first place. It's 
>>> not that simple. The thing is, even *with* the memory barrier in place, we 
>>> may have:
>>>
>>> 	CPU#1			CPU#2
>>> 	=====			=====
>>>
>>> 	fast_gup:
>>> 	 - read low word
>>>
>>> 				native_set_pte_present:
>>> 				 - set low word to 0
>>> 				 - set high word to new value
>>>
>>> 	 - read high word
>>>
>>> 				- set low word to new value
>>>
>>> and so you read a low word that is associated with a *different* high 
>>> word! Notice?
>>>
>>> So trivial memory ordering is _not_ enough.
>>>
>>> So I think the code literally needs to be something like this
>>>
>>> 	#ifdef CONFIG_X86_PAE
>>>
>>> 	static inline pte_t native_get_pte(pte_t *ptep)
>>> 	{
>>> 		pte_t pte;
>>>
>>> 	retry:
>>> 		pte.pte_low = ptep->pte_low;
>>> 		smp_rmb();
>>> 		pte.pte_high = ptep->pte_high;
>>> 		smp_rmb();
>>> 		if (unlikely(pte.pte_low != ptep->pte_low)
>>> 			goto retry;
>>> 		return pte;
>>> 	}
>>>
>>>   
>>>       
>> I think this is still broken.  Suppose that after reading pte_high 
>> native_set_pte() is called again on another cpu, changing pte_low back 
>> to the original value (but with a different pte_high).  You now have 
>> pte_low from second native_set_pte() but pte_high from the first 
>> native_set_pte().
>>     
>
> I think the idea was that for user pages we only use set_pte_present()
> which does the low=0 thing first.
>   

Doesn't matter.  The second native_set_pte() (or set_pte_present()) 
executes atomically:


	fast_gup:
	 - read low word (l0)

				native_set_pte_present:
				 - set low word to 0
				 - set high word to new value (h1)
	 			 - set low word to new value (l1)
 

	 - read high word (h1)

				native_set_pte_present:
				 - set low word to 0
				 - set high word to new value (h2)
	 			 - set low word to new value (l2)

   	 - re-read low word (l2)


If l2 happens to be equal to l0, then the check succeeds and we have a 
splintered pte h1:l0.

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [patch 2/2]: introduce fast_gup
  2008-04-21 13:26           ` Avi Kivity
@ 2008-04-21 14:35             ` Peter Zijlstra
  2008-04-22  3:23               ` Nick Piggin
  0 siblings, 1 reply; 38+ messages in thread
From: Peter Zijlstra @ 2008-04-21 14:35 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Linus Torvalds, Nick Piggin, Andrew Morton, shaggy, axboe,
	linux-mm, linux-arch, Clark Williams, Ingo Molnar, H. Peter Anvin

On Mon, 2008-04-21 at 16:26 +0300, Avi Kivity wrote:
> Peter Zijlstra wrote:
> > On Mon, 2008-04-21 at 15:00 +0300, Avi Kivity wrote:
> >   
> >> Linus Torvalds wrote:
> >>     
> >>> Finally, I don't think that comment is correct in the first place. It's 
> >>> not that simple. The thing is, even *with* the memory barrier in place, we 
> >>> may have:
> >>>
> >>> 	CPU#1			CPU#2
> >>> 	=====			=====
> >>>
> >>> 	fast_gup:
> >>> 	 - read low word
> >>>
> >>> 				native_set_pte_present:
> >>> 				 - set low word to 0
> >>> 				 - set high word to new value
> >>>
> >>> 	 - read high word
> >>>
> >>> 				- set low word to new value
> >>>
> >>> and so you read a low word that is associated with a *different* high 
> >>> word! Notice?
> >>>
> >>> So trivial memory ordering is _not_ enough.
> >>>
> >>> So I think the code literally needs to be something like this
> >>>
> >>> 	#ifdef CONFIG_X86_PAE
> >>>
> >>> 	static inline pte_t native_get_pte(pte_t *ptep)
> >>> 	{
> >>> 		pte_t pte;
> >>>
> >>> 	retry:
> >>> 		pte.pte_low = ptep->pte_low;
> >>> 		smp_rmb();
> >>> 		pte.pte_high = ptep->pte_high;
> >>> 		smp_rmb();
> >>> 		if (unlikely(pte.pte_low != ptep->pte_low)
> >>> 			goto retry;
> >>> 		return pte;
> >>> 	}
> >>>
> >>>   
> >>>       
> >> I think this is still broken.  Suppose that after reading pte_high 
> >> native_set_pte() is called again on another cpu, changing pte_low back 
> >> to the original value (but with a different pte_high).  You now have 
> >> pte_low from second native_set_pte() but pte_high from the first 
> >> native_set_pte().
> >>     
> >
> > I think the idea was that for user pages we only use set_pte_present()
> > which does the low=0 thing first.
> >   
> 
> Doesn't matter.  The second native_set_pte() (or set_pte_present()) 
> executes atomically:
> 
> 
> 	fast_gup:
> 	 - read low word (l0)
> 
> 				native_set_pte_present:
> 				 - set low word to 0
> 				 - set high word to new value (h1)
> 	 			 - set low word to new value (l1)
>  
> 
> 	 - read high word (h1)
> 
> 				native_set_pte_present:
> 				 - set low word to 0
> 				 - set high word to new value (h2)
> 	 			 - set low word to new value (l2)
> 
>    	 - re-read low word (l2)
> 
> 
> If l2 happens to be equal to l0, then the check succeeds and we have a 
> splintered pte h1:l0.

ok, so lets use cmpxchg8.

Index: linux-2.6/arch/x86/mm/gup.c
===================================================================
--- linux-2.6.orig/arch/x86/mm/gup.c
+++ linux-2.6/arch/x86/mm/gup.c
@@ -9,6 +9,19 @@
 #include <linux/vmstat.h>
 #include <asm/pgtable.h>
 
+#ifdef CONFIG_X86_PAE
+
+static inline pte_t native_get_pte(pte_t *ptep)
+{
+	return native_make_pte(cmpxchg64((u64 *)ptep, 0ULL, 0ULL));
+}
+
+#else
+
+#define native_get_pte(ptep) (*(ptep))
+
+#endif
+
 /*
  * The performance critical leaf functions are made noinline otherwise gcc
  * inlines everything into a single function which results in too much
@@ -27,16 +40,7 @@ static noinline int gup_pte_range(pmd_t 
 
 	ptep = pte_offset_map(&pmd, addr);
 	do {
-		/*
-		 * XXX: careful. On 3-level 32-bit, the pte is 64 bits, and
-		 * we need to make sure we load the low word first, then the
-		 * high. This means _PAGE_PRESENT should be clear if the high
-		 * word was not valid. Currently, the C compiler can issue
-		 * the loads in any order, and I don't know of a wrapper
-		 * function that will do this properly, so it is broken on
-		 * 32-bit 3-level for the moment.
-		 */
-		pte_t pte = *ptep;
+		pte_t pte = native_get_pte(ptep);
 		struct page *page;
 
 		if ((pte_val(pte) & mask) != result)


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [patch 2/2]: introduce fast_gup
  2008-04-17 18:28                 ` Linus Torvalds
@ 2008-04-22  3:14                   ` Nick Piggin
  0 siblings, 0 replies; 38+ messages in thread
From: Nick Piggin @ 2008-04-22  3:14 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Peter Zijlstra, Andrew Morton, shaggy, axboe, linux-mm,
	linux-arch, Clark Williams, Ingo Molnar, Jeremy Fitzhardinge

On Thu, Apr 17, 2008 at 11:28:45AM -0700, Linus Torvalds wrote:
> 
> 
> On Thu, 17 Apr 2008, Peter Zijlstra wrote:
> > 
> > D'0h - clearly not my day today...
> 
> Ok, I'm acking this one ;)
> 
> And yes, it would be nice if the gup patches would go in early, since I 
> wouldn't be entirely surprised if other architectures didn't have some 
> other subtle issues here. We've never accessed the page tables without 
> locking before, so we've only had races with hardware, never software.

Well I'd love them to go in 2.6.26. Andrew will be sending you some
precursor patches soon, and then I can rediff the x86 fast_gup.

We actually do access the page tables without the traditional linux vm
locking in architectures like powerpc that do software pagetable walks.
That's why their pagetables are RCUed for example.

So the concept is actually more foreign to x86 than it is to some others.

BTW. we do have powerpc patches for fast_gup. The "problem" with that
is that it requires my speculative page references from the lockless
pagecache patches (basically fast_gup for powerpc is exactly like
lockless pagecache but substitute the pagecache radix tree for the
page tables). But that might have to wait for 2.6.27. At least we'll
have the x86 fast_gup.

But the upshot is that I now have "real world" benchmark results to
justify adding the complexity of speculative references. After that,
adding lockless pagecache is more or less a noop ;) Everything's
falling into place, mwa ha ha.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [patch 2/2]: introduce fast_gup
  2008-04-21 14:35             ` Peter Zijlstra
@ 2008-04-22  3:23               ` Nick Piggin
  2008-04-22  7:19                 ` Avi Kivity
  2008-04-22  8:07                 ` Ingo Molnar
  0 siblings, 2 replies; 38+ messages in thread
From: Nick Piggin @ 2008-04-22  3:23 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Avi Kivity, Linus Torvalds, Andrew Morton, shaggy, axboe,
	linux-mm, linux-arch, Clark Williams, Ingo Molnar, H. Peter Anvin

On Mon, Apr 21, 2008 at 04:35:47PM +0200, Peter Zijlstra wrote:
> On Mon, 2008-04-21 at 16:26 +0300, Avi Kivity wrote:
> > Peter Zijlstra wrote:
> > > On Mon, 2008-04-21 at 15:00 +0300, Avi Kivity wrote:
> > >   
> > >> Linus Torvalds wrote:
> > >>     
> > >>> Finally, I don't think that comment is correct in the first place. It's 
> > >>> not that simple. The thing is, even *with* the memory barrier in place, we 
> > >>> may have:
> > >>>
> > >>> 	CPU#1			CPU#2
> > >>> 	=====			=====
> > >>>
> > >>> 	fast_gup:
> > >>> 	 - read low word
> > >>>
> > >>> 				native_set_pte_present:
> > >>> 				 - set low word to 0
> > >>> 				 - set high word to new value
> > >>>
> > >>> 	 - read high word
> > >>>
> > >>> 				- set low word to new value
> > >>>
> > >>> and so you read a low word that is associated with a *different* high 
> > >>> word! Notice?
> > >>>
> > >>> So trivial memory ordering is _not_ enough.
> > >>>
> > >>> So I think the code literally needs to be something like this
> > >>>
> > >>> 	#ifdef CONFIG_X86_PAE
> > >>>
> > >>> 	static inline pte_t native_get_pte(pte_t *ptep)
> > >>> 	{
> > >>> 		pte_t pte;
> > >>>
> > >>> 	retry:
> > >>> 		pte.pte_low = ptep->pte_low;
> > >>> 		smp_rmb();
> > >>> 		pte.pte_high = ptep->pte_high;
> > >>> 		smp_rmb();
> > >>> 		if (unlikely(pte.pte_low != ptep->pte_low)
> > >>> 			goto retry;
> > >>> 		return pte;
> > >>> 	}
> > >>>
> > >>>   
> > >>>       
> > >> I think this is still broken.  Suppose that after reading pte_high 
> > >> native_set_pte() is called again on another cpu, changing pte_low back 
> > >> to the original value (but with a different pte_high).  You now have 
> > >> pte_low from second native_set_pte() but pte_high from the first 
> > >> native_set_pte().
> > >>     
> > >
> > > I think the idea was that for user pages we only use set_pte_present()
> > > which does the low=0 thing first.
> > >   
> > 
> > Doesn't matter.  The second native_set_pte() (or set_pte_present()) 
> > executes atomically:
> > 
> > 
> > 	fast_gup:
> > 	 - read low word (l0)
> > 
> > 				native_set_pte_present:
> > 				 - set low word to 0
> > 				 - set high word to new value (h1)
> > 	 			 - set low word to new value (l1)
> >  
> > 
> > 	 - read high word (h1)
> > 
> > 				native_set_pte_present:
> > 				 - set low word to 0
> > 				 - set high word to new value (h2)
> > 	 			 - set low word to new value (l2)
> > 
> >    	 - re-read low word (l2)
> > 
> > 
> > If l2 happens to be equal to l0, then the check succeeds and we have a 
> > splintered pte h1:l0.
> 
> ok, so lets use cmpxchg8.

That's horrible ;)

Anyway guys you are missing the other side of the equation -- that whenever
_PAGE_PRESENT is cleared, all CPUs where current->mm might be == mm have to
have a tlb flush. And we're holding off tlb flushes in fast_gup, that's the
whole reason why it all works.

Indeed we do need Linus's loop, though, because I wasn't thinking of the
teardown side when writing that comment it seems (teardowns under
mmu_gather can and do set the pte to some arbitrary values before the
IPI goes out -- but they will never contain _PAGE_PRESENT we can be sure).

Linus's loop I will use for PAE. I'd love to know whether the hardware
walker actually does an atomic 64-bit load or not, though.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [patch 2/2]: introduce fast_gup
  2008-04-22  3:23               ` Nick Piggin
@ 2008-04-22  7:19                 ` Avi Kivity
  2008-04-22  8:07                 ` Ingo Molnar
  1 sibling, 0 replies; 38+ messages in thread
From: Avi Kivity @ 2008-04-22  7:19 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Peter Zijlstra, Linus Torvalds, Andrew Morton, shaggy, axboe,
	linux-mm, linux-arch, Clark Williams, Ingo Molnar, H. Peter Anvin

Nick Piggin wrote:
> That's horrible ;)
>
> Anyway guys you are missing the other side of the equation -- that whenever
> _PAGE_PRESENT is cleared, all CPUs where current->mm might be == mm have to
> have a tlb flush. And we're holding off tlb flushes in fast_gup, that's the
> whole reason why it all works.
>
>   

Ah, clever.  Thanks.

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [patch 2/2]: introduce fast_gup
  2008-04-22  3:23               ` Nick Piggin
  2008-04-22  7:19                 ` Avi Kivity
@ 2008-04-22  8:07                 ` Ingo Molnar
  1 sibling, 0 replies; 38+ messages in thread
From: Ingo Molnar @ 2008-04-22  8:07 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Peter Zijlstra, Avi Kivity, Linus Torvalds, Andrew Morton, shaggy,
	axboe, linux-mm, linux-arch, Clark Williams, H. Peter Anvin,
	Thomas Gleixner

* Nick Piggin <npiggin@suse.de> wrote:

> Linus's loop I will use for PAE. I'd love to know whether the hardware 
> walker actually does an atomic 64-bit load or not, though.

all x86 natural accesses (done by instructions) are MESI atomic as long 
as they lie on a natural word boundary. (which they do in the PTE case)

while the hardware walker is not an instruction, it would be highly 
unusal (and i'd claim, inherently broken) for the hardware walker to 
fetch a 64-bit pte value via two 32-bit accesses from two different 
versions of the same cacheline.

	Ingo

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [patch 2/2]: introduce fast_gup
  2008-03-28  3:00 ` [patch 2/2]: introduce fast_gup Nick Piggin
  2008-03-28 10:01   ` Jens Axboe
  2008-04-17 15:03   ` Peter Zijlstra
@ 2008-04-22  9:42   ` Peter Zijlstra
  2008-04-22  9:46     ` Nick Piggin
  2 siblings, 1 reply; 38+ messages in thread
From: Peter Zijlstra @ 2008-04-22  9:42 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Andrew Morton, shaggy, axboe, linux-mm, linux-arch, torvalds

On Fri, 2008-03-28 at 04:00 +0100, Nick Piggin wrote:

> +static noinline int gup_pte_range(pmd_t pmd, unsigned long addr,
> +		unsigned long end, int write, struct page **pages, int *nr)
> +{
> +	unsigned long mask, result;
> +	pte_t *ptep;
> +
> +	result = _PAGE_PRESENT|_PAGE_USER;
> +	if (write)
> +		result |= _PAGE_RW;
> +	mask = result | _PAGE_SPECIAL;
> +
> +	ptep = pte_offset_map(&pmd, addr);
> +	do {
> +		/*
> +		 * XXX: careful. On 3-level 32-bit, the pte is 64 bits, and
> +		 * we need to make sure we load the low word first, then the
> +		 * high. This means _PAGE_PRESENT should be clear if the high
> +		 * word was not valid. Currently, the C compiler can issue
> +		 * the loads in any order, and I don't know of a wrapper
> +		 * function that will do this properly, so it is broken on
> +		 * 32-bit 3-level for the moment.
> +		 */
> +		pte_t pte = *ptep;
> +		struct page *page;
> +
> +		if ((pte_val(pte) & mask) != result)
> +			return 0;

This return path fails to unmap the pmd.

> +		VM_BUG_ON(!pfn_valid(pte_pfn(pte)));
> +		page = pte_page(pte);
> +		get_page(page);
> +		pages[*nr] = page;
> +		(*nr)++;
> +
> +	} while (ptep++, addr += PAGE_SIZE, addr != end);
> +	pte_unmap(ptep - 1);
> +
> +	return 1;
> +}


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [patch 2/2]: introduce fast_gup
  2008-04-22  9:42   ` Peter Zijlstra
@ 2008-04-22  9:46     ` Nick Piggin
  2008-05-14 18:33       ` Dave Kleikamp
  0 siblings, 1 reply; 38+ messages in thread
From: Nick Piggin @ 2008-04-22  9:46 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrew Morton, shaggy, axboe, linux-mm, linux-arch, torvalds

On Tue, Apr 22, 2008 at 11:42:36AM +0200, Peter Zijlstra wrote:
> On Fri, 2008-03-28 at 04:00 +0100, Nick Piggin wrote:
> 
> > +static noinline int gup_pte_range(pmd_t pmd, unsigned long addr,
> > +		unsigned long end, int write, struct page **pages, int *nr)
> > +{
> > +	unsigned long mask, result;
> > +	pte_t *ptep;
> > +
> > +	result = _PAGE_PRESENT|_PAGE_USER;
> > +	if (write)
> > +		result |= _PAGE_RW;
> > +	mask = result | _PAGE_SPECIAL;
> > +
> > +	ptep = pte_offset_map(&pmd, addr);
> > +	do {
> > +		/*
> > +		 * XXX: careful. On 3-level 32-bit, the pte is 64 bits, and
> > +		 * we need to make sure we load the low word first, then the
> > +		 * high. This means _PAGE_PRESENT should be clear if the high
> > +		 * word was not valid. Currently, the C compiler can issue
> > +		 * the loads in any order, and I don't know of a wrapper
> > +		 * function that will do this properly, so it is broken on
> > +		 * 32-bit 3-level for the moment.
> > +		 */
> > +		pte_t pte = *ptep;
> > +		struct page *page;
> > +
> > +		if ((pte_val(pte) & mask) != result)
> > +			return 0;
> 
> This return path fails to unmap the pmd.

Ah good catch. As you can see I haven't done any highmem testing ;)
Which I will do so before sending upstream.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [patch 2/2]: introduce fast_gup
  2008-04-22  9:46     ` Nick Piggin
@ 2008-05-14 18:33       ` Dave Kleikamp
  2008-05-15  1:13         ` Nick Piggin
  0 siblings, 1 reply; 38+ messages in thread
From: Dave Kleikamp @ 2008-05-14 18:33 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Peter Zijlstra, Andrew Morton, axboe, linux-mm, linux-arch,
	torvalds

On Tue, 2008-04-22 at 11:46 +0200, Nick Piggin wrote:
> On Tue, Apr 22, 2008 at 11:42:36AM +0200, Peter Zijlstra wrote:
> > On Fri, 2008-03-28 at 04:00 +0100, Nick Piggin wrote:
> > 
> > > +static noinline int gup_pte_range(pmd_t pmd, unsigned long addr,
> > > +		unsigned long end, int write, struct page **pages, int *nr)
> > > +{
> > > +	unsigned long mask, result;
> > > +	pte_t *ptep;
> > > +
> > > +	result = _PAGE_PRESENT|_PAGE_USER;
> > > +	if (write)
> > > +		result |= _PAGE_RW;
> > > +	mask = result | _PAGE_SPECIAL;
> > > +
> > > +	ptep = pte_offset_map(&pmd, addr);
> > > +	do {
> > > +		/*
> > > +		 * XXX: careful. On 3-level 32-bit, the pte is 64 bits, and
> > > +		 * we need to make sure we load the low word first, then the
> > > +		 * high. This means _PAGE_PRESENT should be clear if the high
> > > +		 * word was not valid. Currently, the C compiler can issue
> > > +		 * the loads in any order, and I don't know of a wrapper
> > > +		 * function that will do this properly, so it is broken on
> > > +		 * 32-bit 3-level for the moment.
> > > +		 */
> > > +		pte_t pte = *ptep;
> > > +		struct page *page;
> > > +
> > > +		if ((pte_val(pte) & mask) != result)
> > > +			return 0;
> > 
> > This return path fails to unmap the pmd.
> 
> Ah good catch. As you can see I haven't done any highmem testing ;)
> Which I will do so before sending upstream.

Which will be when?  We'd really like to see this in mainline as soon as
possible and in -mm in the meanwhile.

Thanks,
Shaggy
-- 
David Kleikamp
IBM Linux Technology Center

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [patch 2/2]: introduce fast_gup
  2008-05-14 18:33       ` Dave Kleikamp
@ 2008-05-15  1:13         ` Nick Piggin
  0 siblings, 0 replies; 38+ messages in thread
From: Nick Piggin @ 2008-05-15  1:13 UTC (permalink / raw)
  To: Dave Kleikamp
  Cc: Peter Zijlstra, Andrew Morton, axboe, linux-mm, linux-arch,
	torvalds

On Wed, May 14, 2008 at 01:33:14PM -0500, Dave Kleikamp wrote:
> > Ah good catch. As you can see I haven't done any highmem testing ;)
> > Which I will do so before sending upstream.
> 
> Which will be when?  We'd really like to see this in mainline as soon as
> possible and in -mm in the meanwhile.

Well I just got all the "hard" core mm stuff past Linus in this merge
window, and got a couple of preexisting memory ordering bugs fixed..
So I am planning to get it into -mm, ready for the next merge window.

I'm a little concerned about Peter's instability reports, but maybe
they're just an -rt thing. But I don't think we've found any holes in
fast_gup yet (although I think I need to add one last check to ensure
it won't pick up kernel addresses).

Thanks,
Nick

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* [patch 1/2] x86: implement pte_special
@ 2008-05-21 11:59 Nick Piggin
  0 siblings, 0 replies; 38+ messages in thread
From: Nick Piggin @ 2008-05-21 11:59 UTC (permalink / raw)
  To: Andrew Morton; +Cc: shaggy, axboe, torvalds, linux-mm, linux-arch

Implement the pte_special bit for x86. This is required to support lockless
get_user_pages, because we need to know whether or not we can refcount a
particular page given only its pte (and no vma).

Signed-off-by: Nick Piggin <npiggin@suse.de>
Cc: shaggy@austin.ibm.com
Cc: axboe@oracle.com
Cc: torvalds@linux-foundation.org
Cc: linux-mm@kvack.org
Cc: linux-arch@vger.kernel.org
---
Index: linux-2.6/include/asm-x86/pgtable.h
===================================================================
--- linux-2.6.orig/include/asm-x86/pgtable.h
+++ linux-2.6/include/asm-x86/pgtable.h
@@ -17,6 +17,7 @@
 #define _PAGE_BIT_UNUSED1	9	/* available for programmer */
 #define _PAGE_BIT_UNUSED2	10
 #define _PAGE_BIT_UNUSED3	11
+#define _PAGE_BIT_SPECIAL	_PAGE_BIT_UNUSED1
 #define _PAGE_BIT_PAT_LARGE	12	/* On 2MB or 1GB pages */
 #define _PAGE_BIT_NX           63       /* No execute: only valid after cpuid check */
 
@@ -39,6 +40,8 @@
 #define _PAGE_UNUSED3	(_AC(1, L)<<_PAGE_BIT_UNUSED3)
 #define _PAGE_PAT	(_AC(1, L)<<_PAGE_BIT_PAT)
 #define _PAGE_PAT_LARGE (_AC(1, L)<<_PAGE_BIT_PAT_LARGE)
+#define _PAGE_SPECIAL	(_AC(1, L)<<_PAGE_BIT_SPECIAL)
+#define __HAVE_ARCH_PTE_SPECIAL
 
 #if defined(CONFIG_X86_64) || defined(CONFIG_X86_PAE)
 #define _PAGE_NX	(_AC(1, ULL) << _PAGE_BIT_NX)
@@ -198,7 +201,7 @@ static inline int pte_exec(pte_t pte)
 
 static inline int pte_special(pte_t pte)
 {
-	return 0;
+	return pte_val(pte) & _PAGE_SPECIAL;
 }
 
 static inline int pmd_large(pmd_t pte)
@@ -264,7 +267,7 @@ static inline pte_t pte_clrglobal(pte_t 
 
 static inline pte_t pte_mkspecial(pte_t pte)
 {
-	return pte;
+	return __pte(pte_val(pte) | _PAGE_SPECIAL);
 }
 
 extern pteval_t __supported_pte_mask;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* [patch 1/2] x86: implement pte_special
@ 2008-05-25 14:48 Nick Piggin
  0 siblings, 0 replies; 38+ messages in thread
From: Nick Piggin @ 2008-05-25 14:48 UTC (permalink / raw)
  To: Andrew Morton; +Cc: shaggy, jens.axboe, torvalds, linux-mm, linux-arch, apw

Implement the pte_special bit for x86. This is required to support lockless
get_user_pages, because we need to know whether or not we can refcount a
particular page given only its pte (and no vma).

Signed-off-by: Nick Piggin <npiggin@suse.de>
---
Index: linux-2.6/include/asm-x86/pgtable.h
===================================================================
--- linux-2.6.orig/include/asm-x86/pgtable.h
+++ linux-2.6/include/asm-x86/pgtable.h
@@ -17,6 +17,7 @@
 #define _PAGE_BIT_UNUSED1	9	/* available for programmer */
 #define _PAGE_BIT_UNUSED2	10
 #define _PAGE_BIT_UNUSED3	11
+#define _PAGE_BIT_SPECIAL	_PAGE_BIT_UNUSED1
 #define _PAGE_BIT_PAT_LARGE	12	/* On 2MB or 1GB pages */
 #define _PAGE_BIT_NX           63       /* No execute: only valid after cpuid check */
 
@@ -39,6 +40,8 @@
 #define _PAGE_UNUSED3	(_AC(1, L)<<_PAGE_BIT_UNUSED3)
 #define _PAGE_PAT	(_AC(1, L)<<_PAGE_BIT_PAT)
 #define _PAGE_PAT_LARGE (_AC(1, L)<<_PAGE_BIT_PAT_LARGE)
+#define _PAGE_SPECIAL	(_AC(1, L)<<_PAGE_BIT_SPECIAL)
+#define __HAVE_ARCH_PTE_SPECIAL
 
 #if defined(CONFIG_X86_64) || defined(CONFIG_X86_PAE)
 #define _PAGE_NX	(_AC(1, ULL) << _PAGE_BIT_NX)
@@ -199,7 +202,7 @@ static inline int pte_exec(pte_t pte)
 
 static inline int pte_special(pte_t pte)
 {
-	return 0;
+	return pte_val(pte) & _PAGE_SPECIAL;
 }
 
 static inline int pmd_large(pmd_t pte)
@@ -265,7 +268,7 @@ static inline pte_t pte_clrglobal(pte_t 
 
 static inline pte_t pte_mkspecial(pte_t pte)
 {
-	return pte;
+	return __pte(pte_val(pte) | _PAGE_SPECIAL);
 }
 
 extern pteval_t __supported_pte_mask;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

end of thread, other threads:[~2008-05-25 14:48 UTC | newest]

Thread overview: 38+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-03-28  2:54 [patch 0/2]: lockless get_user_pages patchset Nick Piggin
2008-03-28  2:55 ` [patch 1/2]: x86: implement pte_special Nick Piggin
2008-03-28  3:23   ` David Miller, Nick Piggin
2008-03-28  3:31     ` Nick Piggin
2008-03-28  3:44       ` David Miller, Nick Piggin
2008-03-28  4:04         ` Nick Piggin
2008-03-28  4:09           ` David Miller, Nick Piggin
2008-03-28  4:15             ` Nick Piggin
2008-03-28  4:16               ` David Miller, Nick Piggin
2008-03-28  4:19                 ` Nick Piggin
2008-03-28  4:17               ` Nick Piggin
2008-03-28  3:00 ` [patch 2/2]: introduce fast_gup Nick Piggin
2008-03-28 10:01   ` Jens Axboe
2008-04-17 15:03   ` Peter Zijlstra
2008-04-17 15:25     ` Linus Torvalds
2008-04-17 16:12       ` Peter Zijlstra
2008-04-17 16:18         ` Linus Torvalds
2008-04-17 16:35           ` Peter Zijlstra
2008-04-17 16:40             ` Linus Torvalds
2008-04-17 17:23               ` Peter Zijlstra
2008-04-17 18:28                 ` Linus Torvalds
2008-04-22  3:14                   ` Nick Piggin
2008-04-18  6:31                 ` Geert Uytterhoeven
2008-04-18 14:40                   ` Linus Torvalds
2008-04-18  9:58         ` Jeremy Fitzhardinge
2008-04-21 12:00       ` Avi Kivity
2008-04-21 12:30         ` Peter Zijlstra
2008-04-21 13:26           ` Avi Kivity
2008-04-21 14:35             ` Peter Zijlstra
2008-04-22  3:23               ` Nick Piggin
2008-04-22  7:19                 ` Avi Kivity
2008-04-22  8:07                 ` Ingo Molnar
2008-04-22  9:42   ` Peter Zijlstra
2008-04-22  9:46     ` Nick Piggin
2008-05-14 18:33       ` Dave Kleikamp
2008-05-15  1:13         ` Nick Piggin
  -- strict thread matches above, loose matches on Subject: below --
2008-05-21 11:59 [patch 1/2] x86: implement pte_special Nick Piggin
2008-05-25 14:48 Nick Piggin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).