[patch 1/5] x86: implement pte

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [patch 1/5] x86: implement pte_special
       [not found] <20080529122050.823438000@nick.local0.net>
@ 2008-05-29 12:20 ` npiggin
  2008-06-02 23:58   ` Andrew Morton
  2008-06-06 21:35   ` Peter Zijlstra
  2008-05-29 12:20 ` [patch 2/5] mm: introduce get_user_pages_fast npiggin
                   ` (3 subsequent siblings)
  4 siblings, 2 replies; 17+ messages in thread
From: npiggin @ 2008-05-29 12:20 UTC (permalink / raw)
  To: akpm; +Cc: shaggy, linux-mm, linux-arch, apw

[-- Attachment #1: x86-implement-pte_special.patch --]
[-- Type: text/plain, Size: 1985 bytes --]

Implement the pte_special bit for x86. This is required to support lockless
get_user_pages, because we need to know whether or not we can refcount a
particular page given only its pte (and no vma).

Signed-off-by: Nick Piggin <npiggin@suse.de>
Cc: shaggy@austin.ibm.com
Cc: linux-mm@kvack.org
Cc: linux-arch@vger.kernel.org
Cc: apw@shadowen.org

---
 include/asm-x86/pgtable.h |    7 +++++--
 1 file changed, 5 insertions(+), 2 deletions(-)

Index: linux-2.6/include/asm-x86/pgtable.h
===================================================================
--- linux-2.6.orig/include/asm-x86/pgtable.h
+++ linux-2.6/include/asm-x86/pgtable.h
@@ -17,6 +17,7 @@
 #define _PAGE_BIT_UNUSED1	9	/* available for programmer */
 #define _PAGE_BIT_UNUSED2	10
 #define _PAGE_BIT_UNUSED3	11
+#define _PAGE_BIT_SPECIAL	_PAGE_BIT_UNUSED1
 #define _PAGE_BIT_PAT_LARGE	12	/* On 2MB or 1GB pages */
 #define _PAGE_BIT_NX           63       /* No execute: only valid after cpuid check */
 
@@ -39,6 +40,8 @@
 #define _PAGE_UNUSED3	(_AC(1, L)<<_PAGE_BIT_UNUSED3)
 #define _PAGE_PAT	(_AC(1, L)<<_PAGE_BIT_PAT)
 #define _PAGE_PAT_LARGE (_AC(1, L)<<_PAGE_BIT_PAT_LARGE)
+#define _PAGE_SPECIAL	(_AC(1, L)<<_PAGE_BIT_SPECIAL)
+#define __HAVE_ARCH_PTE_SPECIAL
 
 #if defined(CONFIG_X86_64) || defined(CONFIG_X86_PAE)
 #define _PAGE_NX	(_AC(1, ULL) << _PAGE_BIT_NX)
@@ -199,7 +202,7 @@ static inline int pte_exec(pte_t pte)
 
 static inline int pte_special(pte_t pte)
 {
-	return 0;
+	return pte_val(pte) & _PAGE_SPECIAL;
 }
 
 static inline int pmd_large(pmd_t pte)
@@ -265,7 +268,7 @@ static inline pte_t pte_clrglobal(pte_t 
 
 static inline pte_t pte_mkspecial(pte_t pte)
 {
-	return pte;
+	return __pte(pte_val(pte) | _PAGE_SPECIAL);
 }
 
 extern pteval_t __supported_pte_mask;

-- 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [patch 2/5] mm: introduce get_user_pages_fast
       [not found] <20080529122050.823438000@nick.local0.net>
  2008-05-29 12:20 ` [patch 1/5] x86: implement pte_special npiggin
@ 2008-05-29 12:20 ` npiggin
  2008-06-09 10:29   ` Andrew Morton
  2008-05-29 12:20 ` [patch 3/5] x86: lockless get_user_pages_fast npiggin
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 17+ messages in thread
From: npiggin @ 2008-05-29 12:20 UTC (permalink / raw)
  To: akpm; +Cc: shaggy, linux-mm, linux-arch, apw

[-- Attachment #1: mm-get_user_pages-fast.patch --]
[-- Type: text/plain, Size: 3145 bytes --]

Introduce a new get_user_pages_fast mm API, which is basically a get_user_pages
with a less general API (but still tends to be suited to the common case):

- task and mm are always current and current->mm
- force is always 0
- pages is always non-NULL
- don't pass back vmas

This restricted API can be implemented in a much more scalable way on
many architectures when the ptes are present, by walking the page tables
locklessly (no mmap_sem or page table locks). When the ptes are not
populated, get_user_pages_fast() could be slower.

This is implemented locklessly on x86, and used in some key direct IO call
sites, in later patches, which provides nearly 10% performance improvement
on a threaded database workload.

Lots of other code could use this too, depending on use cases (eg. grep
drivers/). And it might inspire some new and clever ways to use it.

Signed-off-by: Nick Piggin <npiggin@suse.de>
Cc: shaggy@austin.ibm.com
Cc: linux-mm@kvack.org
Cc: linux-arch@vger.kernel.org
Cc: apw@shadowen.org

---
 include/linux/mm.h |   33 +++++++++++++++++++++++++++++++++
 1 file changed, 33 insertions(+)

Index: linux-2.6/include/linux/mm.h
===================================================================
--- linux-2.6.orig/include/linux/mm.h
+++ linux-2.6/include/linux/mm.h
@@ -12,6 +12,7 @@
 #include <linux/prio_tree.h>
 #include <linux/debug_locks.h>
 #include <linux/mm_types.h>
+#include <linux/uaccess.h> /* for __HAVE_ARCH_GET_USER_PAGES_FAST */
 
 struct mempolicy;
 struct anon_vma;
@@ -830,6 +831,38 @@ extern int mprotect_fixup(struct vm_area
 			  struct vm_area_struct **pprev, unsigned long start,
 			  unsigned long end, unsigned long newflags);
 
+#ifdef __HAVE_ARCH_GET_USER_PAGES_FAST
+/*
+ * get_user_pages_fast provides equivalent functionality to get_user_pages,
+ * operating on current and current->mm (force=0 and doesn't return any vmas).
+ *
+ * get_user_pages_fast may take mmap_sem and page tables, so no assumptions
+ * can be made about locking. get_user_pages_fast is to be implemented in a
+ * way that is advantageous (vs get_user_pages()) when the user memory area is
+ * already faulted in and present in ptes. However if the pages have to be
+ * faulted in, it may turn out to be slightly slower).
+ */
+int get_user_pages_fast(unsigned long start, int nr_pages, int write, struct page **pages);
+
+#else
+/*
+ * Should probably be moved to asm-generic, and architectures can include it if
+ * they don't implement their own get_user_pages_fast.
+ */
+#define get_user_pages_fast(start, nr_pages, write, pages)	\
+({								\
+	struct mm_struct *mm = current->mm;			\
+	int ret;						\
+								\
+	down_read(&mm->mmap_sem);				\
+	ret = get_user_pages(current, mm, start, nr_pages,	\
+					write, 0, pages, NULL);	\
+	up_read(&mm->mmap_sem);					\
+								\
+	ret;							\
+})
+#endif
+
 /*
  * A callback you can register to apply pressure to ageable caches.
  *

-- 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [patch 3/5] x86: lockless get_user_pages_fast
       [not found] <20080529122050.823438000@nick.local0.net>
  2008-05-29 12:20 ` [patch 1/5] x86: implement pte_special npiggin
  2008-05-29 12:20 ` [patch 2/5] mm: introduce get_user_pages_fast npiggin
@ 2008-05-29 12:20 ` npiggin
  2008-05-29 17:20   ` Dave Kleikamp
  2008-05-29 12:20 ` [patch 4/5] dio: use get_user_pages_fast npiggin
  2008-05-29 12:20 ` [patch 5/5] splice: " npiggin
  4 siblings, 1 reply; 17+ messages in thread
From: npiggin @ 2008-05-29 12:20 UTC (permalink / raw)
  To: akpm; +Cc: shaggy, linux-mm, linux-arch, apw

[-- Attachment #1: x86-lockless-get_user_pages_fast.patch --]
[-- Type: text/plain, Size: 10554 bytes --]

Implement get_user_pages_fast without locking in the fastpath on x86.

Do an optimistic lockless pagetable walk, without taking mmap_sem or any page
table locks or even mmap_sem. Page table existence is guaranteed by turning
interrupts off (combined with the fact that we're always looking up the current
mm, means we can do the lockless page table walk within the constraints of the
TLB shootdown design). Basically we can do this lockless pagetable walk in a
similar manner to the way the CPU's pagetable walker does not have to take any
locks to find present ptes.

This patch (combined with the subsequent ones to convert direct IO to use it)
was found to give about 10% performance improvement on a 2 socket 8 core Intel
Xeon system running an OLTP workload on DB2 v9.5

 "To test the effects of the patch, an OLTP workload was run on an IBM
 x3850 M2 server with 2 processors (quad-core Intel Xeon processors at
 2.93 GHz) using IBM DB2 v9.5 running Linux 2.6.24rc7 kernel. Comparing
 runs with and without the patch resulted in an overall performance
 benefit of ~9.8%. Correspondingly, oprofiles showed that samples from
 __up_read and __down_read routines that is seen during thread contention
 for system resources was reduced from 2.8% down to .05%. Monitoring
 the /proc/vmstat output from the patched run showed that the counter for
 fast_gup contained a very high number while the fast_gup_slow value was
 zero."

(fast_gup is the old name for get_user_pages_fast, fast_gup_slow is a counter
we had for the number of times the slowpath was invoked).

The main reason for the improvement is that DB2 has multiple threads each
issuing direct-IO. Direct-IO uses get_user_pages, and thus the threads
contend the mmap_sem cacheline, and can also contend on page table locks.

I would anticipate larger performance gains on larger systems, however I
think DB2 uses an adaptive mix of threads and processes, so it could be
that thread contention remains pretty constant as machine size increases.
In which case, we stuck with "only" a 10% gain.

The downside of using get_user_pages_fast is that if there is not a pte with
the correct permissions for the access, we end up falling back to
get_user_pages and so the get_user_pages_fast is a bit of extra work. However
this should not be the common case in most performance critical code.

Signed-off-by: Nick Piggin <npiggin@suse.de>
Cc: shaggy@austin.ibm.com
Cc: linux-mm@kvack.org
Cc: linux-arch@vger.kernel.org
Cc: apw@shadowen.org

---
 arch/x86/mm/Makefile      |    2 
 arch/x86/mm/gup.c         |  254 ++++++++++++++++++++++++++++++++++++++++++++++
 include/asm-x86/uaccess.h |    3 
 3 files changed, 258 insertions(+), 1 deletion(-)

Index: linux-2.6/arch/x86/mm/Makefile
===================================================================
--- linux-2.6.orig/arch/x86/mm/Makefile
+++ linux-2.6/arch/x86/mm/Makefile
@@ -1,5 +1,5 @@
 obj-y	:=  init_$(BITS).o fault.o ioremap.o extable.o pageattr.o mmap.o \
-	    pat.o pgtable.o
+	    pat.o pgtable.o gup.o
 
 obj-$(CONFIG_X86_32)		+= pgtable_32.o
 
Index: linux-2.6/arch/x86/mm/gup.c
===================================================================
--- /dev/null
+++ linux-2.6/arch/x86/mm/gup.c
@@ -0,0 +1,254 @@
+/*
+ * Lockless get_user_pages_fast for x86
+ *
+ * Copyright (C) 2008 Nick Piggin
+ * Copyright (C) 2008 Novell Inc.
+ */
+#include <linux/sched.h>
+#include <linux/mm.h>
+#include <linux/vmstat.h>
+#include <asm/pgtable.h>
+
+static inline pte_t gup_get_pte(pte_t *ptep)
+{
+#ifndef CONFIG_X86_PAE
+	return *ptep;
+#else
+	/*
+	 * With get_user_pages_fast, we walk down the pagetables without taking
+	 * any locks.  For this we would like to load the pointers atoimcally,
+	 * but that is not possible (without expensive cmpxchg8b) on PAE.  What
+	 * we do have is the guarantee that a pte will only either go from not
+	 * present to present, or present to not present or both -- it will not
+	 * switch to a completely different present page without a TLB flush in
+	 * between; something that we are blocking by holding interrupts off.
+	 *
+	 * Setting ptes from not present to present goes:
+	 * ptep->pte_high = h;
+	 * smp_wmb();
+	 * ptep->pte_low = l;
+	 *
+	 * And present to not present goes:
+	 * ptep->pte_low = 0;
+	 * smp_wmb();
+	 * ptep->pte_high = 0;
+	 *
+	 * We must ensure here that the load of pte_low sees l iff pte_high
+	 * sees h. We load pte_high *after* loading pte_low, which ensures we
+	 * don't see an older value of pte_high.  *Then* we recheck pte_low,
+	 * which ensures that we haven't picked up a changed pte high. We might
+	 * have got rubbish values from pte_low and pte_high, but we are
+	 * guaranteed that pte_low will not have the present bit set *unless*
+	 * it is 'l'. And get_user_pages_fast only operates on present ptes, so
+	 * we're safe.
+	 *
+	 * gup_get_pte should not be used or copied outside gup.c without being
+	 * very careful -- it does not atomically load the pte or anything that
+	 * is likely to be useful for you.
+	 */
+	pte_t pte;
+
+retry:
+	pte.pte_low = ptep->pte_low;
+	smp_rmb();
+	pte.pte_high = ptep->pte_high;
+	smp_rmb();
+	if (unlikely(pte.pte_low != ptep->pte_low))
+		goto retry;
+
+	return pte;
+#endif
+}
+
+/*
+ * The performance critical leaf functions are made noinline otherwise gcc
+ * inlines everything into a single function which results in too much
+ * register pressure.
+ */
+static noinline int gup_pte_range(pmd_t pmd, unsigned long addr,
+		unsigned long end, int write, struct page **pages, int *nr)
+{
+	unsigned long mask;
+	pte_t *ptep;
+
+	mask = _PAGE_PRESENT|_PAGE_USER;
+	if (write)
+		mask |= _PAGE_RW;
+
+	ptep = pte_offset_map(&pmd, addr);
+	do {
+		pte_t pte = gup_get_pte(ptep);
+		struct page *page;
+
+		if ((pte_val(pte) & (mask | _PAGE_SPECIAL)) != mask) {
+			pte_unmap(ptep);
+			return 0;
+		}
+		VM_BUG_ON(!pfn_valid(pte_pfn(pte)));
+		page = pte_page(pte);
+		get_page(page);
+		pages[*nr] = page;
+		(*nr)++;
+
+	} while (ptep++, addr += PAGE_SIZE, addr != end);
+	pte_unmap(ptep - 1);
+
+	return 1;
+}
+
+static inline void get_head_page_multiple(struct page *page, int nr)
+{
+	VM_BUG_ON(page != compound_head(page));
+	VM_BUG_ON(page_count(page) == 0);
+	atomic_add(nr, &page->_count);
+}
+
+static noinline int gup_huge_pmd(pmd_t pmd, unsigned long addr,
+		unsigned long end, int write, struct page **pages, int *nr)
+{
+	unsigned long mask;
+	pte_t pte = *(pte_t *)&pmd;
+	struct page *head, *page;
+	int refs;
+
+	mask = _PAGE_PRESENT|_PAGE_USER;
+	if (write)
+		mask |= _PAGE_RW;
+	if ((pte_val(pte) & mask) != mask)
+		return 0;
+	/* hugepages are never "special" */
+	VM_BUG_ON(pte_val(pte) & _PAGE_SPECIAL);
+	VM_BUG_ON(!pfn_valid(pte_pfn(pte)));
+
+	refs = 0;
+	head = pte_page(pte);
+	page = head + ((addr & ~HPAGE_MASK) >> PAGE_SHIFT);
+	do {
+		VM_BUG_ON(compound_head(page) != head);
+		pages[*nr] = page;
+		(*nr)++;
+		page++;
+		refs++;
+	} while (addr += PAGE_SIZE, addr != end);
+	get_head_page_multiple(head, refs);
+
+	return 1;
+}
+
+static int gup_pmd_range(pud_t pud, unsigned long addr, unsigned long end,
+		int write, struct page **pages, int *nr)
+{
+	unsigned long next;
+	pmd_t *pmdp;
+
+	pmdp = pmd_offset(&pud, addr);
+	do {
+		pmd_t pmd = *pmdp;
+
+		next = pmd_addr_end(addr, end);
+		if (pmd_none(pmd))
+			return 0;
+		if (unlikely(pmd_large(pmd))) {
+			if (!gup_huge_pmd(pmd, addr, next, write, pages, nr))
+				return 0;
+		} else {
+			if (!gup_pte_range(pmd, addr, next, write, pages, nr))
+				return 0;
+		}
+	} while (pmdp++, addr = next, addr != end);
+
+	return 1;
+}
+
+static int gup_pud_range(pgd_t pgd, unsigned long addr, unsigned long end, int write, struct page **pages, int *nr)
+{
+	unsigned long next;
+	pud_t *pudp;
+
+	pudp = pud_offset(&pgd, addr);
+	do {
+		pud_t pud = *pudp;
+
+		next = pud_addr_end(addr, end);
+		if (pud_none(pud))
+			return 0;
+		if (!gup_pmd_range(pud, addr, next, write, pages, nr))
+			return 0;
+	} while (pudp++, addr = next, addr != end);
+
+	return 1;
+}
+
+int get_user_pages_fast(unsigned long start, int nr_pages, int write, struct page **pages)
+{
+	struct mm_struct *mm = current->mm;
+	unsigned long end = start + (nr_pages << PAGE_SHIFT);
+	unsigned long addr = start;
+	unsigned long next;
+	pgd_t *pgdp;
+	int nr = 0;
+
+	if (unlikely(!access_ok(write ? VERIFY_WRITE : VERIFY_READ,
+					start, nr_pages*PAGE_SIZE)))
+		goto slow_irqon;
+
+	/*
+	 * XXX: batch / limit 'nr', to avoid large irq off latency
+	 * needs some instrumenting to determine the common sizes used by
+	 * important workloads (eg. DB2), and whether limiting the batch size
+	 * will decrease performance.
+	 *
+	 * It seems like we're in the clear for the moment. Direct-IO is
+	 * the main guy that batches up lots of get_user_pages, and even
+	 * they are limited to 64-at-a-time which is not so many.
+	 */
+	/*
+	 * This doesn't prevent pagetable teardown, but does prevent
+	 * the pagetables and pages from being freed on x86.
+	 *
+	 * So long as we atomically load page table pointers versus teardown
+	 * (which we do on x86, with the above PAE exception), we can follow the
+	 * address down to the the page and take a ref on it.
+	 */
+	local_irq_disable();
+	pgdp = pgd_offset(mm, addr);
+	do {
+		pgd_t pgd = *pgdp;
+
+		next = pgd_addr_end(addr, end);
+		if (pgd_none(pgd))
+			goto slow;
+		if (!gup_pud_range(pgd, addr, next, write, pages, &nr))
+			goto slow;
+	} while (pgdp++, addr = next, addr != end);
+	local_irq_enable();
+
+	VM_BUG_ON(nr != (end - start) >> PAGE_SHIFT);
+	return nr;
+
+	{
+		int i, ret;
+
+slow:
+		local_irq_enable();
+slow_irqon:
+		/* Try to get the remaining pages with get_user_pages */
+		start += nr << PAGE_SHIFT;
+		pgaes += nr;
+
+		down_read(&mm->mmap_sem);
+		ret = get_user_pages(current, mm, start,
+			(end - start) >> PAGE_SHIFT, write, 0, pages, NULL);
+		up_read(&mm->mmap_sem);
+
+		/* Have to be a bit careful with return values */
+		if (nr > 0) {
+			if (ret < 0)
+				ret = nr;
+			else
+				ret += nr;
+		}
+
+		return ret;
+	}
+}
Index: linux-2.6/include/asm-x86/uaccess.h
===================================================================
--- linux-2.6.orig/include/asm-x86/uaccess.h
+++ linux-2.6/include/asm-x86/uaccess.h
@@ -3,3 +3,6 @@
 #else
 # include "uaccess_64.h"
 #endif
+
+#define __HAVE_ARCH_GET_USER_PAGES_FAST
+

-- 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [patch 4/5] dio: use get_user_pages_fast
       [not found] <20080529122050.823438000@nick.local0.net>
                   ` (2 preceding siblings ...)
  2008-05-29 12:20 ` [patch 3/5] x86: lockless get_user_pages_fast npiggin
@ 2008-05-29 12:20 ` npiggin
  2008-05-29 12:20 ` [patch 5/5] splice: " npiggin
  4 siblings, 0 replies; 17+ messages in thread
From: npiggin @ 2008-05-29 12:20 UTC (permalink / raw)
  To: akpm; +Cc: shaggy, linux-mm, linux-arch, apw

[-- Attachment #1: dio-get_user_pages_fast.patch --]
[-- Type: text/plain, Size: 2105 bytes --]

Use get_user_pages_fast in the common/generic block and fs direct IO paths.

Signed-off-by: Nick Piggin <npiggin@suse.de>
Cc: shaggy@austin.ibm.com
Cc: linux-mm@kvack.org
Cc: linux-arch@vger.kernel.org
Cc: apw@shadowen.org

---
 fs/bio.c       |    8 ++------
 fs/direct-io.c |   10 ++--------
 2 files changed, 4 insertions(+), 14 deletions(-)

Index: linux-2.6/fs/bio.c
===================================================================
--- linux-2.6.orig/fs/bio.c
+++ linux-2.6/fs/bio.c
@@ -713,12 +713,8 @@ static struct bio *__bio_map_user_iov(st
 		const int local_nr_pages = end - start;
 		const int page_limit = cur_page + local_nr_pages;
 		
-		down_read(&current->mm->mmap_sem);
-		ret = get_user_pages(current, current->mm, uaddr,
-				     local_nr_pages,
-				     write_to_vm, 0, &pages[cur_page], NULL);
-		up_read(&current->mm->mmap_sem);
-
+		ret = get_user_pages_fast(uaddr, local_nr_pages,
+				write_to_vm, &pages[cur_page]);
 		if (ret < local_nr_pages) {
 			ret = -EFAULT;
 			goto out_unmap;
Index: linux-2.6/fs/direct-io.c
===================================================================
--- linux-2.6.orig/fs/direct-io.c
+++ linux-2.6/fs/direct-io.c
@@ -150,17 +150,11 @@ static int dio_refill_pages(struct dio *
 	int nr_pages;
 
 	nr_pages = min(dio->total_pages - dio->curr_page, DIO_PAGES);
-	down_read(&current->mm->mmap_sem);
-	ret = get_user_pages(
-		current,			/* Task for fault acounting */
-		current->mm,			/* whose pages? */
+	ret = get_user_pages_fast(
 		dio->curr_user_address,		/* Where from? */
 		nr_pages,			/* How many pages? */
 		dio->rw == READ,		/* Write to memory? */
-		0,				/* force (?) */
-		&dio->pages[0],
-		NULL);				/* vmas */
-	up_read(&current->mm->mmap_sem);
+		&dio->pages[0]);		/* Put results here */
 
 	if (ret < 0 && dio->blocks_available && (dio->rw & WRITE)) {
 		struct page *page = ZERO_PAGE(0);

-- 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [patch 5/5] splice: use get_user_pages_fast
       [not found] <20080529122050.823438000@nick.local0.net>
                   ` (3 preceding siblings ...)
  2008-05-29 12:20 ` [patch 4/5] dio: use get_user_pages_fast npiggin
@ 2008-05-29 12:20 ` npiggin
  4 siblings, 0 replies; 17+ messages in thread
From: npiggin @ 2008-05-29 12:20 UTC (permalink / raw)
  To: akpm; +Cc: shaggy, linux-mm, linux-arch, apw

[-- Attachment #1: splice-get_user_pages_fast.patch --]
[-- Type: text/plain, Size: 3123 bytes --]

Use get_user_pages_fast in splice. This reverts some mmap_sem batching there,
however the biggest problem with mmap_sem tends to be hold times blocking
out other threads rather than cacheline bouncing. Further: on architectures
that implement get_user_pages_fast without locks, mmap_sem can be avoided
completely anyway.

Signed-off-by: Nick Piggin <npiggin@suse.de>
Cc: shaggy@austin.ibm.com
Cc: linux-mm@kvack.org
Cc: linux-arch@vger.kernel.org
Cc: apw@shadowen.org

---
 fs/splice.c |   41 +++--------------------------------------
 1 file changed, 3 insertions(+), 38 deletions(-)

Index: linux-2.6/fs/splice.c
===================================================================
--- linux-2.6.orig/fs/splice.c
+++ linux-2.6/fs/splice.c
@@ -1147,36 +1147,6 @@ static long do_splice(struct file *in, l
 }
 
 /*
- * Do a copy-from-user while holding the mmap_semaphore for reading, in a
- * manner safe from deadlocking with simultaneous mmap() (grabbing mmap_sem
- * for writing) and page faulting on the user memory pointed to by src.
- * This assumes that we will very rarely hit the partial != 0 path, or this
- * will not be a win.
- */
-static int copy_from_user_mmap_sem(void *dst, const void __user *src, size_t n)
-{
-	int partial;
-
-	if (!access_ok(VERIFY_READ, src, n))
-		return -EFAULT;
-
-	pagefault_disable();
-	partial = __copy_from_user_inatomic(dst, src, n);
-	pagefault_enable();
-
-	/*
-	 * Didn't copy everything, drop the mmap_sem and do a faulting copy
-	 */
-	if (unlikely(partial)) {
-		up_read(&current->mm->mmap_sem);
-		partial = copy_from_user(dst, src, n);
-		down_read(&current->mm->mmap_sem);
-	}
-
-	return partial;
-}
-
-/*
  * Map an iov into an array of pages and offset/length tupples. With the
  * partial_page structure, we can map several non-contiguous ranges into
  * our ones pages[] map instead of splitting that operation into pieces.
@@ -1189,8 +1159,6 @@ static int get_iovec_page_array(const st
 {
 	int buffers = 0, error = 0;
 
-	down_read(&current->mm->mmap_sem);
-
 	while (nr_vecs) {
 		unsigned long off, npages;
 		struct iovec entry;
@@ -1199,7 +1167,7 @@ static int get_iovec_page_array(const st
 		int i;
 
 		error = -EFAULT;
-		if (copy_from_user_mmap_sem(&entry, iov, sizeof(entry)))
+		if (copy_from_user(&entry, iov, sizeof(entry)))
 			break;
 
 		base = entry.iov_base;
@@ -1233,9 +1201,8 @@ static int get_iovec_page_array(const st
 		if (npages > PIPE_BUFFERS - buffers)
 			npages = PIPE_BUFFERS - buffers;
 
-		error = get_user_pages(current, current->mm,
-				       (unsigned long) base, npages, 0, 0,
-				       &pages[buffers], NULL);
+		error = get_user_pages_fast((unsigned long)base, npages,
+					0, &pages[buffers]);
 
 		if (unlikely(error <= 0))
 			break;
@@ -1274,8 +1241,6 @@ static int get_iovec_page_array(const st
 		iov++;
 	}
 
-	up_read(&current->mm->mmap_sem);
-
 	if (buffers)
 		return buffers;
 

-- 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [patch 3/5] x86: lockless get_user_pages_fast
  2008-05-29 12:20 ` [patch 3/5] x86: lockless get_user_pages_fast npiggin
@ 2008-05-29 17:20   ` Dave Kleikamp
  2008-05-30  0:55     ` Nick Piggin
  2008-06-02 10:15     ` Nick Piggin
  0 siblings, 2 replies; 17+ messages in thread
From: Dave Kleikamp @ 2008-05-29 17:20 UTC (permalink / raw)
  To: npiggin; +Cc: akpm, linux-mm, linux-arch, apw

On Thu, 2008-05-29 at 22:20 +1000, npiggin@suse.de wrote:
 
> +int get_user_pages_fast(unsigned long start, int nr_pages, int write, struct page **pages)
> +{
> +	struct mm_struct *mm = current->mm;
> +	unsigned long end = start + (nr_pages << PAGE_SHIFT);
> +	unsigned long addr = start;
> +	unsigned long next;
> +	pgd_t *pgdp;
> +	int nr = 0;
> +
> +	if (unlikely(!access_ok(write ? VERIFY_WRITE : VERIFY_READ,
> +					start, nr_pages*PAGE_SIZE)))
> +		goto slow_irqon;
> +
> +	/*
> +	 * XXX: batch / limit 'nr', to avoid large irq off latency
> +	 * needs some instrumenting to determine the common sizes used by
> +	 * important workloads (eg. DB2), and whether limiting the batch size
> +	 * will decrease performance.
> +	 *
> +	 * It seems like we're in the clear for the moment. Direct-IO is
> +	 * the main guy that batches up lots of get_user_pages, and even
> +	 * they are limited to 64-at-a-time which is not so many.
> +	 */
> +	/*
> +	 * This doesn't prevent pagetable teardown, but does prevent
> +	 * the pagetables and pages from being freed on x86.
> +	 *
> +	 * So long as we atomically load page table pointers versus teardown
> +	 * (which we do on x86, with the above PAE exception), we can follow the
> +	 * address down to the the page and take a ref on it.
> +	 */
> +	local_irq_disable();
> +	pgdp = pgd_offset(mm, addr);
> +	do {
> +		pgd_t pgd = *pgdp;
> +
> +		next = pgd_addr_end(addr, end);
> +		if (pgd_none(pgd))
> +			goto slow;
> +		if (!gup_pud_range(pgd, addr, next, write, pages, &nr))
> +			goto slow;
> +	} while (pgdp++, addr = next, addr != end);
> +	local_irq_enable();
> +
> +	VM_BUG_ON(nr != (end - start) >> PAGE_SHIFT);
> +	return nr;
> +
> +	{
> +		int i, ret;
> +
> +slow:
> +		local_irq_enable();
> +slow_irqon:
> +		/* Try to get the remaining pages with get_user_pages */
> +		start += nr << PAGE_SHIFT;
> +		pgaes += nr;

Typo: s/pgaes/pages/

> +
> +		down_read(&mm->mmap_sem);
> +		ret = get_user_pages(current, mm, start,
> +			(end - start) >> PAGE_SHIFT, write, 0, pages, NULL);
> +		up_read(&mm->mmap_sem);
> +
> +		/* Have to be a bit careful with return values */
> +		if (nr > 0) {
> +			if (ret < 0)
> +				ret = nr;
> +			else
> +				ret += nr;
> +		}
> +
> +		return ret;
> +	}
> +}

-- 
David Kleikamp
IBM Linux Technology Center

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [patch 3/5] x86: lockless get_user_pages_fast
  2008-05-29 17:20   ` Dave Kleikamp
@ 2008-05-30  0:55     ` Nick Piggin
  2008-06-02 10:15     ` Nick Piggin
  1 sibling, 0 replies; 17+ messages in thread
From: Nick Piggin @ 2008-05-30  0:55 UTC (permalink / raw)
  To: Dave Kleikamp; +Cc: akpm, linux-mm, linux-arch, apw

On Thu, May 29, 2008 at 12:20:59PM -0500, Dave Kleikamp wrote:
> On Thu, 2008-05-29 at 22:20 +1000, npiggin@suse.de wrote:
>  
> > +int get_user_pages_fast(unsigned long start, int nr_pages, int write, struct page **pages)
> > +{
> > +	struct mm_struct *mm = current->mm;
> > +	unsigned long end = start + (nr_pages << PAGE_SHIFT);
> > +	unsigned long addr = start;
> > +	unsigned long next;
> > +	pgd_t *pgdp;
> > +	int nr = 0;
> > +
> > +	if (unlikely(!access_ok(write ? VERIFY_WRITE : VERIFY_READ,
> > +					start, nr_pages*PAGE_SIZE)))
> > +		goto slow_irqon;
> > +
> > +	/*
> > +	 * XXX: batch / limit 'nr', to avoid large irq off latency
> > +	 * needs some instrumenting to determine the common sizes used by
> > +	 * important workloads (eg. DB2), and whether limiting the batch size
> > +	 * will decrease performance.
> > +	 *
> > +	 * It seems like we're in the clear for the moment. Direct-IO is
> > +	 * the main guy that batches up lots of get_user_pages, and even
> > +	 * they are limited to 64-at-a-time which is not so many.
> > +	 */
> > +	/*
> > +	 * This doesn't prevent pagetable teardown, but does prevent
> > +	 * the pagetables and pages from being freed on x86.
> > +	 *
> > +	 * So long as we atomically load page table pointers versus teardown
> > +	 * (which we do on x86, with the above PAE exception), we can follow the
> > +	 * address down to the the page and take a ref on it.
> > +	 */
> > +	local_irq_disable();
> > +	pgdp = pgd_offset(mm, addr);
> > +	do {
> > +		pgd_t pgd = *pgdp;
> > +
> > +		next = pgd_addr_end(addr, end);
> > +		if (pgd_none(pgd))
> > +			goto slow;
> > +		if (!gup_pud_range(pgd, addr, next, write, pages, &nr))
> > +			goto slow;
> > +	} while (pgdp++, addr = next, addr != end);
> > +	local_irq_enable();
> > +
> > +	VM_BUG_ON(nr != (end - start) >> PAGE_SHIFT);
> > +	return nr;
> > +
> > +	{
> > +		int i, ret;
> > +
> > +slow:
> > +		local_irq_enable();
> > +slow_irqon:
> > +		/* Try to get the remaining pages with get_user_pages */
> > +		start += nr << PAGE_SHIFT;
> > +		pgaes += nr;
> 
> Typo: s/pgaes/pages/

Gah, missing quilt refresh. Sorry.

I actually did stick a printk in here and manage to hit this path with
a constructed test case (and with nr ! always = 0 to boot). It seemed
to work fine.

BTW. Andy, I dropped your Reviewed-by: Andy Whitcroft <apw@shadowen.org>
because I did make a couple of these little changes that technically
you hadn't reviewed. I don't know what the exact protocol is regarding
the fluidity of RB/AB... 

---

x86: lockless get_user_pages_fast

Implement get_user_pages_fast without locking in the fastpath on x86.

Do an optimistic lockless pagetable walk, without taking mmap_sem or any page
table locks or even mmap_sem. Page table existence is guaranteed by turning
interrupts off (combined with the fact that we're always looking up the current
mm, means we can do the lockless page table walk within the constraints of the
TLB shootdown design). Basically we can do this lockless pagetable walk in a
similar manner to the way the CPU's pagetable walker does not have to take any
locks to find present ptes.

This patch (combined with the subsequent ones to convert direct IO to use it)
was found to give about 10% performance improvement on a 2 socket 8 core Intel
Xeon system running an OLTP workload on DB2 v9.5

 "To test the effects of the patch, an OLTP workload was run on an IBM
 x3850 M2 server with 2 processors (quad-core Intel Xeon processors at
 2.93 GHz) using IBM DB2 v9.5 running Linux 2.6.24rc7 kernel. Comparing
 runs with and without the patch resulted in an overall performance
 benefit of ~9.8%. Correspondingly, oprofiles showed that samples from
 __up_read and __down_read routines that is seen during thread contention
 for system resources was reduced from 2.8% down to .05%. Monitoring
 the /proc/vmstat output from the patched run showed that the counter for
 fast_gup contained a very high number while the fast_gup_slow value was
 zero."

(fast_gup is the old name for get_user_pages_fast, fast_gup_slow is a counter
we had for the number of times the slowpath was invoked).

The main reason for the improvement is that DB2 has multiple threads each
issuing direct-IO. Direct-IO uses get_user_pages, and thus the threads
contend the mmap_sem cacheline, and can also contend on page table locks.

I would anticipate larger performance gains on larger systems, however I
think DB2 uses an adaptive mix of threads and processes, so it could be
that thread contention remains pretty constant as machine size increases.
In which case, we stuck with "only" a 10% gain.

The downside of using get_user_pages_fast is that if there is not a pte with
the correct permissions for the access, we end up falling back to
get_user_pages and so the get_user_pages_fast is a bit of extra work. However
this should not be the common case in most performance critical code.

Signed-off-by: Nick Piggin <npiggin@suse.de>
Cc: shaggy@austin.ibm.com
Cc: linux-mm@kvack.org
Cc: linux-arch@vger.kernel.org
Cc: apw@shadowen.org

---
 arch/x86/mm/Makefile      |    2 
 arch/x86/mm/gup.c         |  254 ++++++++++++++++++++++++++++++++++++++++++++++
 include/asm-x86/uaccess.h |    3 
 3 files changed, 258 insertions(+), 1 deletion(-)

Index: linux-2.6/arch/x86/mm/Makefile
===================================================================
--- linux-2.6.orig/arch/x86/mm/Makefile
+++ linux-2.6/arch/x86/mm/Makefile
@@ -1,5 +1,5 @@
 obj-y	:=  init_$(BITS).o fault.o ioremap.o extable.o pageattr.o mmap.o \
-	    pat.o pgtable.o
+	    pat.o pgtable.o gup.o
 
 obj-$(CONFIG_X86_32)		+= pgtable_32.o
 
Index: linux-2.6/arch/x86/mm/gup.c
===================================================================
--- /dev/null
+++ linux-2.6/arch/x86/mm/gup.c
@@ -0,0 +1,254 @@
+/*
+ * Lockless get_user_pages_fast for x86
+ *
+ * Copyright (C) 2008 Nick Piggin
+ * Copyright (C) 2008 Novell Inc.
+ */
+#include <linux/sched.h>
+#include <linux/mm.h>
+#include <linux/vmstat.h>
+#include <asm/pgtable.h>
+
+static inline pte_t gup_get_pte(pte_t *ptep)
+{
+#ifndef CONFIG_X86_PAE
+	return *ptep;
+#else
+	/*
+	 * With get_user_pages_fast, we walk down the pagetables without taking
+	 * any locks.  For this we would like to load the pointers atoimcally,
+	 * but that is not possible (without expensive cmpxchg8b) on PAE.  What
+	 * we do have is the guarantee that a pte will only either go from not
+	 * present to present, or present to not present or both -- it will not
+	 * switch to a completely different present page without a TLB flush in
+	 * between; something that we are blocking by holding interrupts off.
+	 *
+	 * Setting ptes from not present to present goes:
+	 * ptep->pte_high = h;
+	 * smp_wmb();
+	 * ptep->pte_low = l;
+	 *
+	 * And present to not present goes:
+	 * ptep->pte_low = 0;
+	 * smp_wmb();
+	 * ptep->pte_high = 0;
+	 *
+	 * We must ensure here that the load of pte_low sees l iff pte_high
+	 * sees h. We load pte_high *after* loading pte_low, which ensures we
+	 * don't see an older value of pte_high.  *Then* we recheck pte_low,
+	 * which ensures that we haven't picked up a changed pte high. We might
+	 * have got rubbish values from pte_low and pte_high, but we are
+	 * guaranteed that pte_low will not have the present bit set *unless*
+	 * it is 'l'. And get_user_pages_fast only operates on present ptes, so
+	 * we're safe.
+	 *
+	 * gup_get_pte should not be used or copied outside gup.c without being
+	 * very careful -- it does not atomically load the pte or anything that
+	 * is likely to be useful for you.
+	 */
+	pte_t pte;
+
+retry:
+	pte.pte_low = ptep->pte_low;
+	smp_rmb();
+	pte.pte_high = ptep->pte_high;
+	smp_rmb();
+	if (unlikely(pte.pte_low != ptep->pte_low))
+		goto retry;
+
+	return pte;
+#endif
+}
+
+/*
+ * The performance critical leaf functions are made noinline otherwise gcc
+ * inlines everything into a single function which results in too much
+ * register pressure.
+ */
+static noinline int gup_pte_range(pmd_t pmd, unsigned long addr,
+		unsigned long end, int write, struct page **pages, int *nr)
+{
+	unsigned long mask;
+	pte_t *ptep;
+
+	mask = _PAGE_PRESENT|_PAGE_USER;
+	if (write)
+		mask |= _PAGE_RW;
+
+	ptep = pte_offset_map(&pmd, addr);
+	do {
+		pte_t pte = gup_get_pte(ptep);
+		struct page *page;
+
+		if ((pte_val(pte) & (mask | _PAGE_SPECIAL)) != mask) {
+			pte_unmap(ptep);
+			return 0;
+		}
+		VM_BUG_ON(!pfn_valid(pte_pfn(pte)));
+		page = pte_page(pte);
+		get_page(page);
+		pages[*nr] = page;
+		(*nr)++;
+
+	} while (ptep++, addr += PAGE_SIZE, addr != end);
+	pte_unmap(ptep - 1);
+
+	return 1;
+}
+
+static inline void get_head_page_multiple(struct page *page, int nr)
+{
+	VM_BUG_ON(page != compound_head(page));
+	VM_BUG_ON(page_count(page) == 0);
+	atomic_add(nr, &page->_count);
+}
+
+static noinline int gup_huge_pmd(pmd_t pmd, unsigned long addr,
+		unsigned long end, int write, struct page **pages, int *nr)
+{
+	unsigned long mask;
+	pte_t pte = *(pte_t *)&pmd;
+	struct page *head, *page;
+	int refs;
+
+	mask = _PAGE_PRESENT|_PAGE_USER;
+	if (write)
+		mask |= _PAGE_RW;
+	if ((pte_val(pte) & mask) != mask)
+		return 0;
+	/* hugepages are never "special" */
+	VM_BUG_ON(pte_val(pte) & _PAGE_SPECIAL);
+	VM_BUG_ON(!pfn_valid(pte_pfn(pte)));
+
+	refs = 0;
+	head = pte_page(pte);
+	page = head + ((addr & ~HPAGE_MASK) >> PAGE_SHIFT);
+	do {
+		VM_BUG_ON(compound_head(page) != head);
+		pages[*nr] = page;
+		(*nr)++;
+		page++;
+		refs++;
+	} while (addr += PAGE_SIZE, addr != end);
+	get_head_page_multiple(head, refs);
+
+	return 1;
+}
+
+static int gup_pmd_range(pud_t pud, unsigned long addr, unsigned long end,
+		int write, struct page **pages, int *nr)
+{
+	unsigned long next;
+	pmd_t *pmdp;
+
+	pmdp = pmd_offset(&pud, addr);
+	do {
+		pmd_t pmd = *pmdp;
+
+		next = pmd_addr_end(addr, end);
+		if (pmd_none(pmd))
+			return 0;
+		if (unlikely(pmd_large(pmd))) {
+			if (!gup_huge_pmd(pmd, addr, next, write, pages, nr))
+				return 0;
+		} else {
+			if (!gup_pte_range(pmd, addr, next, write, pages, nr))
+				return 0;
+		}
+	} while (pmdp++, addr = next, addr != end);
+
+	return 1;
+}
+
+static int gup_pud_range(pgd_t pgd, unsigned long addr, unsigned long end, int write, struct page **pages, int *nr)
+{
+	unsigned long next;
+	pud_t *pudp;
+
+	pudp = pud_offset(&pgd, addr);
+	do {
+		pud_t pud = *pudp;
+
+		next = pud_addr_end(addr, end);
+		if (pud_none(pud))
+			return 0;
+		if (!gup_pmd_range(pud, addr, next, write, pages, nr))
+			return 0;
+	} while (pudp++, addr = next, addr != end);
+
+	return 1;
+}
+
+int get_user_pages_fast(unsigned long start, int nr_pages, int write, struct page **pages)
+{
+	struct mm_struct *mm = current->mm;
+	unsigned long end = start + (nr_pages << PAGE_SHIFT);
+	unsigned long addr = start;
+	unsigned long next;
+	pgd_t *pgdp;
+	int nr = 0;
+
+	if (unlikely(!access_ok(write ? VERIFY_WRITE : VERIFY_READ,
+					start, nr_pages*PAGE_SIZE)))
+		goto slow_irqon;
+
+	/*
+	 * XXX: batch / limit 'nr', to avoid large irq off latency
+	 * needs some instrumenting to determine the common sizes used by
+	 * important workloads (eg. DB2), and whether limiting the batch size
+	 * will decrease performance.
+	 *
+	 * It seems like we're in the clear for the moment. Direct-IO is
+	 * the main guy that batches up lots of get_user_pages, and even
+	 * they are limited to 64-at-a-time which is not so many.
+	 */
+	/*
+	 * This doesn't prevent pagetable teardown, but does prevent
+	 * the pagetables and pages from being freed on x86.
+	 *
+	 * So long as we atomically load page table pointers versus teardown
+	 * (which we do on x86, with the above PAE exception), we can follow the
+	 * address down to the the page and take a ref on it.
+	 */
+	local_irq_disable();
+	pgdp = pgd_offset(mm, addr);
+	do {
+		pgd_t pgd = *pgdp;
+
+		next = pgd_addr_end(addr, end);
+		if (pgd_none(pgd))
+			goto slow;
+		if (!gup_pud_range(pgd, addr, next, write, pages, &nr))
+			goto slow;
+	} while (pgdp++, addr = next, addr != end);
+	local_irq_enable();
+
+	VM_BUG_ON(nr != (end - start) >> PAGE_SHIFT);
+	return nr;
+
+	{
+		int i, ret;
+
+slow:
+		local_irq_enable();
+slow_irqon:
+		/* Try to get the remaining pages with get_user_pages */
+		start += nr << PAGE_SHIFT;
+		pages += nr;
+
+		down_read(&mm->mmap_sem);
+		ret = get_user_pages(current, mm, start,
+			(end - start) >> PAGE_SHIFT, write, 0, pages, NULL);
+		up_read(&mm->mmap_sem);
+
+		/* Have to be a bit careful with return values */
+		if (nr > 0) {
+			if (ret < 0)
+				ret = nr;
+			else
+				ret += nr;
+		}
+
+		return ret;
+	}
+}
Index: linux-2.6/include/asm-x86/uaccess.h
===================================================================
--- linux-2.6.orig/include/asm-x86/uaccess.h
+++ linux-2.6/include/asm-x86/uaccess.h
@@ -3,3 +3,6 @@
 #else
 # include "uaccess_64.h"
 #endif
+
+#define __HAVE_ARCH_GET_USER_PAGES_FAST
+

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [patch 3/5] x86: lockless get_user_pages_fast
  2008-05-29 17:20   ` Dave Kleikamp
  2008-05-30  0:55     ` Nick Piggin
@ 2008-06-02 10:15     ` Nick Piggin
  2008-06-02 11:28       ` Stephen Rothwell
  1 sibling, 1 reply; 17+ messages in thread
From: Nick Piggin @ 2008-06-02 10:15 UTC (permalink / raw)
  To: Dave Kleikamp; +Cc: akpm, linux-mm, linux-arch, apw

BTW. I do plan to ask Linus to merge this as soon as 2.6.27 opens.
Hope nobody objects (or if they do please speak up before then)


On Thu, May 29, 2008 at 12:20:59PM -0500, Dave Kleikamp wrote:
> On Thu, 2008-05-29 at 22:20 +1000, npiggin@suse.de wrote:
>  
> > +int get_user_pages_fast(unsigned long start, int nr_pages, int write, struct page **pages)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [patch 3/5] x86: lockless get_user_pages_fast
  2008-06-02 10:15     ` Nick Piggin
@ 2008-06-02 11:28       ` Stephen Rothwell
  2008-06-03  2:34         ` Nick Piggin
  0 siblings, 1 reply; 17+ messages in thread
From: Stephen Rothwell @ 2008-06-02 11:28 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Dave Kleikamp, akpm, linux-mm, linux-arch, apw

[-- Attachment #1: Type: text/plain, Size: 560 bytes --]

Hi Nick,

On Mon, 2 Jun 2008 12:15:30 +0200 Nick Piggin <npiggin@suse.de> wrote:
>
> BTW. I do plan to ask Linus to merge this as soon as 2.6.27 opens.
> Hope nobody objects (or if they do please speak up before then)

Any chance of getting this into linux-next then to see if it
conflicts with/kills anything else?

If this is posted/reviewed/tested enough to be "finished" then put it in
a tree (or quilt series) and submit it.

Thanks.
-- 
Cheers,
Stephen Rothwell                    sfr@canb.auug.org.au
http://www.canb.auug.org.au/~sfr/

[-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [patch 1/5] x86: implement pte_special
  2008-05-29 12:20 ` [patch 1/5] x86: implement pte_special npiggin
@ 2008-06-02 23:58   ` Andrew Morton
  2008-06-03  2:04     ` Nick Piggin
  2008-06-04 17:14     ` Andy Whitcroft
  2008-06-06 21:35   ` Peter Zijlstra
  1 sibling, 2 replies; 17+ messages in thread
From: Andrew Morton @ 2008-06-02 23:58 UTC (permalink / raw)
  To: npiggin; +Cc: shaggy, linux-mm, linux-arch, apw

On Thu, 29 May 2008 22:20:51 +1000
npiggin@suse.de wrote:

> Implement the pte_special bit for x86. This is required to support lockless
> get_user_pages, because we need to know whether or not we can refcount a
> particular page given only its pte (and no vma).

Spits this reject:

***************
*** 39,44 ****
  #define _PAGE_UNUSED3	(_AC(1, L)<<_PAGE_BIT_UNUSED3)
  #define _PAGE_PAT	(_AC(1, L)<<_PAGE_BIT_PAT)
  #define _PAGE_PAT_LARGE (_AC(1, L)<<_PAGE_BIT_PAT_LARGE)
  
  #if defined(CONFIG_X86_64) || defined(CONFIG_X86_PAE)
  #define _PAGE_NX	(_AC(1, ULL) << _PAGE_BIT_NX)
--- 40,47 ----
  #define _PAGE_UNUSED3	(_AC(1, L)<<_PAGE_BIT_UNUSED3)
  #define _PAGE_PAT	(_AC(1, L)<<_PAGE_BIT_PAT)
  #define _PAGE_PAT_LARGE (_AC(1, L)<<_PAGE_BIT_PAT_LARGE)
+ #define _PAGE_SPECIAL	(_AC(1, L)<<_PAGE_BIT_SPECIAL)
+ #define __HAVE_ARCH_PTE_SPECIAL
  
  #if defined(CONFIG_X86_64) || defined(CONFIG_X86_PAE)
  #define _PAGE_NX	(_AC(1, ULL) << _PAGE_BIT_NX)

Which I fixed thusly:

#define _PAGE_PAT	(_AT(pteval_t, 1) << _PAGE_BIT_PAT)
#define _PAGE_PAT_LARGE (_AT(pteval_t, 1) << _PAGE_BIT_PAT_LARGE)
#define _PAGE_SPECIAL	(_AT(pteval_t, 1) << _PAGE_BIT_SPECIAL)
#define __HAVE_ARCH_PTE_SPECIAL


OK?


(Also please check the bunch of checkpatch fixes, a warning fix and a
compile fix).

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [patch 1/5] x86: implement pte_special
  2008-06-02 23:58   ` Andrew Morton
@ 2008-06-03  2:04     ` Nick Piggin
  2008-06-04 17:14     ` Andy Whitcroft
  1 sibling, 0 replies; 17+ messages in thread
From: Nick Piggin @ 2008-06-03  2:04 UTC (permalink / raw)
  To: Andrew Morton; +Cc: shaggy, linux-mm, linux-arch, apw

On Mon, Jun 02, 2008 at 04:58:47PM -0700, Andrew Morton wrote:
> On Thu, 29 May 2008 22:20:51 +1000
> npiggin@suse.de wrote:
> 
> > Implement the pte_special bit for x86. This is required to support lockless
> > get_user_pages, because we need to know whether or not we can refcount a
> > particular page given only its pte (and no vma).
> 
> Spits this reject:
> 
> ***************
> *** 39,44 ****
>   #define _PAGE_UNUSED3	(_AC(1, L)<<_PAGE_BIT_UNUSED3)
>   #define _PAGE_PAT	(_AC(1, L)<<_PAGE_BIT_PAT)
>   #define _PAGE_PAT_LARGE (_AC(1, L)<<_PAGE_BIT_PAT_LARGE)
>   
>   #if defined(CONFIG_X86_64) || defined(CONFIG_X86_PAE)
>   #define _PAGE_NX	(_AC(1, ULL) << _PAGE_BIT_NX)
> --- 40,47 ----
>   #define _PAGE_UNUSED3	(_AC(1, L)<<_PAGE_BIT_UNUSED3)
>   #define _PAGE_PAT	(_AC(1, L)<<_PAGE_BIT_PAT)
>   #define _PAGE_PAT_LARGE (_AC(1, L)<<_PAGE_BIT_PAT_LARGE)
> + #define _PAGE_SPECIAL	(_AC(1, L)<<_PAGE_BIT_SPECIAL)
> + #define __HAVE_ARCH_PTE_SPECIAL
>   
>   #if defined(CONFIG_X86_64) || defined(CONFIG_X86_PAE)
>   #define _PAGE_NX	(_AC(1, ULL) << _PAGE_BIT_NX)
> 
> Which I fixed thusly:
> 
> #define _PAGE_PAT	(_AT(pteval_t, 1) << _PAGE_BIT_PAT)
> #define _PAGE_PAT_LARGE (_AT(pteval_t, 1) << _PAGE_BIT_PAT_LARGE)
> #define _PAGE_SPECIAL	(_AT(pteval_t, 1) << _PAGE_BIT_SPECIAL)
> #define __HAVE_ARCH_PTE_SPECIAL
> 
> 
> OK?

I think so, thanks.


> (Also please check the bunch of checkpatch fixes, a warning fix and a
> compile fix).

Ah, I forgot to rerun checkpatch after renaming it from fast_gup.
Missed the compile bug though... perhaps I was getting the definition
pulled in some other way... hmm, will investigate, but it looks
good.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [patch 3/5] x86: lockless get_user_pages_fast
  2008-06-02 11:28       ` Stephen Rothwell
@ 2008-06-03  2:34         ` Nick Piggin
  2008-06-03  4:46           ` Stephen Rothwell
  0 siblings, 1 reply; 17+ messages in thread
From: Nick Piggin @ 2008-06-03  2:34 UTC (permalink / raw)
  To: Stephen Rothwell; +Cc: Dave Kleikamp, akpm, linux-mm, linux-arch, apw

On Mon, Jun 02, 2008 at 09:28:33PM +1000, Stephen Rothwell wrote:
> Hi Nick,
> 
> On Mon, 2 Jun 2008 12:15:30 +0200 Nick Piggin <npiggin@suse.de> wrote:
> >
> > BTW. I do plan to ask Linus to merge this as soon as 2.6.27 opens.
> > Hope nobody objects (or if they do please speak up before then)
> 
> Any chance of getting this into linux-next then to see if it
> conflicts with/kills anything else?
> 
> If this is posted/reviewed/tested enough to be "finished" then put it in
> a tree (or quilt series) and submit it.

Hi Stephen,

Thanks for the offer... I was hoping for Andrew to pick it up (which
he now has).

I'm not sure how best to do mm/ related stuff, but I suspect we have
gone as smoothly as we are in large part due to Andrew's reviewing
and martialling mm patches so well.

Not saying that wouldn't happen if the patches went to linux-next,
but I'm quite happy with how -mm works for mm development, so I will
prefer to submit to -mm unless Andrew asks otherwise.

For other developments I'll keep linux-next in mind. I guess it will
be useful for me eg in the case where I change an arch defined prototype
that requires a big sweep of the tree.

Thanks,
Nick

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [patch 3/5] x86: lockless get_user_pages_fast
  2008-06-03  2:34         ` Nick Piggin
@ 2008-06-03  4:46           ` Stephen Rothwell
  0 siblings, 0 replies; 17+ messages in thread
From: Stephen Rothwell @ 2008-06-03  4:46 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Dave Kleikamp, akpm, linux-mm, linux-arch, apw

[-- Attachment #1: Type: text/plain, Size: 819 bytes --]

Hi Nick,

On Tue, 3 Jun 2008 04:34:20 +0200 Nick Piggin <npiggin@suse.de> wrote:
>
> Thanks for the offer... I was hoping for Andrew to pick it up (which
> he now has).
> 
> I'm not sure how best to do mm/ related stuff, but I suspect we have
> gone as smoothly as we are in large part due to Andrew's reviewing
> and martialling mm patches so well.

Yeah, that is the correct and best way to go.

> For other developments I'll keep linux-next in mind. I guess it will
> be useful for me eg in the case where I change an arch defined prototype
> that requires a big sweep of the tree.

Yep, linux-next is idea for that because you find out all the places you
step on other people's toes :-)

-- 
Cheers,
Stephen Rothwell                    sfr@canb.auug.org.au
http://www.canb.auug.org.au/~sfr/

[-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [patch 1/5] x86: implement pte_special
  2008-06-02 23:58   ` Andrew Morton
  2008-06-03  2:04     ` Nick Piggin
@ 2008-06-04 17:14     ` Andy Whitcroft
  2008-06-05  2:01       ` Nick Piggin
  1 sibling, 1 reply; 17+ messages in thread
From: Andy Whitcroft @ 2008-06-04 17:14 UTC (permalink / raw)
  To: Andrew Morton; +Cc: npiggin, shaggy, linux-mm, linux-arch

On Mon, Jun 02, 2008 at 04:58:47PM -0700, Andrew Morton wrote:
> On Thu, 29 May 2008 22:20:51 +1000
> npiggin@suse.de wrote:
> 
> > Implement the pte_special bit for x86. This is required to support lockless
> > get_user_pages, because we need to know whether or not we can refcount a
> > particular page given only its pte (and no vma).
> 
> Spits this reject:
> 
> ***************
> *** 39,44 ****
>   #define _PAGE_UNUSED3	(_AC(1, L)<<_PAGE_BIT_UNUSED3)
>   #define _PAGE_PAT	(_AC(1, L)<<_PAGE_BIT_PAT)
>   #define _PAGE_PAT_LARGE (_AC(1, L)<<_PAGE_BIT_PAT_LARGE)
>   
>   #if defined(CONFIG_X86_64) || defined(CONFIG_X86_PAE)
>   #define _PAGE_NX	(_AC(1, ULL) << _PAGE_BIT_NX)
> --- 40,47 ----
>   #define _PAGE_UNUSED3	(_AC(1, L)<<_PAGE_BIT_UNUSED3)
>   #define _PAGE_PAT	(_AC(1, L)<<_PAGE_BIT_PAT)
>   #define _PAGE_PAT_LARGE (_AC(1, L)<<_PAGE_BIT_PAT_LARGE)
> + #define _PAGE_SPECIAL	(_AC(1, L)<<_PAGE_BIT_SPECIAL)
> + #define __HAVE_ARCH_PTE_SPECIAL
>   
>   #if defined(CONFIG_X86_64) || defined(CONFIG_X86_PAE)
>   #define _PAGE_NX	(_AC(1, ULL) << _PAGE_BIT_NX)
> 
> Which I fixed thusly:
> 
> #define _PAGE_PAT	(_AT(pteval_t, 1) << _PAGE_BIT_PAT)
> #define _PAGE_PAT_LARGE (_AT(pteval_t, 1) << _PAGE_BIT_PAT_LARGE)
> #define _PAGE_SPECIAL	(_AT(pteval_t, 1) << _PAGE_BIT_SPECIAL)
> #define __HAVE_ARCH_PTE_SPECIAL
> 
> 
> OK?
> 
> 
> (Also please check the bunch of checkpatch fixes, a warning fix and a
> compile fix).

That looks a sane merge to me.  I had a quick look over the various
fixes and they all look fine to me.

-apw

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [patch 1/5] x86: implement pte_special
  2008-06-04 17:14     ` Andy Whitcroft
@ 2008-06-05  2:01       ` Nick Piggin
  0 siblings, 0 replies; 17+ messages in thread
From: Nick Piggin @ 2008-06-05  2:01 UTC (permalink / raw)
  To: Andy Whitcroft; +Cc: Andrew Morton, shaggy, linux-mm, linux-arch

On Wed, Jun 04, 2008 at 06:14:31PM +0100, Andy Whitcroft wrote:
> On Mon, Jun 02, 2008 at 04:58:47PM -0700, Andrew Morton wrote:
> > On Thu, 29 May 2008 22:20:51 +1000
> > npiggin@suse.de wrote:
> > 
> > > Implement the pte_special bit for x86. This is required to support lockless
> > > get_user_pages, because we need to know whether or not we can refcount a
> > > particular page given only its pte (and no vma).
> > 
> > Spits this reject:
> > 
> > ***************
> > *** 39,44 ****
> >   #define _PAGE_UNUSED3	(_AC(1, L)<<_PAGE_BIT_UNUSED3)
> >   #define _PAGE_PAT	(_AC(1, L)<<_PAGE_BIT_PAT)
> >   #define _PAGE_PAT_LARGE (_AC(1, L)<<_PAGE_BIT_PAT_LARGE)
> >   
> >   #if defined(CONFIG_X86_64) || defined(CONFIG_X86_PAE)
> >   #define _PAGE_NX	(_AC(1, ULL) << _PAGE_BIT_NX)
> > --- 40,47 ----
> >   #define _PAGE_UNUSED3	(_AC(1, L)<<_PAGE_BIT_UNUSED3)
> >   #define _PAGE_PAT	(_AC(1, L)<<_PAGE_BIT_PAT)
> >   #define _PAGE_PAT_LARGE (_AC(1, L)<<_PAGE_BIT_PAT_LARGE)
> > + #define _PAGE_SPECIAL	(_AC(1, L)<<_PAGE_BIT_SPECIAL)
> > + #define __HAVE_ARCH_PTE_SPECIAL
> >   
> >   #if defined(CONFIG_X86_64) || defined(CONFIG_X86_PAE)
> >   #define _PAGE_NX	(_AC(1, ULL) << _PAGE_BIT_NX)
> > 
> > Which I fixed thusly:
> > 
> > #define _PAGE_PAT	(_AT(pteval_t, 1) << _PAGE_BIT_PAT)
> > #define _PAGE_PAT_LARGE (_AT(pteval_t, 1) << _PAGE_BIT_PAT_LARGE)
> > #define _PAGE_SPECIAL	(_AT(pteval_t, 1) << _PAGE_BIT_SPECIAL)
> > #define __HAVE_ARCH_PTE_SPECIAL
> > 
> > 
> > OK?
> > 
> > 
> > (Also please check the bunch of checkpatch fixes, a warning fix and a
> > compile fix).
> 
> That looks a sane merge to me.  I had a quick look over the various
> fixes and they all look fine to me.

That means we can put your reviewed-by: back? ;)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [patch 1/5] x86: implement pte_special
  2008-05-29 12:20 ` [patch 1/5] x86: implement pte_special npiggin
  2008-06-02 23:58   ` Andrew Morton
@ 2008-06-06 21:35   ` Peter Zijlstra
  1 sibling, 0 replies; 17+ messages in thread
From: Peter Zijlstra @ 2008-06-06 21:35 UTC (permalink / raw)
  To: npiggin; +Cc: akpm, shaggy, linux-mm, linux-arch, apw

On Thu, 2008-05-29 at 22:20 +1000, npiggin@suse.de wrote:
> plain text document attachment (x86-implement-pte_special.patch)
> Implement the pte_special bit for x86. This is required to support lockless
> get_user_pages, because we need to know whether or not we can refcount a
> particular page given only its pte (and no vma).
> 
> Signed-off-by: Nick Piggin <npiggin@suse.de>
> Cc: shaggy@austin.ibm.com
> Cc: linux-mm@kvack.org
> Cc: linux-arch@vger.kernel.org
> Cc: apw@shadowen.org

Full series:

Reviewed-by: Peter Zijlstra <a.p.zijlstra@chello.nl>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [patch 2/5] mm: introduce get_user_pages_fast
  2008-05-29 12:20 ` [patch 2/5] mm: introduce get_user_pages_fast npiggin
@ 2008-06-09 10:29   ` Andrew Morton
  0 siblings, 0 replies; 17+ messages in thread
From: Andrew Morton @ 2008-06-09 10:29 UTC (permalink / raw)
  To: npiggin; +Cc: shaggy, linux-mm, linux-arch, apw

On Thu, 29 May 2008 22:20:52 +1000 npiggin@suse.de wrote:

> Introduce a new get_user_pages_fast mm API, which is basically a get_user_pages
> with a less general API (but still tends to be suited to the common case):
> 
> - task and mm are always current and current->mm
> - force is always 0
> - pages is always non-NULL
> - don't pass back vmas
> 
> This restricted API can be implemented in a much more scalable way on
> many architectures when the ptes are present, by walking the page tables
> locklessly (no mmap_sem or page table locks). When the ptes are not
> populated, get_user_pages_fast() could be slower.
> 
> This is implemented locklessly on x86, and used in some key direct IO call
> sites, in later patches, which provides nearly 10% performance improvement
> on a threaded database workload.
> 
> Lots of other code could use this too, depending on use cases (eg. grep
> drivers/). And it might inspire some new and clever ways to use it.
> 
> ...
>
> --- linux-2.6.orig/include/linux/mm.h
> +++ linux-2.6/include/linux/mm.h
> @@ -12,6 +12,7 @@
>  #include <linux/prio_tree.h>
>  #include <linux/debug_locks.h>
>  #include <linux/mm_types.h>
> +#include <linux/uaccess.h> /* for __HAVE_ARCH_GET_USER_PAGES_FAST */
>  

That breaks ia64:

In file included from include/linux/mm.h:15,
                 from include/asm/uaccess.h:39,
                 from include/linux/poll.h:13,
                 from include/linux/rtc.h:113,
                 from include/linux/efi.h:19,
                 from include/asm/sal.h:40,
                 from include/asm-ia64/mca.h:20,
                 from arch/ia64/kernel/asm-offsets.c:17:
include/linux/uaccess.h: In function `__copy_from_user_inatomic_nocache':
include/linux/uaccess.h:46: error: implicit declaration of function `__copy_from_user_inatomic'
include/linux/uaccess.h: In function `__copy_from_user_nocache':
include/linux/uaccess.h:52: error: implicit declaration of function `__copy_from_user'


It shouldn't have been a __HAVE_ARCH_whatever anyway - it should have
been a CONFIG_whatever.

I'll fix it.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2008-06-09 10:29 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <20080529122050.823438000@nick.local0.net>
2008-05-29 12:20 ` [patch 1/5] x86: implement pte_special npiggin
2008-06-02 23:58   ` Andrew Morton
2008-06-03  2:04     ` Nick Piggin
2008-06-04 17:14     ` Andy Whitcroft
2008-06-05  2:01       ` Nick Piggin
2008-06-06 21:35   ` Peter Zijlstra
2008-05-29 12:20 ` [patch 2/5] mm: introduce get_user_pages_fast npiggin
2008-06-09 10:29   ` Andrew Morton
2008-05-29 12:20 ` [patch 3/5] x86: lockless get_user_pages_fast npiggin
2008-05-29 17:20   ` Dave Kleikamp
2008-05-30  0:55     ` Nick Piggin
2008-06-02 10:15     ` Nick Piggin
2008-06-02 11:28       ` Stephen Rothwell
2008-06-03  2:34         ` Nick Piggin
2008-06-03  4:46           ` Stephen Rothwell
2008-05-29 12:20 ` [patch 4/5] dio: use get_user_pages_fast npiggin
2008-05-29 12:20 ` [patch 5/5] splice: " npiggin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).