* page fault fastpath: Increasing SMP scalability by introducing pte locks?
@ 2004-08-15 13:50 Christoph Lameter
2004-08-15 20:09 ` David S. Miller
2004-08-15 22:38 ` Benjamin Herrenschmidt
0 siblings, 2 replies; 17+ messages in thread
From: Christoph Lameter @ 2004-08-15 13:50 UTC (permalink / raw)
To: linux-ia64; +Cc: linux-kernel
Well this is more an idea than a real patch yet. The page_table_lock
becomes a bottleneck if more than 4 CPUs are rapidly allocating and using
memory. "pft" is a program that measures the performance of page faults on
SMP system. It allocates memory simultaneously in multiple threads thereby
causing lots of page faults for anonymous pages.
Results for a standard 2.6.8.1 kernel. Allocating 2G of RAM in an 8
processor SMP system:
Gb Rep Threads User System Wall flt/cpu/s fault/wsec
2 3 1 0.094s 4.500s 4.059s 85561.646 85568.398
2 3 2 0.092s 6.390s 3.043s 60649.650 114521.474
2 3 4 0.081s 6.500s 1.093s 59740.813 203552.963
2 3 8 0.101s 12.001s 2.035s 32487.736 167082.560
Scalablity problems set in over 4 CPUs.
With pte locks and the fastpath:
Gb Rep Threads User System Wall flt/cpu/s fault/wsec
2 3 1 0.071s 4.535s 4.061s 85362.102 85288.646
2 3 2 0.073s 4.793s 2.013s 80789.137 184196.199
2 3 4 0.087s 5.119s 1.057s 75516.326 249716.547
2 3 8 0.096s 7.089s 1.019s 54715.728 328540.988
The performance in SMP situation is significantly enhanced with this
patch.
Note that the patch does not address various race conditions that may
result from using a pte lock only in handle_mm_fault. Some rules
need to be develop how to coordinate pte locks and the page_table_lock in
order to avoid these.
pte locks are realized by finding a spare bit in the ptes (TLB structures
on IA64 and i386) and settings the bit atomically via bitops for locking.
The fastpath does not allocate the page_table_lock but instead immediately
locks the pte. Thus the logic to release and later reacquire the
page_table_lock is avoided. Multiple page faults can run concurrently
using pte locks avoiding the page_table_lock. Essentially pte locks would
allow a finer granularity of locking.
I would like to get some feedback if people feel that this is the right
way to solve the issue. Most of this is based on work of Ray
Bryant and others at SGI.
Attached are:
1. pte lock patch for i386 and ia64
2. page_fault fastpath
3. page fault test program
4. test script
=========== PTE LOCK PATCH
Index: linux-2.6.8-rc4/include/asm-generic/pgtable.h
===================================================================
--- linux-2.6.8-rc4.orig/include/asm-generic/pgtable.h 2004-08-09 19:22:39.000000000 -0700
+++ linux-2.6.8-rc4/include/asm-generic/pgtable.h 2004-08-13 10:05:36.000000000 -0700
@@ -126,4 +126,11 @@
#define pgd_offset_gate(mm, addr) pgd_offset(mm, addr)
#endif
+#ifndef __HAVE_ARCH_PTE_LOCK
+/* need to fall back to the mm spinlock if PTE locks are not supported */
+#define ptep_lock(ptep) !spin_trylock(&mm->page_table_lock)
+#define ptep_unlock(ptep) spin_unlock(&mm->page_table_lock)
+#define pte_locked(pte) spin_is_locked(&mm->page_table_lock)
+#endif
+
#endif /* _ASM_GENERIC_PGTABLE_H */
Index: linux-2.6.8-rc4/include/asm-ia64/pgtable.h
===================================================================
--- linux-2.6.8-rc4.orig/include/asm-ia64/pgtable.h 2004-08-09 19:22:39.000000000 -0700
+++ linux-2.6.8-rc4/include/asm-ia64/pgtable.h 2004-08-13 10:19:15.000000000 -0700
@@ -30,6 +30,8 @@
#define _PAGE_P_BIT 0
#define _PAGE_A_BIT 5
#define _PAGE_D_BIT 6
+#define _PAGE_IG_BITS 53
+#define _PAGE_LOCK_BIT (_PAGE_IG_BITS+3) /* bit 56. Aligned to 8 bits */
#define _PAGE_P (1 << _PAGE_P_BIT) /* page present bit */
#define _PAGE_MA_WB (0x0 << 2) /* write back memory attribute */
@@ -58,6 +60,7 @@
#define _PAGE_PPN_MASK (((__IA64_UL(1) << IA64_MAX_PHYS_BITS) - 1) & ~0xfffUL)
#define _PAGE_ED (__IA64_UL(1) << 52) /* exception deferral */
#define _PAGE_PROTNONE (__IA64_UL(1) << 63)
+#define _PAGE_LOCK (__IA64_UL(1) << _PAGE_LOCK_BIT)
/* Valid only for a PTE with the present bit cleared: */
#define _PAGE_FILE (1 << 1) /* see swap & file pte remarks below */
@@ -282,6 +285,13 @@
#define pte_mkclean(pte) (__pte(pte_val(pte) & ~_PAGE_D))
#define pte_mkdirty(pte) (__pte(pte_val(pte) | _PAGE_D))
+/*
+ * Lock functions for pte's
+*/
+#define ptep_lock(ptep) test_and_set_bit(_PAGE_LOCK_BIT,ptep)
+#define ptep_unlock(ptep) { clear_bit(_PAGE_LOCK_BIT,ptep);smp_mb__after_clear_bit(); }
+#define pte_locked(pte) ((pte_val(pte) & _PAGE_LOCK)!=0)
+
/*
* Macro to a page protection value as "uncacheable". Note that "protection" is really a
* misnomer here as the protection value contains the memory attribute bits, dirty bits,
@@ -558,6 +568,7 @@
#define __HAVE_ARCH_PTEP_MKDIRTY
#define __HAVE_ARCH_PTE_SAME
#define __HAVE_ARCH_PGD_OFFSET_GATE
+#define __HAVE_ARCH_PTE_LOCK
#include <asm-generic/pgtable.h>
#endif /* _ASM_IA64_PGTABLE_H */
Index: linux-2.6.8-rc4/include/asm-i386/pgtable.h
===================================================================
--- linux-2.6.8-rc4.orig/include/asm-i386/pgtable.h 2004-08-09 19:23:35.000000000 -0700
+++ linux-2.6.8-rc4/include/asm-i386/pgtable.h 2004-08-13 10:04:19.000000000 -0700
@@ -101,7 +101,7 @@
#define _PAGE_BIT_DIRTY 6
#define _PAGE_BIT_PSE 7 /* 4 MB (or 2MB) page, Pentium+, if present.. */
#define _PAGE_BIT_GLOBAL 8 /* Global TLB entry PPro+ */
-#define _PAGE_BIT_UNUSED1 9 /* available for programmer */
+#define _PAGE_BIT_LOCK 9 /* available for programmer */
#define _PAGE_BIT_UNUSED2 10
#define _PAGE_BIT_UNUSED3 11
#define _PAGE_BIT_NX 63
@@ -115,7 +115,7 @@
#define _PAGE_DIRTY 0x040
#define _PAGE_PSE 0x080 /* 4 MB (or 2MB) page, Pentium+, if present.. */
#define _PAGE_GLOBAL 0x100 /* Global TLB entry PPro+ */
-#define _PAGE_UNUSED1 0x200 /* available for programmer */
+#define _PAGE_LOCK 0x200 /* available for programmer */
#define _PAGE_UNUSED2 0x400
#define _PAGE_UNUSED3 0x800
@@ -260,6 +260,10 @@
static inline void ptep_set_wrprotect(pte_t *ptep) { clear_bit(_PAGE_BIT_RW, &ptep->pte_low); }
static inline void ptep_mkdirty(pte_t *ptep) { set_bit(_PAGE_BIT_DIRTY, &ptep->pte_low); }
+#define ptep_lock(ptep) test_and_set_bit(_PAGE_BIT_LOCK,&ptep->pte_low)
+#define ptep_unlock(ptep) clear_bit(_PAGE_BIT_LOCK,&ptep->pte_low)
+#define pte_locked(pte) ((ptep->pte_low & _PAGE_LOCK) !=0)
+
/*
* Macro to mark a page protection value as "uncacheable". On processors which do not support
* it, this is a no-op.
@@ -419,6 +423,7 @@
#define __HAVE_ARCH_PTEP_SET_WRPROTECT
#define __HAVE_ARCH_PTEP_MKDIRTY
#define __HAVE_ARCH_PTE_SAME
+#define __HAVE_ARCH_PTE_LOCK
#include <asm-generic/pgtable.h>
#endif /* _I386_PGTABLE_H */
======= PAGEFAULT FASTPATH
Index: linux-2.6.8-rc4/mm/memory.c
===================================================================
--- linux-2.6.8-rc4.orig/mm/memory.c 2004-08-09 19:23:02.000000000 -0700
+++ linux-2.6.8-rc4/mm/memory.c 2004-08-13 10:19:21.000000000 -0700
@@ -1680,6 +1680,10 @@
{
pgd_t *pgd;
pmd_t *pmd;
+#ifdef __HAVE_ARCH_PTE_LOCK
+ pte_t *pte;
+ pte_t entry;
+#endif
__set_current_state(TASK_RUNNING);
pgd = pgd_offset(mm, address);
@@ -1688,7 +1692,64 @@
if (is_vm_hugetlb_page(vma))
return VM_FAULT_SIGBUS; /* mapping truncation does this. */
+#ifdef __HAVE_ARCH_PTE_LOCK
+ /*
+ * Fast path for anonymous pages, not found faults bypassing
+ * the necessity to acquire the page_table_lock
+ */
+
+ if ((vma->vm_ops && vma->vm_ops->nopage) || pgd_none(*pgd)) goto use_page_table_lock;
+ pmd = pmd_offset(pgd,address);
+ if (pmd_none(*pmd)) goto use_page_table_lock;
+ pte = pte_offset_kernel(pmd,address);
+ if (pte_locked(*pte)) return VM_FAULT_MINOR;
+ if (!pte_none(*pte)) goto use_page_table_lock;
+
+ /*
+ * Page not present, so kswapd and PTE updates will not touch the pte
+ * so we are able to just use a pte lock.
+ */
+
+ if (ptep_lock(pte)) return VM_FAULT_MINOR;
+ /*
+ * PTE already locked so this code is already running on another processor. Wait
+ * until that processor does our work and then return. If something went
+ * wrong in the handling of the other processor then we will get another page fault
+ * that may then handle the error condition
+ */
+
+ /* Read-only mapping of ZERO_PAGE. */
+ entry = pte_wrprotect(mk_pte(ZERO_PAGE(address), vma->vm_page_prot));
+
+ if (write_access) {
+ struct page *page;
+
+ if (unlikely(anon_vma_prepare(vma))) goto no_mem;
+
+ page = alloc_page_vma(GFP_HIGHUSER, vma, address);
+ if (!page) goto no_mem;
+ clear_user_highpage(page, address);
+
+ mm->rss++;
+ entry = maybe_mkwrite(pte_mkdirty(mk_pte(page,vma->vm_page_prot)),vma);
+ lru_cache_add_active(page);
+ mark_page_accessed(page);
+ page_add_anon_rmap(page, vma, address);
+ }
+ /* Setting the pte clears the pte lock so there is no need for unlocking */
+ set_pte(pte, entry);
+ pte_unmap(pte);
+
+ /* No need to invalidate - it was non-present before */
+ update_mmu_cache(vma, address, entry);
+ return VM_FAULT_MINOR; /* Minor fault */
+no_mem:
+ ptep_unlock(pte);
+ return VM_FAULT_OOM;
+
+use_page_table_lock:
+#endif
/*
* We need the page table lock to synchronize with kswapd
* and the SMP-safe atomic PTE updates.
======= PFT.C test program
#include <pthread.h>
#include <stdio.h>
#include <sys/types.h>
#include <sys/ipc.h>
#include <sys/shm.h>
#include <fcntl.h>
#include <stdlib.h>
#include <unistd.h>
#include <string.h>
#include <sys/mman.h>
#include <time.h>
#include <errno.h>
#include <sys/resource.h>
extern int optind, opterr;
extern char *optarg;
long bytes=16384;
long sleepsec=0;
long verbose=0;
long forkcnt=1;
long repeatcount=1;
long do_bzero=0;
long mypid;
int title=0;
volatile int go, state[128];
struct timespec wall;
struct rusage ruse;
long faults;
long pages;
long gbyte;
double faults_per_sec;
double faults_per_sec_per_cpu;
#define perrorx(s) (perror(s), exit(1))
#define NBPP 16384
void* test(void*);
void launch(void);
main (int argc, char *argv[])
{
int i, j, c, stat, er=0;
static char optstr[] = "b:f:g:r:s:vzHt";
opterr=1;
while ((c = getopt(argc, argv, optstr)) != EOF)
switch (c) {
case 'g':
bytes = atol(optarg)*1024*1024*1024;
break;
case 'b':
bytes = atol(optarg);
break;
case 'f':
forkcnt = atol(optarg);
break;
case 'r':
repeatcount = atol(optarg);
break;
case 's':
sleepsec = atol(optarg);
break;
case 'v':
verbose++;
break;
case 'z':
do_bzero++;
break;
case 'H':
er++;
break;
case 't' :
title++;
break;
case '?':
er = 1;
break;
}
if (er) {
printf("usage: %s %s\n", argv[0], optstr);
exit(1);
}
pages = bytes*repeatcount/getpagesize();
gbyte = bytes/(1024*1024*1024);
bytes = bytes/forkcnt;
if (verbose) printf("Calculated pages=%ld pagesize=%ld.\n",pages,getpagesize());
mypid = getpid();
setpgid(0, mypid);
for (i=0; i<repeatcount; i++) {
if (fork() == 0)
launch();
while (wait(&stat) > 0);
}
getrusage(RUSAGE_CHILDREN,&ruse);
clock_gettime(CLOCK_PROCESS_CPUTIME_ID,&wall);
if (verbose) printf("Calculated faults=%ld. Real minor faults=%ld, major faults=%ld\n",pages,ruse.ru_minflt+ruse.ru_majflt);
faults_per_sec=(double) pages / ((double) wall.tv_sec + (double) wall.tv_nsec / 1000000000.0);
faults_per_sec_per_cpu=(double) pages / (
(double) (ruse.ru_utime.tv_sec + ruse.ru_stime.tv_sec) + ((double) (ruse.ru_utime.tv_usec + ruse.ru_stime.tv_usec) / 1000000.0));
if (title) printf(" Gb Rep Threads User System Wall flt/cpu/s fault/wsec\n");
printf("%3ld %3ld %4ld %4ld.%03lds%7ld.%03lds%4ld.%03lds%10.3f %10.3f\n",
gbyte,repeatcount,forkcnt,
ruse.ru_utime.tv_sec,ruse.ru_utime.tv_usec/1000,
ruse.ru_stime.tv_sec,ruse.ru_stime.tv_usec/1000,
wall.tv_sec,wall.tv_nsec/10000000,
faults_per_sec_per_cpu,faults_per_sec);
exit(0);
}
char *
do_shm(long shmlen) {
char *p;
int shmid;
printf ("Try to allocate TOTAL shm segment of %ld bytes\n", shmlen);
if ((shmid = shmget(IPC_PRIVATE, shmlen, SHM_R|SHM_W)) == -1)
perrorx("shmget faiiled");
p=(char*)shmat(shmid, (void*)0, SHM_R|SHM_W);
printf(" created, adr: 0x%lx\n", (long)p);
printf(" attached\n");
bzero(p, shmlen);
printf(" zeroed\n");
// if (shmctl(shmid,IPC_RMID,0) == -1)
// perrorx("shmctl failed");
// printf(" deleted\n");
return p;
}
void
launch()
{
pthread_t ptid[128];
int i, j;
for (j=0; j<forkcnt; j++)
if (pthread_create(&ptid[j], NULL, test, (void*) (long)j) < 0)
perrorx("pthread create");
if(0) for (j=0; j<forkcnt; j++)
while(state[j] == 0);
go = 1;
if(0) for (j=0; j<forkcnt; j++)
while(state[j] == 1);
for (j=0; j<forkcnt; j++)
pthread_join(ptid[j], NULL);
exit(0);
}
void*
test(void *arg)
{
char *p, *pe;
long id;
id = (long) arg;
state[id] = 1;
while(!go);
p = malloc(bytes);
// p = do_shm(bytes);
if (p == 0) {
printf("malloc of %Ld bytes failed.\n",bytes);
exit;
} else
if (verbose) printf("malloc of %Ld bytes succeeded\n",bytes);
if (do_bzero)
bzero(p, bytes);
else {
for(pe=p+bytes; p<pe; p+=16384)
*p = 'r';
}
sleep(sleepsec);
state[id] = 2;
pthread_exit(0);
}
===== Test script
./pft -t -g2 -r3 -f1
./pft -g2 -r3 -f2
./pft -g2 -r3 -f4
./pft -g2 -r3 -f8
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: page fault fastpath: Increasing SMP scalability by introducing pte locks?
2004-08-15 13:50 Christoph Lameter
@ 2004-08-15 20:09 ` David S. Miller
2004-08-15 22:58 ` Christoph Lameter
2004-08-15 22:38 ` Benjamin Herrenschmidt
1 sibling, 1 reply; 17+ messages in thread
From: David S. Miller @ 2004-08-15 20:09 UTC (permalink / raw)
To: Christoph Lameter; +Cc: linux-ia64, linux-kernel
Is the read lock in the VMA semaphore enough to let you do
the pgd/pmd walking without the page_table_lock?
I think it is, but just checking.
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: page fault fastpath: Increasing SMP scalability by introducing pte locks?
2004-08-15 13:50 Christoph Lameter
2004-08-15 20:09 ` David S. Miller
@ 2004-08-15 22:38 ` Benjamin Herrenschmidt
2004-08-16 17:28 ` Christoph Lameter
1 sibling, 1 reply; 17+ messages in thread
From: Benjamin Herrenschmidt @ 2004-08-15 22:38 UTC (permalink / raw)
To: Christoph Lameter; +Cc: linux-ia64, Linux Kernel list, Anton Blanchard
On Sun, 2004-08-15 at 23:50, Christoph Lameter wrote:
> Well this is more an idea than a real patch yet. The page_table_lock
> becomes a bottleneck if more than 4 CPUs are rapidly allocating and using
> memory. "pft" is a program that measures the performance of page faults on
> SMP system. It allocates memory simultaneously in multiple threads thereby
> causing lots of page faults for anonymous pages.
Just a note: on ppc64, we already have a PTE lock bit, we use it to
guard against concurrent hash table insertion, it could be extended
to the whole page fault path provided we can guarantee we will never
fault in the hash table on that PTE while it is held. This shouldn't
be a problem as long as only user pages are locked that way (which
should be the case with do_page_fault) provided update_mmu_cache()
is updated to not take this lock, but assume it already held.
Ben.
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: page fault fastpath: Increasing SMP scalability by introducing pte locks?
2004-08-15 20:09 ` David S. Miller
@ 2004-08-15 22:58 ` Christoph Lameter
2004-08-15 23:58 ` David S. Miller
0 siblings, 1 reply; 17+ messages in thread
From: Christoph Lameter @ 2004-08-15 22:58 UTC (permalink / raw)
To: David S. Miller; +Cc: linux-ia64, linux-kernel
On Sun, 15 Aug 2004, David S. Miller wrote:
>
> Is the read lock in the VMA semaphore enough to let you do
> the pgd/pmd walking without the page_table_lock?
> I think it is, but just checking.
That would be great.... May I change the page_table lock to
be a read write spinlock instead?
I would then convert all spin_locks to write_locks and
then use read locks to switch to a "pte locking mode". The read
lock would allow simultanous threads operating on the page table
that will only modify individual pte's via pte locks. Write locks
still exclude the readers and thus the whole scheme should allow
a gradual transition.
Maybe such a locking policy could do some good.
However, performance is only increased somewhat. Scalability
is still bad with more than 32 CPUs despite my hack. More
extensive work is needed <sigh>:
Regular kernel 512 CPU's 16G allocation per thread:
Gb Rep Threads User System Wall flt/cpu/s fault/wsec
16 3 1 0.748s 67.200s 67.098s 46295.921 46270.533
16 3 2 0.899s 100.189s 52.021s 31118.426 60242.544
16 3 4 1.517s 103.467s 31.021s 29963.479 100777.788
16 3 8 1.268s 166.023s 26.035s 18803.807 119350.434
16 3 16 6.296s 453.445s 33.082s 6842.371 92987.774
16 3 32 22.434s 1341.205s 48.026s 2306.860 65174.913
16 3 64 54.189s 4633.748s 81.089s 671.026 38411.466
16 3 128 244.333s 17584.111s 152.026s 176.444 20659.132
16 3 256 222.936s 8167.241s 73.018s 374.930 42983.366
16 3 512 207.464s 4259.264s 39.044s 704.258 79741.366
Modified kernel:
Gb Rep Threads User System Wall flt/cpu/s fault/wsec
16 3 1 0.884s 64.241s 65.014s 48302.177 48287.787
16 3 2 0.931s 99.156s 51.058s 31429.640 60979.126
16 3 4 1.028s 88.451s 26.096s 35155.837 116669.999
16 3 8 1.957s 61.395s 12.099s 49654.307 242078.305
16 3 16 5.701s 81.382s 9.039s 36122.904 334774.381
16 3 32 15.207s 163.893s 9.094s 17564.021 316284.690
16 3 64 76.056s 440.771s 13.037s 6086.601 235120.800
16 3 128 203.843s 1535.909s 19.084s 1808.145 158495.679
16 3 256 274.815s 755.764s 12.058s 3052.387 250010.942
16 3 512 205.505s 381.106s 7.060s 5362.531 413531.352
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: page fault fastpath: Increasing SMP scalability by introducing pte locks?
[not found] ` <2tCiw-8pK-1@gated-at.bofh.it>
@ 2004-08-15 23:53 ` Andi Kleen
2004-08-15 23:55 ` Christoph Lameter
0 siblings, 1 reply; 17+ messages in thread
From: Andi Kleen @ 2004-08-15 23:53 UTC (permalink / raw)
To: Christoph Lameter; +Cc: linux-ia64, linux-kernel
Christoph Lameter <clameter@sgi.com> writes:
> On Sun, 15 Aug 2004, David S. Miller wrote:
>
>>
>> Is the read lock in the VMA semaphore enough to let you do
>> the pgd/pmd walking without the page_table_lock?
>> I think it is, but just checking.
>
> That would be great.... May I change the page_table lock to
> be a read write spinlock instead?
That's probably not a good idea. r/w locks are extremly slow on
some architectures. Including ia64.
Just profile cat /proc/net/tcp on a machine with a lot of memory
and you'll notice.
-Andi
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: page fault fastpath: Increasing SMP scalability by introducing pte locks?
2004-08-15 23:53 ` page fault fastpath: Increasing SMP scalability by introducing pte locks? Andi Kleen
@ 2004-08-15 23:55 ` Christoph Lameter
2004-08-16 0:12 ` Andi Kleen
0 siblings, 1 reply; 17+ messages in thread
From: Christoph Lameter @ 2004-08-15 23:55 UTC (permalink / raw)
To: Andi Kleen; +Cc: Christoph Lameter, linux-ia64, linux-kernel
On Mon, 16 Aug 2004, Andi Kleen wrote:
> Christoph Lameter <clameter@sgi.com> writes:
>
> > On Sun, 15 Aug 2004, David S. Miller wrote:
> >
> >>
> >> Is the read lock in the VMA semaphore enough to let you do
> >> the pgd/pmd walking without the page_table_lock?
> >> I think it is, but just checking.
> >
> > That would be great.... May I change the page_table lock to
> > be a read write spinlock instead?
>
> That's probably not a good idea. r/w locks are extremly slow on
> some architectures. Including ia64.
I was thinking about a read write spinlock not an readwrite
semaphore. Look at include/asm-ia64/spinlock.h.
The implementations are almost the same. Are you sure
about this?
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: page fault fastpath: Increasing SMP scalability by introducing pte locks?
2004-08-15 22:58 ` Christoph Lameter
@ 2004-08-15 23:58 ` David S. Miller
2004-08-16 0:11 ` Christoph Lameter
0 siblings, 1 reply; 17+ messages in thread
From: David S. Miller @ 2004-08-15 23:58 UTC (permalink / raw)
To: Christoph Lameter; +Cc: linux-ia64, linux-kernel
On Sun, 15 Aug 2004 15:58:27 -0700 (PDT)
Christoph Lameter <clameter@sgi.com> wrote:
> On Sun, 15 Aug 2004, David S. Miller wrote:
>
> >
> > Is the read lock in the VMA semaphore enough to let you do
> > the pgd/pmd walking without the page_table_lock?
> > I think it is, but just checking.
>
> That would be great.... May I change the page_table lock to
> be a read write spinlock instead?
No, I means "is the read long _ON_ the VMA semaphore".
The VMA semaphore is a read/write semaphore, and we grab
it for reading in the code path you're modifying.
Please don't change page_table_lock to a rwlock, it's
only needed for write accesses.
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: page fault fastpath: Increasing SMP scalability by introducing pte locks?
2004-08-15 23:58 ` David S. Miller
@ 2004-08-16 0:11 ` Christoph Lameter
2004-08-16 1:56 ` David S. Miller
0 siblings, 1 reply; 17+ messages in thread
From: Christoph Lameter @ 2004-08-16 0:11 UTC (permalink / raw)
To: David S. Miller; +Cc: linux-ia64, linux-kernel
On Sun, 15 Aug 2004, David S. Miller wrote:
> > On Sun, 15 Aug 2004, David S. Miller wrote:
> > > Is the read lock in the VMA semaphore enough to let you do
> > > the pgd/pmd walking without the page_table_lock?
> > > I think it is, but just checking.
> >
> > That would be great.... May I change the page_table lock to
> > be a read write spinlock instead?
>
> No, I means "is the read long _ON_ the VMA semaphore".
> The VMA semaphore is a read/write semaphore, and we grab
> it for reading in the code path you're modifying.
>
> Please don't change page_table_lock to a rwlock, it's
> only needed for write accesses.
pgd/pmd walking should be possible always even without the vma semaphore
since the CPU can potentially walk the chain at anytime.
The modification of the pte is not without issue since there are other
code paths that may modify pte's and rely on the page_table_lock to
exclude others from modifying ptes. One known problem is the swap
code which sets the page to the pte_not_present condition to insure that
nothing else touches the page while it is figuring out where to put it. A
page fault during that time (skipping the checking of the
page_table_lock) will cause the fastpath to be taken which will then
assign new memory to it.
We need to have some kind of system how finer granularity locks could be
realized.
One possibility is to abuse the rw spinlock to not only allow exclusive
access to the page tables(as done right now with the spinlock) but also
allow shared access with pte locking after a read lock.
Is there any other way to realize this?
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: page fault fastpath: Increasing SMP scalability by introducing pte locks?
2004-08-15 23:55 ` Christoph Lameter
@ 2004-08-16 0:12 ` Andi Kleen
0 siblings, 0 replies; 17+ messages in thread
From: Andi Kleen @ 2004-08-16 0:12 UTC (permalink / raw)
To: Christoph Lameter; +Cc: Christoph Lameter, linux-ia64, linux-kernel
On Sun, Aug 15, 2004 at 04:55:57PM -0700, Christoph Lameter wrote:
> On Mon, 16 Aug 2004, Andi Kleen wrote:
>
> > Christoph Lameter <clameter@sgi.com> writes:
> >
> > > On Sun, 15 Aug 2004, David S. Miller wrote:
> > >
> > >>
> > >> Is the read lock in the VMA semaphore enough to let you do
> > >> the pgd/pmd walking without the page_table_lock?
> > >> I think it is, but just checking.
> > >
> > > That would be great.... May I change the page_table lock to
> > > be a read write spinlock instead?
> >
> > That's probably not a good idea. r/w locks are extremly slow on
> > some architectures. Including ia64.
>
> I was thinking about a read write spinlock not an readwrite
> semaphore. Look at include/asm-ia64/spinlock.h.
I was also talking about rw spinlocks.
> The implementations are almost the same. Are you sure
> about this?
Yes. Try the cat /proc/net/tcp test. It will take >100k read locks
for the TCP listen hash table, and on bigger ppc64 and ia64 machines this
can take nearly a second of system time.
-Andi
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: page fault fastpath: Increasing SMP scalability by introducing pte locks?
2004-08-16 0:11 ` Christoph Lameter
@ 2004-08-16 1:56 ` David S. Miller
2004-08-16 3:29 ` Christoph Lameter
0 siblings, 1 reply; 17+ messages in thread
From: David S. Miller @ 2004-08-16 1:56 UTC (permalink / raw)
To: Christoph Lameter; +Cc: linux-ia64, linux-kernel
On Sun, 15 Aug 2004 17:11:53 -0700 (PDT)
Christoph Lameter <clameter@sgi.com> wrote:
> pgd/pmd walking should be possible always even without the vma semaphore
> since the CPU can potentially walk the chain at anytime.
munmap() can destroy pmd and pte tables. somehow we have
to protect against that, and currently that is having the
VMA semaphore held for reading, see free_pgtables().
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: page fault fastpath: Increasing SMP scalability by introducing pte locks?
2004-08-16 1:56 ` David S. Miller
@ 2004-08-16 3:29 ` Christoph Lameter
2004-08-16 7:00 ` Ray Bryant
2004-08-16 14:39 ` William Lee Irwin III
0 siblings, 2 replies; 17+ messages in thread
From: Christoph Lameter @ 2004-08-16 3:29 UTC (permalink / raw)
To: David S. Miller; +Cc: linux-ia64, linux-kernel
On Sun, 15 Aug 2004, David S. Miller wrote:
> On Sun, 15 Aug 2004 17:11:53 -0700 (PDT)
> Christoph Lameter <clameter@sgi.com> wrote:
>
> > pgd/pmd walking should be possible always even without the vma semaphore
> > since the CPU can potentially walk the chain at anytime.
>
> munmap() can destroy pmd and pte tables. somehow we have
> to protect against that, and currently that is having the
> VMA semaphore held for reading, see free_pgtables().
It looks to me like the code takes care to provide the correct
sequencing so that the integrity of pgd,pmd and pte links is
guaranteed from the viewpoint of the MMU in the CPUs. munmap is there to
protect one kernel thread messing with the addresses of these entities
that might be stored in another threads register.
Therefore it is safe to walk the chain only holding the semaphore read
lock?
If the mmap lock already guarantees the integrity of the pgd,pmd,pte
system, then pte locking would be okay as long as integrity of the
pgd,pmd and pte's is always guaranteed. Then also adding a lock bit would
work.
So then there are two ways of modifying the pgd,pmd and pte's.
A) Processor obtains vma semaphore write lock and does large scale
modifications to pgd,pmd,pte.
B) Processor obtains vma semaphore read lock but is still free to do
modifications on individual pte's while holding that vma lock. There is no
need to acquire the page_table_lock. These changes must be atomic.
The role of the page_table_lock is restricted *only* to the "struct
page" stuff? It says in the comments regarding handle_mm_fault that the
lock is taken for synchronization with kswapd in regards to the pte
entries. Seems that this use of the page_table_lock is wrong. A or B
should have been used.
We could simply remove the page_table_lock from handle_mm_fault and
provide the synchronization with kswapd with pte locks right? Both
processes are essentially doing modifications on pte's while holding the
vma read lock and I would be changing the way of synchronization between
these two processes.
F.e. something along these lines removing the page_table_lock from
handle_mm_fault and friends. Surprisingly this will also avoid many
rereads of the pte's since the pte's are really locked. This is just for
illustrative purpose and unfinished...
Index: linux-2.6.8.1/mm/memory.c
===================================================================
--- linux-2.6.8.1.orig/mm/memory.c 2004-08-15 06:03:04.000000000 -0700
+++ linux-2.6.8.1/mm/memory.c 2004-08-15 20:26:29.000000000 -0700
@@ -1035,8 +1035,7 @@
* change only once the write actually happens. This avoids a few races,
* and potentially makes it more efficient.
*
- * We hold the mm semaphore and the page_table_lock on entry and exit
- * with the page_table_lock released.
+ * We hold the mm semaphore.
*/
static int do_wp_page(struct mm_struct *mm, struct vm_area_struct * vma,
unsigned long address, pte_t *page_table, pmd_t *pmd, pte_t pte)
@@ -1051,10 +1050,10 @@
* at least the kernel stops what it's doing before it corrupts
* data, but for the moment just pretend this is OOM.
*/
+ ptep_unlock(page_table);
pte_unmap(page_table);
printk(KERN_ERR "do_wp_page: bogus page at address %08lx\n",
address);
- spin_unlock(&mm->page_table_lock);
return VM_FAULT_OOM;
}
old_page = pfn_to_page(pfn);
@@ -1069,7 +1068,7 @@
ptep_set_access_flags(vma, address, page_table, entry, 1);
update_mmu_cache(vma, address, entry);
pte_unmap(page_table);
- spin_unlock(&mm->page_table_lock);
+ /* pte lock unlocked by ptep_set_access */
return VM_FAULT_MINOR;
}
}
@@ -1080,7 +1079,7 @@
*/
if (!PageReserved(old_page))
page_cache_get(old_page);
- spin_unlock(&mm->page_table_lock);
+ ptep_unlock(page_table);
if (unlikely(anon_vma_prepare(vma)))
goto no_new_page;
@@ -1090,26 +1089,21 @@
copy_cow_page(old_page,new_page,address);
/*
- * Re-check the pte - we dropped the lock
+ * There is no need to recheck. The pte was locked
*/
- spin_lock(&mm->page_table_lock);
- page_table = pte_offset_map(pmd, address);
- if (likely(pte_same(*page_table, pte))) {
- if (PageReserved(old_page))
- ++mm->rss;
- else
- page_remove_rmap(old_page);
- break_cow(vma, new_page, address, page_table);
- lru_cache_add_active(new_page);
- page_add_anon_rmap(new_page, vma, address);
+ if (PageReserved(old_page))
+ ++mm->rss;
+ else
+ page_remove_rmap(old_page);
+ break_cow(vma, new_page, address, page_table);
+ lru_cache_add_active(new_page);
+ page_add_anon_rmap(new_page, vma, address);
- /* Free the old page.. */
- new_page = old_page;
- }
+ /* Free the old page.. */
+ new_page = old_page;
pte_unmap(page_table);
page_cache_release(new_page);
page_cache_release(old_page);
- spin_unlock(&mm->page_table_lock);
return VM_FAULT_MINOR;
no_new_page:
@@ -1314,8 +1308,8 @@
}
/*
- * We hold the mm semaphore and the page_table_lock on entry and
- * should release the pagetable lock on exit..
+ * We hold the mm semaphore and a pte lock n entry and
+ * should release the pte lock on exit..
*/
static int do_swap_page(struct mm_struct * mm,
struct vm_area_struct * vma, unsigned long address,
@@ -1327,27 +1321,10 @@
int ret = VM_FAULT_MINOR;
pte_unmap(page_table);
- spin_unlock(&mm->page_table_lock);
page = lookup_swap_cache(entry);
if (!page) {
swapin_readahead(entry, address, vma);
page = read_swap_cache_async(entry, vma, address);
- if (!page) {
- /*
- * Back out if somebody else faulted in this pte while
- * we released the page table lock.
- */
- spin_lock(&mm->page_table_lock);
- page_table = pte_offset_map(pmd, address);
- if (likely(pte_same(*page_table, orig_pte)))
- ret = VM_FAULT_OOM;
- else
- ret = VM_FAULT_MINOR;
- pte_unmap(page_table);
- spin_unlock(&mm->page_table_lock);
- goto out;
- }
-
/* Had to read the page from swap area: Major fault */
ret = VM_FAULT_MAJOR;
inc_page_state(pgmajfault);
@@ -1356,21 +1333,6 @@
mark_page_accessed(page);
lock_page(page);
- /*
- * Back out if somebody else faulted in this pte while we
- * released the page table lock.
- */
- spin_lock(&mm->page_table_lock);
- page_table = pte_offset_map(pmd, address);
- if (unlikely(!pte_same(*page_table, orig_pte))) {
- pte_unmap(page_table);
- spin_unlock(&mm->page_table_lock);
- unlock_page(page);
- page_cache_release(page);
- ret = VM_FAULT_MINOR;
- goto out;
- }
-
/* The page isn't present yet, go ahead with the fault. */
swap_free(entry);
@@ -1398,8 +1360,8 @@
/* No need to invalidate - it was non-present before */
update_mmu_cache(vma, address, pte);
+ ptep_unlock(page_table);
pte_unmap(page_table);
- spin_unlock(&mm->page_table_lock);
out:
return ret;
}
@@ -1424,7 +1386,6 @@
if (write_access) {
/* Allocate our own private page. */
pte_unmap(page_table);
- spin_unlock(&mm->page_table_lock);
if (unlikely(anon_vma_prepare(vma)))
goto no_mem;
@@ -1433,13 +1394,12 @@
goto no_mem;
clear_user_highpage(page, addr);
- spin_lock(&mm->page_table_lock);
page_table = pte_offset_map(pmd, addr);
if (!pte_none(*page_table)) {
+ ptep_unlock(page_table);
pte_unmap(page_table);
page_cache_release(page);
- spin_unlock(&mm->page_table_lock);
goto out;
}
mm->rss++;
@@ -1456,7 +1416,6 @@
/* No need to invalidate - it was non-present before */
update_mmu_cache(vma, addr, entry);
- spin_unlock(&mm->page_table_lock);
out:
return VM_FAULT_MINOR;
no_mem:
@@ -1472,8 +1431,8 @@
* As this is called only for pages that do not currently exist, we
* do not need to flush old virtual caches or the TLB.
*
- * This is called with the MM semaphore held and the page table
- * spinlock held. Exit with the spinlock released.
+ * This is called with the MM semaphore held and pte lock
+ * held. Exit with the pte lock released.
*/
static int
do_no_page(struct mm_struct *mm, struct vm_area_struct *vma,
@@ -1489,9 +1448,9 @@
if (!vma->vm_ops || !vma->vm_ops->nopage)
return do_anonymous_page(mm, vma, page_table,
pmd, write_access, address);
+ ptep_unlock(page_table);
pte_unmap(page_table);
- spin_unlock(&mm->page_table_lock);
-
+
if (vma->vm_file) {
mapping = vma->vm_file->f_mapping;
sequence = atomic_read(&mapping->truncate_count);
@@ -1523,7 +1482,7 @@
anon = 1;
}
- spin_lock(&mm->page_table_lock);
+ while (ptep_lock(page_table)) ;
/*
* For a file-backed vma, someone could have truncated or otherwise
* invalidated this page. If unmap_mapping_range got called,
@@ -1532,7 +1491,7 @@
if (mapping &&
(unlikely(sequence != atomic_read(&mapping->truncate_count)))) {
sequence = atomic_read(&mapping->truncate_count);
- spin_unlock(&mm->page_table_lock);
+ ptep_unlock(page_table);
page_cache_release(new_page);
goto retry;
}
@@ -1565,15 +1524,15 @@
pte_unmap(page_table);
} else {
/* One of our sibling threads was faster, back out. */
+ ptep_unlock(page_table);
pte_unmap(page_table);
page_cache_release(new_page);
- spin_unlock(&mm->page_table_lock);
goto out;
}
/* no need to invalidate: a not-present page shouldn't be cached */
update_mmu_cache(vma, address, entry);
- spin_unlock(&mm->page_table_lock);
+ ptep_unlock(page_table);
out:
return ret;
oom:
@@ -1606,8 +1565,8 @@
pgoff = pte_to_pgoff(*pte);
+ ptep_unlock(pte);
pte_unmap(pte);
- spin_unlock(&mm->page_table_lock);
err = vma->vm_ops->populate(vma, address & PAGE_MASK, PAGE_SIZE, vma->vm_page_prot, pgoff, 0);
if (err == -ENOMEM)
@@ -1644,13 +1603,11 @@
{
pte_t entry;
- entry = *pte;
+ entry = *pte; /* get the unlocked value so that we do not write the lock bit back */
+
+ if (ptep_lock(pte)) return VM_FAULT_MINOR;
+
if (!pte_present(entry)) {
- /*
- * If it truly wasn't present, we know that kswapd
- * and the PTE updates will not touch it later. So
- * drop the lock.
- */
if (pte_none(entry))
return do_no_page(mm, vma, address, write_access, pte, pmd);
if (pte_file(entry))
@@ -1668,7 +1625,6 @@
ptep_set_access_flags(vma, address, pte, entry, write_access);
update_mmu_cache(vma, address, entry);
pte_unmap(pte);
- spin_unlock(&mm->page_table_lock);
return VM_FAULT_MINOR;
}
@@ -1688,12 +1644,6 @@
if (is_vm_hugetlb_page(vma))
return VM_FAULT_SIGBUS; /* mapping truncation does this. */
-
- /*
- * We need the page table lock to synchronize with kswapd
- * and the SMP-safe atomic PTE updates.
- */
- spin_lock(&mm->page_table_lock);
pmd = pmd_alloc(mm, pgd, address);
if (pmd) {
@@ -1701,7 +1651,6 @@
if (pte)
return handle_pte_fault(mm, vma, address, write_access, pte, pmd);
}
- spin_unlock(&mm->page_table_lock);
return VM_FAULT_OOM;
}
Index: linux-2.6.8.1/mm/rmap.c
===================================================================
--- linux-2.6.8.1.orig/mm/rmap.c 2004-08-14 03:56:22.000000000 -0700
+++ linux-2.6.8.1/mm/rmap.c 2004-08-15 19:59:32.000000000 -0700
@@ -494,8 +494,14 @@
/* Nuke the page table entry. */
flush_cache_page(vma, address);
+#ifdef __HAVE_ARCH_PTE_LOCK
+ /* If we would simply zero the pte then handle_mm_fault might
+ * race against this code and reinstate an anonymous mapping
+ */
+ pteval = ptep_clear_and_lock_flush(vma, address, pte);
+#else
pteval = ptep_clear_flush(vma, address, pte);
-
+#endif
/* Move the dirty bit to the physical page now the pte is gone. */
if (pte_dirty(pteval))
set_page_dirty(page);
@@ -508,9 +514,13 @@
*/
BUG_ON(!PageSwapCache(page));
swap_duplicate(entry);
+ /* This is going to clear the lock that may have been set on the pte */
set_pte(pte, swp_entry_to_pte(entry));
BUG_ON(pte_file(*pte));
}
+#ifdef __HAVE_ARCH_PTE_LOCK
+ else ptep_unlock(pte);
+#endif
mm->rss--;
BUG_ON(!page->mapcount);
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: page fault fastpath: Increasing SMP scalability by introducing pte locks?
2004-08-16 3:29 ` Christoph Lameter
@ 2004-08-16 7:00 ` Ray Bryant
2004-08-16 15:18 ` Christoph Lameter
2004-08-16 14:39 ` William Lee Irwin III
1 sibling, 1 reply; 17+ messages in thread
From: Ray Bryant @ 2004-08-16 7:00 UTC (permalink / raw)
To: Christoph Lameter; +Cc: David S. Miller, linux-ia64, linux-kernel
Christoph,
Something else to worry about here is mm->rss. Previously, this was updated
only with the page_table_lock held, so concurrent increments were not a
problem. rss may need to converted be an atomic_t if you use pte_locks.
It may be that an approximate value for rss is good enough, but I'm not sure
how to bound the error that could be introduced by a couple of hundred
processers handling page faults in parallel and updating rss without locking
it or making it an atomic_t.
Christoph Lameter wrote:
> On Sun, 15 Aug 2004, David S. Miller wrote:
>
>
>>On Sun, 15 Aug 2004 17:11:53 -0700 (PDT)
>>Christoph Lameter <clameter@sgi.com> wrote:
>>
>>
>>>pgd/pmd walking should be possible always even without the vma semaphore
>>>since the CPU can potentially walk the chain at anytime.
>>
>>munmap() can destroy pmd and pte tables. somehow we have
>>to protect against that, and currently that is having the
>>VMA semaphore held for reading, see free_pgtables().
>
>
> It looks to me like the code takes care to provide the correct
> sequencing so that the integrity of pgd,pmd and pte links is
> guaranteed from the viewpoint of the MMU in the CPUs. munmap is there to
> protect one kernel thread messing with the addresses of these entities
> that might be stored in another threads register.
>
> Therefore it is safe to walk the chain only holding the semaphore read
> lock?
>
> If the mmap lock already guarantees the integrity of the pgd,pmd,pte
> system, then pte locking would be okay as long as integrity of the
> pgd,pmd and pte's is always guaranteed. Then also adding a lock bit would
> work.
>
> So then there are two ways of modifying the pgd,pmd and pte's.
>
> A) Processor obtains vma semaphore write lock and does large scale
> modifications to pgd,pmd,pte.
>
> B) Processor obtains vma semaphore read lock but is still free to do
> modifications on individual pte's while holding that vma lock. There is no
> need to acquire the page_table_lock. These changes must be atomic.
>
> The role of the page_table_lock is restricted *only* to the "struct
> page" stuff? It says in the comments regarding handle_mm_fault that the
> lock is taken for synchronization with kswapd in regards to the pte
> entries. Seems that this use of the page_table_lock is wrong. A or B
> should have been used.
>
> We could simply remove the page_table_lock from handle_mm_fault and
> provide the synchronization with kswapd with pte locks right? Both
> processes are essentially doing modifications on pte's while holding the
> vma read lock and I would be changing the way of synchronization between
> these two processes.
>
> F.e. something along these lines removing the page_table_lock from
> handle_mm_fault and friends. Surprisingly this will also avoid many
> rereads of the pte's since the pte's are really locked. This is just for
> illustrative purpose and unfinished...
>
> Index: linux-2.6.8.1/mm/memory.c
> ===================================================================
> --- linux-2.6.8.1.orig/mm/memory.c 2004-08-15 06:03:04.000000000 -0700
> +++ linux-2.6.8.1/mm/memory.c 2004-08-15 20:26:29.000000000 -0700
> @@ -1035,8 +1035,7 @@
> * change only once the write actually happens. This avoids a few races,
> * and potentially makes it more efficient.
> *
> - * We hold the mm semaphore and the page_table_lock on entry and exit
> - * with the page_table_lock released.
> + * We hold the mm semaphore.
> */
> static int do_wp_page(struct mm_struct *mm, struct vm_area_struct * vma,
> unsigned long address, pte_t *page_table, pmd_t *pmd, pte_t pte)
> @@ -1051,10 +1050,10 @@
> * at least the kernel stops what it's doing before it corrupts
> * data, but for the moment just pretend this is OOM.
> */
> + ptep_unlock(page_table);
> pte_unmap(page_table);
> printk(KERN_ERR "do_wp_page: bogus page at address %08lx\n",
> address);
> - spin_unlock(&mm->page_table_lock);
> return VM_FAULT_OOM;
> }
> old_page = pfn_to_page(pfn);
> @@ -1069,7 +1068,7 @@
> ptep_set_access_flags(vma, address, page_table, entry, 1);
> update_mmu_cache(vma, address, entry);
> pte_unmap(page_table);
> - spin_unlock(&mm->page_table_lock);
> + /* pte lock unlocked by ptep_set_access */
> return VM_FAULT_MINOR;
> }
> }
> @@ -1080,7 +1079,7 @@
> */
> if (!PageReserved(old_page))
> page_cache_get(old_page);
> - spin_unlock(&mm->page_table_lock);
> + ptep_unlock(page_table);
>
> if (unlikely(anon_vma_prepare(vma)))
> goto no_new_page;
> @@ -1090,26 +1089,21 @@
> copy_cow_page(old_page,new_page,address);
>
> /*
> - * Re-check the pte - we dropped the lock
> + * There is no need to recheck. The pte was locked
> */
> - spin_lock(&mm->page_table_lock);
> - page_table = pte_offset_map(pmd, address);
> - if (likely(pte_same(*page_table, pte))) {
> - if (PageReserved(old_page))
> - ++mm->rss;
> - else
> - page_remove_rmap(old_page);
> - break_cow(vma, new_page, address, page_table);
> - lru_cache_add_active(new_page);
> - page_add_anon_rmap(new_page, vma, address);
> + if (PageReserved(old_page))
> + ++mm->rss;
> + else
> + page_remove_rmap(old_page);
> + break_cow(vma, new_page, address, page_table);
> + lru_cache_add_active(new_page);
> + page_add_anon_rmap(new_page, vma, address);
>
> - /* Free the old page.. */
> - new_page = old_page;
> - }
> + /* Free the old page.. */
> + new_page = old_page;
> pte_unmap(page_table);
> page_cache_release(new_page);
> page_cache_release(old_page);
> - spin_unlock(&mm->page_table_lock);
> return VM_FAULT_MINOR;
>
> no_new_page:
> @@ -1314,8 +1308,8 @@
> }
>
> /*
> - * We hold the mm semaphore and the page_table_lock on entry and
> - * should release the pagetable lock on exit..
> + * We hold the mm semaphore and a pte lock n entry and
> + * should release the pte lock on exit..
> */
> static int do_swap_page(struct mm_struct * mm,
> struct vm_area_struct * vma, unsigned long address,
> @@ -1327,27 +1321,10 @@
> int ret = VM_FAULT_MINOR;
>
> pte_unmap(page_table);
> - spin_unlock(&mm->page_table_lock);
> page = lookup_swap_cache(entry);
> if (!page) {
> swapin_readahead(entry, address, vma);
> page = read_swap_cache_async(entry, vma, address);
> - if (!page) {
> - /*
> - * Back out if somebody else faulted in this pte while
> - * we released the page table lock.
> - */
> - spin_lock(&mm->page_table_lock);
> - page_table = pte_offset_map(pmd, address);
> - if (likely(pte_same(*page_table, orig_pte)))
> - ret = VM_FAULT_OOM;
> - else
> - ret = VM_FAULT_MINOR;
> - pte_unmap(page_table);
> - spin_unlock(&mm->page_table_lock);
> - goto out;
> - }
> -
> /* Had to read the page from swap area: Major fault */
> ret = VM_FAULT_MAJOR;
> inc_page_state(pgmajfault);
> @@ -1356,21 +1333,6 @@
> mark_page_accessed(page);
> lock_page(page);
>
> - /*
> - * Back out if somebody else faulted in this pte while we
> - * released the page table lock.
> - */
> - spin_lock(&mm->page_table_lock);
> - page_table = pte_offset_map(pmd, address);
> - if (unlikely(!pte_same(*page_table, orig_pte))) {
> - pte_unmap(page_table);
> - spin_unlock(&mm->page_table_lock);
> - unlock_page(page);
> - page_cache_release(page);
> - ret = VM_FAULT_MINOR;
> - goto out;
> - }
> -
> /* The page isn't present yet, go ahead with the fault. */
>
> swap_free(entry);
> @@ -1398,8 +1360,8 @@
>
> /* No need to invalidate - it was non-present before */
> update_mmu_cache(vma, address, pte);
> + ptep_unlock(page_table);
> pte_unmap(page_table);
> - spin_unlock(&mm->page_table_lock);
> out:
> return ret;
> }
> @@ -1424,7 +1386,6 @@
> if (write_access) {
> /* Allocate our own private page. */
> pte_unmap(page_table);
> - spin_unlock(&mm->page_table_lock);
>
> if (unlikely(anon_vma_prepare(vma)))
> goto no_mem;
> @@ -1433,13 +1394,12 @@
> goto no_mem;
> clear_user_highpage(page, addr);
>
> - spin_lock(&mm->page_table_lock);
> page_table = pte_offset_map(pmd, addr);
>
> if (!pte_none(*page_table)) {
> + ptep_unlock(page_table);
> pte_unmap(page_table);
> page_cache_release(page);
> - spin_unlock(&mm->page_table_lock);
> goto out;
> }
> mm->rss++;
> @@ -1456,7 +1416,6 @@
>
> /* No need to invalidate - it was non-present before */
> update_mmu_cache(vma, addr, entry);
> - spin_unlock(&mm->page_table_lock);
> out:
> return VM_FAULT_MINOR;
> no_mem:
> @@ -1472,8 +1431,8 @@
> * As this is called only for pages that do not currently exist, we
> * do not need to flush old virtual caches or the TLB.
> *
> - * This is called with the MM semaphore held and the page table
> - * spinlock held. Exit with the spinlock released.
> + * This is called with the MM semaphore held and pte lock
> + * held. Exit with the pte lock released.
> */
> static int
> do_no_page(struct mm_struct *mm, struct vm_area_struct *vma,
> @@ -1489,9 +1448,9 @@
> if (!vma->vm_ops || !vma->vm_ops->nopage)
> return do_anonymous_page(mm, vma, page_table,
> pmd, write_access, address);
> + ptep_unlock(page_table);
> pte_unmap(page_table);
> - spin_unlock(&mm->page_table_lock);
> -
> +
> if (vma->vm_file) {
> mapping = vma->vm_file->f_mapping;
> sequence = atomic_read(&mapping->truncate_count);
> @@ -1523,7 +1482,7 @@
> anon = 1;
> }
>
> - spin_lock(&mm->page_table_lock);
> + while (ptep_lock(page_table)) ;
> /*
> * For a file-backed vma, someone could have truncated or otherwise
> * invalidated this page. If unmap_mapping_range got called,
> @@ -1532,7 +1491,7 @@
> if (mapping &&
> (unlikely(sequence != atomic_read(&mapping->truncate_count)))) {
> sequence = atomic_read(&mapping->truncate_count);
> - spin_unlock(&mm->page_table_lock);
> + ptep_unlock(page_table);
> page_cache_release(new_page);
> goto retry;
> }
> @@ -1565,15 +1524,15 @@
> pte_unmap(page_table);
> } else {
> /* One of our sibling threads was faster, back out. */
> + ptep_unlock(page_table);
> pte_unmap(page_table);
> page_cache_release(new_page);
> - spin_unlock(&mm->page_table_lock);
> goto out;
> }
>
> /* no need to invalidate: a not-present page shouldn't be cached */
> update_mmu_cache(vma, address, entry);
> - spin_unlock(&mm->page_table_lock);
> + ptep_unlock(page_table);
> out:
> return ret;
> oom:
> @@ -1606,8 +1565,8 @@
>
> pgoff = pte_to_pgoff(*pte);
>
> + ptep_unlock(pte);
> pte_unmap(pte);
> - spin_unlock(&mm->page_table_lock);
>
> err = vma->vm_ops->populate(vma, address & PAGE_MASK, PAGE_SIZE, vma->vm_page_prot, pgoff, 0);
> if (err == -ENOMEM)
> @@ -1644,13 +1603,11 @@
> {
> pte_t entry;
>
> - entry = *pte;
> + entry = *pte; /* get the unlocked value so that we do not write the lock bit back */
> +
> + if (ptep_lock(pte)) return VM_FAULT_MINOR;
> +
> if (!pte_present(entry)) {
> - /*
> - * If it truly wasn't present, we know that kswapd
> - * and the PTE updates will not touch it later. So
> - * drop the lock.
> - */
> if (pte_none(entry))
> return do_no_page(mm, vma, address, write_access, pte, pmd);
> if (pte_file(entry))
> @@ -1668,7 +1625,6 @@
> ptep_set_access_flags(vma, address, pte, entry, write_access);
> update_mmu_cache(vma, address, entry);
> pte_unmap(pte);
> - spin_unlock(&mm->page_table_lock);
> return VM_FAULT_MINOR;
> }
>
> @@ -1688,12 +1644,6 @@
>
> if (is_vm_hugetlb_page(vma))
> return VM_FAULT_SIGBUS; /* mapping truncation does this. */
> -
> - /*
> - * We need the page table lock to synchronize with kswapd
> - * and the SMP-safe atomic PTE updates.
> - */
> - spin_lock(&mm->page_table_lock);
> pmd = pmd_alloc(mm, pgd, address);
>
> if (pmd) {
> @@ -1701,7 +1651,6 @@
> if (pte)
> return handle_pte_fault(mm, vma, address, write_access, pte, pmd);
> }
> - spin_unlock(&mm->page_table_lock);
> return VM_FAULT_OOM;
> }
>
> Index: linux-2.6.8.1/mm/rmap.c
> ===================================================================
> --- linux-2.6.8.1.orig/mm/rmap.c 2004-08-14 03:56:22.000000000 -0700
> +++ linux-2.6.8.1/mm/rmap.c 2004-08-15 19:59:32.000000000 -0700
> @@ -494,8 +494,14 @@
>
> /* Nuke the page table entry. */
> flush_cache_page(vma, address);
> +#ifdef __HAVE_ARCH_PTE_LOCK
> + /* If we would simply zero the pte then handle_mm_fault might
> + * race against this code and reinstate an anonymous mapping
> + */
> + pteval = ptep_clear_and_lock_flush(vma, address, pte);
> +#else
> pteval = ptep_clear_flush(vma, address, pte);
> -
> +#endif
> /* Move the dirty bit to the physical page now the pte is gone. */
> if (pte_dirty(pteval))
> set_page_dirty(page);
> @@ -508,9 +514,13 @@
> */
> BUG_ON(!PageSwapCache(page));
> swap_duplicate(entry);
> + /* This is going to clear the lock that may have been set on the pte */
> set_pte(pte, swp_entry_to_pte(entry));
> BUG_ON(pte_file(*pte));
> }
> +#ifdef __HAVE_ARCH_PTE_LOCK
> + else ptep_unlock(pte);
> +#endif
>
> mm->rss--;
> BUG_ON(!page->mapcount);
> -
> To unsubscribe from this list: send the line "unsubscribe linux-ia64" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
--
Best Regards,
Ray
-----------------------------------------------
Ray Bryant
512-453-9679 (work) 512-507-7807 (cell)
raybry@sgi.com raybry@austin.rr.com
The box said: "Requires Windows 98 or better",
so I installed Linux.
-----------------------------------------------
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: page fault fastpath: Increasing SMP scalability by introducing pte locks?
2004-08-16 3:29 ` Christoph Lameter
2004-08-16 7:00 ` Ray Bryant
@ 2004-08-16 14:39 ` William Lee Irwin III
1 sibling, 0 replies; 17+ messages in thread
From: William Lee Irwin III @ 2004-08-16 14:39 UTC (permalink / raw)
To: Christoph Lameter; +Cc: David S. Miller, linux-ia64, linux-kernel
On Sun, 15 Aug 2004, David S. Miller wrote:
>> munmap() can destroy pmd and pte tables. somehow we have
>> to protect against that, and currently that is having the
>> VMA semaphore held for reading, see free_pgtables().
On Sun, Aug 15, 2004 at 08:29:11PM -0700, Christoph Lameter wrote:
> It looks to me like the code takes care to provide the correct
> sequencing so that the integrity of pgd,pmd and pte links is
> guaranteed from the viewpoint of the MMU in the CPUs. munmap is there to
> protect one kernel thread messing with the addresses of these entities
> that might be stored in another threads register.
> Therefore it is safe to walk the chain only holding the semaphore read
> lock?
Detached pagetables are assumed to be freeable after a TLB flush IPI.
Previously holding ->page_table_lock would prevent the shootdowns of
links to the pagetable page from executing concurrently with
modifications to the pagetable page. Disabling interrupts or otherwise
inhibiting the progress of the IPI'ing cpu is needed to prevent
dereferencing freed pagetables and incorrect accounting based on
contents of about-to-be-freed pagetables. Reference counting pagetable
pages may help here, where the final put would be responsible for
unaccounting the various things in the pagetable page.
-- wli
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: page fault fastpath: Increasing SMP scalability by introducing pte locks?
2004-08-16 7:00 ` Ray Bryant
@ 2004-08-16 15:18 ` Christoph Lameter
2004-08-16 16:18 ` William Lee Irwin III
0 siblings, 1 reply; 17+ messages in thread
From: Christoph Lameter @ 2004-08-16 15:18 UTC (permalink / raw)
To: Ray Bryant; +Cc: David S. Miller, linux-ia64, linux-kernel
On Mon, 16 Aug 2004, Ray Bryant wrote:
> Something else to worry about here is mm->rss. Previously, this was updated
> only with the page_table_lock held, so concurrent increments were not a
> problem. rss may need to converted be an atomic_t if you use pte_locks.
> It may be that an approximate value for rss is good enough, but I'm not sure
> how to bound the error that could be introduced by a couple of hundred
> processers handling page faults in parallel and updating rss without locking
> it or making it an atomic_t.
Correct. There are a number of issues that may have to be addressed but
first we need to agree on a general idea how to proceed.
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: page fault fastpath: Increasing SMP scalability by introducing pte locks?
2004-08-16 15:18 ` Christoph Lameter
@ 2004-08-16 16:18 ` William Lee Irwin III
0 siblings, 0 replies; 17+ messages in thread
From: William Lee Irwin III @ 2004-08-16 16:18 UTC (permalink / raw)
To: Christoph Lameter; +Cc: Ray Bryant, David S. Miller, linux-ia64, linux-kernel
On Mon, 16 Aug 2004, Ray Bryant wrote:
>> Something else to worry about here is mm->rss. Previously, this was updated
>> only with the page_table_lock held, so concurrent increments were not a
>> problem. rss may need to converted be an atomic_t if you use pte_locks.
>> It may be that an approximate value for rss is good enough, but I'm not sure
>> how to bound the error that could be introduced by a couple of hundred
>> processers handling page faults in parallel and updating rss without locking
>> it or making it an atomic_t.
On Mon, Aug 16, 2004 at 08:18:11AM -0700, Christoph Lameter wrote:
> Correct. There are a number of issues that may have to be addressed but
> first we need to agree on a general idea how to proceed.
I'd favor a per-cpu counter so the cacheline doesn't bounce.
-- wli
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: page fault fastpath: Increasing SMP scalability by introducing pte locks?
2004-08-15 22:38 ` Benjamin Herrenschmidt
@ 2004-08-16 17:28 ` Christoph Lameter
2004-08-17 8:01 ` Benjamin Herrenschmidt
0 siblings, 1 reply; 17+ messages in thread
From: Christoph Lameter @ 2004-08-16 17:28 UTC (permalink / raw)
To: Benjamin Herrenschmidt; +Cc: linux-ia64, Linux Kernel list, Anton Blanchard
On Mon, 16 Aug 2004, Benjamin Herrenschmidt wrote:
> On Sun, 2004-08-15 at 23:50, Christoph Lameter wrote:
> > Well this is more an idea than a real patch yet. The page_table_lock
> > becomes a bottleneck if more than 4 CPUs are rapidly allocating and using
> > memory. "pft" is a program that measures the performance of page faults on
> > SMP system. It allocates memory simultaneously in multiple threads thereby
> > causing lots of page faults for anonymous pages.
>
> Just a note: on ppc64, we already have a PTE lock bit, we use it to
> guard against concurrent hash table insertion, it could be extended
> to the whole page fault path provided we can guarantee we will never
> fault in the hash table on that PTE while it is held. This shouldn't
> be a problem as long as only user pages are locked that way (which
> should be the case with do_page_fault) provided update_mmu_cache()
> is updated to not take this lock, but assume it already held.
Is this the _PAGE_BUSY bit? The pte update routines on PPC64 seem to spin
on that bit when it is set waiting for the hash value update to complete.
Looks very specific to the PPC64 architecture.
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: page fault fastpath: Increasing SMP scalability by introducing pte locks?
2004-08-16 17:28 ` Christoph Lameter
@ 2004-08-17 8:01 ` Benjamin Herrenschmidt
0 siblings, 0 replies; 17+ messages in thread
From: Benjamin Herrenschmidt @ 2004-08-17 8:01 UTC (permalink / raw)
To: Christoph Lameter; +Cc: linux-ia64, Linux Kernel list, Anton Blanchard
> Is this the _PAGE_BUSY bit? The pte update routines on PPC64 seem to spin
> on that bit when it is set waiting for the hash value update to complete.
> Looks very specific to the PPC64 architecture.
Yes, it is, I was thinking it's use could be extended tho
Ben.
^ permalink raw reply [flat|nested] 17+ messages in thread
end of thread, other threads:[~2004-08-17 8:12 UTC | newest]
Thread overview: 17+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <2ttIr-2e4-17@gated-at.bofh.it>
[not found] ` <2tzE4-6sw-25@gated-at.bofh.it>
[not found] ` <2tCiw-8pK-1@gated-at.bofh.it>
2004-08-15 23:53 ` page fault fastpath: Increasing SMP scalability by introducing pte locks? Andi Kleen
2004-08-15 23:55 ` Christoph Lameter
2004-08-16 0:12 ` Andi Kleen
2004-08-15 13:50 Christoph Lameter
2004-08-15 20:09 ` David S. Miller
2004-08-15 22:58 ` Christoph Lameter
2004-08-15 23:58 ` David S. Miller
2004-08-16 0:11 ` Christoph Lameter
2004-08-16 1:56 ` David S. Miller
2004-08-16 3:29 ` Christoph Lameter
2004-08-16 7:00 ` Ray Bryant
2004-08-16 15:18 ` Christoph Lameter
2004-08-16 16:18 ` William Lee Irwin III
2004-08-16 14:39 ` William Lee Irwin III
2004-08-15 22:38 ` Benjamin Herrenschmidt
2004-08-16 17:28 ` Christoph Lameter
2004-08-17 8:01 ` Benjamin Herrenschmidt
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox