Protect PGD...PTE walking in ivt.S

public inbox for linux-ia64@vger.kernel.org
 help / color / mirror / Atom feed

* Protect PGD...PTE walking in ivt.S
@ 2006-04-21 14:12 Zoltan Menyhart
  2006-04-21 14:21 ` Zoltan Menyhart
                   ` (3 more replies)
  0 siblings, 4 replies; 5+ messages in thread
From: Zoltan Menyhart @ 2006-04-21 14:12 UTC (permalink / raw)
  To: linux-ia64

The "rx = ... -> pgd[i] -> pud[j] -> pmd[k] -> pte[l]" chain has to be
protected against "free_pgtables()".

"free_pgtables()" is protected by "mmap_sem" taken for write, therefore
taking this semaphore for read in the low level assembly routines will
provide us with enough protection:
- it can be taken for read almost all the time
- it scales well
- the performance is not degraded (see below)

I prepared a patch for the VHPT miss handler only. Later
I can add the rest. Here is how it goes:

0. Only the user region faults are protected.
   (You should not try to remove an in-use 0xa000... mapping.)

1. "activate_mm()" remembers the physical address of the "mmap_sem" of
   the task selected to run in "ar.k2" (IA64_KR_MM_SEM).

2. On entry, "vhpt_miss" takes "mmap_sem" for read:

	while (unlikely(__down_read_trylock(&current->mm->mmap_sem) = 0));

3. If there is no valid PTE => go to "page_fault_sem" (see later).

4. On one hand, as "free_pgtables()" cannot run in the mean time,
   there is no need to re-read pgd[i] -> pud[j] -> pmd[k].
   On the other hand, the swapper is not excluded => we keep re-checking
   pte[l] and purging the freshly inserted TLB entry if it has been changed.

5. Releasing "mmap_sem" is a bit tricky:

	__up_read(&current->mm->mmap_sem);

   On the fast path, we can "rfi".

6. If someone is waiting for "mmap_sem", then we have to switch the
   translation on, to establish the "C" environment and to call
   "rwsem_wake()".

7. "page_fault_sem": if we have "mmap_sem", then it is more efficient
   not to release it, because "ia64_do_page_fault()" would take it.
   Therefore a new argument, called "mmap_sem_taken_for_read", is added.
   (This flag is DONT-CARE for the faults from the region 5.)

Non-regression test:

I wrote a "wicked" test that provokes VHPT misses, while all the
loads / stores hit the caches L1D / L2.

In order to maximize the number of the TLB entries needed for the
virtually mapped PTE pages, only one valid PTE / page will be used
(, and a single byte per data page).

8 times more TLB entries will be solicited than what is really available:
512 data pages + 512 PTE pages => systematic VHPT misses.

The innermost loop is as follows:

1:  st1 [r15]=r16        // DTLB miss, then L2 hit
    adds r16=1,r16
    ld4 r24=[r18] ;;     // L1D cache hit
    add r15=r15,r24
    br.cloop.sptk.few 1b

The original linux-2.6.16.9 with the missing "srlz.d" added into the
VHPT miss handler:

ITC ticks: 615,846,302, # user accesses: 5,120,000, ITC ticks / access: 120
ITC ticks: 593,729,267, # user accesses: 5,120,000, ITC ticks / access: 115
ITC ticks: 558,319,090, # user accesses: 5,120,000, ITC ticks / access: 109
ITC ticks: 596,911,467, # user accesses: 5,120,000, ITC ticks / access: 116
ITC ticks: 558,342,475, # user accesses: 5,120,000, ITC ticks / access: 109
ITC ticks: 602,162,489, # user accesses: 5,120,000, ITC ticks / access: 117
ITC ticks: 558,207,935, # user accesses: 5,120,000, ITC ticks / access: 109
ITC ticks: 612,523,245, # user accesses: 5,120,000, ITC ticks / access: 119
ITC ticks: 615,927,942, # user accesses: 5,120,000, ITC ticks / access: 120
ITC ticks: 595,530,914, # user accesses: 5,120,000, ITC ticks / access: 116

I.e. it takes 109 ... 120 ITC ticks to execute the innermost loop above +
the fast path of the VHPT miss handler on a Tiger like machine with CPUs
like:

processor  : 2
vendor     : GenuineIntel
arch       : IA-64
family     : Itanium 2
model      : 2
revision   : 1
archrev    : 0
features   : branchlong
cpu number : 0
cpu regs   : 4
cpu MHz    : 1700.000554
itc MHz    : 1700.554956
BogoMIPS   : 2547.71
siblings   : 1

With my patch:

ITC ticks: 768,590,910, # user accesses: 5,120,000, ITC ticks / access: 150
ITC ticks: 727,652,674, # user accesses: 5,120,000, ITC ticks / access: 142
ITC ticks: 743,272,209, # user accesses: 5,120,000, ITC ticks / access: 145
ITC ticks: 747,918,306, # user accesses: 5,120,000, ITC ticks / access: 146
ITC ticks: 743,176,995, # user accesses: 5,120,000, ITC ticks / access: 145
ITC ticks: 728,062,632, # user accesses: 5,120,000, ITC ticks / access: 142
ITC ticks: 745,091,227, # user accesses: 5,120,000, ITC ticks / access: 145
ITC ticks: 733,728,213, # user accesses: 5,120,000, ITC ticks / access: 143
ITC ticks: 747,965,619, # user accesses: 5,120,000, ITC ticks / access: 146
ITC ticks: 728,251,497, # user accesses: 5,120,000, ITC ticks / access: 142

I have 142 ... 150 ITC ticks.

Notes:
- a simple memory access on this test machine costs 340 ITC ticks
   (and usually we do some memory accesses :-))
- not all the loads / stores provoke TLB misses when executing an
   average program
- not all the TLB misses provoke VHPT misses

A real life application will not suffer from any performance degradation.

Thanks,

Zoltan


--- linux-2.6.16.9/arch/ia64/mm/fault.c	2006-04-19 08:10:14.000000000 +0200
+++ new/arch/ia64/mm/fault.c	2006-04-21 11:21:47.000000000 +0200
@@ -51,8 +51,16 @@ mapped_kernel_page_is_present (unsigned 
 	return pte_present(pte);
 }
 
+/*
+ * The low level handlers take "current->mm->mmap_sem" for read if user region
+ * (0 ... 4) fault has happened, see "mmap_sem_taken_for_read".
+ * This flag is DONT-CARE for the faults from the region 5.
+ * "current->mm->mmap_sem" is never taken by the low level handlers for the
+ * faults from the region 5.
+ */
 void __kprobes
-ia64_do_page_fault (unsigned long address, unsigned long isr, struct pt_regs *regs)
+ia64_do_page_fault (unsigned long address, unsigned long isr, struct pt_regs *regs,
+							int mmap_sem_taken_for_read)
 {
 	int signal = SIGSEGV, code = SEGV_MAPERR;
 	struct vm_area_struct *vma, *prev_vma;
@@ -60,10 +68,16 @@ ia64_do_page_fault (unsigned long addres
 	struct siginfo si;
 	unsigned long mask;
 
+	if (mmap_sem_taken_for_read && (REGION_NUMBER(address) < 5)){
+		BUG_ON(__pa(&current->mm->mmap_sem) != ia64_get_kr(IA64_KR_MM_SEM));
+		BUG_ON((current->mm->mmap_sem.count & RWSEM_ACTIVE_MASK) = 0);
+	}
 	/*
 	 * If we're in an interrupt or have no user context, we must not take the fault..
 	 */
-	if (in_atomic() || !mm)
+	if (in_atomic())
+		goto no_context_sem;
+	if (!mm)
 		goto no_context;
 
 #ifdef CONFIG_VIRTUAL_MEM_MAP
@@ -82,10 +96,14 @@ ia64_do_page_fault (unsigned long addres
 	 * This is to handle the kprobes on user space access instructions
 	 */
 	if (notify_die(DIE_PAGE_FAULT, "page fault", regs, code, TRAP_BRKPT,
-					SIGSEGV) = NOTIFY_STOP)
+					SIGSEGV) = NOTIFY_STOP){
+		if (mmap_sem_taken_for_read && (REGION_NUMBER(address) < 5))
+			up_read(&mm->mmap_sem);
 		return;
+	}
 
-	down_read(&mm->mmap_sem);
+	if (!mmap_sem_taken_for_read || (REGION_NUMBER(address) = 5))
+		down_read(&mm->mmap_sem);
 
 	vma = find_vma_prev(mm, address, &prev_vma);
 	if (!vma)
@@ -197,6 +215,9 @@ ia64_do_page_fault (unsigned long addres
 		return;
 	}
 
+  no_context_sem:
+  	if (mmap_sem_taken_for_read && (REGION_NUMBER(address) < 5))
+		up_read(&mm->mmap_sem);
   no_context:
 	if ((isr & IA64_ISR_SP)
 	    || ((isr & IA64_ISR_NA) && (isr & IA64_ISR_CODE_MASK) = IA64_ISR_CODE_LFETCH))
@@ -251,3 +272,12 @@ ia64_do_page_fault (unsigned long addres
 		do_exit(SIGKILL);
 	goto no_context;
 }
+
+
+void my_rwsem_wake(struct rw_semaphore *sem)
+{
+	printk("\nmy_rwsem_wake(%p) %p %p\n", sem, current, &current->mm);
+	BUG_ON(&current->mm->mmap_sem != sem);
+	rwsem_wake(sem);
+}
+
--- linux-2.6.16.9/arch/ia64/kernel/ivt.S	2006-04-19 08:10:14.000000000 +0200
+++ new/arch/ia64/kernel/ivt.S	2006-04-20 17:52:58.000000000 +0200
@@ -80,6 +80,10 @@
 
 	.align 32768	// align on 32KB boundary
 	.global ia64_ivt
+
+#define	pUsr		p17			// User regions
+#define	pR5		p16			// Region 5 fault
+
 ia64_ivt:
 /////////////////////////////////////////////////////////////////////////////////////////
 // 0x0000 Entry 0 (size 64 bundles) VHPT Translation (8,20,47)
@@ -112,10 +116,13 @@ ENTRY(vhpt_miss)
 	rsm psr.dt				// use physical addressing for data
 	mov r31=pr				// save the predicate registers
 	mov r19=IA64_KR(PT_BASE)		// get page table base address
-	shl r21=r16,3				// shift bit 60 into sign bit
+#ifdef CONFIG_SMP
+	mov r30 = IA64_KR(MM_SEM)			// &current->mm->mmap_sem
+#endif
+	shl r20=r16,3				// shift bit 60 into sign bit
 	shr.u r17=r16,61			// get the region number into r17
 	;;
-	shr.u r22=r21,3
+	shr.u r22=r20,3
 #ifdef CONFIG_HUGETLB_PAGE
 	extr.u r26=r25,2,6
 	;;
@@ -126,20 +133,47 @@ ENTRY(vhpt_miss)
 (p8)	shr r22=r22,r27
 #endif
 	;;
-	cmp.eq p6,p7=5,r17			// is IFA pointing into to region 5?
+	cmp.eq pR5,pUsr=5,r17			// is IFA pointing into to region 5?
 	shr.u r18=r22,PGDIR_SHIFT		// get bottom portion of pgd index bit
 	;;
-(p7)	dep r17=r17,r19,(PAGE_SHIFT-3),3	// put region number bits in place
+(pUsr)	dep r17=r17,r19,(PAGE_SHIFT-3),3	// put region number bits in place
 
 	srlz.d
-	LOAD_PHYSICAL(p6, r19, swapper_pg_dir)	// region 5 is rooted at swapper_pg_dir
+	LOAD_PHYSICAL(pR5, r19, swapper_pg_dir)	// region 5 is rooted at swapper_pg_dir
 
-	.pred.rel "mutex", p6, p7
-(p6)	shr.u r21=r21,PGDIR_SHIFT+PAGE_SHIFT
-(p7)	shr.u r21=r21,PGDIR_SHIFT+PAGE_SHIFT-3
+#ifdef CONFIG_SMP
+1:	/*
+	 * If (pUsr: user region fault){
+	 *	__s64 r27, r28, *r30 = &current->mm->mmap_sem;
+	 *
+	 *	//
+	 *	// while (unlikely(__down_read_trylock(&current->mm->mmap_sem) = 0));
+	 *	//
+	 * 	do {
+	 *		while (unlikely((r27 = ia64_ld8_bias_nta(&r30->count)) < 0));
+	 *		r28 = r27 + 1;
+	 *	} while (unlikely(ia64_cmpxchg8_acq_nta(&r30->count, r28, r27) != r27));
+	 * }
+	 */
+(pUsr)	ld8.bias.nta r27 = [r30]
 	;;
-(p6)	dep r17=r18,r19,3,(PAGE_SHIFT-3)	// r17=pgd_offset for region 5
-(p7)	dep r17=r18,r17,3,(PAGE_SHIFT-6)	// r17=pgd_offset for region[0-4]
+(pUsr)	cmp.lt.unc p8, p0 = r27, r0
+(p8)	br.cond.spnt.few 1b
+(pUsr)	mov.m ar.ccv = r27
+(pUsr)	adds r28 = 1, r27
+	;;
+(pUsr)	cmpxchg8.acq.nta r28 = [r30], r28, ar.ccv
+	;;
+(pUsr)	cmp.eq.unc p0, p8 = r28, r27
+(p8)	br.cond.spnt.few 1b
+#endif	// CONFIG_SMP
+
+	.pred.rel "mutex", pR5, pUsr
+(pR5)	shr.u r21=r20,PGDIR_SHIFT+PAGE_SHIFT
+(pUsr)	shr.u r21=r20,PGDIR_SHIFT+PAGE_SHIFT-3
+	;;
+(pR5)	dep r17=r18,r19,3,(PAGE_SHIFT-3)	// r17=pgd_offset for region 5
+(pUsr)	dep r17=r18,r17,3,(PAGE_SHIFT-6)	// r17=pgd_offset for region[0-4]
 	cmp.eq p7,p6=0,r21			// unused address bits all zeroes?
 #ifdef CONFIG_PGTABLE_4
 	shr.u r28=r22,PUD_SHIFT			// shift pud index into position
@@ -179,7 +213,7 @@ ENTRY(vhpt_miss)
 	;;
 (p10)	itc.i r18				// insert the instruction TLB entry
 (p11)	itc.d r18				// insert the data TLB entry
-(p6)	br.cond.spnt.many page_fault		// handle bad address/page not present (page fault)
+(p6)	br.cond.spnt.many page_fault_sem	// handle bad address/page not present (page fault)
 	mov cr.ifa=r22
 
 #ifdef CONFIG_HUGETLB_PAGE
@@ -197,11 +231,12 @@ ENTRY(vhpt_miss)
 	;;
 #ifdef CONFIG_SMP
 	/*
-	 * Tell the assemblers dependency-violation checker that the above "itc" instructions
-	 * cannot possibly affect the following loads:
+	 * We make sure the visibility of itc.* to generated purges (like ptc.ga)
+	 * before we re-read the PTE.
+	 * Having itc.i-d a new translation, there is no need for srlz.i, the rfi below
+	 * will do the serialization.
 	 */
-	dv_serialize_data
-
+(p7)	srlz.d
 	/*
 	 * Re-check pagetable entry.  If they changed, we may have received a ptc.g
 	 * between reading the pagetable and the "itc".  If so, flush the entry we
@@ -214,25 +249,45 @@ ENTRY(vhpt_miss)
 	 * r29 = *pud
 	 * r20 = *pmd
 	 * r18 = *pte
+	 *
+	 * There is no need to re-read *pgd, *pud, *pmd because of having "mmap_sem".
+	 * The PTE page translation is guaranteed to be correct.
+	 * The swapper does not take this semaphore for write => do re-check *pte.
 	 */
 	ld8 r25=[r21]				// read *pte again
-	ld8 r26=[r17]				// read *pmd again
-#ifdef CONFIG_PGTABLE_4
-	ld8 r19=[r28]				// read *pud again
-#endif
-	cmp.ne p6,p7=r0,r0
-	;;
-	cmp.ne.or.andcm p6,p7=r26,r20		// did *pmd change
-#ifdef CONFIG_PGTABLE_4
-	cmp.ne.or.andcm p6,p7=r19,r29		// did *pud change
-#endif
 	mov r27=PAGE_SHIFT<<2
 	;;
-(p6)	ptc.l r22,r27				// purge PTE page translation
-(p7)	cmp.ne.or.andcm p6,p7=r25,r18		// did *pte change
+	cmp.ne p6,p0=r25,r18			// did *pte change?
 	;;
 (p6)	ptc.l r16,r27				// purge translation
-#endif
+	/*
+	 * Tell the assemblers dependency-violation checker that the above "ptc" instruction
+	 * cannot possibly affect the following loads/stores.
+	 */
+	dv_serialize_data
+
+1:	/*
+	 * If (pUsr: user region fault){
+	 *	__s64 r27, r28, *r30 = &current->mm->mmap_sem;
+	 *
+	 *	//
+	 *	// __up_read(&current->mm->mmap_sem);
+	 *	//
+	 * 	r28 = ia64_fetchadd8_rel_nta(&r30->count, -1);
+	 *	r27 = r28 - 1;
+	 * 	if (unlikely(r28 < 0))
+	 *		if (unlikely((r27 & RWSEM_ACTIVE_MASK) = 0))
+	 *			rwsem_wake(sem);
+	 * }
+	 */
+(pUsr)	fetchadd8.rel.nta r28 = [r30], -1
+	;;
+(pUsr)	adds r27 = -1, r28
+(pUsr)	cmp.lt.unc p8, p0 = r28, r0
+(p8)	br.cond.spnt.few call_sem_wake_up
+	;;
+back_to_vhpt_miss:				// No need to call "rwsem_wake()"
+#endif	// CONFIG_SMP
 
 	mov pr=r31,-1				// restore predicate registers
 	rfi
@@ -282,6 +337,60 @@ ENTRY(itlb_miss)
 	rfi
 END(itlb_miss)
 
+
+#ifdef CONFIG_SMP
+	/*
+	 * Arrivig from "vhpt_miss":
+	 *
+	 * ...
+	 * If (pUsr: user mode fault){
+	 *	__s64 r27, r28, *r30 = &current->mm->mmap_sem;
+	 *
+	 *	//
+	 *	// __up_read(&current->mm->mmap_sem);
+	 *	//
+	 * 	r28 = ia64_fetchadd8_rel_nta(&r30->count, -1);
+	 *	r27 = r28 - 1;
+	 * 	if (unlikely(r28 < 0))
+	 *
+	 *		//
+	 *		// *** We are here: ***
+	 *		//
+	 *		if (unlikely((r27 & RWSEM_ACTIVE_MASK) = 0))
+	 *			rwsem_wake(sem);
+	 * }
+	 * ...
+	 */
+call_sem_wake_up:
+	cmp4.eq p0, p8 = 0, r27
+(p8)	br.cond.sptk.few back_to_vhpt_miss
+	;;
+	/*
+	 * Call "my_rwsem_wake(IA64_KR(MM_SEM))". Ppredicates are in r31, psr.dt is off.
+	 */
+	ssm psr.dt
+	;;
+	srlz.d
+	;;
+	SAVE_MIN_WITH_COVER
+	alloc r15 = ar.pfs, 0, 0, 1, 0
+	mov out0 = IA64_KR(MM_SEM)		// &current->mm->mmap_sem
+	adds r3=8, r2				// set up second base pointer
+	;;
+	ssm psr.ic | PSR_DEFAULT_BITS
+	;;
+	srlz.i					// guarantee that interruption collectin is on
+	;;
+(p15)	ssm psr.i				// restore psr.i
+	movl r14 = ia64_leave_kernel
+	;;
+	SAVE_REST
+	mov rp = r14
+	;;
+	br.call.sptk.many b6 = my_rwsem_wake	// ignore return address
+#endif	// CONFIG_SMP
+
+
 	.org ia64_ivt+0x0800
 /////////////////////////////////////////////////////////////////////////////////////////
 // 0x0800 Entry 2 (size 64 bundles) DTLB (9,48)
@@ -326,6 +434,21 @@ dtlb_fault:
 	rfi
 END(dtlb_miss)
 
+
+	//-----------------------------------------------------------------------------------
+	// call do_page_fault (predicates are in r31, psr.dt may be off, r16 is faulting address)
+	// We DON'T have "mmap_sem" for read
+ENTRY(page_fault)
+	ssm psr.dt
+	;;
+	srlz.i
+	;;
+	SAVE_MIN_WITH_COVER
+	mov r15 = r0				// We DON'T have "mmap_sem" for read
+	br.sptk.few page_fault_common
+END(page_fault)
+
+
 	.org ia64_ivt+0x0c00
 /////////////////////////////////////////////////////////////////////////////////////////
 // 0x0c00 Entry 3 (size 64 bundles) Alt ITLB (19)
@@ -499,32 +622,39 @@ ENTRY(ikey_miss)
 	FAULT(6)
 END(ikey_miss)
 
+
 	//-----------------------------------------------------------------------------------
 	// call do_page_fault (predicates are in r31, psr.dt may be off, r16 is faulting address)
-ENTRY(page_fault)
+	// If user region fault has happened => we keep "mmap_sem" for read
+ENTRY(page_fault_sem)
 	ssm psr.dt
 	;;
 	srlz.i
 	;;
 	SAVE_MIN_WITH_COVER
-	alloc r15=ar.pfs,0,0,3,0
-	mov out0=cr.ifa
-	mov out1=cr.isr
-	adds r3=8,r2				// set up second base pointer
+	add r15 = 1, r0				// If user region fault has happened =>
+						//	We have "mmap_sem" for read
+page_fault_common:
+	alloc r14 = ar.pfs, 0, 0, 4, 0
+	mov out0 = cr.ifa
+	mov out1 = cr.isr
+	adds r3 = 8, r2				// set up second base pointer
 	;;
 	ssm psr.ic | PSR_DEFAULT_BITS
 	;;
 	srlz.i					// guarantee that interruption collectin is on
 	;;
 (p15)	ssm psr.i				// restore psr.i
-	movl r14=ia64_leave_kernel
+	movl r14 = ia64_leave_kernel
 	;;
 	SAVE_REST
-	mov rp=r14
+	mov rp = r14
 	;;
-	adds out2\x16,r12			// out2 = pointer to pt_regs
+	adds out2 = 16, r12			// out2 = pointer to pt_regs
+	mov out3 = r15				// Is "mmap_sem" taken for read?
 	br.call.sptk.many b6=ia64_do_page_fault	// ignore return address
-END(page_fault)
+END(page_fault_sem)
+
 
 	.org ia64_ivt+0x1c00
 /////////////////////////////////////////////////////////////////////////////////////////
--- linux-2.6.16.9/include/asm-ia64/kregs.h	2006-04-20 10:36:46.000000000 +0200
+++ new/include/asm-ia64/kregs.h	2006-04-20 14:53:14.000000000 +0200
@@ -14,6 +14,7 @@
  */
 #define IA64_KR_IO_BASE		0	/* ar.k0: legacy I/O base address */
 #define IA64_KR_TSSD		1	/* ar.k1: IVE uses this as the TSSD */
+#define	IA64_KR_MM_SEM		2	/* ar.k2: ->mmap_sem address (physical) */
 #define IA64_KR_PER_CPU_DATA	3	/* ar.k3: physical per-CPU base */
 #define IA64_KR_CURRENT_STACK	4	/* ar.k4: what's mapped in IA64_TR_CURRENT_STACK */
 #define IA64_KR_FPU_OWNER	5	/* ar.k5: fpu-owner (UP only, at the moment) */
--- linux-2.6.16.9/include/asm-ia64/mmu_context.h	2006-04-20 10:36:46.000000000 +0200
+++ new/include/asm-ia64/mmu_context.h	2006-04-20 14:59:18.000000000 +0200
@@ -192,6 +192,7 @@ activate_mm (struct mm_struct *prev, str
 	 * handlers cannot touch user-space.
 	 */
 	ia64_set_kr(IA64_KR_PT_BASE, __pa(next->pgd));
+	ia64_set_kr(IA64_KR_MM_SEM, __pa(&next->mmap_sem));
 	activate_context(next);
 }
 

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Protect PGD...PTE walking in ivt.S
  2006-04-21 14:12 Protect PGD...PTE walking in ivt.S Zoltan Menyhart
@ 2006-04-21 14:21 ` Zoltan Menyhart
  2006-04-21 17:13 ` Christoph Lameter
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 5+ messages in thread
From: Zoltan Menyhart @ 2006-04-21 14:21 UTC (permalink / raw)
  To: linux-ia64

The test program:

/*
* This test aims to provoke VHPT misses.
*
* In order to maximize the number of the TLB entries needed for the virtually mapped
* PTE pages, only one valid PTE / page will be used (, and a single byte per data page).
* The TLB pressure will me multiplied by "N" in order not to leave any
* practical chance for a TLB hit.
*
* Remember to allow to over-commit memory: "sysctl -w vm.overcommit_memory=1".
*
* ...there is no control over the physical addresses of the user pages... there will
* be some variation of the results from run to run due to some address aliases...
*
* ...the interrupts are enabled...
*/


#include <sys/types.h>
#include <stdio.h>
#include <sys/mman.h>


#define	N			8	// TLB pressure multiplier
#define	N_RUNS			10
#define	N_EXTERNAL_LOOPS	10
#define	N_INTERNAL_LOOPS	1000

#define	CACHE_BYTES		128
#define	N_TLB_ENTRIES		128

// Not counting the kernel's DTRs...
#define	N_PTE_PAGE_U_PAGE_PAIRS	(N_TLB_ENTRIES / 2 * N)
#define	PTE_PER_CACHE_BYTES	(CACHE_BYTES / sizeof(unsigned long))

unsigned int pagesize;

#define	PTES_PER_PAGE		(pagesize / sizeof(void *))
#define	VA_RANGE_PER_PTE_PAGE	(PTES_PER_PAGE * pagesize)

/*
* Cache line "coloring" in order to reduce cache alias problems.
* (Yet as there is no control over the physical addresses of the user pages...)
* The i-th element touched by the innermost loop will be at the offset in the page:
* (i * 2 + 1) * CACHE_BYTES
* The PTE mapping this test data will be at the offset in its page:
* i * 2 * CACHE_BYTES
*/
#define	OFFSET			(VA_RANGE_PER_PTE_PAGE + \
				2 * PTE_PER_CACHE_BYTES * pagesize + \
				2 * CACHE_BYTES)
unsigned int offset;

#define	TEST_VA_RANGE		((unsigned long) VA_RANGE_PER_PTE_PAGE * \
						N_PTE_PAGE_U_PAGE_PAIRS + \
				/* Cache line "coloring": */ \
				2 * N_PTE_PAGE_U_PAGE_PAIRS * \
						PTE_PER_CACHE_BYTES * pagesize + \
				2 * N_PTE_PAGE_U_PAGE_PAIRS * CACHE_BYTES)

#define	MEM_PROT		(PROT_READ | PROT_WRITE)
#define	MEM_TYPE		(MAP_PRIVATE | MAP_ANONYMOUS)

#define GET_ITC()						\
({								\
	unsigned long ia64_intri_res;				\
								\
	asm volatile ("mov %0=ar.itc" : "=r"(ia64_intri_res));	\
	ia64_intri_res;						\
})


inline unsigned long test(char *p0, unsigned int n)
{
	int		i, j;
	char		*p;
	unsigned long	itc0 = GET_ITC();

	for (j = 0; j < n; j++)
		for (i = 0, p = p0 + CACHE_BYTES;	// Cache line "coloring"
				i < N_PTE_PAGE_U_PAGE_PAIRS; i++, p += offset)
			*p = i;
	return GET_ITC() - itc0;
}


void main_test(void)
{
	void		*p;
	int		loop;
	unsigned long	tmp, sum = 0;

	p = mmap(NULL, TEST_VA_RANGE, MEM_PROT, MEM_TYPE, -1, 0);
	if (p = MAP_FAILED){
		perror("mmap");
		fprintf(stderr, "Have you allowed to over-commit memory? "
				"Try: \"sysctl -w vm.overcommit_memory=1\"\n");
		exit(1);
	}
	(void) test(p, 1);				// Warm up...
	for (loop = 0; loop < N_EXTERNAL_LOOPS; loop++)
		sum += test(p, N_INTERNAL_LOOPS);
	(void) munmap(p, TEST_VA_RANGE);
	printf("ITC ticks: %ld,%03ld,%03ld,",
				sum / 1000000, sum / 1000 % 1000, sum % 1000);
	tmp = (unsigned long) N_EXTERNAL_LOOPS * N_INTERNAL_LOOPS *
							N_PTE_PAGE_U_PAGE_PAIRS;
	printf(" # user accesses: %ld,%03ld,%03ld, ITC ticks / access: %ld\n",
			tmp / 1000000, tmp / 1000 % 1000, tmp % 1000, sum / tmp);
}


int main(int argc, char *argv[])
{
	int		run;
	unsigned long	tmp;

	pagesize = getpagesize();
	offset = OFFSET;			// Depends on "getpagesize()"
	printf("Page size =\t\t\t%12d\n", pagesize);
	printf("PTEs / page =\t\t\t%12d\n", PTES_PER_PAGE);
	printf("VA range / PTE page =\t\t%12d (%d Mbytes)\n",
				VA_RANGE_PER_PTE_PAGE, VA_RANGE_PER_PTE_PAGE >> 20);
	printf("# PTE-page / user-page pairs =\t%12d\n", N_PTE_PAGE_U_PAGE_PAIRS);
	printf("Test VA range =\t\t\t%12ld (%ld Gbytes)\n",
				TEST_VA_RANGE, TEST_VA_RANGE >> 30);
	tmp = N_PTE_PAGE_U_PAGE_PAIRS * pagesize;
	printf("Resident user memory =\t\t%12ld (%ld Mbytes)\n", tmp, tmp >> 20);
	printf("# PTE + user cache lines =\t%12d (%d Kbytes)\n",
				N_PTE_PAGE_U_PAGE_PAIRS * 2,
				N_PTE_PAGE_U_PAGE_PAIRS * 2 * CACHE_BYTES >> 10);
	printf("\n");
	for (run = 0; run < N_RUNS; run++)
		main_test();
	return 0;
}



^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Protect PGD...PTE walking in ivt.S
  2006-04-21 14:12 Protect PGD...PTE walking in ivt.S Zoltan Menyhart
  2006-04-21 14:21 ` Zoltan Menyhart
@ 2006-04-21 17:13 ` Christoph Lameter
  2006-04-21 18:08 ` Zoltan Menyhart
  2006-04-21 18:54 ` Christoph Lameter
  3 siblings, 0 replies; 5+ messages in thread
From: Christoph Lameter @ 2006-04-21 17:13 UTC (permalink / raw)
  To: linux-ia64

On Fri, 21 Apr 2006, Zoltan Menyhart wrote:

> - it scales well

mmap_sem is the major performance bottleneck for page faults at this time 
because it creates a bouncing cacheline. Figure out a way to avoid that to 
increase performance.

> 4. On one hand, as "free_pgtables()" cannot run in the mean time,
>   there is no need to re-read pgd[i] -> pud[j] -> pmd[k].
>   On the other hand, the swapper is not excluded => we keep re-checking
>   pte[l] and purging the freshly inserted TLB entry if it has been changed.

The walking of pgd/pud/pmd can happen anytime through the mmu on many
architectures. It is therefore not protectable by any lock. 

free_pgtables() is run when all processes in a mm_struct have terminated 
or when all vmas from an area have been removed.

There can be no one using the page tables.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Protect PGD...PTE walking in ivt.S
  2006-04-21 14:12 Protect PGD...PTE walking in ivt.S Zoltan Menyhart
  2006-04-21 14:21 ` Zoltan Menyhart
  2006-04-21 17:13 ` Christoph Lameter
@ 2006-04-21 18:08 ` Zoltan Menyhart
  2006-04-21 18:54 ` Christoph Lameter
  3 siblings, 0 replies; 5+ messages in thread
From: Zoltan Menyhart @ 2006-04-21 18:08 UTC (permalink / raw)
  To: linux-ia64

Christoph Lameter wrote:
> On Fri, 21 Apr 2006, Zoltan Menyhart wrote:
> 
>>- it scales well
> 
> mmap_sem is the major performance bottleneck for page faults at this time 
> because it creates a bouncing cacheline. Figure out a way to avoid that to 
> increase performance.

Have you got some numbers?
Can I have a copy of your test program, please?

>>4. On one hand, as "free_pgtables()" cannot run in the mean time,
>>  there is no need to re-read pgd[i] -> pud[j] -> pmd[k].
>>  On the other hand, the swapper is not excluded => we keep re-checking
>>  pte[l] and purging the freshly inserted TLB entry if it has been changed.
> 
> The walking of pgd/pud/pmd can happen anytime through the mmu on many
> architectures. It is therefore not protectable by any lock. 

I do not know much about the other architectures.

As TBL loading is not the same mechanism on all the architectures,
therefore architecture dependent solutions can be applied.

If PGD...PTE walking does not need protection / cannot be protected on
other architectures, then it is not a strong argument about the ia64's
behavior.

We were talking about this issue in the thread "accessed/dirty bit
handler tuning". I thought we had agreed that this PGD...PTE walking
is unsafe.
Locking _is_ a logically correct solution.
Any other suggestion?

> free_pgtables() is run when all processes in a mm_struct have terminated 
> or when all vmas from an area have been removed.
> 
> There can be no one using the page tables.

Please refer to "unmap_region()".

Thanks,

Zoltan

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Protect PGD...PTE walking in ivt.S
  2006-04-21 14:12 Protect PGD...PTE walking in ivt.S Zoltan Menyhart
                   ` (2 preceding siblings ...)
  2006-04-21 18:08 ` Zoltan Menyhart
@ 2006-04-21 18:54 ` Christoph Lameter
  3 siblings, 0 replies; 5+ messages in thread
From: Christoph Lameter @ 2006-04-21 18:54 UTC (permalink / raw)
  To: linux-ia64

On Fri, 21 Apr 2006, Zoltan Menyhart wrote:

> Christoph Lameter wrote:
> > On Fri, 21 Apr 2006, Zoltan Menyhart wrote:
> > 
> > > - it scales well
> > 
> > mmap_sem is the major performance bottleneck for page faults at this time
> > because it creates a bouncing cacheline. Figure out a way to avoid that to
> > increase performance.
> 
> Have you got some numbers?
> Can I have a copy of your test program, please?

Test program is available from 
ftp.kernel.org/pub/linux/kernel/people/christoph/ptl/pft.c. See the 
discussion of my page fault scalability patches on lkml in 2004/2005.

> > The walking of pgd/pud/pmd can happen anytime through the mmu on many
> > architectures. It is therefore not protectable by any lock. 
> 
> I do not know much about the other architectures.

They are relevant since they determine core linux functionality.

> If PGD...PTE walking does not need protection / cannot be protected on
> other architectures, then it is not a strong argument about the ia64's
> behavior.

Of course. This means that the core kernel was designed in such a way to 
cope with this issue.
 
> We were talking about this issue in the thread "accessed/dirty bit
> handler tuning". I thought we had agreed that this PGD...PTE walking
> is unsafe.

I did?

> Locking _is_ a logically correct solution.
> Any other suggestion?

Drop the non issue.

> > free_pgtables() is run when all processes in a mm_struct have terminated or
> > when all vmas from an area have been removed.
> > 
> > There can be no one using the page tables.
> 
> Please refer to "unmap_region()".

Only run after the vmas have been removed. Any fault will lead to an 
access error.

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2006-04-21 18:54 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-04-21 14:12 Protect PGD...PTE walking in ivt.S Zoltan Menyhart
2006-04-21 14:21 ` Zoltan Menyhart
2006-04-21 17:13 ` Christoph Lameter
2006-04-21 18:08 ` Zoltan Menyhart
2006-04-21 18:54 ` Christoph Lameter

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox