From: Zoltan Menyhart <Zoltan.Menyhart@bull.net>
To: linux-ia64@vger.kernel.org
Subject: Protect PGD...PTE walking in ivt.S
Date: Fri, 21 Apr 2006 14:12:43 +0000 [thread overview]
Message-ID: <4448E85B.5080906@bull.net> (raw)
The "rx = ... -> pgd[i] -> pud[j] -> pmd[k] -> pte[l]" chain has to be
protected against "free_pgtables()".
"free_pgtables()" is protected by "mmap_sem" taken for write, therefore
taking this semaphore for read in the low level assembly routines will
provide us with enough protection:
- it can be taken for read almost all the time
- it scales well
- the performance is not degraded (see below)
I prepared a patch for the VHPT miss handler only. Later
I can add the rest. Here is how it goes:
0. Only the user region faults are protected.
(You should not try to remove an in-use 0xa000... mapping.)
1. "activate_mm()" remembers the physical address of the "mmap_sem" of
the task selected to run in "ar.k2" (IA64_KR_MM_SEM).
2. On entry, "vhpt_miss" takes "mmap_sem" for read:
while (unlikely(__down_read_trylock(¤t->mm->mmap_sem) = 0));
3. If there is no valid PTE => go to "page_fault_sem" (see later).
4. On one hand, as "free_pgtables()" cannot run in the mean time,
there is no need to re-read pgd[i] -> pud[j] -> pmd[k].
On the other hand, the swapper is not excluded => we keep re-checking
pte[l] and purging the freshly inserted TLB entry if it has been changed.
5. Releasing "mmap_sem" is a bit tricky:
__up_read(¤t->mm->mmap_sem);
On the fast path, we can "rfi".
6. If someone is waiting for "mmap_sem", then we have to switch the
translation on, to establish the "C" environment and to call
"rwsem_wake()".
7. "page_fault_sem": if we have "mmap_sem", then it is more efficient
not to release it, because "ia64_do_page_fault()" would take it.
Therefore a new argument, called "mmap_sem_taken_for_read", is added.
(This flag is DONT-CARE for the faults from the region 5.)
Non-regression test:
I wrote a "wicked" test that provokes VHPT misses, while all the
loads / stores hit the caches L1D / L2.
In order to maximize the number of the TLB entries needed for the
virtually mapped PTE pages, only one valid PTE / page will be used
(, and a single byte per data page).
8 times more TLB entries will be solicited than what is really available:
512 data pages + 512 PTE pages => systematic VHPT misses.
The innermost loop is as follows:
1: st1 [r15]=r16 // DTLB miss, then L2 hit
adds r16=1,r16
ld4 r24=[r18] ;; // L1D cache hit
add r15=r15,r24
br.cloop.sptk.few 1b
The original linux-2.6.16.9 with the missing "srlz.d" added into the
VHPT miss handler:
ITC ticks: 615,846,302, # user accesses: 5,120,000, ITC ticks / access: 120
ITC ticks: 593,729,267, # user accesses: 5,120,000, ITC ticks / access: 115
ITC ticks: 558,319,090, # user accesses: 5,120,000, ITC ticks / access: 109
ITC ticks: 596,911,467, # user accesses: 5,120,000, ITC ticks / access: 116
ITC ticks: 558,342,475, # user accesses: 5,120,000, ITC ticks / access: 109
ITC ticks: 602,162,489, # user accesses: 5,120,000, ITC ticks / access: 117
ITC ticks: 558,207,935, # user accesses: 5,120,000, ITC ticks / access: 109
ITC ticks: 612,523,245, # user accesses: 5,120,000, ITC ticks / access: 119
ITC ticks: 615,927,942, # user accesses: 5,120,000, ITC ticks / access: 120
ITC ticks: 595,530,914, # user accesses: 5,120,000, ITC ticks / access: 116
I.e. it takes 109 ... 120 ITC ticks to execute the innermost loop above +
the fast path of the VHPT miss handler on a Tiger like machine with CPUs
like:
processor : 2
vendor : GenuineIntel
arch : IA-64
family : Itanium 2
model : 2
revision : 1
archrev : 0
features : branchlong
cpu number : 0
cpu regs : 4
cpu MHz : 1700.000554
itc MHz : 1700.554956
BogoMIPS : 2547.71
siblings : 1
With my patch:
ITC ticks: 768,590,910, # user accesses: 5,120,000, ITC ticks / access: 150
ITC ticks: 727,652,674, # user accesses: 5,120,000, ITC ticks / access: 142
ITC ticks: 743,272,209, # user accesses: 5,120,000, ITC ticks / access: 145
ITC ticks: 747,918,306, # user accesses: 5,120,000, ITC ticks / access: 146
ITC ticks: 743,176,995, # user accesses: 5,120,000, ITC ticks / access: 145
ITC ticks: 728,062,632, # user accesses: 5,120,000, ITC ticks / access: 142
ITC ticks: 745,091,227, # user accesses: 5,120,000, ITC ticks / access: 145
ITC ticks: 733,728,213, # user accesses: 5,120,000, ITC ticks / access: 143
ITC ticks: 747,965,619, # user accesses: 5,120,000, ITC ticks / access: 146
ITC ticks: 728,251,497, # user accesses: 5,120,000, ITC ticks / access: 142
I have 142 ... 150 ITC ticks.
Notes:
- a simple memory access on this test machine costs 340 ITC ticks
(and usually we do some memory accesses :-))
- not all the loads / stores provoke TLB misses when executing an
average program
- not all the TLB misses provoke VHPT misses
A real life application will not suffer from any performance degradation.
Thanks,
Zoltan
--- linux-2.6.16.9/arch/ia64/mm/fault.c 2006-04-19 08:10:14.000000000 +0200
+++ new/arch/ia64/mm/fault.c 2006-04-21 11:21:47.000000000 +0200
@@ -51,8 +51,16 @@ mapped_kernel_page_is_present (unsigned
return pte_present(pte);
}
+/*
+ * The low level handlers take "current->mm->mmap_sem" for read if user region
+ * (0 ... 4) fault has happened, see "mmap_sem_taken_for_read".
+ * This flag is DONT-CARE for the faults from the region 5.
+ * "current->mm->mmap_sem" is never taken by the low level handlers for the
+ * faults from the region 5.
+ */
void __kprobes
-ia64_do_page_fault (unsigned long address, unsigned long isr, struct pt_regs *regs)
+ia64_do_page_fault (unsigned long address, unsigned long isr, struct pt_regs *regs,
+ int mmap_sem_taken_for_read)
{
int signal = SIGSEGV, code = SEGV_MAPERR;
struct vm_area_struct *vma, *prev_vma;
@@ -60,10 +68,16 @@ ia64_do_page_fault (unsigned long addres
struct siginfo si;
unsigned long mask;
+ if (mmap_sem_taken_for_read && (REGION_NUMBER(address) < 5)){
+ BUG_ON(__pa(¤t->mm->mmap_sem) != ia64_get_kr(IA64_KR_MM_SEM));
+ BUG_ON((current->mm->mmap_sem.count & RWSEM_ACTIVE_MASK) = 0);
+ }
/*
* If we're in an interrupt or have no user context, we must not take the fault..
*/
- if (in_atomic() || !mm)
+ if (in_atomic())
+ goto no_context_sem;
+ if (!mm)
goto no_context;
#ifdef CONFIG_VIRTUAL_MEM_MAP
@@ -82,10 +96,14 @@ ia64_do_page_fault (unsigned long addres
* This is to handle the kprobes on user space access instructions
*/
if (notify_die(DIE_PAGE_FAULT, "page fault", regs, code, TRAP_BRKPT,
- SIGSEGV) = NOTIFY_STOP)
+ SIGSEGV) = NOTIFY_STOP){
+ if (mmap_sem_taken_for_read && (REGION_NUMBER(address) < 5))
+ up_read(&mm->mmap_sem);
return;
+ }
- down_read(&mm->mmap_sem);
+ if (!mmap_sem_taken_for_read || (REGION_NUMBER(address) = 5))
+ down_read(&mm->mmap_sem);
vma = find_vma_prev(mm, address, &prev_vma);
if (!vma)
@@ -197,6 +215,9 @@ ia64_do_page_fault (unsigned long addres
return;
}
+ no_context_sem:
+ if (mmap_sem_taken_for_read && (REGION_NUMBER(address) < 5))
+ up_read(&mm->mmap_sem);
no_context:
if ((isr & IA64_ISR_SP)
|| ((isr & IA64_ISR_NA) && (isr & IA64_ISR_CODE_MASK) = IA64_ISR_CODE_LFETCH))
@@ -251,3 +272,12 @@ ia64_do_page_fault (unsigned long addres
do_exit(SIGKILL);
goto no_context;
}
+
+
+void my_rwsem_wake(struct rw_semaphore *sem)
+{
+ printk("\nmy_rwsem_wake(%p) %p %p\n", sem, current, ¤t->mm);
+ BUG_ON(¤t->mm->mmap_sem != sem);
+ rwsem_wake(sem);
+}
+
--- linux-2.6.16.9/arch/ia64/kernel/ivt.S 2006-04-19 08:10:14.000000000 +0200
+++ new/arch/ia64/kernel/ivt.S 2006-04-20 17:52:58.000000000 +0200
@@ -80,6 +80,10 @@
.align 32768 // align on 32KB boundary
.global ia64_ivt
+
+#define pUsr p17 // User regions
+#define pR5 p16 // Region 5 fault
+
ia64_ivt:
/////////////////////////////////////////////////////////////////////////////////////////
// 0x0000 Entry 0 (size 64 bundles) VHPT Translation (8,20,47)
@@ -112,10 +116,13 @@ ENTRY(vhpt_miss)
rsm psr.dt // use physical addressing for data
mov r31=pr // save the predicate registers
mov r19=IA64_KR(PT_BASE) // get page table base address
- shl r21=r16,3 // shift bit 60 into sign bit
+#ifdef CONFIG_SMP
+ mov r30 = IA64_KR(MM_SEM) // ¤t->mm->mmap_sem
+#endif
+ shl r20=r16,3 // shift bit 60 into sign bit
shr.u r17=r16,61 // get the region number into r17
;;
- shr.u r22=r21,3
+ shr.u r22=r20,3
#ifdef CONFIG_HUGETLB_PAGE
extr.u r26=r25,2,6
;;
@@ -126,20 +133,47 @@ ENTRY(vhpt_miss)
(p8) shr r22=r22,r27
#endif
;;
- cmp.eq p6,p7=5,r17 // is IFA pointing into to region 5?
+ cmp.eq pR5,pUsr=5,r17 // is IFA pointing into to region 5?
shr.u r18=r22,PGDIR_SHIFT // get bottom portion of pgd index bit
;;
-(p7) dep r17=r17,r19,(PAGE_SHIFT-3),3 // put region number bits in place
+(pUsr) dep r17=r17,r19,(PAGE_SHIFT-3),3 // put region number bits in place
srlz.d
- LOAD_PHYSICAL(p6, r19, swapper_pg_dir) // region 5 is rooted at swapper_pg_dir
+ LOAD_PHYSICAL(pR5, r19, swapper_pg_dir) // region 5 is rooted at swapper_pg_dir
- .pred.rel "mutex", p6, p7
-(p6) shr.u r21=r21,PGDIR_SHIFT+PAGE_SHIFT
-(p7) shr.u r21=r21,PGDIR_SHIFT+PAGE_SHIFT-3
+#ifdef CONFIG_SMP
+1: /*
+ * If (pUsr: user region fault){
+ * __s64 r27, r28, *r30 = ¤t->mm->mmap_sem;
+ *
+ * //
+ * // while (unlikely(__down_read_trylock(¤t->mm->mmap_sem) = 0));
+ * //
+ * do {
+ * while (unlikely((r27 = ia64_ld8_bias_nta(&r30->count)) < 0));
+ * r28 = r27 + 1;
+ * } while (unlikely(ia64_cmpxchg8_acq_nta(&r30->count, r28, r27) != r27));
+ * }
+ */
+(pUsr) ld8.bias.nta r27 = [r30]
;;
-(p6) dep r17=r18,r19,3,(PAGE_SHIFT-3) // r17=pgd_offset for region 5
-(p7) dep r17=r18,r17,3,(PAGE_SHIFT-6) // r17=pgd_offset for region[0-4]
+(pUsr) cmp.lt.unc p8, p0 = r27, r0
+(p8) br.cond.spnt.few 1b
+(pUsr) mov.m ar.ccv = r27
+(pUsr) adds r28 = 1, r27
+ ;;
+(pUsr) cmpxchg8.acq.nta r28 = [r30], r28, ar.ccv
+ ;;
+(pUsr) cmp.eq.unc p0, p8 = r28, r27
+(p8) br.cond.spnt.few 1b
+#endif // CONFIG_SMP
+
+ .pred.rel "mutex", pR5, pUsr
+(pR5) shr.u r21=r20,PGDIR_SHIFT+PAGE_SHIFT
+(pUsr) shr.u r21=r20,PGDIR_SHIFT+PAGE_SHIFT-3
+ ;;
+(pR5) dep r17=r18,r19,3,(PAGE_SHIFT-3) // r17=pgd_offset for region 5
+(pUsr) dep r17=r18,r17,3,(PAGE_SHIFT-6) // r17=pgd_offset for region[0-4]
cmp.eq p7,p6=0,r21 // unused address bits all zeroes?
#ifdef CONFIG_PGTABLE_4
shr.u r28=r22,PUD_SHIFT // shift pud index into position
@@ -179,7 +213,7 @@ ENTRY(vhpt_miss)
;;
(p10) itc.i r18 // insert the instruction TLB entry
(p11) itc.d r18 // insert the data TLB entry
-(p6) br.cond.spnt.many page_fault // handle bad address/page not present (page fault)
+(p6) br.cond.spnt.many page_fault_sem // handle bad address/page not present (page fault)
mov cr.ifa=r22
#ifdef CONFIG_HUGETLB_PAGE
@@ -197,11 +231,12 @@ ENTRY(vhpt_miss)
;;
#ifdef CONFIG_SMP
/*
- * Tell the assemblers dependency-violation checker that the above "itc" instructions
- * cannot possibly affect the following loads:
+ * We make sure the visibility of itc.* to generated purges (like ptc.ga)
+ * before we re-read the PTE.
+ * Having itc.i-d a new translation, there is no need for srlz.i, the rfi below
+ * will do the serialization.
*/
- dv_serialize_data
-
+(p7) srlz.d
/*
* Re-check pagetable entry. If they changed, we may have received a ptc.g
* between reading the pagetable and the "itc". If so, flush the entry we
@@ -214,25 +249,45 @@ ENTRY(vhpt_miss)
* r29 = *pud
* r20 = *pmd
* r18 = *pte
+ *
+ * There is no need to re-read *pgd, *pud, *pmd because of having "mmap_sem".
+ * The PTE page translation is guaranteed to be correct.
+ * The swapper does not take this semaphore for write => do re-check *pte.
*/
ld8 r25=[r21] // read *pte again
- ld8 r26=[r17] // read *pmd again
-#ifdef CONFIG_PGTABLE_4
- ld8 r19=[r28] // read *pud again
-#endif
- cmp.ne p6,p7=r0,r0
- ;;
- cmp.ne.or.andcm p6,p7=r26,r20 // did *pmd change
-#ifdef CONFIG_PGTABLE_4
- cmp.ne.or.andcm p6,p7=r19,r29 // did *pud change
-#endif
mov r27=PAGE_SHIFT<<2
;;
-(p6) ptc.l r22,r27 // purge PTE page translation
-(p7) cmp.ne.or.andcm p6,p7=r25,r18 // did *pte change
+ cmp.ne p6,p0=r25,r18 // did *pte change?
;;
(p6) ptc.l r16,r27 // purge translation
-#endif
+ /*
+ * Tell the assemblers dependency-violation checker that the above "ptc" instruction
+ * cannot possibly affect the following loads/stores.
+ */
+ dv_serialize_data
+
+1: /*
+ * If (pUsr: user region fault){
+ * __s64 r27, r28, *r30 = ¤t->mm->mmap_sem;
+ *
+ * //
+ * // __up_read(¤t->mm->mmap_sem);
+ * //
+ * r28 = ia64_fetchadd8_rel_nta(&r30->count, -1);
+ * r27 = r28 - 1;
+ * if (unlikely(r28 < 0))
+ * if (unlikely((r27 & RWSEM_ACTIVE_MASK) = 0))
+ * rwsem_wake(sem);
+ * }
+ */
+(pUsr) fetchadd8.rel.nta r28 = [r30], -1
+ ;;
+(pUsr) adds r27 = -1, r28
+(pUsr) cmp.lt.unc p8, p0 = r28, r0
+(p8) br.cond.spnt.few call_sem_wake_up
+ ;;
+back_to_vhpt_miss: // No need to call "rwsem_wake()"
+#endif // CONFIG_SMP
mov pr=r31,-1 // restore predicate registers
rfi
@@ -282,6 +337,60 @@ ENTRY(itlb_miss)
rfi
END(itlb_miss)
+
+#ifdef CONFIG_SMP
+ /*
+ * Arrivig from "vhpt_miss":
+ *
+ * ...
+ * If (pUsr: user mode fault){
+ * __s64 r27, r28, *r30 = ¤t->mm->mmap_sem;
+ *
+ * //
+ * // __up_read(¤t->mm->mmap_sem);
+ * //
+ * r28 = ia64_fetchadd8_rel_nta(&r30->count, -1);
+ * r27 = r28 - 1;
+ * if (unlikely(r28 < 0))
+ *
+ * //
+ * // *** We are here: ***
+ * //
+ * if (unlikely((r27 & RWSEM_ACTIVE_MASK) = 0))
+ * rwsem_wake(sem);
+ * }
+ * ...
+ */
+call_sem_wake_up:
+ cmp4.eq p0, p8 = 0, r27
+(p8) br.cond.sptk.few back_to_vhpt_miss
+ ;;
+ /*
+ * Call "my_rwsem_wake(IA64_KR(MM_SEM))". Ppredicates are in r31, psr.dt is off.
+ */
+ ssm psr.dt
+ ;;
+ srlz.d
+ ;;
+ SAVE_MIN_WITH_COVER
+ alloc r15 = ar.pfs, 0, 0, 1, 0
+ mov out0 = IA64_KR(MM_SEM) // ¤t->mm->mmap_sem
+ adds r3=8, r2 // set up second base pointer
+ ;;
+ ssm psr.ic | PSR_DEFAULT_BITS
+ ;;
+ srlz.i // guarantee that interruption collectin is on
+ ;;
+(p15) ssm psr.i // restore psr.i
+ movl r14 = ia64_leave_kernel
+ ;;
+ SAVE_REST
+ mov rp = r14
+ ;;
+ br.call.sptk.many b6 = my_rwsem_wake // ignore return address
+#endif // CONFIG_SMP
+
+
.org ia64_ivt+0x0800
/////////////////////////////////////////////////////////////////////////////////////////
// 0x0800 Entry 2 (size 64 bundles) DTLB (9,48)
@@ -326,6 +434,21 @@ dtlb_fault:
rfi
END(dtlb_miss)
+
+ //-----------------------------------------------------------------------------------
+ // call do_page_fault (predicates are in r31, psr.dt may be off, r16 is faulting address)
+ // We DON'T have "mmap_sem" for read
+ENTRY(page_fault)
+ ssm psr.dt
+ ;;
+ srlz.i
+ ;;
+ SAVE_MIN_WITH_COVER
+ mov r15 = r0 // We DON'T have "mmap_sem" for read
+ br.sptk.few page_fault_common
+END(page_fault)
+
+
.org ia64_ivt+0x0c00
/////////////////////////////////////////////////////////////////////////////////////////
// 0x0c00 Entry 3 (size 64 bundles) Alt ITLB (19)
@@ -499,32 +622,39 @@ ENTRY(ikey_miss)
FAULT(6)
END(ikey_miss)
+
//-----------------------------------------------------------------------------------
// call do_page_fault (predicates are in r31, psr.dt may be off, r16 is faulting address)
-ENTRY(page_fault)
+ // If user region fault has happened => we keep "mmap_sem" for read
+ENTRY(page_fault_sem)
ssm psr.dt
;;
srlz.i
;;
SAVE_MIN_WITH_COVER
- alloc r15=ar.pfs,0,0,3,0
- mov out0=cr.ifa
- mov out1=cr.isr
- adds r3=8,r2 // set up second base pointer
+ add r15 = 1, r0 // If user region fault has happened =>
+ // We have "mmap_sem" for read
+page_fault_common:
+ alloc r14 = ar.pfs, 0, 0, 4, 0
+ mov out0 = cr.ifa
+ mov out1 = cr.isr
+ adds r3 = 8, r2 // set up second base pointer
;;
ssm psr.ic | PSR_DEFAULT_BITS
;;
srlz.i // guarantee that interruption collectin is on
;;
(p15) ssm psr.i // restore psr.i
- movl r14=ia64_leave_kernel
+ movl r14 = ia64_leave_kernel
;;
SAVE_REST
- mov rp=r14
+ mov rp = r14
;;
- adds out2\x16,r12 // out2 = pointer to pt_regs
+ adds out2 = 16, r12 // out2 = pointer to pt_regs
+ mov out3 = r15 // Is "mmap_sem" taken for read?
br.call.sptk.many b6=ia64_do_page_fault // ignore return address
-END(page_fault)
+END(page_fault_sem)
+
.org ia64_ivt+0x1c00
/////////////////////////////////////////////////////////////////////////////////////////
--- linux-2.6.16.9/include/asm-ia64/kregs.h 2006-04-20 10:36:46.000000000 +0200
+++ new/include/asm-ia64/kregs.h 2006-04-20 14:53:14.000000000 +0200
@@ -14,6 +14,7 @@
*/
#define IA64_KR_IO_BASE 0 /* ar.k0: legacy I/O base address */
#define IA64_KR_TSSD 1 /* ar.k1: IVE uses this as the TSSD */
+#define IA64_KR_MM_SEM 2 /* ar.k2: ->mmap_sem address (physical) */
#define IA64_KR_PER_CPU_DATA 3 /* ar.k3: physical per-CPU base */
#define IA64_KR_CURRENT_STACK 4 /* ar.k4: what's mapped in IA64_TR_CURRENT_STACK */
#define IA64_KR_FPU_OWNER 5 /* ar.k5: fpu-owner (UP only, at the moment) */
--- linux-2.6.16.9/include/asm-ia64/mmu_context.h 2006-04-20 10:36:46.000000000 +0200
+++ new/include/asm-ia64/mmu_context.h 2006-04-20 14:59:18.000000000 +0200
@@ -192,6 +192,7 @@ activate_mm (struct mm_struct *prev, str
* handlers cannot touch user-space.
*/
ia64_set_kr(IA64_KR_PT_BASE, __pa(next->pgd));
+ ia64_set_kr(IA64_KR_MM_SEM, __pa(&next->mmap_sem));
activate_context(next);
}
next reply other threads:[~2006-04-21 14:12 UTC|newest]
Thread overview: 5+ messages / expand[flat|nested] mbox.gz Atom feed top
2006-04-21 14:12 Zoltan Menyhart [this message]
2006-04-21 14:21 ` Protect PGD...PTE walking in ivt.S Zoltan Menyhart
2006-04-21 17:13 ` Christoph Lameter
2006-04-21 18:08 ` Zoltan Menyhart
2006-04-21 18:54 ` Christoph Lameter
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=4448E85B.5080906@bull.net \
--to=zoltan.menyhart@bull.net \
--cc=linux-ia64@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox