RFC: RCU protected page table walking

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* RFC: RCU protected page table walking
@ 2006-05-03 15:31 Zoltan Menyhart
  2006-05-03 16:46 ` Andi Kleen
  0 siblings, 1 reply; 14+ messages in thread
From: Zoltan Menyhart @ 2006-05-03 15:31 UTC (permalink / raw)
  To: linux-mm; +Cc: Zoltan.Menyhart

On a TLB miss, we have got a

    TLB = ... -> pgd[i] -> pud[j] -> pmd[k] -> pte[l]

chain to walk.
Some architectures do it in HW (microcode), some others in SW.
This page table walking is not atomic, not even on an x86.

Let's consider the following scenario:


CPU #1:                      CPU #2:                 CPU #3

Starts walking
Got the ph. addr. of page Y
in internal reg. X
                             free_pgtables():
                             sets free page Y
                                                     Allocates page Y
Accesses page Y via reg. X


As CPU #1 is still keeping the same ph. address, it fetches an item
from a page that is no more its page.

Even if this security window is small, it does exist.
We cannot base our security just on "quick" page table walking.
(How much quick it has to be?)
The probability to hit this bug grows higher on a NUMA machine with lots of CPUs.

If the HW page table walker cannot, the low level assembly routines do not take
any lock, then only some "careful programming" in the PUD / PMD / PTE page removal
code can help.

I propose an RCU based protection mechanism.
Some minor modifications will be necessary in the architecture dependent parts.
I can only give an example for IA64.
I did not spend much effort on optimizing the real freeing routine called
"do_free_pte_mpd_pud_pages()" because (among other reasons, see the comments),
basically, I wanted to present my RCU based concept.

Thanks,

Zoltan Menyhart

Signed-Off-By: Zoltan.Menyhart@bull.net


--- linux-2.6.16.9-save/mm/memory.c	2006-04-21 09:59:16.000000000 +0200
+++ linux-2.6.16.9/mm/memory.c	2006-05-03 16:29:46.000000000 +0200
@@ -264,9 +264,129 @@ void free_pgd_range(struct mmu_gather **
 		flush_tlb_pgtables((*tlb)->mm, start, end);
 }
 
+
+/*
+ * RCU protected page table walking.
+ * ---------------------------------
+ *
+ * Page table walking need protection, otherwise:
+ * - A first CPU, having reached somewhere in this chain walking, it has got
+ *   the physical address of the next page in the chain in an internal register.
+ * - In the mean time, a second CPU executing "free_pgtables()", frees the page
+ *   the first CPU is about to touch.
+ * - Someone re-uses the same page for something else.
+ * - The first CPU fetches an item from a page of someone else.
+ *
+ * PTE, PMD and PUD page usage perfectly fits into the RCU approach:
+ * - Page table walking is the read side
+ * - Allocating and un-mapping these pages is the update side
+ * - Really freeing these pages is the reclaim side
+ * PTE, PMD and PUD pages will be "put aside" by the un-mapping code until all
+ * pre-existing read-side critical sections on all CPUs have completed.
+ *
+ * Page table walking is (logically) carried out as follows:
+ *
+ *	rcu_read_lock_bh();
+ *	pud_p = rcu_dereference( pgd_p[i] );
+ *	pmd_p = rcu_dereference( pud_p[j] );
+ *	pte_p = rcu_dereference( pmd_p[k] );
+ *	...
+ *	pte = *pte_p;
+ *	...
+ *	rcu_read_unlock_bh();
+ *
+ * PTE, PMD and PUD page allocation, initialization and hooking them to their
+ * respective parent pages are carried out under some lock protection.
+ * Releasing this lock provides the required memory fencing semantics =>
+ * there is no need for explicit "rcu_assign_pointer()" usage.
+ *
+ * The "p??_free()" subroutines invoked by "free_pgtables()" indirectly, do not
+ * actually set free the PTE, PMD and PUD pages.
+ * Instead, they "put aside" them in order to give a grace period for the read
+ * sides. "free_pgtables()" kicks off an RCU activated service to reclaim the
+ * PTE, PMD and PUD pages later.
+ *
+ * Notes:
+ * - The life span of the PTE, PMD and PUD pages is rather long:
+ *   they are freed when the usage of the memory area ceases, provided no other
+ *   map (using the same PTE, PMD and PUD pages) is valid.
+ * - The number of the PTE, PMD and PUD pages is much more smaller that that of
+ *   the leaf pages.
+ * Therefore freeing them is not really performance critical.
+ */
+
+#if defined(CONFIG_SMP)
+
+/*
+ * A singly linked ring of the PTE, PMD and PUD pages, which are going to be
+ * reclaimed, is  anchored by the "pages" field.
+ * This ring is formed by use of "->lru.next".
+ * Note that "next" points at the next page structure, not at the list head.
+ */
+struct rcu_free_pte_mpd_pud_pages {
+	struct rcu_head	rcu;
+	struct page	*pages;
+};
+
+#endif
+
+/*
+ * The actual freeing service for the PTE, PMD and PUD pages.
+ *
+ * This is not a performance critical routine:
+ * - these pages are freed much less frequently than the leaf pages are
+ * - the number of these pages are much less than that of the leaf pages
+ */
+void do_free_pte_mpd_pud_pages(struct page * const first_page)
+{
+	struct page *p = first_page;
+	struct page *next;
+
+	do {
+		next = (struct page *)(p->lru.next);
+		free_page((unsigned long) page_address(p));
+//		pgtable_quicklist_free(page_address(p));
+	} while ((p = next) != first_page);
+}
+
+#if defined(CONFIG_SMP)
+
+/*
+ * This is the RCU reclaim end of freeing PTE, PMD and PUD pages on SMP systems.
+ */
+void rcu_free_pte_mpd_pud_pages(struct rcu_head *rcup)
+{
+	const struct rcu_free_pte_mpd_pud_pages * const rp =
+		container_of(rcup, struct rcu_free_pte_mpd_pud_pages, rcu);
+
+	do_free_pte_mpd_pud_pages(rp->pages);
+	kfree(rcup);
+}
+
+#endif	// #if defined(CONFIG_SMP)
+
+/*
+ * Here comes the comment explaining what "free_pgtables()" does,
+ * why, how, etc. :-)
+ *
+ * This is the RCU update end of freeing the PTE, PMD and PUD pages:
+ * The "p??_free()" subroutines do not actually set free these pages,
+ * instead, they add them onto the list "current->pages_rcu_free".
+ * For a multi-threaded process on SMP systems, the PTE, PMD and PUD pages
+ * will be set free via a "call_rcu_bh()"-activated service.
+ * The currently active reader ends are guaranteed to find their good old
+ * PTE, PMD and PUD pages at their actual physical addresses.
+ * (For kernel processes this protection is not available - you should not
+ * unmap an in-use kernel memory zone.)
+ */
 void free_pgtables(struct mmu_gather **tlb, struct vm_area_struct *vma,
 		unsigned long floor, unsigned long ceiling)
 {
+	struct mm_struct * const mm = vma->vm_mm;
+#if defined(CONFIG_SMP)
+	struct rcu_free_pte_mpd_pud_pages *rp;
+#endif
+
 	while (vma) {
 		struct vm_area_struct *next = vma->vm_next;
 		unsigned long addr = vma->vm_start;
@@ -297,6 +417,62 @@ void free_pgtables(struct mmu_gather **t
 		}
 		vma = next;
 	}
+#if defined(CONFIG_SMP)
+	/*
+	 * For a multi-threaded process on SMP systems, the PTE, PMD and PUD
+	 * pages will be set free via a "call_rcu_bh()"-activated service.
+	 * If this is the last thread => no need for this protection.
+	 * (For kernel processes, with "mm == &init_mm", this protection is not
+	 * available - you should not unmap an in-use kernel memory zone.)
+	 * Note that reading "mm_users" below is unsafe.
+	 * If the other threads exit in the mean time, than we call the RCU
+	 * service and we waste our time in vain.
+	 */
+	if (atomic_read(&mm->mm_users) <= 1 || unlikely(mm == &init_mm)){
+		if (current->pages_rcu_free != NULL){
+			do_free_pte_mpd_pud_pages(current->pages_rcu_free);
+			current->pages_rcu_free = NULL;
+		}
+		/* May happen to a multi-threaded process only: */
+		if (unlikely(mm->pages_rcu_free != NULL)){
+			do_free_pte_mpd_pud_pages(mm->pages_rcu_free);
+			mm->pages_rcu_free = NULL;
+		}
+		return;
+	}
+	/*
+	 * We get here for the multi-threaded processes only, with
+	 * (most likely) more than 1 active threads.
+	 */
+	if (likely(mm->pages_rcu_free == NULL)){
+		if (current->pages_rcu_free == NULL)
+			return;
+		mm->pages_rcu_free = current->pages_rcu_free;
+		current->pages_rcu_free = NULL;
+	} else if (current->pages_rcu_free != NULL){
+		/*
+		 * Merge the two rings. "->lru.prev" is just a temporary storage.
+		 */
+		mm->pages_rcu_free->lru.prev = current->pages_rcu_free->lru.next;
+		current->pages_rcu_free->lru.next = mm->pages_rcu_free->lru.next;
+		mm->pages_rcu_free->lru.next = mm->pages_rcu_free->lru.prev;
+		current->pages_rcu_free = NULL;
+	}
+	/*
+	 * If there is no more memory, then try to free these pages later.
+	 * At the very latest, "exit_mmap()" will be able to do it.
+	 */
+	if (unlikely((rp = kmalloc(sizeof(*rp), GFP_KERNEL)) == NULL))
+		return;
+	rp->pages = mm->pages_rcu_free;
+	mm->pages_rcu_free = NULL;
+	call_rcu_bh(&rp->rcu, rcu_free_pte_mpd_pud_pages);
+#else	// #if defined(CONFIG_SMP)
+	if (current->pages_rcu_free != NULL){
+		do_free_pte_mpd_pud_pages(current->pages_rcu_free);
+		current->pages_rcu_free = NULL;
+	}
+#endif	// #if defined(CONFIG_SMP)
 }
 
 int __pte_alloc(struct mm_struct *mm, pmd_t *pmd, unsigned long address)
--- linux-2.6.16.9-save/arch/ia64/kernel/ivt.S	2006-04-21 09:58:55.000000000 +0200
+++ linux-2.6.16.9/arch/ia64/kernel/ivt.S	2006-05-03 15:16:10.000000000 +0200
@@ -37,6 +37,28 @@
  *
  * Table is based upon EAS2.6 (Oct 1999)
  */
+/*
+ * RCU protected page table walking.
+ * ---------------------------------
+ *
+ * (For further details see "mm/memory.c".)
+ *
+ * Page table walking is (logically) carried out as follows:
+ *
+ *	rcu_read_lock_bh();
+ *	pud_p = rcu_dereference( pgd_p[i] );
+ *	pmd_p = rcu_dereference( pud_p[j] );
+ *	pte_p = rcu_dereference( pmd_p[k] );
+ *	...
+ *	pte = *pte_p;
+ *	...
+ *	rcu_read_unlock_bh();
+ *
+ * Notes:
+ *	- the RCU read lock semantics is provided by disabling the interrupts
+ *	- "rcu_dereference()" includes "smp_read_barrier_depends()" that is a
+ *	  no-op for ia64
+ */
 
 #include <linux/config.h>
 
--- linux-2.6.16.9-save/include/asm-ia64/pgalloc.h	2006-04-21 09:59:12.000000000 +0200
+++ linux-2.6.16.9/include/asm-ia64/pgalloc.h	2006-05-03 13:39:57.000000000 +0200
@@ -76,6 +76,23 @@ static inline void pgtable_quicklist_fre
 	preempt_enable();
 }
 
+/*
+ * The PTE, PMD and PUD pages are not actually set free here.
+ * Instead, they are added onto the singly linked ring anchored by
+ * "current->pages_rcu_free". This ring is formed by use of "->lru.next".
+ * Note that "next" points at the next page structure, not at the list head.
+ */
+static inline void p___free(struct page * const p)
+{
+	if (current->pages_rcu_free == NULL){
+		current->pages_rcu_free = p;
+		p->lru.next = (struct lish_head *) p;
+	} else {
+		p->lru.next = current->pages_rcu_free->lru.next;
+		current->pages_rcu_free->lru.next = (struct lish_head *) p;
+	}
+}
+
 static inline pgd_t *pgd_alloc(struct mm_struct *mm)
 {
 	return pgtable_quicklist_alloc();
@@ -100,7 +117,7 @@ static inline pud_t *pud_alloc_one(struc
 
 static inline void pud_free(pud_t * pud)
 {
-	pgtable_quicklist_free(pud);
+	p___free(virt_to_page(pud));
 }
 #define __pud_free_tlb(tlb, pud)	pud_free(pud)
 #endif /* CONFIG_PGTABLE_4 */
@@ -118,7 +135,7 @@ static inline pmd_t *pmd_alloc_one(struc
 
 static inline void pmd_free(pmd_t * pmd)
 {
-	pgtable_quicklist_free(pmd);
+	p___free(virt_to_page(pmd));
 }
 
 #define __pmd_free_tlb(tlb, pmd)	pmd_free(pmd)
@@ -149,9 +166,14 @@ static inline pte_t *pte_alloc_one_kerne
 
 static inline void pte_free(struct page *pte)
 {
-	pgtable_quicklist_free(page_address(pte));
+	p___free(pte);
 }
 
+/*
+ * The only known usage of this function is in case of a failure in
+ * "__pte_alloc_kernel()", therefore it is not included into the "careful"
+ * page freeing mechanism.
+ */
 static inline void pte_free_kernel(pte_t * pte)
 {
 	pgtable_quicklist_free(pte);
--- linux-2.6.16.9-save/include/linux/sched.h	2006-04-21 09:59:15.000000000 +0200
+++ linux-2.6.16.9/include/linux/sched.h	2006-05-02 18:31:54.000000000 +0200
@@ -313,7 +313,10 @@ struct mm_struct {
 						 * together off init_mm.mmlist, and are protected
 						 * by mmlist_lock
 						 */
-
+	struct page *pages_rcu_free;		/* free_pgtables() collects PTE, PMD and
+						 * PUD pages - protected by mmap_sem
+						 * taken for write
+						 */
 	/* Special counters, in some configurations protected by the
 	 * page_table_lock, in other configurations by being atomic.
 	 */
@@ -871,6 +874,8 @@ struct task_struct {
 #endif
 	atomic_t fs_excl;	/* holding fs exclusive resources */
 	struct rcu_head rcu;
+	struct page *pages_rcu_free;	/* free_pgtables() collects */
+					/* PTE, PMD and PUD pages */
 };
 
 static inline pid_t process_group(struct task_struct *tsk)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: RFC: RCU protected page table walking
  2006-05-03 15:31 RFC: RCU protected page table walking Zoltan Menyhart
@ 2006-05-03 16:46 ` Andi Kleen
  2006-05-03 18:00   ` Hugh Dickins
  2006-05-04  9:19   ` Zoltan Menyhart
  0 siblings, 2 replies; 14+ messages in thread
From: Andi Kleen @ 2006-05-03 16:46 UTC (permalink / raw)
  To: Zoltan Menyhart; +Cc: linux-mm, Zoltan.Menyhart

s page table walking is not atomic, not even on an x86.
> 
> Let's consider the following scenario:
> 
> 
> CPU #1:                      CPU #2:                 CPU #3
> 
> Starts walking
> Got the ph. addr. of page Y
> in internal reg. X
>                              free_pgtables():
>                              sets free page Y

The page is not freed until all CPUs who had the mm mapped are flushed.
See mmu_gather in asm-generic/tlb.h


>                                                      Allocates page Y
> Accesses page Y via reg. X
> 
> 
> As CPU #1 is still keeping the same ph. address, it fetches an item
> from a page that is no more its page.
> 
> Even if this security window is small, it does exist.

It doesn't at least on architectures that use the generic tlbflush.h

-Andi

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: RFC: RCU protected page table walking
  2006-05-03 16:46 ` Andi Kleen
@ 2006-05-03 18:00   ` Hugh Dickins
  2006-05-03 23:54     ` Christoph Lameter
  2006-05-04  9:26     ` Zoltan Menyhart
  2006-05-04  9:19   ` Zoltan Menyhart
  1 sibling, 2 replies; 14+ messages in thread
From: Hugh Dickins @ 2006-05-03 18:00 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Zoltan Menyhart, Christoph Lameter, linux-mm, Zoltan.Menyhart

On Wed, 3 May 2006, Andi Kleen wrote:
> 
> The page is not freed until all CPUs who had the mm mapped are flushed.
> See mmu_gather in asm-generic/tlb.h
> 
> > Even if this security window is small, it does exist.
> 
> It doesn't at least on architectures that use the generic tlbflush.h

Those architectures (including i386 and x86_64) which #define their
__pte_free_tlb etc. to tlb_remove_page are safe as is.  But Zoltan's
ia64 #defines it to pte_free, which looks like it may free_page before
the TLB flush.  But it is surprising if it has actually been unsafe
there on ia64 - perhaps Christoph can explain how it is safe?

Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: RFC: RCU protected page table walking
  2006-05-03 18:00   ` Hugh Dickins
@ 2006-05-03 23:54     ` Christoph Lameter
  2006-05-04  2:51       ` Chen, Kenneth W
  2006-05-04  9:26     ` Zoltan Menyhart
  1 sibling, 1 reply; 14+ messages in thread
From: Christoph Lameter @ 2006-05-03 23:54 UTC (permalink / raw)
  To: kenneth.w.chen
  Cc: Hugh Dickins, Andi Kleen, Zoltan Menyhart, linux-mm,
	Zoltan.Menyhart, linux-i64

On Wed, 3 May 2006, Hugh Dickins wrote:

> Those architectures (including i386 and x86_64) which #define their
> __pte_free_tlb etc. to tlb_remove_page are safe as is.  But Zoltan's
> ia64 #defines it to pte_free, which looks like it may free_page before
> the TLB flush.  But it is surprising if it has actually been unsafe

Sorry but I am in .au right now with spotty high latency connectivity. 
But the people on linux-ia64 should know. Ken?

Why was linux-ia64 not cced??

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* RE: RFC: RCU protected page table walking
  2006-05-03 23:54     ` Christoph Lameter
@ 2006-05-04  2:51       ` Chen, Kenneth W
  2006-05-04  4:28         ` Hugh Dickins
  0 siblings, 1 reply; 14+ messages in thread
From: Chen, Kenneth W @ 2006-05-04  2:51 UTC (permalink / raw)
  To: 'Christoph Lameter'
  Cc: Hugh Dickins, Andi Kleen, Zoltan Menyhart, linux-mm,
	Zoltan.Menyhart, linux-i64

Christoph Lameter wrote on Wednesday, May 03, 2006 4:54 PM
> On Wed, 3 May 2006, Hugh Dickins wrote:
> > Those architectures (including i386 and x86_64) which #define their
> > __pte_free_tlb etc. to tlb_remove_page are safe as is.  But Zoltan's
> > ia64 #defines it to pte_free, which looks like it may free_page before
> > the TLB flush.  But it is surprising if it has actually been unsafe
> 
> Sorry but I am in .au right now with spotty high latency connectivity. 
> But the people on linux-ia64 should know. Ken?

A while back ia64 reinstated per-cpu pgtable quicklist, which bypasses tlb_gather/tlb_finish_mmu for page table pages. It should be
safe AFAICT
because TLB for user address and vhpt are already flushed by the time
pte_free_tlb() is called.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* RE: RFC: RCU protected page table walking
  2006-05-04  2:51       ` Chen, Kenneth W
@ 2006-05-04  4:28         ` Hugh Dickins
  0 siblings, 0 replies; 14+ messages in thread
From: Hugh Dickins @ 2006-05-04  4:28 UTC (permalink / raw)
  To: Chen, Kenneth W
  Cc: 'Christoph Lameter', Andi Kleen, Zoltan Menyhart,
	linux-mm, Zoltan.Menyhart, linux-i64

On Wed, 3 May 2006, Chen, Kenneth W wrote:
> > On Wed, 3 May 2006, Hugh Dickins wrote:
> > > Those architectures (including i386 and x86_64) which #define their
> > > __pte_free_tlb etc. to tlb_remove_page are safe as is.  But Zoltan's
> > > ia64 #defines it to pte_free, which looks like it may free_page before
> > > the TLB flush.  But it is surprising if it has actually been unsafe
> 
> A while back ia64 reinstated per-cpu pgtable quicklist,
> which bypasses tlb_gather/tlb_finish_mmu for page table pages.

Right you are, it was using tlb_remove_page until 2.6.12.  Forgive me,
but that makes me a little more suspicious of whether it is now safe.

> It should be safe AFAICT because TLB for user address and
> vhpt are already flushed by the time pte_free_tlb() is called.

I'm ia64-challenged, so VHPT is no more than a name to me; but I can
easily believe that on ia64, once the pte has been cleared and the
user address flushed from the TLB, then the page tables can be freed
without waiting on further flushing.

However, are you sure that the TLB for user address has already been
flushed at that point?  There is not necessarily any tlb_finish_mmu
call in between the last tlb_remove_page of unmap_vmas and the first
pte_free_tlb of free_pgtables.

Hugh

Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: RFC: RCU protected page table walking
  2006-05-03 16:46 ` Andi Kleen
  2006-05-03 18:00   ` Hugh Dickins
@ 2006-05-04  9:19   ` Zoltan Menyhart
  1 sibling, 0 replies; 14+ messages in thread
From: Zoltan Menyhart @ 2006-05-04  9:19 UTC (permalink / raw)
  To: Andi Kleen; +Cc: linux-mm, Zoltan.Menyhart

Andi Kleen wrote:
> s page table walking is not atomic, not even on an x86.
> 
>>Let's consider the following scenario:
>>
>>
>>CPU #1:                      CPU #2:                 CPU #3
>>
>>Starts walking
>>Got the ph. addr. of page Y
>>in internal reg. X
>>                             free_pgtables():
>>                             sets free page Y
> 
> 
> The page is not freed until all CPUs who had the mm mapped are flushed.
> See mmu_gather in asm-generic/tlb.h

Page table walking is in ph. mode, e.g. a PGD access is not sensitive to
a TLB purge.

Here is the (simplified) IA64 implementation:

        free_pgtables(&tlb,...):
            free_pgd_range(tlb,...):
                free_pud_range(*tlb,...):
                    free_pmd_range(tlb,...):
                        free_pte_range(tlb,...):
                            pmd_clear(pmd);
                            pte_free_tlb(tlb, page):
                                __pte_free_tlb(tlb, ptep):
/* --> */                           pte_free(pte);
                        pud_clear(pud);
                        pmd_free_tlb(tlb, pmd):
/* --> */                   pmd_free(pmd);
                    pgd_clear(pgd);
                    pud_free_tlb(tlb, pud):
                        __pud_free_tlb(tlb, pudp):
/* --> */                   pud_free(pud);
                flush_tlb_pgtables((*tlb)->mm,...);

Or if you like, from asm-generic/tlb.h:

	tlb_remove_page(tlb, page):
	    if (tlb_fast_mode(tlb)) {
	        free_page_and_swap_cache(page);
	        return;
	    }
	    tlb->pages[tlb->nr++] = page;
	    if (tlb->nr >= FREE_PTE_NR)
	        tlb_flush_mmu(tlb, 0, 0):

	            free_pages_and_swap_cache(tlb->pages, tlb->nr);

As you can see, we do not care for the the eventual page table walkers.

>>As CPU #1 is still keeping the same ph. address, it fetches an item
>>from a page that is no more its page.
>>
>>Even if this security window is small, it does exist.
> 
> 
> It doesn't at least on architectures that use the generic tlbflush.h

As I showed above, the generic code is unaware of the other CPU's activity.

The problem is:
there is no requirement when we can release a directory page.

What I propose is a way to make sure that the page table walkers will be
able to finish their walks in safety; we release a directory page when
no more walker can reference the page.

Thanks,

Zoltan

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: RFC: RCU protected page table walking
  2006-05-03 18:00   ` Hugh Dickins
  2006-05-03 23:54     ` Christoph Lameter
@ 2006-05-04  9:26     ` Zoltan Menyhart
  2006-05-04  9:31       ` Andi Kleen
  1 sibling, 1 reply; 14+ messages in thread
From: Zoltan Menyhart @ 2006-05-04  9:26 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: Andi Kleen, Christoph Lameter, linux-mm, Zoltan.Menyhart

Hugh Dickins wrote:
> On Wed, 3 May 2006, Andi Kleen wrote:
> 
>>The page is not freed until all CPUs who had the mm mapped are flushed.
>>See mmu_gather in asm-generic/tlb.h
>>
>>
>>>Even if this security window is small, it does exist.
>>
>>It doesn't at least on architectures that use the generic tlbflush.h
> 
> 
> Those architectures (including i386 and x86_64) which #define their
> __pte_free_tlb etc. to tlb_remove_page are safe as is.

I cannot agree with you. Here is the generic sequence:

    tlb_remove_page(tlb, page):
        if (tlb_fast_mode(tlb)) {
            free_page_and_swap_cache(page);
            return;
        }
        tlb->pages[tlb->nr++] = page;
        if (tlb->nr >= FREE_PTE_NR)
            tlb_flush_mmu(tlb, 0, 0):

                free_pages_and_swap_cache(tlb->pages, tlb->nr); 

We set free the PTE, PMD and PUD pages either immediately or when
tlb->pages[] is full.

What can make sure that there is no active page table walker on
another CPU?

Thanks,

Zoltan

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: RFC: RCU protected page table walking
  2006-05-04  9:26     ` Zoltan Menyhart
@ 2006-05-04  9:31       ` Andi Kleen
  2006-05-04 11:32         ` Zoltan Menyhart
  0 siblings, 1 reply; 14+ messages in thread
From: Andi Kleen @ 2006-05-04  9:31 UTC (permalink / raw)
  To: Zoltan Menyhart
  Cc: Hugh Dickins, Christoph Lameter, linux-mm, Zoltan.Menyhart

On Thursday 04 May 2006 11:26, Zoltan Menyhart wrote:
> Hugh Dickins wrote:
> > On Wed, 3 May 2006, Andi Kleen wrote:
> > 
> >>The page is not freed until all CPUs who had the mm mapped are flushed.
> >>See mmu_gather in asm-generic/tlb.h
> >>
> >>
> >>>Even if this security window is small, it does exist.
> >>
> >>It doesn't at least on architectures that use the generic tlbflush.h
> > 
> > 
> > Those architectures (including i386 and x86_64) which #define their
> > __pte_free_tlb etc. to tlb_remove_page are safe as is.
> 
> I cannot agree with you. Here is the generic sequence:
> 
>     tlb_remove_page(tlb, page):
>         if (tlb_fast_mode(tlb)) {
>             free_page_and_swap_cache(page);
>             return;
>         }
>         tlb->pages[tlb->nr++] = page;
>         if (tlb->nr >= FREE_PTE_NR)
>             tlb_flush_mmu(tlb, 0, 0):
> 
>                 free_pages_and_swap_cache(tlb->pages, tlb->nr); 
> 
> We set free the PTE, PMD and PUD pages either immediately or when
> tlb->pages[] is full.
> 
> What can make sure that there is no active page table walker on
> another CPU?

We don't free the pages until the other CPUs have been flushed synchronously. 
After the flush the other CPUs don't walk pages anymore. The whole thing is
batched because the synchronous flush can be pretty expensive.

-Andi

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: RFC: RCU protected page table walking
  2006-05-04  9:31       ` Andi Kleen
@ 2006-05-04 11:32         ` Zoltan Menyhart
  2006-05-04 12:00           ` Andi Kleen
  0 siblings, 1 reply; 14+ messages in thread
From: Zoltan Menyhart @ 2006-05-04 11:32 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Hugh Dickins, Christoph Lameter, linux-mm, Zoltan.Menyhart

Andi Kleen wrote:

> We don't free the pages until the other CPUs have been flushed synchronously.

Do you mean the TLB entries mapping the leaf pages?
If yes, then I agree with you about them.
Yet I speak about the directory pages. Let's take an example:

Assume:
- A process with 2 threads, bound to their respective CPUs
- One of them mapped a file and
  this mapping requires a new PMD and a new PTE page
- They read in some data pages
- Time goes by without ever touching any of these pages again
- The swapped removes the data pages (data flush, TLB purge)
- (on IA64: due to the TLB pressure, the TLB entry mapping the PTE page
  gets killed)

There is no valid TLB entry concerning this mapped zone any more => the TLB
purges around "free_pgtables()" can be considered as NO-OP-s.
(In addition, walking the page tables in physical mode is insensitive to any
TLB purges.)

CPU #1 faults on attempting to touch this mapped zone.
CPU #1 starts to walk the page tables in physical mode.
Assume it has got the address of the PMD page, it is about to fetch "pmd[j]".

CPU #2 executes "free_pgtables()" in the mean time: it sets free the PTE and
the PGD pages (without knowing that CPU #1 has already got a PMD pointer).

Someone else allocates these two pages and fills them in with some data.

CPU #1 now fetches "pmd[j]" from a page of someone else. Without noticing
anything, CPU #1 uses the illegal value to continue to access the PTE page.

> After the flush the other CPUs don't walk pages anymore.

Can you explain please why they do not?
There is a possibility that walking has already been started, but it has
not been completed yet, when "free_pgtables()" runs.

> The whole thing is
> batched because the synchronous flush can be pretty expensive.

Walking the page tables in physical mode is insensitive to any TLB purges,
therefore these purges do not make sure that there is no other CPU just
in the middle of page table walking.

I do a similar batching of the pages to be set free.
The RCU mechanism makes sure that these pages will not be freed before
the already started page table walkers finish their job.

Thanks,

Zoltan

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: RFC: RCU protected page table walking
  2006-05-04 11:32         ` Zoltan Menyhart
@ 2006-05-04 12:00           ` Andi Kleen
  2006-05-04 13:13             ` Robin Holt
  2006-05-04 13:54             ` Zoltan Menyhart
  0 siblings, 2 replies; 14+ messages in thread
From: Andi Kleen @ 2006-05-04 12:00 UTC (permalink / raw)
  To: Zoltan Menyhart
  Cc: Hugh Dickins, Christoph Lameter, linux-mm, Zoltan.Menyhart

On Thursday 04 May 2006 13:32, Zoltan Menyhart wrote:
> Andi Kleen wrote:
> 
> > We don't free the pages until the other CPUs have been flushed synchronously.
> 
> Do you mean the TLB entries mapping the leaf pages?
> If yes, then I agree with you about them.
> Yet I speak about the directory pages. Let's take an example:

x86 uses this for the directory pages too (well for PMD/PUD - PGD never
goes away until final exit). Actually x86-64 didn't
fully at some point and it resulted in a nasty to track down bug.
But it was fixed then. I really went all over this with a very fine
comb back then and I'm pretty sure it's correct now :)

> > After the flush the other CPUs don't walk pages anymore.
> 
> Can you explain please why they do not?

Because the PGD/PMD/PUD has been rewritten and they won't be able
to find the old pages anymore. They also don't have it in their
TLBs because that has been flushed.

The problem I had on x86-64 was because visible the AMD CPUs internally cached
PMD/PGDs.

> There is a possibility that walking has already been started, but it has
> not been completed yet, when "free_pgtables()" runs.
>

Yes, that is why we delay the freeing of the pages to prevent anything
going wrong.

> > The whole thing is
> > batched because the synchronous flush can be pretty expensive.
> 
> Walking the page tables in physical mode 

What do you mean with "physical mode"?

> is insensitive to any TLB purges, 
> therefore these purges do not make sure that there is no other CPU just
> in the middle of page table walking.

A TLB Flush stops all MMU activity - or rather waits for it to finish.

-Andi

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: RFC: RCU protected page table walking
  2006-05-04 12:00           ` Andi Kleen
@ 2006-05-04 13:13             ` Robin Holt
  2006-05-04 13:54             ` Zoltan Menyhart
  1 sibling, 0 replies; 14+ messages in thread
From: Robin Holt @ 2006-05-04 13:13 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Zoltan Menyhart, Hugh Dickins, Christoph Lameter, linux-mm,
	Zoltan.Menyhart, Chen, Kenneth W

On Thu, May 04, 2006 at 02:00:34PM +0200, Andi Kleen wrote:
> On Thursday 04 May 2006 13:32, Zoltan Menyhart wrote:
> > Walking the page tables in physical mode 
> 
> What do you mean with "physical mode"?

ia64 has a software page table walker in ivt.S.  It does its work
using physical addresses.

Robin

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: RFC: RCU protected page table walking
  2006-05-04 12:00           ` Andi Kleen
  2006-05-04 13:13             ` Robin Holt
@ 2006-05-04 13:54             ` Zoltan Menyhart
  2006-05-04 15:27               ` Hugh Dickins
  1 sibling, 1 reply; 14+ messages in thread
From: Zoltan Menyhart @ 2006-05-04 13:54 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Hugh Dickins, Christoph Lameter, linux-mm, Zoltan.Menyhart

Andi Kleen wrote:

>>>We don't free the pages until the other CPUs have been flushed synchronously.
>>
>>Do you mean the TLB entries mapping the leaf pages?
>>If yes, then I agree with you about them.
>>Yet I speak about the directory pages. Let's take an example:
> 
> x86 uses this for the directory pages too (well for PMD/PUD - PGD never
> goes away until final exit).

The i386 branch:

tlb_remove_page():
    // assuming !tlb_fast_mode(tlb)
    tlb_flush_mmu():
        tlb_flush():
            flush_tlb_mm():
                __flush_tlb();
    free_pages_and_swap_cache();

__flush_tlb():
	"movl %%cr3, %0;
	"movl %0, %%cr3;  # flush TLB

Do I understand correctly that it purges the local TLBs only?

> Actually x86-64 didn't
> fully at some point and it resulted in a nasty to track down bug.
> But it was fixed then. I really went all over this with a very fine
> comb back then and I'm pretty sure it's correct now :)

Can you please indicate how the page table walking of the other
CPUs is "aborted"?

>>>After the flush the other CPUs don't walk pages anymore.

Can you please point me where it is documented that the HW walkers
abort on a TLB flush / purge?

Yet I did verify that it is not (always) the case for the RISC-s.

E.g. arch/ia64/kernel/ivt.S:

ENTRY(vhpt_miss)
...
	// r17 = pmd_offset(pud, addr)
// -->
(p7)    ld8 r20=[r17]	// get *pmd (may be 0)

Assume we have reached the point indicated by "// -->":
we have got a valid address for the next level.
Assume "free_pgtables()" sets free these PMD / PTE pages.
The eventual TLB flushes do not do anything to the "ld8"
going to be executed.

Can you explain please why you think that walking the

	rx = ... -> pgd[i] -> pud[j] -> pmd[k] -> pte[l]

chain is safe in this condition, too?

Another example in arch/ppc/kernel/head_44x.S:

	/* Data TLB Error Interrupt */
	START_EXCEPTION(DataTLBError)
...
	// r11 -> PGD or PTE page, r12 = index * sizeof(void *)
// -->
	lwzx    r11, r12, r11           /* Get pgd/pmd entry */

>>Can you explain please why they do not?
> 
> Because the PGD/PMD/PUD has been rewritten and they won't be able
> to find the old pages anymore.

As in the two examples above, the walkers have already picked up
references to the next levels, and these references were valid
at that moment.

> They also don't have it in their
> TLBs because that has been flushed.

Are you sure this is true for the RISC-s, too?
Even if an architecture does not play with TLB-s before really
finding a valid PTE?

>>There is a possibility that walking has already been started, but it has
>>not been completed yet, when "free_pgtables()" runs.
> 
> Yes, that is why we delay the freeing of the pages to prevent anything
> going wrong.

Can you explain please why the already-started walks, which do not
care for the TLB flushes, can be safe?

> What do you mean with "physical mode"?

Not using any TLB entry (or any HW supported address translation stuff)
to translate the data addresses before they go out of the CPU.

>>is insensitive to any TLB purges, 
>>therefore these purges do not make sure that there is no other CPU just
>>in the middle of page table walking.

> A TLB Flush stops all MMU activity - or rather waits for it to finish.

This is what I am trying to say: not on all archtectures.

Thanks,

Zoltan

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: RFC: RCU protected page table walking
  2006-05-04 13:54             ` Zoltan Menyhart
@ 2006-05-04 15:27               ` Hugh Dickins
  0 siblings, 0 replies; 14+ messages in thread
From: Hugh Dickins @ 2006-05-04 15:27 UTC (permalink / raw)
  To: Zoltan Menyhart
  Cc: Andi Kleen, Chen, Kenneth W, Christoph Lameter, linux-ia64,
	linux-mm, Zoltan.Menyhart

On Thu, 4 May 2006, Zoltan Menyhart wrote:
> Andi Kleen wrote:
> 
> > > >We don't free the pages until the other CPUs have been flushed
> > > >synchronously.
> > >
> > >Do you mean the TLB entries mapping the leaf pages?
> > >If yes, then I agree with you about them.
> > >Yet I speak about the directory pages. Let's take an example:
> > 
> > x86 uses this for the directory pages too (well for PMD/PUD - PGD never
> > goes away until final exit).
> 
> The i386 branch:
> 
> tlb_remove_page():
>    // assuming !tlb_fast_mode(tlb)
>    tlb_flush_mmu():
>        tlb_flush():
>            flush_tlb_mm():
>                __flush_tlb();
>    free_pages_and_swap_cache();
> 
> __flush_tlb():
> 	"movl %%cr3, %0;
> 	"movl %0, %%cr3;  # flush TLB
> 
> Do I understand correctly that it purges the local TLBs only?

__flush_tlb() purges the local TLBs only; but when you found the i386
or x86_64 flush_tlb_mm() calling __flush_tlb() above, you were looking
at the #ifndef CONFIG_SMP block of include/asm/tlbflush.h.  Go over to
arch/{i386,x86_64}/kernel/smp.c to see what CONFIG_SMP flush_tlb_mm does.

> > Actually x86-64 didn't
> > fully at some point and it resulted in a nasty to track down bug.
> > But it was fixed then. I really went all over this with a very fine
> > comb back then and I'm pretty sure it's correct now :)
> 
> Can you please indicate how the page table walking of the other
> CPUs is "aborted"?

I cannot answer for other architectures: you need to ask the specialist
list of each architecture for its answer (or hope that a specialist in
each is already reading this thread on linux-mm).  What's certain is
that the issue is _supposed_ to be already covered safely on all arches,
hence the care which has gone into include/asm-generic/tlb.h etc.  But
you may be right that some architectures get it wrong, I cannot tell.

I've CC'ed Ken Chen and linux-ia64 (as Christoph intended to), since
that's your first concern; but I'm reluctant to CC lots of different
architecture lists together myself.

Hugh

> > > >After the flush the other CPUs don't walk pages anymore.
> 
> Can you please point me where it is documented that the HW walkers
> abort on a TLB flush / purge?
> 
> Yet I did verify that it is not (always) the case for the RISC-s.
> 
> E.g. arch/ia64/kernel/ivt.S:
> 
> ENTRY(vhpt_miss)
> ...
> 	// r17 = pmd_offset(pud, addr)
> // -->
> (p7)    ld8 r20=[r17]	// get *pmd (may be 0)
> 
> Assume we have reached the point indicated by "// -->":
> we have got a valid address for the next level.
> Assume "free_pgtables()" sets free these PMD / PTE pages.
> The eventual TLB flushes do not do anything to the "ld8"
> going to be executed.
> 
> Can you explain please why you think that walking the
> 
> 	rx = ... -> pgd[i] -> pud[j] -> pmd[k] -> pte[l]
> 
> chain is safe in this condition, too?
> 
> Another example in arch/ppc/kernel/head_44x.S:
> 
> 	/* Data TLB Error Interrupt */
> 	START_EXCEPTION(DataTLBError)
> ...
> 	// r11 -> PGD or PTE page, r12 = index * sizeof(void *)
> // -->
> 	lwzx    r11, r12, r11           /* Get pgd/pmd entry */
> 
> > >Can you explain please why they do not?
> > 
> > Because the PGD/PMD/PUD has been rewritten and they won't be able
> > to find the old pages anymore.
> 
> As in the two examples above, the walkers have already picked up
> references to the next levels, and these references were valid
> at that moment.
> 
> > They also don't have it in their
> > TLBs because that has been flushed.
> 
> Are you sure this is true for the RISC-s, too?
> Even if an architecture does not play with TLB-s before really
> finding a valid PTE?
> 
> > >There is a possibility that walking has already been started, but it has
> > >not been completed yet, when "free_pgtables()" runs.
> > 
> > Yes, that is why we delay the freeing of the pages to prevent anything
> > going wrong.
> 
> Can you explain please why the already-started walks, which do not
> care for the TLB flushes, can be safe?
> 
> > What do you mean with "physical mode"?
> 
> Not using any TLB entry (or any HW supported address translation stuff)
> to translate the data addresses before they go out of the CPU.
> 
> > >is insensitive to any TLB purges, therefore these purges do not make sure
> > >that there is no other CPU just
> > >in the middle of page table walking.
> 
> > A TLB Flush stops all MMU activity - or rather waits for it to finish.
> 
> This is what I am trying to say: not on all archtectures.
> 
> Thanks,
> 
> Zoltan

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2006-05-04 15:27 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-05-03 15:31 RFC: RCU protected page table walking Zoltan Menyhart
2006-05-03 16:46 ` Andi Kleen
2006-05-03 18:00   ` Hugh Dickins
2006-05-03 23:54     ` Christoph Lameter
2006-05-04  2:51       ` Chen, Kenneth W
2006-05-04  4:28         ` Hugh Dickins
2006-05-04  9:26     ` Zoltan Menyhart
2006-05-04  9:31       ` Andi Kleen
2006-05-04 11:32         ` Zoltan Menyhart
2006-05-04 12:00           ` Andi Kleen
2006-05-04 13:13             ` Robin Holt
2006-05-04 13:54             ` Zoltan Menyhart
2006-05-04 15:27               ` Hugh Dickins
2006-05-04  9:19   ` Zoltan Menyhart

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).