Q: backport of the free_pgtables tlb fixes to 2.4

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* Q: backport of the free_pgtables tlb fixes to 2.4
@ 2002-05-23  5:14 Andrea Arcangeli
  2002-05-23  6:01 ` Linus Torvalds
  0 siblings, 1 reply; 13+ messages in thread
From: Andrea Arcangeli @ 2002-05-23  5:14 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux-kernel

Hello Linus,

I'm dealing with the backport of the free_pgtables() tlb flushes fixes
to 2.4 (got interrupted with it in the middle of the inode highmem
imbalance). I seen a patch and an explanation floating around but it's
wrong as far I can tell in the sense it seems completly a noop.

As far I can see by reading the code the only kernel bug was in the
free_pgtables path and nothing else, so in theory by commenting out the
call of free_pgtables (that is not strictly required except for security
against malicious proggy) the segfault should go away too (could be even
a valid workaround for 2.4). Also it's not related to the P4 cpu, it's
just a generic bug for all archs it seems.

In both 2.4 and 2.5 we were just doing the release of the pages
correctly in zap_page_range, by clearing the pte entry, then
invalidating the tlb and finally freeing the page, I can't see problems
in the normal do_munmap path unless free_pgtables triggers too.

Only in free_pgtables there's an exception: if during an munmap a whole
pgd slot is unmapped we go to a deeper level and we throw away both pmd
and pte too (under the now empty pgd slot) to make space and avoid
expoits that fills the whole pagetable unswappable metadata but without
real pages associated to it.

The problem is that while freeing the pte and pmd we don't do the
"clear; invalidate tlb; free ram" ordered-thread-SMP-safe shootdown
sequence but we instead do the racy "clear; free; invalidate tlb"
SMP-thread-unsafe ordering and that's _THE_ bug. So the fix is simply to
extend the tlb_gather_mm/tlb_finish_mmu/tlb_remove_page to the
free_pgtables path too (while dropping pte/pmd), so in turn to
clear_page_tables, even if only the "free_pgtables" caller really needs
it. exit_mmap obviously cannot need the enforced "clear; invalidate tlb;
free ram" ordering because by the time exit_mmap is recalled there
cannot be any racy reader of the address space in parallel [mm_users ==
0] (and higher performance for exit_mmap are taken care by the fast
mode, or better it was supposed to be taken care before the fast mode
broke in latest 2.5, see below).

I don't see why you changed the "fast mode" to 1 cpu (even without an
#ifdef SMP), if the mm_users is == 1 we are just guaranteed that no
parallel reading of the address space that we're working on can happen
(we're also guaranteed that mm_users cannot increase or decrease under
us), no matter on the number of the cpus. I think the "fast mode" should
return to the more efficient one of 2.4, not to be "slow mode for every
true SMP".

About the patch floating around I also seen random changes in leave_mm
that I cannot explain (that's in 2.5 too, I think that part of 2.5
is superflous too infact)

 /*
  * We cannot call mmdrop() because we are in interrupt context, 
  * instead update mm->cpu_vm_mask.
+ *
+ * We need to reload %cr3 since the page tables may be going
+ * away from under us..
  */
 static void inline leave_mm (unsigned long cpu)
 {
 	if (cpu_tlbstate[cpu].state == TLBSTATE_OK)
 		BUG();
 	clear_bit(cpu, &cpu_tlbstate[cpu].active_mm->cpu_vm_mask);
+	load_cr3(swapper_pg_dir);
 }

 /*

explanation of why the above is superflous: every time a task enters the
lazy state we bump the mm_users so the active_mm is pinned somehow and
nothing can throw away the kernel side of the pgd, so we cannot care
less about loading swapper_pg_dir in %cr3 for tasks in lazy state, it's
simply not required, so I think the patch above is superflous. The
pinned mm (if it's the last refernece) is released by the scheduler
after our cr3 changed in switch_mm. The comment is wrong, our page
tables cannot go away under us, because being a lazy task it only
allowed to use the kernel side of the address space and nothing can
throw it away.

A change like this (not from 2.5) as well is obviously superflous:

 #define tlb_remove_page(ctxp, pte, addr) do {\
 		/* Handle the common case fast, first. */\
 		if ((ctxp)->nr == ~0UL) {\
-			__free_pte(*(pte));\
-			pte_clear((pte));\
+			pte_t __pte = *(pte);\
+			pte_clear(pte);\
+			__free_pte(__pte);\

nr == ~0UL is the single threaded case, so again there cannot be any
parallel reader of the address space, and so it's perfectly fine to free
the page, clear the pte and finally flush the tlb, we don't care about
the ordering if we're single threaded, any ordering is ok, nobody can
access the address space while we're working on it.

This below change as well is superflous after we backout the above
change to leave_mm.

@@ -51,9 +51,9 @@
 			BUG();
 		if(!test_and_set_bit(cpu, &next->cpu_vm_mask)) {
 			/* We were in lazy tlb mode and leave_mm * disabled 
-			 * tlb flush IPI delivery. We must flush our
			 tlb.
+			 * tlb flush IPI delivery. We must reload %cr3.
 			 */
-			local_flush_tlb();
+			load_cr3(next->pgd);
 		}

This actually gets more near to the real problem...

 static inline void flush_tlb_pgtables(struct mm_struct *mm,
 				      unsigned long start, unsigned long end)
 {
-	/* i386 does not keep any page table caches in TLB */
+	flush_tlb_mm(mm);
 }
[..]
 void clear_page_tables(struct mm_struct *mm, unsigned long first, int
nr)
 {
 	pgd_t * page_dir = mm->pgd;
+	unsigned long	last = first + nr;

 	spin_lock(&mm->page_table_lock);
 	page_dir += first;
@@ -153,6 +154,8 @@
 	} while (--nr);
 	spin_unlock(&mm->page_table_lock);

+	flush_tlb_pgtables(mm, first * PGDIR_SIZE, last * PGDIR_SIZE);
+	
 	/* keep the page table cache within bounds */
 	check_pgt_cache();
[..]

... but it's again a noop, cannot fix anything, still any tlb flush done
there is way too late, the pte just gone away in the freelist at that
time. It may reduce the window for the race at best.

The only effective fix that nobody backported yet to 2.4 AFIK is to
avoid the race between free_pgtables and a parallel reader thread in the
address space, and it consists in backporting the pte_free_tlb(tlb, pte)
in the clear_page_tables path to 2.4, nothing else and nothing more. the
zap_page_range path and the fast mode of 2.4 just looks perfectly
correct, only free_pgtables must start using the "slow mode that enforce
ordering" when the mm_count is > 1 (either that or commenting out
free_pgtables will be a valid workaround for the SMP race).

Could you confirm/comment or explain what the problem really is? Many
thanks!

Andrea

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Q: backport of the free_pgtables tlb fixes to 2.4
  2002-05-23  5:14 Q: backport of the free_pgtables tlb fixes to 2.4 Andrea Arcangeli
@ 2002-05-23  6:01 ` Linus Torvalds
  2002-05-23 19:57   ` Andrea Arcangeli
  0 siblings, 1 reply; 13+ messages in thread
From: Linus Torvalds @ 2002-05-23  6:01 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: linux-kernel

On Thu, 23 May 2002, Andrea Arcangeli wrote:
>
> I'm dealing with the backport of the free_pgtables() tlb flushes fixes
> to 2.4 (got interrupted with it in the middle of the inode highmem
> imbalance). I seen a patch and an explanation floating around but it's
> wrong as far I can tell in the sense it seems completly a noop.

Don't take the 2.5.x tree - those changes are bigger than necessary. They
may be _cleaner_ than the quick fix, but I don't think it's worth it for
2.4.x.

The suggested fix for 2.4.x is basically:
 - enable pmd quicklists
 - do a TLB flush _before_ doing the check_pgt_cache() that might free up
   the quicklist, and make sure that nothing else runs on that CPU that
   might get something from the quicklist (interrupts don't do it, so the
   only thing to make sure is that we don't get preempted).

That should do it. Intel made a patch that they've apparently tested that
does this, I think.

> Also it's not related to the P4 cpu, it's just a generic bug for all
> archs it seems.

Well, the hardware has to do a hw tlb walk, so that cuts down the affected
archs to not very many. And the hw has to do a fair amount of speculation,
or support SMP, so that cuts it down even further.

But yes, in theory the bug is generic.

> In both 2.4 and 2.5 we were just doing the release of the pages
> correctly in zap_page_range, by clearing the pte entry, then
> invalidating the tlb and finally freeing the page, I can't see problems
> in the normal do_munmap path unless free_pgtables triggers too.

Right, it's free_pgtables that matters, and we need to do the same
"deferred free" (deferred until the TLB invalidate) as for the actual
pages themselves. "pmd_quicklist" can act as such a deferral method.

In 2.5.x, I wanted to make the deferral explicit, which is why the 2.5.x
approach re-did the thing to make the whole "TLB gather" cover the page
table freeing too.

> So the fix is simply to
> extend the tlb_gather_mm/tlb_finish_mmu/tlb_remove_page to the
> free_pgtables path too (while dropping pte/pmd), so in turn to
> clear_page_tables, even if only the "free_pgtables" caller really needs
> it.

This is indeed exactly what 2.5.x does, but see above about it.

> I don't see why you changed the "fast mode" to 1 cpu (even without an
> #ifdef SMP), if the mm_users is == 1 we are just guaranteed that no
> parallel reading of the address space that we're working on can happen

Yes, but there is ANOTHER race, and this is actually the much more likely
one, and the one that actually happens:

	CPU1		CPU2

	munmap
	.. speculation starts ..
	.. TLB looks up pgd entry ..
	clear pgd entry
	free pmd

			alloc page - get old pmd
			scribble on page

	.. TLB looks up pmd entry ..
	.. tlb fill ends ...
	invalidate_tlb

CPU2 can be doing something completely _unrelated_ and not have the same
MM at all. The bug happens entirely within the TLB contents of CPU1,
simply because the CPU1 speculation started a TLB fill, which looked up
the PGD entry _before_ it was cleared, but because of a cache miss got
delayed at the point where it was going to look up the PTE entry. In the
meantime, CPU1 free'd the page, and another CPU scribbled stuff on it.

So we not have an invalid entry in the TLB on cpu1 - and while we will
invalidate the TLB, if the entry had the global bit set that bad random
entry would NOT get invalidated.

This, btw, is why in practice the thing only shows up on a P4. Only a P4
has deep enough speculation queues that this bug actually happens. Intel
apparently had this trace for 2000 (yes, two THOUSAND) cycles that showed
this asynchronous TLB lookup.

> About the patch floating around I also seen random changes in leave_mm
> that I cannot explain (that's in 2.5 too, I think that part of 2.5
> is superflous too infact)

Here the issue is:

	CPU1			CPU2

				Lazy TLB of same MM as on CPU1
	munmap()		.. start speculative TLB fetch ...
	.. free_pgtable
	.. invalidate   ->	crosscall tlb_IPI
				We're lazy, nothing to do
	free page
	alloc page
	scribble on it
				 .. speculative TLB lookup ends ..
				get a bogus TLB entry with G bit

According to intel, the _only_ thing that serializes the TLB fills is to
do a TLB invalidate (either invlpg or mov->cr3). So not even the
cross-call itself necessarily does anything to the background TLB fetch.

So even the lazy case needs to do that, at which point it is just as well
to just move to another stabler page table (it's actually faster than
doing the cr3<->cr3 dance).

NOTE! This was not seen in any traces, but Intel was worried.

> explanation of why the above is superflous: every time a task enters the
> lazy state we bump the mm_users so the active_mm is pinned somehow and
> nothing can throw away the kernel side of the pgd

The page tables are freed, the same race can occur.

> A change like this (not from 2.5) as well is obviously superflous:

That's the intel patch - they just prefer that order, but they admitted it
doesn't matter.

> This below change as well is superflous after we backout the above
> change to leave_mm.

Don't back it out.

> This actually gets more near to the real problem...
>
>  static inline void flush_tlb_pgtables(struct mm_struct *mm,
>  				      unsigned long start, unsigned long end)
>  {
> -	/* i386 does not keep any page table caches in TLB */
> +	flush_tlb_mm(mm);
>  }

The above, _together_ with moving it to before the check_pgt_cache() (and
removing some other TLB flushes that are now superfluous).

In short, the Intel patch is good.

		Linus

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Q: backport of the free_pgtables tlb fixes to 2.4
  2002-05-23  6:01 ` Linus Torvalds
@ 2002-05-23 19:57   ` Andrea Arcangeli
  2002-05-23 20:05     ` Linus Torvalds
  0 siblings, 1 reply; 13+ messages in thread
From: Andrea Arcangeli @ 2002-05-23 19:57 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux-kernel

On Wed, May 22, 2002 at 11:01:28PM -0700, Linus Torvalds wrote:
> 
> 
> On Thu, 23 May 2002, Andrea Arcangeli wrote:
> >
> > I'm dealing with the backport of the free_pgtables() tlb flushes fixes
> > to 2.4 (got interrupted with it in the middle of the inode highmem
> > imbalance). I seen a patch and an explanation floating around but it's
> > wrong as far I can tell in the sense it seems completly a noop.
> 
> Don't take the 2.5.x tree - those changes are bigger than necessary. They
> may be _cleaner_ than the quick fix, but I don't think it's worth it for
> 2.4.x.
> 
> The suggested fix for 2.4.x is basically:
>  - enable pmd quicklists

actually the pte_quicklist only, the 4 pmd are always associated with
the pgd even with pae.

>  - do a TLB flush _before_ doing the check_pgt_cache() that might free up
>    the quicklist, and make sure that nothing else runs on that CPU that

I see how the patch is supposed to work, such subtle dependency on doing
the check_pgt_cache after the tlb invalidate wasn't commented in the
patch, it deserves a fat comment or it's really hard to guess we depend
on it. the main reason being that pte_free semantics are really "free
like if I put back in the freelist", not "don't touch it until
check_pgt_cache", because remember the pte_free clobbers the contents of
the pte with "random" data, this random data happens to have the
property on x86 that all the last 2 bits are always zero, so the last
two bits aren't random, and with this conicidence plus the other
coincidence that the invalid bit of the pte is the less significant bit
in the pte entry, it can actually work on x86, but it can very well not
work on other archs if the invalid bit isn't the last one or if they
stores something else and not a virtual addess in the pte entry chain.

In short the "pte_free" must clobber the pte in a way that lefts it
invalid, so that we're guaranteed to generate a page fault, this
requirement also deserves a big fat comment. I couldn't guess the
dependency of the fix on the lowlevel details of the quicklist
releasing, on the invalid bits of the pte and finally on the bits
guaranteed to be zero, not sure if that was just obvious to all the arch
maintainers.

>    might get something from the quicklist (interrupts don't do it, so the
>    only thing to make sure is that we don't get preempted).
> 
> That should do it. Intel made a patch that they've apparently tested that
> does this, I think.
> 
> > Also it's not related to the P4 cpu, it's just a generic bug for all
> > archs it seems.
> 
> Well, the hardware has to do a hw tlb walk, so that cuts down the affected
> archs to not very many. And the hw has to do a fair amount of speculation,
> or support SMP, so that cuts it down even further.
> 
> But yes, in theory the bug is generic.
> 
> > In both 2.4 and 2.5 we were just doing the release of the pages
> > correctly in zap_page_range, by clearing the pte entry, then
> > invalidating the tlb and finally freeing the page, I can't see problems
> > in the normal do_munmap path unless free_pgtables triggers too.
> 
> Right, it's free_pgtables that matters, and we need to do the same
> "deferred free" (deferred until the TLB invalidate) as for the actual
> pages themselves. "pmd_quicklist" can act as such a deferral method.
> 
> In 2.5.x, I wanted to make the deferral explicit, which is why the 2.5.x
> approach re-did the thing to make the whole "TLB gather" cover the page
> table freeing too.
> 
> > So the fix is simply to
> > extend the tlb_gather_mm/tlb_finish_mmu/tlb_remove_page to the
> > free_pgtables path too (while dropping pte/pmd), so in turn to
> > clear_page_tables, even if only the "free_pgtables" caller really needs
> > it.
> 
> This is indeed exactly what 2.5.x does, but see above about it.
> 
> > I don't see why you changed the "fast mode" to 1 cpu (even without an
> > #ifdef SMP), if the mm_users is == 1 we are just guaranteed that no
> > parallel reading of the address space that we're working on can happen
> 
> Yes, but there is ANOTHER race, and this is actually the much more likely
> one, and the one that actually happens:
> 
> 	CPU1		CPU2
> 
> 	munmap
> 	.. speculation starts ..

the question is: can you explain how the speculative tlb fill can start?
see below.

> 	.. TLB looks up pgd entry ..
> 	clear pgd entry
> 	free pmd
> 
> 			alloc page - get old pmd
> 			scribble on page
> 
> 	.. TLB looks up pmd entry ..
> 	.. tlb fill ends ...
> 	invalidate_tlb
	^^^^^^^^^^^^^^

I assume the userspace access could be imagined right after the
invalidate_tlb in the above example, and that's the one supposed to
trigger the speculative tlb fill but how can it pass the invalidate_tlb?
see below.

> CPU2 can be doing something completely _unrelated_ and not have the same
> MM at all. The bug happens entirely within the TLB contents of CPU1,
> simply because the CPU1 speculation started a TLB fill, which looked up
> the PGD entry _before_ it was cleared, but because of a cache miss got
> delayed at the point where it was going to look up the PTE entry. In the
> meantime, CPU1 free'd the page, and another CPU scribbled stuff on it.
> 
> So we not have an invalid entry in the TLB on cpu1 - and while we will
> invalidate the TLB, if the entry had the global bit set that bad random
> entry would NOT get invalidated.
> 
> This, btw, is why in practice the thing only shows up on a P4. Only a P4
> has deep enough speculation queues that this bug actually happens. Intel
> apparently had this trace for 2000 (yes, two THOUSAND) cycles that showed
> this asynchronous TLB lookup.

In all cases either the 2.4 fix is wrong, or 2.5 is still overkill while
freeing the pages (not the pagetables), while freeing the pages the
fastmode can still be mm_users == 1 (there's no risk of caching a global
tlb entry with the pages, they're just data, not metadata, and the data
will be invalidated from the cpu during the tlb flush).

I also assume that an irq will force a restart of the TLB fill, if it
doesn't then the same race in freeing the pagetables can happen with
only one cpu too (again assuming invalidate_tlb isn't enough that I
don't think it's the case).

> 
> > About the patch floating around I also seen random changes in leave_mm
> > that I cannot explain (that's in 2.5 too, I think that part of 2.5
> > is superflous too infact)
> 
> Here the issue is:
> 
> 	CPU1			CPU2
> 
> 				Lazy TLB of same MM as on CPU1
> 	munmap()		.. start speculative TLB fetch ...
> 	.. free_pgtable
> 	.. invalidate   ->	crosscall tlb_IPI
> 				We're lazy, nothing to do
> 	free page
> 	alloc page
> 	scribble on it
> 				 .. speculative TLB lookup ends ..
> 				get a bogus TLB entry with G bit
> 
> 
> According to intel, the _only_ thing that serializes the TLB fills is to
		      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> do a TLB invalidate (either invlpg or mov->cr3). So not even the
  ^^^^^^^^^^^^^^^^^^^                   ^^^^^^^^

If that's true then the previous race (the number 2, where I wrote "see
below") cannot happen and we can return to do the fastmode mm_users == 1
check also for the pte/pmd freeing (not only for the page freeing). They
are confirming the TLB fill is serialized by the overwrite of mov->cr3
and the invalidate_tlb _before_ the cpu-local userspace access
underlined in the previous case2 will be a strong barrier for any tlb
fill, so the tlb fill cannot be speculated across it and we can fastmode
with mm_users == 1.

> cross-call itself necessarily does anything to the background TLB fetch.

If smp_num_cpus == 1 or #ifndef CONFIG_SMP can use the fastmode for bug
number 2, then it means the irq handler is a cpu-local strong barrier as
strong as the invalidate_tlb (otherwise in case 2 again the race could
trigger on an UP machine if the irq handler aloocates the page [skb or
whatever] and scribbles over it). OTOH as said case 2 cannot trigger
anyways even on SMP because there's the invalidate_tlb that forbids the
speculative tlb fill to pass, so if the irq is a barrier for the tlb
fill or not is not obvious from case2.

But regardless of case2, I think for UP single threaded transparency all
the IRQs should be strong barriers to any speculative activity, so I
think the IPI irq should as well be a strong barrier that forbids the
speculative TLB to pass. So I'm not convinced the above is necessary.

The only required thing for this last case3 is that the order is
pte_clear, invalidate_tlb and finally free_page and that's guaranteed by
the fastmode because mm_users > 1 (there's an active lazy mm of the same
mm in cpu2), and the invalidate_tlb will make sure any tlb fill is
restarted before the page can be freed and in turn before the pte can be
scribbled by cpu 1 (or any other cpu in the system). otherwise it means
the irq isn't a barrier for the speculative tlb fill as it should for
UP transparency of the speculative actions (all speculative actions
should become visible only with smp effects).

> 
> So even the lazy case needs to do that, at which point it is just as
> well
> to just move to another stabler page table (it's actually faster than
> doing the cr3<->cr3 dance).
> 
> NOTE! This was not seen in any traces, but Intel was worried.
> 
> > explanation of why the above is superflous: every time a task enters the
> > lazy state we bump the mm_users so the active_mm is pinned somehow and
> > nothing can throw away the kernel side of the pgd
> 
> The page tables are freed, the same race can occur.
> 
> > A change like this (not from 2.5) as well is obviously superflous:
> 
> That's the intel patch - they just prefer that order, but they admitted it
> doesn't matter.

ok.

> 
> > This below change as well is superflous after we backout the above
> > change to leave_mm.
> 
> Don't back it out.
> 
> > This actually gets more near to the real problem...
> >
> >  static inline void flush_tlb_pgtables(struct mm_struct *mm,
> >  				      unsigned long start, unsigned long end)
> >  {
> > -	/* i386 does not keep any page table caches in TLB */
> > +	flush_tlb_mm(mm);
> >  }
> 
> The above, _together_ with moving it to before the check_pgt_cache() (and
> removing some other TLB flushes that are now superfluous).
> 
> In short, the Intel patch is good.

thanks for the whole explanation, this just makes many things clear, the
approch in the patch floating around definitely can fix free_pgtables
(case1) [I'd say a bit by luck because the pte still are overwritten in
pte_free], and it incidentally fixes case2 as well (plus it is more
efficient than 2.4 by lefting the fastmode for pages mm_users == 1). But
I'm not really convinced that the tlb fill can pass the IPI irq in case
3 (so I'm not convinced leave_mm is needed), and I think the 2.4 patch
incidentally takes care of case 2 too, but again I don't see what's the
problem of case2 in doing the fastmode for pte too (not only for pages,
where for pages it is certainly safe, or better at worse we can read out
of the ram address spce, potentially allocating an alias cacheline on
the gart but that's ok as far as it's a read-only cacheline, and
speculative read activity shouldn't really allocate writeback buffered
cachelines that would cause lost coherency to two aliased phys addresses)

Comments?

Andrea

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Q: backport of the free_pgtables tlb fixes to 2.4
  2002-05-23 19:57   ` Andrea Arcangeli
@ 2002-05-23 20:05     ` Linus Torvalds
  2002-05-23 20:41       ` Andrea Arcangeli
  0 siblings, 1 reply; 13+ messages in thread
From: Linus Torvalds @ 2002-05-23 20:05 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: linux-kernel

On Thu, 23 May 2002, Andrea Arcangeli wrote:
> > 	munmap
> > 	.. speculation starts ..
> 
> the question is: can you explain how the speculative tlb fill can start?

Any indirect branch can be (and will be) predicted using the BTB. The
speculation starts before the BTB contents have actually been verified,
resulting in iTLB speculation.

Since the BTB can (and does) contain mostly user addresses (from previous
execution in user land), it's apparently quite common to speculatively
fetch user TLB entries even when you're in kernel mode.

(This is also, btw, probably anothre reason why you only see this bug in 
practice on a P4: much bigger BTB)

> see below.
> 
> > 	.. TLB looks up pgd entry ..
> > 	clear pgd entry
> > 	free pmd
> > 
> > 			alloc page - get old pmd
> > 			scribble on page
> > 
> > 	.. TLB looks up pmd entry ..
> > 	.. tlb fill ends ...
> > 	invalidate_tlb
> 	^^^^^^^^^^^^^^
> 
> I assume the userspace access could be imagined right after the
> invalidate_tlb in the above example, and that's the one supposed to
> trigger the speculative tlb fill but how can it pass the invalidate_tlb?
> see below.

It doesn't pass the invalidate_tlb.

By the time the invalidate_tlb happens, the TLB fill has already finished, 
and has already picked up garbage.

READ my explanation. The garbage can (and does) contain the Global bit, so
even though we then flush the TLB, the garbage remains.

> In all cases either the 2.4 fix is wrong,

No. Understand the patch, _then_ complain.

		Linus

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Q: backport of the free_pgtables tlb fixes to 2.4
  2002-05-23 20:05     ` Linus Torvalds
@ 2002-05-23 20:41       ` Andrea Arcangeli
  2002-05-23 19:53         ` Martin Dalecki
  2002-05-23 22:04         ` Linus Torvalds
  0 siblings, 2 replies; 13+ messages in thread
From: Andrea Arcangeli @ 2002-05-23 20:41 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux-kernel

On Thu, May 23, 2002 at 01:05:53PM -0700, Linus Torvalds wrote:
> 
> On Thu, 23 May 2002, Andrea Arcangeli wrote:
> > > 	munmap
> > > 	.. speculation starts ..
> > 
> > the question is: can you explain how the speculative tlb fill can start?
> 
> Any indirect branch can be (and will be) predicted using the BTB. The
> speculation starts before the BTB contents have actually been verified,
> resulting in iTLB speculation.
> 
> Since the BTB can (and does) contain mostly user addresses (from previous
> execution in user land), it's apparently quite common to speculatively
> fetch user TLB entries even when you're in kernel mode.
> 
> (This is also, btw, probably anothre reason why you only see this bug in 
> practice on a P4: much bigger BTB)
> 
> > see below.
> > 
> > > 	.. TLB looks up pgd entry ..
> > > 	clear pgd entry
> > > 	free pmd
> > > 
> > > 			alloc page - get old pmd
> > > 			scribble on page
> > > 
> > > 	.. TLB looks up pmd entry ..
> > > 	.. tlb fill ends ...
> > > 	invalidate_tlb
> > 	^^^^^^^^^^^^^^
> > 
> > I assume the userspace access could be imagined right after the
> > invalidate_tlb in the above example, and that's the one supposed to
> > trigger the speculative tlb fill but how can it pass the invalidate_tlb?
> > see below.
> 
> It doesn't pass the invalidate_tlb.
> 
> By the time the invalidate_tlb happens, the TLB fill has already finished, 
> and has already picked up garbage.

If the userspace tlb lookup is started during munmap the tlb can contain
garabge before invalidate_tlb.

What I don't understand is how the BTB can invoke random userspace tlb
fills when we are running do_munmap, there's no point at all in doing
that. If the cpu see a read of an user address after invalidate_tlb,
the tlb must not be started because it's before an invalidate_tlb.

And if it's true not even irq are barriers for the tlb fills invoked by
this p4-BTB thing, so if leave_mm is really necessary, then 2.5 is as
well wrong in UP, because the pagetable can be scribbled by irqs in a UP
machine, and so the fastmode must go away even in 1 cpu systems.

> 
> READ my explanation. The garbage can (and does) contain the Global bit, so
> even though we then flush the TLB, the garbage remains.
> 
> > In all cases either the 2.4 fix is wrong,
> 
> No. Understand the patch, _then_ complain.

Above i'm not saying in absolute that the 2.4 fix is wrong, I'm saying
either 2.4 fix is wrong, or 2.5 must be overkill in using the tlb
shootdown for tasks with mm_users == 1, while releasing the _pages_, not
the _pagetables_. the 2.4 patch infact only enforces the ordering with
the pagetables, never with the pagetables if the mm_users == 1.

I reaffirm all the questions in my previous email except the "how the
tlb fill is stared in case2", if there's this BTB thing in the p4 that
is filling randomly tlb entries for user addresses any time during any
kernel code, that will perfectly explain how case2 triggers (but I'm
pretty sure that's a p4-only peculiarity and I don't think case2 can
happen in any other cpu out there because the tlb flush should forbid
the cpu to go ahead, or in the worst case a smb_mb() before
invalidate_tlb should be enough to forbid the cpu to see userspace
addresses before the pte is clear, and after the pmd entry is clear it
doesn't matter where's the pagetables).

I see contraddictions in the code:

1) between case2 UP-fastmode and case3 leave_mm (if leave_mm is needed
then fastmode is buggy in 2.5)
2) between fastmode == 1cpu used also for the pages in 2.5 and not used for the pages in 2.4
   so either 2.4 is buggy or 2.5 is overkill

So something is definitely still either overkill on one side or wrong on
the other side for both the above things (plus the fact I'm taking as
an assumption that this BTB thing can start tlb fills anytime regardless
if the cpu is allowed to speculate on userspace addresses or not, but
I'm ok to assume it as a p4 peculiarity). If all the code floating
around would be coherent then things would make more sense. At the
current state of things I cannot tell what is right and I am sure
something is still wrong. So I don't feel my questions are superflous,
and this is not a matter of understanding the code, I think I'm not
missing anything in the code, I'm missing something on the hardware
details of the p4 cpus instead and I doubt that's documented in the
specs and anyways it's faster to ask to learn those lowlevel details.

The one definitely "software" thing is the free_pgtables bug (case1),
and that's clear and fixed by both 2.5 and the 2.4 patch via the
quicklists, we're left with the contraddiction with and the fastmode
differences between case2 and case3 in both 2.4 patch and 2.5, and
that's hardware side, not software side. Infact I'm still not excluding
the possibility that what is been found in the traces is case 1, and if
case2 and case3 can really trigger.

Andrea

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Q: backport of the free_pgtables tlb fixes to 2.4
  2002-05-23 20:41       ` Andrea Arcangeli
@ 2002-05-23 19:53         ` Martin Dalecki
  2002-05-23 21:15           ` Andrea Arcangeli
  2002-05-23 22:04         ` Linus Torvalds
  1 sibling, 1 reply; 13+ messages in thread
From: Martin Dalecki @ 2002-05-23 19:53 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: Linus Torvalds, linux-kernel

Uz.ytkownik Andrea Arcangeli napisa?:

> What I don't understand is how the BTB can invoke random userspace tlb
> fills when we are running do_munmap, there's no point at all in doing
> that. If the cpu see a read of an user address after invalidate_tlb,
> the tlb must not be started because it's before an invalidate_tlb.
> 
> And if it's true not even irq are barriers for the tlb fills invoked by
> this p4-BTB thing, so if leave_mm is really necessary, then 2.5 is as
> well wrong in UP, because the pagetable can be scribbled by irqs in a UP
> machine, and so the fastmode must go away even in 1 cpu systems.

I for one would be really really surprised if the execution of an
interrupt isn't treating the BTB specially. If one reads
about CPU validation "exception handling" aka irq handling
is something that is paramount there. Hard to beleve they
would implement software IRQ commands by not just toggling the
IRQ input line of the chip themself. This safes testing.
But it may be as well indeed just "accidental" that system
call gates are implemented on recent ia32 systems by an op code
which belongs to the IRQ handling family...

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Q: backport of the free_pgtables tlb fixes to 2.4
  2002-05-23 19:53         ` Martin Dalecki
@ 2002-05-23 21:15           ` Andrea Arcangeli
  2002-05-23 20:40             ` Martin Dalecki
  0 siblings, 1 reply; 13+ messages in thread
From: Andrea Arcangeli @ 2002-05-23 21:15 UTC (permalink / raw)
  To: Martin Dalecki; +Cc: Linus Torvalds, linux-kernel

On Thu, May 23, 2002 at 09:53:44PM +0200, Martin Dalecki wrote:
> Uz.ytkownik Andrea Arcangeli napisa?:
> 
> >What I don't understand is how the BTB can invoke random userspace tlb
> >fills when we are running do_munmap, there's no point at all in doing
> >that. If the cpu see a read of an user address after invalidate_tlb,
> >the tlb must not be started because it's before an invalidate_tlb.
> >
> >And if it's true not even irq are barriers for the tlb fills invoked by
> >this p4-BTB thing, so if leave_mm is really necessary, then 2.5 is as
> >well wrong in UP, because the pagetable can be scribbled by irqs in a UP
> >machine, and so the fastmode must go away even in 1 cpu systems.
> 
> I for one would be really really surprised if the execution of an
> interrupt isn't treating the BTB specially. If one reads

me too of course. If an irq isn't making UP-transparent the speculative
actions of the BTB, then 2.5 is still buggy in allowing the fast mode in
UP machines (beause the irq can allocate the pagetable and scribble over
it so then the tlb will be filled with global garbage). To make things
more clear this is what will happen right now in 2.5 if the irq isn't
serializing the BTB speculative tlb fills:

        CPU1

        munmap
        .. speculation starts ..
        .. TLB reads pmd entry, so it now knows the phys address of the pte ..

        clear pmd entry
        free pte
	(doesn't matter if we clear the pmd entry or if we free the pte first)

	irq fired, BTB speculative actions aren't stopped they runs speculative in parallel to the irq
        alloc page - get old pte
        scribble on pte

        .. TLB reads the contents of the pte at the phys address now invalid ..
        .. tlb fill ends and we filled the tlb with random pte contents marked global ...

If instead the irq is serializing the BTB actions as expected (the
invariant is that an UP machine will never see any speculative action
internally, speculations is a problem only with SMP on shared memory
or while talking with hardware devices outside the local cpu), then it
means the above cannot happen, so 2.5 isn't buggy in allowing the
fastmode with 1 cpu systems, but then it also means 2.5 is overkill in
the leave_mm hack and so we can drop it.

Andrea

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Q: backport of the free_pgtables tlb fixes to 2.4
  2002-05-23 21:15           ` Andrea Arcangeli
@ 2002-05-23 20:40             ` Martin Dalecki
  0 siblings, 0 replies; 13+ messages in thread
From: Martin Dalecki @ 2002-05-23 20:40 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: Linus Torvalds, linux-kernel, enginer

Uz.ytkownik Andrea Arcangeli napisa?:
> On Thu, May 23, 2002 at 09:53:44PM +0200, Martin Dalecki wrote:
>
>>I for one would be really really surprised if the execution of an
>>interrupt isn't treating the BTB specially. If one reads
> 
> 
> me too of course. If an irq isn't making UP-transparent the speculative
> actions of the BTB, then 2.5 is still buggy in allowing the fast mode in
> UP machines (beause the irq can allocate the pagetable and scribble over
> it so then the tlb will be filled with global garbage). To make things
> more clear this is what will happen right now in 2.5 if the irq isn't
> serializing the BTB speculative tlb fills:
> 
>         CPU1
> 
>         munmap
>         .. speculation starts ..
>         .. TLB reads pmd entry, so it now knows the phys address of the pte ..
> 
>         clear pmd entry
>         free pte
> 	(doesn't matter if we clear the pmd entry or if we free the pte first)
> 
> 	irq fired, BTB speculative actions aren't stopped they runs speculative in parallel to the irq
>         alloc page - get old pte
>         scribble on pte
> 
>         .. TLB reads the contents of the pte at the phys address now invalid ..
>         .. tlb fill ends and we filled the tlb with random pte contents marked global ...
> 
> If instead the irq is serializing the BTB actions as expected (the
> invariant is that an UP machine will never see any speculative action
> internally, speculations is a problem only with SMP on shared memory
> or while talking with hardware devices outside the local cpu), then it
> means the above cannot happen, so 2.5 isn't buggy in allowing the
> fastmode with 1 cpu systems, but then it also means 2.5 is overkill in
> the leave_mm hack and so we can drop it.

Wait a moment please. The explanation above is very nice but I have
unfortunately some speculation to add to the game. Let's take the
whole "hyper threading" stuff in to account. The HT variant of the
P4 was realeased just few weeks or months after the normal one.
Let's take the following  in to account:

1. CPU validation takes years those times,

2. it is the most expensive part in terms of time and perhaps money
    of the cpu design game,

3. HT only takes just several percent (around 5) of the slicon die to
    implement, which is liekely comparatively cheap in regard of point 2.

4. HT validation does something between double and quadrupling this whole
    effort.

Then it very well may be that the fscking P4 contains the
hyper threading silicon even on the UP marketed version.
It's likely just an "early stepping" and they disabled HT
there by making some Zener diode kaputt.

So it could very well be that the guys there just didn't do
full checks in this "corner" UP case behaviour or didn't notice that
something changed. Or didn't care after looking around at OS soruce code.
And the P4 has to be dealt precisely the same way with the hyper threaded
variant behaves.

...

The longer I think about it the more I tend toward the above
hypothesis... But unfortunatly I can't give you definitive
answers of course. Well the level of "tend toward it" is on the
range of: "If I had to bet my life on it I certainly wouldn't"
- and I consider myself quite courageous.

Multiply this by the number of Linux users, interrupts and the deepth
of the P4 pipelins and well it turns out that
well... 2.5 is most likely broken on P4.

Boy, I would love to trully know about this!
Intel - do you listen to this small humble prayer?


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Q: backport of the free_pgtables tlb fixes to 2.4
  2002-05-23 20:41       ` Andrea Arcangeli
  2002-05-23 19:53         ` Martin Dalecki
@ 2002-05-23 22:04         ` Linus Torvalds
  2002-05-23 23:22           ` Andrea Arcangeli
  1 sibling, 1 reply; 13+ messages in thread
From: Linus Torvalds @ 2002-05-23 22:04 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: linux-kernel

On Thu, 23 May 2002, Andrea Arcangeli wrote:
>
> If the userspace tlb lookup is started during munmap the tlb can contain
> garabge before invalidate_tlb.

No.

If we wait until after the TLB fill to actually free the page tables
pages, there is _no_ way the TLB can contain garbage, because the page
directories will never have had garbage in it while any TLB lookup could
be active.

Which is the whole _point_ of the patches.

> What I don't understand is how the BTB can invoke random userspace tlb
> fills when we are running do_munmap, there's no point at all in doing
> that. If the cpu see a read of an user address after invalidate_tlb,
> the tlb must not be started because it's before an invalidate_tlb.

Take a course in CPU design if you want to understand why a CPU front-end
might speculatively start accessing something before the back-end has
actually told it what the "something" actually is.

But don't argue with the patch.

> And if it's true not even irq are barriers for the tlb fills invoked by
> this p4-BTB thing

It has nothing to do with the BTB - the BTB is just a source of
speculative addresses to start looking at.

But yes, Intel tells me that the only thing that is guaranteed to
serialize a TLB lookup is a TLB invalidate. NOTHING else.

>	so if leave_mm is really necessary, then 2.5 is as
> well wrong in UP, because the pagetable can be scribbled by irqs in a UP
> machine, and so the fastmode must go away even in 1 cpu systems.

Yes. Except I will make the 2.5.x code use the pmd quicklists instead
(both fast and slow mode), since that actually ends up being "nicer" from
a cross-architecture standpoint (right now the i386 careful mode depends
on the fact that page directories are regular pages - which is not true on
other CPU's).

		Linus

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Q: backport of the free_pgtables tlb fixes to 2.4
  2002-05-23 22:04         ` Linus Torvalds
@ 2002-05-23 23:22           ` Andrea Arcangeli
  2002-05-23 23:51             ` Linus Torvalds
  0 siblings, 1 reply; 13+ messages in thread
From: Andrea Arcangeli @ 2002-05-23 23:22 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux-kernel

On Thu, May 23, 2002 at 03:04:36PM -0700, Linus Torvalds wrote:
> 
> 
> On Thu, 23 May 2002, Andrea Arcangeli wrote:
> >
> > If the userspace tlb lookup is started during munmap the tlb can contain
> > garabge before invalidate_tlb.
> 
> No.

Above I just repeated what you said as confirmation of your:

	By the time the invalidate_tlb happens, the TLB fill has already
	finished, and has already picked up garbage.

Not sure why you say "no" about it.

> If we wait until after the TLB fill to actually free the page tables
> pages, there is _no_ way the TLB can contain garbage, because the page
> directories will never have had garbage in it while any TLB lookup could
> be active.

Agreed, doing the safe ordering always is clearly safe, just overkill
for example while freeing the pages (not the pagtables), infact the 2.4
patch floating around allows the fastmode with mm_users == 1 while
freeing the pages. Probably it's not a significant optimization, but
maybe mostly for documentation reasons we could resurrect the mm_users
== 1 fastmode while freeing pages (while freeing pagetables the fastmode
must go away completly anyways, that was the bug).

> > What I don't understand is how the BTB can invoke random userspace tlb
> > fills when we are running do_munmap, there's no point at all in doing
> > that. If the cpu see a read of an user address after invalidate_tlb,
> > the tlb must not be started because it's before an invalidate_tlb.
> 
> Take a course in CPU design if you want to understand why a CPU front-end
> might speculatively start accessing something before the back-end has
> actually told it what the "something" actually is.
> 
> But don't argue with the patch.

If I'm arguing is just because until now there was a lack of coherency
in the explanations and in the code, so I was simply stuck and I
couldn't extract something that made complete sense to me, there was
still some contraddiciton. Now that you admitted 2.5 is buggy in one
place I can finally drop a collision and things make complete sense
again. Without your help I could never understand the changes (like
leave_mm) that were going on in 2.5 and its 2.4 equivalent, thanks.

> > And if it's true not even irq are barriers for the tlb fills invoked by
> > this p4-BTB thing
> 
> It has nothing to do with the BTB - the BTB is just a source of
> speculative addresses to start looking at.
> 
> But yes, Intel tells me that the only thing that is guaranteed to
> serialize a TLB lookup is a TLB invalidate. NOTHING else.

If the only thing that serialize the speculative tlb fill is the tlb
invalidate, then the irq won't serialize it.

Probably due the lack of any course in out of order CPU design, I'm not
aware of any other cpu (not only x86) out there that exposes the
internal out of order speculative actions of the CPU, to the local CPU
strem. I recall you even said once that out of order cpus are hard to
make exactly because it's hard to take care of the serialization across
irq to make all out of order actions transparent for the local cpu
stream.  I'd be curious to know how many other cpus have this new
"feature".

BTW, about the quicklist I think they were nicer to have than just the
per-cpu page allocator in front of the global page allocator (the
allocator affinity is at a lower layer, much easier to be polluted with
non-per-cpu stuff than the pte quicklists). So I'm not really against
resurrecting them, however I suggest to build pte chain always with a
list on the page struct and never overwriting the contents of the pte
like it happens in most current 2.4 quicklists (I partly fixed that in
pte-highmem). This way it is more general. then once it's documented the
shrinkage of the quicklist must be done after the tlb flush we're fine,
like in 2.4 as you said. Last but not the least, also the tlb shootdown
for the page freeing in zap_page_range currently is very inefficient and
it should be rewritten not to use such tiny and inefficient array with a
few hundred pte pointers, but to just queue all the pages pending to be
freed in a list_head allocated in the tlb_gather_t structure (so such
per-cpu structure will also shrink a lot), so we don't need to trigger
suprious tlb flushes every time such tiny array overflows (basically it
will work like the pte_quicklist, but it will be shrunk completly after
the tlb flush, it's not a cache, it will be simply a dispose_list and it
should make the tlb shootdown code more readable too).

Andrea

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Q: backport of the free_pgtables tlb fixes to 2.4
  2002-05-23 23:22           ` Andrea Arcangeli
@ 2002-05-23 23:51             ` Linus Torvalds
  2002-05-24  0:27               ` Andrea Arcangeli
  0 siblings, 1 reply; 13+ messages in thread
From: Linus Torvalds @ 2002-05-23 23:51 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: linux-kernel

On Fri, 24 May 2002, Andrea Arcangeli wrote:
> On Thu, May 23, 2002 at 03:04:36PM -0700, Linus Torvalds wrote:
> >
> >
> > On Thu, 23 May 2002, Andrea Arcangeli wrote:
> > >
> > > If the userspace tlb lookup is started during munmap the tlb can contain
> > > garabge before invalidate_tlb.
> >
> > No.
>
> Above I just repeated what you said as confirmation of your:
>
> 	By the time the invalidate_tlb happens, the TLB fill has already
> 	finished, and has already picked up garbage.
>
> Not sure why you say "no" about it.

With the intel patches for 2.4.x, or with the 2.5.x tree, the problem is
fixed.

I don't see why you argue against the patches - they've been tested, and
they clearly fix a kernel bug that I've explained, and that Intel itself
even has posted explanations about.

> Agreed, doing the safe ordering always is clearly safe, just overkill
> for example while freeing the pages (not the pagtables), infact the 2.4
> patch floating around allows the fastmode with mm_users == 1 while
> freeing the pages.

Yes, you can free the pages themselves if mm_users is 1 with the fastmode.
That's one of the reasons I'm going to make the 2.5.x tree use the
pmd_quicklist, so that the page table tree itself can more cleanly be
handled differently than the pages.

>			 Probably it's not a significant optimization, but
> maybe mostly for documentation reasons we could resurrect the mm_users
> == 1 fastmode while freeing pages (while freeing pagetables the fastmode
> must go away completly anyways, that was the bug).

The other thing I was considering was to actually make fastmode go away,
it doesn't really seem to buy anything. If anything it might be possible
that the "slow" case is faster than the fast case for some loads, simply
because it has nicer icache "batching" behaviour.

> > But yes, Intel tells me that the only thing that is guaranteed to
> > serialize a TLB lookup is a TLB invalidate. NOTHING else.
>
> If the only thing that serialize the speculative tlb fill is the tlb
> invalidate, then the irq won't serialize it.

Right.

HOWEVER, there's obviously another issue - the TLB lookup will complete at
some point on its own, and the interrupt may well be slow enough that it
effectively serializes the TLB lookup just by virtue of taking many
cycles.

In fact I don't think Intel was ever able to reproduce the lazy-TLB
problem, their engineers only speculated that in theory it could be an
issue.

I personally think it's very unlikely that the "load_cr3()" makes any real
difference, but we also have another issue: because we clear the
"vm_cpu_bitmap" thing for that CPU, we'll only ever get _one_ cross-call
for such a lazy TLB. It is quite possible to have one unmap() cause
_multiple_ invalidates because it has a lot of page tables to go over, and
so the second page table clearing might not trigger an interrupt at all.
You also lose the TLB flushes for future unmaps in case the CPU stays in
lazy-TLB mode.

So that's another reason for loading %cr3 with a "known good" page table
value: because we inhibit future TLB flushes, we really should also make
sure that we're not continuing to use this "known-unstable" page table
tree.

> Probably due the lack of any course in out of order CPU design, I'm not
> aware of any other cpu (not only x86) out there that exposes the
> internal out of order speculative actions of the CPU,

I agree. I'm actually surprised myself at just how agressive the P4 is,
but on the other hand I think it's very interesting behaviour.

> BTW, about the quicklist I think they were nicer to have than just the
> per-cpu page allocator in front of the global page allocator (the
> allocator affinity is at a lower layer, much easier to be polluted with
> non-per-cpu stuff than the pte quicklists). So I'm not really against
> resurrecting them, however I suggest to build pte chain always with a
> list on the page struct and never overwriting the contents of the pte
> like it happens in most current 2.4 quicklists (I partly fixed that in
> pte-highmem).

I'd suggest (ab-)using something like page->mapping for the quicklist
following. You can use the page itself (the pointer will always be even,
so using the page will not re-trigger the bug on x86 because the "garbage"
written to the page table never has the Present bit set), but it's nicer
from a cache standpoint to use the "struct page" area.

		Linus

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Q: backport of the free_pgtables tlb fixes to 2.4
  2002-05-23 23:51             ` Linus Torvalds
@ 2002-05-24  0:27               ` Andrea Arcangeli
  0 siblings, 0 replies; 13+ messages in thread
From: Andrea Arcangeli @ 2002-05-24  0:27 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux-kernel

On Thu, May 23, 2002 at 04:51:17PM -0700, Linus Torvalds wrote:
> 
> 
> On Fri, 24 May 2002, Andrea Arcangeli wrote:
> > On Thu, May 23, 2002 at 03:04:36PM -0700, Linus Torvalds wrote:
> > >
> > >
> > > On Thu, 23 May 2002, Andrea Arcangeli wrote:
> > > >
> > > > If the userspace tlb lookup is started during munmap the tlb can contain
> > > > garabge before invalidate_tlb.
> > >
> > > No.
> >
> > Above I just repeated what you said as confirmation of your:
> >
> > 	By the time the invalidate_tlb happens, the TLB fill has already
> > 	finished, and has already picked up garbage.
> >
> > Not sure why you say "no" about it.
> 
> With the intel patches for 2.4.x, or with the 2.5.x tree, the problem is
> fixed.
> 
> I don't see why you argue against the patches - they've been tested, and
> they clearly fix a kernel bug that I've explained, and that Intel itself
> even has posted explanations about.
> 
> > Agreed, doing the safe ordering always is clearly safe, just overkill
> > for example while freeing the pages (not the pagtables), infact the 2.4
> > patch floating around allows the fastmode with mm_users == 1 while
> > freeing the pages.
> 
> Yes, you can free the pages themselves if mm_users is 1 with the fastmode.
> That's one of the reasons I'm going to make the 2.5.x tree use the
> pmd_quicklist, so that the page table tree itself can more cleanly be
> handled differently than the pages.

agreed.

> 
> >			 Probably it's not a significant optimization, but
> > maybe mostly for documentation reasons we could resurrect the mm_users
> > == 1 fastmode while freeing pages (while freeing pagetables the fastmode
> > must go away completly anyways, that was the bug).
> 
> The other thing I was considering was to actually make fastmode go away,
> it doesn't really seem to buy anything. If anything it might be possible
> that the "slow" case is faster than the fast case for some loads, simply
> because it has nicer icache "batching" behaviour.

most probably worthwhile to drop it.

> 
> > > But yes, Intel tells me that the only thing that is guaranteed to
> > > serialize a TLB lookup is a TLB invalidate. NOTHING else.
> >
> > If the only thing that serialize the speculative tlb fill is the tlb
> > invalidate, then the irq won't serialize it.
> 
> Right.
> 
> HOWEVER, there's obviously another issue - the TLB lookup will complete at
> some point on its own, and the interrupt may well be slow enough that it
> effectively serializes the TLB lookup just by virtue of taking many
> cycles.
> 
> In fact I don't think Intel was ever able to reproduce the lazy-TLB
> problem, their engineers only speculated that in theory it could be an
> issue.
> 
> I personally think it's very unlikely that the "load_cr3()" makes any real
> difference, but we also have another issue: because we clear the
> "vm_cpu_bitmap" thing for that CPU, we'll only ever get _one_ cross-call
> for such a lazy TLB. It is quite possible to have one unmap() cause
> _multiple_ invalidates because it has a lot of page tables to go over, and
> so the second page table clearing might not trigger an interrupt at all.
> You also lose the TLB flushes for future unmaps in case the CPU stays in
> lazy-TLB mode.

ok, I see that now that I know about that the p4 fills random user tlb
during any kernel code completly unrelated to the virtual userspace
addresses that is resolving speculatively. The reason it just didn't
looked worthwhile to do that is that the cpu never knows if it will ever
access those user addresses again, the higher the context switch rate,
the lower the probability that any speculative tlb fill on user
addresses will pay off. Of course assuming they're x86 cpus without
ASN/tlb-tagging that isn't possible to implement in hardware without OS
cooperation.

With all other x86 cpus leave_mm is never been a problem, because a CPU
could never start filling an user tlb while it was in lazy mode. before
even trying to access userspace addresses the kernel is supposed to do a
tlb flush during switch_mm and so a tlb fill in userspace could never be
run before the next tlb flush.

> 
> So that's another reason for loading %cr3 with a "known good" page table
> value: because we inhibit future TLB flushes, we really should also make
> sure that we're not continuing to use this "known-unstable" page table
> tree.
> 
> > Probably due the lack of any course in out of order CPU design, I'm not
> > aware of any other cpu (not only x86) out there that exposes the
> > internal out of order speculative actions of the CPU,
> 
> I agree. I'm actually surprised myself at just how agressive the P4 is,
> but on the other hand I think it's very interesting behaviour.
> 
> > BTW, about the quicklist I think they were nicer to have than just the
> > per-cpu page allocator in front of the global page allocator (the
> > allocator affinity is at a lower layer, much easier to be polluted with
> > non-per-cpu stuff than the pte quicklists). So I'm not really against
> > resurrecting them, however I suggest to build pte chain always with a
> > list on the page struct and never overwriting the contents of the pte
> > like it happens in most current 2.4 quicklists (I partly fixed that in
> > pte-highmem).
> 
> I'd suggest (ab-)using something like page->mapping for the quicklist
> following. You can use the page itself (the pointer will always be even,
> so using the page will not re-trigger the bug on x86 because the "garbage"
> written to the page table never has the Present bit set), but it's nicer
> from a cache standpoint to use the "struct page" area.

Agreed. In 2.4 pte-highmem (where I don't want to kmap_atomic inside every
pte_free_fast but I want to just work with pages so I avoid kmapping
overhead) I simply used page->list, that's the most appropriate list
chain to use in a wrapper in front of __free_pages.

static inline void pte_free_fast(struct page * page)
{
	list_add(&page->list, &pte_quicklist);
	pgtable_cache_size++;
}

Andrea

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Q: backport of the free_pgtables tlb fixes to 2.4
@ 2002-05-31 13:19 Roland Fehrenbacher
  0 siblings, 0 replies; 13+ messages in thread
From: Roland Fehrenbacher @ 2002-05-31 13:19 UTC (permalink / raw)
  To: linux-kernel
  Cc: Pallipadi, Venkatesh, Nakajima, Jun, Seth, Rohit, Luck, Tony,
	Mallick, Asit K, Hugh Dickins, pk

Hi all,

we actually also ran into this bug on a dual xeon (2GHz
Prestonia) with Hyperthreading enabled. Without Hyperthreading, it was hard to
reproduce the problem. The following script (requires bash >=  2.0.5) provokes
the problem in a very short time. With the patch from Intel, the problem is
gone (patched into 2.4.18), no side effects discovered so far.

Thanks to Hugh Dickins <hugh@veritas.com> for pointing out the patch to us.

Cheers,

Roland

Here is the script:
-----------------------
#!/bin/sh

# Small script to provoke a SIGSEV on SMP machines with kernel problem
# requires bash > 2.0.5. The script fails e.g. on dual xeon systems without
# Intel patch (Message title: Illegal instruction failures fixes for 2.4.18 in # kernel mailing list, 22.5.2002).
# Script simply executes a couple of mktemp loops in the background, and tries
# to read the generated file back.
# 

# Author: Roland Fehrenbacher, rf@q-leap.com
base_out=/tmp/test-sigsev
maxprocs=5

myname=`basename $0`
pids=""
files=""

trap 'killall $myname; rm -f $files ${base_out}??????;' 0 1 2 3 15

for (( num=1; num <= $maxprocs; num++ )); do
  while true; do 
    file=`mktemp ${base_out}XXXXXX` || { echo mktemp failed; break; }
    cat $file || { echo open $file failed; break; }
    rm -f $file
  done > ${base_out}-${num}.out 2>&1 &
  pids="$pids $!"
  files="$files ${base_out}-${num}.out"
done

printf "PIDS running = $pids\n--\n"
printf "No further output should appear if no bug is present. Run script for\n"
printf "a couple of hours to be sure everything is ok. Ctrl-C to stop.\n--\n"

i=1
while true; do
  echo $i: ok >> ${base_out}-${num}.out
  for pid in $pids; do 
    ps -p $pid > /dev/null 2>&1 || \
      { echo "count = $i: Pid $pid died" >> ${base_out}-${num}.out; \
	pids=`echo $pids | sed -e s/$pid//g`; }
  done
  sleep 10
  ((i++))
done &

files="$files ${base_out}-${num}.out"

sleep 1
tail -f $files | grep "count ="

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2002-05-31 13:20 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2002-05-23  5:14 Q: backport of the free_pgtables tlb fixes to 2.4 Andrea Arcangeli
2002-05-23  6:01 ` Linus Torvalds
2002-05-23 19:57   ` Andrea Arcangeli
2002-05-23 20:05     ` Linus Torvalds
2002-05-23 20:41       ` Andrea Arcangeli
2002-05-23 19:53         ` Martin Dalecki
2002-05-23 21:15           ` Andrea Arcangeli
2002-05-23 20:40             ` Martin Dalecki
2002-05-23 22:04         ` Linus Torvalds
2002-05-23 23:22           ` Andrea Arcangeli
2002-05-23 23:51             ` Linus Torvalds
2002-05-24  0:27               ` Andrea Arcangeli
  -- strict thread matches above, loose matches on Subject: below --
2002-05-31 13:19 Roland Fehrenbacher

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox