* Q: backport of the free_pgtables tlb fixes to 2.4
@ 2002-05-23 5:14 Andrea Arcangeli
2002-05-23 6:01 ` Linus Torvalds
0 siblings, 1 reply; 13+ messages in thread
From: Andrea Arcangeli @ 2002-05-23 5:14 UTC (permalink / raw)
To: Linus Torvalds; +Cc: linux-kernel
Hello Linus,
I'm dealing with the backport of the free_pgtables() tlb flushes fixes
to 2.4 (got interrupted with it in the middle of the inode highmem
imbalance). I seen a patch and an explanation floating around but it's
wrong as far I can tell in the sense it seems completly a noop.
As far I can see by reading the code the only kernel bug was in the
free_pgtables path and nothing else, so in theory by commenting out the
call of free_pgtables (that is not strictly required except for security
against malicious proggy) the segfault should go away too (could be even
a valid workaround for 2.4). Also it's not related to the P4 cpu, it's
just a generic bug for all archs it seems.
In both 2.4 and 2.5 we were just doing the release of the pages
correctly in zap_page_range, by clearing the pte entry, then
invalidating the tlb and finally freeing the page, I can't see problems
in the normal do_munmap path unless free_pgtables triggers too.
Only in free_pgtables there's an exception: if during an munmap a whole
pgd slot is unmapped we go to a deeper level and we throw away both pmd
and pte too (under the now empty pgd slot) to make space and avoid
expoits that fills the whole pagetable unswappable metadata but without
real pages associated to it.
The problem is that while freeing the pte and pmd we don't do the
"clear; invalidate tlb; free ram" ordered-thread-SMP-safe shootdown
sequence but we instead do the racy "clear; free; invalidate tlb"
SMP-thread-unsafe ordering and that's _THE_ bug. So the fix is simply to
extend the tlb_gather_mm/tlb_finish_mmu/tlb_remove_page to the
free_pgtables path too (while dropping pte/pmd), so in turn to
clear_page_tables, even if only the "free_pgtables" caller really needs
it. exit_mmap obviously cannot need the enforced "clear; invalidate tlb;
free ram" ordering because by the time exit_mmap is recalled there
cannot be any racy reader of the address space in parallel [mm_users ==
0] (and higher performance for exit_mmap are taken care by the fast
mode, or better it was supposed to be taken care before the fast mode
broke in latest 2.5, see below).
I don't see why you changed the "fast mode" to 1 cpu (even without an
#ifdef SMP), if the mm_users is == 1 we are just guaranteed that no
parallel reading of the address space that we're working on can happen
(we're also guaranteed that mm_users cannot increase or decrease under
us), no matter on the number of the cpus. I think the "fast mode" should
return to the more efficient one of 2.4, not to be "slow mode for every
true SMP".
About the patch floating around I also seen random changes in leave_mm
that I cannot explain (that's in 2.5 too, I think that part of 2.5
is superflous too infact)
/*
* We cannot call mmdrop() because we are in interrupt context,
* instead update mm->cpu_vm_mask.
+ *
+ * We need to reload %cr3 since the page tables may be going
+ * away from under us..
*/
static void inline leave_mm (unsigned long cpu)
{
if (cpu_tlbstate[cpu].state == TLBSTATE_OK)
BUG();
clear_bit(cpu, &cpu_tlbstate[cpu].active_mm->cpu_vm_mask);
+ load_cr3(swapper_pg_dir);
}
/*
explanation of why the above is superflous: every time a task enters the
lazy state we bump the mm_users so the active_mm is pinned somehow and
nothing can throw away the kernel side of the pgd, so we cannot care
less about loading swapper_pg_dir in %cr3 for tasks in lazy state, it's
simply not required, so I think the patch above is superflous. The
pinned mm (if it's the last refernece) is released by the scheduler
after our cr3 changed in switch_mm. The comment is wrong, our page
tables cannot go away under us, because being a lazy task it only
allowed to use the kernel side of the address space and nothing can
throw it away.
A change like this (not from 2.5) as well is obviously superflous:
#define tlb_remove_page(ctxp, pte, addr) do {\
/* Handle the common case fast, first. */\
if ((ctxp)->nr == ~0UL) {\
- __free_pte(*(pte));\
- pte_clear((pte));\
+ pte_t __pte = *(pte);\
+ pte_clear(pte);\
+ __free_pte(__pte);\
nr == ~0UL is the single threaded case, so again there cannot be any
parallel reader of the address space, and so it's perfectly fine to free
the page, clear the pte and finally flush the tlb, we don't care about
the ordering if we're single threaded, any ordering is ok, nobody can
access the address space while we're working on it.
This below change as well is superflous after we backout the above
change to leave_mm.
@@ -51,9 +51,9 @@
BUG();
if(!test_and_set_bit(cpu, &next->cpu_vm_mask)) {
/* We were in lazy tlb mode and leave_mm * disabled
- * tlb flush IPI delivery. We must flush our
tlb.
+ * tlb flush IPI delivery. We must reload %cr3.
*/
- local_flush_tlb();
+ load_cr3(next->pgd);
}
This actually gets more near to the real problem...
static inline void flush_tlb_pgtables(struct mm_struct *mm,
unsigned long start, unsigned long end)
{
- /* i386 does not keep any page table caches in TLB */
+ flush_tlb_mm(mm);
}
[..]
void clear_page_tables(struct mm_struct *mm, unsigned long first, int
nr)
{
pgd_t * page_dir = mm->pgd;
+ unsigned long last = first + nr;
spin_lock(&mm->page_table_lock);
page_dir += first;
@@ -153,6 +154,8 @@
} while (--nr);
spin_unlock(&mm->page_table_lock);
+ flush_tlb_pgtables(mm, first * PGDIR_SIZE, last * PGDIR_SIZE);
+
/* keep the page table cache within bounds */
check_pgt_cache();
[..]
... but it's again a noop, cannot fix anything, still any tlb flush done
there is way too late, the pte just gone away in the freelist at that
time. It may reduce the window for the race at best.
The only effective fix that nobody backported yet to 2.4 AFIK is to
avoid the race between free_pgtables and a parallel reader thread in the
address space, and it consists in backporting the pte_free_tlb(tlb, pte)
in the clear_page_tables path to 2.4, nothing else and nothing more. the
zap_page_range path and the fast mode of 2.4 just looks perfectly
correct, only free_pgtables must start using the "slow mode that enforce
ordering" when the mm_count is > 1 (either that or commenting out
free_pgtables will be a valid workaround for the SMP race).
Could you confirm/comment or explain what the problem really is? Many
thanks!
Andrea
^ permalink raw reply [flat|nested] 13+ messages in thread* Re: Q: backport of the free_pgtables tlb fixes to 2.4 2002-05-23 5:14 Q: backport of the free_pgtables tlb fixes to 2.4 Andrea Arcangeli @ 2002-05-23 6:01 ` Linus Torvalds 2002-05-23 19:57 ` Andrea Arcangeli 0 siblings, 1 reply; 13+ messages in thread From: Linus Torvalds @ 2002-05-23 6:01 UTC (permalink / raw) To: Andrea Arcangeli; +Cc: linux-kernel On Thu, 23 May 2002, Andrea Arcangeli wrote: > > I'm dealing with the backport of the free_pgtables() tlb flushes fixes > to 2.4 (got interrupted with it in the middle of the inode highmem > imbalance). I seen a patch and an explanation floating around but it's > wrong as far I can tell in the sense it seems completly a noop. Don't take the 2.5.x tree - those changes are bigger than necessary. They may be _cleaner_ than the quick fix, but I don't think it's worth it for 2.4.x. The suggested fix for 2.4.x is basically: - enable pmd quicklists - do a TLB flush _before_ doing the check_pgt_cache() that might free up the quicklist, and make sure that nothing else runs on that CPU that might get something from the quicklist (interrupts don't do it, so the only thing to make sure is that we don't get preempted). That should do it. Intel made a patch that they've apparently tested that does this, I think. > Also it's not related to the P4 cpu, it's just a generic bug for all > archs it seems. Well, the hardware has to do a hw tlb walk, so that cuts down the affected archs to not very many. And the hw has to do a fair amount of speculation, or support SMP, so that cuts it down even further. But yes, in theory the bug is generic. > In both 2.4 and 2.5 we were just doing the release of the pages > correctly in zap_page_range, by clearing the pte entry, then > invalidating the tlb and finally freeing the page, I can't see problems > in the normal do_munmap path unless free_pgtables triggers too. Right, it's free_pgtables that matters, and we need to do the same "deferred free" (deferred until the TLB invalidate) as for the actual pages themselves. "pmd_quicklist" can act as such a deferral method. In 2.5.x, I wanted to make the deferral explicit, which is why the 2.5.x approach re-did the thing to make the whole "TLB gather" cover the page table freeing too. > So the fix is simply to > extend the tlb_gather_mm/tlb_finish_mmu/tlb_remove_page to the > free_pgtables path too (while dropping pte/pmd), so in turn to > clear_page_tables, even if only the "free_pgtables" caller really needs > it. This is indeed exactly what 2.5.x does, but see above about it. > I don't see why you changed the "fast mode" to 1 cpu (even without an > #ifdef SMP), if the mm_users is == 1 we are just guaranteed that no > parallel reading of the address space that we're working on can happen Yes, but there is ANOTHER race, and this is actually the much more likely one, and the one that actually happens: CPU1 CPU2 munmap .. speculation starts .. .. TLB looks up pgd entry .. clear pgd entry free pmd alloc page - get old pmd scribble on page .. TLB looks up pmd entry .. .. tlb fill ends ... invalidate_tlb CPU2 can be doing something completely _unrelated_ and not have the same MM at all. The bug happens entirely within the TLB contents of CPU1, simply because the CPU1 speculation started a TLB fill, which looked up the PGD entry _before_ it was cleared, but because of a cache miss got delayed at the point where it was going to look up the PTE entry. In the meantime, CPU1 free'd the page, and another CPU scribbled stuff on it. So we not have an invalid entry in the TLB on cpu1 - and while we will invalidate the TLB, if the entry had the global bit set that bad random entry would NOT get invalidated. This, btw, is why in practice the thing only shows up on a P4. Only a P4 has deep enough speculation queues that this bug actually happens. Intel apparently had this trace for 2000 (yes, two THOUSAND) cycles that showed this asynchronous TLB lookup. > About the patch floating around I also seen random changes in leave_mm > that I cannot explain (that's in 2.5 too, I think that part of 2.5 > is superflous too infact) Here the issue is: CPU1 CPU2 Lazy TLB of same MM as on CPU1 munmap() .. start speculative TLB fetch ... .. free_pgtable .. invalidate -> crosscall tlb_IPI We're lazy, nothing to do free page alloc page scribble on it .. speculative TLB lookup ends .. get a bogus TLB entry with G bit According to intel, the _only_ thing that serializes the TLB fills is to do a TLB invalidate (either invlpg or mov->cr3). So not even the cross-call itself necessarily does anything to the background TLB fetch. So even the lazy case needs to do that, at which point it is just as well to just move to another stabler page table (it's actually faster than doing the cr3<->cr3 dance). NOTE! This was not seen in any traces, but Intel was worried. > explanation of why the above is superflous: every time a task enters the > lazy state we bump the mm_users so the active_mm is pinned somehow and > nothing can throw away the kernel side of the pgd The page tables are freed, the same race can occur. > A change like this (not from 2.5) as well is obviously superflous: That's the intel patch - they just prefer that order, but they admitted it doesn't matter. > This below change as well is superflous after we backout the above > change to leave_mm. Don't back it out. > This actually gets more near to the real problem... > > static inline void flush_tlb_pgtables(struct mm_struct *mm, > unsigned long start, unsigned long end) > { > - /* i386 does not keep any page table caches in TLB */ > + flush_tlb_mm(mm); > } The above, _together_ with moving it to before the check_pgt_cache() (and removing some other TLB flushes that are now superfluous). In short, the Intel patch is good. Linus ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Q: backport of the free_pgtables tlb fixes to 2.4 2002-05-23 6:01 ` Linus Torvalds @ 2002-05-23 19:57 ` Andrea Arcangeli 2002-05-23 20:05 ` Linus Torvalds 0 siblings, 1 reply; 13+ messages in thread From: Andrea Arcangeli @ 2002-05-23 19:57 UTC (permalink / raw) To: Linus Torvalds; +Cc: linux-kernel On Wed, May 22, 2002 at 11:01:28PM -0700, Linus Torvalds wrote: > > > On Thu, 23 May 2002, Andrea Arcangeli wrote: > > > > I'm dealing with the backport of the free_pgtables() tlb flushes fixes > > to 2.4 (got interrupted with it in the middle of the inode highmem > > imbalance). I seen a patch and an explanation floating around but it's > > wrong as far I can tell in the sense it seems completly a noop. > > Don't take the 2.5.x tree - those changes are bigger than necessary. They > may be _cleaner_ than the quick fix, but I don't think it's worth it for > 2.4.x. > > The suggested fix for 2.4.x is basically: > - enable pmd quicklists actually the pte_quicklist only, the 4 pmd are always associated with the pgd even with pae. > - do a TLB flush _before_ doing the check_pgt_cache() that might free up > the quicklist, and make sure that nothing else runs on that CPU that I see how the patch is supposed to work, such subtle dependency on doing the check_pgt_cache after the tlb invalidate wasn't commented in the patch, it deserves a fat comment or it's really hard to guess we depend on it. the main reason being that pte_free semantics are really "free like if I put back in the freelist", not "don't touch it until check_pgt_cache", because remember the pte_free clobbers the contents of the pte with "random" data, this random data happens to have the property on x86 that all the last 2 bits are always zero, so the last two bits aren't random, and with this conicidence plus the other coincidence that the invalid bit of the pte is the less significant bit in the pte entry, it can actually work on x86, but it can very well not work on other archs if the invalid bit isn't the last one or if they stores something else and not a virtual addess in the pte entry chain. In short the "pte_free" must clobber the pte in a way that lefts it invalid, so that we're guaranteed to generate a page fault, this requirement also deserves a big fat comment. I couldn't guess the dependency of the fix on the lowlevel details of the quicklist releasing, on the invalid bits of the pte and finally on the bits guaranteed to be zero, not sure if that was just obvious to all the arch maintainers. > might get something from the quicklist (interrupts don't do it, so the > only thing to make sure is that we don't get preempted). > > That should do it. Intel made a patch that they've apparently tested that > does this, I think. > > > Also it's not related to the P4 cpu, it's just a generic bug for all > > archs it seems. > > Well, the hardware has to do a hw tlb walk, so that cuts down the affected > archs to not very many. And the hw has to do a fair amount of speculation, > or support SMP, so that cuts it down even further. > > But yes, in theory the bug is generic. > > > In both 2.4 and 2.5 we were just doing the release of the pages > > correctly in zap_page_range, by clearing the pte entry, then > > invalidating the tlb and finally freeing the page, I can't see problems > > in the normal do_munmap path unless free_pgtables triggers too. > > Right, it's free_pgtables that matters, and we need to do the same > "deferred free" (deferred until the TLB invalidate) as for the actual > pages themselves. "pmd_quicklist" can act as such a deferral method. > > In 2.5.x, I wanted to make the deferral explicit, which is why the 2.5.x > approach re-did the thing to make the whole "TLB gather" cover the page > table freeing too. > > > So the fix is simply to > > extend the tlb_gather_mm/tlb_finish_mmu/tlb_remove_page to the > > free_pgtables path too (while dropping pte/pmd), so in turn to > > clear_page_tables, even if only the "free_pgtables" caller really needs > > it. > > This is indeed exactly what 2.5.x does, but see above about it. > > > I don't see why you changed the "fast mode" to 1 cpu (even without an > > #ifdef SMP), if the mm_users is == 1 we are just guaranteed that no > > parallel reading of the address space that we're working on can happen > > Yes, but there is ANOTHER race, and this is actually the much more likely > one, and the one that actually happens: > > CPU1 CPU2 > > munmap > .. speculation starts .. the question is: can you explain how the speculative tlb fill can start? see below. > .. TLB looks up pgd entry .. > clear pgd entry > free pmd > > alloc page - get old pmd > scribble on page > > .. TLB looks up pmd entry .. > .. tlb fill ends ... > invalidate_tlb ^^^^^^^^^^^^^^ I assume the userspace access could be imagined right after the invalidate_tlb in the above example, and that's the one supposed to trigger the speculative tlb fill but how can it pass the invalidate_tlb? see below. > CPU2 can be doing something completely _unrelated_ and not have the same > MM at all. The bug happens entirely within the TLB contents of CPU1, > simply because the CPU1 speculation started a TLB fill, which looked up > the PGD entry _before_ it was cleared, but because of a cache miss got > delayed at the point where it was going to look up the PTE entry. In the > meantime, CPU1 free'd the page, and another CPU scribbled stuff on it. > > So we not have an invalid entry in the TLB on cpu1 - and while we will > invalidate the TLB, if the entry had the global bit set that bad random > entry would NOT get invalidated. > > This, btw, is why in practice the thing only shows up on a P4. Only a P4 > has deep enough speculation queues that this bug actually happens. Intel > apparently had this trace for 2000 (yes, two THOUSAND) cycles that showed > this asynchronous TLB lookup. In all cases either the 2.4 fix is wrong, or 2.5 is still overkill while freeing the pages (not the pagetables), while freeing the pages the fastmode can still be mm_users == 1 (there's no risk of caching a global tlb entry with the pages, they're just data, not metadata, and the data will be invalidated from the cpu during the tlb flush). I also assume that an irq will force a restart of the TLB fill, if it doesn't then the same race in freeing the pagetables can happen with only one cpu too (again assuming invalidate_tlb isn't enough that I don't think it's the case). > > > About the patch floating around I also seen random changes in leave_mm > > that I cannot explain (that's in 2.5 too, I think that part of 2.5 > > is superflous too infact) > > Here the issue is: > > CPU1 CPU2 > > Lazy TLB of same MM as on CPU1 > munmap() .. start speculative TLB fetch ... > .. free_pgtable > .. invalidate -> crosscall tlb_IPI > We're lazy, nothing to do > free page > alloc page > scribble on it > .. speculative TLB lookup ends .. > get a bogus TLB entry with G bit > > > According to intel, the _only_ thing that serializes the TLB fills is to ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > do a TLB invalidate (either invlpg or mov->cr3). So not even the ^^^^^^^^^^^^^^^^^^^ ^^^^^^^^ If that's true then the previous race (the number 2, where I wrote "see below") cannot happen and we can return to do the fastmode mm_users == 1 check also for the pte/pmd freeing (not only for the page freeing). They are confirming the TLB fill is serialized by the overwrite of mov->cr3 and the invalidate_tlb _before_ the cpu-local userspace access underlined in the previous case2 will be a strong barrier for any tlb fill, so the tlb fill cannot be speculated across it and we can fastmode with mm_users == 1. > cross-call itself necessarily does anything to the background TLB fetch. If smp_num_cpus == 1 or #ifndef CONFIG_SMP can use the fastmode for bug number 2, then it means the irq handler is a cpu-local strong barrier as strong as the invalidate_tlb (otherwise in case 2 again the race could trigger on an UP machine if the irq handler aloocates the page [skb or whatever] and scribbles over it). OTOH as said case 2 cannot trigger anyways even on SMP because there's the invalidate_tlb that forbids the speculative tlb fill to pass, so if the irq is a barrier for the tlb fill or not is not obvious from case2. But regardless of case2, I think for UP single threaded transparency all the IRQs should be strong barriers to any speculative activity, so I think the IPI irq should as well be a strong barrier that forbids the speculative TLB to pass. So I'm not convinced the above is necessary. The only required thing for this last case3 is that the order is pte_clear, invalidate_tlb and finally free_page and that's guaranteed by the fastmode because mm_users > 1 (there's an active lazy mm of the same mm in cpu2), and the invalidate_tlb will make sure any tlb fill is restarted before the page can be freed and in turn before the pte can be scribbled by cpu 1 (or any other cpu in the system). otherwise it means the irq isn't a barrier for the speculative tlb fill as it should for UP transparency of the speculative actions (all speculative actions should become visible only with smp effects). > > So even the lazy case needs to do that, at which point it is just as > well > to just move to another stabler page table (it's actually faster than > doing the cr3<->cr3 dance). > > NOTE! This was not seen in any traces, but Intel was worried. > > > explanation of why the above is superflous: every time a task enters the > > lazy state we bump the mm_users so the active_mm is pinned somehow and > > nothing can throw away the kernel side of the pgd > > The page tables are freed, the same race can occur. > > > A change like this (not from 2.5) as well is obviously superflous: > > That's the intel patch - they just prefer that order, but they admitted it > doesn't matter. ok. > > > This below change as well is superflous after we backout the above > > change to leave_mm. > > Don't back it out. > > > This actually gets more near to the real problem... > > > > static inline void flush_tlb_pgtables(struct mm_struct *mm, > > unsigned long start, unsigned long end) > > { > > - /* i386 does not keep any page table caches in TLB */ > > + flush_tlb_mm(mm); > > } > > The above, _together_ with moving it to before the check_pgt_cache() (and > removing some other TLB flushes that are now superfluous). > > In short, the Intel patch is good. thanks for the whole explanation, this just makes many things clear, the approch in the patch floating around definitely can fix free_pgtables (case1) [I'd say a bit by luck because the pte still are overwritten in pte_free], and it incidentally fixes case2 as well (plus it is more efficient than 2.4 by lefting the fastmode for pages mm_users == 1). But I'm not really convinced that the tlb fill can pass the IPI irq in case 3 (so I'm not convinced leave_mm is needed), and I think the 2.4 patch incidentally takes care of case 2 too, but again I don't see what's the problem of case2 in doing the fastmode for pte too (not only for pages, where for pages it is certainly safe, or better at worse we can read out of the ram address spce, potentially allocating an alias cacheline on the gart but that's ok as far as it's a read-only cacheline, and speculative read activity shouldn't really allocate writeback buffered cachelines that would cause lost coherency to two aliased phys addresses) Comments? Andrea ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Q: backport of the free_pgtables tlb fixes to 2.4 2002-05-23 19:57 ` Andrea Arcangeli @ 2002-05-23 20:05 ` Linus Torvalds 2002-05-23 20:41 ` Andrea Arcangeli 0 siblings, 1 reply; 13+ messages in thread From: Linus Torvalds @ 2002-05-23 20:05 UTC (permalink / raw) To: Andrea Arcangeli; +Cc: linux-kernel On Thu, 23 May 2002, Andrea Arcangeli wrote: > > munmap > > .. speculation starts .. > > the question is: can you explain how the speculative tlb fill can start? Any indirect branch can be (and will be) predicted using the BTB. The speculation starts before the BTB contents have actually been verified, resulting in iTLB speculation. Since the BTB can (and does) contain mostly user addresses (from previous execution in user land), it's apparently quite common to speculatively fetch user TLB entries even when you're in kernel mode. (This is also, btw, probably anothre reason why you only see this bug in practice on a P4: much bigger BTB) > see below. > > > .. TLB looks up pgd entry .. > > clear pgd entry > > free pmd > > > > alloc page - get old pmd > > scribble on page > > > > .. TLB looks up pmd entry .. > > .. tlb fill ends ... > > invalidate_tlb > ^^^^^^^^^^^^^^ > > I assume the userspace access could be imagined right after the > invalidate_tlb in the above example, and that's the one supposed to > trigger the speculative tlb fill but how can it pass the invalidate_tlb? > see below. It doesn't pass the invalidate_tlb. By the time the invalidate_tlb happens, the TLB fill has already finished, and has already picked up garbage. READ my explanation. The garbage can (and does) contain the Global bit, so even though we then flush the TLB, the garbage remains. > In all cases either the 2.4 fix is wrong, No. Understand the patch, _then_ complain. Linus ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Q: backport of the free_pgtables tlb fixes to 2.4 2002-05-23 20:05 ` Linus Torvalds @ 2002-05-23 20:41 ` Andrea Arcangeli 2002-05-23 19:53 ` Martin Dalecki 2002-05-23 22:04 ` Linus Torvalds 0 siblings, 2 replies; 13+ messages in thread From: Andrea Arcangeli @ 2002-05-23 20:41 UTC (permalink / raw) To: Linus Torvalds; +Cc: linux-kernel On Thu, May 23, 2002 at 01:05:53PM -0700, Linus Torvalds wrote: > > On Thu, 23 May 2002, Andrea Arcangeli wrote: > > > munmap > > > .. speculation starts .. > > > > the question is: can you explain how the speculative tlb fill can start? > > Any indirect branch can be (and will be) predicted using the BTB. The > speculation starts before the BTB contents have actually been verified, > resulting in iTLB speculation. > > Since the BTB can (and does) contain mostly user addresses (from previous > execution in user land), it's apparently quite common to speculatively > fetch user TLB entries even when you're in kernel mode. > > (This is also, btw, probably anothre reason why you only see this bug in > practice on a P4: much bigger BTB) > > > see below. > > > > > .. TLB looks up pgd entry .. > > > clear pgd entry > > > free pmd > > > > > > alloc page - get old pmd > > > scribble on page > > > > > > .. TLB looks up pmd entry .. > > > .. tlb fill ends ... > > > invalidate_tlb > > ^^^^^^^^^^^^^^ > > > > I assume the userspace access could be imagined right after the > > invalidate_tlb in the above example, and that's the one supposed to > > trigger the speculative tlb fill but how can it pass the invalidate_tlb? > > see below. > > It doesn't pass the invalidate_tlb. > > By the time the invalidate_tlb happens, the TLB fill has already finished, > and has already picked up garbage. If the userspace tlb lookup is started during munmap the tlb can contain garabge before invalidate_tlb. What I don't understand is how the BTB can invoke random userspace tlb fills when we are running do_munmap, there's no point at all in doing that. If the cpu see a read of an user address after invalidate_tlb, the tlb must not be started because it's before an invalidate_tlb. And if it's true not even irq are barriers for the tlb fills invoked by this p4-BTB thing, so if leave_mm is really necessary, then 2.5 is as well wrong in UP, because the pagetable can be scribbled by irqs in a UP machine, and so the fastmode must go away even in 1 cpu systems. > > READ my explanation. The garbage can (and does) contain the Global bit, so > even though we then flush the TLB, the garbage remains. > > > In all cases either the 2.4 fix is wrong, > > No. Understand the patch, _then_ complain. Above i'm not saying in absolute that the 2.4 fix is wrong, I'm saying either 2.4 fix is wrong, or 2.5 must be overkill in using the tlb shootdown for tasks with mm_users == 1, while releasing the _pages_, not the _pagetables_. the 2.4 patch infact only enforces the ordering with the pagetables, never with the pagetables if the mm_users == 1. I reaffirm all the questions in my previous email except the "how the tlb fill is stared in case2", if there's this BTB thing in the p4 that is filling randomly tlb entries for user addresses any time during any kernel code, that will perfectly explain how case2 triggers (but I'm pretty sure that's a p4-only peculiarity and I don't think case2 can happen in any other cpu out there because the tlb flush should forbid the cpu to go ahead, or in the worst case a smb_mb() before invalidate_tlb should be enough to forbid the cpu to see userspace addresses before the pte is clear, and after the pmd entry is clear it doesn't matter where's the pagetables). I see contraddictions in the code: 1) between case2 UP-fastmode and case3 leave_mm (if leave_mm is needed then fastmode is buggy in 2.5) 2) between fastmode == 1cpu used also for the pages in 2.5 and not used for the pages in 2.4 so either 2.4 is buggy or 2.5 is overkill So something is definitely still either overkill on one side or wrong on the other side for both the above things (plus the fact I'm taking as an assumption that this BTB thing can start tlb fills anytime regardless if the cpu is allowed to speculate on userspace addresses or not, but I'm ok to assume it as a p4 peculiarity). If all the code floating around would be coherent then things would make more sense. At the current state of things I cannot tell what is right and I am sure something is still wrong. So I don't feel my questions are superflous, and this is not a matter of understanding the code, I think I'm not missing anything in the code, I'm missing something on the hardware details of the p4 cpus instead and I doubt that's documented in the specs and anyways it's faster to ask to learn those lowlevel details. The one definitely "software" thing is the free_pgtables bug (case1), and that's clear and fixed by both 2.5 and the 2.4 patch via the quicklists, we're left with the contraddiction with and the fastmode differences between case2 and case3 in both 2.4 patch and 2.5, and that's hardware side, not software side. Infact I'm still not excluding the possibility that what is been found in the traces is case 1, and if case2 and case3 can really trigger. Andrea ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Q: backport of the free_pgtables tlb fixes to 2.4 2002-05-23 20:41 ` Andrea Arcangeli @ 2002-05-23 19:53 ` Martin Dalecki 2002-05-23 21:15 ` Andrea Arcangeli 2002-05-23 22:04 ` Linus Torvalds 1 sibling, 1 reply; 13+ messages in thread From: Martin Dalecki @ 2002-05-23 19:53 UTC (permalink / raw) To: Andrea Arcangeli; +Cc: Linus Torvalds, linux-kernel Uz.ytkownik Andrea Arcangeli napisa?: > What I don't understand is how the BTB can invoke random userspace tlb > fills when we are running do_munmap, there's no point at all in doing > that. If the cpu see a read of an user address after invalidate_tlb, > the tlb must not be started because it's before an invalidate_tlb. > > And if it's true not even irq are barriers for the tlb fills invoked by > this p4-BTB thing, so if leave_mm is really necessary, then 2.5 is as > well wrong in UP, because the pagetable can be scribbled by irqs in a UP > machine, and so the fastmode must go away even in 1 cpu systems. I for one would be really really surprised if the execution of an interrupt isn't treating the BTB specially. If one reads about CPU validation "exception handling" aka irq handling is something that is paramount there. Hard to beleve they would implement software IRQ commands by not just toggling the IRQ input line of the chip themself. This safes testing. But it may be as well indeed just "accidental" that system call gates are implemented on recent ia32 systems by an op code which belongs to the IRQ handling family... ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Q: backport of the free_pgtables tlb fixes to 2.4 2002-05-23 19:53 ` Martin Dalecki @ 2002-05-23 21:15 ` Andrea Arcangeli 2002-05-23 20:40 ` Martin Dalecki 0 siblings, 1 reply; 13+ messages in thread From: Andrea Arcangeli @ 2002-05-23 21:15 UTC (permalink / raw) To: Martin Dalecki; +Cc: Linus Torvalds, linux-kernel On Thu, May 23, 2002 at 09:53:44PM +0200, Martin Dalecki wrote: > Uz.ytkownik Andrea Arcangeli napisa?: > > >What I don't understand is how the BTB can invoke random userspace tlb > >fills when we are running do_munmap, there's no point at all in doing > >that. If the cpu see a read of an user address after invalidate_tlb, > >the tlb must not be started because it's before an invalidate_tlb. > > > >And if it's true not even irq are barriers for the tlb fills invoked by > >this p4-BTB thing, so if leave_mm is really necessary, then 2.5 is as > >well wrong in UP, because the pagetable can be scribbled by irqs in a UP > >machine, and so the fastmode must go away even in 1 cpu systems. > > I for one would be really really surprised if the execution of an > interrupt isn't treating the BTB specially. If one reads me too of course. If an irq isn't making UP-transparent the speculative actions of the BTB, then 2.5 is still buggy in allowing the fast mode in UP machines (beause the irq can allocate the pagetable and scribble over it so then the tlb will be filled with global garbage). To make things more clear this is what will happen right now in 2.5 if the irq isn't serializing the BTB speculative tlb fills: CPU1 munmap .. speculation starts .. .. TLB reads pmd entry, so it now knows the phys address of the pte .. clear pmd entry free pte (doesn't matter if we clear the pmd entry or if we free the pte first) irq fired, BTB speculative actions aren't stopped they runs speculative in parallel to the irq alloc page - get old pte scribble on pte .. TLB reads the contents of the pte at the phys address now invalid .. .. tlb fill ends and we filled the tlb with random pte contents marked global ... If instead the irq is serializing the BTB actions as expected (the invariant is that an UP machine will never see any speculative action internally, speculations is a problem only with SMP on shared memory or while talking with hardware devices outside the local cpu), then it means the above cannot happen, so 2.5 isn't buggy in allowing the fastmode with 1 cpu systems, but then it also means 2.5 is overkill in the leave_mm hack and so we can drop it. Andrea ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Q: backport of the free_pgtables tlb fixes to 2.4 2002-05-23 21:15 ` Andrea Arcangeli @ 2002-05-23 20:40 ` Martin Dalecki 0 siblings, 0 replies; 13+ messages in thread From: Martin Dalecki @ 2002-05-23 20:40 UTC (permalink / raw) To: Andrea Arcangeli; +Cc: Linus Torvalds, linux-kernel, enginer Uz.ytkownik Andrea Arcangeli napisa?: > On Thu, May 23, 2002 at 09:53:44PM +0200, Martin Dalecki wrote: > >>I for one would be really really surprised if the execution of an >>interrupt isn't treating the BTB specially. If one reads > > > me too of course. If an irq isn't making UP-transparent the speculative > actions of the BTB, then 2.5 is still buggy in allowing the fast mode in > UP machines (beause the irq can allocate the pagetable and scribble over > it so then the tlb will be filled with global garbage). To make things > more clear this is what will happen right now in 2.5 if the irq isn't > serializing the BTB speculative tlb fills: > > CPU1 > > munmap > .. speculation starts .. > .. TLB reads pmd entry, so it now knows the phys address of the pte .. > > clear pmd entry > free pte > (doesn't matter if we clear the pmd entry or if we free the pte first) > > irq fired, BTB speculative actions aren't stopped they runs speculative in parallel to the irq > alloc page - get old pte > scribble on pte > > .. TLB reads the contents of the pte at the phys address now invalid .. > .. tlb fill ends and we filled the tlb with random pte contents marked global ... > > If instead the irq is serializing the BTB actions as expected (the > invariant is that an UP machine will never see any speculative action > internally, speculations is a problem only with SMP on shared memory > or while talking with hardware devices outside the local cpu), then it > means the above cannot happen, so 2.5 isn't buggy in allowing the > fastmode with 1 cpu systems, but then it also means 2.5 is overkill in > the leave_mm hack and so we can drop it. Wait a moment please. The explanation above is very nice but I have unfortunately some speculation to add to the game. Let's take the whole "hyper threading" stuff in to account. The HT variant of the P4 was realeased just few weeks or months after the normal one. Let's take the following in to account: 1. CPU validation takes years those times, 2. it is the most expensive part in terms of time and perhaps money of the cpu design game, 3. HT only takes just several percent (around 5) of the slicon die to implement, which is liekely comparatively cheap in regard of point 2. 4. HT validation does something between double and quadrupling this whole effort. Then it very well may be that the fscking P4 contains the hyper threading silicon even on the UP marketed version. It's likely just an "early stepping" and they disabled HT there by making some Zener diode kaputt. So it could very well be that the guys there just didn't do full checks in this "corner" UP case behaviour or didn't notice that something changed. Or didn't care after looking around at OS soruce code. And the P4 has to be dealt precisely the same way with the hyper threaded variant behaves. ... The longer I think about it the more I tend toward the above hypothesis... But unfortunatly I can't give you definitive answers of course. Well the level of "tend toward it" is on the range of: "If I had to bet my life on it I certainly wouldn't" - and I consider myself quite courageous. Multiply this by the number of Linux users, interrupts and the deepth of the P4 pipelins and well it turns out that well... 2.5 is most likely broken on P4. Boy, I would love to trully know about this! Intel - do you listen to this small humble prayer? ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Q: backport of the free_pgtables tlb fixes to 2.4 2002-05-23 20:41 ` Andrea Arcangeli 2002-05-23 19:53 ` Martin Dalecki @ 2002-05-23 22:04 ` Linus Torvalds 2002-05-23 23:22 ` Andrea Arcangeli 1 sibling, 1 reply; 13+ messages in thread From: Linus Torvalds @ 2002-05-23 22:04 UTC (permalink / raw) To: Andrea Arcangeli; +Cc: linux-kernel On Thu, 23 May 2002, Andrea Arcangeli wrote: > > If the userspace tlb lookup is started during munmap the tlb can contain > garabge before invalidate_tlb. No. If we wait until after the TLB fill to actually free the page tables pages, there is _no_ way the TLB can contain garbage, because the page directories will never have had garbage in it while any TLB lookup could be active. Which is the whole _point_ of the patches. > What I don't understand is how the BTB can invoke random userspace tlb > fills when we are running do_munmap, there's no point at all in doing > that. If the cpu see a read of an user address after invalidate_tlb, > the tlb must not be started because it's before an invalidate_tlb. Take a course in CPU design if you want to understand why a CPU front-end might speculatively start accessing something before the back-end has actually told it what the "something" actually is. But don't argue with the patch. > And if it's true not even irq are barriers for the tlb fills invoked by > this p4-BTB thing It has nothing to do with the BTB - the BTB is just a source of speculative addresses to start looking at. But yes, Intel tells me that the only thing that is guaranteed to serialize a TLB lookup is a TLB invalidate. NOTHING else. > so if leave_mm is really necessary, then 2.5 is as > well wrong in UP, because the pagetable can be scribbled by irqs in a UP > machine, and so the fastmode must go away even in 1 cpu systems. Yes. Except I will make the 2.5.x code use the pmd quicklists instead (both fast and slow mode), since that actually ends up being "nicer" from a cross-architecture standpoint (right now the i386 careful mode depends on the fact that page directories are regular pages - which is not true on other CPU's). Linus ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Q: backport of the free_pgtables tlb fixes to 2.4 2002-05-23 22:04 ` Linus Torvalds @ 2002-05-23 23:22 ` Andrea Arcangeli 2002-05-23 23:51 ` Linus Torvalds 0 siblings, 1 reply; 13+ messages in thread From: Andrea Arcangeli @ 2002-05-23 23:22 UTC (permalink / raw) To: Linus Torvalds; +Cc: linux-kernel On Thu, May 23, 2002 at 03:04:36PM -0700, Linus Torvalds wrote: > > > On Thu, 23 May 2002, Andrea Arcangeli wrote: > > > > If the userspace tlb lookup is started during munmap the tlb can contain > > garabge before invalidate_tlb. > > No. Above I just repeated what you said as confirmation of your: By the time the invalidate_tlb happens, the TLB fill has already finished, and has already picked up garbage. Not sure why you say "no" about it. > If we wait until after the TLB fill to actually free the page tables > pages, there is _no_ way the TLB can contain garbage, because the page > directories will never have had garbage in it while any TLB lookup could > be active. Agreed, doing the safe ordering always is clearly safe, just overkill for example while freeing the pages (not the pagtables), infact the 2.4 patch floating around allows the fastmode with mm_users == 1 while freeing the pages. Probably it's not a significant optimization, but maybe mostly for documentation reasons we could resurrect the mm_users == 1 fastmode while freeing pages (while freeing pagetables the fastmode must go away completly anyways, that was the bug). > > What I don't understand is how the BTB can invoke random userspace tlb > > fills when we are running do_munmap, there's no point at all in doing > > that. If the cpu see a read of an user address after invalidate_tlb, > > the tlb must not be started because it's before an invalidate_tlb. > > Take a course in CPU design if you want to understand why a CPU front-end > might speculatively start accessing something before the back-end has > actually told it what the "something" actually is. > > But don't argue with the patch. If I'm arguing is just because until now there was a lack of coherency in the explanations and in the code, so I was simply stuck and I couldn't extract something that made complete sense to me, there was still some contraddiciton. Now that you admitted 2.5 is buggy in one place I can finally drop a collision and things make complete sense again. Without your help I could never understand the changes (like leave_mm) that were going on in 2.5 and its 2.4 equivalent, thanks. > > And if it's true not even irq are barriers for the tlb fills invoked by > > this p4-BTB thing > > It has nothing to do with the BTB - the BTB is just a source of > speculative addresses to start looking at. > > But yes, Intel tells me that the only thing that is guaranteed to > serialize a TLB lookup is a TLB invalidate. NOTHING else. If the only thing that serialize the speculative tlb fill is the tlb invalidate, then the irq won't serialize it. Probably due the lack of any course in out of order CPU design, I'm not aware of any other cpu (not only x86) out there that exposes the internal out of order speculative actions of the CPU, to the local CPU strem. I recall you even said once that out of order cpus are hard to make exactly because it's hard to take care of the serialization across irq to make all out of order actions transparent for the local cpu stream. I'd be curious to know how many other cpus have this new "feature". BTW, about the quicklist I think they were nicer to have than just the per-cpu page allocator in front of the global page allocator (the allocator affinity is at a lower layer, much easier to be polluted with non-per-cpu stuff than the pte quicklists). So I'm not really against resurrecting them, however I suggest to build pte chain always with a list on the page struct and never overwriting the contents of the pte like it happens in most current 2.4 quicklists (I partly fixed that in pte-highmem). This way it is more general. then once it's documented the shrinkage of the quicklist must be done after the tlb flush we're fine, like in 2.4 as you said. Last but not the least, also the tlb shootdown for the page freeing in zap_page_range currently is very inefficient and it should be rewritten not to use such tiny and inefficient array with a few hundred pte pointers, but to just queue all the pages pending to be freed in a list_head allocated in the tlb_gather_t structure (so such per-cpu structure will also shrink a lot), so we don't need to trigger suprious tlb flushes every time such tiny array overflows (basically it will work like the pte_quicklist, but it will be shrunk completly after the tlb flush, it's not a cache, it will be simply a dispose_list and it should make the tlb shootdown code more readable too). Andrea ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Q: backport of the free_pgtables tlb fixes to 2.4 2002-05-23 23:22 ` Andrea Arcangeli @ 2002-05-23 23:51 ` Linus Torvalds 2002-05-24 0:27 ` Andrea Arcangeli 0 siblings, 1 reply; 13+ messages in thread From: Linus Torvalds @ 2002-05-23 23:51 UTC (permalink / raw) To: Andrea Arcangeli; +Cc: linux-kernel On Fri, 24 May 2002, Andrea Arcangeli wrote: > On Thu, May 23, 2002 at 03:04:36PM -0700, Linus Torvalds wrote: > > > > > > On Thu, 23 May 2002, Andrea Arcangeli wrote: > > > > > > If the userspace tlb lookup is started during munmap the tlb can contain > > > garabge before invalidate_tlb. > > > > No. > > Above I just repeated what you said as confirmation of your: > > By the time the invalidate_tlb happens, the TLB fill has already > finished, and has already picked up garbage. > > Not sure why you say "no" about it. With the intel patches for 2.4.x, or with the 2.5.x tree, the problem is fixed. I don't see why you argue against the patches - they've been tested, and they clearly fix a kernel bug that I've explained, and that Intel itself even has posted explanations about. > Agreed, doing the safe ordering always is clearly safe, just overkill > for example while freeing the pages (not the pagtables), infact the 2.4 > patch floating around allows the fastmode with mm_users == 1 while > freeing the pages. Yes, you can free the pages themselves if mm_users is 1 with the fastmode. That's one of the reasons I'm going to make the 2.5.x tree use the pmd_quicklist, so that the page table tree itself can more cleanly be handled differently than the pages. > Probably it's not a significant optimization, but > maybe mostly for documentation reasons we could resurrect the mm_users > == 1 fastmode while freeing pages (while freeing pagetables the fastmode > must go away completly anyways, that was the bug). The other thing I was considering was to actually make fastmode go away, it doesn't really seem to buy anything. If anything it might be possible that the "slow" case is faster than the fast case for some loads, simply because it has nicer icache "batching" behaviour. > > But yes, Intel tells me that the only thing that is guaranteed to > > serialize a TLB lookup is a TLB invalidate. NOTHING else. > > If the only thing that serialize the speculative tlb fill is the tlb > invalidate, then the irq won't serialize it. Right. HOWEVER, there's obviously another issue - the TLB lookup will complete at some point on its own, and the interrupt may well be slow enough that it effectively serializes the TLB lookup just by virtue of taking many cycles. In fact I don't think Intel was ever able to reproduce the lazy-TLB problem, their engineers only speculated that in theory it could be an issue. I personally think it's very unlikely that the "load_cr3()" makes any real difference, but we also have another issue: because we clear the "vm_cpu_bitmap" thing for that CPU, we'll only ever get _one_ cross-call for such a lazy TLB. It is quite possible to have one unmap() cause _multiple_ invalidates because it has a lot of page tables to go over, and so the second page table clearing might not trigger an interrupt at all. You also lose the TLB flushes for future unmaps in case the CPU stays in lazy-TLB mode. So that's another reason for loading %cr3 with a "known good" page table value: because we inhibit future TLB flushes, we really should also make sure that we're not continuing to use this "known-unstable" page table tree. > Probably due the lack of any course in out of order CPU design, I'm not > aware of any other cpu (not only x86) out there that exposes the > internal out of order speculative actions of the CPU, I agree. I'm actually surprised myself at just how agressive the P4 is, but on the other hand I think it's very interesting behaviour. > BTW, about the quicklist I think they were nicer to have than just the > per-cpu page allocator in front of the global page allocator (the > allocator affinity is at a lower layer, much easier to be polluted with > non-per-cpu stuff than the pte quicklists). So I'm not really against > resurrecting them, however I suggest to build pte chain always with a > list on the page struct and never overwriting the contents of the pte > like it happens in most current 2.4 quicklists (I partly fixed that in > pte-highmem). I'd suggest (ab-)using something like page->mapping for the quicklist following. You can use the page itself (the pointer will always be even, so using the page will not re-trigger the bug on x86 because the "garbage" written to the page table never has the Present bit set), but it's nicer from a cache standpoint to use the "struct page" area. Linus ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Q: backport of the free_pgtables tlb fixes to 2.4 2002-05-23 23:51 ` Linus Torvalds @ 2002-05-24 0:27 ` Andrea Arcangeli 0 siblings, 0 replies; 13+ messages in thread From: Andrea Arcangeli @ 2002-05-24 0:27 UTC (permalink / raw) To: Linus Torvalds; +Cc: linux-kernel On Thu, May 23, 2002 at 04:51:17PM -0700, Linus Torvalds wrote: > > > On Fri, 24 May 2002, Andrea Arcangeli wrote: > > On Thu, May 23, 2002 at 03:04:36PM -0700, Linus Torvalds wrote: > > > > > > > > > On Thu, 23 May 2002, Andrea Arcangeli wrote: > > > > > > > > If the userspace tlb lookup is started during munmap the tlb can contain > > > > garabge before invalidate_tlb. > > > > > > No. > > > > Above I just repeated what you said as confirmation of your: > > > > By the time the invalidate_tlb happens, the TLB fill has already > > finished, and has already picked up garbage. > > > > Not sure why you say "no" about it. > > With the intel patches for 2.4.x, or with the 2.5.x tree, the problem is > fixed. > > I don't see why you argue against the patches - they've been tested, and > they clearly fix a kernel bug that I've explained, and that Intel itself > even has posted explanations about. > > > Agreed, doing the safe ordering always is clearly safe, just overkill > > for example while freeing the pages (not the pagtables), infact the 2.4 > > patch floating around allows the fastmode with mm_users == 1 while > > freeing the pages. > > Yes, you can free the pages themselves if mm_users is 1 with the fastmode. > That's one of the reasons I'm going to make the 2.5.x tree use the > pmd_quicklist, so that the page table tree itself can more cleanly be > handled differently than the pages. agreed. > > > Probably it's not a significant optimization, but > > maybe mostly for documentation reasons we could resurrect the mm_users > > == 1 fastmode while freeing pages (while freeing pagetables the fastmode > > must go away completly anyways, that was the bug). > > The other thing I was considering was to actually make fastmode go away, > it doesn't really seem to buy anything. If anything it might be possible > that the "slow" case is faster than the fast case for some loads, simply > because it has nicer icache "batching" behaviour. most probably worthwhile to drop it. > > > > But yes, Intel tells me that the only thing that is guaranteed to > > > serialize a TLB lookup is a TLB invalidate. NOTHING else. > > > > If the only thing that serialize the speculative tlb fill is the tlb > > invalidate, then the irq won't serialize it. > > Right. > > HOWEVER, there's obviously another issue - the TLB lookup will complete at > some point on its own, and the interrupt may well be slow enough that it > effectively serializes the TLB lookup just by virtue of taking many > cycles. > > In fact I don't think Intel was ever able to reproduce the lazy-TLB > problem, their engineers only speculated that in theory it could be an > issue. > > I personally think it's very unlikely that the "load_cr3()" makes any real > difference, but we also have another issue: because we clear the > "vm_cpu_bitmap" thing for that CPU, we'll only ever get _one_ cross-call > for such a lazy TLB. It is quite possible to have one unmap() cause > _multiple_ invalidates because it has a lot of page tables to go over, and > so the second page table clearing might not trigger an interrupt at all. > You also lose the TLB flushes for future unmaps in case the CPU stays in > lazy-TLB mode. ok, I see that now that I know about that the p4 fills random user tlb during any kernel code completly unrelated to the virtual userspace addresses that is resolving speculatively. The reason it just didn't looked worthwhile to do that is that the cpu never knows if it will ever access those user addresses again, the higher the context switch rate, the lower the probability that any speculative tlb fill on user addresses will pay off. Of course assuming they're x86 cpus without ASN/tlb-tagging that isn't possible to implement in hardware without OS cooperation. With all other x86 cpus leave_mm is never been a problem, because a CPU could never start filling an user tlb while it was in lazy mode. before even trying to access userspace addresses the kernel is supposed to do a tlb flush during switch_mm and so a tlb fill in userspace could never be run before the next tlb flush. > > So that's another reason for loading %cr3 with a "known good" page table > value: because we inhibit future TLB flushes, we really should also make > sure that we're not continuing to use this "known-unstable" page table > tree. > > > Probably due the lack of any course in out of order CPU design, I'm not > > aware of any other cpu (not only x86) out there that exposes the > > internal out of order speculative actions of the CPU, > > I agree. I'm actually surprised myself at just how agressive the P4 is, > but on the other hand I think it's very interesting behaviour. > > > BTW, about the quicklist I think they were nicer to have than just the > > per-cpu page allocator in front of the global page allocator (the > > allocator affinity is at a lower layer, much easier to be polluted with > > non-per-cpu stuff than the pte quicklists). So I'm not really against > > resurrecting them, however I suggest to build pte chain always with a > > list on the page struct and never overwriting the contents of the pte > > like it happens in most current 2.4 quicklists (I partly fixed that in > > pte-highmem). > > I'd suggest (ab-)using something like page->mapping for the quicklist > following. You can use the page itself (the pointer will always be even, > so using the page will not re-trigger the bug on x86 because the "garbage" > written to the page table never has the Present bit set), but it's nicer > from a cache standpoint to use the "struct page" area. Agreed. In 2.4 pte-highmem (where I don't want to kmap_atomic inside every pte_free_fast but I want to just work with pages so I avoid kmapping overhead) I simply used page->list, that's the most appropriate list chain to use in a wrapper in front of __free_pages. static inline void pte_free_fast(struct page * page) { list_add(&page->list, &pte_quicklist); pgtable_cache_size++; } Andrea ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Q: backport of the free_pgtables tlb fixes to 2.4
@ 2002-05-31 13:19 Roland Fehrenbacher
0 siblings, 0 replies; 13+ messages in thread
From: Roland Fehrenbacher @ 2002-05-31 13:19 UTC (permalink / raw)
To: linux-kernel
Cc: Pallipadi, Venkatesh, Nakajima, Jun, Seth, Rohit, Luck, Tony,
Mallick, Asit K, Hugh Dickins, pk
Hi all,
we actually also ran into this bug on a dual xeon (2GHz
Prestonia) with Hyperthreading enabled. Without Hyperthreading, it was hard to
reproduce the problem. The following script (requires bash >= 2.0.5) provokes
the problem in a very short time. With the patch from Intel, the problem is
gone (patched into 2.4.18), no side effects discovered so far.
Thanks to Hugh Dickins <hugh@veritas.com> for pointing out the patch to us.
Cheers,
Roland
Here is the script:
-----------------------
#!/bin/sh
# Small script to provoke a SIGSEV on SMP machines with kernel problem
# requires bash > 2.0.5. The script fails e.g. on dual xeon systems without
# Intel patch (Message title: Illegal instruction failures fixes for 2.4.18 in # kernel mailing list, 22.5.2002).
# Script simply executes a couple of mktemp loops in the background, and tries
# to read the generated file back.
#
# Author: Roland Fehrenbacher, rf@q-leap.com
base_out=/tmp/test-sigsev
maxprocs=5
myname=`basename $0`
pids=""
files=""
trap 'killall $myname; rm -f $files ${base_out}??????;' 0 1 2 3 15
for (( num=1; num <= $maxprocs; num++ )); do
while true; do
file=`mktemp ${base_out}XXXXXX` || { echo mktemp failed; break; }
cat $file || { echo open $file failed; break; }
rm -f $file
done > ${base_out}-${num}.out 2>&1 &
pids="$pids $!"
files="$files ${base_out}-${num}.out"
done
printf "PIDS running = $pids\n--\n"
printf "No further output should appear if no bug is present. Run script for\n"
printf "a couple of hours to be sure everything is ok. Ctrl-C to stop.\n--\n"
i=1
while true; do
echo $i: ok >> ${base_out}-${num}.out
for pid in $pids; do
ps -p $pid > /dev/null 2>&1 || \
{ echo "count = $i: Pid $pid died" >> ${base_out}-${num}.out; \
pids=`echo $pids | sed -e s/$pid//g`; }
done
sleep 10
((i++))
done &
files="$files ${base_out}-${num}.out"
sleep 1
tail -f $files | grep "count ="
^ permalink raw reply [flat|nested] 13+ messages in threadend of thread, other threads:[~2002-05-31 13:20 UTC | newest] Thread overview: 13+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2002-05-23 5:14 Q: backport of the free_pgtables tlb fixes to 2.4 Andrea Arcangeli 2002-05-23 6:01 ` Linus Torvalds 2002-05-23 19:57 ` Andrea Arcangeli 2002-05-23 20:05 ` Linus Torvalds 2002-05-23 20:41 ` Andrea Arcangeli 2002-05-23 19:53 ` Martin Dalecki 2002-05-23 21:15 ` Andrea Arcangeli 2002-05-23 20:40 ` Martin Dalecki 2002-05-23 22:04 ` Linus Torvalds 2002-05-23 23:22 ` Andrea Arcangeli 2002-05-23 23:51 ` Linus Torvalds 2002-05-24 0:27 ` Andrea Arcangeli -- strict thread matches above, loose matches on Subject: below -- 2002-05-31 13:19 Roland Fehrenbacher
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox