linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [patch, 2.6.10-rc2] fix __flush_tlb*() preemption bug on CONFIG_PREEMPT
@ 2004-11-18 12:46 Ingo Molnar
  2004-11-18 15:47 ` Linus Torvalds
  0 siblings, 1 reply; 5+ messages in thread
From: Ingo Molnar @ 2004-11-18 12:46 UTC (permalink / raw)
  To: Andrew Morton, Linus Torvalds; +Cc: linux-kernel


the PREEMPT_RT feature triggered another generic kernel PREEMPT bug on
x86 and x64: there exists a single-instruction race window in
__flush_tlb(), if the kernel preempts exactly there in a lazy-TLB thread
and certain other, rare scheduling and MM properties are true as well (a
certain constellation of threads and lazy-TLB kernel threads occured),
and the lazy-TLB task then gets another user TLB to inherit, and
switches to a task from which it inherited that new TLB, then the wrong
cr3 is loaded and inherited by this next, non-lazy-TLB next task; then
(and only then) this scenario would typically manifest itself in the
form of an infinite pagefault lockup occuring much after the fact, upon
the next userspace access.

i did a quick audit of how often __flush_tlb()s are done in preemptible
sections in the upstream kernel and the number is surprisingly high
(ioremap done from driver init, or copy_mm), so i think the attached
patch is the safest solution - disable preemption while flushing the
TLB. [We could perhaps add a non-preempt variant for e.g. use in the
scheduler, but that should be a separate patch.]

note that reproducing this bug was only possible under PREEMPT_RT (there
it can be triggered in 30 seconds, with the right reproducer) - it needs
a really unlikely scenario which PREEMPT_RT's high concurrency does
offer but which is apparently much harder to reproduce in the vanilla
kernel. The patch fixes x86 and x64. Other architectures are most likely
safe, but they need review as well.

	Ingo

Signed-off-by: Ingo Molnar <mingo@elte.hu>

--- linux/include/asm-i386/tlbflush.h.orig
+++ linux/include/asm-i386/tlbflush.h
@@ -5,15 +5,32 @@
 #include <linux/mm.h>
 #include <asm/processor.h>
 
+/*
+ * TLB-flush needs to be nonpreemptible on PREEMPT due to the
+ * following complex race scenario:
+ *
+ * if the current task is lazy-TLB and does a TLB flush and
+ * gets preempted after the movl %%r3, %0 but before the
+ * movl %0, %%cr3 then its ->active_mm might change and it will
+ * install the wrong cr3 when it switches back. This is not a
+ * problem for the lazy-TLB task itself, but if the next task it
+ * switches to has an ->mm that is also the lazy-TLB task's
+ * new ->active_mm, then the scheduler will assume that cr3 is
+ * the new one, while we overwrote it with the old one. The result
+ * is the wrong cr3 in the new (non-lazy-TLB) task, which typically
+ * causes an infinite pagefault upon the next userspace access.
+ */
 #define __flush_tlb()							\
 	do {								\
 		unsigned int tmpreg;					\
 									\
+		preempt_disable();					\
 		__asm__ __volatile__(					\
 			"movl %%cr3, %0;              \n"		\
 			"movl %0, %%cr3;  # flush TLB \n"		\
 			: "=r" (tmpreg)					\
 			:: "memory");					\
+		preempt_enable();					\
 	} while (0)
 
 /*
@@ -24,6 +41,7 @@
 	do {								\
 		unsigned int tmpreg;					\
 									\
+		preempt_disable();					\
 		__asm__ __volatile__(					\
 			"movl %1, %%cr4;  # turn off PGE     \n"	\
 			"movl %%cr3, %0;                     \n"	\
@@ -33,6 +51,7 @@
 			: "r" (mmu_cr4_features & ~X86_CR4_PGE),	\
 			  "r" (mmu_cr4_features)			\
 			: "memory");					\
+		preempt_enable();					\
 	} while (0)
 
 extern unsigned long pgkern_mask;
@@ -85,6 +104,13 @@ extern unsigned long pgkern_mask;
 
 static inline void flush_tlb_mm(struct mm_struct *mm)
 {
+	/*
+	 * This is safe on PREEMPT because if we preempt
+	 * right after the check but before the __flush_tlb(),
+	 * and if ->active_mm changes, then we might miss a
+	 * TLB flush, but that TLB flush happened already when
+	 * ->active_mm was changed:
+	 */
 	if (mm == current->active_mm)
 		__flush_tlb();
 }
--- linux/include/asm-x86_64/tlbflush.h.orig
+++ linux/include/asm-x86_64/tlbflush.h
@@ -9,11 +9,13 @@
 	do {								\
 		unsigned long tmpreg;					\
 									\
+		preempt_disable();					\
 		__asm__ __volatile__(					\
 			"movq %%cr3, %0;  # flush TLB \n"		\
 			"movq %0, %%cr3;              \n"		\
 			: "=r" (tmpreg)					\
 			:: "memory");					\
+		preempt_enable();					\
 	} while (0)
 
 /*
@@ -24,6 +26,7 @@
 	do {								\
 		unsigned long tmpreg;					\
 									\
+		preempt_disable();					\
 		__asm__ __volatile__(					\
 			"movq %1, %%cr4;  # turn off PGE     \n"	\
 			"movq %%cr3, %0;  # flush TLB        \n"	\
@@ -33,6 +36,7 @@
 			: "r" (mmu_cr4_features & ~X86_CR4_PGE),	\
 			  "r" (mmu_cr4_features)			\
 			: "memory");					\
+		preempt_enable();					\
 	} while (0)
 
 extern unsigned long pgkern_mask;

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [patch, 2.6.10-rc2] fix __flush_tlb*() preemption bug on CONFIG_PREEMPT
  2004-11-18 12:46 [patch, 2.6.10-rc2] fix __flush_tlb*() preemption bug on CONFIG_PREEMPT Ingo Molnar
@ 2004-11-18 15:47 ` Linus Torvalds
  2004-11-18 19:46   ` Ingo Molnar
  0 siblings, 1 reply; 5+ messages in thread
From: Linus Torvalds @ 2004-11-18 15:47 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: Andrew Morton, linux-kernel



On Thu, 18 Nov 2004, Ingo Molnar wrote:
> 
> note that reproducing this bug was only possible under PREEMPT_RT (there
> it can be triggered in 30 seconds, with the right reproducer) - it needs
> a really unlikely scenario which PREEMPT_RT's high concurrency does
> offer but which is apparently much harder to reproduce in the vanilla
> kernel. The patch fixes x86 and x64. Other architectures are most likely
> safe, but they need review as well.

Ok, that's a pretty race.

However, I'm wondering whether this is the proper approach. After all, a
lazy-tlb process should never have any reason to flush its TLB, since "its
TLB" just aint there, and it ends up flushing somebody elses.

So I assume that this happens only with kswapd or similar? It really might 
be interesting to make the "we were a lazy tlb, and we're flushing 
somebody else" case do a stack dump, because I _suspect_ that this really 
is a special thing, and maybe the right thing to do is to make it special 
in _that_ path.

Very impressive debugging, btw. That must have been painful.

		Linus

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [patch, 2.6.10-rc2] fix __flush_tlb*() preemption bug on CONFIG_PREEMPT
  2004-11-18 19:46   ` Ingo Molnar
@ 2004-11-18 19:05     ` Linus Torvalds
  2004-11-19  2:40       ` Nick Piggin
  0 siblings, 1 reply; 5+ messages in thread
From: Linus Torvalds @ 2004-11-18 19:05 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: Andrew Morton, linux-kernel



On Thu, 18 Nov 2004, Ingo Molnar wrote:
> 
> yeah, if we have such a debugging check then it should be pretty safe to
> find the places one by one and fix them. The patch below (against -rc2)
> adds the debugging check and fixes ioremap, on x86. I've done a quick
> testboot and it doesnt trigger any other warnings.

What about moving the "preempt_disable/enable" into the special flush
cases (ie "global_flush_tlb()" and "flush_tlb_all()"). I don't think it 
makes sense in the code that uses them.

Then we'd just make it a rule that TLB flushing _has_ to be done either
from a valid VM context (that follows the flush) or non-preemptible.  
Otherwise it's kind of undefined what happens to the current context, not
just from an implementation angle, but from a _conceptual_ standpoint.

In contrast, the "global" flushes obviously make sense from any context
(there is no "conceptual" confusion there and the confusion is purely an
implementation bug), which is why I think they should do the preemption
disable themselves.

I really want core kernel code to make sense from a conceptual angle. Not 
just "it works", but "it works because it makes sense".

> (maybe we want to change that check to WARN_ON(!preempt_count()), while
> non-lazy-TLB tasks are fine right now, it's quite fragile i think to
> allow a TLB flush to preempt and it's a ticking timebomb i think.)

Hmm.. I don't see the problem. Maybe we should have other rules (like "you 
have to hold the page table lock") that would make that true, but I'd 
prefer the warning to have some "raison d'etre", if you see what I mean?

Another way of saying the same thing: if we cannot put a comment saying
"this is why we check _this_ condition", then we shouldn't check that
condition.

But yes, it may well make perfect sense to say "we have to hold the page 
table spinlock in order to flush the tlb". Is that actually true right 
now?

		Linus

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [patch, 2.6.10-rc2] fix __flush_tlb*() preemption bug on CONFIG_PREEMPT
  2004-11-18 15:47 ` Linus Torvalds
@ 2004-11-18 19:46   ` Ingo Molnar
  2004-11-18 19:05     ` Linus Torvalds
  0 siblings, 1 reply; 5+ messages in thread
From: Ingo Molnar @ 2004-11-18 19:46 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Andrew Morton, linux-kernel


* Linus Torvalds <torvalds@osdl.org> wrote:

> On Thu, 18 Nov 2004, Ingo Molnar wrote:
> > 
> > note that reproducing this bug was only possible under PREEMPT_RT (there
> > it can be triggered in 30 seconds, with the right reproducer) - it needs
> > a really unlikely scenario which PREEMPT_RT's high concurrency does
> > offer but which is apparently much harder to reproduce in the vanilla
> > kernel. The patch fixes x86 and x64. Other architectures are most likely
> > safe, but they need review as well.
> 
> Ok, that's a pretty race.
> 
> However, I'm wondering whether this is the proper approach. After all,
> a lazy-tlb process should never have any reason to flush its TLB,
> since "its TLB" just aint there, and it ends up flushing somebody
> elses.

e.g. it can happen with the init thread doing ioremap()s during driver
setup - there it's the kernel VM that gets modified and flushed. I
suspect driver-related ioremap() could in theory happen in just about
any thread context, and in fact during modprobe it will definitely
execute in a lazy-TLB context.

Another place where i've seen TLB flushes done with preemption enabled
is the flush_tlb_mm() in dup_mmap() - but here we should never be
lazy-TLB. But this case should be kept in mind nevertheless, if we ever
allow tasks to 'switch' the MM of another task asynchronously (e.g.
there are UML patches that do that), then this would be unsafe code too.

the third category is activate_mm() users (exec and aio), but they are
in a critical section due to lock_task(). (that is not a nonpreemptible
critical section on PREEMPT_RT though - probably this contributed to
this bug triggering more likely there.)

> So I assume that this happens only with kswapd or similar? It really
> might be interesting to make the "we were a lazy tlb, and we're
> flushing somebody else" case do a stack dump, because I _suspect_ that
> this really is a special thing, and maybe the right thing to do is to
> make it special in _that_ path.

yeah, if we have such a debugging check then it should be pretty safe to
find the places one by one and fix them. The patch below (against -rc2)
adds the debugging check and fixes ioremap, on x86. I've done a quick
testboot and it doesnt trigger any other warnings.

(maybe we want to change that check to WARN_ON(!preempt_count()), while
non-lazy-TLB tasks are fine right now, it's quite fragile i think to
allow a TLB flush to preempt and it's a ticking timebomb i think.)

	Ingo

Signed-off-by: Ingo Molnar <mingo@elte.hu>

--- linux/arch/i386/mm/ioremap.c.orig
+++ linux/arch/i386/mm/ioremap.c
@@ -93,7 +93,11 @@ static int remap_area_pages(unsigned lon
 		dir++;
 	} while (address && (address < end));
 	spin_unlock(&init_mm.page_table_lock);
+
+	preempt_disable();
 	flush_tlb_all();
+	preempt_enable();
+
 	return error;
 }
 
@@ -215,7 +219,9 @@ void __iomem *ioremap_nocache (unsigned 
 			iounmap(p); 
 			p = NULL;
 		}
+		preempt_disable();
 		global_flush_tlb();
+		preempt_enable();
 	}
 
 	return p;					
@@ -236,7 +242,9 @@ void iounmap(volatile void __iomem *addr
 		change_page_attr(virt_to_page(__va(p->phys_addr)),
 				 p->size >> PAGE_SHIFT,
 				 PAGE_KERNEL); 				 
+		preempt_disable();
 		global_flush_tlb();
+		preempt_enable();
 	} 
 	kfree(p); 
 }
--- linux/include/asm-i386/tlbflush.h.orig
+++ linux/include/asm-i386/tlbflush.h
@@ -5,10 +5,18 @@
 #include <linux/mm.h>
 #include <asm/processor.h>
 
+/*
+ * flush the TLB.
+ *
+ * NOTE: must not do this from a lazy-TLB task (kernel-thread, exit path),
+ * if in a preemptible section.
+ */
 #define __flush_tlb()							\
 	do {								\
 		unsigned int tmpreg;					\
 									\
+		WARN_ON(!preempt_count() && !current->mm);		\
+									\
 		__asm__ __volatile__(					\
 			"movl %%cr3, %0;              \n"		\
 			"movl %0, %%cr3;  # flush TLB \n"		\
@@ -24,6 +32,8 @@
 	do {								\
 		unsigned int tmpreg;					\
 									\
+		WARN_ON(!preempt_count() && !current->mm);		\
+									\
 		__asm__ __volatile__(					\
 			"movl %1, %%cr4;  # turn off PGE     \n"	\
 			"movl %%cr3, %0;                     \n"	\

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [patch, 2.6.10-rc2] fix __flush_tlb*() preemption bug on CONFIG_PREEMPT
  2004-11-18 19:05     ` Linus Torvalds
@ 2004-11-19  2:40       ` Nick Piggin
  0 siblings, 0 replies; 5+ messages in thread
From: Nick Piggin @ 2004-11-19  2:40 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Ingo Molnar, Andrew Morton, linux-kernel

Linus Torvalds wrote:

> But yes, it may well make perfect sense to say "we have to hold the page 
> table spinlock in order to flush the tlb". Is that actually true right 
> now?
> 

Can we try not to ratify a rule like that? :)

We're somewhat closeish to being able to entirely remove the ptl, so
it might just get awkward if people think they can rely on that rule.

Of course, _if_ holding the ptl is the nicest way to do things in the
mainline kernel then yeah OK I can't argue with that... but if at all
possible... pretty please?

Thanks,
Nick

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2004-11-19  2:48 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2004-11-18 12:46 [patch, 2.6.10-rc2] fix __flush_tlb*() preemption bug on CONFIG_PREEMPT Ingo Molnar
2004-11-18 15:47 ` Linus Torvalds
2004-11-18 19:46   ` Ingo Molnar
2004-11-18 19:05     ` Linus Torvalds
2004-11-19  2:40       ` Nick Piggin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).