All of lore.kernel.org
 help / color / mirror / Atom feed
* [patch, 2.6.10-rc2] fix __flush_tlb*() preemption bug on CONFIG_PREEMPT
@ 2004-11-18 12:46 Ingo Molnar
  2004-11-18 15:47 ` Linus Torvalds
  0 siblings, 1 reply; 5+ messages in thread
From: Ingo Molnar @ 2004-11-18 12:46 UTC (permalink / raw)
  To: Andrew Morton, Linus Torvalds; +Cc: linux-kernel


the PREEMPT_RT feature triggered another generic kernel PREEMPT bug on
x86 and x64: there exists a single-instruction race window in
__flush_tlb(), if the kernel preempts exactly there in a lazy-TLB thread
and certain other, rare scheduling and MM properties are true as well (a
certain constellation of threads and lazy-TLB kernel threads occured),
and the lazy-TLB task then gets another user TLB to inherit, and
switches to a task from which it inherited that new TLB, then the wrong
cr3 is loaded and inherited by this next, non-lazy-TLB next task; then
(and only then) this scenario would typically manifest itself in the
form of an infinite pagefault lockup occuring much after the fact, upon
the next userspace access.

i did a quick audit of how often __flush_tlb()s are done in preemptible
sections in the upstream kernel and the number is surprisingly high
(ioremap done from driver init, or copy_mm), so i think the attached
patch is the safest solution - disable preemption while flushing the
TLB. [We could perhaps add a non-preempt variant for e.g. use in the
scheduler, but that should be a separate patch.]

note that reproducing this bug was only possible under PREEMPT_RT (there
it can be triggered in 30 seconds, with the right reproducer) - it needs
a really unlikely scenario which PREEMPT_RT's high concurrency does
offer but which is apparently much harder to reproduce in the vanilla
kernel. The patch fixes x86 and x64. Other architectures are most likely
safe, but they need review as well.

	Ingo

Signed-off-by: Ingo Molnar <mingo@elte.hu>

--- linux/include/asm-i386/tlbflush.h.orig
+++ linux/include/asm-i386/tlbflush.h
@@ -5,15 +5,32 @@
 #include <linux/mm.h>
 #include <asm/processor.h>
 
+/*
+ * TLB-flush needs to be nonpreemptible on PREEMPT due to the
+ * following complex race scenario:
+ *
+ * if the current task is lazy-TLB and does a TLB flush and
+ * gets preempted after the movl %%r3, %0 but before the
+ * movl %0, %%cr3 then its ->active_mm might change and it will
+ * install the wrong cr3 when it switches back. This is not a
+ * problem for the lazy-TLB task itself, but if the next task it
+ * switches to has an ->mm that is also the lazy-TLB task's
+ * new ->active_mm, then the scheduler will assume that cr3 is
+ * the new one, while we overwrote it with the old one. The result
+ * is the wrong cr3 in the new (non-lazy-TLB) task, which typically
+ * causes an infinite pagefault upon the next userspace access.
+ */
 #define __flush_tlb()							\
 	do {								\
 		unsigned int tmpreg;					\
 									\
+		preempt_disable();					\
 		__asm__ __volatile__(					\
 			"movl %%cr3, %0;              \n"		\
 			"movl %0, %%cr3;  # flush TLB \n"		\
 			: "=r" (tmpreg)					\
 			:: "memory");					\
+		preempt_enable();					\
 	} while (0)
 
 /*
@@ -24,6 +41,7 @@
 	do {								\
 		unsigned int tmpreg;					\
 									\
+		preempt_disable();					\
 		__asm__ __volatile__(					\
 			"movl %1, %%cr4;  # turn off PGE     \n"	\
 			"movl %%cr3, %0;                     \n"	\
@@ -33,6 +51,7 @@
 			: "r" (mmu_cr4_features & ~X86_CR4_PGE),	\
 			  "r" (mmu_cr4_features)			\
 			: "memory");					\
+		preempt_enable();					\
 	} while (0)
 
 extern unsigned long pgkern_mask;
@@ -85,6 +104,13 @@ extern unsigned long pgkern_mask;
 
 static inline void flush_tlb_mm(struct mm_struct *mm)
 {
+	/*
+	 * This is safe on PREEMPT because if we preempt
+	 * right after the check but before the __flush_tlb(),
+	 * and if ->active_mm changes, then we might miss a
+	 * TLB flush, but that TLB flush happened already when
+	 * ->active_mm was changed:
+	 */
 	if (mm == current->active_mm)
 		__flush_tlb();
 }
--- linux/include/asm-x86_64/tlbflush.h.orig
+++ linux/include/asm-x86_64/tlbflush.h
@@ -9,11 +9,13 @@
 	do {								\
 		unsigned long tmpreg;					\
 									\
+		preempt_disable();					\
 		__asm__ __volatile__(					\
 			"movq %%cr3, %0;  # flush TLB \n"		\
 			"movq %0, %%cr3;              \n"		\
 			: "=r" (tmpreg)					\
 			:: "memory");					\
+		preempt_enable();					\
 	} while (0)
 
 /*
@@ -24,6 +26,7 @@
 	do {								\
 		unsigned long tmpreg;					\
 									\
+		preempt_disable();					\
 		__asm__ __volatile__(					\
 			"movq %1, %%cr4;  # turn off PGE     \n"	\
 			"movq %%cr3, %0;  # flush TLB        \n"	\
@@ -33,6 +36,7 @@
 			: "r" (mmu_cr4_features & ~X86_CR4_PGE),	\
 			  "r" (mmu_cr4_features)			\
 			: "memory");					\
+		preempt_enable();					\
 	} while (0)
 
 extern unsigned long pgkern_mask;

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2004-11-19  2:48 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2004-11-18 12:46 [patch, 2.6.10-rc2] fix __flush_tlb*() preemption bug on CONFIG_PREEMPT Ingo Molnar
2004-11-18 15:47 ` Linus Torvalds
2004-11-18 19:46   ` Ingo Molnar
2004-11-18 19:05     ` Linus Torvalds
2004-11-19  2:40       ` Nick Piggin

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.