public inbox for stable@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH] x86/traps: Don't for in_interrupt() to return true in IST handlers
@ 2016-05-24  3:57 Andy Lutomirski
  2016-05-24  8:59 ` Peter Zijlstra
  0 siblings, 1 reply; 7+ messages in thread
From: Andy Lutomirski @ 2016-05-24  3:57 UTC (permalink / raw)
  To: x86
  Cc: Borislav Petkov, Andi Kleen, lkml, Peter Zijlstra, Oleg Nesterov,
	Andy Lutomirski, stable

Forcing in_interrupt() to return true if we're not in a bona fide
interrupt confuses the softirq code.  This fixes warnings like:

NOHZ: local_softirq_pending 282

that can happen when running things like selftests/x86.

Cc: stable@vger.kernel.org
Fixes: 959274753857 ("x86, traps: Track entry into and exit from IST context")
Signed-off-by: Andy Lutomirski <luto@kernel.org>
---

This is intended for x86/urgent.

 arch/x86/kernel/traps.c | 22 ++++++++++++++++++----
 1 file changed, 18 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index d1590486204a..25a905674acd 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -96,6 +96,20 @@ static inline void cond_local_irq_disable(struct pt_regs *regs)
 		local_irq_disable();
 }
 
+/*
+ * We want to cause in_atomic() to return true while in an IST handler
+ * so that attempts to schedule will warn.
+ *
+ * We cannot add use HARDIRQ_OFFSET or otherwise cause in_interrupt() to
+ * return true: the softirq code assumes that in_interrupt() only
+ * returns true if we will soon execute softirqs, and we can't do that
+ * if an IST entry interrupts kernel code with interrupts disabled.
+ *
+ * Using 3 * PREEMPT_OFFSET instead of just PREEMPT_OFFSET is pure
+ * paranoia.
+ */
+#define IST_OFFSET (3 * PREEMPT_OFFSET)
+
 void ist_enter(struct pt_regs *regs)
 {
 	if (user_mode(regs)) {
@@ -116,7 +130,7 @@ void ist_enter(struct pt_regs *regs)
 	 * on x86_64 and entered from user mode, in which case we're
 	 * still atomic unless ist_begin_non_atomic is called.
 	 */
-	preempt_count_add(HARDIRQ_OFFSET);
+	preempt_count_add(IST_OFFSET);
 
 	/* This code is a bit fragile.  Test it. */
 	RCU_LOCKDEP_WARN(!rcu_is_watching(), "ist_enter didn't work");
@@ -124,7 +138,7 @@ void ist_enter(struct pt_regs *regs)
 
 void ist_exit(struct pt_regs *regs)
 {
-	preempt_count_sub(HARDIRQ_OFFSET);
+	preempt_count_sub(IST_OFFSET);
 
 	if (!user_mode(regs))
 		rcu_nmi_exit();
@@ -155,7 +169,7 @@ void ist_begin_non_atomic(struct pt_regs *regs)
 	BUG_ON((unsigned long)(current_top_of_stack() -
 			       current_stack_pointer()) >= THREAD_SIZE);
 
-	preempt_count_sub(HARDIRQ_OFFSET);
+	preempt_count_sub(IST_OFFSET);
 }
 
 /**
@@ -165,7 +179,7 @@ void ist_begin_non_atomic(struct pt_regs *regs)
  */
 void ist_end_non_atomic(void)
 {
-	preempt_count_add(HARDIRQ_OFFSET);
+	preempt_count_add(IST_OFFSET);
 }
 
 static nokprobe_inline int
-- 
2.5.5


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH] x86/traps: Don't for in_interrupt() to return true in IST handlers
  2016-05-24  3:57 [PATCH] x86/traps: Don't for in_interrupt() to return true in IST handlers Andy Lutomirski
@ 2016-05-24  8:59 ` Peter Zijlstra
  2016-05-24  9:36   ` Borislav Petkov
  2016-05-24 15:43   ` Andy Lutomirski
  0 siblings, 2 replies; 7+ messages in thread
From: Peter Zijlstra @ 2016-05-24  8:59 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: x86, Borislav Petkov, Andi Kleen, lkml, Oleg Nesterov, stable

On Mon, May 23, 2016 at 08:57:05PM -0700, Andy Lutomirski wrote:
> Forcing in_interrupt() to return true if we're not in a bona fide
> interrupt confuses the softirq code.  This fixes warnings like:
> 
> NOHZ: local_softirq_pending 282
> 
> that can happen when running things like selftests/x86.
> 
> Cc: stable@vger.kernel.org
> Fixes: 959274753857 ("x86, traps: Track entry into and exit from IST context")
> Signed-off-by: Andy Lutomirski <luto@kernel.org>

> +/*
> + * We want to cause in_atomic() to return true while in an IST handler
> + * so that attempts to schedule will warn.
> + *
> + * We cannot add use HARDIRQ_OFFSET or otherwise cause in_interrupt() to
> + * return true: the softirq code assumes that in_interrupt() only
> + * returns true if we will soon execute softirqs, and we can't do that
> + * if an IST entry interrupts kernel code with interrupts disabled.
> + *
> + * Using 3 * PREEMPT_OFFSET instead of just PREEMPT_OFFSET is pure
> + * paranoia.
> + */
> +#define IST_OFFSET (3 * PREEMPT_OFFSET)

So this has implications for code like
kernel/events/internal.h:get_recursion_context() and
kernel/trace/trace.c:get_trace_buf().

Which use a sequence of: in_nmi(), in_irq(), in_softirq() to pick 1 out
of 4 possible contexts.

I would really like the Changelog to reflect on this. The current code
will have ISTs land in in_irq(), with this chance, not so much.

Now ISTs as a whole invalidate the whole 'we have only 4 contexts' and
the mapping back to those 4 is going to be somewhat arbitrary I suspect,
but changes like this should be very much aware of these things. And
make an explicit choice.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] x86/traps: Don't for in_interrupt() to return true in IST handlers
  2016-05-24  8:59 ` Peter Zijlstra
@ 2016-05-24  9:36   ` Borislav Petkov
  2016-05-24  9:51     ` Peter Zijlstra
  2016-05-24 15:43   ` Andy Lutomirski
  1 sibling, 1 reply; 7+ messages in thread
From: Borislav Petkov @ 2016-05-24  9:36 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andy Lutomirski, x86, Andi Kleen, lkml, Oleg Nesterov, stable

On Tue, May 24, 2016 at 10:59:45AM +0200, Peter Zijlstra wrote:
> Which use a sequence of: in_nmi(), in_irq(), in_softirq() to pick 1 out
> of 4 possible contexts.

So should we make it cleaner and explicit and define a 5th context of
priorities higher than NMI?

There's some room between those two:

 *             NMI_MASK:        0x00100000
 * PREEMPT_NEED_RESCHED:        0x80000000

-- 
Regards/Gruss,
    Boris.

ECO tip #101: Trim your mails when you reply.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] x86/traps: Don't for in_interrupt() to return true in IST handlers
  2016-05-24  9:36   ` Borislav Petkov
@ 2016-05-24  9:51     ` Peter Zijlstra
  0 siblings, 0 replies; 7+ messages in thread
From: Peter Zijlstra @ 2016-05-24  9:51 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Andy Lutomirski, x86, Andi Kleen, linux-kernel, Oleg Nesterov,
	stable, Steven Rostedt


@Andy, its linux-kernel@vger, not lkml@vger :-)

On Tue, May 24, 2016 at 11:36:29AM +0200, Borislav Petkov wrote:
> On Tue, May 24, 2016 at 10:59:45AM +0200, Peter Zijlstra wrote:
> > Which use a sequence of: in_nmi(), in_irq(), in_softirq() to pick 1 out
> > of 4 possible contexts.
> 
> So should we make it cleaner and explicit and define a 5th context of
> priorities higher than NMI?
> 
> There's some room between those two:
> 
>  *             NMI_MASK:        0x00100000
>  * PREEMPT_NEED_RESCHED:        0x80000000
> 

A lot of pain; we'd have to go grow a whole bunch of things to 5.

Also, I don't think 5 is enough to model all the IST nesting. I'm also
not sure we really care too much; IST stuff is relatively rare. It just
means we can loose IST based trace events and the like, because its
treated as recursion.

So I think keeping it at 4 is fine, but we do want to make a semi
concious choice on how we map back to those 4.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] x86/traps: Don't for in_interrupt() to return true in IST handlers
  2016-05-24  8:59 ` Peter Zijlstra
  2016-05-24  9:36   ` Borislav Petkov
@ 2016-05-24 15:43   ` Andy Lutomirski
  2016-05-24 17:09     ` Peter Zijlstra
  1 sibling, 1 reply; 7+ messages in thread
From: Andy Lutomirski @ 2016-05-24 15:43 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andy Lutomirski, X86 ML, Borislav Petkov, Andi Kleen, lkml,
	Oleg Nesterov, stable

On Tue, May 24, 2016 at 1:59 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> On Mon, May 23, 2016 at 08:57:05PM -0700, Andy Lutomirski wrote:
>> Forcing in_interrupt() to return true if we're not in a bona fide
>> interrupt confuses the softirq code.  This fixes warnings like:
>>
>> NOHZ: local_softirq_pending 282
>>
>> that can happen when running things like selftests/x86.
>>
>> Cc: stable@vger.kernel.org
>> Fixes: 959274753857 ("x86, traps: Track entry into and exit from IST context")
>> Signed-off-by: Andy Lutomirski <luto@kernel.org>
>
>> +/*
>> + * We want to cause in_atomic() to return true while in an IST handler
>> + * so that attempts to schedule will warn.
>> + *
>> + * We cannot add use HARDIRQ_OFFSET or otherwise cause in_interrupt() to
>> + * return true: the softirq code assumes that in_interrupt() only
>> + * returns true if we will soon execute softirqs, and we can't do that
>> + * if an IST entry interrupts kernel code with interrupts disabled.
>> + *
>> + * Using 3 * PREEMPT_OFFSET instead of just PREEMPT_OFFSET is pure
>> + * paranoia.
>> + */
>> +#define IST_OFFSET (3 * PREEMPT_OFFSET)
>
> So this has implications for code like
> kernel/events/internal.h:get_recursion_context() and
> kernel/trace/trace.c:get_trace_buf().
>
> Which use a sequence of: in_nmi(), in_irq(), in_softirq() to pick 1 out
> of 4 possible contexts.
>
> I would really like the Changelog to reflect on this. The current code
> will have ISTs land in in_irq(), with this chance, not so much.

I can change the changelog.

>
> Now ISTs as a whole invalidate the whole 'we have only 4 contexts' and
> the mapping back to those 4 is going to be somewhat arbitrary I suspect,
> but changes like this should be very much aware of these things. And
> make an explicit choice.

I'm not so comfortable with trying to make any particular guarantees
about what all the in_xyz() things will return for different entry
types and how they nest.  For example, debug can nest inside itself
quite easily (at one point I even had a user program to force it to
happen) -- this can trigger when a watchpoint nests inside a
single-step trap, and it can also happen when a watchpoint handler is
interrupted by an NMI than then recursively triggers the watchpoint.
The latter could easily result in nested NMIs that are directly
visible to the trace or event code.

On x86, there's also MCE, which is NMI-ish, and NMI and MCE can freely
nest inside each other.  (Blech.)

Would it make more sense to adjust the trace code to have a percpu
nesting count and to match up get_trace_buf with put_trace_buf to
decrement the count?  The event code looks like the same thing could
happen.

Also, on further reflection, I'm going to get rid of the 3* hack.

--Andy

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] x86/traps: Don't for in_interrupt() to return true in IST handlers
  2016-05-24 15:43   ` Andy Lutomirski
@ 2016-05-24 17:09     ` Peter Zijlstra
  2016-05-24 18:54       ` Andy Lutomirski
  0 siblings, 1 reply; 7+ messages in thread
From: Peter Zijlstra @ 2016-05-24 17:09 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Andy Lutomirski, X86 ML, Borislav Petkov, Andi Kleen, lkml,
	Oleg Nesterov, stable

On Tue, May 24, 2016 at 08:43:57AM -0700, Andy Lutomirski wrote:
> > So this has implications for code like
> > kernel/events/internal.h:get_recursion_context() and
> > kernel/trace/trace.c:get_trace_buf().
> >
> > Which use a sequence of: in_nmi(), in_irq(), in_softirq() to pick 1 out
> > of 4 possible contexts.
> >
> > I would really like the Changelog to reflect on this. The current code
> > will have ISTs land in in_irq(), with this chance, not so much.
> 
> I can change the changelog.

This is basically all I'm asking.

> > Now ISTs as a whole invalidate the whole 'we have only 4 contexts' and
> > the mapping back to those 4 is going to be somewhat arbitrary I suspect,
> > but changes like this should be very much aware of these things. And
> > make an explicit choice.
> 
> I'm not so comfortable with trying to make any particular guarantees
> about what all the in_xyz() things will return for different entry
> types and how they nest.  For example, debug can nest inside itself
> quite easily (at one point I even had a user program to force it to
> happen) -- this can trigger when a watchpoint nests inside a
> single-step trap, and it can also happen when a watchpoint handler is
> interrupted by an NMI than then recursively triggers the watchpoint.
> The latter could easily result in nested NMIs that are directly
> visible to the trace or event code.
> 
> On x86, there's also MCE, which is NMI-ish, and NMI and MCE can freely
> nest inside each other.  (Blech.)

Right; all the nesting possibilities are endless and insane. And I don't
think it makes sense to try and enumerate them all and worse; expose
this to generic code.

I think having the 4: task, softirq, hardirq, nmi is fine. We just need
to be a little careful with how we map to them, that we don't wreck some
common situation and suddenly start loosing trace data.

> Would it make more sense to adjust the trace code to have a percpu
> nesting count and to match up get_trace_buf with put_trace_buf to
> decrement the count?  The event code looks like the same thing could
> happen.

We have, but its coupled with static storage, such as to avoid having to
do it on stack or *horror* dynamically allocate.

So in order to protect these resource we have the per context nesting
count.

> Also, on further reflection, I'm going to get rid of the 3* hack.

Makes sense.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] x86/traps: Don't for in_interrupt() to return true in IST handlers
  2016-05-24 17:09     ` Peter Zijlstra
@ 2016-05-24 18:54       ` Andy Lutomirski
  0 siblings, 0 replies; 7+ messages in thread
From: Andy Lutomirski @ 2016-05-24 18:54 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andy Lutomirski, X86 ML, Borislav Petkov, Andi Kleen,
	Oleg Nesterov, stable, linux-kernel@vger.kernel.org

[-- Attachment #1: Type: text/plain, Size: 2592 bytes --]

On Tue, May 24, 2016 at 10:09 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> On Tue, May 24, 2016 at 08:43:57AM -0700, Andy Lutomirski wrote:
>> > So this has implications for code like
>> > kernel/events/internal.h:get_recursion_context() and
>> > kernel/trace/trace.c:get_trace_buf().
>> >
>> > Which use a sequence of: in_nmi(), in_irq(), in_softirq() to pick 1 out
>> > of 4 possible contexts.
>> >
>> > I would really like the Changelog to reflect on this. The current code
>> > will have ISTs land in in_irq(), with this chance, not so much.
>>
>> I can change the changelog.
>
> This is basically all I'm asking.
>
>> > Now ISTs as a whole invalidate the whole 'we have only 4 contexts' and
>> > the mapping back to those 4 is going to be somewhat arbitrary I suspect,
>> > but changes like this should be very much aware of these things. And
>> > make an explicit choice.
>>
>> I'm not so comfortable with trying to make any particular guarantees
>> about what all the in_xyz() things will return for different entry
>> types and how they nest.  For example, debug can nest inside itself
>> quite easily (at one point I even had a user program to force it to
>> happen) -- this can trigger when a watchpoint nests inside a
>> single-step trap, and it can also happen when a watchpoint handler is
>> interrupted by an NMI than then recursively triggers the watchpoint.
>> The latter could easily result in nested NMIs that are directly
>> visible to the trace or event code.
>>
>> On x86, there's also MCE, which is NMI-ish, and NMI and MCE can freely
>> nest inside each other.  (Blech.)
>
> Right; all the nesting possibilities are endless and insane. And I don't
> think it makes sense to try and enumerate them all and worse; expose
> this to generic code.
>
> I think having the 4: task, softirq, hardirq, nmi is fine. We just need
> to be a little careful with how we map to them, that we don't wreck some
> common situation and suddenly start loosing trace data.
>
>> Would it make more sense to adjust the trace code to have a percpu
>> nesting count and to match up get_trace_buf with put_trace_buf to
>> decrement the count?  The event code looks like the same thing could
>> happen.
>
> We have, but its coupled with static storage, such as to avoid having to
> do it on stack or *horror* dynamically allocate.
>
> So in order to protect these resource we have the per context nesting
> count.


I gave it a shot and it looks straightforward and is a net deletion of
code.  It's attached and compile-tested only.  Is there a reason you
don't like this approach?

--Andy

[-- Attachment #2: trace.diff --]
[-- Type: text/plain, Size: 3887 bytes --]

diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index 8a4b49c99f22..acd366e4ab1a 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -1986,83 +1986,41 @@ static void __trace_userstack(struct trace_array *tr, unsigned long flags)
 
 /* created for use with alloc_percpu */
 struct trace_buffer_struct {
-	char buffer[TRACE_BUF_SIZE];
+	int nesting;
+	char buffer[4][TRACE_BUF_SIZE];
 };
 
 static struct trace_buffer_struct *trace_percpu_buffer;
-static struct trace_buffer_struct *trace_percpu_sirq_buffer;
-static struct trace_buffer_struct *trace_percpu_irq_buffer;
-static struct trace_buffer_struct *trace_percpu_nmi_buffer;
 
 /*
- * The buffer used is dependent on the context. There is a per cpu
- * buffer for normal context, softirq contex, hard irq context and
- * for NMI context. Thise allows for lockless recording.
- *
- * Note, if the buffers failed to be allocated, then this returns NULL
+ * Thise allows for lockless recording.  If we're nested too deeply, then
+ * this returns NULL.
  */
 static char *get_trace_buf(void)
 {
-	struct trace_buffer_struct *percpu_buffer;
-
-	/*
-	 * If we have allocated per cpu buffers, then we do not
-	 * need to do any locking.
-	 */
-	if (in_nmi())
-		percpu_buffer = trace_percpu_nmi_buffer;
-	else if (in_irq())
-		percpu_buffer = trace_percpu_irq_buffer;
-	else if (in_softirq())
-		percpu_buffer = trace_percpu_sirq_buffer;
-	else
-		percpu_buffer = trace_percpu_buffer;
+	struct trace_buffer_struct *buffer = this_cpu_ptr(trace_percpu_buffer);
 
-	if (!percpu_buffer)
+	if (!buffer || buffer->nesting >= 4)
 		return NULL;
 
-	return this_cpu_ptr(&percpu_buffer->buffer[0]);
+	return &buffer->buffer[buffer->nesting++][0];
+}
+
+static void put_trace_buf(void)
+{
+	this_cpu_dec(trace_percpu_buffer->nesting);
 }
 
 static int alloc_percpu_trace_buffer(void)
 {
 	struct trace_buffer_struct *buffers;
-	struct trace_buffer_struct *sirq_buffers;
-	struct trace_buffer_struct *irq_buffers;
-	struct trace_buffer_struct *nmi_buffers;
 
 	buffers = alloc_percpu(struct trace_buffer_struct);
-	if (!buffers)
-		goto err_warn;
-
-	sirq_buffers = alloc_percpu(struct trace_buffer_struct);
-	if (!sirq_buffers)
-		goto err_sirq;
-
-	irq_buffers = alloc_percpu(struct trace_buffer_struct);
-	if (!irq_buffers)
-		goto err_irq;
-
-	nmi_buffers = alloc_percpu(struct trace_buffer_struct);
-	if (!nmi_buffers)
-		goto err_nmi;
+	if (WARN(!buffers, "Could not allocate percpu trace_printk buffer"))
+		return -ENOMEM;
 
 	trace_percpu_buffer = buffers;
-	trace_percpu_sirq_buffer = sirq_buffers;
-	trace_percpu_irq_buffer = irq_buffers;
-	trace_percpu_nmi_buffer = nmi_buffers;
-
 	return 0;
-
- err_nmi:
-	free_percpu(irq_buffers);
- err_irq:
-	free_percpu(sirq_buffers);
- err_sirq:
-	free_percpu(buffers);
- err_warn:
-	WARN(1, "Could not allocate percpu trace_printk buffer");
-	return -ENOMEM;
 }
 
 static int buffers_allocated;
@@ -2153,7 +2111,7 @@ int trace_vbprintk(unsigned long ip, const char *fmt, va_list args)
 	tbuffer = get_trace_buf();
 	if (!tbuffer) {
 		len = 0;
-		goto out;
+		goto out_nobuffer;
 	}
 
 	len = vbin_printf((u32 *)tbuffer, TRACE_BUF_SIZE/sizeof(int), fmt, args);
@@ -2179,6 +2137,9 @@ int trace_vbprintk(unsigned long ip, const char *fmt, va_list args)
 	}
 
 out:
+	put_trace_buf();
+
+out_nobuffer:
 	preempt_enable_notrace();
 	unpause_graph_tracing();
 
@@ -2210,7 +2171,7 @@ __trace_array_vprintk(struct ring_buffer *buffer,
 	tbuffer = get_trace_buf();
 	if (!tbuffer) {
 		len = 0;
-		goto out;
+		goto out_nobuffer;
 	}
 
 	len = vscnprintf(tbuffer, TRACE_BUF_SIZE, fmt, args);
@@ -2229,7 +2190,11 @@ __trace_array_vprintk(struct ring_buffer *buffer,
 		__buffer_unlock_commit(buffer, event);
 		ftrace_trace_stack(&global_trace, buffer, flags, 6, pc, NULL);
 	}
- out:
+
+out_nobuffer:
+	put_trace_buf();
+
+out:
 	preempt_enable_notrace();
 	unpause_graph_tracing();
 

^ permalink raw reply related	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2016-05-24 18:55 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2016-05-24  3:57 [PATCH] x86/traps: Don't for in_interrupt() to return true in IST handlers Andy Lutomirski
2016-05-24  8:59 ` Peter Zijlstra
2016-05-24  9:36   ` Borislav Petkov
2016-05-24  9:51     ` Peter Zijlstra
2016-05-24 15:43   ` Andy Lutomirski
2016-05-24 17:09     ` Peter Zijlstra
2016-05-24 18:54       ` Andy Lutomirski

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox