* Re: Q: smp.c && barriers (Was: [PATCH 1/4] generic-smp: remove single ipi fallback for smp_call_function_many())
[not found] ` <1234866453.4744.58.camel@laptop>
@ 2009-02-17 11:26 ` Nick Piggin
2009-02-17 11:48 ` Peter Zijlstra
` (3 more replies)
0 siblings, 4 replies; 49+ messages in thread
From: Nick Piggin @ 2009-02-17 11:26 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Oleg Nesterov, Jens Axboe, Suresh Siddha, Linus Torvalds,
Paul E. McKenney, Ingo Molnar, Rusty Russell, Steven Rostedt,
linux-kernel, linux-arch
cc linux-arch
On Tue, Feb 17, 2009 at 11:27:33AM +0100, Peter Zijlstra wrote:
> > But hmm, why even bother with all this complexity? Why not just
> > remove the outer loop completely? Do the lock and the list_replace_init
> > unconditionally. It would turn tricky lockless code into simple locked
> > code... we've already taken an interrupt anyway, so chances are pretty
> > high that we have work here to do, right?
>
> Well, that's a practical suggestion, and I agree.
How's this?
--
Simplify the barriers in generic remote function call interrupt code.
Firstly, just unconditionally take the lock and check the list in the
generic_call_function_single_interrupt IPI handler. As we've just taken
an IPI here, the chances are fairly high that there will be work on the
list for us, so do the locking unconditionally. This removes the tricky
lockless list_empty check and dubious barriers. The change looks bigger
than it is because it is just removing an outer loop.
Secondly, clarify architecture specific IPI locking rules. Generic code
has no tools to impose any sane ordering on IPIs if they go outside
normal cache coherency, ergo the arch code must make them appear to
obey cache coherency as a "memory operation" to initiate an IPI, and
a "memory operation" to receive one. This way at least they can be
reasoned about in generic code, and smp_mb used to provide ordering.
The combination of these two changes means that explict barriers can
be taken out of queue handling for the single case -- shared data is
explicitly locked, and ipi ordering must conform to that, so no
barriers needed. An extra barrier is needed in the many handler, so
as to ensure we load the list element after the IPI is received.
Does any architecture actually needs barriers? For the initiator I
could see it, but for the handler I would be surprised. The other
thing we could do for simplicity is just to require that a full
barrier is required before generating an IPI, and after receiving an
IPI. We can't just do that in generic code without auditing
architectures. There have been subtle hangs here on some archs in
the past.
Signed-off-by: Nick Piggin <npiggin@suse.de>
---
kernel/smp.c | 83 +++++++++++++++++++++++++++++++----------------------------
1 file changed, 44 insertions(+), 39 deletions(-)
Index: linux-2.6/kernel/smp.c
===================================================================
--- linux-2.6.orig/kernel/smp.c
+++ linux-2.6/kernel/smp.c
@@ -74,9 +74,16 @@ static void generic_exec_single(int cpu,
spin_unlock_irqrestore(&dst->lock, flags);
/*
- * Make the list addition visible before sending the ipi.
+ * The list addition should be visible before sending the IPI
+ * handler locks the list to pull the entry off it because of
+ * normal cache coherency rules implied by spinlocks.
+ *
+ * If IPIs can go out of order to the cache coherency protocol
+ * in an architecture, sufficient synchronisation should be added
+ * to arch code to make it appear to obey cache coherency WRT
+ * locking and barrier primitives. Generic code isn't really equipped
+ * to do the right thing...
*/
- smp_mb();
if (ipi)
arch_send_call_function_single_ipi(cpu);
@@ -104,6 +111,14 @@ void generic_smp_call_function_interrupt
int cpu = get_cpu();
/*
+ * Ensure entry is visible on call_function_queue after we have
+ * entered the IPI. See comment in smp_call_function_many.
+ * If we don't have this, then we may miss an entry on the list
+ * and never get another IPI to process it.
+ */
+ smp_mb();
+
+ /*
* It's ok to use list_for_each_rcu() here even though we may delete
* 'pos', since list_del_rcu() doesn't clear ->next
*/
@@ -154,49 +169,37 @@ void generic_smp_call_function_single_in
{
struct call_single_queue *q = &__get_cpu_var(call_single_queue);
LIST_HEAD(list);
+ unsigned int data_flags;
- /*
- * Need to see other stores to list head for checking whether
- * list is empty without holding q->lock
- */
- smp_read_barrier_depends();
- while (!list_empty(&q->list)) {
- unsigned int data_flags;
-
- spin_lock(&q->lock);
- list_replace_init(&q->list, &list);
- spin_unlock(&q->lock);
-
- while (!list_empty(&list)) {
- struct call_single_data *data;
-
- data = list_entry(list.next, struct call_single_data,
- list);
- list_del(&data->list);
+ spin_lock(&q->lock);
+ list_replace_init(&q->list, &list);
+ spin_unlock(&q->lock);
- /*
- * 'data' can be invalid after this call if
- * flags == 0 (when called through
- * generic_exec_single(), so save them away before
- * making the call.
- */
- data_flags = data->flags;
+ while (!list_empty(&list)) {
+ struct call_single_data *data;
- data->func(data->info);
+ data = list_entry(list.next, struct call_single_data,
+ list);
+ list_del(&data->list);
- if (data_flags & CSD_FLAG_WAIT) {
- smp_wmb();
- data->flags &= ~CSD_FLAG_WAIT;
- } else if (data_flags & CSD_FLAG_LOCK) {
- smp_wmb();
- data->flags &= ~CSD_FLAG_LOCK;
- } else if (data_flags & CSD_FLAG_ALLOC)
- kfree(data);
- }
/*
- * See comment on outer loop
+ * 'data' can be invalid after this call if
+ * flags == 0 (when called through
+ * generic_exec_single(), so save them away before
+ * making the call.
*/
- smp_read_barrier_depends();
+ data_flags = data->flags;
+
+ data->func(data->info);
+
+ if (data_flags & CSD_FLAG_WAIT) {
+ smp_wmb();
+ data->flags &= ~CSD_FLAG_WAIT;
+ } else if (data_flags & CSD_FLAG_LOCK) {
+ smp_wmb();
+ data->flags &= ~CSD_FLAG_LOCK;
+ } else if (data_flags & CSD_FLAG_ALLOC)
+ kfree(data);
}
}
@@ -375,6 +378,8 @@ void smp_call_function_many(const struct
/*
* Make the list addition visible before sending the ipi.
+ * (IPIs must obey or appear to obey normal Linux cache coherency
+ * rules -- see comment in generic_exec_single).
*/
smp_mb();
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: Q: smp.c && barriers (Was: [PATCH 1/4] generic-smp: remove single ipi fallback for smp_call_function_many())
2009-02-17 11:26 ` Q: smp.c && barriers (Was: [PATCH 1/4] generic-smp: remove single ipi fallback for smp_call_function_many()) Nick Piggin
@ 2009-02-17 11:48 ` Peter Zijlstra
2009-02-17 15:51 ` Paul E. McKenney
` (2 subsequent siblings)
3 siblings, 0 replies; 49+ messages in thread
From: Peter Zijlstra @ 2009-02-17 11:48 UTC (permalink / raw)
To: Nick Piggin
Cc: Oleg Nesterov, Jens Axboe, Suresh Siddha, Linus Torvalds,
Paul E. McKenney, Ingo Molnar, Rusty Russell, Steven Rostedt,
linux-kernel, linux-arch
On Tue, 2009-02-17 at 12:26 +0100, Nick Piggin wrote:
> Simplify the barriers in generic remote function call interrupt code.
>
> Firstly, just unconditionally take the lock and check the list in the
> generic_call_function_single_interrupt IPI handler. As we've just taken
> an IPI here, the chances are fairly high that there will be work on the
> list for us, so do the locking unconditionally. This removes the tricky
> lockless list_empty check and dubious barriers. The change looks bigger
> than it is because it is just removing an outer loop.
>
> Secondly, clarify architecture specific IPI locking rules. Generic code
> has no tools to impose any sane ordering on IPIs if they go outside
> normal cache coherency, ergo the arch code must make them appear to
> obey cache coherency as a "memory operation" to initiate an IPI, and
> a "memory operation" to receive one. This way at least they can be
> reasoned about in generic code, and smp_mb used to provide ordering.
>
> The combination of these two changes means that explict barriers can
> be taken out of queue handling for the single case -- shared data is
> explicitly locked, and ipi ordering must conform to that, so no
> barriers needed. An extra barrier is needed in the many handler, so
> as to ensure we load the list element after the IPI is received.
>
> Does any architecture actually needs barriers? For the initiator I
> could see it, but for the handler I would be surprised. The other
> thing we could do for simplicity is just to require that a full
> barrier is required before generating an IPI, and after receiving an
> IPI. We can't just do that in generic code without auditing
> architectures. There have been subtle hangs here on some archs in
> the past.
>
> Signed-off-by: Nick Piggin <npiggin@suse.de>
Looks sane, barring any funny arch details,
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
> ---
> kernel/smp.c | 83 +++++++++++++++++++++++++++++++----------------------------
> 1 file changed, 44 insertions(+), 39 deletions(-)
>
> Index: linux-2.6/kernel/smp.c
> ===================================================================
> --- linux-2.6.orig/kernel/smp.c
> +++ linux-2.6/kernel/smp.c
> @@ -74,9 +74,16 @@ static void generic_exec_single(int cpu,
> spin_unlock_irqrestore(&dst->lock, flags);
>
> /*
> - * Make the list addition visible before sending the ipi.
> + * The list addition should be visible before sending the IPI
> + * handler locks the list to pull the entry off it because of
> + * normal cache coherency rules implied by spinlocks.
> + *
> + * If IPIs can go out of order to the cache coherency protocol
> + * in an architecture, sufficient synchronisation should be added
> + * to arch code to make it appear to obey cache coherency WRT
> + * locking and barrier primitives. Generic code isn't really equipped
> + * to do the right thing...
> */
> - smp_mb();
>
> if (ipi)
> arch_send_call_function_single_ipi(cpu);
> @@ -104,6 +111,14 @@ void generic_smp_call_function_interrupt
> int cpu = get_cpu();
>
> /*
> + * Ensure entry is visible on call_function_queue after we have
> + * entered the IPI. See comment in smp_call_function_many.
> + * If we don't have this, then we may miss an entry on the list
> + * and never get another IPI to process it.
> + */
> + smp_mb();
> +
> + /*
> * It's ok to use list_for_each_rcu() here even though we may delete
> * 'pos', since list_del_rcu() doesn't clear ->next
> */
> @@ -154,49 +169,37 @@ void generic_smp_call_function_single_in
> {
> struct call_single_queue *q = &__get_cpu_var(call_single_queue);
> LIST_HEAD(list);
> + unsigned int data_flags;
>
> - /*
> - * Need to see other stores to list head for checking whether
> - * list is empty without holding q->lock
> - */
> - smp_read_barrier_depends();
> - while (!list_empty(&q->list)) {
> - unsigned int data_flags;
> -
> - spin_lock(&q->lock);
> - list_replace_init(&q->list, &list);
> - spin_unlock(&q->lock);
> -
> - while (!list_empty(&list)) {
> - struct call_single_data *data;
> -
> - data = list_entry(list.next, struct call_single_data,
> - list);
> - list_del(&data->list);
> + spin_lock(&q->lock);
> + list_replace_init(&q->list, &list);
> + spin_unlock(&q->lock);
>
> - /*
> - * 'data' can be invalid after this call if
> - * flags == 0 (when called through
> - * generic_exec_single(), so save them away before
> - * making the call.
> - */
> - data_flags = data->flags;
> + while (!list_empty(&list)) {
> + struct call_single_data *data;
>
> - data->func(data->info);
> + data = list_entry(list.next, struct call_single_data,
> + list);
> + list_del(&data->list);
>
> - if (data_flags & CSD_FLAG_WAIT) {
> - smp_wmb();
> - data->flags &= ~CSD_FLAG_WAIT;
> - } else if (data_flags & CSD_FLAG_LOCK) {
> - smp_wmb();
> - data->flags &= ~CSD_FLAG_LOCK;
> - } else if (data_flags & CSD_FLAG_ALLOC)
> - kfree(data);
> - }
> /*
> - * See comment on outer loop
> + * 'data' can be invalid after this call if
> + * flags == 0 (when called through
> + * generic_exec_single(), so save them away before
> + * making the call.
> */
> - smp_read_barrier_depends();
> + data_flags = data->flags;
> +
> + data->func(data->info);
> +
> + if (data_flags & CSD_FLAG_WAIT) {
> + smp_wmb();
> + data->flags &= ~CSD_FLAG_WAIT;
> + } else if (data_flags & CSD_FLAG_LOCK) {
> + smp_wmb();
> + data->flags &= ~CSD_FLAG_LOCK;
> + } else if (data_flags & CSD_FLAG_ALLOC)
> + kfree(data);
> }
> }
>
> @@ -375,6 +378,8 @@ void smp_call_function_many(const struct
>
> /*
> * Make the list addition visible before sending the ipi.
> + * (IPIs must obey or appear to obey normal Linux cache coherency
> + * rules -- see comment in generic_exec_single).
> */
> smp_mb();
>
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: Q: smp.c && barriers (Was: [PATCH 1/4] generic-smp: remove single ipi fallback for smp_call_function_many())
2009-02-17 11:26 ` Q: smp.c && barriers (Was: [PATCH 1/4] generic-smp: remove single ipi fallback for smp_call_function_many()) Nick Piggin
2009-02-17 11:48 ` Peter Zijlstra
@ 2009-02-17 15:51 ` Paul E. McKenney
2009-02-18 2:15 ` Suresh Siddha
2009-02-17 19:28 ` Q: " Oleg Nesterov
2009-02-18 2:21 ` Suresh Siddha
3 siblings, 1 reply; 49+ messages in thread
From: Paul E. McKenney @ 2009-02-17 15:51 UTC (permalink / raw)
To: Nick Piggin
Cc: Peter Zijlstra, Oleg Nesterov, Jens Axboe, Suresh Siddha,
Linus Torvalds, Ingo Molnar, Rusty Russell, Steven Rostedt,
linux-kernel, linux-arch
On Tue, Feb 17, 2009 at 12:26:57PM +0100, Nick Piggin wrote:
> cc linux-arch
>
> On Tue, Feb 17, 2009 at 11:27:33AM +0100, Peter Zijlstra wrote:
> > > But hmm, why even bother with all this complexity? Why not just
> > > remove the outer loop completely? Do the lock and the list_replace_init
> > > unconditionally. It would turn tricky lockless code into simple locked
> > > code... we've already taken an interrupt anyway, so chances are pretty
> > > high that we have work here to do, right?
> >
> > Well, that's a practical suggestion, and I agree.
>
> How's this?
>
> --
> Simplify the barriers in generic remote function call interrupt code.
>
> Firstly, just unconditionally take the lock and check the list in the
> generic_call_function_single_interrupt IPI handler. As we've just taken
> an IPI here, the chances are fairly high that there will be work on the
> list for us, so do the locking unconditionally. This removes the tricky
> lockless list_empty check and dubious barriers. The change looks bigger
> than it is because it is just removing an outer loop.
>
> Secondly, clarify architecture specific IPI locking rules. Generic code
> has no tools to impose any sane ordering on IPIs if they go outside
> normal cache coherency, ergo the arch code must make them appear to
> obey cache coherency as a "memory operation" to initiate an IPI, and
> a "memory operation" to receive one. This way at least they can be
> reasoned about in generic code, and smp_mb used to provide ordering.
>
> The combination of these two changes means that explict barriers can
> be taken out of queue handling for the single case -- shared data is
> explicitly locked, and ipi ordering must conform to that, so no
> barriers needed. An extra barrier is needed in the many handler, so
> as to ensure we load the list element after the IPI is received.
>
> Does any architecture actually needs barriers? For the initiator I
> could see it, but for the handler I would be surprised. The other
> thing we could do for simplicity is just to require that a full
> barrier is required before generating an IPI, and after receiving an
> IPI. We can't just do that in generic code without auditing
> architectures. There have been subtle hangs here on some archs in
> the past.
>
> Signed-off-by: Nick Piggin <npiggin@suse.de>
>
> ---
> kernel/smp.c | 83 +++++++++++++++++++++++++++++++----------------------------
> 1 file changed, 44 insertions(+), 39 deletions(-)
>
> Index: linux-2.6/kernel/smp.c
> ===================================================================
> --- linux-2.6.orig/kernel/smp.c
> +++ linux-2.6/kernel/smp.c
> @@ -74,9 +74,16 @@ static void generic_exec_single(int cpu,
> spin_unlock_irqrestore(&dst->lock, flags);
>
> /*
> - * Make the list addition visible before sending the ipi.
> + * The list addition should be visible before sending the IPI
> + * handler locks the list to pull the entry off it because of
> + * normal cache coherency rules implied by spinlocks.
> + *
> + * If IPIs can go out of order to the cache coherency protocol
> + * in an architecture, sufficient synchronisation should be added
> + * to arch code to make it appear to obey cache coherency WRT
> + * locking and barrier primitives. Generic code isn't really equipped
> + * to do the right thing...
> */
> - smp_mb();
>
> if (ipi)
> arch_send_call_function_single_ipi(cpu);
> @@ -104,6 +111,14 @@ void generic_smp_call_function_interrupt
> int cpu = get_cpu();
>
> /*
> + * Ensure entry is visible on call_function_queue after we have
> + * entered the IPI. See comment in smp_call_function_many.
> + * If we don't have this, then we may miss an entry on the list
> + * and never get another IPI to process it.
> + */
> + smp_mb();
> +
> + /*
> * It's ok to use list_for_each_rcu() here even though we may delete
> * 'pos', since list_del_rcu() doesn't clear ->next
> */
> @@ -154,49 +169,37 @@ void generic_smp_call_function_single_in
> {
> struct call_single_queue *q = &__get_cpu_var(call_single_queue);
> LIST_HEAD(list);
> + unsigned int data_flags;
>
> - /*
> - * Need to see other stores to list head for checking whether
> - * list is empty without holding q->lock
> - */
> - smp_read_barrier_depends();
> - while (!list_empty(&q->list)) {
> - unsigned int data_flags;
> -
> - spin_lock(&q->lock);
> - list_replace_init(&q->list, &list);
> - spin_unlock(&q->lock);
> -
> - while (!list_empty(&list)) {
> - struct call_single_data *data;
> -
> - data = list_entry(list.next, struct call_single_data,
> - list);
> - list_del(&data->list);
> + spin_lock(&q->lock);
> + list_replace_init(&q->list, &list);
> + spin_unlock(&q->lock);
OK, I'll bite...
How do we avoid deadlock in the case where a pair of CPUs send to each
other concurrently?
Thanx, Paul
>
> - /*
> - * 'data' can be invalid after this call if
> - * flags == 0 (when called through
> - * generic_exec_single(), so save them away before
> - * making the call.
> - */
> - data_flags = data->flags;
> + while (!list_empty(&list)) {
> + struct call_single_data *data;
>
> - data->func(data->info);
> + data = list_entry(list.next, struct call_single_data,
> + list);
> + list_del(&data->list);
>
> - if (data_flags & CSD_FLAG_WAIT) {
> - smp_wmb();
> - data->flags &= ~CSD_FLAG_WAIT;
> - } else if (data_flags & CSD_FLAG_LOCK) {
> - smp_wmb();
> - data->flags &= ~CSD_FLAG_LOCK;
> - } else if (data_flags & CSD_FLAG_ALLOC)
> - kfree(data);
> - }
> /*
> - * See comment on outer loop
> + * 'data' can be invalid after this call if
> + * flags == 0 (when called through
> + * generic_exec_single(), so save them away before
> + * making the call.
> */
> - smp_read_barrier_depends();
> + data_flags = data->flags;
> +
> + data->func(data->info);
> +
> + if (data_flags & CSD_FLAG_WAIT) {
> + smp_wmb();
> + data->flags &= ~CSD_FLAG_WAIT;
> + } else if (data_flags & CSD_FLAG_LOCK) {
> + smp_wmb();
> + data->flags &= ~CSD_FLAG_LOCK;
> + } else if (data_flags & CSD_FLAG_ALLOC)
> + kfree(data);
> }
> }
>
> @@ -375,6 +378,8 @@ void smp_call_function_many(const struct
>
> /*
> * Make the list addition visible before sending the ipi.
> + * (IPIs must obey or appear to obey normal Linux cache coherency
> + * rules -- see comment in generic_exec_single).
> */
> smp_mb();
>
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: Q: smp.c && barriers (Was: [PATCH 1/4] generic-smp: remove single ipi fallback for smp_call_function_many())
2009-02-17 11:26 ` Q: smp.c && barriers (Was: [PATCH 1/4] generic-smp: remove single ipi fallback for smp_call_function_many()) Nick Piggin
2009-02-17 11:48 ` Peter Zijlstra
2009-02-17 15:51 ` Paul E. McKenney
@ 2009-02-17 19:28 ` Oleg Nesterov
2009-02-17 21:32 ` Paul E. McKenney
2009-02-18 2:21 ` Suresh Siddha
3 siblings, 1 reply; 49+ messages in thread
From: Oleg Nesterov @ 2009-02-17 19:28 UTC (permalink / raw)
To: Nick Piggin
Cc: Peter Zijlstra, Jens Axboe, Suresh Siddha, Linus Torvalds,
Paul E. McKenney, Ingo Molnar, Rusty Russell, Steven Rostedt,
linux-kernel, linux-arch
On 02/17, Nick Piggin wrote:
>
> How's this?
To me, this patch makes the code much more clean/understandable.
And imho it is very good it removes smp_read_barrier_depends()s
which (I think) were just wrong.
But I still have the question,
> Does any architecture actually needs barriers? For the initiator I
> could see it, but for the handler I would be surprised. The other
> thing we could do for simplicity is just to require that a full
> barrier is required before generating an IPI, and after receiving an
> IPI. We can't just do that in generic code without auditing
> architectures. There have been subtle hangs here on some archs in
> the past.
OK, so we add the barrier here:
> @@ -104,6 +111,14 @@ void generic_smp_call_function_interrupt
> int cpu = get_cpu();
>
> /*
> + * Ensure entry is visible on call_function_queue after we have
> + * entered the IPI. See comment in smp_call_function_many.
> + * If we don't have this, then we may miss an entry on the list
> + * and never get another IPI to process it.
> + */
> + smp_mb();
But, any arch which needs this barrier should also call mb() in, say,
smp_reschedule_interrupt() path. Otherwise we can miss TIF_NEED_RESCHED
after return from the handler.
So the question is: is there any arch which surely needs this barrier?
IOW,
int COND;
void smp_xxx_interrupt(regs)
{
BUG_ON(!COND);
}
COND = 1;
mb();
smp_send_xxx(cpu);
can we really hit the BUG_ON() above on some arch?
(but in any case I agree, it is better to be safe and add the barrier
like this patch does).
Oleg.
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: Q: smp.c && barriers (Was: [PATCH 1/4] generic-smp: remove single ipi fallback for smp_call_function_many())
2009-02-17 19:28 ` Q: " Oleg Nesterov
@ 2009-02-17 21:32 ` Paul E. McKenney
2009-02-17 21:45 ` Oleg Nesterov
0 siblings, 1 reply; 49+ messages in thread
From: Paul E. McKenney @ 2009-02-17 21:32 UTC (permalink / raw)
To: Oleg Nesterov
Cc: Nick Piggin, Peter Zijlstra, Jens Axboe, Suresh Siddha,
Linus Torvalds, Ingo Molnar, Rusty Russell, Steven Rostedt,
linux-kernel, linux-arch
On Tue, Feb 17, 2009 at 08:28:10PM +0100, Oleg Nesterov wrote:
> On 02/17, Nick Piggin wrote:
> >
> > How's this?
>
> To me, this patch makes the code much more clean/understandable.
>
> And imho it is very good it removes smp_read_barrier_depends()s
> which (I think) were just wrong.
>
>
> But I still have the question,
>
> > Does any architecture actually needs barriers? For the initiator I
> > could see it, but for the handler I would be surprised. The other
> > thing we could do for simplicity is just to require that a full
> > barrier is required before generating an IPI, and after receiving an
> > IPI. We can't just do that in generic code without auditing
> > architectures. There have been subtle hangs here on some archs in
> > the past.
>
> OK, so we add the barrier here:
>
> > @@ -104,6 +111,14 @@ void generic_smp_call_function_interrupt
> > int cpu = get_cpu();
> >
> > /*
> > + * Ensure entry is visible on call_function_queue after we have
> > + * entered the IPI. See comment in smp_call_function_many.
> > + * If we don't have this, then we may miss an entry on the list
> > + * and never get another IPI to process it.
> > + */
> > + smp_mb();
>
> But, any arch which needs this barrier should also call mb() in, say,
> smp_reschedule_interrupt() path. Otherwise we can miss TIF_NEED_RESCHED
> after return from the handler.
>
> So the question is: is there any arch which surely needs this barrier?
>
> IOW,
> int COND;
>
> void smp_xxx_interrupt(regs)
> {
> BUG_ON(!COND);
> }
>
> COND = 1;
> mb();
> smp_send_xxx(cpu);
>
> can we really hit the BUG_ON() above on some arch?
If all of the above is executed by the same task, tripping the BUG_ON()
means either a compiler or CPU bug.
Thanx, Paul
> (but in any case I agree, it is better to be safe and add the barrier
> like this patch does).
>
> Oleg.
>
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: Q: smp.c && barriers (Was: [PATCH 1/4] generic-smp: remove single ipi fallback for smp_call_function_many())
2009-02-17 21:32 ` Paul E. McKenney
@ 2009-02-17 21:45 ` Oleg Nesterov
2009-02-17 22:39 ` Paul E. McKenney
0 siblings, 1 reply; 49+ messages in thread
From: Oleg Nesterov @ 2009-02-17 21:45 UTC (permalink / raw)
To: Paul E. McKenney
Cc: Nick Piggin, Peter Zijlstra, Jens Axboe, Suresh Siddha,
Linus Torvalds, Ingo Molnar, Rusty Russell, Steven Rostedt,
linux-kernel, linux-arch
On 02/17, Paul E. McKenney wrote:
>
> On Tue, Feb 17, 2009 at 08:28:10PM +0100, Oleg Nesterov wrote:
> >
> > So the question is: is there any arch which surely needs this barrier?
> >
> > IOW,
> > int COND;
> >
> > void smp_xxx_interrupt(regs)
> > {
> > BUG_ON(!COND);
> > }
> >
> > COND = 1;
> > mb();
> > smp_send_xxx(cpu);
> >
> > can we really hit the BUG_ON() above on some arch?
>
> If all of the above is executed by the same task, tripping the BUG_ON()
> means either a compiler or CPU bug.
I think you misunderstood...
smp_send_xxx() sends the ipi to another CPU, and smp_xxx_interrupt() is
the handler.
Oleg.
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: Q: smp.c && barriers (Was: [PATCH 1/4] generic-smp: remove single ipi fallback for smp_call_function_many())
2009-02-17 21:45 ` Oleg Nesterov
@ 2009-02-17 22:39 ` Paul E. McKenney
2009-02-18 13:52 ` Nick Piggin
0 siblings, 1 reply; 49+ messages in thread
From: Paul E. McKenney @ 2009-02-17 22:39 UTC (permalink / raw)
To: Oleg Nesterov
Cc: Nick Piggin, Peter Zijlstra, Jens Axboe, Suresh Siddha,
Linus Torvalds, Ingo Molnar, Rusty Russell, Steven Rostedt,
linux-kernel, linux-arch
On Tue, Feb 17, 2009 at 10:45:18PM +0100, Oleg Nesterov wrote:
> On 02/17, Paul E. McKenney wrote:
> >
> > On Tue, Feb 17, 2009 at 08:28:10PM +0100, Oleg Nesterov wrote:
> > >
> > > So the question is: is there any arch which surely needs this barrier?
> > >
> > > IOW,
> > > int COND;
> > >
> > > void smp_xxx_interrupt(regs)
> > > {
> > > BUG_ON(!COND);
> > > }
> > >
> > > COND = 1;
> > > mb();
> > > smp_send_xxx(cpu);
> > >
> > > can we really hit the BUG_ON() above on some arch?
> >
> > If all of the above is executed by the same task, tripping the BUG_ON()
> > means either a compiler or CPU bug.
>
> I think you misunderstood...
>
> smp_send_xxx() sends the ipi to another CPU, and smp_xxx_interrupt() is
> the handler.
You are right, I did miss that completely. :-/
I have seen hardware in which the IPI could beat the cache invalidation
from the sending CPU to the interrupted CPU, and in which one or both of
the CPUs would have to execute special cache-flush/invalidation/whatever
instructions for the interrupted CPU to have a consistent view of the
data (in your example, "COND").
But we had a little chat with the hardware designers, and in subsequent
hardware, the IPI interacted with the cache-coherence protocol so as to
prevent the above bug from firing. However, this was x86-based hardware,
which orders writes. Weakly ordered systems would likely need a memory
barrier somewhere, whether as shown above or buried in the smp_send_xxx()
primitive.
Thanx, Paul
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: smp.c && barriers (Was: [PATCH 1/4] generic-smp: remove single ipi fallback for smp_call_function_many())
2009-02-17 15:51 ` Paul E. McKenney
@ 2009-02-18 2:15 ` Suresh Siddha
2009-02-18 2:40 ` Paul E. McKenney
0 siblings, 1 reply; 49+ messages in thread
From: Suresh Siddha @ 2009-02-18 2:15 UTC (permalink / raw)
To: paulmck@linux.vnet.ibm.com
Cc: Nick Piggin, Peter Zijlstra, Oleg Nesterov, Jens Axboe,
Linus Torvalds, Ingo Molnar, Rusty Russell, Steven Rostedt,
linux-kernel@vger.kernel.org, linux-arch@vger.kernel.org
On Tue, 2009-02-17 at 07:51 -0800, Paul E. McKenney wrote:
> On Tue, Feb 17, 2009 at 12:26:57PM +0100, Nick Piggin wrote:
> > + spin_lock(&q->lock);
> > + list_replace_init(&q->list, &list);
> > + spin_unlock(&q->lock);
>
> OK, I'll bite...
>
> How do we avoid deadlock in the case where a pair of CPUs send to each
> other concurrently?
Sender takes the lock with interrupts-disabled. That should prevent any
deadlock, right?
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: smp.c && barriers (Was: [PATCH 1/4] generic-smp: remove single ipi fallback for smp_call_function_many())
2009-02-17 11:26 ` Q: smp.c && barriers (Was: [PATCH 1/4] generic-smp: remove single ipi fallback for smp_call_function_many()) Nick Piggin
` (2 preceding siblings ...)
2009-02-17 19:28 ` Q: " Oleg Nesterov
@ 2009-02-18 2:21 ` Suresh Siddha
2009-02-18 13:59 ` Nick Piggin
3 siblings, 1 reply; 49+ messages in thread
From: Suresh Siddha @ 2009-02-18 2:21 UTC (permalink / raw)
To: Nick Piggin
Cc: Peter Zijlstra, Oleg Nesterov, Jens Axboe, Linus Torvalds,
Paul E. McKenney, Ingo Molnar, Rusty Russell, Steven Rostedt,
linux-kernel@vger.kernel.org, linux-arch@vger.kernel.org
On Tue, 2009-02-17 at 03:26 -0800, Nick Piggin wrote:
> --
> Simplify the barriers in generic remote function call interrupt code.
>
> Firstly, just unconditionally take the lock and check the list in the
> generic_call_function_single_interrupt IPI handler. As we've just taken
> an IPI here, the chances are fairly high that there will be work on the
> list for us, so do the locking unconditionally. This removes the tricky
> lockless list_empty check and dubious barriers. The change looks bigger
> than it is because it is just removing an outer loop.
>
> Secondly, clarify architecture specific IPI locking rules. Generic code
> has no tools to impose any sane ordering on IPIs if they go outside
> normal cache coherency, ergo the arch code must make them appear to
> obey cache coherency as a "memory operation" to initiate an IPI, and
> a "memory operation" to receive one. This way at least they can be
> reasoned about in generic code, and smp_mb used to provide ordering.
>
> The combination of these two changes means that explict barriers can
> be taken out of queue handling for the single case -- shared data is
> explicitly locked, and ipi ordering must conform to that, so no
> barriers needed. An extra barrier is needed in the many handler, so
> as to ensure we load the list element after the IPI is received.
>
> Does any architecture actually needs barriers? For the initiator I
> could see it, but for the handler I would be surprised. The other
> thing we could do for simplicity is just to require that a full
> barrier is required before generating an IPI, and after receiving an
> IPI. We can't just do that in generic code without auditing
> architectures. There have been subtle hangs here on some archs in
> the past.
x2apic register reads/writes don't have serializing semantics, as
opposed to uncached xapic accesses, which are inherently serializing.
With this patch, we need to fix the corresponding x2apic IPI operations.
I will take a look at it.
thanks,
suresh
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: smp.c && barriers (Was: [PATCH 1/4] generic-smp: remove single ipi fallback for smp_call_function_many())
2009-02-18 2:15 ` Suresh Siddha
@ 2009-02-18 2:40 ` Paul E. McKenney
0 siblings, 0 replies; 49+ messages in thread
From: Paul E. McKenney @ 2009-02-18 2:40 UTC (permalink / raw)
To: Suresh Siddha
Cc: Nick Piggin, Peter Zijlstra, Oleg Nesterov, Jens Axboe,
Linus Torvalds, Ingo Molnar, Rusty Russell, Steven Rostedt,
linux-kernel@vger.kernel.org, linux-arch@vger.kernel.org
On Tue, Feb 17, 2009 at 06:15:31PM -0800, Suresh Siddha wrote:
> On Tue, 2009-02-17 at 07:51 -0800, Paul E. McKenney wrote:
> > On Tue, Feb 17, 2009 at 12:26:57PM +0100, Nick Piggin wrote:
> > > + spin_lock(&q->lock);
> > > + list_replace_init(&q->list, &list);
> > > + spin_unlock(&q->lock);
> >
> > OK, I'll bite...
> >
> > How do we avoid deadlock in the case where a pair of CPUs send to each
> > other concurrently?
>
> Sender takes the lock with interrupts-disabled. That should prevent any
> deadlock, right?
You are of course correct! Apologies for my confusion!!!
Thanx, Paul
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: Q: smp.c && barriers (Was: [PATCH 1/4] generic-smp: remove single ipi fallback for smp_call_function_many())
2009-02-17 22:39 ` Paul E. McKenney
@ 2009-02-18 13:52 ` Nick Piggin
2009-02-18 16:09 ` Linus Torvalds
0 siblings, 1 reply; 49+ messages in thread
From: Nick Piggin @ 2009-02-18 13:52 UTC (permalink / raw)
To: Paul E. McKenney
Cc: Oleg Nesterov, Peter Zijlstra, Jens Axboe, Suresh Siddha,
Linus Torvalds, Ingo Molnar, Rusty Russell, Steven Rostedt,
linux-kernel, linux-arch
On Tue, Feb 17, 2009 at 02:39:10PM -0800, Paul E. McKenney wrote:
> On Tue, Feb 17, 2009 at 10:45:18PM +0100, Oleg Nesterov wrote:
> > > If all of the above is executed by the same task, tripping the BUG_ON()
> > > means either a compiler or CPU bug.
> >
> > I think you misunderstood...
> >
> > smp_send_xxx() sends the ipi to another CPU, and smp_xxx_interrupt() is
> > the handler.
>
> You are right, I did miss that completely. :-/
>
> I have seen hardware in which the IPI could beat the cache invalidation
> from the sending CPU to the interrupted CPU, and in which one or both of
> the CPUs would have to execute special cache-flush/invalidation/whatever
> instructions for the interrupted CPU to have a consistent view of the
> data (in your example, "COND").
>
> But we had a little chat with the hardware designers, and in subsequent
> hardware, the IPI interacted with the cache-coherence protocol so as to
> prevent the above bug from firing. However, this was x86-based hardware,
> which orders writes. Weakly ordered systems would likely need a memory
> barrier somewhere, whether as shown above or buried in the smp_send_xxx()
> primitive.
I agree with you both that we *should* make arch interrupt code
do the ordering, but given the subtle lockups on some architectures
in this new code, I didn't want to make it significantly weaker...
Though perhaps it appears that I have, if I have removed an smp_mb
that x86 was relying on to emit an mfence to serialise the apic.
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: smp.c && barriers (Was: [PATCH 1/4] generic-smp: remove single ipi fallback for smp_call_function_many())
2009-02-18 2:21 ` Suresh Siddha
@ 2009-02-18 13:59 ` Nick Piggin
2009-02-18 16:19 ` Linus Torvalds
2009-02-18 18:43 ` Suresh Siddha
0 siblings, 2 replies; 49+ messages in thread
From: Nick Piggin @ 2009-02-18 13:59 UTC (permalink / raw)
To: Suresh Siddha
Cc: Peter Zijlstra, Oleg Nesterov, Jens Axboe, Linus Torvalds,
Paul E. McKenney, Ingo Molnar, Rusty Russell, Steven Rostedt,
linux-kernel@vger.kernel.org, linux-arch@vger.kernel.org
On Tue, Feb 17, 2009 at 06:21:42PM -0800, Suresh B wrote:
> On Tue, 2009-02-17 at 03:26 -0800, Nick Piggin wrote:
> > --
> > Simplify the barriers in generic remote function call interrupt code.
> >
> > Firstly, just unconditionally take the lock and check the list in the
> > generic_call_function_single_interrupt IPI handler. As we've just taken
> > an IPI here, the chances are fairly high that there will be work on the
> > list for us, so do the locking unconditionally. This removes the tricky
> > lockless list_empty check and dubious barriers. The change looks bigger
> > than it is because it is just removing an outer loop.
> >
> > Secondly, clarify architecture specific IPI locking rules. Generic code
> > has no tools to impose any sane ordering on IPIs if they go outside
> > normal cache coherency, ergo the arch code must make them appear to
> > obey cache coherency as a "memory operation" to initiate an IPI, and
> > a "memory operation" to receive one. This way at least they can be
> > reasoned about in generic code, and smp_mb used to provide ordering.
> >
> > The combination of these two changes means that explict barriers can
> > be taken out of queue handling for the single case -- shared data is
> > explicitly locked, and ipi ordering must conform to that, so no
> > barriers needed. An extra barrier is needed in the many handler, so
> > as to ensure we load the list element after the IPI is received.
> >
> > Does any architecture actually needs barriers? For the initiator I
> > could see it, but for the handler I would be surprised. The other
> > thing we could do for simplicity is just to require that a full
> > barrier is required before generating an IPI, and after receiving an
> > IPI. We can't just do that in generic code without auditing
> > architectures. There have been subtle hangs here on some archs in
> > the past.
>
> x2apic register reads/writes don't have serializing semantics, as
> opposed to uncached xapic accesses, which are inherently serializing.
>
> With this patch, we need to fix the corresponding x2apic IPI operations.
> I will take a look at it.
You're saying the problem is in generic_exec_single because I've
removed the smp_mb that inadvertently also serialises memory with
the x2apic on x86?
Indeed that could cause problems on some architectures which I
had hoped to avoid. So the patch is probably better off to first
add the smp_mb() to arch_send_call_function_xxx arch code, unless
it is immediately obvious or confirmed by arch maintainer that
such barrier is not required.
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: Q: smp.c && barriers (Was: [PATCH 1/4] generic-smp: remove single ipi fallback for smp_call_function_many())
2009-02-18 13:52 ` Nick Piggin
@ 2009-02-18 16:09 ` Linus Torvalds
2009-02-18 16:21 ` Ingo Molnar
` (2 more replies)
0 siblings, 3 replies; 49+ messages in thread
From: Linus Torvalds @ 2009-02-18 16:09 UTC (permalink / raw)
To: Nick Piggin
Cc: Paul E. McKenney, Oleg Nesterov, Peter Zijlstra, Jens Axboe,
Suresh Siddha, Ingo Molnar, Rusty Russell, Steven Rostedt,
linux-kernel, linux-arch
On Wed, 18 Feb 2009, Nick Piggin wrote:
>
> I agree with you both that we *should* make arch interrupt code
> do the ordering, but given the subtle lockups on some architectures
> in this new code, I didn't want to make it significantly weaker...
>
> Though perhaps it appears that I have, if I have removed an smp_mb
> that x86 was relying on to emit an mfence to serialise the apic.
The thing is, if the architecture doesn't order IPI wrt cache coherency,
then the "smp_mb()" doesn't really do so _either_.
It might hide some architecture-specific implementation issue, of course,
so random amounts of "smp_mb()"s sprinkled around might well make some
architecture "work", but it's in no way guaranteed. A smp_mb() does not
guarantee that some separate IPI network is ordered - that may well take
some random machine-specific IO cycle.
That said, at least on x86, taking an interrupt should be a serializing
event, so there should be no reason for anything on the receiving side.
The _sending_ side might need to make sure that there is serialization
when generating the IPI (so that the IPI cannot happen while the writes
are still in some per-CPU write buffer and haven't become part of the
cache coherency domain).
And at least on x86 it's actually pretty hard to generate out-of-order
accesses to begin with (_regardless_ of any issues external to the CPU).
You have to work at it, and use a WC memory area, and I'm pretty sure we
use UC for the apic accesses.
Linus
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: smp.c && barriers (Was: [PATCH 1/4] generic-smp: remove single ipi fallback for smp_call_function_many())
2009-02-18 13:59 ` Nick Piggin
@ 2009-02-18 16:19 ` Linus Torvalds
2009-02-18 16:23 ` Ingo Molnar
2009-02-18 18:43 ` Suresh Siddha
1 sibling, 1 reply; 49+ messages in thread
From: Linus Torvalds @ 2009-02-18 16:19 UTC (permalink / raw)
To: Nick Piggin
Cc: Suresh Siddha, Peter Zijlstra, Oleg Nesterov, Jens Axboe,
Paul E. McKenney, Ingo Molnar, Rusty Russell, Steven Rostedt,
linux-kernel@vger.kernel.org, linux-arch@vger.kernel.org
On Wed, 18 Feb 2009, Nick Piggin wrote:
> >
> > x2apic register reads/writes don't have serializing semantics, as
> > opposed to uncached xapic accesses, which are inherently serializing.
> >
> > With this patch, we need to fix the corresponding x2apic IPI operations.
> > I will take a look at it.
>
> You're saying the problem is in generic_exec_single because I've
> removed the smp_mb that inadvertently also serialises memory with
> the x2apic on x86?
I think Suresh is wrong on this.
The x2apic is using "wrmsr" to write events, and that's a serializing
instruction.
I really don't know of any way to get unordered information out of a x86
core, except for playing games with WC memory, and WC memory would not be
appropriate for something like an interrupt controller.
Of course, it's possible that Intel made the x2apic MSR's magic, and that
they don't serialize, but that's very much against some very explicit
Intel documentation. wrmsr is one of the (few) instructions that is
mentioned all ove the documentation as being serializing.
Linus
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: Q: smp.c && barriers (Was: [PATCH 1/4] generic-smp: remove single ipi fallback for smp_call_function_many())
2009-02-18 16:09 ` Linus Torvalds
@ 2009-02-18 16:21 ` Ingo Molnar
2009-02-18 16:21 ` Ingo Molnar
` (2 more replies)
2009-02-19 0:12 ` Nick Piggin
2009-02-19 6:47 ` Benjamin Herrenschmidt
2 siblings, 3 replies; 49+ messages in thread
From: Ingo Molnar @ 2009-02-18 16:21 UTC (permalink / raw)
To: Linus Torvalds, Pallipadi, Venkatesh, Yinghai Lu
Cc: Nick Piggin, Paul E. McKenney, Oleg Nesterov, Peter Zijlstra,
Jens Axboe, Suresh Siddha, Rusty Russell, Steven Rostedt,
linux-kernel, linux-arch
* Linus Torvalds <torvalds@linux-foundation.org> wrote:
> On Wed, 18 Feb 2009, Nick Piggin wrote:
> >
> > I agree with you both that we *should* make arch interrupt code
> > do the ordering, but given the subtle lockups on some architectures
> > in this new code, I didn't want to make it significantly weaker...
> >
> > Though perhaps it appears that I have, if I have removed an smp_mb
> > that x86 was relying on to emit an mfence to serialise the apic.
>
> The thing is, if the architecture doesn't order IPI wrt cache coherency,
> then the "smp_mb()" doesn't really do so _either_.
>
> It might hide some architecture-specific implementation issue, of course,
> so random amounts of "smp_mb()"s sprinkled around might well make some
> architecture "work", but it's in no way guaranteed. A smp_mb() does not
> guarantee that some separate IPI network is ordered - that may well take
> some random machine-specific IO cycle.
>
> That said, at least on x86, taking an interrupt should be a serializing
> event, so there should be no reason for anything on the receiving side.
> The _sending_ side might need to make sure that there is serialization
> when generating the IPI (so that the IPI cannot happen while the writes
> are still in some per-CPU write buffer and haven't become part of the
> cache coherency domain).
>
> And at least on x86 it's actually pretty hard to generate out-of-order
> accesses to begin with (_regardless_ of any issues external to the CPU).
> You have to work at it, and use a WC memory area, and I'm pretty sure we
> use UC for the apic accesses.
yeah, we do. I do remember one x2apic related memory ordering
bug though which happened in the past 6 months or so, lemme find
the commit.
This one is it:
d6f0f39: x86: add smp_mb() before sending INVALIDATE_TLB_VECTOR
attached below.
The reason for that is that x2apic changes the access sequence
from mmio (which old lapic used to be, and which was mapped UC),
to an MSR sequence:
static inline void native_x2apic_icr_write(u32 low, u32 id)
{
wrmsrl(APIC_BASE_MSR + (APIC_ICR >> 4), ((__u64) id) << 32 | low);
}
But ... WRMSR should already be serializing - it is documented
as a serializing instruction.
I've cc:-ed Suresh & other APIC experts - exactly what type of
hang did that patch fix? Do certain CPUs perhaps cut
serialization corners, to speed up x2apic accesses?
Ingo
------------------->
From d6f0f39b7d05e62b347c4352d070e4afb3ade4b5 Mon Sep 17 00:00:00 2001
From: Suresh Siddha <suresh.b.siddha@intel.com>
Date: Tue, 4 Nov 2008 13:53:04 -0800
Subject: [PATCH] x86: add smp_mb() before sending INVALIDATE_TLB_VECTOR
Impact: fix rare x2apic hang
On x86, x2apic mode accesses for sending IPI's don't have serializing
semantics. If the IPI receivner refers(in lock-free fashion) to some
memory setup by the sender, the need for smp_mb() before sending the
IPI becomes critical in x2apic mode.
Add the smp_mb() in native_flush_tlb_others() before sending the IPI.
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
arch/x86/kernel/tlb_32.c | 6 ++++++
arch/x86/kernel/tlb_64.c | 5 +++++
2 files changed, 11 insertions(+), 0 deletions(-)
diff --git a/arch/x86/kernel/tlb_32.c b/arch/x86/kernel/tlb_32.c
index e00534b..f4049f3 100644
--- a/arch/x86/kernel/tlb_32.c
+++ b/arch/x86/kernel/tlb_32.c
@@ -154,6 +154,12 @@ void native_flush_tlb_others(const cpumask_t *cpumaskp, struct mm_struct *mm,
flush_mm = mm;
flush_va = va;
cpus_or(flush_cpumask, cpumask, flush_cpumask);
+
+ /*
+ * Make the above memory operations globally visible before
+ * sending the IPI.
+ */
+ smp_mb();
/*
* We have to send the IPI only to
* CPUs affected.
diff --git a/arch/x86/kernel/tlb_64.c b/arch/x86/kernel/tlb_64.c
index dcbf7a1..8f919ca 100644
--- a/arch/x86/kernel/tlb_64.c
+++ b/arch/x86/kernel/tlb_64.c
@@ -183,6 +183,11 @@ void native_flush_tlb_others(const cpumask_t *cpumaskp, struct mm_struct *mm,
cpus_or(f->flush_cpumask, cpumask, f->flush_cpumask);
/*
+ * Make the above memory operations globally visible before
+ * sending the IPI.
+ */
+ smp_mb();
+ /*
* We have to send the IPI only to
* CPUs affected.
*/
^ permalink raw reply related [flat|nested] 49+ messages in thread
* Re: Q: smp.c && barriers (Was: [PATCH 1/4] generic-smp: remove single ipi fallback for smp_call_function_many())
2009-02-18 16:21 ` Ingo Molnar
@ 2009-02-18 16:21 ` Ingo Molnar
2009-02-18 16:33 ` Linus Torvalds
2009-02-18 16:37 ` Gleb Natapov
2 siblings, 0 replies; 49+ messages in thread
From: Ingo Molnar @ 2009-02-18 16:21 UTC (permalink / raw)
To: Linus Torvalds, Suresh Siddha, Pallipadi, Venkatesh, Yinghai Lu
Cc: Nick Piggin, Paul E. McKenney, Oleg Nesterov, Peter Zijlstra,
Jens Axboe, Rusty Russell, Steven Rostedt, linux-kernel,
linux-arch
* Linus Torvalds <torvalds@linux-foundation.org> wrote:
> On Wed, 18 Feb 2009, Nick Piggin wrote:
> >
> > I agree with you both that we *should* make arch interrupt code
> > do the ordering, but given the subtle lockups on some architectures
> > in this new code, I didn't want to make it significantly weaker...
> >
> > Though perhaps it appears that I have, if I have removed an smp_mb
> > that x86 was relying on to emit an mfence to serialise the apic.
>
> The thing is, if the architecture doesn't order IPI wrt cache coherency,
> then the "smp_mb()" doesn't really do so _either_.
>
> It might hide some architecture-specific implementation issue, of course,
> so random amounts of "smp_mb()"s sprinkled around might well make some
> architecture "work", but it's in no way guaranteed. A smp_mb() does not
> guarantee that some separate IPI network is ordered - that may well take
> some random machine-specific IO cycle.
>
> That said, at least on x86, taking an interrupt should be a serializing
> event, so there should be no reason for anything on the receiving side.
> The _sending_ side might need to make sure that there is serialization
> when generating the IPI (so that the IPI cannot happen while the writes
> are still in some per-CPU write buffer and haven't become part of the
> cache coherency domain).
>
> And at least on x86 it's actually pretty hard to generate out-of-order
> accesses to begin with (_regardless_ of any issues external to the CPU).
> You have to work at it, and use a WC memory area, and I'm pretty sure we
> use UC for the apic accesses.
yeah, we do. I do remember one x2apic related memory ordering
bug though which happened in the past 6 months or so, lemme find
the commit.
This one is it:
d6f0f39: x86: add smp_mb() before sending INVALIDATE_TLB_VECTOR
attached below.
The reason for that is that x2apic changes the access sequence
from mmio (which old lapic used to be, and which was mapped UC),
to an MSR sequence:
static inline void native_x2apic_icr_write(u32 low, u32 id)
{
wrmsrl(APIC_BASE_MSR + (APIC_ICR >> 4), ((__u64) id) << 32 | low);
}
But ... WRMSR should already be serializing - it is documented
as a serializing instruction.
I've cc:-ed Suresh & other APIC experts - exactly what type of
hang did that patch fix? Do certain CPUs perhaps cut
serialization corners, to speed up x2apic accesses?
Ingo
------------------->
From d6f0f39b7d05e62b347c4352d070e4afb3ade4b5 Mon Sep 17 00:00:00 2001
From: Suresh Siddha <suresh.b.siddha@intel.com>
Date: Tue, 4 Nov 2008 13:53:04 -0800
Subject: [PATCH] x86: add smp_mb() before sending INVALIDATE_TLB_VECTOR
Impact: fix rare x2apic hang
On x86, x2apic mode accesses for sending IPI's don't have serializing
semantics. If the IPI receivner refers(in lock-free fashion) to some
memory setup by the sender, the need for smp_mb() before sending the
IPI becomes critical in x2apic mode.
Add the smp_mb() in native_flush_tlb_others() before sending the IPI.
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
arch/x86/kernel/tlb_32.c | 6 ++++++
arch/x86/kernel/tlb_64.c | 5 +++++
2 files changed, 11 insertions(+), 0 deletions(-)
diff --git a/arch/x86/kernel/tlb_32.c b/arch/x86/kernel/tlb_32.c
index e00534b..f4049f3 100644
--- a/arch/x86/kernel/tlb_32.c
+++ b/arch/x86/kernel/tlb_32.c
@@ -154,6 +154,12 @@ void native_flush_tlb_others(const cpumask_t *cpumaskp, struct mm_struct *mm,
flush_mm = mm;
flush_va = va;
cpus_or(flush_cpumask, cpumask, flush_cpumask);
+
+ /*
+ * Make the above memory operations globally visible before
+ * sending the IPI.
+ */
+ smp_mb();
/*
* We have to send the IPI only to
* CPUs affected.
diff --git a/arch/x86/kernel/tlb_64.c b/arch/x86/kernel/tlb_64.c
index dcbf7a1..8f919ca 100644
--- a/arch/x86/kernel/tlb_64.c
+++ b/arch/x86/kernel/tlb_64.c
@@ -183,6 +183,11 @@ void native_flush_tlb_others(const cpumask_t *cpumaskp, struct mm_struct *mm,
cpus_or(f->flush_cpumask, cpumask, f->flush_cpumask);
/*
+ * Make the above memory operations globally visible before
+ * sending the IPI.
+ */
+ smp_mb();
+ /*
* We have to send the IPI only to
* CPUs affected.
*/
^ permalink raw reply related [flat|nested] 49+ messages in thread
* Re: smp.c && barriers (Was: [PATCH 1/4] generic-smp: remove single ipi fallback for smp_call_function_many())
2009-02-18 16:19 ` Linus Torvalds
@ 2009-02-18 16:23 ` Ingo Molnar
0 siblings, 0 replies; 49+ messages in thread
From: Ingo Molnar @ 2009-02-18 16:23 UTC (permalink / raw)
To: Linus Torvalds
Cc: Nick Piggin, Suresh Siddha, Peter Zijlstra, Oleg Nesterov,
Jens Axboe, Paul E. McKenney, Rusty Russell, Steven Rostedt,
linux-kernel@vger.kernel.org, linux-arch@vger.kernel.org
* Linus Torvalds <torvalds@linux-foundation.org> wrote:
> On Wed, 18 Feb 2009, Nick Piggin wrote:
> > >
> > > x2apic register reads/writes don't have serializing semantics, as
> > > opposed to uncached xapic accesses, which are inherently serializing.
> > >
> > > With this patch, we need to fix the corresponding x2apic IPI operations.
> > > I will take a look at it.
> >
> > You're saying the problem is in generic_exec_single because I've
> > removed the smp_mb that inadvertently also serialises memory with
> > the x2apic on x86?
>
> I think Suresh is wrong on this.
>
> The x2apic is using "wrmsr" to write events, and that's a
> serializing instruction.
>
> I really don't know of any way to get unordered information
> out of a x86 core, except for playing games with WC memory,
> and WC memory would not be appropriate for something like an
> interrupt controller.
>
> Of course, it's possible that Intel made the x2apic MSR's
> magic, and that they don't serialize, but that's very much
> against some very explicit Intel documentation. wrmsr is one
> of the (few) instructions that is mentioned all ove the
> documentation as being serializing.
heh, i just went through all those codepaths to figure out the
SMP ordering semantics. I didnt find anything but the MSR write,
so maybe the MSR writes did get weakened on certain CPUs.
Serializing is a serious performance penalty - and it would not
be totally out of question to optimize xAPIC MSR accesses. If
that's the case it's not quite nice to not document it though.
Ingo
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: Q: smp.c && barriers (Was: [PATCH 1/4] generic-smp: remove single ipi fallback for smp_call_function_many())
2009-02-18 16:21 ` Ingo Molnar
2009-02-18 16:21 ` Ingo Molnar
@ 2009-02-18 16:33 ` Linus Torvalds
2009-02-18 16:58 ` Ingo Molnar
2009-02-18 16:37 ` Gleb Natapov
2 siblings, 1 reply; 49+ messages in thread
From: Linus Torvalds @ 2009-02-18 16:33 UTC (permalink / raw)
To: Ingo Molnar
Cc: Suresh Siddha, Pallipadi, Venkatesh, Yinghai Lu, Nick Piggin,
Paul E. McKenney, Oleg Nesterov, Peter Zijlstra, Jens Axboe,
Rusty Russell, Steven Rostedt, linux-kernel, linux-arch
On Wed, 18 Feb 2009, Ingo Molnar wrote:
>
> But ... WRMSR should already be serializing - it is documented
> as a serializing instruction.
Hmm. I was thinking about this some more, and I think I've come up with an
explanation.
"wrmsr" probably serializes _after_ doing the write. After all, it's
historically used for changing internal CPU state, so you want to do the
write, and then wait until the effects of the write are "stable" in the
core.
That would explain how x2apic can use both a serializing instruction
(wrmsr) and still effectively cause the IPI to happen out of sequence: the
IPI can reach the destination CPU before the source CPU has flushed its
store buffers, because the IPI is actually sent before serializing the
core.
But I would very strongly put this in the "x2apic code bug" column. If
this is a true issue (and your TLB patch does imply it is), then we should
just make sure that the x2apic IPI calls always do a 'sfence' before they
happen - regardless of whether they are for TLB flushes or for generic
kernel cross-calls, or for anything else.
Linus
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: Q: smp.c && barriers (Was: [PATCH 1/4] generic-smp: remove single ipi fallback for smp_call_function_many())
2009-02-18 16:21 ` Ingo Molnar
2009-02-18 16:21 ` Ingo Molnar
2009-02-18 16:33 ` Linus Torvalds
@ 2009-02-18 16:37 ` Gleb Natapov
2 siblings, 0 replies; 49+ messages in thread
From: Gleb Natapov @ 2009-02-18 16:37 UTC (permalink / raw)
To: Ingo Molnar
Cc: Linus Torvalds, Suresh Siddha, Pallipadi, Venkatesh, Yinghai Lu,
Nick Piggin, Paul E. McKenney, Oleg Nesterov, Peter Zijlstra,
Jens Axboe, Rusty Russell, Steven Rostedt, linux-kernel,
linux-arch
On Wed, Feb 18, 2009 at 05:21:16PM +0100, Ingo Molnar wrote:
>
> * Linus Torvalds <torvalds@linux-foundation.org> wrote:
>
> > On Wed, 18 Feb 2009, Nick Piggin wrote:
> > >
> > > I agree with you both that we *should* make arch interrupt code
> > > do the ordering, but given the subtle lockups on some architectures
> > > in this new code, I didn't want to make it significantly weaker...
> > >
> > > Though perhaps it appears that I have, if I have removed an smp_mb
> > > that x86 was relying on to emit an mfence to serialise the apic.
> >
> > The thing is, if the architecture doesn't order IPI wrt cache coherency,
> > then the "smp_mb()" doesn't really do so _either_.
> >
> > It might hide some architecture-specific implementation issue, of course,
> > so random amounts of "smp_mb()"s sprinkled around might well make some
> > architecture "work", but it's in no way guaranteed. A smp_mb() does not
> > guarantee that some separate IPI network is ordered - that may well take
> > some random machine-specific IO cycle.
> >
> > That said, at least on x86, taking an interrupt should be a serializing
> > event, so there should be no reason for anything on the receiving side.
> > The _sending_ side might need to make sure that there is serialization
> > when generating the IPI (so that the IPI cannot happen while the writes
> > are still in some per-CPU write buffer and haven't become part of the
> > cache coherency domain).
> >
> > And at least on x86 it's actually pretty hard to generate out-of-order
> > accesses to begin with (_regardless_ of any issues external to the CPU).
> > You have to work at it, and use a WC memory area, and I'm pretty sure we
> > use UC for the apic accesses.
>
> yeah, we do. I do remember one x2apic related memory ordering
> bug though which happened in the past 6 months or so, lemme find
> the commit.
>
> This one is it:
>
> d6f0f39: x86: add smp_mb() before sending INVALIDATE_TLB_VECTOR
>
> attached below.
>
> The reason for that is that x2apic changes the access sequence
> from mmio (which old lapic used to be, and which was mapped UC),
> to an MSR sequence:
>
> static inline void native_x2apic_icr_write(u32 low, u32 id)
> {
> wrmsrl(APIC_BASE_MSR + (APIC_ICR >> 4), ((__u64) id) << 32 | low);
> }
>
> But ... WRMSR should already be serializing - it is documented
> as a serializing instruction.
>
FWIW that is what Intel docs says:
To allow for efficient access to the APIC registers in x2APIC mode,
the serializing semantics of WRMSR are relaxed when writing to the APIC
registers. Thus, system software should not use “WRMSR to APIC registers
in x2APIC mode” as a serializing instruction. Read and write accesses
to the APIC registers will occur in program order.
> I've cc:-ed Suresh & other APIC experts - exactly what type of
> hang did that patch fix? Do certain CPUs perhaps cut
> serialization corners, to speed up x2apic accesses?
>
> Ingo
>
> ------------------->
> >From d6f0f39b7d05e62b347c4352d070e4afb3ade4b5 Mon Sep 17 00:00:00 2001
> From: Suresh Siddha <suresh.b.siddha@intel.com>
> Date: Tue, 4 Nov 2008 13:53:04 -0800
> Subject: [PATCH] x86: add smp_mb() before sending INVALIDATE_TLB_VECTOR
>
> Impact: fix rare x2apic hang
>
> On x86, x2apic mode accesses for sending IPI's don't have serializing
> semantics. If the IPI receivner refers(in lock-free fashion) to some
> memory setup by the sender, the need for smp_mb() before sending the
> IPI becomes critical in x2apic mode.
>
> Add the smp_mb() in native_flush_tlb_others() before sending the IPI.
>
> Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
> Signed-off-by: Ingo Molnar <mingo@elte.hu>
> ---
> arch/x86/kernel/tlb_32.c | 6 ++++++
> arch/x86/kernel/tlb_64.c | 5 +++++
> 2 files changed, 11 insertions(+), 0 deletions(-)
>
> diff --git a/arch/x86/kernel/tlb_32.c b/arch/x86/kernel/tlb_32.c
> index e00534b..f4049f3 100644
> --- a/arch/x86/kernel/tlb_32.c
> +++ b/arch/x86/kernel/tlb_32.c
> @@ -154,6 +154,12 @@ void native_flush_tlb_others(const cpumask_t *cpumaskp, struct mm_struct *mm,
> flush_mm = mm;
> flush_va = va;
> cpus_or(flush_cpumask, cpumask, flush_cpumask);
> +
> + /*
> + * Make the above memory operations globally visible before
> + * sending the IPI.
> + */
> + smp_mb();
> /*
> * We have to send the IPI only to
> * CPUs affected.
> diff --git a/arch/x86/kernel/tlb_64.c b/arch/x86/kernel/tlb_64.c
> index dcbf7a1..8f919ca 100644
> --- a/arch/x86/kernel/tlb_64.c
> +++ b/arch/x86/kernel/tlb_64.c
> @@ -183,6 +183,11 @@ void native_flush_tlb_others(const cpumask_t *cpumaskp, struct mm_struct *mm,
> cpus_or(f->flush_cpumask, cpumask, f->flush_cpumask);
>
> /*
> + * Make the above memory operations globally visible before
> + * sending the IPI.
> + */
> + smp_mb();
> + /*
> * We have to send the IPI only to
> * CPUs affected.
> */
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
--
Gleb.
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: Q: smp.c && barriers (Was: [PATCH 1/4] generic-smp: remove single ipi fallback for smp_call_function_many())
2009-02-18 16:33 ` Linus Torvalds
@ 2009-02-18 16:58 ` Ingo Molnar
2009-02-18 17:05 ` Ingo Molnar
0 siblings, 1 reply; 49+ messages in thread
From: Ingo Molnar @ 2009-02-18 16:58 UTC (permalink / raw)
To: Linus Torvalds
Cc: Suresh Siddha, Pallipadi, Venkatesh, Yinghai Lu, Nick Piggin,
Paul E. McKenney, Oleg Nesterov, Peter Zijlstra, Jens Axboe,
Rusty Russell, Steven Rostedt, linux-kernel, linux-arch
* Linus Torvalds <torvalds@linux-foundation.org> wrote:
> On Wed, 18 Feb 2009, Ingo Molnar wrote:
> >
> > But ... WRMSR should already be serializing - it is documented
> > as a serializing instruction.
>
> Hmm. I was thinking about this some more, and I think I've
> come up with an explanation.
>
> "wrmsr" probably serializes _after_ doing the write. After
> all, it's historically used for changing internal CPU state,
> so you want to do the write, and then wait until the effects
> of the write are "stable" in the core.
>
> That would explain how x2apic can use both a serializing
> instruction (wrmsr) and still effectively cause the IPI to
> happen out of sequence: the IPI can reach the destination CPU
> before the source CPU has flushed its store buffers, because
> the IPI is actually sent before serializing the core.
>
> But I would very strongly put this in the "x2apic code bug"
> column. If this is a true issue (and your TLB patch does imply
> it is), then we should just make sure that the x2apic IPI
> calls always do a 'sfence' before they happen - regardless of
> whether they are for TLB flushes or for generic kernel
> cross-calls, or for anything else.
Yeah, that makes perfect sense. IPIs are an out of band
signalling mechanism that do not listen to the normal cache
coherency rules.
Moving the smp_mb() to the x2apic specific code will also speed
up the normal mmio-mapped IPI sequence a bit. It should be an
smp_wmb() i suspect - which turns it into an sfence.
Ingo
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: Q: smp.c && barriers (Was: [PATCH 1/4] generic-smp: remove single ipi fallback for smp_call_function_many())
2009-02-18 16:58 ` Ingo Molnar
@ 2009-02-18 17:05 ` Ingo Molnar
2009-02-18 17:10 ` Ingo Molnar
2009-02-18 17:14 ` Linus Torvalds
0 siblings, 2 replies; 49+ messages in thread
From: Ingo Molnar @ 2009-02-18 17:05 UTC (permalink / raw)
To: Linus Torvalds
Cc: Suresh Siddha, Pallipadi, Venkatesh, Yinghai Lu, Nick Piggin,
Paul E. McKenney, Oleg Nesterov, Peter Zijlstra, Jens Axboe,
Rusty Russell, Steven Rostedt, linux-kernel, linux-arch
an off-list comment pointed out this piece of information as
well:
http://www.sandpile.org/ia32/coherent.htm
A WRMSR to one of the x2APIC MSRs (0000_0800h...0000_0BFFh) is
not guaranteed to be serializing.
So i suspect we should just enclose it in smp_mb() pairs to make
sure it's a full barrier in both directions?
Although since it's an explicitly async mechanism with no
ordering expectations whatsoever, just doing an smp_mb() before
it should be enough, to make sure the IPI can never arrive to
another CPU/core faster than the event-horizon/visibility of
preceding operations.
Ingo
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: Q: smp.c && barriers (Was: [PATCH 1/4] generic-smp: remove single ipi fallback for smp_call_function_many())
2009-02-18 17:05 ` Ingo Molnar
@ 2009-02-18 17:10 ` Ingo Molnar
2009-02-18 17:17 ` Linus Torvalds
2009-02-18 17:14 ` Linus Torvalds
1 sibling, 1 reply; 49+ messages in thread
From: Ingo Molnar @ 2009-02-18 17:10 UTC (permalink / raw)
To: Linus Torvalds
Cc: Suresh Siddha, Pallipadi, Venkatesh, Yinghai Lu, Nick Piggin,
Paul E. McKenney, Oleg Nesterov, Peter Zijlstra, Jens Axboe,
Rusty Russell, Steven Rostedt, linux-kernel, linux-arch
ok, it's documented:
Intel® 64 and IA-32 Architectures
Software Developer’s Manual
Volume 3A:
System Programming Guide, Part 1
9.5.3 MSR Access in x2APIC Mode
To allow for efficient access to the APIC registers in x2APIC
mode, the serializing semantics of WRMSR are relaxed when
writing to the APIC registers. Thus, system software should not
use “WRMSR to APIC registers in x2APIC mode” as a serializing
instruction. Read and write accesses to the APIC registers will
occur in program order. A WRMSR to an APIC register may
complete before all preceding stores are globally visible;
software can prevent this by inserting a serializing
instruction or MFENCE before the WRMSR.
Ingo
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: Q: smp.c && barriers (Was: [PATCH 1/4] generic-smp: remove single ipi fallback for smp_call_function_many())
2009-02-18 17:05 ` Ingo Molnar
2009-02-18 17:10 ` Ingo Molnar
@ 2009-02-18 17:14 ` Linus Torvalds
2009-02-18 17:47 ` Ingo Molnar
2009-02-18 18:33 ` Suresh Siddha
1 sibling, 2 replies; 49+ messages in thread
From: Linus Torvalds @ 2009-02-18 17:14 UTC (permalink / raw)
To: Ingo Molnar
Cc: Suresh Siddha, Pallipadi, Venkatesh, Yinghai Lu, Nick Piggin,
Paul E. McKenney, Oleg Nesterov, Peter Zijlstra, Jens Axboe,
Rusty Russell, Steven Rostedt, linux-kernel, linux-arch
On Wed, 18 Feb 2009, Ingo Molnar wrote:
>
> an off-list comment pointed out this piece of information as
> well:
>
> http://www.sandpile.org/ia32/coherent.htm
>
> A WRMSR to one of the x2APIC MSRs (0000_0800h...0000_0BFFh) is
> not guaranteed to be serializing.
>
> So i suspect we should just enclose it in smp_mb() pairs to make
> sure it's a full barrier in both directions?
Why would we care about "both directions"?
I think putting an sfence _before_ the wrmsr (and not even all of them -
just put it in front of the "send IPI" sequence) should be fine. Any other
ordering sounds like just unnecessary overhead to me.
We do want this to be low-overhead, even if we probably don't care _that_
much.
Linus
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: Q: smp.c && barriers (Was: [PATCH 1/4] generic-smp: remove single ipi fallback for smp_call_function_many())
2009-02-18 17:10 ` Ingo Molnar
@ 2009-02-18 17:17 ` Linus Torvalds
2009-02-18 17:23 ` Ingo Molnar
0 siblings, 1 reply; 49+ messages in thread
From: Linus Torvalds @ 2009-02-18 17:17 UTC (permalink / raw)
To: Ingo Molnar
Cc: Suresh Siddha, Pallipadi, Venkatesh, Yinghai Lu, Nick Piggin,
Paul E. McKenney, Oleg Nesterov, Peter Zijlstra, Jens Axboe,
Rusty Russell, Steven Rostedt, linux-kernel, linux-arch
On Wed, 18 Feb 2009, Ingo Molnar wrote:
>
> ok, it's documented:
>
> Intel® 64 and IA-32 Architectures
> Software Developer’s Manual
> Volume 3A:
> System Programming Guide, Part 1
>
> 9.5.3 MSR Access in x2APIC Mode
Well, mine says
9.5.3 Error Handling
so I guess I need to download a newer version. After all, mine is
"ancient" in being a year old (the February-2008 version) and not covering
x2APIC at all.
Linus
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: Q: smp.c && barriers (Was: [PATCH 1/4] generic-smp: remove single ipi fallback for smp_call_function_many())
2009-02-18 17:17 ` Linus Torvalds
@ 2009-02-18 17:23 ` Ingo Molnar
0 siblings, 0 replies; 49+ messages in thread
From: Ingo Molnar @ 2009-02-18 17:23 UTC (permalink / raw)
To: Linus Torvalds
Cc: Suresh Siddha, Pallipadi, Venkatesh, Yinghai Lu, Nick Piggin,
Paul E. McKenney, Oleg Nesterov, Peter Zijlstra, Jens Axboe,
Rusty Russell, Steven Rostedt, linux-kernel, linux-arch
* Linus Torvalds <torvalds@linux-foundation.org> wrote:
> On Wed, 18 Feb 2009, Ingo Molnar wrote:
> >
> > ok, it's documented:
> >
> > Intel® 64 and IA-32 Architectures
> > Software Developer’s Manual
> > Volume 3A:
> > System Programming Guide, Part 1
> >
> > 9.5.3 MSR Access in x2APIC Mode
>
> Well, mine says
>
> 9.5.3 Error Handling
>
> so I guess I need to download a newer version. After all, mine
> is "ancient" in being a year old (the February-2008 version)
> and not covering x2APIC at all.
Mine is 253668.pdf, that's the freshest available:
http://download.intel.com/design/processor/manuals/253668.pdf
Ingo
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: Q: smp.c && barriers (Was: [PATCH 1/4] generic-smp: remove single ipi fallback for smp_call_function_many())
2009-02-18 17:14 ` Linus Torvalds
@ 2009-02-18 17:47 ` Ingo Molnar
2009-02-18 18:33 ` Suresh Siddha
1 sibling, 0 replies; 49+ messages in thread
From: Ingo Molnar @ 2009-02-18 17:47 UTC (permalink / raw)
To: Linus Torvalds
Cc: Suresh Siddha, Pallipadi, Venkatesh, Yinghai Lu, Nick Piggin,
Paul E. McKenney, Oleg Nesterov, Peter Zijlstra, Jens Axboe,
Rusty Russell, Steven Rostedt, linux-kernel, linux-arch
* Linus Torvalds <torvalds@linux-foundation.org> wrote:
>
>
> On Wed, 18 Feb 2009, Ingo Molnar wrote:
> >
> > an off-list comment pointed out this piece of information as
> > well:
> >
> > http://www.sandpile.org/ia32/coherent.htm
> >
> > A WRMSR to one of the x2APIC MSRs (0000_0800h...0000_0BFFh) is
> > not guaranteed to be serializing.
> >
> > So i suspect we should just enclose it in smp_mb() pairs to make
> > sure it's a full barrier in both directions?
>
> Why would we care about "both directions"?
>
> I think putting an sfence _before_ the wrmsr (and not even all
> of them - just put it in front of the "send IPI" sequence)
> should be fine. Any other ordering sounds like just
> unnecessary overhead to me.
>
> We do want this to be low-overhead, even if we probably don't
> care _that_ much.
yeah, you are right, making sure prior stores become visible
should be the only worry here.
Ingo
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: Q: smp.c && barriers (Was: [PATCH 1/4] generic-smp: remove single ipi fallback for smp_call_function_many())
2009-02-18 17:14 ` Linus Torvalds
2009-02-18 17:47 ` Ingo Molnar
@ 2009-02-18 18:33 ` Suresh Siddha
1 sibling, 0 replies; 49+ messages in thread
From: Suresh Siddha @ 2009-02-18 18:33 UTC (permalink / raw)
To: Linus Torvalds
Cc: Ingo Molnar, Pallipadi, Venkatesh, Yinghai Lu, Nick Piggin,
Paul E. McKenney, Oleg Nesterov, Peter Zijlstra, Jens Axboe,
Rusty Russell, Steven Rostedt, linux-kernel@vger.kernel.org,
linux-arch@vger.kernel.org, asit.k.mallick
On Wed, 2009-02-18 at 09:14 -0800, Linus Torvalds wrote:
>
> On Wed, 18 Feb 2009, Ingo Molnar wrote:
> >
> > an off-list comment pointed out this piece of information as
> > well:
> >
> > http://www.sandpile.org/ia32/coherent.htm
> >
> > A WRMSR to one of the x2APIC MSRs (0000_0800h...0000_0BFFh) is
> > not guaranteed to be serializing.
> >
> > So i suspect we should just enclose it in smp_mb() pairs to make
> > sure it's a full barrier in both directions?
>
> Why would we care about "both directions"?
on x86 we don't need in both directions.
>
> I think putting an sfence _before_ the wrmsr (and not even all of them -
> just put it in front of the "send IPI" sequence) should be fine. Any other
> ordering sounds like just unnecessary overhead to me.
For x2apic ipi's, we should use a serializing instruction or a "mfence"
instruction. "sfence" will not help in this scenario.
thanks,
suresh
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: smp.c && barriers (Was: [PATCH 1/4] generic-smp: remove single ipi fallback for smp_call_function_many())
2009-02-18 13:59 ` Nick Piggin
2009-02-18 16:19 ` Linus Torvalds
@ 2009-02-18 18:43 ` Suresh Siddha
2009-02-18 19:17 ` Ingo Molnar
1 sibling, 1 reply; 49+ messages in thread
From: Suresh Siddha @ 2009-02-18 18:43 UTC (permalink / raw)
To: Nick Piggin
Cc: Peter Zijlstra, Oleg Nesterov, Jens Axboe, Linus Torvalds,
Paul E. McKenney, Ingo Molnar, Rusty Russell, Steven Rostedt,
linux-kernel@vger.kernel.org, linux-arch@vger.kernel.org
On Wed, 2009-02-18 at 05:59 -0800, Nick Piggin wrote:
> You're saying the problem is in generic_exec_single because I've
> removed the smp_mb that inadvertently also serialises memory with
> the x2apic on x86?
yes.
>
> Indeed that could cause problems on some architectures which I
> had hoped to avoid. So the patch is probably better off to first
> add the smp_mb() to arch_send_call_function_xxx arch code, unless
> it is immediately obvious or confirmed by arch maintainer that
> such barrier is not required.
For x2apic specific operations we should add the smp_mb() sequence. But
we need to make sure that we don't end up doing it twice (once in
generic code and another in arch code) for all the ipi paths.
thanks,
suresh
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: smp.c && barriers (Was: [PATCH 1/4] generic-smp: remove single ipi fallback for smp_call_function_many())
2009-02-18 18:43 ` Suresh Siddha
@ 2009-02-18 19:17 ` Ingo Molnar
2009-02-18 23:55 ` Suresh Siddha
0 siblings, 1 reply; 49+ messages in thread
From: Ingo Molnar @ 2009-02-18 19:17 UTC (permalink / raw)
To: Suresh Siddha
Cc: Nick Piggin, Peter Zijlstra, Oleg Nesterov, Jens Axboe,
Linus Torvalds, Paul E. McKenney, Rusty Russell, Steven Rostedt,
linux-kernel@vger.kernel.org, linux-arch@vger.kernel.org
* Suresh Siddha <suresh.b.siddha@intel.com> wrote:
> On Wed, 2009-02-18 at 05:59 -0800, Nick Piggin wrote:
> > You're saying the problem is in generic_exec_single because I've
> > removed the smp_mb that inadvertently also serialises memory with
> > the x2apic on x86?
>
> yes.
>
> >
> > Indeed that could cause problems on some architectures which I
> > had hoped to avoid. So the patch is probably better off to first
> > add the smp_mb() to arch_send_call_function_xxx arch code, unless
> > it is immediately obvious or confirmed by arch maintainer that
> > such barrier is not required.
>
> For x2apic specific operations we should add the smp_mb() sequence. But
> we need to make sure that we don't end up doing it twice (once in
> generic code and another in arch code) for all the ipi paths.
right now we do have an smp_mb() due to your fix in November.
So what should happen is to move that smp_mb() from the x86
generic IPI path to the x86 x2apic IPI path. (and turn it into
an smp_wmb() - that should be enough - we dont care about future
reads being done sooner than this point.)
Ingo
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: smp.c && barriers (Was: [PATCH 1/4] generic-smp: remove single ipi fallback for smp_call_function_many())
2009-02-18 19:17 ` Ingo Molnar
@ 2009-02-18 23:55 ` Suresh Siddha
2009-02-19 12:20 ` Ingo Molnar
0 siblings, 1 reply; 49+ messages in thread
From: Suresh Siddha @ 2009-02-18 23:55 UTC (permalink / raw)
To: Ingo Molnar
Cc: Nick Piggin, Peter Zijlstra, Oleg Nesterov, Jens Axboe,
Linus Torvalds, Paul E. McKenney, Rusty Russell, Steven Rostedt,
linux-kernel@vger.kernel.org, linux-arch@vger.kernel.org
On Wed, 2009-02-18 at 11:17 -0800, Ingo Molnar wrote:
> * Suresh Siddha <suresh.b.siddha@intel.com> wrote:
>
> > > Indeed that could cause problems on some architectures which I
> > > had hoped to avoid. So the patch is probably better off to first
> > > add the smp_mb() to arch_send_call_function_xxx arch code, unless
> > > it is immediately obvious or confirmed by arch maintainer that
> > > such barrier is not required.
> >
> > For x2apic specific operations we should add the smp_mb() sequence. But
> > we need to make sure that we don't end up doing it twice (once in
> > generic code and another in arch code) for all the ipi paths.
>
> right now we do have an smp_mb() due to your fix in November.
>
> So what should happen is to move that smp_mb() from the x86
> generic IPI path to the x86 x2apic IPI path. (and turn it into
> an smp_wmb() - that should be enough - we dont care about future
> reads being done sooner than this point.)
Ingo, smp_wmb() won't help. x2apic register writes can still go ahead of
the sfence. According to the SDM, we need a serializing instruction or
mfence. Our internal experiments also proved this.
Appended is the x86 portion of the patch:
---
From: Suresh Siddha <suresh.b.siddha@intel.com>
Subject: x86: move smp_mb() in x86 flush tlb path to x2apic specific IPI
paths
uncached MMIO accesses for xapic are inherently serializing and hence
we don't need explicit barriers for xapic IPI paths.
x2apic MSR writes/reads don't have serializing semantics and hence need
a serializing instruction or mfence, to make all the previous memory
stores
globally visisble before the x2apic msr write for IPI.
And hence move smp_mb() in x86 flush tlb path to x2apic specific paths.
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
---
diff --git a/arch/x86/kernel/genx2apic_cluster.c
b/arch/x86/kernel/genx2apic_cluster.c
index 7c87156..b237248 100644
--- a/arch/x86/kernel/genx2apic_cluster.c
+++ b/arch/x86/kernel/genx2apic_cluster.c
@@ -60,6 +60,13 @@ static void x2apic_send_IPI_mask(const struct cpumask
*mask, int vector)
unsigned long query_cpu;
unsigned long flags;
+ /*
+ * Make previous memory operations globally visible before
+ * sending the IPI. We need a serializing instruction or mfence
+ * for this.
+ */
+ smp_mb();
+
local_irq_save(flags);
for_each_cpu(query_cpu, mask) {
__x2apic_send_IPI_dest(
@@ -76,6 +83,13 @@ static void
unsigned long query_cpu;
unsigned long flags;
+ /*
+ * Make previous memory operations globally visible before
+ * sending the IPI. We need a serializing instruction or mfence
+ * for this.
+ */
+ smp_mb();
+
local_irq_save(flags);
for_each_cpu(query_cpu, mask) {
if (query_cpu == this_cpu)
@@ -93,6 +107,13 @@ static void x2apic_send_IPI_allbutself(int vector)
unsigned long query_cpu;
unsigned long flags;
+ /*
+ * Make previous memory operations globally visible before
+ * sending the IPI. We need a serializing instruction or mfence
+ * for this.
+ */
+ smp_mb();
+
local_irq_save(flags);
for_each_online_cpu(query_cpu) {
if (query_cpu == this_cpu)
diff --git a/arch/x86/kernel/genx2apic_phys.c
b/arch/x86/kernel/genx2apic_phys.c
index 5cbae8a..f48f282 100644
--- a/arch/x86/kernel/genx2apic_phys.c
+++ b/arch/x86/kernel/genx2apic_phys.c
@@ -58,6 +58,13 @@ static void x2apic_send_IPI_mask(const struct cpumask
*mask, int vector)
unsigned long query_cpu;
unsigned long flags;
+ /*
+ * Make previous memory operations globally visible before
+ * sending the IPI. We need a serializing instruction or mfence
+ * for this.
+ */
+ smp_mb();
+
local_irq_save(flags);
for_each_cpu(query_cpu, mask) {
__x2apic_send_IPI_dest(per_cpu(x86_cpu_to_apicid, query_cpu),
@@ -73,6 +80,13 @@ static void
unsigned long query_cpu;
unsigned long flags;
+ /*
+ * Make previous memory operations globally visible before
+ * sending the IPI. We need a serializing instruction or mfence
+ * for this.
+ */
+ smp_mb();
+
local_irq_save(flags);
for_each_cpu(query_cpu, mask) {
if (query_cpu != this_cpu)
@@ -89,6 +103,13 @@ static void x2apic_send_IPI_allbutself(int vector)
unsigned long query_cpu;
unsigned long flags;
+ /*
+ * Make previous memory operations globally visible before
+ * sending the IPI. We need a serializing instruction or mfence
+ * for this.
+ */
+ smp_mb();
+
local_irq_save(flags);
for_each_online_cpu(query_cpu) {
if (query_cpu == this_cpu)
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 14c5af4..de14557 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -188,11 +188,6 @@ static void flush_tlb_others_ipi(const struct
cpumask *cpumask,
cpumask, cpumask_of(smp_processor_id()));
/*
- * Make the above memory operations globally visible before
- * sending the IPI.
- */
- smp_mb();
- /*
* We have to send the IPI only to
* CPUs affected.
*/
^ permalink raw reply related [flat|nested] 49+ messages in thread
* Re: Q: smp.c && barriers (Was: [PATCH 1/4] generic-smp: remove single ipi fallback for smp_call_function_many())
2009-02-18 16:09 ` Linus Torvalds
2009-02-18 16:21 ` Ingo Molnar
@ 2009-02-19 0:12 ` Nick Piggin
2009-02-19 6:47 ` Benjamin Herrenschmidt
2 siblings, 0 replies; 49+ messages in thread
From: Nick Piggin @ 2009-02-19 0:12 UTC (permalink / raw)
To: Linus Torvalds
Cc: Paul E. McKenney, Oleg Nesterov, Peter Zijlstra, Jens Axboe,
Suresh Siddha, Ingo Molnar, Rusty Russell, Steven Rostedt,
linux-kernel, linux-arch
On Wed, Feb 18, 2009 at 08:09:21AM -0800, Linus Torvalds wrote:
>
>
> On Wed, 18 Feb 2009, Nick Piggin wrote:
> >
> > I agree with you both that we *should* make arch interrupt code
> > do the ordering, but given the subtle lockups on some architectures
> > in this new code, I didn't want to make it significantly weaker...
> >
> > Though perhaps it appears that I have, if I have removed an smp_mb
> > that x86 was relying on to emit an mfence to serialise the apic.
>
> The thing is, if the architecture doesn't order IPI wrt cache coherency,
> then the "smp_mb()" doesn't really do so _either_.
Oh yes agreed three, which is why I'm saying it is just a hack
and should be removed at some point.
> It might hide some architecture-specific implementation issue, of course,
> so random amounts of "smp_mb()"s sprinkled around might well make some
> architecture "work", but it's in no way guaranteed. A smp_mb() does not
> guarantee that some separate IPI network is ordered - that may well take
> some random machine-specific IO cycle.
Yes, but I didn't want to pull out that smp_mb() at least until
arch maintainers can ack it. Just because there might indeed be
some random issue hidden by it.
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: Q: smp.c && barriers (Was: [PATCH 1/4] generic-smp: remove single ipi fallback for smp_call_function_many())
2009-02-18 16:09 ` Linus Torvalds
2009-02-18 16:21 ` Ingo Molnar
2009-02-19 0:12 ` Nick Piggin
@ 2009-02-19 6:47 ` Benjamin Herrenschmidt
2009-02-19 13:11 ` Nick Piggin
2 siblings, 1 reply; 49+ messages in thread
From: Benjamin Herrenschmidt @ 2009-02-19 6:47 UTC (permalink / raw)
To: Linus Torvalds
Cc: Nick Piggin, Paul E. McKenney, Oleg Nesterov, Peter Zijlstra,
Jens Axboe, Suresh Siddha, Ingo Molnar, Rusty Russell,
Steven Rostedt, linux-kernel, linux-arch
> It might hide some architecture-specific implementation issue, of course,
> so random amounts of "smp_mb()"s sprinkled around might well make some
> architecture "work", but it's in no way guaranteed. A smp_mb() does not
> guarantee that some separate IPI network is ordered - that may well take
> some random machine-specific IO cycle.
>
> That said, at least on x86, taking an interrupt should be a serializing
> event, so there should be no reason for anything on the receiving side.
> The _sending_ side might need to make sure that there is serialization
> when generating the IPI (so that the IPI cannot happen while the writes
> are still in some per-CPU write buffer and haven't become part of the
> cache coherency domain).
>
> And at least on x86 it's actually pretty hard to generate out-of-order
> accesses to begin with (_regardless_ of any issues external to the CPU).
> You have to work at it, and use a WC memory area, and I'm pretty sure we
> use UC for the apic accesses.
On powerpc, I suspect an smp_mb() on the sender would be useful... it
mostly depends how the IPI is generated but in most case it's going to
be an MMIO write, ie non-cached write which isn't ordered vs. any
previous cached store except using a full sync (which is what smp_mb()
does).
Cheers,
Ben.
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: smp.c && barriers (Was: [PATCH 1/4] generic-smp: remove single ipi fallback for smp_call_function_many())
2009-02-18 23:55 ` Suresh Siddha
@ 2009-02-19 12:20 ` Ingo Molnar
2009-02-19 12:29 ` Nick Piggin
2009-02-19 22:00 ` Suresh Siddha
0 siblings, 2 replies; 49+ messages in thread
From: Ingo Molnar @ 2009-02-19 12:20 UTC (permalink / raw)
To: Suresh Siddha
Cc: Nick Piggin, Peter Zijlstra, Oleg Nesterov, Jens Axboe,
Linus Torvalds, Paul E. McKenney, Rusty Russell, Steven Rostedt,
linux-kernel@vger.kernel.org, linux-arch@vger.kernel.org
* Suresh Siddha <suresh.b.siddha@intel.com> wrote:
> On Wed, 2009-02-18 at 11:17 -0800, Ingo Molnar wrote:
> > * Suresh Siddha <suresh.b.siddha@intel.com> wrote:
> >
> > > > Indeed that could cause problems on some architectures which I
> > > > had hoped to avoid. So the patch is probably better off to first
> > > > add the smp_mb() to arch_send_call_function_xxx arch code, unless
> > > > it is immediately obvious or confirmed by arch maintainer that
> > > > such barrier is not required.
> > >
> > > For x2apic specific operations we should add the smp_mb() sequence. But
> > > we need to make sure that we don't end up doing it twice (once in
> > > generic code and another in arch code) for all the ipi paths.
> >
> > right now we do have an smp_mb() due to your fix in November.
> >
> > So what should happen is to move that smp_mb() from the x86
> > generic IPI path to the x86 x2apic IPI path. (and turn it into
> > an smp_wmb() - that should be enough - we dont care about future
> > reads being done sooner than this point.)
>
> Ingo, smp_wmb() won't help. x2apic register writes can still
> go ahead of the sfence. According to the SDM, we need a
> serializing instruction or mfence. Our internal experiments
> also proved this.
ah, yes - i got confused about how an x2apic write can pass a
_store_ fence.
The reason is that an MSR write is a register->register move
(not a memory write), so it it not part of the generic memory
ordering machinery. So a serializing instruction it has to be.
> Appended is the x86 portion of the patch: ---
>
> From: Suresh Siddha <suresh.b.siddha@intel.com>
> Subject: x86: move smp_mb() in x86 flush tlb path to x2apic specific IPI
> paths
Could you please refresh this patch to latest tip:master? The
APIC drivers moved to arch/x86/kernel/apic/.
Ingo
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: smp.c && barriers (Was: [PATCH 1/4] generic-smp: remove single ipi fallback for smp_call_function_many())
2009-02-19 12:20 ` Ingo Molnar
@ 2009-02-19 12:29 ` Nick Piggin
2009-02-19 12:45 ` Ingo Molnar
2009-02-19 22:00 ` Suresh Siddha
1 sibling, 1 reply; 49+ messages in thread
From: Nick Piggin @ 2009-02-19 12:29 UTC (permalink / raw)
To: Ingo Molnar
Cc: Suresh Siddha, Peter Zijlstra, Oleg Nesterov, Jens Axboe,
Linus Torvalds, Paul E. McKenney, Rusty Russell, Steven Rostedt,
linux-kernel@vger.kernel.org, linux-arch@vger.kernel.org
On Thu, Feb 19, 2009 at 01:20:31PM +0100, Ingo Molnar wrote:
>
> * Suresh Siddha <suresh.b.siddha@intel.com> wrote:
>
> > On Wed, 2009-02-18 at 11:17 -0800, Ingo Molnar wrote:
> > > * Suresh Siddha <suresh.b.siddha@intel.com> wrote:
> > > So what should happen is to move that smp_mb() from the x86
> > > generic IPI path to the x86 x2apic IPI path. (and turn it into
> > > an smp_wmb() - that should be enough - we dont care about future
> > > reads being done sooner than this point.)
> >
> > Ingo, smp_wmb() won't help. x2apic register writes can still
> > go ahead of the sfence. According to the SDM, we need a
> > serializing instruction or mfence. Our internal experiments
> > also proved this.
>
> ah, yes - i got confused about how an x2apic write can pass a
> _store_ fence.
And about how smp_wmb() doesn't emit a store fence ;)
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: smp.c && barriers (Was: [PATCH 1/4] generic-smp: remove single ipi fallback for smp_call_function_many())
2009-02-19 12:29 ` Nick Piggin
@ 2009-02-19 12:45 ` Ingo Molnar
0 siblings, 0 replies; 49+ messages in thread
From: Ingo Molnar @ 2009-02-19 12:45 UTC (permalink / raw)
To: Nick Piggin
Cc: Suresh Siddha, Peter Zijlstra, Oleg Nesterov, Jens Axboe,
Linus Torvalds, Paul E. McKenney, Rusty Russell, Steven Rostedt,
linux-kernel@vger.kernel.org, linux-arch@vger.kernel.org
* Nick Piggin <npiggin@suse.de> wrote:
> On Thu, Feb 19, 2009 at 01:20:31PM +0100, Ingo Molnar wrote:
> >
> > * Suresh Siddha <suresh.b.siddha@intel.com> wrote:
> >
> > > On Wed, 2009-02-18 at 11:17 -0800, Ingo Molnar wrote:
> > > > * Suresh Siddha <suresh.b.siddha@intel.com> wrote:
> > > > So what should happen is to move that smp_mb() from the x86
> > > > generic IPI path to the x86 x2apic IPI path. (and turn it into
> > > > an smp_wmb() - that should be enough - we dont care about future
> > > > reads being done sooner than this point.)
> > >
> > > Ingo, smp_wmb() won't help. x2apic register writes can still
> > > go ahead of the sfence. According to the SDM, we need a
> > > serializing instruction or mfence. Our internal experiments
> > > also proved this.
> >
> > ah, yes - i got confused about how an x2apic write can pass a
> > _store_ fence.
>
> And about how smp_wmb() doesn't emit a store fence ;)
yeah ;-) Only wmb() emits a SFENCE all the time. Writes are
normally ordered so we map smp_wmb() to barrier().
Ingo
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: Q: smp.c && barriers (Was: [PATCH 1/4] generic-smp: remove single ipi fallback for smp_call_function_many())
2009-02-19 6:47 ` Benjamin Herrenschmidt
@ 2009-02-19 13:11 ` Nick Piggin
2009-02-19 13:11 ` Nick Piggin
2009-02-19 15:06 ` Ingo Molnar
0 siblings, 2 replies; 49+ messages in thread
From: Nick Piggin @ 2009-02-19 13:11 UTC (permalink / raw)
To: Benjamin Herrenschmidt
Cc: Linus Torvalds, Paul E. McKenney, Oleg Nesterov, Peter Zijlstra,
Jens Axboe, Suresh Siddha, Ingo Molnar, Rusty Russell,
Steven Rostedt, linux-kernel, linux-arch
On Thu, Feb 19, 2009 at 05:47:20PM +1100, Benjamin Herrenschmidt wrote:
>
> > It might hide some architecture-specific implementation issue, of course,
> > so random amounts of "smp_mb()"s sprinkled around might well make some
> > architecture "work", but it's in no way guaranteed. A smp_mb() does not
> > guarantee that some separate IPI network is ordered - that may well take
> > some random machine-specific IO cycle.
> >
> > That said, at least on x86, taking an interrupt should be a serializing
> > event, so there should be no reason for anything on the receiving side.
> > The _sending_ side might need to make sure that there is serialization
> > when generating the IPI (so that the IPI cannot happen while the writes
> > are still in some per-CPU write buffer and haven't become part of the
> > cache coherency domain).
> >
> > And at least on x86 it's actually pretty hard to generate out-of-order
> > accesses to begin with (_regardless_ of any issues external to the CPU).
> > You have to work at it, and use a WC memory area, and I'm pretty sure we
> > use UC for the apic accesses.
>
> On powerpc, I suspect an smp_mb() on the sender would be useful... it
> mostly depends how the IPI is generated but in most case it's going to
> be an MMIO write, ie non-cached write which isn't ordered vs. any
> previous cached store except using a full sync (which is what smp_mb()
> does).
So your arch_send_call_function_single_ipi etc need to ensure this,
right? Generic code obviously has no idea about how to do it.
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: Q: smp.c && barriers (Was: [PATCH 1/4] generic-smp: remove single ipi fallback for smp_call_function_many())
2009-02-19 13:11 ` Nick Piggin
@ 2009-02-19 13:11 ` Nick Piggin
2009-02-19 15:06 ` Ingo Molnar
1 sibling, 0 replies; 49+ messages in thread
From: Nick Piggin @ 2009-02-19 13:11 UTC (permalink / raw)
To: Benjamin Herrenschmidt
Cc: Linus Torvalds, Paul E. McKenney, Oleg Nesterov, Peter Zijlstra,
Jens Axboe, Suresh Siddha, Ingo Molnar, Rusty Russell,
Steven Rostedt, linux-kernel, linux-arch
On Thu, Feb 19, 2009 at 05:47:20PM +1100, Benjamin Herrenschmidt wrote:
>
> > It might hide some architecture-specific implementation issue, of course,
> > so random amounts of "smp_mb()"s sprinkled around might well make some
> > architecture "work", but it's in no way guaranteed. A smp_mb() does not
> > guarantee that some separate IPI network is ordered - that may well take
> > some random machine-specific IO cycle.
> >
> > That said, at least on x86, taking an interrupt should be a serializing
> > event, so there should be no reason for anything on the receiving side.
> > The _sending_ side might need to make sure that there is serialization
> > when generating the IPI (so that the IPI cannot happen while the writes
> > are still in some per-CPU write buffer and haven't become part of the
> > cache coherency domain).
> >
> > And at least on x86 it's actually pretty hard to generate out-of-order
> > accesses to begin with (_regardless_ of any issues external to the CPU).
> > You have to work at it, and use a WC memory area, and I'm pretty sure we
> > use UC for the apic accesses.
>
> On powerpc, I suspect an smp_mb() on the sender would be useful... it
> mostly depends how the IPI is generated but in most case it's going to
> be an MMIO write, ie non-cached write which isn't ordered vs. any
> previous cached store except using a full sync (which is what smp_mb()
> does).
So your arch_send_call_function_single_ipi etc need to ensure this,
right? Generic code obviously has no idea about how to do it.
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: Q: smp.c && barriers (Was: [PATCH 1/4] generic-smp: remove single ipi fallback for smp_call_function_many())
2009-02-19 13:11 ` Nick Piggin
2009-02-19 13:11 ` Nick Piggin
@ 2009-02-19 15:06 ` Ingo Molnar
2009-02-19 21:49 ` Benjamin Herrenschmidt
1 sibling, 1 reply; 49+ messages in thread
From: Ingo Molnar @ 2009-02-19 15:06 UTC (permalink / raw)
To: Nick Piggin
Cc: Benjamin Herrenschmidt, Linus Torvalds, Paul E. McKenney,
Oleg Nesterov, Peter Zijlstra, Jens Axboe, Suresh Siddha,
Rusty Russell, Steven Rostedt, linux-kernel, linux-arch
* Nick Piggin <npiggin@suse.de> wrote:
> On Thu, Feb 19, 2009 at 05:47:20PM +1100, Benjamin Herrenschmidt wrote:
> >
> > > It might hide some architecture-specific implementation issue, of course,
> > > so random amounts of "smp_mb()"s sprinkled around might well make some
> > > architecture "work", but it's in no way guaranteed. A smp_mb() does not
> > > guarantee that some separate IPI network is ordered - that may well take
> > > some random machine-specific IO cycle.
> > >
> > > That said, at least on x86, taking an interrupt should be a serializing
> > > event, so there should be no reason for anything on the receiving side.
> > > The _sending_ side might need to make sure that there is serialization
> > > when generating the IPI (so that the IPI cannot happen while the writes
> > > are still in some per-CPU write buffer and haven't become part of the
> > > cache coherency domain).
> > >
> > > And at least on x86 it's actually pretty hard to generate out-of-order
> > > accesses to begin with (_regardless_ of any issues external to the CPU).
> > > You have to work at it, and use a WC memory area, and I'm pretty sure we
> > > use UC for the apic accesses.
> >
> > On powerpc, I suspect an smp_mb() on the sender would be
> > useful... it mostly depends how the IPI is generated but in
> > most case it's going to be an MMIO write, ie non-cached
> > write which isn't ordered vs. any previous cached store
> > except using a full sync (which is what smp_mb() does).
>
> So your arch_send_call_function_single_ipi etc need to ensure
> this, right? Generic code obviously has no idea about how to
> do it.
The thing is, the most widespread way to send IPIs (x86
non-x2apic local APIC) does not need any barriers in the generic
code or elsewhere, because the local APIC is mapped uncached so
it's implicitly ordered.
So the right solution is to add barriers to those IPI
implementations that need it. This means that the generic code
should not have a barrier for IPI sending.
Ingo
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: Q: smp.c && barriers (Was: [PATCH 1/4] generic-smp: remove single ipi fallback for smp_call_function_many())
2009-02-19 15:06 ` Ingo Molnar
@ 2009-02-19 21:49 ` Benjamin Herrenschmidt
0 siblings, 0 replies; 49+ messages in thread
From: Benjamin Herrenschmidt @ 2009-02-19 21:49 UTC (permalink / raw)
To: Ingo Molnar
Cc: Nick Piggin, Linus Torvalds, Paul E. McKenney, Oleg Nesterov,
Peter Zijlstra, Jens Axboe, Suresh Siddha, Rusty Russell,
Steven Rostedt, linux-kernel, linux-arch
On Thu, 2009-02-19 at 16:06 +0100, Ingo Molnar wrote:
> > So your arch_send_call_function_single_ipi etc need to ensure
> > this, right? Generic code obviously has no idea about how to
> > do it.
>
> The thing is, the most widespread way to send IPIs (x86
> non-x2apic local APIC) does not need any barriers in the generic
> code or elsewhere, because the local APIC is mapped uncached so
> it's implicitly ordered.
>
> So the right solution is to add barriers to those IPI
> implementations that need it. This means that the generic code
> should not have a barrier for IPI sending.
I agree. In fact, our current code should be fine in any case because
our writel() which will be used to generate the IPI has a sync in it
anyway for other reasons.
Cheers,
Ben.
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: smp.c && barriers (Was: [PATCH 1/4] generic-smp: remove single ipi fallback for smp_call_function_many())
2009-02-19 12:20 ` Ingo Molnar
2009-02-19 12:29 ` Nick Piggin
@ 2009-02-19 22:00 ` Suresh Siddha
2009-02-20 10:56 ` Ingo Molnar
1 sibling, 1 reply; 49+ messages in thread
From: Suresh Siddha @ 2009-02-19 22:00 UTC (permalink / raw)
To: Ingo Molnar
Cc: Nick Piggin, Peter Zijlstra, Oleg Nesterov, Jens Axboe,
Linus Torvalds, Paul E. McKenney, Rusty Russell, Steven Rostedt,
linux-kernel@vger.kernel.org, linux-arch@vger.kernel.org
On Thu, 2009-02-19 at 04:20 -0800, Ingo Molnar wrote:
> Could you please refresh this patch to latest tip:master? The
> APIC drivers moved to arch/x86/kernel/apic/.
Appended the refreshed patch. Thanks.
---
From: Suresh Siddha <suresh.b.siddha@intel.com>
Subject: x86: move smp_mb() in flush tlb path to x2apic specific paths
uncached MMIO accesses for xapic are inherently serializing and hence
we don't need explicit barriers for xapic IPI paths.
x2apic MSR writes/reads don't have serializing semantics and hence need
a serializing instruction or mfence, to make all the previous memory
stores globally visisble before the x2apic msr write for IPI.
And hence move smp_mb() in x86 flush tlb path to x2apic specific paths.
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
---
diff --git a/arch/x86/kernel/apic/x2apic_cluster.c b/arch/x86/kernel/apic/x2apic_cluster.c
index 3f7df23..c54fffe 100644
--- a/arch/x86/kernel/apic/x2apic_cluster.c
+++ b/arch/x86/kernel/apic/x2apic_cluster.c
@@ -63,6 +63,13 @@ static void x2apic_send_IPI_mask(const struct cpumask *mask, int vector)
unsigned long query_cpu;
unsigned long flags;
+ /*
+ * Make previous memory operations globally visible before
+ * sending the IPI. We need a serializing instruction or mfence
+ * for this.
+ */
+ smp_mb();
+
local_irq_save(flags);
for_each_cpu(query_cpu, mask) {
__x2apic_send_IPI_dest(
@@ -79,6 +86,13 @@ static void
unsigned long query_cpu;
unsigned long flags;
+ /*
+ * Make previous memory operations globally visible before
+ * sending the IPI. We need a serializing instruction or mfence
+ * for this.
+ */
+ smp_mb();
+
local_irq_save(flags);
for_each_cpu(query_cpu, mask) {
if (query_cpu == this_cpu)
@@ -96,6 +110,13 @@ static void x2apic_send_IPI_allbutself(int vector)
unsigned long query_cpu;
unsigned long flags;
+ /*
+ * Make previous memory operations globally visible before
+ * sending the IPI. We need a serializing instruction or mfence
+ * for this.
+ */
+ smp_mb();
+
local_irq_save(flags);
for_each_online_cpu(query_cpu) {
if (query_cpu == this_cpu)
diff --git a/arch/x86/kernel/apic/x2apic_phys.c b/arch/x86/kernel/apic/x2apic_phys.c
index d2d52eb..84cc2f3 100644
--- a/arch/x86/kernel/apic/x2apic_phys.c
+++ b/arch/x86/kernel/apic/x2apic_phys.c
@@ -58,6 +58,13 @@ static void x2apic_send_IPI_mask(const struct cpumask *mask, int vector)
unsigned long query_cpu;
unsigned long flags;
+ /*
+ * Make previous memory operations globally visible before
+ * sending the IPI. We need a serializing instruction or mfence
+ * for this.
+ */
+ smp_mb();
+
local_irq_save(flags);
for_each_cpu(query_cpu, mask) {
__x2apic_send_IPI_dest(per_cpu(x86_cpu_to_apicid, query_cpu),
@@ -73,6 +80,13 @@ static void
unsigned long query_cpu;
unsigned long flags;
+ /*
+ * Make previous memory operations globally visible before
+ * sending the IPI. We need a serializing instruction or mfence
+ * for this.
+ */
+ smp_mb();
+
local_irq_save(flags);
for_each_cpu(query_cpu, mask) {
if (query_cpu != this_cpu)
@@ -89,6 +103,13 @@ static void x2apic_send_IPI_allbutself(int vector)
unsigned long query_cpu;
unsigned long flags;
+ /*
+ * Make previous memory operations globally visible before
+ * sending the IPI. We need a serializing instruction or mfence
+ * for this.
+ */
+ smp_mb();
+
local_irq_save(flags);
for_each_online_cpu(query_cpu) {
if (query_cpu == this_cpu)
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index a654d59..821e970 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -187,11 +187,6 @@ static void flush_tlb_others_ipi(const struct cpumask *cpumask,
cpumask, cpumask_of(smp_processor_id()));
/*
- * Make the above memory operations globally visible before
- * sending the IPI.
- */
- smp_mb();
- /*
* We have to send the IPI only to
* CPUs affected.
*/
^ permalink raw reply related [flat|nested] 49+ messages in thread
* Re: smp.c && barriers (Was: [PATCH 1/4] generic-smp: remove single ipi fallback for smp_call_function_many())
2009-02-19 22:00 ` Suresh Siddha
@ 2009-02-20 10:56 ` Ingo Molnar
2009-02-20 18:56 ` Suresh Siddha
0 siblings, 1 reply; 49+ messages in thread
From: Ingo Molnar @ 2009-02-20 10:56 UTC (permalink / raw)
To: Suresh Siddha
Cc: Nick Piggin, Peter Zijlstra, Oleg Nesterov, Jens Axboe,
Linus Torvalds, Paul E. McKenney, Rusty Russell, Steven Rostedt,
linux-kernel@vger.kernel.org, linux-arch@vger.kernel.org
* Suresh Siddha <suresh.b.siddha@intel.com> wrote:
> On Thu, 2009-02-19 at 04:20 -0800, Ingo Molnar wrote:
> > Could you please refresh this patch to latest tip:master? The
> > APIC drivers moved to arch/x86/kernel/apic/.
>
> Appended the refreshed patch. Thanks.
thanks. Two details i noticed:
Firstly:
> +++ b/arch/x86/kernel/apic/x2apic_cluster.c
> +++ b/arch/x86/kernel/apic/x2apic_phys.c
how about x2apic_uv.c? It uses uv_write_global_mmr64() in its
IPI sending method, which uses:
static inline void uv_write_global_mmr64(int pnode, unsigned long offset,
unsigned long val)
{
*uv_global_mmr64_address(pnode, offset) = val;
}
which uses ->mmr_base, which is mapped via:
init_extra_mapping_uc(UV_LOCAL_MMR_BASE, UV_LOCAL_MMR_SIZE);
so it should be fine because uncached - but at minimum we should
put a comment into x2apic_uv.c that the generic IPI code relies
on the lowlevel code serializing - i.e. relies on the UC PAT
attribute.
Secondly, you added smp_mb(), which will translate to an MFENCE.
But in theory it should be enough to have a wmb() here. [Note,
not an smp_wmb() that i suggested before.] That will translate
to an SFENCE - which will serialize writes but still allows
reads/prefetches to pass.
So the question is, is an SFENCE there enough to serialize the
WRMSR with previous memory-writes? It's not specified in the
x2apic docs as far as i could see.
Ingo
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: smp.c && barriers (Was: [PATCH 1/4] generic-smp: remove single ipi fallback for smp_call_function_many())
2009-02-20 10:56 ` Ingo Molnar
@ 2009-02-20 18:56 ` Suresh Siddha
2009-02-20 18:56 ` Suresh Siddha
` (3 more replies)
0 siblings, 4 replies; 49+ messages in thread
From: Suresh Siddha @ 2009-02-20 18:56 UTC (permalink / raw)
To: Ingo Molnar
Cc: Nick Piggin, Peter Zijlstra, Oleg Nesterov, Jens Axboe,
Linus Torvalds, Paul E. McKenney, Rusty Russell, Steven Rostedt,
linux-kernel@vger.kernel.org, linux-arch@vger.kernel.org, steiner
On Fri, 2009-02-20 at 02:56 -0800, Ingo Molnar wrote:
> * Suresh Siddha <suresh.b.siddha@intel.com> wrote:
>
> > On Thu, 2009-02-19 at 04:20 -0800, Ingo Molnar wrote:
> > > Could you please refresh this patch to latest tip:master? The
> > > APIC drivers moved to arch/x86/kernel/apic/.
> >
> > Appended the refreshed patch. Thanks.
>
> thanks. Two details i noticed:
>
> Firstly:
>
> > +++ b/arch/x86/kernel/apic/x2apic_cluster.c
> > +++ b/arch/x86/kernel/apic/x2apic_phys.c
>
> how about x2apic_uv.c? It uses uv_write_global_mmr64() in its
> IPI sending method, which uses:
>
> static inline void uv_write_global_mmr64(int pnode, unsigned long offset,
> unsigned long val)
> {
> *uv_global_mmr64_address(pnode, offset) = val;
> }
>
> which uses ->mmr_base, which is mapped via:
>
> init_extra_mapping_uc(UV_LOCAL_MMR_BASE, UV_LOCAL_MMR_SIZE);
>
> so it should be fine because uncached - but at minimum we should
> put a comment into x2apic_uv.c that the generic IPI code relies
> on the lowlevel code serializing - i.e. relies on the UC PAT
> attribute.
Oops. I forgot to mention that and copy Jack to confirm our
understanding. I can send a followup patch adding the comments for UV.
Jack, are you ok?
>
> Secondly, you added smp_mb(), which will translate to an MFENCE.
>
> But in theory it should be enough to have a wmb() here. [Note,
> not an smp_wmb() that i suggested before.] That will translate
> to an SFENCE - which will serialize writes but still allows
> reads/prefetches to pass.
>
> So the question is, is an SFENCE there enough to serialize the
> WRMSR with previous memory-writes? It's not specified in the
> x2apic docs as far as i could see.
No. sfence is not enough (wrmsr to x2apic regs was still passing ahead).
We have done a small experiment to demonstrate the issue and adding
mfence fixes the issue but not sfence. We need a serializing instruction
or mfence. I will try to get the SDM updated.
thanks,
suresh
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: smp.c && barriers (Was: [PATCH 1/4] generic-smp: remove single ipi fallback for smp_call_function_many())
2009-02-20 18:56 ` Suresh Siddha
@ 2009-02-20 18:56 ` Suresh Siddha
2009-02-20 19:40 ` Ingo Molnar
` (2 subsequent siblings)
3 siblings, 0 replies; 49+ messages in thread
From: Suresh Siddha @ 2009-02-20 18:56 UTC (permalink / raw)
To: Ingo Molnar
Cc: Nick Piggin, Peter Zijlstra, Oleg Nesterov, Jens Axboe,
Linus Torvalds, Paul E. McKenney, Rusty Russell, Steven Rostedt,
linux-kernel@vger.kernel.org, linux-arch@vger.kernel.org, steiner
On Fri, 2009-02-20 at 02:56 -0800, Ingo Molnar wrote:
> * Suresh Siddha <suresh.b.siddha@intel.com> wrote:
>
> > On Thu, 2009-02-19 at 04:20 -0800, Ingo Molnar wrote:
> > > Could you please refresh this patch to latest tip:master? The
> > > APIC drivers moved to arch/x86/kernel/apic/.
> >
> > Appended the refreshed patch. Thanks.
>
> thanks. Two details i noticed:
>
> Firstly:
>
> > +++ b/arch/x86/kernel/apic/x2apic_cluster.c
> > +++ b/arch/x86/kernel/apic/x2apic_phys.c
>
> how about x2apic_uv.c? It uses uv_write_global_mmr64() in its
> IPI sending method, which uses:
>
> static inline void uv_write_global_mmr64(int pnode, unsigned long offset,
> unsigned long val)
> {
> *uv_global_mmr64_address(pnode, offset) = val;
> }
>
> which uses ->mmr_base, which is mapped via:
>
> init_extra_mapping_uc(UV_LOCAL_MMR_BASE, UV_LOCAL_MMR_SIZE);
>
> so it should be fine because uncached - but at minimum we should
> put a comment into x2apic_uv.c that the generic IPI code relies
> on the lowlevel code serializing - i.e. relies on the UC PAT
> attribute.
Oops. I forgot to mention that and copy Jack to confirm our
understanding. I can send a followup patch adding the comments for UV.
Jack, are you ok?
>
> Secondly, you added smp_mb(), which will translate to an MFENCE.
>
> But in theory it should be enough to have a wmb() here. [Note,
> not an smp_wmb() that i suggested before.] That will translate
> to an SFENCE - which will serialize writes but still allows
> reads/prefetches to pass.
>
> So the question is, is an SFENCE there enough to serialize the
> WRMSR with previous memory-writes? It's not specified in the
> x2apic docs as far as i could see.
No. sfence is not enough (wrmsr to x2apic regs was still passing ahead).
We have done a small experiment to demonstrate the issue and adding
mfence fixes the issue but not sfence. We need a serializing instruction
or mfence. I will try to get the SDM updated.
thanks,
suresh
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: smp.c && barriers (Was: [PATCH 1/4] generic-smp: remove single ipi fallback for smp_call_function_many())
2009-02-20 18:56 ` Suresh Siddha
2009-02-20 18:56 ` Suresh Siddha
@ 2009-02-20 19:40 ` Ingo Molnar
2009-02-20 23:28 ` Jack Steiner
2009-02-25 3:32 ` Nick Piggin
3 siblings, 0 replies; 49+ messages in thread
From: Ingo Molnar @ 2009-02-20 19:40 UTC (permalink / raw)
To: Suresh Siddha
Cc: Nick Piggin, Peter Zijlstra, Oleg Nesterov, Jens Axboe,
Linus Torvalds, Paul E. McKenney, Rusty Russell, Steven Rostedt,
linux-kernel@vger.kernel.org, linux-arch@vger.kernel.org, steiner
* Suresh Siddha <suresh.b.siddha@intel.com> wrote:
> > So the question is, is an SFENCE there enough to serialize
> > the WRMSR with previous memory-writes? It's not specified in
> > the x2apic docs as far as i could see.
>
> No. sfence is not enough (wrmsr to x2apic regs was still
> passing ahead). We have done a small experiment to demonstrate
> the issue and adding mfence fixes the issue but not sfence. We
> need a serializing instruction or mfence. I will try to get
> the SDM updated.
Fair enough!
Ingo
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: smp.c && barriers (Was: [PATCH 1/4] generic-smp: remove single ipi fallback for smp_call_function_many())
2009-02-20 18:56 ` Suresh Siddha
2009-02-20 18:56 ` Suresh Siddha
2009-02-20 19:40 ` Ingo Molnar
@ 2009-02-20 23:28 ` Jack Steiner
2009-02-25 3:32 ` Nick Piggin
3 siblings, 0 replies; 49+ messages in thread
From: Jack Steiner @ 2009-02-20 23:28 UTC (permalink / raw)
To: Suresh Siddha
Cc: Ingo Molnar, Nick Piggin, Peter Zijlstra, Oleg Nesterov,
Jens Axboe, Linus Torvalds, Paul E. McKenney, Rusty Russell,
Steven Rostedt, linux-kernel@vger.kernel.org,
linux-arch@vger.kernel.org
On Fri, Feb 20, 2009 at 10:56:54AM -0800, Suresh Siddha wrote:
> On Fri, 2009-02-20 at 02:56 -0800, Ingo Molnar wrote:
> > * Suresh Siddha <suresh.b.siddha@intel.com> wrote:
> >
> > > On Thu, 2009-02-19 at 04:20 -0800, Ingo Molnar wrote:
> > > > Could you please refresh this patch to latest tip:master? The
> > > > APIC drivers moved to arch/x86/kernel/apic/.
> > >
> > > Appended the refreshed patch. Thanks.
> >
> > thanks. Two details i noticed:
> >
> > Firstly:
> >
> > > +++ b/arch/x86/kernel/apic/x2apic_cluster.c
> > > +++ b/arch/x86/kernel/apic/x2apic_phys.c
> >
> > how about x2apic_uv.c? It uses uv_write_global_mmr64() in its
> > IPI sending method, which uses:
> >
> > static inline void uv_write_global_mmr64(int pnode, unsigned long offset,
> > unsigned long val)
> > {
> > *uv_global_mmr64_address(pnode, offset) = val;
> > }
> >
> > which uses ->mmr_base, which is mapped via:
> >
> > init_extra_mapping_uc(UV_LOCAL_MMR_BASE, UV_LOCAL_MMR_SIZE);
> >
> > so it should be fine because uncached - but at minimum we should
> > put a comment into x2apic_uv.c that the generic IPI code relies
> > on the lowlevel code serializing - i.e. relies on the UC PAT
> > attribute.
>
> Oops. I forgot to mention that and copy Jack to confirm our
> understanding. I can send a followup patch adding the comments for UV.
> Jack, are you ok?
Yes. Looks ok to me.
>
> >
> > Secondly, you added smp_mb(), which will translate to an MFENCE.
> >
> > But in theory it should be enough to have a wmb() here. [Note,
> > not an smp_wmb() that i suggested before.] That will translate
> > to an SFENCE - which will serialize writes but still allows
> > reads/prefetches to pass.
> >
> > So the question is, is an SFENCE there enough to serialize the
> > WRMSR with previous memory-writes? It's not specified in the
> > x2apic docs as far as i could see.
>
> No. sfence is not enough (wrmsr to x2apic regs was still passing ahead).
> We have done a small experiment to demonstrate the issue and adding
> mfence fixes the issue but not sfence. We need a serializing instruction
> or mfence. I will try to get the SDM updated.
>
> thanks,
> suresh
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: smp.c && barriers (Was: [PATCH 1/4] generic-smp: remove single ipi fallback for smp_call_function_many())
2009-02-20 18:56 ` Suresh Siddha
` (2 preceding siblings ...)
2009-02-20 23:28 ` Jack Steiner
@ 2009-02-25 3:32 ` Nick Piggin
2009-02-25 12:47 ` Ingo Molnar
` (2 more replies)
3 siblings, 3 replies; 49+ messages in thread
From: Nick Piggin @ 2009-02-25 3:32 UTC (permalink / raw)
To: Suresh Siddha
Cc: Ingo Molnar, Peter Zijlstra, Oleg Nesterov, Jens Axboe,
Linus Torvalds, Paul E. McKenney, Rusty Russell, Steven Rostedt,
linux-kernel@vger.kernel.org, linux-arch@vger.kernel.org, steiner
On Fri, Feb 20, 2009 at 10:56:54AM -0800, Suresh B wrote:
> On Fri, 2009-02-20 at 02:56 -0800, Ingo Molnar wrote:
> > * Suresh Siddha <suresh.b.siddha@intel.com> wrote:
> >
> > > On Thu, 2009-02-19 at 04:20 -0800, Ingo Molnar wrote:
> > > > Could you please refresh this patch to latest tip:master? The
> > > > APIC drivers moved to arch/x86/kernel/apic/.
> > >
> > > Appended the refreshed patch. Thanks.
> >
> > thanks. Two details i noticed:
> >
> > Firstly:
> >
> > > +++ b/arch/x86/kernel/apic/x2apic_cluster.c
> > > +++ b/arch/x86/kernel/apic/x2apic_phys.c
> >
> > how about x2apic_uv.c? It uses uv_write_global_mmr64() in its
> > IPI sending method, which uses:
> >
> > static inline void uv_write_global_mmr64(int pnode, unsigned long offset,
> > unsigned long val)
> > {
> > *uv_global_mmr64_address(pnode, offset) = val;
> > }
> >
> > which uses ->mmr_base, which is mapped via:
> >
> > init_extra_mapping_uc(UV_LOCAL_MMR_BASE, UV_LOCAL_MMR_SIZE);
> >
> > so it should be fine because uncached - but at minimum we should
> > put a comment into x2apic_uv.c that the generic IPI code relies
> > on the lowlevel code serializing - i.e. relies on the UC PAT
> > attribute.
>
> Oops. I forgot to mention that and copy Jack to confirm our
> understanding. I can send a followup patch adding the comments for UV.
> Jack, are you ok?
>
> >
> > Secondly, you added smp_mb(), which will translate to an MFENCE.
> >
> > But in theory it should be enough to have a wmb() here. [Note,
> > not an smp_wmb() that i suggested before.] That will translate
> > to an SFENCE - which will serialize writes but still allows
> > reads/prefetches to pass.
> >
> > So the question is, is an SFENCE there enough to serialize the
> > WRMSR with previous memory-writes? It's not specified in the
> > x2apic docs as far as i could see.
>
> No. sfence is not enough (wrmsr to x2apic regs was still passing ahead).
> We have done a small experiment to demonstrate the issue and adding
> mfence fixes the issue but not sfence. We need a serializing instruction
> or mfence. I will try to get the SDM updated.
Just if I may add something -- I would probably slightly prefer if
you explicitly used an sfence or other serializing instruction rather
than smp_mb(). Maybe call it wrmsr_fence() or something. Apart from
being self documenting, and less confusing (wrmsr is not part of
normal ordering), I assume you technically also need it on UP
systems?
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: smp.c && barriers (Was: [PATCH 1/4] generic-smp: remove single ipi fallback for smp_call_function_many())
2009-02-25 3:32 ` Nick Piggin
@ 2009-02-25 12:47 ` Ingo Molnar
2009-02-25 18:25 ` Luck, Tony
2009-03-17 18:16 ` Suresh Siddha
2 siblings, 0 replies; 49+ messages in thread
From: Ingo Molnar @ 2009-02-25 12:47 UTC (permalink / raw)
To: Nick Piggin
Cc: Suresh Siddha, Peter Zijlstra, Oleg Nesterov, Jens Axboe,
Linus Torvalds, Paul E. McKenney, Rusty Russell, Steven Rostedt,
linux-kernel@vger.kernel.org, linux-arch@vger.kernel.org, steiner
* Nick Piggin <npiggin@suse.de> wrote:
> On Fri, Feb 20, 2009 at 10:56:54AM -0800, Suresh B wrote:
> > On Fri, 2009-02-20 at 02:56 -0800, Ingo Molnar wrote:
> > > * Suresh Siddha <suresh.b.siddha@intel.com> wrote:
> > >
> > > > On Thu, 2009-02-19 at 04:20 -0800, Ingo Molnar wrote:
> > > > > Could you please refresh this patch to latest tip:master? The
> > > > > APIC drivers moved to arch/x86/kernel/apic/.
> > > >
> > > > Appended the refreshed patch. Thanks.
> > >
> > > thanks. Two details i noticed:
> > >
> > > Firstly:
> > >
> > > > +++ b/arch/x86/kernel/apic/x2apic_cluster.c
> > > > +++ b/arch/x86/kernel/apic/x2apic_phys.c
> > >
> > > how about x2apic_uv.c? It uses uv_write_global_mmr64() in its
> > > IPI sending method, which uses:
> > >
> > > static inline void uv_write_global_mmr64(int pnode, unsigned long offset,
> > > unsigned long val)
> > > {
> > > *uv_global_mmr64_address(pnode, offset) = val;
> > > }
> > >
> > > which uses ->mmr_base, which is mapped via:
> > >
> > > init_extra_mapping_uc(UV_LOCAL_MMR_BASE, UV_LOCAL_MMR_SIZE);
> > >
> > > so it should be fine because uncached - but at minimum we should
> > > put a comment into x2apic_uv.c that the generic IPI code relies
> > > on the lowlevel code serializing - i.e. relies on the UC PAT
> > > attribute.
> >
> > Oops. I forgot to mention that and copy Jack to confirm our
> > understanding. I can send a followup patch adding the comments for UV.
> > Jack, are you ok?
> >
> > >
> > > Secondly, you added smp_mb(), which will translate to an MFENCE.
> > >
> > > But in theory it should be enough to have a wmb() here. [Note,
> > > not an smp_wmb() that i suggested before.] That will translate
> > > to an SFENCE - which will serialize writes but still allows
> > > reads/prefetches to pass.
> > >
> > > So the question is, is an SFENCE there enough to serialize the
> > > WRMSR with previous memory-writes? It's not specified in the
> > > x2apic docs as far as i could see.
> >
> > No. sfence is not enough (wrmsr to x2apic regs was still passing ahead).
> > We have done a small experiment to demonstrate the issue and adding
> > mfence fixes the issue but not sfence. We need a serializing instruction
> > or mfence. I will try to get the SDM updated.
>
> Just if I may add something -- I would probably slightly
> prefer if you explicitly used an sfence or other serializing
> instruction rather than smp_mb(). Maybe call it wrmsr_fence()
> or something. Apart from being self documenting, and less
> confusing (wrmsr is not part of normal ordering), I assume you
> technically also need it on UP systems?
Hm, UP systems shouldnt worry too much about sending IPIs to
other CPUs, right? We optimize out same-CPU IPIs already so that
bit should not matter.
Agreed on the self-documentation aspect.
Ingo
^ permalink raw reply [flat|nested] 49+ messages in thread
* RE: smp.c && barriers (Was: [PATCH 1/4] generic-smp: remove single ipi fallback for smp_call_function_many())
2009-02-25 3:32 ` Nick Piggin
2009-02-25 12:47 ` Ingo Molnar
@ 2009-02-25 18:25 ` Luck, Tony
2009-03-17 18:16 ` Suresh Siddha
2 siblings, 0 replies; 49+ messages in thread
From: Luck, Tony @ 2009-02-25 18:25 UTC (permalink / raw)
To: Nick Piggin, Siddha, Suresh B
Cc: Ingo Molnar, Peter Zijlstra, Oleg Nesterov, Jens Axboe,
Linus Torvalds, Paul E. McKenney, Rusty Russell, Steven Rostedt,
linux-kernel@vger.kernel.org, linux-arch@vger.kernel.org,
steiner@sgi.com
> Just if I may add something -- I would probably slightly prefer if
> you explicitly used an sfence or other serializing instruction rather
> than smp_mb(). Maybe call it wrmsr_fence() or something. Apart from
> being self documenting, and less confusing (wrmsr is not part of
> normal ordering), I assume you technically also need it on UP
> systems?
Maybe I lost track of this thread ... but isn't this code for the
"send ipi" path? On a UP system do we use IPI to interrupt ourself?
Even if we did, presumably we can't get inconsistencies if there is
only one cpu.
-Tony
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: smp.c && barriers (Was: [PATCH 1/4] generic-smp: remove single ipi fallback for smp_call_function_many())
2009-02-25 3:32 ` Nick Piggin
2009-02-25 12:47 ` Ingo Molnar
2009-02-25 18:25 ` Luck, Tony
@ 2009-03-17 18:16 ` Suresh Siddha
2 siblings, 0 replies; 49+ messages in thread
From: Suresh Siddha @ 2009-03-17 18:16 UTC (permalink / raw)
To: Nick Piggin
Cc: Ingo Molnar, Peter Zijlstra, Oleg Nesterov, Jens Axboe,
Linus Torvalds, Paul E. McKenney, Rusty Russell, Steven Rostedt,
linux-kernel@vger.kernel.org, linux-arch@vger.kernel.org,
steiner@sgi.com
On Tue, 2009-02-24 at 19:32 -0800, Nick Piggin wrote:
> Just if I may add something -- I would probably slightly prefer if
> you explicitly used an sfence or other serializing instruction rather
> than smp_mb(). Maybe call it wrmsr_fence() or something.
I was planning to send a cleanup patch doing this, but looks like Ingo
never took the original patch. Here it is appended. Ingo, please
consider.
thanks,
suresh
---
From: Suresh Siddha <suresh.b.siddha@intel.com>
Subject: x86: Add x2apic_wrmsr_fence() to x2apic flush tlb paths
uncached MMIO accesses for xapic are inherently serializing and hence
we don't need explicit barriers for xapic IPI paths.
x2apic MSR writes/reads don't have serializing semantics and hence need
a serializing instruction or mfence, to make all the previous memory
stores globally visisble before the x2apic msr write for IPI.
Add x2apic_wrmsr_fence() in flush tlb path to x2apic specific paths.
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
---
diff --git a/arch/x86/include/asm/apic.h b/arch/x86/include/asm/apic.h
index 394d177..cfc0871 100644
--- a/arch/x86/include/asm/apic.h
+++ b/arch/x86/include/asm/apic.h
@@ -108,6 +108,16 @@ extern void native_apic_icr_write(u32 low, u32 id);
extern u64 native_apic_icr_read(void);
#ifdef CONFIG_X86_X2APIC
+/*
+ * Make previous memory operations globally visible before
+ * sending the IPI through x2apic wrmsr. We need a serializing instruction or
+ * mfence for this.
+ */
+static inline void x2apic_wrmsr_fence(void)
+{
+ asm volatile("mfence":::"memory");
+}
+
static inline void native_apic_msr_write(u32 reg, u32 v)
{
if (reg == APIC_DFR || reg == APIC_ID || reg == APIC_LDR ||
diff --git a/arch/x86/kernel/apic/x2apic_cluster.c b/arch/x86/kernel/apic/x2apic_cluster.c
index 8fb87b6..4a903e2 100644
--- a/arch/x86/kernel/apic/x2apic_cluster.c
+++ b/arch/x86/kernel/apic/x2apic_cluster.c
@@ -57,6 +57,8 @@ static void x2apic_send_IPI_mask(const struct cpumask *mask, int vector)
unsigned long query_cpu;
unsigned long flags;
+ x2apic_wrmsr_fence();
+
local_irq_save(flags);
for_each_cpu(query_cpu, mask) {
__x2apic_send_IPI_dest(
@@ -73,6 +75,8 @@ static void
unsigned long query_cpu;
unsigned long flags;
+ x2apic_wrmsr_fence();
+
local_irq_save(flags);
for_each_cpu(query_cpu, mask) {
if (query_cpu == this_cpu)
@@ -90,6 +94,8 @@ static void x2apic_send_IPI_allbutself(int vector)
unsigned long query_cpu;
unsigned long flags;
+ x2apic_wrmsr_fence();
+
local_irq_save(flags);
for_each_online_cpu(query_cpu) {
if (query_cpu == this_cpu)
diff --git a/arch/x86/kernel/apic/x2apic_phys.c b/arch/x86/kernel/apic/x2apic_phys.c
index 23625b9..a284359 100644
--- a/arch/x86/kernel/apic/x2apic_phys.c
+++ b/arch/x86/kernel/apic/x2apic_phys.c
@@ -58,6 +58,8 @@ static void x2apic_send_IPI_mask(const struct cpumask *mask, int vector)
unsigned long query_cpu;
unsigned long flags;
+ x2apic_wrmsr_fence();
+
local_irq_save(flags);
for_each_cpu(query_cpu, mask) {
__x2apic_send_IPI_dest(per_cpu(x86_cpu_to_apicid, query_cpu),
@@ -73,6 +75,8 @@ static void
unsigned long query_cpu;
unsigned long flags;
+ x2apic_wrmsr_fence();
+
local_irq_save(flags);
for_each_cpu(query_cpu, mask) {
if (query_cpu != this_cpu)
@@ -89,6 +93,8 @@ static void x2apic_send_IPI_allbutself(int vector)
unsigned long query_cpu;
unsigned long flags;
+ x2apic_wrmsr_fence();
+
local_irq_save(flags);
for_each_online_cpu(query_cpu) {
if (query_cpu == this_cpu)
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index a654d59..821e970 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -187,11 +187,6 @@ static void flush_tlb_others_ipi(const struct cpumask *cpumask,
cpumask, cpumask_of(smp_processor_id()));
/*
- * Make the above memory operations globally visible before
- * sending the IPI.
- */
- smp_mb();
- /*
* We have to send the IPI only to
* CPUs affected.
*/
^ permalink raw reply related [flat|nested] 49+ messages in thread
end of thread, other threads:[~2009-03-17 18:17 UTC | newest]
Thread overview: 49+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <20090216204902.GA6924@redhat.com>
[not found] ` <1234818201.30178.386.camel@laptop>
[not found] ` <20090216213205.GA9098@redhat.com>
[not found] ` <1234820704.30178.396.camel@laptop>
[not found] ` <20090216220214.GA10093@redhat.com>
[not found] ` <1234823097.30178.406.camel@laptop>
[not found] ` <20090216231946.GA12009@redhat.com>
[not found] ` <1234862974.4744.31.camel@laptop>
[not found] ` <20090217101130.GA8660@wotan.suse.de>
[not found] ` <1234866453.4744.58.camel@laptop>
2009-02-17 11:26 ` Q: smp.c && barriers (Was: [PATCH 1/4] generic-smp: remove single ipi fallback for smp_call_function_many()) Nick Piggin
2009-02-17 11:48 ` Peter Zijlstra
2009-02-17 15:51 ` Paul E. McKenney
2009-02-18 2:15 ` Suresh Siddha
2009-02-18 2:40 ` Paul E. McKenney
2009-02-17 19:28 ` Q: " Oleg Nesterov
2009-02-17 21:32 ` Paul E. McKenney
2009-02-17 21:45 ` Oleg Nesterov
2009-02-17 22:39 ` Paul E. McKenney
2009-02-18 13:52 ` Nick Piggin
2009-02-18 16:09 ` Linus Torvalds
2009-02-18 16:21 ` Ingo Molnar
2009-02-18 16:21 ` Ingo Molnar
2009-02-18 16:33 ` Linus Torvalds
2009-02-18 16:58 ` Ingo Molnar
2009-02-18 17:05 ` Ingo Molnar
2009-02-18 17:10 ` Ingo Molnar
2009-02-18 17:17 ` Linus Torvalds
2009-02-18 17:23 ` Ingo Molnar
2009-02-18 17:14 ` Linus Torvalds
2009-02-18 17:47 ` Ingo Molnar
2009-02-18 18:33 ` Suresh Siddha
2009-02-18 16:37 ` Gleb Natapov
2009-02-19 0:12 ` Nick Piggin
2009-02-19 6:47 ` Benjamin Herrenschmidt
2009-02-19 13:11 ` Nick Piggin
2009-02-19 13:11 ` Nick Piggin
2009-02-19 15:06 ` Ingo Molnar
2009-02-19 21:49 ` Benjamin Herrenschmidt
2009-02-18 2:21 ` Suresh Siddha
2009-02-18 13:59 ` Nick Piggin
2009-02-18 16:19 ` Linus Torvalds
2009-02-18 16:23 ` Ingo Molnar
2009-02-18 18:43 ` Suresh Siddha
2009-02-18 19:17 ` Ingo Molnar
2009-02-18 23:55 ` Suresh Siddha
2009-02-19 12:20 ` Ingo Molnar
2009-02-19 12:29 ` Nick Piggin
2009-02-19 12:45 ` Ingo Molnar
2009-02-19 22:00 ` Suresh Siddha
2009-02-20 10:56 ` Ingo Molnar
2009-02-20 18:56 ` Suresh Siddha
2009-02-20 18:56 ` Suresh Siddha
2009-02-20 19:40 ` Ingo Molnar
2009-02-20 23:28 ` Jack Steiner
2009-02-25 3:32 ` Nick Piggin
2009-02-25 12:47 ` Ingo Molnar
2009-02-25 18:25 ` Luck, Tony
2009-03-17 18:16 ` Suresh Siddha
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).