linuxppc-dev.lists.ozlabs.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] synchronize_irq needs a barrier
@ 2007-10-18  1:25 Benjamin Herrenschmidt
  2007-10-18  1:45 ` Andrew Morton
                   ` (2 more replies)
  0 siblings, 3 replies; 48+ messages in thread
From: Benjamin Herrenschmidt @ 2007-10-18  1:25 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linuxppc-dev list, akpm, Linux Kernel list

synchronize_irq needs at the very least a compiler barrier and a
read barrier on SMP, but there are enough cases around where a
write barrier is also needed and it's not a hot path so I prefer
using a full smp_mb() here.

It will degrade to a compiler barrier on !SMP.

Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
---

Index: linux-work/kernel/irq/manage.c
===================================================================
--- linux-work.orig/kernel/irq/manage.c	2007-10-18 11:22:16.000000000 +1000
+++ linux-work/kernel/irq/manage.c	2007-10-18 11:22:20.000000000 +1000
@@ -33,6 +33,7 @@ void synchronize_irq(unsigned int irq)
 	if (irq >= NR_IRQS)
 		return;
 
+	smp_mb();
 	while (desc->status & IRQ_INPROGRESS)
 		cpu_relax();
 }

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH] synchronize_irq needs a barrier
  2007-10-18  1:25 [PATCH] synchronize_irq needs a barrier Benjamin Herrenschmidt
@ 2007-10-18  1:45 ` Andrew Morton
  2007-10-18  1:55   ` Benjamin Herrenschmidt
  2007-10-18  2:12 ` Linus Torvalds
  2007-10-20  2:02 ` Maxim Levitsky
  2 siblings, 1 reply; 48+ messages in thread
From: Andrew Morton @ 2007-10-18  1:45 UTC (permalink / raw)
  To: benh; +Cc: linuxppc-dev, torvalds, linux-kernel

On Thu, 18 Oct 2007 11:25:42 +1000
Benjamin Herrenschmidt <benh@kernel.crashing.org> wrote:

> synchronize_irq needs at the very least a compiler barrier and a
> read barrier on SMP,

Why?

> but there are enough cases around where a
> write barrier is also needed and it's not a hot path so I prefer
> using a full smp_mb() here.
> 
> It will degrade to a compiler barrier on !SMP.
> 
> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
> ---
> 
> Index: linux-work/kernel/irq/manage.c
> ===================================================================
> --- linux-work.orig/kernel/irq/manage.c	2007-10-18 11:22:16.000000000 +1000
> +++ linux-work/kernel/irq/manage.c	2007-10-18 11:22:20.000000000 +1000
> @@ -33,6 +33,7 @@ void synchronize_irq(unsigned int irq)
>  	if (irq >= NR_IRQS)
>  		return;
>  
> +	smp_mb();
>  	while (desc->status & IRQ_INPROGRESS)
>  		cpu_relax();
>  }

Anyone reading this code is going to ask "wtf is that for".  It needs a
comment telling them.


mb() is the new lock_kernel().  Sigh.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH] synchronize_irq needs a barrier
  2007-10-18  1:45 ` Andrew Morton
@ 2007-10-18  1:55   ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 48+ messages in thread
From: Benjamin Herrenschmidt @ 2007-10-18  1:55 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linuxppc-dev, torvalds, linux-kernel


> > Index: linux-work/kernel/irq/manage.c
> > ===================================================================
> > --- linux-work.orig/kernel/irq/manage.c	2007-10-18 11:22:16.000000000 +1000
> > +++ linux-work/kernel/irq/manage.c	2007-10-18 11:22:20.000000000 +1000
> > @@ -33,6 +33,7 @@ void synchronize_irq(unsigned int irq)
> >  	if (irq >= NR_IRQS)
> >  		return;
> >  
> > +	smp_mb();
> >  	while (desc->status & IRQ_INPROGRESS)
> >  		cpu_relax();
> >  }
> 
> Anyone reading this code is going to ask "wtf is that for".  It needs a
> comment telling them.
> 
> 
> mb() is the new lock_kernel().  Sigh.

Ugh ?

That sounds fairly obvious to me :-) we are reading a value, that is
totally unordered, nothing to do about lock kernel or whatever, if we
want the above statement to make any sense in any kind of usage
scenario, it needs to be ordered vs. what happens before.

For example, take a construct like:

	device->my_hw_is_off = 1;
	synchronize_irq();
	turn_off_hardware();

That basically makes sure the irq either sees device->my_hw_is_off
being set to 1, or if an irq handler is already in progress and hasn't
seen it, we wait for it to complete.

(You can replace "hw_is_off" with anything that we want to set and make
sure the IRQ handler sees it before proceeding. It could be clearing a
pointer to something and make sure the irq sees it before freeing the
data, etc...).

I think pretty much any use of synchronize_irq() I can imagine needs
such kind of ordering... or it simply doesn't synchronize anything :-)

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH] synchronize_irq needs a barrier
  2007-10-18  1:25 [PATCH] synchronize_irq needs a barrier Benjamin Herrenschmidt
  2007-10-18  1:45 ` Andrew Morton
@ 2007-10-18  2:12 ` Linus Torvalds
  2007-10-18  2:40   ` Benjamin Herrenschmidt
  2007-10-20  2:02 ` Maxim Levitsky
  2 siblings, 1 reply; 48+ messages in thread
From: Linus Torvalds @ 2007-10-18  2:12 UTC (permalink / raw)
  To: Benjamin Herrenschmidt; +Cc: linuxppc-dev list, akpm, Linux Kernel list



On Thu, 18 Oct 2007, Benjamin Herrenschmidt wrote:
>  
> +	smp_mb();
>  	while (desc->status & IRQ_INPROGRESS)
>  		cpu_relax();

So, what exactly does it protect against? At a minimum, this needs a 
comment in the changelog, and probably preferably in the source code too.

The thing is, synchronize_irq() can only protect against interrupts that 
are *already*running* on another CPU, and the caller must have made sure 
that no new interrupts are coming in (or at least that whatever new 
interrupts that come in will not pick up a certain piece of data). 

So I can imagine that the smb_mb() is there so that the caller - who has 
cleared some list or done something like that - will have any preceding 
writes that it did be serialized with actually checking the old state of 
"desc->status".

Fair enough - I can see that a smp_mb() is useful. But I think it needs 
documenting as such, and preferably with an example of how this actually 
happened in the first place (do you have one?)

The synchronize_irq() users that are really fundamental (ie the irq 
datastructures themselves) all should use the irq descriptor spinlock, and 
should *not* be needing any of this, since they would have serialized with 
whoever actually set the IRQ_INPROGRESS bit in the first place.

So in *that* sense, I think the memory barrier is useless, and I can't 
make up my mind whether it's good or bad. Which is why I'd really like to 
have an example scenario spelled out where it makes a difference..

Ok?

		Linus

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH] synchronize_irq needs a barrier
  2007-10-18  2:12 ` Linus Torvalds
@ 2007-10-18  2:40   ` Benjamin Herrenschmidt
  2007-10-18  2:57     ` Benjamin Herrenschmidt
  2007-10-18 14:35     ` [PATCH] synchronize_irq needs a barrier Herbert Xu
  0 siblings, 2 replies; 48+ messages in thread
From: Benjamin Herrenschmidt @ 2007-10-18  2:40 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linuxppc-dev list, akpm, Linux Kernel list


On Wed, 2007-10-17 at 19:12 -0700, Linus Torvalds wrote:
> 
> So, what exactly does it protect against? At a minimum, this needs a 
> comment in the changelog, and probably preferably in the source code too.

I replied to Andrew, but I agree, it's worth a comment, I'll add one.

> The thing is, synchronize_irq() can only protect against interrupts that 
> are *already*running* on another CPU, and the caller must have made sure 
> that no new interrupts are coming in (or at least that whatever new 
> interrupts that come in will not pick up a certain piece of data). 
> 
> So I can imagine that the smb_mb() is there so that the caller - who has 
> cleared some list or done something like that - will have any preceding 
> writes that it did be serialized with actually checking the old state of 
> "desc->status".

Yes.

> Fair enough - I can see that a smp_mb() is useful. But I think it needs 
> documenting as such, and preferably with an example of how this actually 
> happened in the first place (do you have one?)

One could argue that it's the caller that should do the mb() but that
would really be asking for trouble. Since most usage scenario require
it, and it's not a hot path, I prefer having it here.

Note that some kind of read barrier or compiler barrier should be needed
regardless, or we are just not sync'ing with anything at all (we may
have loaded the value ages ago and thus operate on a totally stale
value). I prefer a full barrier to also ensure all previous stores are
pushed out.

> The synchronize_irq() users that are really fundamental (ie the irq 
> datastructures themselves) all should use the irq descriptor spinlock, and 
> should *not* be needing any of this, since they would have serialized with 
> whoever actually set the IRQ_INPROGRESS bit in the first place.

That isn't always the case. You can have very efficient lock-less fast
path and use synchronize_irq as a kind of RCU-like way to have the slow
path do the right thing.

> So in *that* sense, I think the memory barrier is useless, and I can't 
> make up my mind whether it's good or bad. Which is why I'd really like to 
> have an example scenario spelled out where it makes a difference..

Well, typically, I can clear a pointer and want to make sure my IRQ
handler isn't using it anymore before freeing the memory. Or set a "HW
is off" flag. Having spinlocks all over isn't always the best approach
on things that are really hot path, it's fair enough to use lockless
approaches like that by moving the burden to the slow path that does
synchronize_irq.

In general, I tend to think that for this function to make any sense
(that is, to synchronize anything at all), it needs a barrier or you are
just making a decision based on a totally random value of desc->status
since it can have been re-ordered, speculatively loaded, pre-fetched,
whatever'ed... :-).

Want a respin with a comment ?

Ben.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH] synchronize_irq needs a barrier
  2007-10-18  2:40   ` Benjamin Herrenschmidt
@ 2007-10-18  2:57     ` Benjamin Herrenschmidt
  2007-10-18 14:56       ` Herbert Xu
  2007-10-18 14:35     ` [PATCH] synchronize_irq needs a barrier Herbert Xu
  1 sibling, 1 reply; 48+ messages in thread
From: Benjamin Herrenschmidt @ 2007-10-18  2:57 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linuxppc-dev list, akpm, Linux Kernel list


> 
> In general, I tend to think that for this function to make any sense
> (that is, to synchronize anything at all), it needs a barrier or you are
> just making a decision based on a totally random value of desc->status
> since it can have been re-ordered, speculatively loaded, pre-fetched,
> whatever'ed... :-).

Take a real life example:

drivers/message/fusion/mptbase.c

	/* Disable interrupts! */
	CHIPREG_WRITE32(&ioc->chip->IntMask, 0xFFFFFFFF);

	ioc->active = 0;
	synchronize_irq(pdev->irq);

And we aren't in a spinlock here. 

That's just a random example grepped.... I think I see a few more. Then,
some drivers like tg3 actually do an smp_mb() before calling
synchronize_irq(). But then, some don't.

I think trying to have all drivers be correct here is asking for
trouble, we'd rather have synchronize_irq() be uber-safe. It's not like
it was used in hot path anyway.

Ben.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH] synchronize_irq needs a barrier
  2007-10-18  2:40   ` Benjamin Herrenschmidt
  2007-10-18  2:57     ` Benjamin Herrenschmidt
@ 2007-10-18 14:35     ` Herbert Xu
  2007-10-18 21:35       ` Benjamin Herrenschmidt
  1 sibling, 1 reply; 48+ messages in thread
From: Herbert Xu @ 2007-10-18 14:35 UTC (permalink / raw)
  To: benh; +Cc: linuxppc-dev, akpm, torvalds, linux-kernel

Benjamin Herrenschmidt <benh@kernel.crashing.org> wrote:
> 
> Note that some kind of read barrier or compiler barrier should be needed
> regardless, or we are just not sync'ing with anything at all (we may
> have loaded the value ages ago and thus operate on a totally stale
> value). I prefer a full barrier to also ensure all previous stores are
> pushed out.

We already have a compiler barrier there in the form of cpu_relax.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH] synchronize_irq needs a barrier
  2007-10-18  2:57     ` Benjamin Herrenschmidt
@ 2007-10-18 14:56       ` Herbert Xu
  2007-10-18 22:05         ` Benjamin Herrenschmidt
  0 siblings, 1 reply; 48+ messages in thread
From: Herbert Xu @ 2007-10-18 14:56 UTC (permalink / raw)
  To: benh; +Cc: linuxppc-dev, akpm, torvalds, linux-kernel

Benjamin Herrenschmidt <benh@kernel.crashing.org> wrote:
>
> Take a real life example:
> 
> drivers/message/fusion/mptbase.c
> 
>        /* Disable interrupts! */
>        CHIPREG_WRITE32(&ioc->chip->IntMask, 0xFFFFFFFF);
> 
>        ioc->active = 0;
>        synchronize_irq(pdev->irq);
> 
> And we aren't in a spinlock here. 
> 
> That's just a random example grepped.... I think I see a few more. Then,
> some drivers like tg3 actually do an smp_mb() before calling
> synchronize_irq(). But then, some don't.

I really don't see what the point of the barrier would be here.
Barriers are generally useless unless you have a counter-part
on the other side.

The counterpart here is presumably the interrupt handler, which
should be terminated by the IO write above regardless of the
memory barrier.

Of course I might have missed your point.  If so please give
a description like this for the race that you see:

CPU1			CPU2
disable IRQ
			whatever the race is
synchronize_irq

> I think trying to have all drivers be correct here is asking for
> trouble, we'd rather have synchronize_irq() be uber-safe. It's not like
> it was used in hot path anyway.

While in general I'd agree with you about give latitude to
drivers, memory barriers I think is something that we can't
afford to.

The reason is that memory barries tend to come in pairs, e.g.,

CPU1			CPU2
write A
wmb
write B
			read B
			rmb
			read A

Taking away either barrier would render the other useless.

So whenever we add only one barrier for the benefit of driver
writers who don't bother to think about barriers we may not
be helping them at all.

In any case, such writers should use easier tools like spin
locks rather than trying to go lockless.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH] synchronize_irq needs a barrier
  2007-10-18 14:35     ` [PATCH] synchronize_irq needs a barrier Herbert Xu
@ 2007-10-18 21:35       ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 48+ messages in thread
From: Benjamin Herrenschmidt @ 2007-10-18 21:35 UTC (permalink / raw)
  To: Herbert Xu; +Cc: linuxppc-dev, akpm, torvalds, linux-kernel


On Thu, 2007-10-18 at 22:35 +0800, Herbert Xu wrote:
> Benjamin Herrenschmidt <benh@kernel.crashing.org> wrote:
> > 
> > Note that some kind of read barrier or compiler barrier should be needed
> > regardless, or we are just not sync'ing with anything at all (we may
> > have loaded the value ages ago and thus operate on a totally stale
> > value). I prefer a full barrier to also ensure all previous stores are
> > pushed out.
> 
> We already have a compiler barrier there in the form of cpu_relax.

Isn't it too late ? The barrier should be before the test_bit, to
prevent it from moving up.

Ben.
 

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH] synchronize_irq needs a barrier
  2007-10-18 14:56       ` Herbert Xu
@ 2007-10-18 22:05         ` Benjamin Herrenschmidt
  2007-10-18 22:52           ` Linus Torvalds
  0 siblings, 1 reply; 48+ messages in thread
From: Benjamin Herrenschmidt @ 2007-10-18 22:05 UTC (permalink / raw)
  To: Herbert Xu; +Cc: linuxppc-dev, akpm, torvalds, linux-kernel


On Thu, 2007-10-18 at 22:56 +0800, Herbert Xu wrote:
> Benjamin Herrenschmidt <benh@kernel.crashing.org> wrote:
> >
> > Take a real life example:
> > 
> > drivers/message/fusion/mptbase.c
> > 
> >        /* Disable interrupts! */
> >        CHIPREG_WRITE32(&ioc->chip->IntMask, 0xFFFFFFFF);
> > 
> >        ioc->active = 0;
> >        synchronize_irq(pdev->irq);
> > 
> > And we aren't in a spinlock here. 
> > 
> > That's just a random example grepped.... I think I see a few more. Then,
> > some drivers like tg3 actually do an smp_mb() before calling
> > synchronize_irq(). But then, some don't.
> 
> I really don't see what the point of the barrier would be here.
> Barriers are generally useless unless you have a counter-part
> on the other side.

The barrier would guarantee that ioc->active (and in fact the write to
the chip too above) are globally visible before the read of the irq
status. There are two different issues not to mixup here. Any previous
store vs. a read (general case) and MMIO store of IRQ mask which has
issues of it's own.

> The counterpart here is presumably the interrupt handler, which
> should be terminated by the IO write above regardless of the
> memory barrier.
>
> Of course I might have missed your point.  If so please give
> a description like this for the race that you see:
> 
> CPU1			CPU2
> disable IRQ
> 			whatever the race is
> synchronize_irq

Let's ignore for a moment the generic problem of loads crossing previous
stores and focus on the pure issue of using MMIO writes to mask
interrupts.

Writing to a chip to disable an IRQ is generally never going to provide
any synchronisation guarantee. The MMIO write itself is asynchronous and
not ordered vs. a subsequent read from main storage (ie memory). So
unless you do an MMIO read to flush it, you basically haven't yet
disabled your IRQ don't know when it will be. That's one thing.

Another one is that even if you do an MMIO read to flush, your IRQ may
already have been latched by your PIC, or may be an MSI already issued
and still travelling along busses, and thus might well occur in a few
instructions. In that later case, note that ignoring it is perfectly
fine since the IRQ line will eventually go down (for a level IRQ that
is, for an edge IRQ, ignoring is always fine). That's what we cause
short interrupts, they can be common, though linux has historical issues
with them (unrelated to this discussion). But your interrupt handler
_will_ be called, and thus should be aware that your intend is to ignore
it.

So for both of the reasons above, the MMIO write doesn't mean you IRQ
won't happen and in fact, synchronize_irq() here is totally useless
since it won't prevent the IRQ from actually still happening just after
the test_bit of IRQ_INPROGRESS.

Now, the way to actually do that properly is to _also_ have a flag to
indicate your handler you don't want to deal with that interrupt. That
could be something along the lines of:

	writel(irq mask);
	wmb();
	chip->masked = 1;
 	smp_mb();
	synchronize_irq();

Now, the IRQ handler can just do

	if (chip->masked == 1)
		return <whatever>;

Another way is to use spinlocks of course, but we are talking about high
perf stuff that tries to avoid spinlocks on every IRQs, which is fair
enough, thus putting the synchronization burden on the slow path.

In the above example, you need that smp_mb() to make sure that the
effect of chip->masked = 1 is globally visible before the read in
synchronize_irq() is performed. Without that, that read could cross the
previous write, in fact, it may even travel all the way to before the
MMIO in some cases, and thus cause synchronize_irq() to operate on a
completely stale version of irq_desc->status, thus possibly bailing out
early while the IRQ is still happening and chip->masked not yet visible
to the other CPUs. 

In general, synchronize_irq() alone is always bogus, because that read
can travel up. That's why I believe it would simply make things simpler
and avoid problems to put the smp_mb(); inside synchronize_irq() (and same
goes with napi_synchronize() for the exact same reasons).

> While in general I'd agree with you about give latitude to
> drivers, memory barriers I think is something that we can't
> afford to.


> The reason is that memory barries tend to come in pairs, e.g.,
> 
> CPU1			CPU2
> write A
> wmb
> write B
> 			read B
> 			rmb
> 			read A
> 
> Taking away either barrier would render the other useless.
> 
> So whenever we add only one barrier for the benefit of driver
> writers who don't bother to think about barriers we may not
> be helping them at all.

In the case of synchronize_irq and my above example, yes, indeed, there
should be a pending barrier between setting IRQ_INPROGRESS and the
reading of chip->masked by the driver. This is a very good point because
I incorrectly assumed that the spin_unlock inside handle_*_irq before
calling the handler would do it, but it will not indeed. The spin_unlock
is only a write barrier, not a read barrier, it won't prevent a
following read from travelling back into the spinlock, potentially
before the setting of IRQ_INPROGRESS.

Thus, indeed, either an smp_rmb() should be used in the IRQ handler
before testing thus chip->masked, or we should stick one in
handle_IRQ_event for safety.

I'm pretty sure quite a few drivers are broken in that regard :-)

So maybe it's fair enough to say that barriers in those case should be
left as a responsibility to the callers. However, I still think that
synchronize_irq() without a barrier will generally not make any sense,
even if you should also generally have some kind of matching barrier
within whatever you are synchronizing with.

Ben.
 

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH] synchronize_irq needs a barrier
  2007-10-18 22:05         ` Benjamin Herrenschmidt
@ 2007-10-18 22:52           ` Linus Torvalds
  2007-10-18 23:17             ` Benjamin Herrenschmidt
  0 siblings, 1 reply; 48+ messages in thread
From: Linus Torvalds @ 2007-10-18 22:52 UTC (permalink / raw)
  To: Benjamin Herrenschmidt; +Cc: linuxppc-dev, akpm, Herbert Xu, linux-kernel



On Fri, 19 Oct 2007, Benjamin Herrenschmidt wrote:
> 
> The barrier would guarantee that ioc->active (and in fact the write to
> the chip too above) are globally visible

No, it doesn't really guarantee that.

The thing is, there is no such thing as "globally visible".

There is a "ordering of visibility wrt CPU's", but it's not global, it's 
quite potentially per-CPU. So a barrier on one CPU doesn't guarantee 
anything at all without a barrier on the *other* CPU.

That said, the interrupt handling itself contains various barriers on the 
CPU's that receive interrupts, thanks to the spinlocking. But I do agree 
with Herbert that adding a "smb_mb()" is certainly in no way "obviously 
correct", because it doesn't talk about what the other side does wrt 
barriers and that word in memory.

		Linus

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH] synchronize_irq needs a barrier
  2007-10-18 22:52           ` Linus Torvalds
@ 2007-10-18 23:17             ` Benjamin Herrenschmidt
  2007-10-18 23:39               ` Linus Torvalds
  0 siblings, 1 reply; 48+ messages in thread
From: Benjamin Herrenschmidt @ 2007-10-18 23:17 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linuxppc-dev, akpm, Herbert Xu, linux-kernel


On Thu, 2007-10-18 at 15:52 -0700, Linus Torvalds wrote:
> 
> On Fri, 19 Oct 2007, Benjamin Herrenschmidt wrote:
> > 
> > The barrier would guarantee that ioc->active (and in fact the write to
> > the chip too above) are globally visible
> 
> No, it doesn't really guarantee that.
> 
> The thing is, there is no such thing as "globally visible".
> 
> There is a "ordering of visibility wrt CPU's", but it's not global, it's 
> quite potentially per-CPU. So a barrier on one CPU doesn't guarantee 
> anything at all without a barrier on the *other* CPU.
> 
> That said, the interrupt handling itself contains various barriers on the 
> CPU's that receive interrupts, thanks to the spinlocking. But I do agree 
> with Herbert that adding a "smb_mb()" is certainly in no way "obviously 
> correct", because it doesn't talk about what the other side does wrt 
> barriers and that word in memory.

I agree and you can see that in fact, we don't have enough barrier on
the other side since spin_unlock doesn't prevent subsequent loads from
crossing a previous store...

I wonder if that's worth trying to address, adding a barrier in
handle_IRQ_event for example, or we can continue ignoring the barrier
and let some drivers do their own fixes in fancy ways.

Ben.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH] synchronize_irq needs a barrier
  2007-10-18 23:17             ` Benjamin Herrenschmidt
@ 2007-10-18 23:39               ` Linus Torvalds
  2007-10-18 23:52                 ` Benjamin Herrenschmidt
  2007-10-19  2:32                 ` Herbert Xu
  0 siblings, 2 replies; 48+ messages in thread
From: Linus Torvalds @ 2007-10-18 23:39 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Herbert Xu, Linux Kernel Mailing List, linuxppc-dev,
	Thomas Gleixner, akpm, Ingo Molnar



On Fri, 19 Oct 2007, Benjamin Herrenschmidt wrote:
> 
> I agree and you can see that in fact, we don't have enough barrier on
> the other side since spin_unlock doesn't prevent subsequent loads from
> crossing a previous store...
> 
> I wonder if that's worth trying to address, adding a barrier in
> handle_IRQ_event for example, or we can continue ignoring the barrier
> and let some drivers do their own fixes in fancy ways.

So how about something like this (untested! not necessarily very well 
thought through either!)

Basic notion: the only thing that serializes the IRQ_INPROGRESS flag is 
the descriptor lock. And we don't have to (or even want to!) hold it while 
waiting for the thing, but we want to *have*held*it* in between whatever 
we're synchronizing with.

The internal irq handler functions already held the lock when they did 
whatever they need to serialize - and they are possibly performance 
critical too - so they use the "internal" function that doesn't get the 
lock unnecessarily again.

Hmm? 

		Linus

---
 kernel/irq/manage.c |   22 ++++++++++++++++++----
 1 files changed, 18 insertions(+), 4 deletions(-)

diff --git a/kernel/irq/manage.c b/kernel/irq/manage.c
index 80eab7a..f3e9575 100644
--- a/kernel/irq/manage.c
+++ b/kernel/irq/manage.c
@@ -14,6 +14,18 @@
 
 #include "internals.h"
 
+/*
+ * Internally wait for IRQ_INPROGRESS to go away on other CPU's,
+ * after having serialized with the descriptor lock.
+ */
+static inline void do_synchronize_irq(struct irq_desc *desc)
+{
+#ifdef CONFIG_SMP
+	while (desc->status & IRQ_INPROGRESS)
+		cpu_relax();
+#endif
+}
+
 #ifdef CONFIG_SMP
 
 /**
@@ -28,13 +40,15 @@
  */
 void synchronize_irq(unsigned int irq)
 {
+	unsigned long flags;
 	struct irq_desc *desc = irq_desc + irq;
 
 	if (irq >= NR_IRQS)
 		return;
 
-	while (desc->status & IRQ_INPROGRESS)
-		cpu_relax();
+	spin_lock_irqsave(&desc->lock, flags);
+	spin_unlock_irqrestore(&desc->lock, flags);
+	do_synchronize_irq(desc);
 }
 EXPORT_SYMBOL(synchronize_irq);
 
@@ -129,7 +143,7 @@ void disable_irq(unsigned int irq)
 
 	disable_irq_nosync(irq);
 	if (desc->action)
-		synchronize_irq(irq);
+		do_synchronize_irq(desc);
 }
 EXPORT_SYMBOL(disable_irq);
 
@@ -443,7 +457,7 @@ void free_irq(unsigned int irq, void *dev_id)
 			unregister_handler_proc(irq, action);
 
 			/* Make sure it's not being used on another CPU */
-			synchronize_irq(irq);
+			do_synchronize_irq(desc);
 #ifdef CONFIG_DEBUG_SHIRQ
 			/*
 			 * It's a shared IRQ -- the driver ought to be

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* Re: [PATCH] synchronize_irq needs a barrier
  2007-10-18 23:39               ` Linus Torvalds
@ 2007-10-18 23:52                 ` Benjamin Herrenschmidt
  2007-10-19  2:32                 ` Herbert Xu
  1 sibling, 0 replies; 48+ messages in thread
From: Benjamin Herrenschmidt @ 2007-10-18 23:52 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Herbert Xu, Linux Kernel Mailing List, linuxppc-dev,
	Thomas Gleixner, akpm, Ingo Molnar


> So how about something like this (untested! not necessarily very well 
> thought through either!)
> 
> Basic notion: the only thing that serializes the IRQ_INPROGRESS flag is 
> the descriptor lock. And we don't have to (or even want to!) hold it while 
> waiting for the thing, but we want to *have*held*it* in between whatever 
> we're synchronizing with.
> 
> The internal irq handler functions already held the lock when they did 
> whatever they need to serialize - and they are possibly performance 
> critical too - so they use the "internal" function that doesn't get the 
> lock unnecessarily again.

That may do the trick as the read can't cross the spin_lock (it can
cross spin_unlock but not lock). Advantage over adding a barrier to
handle_IRQ_event() is that it keeps the overhead to the slow path
(synchronize_irq).

Note that I didn't actually experience a problem here. I just came upon
that by accident while thinking about a similar issue I have with
napi_synchronize().

Looks good to me on a first glance (unfortunately a bit ugly but heh).

Ben.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH] synchronize_irq needs a barrier
  2007-10-18 23:39               ` Linus Torvalds
  2007-10-18 23:52                 ` Benjamin Herrenschmidt
@ 2007-10-19  2:32                 ` Herbert Xu
  2007-10-19  2:52                   ` Nick Piggin
  2007-10-19  2:55                   ` Linus Torvalds
  1 sibling, 2 replies; 48+ messages in thread
From: Herbert Xu @ 2007-10-19  2:32 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Linux Kernel Mailing List, linuxppc-dev, Thomas Gleixner, akpm,
	Ingo Molnar

On Thu, Oct 18, 2007 at 04:39:59PM -0700, Linus Torvalds wrote:
>
> Basic notion: the only thing that serializes the IRQ_INPROGRESS flag is 
> the descriptor lock. And we don't have to (or even want to!) hold it while 
> waiting for the thing, but we want to *have*held*it* in between whatever 
> we're synchronizing with.

Thinking about this again and checking code I have come to the
conclusion that

1) There is a race (in some drivers) even with the full mb.
2) Linus's patch fixes it.
3) It's difficult to fix it elegantly in the driver.

In other words I think this patch is great :)

First of all let's agree on some basic assumptions:

* A pair of spin lock/unlock subsumes the effect of a full mb.
* A spin lock in general only equates to (SS/SL/LL).
* A spin unlock in general only equates to (SS/LS).

In particular, a load after a spin unlock may pass before the
spin unlock.

Here is the race (with tg3 as the only example that I know of).
The driver attempts to quiesce interrupts such that after the
call to synchronize_irq it is intended that no further IRQ
handler calls for that device will do any work besides acking
the IRQ.

Here is how it does it:

CPU0				CPU1
				spin lock
					load irq_sync
irq_sync = 1
smp_mb
synchronize_irq()
	while (IRQ_INPROGRESS)
		wait
	return
				set IRQ_INPROGRESS
				spin unlock
				tg3_msi
					ack IRQ
					if (irq_sync)
						return
					do work

The problem here is that load of irq_sync in the handler has
passed above the setting of IRQ_INPROGRESS.

Linus's patch fixes it because this becomes:

CPU0				CPU1
				spin lock
					load irq_sync
irq_sync = 1
smp_mb
synchronize_irq
				set IRQ_INPROGRESS
				spin unlock
	spin lock
	spin unlock
				tg3_msi
					ack IRQ
					if (irq_sync)
						return
					do work
	while (IRQ_INPROGRESS)
		wait
				spin lock
				clear IRQ_INPROGRESS
				spin unlock
	return

Even though we still do the work on the right we will now notice
the INPROGRESS flag on the left and wait.

It's hard to fix this in the drivers because they'd either have
to access the desc lock or add a full mb to the fast path on the
right.

Once this goes in we can also remove the smp_mb from tg3.c.  BTW,
a lot of drivers (including the fusion example Ben quoted) call
synchronize_irq before free_irq.  This is unnecessary because
the latter already calls it anyway.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH] synchronize_irq needs a barrier
  2007-10-19  2:32                 ` Herbert Xu
@ 2007-10-19  2:52                   ` Nick Piggin
  2007-10-19  3:28                     ` Herbert Xu
  2007-10-19  2:55                   ` Linus Torvalds
  1 sibling, 1 reply; 48+ messages in thread
From: Nick Piggin @ 2007-10-19  2:52 UTC (permalink / raw)
  To: Herbert Xu
  Cc: Linux Kernel Mailing List, linuxppc-dev, Thomas Gleixner, akpm,
	Linus Torvalds, Ingo Molnar

On Friday 19 October 2007 12:32, Herbert Xu wrote:

> First of all let's agree on some basic assumptions:
>
> * A pair of spin lock/unlock subsumes the effect of a full mb.

Not unless you mean a pair of spin lock/unlock as in
2 spin lock/unlock pairs (4 operations).

*X = 10;
spin_lock(&lock);
/* *Y speculatively loaded here */
/* store to *X leaves CPU store queue here */
spin_unlock(&lock);
y = *Y;

> * A spin lock in general only equates to (SS/SL/LL).
> * A spin unlock in general only equates to (SS/LS).

I don't use the sparc barriers, so they don't come naturally to
me ;)

I think both loads and stores can pass into the critical section
by having the spin_lock pass earlier ops, or by having spin_unlock
be passed by later ones.


> In particular, a load after a spin unlock may pass before the
> spin unlock.
>
> Here is the race (with tg3 as the only example that I know of).
> The driver attempts to quiesce interrupts such that after the
> call to synchronize_irq it is intended that no further IRQ
> handler calls for that device will do any work besides acking
> the IRQ.
>
> Here is how it does it:
>
> CPU0				CPU1
> 				spin lock
> 					load irq_sync
> irq_sync = 1
> smp_mb
> synchronize_irq()
> 	while (IRQ_INPROGRESS)
> 		wait
> 	return
> 				set IRQ_INPROGRESS
> 				spin unlock
> 				tg3_msi
> 					ack IRQ
> 					if (irq_sync)
> 						return
> 					do work
>
> The problem here is that load of irq_sync in the handler has
> passed above the setting of IRQ_INPROGRESS.
>
> Linus's patch fixes it because this becomes:
>
> CPU0				CPU1
> 				spin lock
> 					load irq_sync
> irq_sync = 1
> smp_mb
> synchronize_irq
> 				set IRQ_INPROGRESS
> 				spin unlock
> 	spin lock
> 	spin unlock
> 				tg3_msi
> 					ack IRQ
> 					if (irq_sync)
> 						return
> 					do work
> 	while (IRQ_INPROGRESS)
> 		wait
> 				spin lock
> 				clear IRQ_INPROGRESS
> 				spin unlock
> 	return
>
> Even though we still do the work on the right we will now notice
> the INPROGRESS flag on the left and wait.
>
> It's hard to fix this in the drivers because they'd either have
> to access the desc lock or add a full mb to the fast path on the
> right.
>
> Once this goes in we can also remove the smp_mb from tg3.c.  BTW,
> a lot of drivers (including the fusion example Ben quoted) call
> synchronize_irq before free_irq.  This is unnecessary because
> the latter already calls it anyway.
>
> Cheers,

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH] synchronize_irq needs a barrier
  2007-10-19  2:32                 ` Herbert Xu
  2007-10-19  2:52                   ` Nick Piggin
@ 2007-10-19  2:55                   ` Linus Torvalds
  2007-10-19  3:26                     ` Linus Torvalds
  1 sibling, 1 reply; 48+ messages in thread
From: Linus Torvalds @ 2007-10-19  2:55 UTC (permalink / raw)
  To: Herbert Xu
  Cc: Linux Kernel Mailing List, linuxppc-dev, Thomas Gleixner, akpm,
	Ingo Molnar



On Fri, 19 Oct 2007, Herbert Xu wrote:
>
> In other words I think this patch is great :)

Hey, I appreciate it, but I do think that the "spinlock only to 
immediately unlock it again" is pretty horrid.

I'm convinced that there should be some way to do this without actually 
taking the lock. I *think* it should work with something like

	for (;;) {
		smp_rmb();
		if (!spin_is_locked(&desc->lock)) {
			smp_rmb();
			if (!(desc->status & IRQ_INPROGRESS)
				break;
		}
		cpu_relax();
	}

instead. Which basically requires that we see the descriptor lock being 
not held, before we see the IRQ_INPROGRESS bit being clear. Put another 
way: it loops until it sees the lock having been released, and the 
IRQ_INPROGRESS bit being clear after that.

The above requires no serializing instructions on x86, which is a good 
goal (now that smp_rmb() is just a compiler barrier). And it doesn't 
actually have to bounce any cachelines.

And it doesn't have that ugly "get lock only to release it", which just 
makes me go "Eww!".

But it's a bit subtler. It basically depends on the fact that 
spin_unlock() obviously has to make sure that there is a release barrier 
in the unlock, so any writes done (to the IRQ_INPROGRESS bit) within the 
locked region *must* be visible before the spinlock itself has been 
released.

So somebody should:
 - use another pair of eyes and brains to back me up on this.
 - write up some coherent changelog entry, using the emails that have 
   passed back and forth.
 - actually turn the above into a tested patch with a comment.

And I'm pushing for that "somebody" being somebody else than me ;)

			Linus

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH] synchronize_irq needs a barrier
  2007-10-19  2:55                   ` Linus Torvalds
@ 2007-10-19  3:26                     ` Linus Torvalds
  2007-10-19  4:11                       ` Benjamin Herrenschmidt
  2007-10-19  4:20                       ` Herbert Xu
  0 siblings, 2 replies; 48+ messages in thread
From: Linus Torvalds @ 2007-10-19  3:26 UTC (permalink / raw)
  To: Herbert Xu
  Cc: Linux Kernel Mailing List, linuxppc-dev, Thomas Gleixner, akpm,
	Ingo Molnar



On Thu, 18 Oct 2007, Linus Torvalds wrote:
>
> I *think* it should work with something like
> 
> 	for (;;) {
> 		smp_rmb();
> 		if (!spin_is_locked(&desc->lock)) {
> 			smp_rmb();
> 			if (!(desc->status & IRQ_INPROGRESS)
> 				break;
> 		}
> 		cpu_relax();
> 	}

I'm starting to doubt this. 

One of the issues is that we still need the smp_mb() in front of the loop 
(because we want to serialize the loop with any writes in the caller).

The other issue is that I don't think it's enough that we saw the 
descriptor lock unlocked, and then the IRQ_INPROGRESS bit clear. It might 
have been unlocked *while* the IRQ was in progress, but the interrupt 
handler is now in its last throes, and re-takes the spinlock and clears 
the IRQ_INPROGRESS thing. But we're not actually happy until we've seen 
the IRQ_INPROGRESS bit clear and the spinlock has been released *again*.

So those two tests should actually be the other way around: we want to see 
the IRQ_INPROGRESS bit clear first.

It's all just too damn subtle and clever. Something like this should not 
need to be that subtle. 

Maybe the rigth thing to do is to not rely on *any* ordering what-so-ever, 
and just make the rule be: "if you look at the IRQ_INPROGRESS bit, you'd 
better hold the descriptor spinlock", and not have any subtle ordering 
issues at all.

But that makes us have a loop with getting/releasing the lock all the 
time, and then we get back to horrid issues with cacheline bouncing and 
unfairness of cache accesses across cores (ie look at the issues we had 
with the runqueue starvation in wait_task_inactive()).

Those were fixed by starting out with the non-locked and totally unsafe 
versions, but then having one last "check with lock held, and repeat only 
if that says things went south". 

See commit fa490cfd15d7ce0900097cc4e60cfd7a76381138 and ponder. Maybe we 
should take the same approach here, and do something like

	repeat:
		/* Optimistic, no-locking loop */
		while (desc->status & IRQ_INPROGRESS)
			cpu_relax();

		/* Ok, that indicated we're done: double-check carefully */
		spin_lock_irqsave(&desc->lock, flags);
		status = desc->status;
		spin_unlock_irqrestore(&desc->lock, flags);

		/* Oops, that failed? */
		if (status & IRQ_INPROGRESS)
			goto repeat;

Hmm?

			Linus

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH] synchronize_irq needs a barrier
  2007-10-19  2:52                   ` Nick Piggin
@ 2007-10-19  3:28                     ` Herbert Xu
  2007-10-19  4:49                       ` Nick Piggin
  0 siblings, 1 reply; 48+ messages in thread
From: Herbert Xu @ 2007-10-19  3:28 UTC (permalink / raw)
  To: Nick Piggin
  Cc: herbert, linux-kernel, linuxppc-dev, tglx, akpm, torvalds, mingo

Nick Piggin <nickpiggin@yahoo.com.au> wrote:
> 
>> First of all let's agree on some basic assumptions:
>>
>> * A pair of spin lock/unlock subsumes the effect of a full mb.
> 
> Not unless you mean a pair of spin lock/unlock as in
> 2 spin lock/unlock pairs (4 operations).
> 
> *X = 10;
> spin_lock(&lock);
> /* *Y speculatively loaded here */
> /* store to *X leaves CPU store queue here */
> spin_unlock(&lock);
> y = *Y;

Good point.

Although in this case we're still safe because in the worst
cases:

CPU0				CPU1
irq_sync = 1
synchronize_irq
	spin lock
	load IRQ_INPROGRESS
irq_sync sync is visible
	spin unlock
				spin lock
					load irq_sync
	while (IRQ_INPROGRESS)
		wait
	return
				set IRQ_INPROGRESS
				spin unlock
				tg3_msi
					ack IRQ
					if (irq_sync)
						return
				spin lock
				clear IRQ_INPROGRESS
				spin unlock

------------------------------------------------------------

CPU0				CPU1
				spin lock
					load irq_sync
irq_sync = 1
synchronize_irq
				set IRQ_INPROGRESS
				spin unlock
	spin lock
	load IRQ_INPROGRESS
irq_sync sync is visible
	spin unlock
	while (IRQ_INPROGRESS)
		wait
				tg3_msi
					ack IRQ
					if (irq_sync)
						return
					do work
				spin lock
				clear IRQ_INPROGRESS
				spin unlock
	return

So because we're using the same lock on both sides, it does
do the right thing even without the memory barrier.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH] synchronize_irq needs a barrier
  2007-10-19  3:26                     ` Linus Torvalds
@ 2007-10-19  4:11                       ` Benjamin Herrenschmidt
  2007-10-19  4:26                         ` Benjamin Herrenschmidt
  2007-10-19  4:20                       ` Herbert Xu
  1 sibling, 1 reply; 48+ messages in thread
From: Benjamin Herrenschmidt @ 2007-10-19  4:11 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Herbert Xu, Linux Kernel Mailing List, linuxppc-dev,
	Thomas Gleixner, akpm, Ingo Molnar


> 
> 	repeat:
> 		/* Optimistic, no-locking loop */
> 		while (desc->status & IRQ_INPROGRESS)
> 			cpu_relax();
> 
> 		/* Ok, that indicated we're done: double-check carefully */
> 		spin_lock_irqsave(&desc->lock, flags);
> 		status = desc->status;
> 		spin_unlock_irqrestore(&desc->lock, flags);
> 
> 		/* Oops, that failed? */
> 		if (status & IRQ_INPROGRESS)
> 			goto repeat;
> 
> Hmm?

Paulus and I convinced ourselves that this would work. If we call our
variable that gets set before synchronize_irq and read in the IRQ
handler "foo", we get to:

	- writing foo can travel down at most to just before the unlock in the
code above

	- reading foo can travel up out of the IRQ handler at most just after
the lock in the code that sets IRQ_INPROGRESS.

The whole lock/set IRQ_INPROGRESS/unlock path can then only happen
before the locked section above, in which case we see and wait nicely
and all is good, or after, in which case the store to foo will be
visible to the IRQ handler as it will be ordered with the unlock in the
code above.

Pfiew !

So Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>

Thanks !

Ben.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH] synchronize_irq needs a barrier
  2007-10-19  3:26                     ` Linus Torvalds
  2007-10-19  4:11                       ` Benjamin Herrenschmidt
@ 2007-10-19  4:20                       ` Herbert Xu
  2007-10-19  4:29                         ` Benjamin Herrenschmidt
                                           ` (3 more replies)
  1 sibling, 4 replies; 48+ messages in thread
From: Herbert Xu @ 2007-10-19  4:20 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Linux Kernel Mailing List, linuxppc-dev, Thomas Gleixner, akpm,
	Ingo Molnar

On Thu, Oct 18, 2007 at 08:26:45PM -0700, Linus Torvalds wrote:
> 
> 
> On Thu, 18 Oct 2007, Linus Torvalds wrote:
> >
> > I *think* it should work with something like
> > 
> > 	for (;;) {
> > 		smp_rmb();
> > 		if (!spin_is_locked(&desc->lock)) {
> > 			smp_rmb();
> > 			if (!(desc->status & IRQ_INPROGRESS)
> > 				break;
> > 		}
> > 		cpu_relax();
> > 	}
> 
> I'm starting to doubt this. 
> 
> One of the issues is that we still need the smp_mb() in front of the loop 
> (because we want to serialize the loop with any writes in the caller).

Right.  I think if we accept the definition of a spin lock/unlock
that Nick outlined earlier, then we can see that on the IRQ path
there simply isn't a memory barrier between the changing of the
status field and the execution of the action:

	spin_lock
	set IRQ_INPROGRESS
	spin_unlock

	action

	spin_lock
	clear IRQ_INPROGRESS
	spin_unlock

What may happen is that action can either float upwards to give

	spin_lock
	action
	set IRQ_INPROGRESS
	spin_unlock

	spin_lock
	clear IRQ_INPROGRESS
	spin_unlock

or it can float downwards to give

	spin_lock
	set IRQ_INPROGRESS
	spin_unlock

	spin_lock
	clear IRQ_INPROGRESS
	action
	spin_unlock

As a result we can add as many barriers as we want on the slow
(synchronize_irq) path and it just won't do any good (not unless
we add some barriers on the IRQ path == fast path).

What we do have on the right though is the fact in some ways
action behaves as if it's inside the spin lock even though it's
not.  In particular, action cannot float up past the first spin
lock nor can it float down past the last spin unlock.

> See commit fa490cfd15d7ce0900097cc4e60cfd7a76381138 and ponder. Maybe we 
> should take the same approach here, and do something like
> 
> 	repeat:
> 		/* Optimistic, no-locking loop */
> 		while (desc->status & IRQ_INPROGRESS)
> 			cpu_relax();
> 
> 		/* Ok, that indicated we're done: double-check carefully */
> 		spin_lock_irqsave(&desc->lock, flags);
> 		status = desc->status;
> 		spin_unlock_irqrestore(&desc->lock, flags);
> 
> 		/* Oops, that failed? */
> 		if (status & IRQ_INPROGRESS)
> 			goto repeat;

That's why I think this patch is in fact the only one that
solves all the races in this thread.  The case that it solves
which the lock/unlock patch does not is the one where action
flows downwards past the clearing of IRQ_INPROGRESS.  I missed
this case earlier.

In fact this bug exists elsewhere too.  For example, the network
stack does this in net/sched/sch_generic.c:

        /* Wait for outstanding qdisc_run calls. */
	while (test_bit(__LINK_STATE_QDISC_RUNNING, &dev->state))
		yield();

This has the same problem as the current synchronize_irq code.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH] synchronize_irq needs a barrier
  2007-10-19  4:11                       ` Benjamin Herrenschmidt
@ 2007-10-19  4:26                         ` Benjamin Herrenschmidt
  2007-10-19  5:53                           ` Herbert Xu
  0 siblings, 1 reply; 48+ messages in thread
From: Benjamin Herrenschmidt @ 2007-10-19  4:26 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Herbert Xu, Linux Kernel Mailing List, linuxppc-dev,
	Thomas Gleixner, akpm, Ingo Molnar


> The whole lock/set IRQ_INPROGRESS/unlock path can then only happen
> before the locked section above, in which case we see and wait nicely
> and all is good, or after, in which case the store to foo will be
> visible to the IRQ handler as it will be ordered with the unlock in the
> code above.

Note that napi_synchronize needs a slightly different treatement.

Here, the situation boils down to:

one CPU does:

	foo = 1;
	while(test_bit(bar))
		barrier();

and the other:

	if (!test_and_set_bit(bar)) {
		read & use foo
		smp_mb__before_clear_bit();
		clear_bit(bar);
	}

The good thing here is that read & use foo is part of the critical
section (I hate hand-made locks ...) defined by bar which makes
things somewhat easier than the synchronize_irq() case.

I think a simple smp_mb(); here after foo = 1; is enough, which means
basically just having an smp_mp(); inside napi_synchronize(), before
the test_bit(). Or do I miss something ?

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH] synchronize_irq needs a barrier
  2007-10-19  4:20                       ` Herbert Xu
@ 2007-10-19  4:29                         ` Benjamin Herrenschmidt
  2007-10-19  4:35                         ` Benjamin Herrenschmidt
                                           ` (2 subsequent siblings)
  3 siblings, 0 replies; 48+ messages in thread
From: Benjamin Herrenschmidt @ 2007-10-19  4:29 UTC (permalink / raw)
  To: Herbert Xu
  Cc: Linux Kernel Mailing List, linuxppc-dev, Thomas Gleixner, akpm,
	Linus Torvalds, Ingo Molnar


On Fri, 2007-10-19 at 12:20 +0800, Herbert Xu wrote:
> 
> That's why I think this patch is in fact the only one that
> solves all the races in this thread.  The case that it solves
> which the lock/unlock patch does not is the one where action
> flows downwards past the clearing of IRQ_INPROGRESS.  I missed
> this case earlier.
> 
> In fact this bug exists elsewhere too.  For example, the network
> stack does this in net/sched/sch_generic.c:
> 
>         /* Wait for outstanding qdisc_run calls. */
>         while (test_bit(__LINK_STATE_QDISC_RUNNING, &dev->state))
>                 yield();
> 
> This has the same problem as the current synchronize_irq code.

The network stack always made be nervous with it's bitops use as locks.
Any loop of test_bit() alone as a synchronisation method, without locks
or barriers is, imho, inherently buggy.

Ben.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH] synchronize_irq needs a barrier
  2007-10-19  4:20                       ` Herbert Xu
  2007-10-19  4:29                         ` Benjamin Herrenschmidt
@ 2007-10-19  4:35                         ` Benjamin Herrenschmidt
  2007-10-19  4:48                         ` Herbert Xu
  2007-10-19  5:36                         ` [NET]: Fix possible dev_deactivate race condition Herbert Xu
  3 siblings, 0 replies; 48+ messages in thread
From: Benjamin Herrenschmidt @ 2007-10-19  4:35 UTC (permalink / raw)
  To: Herbert Xu
  Cc: Linux Kernel Mailing List, linuxppc-dev, Thomas Gleixner, akpm,
	Linus Torvalds, Ingo Molnar


> What may happen is that action can either float upwards to give
> 
> 	spin_lock
> 	action
> 	set IRQ_INPROGRESS
> 	spin_unlock
> 
> 	spin_lock
> 	clear IRQ_INPROGRESS
> 	spin_unlock
> 
> or it can float downwards to give
> 
> 	spin_lock
> 	set IRQ_INPROGRESS
> 	spin_unlock
> 
> 	spin_lock
> 	clear IRQ_INPROGRESS
> 	action
> 	spin_unlock
> 

Well, we are generally safer here. That is, unless action is a store,
it will not pass spin_lock, at least not on powerpc afaik.

In fact, if action, as it is in our case, is something like

if (foo)
	return;

We cant execute the store inside spin_lock() without having loaded
foo, there is an implicit dependency here.

But anyway, Linus patch fixes that too if it was a problem. Now if
we grep for while (test_bit and while (!test_bit I'm sure we'll find
other similar surprises.

That's also one of the reasons why I _love_ nick patches that add a
proper lock/unlock API on bits, because at least those who are doing
those hand-made pseudo-locks with bitops to save space will be
getting a proper lock/unlock API, no more excuse.

The network stack is doing more fancy things so it's harder (though I
wonder sometimes if it's really worth the risks taken for not using
spinlocks... maybe).

Ben.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH] synchronize_irq needs a barrier
  2007-10-19  4:20                       ` Herbert Xu
  2007-10-19  4:29                         ` Benjamin Herrenschmidt
  2007-10-19  4:35                         ` Benjamin Herrenschmidt
@ 2007-10-19  4:48                         ` Herbert Xu
  2007-10-19  4:58                           ` Benjamin Herrenschmidt
  2007-10-19  5:36                         ` [NET]: Fix possible dev_deactivate race condition Herbert Xu
  3 siblings, 1 reply; 48+ messages in thread
From: Herbert Xu @ 2007-10-19  4:48 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Linux Kernel Mailing List, linuxppc-dev, Thomas Gleixner, akpm,
	Ingo Molnar

On Fri, Oct 19, 2007 at 12:20:25PM +0800, Herbert Xu wrote:
>
> That's why I think this patch is in fact the only one that
> solves all the races in this thread.  The case that it solves
> which the lock/unlock patch does not is the one where action
> flows downwards past the clearing of IRQ_INPROGRESS.  I missed
> this case earlier.

OK, here is the patch again with a changelog:

[IRQ]: Fix synchronize_irq races with IRQ handler

As it is some callers of synchronize_irq rely on memory barriers
to provide synchronisation against the IRQ handlers.  For example,
the tg3 driver does

	tp->irq_sync = 1;
	smp_mb();
	synchronize_irq();

and then in the IRQ handler:

	if (!tp->irq_sync)
		netif_rx_schedule(dev, &tp->napi);

Unfortunately memory barriers only work well when they come in
pairs.  Because we don't actually have memory barriers on the
IRQ path, the memory barrier before the synchronize_irq() doesn't
actually protect us.

In particular, synchronize_irq() may return followed by the
result of netif_rx_schedule being made visible.

This patch (mostly written by Linus) fixes this by using spin
locks instead of memory barries on the synchronize_irq() path.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>

Thanks,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--
diff --git a/kernel/irq/manage.c b/kernel/irq/manage.c
index 80eab7a..1f31422 100644
--- a/kernel/irq/manage.c
+++ b/kernel/irq/manage.c
@@ -29,12 +29,28 @@
 void synchronize_irq(unsigned int irq)
 {
 	struct irq_desc *desc = irq_desc + irq;
+	unsigned int status;
 
 	if (irq >= NR_IRQS)
 		return;
 
-	while (desc->status & IRQ_INPROGRESS)
-		cpu_relax();
+	do {
+		unsigned long flags;
+
+		/*
+		 * Wait until we're out of the critical section.  This might
+		 * give the wrong answer due to the lack of memory barriers.
+		 */
+		while (desc->status & IRQ_INPROGRESS)
+			cpu_relax();
+
+		/* Ok, that indicated we're done: double-check carefully. */
+		spin_lock_irqsave(&desc->lock, flags);
+		status = desc->status;
+		spin_unlock_irqrestore(&desc->lock, flags);
+
+		/* Oops, that failed? */
+	} while (status & IRQ_INPROGRESS);
 }
 EXPORT_SYMBOL(synchronize_irq);
 

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* Re: [PATCH] synchronize_irq needs a barrier
  2007-10-19  3:28                     ` Herbert Xu
@ 2007-10-19  4:49                       ` Nick Piggin
  0 siblings, 0 replies; 48+ messages in thread
From: Nick Piggin @ 2007-10-19  4:49 UTC (permalink / raw)
  To: Herbert Xu; +Cc: linux-kernel, linuxppc-dev, tglx, akpm, torvalds, mingo

On Friday 19 October 2007 13:28, Herbert Xu wrote:
> Nick Piggin <nickpiggin@yahoo.com.au> wrote:
> >> First of all let's agree on some basic assumptions:
> >>
> >> * A pair of spin lock/unlock subsumes the effect of a full mb.
> >
> > Not unless you mean a pair of spin lock/unlock as in
> > 2 spin lock/unlock pairs (4 operations).
> >
> > *X = 10;
> > spin_lock(&lock);
> > /* *Y speculatively loaded here */
> > /* store to *X leaves CPU store queue here */
> > spin_unlock(&lock);
> > y = *Y;
>
> Good point.
>
> Although in this case we're still safe because in the worst
> cases:
>
> CPU0				CPU1
> irq_sync = 1
> synchronize_irq
> 	spin lock
> 	load IRQ_INPROGRESS
> irq_sync sync is visible
> 	spin unlock
> 				spin lock
> 					load irq_sync
> 	while (IRQ_INPROGRESS)
> 		wait
> 	return
> 				set IRQ_INPROGRESS
> 				spin unlock
> 				tg3_msi
> 					ack IRQ
> 					if (irq_sync)
> 						return
> 				spin lock
> 				clear IRQ_INPROGRESS
> 				spin unlock
>
> ------------------------------------------------------------
>
> CPU0				CPU1
> 				spin lock
> 					load irq_sync
> irq_sync = 1
> synchronize_irq
> 				set IRQ_INPROGRESS
> 				spin unlock
> 	spin lock
> 	load IRQ_INPROGRESS
> irq_sync sync is visible
> 	spin unlock
> 	while (IRQ_INPROGRESS)
> 		wait
> 				tg3_msi
> 					ack IRQ
> 					if (irq_sync)
> 						return
> 					do work
> 				spin lock
> 				clear IRQ_INPROGRESS
> 				spin unlock
> 	return
>
> So because we're using the same lock on both sides, it does
> do the right thing even without the memory barrier.


Yeah, if you've got the lock on both sides there, then I
agree it will be correct.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH] synchronize_irq needs a barrier
  2007-10-19  4:48                         ` Herbert Xu
@ 2007-10-19  4:58                           ` Benjamin Herrenschmidt
  2007-10-21 21:10                             ` Benjamin Herrenschmidt
  0 siblings, 1 reply; 48+ messages in thread
From: Benjamin Herrenschmidt @ 2007-10-19  4:58 UTC (permalink / raw)
  To: Herbert Xu
  Cc: Linux Kernel Mailing List, linuxppc-dev, Thomas Gleixner, akpm,
	Linus Torvalds, Ingo Molnar


On Fri, 2007-10-19 at 12:48 +0800, Herbert Xu wrote:

> [IRQ]: Fix synchronize_irq races with IRQ handler
> 
> As it is some callers of synchronize_irq rely on memory barriers
> to provide synchronisation against the IRQ handlers.  For example,
> the tg3 driver does
> 
> 	tp->irq_sync = 1;
> 	smp_mb();
> 	synchronize_irq();
> 
> and then in the IRQ handler:
> 
> 	if (!tp->irq_sync)
> 		netif_rx_schedule(dev, &tp->napi);
> 
> Unfortunately memory barriers only work well when they come in
> pairs.  Because we don't actually have memory barriers on the
> IRQ path, the memory barrier before the synchronize_irq() doesn't
> actually protect us.
> 
> In particular, synchronize_irq() may return followed by the
> result of netif_rx_schedule being made visible.
> 
> This patch (mostly written by Linus) fixes this by using spin
> locks instead of memory barries on the synchronize_irq() path.
> 
> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>

Good for me.

Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* [NET]: Fix possible dev_deactivate race condition
  2007-10-19  4:20                       ` Herbert Xu
                                           ` (2 preceding siblings ...)
  2007-10-19  4:48                         ` Herbert Xu
@ 2007-10-19  5:36                         ` Herbert Xu
  2007-10-19  5:38                           ` David Miller
  2007-10-19  7:35                           ` Peter Zijlstra
  3 siblings, 2 replies; 48+ messages in thread
From: Herbert Xu @ 2007-10-19  5:36 UTC (permalink / raw)
  To: David S. Miller
  Cc: Linux Kernel Mailing List, linuxppc-dev, Thomas Gleixner, netdev,
	akpm, Linus Torvalds, Ingo Molnar

On Fri, Oct 19, 2007 at 12:20:25PM +0800, Herbert Xu wrote:
>
> In fact this bug exists elsewhere too.  For example, the network
> stack does this in net/sched/sch_generic.c:
> 
>         /* Wait for outstanding qdisc_run calls. */
> 	while (test_bit(__LINK_STATE_QDISC_RUNNING, &dev->state))
> 		yield();
> 
> This has the same problem as the current synchronize_irq code.

OK here is the fix for that case.

[NET]: Fix possible dev_deactivate race condition

The function dev_deactivate is supposed to only return when
all outstanding transmissions have completed.  Unfortunately
it is possible for store operations in the driver's transmit
function to only become visible after dev_deactivate returns.

This patch fixes this by taking the queue lock after we see
the end of the queue run.  This ensures that all effects of
any previous transmit calls are visible.

If however we detect that there is another queue run occuring,
then we'll warn about it because this should never happen as
we have pointed dev->qdisc to noop_qdisc within the same queue
lock earlier in the functino.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--
diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
index e01d576..b3b7420 100644
--- a/net/sched/sch_generic.c
+++ b/net/sched/sch_generic.c
@@ -556,6 +556,7 @@ void dev_deactivate(struct net_device *dev)
 {
 	struct Qdisc *qdisc;
 	struct sk_buff *skb;
+	int running;
 
 	spin_lock_bh(&dev->queue_lock);
 	qdisc = dev->qdisc;
@@ -571,12 +572,31 @@ void dev_deactivate(struct net_device *dev)
 
 	dev_watchdog_down(dev);
 
-	/* Wait for outstanding dev_queue_xmit calls. */
+	/* Wait for outstanding qdisc-less dev_queue_xmit calls. */
 	synchronize_rcu();
 
 	/* Wait for outstanding qdisc_run calls. */
-	while (test_bit(__LINK_STATE_QDISC_RUNNING, &dev->state))
-		yield();
+	do {
+		while (test_bit(__LINK_STATE_QDISC_RUNNING, &dev->state))
+			yield();
+
+		/*
+		 * Double-check inside queue lock to ensure that all effects
+		 * of the queue run are visible when we return.
+		 */
+		spin_lock_bh(&dev->queue_lock);
+		running = test_bit(__LINK_STATE_QDISC_RUNNING, &dev->state);
+		spin_unlock_bh(&dev->queue_lock);
+
+		/*
+		 * The running flag should never be set at this point because
+		 * we've already set dev->qdisc to noop_qdisc *inside* the same
+		 * pair of spin locks.  That is, if any qdisc_run starts after
+		 * our initial test it should see the noop_qdisc and then
+		 * clear the RUNNING bit before dropping the queue lock.  So
+		 * if it is set here then we've found a bug.
+		 */
+	} while (WARN_ON_ONCE(running));
 }
 
 void dev_init_scheduler(struct net_device *dev)

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* Re: [NET]: Fix possible dev_deactivate race condition
  2007-10-19  5:36                         ` [NET]: Fix possible dev_deactivate race condition Herbert Xu
@ 2007-10-19  5:38                           ` David Miller
  2007-10-19  7:35                           ` Peter Zijlstra
  1 sibling, 0 replies; 48+ messages in thread
From: David Miller @ 2007-10-19  5:38 UTC (permalink / raw)
  To: herbert; +Cc: linux-kernel, linuxppc-dev, tglx, netdev, akpm, torvalds, mingo

From: Herbert Xu <herbert@gondor.apana.org.au>
Date: Fri, 19 Oct 2007 13:36:24 +0800

> [NET]: Fix possible dev_deactivate race condition
> 
> The function dev_deactivate is supposed to only return when
> all outstanding transmissions have completed.  Unfortunately
> it is possible for store operations in the driver's transmit
> function to only become visible after dev_deactivate returns.
> 
> This patch fixes this by taking the queue lock after we see
> the end of the queue run.  This ensures that all effects of
> any previous transmit calls are visible.
> 
> If however we detect that there is another queue run occuring,
> then we'll warn about it because this should never happen as
> we have pointed dev->qdisc to noop_qdisc within the same queue
> lock earlier in the functino.
> 
> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>

Applied, thanks Herbert!

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH] synchronize_irq needs a barrier
  2007-10-19  4:26                         ` Benjamin Herrenschmidt
@ 2007-10-19  5:53                           ` Herbert Xu
  0 siblings, 0 replies; 48+ messages in thread
From: Herbert Xu @ 2007-10-19  5:53 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Linux Kernel Mailing List, linuxppc-dev, Thomas Gleixner, akpm,
	Linus Torvalds, Ingo Molnar

On Fri, Oct 19, 2007 at 02:26:54PM +1000, Benjamin Herrenschmidt wrote:
>
> I think a simple smp_mb(); here after foo = 1; is enough, which means
> basically just having an smp_mp(); inside napi_synchronize(), before
> the test_bit(). Or do I miss something ?

Yes I think you're right.  In this case we do have barriers
everywhere else so this should work.

Although if you want napi_synchronize to have the property that
when it returns all NAPI processing effects are visible then
you'd need another smp_mb() after the loop.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [NET]: Fix possible dev_deactivate race condition
  2007-10-19  5:36                         ` [NET]: Fix possible dev_deactivate race condition Herbert Xu
  2007-10-19  5:38                           ` David Miller
@ 2007-10-19  7:35                           ` Peter Zijlstra
  2007-10-19  9:29                             ` Herbert Xu
  1 sibling, 1 reply; 48+ messages in thread
From: Peter Zijlstra @ 2007-10-19  7:35 UTC (permalink / raw)
  To: Herbert Xu
  Cc: Linux Kernel Mailing List, David S. Miller, linuxppc-dev,
	Thomas Gleixner, netdev, akpm, Linus Torvalds, Ingo Molnar

On Fri, 2007-10-19 at 13:36 +0800, Herbert Xu wrote:
> On Fri, Oct 19, 2007 at 12:20:25PM +0800, Herbert Xu wrote:
> >
> > In fact this bug exists elsewhere too.  For example, the network
> > stack does this in net/sched/sch_generic.c:
> > 
> >         /* Wait for outstanding qdisc_run calls. */
> >       while (test_bit(__LINK_STATE_QDISC_RUNNING, &dev->state))
> >               yield();
> > 
> > This has the same problem as the current synchronize_irq code.
> 

> diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
> index e01d576..b3b7420 100644
> --- a/net/sched/sch_generic.c
> +++ b/net/sched/sch_generic.c
> @@ -556,6 +556,7 @@ void dev_deactivate(struct net_device *dev)
>  {
>         struct Qdisc *qdisc;
>         struct sk_buff *skb;
> +       int running;
>  
>         spin_lock_bh(&dev->queue_lock);
>         qdisc = dev->qdisc;
> @@ -571,12 +572,31 @@ void dev_deactivate(struct net_device *dev)
>  
>         dev_watchdog_down(dev);
>  
> -       /* Wait for outstanding dev_queue_xmit calls. */
> +       /* Wait for outstanding qdisc-less dev_queue_xmit calls. */
>         synchronize_rcu();
>  
>         /* Wait for outstanding qdisc_run calls. */
> -       while (test_bit(__LINK_STATE_QDISC_RUNNING, &dev->state))
> -               yield();
> +       do {
> +               while (test_bit(__LINK_STATE_QDISC_RUNNING, &dev->state))
> +                       yield();
> +

Ouch!, is there really no sane locking alternative? Hashed waitqueues
like for the page lock come to mind.

> +               /*
> +                * Double-check inside queue lock to ensure that all effects
> +                * of the queue run are visible when we return.
> +                */
> +               spin_lock_bh(&dev->queue_lock);
> +               running = test_bit(__LINK_STATE_QDISC_RUNNING, &dev->state);
> +               spin_unlock_bh(&dev->queue_lock);
> +
> +               /*
> +                * The running flag should never be set at this point because
> +                * we've already set dev->qdisc to noop_qdisc *inside* the same
> +                * pair of spin locks.  That is, if any qdisc_run starts after
> +                * our initial test it should see the noop_qdisc and then
> +                * clear the RUNNING bit before dropping the queue lock.  So
> +                * if it is set here then we've found a bug.
> +                */
> +       } while (WARN_ON_ONCE(running));
>  }
>  
>  void dev_init_scheduler(struct net_device *dev) 

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [NET]: Fix possible dev_deactivate race condition
  2007-10-19  7:35                           ` Peter Zijlstra
@ 2007-10-19  9:29                             ` Herbert Xu
  0 siblings, 0 replies; 48+ messages in thread
From: Herbert Xu @ 2007-10-19  9:29 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linux Kernel Mailing List, David S. Miller, linuxppc-dev,
	Thomas Gleixner, netdev, akpm, Linus Torvalds, Ingo Molnar

On Fri, Oct 19, 2007 at 09:35:19AM +0200, Peter Zijlstra wrote:
>
> >         /* Wait for outstanding qdisc_run calls. */
> > -       while (test_bit(__LINK_STATE_QDISC_RUNNING, &dev->state))
> > -               yield();
> > +       do {
> > +               while (test_bit(__LINK_STATE_QDISC_RUNNING, &dev->state))
> > +                       yield();
> > +
> 
> Ouch!, is there really no sane locking alternative? Hashed waitqueues
> like for the page lock come to mind.

Well if we ever moved the transmission to full process context
then we'll gladly accept your patch :)

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH] synchronize_irq needs a barrier
  2007-10-18  1:25 [PATCH] synchronize_irq needs a barrier Benjamin Herrenschmidt
  2007-10-18  1:45 ` Andrew Morton
  2007-10-18  2:12 ` Linus Torvalds
@ 2007-10-20  2:02 ` Maxim Levitsky
  2007-10-20  2:25   ` Linus Torvalds
                     ` (2 more replies)
  2 siblings, 3 replies; 48+ messages in thread
From: Maxim Levitsky @ 2007-10-20  2:02 UTC (permalink / raw)
  To: benh; +Cc: linuxppc-dev list, akpm, Linus Torvalds, Linux Kernel list

On Thursday 18 October 2007 03:25:42 Benjamin Herrenschmidt wrote:
> synchronize_irq needs at the very least a compiler barrier and a
> read barrier on SMP, but there are enough cases around where a
> write barrier is also needed and it's not a hot path so I prefer
> using a full smp_mb() here.
> 
> It will degrade to a compiler barrier on !SMP.
> 
> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
> ---
> 
> Index: linux-work/kernel/irq/manage.c
> ===================================================================
> --- linux-work.orig/kernel/irq/manage.c	2007-10-18 11:22:16.000000000 +1000
> +++ linux-work/kernel/irq/manage.c	2007-10-18 11:22:20.000000000 +1000
> @@ -33,6 +33,7 @@ void synchronize_irq(unsigned int irq)
>  	if (irq >= NR_IRQS)
>  		return;
>  
> +	smp_mb();
>  	while (desc->status & IRQ_INPROGRESS)
>  		cpu_relax();
>  }
> 
> 
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 

Hi,

I have read this thread and I concluded few things:

1) It is impossible to know that the card won't send more interrupts:
Even if I do a read from the device, the IRQ can be pending in the bus/APIC
It is even possible (and likely) that the IRQ line will be shared, thus the 
handler can be called by non-relevant device.

2) the synchronize_irq(); in .suspend is useless:
an IRQ can happen immediately after this synchronize_irq();
and interrupt even the .suspend()
(At least theoretically)


Thus I now understand that .suspend() should do:

	saa_writel(SAA7134_IRQ1, 0);
	saa_writel(SAA7134_IRQ2, 0);
	saa_writel(SAA7134_MAIN_CTRL, 0);

	dev->insuspend = 1;
	smp_wmb();

	/* at that point the _request to disable card's IRQs was issued, we don't know 
	   that there will be no irqs anymore.
	   the smp_mb(); guaranties that the IRQ handler will bail out in that case. */
	
	/* .......*/

	pci_save_state(pci_dev);
	pci_set_power_state(pci_dev, pci_choose_state(pci_dev, state));
	return 0;

and the interrupt handler:

	smp_rmb();
	if (dev->insuspend)
		goto out;

Am I right?
	Best regards,
		Maxim Levitsky

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH] synchronize_irq needs a barrier
  2007-10-20  2:02 ` Maxim Levitsky
@ 2007-10-20  2:25   ` Linus Torvalds
  2007-10-20  3:10     ` Maxim Levitsky
  2007-10-20  4:04     ` Benjamin Herrenschmidt
  2007-10-20  3:37   ` Herbert Xu
  2007-10-20  3:56   ` Benjamin Herrenschmidt
  2 siblings, 2 replies; 48+ messages in thread
From: Linus Torvalds @ 2007-10-20  2:25 UTC (permalink / raw)
  To: Maxim Levitsky; +Cc: akpm, Linux Kernel list, linuxppc-dev list



On Sat, 20 Oct 2007, Maxim Levitsky wrote:
> 
> and the interrupt handler:
> 
> 	smp_rmb();
> 	if (dev->insuspend)
> 		goto out;

Something like that can work, yes. However, you need to make sure that:

 - even when you ignore the interrupt (because the driver doesn't care, 
   it's suspending), you need to make sure the hardware gets shut up by 
   reading (or writing) the proper interrupt status register.

   Otherwise, with a level interrupt, the interrupt will continue to be 
   held active ("screaming") and the CPU will get interrupted over and 
   over again, until the irq subsystem will just turn the irq off 
   entirely.

 - when you resume, make sure that you get the engine going again, with 
   the understanding that some interrupts may have gotten ignored.

Also, in general, these kinds of things shouldn't always even be 
neicessary. If you use the suspend_late()/resume_early() hooks, those will 
be called with interrupts off, and while they can be harder to debug (and 
may be worth avoiding for non-critical drivers), they do allow for simpler 
models partly exactly because they don't need to worry about interrupts 
etc.

		Linus

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH] synchronize_irq needs a barrier
  2007-10-20  2:25   ` Linus Torvalds
@ 2007-10-20  3:10     ` Maxim Levitsky
  2007-10-20  4:06       ` Benjamin Herrenschmidt
  2007-10-20  4:04     ` Benjamin Herrenschmidt
  1 sibling, 1 reply; 48+ messages in thread
From: Maxim Levitsky @ 2007-10-20  3:10 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: akpm, Linux Kernel list, linuxppc-dev list

On Saturday 20 October 2007 04:25:34 Linus Torvalds wrote:
> 
> On Sat, 20 Oct 2007, Maxim Levitsky wrote:
> > 
> > and the interrupt handler:
> > 
> > 	smp_rmb();
> > 	if (dev->insuspend)
> > 		goto out;
> 
> Something like that can work, yes. However, you need to make sure that:
> 
>  - even when you ignore the interrupt (because the driver doesn't care, 
>    it's suspending), you need to make sure the hardware gets shut up by 
>    reading (or writing) the proper interrupt status register.
I agree, but while device is powered off, its registers can't be accessed
Thus, if I ack the IRQ every time the handler is called, I will access the 
powered off device (this is probably won't hurt a lot, but a bit incorrectly)

> 
>    Otherwise, with a level interrupt, the interrupt will continue to be 
>    held active ("screaming") and the CPU will get interrupted over and 
>    over again, until the irq subsystem will just turn the irq off 
>    entirely.
> 
>  - when you resume, make sure that you get the engine going again, with 
>    the understanding that some interrupts may have gotten ignored.
Here it isn't a problem, this is a video capture card, and I suspend it by just stopping dma
on all active buffers even if in the middle of capture, and I send those buffers to card again
to fill them from the beginning during the resume.
> 
> Also, in general, these kinds of things shouldn't always even be 
> neicessary. If you use the suspend_late()/resume_early() hooks, those will 
> be called with interrupts off, and while they can be harder to debug (and 
> may be worth avoiding for non-critical drivers), they do allow for simpler 
> models partly exactly because they don't need to worry about interrupts 
> etc.
Exactly, I am aware of suspend_late , but I don't want to abuse it.
> 
> 		Linus
> 


Best regards,
	Maxim Levitsky

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH] synchronize_irq needs a barrier
  2007-10-20  2:02 ` Maxim Levitsky
  2007-10-20  2:25   ` Linus Torvalds
@ 2007-10-20  3:37   ` Herbert Xu
  2007-10-20  3:56   ` Benjamin Herrenschmidt
  2 siblings, 0 replies; 48+ messages in thread
From: Herbert Xu @ 2007-10-20  3:37 UTC (permalink / raw)
  To: Maxim Levitsky; +Cc: akpm, Linus Torvalds, Linux Kernel list, linuxppc-dev list

On Sat, Oct 20, 2007 at 02:02:42AM +0000, Maxim Levitsky wrote:
>
> Thus I now understand that .suspend() should do:
> 
> 	saa_writel(SAA7134_IRQ1, 0);
> 	saa_writel(SAA7134_IRQ2, 0);
> 	saa_writel(SAA7134_MAIN_CTRL, 0);
> 
> 	dev->insuspend = 1;
> 	smp_wmb();

If we patch synchronize_irq() as discussed here then you can
use it in place of smp_wmb() and remove the smp_rmb from the
interrupt handler (the latter is the path that matters).

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH] synchronize_irq needs a barrier
  2007-10-20  2:02 ` Maxim Levitsky
  2007-10-20  2:25   ` Linus Torvalds
  2007-10-20  3:37   ` Herbert Xu
@ 2007-10-20  3:56   ` Benjamin Herrenschmidt
  2007-10-20  4:24     ` Maxim Levitsky
  2 siblings, 1 reply; 48+ messages in thread
From: Benjamin Herrenschmidt @ 2007-10-20  3:56 UTC (permalink / raw)
  To: Maxim Levitsky; +Cc: linuxppc-dev list, akpm, Linus Torvalds, Linux Kernel list


> I have read this thread and I concluded few things:
> 
> 1) It is impossible to know that the card won't send more interrupts:
> Even if I do a read from the device, the IRQ can be pending in the bus/APIC
> It is even possible (and likely) that the IRQ line will be shared, thus the 
> handler can be called by non-relevant device.
> 
> 2) the synchronize_irq(); in .suspend is useless:
> an IRQ can happen immediately after this synchronize_irq();
> and interrupt even the .suspend()
> (At least theoretically)

It's not totally useless not. It guarantees that by the time your
are out of your suspend(), a simultaneous instance of the IRQ handler
is either finished, or it (or any subsequent occurence) have seen
your insuspend flag set to 1.

> Thus I now understand that .suspend() should do:
> 
> 	saa_writel(SAA7134_IRQ1, 0);
> 	saa_writel(SAA7134_IRQ2, 0);
> 	saa_writel(SAA7134_MAIN_CTRL, 0);
> 
> 	dev->insuspend = 1;
> 	smp_wmb();
> 
> 	/* at that point the _request to disable card's IRQs was issued, we don't know 
> 	   that there will be no irqs anymore.
> 	   the smp_mb(); guaranties that the IRQ handler will bail out in that case. */
> 	
> 	/* .......*/
> 
> 	pci_save_state(pci_dev);
> 	pci_set_power_state(pci_dev, pci_choose_state(pci_dev, state));
> 	return 0;

The above doesn't handle the case where the IRQ handle just passed the
if (insuspend) test. You may end up calling pci_set_power_state() while
in the middle of some further HW accesses in the handler.

You still need synchronize_irq() for that reason. Or use a spinlock.

> and the interrupt handler:
> 
> 	smp_rmb();
> 	if (dev->insuspend)
> 		goto out;
> 
> Am I right?

Not quite :-)

Also not that the work we are doing with synchronize_irq, if it gets
merged, renders it unnecessary for you to add those two memory barriers,
synchronize_irq will pretty much do the ordering for you.

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH] synchronize_irq needs a barrier
  2007-10-20  2:25   ` Linus Torvalds
  2007-10-20  3:10     ` Maxim Levitsky
@ 2007-10-20  4:04     ` Benjamin Herrenschmidt
  2007-10-20  4:09       ` Benjamin Herrenschmidt
  1 sibling, 1 reply; 48+ messages in thread
From: Benjamin Herrenschmidt @ 2007-10-20  4:04 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linuxppc-dev list, akpm, Maxim Levitsky, Linux Kernel list


On Fri, 2007-10-19 at 19:25 -0700, Linus Torvalds wrote:
> 
> On Sat, 20 Oct 2007, Maxim Levitsky wrote:
> > 
> > and the interrupt handler:
> > 
> > 	smp_rmb();
> > 	if (dev->insuspend)
> > 		goto out;
> 
> Something like that can work, yes. However, you need to make sure that:
> 
>  - even when you ignore the interrupt (because the driver doesn't care, 
>    it's suspending), you need to make sure the hardware gets shut up by 
>    reading (or writing) the proper interrupt status register.
>
>    Otherwise, with a level interrupt, the interrupt will continue to be 
>    held active ("screaming") and the CPU will get interrupted over and 
>    over again, until the irq subsystem will just turn the irq off 
>    entirely.

His suspend routine wrote to the IRQ mask (or equivalent) register in
his code example, thus the HW should shut up eventually, thus that isn't
strictly necessary, the IRQ in that case is just a "short
interrupt" (noticed by the PIC and delivered but possibly not asserted
anymore at this stage or about to go down).

We have many scenario generating such short interrupts (pretty much any
time we use interrupt disable/enable on the chip, for example it's
common with NAPI).

This is one of the reasons why we used to have a "timebomb" causing an
IRQ to erroneously get disabled by the IRQ core assuming the IRQ was
stuck, just because we accumulated too many short interrupts on that
line. A recent patch by Alan fixed that to use some more sensible
algorithm checking the time since the last spurrious interrupt, though I
suspect he's being too optimistic on his timeout value and some
scenario, such as NAPI with just the wrong packet rate, can still
trigger the bug. I need  to find a piece of HW slow enough in
de-asserting the IRQ to try to make a repro-case one of these days.

>  - when you resume, make sure that you get the engine going again, with 
>    the understanding that some interrupts may have gotten ignored.

And you forgot that he still needs synchronize_irq(), or his IRQ handler
might well be in the middle of doing something on another CPU, past the
point where it tested "insuspend", which will do no good when a
simultaneous pci_set_power_state() puts the chip into D3. On some
machines, this will machine check. Either that or he needs a spinlock in
his IRQ handler that he also takes around the code that sets insuspend. 

> Also, in general, these kinds of things shouldn't always even be 
> neicessary. If you use the suspend_late()/resume_early() hooks, those will 
> be called with interrupts off, and while they can be harder to debug (and 
> may be worth avoiding for non-critical drivers), they do allow for simpler 
> models partly exactly because they don't need to worry about interrupts 
> etc.

Yes, this is a easier way to deal with devices that are on a directly
mapped bus (path to them isn't gone by the time you hit suspend_late)
and don't need to do fancy things involing waiting for a queue to drain
using interrupts etc... (at one stage of HW complexity, having to
re-implement polled versions of everything is just not worth the
benefit). In general, suspend_late should cover the need of most PCI
devices I suppose.

Ben.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH] synchronize_irq needs a barrier
  2007-10-20  3:10     ` Maxim Levitsky
@ 2007-10-20  4:06       ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 48+ messages in thread
From: Benjamin Herrenschmidt @ 2007-10-20  4:06 UTC (permalink / raw)
  To: Maxim Levitsky; +Cc: linuxppc-dev list, akpm, Linus Torvalds, Linux Kernel list


> >  - even when you ignore the interrupt (because the driver doesn't care, 
> >    it's suspending), you need to make sure the hardware gets shut up by 
> >    reading (or writing) the proper interrupt status register.
> I agree, but while device is powered off, its registers can't be accessed
> Thus, if I ack the IRQ every time the handler is called, I will access the 
> powered off device (this is probably won't hurt a lot, but a bit incorrectly)

It will actually crash your machine on some platforms. So no, best is to
-not- ack. The masking is enough, the IRQ will go down eventually.

Ben.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH] synchronize_irq needs a barrier
  2007-10-20  4:04     ` Benjamin Herrenschmidt
@ 2007-10-20  4:09       ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 48+ messages in thread
From: Benjamin Herrenschmidt @ 2007-10-20  4:09 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linuxppc-dev list, akpm, Maxim Levitsky, Linux Kernel list


> >  - even when you ignore the interrupt (because the driver doesn't care, 
> >    it's suspending), you need to make sure the hardware gets shut up by 
> >    reading (or writing) the proper interrupt status register.
> >
> >    Otherwise, with a level interrupt, the interrupt will continue to be 
> >    held active ("screaming") and the CPU will get interrupted over and 
> >    over again, until the irq subsystem will just turn the irq off 
> >    entirely.
> 
> His suspend routine wrote to the IRQ mask (or equivalent) register in
> his code example, thus the HW should shut up eventually, thus that isn't
> strictly necessary, the IRQ in that case is just a "short
> interrupt" (noticed by the PIC and delivered but possibly not asserted
> anymore at this stage or about to go down).

In fact, he -must not- ack it. Because is the HW is really down (in D3),
got knows what accessing the ACK register will do. I can give you
ideas... from nothing on most x86 desktops to machine checks on most
powerpc machines, though I could imagine some cards bad enough to lock
your bus up if you try to access a register while they are in D3 (I've
seen some of those critters and it seems not all bridges timeout on
cards that just keep sending retries).

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH] synchronize_irq needs a barrier
  2007-10-20  3:56   ` Benjamin Herrenschmidt
@ 2007-10-20  4:24     ` Maxim Levitsky
  2007-10-20  5:04       ` Benjamin Herrenschmidt
  0 siblings, 1 reply; 48+ messages in thread
From: Maxim Levitsky @ 2007-10-20  4:24 UTC (permalink / raw)
  To: benh; +Cc: linuxppc-dev list, akpm, Linus Torvalds, Linux Kernel list

On Saturday 20 October 2007 05:56:01 Benjamin Herrenschmidt wrote:
> 
> > I have read this thread and I concluded few things:
> > 
> > 1) It is impossible to know that the card won't send more interrupts:
> > Even if I do a read from the device, the IRQ can be pending in the bus/APIC
> > It is even possible (and likely) that the IRQ line will be shared, thus the 
> > handler can be called by non-relevant device.
> > 
> > 2) the synchronize_irq(); in .suspend is useless:
> > an IRQ can happen immediately after this synchronize_irq();
> > and interrupt even the .suspend()
> > (At least theoretically)
> 
> It's not totally useless not. It guarantees that by the time your
> are out of your suspend(), a simultaneous instance of the IRQ handler
> is either finished, or it (or any subsequent occurence) have seen
> your insuspend flag set to 1.
> 
> > Thus I now understand that .suspend() should do:
> > 
> > 	saa_writel(SAA7134_IRQ1, 0);
> > 	saa_writel(SAA7134_IRQ2, 0);
> > 	saa_writel(SAA7134_MAIN_CTRL, 0);
> > 
> > 	dev->insuspend = 1;
> > 	smp_wmb();
> > 
> > 	/* at that point the _request to disable card's IRQs was issued, we don't know 
> > 	   that there will be no irqs anymore.
> > 	   the smp_mb(); guaranties that the IRQ handler will bail out in that case. */
> > 	
> > 	/* .......*/
> > 
> > 	pci_save_state(pci_dev);
> > 	pci_set_power_state(pci_dev, pci_choose_state(pci_dev, state));
> > 	return 0;
> 
> The above doesn't handle the case where the IRQ handle just passed the
> if (insuspend) test. You may end up calling pci_set_power_state() while
> in the middle of some further HW accesses in the handler.
> 
> You still need synchronize_irq() for that reason. Or use a spinlock.
> 
> > and the interrupt handler:
> > 
> > 	smp_rmb();
> > 	if (dev->insuspend)
> > 		goto out;
> > 
> > Am I right?
> 
> Not quite :-)
> 
> Also not that the work we are doing with synchronize_irq, if it gets
> merged, renders it unnecessary for you to add those two memory barriers,
> synchronize_irq will pretty much do the ordering for you.
> 
> Cheers,
> Ben.
> 
> 


Fine, I Agree, and this is why I used it in first place, I just forgot.
To wait for already running IRQ handler, that tested the dev->insuspend.
(And I assumed that interrupts are disabled on the cpu that runs .suspend, but I know now
that this isn't true.)

Besides that can I ask few more .suspend/resume questions:

1) some drivers use pci_disable_device(), and pci_enable_device().
should I use it too?

2) I accidentally did this:
	pci_set_power_state(pci_dev, pci_choose_state(pci_dev, state));
	pci_save_state(pci_dev);

I somehow thought that this is correct, that I should save the pci config state
after the power-down, but now I know that it isn't correct.
I now need to send a patch for dmfe.c network driver that has the same commands written by me.
(but it works perfectly anyway)
Is it possible to access pci configuration space in D3?

And lastly speaking of network drivers, one issue came to my mind:
most network drivars has a packet queue and in case of dmfe it is located in main memory,
and card does dma from it.


in .suspend I ignore that some packets may be in that queue, and I want
to ask, whenever there are better ways to do that.


this is my dmfe .suspend routine.

	/* Disable upper layer interface */
	netif_device_detach(dev);

	/* Disable Tx/Rx */
	db->cr6_data &= ~(CR6_RXSC | CR6_TXSC);
	update_cr6(db->cr6_data, dev->base_addr);

	/* Disable Interrupt */
	outl(0, dev->base_addr + DCR7);
	outl(inl (dev->base_addr + DCR5), dev->base_addr + DCR5);

	/* Fre RX buffers */
	dmfe_free_rxbuffer(db);

	/* Enable WOL */
	pci_read_config_dword(pci_dev, 0x40, &tmp);
	tmp &= ~(DMFE_WOL_LINKCHANGE|DMFE_WOL_MAGICPACKET);

	if (db->wol_mode & WAKE_PHY)
		tmp |= DMFE_WOL_LINKCHANGE;
	if (db->wol_mode & WAKE_MAGIC)
		tmp |= DMFE_WOL_MAGICPACKET;

	pci_write_config_dword(pci_dev, 0x40, tmp);

	pci_enable_wake(pci_dev, PCI_D3hot, 1);
	pci_enable_wake(pci_dev, PCI_D3cold, 1);

	/* Power down device*/
	pci_set_power_state(pci_dev, pci_choose_state (pci_dev,state));
	pci_save_state(pci_dev);


I guess, everybody makes mistakes... :-)

Other network drivers has a bit more complicated .suspend/.resume routines, 
but I didn't see a driver waiting for output queue to finish

Best regards,
	Maxim Levitsky

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH] synchronize_irq needs a barrier
  2007-10-20  4:24     ` Maxim Levitsky
@ 2007-10-20  5:04       ` Benjamin Herrenschmidt
  2007-10-20  5:36         ` Maxim Levitsky
  0 siblings, 1 reply; 48+ messages in thread
From: Benjamin Herrenschmidt @ 2007-10-20  5:04 UTC (permalink / raw)
  To: Maxim Levitsky; +Cc: linuxppc-dev list, akpm, Linus Torvalds, Linux Kernel list


> 1) some drivers use pci_disable_device(), and pci_enable_device().
> should I use it too?

I generally don't do the former, and I would expect the late to be done
by pci_restore_state() for you. pci_disable_device(), last I looked,
only cleared the bus master bit though, which might be a good idea to
do.

People in ACPI/x86 land, are there more good reasons to do one or the
other ?

That reminds me that I volunteered to write a documentation on how
drivers should do all that stuff at KS and didn't get to actually doing
it yet. shame ... I'll try to start something asap.

> 2) I accidentally did this:
> 	pci_set_power_state(pci_dev, pci_choose_state(pci_dev, state));
> 	pci_save_state(pci_dev);
> 
> I somehow thought that this is correct, that I should save the pci config state
> after the power-down, but now I know that it isn't correct.

Right, you need to do the save_state before the power down. You need to
avoid pretty much any access to the device after the save state other
than the pending set_power_state on resume.

> I now need to send a patch for dmfe.c network driver that has the same commands written by me.
> (but it works perfectly anyway)

On x86 desktop... might have surprises on others.

> Is it possible to access pci configuration space in D3?

It's only really safe to access the PM register itself, though I suppose
you should be able to walk the capability chain to do that. But I
wouldnt recommend doing anything else.

> And lastly speaking of network drivers, one issue came to my mind:
> most network drivars has a packet queue and in case of dmfe it is located in main memory,
> and card does dma from it.

Note that the network stack nowadays does a fair bit of cleaning up for
you before your suspend routine is called....
> 
> in .suspend I ignore that some packets may be in that queue, and I want
> to ask, whenever there are better ways to do that.
> 
> 
> this is my dmfe .suspend routine.
> 
> 	/* Disable upper layer interface */
> 	netif_device_detach(dev);

The above -might- not be needed any more, I have to check. I used to do
it too on my drivers.

> 	/* Disable Tx/Rx */
> 	db->cr6_data &= ~(CR6_RXSC | CR6_TXSC);
> 	update_cr6(db->cr6_data, dev->base_addr);
> 
> 	/* Disable Interrupt */
> 	outl(0, dev->base_addr + DCR7);
> 	outl(inl (dev->base_addr + DCR5), dev->base_addr + DCR5);
> 
> 	/* Fre RX buffers */
> 	dmfe_free_rxbuffer(db);
> 
> 	/* Enable WOL */
> 	pci_read_config_dword(pci_dev, 0x40, &tmp);
> 	tmp &= ~(DMFE_WOL_LINKCHANGE|DMFE_WOL_MAGICPACKET);
> 
> 	if (db->wol_mode & WAKE_PHY)
> 		tmp |= DMFE_WOL_LINKCHANGE;
> 	if (db->wol_mode & WAKE_MAGIC)
> 		tmp |= DMFE_WOL_MAGICPACKET;
> 
> 	pci_write_config_dword(pci_dev, 0x40, tmp);
> 
> 	pci_enable_wake(pci_dev, PCI_D3hot, 1);
> 	pci_enable_wake(pci_dev, PCI_D3cold, 1);
> 
> 	/* Power down device*/
> 	pci_set_power_state(pci_dev, pci_choose_state (pci_dev,state));
> 	pci_save_state(pci_dev);
> 

Looks allright on a quick glance appart from the bits we already
discussed.

> I guess, everybody makes mistakes... :-)
> 
> Other network drivers has a bit more complicated .suspend/.resume routines, 
> but I didn't see a driver waiting for output queue to finish

I think the network stack does that nowadays but we'll have to double
check, that's based on what DaveM told me at KS.

Ben. 

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH] synchronize_irq needs a barrier
  2007-10-20  5:04       ` Benjamin Herrenschmidt
@ 2007-10-20  5:36         ` Maxim Levitsky
  2007-10-20  5:46           ` Benjamin Herrenschmidt
  0 siblings, 1 reply; 48+ messages in thread
From: Maxim Levitsky @ 2007-10-20  5:36 UTC (permalink / raw)
  To: benh; +Cc: linuxppc-dev list, akpm, Linus Torvalds, Linux Kernel list

On Saturday 20 October 2007 07:04:35 Benjamin Herrenschmidt wrote:
> 
> > 1) some drivers use pci_disable_device(), and pci_enable_device().
> > should I use it too?
> 
> I generally don't do the former, and I would expect the late to be done
> by pci_restore_state() for you. pci_disable_device(), last I looked,
> only cleared the bus master bit though, which might be a good idea to
> do.
> 
> People in ACPI/x86 land, are there more good reasons to do one or the
> other ?
> 
> That reminds me that I volunteered to write a documentation on how
> drivers should do all that stuff at KS and didn't get to actually doing
> it yet. shame ... I'll try to start something asap.
> 
> > 2) I accidentally did this:
> > 	pci_set_power_state(pci_dev, pci_choose_state(pci_dev, state));
> > 	pci_save_state(pci_dev);
> > 
> > I somehow thought that this is correct, that I should save the pci config state
> > after the power-down, but now I know that it isn't correct.
> 
> Right, you need to do the save_state before the power down. You need to
> avoid pretty much any access to the device after the save state other
> than the pending set_power_state on resume.
> 
> > I now need to send a patch for dmfe.c network driver that has the same commands written by me.
> > (but it works perfectly anyway)
> 
> On x86 desktop... might have surprises on others.
> 
> > Is it possible to access pci configuration space in D3?
> 
> It's only really safe to access the PM register itself, though I suppose
> you should be able to walk the capability chain to do that. But I
> wouldnt recommend doing anything else.
> 
> > And lastly speaking of network drivers, one issue came to my mind:
> > most network drivars has a packet queue and in case of dmfe it is located in main memory,
> > and card does dma from it.
> 
> Note that the network stack nowadays does a fair bit of cleaning up for
> you before your suspend routine is called....
> > 
> > in .suspend I ignore that some packets may be in that queue, and I want
> > to ask, whenever there are better ways to do that.
> > 
> > 
> > this is my dmfe .suspend routine.
> > 
> > 	/* Disable upper layer interface */
> > 	netif_device_detach(dev);
> 

> 
> Looks allright on a quick glance appart from the bits we already
> discussed.


> 
> > I guess, everybody makes mistakes... :-)
> > 
> > Other network drivers has a bit more complicated .suspend/.resume routines, 
> > but I didn't see a driver waiting for output queue to finish
> 
> I think the network stack does that nowadays but we'll have to double
> check, that's based on what DaveM told me at KS.
> 
> Ben. 
> 
> 

Hi,

Thanks a lot.
I fix the order of calls in dmfe.c
and in saa7134-core.c.

I probably need to add this synchronize_irq() logic in dmfe.c too, but I probably do it later,
I think I am overestimating this race, since most drivers don't do dev->insuspend checks in IRQ handler.
Maybe even just use free_irq() after all....


Best regards,
	Maxim Levitsky

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH] synchronize_irq needs a barrier
  2007-10-20  5:36         ` Maxim Levitsky
@ 2007-10-20  5:46           ` Benjamin Herrenschmidt
  2007-10-20  6:06             ` Maxim Levitsky
  0 siblings, 1 reply; 48+ messages in thread
From: Benjamin Herrenschmidt @ 2007-10-20  5:46 UTC (permalink / raw)
  To: Maxim Levitsky; +Cc: linuxppc-dev list, akpm, Linus Torvalds, Linux Kernel list



> I probably need to add this synchronize_irq() logic in dmfe.c too, but I probably do it later,
> I think I am overestimating this race, since most drivers don't do dev->insuspend checks in IRQ handler.
> Maybe even just use free_irq() after all....

Most drivers are probably underestimating the race :-)

free_irq() would work provided that you did the masking on chip before
(and unmask only after request_irq on the way back in). But it's a bit
like using a 10 tons truck to crush an ant... 

Ben.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH] synchronize_irq needs a barrier
  2007-10-20  5:46           ` Benjamin Herrenschmidt
@ 2007-10-20  6:06             ` Maxim Levitsky
  2007-10-20  6:13               ` Benjamin Herrenschmidt
  0 siblings, 1 reply; 48+ messages in thread
From: Maxim Levitsky @ 2007-10-20  6:06 UTC (permalink / raw)
  To: benh; +Cc: linuxppc-dev list, akpm, Linus Torvalds, Linux Kernel list

On Saturday 20 October 2007 07:46:24 Benjamin Herrenschmidt wrote:
> 
> > I probably need to add this synchronize_irq() logic in dmfe.c too, but I probably do it later,
> > I think I am overestimating this race, since most drivers don't do dev->insuspend checks in IRQ handler.
> > Maybe even just use free_irq() after all....
> 
> Most drivers are probably underestimating the race :-)
> 
> free_irq() would work provided that you did the masking on chip before
> (and unmask only after request_irq on the way back in). But it's a bit
> like using a 10 tons truck to crush an ant... 
Agreed.

So, I will add synchronize_irq() to both saa7134, and dmfe, the two drivers that their .suspend/.resume 
routines were written by me.

I already added a synchronize_irq()  plus few more fixes to the driver , but those patches are still in v4l tree.

I now has this:

	saa_writel(SAA7134_IRQ1, 0);
	saa_writel(SAA7134_IRQ2, 0);
	saa_writel(SAA7134_MAIN_CTRL, 0);

	synchronize_irq(pci_dev->irq);
	dev->insuspend = 1;

and I will probably need (with the synchronize_irq patch applied)


	/* Disable interrupts, DMA, and rest of the chip*/
	saa_writel(SAA7134_IRQ1, 0);
	saa_writel(SAA7134_IRQ2, 0);
	saa_writel(SAA7134_MAIN_CTRL, 0);
	
	dev->insuspend = 1;
	synchronize_irq(pci_dev->irq);

	/* ACK pending interrupts just in case*/
	saa_writel(SAA7134_IRQ_REPORT,saa_readl(SAA7134_IRQ_REPORT));

	......
This should be bullet-proof.


> 
> Ben.
> 
> 
> 
Best regards,
	Maxim Levitsky

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH] synchronize_irq needs a barrier
  2007-10-20  6:06             ` Maxim Levitsky
@ 2007-10-20  6:13               ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 48+ messages in thread
From: Benjamin Herrenschmidt @ 2007-10-20  6:13 UTC (permalink / raw)
  To: Maxim Levitsky; +Cc: linuxppc-dev list, akpm, Linus Torvalds, Linux Kernel list


On Sat, 2007-10-20 at 08:06 +0200, Maxim Levitsky wrote:

> 	/* Disable interrupts, DMA, and rest of the chip*/
> 	saa_writel(SAA7134_IRQ1, 0);
> 	saa_writel(SAA7134_IRQ2, 0);
> 	saa_writel(SAA7134_MAIN_CTRL, 0);
	
> 	dev->insuspend = 1;
> 	synchronize_irq(pci_dev->irq);
> 
> 	/* ACK pending interrupts just in case*/
> 	saa_writel(SAA7134_IRQ_REPORT,saa_readl(SAA7134_IRQ_REPORT));
> 
> 	......
> This should be bullet-proof.

Hopefully :-)

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH] synchronize_irq needs a barrier
  2007-10-19  4:58                           ` Benjamin Herrenschmidt
@ 2007-10-21 21:10                             ` Benjamin Herrenschmidt
  2007-10-23  3:26                               ` [IRQ]: Fix synchronize_irq races with IRQ handler Herbert Xu
  0 siblings, 1 reply; 48+ messages in thread
From: Benjamin Herrenschmidt @ 2007-10-21 21:10 UTC (permalink / raw)
  To: Herbert Xu
  Cc: Linux Kernel Mailing List, linuxppc-dev, Thomas Gleixner, akpm,
	Linus Torvalds, Ingo Molnar


 .../...

> This patch (mostly written by Linus) fixes this by using spin
> > locks instead of memory barries on the synchronize_irq() path.
> > 
> > Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
> 
> Good for me.
> 
> Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>

Hrm... not on yet. Herbert, care to resend, looks like it fell down the
wrong hole in Linus mailbox :-)

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* [IRQ]: Fix synchronize_irq races with IRQ handler
  2007-10-21 21:10                             ` Benjamin Herrenschmidt
@ 2007-10-23  3:26                               ` Herbert Xu
  0 siblings, 0 replies; 48+ messages in thread
From: Herbert Xu @ 2007-10-23  3:26 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Linux Kernel Mailing List, linuxppc-dev, Thomas Gleixner, akpm,
	Linus Torvalds, Ingo Molnar

On Mon, Oct 22, 2007 at 07:10:05AM +1000, Benjamin Herrenschmidt wrote:
>
> Hrm... not on yet. Herbert, care to resend, looks like it fell down the
> wrong hole in Linus mailbox :-)

Thanks for the reminder Ben.  Here it is again:

[IRQ]: Fix synchronize_irq races with IRQ handler

As it is some callers of synchronize_irq rely on memory barriers
to provide synchronisation against the IRQ handlers.  For example,
the tg3 driver does

	tp->irq_sync = 1;
	smp_mb();
	synchronize_irq();

and then in the IRQ handler:

	if (!tp->irq_sync)
		netif_rx_schedule(dev, &tp->napi);

Unfortunately memory barriers only work well when they come in
pairs.  Because we don't actually have memory barriers on the
IRQ path, the memory barrier before the synchronize_irq() doesn't
actually protect us.

In particular, synchronize_irq() may return followed by the
result of netif_rx_schedule being made visible.

This patch (mostly written by Linus) fixes this by using spin
locks instead of memory barries on the synchronize_irq() path.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--
diff --git a/kernel/irq/manage.c b/kernel/irq/manage.c
index 80eab7a..1f31422 100644
--- a/kernel/irq/manage.c
+++ b/kernel/irq/manage.c
@@ -29,12 +29,28 @@
 void synchronize_irq(unsigned int irq)
 {
 	struct irq_desc *desc = irq_desc + irq;
+	unsigned int status;
 
 	if (irq >= NR_IRQS)
 		return;
 
-	while (desc->status & IRQ_INPROGRESS)
-		cpu_relax();
+	do {
+		unsigned long flags;
+
+		/*
+		 * Wait until we're out of the critical section.  This might
+		 * give the wrong answer due to the lack of memory barriers.
+		 */
+		while (desc->status & IRQ_INPROGRESS)
+			cpu_relax();
+
+		/* Ok, that indicated we're done: double-check carefully. */
+		spin_lock_irqsave(&desc->lock, flags);
+		status = desc->status;
+		spin_unlock_irqrestore(&desc->lock, flags);
+
+		/* Oops, that failed? */
+	} while (status & IRQ_INPROGRESS);
 }
 EXPORT_SYMBOL(synchronize_irq);
 

^ permalink raw reply related	[flat|nested] 48+ messages in thread

end of thread, other threads:[~2007-10-23  3:27 UTC | newest]

Thread overview: 48+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-10-18  1:25 [PATCH] synchronize_irq needs a barrier Benjamin Herrenschmidt
2007-10-18  1:45 ` Andrew Morton
2007-10-18  1:55   ` Benjamin Herrenschmidt
2007-10-18  2:12 ` Linus Torvalds
2007-10-18  2:40   ` Benjamin Herrenschmidt
2007-10-18  2:57     ` Benjamin Herrenschmidt
2007-10-18 14:56       ` Herbert Xu
2007-10-18 22:05         ` Benjamin Herrenschmidt
2007-10-18 22:52           ` Linus Torvalds
2007-10-18 23:17             ` Benjamin Herrenschmidt
2007-10-18 23:39               ` Linus Torvalds
2007-10-18 23:52                 ` Benjamin Herrenschmidt
2007-10-19  2:32                 ` Herbert Xu
2007-10-19  2:52                   ` Nick Piggin
2007-10-19  3:28                     ` Herbert Xu
2007-10-19  4:49                       ` Nick Piggin
2007-10-19  2:55                   ` Linus Torvalds
2007-10-19  3:26                     ` Linus Torvalds
2007-10-19  4:11                       ` Benjamin Herrenschmidt
2007-10-19  4:26                         ` Benjamin Herrenschmidt
2007-10-19  5:53                           ` Herbert Xu
2007-10-19  4:20                       ` Herbert Xu
2007-10-19  4:29                         ` Benjamin Herrenschmidt
2007-10-19  4:35                         ` Benjamin Herrenschmidt
2007-10-19  4:48                         ` Herbert Xu
2007-10-19  4:58                           ` Benjamin Herrenschmidt
2007-10-21 21:10                             ` Benjamin Herrenschmidt
2007-10-23  3:26                               ` [IRQ]: Fix synchronize_irq races with IRQ handler Herbert Xu
2007-10-19  5:36                         ` [NET]: Fix possible dev_deactivate race condition Herbert Xu
2007-10-19  5:38                           ` David Miller
2007-10-19  7:35                           ` Peter Zijlstra
2007-10-19  9:29                             ` Herbert Xu
2007-10-18 14:35     ` [PATCH] synchronize_irq needs a barrier Herbert Xu
2007-10-18 21:35       ` Benjamin Herrenschmidt
2007-10-20  2:02 ` Maxim Levitsky
2007-10-20  2:25   ` Linus Torvalds
2007-10-20  3:10     ` Maxim Levitsky
2007-10-20  4:06       ` Benjamin Herrenschmidt
2007-10-20  4:04     ` Benjamin Herrenschmidt
2007-10-20  4:09       ` Benjamin Herrenschmidt
2007-10-20  3:37   ` Herbert Xu
2007-10-20  3:56   ` Benjamin Herrenschmidt
2007-10-20  4:24     ` Maxim Levitsky
2007-10-20  5:04       ` Benjamin Herrenschmidt
2007-10-20  5:36         ` Maxim Levitsky
2007-10-20  5:46           ` Benjamin Herrenschmidt
2007-10-20  6:06             ` Maxim Levitsky
2007-10-20  6:13               ` Benjamin Herrenschmidt

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).