wmb vs mmiowb

linuxppc-dev.lists.ozlabs.org archive mirror
 help / color / mirror / Atom feed

* wmb vs mmiowb
@ 2007-08-22  4:57 Nick Piggin
  2007-08-22 18:07 ` Linus Torvalds
  2007-08-23  7:25 ` Benjamin Herrenschmidt
  0 siblings, 2 replies; 26+ messages in thread
From: Nick Piggin @ 2007-08-22  4:57 UTC (permalink / raw)
  To: Jesse Barnes, Benjamin Herrenschmidt, Linus Torvalds
  Cc: linuxppc-dev, linux-ia64

Hi,

I'm ignorant when it comes to IO access, so I hope this isn't rubbish (if
it is, I would appreciate being corrected).

It took me more than a glance to see what the difference is supposed to be
between wmb() and mmiowb(). I think especially because mmiowb isn't really
like a write barrier.

wmb is supposed to order all writes coming out of a single CPU, so that's
pretty simple.

The problem is that writes coming from different CPUs can be seen by the
device in a different order from which they were written if coming from
different CPUs, even if the order of writes is guaranteed (eg. by a
spinlock) and issued in the right order WRT the locking (ie.  using wmb()).
And this can happen because the writes can get posted away and reordered by
the IO fabric (I think). mmiowb ensures the writes are seen by the device
in the correct order.

It doesn't seem like this primary function of mmiowb has anything to do
with a write barrier that we are used to (it may have a seconary semantic
of a wmb as well, but let's ignore that for now). A write barrier will
never provide you with those semantics (writes from 2 CPUs seen in the
same order by a 3rd party). If anything, I think it is closer to being
a read barrier issued on behalf of the target device.  But even that I
think is not much better, because the target is not participating in the
synchronisation that the CPUs are, so the "read barrier request" could
still arrive at the device out of order WRT the other CPU's writes.

It really seems like it is some completely different concept from a
barrier. And it shows, on the platform where it really matters (sn2), where
the thing actually spins.

I don't know exactly how it should be categorised. On one hand, it is
kind of like a critical section, and would work beautifully if we could
just hide it inside spin_lock_io/spin_unlock_io. On the other hand, it
seems like it is often used separately from locks, where it looks decidedly
less like a critical section or release barrier. How can such uses be
correct if they are worried about multi-CPU ordering but don't have
anything to synchronize the CPUs? Or are they cleverly enforcing CPU
ordering some other way? (in which case, maybe an acquire/release API
really would make sense?).

I don't really have a big point, except that I would like to know whether
I'm on the right track, and wish the thing could have a better name/api.

Thanks,
Nick

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: wmb vs mmiowb
  2007-08-22  4:57 wmb vs mmiowb Nick Piggin
@ 2007-08-22 18:07 ` Linus Torvalds
  2007-08-22 19:02   ` Jesse Barnes
                     ` (2 more replies)
  2007-08-23  7:25 ` Benjamin Herrenschmidt
  1 sibling, 3 replies; 26+ messages in thread
From: Linus Torvalds @ 2007-08-22 18:07 UTC (permalink / raw)
  To: Nick Piggin; +Cc: linux-ia64, Jesse Barnes, linuxppc-dev

On Wed, 22 Aug 2007, Nick Piggin wrote:
> 
> It took me more than a glance to see what the difference is supposed to be
> between wmb() and mmiowb(). I think especially because mmiowb isn't really
> like a write barrier.

Well, it is, but it isn't. Not on its own - but together with a "normal" 
barrier it is.

> wmb is supposed to order all writes coming out of a single CPU, so that's
> pretty simple.

No. wmb orders all *normal* writes coming out of a single CPU.

It may not do anything at all for "uncached" IO writes that aren't part of 
the cache coherency, and that are handled using totally different queues 
(both inside and outside of the CPU)!

Now, on x86, the CPU actually tends to order IO writes *more* than it 
orders any other writes (they are mostly entirely synchronous, unless the 
area has been marked as write merging), but at least on PPC, it's the 
other way around: without the cache as a serialization entry, you end up 
having a totally separate queueu to serialize, and a regular-memory write 
barrier does nothing at all to the IO queue.

So think of the IO write queue as something totally asynchronous that has 
zero connection to the normal write ordering - and then think of mmiowb() 
as a way to *insert* a synchronization point.

In particular, the normal synchronization primitives (spinlocks, mutexes 
etc) are guaranteed to synchronize only normal memory accesses. So if you 
do MMIO inside a spinlock, since the MMIO writes are totally asyncronous 
wrt the normal memory accesses, the MMIO write can escape outside the 
spinlock unless you have somethign that serializes the MMIO accesses with 
the normal memory accesses.

So normally you'd see "mmiowb()" always *paired* with a normal memory 
barrier! The "mmiowb()" ends up synchronizing the MMIO writes with the 
normal memory accesses, and then the normal memory barrier acts as a 
barrier for subsequent writes.

Of course, the normal memory barrier would usually be a "spin_unlock()" or 
something like that, not a "wmb()". In fact, I don't think the powerpc 
implementation (as an example of this) will actually synchronize with 
anything *but* a spin_unlock().

> It really seems like it is some completely different concept from a
> barrier. And it shows, on the platform where it really matters (sn2), where
> the thing actually spins.

I agree that it probably isn't a "write barrier" per se. Think of it as a 
"tie two subsystems together" thing.

(And it doesn't just matter on sn2. It also matters on powerpc64, although 
I think they just set a flag and do the *real* sync in the spin_unlock() 
path).

Side note: the thing that makes "mmiowb()" even more exciting is that it's 
not just the CPU, it's the fabric outside the CPU that matters too. That's 
why the sn2 needs this - but the powerpc example shows a case where the 
ordering requirement actually comes from the CPU itself.

			Linus

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: wmb vs mmiowb
  2007-08-22 18:07 ` Linus Torvalds
@ 2007-08-22 19:02   ` Jesse Barnes
  2007-08-23  2:20     ` Nick Piggin
  2007-08-23  1:59   ` Nick Piggin
  2007-08-23  7:27   ` Benjamin Herrenschmidt
  2 siblings, 1 reply; 26+ messages in thread
From: Jesse Barnes @ 2007-08-22 19:02 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Nick Piggin, linux-ia64, linuxppc-dev

On Wednesday, August 22, 2007 11:07 am Linus Torvalds wrote:
> > It really seems like it is some completely different concept from a
> > barrier. And it shows, on the platform where it really matters
> > (sn2), where the thing actually spins.
>
> I agree that it probably isn't a "write barrier" per se. Think of it
> as a "tie two subsystems together" thing.

Right, maybe it's not the best name, but as long as you separate your 
memory access types, you can think of it as a real write barrier, just 
for mmio accesses (well uncached access really).

> (And it doesn't just matter on sn2. It also matters on powerpc64,
> although I think they just set a flag and do the *real* sync in the
> spin_unlock() path).

Yeah, they keep threatening to use this instead, but I'm not sure how 
easy it would be.  Also they may have more devices/drivers to worry 
about than sn2, so maybe changing over would mean too much driver 
debugging (well auditing really since it's not that hard to know where 
to put them).  Irix actually had an io_unlock() routine that did this 
implicitly, but iirc that was shot down for Linux...

Jesse

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: wmb vs mmiowb
  2007-08-22 19:02   ` Jesse Barnes
@ 2007-08-23  2:20     ` Nick Piggin
  2007-08-23  2:57       ` Linus Torvalds
  2007-08-23 17:02       ` Jesse Barnes
  0 siblings, 2 replies; 26+ messages in thread
From: Nick Piggin @ 2007-08-23  2:20 UTC (permalink / raw)
  To: Jesse Barnes; +Cc: linux-ia64, Linus Torvalds, linuxppc-dev

On Wed, Aug 22, 2007 at 12:02:11PM -0700, Jesse Barnes wrote:
> On Wednesday, August 22, 2007 11:07 am Linus Torvalds wrote:
> > > It really seems like it is some completely different concept from a
> > > barrier. And it shows, on the platform where it really matters
> > > (sn2), where the thing actually spins.
> >
> > I agree that it probably isn't a "write barrier" per se. Think of it
> > as a "tie two subsystems together" thing.
> 
> Right, maybe it's not the best name, but as long as you separate your 
> memory access types, you can think of it as a real write barrier, just 
> for mmio accesses (well uncached access really).

If we have the following situation (all vars start at 0)
CPU0			CPU1			CPU2
spin_lock(&lock);				~
A = 1;						~
wmb();						~
B = 2;						~
spin_unlock(&lock);				X = B;
			spin_lock(&lock);	rmb();
			A = 10;			Y = A;
			wmb();			~
			B = 11;			~
			spin_unlock(&lock);	~

(I use the '~' just to show CPU2 is not specifically temporally
related to CPU0 or CPU1).

Then CPU2 could have X==11 and Y==1, according to the Linux abstract
memory consistency model, couldn't it? I think so, and I think this
is what your mmiowb is trying to protect. In the above situation,
CPU2 would just use the spinlock -- I don't think we have a simple
primitive that CPU0 and 1 can call to prevent this reordering at
CPU2. An IO device obviously can't use a spinlock :).

> > (And it doesn't just matter on sn2. It also matters on powerpc64,
> > although I think they just set a flag and do the *real* sync in the
> > spin_unlock() path).
> 
> Yeah, they keep threatening to use this instead, but I'm not sure how 
> easy it would be.  Also they may have more devices/drivers to worry 
> about than sn2, so maybe changing over would mean too much driver 
> debugging (well auditing really since it's not that hard to know where 
> to put them).  Irix actually had an io_unlock() routine that did this 
> implicitly, but iirc that was shot down for Linux...

Why was it shot down? Seems like a pretty good idea to me ;)

I'm clueless when it comes to drivers, but I see a lot of mmiowb()
that are not paired with spin_unlock. How are these obvious? (ie.
what is the pattern?) It looks like some might be lockless FIFOs (or
maybe I'm just not aware of where the locks are). Can you just quickly
illustrate the problem being solved?

Thanks,
Nick

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: wmb vs mmiowb
  2007-08-23  2:20     ` Nick Piggin
@ 2007-08-23  2:57       ` Linus Torvalds
  2007-08-23  3:54         ` Nick Piggin
  2007-08-23  4:20         ` Nick Piggin
  2007-08-23 17:02       ` Jesse Barnes
  1 sibling, 2 replies; 26+ messages in thread
From: Linus Torvalds @ 2007-08-23  2:57 UTC (permalink / raw)
  To: Nick Piggin; +Cc: linux-ia64, Jesse Barnes, linuxppc-dev

On Thu, 23 Aug 2007, Nick Piggin wrote:
>
> > Irix actually had an io_unlock() routine that did this 
> > implicitly, but iirc that was shot down for Linux...
> 
> Why was it shot down? Seems like a pretty good idea to me ;)

It's horrible. We'd need it for *every* single spinlock type. We have lots 
of them. 

So the choice is between:

 - sane:

	mmiowb()

   followed by any of the existing "spin_unlock()" variants (plain, 
   _irq(), _bh(), _irqrestore())

 - insane: multiply our current set of unlock primitives by two, by making 
   "io" versions for them all:

	spin_unlock_io[_irq|_irqrestore|_bh]()

but there's actually an EVEN WORSE problem with the stupid Irix approach, 
namely that it requires that the unlocker be aware of the exact details of 
what happens inside the lock. If the locking is done at an outer layer, 
that's not at all obvious!

In other words, Irix (once again) made a horrible and idiotic choice. 

Big surprise. Irix was probably the flakiest and worst of all the 
commercial proprietary unixes. No taste.

		Linus

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: wmb vs mmiowb
  2007-08-23  2:57       ` Linus Torvalds
@ 2007-08-23  3:54         ` Nick Piggin
  2007-08-23 16:14           ` Linus Torvalds
  2007-08-23  4:20         ` Nick Piggin
  1 sibling, 1 reply; 26+ messages in thread
From: Nick Piggin @ 2007-08-23  3:54 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux-ia64, Jesse Barnes, linuxppc-dev

On Wed, Aug 22, 2007 at 07:57:56PM -0700, Linus Torvalds wrote:
> 
> 
> On Thu, 23 Aug 2007, Nick Piggin wrote:
> >
> > > Irix actually had an io_unlock() routine that did this 
> > > implicitly, but iirc that was shot down for Linux...
> > 
> > Why was it shot down? Seems like a pretty good idea to me ;)
> 
> It's horrible. We'd need it for *every* single spinlock type. We have lots 
> of them. 
> 
> So the choice is between:
> 
>  - sane:
> 
> 	mmiowb()
> 
>    followed by any of the existing "spin_unlock()" variants (plain, 
>    _irq(), _bh(), _irqrestore())
> 
>  - insane: multiply our current set of unlock primitives by two, by making 
>    "io" versions for them all:
> 
> 	spin_unlock_io[_irq|_irqrestore|_bh]()
> 
> but there's actually an EVEN WORSE problem with the stupid Irix approach, 
> namely that it requires that the unlocker be aware of the exact details of 
> what happens inside the lock. If the locking is done at an outer layer, 
> that's not at all obvious!

OK, but we'd have some kind of functions that are called not to
serialise the CPUs, but to serialise the IO. It would be up to
the calling code to already provide CPU synchronisation.

serialize_io(); / unserialize_io(); / a nicer name

If we could pass in some kind of relevant resoure (eg. the IO
memory or device or something), then we might even be able to
put debug checks there to ensure two CPUs are never inside the
same critical IO section at once.

 
> In other words, Irix (once again) made a horrible and idiotic choice. 

We could make a better one. I don't think mmiowb is really insane, but
I'd worry it being confused with a regular type of barrier and that CPU
synchronisation needs to be provided for it to work or make sense.
 

> Big surprise. Irix was probably the flakiest and worst of all the 
> commercial proprietary unixes. No taste.

Is it? I've never used it ;)

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: wmb vs mmiowb
  2007-08-23  3:54         ` Nick Piggin
@ 2007-08-23 16:14           ` Linus Torvalds
  0 siblings, 0 replies; 26+ messages in thread
From: Linus Torvalds @ 2007-08-23 16:14 UTC (permalink / raw)
  To: Nick Piggin; +Cc: linux-ia64, Jesse Barnes, linuxppc-dev



On Thu, 23 Aug 2007, Nick Piggin wrote:
> 
> OK, but we'd have some kind of functions that are called not to
> serialise the CPUs, but to serialise the IO. It would be up to
> the calling code to already provide CPU synchronisation.
> 
> serialize_io(); / unserialize_io(); / a nicer name

We could call it "mmiowb()", for example?

Radical idea, I know.

> If we could pass in some kind of relevant resoure (eg. the IO
> memory or device or something), then we might even be able to
> put debug checks there to ensure two CPUs are never inside the
> same critical IO section at once.

We could certainly give it the spinlock as an argument.

		Linus

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: wmb vs mmiowb
  2007-08-23  2:57       ` Linus Torvalds
  2007-08-23  3:54         ` Nick Piggin
@ 2007-08-23  4:20         ` Nick Piggin
  2007-08-23 16:16           ` Linus Torvalds
  1 sibling, 1 reply; 26+ messages in thread
From: Nick Piggin @ 2007-08-23  4:20 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux-ia64, Jesse Barnes, linuxppc-dev

On Wed, Aug 22, 2007 at 07:57:56PM -0700, Linus Torvalds wrote:
> 
> 
> On Thu, 23 Aug 2007, Nick Piggin wrote:
> >
> > > Irix actually had an io_unlock() routine that did this 
> > > implicitly, but iirc that was shot down for Linux...
> > 
> > Why was it shot down? Seems like a pretty good idea to me ;)
> 
> It's horrible. We'd need it for *every* single spinlock type. We have lots 
> of them. 
> 
> So the choice is between:
> 
>  - sane:
> 
> 	mmiowb()
> 
>    followed by any of the existing "spin_unlock()" variants (plain, 
>    _irq(), _bh(), _irqrestore())
> 
>  - insane: multiply our current set of unlock primitives by two, by making 
>    "io" versions for them all:
> 
> 	spin_unlock_io[_irq|_irqrestore|_bh]()
> 
> but there's actually an EVEN WORSE problem with the stupid Irix approach, 
> namely that it requires that the unlocker be aware of the exact details of 
> what happens inside the lock. If the locking is done at an outer layer, 
> that's not at all obvious!

Also, FWIW, there are some advantages of deferring the mmiowb thingy
until the point of unlock. The disadvantage is that the caller may not
know if the inner layer performed ios that require the mmiowb, but
the advantage of waiting until unlock is that the wait is deferred
for as long as possible, and will hopefully be a shorter one when
performed later.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: wmb vs mmiowb
  2007-08-23  4:20         ` Nick Piggin
@ 2007-08-23 16:16           ` Linus Torvalds
  2007-08-23 16:27             ` Benjamin Herrenschmidt
  2007-08-24  2:59             ` Nick Piggin
  0 siblings, 2 replies; 26+ messages in thread
From: Linus Torvalds @ 2007-08-23 16:16 UTC (permalink / raw)
  To: Nick Piggin; +Cc: linux-ia64, Jesse Barnes, linuxppc-dev



On Thu, 23 Aug 2007, Nick Piggin wrote:
> 
> Also, FWIW, there are some advantages of deferring the mmiowb thingy
> until the point of unlock.

And that is exactly what ppc64 does.

But you're missing a big point: for 99.9% of all hardware, mmiowb() is a 
total no-op. So when you talk about "advantages", you're not talking about 
any *real* advantage, are you?

			Linus

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: wmb vs mmiowb
  2007-08-23 16:16           ` Linus Torvalds
@ 2007-08-23 16:27             ` Benjamin Herrenschmidt
  2007-08-24  3:09               ` Nick Piggin
  2007-08-24  2:59             ` Nick Piggin
  1 sibling, 1 reply; 26+ messages in thread
From: Benjamin Herrenschmidt @ 2007-08-23 16:27 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Nick Piggin, linux-ia64, Jesse Barnes, linuxppc-dev

On Thu, 2007-08-23 at 09:16 -0700, Linus Torvalds wrote:
> 
> On Thu, 23 Aug 2007, Nick Piggin wrote:
> > 
> > Also, FWIW, there are some advantages of deferring the mmiowb thingy
> > until the point of unlock.
> 
> And that is exactly what ppc64 does.
> 
> But you're missing a big point: for 99.9% of all hardware, mmiowb() is a 
> total no-op. So when you talk about "advantages", you're not talking about 
> any *real* advantage, are you?

I wonder whether it might be worth removing mmiowb and having all archs
that matter do like ppc64 though... It's just yet another confusing
barrier that most driver writers get wrong..

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: wmb vs mmiowb
  2007-08-23 16:27             ` Benjamin Herrenschmidt
@ 2007-08-24  3:09               ` Nick Piggin
  2007-08-28 20:56                 ` Brent Casavant
  0 siblings, 1 reply; 26+ messages in thread
From: Nick Piggin @ 2007-08-24  3:09 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Jesse Barnes, linux-ia64, Linus Torvalds, linuxppc-dev

On Thu, Aug 23, 2007 at 06:27:42PM +0200, Benjamin Herrenschmidt wrote:
> On Thu, 2007-08-23 at 09:16 -0700, Linus Torvalds wrote:
> > 
> > On Thu, 23 Aug 2007, Nick Piggin wrote:
> > > 
> > > Also, FWIW, there are some advantages of deferring the mmiowb thingy
> > > until the point of unlock.
> > 
> > And that is exactly what ppc64 does.
> > 
> > But you're missing a big point: for 99.9% of all hardware, mmiowb() is a 
> > total no-op. So when you talk about "advantages", you're not talking about 
> > any *real* advantage, are you?
> 
> I wonder whether it might be worth removing mmiowb and having all archs
> that matter do like ppc64 though... It's just yet another confusing
> barrier that most driver writers get wrong..

Only sn2 and powerpc really matter, actually (for different reasons).

All smp architectures other than powerpc appear to have barrier
instructions that order all memory operations, so IOs never leak
out of locking primitives. This is why powerpc wants a wmb (not
mmiowb) before spin_unlock to order IOs (pity about other locking
primitives).

And all platforms other than sn2 don't appear to reorder IOs after
they leave the CPU, so only sn2 needs to do the mmiowb thing before
spin_unlock.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: wmb vs mmiowb
  2007-08-24  3:09               ` Nick Piggin
@ 2007-08-28 20:56                 ` Brent Casavant
  2007-08-29  0:59                   ` Nick Piggin
  0 siblings, 1 reply; 26+ messages in thread
From: Brent Casavant @ 2007-08-28 20:56 UTC (permalink / raw)
  To: Nick Piggin; +Cc: linuxppc-dev, linux-ia64

On Fri, 24 Aug 2007, Nick Piggin wrote:

> And all platforms other than sn2 don't appear to reorder IOs after
> they leave the CPU, so only sn2 needs to do the mmiowb thing before
> spin_unlock.

I'm sure all of the following is already known to most readers, but
I thought the paragraph above might potentially cause confusion as
to the nature of the problem mmiowb() is solving on SN2.  So for
the record...

SN2 does not reorder IOs issued from a single CPU (that would be
insane).  Neither does it reorder IOs once they've reached the IO
fabric (equally insane).  From an individual CPU's perspective, all
IOs that it issues to a device will arrive at that device in program
order.

(In this entire message, all IOs are assumed to be memory-mapped.)

The problem mmiowb() helps solve on SN2 is the ordering of IOs issued
from multiple CPUs to a single device.  That ordering is undefined, as
IO transactions are not ordered across CPUs.  That is, if CPU A issues
an IO at time T, and CPU B at time T+1, CPU B's IO may arrive at the
IO fabric before CPU A's IO, particularly if CPU B happens to be closer
than CPU B to the target IO bridge on the NUMA network.

The simplistic method to solve this is a lock around the section
issuing IOs, thereby ensuring serialization of access to the IO
device.  However, as SN2 does not enforce an ordering between normal
memory transactions and memory-mapped IO transactions, you cannot
be sure that an IO transaction will arrive at the IO fabric "on the
correct side" of the unlock memory transaction using this scheme.

Enter mmiowb().

mmiowb() causes SN2 to drain the pending IOs from the current CPU's
node.  Once the IOs are drained the CPU can safely unlock a normal
memory based lock without fear of the unlock's memory write passing
any outstanding IOs from that CPU.

Brent

-- 
Brent Casavant                          All music is folk music.  I ain't
bcasavan@sgi.com                        never heard a horse sing a song.
Silicon Graphics, Inc.                    -- Louis Armstrong

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: wmb vs mmiowb
  2007-08-28 20:56                 ` Brent Casavant
@ 2007-08-29  0:59                   ` Nick Piggin
  2007-08-29 18:53                     ` Brent Casavant
  0 siblings, 1 reply; 26+ messages in thread
From: Nick Piggin @ 2007-08-29  0:59 UTC (permalink / raw)
  To: Brent Casavant; +Cc: linuxppc-dev, linux-ia64

On Tue, Aug 28, 2007 at 03:56:28PM -0500, Brent Casavant wrote:
> On Fri, 24 Aug 2007, Nick Piggin wrote:
> 
> > And all platforms other than sn2 don't appear to reorder IOs after
> > they leave the CPU, so only sn2 needs to do the mmiowb thing before
> > spin_unlock.
> 
> I'm sure all of the following is already known to most readers, but
> I thought the paragraph above might potentially cause confusion as
> to the nature of the problem mmiowb() is solving on SN2.  So for
> the record...
> 
> SN2 does not reorder IOs issued from a single CPU (that would be
> insane).  Neither does it reorder IOs once they've reached the IO
> fabric (equally insane).  From an individual CPU's perspective, all
> IOs that it issues to a device will arrive at that device in program
> order.

This is why I think mmiowb() is not like a Linux memory barrier.

And I presume that the device would see IOs and regular stores from
a CPU in program order, given the correct wmb()s? (but maybe I'm
wrong... more below).


> (In this entire message, all IOs are assumed to be memory-mapped.)
> 
> The problem mmiowb() helps solve on SN2 is the ordering of IOs issued
> from multiple CPUs to a single device.  That ordering is undefined, as
> IO transactions are not ordered across CPUs.  That is, if CPU A issues
> an IO at time T, and CPU B at time T+1, CPU B's IO may arrive at the
> IO fabric before CPU A's IO, particularly if CPU B happens to be closer
> than CPU B to the target IO bridge on the NUMA network.
> 
> The simplistic method to solve this is a lock around the section
> issuing IOs, thereby ensuring serialization of access to the IO
> device.  However, as SN2 does not enforce an ordering between normal
> memory transactions and memory-mapped IO transactions, you cannot
> be sure that an IO transaction will arrive at the IO fabric "on the
> correct side" of the unlock memory transaction using this scheme.

Hmm. So what if you had the following code executed by a single CPU:

writel(data, ioaddr);
wmb(); 
*mem = 10;

Will the device see the io write before the store to mem?


> Enter mmiowb().
> 
> mmiowb() causes SN2 to drain the pending IOs from the current CPU's
> node.  Once the IOs are drained the CPU can safely unlock a normal
> memory based lock without fear of the unlock's memory write passing
> any outstanding IOs from that CPU.

mmiowb needs to have the disclaimer that it's probably wrong if called
outside a lock, and it's probably wrong if called between two io writes
(need a regular wmb() in that case). I think some drivers are getting
this wrong.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: wmb vs mmiowb
  2007-08-29  0:59                   ` Nick Piggin
@ 2007-08-29 18:53                     ` Brent Casavant
  2007-08-30  3:36                       ` Nick Piggin
  0 siblings, 1 reply; 26+ messages in thread
From: Brent Casavant @ 2007-08-29 18:53 UTC (permalink / raw)
  To: Nick Piggin; +Cc: linuxppc-dev, linux-ia64

On Wed, 29 Aug 2007, Nick Piggin wrote:

> On Tue, Aug 28, 2007 at 03:56:28PM -0500, Brent Casavant wrote:

> > The simplistic method to solve this is a lock around the section
> > issuing IOs, thereby ensuring serialization of access to the IO
> > device.  However, as SN2 does not enforce an ordering between normal
> > memory transactions and memory-mapped IO transactions, you cannot
> > be sure that an IO transaction will arrive at the IO fabric "on the
> > correct side" of the unlock memory transaction using this scheme.
> 
> Hmm. So what if you had the following code executed by a single CPU:
> 
> writel(data, ioaddr);
> wmb(); 
> *mem = 10;
> 
> Will the device see the io write before the store to mem?

Not necessarily.  There is no guaranteed ordering between the IO write
arriving at the device and the order of the normal memory reference,
regardless of the intervening wmb(), at least on SN2.  I believe the
missing component in the mental model is the effect of the platform
chipset.

Perhaps this will help.  Uncached writes (i.e. IO writes) are posted
to the SN2 SHub ASIC and placed in their own queue which the SHub chip
then routes to the appropriate target.  This uncached write queue is
independent of the NUMA cache-coherency maintained by the SHub ASIC
for system memory; the relative order in which the uncached writes
and the system memory traffic appear at their respective targets is
undefined with respect to eachother.

wmb() does not address this situation as it only guarantees that
the writes issued from the CPU have been posted to the chipset,
not that the chipset itself has posted the write to the final
destination.  mmiowb() guarantees that all outstanding IO writes
have been issued to the IO fabric before proceeding.

I like to think of it this way (probably not 100% accurate, but it
helps me wrap my brain around this particular point):

	wmb(): Ensures preceding writes have issued from the CPU.
	mmiowb(): Ensures preceding IO writes have issued from the
		  system chipset.

mmiowb() on SN2 polls a register in SHub that reports the length
of the outstanding uncached write queue.  When the queue has emptied,
it is known that all subsequent normal memory writes will therefore
arrive at their destination after all preceding IO writes have arrived
at the IO fabric.

Thus, typical mmiowb() usage, for SN2's purpose, is to ensure that
all IO traffic from a CPU has made it out to the IO fabric before
issuing the normal memory transactions which release a RAM-based
lock.  The lock in this case is the one used to serialize access
to a particular IO device.

> > mmiowb() causes SN2 to drain the pending IOs from the current CPU's
> > node.  Once the IOs are drained the CPU can safely unlock a normal
> > memory based lock without fear of the unlock's memory write passing
> > any outstanding IOs from that CPU.
> 
> mmiowb needs to have the disclaimer that it's probably wrong if called
> outside a lock, and it's probably wrong if called between two io writes
> (need a regular wmb() in that case). I think some drivers are getting
> this wrong.

There are situations where mmiowb() can be pressed into service to
some other end, but those are rather rare.  The only instance I am
personally familiar with is synchronizing a free-running counter on
a PCI device as closely as possible to the execution of a particular
line of driver code.  A write of the new counter value to the device
and subsequent mmiowb() synchronizes that execution point as closely
as practical to the IO write arriving at the device.  Not perfect, but
good enough for my purposes.  (This was a hack, by the way, pressing
a bit of hardware into a purpose for which it wasn't really designed,
ideally the hardware would have had a better mechanism to accomplish
this goal.)

But in the normal case, I believe you are 100% correct -- wmb() would
ensure that the memory-mapped IO writes arrive at the chipset in a
particular order, and thus should arrive at the IO hardware in a particular
order.  mmiowb() would not necessarily accomplish this goal, and is
incorrectly used wherever that is the intention.  At least for SN2.

Brent

-- 
Brent Casavant                          All music is folk music.  I ain't
bcasavan@sgi.com                        never heard a horse sing a song.
Silicon Graphics, Inc.                    -- Louis Armstrong

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: wmb vs mmiowb
  2007-08-29 18:53                     ` Brent Casavant
@ 2007-08-30  3:36                       ` Nick Piggin
  2007-08-30 19:42                         ` Brent Casavant
  0 siblings, 1 reply; 26+ messages in thread
From: Nick Piggin @ 2007-08-30  3:36 UTC (permalink / raw)
  To: Brent Casavant; +Cc: linuxppc-dev, linux-ia64

On Wed, Aug 29, 2007 at 01:53:53PM -0500, Brent Casavant wrote:
> On Wed, 29 Aug 2007, Nick Piggin wrote:
> 
> > On Tue, Aug 28, 2007 at 03:56:28PM -0500, Brent Casavant wrote:
> 
> > > The simplistic method to solve this is a lock around the section
> > > issuing IOs, thereby ensuring serialization of access to the IO
> > > device.  However, as SN2 does not enforce an ordering between normal
> > > memory transactions and memory-mapped IO transactions, you cannot
> > > be sure that an IO transaction will arrive at the IO fabric "on the
> > > correct side" of the unlock memory transaction using this scheme.
> > 
> > Hmm. So what if you had the following code executed by a single CPU:
> > 
> > writel(data, ioaddr);
> > wmb(); 
> > *mem = 10;
> > 
> > Will the device see the io write before the store to mem?
> 
> Not necessarily.  There is no guaranteed ordering between the IO write
> arriving at the device and the order of the normal memory reference,
> regardless of the intervening wmb(), at least on SN2.  I believe the
> missing component in the mental model is the effect of the platform
> chipset.
> 
> Perhaps this will help.  Uncached writes (i.e. IO writes) are posted
> to the SN2 SHub ASIC and placed in their own queue which the SHub chip
> then routes to the appropriate target.  This uncached write queue is
> independent of the NUMA cache-coherency maintained by the SHub ASIC
> for system memory; the relative order in which the uncached writes
> and the system memory traffic appear at their respective targets is
> undefined with respect to eachother.
> 
> wmb() does not address this situation as it only guarantees that
> the writes issued from the CPU have been posted to the chipset,
> not that the chipset itself has posted the write to the final
> destination.  mmiowb() guarantees that all outstanding IO writes
> have been issued to the IO fabric before proceeding.
> 
> I like to think of it this way (probably not 100% accurate, but it
> helps me wrap my brain around this particular point):
> 
> 	wmb(): Ensures preceding writes have issued from the CPU.
> 	mmiowb(): Ensures preceding IO writes have issued from the
> 		  system chipset.
> 
> mmiowb() on SN2 polls a register in SHub that reports the length
> of the outstanding uncached write queue.  When the queue has emptied,
> it is known that all subsequent normal memory writes will therefore
> arrive at their destination after all preceding IO writes have arrived
> at the IO fabric.
> 
> Thus, typical mmiowb() usage, for SN2's purpose, is to ensure that
> all IO traffic from a CPU has made it out to the IO fabric before
> issuing the normal memory transactions which release a RAM-based
> lock.  The lock in this case is the one used to serialize access
> to a particular IO device.

OK, thanks for that. I think I have a rough idea of how they both
work... I was just thinking (hoping) that, although the writel may
not reach the device before the store reaches memory, it would
_appear_ that way from the POV of the device (ie. if the device
were to DMA from mem). But that's probably wishful thinking because
the memory might be on some completely different part of the system.

I don't know whether this is exactly a correct implementation of
Linux's barrier semantics. On one hand, wmb _is_ ordering the stores
as they come out of the CPU; on the other, it isn't ordering normal
stores with respect to writel from the POV of the device (which is
seems to be what is expected by the docs and device driver writers).

One argument says that the IO device or chipset is a seperate
agent and thus isn't subject to ordering... which is sort of valid,
but it is definitely not an agent equal to a CPU, because it can't
actively participate in the synchronisation protocol.

And on the other side, it just doesn't seem so useful just to know
that stores coming out of the CPU are ordered if they can be reordered
by an intermediate. Why even have wmb() at all, if it doesn't actually
order stores to IO and RAM?  powerpc's wmb() could just as well be an
'eieio' if it were to follow your model; that instruction orders IO,
but not WRT cacheable stores.

So you could argue that the chipset is an extention of the CPU's IO/memory
subsystem and should follow the ordering specified by the CPU. I like this
idea because it could make things simpler and more regular for the Linux
barrier model.

I guess it is too expensive for you to have mmiowb() in every wmb(),
because _most_ of the time, all that's needed is ordering between IOs.
So why not have io_mb(), io_rmb(), io_wmb(), which order IOs but ignore
system memory. Then the non-prefixed primitives order everything (to the
point that wmb() is like mmiowb on sn2).

> > > mmiowb() causes SN2 to drain the pending IOs from the current CPU's
> > > node.  Once the IOs are drained the CPU can safely unlock a normal
> > > memory based lock without fear of the unlock's memory write passing
> > > any outstanding IOs from that CPU.
> > 
> > mmiowb needs to have the disclaimer that it's probably wrong if called
> > outside a lock, and it's probably wrong if called between two io writes
> > (need a regular wmb() in that case). I think some drivers are getting
> > this wrong.
> 
> There are situations where mmiowb() can be pressed into service to
> some other end, but those are rather rare.  The only instance I am
> personally familiar with is synchronizing a free-running counter on
> a PCI device as closely as possible to the execution of a particular
> line of driver code.  A write of the new counter value to the device
> and subsequent mmiowb() synchronizes that execution point as closely
> as practical to the IO write arriving at the device.  Not perfect, but
> good enough for my purposes.  (This was a hack, by the way, pressing
> a bit of hardware into a purpose for which it wasn't really designed,
> ideally the hardware would have had a better mechanism to accomplish
> this goal.)

I guess that would be fine. You probably have a slightly better
understanding of the issues than the average device driver writer
so you could ignore the warnings ;)

> But in the normal case, I believe you are 100% correct -- wmb() would
> ensure that the memory-mapped IO writes arrive at the chipset in a
> particular order, and thus should arrive at the IO hardware in a particular
> order.  mmiowb() would not necessarily accomplish this goal, and is
> incorrectly used wherever that is the intention.  At least for SN2.

Now I guess it's strictly also needed if you want to ensure cacheable
stores and IO stores are visible to the device in the correct order
too. I think we'd normally hope wmb() does that for us too (hence all
my rambling above).

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: wmb vs mmiowb
  2007-08-30  3:36                       ` Nick Piggin
@ 2007-08-30 19:42                         ` Brent Casavant
  2007-09-03 20:48                           ` Nick Piggin
  0 siblings, 1 reply; 26+ messages in thread
From: Brent Casavant @ 2007-08-30 19:42 UTC (permalink / raw)
  To: Nick Piggin; +Cc: linuxppc-dev, linux-ia64

On Thu, 30 Aug 2007, Nick Piggin wrote:

> OK, thanks for that. I think I have a rough idea of how they both
> work... I was just thinking (hoping) that, although the writel may
> not reach the device before the store reaches memory, it would
> _appear_ that way from the POV of the device (ie. if the device
> were to DMA from mem). But that's probably wishful thinking because
> the memory might be on some completely different part of the system.

Exactly.  Since uncacheable writes cannot by definition take
part in a cache-coherency mechanism, they really become their
own seperate hierarchy of transactions.

> I don't know whether this is exactly a correct implementation of
> Linux's barrier semantics. On one hand, wmb _is_ ordering the stores
> as they come out of the CPU; on the other, it isn't ordering normal
> stores with respect to writel from the POV of the device (which is
> seems to be what is expected by the docs and device driver writers).

Or, as I think of it, it's not ordering cacheable stores with respect
to uncacheable stores from the perspective of other CPUs in the system.
That's what's really at the heart of the concern for SN2.

> And on the other side, it just doesn't seem so useful just to know
> that stores coming out of the CPU are ordered if they can be reordered
> by an intermediate.

Well, it helps when certain classes of stores need to be ordered with
respect to eachother.  On SN2, wmb() still ensures that cacheable stores
are issued in a particular order, and thus seen by other CPUs in a
particular order.  That is still important, even when IO devices are not
in the mix.

> Why even have wmb() at all, if it doesn't actually
> order stores to IO and RAM?

It orders the class of stores which target RAM.  It doesn't order the
two seperate classes of stores (RAM and IO) with respect to eachother.

mmiowb() when used in conjunction with a lock which serializes access
to an IO device ensures that the order of stores to the IO device from
different CPUs is well-defined.  That's what we're really after here.

> powerpc's wmb() could just as well be an
> 'eieio' if it were to follow your model; that instruction orders IO,
> but not WRT cacheable stores.

That would seem to follow the intent of mmiowb() on SN2.  I know
next to nothing about PowerPC, so I'm not qualified to comment on that.

> So you could argue that the chipset is an extention of the CPU's IO/memory
> subsystem and should follow the ordering specified by the CPU. I like this
> idea because it could make things simpler and more regular for the Linux
> barrier model.

Sorry, I didn't design the hardware. ;)

I believe the problem, for a NUMA system, is that in order to implement
what you describe, you would need the chipset to cause all effectively
dirty cachelines in the CPU (including those that will become dirty
due to previous stores which the CPU hasn't committed from its pipeline
yet) to be written back to RAM before the the uncacheable store was allowed
to issue from the chipset to the IO fabric.  This would occur for every
IO store, not just the final store in a related sequence.  That would
obviously have a significant negative impact on performance.

> I guess it is too expensive for you to have mmiowb() in every wmb(),
> because _most_ of the time, all that's needed is ordering between IOs.

I think it's the other way around.  Most of the time all you need is
ordering between RAM stores, so mmiowb() would kill performance if it
was called every time wmb() was invoked.

> So why not have io_mb(), io_rmb(), io_wmb(), which order IOs but ignore
> system memory. Then the non-prefixed primitives order everything (to the
> point that wmb() is like mmiowb on sn2).

I'm not sure I follow.  Here's the bad sequence we're working with:

	CPU A		CPU B		Lock owner	IO device sees 
	-----		-----		----------	--------------
	...		...		unowned
	lock()		...		CPU A
	writel(val_a)	lock()		...
	unlock()			CPU B
	...		write(val_b)	...
	...		unlock()	unowned
	...		...		...		val_b
	...		...		...		val_a

The cacheable store to RAM from CPU A to perform the unlock was
not ordered with respect to the uncacheable writel() to the IO device.
CPU B, which has a different uncacheable store path to the IO device
in the NUMA system, saw the effect of the RAM store before CPU A's
uncacheable store arrived at the IO device.  CPU B then owned the
lock, performed its own uncacheable store to the IO device, and
released the lock.  The two uncacheable stores are taking different
routes to the device, and end up arriving in the wrong order.

mmiowb() solves this by causing the following:

	CPU A		CPU B		Lock owner	IO device sees 
	-----		-----		----------	--------------
	...		...		Unowned
	lock()		...		CPU A
	writel(val_a)	lock()		...
	mmiowb()			...		val_a
	unlock()			CPU B
	...		write(val_b)	...
	...		mmiowb()	...		val_b
	...		unlock()	unowned

The mmiowb() caused the IO device to see the uncacheable store from
CPU A before CPU B saw the cacheable store from CPU A.  Now all is
well with the world.

I might be exhausting your patience, but this is the key.  mmiowb()
causes the IO fabric to see the effects of an uncacheable store
before other CPUs see the effects of a subsequent cacheable store.
That's what's really at the heart of the matter.

> Now I guess it's strictly also needed if you want to ensure cacheable
> stores and IO stores are visible to the device in the correct order
> too. I think we'd normally hope wmb() does that for us too (hence all
> my rambling above).

There's really three perspectives to consider, not just the CPU and IO
device:

	1. CPU A performing locking and issuing IO stores.
	2. The IO device receiving stores.
	3. CPU B performing locking and issuing IO stores.

The lock ensures that the IO device sees stores from a single CPU
at a time.  wmb() ensures that CPU A and CPU B see the effect
of cacheable stores in the same order as eachother.  mmiowb()
ensures that the IO device has seen all the uncacheable stores from
CPU A before CPU B sees the cacheable stores from CPU A.

Wow.  I like that last paragraph.  I think I'll send now...

Brent

-- 
Brent Casavant                          All music is folk music.  I ain't
bcasavan@sgi.com                        never heard a horse sing a song.
Silicon Graphics, Inc.                    -- Louis Armstrong

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: wmb vs mmiowb
  2007-08-30 19:42                         ` Brent Casavant
@ 2007-09-03 20:48                           ` Nick Piggin
  0 siblings, 0 replies; 26+ messages in thread
From: Nick Piggin @ 2007-09-03 20:48 UTC (permalink / raw)
  To: Brent Casavant; +Cc: linuxppc-dev, Linus Torvalds, linux-ia64, Anton Blanchard

On Thu, Aug 30, 2007 at 02:42:41PM -0500, Brent Casavant wrote:
> On Thu, 30 Aug 2007, Nick Piggin wrote:
> 
> > I don't know whether this is exactly a correct implementation of
> > Linux's barrier semantics. On one hand, wmb _is_ ordering the stores
> > as they come out of the CPU; on the other, it isn't ordering normal
> > stores with respect to writel from the POV of the device (which is
> > seems to be what is expected by the docs and device driver writers).
> 
> Or, as I think of it, it's not ordering cacheable stores with respect
> to uncacheable stores from the perspective of other CPUs in the system.
> That's what's really at the heart of the concern for SN2.

AFAIKS, the issue is simply that it is not ordering cacheable stores
with respect to uncacheable stores from a _single_ CPU. I'll elaborate
further down.

> > And on the other side, it just doesn't seem so useful just to know
> > that stores coming out of the CPU are ordered if they can be reordered
> > by an intermediate.
> 
> Well, it helps when certain classes of stores need to be ordered with
> respect to eachother.  On SN2, wmb() still ensures that cacheable stores
> are issued in a particular order, and thus seen by other CPUs in a
> particular order.  That is still important, even when IO devices are not
> in the mix.

Well, we have smp_wmb() for that.

> > Why even have wmb() at all, if it doesn't actually
> > order stores to IO and RAM?
> 
> It orders the class of stores which target RAM.  It doesn't order the
> two seperate classes of stores (RAM and IO) with respect to eachother.

wmb() *really* is supposed to order all stores. As far as I gather,
devices often need it for something like this:

*dma_buffer = blah;
wmb();
writel(START_DMA, iomem);

One problem for sn2 seems to be that wmb is called like 500 times in
drivers/ and would be really heavy to turn it into mmiowb. On the
other hand, I really don't like how it's just gone and said "oh the
normal Linux semantics are too hard, so make wmb() mean something
slightly different, and add a totally new mmiowb() concept". Device
driver writers already get barriers totally wrong. mmiowb is being
completely misused already (and probably wmb too).

> mmiowb() when used in conjunction with a lock which serializes access
> to an IO device ensures that the order of stores to the IO device from
> different CPUs is well-defined.  That's what we're really after here.

But if we're pragmatic, we could say that stores which are sitting in
the CPU's chipset where they can potentially be reordered, can still
_conceptually_ be considered to be in some kind of store queue of the
CPU. This would mean that wmb() does have to order these WRT cacheable
stores coming from a single CPU.

And once you do that, sn2 will _also_ do the right thing with multiple
CPUs.

> > I guess it is too expensive for you to have mmiowb() in every wmb(),
> > because _most_ of the time, all that's needed is ordering between IOs.
> 
> I think it's the other way around.  Most of the time all you need is
> ordering between RAM stores, so mmiowb() would kill performance if it
> was called every time wmb() was invoked.

No, we have smp_wmb() for that.

> > So why not have io_mb(), io_rmb(), io_wmb(), which order IOs but ignore
> > system memory. Then the non-prefixed primitives order everything (to the
> > point that wmb() is like mmiowb on sn2).
> 
> I'm not sure I follow.  Here's the bad sequence we're working with:
> 
> 	CPU A		CPU B		Lock owner	IO device sees 
> 	-----		-----		----------	--------------
> 	...		...		unowned
> 	lock()		...		CPU A
> 	writel(val_a)	lock()		...
> 	unlock()			CPU B
> 	...		write(val_b)	...
> 	...		unlock()	unowned
> 	...		...		...		val_b
> 	...		...		...		val_a
> 
> 
> The cacheable store to RAM from CPU A to perform the unlock was
> not ordered with respect to the uncacheable writel() to the IO device.
> CPU B, which has a different uncacheable store path to the IO device
> in the NUMA system, saw the effect of the RAM store before CPU A's
> uncacheable store arrived at the IO device.  CPU B then owned the
> lock, performed its own uncacheable store to the IO device, and
> released the lock.  The two uncacheable stores are taking different
> routes to the device, and end up arriving in the wrong order.
> 
> mmiowb() solves this by causing the following:
> 
> 	CPU A		CPU B		Lock owner	IO device sees 
> 	-----		-----		----------	--------------
> 	...		...		Unowned
> 	lock()		...		CPU A
> 	writel(val_a)	lock()		...
> 	mmiowb()			...		val_a
> 	unlock()			CPU B
> 	...		write(val_b)	...
> 	...		mmiowb()	...		val_b
> 	...		unlock()	unowned
> 
> The mmiowb() caused the IO device to see the uncacheable store from
> CPU A before CPU B saw the cacheable store from CPU A.  Now all is
> well with the world.
> 
> I might be exhausting your patience, but this is the key.  mmiowb()
> causes the IO fabric to see the effects of an uncacheable store
> before other CPUs see the effects of a subsequent cacheable store.
> That's what's really at the heart of the matter.

Yes, I like this, and this is what wmb() should do :) That's what
Linux expects it to do.

> > Now I guess it's strictly also needed if you want to ensure cacheable
> > stores and IO stores are visible to the device in the correct order
> > too. I think we'd normally hope wmb() does that for us too (hence all
> > my rambling above).
> 
> There's really three perspectives to consider, not just the CPU and IO
> device:
> 
> 	1. CPU A performing locking and issuing IO stores.
> 	2. The IO device receiving stores.
> 	3. CPU B performing locking and issuing IO stores.
> 
> The lock ensures that the IO device sees stores from a single CPU
> at a time.  wmb() ensures that CPU A and CPU B see the effect
> of cacheable stores in the same order as eachother.  mmiowb()
> ensures that the IO device has seen all the uncacheable stores from
> CPU A before CPU B sees the cacheable stores from CPU A.
> 
> Wow.  I like that last paragraph.  I think I'll send now...

OK, now we _could_ consider the path to the IO device to be a 3rd
party that can reorder the IOs, but I'm coming to think that such
a concept need not be added if we instead consider that the reordering
portion is still part of the originating CPU and thus subject to a
wmb().

I was talking with Linus about this today, and he might have had an
opinion. He didn't like my io_wmb() idea, but instead thinks that
_every_ IO operation should be ordered WRT one another (eg. get rid
of the fancy __relaxed ones). That's fine, and once you do that,
you can get rid of lots of wmb(), and wmb() remains just for the
places where you want to order cacheable and uncacheable stores.
And now that wmb() is called much less often, you can define it
to actually match the expected Linux model.

I'm really not just trying to cause trouble here ;) The ordering details
of IO and IO/memory seems to be a mess -- it is defined differently for
different architectures, barriers are doing different things, *writel*
etc. functions have different ordering rules depending on the arch, etc.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: wmb vs mmiowb
  2007-08-23 16:16           ` Linus Torvalds
  2007-08-23 16:27             ` Benjamin Herrenschmidt
@ 2007-08-24  2:59             ` Nick Piggin
  1 sibling, 0 replies; 26+ messages in thread
From: Nick Piggin @ 2007-08-24  2:59 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux-ia64, Jesse Barnes, linuxppc-dev

On Thu, Aug 23, 2007 at 09:16:42AM -0700, Linus Torvalds wrote:
> 
> 
> On Thu, 23 Aug 2007, Nick Piggin wrote:
> > 
> > Also, FWIW, there are some advantages of deferring the mmiowb thingy
> > until the point of unlock.
> 
> And that is exactly what ppc64 does.
> 
> But you're missing a big point: for 99.9% of all hardware, mmiowb() is a 
> total no-op. So when you talk about "advantages", you're not talking about 
> any *real* advantage, are you?

You're in a feisty mood today ;)

I guess on the 0.1% of hradware where it is not a noop, there might be a
real advantage... but that was just handwaving anyway. My real point was
that I'd like things to be more easily understandable.

I think we are agreed at this point that mmiowb without some form of CPU
synchronisation is a bug, and it is also not of the same type of barrier
that we normally think about in the kernel (it could be like a MPI style
rendezvous barrier between the CPU and the IO fabric). Anyway, point is
that device drivers seem to have enough on their plate already.

Look at bcm43xx, for example. Most of this guy's mmiowb()s are completely
wrong and should be wmb(). mmiowb() is only a wmb() on ppc because as I
said, ppc's spin_unlock does not order IOs like most other architectures.
On alpha, for example, spin_unlock does order IOs, so mmiowb is a noop,
and this is broken (non-sn2 ia64 should also be a noop here, because
their unlock orders IOs, but it seems that mmiowb semantics are so non
obvious that they either got it wrong themselves, or assumed device
drivers surely would).

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: wmb vs mmiowb
  2007-08-23  2:20     ` Nick Piggin
  2007-08-23  2:57       ` Linus Torvalds
@ 2007-08-23 17:02       ` Jesse Barnes
  1 sibling, 0 replies; 26+ messages in thread
From: Jesse Barnes @ 2007-08-23 17:02 UTC (permalink / raw)
  To: Nick Piggin; +Cc: linux-ia64, Linus Torvalds, linuxppc-dev

> > Yeah, they keep threatening to use this instead, but I'm not sure
> > how easy it would be.  Also they may have more devices/drivers to
> > worry about than sn2, so maybe changing over would mean too much
> > driver debugging (well auditing really since it's not that hard to
> > know where to put them).  Irix actually had an io_unlock() routine
> > that did this implicitly, but iirc that was shot down for Linux...
>
> Why was it shot down? Seems like a pretty good idea to me ;)

Well, like Linus said, it had some significant downsides (though I think 
Irix had fewer lock types, so the multiplicative effect wasn't so bad 
there).

> I'm clueless when it comes to drivers, but I see a lot of mmiowb()
> that are not paired with spin_unlock. How are these obvious? (ie.
> what is the pattern?) It looks like some might be lockless FIFOs (or
> maybe I'm just not aware of where the locks are). Can you just
> quickly illustrate the problem being solved?

Wow, it certainly has proliferated since it was added to the tree. :)

I didn't audit all the uses, but it seems like many of them get it 
right, i.e. mmiowb() before spin_unlock() where PIO has been done.  I'd 
have to look carefully to see whether lockless usages are correct, it's 
likely they're not.

Jesse

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: wmb vs mmiowb
  2007-08-22 18:07 ` Linus Torvalds
  2007-08-22 19:02   ` Jesse Barnes
@ 2007-08-23  1:59   ` Nick Piggin
  2007-08-23  7:27   ` Benjamin Herrenschmidt
  2 siblings, 0 replies; 26+ messages in thread
From: Nick Piggin @ 2007-08-23  1:59 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux-ia64, Jesse Barnes, linuxppc-dev

On Wed, Aug 22, 2007 at 11:07:32AM -0700, Linus Torvalds wrote:
> 
> 
> On Wed, 22 Aug 2007, Nick Piggin wrote:
> > 
> > It took me more than a glance to see what the difference is supposed to be
> > between wmb() and mmiowb(). I think especially because mmiowb isn't really
> > like a write barrier.
> 
> Well, it is, but it isn't. Not on its own - but together with a "normal" 
> barrier it is.

But it is stronger (or different) to write barrier semantics, because it
enforces the order in which a 3rd party (the IO device) sees writes from
multiple CPUs. The rest of our barrier concept is based purely on the
POV of the single entity executing the barrier.

Now it's needed because the IO device is not participating in the same
synchronisation logic that the CPUs are, which is why I say it is more
like a synchronisation primitive than a barrier primitive.

> > wmb is supposed to order all writes coming out of a single CPU, so that's
> > pretty simple.
> 
> No. wmb orders all *normal* writes coming out of a single CPU.

I'm pretty sure wmb() should order *all* writes, and smp_wmb() is what
you're thinking of for ordering regular writes to cacheable memory.

> It may not do anything at all for "uncached" IO writes that aren't part of 
> the cache coherency, and that are handled using totally different queues 
> (both inside and outside of the CPU)!
> 
> Now, on x86, the CPU actually tends to order IO writes *more* than it 
> orders any other writes (they are mostly entirely synchronous, unless the 
> area has been marked as write merging), but at least on PPC, it's the 
> other way around: without the cache as a serialization entry, you end up 
> having a totally separate queueu to serialize, and a regular-memory write 
> barrier does nothing at all to the IO queue.

Well PPC AFAIKS doesn't need the special synchronisation semantics of
this mmiowb primitive -- the reason it is not a noop is because the API
seems to also imply a wmb() (which is fine, and you'd normally want that
eg.  uncacheable stores must be ordered with the spin_unlock store).

It is just implemented with the PPC sync instruction, which just orders
all stores coming out of _this_ CPU. Their IO fabric must prevent IOs
from being reordered between CPUs if they're executed in a known order
(which is what Altix does not prevent).

> So think of the IO write queue as something totally asynchronous that has 
> zero connection to the normal write ordering - and then think of mmiowb() 
> as a way to *insert* a synchronization point.

If wmb (the non _smp one) orders all stores including IO stores, then it
should be sufficient to prevent IO writes from leaking out of a critical
section. The problem is that the "reader" (the IO device) itself is not
coherent. So _synchronisation_ point is right; it is not really a barrier.
Basically it says all IO writes issued by this CPU at this point will be
seen before any other IO writes issued by any other CPUs subsequently.

make_mmio_coherent()? queue_mmio_writes()? (I'd still prefer some kind of
acquire/release API that shows why CPU/CPU order matters too, and how it
is taken care of).

> > It really seems like it is some completely different concept from a
> > barrier. And it shows, on the platform where it really matters (sn2), where
> > the thing actually spins.
> 
> I agree that it probably isn't a "write barrier" per se. Think of it as a 
> "tie two subsystems together" thing.

Yes, in a way it is more like that. Which does fit with my suggestions
for a name.

> (And it doesn't just matter on sn2. It also matters on powerpc64, although 
> I think they just set a flag and do the *real* sync in the spin_unlock() 
> path).
> 
> Side note: the thing that makes "mmiowb()" even more exciting is that it's 
> not just the CPU, it's the fabric outside the CPU that matters too. That's 
> why the sn2 needs this - but the powerpc example shows a case where the 
> ordering requirement actually comes from the CPU itself.

Well I think sn2 is the *only* reason it matters.  When the ordering
requirement is coming from the CPU itself, that *is* just a traditional
write barrier (one which orders normal and io writes).

The funny things powerpc are doing in spin_unlock/etc. are a different
issue. Basically they are just helping along device drivers who get this
wrong and assume spinlocks order IOs; our lack of an acquire/release API
for IOs... they're just trying to get through this sorry state of affairs
without going insane ;) Powerpc is special here because their ordering
instructions distinguish between normal and IO, wheras most others don't
(including ia64, alpha, etc), so _most_ others do get their IOs ordered by
critical sections. This is a different issue to the mmiowb one (but still
shows that our APIs could be improved).

Why don't we get a nice easy spin_lock_io/spin_unlock_io, which takes
care of all the mmiowb and iowrite vs spin unlock problems? (individual
IOs within the lock would still need to be ordered as approprite).

Then we could also have a serialize_io()/unserialize_io() that takes
care of the same things but can be used when we have something other
than a spinlock for ordering CPUs (serialize_io may be a noop, but it
is good to ensure people are thinking about how they're excluding
other CPUs here -- if other CPUs are not excluded, then any code calling
mmiowb is buggy, right?).

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: wmb vs mmiowb
  2007-08-22 18:07 ` Linus Torvalds
  2007-08-22 19:02   ` Jesse Barnes
  2007-08-23  1:59   ` Nick Piggin
@ 2007-08-23  7:27   ` Benjamin Herrenschmidt
  2007-08-23 16:56     ` Jesse Barnes
  2 siblings, 1 reply; 26+ messages in thread
From: Benjamin Herrenschmidt @ 2007-08-23  7:27 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Nick Piggin, linux-ia64, Jesse Barnes, linuxppc-dev


> Of course, the normal memory barrier would usually be a "spin_unlock()" or 
> something like that, not a "wmb()". In fact, I don't think the powerpc 
> implementation (as an example of this) will actually synchronize with 
> anything *but* a spin_unlock().

We are even more sneaky in the sense that we set a per-cpu flag on any
MMIO write and do the sync automatically in spin_unlock() :-)

Ben.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: wmb vs mmiowb
  2007-08-23  7:27   ` Benjamin Herrenschmidt
@ 2007-08-23 16:56     ` Jesse Barnes
  2007-08-24  3:12       ` Nick Piggin
  2007-08-28 21:21       ` Brent Casavant
  0 siblings, 2 replies; 26+ messages in thread
From: Jesse Barnes @ 2007-08-23 16:56 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Nick Piggin, linux-ia64, Linus Torvalds, linuxppc-dev

On Thursday, August 23, 2007 12:27 am Benjamin Herrenschmidt wrote:
> > Of course, the normal memory barrier would usually be a
> > "spin_unlock()" or something like that, not a "wmb()". In fact, I
> > don't think the powerpc implementation (as an example of this) will
> > actually synchronize with anything *but* a spin_unlock().
>
> We are even more sneaky in the sense that we set a per-cpu flag on
> any MMIO write and do the sync automatically in spin_unlock() :-)

Yeah, that's a reasonable thing to do, and in fact I think there's code 
to do something similar when a task is switched out (this keeps user 
level drivers from having do mmiowb() type things).

FWIW, I think I had an earlier version of the patch that used the name 
pioflush() or something similar, the only confusing thing about that 
name is that the primitive doesn't actually force I/Os down to the 
device level, just to the closest bridge.

It'll be interesting to see if upcoming x86 designs share this problem 
(e.g. large HT or CSI topologies).

Jesse

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: wmb vs mmiowb
  2007-08-23 16:56     ` Jesse Barnes
@ 2007-08-24  3:12       ` Nick Piggin
  2007-08-28 21:21       ` Brent Casavant
  1 sibling, 0 replies; 26+ messages in thread
From: Nick Piggin @ 2007-08-24  3:12 UTC (permalink / raw)
  To: Jesse Barnes; +Cc: linux-ia64, Linus Torvalds, linuxppc-dev

On Thu, Aug 23, 2007 at 09:56:16AM -0700, Jesse Barnes wrote:
> On Thursday, August 23, 2007 12:27 am Benjamin Herrenschmidt wrote:
> > > Of course, the normal memory barrier would usually be a
> > > "spin_unlock()" or something like that, not a "wmb()". In fact, I
> > > don't think the powerpc implementation (as an example of this) will
> > > actually synchronize with anything *but* a spin_unlock().
> >
> > We are even more sneaky in the sense that we set a per-cpu flag on
> > any MMIO write and do the sync automatically in spin_unlock() :-)
> 
> Yeah, that's a reasonable thing to do, and in fact I think there's code 
> to do something similar when a task is switched out (this keeps user 
> level drivers from having do mmiowb() type things).

It might be worth doing that and removing mmiowb completely. Or, if
that's too expensive, I'd like to see an API that is more explicitly
for keeping IOs inside critical sections.


> FWIW, I think I had an earlier version of the patch that used the name 
> pioflush() or something similar, the only confusing thing about that 
> name is that the primitive doesn't actually force I/Os down to the 
> device level, just to the closest bridge.

Yeah, that's what I found when trying to think of a name ;) It is
like an intermediate-level flush for the platform code, but from the
POV of the driver writer, it is nothing of the sort ;)
 

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: wmb vs mmiowb
  2007-08-23 16:56     ` Jesse Barnes
  2007-08-24  3:12       ` Nick Piggin
@ 2007-08-28 21:21       ` Brent Casavant
  2007-08-28 23:01         ` Peter Chubb
  1 sibling, 1 reply; 26+ messages in thread
From: Brent Casavant @ 2007-08-28 21:21 UTC (permalink / raw)
  To: Jesse Barnes; +Cc: Nick Piggin, linux-ia64, Linus Torvalds, linuxppc-dev

On Thu, 23 Aug 2007, Jesse Barnes wrote:

> On Thursday, August 23, 2007 12:27 am Benjamin Herrenschmidt wrote:
> > > Of course, the normal memory barrier would usually be a
> > > "spin_unlock()" or something like that, not a "wmb()". In fact, I
> > > don't think the powerpc implementation (as an example of this) will
> > > actually synchronize with anything *but* a spin_unlock().
> >
> > We are even more sneaky in the sense that we set a per-cpu flag on
> > any MMIO write and do the sync automatically in spin_unlock() :-)
> 
> Yeah, that's a reasonable thing to do, and in fact I think there's code 
> to do something similar when a task is switched out (this keeps user 
> level drivers from having do mmiowb() type things).

Yes there is, git commit e08e6c521355cd33e647b2f739885bc3050eead6.

On SN2 any user process performing memory-mapped IO directly to a
device needs something like mmiowb() to be performed at the node of
the CPU it last ran on when the task context switches onto a new CPU.

The current code performs this action for all inter-CPU context
switches, but we had discussed the possibility of targetting the
action only when the user process has actually mapped a device for
IO.  I believe it was decided that this level of complexity wasn't
warranted unless this simple solution was found to cause a problem.

That reminds me.  Are the people who are working on the user-level
driver effort including a capability similar to mmiowb()?  If we
had that capability we could eventually do away with the change
mentioned above.  But that would come after all user-level drivers
were coded to include the mmiowb()-like calls, and existing drivers
which provide mmap() capability directly to hardware go away.

Brent

-- 
Brent Casavant                          All music is folk music.  I ain't
bcasavan@sgi.com                        never heard a horse sing a song.
Silicon Graphics, Inc.                    -- Louis Armstrong

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: wmb vs mmiowb
  2007-08-28 21:21       ` Brent Casavant
@ 2007-08-28 23:01         ` Peter Chubb
  0 siblings, 0 replies; 26+ messages in thread
From: Peter Chubb @ 2007-08-28 23:01 UTC (permalink / raw)
  To: Brent Casavant
  Cc: Nick Piggin, linux-ia64, linuxppc-dev, Jesse Barnes,
	Linus Torvalds

>>>>> "Brent" == Brent Casavant <bcasavan@sgi.com> writes:

Brent> That reminds me.  Are the people who are working on the
Brent> user-level driver effort including a capability similar to
Brent> mmiowb()?  If we had that capability we could eventually do
Brent> away with the change mentioned above.  But that would come
Brent> after all user-level drivers were coded to include the
Brent> mmiowb()-like calls, and existing drivers which provide mmap()
Brent> capability directly to hardware go away.

Not at present, because the platforms *I'm* mostly targetting at the
moment don't need it.  

--
Dr Peter Chubb  http://www.gelato.unsw.edu.au  peterc AT gelato.unsw.edu.au
http://www.ertos.nicta.com.au           ERTOS within National ICT Australia

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: wmb vs mmiowb
  2007-08-22  4:57 wmb vs mmiowb Nick Piggin
  2007-08-22 18:07 ` Linus Torvalds
@ 2007-08-23  7:25 ` Benjamin Herrenschmidt
  1 sibling, 0 replies; 26+ messages in thread
From: Benjamin Herrenschmidt @ 2007-08-23  7:25 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Linus Torvalds, linux-ia64, Jesse Barnes, linuxppc-dev

On Wed, 2007-08-22 at 06:57 +0200, Nick Piggin wrote:

> It doesn't seem like this primary function of mmiowb has anything to do
> with a write barrier that we are used to (it may have a seconary semantic
> of a wmb as well, but let's ignore that for now). A write barrier will
> never provide you with those semantics (writes from 2 CPUs seen in the
> same order by a 3rd party). If anything, I think it is closer to being
> a read barrier issued on behalf of the target device.  But even that I
> think is not much better, because the target is not participating in the
> synchronisation that the CPUs are, so the "read barrier request" could
> still arrive at the device out of order WRT the other CPU's writes.
> 
> It really seems like it is some completely different concept from a
> barrier. And it shows, on the platform where it really matters (sn2), where
> the thing actually spins.

The way mmiowb was actually defined to me by the ia64 folks who came up
with it is essentially to order an MMIO write with a spin_unlock.

Ben.

^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2007-09-03 20:48 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-08-22  4:57 wmb vs mmiowb Nick Piggin
2007-08-22 18:07 ` Linus Torvalds
2007-08-22 19:02   ` Jesse Barnes
2007-08-23  2:20     ` Nick Piggin
2007-08-23  2:57       ` Linus Torvalds
2007-08-23  3:54         ` Nick Piggin
2007-08-23 16:14           ` Linus Torvalds
2007-08-23  4:20         ` Nick Piggin
2007-08-23 16:16           ` Linus Torvalds
2007-08-23 16:27             ` Benjamin Herrenschmidt
2007-08-24  3:09               ` Nick Piggin
2007-08-28 20:56                 ` Brent Casavant
2007-08-29  0:59                   ` Nick Piggin
2007-08-29 18:53                     ` Brent Casavant
2007-08-30  3:36                       ` Nick Piggin
2007-08-30 19:42                         ` Brent Casavant
2007-09-03 20:48                           ` Nick Piggin
2007-08-24  2:59             ` Nick Piggin
2007-08-23 17:02       ` Jesse Barnes
2007-08-23  1:59   ` Nick Piggin
2007-08-23  7:27   ` Benjamin Herrenschmidt
2007-08-23 16:56     ` Jesse Barnes
2007-08-24  3:12       ` Nick Piggin
2007-08-28 21:21       ` Brent Casavant
2007-08-28 23:01         ` Peter Chubb
2007-08-23  7:25 ` Benjamin Herrenschmidt

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).