* wmb vs mmiowb @ 2007-08-22 4:57 Nick Piggin 2007-08-22 18:07 ` Linus Torvalds 2007-08-23 7:25 ` Benjamin Herrenschmidt 0 siblings, 2 replies; 26+ messages in thread From: Nick Piggin @ 2007-08-22 4:57 UTC (permalink / raw) To: Jesse Barnes, Benjamin Herrenschmidt, Linus Torvalds Cc: linuxppc-dev, linux-ia64 Hi, I'm ignorant when it comes to IO access, so I hope this isn't rubbish (if it is, I would appreciate being corrected). It took me more than a glance to see what the difference is supposed to be between wmb() and mmiowb(). I think especially because mmiowb isn't really like a write barrier. wmb is supposed to order all writes coming out of a single CPU, so that's pretty simple. The problem is that writes coming from different CPUs can be seen by the device in a different order from which they were written if coming from different CPUs, even if the order of writes is guaranteed (eg. by a spinlock) and issued in the right order WRT the locking (ie. using wmb()). And this can happen because the writes can get posted away and reordered by the IO fabric (I think). mmiowb ensures the writes are seen by the device in the correct order. It doesn't seem like this primary function of mmiowb has anything to do with a write barrier that we are used to (it may have a seconary semantic of a wmb as well, but let's ignore that for now). A write barrier will never provide you with those semantics (writes from 2 CPUs seen in the same order by a 3rd party). If anything, I think it is closer to being a read barrier issued on behalf of the target device. But even that I think is not much better, because the target is not participating in the synchronisation that the CPUs are, so the "read barrier request" could still arrive at the device out of order WRT the other CPU's writes. It really seems like it is some completely different concept from a barrier. And it shows, on the platform where it really matters (sn2), where the thing actually spins. I don't know exactly how it should be categorised. On one hand, it is kind of like a critical section, and would work beautifully if we could just hide it inside spin_lock_io/spin_unlock_io. On the other hand, it seems like it is often used separately from locks, where it looks decidedly less like a critical section or release barrier. How can such uses be correct if they are worried about multi-CPU ordering but don't have anything to synchronize the CPUs? Or are they cleverly enforcing CPU ordering some other way? (in which case, maybe an acquire/release API really would make sense?). I don't really have a big point, except that I would like to know whether I'm on the right track, and wish the thing could have a better name/api. Thanks, Nick ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: wmb vs mmiowb 2007-08-22 4:57 wmb vs mmiowb Nick Piggin @ 2007-08-22 18:07 ` Linus Torvalds 2007-08-22 19:02 ` Jesse Barnes ` (2 more replies) 2007-08-23 7:25 ` Benjamin Herrenschmidt 1 sibling, 3 replies; 26+ messages in thread From: Linus Torvalds @ 2007-08-22 18:07 UTC (permalink / raw) To: Nick Piggin; +Cc: linux-ia64, Jesse Barnes, linuxppc-dev On Wed, 22 Aug 2007, Nick Piggin wrote: > > It took me more than a glance to see what the difference is supposed to be > between wmb() and mmiowb(). I think especially because mmiowb isn't really > like a write barrier. Well, it is, but it isn't. Not on its own - but together with a "normal" barrier it is. > wmb is supposed to order all writes coming out of a single CPU, so that's > pretty simple. No. wmb orders all *normal* writes coming out of a single CPU. It may not do anything at all for "uncached" IO writes that aren't part of the cache coherency, and that are handled using totally different queues (both inside and outside of the CPU)! Now, on x86, the CPU actually tends to order IO writes *more* than it orders any other writes (they are mostly entirely synchronous, unless the area has been marked as write merging), but at least on PPC, it's the other way around: without the cache as a serialization entry, you end up having a totally separate queueu to serialize, and a regular-memory write barrier does nothing at all to the IO queue. So think of the IO write queue as something totally asynchronous that has zero connection to the normal write ordering - and then think of mmiowb() as a way to *insert* a synchronization point. In particular, the normal synchronization primitives (spinlocks, mutexes etc) are guaranteed to synchronize only normal memory accesses. So if you do MMIO inside a spinlock, since the MMIO writes are totally asyncronous wrt the normal memory accesses, the MMIO write can escape outside the spinlock unless you have somethign that serializes the MMIO accesses with the normal memory accesses. So normally you'd see "mmiowb()" always *paired* with a normal memory barrier! The "mmiowb()" ends up synchronizing the MMIO writes with the normal memory accesses, and then the normal memory barrier acts as a barrier for subsequent writes. Of course, the normal memory barrier would usually be a "spin_unlock()" or something like that, not a "wmb()". In fact, I don't think the powerpc implementation (as an example of this) will actually synchronize with anything *but* a spin_unlock(). > It really seems like it is some completely different concept from a > barrier. And it shows, on the platform where it really matters (sn2), where > the thing actually spins. I agree that it probably isn't a "write barrier" per se. Think of it as a "tie two subsystems together" thing. (And it doesn't just matter on sn2. It also matters on powerpc64, although I think they just set a flag and do the *real* sync in the spin_unlock() path). Side note: the thing that makes "mmiowb()" even more exciting is that it's not just the CPU, it's the fabric outside the CPU that matters too. That's why the sn2 needs this - but the powerpc example shows a case where the ordering requirement actually comes from the CPU itself. Linus ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: wmb vs mmiowb 2007-08-22 18:07 ` Linus Torvalds @ 2007-08-22 19:02 ` Jesse Barnes 2007-08-23 2:20 ` Nick Piggin 2007-08-23 1:59 ` Nick Piggin 2007-08-23 7:27 ` Benjamin Herrenschmidt 2 siblings, 1 reply; 26+ messages in thread From: Jesse Barnes @ 2007-08-22 19:02 UTC (permalink / raw) To: Linus Torvalds; +Cc: Nick Piggin, linux-ia64, linuxppc-dev On Wednesday, August 22, 2007 11:07 am Linus Torvalds wrote: > > It really seems like it is some completely different concept from a > > barrier. And it shows, on the platform where it really matters > > (sn2), where the thing actually spins. > > I agree that it probably isn't a "write barrier" per se. Think of it > as a "tie two subsystems together" thing. Right, maybe it's not the best name, but as long as you separate your memory access types, you can think of it as a real write barrier, just for mmio accesses (well uncached access really). > (And it doesn't just matter on sn2. It also matters on powerpc64, > although I think they just set a flag and do the *real* sync in the > spin_unlock() path). Yeah, they keep threatening to use this instead, but I'm not sure how easy it would be. Also they may have more devices/drivers to worry about than sn2, so maybe changing over would mean too much driver debugging (well auditing really since it's not that hard to know where to put them). Irix actually had an io_unlock() routine that did this implicitly, but iirc that was shot down for Linux... Jesse ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: wmb vs mmiowb 2007-08-22 19:02 ` Jesse Barnes @ 2007-08-23 2:20 ` Nick Piggin 2007-08-23 2:57 ` Linus Torvalds 2007-08-23 17:02 ` Jesse Barnes 0 siblings, 2 replies; 26+ messages in thread From: Nick Piggin @ 2007-08-23 2:20 UTC (permalink / raw) To: Jesse Barnes; +Cc: linux-ia64, Linus Torvalds, linuxppc-dev On Wed, Aug 22, 2007 at 12:02:11PM -0700, Jesse Barnes wrote: > On Wednesday, August 22, 2007 11:07 am Linus Torvalds wrote: > > > It really seems like it is some completely different concept from a > > > barrier. And it shows, on the platform where it really matters > > > (sn2), where the thing actually spins. > > > > I agree that it probably isn't a "write barrier" per se. Think of it > > as a "tie two subsystems together" thing. > > Right, maybe it's not the best name, but as long as you separate your > memory access types, you can think of it as a real write barrier, just > for mmio accesses (well uncached access really). If we have the following situation (all vars start at 0) CPU0 CPU1 CPU2 spin_lock(&lock); ~ A = 1; ~ wmb(); ~ B = 2; ~ spin_unlock(&lock); X = B; spin_lock(&lock); rmb(); A = 10; Y = A; wmb(); ~ B = 11; ~ spin_unlock(&lock); ~ (I use the '~' just to show CPU2 is not specifically temporally related to CPU0 or CPU1). Then CPU2 could have X==11 and Y==1, according to the Linux abstract memory consistency model, couldn't it? I think so, and I think this is what your mmiowb is trying to protect. In the above situation, CPU2 would just use the spinlock -- I don't think we have a simple primitive that CPU0 and 1 can call to prevent this reordering at CPU2. An IO device obviously can't use a spinlock :). > > (And it doesn't just matter on sn2. It also matters on powerpc64, > > although I think they just set a flag and do the *real* sync in the > > spin_unlock() path). > > Yeah, they keep threatening to use this instead, but I'm not sure how > easy it would be. Also they may have more devices/drivers to worry > about than sn2, so maybe changing over would mean too much driver > debugging (well auditing really since it's not that hard to know where > to put them). Irix actually had an io_unlock() routine that did this > implicitly, but iirc that was shot down for Linux... Why was it shot down? Seems like a pretty good idea to me ;) I'm clueless when it comes to drivers, but I see a lot of mmiowb() that are not paired with spin_unlock. How are these obvious? (ie. what is the pattern?) It looks like some might be lockless FIFOs (or maybe I'm just not aware of where the locks are). Can you just quickly illustrate the problem being solved? Thanks, Nick ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: wmb vs mmiowb 2007-08-23 2:20 ` Nick Piggin @ 2007-08-23 2:57 ` Linus Torvalds 2007-08-23 3:54 ` Nick Piggin 2007-08-23 4:20 ` Nick Piggin 2007-08-23 17:02 ` Jesse Barnes 1 sibling, 2 replies; 26+ messages in thread From: Linus Torvalds @ 2007-08-23 2:57 UTC (permalink / raw) To: Nick Piggin; +Cc: linux-ia64, Jesse Barnes, linuxppc-dev On Thu, 23 Aug 2007, Nick Piggin wrote: > > > Irix actually had an io_unlock() routine that did this > > implicitly, but iirc that was shot down for Linux... > > Why was it shot down? Seems like a pretty good idea to me ;) It's horrible. We'd need it for *every* single spinlock type. We have lots of them. So the choice is between: - sane: mmiowb() followed by any of the existing "spin_unlock()" variants (plain, _irq(), _bh(), _irqrestore()) - insane: multiply our current set of unlock primitives by two, by making "io" versions for them all: spin_unlock_io[_irq|_irqrestore|_bh]() but there's actually an EVEN WORSE problem with the stupid Irix approach, namely that it requires that the unlocker be aware of the exact details of what happens inside the lock. If the locking is done at an outer layer, that's not at all obvious! In other words, Irix (once again) made a horrible and idiotic choice. Big surprise. Irix was probably the flakiest and worst of all the commercial proprietary unixes. No taste. Linus ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: wmb vs mmiowb 2007-08-23 2:57 ` Linus Torvalds @ 2007-08-23 3:54 ` Nick Piggin 2007-08-23 16:14 ` Linus Torvalds 2007-08-23 4:20 ` Nick Piggin 1 sibling, 1 reply; 26+ messages in thread From: Nick Piggin @ 2007-08-23 3:54 UTC (permalink / raw) To: Linus Torvalds; +Cc: linux-ia64, Jesse Barnes, linuxppc-dev On Wed, Aug 22, 2007 at 07:57:56PM -0700, Linus Torvalds wrote: > > > On Thu, 23 Aug 2007, Nick Piggin wrote: > > > > > Irix actually had an io_unlock() routine that did this > > > implicitly, but iirc that was shot down for Linux... > > > > Why was it shot down? Seems like a pretty good idea to me ;) > > It's horrible. We'd need it for *every* single spinlock type. We have lots > of them. > > So the choice is between: > > - sane: > > mmiowb() > > followed by any of the existing "spin_unlock()" variants (plain, > _irq(), _bh(), _irqrestore()) > > - insane: multiply our current set of unlock primitives by two, by making > "io" versions for them all: > > spin_unlock_io[_irq|_irqrestore|_bh]() > > but there's actually an EVEN WORSE problem with the stupid Irix approach, > namely that it requires that the unlocker be aware of the exact details of > what happens inside the lock. If the locking is done at an outer layer, > that's not at all obvious! OK, but we'd have some kind of functions that are called not to serialise the CPUs, but to serialise the IO. It would be up to the calling code to already provide CPU synchronisation. serialize_io(); / unserialize_io(); / a nicer name If we could pass in some kind of relevant resoure (eg. the IO memory or device or something), then we might even be able to put debug checks there to ensure two CPUs are never inside the same critical IO section at once. > In other words, Irix (once again) made a horrible and idiotic choice. We could make a better one. I don't think mmiowb is really insane, but I'd worry it being confused with a regular type of barrier and that CPU synchronisation needs to be provided for it to work or make sense. > Big surprise. Irix was probably the flakiest and worst of all the > commercial proprietary unixes. No taste. Is it? I've never used it ;) ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: wmb vs mmiowb 2007-08-23 3:54 ` Nick Piggin @ 2007-08-23 16:14 ` Linus Torvalds 0 siblings, 0 replies; 26+ messages in thread From: Linus Torvalds @ 2007-08-23 16:14 UTC (permalink / raw) To: Nick Piggin; +Cc: linux-ia64, Jesse Barnes, linuxppc-dev On Thu, 23 Aug 2007, Nick Piggin wrote: > > OK, but we'd have some kind of functions that are called not to > serialise the CPUs, but to serialise the IO. It would be up to > the calling code to already provide CPU synchronisation. > > serialize_io(); / unserialize_io(); / a nicer name We could call it "mmiowb()", for example? Radical idea, I know. > If we could pass in some kind of relevant resoure (eg. the IO > memory or device or something), then we might even be able to > put debug checks there to ensure two CPUs are never inside the > same critical IO section at once. We could certainly give it the spinlock as an argument. Linus ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: wmb vs mmiowb 2007-08-23 2:57 ` Linus Torvalds 2007-08-23 3:54 ` Nick Piggin @ 2007-08-23 4:20 ` Nick Piggin 2007-08-23 16:16 ` Linus Torvalds 1 sibling, 1 reply; 26+ messages in thread From: Nick Piggin @ 2007-08-23 4:20 UTC (permalink / raw) To: Linus Torvalds; +Cc: linux-ia64, Jesse Barnes, linuxppc-dev On Wed, Aug 22, 2007 at 07:57:56PM -0700, Linus Torvalds wrote: > > > On Thu, 23 Aug 2007, Nick Piggin wrote: > > > > > Irix actually had an io_unlock() routine that did this > > > implicitly, but iirc that was shot down for Linux... > > > > Why was it shot down? Seems like a pretty good idea to me ;) > > It's horrible. We'd need it for *every* single spinlock type. We have lots > of them. > > So the choice is between: > > - sane: > > mmiowb() > > followed by any of the existing "spin_unlock()" variants (plain, > _irq(), _bh(), _irqrestore()) > > - insane: multiply our current set of unlock primitives by two, by making > "io" versions for them all: > > spin_unlock_io[_irq|_irqrestore|_bh]() > > but there's actually an EVEN WORSE problem with the stupid Irix approach, > namely that it requires that the unlocker be aware of the exact details of > what happens inside the lock. If the locking is done at an outer layer, > that's not at all obvious! Also, FWIW, there are some advantages of deferring the mmiowb thingy until the point of unlock. The disadvantage is that the caller may not know if the inner layer performed ios that require the mmiowb, but the advantage of waiting until unlock is that the wait is deferred for as long as possible, and will hopefully be a shorter one when performed later. ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: wmb vs mmiowb 2007-08-23 4:20 ` Nick Piggin @ 2007-08-23 16:16 ` Linus Torvalds 2007-08-23 16:27 ` Benjamin Herrenschmidt 2007-08-24 2:59 ` Nick Piggin 0 siblings, 2 replies; 26+ messages in thread From: Linus Torvalds @ 2007-08-23 16:16 UTC (permalink / raw) To: Nick Piggin; +Cc: linux-ia64, Jesse Barnes, linuxppc-dev On Thu, 23 Aug 2007, Nick Piggin wrote: > > Also, FWIW, there are some advantages of deferring the mmiowb thingy > until the point of unlock. And that is exactly what ppc64 does. But you're missing a big point: for 99.9% of all hardware, mmiowb() is a total no-op. So when you talk about "advantages", you're not talking about any *real* advantage, are you? Linus ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: wmb vs mmiowb 2007-08-23 16:16 ` Linus Torvalds @ 2007-08-23 16:27 ` Benjamin Herrenschmidt 2007-08-24 3:09 ` Nick Piggin 2007-08-24 2:59 ` Nick Piggin 1 sibling, 1 reply; 26+ messages in thread From: Benjamin Herrenschmidt @ 2007-08-23 16:27 UTC (permalink / raw) To: Linus Torvalds; +Cc: Nick Piggin, linux-ia64, Jesse Barnes, linuxppc-dev On Thu, 2007-08-23 at 09:16 -0700, Linus Torvalds wrote: > > On Thu, 23 Aug 2007, Nick Piggin wrote: > > > > Also, FWIW, there are some advantages of deferring the mmiowb thingy > > until the point of unlock. > > And that is exactly what ppc64 does. > > But you're missing a big point: for 99.9% of all hardware, mmiowb() is a > total no-op. So when you talk about "advantages", you're not talking about > any *real* advantage, are you? I wonder whether it might be worth removing mmiowb and having all archs that matter do like ppc64 though... It's just yet another confusing barrier that most driver writers get wrong.. Cheers, Ben. ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: wmb vs mmiowb 2007-08-23 16:27 ` Benjamin Herrenschmidt @ 2007-08-24 3:09 ` Nick Piggin 2007-08-28 20:56 ` Brent Casavant 0 siblings, 1 reply; 26+ messages in thread From: Nick Piggin @ 2007-08-24 3:09 UTC (permalink / raw) To: Benjamin Herrenschmidt Cc: Jesse Barnes, linux-ia64, Linus Torvalds, linuxppc-dev On Thu, Aug 23, 2007 at 06:27:42PM +0200, Benjamin Herrenschmidt wrote: > On Thu, 2007-08-23 at 09:16 -0700, Linus Torvalds wrote: > > > > On Thu, 23 Aug 2007, Nick Piggin wrote: > > > > > > Also, FWIW, there are some advantages of deferring the mmiowb thingy > > > until the point of unlock. > > > > And that is exactly what ppc64 does. > > > > But you're missing a big point: for 99.9% of all hardware, mmiowb() is a > > total no-op. So when you talk about "advantages", you're not talking about > > any *real* advantage, are you? > > I wonder whether it might be worth removing mmiowb and having all archs > that matter do like ppc64 though... It's just yet another confusing > barrier that most driver writers get wrong.. Only sn2 and powerpc really matter, actually (for different reasons). All smp architectures other than powerpc appear to have barrier instructions that order all memory operations, so IOs never leak out of locking primitives. This is why powerpc wants a wmb (not mmiowb) before spin_unlock to order IOs (pity about other locking primitives). And all platforms other than sn2 don't appear to reorder IOs after they leave the CPU, so only sn2 needs to do the mmiowb thing before spin_unlock. ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: wmb vs mmiowb 2007-08-24 3:09 ` Nick Piggin @ 2007-08-28 20:56 ` Brent Casavant 2007-08-29 0:59 ` Nick Piggin 0 siblings, 1 reply; 26+ messages in thread From: Brent Casavant @ 2007-08-28 20:56 UTC (permalink / raw) To: Nick Piggin; +Cc: linuxppc-dev, linux-ia64 On Fri, 24 Aug 2007, Nick Piggin wrote: > And all platforms other than sn2 don't appear to reorder IOs after > they leave the CPU, so only sn2 needs to do the mmiowb thing before > spin_unlock. I'm sure all of the following is already known to most readers, but I thought the paragraph above might potentially cause confusion as to the nature of the problem mmiowb() is solving on SN2. So for the record... SN2 does not reorder IOs issued from a single CPU (that would be insane). Neither does it reorder IOs once they've reached the IO fabric (equally insane). From an individual CPU's perspective, all IOs that it issues to a device will arrive at that device in program order. (In this entire message, all IOs are assumed to be memory-mapped.) The problem mmiowb() helps solve on SN2 is the ordering of IOs issued from multiple CPUs to a single device. That ordering is undefined, as IO transactions are not ordered across CPUs. That is, if CPU A issues an IO at time T, and CPU B at time T+1, CPU B's IO may arrive at the IO fabric before CPU A's IO, particularly if CPU B happens to be closer than CPU B to the target IO bridge on the NUMA network. The simplistic method to solve this is a lock around the section issuing IOs, thereby ensuring serialization of access to the IO device. However, as SN2 does not enforce an ordering between normal memory transactions and memory-mapped IO transactions, you cannot be sure that an IO transaction will arrive at the IO fabric "on the correct side" of the unlock memory transaction using this scheme. Enter mmiowb(). mmiowb() causes SN2 to drain the pending IOs from the current CPU's node. Once the IOs are drained the CPU can safely unlock a normal memory based lock without fear of the unlock's memory write passing any outstanding IOs from that CPU. Brent -- Brent Casavant All music is folk music. I ain't bcasavan@sgi.com never heard a horse sing a song. Silicon Graphics, Inc. -- Louis Armstrong ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: wmb vs mmiowb 2007-08-28 20:56 ` Brent Casavant @ 2007-08-29 0:59 ` Nick Piggin 2007-08-29 18:53 ` Brent Casavant 0 siblings, 1 reply; 26+ messages in thread From: Nick Piggin @ 2007-08-29 0:59 UTC (permalink / raw) To: Brent Casavant; +Cc: linuxppc-dev, linux-ia64 On Tue, Aug 28, 2007 at 03:56:28PM -0500, Brent Casavant wrote: > On Fri, 24 Aug 2007, Nick Piggin wrote: > > > And all platforms other than sn2 don't appear to reorder IOs after > > they leave the CPU, so only sn2 needs to do the mmiowb thing before > > spin_unlock. > > I'm sure all of the following is already known to most readers, but > I thought the paragraph above might potentially cause confusion as > to the nature of the problem mmiowb() is solving on SN2. So for > the record... > > SN2 does not reorder IOs issued from a single CPU (that would be > insane). Neither does it reorder IOs once they've reached the IO > fabric (equally insane). From an individual CPU's perspective, all > IOs that it issues to a device will arrive at that device in program > order. This is why I think mmiowb() is not like a Linux memory barrier. And I presume that the device would see IOs and regular stores from a CPU in program order, given the correct wmb()s? (but maybe I'm wrong... more below). > (In this entire message, all IOs are assumed to be memory-mapped.) > > The problem mmiowb() helps solve on SN2 is the ordering of IOs issued > from multiple CPUs to a single device. That ordering is undefined, as > IO transactions are not ordered across CPUs. That is, if CPU A issues > an IO at time T, and CPU B at time T+1, CPU B's IO may arrive at the > IO fabric before CPU A's IO, particularly if CPU B happens to be closer > than CPU B to the target IO bridge on the NUMA network. > > The simplistic method to solve this is a lock around the section > issuing IOs, thereby ensuring serialization of access to the IO > device. However, as SN2 does not enforce an ordering between normal > memory transactions and memory-mapped IO transactions, you cannot > be sure that an IO transaction will arrive at the IO fabric "on the > correct side" of the unlock memory transaction using this scheme. Hmm. So what if you had the following code executed by a single CPU: writel(data, ioaddr); wmb(); *mem = 10; Will the device see the io write before the store to mem? > Enter mmiowb(). > > mmiowb() causes SN2 to drain the pending IOs from the current CPU's > node. Once the IOs are drained the CPU can safely unlock a normal > memory based lock without fear of the unlock's memory write passing > any outstanding IOs from that CPU. mmiowb needs to have the disclaimer that it's probably wrong if called outside a lock, and it's probably wrong if called between two io writes (need a regular wmb() in that case). I think some drivers are getting this wrong. ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: wmb vs mmiowb 2007-08-29 0:59 ` Nick Piggin @ 2007-08-29 18:53 ` Brent Casavant 2007-08-30 3:36 ` Nick Piggin 0 siblings, 1 reply; 26+ messages in thread From: Brent Casavant @ 2007-08-29 18:53 UTC (permalink / raw) To: Nick Piggin; +Cc: linuxppc-dev, linux-ia64 On Wed, 29 Aug 2007, Nick Piggin wrote: > On Tue, Aug 28, 2007 at 03:56:28PM -0500, Brent Casavant wrote: > > The simplistic method to solve this is a lock around the section > > issuing IOs, thereby ensuring serialization of access to the IO > > device. However, as SN2 does not enforce an ordering between normal > > memory transactions and memory-mapped IO transactions, you cannot > > be sure that an IO transaction will arrive at the IO fabric "on the > > correct side" of the unlock memory transaction using this scheme. > > Hmm. So what if you had the following code executed by a single CPU: > > writel(data, ioaddr); > wmb(); > *mem = 10; > > Will the device see the io write before the store to mem? Not necessarily. There is no guaranteed ordering between the IO write arriving at the device and the order of the normal memory reference, regardless of the intervening wmb(), at least on SN2. I believe the missing component in the mental model is the effect of the platform chipset. Perhaps this will help. Uncached writes (i.e. IO writes) are posted to the SN2 SHub ASIC and placed in their own queue which the SHub chip then routes to the appropriate target. This uncached write queue is independent of the NUMA cache-coherency maintained by the SHub ASIC for system memory; the relative order in which the uncached writes and the system memory traffic appear at their respective targets is undefined with respect to eachother. wmb() does not address this situation as it only guarantees that the writes issued from the CPU have been posted to the chipset, not that the chipset itself has posted the write to the final destination. mmiowb() guarantees that all outstanding IO writes have been issued to the IO fabric before proceeding. I like to think of it this way (probably not 100% accurate, but it helps me wrap my brain around this particular point): wmb(): Ensures preceding writes have issued from the CPU. mmiowb(): Ensures preceding IO writes have issued from the system chipset. mmiowb() on SN2 polls a register in SHub that reports the length of the outstanding uncached write queue. When the queue has emptied, it is known that all subsequent normal memory writes will therefore arrive at their destination after all preceding IO writes have arrived at the IO fabric. Thus, typical mmiowb() usage, for SN2's purpose, is to ensure that all IO traffic from a CPU has made it out to the IO fabric before issuing the normal memory transactions which release a RAM-based lock. The lock in this case is the one used to serialize access to a particular IO device. > > mmiowb() causes SN2 to drain the pending IOs from the current CPU's > > node. Once the IOs are drained the CPU can safely unlock a normal > > memory based lock without fear of the unlock's memory write passing > > any outstanding IOs from that CPU. > > mmiowb needs to have the disclaimer that it's probably wrong if called > outside a lock, and it's probably wrong if called between two io writes > (need a regular wmb() in that case). I think some drivers are getting > this wrong. There are situations where mmiowb() can be pressed into service to some other end, but those are rather rare. The only instance I am personally familiar with is synchronizing a free-running counter on a PCI device as closely as possible to the execution of a particular line of driver code. A write of the new counter value to the device and subsequent mmiowb() synchronizes that execution point as closely as practical to the IO write arriving at the device. Not perfect, but good enough for my purposes. (This was a hack, by the way, pressing a bit of hardware into a purpose for which it wasn't really designed, ideally the hardware would have had a better mechanism to accomplish this goal.) But in the normal case, I believe you are 100% correct -- wmb() would ensure that the memory-mapped IO writes arrive at the chipset in a particular order, and thus should arrive at the IO hardware in a particular order. mmiowb() would not necessarily accomplish this goal, and is incorrectly used wherever that is the intention. At least for SN2. Brent -- Brent Casavant All music is folk music. I ain't bcasavan@sgi.com never heard a horse sing a song. Silicon Graphics, Inc. -- Louis Armstrong ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: wmb vs mmiowb 2007-08-29 18:53 ` Brent Casavant @ 2007-08-30 3:36 ` Nick Piggin 2007-08-30 19:42 ` Brent Casavant 0 siblings, 1 reply; 26+ messages in thread From: Nick Piggin @ 2007-08-30 3:36 UTC (permalink / raw) To: Brent Casavant; +Cc: linuxppc-dev, linux-ia64 On Wed, Aug 29, 2007 at 01:53:53PM -0500, Brent Casavant wrote: > On Wed, 29 Aug 2007, Nick Piggin wrote: > > > On Tue, Aug 28, 2007 at 03:56:28PM -0500, Brent Casavant wrote: > > > > The simplistic method to solve this is a lock around the section > > > issuing IOs, thereby ensuring serialization of access to the IO > > > device. However, as SN2 does not enforce an ordering between normal > > > memory transactions and memory-mapped IO transactions, you cannot > > > be sure that an IO transaction will arrive at the IO fabric "on the > > > correct side" of the unlock memory transaction using this scheme. > > > > Hmm. So what if you had the following code executed by a single CPU: > > > > writel(data, ioaddr); > > wmb(); > > *mem = 10; > > > > Will the device see the io write before the store to mem? > > Not necessarily. There is no guaranteed ordering between the IO write > arriving at the device and the order of the normal memory reference, > regardless of the intervening wmb(), at least on SN2. I believe the > missing component in the mental model is the effect of the platform > chipset. > > Perhaps this will help. Uncached writes (i.e. IO writes) are posted > to the SN2 SHub ASIC and placed in their own queue which the SHub chip > then routes to the appropriate target. This uncached write queue is > independent of the NUMA cache-coherency maintained by the SHub ASIC > for system memory; the relative order in which the uncached writes > and the system memory traffic appear at their respective targets is > undefined with respect to eachother. > > wmb() does not address this situation as it only guarantees that > the writes issued from the CPU have been posted to the chipset, > not that the chipset itself has posted the write to the final > destination. mmiowb() guarantees that all outstanding IO writes > have been issued to the IO fabric before proceeding. > > I like to think of it this way (probably not 100% accurate, but it > helps me wrap my brain around this particular point): > > wmb(): Ensures preceding writes have issued from the CPU. > mmiowb(): Ensures preceding IO writes have issued from the > system chipset. > > mmiowb() on SN2 polls a register in SHub that reports the length > of the outstanding uncached write queue. When the queue has emptied, > it is known that all subsequent normal memory writes will therefore > arrive at their destination after all preceding IO writes have arrived > at the IO fabric. > > Thus, typical mmiowb() usage, for SN2's purpose, is to ensure that > all IO traffic from a CPU has made it out to the IO fabric before > issuing the normal memory transactions which release a RAM-based > lock. The lock in this case is the one used to serialize access > to a particular IO device. OK, thanks for that. I think I have a rough idea of how they both work... I was just thinking (hoping) that, although the writel may not reach the device before the store reaches memory, it would _appear_ that way from the POV of the device (ie. if the device were to DMA from mem). But that's probably wishful thinking because the memory might be on some completely different part of the system. I don't know whether this is exactly a correct implementation of Linux's barrier semantics. On one hand, wmb _is_ ordering the stores as they come out of the CPU; on the other, it isn't ordering normal stores with respect to writel from the POV of the device (which is seems to be what is expected by the docs and device driver writers). One argument says that the IO device or chipset is a seperate agent and thus isn't subject to ordering... which is sort of valid, but it is definitely not an agent equal to a CPU, because it can't actively participate in the synchronisation protocol. And on the other side, it just doesn't seem so useful just to know that stores coming out of the CPU are ordered if they can be reordered by an intermediate. Why even have wmb() at all, if it doesn't actually order stores to IO and RAM? powerpc's wmb() could just as well be an 'eieio' if it were to follow your model; that instruction orders IO, but not WRT cacheable stores. So you could argue that the chipset is an extention of the CPU's IO/memory subsystem and should follow the ordering specified by the CPU. I like this idea because it could make things simpler and more regular for the Linux barrier model. I guess it is too expensive for you to have mmiowb() in every wmb(), because _most_ of the time, all that's needed is ordering between IOs. So why not have io_mb(), io_rmb(), io_wmb(), which order IOs but ignore system memory. Then the non-prefixed primitives order everything (to the point that wmb() is like mmiowb on sn2). > > > mmiowb() causes SN2 to drain the pending IOs from the current CPU's > > > node. Once the IOs are drained the CPU can safely unlock a normal > > > memory based lock without fear of the unlock's memory write passing > > > any outstanding IOs from that CPU. > > > > mmiowb needs to have the disclaimer that it's probably wrong if called > > outside a lock, and it's probably wrong if called between two io writes > > (need a regular wmb() in that case). I think some drivers are getting > > this wrong. > > There are situations where mmiowb() can be pressed into service to > some other end, but those are rather rare. The only instance I am > personally familiar with is synchronizing a free-running counter on > a PCI device as closely as possible to the execution of a particular > line of driver code. A write of the new counter value to the device > and subsequent mmiowb() synchronizes that execution point as closely > as practical to the IO write arriving at the device. Not perfect, but > good enough for my purposes. (This was a hack, by the way, pressing > a bit of hardware into a purpose for which it wasn't really designed, > ideally the hardware would have had a better mechanism to accomplish > this goal.) I guess that would be fine. You probably have a slightly better understanding of the issues than the average device driver writer so you could ignore the warnings ;) > But in the normal case, I believe you are 100% correct -- wmb() would > ensure that the memory-mapped IO writes arrive at the chipset in a > particular order, and thus should arrive at the IO hardware in a particular > order. mmiowb() would not necessarily accomplish this goal, and is > incorrectly used wherever that is the intention. At least for SN2. Now I guess it's strictly also needed if you want to ensure cacheable stores and IO stores are visible to the device in the correct order too. I think we'd normally hope wmb() does that for us too (hence all my rambling above). ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: wmb vs mmiowb 2007-08-30 3:36 ` Nick Piggin @ 2007-08-30 19:42 ` Brent Casavant 2007-09-03 20:48 ` Nick Piggin 0 siblings, 1 reply; 26+ messages in thread From: Brent Casavant @ 2007-08-30 19:42 UTC (permalink / raw) To: Nick Piggin; +Cc: linuxppc-dev, linux-ia64 On Thu, 30 Aug 2007, Nick Piggin wrote: > OK, thanks for that. I think I have a rough idea of how they both > work... I was just thinking (hoping) that, although the writel may > not reach the device before the store reaches memory, it would > _appear_ that way from the POV of the device (ie. if the device > were to DMA from mem). But that's probably wishful thinking because > the memory might be on some completely different part of the system. Exactly. Since uncacheable writes cannot by definition take part in a cache-coherency mechanism, they really become their own seperate hierarchy of transactions. > I don't know whether this is exactly a correct implementation of > Linux's barrier semantics. On one hand, wmb _is_ ordering the stores > as they come out of the CPU; on the other, it isn't ordering normal > stores with respect to writel from the POV of the device (which is > seems to be what is expected by the docs and device driver writers). Or, as I think of it, it's not ordering cacheable stores with respect to uncacheable stores from the perspective of other CPUs in the system. That's what's really at the heart of the concern for SN2. > And on the other side, it just doesn't seem so useful just to know > that stores coming out of the CPU are ordered if they can be reordered > by an intermediate. Well, it helps when certain classes of stores need to be ordered with respect to eachother. On SN2, wmb() still ensures that cacheable stores are issued in a particular order, and thus seen by other CPUs in a particular order. That is still important, even when IO devices are not in the mix. > Why even have wmb() at all, if it doesn't actually > order stores to IO and RAM? It orders the class of stores which target RAM. It doesn't order the two seperate classes of stores (RAM and IO) with respect to eachother. mmiowb() when used in conjunction with a lock which serializes access to an IO device ensures that the order of stores to the IO device from different CPUs is well-defined. That's what we're really after here. > powerpc's wmb() could just as well be an > 'eieio' if it were to follow your model; that instruction orders IO, > but not WRT cacheable stores. That would seem to follow the intent of mmiowb() on SN2. I know next to nothing about PowerPC, so I'm not qualified to comment on that. > So you could argue that the chipset is an extention of the CPU's IO/memory > subsystem and should follow the ordering specified by the CPU. I like this > idea because it could make things simpler and more regular for the Linux > barrier model. Sorry, I didn't design the hardware. ;) I believe the problem, for a NUMA system, is that in order to implement what you describe, you would need the chipset to cause all effectively dirty cachelines in the CPU (including those that will become dirty due to previous stores which the CPU hasn't committed from its pipeline yet) to be written back to RAM before the the uncacheable store was allowed to issue from the chipset to the IO fabric. This would occur for every IO store, not just the final store in a related sequence. That would obviously have a significant negative impact on performance. > I guess it is too expensive for you to have mmiowb() in every wmb(), > because _most_ of the time, all that's needed is ordering between IOs. I think it's the other way around. Most of the time all you need is ordering between RAM stores, so mmiowb() would kill performance if it was called every time wmb() was invoked. > So why not have io_mb(), io_rmb(), io_wmb(), which order IOs but ignore > system memory. Then the non-prefixed primitives order everything (to the > point that wmb() is like mmiowb on sn2). I'm not sure I follow. Here's the bad sequence we're working with: CPU A CPU B Lock owner IO device sees ----- ----- ---------- -------------- ... ... unowned lock() ... CPU A writel(val_a) lock() ... unlock() CPU B ... write(val_b) ... ... unlock() unowned ... ... ... val_b ... ... ... val_a The cacheable store to RAM from CPU A to perform the unlock was not ordered with respect to the uncacheable writel() to the IO device. CPU B, which has a different uncacheable store path to the IO device in the NUMA system, saw the effect of the RAM store before CPU A's uncacheable store arrived at the IO device. CPU B then owned the lock, performed its own uncacheable store to the IO device, and released the lock. The two uncacheable stores are taking different routes to the device, and end up arriving in the wrong order. mmiowb() solves this by causing the following: CPU A CPU B Lock owner IO device sees ----- ----- ---------- -------------- ... ... Unowned lock() ... CPU A writel(val_a) lock() ... mmiowb() ... val_a unlock() CPU B ... write(val_b) ... ... mmiowb() ... val_b ... unlock() unowned The mmiowb() caused the IO device to see the uncacheable store from CPU A before CPU B saw the cacheable store from CPU A. Now all is well with the world. I might be exhausting your patience, but this is the key. mmiowb() causes the IO fabric to see the effects of an uncacheable store before other CPUs see the effects of a subsequent cacheable store. That's what's really at the heart of the matter. > Now I guess it's strictly also needed if you want to ensure cacheable > stores and IO stores are visible to the device in the correct order > too. I think we'd normally hope wmb() does that for us too (hence all > my rambling above). There's really three perspectives to consider, not just the CPU and IO device: 1. CPU A performing locking and issuing IO stores. 2. The IO device receiving stores. 3. CPU B performing locking and issuing IO stores. The lock ensures that the IO device sees stores from a single CPU at a time. wmb() ensures that CPU A and CPU B see the effect of cacheable stores in the same order as eachother. mmiowb() ensures that the IO device has seen all the uncacheable stores from CPU A before CPU B sees the cacheable stores from CPU A. Wow. I like that last paragraph. I think I'll send now... Brent -- Brent Casavant All music is folk music. I ain't bcasavan@sgi.com never heard a horse sing a song. Silicon Graphics, Inc. -- Louis Armstrong ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: wmb vs mmiowb 2007-08-30 19:42 ` Brent Casavant @ 2007-09-03 20:48 ` Nick Piggin 0 siblings, 0 replies; 26+ messages in thread From: Nick Piggin @ 2007-09-03 20:48 UTC (permalink / raw) To: Brent Casavant; +Cc: linuxppc-dev, Linus Torvalds, linux-ia64, Anton Blanchard On Thu, Aug 30, 2007 at 02:42:41PM -0500, Brent Casavant wrote: > On Thu, 30 Aug 2007, Nick Piggin wrote: > > > I don't know whether this is exactly a correct implementation of > > Linux's barrier semantics. On one hand, wmb _is_ ordering the stores > > as they come out of the CPU; on the other, it isn't ordering normal > > stores with respect to writel from the POV of the device (which is > > seems to be what is expected by the docs and device driver writers). > > Or, as I think of it, it's not ordering cacheable stores with respect > to uncacheable stores from the perspective of other CPUs in the system. > That's what's really at the heart of the concern for SN2. AFAIKS, the issue is simply that it is not ordering cacheable stores with respect to uncacheable stores from a _single_ CPU. I'll elaborate further down. > > And on the other side, it just doesn't seem so useful just to know > > that stores coming out of the CPU are ordered if they can be reordered > > by an intermediate. > > Well, it helps when certain classes of stores need to be ordered with > respect to eachother. On SN2, wmb() still ensures that cacheable stores > are issued in a particular order, and thus seen by other CPUs in a > particular order. That is still important, even when IO devices are not > in the mix. Well, we have smp_wmb() for that. > > Why even have wmb() at all, if it doesn't actually > > order stores to IO and RAM? > > It orders the class of stores which target RAM. It doesn't order the > two seperate classes of stores (RAM and IO) with respect to eachother. wmb() *really* is supposed to order all stores. As far as I gather, devices often need it for something like this: *dma_buffer = blah; wmb(); writel(START_DMA, iomem); One problem for sn2 seems to be that wmb is called like 500 times in drivers/ and would be really heavy to turn it into mmiowb. On the other hand, I really don't like how it's just gone and said "oh the normal Linux semantics are too hard, so make wmb() mean something slightly different, and add a totally new mmiowb() concept". Device driver writers already get barriers totally wrong. mmiowb is being completely misused already (and probably wmb too). > mmiowb() when used in conjunction with a lock which serializes access > to an IO device ensures that the order of stores to the IO device from > different CPUs is well-defined. That's what we're really after here. But if we're pragmatic, we could say that stores which are sitting in the CPU's chipset where they can potentially be reordered, can still _conceptually_ be considered to be in some kind of store queue of the CPU. This would mean that wmb() does have to order these WRT cacheable stores coming from a single CPU. And once you do that, sn2 will _also_ do the right thing with multiple CPUs. > > I guess it is too expensive for you to have mmiowb() in every wmb(), > > because _most_ of the time, all that's needed is ordering between IOs. > > I think it's the other way around. Most of the time all you need is > ordering between RAM stores, so mmiowb() would kill performance if it > was called every time wmb() was invoked. No, we have smp_wmb() for that. > > So why not have io_mb(), io_rmb(), io_wmb(), which order IOs but ignore > > system memory. Then the non-prefixed primitives order everything (to the > > point that wmb() is like mmiowb on sn2). > > I'm not sure I follow. Here's the bad sequence we're working with: > > CPU A CPU B Lock owner IO device sees > ----- ----- ---------- -------------- > ... ... unowned > lock() ... CPU A > writel(val_a) lock() ... > unlock() CPU B > ... write(val_b) ... > ... unlock() unowned > ... ... ... val_b > ... ... ... val_a > > > The cacheable store to RAM from CPU A to perform the unlock was > not ordered with respect to the uncacheable writel() to the IO device. > CPU B, which has a different uncacheable store path to the IO device > in the NUMA system, saw the effect of the RAM store before CPU A's > uncacheable store arrived at the IO device. CPU B then owned the > lock, performed its own uncacheable store to the IO device, and > released the lock. The two uncacheable stores are taking different > routes to the device, and end up arriving in the wrong order. > > mmiowb() solves this by causing the following: > > CPU A CPU B Lock owner IO device sees > ----- ----- ---------- -------------- > ... ... Unowned > lock() ... CPU A > writel(val_a) lock() ... > mmiowb() ... val_a > unlock() CPU B > ... write(val_b) ... > ... mmiowb() ... val_b > ... unlock() unowned > > The mmiowb() caused the IO device to see the uncacheable store from > CPU A before CPU B saw the cacheable store from CPU A. Now all is > well with the world. > > I might be exhausting your patience, but this is the key. mmiowb() > causes the IO fabric to see the effects of an uncacheable store > before other CPUs see the effects of a subsequent cacheable store. > That's what's really at the heart of the matter. Yes, I like this, and this is what wmb() should do :) That's what Linux expects it to do. > > Now I guess it's strictly also needed if you want to ensure cacheable > > stores and IO stores are visible to the device in the correct order > > too. I think we'd normally hope wmb() does that for us too (hence all > > my rambling above). > > There's really three perspectives to consider, not just the CPU and IO > device: > > 1. CPU A performing locking and issuing IO stores. > 2. The IO device receiving stores. > 3. CPU B performing locking and issuing IO stores. > > The lock ensures that the IO device sees stores from a single CPU > at a time. wmb() ensures that CPU A and CPU B see the effect > of cacheable stores in the same order as eachother. mmiowb() > ensures that the IO device has seen all the uncacheable stores from > CPU A before CPU B sees the cacheable stores from CPU A. > > Wow. I like that last paragraph. I think I'll send now... OK, now we _could_ consider the path to the IO device to be a 3rd party that can reorder the IOs, but I'm coming to think that such a concept need not be added if we instead consider that the reordering portion is still part of the originating CPU and thus subject to a wmb(). I was talking with Linus about this today, and he might have had an opinion. He didn't like my io_wmb() idea, but instead thinks that _every_ IO operation should be ordered WRT one another (eg. get rid of the fancy __relaxed ones). That's fine, and once you do that, you can get rid of lots of wmb(), and wmb() remains just for the places where you want to order cacheable and uncacheable stores. And now that wmb() is called much less often, you can define it to actually match the expected Linux model. I'm really not just trying to cause trouble here ;) The ordering details of IO and IO/memory seems to be a mess -- it is defined differently for different architectures, barriers are doing different things, *writel* etc. functions have different ordering rules depending on the arch, etc. ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: wmb vs mmiowb 2007-08-23 16:16 ` Linus Torvalds 2007-08-23 16:27 ` Benjamin Herrenschmidt @ 2007-08-24 2:59 ` Nick Piggin 1 sibling, 0 replies; 26+ messages in thread From: Nick Piggin @ 2007-08-24 2:59 UTC (permalink / raw) To: Linus Torvalds; +Cc: linux-ia64, Jesse Barnes, linuxppc-dev On Thu, Aug 23, 2007 at 09:16:42AM -0700, Linus Torvalds wrote: > > > On Thu, 23 Aug 2007, Nick Piggin wrote: > > > > Also, FWIW, there are some advantages of deferring the mmiowb thingy > > until the point of unlock. > > And that is exactly what ppc64 does. > > But you're missing a big point: for 99.9% of all hardware, mmiowb() is a > total no-op. So when you talk about "advantages", you're not talking about > any *real* advantage, are you? You're in a feisty mood today ;) I guess on the 0.1% of hradware where it is not a noop, there might be a real advantage... but that was just handwaving anyway. My real point was that I'd like things to be more easily understandable. I think we are agreed at this point that mmiowb without some form of CPU synchronisation is a bug, and it is also not of the same type of barrier that we normally think about in the kernel (it could be like a MPI style rendezvous barrier between the CPU and the IO fabric). Anyway, point is that device drivers seem to have enough on their plate already. Look at bcm43xx, for example. Most of this guy's mmiowb()s are completely wrong and should be wmb(). mmiowb() is only a wmb() on ppc because as I said, ppc's spin_unlock does not order IOs like most other architectures. On alpha, for example, spin_unlock does order IOs, so mmiowb is a noop, and this is broken (non-sn2 ia64 should also be a noop here, because their unlock orders IOs, but it seems that mmiowb semantics are so non obvious that they either got it wrong themselves, or assumed device drivers surely would). ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: wmb vs mmiowb 2007-08-23 2:20 ` Nick Piggin 2007-08-23 2:57 ` Linus Torvalds @ 2007-08-23 17:02 ` Jesse Barnes 1 sibling, 0 replies; 26+ messages in thread From: Jesse Barnes @ 2007-08-23 17:02 UTC (permalink / raw) To: Nick Piggin; +Cc: linux-ia64, Linus Torvalds, linuxppc-dev > > Yeah, they keep threatening to use this instead, but I'm not sure > > how easy it would be. Also they may have more devices/drivers to > > worry about than sn2, so maybe changing over would mean too much > > driver debugging (well auditing really since it's not that hard to > > know where to put them). Irix actually had an io_unlock() routine > > that did this implicitly, but iirc that was shot down for Linux... > > Why was it shot down? Seems like a pretty good idea to me ;) Well, like Linus said, it had some significant downsides (though I think Irix had fewer lock types, so the multiplicative effect wasn't so bad there). > I'm clueless when it comes to drivers, but I see a lot of mmiowb() > that are not paired with spin_unlock. How are these obvious? (ie. > what is the pattern?) It looks like some might be lockless FIFOs (or > maybe I'm just not aware of where the locks are). Can you just > quickly illustrate the problem being solved? Wow, it certainly has proliferated since it was added to the tree. :) I didn't audit all the uses, but it seems like many of them get it right, i.e. mmiowb() before spin_unlock() where PIO has been done. I'd have to look carefully to see whether lockless usages are correct, it's likely they're not. Jesse ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: wmb vs mmiowb 2007-08-22 18:07 ` Linus Torvalds 2007-08-22 19:02 ` Jesse Barnes @ 2007-08-23 1:59 ` Nick Piggin 2007-08-23 7:27 ` Benjamin Herrenschmidt 2 siblings, 0 replies; 26+ messages in thread From: Nick Piggin @ 2007-08-23 1:59 UTC (permalink / raw) To: Linus Torvalds; +Cc: linux-ia64, Jesse Barnes, linuxppc-dev On Wed, Aug 22, 2007 at 11:07:32AM -0700, Linus Torvalds wrote: > > > On Wed, 22 Aug 2007, Nick Piggin wrote: > > > > It took me more than a glance to see what the difference is supposed to be > > between wmb() and mmiowb(). I think especially because mmiowb isn't really > > like a write barrier. > > Well, it is, but it isn't. Not on its own - but together with a "normal" > barrier it is. But it is stronger (or different) to write barrier semantics, because it enforces the order in which a 3rd party (the IO device) sees writes from multiple CPUs. The rest of our barrier concept is based purely on the POV of the single entity executing the barrier. Now it's needed because the IO device is not participating in the same synchronisation logic that the CPUs are, which is why I say it is more like a synchronisation primitive than a barrier primitive. > > wmb is supposed to order all writes coming out of a single CPU, so that's > > pretty simple. > > No. wmb orders all *normal* writes coming out of a single CPU. I'm pretty sure wmb() should order *all* writes, and smp_wmb() is what you're thinking of for ordering regular writes to cacheable memory. > It may not do anything at all for "uncached" IO writes that aren't part of > the cache coherency, and that are handled using totally different queues > (both inside and outside of the CPU)! > > Now, on x86, the CPU actually tends to order IO writes *more* than it > orders any other writes (they are mostly entirely synchronous, unless the > area has been marked as write merging), but at least on PPC, it's the > other way around: without the cache as a serialization entry, you end up > having a totally separate queueu to serialize, and a regular-memory write > barrier does nothing at all to the IO queue. Well PPC AFAIKS doesn't need the special synchronisation semantics of this mmiowb primitive -- the reason it is not a noop is because the API seems to also imply a wmb() (which is fine, and you'd normally want that eg. uncacheable stores must be ordered with the spin_unlock store). It is just implemented with the PPC sync instruction, which just orders all stores coming out of _this_ CPU. Their IO fabric must prevent IOs from being reordered between CPUs if they're executed in a known order (which is what Altix does not prevent). > So think of the IO write queue as something totally asynchronous that has > zero connection to the normal write ordering - and then think of mmiowb() > as a way to *insert* a synchronization point. If wmb (the non _smp one) orders all stores including IO stores, then it should be sufficient to prevent IO writes from leaking out of a critical section. The problem is that the "reader" (the IO device) itself is not coherent. So _synchronisation_ point is right; it is not really a barrier. Basically it says all IO writes issued by this CPU at this point will be seen before any other IO writes issued by any other CPUs subsequently. make_mmio_coherent()? queue_mmio_writes()? (I'd still prefer some kind of acquire/release API that shows why CPU/CPU order matters too, and how it is taken care of). > > It really seems like it is some completely different concept from a > > barrier. And it shows, on the platform where it really matters (sn2), where > > the thing actually spins. > > I agree that it probably isn't a "write barrier" per se. Think of it as a > "tie two subsystems together" thing. Yes, in a way it is more like that. Which does fit with my suggestions for a name. > (And it doesn't just matter on sn2. It also matters on powerpc64, although > I think they just set a flag and do the *real* sync in the spin_unlock() > path). > > Side note: the thing that makes "mmiowb()" even more exciting is that it's > not just the CPU, it's the fabric outside the CPU that matters too. That's > why the sn2 needs this - but the powerpc example shows a case where the > ordering requirement actually comes from the CPU itself. Well I think sn2 is the *only* reason it matters. When the ordering requirement is coming from the CPU itself, that *is* just a traditional write barrier (one which orders normal and io writes). The funny things powerpc are doing in spin_unlock/etc. are a different issue. Basically they are just helping along device drivers who get this wrong and assume spinlocks order IOs; our lack of an acquire/release API for IOs... they're just trying to get through this sorry state of affairs without going insane ;) Powerpc is special here because their ordering instructions distinguish between normal and IO, wheras most others don't (including ia64, alpha, etc), so _most_ others do get their IOs ordered by critical sections. This is a different issue to the mmiowb one (but still shows that our APIs could be improved). Why don't we get a nice easy spin_lock_io/spin_unlock_io, which takes care of all the mmiowb and iowrite vs spin unlock problems? (individual IOs within the lock would still need to be ordered as approprite). Then we could also have a serialize_io()/unserialize_io() that takes care of the same things but can be used when we have something other than a spinlock for ordering CPUs (serialize_io may be a noop, but it is good to ensure people are thinking about how they're excluding other CPUs here -- if other CPUs are not excluded, then any code calling mmiowb is buggy, right?). ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: wmb vs mmiowb 2007-08-22 18:07 ` Linus Torvalds 2007-08-22 19:02 ` Jesse Barnes 2007-08-23 1:59 ` Nick Piggin @ 2007-08-23 7:27 ` Benjamin Herrenschmidt 2007-08-23 16:56 ` Jesse Barnes 2 siblings, 1 reply; 26+ messages in thread From: Benjamin Herrenschmidt @ 2007-08-23 7:27 UTC (permalink / raw) To: Linus Torvalds; +Cc: Nick Piggin, linux-ia64, Jesse Barnes, linuxppc-dev > Of course, the normal memory barrier would usually be a "spin_unlock()" or > something like that, not a "wmb()". In fact, I don't think the powerpc > implementation (as an example of this) will actually synchronize with > anything *but* a spin_unlock(). We are even more sneaky in the sense that we set a per-cpu flag on any MMIO write and do the sync automatically in spin_unlock() :-) Ben. ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: wmb vs mmiowb 2007-08-23 7:27 ` Benjamin Herrenschmidt @ 2007-08-23 16:56 ` Jesse Barnes 2007-08-24 3:12 ` Nick Piggin 2007-08-28 21:21 ` Brent Casavant 0 siblings, 2 replies; 26+ messages in thread From: Jesse Barnes @ 2007-08-23 16:56 UTC (permalink / raw) To: Benjamin Herrenschmidt Cc: Nick Piggin, linux-ia64, Linus Torvalds, linuxppc-dev On Thursday, August 23, 2007 12:27 am Benjamin Herrenschmidt wrote: > > Of course, the normal memory barrier would usually be a > > "spin_unlock()" or something like that, not a "wmb()". In fact, I > > don't think the powerpc implementation (as an example of this) will > > actually synchronize with anything *but* a spin_unlock(). > > We are even more sneaky in the sense that we set a per-cpu flag on > any MMIO write and do the sync automatically in spin_unlock() :-) Yeah, that's a reasonable thing to do, and in fact I think there's code to do something similar when a task is switched out (this keeps user level drivers from having do mmiowb() type things). FWIW, I think I had an earlier version of the patch that used the name pioflush() or something similar, the only confusing thing about that name is that the primitive doesn't actually force I/Os down to the device level, just to the closest bridge. It'll be interesting to see if upcoming x86 designs share this problem (e.g. large HT or CSI topologies). Jesse ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: wmb vs mmiowb 2007-08-23 16:56 ` Jesse Barnes @ 2007-08-24 3:12 ` Nick Piggin 2007-08-28 21:21 ` Brent Casavant 1 sibling, 0 replies; 26+ messages in thread From: Nick Piggin @ 2007-08-24 3:12 UTC (permalink / raw) To: Jesse Barnes; +Cc: linux-ia64, Linus Torvalds, linuxppc-dev On Thu, Aug 23, 2007 at 09:56:16AM -0700, Jesse Barnes wrote: > On Thursday, August 23, 2007 12:27 am Benjamin Herrenschmidt wrote: > > > Of course, the normal memory barrier would usually be a > > > "spin_unlock()" or something like that, not a "wmb()". In fact, I > > > don't think the powerpc implementation (as an example of this) will > > > actually synchronize with anything *but* a spin_unlock(). > > > > We are even more sneaky in the sense that we set a per-cpu flag on > > any MMIO write and do the sync automatically in spin_unlock() :-) > > Yeah, that's a reasonable thing to do, and in fact I think there's code > to do something similar when a task is switched out (this keeps user > level drivers from having do mmiowb() type things). It might be worth doing that and removing mmiowb completely. Or, if that's too expensive, I'd like to see an API that is more explicitly for keeping IOs inside critical sections. > FWIW, I think I had an earlier version of the patch that used the name > pioflush() or something similar, the only confusing thing about that > name is that the primitive doesn't actually force I/Os down to the > device level, just to the closest bridge. Yeah, that's what I found when trying to think of a name ;) It is like an intermediate-level flush for the platform code, but from the POV of the driver writer, it is nothing of the sort ;) ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: wmb vs mmiowb 2007-08-23 16:56 ` Jesse Barnes 2007-08-24 3:12 ` Nick Piggin @ 2007-08-28 21:21 ` Brent Casavant 2007-08-28 23:01 ` Peter Chubb 1 sibling, 1 reply; 26+ messages in thread From: Brent Casavant @ 2007-08-28 21:21 UTC (permalink / raw) To: Jesse Barnes; +Cc: Nick Piggin, linux-ia64, Linus Torvalds, linuxppc-dev On Thu, 23 Aug 2007, Jesse Barnes wrote: > On Thursday, August 23, 2007 12:27 am Benjamin Herrenschmidt wrote: > > > Of course, the normal memory barrier would usually be a > > > "spin_unlock()" or something like that, not a "wmb()". In fact, I > > > don't think the powerpc implementation (as an example of this) will > > > actually synchronize with anything *but* a spin_unlock(). > > > > We are even more sneaky in the sense that we set a per-cpu flag on > > any MMIO write and do the sync automatically in spin_unlock() :-) > > Yeah, that's a reasonable thing to do, and in fact I think there's code > to do something similar when a task is switched out (this keeps user > level drivers from having do mmiowb() type things). Yes there is, git commit e08e6c521355cd33e647b2f739885bc3050eead6. On SN2 any user process performing memory-mapped IO directly to a device needs something like mmiowb() to be performed at the node of the CPU it last ran on when the task context switches onto a new CPU. The current code performs this action for all inter-CPU context switches, but we had discussed the possibility of targetting the action only when the user process has actually mapped a device for IO. I believe it was decided that this level of complexity wasn't warranted unless this simple solution was found to cause a problem. That reminds me. Are the people who are working on the user-level driver effort including a capability similar to mmiowb()? If we had that capability we could eventually do away with the change mentioned above. But that would come after all user-level drivers were coded to include the mmiowb()-like calls, and existing drivers which provide mmap() capability directly to hardware go away. Brent -- Brent Casavant All music is folk music. I ain't bcasavan@sgi.com never heard a horse sing a song. Silicon Graphics, Inc. -- Louis Armstrong ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: wmb vs mmiowb 2007-08-28 21:21 ` Brent Casavant @ 2007-08-28 23:01 ` Peter Chubb 0 siblings, 0 replies; 26+ messages in thread From: Peter Chubb @ 2007-08-28 23:01 UTC (permalink / raw) To: Brent Casavant Cc: Nick Piggin, linux-ia64, linuxppc-dev, Jesse Barnes, Linus Torvalds >>>>> "Brent" == Brent Casavant <bcasavan@sgi.com> writes: Brent> That reminds me. Are the people who are working on the Brent> user-level driver effort including a capability similar to Brent> mmiowb()? If we had that capability we could eventually do Brent> away with the change mentioned above. But that would come Brent> after all user-level drivers were coded to include the Brent> mmiowb()-like calls, and existing drivers which provide mmap() Brent> capability directly to hardware go away. Not at present, because the platforms *I'm* mostly targetting at the moment don't need it. -- Dr Peter Chubb http://www.gelato.unsw.edu.au peterc AT gelato.unsw.edu.au http://www.ertos.nicta.com.au ERTOS within National ICT Australia ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: wmb vs mmiowb 2007-08-22 4:57 wmb vs mmiowb Nick Piggin 2007-08-22 18:07 ` Linus Torvalds @ 2007-08-23 7:25 ` Benjamin Herrenschmidt 1 sibling, 0 replies; 26+ messages in thread From: Benjamin Herrenschmidt @ 2007-08-23 7:25 UTC (permalink / raw) To: Nick Piggin; +Cc: Linus Torvalds, linux-ia64, Jesse Barnes, linuxppc-dev On Wed, 2007-08-22 at 06:57 +0200, Nick Piggin wrote: > It doesn't seem like this primary function of mmiowb has anything to do > with a write barrier that we are used to (it may have a seconary semantic > of a wmb as well, but let's ignore that for now). A write barrier will > never provide you with those semantics (writes from 2 CPUs seen in the > same order by a 3rd party). If anything, I think it is closer to being > a read barrier issued on behalf of the target device. But even that I > think is not much better, because the target is not participating in the > synchronisation that the CPUs are, so the "read barrier request" could > still arrive at the device out of order WRT the other CPU's writes. > > It really seems like it is some completely different concept from a > barrier. And it shows, on the platform where it really matters (sn2), where > the thing actually spins. The way mmiowb was actually defined to me by the ia64 folks who came up with it is essentially to order an MMIO write with a spin_unlock. Ben. ^ permalink raw reply [flat|nested] 26+ messages in thread
end of thread, other threads:[~2007-09-03 20:48 UTC | newest] Thread overview: 26+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2007-08-22 4:57 wmb vs mmiowb Nick Piggin 2007-08-22 18:07 ` Linus Torvalds 2007-08-22 19:02 ` Jesse Barnes 2007-08-23 2:20 ` Nick Piggin 2007-08-23 2:57 ` Linus Torvalds 2007-08-23 3:54 ` Nick Piggin 2007-08-23 16:14 ` Linus Torvalds 2007-08-23 4:20 ` Nick Piggin 2007-08-23 16:16 ` Linus Torvalds 2007-08-23 16:27 ` Benjamin Herrenschmidt 2007-08-24 3:09 ` Nick Piggin 2007-08-28 20:56 ` Brent Casavant 2007-08-29 0:59 ` Nick Piggin 2007-08-29 18:53 ` Brent Casavant 2007-08-30 3:36 ` Nick Piggin 2007-08-30 19:42 ` Brent Casavant 2007-09-03 20:48 ` Nick Piggin 2007-08-24 2:59 ` Nick Piggin 2007-08-23 17:02 ` Jesse Barnes 2007-08-23 1:59 ` Nick Piggin 2007-08-23 7:27 ` Benjamin Herrenschmidt 2007-08-23 16:56 ` Jesse Barnes 2007-08-24 3:12 ` Nick Piggin 2007-08-28 21:21 ` Brent Casavant 2007-08-28 23:01 ` Peter Chubb 2007-08-23 7:25 ` Benjamin Herrenschmidt
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).